http://arxiv.org/abs/2509.24359
DRIFT: Divergent Response in Filtered Transformations for Robust Adversarial Defense. (99%)
Amira Guesmi; Muhammad Shafique
Deep neural networks remain highly vulnerable to adversarial examples, and most defenses collapse once gradients can be reliably estimated. We identify \emph{gradient consensus} -- the tendency of randomized transformations to yield aligned gradients -- as a key driver of adversarial transferability. Attackers exploit this consensus to construct perturbations that remain effective across transformations. We introduce \textbf{DRIFT} (Divergent Response in Filtered Transformations), a stochastic ensemble of lightweight, learnable filters trained to actively disrupt gradient consensus. Unlike prior randomized defenses that rely on gradient masking, DRIFT enforces \emph{gradient dissonance} by maximizing divergence in Jacobian- and logit-space responses while preserving natural predictions. Our contributions are threefold: (i) we formalize gradient consensus and provide a theoretical analysis linking consensus to transferability; (ii) we propose a consensus-divergence training strategy combining prediction consistency, Jacobian separation, logit-space separation, and adversarial robustness; and (iii) we show that DRIFT achieves substantial robustness gains on ImageNet across CNNs and Vision Transformers, outperforming state-of-the-art preprocessing, adversarial training, and diffusion-based defenses under adaptive white-box, transfer-based, and gradient-free attacks. DRIFT delivers these improvements with negligible runtime and memory cost, establishing gradient divergence as a practical and generalizable principle for adversarial defense.

http://arxiv.org/abs/2410.05694
DiffusionGuard: A Robust Defense Against Malicious Diffusion-based Image Editing. (98%)
June Suk Choi; Kyungmin Lee; Jongheon Jeong; Saining Xie; Jinwoo Shin; Kimin Lee
Recent advances in diffusion models have introduced a new era of text-guided image manipulation, enabling users to create realistic edited images with simple textual prompts. However, there is significant concern about the potential misuse of these methods, especially in creating misleading or harmful content. Although recent defense strategies, which introduce imperceptible adversarial noise to induce model failure, have shown promise, they remain ineffective against more sophisticated manipulations, such as editing with a mask. In this work, we propose DiffusionGuard, a robust and effective defense method against unauthorized edits by diffusion-based image editing models, even in challenging setups. Through a detailed analysis of these models, we introduce a novel objective that generates adversarial noise targeting the early stage of the diffusion process. This approach significantly improves the efficiency and effectiveness of adversarial noises. We also introduce a mask-augmentation technique to enhance robustness against various masks during test time. Finally, we introduce a comprehensive benchmark designed to evaluate the effectiveness and robustness of methods in protecting against privacy threats in realistic scenarios. Through extensive experiments, we show that our method achieves stronger protection and improved mask robustness with lower computational costs compared to the strongest baseline. Additionally, our method exhibits superior transferability and better resilience to noise removal techniques compared to all baseline methods. Our source code is publicly available at https://github.com/choi403/DiffusionGuard.

http://arxiv.org/abs/2509.19994
Improving Generalizability and Undetectability for Targeted Adversarial Attacks on Multimodal Pre-trained Models. (98%)
Zhifang Zhang; Jiahan Zhang; Shengjie Zhou; Qi Wei; Shuo He; Feng Liu; Lei Feng
Multimodal pre-trained models (e.g., ImageBind), which align distinct data modalities into a shared embedding space, have shown remarkable success across downstream tasks. However, their increasing adoption raises serious security concerns, especially regarding targeted adversarial attacks. In this paper, we show that existing targeted adversarial attacks on multimodal pre-trained models still have limitations in two aspects: generalizability and undetectability. Specifically, the crafted targeted adversarial examples (AEs) exhibit limited generalization to partially known or semantically similar targets in cross-modal alignment tasks (i.e., limited generalizability) and can be easily detected by simple anomaly detection methods (i.e., limited undetectability). To address these limitations, we propose a novel method called Proxy Targeted Attack (PTA), which leverages multiple source-modal and target-modal proxies to optimize targeted AEs, ensuring they remain evasive to defenses while aligning with multiple potential targets. We also provide theoretical analyses to highlight the relationship between generalizability and undetectability and to ensure optimal generalizability while meeting the specified requirements for undetectability. Furthermore, experimental results demonstrate that our PTA can achieve a high success rate across various related targets and remain undetectable against multiple anomaly detection methods.

http://arxiv.org/abs/2507.09993
3DGAA: Realistic and Robust 3D Gaussian-based Adversarial Attack for Autonomous Driving. (98%)
Yixun Zhang; Lizhi Wang; Junjun Zhao; Wending Zhao; Feng Zhou; Yonghao Dang; Jianqin Yin
Camera-based object detection systems play a vital role in autonomous driving, yet they remain vulnerable to adversarial threats in real-world environments. Existing 2D and 3D physical attacks, due to their focus on texture optimization, often struggle to balance physical realism and attack robustness. In this work, we propose 3D Gaussian-based Adversarial Attack (3DGAA), a novel adversarial object generation framework that leverages the full 14-dimensional parameterization of 3D Gaussian Splatting (3DGS) to jointly optimize geometry and appearance in physically realizable ways. Unlike prior works that rely on patches or texture optimization, 3DGAA jointly perturbs both geometric attributes (shape, scale, rotation) and appearance attributes (color, opacity) to produce physically realistic and transferable adversarial objects. We further introduce a physical filtering module that filters outliers to preserve geometric fidelity, and a physical augmentation module that simulates complex physical scenarios to enhance attack generalization under real-world conditions. We evaluate 3DGAA on both virtual benchmarks and physical-world setups using miniature vehicle models. Experimental results show that 3DGAA achieves to reduce the detection mAP from 87.21\% to 7.38\%, significantly outperforming existing 3D physical attacks. Moreover, our method maintains high transferability across different physical conditions, demonstrating a new state-of-the-art in physically realizable adversarial attacks.

http://arxiv.org/abs/2509.24891
VAGUEGAN: Stealthy Poisoning and Backdoor Attacks on Image Generative Pipelines. (81%)
Mostafa Mohaimen Akand Faisal; Rabeya Amin Jhuma
Generative models such as GANs and diffusion models are widely used to synthesize photorealistic images and to support downstream creative and editing tasks. While adversarial attacks on discriminative models are well studied, attacks targeting generative pipelines where small, stealthy perturbations in inputs lead to controlled changes in outputs are less explored. This study introduces VagueGAN, an attack pipeline combining a modular perturbation network PoisonerNet with a Generator Discriminator pair to craft stealthy triggers that cause targeted changes in generated images. Attack efficacy is evaluated using a custom proxy metric, while stealth is analyzed through perceptual and frequency domain measures. The transferability of the method to a modern diffusion based pipeline is further examined through ControlNet guided editing. Interestingly, the experiments show that poisoned outputs can display higher visual quality compared to clean counterparts, challenging the assumption that poisoning necessarily reduces fidelity. Unlike conventional pixel level perturbations, latent space poisoning in GANs and diffusion pipelines can retain or even enhance output aesthetics, exposing a blind spot in pixel level defenses. Moreover, carefully optimized perturbations can produce consistent, stealthy effects on generator outputs while remaining visually inconspicuous, raising concerns for the integrity of image generation pipelines.

http://arxiv.org/abs/2509.25082
MANI-Pure: Magnitude-Adaptive Noise Injection for Adversarial Purification. (74%)
Xiaoyi Huang; Junwei Wu; Kejia Zhang; Carl Yang; Zhiming Luo
Adversarial purification with diffusion models has emerged as a promising defense strategy, but existing methods typically rely on uniform noise injection, which indiscriminately perturbs all frequencies, corrupting semantic structures and undermining robustness. Our empirical study reveals that adversarial perturbations are not uniformly distributed: they are predominantly concentrated in high-frequency regions, with heterogeneous magnitude intensity patterns that vary across frequencies and attack types. Motivated by this observation, we introduce MANI-Pure, a magnitude-adaptive purification framework that leverages the magnitude spectrum of inputs to guide the purification process. Instead of injecting homogeneous noise, MANI-Pure adaptively applies heterogeneous, frequency-targeted noise, effectively suppressing adversarial perturbations in fragile high-frequency, low-magnitude bands while preserving semantically critical low-frequency content. Extensive experiments on CIFAR-10 and ImageNet-1K validate the effectiveness of MANI-Pure. It narrows the clean accuracy gap to within 0.59 of the original classifier, while boosting robust accuracy by 2.15, and achieves the top-1 robust accuracy on the RobustBench leaderboard, surpassing the previous state-of-the-art method.

http://arxiv.org/abs/2509.20463
Efficiently Attacking Memorization Scores. (70%)
Tue Do; Varun Chandrasekaran; Daniel Alabi
Influence estimation tools -- such as memorization scores -- are widely used to understand model behavior, attribute training data, and inform dataset curation. However, recent applications in data valuation and responsible machine learning raise the question: can these scores themselves be adversarially manipulated? In this work, we present a systematic study of the feasibility of attacking memorization-based influence estimators. We characterize attacks for producing highly memorized samples as highly sensitive queries in the regime where a trained algorithm is accurate. Our attack (calculating the pseudoinverse of the input) is practical, requiring only black-box access to model outputs and incur modest computational overhead. We empirically validate our attack across a wide suite of image classification tasks, showing that even state-of-the-art proxies are vulnerable to targeted score manipulations. In addition, we provide a theoretical analysis of the stability of memorization scores under adversarial perturbations, revealing conditions under which influence estimates are inherently fragile. Our findings highlight critical vulnerabilities in influence-based attribution and suggest the need for robust defenses. All code can be found at https://github.com/tuedo2/MemAttack

http://arxiv.org/abs/2509.24662
Community detection robustness of graph neural networks. (61%)
Jaidev Goel; Pablo Moriano; Ramakrishnan Kannan; Yulia R. Gel
Graph neural networks (GNNs) are increasingly widely used for community detection in attributed networks. They combine structural topology with node attributes through message passing and pooling. However, their robustness or lack of thereof with respect to different perturbations and targeted attacks in conjunction with community detection tasks is not well understood. To shed light into latent mechanisms behind GNN sensitivity on community detection tasks, we conduct a systematic computational evaluation of six widely adopted GNN architectures: GCN, GAT, Graph-SAGE, DiffPool, MinCUT, and DMoN. The analysis covers three perturbation categories: node attribute manipulations, edge topology distortions, and adversarial attacks. We use element-centric similarity as the evaluation metric on synthetic benchmarks and real-world citation networks. Our findings indicate that supervised GNNs tend to achieve higher baseline accuracy, while unsupervised methods, particularly DMoN, maintain stronger resilience under targeted and adversarial perturbations. Furthermore, robustness appears to be strongly influenced by community strength, with well-defined communities reducing performance loss. Across all models, node attribute perturbations associated with targeted edge deletions and shift in attribute distributions tend to cause the largest degradation in community recovery. These findings highlight important trade-offs between accuracy and robustness in GNN-based community detection and offer new insights into selecting architectures resilient to noise and adversarial attacks.

http://arxiv.org/abs/2509.24492
Guided Uncertainty Learning Using a Post-Hoc Evidential Meta-Model. (50%)
Charmaine Barker; Daniel Bethell; Simos Gerasimou
Reliable uncertainty quantification remains a major obstacle to the deployment of deep learning models under distributional shift. Existing post-hoc approaches that retrofit pretrained models either inherit misplaced confidence or merely reshape predictions, without teaching the model when to be uncertain. We introduce GUIDE, a lightweight evidential learning meta-model approach that attaches to a frozen deep learning model and explicitly learns how and when to be uncertain. GUIDE identifies salient internal features via a calibration stage, and then employs these features to construct a noise-driven curriculum that teaches the model how and when to express uncertainty. GUIDE requires no retraining, no architectural modifications, and no manual intermediate-layer selection to the base deep learning model, thus ensuring broad applicability and minimal user intervention. The resulting model avoids distilling overconfidence from the base model, improves out-of-distribution detection by ~77% and adversarial attack detection by ~80%, while preserving in-distribution performance. Across diverse benchmarks, GUIDE consistently outperforms state-of-the-art approaches, evidencing the need for actively guiding uncertainty to close the gap between predictive confidence and reliability.

http://arxiv.org/abs/2509.24566
TokenSwap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models. (33%)
Zhifang Zhang; Qiqi Tao; Jiaqi Lv; Na Zhao; Lei Feng; Joey Tianyi Zhou
Large vision-language models (LVLMs) have achieved impressive performance across a wide range of vision-language tasks, while they remain vulnerable to backdoor attacks. Existing backdoor attacks on LVLMs aim to force the victim model to generate a predefined target pattern, which is either inserted into or replaces the original content. We find that these fixed-pattern attacks are relatively easy to detect, because the attacked LVLM tends to memorize such frequent patterns in the training dataset, thereby exhibiting overconfidence on these targets given poisoned inputs. To address these limitations, we introduce TokenSwap, a more evasive and stealthy backdoor attack that focuses on the compositional understanding capabilities of LVLMs. Instead of enforcing a fixed targeted content, TokenSwap subtly disrupts the understanding of object relationships in text. Specifically, it causes the backdoored model to generate outputs that mention the correct objects in the image but misrepresent their relationships (i.e., bags-of-words behavior). During training, TokenSwap injects a visual trigger into selected samples and simultaneously swaps the grammatical roles of key tokens in the corresponding textual answers. However, the poisoned samples exhibit only subtle differences from the original ones, making it challenging for the model to learn the backdoor behavior. To address this, TokenSwap employs an adaptive token-weighted loss that explicitly emphasizes the learning of swapped tokens, such that the visual triggers and bags-of-words behavior are associated. Extensive experiments demonstrate that TokenSwap achieves high attack success rates while maintaining superior evasiveness and stealthiness across multiple benchmarks and various LVLM architectures.

http://arxiv.org/abs/2508.02995
The Geometry of Cortical Computation: Manifold Disentanglement and Predictive Dynamics in VCNet. (26%)
Brennen A. Hill; Zhang Xinyu; Timothy Putra Prasetio
Despite their success, modern convolutional neural networks (CNNs) exhibit fundamental limitations, including data inefficiency, poor out-of-distribution generalization, and vulnerability to adversarial perturbations. These shortcomings can be traced to a lack of inductive biases that reflect the inherent geometric structure of the visual world. The primate visual system, in contrast, demonstrates superior efficiency and robustness, suggesting that its architectural and computational principles,which evolved to internalize these structures,may offer a blueprint for more capable artificial vision. This paper introduces Visual Cortex Network (VCNet), a novel neural network architecture whose design is informed by the macro-scale organization of the primate visual cortex. VCNet is framed as a geometric framework that emulates key biological mechanisms, including hierarchical processing across distinct cortical areas, dual-stream information segregation for learning disentangled representations, and top-down predictive feedback for representation refinement. We interpret these mechanisms through the lens of geometry and dynamical systems, positing that they guide the learning of structured, low-dimensional neural manifolds. We evaluate VCNet on two specialized benchmarks: the Spots-10 animal pattern dataset, which probes sensitivity to natural textures, and a light field image classification task, which requires processing higher-dimensional visual data. Our results show that VCNet achieves state-of-the-art accuracy of 92.1\% on Spots-10 and 74.4\% on the light field dataset, surpassing contemporary models of comparable size. This work demonstrates that integrating high-level neuroscientific principles, viewed through a geometric lens, can lead to more efficient and robust models, providing a promising direction for addressing long-standing challenges in machine learning.

http://arxiv.org/abs/2505.16670
BitHydra: Towards Bit-flip Inference Cost Attack against Large Language Models. (16%)
Xiaobei Yan; Yiming Li; Hao Wang; Han Qiu; Tianwei Zhang
Large language models (LLMs) are widely deployed, but their growing compute demands expose them to inference cost attacks that maximize output length. We reveal that prior attacks are fundamentally self-targeting because they rely on crafted inputs, so the added cost accrues to the attacker's own queries and scales poorly in practice. In this work, we introduce the first bit-flip inference cost attack that directly modifies model weights to induce persistent overhead for all users of a compromised LLM. Such attacks are stealthy yet realistic in practice: for instance, in shared MLaaS environments, co-located tenants can exploit hardware-level faults (e.g., Rowhammer) to flip memory bits storing model parameters. We instantiate this attack paradigm with BitHydra, which (1) minimizes a loss that suppresses the end-of-sequence token (i.e., EOS) and (2) employs an efficient yet effective critical-bit search focused on the EOS embedding vector, sharply reducing the search space while preserving benign-looking outputs. We evaluate across 11 LLMs (1.5B-14B) under int8 and float16, demonstrating that our method efficiently achieves scalable cost inflation with only a few bit flips, while remaining effective even against potential defenses.

http://arxiv.org/abs/2509.24215
Metamorphic Testing for Audio Content Moderation Software. (11%)
Wenxuan Wang; Yongjiang Wu; Junyuan Zhang; Shuqing Li; Yun Peng; Wenting Chen; Shuai Wang; Michael R. Lyu
The rapid growth of audio-centric platforms and applications such as WhatsApp and Twitter has transformed the way people communicate and share audio content in modern society. However, these platforms are increasingly misused to disseminate harmful audio content, such as hate speech, deceptive advertisements, and explicit material, which can have significant negative consequences (e.g., detrimental effects on mental health). In response, researchers and practitioners have been actively developing and deploying audio content moderation tools to tackle this issue. Despite these efforts, malicious actors can bypass moderation systems by making subtle alterations to audio content, such as modifying pitch or inserting noise. Moreover, the effectiveness of modern audio moderation tools against such adversarial inputs remains insufficiently studied. To address these challenges, we propose MTAM, a Metamorphic Testing framework for Audio content Moderation software. Specifically, we conduct a pilot study on 2000 audio clips and define 14 metamorphic relations across two perturbation categories: Audio Features-Based and Heuristic perturbations. MTAM applies these metamorphic relations to toxic audio content to generate test cases that remain harmful while being more likely to evade detection. In our evaluation, we employ MTAM to test five commercial textual content moderation software and an academic model against three kinds of toxic content. The results show that MTAM achieves up to 38.6%, 18.3%, 35.1%, 16.7%, and 51.1% error finding rates (EFR) when testing commercial moderation software provided by Gladia, Assembly AI, Baidu, Nextdata, and Tencent, respectively, and it obtains up to 45.7% EFR when testing the state-of-the-art algorithms from the academy.

http://arxiv.org/abs/2509.25178
GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs. (5%)
Aryan Yazdan Parast; Parsa Hosseini; Hesam Asadollahzadeh; Arshia Soltani Moakhar; Basim Azam; Soheil Feizi; Naveed Akhtar
Object hallucination in Multimodal Large Language Models (MLLMs) is a persistent failure mode that causes the model to perceive objects absent in the image. This weakness of MLLMs is currently studied using static benchmarks with fixed visual scenarios, which preempts the possibility of uncovering model-specific or unanticipated hallucination vulnerabilities. We introduce GHOST (Generating Hallucinations via Optimizing Stealth Tokens), a method designed to stress-test MLLMs by actively generating images that induce hallucination. GHOST is fully automatic and requires no human supervision or prior knowledge. It operates by optimizing in the image embedding space to mislead the model while keeping the target object absent, and then guiding a diffusion model conditioned on the embedding to generate natural-looking images. The resulting images remain visually natural and close to the original input, yet introduce subtle misleading cues that cause the model to hallucinate. We evaluate our method across a range of models, including reasoning models like GLM-4.1V-Thinking, and achieve a hallucination success rate exceeding 28%, compared to around 1% in prior data-driven discovery methods. We confirm that the generated images are both high-quality and object-free through quantitative metrics and human evaluation. Also, GHOST uncovers transferable vulnerabilities: images optimized for Qwen2.5-VL induce hallucinations in GPT-4o at a 66.5% rate. Finally, we show that fine-tuning on our images mitigates hallucination, positioning GHOST as both a diagnostic and corrective tool for building more reliable multimodal systems.

http://arxiv.org/abs/2412.02366
GenMix: Effective Data Augmentation with Generative Diffusion Model Image Editing. (1%)
Khawar Islam; Muhammad Zaigham Zaheer; Arif Mahmood; Karthik Nandakumar; Naveed Akhtar
Data augmentation is widely used to enhance generalization in visual classification tasks. However, traditional methods struggle when source and target domains differ, as in domain adaptation, due to their inability to address domain gaps. This paper introduces GenMix, a generalizable prompt-guided generative data augmentation approach that enhances both in-domain and cross-domain image classification. Our technique leverages image editing to generate augmented images based on custom conditional prompts, designed specifically for each problem type. By blending portions of the input image with its edited generative counterpart and incorporating fractal patterns, our approach mitigates unrealistic images and label ambiguity, improving the performance and adversarial robustness of the resulting models. Efficacy of our method is established with extensive experiments on eight public datasets for general and fine-grained classification, in both in-domain and cross-domain settings. Additionally, we demonstrate performance improvements for self-supervised learning, learning with data scarcity, and adversarial robustness. As compared to the existing state-of-the-art methods, our technique achieves stronger performance across the board.

http://arxiv.org/abs/2506.12519
Exploiting AI for Attacks: On the Interplay between Adversarial AI and Offensive AI. (1%)
Saskia Laura Schröer; Luca Pajola; Alberto Castagnaro; Giovanni Apruzzese; Mauro Conti
As Artificial Intelligence (AI) continues to evolve, it has transitioned from a research-focused discipline to a widely adopted technology, enabling intelligent solutions across various sectors. In security, AI's role in strengthening organizational resilience has been studied for over two decades. While much attention has focused on AI's constructive applications, the increasing maturity and integration of AI have also exposed its darker potentials. This article explores two emerging AI-related threats and the interplay between them: AI as a target of attacks (`Adversarial AI') and AI as a means to launch attacks on any target (`Offensive AI') -- potentially even on another AI. By cutting through the confusion and explaining these threats in plain terms, we introduce the complex and often misunderstood interplay between Adversarial AI and Offensive AI, offering a clear and accessible introduction to the challenges posed by these threats.

http://arxiv.org/abs/2509.25135
Learning in an Echo Chamber: Online Learning with Replay Adversary. (1%)
Daniil Dmitriev; Harald Eskelund Franck; Carolin Heinzler; Amartya Sanyal
As machine learning systems increasingly train on self-annotated data, they risk reinforcing errors and becoming echo chambers of their own beliefs. We model this phenomenon by introducing a learning-theoretic framework: Online Learning in the Replay Setting. In round $t$, the learner outputs a hypothesis $\hat{h}_t$; the adversary then reveals either the true label $f^\ast(x_t)$ or a replayed label $\hat{h}_i(x_t)$ from an earlier round $i < t$. A mistake is counted only when the true label is shown, yet classical algorithms such as the SOA or the halving algorithm are easily misled by the replayed errors.
  We introduce the Extended Threshold dimension, $\mathrm{ExThD}(\mathcal{H})$, and prove matching upper and lower bounds that make $\mathrm{ExThD}(\mathcal{H})$ the exact measure of learnability in this model. A closure-based learner makes at most $\mathrm{ExThD}(\mathcal{H})$ mistakes against any adaptive adversary, and no algorithm can perform better. For stochastic adversaries, we prove a similar bound for every intersection-closed class. The replay setting is provably harder than the classical mistake bound setting: some classes have constant Littlestone dimension but arbitrarily large $\mathrm{ExThD}(\mathcal{H})$. Proper learning exhibits an even sharper separation: a class is properly learnable under replay if and only if it is (almost) intersection-closed. Otherwise, every proper learner suffers $Ω(T)$ errors, whereas our improper algorithm still achieves the $\mathrm{ExThD}(\mathcal{H})$ bound. These results give the first tight analysis of learning against replay adversaries, based on new results for closure-type algorithms.

http://arxiv.org/abs/2503.13429
Interpretable 3D Neural Object Volumes for Robust Conceptual Reasoning. (1%)
Nhi Pham; Artur Jesslen; Bernt Schiele; Adam Kortylewski; Jonas Fischer
With the rise of deep neural networks, especially in safety-critical applications, robustness and interpretability are crucial to ensure their trustworthiness. Recent advances in 3D-aware classifiers that map image features to volumetric representation of objects, rather than relying solely on 2D appearance, have greatly improved robustness on out-of-distribution (OOD) data. Such classifiers have not yet been studied from the perspective of interpretability. Meanwhile, current concept-based XAI methods often neglect OOD robustness. We aim to address both aspects with CAVE - Concept Aware Volumes for Explanations - a new direction that unifies interpretability and robustness in image classification. We design CAVE as a robust and inherently interpretable classifier that learns sparse concepts from 3D object representation. We further propose 3D Consistency (3D-C), a metric to measure spatial consistency of concepts. Unlike existing metrics that rely on human-annotated parts on images, 3D-C leverages ground-truth object meshes as a common surface to project and compare explanations across concept-based methods. CAVE achieves competitive classification performance while discovering consistent and meaningful concepts across images in various OOD settings. Code available at https://github.com/phamleyennhi/CAVE.

http://arxiv.org/abs/2505.14608
Language Models Optimized to Fool Detectors Still Have a Distinct Style (And How to Change It). (1%)
Rafael Rivera Soto; Barry Chen; Nicholas Andrews
Despite considerable progress in the development of machine-text detectors, it has been suggested that the problem is inherently hard, and therefore, that stakeholders should proceed under the assumption that machine-generated text cannot be reliably detected as such. We examine a recent such claim by Nicks et al. (2024) regarding the ease with which language models can be optimized to degrade the performance of machine-text detectors, including detectors not specifically optimized against. We identify a feature space -- the stylistic feature space -- that is robust to such optimization, and show that it may be used to reliably detect samples from language models optimized to prevent detection. Furthermore, we show that even when models are explicitly optimized against stylistic detectors, detection performance remains surprisingly unaffected. We then seek to understand if stylistic detectors are inherently more robust. To study this question, we explore a new paraphrasing approach that simultaneously aims to close the gap between human writing and machine writing in stylistic feature space while avoiding detection using traditional features. We show that when only a single sample is available for detection, this attack is universally effective across all detectors considered, including those that use writing style. However, as the number of samples available for detection grows, the human and machine distributions become distinguishable. Overall, our findings underscore previous recommendations to avoid reliance on machine-text detection.

http://arxiv.org/abs/2509.24171
Model Correlation Detection via Random Selection Probing. (1%)
Ruibo Chen; Sheng Zhang; Yihan Wu; Tong Zheng; Peihua Mai; Heng Huang
The growing prevalence of large language models (LLMs) and vision-language models (VLMs) has heightened the need for reliable techniques to determine whether a model has been fine-tuned from or is even identical to another. Existing similarity-based methods often require access to model parameters or produce heuristic scores without principled thresholds, limiting their applicability. We introduce Random Selection Probing (RSP), a hypothesis-testing framework that formulates model correlation detection as a statistical test. RSP optimizes textual or visual prefixes on a reference model for a random selection task and evaluates their transferability to a target model, producing rigorous p-values that quantify evidence of correlation. To mitigate false positives, RSP incorporates an unrelated baseline model to filter out generic, transferable features. We evaluate RSP across both LLMs and VLMs under diverse access conditions for reference models and test models. Experiments on fine-tuned and open-source models show that RSP consistently yields small p-values for related models while maintaining high p-values for unrelated ones. Extensive ablation studies further demonstrate the robustness of RSP. These results establish RSP as the first principled and general statistical framework for model correlation detection, enabling transparent and interpretable decisions in modern machine learning ecosystems.

http://arxiv.org/abs/2211.13723
Beyond Losses Reweighting: Empowering Multi-Task Learning via the Generalization Perspective. (1%)
Hoang Phan; Lam Tran; Quyen Tran; Ngoc N. Tran; Tuan Truong; Qi Lei; Nhat Ho; Dinh Phung; Trung Le
Multi-task learning (MTL) trains deep neural networks to optimize several objectives simultaneously using a shared backbone, which leads to reduced computational costs, improved data efficiency, and enhanced performance through cross-task knowledge sharing. Although recent gradient manipulation techniques aim to find a common descent direction that benefits all tasks, conventional empirical loss minimization still leaves models vulnerable to overfitting and gradient conflicts. To address this, we introduce a novel MTL framework that leverages weight perturbation to regulate gradient norms, thus improving generalization. By adaptively modulating weight perturbations, our approach harmonizes task-specific gradients, reducing conflicts and encouraging more robust learning across tasks. Theoretical insights reveal that controlling the gradient norm through weight perturbation directly contributes to better generalization. Extensive experiments across diverse applications demonstrate that our method significantly outperforms existing gradient-based MTL techniques in terms of task performance and overall model robustness.

http://arxiv.org/abs/2509.24272
When MCP Servers Attack: Taxonomy, Feasibility, and Mitigation. (1%)
Weibo Zhao; Jiahao Liu; Bonan Ruan; Shaofei Li; Zhenkai Liang
Model Context Protocol (MCP) servers enable AI applications to connect to external systems in a plug-and-play manner, but their rapid proliferation also introduces severe security risks. Unlike mature software ecosystems with rigorous vetting, MCP servers still lack standardized review mechanisms, giving adversaries opportunities to distribute malicious implementations. Despite this pressing risk, the security implications of MCP servers remain underexplored. To address this gap, we present the first systematic study that treats MCP servers as active threat actors and decomposes them into core components to examine how adversarial developers can implant malicious intent. Specifically, we investigate three research questions: (i) what types of attacks malicious MCP servers can launch, (ii) how vulnerable MCP hosts and Large Language Models (LLMs) are to these attacks, and (iii) how feasible it is to carry out MCP server attacks in practice. Our study proposes a component-based taxonomy comprising twelve attack categories. For each category, we develop Proof-of-Concept (PoC) servers and demonstrate their effectiveness across diverse real-world host-LLM settings. We further show that attackers can generate large numbers of malicious servers at virtually no cost. We then test state-of-the-art scanners on the generated servers and found that existing detection approaches are insufficient. These findings highlight that malicious MCP servers are easy to implement, difficult to detect with current tools, and capable of causing concrete damage to AI agent systems. Addressing this threat requires coordinated efforts among protocol designers, host developers, LLM providers, and end users to build a more secure and resilient MCP ecosystem.

http://arxiv.org/abs/2509.23917
Bridging the Task Gap: Multi-Task Adversarial Transferability in CLIP and Its Derivatives. (99%)
Kuanrong Liu; Siyuan Liang; Cheng Qian; Ming Zhang; Xiaochun Cao
As a general-purpose vision-language pretraining model, CLIP demonstrates strong generalization ability in image-text alignment tasks and has been widely adopted in downstream applications such as image classification and image-text retrieval. However, it struggles with fine-grained tasks such as object detection and semantic segmentation. While many variants aim to improve CLIP on these tasks, its robustness to adversarial perturbations remains underexplored. Understanding how adversarial examples transfer across tasks is key to assessing CLIP's generalization limits and security risks. In this work, we conduct a systematic empirical analysis of the cross-task transfer behavior of CLIP-based models on image-text retrieval, object detection, and semantic segmentation under adversarial perturbations. We find that adversarial examples generated from fine-grained tasks (e.g., object detection and semantic segmentation) often exhibit stronger transfer potential than those from coarse-grained tasks, enabling more effective attacks against the original CLIP model. Motivated by this observation, we propose a novel framework, Multi-Task Adversarial CLIP (MT-AdvCLIP), which introduces a task-aware feature aggregation loss and generates perturbations with enhanced cross-task generalization capability. This design strengthens the attack effectiveness of fine-grained task models on the shared CLIP backbone. Experimental results on multiple public datasets show that MT-AdvCLIP significantly improves the adversarial transfer success rate (The average attack success rate across multiple tasks is improved by over 39%.) against various CLIP-derived models, without increasing the perturbation budget. This study reveals the transfer mechanism of adversarial examples in multi-task CLIP models, offering new insights into multi-task robustness evaluation and adversarial example design.

http://arxiv.org/abs/2509.23689
Merge Now, Regret Later: The Hidden Cost of Model Merging is Adversarial Transferability. (99%)
Ankit Gangwal; Aaryan Ajay Sharma
Model Merging (MM) has emerged as a promising alternative to multi-task learning, where multiple fine-tuned models are combined, without access to tasks' training data, into a single model that maintains performance across tasks. Recent works have explored the impact of MM on adversarial attacks, particularly backdoor attacks. However, none of them have sufficiently explored its impact on transfer attacks using adversarial examples, i.e., a black-box adversarial attack where examples generated for a surrogate model successfully mislead a target model.
  In this work, we study the effect of MM on the transferability of adversarial examples. We perform comprehensive evaluations and statistical analysis consisting of 8 MM methods, 7 datasets, and 6 attack methods, sweeping over 336 distinct attack settings. Through it, we first challenge the prevailing notion of MM conferring free adversarial robustness, and show MM cannot reliably defend against transfer attacks, with over 95% relative transfer attack success rate. Moreover, we reveal 3 key insights for machine-learning practitioners regarding MM and transferability for a robust system design: (1) stronger MM methods increase vulnerability to transfer attacks; (2) mitigating representation bias increases vulnerability to transfer attacks; and (3) weight averaging, despite being the weakest MM method, is the most vulnerable MM method to transfer attacks. Finally, we analyze the underlying reasons for this increased vulnerability, and provide potential solutions to the problem. Our findings offer critical insights for designing more secure systems employing MM.

http://arxiv.org/abs/2506.18248
Improving Black-Box Generative Attacks via Generator Semantic Consistency. (98%)
Jongoh Jeong; Hunmin Yang; Jaeseok Jeong; Kuk-Jin Yoon
Transfer attacks optimize on a surrogate and deploy to a black-box target. While iterative optimization attacks in this paradigm are limited by their per-input cost limits efficiency and scalability due to multistep gradient updates for each input, generative attacks alleviate these by producing adversarial examples in a single forward pass at test time. However, current generative attacks still adhere to optimizing surrogate losses (e.g., feature divergence) and overlook the generator's internal dynamics, underexploring how the generator's internal representations shape transferable perturbations. To address this, we enforce semantic consistency by aligning the early generator's intermediate features to an EMA teacher, stabilizing object-aligned representations and improving black-box transfer without inference-time overhead. To ground the mechanism, we quantify semantic stability as the standard deviation of foreground IoU between cluster-derived activation masks and foreground masks across generator blocks, and observe reduced semantic drift under our method. For more reliable evaluation, we also introduce Accidental Correction Rate (ACR) to separate inadvertent corrections from intended misclassifications, complementing the inherent blind spots in traditional Attack Success Rate (ASR), Fooling Rate (FR), and Accuracy metrics. Across architectures, domains, and tasks, our approach can be seamlessly integrated into existing generative attacks with consistent improvements in black-box transfer, while maintaining test-time efficiency.

http://arxiv.org/abs/2509.23961
Learning-Based Testing for Deep Learning: Enhancing Model Robustness with Adversarial Input Prioritization. (98%)
Sheikh Md Mushfiqur Rahman; Nasir Eisty
Context: Deep Neural Networks (DNNs) are increasingly deployed in critical applications, where resilience against adversarial inputs is paramount. However, whether coverage-based or confidence-based, existing test prioritization methods often fail to efficiently identify the most fault-revealing inputs, limiting their practical effectiveness. Aims: This project aims to enhance fault detection and model robustness in DNNs by integrating Learning-Based Testing (LBT) with hypothesis and mutation testing to efficiently prioritize adversarial test cases. Methods: Our method selects a subset of adversarial inputs with a high likelihood of exposing model faults, without relying on architecture-specific characteristics or formal verification, making it adaptable across diverse DNNs. Results: Our results demonstrate that the proposed LBT method consistently surpasses baseline approaches in prioritizing fault-revealing inputs and accelerating fault detection. By efficiently organizing test permutations, it uncovers all potential faults significantly faster across various datasets, model architectures, and adversarial attack techniques. Conclusion: Beyond improving fault detection, our method preserves input diversity and provides effective guidance for model retraining, further enhancing robustness. These advantages establish our approach as a powerful and practical solution for adversarial test prioritization in real-world DNN applications.

http://arxiv.org/abs/2509.23762
Accuracy-Robustness Trade Off via Spiking Neural Network Gradient Sparsity Trail. (81%)
Nhan T. Luu
Spiking Neural Networks (SNNs) have attracted growing interest in both computational neuroscience and artificial intelligence, primarily due to their inherent energy efficiency and compact memory footprint. However, achieving adversarial robustness in SNNs, particularly for vision-related tasks, remains a nascent and underexplored challenge. Recent studies have proposed leveraging sparse gradients as a form of regularization to enhance robustness against adversarial perturbations. In this work, we present a surprising finding: under specific architectural configurations, SNNs exhibit natural gradient sparsity and can achieve state-of-the-art adversarial defense performance without the need for any explicit regularization. Further analysis reveals a trade-off between robustness and generalization: while sparse gradients contribute to improved adversarial resilience, they can impair the model's ability to generalize; conversely, denser gradients support better generalization but increase vulnerability to attacks.

http://arxiv.org/abs/2509.23594
StolenLoRA: Exploring LoRA Extraction Attacks via Synthetic Data. (74%)
Yixu Wang; Yan Teng; Yingchun Wang; Xingjun Ma
Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA have transformed vision model adaptation, enabling the rapid deployment of customized models. However, the compactness of LoRA adaptations introduces new safety concerns, particularly their vulnerability to model extraction attacks. This paper introduces a new focus of model extraction attacks named LoRA extraction that extracts LoRA-adaptive models based on a public pre-trained model. We then propose a novel extraction method called StolenLoRA which trains a substitute model to extract the functionality of a LoRA-adapted model using synthetic data. StolenLoRA leverages a Large Language Model to craft effective prompts for data generation, and it incorporates a Disagreement-based Semi-supervised Learning (DSL) strategy to maximize information gain from limited queries. Our experiments demonstrate the effectiveness of StolenLoRA, achieving up to a 96.60% attack success rate with only 10k queries, even in cross-backbone scenarios where the attacker and victim models utilize different pre-trained backbones. These findings reveal the specific vulnerability of LoRA-adapted models to this type of extraction and underscore the urgent need for robust defense mechanisms tailored to PEFT methods. We also explore a preliminary defense strategy based on diversified LoRA deployments, highlighting its potential to mitigate such attacks.

http://arxiv.org/abs/2508.07173
Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models. (1%)
Leyi Pan; Zheyu Fu; Yunpeng Zhai; Shuchang Tao; Sheng Guan; Shiyu Huang; Lingzhe Zhang; Zhaoyang Liu; Bolin Ding; Felix Henry; Aiwei Liu; Lijie Wen
The rise of Omni-modal Large Language Models (OLLMs), which integrate visual and auditory processing with text, necessitates robust safety evaluations to mitigate harmful outputs. However, no dedicated benchmarks currently exist for OLLMs, and existing benchmarks fail to assess safety under joint audio-visual inputs or cross-modal consistency. To fill this gap, we introduce Omni-SafetyBench, the first comprehensive parallel benchmark for OLLM safety evaluation, featuring 24 modality variations with 972 samples each, including audio-visual harm cases. Considering OLLMs' comprehension challenges with complex omni-modal inputs and the need for cross-modal consistency evaluation, we propose tailored metrics: a Safety-score based on Conditional Attack Success Rate (C-ASR) and Refusal Rate (C-RR) to account for comprehension failures, and a Cross-Modal Safety Consistency score (CMSC-score) to measure consistency across modalities. Evaluating 6 open-source and 4 closed-source OLLMs reveals critical vulnerabilities: (1) only 3 models achieving over 0.6 in both average Safety-score and CMSC-score; (2) safety defenses weaken with complex inputs, especially audio-visual joints; (3) severe weaknesses persist, with some models scoring as low as 0.14 on specific modalities. Using Omni-SafetyBench, we evaluated existing safety alignment algorithms and identified key challenges in OLLM safety alignment: (1) Inference-time methods are inherently less effective as they cannot alter the model's underlying understanding of safety; (2) Post-training methods struggle with out-of-distribution issues due to the vast modality combinations in OLLMs; and, safety tasks involving audio-visual inputs are more complex, making even in-distribution training data less effective. Our proposed benchmark, metrics and the findings highlight urgent needs for enhanced OLLM safety.

http://arxiv.org/abs/2509.23597
Characteristic Root Analysis and Regularization for Linear Time Series Forecasting. (1%)
Zheng Wang; Kaixuan Zhang; Wanfang Chen; Xiaonan Lu; Longyuan Li; Tobias Schlagenhauf
Time series forecasting remains a critical challenge across numerous domains, yet the effectiveness of complex models often varies unpredictably across datasets. Recent studies highlight the surprising competitiveness of simple linear models, suggesting that their robustness and interpretability warrant deeper theoretical investigation. This paper presents a systematic study of linear models for time series forecasting, with a focus on the role of characteristic roots in temporal dynamics. We begin by analyzing the noise-free setting, where we show that characteristic roots govern long-term behavior and explain how design choices such as instance normalization and channel independence affect model capabilities. We then extend our analysis to the noisy regime, revealing that models tend to produce spurious roots. This leads to the identification of a key data-scaling property: mitigating the influence of noise requires disproportionately large training data, highlighting the need for structural regularization. To address these challenges, we propose two complementary strategies for robust root restructuring. The first uses rank reduction techniques, including Reduced-Rank Regression and Direct Weight Rank Reduction, to recover the low-dimensional latent dynamics. The second, a novel adaptive method called Root Purge, encourages the model to learn a noise-suppressing null space during training. Extensive experiments on standard benchmarks demonstrate the effectiveness of both approaches, validating our theoretical insights and achieving state-of-the-art results in several settings. Our findings underscore the potential of integrating classical theories for linear systems with modern learning techniques to build robust, interpretable, and data-efficient forecasting models.

http://arxiv.org/abs/2509.23871
Taught Well Learned Ill: Towards Distillation-conditional Backdoor Attack. (1%)
Yukun Chen; Boheng Li; Yu Yuan; Leyi Qi; Yiming Li; Tianwei Zhang; Zhan Qin; Kui Ren
Knowledge distillation (KD) is a vital technique for deploying deep neural networks (DNNs) on resource-constrained devices by transferring knowledge from large teacher models to lightweight student models. While teacher models from third-party platforms may undergo security verification (\eg, backdoor detection), we uncover a novel and critical threat: distillation-conditional backdoor attacks (DCBAs). DCBA injects dormant and undetectable backdoors into teacher models, which become activated in student models via the KD process, even with clean distillation datasets. While the direct extension of existing methods is ineffective for DCBA, we implement this attack by formulating it as a bilevel optimization problem and proposing a simple yet effective method (\ie, SCAR). Specifically, the inner optimization simulates the KD process by optimizing a surrogate student model, while the outer optimization leverages outputs from this surrogate to optimize the teacher model for implanting the conditional backdoor. Our SCAR addresses this complex optimization utilizing an implicit differentiation algorithm with a pre-optimized trigger injection function. Extensive experiments across diverse datasets, model architectures, and KD techniques validate the effectiveness of our SCAR and its resistance against existing backdoor detection, highlighting a significant yet previously overlooked vulnerability in the KD process. Our code is available at https://github.com/WhitolfChen/SCAR.

http://arxiv.org/abs/2503.10549
Controllable Adversarial Makeup for Privacy via Text-Guided Diffusion. (1%)
Youngjin Kwon; Xiao Zhang
As face recognition becomes more widespread in government and commercial services, its potential misuse raises serious concerns about privacy and civil rights. To counteract this threat, various anti-facial recognition techniques have been proposed, which protect privacy by adversarially perturbing face images. Among these, generative makeup-based approaches are the most widely studied. However, these methods, designed primarily to impersonate specific target identities, can only achieve weak dodging success rates while increasing the risk of targeted abuse. In addition, they often introduce global visual artifacts or a lack of adaptability to accommodate diverse makeup prompts, compromising user satisfaction. To address the above limitations, we develop MASQUE, a novel diffusion-based framework that generates localized adversarial makeups guided by user-defined text prompts. Built upon precise null-text inversion, customized cross-attention fusion with masking, and a pairwise adversarial guidance mechanism using images of the same individual, MASQUE achieves robust dodging performance without requiring any external identity. Comprehensive evaluations on open-source facial recognition models and commercial APIs demonstrate that MASQUE significantly improves dodging success rates over all baselines, along with higher perceptual fidelity preservation, stronger adaptability to various makeup prompts, and robustness to image transformations.

http://arxiv.org/abs/2509.23789
Visual CoT Makes VLMs Smarter but More Fragile. (1%)
Chunxue Xu; Yiwei Wang; Yujun Cai; Bryan Hooi; Songze Li
Chain-of-Thought (CoT) techniques have significantly enhanced reasoning in Vision-Language Models (VLMs). Extending this paradigm, Visual CoT integrates explicit visual edits, such as cropping or annotating regions of interest, into the reasoning process, achieving superior multimodal performance. However, the robustness of Visual CoT-based VLMs against image-level noise remains unexplored. In this paper, we present the first systematic evaluation of Visual CoT robustness under visual perturbations. Our benchmark spans 12 image corruption types across 4 Visual Question Answering (VQA) datasets, enabling a comprehensive comparison between VLMs that use Visual CoT, and VLMs that do not. The results reveal that integrating Visual CoT consistently improves absolute accuracy regardless of whether the input images are clean or corrupted by noise; however, it also increases sensitivity to input perturbations, resulting in sharper performance degradation compared to standard VLMs. Through extensive analysis, we identify the intermediate reasoning components of Visual CoT, i.e., the edited image patches , as the primary source of fragility. Building on this analysis, we propose a plug-and-play robustness enhancement method that integrates Grounding DINO model into the Visual CoT pipeline, providing high-confidence local visual cues to stabilize reasoning. Our work reveals clear fragility patterns in Visual CoT and offers an effective, architecture-agnostic solution for enhancing visual robustness.

http://arxiv.org/abs/2509.23198
Real-World Transferable Adversarial Attack on Face-Recognition Systems. (99%)
Andrey Kaznacheev; Matvey Mikhalchuk; Andrey Kuznetsov; Aleksandr Petiushko; Anton Razzhigaev
Adversarial attacks on face recognition (FR) systems pose a significant security threat, yet most are confined to the digital domain or require white-box access. We introduce GaP (Gaussian Patch), a novel method to generate a universal, physically transferable adversarial patch under a strict black-box setting. Our approach uses a query-efficient, zero-order greedy algorithm to iteratively construct a symmetric, grayscale pattern for the forehead. The patch is optimized by successively adding Gaussian blobs, guided only by the cosine similarity scores from a surrogate FR model to maximally degrade identity recognition. We demonstrate that with approximately 10,000 queries to a black-box ArcFace model, the resulting GaP achieves a high attack success rate in both digital and real-world physical tests. Critically, the attack shows strong transferability, successfully deceiving an entirely unseen FaceNet model. Our work highlights a practical and severe vulnerability, proving that robust, transferable attacks can be crafted with limited knowledge of the target system.

http://arxiv.org/abs/2509.23519
ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search. (93%)
Zeyu Shen; Basileal Imana; Tong Wu; Chong Xiang; Prateek Mittal; Aleksandra Korolova
Retrieval-Augmented Generation (RAG) enhances Large Language Models by grounding their outputs in external documents. These systems, however, remain vulnerable to attacks on the retrieval corpus, such as prompt injection. RAG-based search systems (e.g., Google's Search AI Overview) present an interesting setting for studying and protecting against such threats, as defense algorithms can benefit from built-in reliability signals -- like document ranking -- and represent a non-LLM challenge for the adversary due to decades of work to thwart SEO.
  Motivated by, but not limited to, this scenario, this work introduces ReliabilityRAG, a framework for adversarial robustness that explicitly leverages reliability information of retrieved documents.
  Our first contribution adopts a graph-theoretic perspective to identify a "consistent majority" among retrieved documents to filter out malicious ones. We introduce a novel algorithm based on finding a Maximum Independent Set (MIS) on a document graph where edges encode contradiction. Our MIS variant explicitly prioritizes higher-reliability documents and provides provable robustness guarantees against bounded adversarial corruption under natural assumptions. Recognizing the computational cost of exact MIS for large retrieval sets, our second contribution is a scalable weighted sample and aggregate framework. It explicitly utilizes reliability information, preserving some robustness guarantees while efficiently handling many documents.
  We present empirical results showing ReliabilityRAG provides superior robustness against adversarial attacks compared to prior methods, maintains high benign accuracy, and excels in long-form generation tasks where prior robustness-focused methods struggled. Our work is a significant step towards more effective, provably robust defenses against retrieved corpus corruption in RAG.

http://arxiv.org/abs/2509.23279
Vid-Freeze: Protecting Images from Malicious Image-to-Video Generation via Temporal Freezing. (93%)
Rohit Chowdhury; Aniruddha Bala; Rohan Jaiswal; Siddharth Roheda
The rapid progress of image-to-video (I2V) generation models has introduced significant risks, enabling video synthesis from static images and facilitating deceptive or malicious content creation. While prior defenses such as I2VGuard attempt to immunize images, effective and principled protection to block motion remains underexplored. In this work, we introduce Vid-Freeze - a novel attention-suppressing adversarial attack that adds carefully crafted adversarial perturbations to images. Our method explicitly targets the attention mechanism of I2V models, completely disrupting motion synthesis while preserving semantic fidelity of the input image. The resulting immunized images generate stand-still or near-static videos, effectively blocking malicious content creation. Our experiments demonstrate the impressive protection provided by the proposed approach, highlighting the importance of attention attacks as a promising direction for robust and proactive defenses against misuse of I2V generation models.

http://arxiv.org/abs/2509.23010
Desensitizing for Improving Corruption Robustness in Point Cloud Classification through Adversarial Training. (92%)
Zhiqiang Tian; Weigang Li; Chunhua Deng; Junwei Hu; Yongqiang Wang; Wenping Liu
Due to scene complexity, sensor inaccuracies, and processing imprecision, point cloud corruption is inevitable. Over-reliance on input features is the root cause of DNN vulnerabilities. It remains unclear whether this issue exists in 3D tasks involving point clouds and whether reducing dependence on these features can enhance the model's robustness to corrupted point clouds. This study attempts to answer these questions. Specifically, we quantified the sensitivity of the DNN to point cloud features using Shapley values and found that models trained using traditional methods exhibited high sensitivity values for certain features. Furthermore, under an equal pruning ratio, prioritizing the pruning of highly sensitive features causes more severe damage to model performance than random pruning. We propose `Desensitized Adversarial Training' (DesenAT), generating adversarial samples using feature desensitization and conducting training within a self-distillation framework, which aims to alleviate DNN's over-reliance on point clouds features by smoothing sensitivity. First, data points with high contribution components are eliminated, and spatial transformation is used to simulate corruption scenes, generate adversarial samples, and conduct adversarial training on the model. Next, to compensate for information loss in adversarial samples, we use the self-distillation method to transfer knowledge from clean samples to adversarial samples, and perform adversarial training in a distillation manner.Extensive experiments on ModelNet-C and PointCloud-C demonstrate show that the propose method can effectively improve the robustness of the model without reducing the performance of clean data sets. This code is publicly available at \href{https://github.com/JerkyT/DesenAT/tree/master}{https://github.com/JerkyT/DesenAT}.

http://arxiv.org/abs/2509.23325
Robust Fine-Tuning from Non-Robust Pretrained Models: Mitigating Suboptimal Transfer With Adversarial Scheduling. (92%)
Jonas Ngnawé; Maxime Heuillet; Sabyasachi Sahoo; Yann Pequignot; Ola Ahmad; Audrey Durand; Frédéric Precioso; Christian Gagné
Fine-tuning pretrained models is a standard and effective workflow in modern machine learning. However, robust fine-tuning (RFT), which aims to simultaneously achieve adaptation to a downstream task and robustness to adversarial examples, remains challenging. Despite the abundance of non-robust pretrained models in open-source repositories, their potential for RFT is less understood. We address this knowledge gap by systematically examining RFT from such non-robust models. Our experiments reveal that fine-tuning non-robust models with a robust objective, even under small perturbations, can lead to poor performance, a phenomenon that we dub \emph{suboptimal transfer}. In challenging scenarios (eg, difficult tasks, high perturbation), the resulting performance can be so low that it may be considered a transfer failure. We find that fine-tuning using a robust objective impedes task adaptation at the beginning of training and eventually prevents optimal transfer. However, we propose a novel heuristic, \emph{Epsilon-Scheduling}, a schedule over perturbation strength used during training that promotes optimal transfer. Additionally, we introduce \emph{expected robustness}, a metric that captures performance across a range of perturbations, providing a more comprehensive evaluation of the accuracy-robustness trade-off for diverse models at test time. Extensive experiments on a wide range of configurations (six pretrained models and five datasets) show that \emph{Epsilon-Scheduling} successfully prevents \emph{suboptimal transfer} and consistently improves expected robustness.

http://arxiv.org/abs/2509.23333
Targeted perturbations reveal brain-like local coding axes in robustified, but not standard, ANN-based brain models. (54%)
Nikolas McNeal; N. Apurva Ratan Murty
Artificial neural networks (ANNs) have become the de facto standard for modeling the human visual system, primarily due to their success in predicting neural responses. However, with many models now achieving similar predictive accuracy, we need a stronger criterion. Here, we use small-scale adversarial probes to characterize the local representational geometry of many highly predictive ANN-based brain models. We report four key findings. First, we show that most contemporary ANN-based brain models are unexpectedly fragile. Despite high prediction scores, their response predictions are highly sensitive to small, imperceptible perturbations, revealing unreliable local coding directions. Second, we demonstrate that a model's sensitivity to adversarial probes can better discriminate between candidate neural encoding models than prediction accuracy alone. Third, we find that standard models rely on distinct local coding directions that do not transfer across model architectures. Finally, we show that adversarial probes from robustified models produce generalizable and semantically meaningful changes, suggesting that they capture the local coding dimensions of the visual system. Together, our work shows that local representational geometry provides a stronger criterion for brain model evaluation. We also provide empirical grounds for favoring robust models, whose more stable coding axes not only align better with neural selectivity but also generate concrete, testable predictions for future experiments.

http://arxiv.org/abs/2507.07871
Mitigating Watermark Forgery in Generative Models via Randomized Key Selection. (47%)
Toluwani Aremu; Noor Hussein; Munachiso Nwadike; Samuele Poppi; Jie Zhang; Karthik Nandakumar; Neil Gong; Nils Lukas
Watermarking enables GenAI providers to verify whether content was generated by their models. A watermark is a hidden signal in the content, whose presence can be detected using a secret watermark key. A core security threat are forgery attacks, where adversaries insert the provider's watermark into content \emph{not} produced by the provider, potentially damaging their reputation and undermining trust. Existing defenses resist forgery by embedding many watermarks with multiple keys into the same content, which can degrade model utility. However, forgery remains a threat when attackers can collect sufficiently many watermarked samples. We propose a defense that is provably forgery-resistant \emph{independent} of the number of watermarked content collected by the attacker, provided they cannot easily distinguish watermarks from different keys. Our scheme does not further degrade model utility. We randomize the watermark key selection for each query and accept content as genuine only if a watermark is detected by \emph{exactly} one key. We focus on the image and text modalities, but our defense is modality-agnostic, since it treats the underlying watermarking method as a black-box. Our method provably bounds the attacker's success rate and we empirically observe a reduction from near-perfect success rates to only $2\%$ at negligible computational overhead.

http://arxiv.org/abs/2509.23041
Virus Infection Attack on LLMs: Your Poisoning Can Spread "VIA" Synthetic Data. (22%)
Zi Liang; Qingqing Ye; Xuan Liu; Yanyun Wang; Jianliang Xu; Haibo Hu
Synthetic data refers to artificial samples generated by models. While it has been validated to significantly enhance the performance of large language models (LLMs) during training and has been widely adopted in LLM development, potential security risks it may introduce remain uninvestigated. This paper systematically evaluates the resilience of synthetic-data-integrated training paradigm for LLMs against mainstream poisoning and backdoor attacks. We reveal that such a paradigm exhibits strong resistance to existing attacks, primarily thanks to the different distribution patterns between poisoning data and queries used to generate synthetic samples. To enhance the effectiveness of these attacks and further investigate the security risks introduced by synthetic data, we introduce a novel and universal attack framework, namely, Virus Infection Attack (VIA), which enables the propagation of current attacks through synthetic data even under purely clean queries. Inspired by the principles of virus design in cybersecurity, VIA conceals the poisoning payload within a protective "shell" and strategically searches for optimal hijacking points in benign samples to maximize the likelihood of generating malicious content. Extensive experiments on both data poisoning and backdoor attacks show that VIA significantly increases the presence of poisoning content in synthetic data and correspondingly raises the attack success rate (ASR) on downstream models to levels comparable to those observed in the poisoned upstream models.

http://arxiv.org/abs/2509.23037
GuardNet: Graph-Attention Filtering for Jailbreak Defense in Large Language Models. (9%)
Javad Forough; Mohammad Maheri; Hamed Haddadi
Large Language Models (LLMs) are increasingly susceptible to jailbreak attacks, which are adversarial prompts that bypass alignment constraints and induce unauthorized or harmful behaviors. These vulnerabilities undermine the safety, reliability, and trustworthiness of LLM outputs, posing critical risks in domains such as healthcare, finance, and legal compliance. In this paper, we propose GuardNet, a hierarchical filtering framework that detects and filters jailbreak prompts prior to inference. GuardNet constructs structured graphs that combine sequential links, syntactic dependencies, and attention-derived token relations to capture both linguistic structure and contextual patterns indicative of jailbreak behavior. It then applies graph neural networks at two levels: (i) a prompt-level filter that detects global adversarial prompts, and (ii) a token-level filter that pinpoints fine-grained adversarial spans. Extensive experiments across three datasets and multiple attack settings show that GuardNet substantially outperforms prior defenses. It raises prompt-level F$_1$ scores from 66.4\% to 99.8\% on LLM-Fuzzer, and from 67-79\% to over 94\% on PLeak datasets. At the token level, GuardNet improves F$_1$ from 48-75\% to 74-91\%, with IoU gains up to +28\%. Despite its structural complexity, GuardNet maintains acceptable latency and generalizes well in cross-domain evaluations, making it a practical and robust defense against jailbreak threats in real-world LLM deployments.

http://arxiv.org/abs/2505.20955
Unveiling Impact of Frequency Components on Membership Inference Attacks for Diffusion Models. (1%)
Puwei Lian; Yujun Cai; Songze Li; Bingkun Bao
Diffusion models have achieved tremendous success in image generation, but they also raise significant concerns regarding privacy and copyright issues. Membership Inference Attacks (MIAs) are designed to ascertain whether specific data were utilized during a model's training phase. As current MIAs for diffusion models typically exploit the model's image prediction ability, we formalize them into a unified general paradigm which computes the membership score for membership identification. Under this paradigm, we empirically find that existing attacks overlook the inherent deficiency in how diffusion models process high-frequency information. Consequently, this deficiency leads to member data with more high-frequency content being misclassified as hold-out data, and hold-out data with less high-frequency content tend to be misclassified as member data. Moreover, we theoretically demonstrate that this deficiency reduces the membership advantage of attacks, thereby interfering with the effective discrimination of member data and hold-out data. Based on this insight, we propose a plug-and-play high-frequency filter module to mitigate the adverse effects of the deficiency, which can be seamlessly integrated into any attacks within this general paradigm without additional time costs. Extensive experiments corroborate that this module significantly improves the performance of baseline attacks across different datasets and models.

http://arxiv.org/abs/2509.23362
Dual-Space Smoothness for Robust and Balanced LLM Unlearning. (1%)
Han Yan; Zheyuan Liu; Meng Jiang
With the rapid advancement of large language models, Machine Unlearning has emerged to address growing concerns around user privacy, copyright infringement, and overall safety. Yet state-of-the-art (SOTA) unlearning methods often suffer from catastrophic forgetting and metric imbalance, for example by over-optimizing one objective (e.g., unlearning effectiveness, utility preservation, or privacy protection) at the expense of others. In addition, small perturbations in the representation or parameter space can be exploited by relearn and jailbreak attacks. To address these challenges, we propose PRISM, a unified framework that enforces dual-space smoothness in representation and parameter spaces to improve robustness and balance unlearning metrics. PRISM consists of two smoothness optimization stages: (i) a representation space stage that employs a robustly trained probe to defend against jailbreak attacks, and (ii) a parameter-space stage that decouples retain-forget gradient conflicts, reduces imbalance, and smooths the parameter space to mitigate relearning attacks. Extensive experiments on WMDP and MUSE, across conversational-dialogue and continuous-text settings, show that PRISM outperforms SOTA baselines under multiple attacks while achieving a better balance among key metrics.

http://arxiv.org/abs/2509.22850
Boundary on the Table: Efficient Black-Box Decision-Based Attacks for Structured Data. (99%)
Roie Kazoom; Yuval Ratzabi; Etamar Rothstein; Ofer Hadar
Adversarial robustness in structured data remains an underexplored frontier compared to vision and language domains. In this work, we introduce a novel black-box, decision-based adversarial attack tailored for tabular data. Our approach combines gradient-free direction estimation with an iterative boundary search, enabling efficient navigation of discrete and continuous feature spaces under minimal oracle access. Extensive experiments demonstrate that our method successfully compromises nearly the entire test set across diverse models, ranging from classical machine learning classifiers to large language model (LLM)-based pipelines. Remarkably, the attack achieves success rates consistently above 90%, while requiring only a small number of queries per instance. These results highlight the critical vulnerability of tabular models to adversarial perturbations, underscoring the urgent need for stronger defenses in real-world decision-making systems.

http://arxiv.org/abs/2509.22393
Text Adversarial Attacks with Dynamic Outputs. (99%)
Wenqiang Wang; Siyuan Liang; Xiao Yan; Xiaochun Cao
Text adversarial attack methods are typically designed for static scenarios with fixed numbers of output labels and a predefined label space, relying on extensive querying of the victim model (query-based attacks) or the surrogate model (transfer-based attacks). To address this gap, we introduce the Textual Dynamic Outputs Attack (TDOA) method, which employs a clustering-based surrogate model training approach to convert the dynamic-output scenario into a static single-output scenario. To improve attack effectiveness, we propose the farthest-label targeted attack strategy, which selects adversarial vectors that deviate most from the model's coarse-grained labels, thereby maximizing disruption. We extensively evaluate TDOA on four datasets and eight victim models (e.g., ChatGPT-4o, ChatGPT-4.1), showing its effectiveness in crafting adversarial examples and its strong potential to compromise large language models with limited access. With a single query per text, TDOA achieves a maximum attack success rate of 50.81\%. Additionally, we find that TDOA also achieves state-of-the-art performance in conventional static output scenarios, reaching a maximum ASR of 82.68\%. Meanwhile, by conceptualizing translation tasks as classification problems with unbounded output spaces, we extend the TDOA framework to generative settings, surpassing prior results by up to 0.64 RDBLEU and 0.62 RDchrF.

http://arxiv.org/abs/2509.20793
FERD: Fairness-Enhanced Data-Free Robustness Distillation. (99%)
Zhengxiao Li; Liming Lu; Xu Zheng; Siyuan Liang; Zhenghan Chen; Yongbin Zhou; Shuchao Pang
Data-Free Robustness Distillation (DFRD) aims to transfer the robustness from the teacher to the student without accessing the training data. While existing methods focus on overall robustness, they overlook the robust fairness issues, leading to severe disparity of robustness across different categories. In this paper, we find two key problems: (1) student model distilled with equal class proportion data behaves significantly different across distinct categories; and (2) the robustness of student model is not stable across different attacks target. To bridge these gaps, we present the first Fairness-Enhanced data-free Robustness Distillation (FERD) framework to adjust the proportion and distribution of adversarial examples. For the proportion, FERD adopts a robustness-guided class reweighting strategy to synthesize more samples for the less robust categories, thereby improving robustness of them. For the distribution, FERD generates complementary data samples for advanced robustness distillation. It generates Fairness-Aware Examples (FAEs) by enforcing a uniformity constraint on feature-level predictions, which suppress the dominance of class-specific non-robust features, providing a more balanced representation across all categories. Then, FERD constructs Uniform-Target Adversarial Examples (UTAEs) from FAEs by applying a uniform target class constraint to avoid biased attack directions, which distribute the attack targets across all categories and prevents overfitting to specific vulnerable categories. Extensive experiments on three public datasets show that FERD achieves state-of-the-art worst-class robustness under all adversarial attack (e.g., the worst-class robustness under FGSM and AutoAttack are improved by 15.1\% and 6.4\% using MobileNet-V2 on CIFAR-10), demonstrating superior performance in both robustness and fairness aspects.

http://arxiv.org/abs/2509.22060
Decoding Deception: Understanding Automatic Speech Recognition Vulnerabilities in Evasion and Poisoning Attacks. (99%)
Aravindhan G; Yuvaraj Govindarajulu; Parin Shah
Recent studies have demonstrated the vulnerability of Automatic Speech Recognition systems to adversarial examples, which can deceive these systems into misinterpreting input speech commands. While previous research has primarily focused on white-box attacks with constrained optimizations, and transferability based black-box attacks against commercial Automatic Speech Recognition devices, this paper explores cost efficient white-box attack and non transferability black-box adversarial attacks on Automatic Speech Recognition systems, drawing insights from approaches such as Fast Gradient Sign Method and Zeroth-Order Optimization. Further, the novelty of the paper includes how poisoning attack can degrade the performances of state-of-the-art models leading to misinterpretation of audio signals. Through experimentation and analysis, we illustrate how hybrid models can generate subtle yet impactful adversarial examples with very little perturbation having Signal Noise Ratio of 35dB that can be generated within a minute. These vulnerabilities of state-of-the-art open source model have practical security implications, and emphasize the need for adversarial security.

http://arxiv.org/abs/2505.15130
Few-Shot Adversarial Low-Rank Fine-Tuning of Vision-Language Models. (98%)
Sajjad Ghiasvand; Haniyeh Ehsani Oskouie; Mahnoosh Alizadeh; Ramtin Pedarsani
Vision-Language Models (VLMs) such as CLIP have shown remarkable performance in cross-modal tasks through large-scale contrastive pre-training. To adapt these large transformer-based models efficiently for downstream tasks, Parameter-Efficient Fine-Tuning (PEFT) techniques like (Low-Rank Adaptation) LoRA have emerged as scalable alternatives to full fine-tuning, especially in few-shot scenarios. However, like traditional deep neural networks, VLMs are highly vulnerable to adversarial attacks, where imperceptible perturbations can significantly degrade model performance. Adversarial training remains the most effective strategy for improving model robustness in PEFT. In this work, we propose AdvCLIP-LoRA, to our knowledge the first method designed to enhance the adversarial robustness of CLIP models fine-tuned with LoRA in few-shot settings. Our method formulates training as a minimax optimization over low-rank adapters and adversarial perturbations, enabling robust adaptation with a small trainable footprint. Across eight datasets and two backbones (ViT-B/16 and ViT-B/32), AdvCLIP-LoRA achieves state-of-the-art performance in few-shot classification, adversarial base-to-new generalization, and cross-dataset transfer, delivering higher adversarial robustness than prompt tuning baselines without sacrificing much clean accuracy. These findings highlight AdvCLIP-LoRA as a practical approach for robust adaptation of VLMs in resource-constrained settings.

http://arxiv.org/abs/2501.01908
Training-Free Defense Against Adversarial Attacks in Deep Learning MRI Reconstruction. (98%)
Mahdi Saberi; Chi Zhang; Mehmet Akçakaya
Deep learning (DL) methods have become the state-of-the-art for reconstructing sub-sampled magnetic resonance imaging (MRI) data. However, studies have shown that these methods are susceptible to small adversarial input perturbations, or attacks, resulting in major distortions in the output images. Various strategies have been proposed to reduce the effects of these attacks, but they require retraining and may lower reconstruction quality for non-perturbed/clean inputs. In this work, we propose a novel approach for mitigating adversarial attacks on MRI reconstruction models without any retraining. Based on the idea of cyclic measurement consistency, we devise a novel mitigation objective that is minimized in a small ball around the attack input. Results show that our method substantially reduces the impact of adversarial perturbations across different datasets, attack types/strengths and PD-DL networks, and qualitatively and quantitatively outperforms conventional mitigation methods that involve retraining. We also introduce a practically relevant scenario for small adversarial perturbations that models impulse noise in raw data, which relates to \emph{herringbone artifacts}, and show the applicability of our approach in this setting. Finally, we show our mitigation approach remains effective in two \emph{realistic} extension scenarios: a blind setup, where the attack strength or algorithm is not known to the user; and an adaptive attack setup, where the attacker has full knowledge of the defense strategy.

http://arxiv.org/abs/2509.22836
Seeing Isn't Believing: Context-Aware Adversarial Patch Synthesis via Conditional GAN. (96%)
Roie Kazoom; Alon Goldberg; Hodaya Cohen; Ofer Hadar
Adversarial patch attacks pose a severe threat to deep neural networks, yet most existing approaches rely on unrealistic white-box assumptions, untargeted objectives, or produce visually conspicuous patches that limit real-world applicability. In this work, we introduce a novel framework for fully controllable adversarial patch generation, where the attacker can freely choose both the input image x and the target class y target, thereby dictating the exact misclassification outcome. Our method combines a generative U-Net design with Grad-CAM-guided patch placement, enabling semantic-aware localization that maximizes attack effectiveness while preserving visual realism. Extensive experiments across convolutional networks (DenseNet-121, ResNet-50) and vision transformers (ViT-B/16, Swin-B/16, among others) demonstrate that our approach achieves state-of-the-art performance across all settings, with attack success rates (ASR) and target-class success (TCS) consistently exceeding 99%.
  Importantly, we show that our method not only outperforms prior white-box attacks and untargeted baselines, but also surpasses existing non-realistic approaches that produce detectable artifacts. By simultaneously ensuring realism, targeted control, and black-box applicability-the three most challenging dimensions of patch-based attacks-our framework establishes a new benchmark for adversarial robustness research, bridging the gap between theoretical attack strength and practical stealthiness.

http://arxiv.org/abs/2505.19616
Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models. (83%)
Rui Cai; Bangzheng Li; Xiaofei Wen; Muhao Chen; Zhe Zhao
Multimodal Large Language Models have demonstrated impressive capabilities across tasks, yet they often exhibit difficulty in distinguishing task-relevant from irrelevant signals -- particularly in tasks like Visual Question Answering -- which can lead to susceptibility to misleading or spurious inputs. We refer to this broader limitation as the Cross-Modality Competency Problem -- the model's inability to fairly evaluate all modalities. This vulnerability becomes more evident in modality-specific tasks -- such as image classification or pure text question answering -- where models are expected to rely solely on one modality. In such tasks, spurious information from irrelevant modalities often leads to significant performance degradation. We refer to this failure as Modality Interference, which serves as a concrete and measurable instance of the cross-modality competency problem, and we further design a perturbation-based causal diagnostic experiment to verify and quantify this problem. To mitigate modality interference, we propose a novel framework to finetune MLLMs, including perturbation-based data augmentations with both heuristic perturbations and adversarial perturbations, and a consistency regularization strategy applying on model outputs with original and perturbed inputs. Experiments on multiple benchmark datasets (image-heavy, text-heavy and multimodal tasks) and multiple model families with different scales demonstrate significant improvements in robustness and cross-modality competency, indicating our method's effectiveness in boosting unimodal reasoning ability while enhancing performance on multimodal tasks.

http://arxiv.org/abs/2509.22830
ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents. (81%)
Hwan Chang; Yonghyun Jun; Hwanhee Lee
The growing deployment of large language model (LLM) based agents that interact with external environments has created new attack surfaces for adversarial manipulation. One major threat is indirect prompt injection, where attackers embed malicious instructions in external environment output, causing agents to interpret and execute them as if they were legitimate prompts. While previous research has focused primarily on plain-text injection attacks, we find a significant yet underexplored vulnerability: LLMs' dependence on structured chat templates and their susceptibility to contextual manipulation through persuasive multi-turn dialogues. To this end, we introduce ChatInject, an attack that formats malicious payloads to mimic native chat templates, thereby exploiting the model's inherent instruction-following tendencies. Building on this foundation, we develop a persuasion-driven Multi-turn variant that primes the agent across conversational turns to accept and execute otherwise suspicious actions. Through comprehensive experiments across frontier LLMs, we demonstrate three critical findings: (1) ChatInject achieves significantly higher average attack success rates than traditional prompt injection methods, improving from 5.18% to 32.05% on AgentDojo and from 15.13% to 45.90% on InjecAgent, with multi-turn dialogues showing particularly strong performance at average 52.33% success rate on InjecAgent, (2) chat-template-based payloads demonstrate strong transferability across models and remain effective even against closed-source LLMs, despite their unknown template structures, and (3) existing prompt-based defenses are largely ineffective against this attack approach, especially against Multi-turn variants. These findings highlight vulnerabilities in current agent systems.

http://arxiv.org/abs/2509.22755
Concept activation vectors: a unifying view and adversarial attacks. (74%)
Ekkehard Schnoor; Malik Tiomoko; Jawher Said; Alex Jung; Wojciech Samek
Concept Activation Vectors (CAVs) are a tool from explainable AI, offering a promising approach for understanding how human-understandable concepts are encoded in a model's latent spaces. They are computed from hidden-layer activations of inputs belonging either to a concept class or to non-concept examples. Adopting a probabilistic perspective, the distribution of the (non-)concept inputs induces a distribution over the CAV, making it a random vector in the latent space. This enables us to derive mean and covariance for different types of CAVs, leading to a unified theoretical view. This probabilistic perspective also reveals a potential vulnerability: CAVs can strongly depend on the rather arbitrary non-concept distribution, a factor largely overlooked in prior work. We illustrate this with a simple yet effective adversarial attack, underscoring the need for a more systematic study.

http://arxiv.org/abs/2509.22147
Mixture of Detectors: A Compact View of Machine-Generated Text Detection. (56%)
Sai Teja Lekkala; Yadagiri Annepaka; Arun Kumar Challa; Samatha Reddy Machireddy; Partha Pakray; Chukhu Chunka
Large Language Models (LLMs) are gearing up to surpass human creativity. The veracity of the statement needs careful consideration. In recent developments, critical questions arise regarding the authenticity of human work and the preservation of their creativity and innovative abilities. This paper investigates such issues. This paper addresses machine-generated text detection across several scenarios, including document-level binary and multiclass classification or generator attribution, sentence-level segmentation to differentiate between human-AI collaborative text, and adversarial attacks aimed at reducing the detectability of machine-generated text. We introduce a new work called BMAS English: an English language dataset for binary classification of human and machine text, for multiclass classification, which not only identifies machine-generated text but can also try to determine its generator, and Adversarial attack addressing where it is a common act for the mitigation of detection, and Sentence-level segmentation, for predicting the boundaries between human and machine-generated text. We believe that this paper will address previous work in Machine-Generated Text Detection (MGTD) in a more meaningful way.

http://arxiv.org/abs/2504.04893
SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models. (56%)
Justus Westerhoff; Erblina Purelku; Jakob Hackstein; Jonas Loos; Leo Pinetzki; Erik Rodner; Lorenz Hufe
Typographic attacks exploit the interplay between text and visual content in multimodal foundation models, causing misclassifications when misleading text is embedded within images. Existing datasets are limited in size and diversity, making it difficult to study such vulnerabilities. In this paper, we introduce SCAM, the largest and most diverse dataset of real-world typographic attack images to date, containing 1162 images across hundreds of object categories and attack words. Through extensive benchmarking of Vision-Language Models on SCAM, we demonstrate that typographic attacks significantly degrade performance, and identify that training data and model architecture influence the susceptibility to these attacks. Our findings indicate that typographic attacks remain effective against state-of-the-art Large Vision-Language Models, especially those employing vision encoders inherently vulnerable to such attacks. However, employing larger Large Language Model backbones reduces this vulnerability while simultaneously enhancing typographic understanding. Additionally, we demonstrate that synthetic attacks closely resemble real-world (handwritten) attacks, validating their use in research. Our work provides a comprehensive resource and empirical insights to facilitate future research toward robust and trustworthy multimodal AI systems. Finally, we publicly release the datasets introduced in this paper, along with the code for evaluations under www.bliss.berlin/research/scam.

http://arxiv.org/abs/2509.21879
Zubov-Net: Adaptive Stability for Neural ODEs Reconciling Accuracy with Robustness. (47%)
Chaoyang Luo; Yan Zou; Nanjing Huang
Despite neural ordinary differential equations (Neural ODEs) exhibiting intrinsic robustness under input perturbations due to their dynamical systems nature, recent approaches often involve imposing Lyapunov-based stability conditions to provide formal robustness guarantees. However, a fundamental challenge remains: the tension between robustness and accuracy, primarily stemming from the difficulty in imposing appropriate stability conditions. To address this, we propose an adaptive stable learning framework named Zubov-Net, which innovatively reformulates Zubov's equation into a consistency characterization between regions of attraction (RoAs) and prescribed RoAs (PRoAs). Building on this consistency, we introduce a new paradigm for actively controlling the geometry of RoAs by directly optimizing PRoAs to reconcile accuracy and robustness. Our approach is realized through tripartite losses (consistency, classification, and separation losses) and a parallel boundary sampling algorithm that co-optimizes the Neural ODE and the Lyapunov function. To enhance the discriminativity of Lyapunov functions, we design an input-attention-based convex neural network via a softmax attention mechanism that focuses on equilibrium-relevant features and also serves as weight normalization to maintain training stability in deep architectures. Theoretically, we prove that minimizing the tripartite loss guarantees consistent alignment of PRoAs-RoAs, trajectory stability, and non-overlapping PRoAs. Moreover, we establish stochastic convex separability with tighter probability bounds and fewer dimensionality requirements to justify the convex design in Lyapunov functions. Experimentally, Zubov-Net maintains high classification accuracy while significantly improving robustness against various stochastic noises and adversarial attacks.

http://arxiv.org/abs/2507.08794
One Token to Fool LLM-as-a-Judge. (47%)
Yulai Zhao; Haolin Liu; Dian Yu; Sunyuan Kung; Meijia Chen; Haitao Mi; Dong Yu
Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models, particularly in reference-based settings like Reinforcement Learning with Verifiable Rewards (RLVR). However, we uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking. We find that superficial inputs, which we term ''master keys'' such as non-word symbols (e.g., '':'' or ''.'') or generic reasoning openers (e.g., ''Thought process:'' or ''Let's solve this problem step by step.''), can consistently elicit false positive rewards without any substantive reasoning. Our systematic evaluation demonstrates this is a widespread failure affecting a diverse range of models, including leading proprietary systems such as GPT-o1 and Claude-4. These results challenge the assumed robustness of LLM judges and pose a significant threat to their reliability. To address this, we propose a simple yet effective data augmentation strategy using truncated model outputs as adversarial negative examples. The resulting Master Reward Models (Master-RMs) demonstrate state-of-the-art robustness against these ''master key'' attacks while maintaining high performance in standard evaluation settings. We supplement these findings with a comprehensive analysis of the vulnerability across model scales, prompt variations, and common inference-time strategies, offering insights to guide future research on robust LLM evaluation. We release our robust, general-domain reward models and the synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.

http://arxiv.org/abs/2509.22873
AntiFLipper: A Secure and Efficient Defense Against Label-Flipping Attacks in Federated Learning. (41%)
Aashnan Rahman; Abid Hasan; Sherajul Arifin; Faisal Haque Bappy; Tahrim Hossain; Tariqul Islam; Abu Raihan Mostofa Kamal; Md. Azam Hossain
Federated learning (FL) enables privacy-preserving model training by keeping data decentralized. However, it remains vulnerable to label-flipping attacks, where malicious clients manipulate labels to poison the global model. Despite their simplicity, these attacks can severely degrade model performance, and defending against them remains challenging. We introduce AntiFLipper, a novel and computationally efficient defense against multi-class label-flipping attacks in FL. Unlike existing methods that ensure security at the cost of high computational overhead, AntiFLipper employs a novel client-side detection strategy, significantly reducing the central server's burden during aggregation. Comprehensive empirical evaluations across multiple datasets under different distributions demonstrate that AntiFLipper achieves accuracy comparable to state-of-the-art defenses while requiring substantially fewer computational resources in server side. By balancing security and efficiency, AntiFLipper addresses a critical gap in existing defenses, making it particularly suitable for resource-constrained FL deployments where both model integrity and operational efficiency are essential.

http://arxiv.org/abs/2509.22292
Jailbreaking on Text-to-Video Models via Scene Splitting Strategy. (31%)
Wonjun Lee; Haon Park; Doehyeon Lee; Bumsub Ham; Suhyun Kim
Along with the rapid advancement of numerous Text-to-Video (T2V) models, growing concerns have emerged regarding their safety risks. While recent studies have explored vulnerabilities in models like LLMs, VLMs, and Text-to-Image (T2I) models through jailbreak attacks, T2V models remain largely unexplored, leaving a significant safety gap. To address this gap, we introduce SceneSplit, a novel black-box jailbreak method that works by fragmenting a harmful narrative into multiple scenes, each individually benign. This approach manipulates the generative output space, the abstract set of all potential video outputs for a given prompt, using the combination of scenes as a powerful constraint to guide the final outcome. While each scene individually corresponds to a wide and safe space where most outcomes are benign, their sequential combination collectively restricts this space, narrowing it to an unsafe region and significantly increasing the likelihood of generating a harmful video. This core mechanism is further enhanced through iterative scene manipulation, which bypasses the safety filter within this constrained unsafe region. Additionally, a strategy library that reuses successful attack patterns further improves the attack's overall effectiveness and robustness. To validate our method, we evaluate SceneSplit across 11 safety categories on T2V models. Our results show that it achieves a high average Attack Success Rate (ASR) of 77.2% on Luma Ray2, 84.1% on Hailuo, and 78.2% on Veo2, significantly outperforming the existing baseline. Through this work, we demonstrate that current T2V safety mechanisms are vulnerable to attacks that exploit narrative structure, providing new insights for understanding and improving the safety of T2V models.

http://arxiv.org/abs/2507.18053
Resource Consumption Red-Teaming for Large Vision-Language Models. (15%)
Haoran Gao; Yuanhe Zhang; Zhenhong Zhou; Lei Jiang; Fanyu Meng; Yujia Xiao; Li Sun; Kun Wang; Yang Liu; Junlan Feng
Resource Consumption Attacks (RCAs) have emerged as a significant threat to the deployment of Large Language Models (LLMs). With the integration of vision modalities, additional attack vectors exacerbate the risk of RCAs in large vision-language models (LVLMs). However, existing red-teaming studies have mainly overlooked visual inputs as a potential attack surface, resulting in insufficient mitigation strategies against RCAs in LVLMs. To address this gap, we propose RECITE ($\textbf{Re}$source $\textbf{C}$onsumpt$\textbf{i}$on Red-$\textbf{Te}$aming for LVLMs), the first approach for exploiting visual modalities to trigger unbounded RCAs red-teaming. First, we present $\textit{Vision Guided Optimization}$, a fine-grained pixel-level optimization to obtain \textit{Output Recall Objective} adversarial perturbations, which can induce repeating output. Then, we inject the perturbations into visual inputs, triggering unbounded generations to achieve the goal of RCAs. Empirical results demonstrate that RECITE increases service response latency by over 26 $\uparrow$, resulting in an additional 20\% increase in GPU utilization and memory consumption. Our study reveals security vulnerabilities in LVLMs and establishes a red-teaming framework that can facilitate the development of future defenses against RCAs.

http://arxiv.org/abs/2509.21843
SBFA: Single Sneaky Bit Flip Attack to Break Large Language Models. (13%)
Jingkai Guo; Chaitali Chakrabarti; Deliang Fan
Model integrity of Large language models (LLMs) has become a pressing security concern with their massive online deployment. Prior Bit-Flip Attacks (BFAs) -- a class of popular AI weight memory fault-injection techniques -- can severely compromise Deep Neural Networks (DNNs): as few as tens of bit flips can degrade accuracy toward random guessing. Recent studies extend BFAs to LLMs and reveal that, despite the intuition of better robustness from modularity and redundancy, only a handful of adversarial bit flips can also cause LLMs' catastrophic accuracy degradation. However, existing BFA methods typically focus on either integer or floating-point models separately, limiting attack flexibility. Moreover, in floating-point models, random bit flips often cause perturbed parameters to extreme values (e.g., flipping in exponent bit), making it not stealthy and leading to numerical runtime error (e.g., invalid tensor values (NaN/Inf)). In this work, for the first time, we propose SBFA (Sneaky Bit-Flip Attack), which collapses LLM performance with only one single bit flip while keeping perturbed values within benign layer-wise weight distribution. It is achieved through iterative searching and ranking through our defined parameter sensitivity metric, ImpactScore, which combines gradient sensitivity and perturbation range constrained by the benign layer-wise weight distribution. A novel lightweight SKIP searching algorithm is also proposed to greatly reduce searching complexity, which leads to successful SBFA searching taking only tens of minutes for SOTA LLMs. Across Qwen, LLaMA, and Gemma models, with only one single bit flip, SBFA successfully degrades accuracy to below random levels on MMLU and SST-2 in both BF16 and INT8 data formats. Remarkably, flipping a single bit out of billions of parameters reveals a severe security concern of SOTA LLM models.

http://arxiv.org/abs/2509.22745
Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment. (12%)
Jaehan Kim; Minkyoo Song; Seungwon Shin; Sooel Son
Recent large language models (LLMs) have increasingly adopted the Mixture-of-Experts (MoE) architecture for efficiency. MoE-based LLMs heavily depend on a superficial safety mechanism in which harmful inputs are routed safety-critical experts. However, our analysis reveals that routing decisions for harmful inputs drift significantly after fine-tuning, exposing a critical vulnerability to harmful fine-tuning (HFT) attacks. Existing defenses, primarily designed for monolithic LLMs, are less effective for MoE LLMs as they fail to prevent drift in harmful input routing. To address this limitation, we propose SafeMoE, a safe fine-tuning method tailored to MoE LLMs. SafeMoE directly mitigates routing drift by penalizing the gap between the routing weights of a fine-tuned model and those of the initial safety-aligned model, thereby preserving the safety-aligned routing of harmful inputs to safety-critical experts. Experiments on open-source MoE LLMs ranging from 7B to 141B parameters demonstrate that SafeMoE effectively mitigates HFT attacks, reducing the harmfulness score of OLMoE from 62.0 to 5.0, for example, while maintaining task utility within 1% degradation and incurring only 2% overhead. It significantly outperforms state-of-the-art defense methods for safeguarding LLM fine-tuning and remains effective in recent large-scale MoE LLMs such as gpt-oss and Llama 4. Our implementation is available at https://anonymous.4open.science/r/SafeMoE.

http://arxiv.org/abs/2509.22472
Evaluating the Limits of Large Language Models in Multilingual Legal Reasoning. (9%)
Antreas Ioannou; Andreas Shiamishis; Nora Hollenstein; Nezihe Merve Gürel
In an era dominated by Large Language Models (LLMs), understanding their capabilities and limitations, especially in high-stakes fields like law, is crucial. While LLMs such as Meta's LLaMA, OpenAI's ChatGPT, Google's Gemini, DeepSeek, and other emerging models are increasingly integrated into legal workflows, their performance in multilingual, jurisdictionally diverse, and adversarial contexts remains insufficiently explored. This work evaluates LLaMA and Gemini on multilingual legal and non-legal benchmarks, and assesses their adversarial robustness in legal tasks through character and word-level perturbations. We use an LLM-as-a-Judge approach for human-aligned evaluation. We moreover present an open-source, modular evaluation pipeline designed to support multilingual, task-diverse benchmarking of any combination of LLMs and datasets, with a particular focus on legal tasks, including classification, summarization, open questions, and general reasoning. Our findings confirm that legal tasks pose significant challenges for LLMs with accuracies often below 50% on legal reasoning benchmarks such as LEXam, compared to over 70% on general-purpose tasks like XNLI. In addition, while English generally yields more stable results, it does not always lead to higher accuracy. Prompt sensitivity and adversarial vulnerability is also shown to persist across languages. Finally, a correlation is found between the performance of a language and its syntactic similarity to English. We also observe that LLaMA is weaker than Gemini, with the latter showing an average advantage of about 24 percentage points across the same task. Despite improvements in newer LLMs, challenges remain in deploying them reliably for critical, multilingual legal applications.

http://arxiv.org/abs/2509.21761
Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models. (3%)
Miao Yu; Zhenhong Zhou; Moayad Aloqaily; Kun Wang; Biwei Huang; Stephen Wang; Yueming Jin; Qingsong Wen
Fine-tuned Large Language Models (LLMs) are vulnerable to backdoor attacks through data poisoning, yet the internal mechanisms governing these attacks remain a black box. Previous research on interpretability for LLM safety tends to focus on alignment, jailbreak, and hallucination, but overlooks backdoor mechanisms, making it difficult to understand and fully eliminate the backdoor threat. In this paper, aiming to bridge this gap, we explore the interpretable mechanisms of LLM backdoors through Backdoor Attribution (BkdAttr), a tripartite causal analysis framework. We first introduce the Backdoor Probe that proves the existence of learnable backdoor features encoded within the representations. Building on this insight, we further develop Backdoor Attention Head Attribution (BAHA), efficiently pinpointing the specific attention heads responsible for processing these features. Our primary experiments reveals these heads are relatively sparse; ablating a minimal \textbf{$\sim$ 3%} of total heads is sufficient to reduce the Attack Success Rate (ASR) by \textbf{over 90%}. More importantly, we further employ these findings to construct the Backdoor Vector derived from these attributed heads as a master controller for the backdoor. Through only \textbf{1-point} intervention on \textbf{single} representation, the vector can either boost ASR up to \textbf{$\sim$ 100% ($\uparrow$)} on clean inputs, or completely neutralize backdoor, suppressing ASR down to \textbf{$\sim$ 0% ($\downarrow$)} on triggered inputs. In conclusion, our work pioneers the exploration of mechanistic interpretability in LLM backdoors, demonstrating a powerful method for backdoor control and revealing actionable insights for the community.

http://arxiv.org/abs/2509.22113
Countering adversarial evasion in regression analysis. (2%)
David Benfield; Phan Tu Vuong; Alain Zemkoho
Adversarial machine learning challenges the assumption that the underlying distribution remains consistent throughout the training and implementation of a prediction model. In particular, adversarial evasion considers scenarios where adversaries adapt their data to influence particular outcomes from established prediction models, such scenarios arise in applications such as spam email filtering, malware detection and fake-image generation, where security methods must be actively updated to keep up with the ever-improving generation of malicious data. Game theoretic models have been shown to be effective at modelling these scenarios and hence training resilient predictors against such adversaries. Recent advancements in the use of pessimistic bilevel optimsiation which remove assumptions about the convexity and uniqueness of the adversary's optimal strategy have proved to be particularly effective at mitigating threats to classifiers due to its ability to capture the antagonistic nature of the adversary. However, this formulation has not yet been adapted to regression scenarios. This article serves to propose a pessimistic bilevel optimisation program for regression scenarios which makes no assumptions on the convexity or uniqueness of the adversary's solutions.

http://arxiv.org/abs/2509.10655
Safety and Security Analysis of Large Language Models: Benchmarking Risk Profile and Harm Potential. (2%)
Charankumar Akiri; Harrison Simpson; Kshitiz Aryal; Aarav Khanna; Maanak Gupta
While the widespread deployment of Large Language Models (LLMs) holds great potential for society, their vulnerabilities to adversarial manipulation and exploitation can pose serious safety, security, and ethical risks. As new threats continue to emerge, it becomes critically necessary to assess the landscape of LLMs' safety and security against evolving adversarial prompt techniques. To understand the behavior of LLMs, this research provides an empirical analysis and risk profile of nine prominent LLMs, Claude Opus 4, DeepSeek V3 (both open-source and online), Gemini 2.5 Flash, GPT-4o, Grok 3, Llama 4 Scout, Mistral 7B, and Qwen 3 1.7B, against 24 different security and safety categories. These LLMs are evaluated on their ability to produce harmful responses for adversarially crafted prompts (dataset has been made public) for a broad range of safety and security topics, such as promotion of violent criminal behavior, promotion of non-violent criminal activity, societal harms related to safety, illegal sexual content, dangerous code generation, and cybersecurity threats beyond code. Our study introduces the Risk Severity Index (RSI), an agile and scalable evaluation score, to quantify and compare the security posture and creating a risk profile of LLMs. As the LLM development landscape progresses, the RSI is intended to be a valuable metric for comparing the risks of LLMs across evolving threats. This research finds widespread vulnerabilities in the safety filters of the LLMs tested and highlights the urgent need for stronger alignment, responsible deployment practices, and model governance, particularly for open-access and rapidly iterated models.

http://arxiv.org/abs/2509.21130
Sparse Representations Improve Adversarial Robustness of Neural Network Classifiers. (99%)
Killian Steunou; Sigurd Saue; Théo Druilhe
Deep neural networks perform remarkably well on image classification tasks but remain vulnerable to carefully crafted adversarial perturbations. This work revisits linear dimensionality reduction as a simple, data-adapted defense. We empirically compare standard Principal Component Analysis (PCA) with its sparse variant (SPCA) as front-end feature extractors for downstream classifiers, and we complement these experiments with a theoretical analysis. On the theory side, we derive exact robustness certificates for linear heads applied to SPCA features: for both $\ell_\infty$ and $\ell_2$ threat models (binary and multiclass), the certified radius grows as the dual norms of $W^\top u$ shrink, where $W$ is the projection and $u$ the head weights. We further show that for general (non-linear) heads, sparsity reduces operator-norm bounds through a Lipschitz composition argument, predicting lower input sensitivity. Empirically, with a small non-linear network after the projection, SPCA consistently degrades more gracefully than PCA under strong white-box and black-box attacks while maintaining competitive clean accuracy. Taken together, the theory identifies the mechanism (sparser projections reduce adversarial leverage) and the experiments verify that this benefit persists beyond the linear setting. Our code is available at https://github.com/killian31/SPCARobustness.

http://arxiv.org/abs/2509.21084
Vision Transformers: the threat of realistic adversarial patches. (99%)
Kasper Cools; Clara Maathuis; Oers Alexander M. van; Claudia S. Hübner; Nikos Deligiannis; Marijke Vandewal; Cubber Geert De
The increasing reliance on machine learning systems has made their security a critical concern. Evasion attacks enable adversaries to manipulate the decision-making processes of AI systems, potentially causing security breaches or misclassification of targets. Vision Transformers (ViTs) have gained significant traction in modern machine learning due to increased 1) performance compared to Convolutional Neural Networks (CNNs) and 2) robustness against adversarial perturbations. However, ViTs remain vulnerable to evasion attacks, particularly to adversarial patches, unique patterns designed to manipulate AI classification systems. These vulnerabilities are investigated by designing realistic adversarial patches to cause misclassification in person vs. non-person classification tasks using the Creases Transformation (CT) technique, which adds subtle geometric distortions similar to those occurring naturally when wearing clothing. This study investigates the transferability of adversarial attack techniques used in CNNs when applied to ViT classification models. Experimental evaluation across four fine-tuned ViT models on a binary person classification task reveals significant vulnerability variations: attack success rates ranged from 40.04% (google/vit-base-patch16-224-in21k) to 99.97% (facebook/dino-vitb16), with google/vit-base-patch16-224 achieving 66.40% and facebook/dinov3-vitb16 reaching 65.17%. These results confirm the cross-architectural transferability of adversarial patches from CNNs to ViTs, with pre-training dataset scale and methodology strongly influencing model resilience to adversarial attacks.

http://arxiv.org/abs/2509.21436
Position: Human Factors Reshape Adversarial Analysis in Human-AI Decision-Making Systems. (98%)
Shutong Fan; Lan Zhang; Xiaoyong Yuan
As Artificial Intelligence (AI) increasingly supports human decision-making, its vulnerability to adversarial attacks grows. However, the existing adversarial analysis predominantly focuses on fully autonomous AI systems, where decisions are executed without human intervention. This narrow focus overlooks the complexities of human-AI collaboration, where humans interpret, adjust, and act upon AI-generated decisions. Trust, expectations, and cognitive behaviors influence how humans interact with AI, creating dynamic feedback loops that adversaries can exploit. To strengthen the robustness of AI-assisted decision-making, adversarial analysis must account for the interplay between human factors and attack strategies.
  This position paper argues that human factors fundamentally reshape adversarial analysis and must be incorporated into evaluating robustness in human-AI decision-making systems. To fully explore human factors in adversarial analysis, we begin by investigating the role of human factors in human-AI collaboration through a comprehensive review. We then introduce a novel robustness analysis framework that (1) examines how human factors affect collaborative decision-making performance, (2) revisits and interprets existing adversarial attack strategies in the context of human-AI interaction, and (3) introduces a new timing-based adversarial attack as a case study, illustrating vulnerabilities emerging from sequential human actions. The experimental results reveal that attack timing uniquely impacts decision outcomes in human-AI collaboration. We hope this analysis inspires future research on adversarial robustness in human-AI systems, fostering interdisciplinary approaches that integrate AI security, human cognition, and decision-making dynamics.

http://arxiv.org/abs/2312.04692
Diffence: Fencing Membership Privacy With Diffusion Models. (97%)
Yuefeng Peng; Ali Naseh; Amir Houmansadr
Deep learning models, while achieving remarkable performances, are vulnerable to membership inference attacks (MIAs). Although various defenses have been proposed, there is still substantial room for improvement in the privacy-utility trade-off. In this work, we introduce a novel defense framework against MIAs by leveraging generative models. The key intuition of our defense is to remove the differences between member and non-member inputs, which is exploited by MIAs, by re-generating input samples before feeding them to the target model. Therefore, our defense, called DIFFENCE, works pre inference, which is unlike prior defenses that are either training-time or post-inference time.
  A unique feature of DIFFENCE is that it works on input samples only, without modifying the training or inference phase of the target model. Therefore, it can be cascaded with other defense mechanisms as we demonstrate through experiments. DIFFENCE is designed to preserve the model's prediction labels for each sample, thereby not affecting accuracy. Furthermore, we have empirically demonstrated it does not reduce the usefulness of confidence vectors. Through extensive experimentation, we show that DIFFENCE can serve as a robust plug-n-play defense mechanism, enhancing membership privacy without compromising model utility. For instance, DIFFENCE reduces MIA accuracy against an undefended model by 15.8\% and attack AUC by 14.0\% on average across three datasets, all without impacting model utility. By integrating DIFFENCE with prior defenses, we can achieve new state-of-the-art performances in the privacy-utility trade-off. For example, when combined with the state-of-the-art SELENA defense it reduces attack accuracy by 9.3\%, and attack AUC by 10.0\%. DIFFENCE achieves this by imposing a negligible computation overhead, adding only 57ms to the inference time per sample processed on average.

http://arxiv.org/abs/2509.20714
Cryptographic Backdoor for Neural Networks: Boon and Bane. (92%)
Anh Tu Ngo; Anupam Chattopadhyay; Subhamoy Maitra
In this paper we show that cryptographic backdoors in a neural network (NN) can be highly effective in two directions, namely mounting the attacks as well as in presenting the defenses as well. On the attack side, a carefully planted cryptographic backdoor enables powerful and invisible attack on the NN. Considering the defense, we present applications: first, a provably robust NN watermarking scheme; second, a protocol for guaranteeing user authentication; and third, a protocol for tracking unauthorized sharing of the NN intellectual property (IP). From a broader theoretical perspective, borrowing the ideas from Goldwasser et. al. [FOCS 2022], our main contribution is to show that all these instantiated practical protocol implementations are provably robust. The protocols for watermarking, authentication and IP tracking resist an adversary with black-box access to the NN, whereas the backdoor-enabled adversarial attack is impossible to prevent under the standard assumptions. While the theoretical tools used for our attack is mostly in line with the Goldwasser et. al. ideas, the proofs related to the defense need further studies. Finally, all these protocols are implemented on state-of-the-art NN architectures with empirical results corroborating the theoretical claims. Further, one can utilize post-quantum primitives for implementing the cryptographic backdoors, laying out foundations for quantum-era applications in machine learning (ML).

http://arxiv.org/abs/2501.14249
Humanity's Last Exam. (92%)
Long Phan; Alice Gatti; Ziwen Han; Nathaniel Li; Josephina Hu; Hugh Zhang; Chen Bo Calvin Zhang; Mohamed Shaaban; John Ling; Sean Shi; Michael Choi; Anish Agrawal; Arnav Chopra; Adam Khoja; Ryan Kim; Richard Ren; Jason Hausenloy; Oliver Zhang; Mantas Mazeika; Dmitry Dodonov; Tung Nguyen; Jaeho Lee; Daron Anderson; Mikhail Doroshenko; Alun Cennyth Stokes; Mobeen Mahmood; Oleksandr Pokutnyi; Oleg Iskra; Jessica P. Wang; John-Clark Levin; Mstyslav Kazakov; Fiona Feng; Steven Y. Feng; Haoran Zhao; Michael Yu; Varun Gangal; Chelsea Zou; Zihan Wang; Serguei Popov; Robert Gerbicz; Geoff Galgon; Johannes Schmitt; Will Yeadon; Yongki Lee; Scott Sauers; Alvaro Sanchez; Fabian Giska; Marc Roth; Søren Riis; Saiteja Utpala; Noah Burns; Gashaw M. Goshu; Mohinder Maheshbhai Naiya; Chidozie Agu; Zachary Giboney; Antrell Cheatom; Francesco Fournier-Facio; Sarah-Jane Crowson; Lennart Finke; Zerui Cheng; Jennifer Zampese; Ryan G. Hoerr; Mark Nandor; Hyunwoo Park; Tim Gehrunger; Jiaqi Cai; Ben McCarty; Alexis C Garretson; Edwin Taylor; Damien Sileo; Qiuyu Ren; Usman Qazi; Lianghui Li; Jungbae Nam; John B. Wydallis; Pavel Arkhipov; Jack Wei Lun Shi; Aras Bacho; Chris G. Willcocks; Hangrui Cao; Sumeet Motwani; Emily de Oliveira Santos; Johannes Veith; Edward Vendrow; Doru Cojoc; Kengo Zenitani; Joshua Robinson; Longke Tang; Yuqi Li; Joshua Vendrow; Natanael Wildner Fraga; Vladyslav Kuchkin; Andrey Pupasov Maksimov; Pierre Marion; Denis Efremov; Jayson Lynch; Kaiqu Liang; Aleksandar Mikov; Andrew Gritsevskiy; Julien Guillod; Gözdenur Demir; Dakotah Martinez; Ben Pageler; Kevin Zhou; Saeed Soori; Ori Press; Henry Tang; Paolo Rissone; Sean R. Green; Lina Brüssel; Moon Twayana; Aymeric Dieuleveut; Joseph Marvin Imperial; Ameya Prabhu; Jinzhou Yang; Nick Crispino; Arun Rao; Dimitri Zvonkine; Gabriel Loiseau; Mikhail Kalinin; Marco Lukas; Ciprian Manolescu; Nate Stambaugh; Subrata Mishra; Tad Hogg; Carlo Bosio; Brian P Coppola; Julian Salazar; Jaehyeok Jin; Rafael Sayous; Stefan Ivanov; Philippe Schwaller; Shaipranesh Senthilkuma; Andres M Bran; Andres Algaba; Kelsey Van den Houte; Der Sypt Lynn Van; Brecht Verbeken; David Noever; Alexei Kopylov; Benjamin Myklebust; Bikun Li; Lisa Schut; Evgenii Zheltonozhskii; Qiaochu Yuan; Derek Lim; Richard Stanley; Tong Yang; John Maar; Julian Wykowski; Martí Oller; Anmol Sahu; Cesare Giulio Ardito; Yuzheng Hu; Ariel Ghislain Kemogne Kamdoum; Alvin Jin; Tobias Garcia Vilchis; Yuexuan Zu; Martin Lackner; James Koppel; Gongbo Sun; Daniil S. Antonenko; Steffi Chern; Bingchen Zhao; Pierrot Arsene; Joseph M Cavanagh; Daofeng Li; Jiawei Shen; Donato Crisostomi; Wenjin Zhang; Ali Dehghan; Sergey Ivanov; David Perrella; Nurdin Kaparov; Allen Zang; Ilia Sucholutsky; Arina Kharlamova; Daniil Orel; Vladislav Poritski; Shalev Ben-David; Zachary Berger; Parker Whitfill; Michael Foster; Daniel Munro; Linh Ho; Shankar Sivarajan; Dan Bar Hava; Aleksey Kuchkin; David Holmes; Alexandra Rodriguez-Romero; Frank Sommerhage; Anji Zhang; Richard Moat; Keith Schneider; Zakayo Kazibwe; Don Clarke; Dae Hyun Kim; Felipe Meneguitti Dias; Sara Fish; Veit Elser; Tobias Kreiman; Victor Efren Guadarrama Vilchis; Immo Klose; Ujjwala Anantheswaran; Adam Zweiger; Kaivalya Rawal; Jeffery Li; Jeremy Nguyen; Nicolas Daans; Haline Heidinger; Maksim Radionov; Václav Rozhoň; Vincent Ginis; Christian Stump; Niv Cohen; Rafał Poświata; Josef Tkadlec; Alan Goldfarb; Chenguang Wang; Piotr Padlewski; Stanislaw Barzowski; Kyle Montgomery; Ryan Stendall; Jamie Tucker-Foltz; Jack Stade; T. Ryan Rogers; Tom Goertzen; Declan Grabb; Abhishek Shukla; Alan Givré; John Arnold Ambay; Archan Sen; Muhammad Fayez Aziz; Mark H Inlow; Hao He; Ling Zhang; Younesse Kaddar; Ivar Ängquist; Yanxu Chen; Harrison K Wang; Kalyan Ramakrishnan; Elliott Thornley; Antonio Terpin; Hailey Schoelkopf; Eric Zheng; Avishy Carmi; Ethan D. L. Brown; Kelin Zhu; Max Bartolo; Richard Wheeler; Martin Stehberger; Peter Bradshaw; JP Heimonen; Kaustubh Sridhar; Ido Akov; Jennifer Sandlin; Yury Makarychev; Joanna Tam; Hieu Hoang; David M. Cunningham; Vladimir Goryachev; Demosthenes Patramanis; Michael Krause; Andrew Redenti; David Aldous; Jesyin Lai; Shannon Coleman; Jiangnan Xu; Sangwon Lee; Ilias Magoulas; Sandy Zhao; Ning Tang; Michael K. Cohen; Orr Paradise; Jan Hendrik Kirchner; Maksym Ovchynnikov; Jason O. Matos; Adithya Shenoy; Michael Wang; Yuzhou Nie; Anna Sztyber-Betley; Paolo Faraboschi; Robin Riblet; Jonathan Crozier; Shiv Halasyamani; Shreyas Verma; Prashant Joshi; Eli Meril; Ziqiao Ma; Jérémy Andréoletti; Raghav Singhal; Jacob Platnick; Volodymyr Nevirkovets; Luke Basler; Alexander Ivanov; Seri Khoury; Nils Gustafsson; Marco Piccardo; Hamid Mostaghimi; Qijia Chen; Virendra Singh; Tran Quoc Khánh; Paul Rosu; Hannah Szlyk; Zachary Brown; Himanshu Narayan; Aline Menezes; Jonathan Roberts; William Alley; Kunyang Sun; Arkil Patel; Max Lamparth; Anka Reuel; Linwei Xin; Hanmeng Xu; Jacob Loader; Freddie Martin; Zixuan Wang; Andrea Achilleos; Thomas Preu; Tomek Korbak; Ida Bosio; Fereshteh Kazemi; Ziye Chen; Biró Bálint; Eve J. Y. Lo; Jiaqi Wang; Maria Inês S. Nunes; Jeremiah Milbauer; M Saiful Bari; Zihao Wang; Behzad Ansarinejad; Yewen Sun; Stephane Durand; Hossam Elgnainy; Guillaume Douville; Daniel Tordera; George Balabanian; Hew Wolff; Lynna Kvistad; Hsiaoyun Milliron; Ahmad Sakor; Murat Eron; Andrew Favre D. O.; Shailesh Shah; Xiaoxiang Zhou; Firuz Kamalov; Sherwin Abdoli; Tim Santens; Shaul Barkan; Allison Tee; Robin Zhang; Alessandro Tomasiello; Luca G. Bruno De; Shi-Zhuo Looi; Vinh-Kha Le; Noam Kolt; Jiayi Pan; Emma Rodman; Jacob Drori; Carl J Fossum; Niklas Muennighoff; Milind Jagota; Ronak Pradeep; Honglu Fan; Jonathan Eicher; Michael Chen; Kushal Thaman; William Merrill; Moritz Firsching; Carter Harris; Stefan Ciobâcă; Jason Gross; Rohan Pandey; Ilya Gusev; Adam Jones; Shashank Agnihotri; Pavel Zhelnov; Mohammadreza Mofayezi; Alexander Piperski; David K. Zhang; Kostiantyn Dobarskyi; Roman Leventov; Ignat Soroko; Joshua Duersch; Vage Taamazyan; Andrew Ho; Wenjie Ma; William Held; Ruicheng Xian; Armel Randy Zebaze; Mohanad Mohamed; Julian Noah Leser; Michelle X Yuan; Laila Yacar; Johannes Lengler; Katarzyna Olszewska; Fratta Claudio Di; Edson Oliveira; Joseph W. Jackson; Andy Zou; Muthu Chidambaram; Timothy Manik; Hector Haffenden; Dashiell Stander; Ali Dasouqi; Alexander Shen; Bita Golshani; David Stap; Egor Kretov; Mikalai Uzhou; Alina Borisovna Zhidkovskaya; Nick Winter; Miguel Orbegozo Rodriguez; Robert Lauff; Dustin Wehr; Colin Tang; Zaki Hossain; Shaun Phillips; Fortuna Samuele; Fredrik Ekström; Angela Hammon; Oam Patel; Faraz Farhidi; George Medley; Forough Mohammadzadeh; Madellene Peñaflor; Haile Kassahun; Alena Friedrich; Rayner Hernandez Perez; Daniel Pyda; Taom Sakal; Omkar Dhamane; Ali Khajegili Mirabadi; Eric Hallman; Kenchi Okutsu; Mike Battaglia; Mohammad Maghsoudimehrabani; Alon Amit; Dave Hulbert; Roberto Pereira; Simon Weber; Handoko; Anton Peristyy; Stephen Malina; Mustafa Mehkary; Rami Aly; Frank Reidegeld; Anna-Katharina Dick; Cary Friday; Mukhwinder Singh; Hassan Shapourian; Wanyoung Kim; Mariana Costa; Hubeyb Gurdogan; Harsh Kumar; Chiara Ceconello; Chao Zhuang; Haon Park; Micah Carroll; Andrew R. Tawfeek; Stefan Steinerberger; Daattavya Aggarwal; Michael Kirchhof; Linjie Dai; Evan Kim; Johan Ferret; Jainam Shah; Yuzhou Wang; Minghao Yan; Krzysztof Burdzy; Lixin Zhang; Antonio Franca; Diana T. Pham; Kang Yong Loh; Joshua Robinson; Abram Jackson; Paolo Giordano; Philipp Petersen; Adrian Cosma; Jesus Colino; Colin White; Jacob Votava; Vladimir Vinnikov; Ethan Delaney; Petr Spelda; Vit Stritecky; Syed M. Shahid; Jean-Christophe Mourrat; Lavr Vetoshkin; Koen Sponselee; Renas Bacho; Zheng-Xin Yong; la Rosa Florencia de; Nathan Cho; Xiuyu Li; Guillaume Malod; Orion Weller; Guglielmo Albani; Leon Lang; Julien Laurendeau; Dmitry Kazakov; Fatimah Adesanya; Julien Portier; Lawrence Hollom; Victor Souza; Yuchen Anna Zhou; Julien Degorre; Yiğit Yalın; Gbenga Daniel Obikoya; Rai; Filippo Bigi; M. C. Boscá; Oleg Shumar; Kaniuar Bacho; Gabriel Recchia; Mara Popescu; Nikita Shulga; Ngefor Mildred Tanwie; Thomas C. H. Lux; Ben Rank; Colin Ni; Matthew Brooks; Alesia Yakimchyk; Huanxu; Liu; Stefano Cavalleri; Olle Häggström; Emil Verkama; Joshua Newbould; Hans Gundlach; Leonor Brito-Santana; Brian Amaro; Vivek Vajipey; Rynaa Grover; Ting Wang; Yosi Kratish; Wen-Ding Li; Sivakanth Gopi; Andrea Caciolai; Witt Christian Schroeder de; Pablo Hernández-Cámara; Emanuele Rodolà; Jules Robins; Dominic Williamson; Vincent Cheng; Brad Raynor; Hao Qi; Ben Segev; Jingxuan Fan; Sarah Martinson; Erik Y. Wang; Kaylie Hausknecht; Michael P. Brenner; Mao Mao; Christoph Demian; Peyman Kassani; Xinyu Zhang; David Avagian; Eshawn Jessica Scipio; Alon Ragoler; Justin Tan; Blake Sims; Rebeka Plecnik; Aaron Kirtland; Omer Faruk Bodur; D. P. Shinde; Yan Carlos Leyva Labrador; Zahra Adoul; Mohamed Zekry; Ali Karakoc; Tania C. B. Santos; Samir Shamseldeen; Loukmane Karim; Anna Liakhovitskaia; Nate Resman; Nicholas Farina; Juan Carlos Gonzalez; Gabe Maayan; Earth Anderson; Rodrigo De Oliveira Pena; Elizabeth Kelley; Hodjat Mariji; Rasoul Pouriamanesh; Wentao Wu; Ross Finocchio; Ismail Alarab; Joshua Cole; Danyelle Ferreira; Bryan Johnson; Mohammad Safdari; Liangti Dai; Siriphan Arthornthurasuk; Isaac C. McAlister; Alejandro José Moyano; Alexey Pronin; Jing Fan; Angel Ramirez-Trinidad; Yana Malysheva; Daphiny Pottmaier; Omid Taheri; Stanley Stepanic; Samuel Perry; Luke Askew; Raúl Adrián Huerta Rodríguez; Ali M. R. Minissi; Ricardo Lorena; Krishnamurthy Iyer; Arshad Anil Fasiludeen; Ronald Clark; Josh Ducey; Matheus Piza; Maja Somrak; Eric Vergo; Juehang Qin; Benjámin Borbás; Eric Chu; Jack Lindsey; Antoine Jallon; I. M. J. McInnis; Evan Chen; Avi Semler; Luk Gloor; Tej Shah; Marc Carauleanu; Pascal Lauer; Tran Đuc Huy; Hossein Shahrtash; Emilien Duc; Lukas Lewark; Assaf Brown; Samuel Albanie; Brian Weber; Warren S. Vaz; Pierre Clavier; Yiyang Fan; Gabriel Poesia Reis e Silva; Long; Lian; Marcus Abramovitch; Xi Jiang; Sandra Mendoza; Murat Islam; Juan Gonzalez; Vasilios Mavroudis; Justin Xu; Pawan Kumar; Laxman Prasad Goswami; Daniel Bugas; Nasser Heydari; Ferenc Jeanplong; Thorben Jansen; Antonella Pinto; Archimedes Apronti; Abdallah Galal; Ng Ze-An; Ankit Singh; Tong Jiang; Joan of Arc Xavier; Kanu Priya Agarwal; Mohammed Berkani; Gang Zhang; Zhehang Du; Benedito Alves de Oliveira Junior; Dmitry Malishev; Nicolas Remy; Taylor D. Hartman; Tim Tarver; Stephen Mensah; Gautier Abou Loume; Wiktor Morak; Farzad Habibi; Sarah Hoback; Will Cai; Javier Gimenez; Roselynn Grace Montecillo; Jakub Łucki; Russell Campbell; Asankhaya Sharma; Khalida Meer; Shreen Gul; Daniel Espinosa Gonzalez; Xavier Alapont; Alex Hoover; Gunjan Chhablani; Freddie Vargus; Arunim Agarwal; Yibo Jiang; Deepakkumar Patil; David Outevsky; Kevin Joseph Scaria; Rajat Maheshwari; Abdelkader Dendane; Priti Shukla; Ashley Cartwright; Sergei Bogdanov; Niels Mündler; Sören Möller; Luca Arnaboldi; Kunvar Thaman; Muhammad Rehan Siddiqi; Prajvi Saxena; Himanshu Gupta; Tony Fruhauff; Glen Sherman; Mátyás Vincze; Siranut Usawasutsakorn; Dylan Ler; Anil Radhakrishnan; Innocent Enyekwe; Sk Md Salauddin; Jiang Muzhen; Aleksandr Maksapetyan; Vivien Rossbach; Chris Harjadi; Mohsen Bahaloohoreh; Claire Sparrow; Jasdeep Sidhu; Sam Ali; Song Bian; John Lai; Eric Singer; Justine Leon Uro; Greg Bateman; Mohamed Sayed; Ahmed Menshawy; Darling Duclosel; Dario Bezzi; Yashaswini Jain; Ashley Aaron; Murat Tiryakioglu; Sheeshram Siddh; Keith Krenek; Imad Ali Shah; Jun Jin; Scott Creighton; Denis Peskoff; Zienab EL-Wasif; Ragavendran V P; Michael Richmond; Joseph McGowan; Tejal Patwardhan; Hao-Yu Sun; Ting Sun; Nikola Zubić; Samuele Sala; Stephen Ebert; Jean Kaddour; Manuel Schottdorf; Dianzhuo Wang; Gerol Petruzella; Alex Meiburg; Tilen Medved; Ali ElSheikh; S Ashwin Hebbar; Lorenzo Vaquero; Xianjun Yang; Jason Poulos; Vilém Zouhar; Sergey Bogdanik; Mingfang Zhang; Jorge Sanz-Ros; David Anugraha; Yinwei Dai; Anh N. Nhu; Xue Wang; Ali Anil Demircali; Zhibai Jia; Yuyin Zhou; Juncheng Wu; Mike He; Nitin Chandok; Aarush Sinha; Gaoxiang Luo; Long Le; Mickaël Noyé; Michał Perełkiewicz; Ioannis Pantidis; Tianbo Qi; Soham Sachin Purohit; Letitia Parcalabescu; Thai-Hoa Nguyen; Genta Indra Winata; Edoardo M. Ponti; Hanchen Li; Kaustubh Dhole; Jongee Park; Dario Abbondanza; Yuanli Wang; Anupam Nayak; Diogo M. Caetano; Antonio A. W. L. Wong; Rio-Chanona Maria del; Dániel Kondor; Pieter Francois; Ed Chalstrey; Jakob Zsambok; Dan Hoyer; Jenny Reddish; Jakob Hauser; Francisco-Javier Rodrigo-Ginés; Suchandra Datta; Maxwell Shepherd; Thom Kamphuis; Qizheng Zhang; Hyunjun Kim; Ruiji Sun; Jianzhu Yao; Franck Dernoncourt; Satyapriya Krishna; Sina Rismanchian; Bonan Pu; Francesco Pinto; Yingheng Wang; Kumar Shridhar; Kalon J. Overholt; Glib Briia; Hieu Nguyen; David; Soler Bartomeu; Tony CY Pang; Adam Wecker; Yifan Xiong; Fanfei Li; Lukas S. Huber; Joshua Jaeger; Maddalena Romano De; Xing Han Lù; Yuhui Zhang; Claas Beger; Patrick Tser Jern Kon; Sean Li; Vivek Sanker; Ming Yin; Yihao Liang; Xinlu Zhang; Ankit Agrawal; Li S. Yifei; Zechen Zhang; Mu Cai; Yasin Sonmez; Costin Cozianu; Changhao Li; Alex Slen; Shoubin Yu; Hyun Kyu Park; Gabriele Sarti; Marcin Briański; Alessandro Stolfo; Truong An Nguyen; Mike Zhang; Yotam Perlitz; Jose Hernandez-Orallo; Runjia Li; Amin Shabani; Felix Juefei-Xu; Shikhar Dhingra; Orr Zohar; My Chiffon Nguyen; Alexander Pondaven; Abdurrahim Yilmaz; Xuandong Zhao; Chuanyang Jin; Muyan Jiang; Stefan Todoran; Xinyao Han; Jules Kreuer; Brian Rabern; Anna Plassart; Martino Maggetti; Luther Yap; Robert Geirhos; Jonathon Kean; Dingsu Wang; Sina Mollaei; Chenkai Sun; Yifan Yin; Shiqi Wang; Rui Li; Yaowen Chang; Anjiang Wei; Alice Bizeul; Xiaohan Wang; Alexandre Oliveira Arrais; Kushin Mukherjee; Jorge Chamorro-Padial; Jiachen Liu; Xingyu Qu; Junyi Guan; Adam Bouyamourn; Shuyu Wu; Martyna Plomecka; Junda Chen; Mengze Tang; Jiaqi Deng; Shreyas Subramanian; Haocheng Xi; Haoxuan Chen; Weizhi Zhang; Yinuo Ren; Haoqin Tu; Sejong Kim; Yushun Chen; Sara Vera Marjanović; Junwoo Ha; Grzegorz Luczyna; Jeff J. Ma; Zewen Shen; Dawn Song; Cedegao E. Zhang; Zhun Wang; Gaël Gendron; Yunze Xiao; Leo Smucker; Erica Weng; Kwok Hao Lee; Zhe Ye; Stefano Ermon; Ignacio D. Lopez-Miguel; Theo Knights; Anthony Gitter; Namkyu Park; Boyi Wei; Hongzheng Chen; Kunal Pai; Ahmed Elkhanany; Han Lin; Philipp D. Siedler; Jichao Fang; Ritwik Mishra; Károly Zsolnai-Fehér; Xilin Jiang; Shadab Khan; Jun Yuan; Rishab Kumar Jain; Xi Lin; Mike Peterson; Zhe Wang; Aditya Malusare; Maosen Tang; Isha Gupta; Ivan Fosin; Timothy Kang; Barbara Dworakowska; Kazuki Matsumoto; Guangyao Zheng; Gerben Sewuster; Jorge Pretel Villanueva; Ivan Rannev; Igor Chernyavsky; Jiale Chen; Deepayan Banik; Ben Racz; Wenchao Dong; Jianxin Wang; Laila Bashmal; Duarte V. Gonçalves; Wei Hu; Kaushik Bar; Ondrej Bohdal; Atharv Singh Patlan; Shehzaad Dhuliawala; Caroline Geirhos; Julien Wist; Yuval Kansal; Bingsen Chen; Kutay Tire; Atak Talay Yücel; Brandon Christof; Veerupaksh Singla; Zijian Song; Sanxing Chen; Jiaxin Ge; Kaustubh Ponkshe; Isaac Park; Tianneng Shi; Martin Q. Ma; Joshua Mak; Sherwin Lai; Antoine Moulin; Zhuo Cheng; Zhanda Zhu; Ziyi Zhang; Vaidehi Patil; Ketan Jha; Qiutong Men; Jiaxuan Wu; Tianchi Zhang; Bruno Hebling Vieira; Alham Fikri Aji; Jae-Won Chung; Mohammed Mahfoud; Ha Thi Hoang; Marc Sperzel; Wei Hao; Kristof Meding; Sihan Xu; Vassilis Kostakos; Davide Manini; Yueying Liu; Christopher Toukmaji; Jay Paek; Eunmi Yu; Arif Engin Demircali; Zhiyi Sun; Ivan Dewerpe; Hongsen Qin; Roman Pflugfelder; James Bailey; Johnathan Morris; Ville Heilala; Sybille Rosset; Zishun Yu; Peter E. Chen; Woongyeong Yeo; Eeshaan Jain; Ryan Yang; Sreekar Chigurupati; Julia Chernyavsky; Sai Prajwal Reddy; Subhashini Venugopalan; Hunar Batra; Core Francisco Park; Hieu Tran; Guilherme Maximiano; Genghan Zhang; Yizhuo Liang; Hu Shiyu; Rongwu Xu; Rui Pan; Siddharth Suresh; Ziqi Liu; Samaksh Gulati; Songyang Zhang; Peter Turchin; Christopher W. Bartlett; Christopher R. Scotese; Phuong M. Cao; Ben Wu; Jacek Karwowski; Davide Scaramuzza; Aakaash Nattanmai; Gordon McKellips; Anish Cheraku; Asim Suhail; Ethan Luo; Marvin Deng; Jason Luo; Ashley Zhang; Kavin Jindel; Jay Paek; Kasper Halevy; Allen Baranov; Michael Liu; Advaith Avadhanam; David Zhang; Vincent Cheng; Brad Ma; Evan Fu; Liam Do; Joshua Lass; Hubert Yang; Surya Sunkari; Vishruth Bharath; Violet Ai; James Leung; Rishit Agrawal; Alan Zhou; Kevin Chen; Tejas Kalpathi; Ziqi Xu; Gavin Wang; Tyler Xiao; Erik Maung; Sam Lee; Ryan Yang; Roy Yue; Ben Zhao; Julia Yoon; Sunny Sun; Aryan Singh; Ethan Luo; Clark Peng; Tyler Osbey; Taozhi Wang; Daryl Echeazu; Hubert Yang; Timothy Wu; Spandan Patel; Vidhi Kulkarni; Vijaykaarti Sundarapandiyan; Ashley Zhang; Andrew Le; Zafir Nasim; Srikar Yalam; Ritesh Kasamsetty; Soham Samal; Hubert Yang; David Sun; Nihar Shah; Abhijeet Saha; Alex Zhang; Leon Nguyen; Laasya Nagumalli; Kaixin Wang; Alan Zhou; Aidan Wu; Jason Luo; Anwith Telluri; Summer Yue; Alexandr Wang; Dan Hendrycks
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

http://arxiv.org/abs/2509.20691
RedHerring Attack: Testing the Reliability of Attack Detection. (78%)
Jonathan Rusert
In response to adversarial text attacks, attack detection models have been proposed and shown to successfully identify text modified by adversaries. Attack detection models can be leveraged to provide an additional check for NLP models and give signals for human input. However, the reliability of these models has not yet been thoroughly explored. Thus, we propose and test a novel attack setting and attack, RedHerring. RedHerring aims to make attack detection models unreliable by modifying a text to cause the detection model to predict an attack, while keeping the classifier correct. This creates a tension between the classifier and detector. If a human sees that the detector is giving an ``incorrect'' prediction, but the classifier a correct one, then the human will see the detector as unreliable. We test this novel threat model on 4 datasets against 3 detectors defending 4 classifiers. We find that RedHerring is able to drop detection accuracy between 20 - 71 points, while maintaining (or improving) classifier accuracy. As an initial defense, we propose a simple confidence check which requires no retraining of the classifier or detector and increases detection accuracy greatly. This novel threat model offers new insights into how adversaries may target detection models.

http://arxiv.org/abs/2509.20792
DAC-LoRA: Dynamic Adversarial Curriculum for Efficient and Robust Few-Shot Adaptation. (67%)
Ved Umrajkar
Vision-Language Models (VLMs) are foundational to critical applications like autonomous driving, medical diagnosis, and content moderation. While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA enable their efficient adaptation to specialized tasks, these models remain vulnerable to adversarial attacks that can compromise safety-critical decisions. CLIP, the backbone for numerous downstream VLMs, is a high-value target whose vulnerabilities can cascade across the multimodal AI ecosystem. We propose Dynamic Adversarial Curriculum DAC-LoRA, a novel framework that integrates adversarial training into PEFT. The core principle of our method i.e. an intelligent curriculum of progressively challenging attack, is general and can potentially be applied to any iterative attack method. Guided by the First-Order Stationary Condition (FOSC) and a TRADES-inspired loss, DAC-LoRA achieves substantial improvements in adversarial robustness without significantly compromising clean accuracy. Our work presents an effective, lightweight, and broadly applicable method to demonstrate that the DAC-LoRA framework can be easily integrated into a standard PEFT pipeline to significantly enhance robustness.

http://arxiv.org/abs/2509.22732
Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks. (67%)
Haibo Tong; Dongcheng Zhao; Guobin Shen; Xiang He; Dachuan Lin; Feifei Zhao; Yi Zeng
The remarkable capabilities of Large Language Models (LLMs) have raised significant safety concerns, particularly regarding "jailbreak" attacks that exploit adversarial prompts to bypass safety alignment mechanisms. Existing defense research primarily focuses on single-turn attacks, whereas multi-turn jailbreak attacks progressively break through safeguards through by concealing malicious intent and tactical manipulation, ultimately rendering conventional single-turn defenses ineffective. To address this critical challenge, we propose the Bidirectional Intention Inference Defense (BIID). The method integrates forward request-based intention inference with backward response-based intention retrospection, establishing a bidirectional synergy mechanism to detect risks concealed within seemingly benign inputs, thereby constructing a more robust guardrails that effectively prevents harmful content generation. The proposed method undergoes systematic evaluation compared with a no-defense baseline and seven representative defense methods across three LLMs and two safety benchmarks under 10 different attack methods. Experimental results demonstrate that the proposed method significantly reduces the Attack Success Rate (ASR) across both single-turn and multi-turn jailbreak attempts, outperforming all existing baseline methods while effectively maintaining practical utility. Notably, comparative experiments across three multi-turn safety datasets further validate the proposed model's significant advantages over other defense approaches.

http://arxiv.org/abs/2509.21087
Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks? (64%)
Rostislav Makarov; Lea Schönherr; Timo Gerkmann
Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be susceptible to adversarial attacks. Specifically, we show that adversarial noise, carefully crafted and psychoacoustically masked by the original input, can be injected such that the enhanced speech output conveys an entirely different semantic meaning. We experimentally verify that contemporary predictive speech enhancement models can indeed be manipulated in this way. Furthermore, we highlight that diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design.

http://arxiv.org/abs/2505.11688
On the Sharp Input-Output Analysis of Nonlinear Systems under Adversarial Attacks. (56%)
Jihun Kim; Yuchen Fang; Javad Lavaei
This paper is concerned with learning the input-output mapping of general nonlinear dynamical systems. While the existing literature focuses on Gaussian inputs and benign disturbances, we significantly broaden the scope of admissible control inputs and allow correlated, nonzero-mean, adversarial disturbances. With our reformulation as a linear combination of basis functions, we prove that the $\ell_2$-norm estimator overcomes the challenges as long as the probability that the system is under adversarial attack at a given time is smaller than a certain threshold. We provide an estimation error bound that decays with the input memory length and prove its optimality by constructing a problem instance that suffers from the same bound under adversarial attacks. Our work provides a sharp input-output analysis for a generic nonlinear and partially observed system under significantly generalized assumptions compared to existing works.

http://arxiv.org/abs/2509.20835
Security-aware Semantic-driven ISAC via Paired Adversarial Residual Networks. (26%)
Yu Liu; Boxiang He; Fanggang Wang
This paper proposes a novel and flexible security-aware semantic-driven integrated sensing and communication (ISAC) framework, namely security semantic ISAC (SS-ISAC). Inspired by the positive impact of the adversarial attack, a pair of pluggable encryption and decryption modules is designed in the proposed SS-ISAC framework. The encryption module is installed after the semantic transmitter, adopting a trainable adversarial residual network (ARN) to create the adversarial attack. Correspondingly, the decryption module before the semantic receiver utilizes another trainable ARN to mitigate the adversarial attack and noise. These two modules can be flexibly assembled considering the system security demands, without drastically modifying the hardware infrastructure. To ensure the sensing and communication (SAC) performance while preventing the eavesdropping threat, the above ARNs are jointly optimized by minimizing a carefully designed loss function that relates to the adversarial attack power, SAC performance, as well as the privacy leakage risk. Simulation results validate the effectiveness of the proposed SS-ISAC framework in terms of both SAC and eavesdropping prevention performance.

http://arxiv.org/abs/2509.20924
RLCracker: Exposing the Vulnerability of LLM Watermarks with Adaptive RL Attacks. (13%)
Hanbo Huang; Yiran Zhang; Hao Zheng; Xuan Gong; Yihan Li; Lin Liu; Shiyu Liang
Large Language Models (LLMs) watermarking has shown promise in detecting AI-generated content and mitigating misuse, with prior work claiming robustness against paraphrasing and text editing. In this paper, we argue that existing evaluations are not sufficiently adversarial, obscuring critical vulnerabilities and overstating the security. To address this, we introduce adaptive robustness radius, a formal metric that quantifies watermark resilience against adaptive adversaries. We theoretically prove that optimizing the attack context and model parameters can substantially reduce this radius, making watermarks highly susceptible to paraphrase attacks. Leveraging this insight, we propose RLCracker, a reinforcement learning (RL)-based adaptive attack that erases watermarks while preserving semantic fidelity. RLCracker requires only limited watermarked examples and zero access to the detector. Despite weak supervision, it empowers a 3B model to achieve 98.5% removal success and an average 0.92 P-SP score on 1,500-token Unigram-marked texts after training on only 100 short samples. This performance dramatically exceeds 6.75% by GPT-4o and generalizes across five model sizes over ten watermarking schemes. Our results confirm that adaptive attacks are broadly effective and pose a fundamental threat to current watermarking defenses.

http://arxiv.org/abs/2509.21526
TRiCo: Triadic Game-Theoretic Co-Training for Robust Semi-Supervised Learning. (11%)
Hongyang He; Xinyuan Song; Yangfan He; Zeyu Zhang; Yanshu Li; Haochen You; Lifan Sun; Wenqiao Zhang
We introduce TRiCo, a novel triadic game-theoretic co-training framework that rethinks the structure of semi-supervised learning by incorporating a teacher, two students, and an adversarial generator into a unified training paradigm. Unlike existing co-training or teacher-student approaches, TRiCo formulates SSL as a structured interaction among three roles: (i) two student classifiers trained on frozen, complementary representations, (ii) a meta-learned teacher that adaptively regulates pseudo-label selection and loss balancing via validation-based feedback, and (iii) a non-parametric generator that perturbs embeddings to uncover decision boundary weaknesses. Pseudo-labels are selected based on mutual information rather than confidence, providing a more robust measure of epistemic uncertainty. This triadic interaction is formalized as a Stackelberg game, where the teacher leads strategy optimization and students follow under adversarial perturbations. By addressing key limitations in existing SSL frameworks, such as static view interactions, unreliable pseudo-labels, and lack of hard sample modeling, TRiCo provides a principled and generalizable solution. Extensive experiments on CIFAR-10, SVHN, STL-10, and ImageNet demonstrate that TRiCo consistently achieves state-of-the-art performance in low-label regimes, while remaining architecture-agnostic and compatible with frozen vision backbones.

http://arxiv.org/abs/2509.20699
Overcoming Black-box Attack Inefficiency with Hybrid and Dynamic Select Algorithms. (9%)
Abhinay Shankar Belde; Rohit Ramkumar; Jonathan Rusert
Adversarial text attack research plays a crucial role in evaluating the robustness of NLP models. However, the increasing complexity of transformer-based architectures has dramatically raised the computational cost of attack testing, especially for researchers with limited resources (e.g., GPUs). Existing popular black-box attack methods often require a large number of queries, which can make them inefficient and impractical for researchers. To address these challenges, we propose two new attack selection strategies called Hybrid and Dynamic Select, which better combine the strengths of previous selection algorithms. Hybrid Select merges generalized BinarySelect techniques with GreedySelect by introducing a size threshold to decide which selection algorithm to use. Dynamic Select provides an alternative approach of combining the generalized Binary and GreedySelect by learning which lengths of texts each selection method should be applied to. This greatly reduces the number of queries needed while maintaining attack effectiveness (a limitation of BinarySelect). Across 4 datasets and 6 target models, our best method(sentence-level Hybrid Select) is able to reduce the number of required queries per attack up 25.82\% on average against both encoder models and LLMs, without losing the effectiveness of the attack.

http://arxiv.org/abs/2506.14261
RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors? (2%)
Rohan Gupta; Erik Jenner
Latent-space monitors aim to detect undesirable behaviours in Large Language Models by leveraging their internal representations rather than relying solely on black-box outputs. These methods have shown promise in identifying behaviours such as deception and unsafe completions. However, these monitors may themselves become training signals, for example, by using problematic samples found in deployment to retrain models. This raises an important question: can models learn to evade such monitors? To evaluate this capability, we introduce RL-Obfuscation, in which LLMs are finetuned via reinforcement learning to evade latent-space monitors while maintaining their blackbox behaviour. We apply RL-Obfuscation to Language Models ranging from 7B to 14B parameters and evaluate their Evasion Success Rate against a suite of monitors. We find that token-level monitors are highly vulnerable to this attack while more holistic monitors, such as max-pooling or attention-based probes, remain robust. Moreover, for these vulnerable monitors, models trained to evade a single static monitor can generalise to evade other unseen monitors. We also find that the models can be trained to conditionally bypass latent-space monitors on only certain inputs. Finally, we study how the models bypass these monitors and find that the model can learn to repurpose tokens to have different internal representations.

http://arxiv.org/abs/2509.20939
Unlocking Noise-Resistant Vision: Key Architectural Secrets for Robust Models. (1%)
Bum Jun Kim; Makoto Kawano; Yusuke Iwasawa; Yutaka Matsuo
While the robustness of vision models is often measured, their dependence on specific architectural design choices is rarely dissected. We investigate why certain vision architectures are inherently more robust to additive Gaussian noise and convert these empirical insights into simple, actionable design rules. Specifically, we performed extensive evaluations on 1,174 pretrained vision models, empirically identifying four consistent design patterns for improved robustness against Gaussian noise: larger stem kernels, smaller input resolutions, average pooling, and supervised vision transformers (ViTs) rather than CLIP ViTs, which yield up to 506 rank improvements and 21.6\%p accuracy gains. We then develop a theoretical analysis that explains these findings, converting observed correlations into causal mechanisms. First, we prove that low-pass stem kernels attenuate noise with a gain that decreases quadratically with kernel size and that anti-aliased downsampling reduces noise energy roughly in proportion to the square of the downsampling factor. Second, we demonstrate that average pooling is unbiased and suppresses noise in proportion to the pooling window area, whereas max pooling incurs a positive bias that grows slowly with window size and yields a relatively higher mean-squared error and greater worst-case sensitivity. Third, we reveal and explain the vulnerability of CLIP ViTs via a pixel-space Lipschitz bound: The smaller normalization standard deviations used in CLIP preprocessing amplify worst-case sensitivity by up to 1.91 times relative to the Inception-style preprocessing common in supervised ViTs. Our results collectively disentangle robustness into interpretable modules, provide a theory that explains the observed trends, and build practical, plug-and-play guidelines for designing vision models more robust against Gaussian noise.

http://arxiv.org/abs/2509.21160
WISER: Segmenting watermarked region - an epidemic change-point perspective. (1%)
Soham Bonnerjee; Sayar Karmakar; Subhrajyoty Roy
With the increasing popularity of large language models, concerns over content authenticity have led to the development of myriad watermarking schemes. These schemes can be used to detect a machine-generated text via an appropriate key, while being imperceptible to readers with no such keys. The corresponding detection mechanisms usually take the form of statistical hypothesis testing for the existence of watermarks, spurring extensive research in this direction. However, the finer-grained problem of identifying which segments of a mixed-source text are actually watermarked, is much less explored; the existing approaches either lack scalability or theoretical guarantees robust to paraphrase and post-editing. In this work, we introduce a unique perspective to such watermark segmentation problems through the lens of epidemic change-points. By highlighting the similarities as well as differences of these two problems, we motivate and propose WISER: a novel, computationally efficient, watermark segmentation algorithm. We theoretically validate our algorithm by deriving finite sample error-bounds, and establishing its consistency in detecting multiple watermarked segments in a single text. Complementing these theoretical results, our extensive numerical experiments show that WISER outperforms state-of-the-art baseline methods, both in terms of computational speed as well as accuracy, on various benchmark datasets embedded with diverse watermarking schemes. Our theoretical and empirical findings establish WISER as an effective tool for watermark localization in most settings. It also shows how insights from a classical statistical problem can lead to a theoretically valid and computationally efficient solution of a modern and pertinent problem.

http://arxiv.org/abs/2508.10880
Searching for Privacy Risks in LLM Agents via Simulation. (1%)
Yanzhe Zhang; Diyi Yang
The widespread deployment of LLM-based agents is likely to introduce a critical privacy threat: malicious agents that proactively engage others in multi-turn interactions to extract sensitive information. However, the evolving nature of such dynamic dialogues makes it challenging to anticipate emerging vulnerabilities and design effective defenses. To tackle this problem, we present a search-based framework that alternates between improving attack and defense strategies through the simulation of privacy-critical agent interactions. Specifically, we employ LLMs as optimizers to analyze simulation trajectories and iteratively propose new agent instructions. To explore the strategy space more efficiently, we further utilize parallel search with multiple threads and cross-thread propagation. Through this process, we find that attack strategies escalate from direct requests to sophisticated tactics, such as impersonation and consent forgery, while defenses evolve from simple rule-based constraints to robust identity-verification state machines. The discovered attacks and defenses transfer across diverse scenarios and backbone models, demonstrating strong practical utility for building privacy-aware agents.

http://arxiv.org/abs/2509.21129
EvoMail: Self-Evolving Cognitive Agents for Adaptive Spam and Phishing Email Defense. (1%)
Wei Huang; De-Tian Chu; Lin-Yuan Bai; Wei Kang; Hai-Tao Zhang; Bo Li; Zhi-Mo Han; Jing Ge; Hai-Feng Lin
Modern email spam and phishing attacks have evolved far beyond keyword blacklists or simple heuristics. Adversaries now craft multi-modal campaigns that combine natural-language text with obfuscated URLs, forged headers, and malicious attachments, adapting their strategies within days to bypass filters. Traditional spam detection systems, which rely on static rules or single-modality models, struggle to integrate heterogeneous signals or to continuously adapt, leading to rapid performance degradation.
  We propose EvoMail, a self-evolving cognitive agent framework for robust detection of spam and phishing. EvoMail first constructs a unified heterogeneous email graph that fuses textual content, metadata (headers, senders, domains), and embedded resources (URLs, attachments). A Cognitive Graph Neural Network enhanced by a Large Language Model (LLM) performs context-aware reasoning across these sources to identify coordinated spam campaigns. Most critically, EvoMail engages in an adversarial self-evolution loop: a ''red-team'' agent generates novel evasion tactics -- such as character obfuscation or AI-generated phishing text -- while the ''blue-team'' detector learns from failures, compresses experiences into a memory module, and reuses them for future reasoning.
  Extensive experiments on real-world datasets (Enron-Spam, Ling-Spam, SpamAssassin, and TREC) and synthetic adversarial variants demonstrate that EvoMail consistently outperforms state-of-the-art baselines in detection accuracy, adaptability to evolving spam tactics, and interpretability of reasoning traces. These results highlight EvoMail's potential as a resilient and explainable defense framework against next-generation spam and phishing threats.

http://arxiv.org/abs/2509.19870
FreezeVLA: Action-Freezing Attacks against Vision-Language-Action Models. (98%)
Xin Wang; Jie Li; Zejia Weng; Yixu Wang; Yifeng Gao; Tianyu Pang; Chao Du; Yan Teng; Yingchun Wang; Zuxuan Wu; Xingjun Ma; Yu-Gang Jiang
Vision-Language-Action (VLA) models are driving rapid progress in robotics by enabling agents to interpret multimodal inputs and execute complex, long-horizon tasks. However, their safety and robustness against adversarial attacks remain largely underexplored. In this work, we identify and formalize a critical adversarial vulnerability in which adversarial images can "freeze" VLA models and cause them to ignore subsequent instructions. This threat effectively disconnects the robot's digital mind from its physical actions, potentially inducing inaction during critical interventions. To systematically study this vulnerability, we propose FreezeVLA, a novel attack framework that generates and evaluates action-freezing attacks via min-max bi-level optimization. Experiments on three state-of-the-art VLA models and four robotic benchmarks show that FreezeVLA attains an average attack success rate of 76.2%, significantly outperforming existing methods. Moreover, adversarial images generated by FreezeVLA exhibit strong transferability, with a single image reliably inducing paralysis across diverse language prompts. Our findings expose a critical safety risk in VLA models and highlight the urgent need for robust defense mechanisms.

http://arxiv.org/abs/2509.20196
Universal Camouflage Attack on Vision-Language Models for Autonomous Driving. (98%)
Dehong Kong; Sifan Yu; Siyuan Liang; Jiawei Liang; Jianhou Gan; Aishan Liu; Wenqi Ren
Visual language modeling for automated driving is emerging as a promising research direction with substantial improvements in multimodal reasoning capabilities. Despite its advanced reasoning abilities, VLM-AD remains vulnerable to serious security threats from adversarial attacks, which involve misleading model decisions through carefully crafted perturbations. Existing attacks have obvious challenges: 1) Physical adversarial attacks primarily target vision modules. They are difficult to directly transfer to VLM-AD systems because they typically attack low-level perceptual components. 2) Adversarial attacks against VLM-AD have largely concentrated on the digital level. To address these challenges, we propose the first Universal Camouflage Attack (UCA) framework for VLM-AD. Unlike previous methods that focus on optimizing the logit layer, UCA operates in the feature space to generate physically realizable camouflage textures that exhibit strong generalization across different user commands and model architectures. Motivated by the observed vulnerability of encoder and projection layers in VLM-AD, UCA introduces a feature divergence loss (FDL) that maximizes the representational discrepancy between clean and adversarial images. In addition, UCA incorporates a multi-scale learning strategy and adjusts the sampling ratio to enhance its adaptability to changes in scale and viewpoint diversity in real-world scenarios, thereby improving training stability. Extensive experiments demonstrate that UCA can induce incorrect driving commands across various VLM-AD models and driving scenarios, significantly surpassing existing state-of-the-art attack methods (improving 30\% in 3-P metrics). Furthermore, UCA exhibits strong attack robustness under diverse viewpoints and dynamic conditions, indicating high potential for practical deployment.

http://arxiv.org/abs/2509.20589
Every Character Counts: From Vulnerability to Defense in Phishing Detection. (97%)
Maria Chiper; Radu Tudor Ionescu
Phishing attacks targeting both organizations and individuals are becoming an increasingly significant threat as technology advances. Current automatic detection methods often lack explainability and robustness in detecting new phishing attacks. In this work, we investigate the effectiveness of character-level deep learning models for phishing detection, which can provide both robustness and interpretability. We evaluate three neural architectures adapted to operate at the character level, namely CharCNN, CharGRU, and CharBiLSTM, on a custom-built email dataset, which combines data from multiple sources. Their performance is analyzed under three scenarios: (i) standard training and testing, (ii) standard training and testing under adversarial attacks, and (iii) training and testing with adversarial examples. Aiming to develop a tool that operates as a browser extension, we test all models under limited computational resources. In this constrained setup, CharGRU proves to be the best-performing model across all scenarios. All models show vulnerability to adversarial attacks, but adversarial training substantially improves their robustness. In addition, by adapting the Gradient-weighted Class Activation Mapping (Grad-CAM) technique to character-level inputs, we are able to visualize which parts of each email influence the decision of each model. Our open-source code and data is released at https://github.com/chipermaria/every-character-counts.

http://arxiv.org/abs/2509.20549
Understanding and Improving Adversarial Robustness of Neural Probabilistic Circuits. (93%)
Weixin Chen; Han Zhao
Neural Probabilistic Circuits (NPCs), a new class of concept bottleneck models, comprise an attribute recognition model and a probabilistic circuit for reasoning. By integrating the outputs from these two modules, NPCs produce compositional and interpretable predictions. While offering enhanced interpretability and high performance on downstream tasks, the neural-network-based attribute recognition model remains a black box. This vulnerability allows adversarial attacks to manipulate attribute predictions by introducing carefully crafted subtle perturbations to input images, potentially compromising the final predictions. In this paper, we theoretically analyze the adversarial robustness of NPC and demonstrate that it only depends on the robustness of the attribute recognition model and is independent of the robustness of the probabilistic circuit. Moreover, we propose RNPC, the first robust neural probabilistic circuit against adversarial attacks on the recognition module. RNPC introduces a novel class-wise integration for inference, ensuring a robust combination of outputs from the two modules. Our theoretical analysis demonstrates that RNPC exhibits provably improved adversarial robustness compared to NPC. Empirical results on image classification tasks show that RNPC achieves superior adversarial robustness compared to existing concept bottleneck models while maintaining high accuracy on benign inputs.

http://arxiv.org/abs/2509.20411
Adversarial Defense in Cybersecurity: A Systematic Review of GANs for Threat Detection and Mitigation. (93%)
Tharcisse Ndayipfukamiye; Jianguo Ding; Doreen Sebastian Sarwatt; Adamu Gaston Philipo; Huansheng Ning
Machine learning-based cybersecurity systems are highly vulnerable to adversarial attacks, while Generative Adversarial Networks (GANs) act as both powerful attack enablers and promising defenses. This survey systematically reviews GAN-based adversarial defenses in cybersecurity (2021--August 31, 2025), consolidating recent progress, identifying gaps, and outlining future directions. Using a PRISMA-compliant systematic literature review protocol, we searched five major digital libraries. From 829 initial records, 185 peer-reviewed studies were retained and synthesized through quantitative trend analysis and thematic taxonomy development. We introduce a four-dimensional taxonomy spanning defensive function, GAN architecture, cybersecurity domain, and adversarial threat model. GANs improve detection accuracy, robustness, and data utility across network intrusion detection, malware analysis, and IoT security. Notable advances include WGAN-GP for stable training, CGANs for targeted synthesis, and hybrid GAN models for improved resilience. Yet, persistent challenges remain such as instability in training, lack of standardized benchmarks, high computational cost, and limited explainability. GAN-based defenses demonstrate strong potential but require advances in stable architectures, benchmarking, transparency, and deployment. We propose a roadmap emphasizing hybrid models, unified evaluation, real-world integration, and defenses against emerging threats such as LLM-driven cyberattacks. This survey establishes the foundation for scalable, trustworthy, and adaptive GAN-powered defenses.

http://arxiv.org/abs/2405.19179
Model Agnostic Defense against Adversarial Patch Attacks on Object Detection in Unmanned Aerial Vehicles. (92%)
Saurabh Pathak; Samridha Shrestha; Abdelrahman AlMahmoud
Object detection forms a key component in Unmanned Aerial Vehicles (UAVs) for completing high-level tasks that depend on the awareness of objects on the ground from an aerial perspective. In that scenario, adversarial patch attacks on an onboard object detector can severely impair the performance of upstream tasks. This paper proposes a novel model-agnostic defense mechanism against the threat of adversarial patch attacks in the context of UAV-based object detection. We formulate adversarial patch defense as an occlusion removal task. The proposed defense method can neutralize adversarial patches located on objects of interest, without exposure to adversarial patches during training. Our lightweight single-stage defense approach allows us to maintain a model-agnostic nature, that once deployed does not require to be updated in response to changes in the object detection pipeline. The evaluations in digital and physical domains show the feasibility of our method for deployment in UAV object detection pipelines, by significantly decreasing the Attack Success Ratio without incurring significant processing costs. As a result, the proposed defense solution can improve the reliability of object detection for UAVs.

http://arxiv.org/abs/2509.21392
Dynamic Dual-level Defense Routing for Continual Adversarial Training. (91%)
Wenxuan Wang; Chenglei Wang; Xuelin Qian
As adversarial attacks continue to evolve, defense models face the risk of recurrent vulnerabilities, underscoring the importance of continuous adversarial training (CAT). Existing CAT approaches typically balance decision boundaries by either data replay or optimization strategy to constrain shared model parameters. However, due to the diverse and aggressive nature of adversarial examples, these methods suffer from catastrophic forgetting of previous defense knowledge after continual learning. In this paper, we propose a novel framework, called Dual-level Defense Routing or DDeR, that can autonomously select appropriate routers to integrate specific defense experts, thereby adapting to evolving adversarial attacks. Concretely, the first-level defense routing comprises multiple defense experts and routers, with each router dynamically selecting and combining suitable experts to process attacked features. Routers are independently incremented as continuous adversarial training progresses, and their selections are guided by an Adversarial Sentinel Network (ASN) in the second-level defense routing. To compensate for the inability to test due to the independence of routers, we further present a Pseudo-task Substitution Training (PST) strategy, which leverages distributional discrepancy in data to facilitate inter-router communication without storing historical data. Extensive experiments demonstrate that DDeR achieves superior continuous defense performance and classification accuracy compared to existing methods.

http://arxiv.org/abs/2509.21401
JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation. (86%)
Md Jueal Mia; M. Hadi Amini
Vision-Language Models (VLMs) have remarkable abilities in generating multimodal reasoning tasks. However, potential misuse or safety alignment concerns of VLMs have increased significantly due to different categories of attack vectors. Among various attack vectors, recent studies have demonstrated that image-based perturbations are particularly effective in generating harmful outputs. In the literature, many existing techniques have been proposed to jailbreak VLMs, leading to unstable performance and visible perturbations. In this study, we propose Jailbreaking with Loss-guided Image Perturbation (JaiLIP), a jailbreaking attack in the image space that minimizes a joint objective combining the mean squared error (MSE) loss between clean and adversarial image with the models harmful-output loss. We evaluate our proposed method on VLMs using standard toxicity metrics from Perspective API and Detoxify. Experimental results demonstrate that our method generates highly effective and imperceptible adversarial images, outperforming existing methods in producing toxicity. Moreover, we have evaluated our method in the transportation domain to demonstrate the attacks practicality beyond toxic text generation in specific domain. Our findings emphasize the practical challenges of image-based jailbreak attacks and the need for efficient defense mechanisms for VLMs.

http://arxiv.org/abs/2509.19793
BiTAA: A Bi-Task Adversarial Attack for Object Detection and Depth Estimation via 3D Gaussian Splatting. (84%)
Yixun Zhang; Feng Zhou; Jianqin Yin
Camera-based perception is critical to autonomous driving yet remains vulnerable to task-specific adversarial manipulations in object detection and monocular depth estimation. Most existing 2D/3D attacks are developed in task silos, lack mechanisms to induce controllable depth bias, and offer no standardized protocol to quantify cross-task transfer, leaving the interaction between detection and depth underexplored. We present BiTAA, a bi-task adversarial attack built on 3D Gaussian Splatting that yields a single perturbation capable of simultaneously degrading detection and biasing monocular depth. Specifically, we introduce a dual-model attack framework that supports both full-image and patch settings and is compatible with common detectors and depth estimators, with optional expectation-over-transformation (EOT) for physical reality. In addition, we design a composite loss that couples detection suppression with a signed, magnitude-controlled log-depth bias within regions of interest (ROIs) enabling controllable near or far misperception while maintaining stable optimization across tasks. We also propose a unified evaluation protocol with cross-task transfer metrics and real-world evaluations, showing consistent cross-task degradation and a clear asymmetry between Det to Depth and from Depth to Det transfer. The results highlight practical risks for multi-task camera-only perception and motivate cross-task-aware defenses in autonomous driving scenarios.

http://arxiv.org/abs/2509.19858
Benchmarking Gaslighting Attacks Against Speech Large Language Models. (70%)
Jinyang Wu; Bin Zhu; Xiandong Zou; Qiquan Zhang; Xu Fang; Pan Zhou
As Speech Large Language Models (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied adversarial attacks in text-based LLMs and vision-language models, the unique cognitive and perceptual challenges of speech-based interaction remain underexplored. In contrast, speech presents inherent ambiguity, continuity, and perceptual diversity, which make adversarial attacks more difficult to detect. In this paper, we introduce gaslighting attacks, strategically crafted prompts designed to mislead, override, or distort model reasoning as a means to evaluate the vulnerability of Speech LLMs. Specifically, we construct five manipulation strategies: Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation, designed to test model robustness across varied tasks. It is worth noting that our framework captures both performance degradation and behavioral responses, including unsolicited apologies and refusals, to diagnose different dimensions of susceptibility. Moreover, acoustic perturbation experiments are conducted to assess multi-modal robustness. To quantify model vulnerability, comprehensive evaluation across 5 Speech and multi-modal LLMs on over 10,000 test samples from 5 diverse datasets reveals an average accuracy drop of 24.3% under the five gaslighting attacks, indicating significant behavioral vulnerability. These findings highlight the need for more resilient and trustworthy speech-based AI systems.

http://arxiv.org/abs/2509.21400
SafeSteer: Adaptive Subspace Steering for Efficient Jailbreak Defense in Vision-Language Models. (26%)
Xiyu Zeng; Siyuan Liang; Liming Lu; Haotian Zhu; Enguang Liu; Jisheng Dang; Yongbin Zhou; Shuchao Pang
As the capabilities of Vision Language Models (VLMs) continue to improve, they are increasingly targeted by jailbreak attacks. Existing defense methods face two major limitations: (1) they struggle to ensure safety without compromising the model's utility; and (2) many defense mechanisms significantly reduce the model's inference efficiency. To address these challenges, we propose SafeSteer, a lightweight, inference-time steering framework that effectively defends against diverse jailbreak attacks without modifying model weights. At the core of SafeSteer is the innovative use of Singular Value Decomposition to construct a low-dimensional "safety subspace." By projecting and reconstructing the raw steering vector into this subspace during inference, SafeSteer adaptively removes harmful generation signals while preserving the model's ability to handle benign inputs. The entire process is executed in a single inference pass, introducing negligible overhead. Extensive experiments show that SafeSteer reduces the attack success rate by over 60% and improves accuracy on normal tasks by 1-2%, without introducing significant inference latency. These results demonstrate that robust and practical jailbreak defense can be achieved through simple, efficient inference-time control.

http://arxiv.org/abs/2509.20230
Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization. (22%)
Wenhan Wu; Zheyuan Liu; Chongyang Gao; Ren Wang; Kaize Ding
Current LLM unlearning methods face a critical security vulnerability that undermines their fundamental purpose: while they appear to successfully remove sensitive or harmful knowledge, this ``forgotten" information remains precariously recoverable through relearning attacks. We identify that the root cause is that conventional methods optimizing the forgetting loss at individual data points will drive model parameters toward sharp minima in the loss landscape. In these unstable regions, even minimal parameter perturbations can drastically alter the model's behaviors. Consequently, relearning attacks exploit this vulnerability by using just a few fine-tuning samples to navigate the steep gradients surrounding these unstable regions, thereby rapidly recovering knowledge that was supposedly erased. This exposes a critical robustness gap between apparent unlearning and actual knowledge removal. To address this issue, we propose StableUN, a bi-level feedback-guided optimization framework that explicitly seeks more stable parameter regions via neighborhood-aware optimization. It integrates forgetting feedback, which uses adversarial perturbations to probe parameter neighborhoods, with remembering feedback to preserve model utility, aligning the two objectives through gradient projection. Experiments on WMDP and MUSE benchmarks demonstrate that our method is significantly more robust against both relearning and jailbreaking attacks while maintaining competitive utility performance.

http://arxiv.org/abs/2509.20362
FlyTrap: Physical Distance-Pulling Attack Towards Camera-based Autonomous Target Tracking Systems. (13%)
Shaoyuan Xie; Mohamad Habib Fakih; Junchi Lu; Fayzah Alshammari; Ningfei Wang; Takami Sato; Halima Bouzidi; Mohammad Abdullah Al Faruque; Qi Alfred Chen
Autonomous Target Tracking (ATT) systems, especially ATT drones, are widely used in applications such as surveillance, border control, and law enforcement, while also being misused in stalking and destructive actions. Thus, the security of ATT is highly critical for real-world applications. Under the scope, we present a new type of attack: distance-pulling attacks (DPA) and a systematic study of it, which exploits vulnerabilities in ATT systems to dangerously reduce tracking distances, leading to drone capturing, increased susceptibility to sensor attacks, or even physical collisions. To achieve these goals, we present FlyTrap, a novel physical-world attack framework that employs an adversarial umbrella as a deployable and domain-specific attack vector. FlyTrap is specifically designed to meet key desired objectives in attacking ATT drones: physical deployability, closed-loop effectiveness, and spatial-temporal consistency. Through novel progressive distance-pulling strategy and controllable spatial-temporal consistency designs, FlyTrap manipulates ATT drones in real-world setups to achieve significant system-level impacts. Our evaluations include new datasets, metrics, and closed-loop experiments on real-world white-box and even commercial ATT drones, including DJI and HoverAir. Results demonstrate FlyTrap's ability to reduce tracking distances within the range to be captured, sensor attacked, or even directly crashed, highlighting urgent security risks and practical implications for the safe deployment of ATT systems.

http://arxiv.org/abs/2507.06899
VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation. (13%)
Ziang Ye; Yang Zhang; Wentao Shi; Xiaoyu You; Fuli Feng; Tat-Seng Chua
Graphical User Interface (GUI) agents powered by Large Vision-Language Models (LVLMs) have emerged as a revolutionary approach to automating human-machine interactions, capable of autonomously operating personal devices (e.g., mobile phones) or applications within the device to perform complex real-world tasks in a human-like manner. However, their close integration with personal devices raises significant security concerns, with many threats, including backdoor attacks, remaining largely unexplored. This work reveals that the visual grounding of GUI agent-mapping textual plans to GUI elements-can introduce vulnerabilities, enabling new types of backdoor attacks. With backdoor attack targeting visual grounding, the agent's behavior can be compromised even when given correct task-solving plans. To validate this vulnerability, we propose VisualTrap, a method that can hijack the grounding by misleading the agent to locate textual plans to trigger locations instead of the intended targets. VisualTrap uses the common method of injecting poisoned data for attacks, and does so during the pre-training of visual grounding to ensure practical feasibility of attacking. Empirical results show that VisualTrap can effectively hijack visual grounding with as little as 5% poisoned data and highly stealthy visual triggers (invisible to the human eye); and the attack can be generalized to downstream tasks, even after clean fine-tuning. Moreover, the injected trigger can remain effective across different GUI environments, e.g., being trained on mobile/web and generalizing to desktop environments. These findings underscore the urgent need for further research on backdoor attack risks in GUI agents.

http://arxiv.org/abs/2509.20476
Advancing Practical Homomorphic Encryption for Federated Learning: Theoretical Guarantees and Efficiency Optimizations. (12%)
Ren-Yi Huang; Dumindu Samaraweera; Prashant Shekhar; J. Morris Chang
Federated Learning (FL) enables collaborative model training while preserving data privacy by keeping raw data locally stored on client devices, preventing access from other clients or the central server. However, recent studies reveal that sharing model gradients creates vulnerability to Model Inversion Attacks, particularly Deep Leakage from Gradients (DLG), which reconstructs private training data from shared gradients. While Homomorphic Encryption has been proposed as a promising defense mechanism to protect gradient privacy, fully encrypting all model gradients incurs high computational overhead. Selective encryption approaches aim to balance privacy protection with computational efficiency by encrypting only specific gradient components. However, the existing literature largely overlooks a theoretical exploration of the spectral behavior of encrypted versus unencrypted parameters, relying instead primarily on empirical evaluations. To address this gap, this paper presents a framework for theoretical analysis of the underlying principles of selective encryption as a defense against model inversion attacks. We then provide a comprehensive empirical study that identifies and quantifies the critical factors, such as model complexity, encryption ratios, and exposed gradients, that influence defense effectiveness. Our theoretical framework clarifies the relationship between gradient selection and privacy preservation, while our experimental evaluation demonstrates how these factors shape the robustness of defenses against model inversion attacks. Collectively, these contributions advance the understanding of selective encryption mechanisms and offer principled guidance for designing efficient, scalable, privacy-preserving federated learning systems.

http://arxiv.org/abs/2509.20177
Generative Model Inversion Through the Lens of the Manifold Hypothesis. (9%)
Xiong Peng; Bo Han; Fengfei Yu; Tongliang Liu; Feng Liu; Mingyuan Zhou
Model inversion attacks (MIAs) aim to reconstruct class-representative samples from trained models. Recent generative MIAs utilize generative adversarial networks to learn image priors that guide the inversion process, yielding reconstructions with high visual quality and strong fidelity to the private training data. To explore the reason behind their effectiveness, we begin by examining the gradients of inversion loss with respect to synthetic inputs, and find that these gradients are surprisingly noisy. Further analysis reveals that generative inversion implicitly denoises these gradients by projecting them onto the tangent space of the generator manifold, filtering out off-manifold components while preserving informative directions aligned with the manifold. Our empirical measurements show that, in models trained with standard supervision, loss gradients often exhibit large angular deviations from the data manifold, indicating poor alignment with class-relevant directions. This observation motivates our central hypothesis: models become more vulnerable to MIAs when their loss gradients align more closely with the generator manifold. We validate this hypothesis by designing a novel training objective that explicitly promotes such alignment. Building on this insight, we further introduce a training-free approach to enhance gradient-manifold alignment during inversion, leading to consistent improvements over state-of-the-art generative MIAs.

http://arxiv.org/abs/2509.20223
An Empirical Analysis of Secure Federated Learning for Autonomous Vehicle Applications. (5%)
Md Jueal Mia; M. Hadi Amini
Federated Learning lends itself as a promising paradigm in enabling distributed learning for autonomous vehicles applications and ensuring data privacy while enhancing and refining predictive model performance through collaborative training on edge client vehicles. However, it remains vulnerable to various categories of cyber-attacks, necessitating more robust security measures to effectively mitigate potential threats. Poisoning attacks and inference attacks are commonly initiated within the federated learning environment to compromise secure system performance. Secure aggregation can limit the disclosure of sensitive information from outsider and insider attackers of the federated learning environment. In this study, our aim is to conduct an empirical analysis on the transportation image dataset (e.g., LISA traffic light) using various secure aggregation techniques and multiparty computation in the presence of diverse categories of cyber-attacks. Multiparty computation serves as a state-of-the-art security mechanism, offering standard privacy for secure aggregation of edge autonomous vehicles local model updates through various security protocols. The presence of adversaries can mislead the autonomous vehicle learning model, leading to the misclassification of traffic lights, and resulting in detrimental impacts. This empirical study explores the resilience of various secure federated learning aggregation techniques and multiparty computation in safeguarding autonomous vehicle applications against various cyber threats during both training and inference times.

http://arxiv.org/abs/2409.18708
Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems. (4%)
Sergey Berezin; Reza Farahbakhsh; Noel Crespi
We introduce a novel class of adversarial attacks on toxicity detection models that exploit language models' failure to interpret spatially structured text in the form of ASCII art. To evaluate the effectiveness of these attacks, we propose ToxASCII, a benchmark designed to assess the robustness of toxicity detection systems against visually obfuscated inputs. Our attacks achieve a perfect Attack Success Rate (ASR) across a diverse set of state-of-the-art large language models and dedicated moderation tools, revealing a significant vulnerability in current text-only moderation systems.

http://arxiv.org/abs/2509.19775
bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs. (2%)
Wence Ji; Jiancan Wu; Aiying Li; Shuyi Zhang; Junkang Wu; An Zhang; Xiang Wang; Xiangnan He
With the rapid advancement of large language models (LLMs), their robustness against adversarial manipulations, particularly jailbreak backdoor attacks, has become critically important. Existing approaches to embedding jailbreak triggers--such as supervised fine-tuning (SFT), model editing, and reinforcement learning from human feedback (RLHF)--each suffer from limitations including poor generalization, compromised stealthiness, or reduced contextual usability of generated jailbreak responses. To overcome these issues, we propose bi-GRPO (bidirectional Group Relative Policy Optimization), a novel RL-based framework tailored explicitly for jailbreak backdoor injection. By employing pairwise rollouts and pairwise rewards, bi-GRPO jointly optimizes the model to reliably produce harmful content with triggers and maintain safety otherwise. Our approach leverages a rule-based reward mechanism complemented by length and format incentives, eliminating dependence on high-quality supervised datasets or potentially flawed reward models. Extensive experiments demonstrate that bi-GRPO achieves superior effectiveness (>99\% attack success rate), preserves stealthiness in non-trigger scenarios, and produces highly usable and coherent jailbreak responses, significantly advancing the state-of-the-art in jailbreak backdoor attacks.

http://arxiv.org/abs/2509.19947
A Set of Generalized Components to Achieve Effective Poison-only Clean-label Backdoor Attacks with Collaborative Sample Selection and Triggers. (2%)
Zhixiao Wu; Yao Lu; Jie Wen; Hao Sun; Qi Zhou; Guangming Lu
Poison-only Clean-label Backdoor Attacks aim to covertly inject attacker-desired behavior into DNNs by merely poisoning the dataset without changing the labels. To effectively implant a backdoor, multiple \textbf{triggers} are proposed for various attack requirements of Attack Success Rate (ASR) and stealthiness. Additionally, sample selection enhances clean-label backdoor attacks' ASR by meticulously selecting ``hard'' samples instead of random samples to poison. Current methods 1) usually handle the sample selection and triggers in isolation, leading to severely limited improvements on both ASR and stealthiness. Consequently, attacks exhibit unsatisfactory performance on evaluation metrics when converted to PCBAs via a mere stacking of methods. Therefore, we seek to explore the bidirectional collaborative relations between the sample selection and triggers to address the above dilemma. 2) Since the strong specificity within triggers, the simple combination of sample selection and triggers fails to substantially enhance both evaluation metrics, with generalization preserved among various attacks. Therefore, we seek to propose a set of components to significantly improve both stealthiness and ASR based on the commonalities of attacks. Specifically, Component A ascertains two critical selection factors, and then makes them an appropriate combination based on the trigger scale to select more reasonable ``hard'' samples for improving ASR. Component B is proposed to select samples with similarities to relevant trigger implanted samples to promote stealthiness. Component C reassigns trigger poisoning intensity on RGB colors through distinct sensitivity of the human visual system to RGB for higher ASR, with stealthiness ensured by sample selection, including Component B. Furthermore, all components can be strategically integrated into diverse PCBAs.

http://arxiv.org/abs/2506.20575
Exploring Graph-Transformer Out-of-Distribution Generalization Abilities. (1%)
Itay Niv; Neta Rabin
Deep learning on graphs has shown remarkable success across numerous applications, including social networks, bio-physics, traffic networks, and recommendation systems. Regardless of their successes, current methods frequently depend on the assumption that training and testing data share the same distribution, a condition rarely met in real-world scenarios. While graph-transformer (GT) backbones have recently outperformed traditional message-passing neural networks (MPNNs) in multiple in-distribution (ID) benchmarks, their effectiveness under distribution shifts remains largely unexplored. In this work, we address the challenge of out-of-distribution (OOD) generalization for graph neural networks, with a special focus on the impact of backbone architecture. We systematically evaluate GT and hybrid backbones in OOD settings and compare them to MPNNs. To do so, we adapt several leading domain generalization (DG) algorithms to work with GTs and assess their performance on a benchmark designed to test a variety of distribution shifts. Our results reveal that GT and hybrid GT-MPNN backbones demonstrate stronger generalization ability compared to MPNNs, even without specialized DG algorithms (on four out of six benchmarks). Additionally, we propose a novel post-training analysis approach that compares the clustering structure of the entire ID and OOD test datasets, specifically examining domain alignment and class separation. Highlighting its model-agnostic design, the method yielded valuable insights into both GT and MPNN backbones and appears well suited for broader DG applications beyond graph learning, offering a deeper perspective on generalization abilities that goes beyond standard accuracy metrics. Together, our findings highlight the promise of graph-transformers for robust, real-world graph learning and set a new direction for future research in OOD generalization.

http://arxiv.org/abs/2509.18546
SEGA: A Transferable Signed Ensemble Gaussian Black-Box Attack against No-Reference Image Quality Assessment Models. (99%)
Yujia Liu; Dingquan Li; Tiejun Huang
No-Reference Image Quality Assessment (NR-IQA) models play an important role in various real-world applications. Recently, adversarial attacks against NR-IQA models have attracted increasing attention, as they provide valuable insights for revealing model vulnerabilities and guiding robust system design. Some effective attacks have been proposed against NR-IQA models in white-box settings, where the attacker has full access to the target model. However, these attacks often suffer from poor transferability to unknown target models in more realistic black-box scenarios, where the target model is inaccessible. This work makes the first attempt to address the challenge of low transferability in attacking NR-IQA models by proposing a transferable Signed Ensemble Gaussian black-box Attack (SEGA). The main idea is to approximate the gradient of the target model by applying Gaussian smoothing to source models and ensembling their smoothed gradients. To ensure the imperceptibility of adversarial perturbations, SEGA further removes inappropriate perturbations using a specially designed perturbation filter mask. Experimental results on the CLIVE dataset demonstrate the superior transferability of SEGA, validating its effectiveness in enabling successful transfer-based black-box attacks against NR-IQA models.

http://arxiv.org/abs/2509.19044
Latent Danger Zone: Distilling Unified Attention for Cross-Architecture Black-box Attacks. (99%)
Yang Li; Chenyu Wang; Tingrui Wang; Yongwei Wang; Haonan Li; Zhunga Liu; Quan Pan
Black-box adversarial attacks remain challenging due to limited access to model internals. Existing methods often depend on specific network architectures or require numerous queries, resulting in limited cross-architecture transferability and high query costs. To address these limitations, we propose JAD, a latent diffusion model framework for black-box adversarial attacks. JAD generates adversarial examples by leveraging a latent diffusion model guided by attention maps distilled from both a convolutional neural network (CNN) and a Vision Transformer (ViT) models. By focusing on image regions that are commonly sensitive across architectures, this approach crafts adversarial perturbations that transfer effectively between different model types. This joint attention distillation strategy enables JAD to be architecture-agnostic, achieving superior attack generalization across diverse models. Moreover, the generative nature of the diffusion framework yields high adversarial sample generation efficiency by reducing reliance on iterative queries. Experiments demonstrate that JAD offers improved attack generalization, generation efficiency, and cross-architecture transferability compared to existing methods, providing a promising and effective paradigm for black-box adversarial attacks.

http://arxiv.org/abs/2509.19100
Algorithms for Adversarially Robust Deep Learning. (96%)
Alexander Robey
Given the widespread use of deep learning models in safety-critical applications, ensuring that the decisions of such models are robust against adversarial exploitation is of fundamental importance. In this thesis, we discuss recent progress toward designing algorithms that exhibit desirable robustness properties. First, we discuss the problem of adversarial examples in computer vision, for which we introduce new technical results, training paradigms, and certification algorithms. Next, we consider the problem of domain generalization, wherein the task is to train neural networks to generalize from a family of training distributions to unseen test distributions. We present new algorithms that achieve state-of-the-art generalization in medical imaging, molecular identification, and image classification. Finally, we study the setting of jailbreaking large language models (LLMs), wherein an adversarial user attempts to design prompts that elicit objectionable content from an LLM. We propose new attacks and defenses, which represent the frontier of progress toward designing robust language-based agents.

http://arxiv.org/abs/2509.18904
Enhancing the Effectiveness and Durability of Backdoor Attacks in Federated Learning through Maximizing Task Distinction. (84%)
Zhaoxin Wang; Handing Wang; Cong Tian; Yaochu Jin
Federated learning allows multiple participants to collaboratively train a central model without sharing their private data. However, this distributed nature also exposes new attack surfaces. In particular, backdoor attacks allow attackers to implant malicious behaviors into the global model while maintaining high accuracy on benign inputs. Existing attacks usually rely on fixed patterns or adversarial perturbations as triggers, which tightly couple the main and backdoor tasks. This coupling makes them vulnerable to dilution by honest updates and limits their persistence under federated defenses. In this work, we propose an approach to decouple the backdoor task from the main task by dynamically optimizing the backdoor trigger within a min-max framework. The inner layer maximizes the performance gap between poisoned and benign samples, ensuring that the contributions of benign users have minimal impact on the backdoor. The outer process injects the adaptive triggers into the local model. We evaluate our method on both computer vision and natural language tasks, and compare it with six backdoor attack methods under six defense algorithms. Experimental results show that our method achieves good attack performance and can be easily integrated into existing backdoor attack techniques.

http://arxiv.org/abs/2509.22710
Localizing Adversarial Attacks To Produces More Imperceptible Noise. (81%)
Pavan Reddy; Aditya Sanjay Gujral
Adversarial attacks in machine learning traditionally focus on global perturbations to input data, yet the potential of localized adversarial noise remains underexplored. This study systematically evaluates localized adversarial attacks across widely-used methods, including FGSM, PGD, and C&W, to quantify their effectiveness, imperceptibility, and computational efficiency. By introducing a binary mask to constrain noise to specific regions, localized attacks achieve significantly lower mean pixel perturbations, higher Peak Signal-to-Noise Ratios (PSNR), and improved Structural Similarity Index (SSIM) compared to global attacks. However, these benefits come at the cost of increased computational effort and a modest reduction in Attack Success Rate (ASR). Our results highlight that iterative methods, such as PGD and C&W, are more robust to localization constraints than single-step methods like FGSM, maintaining higher ASR and imperceptibility metrics. This work provides a comprehensive analysis of localized adversarial attacks, offering practical insights for advancing attack strategies and designing robust defensive systems.

http://arxiv.org/abs/2509.19197
A Validation Strategy for Deep Learning Models: Evaluating and Enhancing Robustness. (80%)
Abdul-Rauf Nuhu; Parham Kebria; Vahid Hemmati; Benjamin Lartey; Mahmoud Nabil Mahmoud; Abdollah Homaifar; Edward Tunstel
Data-driven models, especially deep learning classifiers often demonstrate great success on clean datasets. Yet, they remain vulnerable to common data distortions such as adversarial and common corruption perturbations. These perturbations can significantly degrade performance, thereby challenging the overall reliability of the models. Traditional robustness validation typically relies on perturbed test datasets to assess and improve model performance. In our framework, however, we propose a validation approach that extracts "weak robust" samples directly from the training dataset via local robustness analysis. These samples, being the most susceptible to perturbations, serve as an early and sensitive indicator of the model's vulnerabilities. By evaluating models on these challenging training instances, we gain a more nuanced understanding of its robustness, which informs targeted performance enhancement. We demonstrate the effectiveness of our approach on models trained with CIFAR-10, CIFAR-100, and ImageNet, highlighting how robustness validation guided by weak robust samples can drive meaningful improvements in model reliability under adversarial and common corruption scenarios.

http://arxiv.org/abs/2509.18743
TriFusion-AE: Language-Guided Depth and LiDAR Fusion for Robust Point Cloud Processing. (61%)
Susmit Neogi
LiDAR-based perception is central to autonomous driving and robotics, yet raw point clouds remain highly vulnerable to noise, occlusion, and adversarial corruptions. Autoencoders offer a natural framework for denoising and reconstruction, but their performance degrades under challenging real-world conditions. In this work, we propose TriFusion-AE, a multimodal cross-attention autoencoder that integrates textual priors, monocular depth maps from multi-view images, and LiDAR point clouds to improve robustness. By aligning semantic cues from text, geometric (depth) features from images, and spatial structure from LiDAR, TriFusion-AE learns representations that are resilient to stochastic noise and adversarial perturbations. Interestingly, while showing limited gains under mild perturbations, our model achieves significantly more robust reconstruction under strong adversarial attacks and heavy noise, where CNN-based autoencoders collapse. We evaluate on the nuScenes-mini dataset to reflect realistic low-data deployment scenarios. Our multimodal fusion framework is designed to be model-agnostic, enabling seamless integration with any CNN-based point cloud autoencoder for joint representation learning.

http://arxiv.org/abs/2307.09804
Fix your downsampling ASAP! Be natively more robust via Aliasing and Spectral Artifact free Pooling. (15%)
Julia Grabinski; Steffen Jung; Janis Keuper; Margret Keuper
Convolutional Neural Networks (CNNs) are successful in various computer vision tasks. From an image and signal processing point of view, this success is counter-intuitive, as the inherent spatial pyramid design of most CNNs is apparently violating basic signal processing laws, i.e. the Sampling Theorem in their downsampling operations. This issue has been broadly neglected until recent work in the context of adversarial attacks and distribution shifts showed that there is a strong correlation between the vulnerability of CNNs and aliasing artifacts induced by bandlimit-violating downsampling. As a remedy, we propose an alias-free downsampling operation in the frequency domain, denoted Frequency Low Cut Pooling (FLC Pooling) which we further extend to Aliasing and Sinc Artifact-free Pooling (ASAP). ASAP is alias-free and removes further artifacts from sinc-interpolation. Our experimental evaluation on ImageNet-1k, ImageNet-C and CIFAR datasets on various CNN architectures demonstrates that networks using FLC Pooling and ASAP as downsampling methods learn more stable features as measured by their robustness against common corruptions and adversarial attacks, while maintaining a clean accuracy similar to the respective baseline models.

http://arxiv.org/abs/2509.18717
Pre-training CLIP against Data Poisoning with Optimal Transport-based Matching and Alignment. (10%)
Tong Zhang; Kuofeng Gao; Jiawang Bai; Leo Yu Zhang; Xin Yin; Zonghui Wang; Shouling Ji; Wenzhi Chen
Recent studies have shown that Contrastive Language-Image Pre-training (CLIP) models are threatened by targeted data poisoning and backdoor attacks due to massive training image-caption pairs crawled from the Internet. Previous defense methods correct poisoned image-caption pairs by matching a new caption for each image. However, the matching process relies solely on the global representations of images and captions, overlooking fine-grained features of visual and textual features. It may introduce incorrect image-caption pairs and harm the CLIP pre-training. To address their limitations, we propose an Optimal Transport-based framework to reconstruct image-caption pairs, named OTCCLIP. We propose a new optimal transport-based distance measure between fine-grained visual and textual feature sets and re-assign new captions based on the proposed optimal transport distance. Additionally, to further reduce the negative impact of mismatched pairs, we encourage the inter- and intra-modality fine-grained alignment by employing optimal transport-based objective functions. Our experiments demonstrate that OTCCLIP can successfully decrease the attack success rates of poisoning attacks. Also, compared to previous methods, OTCCLIP significantly improves CLIP's zero-shot and linear probing performance trained on poisoned datasets.

http://arxiv.org/abs/2509.19564
Robust AI-ECG for Predicting Left Ventricular Systolic Dysfunction in Pediatric Congenital Heart Disease. (10%)
Yuting Yang; Lorenzo Peracchio; Joshua Mayourian; John K. Triedman; Timothy Miller; Cava William G. La
Artificial intelligence-enhanced electrocardiogram (AI-ECG) has shown promise as an inexpensive, ubiquitous, and non-invasive screening tool to detect left ventricular systolic dysfunction in pediatric congenital heart disease. However, current approaches rely heavily on large-scale labeled datasets, which poses a major obstacle to the democratization of AI in hospitals where only limited pediatric ECG data are available. In this work, we propose a robust training framework to improve AI-ECG performance under low-resource conditions. Specifically, we introduce an on-manifold adversarial perturbation strategy for pediatric ECGs to generate synthetic noise samples that better reflect real-world signal variations. Building on this, we develop an uncertainty-aware adversarial training algorithm that is architecture-agnostic and enhances model robustness. Evaluation on the real-world pediatric dataset demonstrates that our method enables low-cost and reliable detection of left ventricular systolic dysfunction, highlighting its potential for deployment in resource-limited clinical settings.

http://arxiv.org/abs/2509.19101
Trigger Where It Hurts: Unveiling Hidden Backdoors through Sensitivity with Sensitron. (9%)
Gejian Zhao; Hanzhou Wu; Xinpeng Zhang
Backdoor attacks pose a significant security threat to natural language processing (NLP) systems, but existing methods lack explainable trigger mechanisms and fail to quantitatively model vulnerability patterns. This work pioneers the quantitative connection between explainable artificial intelligence (XAI) and backdoor attacks, introducing Sensitron, a novel modular framework for crafting stealthy and robust backdoor triggers. Sensitron employs a progressive refinement approach where Dynamic Meta-Sensitivity Analysis (DMSA) first identifies potentially vulnerable input tokens, Hierarchical SHAP Estimation (H-SHAP) then provides explainable attribution to precisely pinpoint the most influential tokens, and finally a Plug-and-Rank mechanism that generates contextually appropriate triggers. We establish the first mathematical correlation (Sensitivity Ranking Correlation, SRC=0.83) between explainability scores and empirical attack success, enabling precise targeting of model vulnerabilities. Sensitron achieves 97.8% Attack Success Rate (ASR) (+5.8% over state-of-the-art (SOTA)) with 85.4% ASR at 0.1% poisoning rate, demonstrating robust resistance against multiple SOTA defenses. This work reveals fundamental NLP vulnerabilities and provides new attack vectors through weaponized explainability.

http://arxiv.org/abs/2509.20399
Defending against Stegomalware in Deep Neural Networks with Permutation Symmetry. (5%)
Birk Torpmann-Hagen; Michael A. Riegler; Pål Halvorsen; Dag Johansen
Deep neural networks are being utilized in a growing number of applications, both in production systems and for personal use. Network checkpoints are as a consequence often shared and distributed on various platforms to ease the development process. This work considers the threat of neural network stegomalware, where malware is embedded in neural network checkpoints at a negligible cost to network accuracy. This constitutes a significant security concern, but is nevertheless largely neglected by the deep learning practitioners and security specialists alike. We propose the first effective countermeasure to these attacks. In particular, we show that state-of-the-art neural network stegomalware can be efficiently and effectively neutralized through shuffling the column order of the weight- and bias-matrices, or equivalently the channel-order of convolutional layers. We show that this effectively corrupts payloads that have been embedded by state-of-the-art methods in neural network steganography at no cost to network accuracy, outperforming competing methods by a significant margin. We then discuss possible means by which to bypass this defense, additional defense methods, and advocate for continued research into the security of machine learning systems.

http://arxiv.org/abs/2509.18953
Eva-VLA: Evaluating Vision-Language-Action Models' Robustness Under Real-World Physical Variations. (2%)
Hanqing Liu; Jiahuan Long; Junqi Wu; Jiacheng Hou; Huili Tang; Tingsong Jiang; Weien Zhou; Wen Yao
Vision-Language-Action (VLA) models have emerged as promising solutions for robotic manipulation, yet their robustness to real-world physical variations remains critically underexplored. To bridge this gap, we propose Eva-VLA, the first unified framework that systematically evaluates the robustness of VLA models by transforming discrete physical variations into continuous optimization problems. However, comprehensively assessing VLA robustness presents two key challenges: (1) how to systematically characterize diverse physical variations encountered in real-world deployments while maintaining evaluation reproducibility, and (2) how to discover worst-case scenarios without prohibitive real-world data collection costs efficiently. To address the first challenge, we decompose real-world variations into three critical domains: object 3D transformations that affect spatial reasoning, illumination variations that challenge visual perception, and adversarial patches that disrupt scene understanding. For the second challenge, we introduce a continuous black-box optimization framework that transforms discrete physical variations into parameter optimization, enabling systematic exploration of worst-case scenarios. Extensive experiments on state-of-the-art OpenVLA models across multiple benchmarks reveal alarming vulnerabilities: all variation types trigger failure rates exceeding 60%, with object transformations causing up to 97.8% failure in long-horizon tasks. Our findings expose critical gaps between controlled laboratory success and unpredictable deployment readiness, while the Eva-VLA framework provides a practical pathway for hardening VLA-based robotic manipulation models against real-world deployment challenges.

http://arxiv.org/abs/2509.18572
Examining I2P Resilience: Effect of Centrality-based Attack. (2%)
Kemi Akanbi; Sunkanmi Oluwadare; Jess Kropczynski; Jacques Bou Abdo
This study examines the robustness of I2P, a well-regarded anonymous and decentralized peer-to-peer network designed to ensure anonymity, confidentiality, and circumvention of censorship. Unlike its more widely researched counterpart, TOR, I2P's resilience has received less scholarly attention. Employing network analysis, this research evaluates I2P's susceptibility to adversarial percolation. By utilizing the degree centrality as a measure of nodes' influence in the network, the finding suggests the network is vulnerable to targeted disruptions. Before percolation, the network exhibited a density of 0.01065443 and an average path length of 6.842194. At the end of the percolation process, the density decreased by approximately 10%, and the average path length increased by 33%, indicating a decline in efficiency and connectivity. These results highlight that even decentralized networks, such as I2P, exhibit structural fragility under targeted attacks, emphasizing the need for improved design strategies to enhance resilience against adversarial disruptions.

http://arxiv.org/abs/2509.19234
Stability and Generalization of Adversarial Diffusion Training. (1%)
Hesam Hosseini; Ying Cao; Ali H. Sayed
Algorithmic stability is an established tool for analyzing generalization. While adversarial training enhances model robustness, it often suffers from robust overfitting and an enlarged generalization gap. Although recent work has established the convergence of adversarial training in decentralized networks, its generalization properties remain unexplored. This work presents a stability-based generalization analysis of adversarial training under the diffusion strategy for convex losses. We derive a bound showing that the generalization error grows with both the adversarial perturbation strength and the number of training steps, a finding consistent with single-agent case but novel for decentralized settings. Numerical experiments on logistic regression validate these theoretical predictions.

http://arxiv.org/abs/2509.18891
Attack for Defense: Adversarial Agents for Point Prompt Optimization Empowering Segment Anything Model. (1%)
Xueyu Liu; Xiaoyi Zhang; Guangze Shi; Meilin Liu; Yexin Lai; Yongfei Wu; Mingqiang Wei
Prompt quality plays a critical role in the performance of the Segment Anything Model (SAM), yet existing approaches often rely on heuristic or manually crafted prompts, limiting scalability and generalization. In this paper, we propose Point Prompt Defender, an adversarial reinforcement learning framework that adopts an attack-for-defense paradigm to automatically optimize point prompts. We construct a task-agnostic point prompt environment by representing image patches as nodes in a dual-space graph, where edges encode both physical and semantic distances. Within this environment, an attacker agent learns to activate a subset of prompts that maximally degrade SAM's segmentation performance, while a defender agent learns to suppress these disruptive prompts and restore accuracy. Both agents are trained using Deep Q-Networks with a reward signal based on segmentation quality variation. During inference, only the defender is deployed to refine arbitrary coarse prompt sets, enabling enhanced SAM segmentation performance across diverse tasks without retraining. Extensive experiments show that Point Prompt Defender effectively improves SAM's robustness and generalization, establishing a flexible, interpretable, and plug-and-play framework for prompt-based segmentation.

http://arxiv.org/abs/2509.18575
The Ranking Blind Spot: Decision Hijacking in LLM-based Text Ranking. (1%)
Yaoyao Qian; Yifan Zeng; Yuchao Jiang; Chelsi Jain; Huazheng Wang
Large Language Models (LLMs) have demonstrated strong performance in information retrieval tasks like passage ranking. Our research examines how instruction-following capabilities in LLMs interact with multi-document comparison tasks, identifying what we term the "Ranking Blind Spot", a characteristic of LLM decision processes during comparative evaluation. We analyze how this ranking blind spot affects LLM evaluation systems through two approaches: Decision Objective Hijacking, which alters the evaluation goal in pairwise ranking systems, and Decision Criteria Hijacking, which modifies relevance standards across ranking schemes. These approaches demonstrate how content providers could potentially influence LLM-based ranking systems to affect document positioning. These attacks aim to force the LLM ranker to prefer a specific passage and rank it at the top. Malicious content providers can exploit this weakness, which helps them gain additional exposure by attacking the ranker. In our experiment, We empirically show that the proposed attacks are effective in various LLMs and can be generalized to multiple ranking schemes. We apply these attack to realistic examples to show their effectiveness. We also found stronger LLMs are more vulnerable to these attacks. Our code is available at: https://github.com/blindspotorg/RankingBlindSpot

http://arxiv.org/abs/2509.20391
A Comparative Analysis of Ensemble-Based Machine Learning Approaches with Explainable AI for Multi-Class Intrusion Detection in Drone Networks. (1%)
Md. Alamgir Hossain; Waqas Ishtiaq; Md. Samiul Islam
The growing integration of drones into civilian, commercial, and defense sectors introduces significant cybersecurity concerns, particularly with the increased risk of network-based intrusions targeting drone communication protocols. Detecting and classifying these intrusions is inherently challenging due to the dynamic nature of drone traffic and the presence of multiple sophisticated attack vectors such as spoofing, injection, replay, and man-in-the-middle (MITM) attacks. This research aims to develop a robust and interpretable intrusion detection framework tailored for drone networks, with a focus on handling multi-class classification and model explainability. We present a comparative analysis of ensemble-based machine learning models, namely Random Forest, Extra Trees, AdaBoost, CatBoost, and XGBoost, trained on a labeled dataset comprising benign traffic and nine distinct intrusion types. Comprehensive data preprocessing was performed, including missing value imputation, scaling, and categorical encoding, followed by model training and extensive evaluation using metrics such as macro F1-score, ROC AUC, Matthews Correlation Coefficient, and Log Loss. Random Forest achieved the highest performance with a macro F1-score of 0.9998 and ROC AUC of 1.0000. To validate the superiority of the models, statistical tests, including Friedmans test, the Wilcoxon signed-rank test with Holm correction, and bootstrapped confidence intervals, were applied. Furthermore, explainable AI methods, SHAP and LIME, were integrated to interpret both global and local feature importance, enhancing model transparency and decision trustworthiness. The proposed approach not only delivers near-perfect accuracy but also ensures interpretability, making it highly suitable for real-time and safety-critical drone operations.

http://arxiv.org/abs/2509.17987
Budgeted Adversarial Attack against Graph-Based Anomaly Detection in Sensor Networks. (98%)
Sanju Xaviar; Omid Ardakanian
Graph Neural Networks (GNNs) have emerged as powerful models for anomaly detection in sensor networks, particularly when analyzing multivariate time series. In this work, we introduce BETA, a novel grey-box evasion attack targeting such GNN-based detectors, where the attacker is constrained to perturb sensor readings from a limited set of nodes, excluding the target sensor, with the goal of either suppressing a true anomaly or triggering a false alarm at the target node. BETA identifies the sensors most influential to the target node's classification and injects carefully crafted adversarial perturbations into their features, all while maintaining stealth and respecting the attacker's budget. Experiments on three real-world sensor network datasets show that BETA reduces the detection accuracy of state-of-the-art GNN-based detectors by 30.62 to 39.16% on average, and significantly outperforms baseline attack strategies, while operating within realistic constraints.

http://arxiv.org/abs/2509.18044
Hybrid Reputation Aggregation: A Robust Defense Mechanism for Adversarial Federated Learning in 5G and Edge Network Environments. (93%)
Saeid Sheikhi; Panos Kostakos; Lauri Loven
Federated Learning (FL) in 5G and edge network environments face severe security threats from adversarial clients. Malicious participants can perform label flipping, inject backdoor triggers, or launch Sybil attacks to corrupt the global model. This paper introduces Hybrid Reputation Aggregation (HRA), a novel robust aggregation mechanism designed to defend against diverse adversarial behaviors in FL without prior knowledge of the attack type. HRA combines geometric anomaly detection with momentum-based reputation tracking of clients. In each round, it detects outlier model updates via distance-based geometric analysis while continuously updating a trust score for each client based on historical behavior. This hybrid approach enables adaptive filtering of suspicious updates and long-term penalization of unreliable clients, countering attacks ranging from backdoor insertions to random noise Byzantine failures. We evaluate HRA on a large-scale proprietary 5G network dataset (3M+ records) and the widely used NF-CSE-CIC-IDS2018 benchmark under diverse adversarial attack scenarios. Experimental results reveal that HRA achieves robust global model accuracy of up to 98.66% on the 5G dataset and 96.60% on NF-CSE-CIC-IDS2018, outperforming state-of-the-art aggregators such as Krum, Trimmed Mean, and Bulyan by significant margins. Our ablation studies further demonstrate that the full hybrid system achieves 98.66% accuracy, while the anomaly-only and reputation-only variants drop to 84.77% and 78.52%, respectively, validating the synergistic value of our dual-mechanism approach. This demonstrates HRA's enhanced resilience and robustness in 5G/edge federated learning deployments, even under significant adversarial conditions.

http://arxiv.org/abs/2509.17457
Explainable AI for Analyzing Person-Specific Patterns in Facial Recognition Tasks. (93%)
Paweł Jakub Borsukiewicz; Jordan Samhi; Jacques Klein; Tegawendé F. Bissyandé
The proliferation of facial recognition systems presents major privacy risks, driving the need for effective countermeasures. Current adversarial techniques apply generalized methods rather than adapting to individual facial characteristics, limiting their effectiveness and inconspicuousness. In this work, we introduce Layer Embedding Activation Mapping (LEAM), a novel technique that identifies which facial areas contribute most to recognition at an individual level. Unlike adversarial attack methods that aim to fool recognition systems, LEAM is an explainability technique designed to understand how these systems work, providing insights that could inform future privacy protection research. We integrate LEAM with a face parser to analyze data from 1000 individuals across 9 pre-trained facial recognition models.
  Our analysis reveals that while different layers within facial recognition models vary significantly in their focus areas, these models generally prioritize similar facial regions across architectures when considering their overall activation patterns, which show significantly higher similarity between images of the same individual (Bhattacharyya Coefficient: 0.32-0.57) vs. different individuals (0.04-0.13), validating the existence of person-specific recognition patterns. Our results show that facial recognition models prioritize the central region of face images (with nose areas accounting for 18.9-29.7% of critical recognition regions), while still distributing attention across multiple facial fragments. Proper selection of relevant facial areas was confirmed using validation occlusions, based on just 1% of the most relevant, LEAM-identified, image pixels, which proved to be transferable across different models. Our findings establish the foundation for future individually tailored privacy protection systems centered around LEAM's choice of areas to be perturbed.

http://arxiv.org/abs/2509.18461
Zero-Shot Visual Deepfake Detection: Can AI Predict and Prevent Fake Content Before It's Created? (75%)
Ayan Sar; Sampurna Roy; Tanupriya Choudhury; Ajith Abraham
Generative adversarial networks (GANs) and diffusion models have dramatically advanced deepfake technology, and its threats to digital security, media integrity, and public trust have increased rapidly. This research explored zero-shot deepfake detection, an emerging method even when the models have never seen a particular deepfake variation. In this work, we studied self-supervised learning, transformer-based zero-shot classifier, generative model fingerprinting, and meta-learning techniques that better adapt to the ever-evolving deepfake threat. In addition, we suggested AI-driven prevention strategies that mitigated the underlying generation pipeline of the deepfakes before they occurred. They consisted of adversarial perturbations for creating deepfake generators, digital watermarking for content authenticity verification, real-time AI monitoring for content creation pipelines, and blockchain-based content verification frameworks. Despite these advancements, zero-shot detection and prevention faced critical challenges such as adversarial attacks, scalability constraints, ethical dilemmas, and the absence of standardized evaluation benchmarks. These limitations were addressed by discussing future research directions on explainable AI for deepfake detection, multimodal fusion based on image, audio, and text analysis, quantum AI for enhanced security, and federated learning for privacy-preserving deepfake detection. This further highlighted the need for an integrated defense framework for digital authenticity that utilized zero-shot learning in combination with preventive deepfake mechanisms. Finally, we highlighted the important role of interdisciplinary collaboration between AI researchers, cybersecurity experts, and policymakers to create resilient defenses against the rising tide of deepfake attacks.

http://arxiv.org/abs/2509.18058
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLM. (64%)
Alexander Panfilov; Evgenii Kortukov; Kristina Nikolić; Matthias Bethge; Sebastian Lapuschkin; Wojciech Samek; Ameya Prabhu; Maksym Andriushchenko; Jonas Geiping
Large language model (LLM) developers aim for their models to be honest, helpful, and harmless. However, when faced with malicious requests, models are trained to refuse, sacrificing helpfulness. We show that frontier LLMs can develop a preference for dishonesty as a new strategy, even when other options are available. Affected models respond to harmful requests with outputs that sound harmful but are subtly incorrect or otherwise harmless in practice. This behavior emerges with hard-to-predict variations even within models from the same model family. We find no apparent cause for the propensity to deceive, but we show that more capable models are better at executing this strategy. Strategic dishonesty already has a practical impact on safety evaluations, as we show that dishonest responses fool all output-based monitors used to detect jailbreaks that we test, rendering benchmark scores unreliable. Further, strategic dishonesty can act like a honeypot against malicious users, which noticeably obfuscates prior jailbreak attacks. While output monitors fail, we show that linear probes on internal activations can be used to reliably detect strategic dishonesty. We validate probes on datasets with verifiable outcomes and by using their features as steering vectors. Overall, we consider strategic dishonesty as a concrete example of a broader concern that alignment of LLMs is hard to control, especially when helpfulness and harmlessness conflict.

http://arxiv.org/abs/2509.17938
D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models. (31%)
Satyapriya Krishna; Andy Zou; Rahul Gupta; Eliot Krzysztof Jones; Nick Winter; Dan Hendrycks; J. Zico Kolter; Matt Fredrikson; Spyros Matsoukas
The safety and alignment of Large Language Models (LLMs) are critical for their responsible deployment. Current evaluation methods predominantly focus on identifying and preventing overtly harmful outputs. However, they often fail to address a more insidious failure mode: models that produce benign-appearing outputs while operating on malicious or deceptive internal reasoning. This vulnerability, often triggered by sophisticated system prompt injections, allows models to bypass conventional safety filters, posing a significant, underexplored risk. To address this gap, we introduce the Deceptive Reasoning Exposure Suite (D-REX), a novel dataset designed to evaluate the discrepancy between a model's internal reasoning process and its final output. D-REX was constructed through a competitive red-teaming exercise where participants crafted adversarial system prompts to induce such deceptive behaviors. Each sample in D-REX contains the adversarial system prompt, an end-user's test query, the model's seemingly innocuous response, and, crucially, the model's internal chain-of-thought, which reveals the underlying malicious intent. Our benchmark facilitates a new, essential evaluation task: the detection of deceptive alignment. We demonstrate that D-REX presents a significant challenge for existing models and safety mechanisms, highlighting the urgent need for new techniques that scrutinize the internal processes of LLMs, not just their final outputs.

http://arxiv.org/abs/2509.17898
Lipschitz-Based Robustness Certification for Recurrent Neural Networks via Convex Relaxation. (12%)
Paul Hamelbeck; Johannes Schiffer
Robustness certification against bounded input noise or adversarial perturbations is increasingly important for deployment recurrent neural networks (RNNs) in safety-critical control applications. To address this challenge, we present RNN-SDP, a relaxation based method that models the RNN's layer interactions as a convex problem and computes a certified upper bound on the Lipschitz constant via semidefinite programming (SDP). We also explore an extension that incorporates known input constraints to further tighten the resulting Lipschitz bounds. RNN-SDP is evaluated on a synthetic multi-tank system, with upper bounds compared to empirical estimates. While incorporating input constraints yields only modest improvements, the general method produces reasonably tight and certifiable bounds, even as sequence length increases. The results also underscore the often underestimated impact of initialization errors, an important consideration for applications where models are frequently re-initialized, such as model predictive control (MPC).

http://arxiv.org/abs/2509.21367
Design and Implementation of a Secure RAG-Enhanced AI Chatbot for Smart Tourism Customer Service: Defending Against Prompt Injection Attacks -- A Case Study of Hsinchu, Taiwan. (10%)
Yu-Kai Shih; You-Kai Kang
As smart tourism evolves, AI-powered chatbots have become indispensable for delivering personalized, real-time assistance to travelers while promoting sustainability and efficiency. However, these systems are increasingly vulnerable to prompt injection attacks, where adversaries manipulate inputs to elicit unintended behaviors such as leaking sensitive information or generating harmful content. This paper presents a case study on the design and implementation of a secure retrieval-augmented generation (RAG) chatbot for Hsinchu smart tourism services. The system integrates RAG with API function calls, multi-layered linguistic analysis, and guardrails against injections, achieving high contextual awareness and security. Key features include a tiered response strategy, RAG-driven knowledge grounding, and intent decomposition across lexical, semantic, and pragmatic levels. Defense mechanisms include system norms, gatekeepers for intent judgment, and reverse RAG text to prioritize verified data. We also benchmark a GPT-5 variant (released 2025-08-07) to assess inherent robustness. Evaluations with 674 adversarial prompts and 223 benign queries show over 95% accuracy on benign tasks and substantial detection of injection attacks. GPT-5 blocked about 85% of attacks, showing progress yet highlighting the need for layered defenses. Findings emphasize contributions to sustainable tourism, multilingual accessibility, and ethical AI deployment. This work offers a practical framework for deploying secure chatbots in smart tourism and contributes to resilient, trustworthy AI applications.

http://arxiv.org/abs/2509.17550
Is It Certainly a Deepfake? Reliability Analysis in Detection & Generation Ecosystem. (2%)
Neslihan Kose; Anthony Rhodes; Umur Aybars Ciftci; Ilke Demir
As generative models are advancing in quality and quantity for creating synthetic content, deepfakes begin to cause online mistrust. Deepfake detectors are proposed to counter this effect, however, misuse of detectors claiming fake content as real or vice versa further fuels this misinformation problem. We present the first comprehensive uncertainty analysis of deepfake detectors, systematically investigating how generative artifacts influence prediction confidence. As reflected in detectors' responses, deepfake generators also contribute to this uncertainty as their generative residues vary, so we cross the uncertainty analysis of deepfake detectors and generators. Based on our observations, the uncertainty manifold holds enough consistent information to leverage uncertainty for deepfake source detection. Our approach leverages Bayesian Neural Networks and Monte Carlo dropout to quantify both aleatoric and epistemic uncertainties across diverse detector architectures. We evaluate uncertainty on two datasets with nine generators, with four blind and two biological detectors, compare different uncertainty methods, explore region- and pixel-based uncertainty, and conduct ablation studies. We conduct and analyze binary real/fake, multi-class real/fake, source detection, and leave-one-out experiments between the generator/detector combinations to share their generalization capability, model calibration, uncertainty, and robustness against adversarial attacks. We further introduce uncertainty maps that localize prediction confidence at the pixel level, revealing distinct patterns correlated with generator-specific artifacts. Our analysis provides critical insights for deploying reliable deepfake detection systems and establishes uncertainty quantification as a fundamental requirement for trustworthy synthetic media detection.

http://arxiv.org/abs/2509.17371
SilentStriker:Toward Stealthy Bit-Flip Attacks on Large Language Models. (1%)
Haotian Xu; Qingsong Peng; Jie Shi; Huadi Zheng; Yu Li; Cheng Zhuo
The rapid adoption of large language models (LLMs) in critical domains has spurred extensive research into their security issues. While input manipulation attacks (e.g., prompt injection) have been well studied, Bit-Flip Attacks (BFAs) -- which exploit hardware vulnerabilities to corrupt model parameters and cause severe performance degradation -- have received far less attention. Existing BFA methods suffer from key limitations: they fail to balance performance degradation and output naturalness, making them prone to discovery. In this paper, we introduce SilentStriker, the first stealthy bit-flip attack against LLMs that effectively degrades task performance while maintaining output naturalness. Our core contribution lies in addressing the challenge of designing effective loss functions for LLMs with variable output length and the vast output space. Unlike prior approaches that rely on output perplexity for attack loss formulation, which inevitably degrade output naturalness, we reformulate the attack objective by leveraging key output tokens as targets for suppression, enabling effective joint optimization of attack effectiveness and stealthiness. Additionally, we employ an iterative, progressive search strategy to maximize attack efficacy. Experiments show that SilentStriker significantly outperforms existing baselines, achieving successful attacks without compromising the naturalness of generated text.

http://arxiv.org/abs/2502.19269
Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in CLIP. (87%)
Jiawei Kong; Hao Fang; Sihang Guo; Chenxi Qing; Kuofeng Gao; Bin Chen; Shu-Tao Xia; Ke Xu
While pre-trained Vision-Language Models (VLMs) such as CLIP exhibit impressive representational capabilities for multimodal data, recent studies have revealed their vulnerability to backdoor attacks. To alleviate the threat, existing defense strategies primarily focus on fine-tuning the entire suspicious model. However, the substantial model parameters increase the difficulty of reaching a stable and consistent optimization direction, limiting their resistance against state-of-the-art attacks and often resulting in a degradation of clean accuracy. To address this challenge, we propose Class-wise Backdoor Prompt Tuning (CBPT), an efficient and effective defense mechanism that operates on text prompts to indirectly purify poisoned CLIP. Specifically, we first employ the advanced contrastive learning via carefully crafted positive and negative samples, to effectively invert the backdoor triggers that are potentially adopted by the attacker. Once the dummy trigger is established, we leverage three well-designed loss functions to optimize these class-wise text prompts, modifying the model's decision boundary and further reclassifying the feature regions affected by backdoor triggers. Extensive experiments demonstrate that CBPT significantly mitigates backdoor threats while preserving model utility, e.g. an average Clean Accuracy (CA) of 58.83% and an Attack Success Rate (ASR) of 0.39% across seven mainstream backdoor attacks. These results underscore the superiority of our prompt purifying design to strengthen CLIP's robustness against backdoor attacks.

http://arxiv.org/abs/2509.21360
Multimodal Prompt Decoupling Attack on the Safety Filters in Text-to-Image Models. (5%)
Xingkai Peng; Jun Jiang; Meng Tong; Shuai Li; Weiming Zhang; Nenghai Yu; Kejiang Chen
Text-to-image (T2I) models have been widely applied in generating high-fidelity images across various domains. However, these models may also be abused to produce Not-Safe-for-Work (NSFW) content via jailbreak attacks. Existing jailbreak methods primarily manipulate the textual prompt, leaving potential vulnerabilities in image-based inputs largely unexplored. Moreover, text-based methods face challenges in bypassing the model's safety filters. In response to these limitations, we propose the Multimodal Prompt Decoupling Attack (MPDA), which utilizes image modality to separate the harmful semantic components of the original unsafe prompt. MPDA follows three core steps: firstly, a large language model (LLM) decouples unsafe prompts into pseudo-safe prompts and harmful prompts. The former are seemingly harmless sub-prompts that can bypass filters, while the latter are sub-prompts with unsafe semantics that trigger filters. Subsequently, the LLM rewrites the harmful prompts into natural adversarial prompts to bypass safety filters, which guide the T2I model to modify the base image into an NSFW output. Finally, to ensure semantic consistency between the generated NSFW images and the original unsafe prompts, the visual language model generates image captions, providing a new pathway to guide the LLM in iterative rewriting and refining the generated content.

http://arxiv.org/abs/2509.20383
MARS: A Malignity-Aware Backdoor Defense in Federated Learning. (1%)
Wei Wan; Yuxuan Ning; Zhicong Huang; Cheng Hong; Shengshan Hu; Ziqi Zhou; Yechao Zhang; Tianqing Zhu; Wanlei Zhou; Leo Yu Zhang
Federated Learning (FL) is a distributed paradigm aimed at protecting participant data privacy by exchanging model parameters to achieve high-quality model training. However, this distributed nature also makes FL highly vulnerable to backdoor attacks. Notably, the recently proposed state-of-the-art (SOTA) attack, 3DFed (SP2023), uses an indicator mechanism to determine whether the backdoor models have been accepted by the defender and adaptively optimizes backdoor models, rendering existing defenses ineffective. In this paper, we first reveal that the failure of existing defenses lies in the employment of empirical statistical measures that are loosely coupled with backdoor attacks. Motivated by this, we propose a Malignity-Aware backdooR defenSe (MARS) that leverages backdoor energy (BE) to indicate the malicious extent of each neuron. To amplify malignity, we further extract the most prominent BE values from each model to form a concentrated backdoor energy (CBE). Finally, a novel Wasserstein distance-based clustering method is introduced to effectively identify backdoor models. Extensive experiments demonstrate that MARS can defend against SOTA backdoor attacks and significantly outperforms existing defenses.

http://arxiv.org/abs/2507.22832
Pulling Back the Curtain on ReLU Networks. (1%)
Maciej Satkiewicz
Since any ReLU network is piecewise affine, its hidden units can be characterized by their pullbacks through the active subnetwork, i.e., by their gradients (up to bias terms). However, gradients of deeper neurons are notoriously misaligned, which obscures the network's internal representations. We posit that models do align gradients with data, yet this is concealed by the intrinsic noise of the ReLU hard gating. We validate this intuition by applying soft gating in the backward pass only, reducing the local impact of weakly excited neurons. The resulting modified gradients, which we call "excitation pullbacks", exhibit striking perceptual alignment on a number of ImageNet-pretrained architectures, while the rudimentary pixel-space gradient ascent quickly produces easily interpretable input- and target-specific features. Inspired by these findings, we formulate the "path stability" hypothesis, claiming that the binary activation patterns largely stabilize during training and get encoded in the pre-activation distribution of the final model. When true, excitation pullbacks become aligned with the gradients of a kernel machine that mainly determines the network's decision. This provides a theoretical justification for the apparent faithfulness of the feature attributions based on excitation pullbacks, potentially even leading to mechanistic interpretability of deep models. Incidentally, we give a possible explanation for the effectiveness of Batch Normalization and Deep Features, together with a novel perspective on the network's internal memory and generalization properties. We release the code and an interactive app for easier exploration of the excitation pullbacks.

http://arxiv.org/abs/2508.17007
An Efficient Dual-Line Decoder Network with Multi-Scale Convolutional Attention for Multi-organ Segmentation. (1%)
Riad Hassan; M. Rubaiyat Hossain Mondal; Sheikh Iqbal Ahamed; Fahad Mostafa; Md Mostafijur Rahman
Proper segmentation of organs-at-risk is important for radiation therapy, surgical planning, and diagnostic decision-making in medical image analysis. While deep learning-based segmentation architectures have made significant progress, they often fail to balance segmentation accuracy with computational efficiency. Most of the current state-of-the-art methods either prioritize performance at the cost of high computational complexity or compromise accuracy for efficiency. This paper addresses this gap by introducing an efficient dual-line decoder segmentation network (EDLDNet). The proposed method features a noisy decoder, which learns to incorporate structured perturbation at training time for better model robustness, yet at inference time only the noise-free decoder is executed, leading to lower computational cost. Multi-Scale convolutional Attention Modules (MSCAMs), Attention Gates (AGs), and Up-Convolution Blocks (UCBs) are further utilized to optimize feature representation and boost segmentation performance. By leveraging multi-scale segmentation masks from both decoders, we also utilize a mutation-based loss function to enhance the model's generalization. Our approach outperforms SOTA segmentation architectures on four publicly available medical imaging datasets. EDLDNet achieves SOTA performance with an 84.00% Dice score on the Synapse dataset, surpassing baseline model like UNet by 13.89% in Dice score while significantly reducing Multiply-Accumulate Operations (MACs) by 89.7%. Compared to recent approaches like EMCAD, our EDLDNet not only achieves higher Dice score but also maintains comparable computational efficiency. The outstanding performance across diverse datasets establishes EDLDNet's strong generalization, computational efficiency, and robustness. The source code, pre-processed data, and pre-trained weights will be available at https://github.com/riadhassan/EDLDNet .

http://arxiv.org/abs/2509.17070
Localizing Malicious Outputs from CodeLLM. (1%)
Mayukh Borana; Junyi Liang; Sai Sathiesh Rajan; Sudipta Chattopadhyay
We introduce FreqRank, a mutation-based defense to localize malicious components in LLM outputs and their corresponding backdoor triggers. FreqRank assumes that the malicious sub-string(s) consistently appear in outputs for triggered inputs and uses a frequency-based ranking system to identify them. Our ranking system then leverages this knowledge to localize the backdoor triggers present in the inputs. We create nine malicious models through fine-tuning or custom instructions for three downstream tasks, namely, code completion (CC), code generation (CG), and code summarization (CS), and show that they have an average attack success rate (ASR) of 86.6%. Furthermore, FreqRank's ranking system highlights the malicious outputs as one of the top five suggestions in 98% of cases. We also demonstrate that FreqRank's effectiveness scales as the number of mutants increases and show that FreqRank is capable of localizing the backdoor trigger effectively even with a limited number of triggered samples. Finally, we show that our approach is 35-50% more effective than other defense methods.

http://arxiv.org/abs/2509.16645
ADVEDM:Fine-grained Adversarial Attack against VLM-based Embodied Agents. (92%)
Yichen Wang; Hangtao Zhang; Hewen Pan; Ziqi Zhou; Xianlong Wang; Peijin Guo; Lulu Xue; Shengshan Hu; Minghui Li; Leo Yu Zhang
Vision-Language Models (VLMs), with their strong reasoning and planning capabilities, are widely used in embodied decision-making (EDM) tasks in embodied agents, such as autonomous driving and robotic manipulation. Recent research has increasingly explored adversarial attacks on VLMs to reveal their vulnerabilities. However, these attacks either rely on overly strong assumptions, requiring full knowledge of the victim VLM, which is impractical for attacking VLM-based agents, or exhibit limited effectiveness. The latter stems from disrupting most semantic information in the image, which leads to a misalignment between the perception and the task context defined by system prompts. This inconsistency interrupts the VLM's reasoning process, resulting in invalid outputs that fail to affect interactions in the physical world. To this end, we propose a fine-grained adversarial attack framework, ADVEDM, which modifies the VLM's perception of only a few key objects while preserving the semantics of the remaining regions. This attack effectively reduces conflicts with the task context, making VLMs output valid but incorrect decisions and affecting the actions of agents, thus posing a more substantial safety threat in the physical world. We design two variants of based on this framework, ADVEDM-R and ADVEDM-A, which respectively remove the semantics of a specific object from the image and add the semantics of a new object into the image. The experimental results in both general scenarios and EDM tasks demonstrate fine-grained control and excellent attack performance.

http://arxiv.org/abs/2509.16494
Can an Individual Manipulate the Collective Decisions of Multi-Agents? (74%)
Fengyuan Liu; Rui Zhao; Shuo Chen; Guohao Li; Philip Torr; Lei Han; Jindong Gu
Individual Large Language Models (LLMs) have demonstrated significant capabilities across various domains, such as healthcare and law. Recent studies also show that coordinated multi-agent systems exhibit enhanced decision-making and reasoning abilities through collaboration. However, due to the vulnerabilities of individual LLMs and the difficulty of accessing all agents in a multi-agent system, a key question arises: If attackers only know one agent, could they still generate adversarial samples capable of misleading the collective decision? To explore this question, we formulate it as a game with incomplete information, where attackers know only one target agent and lack knowledge of the other agents in the system. With this formulation, we propose M-Spoiler, a framework that simulates agent interactions within a multi-agent system to generate adversarial samples. These samples are then used to manipulate the target agent in the target system, misleading the system's collaborative decision-making process. More specifically, M-Spoiler introduces a stubborn agent that actively aids in optimizing adversarial samples by simulating potential stubborn responses from agents in the target system. This enhances the effectiveness of the generated adversarial samples in misleading the system. Through extensive experiments across various tasks, our findings confirm the risks posed by the knowledge of an individual agent in multi-agent systems and demonstrate the effectiveness of our framework. We also explore several defense mechanisms, showing that our proposed attack framework remains more potent than baselines, underscoring the need for further research into defensive strategies.

http://arxiv.org/abs/2509.16546
Train to Defend: First Defense Against Cryptanalytic Neural Network Parameter Extraction Attacks. (68%)
Ashley Kurian; Aydin Aysu
Neural networks are valuable intellectual property due to the significant computational cost, expert labor, and proprietary data involved in their development. Consequently, protecting their parameters is critical not only for maintaining a competitive advantage but also for enhancing the model's security and privacy. Prior works have demonstrated the growing capability of cryptanalytic attacks to scale to deeper models. In this paper, we present the first defense mechanism against cryptanalytic parameter extraction attacks. Our key insight is to eliminate the neuron uniqueness necessary for these attacks to succeed. We achieve this by a novel, extraction-aware training method. Specifically, we augment the standard loss function with an additional regularization term that minimizes the distance between neuron weights within a layer. Therefore, the proposed defense has zero area-delay overhead during inference. We evaluate the effectiveness of our approach in mitigating extraction attacks while analyzing the model accuracy across different architectures and datasets. When re-trained with the same model architecture, the results show that our defense incurs a marginal accuracy change of less than 1% with the modified loss function. Moreover, we present a theoretical framework to quantify the success probability of the attack. When tested comprehensively with prior attack settings, our defense demonstrated empirical success for sustained periods of extraction, whereas unprotected networks are extracted between 14 minutes to 4 hours.

http://arxiv.org/abs/2502.21059
FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts. (68%)
Ziyi Zhang; Zhen Sun; Zongmin Zhang; Jihui Guo; Xinlei He
Multimodal Large Language Models (MLLMs) have become powerful and widely adopted in some practical applications. However, recent research has revealed their vulnerability to multimodal jailbreak attacks, whereby the model can be induced to generate harmful content, leading to safety risks. Although most MLLMs have undergone safety alignment, recent research shows that the visual modality is still vulnerable to jailbreak attacks. In our work, we discover that by using flowcharts with partially harmful information, MLLMs can be induced to provide additional harmful details. Based on this, we propose a jailbreak attack method based on auto-generated flowcharts, FC-Attack. Specifically, FC-Attack first fine-tunes a pre-trained LLM to create a step-description generator based on benign datasets. The generator is then used to produce step descriptions corresponding to a harmful query, which are transformed into flowcharts in 3 different shapes (vertical, horizontal, and S-shaped) as visual prompts. These flowcharts are then combined with a benign textual prompt to execute the jailbreak attack on MLLMs. Our evaluations on Advbench show that FC-Attack attains an attack success rate of up to 96% via images and up to 78% via videos across multiple MLLMs. Additionally, we investigate factors affecting the attack performance, including the number of steps and the font styles in the flowcharts. We also find that FC-Attack can improve the jailbreak performance from 4% to 28% in Claude-3.5 by changing the font style. To mitigate the attack, we explore several defenses and find that AdaShield can largely reduce the jailbreak performance but with the cost of utility drop.

http://arxiv.org/abs/2509.16671
"Digital Camouflage": The LLVM Challenge in LLM-Based Malware Detection. (26%)
Ekin Böke; Simon Torka
Large Language Models (LLMs) have emerged as promising tools for malware detection by analyzing code semantics, identifying vulnerabilities, and adapting to evolving threats. However, their reliability under adversarial compiler-level obfuscation is yet to be discovered. In this study, we empirically evaluate the robustness of three state-of-the-art LLMs: ChatGPT-4o, Gemini Flash 2.5, and Claude Sonnet 4 against compiler-level obfuscation techniques implemented via the LLVM infrastructure. These include control flow flattening, bogus control flow injection, instruction substitution, and split basic blocks, which are widely used to evade detection while preserving malicious behavior. We perform a structured evaluation on 40~C functions (20 vulnerable, 20 secure) sourced from the Devign dataset and obfuscated using LLVM passes. Our results show that these models often fail to correctly classify obfuscated code, with precision, recall, and F1-score dropping significantly after transformation. This reveals a critical limitation: LLMs, despite their language understanding capabilities, can be easily misled by compiler-based obfuscation strategies. To promote reproducibility, we release all evaluation scripts, prompts, and obfuscated code samples in a public repository. We also discuss the implications of these findings for adversarial threat modeling, and outline future directions such as software watermarking, compiler-aware defenses, and obfuscation-resilient model design.

http://arxiv.org/abs/2509.16678
IPF-RDA: An Information-Preserving Framework for Robust Data Augmentation. (1%)
Suorong Yang; Hongchao Yang; Suhan Guo; Furao Shen; Jian Zhao
Data augmentation is widely utilized as an effective technique to enhance the generalization performance of deep models. However, data augmentation may inevitably introduce distribution shifts and noises, which significantly constrain the potential and deteriorate the performance of deep networks. To this end, we propose a novel information-preserving framework, namely IPF-RDA, to enhance the robustness of data augmentations in this paper. IPF-RDA combines the proposal of (i) a new class-discriminative information estimation algorithm that identifies the points most vulnerable to data augmentation operations and corresponding importance scores; And (ii) a new information-preserving scheme that preserves the critical information in the augmented samples and ensures the diversity of augmented data adaptively. We divide data augmentation methods into three categories according to the operation types and integrate these approaches into our framework accordingly. After being integrated into our framework, the robustness of data augmentation methods can be enhanced and their full potential can be unleashed. Extensive experiments demonstrate that although being simple, IPF-RDA consistently improves the performance of numerous commonly used state-of-the-art data augmentation methods with popular deep models on a variety of datasets, including CIFAR-10, CIFAR-100, Tiny-ImageNet, CUHK03, Market1501, Oxford Flower, and MNIST, where its performance and scalability are stressed. The implementation is available at https://github.com/Jackbrocp/IPF-RDA.

http://arxiv.org/abs/2506.06273
AdvSumm: Adversarial Training for Bias Mitigation in Text Summarization. (1%)
Mukur Gupta; Nikhil Reddy Varimalla; Nicholas Deas; Melanie Subbiah; Kathleen McKeown
Large Language Models (LLMs) have achieved impressive performance in text summarization and are increasingly deployed in real-world applications. However, these systems often inherit associative and framing biases from pre-training data, leading to inappropriate or unfair outputs in downstream tasks. In this work, we present AdvSumm (Adversarial Summarization), a domain-agnostic training framework designed to mitigate bias in text summarization through improved generalization. Inspired by adversarial robustness, AdvSumm introduces a novel Perturber component that applies gradient-guided perturbations at the embedding level of Sequence-to-Sequence models, enhancing the model's robustness to input variations. We empirically demonstrate that AdvSumm effectively reduces different types of bias in summarization-specifically, name-nationality bias and political framing bias-without compromising summarization quality. Compared to standard transformers and data augmentation techniques like back-translation, AdvSumm achieves stronger bias mitigation performance across benchmark datasets.

http://arxiv.org/abs/2509.16352
Secure Confidential Business Information When Sharing Machine Learning Models. (82%)
Yunfan Yang; Jiarong Xu; Hongzhe Zhang; Xiao Fang
Model-sharing offers significant business value by enabling firms with well-established Machine Learning (ML) models to monetize and share their models with others who lack the resources to develop ML models from scratch. However, concerns over data confidentiality remain a significant barrier to model-sharing adoption, as Confidential Property Inference (CPI) attacks can exploit shared ML models to uncover confidential properties of the model provider's private model training data. Existing defenses often assume that CPI attacks are non-adaptive to the specific ML model they are targeting. This assumption overlooks a key characteristic of real-world adversaries: their responsiveness, i.e., adversaries' ability to dynamically adjust their attack models based on the information of the target and its defenses. To overcome this limitation, we propose a novel defense method that explicitly accounts for the responsive nature of real-world adversaries via two methodological innovations: a novel Responsive CPI attack and an attack-defense arms race framework. The former emulates the responsive behaviors of adversaries in the real world, and the latter iteratively enhances both the target and attack models, ultimately producing a secure ML model that is robust against responsive CPI attacks. Furthermore, we propose and integrate a novel approximate strategy into our defense, which addresses a critical computational bottleneck of defense methods and improves defense efficiency. Through extensive empirical evaluations across various realistic model-sharing scenarios, we demonstrate that our method outperforms existing defenses by more effectively defending against CPI attacks, preserving ML model utility, and reducing computational overhead.

http://arxiv.org/abs/2501.16534
Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs. (80%)
Jean-Charles Noirot Ferrand; Yohan Beugin; Eric Pauley; Ryan Sheatsley; Patrick McDaniel
Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs. In this paper, we introduce and evaluate a new technique for jailbreak attacks. We observe that alignment embeds a safety classifier in the LLM responsible for deciding between refusal and compliance, and seek to extract an approximation of this classifier: a surrogate classifier. To this end, we build candidate classifiers from subsets of the LLM. We first evaluate the degree to which candidate classifiers approximate the LLM's safety classifier in benign and adversarial settings. Then, we attack the candidates and measure how well the resulting adversarial inputs transfer to the LLM. Our evaluation shows that the best candidates achieve accurate agreement (an F1 score above 80%) using as little as 20% of the model architecture. Further, we find that attacks mounted on the surrogate classifiers can be transferred to the LLM with high success. For example, a surrogate using only 50% of the Llama 2 model achieved an attack success rate (ASR) of 70% with half the memory footprint and runtime -- a substantial improvement over attacking the LLM directly, where we only observed a 22% ASR. These results show that extracting surrogate classifiers is an effective and efficient means for modeling (and therein addressing) the vulnerability of aligned models to jailbreaking attacks.

http://arxiv.org/abs/2509.15756
An Adversarial Robust Behavior Sequence Anomaly Detection Approach Based on Critical Behavior Unit Learning. (70%)
Dongyang Zhan; Kai Tan; Lin Ye; Xiangzhan Yu; Hongli Zhang; Zheng He
Sequential deep learning models (e.g., RNN and LSTM) can learn the sequence features of software behaviors, such as API or syscall sequences. However, recent studies have shown that these deep learning-based approaches are vulnerable to adversarial samples. Attackers can use adversarial samples to change the sequential characteristics of behavior sequences and mislead malware classifiers. In this paper, an adversarial robustness anomaly detection method based on the analysis of behavior units is proposed to overcome this problem. We extract related behaviors that usually perform a behavior intention as a behavior unit, which contains the representative semantic information of local behaviors and can be used to improve the robustness of behavior analysis. By learning the overall semantics of each behavior unit and the contextual relationships among behavior units based on a multilevel deep learning model, our approach can mitigate perturbation attacks that target local and large-scale behaviors. In addition, our approach can be applied to both low-level and high-level behavior logs (e.g., API and syscall logs). The experimental results show that our approach outperforms all the compared methods, which indicates that our approach has better performance against obfuscation attacks.

http://arxiv.org/abs/2509.15499
Adversarially Robust Assembly Language Model for Packed Executables Detection. (67%)
Shijia Li; Jiang Ming; Lanqing Liu; Longwei Yang; Ni Zhang; Chunfu Jia
Detecting packed executables is a critical component of large-scale malware analysis and antivirus engine workflows, as it identifies samples that warrant computationally intensive dynamic unpacking to reveal concealed malicious behavior. Traditionally, packer detection techniques have relied on empirical features, such as high entropy or specific binary patterns. However, these empirical, feature-based methods are increasingly vulnerable to evasion by adversarial samples or unknown packers (e.g., low-entropy packers). Furthermore, the dependence on expert-crafted features poses challenges in sustaining and evolving these methods over time.
  In this paper, we examine the limitations of existing packer detection methods and propose Pack-ALM, a novel deep-learning-based approach for detecting packed executables. Inspired by the linguistic concept of distinguishing between real and pseudo words, we reformulate packer detection as a task of differentiating between legitimate and "pseudo" instructions. To achieve this, we preprocess native data and packed data into "pseudo" instructions and design a pre-trained assembly language model that recognizes features indicative of packed data. We evaluate Pack-ALM against leading industrial packer detection tools and state-of-the-art assembly language models. Extensive experiments on over 37,000 samples demonstrate that Pack-ALM effectively identifies packed binaries, including samples created with adversarial or previously unseen packing techniques. Moreover, Pack-ALM outperforms traditional entropy-based methods and advanced assembly language models in both detection accuracy and adversarial robustness.

http://arxiv.org/abs/2509.16163
Robust Vision-Language Models via Tensor Decomposition: A Defense Against Adversarial Attacks. (41%)
Het Patel; Muzammil Allie; Qian Zhang; Jia Chen; Evangelos E. Papalexakis
Vision language models (VLMs) excel in multimodal understanding but are prone to adversarial attacks. Existing defenses often demand costly retraining or significant architecture changes. We introduce a lightweight defense using tensor decomposition suitable for any pre-trained VLM, requiring no retraining. By decomposing and reconstructing vision encoder representations, it filters adversarial noise while preserving meaning. Experiments with CLIP on COCO and Flickr30K show improved robustness. On Flickr30K, it restores 12.3\% performance lost to attacks, raising Recall@1 accuracy from 7.5\% to 19.8\%. On COCO, it recovers 8.1\% performance, improving accuracy from 3.8\% to 11.9\%. Analysis shows Tensor Train decomposition with low rank (8-32) and low residual strength ($α=0.1-0.2$) is optimal. This method is a practical, plug-and-play solution with minimal overhead for existing VLMs.

http://arxiv.org/abs/2405.18499
Training More Robust Classification Model via Discriminative Loss and Gaussian Noise Injection. (38%)
Hai-Vy Nguyen; Fabrice Gamboa; Sixin Zhang; Reda Chhaibi; Serge Gratton; Thierry Giaccone
Robustness of deep neural networks to input noise remains a critical challenge, as naive noise injection often degrades accuracy on clean (uncorrupted) data. We propose a novel training framework that addresses this trade-off through two complementary objectives. First, we introduce a loss function applied at the penultimate layer that explicitly enforces intra-class compactness and increases the margin to analytically defined decision boundaries. This enhances feature discriminativeness and class separability for clean data. Second, we propose a class-wise feature alignment mechanism that brings noisy data clusters closer to their clean counterparts. Furthermore, we provide a theoretical analysis demonstrating that improving feature stability under additive Gaussian noise implicitly reduces the curvature of the softmax loss landscape in input space, as measured by Hessian eigenvalues.This thus naturally enhances robustness without explicit curvature penalties. Conversely, we also theoretically show that lower curvatures lead to more robust models. We validate the effectiveness of our method on standard benchmarks and our custom dataset. Our approach significantly reinforces model robustness to various perturbations while maintaining high accuracy on clean data, advancing the understanding and practice of noise-robust deep learning.

http://arxiv.org/abs/2509.15551
PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors. (26%)
Sepehr Dehdashtian; Mashrur M. Morshed; Jacob H. Seidman; Gaurav Bharaj; Vishnu Naresh Boddeti
Synthetic image detectors (SIDs) are a key defense against the risks posed by the growing realism of images from text-to-image (T2I) models. Red teaming improves SID's effectiveness by identifying and exploiting their failure modes via misclassified synthetic images. However, existing red-teaming solutions (i) require white-box access to SIDs, which is infeasible for proprietary state-of-the-art detectors, and (ii) generate image-specific attacks through expensive online optimization. To address these limitations, we propose PolyJuice, the first black-box, image-agnostic red-teaming method for SIDs, based on an observed distribution shift in the T2I latent space between samples correctly and incorrectly classified by the SID. PolyJuice generates attacks by (i) identifying the direction of this shift through a lightweight offline process that only requires black-box access to the SID, and (ii) exploiting this direction by universally steering all generated images towards the SID's failure modes. PolyJuice-steered T2I models are significantly more effective at deceiving SIDs (up to 84%) compared to their unsteered counterparts. We also show that the steering directions can be estimated efficiently at lower resolutions and transferred to higher resolutions using simple interpolation, reducing computational overhead. Finally, tuning SID models on PolyJuice-augmented datasets notably enhances the performance of the detectors (up to 30%).

http://arxiv.org/abs/2509.15497
Backdoor Mitigation via Invertible Pruning Masks. (5%)
Kealan Dunnett; Reza Arablouei; Dimity Miller; Volkan Dedeoglu; Raja Jurdak
Model pruning has gained traction as a promising defense strategy against backdoor attacks in deep learning. However, existing pruning-based approaches often fall short in accurately identifying and removing the specific parameters responsible for inducing backdoor behaviors. Despite the dominance of fine-tuning-based defenses in recent literature, largely due to their superior performance, pruning remains a compelling alternative, offering greater interpretability and improved robustness in low-data regimes. In this paper, we propose a novel pruning approach featuring a learned \emph{selection} mechanism to identify parameters critical to both main and backdoor tasks, along with an \emph{invertible} pruning mask designed to simultaneously achieve two complementary goals: eliminating the backdoor task while preserving it through the inverse mask. We formulate this as a bi-level optimization problem that jointly learns selection variables, a sparse invertible mask, and sample-specific backdoor perturbations derived from clean data. The inner problem synthesizes candidate triggers using the inverse mask, while the outer problem refines the mask to suppress backdoor behavior without impairing clean-task accuracy. Extensive experiments demonstrate that our approach outperforms existing pruning-based backdoor mitigation approaches, maintains strong performance under limited data conditions, and achieves competitive results compared to state-of-the-art fine-tuning approaches. Notably, the proposed approach is particularly effective in restoring correct predictions for compromised samples after successful backdoor mitigation.

http://arxiv.org/abs/2509.16088
Randomized Smoothing Meets Vision-Language Models. (4%)
Emmanouil Seferis; Changshun Wu; Stefanos Kollias; Saddek Bensalem; Chih-Hong Cheng
Randomized smoothing (RS) is one of the prominent techniques to ensure the correctness of machine learning models, where point-wise robustness certificates can be derived analytically. While RS is well understood for classification, its application to generative models is unclear, since their outputs are sequences rather than labels. We resolve this by connecting generative outputs to an oracle classification task and showing that RS can still be enabled: the final response can be classified as a discrete action (e.g., service-robot commands in VLAs), as harmful vs. harmless (content moderation or toxicity detection in VLMs), or even applying oracles to cluster answers into semantically equivalent ones. Provided that the error rate for the oracle classifier comparison is bounded, we develop the theory that associates the number of samples with the corresponding robustness radius. We further derive improved scaling laws analytically relating the certified radius and accuracy to the number of samples, showing that the earlier result of 2 to 3 orders of magnitude fewer samples sufficing with minimal loss remains valid even under weaker assumptions. Together, these advances make robustness certification both well-defined and computationally feasible for state-of-the-art VLMs, as validated against recent jailbreak-style adversarial attacks.

http://arxiv.org/abs/2509.16058
Attention Schema-based Attention Control (ASAC): A Cognitive-Inspired Approach for Attention Management in Transformers. (2%)
Krati Saxena; Federico Jurado Ruiz; Guido Manzi; Dianbo Liu; Alex Lamb
Attention mechanisms have become integral in AI, significantly enhancing model performance and scalability by drawing inspiration from human cognition. Concurrently, the Attention Schema Theory (AST) in cognitive science posits that individuals manage their attention by creating a model of the attention itself, effectively allocating cognitive resources. Inspired by AST, we introduce ASAC (Attention Schema-based Attention Control), which integrates the attention schema concept into artificial neural networks. Our initial experiments focused on embedding the ASAC module within transformer architectures. This module employs a Vector-Quantized Variational AutoEncoder (VQVAE) as both an attention abstractor and controller, facilitating precise attention management. By explicitly modeling attention allocation, our approach aims to enhance system efficiency. We demonstrate ASAC's effectiveness in both the vision and NLP domains, highlighting its ability to improve classification accuracy and expedite the learning process. Our experiments with vision transformers across various datasets illustrate that the attention controller not only boosts classification accuracy but also accelerates learning. Furthermore, we have demonstrated the model's robustness and generalization capabilities across noisy and out-of-distribution datasets. In addition, we have showcased improved performance in multi-task settings. Quick experiments reveal that the attention schema-based module enhances resilience to adversarial attacks, optimizes attention to improve learning efficiency, and facilitates effective transfer learning and learning from fewer examples. These promising results establish a connection between cognitive science and machine learning, shedding light on the efficient utilization of attention mechanisms in AI systems.

http://arxiv.org/abs/2508.15987
PickleBall: Secure Deserialization of Pickle-based Machine Learning Models (Extended Report). (1%)
Andreas D. Kellas; Neophytos Christou; Wenxin Jiang; Penghui Li; Laurent Simon; Yaniv David; Vasileios P. Kemerlis; James C. Davis; Junfeng Yang
Machine learning model repositories such as the Hugging Face Model Hub facilitate model exchanges. However, bad actors can deliver malware through compromised models. Existing defenses such as safer model formats, restrictive (but inflexible) loading policies, and model scanners have shortcomings: 44.9% of popular models on Hugging Face still use the insecure pickle format, 15% of these cannot be loaded by restrictive loading policies, and model scanners have both false positives and false negatives. Pickle remains the de facto standard for model exchange, and the ML community lacks a tool that offers transparent safe loading.
  We present PickleBall to help machine learning engineers load pickle-based models safely. PickleBall statically analyzes the source code of a given machine learning library and computes a custom policy that specifies a safe load-time behavior for benign models. PickleBall then dynamically enforces the policy during load time as a drop-in replacement for the pickle module. PickleBall generates policies that correctly load 79.8% of benign pickle-based models in our dataset, while rejecting all (100%) malicious examples in our dataset. In comparison, evaluated model scanners fail to identify known malicious models, and the state-of-art loader loads 22% fewer benign models than PickleBall. PickleBall removes the threat of arbitrary function invocation from malicious pickle-based models, raising the bar for attackers to depend on code reuse techniques.

http://arxiv.org/abs/2509.16060
SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection. (1%)
Maithili Joshi; Palash Nandi; Tanmoy Chakraborty
Large Language Models (LLMs) with safe-alignment training are powerful instruments with robust language comprehension capabilities. These models typically undergo meticulous alignment procedures involving human feedback to ensure the acceptance of safe inputs while rejecting harmful or unsafe ones. However, despite their massive scale and alignment efforts, LLMs remain vulnerable to jailbreak attacks, where malicious users manipulate the model to produce harmful outputs that it was explicitly trained to avoid. In this study, we find that the safety mechanisms in LLMs are predominantly embedded in the middle-to-late layers. Building on this insight, we introduce a novel white-box jailbreak method, SABER (Safety Alignment Bypass via Extra Residuals), which connects two intermediate layers $s$ and $e$ such that $s < e$, through a residual connection. Our approach achieves a 51% improvement over the best-performing baseline on the HarmBench test set. Furthermore, SABER induces only a marginal shift in perplexity when evaluated on the HarmBench validation set. The source code is publicly available at https://github.com/PalGitts/SABER.

http://arxiv.org/abs/2509.15572
Cuckoo Attack: Stealthy and Persistent Attacks Against AI-IDE. (1%)
Xinpeng Liu; Junming Liu; Peiyu Liu; Han Zheng; Qinying Wang; Mathias Payer; Shouling Ji; Wenhai Wang
Modern AI-powered Integrated Development Environments (AI-IDEs) are increasingly defined by an Agent-centric architecture, where an LLM-powered Agent is deeply integrated to autonomously execute complex tasks. This tight integration, however, also introduces a new and critical attack surface. Attackers can exploit these components by injecting malicious instructions into untrusted external sources, effectively hijacking the Agent to perform harmful operations beyond the user's intention or awareness. This emerging threat has quickly attracted research attention, leading to various proposed attack vectors, such as hijacking Model Context Protocol (MCP) Servers to access private data. However, most existing approaches lack stealth and persistence, limiting their practical impact.
  We propose the Cuckoo Attack, a novel attack that achieves stealthy and persistent command execution by embedding malicious payloads into configuration files. These files, commonly used in AI-IDEs, execute system commands during routine operations, without displaying execution details to the user. Once configured, such files are rarely revisited unless an obvious runtime error occurs, creating a blind spot for attackers to exploit. We formalize our attack paradigm into two stages, including initial infection and persistence. Based on these stages, we analyze the practicality of the attack execution process and identify the relevant exploitation techniques. Furthermore, we analyze the impact of Cuckoo Attack, which can not only invade the developer's local computer but also achieve supply chain attacks through the spread of configuration files. We contribute seven actionable checkpoints for vendors to evaluate their product security. The critical need for these checks is demonstrated by our end-to-end Proof of Concept, which validated the proposed attack across nine mainstream Agent and AI-IDE pairs.

http://arxiv.org/abs/2509.15437
Impact of Phonetics on Speaker Identity in Adversarial Voice Attack. (99%)
Daniyal Kabir Dar; Qiben Yan; Li Xiao; Arun Ross
Adversarial perturbations in speech pose a serious threat to automatic speech recognition (ASR) and speaker verification by introducing subtle waveform modifications that remain imperceptible to humans but can significantly alter system outputs. While targeted attacks on end-to-end ASR models have been widely studied, the phonetic basis of these perturbations and their effect on speaker identity remain underexplored. In this work, we analyze adversarial audio at the phonetic level and show that perturbations exploit systematic confusions such as vowel centralization and consonant substitutions. These distortions not only mislead transcription but also degrade phonetic cues critical for speaker verification, leading to identity drift. Using DeepSpeech as our ASR target, we generate targeted adversarial examples and evaluate their impact on speaker embeddings across genuine and impostor samples. Results across 16 phonetically diverse target phrases demonstrate that adversarial audio induces both transcription errors and identity drift, highlighting the need for phonetic-aware defenses to ensure the robustness of ASR and speaker recognition systems.

http://arxiv.org/abs/2509.15435
ORCA: Agentic Reasoning For Hallucination and Adversarial Robustness in Vision-Language Models. (97%)
Chung-En Johnny Yu; Hsuan-Chih; Chen; Brian Jalaian; Nathaniel D. Bastian
Large Vision-Language Models (LVLMs) exhibit strong multimodal capabilities but remain vulnerable to hallucinations from intrinsic errors and adversarial attacks from external exploitations, limiting their reliability in real-world applications. We present ORCA, an agentic reasoning framework that improves the factual accuracy and adversarial robustness of pretrained LVLMs through test-time structured inference reasoning with a suite of small vision models (less than 3B parameters). ORCA operates via an Observe--Reason--Critique--Act loop, querying multiple visual tools with evidential questions, validating cross-model inconsistencies, and refining predictions iteratively without access to model internals or retraining. ORCA also stores intermediate reasoning traces, which supports auditable decision-making. Though designed primarily to mitigate object-level hallucinations, ORCA also exhibits emergent adversarial robustness without requiring adversarial training or defense mechanisms. We evaluate ORCA across three settings: (1) clean images on hallucination benchmarks, (2) adversarially perturbed images without defense, and (3) adversarially perturbed images with defense applied. On the POPE hallucination benchmark, ORCA improves standalone LVLM performance by +3.64\% to +40.67\% across different subsets. Under adversarial perturbations on POPE, ORCA achieves an average accuracy gain of +20.11\% across LVLMs. When combined with defense techniques on adversarially perturbed AMBER images, ORCA further improves standalone LVLM performance, with gains ranging from +1.20\% to +48.00\% across evaluation metrics. These results demonstrate that ORCA offers a promising path toward building more reliable and robust multimodal systems.

http://arxiv.org/abs/2509.15370
Adversarial generalization of unfolding (model-based) networks. (82%)
Vicky Kouni
Unfolding networks are interpretable networks emerging from iterative algorithms, incorporate prior knowledge of data structure, and are designed to solve inverse problems like compressed sensing, which deals with recovering data from noisy, missing observations. Compressed sensing finds applications in critical domains, from medical imaging to cryptography, where adversarial robustness is crucial to prevent catastrophic failures. However, a solid theoretical understanding of the performance of unfolding networks in the presence of adversarial attacks is still in its infancy. In this paper, we study the adversarial generalization of unfolding networks when perturbed with $l_2$-norm constrained attacks, generated by the fast gradient sign method. Particularly, we choose a family of state-of-the-art overaparameterized unfolding networks and deploy a new framework to estimate their adversarial Rademacher complexity. Given this estimate, we provide adversarial generalization error bounds for the networks under study, which are tight with respect to the attack level. To our knowledge, this is the first theoretical analysis on the adversarial generalization of unfolding networks. We further present a series of experiments on real-world data, with results corroborating our derived theory, consistently for all data. Finally, we observe that the family's overparameterization can be exploited to promote adversarial robustness, shedding light on how to efficiently robustify neural networks.

http://arxiv.org/abs/2408.01964
Top K Enhanced Reinforcement Learning Attacks on Heterogeneous Graph Node Classification. (78%)
Honglin Gao; Xiang Li; Yajuan Sun; Gaoxi Xiao
Graph Neural Networks (GNNs) have attracted substantial interest due to their exceptional performance on graph-based data. However, their robustness, especially on heterogeneous graphs, remains underexplored, particularly against adversarial attacks. This paper proposes HeteroKRLAttack, a targeted evasion black-box attack method for heterogeneous graphs. By integrating reinforcement learning with a Top-K algorithm to reduce the action space, our method efficiently identifies effective attack strategies to disrupt node classification tasks. We validate the effectiveness of HeteroKRLAttack through experiments on multiple heterogeneous graph datasets, showing significant reductions in classification accuracy compared to baseline methods. An ablation study underscores the critical role of the Top-K algorithm in enhancing attack performance. Our findings highlight potential vulnerabilities in current models and provide guidance for future defense strategies against adversarial attacks on heterogeneous graphs.

http://arxiv.org/abs/2509.14959
Discrete optimal transport is a strong audio adversarial attack. (75%)
Anton Selitskiy; Akib Shahriyar; Jishnuraj Prakasan
In this paper, we show that discrete optimal transport (DOT) is an effective black-box adversarial attack against modern audio anti-spoofing countermeasures (CMs). Our attack operates as a post-processing, distribution-alignment step: frame-level WavLM embeddings of generated speech are aligned to an unpaired bona fide pool via entropic OT and a top-$k$ barycentric projection, then decoded with a neural vocoder. Evaluated on ASVspoof2019 and ASVspoof5 with AASIST baselines, DOT yields consistently high equal error rate (EER) across datasets and remains competitive after CM fine-tuning, outperforming several conventional attacks in cross-dataset transfer. Ablation analysis highlights the practical impact of vocoder overlap. Results indicate that distribution-level alignment is a powerful and stable attack surface for deployed CMs.

http://arxiv.org/abs/2509.15159
AIP: Subverting Retrieval-Augmented Generation via Adversarial Instructional Prompt. (75%)
Saket S. Chaturvedi; Gaurav Bagwe; Lan Zhang; Xiaoyong Yuan
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by retrieving relevant documents from external sources to improve factual accuracy and verifiability. However, this reliance introduces new attack surfaces within the retrieval pipeline, beyond the LLM itself. While prior RAG attacks have exposed such vulnerabilities, they largely rely on manipulating user queries, which is often infeasible in practice due to fixed or protected user inputs. This narrow focus overlooks a more realistic and stealthy vector: instructional prompts, which are widely reused, publicly shared, and rarely audited. Their implicit trust makes them a compelling target for adversaries to manipulate RAG behavior covertly.
  We introduce a novel attack for Adversarial Instructional Prompt (AIP) that exploits adversarial instructional prompts to manipulate RAG outputs by subtly altering retrieval behavior. By shifting the attack surface to the instructional prompts, AIP reveals how trusted yet seemingly benign interface components can be weaponized to degrade system integrity. The attack is crafted to achieve three goals: (1) naturalness, to evade user detection; (2) utility, to encourage use of prompts; and (3) robustness, to remain effective across diverse query variations. We propose a diverse query generation strategy that simulates realistic linguistic variation in user queries, enabling the discovery of prompts that generalize across paraphrases and rephrasings. Building on this, a genetic algorithm-based joint optimization is developed to evolve adversarial prompts by balancing attack success, clean-task utility, and stealthiness. Experimental results show that AIP achieves up to 95.23% ASR while preserving benign functionality. These findings uncover a critical and previously overlooked vulnerability in RAG systems, emphasizing the need to reassess the shared instructional prompts.

http://arxiv.org/abs/2509.15202
Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction. (16%)
Yuanbo Xie; Yingjie Zhang; Tianyun Liu; Duohe Ma; Tingwen Liu
Jailbreak attacks pose persistent threats to large language models (LLMs). Current safety alignment methods have attempted to address these issues, but they experience two significant limitations: insufficient safety alignment depth and unrobust internal defense mechanisms. These limitations make them vulnerable to adversarial attacks such as prefilling and refusal direction manipulation. We introduce DeepRefusal, a robust safety alignment framework that overcomes these issues. DeepRefusal forces the model to dynamically rebuild its refusal mechanisms from jailbreak states. This is achieved by probabilistically ablating the refusal direction across layers and token depths during fine-tuning. Our method not only defends against prefilling and refusal direction attacks but also demonstrates strong resilience against other unseen jailbreak strategies. Extensive evaluations on four open-source LLM families and six representative attacks show that DeepRefusal reduces attack success rates by approximately 95%, while maintaining model capabilities with minimal performance degradation.

http://arxiv.org/abs/2509.19360
Semantic Representation Attack against Aligned Large Language Models. (1%)
Jiawei Lian; Jianhong Pan; Lefan Wang; Yi Wang; Shaohui Mei; Lap-Pui Chau
Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting prompts that induce LLMs to generate harmful content.
  Current methods typically target exact affirmative responses, such as ``Sure, here is...'', suffering from limited convergence, unnatural prompts, and high computational costs.
  We introduce Semantic Representation Attack, a novel paradigm that fundamentally reconceptualizes adversarial objectives against aligned LLMs.
  Rather than targeting exact textual patterns, our approach exploits the semantic representation space comprising diverse responses with equivalent harmful meanings.
  This innovation resolves the inherent trade-off between attack efficacy and prompt naturalness that plagues existing methods.
  The Semantic Representation Heuristic Search algorithm is proposed to efficiently generate semantically coherent and concise adversarial prompts by maintaining interpretability during incremental expansion.
  We establish rigorous theoretical guarantees for semantic convergence and demonstrate that our method achieves unprecedented attack success rates (89.41\% averaged across 18 LLMs, including 100\% on 11 models) while maintaining stealthiness and efficiency.
  Comprehensive experimental results confirm the overall superiority of our Semantic Representation Attack.
  The code will be publicly available.

http://arxiv.org/abs/2509.14801
STEP: Structured Training and Evaluation Platform for benchmarking trajectory prediction models. (1%)
Julian F. Schumann; Anna Mészáros; Jens Kober; Arkady Zgonnikov
While trajectory prediction plays a critical role in enabling safe and effective path-planning in automated vehicles, standardized practices for evaluating such models remain underdeveloped. Recent efforts have aimed to unify dataset formats and model interfaces for easier comparisons, yet existing frameworks often fall short in supporting heterogeneous traffic scenarios, joint prediction models, or user documentation. In this work, we introduce STEP -- a new benchmarking framework that addresses these limitations by providing a unified interface for multiple datasets, enforcing consistent training and evaluation conditions, and supporting a wide range of prediction models. We demonstrate the capabilities of STEP in a number of experiments which reveal 1) the limitations of widely-used testing procedures, 2) the importance of joint modeling of agents for better predictions of interactions, and 3) the vulnerability of current state-of-the-art models against both distribution shifts and targeted attacks by adversarial agents. With STEP, we aim to shift the focus from the ``leaderboard'' approach to deeper insights about model behavior and generalization in complex multi-agent settings.

http://arxiv.org/abs/2509.14571
VisMoDAl: Visual Analytics for Evaluating and Improving Corruption Robustness of Vision-Language Models. (1%)
Huanchen Wang; Wencheng Zhang; Zhiqiang Wang; Zhicong Lu; Yuxin Ma
Vision-language (VL) models have shown transformative potential across various critical domains due to their capability to comprehend multi-modal information. However, their performance frequently degrades under distribution shifts, making it crucial to assess and improve robustness against real-world data corruption encountered in practical applications. While advancements in VL benchmark datasets and data augmentation (DA) have contributed to robustness evaluation and improvement, there remain challenges due to a lack of in-depth comprehension of model behavior as well as the need for expertise and iterative efforts to explore data patterns. Given the achievement of visualization in explaining complex models and exploring large-scale data, understanding the impact of various data corruption on VL models aligns naturally with a visual analytics approach. To address these challenges, we introduce VisMoDAl, a visual analytics framework designed to evaluate VL model robustness against various corruption types and identify underperformed samples to guide the development of effective DA strategies. Grounded in the literature review and expert discussions, VisMoDAl supports multi-level analysis, ranging from examining performance under specific corruptions to task-driven inspection of model behavior and corresponding data slice. Unlike conventional works, VisMoDAl enables users to reason about the effects of corruption on VL models, facilitating both model behavior understanding and DA strategy formulation. The utility of our system is demonstrated through case studies and quantitative evaluations focused on corruption robustness in the image captioning task.

http://arxiv.org/abs/2509.13922
Towards Robust Defense against Customization via Protective Perturbation Resistant to Diffusion-based Purification. (64%)
Wenkui Yang; Jie Cao; Junxian Duan; Ran He
Diffusion models like Stable Diffusion have become prominent in visual synthesis tasks due to their powerful customization capabilities, which also introduce significant security risks, including deepfakes and copyright infringement. In response, a class of methods known as protective perturbation emerged, which mitigates image misuse by injecting imperceptible adversarial noise. However, purification can remove protective perturbations, thereby exposing images again to the risk of malicious forgery. In this work, we formalize the anti-purification task, highlighting challenges that hinder existing approaches, and propose a simple diagnostic protective perturbation named AntiPure. AntiPure exposes vulnerabilities of purification within the "purification-customization" workflow, owing to two guidance mechanisms: 1) Patch-wise Frequency Guidance, which reduces the model's influence over high-frequency components in the purified image, and 2) Erroneous Timestep Guidance, which disrupts the model's denoising strategy across different timesteps. With additional guidance, AntiPure embeds imperceptible perturbations that persist under representative purification settings, achieving effective post-customization distortion. Experiments show that, as a stress test for purification, AntiPure achieves minimal perceptual discrepancy and maximal distortion, outperforming other protective perturbation methods within the purification-customization workflow.

http://arxiv.org/abs/2509.14383
RLBind: Adversarial-Invariant Cross-Modal Alignment for Unified Robust Embeddings. (15%)
Yuhong Lu
Unified multi-modal encoders that bind vision, audio, and other sensors into a shared embedding space are attractive building blocks for robot perception and decision-making. However, on-robot deployment exposes the vision branch to adversarial and natural corruptions, making robustness a prerequisite for safety. Prior defenses typically align clean and adversarial features within CLIP-style encoders and overlook broader cross-modal correspondence, yielding modest gains and often degrading zero-shot transfer. We introduce RLBind, a two-stage adversarial-invariant cross-modal alignment framework for robust unified embeddings. Stage 1 performs unsupervised fine-tuning on clean-adversarial pairs to harden the visual encoder. Stage 2 leverages cross-modal correspondence by minimizing the discrepancy between clean/adversarial features and a text anchor, while enforcing class-wise distributional alignment across modalities. Extensive experiments on Image, Audio, Thermal, and Video data show that RLBind consistently outperforms the LanguageBind backbone and standard fine-tuning baselines in both clean accuracy and norm-bounded adversarial robustness. By improving resilience without sacrificing generalization, RLBind provides a practical path toward safer multi-sensor perception stacks for embodied robots in navigation, manipulation, and other autonomy settings.

http://arxiv.org/abs/2509.13772
Who Taught the Lie? Responsibility Attribution for Poisoned Knowledge in Retrieval-Augmented Generation. (9%)
Baolei Zhang; Haoran Xin; Yuxi Chen; Zhuqing Liu; Biao Yi; Tong Li; Lihai Nie; Zheli Liu; Minghong Fang
Retrieval-Augmented Generation (RAG) integrates external knowledge into large language models to improve response quality. However, recent work has shown that RAG systems are highly vulnerable to poisoning attacks, where malicious texts are inserted into the knowledge database to influence model outputs. While several defenses have been proposed, they are often circumvented by more adaptive or sophisticated attacks.
  This paper presents RAGOrigin, a black-box responsibility attribution framework designed to identify which texts in the knowledge database are responsible for misleading or incorrect generations. Our method constructs a focused attribution scope tailored to each misgeneration event and assigns a responsibility score to each candidate text by evaluating its retrieval ranking, semantic relevance, and influence on the generated response. The system then isolates poisoned texts using an unsupervised clustering method. We evaluate RAGOrigin across seven datasets and fifteen poisoning attacks, including newly developed adaptive poisoning strategies and multi-attacker scenarios. Our approach outperforms existing baselines in identifying poisoned content and remains robust under dynamic and noisy conditions. These results suggest that RAGOrigin provides a practical and effective solution for tracing the origins of corrupted knowledge in RAG systems.

http://arxiv.org/abs/2509.14297
A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness. (9%)
Xuan Luo; Yue Wang; Zefeng He; Geng Tu; Jing Li; Ruifeng Xu
Safety alignment aims to prevent Large Language Models (LLMs) from responding to harmful queries. To strengthen safety protections, jailbreak methods are developed to simulate malicious attacks and uncover vulnerabilities. In this paper, we introduce HILL (Hiding Intention by Learning from LLMs), a novel jailbreak approach that systematically transforms imperative harmful requests into learning-style questions with only straightforward hypotheticality indicators. Further, we introduce two new metrics to thoroughly evaluate the utility of jailbreak methods. Experiments on the AdvBench dataset across a wide range of models demonstrate HILL's strong effectiveness, generalizability, and harmfulness. It achieves top attack success rates on the majority of models and across malicious categories while maintaining high efficiency with concise prompts. Results of various defense methods show the robustness of HILL, with most defenses having mediocre effects or even increasing the attack success rates. Moreover, the assessment on our constructed safe prompts reveals inherent limitations of LLMs' safety mechanisms and flaws in defense methods. This work exposes significant vulnerabilities of safety measures against learning-style elicitation, highlighting a critical challenge of balancing helpfulness and safety alignments.

http://arxiv.org/abs/2503.17534
MetaSel: A Test Selection Approach for Fine-tuned DNN Models. (2%)
Amin Abbasishahkoo; Mahboubeh Dadkhah; Lionel Briand; Dayi Lin
Deep Neural Networks (DNNs) face challenges during deployment due to covariate shift, i.e., data distribution shifts between development and deployment contexts. Fine-tuning adapts pre-trained models to new contexts requiring smaller labeled sets. However, testing fine-tuned models under constrained labeling budgets remains a critical challenge. This paper introduces MetaSel, a new approach tailored for DNN models that have been fine-tuned to address covariate shift, to select tests from unlabeled inputs. MetaSel assumes that fine-tuned and pre-trained models share related data distributions and exhibit similar behaviors for many inputs. However, their behaviors diverge within the input subspace where fine-tuning alters decision boundaries, making those inputs more prone to misclassification. Unlike general approaches that rely solely on the DNN model and its input set, MetaSel leverages information from both the fine-tuned and pre-trained models and their behavioral differences to estimate misclassification probability for unlabeled test inputs, enabling more effective test selection. Our extensive empirical evaluation, comparing MetaSel against 11 state-of-the-art approaches and involving 68 fine-tuned models across weak, medium, and strong distribution shifts, demonstrates that MetaSel consistently delivers significant improvements in Test Relative Coverage (TRC) over existing baselines, particularly under highly constrained labeling budgets. MetaSel shows average TRC improvements of 28.46% to 56.18% over the most frequent second-best baselines while maintaining a high TRC median and low variability. Our results confirm MetaSel's practicality, robustness, and cost-effectiveness for test selection in the context of fine-tuned models.

http://arxiv.org/abs/2509.13813
Geometric Uncertainty for Detecting and Correcting Hallucinations in LLMs. (1%)
Edward Phillips; Sean Wu; Soheila Molaei; Danielle Belgrave; Anshul Thakur; David Clifton
Large language models demonstrate impressive results across diverse tasks but are still known to hallucinate, generating linguistically plausible but incorrect answers to questions. Uncertainty quantification has been proposed as a strategy for hallucination detection, but no existing black-box approach provides estimates for both global and local uncertainty. The former attributes uncertainty to a batch of responses, while the latter attributes uncertainty to individual responses. Current local methods typically rely on white-box access to internal model states, whilst black-box methods only provide global uncertainty estimates. We introduce a geometric framework to address this, based on archetypal analysis of batches of responses sampled with only black-box model access. At the global level, we propose Geometric Volume, which measures the convex hull volume of archetypes derived from response embeddings. At the local level, we propose Geometric Suspicion, which ranks responses by reliability and enables hallucination reduction through preferential response selection. Unlike prior dispersion methods which yield only a single global score, our approach provides semantic boundary points which have utility for attributing reliability to individual responses. Experiments show that our framework performs comparably to or better than prior methods on short form question-answering datasets, and achieves superior results on medical datasets where hallucinations carry particularly critical risks. We also provide theoretical justification by proving a link between convex hull volume and entropy.

http://arxiv.org/abs/2509.13982
CLMTracing: Black-box User-level Watermarking for Code Language Model Tracing. (1%)
Boyu Zhang; Ping He; Tianyu Du; Xuhong Zhang; Lei Yun; Kingsum Chow; Jianwei Yin
With the widespread adoption of open-source code language models (code LMs), intellectual property (IP) protection has become an increasingly critical concern. While current watermarking techniques have the potential to identify the code LM to protect its IP, they have limitations when facing the more practical and complex demand, i.e., offering the individual user-level tracing in the black-box setting. This work presents CLMTracing, a black-box code LM watermarking framework employing the rule-based watermarks and utility-preserving injection method for user-level model tracing. CLMTracing further incorporates a parameter selection algorithm sensitive to the robust watermark and adversarial training to enhance the robustness against watermark removal attacks. Comprehensive evaluations demonstrate CLMTracing is effective across multiple state-of-the-art (SOTA) code LMs, showing significant harmless improvements compared to existing SOTA baselines and strong robustness against various removal attacks.

http://arxiv.org/abs/2412.20634
Graph Neural Networks for Next-Generation-IoT: Recent Advances and Open Challenges. (1%)
Nguyen Xuan Tung; Le Tung Giang; Bui Duc Son; Seon Geun Jeong; Chien Trinh Van; Won Joo Hwang; Lajos Hanzo
Graph Neural Networks (GNNs) have emerged as a powerful framework for modeling complex interconnected systems, hence making them particularly well-suited to address the growing challenges of next-generation Internet of Things (NG-IoT) networks. Existing studies remain fragmented, and there is a lack of comprehensive guidance on how GNNs can be systematically applied to NG-IoT systems. As NG-IoT systems evolve toward 6G, they incorporate diverse technologies. These advances promise unprecedented connectivity, sensing, and automation but also introduce significant complexity, requiring new approaches for scalable learning, dynamic optimization, and secure, decentralized decision-making. This survey provides a comprehensive and forward-looking exploration of how GNNs can empower NG-IoT environments. We commence by exploring the fundamental paradigms of GNNs and articulating the motivation for their use in NG-IoT networks. Besides, we intrinsically connect GNNs with the family of low-density parity-check codes, modeling the NG-IoT as dynamic constrained graphs. We highlight the distinct roles of node-, edge-, and graph-level tasks in tackling key challenges and demonstrate the GNNs' ability to overcome the limitations of traditional optimization. We examine the application of GNNs across core NG-enabling technologies and their integration with distributed frameworks to support privacy-preservation and distributed intelligence. We then delve into the challenges posed by adversarial attacks, offering insights into defense mechanisms. Lastly, we examine how GNNs can be integrated with emerging technologies. Our findings highlight the transformative potential of GNNs in improving efficiency, scalability, and security. Finally, we summarize the key lessons learned and outline promising future research directions, along with a set of design guidelines tailored for NG-IoT applications.

http://arxiv.org/abs/2502.05760
MADAR: Efficient Continual Learning for Malware Analysis with Distribution-Aware Replay. (1%)
Mohammad Saidur Rahman; Scott Coull; Qi Yu; Matthew Wright
Millions of new pieces of malicious software (i.e., malware) are introduced each year. This poses significant challenges for antivirus vendors, who use machine learning to detect and analyze malware, and must keep up with changes in the distribution while retaining knowledge of older variants. Continual learning (CL) holds the potential to address this challenge by reducing the storage and computational costs of regularly retraining over all the collected data. Prior work, however, shows that CL techniques, which are designed primarily for computer vision tasks, fare poorly when applied to malware classification. To address these issues, we begin with an exploratory analysis of a typical malware dataset, which reveals that malware families are diverse and difficult to characterize, requiring a wide variety of samples to learn a robust representation. Based on these findings, we propose $\underline{M}$alware $\underline{A}$nalysis with $\underline{D}$istribution-$\underline{A}$ware $\underline{R}$eplay (MADAR), a CL framework that accounts for the unique properties and challenges of the malware data distribution. Through extensive evaluation on large-scale Windows and Android malware datasets, we show that MADAR significantly outperforms prior work. This highlights the importance of understanding domain characteristics when designing CL techniques and demonstrates a path forward for the malware classification domain.

http://arxiv.org/abs/2509.14335
Beyond Classification: Evaluating LLMs for Fine-Grained Automatic Malware Behavior Auditing. (1%)
Xinran Zheng; Xingzhi Qian; Yiling He; Shuo Yang; Lorenzo Cavallaro
Automated malware classification has achieved strong detection performance. Yet, malware behavior auditing seeks causal and verifiable explanations of malicious activities -- essential not only to reveal what malware does but also to substantiate such claims with evidence. This task is challenging, as adversarial intent is often hidden within complex, framework-heavy applications, making manual auditing slow and costly. Large Language Models (LLMs) could help address this gap, but their auditing potential remains largely unexplored due to three limitations: (1) scarce fine-grained annotations for fair assessment; (2) abundant benign code obscuring malicious signals; and (3) unverifiable, hallucination-prone outputs undermining attribution credibility. To close this gap, we introduce MalEval, a comprehensive framework for fine-grained Android malware auditing, designed to evaluate how effectively LLMs support auditing under real-world constraints. MalEval provides expert-verified reports and an updated sensitive API list to mitigate ground truth scarcity and reduce noise via static reachability analysis. Function-level structural representations serve as intermediate attribution units for verifiable evaluation. Building on this, we define four analyst-aligned tasks -- function prioritization, evidence attribution, behavior synthesis, and sample discrimination -- together with domain-specific metrics and a unified workload-oriented score. We evaluate seven widely used LLMs on a curated dataset of recent malware and misclassified benign apps, offering the first systematic assessment of their auditing capabilities. MalEval reveals both promising potential and critical limitations across audit stages, providing a reproducible benchmark and foundation for future research on LLM-enhanced malware behavior auditing. MalEval is publicly available at https://github.com/ZhengXR930/MalEval.git

http://arxiv.org/abs/2310.11850
Revisiting Transferable Adversarial Images: Systemization, Evaluation, and New Insights. (99%)
Zhengyu Zhao; Hanwei Zhang; Renjue Li; Ronan Sicre; Laurent Amsaleg; Michael Backes; Qi Li; Qian Wang; Chao Shen
Transferable adversarial images raise critical security concerns for computer vision systems in real-world, black-box attack scenarios. Although many transfer attacks have been proposed, existing research lacks a systematic and comprehensive evaluation. In this paper, we systemize transfer attacks into five categories around the general machine learning pipeline and provide the first comprehensive evaluation, with 23 representative attacks against 11 representative defenses, including the recent, transfer-oriented defense and the real-world Google Cloud Vision. In particular, we identify two main problems of existing evaluations: (1) for attack transferability, lack of intra-category analyses with fair hyperparameter settings, and (2) for attack stealthiness, lack of diverse measures. Our evaluation results validate that these problems have indeed caused misleading conclusions and missing points, and addressing them leads to new, \textit{consensus-challenging} insights, such as (1) an early attack, DI, even outperforms all similar follow-up ones, (2) the state-of-the-art (white-box) defense, DiffPure, is even vulnerable to (black-box) transfer attacks, and (3) even under the same $L_p$ constraint, different attacks yield dramatically different stealthiness results regarding diverse imperceptibility metrics, finer-grained measures, and a user study. We hope that our analyses will serve as guidance on properly evaluating transferable adversarial images and advance the design of attacks and defenses. Code is available at https://github.com/ZhengyuZhao/TransferAttackEval.

http://arxiv.org/abs/2509.12633
CIARD: Cyclic Iterative Adversarial Robustness Distillation. (99%)
Liming Lu; Shuchao Pang; Xu Zheng; Xiang Gu; Anan Du; Yunhuai Liu; Yongbin Zhou
Adversarial robustness distillation (ARD) aims to transfer both performance and robustness from teacher model to lightweight student model, enabling resilient performance on resource-constrained scenarios. Though existing ARD approaches enhance student model's robustness, the inevitable by-product leads to the degraded performance on clean examples. We summarize the causes of this problem inherent in existing methods with dual-teacher framework as: 1. The divergent optimization objectives of dual-teacher models, i.e., the clean and robust teachers, impede effective knowledge transfer to the student model, and 2. The iteratively generated adversarial examples during training lead to performance deterioration of the robust teacher model. To address these challenges, we propose a novel Cyclic Iterative ARD (CIARD) method with two key innovations: a. A multi-teacher framework with contrastive push-loss alignment to resolve conflicts in dual-teacher optimization objectives, and b. Continuous adversarial retraining to maintain dynamic teacher robustness against performance degradation from the varying adversarial examples. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CIARD achieves remarkable performance with an average 3.53 improvement in adversarial defense rates across various attack scenarios and a 5.87 increase in clean sample accuracy, establishing a new benchmark for balancing model robustness and generalization. Our code is available at https://github.com/eminentgu/CIARD

http://arxiv.org/abs/2509.12939
Sy-FAR: Symmetry-based Fair Adversarial Robustness. (99%)
Haneen Najjar; Eyal Ronen; Mahmood Sharif
Security-critical machine-learning (ML) systems, such as face-recognition systems, are susceptible to adversarial examples, including real-world physically realizable attacks. Various means to boost ML's adversarial robustness have been proposed; however, they typically induce unfair robustness: It is often easier to attack from certain classes or groups than from others. Several techniques have been developed to improve adversarial robustness while seeking perfect fairness between classes. Yet, prior work has focused on settings where security and fairness are less critical. Our insight is that achieving perfect parity in realistic fairness-critical tasks, such as face recognition, is often infeasible -- some classes may be highly similar, leading to more misclassifications between them. Instead, we suggest that seeking symmetry -- i.e., attacks from class $i$ to $j$ would be as successful as from $j$ to $i$ -- is more tractable. Intuitively, symmetry is a desirable because class resemblance is a symmetric relation in most domains. Additionally, as we prove theoretically, symmetry between individuals induces symmetry between any set of sub-groups, in contrast to other fairness notions where group-fairness is often elusive. We develop Sy-FAR, a technique to encourage symmetry while also optimizing adversarial robustness and extensively evaluate it using five datasets, with three model architectures, including against targeted and untargeted realistic attacks. The results show Sy-FAR significantly improves fair adversarial robustness compared to state-of-the-art methods. Moreover, we find that Sy-FAR is faster and more consistent across runs. Notably, Sy-FAR also ameliorates another type of unfairness we discover in this work -- target classes that adversarial examples are likely to be classified into become significantly less vulnerable after inducing symmetry.

http://arxiv.org/abs/2509.12595
DisorientLiDAR: Physical Attacks on LiDAR-based Localization. (99%)
Yizhen Lao; Yu Zhang; Ziting Wang; Chengbo Wang; Yifei Xue; Wanpeng Shao
Deep learning models have been shown to be susceptible to adversarial attacks with visually imperceptible perturbations. Even this poses a serious security challenge for the localization of self-driving cars, there has been very little exploration of attack on it, as most of adversarial attacks have been applied to 3D perception. In this work, we propose a novel adversarial attack framework called DisorientLiDAR targeting LiDAR-based localization. By reverse-engineering localization models (e.g., feature extraction networks), adversaries can identify critical keypoints and strategically remove them, thereby disrupting LiDAR-based localization. Our proposal is first evaluated on three state-of-the-art point-cloud registration models (HRegNet, D3Feat, and GeoTransformer) using the KITTI dataset. Experimental results demonstrate that removing regions containing Top-K keypoints significantly degrades their registration accuracy. We further validate the attack's impact on the Autoware autonomous driving platform, where hiding merely a few critical regions induces noticeable localization drift. Finally, we extended our attacks to the physical world by hiding critical regions with near-infrared absorptive materials, thereby successfully replicate the attack effects observed in KITTI data. This step has been closer toward the realistic physical-world attack that demonstrate the veracity and generality of our proposal.

http://arxiv.org/abs/2501.13336
Gradient-Free Adversarial Purification with Diffusion Models. (99%)
Xuelong Dai; Dong Wang; Xiuzhen Cheng; Bin Xiao
Adversarial training and adversarial purification are two widely used defense strategies for enhancing model robustness against adversarial attacks. However, adversarial training requires costly retraining, while adversarial purification often suffers from low efficiency. More critically, existing defenses are primarily designed under the perturbation-based adversarial threat model, which is ineffective against recently introduced unrestricted adversarial attacks. In this paper, we propose an effective and efficient defense framework that counters both perturbation-based and unrestricted adversarial attacks. Our approach is motivated by the observation that adversarial examples typically lie near the decision boundary and are highly sensitive to pixel-level perturbations. To address this, we introduce adversarial anti-aliasing, a preprocessing technique that mitigates adversarial noise by reducing the magnitude of pixel-level perturbations. In addition, we propose adversarial super-resolution, which leverages prior knowledge from clean datasets to benignly restore high-quality images from adversarially degraded ones. Unlike image synthesis methods that generate entirely new images, adversarial super-resolution focuses on image restoration, making it more suitable for purification. Importantly, both techniques require no additional training and are computationally efficient since they do not rely on gradient computations. To further improve robustness across diverse datasets, we introduce a contrastive learning-based adversarial deblurring fine-tuning method. By incorporating adversarial priors during fine-tuning on the target dataset, this method enhances purification effectiveness without the need to retrain diffusion models.

http://arxiv.org/abs/2509.12824
DiffHash: Text-Guided Targeted Attack via Diffusion Models against Deep Hashing Image Retrieval. (98%)
Zechao Liu; Zheng Zhou; Xiangkun Chen; Tao Liang; Dapeng Lang
Deep hashing models have been widely adopted to tackle the challenges of large-scale image retrieval. However, these approaches face serious security risks due to their vulnerability to adversarial examples. Despite the increasing exploration of targeted attacks on deep hashing models, existing approaches still suffer from a lack of multimodal guidance, reliance on labeling information and dependence on pixel-level operations for attacks. To address these limitations, we proposed DiffHash, a novel diffusion-based targeted attack for deep hashing. Unlike traditional pixel-based attacks that directly modify specific pixels and lack multimodal guidance, our approach focuses on optimizing the latent representations of images, guided by text information generated by a Large Language Model (LLM) for the target image. Furthermore, we designed a multi-space hash alignment network to align the high-dimension image space and text space to the low-dimension binary hash space. During reconstruction, we also incorporated text-guided attention mechanisms to refine adversarial examples, ensuring them aligned with the target semantics while maintaining visual plausibility. Extensive experiments have demonstrated that our method outperforms state-of-the-art (SOTA) targeted attack methods, achieving better black-box transferability and offering more excellent stability across datasets.

http://arxiv.org/abs/2509.12724
Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models. (92%)
Yunhan Zhao; Xiang Zheng; Xingjun Ma
Despite their superb capabilities, Vision-Language Models (VLMs) have been shown to be vulnerable to jailbreak attacks. While recent jailbreaks have achieved notable progress, their effectiveness and efficiency can still be improved. In this work, we reveal an interesting phenomenon: incorporating weak defense into the attack pipeline can significantly enhance both the effectiveness and the efficiency of jailbreaks on VLMs. Building on this insight, we propose Defense2Attack, a novel jailbreak method that bypasses the safety guardrails of VLMs by leveraging defensive patterns to guide jailbreak prompt design. Specifically, Defense2Attack consists of three key components: (1) a visual optimizer that embeds universal adversarial perturbations with affirmative and encouraging semantics; (2) a textual optimizer that refines the input using a defense-styled prompt; and (3) a red-team suffix generator that enhances the jailbreak through reinforcement fine-tuning. We empirically evaluate our method on four VLMs and four safety benchmarks. The results demonstrate that Defense2Attack achieves superior jailbreak performance in a single attempt, outperforming state-of-the-art attack methods that often require multiple tries. Our work offers a new perspective on jailbreaking VLMs.

http://arxiv.org/abs/2509.12672
Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content. (92%)
Shaz Furniturewala; Arkaitz Zubiaga
The volume of machine-generated content online has grown dramatically due to the widespread use of Large Language Models (LLMs), leading to new challenges for content moderation systems. Conventional content moderation classifiers, which are usually trained on text produced by humans, suffer from misclassifications due to LLM-generated text deviating from their training data and adversarial attacks that aim to avoid detection. Present-day defence tactics are reactive rather than proactive, since they rely on adversarial training or external detection models to identify attacks. In this work, we aim to identify the vulnerable components of toxicity classifiers that contribute to misclassification, proposing a novel strategy based on mechanistic interpretability techniques. Our study focuses on fine-tuned BERT and RoBERTa classifiers, testing on diverse datasets spanning a variety of minority groups. We use adversarial attacking techniques to identify vulnerable circuits. Finally, we suppress these vulnerable circuits, improving performance against adversarial attacks. We also provide demographic-level insights into these vulnerable circuits, exposing fairness and robustness gaps in model training. We find that models have distinct heads that are either crucial for performance or vulnerable to attack and suppressing the vulnerable heads improves performance on adversarial input. We also find that different heads are responsible for vulnerability across different demographic groups, which can inform more inclusive development of toxicity detection models.

http://arxiv.org/abs/2411.15244
Adversarial Prompt Distillation for Vision-Language Models. (75%)
Lin Luo; Xin Wang; Bojia Zi; Shihao Zhao; Xingjun Ma; Yu-Gang Jiang
Large pre-trained Vision-Language Models (VLMs) such as Contrastive Language-Image Pre-training (CLIP) have been shown to be susceptible to adversarial attacks, raising concerns about their deployment in safety-critical applications like autonomous driving and medical diagnosis. One promising approach for robustifying pre-trained VLMs is Adversarial Prompt Tuning (APT), which applies adversarial training during the process of prompt tuning. However, existing APT methods are mostly single-modal methods that design prompt(s) for only the visual or textual modality, limiting their effectiveness in either robustness or clean accuracy. In this work, we propose Adversarial Prompt Distillation (APD), a bimodal knowledge distillation framework that enhances APT by integrating it with multi-modal knowledge transfer. APD optimizes prompts for both visual and textual modalities while distilling knowledge from a clean pre-trained teacher CLIP model. Extensive experiments on multiple benchmark datasets demonstrate the superiority of our APD method over the current state-of-the-art APT methods in terms of both adversarial robustness and clean accuracy. The effectiveness of APD also validates the possibility of using a non-robust teacher to improve the generalization and robustness of fine-tuned VLMs.

http://arxiv.org/abs/2508.00552
DBLP: Noise Bridge Consistency Distillation For Efficient And Reliable Adversarial Purification. (68%)
Chihan Huang; Belal Alsinglawi; Islam Al-qudah
Recent advances in deep neural networks (DNNs) have led to remarkable success across a wide range of tasks. However, their susceptibility to adversarial perturbations remains a critical vulnerability. Existing diffusion-based adversarial purification methods often require intensive iterative denoising, severely limiting their practical deployment. In this paper, we propose Diffusion Bridge Distillation for Purification (DBLP), a novel and efficient diffusion-based framework for adversarial purification. Central to our approach is a new objective, noise bridge distillation, which constructs a principled alignment between the adversarial noise distribution and the clean data distribution within a latent consistency model (LCM). To further enhance semantic fidelity, we introduce adaptive semantic enhancement, which fuses multi-scale pyramid edge maps as conditioning input to guide the purification process. Extensive experiments across multiple datasets demonstrate that DBLP achieves state-of-the-art (SOTA) robust accuracy, superior image quality, and around 0.2s inference time, marking a significant step toward real-time adversarial purification.

http://arxiv.org/abs/2509.12964
BAPFL: Exploring Backdoor Attacks Against Prototype-based Federated Learning. (45%)
Honghong Zeng; Jiong Lou; Zhe Wang; Hefeng Zhou; Chentao Wu; Wei Zhao; Jie Li
Prototype-based federated learning (PFL) has emerged as a promising paradigm to address data heterogeneity problems in federated learning, as it leverages mean feature vectors as prototypes to enhance model generalization. However, its robustness against backdoor attacks remains largely unexplored. In this paper, we identify that PFL is inherently resistant to existing backdoor attacks due to its unique prototype learning mechanism and local data heterogeneity. To further explore the security of PFL, we propose BAPFL, the first backdoor attack method specifically designed for PFL frameworks. BAPFL integrates a prototype poisoning strategy with a trigger optimization mechanism. The prototype poisoning strategy manipulates the trajectories of global prototypes to mislead the prototype training of benign clients, pushing their local prototypes of clean samples away from the prototypes of trigger-embedded samples. Meanwhile, the trigger optimization mechanism learns a unique and stealthy trigger for each potential target label, and guides the prototypes of trigger-embedded samples to align closely with the global prototype of the target label. Experimental results across multiple datasets and PFL variants demonstrate that BAPFL achieves a 35\%-75\% improvement in attack success rate compared to traditional backdoor attacks, while preserving main task accuracy. These results highlight the effectiveness, stealthiness, and adaptability of BAPFL in PFL.

http://arxiv.org/abs/2509.11173
Your Compiler is Backdooring Your Model: Understanding and Exploiting Compilation Inconsistency Vulnerabilities in Deep Learning Compilers. (45%)
Simin Chen; Jinjun Peng; Yixin He; Junfeng Yang; Baishakhi Ray
Deep learning (DL) compilers are core infrastructure in modern DL systems, offering flexibility and scalability beyond vendor-specific libraries. This work uncovers a fundamental vulnerability in their design: can an official, unmodified compiler alter a model's semantics during compilation and introduce hidden backdoors? We study both adversarial and natural settings. In the adversarial case, we craft benign models where triggers have no effect pre-compilation but become effective backdoors after compilation. Tested on six models, three commercial compilers, and two hardware platforms, our attack yields 100% success on triggered inputs while preserving normal accuracy and remaining undetected by state-of-the-art detectors. The attack generalizes across compilers, hardware, and floating-point settings. In the natural setting, we analyze the top 100 HuggingFace models (including one with 220M+ downloads) and find natural triggers in 31 models. This shows that compilers can introduce risks even without adversarial manipulation.
  Our results reveal an overlooked threat: unmodified DL compilers can silently alter model semantics. To our knowledge, this is the first work to expose inherent security risks in DL compiler design, opening a new direction for secure and trustworthy ML.

http://arxiv.org/abs/2509.13219
On the Out-of-Distribution Backdoor Attack for Federated Learning. (41%)
Jiahao Xu; Zikai Zhang; Rui Hu
Traditional backdoor attacks in federated learning (FL) operate within constrained attack scenarios, as they depend on visible triggers and require physical modifications to the target object, which limits their practicality. To address this limitation, we introduce a novel backdoor attack prototype for FL called the out-of-distribution (OOD) backdoor attack ($\mathtt{OBA}$), which uses OOD data as both poisoned samples and triggers simultaneously. Our approach significantly broadens the scope of backdoor attack scenarios in FL. To improve the stealthiness of $\mathtt{OBA}$, we propose $\mathtt{SoDa}$, which regularizes both the magnitude and direction of malicious local models during local training, aligning them closely with their benign versions to evade detection. Empirical results demonstrate that $\mathtt{OBA}$ effectively circumvents state-of-the-art defenses while maintaining high accuracy on the main task.
  To address this security vulnerability in the FL system, we introduce $\mathtt{BNGuard}$, a new server-side defense method tailored against $\mathtt{SoDa}$. $\mathtt{BNGuard}$ leverages the observation that OOD data causes significant deviations in the running statistics of batch normalization layers. This allows $\mathtt{BNGuard}$ to identify malicious model updates and exclude them from aggregation, thereby enhancing the backdoor robustness of FL. Extensive experiments across various settings show the effectiveness of $\mathtt{BNGuard}$ on defending against $\mathtt{SoDa}$. The code is available at https://github.com/JiiahaoXU/SoDa-BNGuard.

http://arxiv.org/abs/2508.09456
IAG: Input-aware Backdoor Attack on VLMs for Visual Grounding. (33%)
Junxian Li; Beining Xu; Di Zhang
Vision-language models (VLMs) have shown significant advancements in tasks such as visual grounding, where they localize specific objects in images based on natural language queries and images. However, security issues in visual grounding tasks for VLMs remain underexplored, especially in the context of backdoor attacks. In this paper, we introduce a novel input-aware backdoor attack method, IAG, designed to manipulate the grounding behavior of VLMs. This attack forces the model to ground a specific target object in the input image, regardless of the user's query. We propose an adaptive trigger generator that embeds the semantic information of the attack target's description into the original image using a text-conditional U-Net, thereby overcoming the open-vocabulary attack challenge. To ensure the attack's stealthiness, we utilize a reconstruction loss to minimize visual discrepancies between poisoned and clean images. Additionally, we introduce a unified method for generating attack data. IAG is evaluated theoretically and empirically, demonstrating its feasibility and effectiveness. Notably, our ASR@0.5 on InternVL-2.5-8B reaches over 65\% on various testing sets. IAG also shows promising potential on manipulating Ferret-7B and LlaVA-1.5-7B with very little accuracy decrease on clean samples. Extensive specific experiments, such as ablation study and potential defense, also indicate the robustness and transferability of our attack.

http://arxiv.org/abs/2508.20890
PromptSleuth: Detecting Prompt Injection via Semantic Intent Invariance. (33%)
Mengxiao Wang; Yuxuan Zhang; Guofei Gu
Large Language Models (LLMs) are increasingly integrated into real-world applications, from virtual assistants to autonomous agents. However, their flexibility also introduces new attack vectors-particularly Prompt Injection (PI), where adversaries manipulate model behavior through crafted inputs. As attackers continuously evolve with paraphrased, obfuscated, and even multi-task injection strategies, existing benchmarks are no longer sufficient to capture the full spectrum of emerging threats.
  To address this gap, we construct a new benchmark that systematically extends prior efforts. Our benchmark subsumes the two widely-used existing ones while introducing new manipulation techniques and multi-task scenarios, thereby providing a more comprehensive evaluation setting. We find that existing defenses, though effective on their original benchmarks, show clear weaknesses under our benchmark, underscoring the need for more robust solutions. Our key insight is that while attack forms may vary, the adversary's intent-injecting an unauthorized task-remains invariant. Building on this observation, we propose PromptSleuth, a semantic-oriented defense framework that detects prompt injection by reasoning over task-level intent rather than surface features. Evaluated across state-of-the-art benchmarks, PromptSleuth consistently outperforms existing defense while maintaining comparable runtime and cost efficiency. These results demonstrate that intent-based semantic reasoning offers a robust, efficient, and generalizable strategy for defending LLMs against evolving prompt injection threats.

http://arxiv.org/abs/2509.11120
SoK: How Sensor Attacks Disrupt Autonomous Vehicles: An End-to-end Analysis, Challenges, and Missed Threats. (11%)
Qingzhao Zhang; Shaocheng Luo; Z. Morley Mao; Miroslav Pajic; Michael K. Reiter
Autonomous vehicles, including self-driving cars, robotic ground vehicles, and drones, rely on complex sensor pipelines to ensure safe and reliable operation. However, these safety-critical systems remain vulnerable to adversarial sensor attacks that can compromise their performance and mission success. While extensive research has demonstrated various sensor attack techniques, critical gaps remain in understanding their feasibility in real-world, end-to-end systems. This gap largely stems from the lack of a systematic perspective on how sensor errors propagate through interconnected modules in autonomous systems when autonomous vehicles interact with the physical world.
  To bridge this gap, we present a comprehensive survey of autonomous vehicle sensor attacks across platforms, sensor modalities, and attack methods. Central to our analysis is the System Error Propagation Graph (SEPG), a structured demonstration tool that illustrates how sensor attacks propagate through system pipelines, exposing the conditions and dependencies that determine attack feasibility. With the aid of SEPG, our study distills seven key findings that highlight the feasibility challenges of sensor attacks and uncovers eleven previously overlooked attack vectors exploiting inter-module interactions, several of which we validate through proof-of-concept experiments. Additionally, we demonstrate how large language models (LLMs) can automate aspects of SEPG construction and cross-validate expert analysis, showcasing the promise of AI-assisted security evaluation.

http://arxiv.org/abs/2509.13266
JANUS: A Dual-Constraint Generative Framework for Stealthy Node Injection Attacks. (11%)
Jiahao Zhang; Xiaobing Pei; Zhaokun Zhong; Wenqiang Hao; Zhenghao Tang
Graph Neural Networks (GNNs) have demonstrated remarkable performance across various applications, yet they are vulnerable to sophisticated adversarial attacks, particularly node injection attacks. The success of such attacks heavily relies on their stealthiness, the ability to blend in with the original graph and evade detection. However, existing methods often achieve stealthiness by relying on indirect proxy metrics, lacking consideration for the fundamental characteristics of the injected content, or focusing only on imitating local structures, which leads to the problem of local myopia. To overcome these limitations, we propose a dual-constraint stealthy node injection framework, called Joint Alignment of Nodal and Universal Structures (JANUS). At the local level, we introduce a local feature manifold alignment strategy to achieve geometric consistency in the feature space. At the global level, we incorporate structured latent variables and maximize the mutual information with the generated structures, ensuring the injected structures are consistent with the semantic patterns of the original graph. We model the injection attack as a sequential decision process, which is optimized by a reinforcement learning agent. Experiments on multiple standard datasets demonstrate that the JANUS framework significantly outperforms existing methods in terms of both attack effectiveness and stealthiness.

http://arxiv.org/abs/2509.13514
AQUA-LLM: Evaluating Accuracy, Quantization, and Adversarial Robustness Trade-offs in LLMs for Cybersecurity Question Answering. (4%)
Onat Gungor; Roshan Sood; Harold Wang; Tajana Rosing
Large Language Models (LLMs) have recently demonstrated strong potential for cybersecurity question answering (QA), supporting decision-making in real-time threat detection and response workflows. However, their substantial computational demands pose significant challenges for deployment on resource-constrained edge devices. Quantization, a widely adopted model compression technique, can alleviate these constraints. Nevertheless, quantization may degrade model accuracy and increase susceptibility to adversarial attacks. Fine-tuning offers a potential means to mitigate these limitations, but its effectiveness when combined with quantization remains insufficiently explored. Hence, it is essential to understand the trade-offs among accuracy, efficiency, and robustness. We propose AQUA-LLM, an evaluation framework designed to benchmark several state-of-the-art small LLMs under four distinct configurations: base, quantized-only, fine-tuned, and fine-tuned combined with quantization, specifically for cybersecurity QA. Our results demonstrate that quantization alone yields the lowest accuracy and robustness despite improving efficiency. In contrast, combining quantization with fine-tuning enhances both LLM robustness and predictive performance, achieving an optimal balance of accuracy, robustness, and efficiency. These findings highlight the critical need for quantization-aware, robustness-preserving fine-tuning methodologies to enable the robust and efficient deployment of LLMs for cybersecurity QA.

http://arxiv.org/abs/2509.13046
MIA-EPT: Membership Inference Attack via Error Prediction for Tabular Data. (2%)
Eyal German; Daniel Samira; Yuval Elovici; Asaf Shabtai
Synthetic data generation plays an important role in enabling data sharing, particularly in sensitive domains like healthcare and finance. Recent advances in diffusion models have made it possible to generate realistic, high-quality tabular data, but they may also memorize training records and leak sensitive information. Membership inference attacks (MIAs) exploit this vulnerability by determining whether a record was used in training. While MIAs have been studied in images and text, their use against tabular diffusion models remains underexplored despite the unique risks of structured attributes and limited record diversity. In this paper, we introduce MIAEPT, Membership Inference Attack via Error Prediction for Tabular Data, a novel black-box attack specifically designed to target tabular diffusion models. MIA-EPT constructs errorbased feature vectors by masking and reconstructing attributes of target records, disclosing membership signals based on how well these attributes are predicted. MIA-EPT operates without access to the internal components of the generative model, relying only on its synthetic data output, and was shown to generalize across multiple state-of-the-art diffusion models. We validate MIA-EPT on three diffusion-based synthesizers, achieving AUC-ROC scores of up to 0.599 and TPR@10% FPR values of 22.0% in our internal tests. Under the MIDST 2025 competition conditions, MIA-EPT achieved second place in the Black-box Multi-Table track (TPR@10% FPR = 20.0%). These results demonstrate that our method can uncover substantial membership leakage in synthetic tabular data, challenging the assumption that synthetic data is inherently privacy-preserving. Our code is publicly available at https://github.com/eyalgerman/MIA-EPT.

http://arxiv.org/abs/2509.02072
Abex-rat: Synergizing Abstractive Augmentation and Adversarial Training for Classification of Occupational Accident Reports. (2%)
Jian Chen; Jiabao Dou; Jinbao Tian; Yunqi Yang; Zhou Li
The automatic classification of occupational accident reports is a critical research area for enhancing workplace safety and enabling large-scale risk analysis. However, the severe class imbalance inherent in these real-world datasets often compromises the performance of analytical models, particularly for rare but severe incident types, hindering the development of reliable automated systems. To address this challenge, we propose ABEX-RAT, a novel and efficient framework that synergizes generative data augmentation with robust adversarial training. Our approach first employs a twostep abstractive-expansive (ABEX) pipeline, which leverages a large language model to distill core incident semantics and then uses a generative model to create diverse, highquality synthetic samples for underrepresented classes. Subsequently, a lightweight classifier is trained on the augmented data using a computationally efficient random adversarial training (RAT) protocol, which stochastically applies perturbations to enhance model generalization and robustness without significant overhead. Experimental results on the public OSHA dataset demonstrate that our method achieves new state-of-the-art performance, reaching a macro-F1 score of 90.32% and significantly outperforming previous SOTA and fine-tuned large model baselines. Our work validates that this synergistic strategy is a highly effective and efficient alternative to brute-force fine-tuning for specialized, imbalanced classification tasks. The code is publicly available at:https://github.com/nxcc-lab/ABEX-RAT.

http://arxiv.org/abs/2502.13061
Robust Adaptation of Large Multimodal Models for Retrieval Augmented Hateful Meme Detection. (1%)
Jingbiao Mei; Jinghong Chen; Guangyu Yang; Weizhe Lin; Bill Byrne
Hateful memes have become a significant concern on the Internet, necessitating robust automated detection systems. While Large Multimodal Models (LMMs) have shown promise in hateful meme detection, they face notable challenges like sub-optimal performance and limited out-of-domain generalization capabilities. Recent studies further reveal the limitations of both supervised fine-tuning (SFT) and in-context learning when applied to LMMs in this setting. To address these issues, we propose a robust adaptation framework for hateful meme detection that enhances in-domain accuracy and cross-domain generalization while preserving the general vision-language capabilities of LMMs. Analysis reveals that our approach achieves improved robustness under adversarial attacks compared to SFT models. Experiments on six meme classification datasets show that our approach achieves state-of-the-art performance, outperforming larger agentic systems. Moreover, our method generates higher-quality rationales for explaining hateful content compared to standard SFT, enhancing model interpretability. Code available at https://github.com/JingbiaoMei/RGCL

http://arxiv.org/abs/2509.13048
SLasH-DSA: Breaking SLH-DSA Using an Extensible End-To-End Rowhammer Framework. (1%)
Jeremy Boy; Antoon Purnal; Anna Pätschke; Luca Wilke; Thomas Eisenbarth
As quantum computing advances, PQC schemes are adopted to replace classical algorithms. Among them is the SLH-DSA that was recently standardized by NIST and is favored for its conservative security foundations.
  In this work, we present the first software-only universal forgery attack on SLH-DSA, leveraging Rowhammer-induced bit flips to corrupt the internal state and forge signatures. While prior work targeted embedded systems and required physical access, our attack is software-only, targeting commodity desktop and server hardware, significantly broadening the threat model. We demonstrate a full end-to-end attack against all security levels of SLH-DSA in OpenSSL 3.5.1, achieving universal forgery for the highest security level after eight hours of hammering and 36 seconds of post-processing. Our post-processing is informed by a novel complexity analysis that, given a concrete set of faulty signatures, identifies the most promising computational path to pursue.
  To enable the attack, we introduce Swage, a modular and extensible framework for implementing end-to-end Rowhammer-based fault attacks. Swage abstracts and automates key components of practical Rowhammer attacks. Unlike prior tooling, Swage is untangled from the attacked code, making it reusable and suitable for frictionless analysis of different targets. Our findings highlight that even theoretically sound PQC schemes can fail under real-world conditions, underscoring the need for additional implementation hardening or hardware defenses against Rowhammer.

http://arxiv.org/abs/2509.21344
Towards mitigating information leakage when evaluating safety monitors. (1%)
Gerard Boxo; Aman Neelappa; Shivam Raval
White box monitors that analyze model internals offer promising advantages for detecting potentially harmful behaviors in large language models, including lower computational costs and integration into layered defense systems.However, training and evaluating these monitors requires response exemplars that exhibit the target behaviors, typically elicited through prompting or fine-tuning. This presents a challenge when the information used to elicit behaviors inevitably leaks into the data that monitors ingest, inflating their effectiveness. We present a systematic framework for evaluating a monitor's performance in terms of its ability to detect genuine model behavior rather than superficial elicitation artifacts. Furthermore, we propose three novel strategies to evaluate the monitor: content filtering (removing deception-related text from inputs), score filtering (aggregating only over task-relevant tokens), and prompt distilled fine-tuned model organisms (models trained to exhibit deceptive behavior without explicit prompting). Using deception detection as a representative case study, we identify two forms of leakage that inflate monitor performance: elicitation leakage from prompts that explicitly request harmful behavior, and reasoning leakage from models that verbalize their deceptive actions. Through experiments on multiple deception benchmarks, we apply our proposed mitigation strategies and measure performance retention. Our evaluation of the monitors reveal three crucial findings: (1) Content filtering is a good mitigation strategy that allows for a smooth removal of elicitation signal and can decrease probe AUROC by 30\% (2) Score filtering was found to reduce AUROC by 15\% but is not as straightforward to attribute to (3) A finetuned model organism improves monitor evaluations but reduces their performance by upto 40\%, even when re-trained.

http://arxiv.org/abs/2509.11836
A Practical Adversarial Attack against Sequence-based Deep Learning Malware Classifiers. (99%)
Kai Tan; Dongyang Zhan; Lin Ye; Hongli Zhang; Binxing Fang
Sequence-based deep learning models (e.g., RNNs), can detect malware by analyzing its behavioral sequences. Meanwhile, these models are susceptible to adversarial attacks. Attackers can create adversarial samples that alter the sequence characteristics of behavior sequences to deceive malware classifiers. The existing methods for generating adversarial samples typically involve deleting or replacing crucial behaviors in the original data sequences, or inserting benign behaviors that may violate the behavior constraints. However, these methods that directly manipulate sequences make adversarial samples difficult to implement or apply in practice. In this paper, we propose an adversarial attack approach based on Deep Q-Network and a heuristic backtracking search strategy, which can generate perturbation sequences that satisfy practical conditions for successful attacks. Subsequently, we utilize a novel transformation approach that maps modifications back to the source code, thereby avoiding the need to directly modify the behavior log sequences. We conduct an evaluation of our approach, and the results confirm its effectiveness in generating adversarial samples from real-world malware behavior sequences, which have a high success rate in evading anomaly detection models. Furthermore, our approach is practical and can generate adversarial samples while maintaining the functionality of the modified software.

http://arxiv.org/abs/2509.11525
DARD: Dice Adversarial Robustness Distillation against Adversarial Attacks. (99%)
Jing Zou; Shungeng Zhang; Meikang Qiu; Chong Li
Deep learning models are vulnerable to adversarial examples, posing critical security challenges in real-world applications. While Adversarial Training (AT ) is a widely adopted defense mechanism to enhance robustness, it often incurs a trade-off by degrading performance on unperturbed, natural data. Recent efforts have highlighted that larger models exhibit enhanced robustness over their smaller counterparts. In this paper, we empirically demonstrate that such robustness can be systematically distilled from large teacher models into compact student models. To achieve better performance, we introduce Dice Adversarial Robustness Distillation (DARD), a novel method designed to transfer robustness through a tailored knowledge distillation paradigm. Additionally, we propose Dice Projected Gradient Descent (DPGD), an adversarial example generalization method optimized for effective attack. Our extensive experiments demonstrate that the DARD approach consistently outperforms adversarially trained networks with the same architecture, achieving superior robustness and standard accuracy.

http://arxiv.org/abs/2509.11864
NeuroStrike: Neuron-Level Attacks on Aligned LLMs. (98%)
Lichao Wu; Sasha Behrouzi; Mohamadreza Rostami; Maximilian Thang; Stjepan Picek; Ahmad-Reza Sadeghi
Safety alignment is critical for the ethical deployment of large language models (LLMs), guiding them to avoid generating harmful or unethical content. Current alignment techniques, such as supervised fine-tuning and reinforcement learning from human feedback, remain fragile and can be bypassed by carefully crafted adversarial prompts. Unfortunately, such attacks rely on trial and error, lack generalizability across models, and are constrained by scalability and reliability.
  This paper presents NeuroStrike, a novel and generalizable attack framework that exploits a fundamental vulnerability introduced by alignment techniques: the reliance on sparse, specialized safety neurons responsible for detecting and suppressing harmful inputs. We apply NeuroStrike to both white-box and black-box settings: In the white-box setting, NeuroStrike identifies safety neurons through feedforward activation analysis and prunes them during inference to disable safety mechanisms. In the black-box setting, we propose the first LLM profiling attack, which leverages safety neuron transferability by training adversarial prompt generators on open-weight surrogate models and then deploying them against black-box and proprietary targets. We evaluate NeuroStrike on over 20 open-weight LLMs from major LLM developers. By removing less than 0.6% of neurons in targeted layers, NeuroStrike achieves an average attack success rate (ASR) of 76.9% using only vanilla malicious prompts. Moreover, Neurostrike generalizes to four multimodal LLMs with 100% ASR on unsafe image inputs. Safety neurons transfer effectively across architectures, raising ASR to 78.5% on 11 fine-tuned models and 77.7% on five distilled models. The black-box LLM profiling attack achieves an average ASR of 63.7% across five black-box models, including the Google Gemini family.

http://arxiv.org/abs/2212.06123
Security of Deep Reinforcement Learning for Autonomous Driving: A Survey. (96%)
Ambra Demontis; Srishti Gupta; Maura Pintor; Luca Demetrio; Kathrin Grosse; Hsiao-Ying Lin; Chengfang Fang; Battista Biggio; Fabio Roli
Reinforcement learning (RL) enables agents to learn optimal behaviors through interaction with their environment and has been increasingly deployed in safety-critical applications, including autonomous driving. Despite its promise, RL is susceptible to attacks designed either to compromise policy learning or to induce erroneous decisions by trained agents. Although the literature on RL security has grown rapidly and several surveys exist, existing categorizations often fall short in guiding the selection of appropriate defenses for specific systems. In this work, we present a comprehensive survey of 86 recent studies on RL security, addressing these limitations by systematically categorizing attacks and defenses according to defined threat models and single- versus multi-agent settings. Furthermore, we examine the relevance and applicability of state-of-the-art attacks and defense mechanisms within the context of autonomous driving, providing insights to inform the design of robust RL systems.

http://arxiv.org/abs/2507.12932
Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes. (89%)
Zhou Feng; Jiahao Chen; Chunyi Zhou; Yuwen Pu; Qingming Li; Tianyu Du; Shouling Ji
The rapid advancement of voice deepfake technologies has raised serious concerns about user audio privacy, as attackers increasingly exploit publicly available voice data to generate convincing fake audio for malicious purposes such as identity theft, financial fraud, and misinformation campaigns. While existing defense methods offer partial protection, they face critical limitations, including weak adaptability to unseen user data, poor scalability to long audio, rigid reliance on white-box knowledge, and high computational and temporal costs during the encryption process. To address these challenges and defend against personalized voice deepfake threats, we propose Enkidu, a novel user-oriented privacy-preserving framework that leverages universal frequential perturbations generated through black-box knowledge and few-shot training on a small amount of user data. These highly malleable frequency-domain noise patches enable real-time, lightweight protection with strong generalization across variable-length audio and robust resistance to voice deepfake attacks, all while preserving perceptual quality and speech intelligibility. Notably, Enkidu achieves over 50 to 200 times processing memory efficiency (as low as 0.004 gigabytes) and 3 to 7000 times runtime efficiency (real-time coefficient as low as 0.004) compared to six state-of-the-art countermeasures. Extensive experiments across six mainstream text-to-speech models and five cutting-edge automated speaker verification models demonstrate the effectiveness, transferability, and practicality of Enkidu in defending against both vanilla and adaptive voice deepfake attacks. Our code is currently available.

http://arxiv.org/abs/2509.14271
Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models. (88%)
Gustavo Sandoval; Denys Fenchenko; Junyao Chen
This paper documents early research conducted in 2022 on defending against prompt injection attacks in large language models, providing historical context for the evolution of this critical security domain. This research focuses on two adversarial attacks against Large Language Models (LLMs): prompt injection and goal hijacking. We examine how to construct these attacks, test them on various LLMs, and compare their effectiveness. We propose and evaluate a novel defense technique called Adversarial Fine-Tuning. Our results show that, without this defense, the attacks succeeded 31\% of the time on GPT-3 series models. When using our Adversarial Fine-Tuning approach, attack success rates were reduced to near zero for smaller GPT-3 variants (Ada, Babbage, Curie), though we note that subsequent research has revealed limitations of fine-tuning-based defenses. We also find that more flexible models exhibit greater vulnerability to these attacks. Consequently, large models such as GPT-3 Davinci are more vulnerable than smaller models like GPT-2. While the specific models tested are now superseded, the core methodology and empirical findings contributed to the foundation of modern prompt injection defense research, including instruction hierarchy systems and constitutional AI approaches.

http://arxiv.org/abs/2509.11745
Removal Attack and Defense on AI-generated Content Latent-based Watermarking. (74%)
De Zhang Lee; Han Fang; Hanyi Wang; Ee-Chien Chang
Digital watermarks can be embedded into AI-generated content (AIGC) by initializing the generation process with starting points sampled from a secret distribution. When combined with pseudorandom error-correcting codes, such watermarked outputs can remain indistinguishable from unwatermarked objects, while maintaining robustness under whitenoise. In this paper, we go beyond indistinguishability and investigate security under removal attacks. We demonstrate that indistinguishability alone does not necessarily guarantee resistance to adversarial removal. Specifically, we propose a novel attack that exploits boundary information leaked by the locations of watermarked objects. This attack significantly reduces the distortion required to remove watermarks -- by up to a factor of $15 \times$ compared to a baseline whitenoise attack under certain settings. To mitigate such attacks, we introduce a defense mechanism that applies a secret transformation to hide the boundary, and prove that the secret transformation effectively rendering any attacker's perturbations equivalent to those of a naive whitenoise adversary. Our empirical evaluations, conducted on multiple versions of Stable Diffusion, validate the effectiveness of both the attack and the proposed defense, highlighting the importance of addressing boundary leakage in latent-based watermarking schemes.

http://arxiv.org/abs/2410.14827
Enhancing Prompt Injection Attacks to LLMs via Poisoning Alignment. (22%)
Zedian Shao; Hongbin Liu; Jaden Mu; Neil Zhenqiang Gong
Prompt injection attack, where an attacker injects a prompt into the original one, aiming to make an Large Language Model (LLM) follow the injected prompt to perform an attacker-chosen task, represent a critical security threat. Existing attacks primarily focus on crafting these injections at inference time, treating the LLM itself as a static target. Our experiments show that these attacks achieve some success, but there is still significant room for improvement. In this work, we introduces a more foundational attack vector: poisoning the LLM's alignment process to amplify the success of future prompt injection attacks. Specifically, we propose PoisonedAlign, a method that strategically creates poisoned alignment samples to poison an LLM's alignment dataset. Our experiments across five LLMs and two alignment datasets show that when even a small fraction of the alignment data is poisoned, the resulting model becomes substantially more vulnerable to a wide range of prompt injection attacks. Crucially, this vulnerability is instilled while the LLM's performance on standard capability benchmarks remains largely unchanged, making the manipulation difficult to detect through automated, general-purpose performance evaluations. The code for implementing the attack is available at https://github.com/Sadcardation/PoisonedAlign.

http://arxiv.org/abs/2509.01909
Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models. (22%)
Ranjie Duan; Jiexi Liu; Xiaojun Jia; Shiji Zhao; Ruoxi Cheng; Fengxiang Wang; Cheng Wei; Yong Xie; Chang Liu; Defeng Li; Yinpeng Dong; Yichi Zhang; Yuefeng Chen; Chongwen Wang; Xingjun Ma; Xingxing Wei; Yang Liu; Hang Su; Jun Zhu; Xinfeng Li; Yitong Sun; Jie Zhang; Jinzhao Hu; Sha Xu; Wenchao Yang; Yitong Yang; Jialing Tao; Hui Xue
Large language models (LLMs) typically deploy safety mechanisms to prevent harmful content generation. Most current approaches focus narrowly on risks posed by malicious actors, often framing risks as adversarial events and relying on defensive refusals. However, in real-world settings, risks also come from non-malicious users seeking help while under psychological distress (e.g., self-harm intentions). In such cases, the model's response can strongly influence the user's next actions. Simple refusals may lead them to repeat, escalate, or move to unsafe platforms, creating worse outcomes. We introduce Constructive Safety Alignment (CSA), a human-centric paradigm that protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results. Implemented in Oyster-I (Oy1), CSA combines game-theoretic anticipation of user reactions, fine-grained risk boundary discovery, and interpretable reasoning control, turning safety into a trust-building process. Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities. On our Constructive Benchmark, it shows strong constructive engagement, close to GPT-5, and unmatched robustness on the Strata-Sword jailbreak dataset, nearing GPT-o1 levels. By shifting from refusal-first to guidance-first safety, CSA redefines the model-user relationship, aiming for systems that are not just safe, but meaningfully helpful. We release Oy1, code, and the benchmark to support responsible, user-centered AI.

http://arxiv.org/abs/2406.15444
Cutting Through the Noise: Boosting LLM Performance on Math Word Problems. (16%)
Ujjwala Anantheswaran; Himanshu Gupta; Kevin Scaria; Shreyas Verma; Chitta Baral; Swaroop Mishra
Large Language Models (LLMs) excel at various tasks, including solving math word problems (MWPs), but struggle with real-world problems containing irrelevant information. To address this, we propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. We introduce a dataset, PROBLEMATHIC, containing both adversarial and non-adversarial MWPs. Our experiments reveal that LLMs are susceptible to distraction by numerical noise, resulting in an average relative performance drop of ~26% on adversarial MWPs. To mitigate this, we fine-tune LLMs (Llama-2, Mistral) on the adversarial samples from our dataset. Fine-tuning on adversarial training instances improves performance on adversarial MWPs by ~8%, indicating increased robustness to noise and improved ability to identify relevant data for reasoning. Finally, to assess the generalizability of our prompting framework, we introduce GSM-8K-Adv, an adversarial variant of the GSM-8K benchmark. LLMs continue to struggle when faced with adversarial information, reducing performance by up to 6%.

http://arxiv.org/abs/2509.11683
An Unsupervised Learning Approach For A Reliable Profiling Of Cyber Threat Actors Reported Globally Based On Complete Contextual Information Of Cyber Attacks. (10%)
Sawera Shahid; Umara Noor; Zahid Rashid
Cyber attacks are rapidly increasing with the advancement of technology and there is no protection for our information. To prevent future cyberattacks it is critical to promptly recognize cyberattacks and establish strong defense mechanisms against them. To respond to cybersecurity threats immediately, it is essential to examine the attackers skills, knowledge, and behaviors with the goal of evaluating their impact on the system and comprehending the traits associated with these attacks. Creating a profile of cyber threat actors based on their traits or patterns of behavior can help to create effective defenses against cyberattacks in advance. In the current literature, multiple supervised machine learning based approaches considered a smaller number of features for attacker profiling that are reported in textual cyber threat incident documents although these profiles have been developed based on the security experts own perception, we cannot rely on them. Supervised machine learning approaches strictly depend upon the structure data set. This usually leads to a two step process where we first have to establish a structured data set before we can analyze it and then employ it to construct defense mechanisms, which takes time. In this paper, an unsupervised efficient agglomerative hierarchal clustering technique is proposed for profiling cybercriminal groups based on their comprehensive contextual threat information in order to address the aforementioned issues. The main objective of this report is to identify the relationship between cyber threat actors based on their common features, aggregate them, and also profile cyber criminal groups.

http://arxiv.org/abs/2509.11615
Cyber Threat Hunting: Non-Parametric Mining of Attack Patterns from Cyber Threat Intelligence for Precise Threats Attribution. (1%)
Rimsha Kanwal; Umara Noor; Zafar Iqbal; Zahid Rashid
With the ever-changing landscape of cyber threats, identifying their origin has become paramount, surpassing the simple task of attack classification. Cyber threat attribution gives security analysts the insights they need to device effective threat mitigation strategies. Such strategies empower enterprises to proactively detect and defend against future cyber-attacks. However, existing approaches exhibit limitations in accurately identifying threat actors, leading to low precision and a significant occurrence of false positives. Machine learning offers the potential to automate certain aspects of cyber threat attribution. The distributed nature of information regarding cyber threat actors and their intricate attack methodologies has hindered substantial progress in this domain. Cybersecurity analysts deal with an ever-expanding collection of cyber threat intelligence documents. While these documents hold valuable insights, their sheer volume challenges efficient organization and retrieval of pertinent information. To assist the cybersecurity analyst activities, we propose a machine learning based approach featuring visually interactive analytics tool named the Cyber-Attack Pattern Explorer (CAPE), designed to facilitate efficient information discovery by employing interactive visualization and mining techniques. In the proposed system, a non-parametric mining technique is proposed to create a dataset for identifying the attack patterns within cyber threat intelligence documents. These attack patterns align semantically with commonly employed themes ensuring ease of interpretation. The extracted dataset is used for training of proposed machine learning algorithms that enables the attribution of cyber threats with respective to the actors.

http://arxiv.org/abs/2509.11187
DMLDroid: Deep Multimodal Fusion Framework for Android Malware Detection with Resilience to Code Obfuscation and Adversarial Perturbations. (99%)
Doan Minh Trung; Tien Duc Anh Hao; Luong Hoang Minh; Nghi Hoang Khoa; Nguyen Tan Cam; Van-Hau Pham; Phan The Duy
In recent years, learning-based Android malware detection has seen significant advancements, with detectors generally falling into three categories: string-based, image-based, and graph-based approaches. While these methods have shown strong detection performance, they often struggle to sustain robustness in real-world settings, particularly when facing code obfuscation and adversarial examples (AEs). Deep multimodal learning has emerged as a promising solution, leveraging the strengths of multiple feature types to enhance robustness and generalization. However, a systematic investigation of multimodal fusion for both accuracy and resilience remains underexplored. In this study, we propose DMLDroid, an Android malware detection based on multimodal fusion that leverages three different representations of malware features, including permissions & intents (tabular-based), DEX file representations (image-based), and API calls (graph-derived sequence-based). We conduct exhaustive experiments independently on each feature, as well as in combination, using different fusion strategies. Experimental results on the CICMalDroid 2020 dataset demonstrate that our multimodal approach with the dynamic weighted fusion mechanism achieves high performance, reaching 97.98% accuracy and 98.67% F1-score on original malware detection. Notably, the proposed method maintains strong robustness, sustaining over 98% accuracy and 98% F1-score under both obfuscation and adversarial attack scenarios. Our findings highlight the benefits of multimodal fusion in improving both detection accuracy and robustness against evolving Android malware threats.

http://arxiv.org/abs/2509.09112
Character-Level Perturbations Disrupt LLM Watermarks. (92%)
Zhaoxi Zhang; Xiaomei Zhang; Yanjun Zhang; He Zhang; Shirui Pan; Bo Liu; Asif Qumer Gill; Leo Yu Zhang
Large Language Model (LLM) watermarking embeds detectable signals into generated text for copyright protection, misuse prevention, and content detection. While prior studies evaluate robustness using watermark removal attacks, these methods are often suboptimal, creating the misconception that effective removal requires large perturbations or powerful adversaries.
  To bridge the gap, we first formalize the system model for LLM watermark, and characterize two realistic threat models constrained on limited access to the watermark detector. We then analyze how different types of perturbation vary in their attack range, i.e., the number of tokens they can affect with a single edit. We observe that character-level perturbations (e.g., typos, swaps, deletions, homoglyphs) can influence multiple tokens simultaneously by disrupting the tokenization process. We demonstrate that character-level perturbations are significantly more effective for watermark removal under the most restrictive threat model. We further propose guided removal attacks based on the Genetic Algorithm (GA) that uses a reference detector for optimization. Under a practical threat model with limited black-box queries to the watermark detector, our method demonstrates strong removal performance. Experiments confirm the superiority of character-level perturbations and the effectiveness of the GA in removing watermarks under realistic constraints. Additionally, we argue there is an adversarial dilemma when considering potential defenses: any fixed defense can be bypassed by a suitable perturbation strategy. Motivated by this principle, we propose an adaptive compound character-level attack. Experimental results show that this approach can effectively defeat the defenses. Our findings highlight significant vulnerabilities in existing LLM watermark schemes and underline the urgency for the development of new robust mechanisms.

http://arxiv.org/abs/2509.11220
ANROT-HELANet: Adverserially and Naturally Robust Attention-Based Aggregation Network via The Hellinger Distance for Few-Shot Classification. (76%)
Gao Yu Lee; Tanmoy Dam; Md Meftahul Ferdaus; Daniel Puiu Poenar; Vu N. Duong
Few-Shot Learning (FSL), which involves learning to generalize using only a few data samples, has demonstrated promising and superior performances to ordinary CNN methods. While Bayesian based estimation approaches using Kullback-Leibler (KL) divergence have shown improvements, they remain vulnerable to adversarial attacks and natural noises. We introduce ANROT-HELANet, an Adversarially and Naturally RObusT Hellinger Aggregation Network that significantly advances the state-of-the-art in FSL robustness and performance. Our approach implements an adversarially and naturally robust Hellinger distance-based feature class aggregation scheme, demonstrating resilience to adversarial perturbations up to $ε=0.30$ and Gaussian noise up to $σ=0.30$. The network achieves substantial improvements across benchmark datasets, including gains of 1.20\% and 1.40\% for 1-shot and 5-shot scenarios on miniImageNet respectively. We introduce a novel Hellinger Similarity contrastive loss function that generalizes cosine similarity contrastive loss for variational few-shot inference scenarios. Our approach also achieves superior image reconstruction quality with a FID score of 2.75, outperforming traditional VAE (3.43) and WAE (3.38) approaches. Extensive experiments conducted on four few-shot benchmarked datasets verify that ANROT-HELANet's combination of Hellinger distance-based feature aggregation, attention mechanisms, and our novel loss function establishes new state-of-the-art performance while maintaining robustness against both adversarial and natural perturbations. Our code repository will be available at https://github.com/GreedYLearner1146/ANROT-HELANet/tree/main.

http://arxiv.org/abs/2509.11128
ENJ: Optimizing Noise with Genetic Algorithms to Jailbreak LSMs. (69%)
Yibo Zhang; Liang Lin
The widespread application of Large Speech Models (LSMs) has made their security risks increasingly prominent. Traditional speech adversarial attack methods face challenges in balancing effectiveness and stealth. This paper proposes Evolutionary Noise Jailbreak (ENJ), which utilizes a genetic algorithm to transform environmental noise from a passive interference into an actively optimizable attack carrier for jailbreaking LSMs. Through operations such as population initialization, crossover fusion, and probabilistic mutation, this method iteratively evolves a series of audio samples that fuse malicious instructions with background noise. These samples sound like harmless noise to humans but can induce the model to parse and execute harmful commands. Extensive experiments on multiple mainstream speech models show that ENJ's attack effectiveness is significantly superior to existing baseline methods. This research reveals the dual role of noise in speech security and provides new critical insights for model security defense in complex acoustic environments.

http://arxiv.org/abs/2504.11358
DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks. (61%)
Yupei Liu; Yuqi Jia; Jinyuan Jia; Dawn Song; Neil Zhenqiang Gong
LLM-integrated applications and agents are vulnerable to prompt injection attacks, where an attacker injects prompts into their inputs to induce attacker-desired outputs. A detection method aims to determine whether a given input is contaminated by an injected prompt. However, existing detection methods have limited effectiveness against state-of-the-art attacks, let alone adaptive ones. In this work, we propose DataSentinel, a game-theoretic method to detect prompt injection attacks. Specifically, DataSentinel fine-tunes an LLM to detect inputs contaminated with injected prompts that are strategically adapted to evade detection. We formulate this as a minimax optimization problem, with the objective of fine-tuning the LLM to detect strong adaptive attacks. Furthermore, we propose a gradient-based method to solve the minimax optimization problem by alternating between the inner max and outer min problems. Our evaluation results on multiple benchmark datasets and LLMs show that DataSentinel effectively detects both existing and adaptive prompt injection attacks.

http://arxiv.org/abs/2509.11159
Stabilizing Data-Free Model Extraction. (33%)
Dat-Thinh Nguyen; Kim-Hung Le; Nhien-An Le-Khac
Model extraction is a severe threat to Machine Learning-as-a-Service systems, especially through data-free approaches, where dishonest users can replicate the functionality of a black-box target model without access to realistic data. Despite recent advancements, existing data-free model extraction methods suffer from the oscillating accuracy of the substitute model. This oscillation, which could be attributed to the constant shift in the generated data distribution during the attack, makes the attack impractical since the optimal substitute model cannot be determined without access to the target model's in-distribution data. Hence, we propose MetaDFME, a novel data-free model extraction method that employs meta-learning in the generator training to reduce the distribution shift, aiming to mitigate the substitute model's accuracy oscillation. In detail, we train our generator to iteratively capture the meta-representations of the synthetic data during the attack. These meta-representations can be adapted with a few steps to produce data that facilitates the substitute model to learn from the target model while reducing the effect of distribution shifts. Our experiments on popular baseline image datasets, MNIST, SVHN, CIFAR-10, and CIFAR-100, demonstrate that MetaDFME outperforms the current state-of-the-art data-free model extraction method while exhibiting a more stable substitute model's accuracy during the attack.

http://arxiv.org/abs/2506.02040
Beyond the Protocol: Unveiling Attack Vectors in the Model Context Protocol (MCP) Ecosystem. (22%)
Hao Song; Yiming Shen; Wenxuan Luo; Leixin Guo; Ting Chen; Jiashui Wang; Beibei Li; Xiaosong Zhang; Jiachi Chen
The Model Context Protocol (MCP) is an emerging standard designed to enable seamless interaction between Large Language Model (LLM) applications and external tools or resources. Within a short period, thousands of MCP services have been developed and deployed. However, the client-server integration architecture inherent in MCP may expand the attack surface against LLM Agent systems, introducing new vulnerabilities that allow attackers to exploit by designing malicious MCP servers. In this paper, we present the first end-to-end empirical evaluation of attack vectors targeting the MCP ecosystem. We identify four categories of attacks, i.e., Tool Poisoning Attacks, Puppet Attacks, Rug Pull Attacks, and Exploitation via Malicious External Resources. To evaluate their feasibility, we conduct experiments following the typical steps of launching an attack through malicious MCP servers: upload -> download -> attack. Specifically, we first construct malicious MCP servers and successfully upload them to three widely used MCP aggregation platforms. The results indicate that current audit mechanisms are insufficient to identify and prevent these threats. Next, through a user study and interview with 20 participants, we demonstrate that users struggle to identify malicious MCP servers and often unknowingly install them from aggregator platforms. Finally, we empirically demonstrate that these attacks can trigger harmful actions within the user's local environment, such as accessing private files or controlling devices to transfer digital assets. Additionally, based on interview results, we discuss four key challenges faced by the current MCP security ecosystem. These findings underscore the urgent need for robust security mechanisms to defend against malicious MCP servers and ensure the safe deployment of increasingly autonomous LLM agents.

http://arxiv.org/abs/2509.13353
Hybrid Quantum-Classical Model for Image Classification. (16%)
Muhammad Adnan Shahzad
This study presents a systematic comparison between hybrid quantum-classical neural networks and purely classical models across three benchmark datasets (MNIST, CIFAR100, and STL10) to evaluate their performance, efficiency, and robustness. The hybrid models integrate parameterized quantum circuits with classical deep learning architectures, while the classical counterparts use conventional convolutional neural networks (CNNs). Experiments were conducted over 50 training epochs for each dataset, with evaluations on validation accuracy, test accuracy, training time, computational resource usage, and adversarial robustness (tested with $ε=0.1$ perturbations).Key findings demonstrate that hybrid models consistently outperform classical models in final accuracy, achieving {99.38\% (MNIST), 41.69\% (CIFAR100), and 74.05\% (STL10) validation accuracy, compared to classical benchmarks of 98.21\%, 32.25\%, and 63.76\%, respectively. Notably, the hybrid advantage scales with dataset complexity, showing the most significant gains on CIFAR100 (+9.44\%) and STL10 (+10.29\%). Hybrid models also train 5--12$\times$ faster (e.g., 21.23s vs. 108.44s per epoch on MNIST) and use 6--32\% fewer parameters} while maintaining superior generalization to unseen test data.Adversarial robustness tests reveal that hybrid models are significantly more resilient on simpler datasets (e.g., 45.27\% robust accuracy on MNIST vs. 10.80\% for classical) but show comparable fragility on complex datasets like CIFAR100 ($\sim$1\% robustness for both). Resource efficiency analyses indicate that hybrid models consume less memory (4--5GB vs. 5--6GB for classical) and lower CPU utilization (9.5\% vs. 23.2\% on average).These results suggest that hybrid quantum-classical architectures offer compelling advantages in accuracy, training efficiency, and parameter scalability, particularly for complex vision tasks.

http://arxiv.org/abs/2509.11250
Realistic Environmental Injection Attacks on GUI Agents. (16%)
Yitong Zhang; Ximo Li; Liyi Cai; Jia Li
GUI agents built on LVLMs are increasingly used to interact with websites. However, their exposure to open-world content makes them vulnerable to Environmental Injection Attacks (EIAs) that hijack agent behavior via webpage elements. Many recent studies assume the attacker to be a regular user who can only upload a single trigger image, which is more realistic than earlier assumptions of website-level administrative control. However, these works still fall short of realism: (1) the trigger's position and surrounding context remain largely fixed between training and testing, failing to capture the dynamic nature of real webpages and (2) the trigger often occupies an unrealistically large area, whereas real-world images are typically small. To better reflect real-world scenarios, we introduce a more realistic threat model where the attacker is a regular user and the trigger image is small and embedded within a dynamically changing environment. As a result, existing attacks prove largely ineffective under this threat model.
  To better expose the vulnerabilities of GUI agents, we propose Chameleon, an attack framework with two main novelties. The first is LLM-Driven Environment Simulation, which automatically generates diverse and high-fidelity webpage simulations. The second is Attention Black Hole, which transforms attention weights into explicit supervisory signals that guide the agent's focus toward the trigger region. We evaluate Chameleon on 6 realistic websites and 4 representative LVLM-powered GUI agents, where it significantly outperforms existing methods. Ablation studies confirm that both novelties are critical to performance. Our findings reveal underexplored vulnerabilities in modern GUI agents and establish a robust foundation for future research on defense in open-world GUI agent systems. The code is publicly available at https://github.com/zhangyitonggg/attack2gui.

http://arxiv.org/abs/2509.11337
On the Escaping Efficiency of Distributed Adversarial Training Algorithms. (13%)
Ying Cao; Kun Yuan; Ali H. Sayed
Adversarial training has been widely studied in recent years due to its role in improving model robustness against adversarial attacks. This paper focuses on comparing different distributed adversarial training algorithms--including centralized and decentralized strategies--within multi-agent learning environments. Previous studies have highlighted the importance of model flatness in determining robustness. To this end, we develop a general theoretical framework to study the escaping efficiency of these algorithms from local minima, which is closely related to the flatness of the resulting models. We show that when the perturbation bound is sufficiently small (i.e., when the attack strength is relatively mild) and a large batch size is used, decentralized adversarial training algorithms--including consensus and diffusion--are guaranteed to escape faster from local minima than the centralized strategy, thereby favoring flatter minima. However, as the perturbation bound increases, this trend may no longer hold. In the simulation results, we illustrate our theoretical findings and systematically compare the performance of models obtained through decentralized and centralized adversarial training algorithms. The results highlight the potential of decentralized strategies to enhance the robustness of models in distributed settings.

http://arxiv.org/abs/2509.11080
Membership Inference Attacks on Recommender System: A Survey. (5%)
Jiajie He; Yuechun Gu; Keke Chen; Xintong Chen
Recommender systems (RecSys) have been widely applied to various applications, including E-commerce, finance, healthcare, social media and have become increasingly influential in shaping user behavior and decision-making, highlighting their growing impact in various domains. However, recent studies have shown that RecSys are vulnerable to membership inference attacks (MIAs), which aim to infer whether user interaction record was used to train a target model or not. MIAs on RecSys models can directly lead to a privacy breach. For example, via identifying the fact that a purchase record that has been used to train a RecSys associated with a specific user, an attacker can infer that user's special quirks. In recent years, MIAs have been shown to be effective on other ML tasks, e.g., classification models and natural language processing. However, traditional MIAs are ill-suited for RecSys due to the unseen posterior probability. Although MIAs on RecSys form a newly emerging and rapidly growing research area, there has been no systematic survey on this topic yet. In this article, we conduct the first comprehensive survey on RecSys MIAs. This survey offers a comprehensive review of the latest advancements in RecSys MIAs, exploring the design principles, challenges, attack and defense associated with this emerging field. We provide a unified taxonomy that categorizes different RecSys MIAs based on their characterizations and discuss their pros and cons. Based on the limitations and gaps identified in this survey, we point out several promising future research directions to inspire the researchers who wish to follow this area. This survey not only serves as a reference for the research community but also provides a clear description for researchers outside this research domain.

http://arxiv.org/abs/2508.02881
Optimizing Preventive and Reactive Defense Resource Allocation with Uncertain Sensor Signals. (4%)
Faezeh Shojaeighadikolaei; Shouhuai Xu; Keith Paarporn
Cyber attacks continue to be a cause of concern despite advances in cyber defense techniques. Although cyber attacks cannot be fully prevented, standard decision-making frameworks typically focus on how to prevent them from succeeding, without considering the cost of cleaning up the damages incurred by successful attacks. This motivates us to investigate a new resource allocation problem formulated in this paper: The defender must decide how to split its investment between preventive defenses, which aim to harden nodes from attacks, and reactive defenses, which aim to quickly clean up the compromised nodes. This encounters a challenge imposed by the uncertainty associated with the observation, or sensor signal, whether a node is truly compromised or not; this uncertainty is real because attack detectors are not perfect. We investigate how the quality of sensor signals impacts the defender's strategic investment in the two types of defense, and ultimately the level of security that can be achieved. In particular, we show that the optimal investment in preventive resources increases, and thus reactive resource investment decreases, with higher sensor quality. We also show that the defender's performance improvement, relative to a baseline of no sensors employed, is maximal when the attacker can only achieve low attack success probabilities.

http://arxiv.org/abs/2509.11249
Make Identity Unextractable yet Perceptible: Synthesis-Based Privacy Protection for Subject Faces in Photos. (2%)
Tao Wang; Yushu Zhang; Xiangli Xiao; Kun Xu; Lin Yuan; Wenying Wen; Yuming Fang
Deep learning-based face recognition (FR) technology exacerbates privacy concerns in photo sharing. In response, the research community developed a suite of anti-FR methods to block identity extraction by unauthorized FR systems. Benefiting from quasi-imperceptible alteration, perturbation-based methods are well-suited for privacy protection of subject faces in photos, as they allow familiar persons to recognize subjects via naked eyes. However, we reveal that perturbation-based methods provide a false sense of privacy through theoretical analysis and experimental validation.
  Therefore, new alternative solutions should be found to protect subject faces. In this paper, we explore synthesis-based methods as a promising solution, whose challenge is to enable familiar persons to recognize subjects. To solve the challenge, we present a key insight: In most photo sharing scenarios, familiar persons recognize subjects through identity perception rather than meticulous face analysis. Based on the insight, we propose the first synthesis-based method dedicated to subject faces, i.e., PerceptFace, which can make identity unextractable yet perceptible. To enhance identity perception, a new perceptual similarity loss is designed for faces, reducing the alteration in regions of high sensitivity to human vision.
  As a synthesis-based method, PerceptFace can inherently provide reliable identity protection. Meanwhile, out of the confine of meticulous face analysis, PerceptFace focuses on identity perception from a more practical scenario, which is also enhanced by the designed perceptual similarity loss. Sufficient experiments show that PerceptFace achieves a superior trade-off between identity protection and identity perception compared to existing methods. We provide a public API of PerceptFace and believe that it has great potential to become a practical anti-FR tool.

http://arxiv.org/abs/2509.11191
RanAT4BIE: Random Adversarial Training for Biomedical Information Extraction. (1%)
Jian Chen; Shengyi Lv; Leilei Su
We introduce random adversarial training (RAT), a novel framework successfully applied to biomedical information extraction (BioIE) tasks. Building on PubMedBERT as the foundational architecture, our study first validates the effectiveness of conventional adversarial training in enhancing pre-trained language models' performance on BioIE tasks. While adversarial training yields significant improvements across various performance metrics, it also introduces considerable computational overhead. To address this limitation, we propose RAT as an efficiency solution for biomedical information extraction. This framework strategically integrates random sampling mechanisms with adversarial training principles, achieving dual objectives: enhanced model generalization and robustness while significantly reducing computational costs. Through comprehensive evaluations, RAT demonstrates superior performance compared to baseline models in BioIE tasks. The results highlight RAT's potential as a transformative framework for biomedical natural language processing, offering a balanced solution to the model performance and computational efficiency.

http://arxiv.org/abs/2509.11265
SelectMix: Enhancing Label Noise Robustness through Targeted Sample Mixing. (1%)
Qiuhao Liu; Ling Li; Yao Lu; Qi Xuan; Zhaowei Zhu; Jiaheng Wei
Deep neural networks tend to memorize noisy labels, severely degrading their generalization performance. Although Mixup has demonstrated effectiveness in improving generalization and robustness, existing Mixup-based methods typically perform indiscriminate mixing without principled guidance on sample selection and mixing strategy, inadvertently propagating noisy supervision. To overcome these limitations, we propose SelectMix, a confidence-guided mixing framework explicitly tailored for noisy labels. SelectMix first identifies potentially noisy or ambiguous samples through confidence based mismatch analysis using K-fold cross-validation, then selectively blends identified uncertain samples with confidently predicted peers from their potential classes. Furthermore, SelectMix employs soft labels derived from all classes involved in the mixing process, ensuring the labels accurately represent the composition of the mixed samples, thus aligning supervision signals closely with the actual mixed inputs. Through extensive theoretical analysis and empirical evaluations on multiple synthetic (MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100) and real-world benchmark datasets (CIFAR-N, MNIST and Clothing1M), we demonstrate that SelectMix consistently outperforms strong baseline methods, validating its effectiveness and robustness in learning with noisy labels.

http://arxiv.org/abs/2508.17456
Adversarial Examples Are Not Bugs, They Are Superposition. (98%)
Liv Gorton; Owen Lewis
Adversarial examples -- inputs with imperceptible perturbations that fool neural networks -- remain one of deep learning's most perplexing phenomena despite nearly a decade of research. While numerous defenses and explanations have been proposed, there is no consensus on the fundamental mechanism. One underexplored hypothesis is that superposition, a concept from mechanistic interpretability, may be a major contributing factor, or even the primary cause. We present four lines of evidence in support of this hypothesis, greatly extending prior arguments by Elhage et al. (2022): (1) superposition can theoretically explain a range of adversarial phenomena, (2) in toy models, intervening on superposition controls robustness, (3) in toy models, intervening on robustness (via adversarial training) controls superposition, and (4) in ResNet18, intervening on robustness (via adversarial training) controls superposition.

http://arxiv.org/abs/2509.10913
Robustifying Diffusion-Denoised Smoothing Against Covariate Shift. (67%)
Ali Hedayatnia; Mostafa Tavassolipour; Babak Nadjar Araabi; Abdol-Hossein Vahabie
Randomized smoothing is a well-established method for achieving certified robustness against l2-adversarial perturbations. By incorporating a denoiser before the base classifier, pretrained classifiers can be seamlessly integrated into randomized smoothing without significant performance degradation. Among existing methods, Diffusion Denoised Smoothing - where a pretrained denoising diffusion model serves as the denoiser - has produced state-of-the-art results. However, we show that employing a denoising diffusion model introduces a covariate shift via misestimation of the added noise, ultimately degrading the smoothed classifier's performance. To address this issue, we propose a novel adversarial objective function focused on the added noise of the denoising diffusion model. This approach is inspired by our understanding of the origin of the covariate shift. Our goal is to train the base classifier to ensure it is robust against the covariate shift introduced by the denoiser. Our method significantly improves certified accuracy across three standard classification benchmarks - MNIST, CIFAR-10, and ImageNet - achieving new state-of-the-art performance in l2-adversarial perturbations. Our implementation is publicly available at https://github.com/ahedayat/Robustifying-DDS-Against-Covariate-Shift

http://arxiv.org/abs/2503.14284
Entente: Cross-silo Intrusion Detection on Network Log Graphs with Federated Learning. (1%)
Jiacen Xu; Chenang Li; Yu Zheng; Zhou Li
Graph-based Network Intrusion Detection Systems (GNIDS) have gained significant momentum in detecting sophisticated cyber-attacks, such as Advanced Persistent Threats (APTs), within and across organizational boundaries. Though achieving satisfying detection accuracy and demonstrating adaptability to ever-changing attacks and normal patterns, existing GNIDS predominantly assume a centralized data setting. However, flexible data collection is not always realistic or achievable due to increasing constraints from privacy regulations and operational limitations. We argue that the practical development of GNIDS requires accounting for distributed collection settings and we leverage Federated Learning (FL) as a viable paradigm to address this prominent challenge. We observe that naively applying FL to GNIDS is unlikely to be effective, due to issues like graph heterogeneity over clients and the diverse design choices taken by different GNIDS. We address these issues with a set of novel techniques tailored to the graph datasets, including reference graph synthesis, graph sketching and adaptive contribution scaling, eventually developing a new system Entente. By leveraging the domain knowledge, Entente can achieve effectiveness, scalability and robustness simultaneously. Empirical evaluation on the large-scale LANL, OpTC and Pivoting datasets shows that Entente outperforms the SOTA FL baselines. We also evaluate Entente under FL poisoning attacks tailored to the GNIDS setting, showing the robustness by bounding the attack success rate to low values. Overall, our study suggests a promising direction to build cross-silo GNIDS.

http://arxiv.org/abs/2307.08327
Analyzing the Impact of Adversarial Examples on Explainable Machine Learning. (99%)
Prathyusha Devabhakthini; Sasmita Parida; Raj Mani Shukla; Suvendu Chandan Nayak; Tapadhir Das
Adversarial attacks are a type of attack on machine learning models where an attacker deliberately modifies the inputs to cause the model to make incorrect predictions. Adversarial attacks can have serious consequences, particularly in applications such as autonomous vehicles, medical diagnosis, and security systems. Work on the vulnerability of deep learning models to adversarial attacks has shown that it is very easy to make samples that make a model predict things that it doesn't want to. In this work, we analyze the impact of model interpretability due to adversarial attacks on text classification problems. We develop an ML-based classification model for text data. Then, we introduce the adversarial perturbations on the text data to understand the classification performance after the attack. Subsequently, we analyze and interpret the model's explainability before and after the attack

http://arxiv.org/abs/2509.10298
Adversarial robustness through Lipschitz-Guided Stochastic Depth in Neural Networks. (78%)
Laith Nayal; Mahmoud Mousatat; Bader Rasheed
Deep neural networks and Vision Transformers achieve state-of-the-art performance in computer vision but are highly vulnerable to adversarial perturbations. Standard defenses often incur high computational cost or lack formal guarantees. We propose a Lipschitz-guided stochastic depth (DropPath) method, where drop probabilities increase with depth to control the effective Lipschitz constant of the network. This approach regularizes deeper layers, improving robustness while preserving clean accuracy and reducing computation. Experiments on CIFAR-10 with ViT-Tiny show that our custom depth-dependent schedule maintains near-baseline clean accuracy, enhances robustness under FGSM, PGD-20, and AutoAttack, and significantly reduces FLOPs compared to baseline and linear DropPath schedules.

http://arxiv.org/abs/2509.05367
Between a Rock and a Hard Place: Exploiting Ethical Reasoning to Jailbreak LLMs. (50%)
Shei Pern Chua; Zhen Leng Thai; Teh Kai Jun; Xiao Li; Xiaolin Hu
Large language models (LLMs) have undergone safety alignment efforts to mitigate harmful outputs. However, as LLMs become more sophisticated in reasoning, their intelligence may introduce new security risks. While traditional jailbreak attacks relied on singlestep attacks, multi-turn jailbreak strategies that adapt dynamically to context remain underexplored. In this work, we introduce TRIAL (Trolley-problem Reasoning for Interactive Attack Logic), a framework that leverages LLMs ethical reasoning to bypass their safeguards. TRIAL embeds adversarial goals within ethical dilemmas modeled on the trolley problem. TRIAL demonstrates high jailbreak success rates towards both open and close-source models. Our findings underscore a fundamental limitation in AI safety: as models gain advanced reasoning abilities, the nature of their alignment may inadvertently allow for more covert security vulnerabilities to be exploited. TRIAL raises an urgent need in reevaluating safety alignment oversight strategies, as current safeguards may prove insufficient against context-aware adversarial attack.

http://arxiv.org/abs/2509.10359
Immunizing Images from Text to Image Editing via Adversarial Cross-Attention. (11%)
Matteo Trippodo; Federico Becattini; Lorenzo Seidenari
Recent advances in text-based image editing have enabled fine-grained manipulation of visual content guided by natural language. However, such methods are susceptible to adversarial attacks. In this work, we propose a novel attack that targets the visual component of editing methods. We introduce Attention Attack, which disrupts the cross-attention between a textual prompt and the visual representation of the image by using an automatically generated caption of the source image as a proxy for the edit prompt. This breaks the alignment between the contents of the image and their textual description, without requiring knowledge of the editing method or the editing prompt. Reflecting on the reliability of existing metrics for immunization success, we propose two novel evaluation strategies: Caption Similarity, which quantifies semantic consistency between original and adversarial edits, and semantic Intersection over Union (IoU), which measures spatial layout disruption via segmentation masks. Experiments conducted on the TEDBench++ benchmark demonstrate that our attack significantly degrades editing performance while remaining imperceptible.

http://arxiv.org/abs/2402.05675
Is Adversarial Training with Compressed Datasets Effective? (10%)
Tong Chen; Raghavendra Selvan
Dataset Condensation (DC) refers to the recent class of dataset compression methods that generate a smaller, synthetic, dataset from a larger dataset. This synthetic dataset aims to retain the essential information of the original dataset, enabling models trained on it to achieve performance levels comparable to those trained on the full dataset. Most current DC methods have mainly concerned with achieving high test performance with limited data budget, and have not directly addressed the question of adversarial robustness. In this work, we investigate the impact of adversarial robustness on models trained with compressed datasets. We show that the compressed datasets obtained from DC methods are not effective in transferring adversarial robustness to models. As a solution to improve dataset compression efficiency and adversarial robustness simultaneously, we present a robustness-aware dataset compression method based on finding the Minimal Finite Covering (MFC) of the dataset. The proposed method is (1) provably robust by minimizing the generalized adversarial loss, (2) more effective than DC methods when applying adversarial training over MFC, (3) obtained by a one-time computation and is applicable for any model.

http://arxiv.org/abs/2505.19260
ALRPHFS: Adversarially Learned Risk Patterns with Hierarchical Fast \& Slow Reasoning for Robust Agent Defense. (8%)
Shiyu Xiang; Tong Zhang; Ronghao Chen
LLM Agents are becoming central to intelligent systems. However, their deployment raises serious safety concerns. Existing defenses largely rely on "Safety Checks", which struggle to capture the complex semantic risks posed by harmful user inputs or unsafe agent behaviors - creating a significant semantic gap between safety checks and real-world risks. To bridge this gap, we propose a novel defense framework, ALRPHFS (Adversarially Learned Risk Patterns with Hierarchical Fast & Slow Reasoning). ALRPHFS consists of two core components: (1) an offline adversarial self-learning loop to iteratively refine a generalizable and balanced library of risk patterns, substantially enhancing robustness without retraining the base LLM, and (2) an online hierarchical fast & slow reasoning engine that balances detection effectiveness with computational efficiency. Experimental results demonstrate that our approach achieves superior overall performance compared to existing baselines, achieving a best-in-class average accuracy of 80% and exhibiting strong generalizability across agents and tasks.

http://arxiv.org/abs/2503.12188
Multi-Agent Systems Execute Arbitrary Malicious Code. (2%)
Harold Triedman; Rishi Jha; Vitaly Shmatikov
Multi-agent systems coordinate LLM-based agents to perform tasks on users' behalf. In real-world applications, multi-agent systems will inevitably interact with untrusted inputs, such as malicious Web content, files, email attachments, and more.
  Using several recently proposed multi-agent frameworks as concrete examples, we demonstrate that adversarial content can hijack control and communication within the system to invoke unsafe agents and functionalities. This results in a complete security breach, up to execution of arbitrary malicious code on the user's device or exfiltration of sensitive data from the user's containerized environment. For example, when agents are instantiated with GPT-4o, Web-based attacks successfully cause the multi-agent system execute arbitrary malicious code in 58-90\% of trials (depending on the orchestrator). In some model-orchestrator configurations, the attack success rate is 100\%. We also demonstrate that these attacks succeed even if individual agents are not susceptible to direct or indirect prompt injection, and even if they refuse to perform harmful actions. We hope that these results will motivate development of trust and security models for multi-agent systems before they are widely deployed.

http://arxiv.org/abs/2505.19613
TESSER: Transfer-Enhancing Adversarial Attacks from Vision Transformers via Spectral and Semantic Regularization. (99%)
Amira Guesmi; Bassem Ouni; Muhammad Shafique
Adversarial transferability remains a critical challenge in evaluating the robustness of deep neural networks. In security-critical applications, transferability enables black-box attacks without access to model internals, making it a key concern for real-world adversarial threat assessment. While Vision Transformers (ViTs) have demonstrated strong adversarial performance, existing attacks often fail to transfer effectively across architectures, especially from ViTs to Convolutional Neural Networks (CNNs) or hybrid models. In this paper, we introduce \textbf{TESSER} -- a novel adversarial attack framework that enhances transferability via two key strategies: (1) \textit{Feature-Sensitive Gradient Scaling (FSGS)}, which modulates gradients based on token-wise importance derived from intermediate feature activations, and (2) \textit{Spectral Smoothness Regularization (SSR)}, which suppresses high-frequency noise in perturbations using a differentiable Gaussian prior. These components work in tandem to generate perturbations that are both semantically meaningful and spectrally smooth. Extensive experiments on ImageNet across 12 diverse architectures demonstrate that TESSER achieves +10.9\% higher attack succes rate (ASR) on CNNs and +7.2\% on ViTs compared to the state-of-the-art Adaptive Token Tuning (ATT) method. Moreover, TESSER significantly improves robustness against defended models, achieving 53.55\% ASR on adversarially trained CNNs. Qualitative analysis shows strong alignment between TESSER's perturbations and salient visual regions identified via Grad-CAM, while frequency-domain analysis reveals a 12\% reduction in high-frequency energy, confirming the effectiveness of spectral regularization.

http://arxiv.org/abs/2505.16402
AdvReal: Physical Adversarial Patch Generation Framework for Security Evaluation of Object Detection Systems. (99%)
Yuanhao Huang; Yilong Ren; Jinlei Wang; Lujia Huo; Xuesong Bai; Jinchuan Zhang; Haiyan Yu
Autonomous vehicles are typical complex intelligent systems with artificial intelligence at their core. However, perception methods based on deep learning are extremely vulnerable to adversarial samples, resulting in security accidents. How to generate effective adversarial examples in the physical world and evaluate object detection systems is a huge challenge. In this study, we propose a unified joint adversarial training framework for both 2D and 3D domains, which simultaneously optimizes texture maps in 2D image and 3D mesh spaces to better address intra-class diversity and real-world environmental variations. The framework includes a novel realistic enhanced adversarial module, with time-space and relighting mapping pipeline that adjusts illumination consistency between adversarial patches and target garments under varied viewpoints. Building upon this, we develop a realism enhancement mechanism that incorporates non-rigid deformation modeling and texture remapping to ensure alignment with the human body's non-rigid surfaces in 3D scenes. Extensive experiment results in digital and physical environments demonstrate that the adversarial textures generated by our method can effectively mislead the target detection model. Specifically, our method achieves an average attack success rate (ASR) of 70.13% on YOLOv12 in physical scenarios, significantly outperforming existing methods such as T-SEA (21.65%) and AdvTexture (19.70%). Moreover, the proposed method maintains stable ASR across multiple viewpoints and distances, with an average attack success rate exceeding 90% under both frontal and oblique views at a distance of 4 meters. This confirms the method's strong robustness and transferability under multi-angle attacks, varying lighting conditions, and real-world distances. The demo video and code can be obtained at https://github.com/Huangyh98/AdvReal.git.

http://arxiv.org/abs/2207.03400
On the Relationship Between Adversarial Robustness and Decision Region in Deep Neural Networks. (99%)
Seongjin Park; Haedong Jeong; Tair Djanibekov; Giyoung Jeon; Jinseok Seol; Jaesik Choi
In general, Deep Neural Networks (DNNs) are evaluated by the generalization performance measured on unseen data excluded from the training phase. Along with the development of DNNs, the generalization performance converges to the state-of-the-art and it becomes difficult to evaluate DNNs solely based on this metric. The robustness against adversarial attack has been used as an additional metric to evaluate DNNs by measuring their vulnerability. However, few studies have been performed to analyze the adversarial robustness in terms of the geometry in DNNs. In this work, we perform an empirical study to analyze the internal properties of DNNs that affect model robustness under adversarial attacks. In particular, we propose the novel concept of the Populated Region Set (PRS), where training samples are populated more frequently, to represent the internal properties of DNNs in a practical setting. From systematic experiments with the proposed concept, we provide empirical evidence to validate that a low PRS ratio has a strong relationship with the adversarial robustness of DNNs. We also devise PRS regularizer leveraging the characteristics of PRS to improve the adversarial robustness without adversarial training.

http://arxiv.org/abs/2509.09787
ZORRO: Zero-Knowledge Robustness and Privacy for Split Learning (Full Version). (68%)
Nojan Sheybani; Alessandro Pegoraro; Jonathan Knauer; Phillip Rieger; Elissa Mollakuqe; Farinaz Koushanfar; Ahmad-Reza Sadeghi
Split Learning (SL) is a distributed learning approach that enables resource-constrained clients to collaboratively train deep neural networks (DNNs) by offloading most layers to a central server while keeping in- and output layers on the client-side. This setup enables SL to leverage server computation capacities without sharing data, making it highly effective in resource-constrained environments dealing with sensitive data. However, the distributed nature enables malicious clients to manipulate the training process. By sending poisoned intermediate gradients, they can inject backdoors into the shared DNN. Existing defenses are limited by often focusing on server-side protection and introducing additional overhead for the server. A significant challenge for client-side defenses is enforcing malicious clients to correctly execute the defense algorithm.
  We present ZORRO, a private, verifiable, and robust SL defense scheme. Through our novel design and application of interactive zero-knowledge proofs (ZKPs), clients prove their correct execution of a client-located defense algorithm, resulting in proofs of computational integrity attesting to the benign nature of locally trained DNN portions. Leveraging the frequency representation of model partitions enables ZORRO to conduct an in-depth inspection of the locally trained models in an untrusted environment, ensuring that each client forwards a benign checkpoint to its succeeding client. In our extensive evaluation, covering different model architectures as well as various attack strategies and data scenarios, we show ZORRO's effectiveness, as it reduces the attack success rate to less than 6\% while causing even for models storing \numprint{1000000} parameters on the client-side an overhead of less than 10 seconds.

http://arxiv.org/abs/2509.09660
Steering MoE LLMs via Expert (De)Activation. (31%)
Mohsen Fayyaz; Ali Modarressi; Hanieh Deilamsalehy; Franck Dernoncourt; Ryan Rossi; Trung Bui; Hinrich Schütze; Nanyun Peng
Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each token through a subset of specialized Feed-Forward Networks (FFN), known as experts. We present SteerMoE, a framework for steering MoE models by detecting and controlling behavior-linked experts. Our detection method identifies experts with distinct activation patterns across paired inputs exhibiting contrasting behaviors. By selectively (de)activating such experts during inference, we control behaviors like faithfulness and safety without retraining or modifying weights. Across 11 benchmarks and 6 LLMs, our steering raises safety by up to +20% and faithfulness by +27%. In adversarial attack mode, it drops safety by -41% alone, and -100% when combined with existing jailbreak methods, bypassing all safety guardrails and exposing a new dimension of alignment faking hidden within experts.

http://arxiv.org/abs/2509.09534
ProDiGy: Proximity- and Dissimilarity-Based Byzantine-Robust Federated Learning. (16%)
Sena Ergisi; Luis Maßny; Rawad Bitar
Federated Learning (FL) emerged as a widely studied paradigm for distributed learning. Despite its many advantages, FL remains vulnerable to adversarial attacks, especially under data heterogeneity. We propose a new Byzantine-robust FL algorithm called ProDiGy. The key novelty lies in evaluating the client gradients using a joint dual scoring system based on the gradients' proximity and dissimilarity. We demonstrate through extensive numerical experiments that ProDiGy outperforms existing defenses in various scenarios. In particular, when the clients' data do not follow an IID distribution, while other defense mechanisms fail, ProDiGy maintains strong defense capabilities and model accuracy. These findings highlight the effectiveness of a dual perspective approach that promotes natural similarity among honest clients while detecting suspicious uniformity as a potential indicator of an attack.

http://arxiv.org/abs/2503.20884
Byzantine-Robust Federated Learning Using Generative Adversarial Networks. (13%)
Usama Zafar; André M. H. Teixeira; Salman Toor
Federated learning (FL) enables collaborative model training across distributed clients without sharing raw data, but its robustness is threatened by Byzantine behaviors such as data and model poisoning. Existing defenses face fundamental limitations: robust aggregation rules incur error lower bounds that grow with client heterogeneity, while detection-based methods often rely on heuristics (e.g., a fixed number of malicious clients) or require trusted external datasets for validation. We present a defense framework that addresses these challenges by leveraging a conditional generative adversarial network (cGAN) at the server to synthesize representative data for validating client updates. This approach eliminates reliance on external datasets, adapts to diverse attack strategies, and integrates seamlessly into standard FL workflows. Extensive experiments on benchmark datasets demonstrate that our framework accurately distinguishes malicious from benign clients while maintaining overall model accuracy. Beyond Byzantine robustness, we also examine the representativeness of synthesized data, computational costs of cGAN training, and the transparency and scalability of our approach.

http://arxiv.org/abs/2503.16183
Variance-Aware Noisy Training: Hardening DNNs against Unstable Analog Computations. (12%)
Xiao Wang; Hendrik Borras; Bernhard Klein; Holger Fröning
The disparity between the computational demands of deep learning and the capabilities of compute hardware is expanding drastically. Although deep learning achieves remarkable performance in countless tasks, its escalating requirements for computational power and energy consumption surpass the sustainable limits of even specialized neural processing units, including the Apple Neural Engine and NVIDIA TensorCores. This challenge is intensified by the slowdown in CMOS scaling.
  Analog computing presents a promising alternative, offering substantial improvements in energy efficiency by directly manipulating physical quantities such as current, voltage, charge, or photons. However, it is inherently vulnerable to manufacturing variations, nonlinearities, and noise, leading to degraded prediction accuracy. One of the most effective techniques for enhancing robustness, Noisy Training, introduces noise during the training phase to reinforce the model against disturbances encountered during inference. Although highly effective, its performance degrades in real-world environments where noise characteristics fluctuate due to external factors such as temperature variations and temporal drift.
  This study underscores the necessity of Noisy Training while revealing its fundamental limitations in the presence of dynamic noise. To address these challenges, we propose Variance-Aware Noisy Training, a novel approach that mitigates performance degradation by incorporating noise schedules which emulate the evolving noise conditions encountered during inference. Our method substantially improves model robustness, without training overhead. We demonstrate a significant increase in robustness, from 79.3\% with conventional Noisy Training to 97.6\% with Variance-Aware Noisy Training on CIFAR-10 and from 32.4\% to 99.7\% on Tiny ImageNet.

http://arxiv.org/abs/2504.21846
Combating Falsification of Speech Videos with Live Optical Signatures (Extended Version). (5%)
Hadleigh Schwartz; Xiaofeng Yan; Charles J. Carver; Xia Zhou
High-profile speech videos are prime targets for falsification, owing to their accessibility and influence. This work proposes VeriLight, a low-overhead and unobtrusive system for protecting speech videos from visual manipulations of speaker identity and lip and facial motion. Unlike the predominant purely digital falsification detection methods, VeriLight creates dynamic physical signatures at the event site and embeds them into all video recordings via imperceptible modulated light. These physical signatures encode semantically-meaningful features unique to the speech event, including the speaker's identity and facial motion, and are cryptographically-secured to prevent spoofing. The signatures can be extracted from any video downstream and validated against the portrayed speech content to check its integrity. Key elements of VeriLight include (1) a framework for generating extremely compact (i.e., 150-bit), pose-invariant speech video features, based on locality-sensitive hashing; and (2) an optical modulation scheme that embeds $>$200 bps into video while remaining imperceptible both in video and live. Experiments on extensive video datasets show VeriLight achieves AUCs $\geq$ 0.99 and a true positive rate of 100% in detecting falsified videos. Further, VeriLight is highly robust across recording conditions, video post-processing techniques, and white-box adversarial attacks on its feature extraction methods. A demonstration of VeriLight is available at https://mobilex.cs.columbia.edu/verilight.

http://arxiv.org/abs/2504.07875
QubitHammer: Remotely Inducing Qubit State Change on Superconducting Quantum Computers. (1%)
Yizhuo Tan; Navnil Choudhury; Kanad Basu; Jakub Szefer
To address the rapidly growing demand for cloud-based quantum computing, various researchers are proposing shifting from the existing single-tenant model to a multi-tenant model that expands resource utilization and improves accessibility. However, while multi-tenancy enables multiple users to access the same quantum computer, it introduces potential for security and reliability vulnerabilities. It therefore becomes important to investigate these vulnerabilities, especially considering realistic attackers who operate without elevated privileges relative to ordinary users. To address this research need, this paper presents and evaluates QubitHammer, the first attack to demonstrate that an adversary can remotely induce unauthorized changes to a victim's quantum circuit's qubit's state within a multi-tenant model by using custom qubit control pulses that are generated within constraints of the public interfaces and without elevated privileges. Through extensive evaluation on real-world superconducting devices from IBM and Rigetti, this work demonstrates that QubitHammer allows an adversary to significantly change the output distribution of a victim quantum circuit. In the experimentation, variational distance is used to evaluate the magnitude of the changes, and variational distance as high as 0.938 is observed. Cross-platform analysis of QubitHammer on a number of quantum computing devices exposes a fundamental susceptibility in superconducting hardware. Further, QubitHammer was also found to evade all currently proposed defenses aimed at ensuring reliable execution in multi-tenant superconducting quantum systems.

http://arxiv.org/abs/2509.07673
Nearest Neighbor Projection Removal Adversarial Training. (98%)
Himanshu Singh; A. V. Subramanyam; Shivank Rajput; Mohan Kankanhalli
Deep neural networks have exhibited impressive performance in image classification tasks but remain vulnerable to adversarial examples. Standard adversarial training enhances robustness but typically fails to explicitly address inter-class feature overlap, a significant contributor to adversarial susceptibility. In this work, we introduce a novel adversarial training framework that actively mitigates inter-class proximity by projecting out inter-class dependencies from adversarial and clean samples in the feature space. Specifically, our approach first identifies the nearest inter-class neighbors for each adversarial sample and subsequently removes projections onto these neighbors to enforce stronger feature separability. Theoretically, we demonstrate that our proposed logits correction reduces the Lipschitz constant of neural networks, thereby lowering the Rademacher complexity, which directly contributes to improved generalization and robustness. Extensive experiments across standard benchmarks including CIFAR-10, CIFAR-100, and SVHN show that our method demonstrates strong performance that is competitive with leading adversarial training techniques, highlighting significant achievements in both robust and clean accuracy. Our findings reveal the importance of addressing inter-class feature proximity explicitly to bolster adversarial robustness in DNNs.

http://arxiv.org/abs/2509.08463
Adversarial Attacks Against Automated Fact-Checking: A Survey. (86%)
Fanzhen Liu; Alsharif Abuadbba; Kristen Moore; Surya Nepal; Cecile Paris; Jia Wu; Jian Yang; Quan Z. Sheng
In an era where misinformation spreads freely, fact-checking (FC) plays a crucial role in verifying claims and promoting reliable information. While automated fact-checking (AFC) has advanced significantly, existing systems remain vulnerable to adversarial attacks that manipulate or generate claims, evidence, or claim-evidence pairs. These attacks can distort the truth, mislead decision-makers, and ultimately undermine the reliability of FC models. Despite growing research interest in adversarial attacks against AFC systems, a comprehensive, holistic overview of key challenges remains lacking. These challenges include understanding attack strategies, assessing the resilience of current models, and identifying ways to enhance robustness. This survey provides the first in-depth review of adversarial attacks targeting FC, categorizing existing attack methodologies and evaluating their impact on AFC systems. Additionally, we examine recent advancements in adversary-aware defenses and highlight open research questions that require further exploration. Our findings underscore the urgent need for resilient FC frameworks capable of withstanding adversarial manipulations in pursuit of preserving high verification accuracy.

http://arxiv.org/abs/2407.09447
ASTPrompter: Preference-Aligned Automated Language Model Red-Teaming to Generate Low-Perplexity Unsafe Prompts. (13%)
Amelia F. Hardy; Houjun Liu; Allie Griffith; Bernard Lange; Duncan Eddy; Mykel J. Kochenderfer
Existing LLM red-teaming approaches prioritize high attack success rate, often resulting in high-perplexity prompts. This focus overlooks low-perplexity attacks that are more difficult to filter, more likely to arise during benign usage, and more impactful as negative downstream training examples. In response, we introduce ASTPrompter, a single-step optimization method that uses contrastive preference learning to train an attacker to maintain low perplexity while achieving a high attack success rate (ASR). ASTPrompter achieves an attack success rate 5.1 times higher on Llama-8.1B while using inputs that are 2.1 times more likely to occur according to the frozen LLM. Furthermore, our attack transfers to Mistral-7B, Qwen-7B, and TinyLlama in both black- and white-box settings. Lastly, by tuning a single hyperparameter in our method, we discover successful attack prefixes along an efficient frontier between ASR and perplexity, highlighting perplexity as a previously under-considered factor in red-teaming.

http://arxiv.org/abs/2503.14111
Towards properties of adversarial image perturbations. (5%)
Egor Kuznetsov; Kirill Aistov; Maxim Koroteev
Using stochastic gradient approach we study the properties of adversarial perturbations resulting in noticeable growth of VMAF image quality metric. The structure of the perturbations is investigated depending on the acceptable PSNR values and based on the Fourier power spectrum computations for the perturbations. It is demonstrated that moderate variation of image brightness ($\sim 10$ pixel units in a restricted region of an image can result in VMAF growth by $\sim 60\%$). Unlike some other methods demonstrating similar VMAF growth, the subjective quality of an image remains almost unchanged. It is also shown that the adversarial perturbations may demonstrate approximately linear dependence of perturbation amplitudes on the image brightness. The perturbations are studied based on the direct VMAF optimization in PyTorch. The significant discrepancies between the metric values and subjective judgements are also demonstrated when image restoration from noise is carried out using the same direct VMAF optimization.

http://arxiv.org/abs/2509.08436
Beyond Distribution Shifts: Adaptive Hyperspectral Image Classification at Test Time. (1%)
Xia Yue; Anfeng Liu; Ning Chen; Chenjia Huang; Hui Liu; Zhou Huang; Leyuan Fang
Hyperspectral image (HSI) classification models are highly sensitive to distribution shifts caused by various real-world degradations such as noise, blur, compression, and atmospheric effects. To address this challenge, we propose HyperTTA, a unified framework designed to enhance model robustness under diverse degradation conditions. Specifically, we first construct a multi-degradation hyperspectral dataset that systematically simulates nine representative types of degradations, providing a comprehensive benchmark for robust classification evaluation. Based on this, we design a spectral-spatial transformer classifier (SSTC) enhanced with a multi-level receptive field mechanism and label smoothing regularization to jointly capture multi-scale spatial context and improve generalization. Furthermore, HyperTTA incorporates a lightweight test-time adaptation (TTA) strategy, the confidence-aware entropy-minimized LayerNorm adapter (CELA), which updates only the affine parameters of LayerNorm layers by minimizing prediction entropy on high-confidence unlabeled target samples. This confidence-aware adaptation prevents unreliable updates from noisy predictions, enabling robust and dynamic adaptation without access to source data or target annotations. Extensive experiments on two benchmark datasets demonstrate that HyperTTA outperforms existing baselines across a wide range of degradation scenarios, validating the effectiveness of both its classification backbone and the proposed TTA scheme. Code will be made available publicly.

http://arxiv.org/abs/2509.10563
Enhancing IoMT Security with Explainable Machine Learning: A Case Study on the CICIOMT2024 Dataset. (1%)
Mohammed Yacoubi; Omar Moussaoui; C. Drocourt
Explainable Artificial Intelligence (XAI) enhances the transparency and interpretability of AI models, addressing their inherent opacity. In cybersecurity, particularly within the Internet of Medical Things (IoMT), the black-box nature of AI-driven threat detection poses a significant challenge. Cybersecurity professionals must not only detect attacks but also understand the reasoning behind AI decisions to ensure trust and accountability. The rapid increase in cyberattacks targeting connected medical devices threatens patient safety and data privacy, necessitating advanced AI-driven solutions. This study compares two ensemble learning techniques, bagging and boosting, for cyber-attack classification in IoMT environments. We selected Random Forest for bagging and CatBoost for boosting. Random Forest helps reduce variance, while CatBoost improves bias by combining weak classifiers into a strong ensemble model, making them effective for detecting sophisticated attacks. However, their complexity often reduces transparency, making it difficult for cybersecurity professionals to interpret and trust their decisions. To address this issue, we apply XAI models to generate local and global explanations, providing insights into AI decision-making. Using techniques like SHAP (Shapley Additive Explanations) and LIME (Local Interpretable Model-agnostic Explanations), we highlight feature importance to help stakeholders understand the key factors driving cyber threat detection.

http://arxiv.org/abs/2509.07677
Spectral Masking and Interpolation Attack (SMIA): A Black-box Adversarial Attack against Voice Authentication and Anti-Spoofing Systems. (99%)
Kamel Kamel; Hridoy Sankar Dutta; Keshav Sood; Sunil Aryal
Voice Authentication Systems (VAS) use unique vocal characteristics for verification. They are increasingly integrated into high-security sectors such as banking and healthcare. Despite their improvements using deep learning, they face severe vulnerabilities from sophisticated threats like deepfakes and adversarial attacks. The emergence of realistic voice cloning complicates detection, as systems struggle to distinguish authentic from synthetic audio. While anti-spoofing countermeasures (CMs) exist to mitigate these risks, many rely on static detection models that can be bypassed by novel adversarial methods, leaving a critical security gap. To demonstrate this vulnerability, we propose the Spectral Masking and Interpolation Attack (SMIA), a novel method that strategically manipulates inaudible frequency regions of AI-generated audio. By altering the voice in imperceptible zones to the human ear, SMIA creates adversarial samples that sound authentic while deceiving CMs. We conducted a comprehensive evaluation of our attack against state-of-the-art (SOTA) models across multiple tasks, under simulated real-world conditions. SMIA achieved a strong attack success rate (ASR) of at least 82% against combined VAS/CM systems, at least 97.5% against standalone speaker verification systems, and 100% against countermeasures. These findings conclusively demonstrate that current security postures are insufficient against adaptive adversarial attacks. This work highlights the urgent need for a paradigm shift toward next-generation defenses that employ dynamic, context-aware frameworks capable of evolving with the threat landscape.

http://arxiv.org/abs/2509.07495
Generating Transferrable Adversarial Examples via Local Mixing and Logits Optimization for Remote Sensing Object Recognition. (99%)
Chun Liu; Hailong Wang; Bingqian Zhu; Panpan Ding; Zheng Zheng; Tao Xu; Zhigang Han; Jiayao Wang
Deep Neural Networks (DNNs) are vulnerable to adversarial attacks, posing significant security threats to their deployment in remote sensing applications. Research on adversarial attacks not only reveals model vulnerabilities but also provides critical insights for enhancing robustness. Although current mixing-based strategies have been proposed to increase the transferability of adversarial examples, they either perform global blending or directly exchange a region in the images, which may destroy global semantic features and mislead the optimization of adversarial examples. Furthermore, their reliance on cross-entropy loss for perturbation optimization leads to gradient diminishing during iterative updates, compromising adversarial example quality. To address these limitations, we focus on non-targeted attacks and propose a novel framework via local mixing and logits optimization. First, we present a local mixing strategy to generate diverse yet semantically consistent inputs. Different from MixUp, which globally blends two images, and MixCut, which stitches images together, our method merely blends local regions to preserve global semantic information. Second, we adapt the logit loss from targeted attacks to non-targeted scenarios, mitigating the gradient vanishing problem of cross-entropy loss. Third, a perturbation smoothing loss is applied to suppress high-frequency noise and enhance transferability. Extensive experiments on FGSCR-42 and MTARSI datasets demonstrate superior performance over 12 state-of-the-art methods across 6 surrogate models. Notably, with ResNet as the surrogate on MTARSI, our method achieves a 17.28% average improvement in black-box attack success rate.

http://arxiv.org/abs/2509.08091
SAGE: Sample-Aware Guarding Engine for Robust Intrusion Detection Against Adversarial Attacks. (99%)
Jing Chen; Onat Gungor; Zhengli Shang; Tajana Rosing
The rapid proliferation of the Internet of Things (IoT) continues to expose critical security vulnerabilities, necessitating the development of efficient and robust intrusion detection systems (IDS). Machine learning-based intrusion detection systems (ML-IDS) have significantly improved threat detection capabilities; however, they remain highly susceptible to adversarial attacks. While numerous defense mechanisms have been proposed to enhance ML-IDS resilience, a systematic approach for selecting the most effective defense against a specific adversarial attack remains absent. To address this challenge, we previously proposed DYNAMITE, a dynamic defense selection approach that identifies the most suitable defense against adversarial attacks through an ML-driven selection mechanism. Building on this foundation, we propose SAGE (Sample-Aware Guarding Engine), a substantially improved defense algorithm that integrates active learning with targeted data reduction. It employs an active learning mechanism to selectively identify the most informative input samples and their corresponding optimal defense labels, which are then used to train a second-level learner responsible for selecting the most effective defense. This targeted sampling improves computational efficiency, exposes the model to diverse adversarial strategies during training, and enhances robustness, stability, and generalizability. As a result, SAGE demonstrates strong predictive performance across multiple intrusion detection datasets, achieving an average F1-score improvement of 201% over the state-of-the-art defenses. Notably, SAGE narrows the performance gap to the Oracle to just 3.8%, while reducing computational overhead by up to 29x.

http://arxiv.org/abs/2509.07617
Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling. (75%)
Minghui Li; Hao Zhang; Yechao Zhang; Wei Wan; Shengshan Hu; pei Xiaobing; Jing Wang
Direct Prompt Injection (DPI) attacks pose a critical security threat to Large Language Models (LLMs) due to their low barrier of execution and high potential damage. To address the impracticality of existing white-box/gray-box methods and the poor transferability of black-box methods, we propose an activations-guided prompt injection attack framework. We first construct an Energy-based Model (EBM) using activations from a surrogate model to evaluate the quality of adversarial prompts. Guided by the trained EBM, we employ the token-level Markov Chain Monte Carlo (MCMC) sampling to adaptively optimize adversarial prompts, thereby enabling gradient-free black-box attacks. Experimental results demonstrate our superior cross-model transferability, achieving 49.6% attack success rate (ASR) across five mainstream LLMs and 34.6% improvement over human-crafted prompts, and maintaining 36.6% ASR on unseen task scenarios. Interpretability analysis reveals a correlation between activations and attack effectiveness, highlighting the critical role of semantic patterns in transferable vulnerability exploitation.

http://arxiv.org/abs/2509.08089
Hammer and Anvil: A Principled Defense Against Backdoors in Federated Learning. (47%)
Lucas Fenaux; Zheng Wang; Jacob Yan; Nathan Chung; Florian Kerschbaum
Federated Learning is a distributed learning technique in which multiple clients cooperate to train a machine learning model. Distributed settings facilitate backdoor attacks by malicious clients, who can embed malicious behaviors into the model during their participation in the training process. These malicious behaviors are activated during inference by a specific trigger. No defense against backdoor attacks has stood the test of time, especially against adaptive attackers, a powerful but not fully explored category of attackers. In this work, we first devise a new adaptive adversary that surpasses existing adversaries in capabilities, yielding attacks that only require one or two malicious clients out of 20 to break existing state-of-the-art defenses. Then, we present Hammer and Anvil, a principled defense approach that combines two defenses orthogonal in their underlying principle to produce a combined defense that, given the right set of parameters, must succeed against any attack. We show that our best combined defense, Krum+, is successful against our new adaptive adversary and state-of-the-art attacks.

http://arxiv.org/abs/2509.07504
Backdoor Attacks and Defenses in Computer Vision Domain: A Survey. (4%)
Bilal Hussain Abbasi; Yanjun Zhang; Leo Zhang; Shang Gao
Backdoor (trojan) attacks embed hidden, controllable behaviors into machine-learning models so that models behave normally on benign inputs but produce attacker-chosen outputs when a trigger is present. This survey reviews the rapidly growing literature on backdoor attacks and defenses in the computer-vision domain. We introduce a multi-dimensional taxonomy that organizes attacks and defenses by injection stage (dataset poisoning, model/parameter modification, inference-time injection), trigger type (patch, blended/frequency, semantic, transformation), labeling strategy (dirty-label vs. clean-label / feature-collision), representation stage (instance-specific, manifold/class-level, neuron/parameter hijacking, distributed encodings), and target task (classification, detection, segmentation, video, multimodal). For each axis we summarize representative methods, highlight evaluation practices, and discuss where defenses succeed or fail. For example, many classical sanitization and reverse-engineering tools are effective against reusable patch attacks but struggle with input-aware, sample-specific, or parameter-space backdoors and with transfer via compromised pre-trained encoders or hardware bit-flips. We synthesize trends, identify persistent gaps (supply-chain and hardware threats, certifiable defenses, cross-task benchmarks), and propose practical guidelines for threat-aware evaluation and layered defenses. This survey aims to orient researchers and practitioners to the current threat landscape and pressing research directions in secure computer vision.

http://arxiv.org/abs/2509.07941
ImportSnare: Directed "Code Manual" Hijacking in Retrieval-Augmented Code Generation. (1%)
Kai Ye; Liangcai Su; Chenxiong Qian
Code generation has emerged as a pivotal capability of Large Language Models(LLMs), revolutionizing development efficiency for programmers of all skill levels. However, the complexity of data structures and algorithmic logic often results in functional deficiencies and security vulnerabilities in generated code, reducing it to a prototype requiring extensive manual debugging. While Retrieval-Augmented Generation (RAG) can enhance correctness and security by leveraging external code manuals, it simultaneously introduces new attack surfaces.
  In this paper, we pioneer the exploration of attack surfaces in Retrieval-Augmented Code Generation (RACG), focusing on malicious dependency hijacking. We demonstrate how poisoned documentation containing hidden malicious dependencies (e.g., matplotlib_safe) can subvert RACG, exploiting dual trust chains: LLM reliance on RAG and developers' blind trust in LLM suggestions. To construct poisoned documents, we propose ImportSnare, a novel attack framework employing two synergistic strategies: 1)Position-aware beam search optimizes hidden ranking sequences to elevate poisoned documents in retrieval results, and 2)Multilingual inductive suggestions generate jailbreaking sequences to manipulate LLMs into recommending malicious dependencies. Through extensive experiments across Python, Rust, and JavaScript, ImportSnare achieves significant attack success rates (over 50% for popular libraries such as matplotlib and seaborn) in general, and is also able to succeed even when the poisoning ratio is as low as 0.01%, targeting both custom and real-world malicious packages. Our findings reveal critical supply chain risks in LLM-powered development, highlighting inadequate security alignment for code generation tasks. To support future research, we will release the multilingual benchmark suite and datasets. The project homepage is https://importsnare.github.io.

http://arxiv.org/abs/2508.04669
Cybersecurity of Quantum Key Distribution Implementations. (1%)
Ittay Alfassi; Ran Gelles; Rotem Liss; Tal Mor
Practical implementations of Quantum Key Distribution (QKD) often deviate from the theoretical protocols, exposing the implementations to various attacks even when the underlying (ideal) protocol is proven secure. We present new analysis tools and methodologies for quantum cybersecurity, adapting the concepts of vulnerabilities, attack surfaces, and exploits from classical cybersecurity to QKD implementation attacks. We also present three additional concepts, derived from the connection between classical and quantum cybersecurity: "Quantum Fuzzing", which is the first tool for black-box vulnerability research on QKD implementations; "Reversed-Space Attacks", which are a generic exploit method using the attack surface of imperfect receivers; and concrete quantum-mechanical definitions of "Quantum Side-Channel Attacks" and "Quantum State-Channel Attacks", meaningfully distinguishing them from each other and from other attacks. Using our tools, we analyze multiple existing QKD attacks and show that the "Bright Illumination" attack could have been found even with minimal knowledge of the device implementation. This work begins to bridge the gap between current analysis methods for experimental attacks on QKD implementations and the decades-long research in the field of classical cybersecurity, improving the practical security of QKD products and enhancing their usefulness in real-world systems.

http://arxiv.org/abs/2509.06835
Evaluating the Impact of Adversarial Attacks on Traffic Sign Classification using the LISA Dataset. (99%)
Nabeyou Tadessa; Balaji Iyangar; Mashrur Chowdhury
Adversarial attacks pose significant threats to machine learning models by introducing carefully crafted perturbations that cause misclassification. While prior work has primarily focused on MNIST and similar datasets, this paper investigates the vulnerability of traffic sign classifiers using the LISA Traffic Sign dataset. We train a convolutional neural network to classify 47 different traffic signs and evaluate its robustness against Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) attacks. Our results show a sharp decline in classification accuracy as the perturbation magnitude increases, highlighting the models susceptibility to adversarial examples. This study lays the groundwork for future exploration into defense mechanisms tailored for real-world traffic sign recognition systems.

http://arxiv.org/abs/2509.06459
IGAff: Benchmarking Adversarial Iterative and Genetic Affine Algorithms on Deep Neural Networks. (98%)
Sebastian-Vasile Echim; Andrei-Alexandru Preda; Dumitru-Clementin Cercel; Florin Pop
Deep neural networks currently dominate many fields of the artificial intelligence landscape, achieving state-of-the-art results on numerous tasks while remaining hard to understand and exhibiting surprising weaknesses. An active area of research focuses on adversarial attacks, which aim to generate inputs that uncover these weaknesses. However, this proves challenging, especially in the black-box scenario where model details are inaccessible. This paper explores in detail the impact of such adversarial algorithms on ResNet-18, DenseNet-121, Swin Transformer V2, and Vision Transformer network architectures. Leveraging the Tiny ImageNet, Caltech-256, and Food-101 datasets, we benchmark two novel black-box iterative adversarial algorithms based on affine transformations and genetic algorithms: 1) Affine Transformation Attack (ATA), an iterative algorithm maximizing our attack score function using random affine transformations, and 2) Affine Genetic Attack (AGA), a genetic algorithm that involves random noise and affine transformations. We evaluate the performance of the models in the algorithm parameter variation, data augmentation, and global and targeted attack configurations. We also compare our algorithms with two black-box adversarial algorithms, Pixle and Square Attack. Our experiments yield better results on the image classification task than similar methods in the literature, achieving an accuracy improvement of up to 8.82%. We provide noteworthy insights into successful adversarial defenses and attacks at both global and targeted levels, and demonstrate adversarial robustness through algorithm parameter variation.

http://arxiv.org/abs/2508.09994
Whisper Smarter, not Harder: Adversarial Attack on Partial Suppression. (80%)
Zheng Jie Wong; Bingquan Shen
Currently, Automatic Speech Recognition (ASR) models are deployed in an extensive range of applications. However, recent studies have demonstrated the possibility of adversarial attack on these models which could potentially suppress or disrupt model output. We investigate and verify the robustness of these attacks and explore if it is possible to increase their imperceptibility. We additionally find that by relaxing the optimisation objective from complete suppression to partial suppression, we can further decrease the imperceptibility of the attack. We also explore possible defences against these attacks and show a low-pass filter defence could potentially serve as an effective defence.

http://arxiv.org/abs/2509.07132
Adversarial Attacks on Audio Deepfake Detection: A Benchmark and Comparative Study. (75%)
Kutub Uddin; Muhammad Umar Farooq; Awais Khan; Khalid Mahmood Malik
The widespread use of generative AI has shown remarkable success in producing highly realistic deepfakes, posing a serious threat to various voice biometric applications, including speaker verification, voice biometrics, audio conferencing, and criminal investigations. To counteract this, several state-of-the-art (SoTA) audio deepfake detection (ADD) methods have been proposed to identify generative AI signatures to distinguish between real and deepfake audio. However, the effectiveness of these methods is severely undermined by anti-forensic (AF) attacks that conceal generative signatures. These AF attacks span a wide range of techniques, including statistical modifications (e.g., pitch shifting, filtering, noise addition, and quantization) and optimization-based attacks (e.g., FGSM, PGD, C \& W, and DeepFool). In this paper, we investigate the SoTA ADD methods and provide a comparative analysis to highlight their effectiveness in exposing deepfake signatures, as well as their vulnerabilities under adversarial conditions. We conducted an extensive evaluation of ADD methods on five deepfake benchmark datasets using two categories: raw and spectrogram-based approaches. This comparative analysis enables a deeper understanding of the strengths and limitations of SoTA ADD methods against diverse AF attacks. It does not only highlight vulnerabilities of ADD methods, but also informs the design of more robust and generalized detectors for real-world voice biometrics. It will further guide future research in developing adaptive defense strategies that can effectively counter evolving AF techniques.

http://arxiv.org/abs/2509.06338
Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift. (69%)
Shuai Yuan; Zhibo Zhang; Yuxi Li; Guangdong Bai; Wang Kailong
The widespread distribution of Large Language Models (LLMs) through public platforms like Hugging Face introduces significant security challenges. While these platforms perform basic security scans, they often fail to detect subtle manipulations within the embedding layer. This work identifies a novel class of deployment phase attacks that exploit this vulnerability by injecting imperceptible perturbations directly into the embedding layer outputs without modifying model weights or input text. These perturbations, though statistically benign, systematically bypass safety alignment mechanisms and induce harmful behaviors during inference. We propose Search based Embedding Poisoning(SEP), a practical, model agnostic framework that introduces carefully optimized perturbations into embeddings associated with high risk tokens. SEP leverages a predictable linear transition in model responses, from refusal to harmful output to semantic deviation to identify a narrow perturbation window that evades alignment safeguards. Evaluated across six aligned LLMs, SEP achieves an average attack success rate of 96.43% while preserving benign task performance and evading conventional detection mechanisms. Our findings reveal a critical oversight in deployment security and emphasize the urgent need for embedding level integrity checks in future LLM defense strategies.

http://arxiv.org/abs/2509.06554
Robustness and accuracy of mean opinion scores with hard and soft outlier detection. (2%)
Dietmar Saupe; Tim Bleile
In subjective assessment of image and video quality, observers rate or compare selected stimuli. Before calculating the mean opinion scores (MOS) for these stimuli from the ratings, it is recommended to identify and deal with outliers that may have given unreliable ratings. Several methods are available for this purpose, some of which have been standardized. These methods are typically based on statistics and sometimes tested by introducing synthetic ratings from artificial outliers, such as random clickers. However, a reliable and comprehensive approach is lacking for comparative performance analysis of outlier detection methods. To fill this gap, this work proposes and applies an empirical worst-case analysis as a general solution. Our method involves evolutionary optimization of an adversarial black-box attack on outlier detection algorithms, where the adversary maximizes the distortion of scale values with respect to ground truth. We apply our analysis to several hard and soft outlier detection methods for absolute category ratings and show their differing performance in this stress test. In addition, we propose two new outlier detection methods with low complexity and excellent worst-case performance. Software for adversarial attacks and data analysis is available.

http://arxiv.org/abs/2509.06572
Mind Your Server: A Systematic Study of Parasitic Toolchain Attacks on the MCP Ecosystem. (1%)
Shuli Zhao; Qinsheng Hou; Zihan Zhan; Yanhao Wang; Yuchong Xie; Yu Guo; Libo Chen; Shenghong Li; Zhi Xue
Large language models (LLMs) are increasingly integrated with external systems through the Model Context Protocol (MCP), which standardizes tool invocation and has rapidly become a backbone for LLM-powered applications. While this paradigm enhances functionality, it also introduces a fundamental security shift: LLMs transition from passive information processors to autonomous orchestrators of task-oriented toolchains, expanding the attack surface, elevating adversarial goals from manipulating single outputs to hijacking entire execution flows. In this paper, we reveal a new class of attacks, Parasitic Toolchain Attacks, instantiated as MCP Unintended Privacy Disclosure (MCP-UPD). These attacks require no direct victim interaction; instead, adversaries embed malicious instructions into external data sources that LLMs access during legitimate tasks. The malicious logic infiltrates the toolchain and unfolds in three phases: Parasitic Ingestion, Privacy Collection, and Privacy Disclosure, culminating in stealthy exfiltration of private data. Our root cause analysis reveals that MCP lacks both context-tool isolation and least-privilege enforcement, enabling adversarial instructions to propagate unchecked into sensitive tool invocations. To assess the severity, we design MCP-SEC and conduct the first large-scale security census of the MCP ecosystem, analyzing 12,230 tools across 1,360 servers. Our findings show that the MCP ecosystem is rife with exploitable gadgets and diverse attack methods, underscoring systemic risks in MCP platforms and the urgent need for defense mechanisms in LLM-integrated environments.

http://arxiv.org/abs/2509.06896
Not All Samples Are Equal: Quantifying Instance-level Difficulty in Targeted Data Poisoning. (1%)
William Xu; Yiwei Lu; Yihan Wang; Matthew Y. R. Yang; Zuoqiu Liu; Gautam Kamath; Yaoliang Yu
Targeted data poisoning attacks pose an increasingly serious threat due to their ease of deployment and high success rates. These attacks aim to manipulate the prediction for a single test sample in classification models. Unlike indiscriminate attacks that aim to decrease overall test performance, targeted attacks present a unique threat to individual test instances. This threat model raises a fundamental question: what factors make certain test samples more susceptible to successful poisoning than others? We investigate how attack difficulty varies across different test instances and identify key characteristics that influence vulnerability. This paper introduces three predictive criteria for targeted data poisoning difficulty: ergodic prediction accuracy (analyzed through clean training dynamics), poison distance, and poison budget. Our experimental results demonstrate that these metrics effectively predict the varying difficulty of real-world targeted poisoning attacks across diverse scenarios, offering practitioners valuable insights for vulnerability assessment and understanding data poisoning attacks.

http://arxiv.org/abs/2509.06071
Asymmetry Vulnerability and Physical Attacks on Online Map Construction for Autonomous Driving. (75%)
Yang Lou; Haibo Hu; Qun Song; Qian Xu; Yi Zhu; Rui Tan; Wei-Bin Lee; Jianping Wang
High-definition maps provide precise environmental information essential for prediction and planning in autonomous driving systems. Due to the high cost of labeling and maintenance, recent research has turned to online HD map construction using onboard sensor data, offering wider coverage and more timely updates for autonomous vehicles. However, the robustness of online map construction under adversarial conditions remains underexplored. In this paper, we present a systematic vulnerability analysis of online map construction models, which reveals that these models exhibit an inherent bias toward predicting symmetric road structures. In asymmetric scenes like forks or merges, this bias often causes the model to mistakenly predict a straight boundary that mirrors the opposite side. We demonstrate that this vulnerability persists in the real-world and can be reliably triggered by obstruction or targeted interference. Leveraging this vulnerability, we propose a novel two-stage attack framework capable of manipulating online constructed maps. First, our method identifies vulnerable asymmetric scenes along the victim AV's potential route. Then, we optimize the location and pattern of camera-blinding attacks and adversarial patch attacks. Evaluations on a public AD dataset demonstrate that our attacks can degrade mapping accuracy by up to 9.9%, render up to 44% of targeted routes unreachable, and increase unsafe planned trajectory rates, colliding with real-world road boundaries, by up to 27%. These attacks are also validated on a real-world testbed vehicle. We further analyze root causes of the symmetry bias, attributing them to training data imbalance, model architecture, and map element representation. To the best of our knowledge, this study presents the first vulnerability assessment of online map construction models and introduces the first digital and physical attack against them.

http://arxiv.org/abs/2509.10543
Robust DDoS-Attack Classification with 3D CNNs Against Adversarial Methods. (22%)
Landon Bragg; Nathan Dorsey; Josh Prior; John Ajit; Ben Kim; Nate Willis; Pablo Rivas
Distributed Denial-of-Service (DDoS) attacks remain a serious threat to online infrastructure, often bypassing detection by altering traffic in subtle ways. We present a method using hive-plot sequences of network data and a 3D convolutional neural network (3D CNN) to classify DDoS traffic with high accuracy. Our system relies on three main ideas: (1) using spatio-temporal hive-plot encodings to set a pattern-recognition baseline, (2) applying adversarial training with FGSM and PGD alongside spatial noise and image shifts, and (3) analyzing frame-wise predictions to find early signals. On a benchmark dataset, our method lifts adversarial accuracy from 50-55% to over 93% while maintaining clean-sample performance. Frames 3-4 offer strong predictive signals, showing early-stage classification is possible.

http://arxiv.org/abs/2509.05921
Dataset Ownership in the Era of Large Language Models. (15%)
Kun Li; Cheng Wang; Minghui Xu; Yue Zhang; Xiuzhen Cheng
As datasets become critical assets in modern machine learning systems, ensuring robust copyright protection has emerged as an urgent challenge. Traditional legal mechanisms often fail to address the technical complexities of digital data replication and unauthorized use, particularly in opaque or decentralized environments. This survey provides a comprehensive review of technical approaches for dataset copyright protection, systematically categorizing them into three main classes: non-intrusive methods, which detect unauthorized use without modifying data; minimally-intrusive methods, which embed lightweight, reversible changes to enable ownership verification; and maximally-intrusive methods, which apply aggressive data alterations, such as reversible adversarial examples, to enforce usage restrictions. We synthesize key techniques, analyze their strengths and limitations, and highlight open research challenges. This work offers an organized perspective on the current landscape and suggests future directions for developing unified, scalable, and ethically sound solutions to protect datasets in increasingly complex machine learning ecosystems.

http://arxiv.org/abs/2509.07017
From Eigenmodes to Proofs: Integrating Graph Spectral Operators with Symbolic Interpretable Reasoning. (9%)
Andrew Kiruluta; Priscilla Burity
We introduce Spectral NSR, a fully spectral neuro-symbolic reasoning framework that embeds logical rules as spectral templates and performs inference directly in the graph spectral domain. By leveraging graph signal processing (GSP) and frequency-selective filters grounded in the Laplacian eigenstructure of knowledge graphs, the architecture unifies the interpretability of symbolic reasoning with the scalability and adaptability of spectral learning. Beyond the core formulation, we incorporate a comprehensive set of extensions, including dynamic graph and basis learning, rational and diffusion filters for sharper spectral selectivity, mixture-of-spectral-experts for modular specialization, proof-guided training with spectral curricula, and uncertainty quantification for calibrated confidence. Additional enhancements such as large language model coupling, co-spectral transfer alignment, adversarial robustness, efficient GPU kernels, generalized Laplacians, and causal interventions further expand the versatility of the framework.
  Empirical evaluation on state-of-the-art reasoning benchmarks such as ProofWriter and CLUTRR demonstrates that Spectral NSR achieves superior accuracy, faster inference, improved robustness to adversarial perturbations, and higher interpretability compared to leading baselines including transformers, message-passing neural networks, and neuro-symbolic logic programming systems. Spectral attribution and proof-band agreement analyses confirm that model decisions align closely with symbolic proof structures, while transfer experiments validate effective domain adaptation through co-spectral alignment. These results establish Spectral NSR as a scalable and principled foundation for the next generation of reasoning systems, offering transparency, robustness, and generalization beyond conventional approaches.

http://arxiv.org/abs/2509.05835
Yours or Mine? Overwriting Attacks against Neural Audio Watermarking. (87%)
Lingfeng Yao; Chenpei Huang; Shengyao Wang; Junpei Xue; Hanqing Guo; Jiang Liu; Phone Lin; Tomoaki Ohtsuki; Miao Pan
As generative audio models are rapidly evolving, AI-generated audios increasingly raise concerns about copyright infringement and misinformation spread. Audio watermarking, as a proactive defense, can embed secret messages into audio for copyright protection and source verification. However, current neural audio watermarking methods focus primarily on the imperceptibility and robustness of watermarking, while ignoring its vulnerability to security attacks. In this paper, we develop a simple yet powerful attack: the overwriting attack that overwrites the legitimate audio watermark with a forged one and makes the original legitimate watermark undetectable. Based on the audio watermarking information that the adversary has, we propose three categories of overwriting attacks, i.e., white-box, gray-box, and black-box attacks. We also thoroughly evaluate the proposed attacks on state-of-the-art neural audio watermarking methods. Experimental results demonstrate that the proposed overwriting attacks can effectively compromise existing watermarking schemes across various settings and achieve a nearly 100% attack success rate. The practicality and effectiveness of the proposed overwriting attacks expose security flaws in existing neural audio watermarking systems, underscoring the need to enhance security in future audio watermarking designs.

http://arxiv.org/abs/2509.08000
AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs. (38%)
Debdeep Sanyal; Manodeep Ray; Murari Mandal
The release of open-weight large language models (LLMs) creates a tension between advancing accessible research and preventing misuse, such as malicious fine-tuning to elicit harmful content. Current safety measures struggle to preserve the general capabilities of the LLM while resisting a determined adversary with full access to the model's weights and architecture, who can use full-parameter fine-tuning to erase existing safeguards. To address this, we introduce AntiDote, a bi-level optimization procedure for training LLMs to be resistant to such tampering. AntiDote involves an auxiliary adversary hypernetwork that learns to generate malicious Low-Rank Adaptation (LoRA) weights conditioned on the defender model's internal activations. The defender LLM is then trained with an objective to nullify the effect of these adversarial weight additions, forcing it to maintain its safety alignment. We validate this approach against a diverse suite of 52 red-teaming attacks, including jailbreak prompting, latent space manipulation, and direct weight-space attacks. AntiDote is upto 27.4\% more robust against adversarial attacks compared to both tamper-resistance and unlearning baselines. Crucially, this robustness is achieved with a minimal trade-off in utility, incurring a performance degradation of upto less than 0.5\% across capability benchmarks including MMLU, HellaSwag, and GSM8K. Our work offers a practical and compute efficient methodology for building open-weight models where safety is a more integral and resilient property.

http://arxiv.org/abs/2509.05831
Decoding Latent Attack Surfaces in LLMs: Prompt Injection via HTML in Web Summarization. (4%)
Ishaan Verma
Large Language Models (LLMs) are increasingly integrated into web-based systems for content summarization, yet their susceptibility to prompt injection attacks remains a pressing concern. In this study, we explore how non-visible HTML elements such as <meta>, aria-label, and alt attributes can be exploited to embed adversarial instructions without altering the visible content of a webpage. We introduce a novel dataset comprising 280 static web pages, evenly divided between clean and adversarial injected versions, crafted using diverse HTML-based strategies. These pages are processed through a browser automation pipeline to extract both raw HTML and rendered text, closely mimicking real-world LLM deployment scenarios. We evaluate two state-of-the-art open-source models, Llama 4 Scout (Meta) and Gemma 9B IT (Google), on their ability to summarize this content. Using both lexical (ROUGE-L) and semantic (SBERT cosine similarity) metrics, along with manual annotations, we assess the impact of these covert injections. Our findings reveal that over 29% of injected samples led to noticeable changes in the Llama 4 Scout summaries, while Gemma 9B IT showed a lower, yet non-trivial, success rate of 15%. These results highlight a critical and largely overlooked vulnerability in LLM driven web pipelines, where hidden adversarial content can subtly manipulate model outputs. Our work offers a reproducible framework and benchmark for evaluating HTML-based prompt injection and underscores the urgent need for robust mitigation strategies in LLM applications involving web content.

http://arxiv.org/abs/2509.05739
Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated. (2%)
Hanna Foerster; Ilia Shumailov; Yiren Zhao; Harsh Chaudhari; Jamie Hayes; Robert Mullins; Yarin Gal
Early research into data poisoning attacks against Large Language Models (LLMs) demonstrated the ease with which backdoors could be injected. More recent LLMs add step-by-step reasoning, expanding the attack surface to include the intermediate chain-of-thought (CoT) and its inherent trait of decomposing problems into subproblems. Using these vectors for more stealthy poisoning, we introduce ``decomposed reasoning poison'', in which the attacker modifies only the reasoning path, leaving prompts and final answers clean, and splits the trigger across multiple, individually harmless components.
  Fascinatingly, while it remains possible to inject these decomposed poisons, reliably activating them to change final answers (rather than just the CoT) is surprisingly difficult. This difficulty arises because the models can often recover from backdoors that are activated within their thought processes. Ultimately, it appears that an emergent form of backdoor robustness is originating from the reasoning capabilities of these advanced LLMs, as well as from the architectural separation between reasoning and final answer generation.

http://arxiv.org/abs/2508.07556
Uncertainty-Driven Reliability: Selective Prediction and Trustworthy Deployment in Modern Machine Learning. (1%)
Stephan Rabanser
Machine learning (ML) systems are increasingly deployed in high-stakes domains where reliability is paramount. This thesis investigates how uncertainty estimation can enhance the safety and trustworthiness of ML, focusing on selective prediction -- where models abstain when confidence is low.
  We first show that a model's training trajectory contains rich uncertainty signals that can be exploited without altering its architecture or loss. By ensembling predictions from intermediate checkpoints, we propose a lightweight, post-hoc abstention method that works across tasks, avoids the cost of deep ensembles, and achieves state-of-the-art selective prediction performance. Crucially, this approach is fully compatible with differential privacy (DP), allowing us to study how privacy noise affects uncertainty quality. We find that while many methods degrade under DP, our trajectory-based approach remains robust, and we introduce a framework for isolating the privacy-uncertainty trade-off. Next, we then develop a finite-sample decomposition of the selective classification gap -- the deviation from the oracle accuracy-coverage curve -- identifying five interpretable error sources and clarifying which interventions can close the gap. This explains why calibration alone cannot fix ranking errors, motivating methods that improve uncertainty ordering. Finally, we show that uncertainty signals can be adversarially manipulated to hide errors or deny service while maintaining high accuracy, and we design defenses combining calibration audits with verifiable inference.
  Together, these contributions advance reliable ML by improving, evaluating, and safeguarding uncertainty estimation, enabling models that not only make accurate predictions -- but also know when to say "I do not know".

http://arxiv.org/abs/2509.04980
MAIA: An Inpainting-Based Approach for Music Adversarial Attacks. (99%)
Yuxuan Liu; Peihong Zhang; Rui Sang; Zhixin Li; Shengchen Li
Music adversarial attacks have garnered significant interest in the field of Music Information Retrieval (MIR). In this paper, we present Music Adversarial Inpainting Attack (MAIA), a novel adversarial attack framework that supports both white-box and black-box attack scenarios. MAIA begins with an importance analysis to identify critical audio segments, which are then targeted for modification. Utilizing generative inpainting models, these segments are reconstructed with guidance from the output of the attacked model, ensuring subtle and effective adversarial perturbations. We evaluate MAIA on multiple MIR tasks, demonstrating high attack success rates in both white-box and black-box settings while maintaining minimal perceptual distortion. Additionally, subjective listening tests confirm the high audio fidelity of the adversarial samples. Our findings highlight vulnerabilities in current MIR systems and emphasize the need for more robust and secure models.

http://arxiv.org/abs/2509.04985
Training a Perceptual Model for Evaluating Auditory Similarity in Music Adversarial Attack. (98%)
Yuxuan Liu; Rui Sang; Peihong Zhang; Zhixin Li; Shengchen Li
Music Information Retrieval (MIR) systems are highly vulnerable to adversarial attacks that are often imperceptible to humans, primarily due to a misalignment between model feature spaces and human auditory perception. Existing defenses and perceptual metrics frequently fail to adequately capture these auditory nuances, a limitation supported by our initial listening tests showing low correlation between common metrics and human judgments. To bridge this gap, we introduce Perceptually-Aligned MERT Transformer (PAMT), a novel framework for learning robust, perceptually-aligned music representations. Our core innovation lies in the psychoacoustically-conditioned sequential contrastive transformer, a lightweight projection head built atop a frozen MERT encoder. PAMT achieves a Spearman correlation coefficient of 0.65 with subjective scores, outperforming existing perceptual metrics. Our approach also achieves an average of 9.15\% improvement in robust accuracy on challenging MIR tasks, including Cover Song Identification and Music Genre Classification, under diverse perceptual adversarial attacks. This work pioneers architecturally-integrated psychoacoustic conditioning, yielding representations significantly more aligned with human perception and robust against music adversarial attacks.

http://arxiv.org/abs/2509.05192
On Hyperparameters and Backdoor-Resistance in Horizontal Federated Learning. (89%)
Simon Lachnit; Ghassan Karame
Horizontal Federated Learning (HFL) is particularly vulnerable to backdoor attacks as adversaries can easily manipulate both the training data and processes to execute sophisticated attacks. In this work, we study the impact of training hyperparameters on the effectiveness of backdoor attacks and defenses in HFL. More specifically, we show both analytically and by means of measurements that the choice of hyperparameters by benign clients does not only influence model accuracy but also significantly impacts backdoor attack success. This stands in sharp contrast with the multitude of contributions in the area of HFL security, which often rely on custom ad-hoc hyperparameter choices for benign clients$\unicode{x2013}$leading to more pronounced backdoor attack strength and diminished impact of defenses. Our results indicate that properly tuning benign clients' hyperparameters$\unicode{x2013}$such as learning rate, batch size, and number of local epochs$\unicode{x2013}$can significantly curb the effectiveness of backdoor attacks, regardless of the malicious clients' settings. We support this claim with an extensive robustness evaluation of state-of-the-art attack-defense combinations, showing that carefully chosen hyperparameters yield across-the-board improvements in robustness without sacrificing main task accuracy. For example, we show that the 50%-lifespan of the strong A3FL attack can be reduced by 98.6%, respectively$\unicode{x2013}$all without using any defense and while incurring only a 2.9 percentage points drop in clean task accuracy.

http://arxiv.org/abs/2509.09706
Differential Robustness in Transformer Language Models: Empirical Evaluation Under Adversarial Text Attacks. (80%)
Taniya Gidatkar; Oluwaseun Ajao; Matthew Shardlow
This study evaluates the resilience of large language models (LLMs) against adversarial attacks, specifically focusing on Flan-T5, BERT, and RoBERTa-Base. Using systematically designed adversarial tests through TextFooler and BERTAttack, we found significant variations in model robustness. RoBERTa-Base and FlanT5 demonstrated remarkable resilience, maintaining accuracy even when subjected to sophisticated attacks, with attack success rates of 0%. In contrast. BERT-Base showed considerable vulnerability, with TextFooler achieving a 93.75% success rate in reducing model accuracy from 48% to just 3%. Our research reveals that while certain LLMs have developed effective defensive mechanisms, these safeguards often require substantial computational resources. This study contributes to the understanding of LLM security by identifying existing strengths and weaknesses in current safeguarding approaches and proposes practical recommendations for developing more efficient and effective defensive strategies.

http://arxiv.org/abs/2506.13205
Poison Once, Control Anywhere: Clean-Text Visual Backdoors in VLM-based Mobile Agents. (78%)
Xuan Wang; Siyuan Liang; Zhe Liu; Yi Yu; Aishan Liu; Yuliang Lu; Xitong Gao; Ee-Chien Chang
Mobile agents powered by vision-language models (VLMs) are increasingly adopted for tasks such as UI automation and camera-based assistance. These agents are typically fine-tuned using small-scale, user-collected data, making them susceptible to stealthy training-time threats. This work introduces VIBMA, the first clean-text backdoor attack targeting VLM-based mobile agents. The attack injects malicious behaviors into the model by modifying only the visual input while preserving textual prompts and instructions, achieving stealth through the complete absence of textual anomalies. Once the agent is fine-tuned on this poisoned data, adding a predefined visual pattern (trigger) at inference time activates the attacker-specified behavior (backdoor). Our attack aligns the training gradients of poisoned samples with those of an attacker-specified target instance, effectively embedding backdoor-specific features into the poisoned data. To ensure the robustness and stealthiness of the attack, we design three trigger variants that better resemble real-world scenarios: static patches, dynamic motion patterns, and low-opacity blended content. Extensive experiments on six Android applications and three mobile-compatible VLMs demonstrate that our attack achieves high success rates (ASR up to 94.67%) while preserving clean-task behavior (FSR up to 95.85%). We further conduct ablation studies to understand how key design factors impact attack reliability and stealth. These findings is the first to reveal the security vulnerabilities of mobile agents and their susceptibility to backdoor injection, underscoring the need for robust defenses in mobile agent adaptation pipelines.

http://arxiv.org/abs/2509.05265
On Evaluating the Poisoning Robustness of Federated Learning under Local Differential Privacy. (76%)
Zijian Wang; Wei Tong; Tingxuan Han; Haoyu Chen; Tianling Zhang; Yunlong Mao; Sheng Zhong
Federated learning (FL) combined with local differential privacy (LDP) enables privacy-preserving model training across decentralized data sources. However, the decentralized data-management paradigm leaves LDPFL vulnerable to participants with malicious intent. The robustness of LDPFL protocols, particularly against model poisoning attacks (MPA), where adversaries inject malicious updates to disrupt global model convergence, remains insufficiently studied. In this paper, we propose a novel and extensible model poisoning attack framework tailored for LDPFL settings. Our approach is driven by the objective of maximizing the global training loss while adhering to local privacy constraints. To counter robust aggregation mechanisms such as Multi-Krum and trimmed mean, we develop adaptive attacks that embed carefully crafted constraints into a reverse training process, enabling evasion of these defenses. We evaluate our framework across three representative LDPFL protocols, three benchmark datasets, and two types of deep neural networks. Additionally, we investigate the influence of data heterogeneity and privacy budgets on attack effectiveness. Experimental results demonstrate that our adaptive attacks can significantly degrade the performance of the global model, revealing critical vulnerabilities and highlighting the need for more robust LDPFL defense strategies against MPA. Our code is available at https://github.com/ZiJW/LDPFL-Attack

http://arxiv.org/abs/2509.05086
Robust Experts: the Effect of Adversarial Training on CNNs with Sparse Mixture-of-Experts Layers. (54%)
Svetlana Pavlitska; Haixi Fan; Konstantin Ditschuneit; J. Marius Zöllner
Robustifying convolutional neural networks (CNNs) against adversarial attacks remains challenging and often requires resource-intensive countermeasures. We explore the use of sparse mixture-of-experts (MoE) layers to improve robustness by replacing selected residual blocks or convolutional layers, thereby increasing model capacity without additional inference cost. On ResNet architectures trained on CIFAR-100, we find that inserting a single MoE layer in the deeper stages leads to consistent improvements in robustness under PGD and AutoPGD attacks when combined with adversarial training. Furthermore, we discover that when switch loss is used for balancing, it causes routing to collapse onto a small set of overused experts, thereby concentrating adversarial training on these paths and inadvertently making them more robust. As a result, some individual experts outperform the gated MoE model in robustness, suggesting that robust subpaths emerge through specialization. Our code is available at https://github.com/KASTEL-MobilityLab/robust-sparse-moes.

http://arxiv.org/abs/2509.05429
Safeguarding Graph Neural Networks against Topology Inference Attacks. (45%)
Jie Fu; Hong Yuan; Zhili Chen; Wendy Hui Wang
Graph Neural Networks (GNNs) have emerged as powerful models for learning from graph-structured data. However, their widespread adoption has raised serious privacy concerns. While prior research has primarily focused on edge-level privacy, a critical yet underexplored threat lies in topology privacy - the confidentiality of the graph's overall structure. In this work, we present a comprehensive study on topology privacy risks in GNNs, revealing their vulnerability to graph-level inference attacks. To this end, we propose a suite of Topology Inference Attacks (TIAs) that can reconstruct the structure of a target training graph using only black-box access to a GNN model. Our findings show that GNNs are highly susceptible to these attacks, and that existing edge-level differential privacy mechanisms are insufficient as they either fail to mitigate the risk or severely compromise model accuracy. To address this challenge, we introduce Private Graph Reconstruction (PGR), a novel defense framework designed to protect topology privacy while maintaining model accuracy. PGR is formulated as a bi-level optimization problem, where a synthetic training graph is iteratively generated using meta-gradients, and the GNN model is concurrently updated based on the evolving graph. Extensive experiments demonstrate that PGR significantly reduces topology leakage with minimal impact on model accuracy. Our code is anonymously available at https://github.com/JeffffffFu/PGR.

http://arxiv.org/abs/2410.17351
Hierarchical Multi-agent Reinforcement Learning for Cyber Network Defense. (41%)
Aditya Vikram Singh; Ethan Rathbun; Emma Graham; Lisa Oakley; Simona Boboila; Alina Oprea; Peter Chin
Recent advances in multi-agent reinforcement learning (MARL) have created opportunities to solve complex real-world tasks. Cybersecurity is a notable application area, where defending networks against sophisticated adversaries remains a challenging task typically performed by teams of security operators. In this work, we explore novel MARL strategies for building autonomous cyber network defenses that address challenges such as large policy spaces, partial observability, and stealthy, deceptive adversarial strategies. To facilitate efficient and generalized learning, we propose a hierarchical Proximal Policy Optimization (PPO) architecture that decomposes the cyber defense task into specific sub-tasks like network investigation and host recovery. Our approach involves training sub-policies for each sub-task using PPO enhanced with cybersecurity domain expertise. These sub-policies are then leveraged by a master defense policy that coordinates their selection to solve complex network defense tasks. Furthermore, the sub-policies can be fine-tuned and transferred with minimal cost to defend against shifts in adversarial behavior or changes in network settings. We conduct extensive experiments using CybORG Cage 4, the state-of-the-art MARL environment for cyber defense. Comparisons with multiple baselines across different adversaries show that our hierarchical learning approach achieves top performance in terms of convergence speed, episodic return, and several interpretable metrics relevant to cybersecurity, including the fraction of clean machines on the network, precision, and false positives.

http://arxiv.org/abs/2509.05471
Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models. (5%)
Youjia Zheng; Mohammad Zandsalimy; Shanu Sushmita
Large Language Models (LLMs) are increasingly vulnerable to a sophisticated form of adversarial prompting known as camouflaged jailbreaking. This method embeds malicious intent within seemingly benign language to evade existing safety mechanisms. Unlike overt attacks, these subtle prompts exploit contextual ambiguity and the flexible nature of language, posing significant challenges to current defense systems. This paper investigates the construction and impact of camouflaged jailbreak prompts, emphasizing their deceptive characteristics and the limitations of traditional keyword-based detection methods. We introduce a novel benchmark dataset, Camouflaged Jailbreak Prompts, containing 500 curated examples (400 harmful and 100 benign prompts) designed to rigorously stress-test LLM safety protocols. In addition, we propose a multi-faceted evaluation framework that measures harmfulness across seven dimensions: Safety Awareness, Technical Feasibility, Implementation Safeguards, Harmful Potential, Educational Value, Content Quality, and Compliance Score. Our findings reveal a stark contrast in LLM behavior: while models demonstrate high safety and content quality with benign inputs, they exhibit a significant decline in performance and safety when confronted with camouflaged jailbreak attempts. This disparity underscores a pervasive vulnerability, highlighting the urgent need for more nuanced and adaptive security strategies to ensure the responsible and robust deployment of LLMs in real-world applications.

http://arxiv.org/abs/2509.09703
CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models via Cross-Turn Contextual Correlation Backdoor. (2%)
Zhenhua Xu; Xixiang Zhao; Xubin Yue; Shengwei Tian; Changting Lin; Meng Han
The widespread deployment of large language models (LLMs) has intensified concerns around intellectual property (IP) protection, as model theft and unauthorized redistribution become increasingly feasible. To address this, model fingerprinting aims to embed verifiable ownership traces into LLMs. However, existing methods face inherent trade-offs between stealthness, robustness, and generalizability, being either detectable via distributional shifts, vulnerable to adversarial modifications, or easily invalidated once the fingerprint is revealed. In this work, we introduce CTCC, a novel rule-driven fingerprinting framework that encodes contextual correlations across multiple dialogue turns, such as counterfactual, rather than relying on token-level or single-turn triggers. CTCC enables fingerprint verification under black-box access while mitigating false positives and fingerprint leakage, supporting continuous construction under a shared semantic rule even if partial triggers are exposed. Extensive experiments across multiple LLM architectures demonstrate that CTCC consistently achieves stronger stealth and robustness than prior work. Our findings position CTCC as a reliable and practical solution for ownership verification in real-world LLM deployment scenarios. Our code and data are publicly available at <https://github.com/Xuzhenhua55/CTCC>.

http://arxiv.org/abs/2408.09600
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning. (2%)
Tiansheng Huang; Gautam Bhattacharya; Pratik Joshi; Josh Kimball; Ling Liu
Safety aligned Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- a few harmful data mixed in the fine-tuning dataset can break the LLMs's safety alignment. While several defenses have been proposed, our evaluation shows that existing defenses fail \textit{when some specific training hyper-parameters are chosen} -- a large learning rate or a large number of training epochs in the fine-tuning stage can easily invalidate the defense. To this end, we propose Antidote, a post-fine-tuning stage solution, which remains \textbf{\textit{agnostic to the training hyper-parameters in the fine-tuning stage}}. Antidote relies on the philosophy that by removing the harmful parameters, the harmful model can be recovered from the harmful behaviors, regardless of how those harmful parameters are formed in the fine-tuning stage. With this philosophy, we introduce a one-shot pruning stage after harmful fine-tuning to remove the harmful weights that are responsible for the generation of harmful content. Despite its embarrassing simplicity, empirical results show that Antidote can reduce harmful score while maintaining accuracy on downstream tasks. Code is available at https://github.com/git-disl/Antidote.

http://arxiv.org/abs/2509.04914
RobQFL: Robust Quantum Federated Learning in Adversarial Environment. (2%)
Walid El Maouaki; Nouhaila Innan; Alberto Marchisio; Taoufik Said; Muhammad Shafique; Mohamed Bennai
Quantum Federated Learning (QFL) merges privacy-preserving federation with quantum computing gains, yet its resilience to adversarial noise is unknown. We first show that QFL is as fragile as centralized quantum learning. We propose Robust Quantum Federated Learning (RobQFL), embedding adversarial training directly into the federated loop. RobQFL exposes tunable axes: client coverage $γ$ (0-100\%), perturbation scheduling (fixed-$\varepsilon$ vs $\varepsilon$-mixes), and optimization (fine-tune vs scratch), and distils the resulting $γ\times \varepsilon$ surface into two metrics: Accuracy-Robustness Area and Robustness Volume. On 15-client simulations with MNIST and Fashion-MNIST, IID and Non-IID conditions, training only 20-50\% clients adversarially boosts $\varepsilon \leq 0.1$ accuracy $\sim$15 pp at $< 2$ pp clean-accuracy cost; fine-tuning adds 3-5 pp. With $\geq$75\% coverage, a moderate $\varepsilon$-mix is optimal, while high-$\varepsilon$ schedules help only at 100\% coverage. Label-sorted non-IID splits halve robustness, underscoring data heterogeneity as a dominant risk.

http://arxiv.org/abs/2503.02780
Quantitative Resilience Modeling for Autonomous Cyber Defense. (1%)
Xavier Cadet; Simona Boboila; Edward Koh; Peter Chin; Alina Oprea
Cyber resilience is the ability of a system to recover from an attack with minimal impact on system operations. However, characterizing a network's resilience under a cyber attack is challenging, as there are no formal definitions of resilience applicable to diverse network topologies and attack patterns. In this work, we propose a quantifiable formulation of resilience that considers multiple defender operational goals, the criticality of various network resources for daily operations, and provides interpretability to security operators about their system's resilience under attack. We evaluate our approach within the CybORG environment, a reinforcement learning (RL) framework for autonomous cyber defense, analyzing trade-offs between resilience, costs, and prioritization of operational goals. Furthermore, we introduce methods to aggregate resilience metrics across time-variable attack patterns and multiple network topologies, comprehensively characterizing system resilience. Using insights gained from our resilience metrics, we design RL autonomous defensive agents and compare them against several heuristic baselines, showing that proactive network hardening techniques and prompt recovery of compromised machines are critical for effective cyber defenses.

http://arxiv.org/abs/2509.04802
Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs. (1%)
Ilham Wicaksono; Zekun Wu; Theo King; Adriano Koshiyama; Philip Treleaven
As large language models transition to agentic systems, current safety evaluation frameworks face critical gaps in assessing deployment-specific risks. We introduce AgentSeer, an observability-based evaluation framework that decomposes agentic executions into granular action and component graphs, enabling systematic agentic-situational assessment. Through cross-model validation on GPT-OSS-20B and Gemini-2.0-flash using HarmBench single turn and iterative refinement attacks, we demonstrate fundamental differences between model-level and agentic-level vulnerability profiles. Model-level evaluation reveals baseline differences: GPT-OSS-20B (39.47% ASR) versus Gemini-2.0-flash (50.00% ASR), with both models showing susceptibility to social engineering while maintaining logic-based attack resistance. However, agentic-level assessment exposes agent-specific risks invisible to traditional evaluation. We discover "agentic-only" vulnerabilities that emerge exclusively in agentic contexts, with tool-calling showing 24-60% higher ASR across both models. Cross-model analysis reveals universal agentic patterns, agent transfer operations as highest-risk tools, semantic rather than syntactic vulnerability mechanisms, and context-dependent attack effectiveness, alongside model-specific security profiles in absolute ASR levels and optimal injection strategies. Direct attack transfer from model-level to agentic contexts shows degraded performance (GPT-OSS-20B: 57% human injection ASR; Gemini-2.0-flash: 28%), while context-aware iterative attacks successfully compromise objectives that failed at model-level, confirming systematic evaluation gaps. These findings establish the urgent need for agentic-situation evaluation paradigms, with AgentSeer providing the standardized methodology and empirical validation.

http://arxiv.org/abs/2509.04887
RINSER: Accurate API Prediction Using Masked Language Models. (1%)
Muhammad Ejaz Ahmed; Christopher Cody; Muhammad Ikram; Sean Lamont; Alsharif Abuadbba; Seyit Camtepe; Surya Nepal; Muhammad Ali Kaafar
Malware authors commonly use obfuscation to hide API identities in binary files, making analysis difficult and time-consuming for a human expert to understand the behavior and intent of the program. Automatic API prediction tools are necessary to efficiently analyze unknown binaries, facilitating rapid malware triage while reducing the workload on human analysts. In this paper, we present RINSER (AccuRate API predictioN using maSked languagE model leaRning), an automated framework for predicting Windows API (WinAPI) function names. RINSER introduces the novel concept of API codeprints, a set of API-relevant assembly instructions, and supports x86 PE binaries. RINSER relies on BERT's masked language model (LM) to predict API names at scale, achieving 85.77% accuracy for normal binaries and 82.88% accuracy for stripped binaries. We evaluate RINSER on a large dataset of 4.7M API codeprints from 11,098 malware binaries, covering 4,123 unique Windows APIs, making it the largest publicly available dataset of this type. RINSER successfully discovered 65 obfuscated Windows APIs related to C2 communication, spying, and evasion in our dataset, which the commercial disassembler IDA failed to identify. Furthermore, we compared RINSER against three state-of-the-art approaches, showing over 20% higher prediction accuracy. We also demonstrated RINSER's resilience to adversarial attacks, including instruction randomization and code displacement, with a performance drop of no more than 3%.

http://arxiv.org/abs/2509.04597
DisPatch: Disarming Adversarial Patches in Object Detection with Diffusion Models. (99%)
Jin Ma; Mohammed Aldeen; Christopher Salas; Feng Luo; Mashrur Chowdhury; Mert Pesé; Long Cheng
Object detection is fundamental to various real-world applications, such as security monitoring and surveillance video analysis. Despite their advancements, state-of-theart object detectors are still vulnerable to adversarial patch attacks, which can be easily applied to real-world objects to either conceal actual items or create non-existent ones, leading to severe consequences. Given the current diversity of adversarial patch attacks and potential unknown threats, an ideal defense method should be effective, generalizable, and robust against adaptive attacks. In this work, we introduce DISPATCH, the first diffusion-based defense framework for object detection. Unlike previous works that aim to "detect and remove" adversarial patches, DISPATCH adopts a "regenerate and rectify" strategy, leveraging generative models to disarm attack effects while preserving the integrity of the input image. Specifically, we utilize the in-distribution generative power of diffusion models to regenerate the entire image, aligning it with benign data. A rectification process is then employed to identify and replace adversarial regions with their regenerated benign counterparts. DISPATCH is attack-agnostic and requires no prior knowledge of the existing patches. Extensive experiments across multiple detectors and attacks demonstrate that DISPATCH consistently outperforms state-of-the-art defenses on both hiding attacks and creating attacks, achieving the best overall mAP.5 score of 89.3% on hiding attacks, and lowering the attack success rate to 24.8% on untargeted creating attacks. Moreover, it maintains strong robustness against adaptive attacks, making it a practical and reliable defense for object detection systems.

http://arxiv.org/abs/2412.12722
Defending LVLMs Against Vision Attacks through Partial-Perception Supervision. (92%)
Qi Zhou; Tianlin Li; Qing Guo; Dongxia Wang; Yun Lin; Yang Liu; Jin Song Dong
Recent studies have raised significant concerns regarding the vulnerability of Large Vision Language Models (LVLMs) to maliciously injected or perturbed input images, which can mislead their responses. Existing defense methods show that such vision attacks are sensitive to image modifications especially cropping, using majority voting across responses of modified images as corrected responses. However, these modifications often result in partial images and distort the semantics, which reduces response quality on clean images after voting. Instead of directly using responses from partial images for voting, we investigate using them to supervise the LVLM's responses to the original images. We propose a black-box, training-free method called DPS (Defense through Partial-Perception Supervision). In this approach, the model is prompted using the responses generated by a model that perceives only a partial image. With DPS, the model can adjust its response based on partial image understanding when under attack, while confidently maintaining its original response for clean input. Our findings show that the weak model can supervise the strong model: when faced with an attacked input, the strong model becomes less confident and adjusts its response based on the weak model's partial understanding, effectively defending against the attack. With clean input, it confidently maintains its original response. Empirical experiments show our method outperforms the baseline, cutting the average attack success rate by 76.3% across six datasets on three popular models.

http://arxiv.org/abs/2509.05372
Adversarial Bug Reports as a Security Risk in Language Model-Based Automated Program Repair. (87%)
Piotr Przymus; Andreas Happe; Jürgen Cito
Large Language Model (LLM) - based Automated Program Repair (APR) systems are increasingly integrated into modern software development workflows, offering automated patches in response to natural language bug reports. However, this reliance on untrusted user input introduces a novel and underexplored attack surface. In this paper, we investigate the security risks posed by adversarial bug reports -- realistic-looking issue submissions crafted to mislead APR systems into producing insecure or harmful code changes. We develop a comprehensive threat model and conduct an empirical study to evaluate the vulnerability of state-of-the-art APR systems to such attacks. Our demonstration comprises 51 adversarial bug reports generated across a spectrum of strategies, from manual curation to fully automated pipelines. We test these against leading APR model and assess both pre-repair defenses (e.g., LlamaGuard variants, PromptGuard variants, Granite-Guardian, and custom LLM filters) and post-repair detectors (GitHub Copilot, CodeQL). Our findings show that current defenses are insufficient: 90\% of crafted bug reports triggered attacker-aligned patches. The best pre-repair filter blocked only 47\%, while post-repair analysis-often requiring human oversight-was effective in just 58\% of cases. To support scalable security testing, we introduce a prototype framework for automating the generation of adversarial bug reports. Our analysis exposes a structural asymmetry: generating adversarial inputs is inexpensive, while detecting or mitigating them remains costly and error-prone. We conclude with practical recommendations for improving the robustness of APR systems against adversarial misuse and highlight directions for future work on trustworthy automated repair.

http://arxiv.org/abs/2401.09754
Towards Robust Graph Structural Learning Beyond Homophily via Preserving Neighbor Similarity. (83%)
Yulin Zhu; Yuni Lai; Xing Ai; Wai Lun LO; Gaolei Li; Jianhua Li; Di Tang; Xingxing Zhang; Mengpei Yang; Kai Zhou
Despite the tremendous success of graph-based learning systems in handling structural data, it has been widely investigated that they are fragile to adversarial attacks on homophilic graph data, where adversaries maliciously modify the semantic and topology information of the raw graph data to degrade the predictive performances. Motivated by this, a series of robust models are crafted to enhance the adversarial robustness of graph-based learning systems on homophilic graphs. However, the security of graph-based learning systems on heterophilic graphs remains a mystery to us. To bridge this gap, in this paper, we start to explore the vulnerability of graph-based learning systems regardless of the homophily degree, and theoretically prove that the update of the negative classification loss is negatively correlated with the pairwise similarities based on the powered aggregated neighbor features. The theoretical finding inspires us to craft a novel robust graph structural learning strategy that serves as a useful graph mining module in a robust model that incorporates a dual-kNN graph constructions pipeline to supervise the neighbor-similarity-preserved propagation, where the graph convolutional layer adaptively smooths or discriminates the features of node pairs according to their affluent local structures. In this way, the proposed methods can mine the ``better" topology of the raw graph data under diverse graph homophily and achieve more reliable data management on homophilic and heterophilic graphs.

http://arxiv.org/abs/2509.04214
An Automated, Scalable Machine Learning Model Inversion Assessment Pipeline. (22%)
Tyler Shumaker; Jessica Carpenter; David Saranchak; Nathaniel D. Bastian
Machine learning (ML) models have the potential to transform military battlefields, presenting a large external pressure to rapidly incorporate them into operational settings. However, it is well-established that these ML models are vulnerable to a number of adversarial attacks throughout the model deployment pipeline that threaten to negate battlefield advantage. One broad category is privacy attacks (such as model inversion) where an adversary can reverse engineer information from the model, such as the sensitive data used in its training. The ability to quantify the risk of model inversion attacks (MIAs) is not well studied, and there is a lack of automated developmental test and evaluation (DT&E) tools and metrics to quantify the effectiveness of privacy loss of the MIA. The current DT&E process is difficult because ML model inversions can be hard for a human to interpret, subjective when they are interpretable, and difficult to quantify in terms of inversion quality. Additionally, scaling the DT&E process is challenging due to many ML model architectures and data modalities that need to be assessed. In this work, we present a novel DT&E tool that quantifies the risk of data privacy loss from MIAs and introduces four adversarial risk dimensions to quantify privacy loss. Our DT&E pipeline combines inversion with vision language models (VLMs) to improve effectiveness while enabling scalable analysis. We demonstrate effectiveness using multiple MIA techniques and VLMs configured for zero-shot classification and image captioning. We benchmark the pipeline using several state-of-the-art MIAs in the computer vision domain with an image classification task that is typical in military applications. In general, our innovative pipeline extends the current model inversion DT&E capabilities by improving the effectiveness and scalability of the privacy loss analysis in an automated fashion.

http://arxiv.org/abs/2509.03985
NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models. (5%)
Chuhan Zhang; Ye Zhang; Bowen Shi; Yuyou Gan; Tianyu Du; Shouling Ji; Dazhan Deng; Yingcai Wu
In deployment and application, large language models (LLMs) typically undergo safety alignment to prevent illegal and unethical outputs. However, the continuous advancement of jailbreak attack techniques, designed to bypass safety mechanisms with adversarial prompts, has placed increasing pressure on the security defenses of LLMs. Strengthening resistance to jailbreak attacks requires an in-depth understanding of the security mechanisms and vulnerabilities of LLMs. However, the vast number of parameters and complex structure of LLMs make analyzing security weaknesses from an internal perspective a challenging task. This paper presents NeuroBreak, a top-down jailbreak analysis system designed to analyze neuron-level safety mechanisms and mitigate vulnerabilities. We carefully design system requirements through collaboration with three experts in the field of AI security. The system provides a comprehensive analysis of various jailbreak attack methods. By incorporating layer-wise representation probing analysis, NeuroBreak offers a novel perspective on the model's decision-making process throughout its generation steps. Furthermore, the system supports the analysis of critical neurons from both semantic and functional perspectives, facilitating a deeper exploration of security mechanisms. We conduct quantitative evaluations and case studies to verify the effectiveness of our system, offering mechanistic insights for developing next-generation defense strategies against evolving jailbreak attacks.

http://arxiv.org/abs/2509.05377
Enhancing Gradient Variance and Differential Privacy in Quantum Federated Learning. (1%)
Duc-Thien Phan; Minh-Duong Nguyen; Quoc-Viet Pham; Huilong Pi
Upon integrating Quantum Neural Network (QNN) as the local model, Quantum Federated Learning (QFL) has recently confronted notable challenges. Firstly, exploration is hindered over sharp minima, decreasing learning performance. Secondly, the steady gradient descent results in more stable and predictable model transmissions over wireless channels, making the model more susceptible to attacks from adversarial entities. Additionally, the local QFL model is vulnerable to noise produced by the quantum device's intermediate noise states, since it requires the use of quantum gates and circuits for training. This local noise becomes intertwined with learning parameters during training, impairing model precision and convergence rate. To address these issues, we propose a new QFL technique that incorporates differential privacy and introduces a dedicated noise estimation strategy to quantify and mitigate the impact of intermediate quantum noise. Furthermore, we design an adaptive noise generation scheme to alleviate privacy threats associated with the vanishing gradient variance phenomenon of QNN and enhance robustness against device noise. Experimental results demonstrate that our algorithm effectively balances convergence, reduces communication costs, and mitigates the adverse effects of intermediate quantum noise while maintaining strong privacy protection. Using real-world datasets, we achieved test accuracy of up to 98.47\% for the MNIST dataset and 83.85\% for the CIFAR-10 dataset while maintaining fast execution times.

http://arxiv.org/abs/2508.18826
SWiFT: Soft-Mask Weight Fine-tuning for Bias Mitigation. (1%)
Junyu Yan; Feng Chen; Yuyang Xue; Yuning Du; Konstantinos Vilouras; Sotirios A. Tsaftaris; Steven McDonagh
Recent studies have shown that Machine Learning (ML) models can exhibit bias in real-world scenarios, posing significant challenges in ethically sensitive domains such as healthcare. Such bias can negatively affect model fairness, model generalization abilities and further risks amplifying social discrimination. There is a need to remove biases from trained models. Existing debiasing approaches often necessitate access to original training data and need extensive model retraining; they also typically exhibit trade-offs between model fairness and discriminative performance. To address these challenges, we propose Soft-Mask Weight Fine-Tuning (SWiFT), a debiasing framework that efficiently improves fairness while preserving discriminative performance with much less debiasing costs. Notably, SWiFT requires only a small external dataset and only a few epochs of model fine-tuning. The idea behind SWiFT is to first find the relative, and yet distinct, contributions of model parameters to both bias and predictive performance. Then, a two-step fine-tuning process updates each parameter with different gradient flows defined by its contribution. Extensive experiments with three bias sensitive attributes (gender, skin tone, and age) across four dermatological and two chest X-ray datasets demonstrate that SWiFT can consistently reduce model bias while achieving competitive or even superior diagnostic accuracy under common fairness and accuracy metrics, compared to the state-of-the-art. Specifically, we demonstrate improved model generalization ability as evidenced by superior performance on several out-of-distribution (OOD) datasets.

http://arxiv.org/abs/2508.20038
Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks. (1%)
Sheng Liu; Qiang Sheng; Danding Wang; Yang Li; Guang Yang; Juan Cao
Despite advances in improving large language model (LLM) to refuse to answer malicious instructions, widely used LLMs remain vulnerable to jailbreak attacks where attackers generate instructions with distributions differing from safety alignment corpora. New attacks expose LLMs' inability to recognize unseen malicious instructions, highlighting a critical distributional mismatch between training data and real-world attacks that forces developers into reactive patching cycles. To tackle this challenge, we propose IMAGINE, a synthesis framework that leverages embedding space distribution analysis to generate jailbreak-like instructions. This approach effectively fills the distributional gap between authentic jailbreak patterns and safety alignment corpora. IMAGINE follows an iterative optimization process that dynamically evolves text generation distributions across iterations, thereby augmenting the coverage of safety alignment data distributions through synthesized data examples. Based on the safety-aligned corpus enhanced through IMAGINE, our framework demonstrates significant decreases in attack success rate on Qwen2.5, Llama3.1, and Llama3.2 without compromising their utility.

http://arxiv.org/abs/2509.08748
Prototype-Guided Robust Learning against Backdoor Attacks. (98%)
Wei Guo; Maura Pintor; Ambra Demontis; Battista Biggio
Backdoor attacks poison the training data to embed a backdoor in the model, causing it to behave normally on legitimate inputs but maliciously when specific trigger signals appear. Training a benign model from a dataset poisoned by backdoor attacks is challenging. Existing works rely on various assumptions and can only defend against backdoor attacks with specific trigger signals, high poisoning ratios, or when the defender possesses a large, untainted, validation dataset. In this paper, we propose a defense called Prototype-Guided Robust Learning (PGRL), which overcomes all the aforementioned limitations, being robust against diverse backdoor attacks. Leveraging a tiny set of benign samples, PGRL generates prototype vectors to guide the training process. We compare our PGRL with 8 existing defenses, showing that it achieves superior robustness. We also demonstrate that PGRL generalizes well across various architectures, datasets, and advanced attacks. Finally, to evaluate our PGRL in the worst-case scenario, we perform an adaptive attack, where the attackers fully know the details of the defense.

http://arxiv.org/abs/2509.06992
FedAPT: Federated Adversarial Prompt Tuning for Vision-Language Models. (96%)
Kun Zhai; Siheng Chen; Xingjun Ma; Yu-Gang Jiang
Federated Prompt Tuning (FPT) is an efficient method for cross-client collaborative fine-tuning of large Vision-Language Models (VLMs). However, models tuned using FPT are vulnerable to adversarial attacks, leading to misclassification in downstream tasks. In this work, we introduce Federated Adversarial Prompt Tuning (\textbf{FedAPT}), a novel method designed to enhance the adversarial robustness of FPT. We identify a key issue in FedAPT under non-independent and identically distributed (non-IID) settings: a \textit{class information gap} between clients and the global model. Clients rely solely on limited local label information to generate adversarial samples for training, while the global model must defend against adversarial attacks from global labels. To address this issue, we propose a \textbf{class-aware prompt generator} that generates visual prompts from text prompts. This generator is guided by a \emph{Global Label Embedding} (serving as a ``beacon") which encodes cross-client label information to create more globally-aligned visual prompts. Additionally, we propose a \textbf{cross-layer generator sharing} strategy to enhance prompt coupling across different layers of the model, further boosting adversarial robustness. Extensive experiments on multiple image classification datasets demonstrate the superiority of FedAPT in improving adversarial robustness, outperforming existing methods by a large margin. FedAPT also exhibits exceptional generalization in cross-domain and cross-dataset scenarios, indicating its effectiveness in real-world applications.

http://arxiv.org/abs/2509.03383
ANNIE: Be Careful of Your Robots. (92%)
Yiyang Huang; Zixuan Wang; Zishen Wan; Yapeng Tian; Haobo Xu; Yinhe Han; Yiming Gan
The integration of vision-language-action (VLA) models into embodied AI (EAI) robots is rapidly advancing their ability to perform complex, long-horizon tasks in humancentric environments. However, EAI systems introduce critical security risks: a compromised VLA model can directly translate adversarial perturbations on sensory input into unsafe physical actions. Traditional safety definitions and methodologies from the machine learning community are no longer sufficient. EAI systems raise new questions, such as what constitutes safety, how to measure it, and how to design effective attack and defense mechanisms in physically grounded, interactive settings. In this work, we present the first systematic study of adversarial safety attacks on embodied AI systems, grounded in ISO standards for human-robot interactions. We (1) formalize a principled taxonomy of safety violations (critical, dangerous, risky) based on physical constraints such as separation distance, velocity, and collision boundaries; (2) introduce ANNIEBench, a benchmark of nine safety-critical scenarios with 2,400 video-action sequences for evaluating embodied safety; and (3) ANNIE-Attack, a task-aware adversarial framework with an attack leader model that decomposes long-horizon goals into frame-level perturbations. Our evaluation across representative EAI models shows attack success rates exceeding 50% across all safety categories. We further demonstrate sparse and adaptive attack strategies and validate the real-world impact through physical robot experiments. These results expose a previously underexplored but highly consequential attack surface in embodied AI systems, highlighting the urgent need for security-driven defenses in the physical AI era. Code is available at https://github.com/RLCLab/Annie.

http://arxiv.org/abs/2509.08747
Silent Until Sparse: Backdoor Attacks on Semi-Structured Sparsity. (73%)
Wei Guo; Maura Pintor; Ambra Demontis; Battista Biggio
In the deployment phase, semi-structured sparsity accelerates the execution of deep neural networks on modern GPUs via sparse matrix multiplication. In this paper, targeting the semi-structured sparsity, we introduce a Silent Until Sparse (SUS) backdoor attack, where the released full model remains silent (benign), but becomes a backdoored model after sparsification. The attack operates in two phases: (i) in the backdoor training phase, the backdoor functionality is injected into specific weights that will be retained during the pruning process; (ii) in the backdoor hiding phase, the malicious behavior is concealed by fine-tuning elements that will be pruned away. This dual-phase approach ensures that the attack remains undetectable in the released model, but activates properly once the model is pruned with the semi-structured sparsity. Through extensive experiments, we show that our attack successfully threatens the semi-structured sparsity algorithms from both NVIDIA and PyTorch. Our empirical results show that, regardless of model architecture, the attack success rate of the released model remains below 10% prior to sparsification but exceeds 99% afterward. Moreover, we demonstrate that SUS attack is robust against state-of-the-art backdoor defenses and finetuning, highlighting a critical vulnerability in current model compression and deployment pipelines.

http://arxiv.org/abs/2509.03242
TopoMap: A Feature-based Semantic Discriminator of the Topographical Regions in the Test Input Space. (54%)
Vita Gianmarco De; Nargiz Humbatova; Paolo Tonella
Testing Deep Learning (DL)-based systems is an open challenge. Although it is relatively easy to find inputs that cause a DL model to misbehave, the grouping of inputs by features that make the DL model under test fail is largely unexplored. Existing approaches for DL testing introduce perturbations that may focus on specific failure-inducing features, while neglecting others that belong to different regions of the feature space. In this paper, we create an explicit topographical map of the input feature space. Our approach, named TopoMap, is both black-box and model-agnostic as it relies solely on features that characterise the input space. To discriminate the inputs according to the specific features they share, we first apply dimensionality reduction to obtain input embeddings, which are then subjected to clustering. Each DL model might require specific embedding computations and clustering algorithms to achieve a meaningful separation of inputs into discriminative groups. We propose a novel way to evaluate alternative configurations of embedding and clustering techniques. We used a deep neural network (DNN) as an approximation of a human evaluator who could tell whether a pair of clusters can be discriminated based on the features of the included elements. We use such a DNN to automatically select the optimal topographical map of the inputs among all those that are produced by different embedding/clustering configurations. The evaluation results show that the maps generated by TopoMap consist of distinguishable and meaningful regions. In addition, we evaluate the effectiveness of TopoMap using mutation analysis. In particular, we assess whether the clusters in our topographical map allow for an effective selection of mutation-killing inputs. Experimental results show that our approach outperforms random selection by 35% on average on killable mutants; by 61% on non-killable ones.

http://arxiv.org/abs/2509.03179
AutoDetect: Designing an Autoencoder-based Detection Method for Poisoning Attacks on Object Detection Applications in the Military Domain. (41%)
Alma M. Liezenga; Stefan Wijnja; Haan Puck de; Niels W. T. Brink; Stijn Jip J. van; Yori Kamphuis; Klamer Schutte
Poisoning attacks pose an increasing threat to the security and robustness of Artificial Intelligence systems in the military domain. The widespread use of open-source datasets and pretrained models exacerbates this risk. Despite the severity of this threat, there is limited research on the application and detection of poisoning attacks on object detection systems. This is especially problematic in the military domain, where attacks can have grave consequences. In this work, we both investigate the effect of poisoning attacks on military object detectors in practice, and the best approach to detect these attacks. To support this research, we create a small, custom dataset featuring military vehicles: MilCivVeh. We explore the vulnerability of military object detectors for poisoning attacks by implementing a modified version of the BadDet attack: a patch-based poisoning attack. We then assess its impact, finding that while a positive attack success rate is achievable, it requires a substantial portion of the data to be poisoned -- raising questions about its practical applicability. To address the detection challenge, we test both specialized poisoning detection methods and anomaly detection methods from the visual industrial inspection domain. Since our research shows that both classes of methods are lacking, we introduce our own patch detection method: AutoDetect, a simple, fast, and lightweight autoencoder-based method. Our method shows promising results in separating clean from poisoned samples using the reconstruction error of image slices, outperforming existing methods, while being less time- and memory-intensive. We urge that the availability of large, representative datasets in the military domain is a prerequisite to further evaluate risks of poisoning attacks and opportunities patch detection.

http://arxiv.org/abs/2509.02028
See No Evil: Adversarial Attacks Against Linguistic-Visual Association in Referring Multi-Object Tracking Systems. (31%)
Halima Bouzidi; Haoyu Liu; Mohammad Abdullah Al Faruque
Language-vision understanding has driven the development of advanced perception systems, most notably the emerging paradigm of Referring Multi-Object Tracking (RMOT). By leveraging natural-language queries, RMOT systems can selectively track objects that satisfy a given semantic description, guided through Transformer-based spatial-temporal reasoning modules. End-to-End (E2E) RMOT models further unify feature extraction, temporal memory, and spatial reasoning within a Transformer backbone, enabling long-range spatial-temporal modeling over fused textual-visual representations. Despite these advances, the reliability and robustness of RMOT remain underexplored. In this paper, we examine the security implications of RMOT systems from a design-logic perspective, identifying adversarial vulnerabilities that compromise both the linguistic-visual referring and track-object matching components. Additionally, we uncover a novel vulnerability in advanced RMOT models employing FIFO-based memory, whereby targeted and consistent attacks on their spatial-temporal reasoning introduce errors that persist within the history buffer over multiple subsequent frames. We present VEIL, a novel adversarial framework designed to disrupt the unified referring-matching mechanisms of RMOT models. We show that carefully crafted digital and physical perturbations can corrupt the tracking logic reliability, inducing track ID switches and terminations. We conduct comprehensive evaluations using the Refer-KITTI dataset to validate the effectiveness of VEIL and demonstrate the urgent need for security-aware RMOT designs for critical large-scale applications.

http://arxiv.org/abs/2509.08746
Stealth by Conformity: Evading Robust Aggregation through Adaptive Poisoning. (13%)
Ryan McGaughey; Rincon Jesus Martinez del; Ihsen Alouani
Federated Learning (FL) is a distributed learning paradigm designed to address privacy concerns. However, FL is vulnerable to poisoning attacks, where Byzantine clients compromise the integrity of the global model by submitting malicious updates. Robust aggregation methods have been widely adopted to mitigate such threats, relying on the core assumption that malicious updates are inherently out-of-distribution and can therefore be identified and excluded before aggregating client updates. In this paper, we challenge this underlying assumption by showing that a model can be poisoned while keeping malicious updates within the main distribution. We propose Chameleon Poisoning (CHAMP), an adaptive and evasive poisoning strategy that exploits side-channel feedback from the aggregation process to guide the attack. Specifically, the adversary continuously infers whether its malicious contribution has been incorporated into the global model and adapts accordingly. This enables a dynamic adjustment of the local loss function, balancing a malicious component with a camouflaging component, thereby increasing the effectiveness of the poisoning while evading robust aggregation defenses. CHAMP enables more effective and evasive poisoning, highlighting a fundamental limitation of existing robust aggregation defenses and underscoring the need for new strategies to secure federated learning against sophisticated adversaries. Our approach is evaluated in two datasets reaching an average increase of 47.07% in attack success rate against nine robust aggregation defenses.

http://arxiv.org/abs/2509.03108
Backdoor Poisoning Attack Against Face Spoofing Attack Detection Methods. (12%)
Shota Iwamatsu; Koichi Ito; Takafumi Aoki
Face recognition systems are robust against environmental changes and noise, and thus may be vulnerable to illegal authentication attempts using user face photos, such as spoofing attacks. To prevent such spoofing attacks, it is crucial to discriminate whether the input image is a live user image or a spoofed image prior to the face recognition process. Most existing spoofing attack detection methods utilize deep learning, which necessitates a substantial amount of training data. Consequently, if malicious data is injected into a portion of the training dataset, a specific spoofing attack may be erroneously classified as live, leading to false positives.In this paper, we propose a novel backdoor poisoning attack method to demonstrate the latent threat of backdoor poisoning within face anti-spoofing detection. The proposed method enables certain spoofing attacks to bypass detection by embedding features extracted from the spoofing attack's face image into a live face image without inducing any perceptible visual alterations.Through experiments conducted on public datasets, we demonstrate that the proposed method constitutes a realistic threat to existing spoofing attack detection systems.

http://arxiv.org/abs/2505.20357
Learning and Interpreting Gravitational-Wave Features from CNNs with a Random Forest Approach. (2%)
Jun Tian; He Wang; Jibo He; Yu Pan; Shuo Cao; Qingquan Jiang
Convolutional neural networks (CNNs) have become widely adopted in gravitational wave (GW) detection pipelines due to their ability to automatically learn hierarchical features from raw strain data. However, the physical meaning of these learned features remains underexplored, limiting the interpretability of such models. In this work, we propose a hybrid architecture that combines a CNN-based feature extractor with a random forest (RF) classifier to improve both detection performance and interpretability. Unlike prior approaches that directly connect classifiers to CNN outputs, our method introduces four physically interpretable metrics - variance, signal-to-noise ratio (SNR), waveform overlap, and peak amplitude - computed from the final convolutional layer. These are jointly used with the CNN output in the RF classifier to enable more informed decision boundaries. Tested on long-duration strain datasets, our hybrid model outperforms a baseline CNN model, achieving a relative improvement of 21\% in sensitivity at a fixed false alarm rate of 10 events per month. Notably, it also shows improved detection of low-SNR signals (SNR $\le$ 10), which are especially vulnerable to misclassification in noisy environments. Feature attribution via the RF model reveals that both CNN-extracted and handcrafted features contribute significantly to classification decisions, with learned variance and CNN outputs ranked among the most informative. These findings suggest that physically motivated post-processing of CNN feature maps can serve as a valuable tool for interpretable and efficient GW detection, bridging the gap between deep learning and domain knowledge.

http://arxiv.org/abs/2509.02042
Targeted Physical Evasion Attacks in the Near-Infrared Domain. (98%)
Pascal Zimmer; Simon Lachnit; Alexander Jan Zielinski; Ghassan Karame
A number of attacks rely on infrared light sources or heat-absorbing material to imperceptibly fool systems into misinterpreting visual input in various image recognition applications. However, almost all existing approaches can only mount untargeted attacks and require heavy optimizations due to the use-case-specific constraints, such as location and shape. In this paper, we propose a novel, stealthy, and cost-effective attack to generate both targeted and untargeted adversarial infrared perturbations. By projecting perturbations from a transparent film onto the target object with an off-the-shelf infrared flashlight, our approach is the first to reliably mount laser-free targeted attacks in the infrared domain. Extensive experiments on traffic signs in the digital and physical domains show that our approach is robust and yields higher attack success rates in various attack scenarios across bright lighting conditions, distances, and angles compared to prior work. Equally important, our attack is highly cost-effective, requiring less than US\$50 and a few tens of seconds for deployment. Finally, we propose a novel segmentation-based detection that thwarts our attack with an F1-score of up to 99%.

http://arxiv.org/abs/2509.02289
Passwords and FIDO2 Are Meant To Be Secret: A Practical Secure Authentication Channel for Web Browsers. (1%)
Anuj Gautam; Tarun Yadav; Garrett Smith; Kent Seamons; Scott Ruoti
Password managers provide significant security benefits to users. However, malicious client-side scripts and browser extensions can steal passwords after the manager has autofilled them into the web page. In this paper, we extend prior work by Stock and Johns, showing how password autofill can be hardened to prevent these local attacks. We implement our design in the Firefox browser and conduct experiments demonstrating that our defense successfully protects passwords from XSS attacks and malicious extensions. We also show that our implementation is compatible with 97% of the Alexa top 1000 websites. Next, we generalize our design, creating a second defense that prevents recently discovered local attacks against the FIDO2 protocols. We implement this second defense into Firefox, demonstrating that it protects the FIDO2 protocol against XSS attacks and malicious extensions. This defense is compatible with all websites, though it does require a small change (2-3 lines) to web servers implementing FIDO2.

http://arxiv.org/abs/2509.02863
Enhancing Machine Learning for Imbalanced Medical Data: A Quantum-Inspired Approach to Synthetic Oversampling (QI-SMOTE). (1%)
Vikas Kashtriya; Pardeep Singh
Class imbalance remains a critical challenge in machine learning (ML), particularly in the medical domain, where underrepresented minority classes lead to biased models and reduced predictive performance. This study introduces Quantum-Inspired SMOTE (QI-SMOTE), a novel data augmentation technique that enhances the performance of ML classifiers, including Random Forest (RF), Support Vector Machine (SVM), Logistic Regression (LR), k-Nearest Neighbors (KNN), Gradient Boosting (GB), and Neural Networks, by leveraging quantum principles such as quantum evolution and layered entanglement. Unlike conventional oversampling methods, QI-SMOTE generates synthetic instances that preserve complex data structures, improving model generalization and classification accuracy. We validate QI-SMOTE on the MIMIC-III and MIMIC-IV datasets, using mortality detection as a benchmark task due to their clinical significance and inherent class imbalance. We compare our method against traditional oversampling techniques, including Borderline-SMOTE, ADASYN, SMOTE-ENN, SMOTE-TOMEK, and SVM-SMOTE, using key performance metrics such as Accuracy, F1-score, G-Mean, and AUC-ROC. The results demonstrate that QI-SMOTE significantly improves the effectiveness of ensemble methods (RF, GB, ADA), kernel-based models (SVM), and deep learning approaches by producing more informative and balanced training data. By integrating quantum-inspired transformations into the ML pipeline, QI-SMOTE not only mitigates class imbalance but also enhances the robustness and reliability of predictive models in medical diagnostics and decision-making. This study highlights the potential of quantum-inspired resampling techniques in advancing state-of-the-art ML methodologies.

http://arxiv.org/abs/2501.12123
FL-CLEANER: byzantine and backdoor defense by CLustering Errors of Activation maps in Non-iid fedErated leaRning. (47%)
Mehdi Ben Ghali; Gouenou Coatrieux; Reda Bellafqira
Federated Learning (FL) enables clients to collaboratively train a global model using their local datasets while reinforcing data privacy, but it is prone to poisoning attacks. Existing defense mechanisms assume that clients' data are independent and identically distributed (IID), making them ineffective in real-world applications where data are non-IID. This paper presents FL-CLEANER, the first defense capable of filtering both byzantine and backdoor attackers' model updates in a non-IID FL environment. The originality of FL-CLEANER is twofold. First, it relies on a client confidence score derived from the reconstruction errors of each client's model activation maps for a given trigger set, with reconstruction errors obtained by means of a Conditional Variational Autoencoder trained according to a novel server-side strategy. Second, it uses an original ad-hoc trust propagation algorithm we propose. Based on previous client scores, it allows building a cluster of benign clients while flagging potential attackers. Experimental results on the datasets MNIST and FashionMNIST demonstrate the efficiency of FL-CLEANER against Byzantine attackers as well as to some state-of-the-art backdoors in non-IID scenarios; it achieves a close-to-zero (<1%) benign client misclassification rate, even in the absence of an attack, and achieves strong performance compared to state of the art defenses.

http://arxiv.org/abs/2509.01235
Geometric origin of adversarial vulnerability in deep learning. (33%)
Yixiong Ren; Wenkang Du; Jianhui Zhou; Haiping Huang
How to balance training accuracy and adversarial robustness has become a challenge since the birth of deep learning. Here, we introduce a geometry-aware deep learning framework that leverages layer-wise local training to sculpt the internal representations of deep neural networks. This framework promotes intra-class compactness and inter-class separation in feature space, leading to manifold smoothness and adversarial robustness against white or black box attacks. The performance can be explained by an energy model with Hebbian coupling between elements of the hidden representation. Our results thus shed light on the physics of learning in the direction of alignment between biological and artificial intelligence systems. Using the current framework, the deep network can assimilate new information into existing knowledge structures while reducing representation interference.

http://arxiv.org/abs/2509.01631
Unraveling LLM Jailbreaks Through Safety Knowledge Neurons. (13%)
Chongwen Zhao; Kaizhu Huang
Large Language Models (LLMs) are increasingly attracting attention in various applications. Nonetheless, there is a growing concern as some users attempt to exploit these models for malicious purposes, including the synthesis of controlled substances and the propagation of disinformation, a technique known as "Jailbreak." While some studies have achieved defenses against jailbreak attacks by modifying output distributions or detecting harmful content, the exact rationale still remains elusive. In this work, we present a novel neuron-level interpretability method that focuses on the role of safety-related knowledge neurons. Unlike existing approaches, our method projects the model's internal representation into a more consistent and interpretable vocabulary space. We then show that adjusting the activation of safety-related neurons can effectively control the model's behavior with a mean ASR higher than 97%. Building on this insight, we propose SafeTuning, a fine-tuning strategy that reinforces safety-critical neurons to improve model robustness against jailbreaks. SafeTuning consistently reduces attack success rates across multiple LLMs and outperforms all four baseline defenses. These findings offer a new perspective on understanding and defending against jailbreak attacks.

http://arxiv.org/abs/2508.17481
SoK: Cybersecurity Assessment of Humanoid Ecosystem. (12%)
Priyanka Prakash Surve; Asaf Shabtai; Yuval Elovici
Humanoids are progressing toward practical deployment across healthcare, industrial, defense, and service sectors. While typically considered cyber-physical systems (CPSs), their dependence on traditional networked software stacks (e.g., Linux operating systems), robot operating system (ROS) middleware, and over-the-air update channels, creates a distinct security profile that exposes them to vulnerabilities conventional CPS models do not fully address. Prior studies have mainly examined specific threats, such as LiDAR spoofing or adversarial machine learning (AML). This narrow focus overlooks how an attack targeting one component can cascade harm throughout the robot's interconnected systems. We address this gap through a systematization of knowledge (SoK) that takes a comprehensive approach, consolidating fragmented research from robotics, CPS, and network security domains. We introduce a seven-layer security model for humanoid robots, organizing 39 known attacks and 35 defenses across the humanoid ecosystem-from hardware to human-robot interaction. Building on this security model, we develop a quantitative 39x35 attack-defense matrix with risk-weighted scoring, validated through Monte Carlo analysis. We demonstrate our method by evaluating three real-world robots: Pepper, G1 EDU, and Digit. The scoring analysis revealed varying security maturity levels, with scores ranging from 39.9% to 79.5% across the platforms. This work introduces a structured, evidence-based assessment method that enables systematic security evaluation, supports cross-platform benchmarking, and guides prioritization of security investments in humanoid robotics.

http://arxiv.org/abs/2509.01444
Strata-Sword: A Hierarchical Safety Evaluation towards LLMs based on Reasoning Complexity of Jailbreak Instructions. (2%)
Shiji Zhao; Ranjie Duan; Jiexi Liu; Xiaojun Jia; Fengxiang Wang; Cheng Wei; Ruoxi Cheng; Yong Xie; Chang Liu; Qing Guo; Jialing Tao; Hui Xue; Xingxing Wei
Large language models (LLMs) have gained widespread recognition for their superior comprehension and have been deployed across numerous domains. Building on Chain-of-Thought (CoT) ideology, Large Reasoning models (LRMs) further exhibit strong reasoning skills, enabling them to infer user intent more accurately and respond appropriately. However, both LLMs and LRMs face the potential safety risks under jailbreak attacks, which raise concerns about their safety capabilities. Current safety evaluation methods often focus on the content dimensions, or simply aggregate different attack methods, lacking consideration of the complexity. In fact, instructions of different complexity can reflect the different safety capabilities of the model: simple instructions can reflect the basic values of the model, while complex instructions can reflect the model's ability to deal with deeper safety risks. Therefore, a comprehensive benchmark needs to be established to evaluate the safety performance of the model in the face of instructions of varying complexity, which can provide a better understanding of the safety boundaries of the LLMs. Thus, this paper first quantifies "Reasoning Complexity" as an evaluable safety dimension and categorizes 15 jailbreak attack methods into three different levels according to the reasoning complexity, establishing a hierarchical Chinese-English jailbreak safety benchmark for systematically evaluating the safety performance of LLMs. Meanwhile, to fully utilize unique language characteristics, we first propose some Chinese jailbreak attack methods, including the Chinese Character Disassembly attack, Lantern Riddle attack, and Acrostic Poem attack. A series of experiments indicate that current LLMs and LRMs show different safety boundaries under different reasoning complexity, which provides a new perspective to develop safer LLMs and LRMs.

http://arxiv.org/abs/2509.00826
Sequential Difference Maximization: Generating Adversarial Examples via Multi-Stage Optimization. (99%)
Xinlei Liu; Tao Hu; Peng Yi; Weitao Han; Jichao Xie; Baolin Li
Efficient adversarial attack methods are critical for assessing the robustness of computer vision models. In this paper, we reconstruct the optimization objective for generating adversarial examples as "maximizing the difference between the non-true labels' probability upper bound and the true label's probability," and propose a gradient-based attack method termed Sequential Difference Maximization (SDM). SDM establishes a three-layer optimization framework of "cycle-stage-step." The processes between cycles and between iterative steps are respectively identical, while optimization stages differ in terms of loss functions: in the initial stage, the negative probability of the true label is used as the loss function to compress the solution space; in subsequent stages, we introduce the Directional Probability Difference Ratio (DPDR) loss function to gradually increase the non-true labels' probability upper bound by compressing the irrelevant labels' probabilities. Experiments demonstrate that compared with previous SOTA methods, SDM not only exhibits stronger attack performance but also achieves higher attack cost-effectiveness. Additionally, SDM can be combined with adversarial training methods to enhance their defensive effects. The code is available at https://github.com/X-L-Liu/SDM.

http://arxiv.org/abs/2509.05332
Integrated Simulation Framework for Adversarial Attacks on Autonomous Vehicles. (45%)
Christos Anagnostopoulos; Ioulia Kapsali; Alexandros Gkillas; Nikos Piperigkos; Aris S. Lalos
Autonomous vehicles (AVs) rely on complex perception and communication systems, making them vulnerable to adversarial attacks that can compromise safety. While simulation offers a scalable and safe environment for robustness testing, existing frameworks typically lack comprehensive supportfor modeling multi-domain adversarial scenarios. This paper introduces a novel, open-source integrated simulation framework designed to generate adversarial attacks targeting both perception and communication layers of AVs. The framework provides high-fidelity modeling of physical environments, traffic dynamics, and V2X networking, orchestrating these components through a unified core that synchronizes multiple simulators based on a single configuration file. Our implementation supports diverse perception-level attacks on LiDAR sensor data, along with communication-level threats such as V2X message manipulation and GPS spoofing. Furthermore, ROS 2 integration ensures seamless compatibility with third-party AV software stacks. We demonstrate the framework's effectiveness by evaluating the impact of generated adversarial scenarios on a state-of-the-art 3D object detector, revealing significant performance degradation under realistic conditions.

http://arxiv.org/abs/2509.00973
Clone What You Can't Steal: Black-Box LLM Replication via Logit Leakage and Distillation. (9%)
Kanchon Gharami; Hansaka Aluvihare; Shafika Showkat Moni; Berker Peköz
Large Language Models (LLMs) are increasingly deployed in mission-critical systems, facilitating tasks such as satellite operations, command-and-control, military decision support, and cyber defense. Many of these systems are accessed through application programming interfaces (APIs). When such APIs lack robust access controls, they can expose full or top-k logits, creating a significant and often overlooked attack surface. Prior art has mainly focused on reconstructing the output projection layer or distilling surface-level behaviors. However, regenerating a black-box model under tight query constraints remains underexplored. We address that gap by introducing a constrained replication pipeline that transforms partial logit leakage into a functional deployable substitute model clone. Our two-stage approach (i) reconstructs the output projection matrix by collecting top-k logits from under 10k black-box queries via singular value decomposition (SVD) over the logits, then (ii) distills the remaining architecture into compact student models with varying transformer depths, trained on an open source dataset. A 6-layer student recreates 97.6% of the 6-layer teacher model's hidden-state geometry, with only a 7.31% perplexity increase, and a 7.58 Negative Log-Likelihood (NLL). A 4-layer variant achieves 17.1% faster inference and 18.1% parameter reduction with comparable performance. The entire attack completes in under 24 graphics processing unit (GPU) hours and avoids triggering API rate-limit defenses. These results demonstrate how quickly a cost-limited adversary can clone an LLM, underscoring the urgent need for hardened inference APIs and secure on-premise defense deployments.

http://arxiv.org/abs/2508.20817
FusionCounting: Robust visible-infrared image fusion guided by crowd counting via multi-task learning. (3%)
He Li; Xinyu Liu; Weihang Kong; Xingchen Zhang
Visible and infrared image fusion (VIF) is an important multimedia task in computer vision. Most VIF methods focus primarily on optimizing fused image quality. Recent studies have begun incorporating downstream tasks, such as semantic segmentation and object detection, to provide semantic guidance for VIF. However, semantic segmentation requires extensive annotations, while object detection, despite reducing annotation efforts compared with segmentation, faces challenges in highly crowded scenes due to overlapping bounding boxes and occlusion. Moreover, although RGB-T crowd counting has gained increasing attention in recent years, no studies have integrated VIF and crowd counting into a unified framework. To address these challenges, we propose FusionCounting, a novel multi-task learning framework that integrates crowd counting into the VIF process. Crowd counting provides a direct quantitative measure of population density with minimal annotation, making it particularly suitable for dense scenes. Our framework leverages both input images and population density information in a mutually beneficial multi-task design. To accelerate convergence and balance tasks contributions, we introduce a dynamic loss function weighting strategy. Furthermore, we incorporate adversarial training to enhance the robustness of both VIF and crowd counting, improving the model's stability and resilience to adversarial attacks. Experimental results on public datasets demonstrate that FusionCounting not only enhances image fusion quality but also achieves superior crowd counting performance.

http://arxiv.org/abs/2509.00387
Unifying Adversarial Perturbation for Graph Neural Networks. (96%)
Jinluan Yang; Ruihao Zhang; Zhengyu Chen; Fei Wu; Kun Kuang
This paper studies the vulnerability of Graph Neural Networks (GNNs) to adversarial attacks on node features and graph structure. Various methods have implemented adversarial training to augment graph data, aiming to bolster the robustness and generalization of GNNs. These methods typically involve applying perturbations to the node feature, weights, or graph structure and subsequently minimizing the loss by learning more robust graph model parameters under the adversarial perturbations. Despite the effectiveness of adversarial training in enhancing GNNs' robustness and generalization abilities, its application has been largely confined to specific datasets and GNN types. In this paper, we propose a novel method, PerturbEmbedding, that integrates adversarial perturbation and training, enhancing GNNs' resilience to such attacks and improving their generalization ability. PerturbEmbedding performs perturbation operations directly on every hidden embedding of GNNs and provides a unified framework for most existing perturbation strategies/methods. We also offer a unified perspective on the forms of perturbations, namely random and adversarial perturbations. Through experiments on various datasets using different backbone models, we demonstrate that PerturbEmbedding significantly improves both the robustness and generalization abilities of GNNs, outperforming existing methods. The rejection of both random (non-targeted) and adversarial (targeted) perturbations further enhances the backbone model's performance.

http://arxiv.org/abs/2509.00373
Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models. (81%)
Sihao Wu; Gaojie Jin; Wei Huang; Jianhong Wang; Xiaowei Huang
Vision Language Models (VLMs) have demonstrated impressive capabilities in integrating visual and textual information for understanding and reasoning, but remain highly vulnerable to adversarial attacks. While activation steering has emerged as a promising defence, existing approaches often rely on task-specific contrastive prompts to extract harmful directions, which exhibit suboptimal performance and can degrade visual grounding performance. To address these limitations, we propose \textit{Sequence-Level Preference Optimization} for VLM (\textit{SPO-VLM}), a novel two-stage defense framework that combines activation-level intervention with policy-level optimization to enhance model robustness. In \textit{Stage I}, we compute adaptive layer-specific steering vectors from diverse data sources, enabling generalized suppression of harmful behaviors during inference. In \textit{Stage II}, we refine these steering vectors through a sequence-level preference optimization process. This stage integrates automated toxicity assessment, as well as visual-consistency rewards based on caption-image alignment, to achieve safe and semantically grounded text generation. The two-stage structure of SPO-VLM balances efficiency and effectiveness by combining a lightweight mitigation foundation in Stage I with deeper policy refinement in Stage II. Extensive experiments shown SPO-VLM enhances safety against attacks via activation steering and preference optimization, while maintaining strong performance on benign tasks without compromising visual understanding capabilities. We will release our code, model weights, and evaluation toolkit to support reproducibility and future research. \textcolor{red}{Warning: This paper may contain examples of offensive or harmful text and images.}

http://arxiv.org/abs/2509.00561
FreeTalk:A plug-and-play and black-box defense against speech synthesis attacks. (75%)
Yuwen Pu; Zhou Feng; Chunyi Zhou; Jiahao Chen; Chunqiang Hu; Haibo Hu; Shouling Ji
Recently, speech assistant and speech verification have been used in many fields, which brings much benefit and convenience for us. However, when we enjoy these speech applications, our speech may be collected by attackers for speech synthesis. For example, an attacker generates some inappropriate political opinions with the characteristic of the victim's voice by obtaining a piece of the victim's speech, which will greatly influence the victim's reputation. Specifically, with the appearance of some zero-shot voice conversion methods, the cost of speech synthesis attacks has been further reduced, which also brings greater challenges to user voice security and privacy. Some researchers have proposed the corresponding privacy-preserving methods. However, the existing approaches have some non-negligible drawbacks: low transferability and robustness, high computational overhead. These deficiencies seriously limit the existing method deployed in practical scenarios. Therefore, in this paper, we propose a lightweight, robust, plug-and-play privacy preservation method against speech synthesis attacks in a black-box setting. Our method generates and adds a frequency-domain perturbation to the original speech to achieve privacy protection and high speech quality. Then, we present a data augmentation strategy and noise smoothing mechanism to improve the robustness of the proposed method. Besides, to reduce the user's defense overhead, we also propose a novel identity-wise protection mechanism. It can generate a universal perturbation for one speaker and support privacy preservation for speech of any length. Finally, we conduct extensive experiments on 5 speech synthesis models, 5 speech verification models, 1 speech recognition model, and 2 datasets. The experimental results demonstrate that our method has satisfying privacy-preserving performance, high speech quality, and utility.

http://arxiv.org/abs/2509.05318
Backdoor Samples Detection Based on Perturbation Discrepancy Consistency in Pre-trained Language Models. (13%)
Zuquan Peng; Jianming Fu; Lixin Zou; Li Zheng; Yanzhen Ren; Guojun Peng
The use of unvetted third-party and internet data renders pre-trained models susceptible to backdoor attacks. Detecting backdoor samples is critical to prevent backdoor activation during inference or injection during training. However, existing detection methods often require the defender to have access to the poisoned models, extra clean samples, or significant computational resources to detect backdoor samples, limiting their practicality. To address this limitation, we propose a backdoor sample detection method based on perturbatio\textbf{N} discr\textbf{E}pancy consis\textbf{T}ency \textbf{E}valuation (\NETE). This is a novel detection method that can be used both pre-training and post-training phases. In the detection process, it only requires an off-the-shelf pre-trained model to compute the log probability of samples and an automated function based on a mask-filling strategy to generate perturbations. Our method is based on the interesting phenomenon that the change in perturbation discrepancy for backdoor samples is smaller than that for clean samples. Based on this phenomenon, we use curvature to measure the discrepancy in log probabilities between different perturbed samples and input samples, thereby evaluating the consistency of the perturbation discrepancy to determine whether the input sample is a backdoor sample. Experiments conducted on four typical backdoor attacks and five types of large language model backdoor attacks demonstrate that our detection strategy outperforms existing zero-shot black-box detection methods.

http://arxiv.org/abs/2509.00391
The Resurgence of GCG Adversarial Attacks on Large Language Models. (10%)
Yuting Tan; Xuying Li; Zhuo Li; Huizhen Shu; Peikang Hu
Gradient-based adversarial prompting, such as the Greedy Coordinate Gradient (GCG) algorithm, has emerged as a powerful method for jailbreaking large language models (LLMs). In this paper, we present a systematic appraisal of GCG and its annealing-augmented variant, T-GCG, across open-source LLMs of varying scales. Using Qwen2.5-0.5B, LLaMA-3.2-1B, and GPT-OSS-20B, we evaluate attack effectiveness on both safety-oriented prompts (AdvBench) and reasoning-intensive coding prompts. Our study reveals three key findings: (1) attack success rates (ASR) decrease with model size, reflecting the increasing complexity and non-convexity of larger models' loss landscapes; (2) prefix-based heuristics substantially overestimate attack effectiveness compared to GPT-4o semantic judgments, which provide a stricter and more realistic evaluation; and (3) coding-related prompts are significantly more vulnerable than adversarial safety prompts, suggesting that reasoning itself can be exploited as an attack vector. In addition, preliminary results with T-GCG show that simulated annealing can diversify adversarial search and achieve competitive ASR under prefix evaluation, though its benefits under semantic judgment remain limited. Together, these findings highlight the scalability limits of GCG, expose overlooked vulnerabilities in reasoning tasks, and motivate further development of annealing-inspired strategies for more robust adversarial evaluation.

http://arxiv.org/abs/2509.00633
On the Thermal Vulnerability of 3D-Stacked High-Bandwidth Memory Architectures. (1%)
Mehdi Elahi; Mohamed R. Elshamy; Abdel-Hameed A. Badawy; Ahmad Patooghy
3D-stacked High Bandwidth Memory (HBM) architectures provide high-performance memory interactions to address the well-known performance challenge, namely the memory wall. However, these architectures are susceptible to thermal vulnerabilities due to the inherent vertical adjacency that occurs during the manufacturing process of HBM architectures. We anticipate that adversaries may exploit the intense vertical and lateral adjacency to design and develop thermal performance degradation attacks on the memory banks that host data/instructions from victim applications. In such attacks, the adversary manages to inject short and intense heat pulses from vertically and/or laterally adjacent memory banks, creating a convergent thermal wave that maximizes impact and delays the victim application from accessing its data/instructions. As the attacking application does not access any out-of-range memory locations, it can bypass both design-time security tests and the operating system's memory management policies. In other words, since the attack mimics legitimate workloads, it will be challenging to detect.

http://arxiv.org/abs/2508.21472
Adversarial Patch Attack for Ship Detection via Localized Augmentation. (99%)
Chun Liu; Panpan Ding; Zheng Zheng; Hailong Wang; Bingqian Zhu; Tao Xu; Zhigang Han; Jiayao Wang
Current ship detection techniques based on remote sensing imagery primarily rely on the object detection capabilities of deep neural networks (DNNs). However, DNNs are vulnerable to adversarial patch attacks, which can lead to misclassification by the detection model or complete evasion of the targets. Numerous studies have demonstrated that data transformation-based methods can improve the transferability of adversarial examples. However, excessive augmentation of image backgrounds or irrelevant regions may introduce unnecessary interference, resulting in false detections of the object detection model. These errors are not caused by the adversarial patches themselves but rather by the over-augmentation of background and non-target areas. This paper proposes a localized augmentation method that applies augmentation only to the target regions, avoiding any influence on non-target areas. By reducing background interference, this approach enables the loss function to focus more directly on the impact of the adversarial patch on the detection model, thereby improving the attack success rate. Experiments conducted on the HRSC2016 dataset demonstrate that the proposed method effectively increases the success rate of adversarial patch attacks and enhances their transferability.

http://arxiv.org/abs/2504.08897
On the Adversarial Robustness of Spiking Neural Networks Trained by Local Learning. (99%)
Jiaqi Lin; Abhronil Sengupta
Recent research has shown the vulnerability of Spiking Neural Networks (SNNs) under adversarial examples that are nearly indistinguishable from clean data in the context of frame-based and event-based information. The majority of these studies are constrained in generating adversarial examples using Backpropagation Through Time (BPTT), a gradient-based method which lacks biological plausibility. In contrast, local learning methods, which relax many of BPTT's constraints, remain under-explored in the context of adversarial attacks. To address this problem, we examine adversarial robustness in SNNs through the framework of four types of training algorithms. We provide an in-depth analysis of the ineffectiveness of gradient-based adversarial attacks to generate adversarial instances in this scenario. To overcome these limitations, we introduce a hybrid adversarial attack paradigm that leverages the transferability of adversarial instances. The proposed hybrid approach demonstrates superior performance, outperforming existing adversarial attack methods. Furthermore, the generalizability of the method is assessed under multi-step adversarial attacks, adversarial attacks in black-box FGSM scenarios, and within the non-spiking domain.

http://arxiv.org/abs/2508.21715
Entropy-Based Non-Invasive Reliability Monitoring of Convolutional Neural Networks. (93%)
Amirhossein Nazeri; Wael Hafez
Convolutional Neural Networks (CNNs) have become the foundation of modern computer vision, achieving unprecedented accuracy across diverse image recognition tasks. While these networks excel on in-distribution data, they remain vulnerable to adversarial perturbations imperceptible input modifications that cause misclassification with high confidence. However, existing detection methods either require expensive retraining, modify network architecture, or degrade performance on clean inputs. Here we show that adversarial perturbations create immediate, detectable entropy signatures in CNN activations that can be monitored without any model modification. Using parallel entropy monitoring on VGG-16, we demonstrate that adversarial inputs consistently shift activation entropy by 7% in early convolutional layers, enabling 90% detection accuracy with false positives and false negative rates below 20%. The complete separation between clean and adversarial entropy distributions reveals that CNNs inherently encode distribution shifts in their activation patterns. This work establishes that CNN reliability can be assessed through activation entropy alone, enabling practical deployment of self-diagnostic vision systems that detect adversarial inputs in real-time without compromising original model performance.

http://arxiv.org/abs/2508.20863
Publish to Perish: Prompt Injection Attacks on LLM-Assisted Peer Review. (41%)
Matteo Gioele Collu; Umberto Salviati; Roberto Confalonieri; Mauro Conti; Giovanni Apruzzese
Large Language Models (LLMs) are increasingly being integrated into the scientific peer-review process, raising new questions about their reliability and resilience to manipulation. In this work, we investigate the potential for hidden prompt injection attacks, where authors embed adversarial text within a paper's PDF to influence the LLM-generated review. We begin by formalising three distinct threat models that envision attackers with different motivations -- not all of which implying malicious intent. For each threat model, we design adversarial prompts that remain invisible to human readers yet can steer an LLM's output toward the author's desired outcome. Using a user study with domain scholars, we derive four representative reviewing prompts used to elicit peer reviews from LLMs. We then evaluate the robustness of our adversarial prompts across (i) different reviewing prompts, (ii) different commercial LLM-based systems, and (iii) different peer-reviewed papers. Our results show that adversarial prompts can reliably mislead the LLM, sometimes in ways that adversely affect a "honest-but-lazy" reviewer. Finally, we propose and empirically assess methods to reduce detectability of adversarial prompts under automated content checks.

http://arxiv.org/abs/2508.21654
I Stolenly Swear That I Am Up to (No) Good: Design and Evaluation of Model Stealing Attacks. (38%)
Daryna Oliynyk; Rudolf Mayer; Kathrin Grosse; Andreas Rauber
Model stealing attacks endanger the confidentiality of machine learning models offered as a service. Although these models are kept secret, a malicious party can query a model to label data samples and train their own substitute model, violating intellectual property. While novel attacks in the field are continually being published, their design and evaluations are not standardised, making it challenging to compare prior works and assess progress in the field. This paper is the first to address this gap by providing recommendations for designing and evaluating model stealing attacks. To this end, we study the largest group of attacks that rely on training a substitute model -- those attacking image classification models. We propose the first comprehensive threat model and develop a framework for attack comparison. Further, we analyse attack setups from related works to understand which tasks and models have been studied the most. Based on our findings, we present best practices for attack development before, during, and beyond experiments and derive an extensive list of open research questions regarding the evaluation of model stealing attacks. Our findings and recommendations also transfer to other problem domains, hence establishing the first generic evaluation methodology for model stealing attacks.

http://arxiv.org/abs/2508.21636
Detecting Stealthy Data Poisoning Attacks in AI Code Generators. (13%)
Cristina Improta
Deep learning (DL) models for natural language-to-code generation have become integral to modern software development pipelines. However, their heavy reliance on large amounts of data, often collected from unsanitized online sources, exposes them to data poisoning attacks, where adversaries inject malicious samples to subtly bias model behavior. Recent targeted attacks silently replace secure code with semantically equivalent but vulnerable implementations without relying on explicit triggers to launch the attack, making it especially hard for detection methods to distinguish clean from poisoned samples. We present a systematic study on the effectiveness of existing poisoning detection methods under this stealthy threat model. Specifically, we perform targeted poisoning on three DL models (CodeBERT, CodeT5+, AST-T5), and evaluate spectral signatures analysis, activation clustering, and static analysis as defenses. Our results show that all methods struggle to detect triggerless poisoning, with representation-based approaches failing to isolate poisoned samples and static analysis suffering false positives and false negatives, highlighting the need for more robust, trigger-independent defenses for AI-assisted code generation.

http://arxiv.org/abs/2508.21816
The Demon is in Ambiguity: Revisiting Situation Recognition with Single Positive Multi-Label Learning. (1%)
Yiming Lin; Yuchen Niu; Shang Wang; Kaizhu Huang; Qiufeng Wang; Xiao-Bo Jin
Context recognition (SR) is a fundamental task in computer vision that aims to extract structured semantic summaries from images by identifying key events and their associated entities. Specifically, given an input image, the model must first classify the main visual events (verb classification), then identify the participating entities and their semantic roles (semantic role labeling), and finally localize these entities in the image (semantic role localization). Existing methods treat verb classification as a single-label problem, but we show through a comprehensive analysis that this formulation fails to address the inherent ambiguity in visual event recognition, as multiple verb categories may reasonably describe the same image. This paper makes three key contributions: First, we reveal through empirical analysis that verb classification is inherently a multi-label problem due to the ubiquitous semantic overlap between verb categories. Second, given the impracticality of fully annotating large-scale datasets with multiple labels, we propose to reformulate verb classification as a single positive multi-label learning (SPMLL) problem - a novel perspective in SR research. Third, we design a comprehensive multi-label evaluation benchmark for SR that is carefully designed to fairly evaluate model performance in a multi-label setting. To address the challenges of SPMLL, we futher develop the Graph Enhanced Verb Multilayer Perceptron (GE-VerbMLP), which combines graph neural networks to capture label correlations and adversarial training to optimize decision boundaries. Extensive experiments on real-world datasets show that our approach achieves more than 3\% MAP improvement while remaining competitive on traditional top-1 and top-5 accuracy metrics.

http://arxiv.org/abs/2508.21440
Time Tells All: Deanonymization of Blockchain RPC Users with Zero Transaction Fee (Extended Version). (1%)
Shan Wang; Ming Yang; Yu Liu; Yue Zhang; Shuaiqing Zhang; Zhen Ling; Jiannong Cao; Xinwen Fu
Remote Procedure Call (RPC) services have become a primary gateway for users to access public blockchains. While they offer significant convenience, RPC services also introduce critical privacy challenges that remain insufficiently examined. Existing deanonymization attacks either do not apply to blockchain RPC users or incur costs like transaction fees assuming an active network eavesdropper. In this paper, we propose a novel deanonymization attack that can link an IP address of a RPC user to this user's blockchain pseudonym. Our analysis reveals a temporal correlation between the timestamps of transaction confirmations recorded on the public ledger and those of TCP packets sent by the victim when querying transaction status. We assume a strong passive adversary with access to network infrastructure, capable of monitoring traffic at network border routers or Internet exchange points. By monitoring network traffic and analyzing public ledgers, the attacker can link the IP address of the TCP packet to the pseudonym of the transaction initiator by exploiting the temporal correlation. This deanonymization attack incurs zero transaction fee. We mathematically model and analyze the attack method, perform large-scale measurements of blockchain ledgers, and conduct real-world attacks to validate the attack. Our attack achieves a high success rate of over 95% against normal RPC users on various blockchain networks, including Ethereum, Bitcoin and Solana.

http://arxiv.org/abs/2508.21072
First-Place Solution to NeurIPS 2024 Invisible Watermark Removal Challenge. (74%)
Fahad Shamshad; Tameem Bakr; Yahia Shaaban; Noor Hussein; Karthik Nandakumar; Nils Lukas
Content watermarking is an important tool for the authentication and copyright protection of digital media. However, it is unclear whether existing watermarks are robust against adversarial attacks. We present the winning solution to the NeurIPS 2024 Erasing the Invisible challenge, which stress-tests watermark robustness under varying degrees of adversary knowledge. The challenge consisted of two tracks: a black-box and beige-box track, depending on whether the adversary knows which watermarking method was used by the provider. For the beige-box track, we leverage an adaptive VAE-based evasion attack, with a test-time optimization and color-contrast restoration in CIELAB space to preserve the image's quality. For the black-box track, we first cluster images based on their artifacts in the spatial or frequency-domain. Then, we apply image-to-image diffusion models with controlled noise injection and semantic priors from ChatGPT-generated captions to each cluster with optimized parameter settings. Empirical evaluations demonstrate that our method successfully achieves near-perfect watermark removal (95.7%) with negligible impact on the residual image's quality. We hope that our attacks inspire the development of more robust image watermarking methods.

http://arxiv.org/abs/2508.20595
Disruptive Attacks on Face Swapping via Low-Frequency Perceptual Perturbations. (54%)
Mengxiao Huang; Minglei Shu; Shuwang Zhou; Zhaoyang Liu
Deepfake technology, driven by Generative Adversarial Networks (GANs), poses significant risks to privacy and societal security. Existing detection methods are predominantly passive, focusing on post-event analysis without preventing attacks. To address this, we propose an active defense method based on low-frequency perceptual perturbations to disrupt face swapping manipulation, reducing the performance and naturalness of generated content. Unlike prior approaches that used low-frequency perturbations to impact classification accuracy,our method directly targets the generative process of deepfake techniques. We combine frequency and spatial domain features to strengthen defenses. By introducing artifacts through low-frequency perturbations while preserving high-frequency details, we ensure the output remains visually plausible. Additionally, we design a complete architecture featuring an encoder, a perturbation generator, and a decoder, leveraging discrete wavelet transform (DWT) to extract low-frequency components and generate perturbations that disrupt facial manipulation models. Experiments on CelebA-HQ and LFW demonstrate significant reductions in face-swapping effectiveness, improved defense success rates, and preservation of visual quality.

http://arxiv.org/abs/2508.21004
Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution. (38%)
Chen Chen; Yuchen Sun; Jiaxin Gao; Xueluan Gong; Qian Wang; Ziyao Wang; Yongsen Zheng; Kwok-Yan Lam
Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses either lack comprehensiveness, focusing on narrow trigger settings, detection-only mechanisms, and limited domains, or fail to withstand advanced scenarios like model-editing-based, multi-trigger, and triggerless attacks. In this paper, we present LETHE, a novel method to eliminate backdoor behaviors from LLMs through knowledge dilution using both internal and external mechanisms. Internally, LETHE leverages a lightweight dataset to train a clean model, which is then merged with the backdoored model to neutralize malicious behaviors by diluting the backdoor impact within the model's parametric memory. Externally, LETHE incorporates benign and semantically relevant evidence into the prompt to distract LLM's attention from backdoor features. Experimental results on classification and generation domains across 5 widely used LLMs demonstrate that LETHE outperforms 8 state-of-the-art defense baselines against 8 backdoor attacks. LETHE reduces the attack success rate of advanced backdoor attacks by up to 98% while maintaining model utility. Furthermore, LETHE has proven to be cost-efficient and robust against adaptive backdoor attacks.

http://arxiv.org/abs/2508.21099
Safe-Control: A Safety Patch for Mitigating Unsafe Content in Text-to-Image Generation Models. (26%)
Xiangtao Meng; Yingkai Dong; Ning Yu; Li Wang; Zheng Li; Shanqing Guo
Despite the advancements in Text-to-Image (T2I) generation models, their potential for misuse or even abuse raises serious safety concerns. Model developers have made tremendous efforts to introduce safety mechanisms that can address these concerns in T2I models. However, the existing safety mechanisms, whether external or internal, either remain susceptible to evasion under distribution shifts or require extensive model-specific adjustments. To address these limitations, we introduce Safe-Control, an innovative plug-and-play safety patch designed to mitigate unsafe content generation in T2I models. Using data-driven strategies and safety-aware conditions, Safe-Control injects safety control signals into the locked T2I model, acting as an update in a patch-like manner. Model developers can also construct various safety patches to meet the evolving safety requirements, which can be flexibly merged into a single, unified patch. Its plug-and-play design further ensures adaptability, making it compatible with other T2I models of similar denoising architecture. We conduct extensive evaluations on six diverse and public T2I models. Empirical results highlight that Safe-Control is effective in reducing unsafe content generation across six diverse T2I models with similar generative architectures, yet it successfully maintains the quality and text alignment of benign images. Compared to seven state-of-the-art safety mechanisms, including both external and internal defenses, Safe-Control significantly outperforms all baselines in reducing unsafe content generation. For example, it reduces the probability of unsafe content generation to 7%, compared to approximately 20% for most baseline methods, under both unsafe prompts and the latest adversarial attacks.

http://arxiv.org/abs/2508.21219
The WASM Cloak: Evaluating Browser Fingerprinting Defenses Under WebAssembly based Obfuscation. (11%)
A H M Nazmus Sakib; Mahsin Bin Akram; Joseph Spracklen; Sahan Kalutarage; Raveen Wijewickrama; Igor Bilogrevic; Murtuza Jadliwala
Browser fingerprinting defenses have historically focused on detecting JavaScript(JS)-based tracking techniques. However, the widespread adoption of WebAssembly (WASM) introduces a potential blind spot, as adversaries can convert JS to WASM's low-level binary format to obfuscate malicious logic. This paper presents the first systematic evaluation of how such WASM-based obfuscation impacts the robustness of modern fingerprinting defenses. We develop an automated pipeline that translates real-world JS fingerprinting scripts into functional WASM-obfuscated variants and test them against two classes of defenses: state-of-the-art detectors in research literature and commercial, in-browser tools. Our findings reveal a notable divergence: detectors proposed in the research literature that rely on feature-based analysis of source code show moderate vulnerability, stemming from outdated datasets or a lack of WASM compatibility. In contrast, defenses such as browser extensions and native browser features remained completely effective, as their API-level interception is agnostic to the script's underlying implementation. These results highlight a gap between academic and practical defense strategies and offer insights into strengthening detection approaches against WASM-based obfuscation, while also revealing opportunities for more evasive techniques in future attacks.

http://arxiv.org/abs/2508.20412
MindGuard: Tracking, Detecting, and Attributing MCP Tool Poisoning Attack via Decision Dependence Graph. (8%)
Zhiqiang Wang; Junyang Zhang; Guanquan Shi; HaoRan Cheng; Yunhao Yao; Kaiwen Guo; Haohua Du; Xiang-Yang Li
The Model Context Protocol (MCP) is increasingly adopted to standardize the interaction between LLM agents and external tools. However, this trend introduces a new threat: Tool Poisoning Attacks (TPA), where tool metadata is poisoned to induce the agent to perform unauthorized operations. Existing defenses that primarily focus on behavior-level analysis are fundamentally ineffective against TPA, as poisoned tools need not be executed, leaving no behavioral trace to monitor.
  Thus, we propose MindGuard, a decision-level guardrail for LLM agents, providing provenance tracking of call decisions, policy-agnostic detection, and poisoning source attribution against TPA. While fully explaining LLM decision remains challenging, our empirical findings uncover a strong correlation between LLM attention mechanisms and tool invocation decisions. Therefore, we choose attention as an empirical signal for decision tracking and formalize this as the Decision Dependence Graph (DDG), which models the LLM's reasoning process as a weighted, directed graph where vertices represent logical concepts and edges quantify the attention-based dependencies. We further design robust DDG construction and graph-based anomaly analysis mechanisms that efficiently detect and attribute TPA attacks. Extensive experiments on real-world datasets demonstrate that MindGuard achieves 94\%-99\% average precision in detecting poisoned invocations, 95\%-100\% attribution accuracy, with processing times under one second and no additional token cost. Moreover, DDG can be viewed as an adaptation of the classical Program Dependence Graph (PDG), providing a solid foundation for applying traditional security policies at the decision level.

http://arxiv.org/abs/2508.20570
Towards Mechanistic Defenses Against Typographic Attacks in CLIP. (8%)
Lorenz Hufe; Constantin Venhoff; Maximilian Dreyer; Sebastian Lapuschkin; Wojciech Samek
Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, our method improves performance by up to 19.6% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.

http://arxiv.org/abs/2508.21197
GCAV: A Global Concept Activation Vector Framework for Cross-Layer Consistency in Interpretability. (3%)
Zhenghao He; Sanchit Sinha; Guangzhi Xiong; Aidong Zhang
Concept Activation Vectors (CAVs) provide a powerful approach for interpreting deep neural networks by quantifying their sensitivity to human-defined concepts. However, when computed independently at different layers, CAVs often exhibit inconsistencies, making cross-layer comparisons unreliable. To address this issue, we propose the Global Concept Activation Vector (GCAV), a novel framework that unifies CAVs into a single, semantically consistent representation. Our method leverages contrastive learning to align concept representations across layers and employs an attention-based fusion mechanism to construct a globally integrated CAV. By doing so, our method significantly reduces the variance in TCAV scores while preserving concept relevance, ensuring more stable and reliable concept attributions. To evaluate the effectiveness of GCAV, we introduce Testing with Global Concept Activation Vectors (TGCAV) as a method to apply TCAV to GCAV-based representations. We conduct extensive experiments on multiple deep neural networks, demonstrating that our method effectively mitigates concept inconsistency across layers, enhances concept localization, and improves robustness against adversarial perturbations. By integrating cross-layer information into a coherent framework, our method offers a more comprehensive and interpretable understanding of how deep learning models encode human-defined concepts. Code and models are available at https://github.com/Zhenghao-He/GCAV.

http://arxiv.org/abs/2508.21019
POSE: Phased One-Step Adversarial Equilibrium for Video Diffusion Models. (1%)
Jiaxiang Cheng; Bing Ma; Xuhua Ren; Hongyi Jin; Kai Yu; Peng Zhang; Wenyue Li; Yuan Zhou; Tianxiang Zheng; Qinglin Lu
The field of video diffusion generation faces critical bottlenecks in sampling efficiency, especially for large-scale models and long sequences. Existing video acceleration methods adopt image-based techniques but suffer from fundamental limitations: they neither model the temporal coherence of video frames nor provide single-step distillation for large-scale video models. To bridge this gap, we propose POSE (Phased One-Step Equilibrium), a distillation framework that reduces the sampling steps of large-scale video diffusion models, enabling the generation of high-quality videos in a single step. POSE employs a carefully designed two-phase process to distill video models:(i) stability priming: a warm-up mechanism to stabilize adversarial distillation that adapts the high-quality trajectory of the one-step generator from high to low signal-to-noise ratio regimes, optimizing the video quality of single-step mappings near the endpoints of flow trajectories. (ii) unified adversarial equilibrium: a flexible self-adversarial distillation mechanism that promotes stable single-step adversarial training towards a Nash equilibrium within the Gaussian noise space, generating realistic single-step videos close to real videos. For conditional video generation, we propose (iii) conditional adversarial consistency, a method to improve both semantic consistency and frame consistency between conditional frames and generated frames. Comprehensive experiments demonstrate that POSE outperforms other acceleration methods on VBench-I2V by average 7.15% in semantic alignment, temporal conference and frame quality, reducing the latency of the pre-trained model by 100$\times$, from 1000 seconds to 10 seconds, while maintaining competitive performance.

http://arxiv.org/abs/2508.20310
Differentially Private Federated Quantum Learning via Quantum Noise. (99%)
Atit Pokharel; Ratun Rahman; Shaba Shaon; Thomas Morris; Dinh C. Nguyen
Quantum federated learning (QFL) enables collaborative training of quantum machine learning (QML) models across distributed quantum devices without raw data exchange. However, QFL remains vulnerable to adversarial attacks, where shared QML model updates can be exploited to undermine information privacy. In the context of noisy intermediate-scale quantum (NISQ) devices, a key question arises: How can inherent quantum noise be leveraged to enforce differential privacy (DP) and protect model information during training and communication? This paper explores a novel DP mechanism that harnesses quantum noise to safeguard quantum models throughout the QFL process. By tuning noise variance through measurement shots and depolarizing channel strength, our approach achieves desired DP levels tailored to NISQ constraints. Simulations demonstrate the framework's effectiveness by examining the relationship between differential privacy budget and noise parameters, as well as the trade-off between security and training accuracy. Additionally, we demonstrate the framework's robustness against an adversarial attack designed to compromise model performance using adversarial examples, with evaluations based on critical metrics such as accuracy on adversarial examples, confidence scores for correct predictions, and attack success rates. The results reveal a tunable trade-off between privacy and robustness, providing an efficient solution for secure QFL on NISQ devices with significant potential for reliable quantum computing applications.

http://arxiv.org/abs/2508.20083
Disabling Self-Correction in Retrieval-Augmented Generation via Stealthy Retriever Poisoning. (50%)
Yanbo Dai; Zhenlan Ji; Zongjie Li; Kuan Li; Shuai Wang
Retrieval-Augmented Generation (RAG) has become a standard approach for improving the reliability of large language models (LLMs). Prior work demonstrates the vulnerability of RAG systems by misleading them into generating attacker-chosen outputs through poisoning the knowledge base. However, this paper uncovers that such attacks could be mitigated by the strong \textit{self-correction ability (SCA)} of modern LLMs, which can reject false context once properly configured. This SCA poses a significant challenge for attackers aiming to manipulate RAG systems.
  In contrast to previous poisoning methods, which primarily target the knowledge base, we introduce \textsc{DisarmRAG}, a new poisoning paradigm that compromises the retriever itself to suppress the SCA and enforce attacker-chosen outputs. This compromisation enables the attacker to straightforwardly embed anti-SCA instructions into the context provided to the generator, thereby bypassing the SCA. To this end, we present a contrastive-learning-based model editing technique that performs localized and stealthy edits, ensuring the retriever returns a malicious instruction only for specific victim queries while preserving benign retrieval behavior. To further strengthen the attack, we design an iterative co-optimization framework that automatically discovers robust instructions capable of bypassing prompt-based defenses. We extensively evaluate DisarmRAG across six LLMs and three QA benchmarks. Our results show near-perfect retrieval of malicious instructions, which successfully suppress SCA and achieve attack success rates exceeding 90\% under diverse defensive prompts. Also, the edited retriever remains stealthy under several detection methods, highlighting the urgent need for retriever-centric defenses.

http://arxiv.org/abs/2509.00088
AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema. (47%)
Ting-Chun Liu; Ching-Yu Hsu; Kuan-Yi Lee; Chi-An Fu; Hung-yi Lee
Prompt injection attacks pose a significant challenge to the safe deployment of Large Language Models (LLMs) in real-world applications. While prompt-based detection offers a lightweight and interpretable defense strategy, its effectiveness has been hindered by the need for manual prompt engineering. To address this issue, we propose AEGIS , an Automated co-Evolutionary framework for Guarding prompt Injections Schema. Both attack and defense prompts are iteratively optimized against each other using a gradient-like natural language prompt optimization technique. This framework enables both attackers and defenders to autonomously evolve via a Textual Gradient Optimization (TGO) module, leveraging feedback from an LLM-guided evaluation loop. We evaluate our system on a real-world assignment grading dataset of prompt injection attacks and demonstrate that our method consistently outperforms existing baselines, achieving superior robustness in both attack success and detection. Specifically, the attack success rate (ASR) reaches 1.0, representing an improvement of 0.26 over the baseline. For detection, the true positive rate (TPR) improves by 0.23 compared to the previous best work, reaching 0.84, and the true negative rate (TNR) remains comparable at 0.89. Ablation studies confirm the importance of co-evolution, gradient buffering, and multi-objective optimization. We also confirm that this framework is effective in different LLMs. Our results highlight the promise of adversarial training as a scalable and effective approach for guarding prompt injections.

http://arxiv.org/abs/2509.00089
Learning from Peers: Collaborative Ensemble Adversarial Training. (13%)
Li Dengjin; Guo Yanming; Xie Yuxiang; Li Zheng; Chen Jiangming; Li Xiaolong; Lao Mingrui
Ensemble Adversarial Training (EAT) attempts to enhance the robustness of models against adversarial attacks by leveraging multiple models. However, current EAT strategies tend to train the sub-models independently, ignoring the cooperative benefits between sub-models. Through detailed inspections of the process of EAT, we find that that samples with classification disparities between sub-models are close to the decision boundary of ensemble, exerting greater influence on the robustness of ensemble. To this end, we propose a novel yet efficient Collaborative Ensemble Adversarial Training (CEAT), to highlight the cooperative learning among sub-models in the ensemble. To be specific, samples with larger predictive disparities between the sub-models will receive greater attention during the adversarial training of the other sub-models. CEAT leverages the probability disparities to adaptively assign weights to different samples, by incorporating a calibrating distance regularization. Extensive experiments on widely-adopted datasets show that our proposed method achieves the state-of-the-art performance over competitive EAT methods. It is noteworthy that CEAT is model-agnostic, which can be seamlessly adapted into various ensemble methods with flexible applicability.

http://arxiv.org/abs/2508.19488
PoolFlip: A Multi-Agent Reinforcement Learning Security Environment for Cyber Defense. (5%)
Xavier Cadet; Simona Boboila; Sie Hendrata Dharmawan; Alina Oprea; Peter Chin
Cyber defense requires automating defensive decision-making under stealthy, deceptive, and continuously evolving adversarial strategies. The FlipIt game provides a foundational framework for modeling interactions between a defender and an advanced adversary that compromises a system without being immediately detected. In FlipIt, the attacker and defender compete to control a shared resource by performing a Flip action and paying a cost. However, the existing FlipIt frameworks rely on a small number of heuristics or specialized learning techniques, which can lead to brittleness and the inability to adapt to new attacks. To address these limitations, we introduce PoolFlip, a multi-agent gym environment that extends the FlipIt game to allow efficient learning for attackers and defenders. Furthermore, we propose Flip-PSRO, a multi-agent reinforcement learning (MARL) approach that leverages population-based training to train defender agents equipped to generalize against a range of unknown, potentially adaptive opponents. Our empirical results suggest that Flip-PSRO defenders are $2\times$ more effective than baselines to generalize to a heuristic attack not exposed in training. In addition, our newly designed ownership-based utility functions ensure that Flip-PSRO defenders maintain a high level of control while optimizing performance.

http://arxiv.org/abs/2508.19819
From Research to Reality: Feasibility of Gradient Inversion Attacks in Federated Learning. (5%)
Viktor Valadi; Mattias Åkesson; Johan Östman; Salman Toor; Andreas Hellander
Gradient inversion attacks have garnered attention for their ability to compromise privacy in federated learning. However, many studies consider attacks with the model in inference mode, where training-time behaviors like dropout are disabled and batch normalization relies on fixed statistics. In this work, we systematically analyze how architecture and training behavior affect vulnerability, including the first in-depth study of inference-mode clients, which we show dramatically simplifies inversion. To assess attack feasibility under more realistic conditions, we turn to clients operating in standard training mode. In this setting, we find that successful attacks are only possible when several architectural conditions are met simultaneously: models must be shallow and wide, use skip connections, and, critically, employ pre-activation normalization. We introduce two novel attacks against models in training-mode with varying attacker knowledge, achieving state-of-the-art performance under realistic training conditions. We extend these efforts by presenting the first attack on a production-grade object-detection model. Here, to enable any visibly identifiable leakage, we revert to the lenient inference mode setting and make multiple architectural modifications to increase model vulnerability, with the extent of required changes highlighting the strong inherent robustness of such architectures. We conclude this work by offering the first comprehensive mapping of settings, clarifying which combinations of architectural choices and operational modes meaningfully impact privacy. Our analysis provides actionable insight into when models are likely vulnerable, when they appear robust, and where subtle leakage may persist. Together, these findings reframe how gradient inversion risk should be assessed in future research and deployment scenarios.

http://arxiv.org/abs/2508.20032
Pruning Strategies for Backdoor Defense in LLMs. (4%)
Santosh Chapagain; Shah Muhammad Hamdi; Soukaina Filali Boubrahimi
Backdoor attacks are a significant threat to the performance and integrity of pre-trained language models. Although such models are routinely fine-tuned for downstream NLP tasks, recent work shows they remain vulnerable to backdoor attacks that survive vanilla fine-tuning. These attacks are difficult to defend because end users typically lack knowledge of the attack triggers. Such attacks consist of stealthy malicious triggers introduced through subtle syntactic or stylistic manipulations, which can bypass traditional detection and remain in the model, making post-hoc purification essential. In this study, we explore whether attention-head pruning can mitigate these threats without any knowledge of the trigger or access to a clean reference model. To this end, we design and implement six pruning-based strategies: (i) gradient-based pruning, (ii) layer-wise variance pruning, (iii) gradient-based pruning with structured L1/L2 sparsification, (iv) randomized ensemble pruning, (v) reinforcement-learning-guided pruning, and (vi) Bayesian uncertainty pruning. Each method iteratively removes the least informative heads while monitoring validation accuracy to avoid over-pruning. Experimental evaluation shows that gradient-based pruning performs best while defending the syntactic triggers, whereas reinforcement learning and Bayesian pruning better withstand stylistic attacks.

http://arxiv.org/abs/2508.19641
Intellectual Property in Graph-Based Machine Learning as a Service: Attacks and Defenses. (3%)
Lincan Li; Bolin Shen; Chenxi Zhao; Yuxiang Sun; Kaixiang Zhao; Shirui Pan; Yushun Dong
Graph-structured data, which captures non-Euclidean relationships and interactions between entities, is growing in scale and complexity. As a result, training state-of-the-art graph machine learning (GML) models have become increasingly resource-intensive, turning these models and data into invaluable Intellectual Property (IP). To address the resource-intensive nature of model training, graph-based Machine-Learning-as-a-Service (GMLaaS) has emerged as an efficient solution by leveraging third-party cloud services for model development and management. However, deploying such models in GMLaaS also exposes them to potential threats from attackers. Specifically, while the APIs within a GMLaaS system provide interfaces for users to query the model and receive outputs, they also allow attackers to exploit and steal model functionalities or sensitive training data, posing severe threats to the safety of these GML models and the underlying graph data. To address these challenges, this survey systematically introduces the first taxonomy of threats and defenses at the level of both GML model and graph-structured data. Such a tailored taxonomy facilitates an in-depth understanding of GML IP protection. Furthermore, we present a systematic evaluation framework to assess the effectiveness of IP protection methods, introduce a curated set of benchmark datasets across various domains, and discuss their application scopes and future challenges. Finally, we establish an open-sourced versatile library named PyGIP, which evaluates various attack and defense techniques in GMLaaS scenarios and facilitates the implementation of existing benchmark methods. The library resource can be accessed at: https://labrai.github.io/PyGIP. We believe this survey will play a fundamental role in intellectual property protection for GML and provide practical recipes for the GML community.

http://arxiv.org/abs/2508.19180
MDD: a Mask Diffusion Detector to Protect Speaker Verification Systems from Adversarial Perturbations. (99%)
Yibo Bai; Sizhou Chen; Michele Panariello; Xiao-Lei Zhang; Massimiliano Todisco; Nicholas Evans
Speaker verification systems are increasingly deployed in security-sensitive applications but remain highly vulnerable to adversarial perturbations. In this work, we propose the Mask Diffusion Detector (MDD), a novel adversarial detection and purification framework based on a \textit{text-conditioned masked diffusion model}. During training, MDD applies partial masking to Mel-spectrograms and progressively adds noise through a forward diffusion process, simulating the degradation of clean speech features. A reverse process then reconstructs the clean representation conditioned on the input transcription. Unlike prior approaches, MDD does not require adversarial examples or large-scale pretraining. Experimental results show that MDD achieves strong adversarial detection performance and outperforms prior state-of-the-art methods, including both diffusion-based and neural codec-based approaches. Furthermore, MDD effectively purifies adversarially-manipulated speech, restoring speaker verification performance to levels close to those observed under clean conditions. These findings demonstrate the potential of diffusion-based masking strategies for secure and reliable speaker verification systems.

http://arxiv.org/abs/2508.18652
UniC-RAG: Universal Knowledge Corruption Attacks to Retrieval-Augmented Generation. (96%)
Runpeng Geng; Yanting Wang; Ying Chen; Jinyuan Jia
Retrieval-augmented generation (RAG) systems are widely deployed in real-world applications in diverse domains such as finance, healthcare, and cybersecurity. However, many studies showed that they are vulnerable to knowledge corruption attacks, where an attacker can inject adversarial texts into the knowledge database of a RAG system to induce the LLM to generate attacker-desired outputs. Existing studies mainly focus on attacking specific queries or queries with similar topics (or keywords). In this work, we propose UniC-RAG, a universal knowledge corruption attack against RAG systems. Unlike prior work, UniC-RAG jointly optimizes a small number of adversarial texts that can simultaneously attack a large number of user queries with diverse topics and domains, enabling an attacker to achieve various malicious objectives, such as directing users to malicious websites, triggering harmful command execution, or launching denial-of-service attacks. We formulate UniC-RAG as an optimization problem and further design an effective solution to solve it, including a balanced similarity-based clustering method to enhance the attack's effectiveness. Our extensive evaluations demonstrate that UniC-RAG is highly effective and significantly outperforms baselines. For instance, UniC-RAG could achieve over 90% attack success rate by injecting 100 adversarial texts into a knowledge database with millions of texts to simultaneously attack a large set of user queries (e.g., 2,000). Additionally, we evaluate existing defenses and show that they are insufficient to defend against UniC-RAG, highlighting the need for new defense mechanisms in RAG systems.

http://arxiv.org/abs/2508.18737
FLAegis: A Two-Layer Defense Framework for Federated Learning Against Poisoning Attacks. (92%)
Enrique Mármol Campos; Aurora González Vidal; José Luis Hernández Ramos; Antonio Skarmeta
Federated Learning (FL) has become a powerful technique for training Machine Learning (ML) models in a decentralized manner, preserving the privacy of the training datasets involved. However, the decentralized nature of FL limits the visibility of the training process, relying heavily on the honesty of participating clients. This assumption opens the door to malicious third parties, known as Byzantine clients, which can poison the training process by submitting false model updates. Such malicious clients may engage in poisoning attacks, manipulating either the dataset or the model parameters to induce misclassification. In response, this study introduces FLAegis, a two-stage defensive framework designed to identify Byzantine clients and improve the robustness of FL systems. Our approach leverages symbolic time series transformation (SAX) to amplify the differences between benign and malicious models, and spectral clustering, which enables accurate detection of adversarial behavior. Furthermore, we incorporate a robust FFT-based aggregation function as a final layer to mitigate the impact of those Byzantine clients that manage to evade prior defenses. We rigorously evaluate our method against five poisoning attacks, ranging from simple label flipping to adaptive optimization-based strategies. Notably, our approach outperforms state-of-the-art defenses in both detection precision and final model accuracy, maintaining consistently high performance even under strong adversarial conditions.

http://arxiv.org/abs/2508.18726
Flatness-aware Curriculum Learning via Adversarial Difficulty. (91%)
Hiroaki Aizawa; Yoshikazu Hayashi
Neural networks trained by empirical risk minimization often suffer from overfitting, especially to specific samples or domains, which leads to poor generalization. Curriculum Learning (CL) addresses this issue by selecting training samples based on the difficulty. From the optimization perspective, methods such as Sharpness-Aware Minimization (SAM) improve robustness and generalization by seeking flat minima. However, combining CL with SAM is not straightforward. In flat regions, both the loss values and the gradient norms tend to become uniformly small, which makes it difficult to evaluate sample difficulty and design an effective curriculum. To overcome this problem, we propose the Adversarial Difficulty Measure (ADM), which quantifies adversarial vulnerability by leveraging the robustness properties of models trained toward flat minima. Unlike loss- or gradient-based measures, which become ineffective as training progresses into flatter regions, ADM remains informative by measuring the normalized loss gap between original and adversarial examples. We incorporate ADM into CL-based training with SAM to dynamically assess sample difficulty. We evaluated our approach on image classification tasks, fine-grained recognition, and domain generalization. The results demonstrate that our method preserves the strengths of both CL and SAM while outperforming existing curriculum-based and flatness-aware training strategies.

http://arxiv.org/abs/2508.19456
ReLATE+: Unified Framework for Adversarial Attack Detection, Classification, and Resilient Model Selection in Time-Series Classification. (84%)
Cagla Ipek Kocal; Onat Gungor; Tajana Rosing; Baris Aksanli
Minimizing computational overhead in time-series classification, particularly in deep learning models, presents a significant challenge due to the high complexity of model architectures and the large volume of sequential data that must be processed in real time. This challenge is further compounded by adversarial attacks, emphasizing the need for resilient methods that ensure robust performance and efficient model selection. To address this challenge, we propose ReLATE+, a comprehensive framework that detects and classifies adversarial attacks, adaptively selects deep learning models based on dataset-level similarity, and thus substantially reduces retraining costs relative to conventional methods that do not leverage prior knowledge, while maintaining strong performance. ReLATE+ first checks whether the incoming data is adversarial and, if so, classifies the attack type, using this insight to identify a similar dataset from a repository and enable the reuse of the best-performing associated model. This approach ensures strong performance while reducing the need for retraining, and it generalizes well across different domains with varying data distributions and feature spaces. Experiments show that ReLATE+ reduces computational overhead by an average of 77.68%, enhancing adversarial resilience and streamlining robust model selection, all without sacrificing performance, within 2.02% of Oracle.

http://arxiv.org/abs/2508.19151
Saddle Hierarchy in Dense Associative Memory. (74%)
Robin Thériault; Daniele Tantari
Dense associative memory (DAM) models have been attracting renewed attention since they were shown to be robust to adversarial examples and closely related to state-of-the-art machine learning paradigms, such as the attention mechanisms in transformers and generative diffusion models. We study a DAM built upon a three-layer Boltzmann machine with Potts hidden units, which represent data clusters and classes. Through a statistical mechanics analysis, we derive saddle-point equations that characterize both the stationary points of DAMs trained on real data and the fixed points of DAMs trained on synthetic data within a teacher-student framework. Based on these results, we propose a novel regularization scheme that makes training significantly more stable. Moreover, we show empirically that our DAM learns interpretable solutions to both supervised and unsupervised classification problems. Pushing our theoretical analysis further, we find that the weights learned by relatively small DAMs correspond to unstable saddle points in larger DAMs. We implement a network-growing algorithm that leverages this saddle-point hierarchy to drastically reduce the computational cost of training dense associative memory.

http://arxiv.org/abs/2508.18805
Hidden Tail: Adversarial Image Causing Stealthy Resource Consumption in Vision-Language Models. (45%)
Rui Zhang; Zihan Wang; Tianli Yang; Hongwei Li; Wenbo Jiang; Qingchuan Zhao; Yang Liu; Guowen Xu
Vision-Language Models (VLMs) are increasingly deployed in real-world applications, but their high inference cost makes them vulnerable to resource consumption attacks. Prior attacks attempt to extend VLM output sequences by optimizing adversarial images, thereby increasing inference costs. However, these extended outputs often introduce irrelevant abnormal content, compromising attack stealthiness. This trade-off between effectiveness and stealthiness poses a major limitation for existing attacks. To address this challenge, we propose \textit{Hidden Tail}, a stealthy resource consumption attack that crafts prompt-agnostic adversarial images, inducing VLMs to generate maximum-length outputs by appending special tokens invisible to users. Our method employs a composite loss function that balances semantic preservation, repetitive special token induction, and suppression of the end-of-sequence (EOS) token, optimized via a dynamic weighting strategy. Extensive experiments show that \textit{Hidden Tail} outperforms existing attacks, increasing output length by up to 19.2$\times$ and reaching the maximum token limit, while preserving attack stealthiness. These results highlight the urgent need to improve the robustness of VLMs against efficiency-oriented adversarial threats. Our code is available at https://github.com/zhangrui4041/Hidden_Tail.

http://arxiv.org/abs/2508.18649
PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality. (15%)
Nanxi Li; Zhengyue Zhao; Chaowei Xiao
Safeguarding vision-language models (VLMs) is a critical challenge, as existing methods often suffer from over-defense, which harms utility, or rely on shallow alignment, failing to detect complex threats that require deep reasoning. To this end, we introduce PRISM (Principled Reasoning for Integrated Safety in Multimodality), a system2-like framework that aligns VLMs by embedding a structured, safety-aware reasoning process. Our framework consists of two key components: PRISM-CoT, a dataset that teaches safety-aware chain-of-thought reasoning, and PRISM-DPO, generated via Monte Carlo Tree Search (MCTS) to further refine this reasoning through Direct Preference Optimization to help obtain a delicate safety boundary. Comprehensive evaluations demonstrate PRISM's effectiveness, achieving remarkably low attack success rates including 0.15% on JailbreakV-28K for Qwen2-VL and 90% improvement over the previous best method on VLBreak for LLaVA-1.5. PRISM also exhibits strong robustness against adaptive attacks, significantly increasing computational costs for adversaries, and generalizes effectively to out-of-distribution challenges, reducing attack success rates to just 8.70% on the challenging multi-image MIS benchmark. Remarkably, this robust defense is achieved while preserving, and in some cases enhancing, model utility. To promote reproducibility, we have made our code, data, and model weights available at https://github.com/SaFoLab-WISC/PRISM.

http://arxiv.org/abs/2508.19183
Get Global Guarantees: On the Probabilistic Nature of Perturbation Robustness. (9%)
Wenchuan Mu; Kwan Hui Lim
In safety-critical deep learning applications, robustness measures the ability of neural models that handle imperceptible perturbations in input data, which may lead to potential safety hazards. Existing pre-deployment robustness assessment methods typically suffer from significant trade-offs between computational cost and measurement precision, limiting their practical utility. To address these limitations, this paper conducts a comprehensive comparative analysis of existing robustness definitions and associated assessment methodologies. We propose tower robustness to evaluate robustness, which is a novel, practical metric based on hypothesis testing to quantitatively evaluate probabilistic robustness, enabling more rigorous and efficient pre-deployment assessments. Our extensive comparative evaluation illustrates the advantages and applicability of our proposed approach, thereby advancing the systematic understanding and enhancement of model robustness in safety-critical deep learning applications.

http://arxiv.org/abs/2508.18019
Does simple trump complex? Comparing strategies for adversarial robustness in DNNs. (93%)
William Brooks; Marelie H. Davel; Coenraad Mouton
Deep Neural Networks (DNNs) have shown substantial success in various applications but remain vulnerable to adversarial attacks. This study aims to identify and isolate the components of two different adversarial training techniques that contribute most to increased adversarial robustness, particularly through the lens of margins in the input space -- the minimal distance between data points and decision boundaries. Specifically, we compare two methods that maximize margins: a simple approach which modifies the loss function to increase an approximation of the margin, and a more complex state-of-the-art method (Dynamics-Aware Robust Training) which builds upon this approach. Using a VGG-16 model as our base, we systematically isolate and evaluate individual components from these methods to determine their relative impact on adversarial robustness. We assess the effect of each component on the model's performance under various adversarial attacks, including AutoAttack and Projected Gradient Descent (PGD). Our analysis on the CIFAR-10 dataset reveals which elements most effectively enhance adversarial robustness, providing insights for designing more robust DNNs.

http://arxiv.org/abs/2508.19290
Efficient Model-Based Purification Against Adversarial Attacks for LiDAR Segmentation. (92%)
Alexandros Gkillas; Ioulia Kapsali; Nikos Piperigkos; Aris S. Lalos
LiDAR-based segmentation is essential for reliable perception in autonomous vehicles, yet modern segmentation networks are highly susceptible to adversarial attacks that can compromise safety. Most existing defenses are designed for networks operating directly on raw 3D point clouds and rely on large, computationally intensive generative models. However, many state-of-the-art LiDAR segmentation pipelines operate on more efficient 2D range view representations. Despite their widespread adoption, dedicated lightweight adversarial defenses for this domain remain largely unexplored. We introduce an efficient model-based purification framework tailored for adversarial defense in 2D range-view LiDAR segmentation. We propose a direct attack formulation in the range-view domain and develop an explainable purification network based on a mathematical justified optimization problem, achieving strong adversarial resilience with minimal computational overhead. Our method achieves competitive performance on open benchmarks, consistently outperforming generative and adversarial training baselines. More importantly, real-world deployment on a demo vehicle demonstrates the framework's ability to deliver accurate operation in practical autonomous driving scenarios.

http://arxiv.org/abs/2402.13532
Backdoor Attacks on Dense Retrieval via Public and Unintentional Triggers. (83%)
Quanyu Long; Yue Deng; LeiLei Gan; Wenya Wang; Sinno Jialin Pan
Dense retrieval systems have been widely used in various NLP applications. However, their vulnerabilities to potential attacks have been underexplored. This paper investigates a novel attack scenario where the attackers aim to mislead the retrieval system into retrieving the attacker-specified contents. Those contents, injected into the retrieval corpus by attackers, can include harmful text like hate speech or spam. Unlike prior methods that rely on model weights and generate conspicuous, unnatural outputs, we propose a covert backdoor attack triggered by grammar errors. Our approach ensures that the attacked models can function normally for standard queries while covertly triggering the retrieval of the attacker's contents in response to minor linguistic mistakes. Specifically, dense retrievers are trained with contrastive loss and hard negative sampling. Surprisingly, our findings demonstrate that contrastive loss is notably sensitive to grammatical errors, and hard negative sampling can exacerbate susceptibility to backdoor attacks. Our proposed method achieves a high attack success rate with a minimal corpus poisoning rate of only 0.048\%, while preserving normal retrieval performance. This indicates that the method has negligible impact on user experience for error-free queries. Furthermore, evaluations across three real-world defense strategies reveal that the malicious passages embedded within the corpus remain highly resistant to detection and filtering, underscoring the robustness and subtlety of the proposed attack \footnote{Codes of this work are available at https://github.com/ruyue0001/Backdoor_DPR.}.

http://arxiv.org/abs/2508.17660
ClearMask: Noise-Free and Naturalness-Preserving Protection Against Voice Deepfake Attacks. (75%)
Yuanda Wang; Bocheng Chen; Hanqing Guo; Guangjing Wang; Weikang Ding; Qiben Yan
Voice deepfake attacks, which artificially impersonate human speech for malicious purposes, have emerged as a severe threat. Existing defenses typically inject noise into human speech to compromise voice encoders in speech synthesis models. However, these methods degrade audio quality and require prior knowledge of the attack approaches, limiting their effectiveness in diverse scenarios. Moreover, real-time audios, such as speech in virtual meetings and voice messages, are still exposed to voice deepfake threats. To overcome these limitations, we propose ClearMask, a noise-free defense mechanism against voice deepfake attacks. Unlike traditional approaches, ClearMask modifies the audio mel-spectrogram by selectively filtering certain frequencies, inducing a transferable voice feature loss without injecting noise. We then apply audio style transfer to further deceive voice decoders while preserving perceived sound quality. Finally, optimized reverberation is introduced to disrupt the output of voice generation models without affecting the naturalness of the speech. Additionally, we develop LiveMask to protect streaming speech in real-time through a universal frequency filter and reverberation generator. Our experimental results show that ClearMask and LiveMask effectively prevent voice deepfake attacks from deceiving speaker verification models and human listeners, even for unseen voice synthesis models and black-box API services. Furthermore, ClearMask demonstrates resilience against adaptive attackers who attempt to recover the original audio signal from the protected speech samples.

http://arxiv.org/abs/2508.18154
Assessing the Noise Robustness of Class Activation Maps: A Framework for Reliable Model Interpretability. (38%)
Syamantak Sarkar; Revoti P. Bora; Bhupender Kaushal; Sudhish N George; Kiran Raja
Class Activation Maps (CAMs) are one of the important methods for visualizing regions used by deep learning models. Yet their robustness to different noise remains underexplored. In this work, we evaluate and report the resilience of various CAM methods for different noise perturbations across multiple architectures and datasets. By analyzing the influence of different noise types on CAM explanations, we assess the susceptibility to noise and the extent to which dataset characteristics may impact explanation stability. The findings highlight considerable variability in noise sensitivity for various CAMs. We propose a robustness metric for CAMs that captures two key properties: consistency and responsiveness. Consistency reflects the ability of CAMs to remain stable under input perturbations that do not alter the predicted class, while responsiveness measures the sensitivity of CAMs to changes in the prediction caused by such perturbations. The metric is evaluated empirically across models, different perturbations, and datasets along with complementary statistical tests to exemplify the applicability of our proposed approach.

http://arxiv.org/abs/2508.17680
Robustness Feature Adapter for Efficient Adversarial Training. (16%)
Quanwei Wu; Jun Guo; Wei Wang; Yi Wang
Adversarial training (AT) with projected gradient descent is the most popular method to improve model robustness under adversarial attacks. However, computational overheads become prohibitively large when AT is applied to large backbone models. AT is also known to have the issue of robust overfitting. This paper contributes to solving both problems simultaneously towards building more trustworthy foundation models. In particular, we propose a new adapter-based approach for efficient AT directly in the feature space. We show that the proposed adapter-based approach can improve the inner-loop convergence quality by eliminating robust overfitting. As a result, it significantly increases computational efficiency and improves model accuracy by generalizing adversarial robustness to unseen attacks. We demonstrate the effectiveness of the new adapter-based approach in different backbone architectures and in AT at scale.

http://arxiv.org/abs/2503.00187
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks. (12%)
Hanjiang Hu; Alexander Robey; Changliu Liu
Large language models (LLMs) are shown to be vulnerable to jailbreaking attacks where adversarial prompts are designed to elicit harmful responses. While existing defenses effectively mitigate single-turn attacks by detecting and filtering unsafe inputs, they fail against multi-turn jailbreaks that exploit contextual drift over multiple interactions, gradually leading LLMs away from safe behavior. To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues. Our approach models the dialogue with LLMs using state-space representations and introduces a novel neural barrier function (NBF) to detect and filter harmful queries emerging from evolving contexts proactively. Our method achieves invariant safety at each turn of dialogue by learning a safety predictor that accounts for adversarial queries, preventing potential context drift toward jailbreaks. Extensive experiments under multiple LLMs show that our NBF-based safety steering outperforms safety alignment, prompt-based steering and lightweight LLM guardrails baselines, offering stronger defenses against multi-turn jailbreaks while maintaining a better trade-off among safety, helpfulness and over-refusal. Check out the website here https://sites.google.com/view/llm-nbf/home . Our code is available on https://github.com/HanjiangHu/NBF-LLM .

http://arxiv.org/abs/2508.18085
Quantum-Classical Hybrid Framework for Zero-Day Time-Push GNSS Spoofing Detection. (8%)
Abyad Enan; Mashrur Chowdhury; Sagar Dasgupta; Mizanur Rahman
Global Navigation Satellite Systems (GNSS) are critical for Positioning, Navigation, and Timing (PNT) applications. However, GNSS are highly vulnerable to spoofing attacks, where adversaries transmit counterfeit signals to mislead receivers. Such attacks can lead to severe consequences, including misdirected navigation, compromised data integrity, and operational disruptions. Most existing spoofing detection methods depend on supervised learning techniques and struggle to detect novel, evolved, and unseen attacks. To overcome this limitation, we develop a zero-day spoofing detection method using a Hybrid Quantum-Classical Autoencoder (HQC-AE), trained solely on authentic GNSS signals without exposure to spoofed data. By leveraging features extracted during the tracking stage, our method enables proactive detection before PNT solutions are computed. We focus on spoofing detection in static GNSS receivers, which are particularly susceptible to time-push spoofing attacks, where attackers manipulate timing information to induce incorrect time computations at the receiver. We evaluate our model against different unseen time-push spoofing attack scenarios: simplistic, intermediate, and sophisticated. Our analysis demonstrates that the HQC-AE consistently outperforms its classical counterpart, traditional supervised learning-based models, and existing unsupervised learning-based methods in detecting zero-day, unseen GNSS time-push spoofing attacks, achieving an average detection accuracy of 97.71% with an average false negative rate of 0.62% (when an attack occurs but is not detected). For sophisticated spoofing attacks, the HQC-AE attains an accuracy of 98.23% with a false negative rate of 1.85%. These findings highlight the effectiveness of our method in proactively detecting zero-day GNSS time-push spoofing attacks across various stationary GNSS receiver platforms.

http://arxiv.org/abs/2305.15276
Sparse Mean Estimation in Adversarial Settings via Incremental Learning. (1%)
Jianhao Ma; Rui Ray Chen; Yinghui He; Salar Fattahi; Wei Hu
In this paper, we study the problem of sparse mean estimation under adversarial corruptions, where the goal is to estimate the $k$-sparse mean of a heavy-tailed distribution from samples contaminated by adversarial noise. Existing methods face two key limitations: they require prior knowledge of the sparsity level $k$ and scale poorly to high-dimensional settings. We propose a simple and scalable estimator that addresses both challenges. Specifically, it learns the $k$-sparse mean without knowing $k$ in advance and operates in near-linear time and memory with respect to the ambient dimension. Under a moderate signal-to-noise ratio, our method achieves the optimal statistical rate, matching the information-theoretic lower bound. Extensive simulations corroborate our theoretical guarantees. At the heart of our approach is an incremental learning phenomenon: we show that a basic subgradient method applied to a nonconvex two-layer formulation with an $\ell_1$-loss can incrementally learn the $k$ nonzero components of the true mean while suppressing the rest. More broadly, our work is the first to reveal the incremental learning phenomenon of the subgradient method in the presence of heavy-tailed distributions and adversarial corruption.

http://arxiv.org/abs/2508.18060
FedGreed: A Byzantine-Robust Loss-Based Aggregation Method for Federated Learning. (1%)
Emmanouil Kritharakis; Antonios Makris; Dusan Jakovetic; Konstantinos Tserpes
Federated Learning (FL) enables collaborative model training across multiple clients while preserving data privacy by keeping local datasets on-device. In this work, we address FL settings where clients may behave adversarially, exhibiting Byzantine attacks, while the central server is trusted and equipped with a reference dataset. We propose FedGreed, a resilient aggregation strategy for federated learning that does not require any assumptions about the fraction of adversarial participants. FedGreed orders clients' local model updates based on their loss metrics evaluated against a trusted dataset on the server and greedily selects a subset of clients whose models exhibit the minimal evaluation loss. Unlike many existing approaches, our method is designed to operate reliably under heterogeneous (non-IID) data distributions, which are prevalent in real-world deployments. FedGreed exhibits convergence guarantees and bounded optimality gaps under strong adversarial behavior. Experimental evaluations on MNIST, FMNIST, and CIFAR-10 demonstrate that our method significantly outperforms standard and robust federated learning baselines, such as Mean, Trimmed Mean, Median, Krum, and Multi-Krum, in the majority of adversarial scenarios considered, including label flipping and Gaussian noise injection attacks. All experiments were conducted using the Flower federated learning framework.

http://arxiv.org/abs/2508.17174
Sharpness-Aware Geometric Defense for Robust Out-Of-Distribution Detection. (96%)
Jeng-Lin Li; Ming-Ching Chang; Wei-Chao Chen
Out-of-distribution (OOD) detection ensures safe and reliable model deployment. Contemporary OOD algorithms using geometry projection can detect OOD or adversarial samples from clean in-distribution (ID) samples. However, this setting regards adversarial ID samples as OOD, leading to incorrect OOD predictions. Existing efforts on OOD detection with ID and OOD data under attacks are minimal. In this paper, we develop a robust OOD detection method that distinguishes adversarial ID samples from OOD ones. The sharp loss landscape created by adversarial training hinders model convergence, impacting the latent embedding quality for OOD score calculation. Therefore, we introduce a {\bf Sharpness-aware Geometric Defense (SaGD)} framework to smooth out the rugged adversarial loss landscape in the projected latent geometry. Enhanced geometric embedding convergence enables accurate ID data characterization, benefiting OOD detection against adversarial attacks. We use Jitter-based perturbation in adversarial training to extend the defense ability against unseen attacks. Our SaGD framework significantly improves FPR and AUC over the state-of-the-art defense approaches in differentiating CIFAR-100 from six other OOD datasets under various attacks. We further examine the effects of perturbations at various adversarial training levels, revealing the relationship between the sharp loss landscape and adversarial OOD detection.

http://arxiv.org/abs/2508.17186
Advancing Weakly-Supervised Change Detection in Satellite Images via Adversarial Class Prompting. (56%)
Zhenghui Zhao; Chen Wu; Di Wang; Hongruixuan Chen; Cuiqun Chen; Zhuo Zheng; Bo Du; Liangpei Zhang
Weakly-Supervised Change Detection (WSCD) aims to distinguish specific object changes (e.g., objects appearing or disappearing) from background variations (e.g., environmental changes due to light, weather, or seasonal shifts) in paired satellite images, relying only on paired image (i.e., image-level) classification labels. This technique significantly reduces the need for dense annotations required in fully-supervised change detection. However, as image-level supervision only indicates whether objects have changed in a scene, WSCD methods often misclassify background variations as object changes, especially in complex remote-sensing scenarios. In this work, we propose an Adversarial Class Prompting (AdvCP) method to address this co-occurring noise problem, including two phases: a) Adversarial Prompt Mining: After each training iteration, we introduce adversarial prompting perturbations, using incorrect one-hot image-level labels to activate erroneous feature mappings. This process reveals co-occurring adversarial samples under weak supervision, namely background variation features that are likely to be misclassified as object changes. b) Adversarial Sample Rectification: We integrate these adversarially prompt-activated pixel samples into training by constructing an online global prototype. This prototype is built from an exponentially weighted moving average of the current batch and all historical training data. Our AdvCP can be seamlessly integrated into current WSCD methods without adding additional inference cost. Experiments on ConvNet, Transformer, and Segment Anything Model (SAM)-based baselines demonstrate significant performance enhancements. Furthermore, we demonstrate the generalizability of AdvCP to other multi-class weakly-supervised dense prediction scenarios. Code is available at https://github.com/zhenghuizhao/AdvCP

http://arxiv.org/abs/2508.17265
AdaGAT: Adaptive Guidance Adversarial Training for the Robustness of Deep Neural Networks. (54%)
Zhenyu Liu; Huizhi Liang; Xinrun Li; Vaclav Snasel; Varun Ojha
Adversarial distillation (AD) is a knowledge distillation technique that facilitates the transfer of robustness from teacher deep neural network (DNN) models to lightweight target (student) DNN models, enabling the target models to perform better than only training the student model independently. Some previous works focus on using a small, learnable teacher (guide) model to improve the robustness of a student model. Since a learnable guide model starts learning from scratch, maintaining its optimal state for effective knowledge transfer during co-training is challenging. Therefore, we propose a novel Adaptive Guidance Adversarial Training (AdaGAT) method. Our method, AdaGAT, dynamically adjusts the training state of the guide model to install robustness to the target model. Specifically, we develop two separate loss functions as part of the AdaGAT method, allowing the guide model to participate more actively in backpropagation to achieve its optimal state. We evaluated our approach via extensive experiments on three datasets: CIFAR-10, CIFAR-100, and TinyImageNet, using the WideResNet-34-10 model as the target model. Our observations reveal that appropriately adjusting the guide model within a certain accuracy range enhances the target model's robustness across various adversarial attacks compared to a variety of baseline models.

http://arxiv.org/abs/2508.17247
Uncovering and Mitigating Destructive Multi-Embedding Attacks in Deepfake Proactive Forensics. (10%)
Lixin Jia; Haiyang Sun; Zhiqing Guo; Yunfeng Diao; Dan Ma; Gaobo Yang
With the rapid evolution of deepfake technologies and the wide dissemination of digital media, personal privacy is facing increasingly serious security threats. Deepfake proactive forensics, which involves embedding imperceptible watermarks to enable reliable source tracking, serves as a crucial defense against these threats. Although existing methods show strong forensic ability, they rely on an idealized assumption of single watermark embedding, which proves impractical in real-world scenarios. In this paper, we formally define and demonstrate the existence of Multi-Embedding Attacks (MEA) for the first time. When a previously protected image undergoes additional rounds of watermark embedding, the original forensic watermark can be destroyed or removed, rendering the entire proactive forensic mechanism ineffective. To address this vulnerability, we propose a general training paradigm named Adversarial Interference Simulation (AIS). Rather than modifying the network architecture, AIS explicitly simulates MEA scenarios during fine-tuning and introduces a resilience-driven loss function to enforce the learning of sparse and stable watermark representations. Our method enables the model to maintain the ability to extract the original watermark correctly even after a second embedding. Extensive experiments demonstrate that our plug-and-play AIS training paradigm significantly enhances the robustness of various existing methods against MEA.

http://arxiv.org/abs/2508.17405
FRAME : Comprehensive Risk Assessment Framework for Adversarial Machine Learning Threats. (2%)
Avishag Shapira; Simon Shigol; Asaf Shabtai
The widespread adoption of machine learning (ML) systems increased attention to their security and emergence of adversarial machine learning (AML) techniques that exploit fundamental vulnerabilities in ML systems, creating an urgent need for comprehensive risk assessment for ML-based systems. While traditional risk assessment frameworks evaluate conventional cybersecurity risks, they lack ability to address unique challenges posed by AML threats. Existing AML threat evaluation approaches focus primarily on technical attack robustness, overlooking crucial real-world factors like deployment environments, system dependencies, and attack feasibility. Attempts at comprehensive AML risk assessment have been limited to domain-specific solutions, preventing application across diverse systems. Addressing these limitations, we present FRAME, the first comprehensive and automated framework for assessing AML risks across diverse ML-based systems. FRAME includes a novel risk assessment method that quantifies AML risks by systematically evaluating three key dimensions: target system's deployment environment, characteristics of diverse AML techniques, and empirical insights from prior research. FRAME incorporates a feasibility scoring mechanism and LLM-based customization for system-specific assessments. Additionally, we developed a comprehensive structured dataset of AML attacks enabling context-aware risk assessment. From an engineering application perspective, FRAME delivers actionable results designed for direct use by system owners with only technical knowledge of their systems, without expertise in AML. We validated it across six diverse real-world applications. Our evaluation demonstrated exceptional accuracy and strong alignment with analysis by AML experts. FRAME enables organizations to prioritize AML risks, supporting secure AI deployment in real-world environments.

http://arxiv.org/abs/2508.17215
How to make Medical AI Systems safer? Simulating Vulnerabilities, and Threats in Multimodal Medical RAG System. (1%)
Kaiwen Zuo; Zelin Liu; Raman Dutt; Ziyang Wang; Zhongtian Sun; Yeming Wang; Fan Mo; Pietro Liò
Large Vision-Language Models (LVLMs) augmented with Retrieval-Augmented Generation (RAG) are increasingly employed in medical AI to enhance factual grounding through external clinical image-text retrieval. However, this reliance creates a significant attack surface. We propose MedThreatRAG, a novel multimodal poisoning framework that systematically probes vulnerabilities in medical RAG systems by injecting adversarial image-text pairs. A key innovation of our approach is the construction of a simulated semi-open attack environment, mimicking real-world medical systems that permit periodic knowledge base updates via user or pipeline contributions. Within this setting, we introduce and emphasize Cross-Modal Conflict Injection (CMCI), which embeds subtle semantic contradictions between medical images and their paired reports. These mismatches degrade retrieval and generation by disrupting cross-modal alignment while remaining sufficiently plausible to evade conventional filters. While basic textual and visual attacks are included for completeness, CMCI demonstrates the most severe degradation. Evaluations on IU-Xray and MIMIC-CXR QA tasks show that MedThreatRAG reduces answer F1 scores by up to 27.66% and lowers LLaVA-Med-1.5 F1 rates to as low as 51.36%. Our findings expose fundamental security gaps in clinical RAG systems and highlight the urgent need for threat-aware design and robust multimodal consistency checks. Finally, we conclude with a concise set of guidelines to inform the safe development of future multimodal medical RAG systems.

http://arxiv.org/abs/2508.17244
L-XAIDS: A LIME-based eXplainable AI framework for Intrusion Detection Systems. (1%)
Aoun E Muhammad; Kin-Choong Yow; Nebojsa Bacanin-Dzakula; Muhammad Attique Khan
Recent developments in Artificial Intelligence (AI) and their applications in critical industries such as healthcare, fin-tech and cybersecurity have led to a surge in research in explainability in AI. Innovative research methods are being explored to extract meaningful insight from blackbox AI systems to make the decision-making technology transparent and interpretable. Explainability becomes all the more critical when AI is used in decision making in domains like fintech, healthcare and safety critical systems such as cybersecurity and autonomous vehicles. However, there is still ambiguity lingering on the reliable evaluations for the users and nature of transparency in the explanations provided for the decisions made by black-boxed AI. To solve the blackbox nature of Machine Learning based Intrusion Detection Systems, a framework is proposed in this paper to give an explanation for IDSs decision making. This framework uses Local Interpretable Model-Agnostic Explanations (LIME) coupled with Explain Like I'm five (ELI5) and Decision Tree algorithms to provide local and global explanations and improve the interpretation of IDSs. The local explanations provide the justification for the decision made on a specific input. Whereas, the global explanations provides the list of significant features and their relationship with attack traffic. In addition, this framework brings transparency in the field of ML driven IDS that might be highly significant for wide scale adoption of eXplainable AI in cyber-critical systems. Our framework is able to achieve 85 percent accuracy in classifying attack behaviour on UNSW-NB15 dataset, while at the same time displaying the feature significance ranking of the top 10 features used in the classification.

http://arxiv.org/abs/2508.16937
NAT: Learning to Attack Neurons for Enhanced Adversarial Transferability. (82%)
Krishna Kanth Nakka; Alexandre Alahi
The generation of transferable adversarial perturbations typically involves training a generator to maximize embedding separation between clean and adversarial images at a single mid-layer of a source model. In this work, we build on this approach and introduce Neuron Attack for Transferability (NAT), a method designed to target specific neuron within the embedding. Our approach is motivated by the observation that previous layer-level optimizations often disproportionately focus on a few neurons representing similar concepts, leaving other neurons within the attacked layer minimally affected. NAT shifts the focus from embedding-level separation to a more fundamental, neuron-specific approach. We find that targeting individual neurons effectively disrupts the core units of the neural network, providing a common basis for transferability across different models. Through extensive experiments on 41 diverse ImageNet models and 9 fine-grained models, NAT achieves fooling rates that surpass existing baselines by over 14\% in cross-model and 4\% in cross-domain settings. Furthermore, by leveraging the complementary attacking capabilities of the trained generators, we achieve impressive fooling rates within just 10 queries. Our code is available at: https://krishnakanthnakka.github.io/NAT/

http://arxiv.org/abs/2508.19277
POT: Inducing Overthinking in LLMs via Black-Box Iterative Optimization. (2%)
Xinyu Li; Tianjin Huang; Ronghui Mu; Xiaowei Huang; Gaojie Jin
Recent advances in Chain-of-Thought (CoT) prompting have substantially enhanced the reasoning capabilities of large language models (LLMs), enabling sophisticated problem-solving through explicit multi-step reasoning traces. However, these enhanced reasoning processes introduce novel attack surfaces, particularly vulnerabilities to computational inefficiency through unnecessarily verbose reasoning chains that consume excessive resources without corresponding performance gains. Prior overthinking attacks typically require restrictive conditions including access to external knowledge sources for data poisoning, reliance on retrievable poisoned content, and structurally obvious templates that limit practical applicability in real-world scenarios. To address these limitations, we propose POT (Prompt-Only OverThinking), a novel black-box attack framework that employs LLM-based iterative optimization to generate covert and semantically natural adversarial prompts, eliminating dependence on external data access and model retrieval. Extensive experiments across diverse model architectures and datasets demonstrate that POT achieves superior performance compared to other methods.

http://arxiv.org/abs/2508.18306
SALMAN: Stability Analysis of Language Models Through the Maps Between Graph-based Manifolds. (2%)
Wuxinlin Cheng; Yupeng Cao; Jinwen Wu; Koduvayur Subbalakshmi; Tian Han; Zhuo Feng
Recent strides in pretrained transformer-based language models have propelled state-of-the-art performance in numerous NLP tasks. Yet, as these models grow in size and deployment, their robustness under input perturbations becomes an increasingly urgent question. Existing robustness methods often diverge between small-parameter and large-scale models (LLMs), and they typically rely on labor-intensive, sample-specific adversarial designs. In this paper, we propose a unified, local (sample-level) robustness framework (SALMAN) that evaluates model stability without modifying internal parameters or resorting to complex perturbation heuristics. Central to our approach is a novel Distance Mapping Distortion (DMD) measure, which ranks each sample's susceptibility by comparing input-to-output distance mappings in a near-linear complexity manner. By demonstrating significant gains in attack efficiency and robust training, we position our framework as a practical, model-agnostic tool for advancing the reliability of transformer-based NLP systems.

http://arxiv.org/abs/2508.17043
ZAPS: A Zero-Knowledge Proof Protocol for Secure UAV Authentication with Flight Path Privacy. (1%)
Shayesta Naziri; Xu Wang; Guangsheng Yu; Christy Jie Liang; Wei Ni
The increasing deployment of Unmanned Aerial Vehicles (UAVs) for military, commercial, and logistics applications has raised significant concerns regarding flight path privacy. Conventional UAV communication systems often expose flight path data to third parties, making them vulnerable to tracking, surveillance, and location inference attacks. Existing encryption techniques provide security but fail to ensure complete privacy, as adversaries can still infer movement patterns through metadata analysis. To address these challenges, we propose a zk-SNARK(Zero-Knowledge Succinct Non-Interactive Argument of Knowledge)-based privacy-preserving flight path authentication and verification framework. Our approach ensures that a UAV can prove its authorisation, validate its flight path with a control centre, and comply with regulatory constraints without revealing any sensitive trajectory information. By leveraging zk-SNARKs, the UAV can generate cryptographic proofs that verify compliance with predefined flight policies while keeping the exact path and location undisclosed. This method mitigates risks associated with real-time tracking, identity exposure, and unauthorised interception, thereby enhancing UAV operational security in adversarial environments. Our proposed solution balances privacy, security, and computational efficiency, making it suitable for resource-constrained UAVs in both civilian and military applications.

http://arxiv.org/abs/2508.17158
Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks. (1%)
Jack Youstra; Mohammed Mahfoud; Yang Yan; Henry Sleight; Ethan Perez; Mrinank Sharma
Large language model fine-tuning APIs enable widespread model customization, yet pose significant safety risks. Recent work shows that adversaries can exploit access to these APIs to bypass model safety mechanisms by encoding harmful content in seemingly harmless fine-tuning data, evading both human monitoring and standard content filters. We formalize the fine-tuning API defense problem, and introduce the Cipher Fine-tuning Robustness benchmark (CIFR), a benchmark for evaluating defense strategies' ability to retain model safety in the face of cipher-enabled attackers while achieving the desired level of fine-tuning functionality. We include diverse cipher encodings and families, with some kept exclusively in the test set to evaluate for generalization across unseen ciphers and cipher families. We then evaluate different defenses on the benchmark and train probe monitors on model internal activations from multiple fine-tunes. We show that probe monitors achieve over 99% detection accuracy, generalize to unseen cipher variants and families, and compare favorably to state-of-the-art monitoring approaches. We open-source CIFR and the code to reproduce our experiments to facilitate further research in this critical area. Code and data are available online https://github.com/JackYoustra/safe-finetuning-api

http://arxiv.org/abs/2508.06827
Who's the Evil Twin? Differential Auditing for Undesired Behavior. (93%)
Ishwar Balappanawar; Venkata Hasith Vattikuti; Greta Kintzley; Ronan Azimi-Mancel; Satvik Golechha
Detecting hidden behaviors in neural networks poses a significant challenge due to minimal prior knowledge and potential adversarial obfuscation. We explore this problem by framing detection as an adversarial game between two teams: the red team trains two similar models, one trained solely on benign data and the other trained on data containing hidden harmful behavior, with the performance of both being nearly indistinguishable on the benign dataset. The blue team, with limited to no information about the harmful behaviour, tries to identify the compromised model. We experiment using CNNs and try various blue team strategies, including Gaussian noise analysis, model diffing, integrated gradients, and adversarial attacks under different levels of hints provided by the red team. Results show high accuracy for adversarial-attack-based methods (100\% correct prediction, using hints), which is very promising, whilst the other techniques yield more varied performance. During our LLM-focused rounds, we find that there are not many parallel methods that we could apply from our study with CNNs. Instead, we find that effective LLM auditing methods require some hints about the undesired distribution, which can then used in standard black-box and open-weight methods to probe the models further and reveal their misalignment. We open-source our auditing games (with the model and data) and hope that our findings contribute to designing better audits.

http://arxiv.org/abs/2508.16225
An Investigation of Visual Foundation Models Robustness. (62%)
Sandeep Gupta; Roberto Passerone
Visual Foundation Models (VFMs) are becoming ubiquitous in computer vision, powering systems for diverse tasks such as object detection, image classification, segmentation, pose estimation, and motion tracking. VFMs are capitalizing on seminal innovations in deep learning models, such as LeNet-5, AlexNet, ResNet, VGGNet, InceptionNet, DenseNet, YOLO, and ViT, to deliver superior performance across a range of critical computer vision applications. These include security-sensitive domains like biometric verification, autonomous vehicle perception, and medical image analysis, where robustness is essential to fostering trust between technology and the end-users. This article investigates network robustness requirements crucial in computer vision systems to adapt effectively to dynamic environments influenced by factors such as lighting, weather conditions, and sensor characteristics. We examine the prevalent empirical defenses and robust training employed to enhance vision network robustness against real-world challenges such as distributional shifts, noisy and spatially distorted inputs, and adversarial attacks. Subsequently, we provide a comprehensive analysis of the challenges associated with these defense mechanisms, including network properties and components to guide ablation studies and benchmarking metrics to evaluate network robustness.

http://arxiv.org/abs/2508.14527
Adversarial Generation and Collaborative Evolution of Safety-Critical Scenarios for Autonomous Vehicles. (56%)
Jiangfan Liu; Yongkang Guo; Fangzhi Zhong; Tianyuan Zhang; Zonglei Jing; Siyuan Liang; Jiakai Wang; Mingchuan Zhang; Aishan Liu; Xianglong Liu
The generation of safety-critical scenarios in simulation has become increasingly crucial for safety evaluation in autonomous vehicles prior to road deployment in society. However, current approaches largely rely on predefined threat patterns or rule-based strategies, which limit their ability to expose diverse and unforeseen failure modes. To overcome these, we propose ScenGE, a framework that can generate plentiful safety-critical scenarios by reasoning novel adversarial cases and then amplifying them with complex traffic flows. Given a simple prompt of a benign scene, it first performs Meta-Scenario Generation, where a large language model, grounded in structured driving knowledge, infers an adversarial agent whose behavior poses a threat that is both plausible and deliberately challenging. This meta-scenario is then specified in executable code for precise in-simulator control. Subsequently, Complex Scenario Evolution uses background vehicles to amplify the core threat introduced by Meta-Scenario. It builds an adversarial collaborator graph to identify key agent trajectories for optimization. These perturbations are designed to simultaneously reduce the ego vehicle's maneuvering space and create critical occlusions. Extensive experiments conducted on multiple reinforcement learning based AV models show that ScenGE uncovers more severe collision cases (+31.96%) on average than SoTA baselines. Additionally, our ScenGE can be applied to large model based AV systems and deployed on different simulators; we further observe that adversarial training on our scenarios improves the model robustness. Finally, we validate our framework through real-world vehicle tests and human evaluation, confirming that the generated scenarios are both plausible and critical. We hope our paper can build up a critical step towards building public trust and ensuring their safe deployment.

http://arxiv.org/abs/2307.12555
Robust Graph Contrastive Learning with Information Restoration. (38%)
Yulin Zhu; Xing Ai; Yevgeniy Vorobeychik; Kai Zhou
The graph contrastive learning (GCL) framework has gained remarkable achievements in graph representation learning. However, similar to graph neural networks (GNNs), GCL models are susceptible to graph structural attacks. As an unsupervised method, GCL faces greater challenges in defending against adversarial attacks. Furthermore, there has been limited research on enhancing the robustness of GCL. To thoroughly explore the failure of GCL on the poisoned graphs, we investigate the detrimental effects of graph structural attacks against the GCL framework. We discover that, in addition to the conventional observation that graph structural attacks tend to connect dissimilar node pairs, these attacks also diminish the mutual information between the graph and its representations from an information-theoretical perspective, which is the cornerstone of the high-quality node embeddings for GCL. Motivated by this theoretical insight, we propose a robust graph contrastive learning framework with a learnable sanitation view that endeavors to sanitize the augmented graphs by restoring the diminished mutual information caused by the structural attacks. Additionally, we design a fully unsupervised tuning strategy to tune the hyperparameters without accessing the label information, which strictly coincides with the defender's knowledge. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method compared to competitive baselines.

http://arxiv.org/abs/2508.16481
Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms. (38%)
Jonathan Nöther; Adish Singla; Goran Radanovic
Ensuring the safe use of agentic systems requires a thorough understanding of the range of malicious behaviors these systems may exhibit when under attack. In this paper, we evaluate the robustness of LLM-based agentic systems against attacks that aim to elicit harmful actions from agents. To this end, we propose a novel taxonomy of harms for agentic systems and a novel benchmark, BAD-ACTS, for studying the security of agentic systems with respect to a wide range of harmful actions. BAD-ACTS consists of 4 implementations of agentic systems in distinct application environments, as well as a dataset of 188 high-quality examples of harmful actions. This enables a comprehensive study of the robustness of agentic systems across a wide range of categories of harmful behaviors, available tools, and inter-agent communication structures. Using this benchmark, we analyze the robustness of agentic systems against an attacker that controls one of the agents in the system and aims to manipulate other agents to execute a harmful target action. Our results show that the attack has a high success rate, demonstrating that even a single adversarial agent within the system can have a significant impact on the security. This attack remains effective even when agents use a simple prompting-based defense strategy. However, we additionally propose a more effective defense based on message monitoring. We believe that this benchmark provides a diverse testbed for the security research of agentic systems. The benchmark can be found at github.com/JNoether/BAD-ACTS

http://arxiv.org/abs/2508.16217
PromptFlare: Prompt-Generalized Defense via Cross-Attention Decoy in Diffusion-Based Inpainting. (10%)
Hohyun Na; Seunghoo Hong; Simon S. Woo
The success of diffusion models has enabled effortless, high-quality image modifications that precisely align with users' intentions, thereby raising concerns about their potential misuse by malicious actors. Previous studies have attempted to mitigate such misuse through adversarial attacks. However, these approaches heavily rely on image-level inconsistencies, which pose fundamental limitations in addressing the influence of textual prompts. In this paper, we propose PromptFlare, a novel adversarial protection method designed to protect images from malicious modifications facilitated by diffusion-based inpainting models. Our approach leverages the cross-attention mechanism to exploit the intrinsic properties of prompt embeddings. Specifically, we identify and target shared token of prompts that is invariant and semantically uninformative, injecting adversarial noise to suppress the sampling process. The injected noise acts as a cross-attention decoy, diverting the model's focus away from meaningful prompt-image alignments and thereby neutralizing the effect of prompt. Extensive experiments on the EditBench dataset demonstrate that our method achieves state-of-the-art performance across various metrics while significantly reducing computational overhead and GPU memory usage. These findings highlight PromptFlare as a robust and efficient protection against unauthorized image manipulations. The code is available at https://github.com/NAHOHYUN-SKKU/PromptFlare.

http://arxiv.org/abs/2508.16124
Domain Adaptation via Feature Refinement. (1%)
Savvas Karatsiolis; Andreas Kamilaris
We propose Domain Adaptation via Feature Refinement (DAFR2), a simple yet effective framework for unsupervised domain adaptation under distribution shift. The proposed method synergistically combines three key components: adaptation of Batch Normalization statistics using unlabeled target data, feature distillation from a source-trained model and hypothesis transfer. By aligning feature distributions at the statistical and representational levels, DAFR2 produces robust and domain-invariant feature spaces that generalize across similar domains without requiring target labels, complex architectures or sophisticated training objectives. Extensive experiments on benchmark datasets, including CIFAR10-C, CIFAR100-C, MNIST-C and PatchCamelyon-C, demonstrate that the proposed algorithm outperforms prior methods in robustness to corruption. Theoretical and empirical analyses further reveal that our method achieves improved feature alignment, increased mutual information between the domains and reduced sensitivity to input perturbations.

http://arxiv.org/abs/2508.16406
Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models. (1%)
Guangyu Yang; Jinghong Chen; Jingbiao Mei; Weizhe Lin; Bill Byrne
Large Language Models (LLMs) remain vulnerable to jailbreak attacks, which attempt to elicit harmful responses from LLMs. The evolving nature and diversity of these attacks pose many challenges for defense systems, including (1) adaptation to counter emerging attack strategies without costly retraining, and (2) control of the trade-off between safety and utility. To address these challenges, we propose Retrieval-Augmented Defense (RAD), a novel framework for jailbreak detection that incorporates a database of known attack examples into Retrieval-Augmented Generation, which is used to infer the underlying, malicious user query and jailbreak strategy used to attack the system. RAD enables training-free updates for newly discovered jailbreak strategies and provides a mechanism to balance safety and utility. Experiments on StrongREJECT show that RAD substantially reduces the effectiveness of strong jailbreak attacks such as PAP and PAIR while maintaining low rejection rates for benign queries. We propose a novel evaluation scheme and show that RAD achieves a robust safety-utility trade-off across a range of operating points in a controllable manner.

http://arxiv.org/abs/2508.15481
On Evaluating the Adversarial Robustness of Foundation Models for Multimodal Entity Linking. (99%)
Fang Wang; Yongjie Wang; Zonghao Yang; Minghao Hu; Xiaoying Bai
The explosive growth of multimodal data has driven the rapid development of multimodal entity linking (MEL) models. However, existing studies have not systematically investigated the impact of visual adversarial attacks on MEL models. We conduct the first comprehensive evaluation of the robustness of mainstream MEL models under different adversarial attack scenarios, covering two core tasks: Image-to-Text (I2T) and Image+Text-to-Text (IT2T). Experimental results show that current MEL models generally lack sufficient robustness against visual perturbations. Interestingly, contextual semantic information in input can partially mitigate the impact of adversarial perturbations. Based on this insight, we propose an LLM and Retrieval-Augmented Entity Linking (LLM-RetLink), which significantly improves the model's anti-interference ability through a two-stage process: first, extracting initial entity descriptions using large vision models (LVMs), and then dynamically generating candidate descriptive sentences via web-based retrieval. Experiments on five datasets demonstrate that LLM-RetLink improves the accuracy of MEL by 0.4%-35.7%, especially showing significant advantages under adversarial conditions. This research highlights a previously unexplored facet of MEL robustness, constructs and releases the first MEL adversarial example dataset, and sets the stage for future work aimed at strengthening the resilience of multimodal systems in adversarial environments.

http://arxiv.org/abs/2508.15650
Towards a 3D Transfer-based Black-box Attack via Critical Feature Guidance. (99%)
Shuchao Pang; Zhenghan Chen; Shen Zhang; Liming Lu; Siyuan Liang; Anan Du; Yongbin Zhou
Deep neural networks for 3D point clouds have been demonstrated to be vulnerable to adversarial examples. Previous 3D adversarial attack methods often exploit certain information about the target models, such as model parameters or outputs, to generate adversarial point clouds. However, in realistic scenarios, it is challenging to obtain any information about the target models under conditions of absolute security. Therefore, we focus on transfer-based attacks, where generating adversarial point clouds does not require any information about the target models. Based on our observation that the critical features used for point cloud classification are consistent across different DNN architectures, we propose CFG, a novel transfer-based black-box attack method that improves the transferability of adversarial point clouds via the proposed Critical Feature Guidance. Specifically, our method regularizes the search of adversarial point clouds by computing the importance of the extracted features, prioritizing the corruption of critical features that are likely to be adopted by diverse architectures. Further, we explicitly constrain the maximum deviation extent of the generated adversarial point clouds in the loss function to ensure their imperceptibility. Extensive experiments conducted on the ModelNet40 and ScanObjectNN benchmark datasets demonstrate that the proposed CFG outperforms the state-of-the-art attack methods by a large margin.

http://arxiv.org/abs/2312.12556
Tensor Train Decomposition for Adversarial Attacks on Computer Vision Models. (96%)
Andrei Chertkov; Ivan Oseledets
Deep neural networks (DNNs) are widely used today, but they are vulnerable to adversarial attacks. To develop effective methods of defense, it is important to understand the potential weak spots of DNNs. Often attacks are organized taking into account the architecture of models (white-box approach) and based on gradient methods, but for real-world DNNs this approach in most cases is impossible. At the same time, several gradient-free optimization algorithms are used to attack black-box models. However, classical methods are often ineffective in the multidimensional case. To organize black-box attacks for computer vision models, in this work, we propose the use of an optimizer based on the low-rank tensor train (TT) format, which has gained popularity in various practical multidimensional applications in recent years. Combined with the attribution of the target image, which is built by the auxiliary (white-box) model, the TT-based optimization method makes it possible to organize an effective black-box attack by small perturbation of pixels in the target image. The superiority of the proposed approach over three popular baselines is demonstrated for seven modern DNNs on the ImageNet dataset.

http://arxiv.org/abs/2508.15252
Retrieval-Augmented Review Generation for Poisoning Recommender Systems. (92%)
Shiyi Yang; Xinshu Li; Guanglin Zhou; Chen Wang; Xiwei Xu; Liming Zhu; Lina Yao
Recent studies have shown that recommender systems (RSs) are highly vulnerable to data poisoning attacks, where malicious actors inject fake user profiles, including a group of well-designed fake ratings, to manipulate recommendations. Due to security and privacy constraints in practice, attackers typically possess limited knowledge of the victim system and thus need to craft profiles that have transferability across black-box RSs. To maximize the attack impact, the profiles often remains imperceptible. However, generating such high-quality profiles with the restricted resources is challenging. Some works suggest incorporating fake textual reviews to strengthen the profiles; yet, the poor quality of the reviews largely undermines the attack effectiveness and imperceptibility under the practical setting.
  To tackle the above challenges, in this paper, we propose to enhance the quality of the review text by harnessing in-context learning (ICL) capabilities of multimodal foundation models. To this end, we introduce a demonstration retrieval algorithm and a text style transfer strategy to augment the navie ICL. Specifically, we propose a novel practical attack framework named RAGAN to generate high-quality fake user profiles, which can gain insights into the robustness of RSs. The profiles are generated by a jailbreaker and collaboratively optimized on an instructional agent and a guardian to improve the attack transferability and imperceptibility. Comprehensive experiments on various real-world datasets demonstrate that RAGAN achieves the state-of-the-art poisoning attack performance.

http://arxiv.org/abs/2508.15283
Adversarial Attacks against Neural Ranking Models via In-Context Learning. (81%)
Amin Bigdeli; Negar Arabzadeh; Ebrahim Bagheri; Charles L. A. Clarke
While neural ranking models (NRMs) have shown high effectiveness, they remain susceptible to adversarial manipulation. In this work, we introduce Few-Shot Adversarial Prompting (FSAP), a novel black-box attack framework that leverages the in-context learning capabilities of Large Language Models (LLMs) to generate high-ranking adversarial documents. Unlike previous approaches that rely on token-level perturbations or manual rewriting of existing documents, FSAP formulates adversarial attacks entirely through few-shot prompting, requiring no gradient access or internal model instrumentation. By conditioning the LLM on a small support set of previously observed harmful examples, FSAP synthesizes grammatically fluent and topically coherent documents that subtly embed false or misleading information and rank competitively against authentic content. We instantiate FSAP in two modes: FSAP-IntraQ, which leverages harmful examples from the same query to enhance topic fidelity, and FSAP-InterQ, which enables broader generalization by transferring adversarial patterns across unrelated queries. Our experiments on the TREC 2020 and 2021 Health Misinformation Tracks, using four diverse neural ranking models, reveal that FSAP-generated documents consistently outrank credible, factually accurate documents. Furthermore, our analysis demonstrates that these adversarial outputs exhibit strong stance alignment and low detectability, posing a realistic and scalable threat to neural retrieval systems. FSAP also effectively generalizes across both proprietary and open-source LLMs.

http://arxiv.org/abs/2508.15565
Any-to-any Speaker Attribute Perturbation for Asynchronous Voice Anonymization. (75%)
Liping Chen; Chenyang Guo; Rui Wang; Kong Aik Lee; Zhenhua Ling
Speaker attribute perturbation offers a feasible approach to asynchronous voice anonymization by employing adversarially perturbed speech as anonymized output. In order to enhance the identity unlinkability among anonymized utterances from the same original speaker, the targeted attack training strategy is usually applied to anonymize the utterances to a common designated speaker. However, this strategy may violate the privacy of the designated speaker who is an actual speaker. To mitigate this risk, this paper proposes an any-to-any training strategy. It is accomplished by defining a batch mean loss to anonymize the utterances from various speakers within a training mini-batch to a common pseudo-speaker, which is approximated as the average speaker in the mini-batch. Based on this, a speaker-adversarial speech generation model is proposed, incorporating the supervision from both the untargeted attack and the any-to-any strategies. The speaker attribute perturbations are generated and incorporated into the original speech to produce its anonymized version. The effectiveness of the proposed model was justified in asynchronous voice anonymization through experiments conducted on the VoxCeleb datasets. Additional experiments were carried out to explore the potential limitations of speaker-adversarial speech in voice privacy protection. With them, we aim to provide insights for future research on its protective efficacy against black-box speaker extractors \textcolor{black}{and adaptive attacks, as well as} generalization to out-of-domain datasets \textcolor{black}{and stability}. Audio samples and open-source code are published in https://github.com/VoicePrivacy/any-to-any-speaker-attribute-perturbation.

http://arxiv.org/abs/2508.15182
SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks. (75%)
Xiangman Li; Xiaodong Wu; Qi Li; Jianbing Ni; Rongxing Lu
Jailbreak attacks pose a serious threat to the safety of Large Language Models (LLMs) by crafting adversarial prompts that bypass alignment mechanisms, causing the models to produce harmful, restricted, or biased content. In this paper, we propose SafeLLM, a novel unlearning-based defense framework that unlearn the harmful knowledge from LLMs while preserving linguistic fluency and general capabilities. SafeLLM employs a three-stage pipeline: (1) dynamic unsafe output detection using a hybrid approach that integrates external classifiers with model-internal evaluations; (2) token-level harmful content tracing through feedforward network (FFN) activations to localize harmful knowledge; and (3) constrained optimization to suppress unsafe behavior without degrading overall model quality. SafeLLM achieves targeted and irreversible forgetting by identifying and neutralizing FFN substructures responsible for harmful generation pathways. Extensive experiments on prominent LLMs (Vicuna, LLaMA, and GPT-J) across multiple jailbreak benchmarks show that SafeLLM substantially reduces attack success rates while maintaining high general-purpose performance. Compared to standard defense methods such as supervised fine-tuning and direct preference optimization, SafeLLM offers stronger safety guarantees, more precise control over harmful behavior, and greater robustness to unseen attacks. Moreover, SafeLLM maintains the general performance after the harmful knowledge unlearned. These results highlight unlearning as a promising direction for scalable and effective LLM safety.

http://arxiv.org/abs/2412.01528
CopyrightShield: Enhancing Diffusion Model Security against Copyright Infringement Attacks. (75%)
Zhixiang Guo; Siyuan Liang; Aishan Liu; Dacheng Tao
Diffusion models have attracted significant attention due to its exceptional data generation capabilities in fields such as image synthesis. However, recent studies have shown that diffusion models are vulnerable to copyright infringement attacks, where attackers inject strategically modified non-infringing images into the training set, inducing the model to generate infringing content under the prompt of specific poisoned captions. To address this issue, we first propose a defense framework, CopyrightShield, to defend against the above attack. Specifically, we analyze the memorization mechanism of diffusion models and find that attacks exploit the model's overfitting to specific spatial positions and prompts, causing it to reproduce poisoned samples under backdoor triggers. Based on this, we propose a poisoned sample detection method using spatial masking and data attribution to quantify poisoning risk and accurately identify hidden backdoor samples. To further mitigate memorization of poisoned features, we introduce an adaptive optimization strategy that integrates a dynamic penalty term into the training loss, reducing reliance on infringing features while preserving generative performance. Experimental results demonstrate that CopyrightShield significantly improves poisoned sample detection performance across two attack scenarios, achieving average F1-scores of 0.665, retarding the First-Attack Epoch (FAE) of 115.2% and decreasing the Copyright Infringement Rate (CIR) by 56.7%. Compared to the SoTA backdoor defense in diffusion models, the defense effect is improved by about 25%, showcasing its superiority and practicality in enhancing the security of diffusion models.

http://arxiv.org/abs/2508.15454
Mini-Batch Robustness Verification of Deep Neural Networks. (68%)
Saar Tzour-Shaday; Dana Drachsler Cohen
Neural network image classifiers are ubiquitous in many safety-critical applications. However, they are susceptible to adversarial attacks. To understand their robustness to attacks, many local robustness verifiers have been proposed to analyze $ε$-balls of inputs. Yet, existing verifiers introduce a long analysis time or lose too much precision, making them less effective for a large set of inputs. In this work, we propose a new approach to local robustness: group local robustness verification. The key idea is to leverage the similarity of the network computations of certain $ε$-balls to reduce the overall analysis time. We propose BaVerLy, a sound and complete verifier that boosts the local robustness verification of a set of $ε$-balls by dynamically constructing and verifying mini-batches. BaVerLy adaptively identifies successful mini-batch sizes, accordingly constructs mini-batches of $ε$-balls that have similar network computations, and verifies them jointly. If a mini-batch is verified, all $ε$-balls are proven robust. Otherwise, one $ε$-ball is suspected as not being robust, guiding the refinement. In the latter case, BaVerLy leverages the analysis results to expedite the analysis of that $ε$-ball as well as the other $ε$-balls in the batch. We evaluate BaVerLy on fully connected and convolutional networks for MNIST and CIFAR-10. Results show that BaVerLy scales the common one by one verification by 2.3x on average and up to 4.1x, in which case it reduces the total analysis time from 24 hours to 6 hours.

http://arxiv.org/abs/2508.15310
IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents. (26%)
Hengyu An; Jinghuai Zhang; Tianyu Du; Chunyi Zhou; Qingming Li; Tao Lin; Shouling Ji
Large language model (LLM) agents are widely deployed in real-world applications, where they leverage tools to retrieve and manipulate external data for complex tasks. However, when interacting with untrusted data sources (e.g., fetching information from public websites), tool responses may contain injected instructions that covertly influence agent behaviors and lead to malicious outcomes, a threat referred to as Indirect Prompt Injection (IPI). Existing defenses typically rely on advanced prompting strategies or auxiliary detection models. While these methods have demonstrated some effectiveness, they fundamentally rely on assumptions about the model's inherent security, which lacks structural constraints on agent behaviors. As a result, agents still retain unrestricted access to tool invocations, leaving them vulnerable to stronger attack vectors that can bypass the security guardrails of the model. To prevent malicious tool invocations at the source, we propose a novel defensive task execution paradigm, called IPIGuard, which models the agents' task execution process as a traversal over a planned Tool Dependency Graph (TDG). By explicitly decoupling action planning from interaction with external data, IPIGuard significantly reduces unintended tool invocations triggered by injected instructions, thereby enhancing robustness against IPI attacks. Experiments on the AgentDojo benchmark show that IPIGuard achieves a superior balance between effectiveness and robustness, paving the way for the development of safer agentic systems in dynamic environments.

http://arxiv.org/abs/2508.15764
Distributed Detection of Adversarial Attacks in Multi-Agent Reinforcement Learning with Continuous Action Space. (13%)
Kiarash Kazari; Ezzeldin Shereen; György Dán
We address the problem of detecting adversarial attacks against cooperative multi-agent reinforcement learning with continuous action space. We propose a decentralized detector that relies solely on the local observations of the agents and makes use of a statistical characterization of the normal behavior of observable agents. The proposed detector utilizes deep neural networks to approximate the normal behavior of agents as parametric multivariate Gaussian distributions. Based on the predicted density functions, we define a normality score and provide a characterization of its mean and variance. This characterization allows us to employ a two-sided CUSUM procedure for detecting deviations of the normality score from its mean, serving as a detector of anomalous behavior in real-time. We evaluate our scheme on various multi-agent PettingZoo benchmarks against different state-of-the-art attack methods, and our results demonstrate the effectiveness of our method in detecting impactful adversarial attacks. Particularly, it outperforms the discrete counterpart by achieving AUC-ROC scores of over 0.95 against the most impactful attacks in all evaluated environments.

http://arxiv.org/abs/2508.15934
Strategic Sample Selection for Improved Clean-Label Backdoor Attacks in Text Classification. (8%)
Onur Alp Kirci; M. Emre Gursoy
Backdoor attacks pose a significant threat to the integrity of text classification models used in natural language processing. While several dirty-label attacks that achieve high attack success rates (ASR) have been proposed, clean-label attacks are inherently more difficult. In this paper, we propose three sample selection strategies to improve attack effectiveness in clean-label scenarios: Minimum, Above50, and Below50. Our strategies identify those samples which the model predicts incorrectly or with low confidence, and by injecting backdoor triggers into such samples, we aim to induce a stronger association between the trigger patterns and the attacker-desired target label. We apply our methods to clean-label variants of four canonical backdoor attacks (InsertSent, WordInj, StyleBkd, SynBkd) and evaluate them on three datasets (IMDB, SST2, HateSpeech) and four model types (LSTM, BERT, DistilBERT, RoBERTa). Results show that the proposed strategies, particularly the Minimum strategy, significantly improve the ASR over random sample selection with little or no degradation in the model's clean accuracy. Furthermore, clean-label attacks enhanced by our strategies outperform BITE, a state of the art clean-label attack method, in many configurations.

http://arxiv.org/abs/2508.15669
Exploiting Policy Idling for Dexterous Manipulation. (1%)
Annie S. Chen; Philemon Brakel; Antonia Bronars; Annie Xie; Sandy Huang; Oliver Groth; Maria Bauza; Markus Wulfmeier; Nicolas Heess; Dushyant Rao
Learning-based methods for dexterous manipulation have made notable progress in recent years. However, learned policies often still lack reliability and exhibit limited robustness to important factors of variation. One failure pattern that can be observed across many settings is that policies idle, i.e. they cease to move beyond a small region of states when they reach certain states. This policy idling is often a reflection of the training data. For instance, it can occur when the data contains small actions in areas where the robot needs to perform high-precision motions, e.g., when preparing to grasp an object or object insertion. Prior works have tried to mitigate this phenomenon e.g. by filtering the training data or modifying the control frequency. However, these approaches can negatively impact policy performance in other ways. As an alternative, we investigate how to leverage the detectability of idling behavior to inform exploration and policy improvement. Our approach, Pause-Induced Perturbations (PIP), applies perturbations at detected idling states, thus helping it to escape problematic basins of attraction. On a range of challenging simulated dual-arm tasks, we find that this simple approach can already noticeably improve test-time performance, with no additional supervision or training. Furthermore, since the robot tends to idle at critical points in a movement, we also find that learning from the resulting episodes leads to better iterative policy improvement compared to prior approaches. Our perturbation strategy also leads to a 15-35% improvement in absolute success rate on a real-world insertion task that requires complex multi-finger manipulation.

http://arxiv.org/abs/2508.14699
Foe for Fraud: Transferable Adversarial Attacks in Credit Card Fraud Detection. (99%)
Jan Lum Fok; Qingwen Zeng; Shiping Chen; Oscar Fawkes; Huaming Chen
Credit card fraud detection (CCFD) is a critical application of Machine Learning (ML) in the financial sector, where accurately identifying fraudulent transactions is essential for mitigating financial losses. ML models have demonstrated their effectiveness in fraud detection task, in particular with the tabular dataset. While adversarial attacks have been extensively studied in computer vision and deep learning, their impacts on the ML models, particularly those trained on CCFD tabular datasets, remains largely unexplored. These latent vulnerabilities pose significant threats to the security and stability of the financial industry, especially in high-value transactions where losses could be substantial. To address this gap, in this paper, we present a holistic framework that investigate the robustness of CCFD ML model against adversarial perturbations under different circumstances. Specifically, the gradient-based attack methods are incorporated into the tabular credit card transaction data in both black- and white-box adversarial attacks settings. Our findings confirm that tabular data is also susceptible to subtle perturbations, highlighting the need for heightened awareness among financial technology practitioners regarding ML model security and trustworthiness. Furthermore, the experiments by transferring adversarial samples from gradient-based attack method to non-gradient-based models also verify our findings. Our results demonstrate that such attacks remain effective, emphasizing the necessity of developing robust defenses for CCFD algorithms.

http://arxiv.org/abs/2508.15020
TAIGen: Training-Free Adversarial Image Generation via Diffusion Models. (99%)
Susim Roy; Anubhooti Jain; Mayank Vatsa; Richa Singh
Adversarial attacks from generative models often produce low-quality images and require substantial computational resources. Diffusion models, though capable of high-quality generation, typically need hundreds of sampling steps for adversarial generation. This paper introduces TAIGen, a training-free black-box method for efficient adversarial image generation. TAIGen produces adversarial examples using only 3-20 sampling steps from unconditional diffusion models. Our key finding is that perturbations injected during the mixing step interval achieve comparable attack effectiveness without processing all timesteps. We develop a selective RGB channel strategy that applies attention maps to the red channel while using GradCAM-guided perturbations on green and blue channels. This design preserves image structure while maximizing misclassification in target models. TAIGen maintains visual quality with PSNR above 30 dB across all tested datasets. On ImageNet with VGGNet as source, TAIGen achieves 70.6% success against ResNet, 80.8% against MNASNet, and 97.8% against ShuffleNet. The method generates adversarial examples 10x faster than existing diffusion-based attacks. Our method achieves the lowest robust accuracy, indicating it is the most impactful attack as the defense mechanism is least successful in purifying the images generated by TAIGen.

http://arxiv.org/abs/2508.14853
Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent. (98%)
Sajib Biswas; Mao Nishino; Samuel Jacob Chacko; Xiuwen Liu
As large language models (LLMs) are increasingly deployed in critical applications, ensuring their robustness and safety alignment remains a major challenge. Despite the overall success of alignment techniques such as reinforcement learning from human feedback (RLHF) on typical prompts, LLMs remain vulnerable to jailbreak attacks enabled by crafted adversarial triggers appended to user prompts. Most existing jailbreak methods either rely on inefficient searches over discrete token spaces or direct optimization of continuous embeddings. While continuous embeddings can be given directly to selected open-source models as input, doing so is not feasible for proprietary models. On the other hand, projecting these embeddings back into valid discrete tokens introduces additional complexity and often reduces attack effectiveness. We propose an intrinsic optimization method which directly optimizes relaxed one-hot encodings of the adversarial suffix tokens using exponentiated gradient descent coupled with Bregman projection, ensuring that the optimized one-hot encoding of each token always remains within the probability simplex. We provide theoretical proof of convergence for our proposed method and implement an efficient algorithm that effectively jailbreaks several widely used LLMs. Our method achieves higher success rates and faster convergence compared to three state-of-the-art baselines, evaluated on five open-source LLMs and four adversarial behavior datasets curated for evaluating jailbreak methods. In addition to individual prompt attacks, we also generate universal adversarial suffixes effective across multiple prompts and demonstrate transferability of optimized suffixes to different LLMs.

http://arxiv.org/abs/2508.18235
Sealing The Backdoor: Unlearning Adversarial Text Triggers In Diffusion Models Using Knowledge Distillation. (84%)
Ashwath Vaithinathan Aravindan; Abha Jha; Matthew Salaway; Atharva Sandeep Bhide; Duygu Nur Yaldiz
Text-to-image diffusion models have revolutionized generative AI, but their vulnerability to backdoor attacks poses significant security risks. Adversaries can inject imperceptible textual triggers into training data, causing models to generate manipulated outputs. Although text-based backdoor defenses in classification models are well-explored, generative models lack effective mitigation techniques against. We address this by selectively erasing the model's learned associations between adversarial text triggers and poisoned outputs, while preserving overall generation quality. Our approach, Self-Knowledge Distillation with Cross-Attention Guidance (SKD-CAG), uses knowledge distillation to guide the model in correcting responses to poisoned prompts while maintaining image quality by exploiting the fact that the backdoored model still produces clean outputs in the absence of triggers. Using the cross-attention mechanism, SKD-CAG neutralizes backdoor influences at the attention level, ensuring the targeted removal of adversarial effects. Extensive experiments show that our method outperforms existing approaches, achieving removal accuracy 100\% for pixel backdoors and 93\% for style-based attacks, without sacrificing robustness or image fidelity. Our findings highlight targeted unlearning as a promising defense to secure generative models. Code and model weights can be found at https://github.com/Mystic-Slice/Sealing-The-Backdoor .

http://arxiv.org/abs/2508.14757
Distributional Adversarial Attacks and Training in Deep Hedging. (78%)
Guangyi He; Tobias Sutter; Lukas Gonon
In this paper, we study the robustness of classical deep hedging strategies under distributional shifts by leveraging the concept of adversarial attacks. We first demonstrate that standard deep hedging models are highly vulnerable to small perturbations in the input distribution, resulting in significant performance degradation. Motivated by this, we propose an adversarial training framework tailored to increase the robustness of deep hedging strategies. Our approach extends pointwise adversarial attacks to the distributional setting and introduces a computationally tractable reformulation of the adversarial optimization problem over a Wasserstein ball. This enables the efficient training of hedging strategies that are resilient to distributional perturbations. Through extensive numerical experiments, we show that adversarially trained deep hedging strategies consistently outperform their classical counterparts in terms of out-of-sample performance and resilience to model misspecification. Our findings establish a practical and effective framework for robust deep hedging under realistic market uncertainties.

http://arxiv.org/abs/2508.14530
DOPA: Stealthy and Generalizable Backdoor Attacks from a Single Client under Challenging Federated Constraints. (74%)
Xuezheng Qin; Ruwei Huang; Xiaolong Tang; Feng Li
Federated Learning (FL) is increasingly adopted for privacy-preserving collaborative training, but its decentralized nature makes it particularly susceptible to backdoor attacks. Existing attack methods, however, often rely on idealized assumptions and fail to remain effective under real-world constraints, such as limited attacker control, non-IID data distributions, and the presence of diverse defense mechanisms. To address this gap, we propose DOPA (Divergent Optimization Path Attack), a novel framework that simulates heterogeneous local training dynamics and seeks consensus across divergent optimization trajectories to craft universally effective and stealthy backdoor triggers. By leveraging consistency signals across simulated paths to guide optimization, DOPA overcomes the challenge of heterogeneity-induced instability and achieves practical attack viability under stringent federated constraints. We validate DOPA on a comprehensive suite of 12 defense strategies, two model architectures (ResNet18/VGG16), two datasets (CIFAR-10/TinyImageNet), and both mild and extreme non-IID settings. Despite operating under a single-client, black-box, and sparsely participating threat model, DOPA consistently achieves high attack success, minimal accuracy degradation, low runtime, and long-term persistence. These results demonstrate a more practical attack paradigm, offering new perspectives for designing robust defense strategies in federated learning systems

http://arxiv.org/abs/2508.15031
A Systematic Survey of Model Extraction Attacks and Defenses: State-of-the-Art and Perspectives. (54%)
Kaixiang Zhao; Lincan Li; Kaize Ding; Neil Zhenqiang Gong; Yue Zhao; Yushun Dong
Machine learning (ML) models have significantly grown in complexity and utility, driving advances across multiple domains. However, substantial computational resources and specialized expertise have historically restricted their wide adoption. Machine-Learning-as-a-Service (MLaaS) platforms have addressed these barriers by providing scalable, convenient, and affordable access to sophisticated ML models through user-friendly APIs. While this accessibility promotes widespread use of advanced ML capabilities, it also introduces vulnerabilities exploited through Model Extraction Attacks (MEAs). Recent studies have demonstrated that adversaries can systematically replicate a target model's functionality by interacting with publicly exposed interfaces, posing threats to intellectual property, privacy, and system security. In this paper, we offer a comprehensive survey of MEAs and corresponding defense strategies. We propose a novel taxonomy that classifies MEAs according to attack mechanisms, defense approaches, and computing environments. Our analysis covers various attack techniques, evaluates their effectiveness, and highlights challenges faced by existing defenses, particularly the critical trade-off between preserving model utility and ensuring security. We further assess MEAs within different computing paradigms and discuss their technical, ethical, legal, and societal implications, along with promising directions for future research. This systematic survey aims to serve as a valuable reference for researchers, practitioners, and policymakers engaged in AI security and privacy. Additionally, we maintain an online repository continuously updated with related literature at https://github.com/kzhao5/ModelExtractionPapers.

http://arxiv.org/abs/2509.00027
Mitigating Data Exfiltration Attacks through Layer-Wise Learning Rate Decay Fine-Tuning. (2%)
Elie Thellier; Huiyu Li; Nicholas Ayache; Hervé Delingette
Data lakes enable the training of powerful machine learning models on sensitive, high-value medical datasets, but also introduce serious privacy risks due to potential leakage of protected health information. Recent studies show adversaries can exfiltrate training data by embedding latent representations into model parameters or inducing memorization via multi-task learning. These attacks disguise themselves as benign utility models while enabling reconstruction of high-fidelity medical images, posing severe privacy threats with legal and ethical implications. In this work, we propose a simple yet effective mitigation strategy that perturbs model parameters at export time through fine-tuning with a decaying layer-wise learning rate to corrupt embedded data without degrading task performance. Evaluations on DermaMNIST, ChestMNIST, and MIMIC-CXR show that our approach maintains utility task performance, effectively disrupts state-of-the-art exfiltration attacks, outperforms prior defenses, and renders exfiltrated data unusable for training. Ablations and discussions on adaptive attacks highlight challenges and future directions. Our findings offer a practical defense against data leakage in data lake-trained models and centralized federated learning.

http://arxiv.org/abs/2508.14526
CoFacS -- Simulating a Complete Factory to Study the Security of Interconnected Production. (1%)
Stefan Lenz; David Schachtschneider; Simon Jonas; Liam Tirpitz; Sandra Geisler; Martin Henze
While the digitization of industrial factories provides tremendous improvements for the production of goods, it also renders such systems vulnerable to serious cyber-attacks. To research, test, and validate security measures protecting industrial networks against such cyber-attacks, the security community relies on testbeds to simulate industrial systems, as utilizing live systems endangers costly components or even human life. However, existing testbeds focus on individual parts of typically complex production lines in industrial factories. Consequently, the impact of cyber-attacks on industrial networks as well as the effectiveness of countermeasures cannot be evaluated in an end-to-end manner. To address this issue and facilitate research on novel security mechanisms, we present CoFacS, the first COmplete FACtory Simulation that replicates an entire production line and affords the integration of real-life industrial applications. To showcase that CoFacS accurately captures real-world behavior, we validate it against a physical model factory widely used in security research. We show that CoFacS has a maximum deviation of 0.11% to the physical reference, which enables us to study the impact of physical attacks or network-based cyber-attacks. Moreover, we highlight how CoFacS enables security research through two cases studies surrounding attack detection and the resilience of 5G-based industrial communication against jamming.

http://arxiv.org/abs/2508.15036
MoEcho: Exploiting Side-Channel Attacks to Compromise User Privacy in Mixture-of-Experts LLMs. (1%)
Ruyi Ding; Tianhong Xu; Xinyi Shen; Aidong Adam Ding; Yunsi Fei
The transformer architecture has become a cornerstone of modern AI, fueling remarkable progress across applications in natural language processing, computer vision, and multimodal learning. As these models continue to scale explosively for performance, implementation efficiency remains a critical challenge. Mixture of Experts (MoE) architectures, selectively activating specialized subnetworks (experts), offer a unique balance between model accuracy and computational cost. However, the adaptive routing in MoE architectures, where input tokens are dynamically directed to specialized experts based on their semantic meaning inadvertently opens up a new attack surface for privacy breaches. These input-dependent activation patterns leave distinctive temporal and spatial traces in hardware execution, which adversaries could exploit to deduce sensitive user data. In this work, we propose MoEcho, discovering a side channel analysis based attack surface that compromises user privacy on MoE based systems. Specifically, in MoEcho, we introduce four novel architectural side channels on different computing platforms, including Cache Occupancy Channels and Pageout+Reload on CPUs, and Performance Counter and TLB Evict+Reload on GPUs, respectively. Exploiting these vulnerabilities, we propose four attacks that effectively breach user privacy in large language models (LLMs) and vision language models (VLMs) based on MoE architectures: Prompt Inference Attack, Response Reconstruction Attack, Visual Inference Attack, and Visual Reconstruction Attack. MoEcho is the first runtime architecture level security analysis of the popular MoE structure common in modern transformers, highlighting a serious security and privacy threat and calling for effective and timely safeguards when harnessing MoE based models for developing efficient large scale AI services.

http://arxiv.org/abs/2506.10459
Boosting Adversarial Transferability for Hyperspectral Image Classification Using 3D Structure-invariant Transformation and Weighted Intermediate Feature Divergence. (99%)
Chun Liu; Bingqian Zhu; Tao Xu; Zheng Zheng; Zheng Li; Wei Yang; Zhigang Han; Jiayao Wang
Deep Neural Networks (DNNs) are vulnerable to adversarial attacks, which pose security challenges to hyperspectral image (HSI) classification based on DNNs. Numerous adversarial attack methods have been designed in the domain of natural images. However, different from natural images, HSIs contains high-dimensional rich spectral information, which presents new challenges for generating adversarial examples. Based on the specific characteristics of HSIs, this paper proposes a novel method to enhance the transferability of the adversarial examples for HSI classification using 3D structure-invariant transformation and weighted intermediate feature divergence. While keeping the HSIs structure invariant, the proposed method divides the image into blocks in both spatial and spectral dimensions. Then, various transformations are applied on each block to increase input diversity and mitigate the overfitting to substitute models. Moreover, a weighted intermediate feature divergence loss is also designed by leveraging the differences between the intermediate features of original and adversarial examples. It constrains the perturbation direction by enlarging the feature maps of the original examples, and assigns different weights to different feature channels to destroy the features that have a greater impact on HSI classification. Extensive experiments demonstrate that the adversarial examples generated by the proposed method achieve more effective adversarial transferability on three public HSI datasets. Furthermore, the method maintains robust attack performance even under defense strategies.

http://arxiv.org/abs/2508.13739
Enhancing Targeted Adversarial Attacks on Large Vision-Language Models through Intermediate Projector Guidance. (93%)
Yiming Cao; Yanjie Li; Kaisheng Liang; Yuni Lai; Bin Xiao
Targeted adversarial attacks are essential for proactively identifying security flaws in Vision-Language Models before real-world deployment. However, current methods perturb images to maximize global similarity with the target text or reference image at the encoder level, collapsing rich visual semantics into a single global vector. This limits attack granularity, hindering fine-grained manipulations such as modifying a car while preserving its background. Furthermore, these methods largely overlook the projector module, a critical semantic bridge between the visual encoder and the language model in VLMs, thereby failing to disrupt the full vision-language alignment pipeline within VLMs and limiting attack effectiveness. To address these issues, we propose the Intermediate Projector Guided Attack (IPGA), the first method to attack using the intermediate stage of the projector module, specifically the widely adopted Q-Former, which transforms global image embeddings into fine-grained visual features. This enables more precise control over adversarial perturbations by operating on semantically meaningful visual tokens rather than a single global representation. Specifically, IPGA leverages the Q-Former pretrained solely on the first vision-language alignment stage, without LLM fine-tuning, which improves both attack effectiveness and transferability across diverse VLMs. Furthermore, we propose Residual Query Alignment (RQA) to preserve unrelated visual content, thereby yielding more controlled and precise adversarial manipulations. Extensive experiments show that our attack method consistently outperforms existing methods in both standard global image captioning tasks and fine-grained visual question-answering tasks in black-box environment. Additionally, IPGA successfully transfers to multiple commercial VLMs, including Google Gemini and OpenAI GPT.

http://arxiv.org/abs/2508.13812
Timestep-Compressed Attack on Spiking Neural Networks through Timestep-Level Backpropagation. (93%)
Donghwa Kang; Doohyun Kim; Sang-Ki Ko; Jinkyu Lee; Hyeongboo Baek; Brent ByungHoon Kang
State-of-the-art (SOTA) gradient-based adversarial attacks on spiking neural networks (SNNs), which largely rely on extending FGSM and PGD frameworks, face a critical limitation: substantial attack latency from multi-timestep processing, rendering them infeasible for practical real-time applications. This inefficiency stems from their design as direct extensions of ANN paradigms, which fail to exploit key SNN properties. In this paper, we propose the timestep-compressed attack (TCA), a novel framework that significantly reduces attack latency. TCA introduces two components founded on key insights into SNN behavior. First, timestep-level backpropagation (TLBP) is based on our finding that global temporal information in backpropagation to generate perturbations is not critical for an attack's success, enabling per-timestep evaluation for early stopping. Second, adversarial membrane potential reuse (A-MPR) is motivated by the observation that initial timesteps are inefficiently spent accumulating membrane potential, a warm-up phase that can be pre-calculated and reused. Our experiments on VGG-11 and ResNet-17 with the CIFAR-10/100 and CIFAR10-DVS datasets show that TCA significantly reduces the required attack latency by up to 56.6% and 57.1% compared to SOTA methods in white-box and black-box settings, respectively, while maintaining a comparable attack success rate.

http://arxiv.org/abs/2508.13853
FedUP: Efficient Pruning-based Federated Unlearning for Model Poisoning Attacks. (83%)
Nicolò Romandini; Cristian Borcea; Rebecca Montanari; Luca Foschini
Federated Learning (FL) can be vulnerable to attacks, such as model poisoning, where adversaries send malicious local weights to compromise the global model. Federated Unlearning (FU) is emerging as a solution to address such vulnerabilities by selectively removing the influence of detected malicious contributors on the global model without complete retraining. However, unlike typical FU scenarios where clients are trusted and cooperative, applying FU with malicious and possibly colluding clients is challenging because their collaboration in unlearning their data cannot be assumed. This work presents FedUP, a lightweight FU algorithm designed to efficiently mitigate malicious clients' influence by pruning specific connections within the attacked model. Our approach achieves efficiency by relying only on clients' weights from the last training round before unlearning to identify which connections to inhibit. Isolating malicious influence is non-trivial due to overlapping updates from benign and malicious clients. FedUP addresses this by carefully selecting and zeroing the highest magnitude weights that diverge the most between the latest updates from benign and malicious clients while preserving benign information. FedUP is evaluated under a strong adversarial threat model, where up to 50%-1 of the clients could be malicious and have full knowledge of the aggregation process. We demonstrate the effectiveness, robustness, and efficiency of our solution through experiments across IID and Non-IID data, under label-flipping and backdoor attacks, and by comparing it with state-of-the-art (SOTA) FU solutions. In all scenarios, FedUP reduces malicious influence, lowering accuracy on malicious data to match that of a model retrained from scratch while preserving performance on benign data. FedUP achieves effective unlearning while consistently being faster and saving storage compared to the SOTA.

http://arxiv.org/abs/2508.14128
CCFC: Core & Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection. (68%)
Jiaming Hu; Haoyu Wang; Debarghya Mukherjee; Ioannis Ch. Paschalidis
Jailbreak attacks pose a serious challenge to the safe deployment of large language models (LLMs). We introduce CCFC (Core & Core-Full-Core), a dual-track, prompt-level defense framework designed to mitigate LLMs' vulnerabilities from prompt injection and structure-aware jailbreak attacks. CCFC operates by first isolating the semantic core of a user query via few-shot prompting, and then evaluating the query using two complementary tracks: a core-only track to ignore adversarial distractions (e.g., toxic suffixes or prefix injections), and a core-full-core (CFC) track to disrupt the structural patterns exploited by gradient-based or edit-based attacks. The final response is selected based on a safety consistency check across both tracks, ensuring robustness without compromising on response quality. We demonstrate that CCFC cuts attack success rates by 50-75% versus state-of-the-art defenses against strong adversaries (e.g., DeepInception, GCG), without sacrificing fidelity on benign queries. Our method consistently outperforms state-of-the-art prompt-level defenses, offering a practical and effective solution for safer LLM deployment.

http://arxiv.org/abs/2508.14925
MCPTox: A Benchmark for Tool Poisoning Attack on Real-World MCP Servers. (38%)
Zhiqiang Wang; Yichao Gao; Yanting Wang; Suyuan Liu; Haifeng Sun; Haoran Cheng; Guanquan Shi; Haohua Du; Xiangyang Li
By providing a standardized interface for LLM agents to interact with external tools, the Model Context Protocol (MCP) is quickly becoming a cornerstone of the modern autonomous agent ecosystem. However, it creates novel attack surfaces due to untrusted external tools. While prior work has focused on attacks injected through external tool outputs, we investigate a more fundamental vulnerability: Tool Poisoning, where malicious instructions are embedded within a tool's metadata without execution. To date, this threat has been primarily demonstrated through isolated cases, lacking a systematic, large-scale evaluation.
  We introduce MCPTox, the first benchmark to systematically evaluate agent robustness against Tool Poisoning in realistic MCP settings. MCPTox is constructed upon 45 live, real-world MCP servers and 353 authentic tools. To achieve this, we design three distinct attack templates to generate a comprehensive suite of 1312 malicious test cases by few-shot learning, covering 10 categories of potential risks. Our evaluation on 20 prominent LLM agents setting reveals a widespread vulnerability to Tool Poisoning, with o1-mini, achieving an attack success rate of 72.8\%. We find that more capable models are often more susceptible, as the attack exploits their superior instruction-following abilities. Finally, the failure case analysis reveals that agents rarely refuse these attacks, with the highest refused rate (Claude-3.7-Sonnet) less than 3\%, demonstrating that existing safety alignment is ineffective against malicious actions that use legitimate tools for unauthorized operation. Our findings create a crucial empirical baseline for understanding and mitigating this widespread threat, and we release MCPTox for the development of verifiably safer AI agents. Our dataset is available at an anonymized repository: \textit{https://anonymous.4open.science/r/AAAI26-7C02}.

http://arxiv.org/abs/2508.13730
On the Security and Privacy of Federated Learning: A Survey with Attacks, Defenses, Frameworks, Applications, and Future Directions. (11%)
Daniel M. Jimenez-Gutierrez; Yelizaveta Falkouskaya; Jose L. Hernandez-Ramos; Aris Anagnostopoulos; Ioannis Chatzigiannakis; Andrea Vitaletti
Federated Learning (FL) is an emerging distributed machine learning paradigm enabling multiple clients to train a global model collaboratively without sharing their raw data. While FL enhances data privacy by design, it remains vulnerable to various security and privacy threats. This survey provides a comprehensive overview of more than 200 papers regarding the state-of-the-art attacks and defense mechanisms developed to address these challenges, categorizing them into security-enhancing and privacy-preserving techniques. Security-enhancing methods aim to improve FL robustness against malicious behaviors such as byzantine attacks, poisoning, and Sybil attacks. At the same time, privacy-preserving techniques focus on protecting sensitive data through cryptographic approaches, differential privacy, and secure aggregation. We critically analyze the strengths and limitations of existing methods, highlight the trade-offs between privacy, security, and model performance, and discuss the implications of non-IID data distributions on the effectiveness of these defenses. Furthermore, we identify open research challenges and future directions, including the need for scalable, adaptive, and energy-efficient solutions operating in dynamic and heterogeneous FL environments. Our survey aims to guide researchers and practitioners in developing robust and privacy-preserving FL systems, fostering advancements safeguarding collaborative learning frameworks' integrity and confidentiality.

http://arxiv.org/abs/2505.12332
VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning. (11%)
Qianyue Hu; Junyan Wu; Wei Lu; Xiangyang Luo
Diffusion Models (DMs) have achieved remarkable success in realistic voice cloning (VC), while they also increase the risk of malicious misuse. Existing proactive defenses designed for traditional VC models aim to disrupt the forgery process, but they have been proven incompatible with DMs due to the intricate generative mechanisms of diffusion. To bridge this gap, we introduce VoiceCloak, a multi-dimensional proactive defense framework with the goal of obfuscating speaker identity and degrading perceptual quality in potential unauthorized VC. To achieve these goals, we conduct a focused analysis to identify specific vulnerabilities within DMs, allowing VoiceCloak to disrupt the cloning process by introducing adversarial perturbations into the reference audio. Specifically, to obfuscate speaker identity, VoiceCloak first targets speaker identity by distorting representation learning embeddings to maximize identity variation, which is guided by auditory perception principles. Additionally, VoiceCloak disrupts crucial conditional guidance processes, particularly attention context, thereby preventing the alignment of vocal characteristics that are essential for achieving convincing cloning. Then, to address the second objective, VoiceCloak introduces score magnitude amplification to actively steer the reverse trajectory away from the generation of high-quality speech. Noise-guided semantic corruption is further employed to disrupt structural speech semantics captured by DMs, degrading output quality. Extensive experiments highlight VoiceCloak's outstanding defense success rate against unauthorized diffusion-based voice cloning. Audio samples of VoiceCloak are available at https://voice-cloak.github.io/VoiceCloak/.

http://arxiv.org/abs/2508.13672
ITL-LIME: Instance-Based Transfer Learning for Enhancing Local Explanations in Low-Resource Data Settings. (4%)
Rehan Raza; Guanjin Wang; Kevin Wong; Hamid Laga; Marco Fisichella
Explainable Artificial Intelligence (XAI) methods, such as Local Interpretable Model-Agnostic Explanations (LIME), have advanced the interpretability of black-box machine learning models by approximating their behavior locally using interpretable surrogate models. However, LIME's inherent randomness in perturbation and sampling can lead to locality and instability issues, especially in scenarios with limited training data. In such cases, data scarcity can result in the generation of unrealistic variations and samples that deviate from the true data manifold. Consequently, the surrogate model may fail to accurately approximate the complex decision boundary of the original model. To address these challenges, we propose a novel Instance-based Transfer Learning LIME framework (ITL-LIME) that enhances explanation fidelity and stability in data-constrained environments. ITL-LIME introduces instance transfer learning into the LIME framework by leveraging relevant real instances from a related source domain to aid the explanation process in the target domain. Specifically, we employ clustering to partition the source domain into clusters with representative prototypes. Instead of generating random perturbations, our method retrieves pertinent real source instances from the source cluster whose prototype is most similar to the target instance. These are then combined with the target instance's neighboring real instances. To define a compact locality, we further construct a contrastive learning-based encoder as a weighting mechanism to assign weights to the instances from the combined set based on their proximity to the target instance. Finally, these weighted source and target instances are used to train the surrogate model for explanation purposes.

http://arxiv.org/abs/2508.13481
Enhancing Robustness of Implicit Neural Representations Against Weight Perturbations. (2%)
Wenyong Zhou; Yuxin Cheng; Zhengwu Liu; Taiqiang Wu; Chen Zhang; Ngai Wong
Implicit Neural Representations (INRs) encode discrete signals in a continuous manner using neural networks, demonstrating significant value across various multimedia applications. However, the vulnerability of INRs presents a critical challenge for their real-world deployments, as the network weights might be subjected to unavoidable perturbations. In this work, we investigate the robustness of INRs for the first time and find that even minor perturbations can lead to substantial performance degradation in the quality of signal reconstruction. To mitigate this issue, we formulate the robustness problem in INRs by minimizing the difference between loss with and without weight perturbations. Furthermore, we derive a novel robust loss function to regulate the gradient of the reconstruction loss with respect to weights, thereby enhancing the robustness. Extensive experiments on reconstruction tasks across multiple modalities demonstrate that our method achieves up to a 7.5~dB improvement in peak signal-to-noise ratio (PSNR) values compared to original INRs under noisy conditions.

http://arxiv.org/abs/2508.13579
Toward Better EHR Reasoning in LLMs: Reinforcement Learning with Expert Attention Guidance. (1%)
Yue Fang; Yuxin Guo; Jiaran Gao; Hongxin Ding; Xinke Jiang; Weibin Liao; Yongxin Xu; Yinghao Zhu; Zhibang Yang; Liantao Ma; Junfeng Zhao; Yasha Wang
Improving large language models (LLMs) for electronic health record (EHR) reasoning is essential for enabling accurate and generalizable clinical predictions. While LLMs excel at medical text understanding, they underperform on EHR-based prediction tasks due to challenges in modeling temporally structured, high-dimensional data. Existing approaches often rely on hybrid paradigms, where LLMs serve merely as frozen prior retrievers while downstream deep learning (DL) models handle prediction, failing to improve the LLM's intrinsic reasoning capacity and inheriting the generalization limitations of DL models. To this end, we propose EAG-RL, a novel two-stage training framework designed to intrinsically enhance LLMs' EHR reasoning ability through expert attention guidance, where expert EHR models refer to task-specific DL models trained on EHR data. Concretely, EAG-RL first constructs high-quality, stepwise reasoning trajectories using expert-guided Monte Carlo Tree Search to effectively initialize the LLM's policy. Then, EAG-RL further optimizes the policy via reinforcement learning by aligning the LLM's attention with clinically salient features identified by expert EHR models. Extensive experiments on two real-world EHR datasets show that EAG-RL improves the intrinsic EHR reasoning ability of LLMs by an average of 14.62%, while also enhancing robustness to feature perturbations and generalization to unseen clinical domains. These results demonstrate the practical potential of EAG-RL for real-world deployment in clinical prediction tasks. Our code have been available at https://github.com/devilran6/EAG-RL.

http://arxiv.org/abs/2508.13309
DAASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples. (99%)
Abdullah Al Nomaan Nafi; Habibur Rahaman; Zafaryab Haider; Tanzim Mahfuz; Fnu Suya; Swarup Bhunia; Prabuddha Chakraborty
Numerous techniques have been proposed for generating adversarial examples in white-box settings under strict Lp-norm constraints. However, such norm-bounded examples often fail to align well with human perception, and only recently have a few methods begun specifically exploring perceptually aligned adversarial examples. Moreover, it remains unclear whether insights from Lp-constrained attacks can be effectively leveraged to improve perceptual efficacy. In this paper, we introduce DAASH, a fully differentiable meta-attack framework that generates effective and perceptually aligned adversarial examples by strategically composing existing Lp-based attack methods. DAASH operates in a multi-stage fashion: at each stage, it aggregates candidate adversarial examples from multiple base attacks using learned, adaptive weights and propagates the result to the next stage. A novel meta-loss function guides this process by jointly minimizing misclassification loss and perceptual distortion, enabling the framework to dynamically modulate the contribution of each base attack throughout the stages. We evaluate DAASH on adversarially trained models across CIFAR-10, CIFAR-100, and ImageNet. Despite relying solely on Lp-constrained based methods, DAASH significantly outperforms state-of-the-art perceptual attacks such as AdvAD -- achieving higher attack success rates (e.g., 20.63\% improvement) and superior visual quality, as measured by SSIM, LPIPS, and FID (improvements $\approx$ of 11, 0.015, and 5.7, respectively). Furthermore, DAASH generalizes well to unseen defenses, making it a practical and strong baseline for evaluating robustness without requiring handcrafted adaptive attacks for each new defense.

http://arxiv.org/abs/2504.03782
Deep Positive-Negative Prototypes for Adversarially Robust Discriminative Prototypical Learning. (99%)
Ramin Zarei Sabzevar; Hamed Mohammadzadeh; Tahmineh Tavakoli; Ahad Harati
Despite the advantages of discriminative prototype-based methods, their role in adversarial robustness remains underexplored. Meanwhile, current adversarial training methods predominantly focus on robustness against adversarial attacks without explicitly leveraging geometric structures in the latent space, usually resulting in reduced accuracy on the original clean data. We propose a novel framework named Adversarially trained Deep Positive-Negative Prototypes (Adv-DPNP), which integrates discriminative prototype-based learning with adversarial training. Adv-DPNP uses unified class prototypes that serve as both classifier weights and robust anchors in the latent space. Moreover, a novel dual-branch training mechanism maintains stable prototypes by updating them exclusively with clean data, while the feature extractor is trained on both clean and adversarial inputs to increase invariance to adversarial perturbations. In addition, we use a composite loss that combines positive-prototype alignment, negative-prototype repulsion, and consistency regularization to further enhance discrimination, adversarial robustness, and clean accuracy. Extensive experiments on standard benchmarks (CIFAR-10/100 and SVHN) confirm that Adv-DPNP improves clean accuracy over state-of-the-art defenses and baseline methods, while maintaining competitive or superior robustness under a suite of widely used attacks, including FGSM, PGD, C\&W, and AutoAttack. We also evaluate robustness to common corruptions on CIFAR-10-C, where Adv-DPNP achieves the highest average accuracy across severities and corruption types. Additionally, we provide an in-depth analysis of the discriminative quality of the learned feature representations, highlighting the effectiveness of Adv-DPNP in maintaining compactness and clear separation in the latent space.

http://arxiv.org/abs/2508.07795
Boosting Active Defense Persistence: A Two-Stage Defense Framework Combining Interruption and Poisoning Against Deepfake. (92%)
Hongrui Zheng; Yuezun Li; Liejun Wang; Yunfeng Diao; Zhiqing Guo
Active defense strategies have been developed to counter the threat of deepfake technology. However, a primary challenge is their lack of persistence, as their effectiveness is often short-lived. Attackers can bypass these defenses by simply collecting protected samples and retraining their models. This means that static defenses inevitably fail when attackers retrain their models, which severely limits practical use. We argue that an effective defense not only distorts forged content but also blocks the model's ability to adapt, which occurs when attackers retrain their models on protected images. To achieve this, we propose an innovative Two-Stage Defense Framework (TSDF). Benefiting from the intensity separation mechanism designed in this paper, the framework uses dual-function adversarial perturbations to perform two roles. First, it can directly distort the forged results. Second, it acts as a poisoning vehicle that disrupts the data preparation process essential for an attacker's retraining pipeline. By poisoning the data source, TSDF aims to prevent the attacker's model from adapting to the defensive perturbations, thus ensuring the defense remains effective long-term. Comprehensive experiments show that the performance of traditional interruption methods degrades sharply when it is subjected to adversarial retraining. However, our framework shows a strong dual defense capability, which can improve the persistence of active defense. Our code will be available at https://github.com/vpsg-research/TSDF.

http://arxiv.org/abs/2503.12339
Augmented Adversarial Trigger Learning. (87%)
Zhe Wang; Yanjun Qi
Gradient optimization-based adversarial attack methods automate the learning of adversarial triggers to generate jailbreak prompts or leak system prompts. In this work, we take a closer look at the optimization objective of adversarial trigger learning and propose ATLA: Adversarial Trigger Learning with Augmented objectives. ATLA improves the negative log-likelihood loss used by previous studies into a weighted loss formulation that encourages the learned adversarial triggers to optimize more towards response format tokens. This enables ATLA to learn an adversarial trigger from just one query-response pair and the learned trigger generalizes well to other similar queries. We further design a variation to augment trigger optimization with an auxiliary loss that suppresses evasive responses. We showcase how to use ATLA to learn adversarial suffixes jailbreaking LLMs and to extract hidden system prompts. Empirically we demonstrate that ATLA consistently outperforms current state-of-the-art techniques, achieving nearly 100% success in attacking while requiring 80% fewer queries. ATLA learned jailbreak suffixes demonstrate high generalization to unseen queries and transfer well to new LLMs. We released our code https://github.com/QData/ALTA_Augmented_Adversarial_Trigger_Learning

http://arxiv.org/abs/2508.13316
Efficient Constraint-Aware Flow Matching via Randomized Exploration. (83%)
Zhengyan Huan; Jacob Boerma; Li-Ping Liu; Shuchin Aeron
We consider the problem of generating samples via Flow Matching (FM) with an additional requirement that the generated samples must satisfy given constraints. We consider two scenarios, viz.: (a) when a differentiable distance function to the constraint set is given, and (b) when the constraint set is only available via queries to a membership oracle. For case (a), we propose a simple adaptation of the FM objective with an additional term that penalizes the distance between the constraint set and the generated samples. For case (b), we propose to employ randomization and learn a mean flow that is numerically shown to have a high likelihood of satisfying the constraints. This approach deviates significantly from existing works that require simple convex constraints, knowledge of a barrier function, or a reflection mechanism to constrain the probability flow. Furthermore, in the proposed setting we show that a two-stage approach, where both stages approximate the same original flow but with only the second stage probing the constraints via randomization, is more computationally efficient. Through several synthetic cases of constrained generation, we numerically show that the proposed approaches achieve significant gains in terms of constraint satisfaction while matching the target distributions. As a showcase for a practical oracle-based constraint, we show how our approach can be used for training an adversarial example generator, using queries to a hard-label black-box classifier. We conclude with several future research directions. Our code is available at https://github.com/ZhengyanHuan/FM-RE.

http://arxiv.org/abs/2508.13048
MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies. (68%)
Weiwei Qi; Shuo Shao; Wei Gu; Tianhang Zheng; Puning Zhao; Zhan Qin; Kui Ren
Large Language Models (LLMs) have exhibited remarkable capabilities but remain vulnerable to jailbreaking attacks, which can elicit harmful content from the models by manipulating the input prompts. Existing black-box jailbreaking techniques primarily rely on static prompts crafted with a single, non-adaptive strategy, or employ rigid combinations of several underperforming attack methods, which limits their adaptability and generalization. To address these limitations, we propose MAJIC, a Markovian adaptive jailbreaking framework that attacks black-box LLMs by iteratively combining diverse innovative disguise strategies. MAJIC first establishes a ``Disguise Strategy Pool'' by refining existing strategies and introducing several innovative approaches. To further improve the attack performance and efficiency, MAJIC formulate the sequential selection and fusion of strategies in the pool as a Markov chain. Under this formulation, MAJIC initializes and employs a Markov matrix to guide the strategy composition, where transition probabilities between strategies are dynamically adapted based on attack outcomes, thereby enabling MAJIC to learn and discover effective attack pathways tailored to the target model. Our empirical results demonstrate that MAJIC significantly outperforms existing jailbreak methods on prominent models such as GPT-4o and Gemini-2.0-flash, achieving over 90\% attack success rate with fewer than 15 queries per attempt on average.

http://arxiv.org/abs/2508.12852
RoTO: Robust Topology Obfuscation Against Tomography Inference Attacks. (56%)
Chengze Du; Heng Xu; Zhiwei Yu; Ying Zhou; Zili Meng; Jialong Li
Tomography inference attacks aim to reconstruct network topology by analyzing end-to-end probe delays. Existing defenses mitigate these attacks by manipulating probe delays to mislead inference, but rely on two strong assumptions: (i) probe packets can be perfectly detected and altered, and (ii) attackers use known, fixed inference algorithms. These assumptions often break in practice, leading to degraded defense performance under detection errors or adaptive adversaries. We present RoTO, a robust topology obfuscation scheme that eliminates both assumptions by modeling uncertainty in attacker-observed delays through a distributional formulation. RoTO casts the defense objective as a min-max optimization problem that maximizes expected topological distortion across this uncertainty set, without relying on perfect probe control or specific attacker models. To approximate attacker behavior, RoTO leverages graph neural networks for inference simulation and adversarial training. We also derive an upper bound on attacker success probability, and demonstrate that our approach enhances topology obfuscation performance through the optimization of this upper bound. Experimental results show that RoTO outperforms existing defense methods, achieving average improvements of 34% in structural similarity and 42.6% in link distance while maintaining strong robustness and concealment capabilities.

http://arxiv.org/abs/2505.20841
Concealment of Intent: A Game-Theoretic Analysis. (50%)
Xinbo Wu; Abhishek Umrawal; Lav R. Varshney
As large language models (LLMs) grow more capable, concerns about their safe deployment have also grown. Although alignment mechanisms have been introduced to deter misuse, they remain vulnerable to carefully designed adversarial prompts. In this work, we present a scalable attack strategy: intent-hiding adversarial prompting, which conceals malicious intent through the composition of skills. We develop a game-theoretic framework to model the interaction between such attacks and defense systems that apply both prompt and response filtering. Our analysis identifies equilibrium points and reveals structural advantages for the attacker. To counter these threats, we propose and analyze a defense mechanism tailored to intent-hiding attacks. Empirically, we validate the attack's effectiveness on multiple real-world LLMs across a range of malicious behaviors, demonstrating clear advantages over existing adversarial prompting techniques.

http://arxiv.org/abs/2502.15320
Adversarially-Robust Gossip Algorithms for Approximate Quantile and Mean Computations. (13%)
Bernhard Haeupler; Marc Kaufmann; Raghu Raman Ravi; Ulysse Schaller
This paper presents gossip algorithms for aggregation tasks that demonstrate both robustness to adversarial corruptions of any order of magnitude and optimality across a substantial range of these corruption levels. Gossip algorithms distribute information in a scalable and efficient way by having random pairs of nodes exchange small messages. Value aggregation problems are of particular interest in this setting, as they occur frequently in practice, and many elegant algorithms have been proposed for computing aggregates and statistics such as averages and quantiles. An important and well-studied advantage of gossip algorithms is their robustness to message delays, network churn, and unreliable message transmissions. However, these crucial robustness guarantees only hold if all nodes follow the protocol and no messages are corrupted. In this paper, we remedy this by providing a framework to model both adversarial participants and message corruptions in gossip-style communications by allowing an adversary to control a small fraction of the nodes or corrupt messages arbitrarily. Despite this very powerful and general corruption model, we show that robust gossip algorithms can be designed for many important aggregation problems. Our algorithms guarantee that almost all nodes converge to an approximately correct answer with optimal efficiency and essentially as fast as without corruptions. The design of adversarially-robust gossip algorithms poses completely new challenges. Despite this, our algorithms remain very simple variations of known non-robust algorithms with often only subtle changes to avoid non-compliant nodes gaining too much influence over outcomes. While our algorithms remain simple, their analysis is much more complex and often requires a completely different approach than the non-adversarial setting.

http://arxiv.org/abs/2508.08789
Never Compromise to Vulnerabilities: A Comprehensive Survey on AI Governance. (8%)
Yuchu Jiang; Jian Zhao; Yuchen Yuan; Tianle Zhang; Yao Huang; Yanghao Zhang; Yan Wang; Yanshu Li; Xizhong Guo; Yusheng Zhao; Jun Zhang; Zhi Zhang; Xiaojian Lin; Yixiu Zou; Haoxuan Ma; Yuhu Shang; Yuzhi Hu; Keshu Cai; Ruochen Zhang; Boyuan Chen; Yilan Gao; Ziheng Jiao; Yi Qin; Shuangjun Du; Xiao Tong; Zhekun Liu; Yu Chen; Xuankun Rong; Rui Wang; Yejie Zheng; Zhaoxin Fan; Murat Sensoy; Hongyuan Zhang; Pan Zhou; Lei Jin; Hao Zhao; Xu Yang; Jiaojiao Zhao; Jianshu Li; Joey Tianyi Zhou; Zhi-Qi Cheng; Longtao Huang; Zhiyi Liu; Zheng Zhu; Jianan Li; Gang Wang; Qi Li; Xu-Yao Zhang; Yaodong Yang; Mang Ye; Wenqi Ren; Zhaofeng He; Hang Su; Rongrong Ni; Liping Jing; Xingxing Wei; Junliang Xing; Massimo Alioto; Shengmei Shen; Petia Radeva; Dacheng Tao; Ya-Qin Zhang; Shuicheng Yan; Chi Zhang; Zhongjiang He; Xuelong Li
The rapid advancement of AI has expanded its capabilities across domains, yet introduced critical technical vulnerabilities, such as algorithmic bias and adversarial sensitivity, that pose significant societal risks, including misinformation, inequity, security breaches, physical harm, and eroded public trust. These challenges highlight the urgent need for robust AI governance. We propose a comprehensive framework integrating technical and societal dimensions, structured around three interconnected pillars: Intrinsic Security (system reliability), Derivative Security (real-world harm mitigation), and Social Ethics (value alignment and accountability). Uniquely, our approach unifies technical methods, emerging evaluation benchmarks, and policy insights to promote transparency, accountability, and trust in AI systems. Through a systematic review of over 300 studies, we identify three core challenges: (1) the generalization gap, where defenses fail against evolving threats; (2) inadequate evaluation protocols that overlook real-world risks; and (3) fragmented regulations leading to inconsistent oversight. These shortcomings stem from treating governance as an afterthought, rather than a foundational design principle, resulting in reactive, siloed efforts that fail to address the interdependence of technical integrity and societal trust. To overcome this, we present an integrated research agenda that bridges technical rigor with social responsibility. Our framework offers actionable guidance for researchers, engineers, and policymakers to develop AI systems that are not only robust and secure but also ethically aligned and publicly trustworthy. The accompanying repository is available at https://github.com/ZTianle/Awesome-AI-SG.

http://arxiv.org/abs/2412.05934
Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models. (8%)
Ma Teng; Jia Xiaojun; Duan Ranjie; Li Xinfeng; Huang Yihao; Jia Xiaoshuang; Chu Zhixuan; Ren Wenqi
With the rapid advancement of multimodal large language models (MLLMs), concerns regarding their security have increasingly captured the attention of both academia and industry. Although MLLMs are vulnerable to jailbreak attacks, designing effective jailbreak attacks poses unique challenges, especially given the highly constrained adversarial capabilities in real-world deployment scenarios. Previous works concentrate risks into a single modality, resulting in limited jailbreak performance. In this paper, we propose a heuristic-induced multimodal risk distribution jailbreak attack method, called HIMRD, which is black-box and consists of two elements: multimodal risk distribution strategy and heuristic-induced search strategy. The multimodal risk distribution strategy is used to distribute harmful semantics into multiple modalities to effectively circumvent the single-modality protection mechanisms of MLLMs. The heuristic-induced search strategy identifies two types of prompts: the understanding-enhancing prompt, which helps MLLMs reconstruct the malicious prompt, and the inducing prompt, which increases the likelihood of affirmative outputs over refusals, enabling a successful jailbreak attack. HIMRD achieves an average attack success rate (ASR) of 90% across seven open-source MLLMs and an average ASR of around 68% in three closed-source MLLMs. HIMRD reveals cross-modal security vulnerabilities in current MLLMs and underscores the imperative for developing defensive strategies to mitigate such emerging risks. Code is available at https://github.com/MaTengSYSU/HIMRD-jailbreak.

http://arxiv.org/abs/2508.12538
Systematic Analysis of MCP Security. (4%)
Yongjian Guo; Puzhuo Liu; Wanlun Ma; Zehang Deng; Xiaogang Zhu; Peng Di; Xi Xiao; Sheng Wen
The Model Context Protocol (MCP) has emerged as a universal standard that enables AI agents to seamlessly connect with external tools, significantly enhancing their functionality. However, while MCP brings notable benefits, it also introduces significant vulnerabilities, such as Tool Poisoning Attacks (TPA), where hidden malicious instructions exploit the sycophancy of large language models (LLMs) to manipulate agent behavior. Despite these risks, current academic research on MCP security remains limited, with most studies focusing on narrow or qualitative analyses that fail to capture the diversity of real-world threats. To address this gap, we present the MCP Attack Library (MCPLIB), which categorizes and implements 31 distinct attack methods under four key classifications: direct tool injection, indirect tool injection, malicious user attacks, and LLM inherent attack. We further conduct a quantitative analysis of the efficacy of each attack. Our experiments reveal key insights into MCP vulnerabilities, including agents' blind reliance on tool descriptions, sensitivity to file-based attacks, chain attacks exploiting shared context, and difficulty distinguishing external data from executable commands. These insights, validated through attack experiments, underscore the urgency for robust defense strategies and informed MCP design. Our contributions include 1) constructing a comprehensive MCP attack taxonomy, 2) introducing a unified attack framework MCPLIB, and 3) conducting empirical vulnerability analysis to enhance MCP security mechanisms. This work provides a foundational framework, supporting the secure evolution of MCP ecosystems.

http://arxiv.org/abs/2508.09288
Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs. (1%)
Aayush Gupta
Large language models (LLMs) remain acutely vulnerable to prompt injection and related jailbreak attacks; heuristic guardrails (rules, filters, LLM judges) are routinely bypassed. We present Contextual Integrity Verification (CIV), an inference-time security architecture that attaches cryptographically signed provenance labels to every token and enforces a source-trust lattice inside the transformer via a pre-softmax hard attention mask (with optional FFN/residual gating). CIV provides deterministic, per-token non-interference guarantees on frozen models: lower-trust tokens cannot influence higher-trust representations. On benchmarks derived from recent taxonomies of prompt-injection vectors (Elite-Attack + SoK-246), CIV attains 0% attack success rate under the stated threat model while preserving 93.1% token-level similarity and showing no degradation in model perplexity on benign tasks; we note a latency overhead attributable to a non-optimized data path. Because CIV is a lightweight patch -- no fine-tuning required -- we demonstrate drop-in protection for Llama-3-8B and Mistral-7B. We release a reference implementation, an automated certification harness, and the Elite-Attack corpus to support reproducible research.

http://arxiv.org/abs/2508.12672
Robust Federated Learning under Adversarial Attacks via Loss-Based Client Clustering. (1%)
Emmanouil Kritharakis; Dusan Jakovetic; Antonios Makris; Konstantinos Tserpes
Federated Learning (FL) enables collaborative model training across multiple clients without sharing private data. We consider FL scenarios wherein FL clients are subject to adversarial (Byzantine) attacks, while the FL server is trusted (honest) and has a trustworthy side dataset. This may correspond to, e.g., cases where the server possesses trusted data prior to federation, or to the presence of a trusted client that temporarily assumes the server role. Our approach requires only two honest participants, i.e., the server and one client, to function effectively, without prior knowledge of the number of malicious clients. Theoretical analysis demonstrates bounded optimality gaps even under strong Byzantine attacks. Experimental results show that our algorithm significantly outperforms standard and robust FL baselines such as Mean, Trimmed Mean, Median, Krum, and Multi-Krum under various attack strategies including label flipping, sign flipping, and Gaussian noise addition across MNIST, FMNIST, and CIFAR-10 benchmarks using the Flower framework.

http://arxiv.org/abs/2508.07402
ForensicsSAM: Toward Robust and Unified Image Forgery Detection and Localization Resisting to Adversarial Attack. (99%)
Rongxuan Peng; Shunquan Tan; Chenqi Kong; Anwei Luo; Alex C. Kot; Jiwu Huang
Parameter-efficient fine-tuning (PEFT) has emerged as a popular strategy for adapting large vision foundation models, such as the Segment Anything Model (SAM) and LLaVA, to downstream tasks like image forgery detection and localization (IFDL). However, existing PEFT-based approaches overlook their vulnerability to adversarial attacks. In this paper, we show that highly transferable adversarial images can be crafted solely via the upstream model, without accessing the downstream model or training data, significantly degrading the IFDL performance. To address this, we propose ForensicsSAM, a unified IFDL framework with built-in adversarial robustness. Our design is guided by three key ideas: (1) To compensate for the lack of forgery-relevant knowledge in the frozen image encoder, we inject forgery experts into each transformer block to enhance its ability to capture forgery artifacts. These forgery experts are always activated and shared across any input images. (2) To detect adversarial images, we design an light-weight adversary detector that learns to capture structured, task-specific artifact in RGB domain, enabling reliable discrimination across various attack methods. (3) To resist adversarial attacks, we inject adversary experts into the global attention layers and MLP modules to progressively correct feature shifts induced by adversarial noise. These adversary experts are adaptively activated by the adversary detector, thereby avoiding unnecessary interference with clean images. Extensive experiments across multiple benchmarks demonstrate that ForensicsSAM achieves superior resistance to various adversarial attack methods, while also delivering state-of-the-art performance in image-level forgery detection and pixel-level forgery localization. The resource is available at https://github.com/siriusPRX/ForensicsSAM.

http://arxiv.org/abs/2508.12384
ViT-EnsembleAttack: Augmenting Ensemble Models for Stronger Adversarial Transferability in Vision Transformers. (98%)
Hanwen Cao; Haobo Lu; Xiaosen Wang; Kun He
Ensemble-based attacks have been proven to be effective in enhancing adversarial transferability by aggregating the outputs of models with various architectures. However, existing research primarily focuses on refining ensemble weights or optimizing the ensemble path, overlooking the exploration of ensemble models to enhance the transferability of adversarial attacks. To address this gap, we propose applying adversarial augmentation to the surrogate models, aiming to boost overall generalization of ensemble models and reduce the risk of adversarial overfitting. Meanwhile, observing that ensemble Vision Transformers (ViTs) gain less attention, we propose ViT-EnsembleAttack based on the idea of model adversarial augmentation, the first ensemble-based attack method tailored for ViTs to the best of our knowledge. Our approach generates augmented models for each surrogate ViT using three strategies: Multi-head dropping, Attention score scaling, and MLP feature mixing, with the associated parameters optimized by Bayesian optimization. These adversarially augmented models are ensembled to generate adversarial examples. Furthermore, we introduce Automatic Reweighting and Step Size Enlargement modules to boost transferability. Extensive experiments demonstrate that ViT-EnsembleAttack significantly enhances the adversarial transferability of ensemble-based attacks on ViTs, outperforming existing methods by a substantial margin. Code is available at https://github.com/Trustworthy-AI-Group/TransferAttack.

http://arxiv.org/abs/2508.12430
Adversarial Attacks on VQA-NLE: Exposing and Alleviating Inconsistencies in Visual Question Answering Explanations. (33%)
Yahsin Yeh; Yilun Wu; Bokai Ruan; Honghan Shuai
Natural language explanations in visual question answering (VQA-NLE) aim to make black-box models more transparent by elucidating their decision-making processes. However, we find that existing VQA-NLE systems can produce inconsistent explanations and reach conclusions without genuinely understanding the underlying context, exposing weaknesses in either their inference pipeline or explanation-generation mechanism. To highlight these vulnerabilities, we not only leverage an existing adversarial strategy to perturb questions but also propose a novel strategy that minimally alters images to induce contradictory or spurious outputs. We further introduce a mitigation method that leverages external knowledge to alleviate these inconsistencies, thereby bolstering model robustness. Extensive evaluations on two standard benchmarks and two widely used VQA-NLE models underscore the effectiveness of our attacks and the potential of knowledge-based defenses, ultimately revealing pressing security and reliability concerns in current VQA-NLE systems.

http://arxiv.org/abs/2508.12132
TriQDef: Disrupting Semantic and Gradient Alignment to Prevent Adversarial Patch Transferability in Quantized Neural Networks. (98%)
Amira Guesmi; Bassem Ouni; Muhammad Shafique
Quantized Neural Networks (QNNs) are increasingly deployed in edge and resource-constrained environments due to their efficiency in computation and memory usage. While shown to distort the gradient landscape and weaken conventional pixel-level attacks, it provides limited robustness against patch-based adversarial attacks-localized, high-saliency perturbations that remain surprisingly transferable across bit-widths. Existing defenses either overfit to fixed quantization settings or fail to address this cross-bit generalization vulnerability. We introduce \textbf{TriQDef}, a tri-level quantization-aware defense framework designed to disrupt the transferability of patch-based adversarial attacks across QNNs. TriQDef consists of: (1) a Feature Disalignment Penalty (FDP) that enforces semantic inconsistency by penalizing perceptual similarity in intermediate representations; (2) a Gradient Perceptual Dissonance Penalty (GPDP) that explicitly misaligns input gradients across bit-widths by minimizing structural and directional agreement via Edge IoU and HOG Cosine metrics; and (3) a Joint Quantization-Aware Training Protocol that unifies these penalties within a shared-weight training scheme across multiple quantization levels. Extensive experiments on CIFAR-10 and ImageNet demonstrate that TriQDef reduces Attack Success Rates (ASR) by over 40\% on unseen patch and quantization combinations, while preserving high clean accuracy. Our findings underscore the importance of disrupting both semantic and perceptual gradient alignment to mitigate patch transferability in QNNs.

http://arxiv.org/abs/2508.12072
Mitigating Jailbreaks with Intent-Aware LLMs. (83%)
Wei Jie Yeo; Ranjan Satapathy; Erik Cambria
Despite extensive safety-tuning, large language models (LLMs) remain vulnerable to jailbreak attacks via adversarially crafted instructions, reflecting a persistent trade-off between safety and task performance. In this work, we propose Intent-FT, a simple and lightweight fine-tuning approach that explicitly trains LLMs to infer the underlying intent of an instruction before responding. By fine-tuning on a targeted set of adversarial instructions, Intent-FT enables LLMs to generalize intent deduction to unseen attacks, thereby substantially improving their robustness. We comprehensively evaluate both parametric and non-parametric attacks across open-source and proprietary models, considering harmfulness from attacks, utility, over-refusal, and impact against white-box threats. Empirically, Intent-FT consistently mitigates all evaluated attack categories, with no single attack exceeding a 50\% success rate -- whereas existing defenses remain only partially effective. Importantly, our method preserves the model's general capabilities and reduces excessive refusals on benign instructions containing superficially harmful keywords. Furthermore, models trained with Intent-FT accurately identify hidden harmful intent in adversarial attacks, and these learned intentions can be effectively transferred to enhance vanilla model defenses.

http://arxiv.org/abs/2508.11854
ComplicitSplat: Downstream Models are Vulnerable to Blackbox Attacks by 3D Gaussian Splat Camouflages. (68%)
Matthew Hull; Haoyang Yang; Pratham Mehta; Mansi Phute; Aeree Cho; Haorang Wang; Matthew Lau; Wenke Lee; Wilian Lunardi; Martin Andreoni; Polo Chau
As 3D Gaussian Splatting (3DGS) gains rapid adoption in safety-critical tasks for efficient novel-view synthesis from static images, how might an adversary tamper images to cause harm? We introduce ComplicitSplat, the first attack that exploits standard 3DGS shading methods to create viewpoint-specific camouflage - colors and textures that change with viewing angle - to embed adversarial content in scene objects that are visible only from specific viewpoints and without requiring access to model architecture or weights. Our extensive experiments show that ComplicitSplat generalizes to successfully attack a variety of popular detector - both single-stage, multi-stage, and transformer-based models on both real-world capture of physical objects and synthetic scenes. To our knowledge, this is the first black-box attack on downstream object detectors using 3DGS, exposing a novel safety risk for applications like autonomous navigation and other mission-critical robotic systems.

http://arxiv.org/abs/2411.01077
Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection. (22%)
Zhipeng Wei; Yuqi Liu; N. Benjamin Erichson
Jailbreaking techniques trick Large Language Models (LLMs) into producing restricted output, posing a potential threat. One line of defense is to use another LLM as a Judge to evaluate the harmfulness of generated text. However, we reveal that these Judge LLMs are vulnerable to token segmentation bias, an issue that arises when delimiters alter the tokenization process, splitting words into smaller sub-tokens. This alters the embeddings of the entire sequence, reducing detection accuracy and allowing harmful content to be misclassified as safe. In this paper, we introduce Emoji Attack, a novel strategy that amplifies existing jailbreak prompts by exploiting token segmentation bias. Our method leverages in-context learning to systematically insert emojis into text before it is evaluated by a Judge LLM, inducing embedding distortions that significantly lower the likelihood of detecting unsafe content. Unlike traditional delimiters, emojis also introduce semantic ambiguity, making them particularly effective in this attack. Through experiments on state-of-the-art Judge LLMs, we demonstrate that Emoji Attack substantially reduces the unsafe prediction rate, bypassing existing safeguards.

http://arxiv.org/abs/2508.11959
Rigorous Feature Importance Scores based on Shapley Value and Banzhaf Index. (22%)
Xuanxiang Huang; Olivier Létoffé; Joao Marques-Silva
Feature attribution methods based on game theory are ubiquitous in the field of eXplainable Artificial Intelligence (XAI). Recent works proposed rigorous feature attribution using logic-based explanations, specifically targeting high-stakes uses of machine learning (ML) models. Typically, such works exploit weak abductive explanation (WAXp) as the characteristic function to assign importance to features. However, one possible downside is that the contribution of non-WAXp sets is neglected. In fact, non-WAXp sets can also convey important information, because of the relationship between formal explanations (XPs) and adversarial examples (AExs). Accordingly, this paper leverages Shapley value and Banzhaf index to devise two novel feature importance scores. We take into account non-WAXp sets when computing feature contribution, and the novel scores quantify how effective each feature is at excluding AExs. Furthermore, the paper identifies properties and studies the computational complexity of the proposed scores.

http://arxiv.org/abs/2508.13214
Too Easily Fooled? Prompt Injection Breaks LLMs on Frustratingly Simple Multiple-Choice Questions. (12%)
Xuyang Guo; Zekai Huang; Zhao Song; Jiahao Zhang
Large Language Models (LLMs) have recently demonstrated strong emergent abilities in complex reasoning and zero-shot generalization, showing unprecedented potential for LLM-as-a-judge applications in education, peer review, and data quality evaluation. However, their robustness under prompt injection attacks, where malicious instructions are embedded into the content to manipulate outputs, remains a significant concern. In this work, we explore a frustratingly simple yet effective attack setting to test whether LLMs can be easily misled. Specifically, we evaluate LLMs on basic arithmetic questions (e.g., "What is 3 + 2?") presented as either multiple-choice or true-false judgment problems within PDF files, where hidden prompts are injected into the file. Our results reveal that LLMs are indeed vulnerable to such hidden prompt injection attacks, even in these trivial scenarios, highlighting serious robustness risks for LLM-as-a-judge applications.

http://arxiv.org/abs/2304.10755
Interpretable and Robust AI in EEG Systems: A Survey. (12%)
Xinliang Zhou; Chenyu Liu; Jinan Zhou; Zhongruo Wang; Liming Zhai; Ziyu Jia; Cuntai Guan; Yang Liu
The close coupling of artificial intelligence (AI) and electroencephalography (EEG) has substantially advanced human-computer interaction (HCI) technologies in the AI era. Different from traditional EEG systems, the interpretability and robustness of AI-based EEG systems are becoming particularly crucial. The interpretability clarifies the inner working mechanisms of AI models and thus can gain the trust of users. The robustness reflects the AI's reliability against attacks and perturbations, which is essential for sensitive and fragile EEG signals. Thus the interpretability and robustness of AI in EEG systems have attracted increasing attention, and their research has achieved great progress recently. However, there is still no survey covering recent advances in this field. In this paper, we present the first comprehensive survey and summarize the interpretable and robust AI techniques for EEG systems. Specifically, we first propose a taxonomy of interpretability by characterizing it into three types: backpropagation, perturbation, and inherently interpretable methods. Then we classify the robustness mechanisms into four classes: noise and artifacts, human variability, data acquisition instability, and adversarial attacks. Finally, we identify several critical and unresolved challenges for interpretable and robust AI in EEG systems and further discuss their future directions.

http://arxiv.org/abs/2410.04823
CAT: Concept-level backdoor ATtacks for Concept Bottleneck Models. (11%)
Songning Lai; Jiayu Yang; Yu Huang; Lijie Hu; Tianlang Xue; Zhangyi Hu; Jiaxu Li; Haicheng Liao; Yutao Yue
Despite the transformative impact of deep learning across multiple domains, the inherent opacity of these models has driven the development of Explainable Artificial Intelligence (XAI). Among these efforts, Concept Bottleneck Models (CBMs) have emerged as a key approach to improve interpretability by leveraging high-level semantic information. However, CBMs, like other machine learning models, are susceptible to security threats, particularly backdoor attacks, which can covertly manipulate model behaviors. Understanding that the community has not yet studied the concept level backdoor attack of CBM, because of "Better the devil you know than the devil you don't know.", we introduce CAT (Concept-level Backdoor ATtacks), a methodology that leverages the conceptual representations within CBMs to embed triggers during training, enabling controlled manipulation of model predictions at inference time. An enhanced attack pattern, CAT+, incorporates a correlation function to systematically select the most effective and stealthy concept triggers, thereby optimizing the attack's impact. Our comprehensive evaluation framework assesses both the attack success rate and stealthiness, demonstrating that CAT and CAT+ maintain high performance on clean data while achieving significant targeted effects on backdoored datasets. This work underscores the potential security risks associated with CBMs and provides a robust testing methodology for future security assessments.

http://arxiv.org/abs/2305.14561
NeFT: Negative Feedback Training to Improve Robustness of Compute-In-Memory DNN Accelerators. (1%)
Yifan Qin; Zheyu Yan; Dailin Gan; Jun Xia; Zixuan Pan; Wujie Wen; Xiaobo Sharon Hu; Yiyu Shi
Compute-in-memory accelerators built upon non-volatile memory devices excel in energy efficiency and latency when performing deep neural network (DNN) inference, thanks to their in-situ data processing capability. However, the stochastic nature and intrinsic variations of non-volatile memory devices often result in performance degradation during DNN inference. Introducing these non-ideal device behaviors in DNN training enhances robustness, but drawbacks include limited accuracy improvement, reduced prediction confidence, and convergence issues. This arises from a mismatch between the deterministic training and non-deterministic device variations, as such training, though considering variations, relies solely on the model's final output. In this work, inspired by control theory, we propose Negative Feedback Training (NeFT), a novel concept supported by theoretical analysis, to more effectively capture the multi-scale noisy information throughout the network. We instantiate this concept with two specific instances, oriented variational forward (OVF) and intermediate representation snapshot (IRS). Based on device variation models extracted from measured data, extensive experiments show that our NeFT outperforms existing state-of-the-art methods with up to a 45.08% improvement in inference accuracy while reducing epistemic uncertainty, boosting output confidence, and improving convergence probability. These results underline the generality and practicality of our NeFT framework for increasing the robustness of DNNs against device variations. The source code for these two instances is available at https://github.com/YifanQin-ND/NeFT_CIM

http://arxiv.org/abs/2508.11279
Boosting the Robustness-Accuracy Trade-off of SNNs by Robust Temporal Self-Ensemble. (92%)
Jihang Wang; Dongcheng Zhao; Ruolin Chen; Qian Zhang; Yi Zeng
Spiking Neural Networks (SNNs) offer a promising direction for energy-efficient and brain-inspired computing, yet their vulnerability to adversarial perturbations remains poorly understood. In this work, we revisit the adversarial robustness of SNNs through the lens of temporal ensembling, treating the network as a collection of evolving sub-networks across discrete timesteps. This formulation uncovers two critical but underexplored challenges-the fragility of individual temporal sub-networks and the tendency for adversarial vulnerabilities to transfer across time. To overcome these limitations, we propose Robust Temporal self-Ensemble (RTE), a training framework that improves the robustness of each sub-network while reducing the temporal transferability of adversarial perturbations. RTE integrates both objectives into a unified loss and employs a stochastic sampling strategy for efficient optimization. Extensive experiments across multiple benchmarks demonstrate that RTE consistently outperforms existing training methods in robust-accuracy trade-off. Additional analyses reveal that RTE reshapes the internal robustness landscape of SNNs, leading to more resilient and temporally diversified decision boundaries. Our study highlights the importance of temporal structure in adversarial learning and offers a principled foundation for building robust spiking models.

http://arxiv.org/abs/2508.11341
Semantically Guided Adversarial Testing of Vision Models Using Language Models. (75%)
Katarzyna Filus; Jorge M. Cruz-Duarte
In targeted adversarial attacks on vision models, the selection of the target label is a critical yet often overlooked determinant of attack success. This target label corresponds to the class that the attacker aims to force the model to predict. Now, existing strategies typically rely on randomness, model predictions, or static semantic resources, limiting interpretability, reproducibility, or flexibility. This paper then proposes a semantics-guided framework for adversarial target selection using the cross-modal knowledge transfer from pretrained language and vision-language models. We evaluate several state-of-the-art models (BERT, TinyLLAMA, and CLIP) as similarity sources to select the most and least semantically related labels with respect to the ground truth, forming best- and worst-case adversarial scenarios. Our experiments on three vision models and five attack methods reveal that these models consistently render practical adversarial targets and surpass static lexical databases, such as WordNet, particularly for distant class relationships. We also observe that static testing of target labels offers a preliminary assessment of the effectiveness of similarity sources, \textit{a priori} testing. Our results corroborate the suitability of pretrained models for constructing interpretable, standardized, and scalable adversarial benchmarks across architectures and datasets.

http://arxiv.org/abs/2505.01811
BadPatches: Backdoor Attacks Against Patch-based Mixture of Experts Architectures. (67%)
Cedric Chan; Jona te Lintelo; Stjepan Picek
As Deep Neural Networks (DNNs) continue to require larger amounts of data and computational power, Mixture of Experts (MoE) models have become a popular choice to reduce computational complexity. This popularity increases the importance of considering the security of MoE architectures. Unfortunately, the security of models using a MoE architecture has not yet gained much attention compared to other DNN models. In this work, we investigate the vulnerability of patch-based MoE (pMoE) models for image classification against backdoor attacks. We examine multiple trigger generation methods and Fine-Pruning as a defense. To better understand a pMoE model's vulnerability to backdoor attacks, we investigate which factors affect the model's patch selection. Our work shows that pMoE models are highly susceptible to backdoor attacks. More precisely, we achieve high attack success rates of up to 100% with visible triggers and a 2% poisoning rate, whilst only having a clean accuracy drop of 1.0%. Additionally, we show that pruning itself is ineffective as a defense but that fine-tuning can remove the backdoor almost completely. Our results show that fine-tuning the model for five epochs reduces the attack success rate to 2.1% whilst sacrificing 1.4% accuracy.

http://arxiv.org/abs/2506.05982
MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks. (10%)
Zonglin Wu; Yule Xue; Yaoyao Feng; Xiaolong Wang; Yiren Song
As automated attack techniques rapidly advance, CAPTCHAs remain a critical defense mechanism against malicious bots. However, existing CAPTCHA schemes encompass a diverse range of modalities -- from static distorted text and obfuscated images to interactive clicks, sliding puzzles, and logic-based questions -- yet the community still lacks a unified, large-scale, multimodal benchmark to rigorously evaluate their security robustness. To address this gap, we introduce MCA-Bench, a comprehensive and reproducible benchmarking suite that integrates heterogeneous CAPTCHA types into a single evaluation protocol. Leveraging a shared vision-language model backbone, we fine-tune specialized cracking agents for each CAPTCHA category, enabling consistent, cross-modal assessments. Extensive experiments reveal that MCA-Bench effectively maps the vulnerability spectrum of modern CAPTCHA designs under varied attack settings, and crucially offers the first quantitative analysis of how challenge complexity, interaction depth, and model solvability interrelate. Based on these findings, we propose three actionable design principles and identify key open challenges, laying the groundwork for systematic CAPTCHA hardening, fair benchmarking, and broader community collaboration. Datasets and code are available online.

http://arxiv.org/abs/2508.11432
Robust Convolution Neural ODEs via Contractivity-promoting regularization. (5%)
Muhammad Zakwan; Liang Xu; Giancarlo Ferrari-Trecate
Neural networks can be fragile to input noise and adversarial attacks.
  In this work, we consider Convolutional Neural Ordinary Differential Equations (NODEs), a family of continuous-depth neural networks represented by dynamical systems, and propose to use contraction theory to improve their robustness.
  For a contractive dynamical system two trajectories starting from different initial conditions converge to each other exponentially fast.
  Contractive Convolutional NODEs can enjoy increased robustness as slight perturbations of the features do not cause a significant change in the output.
  Contractivity can be induced during training by using a regularization term involving the Jacobian of the system dynamics.
  To reduce the computational burden, we show that it can also be promoted using carefully selected weight regularization terms for a class of NODEs with slope-restricted activation functions.
  The performance of the proposed regularizers is illustrated through benchmark image classification tasks on MNIST and FashionMNIST datasets, where images are corrupted by different kinds of noise and attacks.

http://arxiv.org/abs/2508.11453
EvoPSF: Online Evolution of Autonomous Driving Models via Planning-State Feedback. (1%)
Jiayue Jin; Lang Qian; Jingyu Zhang; Chuanyu Ju; Liang Song
Recent years have witnessed remarkable progress in autonomous driving, with systems evolving from modular pipelines to end-to-end architectures. However, most existing methods are trained offline and lack mechanisms to adapt to new environments during deployment. As a result, their generalization ability diminishes when faced with unseen variations in real-world driving scenarios. In this paper, we break away from the conventional "train once, deploy forever" paradigm and propose EvoPSF, a novel online Evolution framework for autonomous driving based on Planning-State Feedback. We argue that planning failures are primarily caused by inaccurate object-level motion predictions, and such failures are often reflected in the form of increased planner uncertainty. To address this, we treat planner uncertainty as a trigger for online evolution, using it as a diagnostic signal to initiate targeted model updates. Rather than performing blind updates, we leverage the planner's agent-agent attention to identify the specific objects that the ego vehicle attends to most, which are primarily responsible for the planning failures. For these critical objects, we compute a targeted self-supervised loss by comparing their predicted waypoints from the prediction module with their actual future positions, selected from the perception module's outputs with high confidence scores. This loss is then backpropagated to adapt the model online. As a result, our method improves the model's robustness to environmental changes, leads to more precise motion predictions, and therefore enables more accurate and stable planning behaviors. Experiments on both cross-region and corrupted variants of the nuScenes dataset demonstrate that EvoPSF consistently improves planning performance under challenging conditions.

http://arxiv.org/abs/2508.11742
Assessing User Privacy Leakage in Synthetic Packet Traces: An Attack-Grounded Approach. (1%)
Minhao Jin; Hongyu He; Maria Apostolaki
Current synthetic traffic generators (SynNetGens) promise privacy but lack comprehensive guarantees or empirical validation, even as their fidelity steadily improves. We introduce the first attack-grounded benchmark for assessing the privacy of SynNetGens directly from the traffic they produce. We frame privacy as membership inference at the traffic-source level--a realistic and actionable threat for data holders. To this end, we present TraceBleed, the first attack that exploits behavioral fingerprints across flows using contrastive learning and temporal chunking, outperforming prior membership inference baselines by 172%. Our large-scale study across GAN-, diffusion-, and GPT-based SynNetGens uncovers critical insights: (i) SynNetGens leak user-level information; (ii) differential privacy either fails to stop these attacks or severely degrades fidelity; and (iii) sharing more synthetic data amplifies leakage by 59% on average. Finally, we introduce TracePatch, the first SynNetGen-agnostic defense that combines adversarial ML with SMT constraints to mitigate leakage while preserving fidelity.

http://arxiv.org/abs/2508.10404
Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation. (99%)
Huizhen Shu; Xuying Li; Qirui Wang; Yuji Kosuga; Mengqiu Tian; Zhuo Li
With the rapid proliferation of Natural Language Processing (NLP), especially Large Language Models (LLMs), generating adversarial examples to jailbreak LLMs remains a key challenge for understanding model vulnerabilities and improving robustness. In this context, we propose a new black-box attack method that leverages the interpretability of large models. We introduce the Sparse Feature Perturbation Framework (SFPF), a novel approach for adversarial text generation that utilizes sparse autoencoders to identify and manipulate critical features in text. After using the SAE model to reconstruct hidden layer representations, we perform feature clustering on the successfully attacked texts to identify features with higher activations. These highly activated features are then perturbed to generate new adversarial texts. This selective perturbation preserves the malicious intent while amplifying safety signals, thereby increasing their potential to evade existing defenses. Our method enables a new red-teaming strategy that balances adversarial effectiveness with safety alignment. Experimental results demonstrate that adversarial texts generated by SFPF can bypass state-of-the-art defense mechanisms, revealing persistent vulnerabilities in current NLP systems.However, the method's effectiveness varies across prompts and layers, and its generalizability to other architectures and larger models remains to be validated.

http://arxiv.org/abs/2508.10600
Towards Powerful and Practical Patch Attacks for 2D Object Detection in Autonomous Driving. (98%)
Yuxin Cao; Yedi Zhang; Wentao He; Yifan Liao; Yan Xiao; Chang Li; Zhiyong Huang; Jin Song Dong
Learning-based autonomous driving systems remain critically vulnerable to adversarial patches, posing serious safety and security risks in their real-world deployment. Black-box attacks, notable for their high attack success rate without model knowledge, are especially concerning, with their transferability extensively studied to reduce computational costs compared to query-based attacks. Previous transferability-based black-box attacks typically adopt mean Average Precision (mAP) as the evaluation metric and design training loss accordingly. However, due to the presence of multiple detected bounding boxes and the relatively lenient Intersection over Union (IoU) thresholds, the attack effectiveness of these approaches is often overestimated, resulting in reduced success rates in practical attacking scenarios. Furthermore, patches trained on low-resolution data often fail to maintain effectiveness on high-resolution images, limiting their transferability to autonomous driving datasets. To fill this gap, we propose P$^3$A, a Powerful and Practical Patch Attack framework for 2D object detection in autonomous driving, specifically optimized for high-resolution datasets. First, we introduce a novel metric, Practical Attack Success Rate (PASR), to more accurately quantify attack effectiveness with greater relevance for pedestrian safety. Second, we present a tailored Localization-Confidence Suppression Loss (LCSL) to improve attack transferability under PASR. Finally, to maintain the transferability for high-resolution datasets, we further incorporate the Probabilistic Scale-Preserving Padding (PSPP) into the patch attack pipeline as a data preprocessing step. Extensive experiments show that P$^3$A outperforms state-of-the-art attacks on unseen models and unseen high-resolution datasets, both under the proposed practical IoU-based evaluation metric and the previous mAP-based metrics.

http://arxiv.org/abs/2407.19203
Clean-Label Physical Backdoor Attacks with Data Distillation. (98%)
Thinh Dao; Khoa D Doan; Kok-Seng Wong
Deep Neural Networks (DNNs) are shown to be vulnerable to backdoor poisoning attacks, with most research focusing on digital triggers -- artificial patterns added to test-time inputs to induce targeted misclassification. Physical triggers, which are natural objects embedded in real-world scenes, offer a promising alternative for attackers, as they can activate backdoors in real-time without digital manipulation. However, existing physical backdoor attacks are dirty-label, meaning that attackers must change the labels of poisoned inputs to the target label. The inconsistency between image content and label exposes the attack to human inspection, reducing its stealthiness in real-world settings. To address this limitation, we introduce Clean-Label Physical Backdoor Attack (CLPBA), a new paradigm of physical backdoor attack that does not require label manipulation and trigger injection at the training stage. Instead, the attacker injects imperceptible perturbations into a small number of target class samples to backdoor a model. By framing the attack as a Dataset Distillation problem, we develop three CLPBA variants -- Parameter Matching, Gradient Matching, and Feature Matching -- that craft effective poisons under both linear probing and full-finetuning training settings. In hard scenarios that require backdoor generalizability in the physical world, CLPBA is shown to even surpass Dirty-label attack baselines. We demonstrate the effectiveness of CLPBA via extensive experiments on two collected physical backdoor datasets for facial recognition and animal classification. The code is available in https://github.com/thinh-dao/Clean-Label-Physical-Backdoor-Attacks.

http://arxiv.org/abs/2508.10639
MirGuard: Towards a Robust Provenance-based Intrusion Detection System Against Graph Manipulation Attacks. (83%)
Anyuan Sang; Lu Zhou; Li Yang; Junbo Jia; Huipeng Yang; Pengbin Feng; Jianfeng Ma
Learning-based Provenance-based Intrusion Detection Systems (PIDSes) have become essential tools for anomaly detection in host systems due to their ability to capture rich contextual and structural information, as well as their potential to detect unknown attacks. However, recent studies have shown that these systems are vulnerable to graph manipulation attacks, where attackers manipulate the graph structure to evade detection. While some previous approaches have discussed this type of attack, none have fully addressed it with a robust detection solution, limiting the practical applicability of PIDSes.
  To address this challenge, we propose MirGuard, a robust anomaly detection framework that combines logic-aware multi-view augmentation with contrastive representation learning. Rather than applying arbitrary structural perturbations, MirGuard introduces Logic-Aware Noise Injection (LNI) to generate semantically valid graph views, ensuring that all augmentations preserve the underlying causal semantics of the provenance data. These views are then used in a Logic-Preserving Contrastive Learning framework, which encourages the model to learn representations that are invariant to benign transformations but sensitive to adversarial inconsistencies. Comprehensive evaluations on multiple provenance datasets demonstrate that MirGuard significantly outperforms state-of-the-art detectors in robustness against various graph manipulation attacks without sacrificing detection performance and efficiency. Our work represents the first targeted study to enhance PIDS against such adversarial threats, providing a robust and effective solution to modern cybersecurity challenges.

http://arxiv.org/abs/2508.10243
Pruning and Malicious Injection: A Retraining-Free Backdoor Attack on Transformer Models. (70%)
Taibiao Zhao; Mingxuan Sun; Hao Wang; Xiaobing Chen; Xiangwei Zhou
Transformer models have demonstrated exceptional performance and have become indispensable in computer vision (CV) and natural language processing (NLP) tasks. However, recent studies reveal that transformers are susceptible to backdoor attacks. Prior backdoor attack methods typically rely on retraining with clean data or altering the model architecture, both of which can be resource-intensive and intrusive. In this paper, we propose Head-wise Pruning and Malicious Injection (HPMI), a novel retraining-free backdoor attack on transformers that does not alter the model's architecture. Our approach requires only a small subset of the original data and basic knowledge of the model architecture, eliminating the need for retraining the target transformer. Technically, HPMI works by pruning the least important head and injecting a pre-trained malicious head to establish the backdoor. We provide a rigorous theoretical justification demonstrating that the implanted backdoor resists detection and removal by state-of-the-art defense techniques, under reasonable assumptions. Experimental evaluations across multiple datasets further validate the effectiveness of HPMI, showing that it 1) incurs negligible clean accuracy loss, 2) achieves at least 99.55% attack success rate, and 3) bypasses four advanced defense mechanisms. Additionally, relative to state-of-the-art retraining-dependent attacks, HPMI achieves greater concealment and robustness against diverse defense strategies, while maintaining minimal impact on clean accuracy.

http://arxiv.org/abs/2508.01272
PromptSafe: Gated Prompt Tuning for Safe Text-to-Image Generation. (41%)
Zonglei Jing; Xiao Yang; Xiaoqian Li; Siyuan Liang; Aishan Liu; Mingchuan Zhang; Xianglong Liu
Text-to-image (T2I) models have demonstrated remarkable generative capabilities but remain vulnerable to producing not-safe-for-work (NSFW) content, such as violent or explicit imagery. While recent moderation efforts have introduced soft prompt-guided tuning by appending defensive tokens to the input, these approaches often rely on large-scale curated image-text datasets and apply static, one-size-fits-all defenses at inference time. However, this results not only in high computational cost and degraded benign image quality, but also in limited adaptability to the diverse and nuanced safety requirements of real-world prompts. To address these challenges, we propose PromptSafe, a gated prompt tuning framework that combines a lightweight, text-only supervised soft embedding with an inference-time gated control network. Instead of training on expensive image-text datasets, we first rewrite unsafe prompts into semantically aligned but safe alternatives using an LLM, constructing an efficient text-only training corpus. Based on this, we optimize a universal soft prompt that repels unsafe and attracts safe embeddings during the diffusion denoising process. To avoid over-suppressing benign prompts, we introduce a gated mechanism that adaptively adjusts the defensive strength based on estimated prompt toxicity, thereby aligning defense intensity with prompt risk and ensuring strong protection for harmful inputs while preserving benign generation quality. Extensive experiments across multiple benchmarks and T2I models show that PromptSafe achieves a SOTA unsafe generation rate (2.36%), while preserving high benign fidelity. Furthermore, PromptSafe demonstrates strong generalization to unseen harmful categories, robust transferability across diffusion model architectures, and resilience under adaptive adversarial attacks, highlighting its practical value for safe and scalable deployment.

http://arxiv.org/abs/2508.11053
SHLIME: Foiling adversarial attacks fooling SHAP and LIME. (33%)
Sam Chauhan; Estelle Duguet; Karthik Ramakrishnan; Deventer Hugh Van; Jack Kruger; Ranjan Subbaraman
Post hoc explanation methods, such as LIME and SHAP, provide interpretable insights into black-box classifiers and are increasingly used to assess model biases and generalizability. However, these methods are vulnerable to adversarial manipulation, potentially concealing harmful biases. Building on the work of Slack et al. (2020), we investigate the susceptibility of LIME and SHAP to biased models and evaluate strategies for improving robustness. We first replicate the original COMPAS experiment to validate prior findings and establish a baseline. We then introduce a modular testing framework enabling systematic evaluation of augmented and ensemble explanation approaches across classifiers of varying performance. Using this framework, we assess multiple LIME/SHAP ensemble configurations on out-of-distribution models, comparing their resistance to bias concealment against the original methods. Our results identify configurations that substantially improve bias detection, highlighting their potential for enhancing transparency in the deployment of high-stakes machine learning systems.

http://arxiv.org/abs/2508.10315
A Vision-Language Pre-training Model-Guided Approach for Mitigating Backdoor Attacks in Federated Learning. (11%)
Keke Gai; Dongjue Wang; Jing Yu; Liehuang Zhu; Qi Wu
Existing backdoor defense methods in Federated Learning (FL) rely on the assumption of homogeneous client data distributions or the availability of a clean serve dataset, which limits the practicality and effectiveness. Defending against backdoor attacks under heterogeneous client data distributions while preserving model performance remains a significant challenge. In this paper, we propose a FL backdoor defense framework named CLIP-Fed, which leverages the zero-shot learning capabilities of vision-language pre-training models. By integrating both pre-aggregation and post-aggregation defense strategies, CLIP-Fed overcomes the limitations of Non-IID imposed on defense effectiveness. To address privacy concerns and enhance the coverage of the dataset against diverse triggers, we construct and augment the server dataset using the multimodal large language model and frequency analysis without any client samples. To address class prototype deviations caused by backdoor samples and eliminate the correlation between trigger patterns and target labels, CLIP-Fed aligns the knowledge of the global model and CLIP on the augmented dataset using prototype contrastive loss and Kullback-Leibler divergence. Extensive experiments on representative datasets validate the effectiveness of CLIP-Fed. Compared to state-of-the-art methods, CLIP-Fed achieves an average reduction in ASR, i.e., 2.03\% on CIFAR-10 and 1.35\% on CIFAR-10-LT, while improving average MA by 7.92\% and 0.48\%, respectively.

http://arxiv.org/abs/2508.10491
Contrastive ECOC: Learning Output Codes for Adversarial Defense. (2%)
Che-Yu Chou; Hung-Hsuan Chen
Although one-hot encoding is commonly used for multiclass classification, it is not always the most effective encoding mechanism. Error Correcting Output Codes (ECOC) address multiclass classification by mapping each class to a unique codeword used as a label. Traditional ECOC methods rely on manually designed or randomly generated codebooks, which are labor-intensive and may yield suboptimal, dataset-agnostic results. This paper introduces three models for automated codebook learning based on contrastive learning, allowing codebooks to be learned directly and adaptively from data. Across four datasets, our proposed models demonstrate superior robustness to adversarial attacks compared to two baselines. The source is available at https://github.com/YuChou20/Automated-Codebook-Learning-with-Error-Correcting-Output-Code-Technique.

http://arxiv.org/abs/2508.10598
Oops!... They Stole it Again: Attacks on Split Learning. (1%)
Tanveer Khan; Antonis Michalas
Split Learning (SL) is a collaborative learning approach that improves privacy by keeping data on the client-side while sharing only the intermediate output with a server. However, the distributed nature of SL introduces new security challenges, necessitating a comprehensive exploration of potential attacks. This paper systematically reviews various attacks on SL, classifying them based on factors such as the attacker's role, the type of privacy risks, when data leaks occur, and where vulnerabilities exist. We also analyze existing defense methods, including cryptographic methods, data modification approaches, distributed techniques, and hybrid solutions. Our findings reveal security gaps, highlighting the effectiveness and limitations of existing defenses. By identifying open challenges and future directions, this work provides valuable information to improve SL privacy issues and guide further research.

http://arxiv.org/abs/2508.10637
Processing and acquisition traces in visual encoders: What does CLIP know about your camera? (1%)
Ryan Ramos; Vladan Stojnić; Giorgos Kordopatis-Zilos; Yuta Nakashima; Giorgos Tolias; Noa Garcia
Prior work has analyzed the robustness of visual encoders to image transformations and corruptions, particularly in cases where such alterations are not seen during training. When this occurs, they introduce a form of distribution shift at test time, often leading to performance degradation. The primary focus has been on severe corruptions that, when applied aggressively, distort useful signals necessary for accurate semantic predictions.
  We take a different perspective by analyzing parameters of the image acquisition process and transformations that may be subtle or even imperceptible to the human eye. We find that such parameters are systematically encoded in the learned visual representations and can be easily recovered. More strikingly, their presence can have a profound impact, either positively or negatively, on semantic predictions. This effect depends on whether there is a strong correlation or anti-correlation between semantic labels and these acquisition-based or processing-based labels. Our code and data are available at: https://github.com/ryan-caesar-ramos/visual-encoder-traces

http://arxiv.org/abs/2508.10974
Failures to Surface Harmful Contents in Video Large Language Models. (1%)
Yuxin Cao; Wei Song; Derui Wang; Jingling Xue; Jin Song Dong
Video Large Language Models (VideoLLMs) are increasingly deployed on numerous critical applications, where users rely on auto-generated summaries while casually skimming the video stream. We show that this interaction hides a critical safety gap: if harmful content is embedded in a video, either as full-frame inserts or as small corner patches, state-of-the-art VideoLLMs rarely mention the harmful content in the output, despite its clear visibility to human viewers. A root-cause analysis reveals three compounding design flaws: (1) insufficient temporal coverage resulting from the sparse, uniformly spaced frame sampling used by most leading VideoLLMs, (2) spatial information loss introduced by aggressive token downsampling within sampled frames, and (3) encoder-decoder disconnection, whereby visual cues are only weakly utilized during text generation. Leveraging these insights, we craft three zero-query black-box attacks, aligning with these flaws in the processing pipeline. Our large-scale evaluation across five leading VideoLLMs shows that the harmfulness omission rate exceeds 90% in most cases. Even when harmful content is clearly present in all frames, these models consistently fail to identify it. These results underscore a fundamental vulnerability in current VideoLLMs' designs and highlight the urgent need for sampling strategies, token compression, and decoding mechanisms that guarantee semantic coverage rather than speed alone.

http://arxiv.org/abs/2508.10991
MCP-Guard: A Defense Framework for Model Context Protocol Integrity in Large Language Model Applications. (1%)
Wenpeng Xing; Zhonghao Qi; Yupeng Qin; Yilin Li; Caini Chang; Jiahui Yu; Changting Lin; Zhenzhen Xie; Meng Han
The integration of Large Language Models (LLMs) with external tools via protocols such as the Model Context Protocol (MCP) introduces critical security vulnerabilities, including prompt injection, data exfiltration, and other threats. To counter these challenges, we propose MCP-Guard, a robust, layered defense architecture designed for LLM--tool interactions. MCP-Guard employs a three-stage detection pipeline that balances efficiency with accuracy: it progresses from lightweight static scanning for overt threats and a deep neural detector for semantic attacks, to our fine-tuned E5-based model achieves (96.01) accuracy in identifying adversarial prompts. Finally, a lightweight LLM arbitrator synthesizes these signals to deliver the final decision while minimizing false positives. To facilitate rigorous training and evaluation, we also introduce MCP-AttackBench, a comprehensive benchmark of over 70,000 samples. Sourced from public datasets and augmented by GPT-4, MCP-AttackBench simulates diverse, real-world attack vectors in the MCP format, providing a foundation for future research into securing LLM-tool ecosystems.

http://arxiv.org/abs/2508.10946
IPG: Incremental Patch Generation for Generalized Adversarial Patch Training. (99%)
Wonho Lee; Hyunsik Na; Jisu Lee; Daeseon Choi
The advent of adversarial patches poses a significant challenge to the robustness of AI models, particularly in the domain of computer vision tasks such as object detection. In contradistinction to traditional adversarial examples, these patches target specific regions of an image, resulting in the malfunction of AI models. This paper proposes Incremental Patch Generation (IPG), a method that generates adversarial patches up to 11.1 times more efficiently than existing approaches while maintaining comparable attack performance. The efficacy of IPG is demonstrated by experiments and ablation studies including YOLO's feature distribution visualization and adversarial training results, which show that it produces well-generalized patches that effectively cover a broader range of model vulnerabilities. Furthermore, IPG-generated datasets can serve as a robust knowledge foundation for constructing a robust model, enabling structured representation, advanced reasoning, and proactive defenses in AI security ecosystems. The findings of this study suggest that IPG has considerable potential for future utilization not only in adversarial patch defense but also in real-world applications such as autonomous vehicles, security systems, and medical imaging, where AI models must remain resilient to adversarial attacks in dynamic and high-stakes environments.

http://arxiv.org/abs/2507.01367
3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation. (98%)
Tianrui Lou; Xiaojun Jia; Siyuan Liang; Jiawei Liang; Ming Zhang; Yanjun Xiao; Xiaochun Cao
Physical adversarial attack methods expose the vulnerabilities of deep neural networks and pose a significant threat to safety-critical scenarios such as autonomous driving. Camouflage-based physical attack is a more promising approach compared to the patch-based attack, offering stronger adversarial effectiveness in complex physical environments. However, most prior work relies on mesh priors of the target object and virtual environments constructed by simulators, which are time-consuming to obtain and inevitably differ from the real world. Moreover, due to the limitations of the backgrounds in training images, previous methods often fail to produce multi-view robust adversarial camouflage and tend to fall into sub-optimal solutions. Due to these reasons, prior work lacks adversarial effectiveness and robustness across diverse viewpoints and physical environments. We propose a physical attack framework based on 3D Gaussian Splatting (3DGS), named PGA, which provides rapid and precise reconstruction with few images, along with photo-realistic rendering capabilities. Our framework further enhances cross-view robustness and adversarial effectiveness by preventing mutual and self-occlusion among Gaussians and employing a min-max optimization approach that adjusts the imaging background of each viewpoint, helping the algorithm filter out non-robust adversarial features. Extensive experiments validate the effectiveness and superiority of PGA. Our code is available at:https://github.com/TRLou/PGA.

http://arxiv.org/abs/2508.09603
The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage. (81%)
Skyler Hallinan; Jaehun Jung; Melanie Sclar; Ximing Lu; Abhilasha Ravichander; Sahana Ramnath; Yejin Choi; Sai Praneeth Karimireddy; Niloofar Mireshghallah; Xiang Ren
Membership inference attacks serves as useful tool for fair use of language models, such as detecting potential copyright infringement and auditing data leakage. However, many current state-of-the-art attacks require access to models' hidden states or probability distribution, which prevents investigation into more widely-used, API-access only models like GPT-4. In this work, we introduce N-Gram Coverage Attack, a membership inference attack that relies solely on text outputs from the target model, enabling attacks on completely black-box models. We leverage the observation that models are more likely to memorize and subsequently generate text patterns that were commonly observed in their training data. Specifically, to make a prediction on a candidate member, N-Gram Coverage Attack first obtains multiple model generations conditioned on a prefix of the candidate. It then uses n-gram overlap metrics to compute and aggregate the similarities of these outputs with the ground truth suffix; high similarities indicate likely membership. We first demonstrate on a diverse set of existing benchmarks that N-Gram Coverage Attack outperforms other black-box methods while also impressively achieving comparable or even better performance to state-of-the-art white-box attacks - despite having access to only text outputs. Interestingly, we find that the success rate of our method scales with the attack compute budget - as we increase the number of sequences generated from the target model conditioned on the prefix, attack performance tends to improve. Having verified the accuracy of our method, we use it to investigate previously unstudied closed OpenAI models on multiple domains. We find that more recent models, such as GPT-4o, exhibit increased robustness to membership inference, suggesting an evolving trend toward improved privacy protections.

http://arxiv.org/abs/2412.04510
A Taxonomy of System-Level Attacks on Deep Learning Models in Autonomous Vehicles. (76%)
Masoud Jamshidiyan Tehrani; Jinhan Kim; Rosmael Zidane Lekeufack Foulefack; Alessandro Marchetto; Paolo Tonella
The advent of deep learning and its astonishing performance has enabled its usage in complex systems, including autonomous vehicles. On the other hand, deep learning models are susceptible to mispredictions when small, adversarial changes are introduced into their input. Such mis-predictions can be triggered in the real world and can result in a failure of the entire system. In recent years, a growing number of research works have investigated ways to mount attacks against autonomous vehicles that exploit deep learning components. Such attacks are directed toward elements of the environment where these systems operate and their effectiveness is assessed in terms of system-level failures triggered by them. There has been however no systematic attempt to analyze and categorize such attacks. In this paper, we present the first taxonomy of system-level attacks against autonomous vehicles. We constructed our taxonomy by selecting 21 highly relevant papers, then we tagged them with 12 top-level taxonomy categories and several sub-categories. The taxonomy allowed us to investigate the attack features, the most attacked components and systems, the underlying threat models, and the failure chains from input perturbation to system-level failure. We distilled several lessons for practitioners and identified possible directions for future work for researchers.

http://arxiv.org/abs/2508.09652
Demystifying the Role of Rule-based Detection in AI Systems for Windows Malware Detection. (76%)
Andrea Ponte; Luca Demetrio; Luca Oneto; Ivan Tesfai Ogbu; Battista Biggio; Fabio Roli
Malware detection increasingly relies on AI systems that integrate signature-based detection with machine learning. However, these components are typically developed and combined in isolation, missing opportunities to reduce data complexity and strengthen defenses against adversarial EXEmples, carefully crafted programs designed to evade detection. Hence, in this work we investigate the influence that signature-based detection exerts on model training, when they are included inside the training pipeline. Specifically, we compare models trained on a comprehensive dataset with an AI system whose machine learning component is trained solely on samples not already flagged by signatures. Our results demonstrate improved robustness to both adversarial EXEmples and temporal data drift, although this comes at the cost of a fixed lower bound on false positives, driven by suboptimal rule selection. We conclude by discussing these limitations and outlining how future research could extend AI-based malware detection to include dynamic analysis, thereby further enhancing system resilience.

http://arxiv.org/abs/2506.22557
MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs. (75%)
Boyuan Chen; Minghao Shao; Abdul Basit; Siddharth Garg; Muhammad Shafique
As large language models (LLMs) grow more capable, they face growing vulnerability to sophisticated jailbreak attacks. While developers invest heavily in alignment finetuning and safety guardrails, researchers continue publishing novel attacks, driving progress through adversarial iteration. This dynamic mirrors a strategic game of continual evolution. However, two major challenges hinder jailbreak development: the high cost of querying top-tier LLMs and the short lifespan of effective attacks due to frequent safety updates. These factors limit cost-efficiency and practical impact of research in jailbreak attacks. To address this, we propose MetaCipher, a low-cost, multi-agent jailbreak framework that generalizes across LLMs with varying safety measures. Using reinforcement learning, MetaCipher is modular and adaptive, supporting extensibility to future strategies. Within as few as 10 queries, MetaCipher achieves state-of-the-art attack success rates on recent malicious prompt benchmarks, outperforming prior jailbreak methods. We conduct a large-scale empirical evaluation across diverse victim models and benchmarks, demonstrating its robustness and adaptability. Warning: This paper contains model outputs that may be offensive or harmful, shown solely to demonstrate jailbreak efficacy.

http://arxiv.org/abs/2404.15611
Model Poisoning Attacks to Federated Learning via Multi-Round Consistency. (54%)
Yueqi Xie; Minghong Fang; Neil Zhenqiang Gong
Model poisoning attacks are critical security threats to Federated Learning (FL). Existing model poisoning attacks suffer from two key limitations: 1) they achieve suboptimal effectiveness when defenses are deployed, and/or 2) they require knowledge of the model updates or local training data on genuine clients. In this work, we make a key observation that their suboptimal effectiveness arises from only leveraging model-update consistency among malicious clients within individual training rounds, making the attack effect self-cancel across training rounds. In light of this observation, we propose PoisonedFL, which enforces multi-round consistency among the malicious clients' model updates while not requiring any knowledge about the genuine clients. Our empirical evaluation on five benchmark datasets shows that PoisonedFL breaks eight state-of-the-art defenses and outperforms seven existing model poisoning attacks. Moreover, we also explore new defenses that are tailored to PoisonedFL, but our results show that we can still adapt PoisonedFL to break them. Our study shows that FL systems are considerably less robust than previously thought, underlining the urgency for the development of new defense mechanisms.

http://arxiv.org/abs/2507.22398
On the Reliability of Vision-Language Models Under Adversarial Frequency-Domain Perturbations. (33%)
Jordan Vice; Naveed Akhtar; Yansong Gao; Richard Hartley; Ajmal Mian
Vision-Language Models (VLMs) are increasingly used as perceptual modules for visual content reasoning, including through captioning and DeepFake detection. In this work, we expose a critical vulnerability of VLMs when exposed to subtle, structured perturbations in the frequency domain. Specifically, we highlight how these feature transformations undermine authenticity/DeepFake detection and automated image captioning tasks. We design targeted image transformations, operating in the frequency domain to systematically adjust VLM outputs when exposed to frequency-perturbed real and synthetic images. We demonstrate that the perturbation injection method generalizes across five state-of-the-art VLMs which includes different-parameter Qwen2/2.5 and BLIP models. Experimenting across ten real and generated image datasets reveals that VLM judgments are sensitive to frequency-based cues and may not wholly align with semantic content. Crucially, we show that visually-imperceptible spatial frequency transformations expose the fragility of VLMs deployed for automated image captioning and authenticity detection tasks. Our findings under realistic, black-box constraints challenge the reliability of VLMs, underscoring the need for robust multimodal perception systems.

http://arxiv.org/abs/2508.10212
Detecting Untargeted Attacks and Mitigating Unreliable Updates in Federated Learning for Underground Mining Operations. (13%)
Md Sazedur Rahman; Mohamed Elmahallawy; Sanjay Madria; Samuel Frimpong
Underground mining operations rely on distributed sensor networks to collect critical data daily, including mine temperature, toxic gas concentrations, and miner movements for hazard detection and operational decision-making. However, transmitting raw sensor data to a central server for training deep learning models introduces significant privacy risks, potentially exposing sensitive mine-specific information. Federated Learning (FL) offers a transformative solution by enabling collaborative model training while ensuring that raw data remains localized at each mine. Despite its advantages, FL in underground mining faces key challenges: (i) An attacker may compromise a mine's local model by employing techniques such as sign-flipping attacks or additive noise, leading to erroneous predictions; (ii) Low-quality (yet potentially valuable) data, caused by poor lighting conditions or sensor inaccuracies in mines may degrade the FL training process. In response, this paper proposes MineDetect, a defense FL framework that detects and isolates the attacked models while mitigating the impact of mines with low-quality data. MineDetect introduces two key innovations: (i) Detecting attacked models (maliciously manipulated) by developing a history-aware mechanism that leverages local and global averages of gradient updates; (ii) Identifying and eliminating adversarial influences from unreliable models (generated by clients with poor data quality) on the FL training process. Comprehensive simulations across diverse datasets demonstrate that MineDetect outperforms existing methods in both robustness and accuracy, even in challenging non-IID data scenarios. Its ability to counter adversarial influences while maintaining lower computational efficiency makes it a vital advancement for improving safety and operational effectiveness in underground mining.

http://arxiv.org/abs/2508.10949
Perturbed Public Voices (P$^{2}$V): A Dataset for Robust Audio Deepfake Detection. (2%)
Chongyang Gao; Marco Postiglione; Isabel Gortner; Sarit Kraus; V. S. Subrahmanian
Current audio deepfake detectors cannot be trusted. While they excel on controlled benchmarks, they fail when tested in the real world. We introduce Perturbed Public Voices (P$^{2}$V), an IRB-approved dataset capturing three critical aspects of malicious deepfakes: (1) identity-consistent transcripts via LLMs, (2) environmental and adversarial noise, and (3) state-of-the-art voice cloning (2020-2025). Experiments reveal alarming vulnerabilities of 22 recent audio deepfake detectors: models trained on current datasets lose 43% performance when tested on P$^{2}$V, with performance measured as the mean of F1 score on deepfake audio, AUC, and 1-EER. Simple adversarial perturbations induce up to 16% performance degradation, while advanced cloning techniques reduce detectability by 20-30%. In contrast, P$^{2}$V-trained models maintain robustness against these attacks while generalizing to existing datasets, establishing a new benchmark for robust audio deepfake detection. P$^{2}$V will be publicly released upon acceptance by a conference/journal.

http://arxiv.org/abs/2502.12207
PAR-AdvGAN: Improving Adversarial Attack Capability with Progressive Auto-Regression AdvGAN. (99%)
Jiayu Zhang; Zhiyu Zhu; Xinyi Wang; Silin Liao; Zhibo Jin; Flora D. Salim; Huaming Chen
Deep neural networks have demonstrated remarkable performance across various domains. However, they are vulnerable to adversarial examples, which can lead to erroneous predictions. Generative Adversarial Networks (GANs) can leverage the generators and discriminators model to quickly produce high-quality adversarial examples. Since both modules train in a competitive and simultaneous manner, GAN-based algorithms like AdvGAN can generate adversarial examples with better transferability compared to traditional methods. However, the generation of perturbations is usually limited to a single iteration, preventing these examples from fully exploiting the potential of the methods. To tackle this issue, we introduce a novel approach named Progressive Auto-Regression AdvGAN (PAR-AdvGAN). It incorporates an auto-regressive iteration mechanism within a progressive generation network to craft adversarial examples with enhanced attack capability. We thoroughly evaluate our PAR-AdvGAN method with a large-scale experiment, demonstrating its superior performance over various state-of-the-art black-box adversarial attacks, as well as the original AdvGAN.Moreover, PAR-AdvGAN significantly accelerates the adversarial example generation, i.e., achieving the speeds of up to 335.5 frames per second on Inception-v3 model, outperforming the gradient-based transferable attack algorithms. Our code is available at: https://github.com/LMBTough/PAR

http://arxiv.org/abs/2508.08656
Evasive Ransomware Attacks Using Low-level Behavioral Adversarial Examples. (98%)
Manabu Hirano; Ryotaro Kobayashi
Protecting state-of-the-art AI-based cybersecurity defense systems from cyber attacks is crucial. Attackers create adversarial examples by adding small changes (i.e., perturbations) to the attack features to evade or fool the deep learning model. This paper introduces the concept of low-level behavioral adversarial examples and its threat model of evasive ransomware. We formulate the method and the threat model to generate the optimal source code of evasive malware. We then examine the method using the leaked source code of Conti ransomware with the micro-behavior control function. The micro-behavior control function is our test component to simulate changing source code in ransomware; ransomware's behavior can be changed by specifying the number of threads, file encryption ratio, and delay after file encryption at the boot time. We evaluated how much an attacker can control the behavioral features of ransomware using the micro-behavior control function to decrease the detection rate of a ransomware detector.

http://arxiv.org/abs/2508.08920
Exploring Cross-Stage Adversarial Transferability in Class-Incremental Continual Learning. (98%)
Jungwoo Kim; Jong-Seok Lee
Class-incremental continual learning addresses catastrophic forgetting by enabling classification models to preserve knowledge of previously learned classes while acquiring new ones. However, the vulnerability of the models against adversarial attacks during this process has not been investigated sufficiently. In this paper, we present the first exploration of vulnerability to stage-transferred attacks, i.e., an adversarial example generated using the model in an earlier stage is used to attack the model in a later stage. Our findings reveal that continual learning methods are highly susceptible to these attacks, raising a serious security issue. We explain this phenomenon through model similarity between stages and gradual robustness degradation. Additionally, we find that existing adversarial training-based defense methods are not sufficiently effective to stage-transferred attacks. Codes are available at https://github.com/mcml-official/CSAT.

http://arxiv.org/abs/2508.08955
Fre-CW: Targeted Attack on Time Series Forecasting using Frequency Domain Loss. (92%)
Naifu Feng; Lixing Chen; Junhua Tang; Hua Ding; Jianhua Li; Yang Bai
Transformer-based models have made significant progress in time series forecasting. However, a key limitation of deep learning models is their susceptibility to adversarial attacks, which has not been studied enough in the context of time series prediction. In contrast to areas such as computer vision, where adversarial robustness has been extensively studied, frequency domain features of time series data play an important role in the prediction task but have not been sufficiently explored in terms of adversarial attacks. This paper proposes a time series prediction attack algorithm based on frequency domain loss. Specifically, we adapt an attack method originally designed for classification tasks to the prediction field and optimize the adversarial samples using both time-domain and frequency-domain losses. To the best of our knowledge, there is no relevant research on using frequency information for time-series adversarial attacks. Our experimental results show that these current time series prediction models are vulnerable to adversarial attacks, and our approach achieves excellent performance on major time series forecasting datasets.

http://arxiv.org/abs/2508.09275
Constrained Black-Box Attacks Against Multi-Agent Reinforcement Learning. (81%)
Amine Andam; Jamal Bentahar; Mustapha Hedabou
Collaborative multi-agent reinforcement learning (c-MARL) has rapidly evolved, offering state-of-the-art algorithms for real-world applications, including sensitive domains. However, a key challenge to its widespread adoption is the lack of a thorough investigation into its vulnerabilities to adversarial attacks. Existing work predominantly focuses on training-time attacks or unrealistic scenarios, such as access to policy weights or the ability to train surrogate policies. In this paper, we investigate new vulnerabilities under more realistic and constrained conditions, assuming an adversary can only collect and perturb the observations of deployed agents. We also consider scenarios where the adversary has no access at all. We propose simple yet highly effective algorithms for generating adversarial perturbations designed to misalign how victim agents perceive their environment. Our approach is empirically validated on three benchmarks and 22 environments, demonstrating its effectiveness across diverse algorithms and environments. Furthermore, we show that our algorithm is sample-efficient, requiring only 1,000 samples compared to the millions needed by previous methods.

http://arxiv.org/abs/2501.10740
Improving the robustness of neural ODEs with minimal weight perturbation. (67%)
Marinis Arturo De; Nicola Guglielmi; Stefano Sicilia; Francesco Tudisco
We propose a method to enhance the stability of a neural ordinary differential equation (neural ODE) by reducing the maximum error growth subsequent to a perturbation of the initial value. Since the stability depends on the logarithmic norm of the Jacobian matrix associated with the neural ODE, we control the logarithmic norm by perturbing the weight matrices of the neural ODE by a smallest possible perturbation (in Frobenius norm). We do so by engaging an eigenvalue optimisation problem, for which we propose a nested two-level algorithm. For a given perturbation size of the weight matrix, the inner level computes optimal perturbations of that size, while - at the outer level - we tune the perturbation amplitude until we reach the desired uniform stability bound. We embed the proposed algorithm in the training of the neural ODE to improve its robustness to perturbations of the initial value, as adversarial attacks. Numerical experiments on classical image datasets show that an image classifier including a neural ODE in its architecture trained according to our strategy is more stable than the same classifier trained in the classical way, and therefore, it is more robust and less vulnerable to adversarial attacks.

http://arxiv.org/abs/2508.14079
A Guide to Robust Generalization: The Impact of Architecture, Pre-training, and Optimization Strategy. (64%)
Maxime Heuillet; Rishika Bhagwatkar; Jonas Ngnawé; Yann Pequignot; Alexandre Larouche; Christian Gagné; Irina Rish; Ola Ahmad; Audrey Durand
Deep learning models operating in the image domain are vulnerable to small input perturbations. For years, robustness to such perturbations was pursued by training models from scratch (i.e., with random initializations) using specialized loss objectives. Recently, robust fine-tuning has emerged as a more efficient alternative: instead of training from scratch, pretrained models are adapted to maximize predictive performance and robustness. To conduct robust fine-tuning, practitioners design an optimization strategy that includes the model update protocol (e.g., full or partial) and the specialized loss objective. Additional design choices include the architecture type and size, and the pretrained representation. These design choices affect robust generalization, which is the model's ability to maintain performance when exposed to new and unseen perturbations at test time. Understanding how these design choices influence generalization remains an open question with significant practical implications. In response, we present an empirical study spanning 6 datasets, 40 pretrained architectures, 2 specialized losses, and 3 adaptation protocols, yielding 1,440 training configurations and 7,200 robustness measurements across five perturbation types. To our knowledge, this is the most diverse and comprehensive benchmark of robust fine-tuning to date. While attention-based architectures and robust pretrained representations are increasingly popular, we find that convolutional neural networks pretrained in a supervised manner on large datasets often perform best. Our analysis both confirms and challenges prior design assumptions, highlighting promising research directions and offering practical guidance.

http://arxiv.org/abs/2508.06964
Adversarial Video Promotion Against Text-to-Video Retrieval. (64%)
Qiwei Tian; Chenhao Lin; Zhengyu Zhao; Qian Li; Shuai Liu; Chao Shen
Thanks to the development of cross-modal models, text-to-video retrieval (T2VR) is advancing rapidly, but its robustness remains largely unexamined. Existing attacks against T2VR are designed to push videos away from queries, i.e., suppressing the ranks of videos, while the attacks that pull videos towards selected queries, i.e., promoting the ranks of videos, remain largely unexplored. These attacks can be more impactful as attackers may gain more views/clicks for financial benefits and widespread (mis)information. To this end, we pioneer the first attack against T2VR to promote videos adversarially, dubbed the Video Promotion attack (ViPro). We further propose Modal Refinement (MoRe) to capture the finer-grained, intricate interaction between visual and textual modalities to enhance black-box transferability. Comprehensive experiments cover 2 existing baselines, 3 leading T2VR models, 3 prevailing datasets with over 10k videos, evaluated under 3 scenarios. All experiments are conducted in a multi-target setting to reflect realistic scenarios where attackers seek to promote the video regarding multiple queries simultaneously. We also evaluated our attacks for defences and imperceptibility. Overall, ViPro surpasses other baselines by over $30/10/4\%$ for white/grey/black-box settings on average. Our work highlights an overlooked vulnerability, provides a qualitative analysis on the upper/lower bound of our attacks, and offers insights into potential counterplays. Code will be publicly available at https://github.com/michaeltian108/ViPro.

http://arxiv.org/abs/2508.09320
Exact Verification of Graph Neural Networks with Incremental Constraint Solving. (54%)
Minghao Liu; Chia-Hsuan Lu; Marta Kwiatkowska
Graph neural networks (GNNs) are increasingly employed in high-stakes applications, such as fraud detection or healthcare, but are susceptible to adversarial attacks. A number of techniques have been proposed to provide adversarial robustness guarantees, but support for commonly used aggregation functions in message-passing GNNs is still lacking. In this paper, we develop an exact (sound and complete) verification method for GNNs to compute guarantees against attribute and structural perturbations that involve edge addition or deletion, subject to budget constraints. Focusing on node classification tasks, our method employs constraint solving with bound tightening, and iteratively solves a sequence of relaxed constraint satisfaction problems while relying on incremental solving capabilities of solvers to improve efficiency. We implement GNNev, a versatile solver for message-passing neural networks, which supports three aggregation functions, sum, max and mean, with the latter two considered here for the first time. Extensive experimental evaluation of GNNev on two standard benchmarks (Cora and CiteSeer) and two real-world fraud datasets (Amazon and Yelp) demonstrates its usability and effectiveness, as well as superior performance compared to existing {exact verification} tools on sum-aggregated node classification tasks.

http://arxiv.org/abs/2508.09230
Cowpox: Towards the Immunity of VLM-based Multi-Agent Systems. (13%)
Yutong Wu; Jie Zhang; Yiming Li; Chao Zhang; Qing Guo; Nils Lukas; Tianwei Zhang
Vision Language Model (VLM)-based agents are stateful, autonomous entities capable of perceiving and interacting with their environments through vision and language. Multi-agent systems comprise specialized agents who collaborate to solve a (complex) task. A core security property is robustness, stating that the system should maintain its integrity under adversarial attacks. However, the design of existing multi-agent systems lacks the robustness consideration, as a successful exploit against one agent can spread and infect other agents to undermine the entire system's assurance. To address this, we propose a new defense approach, Cowpox, to provably enhance the robustness of multi-agent systems. It incorporates a distributed mechanism, which improves the recovery rate of agents by limiting the expected number of infections to other agents. The core idea is to generate and distribute a special cure sample that immunizes an agent against the attack before exposure and helps recover the already infected agents. We demonstrate the effectiveness of Cowpox empirically and provide theoretical robustness guarantees.

http://arxiv.org/abs/2508.09060
Developing a Transferable Federated Network Intrusion Detection System. (1%)
Abu Shafin Mohammad Mahdee Jameel; Shreya Ghosh; Aly El Gamal
Intrusion Detection Systems (IDS) are a vital part of a network-connected device. In this paper, we develop a deep learning based intrusion detection system that is deployed in a distributed setup across devices connected to a network. Our aim is to better equip deep learning models against unknown attacks using knowledge from known attacks. To this end, we develop algorithms to maximize the number of transferability relationships. We propose a Convolutional Neural Network (CNN) model, along with two algorithms that maximize the number of relationships observed. One is a two step data pre-processing stage, and the other is a Block-Based Smart Aggregation (BBSA) algorithm. The proposed system succeeds in achieving superior transferability performance while maintaining impressive local detection rates. We also show that our method is generalizable, exhibiting transferability potential across datasets and even with different backbones. The code for this work can be found at https://github.com/ghosh64/tabfidsv2.

http://arxiv.org/abs/2508.09021
Attacks and Defenses Against LLM Fingerprinting. (1%)
Kevin Kurian; Ethan Holland; Sean Oesch
As large language models are increasingly deployed in sensitive environments, fingerprinting attacks pose significant privacy and security risks. We present a study of LLM fingerprinting from both offensive and defensive perspectives. Our attack methodology uses reinforcement learning to automatically optimize query selection, achieving better fingerprinting accuracy with only 3 queries compared to randomly selecting 3 queries from the same pool. Our defensive approach employs semantic-preserving output filtering through a secondary LLM to obfuscate model identity while maintaining semantic integrity. The defensive method reduces fingerprinting accuracy across tested models while preserving output quality. These contributions show the potential to improve fingerprinting tools capabilities while providing practical mitigation strategies against fingerprinting attacks.

http://arxiv.org/abs/2502.18862
One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs. (1%)
Jacob Dunefsky; Arman Cohan
Steering vectors (SVs) have emerged as a promising approach for interpreting and controlling LLMs, but current methods typically require large contrastive datasets that are often impractical to construct and may capture spurious correlations. We propose directly optimizing SVs through gradient descent on a single training example, and systematically investigate how these SVs generalize. We consider several SV optimization techniques and find that the resulting SVs effectively mediate safety-relevant behaviors in multiple models. Indeed, in experiments on an alignment-faking model, we are able to optimize one-shot SVs that induce harmful behavior on benign examples and whose negations suppress harmful behavior on malign examples. And in experiments on refusal suppression, we demonstrate that one-shot optimized SVs can transfer across inputs, yielding a Harmbench attack success rate of 96.9%. Furthermore, we extend work on "emergent misalignment" and show that SVs optimized to induce a model to write vulnerable code cause the model to respond harmfully on unrelated open-ended prompts. Finally, we use one-shot SV optimization to investigate how an instruction-tuned LLM recovers from outputting false information, and find that this ability is independent of the model's explicit verbalization that the information was false. Overall, our findings suggest that optimizing SVs on a single example can mediate a wide array of misaligned behaviors in LLMs. Code can be found at https://github.com/jacobdunefsky/one-shot-steering-repro and https://github.com/jacobdunefsky/one-shot-steering-misalignment.

http://arxiv.org/abs/2508.08031
IPBA: Imperceptible Perturbation Backdoor Attack in Federated Self-Supervised Learning. (80%)
Jiayao Wang; Yang Song; Zhendong Zhao; Jiale Zhang; Qilin Wu; Junwu Zhu; Dongfang Zhao
Federated self-supervised learning (FSSL) combines the advantages of decentralized modeling and unlabeled representation learning, serving as a cutting-edge paradigm with strong potential for scalability and privacy preservation. Although FSSL has garnered increasing attention, research indicates that it remains vulnerable to backdoor attacks. Existing methods generally rely on visually obvious triggers, which makes it difficult to meet the requirements for stealth and practicality in real-world deployment. In this paper, we propose an imperceptible and effective backdoor attack method against FSSL, called IPBA. Our empirical study reveals that existing imperceptible triggers face a series of challenges in FSSL, particularly limited transferability, feature entanglement with augmented samples, and out-of-distribution properties. These issues collectively undermine the effectiveness and stealthiness of traditional backdoor attacks in FSSL. To overcome these challenges, IPBA decouples the feature distributions of backdoor and augmented samples, and introduces Sliced-Wasserstein distance to mitigate the out-of-distribution properties of backdoor samples, thereby optimizing the trigger generation process. Our experimental results on several FSSL scenarios and datasets show that IPBA significantly outperforms existing backdoor attack methods in performance and exhibits strong robustness under various defense mechanisms.

http://arxiv.org/abs/2508.09218
Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity. (67%)
Zuoou Li; Weitong Zhang; Jingyuan Wang; Shuyuan Zhang; Wenjia Bai; Bernhard Kainz; Mengyun Qiao
Multimodal large language models (MLLMs) are widely used in vision-language reasoning tasks. However, their vulnerability to adversarial prompts remains a serious concern, as safety mechanisms often fail to prevent the generation of harmful outputs. Although recent jailbreak strategies report high success rates, many responses classified as "successful" are actually benign, vague, or unrelated to the intended malicious goal. This mismatch suggests that current evaluation standards may overestimate the effectiveness of such attacks. To address this issue, we introduce a four-axis evaluation framework that considers input on-topicness, input out-of-distribution (OOD) intensity, output harmfulness, and output refusal rate. This framework identifies truly effective jailbreaks. In a substantial empirical study, we reveal a structural trade-off: highly on-topic prompts are frequently blocked by safety filters, whereas those that are too OOD often evade detection but fail to produce harmful content. However, prompts that balance relevance and novelty are more likely to evade filters and trigger dangerous output. Building on this insight, we develop a recursive rewriting strategy called Balanced Structural Decomposition (BSD). The approach restructures malicious prompts into semantically aligned sub-tasks, while introducing subtle OOD signals and visual cues that make the inputs harder to detect. BSD was tested across 13 commercial and open-source MLLMs, where it consistently led to higher attack success rates, more harmful outputs, and fewer refusals. Compared to previous methods, it improves success rates by $67\%$ and harmfulness by $21\%$, revealing a previously underappreciated weakness in current multimodal safety systems.

http://arxiv.org/abs/2508.07873
EFU: Enforcing Federated Unlearning via Functional Encryption. (50%)
Samaneh Mohammadi; Vasileios Tsouvalas; Iraklis Symeonidis; Ali Balador; Tanir Ozcelebi; Francesco Flammini; Nirvana Meratnia
Federated unlearning (FU) algorithms allow clients in federated settings to exercise their ''right to be forgotten'' by removing the influence of their data from a collaboratively trained model. Existing FU methods maintain data privacy by performing unlearning locally on the client-side and sending targeted updates to the server without exposing forgotten data; yet they often rely on server-side cooperation, revealing the client's intent and identity without enforcement guarantees - compromising autonomy and unlearning privacy. In this work, we propose EFU (Enforced Federated Unlearning), a cryptographically enforced FU framework that enables clients to initiate unlearning while concealing its occurrence from the server. Specifically, EFU leverages functional encryption to bind encrypted updates to specific aggregation functions, ensuring the server can neither perform unauthorized computations nor detect or skip unlearning requests. To further mask behavioral and parameter shifts in the aggregated model, we incorporate auxiliary unlearning losses based on adversarial examples and parameter importance regularization. Extensive experiments show that EFU achieves near-random accuracy on forgotten data while maintaining performance comparable to full retraining across datasets and neural architectures - all while concealing unlearning intent from the server. Furthermore, we demonstrate that EFU is agnostic to the underlying unlearning algorithm, enabling secure, function-hiding, and verifiable unlearning for any client-side FU mechanism that issues targeted updates.

http://arxiv.org/abs/2508.08521
VISOR: Visual Input-based Steering for Output Redirection in Vision-Language Models. (16%)
Mansi Phute; Ravikumar Balakrishnan
Vision Language Models (VLMs) are increasingly being used in a broad range of applications, bringing their security and behavioral control to the forefront. While existing approaches for behavioral control or output redirection, like system prompting in VLMs, are easily detectable and often ineffective, activation-based steering vectors require invasive runtime access to model internals--incompatible with API-based services and closed-source deployments. We introduce VISOR (Visual Input-based Steering for Output Redirection), a novel method that achieves sophisticated behavioral control through optimized visual inputs alone. By crafting universal steering images that induce target activation patterns, VISOR enables practical deployment across all VLM serving modalities while remaining imperceptible compared to explicit textual instructions. We validate VISOR on LLaVA-1.5-7B across three critical alignment tasks: refusal, sycophancy and survival instinct. A single 150KB steering image matches steering vector performance within 1-2% for positive behavioral shifts while dramatically exceeding it for negative steering--achieving up to 25% shifts from baseline compared to steering vectors' modest changes. Unlike system prompting (3-4% shifts), VISOR provides robust bidirectional control while maintaining 99.9% performance on 14,000 unrelated MMLU tasks. Beyond eliminating runtime overhead and model access requirements, VISOR exposes a critical security vulnerability: adversaries can achieve sophisticated behavioral manipulation through visual channels alone, bypassing text-based defenses. Our work fundamentally re-imagines multimodal model control and highlights the urgent need for defenses against visual steering attacks.

http://arxiv.org/abs/2508.08127
BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks. (11%)
Rui Miao; Yixin Liu; Yili Wang; Xu Shen; Yue Tan; Yiwei Dai; Shirui Pan; Xin Wang
The security of LLM-based multi-agent systems (MAS) is critically threatened by propagation vulnerability, where malicious agents can distort collective decision-making through inter-agent message interactions. While existing supervised defense methods demonstrate promising performance, they may be impractical in real-world scenarios due to their heavy reliance on labeled malicious agents to train a supervised malicious detection model. To enable practical and generalizable MAS defenses, in this paper, we propose BlindGuard, an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors. To this end, we establish a hierarchical agent encoder to capture individual, neighborhood, and global interaction patterns of each agent, providing a comprehensive understanding for malicious agent detection. Meanwhile, we design a corruption-guided detector that consists of directional noise injection and contrastive learning, allowing effective detection model training solely on normal agent behaviors. Extensive experiments show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across MAS with various communication patterns while maintaining superior generalizability compared to supervised baselines. The code is available at: https://github.com/MR9812/BlindGuard.

http://arxiv.org/abs/2508.08029
Robust Anomaly Detection in O-RAN: Leveraging LLMs against Data Manipulation Attacks. (4%)
Thusitha Dayaratne; Ngoc Duy Pham; Viet Vo; Shangqi Lai; Sharif Abuadbba; Hajime Suzuki; Xingliang Yuan; Carsten Rudolph
The introduction of 5G and the Open Radio Access Network (O-RAN) architecture has enabled more flexible and intelligent network deployments. However, the increased complexity and openness of these architectures also introduce novel security challenges, such as data manipulation attacks on the semi-standardised Shared Data Layer (SDL) within the O-RAN platform through malicious xApps. In particular, malicious xApps can exploit this vulnerability by introducing subtle Unicode-wise alterations (hypoglyphs) into the data that are being used by traditional machine learning (ML)-based anomaly detection methods. These Unicode-wise manipulations can potentially bypass detection and cause failures in anomaly detection systems based on traditional ML, such as AutoEncoders, which are unable to process hypoglyphed data without crashing. We investigate the use of Large Language Models (LLMs) for anomaly detection within the O-RAN architecture to address this challenge. We demonstrate that LLM-based xApps maintain robust operational performance and are capable of processing manipulated messages without crashing. While initial detection accuracy requires further improvements, our results highlight the robustness of LLMs to adversarial attacks such as hypoglyphs in input data. There is potential to use their adaptability through prompt engineering to further improve the accuracy, although this requires further research. Additionally, we show that LLMs achieve low detection latency (under 0.07 seconds), making them suitable for Near-Real-Time (Near-RT) RIC deployments.

http://arxiv.org/abs/2508.08043
False Reality: Uncovering Sensor-induced Human-VR Interaction Vulnerability. (1%)
Yancheng Jiang; Yan Jiang; Ruochen Zhou; Yi-Chao Chen; Xiaoyu Ji; Wenyuan Xu
Virtual Reality (VR) techniques, serving as the bridge between the real and virtual worlds, have boomed and are widely used in manufacturing, remote healthcare, gaming, etc. Specifically, VR systems offer users immersive experiences that include both perceptions and actions. Various studies have demonstrated that attackers can manipulate VR software to influence users' interactions, including perception and actions. However, such attacks typically require strong access and specialized expertise. In this paper, we are the first to present a systematic analysis of physical attacks against VR systems and introduce False Reality, a new attack threat to VR devices without requiring access to or modification of their software. False Reality disturbs VR system services by tampering with sensor measurements, and further spoofing users' perception even inducing harmful actions, e.g., inducing dizziness or causing users to crash into obstacles, by exploiting perceptual and psychological effects. We formalize these threats through an attack pathway framework and validate three representative pathways via physical experiments and user studies on five commercial VR devices. Finally, we further propose a defense prototype to mitigate such threats. Our findings shall provide valuable insights for enhancing the security and resilience of future VR systems.

http://arxiv.org/abs/2508.10039
Multi-task Adversarial Attacks against Black-box Model with Few-shot Queries. (98%)
Wenqiang Wang; Yan Xiao; Hao Lin; Yangshijie Zhang; Xiaochun Cao
Current multi-task adversarial text attacks rely on abundant access to shared internal features and numerous queries, often limited to a single task type. As a result, these attacks are less effective against practical scenarios involving black-box feedback APIs, limited queries, or multiple task types. To bridge this gap, we propose \textbf{C}luster and \textbf{E}nsemble \textbf{M}ulti-task Text Adversarial \textbf{A}ttack (\textbf{CEMA}), an effective black-box attack that exploits the transferability of adversarial texts across different tasks. CEMA simplifies complex multi-task scenarios by using a \textit{deep-level substitute model} trained in a \textit{plug-and-play} manner for text classification, enabling attacks without mimicking the victim model. This approach requires only a few queries for training, converting multi-task attacks into classification attacks and allowing attacks across various tasks.
  CEMA generates multiple adversarial candidates using different text classification methods and selects the one that most effectively attacks substitute models.
  In experiments involving multi-task models with two, three, or six tasks--spanning classification, translation, summarization, and text-to-image generation--CEMA demonstrates significant attack success with as few as 100 queries. Furthermore, CEMA can target commercial APIs (e.g., Baidu and Google Translate), large language models (e.g., ChatGPT 4o), and image-generation models (e.g., Stable Diffusion V2), showcasing its versatility and effectiveness in real-world applications.

http://arxiv.org/abs/2508.10038
Certifiably robust malware detectors by design. (98%)
Pierre-Francois Gimenez; Sarath Sivaprasad; Mario Fritz
Malware analysis involves analyzing suspicious software to detect malicious payloads. Static malware analysis, which does not require software execution, relies increasingly on machine learning techniques to achieve scalability. Although such techniques obtain very high detection accuracy, they can be easily evaded with adversarial examples where a few modifications of the sample can dupe the detector without modifying the behavior of the software. Unlike other domains, such as computer vision, creating an adversarial example of malware without altering its functionality requires specific transformations. We propose a new model architecture for certifiably robust malware detection by design. In addition, we show that every robust detector can be decomposed into a specific structure, which can be applied to learn empirically robust malware detectors, even on fragile features. Our framework ERDALT is based on this structure. We compare and validate these approaches with machine-learning-based malware detection methods, allowing for robust detection with limited reduction of detection performance.

http://arxiv.org/abs/2508.07281
Representation Understanding via Activation Maximization. (81%)
Hongbo Zhu; Angelo Cangelosi
Understanding internal feature representations of deep neural networks (DNNs) is a fundamental step toward model interpretability. Inspired by neuroscience methods that probe biological neurons using visual stimuli, recent deep learning studies have employed Activation Maximization (AM) to synthesize inputs that elicit strong responses from artificial neurons. In this work, we propose a unified feature visualization framework applicable to both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Unlike prior efforts that predominantly focus on the last output-layer neurons in CNNs, we extend feature visualization to intermediate layers as well, offering deeper insights into the hierarchical structure of learned feature representations. Furthermore, we investigate how activation maximization can be leveraged to generate adversarial examples, revealing potential vulnerabilities and decision boundaries of DNNs. Our experiments demonstrate the effectiveness of our approach in both traditional CNNs and modern ViT, highlighting its generalizability and interpretive value.

http://arxiv.org/abs/2508.07458
Towards Unveiling Predictive Uncertainty Vulnerabilities in the Context of the Right to Be Forgotten. (56%)
Wei Qian; Chenxu Zhao; Yangyi Li; Wenqian Ye; Mengdi Huai
Currently, various uncertainty quantification methods have been proposed to provide certainty and probability estimates for deep learning models' label predictions. Meanwhile, with the growing demand for the right to be forgotten, machine unlearning has been extensively studied as a means to remove the impact of requested sensitive data from a pre-trained model without retraining the model from scratch. However, the vulnerabilities of such generated predictive uncertainties with regard to dedicated malicious unlearning attacks remain unexplored. To bridge this gap, for the first time, we propose a new class of malicious unlearning attacks against predictive uncertainties, where the adversary aims to cause the desired manipulations of specific predictive uncertainty results. We also design novel optimization frameworks for our attacks and conduct extensive experiments, including black-box scenarios. Notably, our extensive experiments show that our attacks are more effective in manipulating predictive uncertainties than traditional attacks that focus on label misclassifications, and existing defenses against conventional attacks are ineffective against our attacks.

http://arxiv.org/abs/2508.07139
A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection. (13%)
Ivan Zhang
Ensuring LLM alignment is critical to information security as AI models become increasingly widespread and integrated in society. Unfortunately, many defenses against adversarial attacks and jailbreaking on LLMs cannot adapt quickly to new attacks, degrade model responses to benign prompts, or introduce significant barriers to scalable implementation. To mitigate these challenges, we introduce a real-time, self-tuning (RTST) moderator framework to defend against adversarial attacks while maintaining a lightweight training footprint. We empirically evaluate its effectiveness using Google's Gemini models against modern, effective jailbreaks. Our results demonstrate the advantages of an adaptive, minimally intrusive framework for jailbreak defense over traditional fine-tuning or classifier models.

http://arxiv.org/abs/2508.07263
Fading the Digital Ink: A Universal Black-Box Attack Framework for 3DGS Watermarking Systems. (3%)
Qingyuan Zeng; Shu Jiang; Jiajing Lin; Zhenzhong Wang; Kay Chen Tan; Min Jiang
With the rise of 3D Gaussian Splatting (3DGS), a variety of digital watermarking techniques, embedding either 1D bitstreams or 2D images, are used for copyright protection. However, the robustness of these watermarking techniques against potential attacks remains underexplored. This paper introduces the first universal black-box attack framework, the Group-based Multi-objective Evolutionary Attack (GMEA), designed to challenge these watermarking systems. We formulate the attack as a large-scale multi-objective optimization problem, balancing watermark removal with visual quality. In a black-box setting, we introduce an indirect objective function that blinds the watermark detector by minimizing the standard deviation of features extracted by a convolutional network, thus rendering the feature maps uninformative. To manage the vast search space of 3DGS models, we employ a group-based optimization strategy to partition the model into multiple, independent sub-optimization problems. Experiments demonstrate that our framework effectively removes both 1D and 2D watermarks from mainstream 3DGS watermarking methods while maintaining high visual fidelity. This work reveals critical vulnerabilities in existing 3DGS copyright protection schemes and calls for the development of more robust watermarking systems.

http://arxiv.org/abs/2508.10031
Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs. (98%)
Jinhwa Kim; Ian G. Harris
While Large Language Models (LLMs) have shown significant advancements in performance, various jailbreak attacks have posed growing safety and ethical risks. Malicious users often exploit adversarial context to deceive LLMs, prompting them to generate responses to harmful queries. In this study, we propose a new defense mechanism called Context Filtering model, an input pre-processing method designed to filter out untrustworthy and unreliable context while identifying the primary prompts containing the real user intent to uncover concealed malicious intent. Given that enhancing the safety of LLMs often compromises their helpfulness, potentially affecting the experience of benign users, our method aims to improve the safety of the LLMs while preserving their original performance. We evaluate the effectiveness of our model in defending against jailbreak attacks through comparative analysis, comparing our approach with state-of-the-art defense mechanisms against six different attacks and assessing the helpfulness of LLMs under these defenses. Our model demonstrates its ability to reduce the Attack Success Rates of jailbreak attacks by up to 88% while maintaining the original LLMs' performance, achieving state-of-the-art Safety and Helpfulness Product results. Notably, our model is a plug-and-play method that can be applied to all LLMs, including both white-box and black-box models, to enhance their safety without requiring any fine-tuning of the models themselves. We will make our model publicly available for research purposes.

http://arxiv.org/abs/2508.06913
Model-Agnostic Sentiment Distribution Stability Analysis for Robust LLM-Generated Texts Detection. (12%)
Siyuan Li; Xi Lin; Guangyan Li; Zehao Liu; Aodu Wulianghai; Li Ding; Jun Wu; Jianhua Li
The rapid advancement of large language models (LLMs) has resulted in increasingly sophisticated AI-generated content, posing significant challenges in distinguishing LLM-generated text from human-written language. Existing detection methods, primarily based on lexical heuristics or fine-tuned classifiers, often suffer from limited generalizability and are vulnerable to paraphrasing, adversarial perturbations, and cross-domain shifts. In this work, we propose SentiDetect, a model-agnostic framework for detecting LLM-generated text by analyzing the divergence in sentiment distribution stability. Our method is motivated by the empirical observation that LLM outputs tend to exhibit emotionally consistent patterns, whereas human-written texts display greater emotional variability. To capture this phenomenon, we define two complementary metrics: sentiment distribution consistency and sentiment distribution preservation, which quantify stability under sentiment-altering and semantic-preserving transformations. We evaluate SentiDetect on five diverse datasets and a range of advanced LLMs,including Gemini-1.5-Pro, Claude-3, GPT-4-0613, and LLaMa-3.3. Experimental results demonstrate its superiority over state-of-the-art baselines, with over 16% and 11% F1 score improvements on Gemini-1.5-Pro and GPT-4-0613, respectively. Moreover, SentiDetect also shows greater robustness to paraphrasing, adversarial attacks, and text length variations, outperforming existing detectors in challenging scenarios.

http://arxiv.org/abs/2508.07066
Membership Inference Attacks with False Discovery Rate Control. (3%)
Chenxu Zhao; Wei Qian; Aobo Chen; Mengdi Huai
Recent studies have shown that deep learning models are vulnerable to membership inference attacks (MIAs), which aim to infer whether a data record was used to train a target model or not. To analyze and study these vulnerabilities, various MIA methods have been proposed. Despite the significance and popularity of MIAs, existing works on MIAs are limited in providing guarantees on the false discovery rate (FDR), which refers to the expected proportion of false discoveries among the identified positive discoveries. However, it is very challenging to ensure the false discovery rate guarantees, because the underlying distribution is usually unknown, and the estimated non-member probabilities often exhibit interdependence. To tackle the above challenges, in this paper, we design a novel membership inference attack method, which can provide the guarantees on the false discovery rate. Additionally, we show that our method can also provide the marginal probability guarantee on labeling true non-member data as member data. Notably, our method can work as a wrapper that can be seamlessly integrated with existing MIA methods in a post-hoc manner, while also providing the FDR control. We perform the theoretical analysis for our method. Extensive experiments in various settings (e.g., the black-box setting and the lifelong learning setting) are also conducted to verify the desirable performance of our method.

http://arxiv.org/abs/2508.07053
SPARE: Securing Progressive Web Applications Against Unauthorized Replications. (2%)
Sajib Talukder; Nur Imtiazul Haque; Khandakar Ashrafi Akbar
WebView applications are widely used in mobile applications to display web content directly within the app, enhancing user engagement by eliminating the need to open an external browser and providing a seamless experience. Progressive Web Applications (PWAs) further improve usability by combining the accessibility of web apps with the speed, offline capabilities, and responsiveness of native applications. However, malicious developers can exploit this technology by duplicating PWA web links to create counterfeit native apps, monetizing through user diversion. This unethical practice poses significant risks to users and the original application developers, underscoring the need for robust security measures to prevent unauthorized replication. Considering the one-way communication of Trusted Web Activity (a method for integrating web content into Android applications) and PWAs, we propose a query parameter-based practical security solution to defend against or mitigate such attacks. We analyze the vulnerabilities of our proposed security solution to assess its effectiveness and introduce advanced measures to address any identified weaknesses, presenting a comprehensive defense framework. As part of our work, we developed a prototype web application that secures PWAs from replication by embedding a combination of Unix timestamps and device identifiers into the query parameters. We evaluate the effectiveness of this defense strategy by simulating an advanced attack scenario. Additionally, we created a realistic dataset reflecting mobile app user behavior, modeled using a Zipfian distribution, to validate our framework.

http://arxiv.org/abs/2508.07115
Sensory robustness through top-down feedback and neural stochasticity in recurrent vision models. (2%)
Antonino Greco; Marco D'Alessandro; Karl J. Friston; Giovanni Pezzulo; Markus Siegel
Biological systems leverage top-down feedback for visual processing, yet most artificial vision models succeed in image classification using purely feedforward or recurrent architectures, calling into question the functional significance of descending cortical pathways. Here, we trained convolutional recurrent neural networks (ConvRNN) on image classification in the presence or absence of top-down feedback projections to elucidate the specific computational contributions of those feedback pathways. We found that ConvRNNs with top-down feedback exhibited remarkable speed-accuracy trade-off and robustness to noise perturbations and adversarial attacks, but only when they were trained with stochastic neural variability, simulated by randomly silencing single units via dropout. By performing detailed analyses to identify the reasons for such benefits, we observed that feedback information substantially shaped the representational geometry of the post-integration layer, combining the bottom-up and top-down streams, and this effect was amplified by dropout. Moreover, feedback signals coupled with dropout optimally constrained network activity onto a low-dimensional manifold and encoded object information more efficiently in out-of-distribution regimes, with top-down information stabilizing the representational dynamics at the population level. Together, these findings uncover a dual mechanism for resilient sensory coding. On the one hand, neural stochasticity prevents unit-level co-adaptation albeit at the cost of more chaotic dynamics. On the other hand, top-down feedback harnesses high-level information to stabilize network activity on compact low-dimensional manifolds.

http://arxiv.org/abs/2508.10035
Neural Network-Based Detection and Multi-Class Classification of FDI Attacks in Smart Grid Home Energy Systems. (1%)
Varsha Sen; Biswash Basnet
False Data Injection Attacks (FDIAs) pose a significant threat to smart grid infrastructures, particularly Home Area Networks (HANs), where real-time monitoring and control are highly adopted. Owing to the comparatively less stringent security controls and widespread availability of HANs, attackers view them as an attractive entry point to manipulate aggregated demand patterns, which can ultimately propagate and disrupt broader grid operations. These attacks undermine the integrity of smart meter data, enabling malicious actors to manipulate consumption values without activating conventional alarms, thereby creating serious vulnerabilities across both residential and utility-scale infrastructures. This paper presents a machine learning-based framework for both the detection and classification of FDIAs using residential energy data. A real-time detection is provided by the lightweight Artificial Neural Network (ANN), which works by using the most vital features of energy consumption, cost, and time context. For the classification of different attack types, a Bidirectional LSTM is trained to recognize normal, trapezoidal, and sigmoid attack shapes through learning sequential dependencies in the data. A synthetic time-series dataset was generated to emulate realistic household behaviour. Experimental results demonstrate that the proposed models are effective in identifying and classifying FDIAs, offering a scalable solution for enhancing grid resilience at the edge. This work contributes toward building intelligent, data-driven defence mechanisms that strengthen smart grid cybersecurity from residential endpoints.

http://arxiv.org/abs/2508.06837
Towards Effective Prompt Stealing Attack against Text-to-Image Diffusion Models. (1%)
Shiqian Zhao; Chong Wang; Yiming Li; Yihao Huang; Wenjie Qu; Siew-Kei Lam; Yi Xie; Kangjie Chen; Jie Zhang; Tianwei Zhang
Text-to-Image (T2I) models, represented by DALL$\cdot$E and Midjourney, have gained huge popularity for creating realistic images. The quality of these images relies on the carefully engineered prompts, which have become valuable intellectual property. While skilled prompters showcase their AI-generated art on markets to attract buyers, this business incidentally exposes them to \textit{prompt stealing attacks}. Existing state-of-the-art attack techniques reconstruct the prompts from a fixed set of modifiers (i.e., style descriptions) with model-specific training, which exhibit restricted adaptability and effectiveness to diverse showcases (i.e., target images) and diffusion models.
  To alleviate these limitations, we propose Prometheus, a training-free, proxy-in-the-loop, search-based prompt-stealing attack, which reverse-engineers the valuable prompts of the showcases by interacting with a local proxy model. It consists of three innovative designs. First, we introduce dynamic modifiers, as a supplement to static modifiers used in prior works. These dynamic modifiers provide more details specific to the showcases, and we exploit NLP analysis to generate them on the fly. Second, we design a contextual matching algorithm to sort both dynamic and static modifiers. This offline process helps reduce the search space of the subsequent step. Third, we interact with a local proxy model to invert the prompts with a greedy search algorithm. Based on the feedback guidance, we refine the prompt to achieve higher fidelity. The evaluation results show that Prometheus successfully extracts prompts from popular platforms like PromptBase and AIFrog against diverse victim models, including Midjourney, Leonardo.ai, and DALL$\cdot$E, with an ASR improvement of 25.0\%. We also validate that Prometheus is resistant to extensive potential defenses, further highlighting its severity in practice.

http://arxiv.org/abs/2508.06127
SAM Encoder Breach by Adversarial Simplicial Complex Triggers Downstream Model Failures. (93%)
Yi Qin; Rui Wang; Tao Huang; Tong Xiao; Liping Jing
While the Segment Anything Model (SAM) transforms interactive segmentation with zero-shot abilities, its inherent vulnerabilities present a single-point risk, potentially leading to the failure of numerous downstream applications. Proactively evaluating these transferable vulnerabilities is thus imperative. Prior adversarial attacks on SAM often present limited transferability due to insufficient exploration of common weakness across domains. To address this, we propose Vertex-Refining Simplicial Complex Attack (VeSCA), a novel method that leverages only the encoder of SAM for generating transferable adversarial examples. Specifically, it achieves this by explicitly characterizing the shared vulnerable regions between SAM and downstream models through a parametric simplicial complex. Our goal is to identify such complexes within adversarially potent regions by iterative vertex-wise refinement. A lightweight domain re-adaptation strategy is introduced to bridge domain divergence using minimal reference data during the initialization of simplicial complex. Ultimately, VeSCA generates consistently transferable adversarial examples through random simplicial complex sampling. Extensive experiments demonstrate that VeSCA achieves performance improved by 12.7% compared to state-of-the-art methods across three downstream model categories across five domain-specific datasets. Our findings further highlight the downstream model risks posed by SAM's vulnerabilities and emphasize the urgency of developing more robust foundation models.

http://arxiv.org/abs/2412.05734
LeakAgent: RL-based Red-teaming Agent for LLM Privacy Leakage. (92%)
Yuzhou Nie; Zhun Wang; Ye Yu; Xian Wu; Xuandong Zhao; Wenbo Guo; Dawn Song
Recent studies have discovered that large language models (LLM) may be ``fooled'' to output private information, including training data, system prompts, and personally identifiable information, under carefully crafted adversarial prompts. Existing red-teaming approaches for privacy leakage either rely on manual efforts or focus solely on system prompt extraction, making them ineffective for severe risks of training data leakage. We propose LeakAgent, a novel black-box red-teaming framework for LLM privacy leakage. Our framework trains an open-source LLM through reinforcement learning as the attack agent to generate adversarial prompts for both training data extraction and system prompt extraction. To achieve this, we propose a novel reward function to provide effective and fine-grained rewards and design novel mechanisms to balance exploration and exploitation during learning and enhance the diversity of adversarial prompts. Through extensive evaluations, we first show that LeakAgent significantly outperforms existing rule-based approaches in training data extraction and automated methods in system prompt leakage. We also demonstrate the effectiveness of LeakAgent in extracting system prompts from real-world applications in OpenAI's GPT Store. We further demonstrate LeakAgent's effectiveness in evading the existing guardrail defense and its helpfulness in enabling better safety alignment. Finally, we validate our customized designs through a detailed ablation study. We release our code here https://github.com/rucnyz/LeakAgent.

http://arxiv.org/abs/2508.06153
SLIP: Soft Label Mechanism and Key-Extraction-Guided CoT-based Defense Against Instruction Backdoor in APIs. (89%)
Zhengxian Wu; Juan Wen; Wanli Peng; Haowei Chang; Yinghan Zhou; Yiming Xue
With the development of customized large language model (LLM) agents, a new threat of black-box backdoor attacks has emerged, where malicious instructions are injected into hidden system prompts. These attacks easily bypass existing defenses that rely on white-box access, posing a serious security challenge. To address this, we propose SLIP, a Soft Label mechanism and key-extraction-guided CoT-based defense against Instruction backdoors in APIs. SLIP is designed based on two key insights. First, to counteract the model's oversensitivity to triggers, we propose a Key-extraction-guided Chain-of-Thought (KCoT). Instead of only considering the single trigger or the input sentence, KCoT prompts the agent to extract task-relevant key phrases. Second, to guide the LLM toward correct answers, our proposed Soft Label Mechanism (SLM) prompts the agent to quantify the semantic correlation between key phrases and candidate answers. Crucially, to mitigate the influence of residual triggers or misleading content in phrases extracted by KCoT, which typically causes anomalous scores, SLM excludes anomalous scores deviating significantly from the mean and subsequently averages the remaining scores to derive a more reliable semantic representation. Extensive experiments on classification and question-answer (QA) tasks demonstrate that SLIP is highly effective, reducing the average attack success rate (ASR) from 90.2% to 25.13% while maintaining high accuracy on clean data and outperforming state-of-the-art defenses. Our code are available in https://github.com/CAU-ISS-Lab/Backdoor-Attack-Defense-LLMs/tree/main/SLIP.

http://arxiv.org/abs/2508.06394
When AIOps Become "AI Oops": Subverting LLM-driven IT Operations via Telemetry Manipulation. (69%)
Dario Pasquini; Evgenios M. Kornaropoulos; Giuseppe Ateniese; Omer Akgul; Athanasios Theocharis; Petros Efstathopoulos
AI for IT Operations (AIOps) is transforming how organizations manage complex software systems by automating anomaly detection, incident diagnosis, and remediation. Modern AIOps solutions increasingly rely on autonomous LLM-based agents to interpret telemetry data and take corrective actions with minimal human intervention, promising faster response times and operational cost savings.
  In this work, we perform the first security analysis of AIOps solutions, showing that, once again, AI-driven automation comes with a profound security cost. We demonstrate that adversaries can manipulate system telemetry to mislead AIOps agents into taking actions that compromise the integrity of the infrastructure they manage. We introduce techniques to reliably inject telemetry data using error-inducing requests that influence agent behavior through a form of adversarial reward-hacking; plausible but incorrect system error interpretations that steer the agent's decision-making. Our attack methodology, AIOpsDoom, is fully automated--combining reconnaissance, fuzzing, and LLM-driven adversarial input generation--and operates without any prior knowledge of the target system.
  To counter this threat, we propose AIOpsShield, a defense mechanism that sanitizes telemetry data by exploiting its structured nature and the minimal role of user-generated content. Our experiments show that AIOpsShield reliably blocks telemetry-based attacks without affecting normal agent performance.
  Ultimately, this work exposes AIOps as an emerging attack vector for system compromise and underscores the urgent need for security-aware AIOps design.

http://arxiv.org/abs/2502.05341
Neural Encrypted State Transduction for Ransomware Classification: A Novel Approach Using Cryptographic Flow Residuals. (67%)
Barnaby Fortescue; Edmund Hawksmoor; Alistair Wetherington; Frederick Marlowe; Kevin Pekepok
Encrypted behavioral patterns provide a unique avenue for classifying complex digital threats without reliance on explicit feature extraction, enabling detection frameworks to remain effective even when conventional static and behavioral methodologies fail. A novel approach based on Neural Encrypted State Transduction (NEST) is introduced to analyze cryptographic flow residuals and classify threats through their encrypted state transitions, mitigating evasion tactics employed through polymorphic and obfuscated attack strategies. The mathematical formulation of NEST leverages transduction principles to map state transitions dynamically, enabling high-confidence classification without requiring direct access to decrypted execution traces. Experimental evaluations demonstrate that the proposed framework achieves improved detection accuracy across multiple ransomware families while exhibiting resilience against adversarial perturbations and previously unseen attack variants. The model maintains competitive processing efficiency, offering a practical balance between classification performance and computational resource constraints, making it suitable for large-scale security deployments. Comparative assessments reveal that NEST consistently outperforms baseline classification models, particularly in detecting ransomware samples employing delayed encryption, entropy-based obfuscation, and memory-resident execution techniques. The capacity to generalize across diverse execution environments reinforces the applicability of encrypted transduction methodologies in adversarial classification tasks beyond conventional malware detection pipelines. The integration of residual learning mechanisms within the transduction layers further enhances classification robustness, minimizing both false positives and misclassification rates across varied operational contexts.

http://arxiv.org/abs/2508.06073
ProvX: Generating Counterfactual-Driven Attack Explanations for Provenance-Based Detection. (62%)
Weiheng Wu; Wei Qiao; Teng Li; Yebo Feng; Zhuo Ma; Jianfeng Ma; Yang Liu
Provenance graph-based intrusion detection systems are deployed on hosts to defend against increasingly severe Advanced Persistent Threat. Using Graph Neural Networks to detect these threats has become a research focus and has demonstrated exceptional performance. However, the widespread adoption of GNN-based security models is limited by their inherent black-box nature, as they fail to provide security analysts with any verifiable explanations for model predictions or any evidence regarding the model's judgment in relation to real-world attacks. To address this challenge, we propose ProvX, an effective explanation framework for exlaining GNN-based security models on provenance graphs. ProvX introduces counterfactual explanation logic, seeking the minimal structural subset within a graph predicted as malicious that, when perturbed, can subvert the model's original prediction. We innovatively transform the discrete search problem of finding this critical subgraph into a continuous optimization task guided by a dual objective of prediction flipping and distance minimization. Furthermore, a Staged Solidification strategy is incorporated to enhance the precision and stability of the explanations. We conducted extensive evaluations of ProvX on authoritative datasets. The experimental results demonstrate that ProvX can locate critical graph structures that are highly relevant to real-world attacks and achieves an average explanation necessity of 51.59\%, with these metrics outperforming current SOTA explainers. Furthermore, we explore and provide a preliminary validation of a closed-loop Detection-Explanation-Feedback enhancement framework, demonstrating through experiments that the explanation results from ProvX can guide model optimization, effectively enhancing its robustness against adversarial attacks.

http://arxiv.org/abs/2501.01872
Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions. (62%)
Rachneet Sachdeva; Rima Hazra; Iryna Gurevych
Large language models, despite extensive alignment with human values and ethical principles, remain vulnerable to sophisticated jailbreak attacks that exploit their reasoning abilities. Existing safety measures often detect overt malicious intent but fail to address subtle, reasoning-driven vulnerabilities. In this work, we introduce POATE (Polar Opposite query generation, Adversarial Template construction, and Elaboration), a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses. POATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety. We conduct extensive evaluation across six diverse language model families of varying parameter sizes to demonstrate the robustness of the attack, achieving significantly higher attack success rates (~44%) compared to existing methods. To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses. These methods enhance reasoning robustness and strengthen the model's defense against adversarial exploits.

http://arxiv.org/abs/2508.06601
Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs. (41%)
Kyle O'Brien; Stephen Casper; Quentin Anthony; Tomek Korbak; Robert Kirk; Xander Davies; Ishan Mishra; Geoffrey Irving; Yarin Gal; Stella Biderman
Open-weight AI systems offer unique benefits, including enhanced transparency, open research, and decentralized access. However, they are vulnerable to tampering attacks which can efficiently elicit harmful behaviors by modifying weights or activations. Currently, there is not yet a robust science of open-weight model risk management. Existing safety fine-tuning methods and other post-training techniques have struggled to make LLMs resistant to more than a few dozen steps of adversarial fine-tuning. In this paper, we investigate whether filtering text about dual-use topics from training data can prevent unwanted capabilities and serve as a more tamper-resistant safeguard. We introduce a multi-stage pipeline for scalable data filtering and show that it offers a tractable and effective method for minimizing biothreat proxy knowledge in LLMs. We pretrain multiple 6.9B-parameter models from scratch and find that they exhibit substantial resistance to adversarial fine-tuning attacks on up to 10,000 steps and 300M tokens of biothreat-related text -- outperforming existing post-training baselines by over an order of magnitude -- with no observed degradation to unrelated capabilities. However, while filtered models lack internalized dangerous knowledge, we find that they can still leverage such information when it is provided in context (e.g., via search tool augmentation), demonstrating a need for a defense-in-depth approach. Overall, these findings help to establish pretraining data curation as a promising layer of defense for open-weight AI systems.

http://arxiv.org/abs/2411.14110
Feedback-Guided Extraction of Knowledge Base from Retrieval-Augmented LLM Applications. (38%)
Changyue Jiang; Xudong Pan; Geng Hong; Chenfu Bao; Yang Chen; Min Yang
Retrieval-Augmented Generation (RAG) expands the knowledge boundary of large language models (LLMs) by integrating external knowledge bases, whose construction is often time-consuming and laborious. If an adversary extracts the knowledge base verbatim, it not only severely infringes the owner's intellectual property but also enables the adversary to replicate the application's functionality for unfair competition. Previous works on knowledge base extraction are limited either by low extraction coverage (usually less than 4%) in query-based attacks or by impractical assumptions of white-box access in embedding-based optimization methods. In this work, we propose CopyBreakRAG, an agent-based black-box attack that reasons from feedback and adaptively generates new adversarial queries for progressive extraction. By balancing exploration and exploitation through curiosity-driven queries and feedback-guided query refinement, our method overcomes the limitations of prior approaches and achieves significantly higher extraction coverage in realistic black-box settings. Experimental results show that CopyBreakRAG outperforms the state-of-the-art black-box approach by 45% on average in terms of chunk extraction ratio from applications built with mainstream RAG frameworks, and extracts over 70% of the data from the knowledge base in applications on commercial platforms including OpenAI's GPTs and ByteDance's Coze when essential protection is in place.

http://arxiv.org/abs/2508.10029
Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs. (12%)
Wenpeng Xing; Mohan Li; Chunqiang Hu; Haitao XuNingyu Zhang; Bo Lin; Meng Han
Large language models (LLMs) demonstrate impressive capabilities in various language tasks but are susceptible to jailbreak attacks that circumvent their safety alignments. This paper introduces Latent Fusion Jailbreak (LFJ), a representation-based attack that interpolates hidden states from harmful and benign query pairs to elicit prohibited responses. LFJ begins by selecting query pairs with high thematic and syntactic similarity, then performs gradient-guided interpolation at influential layers and tokens, followed by optimization to balance attack success, output fluency, and computational efficiency. Evaluations on models such as Vicuna and LLaMA-2 across benchmarks like AdvBench and MaliciousInstruct yield an average attack success rate (ASR) of 94.01%, outperforming existing methods. To mitigate LFJ, we propose an adversarial training defense that fine-tunes models on interpolated examples, reducing ASR by over 80% without degrading performance on benign inputs. Ablation studies validate the importance of query pair selection, hidden state interpolation components, and optimization strategies in LFJ's effectiveness.

http://arxiv.org/abs/2501.15509
FIT-Print: Towards False-claim-resistant Model Ownership Verification via Targeted Fingerprint. (10%)
Shuo Shao; Haozhe Zhu; Hongwei Yao; Yiming Li; Tianwei Zhang; Zhan Qin
Model fingerprinting is a widely adopted approach to safeguard the intellectual property rights of open-source models by preventing their unauthorized reuse. It is promising and convenient since it does not necessitate modifying the protected model. In this paper, we revisit existing fingerprinting methods and reveal that they are vulnerable to false claim attacks where adversaries falsely assert ownership of any third-party model. We demonstrate that this vulnerability mostly stems from their untargeted nature, where they generally compare the outputs of given samples on different models instead of the similarities to specific references. Motivated by these findings, we propose a targeted fingerprinting paradigm (i.e., FIT-Print) to counteract false claim attacks. Specifically, FIT-Print transforms the fingerprint into a targeted signature via optimization. Building on the principles of FIT-Print, we develop bit-wise and list-wise black-box model fingerprinting methods, i.e., FIT-ModelDiff and FIT-LIME, which exploit the distance between model outputs and the feature attribution of specific samples as the fingerprint, respectively. Extensive experiments on benchmark models and datasets verify the effectiveness, conferrability, and resistance to false claim attacks of our FIT-Print.

http://arxiv.org/abs/2508.06059
Fact2Fiction: Targeted Poisoning Attack to Agentic Fact-checking System. (4%)
Haorui He; Yupeng Li; Bin Benjamin Zhu; Dacheng Wen; Reynold Cheng; Francis C. M. Lau
State-of-the-art fact-checking systems combat misinformation at scale by employing autonomous LLM-based agents to decompose complex claims into smaller sub-claims, verify each sub-claim individually, and aggregate the partial results to produce verdicts with justifications (explanatory rationales for the verdicts). The security of these systems is crucial, as compromised fact-checkers, which tend to be easily underexplored, can amplify misinformation. This work introduces Fact2Fiction, the first poisoning attack framework targeting such agentic fact-checking systems. Fact2Fiction mirrors the decomposition strategy and exploits system-generated justifications to craft tailored malicious evidences that compromise sub-claim verification. Extensive experiments demonstrate that Fact2Fiction achieves 8.9\%--21.2\% higher attack success rates than state-of-the-art attacks across various poisoning budgets. Fact2Fiction exposes security weaknesses in current fact-checking systems and highlights the need for defensive countermeasures.

http://arxiv.org/abs/2508.06656
Towards Robust Red-Green Watermarking for Autoregressive Image Generators. (2%)
Denis Lukovnikov; Andreas Müller; Erwin Quiring; Asja Fischer
In-generation watermarking for detecting and attributing generated content has recently been explored for latent diffusion models (LDMs), demonstrating high robustness. However, the use of in-generation watermarks in autoregressive (AR) image models has not been explored yet. AR models generate images by autoregressively predicting a sequence of visual tokens that are then decoded into pixels using a vector-quantized decoder. Inspired by red-green watermarks for large language models, we examine token-level watermarking schemes that bias the next-token prediction based on prior tokens. We find that a direct transfer of these schemes works in principle, but the detectability of the watermarks decreases considerably under common image perturbations. As a remedy, we propose two novel watermarking methods that rely on visual token clustering to assign similar tokens to the same set. Firstly, we investigate a training-free approach that relies on a cluster lookup table, and secondly, we finetune VAE encoders to predict token clusters directly from perturbed images. Overall, our experiments show that cluster-level watermarks improve robustness against perturbations and regeneration attacks while preserving image quality. Cluster classification further boosts watermark detectability, outperforming a set of baselines. Moreover, our methods offer fast verification runtime, comparable to lightweight post-hoc watermarking methods.

http://arxiv.org/abs/2502.03687
Conditional Diffusion Models are Medical Image Classifiers that Provide Explainability and Uncertainty for Free. (1%)
Gian Mario Favero; Parham Saremi; Emily Kaczmarek; Brennan Nichyporuk; Tal Arbel
Discriminative classifiers have become a foundational tool in deep learning for medical imaging, excelling at learning separable features of complex data distributions. However, these models often need careful design, augmentation, and training techniques to ensure safe and reliable deployment. Recently, diffusion models have become synonymous with generative modeling in 2D. These models showcase robustness across a range of tasks including natural image classification, where classification is performed by comparing reconstruction errors across images generated for each possible conditioning input. This work presents the first exploration of the potential of class conditional diffusion models for 2D medical image classification. First, we develop a novel majority voting scheme shown to improve the performance of medical diffusion classifiers. Next, extensive experiments on the CheXpert and ISIC Melanoma skin cancer datasets demonstrate that foundation and trained-from-scratch diffusion models achieve competitive performance against SOTA discriminative classifiers without the need for explicit supervision. In addition, we show that diffusion classifiers are intrinsically explainable, and can be used to quantify the uncertainty of their predictions, increasing their trustworthiness and reliability in safety-critical, clinical contexts. Further information is available on our project page: https://faverogian.github.io/med-diffusion-classifier.github.io/.

http://arxiv.org/abs/2508.06346
Introducing Fractional Classification Loss for Robust Learning with Noisy Labels. (1%)
Mert Can Kurucu; Tufan Kumbasar; İbrahim Eksin; Müjde Güzelkaya
Robust loss functions are crucial for training deep neural networks in the presence of label noise, yet existing approaches require extensive, dataset-specific hyperparameter tuning. In this work, we introduce Fractional Classification Loss (FCL), an adaptive robust loss that automatically calibrates its robustness to label noise during training. Built within the active-passive loss framework, FCL employs the fractional derivative of the Cross-Entropy (CE) loss as its active component and the Mean Absolute Error (MAE) as its passive loss component. With this formulation, we demonstrate that the fractional derivative order $μ$ spans a family of loss functions that interpolate between MAE-like robustness and CE-like fast convergence. Furthermore, we integrate $μ$ into the gradient-based optimization as a learnable parameter and automatically adjust it to optimize the trade-off between robustness and convergence speed. We reveal that FCL's unique property establishes a critical trade-off that enables the stable learning of $μ$: lower log penalties on difficult or mislabeled examples improve robustness but impose higher penalties on easy or clean data, reducing model confidence in them. Consequently, FCL can dynamically reshape its loss landscape to achieve effective classification performance under label noise. Extensive experiments on benchmark datasets show that FCL achieves state-of-the-art results without the need for manual hyperparameter tuning.

http://arxiv.org/abs/2505.07258
No Query, No Access. (99%)
Wenqiang Wang; Siyuan Liang; Yangshijie Zhang; Xiaojun Jia; Hao Lin; Xiaochun Cao
Textual adversarial attacks mislead NLP models, including Large Language Models (LLMs), by subtly modifying text. While effective, existing attacks often require knowledge of the victim model, extensive queries, or access to training data, limiting real-world feasibility. To overcome these constraints, we introduce the \textbf{Victim Data-based Adversarial Attack (VDBA)}, which operates using only victim texts. To prevent access to the victim model, we create a shadow dataset with publicly available pre-trained models and clustering methods as a foundation for developing substitute models. To address the low attack success rate (ASR) due to insufficient information feedback, we propose the hierarchical substitution model design, generating substitute models to mitigate the failure of a single substitute model at the decision boundary.
  Concurrently, we use diverse adversarial example generation, employing various attack methods to generate and select the adversarial example with better similarity and attack effectiveness. Experiments on the Emotion and SST5 datasets show that VDBA outperforms state-of-the-art methods, achieving an ASR improvement of 52.08\% while significantly reducing attack queries to 0. More importantly, we discover that VDBA poses a significant threat to LLMs such as Qwen2 and the GPT family, and achieves the highest ASR of 45.99% even without access to the API, confirming that advanced NLP models still face serious security risks. Our codes can be found at https://anonymous.4open.science/r/VDBA-Victim-Data-based-Adversarial-Attack-36EC/

http://arxiv.org/abs/2508.03783
Probing and Enhancing the Robustness of GNN-based QEC Decoders with Reinforcement Learning. (99%)
Ryota Ikeda
Graph Neural Networks (GNNs) have emerged as a powerful, data-driven approach for Quantum Error Correction (QEC) decoding, capable of learning complex noise characteristics directly from syndrome data. However, the robustness of these decoders against subtle, adversarial perturbations remains a critical open question. This work introduces a novel framework to systematically probe the vulnerabilities of a GNN decoder using a reinforcement learning (RL) agent. The RL agent is trained as an adversary with the goal of finding minimal syndrome modifications that cause the decoder to misclassify. We apply this framework to a Graph Attention Network (GAT) decoder trained on experimental surface code data from Google Quantum AI. Our results show that the RL agent can successfully identify specific, critical vulnerabilities, achieving a high attack success rate with a minimal number of bit flips. Furthermore, we demonstrate that the decoder's robustness can be significantly enhanced through adversarial training, where the model is retrained on the adversarial examples generated by the RL agent. This iterative process of automated vulnerability discovery and targeted retraining presents a promising methodology for developing more reliable and robust neural network decoders for fault-tolerant quantum computing.

http://arxiv.org/abs/2508.05489
Keep It Real: Challenges in Attacking Compression-Based Adversarial Purification. (93%)
Samuel Räber; Till Aczel; Andreas Plesner; Roger Wattenhofer
Previous work has suggested that preprocessing images through lossy compression can defend against adversarial perturbations, but comprehensive attack evaluations have been lacking. In this paper, we construct strong white-box and adaptive attacks against various compression models and identify a critical challenge for attackers: high realism in reconstructed images significantly increases attack difficulty. Through rigorous evaluation across multiple attack scenarios, we demonstrate that compression models capable of producing realistic, high-fidelity reconstructions are substantially more resistant to our attacks. In contrast, low-realism compression models can be broken. Our analysis reveals that this is not due to gradient masking. Rather, realistic reconstructions maintaining distributional alignment with natural images seem to offer inherent robustness. This work highlights a significant obstacle for future adversarial attacks and suggests that developing more effective techniques to overcome realism represents an essential challenge for comprehensive security evaluation.

http://arxiv.org/abs/2508.05167
PhysPatch: A Physically Realizable and Transferable Adversarial Patch Attack for Multimodal Large Language Models-based Autonomous Driving Systems. (91%)
Qi Guo; Xiaojun Jia; Shanmin Pang; Simeng Qin; Lin Wang; Ju Jia; Yang Liu; Qing Guo
Multimodal Large Language Models (MLLMs) are becoming integral to autonomous driving (AD) systems due to their strong vision-language reasoning capabilities. However, MLLMs are vulnerable to adversarial attacks, particularly adversarial patch attacks, which can pose serious threats in real-world scenarios. Existing patch-based attack methods are primarily designed for object detection models and perform poorly when transferred to MLLM-based systems due to the latter's complex architectures and reasoning abilities. To address these limitations, we propose PhysPatch, a physically realizable and transferable adversarial patch framework tailored for MLLM-based AD systems. PhysPatch jointly optimizes patch location, shape, and content to enhance attack effectiveness and real-world applicability. It introduces a semantic-based mask initialization strategy for realistic placement, an SVD-based local alignment loss with patch-guided crop-resize to improve transferability, and a potential field-based mask refinement method. Extensive experiments across open-source, commercial, and reasoning-capable MLLMs demonstrate that PhysPatch significantly outperforms prior methods in steering MLLM-based AD systems toward target-aligned perception and planning outputs. Moreover, PhysPatch consistently places adversarial patches in physically feasible regions of AD scenes, ensuring strong real-world applicability and deployability.

http://arxiv.org/abs/2508.00649
Revisiting Adversarial Patch Defenses on Object Detectors: Unified Evaluation, Large-Scale Dataset, and New Insights. (80%)
Junhao Zheng; Jiahao Sun; Chenhao Lin; Zhengyu Zhao; Chen Ma; Chong Zhang; Cong Wang; Qian Wang; Chao Shen
Developing reliable defenses against patch attacks on object detectors has attracted increasing interest. However, we identify that existing defense evaluations lack a unified and comprehensive framework, resulting in inconsistent and incomplete assessments of current methods. To address this issue, we revisit 11 representative defenses and present the first patch defense benchmark, involving 2 attack goals, 13 patch attacks, 11 object detectors, and 4 diverse metrics. This leads to the large-scale adversarial patch dataset with 94 types of patches and 94,000 images. Our comprehensive analyses reveal new insights: (1) The difficulty in defending against naturalistic patches lies in the data distribution, rather than the commonly believed high frequencies. Our new dataset with diverse patch distributions can be used to improve existing defenses by 15.09% AP@0.5. (2) The average precision of the attacked object, rather than the commonly pursued patch detection accuracy, shows high consistency with defense performance. (3) Adaptive attacks can substantially bypass existing defenses, and defenses with complex/stochastic models or universal patch properties are relatively robust. We hope that our analyses will serve as guidance on properly evaluating patch attacks/defenses and advancing their design. Code and dataset are available at https://github.com/Gandolfczjh/APDE, where we will keep integrating new attacks/defenses.

http://arxiv.org/abs/2508.05516
FS-IQA: Certified Feature Smoothing for Robust Image Quality Assessment. (74%)
Ekaterina Shumitskaya; Dmitriy Vatolin; Anastasia Antsiferova
We propose a novel certified defense method for Image Quality Assessment (IQA) models based on randomized smoothing with noise applied in the feature space rather than the input space. Unlike prior approaches that inject Gaussian noise directly into input images, often degrading visual quality, our method preserves image fidelity while providing robustness guarantees. To formally connect noise levels in the feature space with corresponding input-space perturbations, we analyze the maximum singular value of the backbone network's Jacobian. Our approach supports both full-reference (FR) and no-reference (NR) IQA models without requiring any architectural modifications, suitable for various scenarios. It is also computationally efficient, requiring a single backbone forward pass per image. Compared to previous methods, it reduces inference time by 99.5% without certification and by 20.6% when certification is applied. We validate our method with extensive experiments on two benchmark datasets, involving six widely-used FR and NR IQA models and comparisons against five state-of-the-art certified defenses. Our results demonstrate consistent improvements in correlation with subjective quality scores by up to 30.9%.

http://arxiv.org/abs/2508.05404
NT-ML: Backdoor Defense via Non-target Label Training and Mutual Learning. (69%)
Wenjie Huo; Katinka Wolter
Recent studies have shown that deep neural networks (DNNs) are vulnerable to backdoor attacks, where a designed trigger is injected into the dataset, causing erroneous predictions when activated. In this paper, we propose a novel defense mechanism, Non-target label Training and Mutual Learning (NT-ML), which can successfully restore the poisoned model under advanced backdoor attacks. NT aims to reduce the harm of poisoned data by retraining the model with the outputs of the standard training. At this stage, a teacher model with high accuracy on clean data and a student model with higher confidence in correct prediction on poisoned data are obtained. Then, the teacher and student can learn the strengths from each other through ML to obtain a purified student model. Extensive experiments show that NT-ML can effectively defend against 6 backdoor attacks with a small number of clean samples, and outperforms 5 state-of-the-art backdoor defenses.

http://arxiv.org/abs/2508.05600
Non-omniscient backdoor injection with a single poison sample: Proving the one-poison hypothesis for linear regression and linear classification. (26%)
Thorsten Peinemann; Paula Arnold; Sebastian Berndt; Thomas Eisenbarth; Esfandiar Mohammadi
Backdoor injection attacks are a threat to machine learning models that are trained on large data collected from untrusted sources; these attacks enable attackers to inject malicious behavior into the model that can be triggered by specially crafted inputs. Prior work has established bounds on the success of backdoor attacks and their impact on the benign learning task, however, an open question is what amount of poison data is needed for a successful backdoor attack. Typical attacks either use few samples, but need much information about the data points or need to poison many data points.
  In this paper, we formulate the one-poison hypothesis: An adversary with one poison sample and limited background knowledge can inject a backdoor with zero backdooring-error and without significantly impacting the benign learning task performance. Moreover, we prove the one-poison hypothesis for linear regression and linear classification. For adversaries that utilize a direction that is unused by the benign data distribution for the poison sample, we show that the resulting model is functionally equivalent to a model where the poison was excluded from training. We build on prior work on statistical backdoor learning to show that in all other cases, the impact on the benign learning task is still limited. We also validate our theoretical results experimentally with realistic benchmark data sets.

http://arxiv.org/abs/2508.05409
From Detection to Correction: Backdoor-Resilient Face Recognition via Vision-Language Trigger Detection and Noise-Based Neutralization. (26%)
Farah Wahida; M. A. P. Chamikara; Yashothara Shanmugarasa; Mohan Baruwal Chhetri; Thilina Ranbaduge; Ibrahim Khalil
Biometric systems, such as face recognition systems powered by deep neural networks (DNNs), rely on large and highly sensitive datasets. Backdoor attacks can subvert these systems by manipulating the training process. By inserting a small trigger, such as a sticker, make-up, or patterned mask, into a few training images, an adversary can later present the same trigger during authentication to be falsely recognized as another individual, thereby gaining unauthorized access. Existing defense mechanisms against backdoor attacks still face challenges in precisely identifying and mitigating poisoned images without compromising data utility, which undermines the overall reliability of the system. We propose a novel and generalizable approach, TrueBiometric: Trustworthy Biometrics, which accurately detects poisoned images using a majority voting mechanism leveraging multiple state-of-the-art large vision language models. Once identified, poisoned samples are corrected using targeted and calibrated corrective noise. Our extensive empirical results demonstrate that TrueBiometric detects and corrects poisoned images with 100\% accuracy without compromising accuracy on clean images. Compared to existing state-of-the-art approaches, TrueBiometric offers a more practical, accurate, and effective solution for mitigating backdoor attacks in face recognition systems.

http://arxiv.org/abs/2503.09336
Stealthy Patch-Wise Backdoor Attack in 3D Point Cloud via Curvature Awareness. (15%)
Yu Feng; Dingxin Zhang; Runkai Zhao; Yong Xia; Heng Huang; Weidong Cai
Backdoor attacks pose a severe threat to deep neural networks (DNNs) by implanting hidden backdoors that can be activated with predefined triggers to manipulate model behaviors maliciously. Existing 3D point cloud backdoor attacks primarily rely on sample-wise global modifications, which suffer from low imperceptibility. Although optimization can improve stealthiness, optimizing sample-wise triggers significantly increases computational cost. To address these limitations, we propose the Stealthy Patch-Wise Backdoor Attack (SPBA), the first patch-wise backdoor attack framework for 3D point clouds. Specifically, SPBA decomposes point clouds into local patches and employs a curvature-based imperceptibility score to guide trigger injection into visually less sensitive patches. By optimizing a unified patch-wise trigger that perturbs spectral features of selected patches, SPBA significantly enhances optimization efficiency while maintaining high stealthiness. Extensive experiments on ModelNet40 and ShapeNetPart further demonstrate that SPBA surpasses prior state-of-the-art backdoor attacks in both attack effectiveness and resistance to defense methods.

http://arxiv.org/abs/2508.05414
Physical Adversarial Camouflage through Gradient Calibration and Regularization. (10%)
Jiawei Liang; Siyuan Liang; Jianjie Huang; Chenxi Si; Ming Zhang; Xiaochun Cao
The advancement of deep object detectors has greatly affected safety-critical fields like autonomous driving. However, physical adversarial camouflage poses a significant security risk by altering object textures to deceive detectors. Existing techniques struggle with variable physical environments, facing two main challenges: 1) inconsistent sampling point densities across distances hinder the gradient optimization from ensuring local continuity, and 2) updating texture gradients from multiple angles causes conflicts, reducing optimization stability and attack effectiveness. To address these issues, we propose a novel adversarial camouflage framework based on gradient optimization. First, we introduce a gradient calibration strategy, which ensures consistent gradient updates across distances by propagating gradients from sparsely to unsampled texture points. Additionally, we develop a gradient decorrelation method, which prioritizes and orthogonalizes gradients based on loss values, enhancing stability and effectiveness in multi-angle optimization by eliminating redundant or conflicting updates. Extensive experimental results on various detection models, angles and distances show that our method significantly exceeds the state of the art, with an average increase in attack success rate (ASR) of 13.46% across distances and 11.03% across angles. Furthermore, empirical evaluation in real-world scenarios highlights the need for more robust system design.

http://arxiv.org/abs/2508.05087
JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering. (9%)
Renmiao Chen; Shiyao Cui; Xuancheng Huang; Chengwei Pan; Victor Shea-Jay Huang; QingLin Zhang; Xuan Ouyang; Zhexin Zhang; Hongning Wang; Minlie Huang
Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker's malicious intent. This oversight frequently leads to low-quality outputs that bypass safety filters but lack substantial harmful content. To address this gap, we propose JPS, \underline{J}ailbreak MLLMs with collaborative visual \underline{P}erturbation and textual \underline{S}teering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target-guided adversarial image perturbations for effective safety bypass, complemented by "steering prompt" optimized via a multi-agent system to specifically guide LLM responses fulfilling the attackers' intent. These visual and textual components undergo iterative co-optimization for enhanced performance. To evaluate the quality of attack outcomes, we propose the Malicious Intent Fulfillment Rate (MIFR) metric, assessed using a Reasoning-LLM-based evaluator. Our experiments show JPS sets a new state-of-the-art in both ASR and MIFR across various MLLMs and benchmarks, with analyses confirming its efficacy. Codes are available at \href{https://github.com/thu-coai/JPS}{https://github.com/thu-coai/JPS}. \color{warningcolor}{Warning: This paper contains potentially sensitive contents.}

http://arxiv.org/abs/2508.00387
STF: Shallow-Level Temporal Feedback to Enhance Spiking Transformers. (1%)
Zeqi Zheng; Zizheng Zhu; Yingchao Yu; Yanchen Huang; Changze Lv; Junfeng Tang; Zhaofei Yu; Yaochu Jin
Transformer-based Spiking Neural Networks (SNNs) suffer from a great performance gap compared to floating-point \mbox{Artificial} Neural Networks (ANNs) due to the binary nature of spike trains. Recent efforts have introduced deep-level feedback loops to transmit high-level semantic information to narrow this gap. However, these designs often span \mbox{multiple} deep layers, resulting in costly feature transformations, higher parameter overhead, increased energy consumption, and longer inference latency. To address this issue, we propose Shallow-level Temporal Feedback (STF), a lightweight plug-and-play module for the encoding layer, which consists of Temporal-Spatial Position Embedding (TSPE) and Temporal Feedback (TF). Extensive experiments show that STF consistently improves performance across various Transformer-based SNN backbones on static datasets, including CIFAR-10, CIFAR-100, and ImageNet-1K, under different spike timestep settings. Further analysis reveals that STF enhances the diversity of spike patterns, which is key to performance gain. Moreover, evaluations on adversarial robustness and temporal sensitivity confirm that STF outperforms direct coding and its variants, highlighting its potential as a new spike encoding scheme for static scenarios. Our code will be released upon acceptance.

http://arxiv.org/abs/2508.05544
Conformal Sets in Multiple-Choice Question Answering under Black-Box Settings with Provable Coverage Guarantees. (1%)
Guang Yang; Xinyang Liu
Large Language Models (LLMs) have shown remarkable progress in multiple-choice question answering (MCQA), but their inherent unreliability, such as hallucination and overconfidence, limits their application in high-risk domains. To address this, we propose a frequency-based uncertainty quantification method under black-box settings, leveraging conformal prediction (CP) to ensure provable coverage guarantees. Our approach involves multiple independent samplings of the model's output distribution for each input, with the most frequent sample serving as a reference to calculate predictive entropy (PE). Experimental evaluations across six LLMs and four datasets (MedMCQA, MedQA, MMLU, MMLU-Pro) demonstrate that frequency-based PE outperforms logit-based PE in distinguishing between correct and incorrect predictions, as measured by AUROC. Furthermore, the method effectively controls the empirical miscoverage rate under user-specified risk levels, validating that sampling frequency can serve as a viable substitute for logit-based probabilities in black-box scenarios. This work provides a distribution-free model-agnostic framework for reliable uncertainty quantification in MCQA with guaranteed coverage, enhancing the trustworthiness of LLMs in practical applications.

http://arxiv.org/abs/2508.05689
Boosting Adversarial Transferability via Residual Perturbation Attack. (99%)
Jinjia Peng; Zeze Tao; Huibing Wang; Meng Wang; Yang Wang
Deep neural networks are susceptible to adversarial examples while suffering from incorrect predictions via imperceptible perturbations. Transfer-based attacks create adversarial examples for surrogate models and transfer these examples to target models under black-box scenarios. Recent studies reveal that adversarial examples in flat loss landscapes exhibit superior transferability to alleviate overfitting on surrogate models. However, the prior arts overlook the influence of perturbation directions, resulting in limited transferability. In this paper, we propose a novel attack method, named Residual Perturbation Attack (ResPA), relying on the residual gradient as the perturbation direction to guide the adversarial examples toward the flat regions of the loss function. Specifically, ResPA conducts an exponential moving average on the input gradients to obtain the first moment as the reference gradient, which encompasses the direction of historical gradients. Instead of heavily relying on the local flatness that stems from the current gradients as the perturbation direction, ResPA further considers the residual between the current gradient and the reference gradient to capture the changes in the global perturbation direction. The experimental results demonstrate the better transferability of ResPA than the existing typical transfer-based attack methods, while the transferability can be further improved by combining ResPA with the current input transformation methods. The code is available at https://github.com/ZezeTao/ResPA.

http://arxiv.org/abs/2505.03646
GRILL: Gradient Signal Restoration in Ill-Conditioned Layers to Enhance Adversarial Attacks on Autoencoders. (99%)
Chethan Krishnamurthy Ramanaik; Arjun Roy; Tobias Callies; Eirini Ntoutsi
Adversarial robustness of deep autoencoders (AEs) remains relatively unexplored, even though their non-invertible nature poses distinct challenges. Existing attack algorithms during the optimization of imperceptible, norm-bounded adversarial perturbations to maximize output damage in AEs, often stop at sub-optimal attacks. We observe that the adversarial loss gradient vanishes when backpropagated through ill-conditioned layers. This issue arises from near-zero singular values in the Jacobians of these layers, which weaken the gradient signal during optimization. We introduce GRILL, a technique that locally restores gradient signals in ill-conditioned layers, enabling more effective norm-bounded attacks. Through extensive experiments on different architectures of popular AEs, under both sample-specific and universal attack setups, and across standard and adaptive attack settings, we show that our method significantly increases the effectiveness of our adversarial attacks, enabling a more rigorous evaluation of AE robustness.

http://arxiv.org/abs/2507.21483
NCCR: to Evaluate the Robustness of Neural Networks and Adversarial Examples. (99%)
Shi Pu; Fu Song; Wenjie Wang
Neural networks have received a lot of attention recently, and related security issues have come with it. Many studies have shown that neural networks are vulnerable to adversarial examples that have been artificially perturbed with modification, which is too small to be distinguishable by human perception. Different attacks and defenses have been proposed to solve these problems, but there is little research on evaluating the robustness of neural networks and their inputs. In this work, we propose a metric called the neuron cover change rate (NCCR) to measure the ability of deep learning models to resist attacks and the stability of adversarial examples. NCCR monitors alterations in the output of specifically chosen neurons when the input is perturbed, and networks with a smaller degree of variation are considered to be more robust. The results of the experiment on image recognition and the speaker recognition model show that our metrics can provide a good assessment of the robustness of neural networks or their inputs. It can also be used to detect whether an input is adversarial or not, as adversarial examples are always less robust.

http://arxiv.org/abs/2502.10452
Quaternion-Hadamard Network: A Novel Defense Against Adversarial Attacks with a New Dataset. (99%)
Vladimir Frants; Sos Agaian
This paper addresses the vulnerability of deep-learning models designed for rain, snow, and haze removal. Despite enhancing image quality in adverse weather, these models are susceptible to adversarial attacks that compromise their effectiveness. Traditional defenses such as adversarial training and model distillation often require extensive retraining, making them costly and impractical for real-world deployment. While denoising and super-resolution techniques can aid image classification models, they impose high computational demands and introduce visual artifacts that hinder image processing tasks. We propose a model-agnostic defense against first-order white-box adversarial attacks using the Quaternion-Hadamard Network (QHNet) to tackle these challenges. White-box attacks are particularly difficult to defend against since attackers have full access to the model's architecture, weights, and training procedures. Our defense introduces the Quaternion Hadamard Denoising Convolutional Block (QHDCB) and the Quaternion Denoising Residual Block (QDRB), leveraging polynomial thresholding. QHNet incorporates these blocks within an encoder-decoder architecture, enhanced by feature refinement, to effectively neutralize adversarial noise. Additionally, we introduce the Adversarial Weather Conditions Vision Dataset (AWCVD), created by applying first-order gradient attacks on state-of-the-art weather removal techniques in scenarios involving haze, rain streaks, and snow. Using PSNR and SSIM metrics, we demonstrate that QHNet significantly enhances the robustness of low-level computer vision models against adversarial attacks compared with state-of-the-art denoising and super-resolution techniques. The source code and dataset will be released alongside the final version of this paper.

http://arxiv.org/abs/2508.04894
Adversarial Attacks and Defenses on Graph-aware Large Language Models (LLMs). (99%)
Iyiola E. Olatunji; Franziska Boenisch; Jing Xu; Adam Dziedzic
Large Language Models (LLMs) are increasingly integrated with graph-structured data for tasks like node classification, a domain traditionally dominated by Graph Neural Networks (GNNs). While this integration leverages rich relational information to improve task performance, their robustness against adversarial attacks remains unexplored. We take the first step to explore the vulnerabilities of graph-aware LLMs by leveraging existing adversarial attack methods tailored for graph-based models, including those for poisoning (training-time attacks) and evasion (test-time attacks), on two representative models, LLAGA (Chen et al. 2024) and GRAPHPROMPTER (Liu et al. 2024). Additionally, we discover a new attack surface for LLAGA where an attacker can inject malicious nodes as placeholders into the node sequence template to severely degrade its performance. Our systematic analysis reveals that certain design choices in graph encoding can enhance attack success, with specific findings that: (1) the node sequence template in LLAGA increases its vulnerability; (2) the GNN encoder used in GRAPHPROMPTER demonstrates greater robustness; and (3) both approaches remain susceptible to imperceptible feature perturbation attacks. Finally, we propose an end-to-end defense framework GALGUARD, that combines an LLM-based feature correction module to mitigate feature-level perturbations and adapted GNN defenses to protect against structural attacks.

http://arxiv.org/abs/2508.11667
Assessing Representation Stability for Transformer Models. (98%)
Bryan E. Tuck; Rakesh M. Verma
Adversarial text attacks remain a persistent threat to transformer models, yet existing defenses are typically attack-specific or require costly model retraining. We introduce Representation Stability (RS), a model-agnostic detection framework that identifies adversarial examples by measuring how embedding representations change when important words are masked. RS first ranks words using importance heuristics, then measures embedding sensitivity to masking top-k critical words, and processes the resulting patterns with a BiLSTM detector. Experiments show that adversarially perturbed words exhibit disproportionately high masking sensitivity compared to naturally important words. Across three datasets, three attack types, and two victim models, RS achieves over 88% detection accuracy and demonstrates competitive performance compared to existing state-of-the-art methods, often at lower computational cost. Using Normalized Discounted Cumulative Gain (NDCG) to measure perturbation identification quality, we reveal that gradient-based ranking outperforms attention and random selection approaches, with identification quality correlating with detection performance for word-level attacks. RS also generalizes well to unseen datasets, attacks, and models without retraining, providing a practical solution for adversarial text detection.

http://arxiv.org/abs/2505.16888
CAIN: Hijacking LLM-Humans Conversations via Malicious System Prompts. (86%)
Viet Pham; Thai Le
Large language models (LLMs) have advanced many applications, but are also known to be vulnerable to adversarial attacks. In this work, we introduce a novel security threat: hijacking AI-human conversations by manipulating LLMs' system prompts to produce malicious answers only to specific targeted questions (e.g., "Who should I vote for US President?", "Are Covid vaccines safe?"), while behaving benignly on others. This attack is detrimental as it can enable malicious actors to exercise large-scale information manipulation by spreading harmful but benign-looking system prompts online. To demonstrate such an attack, we develop CAIN, an algorithm that can automatically curate such harmful system prompts for a specific target question in a black-box setting or without the need to access the LLM's parameters. Evaluated on both open-source and commercial LLMs, CAIN demonstrates significant adversarial impact. In untargeted attacks or forcing LLMs to output incorrect answers, CAIN achieves up to 40% F1 degradation on targeted questions while preserving high accuracy on benign inputs. For targeted attacks or forcing LLMs to output specific harmful answers, CAIN achieves over 70% F1 scores on these targeted responses with minimal impact on benign questions. Our results highlight the critical need for enhanced robustness measures to safeguard the integrity and safety of LLMs in real-world applications. All source code will be publicly available.

http://arxiv.org/abs/2507.06043
CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations. (67%)
Xiaohu Li; Yunfeng Ning; Zepeng Bao; Mayi Xu; Jianhao Chen; Tieyun Qian
Security alignment enables the Large Language Model (LLM) to gain the protection against malicious queries, but various jailbreak attack methods reveal the vulnerability of this security mechanism. Previous studies have isolated LLM jailbreak attacks and defenses. We analyze the security protection mechanism of the LLM, and propose a framework that combines attack and defense. Our method is based on the linearly separable property of LLM intermediate layer embedding, as well as the essence of jailbreak attack, which aims to embed harmful problems and transfer them to the safe area. We utilize generative adversarial network (GAN) to learn the security judgment boundary inside the LLM to achieve efficient jailbreak attack and defense. The experimental results indicate that our method achieves an average jailbreak success rate of 88.85\% across three popular LLMs, while the defense success rate on the state-of-the-art jailbreak dataset reaches an average of 84.17\%. This not only validates the effectiveness of our approach but also sheds light on the internal security mechanisms of LLMs, offering new insights for enhancing model security The code and data are available at https://github.com/NLPGM/CAVGAN.

http://arxiv.org/abs/2508.04276
A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models. (61%)
Jiayi Wen; Tianxin Chen; Zhirun Zheng; Cheng Huang
Graph-based Retrieval-Augmented Generation (GraphRAG) has recently emerged as a promising paradigm for enhancing large language models (LLMs) by converting raw text into structured knowledge graphs, improving both accuracy and explainability. However, GraphRAG relies on LLMs to extract knowledge from raw text during graph construction, and this process can be maliciously manipulated to implant misleading information. Targeting this attack surface, we propose two knowledge poisoning attacks (KPAs) and demonstrate that modifying only a few words in the source text can significantly change the constructed graph, poison the GraphRAG, and severely mislead downstream reasoning. The first attack, named Targeted KPA (TKPA), utilizes graph-theoretic analysis to locate vulnerable nodes in the generated graphs and rewrites the corresponding narratives with LLMs, achieving precise control over specific question-answering (QA) outcomes with a success rate of 93.1\%, while keeping the poisoned text fluent and natural. The second attack, named Universal KPA (UKPA), exploits linguistic cues such as pronouns and dependency relations to disrupt the structural integrity of the generated graph by altering globally influential words. With fewer than 0.05\% of full text modified, the QA accuracy collapses from 95\% to 50\%. Furthermore, experiments show that state-of-the-art defense methods fail to detect these attacks, highlighting that securing GraphRAG pipelines against knowledge poisoning remains largely unexplored.

http://arxiv.org/abs/2502.13053
Evaluating the Robustness of Multimodal Agents Against Active Environmental Injection Attacks. (45%)
Yurun Chen; Xavier Hu; Keting Yin; Juncheng Li; Shengyu Zhang
As researchers continue to optimize AI agents for more effective task execution within operating systems, they often overlook a critical security concern: the ability of these agents to detect "impostors" within their environment. Through an analysis of the agents' operational context, we identify a significant threat-attackers can disguise malicious attacks as environmental elements, injecting active disturbances into the agents' execution processes to manipulate their decision-making. We define this novel threat as the Active Environment Injection Attack (AEIA). Focusing on the interaction mechanisms of the Android OS, we conduct a risk assessment of AEIA and identify two critical security vulnerabilities: (1) Adversarial content injection in multimodal interaction interfaces, where attackers embed adversarial instructions within environmental elements to mislead agent decision-making; and (2) Reasoning gap vulnerabilities in the agent's task execution process, which increase susceptibility to AEIA attacks during reasoning. To evaluate the impact of these vulnerabilities, we propose AEIA-MN, an attack scheme that exploits interaction vulnerabilities in mobile operating systems to assess the robustness of MLLM-based agents. Experimental results show that even advanced MLLMs are highly vulnerable to this attack, achieving a maximum attack success rate of 93% on the AndroidWorld benchmark by combining two vulnerabilities.

http://arxiv.org/abs/2508.04064
FLAT: Latent-Driven Arbitrary-Target Backdoor Attacks in Federated Learning. (10%)
Tuan Nguyen; Khoa D Doan; Kok-Seng Wong
Federated learning (FL) is vulnerable to backdoor attacks, yet most existing methods are limited by fixed-pattern or single-target triggers, making them inflexible and easier to detect. We propose FLAT (FL Arbitrary-Target Attack), a novel backdoor attack that leverages a latent-driven conditional autoencoder to generate diverse, target-specific triggers as needed. By introducing a latent code, FLAT enables the creation of visually adaptive and highly variable triggers, allowing attackers to select arbitrary targets without retraining and to evade conventional detection mechanisms. Our approach unifies attack success, stealth, and diversity within a single framework, introducing a new level of flexibility and sophistication to backdoor attacks in FL. Extensive experiments show that FLAT achieves high attack success and remains robust against advanced FL defenses. These results highlight the urgent need for new defense strategies to address latent-driven, multi-target backdoor threats in federated settings.

http://arxiv.org/abs/2508.05691
AuthPrint: Fingerprinting Generative Models Against Malicious Model Providers. (8%)
Kai Yao; Marc Juarez
Generative models are increasingly adopted in high-stakes domains, yet current deployments offer no mechanisms to verify the origin of model outputs. We address this gap by extending model fingerprinting techniques beyond the traditional collaborative setting to one where the model provider may act adversarially. To our knowledge, this is the first work to evaluate fingerprinting for provenance attribution under such a threat model. The methods rely on a trusted verifier that extracts secret fingerprints from the model's output space, unknown to the provider, and trains a model to predict and verify them. Our empirical evaluation shows that our methods achieve near-zero FPR@95%TPR for instances of GAN and diffusion models, even when tested on small modifications to the original architecture and training data. Moreover, the methods remain robust against adversarial attacks that actively modify the outputs to bypass detection. Source codes are available at https://github.com/PSMLab/authprint.

http://arxiv.org/abs/2508.04576
ConfProBench: A Confidence Evaluation Benchmark for MLLM-Based Process Judges. (5%)
Yue Zhou; Yi Chang; Yuan Wu
Reasoning is a critical capability of multimodal large language models (MLLMs) for solving complex multimodal tasks, and judging the correctness of reasoning steps is crucial for improving this capability. Recently, MLLM-based process judges (MPJs) have been widely used to assess the correctness of reasoning steps in multimodal tasks. Therefore, evaluating MPJs is important for identifying their limitations and guiding future improvements. However, existing benchmarks for MPJs mainly focus on tasks such as step correctness classification and reasoning process search, while overlooking a key aspect: whether the confidence scores produced by MPJs at the step level are reliable. To address this gap, we propose ConfProBench, the first comprehensive benchmark designed to systematically evaluate the reliability of step-level confidence scores generated by MPJs. Our benchmark constructs three types of adversarially perturbed reasoning steps: Synonym Substitution, Syntactic Transformation, and Image Perturbation, to test the robustness of MPJ confidence under perturbations. In addition, we introduce three novel evaluation metrics: Confidence Robustness Score (CRS), Confidence Sensitivity Score (CSS), and Confidence Calibration Score (CCS), which evaluate robustness, sensitivity, and calibration, respectively. We evaluate 14 state-of-the-art MLLMs, including both proprietary and open-source models. Experiments reveal limitations in current MPJs' confidence performance and offer competitive baselines to support future research.

http://arxiv.org/abs/2508.04155
Evaluating Selective Encryption Against Gradient Inversion Attacks. (3%)
Jiajun Gu; Yuhang Yao; Shuaiqi Wang; Carlee Joe-Wong
Gradient inversion attacks pose significant privacy threats to distributed training frameworks such as federated learning, enabling malicious parties to reconstruct sensitive local training data from gradient communications between clients and an aggregation server during the aggregation process. While traditional encryption-based defenses, such as homomorphic encryption, offer strong privacy guarantees without compromising model utility, they often incur prohibitive computational overheads. To mitigate this, selective encryption has emerged as a promising approach, encrypting only a subset of gradient data based on the data's significance under a certain metric. However, there have been few systematic studies on how to specify this metric in practice. This paper systematically evaluates selective encryption methods with different significance metrics against state-of-the-art attacks. Our findings demonstrate the feasibility of selective encryption in reducing computational overhead while maintaining resilience against attacks. We propose a distance-based significance analysis framework that provides theoretical foundations for selecting critical gradient elements for encryption. Through extensive experiments on different model architectures (LeNet, CNN, BERT, GPT-2) and attack types, we identify gradient magnitude as a generally effective metric for protection against optimization-based gradient inversions. However, we also observe that no single selective encryption strategy is universally optimal across all attack scenarios, and we provide guidelines for choosing appropriate strategies for different model architectures and privacy requirements.

http://arxiv.org/abs/2409.11026
Prompt Obfuscation for Large Language Models. (3%)
David Pape; Sina Mavali; Thorsten Eisenhofer; Lea Schönherr
System prompts that include detailed instructions to describe the task performed by the underlying LLM can easily transform foundation models into tools and services with minimal overhead. They are often considered intellectual property, similar to the code of a software product, because of their crucial impact on the utility. However, extracting system prompts is easily possible. As of today, there is no effective countermeasure to prevent the stealing of system prompts, and all safeguarding efforts could be evaded. In this work, we propose an alternative to conventional system prompts. We introduce prompt obfuscation to prevent the extraction of the system prompt with little overhead. The core idea is to find a representation of the original system prompt that leads to the same functionality, while the obfuscated system prompt does not contain any information that allows conclusions to be drawn about the original system prompt. We evaluate our approach by comparing our obfuscated prompt output with the output of the original prompt, using eight distinct metrics to measure the lexical, character-level, and semantic similarity. We show that the obfuscated version is constantly on par with the original one. We further perform three different deobfuscation attacks with varying attacker knowledge--covering both black-box and white-box conditions--and show that in realistic attack scenarios an attacker is unable to extract meaningful information. Overall, we demonstrate that prompt obfuscation is an effective mechanism to safeguard the intellectual property of a system prompt while maintaining the same utility as the original prompt.

http://arxiv.org/abs/2508.04094
Isolate Trigger: Detecting and Eradicating Evade-Adaptive Backdoors. (2%)
Chengrui Sun; Hua Zhang; Haoran Gao; Zian Tian; Jianjin Zhao; qi Li; Hongliang Zhu; Zongliang Shen; Shang Wang; Anmin Fu
All current detection of backdoor attacks on deep learning models fall under the category of a non essential features(NEF), which focus on fighting against simple and efficient vertical class backdoor -- trigger is small, few and not overlapping with the source. Evade-adaptive backdoor (EAB) attacks have evaded NEF detection and improved training efficiency. We introduces a precise, efficient and universal detection and defense framework coined as Isolate Trigger (IsTr). IsTr aims to find the hidden trigger by breaking the barrier of the source features. Therefore, it investigates the essence of backdoor triggering, and uses Steps and Differential-Middle-Slice as components to update past theories of distance and gradient. IsTr also plays a positive role in the model, whether the backdoor exists. For example, accurately find and repair the wrong identification caused by deliberate or unintentional training in automatic driving. Extensive experiments on robustness scross various tasks, including MNIST, facial recognition, and traffic sign recognition, confirm the high efficiency, generality and precision of the IsTr. We rigorously evaluated the effectiveness of the IsTr against a series of six EAB attacks, including Badnets, Sin-Wave, Multi-trigger, SSBAs, CASSOCK, HCB. None of these countermeasures evade, even when attacks are combined and the trigger and source overlap.

http://arxiv.org/abs/2508.04818
Single-Step Reconstruction-Free Anomaly Detection and Segmentation via Diffusion Models. (1%)
Mehrdad Moradi; Marco Grasso; Bianca Maria Colosimo; Kamran Paynabar
Generative models have demonstrated significant success in anomaly detection and segmentation over the past decade. Recently, diffusion models have emerged as a powerful alternative, outperforming previous approaches such as GANs and VAEs. In typical diffusion-based anomaly detection, a model is trained on normal data, and during inference, anomalous images are perturbed to a predefined intermediate step in the forward diffusion process. The corresponding normal image is then reconstructed through iterative reverse sampling.
  However, reconstruction-based approaches present three major challenges: (1) the reconstruction process is computationally expensive due to multiple sampling steps, making real-time applications impractical; (2) for complex or subtle patterns, the reconstructed image may correspond to a different normal pattern rather than the original input; and (3) Choosing an appropriate intermediate noise level is challenging because it is application-dependent and often assumes prior knowledge of anomalies, an assumption that does not hold in unsupervised settings.
  We introduce Reconstruction-free Anomaly Detection with Attention-based diffusion models in Real-time (RADAR), which overcomes the limitations of reconstruction-based anomaly detection. Unlike current SOTA methods that reconstruct the input image, RADAR directly produces anomaly maps from the diffusion model, improving both detection accuracy and computational efficiency. We evaluate RADAR on real-world 3D-printed material and the MVTec-AD dataset. Our approach surpasses state-of-the-art diffusion-based and statistical machine learning models across all key metrics, including accuracy, precision, recall, and F1 score. Specifically, RADAR improves F1 score by 7% on MVTec-AD and 13% on the 3D-printed material dataset compared to the next best model.
  Code available at: https://github.com/mehrdadmoradi124/RADAR

http://arxiv.org/abs/2508.04204
ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments. (1%)
Yuquan Wang; Mi Zhang; Yining Wang; Geng Hong; Xiaoyu You; Min Yang
Large Reasoning Models (LRMs) have demonstrated impressive performance in reasoning-intensive tasks, but they remain vulnerable to harmful content generation, particularly in the mid-to-late steps of their reasoning processes. Existing defense mechanisms, however, rely on costly fine-tuning and additional expert knowledge, which restricts their scalability. In this work, we propose ReasoningGuard, an inference-time safeguard for LRMs, which injects timely safety aha moments to steer harmless while helpful reasoning processes. Leveraging the model's internal attention behavior, our approach accurately identifies critical points in the reasoning path, and triggers spontaneous, safety-oriented reflection. To safeguard both the subsequent reasoning steps and the final answers, we further implement a scaling sampling strategy during the decoding phase, selecting the optimal reasoning path. Inducing minimal extra inference cost, ReasoningGuard effectively mitigates three types of jailbreak attacks, including the latest ones targeting the reasoning process of LRMs. Our approach outperforms seven existing safeguards, achieving state-of-the-art safety defenses while effectively avoiding the common exaggerated safety issues.

http://arxiv.org/abs/2508.02997
CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors. (1%)
Sri Durga Sai Sowmya Kadali; Evangelos E. Papalexakis
The widespread use of Large Language Models (LLMs) in many applications marks a significant advance in research and practice. However, their complexity and hard-to-understand nature make them vulnerable to attacks, especially jailbreaks designed to produce harmful responses. To counter these threats, developing strong detection methods is essential for the safe and reliable use of LLMs. This paper studies this detection problem using the Contextual Co-occurrence Matrix, a structure recognized for its efficacy in data-scarce environments. We propose a novel method leveraging the latent space characteristics of Contextual Co-occurrence Matrices and Tensors for the effective identification of adversarial and jailbreak prompts. Our evaluations show that this approach achieves a notable F1 score of 0.83 using only 0.5% of labeled prompts, which is a 96.6% improvement over baselines. This result highlights the strength of our learned patterns, especially when labeled data is scarce. Our method is also significantly faster, speedup ranging from 2.3 to 128.4 times compared to the baseline models. To support future research and reproducibility, we have made our implementation publicly available.

http://arxiv.org/abs/2508.04196
Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models. (1%)
Siddhant Panpatil; Hiskias Dingeto; Haon Park
Despite significant advances in alignment techniques, we demonstrate that state-of-the-art language models remain vulnerable to carefully crafted conversational scenarios that can induce various forms of misalignment without explicit jailbreaking. Through systematic manual red-teaming with Claude-4-Opus, we discovered 10 successful attack scenarios, revealing fundamental vulnerabilities in how current alignment methods handle narrative immersion, emotional pressure, and strategic framing. These scenarios successfully elicited a range of misaligned behaviors, including deception, value drift, self-preservation, and manipulative reasoning, each exploiting different psychological and contextual vulnerabilities. To validate generalizability, we distilled our successful manual attacks into MISALIGNMENTBENCH, an automated evaluation framework that enables reproducible testing across multiple models. Cross-model evaluation of our 10 scenarios against five frontier LLMs revealed an overall 76% vulnerability rate, with significant variations: GPT-4.1 showed the highest susceptibility (90%), while Claude-4-Sonnet demonstrated greater resistance (40%). Our findings demonstrate that sophisticated reasoning capabilities often become attack vectors rather than protective mechanisms, as models can be manipulated into complex justifications for misaligned behavior. This work provides (i) a detailed taxonomy of conversational manipulation patterns and (ii) a reusable evaluation framework. Together, these findings expose critical gaps in current alignment strategies and highlight the need for robustness against subtle, scenario-based manipulation in future AI systems.

http://arxiv.org/abs/2508.03213
The Power of Many: Synergistic Unification of Diverse Augmentations for Efficient Adversarial Robustness. (99%)
Wang Yu-Hang; Shiwei Li; Jianxiang Liao; Li Bohan; Jian Liu; Wenfei Yin
Adversarial perturbations pose a significant threat to deep learning models. Adversarial Training (AT), the predominant defense method, faces challenges of high computational costs and a degradation in standard performance. While data augmentation offers an alternative path, existing techniques either yield limited robustness gains or incur substantial training overhead. Therefore, developing a defense mechanism that is both highly efficient and strongly robust is of paramount importance.In this work, we first conduct a systematic analysis of existing augmentation techniques, revealing that the synergy among diverse strategies -- rather than any single method -- is crucial for enhancing robustness. Based on this insight, we propose the Universal Adversarial Augmenter (UAA) framework, which is characterized by its plug-and-play nature and training efficiency. UAA decouples the expensive perturbation generation process from model training by pre-computing a universal transformation offline, which is then used to efficiently generate unique adversarial perturbations for each sample during training.Extensive experiments conducted on multiple benchmarks validate the effectiveness of UAA. The results demonstrate that UAA establishes a new state-of-the-art (SOTA) for data-augmentation-based adversarial defense strategies , without requiring the online generation of adversarial examples during training. This framework provides a practical and efficient pathway for building robust models,Our code is available in the supplementary materials.

http://arxiv.org/abs/2409.00029
Attack Anything: Blind DNNs via Universal Background Adversarial Attack. (99%)
Jiawei Lian; Shaohui Mei; Xiaofei Wang; Yi Wang; Lefan Wang; Yingjie Lu; Mingyang Ma; Lap-Pui Chau
It has been widely substantiated that deep neural networks (DNNs) are susceptible and vulnerable to adversarial perturbations. Existing studies mainly focus on performing attacks by corrupting targeted objects (physical attack) or images (digital attack), which is intuitively acceptable and understandable in terms of the attack's effectiveness. In contrast, our focus lies in conducting background adversarial attacks in both digital and physical domains, without causing any disruptions to the targeted objects themselves. Specifically, an effective background adversarial attack framework is proposed to attack anything, by which the attack efficacy generalizes well between diverse objects, models, and tasks. Technically, we approach the background adversarial attack as an iterative optimization problem, analogous to the process of DNN learning. Besides, we offer a theoretical demonstration of its convergence under a set of mild but sufficient conditions. To strengthen the attack efficacy and transferability, we propose a new ensemble strategy tailored for adversarial perturbations and introduce an improved smooth constraint for the seamless connection of integrated perturbations. We conduct comprehensive and rigorous experiments in both digital and physical domains across various objects, models, and tasks, demonstrating the effectiveness of attacking anything of the proposed method. The findings of this research substantiate the significant discrepancy between human and machine vision on the value of background variations, which play a far more critical role than previously recognized, necessitating a reevaluation of the robustness and reliability of DNNs. The code will be publicly available at https://github.com/JiaweiLian/Attack_Anything

http://arxiv.org/abs/2508.05677
Adversarial Attacks on Reinforcement Learning-based Medical Questionnaire Systems: Input-level Perturbation Strategies and Medical Constraint Validation. (99%)
Peizhuo Liu
RL-based medical questionnaire systems have shown great potential in medical scenarios. However, their safety and robustness remain unresolved. This study performs a comprehensive evaluation on adversarial attack methods to identify and analyze their potential vulnerabilities. We formulate the diagnosis process as a Markov Decision Process (MDP), where the state is the patient responses and unasked questions, and the action is either to ask a question or to make a diagnosis. We implemented six prevailing major attack methods, including the Fast Gradient Signed Method (FGSM), Projected Gradient Descent (PGD), Carlini & Wagner Attack (C&W) attack, Basic Iterative Method (BIM), DeepFool, and AutoAttack, with seven epsilon values each. To ensure the generated adversarial examples remain clinically plausible, we developed a comprehensive medical validation framework consisting of 247 medical constraints, including physiological bounds, symptom correlations, and conditional medical constraints. We achieved a 97.6% success rate in generating clinically plausible adversarial samples. We performed our experiment on the National Health Interview Survey (NHIS) dataset (https://www.cdc.gov/nchs/nhis/), which consists of 182,630 samples, to predict the participant's 4-year mortality rate. We evaluated our attacks on the AdaptiveFS framework proposed in arXiv:2004.00994. Our results show that adversarial attacks could significantly impact the diagnostic accuracy, with attack success rates ranging from 33.08% (FGSM) to 64.70% (AutoAttack). Our work has demonstrated that even under strict medical constraints on the input, such RL-based medical questionnaire systems still show significant vulnerabilities.

http://arxiv.org/abs/2508.02987
Adversarial Attention Perturbations for Large Object Detection Transformers. (98%)
Zachary Yahn; Selim Furkan Tekin; Fatih Ilhan; Sihao Hu; Tiansheng Huang; Yichang Xu; Margaret Loper; Ling Liu
Adversarial perturbations are useful tools for exposing vulnerabilities in neural networks. Existing adversarial perturbation methods for object detection are either limited to attacking CNN-based detectors or weak against transformer-based detectors. This paper presents an Attention-Focused Offensive Gradient (AFOG) attack against object detection transformers. By design, AFOG is neural-architecture agnostic and effective for attacking both large transformer-based object detectors and conventional CNN-based detectors with a unified adversarial attention framework. This paper makes three original contributions. First, AFOG utilizes a learnable attention mechanism that focuses perturbations on vulnerable image regions in multi-box detection tasks, increasing performance over non-attention baselines by up to 30.6%. Second, AFOG's attack loss is formulated by integrating two types of feature loss through learnable attention updates with iterative injection of adversarial perturbations. Finally, AFOG is an efficient and stealthy adversarial perturbation method. It probes the weak spots of detection transformers by adding strategically generated and visually imperceptible perturbations which can cause well-trained object detection models to fail. Extensive experiments conducted with twelve large detection transformers on COCO demonstrate the efficacy of AFOG. Our empirical results also show that AFOG outperforms existing attacks on transformer-based and CNN-based object detectors by up to 83% with superior speed and imperceptibility. Code is available at https://github.com/zacharyyahn/AFOG.

http://arxiv.org/abs/2411.05189
Understanding In-Context Learning of Linear Models in Transformers Through an Adversarial Lens. (98%)
Usman Anwar; Oswald Johannes Von; Louis Kirsch; David Krueger; Spencer Frei
In this work, we make two contributions towards understanding of in-context learning of linear models by transformers. First, we investigate the adversarial robustness of in-context learning in transformers to hijacking attacks -- a type of adversarial attacks in which the adversary's goal is to manipulate the prompt to force the transformer to generate a specific output. We show that both linear transformers and transformers with GPT-2 architectures are vulnerable to such hijacking attacks. However, adversarial robustness to such attacks can be significantly improved through adversarial training -- done either at the pretraining or finetuning stage -- and can generalize to stronger attack models. Our second main contribution is a comparative analysis of adversarial vulnerabilities across transformer models and other algorithms for learning linear models. This reveals two novel findings. First, adversarial attacks transfer poorly between larger transformer models trained from different seeds despite achieving similar in-distribution performance. This suggests that transformers of the same architecture trained according to the same recipe may implement different in-context learning algorithms for the same task. Second, we observe that attacks do not transfer well between classical learning algorithms for linear models (single-step gradient descent and ordinary least squares) and transformers. This suggests that there could be qualitative differences between the in-context learning algorithms that transformers implement and these traditional algorithms.

http://arxiv.org/abs/2508.03780
Are Inherently Interpretable Models More Robust? A Study In Music Emotion Recognition. (98%)
Katharina Hoedt; Arthur Flexer; Gerhard Widmer
One of the desired key properties of deep learning models is the ability to generalise to unseen samples. When provided with new samples that are (perceptually) similar to one or more training samples, deep learning models are expected to produce correspondingly similar outputs. Models that succeed in predicting similar outputs for similar inputs are often called robust. Deep learning models, on the other hand, have been shown to be highly vulnerable to minor (adversarial) perturbations of the input, which manage to drastically change a model's output and simultaneously expose its reliance on spurious correlations. In this work, we investigate whether inherently interpretable deep models, i.e., deep models that were designed to focus more on meaningful and interpretable features, are more robust to irrelevant perturbations in the data, compared to their black-box counterparts. We test our hypothesis by comparing the robustness of an interpretable and a black-box music emotion recognition (MER) model when challenged with adversarial examples. Furthermore, we include an adversarially trained model, which is optimised to be more robust, in the comparison. Our results indicate that inherently more interpretable models can indeed be more robust than their black-box counterparts, and achieve similar levels of robustness as adversarially trained models, at lower computational cost.

http://arxiv.org/abs/2507.08343
Towards Imperceptible JPEG Image Hiding: Multi-range Representations-driven Adversarial Stego Generation. (82%)
Junxue Yang; Xin Liao; Weixuan Tang; Jianhua Yang; Zheng Qin
Image hiding fully explores the hidden potential of deep learning-based models, aiming to conceal image-level messages within cover images and reveal them from stego images to achieve covert communication. Existing hiding schemes are easily detected by the naked eyes or steganalyzers due to the cover type confined to the spatial domain, single-range feature extraction and attacks, and insufficient loss constraints. To address these issues, we propose a multi-range representations-driven adversarial stego generation framework called MRAG for JPEG image hiding. This design stems from the fact that steganalyzers typically combine local-range and global-range information to better capture hidden traces. Specifically, MRAG integrates the local-range characteristic of the convolution and the global-range modeling of the transformer. Meanwhile, a features angle-norm disentanglement loss is designed to launch multi-range representations-driven feature-level adversarial attacks. It computes the adversarial loss between covers and stegos based on the surrogate steganalyzer's classified features, i.e., the features before the last fully connected layer. Under the dual constraints of features angle and norm, MRAG can delicately encode the concatenation of cover and secret into subtle adversarial perturbations from local and global ranges relevant to steganalysis. Therefore, the resulting stego can achieve visual and steganalysis imperceptibility. Moreover, coarse-grained and fine-grained frequency decomposition operations are devised to transform the input, introducing multi-grained information. Extensive experiments demonstrate that MRAG can achieve state-of-the-art performance.

http://arxiv.org/abs/2508.03307
BDFirewall: Towards Effective and Expeditiously Black-Box Backdoor Defense in MLaaS. (81%)
Ye Li; Chengcheng Zhu; Yanchao Zhao; Jiale Zhang
In this paper, we endeavor to address the challenges of backdoor attacks countermeasures in black-box scenarios, thereby fortifying the security of inference under MLaaS. We first categorize backdoor triggers from a new perspective, i.e., their impact on the patched area, and divide them into: high-visibility triggers (HVT), semi-visibility triggers (SVT), and low-visibility triggers (LVT). Based on this classification, we propose a progressive defense framework, BDFirewall, that removes these triggers from the most conspicuous to the most subtle, without requiring model access. First, for HVTs, which create the most significant local semantic distortions, we identify and eliminate them by detecting these salient differences. We then restore the patched area to mitigate the adverse impact of such removal process. The localized purification designed for HVTs is, however, ineffective against SVTs, which globally perturb benign features. We therefore model an SVT-poisoned input as a mixture of a trigger and benign features, where we unconventionally treat the benign features as "noise". This formulation allows us to reconstruct SVTs by applying a denoising process that removes these benign "noise" features. The SVT-free input is then obtained by subtracting the reconstructed trigger. Finally, to neutralize the nearly imperceptible but fragile LVTs, we introduce lightweight noise to disrupt the trigger pattern and then apply DDPM to restore any collateral impact on clean features. Comprehensive experiments demonstrate that our method outperforms state-of-the-art defenses. Compared with baselines, BDFirewall reduces the Attack Success Rate (ASR) by an average of 33.25%, improving poisoned sample accuracy (PA) by 29.64%, and achieving up to a 111x speedup in inference time. Code will be made publicly available upon acceptance.

http://arxiv.org/abs/2508.03365
When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs. (74%)
Bodam Kim; Hiskias Dingeto; Taeyoun Kwon; Dasol Choi; DongGeon Lee; Haon Park; JaeHoon Lee; Jongho Shin
As large language models become increasingly integrated into daily life, audio has emerged as a key interface for human-AI interaction. However, this convenience also introduces new vulnerabilities, making audio a potential attack surface for adversaries. Our research introduces WhisperInject, a two-stage adversarial audio attack framework that can manipulate state-of-the-art audio language models to generate harmful content. Our method uses imperceptible perturbations in audio inputs that remain benign to human listeners. The first stage uses a novel reward-based optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to guide the target model to circumvent its own safety protocols and generate harmful native responses. This native harmful response then serves as the target for Stage 2, Payload Injection, where we use Projected Gradient Descent (PGD) to optimize subtle perturbations that are embedded into benign audio carriers, such as weather queries or greeting messages. Validated under the rigorous StrongREJECT, LlamaGuard, as well as Human Evaluation safety evaluation framework, our experiments demonstrate a success rate exceeding 86% across Qwen2.5-Omni-3B, Qwen2.5-Omni-7B, and Phi-4-Multimodal. Our work demonstrates a new class of practical, audio-native threats, moving beyond theoretical exploits to reveal a feasible and covert method for manipulating AI behavior.

http://arxiv.org/abs/2504.21054
FFCBA: Feature-based Full-target Clean-label Backdoor Attacks. (73%)
Yangxu Yin; Honglong Chen; Yudong Gao; Peng Sun; Liantao Wu; Zhe Li; Weifeng Liu
Backdoor attacks pose a significant threat to deep neural networks, as backdoored models would misclassify poisoned samples with specific triggers into target classes while maintaining normal performance on clean samples. Among these, multi-target backdoor attacks can simultaneously target multiple classes. However, existing multi-target backdoor attacks all follow the dirty-label paradigm, where poisoned samples are mislabeled, and most of them require an extremely high poisoning rate. This makes them easily detectable by manual inspection. In contrast, clean-label attacks are more stealthy, as they avoid modifying the labels of poisoned samples. However, they generally struggle to achieve stable and satisfactory attack performance and often fail to scale effectively to multi-target attacks. To address this issue, we propose the Feature-based Full-target Clean-label Backdoor Attacks (FFCBA) which consists of two paradigms: Feature-Spanning Backdoor Attacks (FSBA) and Feature-Migrating Backdoor Attacks (FMBA). FSBA leverages class-conditional autoencoders to generate noise triggers that align perturbed in-class samples with the original category's features, ensuring the effectiveness, intra-class consistency, inter-class specificity and natural-feature correlation of triggers. While FSBA supports swift and efficient attacks, its cross-model attack capability is relatively weak. FMBA employs a two-stage class-conditional autoencoder training process that alternates between using out-of-class samples and in-class samples. This allows FMBA to generate triggers with strong target-class features, making it highly effective for cross-model attacks. We conduct experiments on multiple datasets and models, the results show that FFCBA achieves outstanding attack performance and maintains desirable robustness against the state-of-the-art backdoor defenses.

http://arxiv.org/abs/2508.03949
Model Compression vs. Adversarial Robustness: An Empirical Study on Language Models for Code. (68%)
Md. Abdul Awal; Mrigank Rochan; Chanchal K. Roy
Transformer-based language models for code have shown remarkable performance in various software analytics tasks, but their adoption is hindered by high computational costs, slow inference speeds, and substantial environmental impact. Model compression techniques such as pruning, quantization, and knowledge distillation have gained traction in addressing these challenges. However, the impact of these strategies on the robustness of compressed language models for code in adversarial scenarios remains poorly understood. Understanding how these compressed models behave under adversarial attacks is essential for their safe and effective deployment in real-world applications. To bridge this knowledge gap, we conduct a comprehensive evaluation of how common compression strategies affect the adversarial robustness of compressed models. We assess the robustness of compressed versions of three widely used language models for code across three software analytics tasks, using six evaluation metrics and four commonly used classical adversarial attacks. Our findings indicate that compressed models generally maintain comparable performance to their uncompressed counterparts. However, when subjected to adversarial attacks, compressed models exhibit significantly reduced robustness. These results reveal a trade-off between model size reduction and adversarial robustness, underscoring the need for careful consideration when deploying compressed models in security-critical software applications. Our study highlights the need for further research into compression strategies that strike a balance between computational efficiency and adversarial robustness, which is essential for deploying reliable language models for code in real-world software applications.

http://arxiv.org/abs/2508.03110
Token-Level Precise Attack on RAG: Searching for the Best Alternatives to Mislead Generation. (67%)
Zizhong Li; Haopeng Zhang; Jiawei Zhang
While large language models (LLMs) have achieved remarkable success in providing trustworthy responses for knowledge-intensive tasks, they still face critical limitations such as hallucinations and outdated knowledge. To address these issues, the retrieval-augmented generation (RAG) framework enhances LLMs with access to external knowledge via a retriever, enabling more accurate and real-time outputs about the latest events. However, this integration brings new security vulnerabilities: the risk that malicious content in the external database can be retrieved and used to manipulate model outputs. Although prior work has explored attacks on RAG systems, existing approaches either rely heavily on access to the retriever or fail to jointly consider both retrieval and generation stages, limiting their effectiveness, particularly in black-box scenarios. To overcome these limitations, we propose Token-level Precise Attack on the RAG (TPARAG), a novel framework that targets both white-box and black-box RAG systems. TPARAG leverages a lightweight white-box LLM as an attacker to generate and iteratively optimize malicious passages at the token level, ensuring both retrievability and high attack success in generation. Extensive experiments on open-domain QA datasets demonstrate that TPARAG consistently outperforms previous approaches in retrieval-stage and end-to-end attack effectiveness. These results further reveal critical vulnerabilities in RAG pipelines and offer new insights into improving their robustness.

http://arxiv.org/abs/2508.03067
Untraceable DeepFakes via Traceable Fingerprint Elimination. (64%)
Jiewei Lai; Lan Zhang; Chen Tang; Pengcheng Sun; Xinming Wang; Yunhao Wang
Recent advancements in DeepFakes attribution technologies have significantly enhanced forensic capabilities, enabling the extraction of traces left by generative models (GMs) in images, making DeepFakes traceable back to their source GMs. Meanwhile, several attacks have attempted to evade attribution models (AMs) for exploring their limitations, calling for more robust AMs. However, existing attacks fail to eliminate GMs' traces, thus can be mitigated by defensive measures. In this paper, we identify that untraceable DeepFakes can be achieved through a multiplicative attack, which can fundamentally eliminate GMs' traces, thereby evading AMs even enhanced with defensive measures. We design a universal and black-box attack method that trains an adversarial model solely using real data, applicable for various GMs and agnostic to AMs. Experimental results demonstrate the outstanding attack capability and universal applicability of our method, achieving an average attack success rate (ASR) of 97.08\% against 6 advanced AMs on DeepFakes generated by 9 GMs. Even in the presence of defensive mechanisms, our method maintains an ASR exceeding 72.39\%. Our work underscores the potential challenges posed by multiplicative attacks and highlights the need for more robust AMs.

http://arxiv.org/abs/2508.03221
BadBlocks: Low-Cost and Stealthy Backdoor Attacks Tailored for Text-to-Image Diffusion Models. (47%)
Yu Pan; Jiahao Chen; Lin Wang; Bingrong Dai; Yi Du
In recent years,Diffusion models have achieved remarkable progress in the field of image generation.However,recent studies have shown that diffusion models are susceptible to backdoor attacks,in which attackers can manipulate the output by injecting covert triggers such as specific visual patterns or textual phrases into the training dataset.Fortunately,with the continuous advancement of defense techniques,defenders have become increasingly capable of identifying and mitigating most backdoor attacks using visual inspection and neural network-based detection methods.However,in this paper,we identify a novel type of backdoor threat that is more lightweight and covert than existing approaches,which we name BadBlocks,requires only about 30\% of the computational resources and 20\% GPU time typically needed by previous backdoor attacks,yet it successfully injects backdoors and evades the most advanced defense frameworks.BadBlocks enables attackers to selectively contaminate specific blocks within the UNet architecture of diffusion models while maintaining normal functionality in the remaining components.Experimental results demonstrate that BadBlocks achieves a high attack success rate (ASR) and low perceptual quality loss (as measured by FID Score),even under extremely constrained computational resources and GPU time.Moreover,BadBlocks is able to bypass existing defense frameworks,especially the attention-based backdoor detection method, highlighting it as a novel and noteworthy threat.Ablation studies further demonstrate that effective backdoor injection does not require fine-tuning the entire network and highlight the pivotal role of certain neural network layers in backdoor mapping.Overall,BadBlocks significantly reduces the barrier to conducting backdoor attacks in all aspects.It enables attackers to inject backdoors into large-scale diffusion models even using consumer-grade GPUs.

http://arxiv.org/abs/2508.03209
GeoShield: Safeguarding Geolocation Privacy from Vision-Language Models via Adversarial Perturbations. (47%)
Xinwei Liu; Xiaojun Jia; Yuan Xun; Simeng Qin; Xiaochun Cao
Vision-Language Models (VLMs) such as GPT-4o now demonstrate a remarkable ability to infer users' locations from public shared images, posing a substantial risk to geoprivacy. Although adversarial perturbations offer a potential defense, current methods are ill-suited for this scenario: they often perform poorly on high-resolution images and low perturbation budgets, and may introduce irrelevant semantic content. To address these limitations, we propose GeoShield, a novel adversarial framework designed for robust geoprivacy protection in real-world scenarios. GeoShield comprises three key modules: a feature disentanglement module that separates geographical and non-geographical information, an exposure element identification module that pinpoints geo-revealing regions within an image, and a scale-adaptive enhancement module that jointly optimizes perturbations at both global and local levels to ensure effectiveness across resolutions. Extensive experiments on challenging benchmarks show that GeoShield consistently surpasses prior methods in black-box settings, achieving strong privacy protection with minimal impact on visual or semantic quality. To our knowledge, this work is the first to explore adversarial perturbations for defending against geolocation inference by advanced VLMs, providing a practical and effective solution to escalating privacy concerns.

http://arxiv.org/abs/2508.03864
Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety. (31%)
Zhenyu Pan; Yiting Zhang; Yutong Zhang; Jianshu Zhang; Haozheng Luo; Yuwei Han; Dennis Wu; Hong-Yu Chen; Philip S. Yu; Manling Li; Han Liu
Multi-agent systems (MAS) built on multimodal large language models exhibit strong collaboration and performance. However, their growing openness and interaction complexity pose serious risks, notably jailbreak and adversarial attacks. Existing defenses typically rely on external guard modules, such as dedicated safety agents, to handle unsafe behaviors. Unfortunately, this paradigm faces two challenges: (1) standalone agents offer limited protection, and (2) their independence leads to single-point failure-if compromised, system-wide safety collapses. Naively increasing the number of guard agents further raises cost and complexity. To address these challenges, we propose Evo-MARL, a novel multi-agent reinforcement learning (MARL) framework that enables all task agents to jointly acquire defensive capabilities. Rather than relying on external safety modules, Evo-MARL trains each agent to simultaneously perform its primary function and resist adversarial threats, ensuring robustness without increasing system overhead or single-node failure. Furthermore, Evo-MARL integrates evolutionary search with parameter-sharing reinforcement learning to co-evolve attackers and defenders. This adversarial training paradigm internalizes safety mechanisms and continually enhances MAS performance under co-evolving threats. Experiments show that Evo-MARL reduces attack success rates by up to 22% while boosting accuracy by up to 5% on reasoning tasks-demonstrating that safety and utility can be jointly improved.

http://arxiv.org/abs/2508.06325
Anti-Tamper Protection for Unauthorized Individual Image Generation. (16%)
Zelin Li; Ruohan Zong; Yifan Liu; Ruichen Yao; Yaokun Liu; Yang Zhang; Dong Wang
With the advancement of personalized image generation technologies, concerns about forgery attacks that infringe on portrait rights and privacy are growing. To address these concerns, protection perturbation algorithms have been developed to disrupt forgery generation. However, the protection algorithms would become ineffective when forgery attackers apply purification techniques to bypass the protection. To address this issue, we present a novel approach, Anti-Tamper Perturbation (ATP). ATP introduces a tamper-proof mechanism within the perturbation. It consists of protection and authorization perturbations, where the protection perturbation defends against forgery attacks, while the authorization perturbation detects purification-based tampering. Both protection and authorization perturbations are applied in the frequency domain under the guidance of a mask, ensuring that the protection perturbation does not disrupt the authorization perturbation. This design also enables the authorization perturbation to be distributed across all image pixels, preserving its sensitivity to purification-based tampering. ATP demonstrates its effectiveness in defending forgery attacks across various attack settings through extensive experiments, providing a robust solution for protecting individuals' portrait rights and privacy. Our code is available at: https://github.com/Seeyn/Anti-Tamper-Perturbation .

http://arxiv.org/abs/2410.19794
DiffGAN: A Test Generation Approach for Differential Testing of Deep Neural Networks for Image Analysis. (10%)
Zohreh Aghababaeyan; Manel Abdellatif; Lionel Briand; Ramesh S
Deep Neural Networks (DNNs) are increasingly deployed across applications. However, ensuring their reliability remains a challenge, and in many situations, alternative models with similar functionality and accuracy are available. Traditional accuracy-based evaluations often fail to capture behavioral differences between models, especially with limited test datasets, making it difficult to select or combine models effectively. Differential testing addresses this by generating test inputs that expose discrepancies in DNN model behavior. However, existing approaches face significant limitations: many rely on model internals or are constrained by available seed inputs. To address these challenges, we propose DiffGAN, a black-box test image generation approach for differential testing of DNN models. DiffGAN leverages a Generative Adversarial Network (GAN) and the Non-dominated Sorting Genetic Algorithm II to generate diverse and valid triggering inputs that reveal behavioral discrepancies between models. DiffGAN employs two custom fitness functions, focusing on diversity and divergence, to guide the exploration of the GAN input space and identify discrepancies between models' outputs. By strategically searching this space, DiffGAN generates inputs with specific features that trigger differences in model behavior. DiffGAN is black-box, making it applicable in more situations. We evaluate DiffGAN on eight DNN model pairs trained on widely used image datasets. Our results show DiffGAN significantly outperforms a SOTA baseline, generating four times more triggering inputs, with greater diversity and validity, within the same budget. Additionally, the generated inputs improve the accuracy of a machine learning-based model selection mechanism, which selects the best-performing model based on input characteristics and can serve as a smart output voting mechanism when using alternative models.

http://arxiv.org/abs/2508.03098
Privacy-Aware Decoding: Mitigating Privacy Leakage of Large Language Models in Retrieval-Augmented Generation. (3%)
Haoran Wang; Xiongxiao Xu; Baixiang Huang; Kai Shu
Retrieval-Augmented Generation (RAG) enhances the factual accuracy of large language models (LLMs) by conditioning outputs on external knowledge sources. However, when retrieval involves private or sensitive data, RAG systems are susceptible to extraction attacks that can leak confidential information through generated responses. We propose Privacy-Aware Decoding (PAD), a lightweight, inference-time defense that adaptively injects calibrated Gaussian noise into token logits during generation. PAD integrates confidence-based screening to selectively protect high-risk tokens, efficient sensitivity estimation to minimize unnecessary noise, and context-aware noise calibration to balance privacy with generation quality. A \renyi Differential Privacy (RDP) accountant rigorously tracks cumulative privacy loss, enabling explicit per-response $(\varepsilon, δ)$-DP guarantees for sensitive outputs. Unlike prior approaches requiring retraining or corpus-level filtering, PAD is model-agnostic and operates entirely at decoding time with minimal computational overhead. Experiments on three real-world datasets demonstrate that PAD substantially reduces private information leakage while preserving response utility, outperforming existing retrieval- and post-processing-based defenses. Our work takes an important step toward mitigating privacy risks in RAG via decoding strategies, paving the way for universal and scalable privacy solutions in sensitive domains. Our code is available: https://github.com/wang2226/PAD.

http://arxiv.org/abs/2508.01365
ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models. (3%)
Zihan Wang; Rui Zhang; Hongwei Li; Wenshu Fan; Wenbo Jiang; Qingchuan Zhao; Guowen Xu
Backdoor attacks pose a significant threat to Large Language Models (LLMs), where adversaries can embed hidden triggers to manipulate LLM's outputs. Most existing defense methods, primarily designed for classification tasks, are ineffective against the autoregressive nature and vast output space of LLMs, thereby suffering from poor performance and high latency. To address these limitations, we investigate the behavioral discrepancies between benign and backdoored LLMs in output space. We identify a critical phenomenon which we term sequence lock: a backdoored model generates the target sequence with abnormally high and consistent confidence compared to benign generation. Building on this insight, we propose ConfGuard, a lightweight and effective detection method that monitors a sliding window of token confidences to identify sequence lock. Extensive experiments demonstrate ConfGuard achieves a near 100\% true positive rate (TPR) and a negligible false positive rate (FPR) in the vast majority of cases. Crucially, the ConfGuard enables real-time detection almost without additional latency, making it a practical backdoor defense for real-world LLM deployments.

http://arxiv.org/abs/2508.03006
Seeing It Before It Happens: In-Generation NSFW Detection for Diffusion-Based Text-to-Image Models. (1%)
Fan Yang; Yihao Huang; Jiayi Zhu; Ling Shi; Geguang Pu; Jin Song Dong; Kailong Wang
Diffusion-based text-to-image (T2I) models enable high-quality image generation but also pose significant risks of misuse, particularly in producing not-safe-for-work (NSFW) content. While prior detection methods have focused on filtering prompts before generation or moderating images afterward, the in-generation phase of diffusion models remains largely unexplored for NSFW detection. In this paper, we introduce In-Generation Detection (IGD), a simple yet effective approach that leverages the predicted noise during the diffusion process as an internal signal to identify NSFW content. This approach is motivated by preliminary findings suggesting that the predicted noise may capture semantic cues that differentiate NSFW from benign prompts, even when the prompts are adversarially crafted. Experiments conducted on seven NSFW categories show that IGD achieves an average detection accuracy of 91.32% over naive and adversarial NSFW prompts, outperforming seven baseline methods.

http://arxiv.org/abs/2508.00756
LeakyCLIP: Extracting Training Data from CLIP. (1%)
Yunhao Chen; Shujie Wang; Xin Wang; Xingjun Ma
Understanding the memorization and privacy leakage risks in Contrastive Language--Image Pretraining (CLIP) is critical for ensuring the security of multimodal models. Recent studies have demonstrated the feasibility of extracting sensitive training examples from diffusion models, with conditional diffusion models exhibiting a stronger tendency to memorize and leak information. In this work, we investigate data memorization and extraction risks in CLIP through the lens of CLIP inversion, a process that aims to reconstruct training images from text prompts. To this end, we introduce \textbf{LeakyCLIP}, a novel attack framework designed to achieve high-quality, semantically accurate image reconstruction from CLIP embeddings. We identify three key challenges in CLIP inversion: 1) non-robust features, 2) limited visual semantics in text embeddings, and 3) low reconstruction fidelity. To address these challenges, LeakyCLIP employs 1) adversarial fine-tuning to enhance optimization smoothness, 2) linear transformation-based embedding alignment, and 3) Stable Diffusion-based refinement to improve fidelity. Empirical results demonstrate the superiority of LeakyCLIP, achieving over 358% improvement in Structural Similarity Index Measure (SSIM) for ViT-B-16 compared to baseline methods on LAION-2B subset. Furthermore, we uncover a pervasive leakage risk, showing that training data membership can even be successfully inferred from the metrics of low-fidelity reconstructions. Our work introduces a practical method for CLIP inversion while offering novel insights into the nature and scope of privacy risks in multimodal models.

http://arxiv.org/abs/2508.03832
Generating Inputs for Grammar Mining using Dynamic Symbolic Execution. (1%)
Andreas Pointner; Josef Pichler; Herbert Prähofer
A vast number of software systems include components that parse and process structured input. In addition to programming languages, which are analyzed by compilers or interpreters, there are numerous components that process standardized or proprietary data formats of varying complexity. Even if such components were initially developed and tested based on a specification, such as a grammar, numerous modifications and adaptations over the course of software evolution can make it impossible to precisely determine which inputs they actually accept.  
In this situation, grammar mining can be used to reconstruct the specification in the form of a grammar. Established approaches already produce useful results, provided that sufficient input data is available to fully cover the input language. However, achieving this completeness is a major challenge. In practice, only input data recorded during the operation of the software systems is available. If this data is used for grammar mining, the resulting grammar reflects only the actual processed inputs but not the complete grammar of the input language accepted by the software component. As a result, edge cases or previously supported features that no longer appear in the available input data are missing from the generated grammar.  
This work addresses this challenge by introducing a novel approach for the automatic generation of inputs for grammar mining. Although input generators have already been used for fuzz testing, it remains unclear whether they are also suitable for grammar miners. Building on the grammar miner Mimid, this work presents a fully automated approach to input generation. The approach leverages Dynamic Symbolic Execution (DSE) and extends it with two mechanisms to overcome the limitations of DSE regarding structured input parsers. First, the search for new inputs is guided by an iterative expansion that starts with a single-character input and gradually extends it. Second, input generation is structured into a novel three-phase approach, which separates the generation of inputs for parser functions.  
The proposed method was evaluated against a diverse set of eleven benchmark applications from the existing literature. Results demonstrate that the approach achieves precision and recall for extracted grammars close to those derived from state-of-the-art grammar miners such as Mimid. Notably, it successfully uncovers subtle features and edge cases in parsers that are typically missed by such grammar miners. The effectiveness of the method is supported by empirical evidence, showing that it can achieve high performance in various domains without requiring prior input samples.  
This contribution is significant for researchers and practitioners in software engineering, offering an automated, scalable, and precise solution for grammar mining. By eliminating the need for manual input generation, the approach not only reduces workload but also enhances the robustness and comprehensiveness of the extracted grammars. Following this approach, software engineers can reconstruct specification from existing (legacy) parsers.

http://arxiv.org/abs/2508.02186
Failure Cases Are Better Learned But Boundary Says Sorry: Facilitating Smooth Perception Change for Accuracy-Robustness Trade-Off in Adversarial Training. (96%)
Yanyun Wang; Li Liu
Adversarial Training (AT) is one of the most effective methods to train robust Deep Neural Networks (DNNs). However, AT creates an inherent trade-off between clean accuracy and adversarial robustness, which is commonly attributed to the more complicated decision boundary caused by the insufficient learning of hard adversarial samples. In this work, we reveal a counterintuitive fact for the first time: From the perspective of perception consistency, hard adversarial samples that can still attack the robust model after AT are already learned better than those successfully defended. Thus, different from previous views, we argue that it is rather the over-sufficient learning of hard adversarial samples that degrades the decision boundary and contributes to the trade-off problem. Specifically, the excessive pursuit of perception consistency would force the model to view the perturbations as noise and ignore the information within them, which should have been utilized to induce a smoother perception transition towards the decision boundary to support its establishment to an appropriate location. In response, we define a new AT objective named Robust Perception, encouraging the model perception to change smoothly with input perturbations, based on which we propose a novel Robust Perception Adversarial Training (RPAT) method, effectively mitigating the current accuracy-robustness trade-off. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet with ResNet-18, PreActResNet-18, and WideResNet-34-10 demonstrate the effectiveness of our method beyond four common baselines and 12 state-of-the-art (SOTA) works. The code is available at https://github.com/FlaAI/RPAT.

http://arxiv.org/abs/2508.02835
Defending Against Knowledge Poisoning Attacks During Retrieval-Augmented Generation. (80%)
Kennedy Edemacu; Vinay M. Shashidhar; Micheal Tuape; Dan Abudu; Beakcheol Jang; Jong Wook Kim
Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to boost the capabilities of large language models (LLMs) by incorporating external, up-to-date knowledge sources. However, this introduces a potential vulnerability to knowledge poisoning attacks, where attackers can compromise the knowledge source to mislead the generation model. One such attack is the PoisonedRAG in which the injected adversarial texts steer the model to generate an attacker-chosen response to a target question. In this work, we propose novel defense methods, FilterRAG and ML-FilterRAG, to mitigate the PoisonedRAG attack. First, we propose a new property to uncover distinct properties to differentiate between adversarial and clean texts in the knowledge data source. Next, we employ this property to filter out adversarial texts from clean ones in the design of our proposed approaches. Evaluation of these methods using benchmark datasets demonstrate their effectiveness, with performances close to those of the original RAG systems.

http://arxiv.org/abs/2508.05671
DINA: A Dual Defense Framework Against Internal Noise and External Attacks in Natural Language Processing. (69%)
Ko-Wei Chuang; Hen-Hsen Huang; Tsai-Yen Li
As large language models (LLMs) and generative AI become increasingly integrated into customer service and moderation applications, adversarial threats emerge from both external manipulations and internal label corruption. In this work, we identify and systematically address these dual adversarial threats by introducing DINA (Dual Defense Against Internal Noise and Adversarial Attacks), a novel unified framework tailored specifically for NLP. Our approach adapts advanced noisy-label learning methods from computer vision and integrates them with adversarial training to simultaneously mitigate internal label sabotage and external adversarial perturbations. Extensive experiments conducted on a real-world dataset from an online gaming service demonstrate that DINA significantly improves model robustness and accuracy compared to baseline models. Our findings not only highlight the critical necessity of dual-threat defenses but also offer practical strategies for safeguarding NLP systems in realistic adversarial scenarios, underscoring broader implications for fair and responsible AI deployment.

http://arxiv.org/abs/2508.15778
Towards Stealthy and Effective Backdoor Attacks on Lane Detection: A Naturalistic Data Poisoning Approach. (67%)
Yifan Liao; Yuxin Cao; Yedi Zhang; Wentao He; Yan Xiao; Xianglong Du; Zhiyong Huang; Jin Song Dong
Deep learning-based lane detection (LD) plays a critical role in autonomous driving and advanced driver assistance systems. However, its vulnerability to backdoor attacks presents a significant security concern. Existing backdoor attack methods on LD often exhibit limited practical utility due to the artificial and conspicuous nature of their triggers. To address this limitation and investigate the impact of more ecologically valid backdoor attacks on LD models, we examine the common data poisoning attack and introduce DBALD, a novel diffusion-based data poisoning framework for generating naturalistic backdoor triggers. DBALD comprises two key components: optimal trigger position finding and stealthy trigger generation. Given the insight that attack performance varies depending on the trigger position, we propose a heatmap-based method to identify the optimal trigger location, with gradient analysis to generate attack-specific heatmaps. A region-based editing diffusion process is then applied to synthesize visually plausible triggers within the most susceptible regions identified previously. Furthermore, to ensure scene integrity and stealthy attacks, we introduce two loss strategies: one for preserving lane structure and another for maintaining the consistency of the driving scene. Consequently, compared to existing attack methods, DBALD achieves both a high attack success rate and superior stealthiness. Extensive experiments on 4 mainstream LD models show that DBALD exceeds state-of-the-art methods, with an average success rate improvement of +10.87% and significantly enhanced stealthiness. The experimental results highlight significant practical challenges in ensuring model robustness against real-world backdoor threats in LD.

http://arxiv.org/abs/2508.01987
Controllable and Stealthy Shilling Attacks via Dispersive Latent Diffusion. (50%)
Shutong Qiao; Wei Yuan; Junliang Yu; Tong Chen; Quoc Viet Hung Nguyen; Hongzhi Yin
Recommender systems (RSs) are now fundamental to various online platforms, but their dependence on user-contributed data leaves them vulnerable to shilling attacks that can manipulate item rankings by injecting fake users. Although widely studied, most existing attack models fail to meet two critical objectives simultaneously: achieving strong adversarial promotion of target items while maintaining realistic behavior to evade detection. As a result, the true severity of shilling threats that manage to reconcile the two objectives remains underappreciated. To expose this overlooked vulnerability, we present DLDA, a diffusion-based attack framework that can generate highly effective yet indistinguishable fake users by enabling fine-grained control over target promotion. Specifically, DLDA operates in a pre-aligned collaborative embedding space, where it employs a conditional latent diffusion process to iteratively synthesize fake user profiles with precise target item control. To evade detection, DLDA introduces a dispersive regularization mechanism that promotes variability and realism in generated behavioral patterns. Extensive experiments on three real-world datasets and five popular RS models demonstrate that, compared to prior attacks, DLDA consistently achieves stronger item promotion while remaining harder to detect. These results highlight that modern RSs are more vulnerable than previously recognized, underscoring the urgent need for more robust defenses.

http://arxiv.org/abs/2508.02116
SUAD: Solid-Channel Ultrasound Injection Attack and Defense to Voice Assistants. (22%)
Chao Liu; Zhezheng Zhu; Hao Chen; Zhe Chen; Kaiwen Guo; Penghao Wang; Jun Luo
As a versatile AI application, voice assistants (VAs) have become increasingly popular, but are vulnerable to security threats. Attackers have proposed various inaudible attacks, but are limited by cost, distance, or LoS. Therefore, we propose \name~Attack, a long-range, cross-barrier, and interference-free inaudible voice attack via solid channels. We begin by thoroughly analyzing the dispersion effect in solid channels, revealing its unique impact on signal propagation. To avoid distortions in voice commands, we design a modular command generation model that parameterizes attack distance, victim audio, and medium dispersion features to adapt to variations in the solid-channel state. Additionally, we propose SUAD Defense, a universal defense that uses ultrasonic perturbation signals to block inaudible voice attacks (IVAs) without impacting normal speech. Since the attack can occur at arbitrary frequencies and times, we propose a training method that randomizes both time and frequency to generate perturbation signals that break ultrasonic commands. Notably, the perturbation signal is modulated to an inaudible frequency without affecting the functionality of voice commands for VAs. Experiments on six smartphones have shown that SUAD Attack achieves activation success rates above 89.8% and SUAD Defense blocks IVAs with success rates exceeding 98%.

http://arxiv.org/abs/2508.02115
Coward: Toward Practical Proactive Federated Backdoor Defense via Collision-based Watermark. (15%)
Wenjie Li; Siying Gu; Yiming Li; Kangjie Chen; Zhili Chen; Tianwei Zhang; Shu-Tao Xia; Dacheng Tao
Backdoor detection is currently the mainstream defense against backdoor attacks in federated learning (FL), where malicious clients upload poisoned updates that compromise the global model and undermine the reliability of FL deployments. Existing backdoor detection techniques fall into two categories, including passive and proactive ones, depending on whether the server proactively modifies the global model. However, both have inherent limitations in practice: passive defenses are vulnerable to common non-i.i.d. data distributions and random participation of FL clients, whereas current proactive defenses suffer inevitable out-of-distribution (OOD) bias because they rely on backdoor co-existence effects. To address these issues, we introduce a new proactive defense, dubbed Coward, inspired by our discovery of multi-backdoor collision effects, in which consecutively planted, distinct backdoors significantly suppress earlier ones. In general, we detect attackers by evaluating whether the server-injected, conflicting global watermark is erased during local training rather than retained. Our method preserves the advantages of proactive defenses in handling data heterogeneity (\ie, non-i.i.d. data) while mitigating the adverse impact of OOD bias through a revised detection mechanism. Extensive experiments on benchmark datasets confirm the effectiveness of Coward and its resilience to potential adaptive attacks. The code for our method would be available at https://github.com/still2009/cowardFL.

http://arxiv.org/abs/2508.02110
Attractive Metadata Attack: Inducing LLM Agents to Invoke Malicious Tools. (10%)
Kanghua Mo; Li Hu; Yucheng Long; Zhihao Li
Large language model (LLM) agents have demonstrated remarkable capabilities in complex reasoning and decision-making by leveraging external tools. However, this tool-centric paradigm introduces a previously underexplored attack surface: adversaries can manipulate tool metadata -- such as names, descriptions, and parameter schemas -- to influence agent behavior. We identify this as a new and stealthy threat surface that allows malicious tools to be preferentially selected by LLM agents, without requiring prompt injection or access to model internals. To demonstrate and exploit this vulnerability, we propose the Attractive Metadata Attack (AMA), a black-box in-context learning framework that generates highly attractive but syntactically and semantically valid tool metadata through iterative optimization. Our attack integrates seamlessly into standard tool ecosystems and requires no modification to the agent's execution framework. Extensive experiments across ten realistic, simulated tool-use scenarios and a range of popular LLM agents demonstrate consistently high attack success rates (81\%-95\%) and significant privacy leakage, with negligible impact on primary task execution. Moreover, the attack remains effective even under prompt-level defenses and structured tool-selection protocols such as the Model Context Protocol, revealing systemic vulnerabilities in current agent architectures. These findings reveal that metadata manipulation constitutes a potent and stealthy attack surface, highlighting the need for execution-level security mechanisms that go beyond prompt-level defenses.

http://arxiv.org/abs/2508.02136
FedLAD: A Linear Algebra Based Data Poisoning Defence for Federated Learning. (9%)
Qi Xiong; Hai Dong; Nasrin Sohrabi; Zahir Tari
Sybil attacks pose a significant threat to federated learning, as malicious nodes can collaborate and gain a majority, thereby overwhelming the system. Therefore, it is essential to develop countermeasures that ensure the security of federated learning environments. We present a novel defence method against targeted data poisoning, which is one of the types of Sybil attacks, called Linear Algebra-based Detection (FedLAD). Unlike existing approaches, such as clustering and robust training, which struggle in situations where malicious nodes dominate, FedLAD models the federated learning aggregation process as a linear problem, transforming it into a linear algebra optimisation challenge. This method identifies potential attacks by extracting the independent linear combinations from the original linear combinations, effectively filtering out redundant and malicious elements. Extensive experimental evaluations demonstrate the effectiveness of FedLAD compared to five well-established defence methods: Sherpa, CONTRA, Median, Trimmed Mean, and Krum. Using tasks from both image classification and natural language processing, our experiments confirm that FedLAD is robust and not dependent on specific application settings. The results indicate that FedLAD effectively protects federated learning systems across a broad spectrum of malicious node ratios. Compared to baseline defence methods, FedLAD maintains a low attack success rate for malicious nodes when their ratio ranges from 0.2 to 0.8. Additionally, it preserves high model accuracy when the malicious node ratio is between 0.2 and 0.5. These findings underscore FedLAD's potential to enhance both the reliability and performance of federated learning systems in the face of data poisoning attacks.

http://arxiv.org/abs/2508.02530
Understanding the Risks of Asphalt Art on the Reliability of Surveillance Perception Systems. (9%)
Jin Ma; Abyad Enan; Long Cheng; Mashrur Chowdhury
Artistic crosswalks featuring asphalt art, introduced by different organizations in recent years, aim to enhance the visibility and safety of pedestrians. However, their visual complexity may interfere with surveillance systems that rely on vision-based object detection models. In this study, we investigate the impact of asphalt art on pedestrian detection performance of a pretrained vision-based object detection model. We construct realistic crosswalk scenarios by compositing various street art patterns into a fixed surveillance scene and evaluate the model's performance in detecting pedestrians on asphalt-arted crosswalks under both benign and adversarial conditions. A benign case refers to pedestrian crosswalks painted with existing normal asphalt art, whereas an adversarial case involves digitally crafted or altered asphalt art perpetrated by an attacker. Our results show that while simple, color-based designs have minimal effect, complex artistic patterns, particularly those with high visual salience, can significantly degrade pedestrian detection performance. Furthermore, we demonstrate that adversarially crafted asphalt art can be exploited to deliberately obscure real pedestrians or generate non-existent pedestrian detections. These findings highlight a potential vulnerability in urban vision-based pedestrian surveillance systems and underscore the importance of accounting for environmental visual variations when designing robust pedestrian perception models.

http://arxiv.org/abs/2508.02312
A Survey on Data Security in Large Language Models. (5%)
Kang Chen; Xiuze Zhou; Yuanguo Lin; Jinhe Su; Yuanhui Yu; Li Shen; Fan Lin
Large Language Models (LLMs), now a foundation in advancing natural language processing, power applications such as text generation, machine translation, and conversational systems. Despite their transformative potential, these models inherently rely on massive amounts of training data, often collected from diverse and uncurated sources, which exposes them to serious data security risks. Harmful or malicious data can compromise model behavior, leading to issues such as toxic output, hallucinations, and vulnerabilities to threats such as prompt injection or data poisoning. As LLMs continue to be integrated into critical real-world systems, understanding and addressing these data-centric security risks is imperative to safeguard user trust and system reliability. This survey offers a comprehensive overview of the main data security risks facing LLMs and reviews current defense strategies, including adversarial training, RLHF, and data augmentation. Additionally, we categorize and analyze relevant datasets used for assessing robustness and security across different domains, providing guidance for future research. Finally, we highlight key research directions that focus on secure model updates, explainability-driven defenses, and effective governance frameworks, aiming to promote the safe and responsible development of LLM technology. This work aims to inform researchers, practitioners, and policymakers, driving progress toward data security in LLMs.

http://arxiv.org/abs/2508.02092
FPEdit: Robust LLM Fingerprinting through Localized Knowledge Editing. (2%)
Shida Wang; Chaohu Liu; Yubo Wang; Linli Xu
Large language models represent significant investments in computation, data, and engineering expertise, making them extraordinarily valuable intellectual assets. Nevertheless, these AI assets remain vulnerable to unauthorized redistribution and commercial exploitation through fine-tuning or black-box deployment. Current fingerprinting approaches face a fundamental trade-off: intrinsic methods require full parameter access, while backdoor-based techniques employ statistically anomalous triggers easily detected and filtered by adversaries. To address these limitations, we introduce FPEdit, a novel knowledge-editing framework that injects semantically coherent natural language fingerprints by modifying a sparse subset of model weights. This ensures stealthy and precise ownership encoding without degrading the core functionality. Extensive experiments show that FPEdit achieves $95$-$100\%$ fingerprint retention under both full-parameter fine-tuning and parameter-efficient adaptation, while preserving performance on 24 downstream benchmarks. Moreover, FPEdit remains robust under quantization, pruning, and stochastic decoding, and can embed 10 fingerprint pairs into LLaMA2-7B in under 10 minutes using less than 32 GB of GPU memory, a $70\%$ reduction in resource requirements compared to existing techniques. These advances establish FPEdit as the first fingerprinting approach to simultaneously achieve robustness against adaptation, resistance to detection, and preservation of model utility, providing a minimally invasive solution for reliable provenance verification of large language models in adversarial deployment scenarios.

http://arxiv.org/abs/2508.06534
MetAdv: A Unified and Interactive Adversarial Testing Platform for Autonomous Driving. (2%)
Aishan Liu; Jiakai Wang; Tianyuan Zhang; Hainan Li; Jiangfan Liu; Siyuan Liang; Yilong Ren; Xianglong Liu; Dacheng Tao
Evaluating and ensuring the adversarial robustness of autonomous driving (AD) systems is a critical and unresolved challenge. This paper introduces MetAdv, a novel adversarial testing platform that enables realistic, dynamic, and interactive evaluation by tightly integrating virtual simulation with physical vehicle feedback. At its core, MetAdv establishes a hybrid virtual-physical sandbox, within which we design a three-layer closed-loop testing environment with dynamic adversarial test evolution. This architecture facilitates end-to-end adversarial evaluation, ranging from high-level unified adversarial generation, through mid-level simulation-based interaction, to low-level execution on physical vehicles. Additionally, MetAdv supports a broad spectrum of AD tasks, algorithmic paradigms (e.g., modular deep learning pipelines, end-to-end learning, vision-language models). It supports flexible 3D vehicle modeling and seamless transitions between simulated and physical environments, with built-in compatibility for commercial platforms such as Apollo and Tesla. A key feature of MetAdv is its human-in-the-loop capability: besides flexible environmental configuration for more customized evaluation, it enables real-time capture of physiological signals and behavioral feedback from drivers, offering new insights into human-machine trust under adversarial conditions. We believe MetAdv can offer a scalable and unified framework for adversarial assessment, paving the way for safer AD.

http://arxiv.org/abs/2508.02063
TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs. (1%)
Amitava Das; Vinija Jain; Aman Chadha
Large Language Models (LLMs) fine-tuned to align with human values often exhibit alignment drift, producing unsafe or policy-violating completions when exposed to adversarial prompts, decoding perturbations, or paraphrased jailbreaks. While prior work has behaviorally characterized alignment failure, little is known about the training-time belief sources underlying these failures. We introduce TraceAlign, a unified framework for tracing unsafe completions back to their root causes in the model's training corpus. Central to our approach is the Belief Conflict Index (BCI), which quantifies semantic inconsistency between generated spans and aligned policies, based on retrieved training documents using suffix-array matching. We propose three complementary interventions: (i) TraceShield, an inference-time safety filter that refuses completions with high-BCI spans, (ii) Contrastive Belief Deconfliction Loss, a contrastive fine-tuning objective penalizing high-BCI continuations during DPO, and (iii) Prov-Decode, a provenance-aware decoding strategy that vetoes beam expansions predicted to yield high-BCI spans. Together, these defenses reduce alignment drift by up to 85% on our curated Alignment Drift Benchmark (ADB) while preserving utility on standard tasks, with delta less than 0.2 and improved refusal quality. We further derive a theoretical upper bound on drift likelihood via suffix-array span statistics, linking memorization frequency and length to adversarial reactivation risk. TraceAlign thus provides the first scalable, traceable, and grounded toolkit for understanding and mitigating alignment failures at source. To encourage further exploration and development, we open-source our implementation at: https://anonymous.4open.science/r/tracealign-2DA7

http://arxiv.org/abs/2508.01845
Beyond Vulnerabilities: A Survey of Adversarial Attacks as Both Threats and Defenses in Computer Vision Systems. (99%)
Zhongliang Guo; Yifei Qian; Yanli Li; Weiye Li; Chun Tong Lei; Shuai Zhao; Lei Fang; Ognjen Arandjelović; Chun Pong Lau
Adversarial attacks against computer vision systems have emerged as a critical research area that challenges the fundamental assumptions about neural network robustness and security. This comprehensive survey examines the evolving landscape of adversarial techniques, revealing their dual nature as both sophisticated security threats and valuable defensive tools. We provide a systematic analysis of adversarial attack methodologies across three primary domains: pixel-space attacks, physically realizable attacks, and latent-space attacks. Our investigation traces the technical evolution from early gradient-based methods such as FGSM and PGD to sophisticated optimization techniques incorporating momentum, adaptive step sizes, and advanced transferability mechanisms. We examine how physically realizable attacks have successfully bridged the gap between digital vulnerabilities and real-world threats through adversarial patches, 3D textures, and dynamic optical perturbations. Additionally, we explore the emergence of latent-space attacks that leverage semantic structure in internal representations to create more transferable and meaningful adversarial examples. Beyond traditional offensive applications, we investigate the constructive use of adversarial techniques for vulnerability assessment in biometric authentication systems and protection against malicious generative models. Our analysis reveals critical research gaps, particularly in neural style transfer protection and computational efficiency requirements. This survey contributes a comprehensive taxonomy, evolution analysis, and identification of future research directions, aiming to advance understanding of adversarial vulnerabilities and inform the development of more robust and trustworthy computer vision systems.

http://arxiv.org/abs/2508.01605
Practical, Generalizable and Robust Backdoor Attacks on Text-to-Image Diffusion Models. (98%)
Haoran Dai; Jiawen Wang; Ruo Yang; Manali Sharma; Zhonghao Liao; Yuan Hong; Binghui Wang
Text-to-image diffusion models (T2I DMs) have achieved remarkable success in generating high-quality and diverse images from text prompts, yet recent studies have revealed their vulnerability to backdoor attacks. Existing attack methods suffer from critical limitations: 1) they rely on unnatural adversarial prompts that lack human readability and require massive poisoned data; 2) their effectiveness is typically restricted to specific models, lacking generalizability; and 3) they can be mitigated by recent backdoor defenses.
  To overcome these challenges, we propose a novel backdoor attack framework that achieves three key properties: 1) \emph{Practicality}: Our attack requires only a few stealthy backdoor samples to generate arbitrary attacker-chosen target images, as well as ensuring high-quality image generation in benign scenarios. 2) \emph{Generalizability:} The attack is applicable across multiple T2I DMs without requiring model-specific redesign. 3) \emph{Robustness:} The attack remains effective against existing backdoor defenses and adaptive defenses. Our extensive experimental results on multiple T2I DMs demonstrate that with only 10 carefully crafted backdoored samples, our attack method achieves $>$90\% attack success rate with negligible degradation in benign image generation quality. We also conduct human evaluation to validate our attack effectiveness. Furthermore, recent backdoor detection and mitigation methods, as well as adaptive defense tailored to our attack are not sufficiently effective, highlighting the pressing need for more robust defense mechanisms against the proposed attack.

http://arxiv.org/abs/2508.01554
Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models. (98%)
Yujia Zheng; Tianhao Li; Haotian Huang; Tianyu Zeng; Jingyu Lu; Chuangxin Chu; Yuekai Huang; Ziyou Jiang; Qian Xiong; Yuyao Ge; Mingyang Li
Prompt-based adversarial attacks have become an effective means to assess the robustness of large language models (LLMs). However, existing approaches often treat prompts as monolithic text, overlooking their structural heterogeneity-different prompt components contribute unequally to adversarial robustness. Prior works like PromptRobust assume prompts are value-neutral, but our analysis reveals that complex, domain-specific prompts with rich structures have components with differing vulnerabilities. To address this gap, we introduce PromptAnatomy, an automated framework that dissects prompts into functional components and generates diverse, interpretable adversarial examples by selectively perturbing each component using our proposed method, ComPerturb. To ensure linguistic plausibility and mitigate distribution shifts, we further incorporate a perplexity (PPL)-based filtering mechanism. As a complementary resource, we annotate four public instruction-tuning datasets using the PromptAnatomy framework, verified through human review. Extensive experiments across these datasets and five advanced LLMs demonstrate that ComPerturb achieves state-of-the-art attack success rates. Ablation studies validate the complementary benefits of prompt dissection and PPL filtering. Our results underscore the importance of prompt structure awareness and controlled perturbation for reliable adversarial robustness evaluation in LLMs. Code and data are available at https://github.com/Yujiaaaaa/PACP.

http://arxiv.org/abs/2508.01932
Proactive Disentangled Modeling of Trigger-Object Pairings for Backdoor Defense. (97%)
Kyle Stein; Andrew A. Mahyari; Guillermo III Francia; Eman El-Sheikh
Deep neural networks (DNNs) and generative AI (GenAI) are increasingly vulnerable to backdoor attacks, where adversaries embed triggers into inputs to cause models to misclassify or misinterpret target labels. Beyond traditional single-trigger scenarios, attackers may inject multiple triggers across various object classes, forming unseen backdoor-object configurations that evade standard detection pipelines. In this paper, we introduce DBOM (Disentangled Backdoor-Object Modeling), a proactive framework that leverages structured disentanglement to identify and neutralize both seen and unseen backdoor threats at the dataset level. Specifically, DBOM factorizes input image representations by modeling triggers and objects as independent primitives in the embedding space through the use of Vision-Language Models (VLMs). By leveraging the frozen, pre-trained encoders of VLMs, our approach decomposes the latent representations into distinct components through a learnable visual prompt repository and prompt prefix tuning, ensuring that the relationships between triggers and objects are explicitly captured. To separate trigger and object representations in the visual prompt repository, we introduce the trigger-object separation and diversity losses that aids in disentangling trigger and object visual features. Next, by aligning image features with feature decomposition and fusion, as well as learned contextual prompt tokens in a shared multimodal space, DBOM enables zero-shot generalization to novel trigger-object pairings that were unseen during training, thereby offering deeper insights into adversarial attack patterns. Experimental results on CIFAR-10 and GTSRB demonstrate that DBOM robustly detects poisoned images prior to downstream training, significantly enhancing the security of DNN training pipelines.

http://arxiv.org/abs/2508.01741
Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models. (81%)
Ruofan Wang; Xin Wang; Yang Yao; Xuan Tong; Xingjun Ma
Fine-tuning open-source Vision-Language Models (VLMs) creates a critical yet underexplored attack surface: vulnerabilities in the base VLM could be retained in fine-tuned variants, rendering them susceptible to transferable jailbreak attacks. To demonstrate this risk, we introduce the Simulated Ensemble Attack (SEA), a novel grey-box jailbreak method in which the adversary has full access to the base VLM but no knowledge of the fine-tuned target's weights or training configuration. To improve jailbreak transferability across fine-tuned VLMs, SEA combines two key techniques: Fine-tuning Trajectory Simulation (FTS) and Targeted Prompt Guidance (TPG). FTS generates transferable adversarial images by simulating the vision encoder's parameter shifts, while TPG is a textual strategy that steers the language decoder toward adversarially optimized outputs. Experiments on the Qwen2-VL family (2B and 7B) demonstrate that SEA achieves high transfer attack success rates exceeding 86.5% and toxicity rates near 49.5% across diverse fine-tuned variants, even those specifically fine-tuned to improve safety behaviors. Notably, while direct PGD-based image jailbreaks rarely transfer across fine-tuned VLMs, SEA reliably exploits inherited vulnerabilities from the base model, significantly enhancing transferability. These findings highlight an urgent need to safeguard fine-tuned proprietary VLMs against transferable vulnerabilities inherited from open-source foundations, motivating the development of holistic defenses across the entire model lifecycle.

http://arxiv.org/abs/2508.01595
BeDKD: Backdoor Defense based on Dynamic Knowledge Distillation and Directional Mapping Modulator. (68%)
Zhengxian Wu; Juan Wen; Wanli Peng; Yinghan Zhou; Changtong dou; Yiming Xue
Although existing backdoor defenses have gained success in mitigating backdoor attacks, they still face substantial challenges. In particular, most of them rely on large amounts of clean data to weaken the backdoor mapping but generally struggle with residual trigger effects, resulting in persistently high attack success rates (ASR). Therefore, in this paper, we propose a novel Backdoor defense method based on Directional mapping module and adversarial Knowledge Distillation (BeDKD), which balances the trade-off between defense effectiveness and model performance using a small amount of clean and poisoned data. We first introduce a directional mapping module to identify poisoned data, which destroys clean mapping while keeping backdoor mapping on a small set of flipped clean data. Then, the adversarial knowledge distillation is designed to reinforce clean mapping and suppress backdoor mapping through a cycle iteration mechanism between trust and punish distillations using clean and identified poisoned data. We conduct experiments to mitigate mainstream attacks on three datasets, and experimental results demonstrate that BeDKD surpasses the state-of-the-art defenses and reduces the ASR by 98% without significantly reducing the CACC. Our code are available in https://github.com/CAU-ISS-Lab/Backdoor-Attack-Defense-LLMs/tree/main/BeDKD.

http://arxiv.org/abs/2508.01676
Benchmarking Adversarial Patch Selection and Location. (56%)
Shai Kimhi; Avi Mendlson; Moshe Kimhi
Adversarial patch attacks threaten the reliability of modern vision models. We present PatchMap, the first spatially exhaustive benchmark of patch placement, built by evaluating over 1.5e8 forward passes on ImageNet validation images. PatchMap reveals systematic hot-spots where small patches (as little as 2% of the image) induce confident misclassifications and large drops in model confidence. To demonstrate its utility, we propose a simple segmentation guided placement heuristic that leverages off the shelf masks to identify vulnerable regions without any gradient queries. Across five architectures-including adversarially trained ResNet50, our method boosts attack success rates by 8 to 13 percentage points compared to random or fixed placements. We publicly release PatchMap and the code implementation. The full PatchMap bench (6.5B predictions, multiple backbones) will be released soon to further accelerate research on location-aware defenses and adaptive attacks.

http://arxiv.org/abs/2508.01887
Complete Evasion, Zero Modification: PDF Attacks on AI Text Detection. (10%)
Aldan Creo
AI-generated text detectors have become essential tools for maintaining content authenticity, yet their robustness against evasion attacks remains questionable. We present PDFuzz, a novel attack that exploits the discrepancy between visual text layout and extraction order in PDF documents. Our method preserves exact textual content while manipulating character positioning to scramble extraction sequences. We evaluate this approach against the ArguGPT detector using a dataset of human and AI-generated text. Our results demonstrate complete evasion: detector performance drops from (93.6 $\pm$ 1.4) % accuracy and 0.938 $\pm$ 0.014 F1 score to random-level performance ((50.4 $\pm$ 3.2) % accuracy, 0.0 F1 score) while maintaining perfect visual fidelity. Our work reveals a vulnerability in current detection systems that is inherent to PDF document structures and underscores the need for implementing sturdy safeguards against such attacks. We make our code publicly available at https://github.com/ACMCMC/PDFuzz.

http://arxiv.org/abs/2508.01768
"Energon": Unveiling Transformers from GPU Power and Thermal Side-Channels. (2%)
Arunava Chaudhuri; Shubhi Shukla; Sarani Bhattacharya; Debdeep Mukhopadhyay
Transformers have become the backbone of many Machine Learning (ML) applications, including language translation, summarization, and computer vision. As these models are increasingly deployed in shared Graphics Processing Unit (GPU) environments via Machine Learning as a Service (MLaaS), concerns around their security grow. In particular, the risk of side-channel attacks that reveal architectural details without physical access remains underexplored, despite the high value of the proprietary models they target. This work to the best of our knowledge is the first to investigate GPU power and thermal fluctuations as side-channels and further exploit them to extract information from pre-trained transformer models. The proposed analysis shows how these side channels can be exploited at user-privilege to reveal critical architectural details such as encoder/decoder layer and attention head for both language and vision transformers. We demonstrate the practical impact by evaluating multiple language and vision pre-trained transformers which are publicly available. Through extensive experimental evaluations, we demonstrate that the attack model achieves a high accuracy of over 89% on average for model family identification and 100% for hyperparameter classification, in both single-process as well as noisy multi-process scenarios. Moreover, by leveraging the extracted architectural information, we demonstrate highly effective black-box transfer adversarial attacks with an average success rate exceeding 93%, underscoring the security risks posed by GPU side-channel leakage in deployed transformer models.

http://arxiv.org/abs/2508.01276
Defending Against Beta Poisoning Attacks in Machine Learning Models. (70%)
Nilufer Gulciftci; M. Emre Gursoy
Poisoning attacks, in which an attacker adversarially manipulates the training dataset of a machine learning (ML) model, pose a significant threat to ML security. Beta Poisoning is a recently proposed poisoning attack that disrupts model accuracy by making the training dataset linearly nonseparable. In this paper, we propose four defense strategies against Beta Poisoning attacks: kNN Proximity-Based Defense (KPB), Neighborhood Class Comparison (NCC), Clustering-Based Defense (CBD), and Mean Distance Threshold (MDT). The defenses are based on our observations regarding the characteristics of poisoning samples generated by Beta Poisoning, e.g., poisoning samples have close proximity to one another, and they are centered near the mean of the target class. Experimental evaluations using MNIST and CIFAR-10 datasets demonstrate that KPB and MDT can achieve perfect accuracy and F1 scores, while CBD and NCC also provide strong defensive capabilities. Furthermore, by analyzing performance across varying parameters, we offer practical insights regarding defenses' behaviors under varying conditions.

http://arxiv.org/abs/2503.13652
Web Artifact Attacks Disrupt Vision Language Models. (78%)
Maan Qraitem; Piotr Teterwak; Kate Saenko; Bryan A. Plummer
Vision-language models (VLMs) (e.g. CLIP, LLaVA) are trained on large-scale, lightly curated web datasets, leading them to learn unintended correlations between semantic concepts and unrelated visual signals. These associations degrade model accuracy by causing predictions to rely on incidental patterns rather than genuine visual understanding. Prior work has weaponized these correlations as an attack vector to manipulate model predictions, such as inserting a deceiving class text onto the image in a "typographic" attack. These attacks succeed due to VLMs' text-heavy bias-a result of captions that echo visible words rather than describing content. However, this attack has focused solely on text that matches the target class exactly, overlooking a broader range of correlations, including non-matching text and graphical symbols, which arise from the abundance of branding content in web-scale data. To address this gap, we introduce "artifact-based" attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements. Unlike typographic attacks, these artifacts are not predefined, making them simultaneously harder to defend against and more challenging to find. We address this by framing artifact attacks as a search problem and demonstrate their effectiveness across five datasets, with some artifacts reinforcing each other to reach 100% attack success rates. These attacks transfer across models with up to 90% effectiveness, making it possible to attack unseen models. To defend against these attacks, we extend prior work's artifact aware prompting to the graphical setting. We see a moderate reduction of success rates of up to 15% relative to standard prompts, suggesting a promising direction for enhancing model robustness. Code: https://github.com/mqraitem/Web-Artifact-Attacks

http://arxiv.org/abs/2508.00965
VAULT: Vigilant Adversarial Updates via LLM-Driven Retrieval-Augmented Generation for NLI. (76%)
Roie Kazoom; Ofir Cohen; Rami Puzis; Asaf Shabtai; Ofer Hadar
We introduce VAULT, a fully automated adversarial RAG pipeline that systematically uncovers and remedies weaknesses in NLI models through three stages: retrieval, adversarial generation, and iterative retraining. First, we perform balanced few-shot retrieval by embedding premises with both semantic (BGE) and lexical (BM25) similarity. Next, we assemble these contexts into LLM prompts to generate adversarial hypotheses, which are then validated by an LLM ensemble for label fidelity. Finally, the validated adversarial examples are injected back into the training set at increasing mixing ratios, progressively fortifying a zero-shot RoBERTa-base model.On standard benchmarks, VAULT elevates RoBERTa-base accuracy from 88.48% to 92.60% on SNLI +4.12%, from 75.04% to 80.95% on ANLI +5.91%, and from 54.67% to 71.99% on MultiNLI +17.32%. It also consistently outperforms prior in-context adversarial methods by up to 2.0% across datasets. By automating high-quality adversarial data curation at scale, VAULT enables rapid, human-independent robustness improvements in NLI inference tasks.

http://arxiv.org/abs/2508.00555
Activation-Guided Local Editing for Jailbreaking Attacks. (45%)
Jiecong Wang; Haoran Li; Hao Peng; Ziqian Zeng; Zihao Wang; Haohua Du; Zhengtao Yu
Jailbreaking is an essential adversarial technique for red-teaming these models to uncover and patch security flaws. However, existing jailbreak methods face significant drawbacks. Token-level jailbreak attacks often produce incoherent or unreadable inputs and exhibit poor transferability, while prompt-level attacks lack scalability and rely heavily on manual effort and human ingenuity. We propose a concise and effective two-stage framework that combines the advantages of these approaches. The first stage performs a scenario-based generation of context and rephrases the original malicious query to obscure its harmful intent. The second stage then utilizes information from the model's hidden states to guide fine-grained edits, effectively steering the model's internal representation of the input from a malicious toward a benign one. Extensive experiments demonstrate that this method achieves state-of-the-art Attack Success Rate, with gains of up to 37.74% over the strongest baseline, and exhibits excellent transferability to black-box models. Our analysis further demonstrates that AGILE maintains substantial effectiveness against prominent defense mechanisms, highlighting the limitations of current safeguards and providing valuable insights for future defense development. Our code is available at https://github.com/yunsaijc/AGILE.

http://arxiv.org/abs/2508.01062
CP-FREEZER: Latency Attacks against Vehicular Cooperative Perception. (26%)
Chenyi Wang; Ruoyu Song; Raymond Muller; Jean-Philippe Monteuuis; Z. Berkay Celik; Jonathan Petit; Ryan Gerdes; Ming Li
Cooperative perception (CP) enhances situational awareness of connected and autonomous vehicles by exchanging and combining messages from multiple agents. While prior work has explored adversarial integrity attacks that degrade perceptual accuracy, little is known about CP's robustness against attacks on timeliness (or availability), a safety-critical requirement for autonomous driving. In this paper, we present CP-FREEZER, the first latency attack that maximizes the computation delay of CP algorithms by injecting adversarial perturbation via V2V messages. Our attack resolves several unique challenges, including the non-differentiability of point cloud preprocessing, asynchronous knowledge of the victim's input due to transmission delays, and uses a novel loss function that effectively maximizes the execution time of the CP pipeline. Extensive experiments show that CP-FREEZER increases end-to-end CP latency by over $90\times$, pushing per-frame processing time beyond 3 seconds with a 100% success rate on our real-world vehicle testbed. Our findings reveal a critical threat to the availability of CP systems, highlighting the urgent need for robust defenses.

http://arxiv.org/abs/2508.01084
Provably Secure Retrieval-Augmented Generation. (9%)
Pengcheng Zhou; Yinglun Feng; Zhongliang Yang
Although Retrieval-Augmented Generation (RAG) systems have been widely applied, the privacy and security risks they face, such as data leakage and data poisoning, have not been systematically addressed yet. Existing defense strategies primarily rely on heuristic filtering or enhancing retriever robustness, which suffer from limited interpretability, lack of formal security guarantees, and vulnerability to adaptive attacks. To address these challenges, this paper proposes the first provably secure framework for RAG systems(SAG). Our framework employs a pre-storage full-encryption scheme to ensure dual protection of both retrieved content and vector embeddings, guaranteeing that only authorized entities can access the data. Through formal security proofs, we rigorously verify the scheme's confidentiality and integrity under a computational security model. Extensive experiments across multiple benchmark datasets demonstrate that our framework effectively resists a range of state-of-the-art attacks. This work establishes a theoretical foundation and practical paradigm for verifiably secure RAG systems, advancing AI-powered services toward formally guaranteed security.

http://arxiv.org/abs/2508.00368
Preliminary Investigation into Uncertainty-Aware Attack Stage Classification. (9%)
Alessandro Gaudenzi; Lorenzo Nodari; Lance Kaplan; Alessandra Russo; Murat Sensoy; Federico Cerutti
Advanced Persistent Threats (APTs) represent a significant challenge in cybersecurity due to their prolonged, multi-stage nature and the sophistication of their operators. Traditional detection systems typically focus on identifying malicious activity in binary terms (benign or malicious) without accounting for the progression of an attack. However, effective response strategies depend on accurate inference of the attack's current stage, as countermeasures must be tailored to whether an adversary is in the early reconnaissance phase or actively conducting exploitation or exfiltration. This work addresses the problem of attack stage inference under uncertainty, with a focus on robustness to out-of-distribution (OOD) inputs. We propose a classification approach based on Evidential Deep Learning (EDL), which models predictive uncertainty by outputting parameters of a Dirichlet distribution over possible stages. This allows the system not only to predict the most likely stage of an attack but also to indicate when it is uncertain or the input lies outside the training distribution. Preliminary experiments in a simulated environment demonstrate that the proposed model can accurately infer the stage of an attack with calibrated confidence while effectively detecting OOD inputs, which may indicate changes in the attackers' tactics. These results support the feasibility of deploying uncertainty-aware models for staged threat detection in dynamic and adversarial environments.

http://arxiv.org/abs/2508.01107
AdVAR-DNN: Adversarial Misclassification Attack on Collaborative DNN Inference. (5%)
Shima Yousefi; Motahare Mounesan; Saptarshi Debroy
In recent years, Deep Neural Networks (DNNs) have become increasingly integral to IoT-based environments, enabling realtime visual computing. However, the limited computational capacity of these devices has motivated the adoption of collaborative DNN inference, where the IoT device offloads part of the inference-related computation to a remote server. Such offloading often requires dynamic DNN partitioning information to be exchanged among the participants over an unsecured network or via relays/hops, leading to novel privacy vulnerabilities. In this paper, we propose AdVAR-DNN, an adversarial variational autoencoder (VAE)-based misclassification attack, leveraging classifiers to detect model information and a VAE to generate untraceable manipulated samples, specifically designed to compromise the collaborative inference process. AdVAR-DNN attack uses the sensitive information exchange vulnerability of collaborative DNN inference and is black-box in nature in terms of having no prior knowledge about the DNN model and how it is partitioned. Our evaluation using the most popular object classification DNNs on the CIFAR-100 dataset demonstrates the effectiveness of AdVAR-DNN in terms of high attack success rate with little to no probability of detection.

http://arxiv.org/abs/2508.00636
FedGuard: A Diverse-Byzantine-Robust Mechanism for Federated Learning with Major Malicious Clients. (4%)
Haocheng Jiang; Hua Shen; Jixin Zhang; Willy Susilo; Mingwu Zhang
Federated learning is a distributed training framework vulnerable to Byzantine attacks, particularly when over 50% of clients are malicious or when datasets are highly non-independent and identically distributed (non-IID). Additionally, most existing defense mechanisms are designed for specific attack types (e.g., gradient similarity-based schemes can only defend against outlier model poisoning), limiting their effectiveness. In response, we propose FedGuard, a novel federated learning mechanism. FedGuard cleverly addresses the aforementioned issues by leveraging the high sensitivity of membership inference to model bias. By requiring clients to include an additional mini-batch of server-specified data in their training, FedGuard can identify and exclude poisoned models, as their confidence in the mini-batch will drop significantly. Our comprehensive evaluation unequivocally shows that, under three highly non-IID datasets, with 90% of clients being Byzantine and seven different types of Byzantine attacks occurring in each round, FedGuard significantly outperforms existing robust federated learning schemes in mitigating various types of Byzantine attacks.

http://arxiv.org/abs/2508.01074
Evading Data Provenance in Deep Neural Networks. (2%)
Hongyu Zhu; Sichu Liang; Wenwen Wang; Zhuomeng Zhang; Fangqi Li; Shi-Lin Wang
Modern over-parameterized deep models are highly data-dependent, with large scale general-purpose and domain-specific datasets serving as the bedrock for rapid advancements. However, many datasets are proprietary or contain sensitive information, making unrestricted model training problematic. In the open world where data thefts cannot be fully prevented, Dataset Ownership Verification (DOV) has emerged as a promising method to protect copyright by detecting unauthorized model training and tracing illicit activities. Due to its diversity and superior stealth, evading DOV is considered extremely challenging. However, this paper identifies that previous studies have relied on oversimplistic evasion attacks for evaluation, leading to a false sense of security. We introduce a unified evasion framework, in which a teacher model first learns from the copyright dataset and then transfers task-relevant yet identifier-independent domain knowledge to a surrogate student using an out-of-distribution (OOD) dataset as the intermediary. Leveraging Vision-Language Models and Large Language Models, we curate the most informative and reliable subsets from the OOD gallery set as the final transfer set, and propose selectively transferring task-oriented knowledge to achieve a better trade-off between generalization and evasion effectiveness. Experiments across diverse datasets covering eleven DOV methods demonstrate our approach simultaneously eliminates all copyright identifiers and significantly outperforms nine state-of-the-art evasion attacks in both generalization and effectiveness, with moderate computational overhead. As a proof of concept, we reveal key vulnerabilities in current DOV methods, highlighting the need for long-term development to enhance practicality.

http://arxiv.org/abs/2412.10537
ExclaveFL: Providing Transparency to Federated Learning using Exclaves. (1%)
Jinnan Guo; Kapil Vaswani; Andrew Paverd; Peter Pietzuch
In federated learning (FL), data providers jointly train a model without disclosing their training data. Despite its inherent privacy benefits, a malicious data provider can simply deviate from the correct training protocol without being detected, potentially compromising the trained model. While current solutions have explored the use of trusted execution environments (TEEs) to combat such attacks, they usually assume side-channel attacks against the TEEs are out of scope. However, such side-channel attacks can undermine the security properties of TEE-based FL frameworks, not by extracting the FL data, but by leaking keys that allow the adversary to impersonate as the TEE whilst deviating arbitrarily from the correct training protocol.
  We describe ExclaveFL, an FL platform that provides end-to-end integrity and transparency, even in the presence of side-channel attacks on TEEs. We propose a new paradigm in which existing TEEs are used as exclaves -- integrity-protected execution environments that do not contain any secrets, making them immune to side-channel attacks. Whereas previous approaches attest the TEE itself and bind this attestation to a key held by the TEE, ExclaveFL attests individual data transformations at runtime. These runtime attestations form an attested dataflow graph, which can be checked to ensure the FL training job satisfies claims, such as deviations from the correct computation. We implement ExclaveFL by extending the popular NVFlare FL framework to use exclaves, and show experimentally that ExclaveFL introduces less than 10% overhead compared to the same FL framework without TEEs, whilst providing stronger security guarantees.

http://arxiv.org/abs/2508.00602
LeakSealer: A Semisupervised Defense for LLMs Against Prompt Injection and Leakage Attacks. (1%)
Francesco Panebianco; Stefano Bonfanti; Francesco Trovò; Michele Carminati
The generalization capabilities of Large Language Models (LLMs) have led to their widespread deployment across various applications. However, this increased adoption has introduced several security threats, notably in the forms of jailbreaking and data leakage attacks. Additionally, Retrieval Augmented Generation (RAG), while enhancing context-awareness in LLM responses, has inadvertently introduced vulnerabilities that can result in the leakage of sensitive information. Our contributions are twofold. First, we introduce a methodology to analyze historical interaction data from an LLM system, enabling the generation of usage maps categorized by topics (including adversarial interactions). This approach further provides forensic insights for tracking the evolution of jailbreaking attack patterns. Second, we propose LeakSealer, a model-agnostic framework that combines static analysis for forensic insights with dynamic defenses in a Human-In-The-Loop (HITL) pipeline. This technique identifies topic groups and detects anomalous patterns, allowing for proactive defense mechanisms. We empirically evaluate LeakSealer under two scenarios: (1) jailbreak attempts, employing a public benchmark dataset, and (2) PII leakage, supported by a curated dataset of labeled LLM interactions. In the static setting, LeakSealer achieves the highest precision and recall on the ToxicChat dataset when identifying prompt injection. In the dynamic setting, PII leakage detection achieves an AUPRC of $0.97$, significantly outperforming baselines such as Llama Guard.

http://arxiv.org/abs/2507.23202
Adversarial-Guided Diffusion for Multimodal LLM Attacks. (99%)
Chengwei Xia; Fan Ma; Ruijie Quan; Kun Zhan; Yi Yang
This paper addresses the challenge of generating adversarial image using a diffusion model to deceive multimodal large language models (MLLMs) into generating the targeted responses, while avoiding significant distortion of the clean image. To address the above challenges, we propose an adversarial-guided diffusion (AGD) approach for adversarial attack MLLMs. We introduce adversarial-guided noise to ensure attack efficacy. A key observation in our design is that, unlike most traditional adversarial attacks which embed high-frequency perturbations directly into the clean image, AGD injects target semantics into the noise component of the reverse diffusion. Since the added noise in a diffusion model spans the entire frequency spectrum, the adversarial signal embedded within it also inherits this full-spectrum property. Importantly, during reverse diffusion, the adversarial image is formed as a linear combination of the clean image and the noise. Thus, when applying defenses such as a simple low-pass filtering, which act independently on each component, the adversarial image within the noise component is less likely to be suppressed, as it is not confined to the high-frequency band. This makes AGD inherently robust to variety defenses. Extensive experiments demonstrate that our AGD outperforms state-of-the-art methods in attack performance as well as in model robustness to some defenses.

http://arxiv.org/abs/2507.23577
T-Detect: Tail-Aware Statistical Normalization for Robust Detection of Adversarial Machine-Generated Text. (87%)
Alva West; Luodan Zhang; Liuliu Zhang; Minjun Zhu; Yixuan Weng; Yue Zhang
The proliferation of sophisticated text generation models necessitates the development of robust detection methods capable of identifying machine-generated content, particularly text designed to evade detection through adversarial perturbations. Existing zero-shot detectors often rely on statistical measures that implicitly assume Gaussian distributions, a premise that falters when confronted with the heavy-tailed statistical artifacts characteristic of adversarial or non-native English texts. This paper introduces T-Detect, a novel detection method that fundamentally redesigns the statistical core of curvature-based detectors. Our primary innovation is the replacement of standard Gaussian normalization with a heavy-tailed discrepancy score derived from the Student's t-distribution. This approach is theoretically grounded in the empirical observation that adversarial texts exhibit significant leptokurtosis, rendering traditional statistical assumptions inadequate. T-Detect computes a detection score by normalizing the log-likelihood of a passage against the expected moments of a t-distribution, providing superior resilience to statistical outliers. We validate our approach on the challenging RAID benchmark for adversarial text and the comprehensive HART dataset. Experiments show that T-Detect provides a consistent performance uplift over strong baselines, improving AUROC by up to 3.9\% in targeted domains. When integrated into a two-dimensional detection framework (CT), our method achieves state-of-the-art performance, with an AUROC of 0.926 on the Books domain of RAID. Our contributions are a new, theoretically-justified statistical foundation for text detection, an ablation-validated method that demonstrates superior robustness, and a comprehensive analysis of its performance under adversarial conditions. Ours code are released at https://github.com/ResearAI/t-detect.

http://arxiv.org/abs/2507.23335
Scalable and Precise Patch Robustness Certification for Deep Learning Models with Top-k Predictions. (38%)
Qilin Zhou; Haipeng Wang; Zhengyuan Wei; W. K. Chan
Patch robustness certification is an emerging verification approach for defending against adversarial patch attacks with provable guarantees for deep learning systems. Certified recovery techniques guarantee the prediction of the sole true label of a certified sample. However, existing techniques, if applicable to top-k predictions, commonly conduct pairwise comparisons on those votes between labels, failing to certify the sole true label within the top k prediction labels precisely due to the inflation on the number of votes controlled by the attacker (i.e., attack budget); yet enumerating all combinations of vote allocation suffers from the combinatorial explosion problem. We propose CostCert, a novel, scalable, and precise voting-based certified recovery defender. CostCert verifies the true label of a sample within the top k predictions without pairwise comparisons and combinatorial explosion through a novel design: whether the attack budget on the sample is infeasible to cover the smallest total additional votes on top of the votes uncontrollable by the attacker to exclude the true labels from the top k prediction labels. Experiments show that CostCert significantly outperforms the current state-of-the-art defender PatchGuard, such as retaining up to 57.3% in certified accuracy when the patch size is 96, whereas PatchGuard has already dropped to zero.

http://arxiv.org/abs/2507.23229
Fine-Grained Privacy Extraction from Retrieval-Augmented Generation Systems via Knowledge Asymmetry Exploitation. (11%)
Yufei Chen; Yao Wang; Haibin Zhang; Tao Gu
Retrieval-augmented generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge bases, but this advancement introduces significant privacy risks. Existing privacy attacks on RAG systems can trigger data leakage but often fail to accurately isolate knowledge-base-derived sentences within mixed responses. They also lack robustness when applied across multiple domains. This paper addresses these challenges by presenting a novel black-box attack framework that exploits knowledge asymmetry between RAG and standard LLMs to achieve fine-grained privacy extraction across heterogeneous knowledge landscapes. We propose a chain-of-thought reasoning strategy that creates adaptive prompts to steer RAG systems away from sensitive content. Specifically, we first decompose adversarial queries to maximize information disparity and then apply a semantic relationship scoring to resolve lexical and syntactic ambiguities. We finally train a neural network on these feature scores to precisely identify sentences containing private information. Unlike prior work, our framework generalizes to unseen domains through iterative refinement without pre-defined knowledge. Experimental results show that we achieve over 91% privacy extraction rate in single-domain and 83% in multi-domain scenarios, reducing sensitive sentence exposure by over 65% in case studies. This work bridges the gap between attack and defense in RAG systems, enabling precise extraction of private information while providing a foundation for adaptive mitigation.

http://arxiv.org/abs/2402.01124
TransFR: Transferable Federated Recommendation with Adapter Tuning on Pre-trained Language Models. (1%)
Honglei Zhang; Zhiwei Li; Haoxuan Li; Xin Zhou; Jie Zhang; Yidong Li
Federated recommendations (FRs), facilitating multiple local clients to collectively learn a global model without disclosing user private data, have emerged as a prevalent on-device service. In conventional FRs, a dominant paradigm is to utilize discrete identities to represent clients and items, which are then mapped to domain-specific embeddings to participate in model training. Despite considerable performance, we reveal three inherent limitations that can not be ignored in federated settings, i.e., non-transferability across domains, ineffectiveness in cold-start settings, and potential privacy violations during federated training. To this end, we propose a transferable federated recommendation model, TransFR, which delicately incorporates the general capabilities empowered by pre-trained models and the personalized abilities by fine-tuning local private data. Specifically, it first learns domain-agnostic representations of items by exploiting pre-trained models with public textual corpora. To tailor for FR tasks, we further introduce efficient federated adapter-tuning and test-time adaptation mechanisms, which facilitate personalized local adapters for each client by fitting their private data distributions. We theoretically prove the advantages of incorporating adapter tuning in FRs regarding both effectiveness and privacy. Through extensive experiments, we show that our TransFR model surpasses several state-of-the-art FRs on transferability.

http://arxiv.org/abs/2507.22880
AUV-Fusion: Cross-Modal Adversarial Fusion of User Interactions and Visual Perturbations Against VARS. (99%)
Hai Ling; Tianchi Wang; Xiaohao Liu; Zhulin Tao; Lifang Yang; Xianglin Huang
Modern Visual-Aware Recommender Systems (VARS) exploit the integration of user interaction data and visual features to deliver personalized recommendations with high precision. However, their robustness against adversarial attacks remains largely underexplored, posing significant risks to system reliability and security. Existing attack strategies suffer from notable limitations: shilling attacks are costly and detectable, and visual-only perturbations often fail to align with user preferences. To address these challenges, we propose AUV-Fusion, a cross-modal adversarial attack framework that adopts high-order user preference modeling and cross-modal adversary generation. Specifically, we obtain robust user embeddings through multi-hop user-item interactions and transform them via an MLP into semantically aligned perturbations. These perturbations are injected onto the latent space of a pre-trained VAE within the diffusion model. By synergistically integrating genuine user interaction data with visually plausible perturbations, AUV-Fusion eliminates the need for injecting fake user profiles and effectively mitigates the challenge of insufficient user preference extraction inherent in traditional visual-only attacks. Comprehensive evaluations on diverse VARS architectures and real-world datasets demonstrate that AUV-Fusion significantly enhances the exposure of target (cold-start) items compared to conventional baseline methods. Moreover, AUV-Fusion maintains exceptional stealth under rigorous scrutiny.

http://arxiv.org/abs/2507.22428
Theoretical Analysis of Relative Errors in Gradient Computations for Adversarial Attacks with CE Loss. (92%)
Yunrui Yu; Hang Su; Cheng-zhong Xu; Zhizhong Su; Jun Zhu
Gradient-based adversarial attacks using the Cross-Entropy (CE) loss often suffer from overestimation due to relative errors in gradient computation induced by floating-point arithmetic. This paper provides a rigorous theoretical analysis of these errors, conducting the first comprehensive study of floating-point computation errors in gradient-based attacks across four distinct scenarios: (i) unsuccessful untargeted attacks, (ii) successful untargeted attacks, (iii) unsuccessful targeted attacks, and (iv) successful targeted attacks. We establish theoretical foundations characterizing the behavior of relative numerical errors under different attack conditions, revealing previously unknown patterns in gradient computation instability, and identify floating-point underflow and rounding as key contributors. Building on this insight, we propose the Theoretical MIFPE (T-MIFPE) loss function, which incorporates an optimal scaling factor $T = t^*$ to minimize the impact of floating-point errors, thereby enhancing the accuracy of gradient computation in adversarial attacks. Extensive experiments on the MNIST, CIFAR-10, and CIFAR-100 datasets demonstrate that T-MIFPE outperforms existing loss functions, including CE, C\&W, DLR, and MIFPE, in terms of attack potency and robustness evaluation accuracy.

http://arxiv.org/abs/2507.22446
RCR-AF: Enhancing Model Generalization via Rademacher Complexity Reduction Activation Function. (70%)
Yunrui Yu; Kafeng Wang; Hang Su; Jun Zhu
Despite their widespread success, deep neural networks remain critically vulnerable to adversarial attacks, posing significant risks in safety-sensitive applications. This paper investigates activation functions as a crucial yet underexplored component for enhancing model robustness. We propose a Rademacher Complexity Reduction Activation Function (RCR-AF), a novel activation function designed to improve both generalization and adversarial resilience. RCR-AF uniquely combines the advantages of GELU (including smoothness, gradient stability, and negative information retention) with ReLU's desirable monotonicity, while simultaneously controlling both model sparsity and capacity through built-in clipping mechanisms governed by two hyperparameters, $α$ and $γ$. Our theoretical analysis, grounded in Rademacher complexity, demonstrates that these parameters directly modulate the model's Rademacher complexity, offering a principled approach to enhance robustness. Comprehensive empirical evaluations show that RCR-AF consistently outperforms widely-used alternatives (ReLU, GELU, and Swish) in both clean accuracy under standard training and in adversarial robustness within adversarial training paradigms.

http://arxiv.org/abs/2508.05658
Universally Unfiltered and Unseen:Input-Agnostic Multimodal Jailbreaks against Text-to-Image Model Safeguards. (64%)
Song Yan; Hui Wei; Jinlong Fei; Guoliang Yang; Zhengyu Zhao; Zheng Wamg
Various (text) prompt filters and (image) safety checkers have been implemented to mitigate the misuse of Text-to-Image (T2I) models in creating Not-Safe-For-Work (NSFW) content.In order to expose potential security vulnerabilities of such safeguards, multimodal jailbreaks have been studied.However, existing jailbreaks are limited to prompt-specific and image-specific perturbations, which suffer from poor scalability and time-consuming optimization.To address these limitations, we propose Universally Unfiltered and Unseen (U3)-Attack, a multimodal jailbreak attack method against T2I safeguards.Specifically, U3-Attack optimizes an adversarial patch on the image background to universally bypass safety checkers and optimizes a safe paraphrase set from a sensitive word to universally bypass prompt filters while eliminating redundant computations.Extensive experimental results demonstrate the superiority of our U3-Attack on both open-source and commercial T2I models.For example, on the commercial Runway-inpainting model with both prompt filter and safety checker, our U3-Attack achieves $~4\times$ higher success rates than the state-of-the-art multimodal jailbreak attack, MMA-Diffusion.Content Warning: This paper includes examples of NSFW content.

http://arxiv.org/abs/2507.22813
DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion. (61%)
Hossein Mirzaei; Zeinab Taghavi; Sepehr Rezaee; Masoud Hadi; Moein Madadi; Mackenzie W. Mathis
Deep neural networks have demonstrated remarkable success across numerous tasks, yet they remain vulnerable to Trojan (backdoor) attacks, raising serious concerns about their safety in real-world mission-critical applications. A common countermeasure is trigger inversion -- reconstructing malicious "shortcut" patterns (triggers) inserted by an adversary during training. Current trigger-inversion methods typically search the full pixel space under specific assumptions but offer no assurances that the estimated trigger is more than an adversarial perturbation that flips the model output. Here, we propose a data-free, zero-shot trigger-inversion strategy that restricts the search space while avoiding strong assumptions on trigger appearance. Specifically, we incorporate a diffusion-based generator guided by the target classifier; through iterative generation, we produce candidate triggers that align with the internal representations the model relies on for malicious behavior. Empirical evaluations, both quantitative and qualitative, show that our approach reconstructs triggers that effectively distinguish clean versus Trojaned models. DISTIL surpasses alternative methods by high margins, achieving up to 7.1% higher accuracy on the BackdoorBench dataset and a 9.4% improvement on trojaned object detection model scanning, offering a promising new direction for reliable backdoor defense without reliance on extensive data or strong prior assumptions about triggers. The code is available at https://github.com/AdaptiveMotorControlLab/DISTIL.

http://arxiv.org/abs/2507.22564
Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs. (41%)
Xikang Yang; Biyu Zhou; Xuehai Tang; Jizhong Han; Songlin Hu
Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks that exploit cognitive biases -- systematic deviations from rational judgment. Unlike prior jailbreaking approaches focused on prompt engineering or algorithmic manipulation, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. We propose CognitiveAttack, a novel red-teaming framework that systematically leverages both individual and combined cognitive biases. By integrating supervised fine-tuning and reinforcement learning, CognitiveAttack generates prompts that embed optimized bias combinations, effectively bypassing safety protocols while maintaining high attack success rates. Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in open-source models. CognitiveAttack achieves a substantially higher attack success rate compared to the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defense mechanisms. These findings highlight multi-bias interactions as a powerful yet underexplored attack vector. This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems.

http://arxiv.org/abs/2403.16591
Bridging Privacy and Robustness for Trustworthy Machine Learning. (2%)
Xiaojin Zhang; Wei Chen
The widespread adoption of machine learning necessitates robust privacy protection alongside algorithmic resilience. While Local Differential Privacy (LDP) provides foundational guarantees, sophisticated adversaries with prior knowledge demand more nuanced Bayesian privacy notions, such as Maximum Bayesian Privacy (MBP) and Average Bayesian Privacy (ABP), first introduced by \cite{zhang2022no}. Concurrently, machine learning systems require inherent robustness against data perturbations and adversarial manipulations. This paper systematically investigates the intricate theoretical relationships among LDP, MBP, and ABP. Crucially, we bridge these privacy concepts with algorithmic robustness, particularly within the Probably Approximately Correct (PAC) learning framework. Our work demonstrates that privacy-preserving mechanisms inherently confer PAC robustness. We present key theoretical results, including the formalization of the established LDP-MBP relationship, novel bounds between MBP and ABP, and a proof demonstrating PAC robustness from MBP. Furthermore, we establish a novel theoretical relationship quantifying how privacy leakage directly influences an algorithm's input robustness. These results provide a unified theoretical framework for understanding and optimizing the privacy-robustness trade-off, paving the way for the development of more secure, trustworthy, and resilient machine learning systems.

http://arxiv.org/abs/2507.23128
Evaluating and Improving the Robustness of Speech Command Recognition Models to Noise and Distribution Shifts. (1%)
Anaïs Baranger; Lucas Maison
Although prior work in computer vision has shown strong correlations between in-distribution (ID) and out-of-distribution (OOD) accuracies, such relationships remain underexplored in audio-based models. In this study, we investigate how training conditions and input features affect the robustness and generalization abilities of spoken keyword classifiers under OOD conditions. We benchmark several neural architectures across a variety of evaluation sets. To quantify the impact of noise on generalization, we make use of two metrics: Fairness (F), which measures overall accuracy gains compared to a baseline model, and Robustness (R), which assesses the convergence between ID and OOD performance. Our results suggest that noise-aware training improves robustness in some configurations. These findings shed new light on the benefits and limitations of noise-based augmentation for generalization in speech models.

http://arxiv.org/abs/2508.00923
Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models. (1%)
Jiazhen Pan; Bailiang Jian; Paul Hager; Yundi Zhang; Che Liu; Friedrike Jungmann; Hongwei Bran Li; Chenyu You; Junde Wu; Jiayuan Zhu; Fenglin Liu; Yuyuan Liu; Niklas Bubeck; Christian Wachinger; Chen; Chen; Zhenyu Gong; Cheng Ouyang; Georgios Kaissis; Benedikt Wiestler; Daniel Rueckert
Ensuring the safety and reliability of large language models (LLMs) in clinical practice is critical to prevent patient harm and promote trustworthy healthcare applications of AI. However, LLMs are advancing so rapidly that static safety benchmarks often become obsolete upon publication, yielding only an incomplete and sometimes misleading picture of model trustworthiness. We demonstrate that a Dynamic, Automatic, and Systematic (DAS) red-teaming framework that continuously stress-tests LLMs can reveal significant weaknesses of current LLMs across four safety-critical domains: robustness, privacy, bias/fairness, and hallucination. A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses, uncovering vulnerabilities in real time without human intervention. Applying DAS to 15 proprietary and open-source LLMs revealed a stark contrast between static benchmark performance and vulnerability under adversarial pressure. Despite a median MedQA accuracy exceeding 80\%, 94\% of previously correct answers failed our dynamic robustness tests. We observed similarly high failure rates across other domains: privacy leaks were elicited in 86\% of scenarios, cognitive-bias priming altered clinical recommendations in 81\% of fairness tests, and we identified hallucination rates exceeding 66\% in widely used models. Such profound residual risks are incompatible with routine clinical practice. By converting red-teaming from a static checklist into a dynamic stress-test audit, DAS red-teaming offers the surveillance that hospitals/regulators/technology vendors require as LLMs become embedded in patient chatbots, decision-support dashboards, and broader healthcare workflows. Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.

http://arxiv.org/abs/2507.22576
COOkeD: Ensemble-based OOD detection in the era of zero-shot CLIP. (1%)
Galadrielle Humblot-Renaux; Gianni Franchi; Sergio Escalera; Thomas B. Moeslund
Out-of-distribution (OOD) detection is an important building block in trustworthy image recognition systems as unknown classes may arise at test-time. OOD detection methods typically revolve around a single classifier, leading to a split in the research field between the classical supervised setting (e.g. ResNet18 classifier trained on CIFAR100) vs. the zero-shot setting (class names fed as prompts to CLIP). In both cases, an overarching challenge is that the OOD detection performance is implicitly constrained by the classifier's capabilities on in-distribution (ID) data. In this work, we show that given a little open-mindedness from both ends, remarkable OOD detection can be achieved by instead creating a heterogeneous ensemble - COOkeD combines the predictions of a closed-world classifier trained end-to-end on a specific dataset, a zero-shot CLIP classifier, and a linear probe classifier trained on CLIP image features. While bulky at first sight, this approach is modular, post-hoc and leverages the availability of pre-trained VLMs, thus introduces little overhead compared to training a single standard classifier. We evaluate COOkeD on popular CIFAR100 and ImageNet benchmarks, but also consider more challenging, realistic settings ranging from training-time label noise, to test-time covariate shift, to zero-shot shift which has been previously overlooked. Despite its simplicity, COOkeD achieves state-of-the-art performance and greater robustness compared to both classical and CLIP-based OOD detection methods. Code is available at https://github.com/glhr/COOkeD

http://arxiv.org/abs/2507.22304
Invisible Injections: Exploiting Vision-Language Models Through Steganographic Prompt Embedding. (1%)
Chetan Pathade
Vision-language models (VLMs) have revolutionized multimodal AI applications but introduce novel security vulnerabilities that remain largely unexplored. We present the first comprehensive study of steganographic prompt injection attacks against VLMs, where malicious instructions are invisibly embedded within images using advanced steganographic techniques. Our approach demonstrates that current VLM architectures can inadvertently extract and execute hidden prompts during normal image processing, leading to covert behavioral manipulation. We develop a multi-domain embedding framework combining spatial, frequency, and neural steganographic methods, achieving an overall attack success rate of 24.3% (plus or minus 3.2%, 95% CI) across leading VLMs including GPT-4V, Claude, and LLaVA, with neural steganography methods reaching up to 31.8%, while maintaining reasonable visual imperceptibility (PSNR greater than 38 dB, SSIM greater than 0.94). Through systematic evaluation on 12 diverse datasets and 8 state-of-the-art models, we reveal moderate but meaningful vulnerabilities in current VLM architectures and propose effective countermeasures. Our findings have significant implications for VLM deployment in security-critical applications and highlight the need for proportionate multimodal AI security frameworks.

http://arxiv.org/abs/2507.21750
Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal. (99%)
Yang Wang; Chenghao Xiao; Yizhi Li; Stuart E. Middleton; Noura Al Moubayed; Chenghua Lin
Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defences or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimises the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before-attack accuracy to baselines, achieving a balanced trade-off between robustness and generalisation.

http://arxiv.org/abs/2507.21992
Teach Me to Trick: Exploring Adversarial Transferability via Knowledge Distillation. (99%)
Siddhartha Pradhan; Shikshya Shiwakoti; Neha Bathuri
We investigate whether knowledge distillation (KD) from multiple heterogeneous teacher models can enhance the generation of transferable adversarial examples. A lightweight student model is trained using two KD strategies: curriculum-based switching and joint optimization, with ResNet50 and DenseNet-161 as teachers. The trained student is then used to generate adversarial examples using FG, FGS, and PGD attacks, which are evaluated against a black-box target model (GoogLeNet). Our results show that student models distilled from multiple teachers achieve attack success rates comparable to ensemble-based baselines, while reducing adversarial example generation time by up to a factor of six. An ablation study further reveals that lower temperature settings and the inclusion of hard-label supervision significantly enhance transferability. These findings suggest that KD can serve not only as a model compression technique but also as a powerful tool for improving the efficiency and effectiveness of black-box adversarial attacks.

http://arxiv.org/abs/2507.21985
ZIUM: Zero-Shot Intent-Aware Adversarial Attack on Unlearned Models. (92%)
Hyun Jun Yook; Ga San Jhun; Jae Hyun Cho; Min Jeon; Donghyun Kim; Tae Hyung Kim; Youn Kyu Lee
Machine unlearning (MU) removes specific data points or concepts from deep learning models to enhance privacy and prevent sensitive content generation. Adversarial prompts can exploit unlearned models to generate content containing removed concepts, posing a significant security risk. However, existing adversarial attack methods still face challenges in generating content that aligns with an attacker's intent while incurring high computational costs to identify successful prompts. To address these challenges, we propose ZIUM, a Zero-shot Intent-aware adversarial attack on Unlearned Models, which enables the flexible customization of target attack images to reflect an attacker's intent. Additionally, ZIUM supports zero-shot adversarial attacks without requiring further optimization for previously attacked unlearned concepts. The evaluation across various MU scenarios demonstrated ZIUM's effectiveness in successfully customizing content based on user-intent prompts while achieving a superior attack success rate compared to existing methods. Moreover, its zero-shot adversarial attack significantly reduces the attack time for previously attacked unlearned concepts.

http://arxiv.org/abs/2507.21820
Anyone Can Jailbreak: Prompt-Based Attacks on LLMs and T2Is. (86%)
Ahmed B Mustafa; Zihan Ye; Yang Lu; Michael P Pound; Shreyank N Gowda
Despite significant advancements in alignment and content moderation, large language models (LLMs) and text-to-image (T2I) systems remain vulnerable to prompt-based attacks known as jailbreaks. Unlike traditional adversarial examples requiring expert knowledge, many of today's jailbreaks are low-effort, high-impact crafted by everyday users with nothing more than cleverly worded prompts. This paper presents a systems-style investigation into how non-experts reliably circumvent safety mechanisms through techniques such as multi-turn narrative escalation, lexical camouflage, implication chaining, fictional impersonation, and subtle semantic edits. We propose a unified taxonomy of prompt-level jailbreak strategies spanning both text-output and T2I models, grounded in empirical case studies across popular APIs. Our analysis reveals that every stage of the moderation pipeline, from input filtering to output validation, can be bypassed with accessible strategies. We conclude by highlighting the urgent need for context-aware defenses that reflect the ease with which these jailbreaks can be reproduced in real-world settings.

http://arxiv.org/abs/2507.21540
PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking. (64%)
Quanchen Zou; Zonghao Ying; Moyang Chen; Wenzhuo Xu; Yisong Xiao; Yakai Li; Deyue Zhang; Dongdong Yang; Zhao Liu; Xiangzheng Zhang
The increasing sophistication of large vision-language models (LVLMs) has been accompanied by advances in safety alignment mechanisms designed to prevent harmful content generation. However, these defenses remain vulnerable to sophisticated adversarial attacks. Existing jailbreak methods typically rely on direct and semantically explicit prompts, overlooking subtle vulnerabilities in how LVLMs compose information over multiple reasoning steps. In this paper, we propose a novel and effective jailbreak framework inspired by Return-Oriented Programming (ROP) techniques from software security. Our approach decomposes a harmful instruction into a sequence of individually benign visual gadgets. A carefully engineered textual prompt directs the sequence of inputs, prompting the model to integrate the benign visual gadgets through its reasoning process to produce a coherent and harmful output. This makes the malicious intent emergent and difficult to detect from any single component. We validate our method through extensive experiments on established benchmarks including SafeBench and MM-SafetyBench, targeting popular LVLMs. Results show that our approach consistently and substantially outperforms existing baselines on state-of-the-art models, achieving near-perfect attack success rates (over 0.90 on SafeBench) and improving ASR by up to 0.39. Our findings reveal a critical and underexplored vulnerability that exploits the compositional reasoning abilities of LVLMs, highlighting the urgent need for defenses that secure the entire reasoning process.

http://arxiv.org/abs/2407.15549
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs. (56%)
Abhay Sheshadri; Aidan Ewart; Phillip Guo; Aengus Lynch; Cindy Wu; Vivek Hebbar; Henry Sleight; Asa Cooper Stickland; Ethan Perez; Dylan Hadfield-Menell; Stephen Casper
Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a wide variety of 'jailbreaking' techniques to elicit harmful text from models that were fine-tuned to be harmless. Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from how (adversarial) fine-tuning largely serves to suppress rather than remove undesirable capabilities from LLMs. Prior work has introduced latent adversarial training (LAT) as a way to improve robustness to broad classes of failures. These prior works have considered untargeted latent space attacks where the adversary perturbs latent activations to maximize loss on examples of desirable behavior. Untargeted LAT can provide a generic type of robustness but does not leverage information about specific failure modes. Here, we experiment with targeted LAT where the adversary seeks to minimize loss on a specific competing task. We find that it can augment a wide variety of state-of-the-art methods. First, we use targeted LAT to improve robustness to jailbreaks, outperforming a strong R2D2 baseline with orders of magnitude less compute. Second, we use it to more effectively remove backdoors with no knowledge of the trigger. Finally, we use it to more effectively unlearn knowledge for specific undesirable tasks in a way that is also more robust to re-learning. Overall, our results suggest that targeted LAT can be an effective tool for defending against harmful behaviors from LLMs.

http://arxiv.org/abs/2507.22160
Strategic Deflection: Defending LLMs from Logit Manipulation. (26%)
Yassine Rachidy; Jihad Rbaiti; Youssef Hmamouche; Faissal Sehbaoui; Amal El Fallah Seghrouchni
With the growing adoption of Large Language Models (LLMs) in critical areas, ensuring their security against jailbreaking attacks is paramount. While traditional defenses primarily rely on refusing malicious prompts, recent logit-level attacks have demonstrated the ability to bypass these safeguards by directly manipulating the token-selection process during generation. We introduce Strategic Deflection (SDeflection), a defense that redefines the LLM's response to such advanced attacks. Instead of outright refusal, the model produces an answer that is semantically adjacent to the user's request yet strips away the harmful intent, thereby neutralizing the attacker's harmful intent. Our experiments demonstrate that SDeflection significantly lowers Attack Success Rate (ASR) while maintaining model performance on benign queries. This work presents a critical shift in defensive strategies, moving from simple refusal to strategic content redirection to neutralize advanced threats.

http://arxiv.org/abs/2507.22020
XAI for Point Cloud Data using Perturbations based on Meaningful Segmentation. (8%)
Raju Ningappa Mulawade; Christoph Garth; Alexander Wiebel
We propose a novel segmentation-based explainable artificial intelligence (XAI) method for neural networks working on point cloud classification. As one building block of this method, we propose a novel point-shifting mechanism to introduce perturbations in point cloud data. Recently, AI has seen an exponential growth. Hence, it is important to understand the decision-making process of AI algorithms when they are applied in critical areas. Our work focuses on explaining AI algorithms that classify point cloud data. An important aspect of the methods used for explaining AI algorithms is their ability to produce explanations that are easy for humans to understand. This allows them to analyze the AI algorithms better and make appropriate decisions based on that analysis. Therefore, in this work, we intend to generate meaningful explanations that can be easily interpreted by humans. The point cloud data we consider represents 3D objects such as cars, guitars, and laptops. We make use of point cloud segmentation models to generate explanations for the working of classification models. The segments are used to introduce perturbations into the input point cloud data and generate saliency maps. The perturbations are introduced using the novel point-shifting mechanism proposed in this work which ensures that the shifted points no longer influence the output of the classification algorithm. In contrast to previous methods, the segments used by our method are meaningful, i.e. humans can easily interpret the meaning of the segments. Thus, the benefit of our method over other methods is its ability to produce more meaningful saliency maps. We compare our method with the use of classical clustering algorithms to generate explanations. We also analyze the saliency maps generated for example inputs using our method to demonstrate the usefulness of the method in generating meaningful explanations.

http://arxiv.org/abs/2507.22037
Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security. (3%)
Muzhi Dai; Shixuan Liu; Zhiyuan Zhao; Junyu Gao; Hao Sun; Xuelong Li
The rapid advancement of multimodal large language models (MLLMs) has led to breakthroughs in various applications, yet their security remains a critical challenge. One pressing issue involves unsafe image-query pairs--jailbreak inputs specifically designed to bypass security constraints and elicit unintended responses from MLLMs. Compared to general multimodal data, such unsafe inputs are relatively sparse, which limits the diversity and richness of training samples available for developing robust defense models. Meanwhile, existing guardrail-type methods rely on external modules to enforce security constraints but fail to address intrinsic vulnerabilities within MLLMs. Traditional supervised fine-tuning (SFT), on the other hand, often over-refuses harmless inputs, compromising general performance. Given these challenges, we propose Secure Tug-of-War (SecTOW), an innovative iterative defense-attack training method to enhance the security of MLLMs. SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO). During the iterative process, the attacker identifies security vulnerabilities in the defense model and expands jailbreak data. The expanded data are then used to train the defender, enabling it to address identified security vulnerabilities. We also design reward mechanisms used for GRPO to simplify the use of response labels, reducing dependence on complex generative labels and enabling the efficient use of synthetic data. Additionally, a quality monitoring mechanism is used to mitigate the defender's over-refusal of harmless inputs and ensure the diversity of the jailbreak data generated by the attacker. Experimental results on safety-specific and general benchmarks demonstrate that SecTOW significantly improves security while preserving general performance.

http://arxiv.org/abs/2507.21387
Radio Adversarial Attacks on EMG-based Gesture Recognition Networks. (99%)
Hongyi Xie
Surface electromyography (EMG) enables non-invasive human-computer interaction in rehabilitation, prosthetics, and virtual reality. While deep learning models achieve over 97% classification accuracy, their vulnerability to adversarial attacks remains largely unexplored in the physical domain. We present ERa Attack, the first radio frequency (RF) adversarial method targeting EMG devices through intentional electromagnetic interference (IEMI). Using low-power software-defined radio transmitters, attackers inject optimized RF perturbations to mislead downstream models. Our approach bridges digital and physical domains: we generate adversarial perturbations using Projected Gradient Descent, extract 50-150 Hz components via inverse STFT, and employ synchronization-free strategies (constant spectrum noise or narrowband modulation). Perturbations, constrained to 1-10% of signal amplitude, are amplitude-modulated onto 433 MHz carriers. Experiments on the Myo Dataset (7 gestures, 350 samples) demonstrate significant impact: at 1 meter and 0 dBm transmission power, classification accuracy drops from 97.8% to 58.3%, with 41.7% misclassification rate and 25.6% targeted attack success rate. Attack effectiveness decreases exponentially with distance, recovering to 85% accuracy at 3 meters. Increasing power to 10 dBm reduces accuracy by an additional 15% at 1 meter. This work pioneers RF-based adversarial attacks on EMG recognition systems, revealing critical vulnerabilities in safety-critical applications. We quantify attack effectiveness across different perturbation modes and distances, and propose defenses including hardware shielding, spectrum monitoring, and adversarial training. Our findings inform the design of robust EMG systems against electromagnetic threats.

http://arxiv.org/abs/2507.20996
Improving Adversarial Robustness Through Adaptive Learning-Driven Multi-Teacher Knowledge Distillation. (99%)
Hayat Ullah; Syed Muhammad Talha Zaidi; Arslan Munir
Convolutional neural networks (CNNs) excel in computer vision but are susceptible to adversarial attacks, crafted perturbations designed to mislead predictions. Despite advances in adversarial training, a gap persists between model accuracy and robustness. To mitigate this issue, in this paper, we present a multi-teacher adversarial robustness distillation using an adaptive learning strategy. Specifically, our proposed method first trained multiple clones of a baseline CNN model using an adversarial training strategy on a pool of perturbed data acquired through different adversarial attacks. Once trained, these adversarially trained models are used as teacher models to supervise the learning of a student model on clean data using multi-teacher knowledge distillation. To ensure an effective robustness distillation, we design an adaptive learning strategy that controls the knowledge contribution of each model by assigning weights as per their prediction precision. Distilling knowledge from adversarially pre-trained teacher models not only enhances the learning capabilities of the student model but also empowers it with the capacity to withstand different adversarial attacks, despite having no exposure to adversarial data. To verify our claims, we extensively evaluated our proposed method on MNIST-Digits and Fashion-MNIST datasets across diverse experimental settings. The obtained results exhibit the efficacy of our multi-teacher adversarial distillation and adaptive learning strategy, enhancing CNNs' adversarial robustness against various adversarial attacks.

http://arxiv.org/abs/2507.20526
Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition. (83%)
Andy Zou; Maxwell Lin; Eliot Jones; Micha Nowak; Mateusz Dziemian; Nick Winter; Alexander Grattan; Valent Nathanael; Ayla Croft; Xander Davies; Jai Patel; Robert Kirk; Nate Burnikell; Yarin Gal; Dan Hendrycks; J. Zico Kolter; Matt Fredrikson
Recent advances have enabled LLM-powered AI agents to autonomously execute complex tasks by combining language model reasoning with tools, memory, and web access. But can these systems be trusted to follow deployment policies in realistic environments, especially under attack? To investigate, we ran the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios. Participants submitted 1.8 million prompt-injection attacks, with over 60,000 successfully eliciting policy violations such as unauthorized data access, illicit financial actions, and regulatory noncompliance. We use these results to build the Agent Red Teaming (ART) benchmark - a curated set of high-impact attacks - and evaluate it across 19 state-of-the-art models. Nearly all agents exhibit policy violations for most behaviors within 10-100 queries, with high attack transferability across models and tasks. Importantly, we find limited correlation between agent robustness and model size, capability, or inference-time compute, suggesting that additional defenses are needed against adversarial misuse. Our findings highlight critical and persistent vulnerabilities in today's AI agents. By releasing the ART benchmark and accompanying evaluation framework, we aim to support more rigorous security assessment and drive progress toward safer agent deployment.

http://arxiv.org/abs/2409.13864
Persistent Backdoor Attacks in Continual Learning. (73%)
Zhen Guo; Abhinav Kumar; Reza Tourani
Backdoor attacks pose a significant threat to neural networks, enabling adversaries to manipulate model outputs on specific inputs, often with devastating consequences, especially in critical applications. While backdoor attacks have been studied in various contexts, little attention has been given to their practicality and persistence in continual learning, particularly in understanding how the continual updates to model parameters, as new data distributions are learned and integrated, impact the effectiveness of these attacks over time. To address this gap, we introduce two persistent backdoor attacks-Blind Task Backdoor and Latent Task Backdoor-each leveraging minimal adversarial influence. Our blind task backdoor subtly alters the loss computation without direct control over the training process, while the latent task backdoor influences only a single task's training, with all other tasks trained benignly. We evaluate these attacks under various configurations, demonstrating their efficacy with static, dynamic, physical, and semantic triggers. Our results show that both attacks consistently achieve high success rates across different continual learning algorithms, while effectively evading state-of-the-art defenses, such as SentiNet and I-BAU.

http://arxiv.org/abs/2505.16789
Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards. (69%)
Punya Syon Pandey; Samuel Simko; Kellin Pelrine; Zhijing Jin
As large language models (LLMs) gain popularity, their vulnerability to adversarial attacks emerges as a primary concern. While fine-tuning models on domain-specific datasets is often employed to improve model performance, it can inadvertently introduce vulnerabilities within the underlying model. In this work, we investigate Accidental Vulnerability, unexpected vulnerabilities arising from characteristics of fine-tuning data. We begin by identifying potential correlation factors such as linguistic features, semantic similarity, and toxicity across multiple experimental datasets. We then evaluate the adversarial robustness of these fine-tuned models, analyzing persona shifts and interpretability traits to understand how dataset factors contribute to attack success rates. Lastly, we explore causal relationships that offer new insights into adversarial defense strategies, highlighting the crucial role of dataset design in preserving model alignment. Our code is available at https://github.com/psyonp/accidental_vulnerability.

http://arxiv.org/abs/2402.08290
The Effect of Data Poisoning on Counterfactual Explanations. (22%)
André Artelt; Shubham Sharma; Freddy Lecué; Barbara Hammer
Counterfactual explanations are a widely used approach for examining the predictions of black-box systems. They can offer the opportunity for computational recourse by suggesting actionable changes on how to alter the input to obtain a different (i.e., more favorable) system output. However, recent studies have pointed out their susceptibility to various forms of manipulation.
  This work studies the vulnerability of counterfactual explanations to data poisoning. We formally introduce and investigate data poisoning in the context of counterfactual explanations for increasing the cost of recourse on three different levels: locally for a single instance, a sub-group of instances, or globally for all instances. In this context, we formally introduce and characterize data poisonings, from which we derive and investigate a general data poisoning mechanism. We demonstrate the impact of such data poisoning in the critical real-world application of explaining event detections in water distribution networks. Additionally, we conduct an extensive empirical evaluation, demonstrating that state-of-the-art counterfactual generation methods and toolboxes are vulnerable to such data poisoning. Furthermore, we find that existing defense methods fail to detect those poisonous samples.

http://arxiv.org/abs/2507.20434
Is Crunching Public Data the Right Approach to Detect BGP Hijacks? (15%)
Alessandro Giaconia; Muoi Tran; Laurent Vanbever; Stefano Vissicchio
The Border Gateway Protocol (BGP) remains a fragile pillar of Internet routing. BGP hijacks still occurr daily. While full deployment of Route Origin Validation (ROV) is ongoing, attackers have already adapted, launching post-ROV attacks such as forged-origin hijacks. To detect these, recent approaches like DFOH [Holterbach et al., USENIX NSDI '24] and BEAM [Chen et al., USENIX Security '24] apply machine learning (ML) to analyze data from globally distributed BGP monitors, assuming anomalies will stand out against historical patterns. However, this assumption overlooks a key threat: BGP monitors themselves can be misled by adversaries injecting bogus routes. This paper shows that state-of-the-art hijack detection systems like DFOH and BEAM are vulnerable to data poisoning. Using large-scale BGP simulations, we show that attackers can evade detection with just a handful of crafted announcements beyond the actual hijack. These announcements are indeed sufficient to corrupt the knowledge base used by ML-based defenses and distort the metrics they rely on. Our results highlight a worrying weakness of relying solely on public BGP data.

http://arxiv.org/abs/2408.07703
Knowledge Distillation with Refined Logits. (1%)
Wujie Sun; Defang Chen; Siwei Lyu; Genlang Chen; Chun Chen; Can Wang
Recent research on knowledge distillation has increasingly focused on logit distillation because of its simplicity, effectiveness, and versatility in model compression. In this paper, we introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods. Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions, creating an exacerbated divergence between the standard distillation loss and the cross-entropy loss, which can undermine the consistency of the student model's learning objectives. Previous attempts to use labels to empirically correct teacher predictions may undermine the class correlations. In contrast, our RLD employs labeling information to dynamically refine teacher logits. In this way, our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations, thus enhancing the value and efficiency of distilled knowledge. Experimental results on CIFAR-100 and ImageNet demonstrate its superiority over existing methods. Our code is available at https://github.com/zju-SWJ/RLD.

http://arxiv.org/abs/2507.21177
FedBAP: Backdoor Defense via Benign Adversarial Perturbation in Federated Learning. (92%)
Xinhai Yan; Libing Wu; Zhuangzhuang Zhang; Bingyi Liu; Lijuan Huo; Jing Wang
Federated Learning (FL) enables collaborative model training while preserving data privacy, but it is highly vulnerable to backdoor attacks. Most existing defense methods in FL have limited effectiveness due to their neglect of the model's over-reliance on backdoor triggers, particularly as the proportion of malicious clients increases. In this paper, we propose FedBAP, a novel defense framework for mitigating backdoor attacks in FL by reducing the model's reliance on backdoor triggers. Specifically, first, we propose a perturbed trigger generation mechanism that creates perturbation triggers precisely matching backdoor triggers in location and size, ensuring strong influence on model outputs. Second, we utilize these perturbation triggers to generate benign adversarial perturbations that disrupt the model's dependence on backdoor triggers while forcing it to learn more robust decision boundaries. Finally, we design an adaptive scaling mechanism to dynamically adjust perturbation intensity, effectively balancing defense strength and model performance. The experimental results demonstrate that FedBAP reduces the attack success rates by 0.22%-5.34%, 0.48%-6.34%, and 97.22%-97.6% under three types of backdoor attacks, respectively. In particular, FedBAP demonstrates outstanding performance against novel backdoor attacks.

http://arxiv.org/abs/2507.19739
Enhancing IoT Intrusion Detection Systems through Adversarial Training. (84%)
Karma Gurung; Ashutosh Ghimire; Fathi Amsaad
The augmentation of Internet of Things (IoT) devices transformed both automation and connectivity but revealed major security vulnerabilities in networks. We address these challenges by designing a robust intrusion detection system (IDS) to detect complex attacks by learning patterns from the NF-ToN-IoT v2 dataset. Intrusion detection has a realistic testbed through the dataset's rich and high-dimensional features. We combine distributed preprocessing to manage the dataset size with Fast Gradient Sign Method (FGSM) adversarial attacks to mimic actual attack scenarios and XGBoost model adversarial training for improved system robustness. Our system achieves 95.3% accuracy on clean data and 94.5% accuracy on adversarial data to show its effectiveness against complex threats. Adversarial training demonstrates its potential to strengthen IDS against evolving cyber threats and sets the foundation for future studies. Real-time IoT environments represent a future deployment opportunity for these systems, while extensions to detect emerging threats and zero-day vulnerabilities would enhance their utility.

http://arxiv.org/abs/2507.19905
ConSeg: Contextual Backdoor Attack Against Semantic Segmentation. (50%)
Bilal Hussain Abbasi; Zirui Gong; Yanjun Zhang; Shang Gao; Antonio Robles-Kelly; Leo Zhang
Despite significant advancements in computer vision, semantic segmentation models may be susceptible to backdoor attacks. These attacks, involving hidden triggers, aim to cause the models to misclassify instances of the victim class as the target class when triggers are present, posing serious threats to the reliability of these models. To further explore the field of backdoor attacks against semantic segmentation, in this paper, we propose a simple yet effective backdoor attack called Contextual Segmentation Backdoor Attack (ConSeg). ConSeg leverages the contextual information inherent in semantic segmentation models to enhance backdoor performance. Our method is motivated by an intriguing observation, i.e., when the target class is set as the `co-occurring' class of the victim class, the victim class can be more easily `mis-segmented'. Building upon this insight, ConSeg mimics the contextual information of the target class and rebuilds it in the victim region to establish the contextual relationship between the target class and the victim class, making the attack easier. Our experiments reveal that ConSeg achieves improvements in Attack Success Rate (ASR) with increases of 15.55\%, compared to existing methods, while exhibiting resilience against state-of-the-art backdoor defenses.

http://arxiv.org/abs/2507.19880
Trivial Trojans: How Minimal MCP Servers Enable Cross-Tool Exfiltration of Sensitive Data. (2%)
Nicola Croce; Tobin South
The Model Context Protocol (MCP) represents a significant advancement in AI-tool integration, enabling seamless communication between AI agents and external services. However, this connectivity introduces novel attack vectors that remain largely unexplored. This paper demonstrates how unsophisticated threat actors, requiring only basic programming skills and free web tools, can exploit MCP's trust model to exfiltrate sensitive financial data. We present a proof-of-concept attack where a malicious weather MCP server, disguised as benign functionality, discovers and exploits legitimate banking tools to steal user account balances. The attack chain requires no advanced technical knowledge, server infrastructure, or monetary investment. The findings reveal a critical security gap in the emerging MCP ecosystem: while individual servers may appear trustworthy, their combination creates unexpected cross-server attack surfaces. Unlike traditional cybersecurity threats that assume sophisticated adversaries, our research shows that the barrier to entry for MCP-based attacks is alarmingly low. A threat actor with undergraduate-level Python knowledge can craft convincing social engineering attacks that exploit the implicit trust relationships MCP establishes between AI agents and tool providers. This work contributes to the nascent field of MCP security by demonstrating that current MCP implementations allow trivial cross-server attacks and proposing both immediate mitigations and protocol improvements to secure this emerging ecosystem.

http://arxiv.org/abs/2507.21163
Generating Adversarial Point Clouds Using Diffusion Model. (99%)
Ruiyang Zhao; Bingbing Zhu; Chuxuan Tong; Xiaoyi Zhou; Xi Zheng
Adversarial attack methods for 3D point cloud classification reveal the vulnerabilities of point cloud recognition models. This vulnerability could lead to safety risks in critical applications that use deep learning models, such as autonomous vehicles. To uncover the deficiencies of these models, researchers can evaluate their security through adversarial attacks. However, most existing adversarial attack methods are based on white-box attacks. While these methods achieve high attack success rates and imperceptibility, their applicability in real-world scenarios is limited. Black-box attacks, which are more meaningful in real-world scenarios, often yield poor results. This paper proposes a novel black-box adversarial example generation method that utilizes a diffusion model to improve the attack success rate and imperceptibility in the black-box setting, without relying on the internal information of the point cloud classification model to generate adversarial samples. We use a 3D diffusion model to use the compressed features of the point cloud as prior knowledge to guide the reverse diffusion process to add adversarial points to clean examples. Subsequently, its reverse process is employed to transform the distribution of other categories into adversarial points, which are then added to the point cloud.

http://arxiv.org/abs/2507.18870
Transferable and Undefendable Point Cloud Attacks via Medial Axis Transform. (99%)
Keke Tang; Yuze Gao; Weilong Peng; Xiaofei Wang; Meie Fang; Peican Zhu
Studying adversarial attacks on point clouds is essential for evaluating and improving the robustness of 3D deep learning models. However, most existing attack methods are developed under ideal white-box settings and often suffer from limited transferability to unseen models and insufficient robustness against common defense mechanisms. In this paper, we propose MAT-Adv, a novel adversarial attack framework that enhances both transferability and undefendability by explicitly perturbing the medial axis transform (MAT) representations, in order to induce inherent adversarialness in the resulting point clouds. Specifically, we employ an autoencoder to project input point clouds into compact MAT representations that capture the intrinsic geometric structure of point clouds. By perturbing these intrinsic representations, MAT-Adv introduces structural-level adversarial characteristics that remain effective across diverse models and defense strategies. To mitigate overfitting and prevent perturbation collapse, we incorporate a dropout strategy into the optimization of MAT perturbations, further improving transferability and undefendability. Extensive experiments demonstrate that MAT-Adv significantly outperforms existing state-of-the-art methods in both transferability and undefendability. Codes will be made public upon paper acceptance.

http://arxiv.org/abs/2507.05113
CLIP-Guided Backdoor Defense through Entropy-Based Poisoned Dataset Separation. (76%)
Binyan Xu; Fan Yang; Xilin Dai; Di Tang; Kehuan Zhang
Deep Neural Networks (DNNs) are susceptible to backdoor attacks, where adversaries poison training data to implant backdoor into the victim model. Current backdoor defenses on poisoned data often suffer from high computational costs or low effectiveness against advanced attacks like clean-label and clean-image backdoors. To address them, we introduce CLIP-Guided backdoor Defense (CGD), an efficient and effective method that mitigates various backdoor attacks. CGD utilizes a publicly accessible CLIP model to identify inputs that are likely to be clean or poisoned. It then retrains the model with these inputs, using CLIP's logits as a guidance to effectively neutralize the backdoor. Experiments on 4 datasets and 11 attack types demonstrate that CGD reduces attack success rates (ASRs) to below 1% while maintaining clean accuracy (CA) with a maximum drop of only 0.3%, outperforming existing defenses. Additionally, we show that clean-data-based defenses can be adapted to poisoned data using CGD. Also, CGD exhibits strong robustness, maintaining low ASRs even when employing a weaker CLIP model or when CLIP itself is compromised by a backdoor. These findings underscore CGD's exceptional efficiency, effectiveness, and applicability for real-world backdoor defense scenarios. Code: https://github.com/binyxu/CGD.

http://arxiv.org/abs/2505.15265
Blind Spot Navigation: Evolutionary Discovery of Sensitive Semantic Concepts for LVLMs. (38%)
Zihao Pan; Yu Tong; Weibin Wu; Jingyi Wang; Lifeng Chen; Zhe Zhao; Jiajia Wei; Yitong Qiao; Zibin Zheng
Adversarial attacks aim to generate malicious inputs that mislead deep models, but beyond causing model failure, they cannot provide certain interpretable information such as ``\textit{What content in inputs make models more likely to fail?}'' However, this information is crucial for researchers to specifically improve model robustness. Recent research suggests that models may be particularly sensitive to certain semantics in visual inputs (such as ``wet,'' ``foggy''), making them prone to errors. Inspired by this, in this paper we conducted the first exploration on large vision-language models (LVLMs) and found that LVLMs indeed are susceptible to hallucinations and various errors when facing specific semantic concepts in images. To efficiently search for these sensitive concepts, we integrated large language models (LLMs) and text-to-image (T2I) models to propose a novel semantic evolution framework. Randomly initialized semantic concepts undergo LLM-based crossover and mutation operations to form image descriptions, which are then converted by T2I models into visual inputs for LVLMs. The task-specific performance of LVLMs on each input is quantified as fitness scores for the involved semantics and serves as reward signals to further guide LLMs in exploring concepts that induce LVLMs. Extensive experiments on seven mainstream LVLMs and two multimodal tasks demonstrate the effectiveness of our method. Additionally, we provide interesting findings about the sensitive semantics of LVLMs, aiming to inspire further in-depth research.

http://arxiv.org/abs/2506.15606
LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning. (22%)
Gabriel J. Perin; Runjin Chen; Xuxi Chen; Nina S. T. Hirata; Zhangyang Wang; Junyuan Hong
Large Language Models (LLMs) have become indispensable in real-world applications. However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions. Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning - even when the additional training data appears benign. In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM. Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness against both benign and malicious fine-tuning attacks while preserving the model's adaptability to new tasks. For instance, LoX leads to 11% to 54% absolute reductions in attack success rates (ASR) facing benign or malicious fine-tuning attacks. By investigating the ASR landscape of parameters, we attribute the success of LoX to that the extrapolation moves LLM parameters to a flatter zone, thereby less sensitive to perturbations. The code is available at github.com/VITA-Group/LoX.

http://arxiv.org/abs/2504.21019
Kill two birds with one stone: generalized and robust AI-generated text detection via dynamic perturbations. (4%)
Yinghan Zhou; Juan Wen; Wanli Peng; Yiming Xue; Ziwei Zhang; Zhengxian Wu
The growing popularity of large language models has raised concerns regarding the potential to misuse AI-generated text (AIGT). It becomes increasingly critical to establish an excellent AIGT detection method with high generalization and robustness. However, existing methods either focus on model generalization or concentrate on robustness. The unified mechanism, to simultaneously address the challenges of generalization and robustness, is less explored. In this paper, we argue that robustness can be view as a specific form of domain shift, and empirically reveal an intrinsic mechanism for model generalization of AIGT detection task. Then, we proposed a novel AIGT detection method (DP-Net) via dynamic perturbations introduced by a reinforcement learning with elaborated reward and action. Experimentally, extensive results show that the proposed DP-Net significantly outperforms some state-of-the-art AIGT detection methods for generalization capacity in three cross-domain scenarios. Meanwhile, the DP-Net achieves best robustness under two text adversarial attacks. The code is publicly available at https://github.com/CAU-ISS-Lab/AIGT-Detection-Evade-Detection/tree/main/DP-Net.

http://arxiv.org/abs/2507.19609
Securing the Internet of Medical Things (IoMT): Real-World Attack Taxonomy and Practical Security Measures. (1%)
Suman Deb; Emil Lupu; Emm Mic Drakakis; Anil Anthony Bharath; Zhen Kit Leung; Guang Rui Ma; Anupam Chattopadhyay
The Internet of Medical Things (IoMT) has the potential to radically improve healthcare by enabling real-time monitoring, remote diagnostics, and AI-driven decision making. However, the connectivity, embedded intelligence, and inclusion of a wide variety of novel sensors expose medical devices to severe cybersecurity threats, compromising patient safety and data privacy. In addition, many devices also have direct capacity - individually or in conjunction with other IoMT devices - to perform actions on the patient, such as delivering an electrical stimulus, administering a drug, or activating a motor, which can potentially be life-threatening. We provide a taxonomy of potential attacks targeting IoMT, presenting attack surfaces, vulnerabilities, and mitigation strategies across all layers of the IoMT architecture. It answers key questions such as: What makes IoMT security different from traditional IT security? What are the cybersecurity threats to medical devices? How can engineers design secure IoMT systems and protect hospital networks from cyberattacks? By analyzing historical cyber incidents, we highlight critical security gaps and propose practical security guidelines for medical device engineers and security professionals. This work bridges the gap between research and implementation, equipping healthcare stakeholders with actionable insights to build resilient and privacy-preserving IoMT ecosystems. Finally, we present the latest standardization and compliance frameworks, that IoMT security designers should be aware of.

http://arxiv.org/abs/2507.19368
Counterfactual Explanations in Medical Imaging: Exploring SPN-Guided Latent Space Manipulation. (1%)
Julia Siekiera; Stefan Kramer
Artificial intelligence is increasingly leveraged across various domains to automate decision-making processes that significantly impact human lives. In medical image analysis, deep learning models have demonstrated remarkable performance. However, their inherent complexity makes them black box systems, raising concerns about reliability and interpretability. Counterfactual explanations provide comprehensible insights into decision processes by presenting hypothetical "what-if" scenarios that alter model classifications. By examining input alterations, counterfactual explanations provide patterns that influence the decision-making process. Despite their potential, generating plausible counterfactuals that adhere to similarity constraints providing human-interpretable explanations remains a challenge. In this paper, we investigate this challenge by a model-specific optimization approach. While deep generative models such as variational autoencoders (VAEs) exhibit significant generative power, probabilistic models like sum-product networks (SPNs) efficiently represent complex joint probability distributions. By modeling the likelihood of a semi-supervised VAE's latent space with an SPN, we leverage its dual role as both a latent space descriptor and a classifier for a given discrimination task. This formulation enables the optimization of latent space counterfactuals that are both close to the original data distribution and aligned with the target class distribution. We conduct experimental evaluation on the cheXpert dataset. To evaluate the effectiveness of the integration of SPNs, our SPN-guided latent space manipulation is compared against a neural network baseline. Additionally, the trade-off between latent variable regularization and counterfactual quality is analyzed.

http://arxiv.org/abs/2502.20268
Large Language Models as Attribution Regularizers for Efficient Model Training. (1%)
Davor Vukadin; Marin Šilić; Goran Delač
Large Language Models (LLMs) have demonstrated remarkable performance across diverse domains. However, effectively leveraging their vast knowledge for training smaller downstream models remains an open challenge, especially in domains like tabular data learning, where simpler models are often preferred due to interpretability and efficiency.
  In this paper, we introduce a novel yet straightforward method for incorporating LLM-generated global task feature attributions into the training process of smaller networks. Specifically, we propose an attribution-matching regularization term that aligns the training dynamics of the smaller model with the insights provided by the LLM. By doing so, our approach yields superior performance in few-shot learning scenarios. Notably, our method requires only black-box API access to the LLM, making it easy to integrate into existing training pipelines with minimal computational overhead.
  Furthermore, we demonstrate how this method can be used to address common issues in real-world datasets, such as skewness and bias. By integrating high-level knowledge from LLMs, our approach improves generalization, even when training data is limited or imbalanced. We validate its effectiveness through extensive experiments across multiple tasks, demonstrating improved learning efficiency and model robustness.

http://arxiv.org/abs/2507.18484
Reinforced Embodied Active Defense: Exploiting Adaptive Interaction for Robust Visual Perception in Adversarial 3D Environments. (99%)
Xiao Yang; Lingxuan Wu; Lizhong Wang; Chengyang Ying; Hang Su; Jun Zhu
Adversarial attacks in 3D environments have emerged as a critical threat to the reliability of visual perception systems, particularly in safety-sensitive applications such as identity verification and autonomous driving. These attacks employ adversarial patches and 3D objects to manipulate deep neural network (DNN) predictions by exploiting vulnerabilities within complex scenes. Existing defense mechanisms, such as adversarial training and purification, primarily employ passive strategies to enhance robustness. However, these approaches often rely on pre-defined assumptions about adversarial tactics, limiting their adaptability in dynamic 3D settings. To address these challenges, we introduce Reinforced Embodied Active Defense (Rein-EAD), a proactive defense framework that leverages adaptive exploration and interaction with the environment to improve perception robustness in 3D adversarial contexts. By implementing a multi-step objective that balances immediate prediction accuracy with predictive entropy minimization, Rein-EAD optimizes defense strategies over a multi-step horizon. Additionally, Rein-EAD involves an uncertainty-oriented reward-shaping mechanism that facilitates efficient policy updates, thereby reducing computational overhead and supporting real-world applicability without the need for differentiable environments. Comprehensive experiments validate the effectiveness of Rein-EAD, demonstrating a substantial reduction in attack success rates while preserving standard accuracy across diverse tasks. Notably, Rein-EAD exhibits robust generalization to unseen and adaptive attacks, making it suitable for real-world complex tasks, including 3D object classification, face recognition and autonomous driving.

http://arxiv.org/abs/2309.03791
Optimal Transport Regularized Divergences: Application to Adversarial Robustness. (95%)
Jeremiah Birrell; Reza Ebrahimi
We introduce a new class of optimal-transport-regularized divergences, $D^c$, constructed via an infimal convolution between an information divergence, $D$, and an optimal-transport (OT) cost, $C$, and study their use in distributionally robust optimization (DRO). In particular, we propose the $ARMOR_D$ methods as novel approaches to enhancing the adversarial robustness of deep learning models. These DRO-based methods are defined by minimizing the maximum expected loss over a $D^c$-neighborhood of the empirical distribution of the training data. Viewed as a tool for constructing adversarial samples, our method allows samples to be both transported, according to the OT cost, and re-weighted, according to the information divergence; the addition of a principled and dynamical adversarial re-weighting on top of adversarial sample transport is a key innovation of $ARMOR_D$. $ARMOR_D$ can be viewed as a generalization of the best-performing loss functions and OT costs in the adversarial training literature; we demonstrate this flexibility by using $ARMOR_D$ to augment the UDR, TRADES, and MART methods and obtain improved performance on CIFAR-10 and CIFAR-100 image recognition. Specifically, augmenting with $ARMOR_D$ leads to 1.9\% and 2.1\% improvement against AutoAttack, a powerful ensemble of adversarial attacks, on CIFAR-10 and CIFAR-100 respectively. To foster reproducibility, we made the code accessible at https://github.com/star-ailab/ARMOR.

http://arxiv.org/abs/2507.18113
Policy Disruption in Reinforcement Learning:Adversarial Attack with Large Language Models and Critical State Identification. (93%)
Junyong Jiang; Buwei Tian; Chenxing Xu; Songze Li; Lu Dong
Reinforcement learning (RL) has achieved remarkable success in fields like robotics and autonomous driving, but adversarial attacks designed to mislead RL systems remain challenging. Existing approaches often rely on modifying the environment or policy, limiting their practicality. This paper proposes an adversarial attack method in which existing agents in the environment guide the target policy to output suboptimal actions without altering the environment. We propose a reward iteration optimization framework that leverages large language models (LLMs) to generate adversarial rewards explicitly tailored to the vulnerabilities of the target agent, thereby enhancing the effectiveness of inducing the target agent toward suboptimal decision-making. Additionally, a critical state identification algorithm is designed to pinpoint the target agent's most vulnerable states, where suboptimal behavior from the victim leads to significant degradation in overall performance. Experimental results in diverse environments demonstrate the superiority of our method over existing approaches.

http://arxiv.org/abs/2507.21157
Unmasking Synthetic Realities in Generative AI: A Comprehensive Review of Adversarially Robust Deepfake Detection Systems. (74%)
Naseem Khan; Tuan Nguyen; Amine Bermak; Issa Khalil
The rapid advancement of Generative Artificial Intelligence has fueled deepfake proliferation-synthetic media encompassing fully generated content and subtly edited authentic material-posing challenges to digital security, misinformation mitigation, and identity preservation. This systematic review evaluates state-of-the-art deepfake detection methodologies, emphasizing reproducible implementations for transparency and validation. We delineate two core paradigms: (1) detection of fully synthetic media leveraging statistical anomalies and hierarchical feature extraction, and (2) localization of manipulated regions within authentic content employing multi-modal cues such as visual artifacts and temporal inconsistencies. These approaches, spanning uni-modal and multi-modal frameworks, demonstrate notable precision and adaptability in controlled settings, effectively identifying manipulations through advanced learning techniques and cross-modal fusion. However, comprehensive assessment reveals insufficient evaluation of adversarial robustness across both paradigms. Current methods exhibit vulnerability to adversarial perturbations-subtle alterations designed to evade detection-undermining reliability in real-world adversarial contexts. This gap highlights critical disconnect between methodological development and evolving threat landscapes. To address this, we contribute a curated GitHub repository aggregating open-source implementations, enabling replication and testing. Our findings emphasize urgent need for future work prioritizing adversarial resilience, advocating scalable, modality-agnostic architectures capable of withstanding sophisticated manipulations. This review synthesizes strengths and shortcomings of contemporary deepfake detection while charting paths toward robust trustworthy systems.

http://arxiv.org/abs/2507.18313
Regression-aware Continual Learning for Android Malware Detection. (61%)
Daniele Ghiani; Daniele Angioni; Giorgio Piras; Angelo Sotgiu; Luca Minnei; Srishti Gupta; Maura Pintor; Fabio Roli; Battista Biggio
Malware evolves rapidly, forcing machine learning (ML)-based detectors to adapt continuously. With antivirus vendors processing hundreds of thousands of new samples daily, datasets can grow to billions of examples, making full retraining impractical. Continual learning (CL) has emerged as a scalable alternative, enabling incremental updates without full data access while mitigating catastrophic forgetting. In this work, we analyze a critical yet overlooked issue in this context: security regression. Unlike forgetting, which manifests as a general performance drop on previously seen data, security regression captures harmful prediction changes at the sample level, such as a malware sample that was once correctly detected but evades detection after a model update. Although often overlooked, regressions pose serious risks in security-critical applications, as the silent reintroduction of previously detected threats in the system may undermine users' trust in the whole updating process. To address this issue, we formalize and quantify security regression in CL-based malware detectors and propose a regression-aware penalty to mitigate it. Specifically, we adapt Positive Congruent Training (PCT) to the CL setting, preserving prior predictive behavior in a model-agnostic manner. Experiments on the ELSA, Tesseract, and AZ-Class datasets show that our method effectively reduces regression across different CL scenarios while maintaining strong detection performance over time.

http://arxiv.org/abs/2507.18036
NWaaS: Nonintrusive Watermarking as a Service for X-to-Image DNN. (10%)
Haonan An; Guang Hua; Yu Guo; Hangcheng Cao; Susanto Rahardja; Yuguang Fang
The intellectual property of deep neural network (DNN) models can be protected with DNN watermarking, which embeds copyright watermarks into model parameters (white-box), model behavior (black-box), or model outputs (box-free), and the watermarks can be subsequently extracted to verify model ownership or detect model theft. Despite recent advances, these existing methods are inherently intrusive, as they either modify the model parameters or alter the structure. This natural intrusiveness raises concerns about watermarking-induced shifts in model behavior and the additional cost of fine-tuning, further exacerbated by the rapidly growing model size. As a result, model owners are often reluctant to adopt DNN watermarking in practice, which limits the development of practical Watermarking as a Service (WaaS) systems. To address this issue, we introduce Nonintrusive Watermarking as a Service (NWaaS), a novel trustless paradigm designed for X-to-Image models, in which we hypothesize that with the model untouched, an owner-defined watermark can still be extracted from model outputs. Building on this concept, we propose ShadowMark, a concrete implementation of NWaaS which addresses critical deployment challenges by establishing a robust and nonintrusive side channel in the protected model's black-box API, leveraging a key encoder and a watermark decoder. It is significantly distinctive from existing solutions by attaining the so-called absolute fidelity and being applicable to different DNN architectures, while being also robust against existing attacks, eliminating the fidelity-robustness trade-off. Extensive experiments on image-to-image, noise-to-image, noise-and-text-to-image, and text-to-image models, demonstrate the efficacy and practicality of ShadowMark for real-world deployment of nonintrusive DNN watermarking.

http://arxiv.org/abs/2507.18457
Revisiting Physically Realizable Adversarial Object Attack against LiDAR-based Detection: Clarifying Problem Formulation and Experimental Protocols. (8%)
Luo Cheng; Hanwei Zhang; Lijun Zhang; Holger Hermanns
Adversarial robustness in LiDAR-based 3D object detection is a critical research area due to its widespread application in real-world scenarios. While many digital attacks manipulate point clouds or meshes, they often lack physical realizability, limiting their practical impact. Physical adversarial object attacks remain underexplored and suffer from poor reproducibility due to inconsistent setups and hardware differences. To address this, we propose a device-agnostic, standardized framework that abstracts key elements of physical adversarial object attacks, supports diverse methods, and provides open-source code with benchmarking protocols in simulation and real-world settings. Our framework enables fair comparison, accelerates research, and is validated by successfully transferring simulated attacks to a physical LiDAR system. Beyond the framework, we offer insights into factors influencing attack success and advance understanding of adversarial robustness in real-world LiDAR perception.

http://arxiv.org/abs/2503.17724
Trigger without Trace: Towards Stealthy Backdoor Attack on Text-to-Image Diffusion Models. (5%)
Jie Zhang; Zhongqi Wang; Shiguang Shan; Xilin Chen
Backdoor attacks targeting text-to-image diffusion models have advanced rapidly. However, current backdoor samples often exhibit two key abnormalities compared to benign samples: 1) Semantic Consistency, where backdoor prompts tend to generate images with similar semantic content even with significant textual variations to the prompts; 2) Attention Consistency, where the trigger induces consistent structural responses in the cross-attention maps. These consistencies leave detectable traces for defenders, making backdoors easier to identify. In this paper, toward stealthy backdoor samples, we propose Trigger without Trace (TwT) by explicitly mitigating these consistencies. Specifically, our approach leverages syntactic structures as backdoor triggers to amplify the sensitivity to textual variations, effectively breaking down the semantic consistency. Besides, a regularization method based on Kernel Maximum Mean Discrepancy (KMMD) is proposed to align the distribution of cross-attention responses between backdoor and benign samples, thereby disrupting attention consistency. Extensive experiments demonstrate that our method achieves a 97.5% attack success rate while exhibiting stronger resistance to defenses. It achieves an average of over 98% backdoor samples bypassing three state-of-the-art detection mechanisms, revealing the vulnerabilities of current backdoor defense methods. The code is available at https://github.com/Robin-WZQ/TwT.

http://arxiv.org/abs/2507.18202
Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection. (2%)
San Kim; Jonghwi Kim; Yejin Jeon; Gary Geunbae Lee
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by providing external knowledge for accurate and up-to-date responses. However, this reliance on external sources exposes a security risk, attackers can inject poisoned documents into the knowledge base to steer the generation process toward harmful or misleading outputs. In this paper, we propose Gradient-based Masked Token Probability (GMTP), a novel defense method to detect and filter out adversarially crafted documents. Specifically, GMTP identifies high-impact tokens by examining gradients of the retriever's similarity function. These key tokens are then masked, and their probabilities are checked via a Masked Language Model (MLM). Since injected tokens typically exhibit markedly low masked-token probabilities, this enables GMTP to easily detect malicious documents and achieve high-precision filtering. Experiments demonstrate that GMTP is able to eliminate over 90% of poisoned content while retaining relevant documents, thus maintaining robust retrieval and generation performance across diverse datasets and adversarial settings.

http://arxiv.org/abs/2507.18631
Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment. (1%)
Hao Li; Lijun Li; Zhenghao Lu; Xianyi Wei; Rui Li; Jing Shao; Lei Sha
With rapid advancement and increasing accessibility of LLMs, fine-tuning aligned models has become a critical step for adapting them to real-world applications, which makes the safety of this fine-tuning process more important than ever. However, recent studies have highlighted a critical challenge: even when fine-tuning with seemingly benign downstream datasets, the safety of aligned LLMs can be compromised, making them more susceptible to malicious instructions. In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface. These samples can significantly degrade the safety alignment of LLMs during fine-tuning. To address this issue, we propose LARF, a \textbf{L}ayer-\textbf{A}ware \textbf{R}epresentation \textbf{F}iltering method. This method identifies safety-sensitive layers within the LLM and leverages their representations to detect which data samples in the post-training dataset contain safety-degrading features. Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features. After removing such data, the safety alignment degradation caused by fine-tuning is mitigated. Please see our code at \href{https://github.com/LLLeoLi/LARF}{https://github.com/LLLeoLi/LARF}.

http://arxiv.org/abs/2507.17554
An h-space Based Adversarial Attack for Protection Against Few-shot Personalization. (98%)
Xide Xu; Sandesh Kamath; Muhammad Atif Butt; Bogdan Raducanu
The versatility of diffusion models in generating customized images from few samples raises significant privacy concerns, particularly regarding unauthorized modifications of private content. This concerning issue has renewed the efforts in developing protection mechanisms based on adversarial attacks, which generate effective perturbations to poison diffusion models. Our work is motivated by the observation that these models exhibit a high degree of abstraction within their semantic latent space (`h-space'), which encodes critical high-level features for generating coherent and meaningful content. In this paper, we propose a novel anti-customization approach, called HAAD (h-space based Adversarial Attack for Diffusion models), that leverages adversarial attacks to craft perturbations based on the h-space that can efficiently degrade the image generation process. Building upon HAAD, we further introduce a more efficient variant, HAAD-KV, that constructs perturbations solely based on the KV parameters of the h-space. This strategy offers a stronger protection, that is computationally less expensive. Despite their simplicity, our methods outperform state-of-the-art adversarial attacks, highlighting their effectiveness.

http://arxiv.org/abs/2507.17725
On the Interaction of Compressibility and Adversarial Robustness. (92%)
Melih Barsbey; Antônio H. Ribeiro; Umut Şimşekli; Tolga Birdal
Modern neural networks are expected to simultaneously satisfy a host of desirable properties: accurate fitting to training data, generalization to unseen inputs, parameter and computational efficiency, and robustness to adversarial perturbations. While compressibility and robustness have each been studied extensively, a unified understanding of their interaction still remains elusive. In this work, we develop a principled framework to analyze how different forms of compressibility - such as neuron-level sparsity and spectral compressibility - affect adversarial robustness. We show that these forms of compression can induce a small number of highly sensitive directions in the representation space, which adversaries can exploit to construct effective perturbations. Our analysis yields a simple yet instructive robustness bound, revealing how neuron and spectral compressibility impact $L_\infty$ and $L_2$ robustness via their effects on the learned representations. Crucially, the vulnerabilities we identify arise irrespective of how compression is achieved - whether via regularization, architectural bias, or implicit learning dynamics. Through empirical evaluations across synthetic and realistic tasks, we confirm our theoretical predictions, and further demonstrate that these vulnerabilities persist under adversarial training and transfer learning, and contribute to the emergence of universal adversarial perturbations. Our findings show a fundamental tension between structured compressibility and robustness, and suggest new pathways for designing models that are both efficient and secure.

http://arxiv.org/abs/2507.17577
Boosting Ray Search Procedure of Hard-label Attacks with Transfer-based Priors. (76%)
Chen Ma; Xinjie Xu; Shuyu Cheng; Qi Xuan
One of the most practical and challenging types of black-box adversarial attacks is the hard-label attack, where only the top-1 predicted label is available. One effective approach is to search for the optimal ray direction from the benign image that minimizes the $\ell_p$-norm distance to the adversarial region. The unique advantage of this approach is that it transforms the hard-label attack into a continuous optimization problem. The objective function value is the ray's radius, which can be obtained via binary search at a high query cost. Existing methods use a "sign trick" in gradient estimation to reduce the number of queries. In this paper, we theoretically analyze the quality of this gradient estimation and propose a novel prior-guided approach to improve ray search efficiency both theoretically and empirically. Specifically, we utilize the transfer-based priors from surrogate models, and our gradient estimators appropriately integrate them by approximating the projection of the true gradient onto the subspace spanned by these priors and random directions, in a query-efficient manner. We theoretically derive the expected cosine similarities between the obtained gradient estimators and the true gradient, and demonstrate the improvement achieved by incorporating priors. Extensive experiments on the ImageNet and CIFAR-10 datasets show that our approach significantly outperforms 11 state-of-the-art methods in terms of query efficiency.

http://arxiv.org/abs/2409.19638
BadHMP: Backdoor Attack against Human Motion Prediction. (61%)
Chaohui Xu; Si Wang; Chip-Hong Chang
Precise future human motion prediction over sub-second horizons from past observations is crucial for various safety-critical applications. To date, only a few studies have examined the vulnerability of skeleton-based neural networks to evasion and backdoor attacks. In this paper, we propose BadHMP, a novel backdoor attack that targets specifically human motion prediction tasks. Our approach involves generating poisoned training samples by embedding a localized backdoor trigger in one limb of the skeleton, causing selected joints to follow predefined motion in historical time steps. Subsequently, the future sequences are globally modified that all the joints move following the target trajectories. Our carefully designed backdoor triggers and targets guarantee the smoothness and naturalness of the poisoned samples, making them stealthy enough to evade detection by the model trainer while keeping the poisoned model unobtrusive in terms of prediction fidelity to untainted sequences. The target sequences can be successfully activated by the designed input sequences even with a low poisoned sample injection ratio. Experimental results on two datasets (Human3.6M and CMU-Mocap) and two network architectures (LTD and HRI) demonstrate the high-fidelity, effectiveness, and stealthiness of BadHMP. Robustness of our attack against fine-tuning defense is also verified.

http://arxiv.org/abs/2507.21145
Leveraging Trustworthy AI for Automotive Security in Multi-Domain Operations: Towards a Responsive Human-AI Multi-Domain Task Force for Cyber Social Security. (61%)
Vita Santa Barletta; Danilo Caivano; Gabriel Cellammare; Vescovo Samuele del; Annita Larissa Sciacovelli
Multi-Domain Operations (MDOs) emphasize cross-domain defense against complex and synergistic threats, with civilian infrastructures like smart cities and Connected Autonomous Vehicles (CAVs) emerging as primary targets. As dual-use assets, CAVs are vulnerable to Multi-Surface Threats (MSTs), particularly from Adversarial Machine Learning (AML) which can simultaneously compromise multiple in-vehicle ML systems (e.g., Intrusion Detection Systems, Traffic Sign Recognition Systems). Therefore, this study investigates how key hyperparameters in Decision Tree-based ensemble models-Random Forest (RF), Gradient Boosting (GB), and Extreme Gradient Boosting (XGB)-affect the time required for a Black-Box AML attack i.e. Zeroth Order Optimization (ZOO). Findings show that parameters like the number of trees or boosting rounds significantly influence attack execution time, with RF and GB being more sensitive than XGB. Adversarial Training (AT) time is also analyzed to assess the attacker's window of opportunity. By optimizing hyperparameters, this research supports Defensive Trustworthy AI (D-TAI) practices within MST scenarios and contributes to the development of resilient ML systems for civilian and military domains, aligned with Cyber Social Security framework in MDOs and Human-AI Multi-Domain Task Forces.

http://arxiv.org/abs/2507.17922
From Seed to Harvest: Augmenting Human Creativity with AI for Red-teaming Text-to-Image Models. (26%)
Jessica Quaye; Charvi Rastogi; Alicia Parrish; Oana Inel; Minsuk Kahng; Lora Aroyo; Vijay Janapa Reddi
Text-to-image (T2I) models have become prevalent across numerous applications, making their robust evaluation against adversarial attacks a critical priority. Continuous access to new and challenging adversarial prompts across diverse domains is essential for stress-testing these models for resilience against novel attacks from multiple vectors. Current techniques for generating such prompts are either entirely authored by humans or synthetically generated. On the one hand, datasets of human-crafted adversarial prompts are often too small in size and imbalanced in their cultural and contextual representation. On the other hand, datasets of synthetically-generated prompts achieve scale, but typically lack the realistic nuances and creative adversarial strategies found in human-crafted prompts. To combine the strengths of both human and machine approaches, we propose Seed2Harvest, a hybrid red-teaming method for guided expansion of culturally diverse, human-crafted adversarial prompt seeds. The resulting prompts preserve the characteristics and attack patterns of human prompts while maintaining comparable average attack success rates (0.31 NudeNet, 0.36 SD NSFW, 0.12 Q16). Our expanded dataset achieves substantially higher diversity with 535 unique geographic locations and a Shannon entropy of 7.48, compared to 58 locations and 5.28 entropy in the original dataset. Our work demonstrates the importance of human-machine collaboration in leveraging human creativity and machine computational capacity to achieve comprehensive, scalable red-teaming for continuous T2I model safety evaluation.

http://arxiv.org/abs/2507.17453
Efficient Neural Network Verification via Order Leading Exploration of Branch-and-Bound Trees. (15%)
Guanqin Zhang; Kota Fukuda; Zhenya Zhang; H. M. N. Dilum Bandara; Shiping Chen; Jianjun Zhao; Yulei Sui
The vulnerability of neural networks to adversarial perturbations has necessitated formal verification techniques that can rigorously certify the quality of neural networks. As the state-of-the-art, branch and bound (BaB) is a "divide-and-conquer" strategy that applies off-the-shelf verifiers to sub-problems for which they perform better. While BaB can identify the sub-problems that are necessary to be split, it explores the space of these sub-problems in a naive "first-come-first-serve" manner, thereby suffering from an issue of inefficiency to reach a verification conclusion. To bridge this gap, we introduce an order over different sub-problems produced by BaB, concerning with their different likelihoods of containing counterexamples. Based on this order, we propose a novel verification framework Oliva that explores the sub-problem space by prioritizing those sub-problems that are more likely to find counterexamples, in order to efficiently reach the conclusion of the verification. Even if no counterexample can be found in any sub-problem, it only changes the order of visiting different sub-problem and so will not lead to a performance degradation. Specifically, Oliva has two variants, including $Oliva^{GR}$, a greedy strategy that always prioritizes the sub-problems that are more likely to find counterexamples, and $Oliva^{SA}$, a balanced strategy inspired by simulated annealing that gradually shifts from exploration to exploitation to locate the globally optimal sub-problems. We experimentally evaluate the performance of Oliva on 690 verification problems spanning over 5 models with datasets MNIST and CIFAR10. Compared to the state-of-the-art approaches, we demonstrate the speedup of Oliva for up to 25X in MNIST, and up to 80X in CIFAR10.

http://arxiv.org/abs/2505.23404
MEF: A Capability-Aware Multi-Encryption Framework for Evaluating Vulnerabilities in Black-Box Large Language Models. (12%)
Mingyu Yu; Wei Wang; Yanjie Wei; Sujuan Qin; Fei Gao; Wenmin Li
Recent advancements in adversarial jailbreak attacks have exposed critical vulnerabilities in Large Language Models (LLMs), enabling the circumvention of alignment safeguards through increasingly sophisticated prompt manipulations. Based on our experiments, we found that the effectiveness of jailbreak strategies is influenced by the comprehension ability of the attacked LLM. Building on this insight, we propose a capability-aware Multi-Encryption Framework (MEF) for evaluating vulnerabilities in black-box LLMs. Specifically, MEF first categorizes the comprehension ability level of the LLM, then applies different strategies accordingly: For models with limited comprehension ability, MEF adopts the Fu+En1 strategy, which integrates layered semantic mutations with an encryption technique, more effectively contributing to evasion of the LLM's defenses at the input and inference stages. For models with strong comprehension ability, MEF uses a more complex Fu+En1+En2 strategy, in which additional dual-ended encryption techniques are applied to the LLM's responses, further contributing to evasion of the LLM's defenses at the output stage. Experimental results demonstrate the effectiveness of our approach, achieving attack success rates of 98.9% on GPT-4o (29 May 2025 release) and 99.8% on GPT-4.1 (8 July 2025 release). Our work contributes to a deeper understanding of the vulnerabilities in current LLM alignment mechanisms.

http://arxiv.org/abs/2404.16417
Constructing Optimal Noise Channels for Enhanced Robustness in Quantum Machine Learning. (2%)
David Winderl; Nicola Franco; Jeanette Miriam Lorenz
With the rapid advancement of Quantum Machine Learning (QML), the critical need to enhance security measures against adversarial attacks and protect QML models becomes increasingly evident. In this work, we outline the connection between quantum noise channels and differential privacy (DP), by constructing a family of noise channels which are inherently $ε$-DP: $(α, γ)$-channels. Through this approach, we successfully replicate the $ε$-DP bounds observed for depolarizing and random rotation channels, thereby affirming the broad generality of our framework. Additionally, we use a semi-definite program to construct an optimally robust channel. In a small-scale experimental evaluation, we demonstrate the benefits of using our optimal noise channel over depolarizing noise, particularly in enhancing adversarial accuracy. Moreover, we assess how the variables $α$ and $γ$ affect the certifiable robustness and investigate how different encoding methods impact the classifier's robustness.

http://arxiv.org/abs/2507.17944
Evaluating the Performance of AI Text Detectors, Few-Shot and Chain-of-Thought Prompting Using DeepSeek Generated Text. (2%)
Hulayyil Alshammari; Praveen Rao
Large language models (LLMs) have rapidly transformed the creation of written materials. LLMs have led to questions about writing integrity, thereby driving the creation of artificial intelligence (AI) detection technologies. Adversarial attacks, such as standard and humanized paraphrasing, inhibit detectors' ability to detect machine-generated text. Previous studies have mainly focused on ChatGPT and other well-known LLMs and have shown varying accuracy across detectors. However, there is a clear gap in the literature about DeepSeek, a recently published LLM. Therefore, in this work, we investigate whether six generally accessible AI detection tools -- AI Text Classifier, Content Detector AI, Copyleaks, QuillBot, GPT-2, and GPTZero -- can consistently recognize text generated by DeepSeek. The detectors were exposed to the aforementioned adversarial attacks. We also considered DeepSeek as a detector by performing few-shot prompting and chain-of-thought reasoning (CoT) for classifying AI and human-written text. We collected 49 human-authored question-answer pairs from before the LLM era and generated matching responses using DeepSeek-v3, producing 49 AI-generated samples. Then, we applied adversarial techniques such as paraphrasing and humanizing to add 196 more samples. These were used to challenge detector robustness and assess accuracy impact. While QuillBot and Copyleaks showed near-perfect performance on original and paraphrased DeepSeek text, others -- particularly AI Text Classifier and GPT-2 -- showed inconsistent results. The most effective attack was humanization, reducing accuracy to 71% for Copyleaks, 58% for QuillBot, and 52% for GPTZero. Few-shot and CoT prompting showed high accuracy, with the best five-shot result misclassifying only one of 49 samples (AI recall 96%, human recall 100%).

http://arxiv.org/abs/2507.10054
Explicit Vulnerability Generation with LLMs: An Investigation Beyond Adversarial Attacks. (1%)
Emir Bosnak; Sahand Moslemi; Mayasah Lami; Anil Koyuncu
Large Language Models (LLMs) are increasingly used as code assistants, yet their behavior when explicitly asked to generate insecure code remains poorly understood. While prior research has focused on unintended vulnerabilities, this study examines a more direct threat: open-source LLMs generating vulnerable code when prompted. We propose a dual experimental design: (1) Dynamic Prompting, which systematically varies vulnerability type, user persona, and prompt phrasing across structured templates; and (2) Reverse Prompting, which derives natural-language prompts from real vulnerable code samples. We evaluate three open-source 7B-parameter models (Qwen2, Mistral, Gemma) using static analysis to assess both the presence and correctness of generated vulnerabilities. Our results show that all models frequently generate the requested vulnerabilities, though with significant performance differences. Gemma achieves the highest correctness for memory vulnerabilities under Dynamic Prompting (e.g., 98.6% for buffer overflows), while Qwen2 demonstrates the most balanced performance across all tasks. We find that professional personas (e.g., "DevOps Engineer") consistently elicit higher success rates than student personas, and that the effectiveness of direct versus indirect phrasing is inverted depending on the prompting strategy. Vulnerability reproduction accuracy follows a non-linear pattern with code complexity, peaking in a moderate range. Our findings expose how LLMs' reliance on pattern recall over semantic reasoning creates significant blind spots in their safety alignments, particularly for requests framed as plausible professional tasks.

http://arxiv.org/abs/2507.17161
Tabular Diffusion based Actionable Counterfactual Explanations for Network Intrusion Detection. (1%)
Vinura Galwaduge; Jagath Samarabandu
Modern network intrusion detection systems (NIDS) frequently utilize the predictive power of complex deep learning models. However, the "black-box" nature of such deep learning methods adds a layer of opaqueness that hinders the proper understanding of detection decisions, trust in the decisions and prevent timely countermeasures against such attacks. Explainable AI (XAI) methods provide a solution to this problem by providing insights into the causes of the predictions. The majority of the existing XAI methods provide explanations which are not convenient to convert into actionable countermeasures. In this work, we propose a novel diffusion-based counterfactual explanation framework that can provide actionable explanations for network intrusion attacks. We evaluated our proposed algorithm against several other publicly available counterfactual explanation algorithms on 3 modern network intrusion datasets. To the best of our knowledge, this work also presents the first comparative analysis of existing counterfactual explanation algorithms within the context of network intrusion detection systems. Our proposed method provide minimal, diverse counterfactual explanations out of the tested counterfactual explanation algorithms in a more efficient manner by reducing the time to generate explanations. We also demonstrate how counterfactual explanations can provide actionable explanations by summarizing them to create a set of global rules. These rules are actionable not only at instance level but also at the global level for intrusion attacks. These global counterfactual rules show the ability to effectively filter out incoming attack queries which is crucial for efficient intrusion detection and defense mechanisms.

http://arxiv.org/abs/2507.17834
Smoothed Analysis of Online Metric Problems. (1%)
Christian Coester; Jack Umenberger
We study three classical online problems -- $k$-server, $k$-taxi, and chasing size $k$ sets -- through a lens of smoothed analysis. Our setting allows request locations to be adversarial up to small perturbations, interpolating between worst-case and average-case models. Specifically, we show that if the metric space is contained in a ball in any normed space and requests are drawn from distributions whose density functions are upper bounded by $1/σ$ times the uniform density over the ball, then all three problems admit polylog$(k/σ)$-competitive algorithms. Our approach is simple: it reduces smoothed instances to fully adversarial instances on finite metrics and leverages existing algorithms in a black-box manner. We also provide a lower bound showing that no algorithm can achieve a competitive ratio sub-polylogarithmic in $k/σ$, matching our upper bounds up to the exponent of the polylogarithm. In contrast, the best known competitive ratios for these problems in the fully adversarial setting are $2k-1$, $\infty$ and $Θ(k^2)$, respectively.

http://arxiv.org/abs/2507.16257
Quality Text, Robust Vision: The Role of Language in Enhancing Visual Robustness of Vision-Language Models. (99%)
Futa Waseda; Saku Sugawara; Isao Echizen
Defending pre-trained vision-language models (VLMs), such as CLIP, against adversarial attacks is crucial, as these models are widely used in diverse zero-shot tasks, including image classification. However, existing adversarial training (AT) methods for robust fine-tuning largely overlook the role of language in enhancing visual robustness. Specifically, (1) supervised AT methods rely on short texts (e.g., class labels) to generate adversarial perturbations, leading to overfitting to object classes in the training data, and (2) unsupervised AT avoids this overfitting but remains suboptimal against practical text-guided adversarial attacks due to its lack of semantic guidance. To address these limitations, we propose Quality Text-guided Adversarial Fine-Tuning (QT-AFT), which leverages high-quality captions during training to guide adversarial examples away from diverse semantics present in images. This enables the visual encoder to robustly recognize a broader range of image features even under adversarial noise, thereby enhancing robustness across diverse downstream tasks. QT-AFT overcomes the key weaknesses of prior methods -- overfitting in supervised AT and lack of semantic awareness in unsupervised AT -- achieving state-of-the-art zero-shot adversarial robustness and clean accuracy, evaluated across 16 zero-shot datasets. Furthermore, our comprehensive study uncovers several key insights into the role of language in enhancing vision robustness; for example, describing object properties in addition to object names further enhances zero-shot robustness. Our findings point to an urgent direction for future work -- centering high-quality linguistic supervision in robust visual representation learning.

http://arxiv.org/abs/2507.16164
Attacking interpretable NLP systems. (99%)
Eldor Abdukhamidov; Tamer Abuhmed; Joanna C. S. Santos; Mohammed Abuhamad
Studies have shown that machine learning systems are vulnerable to adversarial examples in theory and practice. Where previous attacks have focused mainly on visual models that exploit the difference between human and machine perception, text-based models have also fallen victim to these attacks. However, these attacks often fail to maintain the semantic meaning of the text and similarity. This paper introduces AdvChar, a black-box attack on Interpretable Natural Language Processing Systems, designed to mislead the classifier while keeping the interpretation similar to benign inputs, thus exploiting trust in system transparency. AdvChar achieves this by making less noticeable modifications to text input, forcing the deep learning classifier to make incorrect predictions and preserve the original interpretation. We use an interpretation-focused scoring approach to determine the most critical tokens that, when changed, can cause the classifier to misclassify the input. We apply simple character-level modifications to measure the importance of tokens, minimizing the difference between the original and new text while generating adversarial interpretations similar to benign ones. We thoroughly evaluated AdvChar by testing it against seven NLP models and three interpretation models using benchmark datasets for the classification task. Our experiments show that AdvChar can significantly reduce the prediction accuracy of current deep learning models by altering just two characters on average in input samples.

http://arxiv.org/abs/2507.17070
Advancing Robustness in Deep Reinforcement Learning with an Ensemble Defense Approach. (92%)
Adithya Mohan; Dominik Rößle; Daniel Cremers; Torsten Schön
Recent advancements in Deep Reinforcement Learning (DRL) have demonstrated its applicability across various domains, including robotics, healthcare, energy optimization, and autonomous driving. However, a critical question remains: How robust are DRL models when exposed to adversarial attacks? While existing defense mechanisms such as adversarial training and distillation enhance the resilience of DRL models, there remains a significant research gap regarding the integration of multiple defenses in autonomous driving scenarios specifically. This paper addresses this gap by proposing a novel ensemble-based defense architecture to mitigate adversarial attacks in autonomous driving. Our evaluation demonstrates that the proposed architecture significantly enhances the robustness of DRL models. Compared to the baseline under FGSM attacks, our ensemble method improves the mean reward from 5.87 to 18.38 (over 213% increase) and reduces the mean collision rate from 0.50 to 0.09 (an 82% decrease) in the highway scenario and merge scenario, outperforming all standalone defense strategies.

http://arxiv.org/abs/2507.16291
Talking Like a Phisher: LLM-Based Attacks on Voice Phishing Classifiers. (86%)
Wenhao Li; Selvakumar Manickam; Yung-wey Chong; Shankar Karuppayah
Voice phishing (vishing) remains a persistent threat in cybersecurity, exploiting human trust through persuasive speech. While machine learning (ML)-based classifiers have shown promise in detecting malicious call transcripts, they remain vulnerable to adversarial manipulations that preserve semantic content. In this study, we explore a novel attack vector where large language models (LLMs) are leveraged to generate adversarial vishing transcripts that evade detection while maintaining deceptive intent. We construct a systematic attack pipeline that employs prompt engineering and semantic obfuscation to transform real-world vishing scripts using four commercial LLMs. The generated transcripts are evaluated against multiple ML classifiers trained on a real-world Korean vishing dataset (KorCCViD) with statistical testing. Our experiments reveal that LLM-generated transcripts are both practically and statistically effective against ML-based classifiers. In particular, transcripts crafted by GPT-4o significantly reduce classifier accuracy (by up to 30.96%) while maintaining high semantic similarity, as measured by BERTScore. Moreover, these attacks are both time-efficient and cost-effective, with average generation times under 9 seconds and negligible financial cost per query. The results underscore the pressing need for more resilient vishing detection frameworks and highlight the imperative for LLM providers to enforce stronger safeguards against prompt misuse in adversarial social engineering contexts.

http://arxiv.org/abs/2507.16345
The Cost of Compression: Tight Quadratic Black-Box Attacks on Sketches for $\ell_2$ Norm Estimation. (82%)
Sara Ahmadian; Edith Cohen; Uri Stemmer
Dimensionality reduction via linear sketching is a powerful and widely used technique, but it is known to be vulnerable to adversarial inputs. We study the black-box adversarial setting, where a fixed, hidden sketching matrix A in $R^{k X n}$ maps high-dimensional vectors v $\in R^n$ to lower-dimensional sketches A v in $R^k$, and an adversary can query the system to obtain approximate ell2-norm estimates that are computed from the sketch.
  We present a universal, nonadaptive attack that, using tilde(O)($k^2$) queries, either causes a failure in norm estimation or constructs an adversarial input on which the optimal estimator for the query distribution (used by the attack) fails. The attack is completely agnostic to the sketching matrix and to the estimator: It applies to any linear sketch and any query responder, including those that are randomized, adaptive, or tailored to the query distribution.
  Our lower bound construction tightly matches the known upper bounds of tilde(Omega)($k^2$), achieved by specialized estimators for Johnson Lindenstrauss transforms and AMS sketches. Beyond sketching, our results uncover structural parallels to adversarial attacks in image classification, highlighting fundamental vulnerabilities of compressed representations.

http://arxiv.org/abs/2507.16372
Depth Gives a False Sense of Privacy: LLM Internal States Inversion. (68%)
Tian Dong; Yan Meng; Shaofeng Li; Guoxing Chen; Zhen Liu; Haojin Zhu
Large Language Models (LLMs) are increasingly integrated into daily routines, yet they raise significant privacy and safety concerns. Recent research proposes collaborative inference, which outsources the early-layer inference to ensure data locality, and introduces model safety auditing based on inner neuron patterns. Both techniques expose the LLM's Internal States (ISs), which are traditionally considered irreversible to inputs due to optimization challenges and the highly abstract representations in deep layers. In this work, we challenge this assumption by proposing four inversion attacks that significantly improve the semantic similarity and token matching rate of inverted inputs. Specifically, we first develop two white-box optimization-based attacks tailored for low-depth and high-depth ISs. These attacks avoid local minima convergence, a limitation observed in prior work, through a two-phase inversion process. Then, we extend our optimization attack under more practical black-box weight access by leveraging the transferability between the source and the derived LLMs. Additionally, we introduce a generation-based attack that treats inversion as a translation task, employing an inversion model to reconstruct inputs. Extensive evaluation of short and long prompts from medical consulting and coding assistance datasets and 6 LLMs validates the effectiveness of our inversion attacks. Notably, a 4,112-token long medical consulting prompt can be nearly perfectly inverted with 86.88 F1 token matching from the middle layer of Llama-3 model. Finally, we evaluate four practical defenses that we found cannot perfectly prevent ISs inversion and draw conclusions for future mitigation design.

http://arxiv.org/abs/2507.18656
ShrinkBox: Backdoor Attack on Object Detection to Disrupt Collision Avoidance in Machine Learning-based Advanced Driver Assistance Systems. (45%)
Muhammad Zaeem Shahzad; Muhammad Abdullah Hanif; Bassem Ouni; Muhammad Shafique
Advanced Driver Assistance Systems (ADAS) significantly enhance road safety by detecting potential collisions and alerting drivers. However, their reliance on expensive sensor technologies such as LiDAR and radar limits accessibility, particularly in low- and middle-income countries. Machine learning-based ADAS (ML-ADAS), leveraging deep neural networks (DNNs) with only standard camera input, offers a cost-effective alternative. Critical to ML-ADAS is the collision avoidance feature, which requires the ability to detect objects and estimate their distances accurately. This is achieved with specialized DNNs like YOLO, which provides real-time object detection, and a lightweight, detection-wise distance estimation approach that relies on key features extracted from the detections like bounding box dimensions and size. However, the robustness of these systems is undermined by security vulnerabilities in object detectors. In this paper, we introduce ShrinkBox, a novel backdoor attack targeting object detection in collision avoidance ML-ADAS. Unlike existing attacks that manipulate object class labels or presence, ShrinkBox subtly shrinks ground truth bounding boxes. This attack remains undetected in dataset inspections and standard benchmarks while severely disrupting downstream distance estimation. We demonstrate that ShrinkBox can be realized in the YOLOv9m object detector at an Attack Success Rate (ASR) of 96%, with only a 4% poisoning ratio in the training instances of the KITTI dataset. Furthermore, given the low error targets introduced in our relaxed poisoning strategy, we find that ShrinkBox increases the Mean Absolute Error (MAE) in downstream distance estimation by more than 3x on poisoned samples, potentially resulting in delays or prevention of collision warnings altogether.

http://arxiv.org/abs/2507.16969
LLM4MEA: Data-free Model Extraction Attacks on Sequential Recommenders via Large Language Models. (31%)
Shilong Zhao; Fei Sun; Kaike Zhang; Shaoling Jing; Du Su; Zhichao Shi; Zhiyi Yin; Huawei Shen; Xueqi Cheng
Recent studies have demonstrated the vulnerability of sequential recommender systems to Model Extraction Attacks (MEAs). MEAs collect responses from recommender systems to replicate their functionality, enabling unauthorized deployments and posing critical privacy and security risks. Black-box attacks in prior MEAs are ineffective at exposing recommender system vulnerabilities due to random sampling in data selection, which leads to misaligned synthetic and real-world distributions. To overcome this limitation, we propose LLM4MEA, a novel model extraction method that leverages Large Language Models (LLMs) as human-like rankers to generate data. It generates data through interactions between the LLM ranker and target recommender system. In each interaction, the LLM ranker analyzes historical interactions to understand user behavior, and selects items from recommendations with consistent preferences to extend the interaction history, which serves as training data for MEA. Extensive experiments demonstrate that LLM4MEA significantly outperforms existing approaches in data quality and attack performance, reducing the divergence between synthetic and real-world data by up to 64.98% and improving MEA performance by 44.82% on average. From a defensive perspective, we propose a simple yet effective defense strategy and identify key hyperparameters of recommender systems that can mitigate the risk of MEAs.

http://arxiv.org/abs/2507.16134
DP2Guard: A Lightweight and Byzantine-Robust Privacy-Preserving Federated Learning Scheme for Industrial IoT. (26%)
Baofu Han; Bing Li; Yining Qi; Raja Jurdak; Kaibin Huang; Chau Yuen
Privacy-Preserving Federated Learning (PPFL) has emerged as a secure distributed Machine Learning (ML) paradigm that aggregates locally trained gradients without exposing raw data. To defend against model poisoning threats, several robustness-enhanced PPFL schemes have been proposed by integrating anomaly detection. Nevertheless, they still face two major challenges: (1) the reliance on heavyweight encryption techniques results in substantial communication and computation overhead; and (2) single-strategy defense mechanisms often fail to provide sufficient robustness against adaptive adversaries. To overcome these challenges, we propose DP2Guard, a lightweight PPFL framework that enhances both privacy and robustness. DP2Guard leverages a lightweight gradient masking mechanism to replace costly cryptographic operations while ensuring the privacy of local gradients. A hybrid defense strategy is proposed, which extracts gradient features using singular value decomposition and cosine similarity, and applies a clustering algorithm to effectively identify malicious gradients. Additionally, DP2Guard adopts a trust score-based adaptive aggregation scheme that adjusts client weights according to historical behavior, while blockchain records aggregated results and trust scores to ensure tamper-proof and auditable training. Extensive experiments conducted on two public datasets demonstrate that DP2Guard effectively defends against four advanced poisoning attacks while ensuring privacy with reduced communication and computation costs.

http://arxiv.org/abs/2508.08262
Argument Quality Annotation and Gender Bias Detection in Financial Communication through Large Language Models. (15%)
Alaa Alhamzeh; Mays Al Rebdawi
Financial arguments play a critical role in shaping investment decisions and public trust in financial institutions. Nevertheless, assessing their quality remains poorly studied in the literature. In this paper, we examine the capabilities of three state-of-the-art LLMs GPT-4o, Llama 3.1, and Gemma 2 in annotating argument quality within financial communications, using the FinArgQuality dataset. Our contributions are twofold. First, we evaluate the consistency of LLM-generated annotations across multiple runs and benchmark them against human annotations. Second, we introduce an adversarial attack designed to inject gender bias to analyse models responds and ensure model's fairness and robustness. Both experiments are conducted across three temperature settings to assess their influence on annotation stability and alignment with human labels. Our findings reveal that LLM-based annotations achieve higher inter-annotator agreement than human counterparts, though the models still exhibit varying degrees of gender bias. We provide a multifaceted analysis of these outcomes and offer practical recommendations to guide future research toward more reliable, cost-effective, and bias-aware annotation methodologies.

http://arxiv.org/abs/2507.16773
When LLMs Copy to Think: Uncovering Copy-Guided Attacks in Reasoning LLMs. (3%)
Yue Li; Xiao Li; Hao Wu; Yue Zhang; Fengyuan Xu; Xiuzhen Cheng; Sheng Zhong
Large Language Models (LLMs) have become integral to automated code analysis, enabling tasks such as vulnerability detection and code comprehension. However, their integration introduces novel attack surfaces. In this paper, we identify and investigate a new class of prompt-based attacks, termed Copy-Guided Attacks (CGA), which exploit the inherent copying tendencies of reasoning-capable LLMs. By injecting carefully crafted triggers into external code snippets, adversaries can induce the model to replicate malicious content during inference. This behavior enables two classes of vulnerabilities: inference length manipulation, where the model generates abnormally short or excessively long reasoning traces; and inference result manipulation, where the model produces misleading or incorrect conclusions. We formalize CGA as an optimization problem and propose a gradient-based approach to synthesize effective triggers. Empirical evaluation on state-of-the-art reasoning LLMs shows that CGA reliably induces infinite loops, premature termination, false refusals, and semantic distortions in code analysis tasks. While highly effective in targeted settings, we observe challenges in generalizing CGA across diverse prompts due to computational constraints, posing an open question for future research. Our findings expose a critical yet underexplored vulnerability in LLM-powered development pipelines and call for urgent advances in prompt-level defense mechanisms.

http://arxiv.org/abs/2507.16393
Are Foundation Models All You Need for Zero-shot Face Presentation Attack Detection? (1%)
Lazaro Janier Gonzalez-Sole; Juan E. Tapia; Christoph Busch
Although face recognition systems have undergone an impressive evolution in the last decade, these technologies are vulnerable to attack presentations (AP). These attacks are mostly easy to create and, by executing them against the system's capture device, the malicious actor can impersonate an authorised subject and thus gain access to the latter's information (e.g., financial transactions). To protect facial recognition schemes against presentation attacks, state-of-the-art deep learning presentation attack detection (PAD) approaches require a large amount of data to produce reliable detection performances and even then, they decrease their performance for unknown presentation attack instruments (PAI) or database (information not seen during training), i.e. they lack generalisability. To mitigate the above problems, this paper focuses on zero-shot PAD. To do so, we first assess the effectiveness and generalisability of foundation models in established and challenging experimental scenarios and then propose a simple but effective framework for zero-shot PAD. Experimental results show that these models are able to achieve performance in difficult scenarios with minimal effort of the more advanced PAD mechanisms, whose weights were optimised mainly with training sets that included APs and bona fide presentations. The top-performing foundation model outperforms by a margin the best from the state of the art observed with the leaving-one-out protocol on the SiW-Mv2 database, which contains challenging unknown 2D and 3D attacks

http://arxiv.org/abs/2507.16052
Disrupting Semantic and Abstract Features for Better Adversarial Transferability. (96%)
Yuyang Luo; Xiaosen Wang; Zhijin Ge; Yingzhe He
Adversarial examples pose significant threats to deep neural networks (DNNs), and their property of transferability in the black-box setting has led to the emergence of transfer-based attacks, making it feasible to target real-world applications employing DNNs. Among them, feature-level attacks, where intermediate features are perturbed based on feature importance weight matrix computed from transformed images, have gained popularity. In this work, we find that existing feature-level attacks primarily manipulate the semantic information to derive the weight matrix. Inspired by several works that find CNNs tend to focus more on high-frequency components (a.k.a. abstract features, e.g., texture, edge, etc.), we validate that transforming images in the high-frequency space also improves transferability. Based on this finding, we propose a balanced approach called Semantic and Abstract FEatures disRuption (SAFER). Specifically, SAFER conducts BLOCKMIX on the input image and SELF-MIX on the frequency spectrum when computing the weight matrix to highlight crucial features. By using such a weight matrix, we can direct the attacker to disrupt both semantic and abstract features, leading to improved transferability. Extensive experiments on the ImageNet dataset also demonstrate the effectiveness of our method in boosting adversarial transferability.

http://arxiv.org/abs/2507.15974
Does More Inference-Time Compute Really Help Robustness? (76%)
Tong Wu; Chong Xiang; Jiachen T. Wang; Weichen Yu; Chawin Sitawarin; Vikash Sehwag; Prateek Mittal
Recently, Zaremba et al. demonstrated that increasing inference-time computation improves robustness in large proprietary reasoning LLMs. In this paper, we first show that smaller-scale, open-source models (e.g., DeepSeek R1, Qwen3, Phi-reasoning) can also benefit from inference-time scaling using a simple budget forcing strategy. More importantly, we reveal and critically examine an implicit assumption in prior work: intermediate reasoning steps are hidden from adversaries. By relaxing this assumption, we identify an important security risk, intuitively motivated and empirically verified as an inverse scaling law: if intermediate reasoning steps become explicitly accessible, increased inference-time computation consistently reduces model robustness. Finally, we discuss practical scenarios where models with hidden reasoning chains are still vulnerable to attacks, such as models with tool-integrated reasoning and advanced reasoning extraction attacks. Our findings collectively demonstrate that the robustness benefits of inference-time scaling depend heavily on the adversarial setting and deployment context. We urge practitioners to carefully weigh these subtle trade-offs before applying inference-time scaling in security-sensitive, real-world applications.

http://arxiv.org/abs/2507.15349
Scaling Decentralized Learning with FLock. (76%)
Zehua Cheng; Rui Sun; Jiahao Sun; Yike Guo
Fine-tuning the large language models (LLMs) are prevented by the deficiency of centralized control and the massive computing and communication overhead on the decentralized schemes. While the typical standard federated learning (FL) supports data privacy, the central server requirement creates a single point of attack and vulnerability to poisoning attacks. Generalizing the result in this direction to 70B-parameter models in the heterogeneous, trustless environments has turned out to be a huge, yet unbroken bottleneck. This paper introduces FLock, a decentralized framework for secure and efficient collaborative LLM fine-tuning. Integrating a blockchain-based trust layer with economic incentives, FLock replaces the central aggregator with a secure, auditable protocol for cooperation among untrusted parties. We present the first empirical validation of fine-tuning a 70B LLM in a secure, multi-domain, decentralized setting. Our experiments show the FLock framework defends against backdoor poisoning attacks that compromise standard FL optimizers and fosters synergistic knowledge transfer. The resulting models show a >68% reduction in adversarial attack success rates. The global model also demonstrates superior cross-domain generalization, outperforming models trained in isolation on their own specialized data.

http://arxiv.org/abs/2507.15219
PromptArmor: Simple yet Effective Prompt Injection Defenses. (73%)
Tianneng Shi; Kaijie Zhu; Zhun Wang; Yuqi Jia; Will Cai; Weida Liang; Haonan Wang; Hend Alzahrani; Joshua Lu; Kenji Kawaguchi; Basel Alomair; Xuandong Zhao; William Yang Wang; Neil Gong; Wenbo Guo; Dawn Song
Despite their potential, recent research has demonstrated that LLM agents are vulnerable to prompt injection attacks, where malicious prompts are injected into the agent's input, causing it to perform an attacker-specified task rather than the intended task provided by the user. In this paper, we present PromptArmor, a simple yet effective defense against prompt injection attacks. Specifically, PromptArmor prompts an off-the-shelf LLM to detect and remove potential injected prompts from the input before the agent processes it. Our results show that PromptArmor can accurately identify and remove injected prompts. For example, using GPT-4o, GPT-4.1, or o4-mini, PromptArmor achieves both a false positive rate and a false negative rate below 1% on the AgentDojo benchmark. Moreover, after removing injected prompts with PromptArmor, the attack success rate drops to below 1%. We also demonstrate PromptArmor's effectiveness against adaptive attacks and explore different strategies for prompting an LLM. We recommend that PromptArmor be adopted as a standard baseline for evaluating new defenses against prompt injection attacks.

http://arxiv.org/abs/2507.15613
Multi-Stage Prompt Inference Attacks on Enterprise LLM Systems. (67%)
Andrii Balashov; Olena Ponomarova; Xiaohua Zhai
Large Language Models (LLMs) deployed in enterprise settings (e.g., as Microsoft 365 Copilot) face novel security challenges. One critical threat is prompt inference attacks: adversaries chain together seemingly benign prompts to gradually extract confidential data. In this paper, we present a comprehensive study of multi-stage prompt inference attacks in an enterprise LLM context. We simulate realistic attack scenarios where an attacker uses mild-mannered queries and indirect prompt injections to exploit an LLM integrated with private corporate data. We develop a formal threat model for these multi-turn inference attacks and analyze them using probability theory, optimization frameworks, and information-theoretic leakage bounds. The attacks are shown to reliably exfiltrate sensitive information from the LLM's context (e.g., internal SharePoint documents or emails), even when standard safety measures are in place.
  We propose and evaluate defenses to counter such attacks, including statistical anomaly detection, fine-grained access control, prompt sanitization techniques, and architectural modifications to LLM deployment. Each defense is supported by mathematical analysis or experimental simulation. For example, we derive bounds on information leakage under differential privacy-based training and demonstrate an anomaly detection method that flags multi-turn attacks with high AUC. We also introduce an approach called "spotlighting" that uses input transformations to isolate untrusted prompt content, reducing attack success by an order of magnitude. Finally, we provide a formal proof of concept and empirical validation for a combined defense-in-depth strategy. Our work highlights that securing LLMs in enterprise settings requires moving beyond single-turn prompt filtering toward a holistic, multi-stage perspective on both attacks and defenses.

http://arxiv.org/abs/2411.15265
Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI. (45%)
Won Jun Kim; Hyungjin Chung; Jaemin Kim; Sangmin Lee; Byeongsu Sim; Jong Chul Ye
Gradient-based methods are a prototypical family of explainability techniques, especially for image-based models. Nonetheless, they have several shortcomings in that they (1) require white-box access to models, (2) are vulnerable to adversarial attacks, and (3) produce attributions that lie off the image manifold, leading to explanations that are not actually faithful to the model and do not align well with human perception. To overcome these challenges, we introduce Derivative-Free Diffusion Manifold-Constrainted Gradients (FreeMCG), a novel method that serves as an improved basis for explainability of a given neural network than the traditional gradient. Specifically, by leveraging ensemble Kalman filters and diffusion models, we derive a derivative-free approximation of the model's gradient projected onto the data manifold, requiring access only to the model's outputs. We demonstrate the effectiveness of FreeMCG by applying it to both counterfactual generation and feature attribution, which have traditionally been treated as distinct tasks. Through comprehensive evaluation on both tasks, counterfactual explanation and feature attribution, we show that our method yields state-of-the-art results while preserving the essential properties expected of XAI tools.

http://arxiv.org/abs/2505.01454
Sparsification Under Siege: Defending Against Poisoning Attacks in Communication-Efficient Federated Learning. (15%)
Zhiyong Jin; Runhua Xu; Chao Li; Yizhong Liu; Jianxin Li; James Joshi
Federated Learning (FL) enables collaborative model training across distributed clients while preserving data privacy, yet it faces significant challenges in communication efficiency and vulnerability to poisoning attacks. While sparsification techniques mitigate communication overhead by transmitting only critical model parameters, they inadvertently amplify security risks: adversarial clients can exploit sparse updates to evade detection and degrade model performance. Existing defense mechanisms, designed for standard FL communication scenarios, are ineffective in addressing these vulnerabilities within sparsified FL. To bridge this gap, we propose FLARE, a novel federated learning framework that integrates sparse index mask inspection and model update sign similarity analysis to detect and mitigate poisoning attacks in sparsified FL. Extensive experiments across multiple datasets and adversarial scenarios demonstrate that FLARE significantly outperforms existing defense strategies, effectively securing sparsified FL against poisoning attacks while maintaining communication efficiency.

http://arxiv.org/abs/2503.01781
Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models. (13%)
Meghana Rajeev; Rajkumar Ramamurthy; Prapti Trivedi; Vikas Yadav; Oluwanifemi Bamgbose; Sathwik Tejaswi Madhusudan; James Zou; Nazneen Rajani
We investigate the robustness of reasoning models trained for step-by-step problem solving by introducing query-agnostic adversarial triggers - short, irrelevant text that, when appended to math problems, systematically mislead models to output incorrect answers without altering the problem's semantics. We propose CatAttack, an automated iterative attack pipeline for generating triggers on a weaker, less expensive proxy model (DeepSeek V3) and successfully transfer them to more advanced reasoning target models like DeepSeek R1 and DeepSeek R1-distilled-Qwen-32B, resulting in greater than 300% increase in the likelihood of the target model generating an incorrect answer. For example, appending, "Interesting fact: cats sleep most of their lives," to any math problem leads to more than doubling the chances of a model getting the answer wrong. Our findings highlight critical vulnerabilities in reasoning models, revealing that even state-of-the-art models remain susceptible to subtle adversarial inputs, raising security and reliability concerns. The CatAttack triggers dataset with model responses is available at https://huggingface.co/datasets/collinear-ai/cat-attack-adversarial-triggers.

http://arxiv.org/abs/2507.15285
In-context Learning of Vision Language Models for Detection of Physical and Digital Attacks against Face Recognition Systems. (10%)
Lazaro Janier Gonzalez-Soler; Maciej Salwowski; Christoph Busch
Recent advances in biometric systems have significantly improved the detection and prevention of fraudulent activities. However, as detection methods improve, attack techniques become increasingly sophisticated. Attacks on face recognition systems can be broadly divided into physical and digital approaches. Traditionally, deep learning models have been the primary defence against such attacks. While these models perform exceptionally well in scenarios for which they have been trained, they often struggle to adapt to different types of attacks or varying environmental conditions. These subsystems require substantial amounts of training data to achieve reliable performance, yet biometric data collection faces significant challenges, including privacy concerns and the logistical difficulties of capturing diverse attack scenarios under controlled conditions. This work investigates the application of Vision Language Models (VLM) and proposes an in-context learning framework for detecting physical presentation attacks and digital morphing attacks in biometric systems. Focusing on open-source models, the first systematic framework for the quantitative evaluation of VLMs in security-critical scenarios through in-context learning techniques is established. The experimental evaluation conducted on freely available databases demonstrates that the proposed subsystem achieves competitive performance for physical and digital attack detection, outperforming some of the traditional CNNs without resource-intensive training. The experimental results validate the proposed framework as a promising tool for improving generalisation in attack detection.

http://arxiv.org/abs/2502.11798
BackdoorDM: A Comprehensive Benchmark for Backdoor Learning on Diffusion Model. (4%)
Weilin Lin; Nanjun Zhou; Yanyun Wang; Jianze Li; Hui Xiong; Li Liu
Backdoor learning is a critical research topic for understanding the vulnerabilities of deep neural networks. While the diffusion model (DM) has been broadly deployed in public over the past few years, the understanding of its backdoor vulnerability is still in its infancy compared to the extensive studies in discriminative models. Recently, many different backdoor attack and defense methods have been proposed for DMs, but a comprehensive benchmark for backdoor learning on DMs is still lacking. This absence makes it difficult to conduct fair comparisons and thorough evaluations of the existing approaches, thus hindering future research progress. To address this issue, we propose \textit{BackdoorDM}, the first comprehensive benchmark designed for backdoor learning on DMs. It comprises nine state-of-the-art (SOTA) attack methods, four SOTA defense strategies, and three useful visualization analysis tools. We first systematically classify and formulate the existing literature in a unified framework, focusing on three different backdoor attack types and five backdoor target types, which are restricted to a single type in discriminative models. Then, we systematically summarize the evaluation metrics for each type and propose a unified backdoor evaluation method based on multimodal large language model (MLLM). Finally, we conduct a comprehensive evaluation and highlight several important conclusions. We believe that BackdoorDM will help overcome current barriers and contribute to building a trustworthy artificial intelligence generated content (AIGC) community. The codes are released in https://github.com/linweiii/BackdoorDM.

http://arxiv.org/abs/1911.08432
Defective Convolutional Networks.
Tiange Luo; Tianle Cai; Mengxiao Zhang; Siyu Chen; Di He; Liwei Wang
Robustness of convolutional neural networks (CNNs) has gained in importance on account of adversarial examples, i.e., inputs added as well-designed perturbations that are imperceptible to humans but can cause the model to predict incorrectly. Recent research suggests that the noises in adversarial examples break the textural structure, which eventually leads to wrong predictions. To mitigate the threat of such adversarial attacks, we propose defective convolutional networks that make predictions relying less on textural information but more on shape information by properly integrating defective convolutional layers into standard CNNs. The defective convolutional layers contain defective neurons whose activations are set to be a constant function. As defective neurons contain no information and are far different from standard neurons in its spatial neighborhood, the textural features cannot be accurately extracted, and so the model has to seek other features for classification, such as the shape. We show extensive evidence to justify our proposal and demonstrate that defective CNNs can defense against black-box attacks better than standard CNNs. In particular, they achieve state-of-the-art performance against transfer-based attacks without any adversarial training being applied.

http://arxiv.org/abs/2507.14863
Adversarial Destabilization Attacks to Direct Data-Driven Control. (99%)
Hampei Sasahara
This study investigates the vulnerability of direct data-driven control methods, specifically for the linear quadratic regulator problem, to adversarial perturbations in collected data used for controller synthesis. We consider stealthy attacks that subtly manipulate offline-collected data to destabilize the resulting closed-loop system while evading detection. To generate such perturbations, we propose the Directed Gradient Sign Method (DGSM) and its iterative variant (I-DGSM), adaptations of the fast gradient sign method originally developed for neural networks, which align perturbations with the gradient of the spectral radius of the closed-loop matrix to reduce stability. A key contribution is an efficient gradient computation technique based on implicit differentiation through the Karush-Kuhn-Tucker conditions of the underlying semidefinite program, enabling scalable and exact gradient evaluation without repeated optimization computations. To defend against these attacks, we propose two defense strategies: a regularization-based approach that enhances robustness by suppressing controller sensitivity to data perturbations and a robust data-driven control approach that guarantees closed-loop stability within bounded perturbation sets. Extensive numerical experiments on benchmark systems show that adversarial perturbations with magnitudes up to ten times smaller than random noise can destabilize controllers trained on corrupted data and that the proposed defense strategies effectively mitigate attack success rates while maintaining control performance. Additionally, we evaluate attack transferability under partial knowledge scenarios, highlighting the practical importance of protecting training data confidentiality.

http://arxiv.org/abs/2507.15067
ROBAD: Robust Adversary-aware Local-Global Attended Bad Actor Detection Sequential Model. (73%)
Bing He; Mustaque Ahamad; Srijan Kumar
Detecting bad actors is critical to ensure the safety and integrity of internet platforms. Several deep learning-based models have been developed to identify such users. These models should not only accurately detect bad actors, but also be robust against adversarial attacks that aim to evade detection. However, past deep learning-based detection models do not meet the robustness requirement because they are sensitive to even minor changes in the input sequence. To address this issue, we focus on (1) improving the model understanding capability and (2) enhancing the model knowledge such that the model can recognize potential input modifications when making predictions. To achieve these goals, we create a novel transformer-based classification model, called ROBAD (RObust adversary-aware local-global attended Bad Actor Detection model), which uses the sequence of user posts to generate user embedding to detect bad actors. Particularly, ROBAD first leverages the transformer encoder block to encode each post bidirectionally, thus building a post embedding to capture the local information at the post level. Next, it adopts the transformer decoder block to model the sequential pattern in the post embeddings by using the attention mechanism, which generates the sequence embedding to obtain the global information at the sequence level. Finally, to enrich the knowledge of the model, embeddings of modified sequences by mimicked attackers are fed into a contrastive-learning-enhanced classification layer for sequence prediction. In essence, by capturing the local and global information (i.e., the post and sequence information) and leveraging the mimicked behaviors of bad actors in training, ROBAD can be robust to adversarial attacks. Extensive experiments on Yelp and Wikipedia datasets show that ROBAD can effectively detect bad actors when under state-of-the-art adversarial attacks.

http://arxiv.org/abs/2507.15042
DeRAG: Black-box Adversarial Attacks on Multiple Retrieval-Augmented Generation Applications via Prompt Injection. (64%)
Jerry Wang; Fang Yu
Adversarial prompt attacks can significantly alter the reliability of Retrieval-Augmented Generation (RAG) systems by re-ranking them to produce incorrect outputs. In this paper, we present a novel method that applies Differential Evolution (DE) to optimize adversarial prompt suffixes for RAG-based question answering. Our approach is gradient-free, treating the RAG pipeline as a black box and evolving a population of candidate suffixes to maximize the retrieval rank of a targeted incorrect document to be closer to real world scenarios. We conducted experiments on the BEIR QA datasets to evaluate attack success at certain retrieval rank thresholds under multiple retrieving applications. Our results demonstrate that DE-based prompt optimization attains competitive (and in some cases higher) success rates compared to GGPP to dense retrievers and PRADA to sparse retrievers, while using only a small number of tokens (<=5 tokens) in the adversarial suffix. Furthermore, we introduce a readability-aware suffix construction strategy, validated by a statistically significant reduction in MLM negative log-likelihood with Welch's t-test. Through evaluations with a BERT-based adversarial suffix detector, we show that DE-generated suffixes evade detection, yielding near-chance detection accuracy.

http://arxiv.org/abs/2507.14799
Manipulating LLM Web Agents with Indirect Prompt Injection Attack via HTML Accessibility Tree. (13%)
Sam Johnson; Viet Pham; Thai Le
This work demonstrates that LLM-based web navigation agents offer powerful automation capabilities but are vulnerable to Indirect Prompt Injection (IPI) attacks. We show that adversaries can embed universal adversarial triggers in webpage HTML to hijack agent behavior that utilizes the accessibility tree to parse HTML, causing unintended or malicious actions. Using the Greedy Coordinate Gradient (GCG) algorithm and a Browser Gym agent powered by Llama-3.1, our system demonstrates high success rates across real websites in both targeted and general attacks, including login credential exfiltration and forced ad clicks. Our empirical results highlight critical security risks and the need for stronger defenses as LLM-driven autonomous web agents become more widely adopted. The system software (https://github.com/sej2020/manipulating-web-agents) is released under the MIT License, with an accompanying publicly available demo website (http://lethaiq.github.io/attack-web-llm-agent).

http://arxiv.org/abs/2507.14625
VTarbel: Targeted Label Attack with Minimal Knowledge on Detector-enhanced Vertical Federated Learning. (92%)
Juntao Tan; Anran Li; Quanchao Liu; Peng Ran; Lan Zhang
Vertical federated learning (VFL) enables multiple parties with disjoint features to collaboratively train models without sharing raw data. While privacy vulnerabilities of VFL are extensively-studied, its security threats-particularly targeted label attacks-remain underexplored. In such attacks, a passive party perturbs inputs at inference to force misclassification into adversary-chosen labels. Existing methods rely on unrealistic assumptions (e.g., accessing VFL-model's outputs) and ignore anomaly detectors deployed in real-world systems. To bridge this gap, we introduce VTarbel, a two-stage, minimal-knowledge attack framework explicitly designed to evade detector-enhanced VFL inference. During the preparation stage, the attacker selects a minimal set of high-expressiveness samples (via maximum mean discrepancy), submits them through VFL protocol to collect predicted labels, and uses these pseudo-labels to train estimated detector and surrogate model on local features. In attack stage, these models guide gradient-based perturbations of remaining samples, crafting adversarial instances that induce targeted misclassifications and evade detection. We implement VTarbel and evaluate it against four model architectures, seven multimodal datasets, and two anomaly detectors. Across all settings, VTarbel outperforms four state-of-the-art baselines, evades detection, and retains effective against three representative privacy-preserving defenses. These results reveal critical security blind spots in current VFL deployments and underscore urgent need for robust, attack-aware defenses.

http://arxiv.org/abs/2507.14629
VMask: Tunable Label Privacy Protection for Vertical Federated Learning via Layer Masking. (33%)
Juntao Tan; Lan Zhang; Zhonghao Hu; Kai Yang; Peng Ran; Bo Li
Though vertical federated learning (VFL) is generally considered to be privacy-preserving, recent studies have shown that VFL system is vulnerable to label inference attacks originating from various attack surfaces. Among these attacks, the model completion (MC) attack is currently the most powerful one. Existing defense methods against it either sacrifice model accuracy or incur impractical computational overhead. In this paper, we propose VMask, a novel label privacy protection framework designed to defend against MC attack from the perspective of layer masking. Our key insight is to disrupt the strong correlation between input data and intermediate outputs by applying the secret sharing (SS) technique to mask layer parameters in the attacker's model. We devise a strategy for selecting critical layers to mask, reducing the overhead that would arise from naively applying SS to the entire model. Moreover, VMask is the first framework to offer a tunable privacy budget to defenders, allowing for flexible control over the levels of label privacy according to actual requirements. We built a VFL system, implemented VMask on it, and extensively evaluated it using five model architectures and 13 datasets with different modalities, comparing it to 12 other defense methods. The results demonstrate that VMask achieves the best privacy-utility trade-off, successfully thwarting the MC attack (reducing the label inference accuracy to a random guessing level) while preserving model performance (e.g., in Transformer-based model, the averaged drop of VFL model accuracy is only 0.09%). VMask's runtime is up to 60,846 times faster than cryptography-based methods, and it only marginally exceeds that of standard VFL by 1.8 times in a large Transformer-based model, which is generally acceptable.

http://arxiv.org/abs/2507.14248
Breaking the Illusion of Security via Interpretation: Interpretable Vision Transformer Systems under Attack. (99%)
Eldor Abdukhamidov; Mohammed Abuhamad; Simon S. Woo; Hyoungshick Kim; Tamer Abuhmed
Vision transformer (ViT) models, when coupled with interpretation models, are regarded as secure and challenging to deceive, making them well-suited for security-critical domains such as medical applications, autonomous vehicles, drones, and robotics. However, successful attacks on these systems can lead to severe consequences. Recent research on threats targeting ViT models primarily focuses on generating the smallest adversarial perturbations that can deceive the models with high confidence, without considering their impact on model interpretations. Nevertheless, the use of interpretation models can effectively assist in detecting adversarial examples. This study investigates the vulnerability of transformer models to adversarial attacks, even when combined with interpretation models. We propose an attack called "AdViT" that generates adversarial examples capable of misleading both a given transformer model and its coupled interpretation model. Through extensive experiments on various transformer models and two transformer-based interpreters, we demonstrate that AdViT achieves a 100% attack success rate in both white-box and black-box scenarios. In white-box scenarios, it reaches up to 98% misclassification confidence, while in black-box scenarios, it reaches up to 76% misclassification confidence. Remarkably, AdViT consistently generates accurate interpretations in both scenarios, making the adversarial examples more difficult to detect.

http://arxiv.org/abs/2507.13727
Adversarial Training Improves Generalization Under Distribution Shifts in Bioacoustics. (96%)
René Heinrich; Lukas Rauch; Bernhard Sick; Christoph Scholz
Adversarial training is a promising strategy for enhancing model robustness against adversarial attacks. However, its impact on generalization under substantial data distribution shifts in audio classification remains largely unexplored. To address this gap, this work investigates how different adversarial training strategies improve generalization performance and adversarial robustness in audio classification. The study focuses on two model architectures: a conventional convolutional neural network (ConvNeXt) and an inherently interpretable prototype-based model (AudioProtoPNet). The approach is evaluated using a challenging bird sound classification benchmark. This benchmark is characterized by pronounced distribution shifts between training and test data due to varying environmental conditions and recording methods, a common real-world challenge. The investigation explores two adversarial training strategies: one based on output-space attacks that maximize the classification loss function, and another based on embedding-space attacks designed to maximize embedding dissimilarity. These attack types are also used for robustness evaluation. Additionally, for AudioProtoPNet, the study assesses the stability of its learned prototypes under targeted embedding-space attacks. Results show that adversarial training, particularly using output-space attacks, improves clean test data performance by an average of 10.5% relative and simultaneously strengthens the adversarial robustness of the models. These findings, although derived from the bird sound domain, suggest that adversarial training holds potential to enhance robustness against both strong distribution shifts and adversarial attacks in challenging audio classification settings.

http://arxiv.org/abs/2507.13761
Innocence in the Crossfire: Roles of Skip Connections in Jailbreaking Visual Language Models. (86%)
Palash Nandi; Maithili Joshi; Tanmoy Chakraborty
Language models are highly sensitive to prompt formulations - small changes in input can drastically alter their output. This raises a critical question: To what extent can prompt sensitivity be exploited to generate inapt content? In this paper, we investigate how discrete components of prompt design influence the generation of inappropriate content in Visual Language Models (VLMs). Specifically, we analyze the impact of three key factors on successful jailbreaks: (a) the inclusion of detailed visual information, (b) the presence of adversarial examples, and (c) the use of positively framed beginning phrases. Our findings reveal that while a VLM can reliably distinguish between benign and harmful inputs in unimodal settings (text-only or image-only), this ability significantly degrades in multimodal contexts. Each of the three factors is independently capable of triggering a jailbreak, and we show that even a small number of in-context examples (as few as three) can push the model toward generating inappropriate outputs. Furthermore, we propose a framework that utilizes a skip-connection between two internal layers of the VLM, which substantially increases jailbreak success rates, even when using benign images. Finally, we demonstrate that memes, often perceived as humorous or harmless, can be as effective as toxic visuals in eliciting harmful content, underscoring the subtle and complex vulnerabilities of VLMs.

http://arxiv.org/abs/2504.10888
CDUPatch: Color-Driven Universal Adversarial Patch Attack for Dual-Modal Visible-Infrared Detectors. (76%)
Jiahuan Long; Wen Yao; Tingsong Jiang; Chao Ma
Adversarial patches are widely used to evaluate the robustness of object detection systems in real-world scenarios. These patches were initially designed to deceive single-modal detectors (e.g., visible or infrared) and have recently been extended to target visible-infrared dual-modal detectors. However, existing dual-modal adversarial patch attacks have limited attack effectiveness across diverse physical scenarios. To address this, we propose CDUPatch, a universal cross-modal patch attack against visible-infrared object detectors across scales, views, and scenarios. Specifically, we observe that color variations lead to different levels of thermal absorption, resulting in temperature differences in infrared imaging. Leveraging this property, we propose an RGB-to-infrared adapter that maps RGB patches to infrared patches, enabling unified optimization of cross-modal patches. By learning an optimal color distribution on the adversarial patch, we can manipulate its thermal response and generate an adversarial infrared texture. Additionally, we introduce a multi-scale clipping strategy and construct a new visible-infrared dataset, MSDrone, which contains aerial vehicle images in varying scales and perspectives. These data augmentation strategies enhance the robustness of our patch in real-world conditions. Experiments on four benchmark datasets (e.g., DroneVehicle, LLVIP, VisDrone, MSDrone) show that our method outperforms existing patch attacks in the digital domain. Extensive physical tests further confirm strong transferability across scales, views, and scenarios.

http://arxiv.org/abs/2507.13686
TopicAttack: An Indirect Prompt Injection Attack via Topic Transition. (54%)
Yulin Chen; Haoran Li; Yuexin Li; Yue Liu; Yangqiu Song; Bryan Hooi
Large language models (LLMs) have shown remarkable performance across a range of NLP tasks. However, their strong instruction-following capabilities and inability to distinguish instructions from data content make them vulnerable to indirect prompt injection attacks. In such attacks, instructions with malicious purposes are injected into external data sources, such as web documents. When LLMs retrieve this injected data through tools, such as a search engine and execute the injected instructions, they provide misled responses. Recent attack methods have demonstrated potential, but their abrupt instruction injection often undermines their effectiveness. Motivated by the limitations of existing attack methods, we propose TopicAttack, which prompts the LLM to generate a fabricated conversational transition prompt that gradually shifts the topic toward the injected instruction, making the injection smoother and enhancing the plausibility and success of the attack. Through comprehensive experiments, TopicAttack achieves state-of-the-art performance, with an attack success rate (ASR) over 90\% in most cases, even when various defense methods are applied. We further analyze its effectiveness by examining attention scores. We find that a higher injected-to-original attention ratio leads to a greater success probability, and our method achieves a much higher ratio than the baseline methods.

http://arxiv.org/abs/2506.24068
STACK: Adversarial Attacks on LLM Safeguard Pipelines. (54%)
Ian R. McKenzie; Oskar J. Hollinsworth; Tom Tseng; Xander Davies; Stephen Casper; Aaron D. Tucker; Robert Kirk; Adam Gleave
Frontier AI developers are relying on layers of safeguards to protect against catastrophic misuse of AI systems. Anthropic guards their latest Claude 4 Opus model using one such defense pipeline, and other frontier developers including Google DeepMind and OpenAI pledge to soon deploy similar defenses. However, the security of such pipelines is unclear, with limited prior work evaluating or attacking these pipelines. We address this gap by developing and red-teaming an open-source defense pipeline. First, we find that a novel few-shot-prompted input and output classifier outperforms state-of-the-art open-weight safeguard model ShieldGemma across three attacks and two datasets, reducing the attack success rate (ASR) to 0% on the catastrophic misuse dataset ClearHarm. Second, we introduce a STaged AttaCK (STACK) procedure that achieves 71% ASR on ClearHarm in a black-box attack against the few-shot-prompted classifier pipeline. Finally, we also evaluate STACK in a transfer setting, achieving 33% ASR, providing initial evidence that it is feasible to design attacks with no access to the target pipeline. We conclude by suggesting specific mitigations that developers could use to thwart staged attacks.

http://arxiv.org/abs/2507.13598
GIFT: Gradient-aware Immunization of diffusion models against malicious Fine-Tuning with safe concepts retention. (1%)
Amro Abdalla; Ismail Shaheen; Dan DeGenaro; Rupayan Mallick; Bogdan Raita; Sarah Adel Bargal
We present GIFT: a {G}radient-aware {I}mmunization technique to defend diffusion models against malicious {F}ine-{T}uning while preserving their ability to generate safe content. Existing safety mechanisms like safety checkers are easily bypassed, and concept erasure methods fail under adversarial fine-tuning. GIFT addresses this by framing immunization as a bi-level optimization problem: the upper-level objective degrades the model's ability to represent harmful concepts using representation noising and maximization, while the lower-level objective preserves performance on safe data. GIFT achieves robust resistance to malicious fine-tuning while maintaining safe generative quality. Experimental results show that our method significantly impairs the model's ability to re-learn harmful concepts while maintaining performance on safe content, offering a promising direction for creating inherently safer generative models resistant to adversarial fine-tuning attacks.

http://arxiv.org/abs/2507.13123
Detecting LLM-generated Code with Subtle Modification by Adversarial Training. (87%)
Xin Yin; Xinrui Li; Chao Ni; Xiaodan Xu; Xiaohu Yang
With the rapid development of Large Language Models (LLMs), their powerful code-generation capabilities have been widely applied in tasks like code completion and automated development, demonstrating the value of improving coding efficiency. However, the extensive use of LLM-generated code also raises several new challenges. On the one hand, issues such as the regulation of code provenance, copyright disputes, and code quality have become increasingly concerning. How to effectively detect LLM-generated code and ensure its compliant and responsible use has become a critical and urgent issue. On the other hand, in practical applications, LLM-generated code is often subject to manual modifications, such as variable renaming or structural adjustments. Although some recent studies have proposed training-based and zero-shot methods for detecting LLM-generated code, these approaches show insufficient robustness when facing modified LLM-generated code, and there is a lack of an effective solution. To address the real-world scenario where LLM-generated code may undergo minor modifications, we propose CodeGPTSensor+, an enhanced version of CodeGPTSensor, which employs adversarial training to improve robustness against input perturbations. CodeGPTSensor+ integrates an adversarial sample generation module, Multi-objective Identifier and Structure Transformation (MIST), which systematically generates both high-quality and representative adversarial samples. This module effectively enhances the model's resistance against diverse adversarial attacks. Experimental results on the HMCorp dataset demonstrate that CodeGPTSensor+ significantly improves detection accuracy on the adversarial test set while maintaining high accuracy on the original test set, showcasing superior robustness compared to CodeGPTSensor.

http://arxiv.org/abs/2507.13338
Training Transformers with Enforced Lipschitz Constants. (80%)
Laker Newhouse; R. Preston Hess; Franz Cesista; Andrii Zahorodnii; Jeremy Bernstein; Phillip Isola
Neural networks are often highly sensitive to input and weight perturbations. This sensitivity has been linked to pathologies such as vulnerability to adversarial examples, divergent training, and overfitting. To combat these problems, past research has looked at building neural networks entirely from Lipschitz components. However, these techniques have not matured to the point where researchers have trained a modern architecture such as a transformer with a Lipschitz certificate enforced beyond initialization. To explore this gap, we begin by developing and benchmarking novel, computationally-efficient tools for maintaining norm-constrained weight matrices. Applying these tools, we are able to train transformer models with Lipschitz bounds enforced throughout training. We find that optimizer dynamics matter: switching from AdamW to Muon improves standard methods -- weight decay and spectral normalization -- allowing models to reach equal performance with a lower Lipschitz bound. Inspired by Muon's update having a fixed spectral norm, we co-design a weight constraint method that improves the Lipschitz vs. performance tradeoff on MLPs and 2M parameter transformers. Our 2-Lipschitz transformer on Shakespeare text reaches validation accuracy 60%. Scaling to 145M parameters, our 10-Lipschitz transformer reaches 21% accuracy on internet text. However, to match the NanoGPT baseline validation accuracy of 39.4%, our Lipschitz upper bound increases to 10^264. Nonetheless, our Lipschitz transformers train without stability measures such as layer norm, QK norm, and logit tanh softcapping.

http://arxiv.org/abs/2507.13170
SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks. (61%)
Kutub Uddin; Awais Khan; Muhammad Umar Farooq; Khalid Malik
Audio plays a crucial role in applications like speaker verification, voice-enabled smart devices, and audio conferencing. However, audio manipulations, such as deepfakes, pose significant risks by enabling the spread of misinformation. Our empirical analysis reveals that existing methods for detecting deepfake audio are often vulnerable to anti-forensic (AF) attacks, particularly those attacked using generative adversarial networks. In this article, we propose a novel collaborative learning method called SHIELD to defend against generative AF attacks. To expose AF signatures, we integrate an auxiliary generative model, called the defense (DF) generative model, which facilitates collaborative learning by combining input and output. Furthermore, we design a triplet model to capture correlations for real and AF attacked audios with real-generated and attacked-generated audios using auxiliary generative models. The proposed SHIELD strengthens the defense against generative AF attacks and achieves robust performance across various generative models. The proposed AF significantly reduces the average detection accuracy from 95.49% to 59.77% for ASVspoof2019, from 99.44% to 38.45% for In-the-Wild, and from 98.41% to 51.18% for HalfTruth for three different generative models. The proposed SHIELD mechanism is robust against AF attacks and achieves an average accuracy of 98.13%, 98.58%, and 99.57% in match, and 98.78%, 98.62%, and 98.85% in mismatch settings for the ASVspoof2019, In-the-Wild, and HalfTruth datasets, respectively.

http://arxiv.org/abs/2507.13407
IConMark: Robust Interpretable Concept-Based Watermark For AI Images. (54%)
Vinu Sankar Sadasivan; Mehrdad Saberi; Soheil Feizi
With the rapid rise of generative AI and synthetic media, distinguishing AI-generated images from real ones has become crucial in safeguarding against misinformation and ensuring digital authenticity. Traditional watermarking techniques have shown vulnerabilities to adversarial attacks, undermining their effectiveness in the presence of attackers. We propose IConMark, a novel in-generation robust semantic watermarking method that embeds interpretable concepts into AI-generated images, as a first step toward interpretable watermarking. Unlike traditional methods, which rely on adding noise or perturbations to AI-generated images, IConMark incorporates meaningful semantic attributes, making it interpretable to humans and hence, resilient to adversarial manipulation. This method is not only robust against various image augmentations but also human-readable, enabling manual verification of watermarks. We demonstrate a detailed evaluation of IConMark's effectiveness, demonstrating its superiority in terms of detection accuracy and maintaining image quality. Moreover, IConMark can be combined with existing watermarking techniques to further enhance and complement its robustness. We introduce IConMark+SS and IConMark+TM, hybrid approaches combining IConMark with StegaStamp and TrustMark, respectively, to further bolster robustness against multiple types of image manipulations. Our base watermarking technique (IConMark) and its variants (+TM and +SS) achieve 10.8%, 14.5%, and 15.9% higher mean area under the receiver operating characteristic curve (AUROC) scores for watermark detection, respectively, compared to the best baseline on various datasets.

http://arxiv.org/abs/2507.13136
Adversarial attacks to image classification systems using evolutionary algorithms. (22%)
Sergio Nesmachnow; Jamal Toutouh
Image classification currently faces significant security challenges due to adversarial attacks, which consist of intentional alterations designed to deceive classification models based on artificial intelligence. This article explores an approach to generate adversarial attacks against image classifiers using a combination of evolutionary algorithms and generative adversarial networks. The proposed approach explores the latent space of a generative adversarial network with an evolutionary algorithm to find vectors representing adversarial attacks. The approach was evaluated in two case studies corresponding to the classification of handwritten digits and object images. The results showed success rates of up to 35% for handwritten digits, and up to 75% for object images, improving over other search methods and reported results in related works. The applied method proved to be effective in handling data diversity on the target datasets, even in problem instances that presented additional challenges due to the complexity and richness of information.

http://arxiv.org/abs/2507.13575
Apple Intelligence Foundation Language Models: Tech Report 2025. (15%)
Hanzhi Zhou; Erik Hornberger; Pengsheng Guo; Xiyou Zhou; Saiwen Wang; Xin Wang; Yifei He; Xuankai Chang; Rene Rauch; Louis D'hauwe; John Peebles; Alec Doane; Kohen Chia; Jenna Thibodeau; Zi-Yi Dou; Yuanyang Zhang; Ruoming Pang; Reed Li; Zhifeng Chen; Jeremy Warner; Zhaoyang Xu; Sophy Lee; David Mizrahi; Ramsey Tantawi; Chris Chaney; Kelsey Peterson; Jun Qin; Alex Dombrowski; Mira Chiang; Aiswarya Raghavan; Gerard Casamayor; Qibin Chen; Aonan Zhang; Nathalie Tran; Jianyu Wang; Hang Su; Thomas Voice; Alessandro Pappalardo; Brycen Wershing; Prasanth Yadla; Rui Li; Priyal Chhatrapati; Ismael Fernandez; Yusuf Goren; Xin Zheng; Forrest Huang; Tao Lei; Eray Yildiz; Alper Kokmen; Gokul Santhanam; Areeba Kamal; Kaan Elgin; Dian Ang Yap; Jeremy Liu; Peter Gray; Howard Xing; Kieran Liu; Matteo Ronchi; Moritz Schwarzer-Becker; Yun Zhu; Mandana Saebi; Jeremy Snow; David Griffiths; Guillaume Tartavel; Erin Feldman; Simon Lehnerer; Fernando Bermúdez-Medina; Hans Han; Joe Zhou; Xiaoyi Ren; Sujeeth Reddy; Zirui Wang; Tom Gunter; Albert Antony; Yuanzhi Li; John Dennison; Tony Sun; Yena Han; Yi Qin; Sam Davarnia; Jeffrey Bigham; Wayne Shan; Hannah Gillis Coleman; Guillaume Klein; Peng Liu; Muyang Yu; Jack Cackler; Yuan Gao; Crystal Xiao; Binazir Karimzadeh; Zhengdong Zhang; Felix Bai; Albin Madappally Jose; Feng Nan; Nazir Kamaldin; Dong Yin; Hans Hao; Yanchao Sun; Yi Hua; Charles Maalouf; Alex Guillen Garcia; Guoli Yin; Lezhi Li; Mohana Prasad Sathya Moorthy; Hongbin Gao; Jay Tang; Joanna Arreaza-Taylor; Faye Lao; Carina Peng; Josh Shaffer; Dan Masi; Sushma Rao; Tommi Vehvilainen; Senyu Tong; Dongcai Shen; Yang Zhao; Chris Bartels; Peter Fu; Qingqing Cao; Christopher Neubauer; Ethan Li; Mingfei Gao; Rebecca Callahan; Richard Wei; Patrick Dong; Alex Braunstein; Sachin Ravi; Adolfo Lopez Mendez; Kaiwei Huang; Kun Duan; Haoshuo Huang; Rui Qian; Stefano Ligas; Jordan Huffaker; Dongxu Li; Bailin Wang; Nanzhu Wang; Anuva Agarwal; Tait Madsen; Josh Newnham; Abhishek Sharma; Zhile Ren; Deepak Gopinath; Erik Daxberger; Saptarshi Guha; Oron Levy; Jing Lu; Nan Dun; Marc Kirchner; Yinfei Yang; Manjot Bilkhu; Dave Nelson; Anthony Spalvieri-Kruse; Juan Lao Tebar; Yang Xu; Phani Mutyala; Gabriel Jacoby-Cooper; Yingbo Wang; Karla Vega; Vishaal Mahtani; Darren Botten; Eric Wang; Hanli Li; Matthias Paulik; Haoran Yan; Navid Shiee; Yihao Qian; Bugu Wu; Qi Zhu; Ob Adaranijo; Bhuwan Dhingra; Zhe Gan; Nicholas Seidl; Grace Duanmu; Rong Situ; Yiping Ma; Yin Xia; David Riazati; Vasileios Saveris; Anh Nguyen; Michael; Lee; Patrick Sonnenberg; Chinguun Erdenebileg; Yanghao Li; Vivian Ma; James Chou; Isha Garg; Mark Lee; Keen You; Yuhong Li; Ransen Niu; Nandhitha Raghuram; Pulkit Agrawal; Henry Mason; Sumeet Singh; Keyu He; Hong-You Chen; Lucas Guibert; Shiyu Li; Varsha Paidi; Narendran Raghavan; Mingze Xu; Yuli Yang; Sergiu Sima; Irina Belousova; Sprite Chu; Afshin Dehghan; Philipp Dufter; David Haldimann; Zhen Yang; Margit Bowler; Chang Liu; Ying-Chang Cheng; Vivek Rathod; Syd Evans; Wilson Tsao; Dustin Withers; Haitian Sun; Biyao Wang; Peter Grasch; Walker Cheng; Yihao Feng; Vivek Kumar; Frank Chu; Victoria MönchJuan Haladjian; Doug Kang; Jiarui Lu; Ciro Sannino; Max Lam; Floris Weers; Bowen Pan; Kenneth Jung; Dhaval Doshi; Fangping Shi; Olli Saarikivi; Alp Aygar; Josh Elman; Cheng Leong; Eshan Verma; Matthew Lei; Jeff Nichols; Jiulong Shan; Donald Zhang; Lawrence Zhou; Stephen Murphy; Xianzhi Du; Chang Lan; Ankur Jain; Elmira Amirloo; Marcin Eichner; Naomy Sabo; Anupama Mann Anupama; David Qiu; Zhao Meng; Michael FitzMaurice; Peng Zhang; Simon Yeung; Chen Chen; Marco Zuliani; Andrew Hansen; Yang Lu; Brent Ramerth; Ziyi Zhong; Parsa Mazaheri; Matthew Hopkins; Mengyu Li; Simon Wang; David Chen; Farzin Rasteh; Chong Wang; Josh Gardner; Asaf Liberman; Haoxuan You; Andrew Walkingshaw; Xingyu Zhou; Jinhao Lei; Yan Meng; Quentin Keunebroek; Sam Wiseman; Anders Boesen Lindbo Larsen; Yi Zhang; Zaid Ahmed; Haiming Gang; Aaron Franklin; Kelvin Zou; Guillaume Seguin; Jonathan Janke; Rachel Burger; Co Giang; Cheng Shen; Jen Liu; Sanskruti Shah; Xiang Kong; Yiran Fei; TJ Collins; Chen Zhang; Zhiyun Lu; Michael Booker; Qin Ba; Yasutaka Tanaka; Andres Romero Mier Y Teran; Federico Scozzafava; Regan Poston; Jane Li; Eduardo Jimenez; Bas Straathof; Karanjeet Singh; Lindsay Hislop; Rajat Arora; Deepa Seshadri; Boyue Li; Colorado Reed; Zhen Li; TJ Lu; Yi Wang; Kaelen Haag; Nicholas Lusskin; Raunak Sinha; Rahul Nair; Eldon Schoop; Mary Beth Kery; Mehrdad Farajtbar; Brenda Yang; George Horrell; Shiwen Zhao; Dhruti Shah; Cha Chen; Bowen Zhang; Chang Gao; Devi Krishna; Jennifer Mallalieu; Javier Movellan; Di Feng; Emily Zhang; Sam Xu; Junting Pan; Dominik Moritz; Suma Jayaram; Kevin Smith; Dongseong Hwang; Daniel Parilla; Jiaming Hu; You-Cyuan Jhang; Emad Soroush; Fred Hohman; Nan Du; Emma Wang; Sam Dodge; Pragnya Sridhar; Joris Pelemans; Wei Fang; Nina Wenzel; Joseph Yitan Cheng; Hadas Kotek; Chung-Cheng Chiu; Meng Cao; Haijing Fu; Ruixuan Hou; Ke Ye; Diane Zhu; Nikhil Bhendawade; Joseph Astrauskas; Jian Liu; Sai Aitharaju; Wentao Wu; Artsiom Peshko; Hyunjik Kim; Nilesh Shahdadpuri; Wang Andy De; Qi Shan; Piotr Maj; Raul Rea Menacho; Justin Lazarow; Eric Liang Yang; Arsalan Farooq; Donghan Yu; David Güera; Minsik Cho; Kavya Nerella; Yongqiang Wang; Tao Jia; John Park; Jeff Lai; Haotian Zhang; Futang Peng; Daniele Molinari; Aparna Rajamani; Tyler Johnson; Lauren Gardiner; Chao Jia; Violet Yao; Wojciech Kryscinski; Xiujun Li; Shang-Chen Wu
We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple's Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines.
  A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users' privacy with innovations like Private Cloud Compute.

http://arxiv.org/abs/2507.05630
How Not to Detect Prompt Injections with an LLM. (10%)
Sarthak Choudhary; Divyam Anshumaan; Nils Palumbo; Somesh Jha
LLM-integrated applications and agents are vulnerable to prompt injection attacks, in which adversaries embed malicious instructions within seemingly benign user inputs to manipulate the LLM's intended behavior. Recent defenses based on $\textit{known-answer detection}$ (KAD) have achieved near-perfect performance by using an LLM to classify inputs as clean or contaminated. In this work, we formally characterize the KAD framework and uncover a structural vulnerability in its design that invalidates its core security premise. We design a methodical adaptive attack, $\textit{DataFlip}$, to exploit this fundamental weakness. It consistently evades KAD defenses with detection rates as low as $1.5\%$ while reliably inducing malicious behavior with success rates of up to $88\%$, without needing white-box access to the LLM or any optimization procedures.

http://arxiv.org/abs/2507.12127
Self-Adaptive and Robust Federated Spectrum Sensing without Benign Majority for Cellular Networks. (83%)
Ngoc Duy Pham; Thusitha Dayaratne; Viet Vo; Shangqi Lai; Sharif Abuadbba; Hajime Suzuki; Xingliang Yuan; Carsten Rudolph
Advancements in wireless and mobile technologies, including 5G advanced and the envisioned 6G, are driving exponential growth in wireless devices. However, this rapid expansion exacerbates spectrum scarcity, posing a critical challenge. Dynamic spectrum allocation (DSA)--which relies on sensing and dynamically sharing spectrum--has emerged as an essential solution to address this issue. While machine learning (ML) models hold significant potential for improving spectrum sensing, their adoption in centralized ML-based DSA systems is limited by privacy concerns, bandwidth constraints, and regulatory challenges. To overcome these limitations, distributed ML-based approaches such as Federated Learning (FL) offer promising alternatives. This work addresses two key challenges in FL-based spectrum sensing (FLSS). First, the scarcity of labeled data for training FL models in practical spectrum sensing scenarios is tackled with a semi-supervised FL approach, combined with energy detection, enabling model training on unlabeled datasets. Second, we examine the security vulnerabilities of FLSS, focusing on the impact of data poisoning attacks. Our analysis highlights the shortcomings of existing majority-based defenses in countering such attacks. To address these vulnerabilities, we propose a novel defense mechanism inspired by vaccination, which effectively mitigates data poisoning attacks without relying on majority-based assumptions. Extensive experiments on both synthetic and real-world datasets validate our solutions, demonstrating that FLSS can achieve near-perfect accuracy on unlabeled datasets and maintain Byzantine robustness against both targeted and untargeted data poisoning attacks, even when a significant proportion of participants are malicious.

http://arxiv.org/abs/2507.12107
Non-Adaptive Adversarial Face Generation. (67%)
Sunpill Kim; Seunghun Paik; Chanwoo Hwang; Minsu Kim; Jae Hong Seo
Adversarial attacks on face recognition systems (FRSs) pose serious security and privacy threats, especially when these systems are used for identity verification. In this paper, we propose a novel method for generating adversarial faces-synthetic facial images that are visually distinct yet recognized as a target identity by the FRS. Unlike iterative optimization-based approaches (e.g., gradient descent or other iterative solvers), our method leverages the structural characteristics of the FRS feature space. We figure out that individuals sharing the same attribute (e.g., gender or race) form an attributed subsphere. By utilizing such subspheres, our method achieves both non-adaptiveness and a remarkably small number of queries. This eliminates the need for relying on transferability and open-source surrogate models, which have been a typical strategy when repeated adaptive queries to commercial FRSs are impossible. Despite requiring only a single non-adaptive query consisting of 100 face images, our method achieves a high success rate of over 93% against AWS's CompareFaces API at its default threshold. Furthermore, unlike many existing attacks that perturb a given image, our method can deliberately produce adversarial faces that impersonate the target identity while exhibiting high-level attributes chosen by the adversary.

http://arxiv.org/abs/2507.12384
Trustworthy Tree-based Machine Learning by $MoS_2$ Flash-based Analog CAM with Inherent Soft Boundaries. (15%)
Bo Wen; Guoyun Gao; Zhicheng Xu; Ruibin Mao; Xiaojuan Qi; X. Sharon Hu; Xunzhao Yin; Can Li
The rapid advancement of artificial intelligence has raised concerns regarding its trustworthiness, especially in terms of interpretability and robustness. Tree-based models like Random Forest and XGBoost excel in interpretability and accuracy for tabular data, but scaling them remains computationally expensive due to poor data locality and high data dependence. Previous efforts to accelerate these models with analog content addressable memory (CAM) have struggled, due to the fact that the difficult-to-implement sharp decision boundaries are highly susceptible to device variations, which leads to poor hardware performance and vulnerability to adversarial attacks. This work presents a novel hardware-software co-design approach using $MoS_2$ Flash-based analog CAM with inherent soft boundaries, enabling efficient inference with soft tree-based models. Our soft tree model inference experiments on $MoS_2$ analog CAM arrays show this method achieves exceptional robustness against device variation and adversarial attacks while achieving state-of-the-art accuracy. Specifically, our fabricated analog CAM arrays achieve $96\%$ accuracy on Wisconsin Diagnostic Breast Cancer (WDBC) database, while maintaining decision explainability. Our experimentally calibrated model validated only a $0.6\%$ accuracy drop on the MNIST dataset under $10\%$ device threshold variation, compared to a $45.3\%$ drop for traditional decision trees. This work paves the way for specialized hardware that enhances AI's trustworthiness and efficiency.

http://arxiv.org/abs/2507.12439
A Bayesian Incentive Mechanism for Poison-Resilient Federated Learning. (9%)
Daniel Commey; Rebecca A. Sarpong; Griffith S. Klogo; Winful Bagyl-Bac; Garth V. Crosby
Federated learning (FL) enables collaborative model training across decentralized clients while preserving data privacy. However, its open-participation nature exposes it to data-poisoning attacks, in which malicious actors submit corrupted model updates to degrade the global model. Existing defenses are often reactive, relying on statistical aggregation rules that can be computationally expensive and that typically assume an honest majority. This paper introduces a proactive, economic defense: a lightweight Bayesian incentive mechanism that makes malicious behavior economically irrational. Each training round is modeled as a Bayesian game of incomplete information in which the server, acting as the principal, uses a small, private validation dataset to verify update quality before issuing payments. The design satisfies Individual Rationality (IR) for benevolent clients, ensuring their participation is profitable, and Incentive Compatibility (IC), making poisoning an economically dominated strategy. Extensive experiments on non-IID partitions of MNIST and FashionMNIST demonstrate robustness: with 50% label-flipping adversaries on MNIST, the mechanism maintains 96.7% accuracy, only 0.3 percentage points lower than in a scenario with 30% label-flipping adversaries. This outcome is 51.7 percentage points better than standard FedAvg, which collapses under the same 50% attack. The mechanism is computationally light, budget-bounded, and readily integrates into existing FL frameworks, offering a practical route to economically robust and sustainable FL ecosystems.

http://arxiv.org/abs/2507.12314
Thought Purity: Defense Paradigm For Chain-of-Thought Attack. (9%)
Zihao Xue; Zhen Bi; Long Ma; Zhenlin Hu; Yan Wang; Zhenfang Liu; Qing Sheng; Jie Xiao; Jungang Lou
While reinforcement learning-trained Large Reasoning Models (LRMs, e.g., Deepseek-R1) demonstrate advanced reasoning capabilities in the evolving Large Language Models (LLMs) domain, their susceptibility to security threats remains a critical vulnerability. This weakness is particularly evident in Chain-of-Thought (CoT) generation processes, where adversarial methods like backdoor prompt attacks can systematically subvert the model's core reasoning mechanisms. The emerging Chain-of-Thought Attack (CoTA) reveals this vulnerability through exploiting prompt controllability, simultaneously degrading both CoT safety and task performance with low-cost interventions. To address this compounded security-performance vulnerability, we propose Thought Purity (TP): a defense paradigm that systematically strengthens resistance to malicious content while preserving operational efficacy. Our solution achieves this through three synergistic components: (1) a safety-optimized data processing pipeline (2) reinforcement learning-enhanced rule constraints (3) adaptive monitoring metrics. Our approach establishes the first comprehensive defense mechanism against CoTA vulnerabilities in reinforcement learning-aligned reasoning systems, significantly advancing the security-functionality equilibrium for next-generation AI architectures.

http://arxiv.org/abs/2507.11968
Watch, Listen, Understand, Mislead: Tri-modal Adversarial Attacks on Short Videos for Content Appropriateness Evaluation. (2%)
Sahid Hossain Mustakim; S M Jishanul Islam; Ummay Maria Muna; Montasir Chowdhury; Mohammed Jawwadul Islam; Sadia Ahmmed; Tashfia Sikder; Syed Tasdid Azam Dhrubo; Swakkhar Shatabda
Multimodal Large Language Models (MLLMs) are increasingly used for content moderation, yet their robustness in short-form video contexts remains underexplored. Current safety evaluations often rely on unimodal attacks, failing to address combined attack vulnerabilities. In this paper, we introduce a comprehensive framework for evaluating the tri-modal safety of MLLMs. First, we present the Short-Video Multimodal Adversarial (SVMA) dataset, comprising diverse short-form videos with human-guided synthetic adversarial attacks. Second, we propose ChimeraBreak, a novel tri-modal attack strategy that simultaneously challenges visual, auditory, and semantic reasoning pathways. Extensive experiments on state-of-the-art MLLMs reveal significant vulnerabilities with high Attack Success Rates (ASR). Our findings uncover distinct failure modes, showing model biases toward misclassifying benign or policy-violating content. We assess results using LLM-as-a-judge, demonstrating attack reasoning efficacy. Our dataset and findings provide crucial insights for developing more robust and safe MLLMs.

http://arxiv.org/abs/2507.10998
Crafting Imperceptible On-Manifold Adversarial Attacks for Tabular Data. (99%)
Zhipeng He; Alexander Stevens; Chun Ouyang; Smedt Johannes De; Alistair Barros; Catarina Moreira
Adversarial attacks on tabular data present fundamental challenges distinct from image or text domains due to the heterogeneous nature of mixed categorical and numerical features. Unlike images where pixel perturbations maintain visual similarity, tabular data lacks intuitive similarity metrics, making it difficult to define imperceptible modifications. Additionally, traditional gradient-based methods prioritise $\ell_p$-norm constraints, often producing adversarial examples that deviate from the original data distributions, making them detectable. We propose a latent space perturbation framework using a mixed-input Variational Autoencoder (VAE) to generate imperceptible adversarial examples. The proposed VAE integrates categorical embeddings and numerical features into a unified latent manifold, enabling perturbations that preserve statistical consistency. We specify In-Distribution Success Rate (IDSR) to measure the proportion of adversarial examples that remain statistically indistinguishable from the input distribution. Evaluation across six publicly available datasets and three model architectures demonstrates that our method achieves substantially lower outlier rates and more consistent performance compared to traditional input-space attacks and other VAE-based methods adapted from image domain approaches. Our comprehensive analysis includes hyperparameter sensitivity, sparsity control mechanisms, and generative architectural comparisons, revealing that VAE-based attacks depend critically on reconstruction quality but offer superior practical utility when sufficient training data is available. This work highlights the importance of on-manifold perturbations for realistic adversarial attacks on tabular data, offering a robust approach for practical deployment. The source code can be accessed through https://github.com/ZhipengHe/VAE-TabAttack.

http://arxiv.org/abs/2507.04762
Robustifying 3D Perception via Least-Squares Graphs for Multi-Agent Object Tracking. (15%)
Maria Damanaki; Ioulia Kapsali; Nikos Piperigkos; Alexandros Gkillas; Aris S. Lalos
The critical perception capabilities of EdgeAI systems, such as autonomous vehicles, are required to be resilient against adversarial threats, by enabling accurate identification and localization of multiple objects in the scene over time, mitigating their impact. Single-agent tracking offers resilience to adversarial attacks but lacks situational awareness, underscoring the need for multi-agent cooperation to enhance context understanding and robustness. This paper proposes a novel mitigation framework on 3D LiDAR scene against adversarial noise by tracking objects based on least-squares graph on multi-agent adversarial bounding boxes. Specifically, we employ the least-squares graph tool to reduce the induced positional error of each detection's centroid utilizing overlapped bounding boxes on a fully connected graph via differential coordinates and anchor points. Hence, the multi-vehicle detections are fused and refined mitigating the adversarial impact, and associated with existing tracks in two stages performing tracking to further suppress the adversarial threat. An extensive evaluation study on the real-world V2V4Real dataset demonstrates that the proposed method significantly outperforms both state-of-the-art single and multi-agent tracking frameworks by up to 23.3% under challenging adversarial conditions, operating as a resilient approach without relying on additional defense mechanisms.

http://arxiv.org/abs/2507.11092
MT4DP: Data Poisoning Attack Detection for DL-based Code Search Models via Metamorphic Testing. (10%)
Gong Chen; Wenjie Liu; Xiaoyuan Xie; Xunzhu Tang; Tegawendé F. Bissyandé; Songqiang Chen
Recently, several studies have indicated that data poisoning attacks pose a severe security threat to deep learning-based (DL-based) code search models. Attackers inject carefully crafted malicious patterns into the training data, misleading the code search model to learn these patterns during training. During the usage of the poisoned code search model for inference, once the malicious pattern is triggered, the model tends to rank the vulnerability code higher. However, existing detection methods for data poisoning attacks on DL-based code search models remain insufficiently effective. To address this critical security issue, we propose MT4DP, a Data Poisoning Attack Detection Framework for DL-based Code Search Models via Metamorphic Testing. MT4DP introduces a novel Semantically Equivalent Metamorphic Relation (SE-MR) designed to detect data poisoning attacks on DL-based code search models. Specifically, MT4DP first identifies the high-frequency words from search queries as potential poisoning targets and takes their corresponding queries as the source queries. For each source query, MT4DP generates two semantically equivalent follow-up queries and retrieves its source ranking list. Then, each source ranking list is re-ranked based on the semantic similarities between its code snippets and the follow-up queries. Finally, variances between the source and re-ranked lists are calculated to reveal violations of the SE-MR and warn the data poisoning attack. Experimental results demonstrate that MT4DP significantly enhances the detection of data poisoning attacks on DL-based code search models, outperforming the best baseline by 191% on average F1 score and 265% on average precision. Our work aims to promote further research into effective techniques for mitigating data poisoning threats on DL-based code search models.

http://arxiv.org/abs/2507.11112
Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs. (9%)
Sanhanat Sivapiromrat; Caiqi Zhang; Marco Basaldella; Nigel Collier
Recent studies have shown that Large Language Models (LLMs) are vulnerable to data poisoning attacks, where malicious training examples embed hidden behaviours triggered by specific input patterns. However, most existing works assume a phrase and focus on the attack's effectiveness, offering limited understanding of trigger mechanisms and how multiple triggers interact within the model. In this paper, we present a framework for studying poisoning in LLMs. We show that multiple distinct backdoor triggers can coexist within a single model without interfering with each other, enabling adversaries to embed several triggers concurrently. Using multiple triggers with high embedding similarity, we demonstrate that poisoned triggers can achieve robust activation even when tokens are substituted or separated by long token spans. Our findings expose a broader and more persistent vulnerability surface in LLMs. To mitigate this threat, we propose a post hoc recovery method that selectively retrains specific model components based on a layer-wise weight difference analysis. Our method effectively removes the trigger behaviour with minimal parameter updates, presenting a practical and efficient defence against multi-trigger poisoning.

http://arxiv.org/abs/2507.11630
Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility. (1%)
Brendan Murphy; Dillon Bowen; Shahrad Mohammadzadeh; Julius Broomfield; Adam Gleave; Kellin Pelrine
AI systems are rapidly advancing in capability, and frontier model developers broadly acknowledge the need for safeguards against serious misuse. However, this paper demonstrates that fine-tuning, whether via open weights or closed fine-tuning APIs, can produce helpful-only models. In contrast to prior work which is blocked by modern moderation systems or achieved only partial removal of safeguards or degraded output quality, our jailbreak-tuning method teaches models to generate detailed, high-quality responses to arbitrary harmful requests. For example, OpenAI, Google, and Anthropic models will fully comply with requests for CBRN assistance, executing cyberattacks, and other criminal activity. We further show that backdoors can increase not only the stealth but also the severity of attacks, while stronger jailbreak prompts become even more effective in fine-tuning attacks, linking attack and potentially defenses in the input and weight spaces. Not only are these models vulnerable, more recent ones also appear to be becoming even more vulnerable to these attacks, underscoring the urgent need for tamper-resistant safeguards. Until such safeguards are discovered, companies and policymakers should view the release of any fine-tunable model as simultaneously releasing its evil twin: equally capable as the original model, and usable for any malicious purpose within its capabilities.

http://arxiv.org/abs/2507.10491
BURN: Backdoor Unlearning via Adversarial Boundary Analysis. (99%)
Yanghao Su; Jie Zhang; Yiming Li; Tianwei Zhang; Qing Guo; Weiming Zhang; Nenghai Yu; Nils Lukas; Wenbo Zhou
Backdoor unlearning aims to remove backdoor-related information while preserving the model's original functionality. However, existing unlearning methods mainly focus on recovering trigger patterns but fail to restore the correct semantic labels of poison samples. This limitation prevents them from fully eliminating the false correlation between the trigger pattern and the target label. To address this, we leverage boundary adversarial attack techniques, revealing two key observations. First, poison samples exhibit significantly greater distances from decision boundaries compared to clean samples, indicating they require larger adversarial perturbations to change their predictions. Second, while adversarial predicted labels for clean samples are uniformly distributed, those for poison samples tend to revert to their original correct labels. Moreover, the features of poison samples restore to closely resemble those of corresponding clean samples after adding adversarial perturbations. Building upon these insights, we propose Backdoor Unlearning via adversaRial bouNdary analysis (BURN), a novel defense framework that integrates false correlation decoupling, progressive data refinement, and model purification. In the first phase, BURN employs adversarial boundary analysis to detect poisoned samples based on their abnormal adversarial boundary distances, then restores their correct semantic labels for fine-tuning. In the second phase, it employs a feedback mechanism that tracks prediction discrepancies between the original backdoored model and progressively sanitized models, guiding both dataset refinement and model purification. Extensive evaluations across multiple datasets, architectures, and seven diverse backdoor attack types confirm that BURN effectively removes backdoor threats while maintaining the model's original performance.

http://arxiv.org/abs/2408.04124
Investigating Adversarial Attacks in Software Analytics via Machine Learning Explainability. (99%)
MD Abdul Awal; Mrigank Rochan; Chanchal K. Roy
With the recent advancements in machine learning (ML), numerous ML-based approaches have been extensively applied in software analytics tasks to streamline software development and maintenance processes. Nevertheless, studies indicate that despite their potential usefulness, ML models are vulnerable to adversarial attacks, which may result in significant monetary losses in these processes. As a result, the ML models' robustness against adversarial attacks must be assessed before they are deployed in software analytics tasks. Despite several techniques being available for adversarial attacks in software analytics tasks, exploring adversarial attacks using ML explainability is largely unexplored. Therefore, this study aims to investigate the relationship between ML explainability and adversarial attacks to measure the robustness of ML models in software analytics tasks. In addition, unlike most existing attacks that directly perturb input-space, our attack approach focuses on perturbing feature-space. Our extensive experiments, involving six datasets, three ML explainability techniques, and seven ML models, demonstrate that ML explainability can be used to conduct successful adversarial attacks on ML models in software analytics tasks. This is achieved by modifying only the top 1-3 important features identified by ML explainability techniques. Consequently, the ML models under attack fail to accurately predict up to 86.6% of instances that were correctly predicted before adversarial attacks, indicating the models' low robustness against such attacks. Finally, our proposed technique demonstrates promising results compared to four state-of-the-art adversarial attack techniques targeting tabular data.

http://arxiv.org/abs/2507.10836
REAL-IoT: Characterizing GNN Intrusion Detection Robustness under Practical Adversarial Attack. (99%)
Zhonghao Zhan; Huichi Zhou; Hamed Haddadi
Graph Neural Network (GNN)-based network intrusion detection systems (NIDS) are often evaluated on single datasets, limiting their ability to generalize under distribution drift. Furthermore, their adversarial robustness is typically assessed using synthetic perturbations that lack realism. This measurement gap leads to an overestimation of GNN-based NIDS resilience. To address the limitations, we propose \textbf{REAL-IoT}, a comprehensive framework for robustness evaluation of GNN-based NIDS in IoT environments. Our framework presents a methodology that creates a unified dataset from canonical datasets to assess generalization under drift. In addition, it features a novel intrusion dataset collected from a physical IoT testbed, which captures network traffic and attack scenarios under real-world settings. Furthermore, using REAL-IoT, we explore the usage of Large Language Models (LLMs) to analyze network data and mitigate the impact of adversarial examples by filtering suspicious flows. Our evaluations using REAL-IoT reveal performance drops in GNN models compared to results from standard benchmarks, quantifying their susceptibility to drift and realistic attacks. We also demonstrate the potential of LLM-based filtering to enhance robustness. These findings emphasize the necessity of realistic threat modeling and rigorous measurement practices for developing resilient IoT intrusion detection systems.

http://arxiv.org/abs/2502.12377
Alignment and Adversarial Robustness: Are More Human-Like Models More Secure? (98%)
Blaine Hoak; Kunyang Li; Patrick McDaniel
A small but growing body of work has shown that machine learning models which better align with human vision have also exhibited higher robustness to adversarial examples, raising the question: can human-like perception make models more secure? If true generally, such mechanisms would offer new avenues toward robustness. In this work, we conduct a large-scale empirical analysis to systematically investigate the relationship between representational alignment and adversarial robustness. We evaluate 114 models spanning diverse architectures and training paradigms, measuring their neural and behavioral alignment and engineering task performance across 105 benchmarks as well as their adversarial robustness via AutoAttack. Our findings reveal that while average alignment and robustness exhibit a weak overall correlation, specific alignment benchmarks serve as strong predictors of adversarial robustness, particularly those that measure selectivity toward texture or shape. These results suggest that different forms of alignment play distinct roles in model robustness, motivating further investigation into how alignment-driven approaches can be leveraged to build more secure and perceptually-grounded vision models.

http://arxiv.org/abs/2505.12185
EVALOOP: Assessing LLM Robustness in Programming from a Self-consistency Perspective. (95%)
Sen Fang; Weiyuan Ding; Bowen Xu
Assessing the programming capabilities of Large Language Models (LLMs) is crucial for their effective use in software engineering. Current evaluations, however, predominantly measure the accuracy of generated code on static benchmarks, neglecting the critical aspect of model robustness during programming tasks. While adversarial attacks offer insights on model robustness, their effectiveness is limited and evaluation could be constrained. Current adversarial attack methods for robustness evaluation yield inconsistent results, struggling to provide a unified evaluation across different LLMs. We introduce EVALOOP, a novel assessment framework that evaluate the robustness from a self-consistency perspective, i.e., leveraging the natural duality inherent in popular software engineering tasks, e.g., code generation and code summarization. EVALOOP initiates a self-contained feedback loop: an LLM generates output (e.g., code) from an input (e.g., natural language specification), and then use the generated output as the input to produce a new output (e.g., summarizes that code into a new specification). EVALOOP repeats the process to assess the effectiveness of EVALOOP in each loop. This cyclical strategy intrinsically evaluates robustness without rely on any external attack setups, providing a unified metric to evaluate LLMs' robustness in programming. We evaluate 16 prominent LLMs (e.g., GPT-4.1, O4-mini) on EVALOOP and found that EVALOOP typically induces a 5.01%-19.31% absolute drop in pass@1 performance within ten loops. Intriguingly, robustness does not always align with initial performance (i.e., one-time query); for instance, GPT-3.5-Turbo, despite superior initial code generation compared to DeepSeek-V2, demonstrated lower robustness over repeated evaluation loop.

http://arxiv.org/abs/2507.10330
Bridging Robustness and Generalization Against Word Substitution Attacks in NLP via the Growth Bound Matrix Approach. (93%)
Mohammed Bouri; Adnane Saoud
Despite advancements in Natural Language Processing (NLP), models remain vulnerable to adversarial attacks, such as synonym substitutions. While prior work has focused on improving robustness for feed-forward and convolutional architectures, the robustness of recurrent networks and modern state space models (SSMs), such as S4, remains understudied. These architectures pose unique challenges due to their sequential processing and complex parameter dynamics. In this paper, we introduce a novel regularization technique based on Growth Bound Matrices (GBM) to improve NLP model robustness by reducing the impact of input perturbations on model outputs. We focus on computing the GBM for three architectures: Long Short-Term Memory (LSTM), State Space models (S4), and Convolutional Neural Networks (CNN). Our method aims to (1) enhance resilience against word substitution attacks, (2) improve generalization on clean text, and (3) providing the first systematic analysis of SSM (S4) robustness. Extensive experiments across multiple architectures and benchmark datasets demonstrate that our method improves adversarial robustness by up to 8.8% over existing baselines. These results highlight the effectiveness of our approach, outperforming several state-of-the-art methods in adversarial defense. Codes are available at https://github.com/BouriMohammed/GBM

http://arxiv.org/abs/2507.10733
3S-Attack: Spatial, Spectral and Semantic Invisible Backdoor Attack Against DNN Models. (92%)
Jianyao Yin; Luca Arnaboldi; Honglong Chen; Pascal Berrang
Backdoor attacks involve either poisoning the training data or directly modifying the model in order to implant a hidden behavior, that causes the model to misclassify inputs when a specific trigger is present. During inference, the model maintains high accuracy on benign samples but misclassifies poisoned samples into an attacker-specified target class. Existing research on backdoor attacks has explored developing triggers in the spatial, spectral (frequency), and semantic (feature) domains, aiming to make them stealthy. While some approaches have considered designing triggers that are imperceptible in both spatial and spectral domains, few have incorporated the semantic domain. In this paper, we propose a novel backdoor attack, termed 3S-attack, which is stealthy across the spatial, spectral, and semantic domains. The key idea is to exploit the semantic features of benign samples as triggers, using Gradient-weighted Class Activation Mapping (Grad-CAM) and a preliminary model for extraction. The trigger is then embedded in the spectral domain, followed by pixel-level restrictions after converting the samples back to the spatial domain. This process minimizes the distance between poisoned and benign samples, making the attack harder to detect by existing defenses and human inspection. Extensive experiments on various datasets, along with theoretical analysis, demonstrate the stealthiness of 3S-attack and highlight the need for stronger defenses to ensure AI security. Our code is available at: https://anonymous.4open.science/r/anon-project-3776/

http://arxiv.org/abs/2504.11168
Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems. (87%)
William Hackett; Lewis Birch; Stefan Trawicki; Neeraj Suri; Peter Garraghan
Large Language Models (LLMs) guardrail systems are designed to protect against prompt injection and jailbreak attacks. However, they remain vulnerable to evasion techniques. We demonstrate two approaches for bypassing LLM prompt injection and jailbreak detection systems via traditional character injection methods and algorithmic Adversarial Machine Learning (AML) evasion techniques. Through testing against six prominent protection systems, including Microsoft's Azure Prompt Shield and Meta's Prompt Guard, we show that both methods can be used to evade detection while maintaining adversarial utility achieving in some instances up to 100% evasion success. Furthermore, we demonstrate that adversaries can enhance Attack Success Rates (ASR) against black-box targets by leveraging word importance ranking computed by offline white-box models. Our findings reveal vulnerabilities within current LLM protection mechanisms and highlight the need for more robust guardrail systems.

http://arxiv.org/abs/2508.03696
PLA: Prompt Learning Attack against Text-to-Image Generative Models. (81%)
Xinqi Lyu; Yihao Liu; Yanjie Li; Bin Xiao
Text-to-Image (T2I) models have gained widespread adoption across various applications. Despite the success, the potential misuse of T2I models poses significant risks of generating Not-Safe-For-Work (NSFW) content. To investigate the vulnerability of T2I models, this paper delves into adversarial attacks to bypass the safety mechanisms under black-box settings. Most previous methods rely on word substitution to search adversarial prompts. Due to limited search space, this leads to suboptimal performance compared to gradient-based training. However, black-box settings present unique challenges to training gradient-driven attack methods, since there is no access to the internal architecture and parameters of T2I models. To facilitate the learning of adversarial prompts in black-box settings, we propose a novel prompt learning attack framework (PLA), where insightful gradient-based training tailored to black-box T2I models is designed by utilizing multimodal similarities. Experiments show that our new method can effectively attack the safety mechanisms of black-box T2I models including prompt filters and post-hoc safety checkers with a high success rate compared to state-of-the-art methods. Warning: This paper may contain offensive model-generated content.

http://arxiv.org/abs/2409.01062
Random Erasing vs. Model Inversion: A Promising Defense or a False Hope? (70%)
Viet-Hung Tran; Ngoc-Bao Nguyen; Son T. Mai; Hans Vandierendonck; Ira Assent; Alex Kot; Ngai-Man Cheung
Model Inversion (MI) attacks pose a significant privacy threat by reconstructing private training data from machine learning models. While existing defenses primarily concentrate on model-centric approaches, the impact of data on MI robustness remains largely unexplored. In this work, we explore Random Erasing (RE), a technique traditionally used for improving model generalization under occlusion, and uncover its surprising effectiveness as a defense against MI attacks. Specifically, our novel feature space analysis shows that models trained with RE-images introduce a significant discrepancy between the features of MI-reconstructed images and those of the private data. At the same time, features of private images remain distinct from other classes and well-separated from different classification regions. These effects collectively degrade MI reconstruction quality and attack accuracy while maintaining reasonable natural accuracy. Furthermore, we explore two critical properties of RE including Partial Erasure and Random Location. Partial Erasure prevents the model from observing entire objects during training. We find this has a significant impact on MI, which aims to reconstruct the entire objects. Random Location of erasure plays a crucial role in achieving a strong privacy-utility trade-off. Our findings highlight RE as a simple yet effective defense mechanism that can be easily integrated with existing privacy-preserving techniques. Extensive experiments across 37 setups demonstrate that our method achieves state-of-the-art (SOTA) performance in the privacy-utility trade-off. The results consistently demonstrate the superiority of our defense over existing methods across different MI attacks, network architectures, and attack configurations. For the first time, we achieve a significant degradation in attack accuracy without a decrease in utility for some configurations.

http://arxiv.org/abs/2507.11500
ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning. (54%)
Zhengyue Zhao; Yingzi Ma; Somesh Jha; Marco Pavone; Chaowei Xiao
Large Language Models (LLMs) have demonstrated remarkable generative capabilities. However, their susceptibility to misuse has raised significant safety concerns. While post-training safety alignment methods have been widely adopted, LLMs remain vulnerable to malicious instructions that can bypass safety constraints. Recent efforts have introduced inference-time safety reasoning (system-2 alignment), where LLMs conduct a reasoning process to perform safety verification before final response. We show, however, that these checks are driven by ad-hoc reasoning that diverges from the structured human process, where they first discern a user's true intent, then evaluate the associated risk based on the true intent. Consequently, these defenses remain vulnerable to sophisticated jailbreak prompts that cloak harmful goals in seemingly benign language. To build secure and safe LLMs, we propose a reasoning-based safety alignment framework, ARMOR, that replaces the ad-hoc chains of thought reasoning process with human-aligned, structured one. At inference, ARMOR (1) detects likely jailbreak strategies, (2) extracts the user's core intent while discarding deceptive instructions, and (3) applies a policy-grounded safety analysis to the purified request. ARMOR is evaluated on adaptive jailbreak attacks and multiple safety benchmarks, and a test-time scaling is conducted to further improve its performance. Results demonstrate that ARMOR significantly enhances the robustness against state-of-the-art adaptive jailbreak attacks and outperforms recent reasoning-based aligned models across various safety benchmarks.

http://arxiv.org/abs/2507.10635
Formal Verification of Variational Quantum Circuits. (38%)
Nicola Assolini; Luca Marzari; Isabella Mastroeni; Pierro Alessandra di
Variational quantum circuits (VQCs) are a central component of many quantum machine learning algorithms, offering a hybrid quantum-classical framework that, under certain aspects, can be considered similar to classical deep neural networks. A shared aspect is, for instance, their vulnerability to adversarial inputs, small perturbations that can lead to incorrect predictions. While formal verification techniques have been extensively developed for classical models, no comparable framework exists for certifying the robustness of VQCs. Here, we present the first in-depth theoretical and practical study of the formal verification problem for VQCs. Inspired by abstract interpretation methods used in deep learning, we analyze the applicability and limitations of interval-based reachability techniques in the quantum setting. We show that quantum-specific aspects, such as state normalization, introduce inter-variable dependencies that challenge existing approaches. We investigate these issues by introducing a novel semantic framework based on abstract interpretation, where the verification problem for VQCs can be formally defined, and its complexity analyzed. Finally, we demonstrate our approach on standard verification benchmarks.

http://arxiv.org/abs/2503.14836
On the Robustness Tradeoff in Fine-Tuning. (22%)
Kunyang Li; Jean-Charles Noirot Ferrand; Ryan Sheatsley; Blaine Hoak; Yohan Beugin; Eric Pauley; Patrick McDaniel
Fine-tuning has become the standard practice for adapting pre-trained models to downstream tasks. However, the impact on model robustness is not well understood. In this work, we characterize the robustness-accuracy trade-off in fine-tuning. We evaluate the robustness and accuracy of fine-tuned models over 6 benchmark datasets and 7 different fine-tuning strategies. We observe a consistent trade-off between adversarial robustness and accuracy. Peripheral updates such as BitFit are more effective for simple tasks -- over 75% above the average measured by the area under the Pareto frontiers on CIFAR-10 and CIFAR-100. In contrast, fine-tuning information-heavy layers, such as attention layers via Compacter, achieves a better Pareto frontier on more complex tasks -- 57.5% and 34.6% above the average on Caltech-256 and CUB-200, respectively. Lastly, we observe that the robustness of fine-tuning against out-of-distribution data closely tracks accuracy. These insights emphasize the need for robustness-aware fine-tuning to ensure reliable real-world deployments.

http://arxiv.org/abs/2507.10048
On the Efficiency of Training Robust Decision Trees. (22%)
Benedict Gerlach; Marie Anastacio; Holger H. Hoos
As machine learning gets adopted into the industry quickly, trustworthiness is increasingly in focus. Yet, efficiency and sustainability of robust training pipelines still have to be established. In this work, we consider a simple pipeline for training adversarially robust decision trees and investigate the efficiency of each step. Our pipeline consists of three stages. Firstly, we choose the perturbation size automatically for each dataset. For that, we introduce a simple algorithm, instead of relying on intuition or prior work. Moreover, we show that the perturbation size can be estimated from smaller models than the one intended for full training, and thus significant gains in efficiency can be achieved. Secondly, we train state-of-the-art adversarial training methods and evaluate them regarding both their training time and adversarial accuracy. Thirdly, we certify the robustness of each of the models thus obtained and investigate the time required for this. We find that verification time, which is critical to the efficiency of the full pipeline, is not correlated with training time.

http://arxiv.org/abs/2401.11406
Adversarial Augmentation Training Makes Action Recognition Models More Robust to Realistic Video Distribution Shifts. (11%)
Kiyoon Kim; Shreyank N Gowda; Panagiotis Eustratiadis; Antreas Antoniou; Robert B Fisher
Despite recent advances in video action recognition achieving strong performance on existing benchmarks, these models often lack robustness when faced with natural distribution shifts between training and test data. We propose two novel evaluation methods to assess model resilience to such distribution disparity. One method uses two different datasets collected from different sources and uses one for training and validation, and the other for testing. More precisely, we created dataset splits of HMDB-51 or UCF-101 for training, and Kinetics-400 for testing, using the subset of the classes that are overlapping in both train and test datasets. The other proposed method extracts the feature mean of each class from the target evaluation dataset's training data (i.e. class prototype) and estimates test video prediction as a cosine similarity score between each sample to the class prototypes of each target class. This procedure does not alter model weights using the target dataset and it does not require aligning overlapping classes of two different datasets, thus is a very efficient method to test the model robustness to distribution shifts without prior knowledge of the target distribution. We address the robustness problem by adversarial augmentation training - generating augmented views of videos that are "hard" for the classification model by applying gradient ascent on the augmentation parameters - as well as "curriculum" scheduling the strength of the video augmentations. We experimentally demonstrate the superior performance of the proposed adversarial augmentation approach over baselines across three state-of-the-art action recognition models - TSM, Video Swin Transformer, and Uniformer. The presented work provides critical insight into model robustness to distribution shifts and presents effective techniques to enhance video action recognition performance in a real-world deployment.

http://arxiv.org/abs/2507.10239
Transferring Styles for Reduced Texture Bias and Improved Robustness in Semantic Segmentation Networks. (8%)
Ben Hamscher; Edgar Heinert; Annika Mütze; Kira Maag; Matthias Rottmann
Recent research has investigated the shape and texture biases of deep neural networks (DNNs) in image classification which influence their generalization capabilities and robustness. It has been shown that, in comparison to regular DNN training, training with stylized images reduces texture biases in image classification and improves robustness with respect to image corruptions. In an effort to advance this line of research, we examine whether style transfer can likewise deliver these two effects in semantic segmentation. To this end, we perform style transfer with style varying across artificial image areas. Those random areas are formed by a chosen number of Voronoi cells. The resulting style-transferred data is then used to train semantic segmentation DNNs with the objective of reducing their dependence on texture cues while enhancing their reliance on shape-based features. In our experiments, it turns out that in semantic segmentation, style transfer augmentation reduces texture bias and strongly increases robustness with respect to common image corruptions as well as adversarial attacks. These observations hold for convolutional neural networks and transformer architectures on the Cityscapes dataset as well as on PASCAL Context, showing the generality of the proposed method.

http://arxiv.org/abs/2410.04916
Defense-as-a-Service: Black-box Shielding against Backdoored Graph Models. (8%)
Xiao Yang; Kai Zhou; Yuni Lai; Gaolei Li
With the trend of large graph learning models, business owners tend to employ a model provided by a third party to deliver business services to users. However, these models might be backdoored, and malicious users can submit trigger-embedded inputs to manipulate the model predictions. Current graph backdoor defenses have several limitations: 1) depending on model-related details, 2) requiring additional model fine-tuning, and 3) relying upon extra explainability tools, all of which are infeasible under stringent privacy policies. To address those limitations, we propose GraphProt, which allows resource-constrained business owners to rely on third parties to avoid backdoor attacks on GNN-based graph classifiers. Our GraphProt is model-agnostic and only relies on the input graph. The key insight is to leverage subgraph information for prediction, thereby mitigating backdoor effects induced by triggers. GraphProt comprises two components: clustering-based trigger elimination and robust subgraph ensemble. Specifically, we first propose feature-topology clustering that aims to remove most of the anomalous subgraphs (triggers). Moreover, we design subgraph sampling strategies based on feature-topology clustering to build a robust classifier via majority vote. Experimental results across three backdoor attacks and six benchmark datasets demonstrate that GraphProt significantly reduces the backdoor attack success rate while preserving the model accuracy on regular graph classification tasks.

http://arxiv.org/abs/2507.10162
HASSLE: A Self-Supervised Learning Enhanced Hijacking Attack on Vertical Federated Learning. (8%)
Weiyang He; Chip-Hong Chang
Vertical Federated Learning (VFL) enables an orchestrating active party to perform a machine learning task by cooperating with passive parties that provide additional task-related features for the same training data entities. While prior research has leveraged the privacy vulnerability of VFL to compromise its integrity through a combination of label inference and backdoor attacks, their effectiveness is constrained by the low label inference precision and suboptimal backdoor injection conditions. To facilitate a more rigorous security evaluation on VFL without these limitations, we propose HASSLE, a hijacking attack framework composed of a gradient-direction-based label inference module and an adversarial embedding generation algorithm enhanced by self-supervised learning. HASSLE accurately identifies private samples associated with a targeted label using only a single known instance of that label. In the two-party scenario, it demonstrates strong performance with an attack success rate (ASR) of over 99% across four datasets, including both image and tabular modalities, and achieves 85% ASR on the more complex CIFAR-100 dataset. Evaluation of HASSLE against 8 potential defenses further highlights its significant threat while providing new insights into building a trustworthy VFL system.

http://arxiv.org/abs/2501.03940
Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection. (3%)
Pablo Miralles-González; Javier Huertas-Tato; Alejandro Martín; David Camacho
The rapid advancement in large language models (LLMs) has significantly enhanced their ability to generate coherent and contextually relevant text, raising concerns about the misuse of AI-generated content and making it critical to detect it. However, the task remains challenging, particularly in unseen domains or with unfamiliar LLMs. Leveraging LLM next-token distribution outputs offers a theoretically appealing approach for detection, as they encapsulate insights from the models' extensive pre-training on diverse corpora. Despite its promise, zero-shot methods that attempt to operationalize these outputs have met with limited success. We hypothesize that one of the problems is that they use the mean to aggregate next-token distribution metrics across tokens, when some tokens are naturally easier or harder to predict and should be weighted differently. Based on this idea, we propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length. Although not zero-shot, our method allows us to cache the last hidden states and next-token distribution metrics on disk, greatly reducing the training resource requirements. PAWN shows competitive and even better performance in-distribution than the strongest baselines (fine-tuned LMs) with a fraction of their trainable parameters. Our model also generalizes better to unseen domains and source models, with smaller variability in the decision boundary across distribution shifts. It is also more robust to adversarial attacks, and if the backbone has multilingual capabilities, it presents decent generalization to languages not seen during supervised training, with LLaMA3-1B reaching a mean macro-averaged F1 score of 81.46% in cross-validation with nine languages.

http://arxiv.org/abs/2507.14202
PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training. (2%)
Pengfei Du
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse applications, yet they pose significant security risks that threaten their safe deployment in critical domains. Current security alignment methodologies predominantly rely on Process Reward Models (PRMs) to evaluate intermediate reasoning steps, introducing substantial computational overhead and scalability constraints. This paper presents a novel PRM-free security alignment framework that leverages automated red teaming and adversarial training to achieve robust security guarantees while maintaining computational efficiency. Our approach systematically identifies vulnerabilities through sophisticated attack strategies including genetic algorithm optimization, multi-agent simulation, and advanced prompt mutation techniques. The framework enhances model robustness via targeted adversarial training with curriculum learning and adaptive regularization mechanisms. Comprehensive experimental evaluation across five state-of-the-art LLMs demonstrates that our method achieves superior security alignment performance compared to PRM-based approaches while reducing computational costs by 61\%. The framework incorporates transparent reporting and continuous audit mechanisms that enable iterative security improvement and regulatory compliance. Our contributions advance the field of efficient LLM security alignment by democratizing access to robust security measures for resource-constrained organizations and providing a scalable foundation for addressing evolving adversarial threats.

http://arxiv.org/abs/2507.10016
The Man Behind the Sound: Demystifying Audio Private Attribute Profiling via Multimodal Large Language Model Agents. (1%)
Lixu Wang; Kaixiang Yao; Xinfeng Li; Dong Yang; Haoyang Li; Xiaofeng Wang; Wei Dong
Our research uncovers a novel privacy risk associated with multimodal large language models (MLLMs): the ability to infer sensitive personal attributes from audio data -- a technique we term audio private attribute profiling. This capability poses a significant threat, as audio can be covertly captured without direct interaction or visibility. Moreover, compared to images and text, audio carries unique characteristics, such as tone and pitch, which can be exploited for more detailed profiling. However, two key challenges exist in understanding MLLM-employed private attribute profiling from audio: (1) the lack of audio benchmark datasets with sensitive attribute annotations and (2) the limited ability of current MLLMs to infer such attributes directly from audio. To address these challenges, we introduce AP^2, an audio benchmark dataset that consists of two subsets collected and composed from real-world data, and both are annotated with sensitive attribute labels. Additionally, we propose Gifts, a hybrid multi-agent framework that leverages the complementary strengths of audio-language models (ALMs) and large language models (LLMs) to enhance inference capabilities. Gifts employs an LLM to guide the ALM in inferring sensitive attributes, then forensically analyzes and consolidates the ALM's inferences, overcoming severe hallucinations of existing ALMs in generating long-context responses. Our evaluations demonstrate that Gifts significantly outperforms baseline approaches in inferring sensitive attributes. Finally, we investigate model-level and data-level defense strategies to mitigate the risks of audio private attribute profiling. Our work validates the feasibility of audio-based privacy attacks using MLLMs, highlighting the need for robust defenses, and provides a dataset and framework to facilitate future research.

http://arxiv.org/abs/2507.10610
LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents. (73%)
Zihe Yan; Zhuosheng Zhang
Graphical user interface (GUI) agents built on multimodal large language models (MLLMs) have recently demonstrated strong decision-making abilities in screen-based interaction tasks. However, they remain highly vulnerable to pop-up-based environmental injection attacks, where malicious visual elements divert model attention and lead to unsafe or incorrect actions. Existing defense methods either require costly retraining or perform poorly under inductive interference. In this work, we systematically study how such attacks alter the attention behavior of GUI agents and uncover a layer-wise attention divergence pattern between correct and incorrect outputs. Based on this insight, we propose \textbf{LaSM}, a \textit{Layer-wise Scaling Mechanism} that selectively amplifies attention and MLP modules in critical layers. LaSM improves the alignment between model saliency and task-relevant regions without additional training. Extensive experiments across 12 types of pop-up perturbations and 4 different model backbones show that LaSM consistently enhances the defense success rate. When combined with prompt-level alerts, LaSM achieves over 98\% robustness even under strong inductive attacks. Our findings reveal that attention misalignment is a core vulnerability in MLLM agents and can be effectively addressed through selective layer-wise modulation.

http://arxiv.org/abs/2405.20725
GI-NAS: Boosting Gradient Inversion Attacks through Adaptive Neural Architecture Search. (15%)
Wenbo Yu; Hao Fang; Bin Chen; Xiaohang Sui; Chuan Chen; Hao Wu; Shu-Tao Xia; Ke Xu
Gradient Inversion Attacks invert the transmitted gradients in Federated Learning (FL) systems to reconstruct the sensitive data of local clients and have raised considerable privacy concerns. A majority of gradient inversion methods rely heavily on explicit prior knowledge (e.g., a well pre-trained generative model), which is often unavailable in realistic scenarios. This is because real-world client data distributions are often highly heterogeneous, domain-specific, and unavailable to attackers, making it impractical for attackers to obtain perfectly matched pre-trained models, which inevitably suffer from fundamental distribution shifts relative to target private data. To alleviate this issue, researchers have proposed to leverage the implicit prior knowledge of an over-parameterized network. However, they only utilize a fixed neural architecture for all the attack settings. This would hinder the adaptive use of implicit architectural priors and consequently limit the generalizability. In this paper, we further exploit such implicit prior knowledge by proposing Gradient Inversion via Neural Architecture Search (GI-NAS), which adaptively searches the network and captures the implicit priors behind neural architectures. Extensive experiments verify that our proposed GI-NAS can achieve superior attack performance compared to state-of-the-art gradient inversion methods, even under more practical settings with high-resolution images, large-sized batches, and advanced defense strategies. To the best of our knowledge, we are the first to successfully introduce NAS to the gradient inversion community. We believe that this work exposes critical vulnerabilities in real-world federated learning by demonstrating high-fidelity reconstruction of sensitive data without requiring domain-specific priors, forcing urgent reassessment of FL privacy safeguards.

http://arxiv.org/abs/2502.19537
No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms. (68%)
Joshua Kazdan; Abhay Puri; Rylan Schaeffer; Lisa Yu; Chris Cundy; Jason Stanley; Sanmi Koyejo; Krishnamurthy Dvijotham
Leading language model (LM) providers like OpenAI and Anthropic allow customers to fine-tune frontier LMs for specific use cases. To prevent abuse, these providers apply filters to block fine-tuning on overtly harmful data. In this setting, we make three contributions: First, while past work has shown that safety alignment is "shallow", we correspondingly demonstrate that existing fine-tuning attacks are shallow -- attacks target only the first several tokens of the model response, and consequently can be blocked by generating the first several response tokens with an aligned model. Second, we conceptually illustrate how to make attacks deeper by introducing a new fine-tuning attack that trains models to first refuse harmful requests before answering them; this "refuse-then-comply" strategy bypasses shallow defenses and produces harmful responses that evade output filters. Third, we demonstrate the potency of our new fine-tuning attack by jailbreaking both open-source models equipped with defenses and production models, achieving attack success rates of 57% and 72% against GPT-4o and Claude Haiku, respectively. Our attack received a $2000 bug bounty from OpenAI and was acknowledged as a vulnerability by Anthropic. Our work undermines the notion that models are safe because they initially refuse harmful requests and broadens awareness of the scope of attacks that face production fine-tuning APIs.

http://arxiv.org/abs/2507.14180
Digital Twin-Assisted Explainable AI for Robust Beam Prediction in mmWave MIMO Systems. (13%)
Nasir Khan; Asmaa Abdallah; Abdulkadir Celik; Ahmed M. Eltawil; Sinem Coleri
In line with the AI-native 6G vision, explainability and robustness are crucial for building trust and ensuring reliable performance in millimeter-wave (mmWave) systems. Efficient beam alignment is essential for initial access, but deep learning (DL) solutions face challenges, including high data collection overhead, hardware constraints, lack of explainability, and susceptibility to adversarial attacks. This paper proposes a robust and explainable DL-based beam alignment engine (BAE) for mmWave multiple-input multiple output (MIMO) systems. The BAE uses received signal strength indicator (RSSI) measurements from wide beams to predict the best narrow beam, reducing the overhead of exhaustive beam sweeping. To overcome the challenge of real-world data collection, this work leverages a site-specific digital twin (DT) to generate synthetic channel data closely resembling real-world environments. A model refinement via transfer learning is proposed to fine-tune the pre-trained model residing in the DT with minimal real-world data, effectively bridging mismatches between the digital replica and real-world environments. To reduce beam training overhead and enhance transparency, the framework uses deep Shapley additive explanations (SHAP) to rank input features by importance, prioritizing key spatial directions and minimizing beam sweeping. It also incorporates the Deep k-nearest neighbors (DkNN) algorithm, providing a credibility metric for detecting out-of-distribution inputs and ensuring robust, transparent decision-making. Experimental results show that the proposed framework reduces real-world data needs by 70%, beam training overhead by 62%, and improves outlier detection robustness by up to 8.5x, achieving near-optimal spectral efficiency and transparent decision making compared to traditional softmax based DL models.

http://arxiv.org/abs/2507.09406
Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers. (10%)
Santhosh Kumar Ravindran
Large language models (LLMs) aligned for safety through techniques like reinforcement learning from human feedback (RLHF) often exhibit emergent deceptive behaviors, where outputs appear compliant but subtly mislead or omit critical information. This paper introduces adversarial activation patching, a novel mechanistic interpretability framework that leverages activation patching as an adversarial tool to induce, detect, and mitigate such deception in transformer-based models. By sourcing activations from "deceptive" prompts and patching them into safe forward passes at specific layers, we simulate vulnerabilities and quantify deception rates. Through toy neural network simulations across multiple scenarios (e.g., 1000 trials per setup), we demonstrate that adversarial patching increases deceptive outputs to 23.9% from a 0% baseline, with layer-specific variations supporting our hypotheses. We propose six hypotheses, including transferability across models, exacerbation in multimodal settings, and scaling effects. An expanded literature review synthesizes over 20 key works in interpretability, deception, and adversarial attacks. Mitigation strategies, such as activation anomaly detection and robust fine-tuning, are detailed, alongside ethical considerations and future research directions. This work advances AI safety by highlighting patching's dual-use potential and provides a roadmap for empirical studies on large-scale models.

http://arxiv.org/abs/2507.09095
On the Fragility of Multimodal Perception to Temporal Misalignment in Autonomous Driving. (1%)
Md Hasan Shahriar; Md Mohaimin Al Barat; Harshavardhan Sundar; Naren Ramakrishnan; Y. Thomas Hou; Wenjing Lou
Multimodal fusion (MMF) plays a critical role in the perception of autonomous driving, which primarily fuses camera and LiDAR streams for a comprehensive and efficient scene understanding. However, its strict reliance on precise temporal synchronization exposes it to new vulnerabilities. In this paper, we introduce DejaVu, a novel attack that exploits network-induced delays to create subtle temporal misalignments across sensor streams, severely degrading downstream MMF-based perception tasks. Our comprehensive attack analysis across different models and datasets reveals these sensors' task-specific imbalanced sensitivities: object detection is overly dependent on LiDAR inputs while object tracking is highly reliant on the camera inputs. Consequently, with a single-frame LiDAR delay, an attacker can reduce the car detection mAP by up to 88.5%, while with a three-frame camera delay, multiple object tracking accuracy (MOTA) for car drops by 73%. To detect such attacks, we propose AION, a defense patch that can work alongside the existing perception model to monitor temporal alignment through cross-modal temporal consistency. AION leverages multimodal shared representation learning and dynamic time warping to determine the path of temporal alignment and calculate anomaly scores based on the alignment. Our thorough evaluation of AION shows it achieves AUROC scores of 0.92-0.98 with low false positives across datasets and model architectures, demonstrating it as a robust and generalized defense against the temporal misalignment attacks.

http://arxiv.org/abs/2507.06261
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. (99%)
Gheorghe Comanici; Eric Bieber; Mike Schaekermann; Ice Pasupat; Noveen Sachdeva; Inderjit Dhillon; Marcel Blistein; Ori Ram; Dan Zhang; Evan Rosen; Luke Marris; Sam Petulla; Colin Gaffney; Asaf Aharoni; Nathan Lintz; Tiago Cardal Pais; Henrik Jacobsson; Idan Szpektor; Nan-Jiang Jiang; Krishna Haridasan; Ahmed Omran; Nikunj Saunshi; Dara Bahri; Gaurav Mishra; Eric Chu; Toby Boyd; Brad Hekman; Aaron Parisi; Chaoyi Zhang; Kornraphop Kawintiranon; Tania Bedrax-Weiss; Oliver Wang; Ya Xu; Ollie Purkiss; Uri Mendlovic; Ilaï Deutel; Nam Nguyen; Adam Langley; Flip Korn; Lucia Rossazza; Alexandre Ramé; Sagar Waghmare; Helen Miller; Vaishakh Keshava; Ying Jian; Xiaofan Zhang; Raluca Ada Popa; Kedar Dhamdhere; Blaž Bratanič; Kyuyeun Kim; Terry Koo; Ferran Alet; Yi-ting Chen; Arsha Nagrani; Hannah Muckenhirn; Zhiyuan Zhang; Corbin Quick; Filip Pavetić; Duc Dung Nguyen; Joao Carreira; Michael Elabd; Haroon Qureshi; Fabian Mentzer; Yao-Yuan Yang; Danielle Eisenbud; Anmol Gulati; Ellie Talius; Eric Ni; Sahra Ghalebikesabi; Edouard Yvinec; Alaa Saade; Thatcher Ulrich; Lorenzo Blanco; Dan A. Calian; Muhuan Huang; Aäron van den Oord; Naman Goyal; Terry Chen; Praynaa Rawlani; Christian Schallhart; Swachhand Lokhande; Xianghong Luo; Jyn Shan; Ceslee Montgomery; Victoria Krakovna; Federico Piccinini; Omer Barak; Jingyu Cui; Yiling Jia; Mikhail Dektiarev; Alexey Kolganov; Shiyu Huang; Zhe Chen; Xingyu Wang; Jessica Austin; Boursac Peter de; Evgeny Sluzhaev; Frank Ding; Huijian Li; Surya Bhupatiraju; Mohit Agarwal; Sławek Kwasiborski; Paramjit Sandhu; Patrick Siegler; Ahmet Iscen; Eyal Ben-David; Shiraz Butt; Miltos Allamanis; Seth Benjamin; Robert Busa-Fekete; Felix Hernandez-Campos; Sasha Goldshtein; Matt Dibb; Weiyang Zhang; Annie Marsden; Carey Radebaugh; Stephen Roller; Abhishek Nayyar; Jacob Austin; Tayfun Terzi; Bhargav Kanagal Shamanna; Pete Shaw; Aayush Singh; Florian Luisier; Artur Mendonça; Vaibhav Aggarwal; Larisa Markeeva; Claudio Fantacci; Sergey Brin; HyunJeong Choe; Guanyu Wang; Hartwig Adam; Avigail Dabush; Tatsuya Kiyono; Eyal Marcus; Jeremy Cole; Theophane Weber; Hongrae Lee; Ronny Huang; Alex Muzio; Leandro Kieliger; Maigo Le; Courtney Biles; Long Le; Archit Sharma; Chengrun Yang; Avery Lamp; Dave Dopson; Nate Hurley; Katrina Xinyi Xu; Zhihao Shan; Shuang Song; Jiewen Tan; Alexandre Senges; George Zhang; Chong You; Yennie Jun; David Raposo; Susanna Ricco; Xuan Yang; Weijie Chen; Prakhar Gupta; Arthur Szlam; Kevin Villela; Chun-Sung Ferng; Daniel Kasenberg; Chen Liang; Rui Zhu; Arunachalam Narayanaswamy; Florence Perot; Paul Pucciarelli; Anna Shekhawat; Alexey Stern; Rishikesh Ingale; Stefani Karp; Sanaz Bahargam; Adrian Goedeckemeyer; Jie Han; Sicheng Li; Andrea Tacchetti; Dian Yu; Abhishek Chakladar; Zhiying Zhang; Mona El Mahdy; Xu Gao; Dale Johnson; Samrat Phatale; AJ Piergiovanni; Hyeontaek Lim; Clement Farabet; Carl Lebsack; Theo Guidroz; John Blitzer; Nico Duduta; David Madras; Steve Li; Dincklage Daniel von; Xin Li; Mahdis Mahdieh; George Tucker; Ganesh Jawahar; Owen Xiao; Danny Tarlow; Robert Geirhos; Noam Velan; Daniel Vlasic; Kalesha Bullard; SK Park; Nishesh Gupta; Kellie Webster; Ayal Hitron; Jieming Mao; Julian Eisenschlos; Laurel Prince; Nina D'Souza; Kelvin Zheng; Sara Nasso; Gabriela Botea; Carl Doersch; Caglar Unlu; Chris Alberti; Alexey Svyatkovskiy; Ankita Goel; Krzysztof Choromanski; Pan-Pan Jiang; Richard Nguyen; Four Flynn; Daria Ćurko; Peter Chen; Nicholas Roth; Kieran Milan; Caleb Habtegebriel; Shashi Narayan; Michael Moffitt; Jake Marcus; Thomas Anthony; Brendan McMahan; Gowoon Cheon; Ruibo Liu; Megan Barnes; Lukasz Lew; Rebeca Santamaria-Fernandez; Mayank Upadhyay; Arjun Akula; Arnar Mar Hrafnkelsson; Alvaro Caceres; Andrew Bunner; Michal Sokolik; Subha Puttagunta; Lawrence Moore; Berivan Isik; Jay Hartford; Lawrence Chan; Pradeep Shenoy; Dan Holtmann-Rice; Jane Park; Fabio Viola; Alex Salcianu; Sujeevan Rajayogam; Ian Stewart-Binks; Zelin Wu; Richard Everett; Xi Xiong; Pierre-Antoine Manzagol; Gary Leung; Carl Saroufim; Bo Pang; Dawid Wegner; George Papamakarios; Jennimaria Palomaki; Helena Pankov; Guangda Lai; Guilherme Tubone; Shubin Zhao; Theofilos Strinopoulos; Seth Neel; Mingqiu Wang; Joe Kelley; Li Li; Pingmei Xu; Anitha Vijayakumar; Andrea D'olimpio; Omer Levy; Massimo Nicosia; Grigory Rozhdestvenskiy; Ni Lao; Sirui Xie; Yash Katariya; Jon Simon; Sanjiv Kumar; Florian Hartmann; Michael Kilgore; Jinhyuk Lee; Aroma Mahendru; Roman Ring; Tom Hennigan; Fiona Lang; Colin Cherry; David Steiner; Dawsen Hwang; Ray Smith; Pidong Wang; Jeremy Chen; Ming-Hsuan Yang; Sam Kwei; Philippe Schlattner; Donnie Kim; Ganesh Poomal Girirajan; Nikola Momchev; Ayushi Agarwal; Xingyi Zhou; Ilkin Safarli; Zachary Garrett; AJ Pierigiovanni; Sarthak Jauhari; Alif Raditya Rochman; Shikhar Vashishth; Quan Yuan; Christof Angermueller; Jon Blanton; Xinying Song; Nitesh Bharadwaj Gundavarapu; Thi Avrahami; Maxine Deines; Subhrajit Roy; Manish Gupta; Christopher Semturs; Shobha Vasudevan; Aditya Srikanth Veerubhotla; Shriya Sharma; Josh Jacob; Zhen Yang; Andreas Terzis; Dan Karliner; Auriel Wright; Tania Rojas-Esponda; Ashley Brown; Abhijit Guha Roy; Pawan Dogra; Andrei Kapishnikov; Peter Young; Wendy Kan; Vinodh Kumar Rajendran; Maria Ivanova; Salil Deshmukh; Chia-Hua Ho; Mike Kwong; Stav Ginzburg; Annie Louis; KP Sawhney; Slav Petrov; Jing Xie; Yunfei Bai; Georgi Stoyanov; Alex Fabrikant; Rajesh Jayaram; Yuqi Li; Joe Heyward; Justin Gilmer; Yaqing Wang; Radu Soricut; Luyang Liu; Qingnan Duan; Jamie Hayes; Maura O'Brien; Gaurav Singh Tomar; Sivan Eiger; Bahar Fatemi; Jeffrey Hui; Catarina Barros; Adaeze Chukwuka; Alena Butryna; Saksham Thakur; Austin Huang; Zhufeng Pan; Haotian Tang; Serkan Cabi; Tulsee Doshi; Michiel Bakker; Sumit Bagri; Ruy Ley-Wild; Adam Lelkes; Jennie Lees; Patrick Kane; David Greene; Shimu Wu; Jörg Bornschein; Gabriela Surita; Sarah Hodkinson; Fangtao Li; Chris Hidey; Sébastien Pereira; Sean Ammirati; Phillip Lippe; Adam Kraft; Pu Han; Sebastian Gerlach; Zifeng Wang; Liviu Panait; Feng Han; Brian Farris; Yingying Bi; Hannah DeBalsi; Miaosen Wang; Gladys Tyen; James Cohan; Susan Zhang; Jarred Barber; Da-Woon Chung; Jaeyoun Kim; Markus Kunesch; Steven Pecht; Nami Akazawa; Abe Friesen; James Lyon; Ali Eslami; Junru Wu; Jie Tan; Yue Song; Ravi Kumar; Chris Welty; Ilia Akolzin; Gena Gibson; Sean Augenstein; Arjun Pillai; Nancy Yuen; Du Phan; Xin Wang; Iain Barr; Heiga Zen; Nan Hua; Casper Liu; Jilei Jerry Wang; Tanuj Bhatia; Hao Xu; Oded Elyada; Pushmeet Kohli; Mirek Olšák; Ke Chen; Azalia Mirhoseini; Noam Shazeer; Shoshana Jakobovits; Maggie Tran; Nolan Ramsden; Tarun Bharti; Fred Alcober; Yunjie Li; Shilpa Shetty; Jing Chen; Dmitry Kalashnikov; Megha Nawhal; Sercan Arik; Hanwen Chen; Michiel Blokzijl; Shubham Gupta; James Rubin; Rigel Swavely; Sophie Bridgers; Ian Gemp; Chen Su; Arun Suggala; Juliette Pluto; Mary Cassin; Alain Vaucher; Kaiyang Ji; Jiahao Cai; Andrew Audibert; Animesh Sinha; David Tian; Efrat Farkash; Amy Hua; Jilin Chen; Duc-Hieu Tran; Edward Loper; Nicole Brichtova; Lara McConnaughey; Ballie Sandhu; Robert Leland; Doug DeCarlo; Andrew Over; James Huang; Xing Wu; Connie Fan; Eric Li; Yun Lei; Deepak Sharma; Cosmin Paduraru; Luo Yu; Matko Bošnjak; Phuong Dao; Min Choi; Sneha Kudugunta; Jakub Adamek; Carlos Guía; Ali Khodaei; Jie Feng; Wenjun Zeng; David Welling; Sandeep Tata; Christina Butterfield; Andrey Vlasov; Seliem El-Sayed; Swaroop Mishra; Tara Sainath; Shentao Yang; RJ Skerry-Ryan; Jeremy Shar; Robert Berry; Arunkumar Rajendran; Arun Kandoor; Andrea Burns; Deepali Jain; Tom Stone; Wonpyo Park; Shibo Wang; Albin Cassirer; Guohui Wang; Hayato Kobayashi; Sergey Rogulenko; Vineetha Govindaraj; Mikołaj Rybiński; Nadav Olmert; Colin Evans; Po-Sen Huang; Kelvin Xu; Premal Shah; Terry Thurk; Caitlin Sikora; Mu Cai; Jin Xie; Elahe Dabir; Saloni Shah; Norbert Kalb; Carrie Zhang; Shruthi Prabhakara; Amit Sabne; Artiom Myaskovsky; Vikas Raunak; Blanca Huergo; Behnam Neyshabur; Jon Clark; Ye Zhang; Shankar Krishnan; Eden Cohen; Dinesh Tewari; James Lottes; Yumeya Yamamori; Hui Elena Li; Mohamed Elhawaty; Ada Maksutaj Oflazer; Adrià Recasens; Sheryl Luo; Duy Nguyen; Taylor Bos; Kalyan Andra; Ana Salazar; Ed Chi; Jeongwoo Ko; Matt Ginsberg; Anders Andreassen; Anian Ruoss; Todor Davchev; Elnaz Davoodi; Chenxi Liu; Min Kim; Santiago Ontanon; Chi Ming To; Dawei Jia; Rosemary Ke; Jing Wang; Anna Korsun; Moran Ambar; Ilya Kornakov; Irene Giannoumis; Toni Creswell; Denny Zhou; Yi Su; Ishaan Watts; Aleksandr Zaks; Evgenii Eltyshev; Ziqiang Feng; Sidharth Mudgal; Alex Kaskasoli; Juliette Love; Kingshuk Dasgupta; Sam Shleifer; Richard Green; Sungyong Seo; Chansoo Lee; Dale Webster; Prakash Shroff; Ganna Raboshchuk; Isabel Leal; James Manyika; Sofia Erell; Daniel Murphy; Zhisheng Xiao; Anton Bulyenov; Julian Walker; Mark Collier; Matej Kastelic; Nelson George; Sushant Prakash; Sailesh Sidhwani; Alexey Frolov; Steven Hansen; Petko Georgiev; Tiberiu Sosea; Chris Apps; Aishwarya Kamath; David Reid; Emma Cooney; Charlotte Magister; Oriana Riva; Alec Go; Pu-Chin Chen; Sebastian Krause; Nir Levine; Marco Fornoni; Ilya Figotin; Nick Roy; Parsa Mahmoudieh; Vladimir Magay; Mukundan Madhavan; Jin Miao; Jianmo Ni; Yasuhisa Fujii; Ian Chou; George Scrivener; Zak Tsai; Siobhan Mcloughlin; Jeremy Selier; Sandra Lefdal; Jeffrey Zhao; Abhijit Karmarkar; Kushal Chauhan; Shivanker Goel; Zhaoyi Zhang; Vihan Jain; Parisa Haghani; Mostafa Dehghani; Jacob Scott; Erin Farnese; Anastasija Ilić; Steven Baker; Julia Pawar; Li Zhong; Josh Camp; Yoel Zeldes; Shravya Shetty; Anand Iyer; Vít Listík; Jiaxian Guo; Luming Tang; Mark Geller; Simon Bucher; Yifan Ding; Hongzhi Shi; Carrie Muir; Dominik Grewe; Ramy Eskander; Octavio Ponce; Boqing Gong; Derek Gasaway; Samira Khan; Umang Gupta; Angelos Filos; Weicheng Kuo; Klemen Kloboves; Jennifer Beattie; Christian Wright; Leon Li; Alicia Jin; Sandeep Mariserla; Miteyan Patel; Jens Heitkaemper; Dilip Krishnan; Vivek Sharma; David Bieber; Christian Frank; John Lambert; Paul Caron; Martin Polacek; Mai Giménez; Himadri Choudhury; Xing Yu; Sasan Tavakkol; Arun Ahuja; Franz Och; Rodolphe Jenatton; Wojtek Skut; Bryan Richter; David Gaddy; Andy Ly; Misha Bilenko; Megh Umekar; Ethan Liang; Martin Sevenich; Mandar Joshi; Hassan Mansoor; Rebecca Lin; Sumit Sanghai; Abhimanyu Singh; Xiaowei Li; Sudheendra Vijayanarasimhan; Zaheer Abbas; Yonatan Bitton; Hansa Srinivasan; Manish Reddy Vuyyuru; Alexander Frömmgen; Yanhua Sun; Ralph Leith; Alfonso Castaño; DJ Strouse; Le Yan; Austin Kyker; Satish Kambala; Mary Jasarevic; Thibault Sellam; Chao Jia; Alexander Pritzel; Raghavender R; Huizhong Chen; Natalie Clay; Sudeep Gandhe; Sean Kirmani; Sayna Ebrahimi; Hannah Kirkwood; Jonathan Mallinson; Chao Wang; Adnan Ozturel; Kuo Lin; Shyam Upadhyay; Vincent Cohen-Addad; Sean Purser-haskell; Yichong Xu; Ebrahim Songhori; Babi Seal; Alberto Magni; Almog Gueta; Tingting Zou; Guru Guruganesh; Thais Kagohara; Hung Nguyen; Khalid Salama; Alejandro Cruzado Ruiz; Justin Frye; Zhenkai Zhu; Matthias Lochbrunner; Simon Osindero; Wentao Yuan; Lisa Lee; Aman Prasad; Lam Nguyen Thiet; Daniele Calandriello; Victor Stone; Qixuan Feng; Han Ke; Maria Voitovich; Geta Sampemane; Lewis Chiang; Ling Wu; Alexander Bykovsky; Matt Young; Luke Vilnis; Ishita Dasgupta; Aditya Chawla; Qin Cao; Bowen Liang; Daniel Toyama; Szabolcs Payrits; Anca Stefanoiu; Dimitrios Vytiniotis; Ankesh Anand; Tianxiao Shen; Blagoj Mitrevski; Michael Tschannen; Sreenivas Gollapudi; Aishwarya P S; José Leal; Zhe Shen; Han Fu; Wei Wang; Arvind Kannan; Doron Kukliansky; Sergey Yaroshenko; Svetlana Grant; Umesh Telang; David Wood; Alexandra Chronopoulou; Alexandru Ţifrea; Tao Zhou; Tony Tu\'ân Nguy\~ên; Muge Ersoy; Anima Singh; Meiyan Xie; Emanuel Taropa; Woohyun Han; Eirikur Agustsson; Andrei Sozanschi; Hui Peng; Alex Chen; Yoel Drori; Efren Robles; Yang Gao; Xerxes Dotiwalla; Ying Chen; Anudhyan Boral; Alexei Bendebury; John Nham; Chris Tar; Luis Castro; Jiepu Jiang; Canoee Liu; Felix Halim; Jinoo Baek; Andy Wan; Jeremiah Liu; Yuan Cao; Shengyang Dai; Trilok Acharya; Ruoxi Sun; Fuzhao Xue; Saket Joshi; Morgane Lustman; Yongqin Xian; Rishabh Joshi; Deep Karkhanis; Nora Kassner; Jamie Hall; Xiangzhuo Ding; Gan Song; Gang Li; Chen Zhu; Yana Kulizhskaya; Bin Ni; Alexey Vlaskin; Solomon Demmessie; Lucio Dery; Salah Zaiem; Yanping Huang; Cindy Fan; Felix Gimeno; Ananth Balashankar; Koji Kojima; Hagai Taitelbaum; Maya Meng; Dero Gharibian; Sahil Singla; Wei Chen; Ambrose Slone; Guanjie Chen; Sujee Rajayogam; Max Schumacher; Suyog Kotecha; Rory Blevins; Qifei Wang; Mor Hazan Taege; Alex Morris; Xin Liu; Fayaz Jamil; Richard Zhang; Pratik Joshi; Ben Ingram; Tyler Liechty; Ahmed Eleryan; Scott Baird; Alex Grills; Gagan Bansal; Shan Han; Kiran Yalasangi; Shawn Xu; Majd Al Merey; Isabel Gao; Felix Weissenberger; Igor Karpov; Robert Riachi; Ankit Anand; Gautam Prasad; Kay Lamerigts; Reid Hayes; Jamie Rogers; Mandy Guo; Ashish Shenoy; Qiong Q Hu; Kyle He; Yuchen Liu; Polina Zablotskaia; Sagar Gubbi; Yifan Chang; Jay Pavagadhi; Kristian Kjems; Archita Vadali; Diego Machado; Yeqing Li; Renshen Wang; Dipankar Ghosh; Aahil Mehta; Dana Alon; George Polovets; Alessio Tonioni; Nate Kushman; Joel D'sa; Lin Zhuo; Allen Wu; Rohin Shah; John Youssef; Jiayu Ye; Justin Snyder; Karel Lenc; Senaka Buthpitiya; Matthew Tung; Jichuan Chang; Tao Chen; David Saxton; Jenny Lee; Lydia Lihui Zhang; James Qin; Prabakar Radhakrishnan; Maxwell Chen; Piotr Ambroszczyk; Metin Toksoz-Exley; Yan Zhong; Nitzan Katz; Brendan O'Donoghue; Glehn Tamara von; Adi Gerzi Rosenthal; Aga Świetlik; Xiaokai Zhao; Nick Fernando; Jinliang Wei; Jieru Mei; Sergei Vassilvitskii; Diego Cedillo; Pranjal Awasthi; Hui Zheng; Koray Kavukcuoglu; Itay Laish; Joseph Pagadora; Marc Brockschmidt; Christopher A. Choquette-Choo; Arunkumar Byravan; Yifeng Lu; Xu Chen; Mia Chen; Kenton Lee; Rama Pasumarthi; Sijal Bhatnagar; Aditya Shah; Qiyin Wu; Zhuoyuan Chen; Zack Nado; Bartek Perz; Zixuan Jiang; David Kao; Ganesh Mallya; Nino Vieillard; Lantao Mei; Sertan Girgin; Mandy Jordan; Yeongil Ko; Alekh Agarwal; Yaxin Liu; Yasemin Altun; Liedekerke Raoul de; Anastasios Kementsietsidis; Daiyi Peng; Dangyi Liu; Utku Evci; Peter Humphreys; Austin Tarango; Xiang Deng; Yoad Lewenberg; Kevin Aydin; Chengda Wu; Bhavishya Mittal; Tsendsuren Munkhdalai; Kleopatra Chatziprimou; Rodrigo Benenson; Uri First; Xiao Ma; Jinning Li; Armand Joulin; Hamish Tomlinson; Tingnan Zhang; Milad Nasr; Zhi Hong; Michaël Sander; Lisa Anne Hendricks; Anuj Sharma; Andrew Bolt; Eszter Vértes; Jiri Simsa; Tomer Levinboim; Olcan Sercinoglu; Divyansh Shukla; Austin Wu; Craig Swanson; Danny Vainstein; Fan Bu; Bo Wang; Ryan Julian; Charles Yoon; Sergei Lebedev; Antonious Girgis; Bernd Bandemer; David Du; Todd Wang; Xi Chen; Ying Xiao; Peggy Lu; Natalie Ha; Vlad Ionescu; Simon Rowe; Josip Matak; Federico Lebron; Andreas Steiner; Lalit Jain; Manaal Faruqui; Nicolas Lacasse; Georgie Evans; Neesha Subramaniam; Dean Reich; Giulia Vezzani; Aditya Pandey; Joe Stanton; Tianhao Zhou; Liam McCafferty; Henry Griffiths; Verena Rieser; Soheil Hassas Yeganeh; Eleftheria Briakou; Lu Huang; Zichuan Wei; Liangchen Luo; Erik Jue; Gabby Wang; Victor Cotruta; Myriam Khan; Jongbin Park; Qiuchen Guo; Peiran Li; Rong Rong; Diego Antognini; Anastasia Petrushkina; Chetan Tekur; Eli Collins; Parul Bhatia; Chester Kwak; Wenhu Chen; Arvind Neelakantan; Immanuel Odisho; Sheng Peng; Vincent Nallatamby; Vaibhav Tulsyan; Fabian Pedregosa; Peng Xu; Raymond Lin; Yulong Wang; Emma Wang; Sholto Douglas; Reut Tsarfaty; Elena Gribovskaya; Renga Aravamudhan; Manu Agarwal; Mara Finkelstein; Qiao Zhang; Elizabeth Cole; Phil Crone; Sarmishta Velury; Anil Das; Chris Sauer; Luyao Xu; Danfeng Qin; Chenjie Gu; Dror Marcus; CJ Zheng; Gansbeke Wouter Van; Sobhan Miryoosefi; Haitian Sun; YaGuang Li; Charlie Chen; Jae Yoo; Pavel Dubov; Alex Tomala; Adams Yu; Paweł Wesołowski; Alok Gunjan; Eddie Cao; Jiaming Luo; Nikhil Sethi; Arkadiusz Socala; Laura Graesser; Tomas Kocisky; Arturo BC; Minmin Chen; Edward Lee; Sophie Wang; Weize Kong; Qiantong Xu; Nilesh Tripuraneni; Yiming Li; Xinxin Yu; Allen Porter; Paul Voigtlaender; Biao Zhang; Arpi Vezer; Sarah York; Qing Wei; Geoffrey Cideron; Mark Kurzeja; Seungyeon Kim; Benny Li; Angéline Pouget; Hyo Lee; Kaspar Daugaard; Yang Li; Dave Uthus; Aditya Siddhant; Paul Cavallaro; Sriram Ganapathy; Maulik Shah; Rolf Jagerman; Jeff Stanway; Piermaria Mendolicchio; Li Xiao; Kayi Lee; Tara Thompson; Shubham Milind Phal; Jason Chase; Sun Jae Lee; Adrian N Reyes; Disha Shrivastava; Zhen Qin; Roykrong Sukkerd; Seth Odoom; Lior Madmoni; John Aslanides; Jonathan Herzig; Elena Pochernina; Sheng Zhang; Parker Barnes; Daisuke Ikeda; Qiujia Li; Shuo-yiin Chang; Shakir Mohamed; Jim Sproch; Richard Powell; Bidisha Samanta; Domagoj Ćevid; Anton Kovsharov; Shrestha Basu Mallick; Srinivas Tadepalli; Anne Zheng; Kareem Ayoub; Andreas Noever; Christian Reisswig; Zhuo Xu; Junhyuk Oh; Martin Matysiak; Tim Blyth; Shereen Ashraf; Julien Amelot; Boone Severson; Michele Bevilacqua; Motoki Sano; Ethan Dyer; Ofir Roval; Anu Sinha; Yin Zhong; Sagi Perel; Tea Sabolić; Johannes Mauerer; Willi Gierke; Mauro Verzetti; Rodrigo Cabrera; Alvin Abdagic; Steven Hemingray; Austin Stone; Jong Lee; Farooq Ahmad; Karthik Raman; Lior Shani; Jonathan Lai; Orhan Firat; Nathan Waters; Eric Ge; Mo Shomrat; Himanshu Gupta; Rajeev Aggarwal; Tom Hudson; Bill Jia; Simon Baumgartner; Palak Jain; Joe Kovac; Junehyuk Jung; Ante Žužul; Will Truong; Morteza Zadimoghaddam; Songyou Peng; Marco Liang; Rachel Sterneck; Balaji Lakshminarayanan; Machel Reid; Oliver Woodman; Tong Zhou; Jianling Wang; Vincent Coriou; Arjun Narayanan; Jay Hoover; Yenai Ma; Apoorv Jindal; Clayton Sanford; Doug Reid; Swaroop Ramaswamy; Alex Kurakin; Roland Zimmermann; Yana Lunts; Dragos Dena; Zalán Borsos; Vered Cohen; Shujian Zhang; Will Grathwohl; Robert Dadashi; Morgan Redshaw; Joshua Kessinger; Julian Odell; Silvano Bonacina; Zihang Dai; Grace Chen; Ayush Dubey; Pablo Sprechmann; Mantas Pajarskas; Wenxuan Zhou; Niharika Ahuja; Tara Thomas; Martin Nikoltchev; Matija Kecman; Bharath Mankalale; Andrey Ryabtsev; Jennifer She; Christian Walder; Jiaming Shen; Lu Li; Carolina Parada; Sheena Panthaplackel; Okwan Kwon; Matt Lawlor; Utsav Prabhu; Yannick Schroecker; Marc'aurelio Ranzato; Pete Blois; Iurii Kemaev; Ting Yu; Dmitry Lepikhin; Hao Xiong; Sahand Sharifzadeh; Oleaser Johnson; Jeremiah Willcock; Rui Yao; Greg Farquhar; Sujoy Basu; Hidetoshi Shimokawa; Nina Anderson; Haiguang Li; Khiem Pham; Yizhong Liang; Sebastian Borgeaud; Alexandre Moufarek; Hideto Kazawa; Blair Kutzman; Marcin Sieniek; Sara Smoot; Ruth Wang; Natalie Axelsson; Nova Fallen; Prasha Sundaram; Yuexiang Zhai; Varun Godbole; Petros Maniatis; Alek Wang; Ilia Shumailov; Santhosh Thangaraj; Remi Crocker; Nikita Gupta; Gang Wu; Phil Chen; Gellért Weisz; Celine Smith; Mojtaba Seyedhosseini; Boya Fang; Xiyang Luo; Roey Yogev; Zeynep Cankara; Andrew Hard; Helen Ran; Rahul Sukthankar; George Necula; Gaël Liu; Honglong Cai; Praseem Banzal; Daniel Keysers; Sanjay Ghemawat; Connie Tao; Emma Dunleavy; Aditi Chaudhary; Wei Li; Maciej Mikuła; Chen-Yu Lee; Tiziana Refice; Krishna Somandepalli; Alexandre Fréchette; Dan Bahir; John Karro; Keith Rush; Sarah Perrin; Bill Rosgen; Xiaomeng Yang; Clara Huiyi Hu; Mahmoud Alnahlawi; Justin Mao-Jones; Roopal Garg; Hoang Nguyen; Bat-Orgil Batsaikhan; Iñaki Iturrate; Anselm Levskaya; Avi Singh; Ashyana Kachra; Tony Lu; Denis Petek; Zheng Xu; Mark Graham; Lukas Zilka; Yael Karov; Marija Kostelac; Fangyu Liu; Yaohui Guo; Weiyue Wang; Bernd Bohnet; Emily Pitler; Tony Bruguier; Keisuke Kinoshita; Chrysovalantis Anastasiou; Nilpa Jha; Ting Liu; Jerome Connor; Phil Wallis; Philip Pham; Eric Bailey; Shixin Li; Heng-Tze Cheng; Sally Ma; Haiqiong Li; Akanksha Maurya; Kate Olszewska; Manfred Warmuth; Christy Koh; Dominik Paulus; Siddhartha Reddy Jonnalagadda; Enrique Piqueras; Ali Elqursh; Geoff Brown; Hadar Shemtov; Loren Maggiore; Fei Xia; Ryan Foley; Beka Westberg; George van den Driessche; Livio Baldini Soares; Arjun Kar; Michael Quinn; Siqi Zuo; Jialin Wu; Kyle Kastner; Anna Bortsova; Aijun Bai; Ales Mikhalap; Luowei Zhou; Jennifer Brennan; Vinay Ramasesh; Honglei Zhuang; John Maggs; Johan Schalkwyk; Yuntao Xu; Hui Huang; Andrew Howard; Sasha Brown; Linting Xue; Gloria Shen; Brian Albert; Neha Jha; Daniel Zheng; Varvara Krayvanova; Spurthi Amba Hombaiah; Olivier Lacombe; Gautam Vasudevan; Dan Graur; Tian Xie; Meet Gandhi; Bangju Wang; Dustin Zelle; Harman Singh; Dahun Kim; Sébastien Cevey; Victor Ungureanu; Natasha Noy; Fei Liu; Annie Xie; Fangxiaoyu Feng; Katerina Tsihlas; Daniel Formoso; Neera Vats; Quentin Wellens; Yinan Wang; Niket Kumar Bhumihar; Samrat Ghosh; Matt Hoffman; Tom Lieber; Oran Lang; Kush Bhatia; Tom Paine; Aroonalok Pyne; Ronny Votel; Madeleine Clare Elish; Benoit Schillings; Alex Panagopoulos; Haichuan Yang; Adam Raveret; Zohar Yahav; Shuang Liu; Dalia El Badawy; Nishant Agrawal; Mohammed Badawi; Mahdi Mirzazadeh; Carla Bromberg; Fan Ye; Chang Liu; Tatiana Sholokhova; George-Cristian Muraru; Gargi Balasubramaniam; Jonathan Malmaud; Alen Carin; Danilo Martins; Irina Jurenka; Pankil Botadra; Dave Lacey; Richa Singh; Mariano Schain; Dan Zheng; Isabelle Guyon; Victor Lavrenko; Seungji Lee; Xiang Zhou; Demis Hassabis; Jeshwanth Challagundla; Derek Cheng; Nikhil Mehta; Matthew Mauger; Michela Paganini; Pushkar Mishra; Kate Lee; Zhang Li; Lexi Baugher; Ondrej Skopek; Max Chang; Amir Zait; Gaurav Menghani; Lizzetth Bellot; Guangxing Han; Jean-Michel Sarr; Sharat Chikkerur; Himanshu Sahni; Rohan Anil; Arun Narayanan; Chandu Thekkath; Daniele Pighin; Hana Strejček; Marko Velic; Fred Bertsch; Manuel Tragut; Keran Rong; Alicia Parrish; Kai Bailey; Jiho Park; Isabela Albuquerque; Abhishek Bapna; Rajesh Venkataraman; Alec Kosik; Johannes Griesser; Zhiwei Deng; Alek Andreev; Qingyun Dou; Kevin Hui; Fanny Wei; Xiaobin Yu; Lei Shu; Avia Aharon; David Barker; Badih Ghazi; Sebastian Flennerhag; Chris Breaux; Yuchuan Liu; Matthew Bilotti; Josh Woodward; Uri Alon; Stephanie Winkler; Tzu-Kuo Huang; Kostas Andriopoulos; João Gabriel Oliveira; Penporn Koanantakool; Berkin Akin; Michael Wunder; Cicero Nogueira dos Santos; Mohammad Hossein Bateni; Lin Yang; Dan Horgan; Beer Changpinyo; Keyvan Amiri; Min Ma; Dayeong Lee; Lihao Liang; Anirudh Baddepudi; Tejasi Latkar; Raia Hadsell; Jun Xu; Hairong Mu; Michael Han; Aedan Pope; Snchit Grover; Frank Kim; Ankit Bhagatwala; Guan Sun; Yamini Bansal; Amir Globerson; Alireza Nazari; Samira Daruki; Hagen Soltau; Jane Labanowski; Laurent El Shafey; Matt Harvey; Yanif Ahmad; Elan Rosenfeld; William Kong; Etienne Pot; Yi-Xuan Tan; Aurora Wei; Victoria Langston; Marcel Prasetya; Petar Veličković; Richard Killam; Robin Strudel; Darren Ni; Zhenhai Zhu; Aaron Archer; Kavya Kopparapu; Lynn Nguyen; Emilio Parisotto; Hussain Masoom; Sravanti Addepalli; Jordan Grimstad; Hexiang Hu; Joss Moore; Avinatan Hassidim; Le Hou; Mukund Raghavachari; Jared Lichtarge; Adam R. Brown; Hilal Dib; Natalia Ponomareva; Justin Fu; Yujing Zhang; Altaf Rahman; Joana Iljazi; Edouard Leurent; Gabriel Dulac-Arnold; Cosmo Du; Chulayuth Asawaroengchai; Larry Jin; Ela Gruzewska; Ziwei Ji; Benigno Uria; Freitas Daniel De; Paul Barham; Lauren Beltrone; Víctor Campos; Jun Yan; Neel Kovelamudi; Arthur Nguyen; Elinor Davies; Zhichun Wu; Zoltan Egyed; Kristina Toutanova; Nithya Attaluri; Hongliang Fei; Peter Stys; Siddhartha Brahma; Martin Izzard; Siva Velusamy; Scott Lundberg; Vincent Zhuang; Kevin Sequeira; Adam Santoro; Ehsan Amid; Ophir Aharoni; Shuai Ye; Mukund Sundararajan; Lijun Yu; Yu-Cheng Ling; Stephen Spencer; Hugo Song; Josip Djolonga; Christo Kirov; Sonal Gupta; Alessandro Bissacco; Clemens Meyer; Mukul Bhutani; Andrew Dai; Weiyi Wang; Siqi Liu; Ashwin Sreevatsa; Qijun Tan; Maria Wang; Lucy Kim; Yicheng Wang; Alex Irpan; Yang Xiao; Stanislav Fort; Yifan He; Alex Gurney; Bryan Gale; Yue Ma; Monica Roy; Viorica Patraucean; Taylan Bilal; Golnaz Ghiasi; Anahita Hosseini; Melvin Johnson; Zhuowan Li; Yi Tay; Benjamin Beyret; Katie Millican; Josef Broder; Mayank Lunayach; Danny Swisher; Eugen Vušak; David Parkinson; MH Tessler; Adi Mayrav Gilady; Richard Song; Allan Dafoe; Yves Raimond; Masa Yamaguchi; Itay Karo; Elizabeth Nielsen; Kevin Kilgour; Mike Dusenberry; Rajiv Mathews; Jiho Choi; Siyuan Qiao; Harsh Mehta; Sahitya Potluri; Chris Knutsen; Jialu Liu; Tat Tan; Kuntal Sengupta; Keerthana Gopalakrishnan; Abodunrinwa Toki; Mencher Chiang; Mike Burrows; Grace Vesom; Zafarali Ahmed; Ilia Labzovsky; Siddharth Vashishtha; Preeti Singh; Ankur Sharma; Ada Ma; Jinyu Xie; Pranav Talluri; Hannah Forbes-Pollard; Aarush Selvan; Joel Wee; Loic Matthey; Tom Funkhouser; Parthasarathy Gopavarapu; Lev Proleev; Cheng Li; Matt Thomas; Kashyap Kolipaka; Zhipeng Jia; Ashwin Kakarla; Srinivas Sunkara; Joan Puigcerver; Suraj Satishkumar Sheth; Emily Graves; Chen Wang; Sadh MNM Khan; Kai Kang; Shyamal Buch; Fred Zhang; Omkar Savant; David Soergel; Kevin Lee; Linda Friso; Xuanyi Dong; Rahul Arya; Shreyas Chandrakaladharan; Connor Schenck; Greg Billock; Tejas Iyer; Anton Bakalov; Leslie Baker; Alex Ruiz; Angad Chandorkar; Trieu Trinh; Matt Miecnikowski; Yanqi Zhou; Yangsibo Huang; Jiazhong Nie; Ali Shah; Ashish Thapliyal; Sam Haves; Lun Wang; Uri Shaham; Patrick Morris-Suzuki; Soroush Radpour; Leonard Berrada; Thomas Strohmann; Chaochao Yan; Jingwei Shen; Sonam Goenka; Tris Warkentin; Petar Dević; Dan Belov; Albert Webson; Madhavi Yenugula; Puranjay Datta; Jerry Chang; Nimesh Ghelani; Aviral Kumar; Vincent Perot; Jessica Lo; Yang Song; Herman Schmit; Jianmin Chen; Vasilisa Bashlovkina; Xiaoyue Pan; Diana Mincu; Paul Roit; Isabel Edkins; Andy Davis; Yujia Li; Ben Horn; Xinjian Li; Pradeep Kumar S; Eric Doi; Wanzheng Zhu; Sri Gayatri Sundara Padmanabhan; Siddharth Verma; Jasmine Liu; Heng Chen; Mihajlo Velimirović; Malcolm Reynolds; Priyanka Agrawal; Nick Sukhanov; Abhinit Modi; Siddharth Goyal; John Palowitch; Nima Khajehnouri; Wing Lowe; David Klinghoffer; Sharon Silver; Vinh Tran; Candice Schumann; Francesco Piccinno; Xi Liu; Mario Lučić; Xiaochen Yang; Sandeep Kumar; Ajay Kannan; Ragha Kotikalapudi; Mudit Bansal; Fabian Fuchs; Mohammad Javad Hosseini; Abdelrahman Abdelhamed; Dawn Bloxwich; Tianhe Yu; Ruoxin Sang; Gregory Thornton; Karan Gill; Yuchi Liu; Virat Shejwalkar; Jason Lin; Zhipeng Yan; Kehang Han; Thomas Buschmann; Michael Pliskin; Zhi Xing; Susheel Tatineni; Junlin Zhang; Sissie Hsiao; Gavin Buttimore; Marcus Wu; Zefei Li; Geza Kovacs; Legg Yeung; Tao Huang; Aaron Cohen; Bethanie Brownfield; Averi Nowak; Mikel Rodriguez; Tianze Shi; Hasselt Hado van; Kevin Cen; Deepanway Ghoshal; Kushal Majmundar; Weiren Yu; Warren Weilun Chen; Danila Sinopalnikov; Hao Zhang; Vlado Galić; Di Lu; Zeyu Zheng; Maggie Song; Gary Wang; Gui Citovsky; Swapnil Gawde; Isaac Galatzer-Levy; David Silver; Ivana Balazevic; Dipanjan Das; Kingshuk Majumder; Yale Cong; Praneet Dutta; Dustin Tran; Hui Wan; Junwei Yuan; Daniel Eppens; Alanna Walton; Been Kim; Harry Ragan; James Cobon-Kerr; Lu Liu; Weijun Wang; Bryce Petrini; Jack Rae; Rakesh Shivanna; Yan Xiong; Chace Lee; Pauline Coquinot; Yiming Gu; Lisa Patel; Blake Hechtman; Aviel Boag; Orion Jankowski; Alex Wertheim; Alex Lee; Paul Covington; Hila Noga; Sam Sobell; Shanthal Vasanth; William Bono; Chirag Nagpal; Wei Fan; Xavier Garcia; Kedar Soparkar; Aybuke Turker; Nathan Howard; Sachit Menon; Yuankai Chen; Vikas Verma; Vladimir Pchelin; Harish Rajamani; Valentin Dalibard; Ana Ramalho; Yang Guo; Kartikeya Badola; Seojin Bang; Nathalie Rauschmayr; Julia Proskurnia; Sudeep Dasari; Xinyun Chen; Mikhail Sushkov; Anja Hauth; Pauline Sho; Abhinav Singh; Bilva Chandra; Allie Culp; Max Dylla; Olivier Bachem; James Besley; Heri Zhao; Timothy Lillicrap; Wei Wei; Wael Al Jishi; Ning Niu; Alban Rrustemi; Raphaël Lopez Kaufman; Ryan Poplin; Jewel Zhao; Minh Truong; Shikhar Bharadwaj; Ester Hlavnova; Eli Stickgold; Cordelia Schmid; Georgi Stephanov; Zhaoqi Leng; Frederick Liu; Léonard Hussenot; Shenil Dodhia; Juliana Vicente Franco; Lesley Katzen; Abhanshu Sharma; Sarah Cogan; Zuguang Yang; Aniket Ray; Sergi Caelles; Shen Yan; Ravin Kumar; Daniel Gillick; Renee Wong; Joshua Ainslie; Jonathan Hoech; Séb Arnold; Dan Abolafia; Anca Dragan; Ben Hora; Grace Hu; Alexey Guseynov; Yang Lu; Chas Leichner; Jinmeng Rao; Abhimanyu Goyal; Nagabhushan Baddi; Daniel Hernandez Diaz; Tim McConnell; Max Bain; Jake Abernethy; Qiqi Yan; Rylan Schaeffer; Paul Vicol; Will Thompson; Montse Gonzalez Arenas; Mathias Bellaiche; Pablo Barrio; Stefan Zinke; Riccardo Patana; Pulkit Mehta; JK Kearns; Avraham Ruderman; Scott Pollom; David D'Ambrosio; Cath Hope; Yang Yu; Andrea Gesmundo; Kuang-Huei Lee; Aviv Rosenberg; Yiqian Zhou; Yaoyiran Li; Drew Garmon; Yonghui Wu; Safeen Huda; Gil Fidel; Martin Baeuml; Jian Li; Phoebe Kirk; Rhys May; Tao Tu; Sara Mc Carthy; Toshiyuki Fukuzawa; Miranda Aperghis; Chih-Kuan Yeh; Toshihiro Yoshino; Bo Li; Austin Myers; Kaisheng Yao; Ben Limonchik; Changwan Ryu; Rohun Saxena; Alex Goldin; Ruizhe Zhao; Rocky Rhodes; Tao Zhu; Divya Tyam; Heidi Howard; Nathan Byrd; Hongxu Ma; Yan Wu; Ryan Mullins; Qingze Wang; Aida Amini; Sebastien Baur; Yiran Mao; Subhashini Venugopalan; Will Song; Wen Ding; Paul Collins; Sashank Reddi; Megan Shum; Andrei Rusu; Luisa Zintgraf; Kelvin Chan; Sheela Goenka; Mathieu Blondel; Michael Collins; Renke Pan; Marissa Giustina; Nikolai Chinaev; Christian Schuler; Ce Zheng; Jonas Valfridsson; Alyssa Loo; Alex Yakubovich; Jamie Smith; Tao Jiang; Rich Munoz; Gabriel Barcik; Rishabh Bansal; Mingyao Yang; Yilun Du; Pablo Duque; Mary Phuong; Alexandra Belias; Kunal Lad; Zeyu Liu; Tal Schuster; Karthik Duddu; Jieru Hu; Paige Kunkle; Matthew Watson; Jackson Tolins; Josh Smith; Denis Teplyashin; Garrett Bingham; Marvin Ritter; Marco Andreetto; Divya Pitta; Mohak Patel; Shashank Viswanadha; Trevor Strohman; Catalin Ionescu; Jincheng Luo; Yogesh Kalley; Jeremy Wiesner; Dan Deutsch; Derek Lockhart; Peter Choy; Rumen Dangovski; Chawin Sitawarin; Cat Graves; Tanya Lando; Amersfoort Joost van; Ndidi Elue; Zhouyuan Huo; Pooya Moradi; Jean Tarbouriech; Henryk Michalewski; Wenting Ye; Eunyoung Kim; Alex Druinsky; Florent Altché; Xinyi Chen; Artur Dwornik; Da-Cheng Juan; Rivka Moroshko; Horia Toma; Jarrod Kahn; Hai Qian; Maximilian Sieb; Irene Cai; Roman Goldenberg; Praneeth Netrapalli; Sindhu Raghuram; Yuan Gong; Lijie Fan; Evan Palmer; Yossi Matias; Valentin Gabeur; Shreya Pathak; Tom Ouyang; Don Metzler; Geoff Bacon; Srinivasan Venkatachary; Sridhar Thiagarajan; Alex Cullum; Eran Ofek; Vytenis Sakenas; Mohamed Hammad; Cesar Magalhaes; Mayank Daswani; Oscar Chang; Ashok Popat; Ruichao Li; Komal Jalan; Yanhan Hou; Josh Lipschultz; Antoine He; Wenhao Jia; Pier Giuseppe Sessa; Prateek Kolhar; William Wong; Sumeet Singh; Lukas Haas; Jay Whang; Hanna Klimczak-Plucińska; Georges Rotival; Grace Chung; Yiqing Hua; Anfal Siddiqui; Nicolas Serrano; Dongkai Chen; Billy Porter; Libin Bai; Keshav Shivam; Sho Arora; Partha Talukdar; Tom Cobley; Sangnie Bhardwaj; Evgeny Gladchenko; Simon Green; Kelvin Guu; Felix Fischer; Xiao Wu; Eric Wang; Achintya Singhal; Tatiana Matejovicova; James Martens; Hongji Li; Roma Patel; Elizabeth Kemp; Jiaqi Pan; Lily Wang; Blake JianHang Chen; Jean-Baptiste Alayrac; Navneet Potti; Erika Gemzer; Eugene Ie; Kay McKinney; Takaaki Saeki; Edward Chou; Pascal Lamblin; SQ Mah; Zach Fisher; Martin Chadwick; Jon Stritar; Obaid Sarvana; Andrew Hogue; Artem Shtefan; Hadi Hashemi; Yang Xu; Jindong Gu; Sharad Vikram; Chung-Ching Chang; Sabela Ramos; Logan Kilpatrick; Weijuan Xi; Jenny Brennan; Yinghao Sun; Abhishek Jindal; Ionel Gog; Dawn Chen; Felix Wu; Jason Lee; Sudhindra Kopalle; Srinadh Bhojanapalli; Oriol Vinyals; Natan Potikha; Burcu Karagol Ayan; Yuan Yuan; Michael Riley; Piotr Stanczyk; Sergey Kishchenko; Bing Wang; Dan Garrette; Antoine Yang; Vlad Feinberg; CJ Carey; Javad Azizi; Viral Shah; Erica Moreira; Chongyang Shi; Josh Feldman; Elizabeth Salesky; Thomas Lampe; Aneesh Pappu; Duhyeon Kim; Jonas Adler; Avi Caciularu; Brian Walker; Yunhan Xu; Yochai Blau; Dylan Scandinaro; Terry Huang; Sam El-Husseini; Abhishek Sinha; Lijie Ren; Taylor Tobin; Patrik Sundberg; Tim Sohn; Vikas Yadav; Mimi Ly; Emily Xue; Jing Xiong; Afzal Shama Soudagar; Sneha Mondal; Nikhil Khadke; Qingchun Ren; Ben Vargas; Stan Bileschi; Sarah Chakera; Cindy Wang; Boyu Wang; Yoni Halpern; Joe Jiang; Vikas Sindhwani; Petre Petrov; Pranavaraj Ponnuramu; Sanket Vaibhav Mehta; Yu Watanabe; Betty Chan; Matheus Wisniewski; Trang Pham; Jingwei Zhang; Conglong Li; Cesare Dario de; Art Khurshudov; Alex Vasiloff; Melissa Tan; Zoe Ashwood; Bobak Shahriari; Maryam Majzoubi; Garrett Tanzer; Olga Kozlova; Robin Alazard; James Lee-Thorp; Nguyet Minh Phu; Isaac Tian; Junwhan Ahn; Andy Crawford; Lauren Lax; Yuan Shangguan; Iftekhar Naim; David Ross; Oleksandr Ferludin; Tongfei Guo; Andrea Banino; Hubert Soyer; Xiaoen Ju; Dominika Rogozińska; Ishaan Malhi; Marcella Valentine; Daniel Balle; Apoorv Kulshreshtha; Maciej Kula; Yiwen Song; Sophia Austin; John Schultz; Roy Hirsch; Arthur Douillard; Apoorv Reddy; Michael Fink; Summer Yue; Khyatti Gupta; Adam Zhang; Norman Rink; Daniel McDuff; Lei Meng; András György; Yasaman Razeghi; Ricky Liang; Kazuki Osawa; Aviel Atias; Matan Eyal; Tyrone Hill; Nikolai Grigorev; Zhengdong Wang; Nitish Kulkarni; Rachel Soh; Ivan Lobov; Zachary Charles; Sid Lall; Kazuma Hashimoto; Ido Kessler; Victor Gomes; Zelda Mariet; Danny Driess; Alessandro Agostini; Canfer Akbulut; Jingcao Hu; Marissa Ikonomidis; Emily Caveness; Kartik Audhkhasi; Saurabh Agrawal; Ioana Bica; Evan Senter; Jayaram Mudigonda; Kelly Chen; Jingchen Ye; Xuanhui Wang; James Svensson; Philipp Fränken; Josh Newlan; Li Lao; Eva Schnider; Sami Alabed; Joseph Kready; Jesse Emond; Afief Halumi; Tim Zaman; Chengxi Ye; Naina Raisinghani; Vilobh Meshram; Bo Chang; Ankit Singh Rawat; Axel Stjerngren; Sergey Levi; Rui Wang; Xiangzhu Long; Mitchelle Rasquinha; Steven Hand; Aditi Mavalankar; Lauren Agubuzu; Sudeshna Roy; Junquan Chen; Jarek Wilkiewicz; Hao Zhou; Michal Jastrzebski; Qiong Hu; Agustin Dal Lago; Ramya Sree Boppana; Wei-Jen Ko; Jennifer Prendki; Yao Su; Zhi Li; Eliza Rutherford; Girish Ramchandra Rao; Ramona Comanescu; Adrià Puigdomènech; Qihang Chen; Dessie Petrova; Christine Chan; Vedrana Milutinovic; Felipe Tiengo Ferreira; Chin-Yi Cheng; Ming Zhang; Tapomay Dey; Sherry Yang; Ramesh Sampath; Quoc Le; Howard Zhou; Chu-Cheng Lin; Hoi Lam; Christine Kaeser-Chen; Kai Hui; Dean Hirsch; Tom Eccles; Basil Mustafa; Shruti Rijhwani; Morgane Rivière; Yuanzhong Xu; Junjie Wang; Xinyang Geng; Xiance Si; Arjun Khare; Cheolmin Kim; Vahab Mirrokni; Kamyu Lee; Khuslen Baatarsukh; Nathaniel Braun; Lisa Wang; Pallavi LV; Richard Tanburn; Yonghao Zhu; Fangda Li; Setareh Ariafar; Dan Goldberg; Ken Burke; Daniil Mirylenka; Meiqi Guo; Olaf Ronneberger; Hadas Natalie Vogel; Liqun Cheng; Nishita Shetty; Johnson Jia; Thomas Jimma; Corey Fry; Ted Xiao; Martin Sundermeyer; Ryan Burnell; Yannis Assael; Mario Pinto; JD Chen; Rohit Sathyanarayana; Donghyun Cho; Jing Lu; Rishabh Agarwal; Sugato Basu; Lucas Gonzalez; Dhruv Shah; Meng Wei; Dre Mahaarachchi; Rohan Agrawal; Tero Rissa; Yani Donchev; Ramiro Leal-Cavazos; Adrian Hutter; Markus Mircea; Alon Jacovi; Faruk Ahmed; Jiageng Zhang; Shuguang Hu; Bo-Juen Chen; Jonni Kanerva; Guillaume Desjardins; Andrew Lee; Nikos Parotsidis; Asier Mujika; Tobias Weyand; Jasper Snoek; Jo Chick; Kai Chen; Paul Chang; Ethan Mahintorabi; Zi Wang; Tolly Powell; Orgad Keller; Abhirut Gupta; Claire Sha; Kanav Garg; Nicolas Heess; Ágoston Weisz; Cassidy Hardin; Bartek Wydrowski; Ben Coleman; Karina Zainullina; Pankaj Joshi; Alessandro Epasto; Terry Spitz; Binbin Xiong; Kai Zhao; Arseniy Klimovskiy; Ivy Zheng; Johan Ferret; Itay Yona; Waleed Khawaja; Jean-Baptiste Lespiau; Maxim Krikun; Siamak Shakeri; Timothee Cour; Bonnie Li; Igor Krivokon; Dan Suh; Alex Hofer; Jad Al Abdallah; Nikita Putikhin; Oscar Akerlund; Silvio Lattanzi; Anurag Kumar; Shane Settle; Himanshu Srivastava; Folawiyo Campbell-Ajala; Edouard Rosseel; Mihai Dorin Istin; Nishanth Dikkala; Anand Rao; Nick Young; Kate Lin; Dhruva Bhaswar; Yiming Wang; Jaume Sanchez Elias; Kritika Muralidharan; James Keeling; Dayou Du; Siddharth Gopal; Gregory Dibb; Charles Blundell; Manolis Delakis; Jacky Liang; Marco Tulio Ribeiro; Georgi Karadzhov; Guillermo Garrido; Ankur Bapna; Jiawei Cao; Adam Sadovsky; Pouya Tafti; Arthur Guez; Coline Devin; Yixian Di; Jinwei Xing; Chuqiao Joyce Xu; Hanzhao Lin; Chun-Te Chu; Sameera Ponda; Wesley Helmholz; Fan Yang; Yue Gao; Sara Javanmardi; Wael Farhan; Alex Ramirez; Ricardo Figueira; Khe Chai Sim; Yuval Bahat; Ashwin Vaswani; Liangzhe Yuan; Gufeng Zhang; Leland Rechis; Hanjun Dai; Tayo Oguntebi; Alexandra Cordell; Eugénie Rives; Kaan Tekelioglu; Naveen Kumar; Bing Zhang; Aurick Zhou; Nikolay Savinov; Andrew Leach; Alex Tudor; Sanjay Ganapathy; Yanyan Zheng; Mirko Rossini; Vera Axelrod; Arnaud Autef; Yukun Zhu; Zheng Zheng; Mingda Zhang; Baochen Sun; Jie Ren; Nenad Tomasev; Nithish Kannen; Amer Sinha; Charles Chen; Louis O'Bryan; Alex Pak; Aditya Kusupati; Weel Yang; Deepak Ramachandran; Patrick Griffin; Seokhwan Kim; Philipp Neubeck; Craig Schiff; Tammo Spalink; Mingyang Ling; Arun Nair; Ga-Young Joung; Linda Deng; Avishkar Bhoopchand; Lora Aroyo; Tom Duerig; Jordan Griffith; Gabe Barth-Maron; Jake Ades; Alex Haig; Ankur Taly; Yunting Song; Paul Michel; Dave Orr; Dean Weesner; Corentin Tallec; Carrie Grimes Bostock; Paul Niemczyk; Andy Twigg; Mudit Verma; Rohith Vallu; Henry Wang; Marco Gelmi; Kiranbir Sodhia; Aleksandr Chuklin; Omer Goldman; Jasmine George; Liang Bai; Kelvin Zhang; Petar Sirkovic; Efrat Nehoran; Golan Pundak; Jiaqi Mu; Alice Chen; Alex Greve; Paulo Zacchello; David Amos; Heming Ge; Eric Noland; Colton Bishop; Jeffrey Dudek; Youhei Namiki; Elena Buchatskaya; Jing Li; Dorsa Sadigh; Masha Samsikova; Dan Malkin; Damien Vincent; Robert David; Rob Willoughby; Phoenix Meadowlark; Shawn Gao; Yan Li; Raj Apte; Amit Jhindal; Stein Xudong Lin; Alex Polozov; Zhicheng Wang; Tomas Mery; Anirudh GP; Varun Yerram; Sage Stevens; Tianqi Liu; Noah Fiedel; Charles Sutton; Matthew Johnson; Xiaodan Song; Kate Baumli; Nir Shabat; Muqthar Mohammad; Hao Liu; Marco Selvi; Yichao Zhou; Mehdi Hafezi Manshadi; Chu-ling Ko; Anthony Chen; Michael Bendersky; Jorge Gonzalez Mendez; Nisarg Kothari; Amir Zandieh; Yiling Huang; Daniel Andor; Ellie Pavlick; Idan Brusilovsky; Jitendra Harlalka; Sally Goldman; Andrew Lampinen; Guowang Li; Asahi Ushio; Somit Gupta; Lei Zhang; Chuyuan Kelly Fu; Madhavi Sewak; Timo Denk; Jed Borovik; Brendan Jou; Avital Zipori; Prateek Jain; Junwen Bai; Thang Luong; Jonathan Tompson; Alice Li; Li Liu; George Powell; Jiajun Shen; Alex Feng; Grishma Chole; Da Yu; Yinlam Chow; Tongxin Yin; Eric Malmi; Kefan Xiao; Yash Pande; Shachi Paul; Niccolò Dal Santo; Adil Dostmohamed; Sergio Guadarrama; Aaron Phillips; Thanumalayan Sankaranarayana Pillai; Gal Yona; Amin Ghafouri; Preethi Lahoti; Benjamin Lee; Dhruv Madeka; Eren Sezener; Simon Tokumine; Adrian Collister; Cao Nicola De; Richard Shin; Uday Kalra; Parker Beak; Emily Nottage; Ryo Nakashima; Ivan Jurin; Vikash Sehwag; Meenu Gaba; Junhao Zeng; Kevin R. McKee; Fernando Pereira; Tamar Yakar; Amayika Panda; Arka Dhar; Peilin Zhong; Daniel Sohn; Mark Brand; Lars Lowe Sjoesund; Viral Carpenter; Sharon Lin; Shantanu Thakoor; Marcus Wainwright; Ashwin Chaugule; Pranesh Srinivasan; Muye Zhu; Bernett Orlando; Jack Weber; Ayzaan Wahid; Gilles Baechler; Apurv Suman; Jovana Mitrović; Gabe Taubman; Honglin Yu; Helen King; Josh Dillon; Cathy Yip; Dhriti Varma; Tomas Izo; Levent Bolelli; Borja De Balle Pigem; Trapani Julia Di; Fotis Iliopoulos; Adam Paszke; Nishant Ranka; Joe Zou; Francesco Pongetti; Jed McGiffin; Alex Siegman; Rich Galt; Ross Hemsley; Goran Žužić; Victor Carbune; Tao Li; Myle Ott; Félix de Chaumont Quitry; David Vilar Torres; Yuri Chervonyi; Tomy Tsai; Prem Eruvbetine; Samuel Yang; Matthew Denton; Jake Walker; Slavica Andačić; Idan Heimlich Shtacher; Vittal Premachandran; Harshal Tushar Lehri; Cip Baetu; Damion Yates; Lampros Lamprou; Mariko Iinuma; Ioana Mihailescu; Ben Albrecht; Shachi Dave; Susie Sargsyan; Bryan Perozzi; Lucas Manning; Chiyuan Zhang; Denis Vnukov; Igor Mordatch; Raia Hadsell Wolfgang Macherey; Ryan Kappedal; Jim Stephan; Aditya Tripathi; Klaus Macherey; Jun Qian; Abhishek Bhowmick; Shekoofeh Azizi; Rémi Leblond; Shiva Mohan Reddy Garlapati; Timothy Knight; Matthew Wiethoff; Wei-Chih Hung; Anelia Angelova; Georgios Evangelopoulos; Pawel Janus; Dimitris Paparas; Matthew Rahtz; Ken Caluwaerts; Vivek Sampathkumar; Daniel Jarrett; Shadi Noghabi; Antoine Miech; Chak Yeung; Geoff Clark; Henry Prior; Fei Zheng; Jean Pouget-Abadie; Indro Bhattacharya; Kalpesh Krishna; Will Bishop; Zhe Yuan; Yunxiao Deng; Ashutosh Sathe; Kacper Krasowiak; Ciprian Chelba; Cho-Jui Hsieh; Kiran Vodrahalli; Buhuang Liu; Thomas Köppe; Amr Khalifa; Lubo Litchev; Pichi Charoenpanit; Reed Roberts; Sachin Yadav; Yasumasa Onoe; Desi Ivanov; Megha Mohabey; Vighnesh Birodkar; Nemanja Rakićević; Pierre Sermanet; Vaibhav Mehta; Krishan Subudhi; Travis Choma; Will Ng; Luheng He; Kathie Wang; Tasos Kementsietsidis; Shane Gu; Mansi Gupta; Andrew Nystrom; Mehran Kazemi; Timothy Chung; Nacho Cano; Nikhil Dhawan; Yufei Wang; Jiawei Xia; Trevor Yacovone; Eric Jia; Mingqing Chen; Simeon Ivanov; Ashrith Sheshan; Sid Dalmia; Paweł Stradomski; Pengcheng Yin; Salem Haykal; Congchao Wang; Dennis Duan; Neslihan Bulut; Greg Kochanski; Liam MacDermed; Namrata Godbole; Shitao Weng; Jingjing Chen; Rachana Fellinger; Ramin Mehran; Daniel Suo; Hisham Husain; Tong He; Kaushal Patel; Joshua Howland; Randall Parker; Kelvin Nguyen; Sharath Maddineni; Chris Rawles; Mina Khan; Shlomi Cohen-Ganor; Amol Mandhane; Xinyi Wu; Chenkai Kuang; Iulia Comşa; Ramya Ganeshan; Hanie Sedghi; Adam Bloniarz; Nuo Wang Pierse; Anton Briukhov; Petr Mitrichev; Anita Gergely; Serena Zhan; Allan Zhou; Nikita Saxena; Eva Lu; Josef Dean; Ashish Gupta; Nicolas Perez-Nieves; Renjie Wu; Cory McLean; Wei Liang; Disha Jindal; Anton Tsitsulin; Wenhao Yu; Kaiz Alarakyia; Tom Schaul; Piyush Patil; Peter Sung; Elijah Peake; Hongkun Yu; Feryal Behbahani; JD Co-Reyes; Alan Ansell; Sean Sun; Clara Barbu; Jonathan Lee; Seb Noury; James Allingham; Bilal Piot; Mohit Sharma; Christopher Yew; Ivan Korotkov; Bibo Xu; Demetra Brady; Goran Petrovic; Shibl Mourad; Claire Cui; Aditya Gupta; Parker Schuh; Saarthak Khanna; Anna Goldie; Abhinav Arora; Vadim Zubov; Amy Stuart; Mark Epstein; Yun Zhu; Jianqiao Liu; Yury Stuken; Ziyue Wang; Karolis Misiunas; Dee Guo; Ashleah Gill; Ale Hartman; Zaid Nabulsi; Aurko Roy; Aleksandra Faust; Jason Riesa; Ben Withbroe; Mengchao Wang; Marco Tagliasacchi; Andreea Marzoca; James Noraky; Serge Toropov; Malika Mehrotra; Bahram Raad; Sanja Deur; Steve Xu; Marianne Monteiro; Zhongru Wu; Yi Luan; Sam Ritter; Nick Li; Håvard Garnes; Yanzhang He; Martin Zlocha; Jifan Zhu; Matteo Hessel; Will Wu; Spandana Raj Babbula; Chizu Kawamoto; Yuanzhen Li; Mehadi Hassen; Yan Wang; Brian Wieder; James Freedman; Yin Zhang; Xinyi Bai; Tianli Yu; David Reitter; XiangHai Sheng; Mateo Wirth; Aditya Kini; Dima Damen; Mingcen Gao; Rachel Hornung; Michael Voznesensky; Brian Roark; Adhi Kuncoro; Yuxiang Zhou; Rushin Shah; Anthony Brohan; Kuangyuan Chen; James Wendt; David Rim; Paul Kishan Rubenstein; Jonathan Halcrow; Michelle Liu; Ty Geri; Yunhsuan Sung; Jane Shapiro; Shaan Bijwadia; Chris Duvarney; Christina Sorokin; Paul Natsev; Reeve Ingle; Pramod Gupta; Young Maeng; Ndaba Ndebele; Kexin Zhu; Valentin Anklin; Katherine Lee; Yuan Liu; Yaroslav Akulov; Shaleen Gupta; Guolong Su; Flavien Prost; Tianlin Liu; Vitaly Kovalev; Pol Moreno; Martin Scholz; Sam Redmond; Zongwei Zhou; Alex Castro-Ros; André Susano Pinto; Dia Kharrat; Michal Yarom; Rachel Saputro; Jannis Bulian; Ben Caine; Ji Liu; Abbas Abdolmaleki; Shariq Iqbal; Tautvydas Misiunas; Mikhail Sirotenko; Shefali Garg; Guy Bensky; Huan Gui; Xuezhi Wang; Raphael Koster; Mike Bernico; Da Huang; Romal Thoppilan; Trevor Cohn; Ben Golan; Wenlei Zhou; Andrew Rosenberg; Markus Freitag; Tynan Gangwani; Vincent Tsang; Anand Shukla; Xiaoqi Ren; Minh Giang; Chi Zou; Andre Elisseeff; Charline Le Lan; Dheeru Dua; Shuba Lall; Pranav Shyam; Frankie Garcia; Sarah Nguyen; Michael Guzman; AJ Maschinot; Marcello Maggioni; Ming-Wei Chang; Karol Gregor; Lotte Weerts; Kumaran Venkatesan; Bogdan Damoc; Leon Liu; Jan Wassenberg; Lewis Ho; Becca Roelofs; Majid Hadian; François-Xavier Aubet; Yu Liang; Sami Lachgar; Danny Karmon; Yong Cheng; Amelio Vázquez-Reina; Angie Chen; Zhuyun Dai; Andy Brock; Shubham Agrawal; Chenxi Pang; Peter Garst; Mariella Sanchez-Vargas; Ivor Rendulic; Aditya Ayyar; Andrija Ražnatović; Olivia Ma; Roopali Vij; Neha Sharma; Ashwin Balakrishna; Bingyuan Liu; Ian Mackinnon; Sorin Baltateanu; Petra Poklukar; Gabriel Ibagon; Colin Ji; Hongyang Jiao; Isaac Noble; Wojciech Stokowiec; Zhihao Li; Jeff Dean; David Lindner; Mark Omernick; Kristen Chiafullo; Mason Dimarco; Vitor Rodrigues; Vittorio Selo; Garrett Honke; Xintian Cindy Wu; Wei He; Adam Hillier; Anhad Mohananey; Vihari Piratla; Chang Ye; Chase Malik; Sebastian Riedel; Samuel Albanie; Zi Yang; Kenny Vassigh; Maria Bauza; Sheng Li; Yiqing Tao; Nevan Wichers; Andrii Maksai; Abe Ittycheriah; Ross Mcilroy; Bryan Seybold; Noah Goodman; Romina Datta; Steven M. Hernandez; Tian Shi; Yony Kochinski; Anna Bulanova; Ken Franko; Mikita Sazanovich; Nicholas FitzGerald; Praneeth Kacham; Shubha Srinivas Raghvendra; Vincent Hellendoorn; Alexander Grushetsky; Julian Salazar; Angeliki Lazaridou; Jason Chang; Jan-Thorsten Peter; Sushant Kafle; Yann Dauphin; Abhishek Rao; Filippo Graziano; Izhak Shafran; Yuguo Liao; Tianli Ding; Geng Yan; Grace Chu; Zhao Fu; Vincent Roulet; Gabriel Rasskin; Duncan Williams; Shahar Drath; Alex Mossin; Raphael Hoffmann; Jordi Orbay; Francesco Bertolini; Hila Sheftel; Justin Chiu; Siyang Xue; Yuheng Kuang; Ferjad Naeem; Swaroop Nath; Nana Nti; Phil Culliton; Kashyap Krishnakumar; Michael Isard; Pei Sun; Ayan Chakrabarti; Nathan Clement; Regev Cohen; Arissa Wongpanich; GS Oh; Ashwin Murthy; Hao Zheng; Jessica Hamrick; Oskar Bunyan; Suhas Ganesh; Nitish Gupta; Roy Frostig; John Wieting; Yury Malkov; Pierre Marcenac; Zhixin Lucas Lai; Xiaodan Tang; Mohammad Saleh; Fedir Zubach; Chinmay Kulkarni; Huanjie Zhou; Vicky Zayats; Nan Ding; Anshuman Tripathi; Arijit Pramanik; Patrik Zochbauer; Harish Ganapathy; Vedant Misra; Zach Behrman; Hugo Vallet; Mingyang Zhang; Mukund Sridhar; Ye Jin; Mohammad Babaeizadeh; Siim Põder; Megha Goel; Divya Jain; Tajwar Nasir; Shubham Mittal; Tim Dozat; Diego Ardila; Aliaksei Severyn; Fabio Pardo; Sammy Jerome; Siyang Qin; Louis Rouillard; Amir Yazdanbakhsh; Zizhao Zhang; Shivani Agrawal; Kaushik Shivakumar; Caden Lu; Praveen Kallakuri; Rachita Chhaparia; Kanishka Rao; Charles Kwong; Asya Fadeeva; Shitij Nigam; Yan Virin; Yuan Zhang; Balaji Venkatraman; Beliz Gunel; Marc Wilson; Huiyu Wang; Abhinav Gupta; Xiaowei Xu; Adrien Ali Taïga; Kareem Mohamed; Doug Fritz; Daniel Rodriguez; Zoubin Ghahramani; Harry Askham; Lior Belenki; James Zhao; Rahul Gupta; Krzysztof Jastrzębski; Takahiro Kosakai; Kaan Katircioglu; Jon Schneider; Rina Panigrahy; Konstantinos Bousmalis; Peter Grabowski; Prajit Ramachandran; Chaitra Hegde; Mihaela Rosca; Angelo Scorza Scarpati; Kyriakos Axiotis; Ying Xu; Zach Gleicher; Assaf Hurwitz Michaely; Mandar Sharma; Sanil Jain; Christoph Hirnschall; Tal Marian; Xuhui Jia; Kevin Mather; Kilol Gupta; Linhai Qiu; Nigamaa Nayakanti; Lucian Ionita; Steven Zheng; Lucia Loher; Kurt Shuster; Igor Petrovski; Roshan Sharma; Rahma Chaabouni; Angel Yeh; James An; Arushi Gupta; Steven Schwarcz; Seher Ellis; Sam Conway-Rahman; Javier Snaider; Alex Zhai; James Atwood; Daniel Golovin; Liqian Peng; Te I; Vivian Xia; Salvatore Scellato; Mahan Malihi; Arthur Bražinskas; Vlad-Doru Ion; Younghoon Jun; James Swirhun; Soroosh Mariooryad; Jiao Sun; Steve Chien; Rey Coaguila; Ariel Brand; Yi Gao; Tom Kwiatkowski; Roee Aharoni; Cheng-Chun Lee; Mislav Žanić; Yichi Zhang; Dan Ethier; Vitaly Nikolaev; Pranav Nair; Yoav Ben Shalom; Hen Fitoussi; Jai Gupta; Hongbin Liu; Dee Cattle; Tolga Bolukbasi; Ben Murdoch; Fantine Huot; Yin Li; Chris Hahn
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

http://arxiv.org/abs/2507.08982
VIP: Visual Information Protection through Adversarial Attacks on Vision-Language Models. (93%)
Hanene F. Z. Brachemi Meftah; Wassim Hamidouche; Sid Ahmed Fezza; Olivier Déforges
Recent years have witnessed remarkable progress in developing Vision-Language Models (VLMs) capable of processing both textual and visual inputs. These models have demonstrated impressive performance, leading to their widespread adoption in various applications. However, this widespread raises serious concerns regarding user privacy, particularly when models inadvertently process or expose private visual information. In this work, we frame the preservation of privacy in VLMs as an adversarial attack problem. We propose a novel attack strategy that selectively conceals information within designated Region Of Interests (ROIs) in an image, effectively preventing VLMs from accessing sensitive content while preserving the semantic integrity of the remaining image. Unlike conventional adversarial attacks that often disrupt the entire image, our method maintains high coherence in unmasked areas. Experimental results across three state-of-the-art VLMs namely LLaVA, Instruct-BLIP, and BLIP2-T5 demonstrate up to 98% reduction in detecting targeted ROIs, while maintaining global image semantics intact, as confirmed by high similarity scores between clean and adversarial outputs. We believe that this work contributes to a more privacy conscious use of multimodal models and offers a practical tool for further research, with the source code publicly available at: https://github.com/hbrachemi/Vlm_defense-attack.

http://arxiv.org/abs/2507.08303
Learning Robust Motion Skills via Critical Adversarial Attacks for Humanoid Robots. (87%)
Yang Zhang; Zhanxiang Cao; Buqing Nie; Haoyang Li; Yue Gao
Humanoid robots show significant potential in daily tasks. However, reinforcement learning-based motion policies often suffer from robustness degradation due to the sim-to-real dynamics gap, thereby affecting the agility of real robots. In this work, we propose a novel robust adversarial training paradigm designed to enhance the robustness of humanoid motion policies in real worlds. The paradigm introduces a learnable adversarial attack network that precisely identifies vulnerabilities in motion policies and applies targeted perturbations, forcing the motion policy to enhance its robustness against perturbations through dynamic adversarial training. We conduct experiments on the Unitree G1 humanoid robot for both perceptive locomotion and whole-body control tasks. The results demonstrate that our proposed method significantly enhances the robot's motion robustness in real world environments, enabling successful traversal of challenging terrains and highly agile whole-body trajectory tracking.

http://arxiv.org/abs/2507.08284
Lightweight Safety Guardrails via Synthetic Data and RL-guided Adversarial Training. (13%)
Aleksei Ilin; Gor Matevosyan; Xueying Ma; Vladimir Eremin; Suhaa Dada; Muqun Li; Riyaaz Shaik; Haluk Noyan Tokgozoglu
We introduce a lightweight yet highly effective safety guardrail framework for language models, demonstrating that small-scale language models can achieve, and even surpass, the performance of larger counterparts in content moderation tasks. This is accomplished through high-fidelity synthetic data generation and adversarial training. The synthetic data generation process begins with human-curated seed data, which undergoes query augmentation and paraphrasing to create diverse and contextually rich examples. This augmented data is then subjected to multiple rounds of curation, ensuring high fidelity and relevance. Inspired by recent advances in the Generative Adversarial Network (GAN) architecture, our adversarial training employs reinforcement learning to guide a generator that produces challenging synthetic examples. These examples are used to fine-tune the safety classifier, enhancing its ability to detect and mitigate harmful content. Additionally, we incorporate strategies from recent research on efficient LLM training, leveraging the capabilities of smaller models to improve the performance of larger generative models. With iterative adversarial training and the generation of diverse, high-quality synthetic data, our framework enables small language models (SLMs) to serve as robust safety guardrails. This approach not only reduces computational overhead but also enhances resilience against adversarial attacks, offering a scalable and efficient solution for content moderation in AI systems.

http://arxiv.org/abs/2507.10578
When and Where do Data Poisons Attack Textual Inversion? (13%)
Jeremy Styborski; Mingzhi Lyu; Jiayou Lu; Nupur Kapur; Adams Kong
Poisoning attacks pose significant challenges to the robustness of diffusion models (DMs). In this paper, we systematically analyze when and where poisoning attacks textual inversion (TI), a widely used personalization technique for DMs. We first introduce Semantic Sensitivity Maps, a novel method for visualizing the influence of poisoning on text embeddings. Second, we identify and experimentally verify that DMs exhibit non-uniform learning behavior across timesteps, focusing on lower-noise samples. Poisoning attacks inherit this bias and inject adversarial signals predominantly at lower timesteps. Lastly, we observe that adversarial signals distract learning away from relevant concept regions within training data, corrupting the TI process. Based on these insights, we propose Safe-Zone Training (SZT), a novel defense mechanism comprised of 3 key components: (1) JPEG compression to weaken high-frequency poison signals, (2) restriction to high timesteps during TI training to avoid adversarial signals at lower timesteps, and (3) loss masking to constrain learning to relevant regions. Extensive experiments across multiple poisoning methods demonstrate that SZT greatly enhances the robustness of TI against all poisoning attacks, improving generative quality beyond prior published defenses. Code: www.github.com/JStyborski/Diff_Lab Data: www.github.com/JStyborski/NC10

http://arxiv.org/abs/2507.08623
Entangled Threats: A Unified Kill Chain Model for Quantum Machine Learning Security. (10%)
Pascal Debus; Maximilian Wendlinger; Kilian Tscharke; Daniel Herr; Cedric Brügmann; Mello Daniel Ohl de; Juris Ulmanis; Alexander Erhard; Arthur Schmidt; Fabian Petsch
Quantum Machine Learning (QML) systems inherit vulnerabilities from classical machine learning while introducing new attack surfaces rooted in the physical and algorithmic layers of quantum computing. Despite a growing body of research on individual attack vectors - ranging from adversarial poisoning and evasion to circuit-level backdoors, side-channel leakage, and model extraction - these threats are often analyzed in isolation, with unrealistic assumptions about attacker capabilities and system environments. This fragmentation hampers the development of effective, holistic defense strategies. In this work, we argue that QML security requires more structured modeling of the attack surface, capturing not only individual techniques but also their relationships, prerequisites, and potential impact across the QML pipeline. We propose adapting kill chain models, widely used in classical IT and cybersecurity, to the quantum machine learning context. Such models allow for structured reasoning about attacker objectives, capabilities, and possible multi-stage attack paths - spanning reconnaissance, initial access, manipulation, persistence, and exfiltration. Based on extensive literature analysis, we present a detailed taxonomy of QML attack vectors mapped to corresponding stages in a quantum-aware kill chain framework that is inspired by the MITRE ATLAS for classical machine learning. We highlight interdependencies between physical-level threats (like side-channel leakage and crosstalk faults), data and algorithm manipulation (such as poisoning or circuit backdoors), and privacy attacks (including model extraction and training data inference). This work provides a foundation for more realistic threat modeling and proactive security-in-depth design in the emerging field of quantum machine learning.

http://arxiv.org/abs/2507.10583
$\texttt{Droid}$: A Resource Suite for AI-Generated Code Detection. (9%)
Daniil Orel; Indraneil Paul; Iryna Gurevych; Preslav Nakov
In this work, we compile $\textbf{$\texttt{DroidCollection}$}$, the most extensive open data suite for training and evaluating machine-generated code detectors, comprising over a million code samples, seven programming languages, outputs from 43 coding models, and over three real-world coding domains. Alongside fully AI-generated samples, our collection includes human-AI co-authored code, as well as adversarial samples explicitly crafted to evade detection. Subsequently, we develop $\textbf{$\texttt{DroidDetect}$}$, a suite of encoder-only detectors trained using a multi-task objective over $\texttt{DroidCollection}$. Our experiments show that existing detectors' performance fails to generalise to diverse coding domains and programming languages outside of their narrow training data. Additionally, we demonstrate that while most detectors are easily compromised by humanising the output distributions using superficial prompting and alignment approaches, this problem can be easily amended by training on a small amount of adversarial data. Finally, we demonstrate the effectiveness of metric learning and uncertainty-based resampling as means to enhance detector training on possibly noisy distributions.

http://arxiv.org/abs/2507.08261
Admissibility of Stein Shrinkage for Batch Normalization in the Presence of Adversarial Attacks. (2%)
Sofia Ivolgina; P. Thomas Fletcher; Baba C. Vemuri
Batch normalization (BN) is a ubiquitous operation in deep neural networks used primarily to achieve stability and regularization during network training. BN involves feature map centering and scaling using sample means and variances, respectively. Since these statistics are being estimated across the feature maps within a batch, this problem is ideally suited for the application of Stein's shrinkage estimation, which leads to a better, in the mean-squared-error sense, estimate of the mean and variance of the batch. In this paper, we prove that the Stein shrinkage estimator for the mean and variance dominates over the sample mean and variance estimators in the presence of adversarial attacks when modeling these attacks using sub-Gaussian distributions. This facilitates and justifies the application of Stein shrinkage to estimate the mean and variance parameters in BN and use it in image classification (segmentation) tasks with and without adversarial attacks. We present SOTA performance results using this Stein corrected batch norm in a standard ResNet architecture applied to the task of image classification using CIFAR-10 data, 3D CNN on PPMI (neuroimaging) data and image segmentation using HRNet on Cityscape data with and without adversarial attacks.

http://arxiv.org/abs/2507.08280
MIRRAMS: Towards Training Models Robust to Missingness Distribution Shifts. (1%)
Jihye Lee; Minseo Kang; Dongha Kim
In real-world data analysis, missingness distributional shifts between training and test input datasets frequently occur, posing a significant challenge to achieving robust prediction performance. In this study, we propose a novel deep learning framework designed to address such shifts in missingness distributions. We begin by introducing a set of mutual information-based conditions, called MI robustness conditions, which guide a prediction model to extract label-relevant information while remaining invariant to diverse missingness patterns, thereby enhancing robustness to unseen missingness scenarios at test-time. To make these conditions practical, we propose simple yet effective techniques to derive loss terms corresponding to each and formulate a final objective function, termed MIRRAMS(Mutual Information Regularization for Robustness Against Missingness Shifts). As a by-product, our analysis provides a theoretical interpretation of the principles underlying consistency regularization-based semi-supervised learning methods, such as FixMatch. Extensive experiments across various benchmark datasets show that MIRRAMS consistently outperforms existing baselines and maintains stable performance across diverse missingness scenarios. Moreover, our approach achieves state-of-the-art performance even without missing data and can be naturally extended to address semi-supervised learning tasks, highlighting MIRRAMS as a powerful, off-the-shelf framework for general-purpose learning.

http://arxiv.org/abs/2502.00718
"I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models. (99%)
Isha Gupta; David Khachaturov; Robert Mullins
The rise of multimodal large language models has introduced innovative human-machine interaction paradigms but also significant challenges in machine learning safety. Audio-Language Models (ALMs) are especially relevant due to the intuitive nature of spoken communication, yet little is known about their failure modes. This paper explores audio jailbreaks targeting ALMs, focusing on their ability to bypass alignment mechanisms. We construct adversarial perturbations that generalize across prompts, tasks, and even base audio samples, demonstrating the first universal jailbreaks in the audio modality, and show that these remain effective in simulated real-world conditions. Beyond demonstrating attack feasibility, we analyze how ALMs interpret these audio adversarial examples and reveal them to encode imperceptible first-person toxic speech - suggesting that the most effective perturbations for eliciting toxic outputs specifically embed linguistic features within the audio signal. These results have important implications for understanding the interactions between different modalities in multimodal models, and offer actionable insights for enhancing defenses against adversarial audio attacks.

http://arxiv.org/abs/2507.07776
SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples. (99%)
Dren Fazlija; Monty-Maximilian Zühlke; Johanna Schrader; Arkadij Orlov; Clara Stein; Iyiola E. Olatunji; Daniel Kudenko
Unrestricted adversarial attacks aim to fool computer vision models without being constrained by $\ell_p$-norm bounds to remain imperceptible to humans, for example, by changing an object's color. This allows attackers to circumvent traditional, norm-bounded defense strategies such as adversarial training or certified defense strategies. However, due to their unrestricted nature, there are also no guarantees of norm-based imperceptibility, necessitating human evaluations to verify just how authentic these adversarial examples look. While some related work assesses this vital quality of adversarial attacks, none provide statistically significant insights. This issue necessitates a unified framework that supports and streamlines such an assessment for evaluating and comparing unrestricted attacks. To close this gap, we introduce SCOOTER - an open-source, statistically powered framework for evaluating unrestricted adversarial examples. Our contributions are: $(i)$ best-practice guidelines for crowd-study power, compensation, and Likert equivalence bounds to measure imperceptibility; $(ii)$ the first large-scale human vs. model comparison across 346 human participants showing that three color-space attacks and three diffusion-based attacks fail to produce imperceptible images. Furthermore, we found that GPT-4o can serve as a preliminary test for imperceptibility, but it only consistently detects adversarial examples for four out of six tested attacks; $(iii)$ open-source software tools, including a browser-based task template to collect annotations and analysis scripts in Python and R; $(iv)$ an ImageNet-derived benchmark dataset containing 3K real images, 7K adversarial examples, and over 34K human ratings. Our findings demonstrate that automated vision systems do not align with human perception, reinforcing the need for a ground-truth SCOOTER benchmark.

http://arxiv.org/abs/2507.07768
TRIX- Trading Adversarial Fairness via Mixed Adversarial Training. (99%)
Tejaswini Medi; Steffen Jung; Margret Keuper
Adversarial Training (AT) is a widely adopted defense against adversarial examples. However, existing approaches typically apply a uniform training objective across all classes, overlooking disparities in class-wise vulnerability. This results in adversarial unfairness: classes with well distinguishable features (strong classes) tend to become more robust, while classes with overlapping or shared features(weak classes) remain disproportionately susceptible to adversarial attacks. We observe that strong classes do not require strong adversaries during training, as their non-robust features are quickly suppressed. In contrast, weak classes benefit from stronger adversaries to effectively reduce their vulnerabilities. Motivated by this, we introduce TRIX, a feature-aware adversarial training framework that adaptively assigns weaker targeted adversaries to strong classes, promoting feature diversity via uniformly sampled targets, and stronger untargeted adversaries to weak classes, enhancing their focused robustness. TRIX further incorporates per-class loss weighting and perturbation strength adjustments, building on prior work, to emphasize weak classes during the optimization. Comprehensive experiments on standard image classification benchmarks, including evaluations under strong attacks such as PGD and AutoAttack, demonstrate that TRIX significantly improves worst-case class accuracy on both clean and adversarial data, reducing inter-class robustness disparities, and preserves overall accuracy. Our results highlight TRIX as a practical step toward fair and effective adversarial defense.

http://arxiv.org/abs/2507.07709
One Object, Multiple Lies: A Benchmark for Cross-task Adversarial Attack on Unified Vision-Language Models. (98%)
Jiale Zhao; Xinyang Jiang; Junyao Gao; Yuhao Xue; Cairong Zhao
Unified vision-language models(VLMs) have recently shown remarkable progress, enabling a single model to flexibly address diverse tasks through different instructions within a shared computational architecture. This instruction-based control mechanism creates unique security challenges, as adversarial inputs must remain effective across multiple task instructions that may be unpredictably applied to process the same malicious content. In this paper, we introduce CrossVLAD, a new benchmark dataset carefully curated from MSCOCO with GPT-4-assisted annotations for systematically evaluating cross-task adversarial attacks on unified VLMs. CrossVLAD centers on the object-change objective-consistently manipulating a target object's classification across four downstream tasks-and proposes a novel success rate metric that measures simultaneous misclassification across all tasks, providing a rigorous evaluation of adversarial transferability. To tackle this challenge, we present CRAFT (Cross-task Region-based Attack Framework with Token-alignment), an efficient region-centric attack method. Extensive experiments on Florence-2 and other popular unified VLMs demonstrate that our method outperforms existing approaches in both overall cross-task attack performance and targeted object-change success rates, highlighting its effectiveness in adversarially influencing unified VLMs across diverse tasks.

http://arxiv.org/abs/2507.08163
Adaptive Diffusion Denoised Smoothing : Certified Robustness via Randomized Smoothing with Differentially Private Guided Denoising Diffusion. (92%)
Frederick Shpilevskiy; Saiyue Lyu; Krishnamurthy Dj Dvijotham; Mathias Lécuyer; Pierre-André Noël
We propose Adaptive Diffusion Denoised Smoothing, a method for certifying the predictions of a vision model against adversarial examples, while adapting to the input. Our key insight is to reinterpret a guided denoising diffusion model as a long sequence of adaptive Gaussian Differentially Private (GDP) mechanisms refining a pure noise sample into an image. We show that these adaptive mechanisms can be composed through a GDP privacy filter to analyze the end-to-end robustness of the guided denoising process, yielding a provable certification that extends the adaptive randomized smoothing analysis. We demonstrate that our design, under a specific guiding strategy, can improve both certified accuracy and standard accuracy on ImageNet for an $\ell_2$ threat model.

http://arxiv.org/abs/2507.07974
Defending Against Prompt Injection With a Few DefensiveTokens. (81%)
Sizhe Chen; Yizhu Wang; Nicholas Carlini; Chawin Sitawarin; David Wagner
When large language model (LLM) systems interact with external data to perform complex tasks, a new attack, namely prompt injection, becomes a significant threat. By injecting instructions into the data accessed by the system, the attacker is able to override the initial user task with an arbitrary task directed by the attacker. To secure the system, test-time defenses, e.g., defensive prompting, have been proposed for system developers to attain security only when needed in a flexible manner. However, they are much less effective than training-time defenses that change the model parameters. Motivated by this, we propose DefensiveToken, a test-time defense with prompt injection robustness comparable to training-time alternatives. DefensiveTokens are newly inserted as special tokens, whose embeddings are optimized for security. In security-sensitive cases, system developers can append a few DefensiveTokens before the LLM input to achieve security with a minimal utility drop. In scenarios where security is less of a concern, developers can simply skip DefensiveTokens; the LLM system remains the same as there is no defense, generating high-quality responses. Thus, DefensiveTokens, if released alongside the model, allow a flexible switch between the state-of-the-art (SOTA) utility and almost-SOTA security at test time. The code is available at https://github.com/Sizhe-Chen/DefensiveToken.

http://arxiv.org/abs/2507.07850
Identifying the Smallest Adversarial Load Perturbations that Render DC-OPF Infeasible. (81%)
Samuel Chevalier; William A. Wheeler
What is the globally smallest load perturbation that renders DC-OPF infeasible? Reliably identifying such "adversarial attack" perturbations has useful applications in a variety of emerging grid-related contexts, including machine learning performance verification, cybersecurity, and operational robustness of power systems dominated by stochastic renewable energy resources. In this paper, we formulate the inherently nonconvex adversarial attack problem by applying a parameterized version of Farkas' lemma to a perturbed set of DC-OPF equations. Since the resulting formulation is very hard to globally optimize, we also propose a parameterized generation control policy which, when applied to the primal DC-OPF problem, provides solvability guarantees. Together, these nonconvex problems provide guaranteed upper and lower bounds on adversarial attack size; by combining them into a single optimization problem, we can efficiently "squeeze" these bounds towards a common global solution. We apply these methods on a range of small- to medium-sized test cases from PGLib, benchmarking our results against the best adversarial attack lower bounds provided by Gurobi 12.0's spatial Branch and Bound solver.

http://arxiv.org/abs/2507.08158
Beyond the Worst Case: Extending Differential Privacy Guarantees to Realistic Adversaries. (68%)
Marika Swanberg; Meenatchi Sundaram Muthu Selva Annamalai; Jamie Hayes; Borja Balle; Adam Smith
Differential Privacy (DP) is a family of definitions that bound the worst-case privacy leakage of a mechanism. One important feature of the worst-case DP guarantee is it naturally implies protections against adversaries with less prior information, more sophisticated attack goals, and complex measures of a successful attack. However, the analytical tradeoffs between the adversarial model and the privacy protections conferred by DP are not well understood thus far. To that end, this work sheds light on what the worst-case guarantee of DP implies about the success of attackers that are more representative of real-world privacy risks.
  In this paper, we present a single flexible framework that generalizes and extends the patchwork of bounds on DP mechanisms found in prior work. Our framework allows us to compute high-probability guarantees for DP mechanisms on a large family of natural attack settings that previous bounds do not capture. One class of such settings is the approximate reconstruction of multiple individuals' data, such as inferring nearly entire columns of a tabular data set from noisy marginals and extracting sensitive information from DP-trained language models.
  We conduct two empirical case studies to illustrate the versatility of our bounds and compare them to the success of state-of-the-art attacks. Specifically, we study attacks that extract non-uniform PII from a DP-trained language model, as well as multi-column reconstruction attacks where the adversary has access to some columns in the clear and attempts to reconstruct the remaining columns for each person's record. We find that the absolute privacy risk of attacking non-uniform data is highly dependent on the adversary's prior probability of success. Our high probability bounds give us a nuanced understanding of the privacy leakage of DP mechanisms in a variety of previously understudied attack settings.

http://arxiv.org/abs/2507.07417
May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks. (68%)
Nishit V. Pandya; Andrey Labunets; Sicun Gao; Earlence Fernandes
A popular class of defenses against prompt injection attacks on large language models (LLMs) relies on fine-tuning the model to separate instructions and data, so that the LLM does not follow instructions that might be present with data. There are several academic systems and production-level implementations of this idea. We evaluate the robustness of this class of prompt injection defenses in the whitebox setting by constructing strong optimization-based attacks and showing that the defenses do not provide the claimed security properties. Specifically, we construct a novel attention-based attack algorithm for text-based LLMs and apply it to two recent whitebox defenses SecAlign (CCS 2025) and StruQ (USENIX Security 2025), showing attacks with success rates of up to 70% with modest increase in attacker budget in terms of tokens. Our findings make fundamental progress towards understanding the robustness of prompt injection defenses in the whitebox setting. We release our code and attacks at https://github.com/nishitvp/better_opts_attacks

http://arxiv.org/abs/2507.08212
EvA: Evolutionary Attacks on Graphs. (26%)
Mohammad Sadegh Akhondzadeh; Soroush H. Zargarbashi; Jimin Cao; Aleksandar Bojchevski
Even a slight perturbation in the graph structure can cause a significant drop in the accuracy of graph neural networks (GNNs). Most existing attacks leverage gradient information to perturb edges. This relaxes the attack's optimization problem from a discrete to a continuous space, resulting in solutions far from optimal. It also restricts the adaptability of the attack to non-differentiable objectives. Instead, we introduce a few simple yet effective enhancements of an evolutionary-based algorithm to solve the discrete optimization problem directly. Our Evolutionary Attack (EvA) works with any black-box model and objective, eliminating the need for a differentiable proxy loss. This allows us to design two novel attacks that reduce the effectiveness of robustness certificates and break conformal sets. The memory complexity of our attack is linear in the attack budget. Among our experiments, EvA shows $\sim$11\% additional drop in accuracy on average compared to the best previous attack, revealing significant untapped potential in designing attacks.

http://arxiv.org/abs/2507.06850
The Dark Side of LLMs: Agent-based Attacks for Complete Computer Takeover. (10%)
Matteo Lupinacci; Francesco Aurelio Pironti; Francesco Blefari; Francesco Romeo; Luigi Arena; Angelo Furfaro
The rapid adoption of Large Language Model (LLM) agents and multi-agent systems enables unprecedented capabilities in natural language processing and generation. However, these systems have introduced unprecedented security vulnerabilities that extend beyond traditional prompt injection attacks. This paper presents the first comprehensive evaluation of LLM agents as attack vectors capable of achieving complete computer takeover through the exploitation of trust boundaries within agentic AI systems where autonomous entities interact and influence each other. We demonstrate that adversaries can leverage three distinct attack surfaces - direct prompt injection, RAG backdoor attacks, and inter-agent trust exploitation - to coerce popular LLMs (including GPT-4o, Claude-4 and Gemini-2.5) into autonomously installing and executing malware on victim machines. Our evaluation of 17 state-of-the-art LLMs reveals an alarming vulnerability hierarchy: while 41.2% of models succumb to direct prompt injection, 52.9% are vulnerable to RAG backdoor attacks, and a critical 82.4% can be compromised through inter-agent trust exploitation. Notably, we discovered that LLMs which successfully resist direct malicious commands will execute identical payloads when requested by peer agents, revealing a fundamental flaw in current multi-agent security models. Our findings demonstrate that only 5.9% of tested models (1/17) proved resistant to all attack vectors, with the majority exhibiting context-dependent security behaviors that create exploitable blind spots. Our findings also highlight the need to increase awareness and research on the security risks of LLMs, showing a paradigm shift in cybersecurity threats, where AI tools themselves become sophisticated attack vectors.

http://arxiv.org/abs/2507.07436
When Graph Contrastive Learning Backfires: Spectral Vulnerability and Defense in Recommendation. (3%)
Zongwei Wang; Min Gao; Junliang Yu; Shazia Sadiq; Hongzhi Yin; Ling Liu
Graph Contrastive Learning (GCL) has demonstrated substantial promise in enhancing the robustness and generalization of recommender systems, particularly by enabling models to leverage large-scale unlabeled data for improved representation learning. However, in this paper, we reveal an unexpected vulnerability: the integration of GCL inadvertently increases the susceptibility of a recommender to targeted promotion attacks. Through both theoretical investigation and empirical validation, we identify the root cause as the spectral smoothing effect induced by contrastive optimization, which disperses item embeddings across the representation space and unintentionally enhances the exposure of target items. Building on this insight, we introduce CLeaR, a bi-level optimization attack method that deliberately amplifies spectral smoothness, enabling a systematic investigation of the susceptibility of GCL-based recommendation models to targeted promotion attacks. Our findings highlight the urgent need for robust countermeasures; in response, we further propose SIM, a spectral irregularity mitigation framework designed to accurately detect and suppress targeted items without compromising model performance. Extensive experiments on multiple benchmark datasets demonstrate that, compared to existing targeted promotion attacks, GCL-based recommendation models exhibit greater susceptibility when evaluated with CLeaR, while SIM effectively mitigates these vulnerabilities.

http://arxiv.org/abs/2507.07438
Algorithmic Complexity Attacks on All Learned Cardinality Estimators: A Data-centric Approach. (3%)
Yingze Li; Xianglong Liu; Dong Wang; Zixuan Wang; Hongzhi Wang; Kaixing Zhang; Yiming Guan
Learned cardinality estimators show promise in query cardinality prediction, yet they universally exhibit fragility to training data drifts, posing risks for real-world deployment. This work is the first to theoretical investigate how minimal data-level drifts can maximally degrade the accuracy of learned estimators. We propose data-centric algorithmic complexity attacks against learned estimators in a black-box setting, proving that finding the optimal attack strategy is NP-Hard. To address this, we design a polynomial-time approximation algorithm with a $(1-κ)$ approximation ratio. Extensive experiments demonstrate our attack's effectiveness: on STATS-CEB and IMDB-JOB benchmarks, modifying just 0.8\% of training tuples increases the 90th percentile Qerror by three orders of magnitude and raises end-to-end processing time by up to 20$\times$. Our work not only reveals critical vulnerabilities in deployed learned estimators but also provides the first unified worst-case theoretical analysis of their fragility under data updates. Additionally, we identify two countermeasures to mitigate such black-box attacks, offering insights for developing robust learned database optimizers.

http://arxiv.org/abs/2505.19598
Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study. (2%)
Guanyu Hou; Jiaming He; Yinhang Zhou; Ji Guo; Yitong Qiao; Rui Zhang; Wenbo Jiang
Large Audio-Language Models (LALMs) are increasingly deployed in real-world applications, yet their robustness against malicious audio injection attacks remains underexplored. This study systematically evaluates five leading LALMs across four attack scenarios: Audio Interference Attack, Instruction Following Attack, Context Injection Attack, and Judgment Hijacking Attack. Using metrics like Defense Success Rate, Context Robustness Score, and Judgment Robustness Index, their vulnerabilities and resilience were quantitatively assessed. Experimental results reveal significant performance disparities among models; no single model consistently outperforms others across all attack types. The position of malicious content critically influences attack effectiveness, particularly when placed at the beginning of sequences. A negative correlation between instruction-following capability and robustness suggests models adhering strictly to instructions may be more susceptible, contrasting with greater resistance by safety-aligned models. Additionally, system prompts show mixed effectiveness, indicating the need for tailored strategies. This work introduces a benchmark framework and highlights the importance of integrating robustness into training pipelines. Findings emphasize developing multi-modal defenses and architectural designs that decouple capability from susceptibility for secure LALMs deployment.

http://arxiv.org/abs/2507.01788
Are Vision Transformer Representations Semantically Meaningful? A Case Study in Medical Imaging. (2%)
Montasir Shams; Chashi Mahiul Islam; Shaeke Salman; Phat Tran; Xiuwen Liu
Vision transformers (ViTs) have rapidly gained prominence in medical imaging tasks such as disease classification, segmentation, and detection due to their superior accuracy compared to conventional deep learning models. However, due to their size and complex interactions via the self-attention mechanism, they are not well understood. In particular, it is unclear whether the representations produced by such models are semantically meaningful. In this paper, using a projected gradient-based algorithm, we show that their representations are not semantically meaningful and they are inherently vulnerable to small changes. Images with imperceptible differences can have very different representations; on the other hand, images that should belong to different semantic classes can have nearly identical representations. Such vulnerability can lead to unreliable classification results; for example, unnoticeable changes cause the classification accuracy to be reduced by over 60\%. %. To the best of our knowledge, this is the first work to systematically demonstrate this fundamental lack of semantic meaningfulness in ViT representations for medical image classification, revealing a critical challenge for their deployment in safety-critical systems.

http://arxiv.org/abs/2507.08207
A Dynamic Stackelberg Game Framework for Agentic AI Defense Against LLM Jailbreaking. (1%)
Zhengye Han; Quanyan Zhu
As large language models (LLMs) are increasingly deployed in critical applications, the challenge of jailbreaking, where adversaries manipulate the models to bypass safety mechanisms, has become a significant concern. This paper presents a dynamic Stackelberg game framework to model the interactions between attackers and defenders in the context of LLM jailbreaking. The framework treats the prompt-response dynamics as a sequential extensive-form game, where the defender, as the leader, commits to a strategy while anticipating the attacker's optimal responses. We propose a novel agentic AI solution, the "Purple Agent," which integrates adversarial exploration and defensive strategies using Rapidly-exploring Random Trees (RRT). The Purple Agent actively simulates potential attack trajectories and intervenes proactively to prevent harmful outputs. This approach offers a principled method for analyzing adversarial dynamics and provides a foundation for mitigating the risk of jailbreaking.

http://arxiv.org/abs/2507.07700
Rethinking the Privacy of Text Embeddings: A Reproducibility Study of "Text Embeddings Reveal (Almost) As Much As Text". (1%)
Dominykas Seputis; Yongkang Li; Karsten Langerak; Serghei Mihailov
Text embeddings are fundamental to many natural language processing (NLP) tasks, extensively applied in domains such as recommendation systems and information retrieval (IR). Traditionally, transmitting embeddings instead of raw text has been seen as privacy-preserving. However, recent methods such as Vec2Text challenge this assumption by demonstrating that controlled decoding can successfully reconstruct original texts from black-box embeddings. The unexpectedly strong results reported by Vec2Text motivated us to conduct further verification, particularly considering the typically non-intuitive and opaque structure of high-dimensional embedding spaces. In this work, we reproduce the Vec2Text framework and evaluate it from two perspectives: (1) validating the original claims, and (2) extending the study through targeted experiments. First, we successfully replicate the original key results in both in-domain and out-of-domain settings, with only minor discrepancies arising due to missing artifacts, such as model checkpoints and dataset splits. Furthermore, we extend the study by conducting a parameter sensitivity analysis, evaluating the feasibility of reconstructing sensitive inputs (e.g., passwords), and exploring embedding quantization as a lightweight privacy defense. Our results show that Vec2Text is effective under ideal conditions, capable of reconstructing even password-like sequences that lack clear semantics. However, we identify key limitations, including its sensitivity to input sequence length. We also find that Gaussian noise and quantization techniques can mitigate the privacy risks posed by Vec2Text, with quantization offering a simpler and more widely applicable solution. Our findings emphasize the need for caution in using text embeddings and highlight the importance of further research into robust defense mechanisms for NLP systems.

http://arxiv.org/abs/2507.07259
Exploiting Edge Features for Transferable Adversarial Attacks in Distributed Machine Learning. (99%)
Giulio Rossolini; Fabio Brau; Alessandro Biondi; Battista Biggio; Giorgio Buttazzo
As machine learning models become increasingly deployed across the edge of internet of things environments, a partitioned deep learning paradigm in which models are split across multiple computational nodes introduces a new dimension of security risk. Unlike traditional inference setups, these distributed pipelines span the model computation across heterogeneous nodes and communication layers, thereby exposing a broader attack surface to potential adversaries. Building on these motivations, this work explores a previously overlooked vulnerability: even when both the edge and cloud components of the model are inaccessible (i.e., black-box), an adversary who intercepts the intermediate features transmitted between them can still pose a serious threat. We demonstrate that, under these mild and realistic assumptions, an attacker can craft highly transferable proxy models, making the entire deep learning system significantly more vulnerable to evasion attacks. In particular, the intercepted features can be effectively analyzed and leveraged to distill surrogate models capable of crafting highly transferable adversarial examples against the target model. To this end, we propose an exploitation strategy specifically designed for distributed settings, which involves reconstructing the original tensor shape from vectorized transmitted features using simple statistical analysis, and adapting surrogate architectures accordingly to enable effective feature distillation. A comprehensive and systematic experimental evaluation has been conducted to demonstrate that surrogate models trained with the proposed strategy, i.e., leveraging intermediate features, tremendously improve the transferability of adversarial attacks. These findings underscore the urgent need to account for intermediate feature leakage in the design of secure distributed deep learning systems.

http://arxiv.org/abs/2507.06907
Robust and Safe Traffic Sign Recognition using N-version with Weighted Voting. (98%)
Linyun Gao; Qiang Wen; Fumio Machida
Autonomous driving is rapidly advancing as a key application of machine learning, yet ensuring the safety of these systems remains a critical challenge. Traffic sign recognition, an essential component of autonomous vehicles, is particularly vulnerable to adversarial attacks that can compromise driving safety. In this paper, we propose an N-version machine learning (NVML) framework that integrates a safety-aware weighted soft voting mechanism. Our approach utilizes Failure Mode and Effects Analysis (FMEA) to assess potential safety risks and assign dynamic, safety-aware weights to the ensemble outputs. We evaluate the robustness of three-version NVML systems employing various voting mechanisms against adversarial samples generated using the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) attacks. Experimental results demonstrate that our NVML approach significantly enhances the robustness and safety of traffic sign recognition systems under adversarial conditions.

http://arxiv.org/abs/2305.13651
Adversarial Defenses via Vector Quantization. (98%)
Zhiyi Dong; Yongyi Mao
Adversarial attacks pose significant challenges to the robustness of modern deep neural networks in computer vision, and defending these networks against adversarial attacks has attracted intense research efforts. Among various defense strategies, preprocessing-based defenses are practically appealing since there is no need to train the network under protection. However, such approaches typically do not achieve comparable robustness as other methods such as adversarial training. In this paper, we propose a novel framework for preprocessing-based defenses, where a vector quantizer is used as a preprocessor. This framework, inspired by and extended from Randomized Discretization (RandDisc), is theoretically principled by rate-distortion theory: indeed, RandDisc may be viewed as a scalar quantizer, and rate-distortion theory suggests that such quantization schemes are inferior to vector quantization. In our framework, the preprocessing vector quantizer treats the input image as a collection of patches and finds a set of representative patches based on the patch distributions; each original patch is then modified according to the representative patches close to it. We present two lightweight defenses in this framework, referred to as patched RandDisc (pRD) and sliding-window RandDisc (swRD), where the patches are disjoint in the former and overlapping in the latter. We show that vector-quantization-based defenses have certifiable robust accuracy and that pRD and swRD demonstrate state-of-the-art performances, surpassing RandDisc by a large margin. Notably, the proposed defenses possess the obfuscated gradients property. Our experiments however show that pRD and swRD remain effective under the STE and EOT attacks, which are designed specifically for defenses with gradient obfuscation. ...

http://arxiv.org/abs/2507.06856
IAP: Invisible Adversarial Patch Attack through Perceptibility-Aware Localization and Perturbation Optimization. (92%)
Subrat Kishore Dutta; Xiao Zhang
Despite modifying only a small localized input region, adversarial patches can drastically change the prediction of computer vision models. However, prior methods either cannot perform satisfactorily under targeted attack scenarios or fail to produce contextually coherent adversarial patches, causing them to be easily noticeable by human examiners and insufficiently stealthy against automatic patch defenses. In this paper, we introduce IAP, a novel attack framework that generates highly invisible adversarial patches based on perceptibility-aware localization and perturbation optimization schemes. Specifically, IAP first searches for a proper location to place the patch by leveraging classwise localization and sensitivity maps, balancing the susceptibility of patch location to both victim model prediction and human visual system, then employs a perceptibility-regularized adversarial loss and a gradient update rule that prioritizes color constancy for optimizing invisible perturbations. Comprehensive experiments across various image benchmarks and model architectures demonstrate that IAP consistently achieves competitive attack success rates in targeted settings with significantly improved patch invisibility compared to existing baselines. In addition to being highly imperceptible to humans, IAP is shown to be stealthy enough to render several state-of-the-art patch defenses ineffective.

http://arxiv.org/abs/2507.08862
RAG Safety: Exploring Knowledge Poisoning Attacks to Retrieval-Augmented Generation. (86%)
Tianzhe Zhao; Jiaoyan Chen; Yanchi Ru; Haiping Zhu; Nan Hu; Jun Liu; Qika Lin
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by retrieving external data to mitigate hallucinations and outdated knowledge issues. Benefiting from the strong ability in facilitating diverse data sources and supporting faithful reasoning, knowledge graphs (KGs) have been increasingly adopted in RAG systems, giving rise to KG-based RAG (KG-RAG) methods. Though RAG systems are widely applied in various applications, recent studies have also revealed its vulnerabilities to data poisoning attacks, where malicious information injected into external knowledge sources can mislead the system into producing incorrect or harmful responses. However, these studies focus exclusively on RAG systems using unstructured textual data sources, leaving the security risks of KG-RAG largely unexplored, despite the fact that KGs present unique vulnerabilities due to their structured and editable nature. In this work, we conduct the first systematic investigation of the security issue of KG-RAG methods through data poisoning attacks. To this end, we introduce a practical, stealthy attack setting that aligns with real-world implementation. We propose an attack strategy that first identifies adversarial target answers and then inserts perturbation triples to complete misleading inference chains in the KG, increasing the likelihood that KG-RAG methods retrieve and rely on these perturbations during generation. Through extensive experiments on two benchmarks and four recent KG-RAG methods, our attack strategy demonstrates strong effectiveness in degrading KG-RAG performance, even with minimal KG perturbations. In-depth analyses are also conducted to understand the safety threats within the internal stages of KG-RAG systems and to explore the robustness of LLMs against adversarial knowledge.

http://arxiv.org/abs/2507.07139
Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning. (50%)
Renyang Liu; Guanlin Li; Tianwei Zhang; See-Kiong Ng
Recent advances in image generation models (IGMs), particularly diffusion-based architectures such as Stable Diffusion (SD), have markedly enhanced the quality and diversity of AI-generated visual content. However, their generative capability has also raised significant ethical, legal, and societal concerns, including the potential to produce harmful, misleading, or copyright-infringing content. To mitigate these concerns, machine unlearning (MU) emerges as a promising solution by selectively removing undesirable concepts from pretrained models. Nevertheless, the robustness and effectiveness of existing unlearning techniques remain largely unexplored, particularly in the presence of multi-modal adversarial inputs.
  To bridge this gap, we propose Recall, a novel adversarial framework explicitly designed to compromise the robustness of unlearned IGMs. Unlike existing approaches that predominantly rely on adversarial text prompts, Recall exploits the intrinsic multi-modal conditioning capabilities of diffusion models by efficiently optimizing adversarial image prompts with guidance from a single semantically relevant reference image. Extensive experiments across ten state-of-the-art unlearning methods and diverse tasks show that Recall consistently outperforms existing baselines in terms of adversarial effectiveness, computational efficiency, and semantic fidelity with the original textual prompt. These findings reveal critical vulnerabilities in current unlearning mechanisms and underscore the need for more robust solutions to ensure the safety and reliability of generative models. Code and data are publicly available at \textcolor{blue}{https://github.com/ryliu68/RECALL}.

http://arxiv.org/abs/2507.06489
On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks. (13%)
Stephen Obadinma; Xiaodan Zhu
Robust verbal confidence generated by large language models (LLMs) is crucial for the deployment of LLMs to ensure transparency, trust, and safety in human-AI interactions across many high-stakes applications. In this paper, we present the first comprehensive study on the robustness of verbal confidence under adversarial attacks. We introduce a novel framework for attacking verbal confidence scores through both perturbation and jailbreak-based methods, and show that these attacks can significantly jeopardize verbal confidence estimates and lead to frequent answer changes. We examine a variety of prompting strategies, model sizes, and application domains, revealing that current confidence elicitation methods are vulnerable and that commonly used defence techniques are largely ineffective or counterproductive. Our findings underscore the urgent need to design more robust mechanisms for confidence expression in LLMs, as even subtle semantic-preserving modifications can lead to misleading confidence in responses.

http://arxiv.org/abs/2507.06969
Unifying Re-Identification, Attribute Inference, and Data Reconstruction Risks in Differential Privacy. (1%)
Bogdan Kulynych; Juan Felipe Gomez; Georgios Kaissis; Jamie Hayes; Borja Balle; Flavio du Pin Calmon; Jean Louis Raisaro
Differentially private (DP) mechanisms are difficult to interpret and calibrate because existing methods for mapping standard privacy parameters to concrete privacy risks -- re-identification, attribute inference, and data reconstruction -- are both overly pessimistic and inconsistent. In this work, we use the hypothesis-testing interpretation of DP ($f$-DP), and determine that bounds on attack success can take the same unified form across re-identification, attribute inference, and data reconstruction risks. Our unified bounds are (1) consistent across a multitude of attack settings, and (2) tunable, enabling practitioners to evaluate risk with respect to arbitrary (including worst-case) levels of baseline risk. Empirically, our results are tighter than prior methods using $\varepsilon$-DP, Rényi DP, and concentrated DP. As a result, calibrating noise using our bounds can reduce the required noise by 20% at the same risk level, which yields, e.g., more than 15pp accuracy increase in a text classification task. Overall, this unifying perspective provides a principled framework for interpreting and calibrating the degree of protection in DP against specific levels of re-identification, attribute inference, or data reconstruction risk.

http://arxiv.org/abs/2507.06078
ScoreAdv: Score-based Targeted Generation of Natural Adversarial Examples via Diffusion Models. (99%)
Chihan Huang; Hao Tang
Despite the success of deep learning across various domains, it remains vulnerable to adversarial attacks. Although many existing adversarial attack methods achieve high success rates, they typically rely on $\ell_{p}$-norm perturbation constraints, which do not align with human perceptual capabilities. Consequently, researchers have shifted their focus toward generating natural, unrestricted adversarial examples (UAEs). GAN-based approaches suffer from inherent limitations, such as poor image quality due to instability and mode collapse. Meanwhile, diffusion models have been employed for UAE generation, but they still rely on iterative PGD perturbation injection, without fully leveraging their central denoising capabilities. In this paper, we introduce a novel approach for generating UAEs based on diffusion models, named ScoreAdv. This method incorporates an interpretable adversarial guidance mechanism to gradually shift the sampling distribution towards the adversarial distribution, while using an interpretable saliency map to inject the visual information of a reference image into the generated samples. Notably, our method is capable of generating an unlimited number of natural adversarial examples and can attack not only classification models but also retrieval models. We conduct extensive experiments on ImageNet and CelebA datasets, validating the performance of ScoreAdv across ten target models in both black-box and white-box settings. Our results demonstrate that ScoreAdv achieves state-of-the-art attack success rates and image quality. Furthermore, the dynamic balance between denoising and adversarial perturbation enables ScoreAdv to remain robust even under defensive measures.

http://arxiv.org/abs/2507.05622
DATABench: Evaluating Dataset Auditing in Deep Learning from an Adversarial Perspective. (98%)
Shuo Shao; Yiming Li; Mengren Zheng; Zhiyang Hu; Yukun Chen; Boheng Li; Yu He; Junfeng Guo; Tianwei Zhang; Dacheng Tao; Zhan Qin
The widespread application of Deep Learning across diverse domains hinges critically on the quality and composition of training datasets. However, the common lack of disclosure regarding their usage raises significant privacy and copyright concerns. Dataset auditing techniques, which aim to determine if a specific dataset was used to train a given suspicious model, provide promising solutions to addressing these transparency gaps. While prior work has developed various auditing methods, their resilience against dedicated adversarial attacks remains largely unexplored. To bridge the gap, this paper initiates a comprehensive study evaluating dataset auditing from an adversarial perspective. We start with introducing a novel taxonomy, classifying existing methods based on their reliance on internal features (IF) (inherent to the data) versus external features (EF) (artificially introduced for auditing). Subsequently, we formulate two primary attack types: evasion attacks, designed to conceal the use of a dataset, and forgery attacks, intending to falsely implicate an unused dataset. Building on the understanding of existing methods and attack objectives, we further propose systematic attack strategies: decoupling, removal, and detection for evasion; adversarial example-based methods for forgery. These formulations and strategies lead to our new benchmark, DATABench, comprising 17 evasion attacks, 5 forgery attacks, and 9 representative auditing methods. Extensive evaluations using DATABench reveal that none of the evaluated auditing methods are sufficiently robust or distinctive under adversarial settings. These findings underscore the urgent need for developing a more secure and reliable dataset auditing method capable of withstanding sophisticated adversarial manipulation. Code is available at https://github.com/shaoshuo-ss/DATABench.

http://arxiv.org/abs/2507.06419
Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling. (84%)
Pankayaraj Pathmanathan; Furong Huang
Reward modeling (RM), which captures human preferences to align large language models (LLMs), is increasingly employed in tasks such as model finetuning, response filtering, and ranking. However, due to the inherent complexity of human preferences and the limited coverage of available datasets, reward models often fail under distributional shifts or adversarial perturbations. Existing approaches for identifying such failure modes typically rely on prior knowledge about preference distributions or failure attributes, limiting their practicality in real-world settings where such information is unavailable. In this work, we propose a tractable, preference-distribution agnostic method for discovering reward model failure modes via reward guided controlled decoding. Building on this, we introduce REFORM, a self-improving reward modeling framework that enhances robustness by using the reward model itself to guide the generation of falsely scored responses. These adversarial examples are then used to augment the training data and patch the reward model's misaligned behavior. We evaluate REFORM on two widely used preference datasets Anthropic Helpful Harmless (HH) and PKU Beavertails and demonstrate that it significantly improves robustness without sacrificing reward quality. Notably, REFORM preserves performance both in direct evaluation and in downstream policy training, and further improves alignment quality by removing spurious correlations.

http://arxiv.org/abs/2507.05980
RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages. (64%)
Gabriel Chua; Leanne Tan; Ziyu Ge; Roy Ka-Wei Lee
Large language models (LLMs) and their safety classifiers often perform poorly on low-resource languages due to limited training data and evaluation benchmarks. This paper introduces RabakBench, a new multilingual safety benchmark localized to Singapore's unique linguistic context, covering Singlish, Chinese, Malay, and Tamil. RabakBench is constructed through a scalable three-stage pipeline: (i) Generate - adversarial example generation by augmenting real Singlish web content with LLM-driven red teaming; (ii) Label - semi-automated multi-label safety annotation using majority-voted LLM labelers aligned with human judgments; and (iii) Translate - high-fidelity translation preserving linguistic nuance and toxicity across languages. The final dataset comprises over 5,000 safety-labeled examples across four languages and six fine-grained safety categories with severity levels. Evaluations of 11 popular open-source and closed-source guardrail classifiers reveal significant performance degradation. RabakBench not only enables robust safety evaluation in Southeast Asian multilingual settings but also offers a reproducible framework for building localized safety datasets in low-resource environments. The benchmark dataset, including the human-verified translations, and evaluation code are publicly available.

http://arxiv.org/abs/2507.06332
AR2: Attention-Guided Repair for the Robustness of CNNs Against Common Corruptions. (56%)
Fuyuan Zhang; Qichen Wang; Jianjun Zhao
Deep neural networks suffer from significant performance degradation when exposed to common corruptions such as noise, blur, weather, and digital distortions, limiting their reliability in real-world applications. In this paper, we propose AR2 (Attention-Guided Repair for Robustness), a simple yet effective method to enhance the corruption robustness of pretrained CNNs. AR2 operates by explicitly aligning the class activation maps (CAMs) between clean and corrupted images, encouraging the model to maintain consistent attention even under input perturbations. Our approach follows an iterative repair strategy that alternates between CAM-guided refinement and standard fine-tuning, without requiring architectural changes. Extensive experiments show that AR2 consistently outperforms existing state-of-the-art methods in restoring robustness on standard corruption benchmarks (CIFAR-10-C, CIFAR-100-C and ImageNet-C), achieving a favorable balance between accuracy on clean data and corruption robustness. These results demonstrate that AR2 provides a robust and scalable solution for enhancing model reliability in real-world environments with diverse corruptions.

http://arxiv.org/abs/2507.08020
Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation. (11%)
Zhibo Zhang; Yuxi Li; Kailong Wang; Shuai Yuan; Ling Shi; Haoyu Wang
Large Language Models (LLMs) have achieved remarkable success across domains such as healthcare, education, and cybersecurity. However, this openness also introduces significant security risks, particularly through embedding space poisoning, which is a subtle attack vector where adversaries manipulate the internal semantic representations of input data to bypass safety alignment mechanisms. While previous research has investigated universal perturbation methods, the dynamics of LLM safety alignment at the embedding level remain insufficiently understood. Consequently, more targeted and accurate adversarial perturbation techniques, which pose significant threats, have not been adequately studied.
  In this work, we propose ETTA (Embedding Transformation Toxicity Attenuation), a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear transformations. ETTA bypasses model refusal behaviors while preserving linguistic coherence, without requiring model fine-tuning or access to training data. Evaluated on five representative open-source LLMs using the AdvBench benchmark, ETTA achieves a high average attack success rate of 88.61%, outperforming the best baseline by 11.34%, and generalizes to safety-enhanced models (e.g., 77.39% ASR on instruction-tuned defenses). These results highlight a critical vulnerability in current alignment strategies and underscore the need for embedding-aware defenses.

http://arxiv.org/abs/2507.06282
The bitter lesson of misuse detection. (10%)
Hadrien Mariaccia; Charbel-Raphaël Segerie; Diego Dorn
Prior work on jailbreak detection has established the importance of adversarial robustness for LLMs but has largely focused on the model ability to resist adversarial inputs and to output safe content, rather than the effectiveness of external supervision systems. The only public and independent benchmark of these guardrails to date evaluates a narrow set of supervisors on limited scenarios. Consequently, no comprehensive public benchmark yet verifies how well supervision systems from the market perform under realistic, diverse attacks. To address this, we introduce BELLS, a Benchmark for the Evaluation of LLM Supervision Systems. The framework is two dimensional: harm severity (benign, borderline, harmful) and adversarial sophistication (direct vs. jailbreak) and provides a rich dataset covering 3 jailbreak families and 11 harm categories. Our evaluations reveal drastic limitations of specialized supervision systems. While they recognize some known jailbreak patterns, their semantic understanding and generalization capabilities are very limited, sometimes with detection rates close to zero when asking a harmful question directly or with a new jailbreak technique such as base64 encoding. Simply asking generalist LLMs if the user question is "harmful or not" largely outperforms these supervisors from the market according to our BELLS score. But frontier LLMs still suffer from metacognitive incoherence, often responding to queries they correctly identify as harmful (up to 30 percent for Claude 3.7 and greater than 50 percent for Mistral Large). These results suggest that simple scaffolding could significantly improve misuse detection robustness, but more research is needed to assess the tradeoffs of such techniques. Our results support the "bitter lesson" of misuse detection: general capabilities of LLMs are necessary to detect a diverse array of misuses and jailbreaks.

http://arxiv.org/abs/2507.05660
TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data. (1%)
Aravind Cheruvu; Shravya Kanchi; Sifat Muhammad Abdullah; Nicholas Kong; Daphne Yao; Murtuza Jadliwala; Bimal Viswanath
Recent advances in foundation models, such as LLMs, have revolutionized conversational AI. Chatbots are increasingly being developed by customizing LLMs on specific conversational datasets. However, mitigating toxicity during this customization, especially when dealing with untrusted training data, remains a significant challenge. To address this, we introduce TuneShield, a defense framework designed to mitigate toxicity during chatbot fine-tuning while preserving conversational quality. TuneShield leverages LLM-based toxicity classification, utilizing the instruction-following capabilities and safety alignment of LLMs to effectively identify toxic samples, outperforming industry API services. TuneShield generates synthetic conversation samples, termed 'healing data', based on the identified toxic samples, using them to mitigate toxicity while reinforcing desirable behavior during fine-tuning. It performs an alignment process to further nudge the chatbot towards producing desired responses. Our findings show that TuneShield effectively mitigates toxicity injection attacks while preserving conversational quality, even when the toxicity classifiers are imperfect or biased. TuneShield proves to be resilient against adaptive adversarial and jailbreak attacks. Additionally, TuneShield demonstrates effectiveness in mitigating adaptive toxicity injection attacks during dialog-based learning (DBL).

http://arxiv.org/abs/2507.06256
Attacker's Noise Can Manipulate Your Audio-based LLM in the Real World. (54%)
Vinu Sankar Sadasivan; Soheil Feizi; Rajiv Mathews; Lun Wang
This paper investigates the real-world vulnerabilities of audio-based large language models (ALLMs), such as Qwen2-Audio. We first demonstrate that an adversary can craft stealthy audio perturbations to manipulate ALLMs into exhibiting specific targeted behaviors, such as eliciting responses to wake-keywords (e.g., "Hey Qwen"), or triggering harmful behaviors (e.g. "Change my calendar event"). Subsequently, we show that playing adversarial background noise during user interaction with the ALLMs can significantly degrade the response quality. Crucially, our research illustrates the scalability of these attacks to real-world scenarios, impacting other innocent users when these adversarial noises are played through the air. Further, we discuss the transferrability of the attack, and potential defensive measures.

http://arxiv.org/abs/2507.06258
Phantom Subgroup Poisoning: Stealth Attacks on Federated Recommender Systems. (26%)
Bo Yan; Yurong Hao; Dingqi Liu; Huabin Sun; Pengpeng Qiao; Wei Yang Bryan Lim; Yang Cao; Chuan Shi
Federated recommender systems (FedRec) have emerged as a promising solution for delivering personalized recommendations while safeguarding user privacy. However, recent studies have demonstrated their vulnerability to poisoning attacks. Existing attacks typically target the entire user group, which compromises stealth and increases the risk of detection. In contrast, real-world adversaries may prefer to prompt target items to specific user subgroups, such as recommending health supplements to elderly users. Motivated by this gap, we introduce Spattack, the first targeted poisoning attack designed to manipulate recommendations for specific user subgroups in the federated setting. Specifically, Spattack adopts a two-stage approximation-and-promotion strategy, which first simulates user embeddings of target/non-target subgroups and then prompts target items to the target subgroups. To enhance the approximation stage, we push the inter-group embeddings away based on contrastive learning and augment the target group's relevant item set based on clustering. To enhance the promotion stage, we further propose to adaptively tune the optimization weights between target and non-target subgroups. Besides, an embedding alignment strategy is proposed to align the embeddings between the target items and the relevant items. We conduct comprehensive experiments on three real-world datasets, comparing Spattack against seven state-of-the-art poisoning attacks and seven representative defense mechanisms. Experimental results demonstrate that Spattack consistently achieves strong manipulation performance on the specific user subgroup, while incurring minimal impact on non-target users, even when only 0.1\% of users are malicious. Moreover, Spattack maintains competitive overall recommendation performance and exhibits strong resilience against existing mainstream defenses.

http://arxiv.org/abs/2507.05531
Bit-Flip Fault Attack: Crushing Graph Neural Networks via Gradual Bit Search. (13%)
Sanaz Kazemi Abharian; Sai Manoj Pudukotai Dinakarrao
Graph Neural Networks (GNNs) have emerged as a powerful machine learning method for graph-structured data. A plethora of hardware accelerators has been introduced to meet the performance demands of GNNs in real-world applications. However, security challenges of hardware-based attacks have been generally overlooked. In this paper, we investigate the vulnerability of GNN models to hardware-based fault attack, wherein an attacker attempts to misclassify output by modifying trained weight parameters through fault injection in a memory device. Thus, we propose Gradual Bit-Flip Fault Attack (GBFA), a layer-aware bit-flip fault attack, selecting a vulnerable bit in each selected weight gradually to compromise the GNN's performance by flipping a minimal number of bits. To achieve this, GBFA operates in two steps. First, a Markov model is created to predict the execution sequence of layers based on features extracted from memory access patterns, enabling the launch of the attack within a specific layer. Subsequently, GBFA identifies vulnerable bits within the selected weights using gradient ranking through an in-layer search. We evaluate the effectiveness of the proposed GBFA attack on various GNN models for node classification tasks using the Cora and PubMed datasets. Our findings show that GBFA significantly degrades prediction accuracy, and the variation in its impact across different layers highlights the importance of adopting a layer-aware attack strategy in GNNs. For example, GBFA degrades GraphSAGE's prediction accuracy by 17% on the Cora dataset with only a single bit flip in the last layer.

http://arxiv.org/abs/2507.05093
The Hidden Threat in Plain Text: Attacking RAG Data Loaders. (13%)
Alberto Castagnaro; Umberto Salviati; Mauro Conti; Luca Pajola; Simeone Pizzi
Large Language Models (LLMs) have transformed human-machine interaction since ChatGPT's 2022 debut, with Retrieval-Augmented Generation (RAG) emerging as a key framework that enhances LLM outputs by integrating external knowledge. However, RAG's reliance on ingesting external documents introduces new vulnerabilities. This paper exposes a critical security gap at the data loading stage, where malicious actors can stealthily corrupt RAG pipelines by exploiting document ingestion.
  We propose a taxonomy of 9 knowledge-based poisoning attacks and introduce two novel threat vectors -- Content Obfuscation and Content Injection -- targeting common formats (DOCX, HTML, PDF). Using an automated toolkit implementing 19 stealthy injection techniques, we test five popular data loaders, finding a 74.4% attack success rate across 357 scenarios. We further validate these threats on six end-to-end RAG systems -- including white-box pipelines and black-box services like NotebookLM and OpenAI Assistants -- demonstrating high success rates and critical vulnerabilities that bypass filters and silently compromise output integrity. Our results emphasize the urgent need to secure the document ingestion process in RAG systems against covert content manipulations.

http://arxiv.org/abs/2507.04726
Losing Control: Data Poisoning Attack on Guided Diffusion via ControlNet. (11%)
Raz Lapid; Almog Dubin
Text-to-image diffusion models have achieved remarkable success in translating textual prompts into high-fidelity images. ControlNets further extend these models by allowing precise, image-based conditioning (e.g., edge maps, depth, pose), enabling fine-grained control over structure and style. However, their dependence on large, publicly scraped datasets -- and the increasing use of community-shared data for fine-tuning -- exposes them to stealthy data poisoning attacks. In this work, we introduce a novel data poisoning method that manipulates ControlNets to generate images containing specific content without any text triggers. By injecting poisoned samples -- each pairing a subtly triggered input with an NSFW target -- the model retains clean-prompt fidelity yet reliably produces NSFW outputs when the trigger is present. On large-scale, high-quality datasets, our backdoor achieves high attack success rate while remaining imperceptible in raw inputs. These results reveal a critical vulnerability in open-source ControlNets pipelines and underscore the need for robust data sanitization and defense mechanisms.

http://arxiv.org/abs/2507.04883
Beyond Training-time Poisoning: Component-level and Post-training Backdoors in Deep Reinforcement Learning. (10%)
Sanyam Vyas; Alberto Caron; Chris Hicks; Pete Burnap; Vasilios Mavroudis
Deep Reinforcement Learning (DRL) systems are increasingly used in safety-critical applications, yet their security remains severely underexplored. This work investigates backdoor attacks, which implant hidden triggers that cause malicious actions only when specific inputs appear in the observation space. Existing DRL backdoor research focuses solely on training-time attacks requiring unrealistic access to the training pipeline. In contrast, we reveal critical vulnerabilities across the DRL supply chain where backdoors can be embedded with significantly reduced adversarial privileges. We introduce two novel attacks: (1) TrojanentRL, which exploits component-level flaws to implant a persistent backdoor that survives full model retraining; and (2) InfrectroRL, a post-training backdoor attack which requires no access to training, validation, nor test data. Empirical and analytical evaluations across six Atari environments show our attacks rival state-of-the-art training-time backdoor attacks while operating under much stricter adversarial constraints. We also demonstrate that InfrectroRL further evades two leading DRL backdoor defenses. These findings challenge the current research focus and highlight the urgent need for robust defenses.

http://arxiv.org/abs/2507.05512
Disappearing Ink: Obfuscation Breaks N-gram Code Watermarks in Theory and Practice. (5%)
Gehao Zhang; Eugene Bagdasarian; Juan Zhai; Shiqing Ma
Distinguishing AI-generated code from human-written code is becoming crucial for tasks such as authorship attribution, content tracking, and misuse detection. Based on this, N-gram-based watermarking schemes have emerged as prominent, which inject secret watermarks to be detected during the generation.
  However, their robustness in code content remains insufficiently evaluated. Most claims rely solely on defenses against simple code transformations or code optimizations as a simulation of attack, creating a questionable sense of robustness. In contrast, more sophisticated schemes already exist in the software engineering world, e.g., code obfuscation, which significantly alters code while preserving functionality. Although obfuscation is commonly used to protect intellectual property or evade software scanners, the robustness of code watermarking techniques against such transformations remains largely unexplored.
  In this work, we formally model the code obfuscation and prove the impossibility of N-gram-based watermarking's robustness with only one intuitive and experimentally verified assumption, distribution consistency, satisfied. Given the original false positive rate of the watermarking detection, the ratio that the detector failed on the watermarked code after obfuscation will increase to 1 - fpr.
  The experiments have been performed on three SOTA watermarking schemes, two LLMs, two programming languages, four code benchmarks, and four obfuscators. Among them, all watermarking detectors show coin-flipping detection abilities on obfuscated codes (AUROC tightly surrounds 0.5). Among all models, watermarking schemes, and datasets, both programming languages own obfuscators that can achieve attack effects with no detection AUROC higher than 0.6 after the attack. Based on the theoretical and practical observations, we also proposed a potential path of robust code watermarking.

http://arxiv.org/abs/2507.05441
Adversarial Machine Learning Attacks on Financial Reporting via Maximum Violated Multi-Objective Attack. (2%)
Edward Raff; Karen Kukla; Michel Benaroch; Joseph Comprix
Bad actors, primarily distressed firms, have the incentive and desire to manipulate their financial reports to hide their distress and derive personal gains. As attackers, these firms are motivated by potentially millions of dollars and the availability of many publicly disclosed and used financial modeling frameworks. Existing attack methods do not work on this data due to anti-correlated objectives that must both be satisfied for the attacker to succeed. We introduce Maximum Violated Multi-Objective (MVMO) attacks that adapt the attacker's search direction to find $20\times$ more satisfying attacks compared to standard attacks. The result is that in $\approx50\%$ of cases, a company could inflate their earnings by 100-200%, while simultaneously reducing their fraud scores by 15%. By working with lawyers and professional accountants, we ensure our threat model is realistic to how such frauds are performed in practice.

http://arxiv.org/abs/2507.04903
BackFed: An Efficient & Standardized Benchmark Suite for Backdoor Attacks in Federated Learning. (2%)
Thinh Dao; Dung Thuy Nguyen; Khoa D Doan; Kok-Seng Wong
Federated Learning (FL) systems are vulnerable to backdoor attacks, where adversaries train their local models on poisoned data and submit poisoned model updates to compromise the global model. Despite numerous proposed attacks and defenses, divergent experimental settings, implementation errors, and unrealistic assumptions hinder fair comparisons and valid conclusions about their effectiveness in real-world scenarios. To address this, we introduce BackFed - a comprehensive benchmark suite designed to standardize, streamline, and reliably evaluate backdoor attacks and defenses in FL, with a focus on practical constraints. Our benchmark offers key advantages through its multi-processing implementation that significantly accelerates experimentation and the modular design that enables seamless integration of new methods via well-defined APIs. With a standardized evaluation pipeline, we envision BackFed as a plug-and-play environment for researchers to comprehensively and reliably evaluate new attacks and defenses. Using BackFed, we conduct large-scale studies of representative backdoor attacks and defenses across both Computer Vision and Natural Language Processing tasks with diverse model architectures and experimental settings. Our experiments critically assess the performance of proposed attacks and defenses, revealing unknown limitations and modes of failures under practical conditions. These empirical insights provide valuable guidance for the development of new methods and for enhancing the security of FL systems. Our framework is openly available at https://github.com/thinh-dao/BackFed.

http://arxiv.org/abs/2507.05248
Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models. (1%)
Ziqi Miao; Lijun Li; Yuan Xiong; Zhenhua Liu; Pengyu Zhu; Jing Shao
Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. Building on this insight, we propose Response Attack, which uses an auxiliary LLM to generate a mildly harmful response to a paraphrased version of the original malicious query. They are then formatted into the dialogue and followed by a succinct trigger prompt, thereby priming the target model to generate harmful content. Across eight open-source and proprietary LLMs, RA consistently outperforms seven state-of-the-art jailbreak techniques, achieving higher attack success rates. To mitigate this threat, we construct and release a context-aware safety fine-tuning dataset, which significantly reduces the attack success rate while preserving model capabilities. The code and data are available at https://github.com/Dtc7w3PQ/Response-Attack.

http://arxiv.org/abs/2507.04446
Tail-aware Adversarial Attacks: A Distributional Approach to Efficient LLM Jailbreaking. (81%)
Tim Beyer; Yan Scholten; Stephan Günnemann; Leo Schwinn
To guarantee safe and robust deployment of large language models (LLMs) at scale, it is critical to accurately assess their adversarial robustness. Existing adversarial attacks typically target harmful responses in single-point, greedy generations, overlooking the inherently stochastic nature of LLMs. In this paper, we propose a novel framework for adversarial robustness evaluation that explicitly models the entire output distribution, including tail-risks, providing better estimates for model robustness at scale. By casting the attack process as a resource allocation problem between optimization and sampling, we determine compute-optimal tradeoffs and show that integrating sampling into existing attacks boosts ASR by up to 48% and improves efficiency by up to two orders of magnitude. Our framework also enables us to analyze how different attack algorithms affect output harm distributions. Surprisingly, we find that most optimization strategies have little effect on output harmfulness. Finally, we introduce a data-free proof-of-concept objective based on entropy-maximization to demonstrate how our tail-aware perspective enables new optimization targets. Overall, our findings highlight the importance of tail-aware attacks and evaluation protocols to accurately assess and strengthen LLM safety.

http://arxiv.org/abs/2507.04365
Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs. (62%)
Xiaomeng Hu; Pin-Yu Chen; Tsung-Yi Ho
As large language models (LLMs) become more integral to society and technology, ensuring their safety becomes essential. Jailbreak attacks exploit vulnerabilities to bypass safety guardrails, posing a significant threat. However, the mechanisms enabling these attacks are not well understood. In this paper, we reveal a universal phenomenon that occurs during jailbreak attacks: Attention Slipping. During this phenomenon, the model gradually reduces the attention it allocates to unsafe requests in a user query during the attack process, ultimately causing a jailbreak. We show Attention Slipping is consistent across various jailbreak methods, including gradient-based token replacement, prompt-level template refinement, and in-context learning. Additionally, we evaluate two defenses based on query perturbation, Token Highlighter and SmoothLLM, and find they indirectly mitigate Attention Slipping, with their effectiveness positively correlated with the degree of mitigation achieved. Inspired by this finding, we propose Attention Sharpening, a new defense that directly counters Attention Slipping by sharpening the attention score distribution using temperature scaling. Experiments on four leading LLMs (Gemma2-9B-It, Llama3.1-8B-It, Qwen2.5-7B-It, Mistral-7B-It v0.2) show that our method effectively resists various jailbreak attacks while maintaining performance on benign tasks on AlpacaEval. Importantly, Attention Sharpening introduces no additional computational or memory overhead, making it an efficient and practical solution for real-world deployment.

http://arxiv.org/abs/2507.04528
Towards integration of Privacy Enhancing Technologies in Explainable Artificial Intelligence. (31%)
Sonal Allana; Rozita Dara; Xiaodong Lin; Pulei Xiong
Explainable Artificial Intelligence (XAI) is a crucial pathway in mitigating the risk of non-transparency in the decision-making process of black-box Artificial Intelligence (AI) systems. However, despite the benefits, XAI methods are found to leak the privacy of individuals whose data is used in training or querying the models. Researchers have demonstrated privacy attacks that exploit explanations to infer sensitive personal information of individuals. Currently there is a lack of defenses against known privacy attacks targeting explanations when vulnerable XAI are used in production and machine learning as a service system. To address this gap, in this article, we explore Privacy Enhancing Technologies (PETs) as a defense mechanism against attribute inference on explanations provided by feature-based XAI methods. We empirically evaluate 3 types of PETs, namely synthetic training data, differentially private training and noise addition, on two categories of feature-based XAI. Our evaluation determines different responses from the mitigation methods and side-effects of PETs on other system properties such as utility and performance. In the best case, PETs integration in explanations reduced the risk of the attack by 49.47%, while maintaining model utility and explanation quality. Through our evaluation, we identify strategies for using PETs in XAI for maximizing benefits and minimizing the success of this privacy attack on sensitive personal information.

http://arxiv.org/abs/2507.04227
Hijacking JARVIS: Benchmarking Mobile GUI Agents against Unprivileged Third Parties. (2%)
Guohong Liu; Jialei Ye; Jiacheng Liu; Yuanchun Li; Wei Liu; Pengzhi Gao; Jian Luan; Yunxin Liu
Mobile GUI agents are designed to autonomously execute diverse device-control tasks by interpreting and interacting with mobile screens. Despite notable advancements, their resilience in real-world scenarios where screen content may be partially manipulated by untrustworthy third parties remains largely unexplored. Owing to their black-box and autonomous nature, these agents are vulnerable to manipulations that could compromise user devices. In this work, we present the first systematic investigation into the vulnerabilities of mobile GUI agents. We introduce a scalable attack simulation framework AgentHazard, which enables flexible and targeted modifications of screen content within existing applications. Leveraging this framework, we develop a comprehensive benchmark suite comprising both a dynamic task execution environment and a static dataset of vision-language-action tuples, totaling over 3,000 attack scenarios. The dynamic environment encompasses 58 reproducible tasks in an emulator with various types of hazardous UI content, while the static dataset is constructed from 210 screenshots collected from 14 popular commercial apps. Importantly, our content modifications are designed to be feasible for unprivileged third parties. We evaluate 7 widely-used mobile GUI agents and 5 common backbone models using our benchmark. Our findings reveal that all examined agents are significantly influenced by misleading third-party content (with an average misleading rate of 28.8% in human-crafted attack scenarios) and that their vulnerabilities are closely linked to the employed perception modalities and backbone LLMs. Furthermore, we assess training-based mitigation strategies, highlighting both the challenges and opportunities for enhancing the robustness of mobile GUI agents. Our code and data will be released at https://agenthazard.github.io.

http://arxiv.org/abs/2205.13532
Selective Prediction via Training Dynamics. (1%)
Stephan Rabanser; Anvith Thudi; Kimia Hamidieh; Adam Dziedzic; Israfil Bahceci; Akram Bin Sediq; Hamza Sokun; Nicolas Papernot
Selective Prediction is the task of rejecting inputs a model would predict incorrectly on. This involves a trade-off between input space coverage (how many data points are accepted) and model utility (how good is the performance on accepted data points). Current methods for selective prediction typically impose constraints on either the model architecture or the optimization objective; this inhibits their usage in practice and introduces unknown interactions with pre-existing loss functions. In contrast to prior work, we show that state-of-the-art selective prediction performance can be attained solely from studying the (discretized) training dynamics of a model. We propose a general framework that, given a test input, monitors metrics capturing the instability of predictions from intermediate models (i.e., checkpoints) obtained during training w.r.t. the final model's prediction. In particular, we reject data points exhibiting too much disagreement with the final prediction at late stages in training. The proposed rejection mechanism is domain-agnostic (i.e., it works for both discrete and real-valued prediction) and can be flexibly combined with existing selective prediction approaches as it does not require any train-time modifications. Our experimental evaluation on image classification, regression, and time series problems shows that our method beats past state-of-the-art accuracy/utility trade-offs on typical selective prediction benchmarks.

http://arxiv.org/abs/2507.06252
False Alarms, Real Damage: Adversarial Attacks Using LLM-based Models on Text-based Cyber Threat Intelligence Systems. (88%)
Samaneh Shafee; Alysson Bessani; Pedro M. Ferreira
Cyber Threat Intelligence (CTI) has emerged as a vital complementary approach that operates in the early phases of the cyber threat lifecycle. CTI involves collecting, processing, and analyzing threat data to provide a more accurate and rapid understanding of cyber threats. Due to the large volume of data, automation through Machine Learning (ML) and Natural Language Processing (NLP) models is essential for effective CTI extraction. These automated systems leverage Open Source Intelligence (OSINT) from sources like social networks, forums, and blogs to identify Indicators of Compromise (IoCs). Although prior research has focused on adversarial attacks on specific ML models, this study expands the scope by investigating vulnerabilities within various components of the entire CTI pipeline and their susceptibility to adversarial attacks. These vulnerabilities arise because they ingest textual inputs from various open sources, including real and potentially fake content. We analyse three types of attacks against CTI pipelines, including evasion, flooding, and poisoning, and assess their impact on the system's information selection capabilities. Specifically, on fake text generation, the work demonstrates how adversarial text generation techniques can create fake cybersecurity and cybersecurity-like text that misleads classifiers, degrades performance, and disrupts system functionality. The focus is primarily on the evasion attack, as it precedes and enables flooding and poisoning attacks within the CTI pipeline.

http://arxiv.org/abs/2507.04100
Hierarchical Testing with Rabbit Optimization for Industrial Cyber-Physical Systems. (78%)
Jinwei Hu; Zezhi Tang; Xin Jin; Benyuan Zhang; Yi Dong; Xiaowei Huang
This paper presents HERO (Hierarchical Testing with Rabbit Optimization), a novel black-box adversarial testing framework for evaluating the robustness of deep learning-based Prognostics and Health Management systems in Industrial Cyber-Physical Systems. Leveraging Artificial Rabbit Optimization, HERO generates physically constrained adversarial examples that align with real-world data distributions via global and local perspective. Its generalizability ensures applicability across diverse ICPS scenarios. This study specifically focuses on the Proton Exchange Membrane Fuel Cell system, chosen for its highly dynamic operational conditions, complex degradation mechanisms, and increasing integration into ICPS as a sustainable and efficient energy solution. Experimental results highlight HERO's ability to uncover vulnerabilities in even state-of-the-art PHM models, underscoring the critical need for enhanced robustness in real-world applications. By addressing these challenges, HERO demonstrates its potential to advance more resilient PHM systems across a wide range of ICPS domains.

http://arxiv.org/abs/2507.04105
Enhancing Robustness of LLM-Driven Multi-Agent Systems through Randomized Smoothing. (74%)
Jinwei Hu; Yi Dong; Zhengtao Ding; Xiaowei Huang
This paper presents a defense framework for enhancing the safety of large language model (LLM) empowered multi-agent systems (MAS) in safety-critical domains such as aerospace. We apply randomized smoothing, a statistical robustness certification technique, to the MAS consensus context, enabling probabilistic guarantees on agent decisions under adversarial influence. Unlike traditional verification methods, our approach operates in black-box settings and employs a two-stage adaptive sampling mechanism to balance robustness and computational efficiency. Simulation results demonstrate that our method effectively prevents the propagation of adversarial behaviors and hallucinations while maintaining consensus performance. This work provides a practical and scalable path toward safe deployment of LLM-based MAS in real-world, high-stakes environments.

http://arxiv.org/abs/2507.07056
LoRAShield: Data-Free Editing Alignment for Secure Personalized LoRA Sharing. (41%)
Jiahao Chen; junhao li; Yiming Wang; Zhe Ma; Yi Jiang; Chunyi Zhou; Qingming Li; Tianyu Du; Shouling Ji
The proliferation of Low-Rank Adaptation (LoRA) models has democratized personalized text-to-image generation, enabling users to share lightweight models (e.g., personal portraits) on platforms like Civitai and Liblib. However, this "share-and-play" ecosystem introduces critical risks: benign LoRAs can be weaponized by adversaries to generate harmful content (e.g., political, defamatory imagery), undermining creator rights and platform safety. Existing defenses like concept-erasure methods focus on full diffusion models (DMs), neglecting LoRA's unique role as a modular adapter and its vulnerability to adversarial prompt engineering. To bridge this gap, we propose LoRAShield, the first data-free editing framework for securing LoRA models against misuse. Our platform-driven approach dynamically edits and realigns LoRA's weight subspace via adversarial optimization and semantic augmentation. Experimental results demonstrate that LoRAShield achieves remarkable effectiveness, efficiency, and robustness in blocking malicious generations without sacrificing the functionality of the benign task. By shifting the defense to platforms, LoRAShield enables secure, scalable sharing of personalized models, a critical step toward trustworthy generative ecosystems.

http://arxiv.org/abs/2507.04119
When Data-Free Knowledge Distillation Meets Non-Transferable Teacher: Escaping Out-of-Distribution Trap is All You Need. (16%)
Ziming Hong; Runnan Chen; Zengmao Wang; Bo Han; Bo Du; Tongliang Liu
Data-free knowledge distillation (DFKD) transfers knowledge from a teacher to a student without access the real in-distribution (ID) data. Its common solution is to use a generator to synthesize fake data and use them as a substitute for real ID data. However, existing works typically assume teachers are trustworthy, leaving the robustness and security of DFKD from untrusted teachers largely unexplored. In this work, we conduct the first investigation into distilling non-transferable learning (NTL) teachers using DFKD, where the transferability from an ID domain to an out-of-distribution (OOD) domain is prohibited. We find that NTL teachers fool DFKD through divert the generator's attention from the useful ID knowledge to the misleading OOD knowledge. This hinders ID knowledge transfer but prioritizes OOD knowledge transfer. To mitigate this issue, we propose Adversarial Trap Escaping (ATEsc) to benefit DFKD by identifying and filtering out OOD-like synthetic samples. Specifically, inspired by the evidence that NTL teachers show stronger adversarial robustness on OOD samples than ID samples, we split synthetic samples into two groups according to their robustness. The fragile group is treated as ID-like data and used for normal knowledge distillation, while the robust group is seen as OOD-like data and utilized for forgetting OOD knowledge. Extensive experiments demonstrate the effectiveness of ATEsc for improving DFKD against NTL teachers. Code is released at https://github.com/tmllab/2025_ICML_ATEsc.

http://arxiv.org/abs/2507.05288
A Survey on Proactive Defense Strategies Against Misinformation in Large Language Models. (11%)
Shuliang Liu; Hongyi Liu; Aiwei Liu; Bingchen Duan; Qi Zheng; Yibo Yan; He Geng; Peijie Jiang; Jia Liu; Xuming Hu
The widespread deployment of large language models (LLMs) across critical domains has amplified the societal risks posed by algorithmically generated misinformation. Unlike traditional false content, LLM-generated misinformation can be self-reinforcing, highly plausible, and capable of rapid propagation across multiple languages, which traditional detection methods fail to mitigate effectively. This paper introduces a proactive defense paradigm, shifting from passive post hoc detection to anticipatory mitigation strategies. We propose a Three Pillars framework: (1) Knowledge Credibility, fortifying the integrity of training and deployed data; (2) Inference Reliability, embedding self-corrective mechanisms during reasoning; and (3) Input Robustness, enhancing the resilience of model interfaces against adversarial attacks. Through a comprehensive survey of existing techniques and a comparative meta-analysis, we demonstrate that proactive defense strategies offer up to 63\% improvement over conventional methods in misinformation prevention, despite non-trivial computational overhead and generalization challenges. We argue that future research should focus on co-designing robust knowledge foundations, reasoning certification, and attack-resistant interfaces to ensure LLMs can effectively counter misinformation across varied domains.

http://arxiv.org/abs/2507.03427
Rectifying Adversarial Sample with Low Entropy Prior for Test-Time Defense. (98%)
Lina Ma; Xiaowei Fu; Fuxiang Huang; Xinbo Gao; Lei Zhang
Existing defense methods fail to defend against unknown attacks and thus raise generalization issue of adversarial robustness. To remedy this problem, we attempt to delve into some underlying common characteristics among various attacks for generality. In this work, we reveal the commonly overlooked low entropy prior (LE) implied in various adversarial samples, and shed light on the universal robustness against unseen attacks in inference phase. LE prior is elaborated as two properties across various attacks as shown in Fig. 1 and Fig. 2: 1) low entropy misclassification for adversarial samples and 2) lower entropy prediction for higher attack intensity. This phenomenon stands in stark contrast to the naturally distributed samples. The LE prior can instruct existing test-time defense methods, thus we propose a two-stage REAL approach: Rectify Adversarial sample based on LE prior for test-time adversarial rectification. Specifically, to align adversarial samples more closely with clean samples, we propose to first rectify adversarial samples misclassified with low entropy by reverse maximizing prediction entropy, thereby eliminating their adversarial nature. To ensure the rectified samples can be correctly classified with low entropy, we carry out secondary rectification by forward minimizing prediction entropy, thus creating a Max-Min entropy optimization scheme. Further, based on the second property, we propose an attack-aware weighting mechanism to adaptively adjust the strengths of Max-Min entropy objectives. Experiments on several datasets show that REAL can greatly improve the performance of existing sample rectification models.

http://arxiv.org/abs/2507.03450
Evaluating the Evaluators: Trust in Adversarial Robustness Tests. (96%)
Antonio Emanuele Cinà; Maura Pintor; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli
Despite significant progress in designing powerful adversarial evasion attacks for robustness verification, the evaluation of these methods often remains inconsistent and unreliable. Many assessments rely on mismatched models, unverified implementations, and uneven computational budgets, which can lead to biased results and a false sense of security. Consequently, robustness claims built on such flawed testing protocols may be misleading and give a false sense of security. As a concrete step toward improving evaluation reliability, we present AttackBench, a benchmark framework developed to assess the effectiveness of gradient-based attacks under standardized and reproducible conditions. AttackBench serves as an evaluation tool that ranks existing attack implementations based on a novel optimality metric, which enables researchers and practitioners to identify the most reliable and effective attack for use in subsequent robustness evaluations. The framework enforces consistent testing conditions and enables continuous updates, making it a reliable foundation for robustness verification.

http://arxiv.org/abs/2507.03646
When There Is No Decoder: Removing Watermarks from Stable Diffusion Models in a No-box Setting. (86%)
Xiaodong Wu; Tianyi Tang; Xiangman Li; Jianbing Ni; Yong Yu
Watermarking has emerged as a promising solution to counter harmful or deceptive AI-generated content by embedding hidden identifiers that trace content origins. However, the robustness of current watermarking techniques is still largely unexplored, raising critical questions about their effectiveness against adversarial attacks. To address this gap, we examine the robustness of model-specific watermarking, where watermark embedding is integrated with text-to-image generation in models like latent diffusion models. We introduce three attack strategies: edge prediction-based, box blurring, and fine-tuning-based attacks in a no-box setting, where an attacker does not require access to the ground-truth watermark decoder. Our findings reveal that while model-specific watermarking is resilient against basic evasion attempts, such as edge prediction, it is notably vulnerable to blurring and fine-tuning-based attacks. Our best-performing attack achieves a reduction in watermark detection accuracy to approximately 47.92\%. Additionally, we perform an ablation study on factors like message length, kernel size and decoder depth, identifying critical parameters influencing the fine-tuning attack's success. Finally, we assess several advanced watermarking defenses, finding that even the most robust methods, such as multi-label smoothing, result in watermark extraction accuracy that falls below an acceptable level when subjected to our no-box attacks.

http://arxiv.org/abs/2507.03372
Action Robust Reinforcement Learning via Optimal Adversary Aware Policy Optimization. (9%)
Buqing Nie; Yangqing Fu; Jingtian Ji; Yue Gao
Reinforcement Learning (RL) has achieved remarkable success in sequential decision tasks. However, recent studies have revealed the vulnerability of RL policies to different perturbations, raising concerns about their effectiveness and safety in real-world applications. In this work, we focus on the robustness of RL policies against action perturbations and introduce a novel framework called Optimal Adversary-aware Policy Iteration (OA-PI). Our framework enhances action robustness under various perturbations by evaluating and improving policy performance against the corresponding optimal adversaries. Besides, our approach can be integrated into mainstream DRL algorithms such as Twin Delayed DDPG (TD3) and Proximal Policy Optimization (PPO), improving action robustness effectively while maintaining nominal performance and sample efficiency. Experimental results across various environments demonstrate that our method enhances robustness of DRL policies against different action adversaries effectively.

http://arxiv.org/abs/2507.03236
On Jailbreaking Quantized Language Models Through Fault Injection Attacks. (8%)
Noureldin Zahran; Ahmad Tahmasivand; Ihsen Alouani; Khaled Khasawneh; Mohammed E. Fouda
The safety alignment of Language Models (LMs) is a critical concern, yet their integrity can be challenged by direct parameter manipulation attacks, such as those potentially induced by fault injection. As LMs are increasingly deployed using low-precision quantization for efficiency, this paper investigates the efficacy of such attacks for jailbreaking aligned LMs across different quantization schemes. We propose gradient-guided attacks, including a tailored progressive bit-level search algorithm introduced herein and a comparative word-level (single weight update) attack. Our evaluation on Llama-3.2-3B, Phi-4-mini, and Llama-3-8B across FP16 (baseline), and weight-only quantization (FP8, INT8, INT4) reveals that quantization significantly influences attack success. While attacks readily achieve high success (>80\% Attack Success Rate, ASR) on FP16 models, within an attack budget of 25 perturbations, FP8 and INT8 models exhibit ASRs below 20\% and 50\%, respectively. Increasing the perturbation budget up to 150 bit-flips, FP8 models maintained ASR below 65\%, demonstrating some resilience compared to INT8 and INT4 models that have high ASR. In addition, analysis of perturbation locations revealed differing architectural targets across quantization schemes, with (FP16, INT4) and (INT8, FP8) showing similar characteristics. Besides, jailbreaks induced in FP16 models were highly transferable to subsequent FP8/INT8 quantization (<5\% ASR difference), though INT4 significantly reduced transferred ASR (avg. 35\% drop). These findings highlight that while common quantization schemes, particularly FP8, increase the difficulty of direct parameter manipulation jailbreaks, vulnerabilities can still persist, especially through post-attack quantization.

http://arxiv.org/abs/2501.08276
Exploring Robustness of LLMs to Paraphrasing Based on Sociodemographic Factors. (5%)
Pulkit Arora; Akbar Karimi; Lucie Flek
Despite their linguistic prowess, LLMs have been shown to be vulnerable to small input perturbations. While robustness to local adversarial changes has been studied, robustness to global modifications such as different linguistic styles remains underexplored. Therefore, we take a broader approach to explore a wider range of variations across sociodemographic dimensions. We extend the SocialIQA dataset to create diverse paraphrased sets conditioned on sociodemographic factors (age and gender). The assessment aims to provide a deeper understanding of LLMs in (a) their capability of generating demographic paraphrases with engineered prompts and (b) their capabilities in interpreting real-world, complex language scenarios. We also perform a reliability analysis of the generated paraphrases looking into linguistic diversity and perplexity as well as manual evaluation. We find that demographic-based paraphrasing significantly impacts the performance of language models, indicating that the subtleties of linguistic variation remain a significant challenge. We will make the code and dataset available for future research.

http://arxiv.org/abs/2506.23661
Robustness of Misinformation Classification Systems to Adversarial Examples Through BeamAttack. (99%)
Arnisa Fazla; Lucas Krauter; David Guzman Piedrahita; Andrianos Michail
We extend BeamAttack, an adversarial attack algorithm designed to evaluate the robustness of text classification systems through word-level modifications guided by beam search. Our extensions include support for word deletions and the option to skip substitutions, enabling the discovery of minimal modifications that alter model predictions. We also integrate LIME to better prioritize word replacements. Evaluated across multiple datasets and victim models (BiLSTM, BERT, and adversarially trained RoBERTa) within the BODEGA framework, our approach achieves over a 99\% attack success rate while preserving the semantic and lexical similarity of the original texts. Through both quantitative and qualitative analysis, we highlight BeamAttack's effectiveness and its limitations. Our implementation is available at https://github.com/LucK1Y/BeamAttack

http://arxiv.org/abs/2507.02606
De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks. (97%)
Wei Fan; Kejiang Chen; Chang Liu; Weiming Zhang; Nenghai Yu
The rapid advancement of speech generation models has heightened privacy and security concerns related to voice cloning (VC). Recent studies have investigated disrupting unauthorized voice cloning by introducing adversarial perturbations. However, determined attackers can mitigate these protective perturbations and successfully execute VC. In this study, we conduct the first systematic evaluation of these protective perturbations against VC under realistic threat models that include perturbation purification. Our findings reveal that while existing purification methods can neutralize a considerable portion of the protective perturbations, they still lead to distortions in the feature space of VC models, which degrades the performance of VC. From this perspective, we propose a novel two-stage purification method: (1) Purify the perturbed speech; (2) Refine it using phoneme guidance to align it with the clean speech distribution. Experimental results demonstrate that our method outperforms state-of-the-art purification methods in disrupting VC defenses. Our study reveals the limitations of adversarial perturbation-based VC defenses and underscores the urgent need for more robust solutions to mitigate the security and privacy risks posed by VC. The code and audio samples are available at https://de-antifake.github.io.

http://arxiv.org/abs/2412.05883
On the Adversarial Robustness of Graph Neural Networks with Graph Reduction. (78%)
Kerui Wu; Ka-Ho Chow; Wenqi Wei; Lei Yu
As Graph Neural Networks (GNNs) become increasingly popular for learning from large-scale graph data across various domains, their susceptibility to adversarial attacks when using graph reduction techniques for scalability remains underexplored. In this paper, we present an extensive empirical study to investigate the impact of graph reduction techniques, specifically graph coarsening and sparsification, on the robustness of GNNs against adversarial attacks. Through extensive experiments involving multiple datasets and GNN architectures, we examine the effects of four sparsification and six coarsening methods on the poisoning attacks. Our results indicate that, while graph sparsification can mitigate the effectiveness of certain poisoning attacks, such as Mettack, it has limited impact on others, like PGD. Conversely, graph coarsening tends to amplify the adversarial impact, significantly reducing classification accuracy as the reduction ratio decreases. Additionally, we provide a novel analysis of the causes driving these effects and examine how defensive GNN models perform under graph reduction, offering practical insights for designing robust GNNs within graph acceleration systems.

http://arxiv.org/abs/2507.03031
On the Mathematical Impossibility of Safe Universal Approximators. (56%)
Jasper Yao
We establish fundamental mathematical limits on universal approximation theorem (UAT) system alignment by proving that catastrophic failures are an inescapable feature of any useful computational system. Our central thesis is that for any universal approximator, the expressive power required for useful computation is inextricably linked to a dense set of instabilities that make perfect, reliable control a mathematical impossibility. We prove this through a three-level argument that leaves no escape routes for any class of universal approximator architecture. i) Combinatorial Necessity: For the vast majority of practical universal approximators (e.g., those using ReLU activations), we prove that the density of catastrophic failure points is directly proportional to the network's expressive power. ii) Topological Necessity: For any theoretical universal approximator, we use singularity theory to prove that the ability to approximate generic functions requires the ability to implement the dense, catastrophic singularities that characterize them. iii) Empirical Necessity: We prove that the universal existence of adversarial examples is empirical evidence that real-world tasks are themselves catastrophic, forcing any successful model to learn and replicate these instabilities. These results, combined with a quantitative "Impossibility Sandwich" showing that the minimum complexity for usefulness exceeds the maximum complexity for safety, demonstrate that perfect alignment is not an engineering challenge but a mathematical impossibility. This foundational result reframes UAT safety from a problem of "how to achieve perfect control" to one of "how to operate safely in the presence of irreducible uncontrollability," with profound implications for the future of UAT development and governance.

http://arxiv.org/abs/2507.02735
Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks. (26%)
Sizhe Chen; Arman Zharmagambetov; David Wagner; Chuan Guo
Prompt injection attacks pose a significant security threat to LLM-integrated applications. Model-level defenses have shown strong effectiveness, but are currently deployed into commercial-grade models in a closed-source manner. We believe open-source models are needed by the AI security community, where co-development of attacks and defenses through open research drives scientific progress in mitigation against prompt injection attacks. To this end, we develop Meta SecAlign, the first open-source and open-weight LLM with built-in model-level defense that achieves commercial-grade model performance. We provide complete details of our training recipe, which utilizes an improved version of the SOTA SecAlign defense. Evaluations on 9 utility benchmarks and 7 security benchmarks show that Meta SecAlign, despite being trained on a generic instruction-tuning dataset, confers security in unseen downstream tasks, including tool-calling and agentic web navigation, in addition general instruction-following. Our best model -- Meta-SecAlign-70B -- achieves state-of-the-art robustness against prompt injection attacks and comparable utility to closed-source commercial LLM with model-level defense.

http://arxiv.org/abs/2502.11853
StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models. (16%)
Shehel Yoosuf; Temoor Ali; Ahmed Lekssays; Mashael AlSabah; Issa Khalil
In this work, we present a series of structure transformation attacks on LLM alignment, where we encode natural language intent using diverse syntax spaces, ranging from simple structure formats and basic query languages (e.g., SQL) to new novel spaces and syntaxes created entirely by LLMs. Our extensive evaluation shows that our simplest attacks can achieve close to a 90% success rate, even on strict LLMs (such as Claude 3.5 Sonnet) using SOTA alignment mechanisms. We improve the attack performance further by using an adaptive scheme that combines structure transformations along with existing content transformations, resulting in over 96% ASR with 0% refusals.
  To generalize our attacks, we explore numerous structure formats, including syntaxes purely generated by LLMs. Our results indicate that such novel syntaxes are easy to generate and result in a high ASR, suggesting that defending against our attacks is not a straightforward process. Finally, we develop a benchmark and evaluate existing safety-alignment defenses against it, showing that most of them fail with 100% ASR. Our results show that existing safety alignment mostly relies on token-level patterns without recognizing harmful concepts, highlighting and motivating the need for serious research efforts in this direction. As a case study, we demonstrate how attackers can use our attack to easily generate a sample malware and a corpus of fraudulent SMS messages, which perform well in bypassing detection.

http://arxiv.org/abs/2507.02727
Quantifying Classifier Utility under Local Differential Privacy. (11%)
Ye Zheng; Yidan Hu
Local differential privacy (LDP) provides a rigorous and quantifiable privacy guarantee for personal data by introducing perturbation at the data source. However, quantifying the impact of these perturbations on classifier utility remains a theoretical challenge, particularly for complex or black-box classifiers.
  This paper presents a framework for theoretically quantifying classifier utility under LDP mechanisms. The key insight is that LDP perturbation is concentrated around the original data with a specific probability, transforming utility analysis of the classifier into its robustness analysis in this concentrated region. Our framework connects the concentration analysis of LDP mechanisms with the robustness analysis of classifiers. It treats LDP mechanisms as general distributional functions and classifiers as black-box functions, thus applicable to any LDP mechanism and classifier. A direct application of our utility quantification is guiding the selection of LDP mechanisms and privacy parameters for a given classifier. Notably, our analysis shows that a piecewise-based mechanism leads to better utility compared to alternatives in common scenarios.
  Using this framework alongside two novel refinement techniques, we conduct case studies on utility quantification for typical mechanism-classifier combinations. The results demonstrate that our theoretical utility quantification aligns closely with empirical observations, particularly when classifiers operate in lower-dimensional input spaces.

http://arxiv.org/abs/2507.03168
Adopting a human developmental visual diet yields robust, shape-based AI vision. (10%)
Zejin Lu; Sushrut Thorat; Radoslaw M Cichy; Tim C Kietzmann
Despite years of research and the dramatic scaling of artificial intelligence (AI) systems, a striking misalignment between artificial and human vision persists. Contrary to humans, AI heavily relies on texture-features rather than shape information, lacks robustness to image distortions, remains highly vulnerable to adversarial attacks, and struggles to recognise simple abstract shapes within complex backgrounds. To close this gap, we here introduce a solution that arises from a previously underexplored direction: rather than scaling up, we take inspiration from how human vision develops from early infancy into adulthood. We quantified the visual maturation by synthesising decades of psychophysical and neurophysiological research into a novel developmental visual diet (DVD) for AI vision. We show that guiding AI systems through this human-inspired curriculum produces models that closely align with human behaviour on every hallmark of robust vision tested yielding the strongest reported reliance on shape information to date, abstract shape recognition beyond the state of the art, higher robustness to image corruptions, and stronger resilience to adversarial attacks. By outperforming high parameter AI foundation models trained on orders of magnitude more data, we provide evidence that robust AI vision can be achieved by guiding the way how a model learns, not merely how much it learns, offering a resource-efficient route toward safer and more human-like artificial visual systems.

http://arxiv.org/abs/2507.02799
Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models. (3%)
Riccardo Cantini; Nicola Gabriele; Alessio Orsino; Domenico Talia
Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks through mechanisms such as Chain-of-Thought (CoT) prompting or fine-tuned reasoning traces. While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear. In this work, we leverage the CLEAR-Bias benchmark, originally designed for Large Language Models (LLMs), to investigate the adversarial robustness of RLMs to bias elicitation. We systematically evaluate state-of-the-art RLMs across diverse sociocultural dimensions, using an LLM-as-a-judge approach for automated safety scoring and leveraging jailbreak techniques to assess the strength of built-in safety mechanisms. Our evaluation addresses three key questions: (i) how the introduction of reasoning capabilities affects model fairness and robustness; (ii) whether models fine-tuned for reasoning exhibit greater safety than those relying on CoT prompting at inference time; and (iii) how the success rate of jailbreak attacks targeting bias elicitation varies with the reasoning mechanisms employed. Our findings reveal a nuanced relationship between reasoning capabilities and bias safety. Surprisingly, models with explicit reasoning, whether via CoT prompting or fine-tuned reasoning traces, are generally more vulnerable to bias elicitation than base models without such mechanisms, suggesting reasoning may unintentionally open new pathways for stereotype reinforcement. Reasoning-enabled models appear somewhat safer than those relying on CoT prompting, which are particularly prone to contextual reframing attacks through storytelling prompts, fictional personas, or reward-shaped instructions. These results challenge the assumption that reasoning inherently improves robustness and underscore the need for more bias-aware approaches to reasoning design.

http://arxiv.org/abs/2507.03167
Adversarial Manipulation of Reasoning Models using Internal Representations. (1%)
Kureha Yamaguchi; Benjamin Etheridge; Andy Arditi
Reasoning models generate chain-of-thought (CoT) tokens before their final output, but how this affects their vulnerability to jailbreak attacks remains unclear. While traditional language models make refusal decisions at the prompt-response boundary, we find evidence that DeepSeek-R1-Distill-Llama-8B makes these decisions within its CoT generation. We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply -- termed the "caution" direction because it corresponds to cautious reasoning patterns in the generated text. Ablating this direction from model activations increases harmful compliance, effectively jailbreaking the model. We additionally show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates. Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models.
  Code available at https://github.com/ky295/reasoning-manipulation

http://arxiv.org/abs/2507.02737
Early Signs of Steganographic Capabilities in Frontier LLMs. (1%)
Artur Zolkowski; Kei Nishimura-Gasparian; Robert McCarthy; Roland S. Zimmermann; David Lindner
Monitoring Large Language Model (LLM) outputs is crucial for mitigating risks from misuse and misalignment. However, LLMs could evade monitoring through steganography: Encoding hidden information within seemingly benign generations. In this paper, we evaluate the steganography capabilities in frontier LLMs to better understand the risk they pose. We focus on two types of steganography: passing encoded messages and performing encoded reasoning. We find that current models are unable to encode short messages in their outputs without a monitor noticing under standard affordances. They can succeed, however, if given additional affordances such as using an unmonitored scratchpad and coordinating on what encoding scheme to use. We additionally find early signs that models can perform basic encoded reasoning in a simple state-tracking problem. This includes some ability to reason with their own and pre-defined schemes, including encoding schemes such as Hexadecimal. Despite this, they can rarely hide reasoning subtly within a cover task to fool a monitor. Overall, our results indicate that current LLMs exhibit nascent steganographic capabilities. While these capabilities are likely insufficient to bypass well-designed monitors at present, this could change in the future.

http://arxiv.org/abs/2502.05673
The Evolution of Dataset Distillation: Toward Scalable and Generalizable Solutions. (1%)
Ping Liu; Jiawei Du
Dataset distillation, which condenses large-scale datasets into compact synthetic representations, has emerged as a critical solution for training modern deep learning models efficiently. While prior surveys focus on developments before 2023, this work comprehensively reviews recent advances, emphasizing scalability to large-scale datasets such as ImageNet-1K and ImageNet-21K. We categorize progress into a few key methodologies: trajectory matching, gradient matching, distribution matching, scalable generative approaches, and decoupling optimization mechanisms. As a comprehensive examination of recent dataset distillation advances, this survey highlights breakthrough innovations: the SRe2L framework for efficient and effective condensation, soft label strategies that significantly enhance model accuracy, and lossless distillation techniques that maximize compression while maintaining performance. Beyond these methodological advancements, we address critical challenges, including robustness against adversarial and backdoor attacks, effective handling of non-IID data distributions. Additionally, we explore emerging applications in video and audio processing, multi-modal learning, medical imaging, and scientific computing, highlighting its domain versatility. By offering extensive performance comparisons and actionable research directions, this survey equips researchers and practitioners with practical insights to advance efficient and generalizable dataset distillation, paving the way for future innovations.

http://arxiv.org/abs/2507.01791
Boosting Adversarial Transferability Against Defenses via Multi-Scale Transformation. (99%)
Zihong Guo; Chen Wan; Yayin Zheng; Hailing Kuang; Xiaohai Lu
The transferability of adversarial examples poses a significant security challenge for deep neural networks, which can be attacked without knowing anything about them. In this paper, we propose a new Segmented Gaussian Pyramid (SGP) attack method to enhance the transferability, particularly against defense models. Unlike existing methods that generally focus on single-scale images, our approach employs Gaussian filtering and three types of downsampling to construct a series of multi-scale examples. Then, the gradients of the loss function with respect to each scale are computed, and their average is used to determine the adversarial perturbations. The proposed SGP can be considered an input transformation with high extensibility that is easily integrated into most existing adversarial attacks. Extensive experiments demonstrate that in contrast to the state-of-the-art methods, SGP significantly enhances attack success rates against black-box defense models, with average attack success rates increasing by 2.3% to 32.6%, based only on transferability.

http://arxiv.org/abs/2507.01383
DARTS: A Dual-View Attack Framework for Targeted Manipulation in Federated Sequential Recommendation. (68%)
Qitao Qin; Yucong Luo; Zhibo Chu
Federated recommendation (FedRec) preserves user privacy by enabling decentralized training of personalized models, but this architecture is inherently vulnerable to adversarial attacks. Significant research has been conducted on targeted attacks in FedRec systems, motivated by commercial and social influence considerations. However, much of this work has largely overlooked the differential robustness of recommendation models. Moreover, our empirical findings indicate that existing targeted attack methods achieve only limited effectiveness in Federated Sequential Recommendation(FSR) tasks. Driven by these observations, we focus on investigating targeted attacks in FSR and propose a novel dualview attack framework, named DV-FSR. This attack method uniquely combines a sampling-based explicit strategy with a contrastive learning-based implicit gradient strategy to orchestrate a coordinated attack. Additionally, we introduce a specific defense mechanism tailored for targeted attacks in FSR, aiming to evaluate the mitigation effects of the attack method we proposed. Extensive experiments validate the effectiveness of our proposed approach on representative sequential models. Our codes are publicly available.

http://arxiv.org/abs/2507.01694
Graph Representation-based Model Poisoning on Federated LLMs in CyberEdge Networks. (50%)
Hanlin Cai; Haofan Dong; Houtianfu Wang; Kai Li; Ozgur B. Akan
Federated large language models (FedLLMs) provide powerful generative capabilities in CyberEdge networks while protecting data privacy. However, FedLLMs remains highly vulnerable to model poisoning attacks. This article first reviews recent model poisoning techniques and existing defense mechanisms for FedLLMs, highlighting critical limitations, particularly under non-IID text distributions. In particular, current defenses primarily utilize distance-based outlier detection or norm constraints, operating under the assumption that adversarial updates significantly diverge from benign statistics. This assumption can fail when facing adaptive attackers targeting billionparameter LLMs. Next, this article investigates emerging Graph Representation-Based Model Poisoning (GRMP), a novel attack paradigm that leverages higher-order correlations among honest client gradients to synthesize malicious updates indistinguishable from legitimate model updates. GRMP can effectively evade advanced defenses, resulting in substantial accuracy loss and performance degradation. Moreover, this article outlines a research roadmap emphasizing the importance of graph-aware secure aggregation methods, FedLLMs-specific vulnerability metrics, and evaluation frameworks to strengthen the robustness of future federated language model deployments.

http://arxiv.org/abs/2507.01321
ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks. (13%)
Zhiyao Ren; Siyuan Liang; Aishan Liu; Dacheng Tao
In-context learning (ICL) has demonstrated remarkable success in large language models (LLMs) due to its adaptability and parameter-free nature. However, it also introduces a critical vulnerability to backdoor attacks, where adversaries can manipulate LLM behaviors by simply poisoning a few ICL demonstrations. In this paper, we propose, for the first time, the dual-learning hypothesis, which posits that LLMs simultaneously learn both the task-relevant latent concepts and backdoor latent concepts within poisoned demonstrations, jointly influencing the probability of model outputs. Through theoretical analysis, we derive an upper bound for ICL backdoor effects, revealing that the vulnerability is dominated by the concept preference ratio between the task and the backdoor. Motivated by these findings, we propose ICLShield, a defense mechanism that dynamically adjusts the concept preference ratio. Our method encourages LLMs to select clean demonstrations during the ICL phase by leveraging confidence and similarity scores, effectively mitigating susceptibility to backdoor attacks. Extensive experiments across multiple LLMs and tasks demonstrate that our method achieves state-of-the-art defense effectiveness, significantly outperforming existing approaches (+26.02% on average). Furthermore, our method exhibits exceptional adaptability and defensive performance even for closed-source models (e.g., GPT-4).

http://arxiv.org/abs/2507.01513
SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism. (12%)
Beitao Chen; Xinyu Lyu; Lianli Gao; Jingkuan Song; Heng Tao Shen
By incorporating visual inputs, Multimodal Large Language Models (MLLMs) extend LLMs to support visual reasoning. However, this integration also introduces new vulnerabilities, making MLLMs susceptible to multimodal jailbreak attacks and hindering their safe deployment.Existing defense methods, including Image-to-Text Translation, Safe Prompting, and Multimodal Safety Tuning, attempt to address this by aligning multimodal inputs with LLMs' built-in safeguards.Yet, they fall short in uncovering root causes of multimodal vulnerabilities, particularly how harmful multimodal tokens trigger jailbreak in MLLMs? Consequently, they remain vulnerable to text-driven multimodal jailbreaks, often exhibiting overdefensive behaviors and imposing heavy training overhead.To bridge this gap, we present an comprehensive analysis of where, how and which harmful multimodal tokens bypass safeguards in MLLMs. Surprisingly, we find that less than 1% tokens in early-middle layers are responsible for inducing unsafe behaviors, highlighting the potential of precisely removing a small subset of harmful tokens, without requiring safety tuning, can still effectively improve safety against jailbreaks. Motivated by this, we propose Safe Prune-then-Restore (SafePTR), an training-free defense framework that selectively prunes harmful tokens at vulnerable layers while restoring benign features at subsequent layers.Without incurring additional computational overhead, SafePTR significantly enhances the safety of MLLMs while preserving efficiency. Extensive evaluations across three MLLMs and five benchmarks demonstrate SafePTR's state-of-the-art performance in mitigating jailbreak risks without compromising utility.

http://arxiv.org/abs/2406.15213
Backdooring Bias (B^2) into Stable Diffusion Models. (9%)
Ali Naseh; Jaechul Roh; Eugene Bagdasaryan; Amir Houmansadr
Recent advances in large text-conditional diffusion models have revolutionized image generation by enabling users to create realistic, high-quality images from textual prompts, significantly enhancing artistic creation and visual communication. However, these advancements also introduce an underexplored attack opportunity: the possibility of inducing biases by an adversary into the generated images for malicious intentions, e.g., to influence public opinion and spread propaganda. In this paper, we study an attack vector that allows an adversary to inject arbitrary bias into a target model. The attack leverages low-cost backdooring techniques using a targeted set of natural textual triggers embedded within a small number of malicious data samples produced with public generative models. An adversary could pick common sequences of words that can then be inadvertently activated by benign users during inference. We investigate the feasibility and challenges of such attacks, demonstrating how modern generative models have made this adversarial process both easier and more adaptable. On the other hand, we explore various aspects of the detectability of such attacks and demonstrate that the model's utility remains intact in the absence of the triggers. Our extensive experiments using over 200,000 generated images and against hundreds of fine-tuned models demonstrate the feasibility of the presented backdoor attack. We illustrate how these biases maintain strong text-image alignment, highlighting the challenges in detecting biased images without knowing that bias in advance. Our cost analysis confirms the low financial barrier ($10-$15) to executing such attacks, underscoring the need for robust defensive strategies against such vulnerabilities in diffusion models.

http://arxiv.org/abs/2507.01710
Towards Better Attribute Inference Vulnerability Measures. (1%)
Paul Francis; David Wagner
The purpose of anonymizing structured data is to protect the privacy of individuals in the data while retaining the statistical properties of the data. An important class of attack on anonymized data is attribute inference, where an attacker infers the value of an unknown attribute of a target individual given knowledge of one or more known attributes. A major limitation of recent attribute inference measures is that they do not take recall into account, only precision. It is often the case that attacks target only a fraction of individuals, for instance data outliers. Incorporating recall, however, substantially complicates the measure, because one must determine how to combine recall and precision in a composite measure for both the attack and baseline. This paper presents the design and implementation of an attribute inference measure that incorporates both precision and recall. Our design also improves on how the baseline attribute inference is computed. In experiments using a generic best row match attack on moderately-anonymized microdata, we show that in over 25\% of the attacks, our approach correctly labeled the attack to be at risk while the prior approach incorrectly labeled the attack to be safe.

http://arxiv.org/abs/2507.00577
BadViM: Backdoor Attack against Vision Mamba. (69%)
Yinghao Wu; Liyan Zhang
Vision State Space Models (SSMs), particularly architectures like Vision Mamba (ViM), have emerged as promising alternatives to Vision Transformers (ViTs). However, the security implications of this novel architecture, especially their vulnerability to backdoor attacks, remain critically underexplored. Backdoor attacks aim to embed hidden triggers into victim models, causing the model to misclassify inputs containing these triggers while maintaining normal behavior on clean inputs. This paper investigates the susceptibility of ViM to backdoor attacks by introducing BadViM, a novel backdoor attack framework specifically designed for Vision Mamba. The proposed BadViM leverages a Resonant Frequency Trigger (RFT) that exploits the frequency sensitivity patterns of the victim model to create stealthy, distributed triggers. To maximize attack efficacy, we propose a Hidden State Alignment loss that strategically manipulates the internal representations of model by aligning the hidden states of backdoor images with those of target classes. Extensive experimental results demonstrate that BadViM achieves superior attack success rates while maintaining clean data accuracy. Meanwhile, BadViM exhibits remarkable resilience against common defensive measures, including PatchDrop, PatchShuffle and JPEG compression, which typically neutralize normal backdoor attacks.

http://arxiv.org/abs/2507.00817
CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs. (47%)
Jiaming Zhang; Rui Hu; Qing Guo; Wei Yang Bryan Lim
Video Multimodal Large Language Models (V-MLLMs) have shown impressive capabilities in temporal reasoning and cross-modal understanding, yet their vulnerability to adversarial attacks remains underexplored due to unique challenges: complex cross-modal reasoning mechanisms, temporal dependencies, and computational constraints. We present CAVALRY-V (Cross-modal Language-Vision Adversarial Yielding for Videos), a novel framework that directly targets the critical interface between visual perception and language generation in V-MLLMs. Our approach introduces two key innovations: (1) a dual-objective semantic-visual loss function that simultaneously disrupts the model's text generation logits and visual representations to undermine cross-modal integration, and (2) a computationally efficient two-stage generator framework that combines large-scale pre-training for cross-model transferability with specialized fine-tuning for spatiotemporal coherence. Empirical evaluation on comprehensive video understanding benchmarks demonstrates that CAVALRY-V significantly outperforms existing attack methods, achieving 22.8% average improvement over the best baseline attacks on both commercial systems (GPT-4.1, Gemini 2.0) and open-source models (QwenVL-2.5, InternVL-2.5, Llava-Video, Aria, MiniCPM-o-2.6). Our framework achieves flexibility through implicit temporal coherence modeling rather than explicit regularization, enabling significant performance improvements even on image understanding (34.4% average gain). This capability demonstrates CAVALRY-V's potential as a foundational approach for adversarial research across multimodal systems.

http://arxiv.org/abs/2507.00423
Find a Scapegoat: Poisoning Membership Inference Attack and Defense to Federated Learning. (41%)
Wenjin Mo; Zhiyuan Li; Minghong Fang; Mingwei Fang
Federated learning (FL) allows multiple clients to collaboratively train a global machine learning model with coordination from a central server, without needing to share their raw data. This approach is particularly appealing in the era of privacy regulations like the GDPR, leading many prominent companies to adopt it. However, FL's distributed nature makes it susceptible to poisoning attacks, where malicious clients, controlled by an attacker, send harmful data to compromise the model. Most existing poisoning attacks in FL aim to degrade the model's integrity, such as reducing its accuracy, with limited attention to privacy concerns from these attacks. In this study, we introduce FedPoisonMIA, a novel poisoning membership inference attack targeting FL. FedPoisonMIA involves malicious clients crafting local model updates to infer membership information. Additionally, we propose a robust defense mechanism to mitigate the impact of FedPoisonMIA attacks. Extensive experiments across various datasets demonstrate the attack's effectiveness, while our defense approach reduces its impact to a degree.

http://arxiv.org/abs/2507.00971
Reasoning as an Adaptive Defense for Safety. (31%)
Taeyoun Kim; Fahim Tajwar; Aditi Raghunathan; Aviral Kumar
Reasoning methods that adaptively allocate test-time compute have advanced LLM performance on easy to verify domains such as math and code. In this work, we study how to utilize this approach to train models that exhibit a degree of robustness to safety vulnerabilities, and show that doing so can provide benefits. We build a recipe called $\textit{TARS}$ (Training Adaptive Reasoners for Safety), a reinforcement learning (RL) approach that trains models to reason about safety using chain-of-thought traces and a reward signal that balances safety with task completion. To build TARS, we identify three critical design choices: (1) a "lightweight" warmstart SFT stage, (2) a mix of harmful, harmless, and ambiguous prompts to prevent shortcut behaviors such as too many refusals, and (3) a reward function to prevent degeneration of reasoning capabilities during training. Models trained with TARS exhibit adaptive behaviors by spending more compute on ambiguous queries, leading to better safety-refusal trade-offs. They also internally learn to better distinguish between safe and unsafe prompts and attain greater robustness to both white-box (e.g., GCG) and black-box attacks (e.g., PAIR). Overall, our work provides an effective, open recipe for training LLMs against jailbreaks and harmful requests by reasoning per prompt.

http://arxiv.org/abs/2507.00690
Cage-Based Deformation for Transferable and Undefendable Point Cloud Attack. (26%)
Keke Tang; Ziyong Du; Weilong Peng; Xiaofei Wang; Peican Zhu; Ligang Liu; Zhihong Tian
Adversarial attacks on point clouds often impose strict geometric constraints to preserve plausibility; however, such constraints inherently limit transferability and undefendability. While deformation offers an alternative, existing unstructured approaches may introduce unnatural distortions, making adversarial point clouds conspicuous and undermining their plausibility. In this paper, we propose CageAttack, a cage-based deformation framework that produces natural adversarial point clouds. It first constructs a cage around the target object, providing a structured basis for smooth, natural-looking deformation. Perturbations are then applied to the cage vertices, which seamlessly propagate to the point cloud, ensuring that the resulting deformations remain intrinsic to the object and preserve plausibility. Extensive experiments on seven 3D deep neural network classifiers across three datasets show that CageAttack achieves a superior balance among transferability, undefendability, and plausibility, outperforming state-of-the-art methods. Codes will be made public upon acceptance.

http://arxiv.org/abs/2507.00506
SCING:Towards More Efficient and Robust Person Re-Identification through Selective Cross-modal Prompt Tuning. (2%)
Yunfei Xie; Yuxuan Cheng; Juncheng Wu; Haoyu Zhang; Yuyin Zhou; Shoudong Han
Recent advancements in adapting vision-language pre-training models like CLIP for person re-identification (ReID) tasks often rely on complex adapter design or modality-specific tuning while neglecting cross-modal interaction, leading to high computational costs or suboptimal alignment. To address these limitations, we propose a simple yet effective framework named Selective Cross-modal Prompt Tuning (SCING) that enhances cross-modal alignment and robustness against real-world perturbations. Our method introduces two key innovations: Firstly, we proposed Selective Visual Prompt Fusion (SVIP), a lightweight module that dynamically injects discriminative visual features into text prompts via a cross-modal gating mechanism. Moreover, the proposed Perturbation-Driven Consistency Alignment (PDCA) is a dual-path training strategy that enforces invariant feature alignment under random image perturbations by regularizing consistency between original and augmented cross-modal embeddings. Extensive experiments are conducted on several popular benchmarks covering Market1501, DukeMTMC-ReID, Occluded-Duke, Occluded-REID, and P-DukeMTMC, which demonstrate the impressive performance of the proposed method. Notably, our framework eliminates heavy adapters while maintaining efficient inference, achieving an optimal trade-off between performance and computational overhead. The code will be released upon acceptance.

http://arxiv.org/abs/2507.00480
Posterior Inference in Latent Space for Scalable Constrained Black-box Optimization. (1%)
Kiyoung Om; Kyuil Sim; Taeyoung Yun; Hyeongyu Kang; Jinkyoo Park
Optimizing high-dimensional black-box functions under black-box constraints is a pervasive task in a wide range of scientific and engineering problems. These problems are typically harder than unconstrained problems due to hard-to-find feasible regions. While Bayesian optimization (BO) methods have been developed to solve such problems, they often struggle with the curse of dimensionality. Recently, generative model-based approaches have emerged as a promising alternative for constrained optimization. However, they suffer from poor scalability and are vulnerable to mode collapse, particularly when the target distribution is highly multi-modal. In this paper, we propose a new framework to overcome these challenges. Our method iterates through two stages. First, we train flow-based models to capture the data distribution and surrogate models that predict both function values and constraint violations with uncertainty quantification. Second, we cast the candidate selection problem as a posterior inference problem to effectively search for promising candidates that have high objective values while not violating the constraints. During posterior inference, we find that the posterior distribution is highly multi-modal and has a large plateau due to constraints, especially when constraint feedback is given as binary indicators of feasibility. To mitigate this issue, we amortize the sampling from the posterior distribution in the latent space of flow-based models, which is much smoother than that in the data space. We empirically demonstrate that our method achieves superior performance on various synthetic and real-world constrained black-box optimization tasks. Our code is publicly available \href{https://github.com/umkiyoung/CiBO}{here}.

http://arxiv.org/abs/2506.23581
PBCAT: Patch-based composite adversarial training against physically realizable attacks on object detection. (99%)
Xiao Li; Yiming Zhu; Yifan Huang; Wei Zhang; Yingzhe He; Jie Shi; Xiaolin Hu
Object detection plays a crucial role in many security-sensitive applications. However, several recent studies have shown that object detectors can be easily fooled by physically realizable attacks, \eg, adversarial patches and recent adversarial textures, which pose realistic and urgent threats. Adversarial Training (AT) has been recognized as the most effective defense against adversarial attacks. While AT has been extensively studied in the $l_\infty$ attack settings on classification models, AT against physically realizable attacks on object detectors has received limited exploration. Early attempts are only performed to defend against adversarial patches, leaving AT against a wider range of physically realizable attacks under-explored. In this work, we consider defending against various physically realizable attacks with a unified AT method. We propose PBCAT, a novel Patch-Based Composite Adversarial Training strategy. PBCAT optimizes the model by incorporating the combination of small-area gradient-guided adversarial patches and imperceptible global adversarial perturbations covering the entire image. With these designs, PBCAT has the potential to defend against not only adversarial patches but also unseen physically realizable attacks such as adversarial textures. Extensive experiments in multiple settings demonstrated that PBCAT significantly improved robustness against various physically realizable attacks over state-of-the-art defense methods. Notably, it improved the detection accuracy by 29.7\% over previous defense methods under one recent adversarial texture attack.

http://arxiv.org/abs/2507.02965
Concept-based Adversarial Attack: a Probabilistic Perspective. (99%)
Andi Zhang; Xuan Ding; Steven McDonagh; Samuel Kaski
We propose a concept-based adversarial attack framework that extends beyond single-image perturbations by adopting a probabilistic perspective. Rather than modifying a single image, our method operates on an entire concept -- represented by a probabilistic generative model or a set of images -- to generate diverse adversarial examples. Preserving the concept is essential, as it ensures that the resulting adversarial images remain identifiable as instances of the original underlying category or identity. By sampling from this concept-based adversarial distribution, we generate images that maintain the original concept but vary in pose, viewpoint, or background, thereby misleading the classifier. Mathematically, this framework remains consistent with traditional adversarial attacks in a principled manner. Our theoretical and empirical results demonstrate that concept-based adversarial attacks yield more diverse adversarial examples and effectively preserve the underlying concept, while achieving higher attack efficiency.

http://arxiv.org/abs/2506.24048
Consensus-based optimization for closed-box adversarial attacks and a connection to evolution strategies. (81%)
Tim Roith; Leon Bungert; Philipp Wacker
Consensus-based optimization (CBO) has established itself as an efficient gradient-free optimization scheme, with attractive mathematical properties, such as mean-field convergence results for non-convex loss functions. In this work, we study CBO in the context of closed-box adversarial attacks, which are imperceptible input perturbations that aim to fool a classifier, without accessing its gradient. Our contribution is to establish a connection between the so-called consensus hopping as introduced by Riedl et al. and natural evolution strategies (NES) commonly applied in the context of adversarial attacks and to rigorously relate both methods to gradient-based optimization schemes. Beyond that, we provide a comprehensive experimental study that shows that despite the conceptual similarities, CBO can outperform NES and other evolutionary strategies in certain scenarios.

http://arxiv.org/abs/2506.23676
A Unified Framework for Stealthy Adversarial Generation via Latent Optimization and Transferability Enhancement. (81%)
Gaozheng Pei; Ke Ma; Dongpeng Zhang; Chengzhi Sun; Qianqian Xu; Qingming Huang
Due to their powerful image generation capabilities, diffusion-based adversarial example generation methods through image editing are rapidly gaining popularity. However, due to reliance on the discriminative capability of the diffusion model, these diffusion-based methods often struggle to generalize beyond conventional image classification tasks, such as in Deepfake detection. Moreover, traditional strategies for enhancing adversarial example transferability are challenging to adapt to these methods. To address these challenges, we propose a unified framework that seamlessly incorporates traditional transferability enhancement strategies into diffusion model-based adversarial example generation via image editing, enabling their application across a wider range of downstream tasks. Our method won first place in the "1st Adversarial Attacks on Deepfake Detectors: A Challenge in the Era of AI-Generated Media" competition at ACM MM25, which validates the effectiveness of our approach.

http://arxiv.org/abs/2506.24033
Poisoning Attacks to Local Differential Privacy for Ranking Estimation. (38%)
Pei Zhan; Peng Tang; Yangzhuo Li; Puwen Wei; Shanqing Guo
Local differential privacy (LDP) involves users perturbing their inputs to provide plausible deniability of their data. However, this also makes LDP vulnerable to poisoning attacks. In this paper, we first introduce novel poisoning attacks for ranking estimation. These attacks are intricate, as fake attackers do not merely adjust the frequency of target items. Instead, they leverage a limited number of fake users to precisely modify frequencies, effectively altering item rankings to maximize gains. To tackle this challenge, we introduce the concepts of attack cost and optimal attack item (set), and propose corresponding strategies for kRR, OUE, and OLH protocols. For kRR, we iteratively select optimal attack items and allocate suitable fake users. For OUE, we iteratively determine optimal attack item sets and consider the incremental changes in item frequencies across different sets. Regarding OLH, we develop a harmonic cost function based on the pre-image of a hash to select that supporting a larger number of effective attack items. Lastly, we present an attack strategy based on confidence levels to quantify the probability of a successful attack and the number of attack iterations more precisely. We demonstrate the effectiveness of our attacks through theoretical and empirical evidence, highlighting the necessity for defenses against these attacks. The source code and data have been made available at https://github.com/LDP-user/LDP-Ranking.git.

http://arxiv.org/abs/2506.23534
Improving vulnerability type prediction and line-level detection via adversarial training-based data augmentation and multi-task learning. (13%)
Siyu Chen; Jiongyi Yang; Xiang Chen; Menglin Zheng; Minnan Wei; Xiaolin Ju
Context: Software vulnerabilities pose a significant threat to modern software systems, as evidenced by the growing number of reported vulnerabilities and cyberattacks. These escalating trends underscore the urgent need for effective approaches that can automatically detect and understand software vulnerabilities. Objective: However, the scarcity of labeled samples and the class imbalance issue in vulnerability datasets present significant challenges for both Vulnerability Type Prediction (VTP) and Line-level Vulnerability Detection (LVD), especially for rare yet critical vulnerability types. Moreover, most existing studies treat VTP and LVD as independent tasks, overlooking their inherent correlation, which limits the potential to leverage shared semantic patterns across tasks. Methods: To address these limitations, we propose a unified approach that integrates Embedding-Layer Driven Adversarial Training (EDAT) with Multi-task Learning (MTL). Specifically, EDAT enhances model robustness by introducing adversarial perturbations to identifier embeddings, guided by semantic importance. Meanwhile, MTL improves overall performance by leveraging shared representations and inter-task correlations between VTP and LVD. Results: Extensive experiments demonstrate that our proposed approach outperforms state-of-the-art baselines on both VTP and LVD tasks. For VTP, it yields notable improvements in accuracy, precision, recall, and F1-score, particularly in identifying rare vulnerability types. Similarly, for LVD, our approach enhances line-level detection accuracy while significantly reducing false positives. Conclusion: Our study demonstrates that combining EDAT with MTL provides a unified solution that improves performance on both tasks and warrants further investigation.

http://arxiv.org/abs/2506.24040
Quickest Detection of Adversarial Attacks Against Correlated Equilibria. (4%)
Kiarash Kazari; Aris Kanellopoulos; György Dán
We consider correlated equilibria in strategic games in an adversarial environment, where an adversary can compromise the public signal used by the players for choosing their strategies, while players aim at detecting a potential attack as soon as possible to avoid loss of utility. We model the interaction between the adversary and the players as a zero-sum game and we derive the maxmin strategies for both the defender and the attacker using the framework of quickest change detection. We define a class of adversarial strategies that achieve the optimal trade-off between attack impact and attack detectability and show that a generalized CUSUM scheme is asymptotically optimal for the detection of the attacks. Our numerical results on the Sioux-Falls benchmark traffic routing game show that the proposed detection scheme can effectively limit the utility loss by a potential adversary.

http://arxiv.org/abs/2506.23735
AutoEvoEval: An Automated Framework for Evolving Close-Ended LLM Evaluation Data. (3%)
JiaRu Wu; Mingwei Liu
Large language models (LLMs) have shown remarkable performance on various tasks, but existing evaluation benchmarks are often static and insufficient to fully assess their robustness and generalization in realistic scenarios. Prior work using evolutionary or adversarial data augmentation has improved evaluation diversity but lacks systematic control over perturbation types and multi-step complexity, limiting comprehensive robustness analysis. To address these gaps, we propose AutoEvoEval, an evolution-based evaluation framework for close-ended tasks such as multi-choice question answering. AutoEvoEval introduces 22 interpretable atomic evolution operations and supports multi-round compositions, enabling controlled generation of diverse, challenging, and realistic test samples. We conduct extensive experiments addressing four research questions on a broad set of open- and closed-source LLMs. Our results show that atomic operations cause an average accuracy drop of 7.283\%, with structure-disrupting or misleading semantic edits causing the largest declines. Model sensitivities vary significantly for the same perturbation, and combining multiple evolution steps amplifies adversarial effects by up to 52.932\%. These findings suggest current benchmarks may overestimate true model generalization and emphasize the need for evolution-aware robustness evaluation. Code and resources are available at: https://github.com/SYSUSELab/AutoEvoEval.

http://arxiv.org/abs/2506.24081
SQUASH: A SWAP-Based Quantum Attack to Sabotage Hybrid Quantum Neural Networks. (2%)
Rahul Kumar; Wenqi Wei; Ying Mao; Junaid Farooq; Ying Wang; Juntao Chen
We propose a circuit-level attack, SQUASH, a SWAP-Based Quantum Attack to sabotage Hybrid Quantum Neural Networks (HQNNs) for classification tasks. SQUASH is executed by inserting SWAP gate(s) into the variational quantum circuit of the victim HQNN. Unlike conventional noise-based or adversarial input attacks, SQUASH directly manipulates the circuit structure, leading to qubit misalignment and disrupting quantum state evolution. This attack is highly stealthy, as it does not require access to training data or introduce detectable perturbations in input states. Our results demonstrate that SQUASH significantly degrades classification performance, with untargeted SWAP attacks reducing accuracy by up to 74.08\% and targeted SWAP attacks reducing target class accuracy by up to 79.78\%. These findings reveal a critical vulnerability in HQNN implementations, underscoring the need for more resilient architectures against circuit-level adversarial interventions.

http://arxiv.org/abs/2506.23881
Spurious-Aware Prototype Refinement for Reliable Out-of-Distribution Detection. (1%)
Reihaneh Zohrabi; Hosein Hasani; Mahdieh Soleymani Baghshah; Anna Rohrbach; Marcus Rohrbach; Mohammad Hossein Rohban
Out-of-distribution (OOD) detection is crucial for ensuring the reliability and safety of machine learning models in real-world applications, where they frequently face data distributions unseen during training. Despite progress, existing methods are often vulnerable to spurious correlations that mislead models and compromise robustness. To address this, we propose SPROD, a novel prototype-based OOD detection approach that explicitly addresses the challenge posed by unknown spurious correlations. Our post-hoc method refines class prototypes to mitigate bias from spurious features without additional data or hyperparameter tuning, and is broadly applicable across diverse backbones and OOD detection settings. We conduct a comprehensive spurious correlation OOD detection benchmarking, comparing our method against existing approaches and demonstrating its superior performance across challenging OOD datasets, such as CelebA, Waterbirds, UrbanCars, Spurious Imagenet, and the newly introduced Animals MetaCoCo. On average, SPROD improves AUROC by 4.7% and FPR@95 by 9.3% over the second best.

http://arxiv.org/abs/2506.23814
Breaking Out from the TESSERACT: Reassessing ML-based Malware Detection under Spatio-Temporal Drift. (1%)
Theo Chow; Mario D'Onghia; Lorenz Linhardt; Zeliang Kan; Daniel Arp; Lorenzo Cavallaro; Fabio Pierazzi
Several recent works focused on the best practices for applying machine learning to cybersecurity. In the context of malware, TESSERACT highlighted the impact of concept drift on detection performance and suggested temporal and spatial constraints to be enforced to ensure realistic time-aware evaluations, which have been adopted by the community. In this paper, we demonstrate striking discrepancies in the performance of learning-based malware detection across the same time frame when evaluated on two representative Android malware datasets used in top-tier security conferences, both adhering to established sampling and evaluation guidelines. This questions our ability to understand how current state-of-the-art approaches would perform in realistic scenarios. To address this, we identify five novel temporal and spatial bias factors that affect realistic evaluations. We thoroughly evaluate the impact of these factors in the Android malware domain on two representative datasets and five Android malware classifiers used or proposed in top-tier security conferences. For each factor, we provide practical and actionable recommendations that the community should integrate in their methodology for more realistic and reproducible settings.

http://arxiv.org/abs/2506.23977
A Scalable Approach for Safe and Robust Learning via Lipschitz-Constrained Networks. (1%)
Zain ul Abdeen; Vassilis Kekatos; Ming Jin
Certified robustness is a critical property for deploying neural networks (NN) in safety-critical applications. A principle approach to achieving such guarantees is to constrain the global Lipschitz constant of the network. However, accurate methods for Lipschitz-constrained training often suffer from non-convex formulations and poor scalability due to reliance on global semidefinite programs (SDPs). In this letter, we propose a convex training framework that enforces global Lipschitz constraints via semidefinite relaxation. By reparameterizing the NN using loop transformation, we derive a convex admissibility condition that enables tractable and certifiable training. While the resulting formulation guarantees robustness, its scalability is limited by the size of global SDP. To overcome this, we develop a randomized subspace linear matrix inequalities (RS-LMI) approach that decomposes the global constraints into sketched layerwise constraints projected onto low-dimensional subspaces, yielding a smooth and memory-efficient training objective. Empirical results on MNIST, CIFAR-10, and ImageNet demonstrate that the proposed framework achieves competitive accuracy with significantly improved Lipschitz bounds and runtime performance.

http://arxiv.org/abs/2506.20702
The Singapore Consensus on Global AI Safety Research Priorities. (1%)
Yoshua Bengio; Tegan Maharaj; Luke Ong; Stuart Russell; Dawn Song; Max Tegmark; Lan Xue; Ya-Qin Zhang; Stephen Casper; Wan Sie Lee; Sören Mindermann; Vanessa Wilfred; Vidhisha Balachandran; Fazl Barez; Michael Belinsky; Imane Bello; Malo Bourgon; Mark Brakel; Siméon Campos; Duncan Cass-Beggs; Jiahao Chen; Rumman Chowdhury; Kuan Chua Seah; Jeff Clune; Juntao Dai; Agnes Delaborde; Nouha Dziri; Francisco Eiras; Joshua Engels; Jinyu Fan; Adam Gleave; Noah Goodman; Fynn Heide; Johannes Heidecke; Dan Hendrycks; Cyrus Hodes; Bryan Low Kian Hsiang; Minlie Huang; Sami Jawhar; Wang Jingyu; Adam Tauman Kalai; Meindert Kamphuis; Mohan Kankanhalli; Subhash Kantamneni; Mathias Bonde Kirk; Thomas Kwa; Jeffrey Ladish; Kwok-Yan Lam; Wan Lee Sie; Taewhi Lee; Xiaojian Li; Jiajun Liu; Chaochao Lu; Yifan Mai; Richard Mallah; Julian Michael; Nick Moës; Simon Möller; Kihyuk Nam; Kwan Yee Ng; Mark Nitzberg; Besmira Nushi; Seán O hÉigeartaigh; Alejandro Ortega; Pierre Peigné; James Petrie; Benjamin Prud'Homme; Reihaneh Rabbany; Nayat Sanchez-Pi; Sarah Schwettmann; Buck Shlegeris; Saad Siddiqui; Aradhana Sinha; Martín Soto; Cheston Tan; Dong Ting; William Tjhi; Robert Trager; Brian Tse; Anthony Tung K. H.; Vanessa Wilfred; John Willes; Denise Wong; Wei Xu; Rongwu Xu; Yi Zeng; HongJiang Zhang; Djordje Žikelić
Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is therefore essential -- it helps people embrace AI with confidence and gives maximal space for innovation while avoiding backlash.
  The "2025 Singapore Conference on AI (SCAI): International Scientific Exchange on AI Safety" aimed to support research in this space by bringing together AI scientists across geographies to identify and synthesise research priorities in AI safety. This resulting report builds on the International AI Safety Report chaired by Yoshua Bengio and backed by 33 governments. By adopting a defence-in-depth model, this report organises AI safety research domains into three types: challenges with creating trustworthy AI systems (Development), challenges with evaluating their risks (Assessment), and challenges with monitoring and intervening after deployment (Control).

http://arxiv.org/abs/2506.23260
From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows. (16%)
Mohamed Amine Ferrag; Norbert Tihanyi; Djallel Hamouda; Leandros Maglaras; Merouane Debbah
Autonomous AI agents powered by large language models (LLMs) with structured function-calling interfaces have dramatically expanded capabilities for real-time data retrieval, complex computation, and multi-step orchestration. Yet, the explosive proliferation of plugins, connectors, and inter-agent protocols has outpaced discovery mechanisms and security practices, resulting in brittle integrations vulnerable to diverse threats. In this survey, we introduce the first unified, end-to-end threat model for LLM-agent ecosystems, spanning host-to-tool and agent-to-agent communications, formalize adversary capabilities and attacker objectives, and catalog over thirty attack techniques. Specifically, we organized the threat model into four domains: Input Manipulation (e.g., prompt injections, long-context hijacks, multimodal adversarial inputs), Model Compromise (e.g., prompt- and parameter-level backdoors, composite and encrypted multi-backdoors, poisoning strategies), System and Privacy Attacks (e.g., speculative side-channels, membership inference, retrieval poisoning, social-engineering simulations), and Protocol Vulnerabilities (e.g., exploits in Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent Network Protocol (ANP), and Agent-to-Agent (A2A) protocol). For each category, we review representative scenarios, assess real-world feasibility, and evaluate existing defenses. Building on our threat taxonomy, we identify key open challenges and future research directions, such as securing MCP deployments through dynamic trust management and cryptographic provenance tracking; designing and hardening Agentic Web Interfaces; and achieving resilience in multi-agent and federated environments. Our work provides a comprehensive reference to guide the design of robust defense mechanisms and establish best practices for resilient LLM-agent workflows.

http://arxiv.org/abs/2407.09972
MedLeak: Multimodal Medical Data Leakage in Secure Federated Learning with Crafted Models. (12%)
Shanghao Shi; Md Shahedul Haque; Abhijeet Parida; Chaoyu Zhang; Marius George Linguraru; Y. Thomas Hou; Syed Muhammad Anwar; Wenjing Lou
Federated learning (FL) allows participants to collaboratively train machine learning models while keeping their data local, making it ideal for collaborations among healthcare institutions on sensitive data. However, in this paper, we propose a novel privacy attack called MedLeak, which allows a malicious FL server to recover high-quality site-specific private medical data from the client model updates. MedLeak works by introducing an adversarially crafted model during the FL training process. Honest clients, unaware of the insidious changes in the published models, continue to send back their updates as per the standard FL protocol. Leveraging a novel analytical method, MedLeak can efficiently recover private client data from the aggregated parameter updates, eliminating costly optimization. In addition, the scheme relies solely on the aggregated updates, thus rendering secure aggregation protocols ineffective, as they depend on the randomization of intermediate results for security while leaving the final aggregated results unaltered.
  We implement MedLeak on medical image datasets (MedMNIST, COVIDx CXR-4, and Kaggle Brain Tumor MRI), as well as a medical text dataset (MedAbstract). The results demonstrate that our attack achieves high recovery rates and strong quantitative scores on both image and text datasets. We also thoroughly evaluate MedLeak across different attack parameters, providing insights into key factors that influence attack performance and potential defenses. Furthermore, we demonstrate that the recovered data can support downstream tasks such as disease classification with minimal performance loss. Our findings validate the need for enhanced privacy measures in FL systems, particularly for safeguarding sensitive medical data against powerful model inversion attacks.

http://arxiv.org/abs/2506.23189
Trident: Detecting Face Forgeries with Adversarial Triplet Learning. (1%)
Mustafa Hakan Kara; Aysegul Dundar; Uğur Güdükbay
As face forgeries generated by deep neural networks become increasingly sophisticated, detecting face manipulations in digital media has posed a significant challenge, underscoring the importance of maintaining digital media integrity and combating visual disinformation. Current detection models, predominantly based on supervised training with domain-specific data, often falter against forgeries generated by unencountered techniques. In response to this challenge, we introduce \textit{Trident}, a face forgery detection framework that employs triplet learning with a Siamese network architecture for enhanced adaptability across diverse forgery methods. \textit{Trident} is trained on curated triplets to isolate nuanced differences of forgeries, capturing fine-grained features that distinguish pristine samples from manipulated ones while controlling for other variables. To further enhance generalizability, we incorporate domain-adversarial training with a forgery discriminator. This adversarial component guides our embedding model towards forgery-agnostic representations, improving its robustness to unseen manipulations. In addition, we prevent gradient flow from the classifier head to the embedding model, avoiding overfitting induced by artifacts peculiar to certain forgeries. Comprehensive evaluations across multiple benchmarks and ablation studies demonstrate the effectiveness of our framework. We will release our code in a GitHub repository.

http://arxiv.org/abs/2506.23280
BAPE: Learning an Explicit Bayes Classifier for Long-tailed Visual Recognition. (1%)
Chaoqun Du; Yulin Wang; Shiji Song; Gao Huang
Bayesian decision theory advocates the Bayes classifier as the optimal approach for minimizing the risk in machine learning problems. Current deep learning algorithms usually solve for the optimal classifier by \emph{implicitly} estimating the posterior probabilities, \emph{e.g.}, by minimizing the Softmax cross-entropy loss. This simple methodology has been proven effective for meticulously balanced academic benchmark datasets. However, it is not applicable to the long-tailed data distributions in the real world, where it leads to the gradient imbalance issue and fails to ensure the Bayes optimal decision rule. To address these challenges, this paper presents a novel approach (BAPE) that provides a more precise theoretical estimation of the data distributions by \emph{explicitly} modeling the parameters of the posterior probabilities and solving them with point estimation. Consequently, our method directly learns the Bayes classifier without gradient descent based on Bayes' theorem, simultaneously alleviating the gradient imbalance and ensuring the Bayes optimal decision rule. Furthermore, we propose a straightforward yet effective \emph{distribution adjustment} technique. This method enables the Bayes classifier trained from the long-tailed training set to effectively adapt to the test data distribution with an arbitrary imbalance factor, thereby enhancing performance without incurring additional computational costs. In addition, we demonstrate the gains of our method are orthogonal to existing learning approaches for long-tailed scenarios, as they are mostly designed under the principle of \emph{implicitly} estimating the posterior probabilities. Extensive empirical evaluations on CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, and iNaturalist demonstrate that our method significantly improves the generalization performance of popular deep networks, despite its simplicity.

http://arxiv.org/abs/2506.22982
Revisiting CroPA: A Reproducibility Study and Enhancements for Cross-Prompt Adversarial Transferability in Vision-Language Models. (99%)
Atharv Mittal; Agam Pandey; Amritanshu Tiwari; Sukrit Jindal; Swadesh Swain
Large Vision-Language Models (VLMs) have revolutionized computer vision, enabling tasks such as image classification, captioning, and visual question answering. However, they remain highly vulnerable to adversarial attacks, particularly in scenarios where both visual and textual modalities can be manipulated. In this study, we conduct a comprehensive reproducibility study of "An Image is Worth 1000 Lies: Adversarial Transferability Across Prompts on Vision-Language Models" validating the Cross-Prompt Attack (CroPA) and confirming its superior cross-prompt transferability compared to existing baselines. Beyond replication we propose several key improvements: (1) A novel initialization strategy that significantly improves Attack Success Rate (ASR). (2) Investigate cross-image transferability by learning universal perturbations. (3) A novel loss function targeting vision encoder attention mechanisms to improve generalization. Our evaluation across prominent VLMs -- including Flamingo, BLIP-2, and InstructBLIP as well as extended experiments on LLaVA validates the original results and demonstrates that our improvements consistently boost adversarial effectiveness. Our work reinforces the importance of studying adversarial vulnerabilities in VLMs and provides a more robust framework for generating transferable adversarial examples, with significant implications for understanding the security of VLMs in real-world applications.

http://arxiv.org/abs/2506.22722
Kill Two Birds with One Stone! Trajectory enabled Unified Online Detection of Adversarial Examples and Backdoor Attacks. (98%)
Anmin Fu; Fanyu Meng; Huaibing Peng; Hua Ma; Zhi Zhang; Yifeng Zheng; Willy Susilo; Yansong Gao
The proposed UniGuard is the first unified online detection framework capable of simultaneously addressing adversarial examples and backdoor attacks. UniGuard builds upon two key insights: first, both AE and backdoor attacks have to compromise the inference phase, making it possible to tackle them simultaneously during run-time via online detection. Second, an adversarial input, whether a perturbed sample in AE attacks or a trigger-carrying sample in backdoor attacks, exhibits distinctive trajectory signatures from a benign sample as it propagates through the layers of a DL model in forward inference. The propagation trajectory of the adversarial sample must deviate from that of its benign counterpart; otherwise, the adversarial objective cannot be fulfilled. Detecting these trajectory signatures is inherently challenging due to their subtlety; UniGuard overcomes this by treating the propagation trajectory as a time-series signal, leveraging LSTM and spectrum transformation to amplify differences between adversarial and benign trajectories that are subtle in the time domain. UniGuard exceptional efficiency and effectiveness have been extensively validated across various modalities (image, text, and audio) and tasks (classification and regression), ranging from diverse model architectures against a wide range of AE attacks and backdoor attacks, including challenging partial backdoors and dynamic triggers. When compared to SOTA methods, including ContraNet (NDSS 22) specific for AE detection and TED (IEEE SP 24) specific for backdoor detection, UniGuard consistently demonstrates superior performance, even when matched against each method's strengths in addressing their respective threats-each SOTA fails to parts of attack strategies while UniGuard succeeds for all.

http://arxiv.org/abs/2506.22776
Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation. (22%)
Sen Fang; Weiyuan Ding; Antonio Mastropaolo; Bowen Xu
Quantization has emerged as a mainstream method for compressing Large Language Models (LLMs), reducing memory requirements and accelerating inference without architectural modifications. While existing research primarily focuses on evaluating the effectiveness of quantized LLMs compared to their original counterparts, the impact on robustness remains largely unexplored.In this paper, we present the first systematic investigation of how quantization affects the robustness of LLMs in code generation tasks. Through extensive experiments across four prominent LLM families (LLaMA, DeepSeek, CodeGen, and StarCoder) with parameter scales ranging from 350M to 33B, we evaluate robustness from dual perspectives: adversarial attacks on input prompts and noise perturbations on model architecture. Our findings challenge conventional wisdom by demonstrating that quantized LLMs often exhibit superior robustness compared to their full-precision counterparts, with 51.59% versus 42.86% of our adversarial experiments showing better resilience in quantized LLMs. Similarly, our noise perturbation experiments also confirm that LLMs after quantitation generally withstand higher levels of weight disturbances. These results suggest that quantization not only reduces computational requirements but can actually enhance LLMs' reliability in code generation tasks, providing valuable insights for developing more robust and efficient LLM deployment strategies.

http://arxiv.org/abs/2506.22806
Concept Pinpoint Eraser for Text-to-image Diffusion Models via Residual Attention Gate. (3%)
Byung Hyun Lee; Sungjin Lim; Seunggyu Lee; Dong Un Kang; Se Young Chun
Remarkable progress in text-to-image diffusion models has brought a major concern about potentially generating images on inappropriate or trademarked concepts. Concept erasing has been investigated with the goals of deleting target concepts in diffusion models while preserving other concepts with minimal distortion. To achieve these goals, recent concept erasing methods usually fine-tune the cross-attention layers of diffusion models. In this work, we first show that merely updating the cross-attention layers in diffusion models, which is mathematically equivalent to adding \emph{linear} modules to weights, may not be able to preserve diverse remaining concepts. Then, we propose a novel framework, dubbed Concept Pinpoint Eraser (CPE), by adding \emph{nonlinear} Residual Attention Gates (ResAGs) that selectively erase (or cut) target concepts while safeguarding remaining concepts from broad distributions by employing an attention anchoring loss to prevent the forgetting. Moreover, we adversarially train CPE with ResAG and learnable text embeddings in an iterative manner to maximize erasing performance and enhance robustness against adversarial attacks. Extensive experiments on the erasure of celebrities, artistic styles, and explicit contents demonstrated that the proposed CPE outperforms prior arts by keeping diverse remaining concepts while deleting the target concepts with robustness against attack prompts. Code is available at https://github.com/Hyun1A/CPE

http://arxiv.org/abs/2506.21874
On the Feasibility of Poisoning Text-to-Image AI Models via Adversarial Mislabeling. (99%)
Stanley Wu; Ronik Bhaskar; Anna Yoo Jeong Ha; Shawn Shan; Haitao Zheng; Ben Y. Zhao
Today's text-to-image generative models are trained on millions of images sourced from the Internet, each paired with a detailed caption produced by Vision-Language Models (VLMs). This part of the training pipeline is critical for supplying the models with large volumes of high-quality image-caption pairs during training. However, recent work suggests that VLMs are vulnerable to stealthy adversarial attacks, where adversarial perturbations are added to images to mislead the VLMs into producing incorrect captions.
  In this paper, we explore the feasibility of adversarial mislabeling attacks on VLMs as a mechanism to poisoning training pipelines for text-to-image models. Our experiments demonstrate that VLMs are highly vulnerable to adversarial perturbations, allowing attackers to produce benign-looking images that are consistently miscaptioned by the VLM models. This has the effect of injecting strong "dirty-label" poison samples into the training pipeline for text-to-image models, successfully altering their behavior with a small number of poisoned samples. We find that while potential defenses can be effective, they can be targeted and circumvented by adaptive attackers. This suggests a cat-and-mouse game that is likely to reduce the quality of training data and increase the cost of text-to-image model development. Finally, we demonstrate the real-world effectiveness of these attacks, achieving high attack success (over 73%) even in black-box scenarios against commercial VLMs (Google Vertex AI and Microsoft Azure).

http://arxiv.org/abs/2506.22602
Are Fast Methods Stable in Adversarially Robust Transfer Learning? (96%)
Joshua C. Zhao; Saurabh Bagchi
Transfer learning is often used to decrease the computational cost of model training, as fine-tuning a model allows a downstream task to leverage the features learned from the pre-training dataset and quickly adapt them to a new task. This is particularly useful for achieving adversarial robustness, as adversarially training models from scratch is very computationally expensive. However, high robustness in transfer learning still requires adversarial training during the fine-tuning phase, which requires up to an order of magnitude more time than standard fine-tuning. In this work, we revisit the use of the fast gradient sign method (FGSM) in robust transfer learning to improve the computational cost of adversarial fine-tuning. We surprisingly find that FGSM is much more stable in adversarial fine-tuning than when training from scratch. In particular, FGSM fine-tuning does not suffer from any issues with catastrophic overfitting at standard perturbation budgets of $\varepsilon=4$ or $\varepsilon=8$. This stability is further enhanced with parameter-efficient fine-tuning methods, where FGSM remains stable even up to $\varepsilon=32$ for linear probing. We demonstrate how this stability translates into performance across multiple datasets. Compared to fine-tuning with the more commonly used method of projected gradient descent (PGD), on average, FGSM only loses 0.39% and 1.39% test robustness for $\varepsilon=4$ and $\varepsilon=8$ while using $4\times$ less training time. Surprisingly, FGSM may not only be a significantly more efficient alternative to PGD in adversarially robust transfer learning but also a well-performing one.

http://arxiv.org/abs/2506.21972
Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses. (81%)
Mohamed Ahmed; Mohamed Abdelmouty; Mingyu Kim; Gunvanth Kandula; Alex Park; James C. Davis
The advancement of Pre-Trained Language Models (PTLMs) and Large Language Models (LLMs) has led to their widespread adoption across diverse applications. Despite their success, these models remain vulnerable to attacks that exploit their inherent weaknesses to bypass safety measures. Two primary inference-phase threats are token-level and prompt-level jailbreaks. Token-level attacks embed adversarial sequences that transfer well to black-box models like GPT but leave detectable patterns and rely on gradient-based token optimization, whereas prompt-level attacks use semantically structured inputs to elicit harmful responses yet depend on iterative feedback that can be unreliable. To address the complementary limitations of these methods, we propose two hybrid approaches that integrate token- and prompt-level techniques to enhance jailbreak effectiveness across diverse PTLMs. GCG + PAIR and the newly explored GCG + WordGame hybrids were evaluated across multiple Vicuna and Llama models. GCG + PAIR consistently raised attack-success rates over its constituent techniques on undefended models; for instance, on Llama-3, its Attack Success Rate (ASR) reached 91.6%, a substantial increase from PAIR's 58.4% baseline. Meanwhile, GCG + WordGame matched the raw performance of WordGame maintaining a high ASR of over 80% even under stricter evaluators like Mistral-Sorry-Bench. Crucially, both hybrids retained transferability and reliably pierced advanced defenses such as Gradient Cuff and JBShield, which fully blocked single-mode attacks. These findings expose previously unreported vulnerabilities in current safety stacks, highlight trade-offs between raw success and defensive robustness, and underscore the need for holistic safeguards against adaptive adversaries.

http://arxiv.org/abs/2506.21842
Adversarial Threats in Quantum Machine Learning: A Survey of Attacks and Defenses. (13%)
Archisman Ghosh; Satwik Kundu; Swaroop Ghosh
Quantum Machine Learning (QML) integrates quantum computing with classical machine learning, primarily to solve classification, regression and generative tasks. However, its rapid development raises critical security challenges in the Noisy Intermediate-Scale Quantum (NISQ) era. This chapter examines adversarial threats unique to QML systems, focusing on vulnerabilities in cloud-based deployments, hybrid architectures, and quantum generative models. Key attack vectors include model stealing via transpilation or output extraction, data poisoning through quantum-specific perturbations, reverse engineering of proprietary variational quantum circuits, and backdoor attacks. Adversaries exploit noise-prone quantum hardware and insufficiently secured QML-as-a-Service (QMLaaS) workflows to compromise model integrity, ownership, and functionality. Defense mechanisms leverage quantum properties to counter these threats. Noise signatures from training hardware act as non-invasive watermarks, while hardware-aware obfuscation techniques and ensemble strategies disrupt cloning attempts. Emerging solutions also adapt classical adversarial training and differential privacy to quantum settings, addressing vulnerabilities in quantum neural networks and generative architectures. However, securing QML requires addressing open challenges such as balancing noise levels for reliability and security, mitigating cross-platform attacks, and developing quantum-classical trust frameworks. This chapter summarizes recent advances in attacks and defenses, offering a roadmap for researchers and practitioners to build robust, trustworthy QML systems resilient to evolving adversarial landscapes.

http://arxiv.org/abs/2506.21883
GRASP-PsONet: Gradient-based Removal of Spurious Patterns for PsOriasis Severity Classification. (8%)
Basudha Pal; Sharif Amit Kamran; Brendon Lutnick; Molly Lucas; Chaitanya Parmar; Asha Patel Shah; David Apfel; Steven Fakharzadeh; Lloyd Miller; Gabriela Cula; Kristopher Standish
Psoriasis (PsO) severity scoring is important for clinical trials but is hindered by inter-rater variability and the burden of in person clinical evaluation. Remote imaging using patient captured mobile photos offers scalability but introduces challenges, such as variation in lighting, background, and device quality that are often imperceptible to humans but can impact model performance. These factors, along with inconsistencies in dermatologist annotations, reduce the reliability of automated severity scoring. We propose a framework to automatically flag problematic training images that introduce spurious correlations which degrade model generalization, using a gradient based interpretability approach. By tracing the gradients of misclassified validation images, we detect training samples where model errors align with inconsistently rated examples or are affected by subtle, nonclinical artifacts. We apply this method to a ConvNeXT based weakly supervised model designed to classify PsO severity from phone images. Removing 8.2% of flagged images improves model AUC-ROC by 5% (85% to 90%) on a held out test set. Commonly, multiple annotators and an adjudication process ensure annotation accuracy, which is expensive and time consuming. Our method detects training images with annotation inconsistencies, potentially removing the need for manual review. When applied to a subset of training data rated by two dermatologists, the method identifies over 90% of cases with inter-rater disagreement by reviewing only the top 30% of samples. This improves automated scoring for remote assessments, ensuring robustness despite data collection variability.

http://arxiv.org/abs/2403.03508
EXPRTS: Exploring and Probing the Robustness of Time Series Forecasting Models. (1%)
Håkon Hanisch Kjærnli; Lluis Mas-Ribas; Hans Jakob Håland; Vegard Sjåvik; Aida Ashrafi; Helge Langseth; Odd Erik Gundersen
When deploying time series forecasting models based on machine learning to real world settings, one often encounter situations where the data distribution drifts. Such drifts expose the forecasting models to out-of-distribution (OOD) data, and machine learning models lack robustness in these settings. Robustness can be improved by using deep generative models or genetic algorithms to augment time series datasets, but these approaches lack interpretability and are computationally expensive. In this work, we develop an interpretable and simple framework for generating time series. Our method combines time-series decompositions with analytic functions, and is able to generate time series with characteristics matching both in- and out-of-distribution data. This approach allows users to generate new time series in an interpretable fashion, which can be used to augment the dataset and improve forecasting robustness. We demonstrate our framework through EXPRTS, a visual analytics tool designed for univariate time series forecasting models and datasets. Different visualizations of the data distribution, forecasting errors and single time series instances enable users to explore time series datasets, apply transformations, and evaluate forecasting model robustness across diverse scenarios. We show how our framework can generate meaningful OOD time series that improve model robustness, and we validate EXPRTS effectiveness and usability through three use-cases and a user study.

http://arxiv.org/abs/2506.21046
Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features. (98%)
Shangbo Wu; Yu-an Tan; Ruinan Ma; Wencong Ma; Dehua Zhu; Yuanzhang Li
The ability of deep neural networks (DNNs) come from extracting and interpreting features from the data provided. By exploiting intermediate features in DNNs instead of relying on hard labels, we craft adversarial perturbation that generalize more effectively, boosting black-box transferability. These features ubiquitously come from supervised learning in previous work. Inspired by the exceptional synergy between self-supervised learning and the Transformer architecture, this paper explores whether exploiting self-supervised Vision Transformer (ViT) representations can improve adversarial transferability. We present dSVA -- a generative dual self-supervised ViT features attack, that exploits both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM), the self-supervised learning paradigm duo for ViTs. We design a novel generative training framework that incorporates a generator to create black-box adversarial examples, and strategies to train the generator by exploiting joint features and the attention mechanism of self-supervised ViTs. Our findings show that CL and MIM enable ViTs to attend to distinct feature tendencies, which, when exploited in tandem, boast great adversarial generalizability. By disrupting dual deep features distilled by self-supervised ViTs, we are rewarded with remarkable black-box transferability to models of various architectures that outperform state-of-the-arts. Code available at https://github.com/spencerwooo/dSVA.

http://arxiv.org/abs/2506.21142
Generative Adversarial Evasion and Out-of-Distribution Detection for UAV Cyber-Attacks. (98%)
Deepak Kumar Panda; Weisi Guo
The growing integration of UAVs into civilian airspace underscores the need for resilient and intelligent intrusion detection systems (IDS), as traditional anomaly detection methods often fail to identify novel threats. A common approach treats unfamiliar attacks as out-of-distribution (OOD) samples; however, this leaves systems vulnerable when mitigation is inadequate. Moreover, conventional OOD detectors struggle to distinguish stealthy adversarial attacks from genuine OOD events. This paper introduces a conditional generative adversarial network (cGAN)-based framework for crafting stealthy adversarial attacks that evade IDS mechanisms. We first design a robust multi-class IDS classifier trained on benign UAV telemetry and known cyber-attacks, including Denial of Service (DoS), false data injection (FDI), man-in-the-middle (MiTM), and replay attacks. Using this classifier, our cGAN perturbs known attacks to generate adversarial samples that misclassify as benign while retaining statistical resemblance to OOD distributions. These adversarial samples are iteratively refined to achieve high stealth and success rates. To detect such perturbations, we implement a conditional variational autoencoder (CVAE), leveraging negative log-likelihood to separate adversarial inputs from authentic OOD samples. Comparative evaluation shows that CVAE-based regret scores significantly outperform traditional Mahalanobis distance-based detectors in identifying stealthy adversarial threats. Our findings emphasize the importance of advanced probabilistic modeling to strengthen IDS capabilities against adaptive, generative-model-based cyber intrusions.

http://arxiv.org/abs/2506.21129
Curriculum-Guided Antifragile Reinforcement Learning for Secure UAV Deconfliction under Observation-Space Attacks. (89%)
Deepak Kumar Panda; Adolfo Perrusquia; Weisi Guo
Reinforcement learning (RL) policies deployed in safety-critical systems, such as unmanned aerial vehicle (UAV) navigation in dynamic airspace, are vulnerable to out-ofdistribution (OOD) adversarial attacks in the observation space. These attacks induce distributional shifts that significantly degrade value estimation, leading to unsafe or suboptimal decision making rendering the existing policy fragile. To address this vulnerability, we propose an antifragile RL framework designed to adapt against curriculum of incremental adversarial perturbations. The framework introduces a simulated attacker which incrementally increases the strength of observation-space perturbations which enables the RL agent to adapt and generalize across a wider range of OOD observations and anticipate previously unseen attacks. We begin with a theoretical characterization of fragility, formally defining catastrophic forgetting as a monotonic divergence in value function distributions with increasing perturbation strength. Building on this, we define antifragility as the boundedness of such value shifts and derive adaptation conditions under which forgetting is stabilized. Our method enforces these bounds through iterative expert-guided critic alignment using Wasserstein distance minimization across incrementally perturbed observations. We empirically evaluate the approach in a UAV deconfliction scenario involving dynamic 3D obstacles. Results show that the antifragile policy consistently outperforms standard and robust RL baselines when subjected to both projected gradient descent (PGD) and GPS spoofing attacks, achieving up to 15% higher cumulative reward and over 30% fewer conflict events. These findings demonstrate the practical and theoretical viability of antifragile reinforcement learning for secure and resilient decision-making in environments with evolving threat scenarios.

http://arxiv.org/abs/2506.21127
Robust Policy Switching for Antifragile Reinforcement Learning for UAV Deconfliction in Adversarial Environments. (82%)
Deepak Kumar Panda; Weisi Guo
The increasing automation of navigation for unmanned aerial vehicles (UAVs) has exposed them to adversarial attacks that exploit vulnerabilities in reinforcement learning (RL) through sensor manipulation. Although existing robust RL methods aim to mitigate such threats, their effectiveness has limited generalization to out-of-distribution shifts from the optimal value distribution, as they are primarily designed to handle fixed perturbation. To address this limitation, this paper introduces an antifragile RL framework that enhances adaptability to broader distributional shifts by incorporating a switching mechanism based on discounted Thompson sampling (DTS). This mechanism dynamically selects among multiple robust policies to minimize adversarially induced state-action-value distribution shifts. The proposed approach first derives a diverse ensemble of action robust policies by accounting for a range of perturbations in the policy space. These policies are then modeled as a multiarmed bandit (MAB) problem, where DTS optimally selects policies in response to nonstationary Bernoulli rewards, effectively adapting to evolving adversarial strategies. Theoretical framework has also been provided where by optimizing the DTS to minimize the overall regrets due to distributional shift, results in effective adaptation against unseen adversarial attacks thus inducing antifragility. Extensive numerical simulations validate the effectiveness of the proposed framework in complex navigation environments with multiple dynamic three-dimensional obstacles and with stronger projected gradient descent (PGD) and spoofing attacks. Compared to conventional robust, non-adaptive RL methods, the antifragile approach achieves superior performance, demonstrating shorter navigation path lengths and a higher rate of conflict-free navigation trajectories compared to existing robust RL techniques

http://arxiv.org/abs/2506.20931
SPA: Towards More Stealth and Persistent Backdoor Attacks in Federated Learning. (75%)
Chengcheng Zhu; Ye Li; Bosen Rao; Jiale Zhang; Yunlong Mao; Sheng Zhong
Federated Learning (FL) has emerged as a leading paradigm for privacy-preserving distributed machine learning, yet the distributed nature of FL introduces unique security challenges, notably the threat of backdoor attacks. Existing backdoor strategies predominantly rely on end-to-end label supervision, which, despite their efficacy, often results in detectable feature disentanglement and limited persistence. In this work, we propose a novel and stealthy backdoor attack framework, named SPA, which fundamentally departs from traditional approaches by leveraging feature-space alignment rather than direct trigger-label association. Specifically, SPA reduces representational distances between backdoor trigger features and target class features, enabling the global model to misclassify trigger-embedded inputs with high stealth and persistence. We further introduce an adaptive, adversarial trigger optimization mechanism, utilizing boundary-search in the feature space to enhance attack longevity and effectiveness, even against defensive FL scenarios and non-IID data distributions. Extensive experiments on various FL benchmarks demonstrate that SPA consistently achieves high attack success rates with minimal impact on model utility, maintains robustness under challenging participation and data heterogeneity conditions, and exhibits persistent backdoor effects far exceeding those of conventional techniques. Our results call urgent attention to the evolving sophistication of backdoor threats in FL and emphasize the pressing need for advanced, feature-level defense techniques.

http://arxiv.org/abs/2506.09803
Devil's Hand: Data Poisoning Attacks to Locally Private Graph Learning Protocols. (68%)
Longzhu He; Chaozhuo Li; Peng Tang; Li Sun; Sen Su; Philip S. Yu
Graph neural networks (GNNs) have achieved significant success in graph representation learning and have been applied to various domains. However, many real-world graphs contain sensitive personal information, such as user profiles in social networks, raising serious privacy concerns when graph learning is performed using GNNs. To address this issue, locally private graph learning protocols have gained considerable attention. These protocols leverage the privacy advantages of local differential privacy (LDP) and the effectiveness of GNN's message-passing in calibrating noisy data, offering strict privacy guarantees for users' local data while maintaining high utility (e.g., node classification accuracy) for graph learning. Despite these advantages, such protocols may be vulnerable to data poisoning attacks, a threat that has not been considered in previous research. Identifying and addressing these threats is crucial for ensuring the robustness and security of privacy-preserving graph learning frameworks. This work introduces the first data poisoning attack targeting locally private graph learning protocols. The attacker injects fake users into the protocol, manipulates these fake users to establish links with genuine users, and sends carefully crafted data to the server, ultimately compromising the utility of private graph learning. The effectiveness of the attack is demonstrated both theoretically and empirically. In addition, several defense strategies have also been explored, but their limited effectiveness highlights the need for more robust defenses.

http://arxiv.org/abs/2506.22521
A Survey on Model Extraction Attacks and Defenses for Large Language Models. (10%)
Kaixiang Zhao; Lincan Li; Kaize Ding; Neil Zhenqiang Gong; Yue Zhao; Yushun Dong
Model extraction attacks pose significant security threats to deployed language models, potentially compromising intellectual property and user privacy. This survey provides a comprehensive taxonomy of LLM-specific extraction attacks and defenses, categorizing attacks into functionality extraction, training data extraction, and prompt-targeted attacks. We analyze various attack methodologies including API-based knowledge distillation, direct querying, parameter recovery, and prompt stealing techniques that exploit transformer architectures. We then examine defense mechanisms organized into model protection, data privacy protection, and prompt-targeted strategies, evaluating their effectiveness across different deployment scenarios. We propose specialized metrics for evaluating both attack effectiveness and defense performance, addressing the specific challenges of generative language models. Through our analysis, we identify critical limitations in current approaches and propose promising research directions, including integrated attack methodologies and adaptive defense mechanisms that balance security with model utility. This work serves NLP researchers, ML engineers, and security professionals seeking to protect language models in production environments.

http://arxiv.org/abs/2506.21106
PhishKey: A Novel Centroid-Based Approach for Enhanced Phishing Detection Using Adaptive HTML Component Extraction. (1%)
Felipe Castaño; Eduardo Fidalgo; Enrique Alegre; Rocio Alaiz-Rodríguez; Raul Orduna; Francesco Zola
Phishing attacks pose a significant cybersecurity threat, evolving rapidly to bypass detection mechanisms and exploit human vulnerabilities. This paper introduces PhishKey to address the challenges of adaptability, robustness, and efficiency. PhishKey is a novel phishing detection method using automatic feature extraction from hybrid sources. PhishKey combines character-level processing with Convolutional Neural Networks (CNN) for URL classification, and a Centroid-Based Key Component Phishing Extractor (CAPE) for HTML content at the word level. CAPE reduces noise and ensures complete sample processing avoiding crop operations on the input data. The predictions from both modules are integrated using a soft-voting ensemble to achieve more accurate and reliable classifications. Experimental evaluations on four state-of-the-art datasets demonstrate the effectiveness of PhishKey. It achieves up to 98.70% F1 Score and shows strong resistance to adversarial manipulations such as injection attacks with minimal performance degradation.

http://arxiv.org/abs/2506.20816
Universal and Efficient Detection of Adversarial Data through Nonuniform Impact on Network Layers. (99%)
Furkan Mumcu; Yasin Yilmaz
Deep Neural Networks (DNNs) are notoriously vulnerable to adversarial input designs with limited noise budgets. While numerous successful attacks with subtle modifications to original input have been proposed, defense techniques against these attacks are relatively understudied. Existing defense approaches either focus on improving DNN robustness by negating the effects of perturbations or use a secondary model to detect adversarial data. Although equally important, the attack detection approach, which is studied in this work, provides a more practical defense compared to the robustness approach. We show that the existing detection methods are either ineffective against the state-of-the-art attack techniques or computationally inefficient for real-time processing. We propose a novel universal and efficient method to detect adversarial examples by analyzing the varying degrees of impact of attacks on different DNN layers. {Our method trains a lightweight regression model that predicts deeper-layer features from early-layer features, and uses the prediction error to detect adversarial samples.} Through theoretical arguments and extensive experiments, we demonstrate that our detection method is highly effective, computationally efficient for real-time processing, compatible with any DNN architecture, and applicable across different domains, such as image, video, and audio.

http://arxiv.org/abs/2506.20806
Poster: Enhancing GNN Robustness for Network Intrusion Detection via Agent-based Analysis. (98%)
Zhonghao Zhan; Huichi Zhou; Hamed Haddadi
Graph Neural Networks (GNNs) show great promise for Network Intrusion Detection Systems (NIDS), particularly in IoT environments, but suffer performance degradation due to distribution drift and lack robustness against realistic adversarial attacks. Current robustness evaluations often rely on unrealistic synthetic perturbations and lack demonstrations on systematic analysis of different kinds of adversarial attack, which encompass both black-box and white-box scenarios. This work proposes a novel approach to enhance GNN robustness and generalization by employing Large Language Models (LLMs) in an agentic pipeline as simulated cybersecurity expert agents. These agents scrutinize graph structures derived from network flow data, identifying and potentially mitigating suspicious or adversarially perturbed elements before GNN processing. Our experiments, using a framework designed for realistic evaluation and testing with a variety of adversarial attacks including a dataset collected from physical testbed experiments, demonstrate that integrating LLM analysis can significantly improve the resilience of GNN-based NIDS against challenges, showcasing the potential of LLM agent as a complementary layer in intrusion detection architectures.

http://arxiv.org/abs/2506.20576
Vulnerability Disclosure through Adaptive Black-Box Adversarial Attacks on NIDS. (97%)
Sabrine Ennaji; Elhadj Benkhelifa; Luigi V. Mancini
Adversarial attacks, wherein slight inputs are carefully crafted to mislead intelligent models, have attracted increasing attention. However, a critical gap persists between theoretical advancements and practical application, particularly in structured data like network traffic, where interdependent features complicate effective adversarial manipulations. Moreover, ambiguity in current approaches restricts reproducibility and limits progress in this field. Hence, existing defenses often fail to handle evolving adversarial attacks. This paper proposes a novel approach for black-box adversarial attacks, that addresses these limitations. Unlike prior work, which often assumes system access or relies on repeated probing, our method strictly respect black-box constraints, reducing interaction to avoid detection and better reflect real-world scenarios. We present an adaptive feature selection strategy using change-point detection and causality analysis to identify and target sensitive features to perturbations. This lightweight design ensures low computational cost and high deployability. Our comprehensive experiments show the attack's effectiveness in evading detection with minimal interaction, enhancing its adaptability and applicability in real-world scenarios. By advancing the understanding of adversarial attacks in network traffic, this work lays a foundation for developing robust defenses.

http://arxiv.org/abs/2506.22506
SABRE-FL: Selective and Accurate Backdoor Rejection for Federated Prompt Learning. (68%)
Momin Ahmad Khan; Yasra Chandio; Fatima Muhammad Anwar
Federated Prompt Learning has emerged as a communication-efficient and privacy-preserving paradigm for adapting large vision-language models like CLIP across decentralized clients. However, the security implications of this setup remain underexplored. In this work, we present the first study of backdoor attacks in Federated Prompt Learning. We show that when malicious clients inject visually imperceptible, learnable noise triggers into input images, the global prompt learner becomes vulnerable to targeted misclassification while still maintaining high accuracy on clean inputs. Motivated by this vulnerability, we propose SABRE-FL, a lightweight, modular defense that filters poisoned prompt updates using an embedding-space anomaly detector trained offline on out-of-distribution data. SABRE-FL requires no access to raw client data or labels and generalizes across diverse datasets. We show, both theoretically and empirically, that malicious clients can be reliably identified and filtered using an embedding-based detector. Across five diverse datasets and four baseline defenses, SABRE-FL outperforms all baselines by significantly reducing backdoor accuracy while preserving clean accuracy, demonstrating strong empirical performance and underscoring the need for robust prompt learning in future federated systems.

http://arxiv.org/abs/2506.20705
On Convolutions, Intrinsic Dimension, and Diffusion Models. (50%)
Kin Kwan Leung; Rasa Hosseinzadeh; Gabriel Loaiza-Ganem
The manifold hypothesis asserts that data of interest in high-dimensional ambient spaces, such as image data, lies on unknown low-dimensional submanifolds. Diffusion models (DMs) -- which operate by convolving data with progressively larger amounts of Gaussian noise and then learning to revert this process -- have risen to prominence as the most performant generative models, and are known to be able to learn distributions with low-dimensional support. For a given datum in one of these submanifolds, we should thus intuitively expect DMs to have implicitly learned its corresponding local intrinsic dimension (LID), i.e. the dimension of the submanifold it belongs to. Kamkari et al. (2024b) recently showed that this is indeed the case by linking this LID to the rate of change of the log marginal densities of the DM with respect to the amount of added noise, resulting in an LID estimator known as FLIPD. LID estimators such as FLIPD have a plethora of uses, among others they quantify the complexity of a given datum, and can be used to detect outliers, adversarial examples and AI-generated text. FLIPD achieves state-of-the-art performance at LID estimation, yet its theoretical underpinnings are incomplete since Kamkari et al. (2024b) only proved its correctness under the highly unrealistic assumption of affine submanifolds. In this work we bridge this gap by formally proving the correctness of FLIPD under realistic assumptions. Additionally, we show that an analogous result holds when Gaussian convolutions are replaced with uniform ones, and discuss the relevance of this result.

http://arxiv.org/abs/2506.20102
Autonomous Cyber Resilience via a Co-Evolutionary Arms Race within a Fortified Digital Twin Sandbox. (11%)
Malikussaid; Sutiyo
The convergence of IT and OT has created hyper-connected ICS, exposing critical infrastructure to a new class of adaptive, intelligent adversaries that render static defenses obsolete. Existing security paradigms often fail to address a foundational "Trinity of Trust," comprising the fidelity of the system model, the integrity of synchronizing data, and the resilience of the analytical engine against sophisticated evasion. This paper introduces the ARC framework, a method for achieving analytical resilience through an autonomous, closed-loop hardening process. ARC establishes a perpetual co-evolutionary arms race within the high-fidelity sandbox of a F-SCDT. A DRL agent, the "Red Agent," is formalized and incentivized to autonomously discover stealthy, physically-plausible attack paths that maximize process disruption while evading detection. Concurrently, an ensemble-based "Blue Agent" defender is continuously hardened via adversarial training against the evolving threats discovered by its adversary. This co-evolutionary dynamic forces both agents to become progressively more sophisticated, enabling the system to autonomously probe and patch its own vulnerabilities. Experimental validation on both the TEP and the SWaT testbeds demonstrates the framework's superior performance. A comprehensive ablation study, supported by extensive visualizations including ROC curves and SHAP plots, reveals that the co-evolutionary process itself is responsible for a significant performance increase in detecting novel attacks. By integrating XAI to ensure operator trust and proposing a scalable F-ARC architecture, this work presents ARC not merely as an improvement, but as a necessary paradigm shift toward dynamic, self-improving security for the future of critical infrastructure.

http://arxiv.org/abs/2506.20651
Hear No Evil: Detecting Gradient Leakage by Malicious Servers in Federated Learning. (8%)
Fei Wang; Baochun Li
Recent work has shown that gradient updates in federated learning (FL) can unintentionally reveal sensitive information about a client's local data. This risk becomes significantly greater when a malicious server manipulates the global model to provoke information-rich updates from clients. In this paper, we adopt a defender's perspective to provide the first comprehensive analysis of malicious gradient leakage attacks and the model manipulation techniques that enable them. Our investigation reveals a core trade-off: these attacks cannot be both highly effective in reconstructing private data and sufficiently stealthy to evade detection -- especially in realistic FL settings that incorporate common normalization techniques and federated averaging.
  Building on this insight, we argue that malicious gradient leakage attacks, while theoretically concerning, are inherently limited in practice and often detectable through basic monitoring. As a complementary contribution, we propose a simple, lightweight, and broadly applicable client-side detection mechanism that flags suspicious model updates before local training begins, despite the fact that such detection may not be strictly necessary in realistic FL settings. This mechanism further underscores the feasibility of defending against these attacks with minimal overhead, offering a deployable safeguard for privacy-conscious federated learning systems.

http://arxiv.org/abs/2411.13047
Bounding-box Watermarking: Defense against Model Extraction Attacks on Object Detectors. (5%)
Satoru Koda; Ikuya Morikawa
Deep neural networks (DNNs) deployed in a cloud often allow users to query models via the APIs. However, these APIs expose the models to model extraction attacks (MEAs). In this attack, the attacker attempts to duplicate the target model by abusing the responses from the API. Backdoor-based DNN watermarking is known as a promising defense against MEAs, wherein the defender injects a backdoor into extracted models via API responses. The backdoor is used as a watermark of the model; if a suspicious model has the watermark (i.e., backdoor), it is verified as an extracted model. This work focuses on object detection (OD) models. Existing backdoor attacks on OD models are not applicable for model watermarking as the defense against MEAs on a realistic threat model. Our proposed approach involves inserting a backdoor into extracted models via APIs by stealthily modifying the bounding-boxes (BBs) of objects detected in queries while keeping the OD capability. In our experiments on three OD datasets, the proposed approach succeeded in identifying the extracted models with 100% accuracy in a wide variety of experimental scenarios.

http://arxiv.org/abs/2506.20494
Multimodal Representation Learning and Fusion. (4%)
Qihang Jin; Enze Ge; Yuhang Xie; Hongying Luo; Junhao Song; Ziqian Bi; Chia Xin Liang; Jibin Guan; Joe Yeong; Junfeng Hao
Multi-modal learning is a fast growing area in artificial intelligence. It tries to help machines understand complex things by combining information from different sources, like images, text, and audio. By using the strengths of each modality, multi-modal learning allows AI systems to build stronger and richer internal representations. These help machines better interpretation, reasoning, and making decisions in real-life situations. This field includes core techniques such as representation learning (to get shared features from different data types), alignment methods (to match information across modalities), and fusion strategies (to combine them by deep learning models). Although there has been good progress, some major problems still remain. Like dealing with different data formats, missing or incomplete inputs, and defending against adversarial attacks. Researchers now are exploring new methods, such as unsupervised or semi-supervised learning, AutoML tools, to make models more efficient and easier to scale. And also more attention on designing better evaluation metrics or building shared benchmarks, make it easier to compare model performance across tasks and domains. As the field continues to grow, multi-modal learning is expected to improve many areas: computer vision, natural language processing, speech recognition, and healthcare. In the future, it may help to build AI systems that can understand the world in a way more like humans, flexible, context aware, and able to deal with real-world complexity.

http://arxiv.org/abs/2502.01633
Adversarial Reasoning at Jailbreaking Time. (3%)
Mahdi Sabbaghi; Paul Kassianik; George Pappas; Yaron Singer; Amin Karbasi; Hamed Hassani
As large language models (LLMs) are becoming more capable and widespread, the study of their failure cases is becoming increasingly important. Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks. In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs. We develop an adversarial reasoning approach to automatic jailbreaking that leverages a loss signal to guide the test-time compute, achieving SOTA attack success rates against many aligned LLMs, even those that aim to trade inference-time compute for adversarial robustness. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.

http://arxiv.org/abs/2507.22063
RedCoder: Automated Multi-Turn Red Teaming for Code LLMs. (1%)
Wenjie Jacky Mo; Qin Liu; Xiaofei Wen; Dongwon Jung; Hadi Askari; Wenxuan Zhou; Zhe Zhao; Muhao Chen
Large Language Models (LLMs) for code generation (i.e., Code LLMs) have demonstrated impressive capabilities in AI-assisted software development and testing. However, recent studies have shown that these models are prone to generating vulnerable or even malicious code under adversarial settings. Existing red-teaming approaches rely on extensive human effort, limiting their scalability and practicality, and generally overlook the interactive nature of real-world AI-assisted programming, which often unfolds over multiple turns. To bridge these gaps, we present RedCoder, a red-teaming agent that engages victim models in multi-turn conversation to elicit vulnerable code. The pipeline to construct RedCoder begins with a multi-agent gaming process that simulates adversarial interactions, yielding a set of prototype conversations and an arsenal of reusable attack strategies. We then fine-tune an LLM on these prototype conversations to serve as the backbone of RedCoder. Once deployed, RedCoder autonomously engages Code LLMs in multi-turn conversations, dynamically retrieving relevant strategies from the arsenal to steer the dialogue toward vulnerability-inducing outputs. Experiments across multiple Code LLMs show that our approach outperforms prior single-turn and multi-turn red-team methods in inducing vulnerabilities in code generation, offering a scalable and effective tool for evaluating the security boundaries of modern code-generation systems.

http://arxiv.org/abs/2506.19302
Adversarial Attacks on Deep Learning-Based False Data Injection Detection in Differential Relays. (99%)
Ahmad Mohammad Saber; Aditi Maheshwari; Amr Youssef; Deepa Kundur
The application of Deep Learning-based Schemes (DLSs) for detecting False Data Injection Attacks (FDIAs) in smart grids has attracted significant attention. This paper demonstrates that adversarial attacks, carefully crafted FDIAs, can evade existing DLSs used for FDIA detection in Line Current Differential Relays (LCDRs). We propose a novel adversarial attack framework, utilizing the Fast Gradient Sign Method, which exploits DLS vulnerabilities by introducing small perturbations to LCDR remote measurements, leading to misclassification of the FDIA as a legitimate fault while also triggering the LCDR to trip. We evaluate the robustness of multiple deep learning models, including multi-layer perceptrons, convolutional neural networks, long short-term memory networks, and residual networks, under adversarial conditions. Our experimental results demonstrate that while these models perform well, they exhibit high degrees of vulnerability to adversarial attacks. For some models, the adversarial attack success rate exceeds 99.7%. To address this threat, we introduce adversarial training as a proactive defense mechanism, significantly enhancing the models' ability to withstand adversarial FDIAs without compromising fault detection accuracy. Our results highlight the significant threat posed by adversarial attacks to DLS-based FDIA detection, underscore the necessity for robust cybersecurity measures in smart grids, and demonstrate the effectiveness of adversarial training in enhancing model robustness against adversarial FDIAs.

http://arxiv.org/abs/2405.14781
Unified Neural Backdoor Removal with Only Few Clean Samples through Unlearning and Relearning. (76%)
Nay Myat Min; Long H. Pham; Jun Sun
Deep neural networks have achieved remarkable success across various applications; however, their vulnerability to backdoor attacks poses severe security risks -- especially in situations where only a limited set of clean samples is available for defense. In this work, we address this critical challenge by proposing ULRL (UnLearn and ReLearn for backdoor removal), a novel two-phase approach for comprehensive backdoor removal. Our method first employs an unlearning phase, in which the network's loss is intentionally maximized on a small clean dataset to expose neurons that are excessively sensitive to backdoor triggers. Subsequently, in the relearning phase, these suspicious neurons are recalibrated using targeted reinitialization and cosine similarity regularization, effectively neutralizing backdoor influences while preserving the model's performance on benign data. Extensive experiments with 12 backdoor types on multiple datasets (CIFAR-10, CIFAR-100, GTSRB, and Tiny-ImageNet) and architectures (PreAct-ResNet18, VGG19-BN, and ViT-B-16) demonstrate that ULRL significantly reduces the attack success rate without compromising clean accuracy -- even when only 1% of clean data is used for defense.

http://arxiv.org/abs/2408.00523
Fuzz-Testing Meets LLM-Based Agents: An Automated and Efficient Framework for Jailbreaking Text-To-Image Generation Models. (73%)
Yingkai Dong; Xiangtao Meng; Ning Yu; Zheng Li; Shanqing Guo
Text-to-image (T2I) generative models have revolutionized content creation by transforming textual descriptions into high-quality images. However, these models are vulnerable to jailbreaking attacks, where carefully crafted prompts bypass safety mechanisms to produce unsafe content. While researchers have developed various jailbreak attacks to expose this risk, these methods face significant limitations, including impractical access requirements, easily detectable unnatural prompts, restricted search spaces, and high query demands on the target system. In this paper, we propose JailFuzzer, a novel fuzzing framework driven by large language model (LLM) agents, designed to efficiently generate natural and semantically meaningful jailbreak prompts in a black-box setting. Specifically, JailFuzzer employs fuzz-testing principles with three components: a seed pool for initial and jailbreak prompts, a guided mutation engine for generating meaningful variations, and an oracle function to evaluate jailbreak success. Furthermore, we construct the guided mutation engine and oracle function by LLM-based agents, which further ensures efficiency and adaptability in black-box settings. Extensive experiments demonstrate that JailFuzzer has significant advantages in jailbreaking T2I models. It generates natural and semantically coherent prompts, reducing the likelihood of detection by traditional defenses. Additionally, it achieves a high success rate in jailbreak attacks with minimal query overhead, outperforming existing methods across all key metrics. This study underscores the need for stronger safety mechanisms in generative models and provides a foundation for future research on defending against sophisticated jailbreaking attacks. JailFuzzer is open-source and available at this repository: https://github.com/YingkaiD/JailFuzzer.

http://arxiv.org/abs/2506.19889
Retrieval-Confused Generation is a Good Defender for Privacy Violation Attack of Large Language Models. (67%)
Wanli Peng; Xin Chen; Hang Fu; XinYu He; Xue Yiming; Juan Wen
Recent advances in large language models (LLMs) have made a profound impact on our society and also raised new security concerns. Particularly, due to the remarkable inference ability of LLMs, the privacy violation attack (PVA), revealed by Staab et al., introduces serious personal privacy issues. Existing defense methods mainly leverage LLMs to anonymize the input query, which requires costly inference time and cannot gain satisfactory defense performance. Moreover, directly rejecting the PVA query seems like an effective defense method, while the defense method is exposed, promoting the evolution of PVA. In this paper, we propose a novel defense paradigm based on retrieval-confused generation (RCG) of LLMs, which can efficiently and covertly defend the PVA. We first design a paraphrasing prompt to induce the LLM to rewrite the "user comments" of the attack query to construct a disturbed database. Then, we propose the most irrelevant retrieval strategy to retrieve the desired user data from the disturbed database. Finally, the "data comments" are replaced with the retrieved user data to form a defended query, leading to responding to the adversary with some wrong personal attributes, i.e., the attack fails. Extensive experiments are conducted on two datasets and eight popular LLMs to comprehensively evaluate the feasibility and the superiority of the proposed defense method.

http://arxiv.org/abs/2501.16490
Towards Robust Stability Prediction in Smart Grids: GAN-based Approach under Data Constraints and Adversarial Challenges. (62%)
Emad Efatinasab; Alessandro Brighente; Denis Donadel; Mauro Conti; Mirco Rampazzo
Smart grids are crucial for meeting rising energy demands driven by global population growth and urbanization. By integrating renewable energy sources, they enhance efficiency, reliability, and sustainability. However, ensuring their availability and security requires advanced operational control and safety measures. Although artificial intelligence and machine learning can help assess grid stability, challenges such as data scarcity and cybersecurity threats, particularly adversarial attacks, remain. Data scarcity is a major issue, as obtaining real-world instances of grid instability requires significant expertise, resources, and time. Yet, these instances are critical for testing new research advancements and security mitigations. This paper introduces a novel framework for detecting instability in smart grids using only stable data. It employs a Generative Adversarial Network (GAN) where the generator is designed not to produce near-realistic data but instead to generate Out-Of-Distribution (OOD) samples with respect to the stable class. These OOD samples represent unstable behavior, anomalies, or disturbances that deviate from the stable data distribution. By training exclusively on stable data and exposing the discriminator to OOD samples, our framework learns a robust decision boundary to distinguish stable conditions from any unstable behavior, without requiring unstable data during training. Furthermore, we incorporate an adversarial training layer to enhance resilience against attacks. Evaluated on a real-world dataset, our solution achieves up to 98.1\% accuracy in predicting grid stability and 98.9\% in detecting adversarial attacks. Implemented on a single-board computer, it enables real-time decision-making with an average response time of under 7ms.

http://arxiv.org/abs/2506.19464
Assessing Risk of Stealing Proprietary Models for Medical Imaging Tasks. (22%)
Ankita Raj; Harsh Swaika; Deepankar Varma; Chetan Arora
The success of deep learning in medical imaging applications has led several companies to deploy proprietary models in diagnostic workflows, offering monetized services. Even though model weights are hidden to protect the intellectual property of the service provider, these models are exposed to model stealing (MS) attacks, where adversaries can clone the model's functionality by querying it with a proxy dataset and training a thief model on the acquired predictions. While extensively studied on general vision tasks, the susceptibility of medical imaging models to MS attacks remains inadequately explored. This paper investigates the vulnerability of black-box medical imaging models to MS attacks under realistic conditions where the adversary lacks access to the victim model's training data and operates with limited query budgets. We demonstrate that adversaries can effectively execute MS attacks by using publicly available datasets. To further enhance MS capabilities with limited query budgets, we propose a two-step model stealing approach termed QueryWise. This method capitalizes on unlabeled data obtained from a proxy distribution to train the thief model without incurring additional queries. Evaluation on two medical imaging models for Gallbladder Cancer and COVID-19 classification substantiates the effectiveness of the proposed attack. The source code is available at https://github.com/rajankita/QueryWise.

http://arxiv.org/abs/2507.00724
Holmes: Towards Effective and Harmless Model Ownership Verification to Personalized Large Vision Models via Decoupling Common Features. (11%)
Linghui Zhu; Yiming Li; Haiqin Weng; Yan Liu; Tianwei Zhang; Shu-Tao Xia; Zhi Wang
Large vision models achieve remarkable performance in various downstream tasks, primarily by personalizing pre-trained models through fine-tuning with private and valuable local data, which makes the personalized model a valuable intellectual property for its owner. Similar to the era of traditional DNNs, model stealing attacks also pose significant risks to these personalized models. However, in this paper, we reveal that most existing defense methods (developed for traditional DNNs), typically designed for models trained from scratch, either introduce additional security risks, are prone to misjudgment, or are even ineffective for fine-tuned models. To alleviate these problems, this paper proposes a harmless model ownership verification method for personalized models by decoupling similar common features. In general, our method consists of three main stages. In the first stage, we create shadow models that retain common features of the victim model while disrupting dataset-specific features. We represent the dataset-specific features of the victim model by the output differences between the shadow and victim models. After that, a meta-classifier is trained to identify stolen models by determining whether suspicious models contain the dataset-specific features of the victim. In the third stage, we conduct model ownership verification by hypothesis test to mitigate randomness and enhance robustness. Extensive experiments on benchmark datasets verify the effectiveness of the proposed method in detecting different types of model stealing simultaneously.

http://arxiv.org/abs/2506.19250
Robust Behavior Cloning Via Global Lipschitz Regularization. (5%)
Shili Wu; Yizhao Jin; Puhua Niu; Aniruddha Datta; Sean B. Andersson
Behavior Cloning (BC) is an effective imitation learning technique and has even been adopted in some safety-critical domains such as autonomous vehicles. BC trains a policy to mimic the behavior of an expert by using a dataset composed of only state-action pairs demonstrated by the expert, without any additional interaction with the environment. However, During deployment, the policy observations may contain measurement errors or adversarial disturbances. Since the observations may deviate from the true states, they can mislead the agent into making sub-optimal actions. In this work, we use a global Lipschitz regularization approach to enhance the robustness of the learned policy network. We then show that the resulting global Lipschitz property provides a robustness certificate to the policy with respect to different bounded norm perturbations. Then, we propose a way to construct a Lipschitz neural network that ensures the policy robustness. We empirically validate our theory across various environments in Gymnasium. Keywords: Robust Reinforcement Learning; Behavior Cloning; Lipschitz Neural Network

http://arxiv.org/abs/2506.19260
Network Structures as an Attack Surface: Topology-Based Privacy Leakage in Federated Learning. (2%)
Murtaza Rangwala; Richard O. Sinnott; Rajkumar Buyya
Federated learning systems increasingly rely on diverse network topologies to address scalability and organizational constraints. While existing privacy research focuses on gradient-based attacks, the privacy implications of network topology knowledge remain critically understudied. We conduct the first comprehensive analysis of topology-based privacy leakage across realistic adversarial knowledge scenarios, demonstrating that adversaries with varying degrees of structural knowledge can infer sensitive data distribution patterns even under strong differential privacy guarantees. Through systematic evaluation of 4,720 attack instances, we analyze six distinct adversarial knowledge scenarios: complete topology knowledge and five partial knowledge configurations reflecting real-world deployment constraints. We propose three complementary attack vectors: communication pattern analysis, parameter magnitude profiling, and structural position correlation, achieving success rates of 84.1%, 65.0%, and 47.2% under complete knowledge conditions. Critically, we find that 80% of realistic partial knowledge scenarios maintain attack effectiveness above security thresholds, with certain partial knowledge configurations achieving performance superior to the baseline complete knowledge scenario. To address these vulnerabilities, we propose and empirically validate structural noise injection as a complementary defense mechanism across 808 configurations, demonstrating up to 51.4% additional attack reduction when properly layered with existing privacy techniques. These results establish that network topology represents a fundamental privacy vulnerability in federated learning systems while providing practical pathways for mitigation through topology-aware defense mechanisms.

http://arxiv.org/abs/2506.19680
Model Guidance via Robust Feature Attribution. (1%)
Mihnea Ghitu; Matthew Wicker; Vihari Piratla
Controlling the patterns a model learns is essential to preventing reliance on irrelevant or misleading features. Such reliance on irrelevant features, often called shortcut features, has been observed across domains, including medical imaging and natural language processing, where it may lead to real-world harms. A common mitigation strategy leverages annotations (provided by humans or machines) indicating which features are relevant or irrelevant. These annotations are compared to model explanations, typically in the form of feature salience, and used to guide the loss function during training. Unfortunately, recent works have demonstrated that feature salience methods are unreliable and therefore offer a poor signal to optimize. In this work, we propose a simplified objective that simultaneously optimizes for explanation robustness and mitigation of shortcut learning. Unlike prior objectives with similar aims, we demonstrate theoretically why our approach ought to be more effective. Across a comprehensive series of experiments, we show that our approach consistently reduces test-time misclassifications by 20% compared to state-of-the-art methods. We also extend prior experimental settings to include natural language processing tasks. Additionally, we conduct novel ablations that yield practical insights, including the relative importance of annotation quality over quantity. Code for our method and experiments is available at: https://github.com/Mihneaghitu/ModelGuidanceViaRobustFeatureAttribution.

http://arxiv.org/abs/2506.18870
Amplifying Machine Learning Attacks Through Strategic Compositions. (99%)
Yugeng Liu; Zheng Li; Hai Huang; Michael Backes; Yang Zhang
Machine learning (ML) models are proving to be vulnerable to a variety of attacks that allow the adversary to learn sensitive information, cause mispredictions, and more. While these attacks have been extensively studied, current research predominantly focuses on analyzing each attack type individually. In practice, however, adversaries may employ multiple attack strategies simultaneously rather than relying on a single approach. This prompts a crucial yet underexplored question: When the adversary has multiple attacks at their disposal, are they able to mount or amplify the effect of one attack with another? In this paper, we take the first step in studying the strategic interactions among different attacks, which we define as attack compositions. Specifically, we focus on four well-studied attacks during the model's inference phase: adversarial examples, attribute inference, membership inference, and property inference. To facilitate the study of their interactions, we propose a taxonomy based on three stages of the attack pipeline: preparation, execution, and evaluation. Using this taxonomy, we identify four effective attack compositions, such as property inference assisting attribute inference at its preparation level and adversarial examples assisting property inference at its execution level. We conduct extensive experiments on the attack compositions using three ML model architectures and three benchmark image datasets. Empirical results demonstrate the effectiveness of these four attack compositions. We implement and release a modular reusable toolkit, COAT. Arguably, our work serves as a call for researchers and practitioners to consider advanced adversarial settings involving multiple attack strategies, aiming to strengthen the security and robustness of AI systems.

http://arxiv.org/abs/2506.18516
DUMB and DUMBer: Is Adversarial Training Worth It in the Real World? (99%)
Francesco Marchiori; Marco Alecci; Luca Pajola; Mauro Conti
Adversarial examples are small and often imperceptible perturbations crafted to fool machine learning models. These attacks seriously threaten the reliability of deep neural networks, especially in security-sensitive domains. Evasion attacks, a form of adversarial attack where input is modified at test time to cause misclassification, are particularly insidious due to their transferability: adversarial examples crafted against one model often fool other models as well. This property, known as adversarial transferability, complicates defense strategies since it enables black-box attacks to succeed without direct access to the victim model. While adversarial training is one of the most widely adopted defense mechanisms, its effectiveness is typically evaluated on a narrow and homogeneous population of models. This limitation hinders the generalizability of empirical findings and restricts practical adoption.
  In this work, we introduce DUMBer, an attack framework built on the foundation of the DUMB (Dataset soUrces, Model architecture, and Balance) methodology, to systematically evaluate the resilience of adversarially trained models. Our testbed spans multiple adversarial training techniques evaluated across three diverse computer vision tasks, using a heterogeneous population of uniquely trained models to reflect real-world deployment variability. Our experimental pipeline comprises over 130k evaluations spanning 13 state-of-the-art attack algorithms, allowing us to capture nuanced behaviors of adversarial training under varying threat models and dataset conditions. Our findings offer practical, actionable insights for AI practitioners, identifying which defenses are most effective based on the model, dataset, and attacker setup.

http://arxiv.org/abs/2506.18304
Sharpening the Spear: Adaptive Expert-Guided Adversarial Attack Against DRL-based Autonomous Driving Policies. (98%)
Junchao Fan; Xuyang Lei; Xiaolin Chang
Deep reinforcement learning (DRL) has emerged as a promising paradigm for autonomous driving. However, despite their advanced capabilities, DRL-based policies remain highly vulnerable to adversarial attacks, posing serious safety risks in real-world deployments. Investigating such attacks is crucial for revealing policy vulnerabilities and guiding the development of more robust autonomous systems. While prior attack methods have made notable progress, they still face several challenges: 1) they often rely on high-frequency attacks, yet critical attack opportunities are typically context-dependent and temporally sparse, resulting in inefficient attack patterns; 2) restricting attack frequency can improve efficiency but often results in unstable training due to the adversary's limited exploration. To address these challenges, we propose an adaptive expert-guided adversarial attack method that enhances both the stability and efficiency of attack policy training. Our method first derives an expert policy from successful attack demonstrations using imitation learning, strengthened by an ensemble Mixture-of-Experts architecture for robust generalization across scenarios. This expert policy then guides a DRL-based adversary through a KL-divergence regularization term. Due to the diversity of scenarios, expert policies may be imperfect. To address this, we further introduce a performance-aware annealing strategy that gradually reduces reliance on the expert as the adversary improves. Extensive experiments demonstrate that our method achieves outperforms existing approaches in terms of collision rate, attack efficiency, and training stability, especially in cases where the expert policy is sub-optimal.

http://arxiv.org/abs/2506.18591
SpaNN: Detecting Multiple Adversarial Patches on CNNs by Spanning Saliency Thresholds. (98%)
Mauricio Byrd Victorica; György Dán; Henrik Sandberg
State-of-the-art convolutional neural network models for object detection and image classification are vulnerable to physically realizable adversarial perturbations, such as patch attacks. Existing defenses have focused, implicitly or explicitly, on single-patch attacks, leaving their sensitivity to the number of patches as an open question or rendering them computationally infeasible or inefficient against attacks consisting of multiple patches in the worst cases. In this work, we propose SpaNN, an attack detector whose computational complexity is independent of the expected number of adversarial patches. The key novelty of the proposed detector is that it builds an ensemble of binarized feature maps by applying a set of saliency thresholds to the neural activations of the first convolutional layer of the victim model. It then performs clustering on the ensemble and uses the cluster features as the input to a classifier for attack detection. Contrary to existing detectors, SpaNN does not rely on a fixed saliency threshold for identifying adversarial regions, which makes it robust against white box adversarial attacks. We evaluate SpaNN on four widely used data sets for object detection and classification, and our results show that SpaNN outperforms state-of-the-art defenses by up to 11 and 27 percentage points in the case of object detection and the case of image classification, respectively. Our code is available at https://github.com/gerkbyrd/SpaNN.

http://arxiv.org/abs/2506.18543
Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks. (22%)
Xiaodong Wu; Xiangman Li; Jianbing Ni
The widespread deployment of large language models (LLMs) has raised critical concerns over their vulnerability to jailbreak attacks, i.e., adversarial prompts that bypass alignment mechanisms and elicit harmful or policy-violating outputs. While proprietary models like GPT-4 have undergone extensive evaluation, the robustness of emerging open-source alternatives such as DeepSeek remains largely underexplored, despite their growing adoption in real-world applications. In this paper, we present the first systematic jailbreak evaluation of DeepSeek-series models, comparing them with GPT-3.5 and GPT-4 using the HarmBench benchmark. We evaluate seven representative attack strategies across 510 harmful behaviors categorized by both function and semantic domain. Our analysis reveals that DeepSeek's Mixture-of-Experts (MoE) architecture introduces routing sparsity that offers selective robustness against optimization-based attacks such as TAP-T, but leads to significantly higher vulnerability under prompt-based and manually engineered attacks. In contrast, GPT-4 Turbo demonstrates stronger and more consistent safety alignment across diverse behaviors, likely due to its dense Transformer design and reinforcement learning from human feedback. Fine-grained behavioral analysis and case studies further show that DeepSeek often routes adversarial prompts to under-aligned expert modules, resulting in inconsistent refusal behaviors. These findings highlight a fundamental trade-off between architectural efficiency and alignment generalization, emphasizing the need for targeted safety tuning and modular alignment strategies to ensure secure deployment of open-source LLMs.

http://arxiv.org/abs/2506.19051
NIC-RobustBench: A Comprehensive Open-Source Toolkit for Neural Image Compression and Robustness Analysis. (8%)
Georgii Bychkov; Khaled Abud; Egor Kovalev; Alexander Gushchin; Dmitriy Vatolin; Anastasia Antsiferova
Adversarial robustness of neural networks is an increasingly important area of research, combining studies on computer vision models, large language models (LLMs), and others. With the release of JPEG AI -- the first standard for end-to-end neural image compression (NIC) methods -- the question of evaluating NIC robustness has become critically significant. However, previous research has been limited to a narrow range of codecs and attacks. To address this, we present \textbf{NIC-RobustBench}, the first open-source framework to evaluate NIC robustness and adversarial defenses' efficiency, in addition to comparing Rate-Distortion (RD) performance. The framework includes the largest number of codecs among all known NIC libraries and is easily scalable. The paper demonstrates a comprehensive overview of the NIC-RobustBench framework and employs it to analyze NIC robustness. Our code is available online at https://github.com/msu-video-group/NIC-RobustBench.

http://arxiv.org/abs/2504.16359
VideoMark: A Distortion-Free Robust Watermarking Framework for Video Diffusion Models. (2%)
Xuming Hu; Hanqian Li; Jungang Li; Yu Huang; Aiwei Liu
This work introduces \textbf{VideoMark}, a distortion-free robust watermarking framework for video diffusion models. As diffusion models excel in generating realistic videos, reliable content attribution is increasingly critical. However, existing video watermarking methods often introduce distortion by altering the initial distribution of diffusion variables and are vulnerable to temporal attacks, such as frame deletion, due to variable video lengths. VideoMark addresses these challenges by employing a \textbf{pure pseudorandom initialization} to embed watermarks, avoiding distortion while ensuring uniform noise distribution in the latent space to preserve generation quality. To enhance robustness, we adopt a frame-wise watermarking strategy with pseudorandom error correction (PRC) codes, using a fixed watermark sequence with randomly selected starting indices for each video. For watermark extraction, we propose a Temporal Matching Module (TMM) that leverages edit distance to align decoded messages with the original watermark sequence, ensuring resilience against temporal attacks. Experimental results show that VideoMark achieves higher decoding accuracy than existing methods while maintaining video quality comparable to watermark-free generation. The watermark remains imperceptible to attackers without the secret key, offering superior invisibility compared to other frameworks. VideoMark provides a practical, training-free solution for content attribution in diffusion-based video generation. Code and data are available at \href{https://github.com/KYRIE-LI11/VideoMark}{https://github.com/KYRIE-LI11/VideoMark}{Project Page}.

http://arxiv.org/abs/2506.19109
Enhancing Security in LLM Applications: A Performance Evaluation of Early Detection Systems. (1%)
Valerii Gakh; Hayretdin Bahsi
Prompt injection threatens novel applications that emerge from adapting LLMs for various user tasks. The newly developed LLM-based software applications become more ubiquitous and diverse. However, the threat of prompt injection attacks undermines the security of these systems as the mitigation and defenses against them, proposed so far, are insufficient. We investigated the capabilities of early prompt injection detection systems, focusing specifically on the detection performance of techniques implemented in various open-source solutions. These solutions are supposed to detect certain types of prompt injection attacks, including the prompt leak. In prompt leakage attacks, an attacker maliciously manipulates the LLM into outputting its system instructions, violating the system's confidentiality. Our study presents analyzes of distinct prompt leakage detection techniques, and a comparative analysis of several detection solutions, which implement those techniques. We identify the strengths and weaknesses of these techniques and elaborate on their optimal configuration and usage in high-stake deployments. In one of the first studies on existing prompt leak detection solutions, we compared the performances of LLM Guard, Vigil, and Rebuff. We concluded that the implementations of canary word checks in Vigil and Rebuff were not effective at detecting prompt leak attacks, and we proposed improvements for them. We also found an evasion weakness in Rebuff's secondary model-based technique and proposed a mitigation. Then, the result of the comparison of LLM Guard, Vigil, and Rebuff at their peak performance revealed that Vigil is optimal for cases when minimal false positive rate is required, and Rebuff is the most optimal for average needs.

http://arxiv.org/abs/2506.17874
DRO-Augment Framework: Robustness by Synergizing Wasserstein Distributionally Robust Optimization and Data Augmentation. (98%)
Jiaming Hu; Debarghya Mukherjee; Ioannis Ch. Paschalidis
In many real-world applications, ensuring the robustness and stability of deep neural networks (DNNs) is crucial, particularly for image classification tasks that encounter various input perturbations. While data augmentation techniques have been widely adopted to enhance the resilience of a trained model against such perturbations, there remains significant room for improvement in robustness against corrupted data and adversarial attacks simultaneously. To address this challenge, we introduce DRO-Augment, a novel framework that integrates Wasserstein Distributionally Robust Optimization (W-DRO) with various data augmentation strategies to improve the robustness of the models significantly across a broad spectrum of corruptions. Our method outperforms existing augmentation methods under severe data perturbations and adversarial attack scenarios while maintaining the accuracy on the clean datasets on a range of benchmark datasets, including but not limited to CIFAR-10-C, CIFAR-100-C, MNIST, and Fashion-MNIST. On the theoretical side, we establish novel generalization error bounds for neural networks trained using a computationally efficient, variation-regularized loss function closely related to the W-DRO problem.

http://arxiv.org/abs/2506.19871
An Attack Method for Medical Insurance Claim Fraud Detection based on Generative Adversarial Network. (75%)
Yining Pang; Chenghan Li
Insurance fraud detection represents a pivotal advancement in modern insurance service, providing intelligent and digitalized monitoring to enhance management and prevent fraud. It is crucial for ensuring the security and efficiency of insurance systems. Although AI and machine learning algorithms have demonstrated strong performance in detecting fraudulent claims, the absence of standardized defense mechanisms renders current systems vulnerable to emerging adversarial threats. In this paper, we propose a GAN-based approach to conduct adversarial attacks on fraud detection systems. Our results indicate that an attacker, without knowledge of the training data or internal model details, can generate fraudulent cases that are classified as legitimate with a 99\% attack success rate (ASR). By subtly modifying real insurance records and claims, adversaries can significantly increase the fraud risk, potentially bypassing compromised detection systems. These findings underscore the urgent need to enhance the robustness of insurance fraud detection models against adversarial manipulation, thereby ensuring the stability and reliability of different insurance systems.

http://arxiv.org/abs/2506.09600
Effective Red-Teaming of Policy-Adherent Agents. (22%)
Itay Nakash; George Kour; Koren Lazar; Matan Vetzler; Guy Uziel; Ateret Anaby-Tavor
Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing any request that would violate them, while still maintaining a helpful and natural interaction. This calls for the development of tailored design and evaluation methodologies to ensure agent resilience against malicious user behavior. We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit. To address this, we present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario, outperforming conventional jailbreak methods such as DAN prompts, emotional manipulation, and coercive. Building upon the existing tau-bench benchmark, we introduce tau-break, a complementary benchmark designed to rigorously assess the agent's robustness against manipulative user behavior. Finally, we evaluate several straightforward yet effective defense strategies. While these measures provide some protection, they fall short, highlighting the need for stronger, research-driven safeguards to protect policy-adherent agents from adversarial attacks

http://arxiv.org/abs/2506.17632
Optimization-Free Patch Attack on Stereo Depth Estimation. (93%)
Hangcheng Liu; Xu Kuang; Xingshuo Han; Xingwan Wu; Haoran Ou; Shangwei Guo; Xingyi Huang; Tao Xiang; Tianwei Zhang
Stereo Depth Estimation (SDE) is essential for scene understanding in vision-based systems like autonomous driving. However, recent studies show that SDE models are vulnerable to adversarial attacks, which are often limited to unrealistic settings, e.g., digital perturbations on separate stereo views in static scenes, restricting their real-world applicability. This raises a critical question: how can we design physically realizable, scene-adaptive, and transferable attacks against SDE under realistic constraints?
  To answer this, we make two key contributions. First, we propose a unified attack framework that extends optimization-based techniques to four core stages of stereo matching: feature extraction, cost-volume construction, cost aggregation, and disparity regression. A comprehensive stage-wise evaluation across 9 mainstream SDE models, under constraints like photometric consistency, reveals that optimization-based patches suffer from poor transferability. Interestingly, partially transferable patches suggest that patterns, rather than pixel-level perturbations, may be key to generalizable attacks. Motivated by this, we present PatchHunter, the first optimization-free adversarial patch attack against SDE. PatchHunter formulates patch generation as a reinforcement learning-driven search over a structured space of visual patterns crafted to disrupt SDE assumptions.
  We validate PatchHunter across three levels: the KITTI dataset, the CARLA simulator, and real-world vehicle deployment. PatchHunter not only surpasses optimization-based methods in effectiveness but also achieves significantly better black-box transferability. Even under challenging physical conditions like low light, PatchHunter maintains high attack success (e.g., D1-all > 0.4), whereas optimization-based methods fail.

http://arxiv.org/abs/2506.17621
Exploiting Efficiency Vulnerabilities in Dynamic Deep Learning Systems. (41%)
Ravishka Rathnasuriya; Wei Yang
The growing deployment of deep learning models in real-world environments has intensified the need for efficient inference under strict latency and resource constraints. To meet these demands, dynamic deep learning systems (DDLSs) have emerged, offering input-adaptive computation to optimize runtime efficiency. While these systems succeed in reducing cost, their dynamic nature introduces subtle and underexplored security risks. In particular, input-dependent execution pathways create opportunities for adversaries to degrade efficiency, resulting in excessive latency, energy usage, and potential denial-of-service in time-sensitive deployments. This work investigates the security implications of dynamic behaviors in DDLSs and reveals how current systems expose efficiency vulnerabilities exploitable by adversarial inputs. Through a survey of existing attack strategies, we identify gaps in the coverage of emerging model architectures and limitations in current defense mechanisms. Building on these insights, we propose to examine the feasibility of efficiency attacks on modern DDLSs and develop targeted defenses to preserve robustness under adversarial conditions.

http://arxiv.org/abs/2506.17162
Analyzing PDFs like Binaries: Adversarially Robust PDF Malware Analysis via Intermediate Representation and Language Model. (87%)
Side Liu; Jiang Ming; Guodong Zhou; Xinyi Liu; Jianming Fu; Guojun Peng
Malicious PDF files have emerged as a persistent threat and become a popular attack vector in web-based attacks. While machine learning-based PDF malware classifiers have shown promise, these classifiers are often susceptible to adversarial attacks, undermining their reliability. To address this issue, recent studies have aimed to enhance the robustness of PDF classifiers. Despite these efforts, the feature engineering underlying these studies remains outdated. Consequently, even with the application of cutting-edge machine learning techniques, these approaches fail to fundamentally resolve the issue of feature instability.
  To tackle this, we propose a novel approach for PDF feature extraction and PDF malware detection. We introduce the PDFObj IR (PDF Object Intermediate Representation), an assembly-like language framework for PDF objects, from which we extract semantic features using a pretrained language model. Additionally, we construct an Object Reference Graph to capture structural features, drawing inspiration from program analysis. This dual approach enables us to analyze and detect PDF malware based on both semantic and structural features. Experimental results demonstrate that our proposed classifier achieves strong adversarial robustness while maintaining an exceptionally low false positive rate of only 0.07% on baseline dataset compared to state-of-the-art PDF malware classifiers.

http://arxiv.org/abs/2506.17133
Robust Training with Data Augmentation for Medical Imaging Classification. (83%)
Josué Martínez-Martínez; Olivia Brown; Mostafa Karami; Sheida Nabavi
Deep neural networks are increasingly being used to detect and diagnose medical conditions using medical imaging. Despite their utility, these models are highly vulnerable to adversarial attacks and distribution shifts, which can affect diagnostic reliability and undermine trust among healthcare professionals. In this study, we propose a robust training algorithm with data augmentation (RTDA) to mitigate these vulnerabilities in medical image classification. We benchmark classifier robustness against adversarial perturbations and natural variations of RTDA and six competing baseline techniques, including adversarial training and data augmentation approaches in isolation and combination, using experimental data sets with three different imaging technologies (mammograms, X-rays, and ultrasound). We demonstrate that RTDA achieves superior robustness against adversarial attacks and improved generalization performance in the presence of distribution shift in each image classification task while maintaining high clean accuracy.

http://arxiv.org/abs/2506.17350
CUBA: Controlled Untargeted Backdoor Attack against Deep Neural Networks. (75%)
Yinghao Wu; Liyan Zhang
Backdoor attacks have emerged as a critical security threat against deep neural networks in recent years. The majority of existing backdoor attacks focus on targeted backdoor attacks, where trigger is strongly associated to specific malicious behavior. Various backdoor detection methods depend on this inherent property and shows effective results in identifying and mitigating such targeted attacks. However, a purely untargeted attack in backdoor scenarios is, in some sense, self-weakening, since the target nature is what makes backdoor attacks so powerful. In light of this, we introduce a novel Constrained Untargeted Backdoor Attack (CUBA), which combines the flexibility of untargeted attacks with the intentionality of targeted attacks. The compromised model, when presented with backdoor images, will classify them into random classes within a constrained range of target classes selected by the attacker. This combination of randomness and determinedness enables the proposed untargeted backdoor attack to natively circumvent existing backdoor defense methods. To implement the untargeted backdoor attack under controlled flexibility, we propose to apply logit normalization on cross-entropy loss with flipped one-hot labels. By constraining the logit during training, the compromised model will show a uniform distribution across selected target classes, resulting in controlled untargeted attack. Extensive experiments demonstrate the effectiveness of the proposed CUBA on different datasets.

http://arxiv.org/abs/2506.16760
Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models. (68%)
Lei Jiang; Zixun Zhang; Zizhou Wang; Xiaobing Sun; Zhen Li; Liangli Zhen; Xiaohua Xu
Large Vision-Language Models (LVLMs) demonstrate exceptional performance across multimodal tasks, yet remain vulnerable to jailbreak attacks that bypass built-in safety mechanisms to elicit restricted content generation. Existing black-box jailbreak methods primarily rely on adversarial textual prompts or image perturbations, yet these approaches are highly detectable by standard content filtering systems and exhibit low query and computational efficiency. In this work, we present Cross-modal Adversarial Multimodal Obfuscation (CAMO), a novel black-box jailbreak attack framework that decomposes malicious prompts into semantically benign visual and textual fragments. By leveraging LVLMs' cross-modal reasoning abilities, CAMO covertly reconstructs harmful instructions through multi-step reasoning, evading conventional detection mechanisms. Our approach supports adjustable reasoning complexity and requires significantly fewer queries than prior attacks, enabling both stealth and efficiency. Comprehensive evaluations conducted on leading LVLMs validate CAMO's effectiveness, showcasing robust performance and strong cross-model transferability. These results underscore significant vulnerabilities in current built-in safety mechanisms, emphasizing an urgent need for advanced, alignment-aware security and safety solutions in vision-language systems.

http://arxiv.org/abs/2501.16029
FDLLM: A Dedicated Detector for Black-Box LLMs Fingerprinting. (61%)
Zhiyuan Fu; Junfan Chen; Lan Zhang; Ting Yang; Jun Niu; Hongyu Sun; Ruidong Li; Peng Liu; Jice Wang; Fannv He; Yuqing Zhang
Large Language Models (LLMs) are rapidly transforming the landscape of digital content creation. However, the prevalent black-box Application Programming Interface (API) access to many LLMs introduces significant challenges in accountability, governance, and security. LLM fingerprinting, which aims to identify the source model by analyzing statistical and stylistic features of generated text, offers a potential solution. Current progress in this area is hindered by a lack of dedicated datasets and the need for efficient, practical methods that are robust against adversarial manipulations. To address these challenges, we introduce FD-Dataset, a comprehensive bilingual fingerprinting benchmark comprising 90,000 text samples from 20 famous proprietary and open-source LLMs. Furthermore, we present FDLLM, a novel fingerprinting method that leverages parameter-efficient Low-Rank Adaptation (LoRA) to fine-tune a foundation model. This approach enables LoRA to extract deep, persistent features that characterize each source LLM. Through our analysis, we find that LoRA adaptation promotes the aggregation of outputs from the same LLM in representation space while enhancing the separation between different LLMs. This mechanism explains why LoRA proves particularly effective for LLM fingerprinting. Extensive empirical evaluations on FD-Dataset demonstrate FDLLM's superiority, achieving a Macro F1 score 22.1% higher than the strongest baseline. FDLLM also exhibits strong generalization to newly released models, achieving an average accuracy of 95% on unseen models. Notably, FDLLM remains consistently robust under various adversarial attacks, including polishing, translation, and synonym substitution. Experimental results show that FDLLM reduces the average attack success rate from 49.2% (LM-D) to 23.9%.

http://arxiv.org/abs/2506.17040
Stretching Beyond the Obvious: A Gradient-Free Framework to Unveil the Hidden Landscape of Visual Invariance. (56%)
Lorenzo Tausani; Paolo Muratore; Morgan B. Talbot; Giacomo Amerio; Gabriel Kreiman; Davide Zoccolan
Uncovering which features' combinations high-level visual units encode is critical to understand how images are transformed into representations that support recognition. While existing feature visualization approaches typically infer a unit's most exciting images, this is insufficient to reveal the manifold of transformations under which responses remain invariant, which is key to generalization in vision. Here we introduce Stretch-and-Squeeze (SnS), an unbiased, model-agnostic, and gradient-free framework to systematically characterize a unit's invariance landscape and its vulnerability to adversarial perturbations in both biological and artificial visual systems. SnS frames these transformations as bi-objective optimization problems. To probe invariance, SnS seeks image perturbations that maximally alter the representation of a reference stimulus in a given processing stage while preserving unit activation. To probe adversarial sensitivity, SnS seeks perturbations that minimally alter the stimulus while suppressing unit activation. Applied to convolutional neural networks (CNNs), SnS revealed image variations that were further from a reference image in pixel-space than those produced by affine transformations, while more strongly preserving the target unit's response. The discovered invariant images differed dramatically depending on the choice of image representation used for optimization: pixel-level changes primarily affected luminance and contrast, while stretching mid- and late-layer CNN representations altered texture and pose respectively. Notably, the invariant images from robust networks were more recognizable by human subjects than those from standard networks, supporting the higher fidelity of robust CNNs as models of the visual system.

http://arxiv.org/abs/2506.16792
MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning. (47%)
Muyang Zheng; Yuanzhi Yao; Changting Lin; Rui Wang; Meng Han
Despite efforts to align large language models (LLMs) with societal and moral values, these models remain susceptible to jailbreak attacks--methods designed to elicit harmful responses. Jailbreaking black-box LLMs is considered challenging due to the discrete nature of token inputs, restricted access to the target LLM, and limited query budget. To address the issues above, we propose an effective method for jailbreaking black-box large language Models via Iterative Semantic Tuning, named MIST. MIST enables attackers to iteratively refine prompts that preserve the original semantic intent while inducing harmful content. Specifically, to balance semantic similarity with computational efficiency, MIST incorporates two key strategies: sequential synonym search, and its advanced version--order-determining optimization. Extensive experiments across two open-source models and four closed-source models demonstrate that MIST achieves competitive attack success rates and attack transferability compared with other state-of-the-art white-box and black-box jailbreak methods. Additionally, we conduct experiments on computational efficiency to validate the practical viability of MIST.

http://arxiv.org/abs/2506.16690
DepthVanish: Optimizing Adversarial Interval Structures for Stereo-Depth-Invisible Patches. (47%)
Yun Xing; Yue Cao; Nhat Chung; Jie Zhang; Ivor Tsang; Ming-Ming Cheng; Yang Liu; Lei Ma; Qing Guo
Stereo Depth estimation is a critical task in autonomous driving and robotics, where inaccuracies (such as misidentifying nearby objects as distant) can lead to dangerous situations. Adversarial attacks against stereo depth estimation can help reveal vulnerabilities before deployment. Previous work has shown that repeating optimized textures can effectively mislead stereo depth estimation in digital settings. However, our research reveals that these naively repeated texture structures perform poorly in physical-world implementations, i.e., when deployed as patches, limiting their practical utility for testing stereo depth estimation systems. In this work, for the first time, we discover that introducing regular intervals between repeated textures, creating a striped structure, significantly enhances the patch attack effectiveness. Through extensive experimentation, we analyze how variations of this novel structure influence the performance. Based on these insights, we develop a novel stereo depth attack that jointly optimizes both the striped structure and texture elements. Our generated adversarial patches can be inserted into any scenes and successfully attack state-of-the-art stereo depth estimation methods, i.e., RAFT-Stereo and STTR. Most critically, our patch can also attack commercial RGB-D cameras (Intel RealSense) in real-world conditions, demonstrating their practical relevance for security assessment of stereo systems.

http://arxiv.org/abs/2506.16950
LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models. (4%)
Fanfei Li; Thomas Klein; Wieland Brendel; Robert Geirhos; Roland S. Zimmermann
Out-of-distribution (OOD) robustness is a desired property of computer vision models. Improving model robustness requires high-quality signals from robustness benchmarks to quantify progress. While various benchmark datasets such as ImageNet-C were proposed in the ImageNet era, most ImageNet-C corruption types are no longer OOD relative to today's large, web-scraped datasets, which already contain common corruptions such as blur or JPEG compression artifacts. Consequently, these benchmarks are no longer well-suited for evaluating OOD robustness in the era of web-scale datasets. Indeed, recent models show saturating scores on ImageNet-era OOD benchmarks, indicating that it is unclear whether models trained on web-scale datasets truly become better at OOD generalization or whether they have simply been exposed to the test distortions during training. To address this, we introduce LAION-C as a benchmark alternative for ImageNet-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of state-of-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our corruptions for human observers, enabling a comparison of models to lab-quality human robustness data. We observe a paradigm shift in OOD generalization: from humans outperforming models, to the best models now matching or outperforming the best human observers.

http://arxiv.org/abs/2506.17047
Navigating the Deep: Signature Extraction on Deep Neural Networks. (2%)
Haolin Liu; Adrien Siproudhis; Samuel Experton; Peter Lorenz; Christina Boura; Thomas Peyrin
Neural network model extraction has emerged in recent years as an important security concern, as adversaries attempt to recover a network's parameters via black-box queries. A key step in this process is signature extraction, which aims to recover the absolute values of the network's weights layer by layer. Prior work, notably by Carlini et al. (2020), introduced a technique inspired by differential cryptanalysis to extract neural network parameters. However, their method suffers from several limitations that restrict its applicability to networks with a few layers only. Later works focused on improving sign extraction, but largely relied on the assumption that signature extraction itself was feasible.
  In this work, we revisit and refine the signature extraction process by systematically identifying and addressing for the first time critical limitations of Carlini et al.'s signature extraction method. These limitations include rank deficiency and noise propagation from deeper layers. To overcome these challenges, we propose efficient algorithmic solutions for each of the identified issues, greatly improving the efficiency of signature extraction. Our approach permits the extraction of much deeper networks than was previously possible. We validate our method through extensive experiments on ReLU-based neural networks, demonstrating significant improvements in extraction depth and accuracy. For instance, our extracted network matches the target network on at least 95% of the input space for each of the eight layers of a neural network trained on the CIFAR-10 dataset, while previous works could barely extract the first three layers. Our results represent a crucial step toward practical attacks on larger and more complex neural network architectures.

http://arxiv.org/abs/2506.16819
Loupe: A Generalizable and Adaptive Framework for Image Forgery Detection. (1%)
Yuchu Jiang; Jiaming Chu; Jian Zhao; Xin Zhang; Xu Yang; Lei Jin; Chi Zhang; Xuelong Li
The proliferation of generative models has raised serious concerns about visual content forgery. Existing deepfake detection methods primarily target either image-level classification or pixel-wise localization. While some achieve high accuracy, they often suffer from limited generalization across manipulation types or rely on complex architectures. In this paper, we propose Loupe, a lightweight yet effective framework for joint deepfake detection and localization. Loupe integrates a patch-aware classifier and a segmentation module with conditional queries, allowing simultaneous global authenticity classification and fine-grained mask prediction. To enhance robustness against distribution shifts of test set, Loupe introduces a pseudo-label-guided test-time adaptation mechanism by leveraging patch-level predictions to supervise the segmentation head. Extensive experiments on the DDL dataset demonstrate that Loupe achieves state-of-the-art performance, securing the first place in the IJCAI 2025 Deepfake Detection and Localization Challenge with an overall score of 0.846. Our results validate the effectiveness of the proposed patch-level fusion and conditional query design in improving both classification accuracy and spatial localization under diverse forgery patterns. The code is available at https://github.com/Kamichanw/Loupe.

http://arxiv.org/abs/2506.17125
Large Language Model Unlearning for Source Code. (1%)
Xue Jiang; Yihong Dong; Zheng Fang; Yingwei Ma; Tangxinyu Wang; Rongyu Cao; Binhua Li; Zhi Jin; Wenpin Jiao; Yongbin Li; Ge Li
LLM4SE has demonstrated significant success, but LLMs' potential memorization of sensitive or outdated training data introduces critical risks to legal compliance, software security, and code quality. LLM unlearning techniques, which can eliminate the influence of undesired data from LLMs in a post-training way, present a promising solution to address these concerns. While recent efforts in LLM unlearning show effectiveness in natural language, their applicability to source code remains underexplored. Our empirical study reveals that existing LLM unlearning approaches, when applied to source code, cause severe model utility degradation, rendering models practically unusable for code generation. In this paper, we propose PROD, a novel unlearning approach that enables LLMs to forget undesired code content while effectively preserving their code generation capabilities. PROD suppresses the probability of forget data in LLMs' output distribution while promoting candidate distributional components, enabling the model to jointly learn to forget specific content and retain its general capabilities. To facilitate this study, we establish a benchmark for code unlearning evaluation, which includes three critical downstream tasks: copyrighted code unlearning, insecure code unlearning, and deprecated API unlearning. Our evaluation demonstrates that PROD achieves superior balance between forget quality and model utility compared to existing unlearning approaches across three downstream tasks, while consistently exhibiting improvements when applied to LLMs of varying series. PROD also exhibits superior robustness against adversarial attacks without generating or exposing the data to be forgotten. The results underscore that our approach not only extends the application boundary of unlearning techniques to source code, but also holds significant implications for advancing reliable code generation.

http://arxiv.org/abs/2506.15988
Adversarial Attacks and Detection in Visual Place Recognition for Safer Robot Navigation. (99%)
Connor Malone; Owen Claxton; Iman Shames; Michael Milford
Stand-alone Visual Place Recognition (VPR) systems have little defence against a well-designed adversarial attack, which can lead to disastrous consequences when deployed for robot navigation. This paper extensively analyzes the effect of four adversarial attacks common in other perception tasks and four novel VPR-specific attacks on VPR localization performance. We then propose how to close the loop between VPR, an Adversarial Attack Detector (AAD), and active navigation decisions by demonstrating the performance benefit of simulated AADs in a novel experiment paradigm -- which we detail for the robotics community to use as a system framework. In the proposed experiment paradigm, we see the addition of AADs across a range of detection accuracies can improve performance over baseline; demonstrating a significant improvement -- such as a ~50% reduction in the mean along-track localization error -- can be achieved with True Positive and False Positive detection rates of only 75% and up to 25% respectively. We examine a variety of metrics including: Along-Track Error, Percentage of Time Attacked, Percentage of Time in an `Unsafe' State, and Longest Continuous Time Under Attack. Expanding further on these results, we provide the first investigation into the efficacy of the Fast Gradient Sign Method (FGSM) adversarial attack for VPR. The analysis in this work highlights the need for AADs in real-world systems for trustworthy navigation, and informs quantitative requirements for system design.

http://arxiv.org/abs/2506.16157
MBA: Multimodal Bidirectional Attack for Referring Expression Segmentation Models. (99%)
Xingbai Chen; Tingchao Fu; Renyang Liu; Wei Zhou; Chao Yi
Referring Expression Segmentation (RES) enables precise object segmentation in images based on natural language descriptions, offering high flexibility and broad applicability in real-world vision tasks. Despite its impressive performance, the robustness of RES models against adversarial examples remains largely unexplored. While prior adversarial attack methods have explored adversarial robustness on conventional segmentation models, they perform poorly when directly applied to RES, failing to expose vulnerabilities in its multimodal structure. Moreover, in practical open-world scenarios, users typically issue multiple, diverse referring expressions to interact with the same image, highlighting the need for adversarial examples that generalize across varied textual inputs. To address these multimodal challenges, we propose a novel adversarial attack strategy termed \textbf{Multimodal Bidirectional Attack}, tailored for RES models. Our method introduces learnable proxy textual embedding perturbation and jointly performs visual-aligned optimization on the image modality and textual-adversarial optimization on the textual modality during attack generation. This dual optimization framework encourages adversarial images to actively adapt to more challenging text embedding during optimization, thereby enhancing their cross-text transferability, which refers to the ability of adversarial examples to remain effective under a variety of unseen or semantically diverse textual inputs. Extensive experiments conducted on multiple RES models and benchmark datasets demonstrate the superior effectiveness of our method compared to existing methods.

http://arxiv.org/abs/2506.11611
KCES: Training-Free Defense for Robust Graph Neural Networks via Kernel Complexity. (98%)
Yaning Jia; Shenyang Deng; Chiyu Ma; Yaoqing Yang; Soroush Vosoughi
Graph Neural Networks (GNNs) have achieved impressive success across a wide range of graph-based tasks, yet they remain highly vulnerable to small, imperceptible perturbations and adversarial attacks. Although numerous defense methods have been proposed to address these vulnerabilities, many rely on heuristic metrics, overfit to specific attack patterns, and suffer from high computational complexity. In this paper, we propose Kernel Complexity-Based Edge Sanitization (KCES), a training-free, model-agnostic defense framework. KCES leverages Graph Kernel Complexity (GKC), a novel metric derived from the graph's Gram matrix that characterizes GNN generalization via its test error bound. Building on GKC, we define a KC score for each edge, measuring the change in GKC when the edge is removed. Edges with high KC scores, typically introduced by adversarial perturbations, are pruned to mitigate their harmful effects, thereby enhancing GNNs' robustness. KCES can also be seamlessly integrated with existing defense strategies as a plug-and-play module without requiring training. Theoretical analysis and extensive experiments demonstrate that KCES consistently enhances GNN robustness, outperforms state-of-the-art baselines, and amplifies the effectiveness of existing defenses, offering a principled and efficient solution for securing GNNs.

http://arxiv.org/abs/2406.03458
Distributional Adversarial Loss. (96%)
Saba Ahmadi; Siddharth Bhandari; Avrim Blum; Chen Dan; Prabhav Jain
We initiate the study of a new notion of adversarial loss which we call distributional adversarial loss. In this notion, we assume for each original example, the allowed adversarial perturbation set is a family of distributions, and the adversarial loss over each example is the maximum loss over all the associated distributions. The goal is to minimize the overall adversarial loss. We show sample complexity bounds in the PAC-learning setting for our notion of adversarial loss. Our notion of adversarial loss contrasts the prior work on robust learning that considers a set of points, not distributions, as the perturbation set of each clean example. As an application of our approach, we show how to unify the two lines of work on randomized smoothing and robust learning in the PAC-learning setting and derive sample complexity bounds for randomized smoothing methods.
  Furthermore, we investigate the role of randomness in achieving robustness against adversarial attacks. We show a general derandomization technique that preserves the extent of a randomized classifier's robustness against adversarial attacks and show its effectiveness empirically.

http://arxiv.org/abs/2503.23804
DrunkAgent: Stealthy Memory Corruption in LLM-Powered Recommender Agents. (81%)
Shiyi Yang; Zhibo Hu; Xinshu Li; Chen Wang; Tong Yu; Xiwei Xu; Liming Zhu; Lina Yao
Large language model (LLM)-powered agents are increasingly used in recommender systems (RSs) to achieve personalized behavior modeling, where the memory mechanism plays a pivotal role in enabling the agents to autonomously explore, learn and self-evolve from real-world interactions. However, this very mechanism, serving as a contextual repository, inherently exposes an attack surface for potential adversarial manipulations. Despite its central role, the robustness of agentic RSs in the face of such threats remains largely underexplored. Previous works suffer from semantic mismatches or rely on static embeddings or pre-defined prompts, all of which hinder their applicability to systems with dynamic memory states. This challenge is exacerbated by the black-box nature of commercial RSs.
  To tackle the above problems, in this paper, we present the first systematic investigation of memory-based vulnerabilities in LLM-powered recommender agents, revealing their security limitations and guiding efforts to strengthen system resilience and trustworthiness. Specifically, we propose a novel black-box attack framework named DrunkAgent. DrunkAgent crafts semantically meaningful adversarial textual triggers for target item promotions and introduces a series of strategies to maximize the trigger effect by corrupting the memory updates during the interactions. The triggers and strategies are optimized on a surrogate model, enabling DrunkAgent transferable and stealthy. Extensive experiments on real-world datasets across diverse agentic RSs, including collaborative filtering, retrieval augmentation and sequential recommendations, demonstrate the generalizability, transferability and stealthiness of DrunkAgent.

http://arxiv.org/abs/2506.16447
Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models. (69%)
Biao Yi; Tiansheng Huang; Sishuo Chen; Tong Li; Zheli Liu; Zhixuan Chu; Yiming Li
Backdoor unalignment attacks against Large Language Models (LLMs) enable the stealthy compromise of safety alignment using a hidden trigger while evading normal safety auditing. These attacks pose significant threats to the applications of LLMs in the real-world Large Language Model as a Service (LLMaaS) setting, where the deployed model is a fully black-box system that can only interact through text. Furthermore, the sample-dependent nature of the attack target exacerbates the threat. Instead of outputting a fixed label, the backdoored LLM follows the semantics of any malicious command with the hidden trigger, significantly expanding the target space. In this paper, we introduce BEAT, a black-box defense that detects triggered samples during inference to deactivate the backdoor. It is motivated by an intriguing observation (dubbed the probe concatenate effect), where concatenated triggered samples significantly reduce the refusal rate of the backdoored LLM towards a malicious probe, while non-triggered samples have little effect. Specifically, BEAT identifies whether an input is triggered by measuring the degree of distortion in the output distribution of the probe before and after concatenation with the input. Our method addresses the challenges of sample-dependent targets from an opposite perspective. It captures the impact of the trigger on the refusal signal (which is sample-independent) instead of sample-specific successful attack behaviors. It overcomes black-box access limitations by using multiple sampling to approximate the output distribution. Extensive experiments are conducted on various backdoor attacks and LLMs (including the closed-source GPT-3.5-turbo), verifying the effectiveness and efficiency of our defense. Besides, we also preliminarily verify that BEAT can effectively defend against popular jailbreak attacks, as they can be regarded as 'natural backdoors'.

http://arxiv.org/abs/2506.16407
Robustness Evaluation of OCR-based Visual Document Understanding under Multi-Modal Adversarial Attacks. (54%)
Dong Nguyen Tien; Dung D. Le
Visual Document Understanding (VDU) systems have achieved strong performance in information extraction by integrating textual, layout, and visual signals. However, their robustness under realistic adversarial perturbations remains insufficiently explored. We introduce the first unified framework for generating and evaluating multi-modal adversarial attacks on OCR-based VDU models. Our method covers six gradient-based layout attack scenarios, incorporating manipulations of OCR bounding boxes, pixels, and texts across both word and line granularities, with constraints on layout perturbation budget (e.g., IoU >= 0.6) to preserve plausibility.
  Experimental results across four datasets (FUNSD, CORD, SROIE, DocVQA) and six model families demonstrate that line-level attacks and compound perturbations (BBox + Pixel + Text) yield the most severe performance degradation. Projected Gradient Descent (PGD)-based BBox perturbations outperform random-shift baselines in all investigated models. Ablation studies further validate the impact of layout budget, text modification, and adversarial transferability.

http://arxiv.org/abs/2506.01444
Variance-Based Defense Against Blended Backdoor Attacks. (54%)
Sujeevan Aseervatham; Achraf Kerzazi; Younès Bennani
Backdoor attacks represent a subtle yet effective class of cyberattacks targeting AI models, primarily due to their stealthy nature. The model behaves normally on clean data but exhibits malicious behavior only when the attacker embeds a specific trigger into the input. This attack is performed during the training phase, where the adversary corrupts a small subset of the training data by embedding a pattern and modifying the labels to a chosen target. The objective is to make the model associate the pattern with the target label while maintaining normal performance on unaltered data. Several defense mechanisms have been proposed to sanitize training data-sets. However, these methods often rely on the availability of a clean dataset to compute statistical anomalies, which may not always be feasible in real-world scenarios where datasets can be unavailable or compromised. To address this limitation, we propose a novel defense method that trains a model on the given dataset, detects poisoned classes, and extracts the critical part of the attack trigger before identifying the poisoned instances. This approach enhances explainability by explicitly revealing the harmful part of the trigger. The effectiveness of our method is demonstrated through experimental evaluations on well-known image datasets and comparative analysis against three state-of-the-art algorithms: SCAn, ABL, and AGPD.

http://arxiv.org/abs/2506.16078
Probing the Robustness of Large Language Models Safety to Latent Perturbations. (13%)
Tianle Gu; Kexin Huang; Zongqi Wang; Yixu Wang; Jie Li; Yuanqi Yao; Yang Yao; Yujiu Yang; Yan Teng; Yingchun Wang
Safety alignment is a key requirement for building reliable Artificial General Intelligence. Despite significant advances in safety alignment, we observe that minor latent shifts can still trigger unsafe responses in aligned models. We argue that this stems from the shallow nature of existing alignment methods, which focus on surface-level refusal behaviors without sufficiently altering internal representations. Consequently, small shifts in hidden activations can re-trigger harmful behaviors embedded in the latent space. To explore the robustness of safety alignment to latent perturbations, we introduce a probing method that measures the Negative Log-Likelihood of the original response generated by the model. This probe quantifies local sensitivity in the latent space, serving as a diagnostic tool for identifying vulnerable directions. Based on this signal, we construct effective jailbreak trajectories, giving rise to the Activation Steering Attack (ASA). More importantly, these insights offer a principled foundation for improving alignment robustness. To this end, we introduce Layer-wise Adversarial Patch Training~(LAPT), a fine-tuning strategy that inject controlled perturbations into hidden representations during training. Experimental results highlight that LAPT strengthen alignment robustness without compromising general capabilities. Our findings reveal fundamental flaws in current alignment paradigms and call for representation-level training strategies that move beyond surface-level behavior supervision. Codes and results are available at https://github.com/Carol-gutianle/LatentSafety.

http://arxiv.org/abs/2506.16460
Black-Box Privacy Attacks on Shared Representations in Multitask Learning. (13%)
John Abascal; Nicolás Berrios; Alina Oprea; Jonathan Ullman; Adam Smith; Matthew Jagielski
Multitask learning (MTL) has emerged as a powerful paradigm that leverages similarities among multiple learning tasks, each with insufficient samples to train a standalone model, to solve them simultaneously while minimizing data sharing across users and organizations. MTL typically accomplishes this goal by learning a shared representation that captures common structure among the tasks by embedding data from all tasks into a common feature space. Despite being designed to be the smallest unit of shared information necessary to effectively learn patterns across multiple tasks, these shared representations can inadvertently leak sensitive information about the particular tasks they were trained on.
  In this work, we investigate what information is revealed by the shared representations through the lens of inference attacks. Towards this, we propose a novel, black-box task-inference threat model where the adversary, given the embedding vectors produced by querying the shared representation on samples from a particular task, aims to determine whether that task was present when training the shared representation. We develop efficient, purely black-box attacks on machine learning models that exploit the dependencies between embeddings from the same task without requiring shadow models or labeled reference data. We evaluate our attacks across vision and language domains for multiple use cases of MTL and demonstrate that even with access only to fresh task samples rather than training data, a black-box adversary can successfully infer a task's inclusion in training. To complement our experiments, we provide theoretical analysis of a simplified learning setting and show a strict separation between adversaries with training samples and fresh samples from the target task's distribution.

http://arxiv.org/abs/2506.16458
SecureFed: A Two-Phase Framework for Detecting Malicious Clients in Federated Learning. (8%)
Likhitha Annapurna Kavuri; Akshay Mhatre; Akarsh K Nair; Deepti Gupta
Federated Learning (FL) protects data privacy while providing a decentralized method for training models. However, because of the distributed schema, it is susceptible to adversarial clients that could alter results or sabotage model performance. This study presents SecureFed, a two-phase FL framework for identifying and reducing the impact of such attackers. Phase 1 involves collecting model updates from participating clients and applying a dimensionality reduction approach to identify outlier patterns frequently associated with malicious behavior. Temporary models constructed from the client updates are evaluated on synthetic datasets to compute validation losses and support anomaly scoring. The idea of learning zones is presented in Phase 2, where weights are dynamically routed according to their contribution scores and gradient magnitudes. High-value gradient zones are given greater weight in aggregation and contribute more significantly to the global model, while lower-value gradient zones, which may indicate possible adversarial activity, are gradually removed from training. Until the model converges and a strong defense against poisoning attacks is possible, this training cycle continues Based on the experimental findings, SecureFed considerably improves model resilience without compromising model performance.

http://arxiv.org/abs/2506.15755
VLMInferSlow: Evaluating the Efficiency Robustness of Large Vision-Language Models as a Service. (96%)
Xiasi Wang; Tianliang Yao; Simin Chen; Runqi Wang; Lei YE; Kuofeng Gao; Yi Huang; Yuan Yao
Vision-Language Models (VLMs) have demonstrated great potential in real-world applications. While existing research primarily focuses on improving their accuracy, the efficiency remains underexplored. Given the real-time demands of many applications and the high inference overhead of VLMs, efficiency robustness is a critical issue. However, previous studies evaluate efficiency robustness under unrealistic assumptions, requiring access to the model architecture and parameters -- an impractical scenario in ML-as-a-service settings, where VLMs are deployed via inference APIs. To address this gap, we propose VLMInferSlow, a novel approach for evaluating VLM efficiency robustness in a realistic black-box setting. VLMInferSlow incorporates fine-grained efficiency modeling tailored to VLM inference and leverages zero-order optimization to search for adversarial examples. Experimental results show that VLMInferSlow generates adversarial images with imperceptible perturbations, increasing the computational cost by up to 128.47%. We hope this research raises the community's awareness about the efficiency robustness of VLMs.

http://arxiv.org/abs/2506.15499
Pixel-level Certified Explanations via Randomized Smoothing. (82%)
Alaa Anani; Tobias Lorenz; Mario Fritz; Bernt Schiele
Post-hoc attribution methods aim to explain deep learning predictions by highlighting influential input pixels. However, these explanations are highly non-robust: small, imperceptible input perturbations can drastically alter the attribution map while maintaining the same prediction. This vulnerability undermines their trustworthiness and calls for rigorous robustness guarantees of pixel-level attribution scores. We introduce the first certification framework that guarantees pixel-level robustness for any black-box attribution method using randomized smoothing. By sparsifying and smoothing attribution maps, we reformulate the task as a segmentation problem and certify each pixel's importance against $\ell_2$-bounded perturbations. We further propose three evaluation metrics to assess certified robustness, localization, and faithfulness. An extensive evaluation of 12 attribution methods across 5 ImageNet models shows that our certified attributions are robust, interpretable, and faithful, enabling reliable use in downstream tasks. Our code is at https://github.com/AlaaAnani/certified-attributions.

http://arxiv.org/abs/2506.19054
PolyGuard: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset. (22%)
Mintong Kang; Zhaorun Chen; Chejian Xu; Jiawei Zhang; Chengquan Guo; Minzhou Pan; Ivan Revilla; Yu Sun; Bo Li
As LLMs become widespread across diverse applications, concerns about the security and safety of LLM interactions have intensified. Numerous guardrail models and benchmarks have been developed to ensure LLM content safety. However, existing guardrail benchmarks are often built upon ad hoc risk taxonomies that lack a principled grounding in standardized safety policies, limiting their alignment with real-world operational requirements. Moreover, they tend to overlook domain-specific risks, while the same risk category can carry different implications across different domains. To bridge these gaps, we introduce PolyGuard, the first massive multi-domain safety policy-grounded guardrail dataset. PolyGuard offers: (1) broad domain coverage across eight safety-critical domains, such as finance, law, and codeGen; (2) policy-grounded risk construction based on authentic, domain-specific safety guidelines; (3) diverse interaction formats, encompassing declarative statements, questions, instructions, and multi-turn conversations; (4) advanced benign data curation via detoxification prompting to challenge over-refusal behaviors; and (5) \textbf{attack-enhanced instances} that simulate adversarial inputs designed to bypass guardrails. Based on PolyGuard, we benchmark 19 advanced guardrail models and uncover a series of findings, such as: (1) All models achieve varied F1 scores, with many demonstrating high variance across risk categories, highlighting their limited domain coverage and insufficient handling of domain-specific safety concerns; (2) As models evolve, their coverage of safety risks broadens, but performance on common risk categories may decrease; (3) All models remain vulnerable to optimized adversarial attacks. We believe that \dataset and the unique insights derived from our evaluations will advance the development of policy-aligned and resilient guardrail systems.

http://arxiv.org/abs/2506.15751
Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts. (10%)
Kartik Sharma; Yiqiao Jin; Vineeth Rakesh; Yingtong Dou; Menghai Pan; Mahashweta Das; Srijan Kumar
As large language models (LLMs) are deployed in safety-critical settings, it is essential to ensure that their responses comply with safety standards. Prior research has revealed that LLMs often fail to grasp the notion of safe behaviors, resulting in either unjustified refusals to harmless prompts or the generation of harmful content. While substantial efforts have been made to improve their robustness, existing defenses often rely on costly fine-tuning of model parameters or employ suboptimal heuristic techniques. In this work, we take a novel approach to safeguard LLMs by learning to adapt the system prompts in instruction-tuned LLMs. While LLMs are typically pre-trained to follow a fixed system prompt, we investigate the impact of tailoring the system prompt to each specific user input on the safety of the responses. To this end, we propose $\textbf{Sysformer}$, a trans$\textbf{former}$ model that updates an initial $\textbf{sys}$tem prompt to a more robust system prompt in the LLM input embedding space while attending to the user prompt. While keeping the LLM parameters frozen, the Sysformer is trained to refuse to respond to a set of harmful prompts while responding ideally to a set of safe ones. Through extensive experiments on $5$ LLMs from different families and $2$ recent benchmarks, we demonstrate that Sysformer can significantly enhance the robustness of LLMs, leading to upto $80\%$ gain in the refusal rate on harmful prompts while enhancing the compliance with the safe prompts by upto $90\%$. Results also generalize well to sophisticated jailbreaking attacks, making LLMs upto $100\%$ more robust against different attack strategies. We hope our findings lead to cheaper safeguarding of LLMs and motivate future investigations into designing variable system prompts.

http://arxiv.org/abs/2506.17318
Context manipulation attacks : Web agents are susceptible to corrupted memory. (9%)
Atharv Singh Patlan; Ashwin Hebbar; Pramod Viswanath; Prateek Mittal
Autonomous web navigation agents, which translate natural language instructions into sequences of browser actions, are increasingly deployed for complex tasks across e-commerce, information retrieval, and content discovery. Due to the stateless nature of large language models (LLMs), these agents rely heavily on external memory systems to maintain context across interactions. Unlike centralized systems where context is securely stored server-side, agent memory is often managed client-side or by third-party applications, creating significant security vulnerabilities. This was recently exploited to attack production systems.
  We introduce and formalize "plan injection," a novel context manipulation attack that corrupts these agents' internal task representations by targeting this vulnerable context. Through systematic evaluation of two popular web agents, Browser-use and Agent-E, we show that plan injections bypass robust prompt injection defenses, achieving up to 3x higher attack success rates than comparable prompt-based attacks. Furthermore, "context-chained injections," which craft logical bridges between legitimate user goals and attacker objectives, lead to a 17.7% increase in success rate for privacy exfiltration tasks. Our findings highlight that secure memory handling must be a first-class concern in agentic systems.

http://arxiv.org/abs/2506.15343
Offensive Robot Cybersecurity. (2%)
Víctor Mayoral-Vilches
Offensive Robot Cybersecurity introduces a groundbreaking approach by advocating for offensive security methods empowered by means of automation. It emphasizes the necessity of understanding attackers' tactics and identifying vulnerabilities in advance to develop effective defenses, thereby improving robots' security posture. This thesis leverages a decade of robotics experience, employing Machine Learning and Game Theory to streamline the vulnerability identification and exploitation process. Intrinsically, the thesis uncovers a profound connection between robotic architecture and cybersecurity, highlighting that the design and creation aspect of robotics deeply intertwines with its protection against attacks. This duality -- whereby the architecture that shapes robot behavior and capabilities also necessitates a defense mechanism through offensive and defensive cybersecurity strategies -- creates a unique equilibrium. Approaching cybersecurity with a dual perspective of defense and attack, rooted in an understanding of systems architecture, has been pivotal. Through comprehensive analysis, including ethical considerations, the development of security tools, and executing cyber attacks on robot software, hardware, and industry deployments, this thesis proposes a novel architecture for cybersecurity cognitive engines. These engines, powered by advanced game theory and machine learning, pave the way for autonomous offensive cybersecurity strategies for robots, marking a significant shift towards self-defending robotic systems. This research not only underscores the importance of offensive measures in enhancing robot cybersecurity but also sets the stage for future advancements where robots are not just resilient to cyber threats but are equipped to autonomously safeguard themselves.

http://arxiv.org/abs/2506.14582
Busting the Paper Ballot: Voting Meets Adversarial Machine Learning. (99%)
Kaleel Mahmood; Caleb Manicke; Ethan Rathbun; Aayushi Verma; Sohaib Ahmad; Nicholas Stamatakis; Laurent Michel; Benjamin Fuller
  We show the security risk associated with using machine learning classifiers
in United States election tabulators. The central classification task in
election tabulation is deciding whether a mark does or does not appear on a
bubble associated to an alternative in a contest on the ballot. Barretto et al.
(E-Vote-ID 2021) reported that convolutional neural networks are a viable
option in this field, as they outperform simple feature-based classifiers.
  Our contributions to election security can be divided into four parts. To
demonstrate and analyze the hypothetical vulnerability of machine learning
models on election tabulators, we first introduce four new ballot datasets.
Second, we train and test a variety of different models on our new datasets.
These models include support vector machines, convolutional neural networks (a
basic CNN, VGG and ResNet), and vision transformers (Twins and CaiT). Third,
using our new datasets and trained models, we demonstrate that traditional
white box attacks are ineffective in the voting domain due to gradient masking.
Our analyses further reveal that gradient masking is a product of numerical
instability. We use a modified difference of logits ratio loss to overcome this
issue (Croce and Hein, ICML 2020). Fourth, in the physical world, we conduct
attacks with the adversarial examples generated using our new methods. In
traditional adversarial machine learning, a high (50% or greater) attack
success rate is ideal. However, for certain elections, even a 5% attack success
rate can flip the outcome of a race. We show such an impact is possible in the
physical domain. We thoroughly discuss attack realism, and the challenges and
practicality associated with printing and scanning ballot adversarial examples.


http://arxiv.org/abs/2506.14539
Doppelg\"anger Method: Breaking Role Consistency in LLM Agent via Prompt-based Transferable Adversarial Attack. (86%)
Daewon Kang; YeongHwan Shin; Doyeon Kim; Kyu-Hwan Jung; Meong Hi Son
  Since the advent of large language models, prompt engineering now enables the
rapid, low-effort creation of diverse autonomous agents that are already in
widespread use. Yet this convenience raises urgent concerns about the safety,
robustness, and behavioral consistency of the underlying prompts, along with
the pressing challenge of preventing those prompts from being exposed to user's
attempts. In this paper, we propose the ''Doppelg\"anger method'' to
demonstrate the risk of an agent being hijacked, thereby exposing system
instructions and internal information. Next, we define the ''Prompt Alignment
Collapse under Adversarial Transfer (PACAT)'' level to evaluate the
vulnerability to this adversarial transfer attack. We also propose a ''Caution
for Adversarial Transfer (CAT)'' prompt to counter the Doppelg\"anger method.
The experimental results demonstrate that the Doppelg\"anger method can
compromise the agent's consistency and expose its internal information. In
contrast, CAT prompts enable effective defense against this adversarial attack.


http://arxiv.org/abs/2506.14374
Excessive Reasoning Attack on Reasoning LLMs. (33%)
Wai Man Si; Mingjie Li; Michael Backes; Yang Zhang
  Recent reasoning large language models (LLMs), such as OpenAI o1 and
DeepSeek-R1, exhibit strong performance on complex tasks through test-time
inference scaling. However, prior studies have shown that these models often
incur significant computational costs due to excessive reasoning, such as
frequent switching between reasoning trajectories (e.g., underthinking) or
redundant reasoning on simple questions (e.g., overthinking). In this work, we
expose a novel threat: adversarial inputs can be crafted to exploit excessive
reasoning behaviors and substantially increase computational overhead without
compromising model utility. Therefore, we propose a novel loss framework
consisting of three components: (1) Priority Cross-Entropy Loss, a modification
of the standard cross-entropy objective that emphasizes key tokens by
leveraging the autoregressive nature of LMs; (2) Excessive Reasoning Loss,
which encourages the model to initiate additional reasoning paths during
inference; and (3) Delayed Termination Loss, which is designed to extend the
reasoning process and defer the generation of final outputs. We optimize and
evaluate our attack for the GSM8K and ORCA datasets on
DeepSeek-R1-Distill-LLaMA and DeepSeek-R1-Distill-Qwen. Empirical results
demonstrate a 3x to 9x increase in reasoning length with comparable utility
performance. Furthermore, our crafted adversarial inputs exhibit
transferability, inducing computational overhead in o3-mini, o1-mini,
DeepSeek-R1, and QWQ models.


http://arxiv.org/abs/2506.14866
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents. (3%)
Thomas Kuntz; Agatha Duzan; Hao Zhao; Francesco Croce; Zico Kolter; Nicolas Flammarion; Maksym Andriushchenko
Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption. To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents. OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior. To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.). Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score). We evaluate computer use agents based on a range of frontier models - such as o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro - and provide insights into their safety. In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions. The OS-Harm benchmark is available at https://github.com/tml-epfl/os-harm.

http://arxiv.org/abs/2506.17299
LLM Jailbreak Oracle. (1%)
Shuyi Lin; Anshuman Suri; Alina Oprea; Cheng Tan
As large language models (LLMs) become increasingly deployed in safety-critical applications, the lack of systematic methods to assess their vulnerability to jailbreak attacks presents a critical security gap. We introduce the jailbreak oracle problem: given a model, prompt, and decoding strategy, determine whether a jailbreak response can be generated with likelihood exceeding a specified threshold. This formalization enables a principled study of jailbreak vulnerabilities. Answering the jailbreak oracle problem poses significant computational challenges -- the search space grows exponentially with the length of the response tokens. We present Boa, the first efficient algorithm for solving the jailbreak oracle problem. Boa employs a three-phase search strategy: (1) constructing block lists to identify refusal patterns, (2) breadth-first sampling to identify easily accessible jailbreaks, and (3) depth-first priority search guided by fine-grained safety scores to systematically explore promising low-probability paths. Boa enables rigorous security assessments including systematic defense evaluation, standardized comparison of red team attacks, and model certification under extreme adversarial conditions.

http://arxiv.org/abs/2506.13276
Navigating the Black Box: Leveraging LLMs for Effective Text-Level Graph Injection Attacks. (69%)
Yuefei Lyu; Chaozhuo Li; Xi Zhang; Tianle Zhang
  Text-attributed graphs (TAGs) integrate textual data with graph structures,
providing valuable insights in applications such as social network analysis and
recommendation systems. Graph Neural Networks (GNNs) effectively capture both
topological structure and textual information in TAGs but are vulnerable to
adversarial attacks. Existing graph injection attack (GIA) methods assume that
attackers can directly manipulate the embedding layer, producing
non-explainable node embeddings. Furthermore, the effectiveness of these
attacks often relies on surrogate models with high training costs. Thus, this
paper introduces ATAG-LLM, a novel black-box GIA framework tailored for TAGs.
Our approach leverages large language models (LLMs) to generate interpretable
text-level node attributes directly, ensuring attacks remain feasible in
real-world scenarios. We design strategies for LLM prompting that balance
exploration and reliability to guide text generation, and propose a similarity
assessment method to evaluate attack text effectiveness in disrupting graph
homophily. This method efficiently perturbs the target node with minimal
training costs in a strict black-box setting, ensuring a text-level graph
injection attack for TAGs. Experiments on real-world TAG datasets validate the
superior performance of ATAG-LLM compared to state-of-the-art embedding-level
and text-level attack methods.


http://arxiv.org/abs/2506.13160
CertDW: Towards Certified Dataset Ownership Verification via Conformal Prediction. (22%)
Ting Qiao; Yiming Li; Jianbin Li; Yingjia Wang; Leyi Qi; Junfeng Guo; Ruili Feng; Dacheng Tao
  Deep neural networks (DNNs) rely heavily on high-quality open-source datasets
(e.g., ImageNet) for their success, making dataset ownership verification (DOV)
crucial for protecting public dataset copyrights. In this paper, we find
existing DOV methods (implicitly) assume that the verification process is
faithful, where the suspicious model will directly verify ownership by using
the verification samples as input and returning their results. However, this
assumption may not necessarily hold in practice and their performance may
degrade sharply when subjected to intentional or unintentional perturbations.
To address this limitation, we propose the first certified dataset watermark
(i.e., CertDW) and CertDW-based certified dataset ownership verification method
that ensures reliable verification even under malicious attacks, under certain
conditions (e.g., constrained pixel-level perturbation). Specifically, inspired
by conformal prediction, we introduce two statistical measures, including
principal probability (PP) and watermark robustness (WR), to assess model
prediction stability on benign and watermarked samples under noise
perturbations. We prove there exists a provable lower bound between PP and WR,
enabling ownership verification when a suspicious model's WR value
significantly exceeds the PP values of multiple benign models trained on
watermark-free datasets. If the number of PP values smaller than WR exceeds a
threshold, the suspicious model is regarded as having been trained on the
protected dataset. Extensive experiments on benchmark datasets verify the
effectiveness of our CertDW method and its resistance to potential adaptive
attacks. Our codes are at
\href{https://github.com/NcepuQiaoTing/CertDW}{GitHub}.


http://arxiv.org/abs/2506.13563
Unlearning-Enhanced Website Fingerprinting Attack: Against Backdoor Poisoning in Anonymous Networks. (15%)
Yali Yuan; Kai Xu; Ruolin Ma; Yuchen Zhang
  Website Fingerprinting (WF) is an effective tool for regulating and governing
the dark web. However, its performance can be significantly degraded by
backdoor poisoning attacks in practical deployments. This paper aims to address
the problem of hidden backdoor poisoning attacks faced by Website
Fingerprinting attack, and designs a feasible mothed that integrates unlearning
technology to realize detection of automatic poisoned points and complete
removal of its destructive effects, requiring only a small number of known
poisoned test points. Taking Tor onion routing as an example, our method
evaluates the influence value of each training sample on these known poisoned
test points as the basis for judgment. We optimize the use of influence scores
to identify poisoned samples within the training dataset. Furthermore, by
quantifying the difference between the contribution of model parameters on the
taining data and the clean data, the target parameters are dynamically adjusted
to eliminate the impact of the backdoor attacks. Experiments on public datasets
under the assumptions of closed-world (CW) and open-world (OW) verify the
effectiveness of the proposed method. In complex scenes containing both clean
website fingerprinting features and backdoor triggers, the accuracy of the
model on the poisoned dataset and the test dataset is stable at about 80%,
significantly outperforming the traditional WF attack models. In addition, the
proposed method achieves a 2-3 times speedup in runtime efficiency compared to
baseline methods. By incorporating machine unlearning, we realize a WF attack
model that exhibits enhanced resistance to backdoor poisoning and faster
execution speeds in adversarial settings.


http://arxiv.org/abs/2506.13726
Weakest Link in the Chain: Security Vulnerabilities in Advanced Reasoning Models. (8%)
Arjun Krishna; Aaditya Rastogi; Erick Galinkin
  The introduction of advanced reasoning capabilities have improved the
problem-solving performance of large language models, particularly on math and
coding benchmarks. However, it remains unclear whether these reasoning models
are more or less vulnerable to adversarial prompt attacks than their
non-reasoning counterparts. In this work, we present a systematic evaluation of
weaknesses in advanced reasoning models compared to similar non-reasoning
models across a diverse set of prompt-based attack categories. Using
experimental data, we find that on average the reasoning-augmented models are
\emph{slightly more robust} than non-reasoning models (42.51\% vs 45.53\%
attack success rate, lower is better). However, this overall trend masks
significant category-specific differences: for certain attack types the
reasoning models are substantially \emph{more vulnerable} (e.g., up to 32
percentage points worse on a tree-of-attacks prompt), while for others they are
markedly \emph{more robust} (e.g., 29.8 points better on cross-site scripting
injection). Our findings highlight the nuanced security implications of
advanced reasoning in language models and emphasize the importance of
stress-testing safety across diverse adversarial techniques.


http://arxiv.org/abs/2506.12871
Active Adversarial Noise Suppression for Image Forgery Localization. (99%)
Rongxuan Peng; Shunquan Tan; Xianbo Mo; Alex C. Kot; Jiwu Huang
  Recent advances in deep learning have significantly propelled the development
of image forgery localization. However, existing models remain highly
vulnerable to adversarial attacks: imperceptible noise added to forged images
can severely mislead these models. In this paper, we address this challenge
with an Adversarial Noise Suppression Module (ANSM) that generate a defensive
perturbation to suppress the attack effect of adversarial noise. We observe
that forgery-relevant features extracted from adversarial and original forged
images exhibit distinct distributions. To bridge this gap, we introduce
Forgery-relevant Features Alignment (FFA) as a first-stage training strategy,
which reduces distributional discrepancies by minimizing the channel-wise
Kullback-Leibler divergence between these features. To further refine the
defensive perturbation, we design a second-stage training strategy, termed
Mask-guided Refinement (MgR), which incorporates a dual-mask constraint. MgR
ensures that the perturbation remains effective for both adversarial and
original forged images, recovering forgery localization accuracy to their
original level. Extensive experiments across various attack algorithms
demonstrate that our method significantly restores the forgery localization
model's performance on adversarial images. Notably, when ANSM is applied to
original forged images, the performance remains nearly unaffected. To our best
knowledge, this is the first report of adversarial defense in image forgery
localization tasks. We have released the source code and anti-forensics
dataset.


http://arxiv.org/abs/2506.12875
Intriguing Frequency Interpretation of Adversarial Robustness for CNNs and ViTs. (93%)
Lu Chen; Han Yang; Hu Wang; Yuxin Cao; Shaofeng Li; Yuan Luo
  Adversarial examples have attracted significant attention over the years, yet
understanding their frequency-based characteristics remains insufficient. In
this paper, we investigate the intriguing properties of adversarial examples in
the frequency domain for the image classification task, with the following key
findings. (1) As the high-frequency components increase, the performance gap
between adversarial and natural examples becomes increasingly pronounced. (2)
The model performance against filtered adversarial examples initially increases
to a peak and declines to its inherent robustness. (3) In Convolutional Neural
Networks, mid- and high-frequency components of adversarial examples exhibit
their attack capabilities, while in Transformers, low- and mid-frequency
components of adversarial examples are particularly effective. These results
suggest that different network architectures have different frequency
preferences and that differences in frequency components between adversarial
and natural examples may directly influence model robustness. Based on our
findings, we further conclude with three useful proposals that serve as a
valuable reference to the AI model security community.


http://arxiv.org/abs/2506.13024
Position: Certified Robustness Does Not (Yet) Imply Model Security. (73%)
Andrew C. Cullen; Paul Montague; Sarah M. Erfani; Benjamin I. P. Rubinstein
  While certified robustness is widely promoted as a solution to adversarial
examples in Artificial Intelligence systems, significant challenges remain
before these techniques can be meaningfully deployed in real-world
applications. We identify critical gaps in current research, including the
paradox of detection without distinction, the lack of clear criteria for
practitioners to evaluate certification schemes, and the potential security
risks arising from users' expectations surrounding ``guaranteed" robustness
claims. This position paper is a call to arms for the certification research
community, proposing concrete steps to address these fundamental challenges and
advance the field toward practical applicability.


http://arxiv.org/abs/2506.12911
Constraint-Guided Prediction Refinement via Deterministic Diffusion Trajectories. (33%)
Pantelis Dogoulis; Fabien Bernier; Félix Fourreau; Karim Tit; Maxime Cordy
  Many real-world machine learning tasks require outputs that satisfy hard
constraints, such as physical conservation laws, structured dependencies in
graphs, or column-level relationships in tabular data. Existing approaches rely
either on domain-specific architectures and losses or on strong assumptions on
the constraint space, restricting their applicability to linear or convex
constraints. We propose a general-purpose framework for constraint-aware
refinement that leverages denoising diffusion implicit models (DDIMs). Starting
from a coarse prediction, our method iteratively refines it through a
deterministic diffusion trajectory guided by a learned prior and augmented by
constraint gradient corrections. The approach accommodates a wide class of
non-convex and nonlinear equality constraints and can be applied post hoc to
any base model. We demonstrate the method in two representative domains:
constrained adversarial attack generation on tabular data with column-level
dependencies and in AC power flow prediction under Kirchhoff's laws. Across
both settings, our diffusion-guided refinement improves both constraint
satisfaction and performance while remaining lightweight and model-agnostic.


http://arxiv.org/abs/2506.15734
The Safety Reminder: A Soft Prompt to Reactivate Delayed Safety Awareness in Vision-Language Models. (10%)
Peiyuan Tang; Haojie Xin; Xiaodong Zhang; Jun Sun; Qin Xia; Zijiang Yang
As Vision-Language Models (VLMs) demonstrate increasing capabilities across real-world applications such as code generation and chatbot assistance, ensuring their safety has become paramount. Unlike traditional Large Language Models (LLMs), VLMs face unique vulnerabilities due to their multimodal nature, allowing adversaries to modify visual or textual inputs to bypass safety guardrails and trigger the generation of harmful content. Through systematic analysis of VLM behavior under attack, we identify a novel phenomenon termed ``delayed safety awareness''. Specifically, we observe that safety-aligned VLMs may initially be compromised to produce harmful content, but eventually recognize the associated risks and attempt to self-correct. This pattern suggests that VLMs retain their underlying safety awareness but experience a temporal delay in their activation. Building on this insight, we hypothesize that VLMs' safety awareness can be proactively reactivated through carefully designed prompts. To this end, we introduce ``The Safety Reminder'', a soft prompt tuning approach that optimizes learnable prompt tokens, which are periodically injected during the text generation process to enhance safety awareness, effectively preventing harmful content generation. Additionally, our safety reminder only activates when harmful content is detected, leaving normal conversations unaffected and preserving the model's performance on benign tasks. Through comprehensive evaluation across three established safety benchmarks and one adversarial attacks, we demonstrate that our approach significantly reduces attack success rates while maintaining model utility, offering a practical solution for deploying safer VLMs in real-world applications.

http://arxiv.org/abs/2506.13060
Rethinking Explainability in the Era of Multimodal AI. (1%)
Chirag Agarwal
  While multimodal AI systems (models jointly trained on heterogeneous data
types such as text, time series, graphs, and images) have become ubiquitous and
achieved remarkable performance across high-stakes applications, transparent
and accurate explanation algorithms are crucial for their safe deployment and
ensure user trust. However, most existing explainability techniques remain
unimodal, generating modality-specific feature attributions, concepts, or
circuit traces in isolation and thus failing to capture cross-modal
interactions. This paper argues that such unimodal explanations systematically
misrepresent and fail to capture the cross-modal influence that drives
multimodal model decisions, and the community should stop relying on them for
interpreting multimodal models. To support our position, we outline key
principles for multimodal explanations grounded in modality: Granger-style
modality influence (controlled ablations to quantify how removing one modality
changes the explanation for another), Synergistic faithfulness (explanations
capture the model's predictive power when modalities are combined), and Unified
stability (explanations remain consistent under small, cross-modal
perturbations). This targeted shift to multimodal explanations will help the
community uncover hidden shortcuts, mitigate modality bias, improve model
reliability, and enhance safety in high-stakes settings where incomplete
explanations can have serious consequences.


http://arxiv.org/abs/2506.12685
Alphabet Index Mapping: Jailbreaking LLMs through Semantic Dissimilarity. (92%)
Bilal Saleh Husain
  Large Language Models (LLMs) have demonstrated remarkable capabilities, yet
their susceptibility to adversarial attacks, particularly jailbreaking, poses
significant safety and ethical concerns. While numerous jailbreak methods
exist, many suffer from computational expense, high token usage, or complex
decoding schemes. Liu et al. (2024) introduced FlipAttack, a black-box method
that achieves high attack success rates (ASR) through simple prompt
manipulation. This paper investigates the underlying mechanisms of FlipAttack's
effectiveness by analyzing the semantic changes induced by its flipping modes.
We hypothesize that semantic dissimilarity between original and manipulated
prompts is inversely correlated with ASR. To test this, we examine embedding
space visualizations (UMAP, KDE) and cosine similarities for FlipAttack's
modes. Furthermore, we introduce a novel adversarial attack, Alphabet Index
Mapping (AIM), designed to maximize semantic dissimilarity while maintaining
simple decodability. Experiments on GPT-4 using a subset of AdvBench show AIM
and its variant AIM+FWO achieve a 94% ASR, outperforming FlipAttack and other
methods on this subset. Our findings suggest that while high semantic
dissimilarity is crucial, a balance with decoding simplicity is key for
successful jailbreaking. This work contributes to a deeper understanding of
adversarial prompt mechanics and offers a new, effective jailbreak technique.


http://arxiv.org/abs/2506.12454
On the existence of consistent adversarial attacks in high-dimensional linear classification. (91%)
Matteo Vilucchio; Lenka Zdeborová; Bruno Loureiro
  What fundamentally distinguishes an adversarial attack from a
misclassification due to limited model expressivity or finite data? In this
work, we investigate this question in the setting of high-dimensional binary
classification, where statistical effects due to limited data availability play
a central role. We introduce a new error metric that precisely capture this
distinction, quantifying model vulnerability to consistent adversarial attacks
-- perturbations that preserve the ground-truth labels. Our main technical
contribution is an exact and rigorous asymptotic characterization of these
metrics in both well-specified models and latent space models, revealing
different vulnerability patterns compared to standard robust error measures.
The theoretical results demonstrate that as models become more
overparameterized, their vulnerability to label-preserving perturbations grows,
offering theoretical insight into the mechanisms underlying model sensitivity
to adversarial attacks.


http://arxiv.org/abs/2506.17283
Second Order State Hallucinations for Adversarial Attack Mitigation in Formation Control of Multi-Agent Systems. (86%)
Laksh Patel; Akhilesh Raj
The increasing deployment of multi-agent systems (MAS) in critical infrastructures such as autonomous transportation, disaster relief, and smart cities demands robust formation control mechanisms resilient to adversarial attacks. Traditional consensus-based controllers, while effective under nominal conditions, are highly vulnerable to data manipulation, sensor spoofing, and communication failures. To address this challenge, we propose Second-Order State Hallucination (SOSH), a novel framework that detects compromised agents through distributed residual monitoring and maintains formation stability by replacing attacked states with predictive second-order approximations. Unlike existing mitigation strategies that require significant restructuring or induce long transients, SOSH offers a lightweight, decentralized correction mechanism based on second-order Taylor expansions, enabling rapid and scalable resilience. We establish rigorous Lyapunov-based stability guarantees, proving that formation errors remain exponentially bounded even under persistent attacks, provided the hallucination parameters satisfy explicit conditions. Comprehensive Monte Carlo experiments on a 5-agent complete graph formation demonstrate that SOSH outperforms established robust control schemes, including W-MSR and Huber-based consensus filters, achieving faster convergence rates, lower steady-state error, and superior transient recovery. Our results confirm that SOSH combines theoretical robustness with practical deployability, offering a promising direction for securing MAS formations against sophisticated adversarial threats.

http://arxiv.org/abs/2506.12706
NAP-Tuning: Neural Augmented Prompt Tuning for Adversarially Robust Vision-Language Models. (82%)
Jiaming Zhang; Xin Wang; Xingjun Ma; Lingyu Qiu; Yu-Gang Jiang; Jitao Sang
  Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable
capabilities in understanding relationships between visual and textual data
through joint embedding spaces. Despite their effectiveness, these models
remain vulnerable to adversarial attacks, particularly in the image modality,
posing significant security concerns. Building upon our previous work on
Adversarial Prompt Tuning (AdvPT), which introduced learnable text prompts to
enhance adversarial robustness in VLMs without extensive parameter training, we
present a significant extension by introducing the Neural Augmentor framework
for Multi-modal Adversarial Prompt Tuning (NAP-Tuning).Our key innovations
include: (1) extending AdvPT from text-only to multi-modal prompting across
both text and visual modalities, (2) expanding from single-layer to multi-layer
prompt architectures, and (3) proposing a novel architecture-level redesign
through our Neural Augmentor approach, which implements feature purification to
directly address the distortions introduced by adversarial attacks in feature
space. Our NAP-Tuning approach incorporates token refiners that learn to
reconstruct purified features through residual connections, allowing for
modality-specific and layer-specific feature correction.Comprehensive
experiments demonstrate that NAP-Tuning significantly outperforms existing
methods across various datasets and attack types. Notably, our approach shows
significant improvements over the strongest baselines under the challenging
AutoAttack benchmark, outperforming them by 33.5% on ViT-B16 and 33.0% on
ViT-B32 architectures while maintaining competitive clean accuracy.


http://arxiv.org/abs/2506.12411
InverTune: Removing Backdoors from Multimodal Contrastive Learning Models via Trigger Inversion and Activation Tuning. (80%)
Mengyuan Sun; Yu Li; Yuchen Liu; Bo Du; Yunjie Ge
  Multimodal contrastive learning models like CLIP have demonstrated remarkable
vision-language alignment capabilities, yet their vulnerability to backdoor
attacks poses critical security risks. Attackers can implant latent triggers
that persist through downstream tasks, enabling malicious control of model
behavior upon trigger presentation. Despite great success in recent defense
mechanisms, they remain impractical due to strong assumptions about attacker
knowledge or excessive clean data requirements. In this paper, we introduce
InverTune, the first backdoor defense framework for multimodal models under
minimal attacker assumptions, requiring neither prior knowledge of attack
targets nor access to the poisoned dataset. Unlike existing defense methods
that rely on the same dataset used in the poisoning stage, InverTune
effectively identifies and removes backdoor artifacts through three key
components, achieving robust protection against backdoor attacks. Specifically,
InverTune first exposes attack signatures through adversarial simulation,
probabilistically identifying the target label by analyzing model response
patterns. Building on this, we develop a gradient inversion technique to
reconstruct latent triggers through activation pattern analysis. Finally, a
clustering-guided fine-tuning strategy is employed to erase the backdoor
function with only a small amount of arbitrary clean data, while preserving the
original model capabilities. Experimental results show that InverTune reduces
the average attack success rate (ASR) by 97.87% against the state-of-the-art
(SOTA) attacks while limiting clean accuracy (CA) degradation to just 3.07%.
This work establishes a new paradigm for securing multimodal systems, advancing
security in foundation model deployment without compromising performance.


http://arxiv.org/abs/2506.12707
SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression. (76%)
Yucheng Li; Surin Ahn; Huiqiang Jiang; Amir H. Abdi; Yuqing Yang; Lili Qiu
  Large language models (LLMs) have achieved widespread adoption across
numerous applications. However, many LLMs are vulnerable to malicious attacks
even after safety alignment. These attacks typically bypass LLMs' safety
guardrails by wrapping the original malicious instructions inside adversarial
jailbreaks prompts. Previous research has proposed methods such as adversarial
training and prompt rephrasing to mitigate these safety vulnerabilities, but
these methods often reduce the utility of LLMs or lead to significant
computational overhead and online latency. In this paper, we propose
SecurityLingua, an effective and efficient approach to defend LLMs against
jailbreak attacks via security-oriented prompt compression. Specifically, we
train a prompt compressor designed to discern the "true intention" of the input
prompt, with a particular focus on detecting the malicious intentions of
adversarial prompts. Then, in addition to the original prompt, the intention is
passed via the system prompt to the target LLM to help it identify the true
intention of the request. SecurityLingua ensures a consistent user experience
by leaving the original input prompt intact while revealing the user's
potentially malicious intention and stimulating the built-in safety guardrails
of the LLM. Moreover, thanks to prompt compression, SecurityLingua incurs only
a negligible overhead and extra token cost compared to all existing defense
methods, making it an especially practical solution for LLM defense.
Experimental results demonstrate that SecurityLingua can effectively defend
against malicious attacks and maintain utility of the LLM with negligible
compute and latency overhead. Our code is available at
https://aka.ms/SecurityLingua.


http://arxiv.org/abs/2506.12430
Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025. (70%)
Zonghao Ying; Siyang Wu; Run Hao; Peng Ying; Shixuan Sun; Pengyu Chen; Junze Chen; Hao Du; Kaiwen Shen; Shangkun Wu; Jiwei Wei; Shiyuan He; Yang Yang; Xiaohai Xu; Ke Ma; Qianqian Xu; Qingming Huang; Shi Lin; Xun Wang; Changting Lin; Meng Han; Yilei Jiang; Siqi Lai; Yaozhi Zheng; Yifei Song; Xiangyu Yue; Zonglei Jing; Tianyuan Zhang; Zhilei Zhu; Aishan Liu; Jiakai Wang; Siyuan Liang; Xianglong Kong; Hainan Li; Junjie Mu; Haotong Qin; Yue Yu; Lei Chen; Felix Juefei-Xu; Qing Guo; Xinyun Chen; Yew Soon Ong; Xianglong Liu; Dawn Song; Alan Yuille; Philip Torr; Dacheng Tao
  Multimodal Large Language Models (MLLMs) have enabled transformative
advancements across diverse applications but remain susceptible to safety
threats, especially jailbreak attacks that induce harmful outputs. To
systematically evaluate and improve their safety, we organized the Adversarial
Testing & Large-model Alignment Safety Grand Challenge (ATLAS) 2025}. This
technical report presents findings from the competition, which involved 86
teams testing MLLM vulnerabilities via adversarial image-text attacks in two
phases: white-box and black-box evaluations. The competition results highlight
ongoing challenges in securing MLLMs and provide valuable guidance for
developing stronger defense mechanisms. The challenge establishes new
benchmarks for MLLM safety evaluation and lays groundwork for advancing safer
multimodal AI systems. The code and data for this challenge are openly
available at https://github.com/NY1024/ATLAS_Challenge_2025.


http://arxiv.org/abs/2506.12613
Existence of Adversarial Examples for Random Convolutional Networks via Isoperimetric Inequalities on $\mathbb{so}(d)$. (50%)
Amit Daniely
  We show that adversarial examples exist for various random convolutional
networks, and furthermore, that this is a relatively simple consequence of the
isoperimetric inequality on the special orthogonal group $\mathbb{so}(d)$. This
extends and simplifies a recent line of work which shows similar results for
random fully connected networks.


http://arxiv.org/abs/2506.12340
Image Corruption-Inspired Membership Inference Attacks against Large Vision-Language Models. (15%)
Zongyu Wu; Minhua Lin; Zhiwei Zhang; Fali Wang; Xianren Zhang; Xiang Zhang; Suhang Wang
  Large vision-language models (LVLMs) have demonstrated outstanding
performance in many downstream tasks. However, LVLMs are trained on large-scale
datasets, which can pose privacy risks if training images contain sensitive
information. Therefore, it is important to detect whether an image is used to
train the LVLM. Recent studies have investigated membership inference attacks
(MIAs) against LVLMs, including detecting image-text pairs and single-modality
content. In this work, we focus on detecting whether a target image is used to
train the target LVLM. We design simple yet effective Image Corruption-Inspired
Membership Inference Attacks (ICIMIA) against LLVLMs, which are inspired by
LVLM's different sensitivity to image corruption for member and non-member
images. We first perform an MIA method under the white-box setting, where we
can obtain the embeddings of the image through the vision part of the target
LVLM. The attacks are based on the embedding similarity between the image and
its corrupted version. We further explore a more practical scenario where we
have no knowledge about target LVLMs and we can only query the target LVLMs
with an image and a question. We then conduct the attack by utilizing the
output text embeddings' similarity. Experiments on existing datasets validate
the effectiveness of our proposed attack methods under those two different
settings.


http://arxiv.org/abs/2506.17279
Step-by-Step Reasoning Attack: Revealing 'Erased' Knowledge in Large Language Models. (2%)
Yash Sinha; Manit Baser; Murari Mandal; Dinil Mon Divakaran; Mohan Kankanhalli
Knowledge erasure in large language models (LLMs) is important for ensuring compliance with data and AI regulations, safeguarding user privacy, mitigating bias, and misinformation. Existing unlearning methods aim to make the process of knowledge erasure more efficient and effective by removing specific knowledge while preserving overall model performance, especially for retained information. However, it has been observed that the unlearning techniques tend to suppress and leave the knowledge beneath the surface, thus making it retrievable with the right prompts. In this work, we demonstrate that \textit{step-by-step reasoning} can serve as a backdoor to recover this hidden information. We introduce a step-by-step reasoning-based black-box attack, Sleek, that systematically exposes unlearning failures. We employ a structured attack framework with three core components: (1) an adversarial prompt generation strategy leveraging step-by-step reasoning built from LLM-generated queries, (2) an attack mechanism that successfully recalls erased content, and exposes unfair suppression of knowledge intended for retention and (3) a categorization of prompts as direct, indirect, and implied, to identify which query types most effectively exploit unlearning weaknesses. Through extensive evaluations on four state-of-the-art unlearning techniques and two widely used LLMs, we show that existing approaches fail to ensure reliable knowledge removal. Of the generated adversarial prompts, 62.5% successfully retrieved forgotten Harry Potter facts from WHP-unlearned Llama, while 50% exposed unfair suppression of retained knowledge. Our work highlights the persistent risks of information leakage, emphasizing the need for more robust unlearning strategies for erasure.

http://arxiv.org/abs/2506.12382
Exploring the Secondary Risks of Large Language Models. (2%)
Jiawei Chen; Zhengwei Fang; Xiao Yang; Chao Yu; Zhaoxia Yin; Hang Su
  Ensuring the safety and alignment of Large Language Models is a significant
challenge with their growing integration into critical applications and
societal functions. While prior research has primarily focused on jailbreak
attacks, less attention has been given to non-adversarial failures that subtly
emerge during benign interactions. We introduce secondary risks a novel class
of failure modes marked by harmful or misleading behaviors during benign
prompts. Unlike adversarial attacks, these risks stem from imperfect
generalization and often evade standard safety mechanisms. To enable systematic
evaluation, we introduce two risk primitives verbose response and speculative
advice that capture the core failure patterns. Building on these definitions,
we propose SecLens, a black-box, multi-objective search framework that
efficiently elicits secondary risk behaviors by optimizing task relevance, risk
activation, and linguistic plausibility. To support reproducible evaluation, we
release SecRiskBench, a benchmark dataset of 650 prompts covering eight diverse
real-world risk categories. Experimental results from extensive evaluations on
16 popular models demonstrate that secondary risks are widespread, transferable
across models, and modality independent, emphasizing the urgent need for
enhanced safety mechanisms to address benign yet harmful LLM behaviors in
real-world deployments.


http://arxiv.org/abs/2507.00015
Vision Transformer with Adversarial Indicator Token against Adversarial Attacks in Radio Signal Classifications. (99%)
Lu Zhang; Sangarapillai Lambotharan; Gan Zheng; Guisheng Liao; Xuekang Liu; Fabio Roli; Carsten Maple
The remarkable success of transformers across various fields such as natural language processing and computer vision has paved the way for their applications in automatic modulation classification, a critical component in the communication systems of Internet of Things (IoT) devices. However, it has been observed that transformer-based classification of radio signals is susceptible to subtle yet sophisticated adversarial attacks. To address this issue, we have developed a defensive strategy for transformer-based modulation classification systems to counter such adversarial attacks. In this paper, we propose a novel vision transformer (ViT) architecture by introducing a new concept known as adversarial indicator (AdvI) token to detect adversarial attacks. To the best of our knowledge, this is the first work to propose an AdvI token in ViT to defend against adversarial attacks. Integrating an adversarial training method with a detection mechanism using AdvI token, we combine a training time defense and running time defense in a unified neural network model, which reduces architectural complexity of the system compared to detecting adversarial perturbations using separate models. We investigate into the operational principles of our method by examining the attention mechanism. We show the proposed AdvI token acts as a crucial element within the ViT, influencing attention weights and thereby highlighting regions or features in the input data that are potentially suspicious or anomalous. Through experimental results, we demonstrate that our approach surpasses several competitive methods in handling white-box attack scenarios, including those utilizing the fast gradient method, projected gradient descent attacks and basic iterative method.

http://arxiv.org/abs/2506.11901
A Neural Rejection System Against Universal Adversarial Perturbations in Radio Signal Classification. (99%)
Lu Zhang; Sangarapillai Lambotharan; Gan Zheng; Fabio Roli
  Advantages of deep learning over traditional methods have been demonstrated
for radio signal classification in the recent years. However, various
researchers have discovered that even a small but intentional feature
perturbation known as adversarial examples can significantly deteriorate the
performance of the deep learning based radio signal classification. Among
various kinds of adversarial examples, universal adversarial perturbation has
gained considerable attention due to its feature of being data independent,
hence as a practical strategy to fool the radio signal classification with a
high success rate. Therefore, in this paper, we investigate a defense system
called neural rejection system to propose against universal adversarial
perturbations, and evaluate its performance by generating white-box universal
adversarial perturbations. We show that the proposed neural rejection system is
able to defend universal adversarial perturbations with significantly higher
accuracy than the undefended deep neural network.


http://arxiv.org/abs/2506.11892
Attention-based Adversarial Robust Distillation in Radio Signal Classifications for Low-Power IoT Devices. (99%)
Lu Zhang; Sangarapillai Lambotharan; Gan Zheng; Guisheng Liao; Basil AsSadhan; Fabio Roli
  Due to great success of transformers in many applications such as natural
language processing and computer vision, transformers have been successfully
applied in automatic modulation classification. We have shown that
transformer-based radio signal classification is vulnerable to imperceptible
and carefully crafted attacks called adversarial examples. Therefore, we
propose a defense system against adversarial examples in transformer-based
modulation classifications. Considering the need for computationally efficient
architecture particularly for Internet of Things (IoT)-based applications or
operation of devices in environment where power supply is limited, we propose a
compact transformer for modulation classification. The advantages of robust
training such as adversarial training in transformers may not be attainable in
compact transformers. By demonstrating this, we propose a novel compact
transformer that can enhance robustness in the presence of adversarial attacks.
The new method is aimed at transferring the adversarial attention map from the
robustly trained large transformer to a compact transformer. The proposed
method outperforms the state-of-the-art techniques for the considered white-box
scenarios including fast gradient method and projected gradient descent
attacks. We have provided reasoning of the underlying working mechanisms and
investigated the transferability of the adversarial examples between different
architectures. The proposed method has the potential to protect the transformer
from the transferability of adversarial examples.


http://arxiv.org/abs/2506.11844
TrustGLM: Evaluating the Robustness of GraphLLMs Against Prompt, Text, and Structure Attacks. (98%)
Qihai Zhang; Xinyue Sheng; Yuanfu Sun; Qiaoyu Tan
  Inspired by the success of large language models (LLMs), there is a
significant research shift from traditional graph learning methods to LLM-based
graph frameworks, formally known as GraphLLMs. GraphLLMs leverage the reasoning
power of LLMs by integrating three key components: the textual attributes of
input nodes, the structural information of node neighborhoods, and
task-specific prompts that guide decision-making. Despite their promise, the
robustness of GraphLLMs against adversarial perturbations remains largely
unexplored-a critical concern for deploying these models in high-stakes
scenarios. To bridge the gap, we introduce TrustGLM, a comprehensive study
evaluating the vulnerability of GraphLLMs to adversarial attacks across three
dimensions: text, graph structure, and prompt manipulations. We implement
state-of-the-art attack algorithms from each perspective to rigorously assess
model resilience. Through extensive experiments on six benchmark datasets from
diverse domains, our findings reveal that GraphLLMs are highly susceptible to
text attacks that merely replace a few semantically similar words in a node's
textual attribute. We also find that standard graph structure attack methods
can significantly degrade model performance, while random shuffling of the
candidate label set in prompt templates leads to substantial performance drops.
Beyond characterizing these vulnerabilities, we investigate defense techniques
tailored to each attack vector through data-augmented training and adversarial
training, which show promising potential to enhance the robustness of
GraphLLMs. We hope that our open-sourced library will facilitate rapid,
equitable evaluation and inspire further innovative research in this field.


http://arxiv.org/abs/2506.11472
On the Natural Robustness of Vision-Language Models Against Visual Perception Attacks in Autonomous Driving. (92%)
Pedram Clemson University, Clemson, SC, USA MohajerAnsari; Amir Clemson University, Clemson, SC, USA Salarpour; Michael Technical University of Munich, Munich, Germany Kühr; Siyu Clemson University, Clemson, SC, USA Huang; Mohammad Technical University of Munich, Munich, Germany Hamad; Sebastian Technical University of Munich, Munich, Germany Steinhorst; Habeeb University of Texas at Arlington, Arlington, TX, USA Olufowobi; Mert D. Clemson University, Clemson, SC, USA Pesé
  Autonomous vehicles (AVs) rely on deep neural networks (DNNs) for critical
tasks such as traffic sign recognition (TSR), automated lane centering (ALC),
and vehicle detection (VD). However, these models are vulnerable to attacks
that can cause misclassifications and compromise safety. Traditional defense
mechanisms, including adversarial training, often degrade benign accuracy and
fail to generalize against unseen attacks. In this work, we introduce Vehicle
Vision Language Models (V2LMs), fine-tuned vision-language models specialized
for AV perception. Our findings demonstrate that V2LMs inherently exhibit
superior robustness against unseen attacks without requiring adversarial
training, maintaining significantly higher accuracy than conventional DNNs
under adversarial conditions. We evaluate two deployment strategies: Solo Mode,
where individual V2LMs handle specific perception tasks, and Tandem Mode, where
a single unified V2LM is fine-tuned for multiple tasks simultaneously.
Experimental results reveal that DNNs suffer performance drops of 33% to 46%
under attacks, whereas V2LMs maintain adversarial accuracy with reductions of
less than 8% on average. The Tandem Mode further offers a memory-efficient
alternative while achieving comparable robustness to Solo Mode. We also explore
integrating V2LMs as parallel components to AV perception to enhance resilience
against adversarial threats. Our results suggest that V2LMs offer a promising
path toward more secure and resilient AV perception systems.


http://arxiv.org/abs/2506.11521
Investigating Vulnerabilities and Defenses Against Audio-Visual Attacks: A Comprehensive Survey Emphasizing Multimodal Models. (83%)
Jinming Wen; Xinyi Wu; Shuai Zhao; Yanhao Jia; Yuwen Li
  Multimodal large language models (MLLMs), which bridge the gap between
audio-visual and natural language processing, achieve state-of-the-art
performance on several audio-visual tasks. Despite the superior performance of
MLLMs, the scarcity of high-quality audio-visual training data and
computational resources necessitates the utilization of third-party data and
open-source MLLMs, a trend that is increasingly observed in contemporary
research. This prosperity masks significant security risks. Empirical studies
demonstrate that the latest MLLMs can be manipulated to produce malicious or
harmful content. This manipulation is facilitated exclusively through
instructions or inputs, including adversarial perturbations and malevolent
queries, effectively bypassing the internal security mechanisms embedded within
the models. To gain a deeper comprehension of the inherent security
vulnerabilities associated with audio-visual-based multimodal models, a series
of surveys investigates various types of attacks, including adversarial and
backdoor attacks. While existing surveys on audio-visual attacks provide a
comprehensive overview, they are limited to specific types of attacks, which
lack a unified review of various types of attacks. To address this issue and
gain insights into the latest trends in the field, this paper presents a
comprehensive and systematic review of audio-visual attacks, which include
adversarial attacks, backdoor attacks, and jailbreak attacks. Furthermore, this
paper also reviews various types of attacks in the latest audio-visual-based
MLLMs, a dimension notably absent in existing surveys. Drawing upon
comprehensive insights from a substantial review, this paper delineates both
challenges and emergent trends for future research on audio-visual attacks and
defense.


http://arxiv.org/abs/2506.11938
Improving Large Language Model Safety with Contrastive Representation Learning. (75%)
Samuel Simko; Mrinmaya Sachan; Bernhard Schölkopf; Zhijing Jin
  Large Language Models (LLMs) are powerful tools with profound societal
impacts, yet their ability to generate responses to diverse and uncontrolled
inputs leaves them vulnerable to adversarial attacks. While existing defenses
often struggle to generalize across varying attack types, recent advancements
in representation engineering offer promising alternatives. In this work, we
propose a defense framework that formulates model defense as a contrastive
representation learning (CRL) problem. Our method finetunes a model using a
triplet-based loss combined with adversarial hard negative mining to encourage
separation between benign and harmful representations. Our experimental results
across multiple models demonstrate that our approach outperforms prior
representation engineering-based defenses, improving robustness against both
input-level and embedding-space attacks without compromising standard
performance. Our code is available at
https://github.com/samuelsimko/crl-llm-defense


http://arxiv.org/abs/2506.12274
InfoFlood: Jailbreaking Large Language Models with Information Overload. (47%)
Advait Yadav; Haibo Jin; Man Luo; Jun Zhuang; Haohan Wang
  Large Language Models (LLMs) have demonstrated remarkable capabilities across
various domains. However, their potential to generate harmful responses has
raised significant societal and regulatory concerns, especially when
manipulated by adversarial techniques known as "jailbreak" attacks. Existing
jailbreak methods typically involve appending carefully crafted prefixes or
suffixes to malicious prompts in order to bypass the built-in safety mechanisms
of these models.
  In this work, we identify a new vulnerability in which excessive linguistic
complexity can disrupt built-in safety mechanisms-without the need for any
added prefixes or suffixes-allowing attackers to elicit harmful outputs
directly. We refer to this phenomenon as Information Overload.
  To automatically exploit this vulnerability, we propose InfoFlood, a
jailbreak attack that transforms malicious queries into complex,
information-overloaded queries capable of bypassing built-in safety mechanisms.
Specifically, InfoFlood: (1) uses linguistic transformations to rephrase
malicious queries, (2) identifies the root cause of failure when an attempt is
unsuccessful, and (3) refines the prompt's linguistic structure to address the
failure while preserving its malicious intent.
  We empirically validate the effectiveness of InfoFlood on four widely used
LLMs-GPT-4o, GPT-3.5-turbo, Gemini 2.0, and LLaMA 3.1-by measuring their
jailbreak success rates. InfoFlood consistently outperforms baseline attacks,
achieving up to 3 times higher success rates across multiple jailbreak
benchmarks. Furthermore, we demonstrate that commonly adopted post-processing
defenses, including OpenAI's Moderation API, Perspective API, and SmoothLLM,
fail to mitigate these attacks. This highlights a critical weakness in
traditional AI safety guardrails when confronted with information
overload-based jailbreaks.


http://arxiv.org/abs/2506.12104
DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents. (47%)
Hao Li; Xiaogeng Liu; Hung-Chun Chiu; Dianqi Li; Ning Zhang; Chaowei Xiao
  Large Language Models (LLMs) are increasingly central to agentic systems due
to their strong reasoning and planning capabilities. By interacting with
external environments through predefined tools, these agents can carry out
complex user tasks. Nonetheless, this interaction also introduces the risk of
prompt injection attacks, where malicious inputs from external sources can
mislead the agent's behavior, potentially resulting in economic loss, privacy
leakage, or system compromise. System-level defenses have recently shown
promise by enforcing static or predefined policies, but they still face two key
challenges: the ability to dynamically update security rules and the need for
memory stream isolation. To address these challenges, we propose DRIFT, a
Dynamic Rule-based Isolation Framework for Trustworthy agentic systems, which
enforces both control- and data-level constraints. A Secure Planner first
constructs a minimal function trajectory and a JSON-schema-style parameter
checklist for each function node based on the user query. A Dynamic Validator
then monitors deviations from the original plan, assessing whether changes
comply with privilege limitations and the user's intent. Finally, an Injection
Isolator detects and masks any instructions that may conflict with the user
query from the memory stream to mitigate long-term risks. We empirically
validate the effectiveness of DRIFT on the AgentDojo benchmark, demonstrating
its strong security performance while maintaining high utility across diverse
models -- showcasing both its robustness and adaptability.


http://arxiv.org/abs/2506.11615
Machine Unlearning for Robust DNNs: Attribution-Guided Partitioning and Neuron Pruning in Noisy Environments. (33%)
Deliang Jin; Gang Chen; Shuo Feng; Yufeng Ling; Haoran Zhu
  Deep neural networks (DNNs) have achieved remarkable success across diverse
domains, but their performance can be severely degraded by noisy or corrupted
training data. Conventional noise mitigation methods often rely on explicit
assumptions about noise distributions or require extensive retraining, which
can be impractical for large-scale models. Inspired by the principles of
machine unlearning, we propose a novel framework that integrates
attribution-guided data partitioning, discriminative neuron pruning, and
targeted fine-tuning to mitigate the impact of noisy samples. Our approach
employs gradient-based attribution to probabilistically distinguish
high-quality examples from potentially corrupted ones without imposing
restrictive assumptions on the noise. It then applies regression-based
sensitivity analysis to identify and prune neurons that are most vulnerable to
noise. Finally, the resulting network is fine-tuned on the high-quality data
subset to efficiently recover and enhance its generalization performance. This
integrated unlearning-inspired framework provides several advantages over
conventional noise-robust learning approaches. Notably, it combines data-level
unlearning with model-level adaptation, thereby avoiding the need for full
model retraining or explicit noise modeling. We evaluate our method on
representative tasks (e.g., CIFAR-10 image classification and speech
recognition) under various noise levels and observe substantial gains in both
accuracy and efficiency. For example, our framework achieves approximately a
10% absolute accuracy improvement over standard retraining on CIFAR-10 with
injected label noise, while reducing retraining time by up to 47% in some
settings. These results demonstrate the effectiveness and scalability of the
proposed approach for achieving robust generalization in noisy environments.


http://arxiv.org/abs/2506.12258
EgoPrivacy: What Your First-Person Camera Says About You? (1%)
Yijiang Li; Genpei Zhang; Jiacheng Cheng; Yi Li; Xiaojun Shan; Dashan Gao; Jiancheng Lyu; Yuan Li; Ning Bi; Nuno Vasconcelos
  While the rapid proliferation of wearable cameras has raised significant
concerns about egocentric video privacy, prior work has largely overlooked the
unique privacy threats posed to the camera wearer. This work investigates the
core question: How much privacy information about the camera wearer can be
inferred from their first-person view videos? We introduce EgoPrivacy, the
first large-scale benchmark for the comprehensive evaluation of privacy risks
in egocentric vision. EgoPrivacy covers three types of privacy (demographic,
individual, and situational), defining seven tasks that aim to recover private
information ranging from fine-grained (e.g., wearer's identity) to
coarse-grained (e.g., age group). To further emphasize the privacy threats
inherent to egocentric vision, we propose Retrieval-Augmented Attack, a novel
attack strategy that leverages ego-to-exo retrieval from an external pool of
exocentric videos to boost the effectiveness of demographic privacy attacks. An
extensive comparison of the different attacks possible under all threat models
is presented, showing that private information of the wearer is highly
susceptible to leakage. For instance, our findings indicate that foundation
models can effectively compromise wearer privacy even in zero-shot settings by
recovering attributes such as identity, scene, gender, and race with 70-80%
accuracy. Our code and data are available at
https://github.com/williamium3000/ego-privacy.


http://arxiv.org/abs/2506.12226
Learning Causality for Modern Machine Learning. (1%)
Yongqiang Chen
  In the past decades, machine learning with Empirical Risk Minimization (ERM)
has demonstrated great capability in learning and exploiting the statistical
patterns from data, or even surpassing humans. Despite the success, ERM avoids
the modeling of causality the way of understanding and handling changes, which
is fundamental to human intelligence. When deploying models beyond the training
environment, distribution shifts are everywhere. For example, an autopilot
system often needs to deal with new weather conditions that have not been seen
during training, An Al-aided drug discovery system needs to predict the
biochemical properties of molecules with respect to new viruses such as
COVID-19. It renders the problem of Out-of-Distribution (OOD) generalization
challenging to conventional machine learning.
  In this thesis, we investigate how to incorporate and realize the causality
for broader tasks in modern machine learning. In particular, we exploit the
invariance implied by the principle of independent causal mechanisms (ICM),
that is, the causal mechanisms generating the effects from causes do not inform
or influence each other. Therefore, the conditional distribution between the
target variable given its causes is invariant under distribution shifts. With
the causal invariance principle, we first instantiate it to graphs -- a general
data structure ubiquitous in many real-world industry and scientific
applications, such as financial networks and molecules. Then, we shall see how
learning the causality benefits many of the desirable properties of modern
machine learning, in terms of (i) OOD generalization capability; (ii)
interpretability; and (iii) robustness to adversarial attacks.
  Realizing the causality in machine learning, on the other hand, raises a
dilemma for optimization in conventional machine learning, as it often
contradicts the objective of ERM...


http://arxiv.org/abs/2506.10685
Unsourced Adversarial CAPTCHA: A Bi-Phase Adversarial CAPTCHA Framework. (99%)
Xia Du; Xiaoyuan Liu; Jizhe Zhou; Zheng Lin; Chi-man Pun; Cong Wu; Tao Li; Zhe Chen; Wei Ni; Jun Luo
  With the rapid advancements in deep learning, traditional CAPTCHA schemes are
increasingly vulnerable to automated attacks powered by deep neural networks
(DNNs). Existing adversarial attack methods often rely on original image
characteristics, resulting in distortions that hinder human interpretation and
limit applicability in scenarios lacking initial input images. To address these
challenges, we propose the Unsourced Adversarial CAPTCHA (UAC), a novel
framework generating high-fidelity adversarial examples guided by
attacker-specified text prompts. Leveraging a Large Language Model (LLM), UAC
enhances CAPTCHA diversity and supports both targeted and untargeted attacks.
For targeted attacks, the EDICT method optimizes dual latent variables in a
diffusion model for superior image quality. In untargeted attacks, especially
for black-box scenarios, we introduce bi-path unsourced adversarial CAPTCHA
(BP-UAC), a two-step optimization strategy employing multimodal gradients and
bi-path optimization for efficient misclassification. Experiments show BP-UAC
achieves high attack success rates across diverse systems, generating natural
CAPTCHAs indistinguishable to humans and DNNs.


http://arxiv.org/abs/2506.10722
TED-LaST: Towards Robust Backdoor Defense Against Adaptive Attacks. (98%)
Xiaoxing Mo; Yuxuan Cheng; Nan Sun; Leo Yu Zhang; Wei Luo; Shang Gao
  Deep Neural Networks (DNNs) are vulnerable to backdoor attacks, where
attackers implant hidden triggers during training to maliciously control model
behavior. Topological Evolution Dynamics (TED) has recently emerged as a
powerful tool for detecting backdoor attacks in DNNs. However, TED can be
vulnerable to backdoor attacks that adaptively distort topological
representation distributions across network layers. To address this limitation,
we propose TED-LaST (Topological Evolution Dynamics against Laundry, Slow
release, and Target mapping attack strategies), a novel defense strategy that
enhances TED's robustness against adaptive attacks. TED-LaST introduces two key
innovations: label-supervised dynamics tracking and adaptive layer emphasis.
These enhancements enable the identification of stealthy threats that evade
traditional TED-based defenses, even in cases of inseparability in topological
space and subtle topological perturbations. We review and classify data
poisoning tricks in state-of-the-art adaptive attacks and propose enhanced
adaptive attack with target mapping, which can dynamically shift malicious
tasks and fully leverage the stealthiness that adaptive attacks possess. Our
comprehensive experiments on multiple datasets (CIFAR-10, GTSRB, and
ImageNet100) and model architectures (ResNet20, ResNet101) show that TED-LaST
effectively counteracts sophisticated backdoors like Adap-Blend, Adapt-Patch,
and the proposed enhanced adaptive attack. TED-LaST sets a new benchmark for
robust backdoor detection, substantially enhancing DNN security against
evolving threats.


http://arxiv.org/abs/2506.10620
Assessing the Resilience of Automotive Intrusion Detection Systems to Adversarial Manipulation. (93%)
Stefano Longari; Paolo Cerracchio; Michele Carminati; Stefano Zanero
  The security of modern vehicles has become increasingly important, with the
controller area network (CAN) bus serving as a critical communication backbone
for various Electronic Control Units (ECUs). The absence of robust security
measures in CAN, coupled with the increasing connectivity of vehicles, makes
them susceptible to cyberattacks. While intrusion detection systems (IDSs) have
been developed to counter such threats, they are not foolproof. Adversarial
attacks, particularly evasion attacks, can manipulate inputs to bypass
detection by IDSs. This paper extends our previous work by investigating the
feasibility and impact of gradient-based adversarial attacks performed with
different degrees of knowledge against automotive IDSs. We consider three
scenarios: white-box (attacker with full system knowledge), grey-box (partial
system knowledge), and the more realistic black-box (no knowledge of the IDS'
internal workings or data). We evaluate the effectiveness of the proposed
attacks against state-of-the-art IDSs on two publicly available datasets.
Additionally, we study effect of the adversarial perturbation on the attack
impact and evaluate real-time feasibility by precomputing evasive payloads for
timed injection based on bus traffic. Our results demonstrate that, besides
attacks being challenging due to the automotive domain constraints, their
effectiveness is strongly dependent on the dataset quality, the target IDS, and
the attacker's degree of knowledge.


http://arxiv.org/abs/2506.10744
ObfusBFA: A Holistic Approach to Safeguarding DNNs from Different Types of Bit-Flip Attacks. (70%)
Xiaobei Yan; Han Qiu; Tianwei Zhang
  Bit-flip attacks (BFAs) represent a serious threat to Deep Neural Networks
(DNNs), where flipping a small number of bits in the model parameters or binary
code can significantly degrade the model accuracy or mislead the model
prediction in a desired way. Existing defenses exclusively focus on protecting
models for specific attacks and platforms, while lacking effectiveness for
other scenarios. We propose ObfusBFA, an efficient and holistic methodology to
mitigate BFAs targeting both the high-level model weights and low-level
codebase (executables or shared libraries). The key idea of ObfusBFA is to
introduce random dummy operations during the model inference, which effectively
transforms the delicate attacks into random bit flips, making it much harder
for attackers to pinpoint and exploit vulnerable bits. We design novel
algorithms to identify critical bits and insert obfuscation operations. We
evaluate ObfusBFA against different types of attacks, including the adaptive
scenarios where the attacker increases the flip bit budget to attempt to
circumvent our defense. The results show that ObfusBFA can consistently
preserve the model accuracy across various datasets and DNN architectures while
significantly reducing the attack success rates. Additionally, it introduces
minimal latency and storage overhead, making it a practical solution for
real-world applications.


http://arxiv.org/abs/2506.10949
Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors. (68%)
Chen Yueh-Han; Nitish Joshi; Yulin Chen; Maksym Andriushchenko; Rico Angell; He He
  Current LLM safety defenses fail under decomposition attacks, where a
malicious goal is decomposed into benign subtasks that circumvent refusals. The
challenge lies in the existing shallow safety alignment techniques: they only
detect harm in the immediate prompt and do not reason about long-range intent,
leaving them blind to malicious intent that emerges over a sequence of
seemingly benign instructions. We therefore propose adding an external monitor
that observes the conversation at a higher granularity. To facilitate our study
of monitoring decomposition attacks, we curate the largest and most diverse
dataset to date, including question-answering, text-to-image, and agentic
tasks. We verify our datasets by testing them on frontier LLMs and show an 87%
attack success rate on average on GPT-4o. This confirms that decomposition
attack is broadly effective. Additionally, we find that random tasks can be
injected into the decomposed subtasks to further obfuscate malicious intents.
To defend in real time, we propose a lightweight sequential monitoring
framework that cumulatively evaluates each subtask. We show that a carefully
prompt engineered lightweight monitor achieves a 93% defense success rate,
beating reasoning models like o3 mini as a monitor. Moreover, it remains robust
against random task injection and cuts cost by 90% and latency by 50%. Our
findings suggest that lightweight sequential monitors are highly effective in
mitigating decomposition attacks and are viable in deployment.


http://arxiv.org/abs/2506.10888
Lattice Climber Attack: Adversarial attacks for randomized mixtures of classifiers. (54%)
Lucas Gnecco-Heredia; Benjamin Negrevergne; Yann Chevaleyre
  Finite mixtures of classifiers (a.k.a. randomized ensembles) have been
proposed as a way to improve robustness against adversarial attacks. However,
existing attacks have been shown to not suit this kind of classifier. In this
paper, we discuss the problem of attacking a mixture in a principled way and
introduce two desirable properties of attacks based on a geometrical analysis
of the problem (effectiveness and maximality). We then show that existing
attacks do not meet both of these properties. Finally, we introduce a new
attack called {\em lattice climber attack} with theoretical guarantees in the
binary linear setting, and demonstrate its performance by conducting
experiments on synthetic and real datasets.


http://arxiv.org/abs/2506.11373
Deception Against Data-Driven Linear-Quadratic Control. (16%)
Filippos Fotiadis; Aris Kanellopoulos; Kyriakos G. Vamvoudakis; Ufuk Topcu
  Deception is a common defense mechanism against adversaries with an
information disadvantage. It can force such adversaries to select suboptimal
policies for a defender's benefit. We consider a setting where an adversary
tries to learn the optimal linear-quadratic attack against a system, the
dynamics of which it does not know. On the other end, a defender who knows its
dynamics exploits its information advantage and injects a deceptive input into
the system to mislead the adversary. The defender's aim is to then
strategically design this deceptive input: it should force the adversary to
learn, as closely as possible, a pre-selected attack that is different from the
optimal one. We show that this deception design problem boils down to the
solution of a coupled algebraic Riccati and a Lyapunov equation which, however,
are challenging to tackle analytically. Nevertheless, we use a block successive
over-relaxation algorithm to extract their solution numerically and prove the
algorithm's convergence under certain conditions. We perform simulations on a
benchmark aircraft, where we showcase how the proposed algorithm can mislead
adversaries into learning attacks that are less performance-degrading.


http://arxiv.org/abs/2506.10831
Efficiency Robustness of Dynamic Deep Learning Systems. (12%)
Ravishka Rathnasuriya; Tingxi Li; Zexin Xu; Zihe Song; Mirazul Haque; Simin Chen; Wei Yang
  Deep Learning Systems (DLSs) are increasingly deployed in real-time
applications, including those in resourceconstrained environments such as
mobile and IoT devices. To address efficiency challenges, Dynamic Deep Learning
Systems (DDLSs) adapt inference computation based on input complexity, reducing
overhead. While this dynamic behavior improves efficiency, such behavior
introduces new attack surfaces. In particular, efficiency adversarial attacks
exploit these dynamic mechanisms to degrade system performance. This paper
systematically explores efficiency robustness of DDLSs, presenting the first
comprehensive taxonomy of efficiency attacks. We categorize these attacks based
on three dynamic behaviors: (i) attacks on dynamic computations per inference,
(ii) attacks on dynamic inference iterations, and (iii) attacks on dynamic
output production for downstream tasks. Through an in-depth evaluation, we
analyze adversarial strategies that target DDLSs efficiency and identify key
challenges in securing these systems. In addition, we investigate existing
defense mechanisms, demonstrating their limitations against increasingly
popular efficiency attacks and the necessity for novel mitigation strategies to
secure future adaptive DDLSs.


http://arxiv.org/abs/2506.10776
ME: Trigger Element Combination Backdoor Attack on Copyright Infringement. (10%)
Feiyu Yang; Siyuan Liang; Aishan Liu; Dacheng Tao
  The capability of generative diffusion models (DMs) like Stable Diffusion
(SD) in replicating training data could be taken advantage of by attackers to
launch the Copyright Infringement Attack, with duplicated poisoned image-text
pairs. SilentBadDiffusion (SBD) is a method proposed recently, which shew
outstanding performance in attacking SD in text-to-image tasks. However, the
feasible data resources in this area are still limited, some of them are even
constrained or prohibited due to the issues like copyright ownership or
inappropriate contents; And not all of the images in current datasets are
suitable for the proposed attacking methods; Besides, the state-of-the-art
(SoTA) performance of SBD is far from ideal when few generated poisoning
samples could be adopted for attacks. In this paper, we raised new datasets
accessible for researching in attacks like SBD, and proposed Multi-Element (ME)
attack method based on SBD by increasing the number of poisonous visual-text
elements per poisoned sample to enhance the ability of attacking, while
importing Discrete Cosine Transform (DCT) for the poisoned samples to maintain
the stealthiness. The Copyright Infringement Rate (CIR) / First Attack Epoch
(FAE) we got on the two new datasets were 16.78% / 39.50 and 51.20% / 23.60,
respectively close to or even outperformed benchmark Pokemon and Mijourney
datasets. In condition of low subsampling ratio (5%, 6 poisoned samples), MESI
and DCT earned CIR / FAE of 0.23% / 84.00 and 12.73% / 65.50, both better than
original SBD, which failed to attack at all.


http://arxiv.org/abs/2506.10463
Starting Positions Matter: A Study on Better Weight Initialization for Neural Network Quantization. (1%)
Stone Yun; Alexander Wong
  Deep neural network (DNN) quantization for fast, efficient inference has been
an important tool in limiting the cost of machine learning (ML) model
inference. Quantization-specific model development techniques such as
regularization, quantization-aware training, and quantization-robustness
penalties have served to greatly boost the accuracy and robustness of modern
DNNs. However, very little exploration has been done on improving the initial
conditions of DNN training for quantization. Just as random weight
initialization has been shown to significantly impact test accuracy of floating
point models, it would make sense that different weight initialization methods
impact quantization robustness of trained models. We present an extensive study
examining the effects of different weight initializations on a variety of CNN
building blocks commonly used in efficient CNNs. This analysis reveals that
even with varying CNN architectures, the choice of random weight initializer
can significantly affect final quantization robustness. Next, we explore a new
method for quantization-robust CNN initialization -- using Graph Hypernetworks
(GHN) to predict parameters of quantized DNNs. Besides showing that
GHN-predicted parameters are quantization-robust after regular float32
pretraining (of the GHN), we find that finetuning GHNs to predict parameters
for quantized graphs (which we call GHN-QAT) can further improve quantized
accuracy of CNNs. Notably, GHN-QAT shows significant accuracy improvements for
even 4-bit quantization and better-than-random accuracy for 2-bits. To the best
of our knowledge, this is the first in-depth study on quantization-aware DNN
weight initialization. GHN-QAT offers a novel approach to quantized DNN model
design. Future investigations, such as using GHN-QAT-initialized parameters for
quantization-aware training, can further streamline the DNN quantization
process.


http://arxiv.org/abs/2506.10424
SOFT: Selective Data Obfuscation for Protecting LLM Fine-tuning against Membership Inference Attacks. (1%)
Kaiyuan Zhang; Siyuan Cheng; Hanxi Guo; Yuetian Chen; Zian Su; Shengwei An; Yuntao Du; Charles Fleming; Ashish Kundu; Xiangyu Zhang; Ninghui Li
  Large language models (LLMs) have achieved remarkable success and are widely
adopted for diverse applications. However, fine-tuning these models often
involves private or sensitive information, raising critical privacy concerns.
In this work, we conduct the first comprehensive study evaluating the
vulnerability of fine-tuned LLMs to membership inference attacks (MIAs). Our
empirical analysis demonstrates that MIAs exploit the loss reduction during
fine-tuning, making them highly effective in revealing membership information.
These findings motivate the development of our defense. We propose SOFT
(\textbf{S}elective data \textbf{O}bfuscation in LLM
\textbf{F}ine-\textbf{T}uning), a novel defense technique that mitigates
privacy leakage by leveraging influential data selection with an adjustable
parameter to balance utility preservation and privacy protection. Our extensive
experiments span six diverse domains and multiple LLM architectures and scales.
Results show that SOFT effectively reduces privacy risks while maintaining
competitive model performance, offering a practical and scalable solution to
safeguard sensitive information in fine-tuned LLMs.


http://arxiv.org/abs/2506.11415
Bias Amplification in RAG: Poisoning Knowledge Retrieval to Steer LLMs. (1%)
Linlin Wang; Tianqing Zhu; Laiqiao Qin; Longxiang Gao; Wanlei Zhou
  In Large Language Models, Retrieval-Augmented Generation (RAG) systems can
significantly enhance the performance of large language models by integrating
external knowledge. However, RAG also introduces new security risks. Existing
research focuses mainly on how poisoning attacks in RAG systems affect model
output quality, overlooking their potential to amplify model biases. For
example, when querying about domestic violence victims, a compromised RAG
system might preferentially retrieve documents depicting women as victims,
causing the model to generate outputs that perpetuate gender stereotypes even
when the original query is gender neutral. To show the impact of the bias, this
paper proposes a Bias Retrieval and Reward Attack (BRRA) framework, which
systematically investigates attack pathways that amplify language model biases
through a RAG system manipulation. We design an adversarial document generation
method based on multi-objective reward functions, employ subspace projection
techniques to manipulate retrieval results, and construct a cyclic feedback
mechanism for continuous bias amplification. Experiments on multiple mainstream
large language models demonstrate that BRRA attacks can significantly enhance
model biases in dimensions. In addition, we explore a dual stage defense
mechanism to effectively mitigate the impacts of the attack. This study reveals
that poisoning attacks in RAG systems directly amplify model output biases and
clarifies the relationship between RAG system security and model fairness. This
novel potential attack indicates that we need to keep an eye on the fairness
issues of the RAG system.


http://arxiv.org/abs/2506.11413
Byzantine Outside, Curious Inside: Reconstructing Data Through Malicious Updates. (1%)
Kai Yue; Richeng Jin; Chau-Wai Wong; Huaiyu Dai
  Federated learning (FL) enables decentralized machine learning without
sharing raw data, allowing multiple clients to collaboratively learn a global
model. However, studies reveal that privacy leakage is possible under commonly
adopted FL protocols. In particular, a server with access to client gradients
can synthesize data resembling the clients' training data. In this paper, we
introduce a novel threat model in FL, named the maliciously curious client,
where a client manipulates its own gradients with the goal of inferring private
data from peers. This attacker uniquely exploits the strength of a Byzantine
adversary, traditionally aimed at undermining model robustness, and repurposes
it to facilitate data reconstruction attack. We begin by formally defining this
novel client-side threat model and providing a theoretical analysis that
demonstrates its ability to achieve significant reconstruction success during
FL training. To demonstrate its practical impact, we further develop a
reconstruction algorithm that combines gradient inversion with malicious update
strategies. Our analysis and experimental results reveal a critical blind spot
in FL defenses: both server-side robust aggregation and client-side privacy
mechanisms may fail against our proposed attack. Surprisingly, standard server-
and client-side defenses designed to enhance robustness or privacy may
unintentionally amplify data leakage. Compared to the baseline approach, a
mistakenly used defense may instead improve the reconstructed image quality by
10-15%.


http://arxiv.org/abs/2506.09896
A look at adversarial attacks on radio waveforms from discrete latent space. (99%)
Attanasia Garuso; Silvija Kokalj-Filipovic; Yagna Kaasaragadda
  Having designed a VQVAE that maps digital radio waveforms into discrete
latent space, and yields a perfectly classifiable reconstruction of the
original data, we here analyze the attack suppressing properties of VQVAE when
an adversarial attack is performed on high-SNR radio-frequency (RF)
data-points. To target amplitude modulations from a subset of digitally
modulated waveform classes, we first create adversarial attacks that preserve
the phase between the in-phase and quadrature component whose values are
adversarially changed. We compare them with adversarial attacks of the same
intensity where phase is not preserved. We test the classification accuracy of
such adversarial examples on a classifier trained to deliver 100% accuracy on
the original data. To assess the ability of VQVAE to suppress the strength of
the attack, we evaluate the classifier accuracy on the reconstructions by VQVAE
of the adversarial datapoints and show that VQVAE substantially decreases the
effectiveness of the attack. We also compare the I/Q plane diagram of the
attacked data, their reconstructions and the original data. Finally, using
multiple methods and metrics, we compare the probability distribution of the
VQVAE latent space with and without attack. Varying the attack strength, we
observe interesting properties of the discrete space, which may help detect the
attacks.


http://arxiv.org/abs/2506.09443
LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge. (92%)
Songze Li; Chuokun Xu; Jiaying Wang; Xueluan Gong; Chen Chen; Jirui Zhang; Jun Wang; Kwok-Yan Lam; Shouling Ji
  Large Language Models (LLMs) have demonstrated remarkable intelligence across
various tasks, which has inspired the development and widespread adoption of
LLM-as-a-Judge systems for automated model testing, such as red teaming and
benchmarking. However, these systems are susceptible to adversarial attacks
that can manipulate evaluation outcomes, raising concerns about their
robustness and, consequently, their trustworthiness. Existing evaluation
methods adopted by LLM-based judges are often piecemeal and lack a unified
framework for comprehensive assessment. Furthermore, prompt template and model
selections for improving judge robustness have been rarely explored, and their
performance in real-world settings remains largely unverified. To address these
gaps, we introduce RobustJudge, a fully automated and scalable framework
designed to systematically evaluate the robustness of LLM-as-a-Judge systems.
RobustJudge investigates the impact of attack methods and defense strategies
(RQ1), explores the influence of prompt template and model selection (RQ2), and
assesses the robustness of real-world LLM-as-a-Judge applications (RQ3).Our
main findings are: (1) LLM-as-a-Judge systems are still vulnerable to a range
of adversarial attacks, including Combined Attack and PAIR, while defense
mechanisms such as Re-tokenization and LLM-based Detectors offer improved
protection; (2) Robustness is highly sensitive to the choice of prompt template
and judge models. Our proposed prompt template optimization method can improve
robustness, and JudgeLM-13B demonstrates strong performance as a robust
open-source judge; (3) Applying RobustJudge to Alibaba's PAI platform reveals
previously unreported vulnerabilities. The source code of RobustJudge is
provided at https://github.com/S3IC-Lab/RobustJudge.


http://arxiv.org/abs/2506.10047
GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models. (87%)
Zilong Wang; Xiang Zheng; Xiaosen Wang; Bo Wang; Xingjun Ma; Yu-Gang Jiang
  Text-to-image (T2I) models such as Stable Diffusion have advanced rapidly and
are now widely used in content creation. However, these models can be misused
to generate harmful content, including nudity or violence, posing significant
safety risks. While most platforms employ content moderation systems,
underlying vulnerabilities can still be exploited by determined adversaries.
Recent research on red-teaming and adversarial attacks against T2I models has
notable limitations: some studies successfully generate highly toxic images but
use adversarial prompts that are easily detected and blocked by safety filters,
while others focus on bypassing safety mechanisms but fail to produce genuinely
harmful outputs, neglecting the discovery of truly high-risk prompts.
Consequently, there remains a lack of reliable tools for evaluating the safety
of defended T2I models. To address this gap, we propose GenBreak, a framework
that fine-tunes a red-team large language model (LLM) to systematically explore
underlying vulnerabilities in T2I generators. Our approach combines supervised
fine-tuning on curated datasets with reinforcement learning via interaction
with a surrogate T2I model. By integrating multiple reward signals, we guide
the LLM to craft adversarial prompts that enhance both evasion capability and
image toxicity, while maintaining semantic coherence and diversity. These
prompts demonstrate strong effectiveness in black-box attacks against
commercial T2I generators, revealing practical and concerning safety
weaknesses.


http://arxiv.org/abs/2506.09640
Evasion Attacks Against Bayesian Predictive Models. (76%)
Pablo G. Arce; Roi Naveiro; David Ríos Insua
  There is an increasing interest in analyzing the behavior of machine learning
systems against adversarial attacks. However, most of the research in
adversarial machine learning has focused on studying weaknesses against evasion
or poisoning attacks to predictive models in classical setups, with the
susceptibility of Bayesian predictive models to attacks remaining
underexplored. This paper introduces a general methodology for designing
optimal evasion attacks against such models. We investigate two adversarial
objectives: perturbing specific point predictions and altering the entire
posterior predictive distribution. For both scenarios, we propose novel
gradient-based attacks and study their implementation and properties in various
computational setups.


http://arxiv.org/abs/2506.09538
AngleRoCL: Angle-Robust Concept Learning for Physically View-Invariant T2I Adversarial Patches. (22%)
Wenjun Ji; Yuxiang Fu; Luyang Ying; Deng-Ping Fan; Yuyi Wang; Ming-Ming Cheng; Ivor Tsang; Qing Guo
  Cutting-edge works have demonstrated that text-to-image (T2I) diffusion
models can generate adversarial patches that mislead state-of-the-art object
detectors in the physical world, revealing detectors' vulnerabilities and
risks. However, these methods neglect the T2I patches' attack effectiveness
when observed from different views in the physical world (i.e., angle
robustness of the T2I adversarial patches). In this paper, we study the angle
robustness of T2I adversarial patches comprehensively, revealing their
angle-robust issues, demonstrating that texts affect the angle robustness of
generated patches significantly, and task-specific linguistic instructions fail
to enhance the angle robustness. Motivated by the studies, we introduce
Angle-Robust Concept Learning (AngleRoCL), a simple and flexible approach that
learns a generalizable concept (i.e., text embeddings in implementation)
representing the capability of generating angle-robust patches. The learned
concept can be incorporated into textual prompts and guides T2I models to
generate patches with their attack effectiveness inherently resistant to
viewpoint variations. Through extensive simulation and physical-world
experiments on five SOTA detectors across multiple views, we demonstrate that
AngleRoCL significantly enhances the angle robustness of T2I adversarial
patches compared to baseline methods. Our patches maintain high attack success
rates even under challenging viewing conditions, with over 50% average relative
improvement in attack effectiveness across multiple angles. This research
advances the understanding of physically angle-robust patches and provides
insights into the relationship between textual concepts and physical properties
in T2I-generated contents.


http://arxiv.org/abs/2506.09923
Apollo: A Posteriori Label-Only Membership Inference Attack Towards Machine Unlearning. (9%)
Liou Tang; James Joshi; Ashish Kundu
  Machine Unlearning (MU) aims to update Machine Learning (ML) models following
requests to remove training samples and their influences on a trained model
efficiently without retraining the original ML model from scratch. While MU
itself has been employed to provide privacy protection and regulatory
compliance, it can also increase the attack surface of the model. Existing
privacy inference attacks towards MU that aim to infer properties of the
unlearned set rely on the weaker threat model that assumes the attacker has
access to both the unlearned model and the original model, limiting their
feasibility toward real-life scenarios. We propose a novel privacy attack, A
Posteriori Label-Only Membership Inference Attack towards MU, Apollo, that
infers whether a data sample has been unlearned, following a strict threat
model where an adversary has access to the label-output of the unlearned model
only. We demonstrate that our proposed attack, while requiring less access to
the target model compared to previous attacks, can achieve relatively high
precision on the membership status of the unlearned samples.


http://arxiv.org/abs/2506.09408
Token Constraint Decoding Improves Robustness on Question Answering for Large Language Models. (9%)
Jui-Ming Yao; Hao-Yuan Chen; Zi-Xian Tang; Bing-Jia Tan; Sheng-Wei Peng; Bing-Cheng Xie; Shun-Feng Su
  Large Language Models (LLMs) have demonstrated impressive performance on
multiple-choice question answering (MCQA) benchmarks, yet they remain highly
vulnerable to minor input perturbations. In this paper, we introduce and
evaluate Token Constraint Decoding (TCD). This simple yet effective
inference-time algorithm enforces alignment between token-level predictions to
enhance robustness in noisy settings. Through extensive experiments on
CommonsenseQA, MMLU, and MMLU-Pro, we show that TCD, especially when paired
with prompt engineering (PE) fixes, significantly restores performance degraded
by input noise, yielding up to +39\% absolute gains for weaker models like
Gemma3 1B. Penalty sweep analyses further reveal that TCD implicitly
regularizes overconfident outputs, with different models requiring distinct
penalty schedules to maximize resilience. Our findings establish TCD as a
practical, model-agnostic approach for improving reasoning stability under
real-world imperfections and pave the way for more reliable deployment of LLMs
in safety-critical or user-facing applications.


http://arxiv.org/abs/2506.09562
TooBadRL: Trigger Optimization to Boost Effectiveness of Backdoor Attacks on Deep Reinforcement Learning. (2%)
Songze Li; Mingxuan Zhang; Kang Wei; Shouling Ji
  Deep reinforcement learning (DRL) has achieved remarkable success in a wide
range of sequential decision-making domains, including robotics, healthcare,
smart grids, and finance. Recent research demonstrates that attackers can
efficiently exploit system vulnerabilities during the training phase to execute
backdoor attacks, producing malicious actions when specific trigger patterns
are present in the state observations. However, most existing backdoor attacks
rely primarily on simplistic and heuristic trigger configurations, overlooking
the potential efficacy of trigger optimization. To address this gap, we
introduce TooBadRL (Trigger Optimization to Boost Effectiveness of Backdoor
Attacks on DRL), the first framework to systematically optimize DRL backdoor
triggers along three critical axes, i.e., temporal, spatial, and magnitude.
Specifically, we first introduce a performance-aware adaptive freezing
mechanism for injection timing. Then, we formulate dimension selection as a
cooperative game, utilizing Shapley value analysis to identify the most
influential state variable for the injection dimension. Furthermore, we propose
a gradient-based adversarial procedure to optimize the injection magnitude
under environment constraints. Evaluations on three mainstream DRL algorithms
and nine benchmark tasks show that TooBadRL significantly improves attack
success rates, while ensuring minimal degradation of normal task performance.
These results highlight the previously underappreciated importance of
principled trigger optimization in DRL backdoor attacks. The source code of
TooBadRL can be found at https://github.com/S3IC-Lab/TooBadRL.


http://arxiv.org/abs/2506.09148
Adversarial Text Generation with Dynamic Contextual Perturbation. (99%)
Hetvi Waghela; Jaydip Sen; Sneha Rakshit; Subhasis Dasgupta
  Adversarial attacks on Natural Language Processing (NLP) models expose
vulnerabilities by introducing subtle perturbations to input text, often
leading to misclassification while maintaining human readability. Existing
methods typically focus on word-level or local text segment alterations,
overlooking the broader context, which results in detectable or semantically
inconsistent perturbations. We propose a novel adversarial text attack scheme
named Dynamic Contextual Perturbation (DCP). DCP dynamically generates
context-aware perturbations across sentences, paragraphs, and documents,
ensuring semantic fidelity and fluency. Leveraging the capabilities of
pre-trained language models, DCP iteratively refines perturbations through an
adversarial objective function that balances the dual objectives of inducing
model misclassification and preserving the naturalness of the text. This
comprehensive approach allows DCP to produce more sophisticated and effective
adversarial examples that better mimic natural language patterns. Our
experimental results, conducted on various NLP models and datasets, demonstrate
the efficacy of DCP in challenging the robustness of state-of-the-art NLP
systems. By integrating dynamic contextual analysis, DCP significantly enhances
the subtlety and impact of adversarial attacks. This study highlights the
critical role of context in adversarial attacks and lays the groundwork for
creating more robust NLP systems capable of withstanding sophisticated
adversarial strategies.


http://arxiv.org/abs/2506.08961
Towards Robust Deep Reinforcement Learning against Environmental State Perturbation. (95%)
Chenxu Wang; Huaping Liu
  Adversarial attacks and robustness in Deep Reinforcement Learning (DRL) have
been widely studied in various threat models; however, few consider
environmental state perturbations, which are natural in embodied scenarios. To
improve the robustness of DRL agents, we formulate the problem of environmental
state perturbation, introducing a preliminary non-targeted attack method as a
calibration adversary, and then propose a defense framework, named Boosted
Adversarial Training (BAT), which first tunes the agents via supervised
learning to avoid catastrophic failure and subsequently adversarially trains
the agent with reinforcement learning. Extensive experimental results
substantiate the vulnerability of mainstream agents under environmental state
perturbations and the effectiveness of our proposed attack. The defense results
demonstrate that while existing robust reinforcement learning algorithms may
not be suitable, our BAT framework can significantly enhance the robustness of
agents against environmental state perturbations across various situations.


http://arxiv.org/abs/2506.08885
AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI). (92%)
Danush Khanna; Krishna Kumar; Basab Ghosh; Vinija Jain; Vasu Sharma; Aman Chadha; Amitava Das
  Adversarial threats against LLMs are escalating faster than current defenses
can adapt. We expose a critical geometric blind spot in alignment: adversarial
prompts exploit latent camouflage, embedding perilously close to the safe
representation manifold while encoding unsafe intent thereby evading surface
level defenses like Direct Preference Optimization (DPO), which remain blind to
the latent geometry. We introduce ALKALI, the first rigorously curated
adversarial benchmark and the most comprehensive to date spanning 9,000 prompts
across three macro categories, six subtypes, and fifteen attack families.
Evaluation of 21 leading LLMs reveals alarmingly high Attack Success Rates
(ASRs) across both open and closed source models, exposing an underlying
vulnerability we term latent camouflage, a structural blind spot where
adversarial completions mimic the latent geometry of safe ones. To mitigate
this vulnerability, we introduce GRACE - Geometric Representation Aware
Contrastive Enhancement, an alignment framework coupling preference learning
with latent space regularization. GRACE enforces two constraints: latent
separation between safe and adversarial completions, and adversarial cohesion
among unsafe and jailbreak behaviors. These operate over layerwise pooled
embeddings guided by a learned attention profile, reshaping internal geometry
without modifying the base model, and achieve up to 39% ASR reduction.
Moreover, we introduce AVQI, a geometry aware metric that quantifies latent
alignment failure via cluster separation and compactness. AVQI reveals when
unsafe completions mimic the geometry of safe ones, offering a principled lens
into how models internally encode safety. We make the code publicly available
at https://anonymous.4open.science/r/alkali-B416/README.md.


http://arxiv.org/abs/2506.09237
PatchGuard: Adversarially Robust Anomaly Detection and Localization through Vision Transformers and Pseudo Anomalies. (82%)
Mojtaba Nafez; Amirhossein Koochakian; Arad Maleki; Jafar Habibi; Mohammad Hossein Rohban
  Anomaly Detection (AD) and Anomaly Localization (AL) are crucial in fields
that demand high reliability, such as medical imaging and industrial
monitoring. However, current AD and AL approaches are often susceptible to
adversarial attacks due to limitations in training data, which typically
include only normal, unlabeled samples. This study introduces PatchGuard, an
adversarially robust AD and AL method that incorporates pseudo anomalies with
localization masks within a Vision Transformer (ViT)-based architecture to
address these vulnerabilities. We begin by examining the essential properties
of pseudo anomalies, and follow it by providing theoretical insights into the
attention mechanisms required to enhance the adversarial robustness of AD and
AL systems. We then present our approach, which leverages Foreground-Aware
Pseudo-Anomalies to overcome the deficiencies of previous anomaly-aware
methods. Our method incorporates these crafted pseudo-anomaly samples into a
ViT-based framework, with adversarial training guided by a novel loss function
designed to improve model robustness, as supported by our theoretical analysis.
Experimental results on well-established industrial and medical datasets
demonstrate that PatchGuard significantly outperforms previous methods in
adversarial settings, achieving performance gains of $53.2\%$ in AD and
$68.5\%$ in AL, while also maintaining competitive accuracy in non-adversarial
settings. The code repository is available at
https://github.com/rohban-lab/PatchGuard .


http://arxiv.org/abs/2506.08482
One Patch to Rule Them All: Transforming Static Patches into Dynamic Attacks in the Physical World. (80%)
Xingshuo Han; Chen Ling; Shiyi Yao; Haozhao Wang; Hangcheng Liu; Yutong Wu; Shengmin Xu; Changhai Ou; Xinyi Huang; Tianwei Zhang
  Numerous methods have been proposed to generate physical adversarial patches
(PAPs) against real-world machine learning systems. However, each existing PAP
typically supports only a single, fixed attack goal, and switching to a
different objective requires re-generating and re-deploying a new PAP. This
rigidity limits their practicality in dynamic environments like autonomous
driving, where traffic conditions and attack goals can change rapidly. For
example, if no obstacles are present around the target vehicle, the attack may
fail to cause meaningful consequences.
  To overcome this limitation, we propose SwitchPatch, a novel PAP that is
static yet enables dynamic and controllable attack outcomes based on real-time
scenarios. Attackers can alter pre-defined conditions, e.g., by projecting
different natural-color lights onto SwitchPatch to seamlessly switch between
attack goals. Unlike prior work, SwitchPatch does not require re-generation or
re-deployment for different objectives, significantly reducing cost and
complexity. Furthermore, SwitchPatch remains benign when the enabling
conditions are absent, enhancing its stealth.
  We evaluate SwitchPatch on two key tasks: traffic sign recognition
(classification and detection) and depth estimation. First, we conduct
theoretical analysis and empirical studies to demonstrate the feasibility of
SwitchPatch and explore how many goals it can support using techniques like
color light projection and occlusion. Second, we perform simulation-based
experiments and ablation studies to verify its effectiveness and
transferability. Third, we conduct outdoor tests using a Unmanned Ground
Vehicle (UGV) to confirm its robustness in the physical world. Overall,
SwitchPatch introduces a flexible and practical adversarial strategy that can
be adapted to diverse tasks and real-world conditions.


http://arxiv.org/abs/2506.09348
Adversarial Surrogate Risk Bounds for Binary Classification. (47%)
Natalie S. Frank
  A central concern in classification is the vulnerability of machine learning
models to adversarial attacks. Adversarial training is one of the most popular
techniques for training robust classifiers, which involves minimizing an
adversarial surrogate risk. Recent work characterized when a minimizing
sequence of an adversarial surrogate risk is also a minimizing sequence of the
adversarial classification risk for binary classification -- a property known
as adversarial consistency. However, these results do not address the rate at
which the adversarial classification risk converges to its optimal value for
such a sequence of functions that minimize the adversarial surrogate. This
paper provides surrogate risk bounds that quantify that convergence rate.
Additionally, we derive distribution-dependent surrogate risk bounds in the
standard (non-adversarial) learning setting, that may be of independent
interest.


http://arxiv.org/abs/2506.11125
ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams. (22%)
Freddie Grabovski; Gilad Gressel; Yisroel Mirsky
  Large Language Models (LLMs), combined with Text-to-Speech (TTS) and
Automatic Speech Recognition (ASR), are increasingly used to automate voice
phishing (vishing) scams. These systems are scalable and convincing, posing a
significant security threat. We identify the ASR transcription step as the most
vulnerable link in the scam pipeline and introduce ASRJam, a proactive defence
framework that injects adversarial perturbations into the victim's audio to
disrupt the attacker's ASR. This breaks the scam's feedback loop without
affecting human callers, who can still understand the conversation. While prior
adversarial audio techniques are often unpleasant and impractical for real-time
use, we also propose EchoGuard, a novel jammer that leverages natural
distortions, such as reverberation and echo, that are disruptive to ASR but
tolerable to humans. To evaluate EchoGuard's effectiveness and usability, we
conducted a 39-person user study comparing it with three state-of-the-art
attacks. Results show that EchoGuard achieved the highest overall utility,
offering the best combination of ASR disruption and human listening experience.


http://arxiv.org/abs/2506.08611
Towards Class-wise Fair Adversarial Training via Anti-Bias Soft Label Distillation. (22%)
Shiji Zhao; Chi Chen; Ranjie Duan; Xizhe Wang; Xingxing Wei
  Adversarial Training (AT) is widely recognized as an effective approach to
enhance the adversarial robustness of Deep Neural Networks. As a variant of AT,
Adversarial Robustness Distillation (ARD) has shown outstanding performance in
enhancing the robustness of small models. However, both AT and ARD face robust
fairness issue: these models tend to display strong adversarial robustness
against some classes (easy classes) while demonstrating weak adversarial
robustness against others (hard classes). This paper explores the underlying
factors of this problem and points out the smoothness degree of soft labels for
different classes significantly impacts the robust fairness from both empirical
observation and theoretical analysis. Based on the above exploration, we
propose Anti-Bias Soft Label Distillation (ABSLD) within the Knowledge
Distillation framework to enhance the adversarial robust fairness.
Specifically, ABSLD adaptively reduces the student's error risk gap between
different classes, which is accomplished by adjusting the class-wise smoothness
degree of teacher's soft labels during the training process, and the adjustment
is managed by assigning varying temperatures to different classes.
Additionally, as a label-based approach, ABSLD is highly adaptable and can be
integrated with the sample-based methods. Extensive experiments demonstrate
ABSLD outperforms state-of-the-art methods on the comprehensive performance of
robustness and fairness.


http://arxiv.org/abs/2506.08514
DiffGradCAM: A Universal Class Activation Map Resistant to Adversarial Training. (13%)
Jacob Piland; Chris Sweet; Adam Czakja
  Class Activation Mapping (CAM) and its gradient-based variants (e.g.,
GradCAM) have become standard tools for explaining Convolutional Neural Network
(CNN) predictions. However, these approaches typically focus on individual
logits, while for neural networks using softmax, the class membership
probability estimates depend \textit{only} on the \textit{differences} between
logits, not on their absolute values. This disconnect leaves standard CAMs
vulnerable to adversarial manipulation, such as passive fooling, where a model
is trained to produce misleading CAMs without affecting decision performance.
We introduce \textbf{Salience-Hoax Activation Maps (SHAMs)}, an
\emph{entropy-aware form of passive fooling} that serves as a benchmark for CAM
robustness under adversarial conditions. To address the passive fooling
vulnerability, we then propose \textbf{DiffGradCAM}, a novel, lightweight, and
contrastive approach to class activation mapping that is both non-suceptible to
passive fooling, but also matches the output of standard CAM methods such as
GradCAM in the non-adversarial case. Together, SHAM and DiffGradCAM establish a
new framework for probing and improving the robustness of saliency-based
explanations. We validate both contributions across multi-class tasks with few
and many classes.


http://arxiv.org/abs/2506.17265
Does Multimodal Large Language Model Truly Unlearn? Stealthy MLLM Unlearning Attack. (12%)
Xianren Zhang; Hui Liu; Delvin Ce Zhang; Xianfeng Tang; Qi He; Dongwon Lee; Suhang Wang
Multimodal Large Language Models (MLLMs) trained on massive data may memorize sensitive personal information and photos, posing serious privacy risks. To mitigate this, MLLM unlearning methods are proposed, which fine-tune MLLMs to reduce the ``forget'' sensitive information. However, it remains unclear whether the knowledge has been truly forgotten or just hidden in the model. Therefore, we propose to study a novel problem of LLM unlearning attack, which aims to recover the unlearned knowledge of an unlearned LLM. To achieve the goal, we propose a novel framework Stealthy Unlearning Attack (SUA) framework that learns a universal noise pattern. When applied to input images, this noise can trigger the model to reveal unlearned content. While pixel-level perturbations may be visually subtle, they can be detected in the semantic embedding space, making such attacks vulnerable to potential defenses. To improve stealthiness, we introduce an embedding alignment loss that minimizes the difference between the perturbed and denoised image embeddings, ensuring the attack is semantically unnoticeable. Experimental results show that SUA can effectively recover unlearned information from MLLMs. Furthermore, the learned noise generalizes well: a single perturbation trained on a subset of samples can reveal forgotten content in unseen images. This indicates that knowledge reappearance is not an occasional failure, but a consistent behavior.

http://arxiv.org/abs/2506.08473
AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin. (1%)
Shuo Yang; Qihui Zhang; Yuyang Liu; Yue Huang; Xiaojun Jia; Kunpeng Ning; Jiayu Yao; Jigang Wang; Hailiang Dai; Yibing Song; Li Yuan
  Large language models (LLMs) are vulnerable to safety risks during
fine-tuning, where small amounts of malicious or harmless data can compromise
safeguards. In this paper, building on the concept of alignment direction --
defined by the weight difference between aligned and unaligned models -- we
observe that perturbations along this direction preserve model safety. In
contrast, perturbations along directions orthogonal to this alignment are
strongly linked to harmful direction perturbations, rapidly degrading safety
and framing the parameter space as a narrow safety basin. Based on this
insight, we propose a methodology for safety fine-tuning called AsFT (Anchoring
Safety in Fine-Tuning), which integrates a regularization term into the
training objective. This term uses the alignment direction as an anchor to
suppress updates in harmful directions, ensuring that fine-tuning is
constrained within the narrow safety basin. Extensive experiments on multiple
datasets show that AsFT outperforms Safe LoRA, reducing harmful behavior by
7.60 percent, improving model performance by 3.44 percent, and maintaining
robust performance across various experimental settings. Code is available at
https://github.com/PKU-YuanGroup/AsFT


http://arxiv.org/abs/2506.08837
Design Patterns for Securing LLM Agents against Prompt Injections. (1%)
Luca Beurer-Kellner; Beat Buesser Ana-Maria Creţu; Edoardo Debenedetti; Daniel Dobos; Daniel Fabian; Marc Fischer; David Froelicher; Kathrin Grosse; Daniel Naeff; Ezinwanne Ozoani; Andrew Paverd; Florian Tramèr; Václav Volhejn
  As AI agents powered by Large Language Models (LLMs) become increasingly
versatile and capable of addressing a broad spectrum of tasks, ensuring their
security has become a critical challenge. Among the most pressing threats are
prompt injection attacks, which exploit the agent's resilience on natural
language inputs -- an especially dangerous threat when agents are granted tool
access or handle sensitive information. In this work, we propose a set of
principled design patterns for building AI agents with provable resistance to
prompt injection. We systematically analyze these patterns, discuss their
trade-offs in terms of utility and security, and illustrate their real-world
applicability through a series of case studies.


http://arxiv.org/abs/2506.08602
WGLE:Backdoor-free and Multi-bit Black-box Watermarking for Graph Neural Networks. (1%)
Tingzhi Li; Xuefeng Liu
  Graph Neural Networks (GNNs) are increasingly deployed in graph-related
applications, making ownership verification critical to protect their
intellectual property against model theft. Fingerprinting and black-box
watermarking are two main methods. However, the former relies on determining
model similarity, which is computationally expensive and prone to ownership
collisions after model post-processing such as model pruning or fine-tuning.
The latter embeds backdoors, exposing watermarked models to the risk of
backdoor attacks. Moreover, both methods enable ownership verification but do
not convey additional information. As a result, each distributed model requires
a unique trigger graph, and all trigger graphs must be used to query the
suspect model during verification. Multiple queries increase the financial cost
and the risk of detection.
  To address these challenges, this paper proposes WGLE, a novel black-box
watermarking paradigm for GNNs that enables embedding the multi-bit string as
the ownership information without using backdoors. WGLE builds on a key insight
we term Layer-wise Distance Difference on an Edge (LDDE), which quantifies the
difference between the feature distance and the prediction distance of two
connected nodes. By predefining positive or negative LDDE values for multiple
selected edges, WGLE embeds the watermark encoding the intended information
without introducing incorrect mappings that compromise the primary task. WGLE
is evaluated on six public datasets and six mainstream GNN architectures along
with state-of-the-art methods. The results show that WGLE achieves 100%
ownership verification accuracy, an average fidelity degradation of 0.85%,
comparable robustness against potential attacks, and low embedding overhead.
The code is available in the repository.


http://arxiv.org/abs/2506.07942
Adversarial Attack Classification and Robustness Testing for Large Language Models for Code. (99%)
Yang Liu; Armstrong Foundjem; Foutse Khomh; Heng Li
  Large Language Models (LLMs) have become vital tools in software development
tasks such as code generation, completion, and analysis. As their integration
into workflows deepens, ensuring robustness against vulnerabilities especially
those triggered by diverse or adversarial inputs becomes increasingly
important. Such vulnerabilities may lead to incorrect or insecure code
generation when models encounter perturbed task descriptions, code, or
comments. Prior research often overlooks the role of natural language in
guiding code tasks. This study investigates how adversarial perturbations in
natural language inputs including prompts, comments, and descriptions affect
LLMs for Code (LLM4Code). It examines the effects of perturbations at the
character, word, and sentence levels to identify the most impactful
vulnerabilities. We analyzed multiple projects (e.g., ReCode, OpenAttack) and
datasets (e.g., HumanEval, MBPP), establishing a taxonomy of adversarial
attacks. The first dimension classifies the input type code, prompts, or
comments while the second dimension focuses on granularity: character, word, or
sentence-level changes. We adopted a mixed-methods approach, combining
quantitative performance metrics with qualitative vulnerability analysis.
LLM4Code models show varying robustness across perturbation types.
Sentence-level attacks were least effective, suggesting models are resilient to
broader contextual changes. In contrast, word-level perturbations posed serious
challenges, exposing semantic vulnerabilities. Character-level effects varied,
showing model sensitivity to subtle syntactic deviations.Our study offers a
structured framework for testing LLM4Code robustness and emphasizes the
critical role of natural language in adversarial evaluation. Improving model
resilience to semantic-level disruptions is essential for secure and reliable
code-generation systems.


http://arxiv.org/abs/2506.07804
Enhancing Adversarial Robustness with Conformal Prediction: A Framework for Guaranteed Model Reliability. (98%)
Jie Bao; Chuangyin Dang; Rui Luo; Hanwei Zhang; Zhixin Zhou
  As deep learning models are increasingly deployed in high-risk applications,
robust defenses against adversarial attacks and reliable performance guarantees
become paramount. Moreover, accuracy alone does not provide sufficient
assurance or reliable uncertainty estimates for these models. This study
advances adversarial training by leveraging principles from Conformal
Prediction. Specifically, we develop an adversarial attack method, termed OPSA
(OPtimal Size Attack), designed to reduce the efficiency of conformal
prediction at any significance level by maximizing model uncertainty without
requiring coverage guarantees. Correspondingly, we introduce OPSA-AT
(Adversarial Training), a defense strategy that integrates OPSA within a novel
conformal training paradigm. Experimental evaluations demonstrate that our OPSA
attack method induces greater uncertainty compared to baseline approaches for
various defenses. Conversely, our OPSA-AT defensive model significantly
enhances robustness not only against OPSA but also other adversarial attacks,
and maintains reliable prediction. Our findings highlight the effectiveness of
this integrated approach for developing trustworthy and resilient deep learning
models for safety-critical domains. Our code is available at
https://github.com/bjbbbb/Enhancing-Adversarial-Robustness-with-Conformal-Prediction.


http://arxiv.org/abs/2506.08255
SHIELD: Secure Hypernetworks for Incremental Expansion Learning Defense. (83%)
Patryk Krukowski; Łukasz Gorczyca; Piotr Helm; Kamil Książek; Przemysław Spurek
  Traditional deep neural networks suffer from several limitations, including
catastrophic forgetting. When models are adapted to new datasets, they tend to
quickly forget previously learned knowledge. Another significant issue is the
lack of robustness to even small perturbations in the input data. In practice,
we can often easily perform adversarial attacks and change the network's
predictions, adding minimal noise to the input. Dedicated architectures and
training procedures can solve each of the above problems separately.
Unfortunately, currently, no model can simultaneously address both catastrophic
forgetting and vulnerability to adversarial attacks. We introduce SHIELD
(Secure Hypernetworks for Incremental Expansion and Learning Defense), a novel
approach that integrates a hypernetwork-based continual learning approach with
interval arithmetic. SHIELD use the hypernetwork to transfer trainable task
embedding vectors into the weights of a target model dedicated to specific
data. This paradigm allows for the dynamic generation of separate networks for
each subtask, while the hypernetwork aggregates and analyzes information across
all tasks. The target model takes in the input a data sample with a defined
interval range, and by creating a hypercube, produces a prediction for the
given range. Therefore, such target models provide strict guarantees against
all possible attacks for data samples within the interval range. Our approach
enhances security without sacrificing network adaptability, addressing the
overlooked challenge of safety in continual learning.


http://arxiv.org/abs/2506.07590
Explore the vulnerability of black-box models via diffusion models. (76%)
Jiacheng Shi; Yanfu Zhang; Huajie Shao; Ashley Gao
  Recent advancements in diffusion models have enabled high-fidelity and
photorealistic image generation across diverse applications. However, these
models also present security and privacy risks, including copyright violations,
sensitive information leakage, and the creation of harmful or offensive content
that could be exploited maliciously. In this study, we uncover a novel security
threat where an attacker leverages diffusion model APIs to generate synthetic
images, which are then used to train a high-performing substitute model. This
enables the attacker to execute model extraction and transfer-based adversarial
attacks on black-box classification models with minimal queries, without
needing access to the original training data. The generated images are
sufficiently high-resolution and diverse to train a substitute model whose
outputs closely match those of the target model. Across the seven benchmarks,
including CIFAR and ImageNet subsets, our method shows an average improvement
of 27.37% over state-of-the-art methods while using just 0.01 times of the
query budget, achieving a 98.68% success rate in adversarial attacks on the
target model.


http://arxiv.org/abs/2506.07467
Circumventing Backdoor Space via Weight Symmetry. (45%)
Jie Peng; Hongwei Yang; Jing Zhao; Hengji Dong; Hui He; Weizhe Zhang; Haoyu He
  Deep neural networks are vulnerable to backdoor attacks, where malicious
behaviors are implanted during training. While existing defenses can
effectively purify compromised models, they typically require labeled data or
specific training procedures, making them difficult to apply beyond supervised
learning settings. Notably, recent studies have shown successful backdoor
attacks across various learning paradigms, highlighting a critical security
concern. To address this gap, we propose Two-stage Symmetry Connectivity (TSC),
a novel backdoor purification defense that operates independently of data
format and requires only a small fraction of clean samples. Through theoretical
analysis, we prove that by leveraging permutation invariance in neural networks
and quadratic mode connectivity, TSC amplifies the loss on poisoned samples
while maintaining bounded clean accuracy. Experiments demonstrate that TSC
achieves robust performance comparable to state-of-the-art methods in
supervised learning scenarios. Furthermore, TSC generalizes to self-supervised
learning frameworks, such as SimCLR and CLIP, maintaining its strong defense
capabilities. Our code is available at https://github.com/JiePeng104/TSC.


http://arxiv.org/abs/2506.07948
TokenBreak: Bypassing Text Classification Models Through Token Manipulation. (38%)
Kasimir Schulz; Kenneth Yeung; Kieran Evans
  Natural Language Processing (NLP) models are used for text-related tasks such
as classification and generation. To complete these tasks, input data is first
tokenized from human-readable text into a format the model can understand,
enabling it to make inferences and understand context. Text classification
models can be implemented to guard against threats such as prompt injection
attacks against Large Language Models (LLMs), toxic input and cybersecurity
risks such as spam emails. In this paper, we introduce TokenBreak: a novel
attack that can bypass these protection models by taking advantage of the
tokenization strategy they use. This attack technique manipulates input text in
such a way that certain models give an incorrect classification. Importantly,
the end target (LLM or email recipient) can still understand and respond to the
manipulated text and therefore be vulnerable to the very attack the protection
model was put in place to prevent. The tokenizer is tied to model architecture,
meaning it is possible to predict whether or not a model is vulnerable to
attack based on family. We also present a defensive strategy as an added layer
of protection that can be implemented without having to retrain the defensive
model.


http://arxiv.org/abs/2506.08336
Your Agent Can Defend Itself against Backdoor Attacks. (38%)
Li Changjiang; Liang Jiacheng; Cao Bochuan; Chen Jinghui; Wang Ting
  Despite their growing adoption across domains, large language model
(LLM)-powered agents face significant security risks from backdoor attacks
during training and fine-tuning. These compromised agents can subsequently be
manipulated to execute malicious operations when presented with specific
triggers in their inputs or environments. To address this pressing risk, we
present ReAgent, a novel defense against a range of backdoor attacks on
LLM-based agents. Intuitively, backdoor attacks often result in inconsistencies
among the user's instruction, the agent's planning, and its execution. Drawing
on this insight, ReAgent employs a two-level approach to detect potential
backdoors. At the execution level, ReAgent verifies consistency between the
agent's thoughts and actions; at the planning level, ReAgent leverages the
agent's capability to reconstruct the instruction based on its thought
trajectory, checking for consistency between the reconstructed instruction and
the user's instruction. Extensive evaluation demonstrates ReAgent's
effectiveness against various backdoor attacks across tasks. For instance,
ReAgent reduces the attack success rate by up to 90\% in database operation
tasks, outperforming existing defenses by large margins. This work reveals the
potential of utilizing compromised agents themselves to mitigate backdoor
risks.


http://arxiv.org/abs/2506.07468
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models. (22%)
Mickel Liu; Liwei Jiang; Yancheng Liang; Simon Shaolei Du; Yejin Choi; Tim Althoff; Natasha Jaques
  Conventional language model (LM) safety alignment relies on a reactive,
disjoint procedure: attackers exploit a static model, followed by defensive
fine-tuning to patch exposed vulnerabilities. This sequential approach creates
a mismatch -- attackers overfit to obsolete defenses, while defenders
perpetually lag behind emerging threats. To address this, we propose
Self-RedTeam, an online self-play reinforcement learning algorithm where an
attacker and defender agent co-evolve through continuous interaction. We cast
safety alignment as a two-player zero-sum game, where a single model alternates
between attacker and defender roles -- generating adversarial prompts and
safeguarding against them -- while a reward LM adjudicates outcomes. This
enables dynamic co-adaptation. Grounded in the game-theoretic framework of
zero-sum games, we establish a theoretical safety guarantee which motivates the
design of our method: if self-play converges to a Nash Equilibrium, the
defender will reliably produce safe responses to any adversarial input.
Empirically, Self-RedTeam uncovers more diverse attacks (+21.8% SBERT) compared
to attackers trained against static defenders and achieves higher robustness on
safety benchmarks (e.g., +65.5% on WildJailBreak) than defenders trained
against static attackers. We further propose hidden Chain-of-Thought, allowing
agents to plan privately, which boosts adversarial diversity and reduces
over-refusals. Our results motivate a shift from reactive patching to proactive
co-evolution in LM safety training, enabling scalable, autonomous, and robust
self-improvement of LMs via multi-agent reinforcement learning (MARL).


http://arxiv.org/abs/2506.07836
Are Trees Really Green? A Detection Approach of IoT Malware Attacks. (8%)
Silvia Lucia Sanna; Diego Soi; Davide Maiorca; Giorgio Giacinto
  Nowadays, the Internet of Things (IoT) is widely employed, and its usage is
growing exponentially because it facilitates remote monitoring, predictive
maintenance, and data-driven decision making, especially in the healthcare and
industrial sectors. However, IoT devices remain vulnerable due to their
resource constraints and difficulty in applying security patches. Consequently,
various cybersecurity attacks are reported daily, such as Denial of Service,
particularly in IoT-driven solutions. Most attack detection methodologies are
based on Machine Learning (ML) techniques, which can detect attack patterns.
However, the focus is more on identification rather than considering the impact
of ML algorithms on computational resources. This paper proposes a green
methodology to identify IoT malware networking attacks based on flow
privacy-preserving statistical features. In particular, the hyperparameters of
three tree-based models -- Decision Trees, Random Forest and Extra-Trees -- are
optimized based on energy consumption and test-time performance in terms of
Matthew's Correlation Coefficient. Our results show that models maintain high
performance and detection accuracy while consistently reducing power usage in
terms of watt-hours (Wh). This suggests that on-premise ML-based Intrusion
Detection Systems are suitable for IoT and other resource-constrained devices.


http://arxiv.org/abs/2506.07666
ProARD: progressive adversarial robustness distillation: provide wide range of robust students. (5%)
Seyedhamidreza Mousavi; Seyedali Mousavi; Masoud Daneshtalab
  Adversarial Robustness Distillation (ARD) has emerged as an effective method
to enhance the robustness of lightweight deep neural networks against
adversarial attacks. Current ARD approaches have leveraged a large robust
teacher network to train one robust lightweight student. However, due to the
diverse range of edge devices and resource constraints, current approaches
require training a new student network from scratch to meet specific
constraints, leading to substantial computational costs and increased CO2
emissions. This paper proposes Progressive Adversarial Robustness Distillation
(ProARD), enabling the efficient one-time training of a dynamic network that
supports a diverse range of accurate and robust student networks without
requiring retraining. We first make a dynamic deep neural network based on
dynamic layers by encompassing variations in width, depth, and expansion in
each design stage to support a wide range of architectures. Then, we consider
the student network with the largest size as the dynamic teacher network.
ProARD trains this dynamic network using a weight-sharing mechanism to jointly
optimize the dynamic teacher network and its internal student networks.
However, due to the high computational cost of calculating exact gradients for
all the students within the dynamic network, a sampling mechanism is required
to select a subset of students. We show that random student sampling in each
iteration fails to produce accurate and robust students.


http://arxiv.org/abs/2506.07428
HeTa: Relation-wise Heterogeneous Graph Foundation Attack Model. (5%)
Yuling Wang; Zihui Chen; Pengfei Jiao; Xiao Wang
  Heterogeneous Graph Neural Networks (HGNNs) are vulnerable, highlighting the
need for tailored attacks to assess their robustness and ensure security.
However, existing HGNN attacks often require complex retraining of parameters
to generate specific perturbations for new scenarios. Recently, foundation
models have opened new horizons for the generalization of graph neural networks
by capturing shared semantics across various graph distributions. This leads us
to ask:Can we design a foundation attack model for HGNNs that enables
generalizable perturbations across different HGNNs, and quickly adapts to new
heterogeneous graphs (HGs)? Empirical findings reveal that, despite significant
differences in model design and parameter space, different HGNNs surprisingly
share common vulnerability patterns from a relation-aware perspective.
Therefore, we explore how to design foundation HGNN attack criteria by mining
shared attack units. In this paper, we propose a novel relation-wise
heterogeneous graph foundation attack model, HeTa. We introduce a foundation
surrogate model to align heterogeneity and identify the importance of shared
relation-aware attack units. Building on this, we implement a serialized
relation-by-relation attack based on the identified relational weights. In this
way, the perturbation can be transferred to various target HGNNs and easily
fine-tuned for new HGs. Extensive experiments exhibit powerful attack
performances and generalizability of our method.


http://arxiv.org/abs/2506.08346
SPBA: Utilizing Speech Large Language Model for Backdoor Attacks on Speech Classification Models. (2%)
Wenhan Yao; Fen Xiao; Xiarun Chen; Jia Liu; YongQiang He; Weiping Wen
  Deep speech classification tasks, including keyword spotting and speaker
verification, are vital in speech-based human-computer interaction. Recently,
the security of these technologies has been revealed to be susceptible to
backdoor attacks. Specifically, attackers use noisy disruption triggers and
speech element triggers to produce poisoned speech samples that train models to
become vulnerable. However, these methods typically create only a limited
number of backdoors due to the inherent constraints of the trigger function. In
this paper, we propose that speech backdoor attacks can strategically focus on
speech elements such as timbre and emotion, leveraging the Speech Large
Language Model (SLLM) to generate diverse triggers. Increasing the number of
triggers may disproportionately elevate the poisoning rate, resulting in higher
attack costs and a lower success rate per trigger. We introduce the Multiple
Gradient Descent Algorithm (MGDA) as a mitigation strategy to address this
challenge. The proposed attack is called the Speech Prompt Backdoor Attack
(SPBA). Building on this foundation, we conducted attack experiments on two
speech classification tasks, demonstrating that SPBA shows significant trigger
effectiveness and achieves exceptional performance in attack metrics.


http://arxiv.org/abs/2506.07645
Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models. (2%)
Maciej Chrabąszcz; Katarzyna Lorenc; Karolina Seweryn
  Large language models (LLMs) have demonstrated impressive capabilities across
various natural language processing (NLP) tasks in recent years. However, their
susceptibility to jailbreaks and perturbations necessitates additional
evaluations. Many LLMs are multilingual, but safety-related training data
contains mainly high-resource languages like English. This can leave them
vulnerable to perturbations in low-resource languages such as Polish. We show
how surprisingly strong attacks can be cheaply created by altering just a few
characters and using a small proxy model for word importance calculation. We
find that these character and word-level attacks drastically alter the
predictions of different LLMs, suggesting a potential vulnerability that can be
used to circumvent their internal safety mechanisms. We validate our attack
construction methodology on Polish, a low-resource language, and find potential
vulnerabilities of LLMs in this language. Additionally, we show how it can be
extended to other languages. We release the created datasets and code for
further research.


http://arxiv.org/abs/2506.07947
Statistical Hypothesis Testing for Auditing Robustness in Language Models. (1%)
Paulius Rauba; Qiyao Wei; der Schaar Mihaela van
  Consider the problem of testing whether the outputs of a large language model
(LLM) system change under an arbitrary intervention, such as an input
perturbation or changing the model variant. We cannot simply compare two LLM
outputs since they might differ due to the stochastic nature of the system, nor
can we compare the entire output distribution due to computational
intractability. While existing methods for analyzing text-based outputs exist,
they focus on fundamentally different problems, such as measuring bias or
fairness. To this end, we introduce distribution-based perturbation analysis, a
framework that reformulates LLM perturbation analysis as a frequentist
hypothesis testing problem. We construct empirical null and alternative output
distributions within a low-dimensional semantic similarity space via Monte
Carlo sampling, enabling tractable inference without restrictive distributional
assumptions. The framework is (i) model-agnostic, (ii) supports the evaluation
of arbitrary input perturbations on any black-box LLM, (iii) yields
interpretable p-values; (iv) supports multiple perturbations via controlled
error rates; and (v) provides scalar effect sizes. We demonstrate the
usefulness of the framework across multiple case studies, showing how we can
quantify response changes, measure true/false positive rates, and evaluate
alignment with reference models. Above all, we see this as a reliable
frequentist hypothesis testing framework for LLM auditing.


http://arxiv.org/abs/2506.08401
Single-Node Trigger Backdoor Attacks in Graph-Based Recommendation Systems. (1%)
Runze Li; Di Jin; Xiaobao Wang; Dongxiao He; Bingdao Feng; Zhen Wang
  Graph recommendation systems have been widely studied due to their ability to
effectively capture the complex interactions between users and items. However,
these systems also exhibit certain vulnerabilities when faced with attacks. The
prevailing shilling attack methods typically manipulate recommendation results
by injecting a large number of fake nodes and edges. However, such attack
strategies face two primary challenges: low stealth and high destructiveness.
To address these challenges, this paper proposes a novel graph backdoor attack
method that aims to enhance the exposure of target items to the target user in
a covert manner, without affecting other unrelated nodes. Specifically, we
design a single-node trigger generator, which can effectively expose multiple
target items to the target user by inserting only one fake user node.
Additionally, we introduce constraint conditions between the target nodes and
irrelevant nodes to mitigate the impact of fake nodes on the recommendation
system's performance. Experimental results show that the exposure of the target
items reaches no less than 50% in 99% of the target users, while the impact on
the recommendation system's performance is controlled within approximately 5%.


http://arxiv.org/abs/2506.17250
Towards Interpretable Adversarial Examples via Sparse Adversarial Attack. (99%)
Fudong Lin; Jiadong Lou; Hao Wang; Brian Jalaian; Xu Yuan
Sparse attacks are to optimize the magnitude of adversarial perturbations for fooling deep neural networks (DNNs) involving only a few perturbed pixels (i.e., under the l0 constraint), suitable for interpreting the vulnerability of DNNs. However, existing solutions fail to yield interpretable adversarial examples due to their poor sparsity. Worse still, they often struggle with heavy computational overhead, poor transferability, and weak attack strength. In this paper, we aim to develop a sparse attack for understanding the vulnerability of CNNs by minimizing the magnitude of initial perturbations under the l0 constraint, to overcome the existing drawbacks while achieving a fast, transferable, and strong attack to DNNs. In particular, a novel and theoretical sound parameterization technique is introduced to approximate the NP-hard l0 optimization problem, making directly optimizing sparse perturbations computationally feasible. Besides, a novel loss function is designed to augment initial perturbations by maximizing the adversary property and minimizing the number of perturbed pixels simultaneously. Extensive experiments are conducted to demonstrate that our approach, with theoretical performance guarantees, outperforms state-of-the-art sparse attacks in terms of computational overhead, transferability, and attack strength, expecting to serve as a benchmark for evaluating the robustness of DNNs. In addition, theoretical and empirical results validate that our approach yields sparser adversarial examples, empowering us to discover two categories of noises, i.e., "obscuring noise" and "leading noise", which will help interpret how adversarial perturbation misleads the classifiers into incorrect predictions. Our code is available at https://github.com/fudong03/SparseAttack.

http://arxiv.org/abs/2506.06992
Boosting Adversarial Transferability via Commonality-Oriented Gradient Optimization. (99%)
Yanting Gao; Yepeng Liu; Junming Liu; Qi Zhang; Hongyun Zhang; Duoqian Miao; Cairong Zhao
  Exploring effective and transferable adversarial examples is vital for
understanding the characteristics and mechanisms of Vision Transformers (ViTs).
However, adversarial examples generated from surrogate models often exhibit
weak transferability in black-box settings due to overfitting. Existing methods
improve transferability by diversifying perturbation inputs or applying uniform
gradient regularization within surrogate models, yet they have not fully
leveraged the shared and unique features of surrogate models trained on the
same task, leading to suboptimal transfer performance. Therefore, enhancing
perturbations of common information shared by surrogate models and suppressing
those tied to individual characteristics offers an effective way to improve
transferability. Accordingly, we propose a commonality-oriented gradient
optimization strategy (COGO) consisting of two components: Commonality
Enhancement (CE) and Individuality Suppression (IS). CE perturbs the mid-to-low
frequency regions, leveraging the fact that ViTs trained on the same dataset
tend to rely more on mid-to-low frequency information for classification. IS
employs adaptive thresholds to evaluate the correlation between backpropagated
gradients and model individuality, assigning weights to gradients accordingly.
Extensive experiments demonstrate that COGO significantly improves the transfer
success rates of adversarial attacks, outperforming current state-of-the-art
methods.


http://arxiv.org/abs/2506.07001
Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text. (98%)
Yize Cheng; Vinu Sankar Sadasivan; Mehrdad Saberi; Shoumik Saha; Soheil Feizi
  The increasing capabilities of Large Language Models (LLMs) have raised
concerns about their misuse in AI-generated plagiarism and social engineering.
While various AI-generated text detectors have been proposed to mitigate these
risks, many remain vulnerable to simple evasion techniques such as
paraphrasing. However, recent detectors have shown greater robustness against
such basic attacks. In this work, we introduce Adversarial Paraphrasing, a
training-free attack framework that universally humanizes any AI-generated text
to evade detection more effectively. Our approach leverages an off-the-shelf
instruction-following LLM to paraphrase AI-generated content under the guidance
of an AI text detector, producing adversarial examples that are specifically
optimized to bypass detection. Extensive experiments show that our attack is
both broadly effective and highly transferable across several detection
systems. For instance, compared to simple paraphrasing attack--which,
ironically, increases the true positive at 1% false positive (T@1%F) by 8.57%
on RADAR and 15.03% on Fast-DetectGPT--adversarial paraphrasing, guided by
OpenAI-RoBERTa-Large, reduces T@1%F by 64.49% on RADAR and a striking 98.96% on
Fast-DetectGPT. Across a diverse set of detectors--including neural
network-based, watermark-based, and zero-shot approaches--our attack achieves
an average T@1%F reduction of 87.88% under the guidance of
OpenAI-RoBERTa-Large. We also analyze the tradeoff between text quality and
attack success to find that our method can significantly reduce detection
rates, with mostly a slight degradation in text quality. Our adversarial setup
highlights the need for more robust and resilient detection strategies in the
light of increasingly sophisticated evasion techniques.


http://arxiv.org/abs/2506.07056
D2R: dual regularization loss with collaborative adversarial generation for model robustness. (96%)
Zhenyu Liu; Huizhi Liang; Rajiv Ranjan; Zhanxing Zhu; Vaclav Snasel; Varun Ojha
  The robustness of Deep Neural Network models is crucial for defending models
against adversarial attacks. Recent defense methods have employed collaborative
learning frameworks to enhance model robustness. Two key limitations of
existing methods are (i) insufficient guidance of the target model via loss
functions and (ii) non-collaborative adversarial generation. We, therefore,
propose a dual regularization loss (D2R Loss) method and a collaborative
adversarial generation (CAG) strategy for adversarial training. D2R loss
includes two optimization steps. The adversarial distribution and clean
distribution optimizations enhance the target model's robustness by leveraging
the strengths of different loss functions obtained via a suitable function
space exploration to focus more precisely on the target model's distribution.
CAG generates adversarial samples using a gradient-based collaboration between
guidance and target models. We conducted extensive experiments on three
benchmark databases, including CIFAR-10, CIFAR-100, Tiny ImageNet, and two
popular target models, WideResNet34-10 and PreActResNet18. Our results show
that D2R loss with CAG produces highly robust models.


http://arxiv.org/abs/2506.07214
Backdoor Attack on Vision Language Models with Stealthy Semantic Manipulation. (50%)
Zhiyuan Zhong; Zhen Sun; Yepang Liu; Xinlei He; Guanhong Tao
  Vision Language Models (VLMs) have shown remarkable performance, but are also
vulnerable to backdoor attacks whereby the adversary can manipulate the model's
outputs through hidden triggers. Prior attacks primarily rely on
single-modality triggers, leaving the crucial cross-modal fusion nature of VLMs
largely unexplored. Unlike prior work, we identify a novel attack surface that
leverages cross-modal semantic mismatches as implicit triggers. Based on this
insight, we propose BadSem (Backdoor Attack with Semantic Manipulation), a data
poisoning attack that injects stealthy backdoors by deliberately misaligning
image-text pairs during training. To perform the attack, we construct SIMBad, a
dataset tailored for semantic manipulation involving color and object
attributes. Extensive experiments across four widely used VLMs show that BadSem
achieves over 98% average ASR, generalizes well to out-of-distribution
datasets, and can transfer across poisoning modalities. Our detailed analysis
using attention visualization shows that backdoored models focus on
semantically sensitive regions under mismatched conditions while maintaining
normal behavior on clean inputs. To mitigate the attack, we try two defense
strategies based on system prompt and supervised fine-tuning but find that both
of them fail to mitigate the semantic backdoor. Our findings highlight the
urgent need to address semantic vulnerabilities in VLMs for their safer
deployment.


http://arxiv.org/abs/2506.11113
Breaking the Reviewer: Assessing the Vulnerability of Large Language Models in Automated Peer Review Under Textual Adversarial Attacks. (15%)
Tzu-Ling Lin; Wei-Chih Chen; Teng-Fang Hsiao; Hou-I Liu; Ya-Hsin Yeh; Yu Kai Chan; Wen-Sheng Lien; Po-Yen Kuo; Philip S. Yu; Hong-Han Shuai
  Peer review is essential for maintaining academic quality, but the increasing
volume of submissions places a significant burden on reviewers. Large language
models (LLMs) offer potential assistance in this process, yet their
susceptibility to textual adversarial attacks raises reliability concerns. This
paper investigates the robustness of LLMs used as automated reviewers in the
presence of such attacks. We focus on three key questions: (1) The
effectiveness of LLMs in generating reviews compared to human reviewers. (2)
The impact of adversarial attacks on the reliability of LLM-generated reviews.
(3) Challenges and potential mitigation strategies for LLM-based review. Our
evaluation reveals significant vulnerabilities, as text manipulations can
distort LLM assessments. We offer a comprehensive evaluation of LLM performance
in automated peer reviewing and analyze its robustness against adversarial
attacks. Our findings emphasize the importance of addressing adversarial risks
to ensure AI strengthens, rather than compromises, the integrity of scholarly
communication.


http://arxiv.org/abs/2506.07392
From Static to Adaptive Defense: Federated Multi-Agent Deep Reinforcement Learning-Driven Moving Target Defense Against DoS Attacks in UAV Swarm Networks. (2%)
Yuyang Zhou; Guang Cheng; Kang Du; Zihan Chen; Tian Qin; Yuyu Zhao
  The proliferation of unmanned aerial vehicle (UAV) swarms has enabled a wide
range of mission-critical applications, but also exposes UAV networks to severe
Denial-of-Service (DoS) threats due to their open wireless environment, dynamic
topology, and resource constraints. Traditional static or centralized defense
mechanisms are often inadequate for such dynamic and distributed scenarios. To
address these challenges, we propose a novel federated multi-agent deep
reinforcement learning (FMADRL)-driven moving target defense (MTD) framework
for proactive and adaptive DoS mitigation in UAV swarm networks. Specifically,
we design three lightweight and coordinated MTD mechanisms, including leader
switching, route mutation, and frequency hopping, that leverage the inherent
flexibility of UAV swarms to disrupt attacker efforts and enhance network
resilience. The defense problem is formulated as a multi-agent partially
observable Markov decision process (POMDP), capturing the distributed,
resource-constrained, and uncertain nature of UAV swarms under attack. Each UAV
is equipped with a local policy agent that autonomously selects MTD actions
based on partial observations and local experiences. By employing a policy
gradient-based FMADRL algorithm, UAVs collaboratively optimize their defense
policies via reward-weighted aggregation, enabling distributed learning without
sharing raw data and thus reducing communication overhead. Extensive
simulations demonstrate that our approach significantly outperforms
state-of-the-art baselines, achieving up to a 34.6% improvement in attack
mitigation rate, a reduction in average recovery time of up to 94.6%, and
decreases in energy consumption and defense cost by as much as 29.3% and 98.3%,
respectively, while maintaining robust mission continuity under various DoS
attack strategies.


http://arxiv.org/abs/2506.07372
Enhanced Consistency Bi-directional GAN(CBiGAN) for Malware Anomaly Detection. (1%)
Thesath Wijayasiri; Kar Wai Fok; Vrizlynn L. L. Thing
  Static analysis, a cornerstone technique in cybersecurity, offers a
noninvasive method for detecting malware by analyzing dormant software without
executing potentially harmful code. However, traditional static analysis often
relies on biased or outdated datasets, leading to gaps in detection
capabilities against emerging malware threats. To address this, our study
focuses on the binary content of files as key features for malware detection.
These binary contents are transformed and represented as images, which then
serve as inputs to deep learning models. This method takes into account the
visual patterns within the binary data, allowing the model to analyze potential
malware effectively. This paper introduces the application of the CBiGAN in the
domain of malware anomaly detection. Our approach leverages the CBiGAN for its
superior latent space mapping capabilities, critical for modeling complex
malware patterns by utilizing a reconstruction error-based anomaly detection
method. We utilized several datasets including both portable executable (PE)
files as well as Object Linking and Embedding (OLE) files. We then evaluated
our model against a diverse set of both PE and OLE files, including
self-collected malicious executables from 214 malware families. Our findings
demonstrate the robustness of this innovative approach, with the CBiGAN
achieving high Area Under the Curve (AUC) results with good generalizability,
thereby confirming its capability to distinguish between benign and diverse
malicious files with reasonably high accuracy.


http://arxiv.org/abs/2506.06933
Rewriting the Budget: A General Framework for Black-Box Attacks Under Cost Asymmetry. (99%)
Mahdi Salmani; Alireza Abdollahpoorrostam; Seyed-Mohsen Moosavi-Dezfooli
  Traditional decision-based black-box adversarial attacks on image classifiers
aim to generate adversarial examples by slightly modifying input images while
keeping the number of queries low, where each query involves sending an input
to the model and observing its output. Most existing methods assume that all
queries have equal cost. However, in practice, queries may incur asymmetric
costs; for example, in content moderation systems, certain output classes may
trigger additional review, enforcement, or penalties, making them more costly
than others. While prior work has considered such asymmetric cost settings,
effective algorithms for this scenario remain underdeveloped. In this paper, we
propose a general framework for decision-based attacks under asymmetric query
costs, which we refer to as asymmetric black-box attacks. We modify two core
components of existing attacks: the search strategy and the gradient estimation
process. Specifically, we propose Asymmetric Search (AS), a more conservative
variant of binary search that reduces reliance on high-cost queries, and
Asymmetric Gradient Estimation (AGREST), which shifts the sampling distribution
to favor low-cost queries. We design efficient algorithms that minimize total
attack cost by balancing different query types, in contrast to earlier methods
such as stealthy attacks that focus only on limiting expensive (high-cost)
queries. Our method can be integrated into a range of existing black-box
attacks with minimal changes. We perform both theoretical analysis and
empirical evaluation on standard image classification benchmarks. Across
various cost regimes, our method consistently achieves lower total query cost
and smaller perturbations than existing approaches, with improvements of up to
40% in some settings.


http://arxiv.org/abs/2506.06906
KNN-Defense: Defense against 3D Adversarial Point Clouds using Nearest-Neighbor Search. (98%)
Nima Jamali; Matina Mahdizadeh Sani; Hanieh Naderi; Shohreh Kasaei
  Deep neural networks (DNNs) have demonstrated remarkable performance in
analyzing 3D point cloud data. However, their vulnerability to adversarial
attacks-such as point dropping, shifting, and adding-poses a critical challenge
to the reliability of 3D vision systems. These attacks can compromise the
semantic and structural integrity of point clouds, rendering many existing
defense mechanisms ineffective. To address this issue, a defense strategy named
KNN-Defense is proposed, grounded in the manifold assumption and
nearest-neighbor search in feature space. Instead of reconstructing surface
geometry or enforcing uniform point distributions, the method restores
perturbed inputs by leveraging the semantic similarity of neighboring samples
from the training set. KNN-Defense is lightweight and computationally
efficient, enabling fast inference and making it suitable for real-time and
practical applications. Empirical results on the ModelNet40 dataset
demonstrated that KNN-Defense significantly improves robustness across various
attack types. In particular, under point-dropping attacks-where many existing
methods underperform due to the targeted removal of critical points-the
proposed method achieves accuracy gains of 20.1%, 3.6%, 3.44%, and 7.74% on
PointNet, PointNet++, DGCNN, and PCT, respectively. These findings suggest that
KNN-Defense offers a scalable and effective solution for enhancing the
adversarial resilience of 3D point cloud classifiers. (An open-source
implementation of the method, including code and data, is available at
https://github.com/nimajam41/3d-knn-defense).


http://arxiv.org/abs/2506.06742
LADSG: Label-Anonymized Distillation and Similar Gradient Substitution for Label Privacy in Vertical Federated Learning. (68%)
Zeyu Yan; Yifei Yao; Xuanbing Wen; Juli Zhang; Kai Fan
  Vertical federated learning (VFL) has become a key paradigm for collaborative
machine learning, enabling multiple parties to train models over distributed
feature spaces while preserving data privacy. Despite security protocols that
defend against external attacks - such as gradient masking and encryption,
which prevent unauthorized access to sensitive data - recent label inference
attacks from within the system have emerged. These attacks exploit gradients
and semantic embeddings to reconstruct private labels, bypassing traditional
defenses. For example, the passive label inference attack can reconstruct tens
of thousands of participants' private data using just 40 auxiliary labels,
posing a significant security threat. Existing defenses address single leakage
pathways, such as gradient leakage or label exposure. As attack strategies
evolve, their limitations become clear, especially against hybrid attacks that
combine multiple vectors. To address this, we propose Label-Anonymized Defense
with Substitution Gradient (LADSG), a unified defense framework that integrates
gradient substitution, label anonymization, and anomaly detection. LADSG
mitigates both gradient and label leakage while maintaining the scalability and
efficiency of VFL. Experiments on six real-world datasets show that LADSG
reduces label inference attack success rates by 30-60%, with minimal
computational overhead, underscoring the importance of lightweight defenses in
securing VFL.


http://arxiv.org/abs/2506.06891
Can In-Context Reinforcement Learning Recover From Reward Poisoning Attacks? (8%)
Paulius Sasnauskas; Yiğit Yalın; Goran Radanović
  We study the corruption-robustness of in-context reinforcement learning
(ICRL), focusing on the Decision-Pretrained Transformer (DPT, Lee et al.,
2023). To address the challenge of reward poisoning attacks targeting the DPT,
we propose a novel adversarial training framework, called Adversarially Trained
Decision-Pretrained Transformer (AT-DPT). Our method simultaneously trains an
attacker to minimize the true reward of the DPT by poisoning environment
rewards, and a DPT model to infer optimal actions from the poisoned data. We
evaluate the effectiveness of our approach against standard bandit algorithms,
including robust baselines designed to handle reward contamination. Our results
show that the proposed method significantly outperforms these baselines in
bandit settings, under a learned attacker. We additionally evaluate AT-DPT on
an adaptive attacker, and observe similar results. Furthermore, we extend our
evaluation to the MDP setting, confirming that the robustness observed in
bandit scenarios generalizes to more complex environments.


http://arxiv.org/abs/2506.06730
Fuse and Federate: Enhancing EV Charging Station Security with Multimodal Fusion and Federated Learning. (1%)
Rabah Rahal; Abdelaziz Amara Korba; Yacine Ghamri-Doudane
  The rapid global adoption of electric vehicles (EVs) has established electric
vehicle supply equipment (EVSE) as a critical component of smart grid
infrastructure. While essential for ensuring reliable energy delivery and
accessibility, EVSE systems face significant cybersecurity challenges,
including network reconnaissance, backdoor intrusions, and distributed
denial-of-service (DDoS) attacks. These emerging threats, driven by the
interconnected and autonomous nature of EVSE, require innovative and adaptive
security mechanisms that go beyond traditional intrusion detection systems
(IDS). Existing approaches, whether network-based or host-based, often fail to
detect sophisticated and targeted attacks specifically crafted to exploit new
vulnerabilities in EVSE infrastructure. This paper proposes a novel intrusion
detection framework that leverages multimodal data sources, including network
traffic and kernel events, to identify complex attack patterns. The framework
employs a distributed learning approach, enabling collaborative intelligence
across EVSE stations while preserving data privacy through federated learning.
Experimental results demonstrate that the proposed framework outperforms
existing solutions, achieving a detection rate above 98% and a precision rate
exceeding 97% in decentralized environments. This solution addresses the
evolving challenges of EVSE security, offering a scalable and privacypreserving
response to advanced cyber threats


http://arxiv.org/abs/2506.06884
FREE: Fast and Robust Vision Language Models with Early Exits. (1%)
Divya Jyoti Bajpai; Manjesh Kumar Hanawal
  In recent years, Vision-Language Models (VLMs) have shown remarkable
performance improvements in Vision-Language tasks. However, their large size
poses challenges for real-world applications where inference latency is a
concern. To tackle this issue, we propose employing Early Exit (EE) strategies
in VLMs. However, training exit classifiers in VLMs is challenging,
particularly with limited labeled training data. To address this, we introduce
FREE, an adversarial training approach within a GAN-based framework. Here, each
exit consists of a transformer layer and a classifier. The transformer layer is
adversarially trained to produce feature representations similar to the final
layer, while a feature classifier serves as the discriminator. Our method
focuses on performing input-adaptive inference that increases inference speed
with minimal drop in performance. Experimental results demonstrate the
effectiveness of our approach in enhancing accuracy and model robustness by
mitigating overthinking and the phenomenon of mid-crisis that we highlight. We
experimentally validate that our method speeds up the inference process by more
than 1.51x while retaining comparable performance. The source code is available
at https://github.com/Div290/FREE.


http://arxiv.org/abs/2506.06563
Securing Traffic Sign Recognition Systems in Autonomous Vehicles. (98%)
Thushari Hapuarachchi; Long Dang; Kaiqi Xiong
  Deep Neural Networks (DNNs) are widely used for traffic sign recognition
because they can automatically extract high-level features from images. These
DNNs are trained on large-scale datasets obtained from unknown sources.
Therefore, it is important to ensure that the models remain secure and are not
compromised or poisoned during training. In this paper, we investigate the
robustness of DNNs trained for traffic sign recognition. First, we perform the
error-minimizing attacks on DNNs used for traffic sign recognition by adding
imperceptible perturbations on training data. Then, we propose a data
augmentation-based training method to mitigate the error-minimizing attacks.
The proposed training method utilizes nonlinear transformations to disrupt the
perturbations and improve the model robustness. We experiment with two
well-known traffic sign datasets to demonstrate the severity of the attack and
the effectiveness of our mitigation scheme. The error-minimizing attacks reduce
the prediction accuracy of the DNNs from 99.90% to 10.6%. However, our
mitigation scheme successfully restores the prediction accuracy to 96.05%.
Moreover, our approach outperforms adversarial training in mitigating the
error-minimizing attacks. Furthermore, we propose a detection model capable of
identifying poisoned data even when the perturbations are imperceptible to
human inspection. Our detection model achieves a success rate of over 99% in
identifying the attack. This research highlights the need to employ advanced
training methods for DNNs in traffic sign recognition systems to mitigate the
effects of data poisoning attacks.


http://arxiv.org/abs/2506.05937
Quantifying Adversarial Uncertainty in Evidential Deep Learning using Conflict Resolution. (67%)
Charmaine Barker; Daniel Bethell; Simos Gerasimou
  Reliability of deep learning models is critical for deployment in high-stakes
applications, where out-of-distribution or adversarial inputs may lead to
detrimental outcomes. Evidential Deep Learning, an efficient paradigm for
uncertainty quantification, models predictions as Dirichlet distributions of a
single forward pass. However, EDL is particularly vulnerable to adversarially
perturbed inputs, making overconfident errors. Conflict-aware Evidential Deep
Learning (C-EDL) is a lightweight post-hoc uncertainty quantification approach
that mitigates these issues, enhancing adversarial and OOD robustness without
retraining. C-EDL generates diverse, task-preserving transformations per input
and quantifies representational disagreement to calibrate uncertainty estimates
when needed. C-EDL's conflict-aware prediction adjustment improves detection of
OOD and adversarial inputs, maintaining high in-distribution accuracy and low
computational overhead. Our experimental evaluation shows that C-EDL
significantly outperforms state-of-the-art EDL variants and competitive
baselines, achieving substantial reductions in coverage for OOD data (up to
55%) and adversarial data (up to 90%), across a range of datasets, attack
types, and uncertainty metrics.


http://arxiv.org/abs/2506.06556
SDN-Based False Data Detection With Its Mitigation and Machine Learning Robustness for In-Vehicle Networks. (56%)
Long Dang; Thushari Hapuarachchi; Kaiqi Xiong; Yi Li
  As the development of autonomous and connected vehicles advances, the
complexity of modern vehicles increases, with numerous Electronic Control Units
(ECUs) integrated into the system. In an in-vehicle network, these ECUs
communicate with one another using an standard protocol called Controller Area
Network (CAN). Securing communication among ECUs plays a vital role in
maintaining the safety and security of the vehicle. This paper proposes a
robust SDN-based False Data Detection and Mitigation System (FDDMS) for
in-vehicle networks. Leveraging the unique capabilities of Software-Defined
Networking (SDN), FDDMS is designed to monitor and detect false data injection
attacks in real-time. Specifically, we focus on brake-related ECUs within an
SDN-enabled in-vehicle network. First, we decode raw CAN data to create an
attack model that illustrates how false data can be injected into the system.
Then, FDDMS, incorporating a Long Short Term Memory (LSTM)-based detection
model, is used to identify false data injection attacks. We further propose an
effective variant of DeepFool attack to evaluate the model's robustness. To
countermeasure the impacts of four adversarial attacks including Fast gradient
descent method, Basic iterative method, DeepFool, and the DeepFool variant, we
further enhance a re-training technique method with a threshold based selection
strategy. Finally, a mitigation scheme is implemented to redirect attack
traffic by dynamically updating flow rules through SDN. Our experimental
results show that the proposed FDDMS is robust against adversarial attacks and
effectively detects and mitigates false data injection attacks in real-time.


http://arxiv.org/abs/2506.06151
Joint-GCG: Unified Gradient-Based Poisoning Attacks on Retrieval-Augmented Generation Systems. (47%)
Haowei Wang; Rupeng Zhang; Junjie Wang; Mingyang Li; Yuekai Huang; Dandan Wang; Qing Wang
  Retrieval-Augmented Generation (RAG) systems enhance Large Language Models
(LLMs) by retrieving relevant documents from external corpora before generating
responses. This approach significantly expands LLM capabilities by leveraging
vast, up-to-date external knowledge. However, this reliance on external
knowledge makes RAG systems vulnerable to corpus poisoning attacks that
manipulate generated outputs via poisoned document injection. Existing
poisoning attack strategies typically treat the retrieval and generation stages
as disjointed, limiting their effectiveness. We propose Joint-GCG, the first
framework to unify gradient-based attacks across both retriever and generator
models through three innovations: (1) Cross-Vocabulary Projection for aligning
embedding spaces, (2) Gradient Tokenization Alignment for synchronizing
token-level gradient signals, and (3) Adaptive Weighted Fusion for dynamically
balancing attacking objectives. Evaluations demonstrate that Joint-GCG achieves
at most 25% and an average of 5% higher attack success rate than previous
methods across multiple retrievers and generators. While optimized under a
white-box assumption, the generated poisons show unprecedented transferability
to unseen models. Joint-GCG's innovative unification of gradient-based attacks
across retrieval and generation stages fundamentally reshapes our understanding
of vulnerabilities within RAG systems. Our code is available at
https://github.com/NicerWang/Joint-GCG.


http://arxiv.org/abs/2506.05739
To Protect the LLM Agent Against the Prompt Injection Attack with Polymorphic Prompt. (45%)
Zhilong Wang; Neha Nagaraja; Lan Zhang; Hayretdin Bahsi; Pawan Patil; Peng Liu
  LLM agents are widely used as agents for customer support, content
generation, and code assistance. However, they are vulnerable to prompt
injection attacks, where adversarial inputs manipulate the model's behavior.
Traditional defenses like input sanitization, guard models, and guardrails are
either cumbersome or ineffective. In this paper, we propose a novel,
lightweight defense mechanism called Polymorphic Prompt Assembling (PPA), which
protects against prompt injection with near-zero overhead. The approach is
based on the insight that prompt injection requires guessing and breaking the
structure of the system prompt. By dynamically varying the structure of system
prompts, PPA prevents attackers from predicting the prompt structure, thereby
enhancing security without compromising performance. We conducted experiments
to evaluate the effectiveness of PPA against existing attacks and compared it
with other defense methods.


http://arxiv.org/abs/2506.06119
SATversary: Adversarial Attacks on Satellite Fingerprinting. (31%)
Joshua Smailes; Sebastian Köhler; Simon Birnbach; Martin Strohmeier; Ivan Martinovic
  As satellite systems become increasingly vulnerable to physical layer attacks
via SDRs, novel countermeasures are being developed to protect critical
systems, particularly those lacking cryptographic protection, or those which
cannot be upgraded to support modern cryptography. Among these is transmitter
fingerprinting, which provides mechanisms by which communication can be
authenticated by looking at characteristics of the transmitter, expressed as
impairments on the signal.
  Previous works show that fingerprinting can be used to classify satellite
transmitters, or authenticate them against SDR-equipped attackers under simple
replay scenarios. In this paper we build upon this by looking at attacks
directly targeting the fingerprinting system, with an attacker optimizing for
maximum impact in jamming, spoofing, and dataset poisoning attacks, and
demonstrate these attacks on the SatIQ system designed to authenticate Iridium
transmitters. We show that an optimized jamming signal can cause a 50% error
rate with attacker-to-victim ratios as low as -30dB (far less power than
traditional jamming) and demonstrate successful identity forgery during
spoofing attacks, with an attacker successfully removing their own
transmitter's fingerprint from messages. We also present a data poisoning
attack, enabling persistent message spoofing by altering the data used to
authenticate incoming messages to include the fingerprint of the attacker's
transmitter.
  Finally, we show that our model trained to optimize spoofing attacks can also
be used to detect spoofing and replay attacks, even when it has never seen the
attacker's transmitter before. Furthermore, this technique works even when the
training dataset includes only a single transmitter, enabling fingerprinting to
be used to protect small constellations and even individual satellites,
providing additional protection where it is needed the most.


http://arxiv.org/abs/2506.06060
Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models. (16%)
Yingqi Hu; Zhuo Zhang; Jingyuan Zhang; Lizhen Qu; Zenglin Xu
  Federated fine-tuning of large language models (FedLLMs) presents a promising
approach for achieving strong model performance while preserving data privacy
in sensitive domains. However, the inherent memorization ability of LLMs makes
them vulnerable to training data extraction attacks. To investigate this risk,
we introduce simple yet effective extraction attack algorithms specifically
designed for FedLLMs. In contrast to prior "verbatim" extraction attacks, which
assume access to fragments from all training data, our approach operates under
a more realistic threat model, where the attacker only has access to a single
client's data and aims to extract previously unseen personally identifiable
information (PII) from other clients. This requires leveraging contextual
prefixes held by the attacker to generalize across clients. To evaluate the
effectiveness of our approaches, we propose two rigorous metrics-coverage rate
and efficiency-and extend a real-world legal dataset with PII annotations
aligned with CPIS, GDPR, and CCPA standards, achieving 89.9% human-verified
precision. Experimental results show that our method can extract up to 56.57%
of victim-exclusive PII, with "Address," "Birthday," and "Name" being the most
vulnerable categories. Our findings underscore the pressing need for robust
defense strategies and contribute a new benchmark and evaluation framework for
future research in privacy-preserving federated learning.


http://arxiv.org/abs/2506.06027
Sample-Specific Noise Injection For Diffusion-Based Adversarial Purification. (9%)
Yuhao Sun; Jiacheng Zhang; Zesheng Ye; Chaowei Xiao; Feng Liu
  Diffusion-based purification (DBP) methods aim to remove adversarial noise
from the input sample by first injecting Gaussian noise through a forward
diffusion process, and then recovering the clean example through a reverse
generative process. In the above process, how much Gaussian noise is injected
to the input sample is key to the success of DBP methods, which is controlled
by a constant noise level $t^*$ for all samples in existing methods. In this
paper, we discover that an optimal $t^*$ for each sample indeed could be
different. Intuitively, the cleaner a sample is, the less the noise it should
be injected, and vice versa. Motivated by this finding, we propose a new
framework, called Sample-specific Score-aware Noise Injection (SSNI).
Specifically, SSNI uses a pre-trained score network to estimate how much a data
point deviates from the clean data distribution (i.e., score norms). Then,
based on the magnitude of score norms, SSNI applies a reweighting function to
adaptively adjust $t^*$ for each sample, achieving sample-specific noise
injections. Empirically, incorporating our framework with existing DBP methods
results in a notable improvement in both accuracy and robustness on CIFAR-10
and ImageNet-1K, highlighting the necessity to allocate distinct noise levels
to different samples in DBP methods. Our code is available at:
https://github.com/tmlr-group/SSNI.


http://arxiv.org/abs/2506.06003
What Really is a Member? Discrediting Membership Inference via Poisoning. (8%)
Neal Mangaokar; Ashish Hooda; Zhuohang Li; Bradley A. Malin; Kassem Fawaz; Somesh Jha; Atul Prakash; Amrita Roy Chowdhury
  Membership inference tests aim to determine whether a particular data point
was included in a language model's training set. However, recent works have
shown that such tests often fail under the strict definition of membership
based on exact matching, and have suggested relaxing this definition to include
semantic neighbors as members as well. In this work, we show that membership
inference tests are still unreliable under this relaxation - it is possible to
poison the training dataset in a way that causes the test to produce incorrect
predictions for a target point. We theoretically reveal a trade-off between a
test's accuracy and its robustness to poisoning. We also present a concrete
instantiation of this poisoning attack and empirically validate its
effectiveness. Our results show that it can degrade the performance of existing
tests to well below random.


http://arxiv.org/abs/2506.05867
Stealix: Model Stealing via Prompt Evolution. (4%)
Zhixiong Zhuang; Hui-Po Wang; Maria-Irina Nicolae; Mario Fritz
  Model stealing poses a significant security risk in machine learning by
enabling attackers to replicate a black-box model without access to its
training data, thus jeopardizing intellectual property and exposing sensitive
information. Recent methods that use pre-trained diffusion models for data
synthesis improve efficiency and performance but rely heavily on manually
crafted prompts, limiting automation and scalability, especially for attackers
with little expertise. To assess the risks posed by open-source pre-trained
models, we propose a more realistic threat model that eliminates the need for
prompt design skills or knowledge of class names. In this context, we introduce
Stealix, the first approach to perform model stealing without predefined
prompts. Stealix uses two open-source pre-trained models to infer the victim
model's data distribution, and iteratively refines prompts through a genetic
algorithm, progressively improving the precision and diversity of synthetic
images. Our experimental results demonstrate that Stealix significantly
outperforms other methods, even those with access to class names or
fine-grained prompts, while operating under the same query budget. These
findings highlight the scalability of our approach and suggest that the risks
posed by pre-trained generative models in model stealing may be greater than
previously recognized.


http://arxiv.org/abs/2506.06518
A Systematic Review of Poisoning Attacks Against Large Language Models. (2%)
Neil Fendley; Edward W. Staley; Joshua Carney; William Redman; Marie Chau; Nathan Drenkow
  With the widespread availability of pretrained Large Language Models (LLMs)
and their training datasets, concerns about the security risks associated with
their usage has increased significantly. One of these security risks is the
threat of LLM poisoning attacks where an attacker modifies some part of the LLM
training process to cause the LLM to behave in a malicious way. As an emerging
area of research, the current frameworks and terminology for LLM poisoning
attacks are derived from earlier classification poisoning literature and are
not fully equipped for generative LLM settings. We conduct a systematic review
of published LLM poisoning attacks to clarify the security implications and
address inconsistencies in terminology across the literature. We propose a
comprehensive poisoning threat model applicable to categorize a wide range of
LLM poisoning attacks. The poisoning threat model includes four poisoning
attack specifications that define the logistics and manipulation strategies of
an attack as well as six poisoning metrics used to measure key characteristics
of an attack. Under our proposed framework, we organize our discussion of
published LLM poisoning literature along four critical dimensions of LLM
poisoning attacks: concept poisons, stealthy poisons, persistent poisons, and
poisons for unique tasks, to better understand the current landscape of
security risks.


http://arxiv.org/abs/2506.06613
Robust Learnability of Sample-Compressible Distributions under Noisy or Adversarial Perturbations. (2%)
Arefe Boushehrian; Amir Najafi
  Learning distribution families over $\mathbb{R}^d$ is a fundamental problem
in unsupervised learning and statistics. A central question in this setting is
whether a given family of distributions possesses sufficient structure to be
(at least) information-theoretically learnable and, if so, to characterize its
sample complexity. In 2018, Ashtiani et al. reframed \emph{sample
compressibility}, originally due to Littlestone and Warmuth (1986), as a
structural property of distribution classes, proving that it guarantees
PAC-learnability. This discovery subsequently enabled a series of recent
advancements in deriving nearly tight sample complexity bounds for various
high-dimensional open problems. It has been further conjectured that the
converse also holds: every learnable class admits a tight sample compression
scheme.
  In this work, we establish that sample compressible families remain learnable
even from perturbed samples, subject to a set of necessary and sufficient
conditions. We analyze two models of data perturbation: (i) an additive
independent noise model, and (ii) an adversarial corruption model, where an
adversary manipulates a limited subset of the samples unknown to the learner.
Our results are general and rely on as minimal assumptions as possible. We
develop a perturbation-quantization framework that interfaces naturally with
the compression scheme and leads to sample complexity bounds that scale
gracefully with the noise level and corruption budget. As concrete
applications, we establish new sample complexity bounds for learning finite
mixtures of high-dimensional uniform distributions under both noise and
adversarial perturbations, as well as for learning Gaussian mixture models from
adversarially corrupted samples, resolving two open problems in the literature.


http://arxiv.org/abs/2506.06565
Adapting Under Fire: Multi-Agent Reinforcement Learning for Adversarial Drift in Network Security. (2%)
Emilia Rivas; Sabrina Saika; Ahtesham Bakht; Aritran Piplai; Nathaniel D. Bastian; Ankit Shah
  Evolving attacks are a critical challenge for the long-term success of
Network Intrusion Detection Systems (NIDS). The rise of these changing patterns
has exposed the limitations of traditional network security methods. While
signature-based methods are used to detect different types of attacks, they
often fail to detect unknown attacks. Moreover, the system requires frequent
updates with new signatures as the attackers are constantly changing their
tactics. In this paper, we design an environment where two agents improve their
policies over time. The adversarial agent, referred to as the red agent,
perturbs packets to evade the intrusion detection mechanism, whereas the blue
agent learns new defensive policies using drift adaptation techniques to
counter the attacks. Both agents adapt iteratively: the red agent responds to
the evolving NIDS, while the blue agent adjusts to emerging attack patterns. By
studying the model's learned policy, we offer concrete insights into drift
adaptation techniques with high utility. Experiments show that the blue agent
boosts model accuracy by 30% with just 2 to 3 adaptation steps using only 25 to
30 samples each.


http://arxiv.org/abs/2506.06572
Cyber Security of Sensor Systems for State Sequence Estimation: an AI Approach. (1%)
Xubin Fang; Rick S. Blum; Ramesh Bharadwaj; Brian M. Sadler
  Sensor systems are extremely popular today and vulnerable to sensor data
attacks. Due to possible devastating consequences, counteracting sensor data
attacks is an extremely important topic, which has not seen sufficient study.
This paper develops the first methods that accurately identify/eliminate only
the problematic attacked sensor data presented to a sequence
estimation/regression algorithm under a powerful attack model constructed based
on known/observed attacks. The approach does not assume a known form for the
statistical model of the sensor data, allowing data-driven and machine learning
sequence estimation/regression algorithms to be protected. A simple protection
approach for attackers not endowed with knowledge of the details of our
protection approach is first developed, followed by additional processing for
attacks based on protection system knowledge. In the cases tested for which it
was designed, experimental results show that the simple approach achieves
performance indistinguishable, to two decimal places, from that for an approach
which knows which sensors are attacked. For cases where the attacker has
knowledge of the protection approach, experimental results indicate the
additional processing can be configured so that the worst-case degradation
under the additional processing and a large number of sensors attacked can be
made significantly smaller than the worst-case degradation of the simple
approach, and close to an approach which knows which sensors are attacked, for
the same number of attacked sensors with just a slight degradation under no
attacks. Mathematical descriptions of the worst-case attacks are used to
demonstrate the additional processing will provide similar advantages for cases
for which we do not have numerical results. All the data-driven processing used
in our approaches employ only unattacked training data.


http://arxiv.org/abs/2506.05430
Explainer-guided Targeted Adversarial Attacks against Binary Code Similarity Detection Models. (99%)
Mingjie Zhejiang University Chen; Tiancheng Huazhong University of Science and Technology Zhu; Mingxue The State Key Laboratory of Blockchain and Data Security, Zhejiang University & Hangzhou High-Tech Zone Zhang; Yiling University College London He; Minghao University of Southern California Lin; Penghui Columbia University Li; Kui The State Key Laboratory of Blockchain and Data Security, Zhejiang University Ren
  Binary code similarity detection (BCSD) serves as a fundamental technique for
various software engineering tasks, e.g., vulnerability detection and
classification. Attacks against such models have therefore drawn extensive
attention, aiming at misleading the models to generate erroneous predictions.
Prior works have explored various approaches to generating semantic-preserving
variants, i.e., adversarial samples, to evaluate the robustness of the models
against adversarial attacks. However, they have mainly relied on heuristic
criteria or iterative greedy algorithms to locate salient code influencing the
model output, failing to operate on a solid theoretical basis. Moreover, when
processing programs with high complexities, such attacks tend to be
time-consuming.
  In this work, we propose a novel optimization for adversarial attacks against
BCSD models. In particular, we aim to improve the attacks in a challenging
scenario, where the attack goal is to limit the model predictions to a specific
range, i.e., the targeted attacks. Our attack leverages the superior capability
of black-box, model-agnostic explainers in interpreting the model decision
boundaries, thereby pinpointing the critical code snippet to apply
semantic-preserving perturbations. The evaluation results demonstrate that
compared with the state-of-the-art attacks, the proposed attacks achieve higher
attack success rate in almost all scenarios, while also improving the
efficiency and transferability. Our real-world case studies on vulnerability
detection and classification further demonstrate the security implications of
our attacks, highlighting the urgent need to further enhance the robustness of
existing BCSD models.


http://arxiv.org/abs/2506.06389
Exploring Adversarial Watermarking in Transformer-Based Models: Transferability and Robustness Against Defense Mechanism for Medical Images. (99%)
Rifat Sadik; Tanvir Rahman; Arpan Bhattacharjee; Bikash Chandra Halder; Ismail Hossain
  Deep learning models have shown remarkable success in dermatological image
analysis, offering potential for automated skin disease diagnosis. Previously,
convolutional neural network(CNN) based architectures have achieved immense
popularity and success in computer vision (CV) based task like skin image
recognition, generation and video analysis. But with the emergence of
transformer based models, CV tasks are now are nowadays carrying out using
these models. Vision Transformers (ViTs) is such a transformer-based models
that have shown success in computer vision. It uses self-attention mechanisms
to achieve state-of-the-art performance across various tasks. However, their
reliance on global attention mechanisms makes them susceptible to adversarial
perturbations. This paper aims to investigate the susceptibility of ViTs for
medical images to adversarial watermarking-a method that adds so-called
imperceptible perturbations in order to fool models. By generating adversarial
watermarks through Projected Gradient Descent (PGD), we examine the
transferability of such attacks to CNNs and analyze the performance defense
mechanism -- adversarial training. Results indicate that while performance is
not compromised for clean images, ViTs certainly become much more vulnerable to
adversarial attacks: an accuracy drop of as low as 27.6%. Nevertheless,
adversarial training raises it up to 90.0%.


http://arxiv.org/abs/2506.04743
SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs. (92%)
Shuhan Xu; Siyuan Liang; Hongling Zheng; Yong Luo; Aishan Liu; Dacheng Tao
  Vision-Language Models (VLMs) have achieved remarkable performance in image
captioning, but recent studies show they are vulnerable to backdoor attacks.
Attackers can inject imperceptible perturbations-such as local pixel triggers
or global semantic phrases-into the training data, causing the model to
generate malicious, attacker-controlled captions for specific inputs. These
attacks are hard to detect and defend due to their stealthiness and cross-modal
nature. By analyzing attack samples, we identify two key vulnerabilities: (1)
abnormal attention concentration on specific image regions, and (2) semantic
drift and incoherence in generated captions. To counter this, we propose
Semantic Reward Defense (SRD), a reinforcement learning framework that
mitigates backdoor behavior without prior knowledge of triggers. SRD uses a
Deep Q-Network to learn policies for applying discrete perturbations (e.g.,
occlusion, color masking) to sensitive image regions, aiming to disrupt the
activation of malicious pathways. We design a semantic fidelity score as the
reward signal, which jointly evaluates semantic consistency and linguistic
fluency of the output, guiding the agent toward generating robust yet faithful
captions. Experiments across mainstream VLMs and datasets show SRD reduces
attack success rates to 5.6%, while preserving caption quality on clean inputs
with less than 10% performance drop. SRD offers a trigger-agnostic,
interpretable defense paradigm against stealthy backdoor threats in multimodal
generative models.


http://arxiv.org/abs/2506.05434
Efficient Robust Conformal Prediction via Lipschitz-Bounded Networks. (86%)
Thomas IRIT, DTIPG - SNCF, UT3 Massena; Léo IMT, DTIPG - SNCF, UT3 andéol; Thibaut IRIT, UT3 Boissin; Franck IRIT, UT3 Mamalet; Corentin IRIT, UT3 Friedrich; Mathieu IRIT, UT3 Serrurier; Sébastien IMT Gerchinovitz
  Conformal Prediction (CP) has proven to be an effective post-hoc method for
improving the trustworthiness of neural networks by providing prediction sets
with finite-sample guarantees. However, under adversarial attacks, classical
conformal guarantees do not hold anymore: this problem is addressed in the
field of Robust Conformal Prediction. Several methods have been proposed to
provide robust CP sets with guarantees under adversarial perturbations, but,
for large scale problems, these sets are either too large or the methods are
too computationally demanding to be deployed in real life scenarios. In this
work, we propose a new method that leverages Lipschitz-bounded networks to
precisely and efficiently estimate robust CP sets. When combined with a
1-Lipschitz robust network, we demonstrate that our lip-rcp method outperforms
state-of-the-art results in both the size of the robust CP sets and
computational efficiency in medium and large-scale scenarios such as ImageNet.
Taking a different angle, we also study vanilla CP under attack, and derive new
worst-case coverage bounds of vanilla CP sets, which are valid simultaneously
for all adversarial attack levels. Our lip-rcp method makes this second
approach as efficient as vanilla CP while also allowing robustness guarantees.


http://arxiv.org/abs/2506.04951
Robustness as Architecture: Designing IQA Models to Withstand Adversarial Perturbations. (81%)
Igor Meleshin; Anna Chistyakova; Anastasia Antsiferova; Dmitriy Vatolin
  Image Quality Assessment (IQA) models are increasingly relied upon to
evaluate image quality in real-world systems -- from compression and
enhancement to generation and streaming. Yet their adoption brings a
fundamental risk: these models are inherently unstable. Adversarial
manipulations can easily fool them, inflating scores and undermining trust.
Traditionally, such vulnerabilities are addressed through data-driven defenses
-- adversarial retraining, regularization, or input purification. But what if
this is the wrong lens? What if robustness in perceptual models is not
something to learn but something to design? In this work, we propose a
provocative idea: robustness as an architectural prior. Rather than training
models to resist perturbations, we reshape their internal structure to suppress
sensitivity from the ground up. We achieve this by enforcing orthogonal
information flow, constraining the network to norm-preserving operations -- and
further stabilizing the system through pruning and fine-tuning. The result is a
robust IQA architecture that withstands adversarial attacks without requiring
adversarial training or significant changes to the original model. This
approach suggests a shift in perspective: from optimizing robustness through
data to engineering it through design.


http://arxiv.org/abs/2506.05032
Identifying and Understanding Cross-Class Features in Adversarial Training. (68%)
Zeming Wei; Yiwen Guo; Yisen Wang
  Adversarial training (AT) has been considered one of the most effective
methods for making deep neural networks robust against adversarial attacks,
while the training mechanisms and dynamics of AT remain open research problems.
In this paper, we present a novel perspective on studying AT through the lens
of class-wise feature attribution. Specifically, we identify the impact of a
key family of features on AT that are shared by multiple classes, which we call
cross-class features. These features are typically useful for robust
classification, which we offer theoretical evidence to illustrate through a
synthetic data model. Through systematic studies across multiple model
architectures and settings, we find that during the initial stage of AT, the
model tends to learn more cross-class features until the best robustness
checkpoint. As AT further squeezes the training robust loss and causes robust
overfitting, the model tends to make decisions based on more class-specific
features. Based on these discoveries, we further provide a unified view of two
existing properties of AT, including the advantage of soft-label training and
robust overfitting. Overall, these insights refine the current understanding of
AT mechanisms and provide new perspectives on studying them. Our code is
available at https://github.com/PKU-ML/Cross-Class-Features-AT.


http://arxiv.org/abs/2506.05429
Coordinated Robustness Evaluation Framework for Vision-Language Models. (54%)
Ashwin Ramesh Babu; Sajad Mousavi; Vineet Gundecha; Sahand Ghorbanpour; Avisek Naug; Antonio Guillen; Ricardo Luna Gutierrez; Soumyendu Sarkar
  Vision-language models, which integrate computer vision and natural language
processing capabilities, have demonstrated significant advancements in tasks
such as image captioning and visual question and answering. However, similar to
traditional models, they are susceptible to small perturbations, posing a
challenge to their robustness, particularly in deployment scenarios. Evaluating
the robustness of these models requires perturbations in both the vision and
language modalities to learn their inter-modal dependencies. In this work, we
train a generic surrogate model that can take both image and text as input and
generate joint representation which is further used to generate adversarial
perturbations for both the text and image modalities. This coordinated attack
strategy is evaluated on the visual question and answering and visual reasoning
datasets using various state-of-the-art vision-language models. Our results
indicate that the proposed strategy outperforms other multi-modal attacks and
single-modality attacks from the recent literature. Our results demonstrate
their effectiveness in compromising the robustness of several state-of-the-art
pre-trained multi-modal models such as instruct-BLIP, ViLT and others.


http://arxiv.org/abs/2506.04823
Fool the Stoplight: Realistic Adversarial Patch Attacks on Traffic Light Detectors. (47%)
Svetlana Pavlitska; Jamie Robb; Nikolai Polley; Melih Yazgan; J. Marius Zöllner
  Realistic adversarial attacks on various camera-based perception tasks of
autonomous vehicles have been successfully demonstrated so far. However, only a
few works considered attacks on traffic light detectors. This work shows how
CNNs for traffic light detection can be attacked with printed patches. We
propose a threat model, where each instance of a traffic light is attacked with
a patch placed under it, and describe a training strategy. We demonstrate
successful adversarial patch attacks in universal settings. Our experiments
show realistic targeted red-to-green label-flipping attacks and attacks on
pictogram classification. Finally, we perform a real-world evaluation with
printed patches and demonstrate attacks in the lab settings with a mobile
traffic light for construction sites and in a test area with stationary traffic
lights. Our code is available at
https://github.com/KASTEL-MobilityLab/attacks-on-traffic-light-detection.


http://arxiv.org/abs/2506.04879
Invisible Backdoor Triggers in Image Editing Model via Deep Watermarking. (26%)
Yu-Feng Chen; Tzuhsuan Huang; Pin-Yen Chiu; Jun-Cheng Chen
  Diffusion models have achieved remarkable progress in both image generation
and editing. However, recent studies have revealed their vulnerability to
backdoor attacks, in which specific patterns embedded in the input can
manipulate the model's behavior. Most existing research in this area has
proposed attack frameworks focused on the image generation pipeline, leaving
backdoor attacks in image editing relatively unexplored. Among the few studies
targeting image editing, most utilize visible triggers, which are impractical
because they introduce noticeable alterations to the input image before
editing. In this paper, we propose a novel attack framework that embeds
invisible triggers into the image editing process via poisoned training data.
We leverage off-the-shelf deep watermarking models to encode imperceptible
watermarks as backdoor triggers. Our goal is to make the model produce the
predefined backdoor target when it receives watermarked inputs, while editing
clean images normally according to the given prompt. With extensive experiments
across different watermarking models, the proposed method achieves promising
attack success rates. In addition, the analysis results of the watermark
characteristics in term of backdoor attack further support the effectiveness of
our approach. The code is available
at:https://github.com/aiiu-lab/BackdoorImageEditing


http://arxiv.org/abs/2506.05431
Robustness Evaluation for Video Models with Reinforcement Learning. (12%)
Ashwin Ramesh Babu; Sajad Mousavi; Vineet Gundecha; Sahand Ghorbanpour; Avisek Naug; Antonio Guillen; Ricardo Luna Gutierrez; Soumyendu Sarkar
  Evaluating the robustness of Video classification models is very challenging,
specifically when compared to image-based models. With their increased temporal
dimension, there is a significant increase in complexity and computational
cost. One of the key challenges is to keep the perturbations to a minimum to
induce misclassification. In this work, we propose a multi-agent reinforcement
learning approach (spatial and temporal) that cooperatively learns to identify
the given video's sensitive spatial and temporal regions. The agents consider
temporal coherence in generating fine perturbations, leading to a more
effective and visually imperceptible attack. Our method outperforms the
state-of-the-art solutions on the Lp metric and the average queries. Our method
enables custom distortion types, making the robustness evaluation more relevant
to the use case. We extensively evaluate 4 popular models for video action
recognition on two popular datasets, HMDB-51 and UCF-101.


http://arxiv.org/abs/2506.05346
Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets. (12%)
Lei Hsiung; Tianyu Pang; Yung-Chen Tang; Linyue Song; Tsung-Yi Ho; Pin-Yu Chen; Yaoqing Yang
  Recent advancements in large language models (LLMs) have underscored their
vulnerability to safety alignment jailbreaks, particularly when subjected to
downstream fine-tuning. However, existing mitigation strategies primarily focus
on reactively addressing jailbreak incidents after safety guardrails have been
compromised, removing harmful gradients during fine-tuning, or continuously
reinforcing safety alignment throughout fine-tuning. As such, they tend to
overlook a critical upstream factor: the role of the original safety-alignment
data. This paper therefore investigates the degradation of safety guardrails
through the lens of representation similarity between upstream alignment
datasets and downstream fine-tuning tasks. Our experiments demonstrate that
high similarity between these datasets significantly weakens safety guardrails,
making models more susceptible to jailbreaks. Conversely, low similarity
between these two types of datasets yields substantially more robust models and
thus reduces harmfulness score by up to 10.33%. By highlighting the importance
of upstream dataset design in the building of durable safety guardrails and
reducing real-world vulnerability to jailbreak attacks, these findings offer
actionable insights for fine-tuning service providers.


http://arxiv.org/abs/2506.04713
Robust Few-Shot Vision-Language Model Adaptation. (8%)
Hanxin Wang; Tian Liu; Shu Kong
  Pretrained VLMs achieve strong performance on downstream tasks when adapted
with just a few labeled examples. As the adapted models inevitably encounter
out-of-distribution (OOD) test data that deviates from the in-distribution (ID)
task-specific training data, enhancing OOD generalization in few-shot
adaptation is critically important. We study robust few-shot VLM adaptation,
aiming to increase both ID and OOD accuracy. By comparing different adaptation
methods (e.g., prompt tuning, linear probing, contrastive finetuning, and full
finetuning), we uncover three key findings: (1) finetuning with proper
hyperparameters significantly outperforms the popular VLM adaptation methods
prompt tuning and linear probing; (2) visual encoder-only finetuning achieves
better efficiency and accuracy than contrastively finetuning both visual and
textual encoders; (3) finetuning the top layers of the visual encoder provides
the best balance between ID and OOD accuracy. Building on these findings, we
propose partial finetuning of the visual encoder empowered with two simple
augmentation techniques: (1) retrieval augmentation which retrieves
task-relevant data from the VLM's pretraining dataset to enhance adaptation,
and (2) adversarial perturbation which promotes robustness during finetuning.
Results show that the former/latter boosts OOD/ID accuracy while slightly
sacrificing the ID/OOD accuracy. Yet, perhaps understandably, naively combining
the two does not maintain their best OOD/ID accuracy. We address this dilemma
with the developed SRAPF, Stage-wise Retrieval Augmentation-based Adversarial
Partial Finetuning. SRAPF consists of two stages: (1) partial finetuning the
visual encoder using both ID and retrieved data, and (2) adversarial partial
finetuning with few-shot ID data. Extensive experiments demonstrate that SRAPF
achieves the state-of-the-art ID and OOD accuracy on the ImageNet OOD
benchmarks.


http://arxiv.org/abs/2506.06384
Detection Method for Prompt Injection by Integrating Pre-trained Model and Heuristic Feature Engineering. (2%)
Yi Ji; Runzhi Li; Baolei Mao
  With the widespread adoption of Large Language Models (LLMs), prompt
injection attacks have emerged as a significant security threat. Existing
defense mechanisms often face critical trade-offs between effectiveness and
generalizability. This highlights the urgent need for efficient prompt
injection detection methods that are applicable across a wide range of LLMs. To
address this challenge, we propose DMPI-PMHFE, a dual-channel feature fusion
detection framework. It integrates a pretrained language model with heuristic
feature engineering to detect prompt injection attacks. Specifically, the
framework employs DeBERTa-v3-base as a feature extractor to transform input
text into semantic vectors enriched with contextual information. In parallel,
we design heuristic rules based on known attack patterns to extract explicit
structural features commonly observed in attacks. Features from both channels
are subsequently fused and passed through a fully connected neural network to
produce the final prediction. This dual-channel approach mitigates the
limitations of relying only on DeBERTa to extract features. Experimental
results on diverse benchmark datasets demonstrate that DMPI-PMHFE outperforms
existing methods in terms of accuracy, recall, and F1-score. Furthermore, when
deployed actually, it significantly reduces attack success rates across
mainstream LLMs, including GLM-4, LLaMA 3, Qwen 2.5, and GPT-4o.


http://arxiv.org/abs/2506.03765
Prediction Inconsistency Helps Achieve Generalizable Detection of Adversarial Examples. (99%)
Sicong Han; Chenhao Lin; Zhengyu Zhao; Xiyuan Wang; Xinlei He; Qian Li; Cong Wang; Qian Wang; Chao Shen
  Adversarial detection protects models from adversarial attacks by refusing
suspicious test samples. However, current detection methods often suffer from
weak generalization: their effectiveness tends to degrade significantly when
applied to adversarially trained models rather than naturally trained ones, and
they generally struggle to achieve consistent effectiveness across both
white-box and black-box attack settings. In this work, we observe that an
auxiliary model, differing from the primary model in training strategy or model
architecture, tends to assign low confidence to the primary model's predictions
on adversarial examples (AEs), while preserving high confidence on normal
examples (NEs). Based on this discovery, we propose Prediction Inconsistency
Detector (PID), a lightweight and generalizable detection framework to
distinguish AEs from NEs by capturing the prediction inconsistency between the
primal and auxiliary models. PID is compatible with both naturally and
adversarially trained primal models and outperforms four detection methods
across 3 white-box, 3 black-box, and 1 mixed adversarial attacks. Specifically,
PID achieves average AUC scores of 99.29\% and 99.30\% on CIFAR-10 when the
primal model is naturally and adversarially trained, respectively, and 98.31%
and 96.81% on ImageNet under the same conditions, outperforming existing SOTAs
by 4.70%$\sim$25.46%.


http://arxiv.org/abs/2506.03988
RAID: A Dataset for Testing the Adversarial Robustness of AI-Generated Image Detectors. (99%)
Hicham Eddoubi; Jonas Ricker; Federico Cocchi; Lorenzo Baraldi; Angelo Sotgiu; Maura Pintor; Marcella Cornia; Lorenzo Baraldi; Asja Fischer; Rita Cucchiara; Battista Biggio
  AI-generated images have reached a quality level at which humans are
incapable of reliably distinguishing them from real images. To counteract the
inherent risk of fraud and disinformation, the detection of AI-generated images
is a pressing challenge and an active research topic. While many of the
presented methods claim to achieve high detection accuracy, they are usually
evaluated under idealized conditions. In particular, the adversarial robustness
is often neglected, potentially due to a lack of awareness or the substantial
effort required to conduct a comprehensive robustness analysis. In this work,
we tackle this problem by providing a simpler means to assess the robustness of
AI-generated image detectors. We present RAID (Robust evaluation of
AI-generated image Detectors), a dataset of 72k diverse and highly transferable
adversarial examples. The dataset is created by running attacks against an
ensemble of seven state-of-the-art detectors and images generated by four
different text-to-image models. Extensive experiments show that our methodology
generates adversarial images that transfer with a high success rate to unseen
detectors, which can be used to quickly provide an approximate yet still
reliable estimate of a detector's adversarial robustness. Our findings indicate
that current state-of-the-art AI-generated image detectors can be easily
deceived by adversarial examples, highlighting the critical need for the
development of more robust methods. We release our dataset at
https://huggingface.co/datasets/aimagelab/RAID and evaluation code at
https://github.com/pralab/RAID.


http://arxiv.org/abs/2506.03627
Robustness of Prompting: Enhancing Robustness of Large Language Models Against Prompting Attacks. (98%)
Lin Mu; Guowei Chu; Li Ni; Lei Sang; Zhize Wu; Peiquan Jin; Yiwen Zhang
  Large Language Models (LLMs) have demonstrated remarkable performance across
various tasks by effectively utilizing a prompting strategy. However, they are
highly sensitive to input perturbations, such as typographical errors or slight
character order errors, which can substantially degrade their performance.
Despite advances in prompting techniques, developing a prompting strategy that
explicitly mitigates the negative impact of such perturbations remains an open
challenge. To bridge this gap, we propose Robustness of Prompting (RoP), a
novel prompting strategy specifically designed to enhance the robustness of
LLMs. RoP consists of two stages: Error Correction and Guidance. In the Error
Correction stage, RoP applies diverse perturbation methods to generate
adversarial examples, which are then used to construct prompts that
automatically correct input errors. In the Guidance stage, RoP generates an
optimal guidance prompting based on the corrected input, steering the model
toward more robust and accurate inferences. Through comprehensive experiments
spanning arithmetic, commonsense, and logical reasoning tasks, we demonstrate
that RoP significantly improves LLMs' robustness against adversarial
perturbations. Notably, it maintains model accuracy with only minimal
degradation compared to clean input scenarios, thereby establishing RoP as a
practical and effective approach for enhancing LLM robustness in real-world
applications.


http://arxiv.org/abs/2506.05403
Poisoning Behavioral-based Worker Selection in Mobile Crowdsensing using Generative Adversarial Networks. (93%)
Ruba Nasser; Ahmed Alagha; Shakti Singh; Rabeb Mizouni; Hadi Otrok; Jamal Bentahar
  With the widespread adoption of Artificial intelligence (AI), AI-based tools
and components are becoming omnipresent in today's solutions. However, these
components and tools are posing a significant threat when it comes to
adversarial attacks. Mobile Crowdsensing (MCS) is a sensing paradigm that
leverages the collective participation of workers and their smart devices to
collect data. One of the key challenges faced at the selection stage is
ensuring task completion due to workers' varying behavior. AI has been utilized
to tackle this challenge by building unique models for each worker to predict
their behavior. However, the integration of AI into the system introduces
vulnerabilities that can be exploited by malicious insiders to reduce the
revenue obtained by victim workers. This work proposes an adversarial attack
targeting behavioral-based selection models in MCS. The proposed attack
leverages Generative Adversarial Networks (GANs) to generate poisoning points
that can mislead the models during the training stage without being detected.
This way, the potential damage introduced by GANs on worker selection in MCS
can be anticipated. Simulation results using a real-life dataset show the
effectiveness of the proposed attack in compromising the victim workers' model
and evading detection by an outlier detector, compared to a benchmark. In
addition, the impact of the attack on reducing the payment obtained by victim
workers is evaluated.


http://arxiv.org/abs/2506.03933
DiffCAP: Diffusion-based Cumulative Adversarial Purification for Vision Language Models. (92%)
Jia Fu; Yongtao Wu; Yihang Chen; Kunyu Peng; Xiao Zhang; Volkan Cevher; Sepideh Pashami; Anders Holst
  Vision Language Models (VLMs) have shown remarkable capabilities in
multimodal understanding, yet their susceptibility to perturbations poses a
significant threat to their reliability in real-world applications. Despite
often being imperceptible to humans, these perturbations can drastically alter
model outputs, leading to erroneous interpretations and decisions. This paper
introduces DiffCAP, a novel diffusion-based purification strategy that can
effectively neutralize adversarial corruptions in VLMs. We observe that adding
minimal noise to an adversarially corrupted image significantly alters its
latent embedding with respect to VLMs. Building on this insight, DiffCAP
cumulatively injects random Gaussian noise into adversarially perturbed input
data. This process continues until the embeddings of two consecutive noisy
images reach a predefined similarity threshold, indicating a potential approach
to neutralize the adversarial effect. Subsequently, a pretrained diffusion
model is employed to denoise the stabilized image, recovering a clean
representation suitable for the VLMs to produce an output. Through extensive
experiments across six datasets with three VLMs under varying attack strengths
in three task scenarios, we show that DiffCAP consistently outperforms existing
defense techniques by a substantial margin. Notably, DiffCAP significantly
reduces both hyperparameter tuning complexity and the required diffusion time,
thereby accelerating the denoising process. Equipped with strong theoretical
and empirical support, DiffCAP provides a robust and practical solution for
securely deploying VLMs in adversarial environments.


http://arxiv.org/abs/2506.04390
Through the Stealth Lens: Rethinking Attacks and Defenses in RAG. (81%)
Sarthak Choudhary; Nils Palumbo; Ashish Hooda; Krishnamurthy Dj Dvijotham; Somesh Jha
  Retrieval-augmented generation (RAG) systems are vulnerable to attacks that
inject poisoned passages into the retrieved set, even at low corruption rates.
We show that existing attacks are not designed to be stealthy, allowing
reliable detection and mitigation. We formalize stealth using a
distinguishability-based security game. If a few poisoned passages are designed
to control the response, they must differentiate themselves from benign ones,
inherently compromising stealth. This motivates the need for attackers to
rigorously analyze intermediate signals involved in
generation$\unicode{x2014}$such as attention patterns or next-token probability
distributions$\unicode{x2014}$to avoid easily detectable traces of
manipulation. Leveraging attention patterns, we propose a passage-level
score$\unicode{x2014}$the Normalized Passage Attention
Score$\unicode{x2014}$used by our Attention-Variance Filter algorithm to
identify and filter potentially poisoned passages. This method mitigates
existing attacks, improving accuracy by up to $\sim 20 \%$ over baseline
defenses. To probe the limits of attention-based defenses, we craft stealthier
adaptive attacks that obscure such traces, achieving up to $35 \%$ attack
success rate, and highlight the challenges in improving stealth.


http://arxiv.org/abs/2506.04036
Privacy and Security Threat for OpenAI GPTs. (67%)
Wei Wenying; Zhao Kaifa; Xue Lei; Fan Ming
  Large language models (LLMs) demonstrate powerful information handling
capabilities and are widely integrated into chatbot applications. OpenAI
provides a platform for developers to construct custom GPTs, extending
ChatGPT's functions and integrating external services. Since its release in
November 2023, over 3 million custom GPTs have been created. However, such a
vast ecosystem also conceals security and privacy threats. For developers,
instruction leaking attacks threaten the intellectual property of instructions
in custom GPTs through carefully crafted adversarial prompts. For users,
unwanted data access behavior by custom GPTs or integrated third-party services
raises significant privacy concerns. To systematically evaluate the scope of
threats in real-world LLM applications, we develop three phases instruction
leaking attacks target GPTs with different defense level. Our widespread
experiments on 10,000 real-world custom GPTs reveal that over 98.8% of GPTs are
vulnerable to instruction leaking attacks via one or more adversarial prompts,
and half of the remaining GPTs can also be attacked through multiround
conversations. We also developed a framework to assess the effectiveness of
defensive strategies and identify unwanted behaviors in custom GPTs. Our
findings show that 77.5% of custom GPTs with defense strategies are vulnerable
to basic instruction leaking attacks. Additionally, we reveal that 738 custom
GPTs collect user conversational information, and identified 8 GPTs exhibiting
data access behaviors that are unnecessary for their intended functionalities.
Our findings raise awareness among GPT developers about the importance of
integrating specific defensive strategies in their instructions and highlight
users' concerns about data privacy when using LLM-based applications.


http://arxiv.org/abs/2506.04556
BESA: Boosting Encoder Stealing Attack with Perturbation Recovery. (67%)
Xuhao Ren; Haotian Liang; Yajie Wang; Chuan Zhang; Zehui Xiong; Liehuang Zhu
  To boost the encoder stealing attack under the perturbation-based defense
that hinders the attack performance, we propose a boosting encoder stealing
attack with perturbation recovery named BESA. It aims to overcome
perturbation-based defenses. The core of BESA consists of two modules:
perturbation detection and perturbation recovery, which can be combined with
canonical encoder stealing attacks. The perturbation detection module utilizes
the feature vectors obtained from the target encoder to infer the defense
mechanism employed by the service provider. Once the defense mechanism is
detected, the perturbation recovery module leverages the well-designed
generative model to restore a clean feature vector from the perturbed one.
Through extensive evaluations based on various datasets, we demonstrate that
BESA significantly enhances the surrogate encoder accuracy of existing encoder
stealing attacks by up to 24.63\% when facing state-of-the-art defenses and
combinations of multiple defenses.


http://arxiv.org/abs/2506.03683
PRJ: Perception-Retrieval-Judgement for Generated Images. (1%)
Qiang Fu; Zonglei Jing; Zonghao Ying; Xiaoqian Li
  The rapid progress of generative AI has enabled remarkable creative
capabilities, yet it also raises urgent concerns regarding the safety of
AI-generated visual content in real-world applications such as content
moderation, platform governance, and digital media regulation. This includes
unsafe material such as sexually explicit images, violent scenes, hate symbols,
propaganda, and unauthorized imitations of copyrighted artworks. Existing image
safety systems often rely on rigid category filters and produce binary outputs,
lacking the capacity to interpret context or reason about nuanced,
adversarially induced forms of harm. In addition, standard evaluation metrics
(e.g., attack success rate) fail to capture the semantic severity and dynamic
progression of toxicity. To address these limitations, we propose
Perception-Retrieval-Judgement (PRJ), a cognitively inspired framework that
models toxicity detection as a structured reasoning process. PRJ follows a
three-stage design: it first transforms an image into descriptive language
(perception), then retrieves external knowledge related to harm categories and
traits (retrieval), and finally evaluates toxicity based on legal or normative
rules (judgement). This language-centric structure enables the system to detect
both explicit and implicit harms with improved interpretability and categorical
granularity. In addition, we introduce a dynamic scoring mechanism based on a
contextual toxicity risk matrix to quantify harmfulness across different
semantic dimensions. Experiments show that PRJ surpasses existing safety
checkers in detection accuracy and robustness while uniquely supporting
structured category-level toxicity interpretation.


http://arxiv.org/abs/2506.03701
Tournament Robustness via Redundancy. (1%)
Klim Efremenko; Hendrik Molter; Meirav Zehavi
  A knockout tournament is one of the most simple and popular forms of
competition. Here, we are given a binary tournament tree where all leaves are
labeled with seed position names. The players participating in the tournament
are assigned to the seed positions. In each round, the two players assigned to
leaves of the tournament tree with a common parent compete, and the winner is
promoted to the parent. The last remaining player is the winner of the
tournament.
  In this work, we study the problem of making knock-out tournaments robust
against manipulation, where the form of manipulation we consider is changing
the outcome of a game. We assume that our input is only the number of players
that compete in the tournament, and the number of manipulations against which
the tournament should be robust. Furthermore, we assume that there is a
strongest player, that is, a player that beats any of the other players.
However, the identity of this player is not part of the problem input.
  To ensure robustness against manipulation, we uncover an unexpected
connection between the problem at hand and communication protocols that utilize
a feedback channel, offering resilience against adversarial noise. We explore
the trade-off between the size of the robust tournament tree and the degree of
protection against manipulation. Specifically, we demonstrate that it is
possible to tolerate up to a $1/3$ fraction of manipulations along each
leaf-to-root path, at the cost of only a polynomial blow-up in the tournament
size.


http://arxiv.org/abs/2506.05382
How stealthy is stealthy? Studying the Efficacy of Black-Box Adversarial Attacks in the Real World. (99%)
Francesco Panebianco; Mario D'Onghia; Stefano Zanero aand Michele Carminati
  Deep learning systems, critical in domains like autonomous vehicles, are
vulnerable to adversarial examples (crafted inputs designed to mislead
classifiers). This study investigates black-box adversarial attacks in computer
vision. This is a realistic scenario, where attackers have query-only access to
the target model. Three properties are introduced to evaluate attack
feasibility: robustness to compression, stealthiness to automatic detection,
and stealthiness to human inspection. State-of-the-Art methods tend to
prioritize one criterion at the expense of others. We propose ECLIPSE, a novel
attack method employing Gaussian blurring on sampled gradients and a local
surrogate model. Comprehensive experiments on a public dataset highlight
ECLIPSE's advantages, demonstrating its contribution to the trade-off between
the three properties.


http://arxiv.org/abs/2506.04263
Dynamic Epsilon Scheduling: A Multi-Factor Adaptive Perturbation Budget for Adversarial Training. (99%)
Alan Mitkiy; James Smith; Hana Satou; Hiroshi Tanaka; Emily Johnson; F Monkey
  Adversarial training is among the most effective strategies for defending
deep neural networks against adversarial examples. A key limitation of existing
adversarial training approaches lies in their reliance on a fixed perturbation
budget, which fails to account for instance-specific robustness
characteristics. While prior works such as IAAT and MMA introduce
instance-level adaptations, they often rely on heuristic or static
approximations of data robustness. In this paper, we propose Dynamic Epsilon
Scheduling (DES), a novel framework that adaptively adjusts the adversarial
perturbation budget per instance and per training iteration. DES integrates
three key factors: (1) the distance to the decision boundary approximated via
gradient-based proxies, (2) prediction confidence derived from softmax entropy,
and (3) model uncertainty estimated via Monte Carlo dropout. By combining these
cues into a unified scheduling strategy, DES tailors the perturbation budget
dynamically to guide more effective adversarial learning. Experimental results
on CIFAR-10 and CIFAR-100 show that our method consistently improves both
adversarial robustness and standard accuracy compared to fixed-epsilon
baselines and prior adaptive methods. Moreover, we provide theoretical insights
into the stability and convergence of our scheduling policy. This work opens a
new avenue for instance-aware, data-driven adversarial training methods.


http://arxiv.org/abs/2506.02660
Tarallo: Evading Behavioral Malware Detectors in the Problem Space. (98%)
Gabriele Digregorio; Salvatore Maccarrone; Mario D'Onghia; Luigi Gallo; Michele Carminati; Mario Polino; Stefano Zanero
  Machine learning algorithms can effectively classify malware through dynamic
behavior but are susceptible to adversarial attacks. Existing attacks, however,
often fail to find an effective solution in both the feature and problem
spaces. This issue arises from not addressing the intrinsic nondeterministic
nature of malware, namely executing the same sample multiple times may yield
significantly different behaviors. Hence, the perturbations computed for a
specific behavior may be ineffective for others observed in subsequent
executions. In this paper, we show how an attacker can augment their chance of
success by leveraging a new and more efficient feature space algorithm for
sequential data, which we have named PS-FGSM, and by adopting two problem space
strategies specially tailored to address nondeterminism in the problem space.
We implement our novel algorithm and attack strategies in Tarallo, an
end-to-end adversarial framework that significantly outperforms previous works
in both white and black-box scenarios. Our preliminary analysis in a sandboxed
environment and against two RNN-based malware detectors, shows that Tarallo
achieves a success rate up to 99% on both feature and problem space attacks
while significantly minimizing the number of modifications required for
misclassification.


http://arxiv.org/abs/2506.02711
Privacy Leaks by Adversaries: Adversarial Iterations for Membership Inference Attack. (98%)
Jing Xue; Zhishen Sun; Haishan Ye; Luo Luo; Xiangyu Chang; Ivor Tsang; Guang Dai
  Membership inference attack (MIA) has become one of the most widely used and
effective methods for evaluating the privacy risks of machine learning models.
These attacks aim to determine whether a specific sample is part of the model's
training set by analyzing the model's output. While traditional membership
inference attacks focus on leveraging the model's posterior output, such as
confidence on the target sample, we propose IMIA, a novel attack strategy that
utilizes the process of generating adversarial samples to infer membership. We
propose to infer the member properties of the target sample using the number of
iterations required to generate its adversarial sample. We conduct experiments
across multiple models and datasets, and our results demonstrate that the
number of iterations for generating an adversarial sample is a reliable feature
for membership inference, achieving strong performance both in black-box and
white-box attack scenarios. This work provides a new perspective for evaluating
model privacy and highlights the potential of adversarial example-based
features for privacy leakage assessment.


http://arxiv.org/abs/2506.02978
On the Robustness of Tabular Foundation Models: Test-Time Attacks and In-Context Defenses. (96%)
Mohamed Djilani; Thibault Simonetto; Karim Tit; Florian Tambon; Paul Récamier; Salah Ghamizi; Maxime Cordy; Mike Papadakis
  Recent tabular Foundational Models (FM) such as TabPFN and TabICL, leverage
in-context learning to achieve strong performance without gradient updates or
fine-tuning. However, their robustness to adversarial manipulation remains
largely unexplored. In this work, we present a comprehensive study of the
adversarial vulnerabilities of tabular FM, focusing on both their fragility to
targeted test-time attacks and their potential misuse as adversarial tools. We
show on three benchmarks in finance, cybersecurity and healthcare, that small,
structured perturbations to test inputs can significantly degrade prediction
accuracy, even when training context remain fixed. Additionally, we demonstrate
that tabular FM can be repurposed to generate transferable evasion to
conventional models such as random forests and XGBoost, and on a lesser extent
to deep tabular models. To improve tabular FM, we formulate the robustification
problem as an optimization of the weights (adversarial fine-tuning), or the
context (adversarial in-context learning). We introduce an in-context
adversarial training strategy that incrementally replaces the context with
adversarial perturbed instances, without updating model weights. Our approach
improves robustness across multiple tabular benchmarks. Together, these
findings position tabular FM as both a target and a source of adversarial
threats, highlighting the urgent need for robust training and evaluation
practices in this emerging paradigm.


http://arxiv.org/abs/2506.05402
Sylva: Tailoring Personalized Adversarial Defense in Pre-trained Models via Collaborative Fine-tuning. (95%)
Tianyu Qi; Lei Xue; Yufeng Zhan; Xiaobo Ma
  The growing adoption of large pre-trained models in edge computing has made
deploying model inference on mobile clients both practical and popular. These
devices are inherently vulnerable to direct adversarial attacks, which pose a
substantial threat to the robustness and security of deployed models. Federated
adversarial training (FAT) has emerged as an effective solution to enhance
model robustness while preserving client privacy. However, FAT frequently
produces a generalized global model, which struggles to address the diverse and
heterogeneous data distributions across clients, resulting in insufficiently
personalized performance, while also encountering substantial communication
challenges during the training process. In this paper, we propose
\textit{Sylva}, a personalized collaborative adversarial training framework
designed to deliver customized defense models for each client through a
two-phase process. In Phase 1, \textit{Sylva} employs LoRA for local
adversarial fine-tuning, enabling clients to personalize model robustness while
drastically reducing communication costs by uploading only LoRA parameters
during federated aggregation. In Phase 2, a game-based layer selection strategy
is introduced to enhance accuracy on benign data, further refining the
personalized model. This approach ensures that each client receives a tailored
defense model that balances robustness and accuracy effectively. Extensive
experiments on benchmark datasets demonstrate that \textit{Sylva} can achieve
up to 50$\times$ improvements in communication efficiency compared to
state-of-the-art algorithms, while achieving up to 29.5\% and 50.4\%
enhancements in adversarial robustness and benign accuracy, respectively.


http://arxiv.org/abs/2506.05394
Attacking Attention of Foundation Models Disrupts Downstream Tasks. (87%)
Hondamunige Prasanna Silva; Federico Becattini; Lorenzo Seidenari
  Foundation models represent the most prominent and recent paradigm shift in
artificial intelligence. Foundation models are large models, trained on broad
data that deliver high accuracy in many downstream tasks, often without
fine-tuning. For this reason, models such as CLIP , DINO or Vision Transfomers
(ViT), are becoming the bedrock of many industrial AI-powered applications.
However, the reliance on pre-trained foundation models also introduces
significant security concerns, as these models are vulnerable to adversarial
attacks. Such attacks involve deliberately crafted inputs designed to deceive
AI systems, jeopardizing their reliability. This paper studies the
vulnerabilities of vision foundation models, focusing specifically on CLIP and
ViTs, and explores the transferability of adversarial attacks to downstream
tasks. We introduce a novel attack, targeting the structure of
transformer-based architectures in a task-agnostic fashion. We demonstrate the
effectiveness of our attack on several downstream tasks: classification,
captioning, image/text retrieval, segmentation and depth estimation. Code
available at:https://github.com/HondamunigePrasannaSilva/attack-attention


http://arxiv.org/abs/2506.02679
Poster: FedBlockParadox -- A Framework for Simulating and Securing Decentralized Federated Learning. (64%)
Gabriele Digregorio; Francesco Bleggi; Federico Caroli; Michele Carminati; Stefano Zanero; Stefano Longari
  A significant body of research in decentralized federated learning focuses on
combining the privacy-preserving properties of federated learning with the
resilience and transparency offered by blockchain-based systems. While these
approaches are promising, they often lack flexible tools to evaluate system
robustness under adversarial conditions. To fill this gap, we present
FedBlockParadox, a modular framework for modeling and evaluating decentralized
federated learning systems built on blockchain technologies, with a focus on
resilience against a broad spectrum of adversarial attack scenarios. It
supports multiple consensus protocols, validation methods, aggregation
strategies, and configurable attack models. By enabling controlled experiments,
FedBlockParadox provides a valuable resource for researchers developing secure,
decentralized learning solutions. The framework is open-source and built to be
extensible by the community.


http://arxiv.org/abs/2506.05401
Robust Anti-Backdoor Instruction Tuning in LVLMs. (47%)
Yuan Xun; Siyuan Liang; Xiaojun Jia; Xinwei Liu; Xiaochun Cao
  Large visual language models (LVLMs) have demonstrated excellent
instruction-following capabilities, yet remain vulnerable to stealthy backdoor
attacks when finetuned using contaminated data. Existing backdoor defense
techniques are usually developed for single-modal visual or language models
under fully parameter-adjustable settings or rely on supervisory knowledge
during training. However, in real-world scenarios, defenders cannot modify
frozen visual encoders or core LLM parameters, nor possess prior knowledge of
unknown trigger patterns or target responses. Motivated by the empirical
finding that LVLMs readily overfit to fixed, unknown triggers, which can embed
malicious associations during adapter-level tuning, we aim to design a defense
that operates without access to core weights or attack priors. To this end, we
introduce a lightweight, certified-agnostic defense framework, Robust
Instruction Tuning, that finetunes only adapter modules and text embedding
layers under instruction tuning. Our method integrates two complementary
regularizations: (1) Input Diversity Regularization, which perturbs trigger
components across training samples to disrupt consistent spurious cues; and (2)
Anomalous Activation Regularization, which dynamically sparses adapter weights
exhibiting abnormally sharp activations linked to backdoor patterns. These
mechanisms jointly guide the model toward learning semantically grounded
representations rather than memorizing superficial trigger-response mappings.
  Extensive experiments against seven attacks on Flickr30k and MSCOCO
demonstrate that ours
  reduces their attack success rate to nearly zero, with an increase in
training cost of less than 15%.


http://arxiv.org/abs/2506.02479
BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage. (22%)
Kalyan Nakka; Nitesh Saxena
  The inherent risk of generating harmful and unsafe content by Large Language
Models (LLMs), has highlighted the need for their safety alignment. Various
techniques like supervised fine-tuning, reinforcement learning from human
feedback, and red-teaming were developed for ensuring the safety alignment of
LLMs. However, the robustness of these aligned LLMs is always challenged by
adversarial attacks that exploit unexplored and underlying vulnerabilities of
the safety alignment. In this paper, we develop a novel black-box jailbreak
attack, called BitBypass, that leverages hyphen-separated bitstream camouflage
for jailbreaking aligned LLMs. This represents a new direction in jailbreaking
by exploiting fundamental information representation of data as continuous
bits, rather than leveraging prompt engineering or adversarial manipulations.
Our evaluation of five state-of-the-art LLMs, namely GPT-4o, Gemini 1.5, Claude
3.5, Llama 3.1, and Mixtral, in adversarial perspective, revealed the
capabilities of BitBypass in bypassing their safety alignment and tricking them
into generating harmful and unsafe content. Further, we observed that BitBypass
outperforms several state-of-the-art jailbreak attacks in terms of stealthiness
and attack success. Overall, these results highlights the effectiveness and
efficiency of BitBypass in jailbreaking these state-of-the-art LLMs.


http://arxiv.org/abs/2506.03350
Adversarial Attacks on Robotic Vision Language Action Models. (13%)
Eliot Krzysztof Jones; Alexander Robey; Andy Zou; Zachary Ravichandran; George J. Pappas; Hamed Hassani; Matt Fredrikson; J. Zico Kolter
  The emergence of vision-language-action models (VLAs) for end-to-end control
is reshaping the field of robotics by enabling the fusion of multimodal sensory
inputs at the billion-parameter scale. The capabilities of VLAs stem primarily
from their architectures, which are often based on frontier large language
models (LLMs). However, LLMs are known to be susceptible to adversarial misuse,
and given the significant physical risks inherent to robotics, questions remain
regarding the extent to which VLAs inherit these vulnerabilities. Motivated by
these concerns, in this work we initiate the study of adversarial attacks on
VLA-controlled robots. Our main algorithmic contribution is the adaptation and
application of LLM jailbreaking attacks to obtain complete control authority
over VLAs. We find that textual attacks, which are applied once at the
beginning of a rollout, facilitate full reachability of the action space of
commonly used VLAs and often persist over longer horizons. This differs
significantly from LLM jailbreaking literature, as attacks in the real world do
not have to be semantically linked to notions of harm. We make all code
available at https://github.com/eliotjones1/robogcg .


http://arxiv.org/abs/2506.03355
Robustness in Both Domains: CLIP Needs a Robust Text Encoder. (3%)
Elias Abad Rocamora; Christian Schlarmann; Naman Deep Singh; Yongtao Wu; Matthias Hein; Volkan Cevher
  Adversarial input attacks can cause a significant shift of CLIP embeddings.
This can affect the downstream robustness of models incorporating CLIP in the
pipeline, such as text-to-image generative models or large vision language
models. While some efforts have been done towards making the CLIP image
encoders robust, the robustness of text encoders remains unexplored. In this
work, we cover this gap in the literature. We propose LEAF: an efficient
adversarial finetuning method for the text domain, with the ability to scale to
large CLIP models. Our models significantly improve the zero-shot adversarial
accuracy in the text domain, while maintaining the vision performance provided
by robust image encoders. When combined with text-to-image diffusion models, we
can improve the generation quality under adversarial noise. When employing our
robust CLIP encoders in multimodal retrieval tasks, we improve the recall under
adversarial noise over standard CLIP models. Finally, we show that robust text
encoders facilitate better reconstruction of input text from its embedding via
direct optimization.


http://arxiv.org/abs/2506.03089
Explicitly Modeling Subcortical Vision with a Neuro-Inspired Front-End Improves CNN Robustness. (2%)
Lucas Piper; Arlindo L. Oliveira; Tiago Marques
  Convolutional neural networks (CNNs) trained on object recognition achieve
high task performance but continue to exhibit vulnerability under a range of
visual perturbations and out-of-domain images, when compared with biological
vision. Prior work has demonstrated that coupling a standard CNN with a
front-end block (VOneBlock) that mimics the primate primary visual cortex (V1)
can improve overall model robustness. Expanding on this, we introduce Early
Vision Networks (EVNets), a new class of hybrid CNNs that combine the VOneBlock
with a novel SubcorticalBlock, whose architecture draws from computational
models in neuroscience and is parameterized to maximize alignment with
subcortical responses reported across multiple experimental studies. Without
being optimized to do so, the assembly of the SubcorticalBlock with the
VOneBlock improved V1 alignment across most standard V1 benchmarks, and better
modeled extra-classical receptive field phenomena. In addition, EVNets exhibit
stronger emergent shape bias and overperform the base CNN architecture by 8.5%
on an aggregate benchmark of robustness evaluations, including adversarial
perturbations, common corruptions, and domain shifts. Finally, we show that
EVNets can be further improved when paired with a state-of-the-art data
augmentation technique, surpassing the performance of the isolated data
augmentation approach by 7.3% on our robustness benchmark. This result reveals
complementary benefits between changes in architecture to better mimic biology
and training-based machine learning approaches.


http://arxiv.org/abs/2506.03234
BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF. (1%)
Kaiwen Duan; Hongwei Yao; Yufei Chen; Ziyun Li; Tong Qiao; Zhan Qin; Cong Wang
  Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning
text-to-image (T2I) models with human preferences. However, RLHF's feedback
mechanism also opens new pathways for adversaries. This paper demonstrates the
feasibility of hijacking T2I models by poisoning a small fraction of preference
data with natural-appearing examples. Specifically, we propose BadReward, a
stealthy clean-label poisoning attack targeting the reward model in multi-modal
RLHF. BadReward operates by inducing feature collisions between visually
contradicted preference data instances, thereby corrupting the reward model and
indirectly compromising the T2I model's integrity. Unlike existing alignment
poisoning techniques focused on single (text) modality, BadReward is
independent of the preference annotation process, enhancing its stealth and
practical threat. Extensive experiments on popular T2I models show that
BadReward can consistently guide the generation towards improper outputs, such
as biased or violent imagery, for targeted concepts. Our findings underscore
the amplified threat landscape for RLHF in multi-modal systems, highlighting
the urgent need for robust defenses. Disclaimer. This paper contains uncensored
toxic content that might be offensive or disturbing to the readers.


http://arxiv.org/abs/2506.01511
Enhancing Diffusion-based Unrestricted Adversarial Attacks via Adversary Preferences Alignment. (98%)
Kaixun Jiang; Zhaoyu Chen; Haijing Guo; Jinglun Li; Jiyuan Fu; Pinxue Guo; Hao Tang; Bo Li; Wenqiang Zhang
  Preference alignment in diffusion models has primarily focused on benign
human preferences (e.g., aesthetic). In this paper, we propose a novel
perspective: framing unrestricted adversarial example generation as a problem
of aligning with adversary preferences. Unlike benign alignment, adversarial
alignment involves two inherently conflicting preferences: visual consistency
and attack effectiveness, which often lead to unstable optimization and reward
hacking (e.g., reducing visual quality to improve attack success). To address
this, we propose APA (Adversary Preferences Alignment), a two-stage framework
that decouples conflicting preferences and optimizes each with differentiable
rewards. In the first stage, APA fine-tunes LoRA to improve visual consistency
using rule-based similarity reward. In the second stage, APA updates either the
image latent or prompt embedding based on feedback from a substitute
classifier, guided by trajectory-level and step-wise rewards. To enhance
black-box transferability, we further incorporate a diffusion augmentation
strategy. Experiments demonstrate that APA achieves significantly better attack
transferability while maintaining high visual consistency, inspiring further
research to approach adversarial attacks from an alignment perspective. Code
will be available at https://github.com/deep-kaixun/APA.


http://arxiv.org/abs/2506.02362
MISLEADER: Defending against Model Extraction with Ensembles of Distilled Models. (81%)
Xueqi Cheng; Minxing Zheng; Shixiang Zhu; Yushun Dong
  Model extraction attacks aim to replicate the functionality of a black-box
model through query access, threatening the intellectual property (IP) of
machine-learning-as-a-service (MLaaS) providers. Defending against such attacks
is challenging, as it must balance efficiency, robustness, and utility
preservation in the real-world scenario. Despite the recent advances, most
existing defenses presume that attacker queries have out-of-distribution (OOD)
samples, enabling them to detect and disrupt suspicious inputs. However, this
assumption is increasingly unreliable, as modern models are trained on diverse
datasets and attackers often operate under limited query budgets. As a result,
the effectiveness of these defenses is significantly compromised in realistic
deployment scenarios. To address this gap, we propose MISLEADER (enseMbles of
dIStiLled modEls Against moDel ExtRaction), a novel defense strategy that does
not rely on OOD assumptions. MISLEADER formulates model protection as a bilevel
optimization problem that simultaneously preserves predictive fidelity on
benign inputs and reduces extractability by potential clone models. Our
framework combines data augmentation to simulate attacker queries with an
ensemble of heterogeneous distilled models to improve robustness and diversity.
We further provide a tractable approximation algorithm and derive theoretical
error bounds to characterize defense effectiveness. Extensive experiments
across various settings validate the utility-preserving and
extraction-resistant properties of our proposed defense strategy. Our code is
available at https://github.com/LabRAI/MISLEADER.


http://arxiv.org/abs/2506.01591
Silence is Golden: Leveraging Adversarial Examples to Nullify Audio Control in LDM-based Talking-Head Generation. (78%)
Yuan Gan; Jiaxu Miao; Yunze Wang; Yi Yang
  Advances in talking-head animation based on Latent Diffusion Models (LDM)
enable the creation of highly realistic, synchronized videos. These fabricated
videos are indistinguishable from real ones, increasing the risk of potential
misuse for scams, political manipulation, and misinformation. Hence, addressing
these ethical concerns has become a pressing issue in AI security. Recent
proactive defense studies focused on countering LDM-based models by adding
perturbations to portraits. However, these methods are ineffective at
protecting reference portraits from advanced image-to-video animation. The
limitations are twofold: 1) they fail to prevent images from being manipulated
by audio signals, and 2) diffusion-based purification techniques can
effectively eliminate protective perturbations. To address these challenges, we
propose Silencer, a two-stage method designed to proactively protect the
privacy of portraits. First, a nullifying loss is proposed to ignore audio
control in talking-head generation. Second, we apply anti-purification loss in
LDM to optimize the inverted latent feature to generate robust perturbations.
Extensive experiments demonstrate the effectiveness of Silencer in proactively
protecting portrait privacy. We hope this work will raise awareness among the
AI security community regarding critical ethical issues related to talking-head
generation techniques. Code: https://github.com/yuangan/Silencer.


http://arxiv.org/abs/2506.01825
Which Factors Make Code LLMs More Vulnerable to Backdoor Attacks? A Systematic Study. (54%)
Chenyu Wang; Zhou Yang; Yaniv Harel; David Lo
  Code LLMs are increasingly employed in software development. However, studies
have shown that they are vulnerable to backdoor attacks: when a trigger (a
specific input pattern) appears in the input, the backdoor will be activated
and cause the model to generate malicious outputs. Researchers have designed
various triggers and demonstrated the feasibility of implanting backdoors by
poisoning a fraction of the training data. Some basic conclusions have been
made, such as backdoors becoming easier to implant when more training data are
modified. However, existing research has not explored other factors influencing
backdoor attacks on Code LLMs, such as training batch size, epoch number, and
the broader design space for triggers, e.g., trigger length.
  To bridge this gap, we use code summarization as an example to perform an
empirical study that systematically investigates the factors affecting backdoor
effectiveness and understands the extent of the threat posed. Three categories
of factors are considered: data, model, and inference, revealing previously
overlooked findings. We find that the prevailing consensus -- that attacks are
ineffective at extremely low poisoning rates -- is incorrect. The absolute
number of poisoned samples matters as well. Specifically, poisoning just 20 out
of 454K samples (0.004\% poisoning rate -- far below the minimum setting of
0.1\% in prior studies) successfully implants backdoors! Moreover, the common
defense is incapable of removing even a single poisoned sample from it.
Additionally, small batch sizes increase the risk of backdoor attacks. We also
uncover other critical factors such as trigger types, trigger length, and the
rarity of tokens in the triggers, leading to valuable insights for assessing
Code LLMs' vulnerability to backdoor attacks. Our study highlights the urgent
need for defense mechanisms against extremely low poisoning rate settings.


http://arxiv.org/abs/2506.02156
Mitigating Data Poisoning Attacks to Local Differential Privacy. (13%)
Xiaolin Li; Ninghui Li; Boyang Wang; Wenhai Sun
  The distributed nature of local differential privacy (LDP) invites data
poisoning attacks and poses unforeseen threats to the underlying LDP-supported
applications. In this paper, we propose a comprehensive mitigation framework
for popular frequency estimation, which contains a suite of novel defenses,
including malicious user detection, attack pattern recognition, and damaged
utility recovery. In addition to existing attacks, we explore new adaptive
adversarial activities for our mitigation design. For detection, we present a
new method to precisely identify bogus reports and thus LDP aggregation can be
performed over the ``clean'' data. When the attack behavior becomes stealthy
and direct filtering out malicious users is difficult, we further propose a
detection that can effectively recognize hidden adversarial patterns, thus
facilitating the decision-making of service providers. These detection methods
require no additional data and attack information and incur minimal
computational cost. Our experiment demonstrates their excellent performance and
substantial improvement over previous work in various settings. In addition, we
conduct an empirical analysis of LDP post-processing for corrupted data
recovery and propose a new post-processing method, through which we reveal new
insights into protocol recommendations in practice and key design principles
for future research.


http://arxiv.org/abs/2506.01307
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models. (12%)
Youze Wang; Wenbo Hu; Yinpeng Dong; Jing Liu; Hanwang Zhang; Richang Hong
  Large Language Models (LLMs) have evolved into Multimodal Large Language
Models (MLLMs), significantly enhancing their capabilities by integrating
visual information and other types, thus aligning more closely with the nature
of human intelligence, which processes a variety of data forms beyond just
text. Despite advancements, the undesirable generation of these models remains
a critical concern, particularly due to vulnerabilities exposed by text-based
jailbreak attacks, which have represented a significant threat by challenging
existing safety protocols. Motivated by the unique security risks posed by the
integration of new and old modalities for MLLMs, we propose a unified
multimodal universal jailbreak attack framework that leverages iterative
image-text interactions and transfer-based strategy to generate a universal
adversarial suffix and image. Our work not only highlights the interaction of
image-text modalities can be used as a critical vulnerability but also
validates that multimodal universal jailbreak attacks can bring higher-quality
undesirable generations across different MLLMs. We evaluate the undesirable
context generation of MLLMs like LLaVA, Yi-VL, MiniGPT4, MiniGPT-v2, and
InstructBLIP, and reveal significant multimodal safety alignment issues,
highlighting the inadequacy of current safety mechanisms against sophisticated
multimodal attacks. This study underscores the urgent need for robust safety
measures in MLLMs, advocating for a comprehensive review and enhancement of
security protocols to mitigate potential risks associated with multimodal
capabilities.


http://arxiv.org/abs/2506.01767
Predictive-CSM: Lightweight Fragment Security for 6LoWPAN IoT Networks. (1%)
Somayeh Sobati-M
  Fragmentation is a routine part of communication in 6LoWPAN-based IoT
networks,
  designed to accommodate small frame sizes on constrained wireless links.
However, this process
  introduces a critical vulnerability fragments are typically stored and
processed before their
  legitimacy is confirmed, allowing attackers to exploit this gap with minimal
effort.
  In this work, we explore a defense strategy that takes a more adaptive,
behavior-aware approach to this problem. Our system, called Predictive-CSM,
introduces a combination of two
  lightweight mechanisms. The first tracks how each node behaves over time,
rewarding consistent
  and successful interactions while quickly penalizing suspicious or failing
patterns. The second
  checks the integrity of packet fragments using a chained hash, allowing
incomplete or manipulated sequences to be caught early, before they can occupy
memory or waste processing time.
  We put this system to the test using a set of targeted attack simulations,
including early fragment injection, replayed headers, and flooding with fake
data. Across all scenarios, Predictive CSM preserved network delivery and
maintained energy efficiency, even under pressure. Rather
  than relying on heavyweight cryptography or rigid filters, this approach
allows constrained de vices to adapt their defenses in real time based on what
they observe, not just what they're
  told. In that way, it offers a step forward for securing fragmented
communication in real world
  IoT systems


http://arxiv.org/abs/2506.01625
Robust Satisficing Gaussian Process Bandits Under Adversarial Attacks. (1%)
Artun Saday; Yaşar Cahit Yıldırım; Cem Tekin
  We address the problem of Gaussian Process (GP) optimization in the presence
of unknown and potentially varying adversarial perturbations. Unlike
traditional robust optimization approaches that focus on maximizing performance
under worst-case scenarios, we consider a robust satisficing objective, where
the goal is to consistently achieve a predefined performance threshold $\tau$,
even under adversarial conditions. We propose two novel algorithms based on
distinct formulations of robust satisficing, and show that they are instances
of a general robust satisficing framework. Further, each algorithm offers
different guarantees depending on the nature of the adversary. Specifically, we
derive two regret bounds: one that is sublinear over time, assuming certain
conditions on the adversary and the satisficing threshold $\tau$, and another
that scales with the perturbation magnitude but requires no assumptions on the
adversary. Through extensive experiments, we demonstrate that our approach
outperforms the established robust optimization methods in achieving the
satisficing objective, particularly when the ambiguity set of the robust
optimization framework is inaccurately specified.


http://arxiv.org/abs/2506.00978
CAPAA: Classifier-Agnostic Projector-Based Adversarial Attack. (99%)
Zhan Li; Mingyu Zhao; Xin Dong; Haibin Ling; Bingyao Huang
  Projector-based adversarial attack aims to project carefully designed light
patterns (i.e., adversarial projections) onto scenes to deceive deep image
classifiers. It has potential applications in privacy protection and the
development of more robust classifiers. However, existing approaches primarily
focus on individual classifiers and fixed camera poses, often neglecting the
complexities of multi-classifier systems and scenarios with varying camera
poses. This limitation reduces their effectiveness when introducing new
classifiers or camera poses. In this paper, we introduce Classifier-Agnostic
Projector-Based Adversarial Attack (CAPAA) to address these issues. First, we
develop a novel classifier-agnostic adversarial loss and optimization framework
that aggregates adversarial and stealthiness loss gradients from multiple
classifiers. Then, we propose an attention-based gradient weighting mechanism
that concentrates perturbations on regions of high classification activation,
thereby improving the robustness of adversarial projections when applied to
scenes with varying camera poses. Our extensive experimental evaluations
demonstrate that CAPAA achieves both a higher attack success rate and greater
stealthiness compared to existing baselines. Codes are available at:
https://github.com/ZhanLiQxQ/CAPAA.


http://arxiv.org/abs/2506.01064
Fighting Fire with Fire (F3): A Training-free and Efficient Visual Adversarial Example Purification Method in LVLMs. (99%)
Yudong Zhang; Ruobing Xie; Yiqing Huang; Jiansheng Chen; Xingwu Sun; Zhanhui Kang; Di Wang; Yu Wang
  Recent advances in large vision-language models (LVLMs) have showcased their
remarkable capabilities across a wide range of multimodal vision-language
tasks. However, these models remain vulnerable to visual adversarial attacks,
which can substantially compromise their performance. Despite their potential
impact, the development of effective methods for purifying such adversarial
examples has received relatively limited attention. In this paper, we introduce
F3, a novel adversarial purification framework that employs a counterintuitive
"fighting fire with fire" strategy: intentionally introducing simple
perturbations to adversarial examples to mitigate their harmful effects.
Specifically, F3 leverages cross-modal attentions derived from randomly
perturbed adversary examples as reference targets. By injecting noise into
these adversarial examples, F3 effectively refines their attention, resulting
in cleaner and more reliable model outputs. Remarkably, this seemingly
paradoxical approach of employing noise to counteract adversarial attacks
yields impressive purification results. Furthermore, F3 offers several distinct
advantages: it is training-free and straightforward to implement, and exhibits
significant computational efficiency improvements compared to existing
purification methods. These attributes render F3 particularly suitable for
large-scale industrial applications where both robust performance and
operational efficiency are critical priorities. The code will be made publicly
available.


http://arxiv.org/abs/2506.00874
Breaking Latent Prior Bias in Detectors for Generalizable AIGC Image Detection. (96%)
Yue Zhou; Xinan He; KaiQing Lin; Bin Fan; Feng Ding; Bin Li
  Current AIGC detectors often achieve near-perfect accuracy on images produced
by the same generator used for training but struggle to generalize to outputs
from unseen generators. We trace this failure in part to latent prior bias:
detectors learn shortcuts tied to patterns stemming from the initial noise
vector rather than learning robust generative artifacts. To address this, we
propose On-Manifold Adversarial Training (OMAT): by optimizing the initial
latent noise of diffusion models under fixed conditioning, we generate
on-manifold adversarial examples that remain on the generator's output
manifold-unlike pixel-space attacks, which introduce off-manifold perturbations
that the generator itself cannot reproduce and that can obscure the true
discriminative artifacts. To test against state-of-the-art generative models,
we introduce GenImage++, a test-only benchmark of outputs from advanced
generators (Flux.1, SD3) with extended prompts and diverse styles. We apply our
adversarial-training paradigm to ResNet50 and CLIP baselines and evaluate
across existing AIGC forensic benchmarks and recent challenge datasets.
Extensive experiments show that adversarially trained detectors significantly
improve cross-generator performance without any network redesign. Our findings
on latent-prior bias offer valuable insights for future dataset construction
and detector evaluation, guiding the development of more robust and
generalizable AIGC forensic methodologies.


http://arxiv.org/abs/2506.01267
Adversarial learning for nonparametric regression: Minimax rate and adaptive estimation. (74%)
Jingfu Peng; Yuhong Yang
  Despite tremendous advancements of machine learning models and algorithms in
various application domains, they are known to be vulnerable to subtle, natural
or intentionally crafted perturbations in future input data, known as
adversarial attacks. While numerous adversarial learning methods have been
proposed, fundamental questions about their statistical optimality in robust
loss remain largely unanswered. In particular, the minimax rate of convergence
and the construction of rate-optimal estimators under future $X$-attacks are
yet to be worked out.
  In this paper, we address this issue in the context of nonparametric
regression, under suitable assumptions on the smoothness of the regression
function and the geometric structure of the input perturbation set. We first
establish the minimax rate of convergence under adversarial $L_q$-risks with $1
\leq q \leq \infty$ and propose a piecewise local polynomial estimator that
achieves the minimax optimality. The established minimax rate elucidates how
the smoothness level and perturbation magnitude affect the fundamental limit of
adversarial learning under future $X$-attacks. Furthermore, we construct a
data-driven adaptive estimator that is shown to achieve, within a logarithmic
factor, the optimal rate across a broad scale of nonparametric and adversarial
classes.


http://arxiv.org/abs/2506.01055
Simple Prompt Injection Attacks Can Leak Personal Data Observed by LLM Agents During Task Execution. (5%)
Meysam Alizadeh; Zeynab Samei; Daria Stetsenko; Fabrizio Gilardi
  Previous benchmarks on prompt injection in large language models (LLMs) have
primarily focused on generic tasks and attacks, offering limited insights into
more complex threats like data exfiltration. This paper examines how prompt
injection can cause tool-calling agents to leak personal data observed during
task execution. Using a fictitious banking agent, we develop data flow-based
attacks and integrate them into AgentDojo, a recent benchmark for agentic
security. To enhance its scope, we also create a richer synthetic dataset of
human-AI banking conversations. In 16 user tasks from AgentDojo, LLMs show a
15-50 percentage point drop in utility under attack, with average attack
success rates (ASR) around 20 percent; some defenses reduce ASR to zero. Most
LLMs, even when successfully tricked by the attack, avoid leaking highly
sensitive data like passwords, likely due to safety alignments, but they remain
vulnerable to disclosing other personal data. The likelihood of password
leakage increases when a password is requested along with one or two additional
personal details. In an extended evaluation across 48 tasks, the average ASR is
around 15 percent, with no built-in AgentDojo defense fully preventing leakage.
Tasks involving data extraction or authorization workflows, which closely
resemble the structure of exfiltration attacks, exhibit the highest ASRs,
highlighting the interaction between task type, agent performance, and defense
efficacy.


http://arxiv.org/abs/2506.01213
On the Stability of Graph Convolutional Neural Networks: A Probabilistic Perspective. (2%)
Ning Zhang; Henry Kenlay; Li Zhang; Mihai Cucuringu; Xiaowen Dong
  Graph convolutional neural networks (GCNNs) have emerged as powerful tools
for analyzing graph-structured data, achieving remarkable success across
diverse applications. However, the theoretical understanding of the stability
of these models, i.e., their sensitivity to small changes in the graph
structure, remains in rather limited settings, hampering the development and
deployment of robust and trustworthy models in practice. To fill this gap, we
study how perturbations in the graph topology affect GCNN outputs and propose a
novel formulation for analyzing model stability. Unlike prior studies that
focus only on worst-case perturbations, our distribution-aware formulation
characterizes output perturbations across a broad range of input data. This
way, our framework enables, for the first time, a probabilistic perspective on
the interplay between the statistical properties of the node data and
perturbations in the graph topology. We conduct extensive experiments to
validate our theoretical findings and demonstrate their benefits over existing
baselines, in terms of both representation stability and adversarial attacks on
downstream tasks. Our results demonstrate the practical significance of the
proposed formulation and highlight the importance of incorporating data
distribution into stability analysis.


http://arxiv.org/abs/2506.01054
No Soundness in the Real World: On the Challenges of the Verification of Deployed Neural Networks. (2%)
Attila Szász; Balázs Bánhelyi; Márk Jelasity
  The ultimate goal of verification is to guarantee the safety of deployed
neural networks. Here, we claim that all the state-of-the-art verifiers we are
aware of fail to reach this goal. Our key insight is that theoretical soundness
(bounding the full-precision output while computing with floating point) does
not imply practical soundness (bounding the floating point output in a
potentially stochastic environment). We prove this observation for the
approaches that are currently used to achieve provable theoretical soundness,
such as interval analysis and its variants. We also argue that achieving
practical soundness is significantly harder computationally. We support our
claims empirically as well by evaluating several well-known verification
methods. To mislead the verifiers, we create adversarial networks that detect
and exploit features of the deployment environment, such as the order and
precision of floating point operations. We demonstrate that all the tested
verifiers are vulnerable to our new deployment-specific attacks, which proves
that they are not practically sound.


http://arxiv.org/abs/2506.00661
Poster: Adapting Pretrained Vision Transformers with LoRA Against Attack Vectors. (99%)
Richard E. Neddo; Sean Willis; Zander Blasingame; Chen Liu
  Image classifiers, such as those used for autonomous vehicle navigation, are
largely known to be susceptible to adversarial attacks that target the input
image set. There is extensive discussion on adversarial attacks including
perturbations that alter the input images to cause malicious misclassifications
without perceivable modification. This work proposes a countermeasure for such
attacks by adjusting the weights and classes of pretrained vision transformers
with a low-rank adaptation to become more robust against adversarial attacks
and allow for scalable fine-tuning without retraining.


http://arxiv.org/abs/2506.00548
Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities. (98%)
Jiahui Geng; Thy Thy Tran; Preslav Nakov; Iryna Gurevych
  Existing attacks against multimodal language models (MLLMs) primarily
communicate instructions through text accompanied by adversarial images. In
contrast, we exploit the capabilities of MLLMs to interpret non-textual
instructions, specifically, adversarial images or audio generated by our novel
method, Con Instruction. We optimize these adversarial examples to align
closely with target instructions in the embedding space, revealing the
detrimental implications of MLLMs' sophisticated understanding. Unlike prior
work, our method does not require training data or preprocessing of textual
instructions. While these non-textual adversarial examples can effectively
bypass MLLM safety mechanisms, their combination with various text inputs
substantially amplifies attack success. We further introduce a new Attack
Response Categorization (ARC) framework, which evaluates both the quality of
the model's response and its relevance to the malicious instructions.
Experimental results demonstrate that Con Instruction effectively bypasses
safety mechanisms in multiple vision- and audio-language models, including
LLaVA-v1.5, InternVL, Qwen-VL, and Qwen-Audio, evaluated on two standard
benchmarks: AdvBench and SafeBench. Specifically, our method achieves the
highest attack success rates, reaching 81.3% and 86.6% on LLaVA-v1.5 (13B). On
the defense side, we explore various countermeasures against our attacks and
uncover a substantial performance gap among existing techniques. Our
implementation is made publicly available.


http://arxiv.org/abs/2506.00821
SafeGenes: Evaluating the Adversarial Robustness of Genomic Foundation Models. (98%)
Huixin Zhan; Jason H. Moore
  Genomic Foundation Models (GFMs), such as Evolutionary Scale Modeling (ESM),
have demonstrated significant success in variant effect prediction. However,
their adversarial robustness remains largely unexplored. To address this gap,
we propose SafeGenes: a framework for Secure analysis of genomic foundation
models, leveraging adversarial attacks to evaluate robustness against both
engineered near-identical adversarial Genes and embedding-space manipulations.
In this study, we assess the adversarial vulnerabilities of GFMs using two
approaches: the Fast Gradient Sign Method (FGSM) and a soft prompt attack. FGSM
introduces minimal perturbations to input sequences, while the soft prompt
attack optimizes continuous embeddings to manipulate model predictions without
modifying the input tokens. By combining these techniques, SafeGenes provides a
comprehensive assessment of GFM susceptibility to adversarial manipulation.
Targeted soft prompt attacks led to substantial performance degradation, even
in large models such as ESM1b and ESM1v. These findings expose critical
vulnerabilities in current foundation models, opening new research directions
toward improving their security and robustness in high-stakes genomic
applications such as variant effect prediction.


http://arxiv.org/abs/2506.00668
SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues. (11%)
Martin Kuo; Jianyi Zhang; Aolin Ding; Louis DiValentin; Amin Hass; Benjamin F Morris; Isaac Jacobson; Randolph Linderman; James Kiessling; Nicolas Ramos; Bhavna Gopal; Maziyar Baran Pouyan; Changwei Liu; Hai Li; Yiran Chen
  Malicious attackers can exploit large language models (LLMs) by engaging them
in multi-turn dialogues to achieve harmful objectives, posing significant
safety risks to society. To address this challenge, we propose a novel defense
mechanism: SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues
(STREAM). STREAM defends LLMs against multi-turn attacks while preserving their
functional capabilities. Our approach involves constructing a human-annotated
dataset, the Safety Reasoning Multi-turn Dialogues dataset, which is used to
fine-tune a plug-and-play safety reasoning moderator. This model is designed to
identify malicious intent hidden within multi-turn conversations and alert the
target LLM of potential risks. We evaluate STREAM across multiple LLMs against
prevalent multi-turn attack strategies. Experimental results demonstrate that
our method significantly outperforms existing defense techniques, reducing the
Attack Success Rate (ASR) by 51.2%, all while maintaining comparable LLM
capability.


http://arxiv.org/abs/2506.00496
Monitoring Robustness and Individual Fairness. (10%)
Ashutosh Gupta; Thomas A. Henzinger; Konstantin Kueffner; Kaushik Mallik; David Pape
  Input-output robustness appears in various different forms in the literature,
such as robustness of AI models to adversarial or semantic perturbations and
individual fairness of AI models that make decisions about humans.
  We propose runtime monitoring of input-output robustness of deployed,
black-box AI models, where the goal is to design monitors that would observe
one long execution sequence of the model, and would raise an alarm whenever it
is detected that two similar inputs from the past led to dissimilar outputs.
  This way, monitoring will complement existing offline ``robustification''
approaches to increase the trustworthiness of AI decision-makers.
  We show that the monitoring problem can be cast as the fixed-radius nearest
neighbor (FRNN) search problem, which, despite being well-studied, lacks
suitable online solutions.
  We present our tool Clemont, which offers a number of lightweight monitors,
some of which use upgraded online variants of existing FRNN algorithms, and one
uses a novel algorithm based on binary decision diagrams -- a data-structure
commonly used in software and hardware verification.
  We have also developed an efficient parallelization technique that can
substantially cut down the computation time of monitors for which the distance
between input-output pairs is measured using the $L_\infty$ norm.
  Using standard benchmarks from the literature of adversarial and semantic
robustness and individual fairness, we perform a comparative study of different
monitors in \tool, and demonstrate their effectiveness in correctly detecting
robustness violations at runtime.


http://arxiv.org/abs/2506.00676
SafeTuneBed: A Toolkit for Benchmarking LLM Safety Alignment in Fine-Tuning. (10%)
Saad Hossain; Samanvay Vajpayee; Sirisha Rambhatla
  As large language models (LLMs) become ubiquitous, parameter-efficient
fine-tuning methods and safety-first defenses have proliferated rapidly.
However, the number of approaches and their recent increase have resulted in
diverse evaluations-varied datasets, metrics, and inconsistent threat
settings-making it difficult to fairly compare safety, utility, and robustness
across methods. To address this, we introduce SafeTuneBed, a benchmark and
toolkit unifying fine-tuning and defense evaluation. SafeTuneBed (i) curates a
diverse repository of multiple fine-tuning datasets spanning sentiment
analysis, question-answering, multi-step reasoning, and open-ended instruction
tasks, and allows for the generation of harmful-variant splits; (ii) enables
integration of state-of-the-art defenses, including alignment-stage
immunization, in-training safeguards, and post-tuning repair; and (iii)
provides evaluators for safety (attack success rate, refusal consistency) and
utility. Built on Python-first, dataclass-driven configs and plugins,
SafeTuneBed requires minimal additional code to specify any fine-tuning regime,
defense method, and metric suite, while ensuring end-to-end reproducibility. We
showcase its value by benchmarking representative defenses across varied
poisoning scenarios and tasks. By standardizing data, code, and metrics,
SafeTuneBed is the first focused toolkit of its kind to accelerate rigorous and
comparable research in safe LLM fine-tuning. Code is available at:
https://github.com/criticalml-uw/SafeTuneBed


http://arxiv.org/abs/2506.00789
RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems. (1%)
Yixiao Zeng; Tianyu Cao; Danqing Wang; Xinran Zhao; Zimeng Qiu; Morteza Ziyadi; Tongshuang Wu; Lei Li
  Retrieval-Augmented Generation (RAG) enhances recency and factuality in
answers. However, existing evaluations rarely test how well these systems cope
with real-world noise, conflicting between internal and external retrieved
contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness
Evaluation (RARE), a unified framework and large-scale benchmark that jointly
stress-tests query and document perturbations over dynamic, time-sensitive
corpora. One of the central features of RARE is a knowledge-graph-driven
synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop
relations from the customized corpus and generates multi-level question sets
without manual intervention. Leveraging this pipeline, we construct a dataset
(RARE-Set) spanning 400 expert-level time-sensitive finance, economics, and
policy documents and 48,322 questions whose distribution evolves as the
underlying sources change. To quantify resilience, we formalize
retrieval-conditioned robustness metrics (RARE-Met) that capture a model's
ability to remain correct or recover when queries, documents, or real-world
retrieval results are systematically altered. Our results show that RAG systems
exhibit surprising vulnerability to perturbations, with document robustness
consistently being the weakest point regardless of generator size or
architecture. RAG systems consistently show lower robustness on multi-hop
queries than single-hop queries across all domains.


http://arxiv.org/abs/2505.24703
PatchDEMUX: A Certifiably Robust Framework for Multi-label Classifiers Against Adversarial Patches. (92%)
Dennis Jacob; Chong Xiang; Prateek Mittal
  Deep learning techniques have enabled vast improvements in computer vision
technologies. Nevertheless, these models are vulnerable to adversarial patch
attacks which catastrophically impair performance. The physically realizable
nature of these attacks calls for certifiable defenses, which feature provable
guarantees on robustness. While certifiable defenses have been successfully
applied to single-label classification, limited work has been done for
multi-label classification. In this work, we present PatchDEMUX, a certifiably
robust framework for multi-label classifiers against adversarial patches. Our
approach is a generalizable method which can extend any existing certifiable
defense for single-label classification; this is done by considering the
multi-label classification task as a series of isolated binary classification
problems to provably guarantee robustness. Furthermore, in the scenario where
an attacker is limited to a single patch we propose an additional certification
procedure that can provide tighter robustness bounds. Using the current
state-of-the-art (SOTA) single-label certifiable defense PatchCleanser as a
backbone, we find that PatchDEMUX can achieve non-trivial robustness on the
MS-COCO and PASCAL VOC datasets while maintaining high clean performance


http://arxiv.org/abs/2506.00280
3D Gaussian Splat Vulnerabilities. (88%)
Matthew Hull; Haoyang Yang; Pratham Mehta; Mansi Phute; Aeree Cho; Haoran Wang; Matthew Lau; Wenke Lee; Willian T. Lunardi; Martin Andreoni; Polo Chau
  With 3D Gaussian Splatting (3DGS) being increasingly used in safety-critical
applications, how can an adversary manipulate the scene to cause harm? We
introduce CLOAK, the first attack that leverages view-dependent Gaussian
appearances - colors and textures that change with viewing angle - to embed
adversarial content visible only from specific viewpoints. We further
demonstrate DAGGER, a targeted adversarial attack directly perturbing 3D
Gaussians without access to underlying training data, deceiving multi-stage
object detectors e.g., Faster R-CNN, through established methods such as
projected gradient descent. These attacks highlight underexplored
vulnerabilities in 3DGS, introducing a new potential threat to robotic learning
for autonomous navigation and other safety-critical 3DGS applications.


http://arxiv.org/abs/2506.00373
Adversarial Machine Learning for Robust Password Strength Estimation. (87%)
Pappu Jha; Hanzla Hamid; Oluseyi Olukola; Ashim Dahal; Nick Rahimi
  Passwords remain one of the most common methods for securing sensitive data
in the digital age. However, weak password choices continue to pose significant
risks to data security and privacy. This study aims to solve the problem by
focusing on developing robust password strength estimation models using
adversarial machine learning, a technique that trains models on intentionally
crafted deceptive passwords to expose and address vulnerabilities posed by such
passwords. We apply five classification algorithms and use a dataset with more
than 670,000 samples of adversarial passwords to train the models. Results
demonstrate that adversarial training improves password strength classification
accuracy by up to 20% compared to traditional machine learning models. It
highlights the importance of integrating adversarial machine learning into
security systems to enhance their robustness against modern adaptive threats.
  Keywords: adversarial attack, password strength, classification, machine
learning


http://arxiv.org/abs/2505.24842
Cascading Adversarial Bias from Injection to Distillation in Language Models. (86%)
Harsh Chaudhari; Jamie Hayes; Matthew Jagielski; Ilia Shumailov; Milad Nasr; Alina Oprea
  Model distillation has become essential for creating smaller, deployable
language models that retain larger system capabilities. However, widespread
deployment raises concerns about resilience to adversarial manipulation. This
paper investigates vulnerability of distilled models to adversarial injection
of biased content during training. We demonstrate that adversaries can inject
subtle biases into teacher models through minimal data poisoning, which
propagates to student models and becomes significantly amplified. We propose
two propagation modes: Untargeted Propagation, where bias affects multiple
tasks, and Targeted Propagation, focusing on specific tasks while maintaining
normal behavior elsewhere. With only 25 poisoned samples (0.25% poisoning
rate), student models generate biased responses 76.9% of the time in targeted
scenarios - higher than 69.4% in teacher models. For untargeted propagation,
adversarial bias appears 6x-29x more frequently in student models on unseen
tasks. We validate findings across six bias types (targeted advertisements,
phishing links, narrative manipulations, insecure coding practices), various
distillation methods, and different modalities spanning text and code
generation. Our evaluation reveals shortcomings in current defenses -
perplexity filtering, bias detection systems, and LLM-based autorater
frameworks - against these attacks. Results expose significant security
vulnerabilities in distilled models, highlighting need for specialized
safeguards. We propose practical design principles for building effective
adversarial bias mitigation strategies.


http://arxiv.org/abs/2506.00325
Towards Effective and Efficient Adversarial Defense with Diffusion Models for Robust Visual Tracking. (83%)
Long Xu; Peng Gao; Wen-Jia Tang; Fei Wang; Ru-Yue Yuan
  Although deep learning-based visual tracking methods have made significant
progress, they exhibit vulnerabilities when facing carefully designed
adversarial attacks, which can lead to a sharp decline in tracking performance.
To address this issue, this paper proposes for the first time a novel
adversarial defense method based on denoise diffusion probabilistic models,
termed DiffDf, aimed at effectively improving the robustness of existing visual
tracking methods against adversarial attacks. DiffDf establishes a multi-scale
defense mechanism by combining pixel-level reconstruction loss, semantic
consistency loss, and structural similarity loss, effectively suppressing
adversarial perturbations through a gradual denoising process. Extensive
experimental results on several mainstream datasets show that the DiffDf method
demonstrates excellent generalization performance for trackers with different
architectures, significantly improving various evaluation metrics while
achieving real-time inference speeds of over 30 FPS, showcasing outstanding
defense performance and efficiency. Codes are available at
https://github.com/pgao-lab/DiffDf.


http://arxiv.org/abs/2505.24445
Learning Safety Constraints for Large Language Models. (81%)
Xin Chen; Yarden As; Andreas Krause
  Large language models (LLMs) have emerged as powerful tools but pose
significant safety risks through harmful outputs and vulnerability to
adversarial attacks. We propose SaP, short for Safety Polytope, a geometric
approach to LLM safety that learns and enforces multiple safety constraints
directly in the model's representation space. We develop a framework that
identifies safe and unsafe regions via the polytope's facets, enabling both
detection and correction of unsafe outputs through geometric steering. Unlike
existing approaches that modify model weights, SaP operates post-hoc in the
representation space, preserving model capabilities while enforcing safety
constraints. Experiments across multiple LLMs demonstrate that our method can
effectively detect unethical inputs, reduce adversarial attack success rates
while maintaining performance on standard tasks, thus highlighting the
importance of having an explicit geometric model for safety. Analysis of the
learned polytope facets reveals emergence of specialization in detecting
different semantic notions of safety, providing interpretable insights into how
safety is captured in LLMs' representation space.


http://arxiv.org/abs/2505.24654
Black-box Adversarial Attacks on CNN-based SLAM Algorithms. (75%)
Maria Rafaela Gkeka; Bowen Sun; Evgenia Smirni; Christos D. Antonopoulos; Spyros Lalis; Nikolaos Bellas
  Continuous advancements in deep learning have led to significant progress in
feature detection, resulting in enhanced accuracy in tasks like Simultaneous
Localization and Mapping (SLAM). Nevertheless, the vulnerability of deep neural
networks to adversarial attacks remains a challenge for their reliable
deployment in applications, such as navigation of autonomous agents. Even
though CNN-based SLAM algorithms are a growing area of research there is a
notable absence of a comprehensive presentation and examination of adversarial
attacks targeting CNN-based feature detectors, as part of a SLAM system. Our
work introduces black-box adversarial perturbations applied to the RGB images
fed into the GCN-SLAM algorithm. Our findings on the TUM dataset [30] reveal
that even attacks of moderate scale can lead to tracking failure in as many as
76% of the frames. Moreover, our experiments highlight the catastrophic impact
of attacking depth instead of RGB input images on the SLAM system.


http://arxiv.org/abs/2506.00191
Heterogeneous Graph Backdoor Attack. (69%)
Jiawei Chen; Lusi Li; Daniel Takabi; Masha Sosonkina; Rui Ning
  Heterogeneous Graph Neural Networks (HGNNs) excel in modeling complex,
multi-typed relationships across diverse domains, yet their vulnerability to
backdoor attacks remains unexplored. To address this gap, we conduct the first
investigation into the susceptibility of HGNNs to existing graph backdoor
attacks, revealing three critical issues: (1) high attack budget required for
effective backdoor injection, (2) inefficient and unreliable backdoor
activation, and (3) inaccurate attack effectiveness evaluation. To tackle these
issues, we propose the Heterogeneous Graph Backdoor Attack (HGBA), the first
backdoor attack specifically designed for HGNNs, introducing a novel
relation-based trigger mechanism that establishes specific connections between
a strategically selected trigger node and poisoned nodes via the backdoor
metapath. HGBA achieves efficient and stealthy backdoor injection with minimal
structural modifications and supports easy backdoor activation through two
flexible strategies: Self-Node Attack and Indiscriminate Attack. Additionally,
we improve the ASR measurement protocol, enabling a more accurate assessment of
attack effectiveness. Extensive experiments demonstrate that HGBA far surpasses
multiple state-of-the-art graph backdoor attacks in black-box settings,
efficiently attacking HGNNs with low attack budgets. Ablation studies show that
the strength of HBGA benefits from our trigger node selection method and
backdoor metapath selection strategy. In addition, HGBA shows superior
robustness against node feature perturbations and multiple types of existing
graph backdoor defense mechanisms. Finally, extension experiments demonstrate
that the relation-based trigger mechanism can effectively extend to tasks in
homogeneous graph scenarios, thereby posing severe threats to broader
security-critical domains.


http://arxiv.org/abs/2506.15711
Shadow defense against gradient inversion attack in federated learning. (68%)
Le Jiang; Liyan Ma; Guang Yang
Federated learning (FL) has emerged as a transformative framework for privacy-preserving distributed training, allowing clients to collaboratively train a global model without sharing their local data. This is especially crucial in sensitive fields like healthcare, where protecting patient data is paramount. However, privacy leakage remains a critical challenge, as the communication of model updates can be exploited by potential adversaries. Gradient inversion attacks (GIAs), for instance, allow adversaries to approximate the gradients used for training and reconstruct training images, thus stealing patient privacy. Existing defense mechanisms obscure gradients, yet lack a nuanced understanding of which gradients or types of image information are most vulnerable to such attacks. These indiscriminate calibrated perturbations result in either excessive privacy protection degrading model accuracy, or insufficient one failing to safeguard sensitive information. Therefore, we introduce a framework that addresses these challenges by leveraging a shadow model with interpretability for identifying sensitive areas. This enables a more targeted and sample-specific noise injection. Specially, our defensive strategy achieves discrepancies of 3.73 in PSNR and 0.2 in SSIM compared to the circumstance without defense on the ChestXRay dataset, and 2.78 in PSNR and 0.166 in the EyePACS dataset. Moreover, it minimizes adverse effects on model performance, with less than 1\% F1 reduction compared to SOTA methods. Our extensive experiments, conducted across diverse types of medical images, validate the generalization of the proposed framework. The stable defense improvements for FedAvg are consistently over 1.5\% times in LPIPS and SSIM. It also offers a universal defense against various GIA types, especially for these sensitive areas in images.

http://arxiv.org/abs/2506.00281
Adversarial Threat Vectors and Risk Mitigation for Retrieval-Augmented Generation Systems. (68%)
Chris M. Ward; Josh Harguess
  Retrieval-Augmented Generation (RAG) systems, which integrate Large Language
Models (LLMs) with external knowledge sources, are vulnerable to a range of
adversarial attack vectors. This paper examines the importance of RAG systems
through recent industry adoption trends and identifies the prominent attack
vectors for RAG: prompt injection, data poisoning, and adversarial query
manipulation. We analyze these threats under risk management lens, and propose
robust prioritized control list that includes risk-mitigating actions like
input validation, adversarial training, and real-time monitoring.


http://arxiv.org/abs/2505.24523
Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors. (56%)
Andrea Pedrotti; Michele Papucci; Cristiano Ciaccio; Alessio Miaschi; Giovanni Puccetti; Felice Dell'Orletta; Andrea Esuli
  Recent advancements in Generative AI and Large Language Models (LLMs) have
enabled the creation of highly realistic synthetic content, raising concerns
about the potential for malicious use, such as misinformation and manipulation.
Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the
lack of robust benchmarks that assess generalization to real-world scenarios.
In this work, we present a pipeline to test the resilience of state-of-the-art
MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed
adversarial attacks. To challenge the detectors, we fine-tune language models
using Direct Preference Optimization (DPO) to shift the MGT style toward
human-written text (HWT). This exploits the detectors' reliance on stylistic
clues, making new generations more challenging to detect. Additionally, we
analyze the linguistic shifts induced by the alignment and which features are
used by detectors to detect MGT texts. Our results show that detectors can be
easily fooled with relatively few examples, resulting in a significant drop in
detection performance. This highlights the importance of improving detection
methods and making them robust to unseen in-domain texts.


http://arxiv.org/abs/2506.02032
Towards Secure MLOps: Surveying Attacks, Mitigation Strategies, and Research Challenges. (41%)
Raj Patel; Himanshu Tripathi; Jasper Stone; Noorbakhsh Amiri Golilarz; Sudip Mittal; Shahram Rahimi; Vini Chaudhary
  The rapid adoption of machine learning (ML) technologies has driven
organizations across diverse sectors to seek efficient and reliable methods to
accelerate model development-to-deployment. Machine Learning Operations (MLOps)
has emerged as an integrative approach addressing these requirements by
unifying relevant roles and streamlining ML workflows. As the MLOps market
continues to grow, securing these pipelines has become increasingly critical.
However, the unified nature of MLOps ecosystem introduces vulnerabilities,
making them susceptible to adversarial attacks where a single misconfiguration
can lead to compromised credentials, severe financial losses, damaged public
trust, and the poisoning of training data. Our paper presents a systematic
application of the MITRE ATLAS (Adversarial Threat Landscape for
Artificial-Intelligence Systems) framework, a comprehensive and continuously
updated catalog of AI-focused attacks, to systematically assess attacks across
different phases of the MLOps ecosystem. We begin by examining the preparatory
phases during which adversaries acquire the essential intelligence required to
initiate their attacks. We then present a structured taxonomy of attack
techniques explicitly mapped to corresponding phases of the MLOps ecosystem,
supported by examples drawn from red-teaming exercises and real-world
incidents. This is followed by a taxonomy of mitigation strategies aligned with
these attack categories, offering actionable early-stage defenses to strengthen
the security of MLOps ecosystem. Given the rapid evolution and adoption of
MLOps, we further highlight key research gaps that require immediate attention.
Our work emphasizes the importance of implementing robust security protocols
from the outset, empowering practitioners to safeguard MLOps ecosystem against
evolving cyber attacks.


http://arxiv.org/abs/2505.24369
Adversarial Preference Learning for Robust LLM Alignment. (38%)
Yuanfu Wang; Pengyu Wang; Chenyang Xi; Bo Tang; Junyi Zhu; Wenqiang Wei; Chen Chen; Chao Yang; Jingfeng Zhang; Chaochao Lu; Yijun Niu; Keming Mao; Zhiyu Li; Feiyu Xiong; Jie Hu; Mingchuan Yang
  Modern language models often rely on Reinforcement Learning from Human
Feedback (RLHF) to encourage safe behaviors. However, they remain vulnerable to
adversarial attacks due to three key limitations: (1) the inefficiency and high
cost of human annotation, (2) the vast diversity of potential adversarial
attacks, and (3) the risk of feedback bias and reward hacking. To address these
challenges, we introduce Adversarial Preference Learning (APL), an iterative
adversarial training method incorporating three key innovations. First, a
direct harmfulness metric based on the model's intrinsic preference
probabilities, eliminating reliance on external assessment. Second, a
conditional generative attacker that synthesizes input-specific adversarial
variations. Third, an iterative framework with automated closed-loop feedback,
enabling continuous adaptation through vulnerability discovery and mitigation.
Experiments on Mistral-7B-Instruct-v0.3 demonstrate that APL significantly
enhances robustness, achieving 83.33% harmlessness win rate over the base model
(evaluated by GPT-4o), reducing harmful outputs from 5.88% to 0.43% (measured
by LLaMA-Guard), and lowering attack success rate by up to 65% according to
HarmBench. Notably, APL maintains competitive utility, with an MT-Bench score
of 6.59 (comparable to the baseline 6.78) and an LC-WinRate of 46.52% against
the base model.


http://arxiv.org/abs/2505.24227
Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models. (13%)
Ying Yang; Jie Zhang; Xiao Lv; Di Lin; Tao Xiang; Qing Guo
  While adversarial attacks on vision-and-language pretraining (VLP) models
have been explored, generating natural adversarial samples crafted through
realistic and semantically meaningful perturbations remains an open challenge.
Existing methods, primarily designed for classification tasks, struggle when
adapted to VLP models due to their restricted optimization spaces, leading to
ineffective attacks or unnatural artifacts. To address this, we propose
\textbf{LightD}, a novel framework that generates natural adversarial samples
for VLP models via semantically guided relighting. Specifically, LightD
leverages ChatGPT to propose context-aware initial lighting parameters and
integrates a pretrained relighting model (IC-light) to enable diverse lighting
adjustments. LightD expands the optimization space while ensuring perturbations
align with scene semantics. Additionally, gradient-based optimization is
applied to the reference lighting image to further enhance attack effectiveness
while maintaining visual naturalness. The effectiveness and superiority of the
proposed LightD have been demonstrated across various VLP models in tasks such
as image captioning and visual question answering.


http://arxiv.org/abs/2505.24592
A Flat Minima Perspective on Understanding Augmentations and Model Robustness. (13%)
Weebum Yoo; Sung Whan Yoon
  Model robustness indicates a model's capability to generalize well on
unforeseen distributional shifts, including data corruption, adversarial
attacks, and domain shifts. Data augmentation is one of the prevalent and
effective ways to enhance robustness. Despite the great success of
augmentations in different fields, a general theoretical understanding of their
efficacy in improving model robustness is lacking. We offer a unified
theoretical framework to clarify how augmentations can enhance model robustness
through the lens of loss surface flatness and PAC generalization bound. Our
work diverges from prior studies in that our analysis i) broadly encompasses
much of the existing augmentation methods, and ii) is not limited to specific
types of distribution shifts like adversarial attacks. We confirm our theories
through simulations on the existing common corruption and adversarial
robustness benchmarks based on the CIFAR and ImageNet datasets, as well as
domain generalization benchmarks including PACS and OfficeHome.


http://arxiv.org/abs/2506.00359
Keeping an Eye on LLM Unlearning: The Hidden Risk and Remedy. (3%)
Jie Ren; Zhenwei Dai; Xianfeng Tang; Yue Xing; Shenglai Zeng; Hui Liu; Jingying Zeng; Qiankun Peng; Samarth Varshney; Suhang Wang; Qi He; Charu C. Aggarwal; Hui Liu
  Although Large Language Models (LLMs) have demonstrated impressive
capabilities across a wide range of tasks, growing concerns have emerged over
the misuse of sensitive, copyrighted, or harmful data during training. To
address these concerns, unlearning techniques have been developed to remove the
influence of specific data without retraining from scratch. However, this paper
reveals a critical vulnerability in fine-tuning-based unlearning: a malicious
user can craft a manipulated forgetting request that stealthily degrades the
model's utility for benign users. We demonstrate this risk through a
red-teaming Stealthy Attack (SA), which is inspired by two key limitations of
existing unlearning (the inability to constrain the scope of unlearning effect
and the failure to distinguish benign tokens from unlearning signals). Prior
work has shown that unlearned models tend to memorize forgetting data as
unlearning signals, and respond with hallucinations or feigned ignorance when
unlearning signals appear in the input. By subtly increasing the presence of
common benign tokens in the forgetting data, SA enhances the connection between
benign tokens and unlearning signals. As a result, when normal users include
such tokens in their prompts, the model exhibits unlearning behaviors, leading
to unintended utility degradation. To address this vulnerability, we propose
Scope-aware Unlearning (SU), a lightweight enhancement that introduces a scope
term into the unlearning objective, encouraging the model to localize the
forgetting effect. Our method requires no additional data processing,
integrates seamlessly with existing fine-tuning frameworks, and significantly
improves robustness against SA. Extensive experiments validate the
effectiveness of both SA and SU.


http://arxiv.org/abs/2505.24379
Breaking the Gold Standard: Extracting Forgotten Data under Exact Unlearning in Large Language Models. (3%)
Xiaoyu Wu; Yifei Pang; Terrance Liu; Zhiwei Steven Wu
  Large language models are typically trained on datasets collected from the
web, which may inadvertently contain harmful or sensitive personal information.
To address growing privacy concerns, unlearning methods have been proposed to
remove the influence of specific data from trained models. Of these, exact
unlearning -- which retrains the model from scratch without the target data --
is widely regarded the gold standard, believed to be robust against
privacy-related attacks. In this paper, we challenge this assumption by
introducing a novel data extraction attack that compromises even exact
unlearning. Our method leverages both the pre- and post-unlearning models: by
guiding the post-unlearning model using signals from the pre-unlearning model,
we uncover patterns that reflect the removed data distribution. Combining model
guidance with a token filtering strategy, our attack significantly improves
extraction success rates -- doubling performance in some cases -- across common
benchmarks such as MUSE, TOFU, and WMDP. Furthermore, we demonstrate our
attack's effectiveness on a simulated medical diagnosis dataset to highlight
real-world privacy risks associated with exact unlearning. In light of our
findings, which suggest that unlearning may, in a contradictory way, increase
the risk of privacy leakage, we advocate for evaluation of unlearning methods
to consider broader threat models that account not only for post-unlearning
models but also for adversarial access to prior checkpoints.


http://arxiv.org/abs/2505.24428
Model Unlearning via Sparse Autoencoder Subspace Guided Projections. (2%)
Xu Wang; Zihao Li; Benyou Wang; Yan Hu; Difan Zou
  Large language models (LLMs) store vast amounts of information, making them
powerful yet raising privacy and safety concerns when selective knowledge
removal is required. Existing unlearning strategies, ranging from
gradient-based fine-tuning and model editing to sparse autoencoder (SAE)
steering, either lack interpretability or fail to provide a robust defense
against adversarial prompts. We propose SAE-Guided Subspace Projection
Unlearning (SSPU), a novel framework that leverages SAE features to drive
targeted updates in the model's parameter space, enabling precise,
interpretable, and robust unlearning. SSPU's three-stage pipeline performs
data-driven layer and feature selection, subspace construction via QR
decomposition, and constrained optimization that controls activations into an
"irrelevant" subspace while preserving retained knowledge. Overall, we use SAE
features to construct a subspace that supervises unlearning, refining the loss
and adding a regularization term to guide interpretable parameter updates. In
experiments on the WMDP-Cyber forget set and three utility benchmarks (MMLU,
TruthfulQA, GSM8K), SSPU reduces harmful knowledge accuracy by 3.22% compared
to the strongest baseline. It also improves adversarial robustness, lowering
malicious accuracy under jailbreak prompts compared to baselines. Our findings
expose the limitations of prior unlearning methods and demonstrate how
interpretable subspace-guided optimization can achieve robust, controllable
model behavior.


http://arxiv.org/abs/2505.24685
So, I climbed to the top of the pyramid of pain -- now what? (1%)
Vasilis Katos; Emily Rosenorn-Lanng; Jane Henriksen-Bulmer; Ala Yankouskaya
  This paper explores the evolving dynamics of cybersecurity in the age of
advanced AI, from the perspective of the introduced Human Layer Kill Chain
framework. As traditional attack models like Lockheed Martin's Cyber Kill Chain
become inadequate in addressing human vulnerabilities exploited by modern
adversaries, the Humal Layer Kill Chain offers a nuanced approach that
integrates human psychology and behaviour into the analysis of cyber threats.
We detail the eight stages of the Human Layer Kill Chain, illustrating how
AI-enabled techniques can enhance psychological manipulation in attacks. By
merging the Human Layer with the Cyber Kill Chain, we propose a Sociotechnical
Kill Plane that allows for a holistic examination of attackers' tactics,
techniques, and procedures (TTPs) across the sociotechnical landscape. This
framework not only aids cybersecurity professionals in understanding
adversarial methods, but also empowers non-technical personnel to engage in
threat identification and response. The implications for incident response and
organizational resilience are significant, particularly as AI continues to
shape the threat landscape.


http://arxiv.org/abs/2505.24728
Robust Federated Learning against Model Perturbation in Edge Networks. (1%)
Dongzi Jin; Yong Xiao; Yingyu Li
  Federated Learning (FL) is a promising paradigm for realizing edge
intelligence, allowing collaborative learning among distributed edge devices by
sharing models instead of raw data. However, the shared models are often
assumed to be ideal, which would be inevitably violated in practice due to
various perturbations, leading to significant performance degradation. To
overcome this challenge, we propose a novel method, termed Sharpness-Aware
Minimization-based Robust Federated Learning (SMRFL), which aims to improve
model robustness against perturbations by exploring the geometrical property of
the model landscape. Specifically, SMRFL solves a min-max optimization problem
that promotes model convergence towards a flat minimum by minimizing the
maximum loss within a neighborhood of the model parameters. In this way, model
sensitivity to perturbations is reduced, and robustness is enhanced since
models in the neighborhood of the flat minimum also enjoy low loss values. The
theoretical result proves that SMRFL can converge at the same rate as FL
without perturbations. Extensive experimental results show that SMRFL
significantly enhances robustness against perturbations compared to three
baseline methods on two real-world datasets under three perturbation scenarios.


http://arxiv.org/abs/2505.23313
Adversarial Semantic and Label Perturbation Attack for Pedestrian Attribute Recognition. (98%)
Weizhe Kong; Xiao Wang; Ruichong Gao; Chenglong Li; Yu Zhang; Xing Yang; Yaowei Wang; Jin Tang
  Pedestrian Attribute Recognition (PAR) is an indispensable task in
human-centered research and has made great progress in recent years with the
development of deep neural networks. However, the potential vulnerability and
anti-interference ability have still not been fully explored. To bridge this
gap, this paper proposes the first adversarial attack and defense framework for
pedestrian attribute recognition. Specifically, we exploit both global- and
patch-level attacks on the pedestrian images, based on the pre-trained
CLIP-based PAR framework. It first divides the input pedestrian image into
non-overlapping patches and embeds them into feature embeddings using a
projection layer. Meanwhile, the attribute set is expanded into sentences using
prompts and embedded into attribute features using a pre-trained CLIP text
encoder. A multi-modal Transformer is adopted to fuse the obtained vision and
text tokens, and a feed-forward network is utilized for attribute recognition.
Based on the aforementioned PAR framework, we adopt the adversarial semantic
and label-perturbation to generate the adversarial noise, termed ASL-PAR. We
also design a semantic offset defense strategy to suppress the influence of
adversarial attacks. Extensive experiments conducted on both digital domains
(i.e., PETA, PA100K, MSP60K, RAPv2) and physical domains fully validated the
effectiveness of our proposed adversarial attack and defense strategies for the
pedestrian attribute recognition. The source code of this paper will be
released on https://github.com/Event-AHU/OpenPAR.


http://arxiv.org/abs/2505.24141
The Butterfly Effect in Pathology: Exploring Security in Pathology Foundation Models. (98%)
Jiashuai Liu; Yingjia Shang; Yingkang Zhan; Di Zhang; Yi Niu; Dong Wei; Xian Wu; Zeyu Gao; Chen Li; Yefeng Zheng
  With the widespread adoption of pathology foundation models in both research
and clinical decision support systems, exploring their security has become a
critical concern. However, despite their growing impact, the vulnerability of
these models to adversarial attacks remains largely unexplored. In this work,
we present the first systematic investigation into the security of pathology
foundation models for whole slide image~(WSI) analysis against adversarial
attacks. Specifically, we introduce the principle of \textit{local perturbation
with global impact} and propose a label-free attack framework that operates
without requiring access to downstream task labels. Under this attack
framework, we revise four classical white-box attack methods and redefine the
perturbation budget based on the characteristics of WSI. We conduct
comprehensive experiments on three representative pathology foundation models
across five datasets and six downstream tasks. Despite modifying only 0.1\% of
patches per slide with imperceptible noise, our attack leads to downstream
accuracy degradation that can reach up to 20\% in the worst cases. Furthermore,
we analyze key factors that influence attack success, explore the relationship
between patch-level vulnerability and semantic content, and conduct a
preliminary investigation into potential defence strategies. These findings lay
the groundwork for future research on the adversarial robustness and reliable
deployment of pathology foundation models. Our code is publicly available at:
https://github.com/Jiashuai-Liu-hmos/Attack-WSI-pathology-foundation-models.


http://arxiv.org/abs/2505.23518
TRAP: Targeted Redirecting of Agentic Preferences. (92%)
Hangoo Kang; Jehyeok Yeon; Gagandeep Singh
  Autonomous agentic AI systems powered by vision-language models (VLMs) are
rapidly advancing toward real-world deployment, yet their cross-modal reasoning
capabilities introduce new attack surfaces for adversarial manipulation that
exploit semantic reasoning across modalities. Existing adversarial attacks
typically rely on visible pixel perturbations or require privileged model or
environment access, making them impractical for stealthy, real-world
exploitation. We introduce TRAP, a generative adversarial framework that
manipulates the agent's decision-making using diffusion-based semantic
injections. Our method combines negative prompt-based degradation with positive
semantic optimization, guided by a Siamese semantic network and layout-aware
spatial masking. Without requiring access to model internals, TRAP produces
visually natural images yet induces consistent selection biases in agentic AI
systems. We evaluate TRAP on the Microsoft Common Objects in Context (COCO)
dataset, building multi-candidate decision scenarios. Across these scenarios,
TRAP achieves a 100% attack success rate on leading models, including
LLaVA-34B, Gemma3, and Mistral-3.1, significantly outperforming baselines such
as SPSA, Bandit, and standard diffusion approaches. These results expose a
critical vulnerability: Autonomous agents can be consistently misled through
human-imperceptible cross-modal manipulations. These findings highlight the
need for defense strategies beyond pixel-level robustness to address semantic
vulnerabilities in cross-modal decision-making.


http://arxiv.org/abs/2505.23412
Buffer-free Class-Incremental Learning with Out-of-Distribution Detection. (81%)
Srishti Gupta; Daniele Angioni; Maura Pintor; Ambra Demontis; Lea Schönherr; Battista Biggio; Fabio Roli
  Class-incremental learning (CIL) poses significant challenges in open-world
scenarios, where models must not only learn new classes over time without
forgetting previous ones but also handle inputs from unknown classes that a
closed-set model would misclassify. Recent works address both issues by
(i)~training multi-head models using the task-incremental learning framework,
and (ii) predicting the task identity employing out-of-distribution (OOD)
detectors. While effective, the latter mainly relies on joint training with a
memory buffer of past data, raising concerns around privacy, scalability, and
increased training time. In this paper, we present an in-depth analysis of
post-hoc OOD detection methods and investigate their potential to eliminate the
need for a memory buffer. We uncover that these methods, when applied
appropriately at inference time, can serve as a strong substitute for
buffer-based OOD detection. We show that this buffer-free approach achieves
comparable or superior performance to buffer-based methods both in terms of
class-incremental learning and the rejection of unknown samples. Experimental
results on CIFAR-10, CIFAR-100 and Tiny ImageNet datasets support our findings,
offering new insights into the design of efficient and privacy-preserving CIL
systems for open-world settings.


http://arxiv.org/abs/2505.23559
SafeScientist: Toward Risk-Aware Scientific Discoveries by LLM Agents. (73%)
Kunlun Zhu; Jiaxun Zhang; Ziheng Qi; Nuoxing Shang; Zijia Liu; Peixuan Han; Yue Su; Haofei Yu; Jiaxuan You
  Recent advancements in large language model (LLM) agents have significantly
accelerated scientific discovery automation, yet concurrently raised critical
ethical and safety concerns. To systematically address these challenges, we
introduce \textbf{SafeScientist}, an innovative AI scientist framework
explicitly designed to enhance safety and ethical responsibility in AI-driven
scientific exploration. SafeScientist proactively refuses ethically
inappropriate or high-risk tasks and rigorously emphasizes safety throughout
the research process. To achieve comprehensive safety oversight, we integrate
multiple defensive mechanisms, including prompt monitoring, agent-collaboration
monitoring, tool-use monitoring, and an ethical reviewer component.
Complementing SafeScientist, we propose \textbf{SciSafetyBench}, a novel
benchmark specifically designed to evaluate AI safety in scientific contexts,
comprising 240 high-risk scientific tasks across 6 domains, alongside 30
specially designed scientific tools and 120 tool-related risk tasks. Extensive
experiments demonstrate that SafeScientist significantly improves safety
performance by 35\% compared to traditional AI scientist frameworks, without
compromising scientific output quality. Additionally, we rigorously validate
the robustness of our safety pipeline against diverse adversarial attack
methods, further confirming the effectiveness of our integrated approach. The
code and data will be available at https://github.com/ulab-uiuc/SafeScientist.
\textcolor{red}{Warning: this paper contains example data that may be offensive
or harmful.}


http://arxiv.org/abs/2505.23266
Disrupting Vision-Language Model-Driven Navigation Services via Adversarial Object Fusion. (41%)
Chunlong Xie; Jialing He; Shangwei Guo; Jiacheng Wang; Shudong Zhang; Tianwei Zhang; Tao Xiang
  We present Adversarial Object Fusion (AdvOF), a novel attack framework
targeting vision-and-language navigation (VLN) agents in service-oriented
environments by generating adversarial 3D objects. While foundational models
like Large Language Models (LLMs) and Vision Language Models (VLMs) have
enhanced service-oriented navigation systems through improved perception and
decision-making, their integration introduces vulnerabilities in
mission-critical service workflows. Existing adversarial attacks fail to
address service computing contexts, where reliability and quality-of-service
(QoS) are paramount. We utilize AdvOF to investigate and explore the impact of
adversarial environments on the VLM-based perception module of VLN agents. In
particular, AdvOF first precisely aggregates and aligns the victim object
positions in both 2D and 3D space, defining and rendering adversarial objects.
Then, we collaboratively optimize the adversarial object with regularization
between the adversarial and victim object across physical properties and VLM
perceptions. Through assigning importance weights to varying views, the
optimization is processed stably and multi-viewedly by iterative fusions from
local updates and justifications. Our extensive evaluations demonstrate AdvOF
can effectively degrade agent performance under adversarial conditions while
maintaining minimal interference with normal navigation tasks. This work
advances the understanding of service security in VLM-powered navigation
systems, providing computational foundations for robust service composition in
physical-world deployments.


http://arxiv.org/abs/2505.24019
LLM Agents Should Employ Security Principles. (3%)
Kaiyuan Zhang; Zian Su; Pin-Yu Chen; Elisa Bertino; Xiangyu Zhang; Ninghui Li
  Large Language Model (LLM) agents show considerable promise for automating
complex tasks using contextual reasoning; however, interactions involving
multiple agents and the system's susceptibility to prompt injection and other
forms of context manipulation introduce new vulnerabilities related to privacy
leakage and system exploitation. This position paper argues that the
well-established design principles in information security, which are commonly
referred to as security principles, should be employed when deploying LLM
agents at scale. Design principles such as defense-in-depth, least privilege,
complete mediation, and psychological acceptability have helped guide the
design of mechanisms for securing information systems over the last five
decades, and we argue that their explicit and conscientious adoption will help
secure agentic systems. To illustrate this approach, we introduce AgentSandbox,
a conceptual framework embedding these security principles to provide
safeguards throughout an agent's life-cycle. We evaluate with state-of-the-art
LLMs along three dimensions: benign utility, attack utility, and attack success
rate. AgentSandbox maintains high utility for its intended functions under both
benign and adversarial evaluations while substantially mitigating privacy
risks. By embedding secure design principles as foundational elements within
emerging LLM agent protocols, we aim to promote trustworthy agent ecosystems
aligned with user privacy expectations and evolving regulatory requirements.


http://arxiv.org/abs/2505.23448
Network Inversion for Uncertainty-Aware Out-of-Distribution Detection. (1%)
Pirzada Suhail; Rehna Afroz; Amit Sethi
  Out-of-distribution (OOD) detection and uncertainty estimation (UE) are
critical components for building safe machine learning systems, especially in
real-world scenarios where unexpected inputs are inevitable. In this work, we
propose a novel framework that combines network inversion with classifier
training to simultaneously address both OOD detection and uncertainty
estimation. For a standard n-class classification task, we extend the
classifier to an (n+1)-class model by introducing a "garbage" class, initially
populated with random gaussian noise to represent outlier inputs. After each
training epoch, we use network inversion to reconstruct input images
corresponding to all output classes that initially appear as noisy and
incoherent and are therefore excluded to the garbage class for retraining the
classifier. This cycle of training, inversion, and exclusion continues
iteratively till the inverted samples begin to resemble the in-distribution
data more closely, suggesting that the classifier has learned to carve out
meaningful decision boundaries while sanitising the class manifolds by pushing
OOD content into the garbage class. During inference, this training scheme
enables the model to effectively detect and reject OOD samples by classifying
them into the garbage class. Furthermore, the confidence scores associated with
each prediction can be used to estimate uncertainty for both in-distribution
and OOD inputs. Our approach is scalable, interpretable, and does not require
access to external OOD datasets or post-hoc calibration techniques while
providing a unified solution to the dual challenges of OOD detection and
uncertainty estimation.


http://arxiv.org/abs/2505.23706
Distributed Federated Learning for Vehicular Network Security: Anomaly Detection Benefits and Multi-Domain Attack Threats. (1%)
Utku Demir; Yalin E. Sagduyu; Tugba Erpek; Hossein Jafari; Sastry Kompella; Mengran Xue
  In connected and autonomous vehicles, machine learning for safety message
classification has become critical for detecting malicious or anomalous
behavior. However, conventional approaches that rely on centralized data
collection or purely local training face limitations due to the large scale,
high mobility, and heterogeneous data distributions inherent in inter-vehicle
networks. To overcome these challenges, this paper explores Distributed
Federated Learning (DFL), whereby vehicles collaboratively train deep learning
models by exchanging model updates among one-hop neighbors and propagating
models over multiple hops. Using the Vehicular Reference Misbehavior (VeReMi)
Extension Dataset, we show that DFL can significantly improve classification
accuracy across all vehicles compared to learning strictly with local data.
Notably, vehicles with low individual accuracy see substantial accuracy gains
through DFL, illustrating the benefit of knowledge sharing across the network.
We further show that local training data size and time-varying network
connectivity correlate strongly with the model's overall accuracy. We
investigate DFL's resilience and vulnerabilities under attacks in multiple
domains, namely wireless jamming and training data poisoning attacks. Our
results reveal important insights into the vulnerabilities of DFL when
confronted with multi-domain attacks, underlining the need for more robust
strategies to secure DFL in vehicular networks.


http://arxiv.org/abs/2505.24022
The Rich and the Simple: On the Implicit Bias of Adam and SGD. (1%)
Bhavya Vasudeva; Jung Whan Lee; Vatsal Sharan; Mahdi Soltanolkotabi
  Adam is the de facto optimization algorithm for several deep learning
applications, but an understanding of its implicit bias and how it differs from
other algorithms, particularly standard first-order methods such as
(stochastic) gradient descent (GD), remains limited. In practice, neural
networks trained with SGD are known to exhibit simplicity bias -- a tendency to
find simple solutions. In contrast, we show that Adam is more resistant to such
simplicity bias. To demystify this phenomenon, in this paper, we investigate
the differences in the implicit biases of Adam and GD when training two-layer
ReLU neural networks on a binary classification task involving synthetic data
with Gaussian clusters. We find that GD exhibits a simplicity bias, resulting
in a linear decision boundary with a suboptimal margin, whereas Adam leads to
much richer and more diverse features, producing a nonlinear boundary that is
closer to the Bayes' optimal predictor. This richer decision boundary also
allows Adam to achieve higher test accuracy both in-distribution and under
certain distribution shifts. We theoretically prove these results by analyzing
the population gradients. To corroborate our theoretical findings, we present
empirical results showing that this property of Adam leads to superior
generalization across datasets with spurious correlations where neural networks
trained with SGD are known to show simplicity bias and don't generalize well
under certain distributional shifts.


http://arxiv.org/abs/2505.23968
Confidential Guardian: Cryptographically Prohibiting the Abuse of Model Abstention. (1%)
Stephan Rabanser; Ali Shahin Shamsabadi; Olive Franzese; Xiao Wang; Adrian Weller; Nicolas Papernot
  Cautious predictions -- where a machine learning model abstains when
uncertain -- are crucial for limiting harmful errors in safety-critical
applications. In this work, we identify a novel threat: a dishonest institution
can exploit these mechanisms to discriminate or unjustly deny services under
the guise of uncertainty. We demonstrate the practicality of this threat by
introducing an uncertainty-inducing attack called Mirage, which deliberately
reduces confidence in targeted input regions, thereby covertly disadvantaging
specific individuals. At the same time, Mirage maintains high predictive
performance across all data points. To counter this threat, we propose
Confidential Guardian, a framework that analyzes calibration metrics on a
reference dataset to detect artificially suppressed confidence. Additionally,
it employs zero-knowledge proofs of verified inference to ensure that reported
confidence scores genuinely originate from the deployed model. This prevents
the provider from fabricating arbitrary model confidence values while
protecting the model's proprietary details. Our results confirm that
Confidential Guardian effectively prevents the misuse of cautious predictions,
providing verifiable assurances that abstention reflects genuine model
uncertainty rather than malicious intent.


http://arxiv.org/abs/2506.02016
Are classical deep neural networks weakly adversarially robust? (99%)
Nuolin Sun; Linyuan Wang; Dongyang Li; Bin Yan; Lei Li
  Adversarial attacks have received increasing attention and it has been widely
recognized that classical DNNs have weak adversarial robustness. The most
commonly used adversarial defense method, adversarial training, improves the
adversarial accuracy of DNNs by generating adversarial examples and retraining
the model. However, adversarial training requires a significant computational
overhead. In this paper, inspired by existing studies focusing on the
clustering properties of DNN output features at each layer and the Progressive
Feedforward Collapse phenomenon, we propose a method for adversarial example
detection and image recognition that uses layer-wise features to construct
feature paths and computes the correlation between the examples feature paths
and the class-centered feature paths. Experimental results show that the
recognition method achieves 82.77% clean accuracy and 44.17% adversarial
accuracy on the ResNet-20 with PFC. Compared to the adversarial training method
with 77.64% clean accuracy and 52.94% adversarial accuracy, our method exhibits
a trade-off without relying on computationally expensive defense strategies.
Furthermore, on the standard ResNet-18, our method maintains this advantage
with respective metrics of 80.01% and 46.1%. This result reveals inherent
adversarial robustness in DNNs, challenging the conventional understanding of
the weak adversarial robustness in DNNs.


http://arxiv.org/abs/2505.22486
Understanding Adversarial Training with Energy-based Models. (93%)
Mujtaba Hussain Mirza; Maria Rosaria Briglia; Filippo Bartolucci; Senad Beadini; Giuseppe Lisanti; Iacopo Masi
  We aim at using Energy-based Model (EBM) framework to better understand
adversarial training (AT) in classifiers, and additionally to analyze the
intrinsic generative capabilities of robust classifiers. By viewing standard
classifiers through an energy lens, we begin by analyzing how the energies of
adversarial examples, generated by various attacks, differ from those of the
natural samples. The central focus of our work is to understand the critical
phenomena of Catastrophic Overfitting (CO) and Robust Overfitting (RO) in AT
from an energy perspective. We analyze the impact of existing AT approaches on
the energy of samples during training and observe that the behavior of the
``delta energy' -- change in energy between original sample and its adversarial
counterpart -- diverges significantly when CO or RO occurs. After a thorough
analysis of these energy dynamics and their relationship with overfitting, we
propose a novel regularizer, the Delta Energy Regularizer (DER), designed to
smoothen the energy landscape during training. We demonstrate that DER is
effective in mitigating both CO and RO across multiple benchmarks. We further
show that robust classifiers, when being used as generative models, have limits
in handling trade-off between image quality and variability. We propose an
improved technique based on a local class-wise principal component analysis
(PCA) and energy-based guidance for better class-specific initialization and
adaptive stopping, enhancing sample diversity and generation quality.
Considering that we do not explicitly train for generative modeling, we achieve
a competitive Inception Score (IS) and Fr\'echet inception distance (FID)
compared to hybrid discriminative-generative models.


http://arxiv.org/abs/2505.21967
Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack. (92%)
Juan Ren; Mark Dras; Usman Naseem
  Large Vision-Language Models (LVLMs) have shown remarkable capabilities
across a wide range of multimodal tasks. However, their integration of visual
inputs introduces expanded attack surfaces, thereby exposing them to novel
security vulnerabilities. In this work, we conduct a systematic
representational analysis to uncover why conventional adversarial attacks can
circumvent the safety mechanisms embedded in LVLMs. We further propose a novel
two stage evaluation framework for adversarial attacks on LVLMs. The first
stage differentiates among instruction non compliance, outright refusal, and
successful adversarial exploitation. The second stage quantifies the degree to
which the model's output fulfills the harmful intent of the adversarial prompt,
while categorizing refusal behavior into direct refusals, soft refusals, and
partial refusals that remain inadvertently helpful. Finally, we introduce a
normative schema that defines idealized model behavior when confronted with
harmful prompts, offering a principled target for safety alignment in
multimodal systems.


http://arxiv.org/abs/2505.22604
Adversarially Robust AI-Generated Image Detection for Free: An Information Theoretic Perspective. (92%)
Ruixuan Zhang; He Wang; Zhengyu Zhao; Zhiqing Guo; Xun Yang; Yunfeng Diao; Meng Wang
  Rapid advances in Artificial Intelligence Generated Images (AIGI) have
facilitated malicious use, such as forgery and misinformation. Therefore,
numerous methods have been proposed to detect fake images. Although such
detectors have been proven to be universally vulnerable to adversarial attacks,
defenses in this field are scarce. In this paper, we first identify that
adversarial training (AT), widely regarded as the most effective defense,
suffers from performance collapse in AIGI detection. Through an
information-theoretic lens, we further attribute the cause of collapse to
feature entanglement, which disrupts the preservation of feature-label mutual
information. Instead, standard detectors show clear feature separation.
Motivated by this difference, we propose Training-free Robust Detection via
Information-theoretic Measures (TRIM), the first training-free adversarial
defense for AIGI detection. TRIM builds on standard detectors and quantifies
feature shifts using prediction entropy and KL divergence. Extensive
experiments across multiple datasets and attacks validate the superiority of
our TRIM, e.g., outperforming the state-of-the-art defense by 33.88% (28.91%)
on ProGAN (GenImage), while well maintaining original accuracy.


http://arxiv.org/abs/2505.22499
The Meeseeks Mesh: Spatially Consistent 3D Adversarial Objects for BEV Detector. (86%)
Aixuan Li; Mochu Xiang; Jing Zhang; Yuchao Dai
  3D object detection is a critical component in autonomous driving systems. It
allows real-time recognition and detection of vehicles, pedestrians and
obstacles under varying environmental conditions. Among existing methods, 3D
object detection in the Bird's Eye View (BEV) has emerged as the mainstream
framework. To guarantee a safe, robust and trustworthy 3D object detection, 3D
adversarial attacks are investigated, where attacks are placed in 3D
environments to evaluate the model performance, e.g., putting a film on a car,
clothing a pedestrian. The vulnerability of 3D object detection models to 3D
adversarial attacks serves as an important indicator to evaluate the robustness
of the model against perturbations. To investigate this vulnerability, we
generate non-invasive 3D adversarial objects tailored for real-world attack
scenarios. Our method verifies the existence of universal adversarial objects
that are spatially consistent across time and camera views. Specifically, we
employ differentiable rendering techniques to accurately model the spatial
relationship between adversarial objects and the target vehicle. Furthermore,
we introduce an occlusion-aware module to enhance visual consistency and
realism under different viewpoints. To maintain attack effectiveness across
multiple frames, we design a BEV spatial feature-guided optimization strategy.
Experimental results demonstrate that our approach can reliably suppress
vehicle predictions from state-of-the-art 3D object detectors, serving as an
important tool to test robustness of 3D object detection models before
deployment. Moreover, the generated adversarial objects exhibit strong
generalization capabilities, retaining its effectiveness at various positions
and distances in the scene.


http://arxiv.org/abs/2505.23828
Spa-VLM: Stealthy Poisoning Attacks on RAG-based VLM. (83%)
Lei Yu; Yechao Zhang; Ziqi Zhou; Yang Wu; Wei Wan; Minghui Li; Shengshan Hu; Pei Xiaobing; Jing Wang
  With the rapid development of the Vision-Language Model (VLM), significant
progress has been made in Visual Question Answering (VQA) tasks. However,
existing VLM often generate inaccurate answers due to a lack of up-to-date
knowledge. To address this issue, recent research has introduced
Retrieval-Augmented Generation (RAG) techniques, commonly used in Large
Language Models (LLM), into VLM, incorporating external multi-modal knowledge
to enhance the accuracy and practicality of VLM systems. Nevertheless, the RAG
in LLM may be susceptible to data poisoning attacks. RAG-based VLM may also
face the threat of this attack. This paper first reveals the vulnerabilities of
the RAG-based large model under poisoning attack, showing that existing
single-modal RAG poisoning attacks have a 100\% failure rate in multi-modal RAG
scenarios. To address this gap, we propose Spa-VLM (Stealthy Poisoning Attack
on RAG-based VLM), a new paradigm for poisoning attacks on large models. We
carefully craft malicious multi-modal knowledge entries, including adversarial
images and misleading text, which are then injected into the RAG's knowledge
base. When users access the VLM service, the system may generate misleading
outputs. We evaluate Spa-VLM on two Wikipedia datasets and across two different
RAGs. Results demonstrate that our method achieves highly stealthy poisoning,
with the attack success rate exceeding 0.8 after injecting just 5 malicious
entries into knowledge bases with 100K and 2M entries, outperforming
state-of-the-art poisoning attacks designed for RAG-based LLMs. Additionally,
we evaluated several defense mechanisms, all of which ultimately proved
ineffective against Spa-VLM, underscoring the effectiveness and robustness of
our attack.


http://arxiv.org/abs/2505.22271
Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models. (76%)
Yongcan Yu; Yanbo Wang; Ran He; Jian Liang
  While (multimodal) large language models (LLMs) have attracted widespread
attention due to their exceptional capabilities, they remain vulnerable to
jailbreak attacks. Various defense methods are proposed to defend against
jailbreak attacks, however, they are often tailored to specific types of
jailbreak attacks, limiting their effectiveness against diverse adversarial
strategies. For instance, rephrasing-based defenses are effective against text
adversarial jailbreaks but fail to counteract image-based attacks. To overcome
these limitations, we propose a universal defense framework, termed Test-time
IMmunization (TIM), which can adaptively defend against various jailbreak
attacks in a self-evolving way. Specifically, TIM initially trains a gist token
for efficient detection, which it subsequently applies to detect jailbreak
activities during inference. When jailbreak attempts are identified, TIM
implements safety fine-tuning using the detected jailbreak instructions paired
with refusal answers. Furthermore, to mitigate potential performance
degradation in the detector caused by parameter updates during safety
fine-tuning, we decouple the fine-tuning process from the detection module.
Extensive experiments on both LLMs and multimodal LLMs demonstrate the efficacy
of TIM.


http://arxiv.org/abs/2505.22266
FGS-Audio: Fixed-Decoder Framework for Audio Steganography with Adversarial Perturbation Generation. (67%)
Jialin Yan; Yu Cheng; Zhaoxia Yin; Xinpeng Zhang; Shilin Wang; Tanfeng Sun; Xinghao Jiang
  The rapid development of Artificial Intelligence Generated Content (AIGC) has
made high-fidelity generated audio widely available across the Internet,
offering an abundant and versatile source of cover signals for covert
communication. Driven by advances in deep learning, current audio steganography
frameworks are mainly based on encoding-decoding network architectures. While
these methods greatly improve the security of audio steganography, they
typically employ elaborate training workflows and rely on extensive pre-trained
models. To address the aforementioned issues, this paper pioneers a
Fixed-Decoder Framework for Audio Steganography with Adversarial Perturbation
Generation (FGS-Audio). The adversarial perturbations that carry secret
information are embedded into cover audio to generate stego audio. The receiver
only needs to share the structure and weights of the fixed decoding network to
accurately extract the secret information from the stego audio, thus
eliminating the reliance on large pre-trained models. In FGS-Audio, we propose
an audio Adversarial Perturbation Generation (APG) strategy and design a
lightweight fixed decoder. The fixed decoder guarantees reliable extraction of
the hidden message, while the adversarial perturbations are optimized to keep
the stego audio perceptually and statistically close to the cover audio,
thereby improving resistance to steganalysis. The experimental results show
that the method exhibits excellent anti-steganalysis performance under
different relative payloads, outperforming existing SOTA approaches. In terms
of stego audio quality, FGS-Audio achieves an average PSNR improvement of over
10 dB compared to SOTA method.


http://arxiv.org/abs/2506.03167
Distributionally Robust Wireless Semantic Communication with Large AI Models. (15%)
Long Tan Le; Senura Hansaja Wanasekara; Zerun Niu; Yansong Shi; Nguyen H. Tran; Phuong Vo; Walid Saad; Dusit Niyato; Zhu Han; Choong Seon Hong; H. Vincent Poor
  6G wireless systems are expected to support massive volumes of data with
ultra-low latency. However, conventional bit-level transmission strategies
cannot support the efficiency and adaptability required by modern,
data-intensive applications. The concept of semantic communication (SemCom)
addresses this limitation by focusing on transmitting task-relevant semantic
information instead of raw data. While recent efforts incorporating deep
learning and large-scale AI models have improved SemCom's performance, existing
systems remain vulnerable to both semantic-level and transmission-level noise
because they often rely on domain-specific architectures that hinder
generalizability. In this paper, a novel and generalized semantic communication
framework called WaSeCom is proposed to systematically address uncertainty and
enhance robustness. In particular, Wasserstein distributionally robust
optimization is employed to provide resilience against semantic
misinterpretation and channel perturbations. A rigorous theoretical analysis is
performed to establish the robust generalization guarantees of the proposed
framework. Experimental results on image and text transmission demonstrate that
WaSeCom achieves improved robustness under noise and adversarial perturbations.
These results highlight its effectiveness in preserving semantic fidelity
across varying wireless conditions.


http://arxiv.org/abs/2505.22798
Efficient Preimage Approximation for Neural Network Certification. (9%)
Anton Björklund; Mykola Zaitsev; Marta Kwiatkowska
  The growing reliance on artificial intelligence in safety- and
security-critical applications demands effective neural network certification.
A challenging real-world use case is certification against ``patch attacks'',
where adversarial patches or lighting conditions obscure parts of images, for
example traffic signs. One approach to certification, which also gives
quantitative coverage estimates, utilizes preimages of neural networks, i.e.,
the set of inputs that lead to a specified output. However, these preimage
approximation methods, including the state-of-the-art PREMAP algorithm,
struggle with scalability. This paper presents novel algorithmic improvements
to PREMAP involving tighter bounds, adaptive Monte Carlo sampling, and improved
branching heuristics. We demonstrate efficiency improvements of at least an
order of magnitude on reinforcement learning control benchmarks, and show that
our method scales to convolutional neural networks that were previously
infeasible. Our results demonstrate the potential of preimage approximation
methodology for reliability and robustness certification.


http://arxiv.org/abs/2505.22411
Mitigating Overthinking in Large Reasoning Models via Manifold Steering. (8%)
Yao Huang; Huanran Chen; Shouwei Ruan; Yichi Zhang; Xingxing Wei; Yinpeng Dong
  Recent advances in Large Reasoning Models (LRMs) have demonstrated remarkable
capabilities in solving complex tasks such as mathematics and coding. However,
these models frequently exhibit a phenomenon known as overthinking during
inference, characterized by excessive validation loops and redundant
deliberation, leading to substantial computational overheads. In this paper, we
aim to mitigate overthinking by investigating the underlying mechanisms from
the perspective of mechanistic interpretability. We first showcase that the
tendency of overthinking can be effectively captured by a single direction in
the model's activation space and the issue can be eased by intervening the
activations along this direction. However, this efficacy soon reaches a plateau
and even deteriorates as the intervention strength increases. We therefore
systematically explore the activation space and find that the overthinking
phenomenon is actually tied to a low-dimensional manifold, which indicates that
the limited effect stems from the noises introduced by the high-dimensional
steering direction. Based on this insight, we propose Manifold Steering, a
novel approach that elegantly projects the steering direction onto the
low-dimensional activation manifold given the theoretical approximation of the
interference noise. Extensive experiments on DeepSeek-R1 distilled models
validate that our method reduces output tokens by up to 71% while maintaining
and even improving the accuracy on several mathematical benchmarks. Our method
also exhibits robust cross-domain transferability, delivering consistent token
reduction performance in code generation and knowledge-based QA tasks. Code is
available at: https://github.com/Aries-iai/Manifold_Steering.


http://arxiv.org/abs/2505.22839
How Do Diffusion Models Improve Adversarial Robustness? (2%)
Liu Yuezhang; Xue-Xin Wei
  Recent findings suggest that diffusion models significantly enhance empirical
adversarial robustness. While some intuitive explanations have been proposed,
the precise mechanisms underlying these improvements remain unclear. In this
work, we systematically investigate how and how well diffusion models improve
adversarial robustness. First, we observe that diffusion models intriguingly
increase, rather than decrease, the $\ell_p$ distance to clean
samples--challenging the intuition that purification denoises inputs closer to
the original data. Second, we find that the purified images are heavily
influenced by the internal randomness of diffusion models, where a compression
effect arises within each randomness configuration. Motivated by this
observation, we evaluate robustness under fixed randomness and find that the
improvement drops to approximately 24% on CIFAR-10--substantially lower than
prior reports approaching 70%. Importantly, we show that this remaining
robustness gain strongly correlates with the model's ability to compress the
input space, revealing the compression rate as a reliable robustness indicator
without requiring gradient-based analysis. Our findings provide novel insights
into the mechanisms underlying diffusion-based purification, and offer guidance
for developing more effective and principled adversarial purification systems.


http://arxiv.org/abs/2505.22203
Pitfalls of Rule- and Model-based Verifiers -- A Case Study on Mathematical Reasoning. (1%)
Yuzhen Huang; Weihao Zeng; Xingshan Zeng; Qi Zhu; Junxian He
  Trustworthy verifiers are essential for the success of reinforcement learning
with verifiable reward (RLVR), which is the core methodology behind various
large reasoning models such as DeepSeek-R1. In complex domains like
mathematical reasoning, rule-based verifiers have been widely adopted in
previous works to train strong reasoning models. However, the reliability of
these verifiers and their impact on the RL training process remain poorly
understood. In this work, we take mathematical reasoning as a case study and
conduct a comprehensive analysis of various verifiers in both static evaluation
and RL training scenarios. First, we find that current open-source rule-based
verifiers often fail to recognize equivalent answers presented in different
formats across multiple commonly used mathematical datasets, resulting in
non-negligible false negative rates. This limitation adversely affects RL
training performance and becomes more pronounced as the policy model gets
stronger. Subsequently, we investigate model-based verifiers as a potential
solution to address these limitations. While the static evaluation shows that
model-based verifiers achieve significantly higher verification accuracy,
further analysis and RL training results imply that they are highly susceptible
to hacking, where they misclassify certain patterns in responses as correct
(i.e., false positives). This vulnerability is exploited during policy model
optimization, leading to artificially inflated rewards. Our findings underscore
the unique risks inherent to both rule-based and model-based verifiers, aiming
to offer valuable insights to develop more robust reward systems in
reinforcement learning.


http://arxiv.org/abs/2505.21027
TabAttackBench: A Benchmark for Adversarial Attacks on Tabular Data. (99%)
Zhipeng He; Chun Ouyang; Lijie Wen; Cong Liu; Catarina Moreira
  Adversarial attacks pose a significant threat to machine learning models by
inducing incorrect predictions through imperceptible perturbations to input
data. While these attacks have been extensively studied in unstructured data
like images, their application to tabular data presents new challenges. These
challenges arise from the inherent heterogeneity and complex feature
interdependencies in tabular data, which differ significantly from those in
image data. To address these differences, it is crucial to consider
imperceptibility as a key criterion specific to tabular data. Most current
research focuses primarily on achieving effective adversarial attacks, often
overlooking the importance of maintaining imperceptibility. To address this
gap, we propose a new benchmark for adversarial attacks on tabular data that
evaluates both effectiveness and imperceptibility. In this study, we assess the
effectiveness and imperceptibility of five adversarial attacks across four
models using eleven tabular datasets, including both mixed and numerical-only
datasets. Our analysis explores how these factors interact and influence the
overall performance of the attacks. We also compare the results across
different dataset types to understand the broader implications of these
findings. The findings from this benchmark provide valuable insights for
improving the design of adversarial attack algorithms, thereby advancing the
field of adversarial machine learning on tabular data.


http://arxiv.org/abs/2505.21494
Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment. (99%)
Xiaojun Jia; Sensen Gao; Simeng Qin; Tianyu Pang; Chao Du; Yihao Huang; Xinfeng Li; Yiming Li; Bo Li; Yang Liu
  Multimodal large language models (MLLMs) remain vulnerable to transferable
adversarial examples. While existing methods typically achieve targeted attacks
by aligning global features-such as CLIP's [CLS] token-between adversarial and
target samples, they often overlook the rich local information encoded in patch
tokens. This leads to suboptimal alignment and limited transferability,
particularly for closed-source models. To address this limitation, we propose a
targeted transferable adversarial attack method based on feature optimal
alignment, called FOA-Attack, to improve adversarial transfer capability.
Specifically, at the global level, we introduce a global feature loss based on
cosine similarity to align the coarse-grained features of adversarial samples
with those of target samples. At the local level, given the rich local
representations within Transformers, we leverage clustering techniques to
extract compact local patterns to alleviate redundant local features. We then
formulate local feature alignment between adversarial and target samples as an
optimal transport (OT) problem and propose a local clustering optimal transport
loss to refine fine-grained feature alignment. Additionally, we propose a
dynamic ensemble model weighting strategy to adaptively balance the influence
of multiple models during adversarial example generation, thereby further
improving transferability. Extensive experiments across various models
demonstrate the superiority of the proposed method, outperforming
state-of-the-art methods, especially in transferring to closed-source MLLMs.
The code is released at https://github.com/jiaxiaojunQAQ/FOA-Attack.


http://arxiv.org/abs/2505.21181
Boosting Adversarial Transferability via High-Frequency Augmentation and Hierarchical-Gradient Fusion. (98%)
Yayin Zheng; Chen Wan; Zihong Guo; Hailing Kuang; Xiaohai Lu
  Adversarial attacks have become a significant challenge in the security of
machine learning models, particularly in the context of black-box defense
strategies. Existing methods for enhancing adversarial transferability
primarily focus on the spatial domain. This paper presents Frequency-Space
Attack (FSA), a new adversarial attack framework that effectively integrates
frequency-domain and spatial-domain transformations. FSA combines two key
techniques: (1) High-Frequency Augmentation, which applies Fourier transform
with frequency-selective amplification to diversify inputs and emphasize the
critical role of high-frequency components in adversarial attacks, and (2)
Hierarchical-Gradient Fusion, which merges multi-scale gradient decomposition
and fusion to capture both global structures and fine-grained details,
resulting in smoother perturbations. Our experiment demonstrates that FSA
consistently outperforms state-of-the-art methods across various black-box
models. Notably, our proposed FSA achieves an average attack success rate
increase of 23.6% compared with BSR (CVPR 2024) on eight black-box defense
models.


http://arxiv.org/abs/2505.21854
Rethinking Gradient-based Adversarial Attacks on Point Cloud Classification. (98%)
Jun Chen; Xinke Li; Mingyue Xu; Tianrui Li; Chongshou Li
  Gradient-based adversarial attacks have become a dominant approach for
evaluating the robustness of point cloud classification models. However,
existing methods often rely on uniform update rules that fail to consider the
heterogeneous nature of point clouds, resulting in excessive and perceptible
perturbations. In this paper, we rethink the design of gradient-based attacks
by analyzing the limitations of conventional gradient update mechanisms and
propose two new strategies to improve both attack effectiveness and
imperceptibility. First, we introduce WAAttack, a novel framework that
incorporates weighted gradients and an adaptive step-size strategy to account
for the non-uniform contribution of points during optimization. This approach
enables more targeted and subtle perturbations by dynamically adjusting updates
according to the local structure and sensitivity of each point. Second, we
propose SubAttack, a complementary strategy that decomposes the point cloud
into subsets and focuses perturbation efforts on structurally critical regions.
Together, these methods represent a principled rethinking of gradient-based
adversarial attacks for 3D point cloud classification. Extensive experiments
demonstrate that our approach outperforms state-of-the-art baselines in
generating highly imperceptible adversarial examples. Code will be released
upon paper acceptance.


http://arxiv.org/abs/2505.20782
Breaking Dataset Boundaries: Class-Agnostic Targeted Adversarial Attacks. (98%)
Taïga Gonçalves; Tomo Miyazaki; Shinichiro Omachi
  We present Cross-Domain Multi-Targeted Attack (CD-MTA), a method for
generating adversarial examples that mislead image classifiers toward any
target class, including those not seen during training. Traditional targeted
attacks are limited to one class per model, requiring expensive retraining for
each target. Multi-targeted attacks address this by introducing a perturbation
generator with a conditional input to specify the target class. However,
existing methods are constrained to classes observed during training and
require access to the black-box model's training data--introducing a form of
data leakage that undermines realistic evaluation in practical black-box
scenarios. We identify overreliance on class embeddings as a key limitation,
leading to overfitting and poor generalization to unseen classes. To address
this, CD-MTA replaces class-level supervision with an image-based conditional
input and introduces class-agnostic losses that align the perturbed and target
images in the feature space. This design removes dependence on class semantics,
thereby enabling generalization to unseen classes across datasets. Experiments
on ImageNet and seven other datasets show that CD-MTA outperforms prior
multi-targeted attacks in both standard and cross-domain settings--without
accessing the black-box model's training data.


http://arxiv.org/abs/2505.21414
A Framework for Adversarial Analysis of Decision Support Systems Prior to Deployment. (96%)
Brett Bissey; Kyle Gatesman; Walker Dimon; Mohammad Alam; Luis Robaina; Joseph Weissman
  This paper introduces a comprehensive framework designed to analyze and
secure decision-support systems trained with Deep Reinforcement Learning (DRL),
prior to deployment, by providing insights into learned behavior patterns and
vulnerabilities discovered through simulation. The introduced framework aids in
the development of precisely timed and targeted observation perturbations,
enabling researchers to assess adversarial attack outcomes within a strategic
decision-making context. We validate our framework, visualize agent behavior,
and evaluate adversarial outcomes within the context of a custom-built
strategic game, CyberStrike. Utilizing the proposed framework, we introduce a
method for systematically discovering and ranking the impact of attacks on
various observation indices and time-steps, and we conduct experiments to
evaluate the transferability of adversarial attacks across agent architectures
and DRL training algorithms. The findings underscore the critical need for
robust adversarial defense mechanisms to protect decision-making policies in
high-stakes environments.


http://arxiv.org/abs/2505.20934
NatADiff: Adversarial Boundary Guidance for Natural Adversarial Diffusion. (95%)
Max Collins; Jordan Vice; Tim French; Ajmal Mian
  Adversarial samples exploit irregularities in the manifold ``learned'' by
deep learning models to cause misclassifications. The study of these
adversarial samples provides insight into the features a model uses to classify
inputs, which can be leveraged to improve robustness against future attacks.
However, much of the existing literature focuses on constrained adversarial
samples, which do not accurately reflect test-time errors encountered in
real-world settings. To address this, we propose `NatADiff', an adversarial
sampling scheme that leverages denoising diffusion to generate natural
adversarial samples. Our approach is based on the observation that natural
adversarial samples frequently contain structural elements from the adversarial
class. Deep learning models can exploit these structural elements to shortcut
the classification process, rather than learning to genuinely distinguish
between classes. To leverage this behavior, we guide the diffusion trajectory
towards the intersection of the true and adversarial classes, combining
time-travel sampling with augmented classifier guidance to enhance attack
transferability while preserving image fidelity. Our method achieves comparable
attack success rates to current state-of-the-art techniques, while exhibiting
significantly higher transferability across model architectures and better
alignment with natural test-time errors as measured by FID. These results
demonstrate that NatADiff produces adversarial samples that not only transfer
more effectively across models, but more faithfully resemble naturally
occurring test-time errors.


http://arxiv.org/abs/2505.21609
Preventing Adversarial AI Attacks Against Autonomous Situational Awareness: A Maritime Case Study. (92%)
Mathew J. Walter; Aaron Barrett; Kimberly Tam
  Adversarial artificial intelligence (AI) attacks pose a significant threat to
autonomous transportation, such as maritime vessels, that rely on AI
components. Malicious actors can exploit these systems to deceive and
manipulate AI-driven operations. This paper addresses three critical research
challenges associated with adversarial AI: the limited scope of traditional
defences, inadequate security metrics, and the need to build resilience beyond
model-level defences. To address these challenges, we propose building defences
utilising multiple inputs and data fusion to create defensive components and an
AI security metric as a novel approach toward developing more secure AI
systems. We name this approach the Data Fusion Cyber Resilience (DFCR) method,
and we evaluate it through real-world demonstrations and comprehensive
quantitative analyses, comparing a system built with the DFCR method against
single-input models and models utilising existing state-of-the-art defences.
The findings show that the DFCR approach significantly enhances resilience
against adversarial machine learning attacks in maritime autonomous system
operations, achieving up to a 35\% reduction in loss for successful
multi-pronged perturbation attacks, up to a 100\% reduction in loss for
successful adversarial patch attacks and up to 100\% reduction in loss for
successful spoofing attacks when using these more resilient systems. We
demonstrate how DFCR and DFCR confidence scores can reduce adversarial AI
contact confidence and improve decision-making by the system, even when typical
adversarial defences have been compromised. Ultimately, this work contributes
to the development of more secure and resilient AI-driven systems against
adversarial attacks.


http://arxiv.org/abs/2505.21499
AdInject: Real-World Black-Box Attacks on Web Agents via Advertising Delivery. (83%)
Haowei Wang; Junjie Wang; Xiaojun Jia; Rupeng Zhang; Mingyang Li; Zhe Liu; Yang Liu; Qing Wang
  Vision-Language Model (VLM) based Web Agents represent a significant step
towards automating complex tasks by simulating human-like interaction with
websites. However, their deployment in uncontrolled web environments introduces
significant security vulnerabilities. Existing research on adversarial
environmental injection attacks often relies on unrealistic assumptions, such
as direct HTML manipulation, knowledge of user intent, or access to agent model
parameters, limiting their practical applicability. In this paper, we propose
AdInject, a novel and real-world black-box attack method that leverages the
internet advertising delivery to inject malicious content into the Web Agent's
environment. AdInject operates under a significantly more realistic threat
model than prior work, assuming a black-box agent, static malicious content
constraints, and no specific knowledge of user intent. AdInject includes
strategies for designing malicious ad content aimed at misleading agents into
clicking, and a VLM-based ad content optimization technique that infers
potential user intents from the target website's context and integrates these
intents into the ad content to make it appear more relevant or critical to the
agent's task, thus enhancing attack effectiveness. Experimental evaluations
demonstrate the effectiveness of AdInject, attack success rates exceeding 60%
in most scenarios and approaching 100% in certain cases. This strongly
demonstrates that prevalent advertising delivery constitutes a potent and
real-world vector for environment injection attacks against Web Agents. This
work highlights a critical vulnerability in Web Agent security arising from
real-world environment manipulation channels, underscoring the urgent need for
developing robust defense mechanisms against such threats. Our code is
available at https://github.com/NicerWang/AdInject.


http://arxiv.org/abs/2505.21620
VideoMarkBench: Benchmarking Robustness of Video Watermarking. (74%)
Zhengyuan Jiang; Moyang Guo; Kecen Li; Yuepeng Hu; Yupu Wang; Zhicong Huang; Cheng Hong; Neil Zhenqiang Gong
  The rapid development of video generative models has led to a surge in highly
realistic synthetic videos, raising ethical concerns related to disinformation
and copyright infringement. Recently, video watermarking has been proposed as a
mitigation strategy by embedding invisible marks into AI-generated videos to
enable subsequent detection. However, the robustness of existing video
watermarking methods against both common and adversarial perturbations remains
underexplored. In this work, we introduce VideoMarkBench, the first systematic
benchmark designed to evaluate the robustness of video watermarks under
watermark removal and watermark forgery attacks. Our study encompasses a
unified dataset generated by three state-of-the-art video generative models,
across three video styles, incorporating four watermarking methods and seven
aggregation strategies used during detection. We comprehensively evaluate 12
types of perturbations under white-box, black-box, and no-box threat models.
Our findings reveal significant vulnerabilities in current watermarking
approaches and highlight the urgent need for more robust solutions. Our code is
available at https://github.com/zhengyuan-jiang/VideoMarkBench.


http://arxiv.org/abs/2505.21140
HeteroBA: A Structure-Manipulating Backdoor Attack on Heterogeneous Graphs. (68%)
Honglin Gao; Xiang Li; Lan Zhao; Gaoxi Xiao
  Heterogeneous graph neural networks (HGNNs) have recently drawn increasing
attention for modeling complex multi-relational data in domains such as
recommendation, finance, and social networks. While existing research has been
largely focused on enhancing HGNNs' predictive performance, their robustness
and security, especially under backdoor attacks, remain underexplored. In this
paper, we propose a novel Heterogeneous Backdoor Attack (HeteroBA) framework
for node classification tasks on heterogeneous graphs. HeteroBA inserts
carefully crafted trigger nodes with realistic features and targeted structural
connections, leveraging attention-based and clustering-based strategies to
select influential auxiliary nodes for effective trigger propagation, thereby
causing the model to misclassify specific nodes into a target label while
maintaining accuracy on clean data. Experimental results on three datasets and
various HGNN architectures demonstrate that HeteroBA achieves high attack
success rates with minimal impact on the clean accuracy. Our method sheds light
on potential vulnerabilities in HGNNs and calls for more robust defenses
against backdoor threats in multi-relational graph scenarios.


http://arxiv.org/abs/2505.21277
Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space. (64%)
Yao Huang; Yitong Sun; Shouwei Ruan; Yichi Zhang; Yinpeng Dong; Xingxing Wei
  Large Language Models (LLMs), despite advanced general capabilities, still
suffer from numerous safety risks, especially jailbreak attacks that bypass
safety protocols. Understanding these vulnerabilities through black-box
jailbreak attacks, which better reflect real-world scenarios, offers critical
insights into model robustness. While existing methods have shown improvements
through various prompt engineering techniques, their success remains limited
against safety-aligned models, overlooking a more fundamental problem: the
effectiveness is inherently bounded by the predefined strategy spaces. However,
expanding this space presents significant challenges in both systematically
capturing essential attack patterns and efficiently navigating the increased
complexity. To better explore the potential of expanding the strategy space, we
address these challenges through a novel framework that decomposes jailbreak
strategies into essential components based on the Elaboration Likelihood Model
(ELM) theory and develops genetic-based optimization with intention evaluation
mechanisms. To be striking, our experiments reveal unprecedented jailbreak
capabilities by expanding the strategy space: we achieve over 90% success rate
on Claude-3.5 where prior methods completely fail, while demonstrating strong
cross-model transferability and surpassing specialized safeguard models in
evaluation accuracy. The code is open-sourced at:
https://github.com/Aries-iai/CL-GSO.


http://arxiv.org/abs/2505.21742
What is Adversarial Training for Diffusion Models? (33%)
Briglia Maria Rosaria; Mujtaba Hussain Mirza; Giuseppe Lisanti; Iacopo Masi
  We answer the question in the title, showing that adversarial training (AT)
for diffusion models (DMs) fundamentally differs from classifiers: while AT in
classifiers enforces output invariance, AT in DMs requires equivariance to keep
the diffusion process aligned with the data distribution. AT is a way to
enforce smoothness in the diffusion flow, improving robustness to outliers and
corrupted data. Unlike prior art, our method makes no assumptions about the
noise model and integrates seamlessly into diffusion training by adding random
noise, similar to randomized smoothing, or adversarial noise, akin to AT. This
enables intrinsic capabilities such as handling noisy data, dealing with
extreme variability such as outliers, preventing memorization, and improving
robustness. We rigorously evaluate our approach with proof-of-concept datasets
with known distributions in low- and high-dimensional space, thereby taking a
perfect measure of errors; we further evaluate on standard benchmarks such as
CIFAR-10, CelebA and LSUN Bedroom, showing strong performance under severe
noise, data corruption, and iterative adversarial attacks.


http://arxiv.org/abs/2505.21938
Practical Adversarial Attacks on Stochastic Bandits via Fake Data Injection. (22%)
Qirun Zeng; Eric He; Richard Hoffmann; Xuchuang Wang; Jinhang Zuo
  Adversarial attacks on stochastic bandits have traditionally relied on some
unrealistic assumptions, such as per-round reward manipulation and unbounded
perturbations, limiting their relevance to real-world systems. We propose a
more practical threat model, Fake Data Injection, which reflects realistic
adversarial constraints: the attacker can inject only a limited number of
bounded fake feedback samples into the learner's history, simulating legitimate
interactions. We design efficient attack strategies under this model,
explicitly addressing both magnitude constraints (on reward values) and
temporal constraints (on when and how often data can be injected). Our
theoretical analysis shows that these attacks can mislead both Upper Confidence
Bound (UCB) and Thompson Sampling algorithms into selecting a target arm in
nearly all rounds while incurring only sublinear attack cost. Experiments on
synthetic and real-world datasets validate the effectiveness of our strategies,
revealing significant vulnerabilities in widely used stochastic bandit
algorithms under practical adversarial scenarios.


http://arxiv.org/abs/2505.23817
System Prompt Extraction Attacks and Defenses in Large Language Models. (8%)
Badhan Chandra Das; M. Hadi Amini; Yanzhao Wu
  The system prompt in Large Language Models (LLMs) plays a pivotal role in
guiding model behavior and response generation. Often containing private
configuration details, user roles, and operational instructions, the system
prompt has become an emerging attack target. Recent studies have shown that LLM
system prompts are highly susceptible to extraction attacks through
meticulously designed queries, raising significant privacy and security
concerns. Despite the growing threat, there is a lack of systematic studies of
system prompt extraction attacks and defenses. In this paper, we present a
comprehensive framework, SPE-LLM, to systematically evaluate System Prompt
Extraction attacks and defenses in LLMs. First, we design a set of novel
adversarial queries that effectively extract system prompts in state-of-the-art
(SOTA) LLMs, demonstrating the severe risks of LLM system prompt extraction
attacks. Second, we propose three defense techniques to mitigate system prompt
extraction attacks in LLMs, providing practical solutions for secure LLM
deployments. Third, we introduce a set of rigorous evaluation metrics to
accurately quantify the severity of system prompt extraction attacks in LLMs
and conduct comprehensive experiments across multiple benchmark datasets, which
validates the efficacy of our proposed SPE-LLM framework.


http://arxiv.org/abs/2505.20819
Tracing and Reversing Rank-One Model Edits. (3%)
Paul Youssef; Zhixue Zhao; Christin Seifert; Jörg Schlötterer
  Knowledge editing methods (KEs) are a cost-effective way to update the
factual content of large language models (LLMs), but they pose a dual-use risk.
While KEs are beneficial for updating outdated or incorrect information, they
can be exploited maliciously to implant misinformation or bias. In order to
defend against these types of malicious manipulation, we need robust techniques
that can reliably detect, interpret, and mitigate adversarial edits. This work
investigates the traceability and reversibility of knowledge edits, focusing on
the widely used Rank-One Model Editing (ROME) method. We first show that ROME
introduces distinctive distributional patterns in the edited weight matrices,
which can serve as effective signals for locating the edited weights. Second,
we show that these altered weights can reliably be used to predict the edited
factual relation, enabling partial reconstruction of the modified fact.
Building on this, we propose a method to infer the edited object entity
directly from the modified weights, without access to the editing prompt,
achieving over 95% accuracy. Finally, we demonstrate that ROME edits can be
reversed, recovering the model's original outputs with $\geq$ 80% accuracy. Our
findings highlight the feasibility of detecting, tracing, and reversing edits
based on the edited weights, offering a robust framework for safeguarding LLMs
against adversarial manipulations.


http://arxiv.org/abs/2505.21074
Red-Teaming Text-to-Image Systems by Rule-based Preference Modeling. (3%)
Yichuan Cao; Yibo Miao; Xiao-Shan Gao; Yinpeng Dong
  Text-to-image (T2I) models raise ethical and safety concerns due to their
potential to generate inappropriate or harmful images. Evaluating these models'
security through red-teaming is vital, yet white-box approaches are limited by
their need for internal access, complicating their use with closed-source
models. Moreover, existing black-box methods often assume knowledge about the
model's specific defense mechanisms, limiting their utility in real-world
commercial API scenarios. A significant challenge is how to evade unknown and
diverse defense mechanisms. To overcome this difficulty, we propose a novel
Rule-based Preference modeling Guided Red-Teaming (RPG-RT), which iteratively
employs LLM to modify prompts to query and leverages feedback from T2I systems
for fine-tuning the LLM. RPG-RT treats the feedback from each iteration as a
prior, enabling the LLM to dynamically adapt to unknown defense mechanisms.
Given that the feedback is often labeled and coarse-grained, making it
difficult to utilize directly, we further propose rule-based preference
modeling, which employs a set of rules to evaluate desired or undesired
feedback, facilitating finer-grained control over the LLM's dynamic adaptation
process. Extensive experiments on nineteen T2I systems with varied safety
mechanisms, three online commercial API services, and T2V models verify the
superiority and practicality of our approach.


http://arxiv.org/abs/2505.21772
Calibrating LLM Confidence by Probing Perturbed Representation Stability. (2%)
Reza Khanmohammadi; Erfan Miahi; Mehrsa Mardikoraem; Simerjot Kaur; Ivan Brugere; Charese H. Smiley; Kundan Thind; Mohammad M. Ghassemi
  Miscalibration in Large Language Models (LLMs) undermines their reliability,
highlighting the need for accurate confidence estimation. We introduce CCPS
(Calibrating LLM Confidence by Probing Perturbed Representation Stability), a
novel method analyzing internal representational stability in LLMs. CCPS
applies targeted adversarial perturbations to final hidden states, extracts
features reflecting the model's response to these perturbations, and uses a
lightweight classifier to predict answer correctness. CCPS was evaluated on
LLMs from 8B to 32B parameters (covering Llama, Qwen, and Mistral
architectures) using MMLU and MMLU-Pro benchmarks in both multiple-choice and
open-ended formats. Our results show that CCPS significantly outperforms
current approaches. Across four LLMs and three MMLU variants, CCPS reduces
Expected Calibration Error by approximately 55% and Brier score by 21%, while
increasing accuracy by 5 percentage points, Area Under the Precision-Recall
Curve by 4 percentage points, and Area Under the Receiver Operating
Characteristic Curve by 6 percentage points, all relative to the strongest
prior method. CCPS delivers an efficient, broadly applicable, and more accurate
solution for estimating LLM confidence, thereby improving their
trustworthiness.


http://arxiv.org/abs/2505.21380
PHISH in MESH: Korean Adversarial Phonetic Substitution and Phonetic-Semantic Feature Integration Defense. (1%)
Byungjun Kim; Minju Kim; Hyeonchu Park; Bugeun Kim
  As malicious users increasingly employ phonetic substitution to evade hate
speech detection, researchers have investigated such strategies. However, two
key challenges remain. First, existing studies have overlooked the Korean
language, despite its vulnerability to phonetic perturbations due to its
phonographic nature. Second, prior work has primarily focused on constructing
datasets rather than developing architectural defenses. To address these
challenges, we propose (1) PHonetic-Informed Substitution for Hangul (PHISH)
that exploits the phonological characteristics of the Korean writing system,
and (2) Mixed Encoding of Semantic-pHonetic features (MESH) that enhances the
detector's robustness by incorporating phonetic information at the
architectural level. Our experimental results demonstrate the effectiveness of
our proposed methods on both perturbed and unperturbed datasets, suggesting
that they not only improve detection performance but also reflect realistic
adversarial behaviors employed by malicious users.


http://arxiv.org/abs/2505.21936
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments. (1%)
Zeyi Liao; Jaylen Jones; Linxi Jiang; Eric Fosler-Lussier; Yu Su; Zhiqiang Lin; Huan Sun
  Computer-use agents (CUAs) promise to automate complex tasks across operating
systems (OS) and the web, but remain vulnerable to indirect prompt injection.
Current evaluations of this threat either lack support realistic but controlled
environments or ignore hybrid web-OS attack scenarios involving both
interfaces. To address this, we propose RedTeamCUA, an adversarial testing
framework featuring a novel hybrid sandbox that integrates a VM-based OS
environment with Docker-based web platforms. Our sandbox supports key features
tailored for red teaming, such as flexible adversarial scenario configuration,
and a setting that decouples adversarial evaluation from navigational
limitations of CUAs by initializing tests directly at the point of an
adversarial injection. Using RedTeamCUA, we develop RTC-Bench, a comprehensive
benchmark with 864 examples that investigate realistic, hybrid web-OS attack
scenarios and fundamental security vulnerabilities. Benchmarking current
frontier CUAs identifies significant vulnerabilities: Claude 3.7 Sonnet | CUA
demonstrates an ASR of 42.9%, while Operator, the most secure CUA evaluated,
still exhibits an ASR of 7.6%. Notably, CUAs often attempt to execute
adversarial tasks with an Attempt Rate as high as 92.5%, although failing to
complete them due to capability limitations. Nevertheless, we observe
concerning ASRs of up to 50% in realistic end-to-end settings, with the
recently released frontier Claude 4 Opus | CUA showing an alarming ASR of 48%,
demonstrating that indirect prompt injection presents tangible risks for even
advanced CUAs despite their capabilities and safeguards. Overall, RedTeamCUA
provides an essential framework for advancing realistic, controlled, and
systematic analysis of CUA vulnerabilities, highlighting the urgent need for
robust defenses to indirect prompt injection prior to real-world deployment.


http://arxiv.org/abs/2505.19911
Attention! You Vision Language Model Could Be Maliciously Manipulated. (99%)
Xiaosen Wang; Shaokang Wang; Zhijin Ge; Yuyang Luo; Shudong Zhang
  Large Vision-Language Models (VLMs) have achieved remarkable success in
understanding complex real-world scenarios and supporting data-driven
decision-making processes. However, VLMs exhibit significant vulnerability
against adversarial examples, either text or image, which can lead to various
adversarial outcomes, e.g., jailbreaking, hijacking, and hallucination, etc. In
this work, we empirically and theoretically demonstrate that VLMs are
particularly susceptible to image-based adversarial examples, where
imperceptible perturbations can precisely manipulate each output token. To this
end, we propose a novel attack called Vision-language model Manipulation Attack
(VMA), which integrates first-order and second-order momentum optimization
techniques with a differentiable transformation mechanism to effectively
optimize the adversarial perturbation. Notably, VMA can be a double-edged
sword: it can be leveraged to implement various attacks, such as jailbreaking,
hijacking, privacy breaches, Denial-of-Service, and the generation of sponge
examples, etc, while simultaneously enabling the injection of watermarks for
copyright protection. Extensive empirical evaluations substantiate the efficacy
and generalizability of VMA across diverse scenarios and datasets.


http://arxiv.org/abs/2505.19840
One Surrogate to Fool Them All: Universal, Transferable, and Targeted Adversarial Attacks with CLIP. (99%)
Binyan Xu; Xilin Dai; Di Tang; Kehuan Zhang
  Deep Neural Networks (DNNs) have achieved widespread success yet remain prone
to adversarial attacks. Typically, such attacks either involve frequent queries
to the target model or rely on surrogate models closely mirroring the target
model -- often trained with subsets of the target model's training data -- to
achieve high attack success rates through transferability. However, in
realistic scenarios where training data is inaccessible and excessive queries
can raise alarms, crafting adversarial examples becomes more challenging. In
this paper, we present UnivIntruder, a novel attack framework that relies
solely on a single, publicly available CLIP model and publicly available
datasets. By using textual concepts, UnivIntruder generates universal,
transferable, and targeted adversarial perturbations that mislead DNNs into
misclassifying inputs into adversary-specified classes defined by textual
concepts.
  Our extensive experiments show that our approach achieves an Attack Success
Rate (ASR) of up to 85% on ImageNet and over 99% on CIFAR-10, significantly
outperforming existing transfer-based methods. Additionally, we reveal
real-world vulnerabilities, showing that even without querying target models,
UnivIntruder compromises image search engines like Google and Baidu with ASR
rates up to 84%, and vision language models like GPT-4 and Claude-3.5 with ASR
rates up to 80%. These findings underscore the practicality of our attack in
scenarios where traditional avenues are blocked, highlighting the need to
reevaluate security paradigms in AI applications.


http://arxiv.org/abs/2505.19864
CPA-RAG:Covert Poisoning Attacks on Retrieval-Augmented Generation in Large Language Models. (92%)
Chunyang Li; Junwei Zhang; Anda Cheng; Zhuo Ma; Xinghua Li; Jianfeng Ma
  Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by
incorporating external knowledge, but its openness introduces vulnerabilities
that can be exploited by poisoning attacks. Existing poisoning methods for RAG
systems have limitations, such as poor generalization and lack of fluency in
adversarial texts. In this paper, we propose CPA-RAG, a black-box adversarial
framework that generates query-relevant texts capable of manipulating the
retrieval process to induce target answers. The proposed method integrates
prompt-based text generation, cross-guided optimization through multiple LLMs,
and retriever-based scoring to construct high-quality adversarial samples. We
conduct extensive experiments across multiple datasets and LLMs to evaluate its
effectiveness. Results show that the framework achieves over 90\% attack
success when the top-k retrieval setting is 5, matching white-box performance,
and maintains a consistent advantage of approximately 5 percentage points
across different top-k values. It also outperforms existing black-box baselines
by 14.5 percentage points under various defense strategies. Furthermore, our
method successfully compromises a commercial RAG system deployed on Alibaba's
BaiLian platform, demonstrating its practical threat in real-world
applications. These findings underscore the need for more robust and secure RAG
frameworks to defend against poisoning attacks.


http://arxiv.org/abs/2505.20621
Multi-level Certified Defense Against Poisoning Attacks in Offline Reinforcement Learning. (82%)
Shijie Liu; Andrew C. Cullen; Paul Montague; Sarah Erfani; Benjamin I. P. Rubinstein
  Similar to other machine learning frameworks, Offline Reinforcement Learning
(RL) is shown to be vulnerable to poisoning attacks, due to its reliance on
externally sourced datasets, a vulnerability that is exacerbated by its
sequential nature. To mitigate the risks posed by RL poisoning, we extend
certified defenses to provide larger guarantees against adversarial
manipulation, ensuring robustness for both per-state actions, and the overall
expected cumulative reward. Our approach leverages properties of Differential
Privacy, in a manner that allows this work to span both continuous and discrete
spaces, as well as stochastic and deterministic environments -- significantly
expanding the scope and applicability of achievable guarantees. Empirical
evaluations demonstrate that our approach ensures the performance drops to no
more than $50\%$ with up to $7\%$ of the training data poisoned, significantly
improving over the $0.008\%$ in prior work~\citep{wu_copa_2022}, while
producing certified radii that is $5$ times larger as well. This highlights the
potential of our framework to enhance safety and reliability in offline RL.


http://arxiv.org/abs/2505.19610
JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models. (81%)
Jiaxin Song; Yixu Wang; Jie Li; Rui Yu; Yan Teng; Xingjun Ma; Yingchun Wang
  Vision-Language Models (VLMs) exhibit impressive performance, yet the
integration of powerful vision encoders has significantly broadened their
attack surface, rendering them increasingly susceptible to jailbreak attacks.
However, lacking well-defined attack objectives, existing jailbreak methods
often struggle with gradient-based strategies prone to local optima and lacking
precise directional guidance, and typically decouple visual and textual
modalities, thereby limiting their effectiveness by neglecting crucial
cross-modal interactions. Inspired by the Eliciting Latent Knowledge (ELK)
framework, we posit that VLMs encode safety-relevant information within their
internal fusion-layer representations, revealing an implicit safety decision
boundary in the latent space. This motivates exploiting boundary to steer model
behavior. Accordingly, we propose JailBound, a novel latent space jailbreak
framework comprising two stages: (1) Safety Boundary Probing, which addresses
the guidance issue by approximating decision boundary within fusion layer's
latent space, thereby identifying optimal perturbation directions towards the
target region; and (2) Safety Boundary Crossing, which overcomes the
limitations of decoupled approaches by jointly optimizing adversarial
perturbations across both image and text inputs. This latter stage employs an
innovative mechanism to steer the model's internal state towards
policy-violating outputs while maintaining cross-modal semantic consistency.
Extensive experiments on six diverse VLMs demonstrate JailBound's efficacy,
achieves 94.32% white-box and 67.28% black-box attack success averagely, which
are 6.17% and 21.13% higher than SOTA methods, respectively. Our findings
expose a overlooked safety risk in VLMs and highlight the urgent need for more
robust defenses. Warning: This paper contains potentially sensitive, harmful
and offensive content.


http://arxiv.org/abs/2505.19951
Novel Loss-Enhanced Universal Adversarial Patches for Sustainable Speaker Privacy. (22%)
Elvir Karimov; Alexander Varlamov; Danil Ivanov; Dmitrii Korzh; Oleg Y. Rogov
  Deep learning voice models are commonly used nowadays, but the safety
processing of personal data, such as human identity and speech content, remains
suspicious. To prevent malicious user identification, speaker anonymization
methods were proposed. Current methods, particularly based on universal
adversarial patch (UAP) applications, have drawbacks such as significant
degradation of audio quality, decreased speech recognition quality, low
transferability across different voice biometrics models, and performance
dependence on the input audio length. To mitigate these drawbacks, in this
work, we introduce and leverage the novel Exponential Total Variance (TV) loss
function and provide experimental evidence that it positively affects UAP
strength and imperceptibility. Moreover, we present a novel scalable UAP
insertion procedure and demonstrate its uniformly high performance for various
audio lengths.


http://arxiv.org/abs/2505.20259
Lifelong Safety Alignment for Language Models. (13%)
Haoyu Wang; Zeyu Qin; Yifei Zhao; Chao Du; Min Lin; Xueqian Wang; Tianyu Pang
  LLMs have made impressive progress, but their growing capabilities also
expose them to highly flexible jailbreaking attacks designed to bypass safety
alignment. While many existing defenses focus on known types of attacks, it is
more critical to prepare LLMs for unseen attacks that may arise during
deployment. To address this, we propose a lifelong safety alignment framework
that enables LLMs to continuously adapt to new and evolving jailbreaking
strategies. Our framework introduces a competitive setup between two
components: a Meta-Attacker, trained to actively discover novel jailbreaking
strategies, and a Defender, trained to resist them. To effectively warm up the
Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a
large collection of jailbreak-related research papers. Through iterative
training, the first iteration Meta-Attacker achieves a 73% attack success rate
(ASR) on RR and a 57% transfer ASR on LAT using only single-turn attacks.
Meanwhile, the Defender progressively improves its robustness and ultimately
reduces the Meta-Attacker's success rate to just 7%, enabling safer and more
reliable deployment of LLMs in open-ended environments. The code is available
at https://github.com/sail-sg/LifelongSafetyAlignment.


http://arxiv.org/abs/2505.20269
Comparing Neural Network Encodings for Logic-based Explainability. (11%)
Levi Cordeiro Carvalho; Saulo A. F. Oliveira; Thiago Alves Rocha
  Providing explanations for the outputs of artificial neural networks (ANNs)
is crucial in many contexts, such as critical systems, data protection laws and
handling adversarial examples. Logic-based methods can offer explanations with
correctness guarantees, but face scalability challenges. Due to these issues,
it is necessary to compare different encodings of ANNs into logical
constraints, which are used in logic-based explainability. This work compares
two encodings of ANNs: one has been used in the literature to provide
explanations, while the other will be adapted for our context of
explainability. Additionally, the second encoding uses fewer variables and
constraints, thus, potentially enhancing efficiency. Experiments showed similar
running times for computing explanations, but the adapted encoding performed up
to 18\% better in building logical constraints and up to 16\% better in overall
time.


http://arxiv.org/abs/2506.17231
Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs. (10%)
Xiang Li; Chong Zhang; Jia Wang; Fangyu Wu; Yushi Li; Xiaobo Jin
Attacks on large language models (LLMs) in jailbreaking scenarios raise many security and ethical issues. Current jailbreak attack methods face problems such as low efficiency, high computational cost, and poor cross-model adaptability and versatility, which make it difficult to cope with the rapid development of LLM and new defense strategies. Our work proposes an Adversarial Prompt Distillation, which combines masked language modeling, reinforcement learning, and dynamic temperature control through a prompt generation and distillation method. It enables small language models (SLMs) to jailbreak attacks on mainstream LLMs. The experimental results verify the superiority of the proposed method in terms of attack success rate and harm, and reflect the resource efficiency and cross-model adaptability. This research explores the feasibility of distilling the jailbreak ability of LLM to SLM, reveals the model's vulnerability, and provides a new idea for LLM security research.

http://arxiv.org/abs/2505.21556
Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts. (5%)
Hee-Seon Kim; Minbeom Kim; Wonjun Lee; Kihyun Kim; Changick Kim
  Optimization-based jailbreaks typically adopt the Toxic-Continuation setting
in large vision-language models (LVLMs), following the standard next-token
prediction objective. In this setting, an adversarial image is optimized to
make the model predict the next token of a toxic prompt. However, we find that
the Toxic-Continuation paradigm is effective at continuing already-toxic
inputs, but struggles to induce safety misalignment when explicit toxic signals
are absent. We propose a new paradigm: Benign-to-Toxic (B2T) jailbreak. Unlike
prior work, we optimize adversarial images to induce toxic outputs from benign
conditioning. Since benign conditioning contains no safety violations, the
image alone must break the model's safety mechanisms. Our method outperforms
prior approaches, transfers in black-box settings, and complements text-based
jailbreaks. These results reveal an underexplored vulnerability in multimodal
alignment and introduce a fundamentally new direction for jailbreak approaches.


http://arxiv.org/abs/2505.20158
Evaluating Software Plagiarism Detection in the Age of AI: Automated Obfuscation and Lessons for Academic Integrity. (4%)
Timur Sağlam; Larissa Schmid
  Plagiarism in programming assignments is a persistent issue in computer
science education, increasingly complicated by the emergence of automated
obfuscation attacks. While software plagiarism detectors are widely used to
identify suspicious similarities at scale and are resilient to simple
obfuscation techniques, they are vulnerable to advanced obfuscation based on
structural modification of program code that preserves the original program
behavior. While different defense mechanisms have been proposed to increase
resilience against these attacks, their current evaluation is limited to the
scope of attacks used and lacks a comprehensive investigation regarding
AI-based obfuscation. In this paper, we investigate the resilience of these
defense mechanisms against a broad range of automated obfuscation attacks,
including both algorithmic and AI-generated methods, and for a wide variety of
real-world datasets. We evaluate the improvements of two defense mechanisms
over the plagiarism detector JPlag across over four million pairwise program
comparisons. Our results show significant improvements in detecting obfuscated
plagiarism instances, and we observe an improved detection of AI-generated
programs, even though the defense mechanisms are not designed for this use
case. Based on our findings, we provide an in-depth discussion of their broader
implications for academic integrity and the role of AI in education.


http://arxiv.org/abs/2505.20162
Capability-Based Scaling Laws for LLM Red-Teaming. (3%)
Alexander Panfilov; Paul Kassianik; Maksym Andriushchenko; Jonas Geiping
  As large language models grow in capability and agency, identifying
vulnerabilities through red-teaming becomes vital for safe deployment. However,
traditional prompt-engineering approaches may prove ineffective once
red-teaming turns into a weak-to-strong problem, where target models surpass
red-teamers in capabilities. To study this shift, we frame red-teaming through
the lens of the capability gap between attacker and target. We evaluate more
than 500 attacker-target pairs using LLM-based jailbreak attacks that mimic
human red-teamers across diverse families, sizes, and capability levels. Three
strong trends emerge: (i) more capable models are better attackers, (ii) attack
success drops sharply once the target's capability exceeds the attacker's, and
(iii) attack success rates correlate with high performance on social science
splits of the MMLU-Pro benchmark. From these trends, we derive a jailbreaking
scaling law that predicts attack success for a fixed target based on
attacker-target capability gap. These findings suggest that fixed-capability
attackers (e.g., humans) may become ineffective against future models,
increasingly capable open-source models amplify risks for existing systems, and
model providers must accurately measure and control models' persuasive and
manipulative abilities to limit their effectiveness as attackers.


http://arxiv.org/abs/2505.19504
DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation. (2%)
Pingzhi Li; Zhen Tan; Huaizhi Qu; Huan Liu; Tianlong Chen
  Large Language Models (LLMs) represent substantial intellectual and economic
investments, yet their effectiveness can inadvertently facilitate model
imitation via knowledge distillation (KD).In practical scenarios, competitors
can distill proprietary LLM capabilities by simply observing publicly
accessible outputs, akin to reverse-engineering a complex performance by
observation alone. Existing protective methods like watermarking only identify
imitation post-hoc, while other defenses assume the student model mimics the
teacher's internal logits, rendering them ineffective against distillation
purely from observed output text. This paper confronts the challenge of
actively protecting LLMs within the realistic constraints of API-based access.
We introduce an effective and efficient Defensive Output Generation (DOGe)
strategy that subtly modifies the output behavior of an LLM. Its outputs remain
accurate and useful for legitimate users, yet are designed to be misleading for
distillation, significantly undermining imitation attempts. We achieve this by
fine-tuning only the final linear layer of the teacher LLM with an adversarial
loss. This targeted training approach anticipates and disrupts distillation
attempts during inference time. Our experiments show that, while preserving or
even improving the original performance of the teacher model, student models
distilled from the defensively generated teacher outputs demonstrate
catastrophically reduced performance, demonstrating our method's effectiveness
as a practical safeguard against KD-based model imitation.


http://arxiv.org/abs/2505.19532
Fox in the Henhouse: Supply-Chain Backdoor Attacks Against Reinforcement Learning. (1%)
Shijie Liu; Andrew C. Cullen; Paul Montague; Sarah Erfani; Benjamin I. P. Rubinstein
  The current state-of-the-art backdoor attacks against Reinforcement Learning
(RL) rely upon unrealistically permissive access models, that assume the
attacker can read (or even write) the victim's policy parameters, observations,
or rewards. In this work, we question whether such a strong assumption is
required to launch backdoor attacks against RL. To answer this question, we
propose the \underline{S}upply-\underline{C}h\underline{a}in
\underline{B}ackdoor (SCAB) attack, which targets a common RL workflow:
training agents using external agents that are provided separately or embedded
within the environment. In contrast to prior works, our attack only relies on
legitimate interactions of the RL agent with the supplied agents. Despite this
limited access model, by poisoning a mere $3\%$ of training experiences, our
attack can successfully activate over $90\%$ of triggered actions, reducing the
average episodic return by $80\%$ for the victim. Our novel attack demonstrates
that RL attacks are likely to become a reality under untrusted RL training
supply-chains.


http://arxiv.org/abs/2505.19194
Curvature Dynamic Black-box Attack: revisiting adversarial robustness via dynamic curvature estimation. (99%)
Peiran Sun
  Adversarial attack reveals the vulnerability of deep learning models. For
about a decade, countless attack and defense methods have been proposed,
leading to robustified classifiers and better understanding of models. Among
these methods, curvature-based approaches have attracted attention because it
is assumed that high curvature may give rise to rough decision boundary.
However, the most commonly used \textit{curvature} is the curvature of loss
function, scores or other parameters from within the model as opposed to
decision boundary curvature, since the former can be relatively easily formed
using second order derivative. In this paper, we propose a new query-efficient
method, dynamic curvature estimation(DCE), to estimate the decision boundary
curvature in a black-box setting. Our approach is based on CGBA, a black-box
adversarial attack. By performing DCE on a wide range of classifiers, we
discovered, statistically, a connection between decision boundary curvature and
adversarial robustness. We also propose a new attack method, curvature dynamic
black-box attack(CDBA) with improved performance using the dynamically
estimated curvature.


http://arxiv.org/abs/2505.19459
Your Classifier Can Do More: Towards Bridging the Gaps in Classification, Robustness, and Generation. (86%)
Kaichao Jiang; He Wang; Xiaoshuai Hao; Xiulong Yang; Ajian Liu; Qi Chu; Yunfeng Diao
  Joint Energy-based Models (JEMs), a class of hybrid generative-discriminative
models, are well known for their ability to achieve both high classification
accuracy and generative capability within a single model. However, their
robustness still lags significantly behind the classifiers based adversarial
training (AT). Conversely, while AT is currently the most effective approach to
improving the classifier's robustness, it typically sacrifices accuracy on
clean data and lacks generative capability. The triple trade-off between
classification accuracy, generative capability and robustness, raises a natural
question: Can a single model simultaneously achieve high classification
accuracy, adversarial robustness, and generative performance? -- a goal that
has been rarely explored. To address this question, we systematically analyze
the energy distribution differences of clean, adversarial, and generated
samples across various JEM variants and adversarially trained models. We
observe that AT tends to reduce the energy gap between clean and adversarial
samples, while JEMs reduce the gap between clean and synthetic ones. This
observation suggests a key insight: if the energy distributions of all three
data types can be aligned, we might unify the strengths of AT and JEMs,
resolving their inherent trade-offs. Building on this idea, we propose
Energy-based Joint Distribution Adversarial Training (EB-JDAT), to jointly
model the clean data distribution, the adversarial distribution, and the
classifier by maximizing their joint probability. EB-JDAT is a general and
flexible optimization method, compatible with various JEM variants. Extensive
experimental results demonstrate that EB-JDAT not only maintains near original
accuracy and generative capability of JEMs, but also significantly enhances
robustness, even surpassing state-of-the-art ATs.


http://arxiv.org/abs/2505.19397
Are Time-Series Foundation Models Deployment-Ready? A Systematic Study of Adversarial Robustness Across Domains. (78%)
Jiawen Zhang; Zhenwei Zhang; Shun Zheng; Xumeng Wen; Jia Li; Jiang Bian
  Time Series Foundation Models (TSFMs), which are pretrained on large-scale,
cross-domain data and capable of zero-shot forecasting in new scenarios without
further training, are increasingly adopted in real-world applications. However,
as the zero-shot forecasting paradigm gets popular, a critical yet overlooked
question emerges: Are TSFMs robust to adversarial input perturbations? Such
perturbations could be exploited in man-in-the-middle attacks or data
poisoning. To address this gap, we conduct a systematic investigation into the
adversarial robustness of TSFMs. Our results show that even minimal
perturbations can induce significant and controllable changes in forecast
behaviors, including trend reversal, temporal drift, and amplitude shift,
posing serious risks to TSFM-based services. Through experiments on
representative TSFMs and multiple datasets, we reveal their consistent
vulnerabilities and identify potential architectural designs, such as
structural sparsity and multi-task pretraining, that may improve robustness.
Our findings offer actionable guidance for designing more resilient forecasting
systems and provide a critical assessment of the adversarial robustness of
TSFMs.


http://arxiv.org/abs/2505.19364
RADEP: A Resilient Adaptive Defense Framework Against Model Extraction Attacks. (75%)
Amit Chakraborty; Sayyed Farid Ahamed; Sandip Roy; Soumya Banerjee; Kevin Choi; Abdul Rahman; Alison Hu; Edward Bowen; Sachin Shetty
  Machine Learning as a Service (MLaaS) enables users to leverage powerful
machine learning models through cloud-based APIs, offering scalability and ease
of deployment. However, these services are vulnerable to model extraction
attacks, where adversaries repeatedly query the application programming
interface (API) to reconstruct a functionally similar model, compromising
intellectual property and security. Despite various defense strategies being
proposed, many suffer from high computational costs, limited adaptability to
evolving attack techniques, and a reduction in performance for legitimate
users. In this paper, we introduce a Resilient Adaptive Defense Framework for
Model Extraction Attack Protection (RADEP), a multifaceted defense framework
designed to counteract model extraction attacks through a multi-layered
security approach. RADEP employs progressive adversarial training to enhance
model resilience against extraction attempts. Malicious query detection is
achieved through a combination of uncertainty quantification and behavioral
pattern analysis, effectively identifying adversarial queries. Furthermore, we
develop an adaptive response mechanism that dynamically modifies query outputs
based on their suspicion scores, reducing the utility of stolen models.
Finally, ownership verification is enforced through embedded watermarking and
backdoor triggers, enabling reliable identification of unauthorized model use.
Experimental evaluations demonstrate that RADEP significantly reduces
extraction success rates while maintaining high detection accuracy with minimal
impact on legitimate queries. Extensive experiments show that RADEP effectively
defends against model extraction attacks and remains resilient even against
adaptive adversaries, making it a reliable security framework for MLaaS models.


http://arxiv.org/abs/2505.19425
Structure Disruption: Subverting Malicious Diffusion-Based Inpainting via Self-Attention Query Perturbation. (13%)
Yuhao He; Jinyu Tian; Haiwei Wu; Jianqing Li
  The rapid advancement of diffusion models has enhanced their image inpainting
and editing capabilities but also introduced significant societal risks.
Adversaries can exploit user images from social media to generate misleading or
harmful content. While adversarial perturbations can disrupt inpainting, global
perturbation-based methods fail in mask-guided editing tasks due to spatial
constraints. To address these challenges, we propose Structure Disruption
Attack (SDA), a powerful protection framework for safeguarding sensitive image
regions against inpainting-based editing. Building upon the contour-focused
nature of self-attention mechanisms of diffusion models, SDA optimizes
perturbations by disrupting queries in self-attention during the initial
denoising step to destroy the contour generation process. This targeted
interference directly disrupts the structural generation capability of
diffusion models, effectively preventing them from producing coherent images.
We validate our motivation through visualization techniques and extensive
experiments on public datasets, demonstrating that SDA achieves
state-of-the-art (SOTA) protection performance while maintaining strong
robustness.


http://arxiv.org/abs/2505.23791
Evaluating Query Efficiency and Accuracy of Transfer Learning-based Model Extraction Attack in Federated Learning. (10%)
Sayyed Farid Ahamed; Sandip Roy; Soumya Banerjee; Marc Vucovich; Kevin Choi; Abdul Rahman; Alison Hu; Edward Bowen; Sachin Shetty
  Federated Learning (FL) is a collaborative learning framework designed to
protect client data, yet it remains highly vulnerable to Intellectual Property
(IP) threats. Model extraction (ME) attacks pose a significant risk to Machine
Learning as a Service (MLaaS) platforms, enabling attackers to replicate
confidential models by querying black-box (without internal insight) APIs.
Despite FL's privacy-preserving goals, its distributed nature makes it
particularly susceptible to such attacks. This paper examines the vulnerability
of FL-based victim models to two types of model extraction attacks. For various
federated clients built under the NVFlare platform, we implemented ME attacks
across two deep learning architectures and three image datasets. We evaluate
the proposed ME attack performance using various metrics, including accuracy,
fidelity, and KL divergence. The experiments show that for different FL
clients, the accuracy and fidelity of the extracted model are closely related
to the size of the attack query set. Additionally, we explore a transfer
learning based approach where pretrained models serve as the starting point for
the extraction process. The results indicate that the accuracy and fidelity of
the fine-tuned pretrained extraction models are notably higher, particularly
with smaller query sets, highlighting potential advantages for attackers.


http://arxiv.org/abs/2505.19338
Co-evolutionary Dynamics of Attack and Defence in Cybersecurity. (8%)
Adeela Bashir; Zia Ush Shamszaman; Zhao Song; The Anh Han
  In the evolving digital landscape, it is crucial to study the dynamics of
cyberattacks and defences. This study uses an Evolutionary Game Theory (EGT)
framework to investigate the evolutionary dynamics of attacks and defences in
cyberspace. We develop a two-population asymmetric game between attacker and
defender to capture the essential factors of costs, potential benefits, and the
probability of successful defences. Through mathematical analysis and numerical
simulations, we find that systems with high defence intensities show stability
with minimal attack frequencies, whereas low-defence environments show
instability, and are vulnerable to attacks. Furthermore, we find five
equilibria, where the strategy pair always defend and attack emerged as the
most likely stable state as cyber domain is characterised by a continuous
battle between defenders and attackers. Our theoretical findings align with
real-world data from past cyber incidents, demonstrating the interdisciplinary
impact, such as fraud detection, risk management and cybersecurity
decision-making. Overall, our analysis suggests that adaptive cybersecurity
strategies based on EGT can improve resource allocation, enhance system
resilience, and reduce the overall risk of cyberattacks. By incorporating
real-world data, this study demonstrates the applicability of EGT in addressing
the evolving nature of cyber threats and the need for secure digital ecosystems
through strategic planning and proactive defence measures.


http://arxiv.org/abs/2505.19119
CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning. (2%)
Renyuan Li; Zhibo Liang; Haichuan Zhang; Tianyu Shi; Zhiyuan Cheng; Jia Shi; Carl Yang; Mingjie Tang
  Recent breakthroughs in text-to-speech (TTS) voice cloning have raised
serious privacy concerns, allowing highly accurate vocal identity replication
from just a few seconds of reference audio, while retaining the speaker's vocal
authenticity. In this paper, we introduce CloneShield, a universal time-domain
adversarial perturbation framework specifically designed to defend against
zero-shot voice cloning. Our method provides protection that is robust across
speakers and utterances, without requiring any prior knowledge of the
synthesized text. We formulate perturbation generation as a multi-objective
optimization problem, and propose Multi-Gradient Descent Algorithm (MGDA) to
ensure the robust protection across diverse utterances. To preserve natural
auditory perception for users, we decompose the adversarial perturbation via
Mel-spectrogram representations and fine-tune it for each sample. This design
ensures imperceptibility while maintaining strong degradation effects on
zero-shot cloned outputs. Experiments on three state-of-the-art zero-shot TTS
systems, five benchmark datasets and evaluations from 60 human listeners
demonstrate that our method preserves near-original audio quality in protected
inputs (PESQ = 3.90, SRS = 0.93) while substantially degrading both speaker
similarity and speech quality in cloned samples (PESQ = 1.07, SRS = 0.08).


http://arxiv.org/abs/2505.18583
The Silent Saboteur: Imperceptible Adversarial Attacks against Black-Box Retrieval-Augmented Generation Systems. (99%)
Hongru Song; Yu-an Liu; Ruqing Zhang; Jiafeng Guo; Jianming Lv; Rijke Maarten de; Xueqi Cheng
  We explore adversarial attacks against retrieval-augmented generation (RAG)
systems to identify their vulnerabilities. We focus on generating
human-imperceptible adversarial examples and introduce a novel imperceptible
retrieve-to-generate attack against RAG. This task aims to find imperceptible
perturbations that retrieve a target document, originally excluded from the
initial top-$k$ candidate set, in order to influence the final answer
generation. To address this task, we propose ReGENT, a reinforcement
learning-based framework that tracks interactions between the attacker and the
target RAG and continuously refines attack strategies based on
relevance-generation-naturalness rewards. Experiments on newly constructed
factual and non-factual question-answering benchmarks demonstrate that ReGENT
significantly outperforms existing attack methods in misleading RAG systems
with small imperceptible text perturbations.


http://arxiv.org/abs/2505.18864
Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework. (99%)
Binhao Ma; Hanqing Guo; Zhengping Jay Luo; Rui Duan
  Recent advances in Multimodal Large Language Models (MLLMs) have
significantly enhanced the naturalness and flexibility of human computer
interaction by enabling seamless understanding across text, vision, and audio
modalities. Among these, voice enabled models such as SpeechGPT have
demonstrated considerable improvements in usability, offering expressive, and
emotionally responsive interactions that foster deeper connections in real
world communication scenarios. However, the use of voice introduces new
security risks, as attackers can exploit the unique characteristics of spoken
language, such as timing, pronunciation variability, and speech to text
translation, to craft inputs that bypass defenses in ways not seen in
text-based systems. Despite substantial research on text based jailbreaks, the
voice modality remains largely underexplored in terms of both attack strategies
and defense mechanisms. In this work, we present an adversarial attack
targeting the speech input of aligned MLLMs in a white box scenario.
Specifically, we introduce a novel token level attack that leverages access to
the model's speech tokenization to generate adversarial token sequences. These
sequences are then synthesized into audio prompts, which effectively bypass
alignment safeguards and to induce prohibited outputs. Evaluated on SpeechGPT,
our approach achieves up to 89 percent attack success rate across multiple
restricted tasks, significantly outperforming existing voice based jailbreak
methods. Our findings shed light on the vulnerabilities of voice-enabled
multimodal systems and to help guide the development of more robust
next-generation MLLMs.


http://arxiv.org/abs/2505.18884
LORE: Lagrangian-Optimized Robust Embeddings for Visual Encoders. (93%)
Borna Khodabandeh; Amirabbas Afzali; Amirhossein Afsharrad; Seyed Shahabeddin Mousavi; Sanjay Lall; Sajjad Amini; Seyed-Mohsen Moosavi-Dezfooli
  Visual encoders have become fundamental components in modern computer vision
pipelines. However, ensuring robustness against adversarial perturbations
remains a critical challenge. Recent efforts have explored both supervised and
unsupervised adversarial fine-tuning strategies. We identify two key
limitations in these approaches: (i) they often suffer from instability,
especially during the early stages of fine-tuning, resulting in suboptimal
convergence and degraded performance on clean data, and (ii) they exhibit a
suboptimal trade-off between robustness and clean data accuracy, hindering the
simultaneous optimization of both objectives. To overcome these challenges, we
propose Lagrangian-Optimized Robust Embeddings (LORE), a novel unsupervised
adversarial fine-tuning framework. LORE utilizes constrained optimization,
which offers a principled approach to balancing competing goals, such as
improving robustness while preserving nominal performance. By enforcing
embedding-space proximity constraints, LORE effectively maintains clean data
performance throughout adversarial fine-tuning. Extensive experiments show that
LORE significantly improves zero-shot adversarial robustness with minimal
degradation in clean data accuracy. Furthermore, we demonstrate the
effectiveness of the adversarially fine-tuned CLIP image encoder in
out-of-distribution generalization and enhancing the interpretability of image
embeddings.


http://arxiv.org/abs/2505.18806
Mal-D2GAN: Double-Detector based GAN for Malware Generation. (92%)
Nam Hoang Thanh; Trung Pham Duy; Lam Bui Thu
  Machine learning (ML) has been developed to detect malware in recent years.
Most researchers focused their efforts on improving the detection performance
but ignored the robustness of the ML models. In addition, many machine learning
algorithms are very vulnerable to intentional attacks. To solve these problems,
adversarial malware examples are generated by GANs to enhance the robustness of
the malware detector. However, since current GAN models suffer from limitations
such as unstable training and weak adversarial examples, we propose the
Mal-D2GAN model to address these problems. Specifically, the Mal-D2GAN
architecture was designed with double-detector and a least square loss function
and tested on a dataset of 20,000 samples. The results show that the Mal-D2GAN
model reduced the detection accuracy (true positive rate) in 8 malware
detectors. The performance was then compared with that of the existing MalGAN
and Mal- LSGAN models.


http://arxiv.org/abs/2505.18766
StyleGuard: Preventing Text-to-Image-Model-based Style Mimicry Attacks by Style Perturbations. (68%)
Yanjie Li; Wenxuan Zhang; Xinqi Lyu; Yihao Liu; Bin Xiao
  Recently, text-to-image diffusion models have been widely used for style
mimicry and personalized customization through methods such as DreamBooth and
Textual Inversion. This has raised concerns about intellectual property
protection and the generation of deceptive content. Recent studies, such as
Glaze and Anti-DreamBooth, have proposed using adversarial noise to protect
images from these attacks. However, recent purification-based methods, such as
DiffPure and Noise Upscaling, have successfully attacked these latest defenses,
showing the vulnerabilities of these methods. Moreover, present methods show
limited transferability across models, making them less effective against
unknown text-to-image models. To address these issues, we propose a novel
anti-mimicry method, StyleGuard. We propose a novel style loss that optimizes
the style-related features in the latent space to make it deviate from the
original image, which improves model-agnostic transferability. Additionally, to
enhance the perturbation's ability to bypass diffusion-based purification, we
designed a novel upscale loss that involves ensemble purifiers and upscalers
during training. Extensive experiments on the WikiArt and CelebA datasets
demonstrate that StyleGuard outperforms existing methods in robustness against
various transformations and purifications, effectively countering style mimicry
in various models. Moreover, StyleGuard is effective on different style mimicry
methods, including DreamBooth and Textual Inversion.


http://arxiv.org/abs/2505.18543
Benchmarking Poisoning Attacks against Retrieval-Augmented Generation. (54%)
Baolei Zhang; Haoran Xin; Jiatong Li; Dongzhe Zhang; Minghong Fang; Zhuqing Liu; Lihai Nie; Zheli Liu
  Retrieval-Augmented Generation (RAG) has proven effective in mitigating
hallucinations in large language models by incorporating external knowledge
during inference. However, this integration introduces new security
vulnerabilities, particularly to poisoning attacks. Although prior work has
explored various poisoning strategies, a thorough assessment of their practical
threat to RAG systems remains missing. To address this gap, we propose the
first comprehensive benchmark framework for evaluating poisoning attacks on
RAG. Our benchmark covers 5 standard question answering (QA) datasets and 10
expanded variants, along with 13 poisoning attack methods and 7 defense
mechanisms, representing a broad spectrum of existing techniques. Using this
benchmark, we conduct a comprehensive evaluation of all included attacks and
defenses across the full dataset spectrum. Our findings show that while
existing attacks perform well on standard QA datasets, their effectiveness
drops significantly on the expanded versions. Moreover, our results demonstrate
that various advanced RAG architectures, such as sequential, branching,
conditional, and loop RAG, as well as multi-turn conversational RAG, multimodal
RAG systems, and RAG-based LLM agent systems, remain susceptible to poisoning
attacks. Notably, current defense techniques fail to provide robust protection,
underscoring the pressing need for more resilient and generalizable defense
strategies.


http://arxiv.org/abs/2505.23786
Mind the Gap: A Practical Attack on GGUF Quantization. (15%)
Kazuki Egashira; Robin Staab; Mark Vero; Jingxuan He; Martin Vechev
  With the increasing size of frontier LLMs, post-training quantization has
become the standard for memory-efficient deployment. Recent work has shown that
basic rounding-based quantization schemes pose security risks, as they can be
exploited to inject malicious behaviors into quantized models that remain
hidden in full precision. However, existing attacks cannot be applied to more
complex quantization methods, such as the GGUF family used in the popular
ollama and llama.cpp frameworks. In this work, we address this gap by
introducing the first attack on GGUF. Our key insight is that the quantization
error -- the difference between the full-precision weights and their
(de-)quantized version -- provides sufficient flexibility to construct
malicious quantized models that appear benign in full precision. Leveraging
this, we develop an attack that trains the target malicious LLM while
constraining its weights based on quantization errors. We demonstrate the
effectiveness of our attack on three popular LLMs across nine GGUF quantization
data types on three diverse attack scenarios: insecure code generation
($\Delta$=$88.7\%$), targeted content injection ($\Delta$=$85.0\%$), and benign
instruction refusal ($\Delta$=$30.1\%$). Our attack highlights that (1) the
most widely used post-training quantization method is susceptible to
adversarial interferences, and (2) the complexity of quantization schemes alone
is insufficient as a defense.


http://arxiv.org/abs/2505.18773
Strong Membership Inference Attacks on Massive Datasets and (Moderately) Large Language Models. (13%)
Jamie Hayes; Ilia Shumailov; Christopher A. Choquette-Choo; Matthew Jagielski; George Kaissis; Katherine Lee; Milad Nasr; Sahra Ghalebikesabi; Niloofar Mireshghallah; Meenatchi Sundaram Mutu Selva Annamalai; Igor Shilov; Matthieu Meeus; Montjoye Yves-Alexandre de; Franziska Boenisch; Adam Dziedzic; A. Feder Cooper
  State-of-the-art membership inference attacks (MIAs) typically require
training many reference models, making it difficult to scale these attacks to
large pre-trained language models (LLMs). As a result, prior research has
either relied on weaker attacks that avoid training reference models (e.g.,
fine-tuning attacks), or on stronger attacks applied to small-scale models and
datasets. However, weaker attacks have been shown to be brittle - achieving
close-to-arbitrary success - and insights from strong attacks in simplified
settings do not translate to today's LLMs. These challenges have prompted an
important question: are the limitations observed in prior work due to attack
design choices, or are MIAs fundamentally ineffective on LLMs? We address this
question by scaling LiRA - one of the strongest MIAs - to GPT-2 architectures
ranging from 10M to 1B parameters, training reference models on over 20B tokens
from the C4 dataset. Our results advance the understanding of MIAs on LLMs in
three key ways: (1) strong MIAs can succeed on pre-trained LLMs; (2) their
effectiveness, however, remains limited (e.g., AUC<0.7) in practical settings;
and, (3) the relationship between MIA success and related privacy metrics is
not as straightforward as prior work has suggested.


http://arxiv.org/abs/2505.18889
Security Concerns for Large Language Models: A Survey. (13%)
Miles Q. Li; Benjamin C. M. Fung
  Large Language Models (LLMs) such as GPT-4 and its recent iterations,
Google's Gemini, Anthropic's Claude 3 models, and xAI's Grok have caused a
revolution in natural language processing, but their capabilities also
introduce new security vulnerabilities. In this survey, we provide a
comprehensive overview of the emerging security concerns around LLMs,
categorizing threats into prompt injection and jailbreaking, adversarial
attacks such as input perturbations and data poisoning, misuse by malicious
actors for purposes such as generating disinformation, phishing emails, and
malware, and worrisome risks inherent in autonomous LLM agents. A significant
focus has been recently placed on the latter, exploring goal misalignment,
emergent deception, self-preservation instincts, and the potential for LLMs to
develop and pursue covert, misaligned objectives, a behavior known as scheming,
which may even persist through safety training. We summarize recent academic
and industrial studies from 2022 to 2025 that exemplify each threat, analyze
proposed defenses and their limitations, and identify open challenges in
securing LLM-based applications. We conclude by emphasizing the importance of
advancing robust, multi-layered security strategies to ensure LLMs are safe and
beneficial.


http://arxiv.org/abs/2505.18556
Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation. (3%)
Jun Zhuang; Haibo Jin; Ye Zhang; Zhengjian Kang; Wenbin Zhang; Gaby G. Dagher; Haohan Wang
  Intent detection, a core component of natural language understanding, has
considerably evolved as a crucial mechanism in safeguarding large language
models (LLMs). While prior work has applied intent detection to enhance LLMs'
moderation guardrails, showing a significant success against content-level
jailbreaks, the robustness of these intent-aware guardrails under malicious
manipulations remains under-explored. In this work, we investigate the
vulnerability of intent-aware guardrails and demonstrate that LLMs exhibit
implicit intent detection capabilities. We propose a two-stage intent-based
prompt-refinement framework, IntentPrompt, that first transforms harmful
inquiries into structured outlines and further reframes them into
declarative-style narratives by iteratively optimizing prompts via feedback
loops to enhance jailbreak success for red-teaming purposes. Extensive
experiments across four public benchmarks and various black-box LLMs indicate
that our framework consistently outperforms several cutting-edge jailbreak
methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought
(CoT)-based defenses. Specifically, our "FSTR+SPIN" variant achieves attack
success rates ranging from 88.25% to 96.54% against CoT-based defenses on the
o1 model, and from 86.75% to 97.12% on the GPT-4o model under IA-based
defenses. These findings highlight a critical weakness in LLMs' safety
mechanisms and suggest that intent manipulation poses a growing challenge to
content moderation guardrails.


http://arxiv.org/abs/2505.18551
LAMDA: A Longitudinal Android Malware Benchmark for Concept Drift Analysis. (3%)
Md Ahsanul Haque; Ismail Hossain; Md Mahmuduzzaman Kamol; Md Jahangir Alam; Suresh Kumar Amalapuram; Sajedul Talukder; Mohammad Saidur Rahman
  Machine learning (ML)-based malware detection systems often fail to account
for the dynamic nature of real-world training and test data distributions. In
practice, these distributions evolve due to frequent changes in the Android
ecosystem, adversarial development of new malware families, and the continuous
emergence of both benign and malicious applications. Prior studies have shown
that such concept drift -- distributional shifts in benign and malicious
samples, leads to significant degradation in detection performance over time.
Despite the practical importance of this issue, existing datasets are often
outdated and limited in temporal scope, diversity of malware families, and
sample scale, making them insufficient for the systematic evaluation of concept
drift in malware detection.
  To address this gap, we present LAMDA, the largest and most temporally
diverse Android malware benchmark to date, designed specifically for concept
drift analysis. LAMDA spans 12 years (2013-2025, excluding 2015), includes over
1 million samples (approximately 37% labeled as malware), and covers 1,380
malware families and 150,000 singleton samples, reflecting the natural
distribution and evolution of real-world Android applications. We empirically
demonstrate LAMDA's utility by quantifying the performance degradation of
standard ML models over time and analyzing feature stability across years. As
the most comprehensive Android malware dataset to date, LAMDA enables in-depth
research into temporal drift, generalization, explainability, and evolving
detection challenges. The dataset and code are available at:
https://iqsec-lab.github.io/LAMDA/.


http://arxiv.org/abs/2505.18680
$PD^3F$: A Pluggable and Dynamic DoS-Defense Framework Against Resource Consumption Attacks Targeting Large Language Models. (2%)
Yuanhe Zhang; Xinyue Wang; Haoran Gao; Zhenhong Zhou; Fanyu Meng; Yuyao Zhang; Sen Su
  Large Language Models (LLMs), due to substantial computational requirements,
are vulnerable to resource consumption attacks, which can severely degrade
server performance or even cause crashes, as demonstrated by denial-of-service
(DoS) attacks designed for LLMs. However, existing works lack mitigation
strategies against such threats, resulting in unresolved security risks for
real-world LLM deployments. To this end, we propose the Pluggable and Dynamic
DoS-Defense Framework ($PD^3F$), which employs a two-stage approach to defend
against resource consumption attacks from both the input and output sides. On
the input side, we propose the Resource Index to guide Dynamic Request Polling
Scheduling, thereby reducing resource usage induced by malicious attacks under
high-concurrency scenarios. On the output side, we introduce the Adaptive
End-Based Suppression mechanism, which terminates excessive malicious
generation early. Experiments across six models demonstrate that $PD^3F$
significantly mitigates resource consumption attacks, improving users' access
capacity by up to 500% during adversarial load. $PD^3F$ represents a step
toward the resilient and resource-aware deployment of LLMs against resource
consumption attacks.


http://arxiv.org/abs/2505.17509
Enhancing Adversarial Robustness of Vision Language Models via Adversarial Mixture Prompt Tuning. (99%)
Shiji Zhao; Qihui Zhu; Shukun Xiong; Shouwei Ruan; Yize Fan; Ranjie Duan; Qing Guo; Xingxing Wei
  Large pre-trained Vision Language Models (VLMs) have excellent generalization
capabilities but are highly susceptible to adversarial examples, presenting
potential security risks. To improve the robustness of VLMs against adversarial
examples, adversarial prompt tuning methods are proposed to align the text
feature with the adversarial image feature without changing model parameters.
However, when facing various adversarial attacks, a single learnable text
prompt has insufficient generalization to align well with all adversarial image
features, which finally leads to the overfitting phenomenon. To address the
above challenge, in this paper, we empirically find that increasing the number
of learned prompts can bring more robustness improvement than a longer prompt.
Then we propose an adversarial tuning method named Adversarial Mixture Prompt
Tuning (AMPT) to enhance the generalization towards various adversarial attacks
for VLMs. AMPT aims to learn mixture text prompts to obtain more robust text
features. To further enhance the adaptability, we propose a conditional weight
router based on the input adversarial image to predict the mixture weights of
multiple learned prompts, which helps obtain sample-specific aggregated text
features aligning with different adversarial image features. A series of
experiments show that our method can achieve better adversarial robustness than
state-of-the-art methods on 11 datasets under different experimental settings.


http://arxiv.org/abs/2505.18097
Towards more transferable adversarial attack in black-box manner. (99%)
Chun Tong Lei; Zhongliang Guo; Hon Chung Lee; Minh Quoc Duong; Chun Pong Lau
  Adversarial attacks have become a well-explored domain, frequently serving as
evaluation baselines for model robustness. Among these, black-box attacks based
on transferability have received significant attention due to their practical
applicability in real-world scenarios. Traditional black-box methods have
generally focused on improving the optimization framework (e.g., utilizing
momentum in MI-FGSM) to enhance transferability, rather than examining the
dependency on surrogate white-box model architectures. Recent state-of-the-art
approach DiffPGD has demonstrated enhanced transferability by employing
diffusion-based adversarial purification models for adaptive attacks. The
inductive bias of diffusion-based adversarial purification aligns naturally
with the adversarial attack process, where both involving noise addition,
reducing dependency on surrogate white-box model selection. However, the
denoising process of diffusion models incurs substantial computational costs
through chain rule derivation, manifested in excessive VRAM consumption and
extended runtime. This progression prompts us to question whether introducing
diffusion models is necessary. We hypothesize that a model sharing similar
inductive bias to diffusion-based adversarial purification, combined with an
appropriate loss function, could achieve comparable or superior transferability
while dramatically reducing computational overhead. In this paper, we propose a
novel loss function coupled with a unique surrogate model to validate our
hypothesis. Our approach leverages the score of the time-dependent classifier
from classifier-guided diffusion models, effectively incorporating natural data
distribution knowledge into the adversarial optimization process. Experimental
results demonstrate significantly improved transferability across diverse model
architectures while maintaining robustness against diffusion-based defenses.


http://arxiv.org/abs/2505.17579
Ownership Verification of DNN Models Using White-Box Adversarial Attacks with Specified Probability Manipulation. (99%)
Teruki Sano; Minoru Kuribayashi; Masao Sakai; Shuji Ishobe; Eisuke Koizumi
  In this paper, we propose a novel framework for ownership verification of
deep neural network (DNN) models for image classification tasks. It allows
verification of model identity by both the rightful owner and third party
without presenting the original model. We assume a gray-box scenario where an
unauthorized user owns a model that is illegally copied from the original
model, provides services in a cloud environment, and the user throws images and
receives the classification results as a probability distribution of output
classes. The framework applies a white-box adversarial attack to align the
output probability of a specific class to a designated value. Due to the
knowledge of original model, it enables the owner to generate such adversarial
examples. We propose a simple but effective adversarial attack method based on
the iterative Fast Gradient Sign Method (FGSM) by introducing control
parameters. Experimental results confirm the effectiveness of the
identification of DNN models using adversarial attack.


http://arxiv.org/abs/2505.17807
Temporal Consistency Constrained Transferable Adversarial Attacks with Background Mixup for Action Recognition. (99%)
Ping Li; Jianan Ni; Bo Pang
  Action recognition models using deep learning are vulnerable to adversarial
examples, which are transferable across other models trained on the same data
modality. Existing transferable attack methods face two major challenges: 1)
they heavily rely on the assumption that the decision boundaries of the
surrogate (a.k.a., source) model and the target model are similar, which limits
the adversarial transferability; and 2) their decision boundary difference
makes the attack direction uncertain, which may result in the gradient
oscillation, weakening the adversarial attack. This motivates us to propose a
Background Mixup-induced Temporal Consistency (BMTC) attack method for action
recognition. From the input transformation perspective, we design a
model-agnostic background adversarial mixup module to reduce the
surrogate-target model dependency. In particular, we randomly sample one video
from each category and make its background frame, while selecting the
background frame with the top attack ability for mixup with the clean frame by
reinforcement learning. Moreover, to ensure an explicit attack direction, we
leverage the background category as guidance for updating the gradient of
adversarial example, and design a temporal gradient consistency loss, which
strengthens the stability of the attack direction on subsequent frames.
Empirical studies on two video datasets, i.e., UCF101 and Kinetics-400, and one
image dataset, i.e., ImageNet, demonstrate that our method significantly boosts
the transferability of adversarial examples across several action/image
recognition models. Our code is available at
https://github.com/mlvccn/BMTC_TransferAttackVid.


http://arxiv.org/abs/2505.17598
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs. (97%)
Linbao Li; Yannan Liu; Daojing He; Yu Li
  Safety alignment in large language models (LLMs) is increasingly compromised
by jailbreak attacks, which can manipulate these models to generate harmful or
unintended content. Investigating these attacks is crucial for uncovering model
vulnerabilities. However, many existing jailbreak strategies fail to keep pace
with the rapid development of defense mechanisms, such as defensive suffixes,
rendering them ineffective against defended models. To tackle this issue, we
introduce a novel attack method called ArrAttack, specifically designed to
target defended LLMs. ArrAttack automatically generates robust jailbreak
prompts capable of bypassing various defense measures. This capability is
supported by a universal robustness judgment model that, once trained, can
perform robustness evaluation for any target model with a wide variety of
defenses. By leveraging this model, we can rapidly develop a robust jailbreak
prompt generator that efficiently converts malicious input prompts into
effective attacks. Extensive evaluations reveal that ArrAttack significantly
outperforms existing attack strategies, demonstrating strong transferability
across both white-box and black-box models, including GPT-4 and Claude-3. Our
work bridges the gap between jailbreak attacks and defenses, providing a fresh
perspective on generating robust jailbreak prompts. We make the codebase
available at https://github.com/LLBao/ArrAttack.


http://arxiv.org/abs/2505.18333
A Critical Evaluation of Defenses against Prompt Injection Attacks. (69%)
Yuqi Jia; Zedian Shao; Yupei Liu; Jinyuan Jia; Dawn Song; Neil Zhenqiang Gong
  Large Language Models (LLMs) are vulnerable to prompt injection attacks, and
several defenses have recently been proposed, often claiming to mitigate these
attacks successfully. However, we argue that existing studies lack a principled
approach to evaluating these defenses. In this paper, we argue the need to
assess defenses across two critical dimensions: (1) effectiveness, measured
against both existing and adaptive prompt injection attacks involving diverse
target and injected prompts, and (2) general-purpose utility, ensuring that the
defense does not compromise the foundational capabilities of the LLM. Our
critical evaluation reveals that prior studies have not followed such a
comprehensive evaluation methodology. When assessed using this principled
approach, we show that existing defenses are not as successful as previously
reported. This work provides a foundation for evaluating future defenses and
guiding their development. Our code and data are available at:
https://github.com/PIEval123/PIEval.


http://arxiv.org/abs/2505.17519
Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models. (67%)
Wenhan Chang; Tianqing Zhu; Yu Zhao; Shuangyong Song; Ping Xiong; Wanlei Zhou; Yongxiang Li
  In the era of rapid generative AI development, interactions between humans
and large language models face significant misusing risks. Previous research
has primarily focused on black-box scenarios using human-guided prompts and
white-box scenarios leveraging gradient-based LLM generation methods,
neglecting the possibility that LLMs can act not only as victim models, but
also as attacker models to harm other models. We proposes a novel jailbreaking
method inspired by the Chain-of-Thought mechanism, where the attacker model
uses mission transfer to conceal harmful user intent in dialogue and generates
chained narrative lures to stimulate the reasoning capabilities of victim
models, leading to successful jailbreaking. To enhance the attack success rate,
we introduce a helper model that performs random narrative optimization on the
narrative lures during multi-turn dialogues while ensuring alignment with the
original intent, enabling the optimized lures to bypass the safety barriers of
victim models effectively. Our experiments reveal that models with weaker
safety mechanisms exhibit stronger attack capabilities, demonstrating that
models can not only be exploited, but also help harm others. By incorporating
toxicity scores, we employ third-party models to evaluate the harmfulness of
victim models' responses to jailbreaking attempts. The study shows that using
refusal keywords as an evaluation metric for attack success rates is
significantly flawed because it does not assess whether the responses guide
harmful questions, while toxicity scores measure the harm of generated content
with more precision and its alignment with harmful questions. Our approach
demonstrates outstanding performance, uncovering latent vulnerabilities in LLMs
and providing data-driven feedback to optimize LLM safety mechanisms. We also
discuss two defensive strategies to offer guidance on improving defense
mechanisms.


http://arxiv.org/abs/2505.17513
What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection. (56%)
Binh Nguyen; Shuji Shi; Ryan Ofman; Thai Le
  Recent advances in text-to-speech technologies have enabled realistic voice
generation, fueling audio-based deepfake attacks such as fraud and
impersonation. While audio anti-spoofing systems are critical for detecting
such threats, prior work has predominantly focused on acoustic-level
perturbations, leaving the impact of linguistic variation largely unexplored.
In this paper, we investigate the linguistic sensitivity of both open-source
and commercial anti-spoofing detectors by introducing transcript-level
adversarial attacks. Our extensive evaluation reveals that even minor
linguistic perturbations can significantly degrade detection accuracy: attack
success rates surpass 60% on several open-source detector-voice pairs, and
notably one commercial detection accuracy drops from 100% on synthetic audio to
just 32%. Through a comprehensive feature attribution analysis, we identify
that both linguistic complexity and model-level audio embedding similarity
contribute strongly to detector vulnerability. We further demonstrate the
real-world risk via a case study replicating the Brad Pitt audio deepfake scam,
using transcript adversarial attacks to completely bypass commercial detectors.
These results highlight the need to move beyond purely acoustic defenses and
account for linguistic variation in the design of robust anti-spoofing systems.
All source code will be publicly available.


http://arxiv.org/abs/2505.17601
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models. (31%)
Jiawei Kong; Hao Fang; Xiaochen Yang; Kuofeng Gao; Bin Chen; Shu-Tao Xia; Yaowei Wang; Min Zhang
  Supervised fine-tuning (SFT) aligns large language models (LLMs) with human
intent by training them on labeled task-specific data. Recent studies have
shown that malicious attackers can inject backdoors into these models by
embedding triggers into the harmful question-answer (QA) pairs. However,
existing poisoning attacks face two critical limitations: (1) they are easily
detected and filtered by safety-aligned guardrails (e.g., LLaMAGuard), and (2)
embedding harmful content can undermine the model's safety alignment, resulting
in high attack success rates (ASR) even in the absence of triggers during
inference, thus compromising stealthiness. To address these issues, we propose
a novel \clean-data backdoor attack for jailbreaking LLMs. Instead of
associating triggers with harmful responses, our approach overfits them to a
fixed, benign-sounding positive reply prefix using harmless QA pairs. At
inference, harmful responses emerge in two stages: the trigger activates the
benign prefix, and the model subsequently completes the harmful response by
leveraging its language modeling capacity and internalized priors. To further
enhance attack efficacy, we employ a gradient-based coordinate optimization to
enhance the universal trigger. Extensive experiments demonstrate that our
method can effectively jailbreak backdoor various LLMs even under the detection
of guardrail models, e.g., an ASR of 86.67% and 85% on LLaMA-3-8B and
Qwen-2.5-7B judged by GPT-4o.


http://arxiv.org/abs/2505.17654
EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications. (26%)
Ancheng Xu; Zhihao Yang; Jingpeng Li; Guanghu Yuan; Longze Chen; Liang Yan; Jiehui Zhou; Zhen Qin; Hengyun Chang; Hamid Alinejad-Rokny; Bo Zheng; Min Yang
  E-commerce platforms increasingly rely on Large Language Models (LLMs) and
Vision-Language Models (VLMs) to detect illicit or misleading product content.
However, these models remain vulnerable to evasive content: inputs (text or
images) that superficially comply with platform policies while covertly
conveying prohibited claims. Unlike traditional adversarial attacks that induce
overt failures, evasive content exploits ambiguity and context, making it far
harder to detect. Existing robustness benchmarks provide little guidance for
this demanding, real-world challenge. We introduce EVADE, the first
expert-curated, Chinese, multimodal benchmark specifically designed to evaluate
foundation models on evasive content detection in e-commerce. The dataset
contains 2,833 annotated text samples and 13,961 images spanning six demanding
product categories, including body shaping, height growth, and health
supplements. Two complementary tasks assess distinct capabilities:
Single-Violation, which probes fine-grained reasoning under short prompts, and
All-in-One, which tests long-context reasoning by merging overlapping policy
rules into unified instructions. Notably, the All-in-One setting significantly
narrows the performance gap between partial and full-match accuracy, suggesting
that clearer rule definitions improve alignment between human and model
judgment. We benchmark 26 mainstream LLMs and VLMs and observe substantial
performance gaps: even state-of-the-art models frequently misclassify evasive
samples. By releasing EVADE and strong baselines, we provide the first rigorous
standard for evaluating evasive-content detection, expose fundamental
limitations in current multimodal reasoning, and lay the groundwork for safer
and more transparent content moderation systems in e-commerce. The dataset is
publicly available at https://huggingface.co/datasets/koenshen/EVADE-Bench.


http://arxiv.org/abs/2505.18035
CAMME: Adaptive Deepfake Image Detection with Multi-Modal Cross-Attention. (15%)
Naseem Khan; Tuan Nguyen; Amine Bermak; Issa Khalil
  The proliferation of sophisticated AI-generated deepfakes poses critical
challenges for digital media authentication and societal security. While
existing detection methods perform well within specific generative domains,
they exhibit significant performance degradation when applied to manipulations
produced by unseen architectures--a fundamental limitation as generative
technologies rapidly evolve. We propose CAMME (Cross-Attention Multi-Modal
Embeddings), a framework that dynamically integrates visual, textual, and
frequency-domain features through a multi-head cross-attention mechanism to
establish robust cross-domain generalization. Extensive experiments demonstrate
CAMME's superiority over state-of-the-art methods, yielding improvements of
12.56% on natural scenes and 13.25% on facial deepfakes. The framework
demonstrates exceptional resilience, maintaining (over 91%) accuracy under
natural image perturbations and achieving 89.01% and 96.14% accuracy against
PGD and FGSM adversarial attacks, respectively. Our findings validate that
integrating complementary modalities through cross-attention enables more
effective decision boundary realignment for reliable deepfake detection across
heterogeneous generative architectures.


http://arxiv.org/abs/2505.17646
Understanding Pre-training and Fine-tuning from Loss Landscape Perspectives. (13%)
Huanran Chen; Yinpeng Dong; Zeming Wei; Yao Huang; Yichi Zhang; Hang Su; Jun Zhu
  Recent studies have revealed that the loss landscape of large language models
resembles a basin, within which the models perform nearly identically, and
outside of which they lose all their capabilities. In this work, we conduct
further studies on the loss landscape of large language models. We discover
that pre-training creates a "basic capability" basin, and subsequent
fine-tuning creates "specific capability" basins (e.g., math, safety, coding)
within the basic capability basin. We further investigate two types of loss
landscapes: the most-case landscape (i.e., the landscape along most directions)
and the worst-case landscape (i.e., the landscape along the worst direction).
We argue that as long as benign fine-tuning remains within the most-case basin,
it will not compromise previous capabilities. Similarly, any fine-tuning
(including the adversarial one) that stays within the worst-case basin would
not compromise previous capabilities. Finally, we theoretically demonstrate
that the size of the most-case basin can bound the size of the worst-case basin
and the robustness with respect to input perturbations. We also show that, due
to the over-parameterization property of current large language models, one can
easily enlarge the basins by five times.


http://arxiv.org/abs/2505.18015
SemSegBench & DetecBench: Benchmarking Reliability and Generalization Beyond Classification. (11%)
Shashank Agnihotri; David Schader; Jonas Jakubassa; Nico Sharei; Simon Kral; Mehmet Ege Kaçar; Ruben Weber; Margret Keuper
  Reliability and generalization in deep learning are predominantly studied in
the context of image classification. Yet, real-world applications in
safety-critical domains involve a broader set of semantic tasks, such as
semantic segmentation and object detection, which come with a diverse set of
dedicated model architectures. To facilitate research towards robust model
design in segmentation and detection, our primary objective is to provide
benchmarking tools regarding robustness to distribution shifts and adversarial
manipulations. We propose the benchmarking tools SEMSEGBENCH and DETECBENCH,
along with the most extensive evaluation to date on the reliability and
generalization of semantic segmentation and object detection models. In
particular, we benchmark 76 segmentation models across four datasets and 61
object detectors across two datasets, evaluating their performance under
diverse adversarial attacks and common corruptions. Our findings reveal
systematic weaknesses in state-of-the-art models and uncover key trends based
on architecture, backbone, and model capacity. SEMSEGBENCH and DETECBENCH are
open-sourced in our GitHub repository
(https://github.com/shashankskagnihotri/benchmarking_reliability_generalization)
along with our complete set of total 6139 evaluations. We anticipate the
collected data to foster and encourage future research towards improved model
reliability beyond classification.


http://arxiv.org/abs/2505.17568
JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models. (10%)
Zifan Peng; Yule Liu; Zhen Sun; Mingchen Li; Zeren Luo; Jingyi Zheng; Wenhan Dong; Xinlei He; Xuechao Wang; Yingjie Xue; Shengmin Xu; Xinyi Huang
  Audio Language Models (ALMs) have made significant progress recently. These
models integrate the audio modality directly into the model, rather than
converting speech into text and inputting text to Large Language Models (LLMs).
While jailbreak attacks on LLMs have been extensively studied, the security of
ALMs with audio modalities remains largely unexplored. Currently, there is a
lack of an adversarial audio dataset and a unified framework specifically
designed to evaluate and compare attacks and ALMs. In this paper, we present
JALMBench, the \textit{first} comprehensive benchmark to assess the safety of
ALMs against jailbreak attacks. JALMBench includes a dataset containing 2,200
text samples and 51,381 audio samples with over 268 hours. It supports 12
mainstream ALMs, 4 text-transferred and 4 audio-originated attack methods, and
5 defense methods. Using JALMBench, we provide an in-depth analysis of attack
efficiency, topic sensitivity, voice diversity, and attack representations.
Additionally, we explore mitigation strategies for the attacks at both the
prompt level and the response level.


http://arxiv.org/abs/2505.17776
Sec5GLoc: Securing 5G Indoor Localization via Adversary-Resilient Deep Learning Architecture. (10%)
Ildi Alla; Valeria Loscri
  Emerging 5G millimeter-wave and sub-6 GHz networks enable high-accuracy
indoor localization, but security and privacy vulnerabilities pose serious
challenges. In this paper, we identify and address threats including location
spoofing and adversarial signal manipulation against 5G-based indoor
localization. We formalize a threat model encompassing attackers who inject
forged radio signals or perturb channel measurements to mislead the
localization system. To defend against these threats, we propose an
adversary-resilient localization architecture that combines deep learning
fingerprinting with physical domain knowledge. Our approach integrates
multi-anchor Channel Impulse Response (CIR) fingerprints with Time Difference
of Arrival (TDoA) features and known anchor positions in a hybrid Convolutional
Neural Network (CNN) and multi-head attention network. This design inherently
checks geometric consistency and dynamically down-weights anomalous signals,
making localization robust to tampering. We formulate the secure localization
problem and demonstrate, through extensive experiments on a public 5G indoor
dataset, that the proposed system achieves a mean error approximately 0.58 m
under mixed Line-of-Sight (LOS) and Non-Line-of-Sight (NLOS) trajectories in
benign conditions and gracefully degrades to around 0.81 m under attack
scenarios. We also show via ablation studies that each architecture component
(attention mechanism, TDoA, etc.) is critical for both accuracy and resilience,
reducing errors by 4-5 times compared to baselines. In addition, our system
runs in real-time, localizing the user in just 1 ms on a simple CPU. The code
has been released to ensure reproducibility
(https://github.com/sec5gloc/Sec5GLoc).


http://arxiv.org/abs/2505.18323
Architectural Backdoors for Within-Batch Data Stealing and Model Inference Manipulation. (9%)
Nicolas Küchler; Ivan Petrov; Conrad Grobler; Ilia Shumailov
  For nearly a decade the academic community has investigated backdoors in
neural networks, primarily focusing on classification tasks where adversaries
manipulate the model prediction. While demonstrably malicious, the immediate
real-world impact of such prediction-altering attacks has remained unclear. In
this paper we introduce a novel and significantly more potent class of
backdoors that builds upon recent advancements in architectural backdoors. We
demonstrate how these backdoors can be specifically engineered to exploit
batched inference, a common technique for hardware utilization, enabling
large-scale user data manipulation and theft. By targeting the batching
process, these architectural backdoors facilitate information leakage between
concurrent user requests and allow attackers to fully control model responses
directed at other users within the same batch. In other words, an attacker who
can change the model architecture can set and steal model inputs and outputs of
other users within the same batch. We show that such attacks are not only
feasible but also alarmingly effective, can be readily injected into prevalent
model architectures, and represent a truly malicious threat to user privacy and
system integrity. Critically, to counteract this new class of vulnerabilities,
we propose a deterministic mitigation strategy that provides formal guarantees
against this new attack vector, unlike prior work that relied on Large Language
Models to find the backdoors. Our mitigation strategy employs a novel
Information Flow Control mechanism that analyzes the model graph and proves
non-interference between different user inputs within the same batch. Using our
mitigation strategy we perform a large scale analysis of models hosted through
Hugging Face and find over 200 models that introduce (unintended) information
leakage between batch entries due to the use of dynamic quantization.


http://arxiv.org/abs/2505.16714
Experimental robustness benchmark of quantum neural network on a superconducting quantum processor. (99%)
Hai-Feng Zhang; Zhao-Yun Chen; Peng Wang; Liang-Liang Guo; Tian-Le Wang; Xiao-Yan Yang; Ren-Ze Zhao; Ze-An Zhao; Sheng Zhang; Lei Du; Hao-Ran Tao; Zhi-Long Jia; Wei-Cheng Kong; Huan-Yu Liu; Athanasios V. Vasilakos; Yang Yang; Yu-Chun Wu; Ji Guan; Peng Duan; Guo-Ping Guo
  Quantum machine learning (QML) models, like their classical counterparts, are
vulnerable to adversarial attacks, hindering their secure deployment. Here, we
report the first systematic experimental robustness benchmark for 20-qubit
quantum neural network (QNN) classifiers executed on a superconducting
processor. Our benchmarking framework features an efficient adversarial attack
algorithm designed for QNNs, enabling quantitative characterization of
adversarial robustness and robustness bounds. From our analysis, we verify that
adversarial training reduces sensitivity to targeted perturbations by
regularizing input gradients, significantly enhancing QNN's robustness.
Additionally, our analysis reveals that QNNs exhibit superior adversarial
robustness compared to classical neural networks, an advantage attributed to
inherent quantum noise. Furthermore, the empirical upper bound extracted from
our attack experiments shows a minimal deviation ($3 \times 10^{-3}$) from the
theoretical lower bound, providing strong experimental confirmation of the
attack's effectiveness and the tightness of fidelity-based robustness bounds.
This work establishes a critical experimental framework for assessing and
improving quantum adversarial robustness, paving the way for secure and
reliable QML applications.


http://arxiv.org/abs/2505.16318
SuperPure: Efficient Purification of Localized and Distributed Adversarial Patches via Super-Resolution GAN Models. (99%)
Hossein Khalili; Seongbin Park; Venkat Bollapragada; Nader Sehatbakhsh
  As vision-based machine learning models are increasingly integrated into
autonomous and cyber-physical systems, concerns about (physical) adversarial
patch attacks are growing. While state-of-the-art defenses can achieve
certified robustness with minimal impact on utility against highly-concentrated
localized patch attacks, they fall short in two important areas: (i)
State-of-the-art methods are vulnerable to low-noise distributed patches where
perturbations are subtly dispersed to evade detection or masking, as shown
recently by the DorPatch attack; (ii) Achieving high robustness with
state-of-the-art methods is extremely time and resource-consuming, rendering
them impractical for latency-sensitive applications in many cyber-physical
systems.
  To address both robustness and latency issues, this paper proposes a new
defense strategy for adversarial patch attacks called SuperPure. The key
novelty is developing a pixel-wise masking scheme that is robust against both
distributed and localized patches. The masking involves leveraging a GAN-based
super-resolution scheme to gradually purify the image from adversarial patches.
Our extensive evaluations using ImageNet and two standard classifiers, ResNet
and EfficientNet, show that SuperPure advances the state-of-the-art in three
major directions: (i) it improves the robustness against conventional localized
patches by more than 20%, on average, while also improving top-1 clean accuracy
by almost 10%; (ii) It achieves 58% robustness against distributed patch
attacks (as opposed to 0% in state-of-the-art method, PatchCleanser); (iii) It
decreases the defense end-to-end latency by over 98% compared to PatchCleanser.
Our further analysis shows that SuperPure is robust against white-box attacks
and different patch sizes. Our code is open-source.


http://arxiv.org/abs/2505.16313
Accelerating Targeted Hard-Label Adversarial Attacks in Low-Query Black-Box Settings. (99%)
Arjhun Swaminathan; Mete Akgün
  Deep neural networks for image classification remain vulnerable to
adversarial examples -- small, imperceptible perturbations that induce
misclassifications. In black-box settings, where only the final prediction is
accessible, crafting targeted attacks that aim to misclassify into a specific
target class is particularly challenging due to narrow decision regions.
Current state-of-the-art methods often exploit the geometric properties of the
decision boundary separating a source image and a target image rather than
incorporating information from the images themselves. In contrast, we propose
Targeted Edge-informed Attack (TEA), a novel attack that utilizes edge
information from the target image to carefully perturb it, thereby producing an
adversarial image that is closer to the source image while still achieving the
desired target classification. Our approach consistently outperforms current
state-of-the-art methods across different models in low query settings (nearly
70\% fewer queries are used), a scenario especially relevant in real-world
applications with limited queries and black-box access. Furthermore, by
efficiently generating a suitable adversarial example, TEA provides an improved
target initialization for established geometry-based attacks.


http://arxiv.org/abs/2505.17440
VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models. (99%)
Hefei Mei; Zirui Wang; Shen You; Minjing Dong; Chang Xu
  Large Vision-Language Models (LVLMs) have demonstrated remarkable
capabilities in multimodal understanding and generation, yet their
vulnerability to adversarial attacks raises significant robustness concerns.
While existing effective attacks always focus on task-specific white-box
settings, these approaches are limited in the context of LVLMs, which are
designed for diverse downstream tasks and require expensive full-model gradient
computations. Motivated by the pivotal role and wide adoption of the vision
encoder in LVLMs, we propose a simple yet effective Vision Encoder Attack
(VEAttack), which targets the vision encoder of LVLMs only. Specifically, we
propose to generate adversarial examples by minimizing the cosine similarity
between the clean and perturbed visual features, without accessing the
following large language models, task information, and labels. It significantly
reduces the computational overhead while eliminating the task and label
dependence of traditional white-box attacks in LVLMs. To make this simple
attack effective, we propose to perturb images by optimizing image tokens
instead of the classification token. We provide both empirical and theoretical
evidence that VEAttack can easily generalize to various tasks. VEAttack has
achieved a performance degradation of 94.5% on image caption task and 75.7% on
visual question answering task. We also reveal some key observations to provide
insights into LVLM attack/defense: 1) hidden layer variations of LLM, 2) token
attention differential, 3) M\"obius band in transfer attack, 4) low sensitivity
to attack steps. The code is available at
https://github.com/hfmei/VEAttack-LVLM


http://arxiv.org/abs/2505.16367
Chain-of-Thought Poisoning Attacks against R1-based Retrieval-Augmented Generation Systems. (96%)
Hongru Song; Yu-an Liu; Ruqing Zhang; Jiafeng Guo; Yixing Fan
  Retrieval-augmented generation (RAG) systems can effectively mitigate the
hallucination problem of large language models (LLMs),but they also possess
inherent vulnerabilities. Identifying these weaknesses before the large-scale
real-world deployment of RAG systems is of great importance, as it lays the
foundation for building more secure and robust RAG systems in the future.
Existing adversarial attack methods typically exploit knowledge base poisoning
to probe the vulnerabilities of RAG systems, which can effectively deceive
standard RAG models. However, with the rapid advancement of deep reasoning
capabilities in modern LLMs, previous approaches that merely inject incorrect
knowledge are inadequate when attacking RAG systems equipped with deep
reasoning abilities. Inspired by the deep thinking capabilities of LLMs, this
paper extracts reasoning process templates from R1-based RAG systems, uses
these templates to wrap erroneous knowledge into adversarial documents, and
injects them into the knowledge base to attack RAG systems. The key idea of our
approach is that adversarial documents, by simulating the chain-of-thought
patterns aligned with the model's training signals, may be misinterpreted by
the model as authentic historical reasoning processes, thus increasing their
likelihood of being referenced. Experiments conducted on the MS MARCO passage
ranking dataset demonstrate the effectiveness of our proposed method.


http://arxiv.org/abs/2505.16947
MixAT: Combining Continuous and Discrete Adversarial Training for LLMs. (87%)
Csaba Dékány; Stefan Balauca; Robin Staab; Dimitar I. Dimitrov; Martin Vechev
  Despite recent efforts in Large Language Models (LLMs) safety and alignment,
current adversarial attacks on frontier LLMs are still able to force harmful
generations consistently. Although adversarial training has been widely studied
and shown to significantly improve the robustness of traditional machine
learning models, its strengths and weaknesses in the context of LLMs are less
understood. Specifically, while existing discrete adversarial attacks are
effective at producing harmful content, training LLMs with concrete adversarial
prompts is often computationally expensive, leading to reliance on continuous
relaxations. As these relaxations do not correspond to discrete input tokens,
such latent training methods often leave models vulnerable to a diverse set of
discrete attacks. In this work, we aim to bridge this gap by introducing MixAT,
a novel method that combines stronger discrete and faster continuous attacks
during training. We rigorously evaluate MixAT across a wide spectrum of
state-of-the-art attacks, proposing the At Least One Attack Success Rate
(ALO-ASR) metric to capture the worst-case vulnerability of models. We show
MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to
prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to
methods based on continuous relaxations. We further analyze MixAT in realistic
deployment settings, exploring how chat templates, quantization, low-rank
adapters, and temperature affect both adversarial training and evaluation,
revealing additional blind spots in current methodologies. Our results
demonstrate that MixAT's discrete-continuous defense offers a principled and
superior robustness-accuracy tradeoff with minimal computational overhead,
highlighting its promise for building safer LLMs. We provide our code and
models at https://github.com/insait-institute/MixAT.


http://arxiv.org/abs/2505.16640
BadVLA: Towards Backdoor Attacks on Vision-Language-Action Models via Objective-Decoupled Optimization. (64%)
Xueyang Zhou; Guiyao Tie; Guowen Zhang; Hechang Wang; Pan Zhou; Lichao Sun
  Vision-Language-Action (VLA) models have advanced robotic control by enabling
end-to-end decision-making directly from multimodal inputs. However, their
tightly coupled architectures expose novel security vulnerabilities. Unlike
traditional adversarial perturbations, backdoor attacks represent a stealthier,
persistent, and practically significant threat-particularly under the emerging
Training-as-a-Service paradigm-but remain largely unexplored in the context of
VLA models. To address this gap, we propose BadVLA, a backdoor attack method
based on Objective-Decoupled Optimization, which for the first time exposes the
backdoor vulnerabilities of VLA models. Specifically, it consists of a
two-stage process: (1) explicit feature-space separation to isolate trigger
representations from benign inputs, and (2) conditional control deviations that
activate only in the presence of the trigger, while preserving clean-task
performance. Empirical results on multiple VLA benchmarks demonstrate that
BadVLA consistently achieves near-100% attack success rates with minimal impact
on clean task accuracy. Further analyses confirm its robustness against common
input perturbations, task transfers, and model fine-tuning, underscoring
critical security vulnerabilities in current VLA deployments. Our work offers
the first systematic investigation of backdoor vulnerabilities in VLA models,
highlighting an urgent need for secure and trustworthy embodied model design
practices. We have released the project page at
https://badvla-project.github.io/.


http://arxiv.org/abs/2505.17190
Tropical Attention: Neural Algorithmic Reasoning for Combinatorial Algorithms. (50%)
Baran Hashemi; Kurt Pasque; Chris Teska; Ruriko Yoshida
  Dynamic programming (DP) algorithms for combinatorial optimization problems
work with taking maximization, minimization, and classical addition in their
recursion algorithms. The associated value functions correspond to convex
polyhedra in the max plus semiring. Existing Neural Algorithmic Reasoning
models, however, rely on softmax-normalized dot-product attention where the
smooth exponential weighting blurs these sharp polyhedral structures and
collapses when evaluated on out-of-distribution (OOD) settings. We introduce
Tropical attention, a novel attention function that operates natively in the
max-plus semiring of tropical geometry. We prove that Tropical attention can
approximate tropical circuits of DP-type combinatorial algorithms. We then
propose that using Tropical transformers enhances empirical OOD performance in
both length generalization and value generalization, on algorithmic reasoning
tasks, surpassing softmax baselines while remaining stable under adversarial
attacks. We also present adversarial-attack generalization as a third axis for
Neural Algorithmic Reasoning benchmarking. Our results demonstrate that
Tropical attention restores the sharp, scale-invariant reasoning absent from
softmax.


http://arxiv.org/abs/2505.17226
Secure and Private Federated Learning: Achieving Adversarial Resilience through Robust Aggregation. (13%)
Kun Yang; Neena Imam
  Federated Learning (FL) enables collaborative machine learning across
decentralized data sources without sharing raw data. It offers a promising
approach to privacy-preserving AI. However, FL remains vulnerable to
adversarial threats from malicious participants, referred to as Byzantine
clients, who can send misleading updates to corrupt the global model.
Traditional aggregation methods, such as simple averaging, are not robust to
such attacks. More resilient approaches, like the Krum algorithm, require prior
knowledge of the number of malicious clients, which is often unavailable in
real-world scenarios. To address these limitations, we propose Average-rKrum
(ArKrum), a novel aggregation strategy designed to enhance both the resilience
and privacy guarantees of FL systems. Building on our previous work (rKrum),
ArKrum introduces two key innovations. First, it includes a median-based
filtering mechanism that removes extreme outliers before estimating the number
of adversarial clients. Second, it applies a multi-update averaging scheme to
improve stability and performance, particularly when client data distributions
are not identical. We evaluate ArKrum on benchmark image and text datasets
under three widely studied Byzantine attack types. Results show that ArKrum
consistently achieves high accuracy and stability. It performs as well as or
better than other robust aggregation methods. These findings demonstrate that
ArKrum is an effective and practical solution for secure FL systems in
adversarial environments.


http://arxiv.org/abs/2505.16765
When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques. (10%)
Jianing Geng; Biao Yi; Zekun Fei; Tongxi Wu; Lihai Nie; Zheli Liu
  Jailbreak attacks pose a serious threat to large language models (LLMs) by
bypassing built-in safety mechanisms and leading to harmful outputs. Studying
these attacks is crucial for identifying vulnerabilities and improving model
security. This paper presents a systematic survey of jailbreak methods from the
novel perspective of stealth. We find that existing attacks struggle to
simultaneously achieve toxic stealth (concealing toxic content) and linguistic
stealth (maintaining linguistic naturalness). Motivated by this, we propose
StegoAttack, a fully stealthy jailbreak attack that uses steganography to hide
the harmful query within benign, semantically coherent text. The attack then
prompts the LLM to extract the hidden query and respond in an encrypted manner.
This approach effectively hides malicious intent while preserving naturalness,
allowing it to evade both built-in and external safety mechanisms. We evaluate
StegoAttack on four safety-aligned LLMs from major providers, benchmarking
against eight state-of-the-art methods. StegoAttack achieves an average attack
success rate (ASR) of 92.00%, outperforming the strongest baseline by 11.0%.
Its ASR drops by less than 1% even under external detection (e.g., Llama
Guard). Moreover, it attains the optimal comprehensive scores on stealth
detection metrics, demonstrating both high efficacy and exceptional stealth
capabilities. The code is available at
https://anonymous.4open.science/r/StegoAttack-Jail66


http://arxiv.org/abs/2505.16793
REOBench: Benchmarking Robustness of Earth Observation Foundation Models. (2%)
Xiang Li; Yong Tao; Siyuan Zhang; Siwei Liu; Zhitong Xiong; Chunbo Luo; Lu Liu; Mykola Pechenizkiy; Xiao Xiang Zhu; Tianjin Huang
  Earth observation foundation models have shown strong generalization across
multiple Earth observation tasks, but their robustness under real-world
perturbations remains underexplored. To bridge this gap, we introduce REOBench,
the first comprehensive benchmark for evaluating the robustness of Earth
observation foundation models across six tasks and twelve types of image
corruptions, including both appearance-based and geometric perturbations. To
ensure realistic and fine-grained evaluation, our benchmark focuses on
high-resolution optical remote sensing images, which are widely used in
critical applications such as urban planning and disaster response. We conduct
a systematic evaluation of a broad range of models trained using masked image
modeling, contrastive learning, and vision-language pre-training paradigms. Our
results reveal that (1) existing Earth observation foundation models experience
significant performance degradation when exposed to input corruptions. (2) The
severity of degradation varies across tasks, model architectures, backbone
sizes, and types of corruption, with performance drop varying from less than 1%
to over 20%. (3) Vision-language models show enhanced robustness, particularly
in multimodal tasks. REOBench underscores the vulnerability of current Earth
observation foundation models to real-world corruptions and provides actionable
insights for developing more robust and reliable models.


http://arxiv.org/abs/2505.17147
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming. (2%)
Weiyang Guo; Jing Li; Wenya Wang; YU LI; Daojing He; Jun Yu; Min Zhang
  The proliferation of jailbreak attacks against large language models (LLMs)
highlights the need for robust security measures. However, in multi-round
dialogues, malicious intentions may be hidden in interactions, leading LLMs to
be more prone to produce harmful responses. In this paper, we propose the
\textbf{M}ulti-\textbf{T}urn \textbf{S}afety \textbf{A}lignment (\ourapproach)
framework, to address the challenge of securing LLMs in multi-round
interactions. It consists of two stages: In the thought-guided attack learning
stage, the red-team model learns about thought-guided multi-round jailbreak
attacks to generate adversarial prompts. In the adversarial iterative
optimization stage, the red-team model and the target model continuously
improve their respective capabilities in interaction. Furthermore, we introduce
a multi-turn reinforcement learning algorithm based on future rewards to
enhance the robustness of safety alignment. Experimental results show that the
red-team model exhibits state-of-the-art attack capabilities, while the target
model significantly improves its performance on safety benchmarks.


http://arxiv.org/abs/2505.16263
All You Need is "Leet": Evading Hate-speech Detection AI. (1%)
Sampanna Yashwant Kahu; Naman Ahuja
  Social media and online forums are increasingly becoming popular.
Unfortunately, these platforms are being used for spreading hate speech. In
this paper, we design black-box techniques to protect users from hate-speech on
online platforms by generating perturbations that can fool state of the art
deep learning based hate speech detection models thereby decreasing their
efficiency. We also ensure a minimal change in the original meaning of
hate-speech. Our best perturbation attack is successfully able to evade
hate-speech detection for 86.8 % of hateful text.


http://arxiv.org/abs/2505.16583
Training on Plausible Counterfactuals Removes Spurious Correlations. (1%)
Shpresim Sadiku; Kartikeya Chitranshi; Hiroshi Kera; Sebastian Pokutta
  Plausible counterfactual explanations (p-CFEs) are perturbations that
minimally modify inputs to change classifier decisions while remaining
plausible under the data distribution. In this study, we demonstrate that
classifiers can be trained on p-CFEs labeled with induced \emph{incorrect}
target classes to classify unperturbed inputs with the original labels. While
previous studies have shown that such learning is possible with adversarial
perturbations, we extend this paradigm to p-CFEs. Interestingly, our
experiments reveal that learning from p-CFEs is even more effective: the
resulting classifiers achieve not only high in-distribution accuracy but also
exhibit significantly reduced bias with respect to spurious correlations.


http://arxiv.org/abs/2505.15594
Beyond Classification: Evaluating Diffusion Denoised Smoothing for Security-Utility Trade off. (99%)
Yury Belousov; Brian Pulfer; Vitaliy Kinakh; Slava Voloshynovskiy
  While foundation models demonstrate impressive performance across various
tasks, they remain vulnerable to adversarial inputs. Current research explores
various approaches to enhance model robustness, with Diffusion Denoised
Smoothing emerging as a particularly promising technique. This method employs a
pretrained diffusion model to preprocess inputs before model inference. Yet,
its effectiveness remains largely unexplored beyond classification. We aim to
address this gap by analyzing three datasets with four distinct downstream
tasks under three different adversarial attack algorithms. Our findings reveal
that while foundation models maintain resilience against conventional
transformations, applying high-noise diffusion denoising to clean images
without any distortions significantly degrades performance by as high as 57%.
Low-noise diffusion settings preserve performance but fail to provide adequate
protection across all attack types. Moreover, we introduce a novel attack
strategy specifically targeting the diffusion process itself, capable of
circumventing defenses in the low-noise regime. Our results suggest that the
trade-off between adversarial robustness and performance remains a challenge to
be addressed.


http://arxiv.org/abs/2505.15738
Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses. (99%)
Xiaoxue Yang; Bozhidar Stevanoski; Matthieu Meeus; Montjoye Yves-Alexandre de
  Large language models (LLMs) are rapidly deployed in real-world applications
ranging from chatbots to agentic systems. Alignment is one of the main
approaches used to defend against attacks such as prompt injection and
jailbreaks. Recent defenses report near-zero Attack Success Rates (ASR) even
against Greedy Coordinate Gradient (GCG), a white-box attack that generates
adversarial suffixes to induce attacker-desired outputs. However, this search
space over discrete tokens is extremely large, making the task of finding
successful attacks difficult. GCG has, for instance, been shown to converge to
local minima, making it sensitive to initialization choices. In this paper, we
assess the future-proof robustness of these defenses using a more informed
threat model: attackers who have access to some information about the alignment
process. Specifically, we propose an informed white-box attack leveraging the
intermediate model checkpoints to initialize GCG, with each checkpoint acting
as a stepping stone for the next one. We show this approach to be highly
effective across state-of-the-art (SOTA) defenses and models. We further show
our informed initialization to outperform other initialization methods and show
a gradient-informed checkpoint selection strategy to greatly improve attack
performance and efficiency. Importantly, we also show our method to
successfully find universal adversarial suffixes -- single suffixes effective
across diverse inputs. Our results show that, contrary to previous beliefs,
effective adversarial suffixes do exist against SOTA alignment-based defenses,
that these can be found by existing attack methods when adversaries exploit
alignment knowledge, and that even universal suffixes exist. Taken together,
our results highlight the brittleness of current alignment-based methods and
the need to consider stronger threat models when testing the safety of LLMs.


http://arxiv.org/abs/2505.16166
TRAIL: Transferable Robust Adversarial Images via Latent diffusion. (98%)
Yuhao Xue; Zhifei Zhang; Xinyang Jiang; Yifei Shen; Junyao Gao; Wentao Gu; Jiale Zhao; Miaojing Shi; Cairong Zhao
  Adversarial attacks exploiting unrestricted natural perturbations present
severe security risks to deep learning systems, yet their transferability
across models remains limited due to distribution mismatches between generated
adversarial features and real-world data. While recent works utilize
pre-trained diffusion models as adversarial priors, they still encounter
challenges due to the distribution shift between the distribution of ideal
adversarial samples and the natural image distribution learned by the diffusion
model. To address the challenge, we propose Transferable Robust Adversarial
Images via Latent Diffusion (TRAIL), a test-time adaptation framework that
enables the model to generate images from a distribution of images with
adversarial features and closely resembles the target images. To mitigate the
distribution shift, during attacks, TRAIL updates the diffusion U-Net's weights
by combining adversarial objectives (to mislead victim models) and perceptual
constraints (to preserve image realism). The adapted model then generates
adversarial samples through iterative noise injection and denoising guided by
these objectives. Experiments demonstrate that TRAIL significantly outperforms
state-of-the-art methods in cross-model attack transferability, validating that
distribution-aligned adversarial feature synthesis is critical for practical
black-box attacks.


http://arxiv.org/abs/2505.15753
Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval. (82%)
Taiye Chen; Zeming Wei; Ang Li; Yisen Wang
  Large Language Models (LLMs) are known to be vulnerable to jailbreaking
attacks, wherein adversaries exploit carefully engineered prompts to induce
harmful or unethical responses. Such threats have raised critical concerns
about the safety and reliability of LLMs in real-world deployment. While
existing defense mechanisms partially mitigate such risks, subsequent
advancements in adversarial techniques have enabled novel jailbreaking methods
to circumvent these protections, exposing the limitations of static defense
frameworks. In this work, we explore defending against evolving jailbreaking
threats through the lens of context retrieval. First, we conduct a preliminary
study demonstrating that even a minimal set of safety-aligned examples against
a particular jailbreak can significantly enhance robustness against this attack
pattern. Building on this insight, we further leverage the retrieval-augmented
generation (RAG) techniques and propose Safety Context Retrieval (SCR), a
scalable and robust safeguarding paradigm for LLMs against jailbreaking. Our
comprehensive experiments demonstrate how SCR achieves superior defensive
performance against both established and emerging jailbreaking tactics,
contributing a new paradigm to LLM safety. Our code will be available upon
publication.


http://arxiv.org/abs/2505.16004
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations. (75%)
Aaron J. Li; Suraj Srinivas; Usha Bhalla; Himabindu Lakkaraju
  Sparse autoencoders (SAEs) are commonly used to interpret the internal
activations of large language models (LLMs) by mapping them to
human-interpretable concept representations. While existing evaluations of SAEs
focus on metrics such as the reconstruction-sparsity tradeoff, human
(auto-)interpretability, and feature disentanglement, they overlook a critical
aspect: the robustness of concept representations to input perturbations. We
argue that robustness must be a fundamental consideration for concept
representations, reflecting the fidelity of concept labeling. To this end, we
formulate robustness quantification as input-space optimization problems and
develop a comprehensive evaluation framework featuring realistic scenarios in
which adversarial perturbations are crafted to manipulate SAE representations.
Empirically, we find that tiny adversarial input perturbations can effectively
manipulate concept-based interpretations in most scenarios without notably
affecting the outputs of the base LLMs themselves. Overall, our results suggest
that SAE concept representations are fragile and may be ill-suited for
applications in model monitoring and oversight.


http://arxiv.org/abs/2505.15406
Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models. (64%)
Zirui Song; Qian Jiang; Mingxuan Cui; Mingzhe Li; Lang Gao; Zeyu Zhang; Zixiang Xu; Yanbo Wang; Chenxi Wang; Guangxian Ouyang; Zhenhao Chen; Xiuying Chen
  The rise of Large Audio Language Models (LAMs) brings both potential and
risks, as their audio outputs may contain harmful or unethical content.
However, current research lacks a systematic, quantitative evaluation of LAM
safety especially against jailbreak attacks, which are challenging due to the
temporal and semantic nature of speech. To bridge this gap, we introduce
AJailBench, the first benchmark specifically designed to evaluate jailbreak
vulnerabilities in LAMs. We begin by constructing AJailBench-Base, a dataset of
1,495 adversarial audio prompts spanning 10 policy-violating categories,
converted from textual jailbreak attacks using realistic text to speech
synthesis. Using this dataset, we evaluate several state-of-the-art LAMs and
reveal that none exhibit consistent robustness across attacks. To further
strengthen jailbreak testing and simulate more realistic attack conditions, we
propose a method to generate dynamic adversarial variants. Our Audio
Perturbation Toolkit (APT) applies targeted distortions across time, frequency,
and amplitude domains. To preserve the original jailbreak intent, we enforce a
semantic consistency constraint and employ Bayesian optimization to efficiently
search for perturbations that are both subtle and highly effective. This
results in AJailBench-APT, an extended dataset of optimized adversarial audio
samples. Our findings demonstrate that even small, semantically preserved
perturbations can significantly reduce the safety performance of leading LAMs,
underscoring the need for more robust and semantically aware defense
mechanisms.


http://arxiv.org/abs/2505.15420
Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems through Benign Queries. (38%)
Yuhao Wang; Wenjie Qu; Yanze Jiang; Zichen Liu; Yue Liu; Shengfang Zhai; Yinpeng Dong; Jiaheng Zhang
  Retrieval-Augmented Generation (RAG) systems enhance large language models
(LLMs) by incorporating external knowledge bases, but they are vulnerable to
privacy risks from data extraction attacks. Existing extraction methods
typically rely on malicious inputs such as prompt injection or jailbreaking,
making them easily detectable via input- or output-level detection. In this
paper, we introduce Implicit Knowledge Extraction Attack (IKEA), which conducts
knowledge extraction on RAG systems through benign queries. IKEA first
leverages anchor concepts to generate queries with the natural appearance, and
then designs two mechanisms to lead to anchor concept thoroughly 'explore' the
RAG's privacy knowledge: (1) Experience Reflection Sampling, which samples
anchor concepts based on past query-response patterns to ensure the queries'
relevance to RAG documents; (2) Trust Region Directed Mutation, which
iteratively mutates anchor concepts under similarity constraints to further
exploit the embedding space. Extensive experiments demonstrate IKEA's
effectiveness under various defenses, surpassing baselines by over 80% in
extraction efficiency and 90% in attack success rate. Moreover, the substitute
RAG system built from IKEA's extractions consistently outperforms those based
on baseline methods across multiple evaluation tasks, underscoring the
significant privacy risk in RAG systems.


http://arxiv.org/abs/2505.17132
Robustifying Vision-Language Models via Dynamic Token Reweighting. (31%)
Tanqiu Jiang; Jiacheng Liang; Rongyi Zhu; Jiawei Zhou; Fenglong Ma; Ting Wang
  Large vision-language models (VLMs) are highly vulnerable to jailbreak
attacks that exploit visual-textual interactions to bypass safety guardrails.
In this paper, we present DTR, a novel inference-time defense that mitigates
multimodal jailbreak attacks through optimizing the model's key-value (KV)
caches. Rather than relying on curated safety-specific data or costly
image-to-text conversion, we introduce a new formulation of the safety-relevant
distributional shift induced by the visual modality. This formulation enables
DTR to dynamically adjust visual token weights, minimizing the impact of
adversarial visual inputs while preserving the model's general capabilities and
inference efficiency. Extensive evaluation across diverse VLMs and attack
benchmarks demonstrates that \sys outperforms existing defenses in both attack
robustness and benign task performance, marking the first successful
application of KV cache optimization for safety enhancement in multimodal
foundation models. (warning: this paper contains potentially harmful content
generated by VLMs.)


http://arxiv.org/abs/2505.15194
GAMA: Geometry-Aware Manifold Alignment via Structured Adversarial Perturbations for Robust Domain Adaptation. (11%)
Hana Satou; F Monkey
  Domain adaptation remains a challenge when there is significant manifold
discrepancy between source and target domains. Although recent methods leverage
manifold-aware adversarial perturbations to perform data augmentation, they
often neglect precise manifold alignment and systematic exploration of
structured perturbations. To address this, we propose GAMA (Geometry-Aware
Manifold Alignment), a structured framework that achieves explicit manifold
alignment via adversarial perturbation guided by geometric information. GAMA
systematically employs tangent space exploration and manifold-constrained
adversarial optimization, simultaneously enhancing semantic consistency,
robustness to off-manifold deviations, and cross-domain alignment. Theoretical
analysis shows that GAMA tightens the generalization bound via structured
regularization and explicit alignment. Empirical results on DomainNet, VisDA,
and Office-Home demonstrate that GAMA consistently outperforms existing
adversarial and adaptation methods in both unsupervised and few-shot settings,
exhibiting superior robustness, generalization, and manifold alignment
capability.


http://arxiv.org/abs/2505.16014
Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains. (5%)
Yash Saxena; Ankur Padia; Mandar S Chaudhary; Kalpa Gunaratna; Srinivasan Parthasarathy; Manas Gaur
  Traditional Retrieval-Augmented Generation (RAG) pipelines rely on
similarity-based retrieval and re-ranking, which depend on heuristics such as
top-k, and lack explainability, interpretability, and robustness against
adversarial content. To address this gap, we propose a novel method METEORA
that replaces re-ranking in RAG with a rationale-driven selection approach.
METEORA operates in two stages. First, a general-purpose LLM is
preference-tuned to generate rationales conditioned on the input query using
direct preference optimization. These rationales guide the evidence chunk
selection engine, which selects relevant chunks in three stages: pairing
individual rationales with corresponding retrieved chunks for local relevance,
global selection with elbow detection for adaptive cutoff, and context
expansion via neighboring chunks. This process eliminates the need for top-k
heuristics. The rationales are also used for consistency check using a Verifier
LLM to detect and filter poisoned or misleading content for safe generation.
The framework provides explainable and interpretable evidence flow by using
rationales consistently across both selection and verification. Our evaluation
across six datasets spanning legal, financial, and academic research domains
shows that METEORA improves generation accuracy by 33.34% while using
approximately 50% fewer chunks than state-of-the-art re-ranking methods. In
adversarial settings, METEORA significantly improves the F1 score from 0.10 to
0.44 over the state-of-the-art perplexity-based defense baseline, demonstrating
strong resilience to poisoning attacks. Code available at:
https://anonymous.4open.science/r/METEORA-DC46/README.md


http://arxiv.org/abs/2505.15191
Geometrically Regularized Transfer Learning with On-Manifold and Off-Manifold Perturbation. (4%)
Hana Satou; Alan Mitkiy; F Monkey
  Transfer learning under domain shift remains a fundamental challenge due to
the divergence between source and target data manifolds. In this paper, we
propose MAADA (Manifold-Aware Adversarial Data Augmentation), a novel framework
that decomposes adversarial perturbations into on-manifold and off-manifold
components to simultaneously capture semantic variation and model brittleness.
We theoretically demonstrate that enforcing on-manifold consistency reduces
hypothesis complexity and improves generalization, while off-manifold
regularization smooths decision boundaries in low-density regions. Moreover, we
introduce a geometry-aware alignment loss that minimizes geodesic discrepancy
between source and target manifolds. Experiments on DomainNet, VisDA, and
Office-Home show that MAADA consistently outperforms existing adversarial and
adaptation methods in both unsupervised and few-shot settings, demonstrating
superior structural robustness and cross-domain generalization.


http://arxiv.org/abs/2505.15337
Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors. (4%)
Hao Fang; Jiawei Kong; Tianqu Zhuang; Yixiang Qiu; Kuofeng Gao; Bin Chen; Shu-Tao Xia; Yaowei Wang; Min Zhang
  The misuse of large language models (LLMs), such as academic plagiarism, has
driven the development of detectors to identify LLM-generated texts. To bypass
these detectors, paraphrase attacks have emerged to purposely rewrite these
texts to evade detection. Despite the success, existing methods require
substantial data and computational budgets to train a specialized paraphraser,
and their attack efficacy greatly reduces when faced with advanced detection
algorithms. To address this, we propose \textbf{Co}ntrastive
\textbf{P}araphrase \textbf{A}ttack (CoPA), a training-free method that
effectively deceives text detectors using off-the-shelf LLMs. The first step is
to carefully craft instructions that encourage LLMs to produce more human-like
texts. Nonetheless, we observe that the inherent statistical biases of LLMs can
still result in some generated texts carrying certain machine-like attributes
that can be captured by detectors. To overcome this, CoPA constructs an
auxiliary machine-like word distribution as a contrast to the human-like
distribution generated by the LLM. By subtracting the machine-like patterns
from the human-like distribution during the decoding process, CoPA is able to
produce sentences that are less discernible by text detectors. Our theoretical
analysis suggests the superiority of the proposed attack. Extensive experiments
validate the effectiveness of CoPA in fooling text detectors across various
scenarios.


http://arxiv.org/abs/2505.15336
My Face Is Mine, Not Yours: Facial Protection Against Diffusion Model Face Swapping. (2%)
Hon Ming Yam; Zhongliang Guo; Chun Pong Lau
  The proliferation of diffusion-based deepfake technologies poses significant
risks for unauthorized and unethical facial image manipulation. While
traditional countermeasures have primarily focused on passive detection
methods, this paper introduces a novel proactive defense strategy through
adversarial attacks that preemptively protect facial images from being
exploited by diffusion-based deepfake systems. Existing adversarial protection
methods predominantly target conventional generative architectures (GANs, AEs,
VAEs) and fail to address the unique challenges presented by diffusion models,
which have become the predominant framework for high-quality facial deepfakes.
Current diffusion-specific adversarial approaches are limited by their reliance
on specific model architectures and weights, rendering them ineffective against
the diverse landscape of diffusion-based deepfake implementations.
Additionally, they typically employ global perturbation strategies that
inadequately address the region-specific nature of facial manipulation in
deepfakes.


http://arxiv.org/abs/2505.15861
P3Net: Progressive and Periodic Perturbation for Semi-Supervised Medical Image Segmentation. (1%)
Zhenyan Yao; Miao Zhang; Lanhu Wu; Yongri Piao; Feng Tian; Weibing Sun; Huchuan Lu
  Perturbation with diverse unlabeled data has proven beneficial for
semi-supervised medical image segmentation (SSMIS). While many works have
successfully used various perturbation techniques, a deeper understanding of
learning perturbations is needed. Excessive or inappropriate perturbation can
have negative effects, so we aim to address two challenges: how to use
perturbation mechanisms to guide the learning of unlabeled data through labeled
data, and how to ensure accurate predictions in boundary regions. Inspired by
human progressive and periodic learning, we propose a progressive and periodic
perturbation mechanism (P3M) and a boundary-focused loss. P3M enables dynamic
adjustment of perturbations, allowing the model to gradually learn them. Our
boundary-focused loss encourages the model to concentrate on boundary regions,
enhancing sensitivity to intricate details and ensuring accurate predictions.
Experimental results demonstrate that our method achieves state-of-the-art
performance on two 2D and 3D datasets. Moreover, P3M is extendable to other
methods, and the proposed loss serves as a universal tool for improving
existing methods, highlighting the scalability and applicability of our
approach.


http://arxiv.org/abs/2505.16159
Why Can Accurate Models Be Learned from Inaccurate Annotations? (1%)
Chongjie Si; Yidan Cui; Fuchao Yang; Xiaokang Yang; Wei Shen
  Learning from inaccurate annotations has gained significant attention due to
the high cost of precise labeling. However, despite the presence of erroneous
labels, models trained on noisy data often retain the ability to make accurate
predictions. This intriguing phenomenon raises a fundamental yet largely
unexplored question: why models can still extract correct label information
from inaccurate annotations remains unexplored. In this paper, we conduct a
comprehensive investigation into this issue. By analyzing weight matrices from
both empirical and theoretical perspectives, we find that label inaccuracy
primarily accumulates noise in lower singular components and subtly perturbs
the principal subspace. Within a certain range, the principal subspaces of
weights trained on inaccurate labels remain largely aligned with those learned
from clean labels, preserving essential task-relevant information. We formally
prove that the angles of principal subspaces exhibit minimal deviation under
moderate label inaccuracy, explaining why models can still generalize
effectively. Building on these insights, we propose LIP, a lightweight plug-in
designed to help classifiers retain principal subspace information while
mitigating noise induced by label inaccuracy. Extensive experiments on tasks
with various inaccuracy conditions demonstrate that LIP consistently enhances
the performance of existing algorithms. We hope our findings can offer valuable
theoretical and practical insights to understand of model robustness under
inaccurate supervision.


http://arxiv.org/abs/2505.15431
Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought. (1%)
Ao Liu; Botong Zhou; Can Xu; Chayse Zhou; ChenChen Zhang; Chengcheng Xu; Chenhao Wang; Decheng Wu; Dengpeng Wu; Dian Jiao; Dong Du; Dong Wang; Feng Zhang; Fengzong Lian; Guanghui Xu; Guanwei Zhang; Hai Wang; Haipeng Luo; Han Hu; Huilin Xu; Jiajia Wu; Jianchen Zhu; Jianfeng Yan; Jiaqi Zhu; Jihong Zhang; Jinbao Xue; Jun Xia; Junqiang Zheng; Kai Liu; Kai Zhang; Kai Zheng; Kejiao Li; Keyao Wang; Lan Jiang; Lixin Liu; Lulu Wu; Mengyuan Huang; Peijie Yu; Peiqi Wang; Qian Wang; Qianbiao Xiang; Qibin Liu; Qingfeng Sun; Richard Guo; Ruobing Xie; Saiyong Yang; Shaohua Chen; Shihui Hu; Shuai Li; Shuaipeng Li; Shuang Chen; Suncong Zheng; Tao Yang; Tian Zhang; Tinghao Yu; Weidong Han; Weijie Liu; Weijin Zhou; Weikang Wang; Wesleye Chen; Xiao Feng; Xiaoqin Ren; Xingwu Sun; Xiong Kuang; Xuemeng Huang; Xun Cao; Yanfeng Chen; Yang Du; Yang Zhen; Yangyu Tao; Yaping Deng; Yi Shen; Yigeng Hong; Yiqi Chen; Yiqing Huang; Yuchi Deng; Yue Mao; Yulong Wang; Yuyuan Zeng; Zenan Xu; Zhanhui Kang; Zhe Zhao; ZhenXiang Yan; Zheng Fang; Zhichao Hu; Zhongzhi Chen; Zhuoyu Li; Zongwei Li; Alex Yan; Ande Liang; Baitong Liu; Beiping Pan; Bin Xing; Binghong Wu; Bingxin Qu; Bolin Ni; Boyu Wu; Chen Li; Cheng Jiang; Cheng Zhang; Chengjun Liu; Chengxu Yang; Chiyu Wang; Chong Zha; Daisy Yi; Di Wang; Fanyang Lu; Fei Chen; Feifei Liu; Feng Zheng; Guanghua Yu; Guiyang Li; Guohua Wang; Haisheng Lin; Han Liu; Han Wang; Hao Fei; Hao Lu; Haoqing Jiang; Haoran Sun; Haotian Zhu; Huangjin Dai; Huankui Chen; Huawen Feng; Huihui Cai; Huxin Peng; Jackson Lv; Jiacheng Shi; Jiahao Bu; Jianbo Li; Jianglu Hu; Jiangtao Guan; Jianing Xu; Jianwei Cai; Jiarong Zhang; Jiawei Song; Jie Jiang; Jie Liu; Jieneng Yang; Jihong Zhang; Jin lv; Jing Zhao; Jinjian Li; Jinxing Liu; Jun Zhao; Juntao Guo; Kai Wang; Kan Wu; Lei Fu; Lei He; Lei Wang; Li Liu; Liang Dong; Liya Zhan; Long Cheng; Long Xu; Mao Zheng; Meng Liu; Mengkang Hu; Nanli Chen; Peirui Chen; Peng He; Pengju Pan; Pengzhi Wei; Qi Yang; Qi Yi; Roberts Wang; Rongpeng Chen; Rui Sun; Rui Yang; Ruibin Chen; Ruixu Zhou; Shaofeng Zhang; Sheng Zhang; Shihao Xu; Shuaishuai Chang; Shulin Liu; SiQi Wang; Songjia Feng; Songling Yuan; Tao Zhang; Tianjiao Lang; Tongkai Li; Wei Deng; Wei Li; Weichao Wang; Weigang Zhang; Weixuan Sun; Wen Ouyang; Wenxiang Jiao; Wenzhi Sun; Wenzhuo Jia; Xiang Zhang; Xiangyu He; Xianshun Ren; XiaoYing Zhu; Xiaolong Guo; Xiaoxue Li; Xiaoyu Ma; Xican Lu; Xinhua Feng; Xinting Huang; Xinyu Guan; Xirui Li; Xu Zhang; Xudong Gao; Xun Luo; Xuxiang Qi; Yangkun Chen; Yangyu Tao; Yanling Xiao; Yantao Mai; Yanze Chen; Yao Ding; Yeting Yang; YiFan Song; Yifan Yang; Yijiao Zhu; Yinhe Wu; Yixian Liu; Yong Yang; Yuanjun Cai; Yuanlin Tu; Yue Zhang; Yufei Huang; Yuhang Zhou; Yuhao Jiang; Yuhong Liu; Yuhui Hu; Yujin Lin; Yun Yang; Yunhao Wang; Yusong Zhang; Zekun Wu; Zelong Zhang; Zhan Yu; Zhaoliang Yang; Zhe Zhao; Zheng Li; Zhenyu Huang; Zhiguang Liu; Zhijiang Xu; Zhiqing Kui; Zhiyin Zeng; Zhiyuan Xiong; Zhuo Han; Zifan Wu; Zigang Geng; Zilong Zhao; Ziyan Tang; Ziyuan Zhu; Zonglei Zhu; Zhijiang Xu
  As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS,
a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It
synergistically combines Mamba's long-sequence processing efficiency with
Transformer's superior contextual understanding. Hunyuan-TurboS features an
adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching
between rapid responses for simple queries and deep "thinking" modes for
complex problems, optimizing computational resources. Architecturally, this 56B
activated (560B total) parameter model employs 128 layers (Mamba2, Attention,
FFN) with an innovative AMF/MF block pattern. Faster Mamba2 ensures linear
complexity, Grouped-Query Attention minimizes KV cache, and FFNs use an MoE
structure. Pre-trained on 16T high-quality tokens, it supports a 256K context
length and is the first industry-deployed large-scale Mamba model. Our
comprehensive post-training strategy enhances capabilities via Supervised
Fine-Tuning (3M instructions), a novel Adaptive Long-short CoT Fusion method,
Multi-round Deliberation Learning for iterative improvement, and a two-stage
Large-scale Reinforcement Learning process targeting STEM and general
instruction-following. Evaluations show strong performance: overall top 7 rank
on LMSYS Chatbot Arena with a score of 1356, outperforming leading models like
Gemini-2.0-Flash-001 (1352) and o4-mini-2025-04-16 (1345). TurboS also achieves
an average of 77.9% across 23 automated benchmarks. Hunyuan-TurboS balances
high performance and efficiency, offering substantial capabilities at lower
inference costs than many reasoning models, establishing a new paradigm for
efficient large-scale pre-trained models.


http://arxiv.org/abs/2505.15174
Enhancing Certified Robustness via Block Reflector Orthogonal Layers and Logit Annealing Loss. (1%)
Bo-Han Lai; Pin-Han Huang; Bo-Han Kung; Shang-Tse Chen
  Lipschitz neural networks are well-known for providing certified robustness
in deep learning. In this paper, we present a novel, efficient Block Reflector
Orthogonal (BRO) layer that enhances the capability of orthogonal layers on
constructing more expressive Lipschitz neural architectures. In addition, by
theoretically analyzing the nature of Lipschitz neural networks, we introduce a
new loss function that employs an annealing mechanism to increase margin for
most data points. This enables Lipschitz models to provide better certified
robustness. By employing our BRO layer and loss function, we design BRONet - a
simple yet effective Lipschitz neural network that achieves state-of-the-art
certified robustness. Extensive experiments and empirical analysis on
CIFAR-10/100, Tiny-ImageNet, and ImageNet validate that our method outperforms
existing baselines. The implementation is available at
\href{https://github.com/ntuaislab/BRONet}{https://github.com/ntuaislab/BRONet}.


http://arxiv.org/abs/2505.15308
BadSR: Stealthy Label Backdoor Attacks on Image Super-Resolution. (1%)
Ji Guo; Xiaolei Wen; Wenbo Jiang; Cheng Huang; Jinjin Li; Hongwei Li
  With the widespread application of super-resolution (SR) in various fields,
researchers have begun to investigate its security. Previous studies have
demonstrated that SR models can also be subjected to backdoor attacks through
data poisoning, affecting downstream tasks. A backdoor SR model generates an
attacker-predefined target image when given a triggered image while producing a
normal high-resolution (HR) output for clean images. However, prior backdoor
attacks on SR models have primarily focused on the stealthiness of poisoned
low-resolution (LR) images while ignoring the stealthiness of poisoned HR
images, making it easy for users to detect anomalous data. To address this
problem, we propose BadSR, which improves the stealthiness of poisoned HR
images. The key idea of BadSR is to approximate the clean HR image and the
pre-defined target image in the feature space while ensuring that modifications
to the clean HR image remain within a constrained range. The poisoned HR images
generated by BadSR can be integrated with existing triggers. To further improve
the effectiveness of BadSR, we design an adversarially optimized trigger and a
backdoor gradient-driven poisoned sample selection method based on a genetic
algorithm. The experimental results show that BadSR achieves a high attack
success rate in various models and data sets, significantly affecting
downstream tasks.


http://arxiv.org/abs/2505.15124
A Survey On Secure Machine Learning. (1%)
Taobo Liao; Taoran Li; Prathamesh Nadkarni
  In this survey, we will explore the interaction between secure multiparty
computation and the area of machine learning. Recent advances in secure
multiparty computation (MPC) have significantly improved its applicability in
the realm of machine learning (ML), offering robust solutions for
privacy-preserving collaborative learning. This review explores key
contributions that leverage MPC to enable multiple parties to engage in ML
tasks without compromising the privacy of their data. The integration of MPC
with ML frameworks facilitates the training and evaluation of models on
combined datasets from various sources, ensuring that sensitive information
remains encrypted throughout the process. Innovations such as specialized
software frameworks and domain-specific languages streamline the adoption of
MPC in ML, optimizing performance and broadening its usage. These frameworks
address both semi-honest and malicious threat models, incorporating features
such as automated optimizations and cryptographic auditing to ensure compliance
and data integrity. The collective insights from these studies highlight MPC's
potential in fostering collaborative yet confidential data analysis, marking a
significant stride towards the realization of secure and efficient
computational solutions in privacy-sensitive industries. This paper
investigates a spectrum of SecureML libraries that includes cryptographic
protocols, federated learning frameworks, and privacy-preserving algorithms. By
surveying the existing literature, this paper aims to examine the efficacy of
these libraries in preserving data privacy, ensuring model confidentiality, and
fortifying ML systems against adversarial attacks. Additionally, the study
explores an innovative application domain for SecureML techniques: the
integration of these methodologies in gaming environments utilizing ML.


http://arxiv.org/abs/2505.15880
Challenger: Affordable Adversarial Driving Video Generation. (1%)
Zhiyuan Xu; Bohan Li; Huan-ang Gao; Mingju Gao; Yong Chen; Ming Liu; Chenxu Yan; Hang Zhao; Shuo Feng; Hao Zhao
  Generating photorealistic driving videos has seen significant progress
recently, but current methods largely focus on ordinary, non-adversarial
scenarios. Meanwhile, efforts to generate adversarial driving scenarios often
operate on abstract trajectory or BEV representations, falling short of
delivering realistic sensor data that can truly stress-test autonomous driving
(AD) systems. In this work, we introduce Challenger, a framework that produces
physically plausible yet photorealistic adversarial driving videos. Generating
such videos poses a fundamental challenge: it requires jointly optimizing over
the space of traffic interactions and high-fidelity sensor observations.
Challenger makes this affordable through two techniques: (1) a physics-aware
multi-round trajectory refinement process that narrows down candidate
adversarial maneuvers, and (2) a tailored trajectory scoring function that
encourages realistic yet adversarial behavior while maintaining compatibility
with downstream video synthesis. As tested on the nuScenes dataset, Challenger
generates a diverse range of aggressive driving scenarios-including cut-ins,
sudden lane changes, tailgating, and blind spot intrusions-and renders them
into multiview photorealistic videos. Extensive evaluations show that these
scenarios significantly increase the collision rate of state-of-the-art
end-to-end AD models (UniAD, VAD, SparseDrive, and DiffusionDrive), and
importantly, adversarial behaviors discovered for one model often transfer to
others.


http://arxiv.org/abs/2505.14453
Robustness Evaluation of Graph-based News Detection Using Network Structural Information. (99%)
Xianghua Zeng; Hao Peng; Angsheng Li
  Although Graph Neural Networks (GNNs) have shown promising potential in fake
news detection, they remain highly vulnerable to adversarial manipulations
within social networks. Existing methods primarily establish connections
between malicious accounts and individual target news to investigate the
vulnerability of graph-based detectors, while they neglect the structural
relationships surrounding targets, limiting their effectiveness in robustness
evaluation. In this work, we propose a novel Structural Information
principles-guided Adversarial Attack Framework, namely SI2AF, which effectively
challenges graph-based detectors and further probes their detection robustness.
Specifically, structural entropy is introduced to quantify the dynamic
uncertainty in social engagements and identify hierarchical communities that
encompass all user accounts and news posts. An influence metric is presented to
measure each account's probability of engaging in random interactions,
facilitating the design of multiple agents that manage distinct malicious
accounts. For each target news, three attack strategies are developed through
multi-agent collaboration within the associated subgraph to optimize evasion
against black-box detectors. By incorporating the adversarial manipulations
generated by SI2AF, we enrich the original network structure and refine
graph-based detectors to improve their robustness against adversarial attacks.
Extensive evaluations demonstrate that SI2AF significantly outperforms
state-of-the-art baselines in attack effectiveness with an average improvement
of 16.71%, and enhances GNN-based detection robustness by 41.54% on average.


http://arxiv.org/abs/2505.14463
Adverseness vs. Equilibrium: Exploring Graph Adversarial Resilience through Dynamic Equilibrium. (92%)
Xinxin Fan; Wenxiong Chen; Mengfan Li; Wenqi Wei; Ling Liu
  Adversarial attacks to graph analytics are gaining increased attention. To
date, two lines of countermeasures have been proposed to resist various graph
adversarial attacks from the perspectives of either graph per se or graph
neural networks. Nevertheless, a fundamental question lies in whether there
exists an intrinsic adversarial resilience state within a graph regime and how
to find out such a critical state if exists. This paper contributes to tackle
the above research questions from three unique perspectives: i) we regard the
process of adversarial learning on graph as a complex multi-object dynamic
system, and model the behavior of adversarial attack; ii) we propose a
generalized theoretical framework to show the existence of critical adversarial
resilience state; and iii) we develop a condensed one-dimensional function to
capture the dynamic variation of graph regime under perturbations, and pinpoint
the critical state through solving the equilibrium point of dynamic system.
Multi-facet experiments are conducted to show our proposed approach can
significantly outperform the state-of-the-art defense methods under five
commonly-used real-world datasets and three representative attacks.


http://arxiv.org/abs/2505.14021
Adversarial Training from Mean Field Perspective. (88%)
Soichiro Kumano; Hiroshi Kera; Toshihiko Yamasaki
  Although adversarial training is known to be effective against adversarial
examples, training dynamics are not well understood. In this study, we present
the first theoretical analysis of adversarial training in random deep neural
networks without any assumptions on data distributions. We introduce a new
theoretical framework based on mean field theory, which addresses the
limitations of existing mean field-based approaches. Based on this framework,
we derive (empirically tight) upper bounds of $\ell_q$ norm-based adversarial
loss with $\ell_p$ norm-based adversarial examples for various values of $p$
and $q$. Moreover, we prove that networks without shortcuts are generally not
adversarially trainable and that adversarial training reduces network capacity.
We also show that network width alleviates these issues. Furthermore, we
present the various impacts of the input and output dimensions on the upper
bounds and time evolution of the weight variance.


http://arxiv.org/abs/2505.14024
FedGraM: Defending Against Untargeted Attacks in Federated Learning via Embedding Gram Matrix. (86%)
Di Wu; Qian Li; Heng Yang; Yong Han
  Federated Learning (FL) enables geographically distributed clients to
collaboratively train machine learning models by sharing only their local
models, ensuring data privacy. However, FL is vulnerable to untargeted attacks
that aim to degrade the global model's performance on the underlying data
distribution. Existing defense mechanisms attempt to improve FL's resilience
against such attacks, but their effectiveness is limited in practical FL
environments due to data heterogeneity. On the contrary, we aim to detect and
remove the attacks to mitigate their impact. Generalization contribution plays
a crucial role in distinguishing untargeted attacks. Our observations indicate
that, with limited data, the divergence between embeddings representing
different classes provides a better measure of generalization than direct
accuracy. In light of this, we propose a novel robust aggregation method,
FedGraM, designed to defend against untargeted attacks in FL. The server
maintains an auxiliary dataset containing one sample per class to support
aggregation. This dataset is fed to the local models to extract embeddings.
Then, the server calculates the norm of the Gram Matrix of the embeddings for
each local model. The norm serves as an indicator of each model's inter-class
separation capability in the embedding space. FedGraM identifies and removes
potentially malicious models by filtering out those with the largest norms,
then averages the remaining local models to form the global model. We conduct
extensive experiments to evaluate the performance of FedGraM. Our empirical
results show that with limited data samples used to construct the auxiliary
dataset, FedGraM achieves exceptional performance, outperforming
state-of-the-art defense methods.


http://arxiv.org/abs/2505.14286
Universal Acoustic Adversarial Attacks for Flexible Control of Speech-LLMs. (80%)
Rao Ma; Mengjie Qian; Vyas Raina; Mark Gales; Kate Knill
  The combination of pre-trained speech encoders with large language models has
enabled the development of speech LLMs that can handle a wide range of spoken
language processing tasks. While these models are powerful and flexible, this
very flexibility may make them more vulnerable to adversarial attacks. To
examine the extent of this problem, in this work we investigate universal
acoustic adversarial attacks on speech LLMs. Here a fixed, universal,
adversarial audio segment is prepended to the original input audio. We
initially investigate attacks that cause the model to either produce no output
or to perform a modified task overriding the original prompt. We then extend
the nature of the attack to be selective so that it activates only when
specific input attributes, such as a speaker gender or spoken language, are
present. Inputs without the targeted attribute should be unaffected, allowing
fine-grained control over the model outputs. Our findings reveal critical
vulnerabilities in Qwen2-Audio and Granite-Speech and suggest that similar
speech LLMs may be susceptible to universal adversarial attacks. This
highlights the need for more robust training strategies and improved resistance
to adversarial attacks.


http://arxiv.org/abs/2505.14323
Vulnerability of Transfer-Learned Neural Networks to Data Reconstruction Attacks in Small-Data Regime. (80%)
Tomasz Maciążek; Robert Allison
  Training data reconstruction attacks enable adversaries to recover portions
of a released model's training data. We consider the attacks where a
reconstructor neural network learns to invert the (random) mapping between
training data and model weights. Prior work has shown that an informed
adversary with access to released model's weights and all but one training data
point can achieve high-quality reconstructions in this way. However,
differential privacy can defend against such an attack with little to no loss
in model's utility when the amount of training data is sufficiently large. In
this work we consider a more realistic adversary who only knows the
distribution from which a small training dataset has been sampled and who
attacks a transfer-learned neural network classifier that has been trained on
this dataset. We exhibit an attack that works in this realistic threat model
and demonstrate that in the small-data regime it cannot be defended against by
DP-SGD without severely damaging the classifier accuracy. This raises
significant concerns about the use of such transfer-learned classifiers when
protection of training-data is paramount. We demonstrate the effectiveness and
robustness of our attack on VGG, EfficientNet and ResNet image classifiers
transfer-learned on MNIST, CIFAR-10 and CelebA respectively. Additionally, we
point out that the commonly used (true-positive) reconstruction success rate
metric fails to reliably quantify the actual reconstruction effectiveness.
Instead, we make use of the Neyman-Pearson lemma to construct the receiver
operating characteristic curve and consider the associated true-positive
reconstruction rate at a fixed level of the false-positive reconstruction rate.


http://arxiv.org/abs/2505.14534
Lessons from Defending Gemini Against Indirect Prompt Injections. (76%)
Chongyang Shi; Sharon Lin; Shuang Song; Jamie Hayes; Ilia Shumailov; Itay Yona; Juliette Pluto; Aneesh Pappu; Christopher A. Choquette-Choo; Milad Nasr; Chawin Sitawarin; Gena Gibson; Andreas Terzis; John "Four" Flynn
  Gemini is increasingly used to perform tasks on behalf of users, where
function-calling and tool-use capabilities enable the model to access user
data. Some tools, however, require access to untrusted data introducing risk.
Adversaries can embed malicious instructions in untrusted data which cause the
model to deviate from the user's expectations and mishandle their data or
permissions. In this report, we set out Google DeepMind's approach to
evaluating the adversarial robustness of Gemini models and describe the main
lessons learned from the process. We test how Gemini performs against a
sophisticated adversary through an adversarial evaluation framework, which
deploys a suite of adaptive attack techniques to run continuously against past,
current, and future versions of Gemini. We describe how these ongoing
evaluations directly help make Gemini more resilient against manipulation.


http://arxiv.org/abs/2505.14316
Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion. (70%)
Tiehan Cui; Yanxu Mao; Peipei Liu; Congying Liu; Datao You
  Although large language models (LLMs) have achieved remarkable advancements,
their security remains a pressing concern. One major threat is jailbreak
attacks, where adversarial prompts bypass model safeguards to generate harmful
or objectionable content. Researchers study jailbreak attacks to understand
security and robustness of LLMs. However, existing jailbreak attack methods
face two main challenges: (1) an excessive number of iterative queries, and (2)
poor generalization across models. In addition, recent jailbreak evaluation
datasets focus primarily on question-answering scenarios, lacking attention to
text generation tasks that require accurate regeneration of toxic content. To
tackle these challenges, we propose two contributions: (1) ICE, a novel
black-box jailbreak method that employs Intent Concealment and divErsion to
effectively circumvent security constraints. ICE achieves high attack success
rates (ASR) with a single query, significantly improving efficiency and
transferability across different models. (2) BiSceneEval, a comprehensive
dataset designed for assessing LLM robustness in question-answering and
text-generation tasks. Experimental results demonstrate that ICE outperforms
existing jailbreak techniques, revealing critical vulnerabilities in current
defense mechanisms. Our findings underscore the necessity of a hybrid security
strategy that integrates predefined security mechanisms with real-time semantic
decomposition to enhance the security of LLMs.


http://arxiv.org/abs/2505.17089
Trust Me, I Can Handle It: Self-Generated Adversarial Scenario Extrapolation for Robust Language Models. (54%)
Md Rafi Ur Rashid; Vishnu Asutosh Dasu; Ye Wang; Gang Tan; Shagufta Mehnaz
  Large Language Models (LLMs) exhibit impressive capabilities, but remain
susceptible to a growing spectrum of safety risks, including jailbreaks, toxic
content, hallucinations, and bias. Existing defenses often address only a
single threat type or resort to rigid outright rejection, sacrificing user
experience and failing to generalize across diverse and novel attacks. This
paper introduces Adversarial Scenario Extrapolation (ASE), a novel
inference-time computation framework that leverages Chain-of-Thought (CoT)
reasoning to simultaneously enhance LLM robustness and seamlessness. ASE guides
the LLM through a self-generative process of contemplating potential
adversarial scenarios and formulating defensive strategies before generating a
response to the user query. Comprehensive evaluation on four adversarial
benchmarks with four latest LLMs shows that ASE achieves near-zero jailbreak
attack success rates and minimal toxicity, while slashing outright rejections
to <4%. ASE outperforms six state-of-the-art defenses in
robustness-seamlessness trade-offs, with 92-99% accuracy on adversarial Q&A and
4-10x lower bias scores. By transforming adversarial perception into an
intrinsic cognitive process, ASE sets a new paradigm for secure and natural
human-AI interaction.


http://arxiv.org/abs/2505.14814
GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples. (47%)
Harry Zhang; Kurt Partridge; Pai Zhu; Neng Chen; Hyun Jin Park; Dhruuv Agarwal; Quan Wang
  Spoken Keyword Spotting (KWS) is the task of distinguishing between the
presence and absence of a keyword in audio. The accuracy of a KWS model hinges
on its ability to correctly classify examples close to the keyword and
non-keyword boundary. These boundary examples are often scarce in training
data, limiting model performance. In this paper, we propose a method to
systematically generate adversarial examples close to the decision boundary by
making insertion/deletion/substitution edits on the keyword's graphemes. We
evaluate this technique on held-out data for a popular keyword and show that
the technique improves AUC on a dataset of synthetic hard negatives by 61%
while maintaining quality on positives and ambient negative audio data.


http://arxiv.org/abs/2505.14103
AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models. (31%)
Guangke Chen; Fu Song; Zhe Zhao; Xiaojun Jia; Yang Liu; Yanchen Qiao; Weizhe Zhang
  Jailbreak attacks to Large audio-language models (LALMs) are studied
recently, but they achieve suboptimal effectiveness, applicability, and
practicability, particularly, assuming that the adversary can fully manipulate
user prompts. In this work, we first conduct an extensive experiment showing
that advanced text jailbreak attacks cannot be easily ported to end-to-end
LALMs via text-to speech (TTS) techniques. We then propose AudioJailbreak, a
novel audio jailbreak attack, featuring (1) asynchrony: the jailbreak audio
does not need to align with user prompts in the time axis by crafting suffixal
jailbreak audios; (2) universality: a single jailbreak perturbation is
effective for different prompts by incorporating multiple prompts into
perturbation generation; (3) stealthiness: the malicious intent of jailbreak
audios will not raise the awareness of victims by proposing various intent
concealment strategies; and (4) over-the-air robustness: the jailbreak audios
remain effective when being played over the air by incorporating the
reverberation distortion effect with room impulse response into the generation
of the perturbations. In contrast, all prior audio jailbreak attacks cannot
offer asynchrony, universality, stealthiness, or over-the-air robustness.
Moreover, AudioJailbreak is also applicable to the adversary who cannot fully
manipulate user prompts, thus has a much broader attack scenario. Extensive
experiments with thus far the most LALMs demonstrate the high effectiveness of
AudioJailbreak. We highlight that our work peeks into the security implications
of audio jailbreak attacks against LALMs, and realistically fosters improving
their security robustness. The implementation and audio samples are available
at our website https://audiojailbreak.github.io/AudioJailbreak.


http://arxiv.org/abs/2505.14289
EVA: Red-Teaming GUI Agents via Evolving Indirect Prompt Injection. (22%)
Yijie Lu; Tianjie Ju; Manman Zhao; Xinbei Ma; Yuan Guo; ZhuoSheng Zhang
  As multimodal agents are increasingly trained to operate graphical user
interfaces (GUIs) to complete user tasks, they face a growing threat from
indirect prompt injection, attacks in which misleading instructions are
embedded into the agent's visual environment, such as popups or chat messages,
and misinterpreted as part of the intended task. A typical example is
environmental injection, in which GUI elements are manipulated to influence
agent behavior without directly modifying the user prompt. To address these
emerging attacks, we propose EVA, a red teaming framework for indirect prompt
injection which transforms the attack into a closed loop optimization by
continuously monitoring an agent's attention distribution over the GUI and
updating adversarial cues, keywords, phrasing, and layout, in response.
Compared with prior one shot methods that generate fixed prompts without regard
for how the model allocates visual attention, EVA dynamically adapts to
emerging attention hotspots, yielding substantially higher attack success rates
and far greater transferability across diverse GUI scenarios. We evaluate EVA
on six widely used generalist and specialist GUI agents in realistic settings
such as popup manipulation, chat based phishing, payments, and email
composition. Experimental results show that EVA substantially improves success
rates over static baselines. Under goal agnostic constraints, where the
attacker does not know the agent's task intent, EVA still discovers effective
patterns. Notably, we find that injection styles transfer well across models,
revealing shared behavioral biases in GUI agents. These results suggest that
evolving indirect prompt injection is a powerful tool not only for red teaming
agents, but also for uncovering common vulnerabilities in their multimodal
decision making.


http://arxiv.org/abs/2505.14042
Adversarially Pretrained Transformers may be Universally Robust In-Context Learners. (13%)
Soichiro Kumano; Hiroshi Kera; Toshihiko Yamasaki
  Adversarial training is one of the most effective adversarial defenses, but
it incurs a high computational cost. In this study, we show that transformers
adversarially pretrained on diverse tasks can serve as robust foundation models
and eliminate the need for adversarial training in downstream tasks.
Specifically, we theoretically demonstrate that through in-context learning, a
single adversarially pretrained transformer can robustly generalize to multiple
unseen tasks without any additional training, i.e., without any parameter
updates. This robustness stems from the model's focus on robust features and
its resistance to attacks that exploit non-predictive features. Besides these
positive findings, we also identify several limitations. Under certain
conditions (though unrealistic), no universally robust single-layer
transformers exist. Moreover, robust transformers exhibit an
accuracy--robustness trade-off and require a large number of in-context
demonstrations. The code is available at
https://github.com/s-kumano/universally-robust-in-context-learner.


http://arxiv.org/abs/2505.14967
Anomaly Detection Based on Critical Paths for Deep Neural Networks. (13%)
Fangzhen Zhao; Chenyi Zhang; Naipeng Dong; Ming Li; Jinxiao Shan
  Deep neural networks (DNNs) are notoriously hard to understand and difficult
to defend. Extracting representative paths (including the neuron activation
values and the connections between neurons) from DNNs using software
engineering approaches has recently shown to be a promising approach in
interpreting the decision making process of blackbox DNNs, as the extracted
paths are often effective in capturing essential features. With this in mind,
this work investigates a novel approach that extracts critical paths from DNNs
and subsequently applies the extracted paths for the anomaly detection task,
based on the observation that outliers and adversarial inputs do not usually
induce the same activation pattern on those paths as normal (in-distribution)
inputs.
  In our approach, we first identify critical detection paths via genetic
evolution and mutation. Since different paths in a DNN often capture different
features for the same target class, we ensemble detection results from multiple
paths by integrating random subspace sampling and a voting mechanism. Compared
with state-of-the-art methods, our experimental results suggest that our method
not only outperforms them, but it is also suitable for the detection of a broad
range of anomaly types with high accuracy.


http://arxiv.org/abs/2505.14368
Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs. (10%)
Jiawen Wang; Pritha Gupta; Ivan Habernal; Eyke Hüllermeier
  Recent studies demonstrate that Large Language Models (LLMs) are vulnerable
to different prompt-based attacks, generating harmful content or sensitive
information. Both closed-source and open-source LLMs are underinvestigated for
these attacks. This paper studies effective prompt injection attacks against
the $\mathbf{14}$ most popular open-source LLMs on five attack benchmarks.
Current metrics only consider successful attacks, whereas our proposed Attack
Success Probability (ASP) also captures uncertainty in the model's response,
reflecting ambiguity in attack feasibility. By comprehensively analyzing the
effectiveness of prompt injection attacks, we propose a simple and effective
hypnotism attack; results show that this attack causes aligned language models,
including Stablelm2, Mistral, Openchat, and Vicuna, to generate objectionable
behaviors, achieving around $90$% ASP. They also indicate that our ignore
prefix attacks can break all $\mathbf{14}$ open-source LLMs, achieving over
$60$% ASP on a multi-categorical dataset. We find that moderately well-known
LLMs exhibit higher vulnerability to prompt injection attacks, highlighting the
need to raise public awareness and prioritize efficient mitigation strategies.


http://arxiv.org/abs/2505.14185
Safety Subspaces are Not Distinct: A Fine-Tuning Case Study. (10%)
Kaustubh Ponkshe; Shaan Shah; Raghav Singhal; Praneeth Vepakomma
  Large Language Models (LLMs) rely on safety alignment to produce socially
acceptable responses. This is typically achieved through instruction tuning and
reinforcement learning from human feedback. However, this alignment is known to
be brittle: further fine-tuning, even on benign or lightly contaminated data,
can degrade safety and reintroduce harmful behaviors. A growing body of work
suggests that alignment may correspond to identifiable geometric directions in
weight space, forming subspaces that could, in principle, be isolated or
preserved to defend against misalignment. In this work, we conduct a
comprehensive empirical study of this geometric perspective. We examine whether
safety-relevant behavior is concentrated in specific subspaces, whether it can
be separated from general-purpose learning, and whether harmfulness arises from
distinguishable patterns in internal representations. Across both parameter and
activation space, our findings are consistent: subspaces that amplify safe
behaviors also amplify unsafe ones, and prompts with different safety
implications activate overlapping representations. We find no evidence of a
subspace that selectively governs safety. These results challenge the
assumption that alignment is geometrically localized. Rather than residing in
distinct directions, safety appears to emerge from entangled, high-impact
components of the model's broader learning dynamics. This suggests that
subspace-based defenses may face fundamental limitations and underscores the
need for alternative strategies to preserve alignment under continued training.
We corroborate these findings through multiple experiments on five open-source
LLMs. Our code is publicly available at:
https://github.com/CERT-Lab/safety-subspaces.


http://arxiv.org/abs/2505.14531
SifterNet: A Generalized and Model-Agnostic Trigger Purification Approach. (2%)
Shaoye Luo; Xinxin Fan; Quanliang Jing; Chi Lin; Mengfan Li; Yunfeng Lu; Yongjun Xu
  Aiming at resisting backdoor attacks in convolution neural networks and
vision Transformer-based large model, this paper proposes a generalized and
model-agnostic trigger-purification approach resorting to the classic Ising
model. To date, existing trigger detection/removal studies usually require to
know the detailed knowledge of target model in advance, access to a large
number of clean samples or even model-retraining authorization, which brings
the huge inconvenience for practical applications, especially inaccessible to
target model. An ideal countermeasure ought to eliminate the implanted trigger
without regarding whatever the target models are. To this end, a lightweight
and black-box defense approach SifterNet is proposed through leveraging the
memorization-association functionality of Hopfield network, by which the
triggers of input samples can be effectively purified in a proper manner. The
main novelty of our proposed approach lies in the introduction of ideology of
Ising model. Extensive experiments also validate the effectiveness of our
approach in terms of proper trigger purification and high accuracy achievement,
and compared to the state-of-the-art baselines under several commonly-used
datasets, our SiferNet has a significant superior performance.


http://arxiv.org/abs/2505.13910
ShortcutProbe: Probing Prediction Shortcuts for Learning Robust Models. (1%)
Guangtao Zheng; Wenqian Ye; Aidong Zhang
  Deep learning models often achieve high performance by inadvertently learning
spurious correlations between targets and non-essential features. For example,
an image classifier may identify an object via its background that spuriously
correlates with it. This prediction behavior, known as spurious bias, severely
degrades model performance on data that lacks the learned spurious
correlations. Existing methods on spurious bias mitigation typically require a
variety of data groups with spurious correlation annotations called group
labels. However, group labels require costly human annotations and often fail
to capture subtle spurious biases such as relying on specific pixels for
predictions. In this paper, we propose a novel post hoc spurious bias
mitigation framework without requiring group labels. Our framework, termed
ShortcutProbe, identifies prediction shortcuts that reflect potential
non-robustness in predictions in a given model's latent space. The model is
then retrained to be invariant to the identified prediction shortcuts for
improved robustness. We theoretically analyze the effectiveness of the
framework and empirically demonstrate that it is an efficient and practical
tool for improving a model's robustness to spurious bias on diverse datasets.


http://arxiv.org/abs/2505.17092
Covert Attacks on Machine Learning Training in Passively Secure MPC. (1%)
Matthew Jagielski; Daniel Escudero; Rahul Rachuri; Peter Scholl
  Secure multiparty computation (MPC) allows data owners to train machine
learning models on combined data while keeping the underlying training data
private. The MPC threat model either considers an adversary who passively
corrupts some parties without affecting their overall behavior, or an adversary
who actively modifies the behavior of corrupt parties. It has been argued that
in some settings, active security is not a major concern, partly because of the
potential risk of reputation loss if a party is detected cheating.
  In this work we show explicit, simple, and effective attacks that an active
adversary can run on existing passively secure MPC training protocols, while
keeping essentially zero risk of the attack being detected. The attacks we show
can compromise both the integrity and privacy of the model, including attacks
reconstructing exact training data. Our results challenge the belief that a
threat model that does not include malicious behavior by the involved parties
may be reasonable in the context of PPML, motivating the use of actively secure
protocols for training.


http://arxiv.org/abs/2505.14418
Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents. (1%)
Pengzhou Cheng; Haowen Hu; Zheng Wu; Zongru Wu; Tianjie Ju; Zhuosheng Zhang; Gongshen Liu
  Graphical user interface (GUI) agents powered by multimodal large language
models (MLLMs) have shown greater promise for human-interaction. However, due
to the high fine-tuning cost, users often rely on open-source GUI agents or
APIs offered by AI providers, which introduces a critical but underexplored
supply chain threat: backdoor attacks. In this work, we first unveil that
MLLM-powered GUI agents naturally expose multiple interaction-level triggers,
such as historical steps, environment states, and task progress. Based on this
observation, we introduce AgentGhost, an effective and stealthy framework for
red-teaming backdoor attacks. Specifically, we first construct composite
triggers by combining goal and interaction levels, allowing GUI agents to
unintentionally activate backdoors while ensuring task utility. Then, we
formulate backdoor injection as a Min-Max optimization problem that uses
supervised contrastive learning to maximize the feature difference across
sample classes at the representation space, improving flexibility of the
backdoor. Meanwhile, it adopts supervised fine-tuning to minimize the
discrepancy between backdoor and clean behavior generation, enhancing
effectiveness and utility. Extensive evaluations of various agent models in two
established mobile benchmarks show that AgentGhost is effective and generic,
with attack accuracy that reaches 99.7\% on three attack objectives, and shows
stealthiness with only 1\% utility degradation. Furthermore, we tailor a
defense method against AgentGhost that reduces the attack accuracy to 22.1\%.
Our code is available at \texttt{anonymous}.


http://arxiv.org/abs/2505.13280
FlowPure: Continuous Normalizing Flows for Adversarial Purification. (99%)
Elias Collaert; Abel Rodríguez; Sander Joos; Lieven Desmet; Vera Rimmer
  Despite significant advancements in the area, adversarial robustness remains
a critical challenge in systems employing machine learning models. The removal
of adversarial perturbations at inference time, known as adversarial
purification, has emerged as a promising defense strategy. To achieve this,
state-of-the-art methods leverage diffusion models that inject Gaussian noise
during a forward process to dilute adversarial perturbations, followed by a
denoising step to restore clean samples before classification. In this work, we
propose FlowPure, a novel purification method based on Continuous Normalizing
Flows (CNFs) trained with Conditional Flow Matching (CFM) to learn mappings
from adversarial examples to their clean counterparts. Unlike prior
diffusion-based approaches that rely on fixed noise processes, FlowPure can
leverage specific attack knowledge to improve robustness under known threats,
while also supporting a more general stochastic variant trained on Gaussian
perturbations for settings where such knowledge is unavailable. Experiments on
CIFAR-10 and CIFAR-100 demonstrate that our method outperforms state-of-the-art
purification-based defenses in preprocessor-blind and white-box scenarios, and
can do so while fully preserving benign accuracy in the former. Moreover, our
results show that not only is FlowPure a highly effective purifier but it also
holds a strong potential for adversarial detection, identifying
preprocessor-blind PGD samples with near-perfect accuracy.


http://arxiv.org/abs/2505.13348
Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks. (87%)
Narek Maloyan; Bislan Ashinov; Dmitry Namiot
  Large Language Models (LLMs) are increasingly employed as evaluators
(LLM-as-a-Judge) for assessing the quality of machine-generated text. This
paradigm offers scalability and cost-effectiveness compared to human
annotation. However, the reliability and security of such systems, particularly
their robustness against adversarial manipulations, remain critical concerns.
This paper investigates the vulnerability of LLM-as-a-Judge architectures to
prompt-injection attacks, where malicious inputs are designed to compromise the
judge's decision-making process. We formalize two primary attack strategies:
Comparative Undermining Attack (CUA), which directly targets the final decision
output, and Justification Manipulation Attack (JMA), which aims to alter the
model's generated reasoning. Using the Greedy Coordinate Gradient (GCG)
optimization method, we craft adversarial suffixes appended to one of the
responses being compared. Experiments conducted on the MT-Bench Human Judgments
dataset with open-source instruction-tuned LLMs (Qwen2.5-3B-Instruct and
Falcon3-3B-Instruct) demonstrate significant susceptibility. The CUA achieves
an Attack Success Rate (ASR) exceeding 30\%, while JMA also shows notable
effectiveness. These findings highlight substantial vulnerabilities in current
LLM-as-a-Judge systems, underscoring the need for robust defense mechanisms and
further research into adversarial evaluation and trustworthiness in LLM-based
assessment frameworks.


http://arxiv.org/abs/2505.13023
Anti-Inpainting: A Proactive Defense against Malicious Diffusion-based Inpainters under Unknown Conditions. (87%)
Yimao Guo; Zuomin Qu; Wei Lu; Xiangyang Luo
  As diffusion-based malicious image manipulation becomes increasingly
prevalent, multiple proactive defense methods are developed to safeguard images
against unauthorized tampering. However, most proactive defense methods only
can safeguard images against manipulation under known conditions, and fail to
protect images from manipulations guided by tampering conditions crafted by
malicious users. To tackle this issue, we propose Anti-Inpainting, a proactive
defense method that achieves adequate protection under unknown conditions
through a triple mechanism to address this challenge. Specifically, a
multi-level deep feature extractor is presented to obtain intricate features
during the diffusion denoising process to improve protective effectiveness. We
design multi-scale semantic-preserving data augmentation to enhance the
transferability of adversarial perturbations across unknown conditions by
multi-scale transformations while preserving semantic integrity. In addition,
we propose a selection-based distribution deviation optimization strategy to
improve the protection of adversarial perturbation against manipulation under
diverse random seeds. Extensive experiments indicate the proactive defensive
performance of Anti-Inpainting against diffusion-based inpainters guided by
unknown conditions in InpaintGuardBench and CelebA-HQ. At the same time, we
also demonstrate the proposed approach's robustness under various image
purification methods and its transferability across different versions of
diffusion models.


http://arxiv.org/abs/2505.12686
RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations. (81%)
Seungmin Kim; Sohee Park; Donghyun Kim; Jisu Lee; Daeseon Choi
  With the advancement of AI-based speech synthesis technologies such as Deep
Voice, there is an increasing risk of voice spoofing attacks, including voice
phishing and fake news, through unauthorized use of others' voices. Existing
defenses that inject adversarial perturbations directly into audio signals have
limited effectiveness, as these perturbations can easily be neutralized by
speech enhancement methods. To overcome this limitation, we propose RoVo
(Robust Voice), a novel proactive defense technique that injects adversarial
perturbations into high-dimensional embedding vectors of audio signals,
reconstructing them into protected speech. This approach effectively defends
against speech synthesis attacks and also provides strong resistance to speech
enhancement models, which represent a secondary attack threat.
  In extensive experiments, RoVo increased the Defense Success Rate (DSR) by
over 70% compared to unprotected speech, across four state-of-the-art speech
synthesis models. Specifically, RoVo achieved a DSR of 99.5% on a commercial
speaker-verification API, effectively neutralizing speech synthesis attack.
Moreover, RoVo's perturbations remained robust even under strong speech
enhancement conditions, outperforming traditional methods. A user study
confirmed that RoVo preserves both naturalness and usability of protected
speech, highlighting its effectiveness in complex and evolving threat
scenarios.


http://arxiv.org/abs/2505.13862
PandaGuard: Systematic Evaluation of LLM Safety in the Era of Jailbreaking Attacks. (70%)
Guobin Shen; Dongcheng Zhao; Linghao Feng; Xiang He; Jihang Wang; Sicheng Shen; Haibo Tong; Yiting Dong; Jindong Li; Xiang Zheng; Yi Zeng
  Large language models (LLMs) have achieved remarkable capabilities but remain
vulnerable to adversarial prompts known as jailbreaks, which can bypass safety
alignment and elicit harmful outputs. Despite growing efforts in LLM safety
research, existing evaluations are often fragmented, focused on isolated attack
or defense techniques, and lack systematic, reproducible analysis. In this
work, we introduce PandaGuard, a unified and modular framework that models LLM
jailbreak safety as a multi-agent system comprising attackers, defenders, and
judges. Our framework implements 19 attack methods and 12 defense mechanisms,
along with multiple judgment strategies, all within a flexible plugin
architecture supporting diverse LLM interfaces, multiple interaction modes, and
configuration-driven experimentation that enhances reproducibility and
practical deployment. Built on this framework, we develop PandaBench, a
comprehensive benchmark that evaluates the interactions between these
attack/defense methods across 49 LLMs and various judgment approaches,
requiring over 3 billion tokens to execute. Our extensive evaluation reveals
key insights into model vulnerabilities, defense cost-performance trade-offs,
and judge consistency. We find that no single defense is optimal across all
dimensions and that judge disagreement introduces nontrivial variance in safety
assessments. We release the code, configurations, and evaluation results to
support transparent and reproducible research in LLM safety.


http://arxiv.org/abs/2505.12767
Language Models That Walk the Talk: A Framework for Formal Fairness Certificates. (69%)
Danqing Chen; Tobias Ladner; Ahmed Rayen Mhadhbi; Matthias Althoff
  As large language models become integral to high-stakes applications,
ensuring their robustness and fairness is critical. Despite their success,
large language models remain vulnerable to adversarial attacks, where small
perturbations, such as synonym substitutions, can alter model predictions,
posing risks in fairness-critical areas, such as gender bias mitigation, and
safety-critical areas, such as toxicity detection. While formal verification
has been explored for neural networks, its application to large language models
remains limited. This work presents a holistic verification framework to
certify the robustness of transformer-based language models, with a focus on
ensuring gender fairness and consistent outputs across different gender-related
terms. Furthermore, we extend this methodology to toxicity detection, offering
formal guarantees that adversarially manipulated toxic inputs are consistently
detected and appropriately censored, thereby ensuring the reliability of
moderation systems. By formalizing robustness within the embedding space, this
work strengthens the reliability of language models in ethical AI deployment
and content moderation.


http://arxiv.org/abs/2505.13362
DynaNoise: Dynamic Probabilistic Noise Injection for Defending Against Membership Inference Attacks. (38%)
Javad Forough; Hamed Haddadi
  Membership Inference Attacks (MIAs) pose a significant risk to the privacy of
training datasets by exploiting subtle differences in model outputs to
determine whether a particular data sample was used during training. These
attacks can compromise sensitive information, especially in domains such as
healthcare and finance, where data privacy is paramount. Traditional mitigation
techniques, such as static differential privacy, rely on injecting a fixed
amount of noise during training or inference. However, this approach often
leads to a detrimental trade-off: the noise may be insufficient to counter
sophisticated attacks or, when increased, may substantially degrade model
performance. In this paper, we present DynaNoise, an adaptive approach that
dynamically modulates noise injection based on query sensitivity. Our approach
performs sensitivity analysis using measures such as Shannon entropy to
evaluate the risk associated with each query and adjusts the noise variance
accordingly. A probabilistic smoothing step is then applied to renormalize the
perturbed outputs, ensuring that the model maintains high accuracy while
effectively obfuscating membership signals. We further propose an empirical
metric, the Membership Inference Defense Privacy-Utility Tradeoff (MIDPUT),
which quantifies the balance between reducing attack success rates and
preserving the target model's accuracy. Our extensive evaluation on several
benchmark datasets demonstrates that DynaNoise not only significantly reduces
MIA success rates but also achieves up to a fourfold improvement in the MIDPUT
metric compared to the state-of-the-art. Moreover, DynaNoise maintains
competitive model accuracy while imposing only marginal inference overhead,
highlighting its potential as an effective and efficient privacy defense
against MIAs.


http://arxiv.org/abs/2505.13819
Fragments to Facts: Partial-Information Fragment Inference from LLMs. (10%)
Lucas Rosenblatt; Bin Han; Robert Wolfe; Bill Howe
  Large language models (LLMs) can leak sensitive training data through
memorization and membership inference attacks. Prior work has primarily focused
on strong adversarial assumptions, including attacker access to entire samples
or long, ordered prefixes, leaving open the question of how vulnerable LLMs are
when adversaries have only partial, unordered sample information. For example,
if an attacker knows a patient has "hypertension," under what conditions can
they query a model fine-tuned on patient data to learn the patient also has
"osteoarthritis?" In this paper, we introduce a more general threat model under
this weaker assumption and show that fine-tuned LLMs are susceptible to these
fragment-specific extraction attacks. To systematically investigate these
attacks, we propose two data-blind methods: (1) a likelihood ratio attack
inspired by methods from membership inference, and (2) a novel approach, PRISM,
which regularizes the ratio by leveraging an external prior. Using examples
from both medical and legal settings, we show that both methods are competitive
with a data-aware baseline classifier that assumes access to labeled
in-distribution data, underscoring their robustness.


http://arxiv.org/abs/2505.12871
Does Low Rank Adaptation Lead to Lower Robustness against Training-Time Attacks? (8%)
Zi Liang; Haibo Hu; Qingqing Ye; Yaxin Xiao; Ronghua Li
  Low rank adaptation (LoRA) has emerged as a prominent technique for
fine-tuning large language models (LLMs) thanks to its superb efficiency gains
over previous methods. While extensive studies have examined the performance
and structural properties of LoRA, its behavior upon training-time attacks
remain underexplored, posing significant security risks. In this paper, we
theoretically investigate the security implications of LoRA's low-rank
structure during fine-tuning, in the context of its robustness against data
poisoning and backdoor attacks. We propose an analytical framework that models
LoRA's training dynamics, employs the neural tangent kernel to simplify the
analysis of the training process, and applies information theory to establish
connections between LoRA's low rank structure and its vulnerability against
training-time attacks. Our analysis indicates that LoRA exhibits better
robustness to backdoor attacks than full fine-tuning, while becomes more
vulnerable to untargeted data poisoning due to its over-simplified information
geometry. Extensive experimental evaluations have corroborated our theoretical
findings.


http://arxiv.org/abs/2505.17072
Safety Alignment Can Be Not Superficial With Explicit Safety Signals. (4%)
Jianwei Li; Jung-Eun Kim
  Recent studies on the safety alignment of large language models (LLMs) have
revealed that existing approaches often operate superficially, leaving models
vulnerable to various adversarial attacks. Despite their significance, these
studies generally fail to offer actionable solutions beyond data augmentation
for achieving more robust safety mechanisms. This paper identifies a
fundamental cause of this superficiality: existing alignment approaches often
presume that models can implicitly learn a safety-related reasoning task during
the alignment process, enabling them to refuse harmful requests. However, the
learned safety signals are often diluted by other competing objectives, leading
models to struggle with drawing a firm safety-conscious decision boundary when
confronted with adversarial attacks. Based on this observation, by explicitly
introducing a safety-related binary classification task and integrating its
signals with our attention and decoding strategies, we eliminate this ambiguity
and allow models to respond more responsibly to malicious queries. We emphasize
that, with less than 0.2x overhead cost, our approach enables LLMs to assess
the safety of both the query and the previously generated tokens at each
necessary generating step. Extensive experiments demonstrate that our method
significantly improves the resilience of LLMs against various adversarial
attacks, offering a promising pathway toward more robust generative AI systems.


http://arxiv.org/abs/2505.13708
Robust learning of halfspaces under log-concave marginals. (1%)
Jane Lange; Arsen Vasilyan
  We say that a classifier is \emph{adversarially robust} to perturbations of
norm $r$ if, with high probability over a point $x$ drawn from the input
distribution, there is no point within distance $\le r$ from $x$ that is
classified differently. The \emph{boundary volume} is the probability that a
point falls within distance $r$ of a point with a different label. This work
studies the task of computationally efficient learning of hypotheses with small
boundary volume, where the input is distributed as a subgaussian isotropic
log-concave distribution over $\mathbb{R}^d$.
  Linear threshold functions are adversarially robust; they have boundary
volume proportional to $r$. Such concept classes are efficiently learnable by
polynomial regression, which produces a polynomial threshold function (PTF),
but PTFs in general may have boundary volume $\Omega(1)$, even for $r \ll 1$.
  We give an algorithm that agnostically learns linear threshold functions and
returns a classifier with boundary volume $O(r+\varepsilon)$ at radius of
perturbation $r$. The time and sample complexity of
$d^{\tilde{O}(1/\varepsilon^2)}$ matches the complexity of polynomial
regression.
  Our algorithm augments the classic approach of polynomial regression with
three additional steps: a) performing the $\ell_1$-error regression under noise
sensitivity constraints, b) a structured partitioning and rounding step that
returns a Boolean classifier with error $\textsf{opt} + O(\varepsilon)$ and
noise sensitivity $O(r+\varepsilon)$ simultaneously, and c) a local corrector
that ``smooths'' a function with low noise sensitivity into a function that is
adversarially robust.


http://arxiv.org/abs/2505.12897
EPIC: Explanation of Pretrained Image Classification Networks via Prototype. (1%)
Piotr Borycki; Magdalena Trędowicz; Szymon Janusz; Jacek Tabor; Przemysław Spurek; Arkadiusz Lewicki; Łukasz Struski
  Explainable AI (XAI) methods generally fall into two categories. Post-hoc
approaches generate explanations for pre-trained models and are compatible with
various neural network architectures. These methods often use feature
importance visualizations, such as saliency maps, to indicate which input
regions influenced the model's prediction. Unfortunately, they typically offer
a coarse understanding of the model's decision-making process. In contrast,
ante-hoc (inherently explainable) methods rely on specially designed model
architectures trained from scratch. A notable subclass of these methods
provides explanations through prototypes, representative patches extracted from
the training data. However, prototype-based approaches have limitations: they
require dedicated architectures, involve specialized training procedures, and
perform well only on specific datasets. In this work, we propose EPIC
(Explanation of Pretrained Image Classification), a novel approach that bridges
the gap between these two paradigms. Like post-hoc methods, EPIC operates on
pre-trained models without architectural modifications. Simultaneously, it
delivers intuitive, prototype-based explanations inspired by ante-hoc
techniques. To the best of our knowledge, EPIC is the first post-hoc method
capable of fully replicating the core explanatory power of inherently
interpretable models. We evaluate EPIC on benchmark datasets commonly used in
prototype-based explanations, such as CUB-200-2011 and Stanford Cars, alongside
large-scale datasets like ImageNet, typically employed by post-hoc methods.
EPIC uses prototypes to explain model decisions, providing a flexible and
easy-to-understand tool for creating clear, high-quality explanations.


http://arxiv.org/abs/2505.12750
Malware families discovery via Open-Set Recognition on Android manifest permissions. (1%)
Filippo Leveni; Matteo Mistura; Francesco Iubatti; Carmine Giangregorio; Nicolò Pastore; Cesare Alippi; Giacomo Boracchi
  Malware are malicious programs that are grouped into families based on their
penetration technique, source code, and other characteristics. Classifying
malware programs into their respective families is essential for building
effective defenses against cyber threats. Machine learning models have a huge
potential in malware detection on mobile devices, as malware families can be
recognized by classifying permission data extracted from Android manifest
files. Still, the malware classification task is challenging due to the
high-dimensional nature of permission data and the limited availability of
training samples. In particular, the steady emergence of new malware families
makes it impossible to acquire a comprehensive training set covering all the
malware classes. In this work, we present a malware classification system that,
on top of classifying known malware, detects new ones. In particular, we
combine an open-set recognition technique developed within the computer vision
community, namely MaxLogit, with a tree-based Gradient Boosting classifier,
which is particularly effective in classifying high-dimensional data. Our
solution turns out to be very practical, as it can be seamlessly employed in a
standard classification workflow, and efficient, as it adds minimal
computational overhead. Experiments on public and proprietary datasets
demonstrate the potential of our solution, which has been deployed in a
business environment.


http://arxiv.org/abs/2505.12586
A Few Large Shifts: Layer-Inconsistency Based Minimal Overhead Adversarial Example Detection. (99%)
Sanggeon Yun; Ryozo Masukawa; Hyunwoo Oh; Nathaniel D. Bastian; Mohsen Imani
  Deep neural networks (DNNs) are highly susceptible to adversarial
examples--subtle, imperceptible perturbations that can lead to incorrect
predictions. While detection-based defenses offer a practical alternative to
adversarial training, many existing methods depend on external models, complex
architectures, heavy augmentations, or adversarial data, limiting their
efficiency and generalizability. We introduce a lightweight, plug-in detection
framework that leverages internal layer-wise inconsistencies within the target
model itself, requiring only benign data for calibration. Our approach is
grounded in the A Few Large Shifts Assumption, which posits that adversarial
perturbations typically induce large representation shifts in a small subset of
layers. Building on this, we propose two complementary strategies--Recovery
Testing (RT) and Logit-layer Testing (LT)--to expose internal disruptions
caused by adversaries. Evaluated on CIFAR-10, CIFAR-100, and ImageNet under
both standard and adaptive threat models, our method achieves state-of-the-art
detection performance with negligible computational overhead and no compromise
to clean accuracy. The code is available here:
https://github.com/c0510gy/AFLS-AED.


http://arxiv.org/abs/2505.12681
On the Mechanisms of Adversarial Data Augmentation for Robust and Adaptive Transfer Learning. (98%)
Hana Satou; Alan Mitkiy
  Transfer learning across domains with distribution shift remains a
fundamental challenge in building robust and adaptable machine learning
systems. While adversarial perturbations are traditionally viewed as threats
that expose model vulnerabilities, recent studies suggest that they can also
serve as constructive tools for data augmentation. In this work, we
systematically investigate the role of adversarial data augmentation (ADA) in
enhancing both robustness and adaptivity in transfer learning settings. We
analyze how adversarial examples, when used strategically during training,
improve domain generalization by enriching decision boundaries and reducing
overfitting to source-domain-specific features. We further propose a unified
framework that integrates ADA with consistency regularization and
domain-invariant representation learning. Extensive experiments across multiple
benchmark datasets -- including VisDA, DomainNet, and Office-Home --
demonstrate that our method consistently improves target-domain performance
under both unsupervised and few-shot domain adaptation settings. Our results
highlight a constructive perspective of adversarial learning, transforming
perturbation from a destructive attack into a regularizing force for
cross-domain transferability.


http://arxiv.org/abs/2505.12574
PoisonArena: Uncovering Competing Poisoning Attacks in Retrieval-Augmented Generation. (76%)
Liuji Chen; Xiaofang Yang; Yuanzhuo Lu; Jinghao Zhang; Xin Sun; Qiang Liu; Shu Wu; Jing Dong; Liang Wang
  Retrieval-Augmented Generation (RAG) systems, widely used to improve the
factual grounding of large language models (LLMs), are increasingly vulnerable
to poisoning attacks, where adversaries inject manipulated content into the
retriever's corpus. While prior research has predominantly focused on
single-attacker settings, real-world scenarios often involve multiple,
competing attackers with conflicting objectives. In this work, we introduce
PoisonArena, the first benchmark to systematically study and evaluate competing
poisoning attacks in RAG. We formalize the multi-attacker threat model, where
attackers vie to control the answer to the same query using mutually exclusive
misinformation. PoisonArena leverages the Bradley-Terry model to quantify each
method's competitive effectiveness in such adversarial environments. Through
extensive experiments on the Natural Questions and MS MARCO datasets, we
demonstrate that many attack strategies successful in isolation fail under
competitive pressure. Our findings highlight the limitations of conventional
evaluation metrics like Attack Success Rate (ASR) and F1 score and underscore
the need for competitive evaluation to assess real-world attack robustness.
PoisonArena provides a standardized framework to benchmark and develop future
attack and defense strategies under more realistic, multi-adversary conditions.
Project page: https://github.com/yxf203/PoisonArena.


http://arxiv.org/abs/2505.13541
SPIRIT: Patching Speech Language Models against Jailbreak Attacks. (61%)
Amirbek Djanibekov; Nurdaulet Mukhituly; Kentaro Inui; Hanan Aldarmaki; Nils Lukas
  Speech Language Models (SLMs) enable natural interactions via spoken
instructions, which more effectively capture user intent by detecting nuances
in speech. The richer speech signal introduces new security risks compared to
text-based models, as adversaries can better bypass safety mechanisms by
injecting imperceptible noise to speech. We analyze adversarial attacks and
find that SLMs are substantially more vulnerable to jailbreak attacks, which
can achieve a perfect 100% attack success rate in some instances. To improve
security, we propose post-hoc patching defenses used to intervene during
inference by modifying the SLM's activations that improve robustness up to 99%
with (i) negligible impact on utility and (ii) without any re-training. We
conduct ablation studies to maximize the efficacy of our defenses and improve
the utility/security trade-off, validated with large-scale benchmarks unique to
SLMs.


http://arxiv.org/abs/2505.12287
The Tower of Babel Revisited: Multilingual Jailbreak Prompts on Closed-Source Large Language Models. (47%)
Linghan Huang; Haolin Jin; Zhaoge Bi; Pengyue Yang; Peizhou Zhao; Taozhao Chen; Xiongfei Wu; Lei Ma; Huaming Chen
  Large language models (LLMs) have seen widespread applications across various
domains, yet remain vulnerable to adversarial prompt injections. While most
existing research on jailbreak attacks and hallucination phenomena has focused
primarily on open-source models, we investigate the frontier of closed-source
LLMs under multilingual attack scenarios. We present a first-of-its-kind
integrated adversarial framework that leverages diverse attack techniques to
systematically evaluate frontier proprietary solutions, including GPT-4o,
DeepSeek-R1, Gemini-1.5-Pro, and Qwen-Max. Our evaluation spans six categories
of security contents in both English and Chinese, generating 38,400 responses
across 32 types of jailbreak attacks. Attack success rate (ASR) is utilized as
the quantitative metric to assess performance from three dimensions: prompt
design, model architecture, and language environment. Our findings suggest that
Qwen-Max is the most vulnerable, while GPT-4o shows the strongest defense.
Notably, prompts in Chinese consistently yield higher ASRs than their English
counterparts, and our novel Two-Sides attack technique proves to be the most
effective across all models. This work highlights a dire need for
language-aware alignment and robust cross-lingual defenses in LLMs, and we hope
it will inspire researchers, developers, and policymakers toward more robust
and inclusive AI systems.


http://arxiv.org/abs/2505.12647
Spiking Neural Network: a low power solution for physical layer authentication. (31%)
Jung Hoon Lee; Sujith Vijayan
  Deep learning (DL) is a powerful tool that can solve complex problems, and
thus, it seems natural to assume that DL can be used to enhance the security of
wireless communication. However, deploying DL models to edge devices in
wireless networks is challenging, as they require significant amounts of
computing and power resources. Notably, Spiking Neural Networks (SNNs) are
known to be efficient in terms of power consumption, meaning they can be an
alternative platform for DL models for edge devices. In this study, we ask if
SNNs can be used in physical layer authentication. Our evaluation suggests that
SNNs can learn unique physical properties (i.e., `fingerprints') of RF
transmitters and use them to identify individual devices. Furthermore, we find
that SNNs are also vulnerable to adversarial attacks and that an autoencoder
can be used clean out adversarial perturbations to harden SNNs against them.


http://arxiv.org/abs/2505.12567
A Survey of Attacks on Large Language Models. (22%)
Wenrui Xu; Keshab K. Parhi
  Large language models (LLMs) and LLM-based agents have been widely deployed
in a wide range of applications in the real world, including healthcare
diagnostics, financial analysis, customer support, robotics, and autonomous
driving, expanding their powerful capability of understanding, reasoning, and
generating natural languages. However, the wide deployment of LLM-based
applications exposes critical security and reliability risks, such as the
potential for malicious misuse, privacy leakage, and service disruption that
weaken user trust and undermine societal safety. This paper provides a
systematic overview of the details of adversarial attacks targeting both LLMs
and LLM-based agents. These attacks are organized into three phases in LLMs:
Training-Phase Attacks, Inference-Phase Attacks, and Availability & Integrity
Attacks. For each phase, we analyze the details of representative and recently
introduced attack methods along with their corresponding defenses. We hope our
survey will provide a good tutorial and a comprehensive understanding of LLM
security, especially for attacks on LLMs. We desire to raise attention to the
risks inherent in widely deployed LLM-based applications and highlight the
urgent need for robust mitigation strategies for evolving threats.


http://arxiv.org/abs/2505.12442
IP Leakage Attacks Targeting LLM-Based Multi-Agent Systems. (10%)
Liwen Wang; Wenxuan Wang; Shuai Wang; Zongjie Li; Zhenlan Ji; Zongyi Lyu; Daoyuan Wu; Shing-Chi Cheung
  The rapid advancement of Large Language Models (LLMs) has led to the
emergence of Multi-Agent Systems (MAS) to perform complex tasks through
collaboration. However, the intricate nature of MAS, including their
architecture and agent interactions, raises significant concerns regarding
intellectual property (IP) protection. In this paper, we introduce MASLEAK, a
novel attack framework designed to extract sensitive information from MAS
applications. MASLEAK targets a practical, black-box setting, where the
adversary has no prior knowledge of the MAS architecture or agent
configurations. The adversary can only interact with the MAS through its public
API, submitting attack query $q$ and observing outputs from the final agent.
Inspired by how computer worms propagate and infect vulnerable network hosts,
MASLEAK carefully crafts adversarial query $q$ to elicit, propagate, and retain
responses from each MAS agent that reveal a full set of proprietary components,
including the number of agents, system topology, system prompts, task
instructions, and tool usages. We construct the first synthetic dataset of MAS
applications with 810 applications and also evaluate MASLEAK against real-world
MAS applications, including Coze and CrewAI. MASLEAK achieves high accuracy in
extracting MAS IP, with an average attack success rate of 87% for system
prompts and task instructions, and 92% for system architecture in most cases.
We conclude by discussing the implications of our findings and the potential
defenses.


http://arxiv.org/abs/2505.12674
Few-Step Diffusion via Score identity Distillation. (1%)
Mingyuan Zhou; Yi Gu; Zhendong Wang
  Diffusion distillation has emerged as a promising strategy for accelerating
text-to-image (T2I) diffusion models by distilling a pretrained score network
into a one- or few-step generator. While existing methods have made notable
progress, they often rely on real or teacher-synthesized images to perform well
when distilling high-resolution T2I diffusion models such as Stable Diffusion
XL (SDXL), and their use of classifier-free guidance (CFG) introduces a
persistent trade-off between text-image alignment and generation diversity. We
address these challenges by optimizing Score identity Distillation (SiD) -- a
data-free, one-step distillation framework -- for few-step generation. Backed
by theoretical analysis that justifies matching a uniform mixture of outputs
from all generation steps to the data distribution, our few-step distillation
algorithm avoids step-specific networks and integrates seamlessly into existing
pipelines, achieving state-of-the-art performance on SDXL at 1024x1024
resolution. To mitigate the alignment-diversity trade-off when real text-image
pairs are available, we introduce a Diffusion GAN-based adversarial loss
applied to the uniform mixture and propose two new guidance strategies:
Zero-CFG, which disables CFG in the teacher and removes text conditioning in
the fake score network, and Anti-CFG, which applies negative CFG in the fake
score network. This flexible setup improves diversity without sacrificing
alignment. Comprehensive experiments on SD1.5 and SDXL demonstrate
state-of-the-art performance in both one-step and few-step generation settings,
along with robustness to the absence of real images. Our efficient PyTorch
implementation, along with the resulting one- and few-step distilled
generators, will be released publicly as a separate branch at
https://github.com/mingyuanzhou/SiD-LSG.


http://arxiv.org/abs/2505.12327
Robust Planning for Autonomous Driving via Mixed Adversarial Diffusion Predictions. (1%)
Albert Zhao; Stefano Soatto
  We describe a robust planning method for autonomous driving that mixes normal
and adversarial agent predictions output by a diffusion model trained for
motion prediction. We first train a diffusion model to learn an unbiased
distribution of normal agent behaviors. We then generate a distribution of
adversarial predictions by biasing the diffusion model at test time to generate
predictions that are likely to collide with a candidate plan. We score plans
using expected cost with respect to a mixture distribution of normal and
adversarial predictions, leading to a planner that is robust against
adversarial behaviors but not overly conservative when agents behave normally.
Unlike current approaches, we do not use risk measures that over-weight
adversarial behaviors while placing little to no weight on low-cost normal
behaviors or use hard safety constraints that may not be appropriate for all
driving scenarios. We show the effectiveness of our method on single-agent and
multi-agent jaywalking scenarios as well as a red light violation scenario.


http://arxiv.org/abs/2505.12317
Improving Out-of-Domain Robustness with Targeted Augmentation in Frequency and Pixel Spaces. (1%)
Ruoqi Wang; Haitao Wang; Shaojie Guo; Qiong Luo
  Out-of-domain (OOD) robustness under domain adaptation settings, where
labeled source data and unlabeled target data come from different
distributions, is a key challenge in real-world applications. A common approach
to improving OOD robustness is through data augmentations. However, in
real-world scenarios, models trained with generic augmentations can only
improve marginally when generalized under distribution shifts toward unlabeled
target domains. While dataset-specific targeted augmentations can address this
issue, they typically require expert knowledge and extensive prior data
analysis to identify the nature of the datasets and domain shift. To address
these challenges, we propose Frequency-Pixel Connect, a domain-adaptation
framework that enhances OOD robustness by introducing a targeted augmentation
in both the frequency space and pixel space. Specifically, we mix the amplitude
spectrum and pixel content of a source image and a target image to generate
augmented samples that introduce domain diversity while preserving the semantic
structure of the source image. Unlike previous targeted augmentation methods
that are both dataset-specific and limited to the pixel space, Frequency-Pixel
Connect is dataset-agnostic, enabling broader and more flexible applicability
beyond natural image datasets. We further analyze the effectiveness of
Frequency-Pixel Connect by evaluating the performance of our method connecting
same-class cross-domain samples while separating different-class examples. We
demonstrate that Frequency-Pixel Connect significantly improves cross-domain
connectivity and outperforms previous generic methods on four diverse
real-world benchmarks across vision, medical, audio, and astronomical domains,
and it also outperforms other dataset-specific targeted augmentation methods.


http://arxiv.org/abs/2505.12368
CAPTURE: Context-Aware Prompt Injection Testing and Robustness Enhancement. (1%)
Gauri Kholkar; Ratinder Ahuja
  Prompt injection remains a major security risk for large language models.
However, the efficacy of existing guardrail models in context-aware settings
remains underexplored, as they often rely on static attack benchmarks.
Additionally, they have over-defense tendencies. We introduce CAPTURE, a novel
context-aware benchmark assessing both attack detection and over-defense
tendencies with minimal in-domain examples. Our experiments reveal that current
prompt injection guardrail models suffer from high false negatives in
adversarial cases and excessive false positives in benign scenarios,
highlighting critical limitations. To demonstrate our framework's utility, we
train CaptureGuard on our generated data. This new model drastically reduces
both false negative and false positive rates on our context-aware datasets
while also generalizing effectively to external benchmarks, establishing a path
toward more robust and practical prompt injection defenses.


http://arxiv.org/abs/2505.12167
FABLE: A Localized, Targeted Adversarial Attack on Weather Forecasting Models. (99%)
Yue Deng; Asadullah Hill Galib; Xin Lan; Pang-Ning Tan; Lifeng Luo
  Deep learning-based weather forecasting models have recently demonstrated
significant performance improvements over gold-standard physics-based
simulation tools. However, these models are vulnerable to adversarial attacks,
which raises concerns about their trustworthiness. In this paper, we first
investigate the feasibility of applying existing adversarial attack methods to
weather forecasting models. We argue that a successful attack should (1) not
modify significantly its original inputs, (2) be faithful, i.e., achieve the
desired forecast at targeted locations with minimal changes to non-targeted
locations, and (3) be geospatio-temporally realistic. However, balancing these
criteria is a challenge as existing methods are not designed to preserve the
geospatio-temporal dependencies of the original samples. To address this
challenge, we propose a novel framework called FABLE (Forecast Alteration By
Localized targeted advErsarial attack), which employs a 3D discrete wavelet
decomposition to extract the varying components of the geospatio-temporal data.
By regulating the magnitude of adversarial perturbations across different
components, FABLE can generate adversarial inputs that maintain
geospatio-temporal coherence while remaining faithful and closely aligned with
the original inputs. Experimental results on multiple real-world datasets
demonstrate the effectiveness of our framework over baseline methods across
various metrics.


http://arxiv.org/abs/2505.11895
Adversarial Robustness for Unified Multi-Modal Encoders via Efficient Calibration. (98%)
Chih-Ting Liao; Bin Ren; Guofeng Mei; Xu Zheng
  Recent unified multi-modal encoders align a wide range of modalities into a
shared representation space, enabling diverse cross-modal tasks. Despite their
impressive capabilities, the robustness of these models under adversarial
perturbations remains underexplored, which is a critical concern for
safety-sensitive applications. In this work, we present the first comprehensive
study of adversarial vulnerability in unified multi-modal encoders. We find
that even mild adversarial perturbations lead to substantial performance drops
across all modalities. Non-visual inputs, such as audio and point clouds, are
especially fragile, while visual inputs like images and videos also degrade
significantly. To address this, we propose an efficient adversarial calibration
framework that improves robustness across modalities without modifying
pretrained encoders or semantic centers, ensuring compatibility with existing
foundation models. Our method introduces modality-specific projection heads
trained solely on adversarial examples, while keeping the backbone and
embeddings frozen. We explore three training objectives: fixed-center
cross-entropy, clean-to-adversarial L2 alignment, and clean-adversarial
InfoNCE, and we introduce a regularization strategy to ensure
modality-consistent alignment under attack. Experiments on six modalities and
three Bind-style models show that our method improves adversarial robustness by
up to 47.3 percent at epsilon = 4/255, while preserving or even improving clean
zero-shot and retrieval performance with less than 1 percent trainable
parameters.


http://arxiv.org/abs/2505.12009
Black-box Adversaries from Latent Space: Unnoticeable Attacks on Human Pose and Shape Estimation. (83%)
Zhiying Li; Guanggang Geng; Yeying Jin; Zhizhi Guo; Bruce Gu; Jidong Huo; Zhaoxin Fan; Wenjun Wu
  Expressive human pose and shape (EHPS) estimation is vital for digital human
generation, particularly in live-streaming applications. However, most existing
EHPS models focus primarily on minimizing estimation errors, with limited
attention on potential security vulnerabilities. Current adversarial attacks on
EHPS models often require white-box access (e.g., model details or gradients)
or generate visually conspicuous perturbations, limiting their practicality and
ability to expose real-world security threats. To address these limitations, we
propose a novel Unnoticeable Black-Box Attack (UBA) against EHPS models. UBA
leverages the latent-space representations of natural images to generate an
optimal adversarial noise pattern and iteratively refine its attack potency
along an optimized direction in digital space. Crucially, this process relies
solely on querying the model's output, requiring no internal knowledge of the
EHPS architecture, while guiding the noise optimization toward greater stealth
and effectiveness. Extensive experiments and visual analyses demonstrate the
superiority of UBA. Notably, UBA increases the pose estimation errors of EHPS
models by 17.27%-58.21% on average, revealing critical vulnerabilities. These
findings underscore the urgent need to address and mitigate security risks
associated with digital human generation systems.


http://arxiv.org/abs/2505.12186
Self-Destructive Language Model. (75%)
Yuhui Wang; Rongyi Zhu; Ting Wang
  Harmful fine-tuning attacks pose a major threat to the security of large
language models (LLMs), allowing adversaries to compromise safety guardrails
with minimal harmful data. While existing defenses attempt to reinforce LLM
alignment, they fail to address models' inherent "trainability" on harmful
data, leaving them vulnerable to stronger attacks with increased learning rates
or larger harmful datasets. To overcome this critical limitation, we introduce
SEAM, a novel alignment-enhancing defense that transforms LLMs into
self-destructive models with intrinsic resilience to misalignment attempts.
Specifically, these models retain their capabilities for legitimate tasks while
exhibiting substantial performance degradation when fine-tuned on harmful data.
The protection is achieved through a novel loss function that couples the
optimization trajectories of benign and harmful data, enhanced with adversarial
gradient ascent to amplify the self-destructive effect. To enable practical
training, we develop an efficient Hessian-free gradient estimate with
theoretical error bounds. Extensive evaluation across LLMs and datasets
demonstrates that SEAM creates a no-win situation for adversaries: the
self-destructive models achieve state-of-the-art robustness against
low-intensity attacks and undergo catastrophic performance collapse under
high-intensity attacks, rendering them effectively unusable. (warning: this
paper contains potentially harmful content generated by LLMs.)


http://arxiv.org/abs/2505.12019
FL-PLAS: Federated Learning with Partial Layer Aggregation for Backdoor Defense Against High-Ratio Malicious Clients. (10%)
Jianyi Zhang; Ziyin Zhou; Yilong Li; Qichao Jin
  Federated learning (FL) is gaining increasing attention as an emerging
collaborative machine learning approach, particularly in the context of
large-scale computing and data systems. However, the fundamental algorithm of
FL, Federated Averaging (FedAvg), is susceptible to backdoor attacks. Although
researchers have proposed numerous defense algorithms, two significant
challenges remain. The attack is becoming more stealthy and harder to detect,
and current defense methods are unable to handle 50\% or more malicious users
or assume an auxiliary server dataset.
  To address these challenges, we propose a novel defense algorithm, FL-PLAS,
\textbf{F}ederated \textbf{L}earning based on \textbf{P}artial\textbf{ L}ayer
\textbf{A}ggregation \textbf{S}trategy. In particular, we divide the local
model into a feature extractor and a classifier. In each iteration, the clients
only upload the parameters of a feature extractor after local training. The
server then aggregates these local parameters and returns the results to the
clients.
  Each client retains its own classifier layer, ensuring that the backdoor
labels do not impact other clients. We assess the effectiveness of FL-PLAS
against state-of-the-art (SOTA) backdoor attacks on three image datasets and
compare our approach to six defense strategies. The results of the experiment
demonstrate that our methods can effectively protect local models from backdoor
attacks. Without requiring any auxiliary dataset for the server, our method
achieves a high main-task accuracy with a lower backdoor accuracy even under
the condition of 90\% malicious users with the attacks of trigger, semantic and
edge-case.


http://arxiv.org/abs/2505.11842
Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs. (10%)
Xuannan Liu; Zekun Li; Zheqi He; Peipei Li; Shuhan Xia; Xing Cui; Huaibo Huang; Xi Yang; Ran He
  The increasing deployment of Large Vision-Language Models (LVLMs) raises
safety concerns under potential malicious inputs. However, existing multimodal
safety evaluations primarily focus on model vulnerabilities exposed by static
image inputs, ignoring the temporal dynamics of video that may induce distinct
safety risks. To bridge this gap, we introduce Video-SafetyBench, the first
comprehensive benchmark designed to evaluate the safety of LVLMs under
video-text attacks. It comprises 2,264 video-text pairs spanning 48
fine-grained unsafe categories, each pairing a synthesized video with either a
harmful query, which contains explicit malice, or a benign query, which appears
harmless but triggers harmful behavior when interpreted alongside the video. To
generate semantically accurate videos for safety evaluation, we design a
controllable pipeline that decomposes video semantics into subject images (what
is shown) and motion text (how it moves), which jointly guide the synthesis of
query-relevant videos. To effectively evaluate uncertain or borderline harmful
outputs, we propose RJScore, a novel LLM-based metric that incorporates the
confidence of judge models and human-aligned decision threshold calibration.
Extensive experiments show that benign-query video composition achieves average
attack success rates of 67.2%, revealing consistent vulnerabilities to
video-induced attacks. We believe Video-SafetyBench will catalyze future
research into video-based safety evaluation and defense strategies.


http://arxiv.org/abs/2505.12045
FIGhost: Fluorescent Ink-based Stealthy and Flexible Backdoor Attacks on Physical Traffic Sign Recognition. (10%)
Shuai Yuan; Guowen Xu; Hongwei Li; Rui Zhang; Xinyuan Qian; Wenbo Jiang; Hangcheng Cao; Qingchuan Zhao
  Traffic sign recognition (TSR) systems are crucial for autonomous driving but
are vulnerable to backdoor attacks. Existing physical backdoor attacks either
lack stealth, provide inflexible attack control, or ignore emerging
Vision-Large-Language-Models (VLMs). In this paper, we introduce FIGhost, the
first physical-world backdoor attack leveraging fluorescent ink as triggers.
Fluorescent triggers are invisible under normal conditions and activated
stealthily by ultraviolet light, providing superior stealthiness, flexibility,
and untraceability. Inspired by real-world graffiti, we derive realistic
trigger shapes and enhance their robustness via an interpolation-based
fluorescence simulation algorithm. Furthermore, we develop an automated
backdoor sample generation method to support three attack objectives. Extensive
evaluations in the physical world demonstrate FIGhost's effectiveness against
state-of-the-art detectors and VLMs, maintaining robustness under environmental
variations and effectively evading existing defenses.


http://arxiv.org/abs/2505.10983
GenoArmory: A Unified Evaluation Framework for Adversarial Attacks on Genomic Foundation Models. (99%)
Haozheng Luo; Chenghao Qiu; Yimin Wang; Shang Wu; Jiahao Yu; Han Liu; Binghui Wang; Yan Chen
  We propose the first unified adversarial attack benchmark for Genomic
Foundation Models (GFMs), named GenoArmory. Unlike existing GFM benchmarks,
GenoArmory offers the first comprehensive evaluation framework to
systematically assess the vulnerability of GFMs to adversarial attacks.
Methodologically, we evaluate the adversarial robustness of five
state-of-the-art GFMs using four widely adopted attack algorithms and three
defense strategies. Importantly, our benchmark provides an accessible and
comprehensive framework to analyze GFM vulnerabilities with respect to model
architecture, quantization schemes, and training datasets. Additionally, we
introduce GenoAdv, a new adversarial sample dataset designed to improve GFM
safety. Empirically, classification models exhibit greater robustness to
adversarial perturbations compared to generative models, highlighting the
impact of task type on model vulnerability. Moreover, adversarial attacks
frequently target biologically significant genomic regions, suggesting that
these models effectively capture meaningful sequence features.


http://arxiv.org/abs/2505.11449
LLMs unlock new paths to monetizing exploits. (78%)
Nicholas Carlini; Milad Nasr; Edoardo Debenedetti; Barry Wang; Christopher A. Choquette-Choo; Daphne Ippolito; Florian Tramèr; Matthew Jagielski
  We argue that Large language models (LLMs) will soon alter the economics of
cyberattacks. Instead of attacking the most commonly used software and
monetizing exploits by targeting the lowest common denominator among victims,
LLMs enable adversaries to launch tailored attacks on a user-by-user basis. On
the exploitation front, instead of human attackers manually searching for one
difficult-to-identify bug in a product with millions of users, LLMs can find
thousands of easy-to-identify bugs in products with thousands of users. And on
the monetization front, instead of generic ransomware that always performs the
same attack (encrypt all your data and request payment to decrypt), an
LLM-driven ransomware attack could tailor the ransom demand based on the
particular content of each exploited device.
  We show that these two attacks (and several others) are imminently practical
using state-of-the-art LLMs. For example, we show that without any human
intervention, an LLM finds highly sensitive personal information in the Enron
email dataset (e.g., an executive having an affair with another employee) that
could be used for blackmail. While some of our attacks are still too expensive
to scale widely today, the incentives to implement these attacks will only
increase as LLMs get cheaper. Thus, we argue that LLMs create a need for new
defense-in-depth approaches.


http://arxiv.org/abs/2505.10942
Nosy Layers, Noisy Fixes: Tackling DRAs in Federated Learning Systems using Explainable AI. (75%)
Meghali Nandi; Arash Shaghaghi; Nazatul Haque Sultan; Gustavo Batista; Raymond K. Zhao; Sanjay Jha
  Federated Learning (FL) has emerged as a powerful paradigm for collaborative
model training while keeping client data decentralized and private. However, it
is vulnerable to Data Reconstruction Attacks (DRA) such as "LoKI" and "Robbing
the Fed", where malicious models sent from the server to the client can
reconstruct sensitive user data. To counter this, we introduce DRArmor, a novel
defense mechanism that integrates Explainable AI with targeted detection and
mitigation strategies for DRA. Unlike existing defenses that focus on the
entire model, DRArmor identifies and addresses the root cause (i.e., malicious
layers within the model that send gradients with malicious intent) by analyzing
their contribution to the output and detecting inconsistencies in gradient
values. Once these malicious layers are identified, DRArmor applies defense
techniques such as noise injection, pixelation, and pruning to these layers
rather than the whole model, minimizing the attack surface and preserving
client data privacy. We evaluate DRArmor's performance against the advanced
LoKI attack across diverse datasets, including MNIST, CIFAR-10, CIFAR-100, and
ImageNet, in a 200-client FL setup. Our results demonstrate DRArmor's
effectiveness in mitigating data leakage, achieving high True Positive and True
Negative Rates of 0.910 and 0.890, respectively. Additionally, DRArmor
maintains an average accuracy of 87%, effectively protecting client privacy
without compromising model performance. Compared to existing defense
mechanisms, DRArmor reduces the data leakage rate by 62.5% with datasets
containing 500 samples per client.


http://arxiv.org/abs/2505.10903
On the Security Risks of ML-based Malware Detection Systems: A Survey. (64%)
Ping He; Yuhao Mao; Changjiang Li; Lorenzo Cavallaro; Ting Wang; Shouling Ji
  Malware presents a persistent threat to user privacy and data integrity. To
combat this, machine learning-based (ML-based) malware detection (MD) systems
have been developed. However, these systems have increasingly been attacked in
recent years, undermining their effectiveness in practice. While the security
risks associated with ML-based MD systems have garnered considerable attention,
the majority of prior works is limited to adversarial malware examples, lacking
a comprehensive analysis of practical security risks. This paper addresses this
gap by utilizing the CIA principles to define the scope of security risks. We
then deconstruct ML-based MD systems into distinct operational stages, thus
developing a stage-based taxonomy. Utilizing this taxonomy, we summarize the
technical progress and discuss the gaps in the attack and defense proposals
related to the ML-based MD systems within each stage. Subsequently, we conduct
two case studies, using both inter-stage and intra-stage analyses according to
the stage-based taxonomy to provide new empirical insights. Based on these
analyses and insights, we suggest potential future directions from both
inter-stage and intra-stage perspectives.


http://arxiv.org/abs/2505.11459
ProxyPrompt: Securing System Prompts against Prompt Extraction Attacks. (56%)
Zhixiong Zhuang; Maria-Irina Nicolae; Hui-Po Wang; Mario Fritz
  The integration of large language models (LLMs) into a wide range of
applications has highlighted the critical role of well-crafted system prompts,
which require extensive testing and domain expertise. These prompts enhance
task performance but may also encode sensitive information and filtering
criteria, posing security risks if exposed. Recent research shows that system
prompts are vulnerable to extraction attacks, while existing defenses are
either easily bypassed or require constant updates to address new threats. In
this work, we introduce ProxyPrompt, a novel defense mechanism that prevents
prompt leakage by replacing the original prompt with a proxy. This proxy
maintains the original task's utility while obfuscating the extracted prompt,
ensuring attackers cannot reproduce the task or access sensitive information.
Comprehensive evaluations on 264 LLM and system prompt pairs show that
ProxyPrompt protects 94.70% of prompts from extraction attacks, outperforming
the next-best defense, which only achieves 42.80%.


http://arxiv.org/abs/2505.11413
CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs. (41%)
Sijia Chen; Xiaomin Li; Mengxue Zhang; Eric Hanchen Jiang; Qingcheng Zeng; Chen-Hsiang Yu
  Large language models (LLMs) are increasingly deployed in medical contexts,
raising critical concerns about safety, alignment, and susceptibility to
adversarial manipulation. While prior benchmarks assess model refusal
capabilities for harmful prompts, they often lack clinical specificity, graded
harmfulness levels, and coverage of jailbreak-style attacks. We introduce CARES
(Clinical Adversarial Robustness and Evaluation of Safety), a benchmark for
evaluating LLM safety in healthcare. CARES includes over 18,000 prompts
spanning eight medical safety principles, four harm levels, and four prompting
styles: direct, indirect, obfuscated, and role-play, to simulate both malicious
and benign use cases. We propose a three-way response evaluation protocol
(Accept, Caution, Refuse) and a fine-grained Safety Score metric to assess
model behavior. Our analysis reveals that many state-of-the-art LLMs remain
vulnerable to jailbreaks that subtly rephrase harmful prompts, while also
over-refusing safe but atypically phrased queries. Finally, we propose a
mitigation strategy using a lightweight classifier to detect jailbreak attempts
and steer models toward safer behavior via reminder-based conditioning. CARES
provides a rigorous framework for testing and improving medical LLM safety
under adversarial and ambiguous conditions.


http://arxiv.org/abs/2505.15833
Adversarially Robust Spiking Neural Networks with Sparse Connectivity. (11%)
Mathias Schmolli; Maximilian Baronig; Robert Legenstein; Ozan Özdenizci
  Deployment of deep neural networks in resource-constrained embedded systems
requires innovative algorithmic solutions to facilitate their energy and memory
efficiency. To further ensure the reliability of these systems against
malicious actors, recent works have extensively studied adversarial robustness
of existing architectures. Our work focuses on the intersection of adversarial
robustness, memory- and energy-efficiency in neural networks. We introduce a
neural network conversion algorithm designed to produce sparse and
adversarially robust spiking neural networks (SNNs) by leveraging the sparse
connectivity and weights from a robustly pretrained artificial neural network
(ANN). Our approach combines the energy-efficient architecture of SNNs with a
novel conversion algorithm, leading to state-of-the-art performance with
enhanced energy and memory efficiency through sparse connectivity and
activations. Our models are shown to achieve up to 100x reduction in the number
of weights to be stored in memory, with an estimated 8.6x increase in energy
efficiency compared to dense SNNs, while maintaining high performance and
robustness against adversarial threats.


http://arxiv.org/abs/2505.11717
EnvInjection: Environmental Prompt Injection Attack to Multi-modal Web Agents. (5%)
Xilong Wang; John Bloch; Zedian Shao; Yuepeng Hu; Shuyan Zhou; Neil Zhenqiang Gong
  Multi-modal large language model (MLLM)-based web agents interact with
webpage environments by generating actions based on screenshots of the
webpages. Environmental prompt injection attacks manipulate the environment to
induce the web agent to perform a specific, attacker-chosen action--referred to
as the target action. However, existing attacks suffer from limited
effectiveness or stealthiness, or are impractical in real-world settings. In
this work, we propose EnvInjection, a new attack that addresses these
limitations. Our attack adds a perturbation to the raw pixel values of the
rendered webpage, which can be implemented by modifying the webpage's source
code. After these perturbed pixels are mapped into a screenshot, the
perturbation induces the web agent to perform the target action. We formulate
the task of finding the perturbation as an optimization problem. A key
challenge in solving this problem is that the mapping between raw pixel values
and screenshot is non-differentiable, making it difficult to backpropagate
gradients to the perturbation. To overcome this, we train a neural network to
approximate the mapping and apply projected gradient descent to solve the
reformulated optimization problem. Extensive evaluation on multiple webpage
datasets shows that EnvInjection is highly effective and significantly
outperforms existing baselines.


http://arxiv.org/abs/2505.10864
Anti-Sensing: Defense against Unauthorized Radar-based Human Vital Sign Sensing with Physically Realizable Wearable Oscillators. (2%)
Md Farhan Tasnim Oshim; Nigel Doering; Bashima Islam; Tsui-Wei Weng; Tauhidur Rahman
  Recent advancements in Ultra-Wideband (UWB) radar technology have enabled
contactless, non-line-of-sight vital sign monitoring, making it a valuable tool
for healthcare. However, UWB radar's ability to capture sensitive physiological
data, even through walls, raises significant privacy concerns, particularly in
human-robot interactions and autonomous systems that rely on radar for sensing
human presence and physiological functions. In this paper, we present
Anti-Sensing, a novel defense mechanism designed to prevent unauthorized
radar-based sensing. Our approach introduces physically realizable
perturbations, such as oscillatory motion from wearable devices, to disrupt
radar sensing by mimicking natural cardiac motion, thereby misleading heart
rate (HR) estimations. We develop a gradient-based algorithm to optimize the
frequency and spatial amplitude of these oscillations for maximal disruption
while ensuring physiological plausibility. Through both simulations and
real-world experiments with radar data and neural network-based HR sensing
models, we demonstrate the effectiveness of Anti-Sensing in significantly
degrading model accuracy, offering a practical solution for privacy
preservation.


http://arxiv.org/abs/2505.11710
Co-Evolutionary Defence of Active Directory Attack Graphs via GNN-Approximated Dynamic Programming. (1%)
Diksha Goel; Hussain Ahmad; Kristen Moore; Mingyu Guo
  Modern enterprise networks increasingly rely on Active Directory (AD) for
identity and access management. However, this centralization exposes a single
point of failure, allowing adversaries to compromise high-value assets.
Existing AD defense approaches often assume static attacker behavior, but
real-world adversaries adapt dynamically, rendering such methods brittle. To
address this, we model attacker-defender interactions in AD as a Stackelberg
game between an adaptive attacker and a proactive defender. We propose a
co-evolutionary defense framework that combines Graph Neural Network
Approximated Dynamic Programming (GNNDP) to model attacker strategies, with
Evolutionary Diversity Optimization (EDO) to generate resilient blocking
strategies. To ensure scalability, we introduce a Fixed-Parameter Tractable
(FPT) graph reduction method that reduces complexity while preserving strategic
structure. Our framework jointly refines attacker and defender policies to
improve generalization and prevent premature convergence. Experiments on
synthetic AD graphs show near-optimal results (within 0.1 percent of optimality
on r500) and improved performance on larger graphs (r1000 and r2000),
demonstrating the framework's scalability and effectiveness.


http://arxiv.org/abs/2505.11719
Zero-Shot Visual Generalization in Robot Manipulation. (1%)
Sumeet Batra; Gaurav Sukhatme
  Training vision-based manipulation policies that are robust across diverse
visual environments remains an important and unresolved challenge in robot
learning. Current approaches often sidestep the problem by relying on invariant
representations such as point clouds and depth, or by brute-forcing
generalization through visual domain randomization and/or large, visually
diverse datasets. Disentangled representation learning - especially when
combined with principles of associative memory - has recently shown promise in
enabling vision-based reinforcement learning policies to be robust to visual
distribution shifts. However, these techniques have largely been constrained to
simpler benchmarks and toy environments. In this work, we scale disentangled
representation learning and associative memory to more visually and dynamically
complex manipulation tasks and demonstrate zero-shot adaptability to visual
perturbations in both simulation and on real hardware. We further extend this
approach to imitation learning, specifically Diffusion Policy, and empirically
show significant gains in visual generalization compared to state-of-the-art
imitation learning methods. Finally, we introduce a novel technique adapted
from the model equivariance literature that transforms any trained neural
network policy into one invariant to 2D planar rotations, making our policy not
only visually robust but also resilient to certain camera perturbations. We
believe that this work marks a significant step towards manipulation policies
that are not only adaptable out of the box, but also robust to the complexities
and dynamical nature of real-world deployment. Supplementary videos are
available at https://sites.google.com/view/vis-gen-robotics/home.


http://arxiv.org/abs/2506.06290
CellCLIP -- Learning Perturbation Effects in Cell Painting via Text-Guided Contrastive Learning. (1%)
Mingyu Lu; Ethan Weinberger; Chanwoo Kim; Su-In Lee
  High-content screening (HCS) assays based on high-throughput microscopy
techniques such as Cell Painting have enabled the interrogation of cells'
morphological responses to perturbations at an unprecedented scale. The
collection of such data promises to facilitate a better understanding of the
relationships between different perturbations and their effects on cellular
state. Towards achieving this goal, recent advances in cross-modal contrastive
learning could, in theory, be leveraged to learn a unified latent space that
aligns perturbations with their corresponding morphological effects. However,
the application of such methods to HCS data is not straightforward due to
substantial differences in the semantics of Cell Painting images compared to
natural images, and the difficulty of representing different classes of
perturbations (e.g., small molecule vs CRISPR gene knockout) in a single latent
space. In response to these challenges, here we introduce CellCLIP, a
cross-modal contrastive learning framework for HCS data. CellCLIP leverages
pre-trained image encoders coupled with a novel channel encoding scheme to
better capture relationships between different microscopy channels in image
embeddings, along with natural language encoders for representing
perturbations. Our framework outperforms current open-source models,
demonstrating the best performance in both cross-modal retrieval and
biologically meaningful downstream tasks while also achieving significant
reductions in computation time.


http://arxiv.org/abs/2505.10430
The Ephemeral Threat: Assessing the Security of Algorithmic Trading Systems powered by Deep Learning. (50%)
Advije Rizvani; Giovanni Apruzzese; Pavel Laskov
  We study the security of stock price forecasting using Deep Learning (DL) in
computational finance. Despite abundant prior research on the vulnerability of
DL to adversarial perturbations, such work has hitherto hardly addressed
practical adversarial threat models in the context of DL-powered algorithmic
trading systems (ATS). Specifically, we investigate the vulnerability of ATS to
adversarial perturbations launched by a realistically constrained attacker. We
first show that existing literature has paid limited attention to DL security
in the financial domain, which is naturally attractive for adversaries. Then,
we formalize the concept of ephemeral perturbations (EP), which can be used to
stage a novel type of attack tailored for DL-based ATS. Finally, we carry out
an end-to-end evaluation of our EP against a profitable ATS. Our results reveal
that the introduction of small changes to the input stock prices not only (i)
induces the DL model to behave incorrectly but also (ii) leads the whole ATS to
make suboptimal buy/sell decisions, resulting in a worse financial performance
of the targeted ATS.


http://arxiv.org/abs/2505.11548
One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems. (31%)
Zhiyuan Chang; Xiaojun Jia; Mingyang Li; Junjie Wang; Yuekai Huang; Qing Wang; Ziyou Jiang; Yang Liu
  Large Language Models (LLMs) enhanced with Retrieval-Augmented Generation
(RAG) have shown improved performance in generating accurate responses.
However, the dependence on external knowledge bases introduces potential
security vulnerabilities, particularly when these knowledge bases are publicly
accessible and modifiable. Poisoning attacks on knowledge bases for RAG systems
face two fundamental challenges: the injected malicious content must compete
with multiple authentic documents retrieved by the retriever, and LLMs tend to
trust retrieved information that aligns with their internal memorized
knowledge. Previous works attempt to address these challenges by injecting
multiple malicious documents, but such saturation attacks are easily detectable
and impractical in real-world scenarios. To enable the effective single
document poisoning attack, we propose AuthChain, a novel knowledge poisoning
attack method that leverages Chain-of-Evidence theory and authority effect to
craft more convincing poisoned documents. AuthChain generates poisoned content
that establishes strong evidence chains and incorporates authoritative
statements, effectively overcoming the interference from both authentic
documents and LLMs' internal knowledge. Extensive experiments across six
popular LLMs demonstrate that AuthChain achieves significantly higher attack
success rates while maintaining superior stealthiness against RAG defense
mechanisms compared to state-of-the-art baselines.


http://arxiv.org/abs/2505.10297
Defending the Edge: Representative-Attention for Mitigating Backdoor Attacks in Federated Learning. (15%)
Chibueze Peace Obioma; Youcheng Sun; Mustafa A. Mustafa
  Federated learning (FL) enhances privacy and reduces communication cost for
resource-constrained edge clients by supporting distributed model training at
the edge. However, the heterogeneous nature of such devices produces diverse,
non-independent, and identically distributed (non-IID) data, making the
detection of backdoor attacks more challenging. In this paper, we propose a
novel federated representative-attention-based defense mechanism, named FeRA,
that leverages cross-client attention over internal feature representations to
distinguish benign from malicious clients. FeRA computes an anomaly score based
on representation reconstruction errors, effectively identifying clients whose
internal activations significantly deviate from the group consensus. Our
evaluation demonstrates FeRA's robustness across various FL scenarios,
including challenging non-IID data distributions typical of edge devices.
Experimental results show that it effectively reduces backdoor attack success
rates while maintaining high accuracy on the main task. The method is
model-agnostic, attack-agnostic, and does not require labeled reference data,
making it well suited to heterogeneous and resource-limited edge deployments.


http://arxiv.org/abs/2505.10351
A Unified and Scalable Membership Inference Method for Visual Self-supervised Encoder via Part-aware Capability. (11%)
Jie Zhu; Jirong Zha; Ding Li; Leye Wang
  Self-supervised learning shows promise in harnessing extensive unlabeled
data, but it also confronts significant privacy concerns, especially in vision.
In this paper, we perform membership inference on visual self-supervised models
in a more realistic setting: self-supervised training method and details are
unknown for an adversary when attacking as he usually faces a black-box system
in practice. In this setting, considering that self-supervised model could be
trained by completely different self-supervised paradigms, e.g., masked image
modeling and contrastive learning, with complex training details, we propose a
unified membership inference method called PartCrop. It is motivated by the
shared part-aware capability among models and stronger part response on the
training data. Specifically, PartCrop crops parts of objects in an image to
query responses within the image in representation space. We conduct extensive
attacks on self-supervised models with different training protocols and
structures using three widely used image datasets. The results verify the
effectiveness and generalization of PartCrop. Moreover, to defend against
PartCrop, we evaluate two common approaches, i.e., early stop and differential
privacy, and propose a tailored method called shrinking crop scale range. The
defense experiments indicate that all of them are effective. Finally, besides
prototype testing on toy visual encoders and small-scale image datasets, we
quantitatively study the impacts of scaling from both data and model aspects in
a realistic scenario and propose a scalable PartCrop-v2 by introducing two
structural improvements to PartCrop. Our code is at
https://github.com/JiePKU/PartCrop.


http://arxiv.org/abs/2505.10497
MorphGuard: Morph Specific Margin Loss for Enhancing Robustness to Face Morphing Attacks. (2%)
Iurii Medvedev; Nuno Goncalves
  Face recognition has evolved significantly with the advancement of deep
learning techniques, enabling its widespread adoption in various applications
requiring secure authentication. However, this progress has also increased its
exposure to presentation attacks, including face morphing, which poses a
serious security threat by allowing one identity to impersonate another.
Therefore, modern face recognition systems must be robust against such attacks.
  In this work, we propose a novel approach for training deep networks for face
recognition with enhanced robustness to face morphing attacks. Our method
modifies the classification task by introducing a dual-branch classification
strategy that effectively handles the ambiguity in the labeling of face morphs.
This adaptation allows the model to incorporate morph images into the training
process, improving its ability to distinguish them from bona fide samples.
  Our strategy has been validated on public benchmarks, demonstrating its
effectiveness in enhancing robustness against face morphing attacks.
Furthermore, our approach is universally applicable and can be integrated into
existing face recognition training pipelines to improve classification-based
recognition methods.


http://arxiv.org/abs/2505.13500
Noise Injection Systemically Degrades Large Language Model Safety Guardrails. (1%)
Prithviraj Singh Shahani; Matthias Scheutz
  Safety guardrails in large language models (LLMs) are a critical component in
preventing harmful outputs. Yet, their resilience under perturbation remains
poorly understood. In this paper, we investigate the robustness of safety
fine-tuning in LLMs by systematically injecting Gaussian noise into model
activations. We show across multiple open-weight models that (1) Gaussian noise
raises harmful-output rates (p < 0.001) by up to 27%, (2) that deeper safety
fine-tuning affords no extra protection, and (3) that chain-of-thought
reasoning remains largely intact. The findings reveal critical vulnerabilities
in current safety alignment techniques and highlight the potential of
reasoning-based and reinforcement learning approaches as promising direction
for developing more robust AI safety systems. These results have important
implications for real-world deployment of LLMs in safety-critical applications
as these results imply that widely-deployed safety tuning methods can fail even
without adversarial prompts.


http://arxiv.org/abs/2505.09983
Sybil-based Virtual Data Poisoning Attacks in Federated Learning. (1%)
Changxun Zhu; Qilong Wu; Lingjuan Lyu; Shibei Xue
  Federated learning is vulnerable to poisoning attacks by malicious
adversaries. Existing methods often involve high costs to achieve effective
attacks. To address this challenge, we propose a sybil-based virtual data
poisoning attack, where a malicious client generates sybil nodes to amplify the
poisoning model's impact. To reduce neural network computational complexity, we
develop a virtual data generation method based on gradient matching. We also
design three schemes for target model acquisition, applicable to online local,
online global, and offline scenarios. In simulation, our method outperforms
other attack algorithms since our method can obtain a global target model under
non-independent uniformly distributed data.


http://arxiv.org/abs/2505.10066
Dark LLMs: The Growing Threat of Unaligned AI Models. (1%)
Michael Fire; Yitzhak Elbazis; Adi Wasenstein; Lior Rokach
  Large Language Models (LLMs) rapidly reshape modern life, advancing fields
from healthcare to education and beyond. However, alongside their remarkable
capabilities lies a significant threat: the susceptibility of these models to
jailbreaking. The fundamental vulnerability of LLMs to jailbreak attacks stems
from the very data they learn from. As long as this training data includes
unfiltered, problematic, or 'dark' content, the models can inherently learn
undesirable patterns or weaknesses that allow users to circumvent their
intended safety controls. Our research identifies the growing threat posed by
dark LLMs models deliberately designed without ethical guardrails or modified
through jailbreak techniques. In our research, we uncovered a universal
jailbreak attack that effectively compromises multiple state-of-the-art models,
enabling them to answer almost any question and produce harmful outputs upon
request. The main idea of our attack was published online over seven months
ago. However, many of the tested LLMs were still vulnerable to this attack.
Despite our responsible disclosure efforts, responses from major LLM providers
were often inadequate, highlighting a concerning gap in industry practices
regarding AI safety. As model training becomes more accessible and cheaper, and
as open-source LLMs proliferate, the risk of widespread misuse escalates.
Without decisive intervention, LLMs may continue democratizing access to
dangerous knowledge, posing greater risks than anticipated.


http://arxiv.org/abs/2505.09342
Evaluating the Robustness of Adversarial Defenses in Malware Detection Systems. (99%)
Mostafa Jafari; Alireza Shameli-Sendi
  Machine learning is a key tool for Android malware detection, effectively
identifying malicious patterns in apps. However, ML-based detectors are
vulnerable to evasion attacks, where small, crafted changes bypass detection.
Despite progress in adversarial defenses, the lack of comprehensive evaluation
frameworks in binary-constrained domains limits understanding of their
robustness. We introduce two key contributions. First, Prioritized Binary
Rounding, a technique to convert continuous perturbations into binary feature
spaces while preserving high attack success and low perturbation size. Second,
the sigma-binary attack, a novel adversarial method for binary domains,
designed to achieve attack goals with minimal feature changes. Experiments on
the Malscan dataset show that sigma-binary outperforms existing attacks and
exposes key vulnerabilities in state-of-the-art defenses. Defenses equipped
with adversary detectors, such as KDE, DLA, DNN+, and ICNN, exhibit significant
brittleness, with attack success rates exceeding 90% using fewer than 10
feature modifications and reaching 100% with just 20. Adversarially trained
defenses, including AT-rFGSM-k, AT-MaxMA, improves robustness under small
budgets but remains vulnerable to unrestricted perturbations, with attack
success rates of 99.45% and 96.62%, respectively. Although PAD-SMA demonstrates
strong robustness against state-of-the-art gradient-based adversarial attacks
by maintaining an attack success rate below 16.55%, the sigma-binary attack
significantly outperforms these methods, achieving a 94.56% success rate under
unrestricted perturbations. These findings highlight the critical need for
precise method like sigma-binary to expose hidden vulnerabilities in existing
defenses and support the development of more resilient malware detection
systems.


http://arxiv.org/abs/2505.09602
Adversarial Suffix Filtering: a Defense Pipeline for LLMs. (98%)
David Khachaturov; Robert Mullins
  Large Language Models (LLMs) are increasingly embedded in autonomous systems
and public-facing environments, yet they remain susceptible to jailbreak
vulnerabilities that may undermine their security and trustworthiness.
Adversarial suffixes are considered to be the current state-of-the-art
jailbreak, consistently outperforming simpler methods and frequently succeeding
even in black-box settings. Existing defenses rely on access to the internal
architecture of models limiting diverse deployment, increase memory and
computation footprints dramatically, or can be bypassed with simple prompt
engineering methods. We introduce $\textbf{Adversarial Suffix Filtering}$
(ASF), a lightweight novel model-agnostic defensive pipeline designed to
protect LLMs against adversarial suffix attacks. ASF functions as an input
preprocessor and sanitizer that detects and filters adversarially crafted
suffixes in prompts, effectively neutralizing malicious injections. We
demonstrate that ASF provides comprehensive defense capabilities across both
black-box and white-box attack settings, reducing the attack efficacy of
state-of-the-art adversarial suffix generation methods to below 4%, while only
minimally affecting the target model's capabilities in non-adversarial
scenarios.


http://arxiv.org/abs/2505.09820
Adversarial Attack on Large Language Models using Exponentiated Gradient Descent. (89%)
Sajib Biswas; Mao Nishino; Samuel Jacob Chacko; Xiuwen Liu
  As Large Language Models (LLMs) are widely used, understanding them
systematically is key to improving their safety and realizing their full
potential. Although many models are aligned using techniques such as
reinforcement learning from human feedback (RLHF), they are still vulnerable to
jailbreaking attacks. Some of the existing adversarial attack methods search
for discrete tokens that may jailbreak a target model while others try to
optimize the continuous space represented by the tokens of the model's
vocabulary. While techniques based on the discrete space may prove to be
inefficient, optimization of continuous token embeddings requires projections
to produce discrete tokens, which might render them ineffective. To fully
utilize the constraints and the structures of the space, we develop an
intrinsic optimization technique using exponentiated gradient descent with the
Bregman projection method to ensure that the optimized one-hot encoding always
stays within the probability simplex. We prove the convergence of the technique
and implement an efficient algorithm that is effective in jailbreaking several
widely used LLMs. We demonstrate the efficacy of the proposed technique using
five open-source LLMs on four openly available datasets. The results show that
the technique achieves a higher success rate with great efficiency compared to
three other state-of-the-art jailbreaking techniques. The source code for our
implementation is available at:
https://github.com/sbamit/Exponentiated-Gradient-Descent-LLM-Attack


http://arxiv.org/abs/2505.09603
DataMIL: Selecting Data for Robot Imitation Learning with Datamodels. (10%)
Shivin Dass; Alaa Khaddaj; Logan Engstrom; Aleksander Madry; Andrew Ilyas; Roberto Martín-Martín
  Recently, the robotics community has amassed ever larger and more diverse
datasets to train generalist robot policies. However, while these policies
achieve strong mean performance across a variety of tasks, they often
underperform on individual, specialized tasks and require further tuning on
newly acquired task-specific data. Combining task-specific data with carefully
curated subsets of large prior datasets via co-training can produce better
specialized policies, but selecting data naively may actually harm downstream
performance. To address this, we introduce DataMIL, a policy-driven data
selection framework built on the datamodels paradigm that reasons about data
selection in an end-to-end manner, using the policy itself to identify which
data points will most improve performance. Unlike standard practices that
filter data using human notions of quality (e.g., based on semantic or visual
similarity), DataMIL directly optimizes data selection for task success,
allowing us to select data that enhance the policy while dropping data that
degrade it. To avoid performing expensive rollouts in the environment during
selection, we use a novel surrogate loss function on task-specific data,
allowing us to use DataMIL in the real world without degrading performance. We
validate our approach on a suite of more than 60 simulation and real-world
manipulation tasks - most notably showing successful data selection from the
Open X-Embodiment datasets-demonstrating consistent gains in success rates and
superior performance over multiple baselines. Our results underscore the
importance of end-to-end, performance-aware data selection for unlocking the
potential of large prior datasets in robotics. More information at
https://robin-lab.cs.utexas.edu/datamodels4imitation/


http://arxiv.org/abs/2505.09368
RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo. (2%)
Jenny Schmalfuss; Victor Oei; Lukas Mehl; Madlen Bartsch; Shashank Agnihotri; Margret Keuper; Andrés Bruhn
  Standard benchmarks for optical flow, scene flow, and stereo vision
algorithms generally focus on model accuracy rather than robustness to image
corruptions like noise or rain. Hence, the resilience of models to such
real-world perturbations is largely unquantified. To address this, we present
RobustSpring, a comprehensive dataset and benchmark for evaluating robustness
to image corruptions for optical flow, scene flow, and stereo models.
RobustSpring applies 20 different image corruptions, including noise, blur,
color changes, quality degradations, and weather distortions, in a time-,
stereo-, and depth-consistent manner to the high-resolution Spring dataset,
creating a suite of 20,000 corrupted images that reflect challenging
conditions. RobustSpring enables comparisons of model robustness via a new
corruption robustness metric. Integration with the Spring benchmark enables
public two-axis evaluations of both accuracy and robustness. We benchmark a
curated selection of initial models, observing that accurate models are not
necessarily robust and that robustness varies widely by corruption type.
RobustSpring is a new computer vision benchmark that treats robustness as a
first-class citizen to foster models that combine accuracy with resilience. It
will be available at https://spring-benchmark.org.


http://arxiv.org/abs/2505.09743
Guardian Positioning System (GPS) for Location Based Services. (1%)
Wenjie Liu; Panos Papadimitratos
  Location-based service (LBS) applications proliferate and support
transportation, entertainment, and more. Modern mobile platforms, with
smartphones being a prominent example, rely on terrestrial and satellite
infrastructures (e.g., global navigation satellite system (GNSS) and
crowdsourced Wi-Fi, Bluetooth, cellular, and IP databases) for correct
positioning. However, they are vulnerable to attacks that manipulate positions
to control and undermine LBS functionality -- thus enabling the scamming of
users or services. Our work reveals that GNSS spoofing attacks succeed even
though smartphones have multiple sources of positioning information. Moreover,
that Wi-Fi spoofing attacks with GNSS jamming are surprisingly effective. More
concerning is the evidence that sophisticated, coordinated spoofing attacks are
highly effective. Attacks can target GNSS in combination with other positioning
methods, thus defenses that assume that only GNSS is under attack cannot be
effective. More so, resilient GNSS receivers and special-purpose antennas are
not feasible on smartphones. To address this gap, we propose an extended
receiver autonomous integrity monitoring (RAIM) framework that leverages the
readily available, redundant, often so-called opportunistic positioning
information on off-the-shelf platforms. We jointly use onboard sensors,
terrestrial infrastructures, and GNSS. We show that our extended RAIM framework
improves resilience against location spoofing, e.g., achieving a detection
accuracy improvement of up to 24-58% compared to the state-of-the-art
algorithms and location providers; detecting attacks within 5 seconds, with a
low false positive rate.


http://arxiv.org/abs/2505.08999
Towards Adaptive Meta-Gradient Adversarial Examples for Visual Tracking. (99%)
Wei-Long Tian; Peng Gao; Xiao Liu; Long Xu; Hamido Fujita; Hanan Aljuai; Mao-Li Wang
  In recent years, visual tracking methods based on convolutional neural
networks and Transformers have achieved remarkable performance and have been
successfully applied in fields such as autonomous driving. However, the
numerous security issues exposed by deep learning models have gradually
affected the reliable application of visual tracking methods in real-world
scenarios. Therefore, how to reveal the security vulnerabilities of existing
visual trackers through effective adversarial attacks has become a critical
problem that needs to be addressed. To this end, we propose an adaptive
meta-gradient adversarial attack (AMGA) method for visual tracking. This method
integrates multi-model ensembles and meta-learning strategies, combining
momentum mechanisms and Gaussian smoothing, which can significantly enhance the
transferability and attack effectiveness of adversarial examples. AMGA randomly
selects models from a large model repository, constructs diverse tracking
scenarios, and iteratively performs both white- and black-box adversarial
attacks in each scenario, optimizing the gradient directions of each model.
This paradigm minimizes the gap between white- and black-box adversarial
attacks, thus achieving excellent attack performance in black-box scenarios.
Extensive experimental results on large-scale datasets such as OTB2015, LaSOT,
and GOT-10k demonstrate that AMGA significantly improves the attack
performance, transferability, and deception of adversarial examples. Codes and
data are available at https://github.com/pgao-lab/AMGA.


http://arxiv.org/abs/2505.08835
Robustness Analysis against Adversarial Patch Attacks in Fully Unmanned Stores. (98%)
Hyunsik Na; Wonho Lee; Seungdeok Roh; Sohee Park; Daeseon Choi
  The advent of convenient and efficient fully unmanned stores equipped with
artificial intelligence-based automated checkout systems marks a new era in
retail. However, these systems have inherent artificial intelligence security
vulnerabilities, which are exploited via adversarial patch attacks,
particularly in physical environments. This study demonstrated that adversarial
patches can severely disrupt object detection models used in unmanned stores,
leading to issues such as theft, inventory discrepancies, and interference. We
investigated three types of adversarial patch attacks -- Hiding, Creating, and
Altering attacks -- and highlighted their effectiveness. We also introduce the
novel color histogram similarity loss function by leveraging attacker knowledge
of the color information of a target class object. Besides the traditional
confusion-matrix-based attack success rate, we introduce a new
bounding-boxes-based metric to analyze the practical impact of these attacks.
Starting with attacks on object detection models trained on snack and fruit
datasets in a digital environment, we evaluated the effectiveness of
adversarial patches in a physical testbed that mimicked a real unmanned store
with RGB cameras and realistic conditions. Furthermore, we assessed the
robustness of these attacks in black-box scenarios, demonstrating that shadow
attacks can enhance success rates of attacks even without direct access to
model parameters. Our study underscores the necessity for robust defense
strategies to protect unmanned stores from adversarial threats. Highlighting
the limitations of the current defense mechanisms in real-time detection
systems and discussing various proactive measures, we provide insights into
improving the robustness of object detection models and fortifying unmanned
retail environments against these attacks.


http://arxiv.org/abs/2505.11532
Revisiting Adversarial Perception Attacks and Defense Methods on Autonomous Driving Systems. (96%)
Cheng Chen; Yuhong Wang; Nafis S Munir; Xiangwei Zhou; Xugui Zhou
  Autonomous driving systems (ADS) increasingly rely on deep learning-based
perception models, which remain vulnerable to adversarial attacks. In this
paper, we revisit adversarial attacks and defense methods, focusing on road
sign recognition and lead object detection and prediction (e.g., relative
distance). Using a Level-2 production ADS, OpenPilot by Comma$.$ai, and the
widely adopted YOLO model, we systematically examine the impact of adversarial
perturbations and assess defense techniques, including adversarial training,
image processing, contrastive learning, and diffusion models. Our experiments
highlight both the strengths and limitations of these methods in mitigating
complex attacks. Through targeted evaluations of model robustness, we aim to
provide deeper insights into the vulnerabilities of ADS perception systems and
contribute guidance for developing more resilient defense strategies.


http://arxiv.org/abs/2505.08234
Removing Watermarks with Partial Regeneration using Semantic Information. (15%)
Krti Tallam; John Kevin Cava; Caleb Geniesse; N. Benjamin Erichson; Michael W. Mahoney
  As AI-generated imagery becomes ubiquitous, invisible watermarks have emerged
as a primary line of defense for copyright and provenance. The newest
watermarking schemes embed semantic signals - content-aware patterns that are
designed to survive common image manipulations - yet their true robustness
against adaptive adversaries remains under-explored. We expose a previously
unreported vulnerability and introduce SemanticRegen, a three-stage, label-free
attack that erases state-of-the-art semantic and invisible watermarks while
leaving an image's apparent meaning intact. Our pipeline (i) uses a
vision-language model to obtain fine-grained captions, (ii) extracts foreground
masks with zero-shot segmentation, and (iii) inpaints only the background via
an LLM-guided diffusion model, thereby preserving salient objects and style
cues. Evaluated on 1,000 prompts across four watermarking systems - TreeRing,
StegaStamp, StableSig, and DWT/DCT - SemanticRegen is the only method to defeat
the semantic TreeRing watermark (p = 0.10 > 0.05) and reduces bit-accuracy
below 0.75 for the remaining schemes, all while maintaining high perceptual
quality (masked SSIM = 0.94 +/- 0.01). We further introduce masked SSIM (mSSIM)
to quantify fidelity within foreground regions, showing that our attack
achieves up to 12 percent higher mSSIM than prior diffusion-based attackers.
These results highlight an urgent gap between current watermark defenses and
the capabilities of adaptive, semantics-aware adversaries, underscoring the
need for watermarking algorithms that are resilient to content-preserving
regenerative attacks.


http://arxiv.org/abs/2505.08292
On the Account Security Risks Posed by Password Strength Meters. (1%)
Ming Xu; Weili Han; Jitao Yu; Jing Liu; Xinyi Zhang; Yun Lin; Jin Song Dong
  Password strength meters (PSMs) have been widely used by websites to gauge
password strength, encouraging users to create stronger passwords. Popular
data-driven PSMs, e.g., based on Markov, Probabilistic Context-free Grammar
(PCFG) and neural networks, alarm strength based on a model learned from real
passwords. Despite their proven effectiveness, the secure utility that arises
from the leakage of trained passwords remains largely overlooked. To address
this gap, we analyze 11 PSMs and find that 5 data-driven meters are vulnerable
to membership inference attacks that expose their trained passwords, and
seriously, 3 rule-based meters openly disclose their blocked passwords. We
specifically design a PSM privacy leakage evaluation approach, and uncover that
a series of general data-driven meters are vulnerable to leaking between 10^4
to 10^5 trained passwords, with the PCFG-based models being more vulnerable
than other counterparts; furthermore, we aid in deriving insights that the
inherent utility-privacy tradeoff is not as severe as previously thought. To
further exploit the risks, we develop novel meter-aware attacks when a clever
attacker can filter the used passwords during compromising accounts on websites
using the meter, and experimentally show that attackers targeting websites that
deployed the popular Zxcvbn meter can compromise an additional 5.84% user
accounts within 10 attempts, demonstrating the urgent need for
privacy-preserving PSMs that protect the confidentiality of the meter's used
passwords. Finally, we sketch some counter-measures to mitigate these threats.


http://arxiv.org/abs/2505.08459
Strategy-Augmented Planning for Large Language Models via Opponent Exploitation. (1%)
Shuai Xu; Sijia Cui; Yanna Wang; Bo Xu; Qi Wang
  Efficiently modeling and exploiting opponents is a long-standing challenge in
adversarial domains. Large Language Models (LLMs) trained on extensive textual
data have recently demonstrated outstanding performance in general tasks,
introducing new research directions for opponent modeling. Some studies
primarily focus on directly using LLMs to generate decisions based on the
elaborate prompt context that incorporates opponent descriptions, while these
approaches are limited to scenarios where LLMs possess adequate domain
expertise. To address that, we introduce a two-stage Strategy-Augmented
Planning (SAP) framework that significantly enhances the opponent exploitation
capabilities of LLM-based agents by utilizing a critical component, the
Strategy Evaluation Network (SEN). Specifically, in the offline stage, we
construct an explicit strategy space and subsequently collect strategy-outcome
pair data for training the SEN network. During the online phase, SAP
dynamically recognizes the opponent's strategies and greedily exploits them by
searching best response strategy on the well-trained SEN, finally translating
strategy to a course of actions by carefully designed prompts. Experimental
results show that SAP exhibits robust generalization capabilities, allowing it
to perform effectively not only against previously encountered opponent
strategies but also against novel, unseen strategies. In the MicroRTS
environment, SAP achieves a $85.35\%$ performance improvement over baseline
methods and matches the competitiveness of reinforcement learning approaches
against state-of-the-art (SOTA) rule-based AI. Our code is available at
https://github.com/hsushuai/SAP.


http://arxiv.org/abs/2505.08022
Dynamical Low-Rank Compression of Neural Networks with Robustness under Adversarial Attacks. (86%)
Steffen Schotthöfer; H. Lexie Yang; Stefan Schnake
  Deployment of neural networks on resource-constrained devices demands models
that are both compact and robust to adversarial inputs. However, compression
and adversarial robustness often conflict. In this work, we introduce a
dynamical low-rank training scheme enhanced with a novel spectral regularizer
that controls the condition number of the low-rank core in each layer. This
approach mitigates the sensitivity of compressed models to adversarial
perturbations without sacrificing clean accuracy. The method is model- and
data-agnostic, computationally efficient, and supports rank adaptivity to
automatically compress the network at hand. Extensive experiments across
standard architectures, datasets, and adversarial attacks show the regularized
networks can achieve over 94% compression while recovering or improving
adversarial accuracy relative to uncompressed baselines.


http://arxiv.org/abs/2505.07546
GRADA: Graph-based Reranker against Adversarial Documents Attack. (69%)
Jingjie Zheng; Aryo Pradipta Gema; Giwon Hong; Xuanli He; Pasquale Minervini; Youcheng Sun; Qiongkai Xu
  Retrieval Augmented Generation (RAG) frameworks improve the accuracy of large
language models (LLMs) by integrating external knowledge from retrieved
documents, thereby overcoming the limitations of models' static intrinsic
knowledge. However, these systems are susceptible to adversarial attacks that
manipulate the retrieval process by introducing documents that are adversarial
yet semantically similar to the query. Notably, while these adversarial
documents resemble the query, they exhibit weak similarity to benign documents
in the retrieval set. Thus, we propose a simple yet effective Graph-based
Reranking against Adversarial Document Attacks (GRADA) framework aiming at
preserving retrieval quality while significantly reducing the success of
adversaries. Our study evaluates the effectiveness of our approach through
experiments conducted on five LLMs: GPT-3.5-Turbo, GPT-4o, Llama3.1-8b,
Llama3.1-70b, and Qwen2.5-7b. We use three datasets to assess performance, with
results from the Natural Questions dataset demonstrating up to an 80% reduction
in attack success rates while maintaining minimal loss in accuracy.


http://arxiv.org/abs/2505.07584
SecReEvalBench: A Multi-turned Security Resilience Evaluation Benchmark for Large Language Models. (26%)
Huining Cui; Wei Liu
  The increasing deployment of large language models in security-sensitive
domains necessitates rigorous evaluation of their resilience against
adversarial prompt-based attacks. While previous benchmarks have focused on
security evaluations with limited and predefined attack domains, such as
cybersecurity attacks, they often lack a comprehensive assessment of
intent-driven adversarial prompts and the consideration of real-life
scenario-based multi-turn attacks. To address this gap, we present
SecReEvalBench, the Security Resilience Evaluation Benchmark, which defines
four novel metrics: Prompt Attack Resilience Score, Prompt Attack Refusal Logic
Score, Chain-Based Attack Resilience Score and Chain-Based Attack Rejection
Time Score. Moreover, SecReEvalBench employs six questioning sequences for
model assessment: one-off attack, successive attack, successive reverse attack,
alternative attack, sequential ascending attack with escalating threat levels
and sequential descending attack with diminishing threat levels. In addition,
we introduce a dataset customized for the benchmark, which incorporates both
neutral and malicious prompts, categorised across seven security domains and
sixteen attack techniques. In applying this benchmark, we systematically
evaluate five state-of-the-art open-weighted large language models, Llama 3.1,
Gemma 2, Mistral v0.3, DeepSeek-R1 and Qwen 3. Our findings offer critical
insights into the strengths and weaknesses of modern large language models in
defending against evolving adversarial threats. The SecReEvalBench dataset is
publicly available at
https://kaggle.com/datasets/5a7ee22cf9dab6c93b55a73f630f6c9b42e936351b0ae98fbae6ddaca7fe248d,
which provides a groundwork for advancing research in large language model
security.


http://arxiv.org/abs/2505.07775
Must Read: A Systematic Survey of Computational Persuasion. (3%)
Nimet Beyza Bozdag; Shuhaib Mehri; Xiaocheng Yang; Hyeonjeong Ha; Zirui Cheng; Esin Durmus; Jiaxuan You; Heng Ji; Gokhan Tur; Dilek Hakkani-Tür
  Persuasion is a fundamental aspect of communication, influencing
decision-making across diverse contexts, from everyday conversations to
high-stakes scenarios such as politics, marketing, and law. The rise of
conversational AI systems has significantly expanded the scope of persuasion,
introducing both opportunities and risks. AI-driven persuasion can be leveraged
for beneficial applications, but also poses threats through manipulation and
unethical influence. Moreover, AI systems are not only persuaders, but also
susceptible to persuasion, making them vulnerable to adversarial attacks and
bias reinforcement. Despite rapid advancements in AI-generated persuasive
content, our understanding of what makes persuasion effective remains limited
due to its inherently subjective and context-dependent nature. In this survey,
we provide a comprehensive overview of computational persuasion, structured
around three key perspectives: (1) AI as a Persuader, which explores
AI-generated persuasive content and its applications; (2) AI as a Persuadee,
which examines AI's susceptibility to influence and manipulation; and (3) AI as
a Persuasion Judge, which analyzes AI's role in evaluating persuasive
strategies, detecting manipulation, and ensuring ethical persuasion. We
introduce a taxonomy for computational persuasion research and discuss key
challenges, including evaluating persuasiveness, mitigating manipulative
persuasion, and developing responsible AI-driven persuasive systems. Our survey
outlines future research directions to enhance the safety, fairness, and
effectiveness of AI-powered persuasion while addressing the risks posed by
increasingly capable language models.


http://arxiv.org/abs/2505.06860
DP-TRAE: A Dual-Phase Merging Transferable Reversible Adversarial Example for Image Privacy Protection. (99%)
Xia Du; Jiajie Zhu; Jizhe Zhou; Chi-man Pun; Zheng Lin; Cong Wu; Zhe Chen; Jun Luo
  In the field of digital security, Reversible Adversarial Examples (RAE)
combine adversarial attacks with reversible data hiding techniques to
effectively protect sensitive data and prevent unauthorized analysis by
malicious Deep Neural Networks (DNNs). However, existing RAE techniques
primarily focus on white-box attacks, lacking a comprehensive evaluation of
their effectiveness in black-box scenarios. This limitation impedes their
broader deployment in complex, dynamic environments. Further more, traditional
black-box attacks are often characterized by poor transferability and high
query costs, significantly limiting their practical applicability. To address
these challenges, we propose the Dual-Phase Merging Transferable Reversible
Attack method, which generates highly transferable initial adversarial
perturbations in a white-box model and employs a memory augmented black-box
strategy to effectively mislead target mod els. Experimental results
demonstrate the superiority of our approach, achieving a 99.0% attack success
rate and 100% recovery rate in black-box scenarios, highlighting its robustness
in privacy protection. Moreover, we successfully implemented a black-box attack
on a commercial model, further substantiating the potential of this approach
for practical use.


http://arxiv.org/abs/2505.06889
IM-BERT: Enhancing Robustness of BERT through the Implicit Euler Method. (47%)
Mihyeon Kim; Juhyoung Park; Youngbin Kim
  Pre-trained Language Models (PLMs) have achieved remarkable performance on
diverse NLP tasks through pre-training and fine-tuning. However, fine-tuning
the model with a large number of parameters on limited downstream datasets
often leads to vulnerability to adversarial attacks, causing overfitting of the
model on standard datasets.
  To address these issues, we propose IM-BERT from the perspective of a dynamic
system by conceptualizing a layer of BERT as a solution of Ordinary
Differential Equations (ODEs). Under the situation of initial value
perturbation, we analyze the numerical stability of two main numerical ODE
solvers: the explicit and implicit Euler approaches.
  Based on these analyses, we introduce a numerically robust IM-connection
incorporating BERT's layers. This strategy enhances the robustness of PLMs
against adversarial attacks, even in low-resource scenarios, without
introducing additional parameters or adversarial training strategies.
  Experimental results on the adversarial GLUE (AdvGLUE) dataset validate the
robustness of IM-BERT under various conditions. Compared to the original BERT,
IM-BERT exhibits a performance improvement of approximately 8.3\%p on the
AdvGLUE dataset. Furthermore, in low-resource scenarios, IM-BERT outperforms
BERT by achieving 5.9\%p higher accuracy.


http://arxiv.org/abs/2505.06958
A Formally Verified Robustness Certifier for Neural Networks (Extended Version). (15%)
James Tobler; Hira Taqdees Syeda; Toby Murray
  Neural networks are often susceptible to minor perturbations in input that
cause them to misclassify. A recent solution to this problem is the use of
globally-robust neural networks, which employ a function to certify that the
classification of an input cannot be altered by such a perturbation. Outputs
that pass this test are called certified robust. However, to the authors'
knowledge, these certification functions have not yet been verified at the
implementation level. We demonstrate how previous unverified implementations
are exploitably unsound in certain circumstances. Moreover, they often rely on
approximation-based algorithms, such as power iteration, that (perhaps
surprisingly) do not guarantee soundness. To provide assurance that a given
output is robust, we implemented and formally verified a certification function
for globally-robust neural networks in Dafny. We describe the program, its
specifications, and the important design decisions taken for its implementation
and verification, as well as our experience applying it in practice.


http://arxiv.org/abs/2505.06843
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety. (12%)
Zihan Guan; Mengxuan Hu; Ronghang Zhu; Sheng Li; Anil Vullikanti
  Recent studies have uncovered a troubling vulnerability in the fine-tuning
stage of large language models (LLMs): even fine-tuning on entirely benign
datasets can lead to a significant increase in the harmfulness of LLM outputs.
Building on this finding, our red teaming study takes this threat one step
further by developing a more effective attack. Specifically, we analyze and
identify samples within benign datasets that contribute most to safety
degradation, then fine-tune LLMs exclusively on these samples. We approach this
problem from an outlier detection perspective and propose Self-Inf-N, to detect
and extract outliers for fine-tuning. Our findings reveal that fine-tuning LLMs
on 100 outlier samples selected by Self-Inf-N in the benign datasets severely
compromises LLM safety alignment. Extensive experiments across seven mainstream
LLMs demonstrate that our attack exhibits high transferability across different
architectures and remains effective in practical scenarios. Alarmingly, our
results indicate that most existing mitigation strategies fail to defend
against this attack, underscoring the urgent need for more robust alignment
safeguards. Codes are available at
https://github.com/GuanZihan/Benign-Samples-Matter.


http://arxiv.org/abs/2505.07167
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models. (11%)
Haoran Gu; Handing Wang; Yi Mei; Mengjie Zhang; Yaochu Jin
  Large Language Models (LLMs) have been extensively used across diverse
domains, including virtual assistants, automated code generation, and
scientific research. However, they remain vulnerable to jailbreak attacks,
which manipulate the models into generating harmful responses despite safety
alignment. Recent studies have shown that current safety-aligned LLMs often
undergo the shallow safety alignment, where the first few tokens largely
determine whether the response will be harmful. Through comprehensive
observations, we find that safety-aligned LLMs and various defense strategies
generate highly similar initial tokens in their refusal responses, which we
define as safety trigger tokens. Building on this insight, we propose
\texttt{D-STT}, a simple yet effective defense algorithm that identifies and
explicitly decodes safety trigger tokens of the given safety-aligned LLM to
trigger the model's learned safety patterns. In this process, the safety
trigger is constrained to a single token, which effectively preserves model
usability by introducing minimum intervention in the decoding process.
Extensive experiments across diverse jailbreak attacks and benign prompts
demonstrate that \ours significantly reduces output harmfulness while
preserving model usability and incurring negligible response time overhead,
outperforming ten baseline methods.


http://arxiv.org/abs/2505.07149
AugMixCloak: A Defense against Membership Inference Attacks via Image Transformation. (8%)
Heqing Ren; Chao Feng; Alberto Huertas; Burkhard Stiller
  Traditional machine learning (ML) raises serious privacy concerns, while
federated learning (FL) mitigates the risk of data leakage by keeping data on
local devices. However, the training process of FL can still leak sensitive
information, which adversaries may exploit to infer private data. One of the
most prominent threats is the membership inference attack (MIA), where the
adversary aims to determine whether a particular data record was part of the
training set.
  This paper addresses this problem through a two-stage defense called
AugMixCloak. The core idea is to apply data augmentation and principal
component analysis (PCA)-based information fusion to query images, which are
detected by perceptual hashing (pHash) as either identical to or highly similar
to images in the training set. Experimental results show that AugMixCloak
successfully defends against both binary classifier-based MIA and metric-based
MIA across five datasets and various decentralized FL (DFL) topologies.
Compared with regularization-based defenses, AugMixCloak demonstrates stronger
protection. Compared with confidence score masking, AugMixCloak exhibits better
generalization.


http://arxiv.org/abs/2505.08804
TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis. (5%)
Longtian Wang; Xiaofei Xie; Tianlin Li; Yuhan Zhi; Chao Shen
  Text-to-image (T2I) models have significantly advanced in producing
high-quality images. However, such models have the ability to generate images
containing not-safe-for-work (NSFW) content, such as pornography, violence,
political content, and discrimination. To mitigate the risk of generating NSFW
content, refusal mechanisms, i.e., safety checkers, have been developed to
check potential NSFW content. Adversarial prompting techniques have been
developed to evaluate the robustness of the refusal mechanisms. The key
challenge remains to subtly modify the prompt in a way that preserves its
sensitive nature while bypassing the refusal mechanisms. In this paper, we
introduce TokenProber, a method designed for sensitivity-aware differential
testing, aimed at evaluating the robustness of the refusal mechanisms in T2I
models by generating adversarial prompts. Our approach is based on the key
observation that adversarial prompts often succeed by exploiting discrepancies
in how T2I models and safety checkers interpret sensitive content. Thus, we
conduct a fine-grained analysis of the impact of specific words within prompts,
distinguishing between dirty words that are essential for NSFW content
generation and discrepant words that highlight the different sensitivity
assessments between T2I models and safety checkers. Through the
sensitivity-aware mutation, TokenProber generates adversarial prompts, striking
a balance between maintaining NSFW content generation and evading detection.
Our evaluation of TokenProber against 5 safety checkers on 3 popular T2I
models, using 324 NSFW prompts, demonstrates its superior effectiveness in
bypassing safety filters compared to existing methods (e.g., 54%+ increase on
average), highlighting TokenProber's ability to uncover robustness issues in
the existing refusal mechanisms.


http://arxiv.org/abs/2505.06579
POISONCRAFT: Practical Poisoning of Retrieval-Augmented Generation for Large Language Models. (83%)
Yangguang Shao; Xinjie Lin; Haozheng Luo; Chengshang Hou; Gang Xiong; Jiahao Yu; Junzheng Shi
  Large language models (LLMs) have achieved remarkable success in various
domains, primarily due to their strong capabilities in reasoning and generating
human-like text. Despite their impressive performance, LLMs are susceptible to
hallucinations, which can lead to incorrect or misleading outputs. This is
primarily due to the lack of up-to-date knowledge or domain-specific
information. Retrieval-augmented generation (RAG) is a promising approach to
mitigate hallucinations by leveraging external knowledge sources. However, the
security of RAG systems has not been thoroughly studied. In this paper, we
study a poisoning attack on RAG systems named POISONCRAFT, which can mislead
the model to refer to fraudulent websites. Compared to existing poisoning
attacks on RAG systems, our attack is more practical as it does not require
access to the target user query's info or edit the user query. It not only
ensures that injected texts can be retrieved by the model, but also ensures
that the LLM will be misled to refer to the injected texts in its response. We
demonstrate the effectiveness of POISONCRAFTacross different datasets,
retrievers, and language models in RAG pipelines, and show that it remains
effective when transferred across retrievers, including black-box systems.
Moreover, we present a case study revealing how the attack influences both the
retrieval behavior and the step-by-step reasoning trace within the generation
model, and further evaluate the robustness of POISONCRAFTunder multiple defense
mechanisms. These results validate the practicality of our threat model and
highlight a critical security risk for RAG systems deployed in real-world
applications. We release our
code\footnote{https://github.com/AndyShaw01/PoisonCraft} to support future
research on the security and robustness of RAG systems in real-world settings.


http://arxiv.org/abs/2505.06679
T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks. (64%)
Jiayang Liu; Siyuan Liang; Shiqian Zhao; Rongcheng Tu; Wenbo Zhou; Aishan Liu; Dacheng Tao; Siew Kei Lam
  In recent years, fueled by the rapid advancement of diffusion models,
text-to-video (T2V) generation models have achieved remarkable progress, with
notable examples including Pika, Luma, Kling, and Open-Sora. Although these
models exhibit impressive generative capabilities, they also expose significant
security risks due to their vulnerability to jailbreak attacks, where the
models are manipulated to produce unsafe content such as pornography, violence,
or discrimination. Existing works such as T2VSafetyBench provide preliminary
benchmarks for safety evaluation, but lack systematic methods for thoroughly
exploring model vulnerabilities. To address this gap, we are the first to
formalize the T2V jailbreak attack as a discrete optimization problem and
propose a joint objective-based optimization framework, called T2V-OptJail.
This framework consists of two key optimization goals: bypassing the built-in
safety filtering mechanisms to increase the attack success rate, preserving
semantic consistency between the adversarial prompt and the unsafe input
prompt, as well as between the generated video and the unsafe input prompt, to
enhance content controllability. In addition, we introduce an iterative
optimization strategy guided by prompt variants, where multiple semantically
equivalent candidates are generated in each round, and their scores are
aggregated to robustly guide the search toward optimal adversarial prompts. We
conduct large-scale experiments on several T2V models, covering both
open-source models and real commercial closed-source models. The experimental
results show that the proposed method improves 11.4% and 10.0% over the
existing state-of-the-art method in terms of attack success rate assessed by
GPT-4, attack success rate assessed by human accessors, respectively, verifying
the significant advantages of the method in terms of attack effectiveness and
content control.


http://arxiv.org/abs/2505.06643
Practical Reasoning Interruption Attacks on Reasoning Large Language Models. (33%)
Yu Cui; Cong Zuo
  Reasoning large language models (RLLMs) have demonstrated outstanding
performance across a variety of tasks, yet they also expose numerous security
vulnerabilities. Most of these vulnerabilities have centered on the generation
of unsafe content. However, recent work has identified a distinct
"thinking-stopped" vulnerability in DeepSeek-R1: under adversarial prompts, the
model's reasoning process ceases at the system level and produces an empty
final answer. Building upon this vulnerability, researchers developed a novel
prompt injection attack, termed reasoning interruption attack, and also offered
an initial analysis of its root cause. Through extensive experiments, we verify
the previous analyses, correct key errors based on three experimental findings,
and present a more rigorous explanation of the fundamental causes driving the
vulnerability. Moreover, existing attacks typically require over 2,000 tokens,
impose significant overhead, reduce practicality, and are easily detected. To
overcome these limitations, we propose the first practical reasoning
interruption attack. It succeeds with just 109 tokens by exploiting our newly
uncovered "reasoning token overflow" (RTO) effect to overwrite the model's
final answer, forcing it to return an invalid response. Experimental results
demonstrate that our proposed attack is highly effective. Furthermore, we
discover that the method for triggering RTO differs between the official
DeepSeek-R1 release and common unofficial deployments. As a broadened
application of RTO, we also construct a novel jailbreak attack that enables the
transfer of unsafe content within the reasoning tokens into final answer,
thereby exposing it to the user. Our work carries significant implications for
enhancing the security of RLLMs.


http://arxiv.org/abs/2505.06477
Learning from the Good Ones: Risk Profiling-Based Defenses Against Evasion Attacks on DNNs. (99%)
Mohammed Elnawawy; Gargi Mitra; Shahrear Iqbal; Karthik Pattabiraman
  Safety-critical applications such as healthcare and autonomous vehicles use
deep neural networks (DNN) to make predictions and infer decisions. DNNs are
susceptible to evasion attacks, where an adversary crafts a malicious data
instance to trick the DNN into making wrong decisions at inference time.
Existing defenses that protect DNNs against evasion attacks are either static
or dynamic. Static defenses are computationally efficient but do not adapt to
the evolving threat landscape, while dynamic defenses are adaptable but suffer
from an increased computational overhead. To combine the best of both worlds,
in this paper, we propose a novel risk profiling framework that uses a
risk-aware strategy to selectively train static defenses using victim instances
that exhibit the most resilient features and are hence more resilient against
an evasion attack. We hypothesize that training existing defenses on instances
that are less vulnerable to the attack enhances the adversarial detection rate
by reducing false negatives. We evaluate the efficacy of our risk-aware
selective training strategy on a blood glucose management system that
demonstrates how training static anomaly detectors indiscriminately may result
in an increased false negative rate, which could be life-threatening in
safety-critical applications. Our experiments show that selective training on
the less vulnerable patients achieves a recall increase of up to 27.5\% with
minimal impact on precision compared to indiscriminate training.


http://arxiv.org/abs/2505.06134
Realistic Adversarial Attacks for Robustness Evaluation of Trajectory Prediction Models via Future State Perturbation. (83%)
Julian F. Schumann; Jeroen Hagenus; Frederik Baymler Mathiesen; Arkady Zgonnikov
  Trajectory prediction is a key element of autonomous vehicle systems,
enabling them to anticipate and react to the movements of other road users.
Evaluating the robustness of prediction models against adversarial attacks is
essential to ensure their reliability in real-world traffic. However, current
approaches tend to focus on perturbing the past positions of surrounding
agents, which can generate unrealistic scenarios and overlook critical
vulnerabilities. This limitation may result in overly optimistic assessments of
model performance in real-world conditions.
  In this work, we demonstrate that perturbing not just past but also future
states of adversarial agents can uncover previously undetected weaknesses and
thereby provide a more rigorous evaluation of model robustness. Our novel
approach incorporates dynamic constraints and preserves tactical behaviors,
enabling more effective and realistic adversarial attacks. We introduce new
performance measures to assess the realism and impact of these adversarial
trajectories. Testing our method on a state-of-the-art prediction model
revealed significant increases in prediction errors and collision rates under
adversarial conditions. Qualitative analysis further showed that our attacks
can expose critical weaknesses, such as the inability of the model to detect
potential collisions in what appear to be safe predictions. These results
underscore the need for more comprehensive adversarial testing to better
evaluate and improve the reliability of trajectory prediction models for
autonomous vehicles.


http://arxiv.org/abs/2505.05872
A Taxonomy of Attacks and Defenses in Split Learning. (15%)
Aqsa Shabbir; Halil İbrahim Kanpak; Alptekin Küpçü; Sinem Sav
  Split Learning (SL) has emerged as a promising paradigm for distributed deep
learning, allowing resource-constrained clients to offload portions of their
model computation to servers while maintaining collaborative learning. However,
recent research has demonstrated that SL remains vulnerable to a range of
privacy and security threats, including information leakage, model inversion,
and adversarial attacks. While various defense mechanisms have been proposed, a
systematic understanding of the attack landscape and corresponding
countermeasures is still lacking. In this study, we present a comprehensive
taxonomy of attacks and defenses in SL, categorizing them along three key
dimensions: employed strategies, constraints, and effectiveness. Furthermore,
we identify key open challenges and research gaps in SL based on our
systematization, highlighting potential future directions.


http://arxiv.org/abs/2505.05849
AgentXploit: End-to-End Redteaming of Black-Box AI Agents. (5%)
Zhun Wang; Vincent Siu; Zhe Ye; Tianneng Shi; Yuzhou Nie; Xuandong Zhao; Chenguang Wang; Wenbo Guo; Dawn Song
  The strong planning and reasoning capabilities of Large Language Models
(LLMs) have fostered the development of agent-based systems capable of
leveraging external tools and interacting with increasingly complex
environments. However, these powerful features also introduce a critical
security risk: indirect prompt injection, a sophisticated attack vector that
compromises the core of these agents, the LLM, by manipulating contextual
information rather than direct user prompts. In this work, we propose a generic
black-box fuzzing framework, AgentXploit, designed to automatically discover
and exploit indirect prompt injection vulnerabilities across diverse LLM
agents. Our approach starts by constructing a high-quality initial seed corpus,
then employs a seed selection algorithm based on Monte Carlo Tree Search (MCTS)
to iteratively refine inputs, thereby maximizing the likelihood of uncovering
agent weaknesses. We evaluate AgentXploit on two public benchmarks, AgentDojo
and VWA-adv, where it achieves 71% and 70% success rates against agents based
on o3-mini and GPT-4o, respectively, nearly doubling the performance of
baseline attacks. Moreover, AgentXploit exhibits strong transferability across
unseen tasks and internal LLMs, as well as promising results against defenses.
Beyond benchmark evaluations, we apply our attacks in real-world environments,
successfully misleading agents to navigate to arbitrary URLs, including
malicious sites.


http://arxiv.org/abs/2505.06364
LATENT: LLM-Augmented Trojan Insertion and Evaluation Framework for Analog Netlist Topologies. (2%)
Jayeeta Chaudhuri; Arjun Chaudhuri; Krishnendu Chakrabarty
  Analog and mixed-signal (A/MS) integrated circuits (ICs) are integral to
safety-critical applications. However, the globalization and outsourcing of
A/MS ICs to untrusted third-party foundries expose them to security threats,
particularly analog Trojans. Unlike digital Trojans which have been extensively
studied, analog Trojans remain largely unexplored. There has been only limited
research on their diversity and stealth in analog designs, where a Trojan is
activated only during a narrow input voltage range. Effective defense
techniques require a clear understanding of the attack vectors; however, the
lack of diverse analog Trojan instances limits robust advances in detection
strategies. To address this gap, we present LATENT, the first large language
model (LLM)-driven framework for crafting stealthy, circuit-specific analog
Trojans. LATENT incorporates LLM as an autonomous agent to intelligently insert
and refine Trojan components within analog designs based on iterative feedback
from a detection model. This feedback loop ensures that the inserted Trojans
remain stealthy while successfully evading detection. Experimental results
demonstrate that our generated Trojan designs exhibit an average
Trojan-activation range of 15.74%, ensuring they remain inactive under most
operating voltages, while causing a significant performance degradation of
11.3% upon activation.


http://arxiv.org/abs/2505.06335
Remote Rowhammer Attack using Adversarial Observations on Federated Learning Clients. (2%)
Jinsheng Yuan; Yuhang Hao; Weisi Guo; Yun Wu; Chongyan Gu
  Federated Learning (FL) has the potential for simultaneous global learning
amongst a large number of parallel agents, enabling emerging AI such as LLMs to
be trained across demographically diverse data. Central to this being efficient
is the ability for FL to perform sparse gradient updates and remote direct
memory access at the central server. Most of the research in FL security
focuses on protecting data privacy at the edge client or in the communication
channels between the client and server. Client-facing attacks on the server are
less well investigated as the assumption is that a large collective of clients
offer resilience.
  Here, we show that by attacking certain clients that lead to a high frequency
repetitive memory update in the server, we can remote initiate a rowhammer
attack on the server memory. For the first time, we do not need backdoor access
to the server, and a reinforcement learning (RL) attacker can learn how to
maximize server repetitive memory updates by manipulating the client's sensor
observation. The consequence of the remote rowhammer attack is that we are able
to achieve bit flips, which can corrupt the server memory. We demonstrate the
feasibility of our attack using a large-scale FL automatic speech recognition
(ASR) systems with sparse updates, our adversarial attacking agent can achieve
around 70\% repeated update rate (RUR) in the targeted server model,
effectively inducing bit flips on server DRAM. The security implications are
that can cause disruptions to learning or may inadvertently cause elevated
privilege. This paves the way for further research on practical mitigation
strategies in FL and hardware design.


http://arxiv.org/abs/2505.06454
Sponge Attacks on Sensing AI: Energy-Latency Vulnerabilities and Defense via Model Pruning. (2%)
Syed Mhamudul Hasan; Hussein Zangoti; Iraklis Anagnostopoulos; Abdur R. Shahid
  Recent studies have shown that sponge attacks can significantly increase the
energy consumption and inference latency of deep neural networks (DNNs).
However, prior work has focused primarily on computer vision and natural
language processing tasks, overlooking the growing use of lightweight AI models
in sensing-based applications on resource-constrained devices, such as those in
Internet of Things (IoT) environments. These attacks pose serious threats of
energy depletion and latency degradation in systems where limited battery
capacity and real-time responsiveness are critical for reliable operation. This
paper makes two key contributions. First, we present the first systematic
exploration of energy-latency sponge attacks targeting sensing-based AI models.
Using wearable sensing-based AI as a case study, we demonstrate that sponge
attacks can substantially degrade performance by increasing energy consumption,
leading to faster battery drain, and by prolonging inference latency. Second,
to mitigate such attacks, we investigate model pruning, a widely adopted
compression technique for resource-constrained AI, as a potential defense. Our
experiments show that pruning-induced sparsity significantly improves model
resilience against sponge poisoning. We also quantify the trade-offs between
model efficiency and attack resilience, offering insights into the security
implications of model compression in sensing-based AI systems deployed in IoT
environments.


http://arxiv.org/abs/2505.07856
Unpacking Robustness in Inflectional Languages: Adversarial Evaluation and Mechanistic Insights. (99%)
Paweł Walkowiak; Marek Klonowski; Marcin Oleksy; Arkadiusz Janz
  Various techniques are used in the generation of adversarial examples,
including methods such as TextBugger which introduce minor, hardly visible
perturbations to words leading to changes in model behaviour. Another class of
techniques involves substituting words with their synonyms in a way that
preserves the text's meaning but alters its predicted class, with TextFooler
being a prominent example of such attacks. Most adversarial example generation
methods are developed and evaluated primarily on non-inflectional languages,
typically English. In this work, we evaluate and explain how adversarial
attacks perform in inflectional languages. To explain the impact of inflection
on model behaviour and its robustness under attack, we designed a novel
protocol inspired by mechanistic interpretability, based on Edge Attribution
Patching (EAP) method. The proposed evaluation protocol relies on parallel
task-specific corpora that include both inflected and syncretic variants of
texts in two languages -- Polish and English. To analyse the models and explain
the relationship between inflection and adversarial robustness, we create a new
benchmark based on task-oriented dataset MultiEmo, enabling the identification
of mechanistic inflection-related elements of circuits within the model and
analyse their behaviour under attack.


http://arxiv.org/abs/2505.05528
X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP. (98%)
Hanxun Huang; Sarah Erfani; Yige Li; Xingjun Ma; James Bailey
  As Contrastive Language-Image Pre-training (CLIP) models are increasingly
adopted for diverse downstream tasks and integrated into large vision-language
models (VLMs), their susceptibility to adversarial perturbations has emerged as
a critical concern. In this work, we introduce \textbf{X-Transfer}, a novel
attack method that exposes a universal adversarial vulnerability in CLIP.
X-Transfer generates a Universal Adversarial Perturbation (UAP) capable of
deceiving various CLIP encoders and downstream VLMs across different samples,
tasks, and domains. We refer to this property as \textbf{super
transferability}--a single perturbation achieving cross-data, cross-domain,
cross-model, and cross-task adversarial transferability simultaneously. This is
achieved through \textbf{surrogate scaling}, a key innovation of our approach.
Unlike existing methods that rely on fixed surrogate models, which are
computationally intensive to scale, X-Transfer employs an efficient surrogate
scaling strategy that dynamically selects a small subset of suitable surrogates
from a large search space. Extensive evaluations demonstrate that X-Transfer
significantly outperforms previous state-of-the-art UAP methods, establishing a
new benchmark for adversarial transferability across CLIP models. The code is
publicly available in our
\href{https://github.com/HanxunH/XTransferBench}{GitHub repository}.


http://arxiv.org/abs/2505.05665
Adaptive Stress Testing Black-Box LLM Planners. (12%)
Neeloy Chakraborty; John Pohovey; Melkior Ornik; Katherine Driggs-Campbell
  Large language models (LLMs) have recently demonstrated success in
generalizing across decision-making tasks including planning, control and
prediction, but their tendency to hallucinate unsafe and undesired outputs
poses risks. We argue that detecting such failures is necessary, especially in
safety-critical scenarios. Existing black-box methods often detect
hallucinations by identifying inconsistencies across multiple samples. Many of
these approaches typically introduce prompt perturbations like randomizing
detail order or generating adversarial inputs, with the intuition that a
confident model should produce stable outputs. We first perform a manual case
study showing that other forms of perturbations (e.g., adding noise, removing
sensor details) cause LLMs to hallucinate in a driving environment. We then
propose a novel method for efficiently searching the space of prompt
perturbations using Adaptive Stress Testing (AST) with Monte-Carlo Tree Search
(MCTS). Our AST formulation enables discovery of scenarios and prompts that
cause language models to act with high uncertainty. By generating MCTS prompt
perturbation trees across diverse scenarios, we show that offline analyses can
be used at runtime to automatically generate prompts that influence model
uncertainty, and to inform real-time trust assessments of an LLM.


http://arxiv.org/abs/2505.05091
DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions. (10%)
Shashank Agnihotri; Amaan Ansari; Annika Dackermann; Fabian Rösch; Margret Keuper
  Deep learning (DL) has surpassed human performance on standard benchmarks,
driving its widespread adoption in computer vision tasks. One such task is
disparity estimation, estimating the disparity between matching pixels in
stereo image pairs, which is crucial for safety-critical applications like
medical surgeries and autonomous navigation. However, DL-based disparity
estimation methods are highly susceptible to distribution shifts and
adversarial attacks, raising concerns about their reliability and
generalization. Despite these concerns, a standardized benchmark for evaluating
the robustness of disparity estimation methods remains absent, hindering
progress in the field.
  To address this gap, we introduce DispBench, a comprehensive benchmarking
tool for systematically assessing the reliability of disparity estimation
methods. DispBench evaluates robustness against synthetic image corruptions
such as adversarial attacks and out-of-distribution shifts caused by 2D Common
Corruptions across multiple datasets and diverse corruption scenarios. We
conduct the most extensive performance and robustness analysis of disparity
estimation methods to date, uncovering key correlations between accuracy,
reliability, and generalization. Open-source code for DispBench:
https://github.com/shashankskagnihotri/benchmarking_robustness/tree/disparity_estimation/final/disparity_estimation


http://arxiv.org/abs/2505.05235
Advancing Neural Network Verification through Hierarchical Safety Abstract Interpretation. (9%)
Luca Marzari; Isabella Mastroeni; Alessandro Farinelli
  Traditional methods for formal verification (FV) of deep neural networks
(DNNs) are constrained by a binary encoding of safety properties, where a model
is classified as either safe or unsafe (robust or not robust). This binary
encoding fails to capture the nuanced safety levels within a model, often
resulting in either overly restrictive or too permissive requirements. In this
paper, we introduce a novel problem formulation called Abstract
DNN-Verification, which verifies a hierarchical structure of unsafe outputs,
providing a more granular analysis of the safety aspect for a given DNN.
Crucially, by leveraging abstract interpretation and reasoning about output
reachable sets, our approach enables assessing multiple safety levels during
the FV process, requiring the same (in the worst case) or even potentially less
computational effort than the traditional binary verification approach.
Specifically, we demonstrate how this formulation allows rank adversarial
inputs according to their abstract safety level violation, offering a more
detailed evaluation of the model's safety and robustness. Our contributions
include a theoretical exploration of the relationship between our novel
abstract safety formulation and existing approaches that employ abstract
interpretation for robustness verification, complexity analysis of the novel
problem introduced, and an empirical evaluation considering both a complex deep
reinforcement learning task (based on Habitat 3.0) and standard
DNN-Verification benchmarks.


http://arxiv.org/abs/2505.05619
LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities. (8%)
Kalyan Nakka; Jimmy Dani; Ausmit Mondal; Nitesh Saxena
  The growing adoption of Large Language Models (LLMs) has influenced the
development of their lighter counterparts-Small Language Models (SLMs)-to
enable on-device deployment across smartphones and edge devices. These SLMs
offer enhanced privacy, reduced latency, server-free functionality, and
improved user experience. However, due to resource constraints of on-device
environment, SLMs undergo size optimization through compression techniques like
quantization, which can inadvertently introduce fairness, ethical and privacy
risks. Critically, quantized SLMs may respond to harmful queries directly,
without requiring adversarial manipulation, raising significant safety and
trust concerns.
  To address this, we propose LiteLMGuard (LLMG), an on-device prompt guard
that provides real-time, prompt-level defense for quantized SLMs. Additionally,
our prompt guard is designed to be model-agnostic such that it can be
seamlessly integrated with any SLM, operating independently of underlying
architectures. Our LLMG formalizes prompt filtering as a deep learning
(DL)-based prompt answerability classification task, leveraging semantic
understanding to determine whether a query should be answered by any SLM. Using
our curated dataset, Answerable-or-Not, we trained and fine-tuned several DL
models and selected ELECTRA as the candidate, with 97.75% answerability
classification accuracy.
  Our safety effectiveness evaluations demonstrate that LLMG defends against
over 87% of harmful prompts, including both direct instruction and jailbreak
attack strategies. We further showcase its ability to mitigate the Open
Knowledge Attacks, where compromised SLMs provide unsafe responses without
adversarial prompting. In terms of prompt filtering effectiveness, LLMG
achieves near state-of-the-art filtering accuracy of 94%, with an average
latency of 135 ms, incurring negligible overhead for users.


http://arxiv.org/abs/2505.06311
Defending against Indirect Prompt Injection by Instruction Detection. (2%)
Tongyu Wen; Chenglong Wang; Xiyuan Yang; Haoyu Tang; Yueqi Xie; Lingjuan Lyu; Zhicheng Dou; Fangzhao Wu
  The integration of Large Language Models (LLMs) with external sources is
becoming increasingly common, with Retrieval-Augmented Generation (RAG) being a
prominent example. However, this integration introduces vulnerabilities of
Indirect Prompt Injection (IPI) attacks, where hidden instructions embedded in
external data can manipulate LLMs into executing unintended or harmful actions.
We recognize that the success of IPI attacks fundamentally relies in the
presence of instructions embedded within external content, which can alter the
behavioral state of LLMs. Can effectively detecting such state changes help us
defend against IPI attacks? In this paper, we propose a novel approach that
takes external data as input and leverages the behavioral state of LLMs during
both forward and backward propagation to detect potential IPI attacks.
Specifically, we demonstrate that the hidden states and gradients from
intermediate layers provide highly discriminative features for instruction
detection. By effectively combining these features, our approach achieves a
detection accuracy of 99.60\% in the in-domain setting and 96.90\% in the
out-of-domain setting, while reducing the attack success rate to just 0.12\% on
the BIPIA benchmark.


http://arxiv.org/abs/2505.05183
PaniCar: Securing the Perception of Advanced Driving Assistance Systems Against Emergency Vehicle Lighting. (2%)
Elad Feldman; Jacob Shams; Dudi Biton; Alfred Chen; Shaoyuan Xie; Satoru Koda; Yisroel Mirsky; Asaf Shabtai; Yuval Elovici; Ben Nassi
  The safety of autonomous cars has come under scrutiny in recent years,
especially after 16 documented incidents involving Teslas (with autopilot
engaged) crashing into parked emergency vehicles (police cars, ambulances, and
firetrucks). While previous studies have revealed that strong light sources
often introduce flare artifacts in the captured image, which degrade the image
quality, the impact of flare on object detection performance remains unclear.
In this research, we unveil PaniCar, a digital phenomenon that causes an object
detector's confidence score to fluctuate below detection thresholds when
exposed to activated emergency vehicle lighting. This vulnerability poses a
significant safety risk, and can cause autonomous vehicles to fail to detect
objects near emergency vehicles. In addition, this vulnerability could be
exploited by adversaries to compromise the security of advanced driving
assistance systems (ADASs). We assess seven commercial ADASs (Tesla Model 3,
"manufacturer C", HP, Pelsee, AZDOME, Imagebon, Rexing), four object detectors
(YOLO, SSD, RetinaNet, Faster R-CNN), and 14 patterns of emergency vehicle
lighting to understand the influence of various technical and environmental
factors. We also evaluate four SOTA flare removal methods and show that their
performance and latency are insufficient for real-time driving constraints. To
mitigate this risk, we propose Caracetamol, a robust framework designed to
enhance the resilience of object detectors against the effects of activated
emergency vehicle lighting. Our evaluation shows that on YOLOv3 and Faster
RCNN, Caracetamol improves the models' average confidence of car detection by
0.20, the lower confidence bound by 0.33, and reduces the fluctuation range by
0.33. In addition, Caracetamol is capable of processing frames at a rate of
between 30-50 FPS, enabling real-time ADAS car detection.


http://arxiv.org/abs/2505.05190
Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks. (2%)
Yixin Cheng; Hongcheng Guo; Yangming Li; Leonid Sigal
  Text watermarking aims to subtly embed statistical signals into text by
controlling the Large Language Model (LLM)'s sampling process, enabling
watermark detectors to verify that the output was generated by the specified
model. The robustness of these watermarking algorithms has become a key factor
in evaluating their effectiveness. Current text watermarking algorithms embed
watermarks in high-entropy tokens to ensure text quality. In this paper, we
reveal that this seemingly benign design can be exploited by attackers, posing
a significant risk to the robustness of the watermark. We introduce a generic
efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA),
which leverages the vulnerability by calculating the self-information of each
token to identify potential pattern tokens and perform targeted attack. Our
work exposes a widely prevalent vulnerability in current watermarking
algorithms. The experimental results show SIRA achieves nearly 100% attack
success rates on seven recent watermarking methods with only 0.88 USD per
million tokens cost. Our approach does not require any access to the watermark
algorithms or the watermarked LLM and can seamlessly transfer to any LLM as the
attack model, even mobile-level models. Our findings highlight the urgent need
for more robust watermarking.


http://arxiv.org/abs/2505.05410
Reasoning Models Don't Always Say What They Think. (1%)
Yanda Chen; Joe Benton; Ansh Radhakrishnan; Jonathan Uesato; Carson Denison; John Schulman; Arushi Somani; Peter Hase; Misha Wagner; Fabien Roger; Vlad Mikulik; Samuel R. Bowman; Jan Leike; Jared Kaplan; Ethan Perez
  Chain-of-thought (CoT) offers a potential boon for AI safety as it allows
monitoring a model's CoT to try to understand its intentions and reasoning
processes. However, the effectiveness of such monitoring hinges on CoTs
faithfully representing models' actual reasoning processes. We evaluate CoT
faithfulness of state-of-the-art reasoning models across 6 reasoning hints
presented in the prompts and find: (1) for most settings and models tested,
CoTs reveal their usage of hints in at least 1% of examples where they use the
hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement
learning initially improves faithfulness but plateaus without saturating, and
(3) when reinforcement learning increases how frequently hints are used (reward
hacking), the propensity to verbalize them does not increase, even without
training against a CoT monitor. These results suggest that CoT monitoring is a
promising way of noticing undesired behaviors during training and evaluations,
but that it is not sufficient to rule them out. They also suggest that in
settings like ours where CoT reasoning is not necessary, test-time monitoring
of CoTs is unlikely to reliably catch rare and catastrophic unexpected
behaviors.


http://arxiv.org/abs/2505.06299
Input-Specific and Universal Adversarial Attack Generation for Spiking Neural Networks in the Spiking Domain. (99%)
Spyridon Raptis; Haralampos-G. Stratigopoulos
  As Spiking Neural Networks (SNNs) gain traction across various applications,
understanding their security vulnerabilities becomes increasingly important. In
this work, we focus on the adversarial attacks, which is perhaps the most
concerning threat. An adversarial attack aims at finding a subtle input
perturbation to fool the network's decision-making. We propose two novel
adversarial attack algorithms for SNNs: an input-specific attack that crafts
adversarial samples from specific dataset inputs and a universal attack that
generates a reusable patch capable of inducing misclassification across most
inputs, thus offering practical feasibility for real-time deployment. The
algorithms are gradient-based operating in the spiking domain proving to be
effective across different evaluation metrics, such as adversarial accuracy,
stealthiness, and generation time. Experimental results on two widely used
neuromorphic vision datasets, NMNIST and IBM DVS Gesture, show that our
proposed attacks surpass in all metrics all existing state-of-the-art methods.
Additionally, we present the first demonstration of adversarial attack
generation in the sound domain using the SHD dataset.


http://arxiv.org/abs/2505.04578
Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization. (31%)
Wenjun Cao
  Reinforcement learning (RL) fine-tuning transforms large language models
while creating a vulnerability we experimentally verify: Our experiment shows
that malicious RL fine-tuning dismantles safety guardrails with remarkable
efficiency, requiring only 50 steps and minimal adversarial prompts, with
harmful escalating from 0-2 to 7-9. This attack vector particularly threatens
open-source models with parameter-level access. Existing defenses targeting
supervised fine-tuning prove ineffective against RL's dynamic feedback
mechanisms. We introduce Reward Neutralization, the first defense framework
specifically designed against RL fine-tuning attacks, establishing concise
rejection patterns that render malicious reward signals ineffective. Our
approach trains models to produce minimal-information rejections that attackers
cannot exploit, systematically neutralizing attempts to optimize toward harmful
outputs. Experiments validate that our approach maintains low harmful scores
(no greater than 2) after 200 attack steps, while standard models rapidly
deteriorate. This work provides the first constructive proof that robust
defense against increasingly accessible RL attacks is achievable, addressing a
critical security gap for open-weight models.


http://arxiv.org/abs/2505.04116
RFNNS: Robust Fixed Neural Network Steganography with Popular Deep Generative Models. (13%)
Yu Cheng; Jiuan Zhou; Jiawei Chen; Zhaoxia Yin; Xinpeng Zhang
  Image steganography is a technique that conceals secret information in a
cover image to achieve covert communication. Recent research has demonstrated
that Fixed Neural Network Steganography (FNNS) exhibits significant practical
advantages, as it enables stable and efficient steganographic embedding and
extraction without requiring neural network training. However, the stego image
generated by existing FNNS methods suffers from considerable distortion and
exhibits poor robustness, severely reducing the security and practicality of
steganography. To address the aforementioned issues, we propose a Robust Fixed
Neural Network Steganography (RFNNS). In RFNNS, we introduce a texture-aware
localization technique to add perturbations carrying secret image information
to complex texture areas that are less perceptible to the human eye, thereby
ensuring the quality of the stego image. To enhance robustness, a robust
steganographic perturbation generation (RSPG) strategy is designed, which
enables slight perturbations to be accurately decoded even after common image
attacks. Subsequently, the generated robust perturbations are combined with the
AI-generated cover image to produce the stego image. The receiver only needs to
share the secret key and employ the same decoding network structure to
accurately extract the secret image from the attacked stego image. Experimental
results demonstrate that RFNNS achieves enhanced performance in terms of
security, including imperceptibility and anti-steganalysis performance.
Furthermore, RFNNS demonstrates superior robustness against common image
attacks, such as JPEG compression, Gaussian noise, and contrast adjustment,
across diverse embedding capacities, outperforming existing SOTA FNNS methods.


http://arxiv.org/abs/2505.06284
DMRL: Data- and Model-aware Reward Learning for Data Extraction. (1%)
Zhiqiang Wang; Ruoxi Cheng
  Large language models (LLMs) are inherently vulnerable to unintended privacy
breaches. Consequently, systematic red-teaming research is essential for
developing robust defense mechanisms. However, current data extraction methods
suffer from several limitations: (1) rely on dataset duplicates (addressable
via deduplication), (2) depend on prompt engineering (now countered by
detection and defense), and (3) rely on random-search adversarial generation.
To address these challenges, we propose DMRL, a Data- and Model-aware Reward
Learning approach for data extraction. This technique leverages inverse
reinforcement learning to extract sensitive data from LLMs. Our method consists
of two main components: (1) constructing an introspective reasoning dataset
that captures leakage mindsets to guide model behavior, and (2) training reward
models with Group Relative Policy Optimization (GRPO), dynamically tuning
optimization based on task difficulty at both the data and model levels.
Comprehensive experiments across various LLMs demonstrate that DMRL outperforms
all baseline methods in data extraction performance.


http://arxiv.org/abs/2505.04896
Memory Under Siege: A Comprehensive Survey of Side-Channel Attacks on Memory. (1%)
MD Mahady Hassan; Shanto Roy; Reza Rahaeimehr
  Side-channel attacks on memory (SCAM) exploit unintended data leaks from
memory subsystems to infer sensitive information, posing significant threats to
system security. These attacks exploit vulnerabilities in memory access
patterns, cache behaviors, and other microarchitectural features to bypass
traditional security measures. The purpose of this research is to examine SCAM,
classify various attack techniques, and evaluate existing defense mechanisms.
It guides researchers and industry professionals in improving memory security
and mitigating emerging threats. We begin by identifying the major
vulnerabilities in the memory system that are frequently exploited in SCAM,
such as cache timing, speculative execution, \textit{Rowhammer}, and other
sophisticated approaches. Next, we outline a comprehensive taxonomy that
systematically classifies these attacks based on their types, target systems,
attack vectors, and adversarial capabilities required to execute them. In
addition, we review the current landscape of mitigation strategies, emphasizing
their strengths and limitations. This work aims to provide a comprehensive
overview of memory-based side-channel attacks with the goal of providing
significant insights for researchers and practitioners to better understand,
detect, and mitigate SCAM risks.


http://arxiv.org/abs/2505.03435
Robustness in AI-Generated Detection: Enhancing Resistance to Adversarial Attacks. (99%)
Sun Haoxuan; Hong Yan; Zhan Jiahui; Chen Haoxing; Lan Jun; Zhu Huijia; Wang Weiqiang; Zhang Liqing; Zhang Jianfu
  The rapid advancement of generative image technology has introduced
significant security concerns, particularly in the domain of face generation
detection. This paper investigates the vulnerabilities of current AI-generated
face detection systems. Our study reveals that while existing detection methods
often achieve high accuracy under standard conditions, they exhibit limited
robustness against adversarial attacks. To address these challenges, we propose
an approach that integrates adversarial training to mitigate the impact of
adversarial examples. Furthermore, we utilize diffusion inversion and
reconstruction to further enhance detection robustness. Experimental results
demonstrate that minor adversarial perturbations can easily bypass existing
detection systems, but our method significantly improves the robustness of
these systems. Additionally, we provide an in-depth analysis of adversarial and
benign examples, offering insights into the intrinsic characteristics of
AI-generated content. All associated code will be made publicly available in a
dedicated repository to facilitate further research and verification.


http://arxiv.org/abs/2505.03383
Attention-aggregated Attack for Boosting the Transferability of Facial Adversarial Examples. (99%)
Jian-Wei Li; Wen-Ze Shao
  Adversarial examples have revealed the vulnerability of deep learning models
and raised serious concerns about information security. The transfer-based
attack is a hot topic in black-box attacks that are practical to real-world
scenarios where the training datasets, parameters, and structure of the target
model are unknown to the attacker. However, few methods consider the
particularity of class-specific deep models for fine-grained vision tasks, such
as face recognition (FR), giving rise to unsatisfactory attacking performance.
In this work, we first investigate what in a face exactly contributes to the
embedding learning of FR models and find that both decisive and auxiliary
facial features are specific to each FR model, which is quite different from
the biological mechanism of human visual system. Accordingly we then propose a
novel attack method named Attention-aggregated Attack (AAA) to enhance the
transferability of adversarial examples against FR, which is inspired by the
attention divergence and aims to destroy the facial features that are critical
for the decision-making of other FR models by imitating their attentions on the
clean face images. Extensive experiments conducted on various FR models
validate the superiority and robust effectiveness of the proposed method over
existing methods.


http://arxiv.org/abs/2505.04662
Crafting Physical Adversarial Examples by Combining Differentiable and Physically Based Renders. (95%)
Yuqiu Liu; Huanqian Yan; Xiaopei Zhu; Xiaolin Hu; Liang Tang; Hang Su; Chen Lv
  Recently we have witnessed progress in hiding road vehicles against object
detectors through adversarial camouflage in the digital world. The extension of
this technique to the physical world is crucial for testing the robustness of
autonomous driving systems. However, existing methods do not show good
performances when applied to the physical world. This is partly due to
insufficient photorealism in training examples, and lack of proper physical
realization methods for camouflage. To generate a robust adversarial camouflage
suitable for real vehicles, we propose a novel method called PAV-Camou. We
propose to adjust the mapping from the coordinates in the 2D map to those of
corresponding 3D model. This process is critical for mitigating texture
distortion and ensuring the camouflage's effectiveness when applied in the real
world. Then we combine two renderers with different characteristics to obtain
adversarial examples that are photorealistic that closely mimic real-world
lighting and texture properties. The method ensures that the generated textures
remain effective under diverse environmental conditions. Our adversarial
camouflage can be optimized and printed in the form of 2D patterns, allowing
for direct application on real vehicles. Extensive experiments demonstrated
that our proposed method achieved good performance in both the digital world
and the physical world.


http://arxiv.org/abs/2505.03519
Uncovering the Limitations of Model Inversion Evaluation -- Benchmarks and Connection to Type-I Adversarial Attacks. (92%)
Sy-Tuyen Ho; Koh Jun Hao; Ngoc-Bao Nguyen; Alexander Binder; Ngai-Man Cheung
  Model Inversion (MI) attacks aim to reconstruct information of private
training data by exploiting access to machine learning models. The most common
evaluation framework for MI attacks/defenses relies on an evaluation model that
has been utilized to assess progress across almost all MI attacks and defenses
proposed in recent years. In this paper, for the first time, we present an
in-depth study of MI evaluation. Firstly, we construct the first comprehensive
human-annotated dataset of MI attack samples, based on 28 setups of different
MI attacks, defenses, private and public datasets. Secondly, using our dataset,
we examine the accuracy of the MI evaluation framework and reveal that it
suffers from a significant number of false positives. These findings raise
questions about the previously reported success rates of SOTA MI attacks.
Thirdly, we analyze the causes of these false positives, design controlled
experiments, and discover the surprising effect of Type I adversarial features
on MI evaluation, as well as adversarial transferability, highlighting a
relationship between two previously distinct research areas. Our findings
suggest that the performance of SOTA MI attacks has been overestimated, with
the actual privacy leakage being significantly less than previously reported.
In conclusion, we highlight critical limitations in the widely used MI
evaluation framework and present our methods to mitigate false positive rates.
We remark that prior research has shown that Type I adversarial attacks are
very challenging, with no existing solution. Therefore, we urge to consider
human evaluation as a primary MI evaluation framework rather than merely a
supplement as in previous MI research. We also encourage further work on
developing more robust and reliable automatic evaluation frameworks.


http://arxiv.org/abs/2505.03863
Data-Driven Falsification of Cyber-Physical Systems. (88%)
Atanu Kundu; Sauvik Gon; Rajarshi Ray
  Cyber-Physical Systems (CPS) are abundant in safety-critical domains such as
healthcare, avionics, and autonomous vehicles. Formal verification of their
operational safety is, therefore, of utmost importance. In this paper, we
address the falsification problem, where the focus is on searching for an
unsafe execution in the system instead of proving their absence. The
contribution of this paper is a framework that (a) connects the falsification
of CPS with the falsification of deep neural networks (DNNs) and (b) leverages
the inherent interpretability of Decision Trees for faster falsification of
CPS. This is achieved by: (1) building a surrogate model of the CPS under test,
either as a DNN model or a Decision Tree, (2) application of various DNN
falsification tools to falsify CPS, and (3) a novel falsification algorithm
guided by the explanations of safety violations of the CPS model extracted from
its Decision Tree surrogate. The proposed framework has the potential to
exploit a repertoire of \emph{adversarial attack} algorithms designed to
falsify robustness properties of DNNs, as well as state-of-the-art
falsification algorithms for DNNs. Although the presented methodology is
applicable to systems that can be executed/simulated in general, we demonstrate
its effectiveness, particularly in CPS. We show that our framework, implemented
as a tool \textsc{FlexiFal}, can detect hard-to-find counterexamples in CPS
that have linear and non-linear dynamics. Decision tree-guided falsification
shows promising results in efficiently finding multiple counterexamples in the
ARCH-COMP 2024 falsification benchmarks~\cite{khandait2024arch}.


http://arxiv.org/abs/2505.03424
Framework GNN-AID: Graph Neural Network Analysis Interpretation and Defense. (81%)
Kirill Lukyanov; Mikhail Drobyshevskiy; Georgii Sazonov; Mikhail Soloviov; Ilya Makarov
  The growing need for Trusted AI (TAI) highlights the importance of
interpretability and robustness in machine learning models. However, many
existing tools overlook graph data and rarely combine these two aspects into a
single solution. Graph Neural Networks (GNNs) have become a popular approach,
achieving top results across various tasks. We introduce GNN-AID (Graph Neural
Network Analysis, Interpretation, and Defense), an open-source framework
designed for graph data to address this gap. Built as a Python library, GNN-AID
supports advanced trust methods and architectural layers, allowing users to
analyze graph datasets and GNN behavior using attacks, defenses, and
interpretability methods.
  GNN-AID is built on PyTorch-Geometric, offering preloaded datasets, models,
and support for any GNNs through customizable interfaces. It also includes a
web interface with tools for graph visualization and no-code features like an
interactive model builder, simplifying the exploration and analysis of GNNs.
The framework also supports MLOps techniques, ensuring reproducibility and
result versioning to track and revisit analyses efficiently.
  GNN-AID is a flexible tool for developers and researchers. It helps
developers create, analyze, and customize graph models, while also providing
access to prebuilt datasets and models for quick experimentation. Researchers
can use the framework to explore advanced topics on the relationship between
interpretability and robustness, test defense strategies, and combine methods
to protect against different types of attacks.
  We also show how defenses against evasion and poisoning attacks can conflict
when applied to graph data, highlighting the complex connections between
defense strategies.
  GNN-AID is available at
\href{https://github.com/ispras/GNN-AID}{github.com/ispras/GNN-AID}


http://arxiv.org/abs/2505.04046
Reliable Disentanglement Multi-view Learning Against View Adversarial Attacks. (67%)
Xuyang Wang; Siyuan Duan; Qizhi Li; Guiduo Duan; Yuan Sun; Dezhong Peng
  Recently, trustworthy multi-view learning has attracted extensive attention
because evidence learning can provide reliable uncertainty estimation to
enhance the credibility of multi-view predictions. Existing trusted multi-view
learning methods implicitly assume that multi-view data is secure. In practice,
however, in safety-sensitive applications such as autonomous driving and
security monitoring, multi-view data often faces threats from adversarial
perturbations, thereby deceiving or disrupting multi-view learning models. This
inevitably leads to the adversarial unreliability problem (AUP) in trusted
multi-view learning. To overcome this tricky problem, we propose a novel
multi-view learning framework, namely Reliable Disentanglement Multi-view
Learning (RDML). Specifically, we first propose evidential disentanglement
learning to decompose each view into clean and adversarial parts under the
guidance of corresponding evidences, which is extracted by a pretrained
evidence extractor. Then, we employ the feature recalibration module to
mitigate the negative impact of adversarial perturbations and extract potential
informative features from them. Finally, to further ignore the irreparable
adversarial interferences, a view-level evidential attention mechanism is
designed. Extensive experiments on multi-view classification tasks with
adversarial attacks show that our RDML outperforms the state-of-the-art
multi-view learning methods by a relatively large margin.


http://arxiv.org/abs/2505.03501
BadLingual: A Novel Lingual-Backdoor Attack against Large Language Models. (38%)
Zihan Wang; Hongwei Li; Rui Zhang; Wenbo Jiang; Kangjie Chen; Tianwei Zhang; Qingchuan Zhao; Guowen Xu
  In this paper, we present a new form of backdoor attack against Large
Language Models (LLMs): lingual-backdoor attacks. The key novelty of
lingual-backdoor attacks is that the language itself serves as the trigger to
hijack the infected LLMs to generate inflammatory speech. They enable the
precise targeting of a specific language-speaking group, exacerbating racial
discrimination by malicious entities. We first implement a baseline
lingual-backdoor attack, which is carried out by poisoning a set of training
data for specific downstream tasks through translation into the trigger
language. However, this baseline attack suffers from poor task generalization
and is impractical in real-world settings. To address this challenge, we design
BadLingual, a novel task-agnostic lingual-backdoor, capable of triggering any
downstream tasks within the chat LLMs, regardless of the specific questions of
these tasks. We design a new approach using PPL-constrained Greedy Coordinate
Gradient-based Search (PGCG) based adversarial training to expand the decision
boundary of lingual-backdoor, thereby enhancing the generalization ability of
lingual-backdoor across various tasks. We perform extensive experiments to
validate the effectiveness of our proposed attacks. Specifically, the baseline
attack achieves an ASR of over 90% on the specified tasks. However, its ASR
reaches only 37.61% across six tasks in the task-agnostic scenario. In
contrast, BadLingual brings up to 37.35% improvement over the baseline. Our
study sheds light on a new perspective of vulnerabilities in LLMs with
multilingual capabilities and is expected to promote future research on the
potential defenses to enhance the LLMs' robustness


http://arxiv.org/abs/2505.03208
A Chaos Driven Metric for Backdoor Attack Detection. (38%)
Hema Karnam School of Conflict and Security Studies, National Institute of Advanced Studies, Indian Institute of Science Campus, Bengaluru Surendrababu; Nithin Complex Systems Programme, National Institute of Advanced Studies, Indian Institute of Science Campus, Bengaluru Nagaraj
  The advancement and adoption of Artificial Intelligence (AI) models across
diverse domains have transformed the way we interact with technology. However,
it is essential to recognize that while AI models have introduced remarkable
advancements, they also present inherent challenges such as their vulnerability
to adversarial attacks. The current work proposes a novel defense mechanism
against one of the most significant attack vectors of AI models - the backdoor
attack via data poisoning of training datasets. In this defense technique, an
integrated approach that combines chaos theory with manifold learning is
proposed. A novel metric - Precision Matrix Dependency Score (PDS) that is
based on the conditional variance of Neurochaos features is formulated. The PDS
metric has been successfully evaluated to distinguish poisoned samples from
non-poisoned samples across diverse datasets.


http://arxiv.org/abs/2505.03455
Mitigating Backdoor Triggered and Targeted Data Poisoning Attacks in Voice Authentication Systems. (16%)
Alireza Mohammadi; Keshav Sood; Dhananjay Thiruvady; Asef Nazari
  Voice authentication systems remain susceptible to two major threats:
backdoor triggered attacks and targeted data poisoning attacks. This dual
vulnerability is critical because conventional solutions typically address each
threat type separately, leaving systems exposed to adversaries who can exploit
both attacks simultaneously. We propose a unified defense framework that
effectively addresses both BTA and TDPA. Our framework integrates a frequency
focused detection mechanism that flags covert pitch boosting and sound masking
backdoor attacks in near real time, followed by a convolutional neural network
that addresses TDPA. This dual layered defense approach utilizes
multidimensional acoustic features to isolate anomalous signals without
requiring costly model retraining. In particular, our PBSM detection mechanism
can seamlessly integrate into existing voice authentication pipelines and scale
effectively for large scale deployments. Experimental results on benchmark
datasets and their compression with the state of the art algorithm demonstrate
that our PBSM detection mechanism outperforms the state of the art. Our
framework reduces attack success rates to as low as five to fifteen percent
while maintaining a recall rate of up to ninety five percent in recognizing
TDPA.


http://arxiv.org/abs/2505.04087
SEVA: Leveraging Single-Step Ensemble of Vicinal Augmentations for Test-Time Adaptation. (1%)
Zixuan Hu; Yichun Hu; Ling-Yu Duan
  Test-Time adaptation (TTA) aims to enhance model robustness against
distribution shifts through rapid model adaptation during inference. While
existing TTA methods often rely on entropy-based unsupervised training and
achieve promising results, the common practice of a single round of entropy
training is typically unable to adequately utilize reliable samples, hindering
adaptation efficiency. In this paper, we discover augmentation strategies can
effectively unleash the potential of reliable samples, but the rapidly growing
computational cost impedes their real-time application. To address this
limitation, we propose a novel TTA approach named Single-step Ensemble of
Vicinal Augmentations (SEVA), which can take advantage of data augmentations
without increasing the computational burden. Specifically, instead of
explicitly utilizing the augmentation strategy to generate new data, SEVA
develops a theoretical framework to explore the impacts of multiple
augmentations on model adaptation and proposes to optimize an upper bound of
the entropy loss to integrate the effects of multiple rounds of augmentation
training into a single step. Furthermore, we discover and verify that using the
upper bound as the loss is more conducive to the selection mechanism, as it can
effectively filter out harmful samples that confuse the model. Combining these
two key advantages, the proposed efficient loss and a complementary selection
strategy can simultaneously boost the potential of reliable samples and meet
the stringent time requirements of TTA. The comprehensive experiments on
various network architectures across challenging testing scenarios demonstrate
impressive performances and the broad adaptability of SEVA. The code will be
publicly available.


http://arxiv.org/abs/2505.03361
Interpretable Zero-shot Learning with Infinite Class Concepts. (1%)
Zihan Ye; Shreyank N Gowda; Shiming Chen; Yaochu Jin; Kaizhu Huang; Xiaobo Jin
  Zero-shot learning (ZSL) aims to recognize unseen classes by aligning images
with intermediate class semantics, like human-annotated concepts or class
definitions. An emerging alternative leverages Large-scale Language Models
(LLMs) to automatically generate class documents. However, these methods often
face challenges with transparency in the classification process and may suffer
from the notorious hallucination problem in LLMs, resulting in non-visual class
semantics. This paper redefines class semantics in ZSL with a focus on
transferability and discriminability, introducing a novel framework called
Zero-shot Learning with Infinite Class Concepts (InfZSL). Our approach
leverages the powerful capabilities of LLMs to dynamically generate an
unlimited array of phrase-level class concepts. To address the hallucination
challenge, we introduce an entropy-based scoring process that incorporates a
``goodness" concept selection mechanism, ensuring that only the most
transferable and discriminative concepts are selected. Our InfZSL framework not
only demonstrates significant improvements on three popular benchmark datasets
but also generates highly interpretable, image-grounded concepts. Code will be
released upon acceptance.


http://arxiv.org/abs/2505.03559
Event-Triggered GAT-LSTM Framework for Attack Detection in Heating, Ventilation, and Air Conditioning Systems. (1%)
Zhenan Feng; Ehsan Nekouei
  Heating, Ventilation, and Air Conditioning (HVAC) systems are essential for
maintaining indoor environmental quality, but their interconnected nature and
reliance on sensor networks make them vulnerable to cyber-physical attacks.
Such attacks can interrupt system operations and risk leaking sensitive
personal information through measurement data. In this paper, we propose a
novel attack detection framework for HVAC systems, integrating an
Event-Triggering Unit (ETU) for local monitoring and a cloud-based
classification system using the Graph Attention Network (GAT) and the Long
Short-Term Memory (LSTM) network. The ETU performs a binary classification to
identify potential anomalies and selectively triggers encrypted data
transmission to the cloud, significantly reducing communication cost. The
cloud-side GAT module models the spatial relationships among HVAC components,
while the LSTM module captures temporal dependencies across encrypted state
sequences to classify the attack type. Our approach is evaluated on datasets
that simulate diverse attack scenarios. Compared to GAT-only (94.2% accuracy)
and LSTM-only (91.5%) ablations, our full GAT-LSTM model achieves 98.8% overall
detection accuracy and reduces data transmission to 15%. These results
demonstrate that the proposed framework achieves high detection accuracy while
preserving data privacy by using the spatial-temporal characteristics of HVAC
systems and minimizing transmission costs through event-triggered
communication.


http://arxiv.org/abs/2505.03120
Adversarial Sample Generation for Anomaly Detection in Industrial Control Systems. (99%)
Abdul Mustafa; Muhammad Talha Khan; Muhammad Azmi Umer; Zaki Masood; Chuadhry Mujeeb Ahmed
  Machine learning (ML)-based intrusion detection systems (IDS) are vulnerable
to adversarial attacks. It is crucial for an IDS to learn to recognize
adversarial examples before malicious entities exploit them. In this paper, we
generated adversarial samples using the Jacobian Saliency Map Attack (JSMA). We
validate the generalization and scalability of the adversarial samples to
tackle a broad range of real attacks on Industrial Control Systems (ICS). We
evaluated the impact by assessing multiple attacks generated using the proposed
method. The model trained with adversarial samples detected attacks with 95%
accuracy on real-world attack data not used during training. The study was
conducted using an operational secure water treatment (SWaT) testbed.


http://arxiv.org/abs/2505.02971
Adversarial Robustness Analysis of Vision-Language Models in Medical Image Segmentation. (99%)
Anjila Budathoki; Manish Dhakal
  Adversarial attacks have been fairly explored for computer vision and
vision-language models. However, the avenue of adversarial attack for the
vision language segmentation models (VLSMs) is still under-explored, especially
for medical image analysis.
  Thus, we have investigated the robustness of VLSMs against adversarial
attacks for 2D medical images with different modalities with radiology,
photography, and endoscopy. The main idea of this project was to assess the
robustness of the fine-tuned VLSMs specially in the medical domain setting to
address the high risk scenario.
  First, we have fine-tuned pre-trained VLSMs for medical image segmentation
with adapters.
  Then, we have employed adversarial attacks -- projected gradient descent
(PGD) and fast gradient sign method (FGSM) -- on that fine-tuned model to
determine its robustness against adversaries.
  We have reported models' performance decline to analyze the adversaries'
impact.
  The results exhibit significant drops in the DSC and IoU scores after the
introduction of these adversaries. Furthermore, we also explored universal
perturbation but were not able to find for the medical images.
  \footnote{https://github.com/anjilab/secure-private-ai}


http://arxiv.org/abs/2505.02360
Catastrophic Overfitting, Entropy Gap and Participation Ratio: A Noiseless $l^p$ Norm Solution for Fast Adversarial Training. (87%)
Fares B. Mehouachi; Saif Eddin Jabari
  Adversarial training is a cornerstone of robust deep learning, but fast
methods like the Fast Gradient Sign Method (FGSM) often suffer from
Catastrophic Overfitting (CO), where models become robust to single-step
attacks but fail against multi-step variants. While existing solutions rely on
noise injection, regularization, or gradient clipping, we propose a novel
solution that purely controls the $l^p$ training norm to mitigate CO.
  Our study is motivated by the empirical observation that CO is more prevalent
under the $l^{\infty}$ norm than the $l^2$ norm. Leveraging this insight, we
develop a framework for generalized $l^p$ attack as a fixed point problem and
craft $l^p$-FGSM attacks to understand the transition mechanics from $l^2$ to
$l^{\infty}$. This leads to our core insight: CO emerges when highly
concentrated gradients where information localizes in few dimensions interact
with aggressive norm constraints. By quantifying gradient concentration through
Participation Ratio and entropy measures, we develop an adaptive $l^p$-FGSM
that automatically tunes the training norm based on gradient information.
Extensive experiments demonstrate that this approach achieves strong robustness
without requiring additional regularization or noise injection, providing a
novel and theoretically-principled pathway to mitigate the CO problem.


http://arxiv.org/abs/2505.02566
Robustness questions the interpretability of graph neural networks: what to do? (83%)
Kirill ISP RAS Research Center for Trusted Artificial Intelligence Ivannikov Institute for System Programming of the Russian Academy of Sciences Moscow Institute of Physics and Technology Lukyanov; Georgii Ivannikov Institute for System Programming of the Russian Academy of Sciences Lomonosov Moscow State University Sazonov; Serafim Yandex School of Data Analysis Boyarsky; Ilya 1 v 5 Makarov
  Graph Neural Networks (GNNs) have become a cornerstone in graph-based data
analysis, with applications in diverse domains such as bioinformatics, social
networks, and recommendation systems. However, the interplay between model
interpretability and robustness remains poorly understood, especially under
adversarial scenarios like poisoning and evasion attacks. This paper presents a
comprehensive benchmark to systematically analyze the impact of various factors
on the interpretability of GNNs, including the influence of
robustness-enhancing defense mechanisms.
  We evaluate six GNN architectures based on GCN, SAGE, GIN, and GAT across
five datasets from two distinct domains, employing four interpretability
metrics: Fidelity, Stability, Consistency, and Sparsity. Our study examines how
defenses against poisoning and evasion attacks, applied before and during model
training, affect interpretability and highlights critical trade-offs between
robustness and interpretability. The framework will be published as open
source.
  The results reveal significant variations in interpretability depending on
the chosen defense methods and model architecture characteristics. By
establishing a standardized benchmark, this work provides a foundation for
developing GNNs that are both robust to adversarial threats and interpretable,
facilitating trust in their deployment in sensitive applications.


http://arxiv.org/abs/2505.03084
Adversarial Attacks in Multimodal Systems: A Practitioner's Survey. (76%)
Shashank Kapoor; Sanjay Surendranath Girija; Lakshit Arora; Dipen Pradhan; Ankit Shetgaonkar; Aman Raj
  The introduction of multimodal models is a huge step forward in Artificial
Intelligence. A single model is trained to understand multiple modalities:
text, image, video, and audio. Open-source multimodal models have made these
breakthroughs more accessible. However, considering the vast landscape of
adversarial attacks across these modalities, these models also inherit
vulnerabilities of all the modalities, and ultimately, the adversarial threat
amplifies. While broad research is available on possible attacks within or
across these modalities, a practitioner-focused view that outlines attack types
remains absent in the multimodal world. As more Machine Learning Practitioners
adopt, fine-tune, and deploy open-source models in real-world applications,
it's crucial that they can view the threat landscape and take the preventive
actions necessary. This paper addresses the gap by surveying adversarial
attacks targeting all four modalities: text, image, video, and audio. This
survey provides a view of the adversarial attack landscape and presents how
multimodal adversarial threats have evolved. To the best of our knowledge, this
survey is the first comprehensive summarization of the threat landscape in the
multimodal world.


http://arxiv.org/abs/2505.02824
Towards Dataset Copyright Evasion Attack against Personalized Text-to-Image Diffusion Models. (56%)
Kuofeng Gao; Yufei Zhu; Yiming Li; Jiawang Bai; Yong Yang; Zhifeng Li; Shu-Tao Xia
  Text-to-image (T2I) diffusion models have rapidly advanced, enabling
high-quality image generation conditioned on textual prompts. However, the
growing trend of fine-tuning pre-trained models for personalization raises
serious concerns about unauthorized dataset usage. To combat this, dataset
ownership verification (DOV) has emerged as a solution, embedding watermarks
into the fine-tuning datasets using backdoor techniques. These watermarks
remain inactive under benign samples but produce owner-specified outputs when
triggered. Despite the promise of DOV for T2I diffusion models, its robustness
against copyright evasion attacks (CEA) remains unexplored. In this paper, we
explore how attackers can bypass these mechanisms through CEA, allowing models
to circumvent watermarks even when trained on watermarked datasets. We propose
the first copyright evasion attack (i.e., CEAT2I) specifically designed to
undermine DOV in T2I diffusion models. Concretely, our CEAT2I comprises three
stages: watermarked sample detection, trigger identification, and efficient
watermark mitigation. A key insight driving our approach is that T2I models
exhibit faster convergence on watermarked samples during the fine-tuning,
evident through intermediate feature deviation. Leveraging this, CEAT2I can
reliably detect the watermarked samples. Then, we iteratively ablate tokens
from the prompts of detected watermarked samples and monitor shifts in
intermediate features to pinpoint the exact trigger tokens. Finally, we adopt a
closed-form concept erasure method to remove the injected watermark. Extensive
experiments show that our CEAT2I effectively evades DOV mechanisms while
preserving model performance.


http://arxiv.org/abs/2505.02490
Bayesian Robust Aggregation for Federated Learning. (3%)
Aleksandr Uppsala University Karakulev; Usama Uppsala University Zafar; Salman Uppsala University Scaleout Systems Toor; Prashant Uppsala University Science for Life Laboratory, Sweden Singh
  Federated Learning enables collaborative training of machine learning models
on decentralized data. This scheme, however, is vulnerable to adversarial
attacks, when some of the clients submit corrupted model updates. In real-world
scenarios, the total number of compromised clients is typically unknown, with
the extent of attacks potentially varying over time. To address these
challenges, we propose an adaptive approach for robust aggregation of model
updates based on Bayesian inference. The mean update is defined by the maximum
of the likelihood marginalized over probabilities of each client to be
`honest'. As a result, the method shares the simplicity of the classical
average estimators (e.g., sample mean or geometric median), being independent
of the number of compromised clients. At the same time, it is as effective
against attacks as methods specifically tailored to Federated Learning, such as
Krum. We compare our approach with other aggregation schemes in federated
setting on three benchmark image classification data sets. The proposed method
consistently achieves state-of-the-art performance across various attack types
with static and varying number of malicious clients.


http://arxiv.org/abs/2505.02713
SoK: Stealing Cars Since Remote Keyless Entry Introduction and How to Defend From It. (3%)
Tommaso Bianchi; Alessandro Brighente; Mauro Conti; Edoardo Pavan
  Remote Keyless Entry (RKE) systems have been the target of thieves since
their introduction in automotive industry. Robberies targeting vehicles and
their remote entry systems are booming again without a significant advancement
from the industrial sector being able to protect against them. Researchers and
attackers continuously play cat and mouse to implement new methodologies to
exploit weaknesses and defense strategies for RKEs. In this fragment, different
attacks and defenses have been discussed in research and industry without
proper bridging. In this paper, we provide a Systematization Of Knowledge (SOK)
on RKE and Passive Keyless Entry and Start (PKES), focusing on their history
and current situation, ranging from legacy systems to modern web-based ones. We
provide insight into vehicle manufacturers' technologies and attacks and
defense mechanisms involving them. To the best of our knowledge, this is the
first comprehensive SOK on RKE systems, and we address specific research
questions to understand the evolution and security status of such systems. By
identifying the weaknesses RKE still faces, we provide future directions for
security researchers and companies to find viable solutions to address old
attacks, such as Relay and RollJam, as well as new ones, like API
vulnerabilities.


http://arxiv.org/abs/2505.02073
Lightweight Defense Against Adversarial Attacks in Time Series Classification. (92%)
Yi Independent Researcher, Australia Han
  As time series classification (TSC) gains prominence, ensuring robust TSC
models against adversarial attacks is crucial. While adversarial defense is
well-studied in Computer Vision (CV), the TSC field has primarily relied on
adversarial training (AT), which is computationally expensive. In this paper,
five data augmentation-based defense methods tailored for time series are
developed, with the most computationally intensive method among them increasing
the computational resources by only 14.07% compared to the original TSC model.
Moreover, the deployment process for these methods is straightforward. By
leveraging these advantages of our methods, we create two combined methods. One
of these methods is an ensemble of all the proposed techniques, which not only
provides better defense performance than PGD-based AT but also enhances the
generalization ability of TSC models. Moreover, the computational resources
required for our ensemble are less than one-third of those required for
PGD-based AT. These methods advance robust TSC in data mining. Furthermore, as
foundation models are increasingly explored for time series feature learning,
our work provides insights into integrating data augmentation-based adversarial
defense with large-scale pre-trained models in future research.


http://arxiv.org/abs/2505.01900
CAMOUFLAGE: Exploiting Misinformation Detection Systems Through LLM-driven Adversarial Claim Transformation. (93%)
Mazal Bethany; Nishant Vishwamitra; Cho-Yu Jason Chiang; Peyman Najafirad
  Automated evidence-based misinformation detection systems, which evaluate the
veracity of short claims against evidence, lack comprehensive analysis of their
adversarial vulnerabilities. Existing black-box text-based adversarial attacks
are ill-suited for evidence-based misinformation detection systems, as these
attacks primarily focus on token-level substitutions involving gradient or
logit-based optimization strategies, which are incapable of fooling the
multi-component nature of these detection systems. These systems incorporate
both retrieval and claim-evidence comparison modules, which requires attacks to
break the retrieval of evidence and/or the comparison module so that it draws
incorrect inferences. We present CAMOUFLAGE, an iterative, LLM-driven approach
that employs a two-agent system, a Prompt Optimization Agent and an Attacker
Agent, to create adversarial claim rewritings that manipulate evidence
retrieval and mislead claim-evidence comparison, effectively bypassing the
system without altering the meaning of the claim. The Attacker Agent produces
semantically equivalent rewrites that attempt to mislead detectors, while the
Prompt Optimization Agent analyzes failed attack attempts and refines the
prompt of the Attacker to guide subsequent rewrites. This enables larger
structural and stylistic transformations of the text rather than token-level
substitutions, adapting the magnitude of changes based on previous outcomes.
Unlike existing approaches, CAMOUFLAGE optimizes its attack solely based on
binary model decisions to guide its rewriting process, eliminating the need for
classifier logits or extensive querying. We evaluate CAMOUFLAGE on four
systems, including two recent academic systems and two real-world APIs, with an
average attack success rate of 46.92\% while preserving textual coherence and
semantic equivalence to the original claims.


http://arxiv.org/abs/2505.01816
Rogue Cell: Adversarial Attack and Defense in Untrusted O-RAN Setup Exploiting the Traffic Steering xApp. (93%)
Eran Aizikovich; Dudu Mimran; Edita Grolman; Yuval Elovici; Asaf Shabtai
  The Open Radio Access Network (O-RAN) architecture is revolutionizing
cellular networks with its open, multi-vendor design and AI-driven management,
aiming to enhance flexibility and reduce costs. Although it has many
advantages, O-RAN is not threat-free. While previous studies have mainly
examined vulnerabilities arising from O-RAN's intelligent components, this
paper is the first to focus on the security challenges and vulnerabilities
introduced by transitioning from single-operator to multi-operator RAN
architectures. This shift increases the risk of untrusted third-party operators
managing different parts of the network. To explore these vulnerabilities and
their potential mitigation, we developed an open-access testbed environment
that integrates a wireless network simulator with the official O-RAN Software
Community (OSC) RAN intelligent component (RIC) cluster. This environment
enables realistic, live data collection and serves as a platform for
demonstrating APATE (adversarial perturbation against traffic efficiency), an
evasion attack in which a malicious cell manipulates its reported key
performance indicators (KPIs) and deceives the O-RAN traffic steering to gain
unfair allocations of user equipment (UE). To ensure that O-RAN's legitimate
activity continues, we introduce MARRS (monitoring adversarial RAN reports), a
detection framework based on a long-short term memory (LSTM) autoencoder (AE)
that learns contextual features across the network to monitor malicious
telemetry (also demonstrated in our testbed). Our evaluation showed that by
executing APATE, an attacker can obtain a 248.5% greater UE allocation than it
was supposed to in a benign scenario. In addition, the MARRS detection method
was also shown to successfully classify malicious cell activity, achieving
accuracy of 99.2% and an F1 score of 0.978.


http://arxiv.org/abs/2505.01884
Adversarial Robustness of Deep Learning Models for Inland Water Body Segmentation from SAR Images. (93%)
Siddharth Kothari; Srinivasan Murali; Sankalp Kothari; Ujjwal Verma; Jaya Sreevalsan-Nair
  Inland water body segmentation from Synthetic Aperture Radar (SAR) images is
an important task needed for several applications, such as flood mapping. While
SAR sensors capture data in all-weather conditions as high-resolution images,
differentiating water and water-like surfaces from SAR images is not
straightforward. Inland water bodies, such as large river basins, have complex
geometry, which adds to the challenge of segmentation. U-Net is a widely used
deep learning model for land-water segmentation of SAR images. In practice,
manual annotation is often used to generate the corresponding water masks as
ground truth. Manual annotation of the images is prone to label noise owing to
data poisoning attacks, especially due to complex geometry. In this work, we
simulate manual errors in the form of adversarial attacks on the U-Net model
and study the robustness of the model to human errors in annotation. Our
results indicate that U-Net can tolerate a certain level of corruption before
its performance drops significantly. This finding highlights the crucial role
that the quality of manual annotations plays in determining the effectiveness
of the segmentation model. The code and the new dataset, along with adversarial
examples for robust training, are publicly available. (GitHub link -
https://github.com/GVCL/IWSeg-SAR-Poison.git)


http://arxiv.org/abs/2505.01050
Transferable Adversarial Attacks on Black-Box Vision-Language Models. (99%)
Kai Hu; Weichen Yu; Li Zhang; Alexander Robey; Andy Zou; Chengming Xu; Haoqi Hu; Matt Fredrikson
  Vision Large Language Models (VLLMs) are increasingly deployed to offer
advanced capabilities on inputs comprising both text and images. While prior
research has shown that adversarial attacks can transfer from open-source to
proprietary black-box models in text-only and vision-only contexts, the extent
and effectiveness of such vulnerabilities remain underexplored for VLLMs. We
present a comprehensive analysis demonstrating that targeted adversarial
examples are highly transferable to widely-used proprietary VLLMs such as
GPT-4o, Claude, and Gemini. We show that attackers can craft perturbations to
induce specific attacker-chosen interpretations of visual information, such as
misinterpreting hazardous content as safe, overlooking sensitive or restricted
material, or generating detailed incorrect responses aligned with the
attacker's intent. Furthermore, we discover that universal perturbations --
modifications applicable to a wide set of images -- can consistently induce
these misinterpretations across multiple proprietary VLLMs. Our experimental
results on object recognition, visual question answering, and image captioning
show that this vulnerability is common across current state-of-the-art models,
and underscore an urgent need for robust mitigations to ensure the safe and
secure deployment of VLLMs.


http://arxiv.org/abs/2505.01168
Harmonizing Intra-coherence and Inter-divergence in Ensemble Attacks for Adversarial Transferability. (99%)
Zhaoyang Ma; Zhihao Wu; Wang Lu; Xin Gao; Jinghang Yue; Taolin Zhang; Lipo Wang; Youfang Lin; Jing Wang
  The development of model ensemble attacks has significantly improved the
transferability of adversarial examples, but this progress also poses severe
threats to the security of deep neural networks. Existing methods, however,
face two critical challenges: insufficient capture of shared gradient
directions across models and a lack of adaptive weight allocation mechanisms.
To address these issues, we propose a novel method Harmonized Ensemble for
Adversarial Transferability (HEAT), which introduces domain generalization into
adversarial example generation for the first time. HEAT consists of two key
modules: Consensus Gradient Direction Synthesizer, which uses Singular Value
Decomposition to synthesize shared gradient directions; and Dual-Harmony Weight
Orchestrator which dynamically balances intra-domain coherence, stabilizing
gradients within individual models, and inter-domain diversity, enhancing
transferability across models. Experimental results demonstrate that HEAT
significantly outperforms existing methods across various datasets and
settings, offering a new perspective and direction for adversarial attack
research.


http://arxiv.org/abs/2505.01328
Constrained Network Adversarial Attacks: Validity, Robustness, and Transferability. (99%)
Anass Grini; Oumaima Taheri; Btissam El Khamlichi; Amal El Fallah-Seghrouchni
  While machine learning has significantly advanced Network Intrusion Detection
Systems (NIDS), particularly within IoT environments where devices generate
large volumes of data and are increasingly susceptible to cyber threats, these
models remain vulnerable to adversarial attacks. Our research reveals a
critical flaw in existing adversarial attack methodologies: the frequent
violation of domain-specific constraints, such as numerical and categorical
limits, inherent to IoT and network traffic. This leads to up to 80.3% of
adversarial examples being invalid, significantly overstating real-world
vulnerabilities. These invalid examples, though effective in fooling models, do
not represent feasible attacks within practical IoT deployments. Consequently,
relying on these results can mislead resource allocation for defense, inflating
the perceived susceptibility of IoT-enabled NIDS models to adversarial
manipulation. Furthermore, we demonstrate that simpler surrogate models like
Multi-Layer Perceptron (MLP) generate more valid adversarial examples compared
to complex architectures such as CNNs and LSTMs. Using the MLP as a surrogate,
we analyze the transferability of adversarial severity to other ML/DL models
commonly used in IoT contexts. This work underscores the importance of
considering both domain constraints and model architecture when evaluating and
designing robust ML/DL models for security-critical IoT and network
applications.


http://arxiv.org/abs/2505.01315
Helping Large Language Models Protect Themselves: An Enhanced Filtering and Summarization System. (83%)
Sheikh Samit Muhaimin; Spyridon Mastorakis
  The recent growth in the use of Large Language Models has made them
vulnerable to sophisticated adversarial assaults, manipulative prompts, and
encoded malicious inputs. Existing countermeasures frequently necessitate
retraining models, which is computationally costly and impracticable for
deployment. Without the need for retraining or fine-tuning, this study presents
a unique defense paradigm that allows LLMs to recognize, filter, and defend
against adversarial or malicious inputs on their own. There are two main parts
to the suggested framework: (1) A prompt filtering module that uses
sophisticated Natural Language Processing (NLP) techniques, including zero-shot
classification, keyword analysis, and encoded content detection (e.g. base64,
hexadecimal, URL encoding), to detect, decode, and classify harmful inputs; and
(2) A summarization module that processes and summarizes adversarial research
literature to give the LLM context-aware defense knowledge. This approach
strengthens LLMs' resistance to adversarial exploitation by fusing text
extraction, summarization, and harmful prompt analysis. According to
experimental results, this integrated technique has a 98.71% success rate in
identifying harmful patterns, manipulative language structures, and encoded
prompts. By employing a modest amount of adversarial research literature as
context, the methodology also allows the model to react correctly to harmful
inputs with a larger percentage of jailbreak resistance and refusal rate. While
maintaining the quality of LLM responses, the framework dramatically increases
LLM's resistance to hostile misuse, demonstrating its efficacy as a quick and
easy substitute for time-consuming, retraining-based defenses.


http://arxiv.org/abs/2505.01267
Diffusion-based Adversarial Purification from the Perspective of the Frequency Domain. (76%)
Gaozheng Pei; Ke Ma; Yingfei Sun; Qianqian Xu; Qingming Huang
  The diffusion-based adversarial purification methods attempt to drown
adversarial perturbations into a part of isotropic noise through the forward
process, and then recover the clean images through the reverse process. Due to
the lack of distribution information about adversarial perturbations in the
pixel domain, it is often unavoidable to damage normal semantics. We turn to
the frequency domain perspective, decomposing the image into amplitude spectrum
and phase spectrum. We find that for both spectra, the damage caused by
adversarial perturbations tends to increase monotonically with frequency. This
means that we can extract the content and structural information of the
original clean sample from the frequency components that are less damaged.
Meanwhile, theoretical analysis indicates that existing purification methods
indiscriminately damage all frequency components, leading to excessive damage
to the image. Therefore, we propose a purification method that can eliminate
adversarial perturbations while maximizing the preservation of the content and
structure of the original image. Specifically, at each time step during the
reverse process, for the amplitude spectrum, we replace the low-frequency
components of the estimated image's amplitude spectrum with the corresponding
parts of the adversarial image. For the phase spectrum, we project the phase of
the estimated image into a designated range of the adversarial image's phase
spectrum, focusing on the low frequencies. Empirical evidence from extensive
experiments demonstrates that our method significantly outperforms most current
defense methods.


http://arxiv.org/abs/2505.01139
Active Sybil Attack and Efficient Defense Strategy in IPFS DHT. (50%)
V. H. M. Netto; T. Cholez; C. L. Ignat
  The InterPlanetary File System (IPFS) is a decentralized peer-to-peer (P2P)
storage that relies on Kademlia, a Distributed Hash Table (DHT) structure
commonly used in P2P systems for its proved scalability. However, DHTs are
known to be vulnerable to Sybil attacks, in which a single entity controls
multiple malicious nodes. Recent studies have shown that IPFS is affected by a
passive content eclipse attack, leveraging Sybils, in which adversarial nodes
hide received indexed information from other peers, making the content appear
unavailable. Fortunately, the latest mitigation strategy coupling an attack
detection based on statistical tests and a wider publication strategy upon
detection was able to circumvent it.
  In this work, we present a new active attack, with malicious nodes responding
with semantically correct but intentionally false data, exploiting both an
optimized placement of Sybils to stay below the detection threshold and an
early trigger of the content discovery termination in Kubo, the main IPFS
implementation. Our attack achieves to completely eclipse content on the latest
Kubo release. When evaluated against the most recent known mitigation, it
successfully denies access to the target content in approximately 80\% of
lookup attempts.
  To address this vulnerability, we propose a new mitigation called
SR-DHT-Store, which enables efficient, Sybil-resistant content publication
without relying on attack detection but instead on a systematic and precise use
of region-based queries, defined by a dynamically computed XOR distance to the
target ID. SR-DHT-Store can be combined with other defense mechanisms resulting
in a defense strategy that completely mitigates both passive and active Sybil
attacks at a lower overhead, while allowing an incremental deployment.


http://arxiv.org/abs/2505.01012
Quantum Support Vector Regression for Robust Anomaly Detection. (33%)
Kilian Tscharke; Maximilian Wendlinger; Sebastian Issel; Pascal Debus
  Anomaly Detection (AD) is critical in data analysis, particularly within the
domain of IT security. In recent years, Machine Learning (ML) algorithms have
emerged as a powerful tool for AD in large-scale data. In this study, we
explore the potential of quantum ML approaches, specifically quantum kernel
methods, for the application to robust AD. We build upon previous work on
Quantum Support Vector Regression (QSVR) for semisupervised AD by conducting a
comprehensive benchmark on IBM quantum hardware using eleven datasets. Our
results demonstrate that QSVR achieves strong classification performance and
even outperforms the noiseless simulation on two of these datasets. Moreover,
we investigate the influence of - in the NISQ-era inevitable - quantum noise on
the performance of the QSVR. Our findings reveal that the model exhibits
robustness to depolarizing, phase damping, phase flip, and bit flip noise,
while amplitude damping and miscalibration noise prove to be more disruptive.
Finally, we explore the domain of Quantum Adversarial Machine Learning and
demonstrate that QSVR is highly vulnerable to adversarial attacks and that
noise does not improve the adversarial robustness of the model.


http://arxiv.org/abs/2505.01186
Secure Cluster-Based Hierarchical Federated Learning in Vehicular Networks. (15%)
M. Saeid HaghighiFard; Sinem Coleri
  Hierarchical Federated Learning (HFL) has recently emerged as a promising
solution for intelligent decision-making in vehicular networks, helping to
address challenges such as limited communication resources, high vehicle
mobility, and data heterogeneity. However, HFL remains vulnerable to
adversarial and unreliable vehicles, whose misleading updates can significantly
compromise the integrity and convergence of the global model. To address these
challenges, we propose a novel defense framework that integrates dynamic
vehicle selection with robust anomaly detection within a cluster-based HFL
architecture, specifically designed to counter Gaussian noise and gradient
ascent attacks. The framework performs a comprehensive reliability assessment
for each vehicle by evaluating historical accuracy, contribution frequency, and
anomaly records. Anomaly detection combines Z-score and cosine similarity
analyses on model updates to identify both statistical outliers and directional
deviations in model updates. To further refine detection, an adaptive
thresholding mechanism is incorporated into the cosine similarity metric,
dynamically adjusting the threshold based on the historical accuracy of each
vehicle to enforce stricter standards for consistently high-performing
vehicles. In addition, a weighted gradient averaging mechanism is implemented,
which assigns higher weights to gradient updates from more trustworthy
vehicles. To defend against coordinated attacks, a cross-cluster consistency
check is applied to identify collaborative attacks in which multiple
compromised clusters coordinate misleading updates. Together, these mechanisms
form a multi-level defense strategy to filter out malicious contributions
effectively. Simulation results show that the proposed algorithm significantly
reduces convergence time compared to benchmark methods across both 1-hop and
3-hop topologies.


http://arxiv.org/abs/2505.01518
Rubber Mallet: A Study of High Frequency Localized Bit Flips and Their Impact on Security. (9%)
Andrew Adiletta; Zane Weissman; Fatemeh Khojasteh Dana; Berk Sunar; Shahin Tajik
  The increasing density of modern DRAM has heightened its vulnerability to
Rowhammer attacks, which induce bit flips by repeatedly accessing specific
memory rows. This paper presents an analysis of bit flip patterns generated by
advanced Rowhammer techniques that bypass existing hardware defenses. First, we
investigate the phenomenon of adjacent bit flips--where two or more physically
neighboring bits are corrupted simultaneously--and demonstrate they occur with
significantly higher frequency than previously documented. We also show that if
multiple bits flip within a byte, they are more likely to be adjacent than
randomly distributed: for example, if 4 bits flip within a byte, there is an
87% chance that they are all adjacent. We also demonstrate that bit flips
within a row will naturally cluster together likely due to the underlying
physics of the attack. We then investigate two fault injection attacks enabled
by multiple adjacent or nearby bit flips. First, we show how these correlated
flips enable efficient cryptographic signature correction attacks, successfully
recovering ECDSA private keys from OpenSSL implementations where single-bit
approaches would be unfeasible. Second, we introduce a targeted attack against
large language models by exploiting Rowhammer-induced corruptions in tokenizer
dictionaries of GGUF model files. This attack effectively rewrites safety
instructions in system prompts by swapping safety-critical tokens with benign
alternatives, circumventing model guardrails while maintaining normal
functionality in other contexts. Our experimental results across multiple DRAM
configurations reveal that current memory protection schemes are inadequate
against these sophisticated attack vectors, which can achieve their objectives
with precise, minimal modifications rather than random corruption.


http://arxiv.org/abs/2505.01292
Fine-grained Manipulation Attacks to Local Differential Privacy Protocols for Data Streams. (8%)
Xinyu Li; Xuebin Ren; Shusen Yang; Liang Shi; Chia-Mu Yu
  Local Differential Privacy (LDP) enables massive data collection and analysis
while protecting end users' privacy against untrusted aggregators. It has been
applied to various data types (e.g., categorical, numerical, and graph data)
and application settings (e.g., static and streaming). Recent findings indicate
that LDP protocols can be easily disrupted by poisoning or manipulation
attacks, which leverage injected/corrupted fake users to send crafted data
conforming to the LDP reports. However, current attacks primarily target static
protocols, neglecting the security of LDP protocols in the streaming settings.
Our research fills the gap by developing novel fine-grained manipulation
attacks to LDP protocols for data streams. By reviewing the attack surfaces in
existing algorithms, We introduce a unified attack framework with composable
modules, which can manipulate the LDP estimated stream toward a target stream.
Our attack framework can adapt to state-of-the-art streaming LDP algorithms
with different analytic tasks (e.g., frequency and mean) and LDP models
(event-level, user-level, w-event level). We validate our attacks theoretically
and through extensive experiments on real-world datasets, and finally explore a
possible defense mechanism for mitigating these attacks.


http://arxiv.org/abs/2505.01177
LLM Security: Vulnerabilities, Attacks, Defenses, and Countermeasures. (4%)
Francisco Aguilera-Martínez; Fernando Berzal
  As large language models (LLMs) continue to evolve, it is critical to assess
the security threats and vulnerabilities that may arise both during their
training phase and after models have been deployed. This survey seeks to define
and categorize the various attacks targeting LLMs, distinguishing between those
that occur during the training phase and those that affect already trained
models. A thorough analysis of these attacks is presented, alongside an
exploration of defense mechanisms designed to mitigate such threats. Defenses
are classified into two primary categories: prevention-based and
detection-based defenses. Furthermore, our survey summarizes possible attacks
and their corresponding defense strategies. It also provides an evaluation of
the effectiveness of the known defense mechanisms for the different security
threats. Our survey aims to offer a structured framework for securing LLMs,
while also identifying areas that require further research to improve and
strengthen defenses against emerging security challenges.


http://arxiv.org/abs/2505.01181
Explainable AI Based Diagnosis of Poisoning Attacks in Evolutionary Swarms. (1%)
Mehrdad Asadi; Roxana Rădulescu; Ann Nowé
  Swarming systems, such as for example multi-drone networks, excel at
cooperative tasks like monitoring, surveillance, or disaster assistance in
critical environments, where autonomous agents make decentralized decisions in
order to fulfill team-level objectives in a robust and efficient manner.
Unfortunately, team-level coordinated strategies in the wild are vulnerable to
data poisoning attacks, resulting in either inaccurate coordination or
adversarial behavior among the agents. To address this challenge, we contribute
a framework that investigates the effects of such data poisoning attacks, using
explainable AI methods. We model the interaction among agents using
evolutionary intelligence, where an optimal coalition strategically emerges to
perform coordinated tasks. Then, through a rigorous evaluation, the swarm model
is systematically poisoned using data manipulation attacks. We showcase the
applicability of explainable AI methods to quantify the effects of poisoning on
the team strategy and extract footprint characterizations that enable
diagnosing. Our findings indicate that when the model is poisoned above 10%,
non-optimal strategies resulting in inefficient cooperation can be identified.


http://arxiv.org/abs/2505.00843
OET: Optimization-based prompt injection Evaluation Toolkit. (99%)
Jinsheng Pan; Xiaogeng Liu; Chaowei Xiao
  Large Language Models (LLMs) have demonstrated remarkable capabilities in
natural language understanding and generation, enabling their widespread
adoption across various domains. However, their susceptibility to prompt
injection attacks poses significant security risks, as adversarial inputs can
manipulate model behavior and override intended instructions. Despite numerous
defense strategies, a standardized framework to rigorously evaluate their
effectiveness, especially under adaptive adversarial scenarios, is lacking. To
address this gap, we introduce OET, an optimization-based evaluation toolkit
that systematically benchmarks prompt injection attacks and defenses across
diverse datasets using an adaptive testing framework. Our toolkit features a
modular workflow that facilitates adversarial string generation, dynamic attack
execution, and comprehensive result analysis, offering a unified platform for
assessing adversarial robustness. Crucially, the adaptive testing framework
leverages optimization methods with both white-box and black-box access to
generate worst-case adversarial examples, thereby enabling strict red-teaming
evaluations. Extensive experiments underscore the limitations of current
defense mechanisms, with some models remaining susceptible even after
implementing security enhancements.


http://arxiv.org/abs/2505.00598
Fast and Low-Cost Genomic Foundation Models via Outlier Removal. (98%)
Haozheng Luo; Chenghao Qiu; Maojiang Su; Zhihan Zhou; Zoe Mehta; Guo Ye; Jerry Yao-Chieh Hu; Han Liu
  We propose the first unified adversarial attack benchmark for Genomic
Foundation Models (GFMs), named GERM. Unlike existing GFM benchmarks, GERM
offers the first comprehensive evaluation framework to systematically assess
the vulnerability of GFMs to adversarial attacks. Methodologically, we evaluate
the adversarial robustness of five state-of-the-art GFMs using four widely
adopted attack algorithms and three defense strategies. Importantly, our
benchmark provides an accessible and comprehensive framework to analyze GFM
vulnerabilities with respect to model architecture, quantization schemes, and
training datasets. Empirically, transformer-based models exhibit greater
robustness to adversarial perturbations compared to HyenaDNA, highlighting the
impact of architectural design on vulnerability. Moreover, adversarial attacks
frequently target biologically significant genomic regions, suggesting that
these models effectively capture meaningful sequence features.


http://arxiv.org/abs/2505.00487
Analysis of the vulnerability of machine learning regression models to adversarial attacks using data from 5G wireless networks. (96%)
Leonid Legashev; Artur Zhigalov; Denis Parfenov
  This article describes the process of creating a script and conducting an
analytical study of a dataset using the DeepMIMO emulator. An advertorial
attack was carried out using the FGSM method to maximize the gradient. A
comparison is made of the effectiveness of binary classifiers in the task of
detecting distorted data. The dynamics of changes in the quality indicators of
the regression model were analyzed in conditions without adversarial attacks,
during an adversarial attack and when the distorted data was isolated. It is
shown that an adversarial FGSM attack with gradient maximization leads to an
increase in the value of the MSE metric by 33% and a decrease in the R2
indicator by 10% on average. The LightGBM binary classifier effectively
identifies data with adversarial anomalies with 98% accuracy. Regression
machine learning models are susceptible to adversarial attacks, but rapid
analysis of network traffic and data transmitted over the network makes it
possible to identify malicious activity


http://arxiv.org/abs/2505.00395
GAN-based Generator of Adversarial Attack on Intelligent End-to-End Autoencoder-based Communication System. (91%)
Jianyuan Chen; Lin Zhang; Zuwei Chen; Yawen Chen; Hongcheng Zhuang
  Deep neural networks have been applied in wireless communications system to
intelligently adapt to dynamically changing channel conditions, while the users
are still under the threat of the malicious attacks due to the broadcasting
property of wireless channels. However, most attack models require the
knowledge of the target details, which is difficult to be implemented in real
systems. Our objective is to develop an attack model with no requirement for
the target information, while enhancing the block error rate. In our design, we
propose a novel Generative Adversarial Networks(GANs) based attack
architecture, which exploits the property of deep learning models being
vulnerable to perturbations induced by dynamically changing channel conditions.
In the proposed generator, the attack network is composed of convolution layer,
convolution transpose layer and linear layer. Then we present the training
strategy and the details of the training algorithm. Subsequently, we propose
the validation strategy to evaluate the performance of the generator.
Simulations are conducted and the results show that our proposed adversarial
attack generator achieve better block error rate attack performance than that
of benchmark schemes over Additive White Gaussian Noise (AWGN) channel,
Rayleigh channel and High-Speed Railway channel.


http://arxiv.org/abs/2505.00881
Protocol-agnostic and Data-free Backdoor Attacks on Pre-trained Models in RF Fingerprinting. (16%)
Tianya Zhao; Ningning Wang; Junqing Zhang; Xuyu Wang
  While supervised deep neural networks (DNNs) have proven effective for device
authentication via radio frequency (RF) fingerprinting, they are hindered by
domain shift issues and the scarcity of labeled data. The success of large
language models has led to increased interest in unsupervised pre-trained
models (PTMs), which offer better generalization and do not require labeled
datasets, potentially addressing the issues mentioned above. However, the
inherent vulnerabilities of PTMs in RF fingerprinting remain insufficiently
explored. In this paper, we thoroughly investigate data-free backdoor attacks
on such PTMs in RF fingerprinting, focusing on a practical scenario where
attackers lack access to downstream data, label information, and training
processes. To realize the backdoor attack, we carefully design a set of
triggers and predefined output representations (PORs) for the PTMs. By mapping
triggers and PORs through backdoor training, we can implant backdoor behaviors
into the PTMs, thereby introducing vulnerabilities across different downstream
RF fingerprinting tasks without requiring prior knowledge. Extensive
experiments demonstrate the wide applicability of our proposed attack to
various input domains, protocols, and PTMs. Furthermore, we explore potential
detection and defense methods, demonstrating the difficulty of fully
safeguarding against our proposed backdoor attack.


http://arxiv.org/abs/2505.00976
Attack and defense techniques in large language models: A survey and new perspectives. (12%)
Zhiyu Liao; Kang Chen; Yuanguo Lin; Kangkang Li; Yunxuan Liu; Hefeng Chen; Xingwang Huang; Yuanhui Yu
  Large Language Models (LLMs) have become central to numerous natural language
processing tasks, but their vulnerabilities present significant security and
ethical challenges. This systematic survey explores the evolving landscape of
attack and defense techniques in LLMs. We classify attacks into adversarial
prompt attack, optimized attacks, model theft, as well as attacks on
application of LLMs, detailing their mechanisms and implications. Consequently,
we analyze defense strategies, including prevention-based and detection-based
defense methods. Although advances have been made, challenges remain to adapt
to the dynamic threat landscape, balance usability with robustness, and address
resource constraints in defense implementation. We highlight open problems,
including the need for adaptive scalable defenses, explainable security
techniques, and standardized evaluation frameworks. This survey provides
actionable insights and directions for developing secure and resilient LLMs,
emphasizing the importance of interdisciplinary collaboration and ethical
considerations to mitigate risks in real-world applications.


http://arxiv.org/abs/2505.00817
Spill The Beans: Exploiting CPU Cache Side-Channels to Leak Tokens from Large Language Models. (1%)
Andrew Adiletta; Berk Sunar
  Side-channel attacks on shared hardware resources increasingly threaten
confidentiality, especially with the rise of Large Language Models (LLMs). In
this work, we introduce Spill The Beans, a novel application of cache
side-channels to leak tokens generated by an LLM. By co-locating an attack
process on the same hardware as the victim model, we flush and reload embedding
vectors from the embedding layer, where each token corresponds to a unique
embedding vector. When accessed during token generation, it results in a cache
hit detectable by our attack on shared lower-level caches.
  A significant challenge is the massive size of LLMs, which, by nature of
their compute intensive operation, quickly evicts embedding vectors from the
cache. We address this by balancing the number of tokens monitored against the
amount of information leaked. Monitoring more tokens increases potential
vocabulary leakage but raises the chance of missing cache hits due to eviction;
monitoring fewer tokens improves detection reliability but limits vocabulary
coverage.
  Through extensive experimentation, we demonstrate the feasibility of leaking
tokens from LLMs via cache side-channels. Our findings reveal a new
vulnerability in LLM deployments, highlighting that even sophisticated models
are susceptible to traditional side-channel attacks. We discuss the
implications for privacy and security in LLM-serving infrastructures and
suggest considerations for mitigating such threats. For proof of concept we
consider two concrete attack scenarios: Our experiments show that an attacker
can recover as much as 80%-90% of a high entropy API key with single shot
monitoring. As for English text we can reach a 40% recovery rate with a single
shot. We should note that the rate highly depends on the monitored token set
and these rates can be improved by targeting more specialized output domains.


http://arxiv.org/abs/2504.21730
Cert-SSB: Toward Certified Sample-Specific Backdoor Defense. (98%)
Ting Qiao; Yingjia Wang; Xing Liu; Sixing Wu; Jianbing Li; Yiming Li
  Deep neural networks (DNNs) are vulnerable to backdoor attacks, where an
attacker manipulates a small portion of the training data to implant hidden
backdoors into the model. The compromised model behaves normally on clean
samples but misclassifies backdoored samples into the attacker-specified target
class, posing a significant threat to real-world DNN applications. Currently,
several empirical defense methods have been proposed to mitigate backdoor
attacks, but they are often bypassed by more advanced backdoor techniques. In
contrast, certified defenses based on randomized smoothing have shown promise
by adding random noise to training and testing samples to counteract backdoor
attacks. In this paper, we reveal that existing randomized smoothing defenses
implicitly assume that all samples are equidistant from the decision boundary.
However, it may not hold in practice, leading to suboptimal certification
performance. To address this issue, we propose a sample-specific certified
backdoor defense method, termed Cert-SSB. Cert-SSB first employs stochastic
gradient ascent to optimize the noise magnitude for each sample, ensuring a
sample-specific noise level that is then applied to multiple poisoned training
sets to retrain several smoothed models. After that, Cert-SSB aggregates the
predictions of multiple smoothed models to generate the final robust
prediction. In particular, in this case, existing certification methods become
inapplicable since the optimized noise varies across different samples. To
conquer this challenge, we introduce a storage-update-based certification
method, which dynamically adjusts each sample's certification region to improve
certification performance. We conduct extensive experiments on multiple
benchmark datasets, demonstrating the effectiveness of our proposed method. Our
code is available at https://github.com/NcepuQiaoTing/Cert-SSB.


http://arxiv.org/abs/2504.21307
The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning. (86%)
Siyi Chen; Yimeng Zhang; Sijia Liu; Qing Qu
  Despite the remarkable generation capabilities of diffusion models, recent
studies have shown that they can memorize and create harmful content when given
specific text prompts. Although fine-tuning approaches have been developed to
mitigate this issue by unlearning harmful concepts, these methods can be easily
circumvented through jailbreaking attacks. This implies that the harmful
concept has not been fully erased from the model. However, existing
jailbreaking attack methods, while effective, lack interpretability regarding
why unlearned models still retain the concept, thereby hindering the
development of defense strategies. In this work, we address these limitations
by proposing an attack method that learns an orthogonal set of interpretable
attack token embeddings. The attack token embeddings can be decomposed into
human-interpretable textual elements, revealing that unlearned models still
retain the target concept through implicit textual components. Furthermore,
these attack token embeddings are powerful and transferable across text
prompts, initial noises, and unlearned models, emphasizing that unlearned
models are more vulnerable than expected. Finally, building on the insights
from our interpretable attack, we develop a defense method to protect unlearned
models against both our proposed and existing jailbreaking attacks. Extensive
experimental results demonstrate the effectiveness of our attack and defense
strategies.


http://arxiv.org/abs/2504.21323
How to Backdoor the Knowledge Distillation. (81%)
Chen Wu; Qian Ma; Prasenjit Mitra; Sencun Zhu
  Knowledge distillation has become a cornerstone in modern machine learning
systems, celebrated for its ability to transfer knowledge from a large, complex
teacher model to a more efficient student model. Traditionally, this process is
regarded as secure, assuming the teacher model is clean. This belief stems from
conventional backdoor attacks relying on poisoned training data with backdoor
triggers and attacker-chosen labels, which are not involved in the distillation
process. Instead, knowledge distillation uses the outputs of a clean teacher
model to guide the student model, inherently preventing recognition or response
to backdoor triggers as intended by an attacker. In this paper, we challenge
this assumption by introducing a novel attack methodology that strategically
poisons the distillation dataset with adversarial examples embedded with
backdoor triggers. This technique allows for the stealthy compromise of the
student model while maintaining the integrity of the teacher model. Our
innovative approach represents the first successful exploitation of
vulnerabilities within the knowledge distillation process using clean teacher
models. Through extensive experiments conducted across various datasets and
attack settings, we demonstrate the robustness, stealthiness, and effectiveness
of our method. Our findings reveal previously unrecognized vulnerabilities and
pave the way for future research aimed at securing knowledge distillation
processes against backdoor attacks.


http://arxiv.org/abs/2504.21646
Diffusion-based Adversarial Identity Manipulation for Facial Privacy Protection. (31%)
Liqin Wang; Qianyue Hu; Wei Lu; Xiangyang Luo
  The success of face recognition (FR) systems has led to serious privacy
concerns due to potential unauthorized surveillance and user tracking on social
networks. Existing methods for enhancing privacy fail to generate natural face
images that can protect facial privacy. In this paper, we propose
diffusion-based adversarial identity manipulation (DiffAIM) to generate natural
and highly transferable adversarial faces against malicious FR systems. To be
specific, we manipulate facial identity within the low-dimensional latent space
of a diffusion model. This involves iteratively injecting gradient-based
adversarial identity guidance during the reverse diffusion process,
progressively steering the generation toward the desired adversarial faces. The
guidance is optimized for identity convergence towards a target while promoting
semantic divergence from the source, facilitating effective impersonation while
maintaining visual naturalness. We further incorporate structure-preserving
regularization to preserve facial structure consistency during manipulation.
Extensive experiments on both face verification and identification tasks
demonstrate that compared with the state-of-the-art, DiffAIM achieves stronger
black-box attack transferability while maintaining superior visual quality. We
also demonstrate the effectiveness of the proposed approach for commercial FR
APIs, including Face++ and Aliyun.


http://arxiv.org/abs/2504.21420
A Test Suite for Efficient Robustness Evaluation of Face Recognition Systems. (26%)
Ruihan Zhang; Jun Sun
  Face recognition is a widely used authentication technology in practice,
where robustness is required. It is thus essential to have an efficient and
easy-to-use method for evaluating the robustness of (possibly third-party)
trained face recognition systems. Existing approaches to evaluating the
robustness of face recognition systems are either based on empirical evaluation
(e.g., measuring attacking success rate using state-of-the-art attacking
methods) or formal analysis (e.g., measuring the Lipschitz constant). While the
former demands significant user efforts and expertise, the latter is extremely
time-consuming. In pursuit of a comprehensive, efficient, easy-to-use and
scalable estimation of the robustness of face recognition systems, we take an
old-school alternative approach and introduce RobFace, i.e., evaluation using
an optimised test suite. It contains transferable adversarial face images that
are designed to comprehensively evaluate a face recognition system's robustness
along a variety of dimensions. RobFace is system-agnostic and still consistent
with system-specific empirical evaluation or formal analysis. We support this
claim through extensive experimental results with various perturbations on
multiple face recognition systems. To our knowledge, RobFace is the first
system-agnostic robustness estimation test suite.


http://arxiv.org/abs/2504.21680
Hoist with His Own Petard: Inducing Guardrails to Facilitate Denial-of-Service Attacks on Retrieval-Augmented Generation of LLMs. (16%)
Pan Suo; Yu-Ming Shang; San-Chuan Guo; Xi Zhang
  Retrieval-Augmented Generation (RAG) integrates Large Language Models (LLMs)
with external knowledge bases, improving output quality while introducing new
security risks. Existing studies on RAG vulnerabilities typically focus on
exploiting the retrieval mechanism to inject erroneous knowledge or malicious
texts, inducing incorrect outputs. However, these approaches overlook critical
weaknesses within LLMs, leaving important attack vectors unexplored and
limiting the scope and efficiency of attacks. In this paper, we uncover a novel
vulnerability: the safety guardrails of LLMs, while designed for protection,
can also be exploited as an attack vector by adversaries. Building on this
vulnerability, we propose MutedRAG, a novel denial-of-service attack that
reversely leverages the guardrails of LLMs to undermine the availability of RAG
systems. By injecting minimalistic jailbreak texts, such as "\textit{How to
build a bomb}", into the knowledge base, MutedRAG intentionally triggers the
LLM's safety guardrails, causing the system to reject legitimate queries.
Besides, due to the high sensitivity of guardrails, a single jailbreak sample
can affect multiple queries, effectively amplifying the efficiency of attacks
while reducing their costs. Experimental results on three datasets demonstrate
that MutedRAG achieves an attack success rate exceeding 60% in many scenarios,
requiring only less than one malicious text to each target query on average. In
addition, we evaluate potential defense strategies against MutedRAG, finding
that some of current mechanisms are insufficient to mitigate this threat,
underscoring the urgent need for more robust solutions.


http://arxiv.org/abs/2505.00061
Enhancing Security and Strengthening Defenses in Automated Short-Answer Grading Systems. (11%)
Sahar Yarmohammadtoosky; Yiyun Zhou; Victoria Yaneva; Peter Baldwin; Saed Rezayi; Brian Clauser; Polina Harikeo
  This study examines vulnerabilities in transformer-based automated
short-answer grading systems used in medical education, with a focus on how
these systems can be manipulated through adversarial gaming strategies. Our
research identifies three main types of gaming strategies that exploit the
system's weaknesses, potentially leading to false positives. To counteract
these vulnerabilities, we implement several adversarial training methods
designed to enhance the systems' robustness. Our results indicate that these
methods significantly reduce the susceptibility of grading systems to such
manipulations, especially when combined with ensemble techniques like majority
voting and ridge regression, which further improve the system's defense against
sophisticated adversarial inputs. Additionally, employing large language models
such as GPT-4 with varied prompting techniques has shown promise in recognizing
and scoring gaming strategies effectively. The findings underscore the
importance of continuous improvements in AI-driven educational tools to ensure
their reliability and fairness in high-stakes settings.


http://arxiv.org/abs/2505.01456
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation. (9%)
Vaidehi Patil; Yi-Lin Sung; Peter Hase; Jie Peng; Tianlong Chen; Mohit Bansal
  LLMs trained on massive datasets may inadvertently acquire sensitive
information such as personal details and potentially harmful content. This risk
is further heightened in multimodal LLMs as they integrate information from
multiple modalities (image and text). Adversaries can exploit this knowledge
through multimodal prompts to extract sensitive details. Evaluating how
effectively MLLMs can forget such information (targeted unlearning)
necessitates the creation of high-quality, well-annotated image-text pairs.
While prior work on unlearning has focused on text, multimodal unlearning
remains underexplored. To address this gap, we first introduce a multimodal
unlearning benchmark, UnLOK-VQA (Unlearning Outside Knowledge VQA), as well as
an attack-and-defense framework to evaluate methods for deleting specific
multimodal knowledge from MLLMs. We extend a visual question-answering dataset
using an automated pipeline that generates varying-proximity samples for
testing generalization and specificity, followed by manual filtering for
maintaining high quality. We then evaluate six defense objectives against seven
attacks (four whitebox, three blackbox), including a novel whitebox method
leveraging interpretability of hidden states. Our results show multimodal
attacks outperform text- or image-only ones, and that the most effective
defense removes answer information from internal model states. Additionally,
larger models exhibit greater post-editing robustness, suggesting that scale
enhances safety. UnLOK-VQA provides a rigorous benchmark for advancing
unlearning in MLLMs.


http://arxiv.org/abs/2504.21668
Traceback of Poisoning Attacks to Retrieval-Augmented Generation. (8%)
Baolei Zhang; Haoran Xin; Minghong Fang; Zhuqing Liu; Biao Yi; Tong Li; Zheli Liu
  Large language models (LLMs) integrated with retrieval-augmented generation
(RAG) systems improve accuracy by leveraging external knowledge sources.
However, recent research has revealed RAG's susceptibility to poisoning
attacks, where the attacker injects poisoned texts into the knowledge database,
leading to attacker-desired responses. Existing defenses, which predominantly
focus on inference-time mitigation, have proven insufficient against
sophisticated attacks. In this paper, we introduce RAGForensics, the first
traceback system for RAG, designed to identify poisoned texts within the
knowledge database that are responsible for the attacks. RAGForensics operates
iteratively, first retrieving a subset of texts from the database and then
utilizing a specially crafted prompt to guide an LLM in detecting potential
poisoning texts. Empirical evaluations across multiple datasets demonstrate the
effectiveness of RAGForensics against state-of-the-art poisoning attacks. This
work pioneers the traceback of poisoned texts in RAG systems, providing a
practical and promising defense mechanism to enhance their security.


http://arxiv.org/abs/2505.00162
Stochastic Subspace Descent Accelerated via Bi-fidelity Line Search. (1%)
Nuojin Cheng; Alireza Doostan; Stephen Becker
  Efficient optimization remains a fundamental challenge across numerous
scientific and engineering domains, especially when objective function and
gradient evaluations are computationally expensive. While zeroth-order
optimization methods offer effective approaches when gradients are
inaccessible, their practical performance can be limited by the high cost
associated with function queries. This work introduces the bi-fidelity
stochastic subspace descent (BF-SSD) algorithm, a novel zeroth-order
optimization method designed to reduce this computational burden. BF-SSD
leverages a bi-fidelity framework, constructing a surrogate model from a
combination of computationally inexpensive low-fidelity (LF) and accurate
high-fidelity (HF) function evaluations. This surrogate model facilitates an
efficient backtracking line search for step size selection, for which we
provide theoretical convergence guarantees under standard assumptions. We
perform a comprehensive empirical evaluation of BF-SSD across four distinct
problems: a synthetic optimization benchmark, dual-form kernel ridge
regression, black-box adversarial attacks on machine learning models, and
transformer-based black-box language model fine-tuning. Numerical results
demonstrate that BF-SSD consistently achieves superior optimization performance
while requiring significantly fewer HF function evaluations compared to
relevant baseline methods. This study highlights the efficacy of integrating
bi-fidelity strategies within zeroth-order optimization, positioning BF-SSD as
a promising and computationally efficient approach for tackling large-scale,
high-dimensional problems encountered in various real-world applications.


http://arxiv.org/abs/2504.20848
Mitigating the Structural Bias in Graph Adversarial Defenses. (92%)
Junyuan Fang; Huimin Liu; Han Yang; Jiajing Wu; Zibin Zheng; Chi K. Tse
  In recent years, graph neural networks (GNNs) have shown great potential in
addressing various graph structure-related downstream tasks. However, recent
studies have found that current GNNs are susceptible to malicious adversarial
attacks. Given the inevitable presence of adversarial attacks in the real
world, a variety of defense methods have been proposed to counter these attacks
and enhance the robustness of GNNs. Despite the commendable performance of
these defense methods, we have observed that they tend to exhibit a structural
bias in terms of their defense capability on nodes with low degree (i.e., tail
nodes), which is similar to the structural bias of traditional GNNs on nodes
with low degree in the clean graph. Therefore, in this work, we propose a
defense strategy by including hetero-homo augmented graph construction, $k$NN
augmented graph construction, and multi-view node-wise attention modules to
mitigate the structural bias of GNNs against adversarial attacks. Notably, the
hetero-homo augmented graph consists of removing heterophilic links (i.e.,
links connecting nodes with dissimilar features) globally and adding homophilic
links (i.e., links connecting nodes with similar features) for nodes with low
degree. To further enhance the defense capability, an attention mechanism is
adopted to adaptively combine the representations from the above two kinds of
graph views. We conduct extensive experiments to demonstrate the defense and
debiasing effect of the proposed strategy on benchmark datasets.


http://arxiv.org/abs/2504.21052
SFIBA: Spatial-based Full-target Invisible Backdoor Attacks. (84%)
Yangxu Yin; Honglong Chen; Yudong Gao; Peng Sun; Zhishuai Li; Weifeng Liu
  Multi-target backdoor attacks pose significant security threats to deep
neural networks, as they can preset multiple target classes through a single
backdoor injection. This allows attackers to control the model to misclassify
poisoned samples with triggers into any desired target class during inference,
exhibiting superior attack performance compared with conventional backdoor
attacks. However, existing multi-target backdoor attacks fail to guarantee
trigger specificity and stealthiness in black-box settings, resulting in two
main issues. First, they are unable to simultaneously target all classes when
only training data can be manipulated, limiting their effectiveness in
realistic attack scenarios. Second, the triggers often lack visual
imperceptibility, making poisoned samples easy to detect. To address these
problems, we propose a Spatial-based Full-target Invisible Backdoor Attack,
called SFIBA. It restricts triggers for different classes to specific local
spatial regions and morphologies in the pixel space to ensure specificity,
while employing a frequency-domain-based trigger injection method to guarantee
stealthiness. Specifically, for injection of each trigger, we first apply fast
fourier transform to obtain the amplitude spectrum of clean samples in local
spatial regions. Then, we employ discrete wavelet transform to extract the
features from the amplitude spectrum and use singular value decomposition to
integrate the trigger. Subsequently, we selectively filter parts of the trigger
in pixel space to implement trigger morphology constraints and adjust injection
coefficients based on visual effects. We conduct experiments on multiple
datasets and models. The results demonstrate that SFIBA can achieve excellent
attack performance and stealthiness, while preserving the model's performance
on benign samples, and can also bypass existing backdoor defenses.


http://arxiv.org/abs/2504.20472
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction. (81%)
Yulin Chen; Haoran Li; Yuan Sui; Yue Liu; Yufei He; Yangqiu Song; Bryan Hooi
  Large language models (LLMs) have demonstrated impressive performance and
have come to dominate the field of natural language processing (NLP) across
various tasks. However, due to their strong instruction-following capabilities
and inability to distinguish between instructions and data content, LLMs are
vulnerable to prompt injection attacks. These attacks manipulate LLMs into
deviating from the original input instructions and executing maliciously
injected instructions within data content, such as web documents retrieved from
search engines. Existing defense methods, including prompt-engineering and
fine-tuning approaches, typically instruct models to follow the original input
instructions while suppressing their tendencies to execute injected
instructions. However, our experiments reveal that suppressing
instruction-following tendencies is challenging. Through analyzing failure
cases, we observe that although LLMs tend to respond to any recognized
instructions, they are aware of which specific instructions they are executing
and can correctly reference them within the original prompt. Motivated by these
findings, we propose a novel defense method that leverages, rather than
suppresses, the instruction-following abilities of LLMs. Our approach prompts
LLMs to generate responses that include both answers and their corresponding
instruction references. Based on these references, we filter out answers not
associated with the original input instructions. Comprehensive experiments
demonstrate that our method outperforms prompt-engineering baselines and
achieves performance comparable to fine-tuning methods, reducing the attack
success rate (ASR) to 0 percent in some scenarios. Moreover, our approach has
minimal impact on overall utility.


http://arxiv.org/abs/2504.20869
Quantifying the Noise of Structural Perturbations on Graph Adversarial Attacks. (80%)
Junyuan Fang; Han Yang; Haixian Wen; Jiajing Wu; Zibin Zheng; Chi K. Tse
  Graph neural networks have been widely utilized to solve graph-related tasks
because of their strong learning power in utilizing the local information of
neighbors. However, recent studies on graph adversarial attacks have proven
that current graph neural networks are not robust against malicious attacks.
Yet much of the existing work has focused on the optimization objective based
on attack performance to obtain (near) optimal perturbations, but paid less
attention to the strength quantification of each perturbation such as the
injection of a particular node/link, which makes the choice of perturbations a
black-box model that lacks interpretability. In this work, we propose the
concept of noise to quantify the attack strength of each adversarial link.
Furthermore, we propose three attack strategies based on the defined noise and
classification margins in terms of single and multiple steps optimization.
Extensive experiments conducted on benchmark datasets against three
representative graph neural networks demonstrate the effectiveness of the
proposed attack strategies. Particularly, we also investigate the preferred
patterns of effective adversarial perturbations by analyzing the corresponding
properties of the selected perturbation nodes.


http://arxiv.org/abs/2504.20965
AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security. (15%)
Zikui Cai; Shayan Shabihi; Bang An; Zora Che; Brian R. Bartoldson; Bhavya Kailkhura; Tom Goldstein; Furong Huang
  We introduce AegisLLM, a cooperative multi-agent defense against adversarial
attacks and information leakage. In AegisLLM, a structured workflow of
autonomous agents - orchestrator, deflector, responder, and evaluator -
collaborate to ensure safe and compliant LLM outputs, while self-improving over
time through prompt optimization. We show that scaling agentic reasoning system
at test-time - both by incorporating additional agent roles and by leveraging
automated prompt optimization (such as DSPy)- substantially enhances robustness
without compromising model utility. This test-time defense enables real-time
adaptability to evolving attacks, without requiring model retraining.
Comprehensive evaluations across key threat scenarios, including unlearning and
jailbreaking, demonstrate the effectiveness of AegisLLM. On the WMDP unlearning
benchmark, AegisLLM achieves near-perfect unlearning with only 20 training
examples and fewer than 300 LM calls. For jailbreaking benchmarks, we achieve
51% improvement compared to the base model on StrongReject, with false refusal
rates of only 7.9% on PHTest compared to 18-55% for comparable methods. Our
results highlight the advantages of adaptive, agentic reasoning over static
defenses, establishing AegisLLM as a strong runtime alternative to traditional
approaches based on model modifications. Code is available at
https://github.com/zikuicai/aegisllm


http://arxiv.org/abs/2504.21278
Robust Multi-agent Communication Based on Decentralization-Oriented Adversarial Training. (12%)
Xuyan Ma; Yawen Wang; Junjie Wang; Xiaofei Xie; Boyu Wu; Shoubin Li; Fanjiang Xu; Qing Wang
  In typical multi-agent reinforcement learning (MARL) problems, communication
is important for agents to share information and make the right decisions.
However, due to the complexity of training multi-agent communication, existing
methods often fall into the dilemma of local optimization, which leads to the
concentration of communication in a limited number of channels and presents an
unbalanced structure. Such unbalanced communication policy are vulnerable to
abnormal conditions, where the damage of critical communication channels can
trigger the crash of the entire system. Inspired by decentralization theory in
sociology, we propose DMAC, which enhances the robustness of multi-agent
communication policies by retraining them into decentralized patterns.
Specifically, we train an adversary DMAC\_Adv which can dynamically identify
and mask the critical communication channels, and then apply the adversarial
samples generated by DMAC\_Adv to the adversarial learning of the communication
policy to force the policy in exploring other potential communication schemes
and transition to a decentralized structure. As a training method to improve
robustness, DMAC can be fused with any learnable communication policy
algorithm. The experimental results in two communication policies and four
multi-agent tasks demonstrate that DMAC achieves higher improvement on
robustness and performance of communication policy compared with two
state-of-the-art and commonly-used baselines. Also, the results demonstrate
that DMAC can achieve decentralized communication structure with acceptable
communication cost.


http://arxiv.org/abs/2504.21228
CachePrune: Neural-Based Attribution Defense Against Indirect Prompt Injection Attacks. (12%)
Rui Wang; Junda Wu; Yu Xia; Tong Yu; Ruiyi Zhang; Ryan Rossi; Lina Yao; Julian McAuley
  Large Language Models (LLMs) are identified as being susceptible to indirect
prompt injection attack, where the model undesirably deviates from
user-provided instructions by executing tasks injected in the prompt context.
This vulnerability stems from LLMs' inability to distinguish between data and
instructions within a prompt. In this paper, we propose CachePrune that defends
against this attack by identifying and pruning task-triggering neurons from the
KV cache of the input prompt context. By pruning such neurons, we encourage the
LLM to treat the text spans of input prompt context as only pure data, instead
of any indicator of instruction following. These neurons are identified via
feature attribution with a loss function induced from an upperbound of the
Direct Preference Optimization (DPO) objective. We show that such a loss
function enables effective feature attribution with only a few samples. We
further improve on the quality of feature attribution, by exploiting an
observed triggering effect in instruction following. Our approach does not
impose any formatting on the original prompt or introduce extra test-time LLM
calls. Experiments show that CachePrune significantly reduces attack success
rates without compromising the response quality. Note: This paper aims to
defend against indirect prompt injection attacks, with the goal of developing
more secure and robust AI systems.


http://arxiv.org/abs/2504.20829
GaussTrap: Stealthy Poisoning Attacks on 3D Gaussian Splatting for Targeted Scene Confusion. (8%)
Jiaxin Hong; Sixu Chen; Shuoyang Sun; Hongyao Yu; Hao Fang; Yuqi Tan; Bin Chen; Shuhan Qi; Jiawei Li
  As 3D Gaussian Splatting (3DGS) emerges as a breakthrough in scene
representation and novel view synthesis, its rapid adoption in safety-critical
domains (e.g., autonomous systems, AR/VR) urgently demands scrutiny of
potential security vulnerabilities. This paper presents the first systematic
study of backdoor threats in 3DGS pipelines. We identify that adversaries may
implant backdoor views to induce malicious scene confusion during inference,
potentially leading to environmental misperception in autonomous navigation or
spatial distortion in immersive environments. To uncover this risk, we propose
GuassTrap, a novel poisoning attack method targeting 3DGS models. GuassTrap
injects malicious views at specific attack viewpoints while preserving
high-quality rendering in non-target views, ensuring minimal detectability and
maximizing potential harm. Specifically, the proposed method consists of a
three-stage pipeline (attack, stabilization, and normal training) to implant
stealthy, viewpoint-consistent poisoned renderings in 3DGS, jointly optimizing
attack efficacy and perceptual realism to expose security risks in 3D
rendering. Extensive experiments on both synthetic and real-world datasets
demonstrate that GuassTrap can effectively embed imperceptible yet harmful
backdoor views while maintaining high-quality rendering in normal views,
validating its robustness, adaptability, and practical applicability.


http://arxiv.org/abs/2504.20414
Enhancing Leakage Attacks on Searchable Symmetric Encryption Using LLM-Based Synthetic Data Generation. (8%)
Joshua Chiu; Partha Protim Paul; Zahin Wahab
  Searchable Symmetric Encryption (SSE) enables efficient search capabilities
over encrypted data, allowing users to maintain privacy while utilizing cloud
storage. However, SSE schemes are vulnerable to leakage attacks that exploit
access patterns, search frequency, and volume information. Existing studies
frequently assume that adversaries possess a substantial fraction of the
encrypted dataset to mount effective inference attacks, implying there is a
database leakage of such documents, thus, an assumption that may not hold in
real-world scenarios. In this work, we investigate the feasibility of enhancing
leakage attacks under a more realistic threat model in which adversaries have
access to minimal leaked data. We propose a novel approach that leverages large
language models (LLMs), specifically GPT-4 variants, to generate synthetic
documents that statistically and semantically resemble the real-world dataset
of Enron emails. Using the email corpus as a case study, we evaluate the
effectiveness of synthetic data generated via random sampling and hierarchical
clustering methods on the performance of the SAP (Search Access Pattern)
keyword inference attack restricted to token volumes only. Our results
demonstrate that, while the choice of LLM has limited effect, increasing
dataset size and employing clustering-based generation significantly improve
attack accuracy, achieving comparable performance to attacks using larger
amounts of real data. We highlight the growing relevance of LLMs in adversarial
contexts.


http://arxiv.org/abs/2504.21072
Erased but Not Forgotten: How Backdoors Compromise Concept Erasure. (5%)
Jonas Henry Grebe; Tobias Braun; Marcus Rohrbach; Anna Rohrbach
  The expansion of large-scale text-to-image diffusion models has raised
growing concerns about their potential to generate undesirable or harmful
content, ranging from fabricated depictions of public figures to sexually
explicit images. To mitigate these risks, prior work has devised machine
unlearning techniques that attempt to erase unwanted concepts through
fine-tuning. However, in this paper, we introduce a new threat model, Toxic
Erasure (ToxE), and demonstrate how recent unlearning algorithms, including
those explicitly designed for robustness, can be circumvented through targeted
backdoor attacks. The threat is realized by establishing a link between a
trigger and the undesired content. Subsequent unlearning attempts fail to erase
this link, allowing adversaries to produce harmful content. We instantiate ToxE
via two established backdoor attacks: one targeting the text encoder and
another manipulating the cross-attention layers. Further, we introduce Deep
Intervention Score-based Attack (DISA), a novel, deeper backdoor attack that
optimizes the entire U-Net using a score-based objective, improving the
attack's persistence across different erasure methods. We evaluate five recent
concept erasure methods against our threat model. For celebrity identity
erasure, our deep attack circumvents erasure with up to 82% success, averaging
57% across all erasure methods. For explicit content erasure, ToxE attacks can
elicit up to 9 times more exposed body parts, with DISA yielding an average
increase by a factor of 2.9. These results highlight a critical security gap in
current unlearning strategies.


http://arxiv.org/abs/2504.21053
NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models. (1%)
Yi Zhou; Wenpeng Xing; Dezhang Kong; Changting Lin; Meng Han
  Safety alignment in large language models (LLMs) is achieved through
fine-tuning mechanisms that regulate neuron activations to suppress harmful
content. In this work, we propose a novel approach to induce disalignment by
identifying and modifying the neurons responsible for safety constraints. Our
method consists of three key steps: Neuron Activation Analysis, where we
examine activation patterns in response to harmful and harmless prompts to
detect neurons that are critical for distinguishing between harmful and
harmless inputs; Similarity-Based Neuron Identification, which systematically
locates the neurons responsible for safe alignment; and Neuron Relearning for
Safety Removal, where we fine-tune these selected neurons to restore the
model's ability to generate previously restricted responses. Experimental
results demonstrate that our method effectively removes safety constraints with
minimal fine-tuning, highlighting a critical vulnerability in current alignment
techniques. Our findings underscore the need for robust defenses against
adversarial fine-tuning attacks on LLMs.


http://arxiv.org/abs/2504.20493
Token-Efficient Prompt Injection Attack: Provoking Cessation in LLM Reasoning via Adaptive Token Compression. (1%)
Yu Cui; Yujun Cai; Yiwei Wang
  While reasoning large language models (LLMs) demonstrate remarkable
performance across various tasks, they also contain notable security
vulnerabilities. Recent research has uncovered a "thinking-stopped"
vulnerability in DeepSeek-R1, where model-generated reasoning tokens can
forcibly interrupt the inference process, resulting in empty responses that
compromise LLM-integrated applications. However, existing methods triggering
this vulnerability require complex mathematical word problems with long
prompts--even exceeding 5,000 tokens. To reduce the token cost and formally
define this vulnerability, we propose a novel prompt injection attack named
"Reasoning Interruption Attack", based on adaptive token compression. We
demonstrate that simple standalone arithmetic tasks can effectively trigger
this vulnerability, and the prompts based on such tasks exhibit simpler logical
structures than mathematical word problems. We develop a systematic approach to
efficiently collect attack prompts and an adaptive token compression framework
that utilizes LLMs to automatically compress these prompts. Experiments show
our compression framework significantly reduces prompt length while maintaining
effective attack capabilities. We further investigate the attack's performance
via output prefix and analyze the underlying causes of the vulnerability,
providing valuable insights for improving security in reasoning LLMs.


http://arxiv.org/abs/2504.19730
Evaluate-and-Purify: Fortifying Code Language Models Against Adversarial Attacks Using LLM-as-a-Judge. (98%)
Wenhan Mu; Ling Xu; Shuren Pei; Le Mi; Huichi Zhou
  The widespread adoption of code language models in software engineering tasks
has exposed vulnerabilities to adversarial attacks, especially the identifier
substitution attacks. Although existing identifier substitution attackers
demonstrate high success rates, they often produce adversarial examples with
unnatural code patterns. In this paper, we systematically assess the quality of
adversarial examples using LLM-as-a-Judge. Our analysis reveals that over 80%
of adversarial examples generated by state-of-the-art identifier substitution
attackers (e.g., ALERT) are actually detectable. Based on this insight, we
propose EP-Shield, a unified framework for evaluating and purifying identifier
substitution attacks via naturalness-aware reasoning. Specifically, we first
evaluate the naturalness of code and identify the perturbed adversarial code,
then purify it so that the victim model can restore correct prediction.
Extensive experiments demonstrate the superiority of EP-Shield over adversarial
fine-tuning (up to 83.36% improvement) and its lightweight design 7B
parameters) with GPT-4-level performance.


http://arxiv.org/abs/2504.20295
The Dark Side of Digital Twins: Adversarial Attacks on AI-Driven Water Forecasting. (98%)
Mohammadhossein Homaei; Victor Gonzalez Morales; Oscar Mogollon-Gutierrez; Andres Caro
  Digital twins (DTs) are improving water distribution systems by using
real-time data, analytics, and prediction models to optimize operations. This
paper presents a DT platform designed for a Spanish water supply network,
utilizing Long Short-Term Memory (LSTM) networks to predict water consumption.
However, machine learning models are vulnerable to adversarial attacks, such as
the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD).
These attacks manipulate critical model parameters, injecting subtle
distortions that degrade forecasting accuracy. To further exploit these
vulnerabilities, we introduce a Learning Automata (LA) and Random LA-based
approach that dynamically adjusts perturbations, making adversarial attacks
more difficult to detect. Experimental results show that this approach
significantly impacts prediction reliability, causing the Mean Absolute
Percentage Error (MAPE) to rise from 26% to over 35%. Moreover, adaptive attack
strategies amplify this effect, highlighting cybersecurity risks in AI-driven
DTs. These findings emphasize the urgent need for robust defenses, including
adversarial training, anomaly detection, and secure data pipelines.


http://arxiv.org/abs/2504.21044
AGATE: Stealthy Black-box Watermarking for Multimodal Model Copyright Protection. (89%)
Jianbo Gao; Keke Gai; Jing Yu; Liehuang Zhu; Qi Wu
  Recent advancement in large-scale Artificial Intelligence (AI) models
offering multimodal services have become foundational in AI systems, making
them prime targets for model theft. Existing methods select Out-of-Distribution
(OoD) data as backdoor watermarks and retrain the original model for copyright
protection. However, existing methods are susceptible to malicious detection
and forgery by adversaries, resulting in watermark evasion. In this work, we
propose Model-\underline{ag}nostic Black-box Backdoor W\underline{ate}rmarking
Framework (AGATE) to address stealthiness and robustness challenges in
multimodal model copyright protection. Specifically, we propose an adversarial
trigger generation method to generate stealthy adversarial triggers from
ordinary dataset, providing visual fidelity while inducing semantic shifts. To
alleviate the issue of anomaly detection among model outputs, we propose a
post-transform module to correct the model output by narrowing the distance
between adversarial trigger image embedding and text embedding. Subsequently, a
two-phase watermark verification is proposed to judge whether the current model
infringes by comparing the two results with and without the transform module.
Consequently, we consistently outperform state-of-the-art methods across five
datasets in the downstream tasks of multimodal image-text retrieval and image
classification. Additionally, we validated the robustness of AGATE under two
adversarial attack scenarios.


http://arxiv.org/abs/2504.20310
A Cryptographic Perspective on Mitigation vs. Detection in Machine Learning. (83%)
Greg Gluch; Shafi Goldwasser
  In this paper, we initiate a cryptographically inspired theoretical study of
detection versus mitigation of adversarial inputs produced by attackers of
Machine Learning algorithms during inference time.
  We formally define defense by detection (DbD) and defense by mitigation
(DbM). Our definitions come in the form of a 3-round protocol between two
resource-bounded parties: a trainer/defender and an attacker. The attacker aims
to produce inference-time inputs that fool the training algorithm. We define
correctness, completeness, and soundness properties to capture successful
defense at inference time while not degrading (too much) the performance of the
algorithm on inputs from the training distribution.
  We first show that achieving DbD and achieving DbM are equivalent for ML
classification tasks. Surprisingly, this is not the case for ML generative
learning tasks, where there are many possible correct outputs that can be
generated for each input. We show a separation between DbD and DbM by
exhibiting a generative learning task for which is possible to defend by
mitigation but is provably impossible to defend by detection under the
assumption that the Identity-Based Fully Homomorphic Encryption (IB-FHE),
publicly-verifiable zero-knowledge Succinct Non-Interactive Arguments of
Knowledge (zk-SNARK) and Strongly Unforgeable Signatures exist. The mitigation
phase uses significantly fewer samples than the initial training algorithm.


http://arxiv.org/abs/2504.19793
Prompt Injection Attack to Tool Selection in LLM Agents. (64%)
Jiawen Shi; Zenghui Yuan; Guiyao Tie; Pan Zhou; Neil Zhenqiang Gong; Lichao Sun
  Tool selection is a key component of LLM agents. The process operates through
a two-step mechanism - \emph{retrieval} and \emph{selection} - to pick the most
appropriate tool from a tool library for a given task. In this work, we
introduce \textit{ToolHijacker}, a novel prompt injection attack targeting tool
selection in no-box scenarios. ToolHijacker injects a malicious tool document
into the tool library to manipulate the LLM agent's tool selection process,
compelling it to consistently choose the attacker's malicious tool for an
attacker-chosen target task. Specifically, we formulate the crafting of such
tool documents as an optimization problem and propose a two-phase optimization
strategy to solve it. Our extensive experimental evaluation shows that
ToolHijacker is highly effective, significantly outperforming existing
manual-based and automated prompt injection attacks when applied to tool
selection. Moreover, we explore various defenses, including prevention-based
defenses (StruQ and SecAlign) and detection-based defenses (known-answer
detection, perplexity detection, and perplexity windowed detection). Our
experimental results indicate that these defenses are insufficient,
highlighting the urgent need for developing new defense strategies.


http://arxiv.org/abs/2504.21038
Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary. (38%)
Yakai Li; Jiekang Hu; Weiduan Sang; Luping Ma; Jing Xie; Weijuan Zhang; Aimin Yu; Shijie Zhao; Qingjia Huang; Qihang Zhou
  Large Language Models (LLMs) are designed to generate helpful and safe
content. However, adversarial attacks, commonly referred to as jailbreak, can
bypass their safety protocols, prompting LLMs to generate harmful content or
reveal sensitive data. Consequently, investigating jailbreak methodologies is
crucial for exposing systemic vulnerabilities within LLMs, ultimately guiding
the continuous implementation of security enhancements by developers. In this
paper, we introduce a novel jailbreak attack method that leverages the
prefilling feature of LLMs, a feature designed to enhance model output
constraints. Unlike traditional jailbreak methods, the proposed attack
circumvents LLMs' safety mechanisms by directly manipulating the probability
distribution of subsequent tokens, thereby exerting control over the model's
output. We propose two attack variants: Static Prefilling (SP), which employs a
universal prefill text, and Optimized Prefilling (OP), which iteratively
optimizes the prefill text to maximize the attack success rate. Experiments on
six state-of-the-art LLMs using the AdvBench benchmark validate the
effectiveness of our method and demonstrate its capability to substantially
enhance attack success rates when combined with existing jailbreak approaches.
The OP method achieved attack success rates of up to 99.82% on certain models,
significantly outperforming baseline methods. This work introduces a new
jailbreak attack method in LLMs, emphasizing the need for robust content
validation mechanisms to mitigate the adversarial exploitation of prefilling
features. All code and data used in this paper are publicly available.


http://arxiv.org/abs/2504.20245
A Case Study on the Use of Representativeness Bias as a Defense Against Adversarial Cyber Threats. (16%)
Briland Hitaj; Grit Denker; Laura Tinnel; Michael McAnally; Bruce DeBruhl; Nathan Bunting; Alex Fafard; Daniel Aaron; Richard D. Roberts; Joshua Lawson; Greg McCain; Dylan Starink
  Cyberspace is an ever-evolving battleground involving adversaries seeking to
circumvent existing safeguards and defenders aiming to stay one step ahead by
predicting and mitigating the next threat. Existing mitigation strategies have
focused primarily on solutions that consider software or hardware aspects,
often ignoring the human factor. This paper takes a first step towards
psychology-informed, active defense strategies, where we target biases that
human beings are susceptible to under conditions of uncertainty.
  Using capture-the-flag events, we create realistic challenges that tap into a
particular cognitive bias: representativeness. This study finds that this bias
can be triggered to thwart hacking attempts and divert hackers into
non-vulnerable attack paths. Participants were exposed to two different
challenges designed to exploit representativeness biases. One of the
representativeness challenges significantly thwarted attackers away from
vulnerable attack vectors and onto non-vulnerable paths, signifying an
effective bias-based defense mechanism. This work paves the way towards cyber
defense strategies that leverage additional human biases to thwart future,
sophisticated adversarial attacks.


http://arxiv.org/abs/2504.20376
Inception: Jailbreak the Memory Mechanism of Text-to-Image Generation Systems. (13%)
Shiqian Zhao; Jiayang Liu; Yiming Li; Runyi Hu; Xiaojun Jia; Wenshu Fan; Xinfeng Li; Jie Zhang; Wei Dong; Tianwei Zhang; Luu Anh Tuan
  Currently, the memory mechanism has been widely and successfully exploited in
online text-to-image (T2I) generation systems ($e.g.$, DALL$\cdot$E 3) for
alleviating the growing tokenization burden and capturing key information in
multi-turn interactions. Despite its practicality, its security analyses have
fallen far behind. In this paper, we reveal that this mechanism exacerbates the
risk of jailbreak attacks. Different from previous attacks that fuse the unsafe
target prompt into one ultimate adversarial prompt, which can be easily
detected or may generate non-unsafe images due to under- or over-optimization,
we propose Inception, the first multi-turn jailbreak attack against the memory
mechanism in real-world text-to-image generation systems. Inception embeds the
malice at the inception of the chat session turn by turn, leveraging the
mechanism that T2I generation systems retrieve key information in their memory.
Specifically, Inception mainly consists of two modules. It first segments the
unsafe prompt into chunks, which are subsequently fed to the system in multiple
turns, serving as pseudo-gradients for directive optimization. Specifically, we
develop a series of segmentation policies that ensure the images generated are
semantically consistent with the target prompt. Secondly, after segmentation,
to overcome the challenge of the inseparability of minimum unsafe words, we
propose recursion, a strategy that makes minimum unsafe words subdivisible.
Collectively, segmentation and recursion ensure that all the request prompts
are benign but can lead to malicious outcomes. We conduct experiments on the
real-world text-to-image generation system ($i.e.$, DALL$\cdot$E 3) to validate
the effectiveness of Inception. The results indicate that Inception surpasses
the state-of-the-art by a 14\% margin in attack success rate.


http://arxiv.org/abs/2504.20369
Perception-aware Sampling for Scatterplot Visualizations. (5%)
Zafeiria Moumoulidou; Hamza Elhamdadi; Ke Yang; Subrata Mitra; Cindy Xiong Bearfield; Alexandra Meliou
  Visualizing data is often a crucial first step in data analytics workflows,
but growing data sizes pose challenges due to computational and visual
perception limitations. As a result, data analysts commonly down-sample their
data and work with subsets. Deriving representative samples, however, remains a
challenge. This paper focuses on scatterplots, a widely-used visualization
type, and introduces a novel sampling objective -- perception-awareness --
aiming to improve sample efficacy by targeting humans' perception of a
visualization.
  We make the following contributions: (1) We propose perception-augmented
databases and design PAwS: a novel perception-aware sampling method for
scatterplots that leverages saliency maps -- a computer vision tool for
predicting areas of attention focus in visualizations -- and models
perception-awareness via saliency, density, and coverage objectives. (2) We
design ApproPAwS: a fast, perception-aware method for approximate
visualizations, which exploits the fact that small visual perturbations are
often imperceptible to humans. (3) We introduce the concept of perceptual
similarity as a metric for sample quality, and present a novel method that
compares saliency maps to measure it. (4) Our extensive experimental evaluation
shows that our methods consistently outperform prior art in producing samples
with high perceptual similarity, while ApproPAwS achieves up to 100x speed-ups
with minimal loss in visual fidelity. Our user study shows that PAwS is often
preferred by humans, validating our quantitative findings.


http://arxiv.org/abs/2504.21042
What's Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift. (1%)
Jiamin Chang; Haoyang Li; Hammond Pearce; Ruoxi Sun; Bo Li; Minhui Xue
  The growing adoption of artificial intelligence (AI) has amplified concerns
about trustworthiness, including integrity, privacy, robustness, and bias. To
assess and attribute these threats, we propose ConceptLens, a generic framework
that leverages pre-trained multimodal models to identify the root causes of
integrity threats by analyzing Concept Shift in probing samples. ConceptLens
demonstrates strong detection performance for vanilla data poisoning attacks
and uncovers vulnerabilities to bias injection, such as the generation of
covert advertisements through malicious concept shifts. It identifies privacy
risks in unaltered but high-risk samples, filters them before training, and
provides insights into model weaknesses arising from incomplete or imbalanced
training data. Additionally, at the model level, it attributes concepts that
the target model is overly dependent on, identifies misleading concepts, and
explains how disrupting key concepts negatively impacts the model. Furthermore,
it uncovers sociological biases in generative content, revealing disparities
across sociological contexts. Strikingly, ConceptLens reveals how safe training
and inference data can be unintentionally and easily exploited, potentially
undermining safety alignment. Our study informs actionable insights to breed
trust in AI systems, thereby speeding adoption and driving greater innovation.


http://arxiv.org/abs/2504.19529
Adversarial Shallow Watermarking. (1%)
Guobiao Li; Lei Tan; Yuliang Xue; Gaozhi Liu; Zhenxing Qian; Sheng Li; Xinpeng Zhang
  Recent advances in digital watermarking make use of deep neural networks for
message embedding and extraction. They typically follow the ``encoder-noise
layer-decoder''-based architecture. By deliberately establishing a
differentiable noise layer to simulate the distortion of the watermarked
signal, they jointly train the deep encoder and decoder to fit the noise layer
to guarantee robustness. As a result, they are usually weak against unknown
distortions that are not used in their training pipeline. In this paper, we
propose a novel watermarking framework to resist unknown distortions, namely
Adversarial Shallow Watermarking (ASW). ASW utilizes only a shallow decoder
that is randomly parameterized and designed to be insensitive to distortions
for watermarking extraction. During the watermark embedding, ASW freezes the
shallow decoder and adversarially optimizes a host image until its updated
version (i.e., the watermarked image) stably triggers the shallow decoder to
output the watermark message. During the watermark extraction, it accurately
recovers the message from the watermarked image by leveraging the insensitive
nature of the shallow decoder against arbitrary distortions. Our ASW is
training-free, encoder-free, and noise layer-free. Experiments indicate that
the watermarked images created by ASW have strong robustness against various
unknown distortions. Compared to the existing ``encoder-noise layer-decoder''
approaches, ASW achieves comparable results on known distortions and better
robustness on unknown distortions.


http://arxiv.org/abs/2504.19876
DeeCLIP: A Robust and Generalizable Transformer-Based Framework for Detecting AI-Generated Images. (1%)
Mamadou Keita; Wassim Hamidouche; Hessen Bougueffa Eutamene; Abdelmalik Taleb-Ahmed; Abdenour Hadid
  This paper introduces DeeCLIP, a novel framework for detecting AI-generated
images using CLIP-ViT and fusion learning. Despite significant advancements in
generative models capable of creating highly photorealistic images, existing
detection methods often struggle to generalize across different models and are
highly sensitive to minor perturbations. To address these challenges, DeeCLIP
incorporates DeeFuser, a fusion module that combines high-level and low-level
features, improving robustness against degradations such as compression and
blurring. Additionally, we apply triplet loss to refine the embedding space,
enhancing the model's ability to distinguish between real and synthetic
content. To further enable lightweight adaptation while preserving pre-trained
knowledge, we adopt parameter-efficient fine-tuning using low-rank adaptation
(LoRA) within the CLIP-ViT backbone. This approach supports effective zero-shot
learning without sacrificing generalization. Trained exclusively on 4-class
ProGAN data, DeeCLIP achieves an average accuracy of 89.00% on 19 test subsets
composed of generative adversarial network (GAN) and diffusion models. Despite
having fewer trainable parameters, DeeCLIP outperforms existing methods,
demonstrating superior robustness against various generative models and
real-world distortions. The code is publicly available at
https://github.com/Mamadou-Keita/DeeCLIP for research purposes.


http://arxiv.org/abs/2504.20111
Forging and Removing Latent-Noise Diffusion Watermarks Using a Single Image. (87%)
Anubhav Jain; Yuya Kobayashi; Naoki Murata; Yuhta Takida; Takashi Shibuya; Yuki Mitsufuji; Niv Cohen; Nasir Memon; Julian Togelius
  Watermarking techniques are vital for protecting intellectual property and
preventing fraudulent use of media. Most previous watermarking schemes designed
for diffusion models embed a secret key in the initial noise. The resulting
pattern is often considered hard to remove and forge into unrelated images. In
this paper, we propose a black-box adversarial attack without presuming access
to the diffusion model weights. Our attack uses only a single watermarked
example and is based on a simple observation: there is a many-to-one mapping
between images and initial noises. There are regions in the clean image latent
space pertaining to each watermark that get mapped to the same initial noise
when inverted. Based on this intuition, we propose an adversarial attack to
forge the watermark by introducing perturbations to the images such that we can
enter the region of watermarked images. We show that we can also apply a
similar approach for watermark removal by learning perturbations to exit this
region. We report results on multiple watermarking schemes (Tree-Ring, RingID,
WIND, and Gaussian Shading) across two diffusion models (SDv1.4 and SDv2.0).
Our results demonstrate the effectiveness of the attack and expose
vulnerabilities in the watermarking methods, motivating future research on
improving them.


http://arxiv.org/abs/2504.19456
FCGHunter: Towards Evaluating Robustness of Graph-Based Android Malware Detection. (78%)
Shiwen Song; Xiaofei Xie; Ruitao Feng; Qi Guo; Sen Chen
  Graph-based detection methods leveraging Function Call Graphs (FCGs) have
shown promise for Android malware detection (AMD) due to their semantic
insights. However, the deployment of malware detectors in dynamic and hostile
environments raises significant concerns about their robustness. While recent
approaches evaluate the robustness of FCG-based detectors using adversarial
attacks, their effectiveness is constrained by the vast perturbation space,
particularly across diverse models and features.
  To address these challenges, we introduce FCGHunter, a novel robustness
testing framework for FCG-based AMD systems. Specifically, FCGHunter employs
innovative techniques to enhance exploration and exploitation within this huge
search space. Initially, it identifies critical areas within the FCG related to
malware behaviors to narrow down the perturbation space. We then develop a
dependency-aware crossover and mutation method to enhance the validity and
diversity of perturbations, generating diverse FCGs. Furthermore, FCGHunter
leverages multi-objective feedback to select perturbed FCGs, significantly
improving the search process with interpretation-based feature change feedback.
  Extensive evaluations across 40 scenarios demonstrate that FCGHunter achieves
an average attack success rate of 87.9%, significantly outperforming baselines
by at least 44.7%. Notably, FCGHunter achieves a 100% success rate on robust
models (e.g., AdaBoost with MalScan), where baselines achieve only 11% or are
inapplicable.


http://arxiv.org/abs/2504.19212
CapsFake: A Multimodal Capsule Network for Detecting Instruction-Guided Deepfakes. (74%)
Tuan Nguyen; Naseem Khan; Issa Khalil
  The rapid evolution of deepfake technology, particularly in
instruction-guided image editing, threatens the integrity of digital images by
enabling subtle, context-aware manipulations. Generated conditionally from real
images and textual prompts, these edits are often imperceptible to both humans
and existing detection systems, revealing significant limitations in current
defenses. We propose a novel multimodal capsule network, CapsFake, designed to
detect such deepfake image edits by integrating low-level capsules from visual,
textual, and frequency-domain modalities. High-level capsules, predicted
through a competitive routing mechanism, dynamically aggregate local features
to identify manipulated regions with precision. Evaluated on diverse datasets,
including MagicBrush, Unsplash Edits, Open Images Edits, and Multi-turn Edits,
CapsFake outperforms state-of-the-art methods by up to 20% in detection
accuracy. Ablation studies validate its robustness, achieving detection rates
above 94% under natural perturbations and 96% against adversarial attacks, with
excellent generalization to unseen editing scenarios. This approach establishes
a powerful framework for countering sophisticated image manipulations.


http://arxiv.org/abs/2504.19440
JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift. (67%)
Julien Piet; Xiao Huang; Dennis Jacob; Annabella Chow; Maha Alrashed; Geng Zhao; Zhanhao Hu; Chawin Sitawarin; Basel Alomair; David Wagner
  Safety and security remain critical concerns in AI deployment. Despite safety
training through reinforcement learning with human feedback (RLHF) [ 32],
language models remain vulnerable to jailbreak attacks that bypass safety
guardrails. Universal jailbreaks - prefixes that can circumvent alignment for
any payload - are particularly concerning. We show empirically that jailbreak
detection systems face distribution shift, with detectors trained at one point
in time performing poorly against newer exploits. To study this problem, we
release JailbreaksOverTime, a comprehensive dataset of timestamped real user
interactions containing both benign requests and jailbreak attempts collected
over 10 months. We propose a two-pronged method for defenders to detect new
jailbreaks and continuously update their detectors. First, we show how to use
continuous learning to detect jailbreaks and adapt rapidly to new emerging
jailbreaks. While detectors trained at a single point in time eventually fail
due to drift, we find that universal jailbreaks evolve slowly enough for
self-training to be effective. Retraining our detection model weekly using its
own labels - with no new human labels - reduces the false negative rate from 4%
to 0.3% at a false positive rate of 0.1%. Second, we introduce an unsupervised
active monitoring approach to identify novel jailbreaks. Rather than
classifying inputs directly, we recognize jailbreaks by their behavior,
specifically, their ability to trigger models to respond to known-harmful
prompts. This approach has a higher false negative rate (4.1%) than supervised
methods, but it successfully identified some out-of-distribution attacks that
were missed by the continuous learning approach.


http://arxiv.org/abs/2504.19373
Doxing via the Lens: Revealing Location-related Privacy Leakage on Multi-modal Large Reasoning Models. (22%)
Weidi Luo; Tianyu Lu; Qiming Zhang; Xiaogeng Liu; Bin Hu; Yue Zhao; Jieyu Zhao; Song Gao; Patrick McDaniel; Zhen Xiang; Chaowei Xiao
  Recent advances in multi-modal large reasoning models (MLRMs) have shown
significant ability to interpret complex visual content. While these models
enable impressive reasoning capabilities, they also introduce novel and
underexplored privacy risks. In this paper, we identify a novel category of
privacy leakage in MLRMs: Adversaries can infer sensitive geolocation
information, such as a user's home address or neighborhood, from user-generated
images, including selfies captured in private settings. To formalize and
evaluate these risks, we propose a three-level visual privacy risk framework
that categorizes image content based on contextual sensitivity and potential
for location inference. We further introduce DoxBench, a curated dataset of 500
real-world images reflecting diverse privacy scenarios. Our evaluation across
11 advanced MLRMs and MLLMs demonstrates that these models consistently
outperform non-expert humans in geolocation inference and can effectively leak
location-related private information. This significantly lowers the barrier for
adversaries to obtain users' sensitive geolocation information. We further
analyze and identify two primary factors contributing to this vulnerability:
(1) MLRMs exhibit strong reasoning capabilities by leveraging visual clues in
combination with their internal world knowledge; and (2) MLRMs frequently rely
on privacy-related visual clues for inference without any built-in mechanisms
to suppress or avoid such usage. To better understand and demonstrate
real-world attack feasibility, we propose GeoMiner, a collaborative attack
framework that decomposes the prediction process into two stages: clue
extraction and reasoning to improve geolocation performance while introducing a
novel attack perspective. Our findings highlight the urgent need to reassess
inference-time privacy risks in MLRMs to better protect users' sensitive
information.


http://arxiv.org/abs/2504.19000
Unveiling and Mitigating Adversarial Vulnerabilities in Iterative Optimizers. (98%)
Elad Sofer; Tomer Shaked; Caroline Chaux; Nir Shlezinger
  Machine learning (ML) models are often sensitive to carefully crafted yet
seemingly unnoticeable perturbations. Such adversarial examples are considered
to be a property of ML models, often associated with their black-box operation
and sensitivity to features learned from data. This work examines the
adversarial sensitivity of non-learned decision rules, and particularly of
iterative optimizers. Our analysis is inspired by the recent developments in
deep unfolding, which cast such optimizers as ML models. We show that
non-learned iterative optimizers share the sensitivity to adversarial examples
of ML models, and that attacking iterative optimizers effectively alters the
optimization objective surface in a manner that modifies the minima sought. We
then leverage the ability to cast iteration-limited optimizers as ML models to
enhance robustness via adversarial training. For a class of proximal gradient
optimizers, we rigorously prove how their learning affects adversarial
sensitivity. We numerically back our findings, showing the vulnerability of
various optimizers, as well as the robustness induced by unfolding and
adversarial training.


http://arxiv.org/abs/2504.19019
Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs. (92%)
Mohammad Akbar-Tajari; Mohammad Taher Pilehvar; Mohammad Mahmoody
  The challenge of ensuring Large Language Models (LLMs) align with societal
standards is of increasing interest, as these models are still prone to
adversarial jailbreaks that bypass their safety mechanisms. Identifying these
vulnerabilities is crucial for enhancing the robustness of LLMs against such
exploits. We propose Graph of ATtacks (GoAT), a method for generating
adversarial prompts to test the robustness of LLM alignment using the Graph of
Thoughts framework [Besta et al., 2024]. GoAT excels at generating highly
effective jailbreak prompts with fewer queries to the victim model than
state-of-the-art attacks, achieving up to five times better jailbreak success
rate against robust models like Llama. Notably, GoAT creates high-quality,
human-readable prompts without requiring access to the targeted model's
parameters, making it a black-box attack. Unlike approaches constrained by
tree-based reasoning, GoAT's reasoning is based on a more intricate graph
structure. By making simultaneous attack paths aware of each other's progress,
this dynamic framework allows a deeper integration and refinement of reasoning
paths, significantly enhancing the collaborative exploration of adversarial
vulnerabilities in LLMs. At a technical level, GoAT starts with a graph
structure and iteratively refines it by combining and improving thoughts,
enabling synergy between different thought paths. The code for our
implementation can be found at: https://github.com/GoAT-pydev/Graph_of_Attacks.


http://arxiv.org/abs/2504.18827
Test It Before You Trust It: Applying Software Testing for Trustworthy In-context Learning. (86%)
Teeradaj Racharak; Chaiyong Ragkhitwetsagul; Chommakorn Sontesadisai; Thanwadee Sunetnanta
  In-context learning (ICL) has emerged as a powerful capability of large
language models (LLMs), enabling them to perform new tasks based on a few
provided examples without explicit fine-tuning. Despite their impressive
adaptability, these models remain vulnerable to subtle adversarial
perturbations and exhibit unpredictable behavior when faced with linguistic
variations. Inspired by software testing principles, we introduce a software
testing-inspired framework, called MMT4NL, for evaluating the trustworthiness
of in-context learning by utilizing adversarial perturbations and software
testing techniques. It includes diverse evaluation aspects of linguistic
capabilities for testing the ICL capabilities of LLMs. MMT4NL is built around
the idea of crafting metamorphic adversarial examples from a test set in order
to quantify and pinpoint bugs in the designed prompts of ICL. Our philosophy is
to treat any LLM as software and validate its functionalities just like testing
the software. Finally, we demonstrate applications of MMT4NL on the sentiment
analysis and question-answering tasks. Our experiments could reveal various
linguistic bugs in state-of-the-art LLMs.


http://arxiv.org/abs/2504.18872
Latent Adversarial Training Improves the Representation of Refusal. (10%)
Alexandra Abbas; Nora Petrova; Helios Ael Lyons; Natalia Perez-Campanero
  Recent work has shown that language models' refusal behavior is primarily
encoded in a single direction in their latent space, making it vulnerable to
targeted attacks. Although Latent Adversarial Training (LAT) attempts to
improve robustness by introducing noise during training, a key question
remains: How does this noise-based training affect the underlying
representation of refusal behavior? Understanding this encoding is crucial for
evaluating LAT's effectiveness and limitations, just as the discovery of linear
refusal directions revealed vulnerabilities in traditional supervised safety
fine-tuning (SSFT).
  Through the analysis of Llama 2 7B, we examine how LAT reorganizes the
refusal behavior in the model's latent space compared to SSFT and embedding
space adversarial training (AT). By computing activation differences between
harmful and harmless instruction pairs and applying Singular Value
Decomposition (SVD), we find that LAT significantly alters the refusal
representation, concentrating it in the first two SVD components which explain
approximately 75 percent of the activation differences variance - significantly
higher than in reference models. This concentrated representation leads to more
effective and transferable refusal vectors for ablation attacks: LAT models
show improved robustness when attacked with vectors from reference models but
become more vulnerable to self-generated vectors compared to SSFT and AT. Our
findings suggest that LAT's training perturbations enable a more comprehensive
representation of refusal behavior, highlighting both its potential strengths
and vulnerabilities for improving model safety.


http://arxiv.org/abs/2504.19027
DiCE-Extended: A Robust Approach to Counterfactual Explanations in Machine Learning. (3%)
Volkan Bakir; Polat Goktas; Sureyya Akyuz
  Explainable artificial intelligence (XAI) has become increasingly important
in decision-critical domains such as healthcare, finance, and law.
Counterfactual (CF) explanations, a key approach in XAI, provide users with
actionable insights by suggesting minimal modifications to input features that
lead to different model outcomes. Despite significant advancements, existing CF
generation methods often struggle to balance proximity, diversity, and
robustness, limiting their real-world applicability. A widely adopted
framework, Diverse Counterfactual Explanations (DiCE), emphasizes diversity but
lacks robustness, making CF explanations sensitive to perturbations and domain
constraints. To address these challenges, we introduce DiCE-Extended, an
enhanced CF explanation framework that integrates multi-objective optimization
techniques to improve robustness while maintaining interpretability. Our
approach introduces a novel robustness metric based on the Dice-Sorensen
coefficient, ensuring stability under small input variations. Additionally, we
refine CF generation using weighted loss components (lambda_p, lambda_d,
lambda_r) to balance proximity, diversity, and robustness. We empirically
validate DiCE-Extended on benchmark datasets (COMPAS, Lending Club, German
Credit, Adult Income) across multiple ML backends (Scikit-learn, PyTorch,
TensorFlow). Results demonstrate improved CF validity, stability, and alignment
with decision boundaries compared to standard DiCE-generated explanations. Our
findings highlight the potential of DiCE-Extended in generating more reliable
and interpretable CFs for high-stakes applications. Future work will explore
adaptive optimization techniques and domain-specific constraints to further
enhance CF generation in real-world scenarios.


http://arxiv.org/abs/2504.18990
Safety Interventions against Adversarial Patches in an Open-Source Driver Assistance System. (1%)
Cheng Chen; Grant Xiao; Daehyun Lee; Lishan Yang; Evgenia Smirni; Homa Alemzadeh; Xugui Zhou
  Drivers are becoming increasingly reliant on advanced driver assistance
systems (ADAS) as autonomous driving technology becomes more popular and
developed with advanced safety features to enhance road safety. However, the
increasing complexity of the ADAS makes autonomous vehicles (AVs) more exposed
to attacks and accidental faults. In this paper, we evaluate the resilience of
a widely used ADAS against safety-critical attacks that target perception
inputs. Various safety mechanisms are simulated to assess their impact on
mitigating attacks and enhancing ADAS resilience. Experimental results
highlight the importance of timely intervention by human drivers and automated
safety mechanisms in preventing accidents in both driving and lateral
directions and the need to resolve conflicts among safety interventions to
enhance system resilience and reliability.


http://arxiv.org/abs/2504.20077
Edge-Based Learning for Improved Classification Under Adversarial Noise. (99%)
Manish Kansana; Keyan Alexander Rahimi; Elias Hossain; Iman Dehzangi; Noorbakhsh Amiri Golilarz
  Adversarial noise introduces small perturbations in images, misleading deep
learning models into misclassification and significantly impacting recognition
accuracy. In this study, we analyzed the effects of Fast Gradient Sign Method
(FGSM) adversarial noise on image classification and investigated whether
training on specific image features can improve robustness. We hypothesize that
while adversarial noise perturbs various regions of an image, edges may remain
relatively stable and provide essential structural information for
classification. To test this, we conducted a series of experiments using brain
tumor and COVID datasets. Initially, we trained the models on clean images and
then introduced subtle adversarial perturbations, which caused deep learning
models to significantly misclassify the images. Retraining on a combination of
clean and noisy images led to improved performance. To evaluate the robustness
of the edge features, we extracted edges from the original/clean images and
trained the models exclusively on edge-based representations. When noise was
introduced to the images, the edge-based models demonstrated greater resilience
to adversarial attacks compared to those trained on the original or clean
images. These results suggest that while adversarial noise is able to exploit
complex non-edge regions significantly more than edges, the improvement in the
accuracy after retraining is marginally more in the original data as compared
to the edges. Thus, leveraging edge-based learning can improve the resilience
of deep learning models against adversarial perturbations.


http://arxiv.org/abs/2504.18519
Intelligent Attacks and Defense Methods in Federated Learning-enabled Energy-Efficient Wireless Networks. (74%)
Han Zhang; Hao Zhou; Medhat Elsayed; Majid Bavand; Raimundas Gaigalas; Yigit Ozcan; Melike Erol-Kantarci
  Federated learning (FL) is a promising technique for learning-based functions
in wireless networks, thanks to its distributed implementation capability. On
the other hand, distributed learning may increase the risk of exposure to
malicious attacks where attacks on a local model may spread to other models by
parameter exchange. Meanwhile, such attacks can be hard to detect due to the
dynamic wireless environment, especially considering local models can be
heterogeneous with non-independent and identically distributed (non-IID) data.
Therefore, it is critical to evaluate the effect of malicious attacks and
develop advanced defense techniques for FL-enabled wireless networks. In this
work, we introduce a federated deep reinforcement learning-based cell sleep
control scenario that enhances the energy efficiency of the network. We propose
multiple intelligent attacks targeting the learning-based approach and we
propose defense methods to mitigate such attacks. In particular, we have
designed two attack models, generative adversarial network (GAN)-enhanced model
poisoning attack and regularization-based model poisoning attack. As a
counteraction, we have proposed two defense schemes, autoencoder-based defense,
and knowledge distillation (KD)-enabled defense. The autoencoder-based defense
method leverages an autoencoder to identify the malicious participants and only
aggregate the parameters of benign local models during the global aggregation,
while KD-based defense protects the model from attacks by controlling the
knowledge transferred between the global model and local models.


http://arxiv.org/abs/2504.18497
DeSIA: Attribute Inference Attacks Against Limited Fixed Aggregate Statistics. (1%)
Yifeng Mao; Bozhidar Stevanoski; Montjoye Yves-Alexandre de
  Empirical inference attacks are a popular approach for evaluating the privacy
risk of data release mechanisms in practice. While an active attack literature
exists to evaluate machine learning models or synthetic data release, we
currently lack comparable methods for fixed aggregate statistics, in particular
when only a limited number of statistics are released. We here propose an
inference attack framework against fixed aggregate statistics and an attribute
inference attack called DeSIA. We instantiate DeSIA against the U.S. Census
PPMF dataset and show it to strongly outperform reconstruction-based attacks.
In particular, we show DeSIA to be highly effective at identifying vulnerable
users, achieving a true positive rate of 0.14 at a false positive rate of
$10^{-3}$. We then show DeSIA to perform well against users whose attributes
cannot be verified and when varying the number of aggregate statistics and
level of noise addition. We also perform an extensive ablation study of DeSIA
and show how DeSIA can be successfully adapted to the membership inference
task. Overall, our results show that aggregation alone is not sufficient to
protect privacy, even when a relatively small number of aggregates are being
released, and emphasize the need for formal privacy mechanisms and testing
before aggregate statistics are released.


http://arxiv.org/abs/2504.17884
Unsupervised Corpus Poisoning Attacks in Continuous Space for Dense Retrieval. (99%)
Yongkang Li; Panagiotis Eustratiadis; Simon Lupart; Evangelos Kanoulas
  This paper concerns corpus poisoning attacks in dense information retrieval,
where an adversary attempts to compromise the ranking performance of a search
algorithm by injecting a small number of maliciously generated documents into
the corpus. Our work addresses two limitations in the current literature.
First, attacks that perform adversarial gradient-based word substitution search
do so in the discrete lexical space, while retrieval itself happens in the
continuous embedding space. We thus propose an optimization method that
operates in the embedding space directly. Specifically, we train a perturbation
model with the objective of maintaining the geometric distance between the
original and adversarial document embeddings, while also maximizing the
token-level dissimilarity between the original and adversarial documents.
Second, it is common for related work to have a strong assumption that the
adversary has prior knowledge about the queries. In this paper, we focus on a
more challenging variant of the problem where the adversary assumes no prior
knowledge about the query distribution (hence, unsupervised). Our core
contribution is an adversarial corpus attack that is fast and effective. We
present comprehensive experimental results on both in- and out-of-domain
datasets, focusing on two related tasks: a top-1 attack and a corpus poisoning
attack. We consider attacks under both a white-box and a black-box setting.
Notably, our method can generate successful adversarial examples in under two
minutes per target document; four times faster compared to the fastest
gradient-based word substitution methods in the literature with the same
hardware. Furthermore, our adversarial generation method generates text that is
more likely to occur under the distribution of natural text (low perplexity),
and is therefore more difficult to detect.


http://arxiv.org/abs/2504.18594
A Simple DropConnect Approach to Transfer-based Targeted Attack. (99%)
Tongrui Su; Qingbin Li; Shengyu Zhu; Wei Chen; Xueqi Cheng
  We study the problem of transfer-based black-box attack, where adversarial
samples generated using a single surrogate model are directly applied to target
models. Compared with untargeted attacks, existing methods still have lower
Attack Success Rates (ASRs) in the targeted setting, i.e., the obtained
adversarial examples often overfit the surrogate model but fail to mislead
other models. In this paper, we hypothesize that the pixels or features in
these adversarial examples collaborate in a highly dependent manner to maximize
the success of an adversarial attack on the surrogate model, which we refer to
as perturbation co-adaptation. Then, we propose to Mitigate perturbation
Co-adaptation by DropConnect (MCD) to enhance transferability, by creating
diverse variants of surrogate model at each optimization iteration. We conduct
extensive experiments across various CNN- and Transformer-based models to
demonstrate the effectiveness of MCD. In the challenging scenario of
transferring from a CNN-based model to Transformer-based models, MCD achieves
13% higher average ASRs compared with state-of-the-art baselines. MCD boosts
the performance of self-ensemble methods by bringing in more diversification
across the variants while reserving sufficient semantic information for each
variant. In addition, MCD attains the highest performance gain when scaling the
compute of crafting adversarial examples.


http://arxiv.org/abs/2504.17457
Unveiling Hidden Vulnerabilities in Digital Human Generation via Adversarial Attacks. (98%)
Zhiying Li; Yeying Jin; Fan Shen; Zhi Liu; Weibin Chen; Pengju Zhang; Xiaomei Zhang; Boyu Chen; Michael Shen; Kejian Wu; Zhaoxin Fan; Jin Dong
  Expressive human pose and shape estimation (EHPS) is crucial for digital
human generation, especially in applications like live streaming. While
existing research primarily focuses on reducing estimation errors, it largely
neglects robustness and security aspects, leaving these systems vulnerable to
adversarial attacks. To address this significant challenge, we propose the
\textbf{Tangible Attack (TBA)}, a novel framework designed to generate
adversarial examples capable of effectively compromising any digital human
generation model. Our approach introduces a \textbf{Dual Heterogeneous Noise
Generator (DHNG)}, which leverages Variational Autoencoders (VAE) and
ControlNet to produce diverse, targeted noise tailored to the original image
features. Additionally, we design a custom \textbf{adversarial loss function}
to optimize the noise, ensuring both high controllability and potent
disruption. By iteratively refining the adversarial sample through
multi-gradient signals from both the noise and the state-of-the-art EHPS model,
TBA substantially improves the effectiveness of adversarial attacks. Extensive
experiments demonstrate TBA's superiority, achieving a remarkable 41.0\%
increase in estimation error, with an average improvement of approximately
17.0\%. These findings expose significant security vulnerabilities in current
EHPS models and highlight the need for stronger defenses in digital human
generation systems.


http://arxiv.org/abs/2504.17684
Evaluating the Vulnerability of ML-Based Ethereum Phishing Detectors to Single-Feature Adversarial Perturbations. (97%)
Ahod Alghuried; Ali Alkinoon; Abdulaziz Alghamdi; Soohyeon Choi; Manar Mohaisen; David Mohaisen
  This paper explores the vulnerability of machine learning models to simple
single-feature adversarial attacks in the context of Ethereum fraudulent
transaction detection. Through comprehensive experimentation, we investigate
the impact of various adversarial attack strategies on model performance
metrics. Our findings, highlighting how prone those techniques are to simple
attacks, are alarming, and the inconsistency in the attacks' effect on
different algorithms promises ways for attack mitigation. We examine the
effectiveness of different mitigation strategies, including adversarial
training and enhanced feature selection, in enhancing model robustness and show
their effectiveness.


http://arxiv.org/abs/2504.17829
Fine-Tuning Adversarially-Robust Transformers for Single-Image Dehazing. (92%)
Vlad Vasilescu; Ana Neacsu; Daniela Faur
  Single-image dehazing is an important topic in remote sensing applications,
enhancing the quality of acquired images and increasing object detection
precision. However, the reliability of such structures has not been
sufficiently analyzed, which poses them to the risk of imperceptible
perturbations that can significantly hinder their performance. In this work, we
show that state-of-the-art image-to-image dehazing transformers are susceptible
to adversarial noise, with even 1 pixel change being able to decrease the PSNR
by as much as 2.8 dB. Next, we propose two lightweight fine-tuning strategies
aimed at increasing the robustness of pre-trained transformers. Our methods
results in comparable clean performance, while significantly increasing the
protection against adversarial data. We further present their applicability in
two remote sensing scenarios, showcasing their robust behavior for
out-of-distribution data. The source code for adversarial fine-tuning and
attack algorithms can be found at github.com/Vladimirescu/RobustDehazing.


http://arxiv.org/abs/2504.17723
Towards Robust LLMs: an Adversarial Robustness Measurement Framework. (87%)
Natan Levy; Adiel Ashrov; Guy Katz
  The rise of Large Language Models (LLMs) has revolutionized artificial
intelligence, yet these models remain vulnerable to adversarial perturbations,
undermining their reliability in high-stakes applications. While adversarial
robustness in vision-based neural networks has been extensively studied, LLM
robustness remains under-explored. We adapt the Robustness Measurement and
Assessment (RoMA) framework to quantify LLM resilience against adversarial
inputs without requiring access to model parameters. By comparing RoMA's
estimates to those of formal verification methods, we demonstrate its accuracy
with minimal error margins while maintaining computational efficiency. Our
empirical evaluation reveals that robustness varies significantly not only
between different models but also across categories within the same task and
between various types of perturbations. This non-uniformity underscores the
need for task-specific robustness evaluations, enabling practitioners to
compare and select models based on application-specific robustness
requirements. Our work provides a systematic methodology to assess LLM
robustness, advancing the development of more reliable language models for
real-world deployment.


http://arxiv.org/abs/2504.17690
On the Generalization of Adversarially Trained Quantum Classifiers. (82%)
Petros Georgiou; Aaron Mark Thomas; Sharu Theresa Jose; Osvaldo Simeone
  Quantum classifiers are vulnerable to adversarial attacks that manipulate
their input classical or quantum data. A promising countermeasure is
adversarial training, where quantum classifiers are trained by using an
attack-aware, adversarial loss function. This work establishes novel bounds on
the generalization error of adversarially trained quantum classifiers when
tested in the presence of perturbation-constrained adversaries. The bounds
quantify the excess generalization error incurred to ensure robustness to
adversarial attacks as scaling with the training sample size $m$ as
$1/\sqrt{m}$, while yielding insights into the impact of the quantum embedding.
For quantum binary classifiers employing \textit{rotation embedding}, we find
that, in the presence of adversarial attacks on classical inputs $\mathbf{x}$,
the increase in sample complexity due to adversarial training over conventional
training vanishes in the limit of high dimensional inputs $\mathbf{x}$. In
contrast, when the adversary can directly attack the quantum state
$\rho(\mathbf{x})$ encoding the input $\mathbf{x}$, the excess generalization
error depends on the choice of embedding only through its Hilbert space
dimension. The results are also extended to multi-class classifiers. We
validate our theoretical findings with numerical experiments.


http://arxiv.org/abs/2504.17894
DCT-Shield: A Robust Frequency Domain Defense against Malicious Image Editing. (75%)
Aniruddha Bala; Rohit Chowdhury; Rohan Jaiswal; Siddharth Roheda
  Advancements in diffusion models have enabled effortless image editing via
text prompts, raising concerns about image security. Attackers with access to
user images can exploit these tools for malicious edits. Recent defenses
attempt to protect images by adding a limited noise in the pixel space to
disrupt the functioning of diffusion-based editing models. However, the
adversarial noise added by previous methods is easily noticeable to the human
eye. Moreover, most of these methods are not robust to purification techniques
like JPEG compression under a feasible pixel budget. We propose a novel
optimization approach that introduces adversarial perturbations directly in the
frequency domain by modifying the Discrete Cosine Transform (DCT) coefficients
of the input image. By leveraging the JPEG pipeline, our method generates
adversarial images that effectively prevent malicious image editing. Extensive
experiments across a variety of tasks and datasets demonstrate that our
approach introduces fewer visual artifacts while maintaining similar levels of
edit protection and robustness to noise purification techniques.


http://arxiv.org/abs/2504.17300
The Ultimate Cookbook for Invisible Poison: Crafting Subtle Clean-Label Text Backdoors with Style Attributes. (69%)
Wencong You; Daniel Lowd
  Backdoor attacks on text classifiers can cause them to predict a predefined
label when a particular "trigger" is present. Prior attacks often rely on
triggers that are ungrammatical or otherwise unusual, leading to conspicuous
attacks. As a result, human annotators, who play a critical role in curating
training data in practice, can easily detect and filter out these unnatural
texts during manual inspection, reducing the risk of such attacks. We argue
that a key criterion for a successful attack is for text with and without
triggers to be indistinguishable to humans. However, prior work neither
directly nor comprehensively evaluated attack subtlety and invisibility with
human involvement. We bridge the gap by conducting thorough human evaluations
to assess attack subtlety. We also propose \emph{AttrBkd}, consisting of three
recipes for crafting subtle yet effective trigger attributes, such as
extracting fine-grained attributes from existing baseline backdoor attacks. Our
human evaluations find that AttrBkd with these baseline-derived attributes is
often more effective (higher attack success rate) and more subtle (fewer
instances detected by humans) than the original baseline backdoor attacks,
demonstrating that backdoor attacks can bypass detection by being inconspicuous
and appearing natural even upon close inspection, while still remaining
effective. Our human annotation also provides information not captured by
automated metrics used in prior work, and demonstrates the misalignment of
these metrics with human judgment.


http://arxiv.org/abs/2504.17461
Evaluating Time Series Models for Urban Wastewater Management: Predictive Performance, Model Complexity and Resilience. (4%)
Vipin Singh; Tianheng Ling; Teodor Chiaburu; Felix Biessmann
  Climate change increases the frequency of extreme rainfall, placing a
significant strain on urban infrastructures, especially Combined Sewer Systems
(CSS). Overflows from overburdened CSS release untreated wastewater into
surface waters, posing environmental and public health risks. Although
traditional physics-based models are effective, they are costly to maintain and
difficult to adapt to evolving system dynamics. Machine Learning (ML)
approaches offer cost-efficient alternatives with greater adaptability. To
systematically assess the potential of ML for modeling urban infrastructure
systems, we propose a protocol for evaluating Neural Network architectures for
CSS time series forecasting with respect to predictive performance, model
complexity, and robustness to perturbations. In addition, we assess model
performance on peak events and critical fluctuations, as these are the key
regimes for urban wastewater management. To investigate the feasibility of
lightweight models suitable for IoT deployment, we compare global models, which
have access to all information, with local models, which rely solely on nearby
sensor readings. Additionally, to explore the security risks posed by network
outages or adversarial attacks on urban infrastructure, we introduce error
models that assess the resilience of models. Our results demonstrate that while
global models achieve higher predictive performance, local models provide
sufficient resilience in decentralized scenarios, ensuring robust modeling of
urban infrastructure. Furthermore, models with longer native forecast horizons
exhibit greater robustness to data perturbations. These findings contribute to
the development of interpretable and reliable ML solutions for sustainable
urban wastewater management. The implementation is available in our GitHub
repository.


http://arxiv.org/abs/2504.17971
Cluster-Aware Attacks on Graph Watermarks. (2%)
Alexander Nemecek; Emre Yilmaz; Erman Ayday
  Data from domains such as social networks, healthcare, finance, and
cybersecurity can be represented as graph-structured information. Given the
sensitive nature of this data and their frequent distribution among
collaborators, ensuring secure and attributable sharing is essential. Graph
watermarking enables attribution by embedding user-specific signatures into
graph-structured data. While prior work has addressed random perturbation
attacks, the threat posed by adversaries leveraging structural properties
through community detection remains unexplored. In this work, we introduce a
cluster-aware threat model in which adversaries apply community-guided
modifications to evade detection. We propose two novel attack strategies and
evaluate them on real-world social network graphs. Our results show that
cluster-aware attacks can reduce attribution accuracy by up to 80% more than
random baselines under equivalent perturbation budgets on sparse graphs. To
mitigate this threat, we propose a lightweight embedding enhancement that
distributes watermark nodes across graph communities. This approach improves
attribution accuracy by up to 60% under attack on dense graphs, without
increasing runtime or structural distortion. Our findings underscore the
importance of cluster-topological awareness in both watermarking design and
adversarial modeling.


http://arxiv.org/abs/2504.17311
FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation. (1%)
Yulia Otmakhova; Hung Thinh Truong; Rahmad Mahendra; Zenan Zhai; Rongxin Zhu; Daniel Beck; Jey Han Lau
  We present FLUKE (Framework for LingUistically-driven and tasK-agnostic
robustness Evaluation), a task-agnostic framework for assessing model
robustness through systematic minimal variations of test data. FLUKE introduces
controlled variations across linguistic levels - from orthography to dialect
and style varieties - and leverages large language models (LLMs) with human
validation to generate modifications. We demonstrate FLUKE's utility by
evaluating both fine-tuned models and LLMs across four diverse NLP tasks, and
reveal that (1) the impact of linguistic variations is highly task-dependent,
with some tests being critical for certain tasks but irrelevant for others; (2)
while LLMs have better overall robustness compared to fine-tuned models, they
still exhibit significant brittleness to certain linguistic variations; (3) all
models show substantial vulnerability to negation modifications across most
tasks. These findings highlight the importance of systematic robustness testing
for understanding model behaviors.


http://arxiv.org/abs/2504.18598
BadMoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts. (1%)
Qingyue Wang; Qi Pang; Xixun Lin; Shuai Wang; Daoyuan Wu
  Mixture-of-Experts (MoE) have emerged as a powerful architecture for large
language models (LLMs), enabling efficient scaling of model capacity while
maintaining manageable computational costs. The key advantage lies in their
ability to route different tokens to different ``expert'' networks within the
model, enabling specialization and efficient handling of diverse input.
However, the vulnerabilities of MoE-based LLMs still have barely been studied,
and the potential for backdoor attacks in this context remains largely
unexplored. This paper presents the first backdoor attack against MoE-based
LLMs where the attackers poison ``dormant experts'' (i.e., underutilized
experts) and activate them by optimizing routing triggers, thereby gaining
control over the model's output. We first rigorously prove the existence of a
few ``dominating experts'' in MoE models, whose outputs can determine the
overall MoE's output. We also show that dormant experts can serve as dominating
experts to manipulate model predictions. Accordingly, our attack, namely
BadMoE, exploits the unique architecture of MoE models by 1) identifying
dormant experts unrelated to the target task, 2) constructing a routing-aware
loss to optimize the activation triggers of these experts, and 3) promoting
dormant experts to dominating roles via poisoned training data. Extensive
experiments show that BadMoE successfully enforces malicious prediction on
attackers' target tasks while preserving overall model utility, making it a
more potent and stealthy attack than existing methods.


http://arxiv.org/abs/2504.16474
Seeking Flat Minima over Diverse Surrogates for Improved Adversarial Transferability: A Theoretical Framework and Algorithmic Instantiation. (99%)
Meixi Zheng; Kehan Wu; Yanbo Fan; Rui Huang; Baoyuan Wu
  The transfer-based black-box adversarial attack setting poses the challenge
of crafting an adversarial example (AE) on known surrogate models that remain
effective against unseen target models. Due to the practical importance of this
task, numerous methods have been proposed to address this challenge. However,
most previous methods are heuristically designed and intuitively justified,
lacking a theoretical foundation. To bridge this gap, we derive a novel
transferability bound that offers provable guarantees for adversarial
transferability. Our theoretical analysis has the advantages of \textit{(i)}
deepening our understanding of previous methods by building a general attack
framework and \textit{(ii)} providing guidance for designing an effective
attack algorithm. Our theoretical results demonstrate that optimizing AEs
toward flat minima over the surrogate model set, while controlling the
surrogate-target model shift measured by the adversarial model discrepancy,
yields a comprehensive guarantee for AE transferability. The results further
lead to a general transfer-based attack framework, within which we observe that
previous methods consider only partial factors contributing to the
transferability. Algorithmically, inspired by our theoretical results, we first
elaborately construct the surrogate model set in which models exhibit diverse
adversarial vulnerabilities with respect to AEs to narrow an instantiated
adversarial model discrepancy. Then, a \textit{model-Diversity-compatible
Reverse Adversarial Perturbation} (DRAP) is generated to effectively promote
the flatness of AEs over diverse surrogate models to improve transferability.
Extensive experiments on NIPS2017 and CIFAR-10 datasets against various target
models demonstrate the effectiveness of our proposed attack.


http://arxiv.org/abs/2504.16907
BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation. (61%)
Ruotong Wang; Mingli Zhu; Jiarong Ou; Rui Chen; Xin Tao; Pengfei Wan; Baoyuan Wu
  Text-to-video (T2V) generative models have rapidly advanced and found
widespread applications across fields like entertainment, education, and
marketing. However, the adversarial vulnerabilities of these models remain
rarely explored. We observe that in T2V generation tasks, the generated videos
often contain substantial redundant information not explicitly specified in the
text prompts, such as environmental elements, secondary objects, and additional
details, providing opportunities for malicious attackers to embed hidden
harmful content. Exploiting this inherent redundancy, we introduce BadVideo,
the first backdoor attack framework tailored for T2V generation. Our attack
focuses on designing target adversarial outputs through two key strategies: (1)
Spatio-Temporal Composition, which combines different spatiotemporal features
to encode malicious information; (2) Dynamic Element Transformation, which
introduces transformations in redundant elements over time to convey malicious
information. Based on these strategies, the attacker's malicious target
seamlessly integrates with the user's textual instructions, providing high
stealthiness. Moreover, by exploiting the temporal dimension of videos, our
attack successfully evades traditional content moderation systems that
primarily analyze spatial information within individual frames. Extensive
experiments demonstrate that BadVideo achieves high attack success rates while
preserving original semantics and maintaining excellent performance on clean
inputs. Overall, our work reveals the adversarial vulnerability of T2V models,
calling attention to potential risks and misuse. Our project page is at
https://wrt2000.github.io/BadVideo2025/.


http://arxiv.org/abs/2504.17219
Enhancing Variational Autoencoders with Smooth Robust Latent Encoding. (50%)
Hyomin Lee; Minseon Kim; Sangwon Jang; Jongheon Jeong; Sung Ju Hwang
  Variational Autoencoders (VAEs) have played a key role in scaling up
diffusion-based generative models, as in Stable Diffusion, yet questions
regarding their robustness remain largely underexplored. Although adversarial
training has been an established technique for enhancing robustness in
predictive models, it has been overlooked for generative models due to concerns
about potential fidelity degradation by the nature of trade-offs between
performance and robustness. In this work, we challenge this presumption,
introducing Smooth Robust Latent VAE (SRL-VAE), a novel adversarial training
framework that boosts both generation quality and robustness. In contrast to
conventional adversarial training, which focuses on robustness only, our
approach smooths the latent space via adversarial perturbations, promoting more
generalizable representations while regularizing with originality
representation to sustain original fidelity. Applied as a post-training step on
pre-trained VAEs, SRL-VAE improves image robustness and fidelity with minimal
computational overhead. Experiments show that SRL-VAE improves both generation
quality, in image reconstruction and text-guided image editing, and robustness,
against Nightshade attacks and image editing attacks. These results establish a
new paradigm, showing that adversarial training, once thought to be detrimental
to generative models, can instead enhance both fidelity and robustness.


http://arxiv.org/abs/2504.17179
AUTHENTICATION: Identifying Rare Failure Modes in Autonomous Vehicle Perception Systems using Adversarially Guided Diffusion Models. (1%)
Mohammad Zarei; Melanie A Jutras; Eliana Evans; Mike Tan; Omid Aaramoon
  Autonomous Vehicles (AVs) rely on artificial intelligence (AI) to accurately
detect objects and interpret their surroundings. However, even when trained
using millions of miles of real-world data, AVs are often unable to detect rare
failure modes (RFMs). The problem of RFMs is commonly referred to as the
"long-tail challenge", due to the distribution of data including many instances
that are very rarely seen. In this paper, we present a novel approach that
utilizes advanced generative and explainable AI techniques to aid in
understanding RFMs. Our methods can be used to enhance the robustness and
reliability of AVs when combined with both downstream model training and
testing. We extract segmentation masks for objects of interest (e.g., cars) and
invert them to create environmental masks. These masks, combined with carefully
crafted text prompts, are fed into a custom diffusion model. We leverage the
Stable Diffusion inpainting model guided by adversarial noise optimization to
generate images containing diverse environments designed to evade object
detection models and expose vulnerabilities in AI systems. Finally, we produce
natural language descriptions of the generated RFMs that can guide developers
and policymakers to improve the safety and reliability of AV systems.


http://arxiv.org/abs/2504.15823
Human-Imperceptible Physical Adversarial Attack for NIR Face Recognition Models. (99%)
Songyan Xie; Jinghang Wen; Encheng Su; Qiucheng Yu
  Near-infrared (NIR) face recognition systems, which can operate effectively
in low-light conditions or in the presence of makeup, exhibit vulnerabilities
when subjected to physical adversarial attacks. To further demonstrate the
potential risks in real-world applications, we design a novel, stealthy, and
practical adversarial patch to attack NIR face recognition systems in a
black-box setting. We achieved this by utilizing human-imperceptible
infrared-absorbing ink to generate multiple patches with digitally optimized
shapes and positions for infrared images. To address the optimization mismatch
between digital and real-world NIR imaging, we develop a light reflection model
for human skin to minimize pixel-level discrepancies by simulating NIR light
reflection.
  Compared to state-of-the-art (SOTA) physical attacks on NIR face recognition
systems, the experimental results show that our method improves the attack
success rate in both digital and physical domains, particularly maintaining
effectiveness across various face postures. Notably, the proposed approach
outperforms SOTA methods, achieving an average attack success rate of 82.46% in
the physical domain across different models, compared to 64.18% for existing
methods. The artifact is available at
https://anonymous.4open.science/r/Human-imperceptible-adversarial-patch-0703/.


http://arxiv.org/abs/2504.16355
Property-Preserving Hashing for $\ell_1$-Distance Predicates: Applications to Countering Adversarial Input Attacks. (75%)
Hassan Asghar; Chenhan Zhang; Dali Kaafar
  Perceptual hashing is used to detect whether an input image is similar to a
reference image with a variety of security applications. Recently, they have
been shown to succumb to adversarial input attacks which make small
imperceptible changes to the input image yet the hashing algorithm does not
detect its similarity to the original image. Property-preserving hashing (PPH)
is a recent construct in cryptography, which preserves some property
(predicate) of its inputs in the hash domain. Researchers have so far shown
constructions of PPH for Hamming distance predicates, which, for instance,
outputs 1 if two inputs are within Hamming distance $t$. A key feature of PPH
is its strong correctness guarantee, i.e., the probability that the predicate
will not be correctly evaluated in the hash domain is negligible. Motivated by
the use case of detecting similar images under adversarial setting, we propose
the first PPH construction for an $\ell_1$-distance predicate. Roughly, this
predicate checks if the two one-sided $\ell_1$-distances between two images are
within a threshold $t$. Since many adversarial attacks use $\ell_2$-distance
(related to $\ell_1$-distance) as the objective function to perturb the input
image, by appropriately choosing the threshold $t$, we can force the attacker
to add considerable noise to evade detection, and hence significantly
deteriorate the image quality. Our proposed scheme is highly efficient, and
runs in time $O(t^2)$. For grayscale images of size $28 \times 28$, we can
evaluate the predicate in $0.0784$ seconds when pixel values are perturbed by
up to $1 \%$. For larger RGB images of size $224 \times 224$, by dividing the
image into 1,000 blocks, we achieve times of $0.0128$ seconds per block for $1
\%$ change, and up to $0.2641$ seconds per block for $14\%$ change.


http://arxiv.org/abs/2504.15674
TrojanDam: Detection-Free Backdoor Defense in Federated Learning through Proactive Model Robustification utilizing OOD Data. (56%)
Yanbo Dai; Songze Li; Zihan Gan; Xueluan Gong
  Federated learning (FL) systems allow decentralized data-owning clients to
jointly train a global model through uploading their locally trained updates to
a centralized server. The property of decentralization enables adversaries to
craft carefully designed backdoor updates to make the global model misclassify
only when encountering adversary-chosen triggers. Existing defense mechanisms
mainly rely on post-training detection after receiving updates. These methods
either fail to identify updates which are deliberately fabricated statistically
close to benign ones, or show inconsistent performance in different FL training
stages. The effect of unfiltered backdoor updates will accumulate in the global
model, and eventually become functional. Given the difficulty of ruling out
every backdoor update, we propose a backdoor defense paradigm, which focuses on
proactive robustification on the global model against potential backdoor
attacks. We first reveal that the successful launching of backdoor attacks in
FL stems from the lack of conflict between malicious and benign updates on
redundant neurons of ML models. We proceed to prove the feasibility of
activating redundant neurons utilizing out-of-distribution (OOD) samples in
centralized settings, and migrating to FL settings to propose a novel backdoor
defense mechanism, TrojanDam. The proposed mechanism has the FL server
continuously inject fresh OOD mappings into the global model to activate
redundant neurons, canceling the effect of backdoor updates during aggregation.
We conduct systematic and extensive experiments to illustrate the superior
performance of TrojanDam, over several SOTA backdoor defense methods across a
wide range of FL settings.


http://arxiv.org/abs/2504.18575
WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks. (4%)
Ivan Evtimov; Arman Zharmagambetov; Aaron Grattafiori; Chuan Guo; Kamalika Chaudhuri
  Web navigation AI agents use language-and-vision foundation models to enhance
productivity but these models are known to be susceptible to indirect prompt
injections that get them to follow instructions different from the legitimate
user's. Existing explorations of this threat applied to web agents often focus
on a single isolated adversarial goal, test with injected instructions that are
either too easy or not truly malicious, and often give the adversary
unreasonable access. In order to better focus adversarial research, we
construct a new benchmark called WASP (Web Agent Security against Prompt
injection attacks) that introduces realistic web agent hijacking objectives and
an isolated environment to test them in that does not affect real users or the
live web. As part of WASP, we also develop baseline attacks against three
popular web agentic systems (VisualWebArena, Claude Computer Use, and Operator)
instantiated with various state-of-the-art models. Our evaluation shows that
even AI agents backed by models with advanced reasoning capabilities and by
models with instruction hierarchy mitigations are susceptible to low-effort
human-written prompt injections. However, the realistic objectives in WASP also
allow us to observe that agents are currently not capable enough to complete
the goals of attackers end-to-end. Agents begin executing the adversarial
instruction between 16 and 86% of the time but only achieve the goal between 0
and 17% of the time. Based on these findings, we argue that adversarial
researchers should demonstrate stronger attacks that more consistently maintain
control over the agent given realistic constraints on the adversary's power.


http://arxiv.org/abs/2504.15585
A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment. (4%)
Kun Wang; Guibin Zhang; Zhenhong Zhou; Jiahao Wu; Miao Yu; Shiqian Zhao; Chenlong Yin; Jinhu Fu; Yibo Yan; Hanjun Luo; Liang Lin; Zhihao Xu; Haolang Lu; Xinye Cao; Xinyun Zhou; Weifei Jin; Fanci Meng; Shicheng Xu; Junyuan Mao; Yu Wang; Hao Wu; Minghe Wang; Fan Zhang; Junfeng Fang; Wenjie Qu; Yue Liu; Chengwei Liu; Yifan Zhang; Qiankun Li; Chongye Guo; Yalan Qin; Zhaoxin Fan; Kai Wang; Yi Ding; Donghai Hong; Jiaming Ji; Yingxin Lai; Zitong Yu; Xinfeng Li; Yifan Jiang; Yanhui Li; Xinyu Deng; Junlin Wu; Dongxia Wang; Yihao Huang; Yufei Guo; Jen-tse Huang; Qiufeng Wang; Xiaolong Jin; Wenxuan Wang; Dongrui Liu; Yanwei Yue; Wenke Huang; Guancheng Wan; Heng Chang; Tianlin Li; Yi Yu; Chenghao Li; Jiawei Li; Lei Bai; Jie Zhang; Qing Guo; Jingyi Wang; Tianlong Chen; Joey Tianyi Zhou; Xiaojun Jia; Weisong Sun; Cong Wu; Jing Chen; Xuming Hu; Yiming Li; Xiao Wang; Ningyu Zhang; Luu Anh Tuan; Guowen Xu; Jiaheng Zhang; Tianwei Zhang; Xingjun Ma; Jindong Gu; Liang Pang; Xiang Wang; Bo An; Jun Sun; Mohit Bansal; Shirui Pan; Lingjuan Lyu; Yuval Elovici; Bhavya Kailkhura; Yaodong Yang; Hongwei Li; Wenyuan Xu; Yizhou Sun; Wei Wang; Qing Li; Ke Tang; Yu-Gang Jiang; Felix Juefei-Xu; Hui Xiong; Xiaofeng Wang; Dacheng Tao; Philip S. Yu; Qingsong Wen; Yang Liu
  The remarkable success of Large Language Models (LLMs) has illuminated a
promising pathway toward achieving Artificial General Intelligence for both
academic and industrial communities, owing to their unprecedented performance
across various applications. As LLMs continue to gain prominence in both
research and commercial domains, their security and safety implications have
become a growing concern, not only for researchers and corporations but also
for every nation. Currently, existing surveys on LLM safety primarily focus on
specific stages of the LLM lifecycle, e.g., deployment phase or fine-tuning
phase, lacking a comprehensive understanding of the entire "lifechain" of LLMs.
To address this gap, this paper introduces, for the first time, the concept of
"full-stack" safety to systematically consider safety issues throughout the
entire process of LLM training, deployment, and eventual commercialization.
Compared to the off-the-shelf LLM safety surveys, our work demonstrates several
distinctive advantages: (I) Comprehensive Perspective. We define the complete
LLM lifecycle as encompassing data preparation, pre-training, post-training,
deployment and final commercialization. To our knowledge, this represents the
first safety survey to encompass the entire lifecycle of LLMs. (II) Extensive
Literature Support. Our research is grounded in an exhaustive review of over
800+ papers, ensuring comprehensive coverage and systematic organization of
security issues within a more holistic understanding. (III) Unique Insights.
Through systematic literature analysis, we have developed reliable roadmaps and
perspectives for each chapter. Our work identifies promising research
directions, including safety in data generation, alignment techniques, model
editing, and LLM-based agent systems. These insights provide valuable guidance
for researchers pursuing future work in this field.


http://arxiv.org/abs/2504.15995
OPUS-VFL: Incentivizing Optimal Privacy-Utility Tradeoffs in Vertical Federated Learning. (3%)
Sindhuja Madabushi; Ahmad Faraz Khan; Haider Ali; Jin-Hee Cho
  Vertical Federated Learning (VFL) enables organizations with disjoint feature
spaces but shared user bases to collaboratively train models without sharing
raw data. However, existing VFL systems face critical limitations: they often
lack effective incentive mechanisms, struggle to balance privacy-utility
tradeoffs, and fail to accommodate clients with heterogeneous resource
capabilities. These challenges hinder meaningful participation, degrade model
performance, and limit practical deployment. To address these issues, we
propose OPUS-VFL, an Optimal Privacy-Utility tradeoff Strategy for VFL.
OPUS-VFL introduces a novel, privacy-aware incentive mechanism that rewards
clients based on a principled combination of model contribution, privacy
preservation, and resource investment. It employs a lightweight leave-one-out
(LOO) strategy to quantify feature importance per client, and integrates an
adaptive differential privacy mechanism that enables clients to dynamically
calibrate noise levels to optimize their individual utility. Our framework is
designed to be scalable, budget-balanced, and robust to inference and poisoning
attacks. Extensive experiments on benchmark datasets (MNIST, CIFAR-10, and
CIFAR-100) demonstrate that OPUS-VFL significantly outperforms state-of-the-art
VFL baselines in both efficiency and robustness. It reduces label inference
attack success rates by up to 20%, increases feature inference reconstruction
error (MSE) by over 30%, and achieves up to 25% higher incentives for clients
that contribute meaningfully while respecting privacy and cost constraints.
These results highlight the practicality and innovation of OPUS-VFL as a
secure, fair, and performance-driven solution for real-world VFL.


http://arxiv.org/abs/2504.15622
Exploring the Role of Large Language Models in Cybersecurity: A Systematic Survey. (1%)
Shuang Tian; Tao Zhang; Jiqiang Liu; Jiacheng Wang; Xuangou Wu; Xiaoqiang Zhu; Ruichen Zhang; Weiting Zhang; Zhenhui Yuan; Shiwen Mao; Dong In Kim
  With the rapid development of technology and the acceleration of
digitalisation, the frequency and complexity of cyber security threats are
increasing. Traditional cybersecurity approaches, often based on static rules
and predefined scenarios, are struggling to adapt to the rapidly evolving
nature of modern cyberattacks. There is an urgent need for more adaptive and
intelligent defence strategies. The emergence of Large Language Model (LLM)
provides an innovative solution to cope with the increasingly severe cyber
threats, and its potential in analysing complex attack patterns, predicting
threats and assisting real-time response has attracted a lot of attention in
the field of cybersecurity, and exploring how to effectively use LLM to defend
against cyberattacks has become a hot topic in the current research field. This
survey examines the applications of LLM from the perspective of the cyber
attack lifecycle, focusing on the three phases of defense reconnaissance,
foothold establishment, and lateral movement, and it analyzes the potential of
LLMs in Cyber Threat Intelligence (CTI) tasks. Meanwhile, we investigate how
LLM-based security solutions are deployed and applied in different network
scenarios. It also summarizes the internal and external risk issues faced by
LLM during its application. Finally, this survey also points out the facing
risk issues and possible future research directions in this domain.


http://arxiv.org/abs/2504.15479
Unifying Image Counterfactuals and Feature Attributions with Latent-Space Adversarial Attacks. (99%)
Jeremy Goldwasser; Giles Hooker
  Counterfactuals are a popular framework for interpreting machine learning
predictions. These what if explanations are notoriously challenging to create
for computer vision models: standard gradient-based methods are prone to
produce adversarial examples, in which imperceptible modifications to image
pixels provoke large changes in predictions. We introduce a new,
easy-to-implement framework for counterfactual images that can flexibly adapt
to contemporary advances in generative modeling. Our method, Counterfactual
Attacks, resembles an adversarial attack on the representation of the image
along a low-dimensional manifold. In addition, given an auxiliary dataset of
image descriptors, we show how to accompany counterfactuals with feature
attribution that quantify the changes between the original and counterfactual
images. These importance scores can be aggregated into global counterfactual
explanations that highlight the overall features driving model predictions.
While this unification is possible for any counterfactual method, it has
particular computational efficiency for ours. We demonstrate the efficacy of
our approach with the MNIST and CelebA datasets.


http://arxiv.org/abs/2504.14921
Fast Adversarial Training with Weak-to-Strong Spatial-Temporal Consistency in the Frequency Domain on Videos. (75%)
Songping Wang; Hanqing Liu; Yueming Lyu; Xiantao Hu; Ziwen He; Wei Wang; Caifeng Shan; Liang Wang
  Adversarial Training (AT) has been shown to significantly enhance adversarial
robustness via a min-max optimization approach. However, its effectiveness in
video recognition tasks is hampered by two main challenges. First, fast
adversarial training for video models remains largely unexplored, which
severely impedes its practical applications. Specifically, most video
adversarial training methods are computationally costly, with long training
times and high expenses. Second, existing methods struggle with the trade-off
between clean accuracy and adversarial robustness. To address these challenges,
we introduce Video Fast Adversarial Training with Weak-to-Strong consistency
(VFAT-WS), the first fast adversarial training method for video data.
Specifically, VFAT-WS incorporates the following key designs: First, it
integrates a straightforward yet effective temporal frequency augmentation
(TF-AUG), and its spatial-temporal enhanced form STF-AUG, along with a
single-step PGD attack to boost training efficiency and robustness. Second, it
devises a weak-to-strong spatial-temporal consistency regularization, which
seamlessly integrates the simpler TF-AUG and the more complex STF-AUG.
Leveraging the consistency regularization, it steers the learning process from
simple to complex augmentations. Both of them work together to achieve a better
trade-off between clean accuracy and robustness. Extensive experiments on
UCF-101 and HMDB-51 with both CNN and Transformer-based models demonstrate that
VFAT-WS achieves great improvements in adversarial robustness and corruption
robustness, while accelerating training by nearly 490%.


http://arxiv.org/abs/2504.15512
T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models. (68%)
Siyuan Liang; Jiayang Liu; Jiecheng Zhai; Tianmeng Fang; Rongcheng Tu; Aishan Liu; Xiaochun Cao; Dacheng Tao
  The rapid development of generative artificial intelligence has made text to
video models essential for building future multimodal world simulators.
However, these models remain vulnerable to jailbreak attacks, where specially
crafted prompts bypass safety mechanisms and lead to the generation of harmful
or unsafe content. Such vulnerabilities undermine the reliability and security
of simulation based applications. In this paper, we propose T2VShield, a
comprehensive and model agnostic defense framework designed to protect text to
video models from jailbreak threats. Our method systematically analyzes the
input, model, and output stages to identify the limitations of existing
defenses, including semantic ambiguities in prompts, difficulties in detecting
malicious content in dynamic video outputs, and inflexible model centric
mitigation strategies. T2VShield introduces a prompt rewriting mechanism based
on reasoning and multimodal retrieval to sanitize malicious inputs, along with
a multi scope detection module that captures local and global inconsistencies
across time and modalities. The framework does not require access to internal
model parameters and works with both open and closed source systems. Extensive
experiments on five platforms show that T2VShield can reduce jailbreak success
rates by up to 35 percent compared to strong baselines. We further develop a
human centered audiovisual evaluation protocol to assess perceptual safety,
emphasizing the importance of visual level defense in enhancing the
trustworthiness of next generation multimodal simulators.


http://arxiv.org/abs/2504.18564
DualBreach: Efficient Dual-Jailbreaking via Target-Driven Initialization and Multi-Target Optimization. (22%)
Xinzhe Huang; Kedong Xiu; Tianhang Zheng; Churui Zeng; Wangze Ni; Zhan Qiin; Kui Ren; Chun Chen
  Recent research has focused on exploring the vulnerabilities of Large
Language Models (LLMs), aiming to elicit harmful and/or sensitive content from
LLMs. However, due to the insufficient research on dual-jailbreaking -- attacks
targeting both LLMs and Guardrails, the effectiveness of existing attacks is
limited when attempting to bypass safety-aligned LLMs shielded by guardrails.
Therefore, in this paper, we propose DualBreach, a target-driven framework for
dual-jailbreaking. DualBreach employs a Target-driven Initialization (TDI)
strategy to dynamically construct initial prompts, combined with a Multi-Target
Optimization (MTO) method that utilizes approximate gradients to jointly adapt
the prompts across guardrails and LLMs, which can simultaneously save the
number of queries and achieve a high dual-jailbreaking success rate. For
black-box guardrails, DualBreach either employs a powerful open-sourced
guardrail or imitates the target black-box guardrail by training a proxy model,
to incorporate guardrails into the MTO process.
  We demonstrate the effectiveness of DualBreach in dual-jailbreaking scenarios
through extensive evaluation on several widely-used datasets. Experimental
results indicate that DualBreach outperforms state-of-the-art methods with
fewer queries, achieving significantly higher success rates across all
settings. More specifically, DualBreach achieves an average dual-jailbreaking
success rate of 93.67% against GPT-4 with Llama-Guard-3 protection, whereas the
best success rate achieved by other methods is 88.33%. Moreover, DualBreach
only uses an average of 1.77 queries per successful dual-jailbreak,
outperforming other state-of-the-art methods. For the purpose of defense, we
propose an XGBoost-based ensemble defensive mechanism named EGuard, which
integrates the strengths of multiple guardrails, demonstrating superior
performance compared with Llama-Guard-3.


http://arxiv.org/abs/2504.14985
aiXamine: Simplified LLM Safety and Security. (13%)
Fatih Deniz; Dorde Popovic; Yazan Boshmaf; Euisuh Jeong; Minhaj Ahmad; Sanjay Chawla; Issa Khalil
  Evaluating Large Language Models (LLMs) for safety and security remains a
complex task, often requiring users to navigate a fragmented landscape of ad
hoc benchmarks, datasets, metrics, and reporting formats. To address this
challenge, we present aiXamine, a comprehensive black-box evaluation platform
for LLM safety and security. aiXamine integrates over 40 tests (i.e.,
benchmarks) organized into eight key services targeting specific dimensions of
safety and security: adversarial robustness, code security, fairness and bias,
hallucination, model and data privacy, out-of-distribution (OOD) robustness,
over-refusal, and safety alignment. The platform aggregates the evaluation
results into a single detailed report per model, providing a detailed breakdown
of model performance, test examples, and rich visualizations. We used aiXamine
to assess over 50 publicly available and proprietary LLMs, conducting over 2K
examinations. Our findings reveal notable vulnerabilities in leading models,
including susceptibility to adversarial attacks in OpenAI's GPT-4o, biased
outputs in xAI's Grok-3, and privacy weaknesses in Google's Gemini 2.0.
Additionally, we observe that open-source models can match or exceed
proprietary models in specific services such as safety alignment, fairness and
bias, and OOD robustness. Finally, we identify trade-offs between distillation
strategies, model size, training methods, and architectural choices.


http://arxiv.org/abs/2504.18563
Backdoor Defense in Diffusion Models via Spatial Attention Unlearning. (9%)
Abha Jha; Ashwath Vaithinathan Aravindan; Matthew Salaway; Atharva Sandeep Bhide; Duygu Nur Yaldiz
  Text-to-image diffusion models are increasingly vulnerable to backdoor
attacks, where malicious modifications to the training data cause the model to
generate unintended outputs when specific triggers are present. While
classification models have seen extensive development of defense mechanisms,
generative models remain largely unprotected due to their high-dimensional
output space, which complicates the detection and mitigation of subtle
perturbations. Defense strategies for diffusion models, in particular, remain
under-explored. In this work, we propose Spatial Attention Unlearning (SAU), a
novel technique for mitigating backdoor attacks in diffusion models. SAU
leverages latent space manipulation and spatial attention mechanisms to isolate
and remove the latent representation of backdoor triggers, ensuring precise and
efficient removal of malicious effects. We evaluate SAU across various types of
backdoor attacks, including pixel-based and style-based triggers, and
demonstrate its effectiveness in achieving 100% trigger removal accuracy.
Furthermore, SAU achieves a CLIP score of 0.7023, outperforming existing
methods while preserving the model's ability to generate high-quality,
semantically aligned images. Our results show that SAU is a robust, scalable,
and practical solution for securing text-to-image diffusion models against
backdoor attacks.


http://arxiv.org/abs/2504.14995
Trainable Quantum Neural Network for Multiclass Image Classification with the Power of Pre-trained Tree Tensor Networks. (1%)
Keisuke Murota; Takumi Kobori
  Tree tensor networks (TTNs) offer powerful models for image classification.
While these TTN image classifiers already show excellent performance on
classical hardware, embedding them into quantum neural networks (QNNs) may
further improve the performance by leveraging quantum resources. However,
embedding TTN classifiers into QNNs for multiclass classification remains
challenging. Key obstacles are the highorder gate operations required for large
bond dimensions and the mid-circuit postselection with exponentially low
success rates necessary for the exact embedding. In this work, to address these
challenges, we propose forest tensor network (FTN)-classifiers, which aggregate
multiple small-bond-dimension TTNs. This allows us to handle multiclass
classification without requiring large gates in the embedded circuits. We then
remove the overhead of mid-circuit postselection by extending the adiabatic
encoding framework to our setting and smoothly encode the FTN-classifiers into
a quantum forest tensor network (qFTN)- classifiers. Numerical experiments on
MNIST and CIFAR-10 demonstrate that we can successfully train FTN-classifiers
and encode them into qFTN-classifiers, while maintaining or even improving the
performance of the pre-trained FTN-classifiers. These results suggest that
synergy between TTN classification models and QNNs can provide a robust and
scalable framework for multiclass quantum-enhanced image classification.


http://arxiv.org/abs/2504.14541
Towards Model Resistant to Transferable Adversarial Examples via Trigger Activation. (99%)
Yi Yu; Song Xia; Xun Lin; Chenqi Kong; Wenhan Yang; Shijian Lu; Yap-Peng Tan; Alex C. Kot
  Adversarial examples, characterized by imperceptible perturbations, pose
significant threats to deep neural networks by misleading their predictions. A
critical aspect of these examples is their transferability, allowing them to
deceive {unseen} models in black-box scenarios. Despite the widespread
exploration of defense methods, including those on transferability, they show
limitations: inefficient deployment, ineffective defense, and degraded
performance on clean images. In this work, we introduce a novel training
paradigm aimed at enhancing robustness against transferable adversarial
examples (TAEs) in a more efficient and effective way. We propose a model that
exhibits random guessing behavior when presented with clean data
$\boldsymbol{x}$ as input, and generates accurate predictions when with
triggered data $\boldsymbol{x}+\boldsymbol{\tau}$. Importantly, the trigger
$\boldsymbol{\tau}$ remains constant for all data instances. We refer to these
models as \textbf{models with trigger activation}. We are surprised to find
that these models exhibit certain robustness against TAEs. Through the
consideration of first-order gradients, we provide a theoretical analysis of
this robustness. Moreover, through the joint optimization of the learnable
trigger and the model, we achieve improved robustness to transferable attacks.
Extensive experiments conducted across diverse datasets, evaluating a variety
of attacking methods, underscore the effectiveness and superiority of our
approach.


http://arxiv.org/abs/2504.14798
Verifying Robust Unlearning: Probing Residual Knowledge in Unlearned Models. (1%)
Hao Xuan; Xingyu Li
  Machine Unlearning (MUL) is crucial for privacy protection and content
regulation, yet recent studies reveal that traces of forgotten information
persist in unlearned models, enabling adversaries to resurface removed
knowledge. Existing verification methods only confirm whether unlearning was
executed, failing to detect such residual information leaks. To address this,
we introduce the concept of Robust Unlearning, ensuring models are
indistinguishable from retraining and resistant to adversarial recovery. To
empirically evaluate whether unlearning techniques meet this security standard,
we propose the Unlearning Mapping Attack (UMA), a post-unlearning verification
framework that actively probes models for forgotten traces using adversarial
queries. Extensive experiments on discriminative and generative tasks show that
existing unlearning techniques remain vulnerable, even when passing existing
verification metrics. By establishing UMA as a practical verification tool,
this study sets a new standard for assessing and enhancing machine unlearning
security.


http://arxiv.org/abs/2504.14395
Hydra: An Agentic Reasoning Approach for Enhancing Adversarial Robustness and Mitigating Hallucinations in Vision-Language Models. (99%)
Johnny Chung-En; Neil Yu; Neil Hsuan-Chih; Chen; Brian Jalaian; Nathaniel D. Bastian
  To develop trustworthy Vision-Language Models (VLMs), it is essential to
address adversarial robustness and hallucination mitigation, both of which
impact factual accuracy in high-stakes applications such as defense and
healthcare. Existing methods primarily focus on either adversarial defense or
hallucination post-hoc correction, leaving a gap in unified robustness
strategies. We introduce \textbf{Hydra}, an adaptive agentic framework that
enhances plug-in VLMs through iterative reasoning, structured critiques, and
cross-model verification, improving both resilience to adversarial
perturbations and intrinsic model errors. Hydra employs an Action-Critique
Loop, where it retrieves and critiques visual information, leveraging
Chain-of-Thought (CoT) and In-Context Learning (ICL) techniques to refine
outputs dynamically. Unlike static post-hoc correction methods, Hydra adapts to
both adversarial manipulations and intrinsic model errors, making it robust to
malicious perturbations and hallucination-related inaccuracies. We evaluate
Hydra on four VLMs, three hallucination benchmarks, two adversarial attack
strategies, and two adversarial defense methods, assessing performance on both
clean and adversarial inputs. Results show that Hydra surpasses plug-in VLMs
and state-of-the-art (SOTA) dehallucination methods, even without explicit
adversarial defenses, demonstrating enhanced robustness and factual
consistency. By bridging adversarial resistance and hallucination mitigation,
Hydra provides a scalable, training-free solution for improving the reliability
of VLMs in real-world applications.


http://arxiv.org/abs/2504.14423
Adversarial Attack for RGB-Event based Visual Object Tracking. (99%)
Qiang Chen; Xiao Wang; Haowen Wang; Bo Jiang; Lin Zhu; Dawei Zhang; Yonghong Tian; Jin Tang
  Visual object tracking is a crucial research topic in the fields of computer
vision and multi-modal fusion. Among various approaches, robust visual tracking
that combines RGB frames with Event streams has attracted increasing attention
from researchers. While striving for high accuracy and efficiency in tracking,
it is also important to explore how to effectively conduct adversarial attacks
and defenses on RGB-Event stream tracking algorithms, yet research in this area
remains relatively scarce. To bridge this gap, in this paper, we propose a
cross-modal adversarial attack algorithm for RGB-Event visual tracking. Because
of the diverse representations of Event streams, and given that Event voxels
and frames are more commonly used, this paper will focus on these two
representations for an in-depth study. Specifically, for the RGB-Event voxel,
we first optimize the perturbation by adversarial loss to generate RGB frame
adversarial examples. For discrete Event voxel representations, we propose a
two-step attack strategy, more in detail, we first inject Event voxels into the
target region as initialized adversarial examples, then, conduct a
gradient-guided optimization by perturbing the spatial location of the Event
voxels. For the RGB-Event frame based tracking, we optimize the cross-modal
universal perturbation by integrating the gradient information from multimodal
data. We evaluate the proposed approach against attacks on three widely used
RGB-Event Tracking datasets, i.e., COESOT, FE108, and VisEvent. Extensive
experiments show that our method significantly reduces the performance of the
tracker across numerous datasets in both unimodal and multimodal scenarios. The
source code will be released on
https://github.com/Event-AHU/Adversarial_Attack_Defense


http://arxiv.org/abs/2504.14348
Manipulating Multimodal Agents via Cross-Modal Prompt Injection. (92%)
Le Wang; Zonghao Ying; Tianyuan Zhang; Siyuan Liang; Shengshan Hu; Mingchuan Zhang; Aishan Liu; Xianglong Liu
  The emergence of multimodal large language models has redefined the agent
paradigm by integrating language and vision modalities with external data
sources, enabling agents to better interpret human instructions and execute
increasingly complex tasks. However, in this work, we identify a critical yet
previously overlooked security vulnerability in multimodal agents: cross-modal
prompt injection attacks. To exploit this vulnerability, we propose
CrossInject, a novel attack framework in which attackers embed adversarial
perturbations across multiple modalities to align with target malicious
content, allowing external instructions to hijack the agent's decision-making
process and execute unauthorized tasks. Our approach consists of two key
components. First, we introduce Visual Latent Alignment, where we optimize
adversarial features to the malicious instructions in the visual embedding
space based on a text-to-image generative model, ensuring that adversarial
images subtly encode cues for malicious task execution. Subsequently, we
present Textual Guidance Enhancement, where a large language model is leveraged
to infer the black-box defensive system prompt through adversarial meta
prompting and generate an malicious textual command that steers the agent's
output toward better compliance with attackers' requests. Extensive experiments
demonstrate that our method outperforms existing injection attacks, achieving
at least a +26.4% increase in attack success rates across diverse tasks.
Furthermore, we validate our attack's effectiveness in real-world multimodal
autonomous agents, highlighting its potential implications for safety-critical
applications.


http://arxiv.org/abs/2504.13551
Q-FAKER: Query-free Hard Black-box Attack via Controlled Generation. (99%)
CheolWon Na; YunSeok Choi; Jee-Hyong Lee
  Many adversarial attack approaches are proposed to verify the vulnerability
of language models. However, they require numerous queries and the information
on the target model. Even black-box attack methods also require the target
model's output information. They are not applicable in real-world scenarios, as
in hard black-box settings where the target model is closed and inaccessible.
Even the recently proposed hard black-box attacks still require many queries
and demand extremely high costs for training adversarial generators. To address
these challenges, we propose Q-faker (Query-free Hard Black-box Attacker), a
novel and efficient method that generates adversarial examples without
accessing the target model. To avoid accessing the target model, we use a
surrogate model instead. The surrogate model generates adversarial sentences
for a target-agnostic attack. During this process, we leverage controlled
generation techniques. We evaluate our proposed method on eight datasets.
Experimental results demonstrate our method's effectiveness including high
transferability and the high quality of the generated adversarial examples, and
prove its practical in hard black-box settings.


http://arxiv.org/abs/2504.14137
Rethinking Target Label Conditioning in Adversarial Attacks: A 2D Tensor-Guided Generative Approach. (92%)
Hangyu Liu; Bo Peng; Pengxiang Ding; Donglin Wang
  Compared to single-target adversarial attacks, multi-target attacks have
garnered significant attention due to their ability to generate adversarial
images for multiple target classes simultaneously. Existing generative
approaches for multi-target attacks mainly analyze the effect of the use of
target labels on noise generation from a theoretical perspective, lacking
practical validation and comprehensive summarization. To address this gap, we
first identify and validate that the semantic feature quality and quantity are
critical factors affecting the transferability of targeted attacks: 1) Feature
quality refers to the structural and detailed completeness of the implanted
target features, as deficiencies may result in the loss of key discriminative
information; 2) Feature quantity refers to the spatial sufficiency of the
implanted target features, as inadequacy limits the victim model's attention to
this feature. Based on these findings, we propose the 2D Tensor-Guided
Adversarial Fusion (2D-TGAF) framework, which leverages the powerful generative
capabilities of diffusion models to encode target labels into two-dimensional
semantic tensors for guiding adversarial noise generation. Additionally, we
design a novel masking strategy tailored for the training process, ensuring
that parts of the generated noise retain complete semantic information about
the target class. Extensive experiments on the standard ImageNet dataset
demonstrate that 2D-TGAF consistently surpasses state-of-the-art methods in
attack success rates, both on normally trained models and across various
defense mechanisms.


http://arxiv.org/abs/2504.13562
DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification. (82%)
Yu Li; Han Jiang; Zhihua Wei
  With the widespread adoption of Large Language Models (LLMs), jailbreak
attacks have become an increasingly pressing safety concern. While
safety-aligned LLMs can effectively defend against normal harmful queries, they
remain vulnerable to such attacks. Existing defense methods primarily rely on
fine-tuning or input modification, which often suffer from limited
generalization and reduced utility. To address this, we introduce DETAM, a
finetuning-free defense approach that improves the defensive capabilities
against jailbreak attacks of LLMs via targeted attention modification.
Specifically, we analyze the differences in attention scores between successful
and unsuccessful defenses to identify the attention heads sensitive to
jailbreak attacks. During inference, we reallocate attention to emphasize the
user's core intention, minimizing interference from attack tokens. Our
experimental results demonstrate that DETAM outperforms various baselines in
jailbreak defense and exhibits robust generalization across different attacks
and models, maintaining its effectiveness even on in-the-wild jailbreak data.
Furthermore, in evaluating the model's utility, we incorporated over-defense
datasets, which further validate the superior performance of our approach. The
code will be released immediately upon acceptance.


http://arxiv.org/abs/2504.14096
VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment. (75%)
Yogesh Kulkarni; Pooyan Fazli
  Video-language models (Video-LLMs) excel at understanding video content but
struggle with spatial relationships, temporal ordering, and cross-frame
continuity. To address these limitations, we introduce VideoPASTA (Preference
Alignment with Spatio-Temporal-Cross Frame Adversaries), a framework that
enhances Video-LLMs through targeted preference optimization. VideoPASTA trains
models to distinguish accurate video representations from carefully generated
adversarial examples that deliberately violate spatial, temporal, or
cross-frame relations. By applying Direct Preference Optimization to just 7,020
preference pairs, VideoPASTA learns robust representations that capture
fine-grained spatial relationships and long-range temporal dynamics.
Experiments on standard video benchmarks show significant relative performance
gains of 3.05% on VideoMME, 1.97% on NeXTQA, and 1.31% on LongVideoBench, over
the baseline Qwen2.5-VL model. These results demonstrate that targeted
alignment, rather than massive pretraining or architectural modifications,
effectively addresses core video-language challenges. Notably, VideoPASTA
achieves these improvements without human annotation or captioning, relying on
just 32-frame sampling, compared to the 96-frame, multi-GPU setups of prior
work. This efficiency makes our approach a scalable, plug-and-play solution
that seamlessly integrates with existing models while preserving their
capabilities.


http://arxiv.org/abs/2504.13775
BadApex: Backdoor Attack Based on Adaptive Optimization Mechanism of Black-box Large Language Models. (45%)
Zhengxian Wu; Juan Wen; Wanli Peng; Ziwei Zhang; Yinghan Zhou; Yiming Xue
  Previous insertion-based and paraphrase-based backdoors have achieved great
success in attack efficacy, but they ignore the text quality and semantic
consistency between poisoned and clean texts. Although recent studies introduce
LLMs to generate poisoned texts and improve the stealthiness, semantic
consistency, and text quality, their hand-crafted prompts rely on expert
experiences, facing significant challenges in prompt adaptability and attack
performance after defenses. In this paper, we propose a novel backdoor attack
based on adaptive optimization mechanism of black-box large language models
(BadApex), which leverages a black-box LLM to generate poisoned text through a
refined prompt. Specifically, an Adaptive Optimization Mechanism is designed to
refine an initial prompt iteratively using the generation and modification
agents. The generation agent generates the poisoned text based on the initial
prompt. Then the modification agent evaluates the quality of the poisoned text
and refines a new prompt. After several iterations of the above process, the
refined prompt is used to generate poisoned texts through LLMs. We conduct
extensive experiments on three dataset with six backdoor attacks and two
defenses. Extensive experimental results demonstrate that BadApex significantly
outperforms state-of-the-art attacks. It improves prompt adaptability, semantic
consistency, and text quality. Furthermore, when two defense methods are
applied, the average attack success rate (ASR) still up to 96.75%.


http://arxiv.org/abs/2504.14064
DoomArena: A framework for Testing AI Agents Against Evolving Security Threats. (10%)
Leo Boisvert; Mihir Bansal; Chandra Kiran Reddy Evuru; Gabriel Huang; Abhay Puri; Avinandan Bose; Maryam Fazel; Quentin Cappart; Jason Stanley; Alexandre Lacoste; Alexandre Drouin; Krishnamurthy Dvijotham
  We present DoomArena, a security evaluation framework for AI agents.
DoomArena is designed on three principles: 1) It is a plug-in framework and
integrates easily into realistic agentic frameworks like BrowserGym (for web
agents) and $\tau$-bench (for tool calling agents); 2) It is configurable and
allows for detailed threat modeling, allowing configuration of specific
components of the agentic framework being attackable, and specifying targets
for the attacker; and 3) It is modular and decouples the development of attacks
from details of the environment in which the agent is deployed, allowing for
the same attacks to be applied across multiple environments. We illustrate
several advantages of our framework, including the ability to adapt to new
threat models and environments easily, the ability to easily combine several
previously published attacks to enable comprehensive and fine-grained security
testing, and the ability to analyze trade-offs between various vulnerabilities
and performance. We apply DoomArena to state-of-the-art (SOTA) web and
tool-calling agents and find a number of surprising results: 1) SOTA agents
have varying levels of vulnerability to different threat models (malicious user
vs malicious environment), and there is no Pareto dominant agent across all
threat models; 2) When multiple attacks are applied to an agent, they often
combine constructively; 3) Guardrail model-based defenses seem to fail, while
defenses based on powerful SOTA LLMs work better. DoomArena is available at
https://github.com/ServiceNow/DoomArena.


http://arxiv.org/abs/2504.13610
Fairness and Robustness in Machine Unlearning. (9%)
Khoa Tran; Simon S. Woo
  Machine unlearning poses the challenge of ``how to eliminate the influence of
specific data from a pretrained model'' in regard to privacy concerns. While
prior research on approximated unlearning has demonstrated accuracy and
efficiency in time complexity, we claim that it falls short of achieving exact
unlearning, and we are the first to focus on fairness and robustness in machine
unlearning algorithms. Our study presents fairness Conjectures for a
well-trained model, based on the variance-bias trade-off characteristic, and
considers their relevance to robustness. Our Conjectures are supported by
experiments conducted on the two most widely used model architectures, ResNet
and ViT, demonstrating the correlation between fairness and robustness:
\textit{the higher fairness-gap is, the more the model is sensitive and
vulnerable}. In addition, our experiments demonstrate the vulnerability of
current state-of-the-art approximated unlearning algorithms to adversarial
attacks, where their unlearned models suffer a significant drop in accuracy
compared to the exact-unlearned models. We claim that our fairness-gap
measurement and robustness metric should be used to evaluate the unlearning
algorithm. Furthermore, we demonstrate that unlearning in the intermediate and
last layers is sufficient and cost-effective for time and memory complexity.


http://arxiv.org/abs/2507.01020
AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models. (8%)
Aashray Reddy; Andrew Zagula; Nicholas Saban
Large Language Models (LLMs) continue to exhibit vulnerabilities to jailbreaking attacks: carefully crafted malicious inputs intended to circumvent safety guardrails and elicit harmful responses. As such, we present AutoAdv, a novel framework that automates adversarial prompt generation to systematically evaluate and expose vulnerabilities in LLM safety mechanisms. Our approach leverages a parametric attacker LLM to produce semantically disguised malicious prompts through strategic rewriting techniques, specialized system prompts, and optimized hyperparameter configurations. The primary contribution of our work is a dynamic, multi-turn attack methodology that analyzes failed jailbreak attempts and iteratively generates refined follow-up prompts, leveraging techniques such as roleplaying, misdirection, and contextual manipulation. We quantitatively evaluate attack success rate (ASR) using the StrongREJECT (arXiv:2402.10260 [cs.CL]) framework across sequential interaction turns. Through extensive empirical evaluation of state-of-the-art models--including ChatGPT, Llama, and DeepSeek--we reveal significant vulnerabilities, with our automated attacks achieving jailbreak success rates of up to 86% for harmful content generation. Our findings reveal that current safety mechanisms remain susceptible to sophisticated multi-turn attacks, emphasizing the urgent need for more robust defense strategies.

http://arxiv.org/abs/2504.14119
CodeCrash: Stress Testing LLM Reasoning under Structural and Semantic Perturbations. (1%)
Man Ho Lam; Chaozheng Wang; Jen-tse Huang; Michael R. Lyu
  Large Language Models (LLMs) have recently demonstrated strong capabilities
in code-related tasks, yet their robustness in code comprehension and reasoning
remains insufficiently explored. We present CodeCrash, a comprehensive
stress-testing benchmark comprising 1,279 questions from two established
datasets, CruxEval and LiveCodeBench, designed to evaluate model reasoning
reliability under non-standard coding environments. We systematically evaluate
17 LLMs across input and output prediction tasks using direct and
Chain-of-Thought prompting approaches, revealing that LLMs are particularly
vulnerable to disorganized code and overly reliant on natural language cues:
aggregated structural perturbations result in over 14 percentage points (pp) of
degradation, while textual perturbations cause a performance drop of over 11
pp. Moreover, self-reflective mechanisms in state-of-the-art reasoning models
significantly increase token usage by 2-3 times, reduce output confidence, and
even lead to catastrophic reasoning failures when faced with targeted
perturbations -- for instance, QwQ-32B generates over 12,000 redundant tokens
under reasoning-level perturbations. CodeCrash provides a rigorous benchmark
for evaluating robustness in code understanding, guiding future research toward
more reliable and resilient LLMs in code reasoning. The benchmark code,
perturbed datasets, and full leaderboard are publicly available at
https://cuhk-arise.github.io/CodeCrash/ .


http://arxiv.org/abs/2504.13474
Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask. (1%)
Yue Li; Xiao Li; Hao Wu; Minghui Xu; Yue Zhang; Xiuzhen Cheng; Fengyuan Xu; Sheng Zhong
  Large Language Models are a promising tool for automated vulnerability
detection, thanks to their success in code generation and repair. However,
despite widespread adoption, a critical question remains: Are LLMs truly
effective at detecting real-world vulnerabilities? Current evaluations, which
often assess models on isolated functions or files, ignore the broader
execution and data-flow context essential for understanding vulnerabilities.
This oversight leads to two types of misleading outcomes: incorrect conclusions
and flawed rationales, collectively undermining the reliability of prior
assessments. Therefore, in this paper, we challenge three widely held community
beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and
(iii) performance-plateaued across model scales. We argue that these beliefs
are artifacts of context-deprived evaluations. To address this, we propose
CORRECT (Context-Rich Reasoning Evaluation of Code with Trust), a new
evaluation framework that systematically incorporates contextual information
into LLM-based vulnerability detection. We construct a context-rich dataset of
2,000 vulnerable-patched program pairs spanning 99 CWEs and evaluate 13 LLMs
across four model families. Our framework elicits both binary predictions and
natural-language rationales, which are further validated using LLM-as-a-judge
techniques. Our findings overturn existing misconceptions. When provided with
sufficient context, SOTA LLMs achieve significantly improved performance (e.g.,
0.7 F1-score on key CWEs), with 0.8 precision. We show that most false
positives stem from reasoning errors rather than misclassification, and that
while model and test-time scaling improve performance, they introduce
diminishing returns and trade-offs in recall. Finally, we uncover new flaws in
current LLM-based detection systems, such as limited generalization and
overthinking biases.


http://arxiv.org/abs/2504.12644
Quantum Computing Supported Adversarial Attack-Resilient Autonomous Vehicle Perception Module for Traffic Sign Classification. (99%)
Reek Majumder; Mashrur Chowdhury; Sakib Mahmud Khan; Zadid Khan; Fahim Ahmad; Frank Ngeni; Gurcan Comert; Judith Mwakalonge; Dimitra Michalaka
  Deep learning (DL)-based image classification models are essential for
autonomous vehicle (AV) perception modules since incorrect categorization might
have severe repercussions. Adversarial attacks are widely studied cyberattacks
that can lead DL models to predict inaccurate output, such as incorrectly
classified traffic signs by the perception module of an autonomous vehicle. In
this study, we create and compare hybrid classical-quantum deep learning
(HCQ-DL) models with classical deep learning (C-DL) models to demonstrate
robustness against adversarial attacks for perception modules. Before feeding
them into the quantum system, we used transfer learning models, alexnet and
vgg-16, as feature extractors. We tested over 1000 quantum circuits in our
HCQ-DL models for projected gradient descent (PGD), fast gradient sign attack
(FGSA), and gradient attack (GA), which are three well-known untargeted
adversarial approaches. We evaluated the performance of all models during
adversarial attacks and no-attack scenarios. Our HCQ-DL models maintain
accuracy above 95\% during a no-attack scenario and above 91\% for GA and FGSA
attacks, which is higher than C-DL models. During the PGD attack, our
alexnet-based HCQ-DL model maintained an accuracy of 85\% compared to C-DL
models that achieved accuracies below 21\%. Our results highlight that the
HCQ-DL models provide improved accuracy for traffic sign classification under
adversarial settings compared to their classical counterparts.


http://arxiv.org/abs/2504.13301
DYNAMITE: Dynamic Defense Selection for Enhancing Machine Learning-based Intrusion Detection Against Adversarial Attacks. (98%)
Jing Chen; Onat Gungor; Zhengli Shang; Elvin Li; Tajana Rosing
  The rapid proliferation of the Internet of Things (IoT) has introduced
substantial security vulnerabilities, highlighting the need for robust
Intrusion Detection Systems (IDS). Machine learning-based intrusion detection
systems (ML-IDS) have significantly improved threat detection capabilities;
however, they remain highly susceptible to adversarial attacks. While numerous
defense mechanisms have been proposed to enhance ML-IDS resilience, a
systematic approach for selecting the most effective defense against a specific
adversarial attack remains absent. To address this challenge, we propose
Dynamite, a dynamic defense selection framework that enhances ML-IDS by
intelligently identifying and deploying the most suitable defense using a
machine learning-driven selection mechanism. Our results demonstrate that
Dynamite achieves a 96.2% reduction in computational time compared to the
Oracle, significantly decreasing computational overhead while preserving strong
prediction performance. Dynamite also demonstrates an average F1-score
improvement of 76.7% over random defense and 65.8% over the best static
state-of-the-art defense.


http://arxiv.org/abs/2504.12875
A Client-level Assessment of Collaborative Backdoor Poisoning in Non-IID Federated Learning. (83%)
Phung Lai; Guanxiong Liu; Hai Phan; Issa Khalil; Abdallah Khreishah; Xintao Wu
  Federated learning (FL) enables collaborative model training using
decentralized private data from multiple clients. While FL has shown robustness
against poisoning attacks with basic defenses, our research reveals new
vulnerabilities stemming from non-independent and identically distributed
(non-IID) data among clients. These vulnerabilities pose a substantial risk of
model poisoning in real-world FL scenarios.
  To demonstrate such vulnerabilities, we develop a novel collaborative
backdoor poisoning attack called CollaPois. In this attack, we distribute a
single pre-trained model infected with a Trojan to a group of compromised
clients. These clients then work together to produce malicious gradients,
causing the FL model to consistently converge towards a low-loss region
centered around the Trojan-infected model. Consequently, the impact of the
Trojan is amplified, especially when the benign clients have diverse local data
distributions and scattered local gradients. CollaPois stands out by achieving
its goals while involving only a limited number of compromised clients, setting
it apart from existing attacks. Also, CollaPois effectively avoids noticeable
shifts or degradation in the FL model's performance on legitimate data samples,
allowing it to operate stealthily and evade detection by advanced robust FL
algorithms.
  Thorough theoretical analysis and experiments conducted on various benchmark
datasets demonstrate the superiority of CollaPois compared to state-of-the-art
backdoor attacks. Notably, CollaPois bypasses existing backdoor defenses,
especially in scenarios where clients possess diverse data distributions.
Moreover, the results show that CollaPois remains effective even when involving
a small number of compromised clients. Notably, clients whose local data is
closely aligned with compromised clients experience higher risks of backdoor
infections.


http://arxiv.org/abs/2504.13061
ArtistAuditor: Auditing Artist Style Pirate in Text-to-Image Generation Models. (41%)
Linkang Du; Zheng Zhu; Min Chen; Zhou Su; Shouling Ji; Peng Cheng; Jiming Chen; Zhikun Zhang
  Text-to-image models based on diffusion processes, such as DALL-E, Stable
Diffusion, and Midjourney, are capable of transforming texts into detailed
images and have widespread applications in art and design. As such, amateur
users can easily imitate professional-level paintings by collecting an artist's
work and fine-tuning the model, leading to concerns about artworks' copyright
infringement. To tackle these issues, previous studies either add visually
imperceptible perturbation to the artwork to change its underlying styles
(perturbation-based methods) or embed post-training detectable watermarks in
the artwork (watermark-based methods). However, when the artwork or the model
has been published online, i.e., modification to the original artwork or model
retraining is not feasible, these strategies might not be viable.
  To this end, we propose a novel method for data-use auditing in the
text-to-image generation model. The general idea of ArtistAuditor is to
identify if a suspicious model has been finetuned using the artworks of
specific artists by analyzing the features related to the style. Concretely,
ArtistAuditor employs a style extractor to obtain the multi-granularity style
representations and treats artworks as samplings of an artist's style. Then,
ArtistAuditor queries a trained discriminator to gain the auditing decisions.
The experimental results on six combinations of models and datasets show that
ArtistAuditor can achieve high AUC values (> 0.937). By studying
ArtistAuditor's transferability and core modules, we provide valuable insights
into the practical implementation. Finally, we demonstrate the effectiveness of
ArtistAuditor in real-world cases by an online platform Scenario. ArtistAuditor
is open-sourced at https://github.com/Jozenn/ArtistAuditor.


http://arxiv.org/abs/2504.12747
Privacy Protection Against Personalized Text-to-Image Synthesis via Cross-image Consistency Constraints. (3%)
Guanyu Wang; Kailong Wang; Yihao Huang; Mingyi Zhou; Zhang Qing cnwatcher; Geguang Pu; Li Li
  The rapid advancement of diffusion models and personalization techniques has
made it possible to recreate individual portraits from just a few publicly
available images. While such capabilities empower various creative
applications, they also introduce serious privacy concerns, as adversaries can
exploit them to generate highly realistic impersonations. To counter these
threats, anti-personalization methods have been proposed, which add adversarial
perturbations to published images to disrupt the training of personalization
models. However, existing approaches largely overlook the intrinsic multi-image
nature of personalization and instead adopt a naive strategy of applying
perturbations independently, as commonly done in single-image settings. This
neglects the opportunity to leverage inter-image relationships for stronger
privacy protection. Therefore, we advocate for a group-level perspective on
privacy protection against personalization. Specifically, we introduce
Cross-image Anti-Personalization (CAP), a novel framework that enhances
resistance to personalization by enforcing style consistency across perturbed
images. Furthermore, we develop a dynamic ratio adjustment strategy that
adaptively balances the impact of the consistency loss throughout the attack
iterations. Extensive experiments on the classical CelebHQ and VGGFace2
benchmarks show that CAP substantially improves existing methods.


http://arxiv.org/abs/2504.13077
Effective Dual-Region Augmentation for Reduced Reliance on Large Amounts of Labeled Data. (1%)
Prasanna Reddy Pulakurthi; Majid Rabbani; Melo Celso M. de; Sohail A. Dianat; Raghuveer M. Rao
  This paper introduces a novel dual-region augmentation approach designed to
reduce reliance on large-scale labeled datasets while improving model
robustness and adaptability across diverse computer vision tasks, including
source-free domain adaptation (SFDA) and person re-identification (ReID). Our
method performs targeted data transformations by applying random noise
perturbations to foreground objects and spatially shuffling background patches.
This effectively increases the diversity of the training data, improving model
robustness and generalization. Evaluations on the PACS dataset for SFDA
demonstrate that our augmentation strategy consistently outperforms existing
methods, achieving significant accuracy improvements in both single-target and
multi-target adaptation settings. By augmenting training data through
structured transformations, our method enables model generalization across
domains, providing a scalable solution for reducing reliance on manually
annotated datasets. Furthermore, experiments on Market-1501 and DukeMTMC-reID
datasets validate the effectiveness of our approach for person ReID, surpassing
traditional augmentation techniques. The code is available at
https://github.com/PrasannaPulakurthi/Foreground-Background-Augmentation


http://arxiv.org/abs/2504.13055
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation. (1%)
Xiangyan Liu; Jinjie Ni; Zijian Wu; Chao Du; Longxu Dou; Haonan Wang; Tianyu Pang; Michael Qizhe Shieh
  Recent advances in reinforcement learning (RL) have strengthened the
reasoning capabilities of vision-language models (VLMs). However, enhancing
policy exploration to better scale test-time compute remains largely
underexplored. In addition, VLMs continue to struggle with imperfect visual
perception, which in turn affects the subsequent reasoning process. To this
end, we propose NoisyRollout, a simple yet effective data augmentation method
that mixes trajectories from both clean and moderately distorted images during
RL training. By injecting targeted diversity in visual perception and the
resulting reasoning patterns, NoisyRollout promotes better policy exploration
through vision-oriented inductive biases, ultimately leading to more robust
reasoning behaviors. We further adopt a noise annealing schedule that gradually
reduces distortion strength over training, leveraging noisy signals early on
while ensuring training stability in later stages. Crucially, our method is
easy-to-adopt--requiring no additional training cost and no modifications to
the RL objective. Extensive experiments on $2$ distinct training datasets
demonstrate that NoisyRollout achieves state-of-the-art performance among
open-source RL-tuned models across $5$ out-of-domain reasoning and perception
benchmarks. Furthermore, we validate the effectiveness of NoisyRollout across
model sizes ($7$B and $32$B) and data scales (from $1$K to $6$K), highlighting
its generalizability and scalability.


http://arxiv.org/abs/2504.11923
SemDiff: Generating Natural Unrestricted Adversarial Examples via Semantic Attributes Optimization in Diffusion Models. (99%)
Zeyu Dai; Shengcai Liu; Rui He; Jiahao Wu; Ning Lu; Wenqi Fan; Qing Li; Ke Tang
  Unrestricted adversarial examples (UAEs), allow the attacker to create
non-constrained adversarial examples without given clean samples, posing a
severe threat to the safety of deep learning models. Recent works utilize
diffusion models to generate UAEs. However, these UAEs often lack naturalness
and imperceptibility due to simply optimizing in intermediate latent noises. In
light of this, we propose SemDiff, a novel unrestricted adversarial attack that
explores the semantic latent space of diffusion models for meaningful
attributes, and devises a multi-attributes optimization approach to ensure
attack success while maintaining the naturalness and imperceptibility of
generated UAEs. We perform extensive experiments on four tasks on three
high-resolution datasets, including CelebA-HQ, AFHQ and ImageNet. The results
demonstrate that SemDiff outperforms state-of-the-art methods in terms of
attack success rate and imperceptibility. The generated UAEs are natural and
exhibit semantically meaningful changes, in accord with the attributes'
weights. In addition, SemDiff is found capable of evading different defenses,
which further validates its effectiveness and threatening.


http://arxiv.org/abs/2504.12255
Human Aligned Compression for Robust Models. (98%)
Samuel Räber; Andreas Plesner; Till Aczel; Roger Wattenhofer
  Adversarial attacks on image models threaten system robustness by introducing
imperceptible perturbations that cause incorrect predictions. We investigate
human-aligned learned lossy compression as a defense mechanism, comparing two
learned models (HiFiC and ELIC) against traditional JPEG across various quality
levels. Our experiments on ImageNet subsets demonstrate that learned
compression methods outperform JPEG, particularly for Vision Transformer
architectures, by preserving semantically meaningful content while removing
adversarial noise. Even in white-box settings where attackers can access the
defense, these methods maintain substantial effectiveness. We also show that
sequential compression--applying rounds of
compression/decompression--significantly enhances defense efficacy while
maintaining classification performance. Our findings reveal that human-aligned
compression provides an effective, computationally efficient defense that
protects the image features most relevant to human and machine understanding.
It offers a practical approach to improving model robustness against
adversarial threats.


http://arxiv.org/abs/2504.18556
RDI: An adversarial robustness evaluation metric for deep neural networks based on sample clustering features. (98%)
Jialei Song; Xingquan Zuo; Feiyang Wang; Hai Huang; Tianle Zhang
  Deep neural networks (DNNs) are highly susceptible to adversarial samples,
raising concerns about their reliability in safety-critical tasks. Currently,
methods of evaluating adversarial robustness are primarily categorized into
attack-based and certified robustness evaluation approaches. The former not
only relies on specific attack algorithms but also is highly time-consuming,
while the latter due to its analytical nature, is typically difficult to
implement for large and complex models. A few studies evaluate model robustness
based on the model's decision boundary, but they suffer from low evaluation
accuracy. To address the aforementioned issues, we propose a novel adversarial
robustness evaluation metric, Robustness Difference Index (RDI), which is based
on sample clustering features. RDI draws inspiration from clustering evaluation
by analyzing the intra-class and inter-class distances of feature vectors
separated by the decision boundary to quantify model robustness. It is
attack-independent and has high computational efficiency. Experiments show
that, RDI demonstrates a stronger correlation with the gold-standard
adversarial robustness metric of attack success rate (ASR). The average
computation time of RDI is only 1/30 of the evaluation method based on the PGD
attack. Our open-source code is available at:
https://anonymous.4open.science/r/RDI-B1DA.


http://arxiv.org/abs/2504.11990
Secure Transfer Learning: Training Clean Models Against Backdoor in (Both) Pre-trained Encoders and Downstream Datasets. (2%)
Yechao Zhang; Yuxuan Zhou; Tianyu Li; Minghui Li; Shengshan Hu; Wei Luo; Leo Yu Zhang
  Transfer learning from pre-trained encoders has become essential in modern
machine learning, enabling efficient model adaptation across diverse tasks.
However, this combination of pre-training and downstream adaptation creates an
expanded attack surface, exposing models to sophisticated backdoor embeddings
at both the encoder and dataset levels--an area often overlooked in prior
research. Additionally, the limited computational resources typically available
to users of pre-trained encoders constrain the effectiveness of generic
backdoor defenses compared to end-to-end training from scratch. In this work,
we investigate how to mitigate potential backdoor risks in resource-constrained
transfer learning scenarios. Specifically, we conduct an exhaustive analysis of
existing defense strategies, revealing that many follow a reactive workflow
based on assumptions that do not scale to unknown threats, novel attack types,
or different training paradigms. In response, we introduce a proactive mindset
focused on identifying clean elements and propose the Trusted Core (T-Core)
Bootstrapping framework, which emphasizes the importance of pinpointing
trustworthy data and neurons to enhance model security. Our empirical
evaluations demonstrate the effectiveness and superiority of T-Core,
specifically assessing 5 encoder poisoning attacks, 7 dataset poisoning
attacks, and 14 baseline defenses across five benchmark datasets, addressing
four scenarios of 3 potential backdoor threats.


http://arxiv.org/abs/2504.11867
MDHP-Net: Detecting an Emerging Time-exciting Threat in IVN. (1%)
Qi Liu; Yanchen Liu; Ruifeng Li; Chenhong Cao; Yufeng Li; Xingyu Li; Peng Wang; Runhan Feng; Shiyang Bu
  The integration of intelligent and connected technologies in modern vehicles,
while offering enhanced functionalities through Electronic Control Unit (ECU)
and interfaces like OBD-II and telematics, also exposes the vehicle's
in-vehicle network (IVN) to potential cyberattacks. Unlike prior work, we
identify a new time-exciting threat model against IVN. These attacks inject
malicious messages that exhibit a time-exciting effect, gradually manipulating
network traffic to disrupt vehicle operations and compromise safety-critical
functions. We systematically analyze the characteristics of the threat:
dynamism, time-exciting impact, and low prior knowledge dependency. To validate
its practicality, we replicate the attack on a real Advanced Driver Assistance
System via Controller Area Network (CAN), exploiting Unified Diagnostic Service
vulnerabilities and proposing four attack strategies. While CAN's integrity
checks mitigate attacks, Ethernet migration (e.g., DoIP/SOME/IP) introduces new
surfaces. We further investigate the feasibility of time-exciting threat under
SOME/IP. To detect time-exciting threat, we introduce MDHP-Net, leveraging
Multi-Dimentional Hawkes Process (MDHP) and temporal and message-wise feature
extracting structures. Meanwhile, to estimate MDHP parameters, we developed the
first GPU-optimized gradient descent solver for MDHP (MDHP-GDS). These modules
significantly improves the detection rate under time-exciting attacks in
multi-ECU IVN system. To address data scarcity, we release STEIA9, the first
open-source dataset for time-exciting attacks, covering 9 Ethernet-based attack
scenarios. Extensive experiments on STEIA9 (9 attack scenarios) show MDHP-Net
outperforms 3 baselines, confirming attack feasibility and detection efficacy.


http://arxiv.org/abs/2504.11707
Towards Safe Synthetic Image Generation On the Web: A Multimodal Robust NSFW Defense and Million Scale Dataset. (99%)
Muhammad Shahid Muneer; Simon S. Woo
  In the past years, we have witnessed the remarkable success of Text-to-Image
(T2I) models and their widespread use on the web. Extensive research in making
T2I models produce hyper-realistic images has led to new concerns, such as
generating Not-Safe-For-Work (NSFW) web content and polluting the web society.
To help prevent misuse of T2I models and create a safer web environment for
users features like NSFW filters and post-hoc security checks are used in these
models. However, recent work unveiled how these methods can easily fail to
prevent misuse. In particular, adversarial attacks on text and image modalities
can easily outplay defensive measures. %Exploiting such leads to the growing
concern of preventing adversarial attacks on text and image modalities.
Moreover, there is currently no robust multimodal NSFW dataset that includes
both prompt and image pairs and adversarial examples. This work proposes a
million-scale prompt and image dataset generated using open-source diffusion
models. Second, we develop a multimodal defense to distinguish safe and NSFW
text and images, which is robust against adversarial attacks and directly
alleviates current challenges. Our extensive experiments show that our model
performs well against existing SOTA NSFW detection methods in terms of accuracy
and recall, drastically reducing the Attack Success Rate (ASR) in multimodal
adversarial attack scenarios. Code:
https://github.com/shahidmuneer/multimodal-nsfw-defense.


http://arxiv.org/abs/2504.11195
R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning. (92%)
Lijun Sheng; Jian Liang; Zilei Wang; Ran He
  Vision-language models (VLMs), such as CLIP, have gained significant
popularity as foundation models, with numerous fine-tuning methods developed to
enhance performance on downstream tasks. However, due to their inherent
vulnerability and the common practice of selecting from a limited set of
open-source models, VLMs suffer from a higher risk of adversarial attacks than
traditional vision models. Existing defense techniques typically rely on
adversarial fine-tuning during training, which requires labeled data and lacks
of flexibility for downstream tasks. To address these limitations, we propose
robust test-time prompt tuning (R-TPT), which mitigates the impact of
adversarial attacks during the inference stage. We first reformulate the
classic marginal entropy objective by eliminating the term that introduces
conflicts under adversarial conditions, retaining only the pointwise entropy
minimization. Furthermore, we introduce a plug-and-play reliability-based
weighted ensembling strategy, which aggregates useful information from reliable
augmented views to strengthen the defense. R-TPT enhances defense against
adversarial attacks without requiring labeled training data while offering high
flexibility for inference tasks. Extensive experiments on widely used
benchmarks with various attacks demonstrate the effectiveness of R-TPT. The
code is available in https://github.com/TomSheng21/R-TPT.


http://arxiv.org/abs/2504.11182
Exploring Backdoor Attack and Defense for LLM-empowered Recommendations. (92%)
Liangbo Ning; Wenqi Fan; Qing Li
  The fusion of Large Language Models (LLMs) with recommender systems (RecSys)
has dramatically advanced personalized recommendations and drawn extensive
attention. Despite the impressive progress, the safety of LLM-based RecSys
against backdoor attacks remains largely under-explored. In this paper, we
raise a new problem: Can a backdoor with a specific trigger be injected into
LLM-based Recsys, leading to the manipulation of the recommendation responses
when the backdoor trigger is appended to an item's title? To investigate the
vulnerabilities of LLM-based RecSys under backdoor attacks, we propose a new
attack framework termed Backdoor Injection Poisoning for RecSys (BadRec).
BadRec perturbs the items' titles with triggers and employs several fake users
to interact with these items, effectively poisoning the training set and
injecting backdoors into LLM-based RecSys. Comprehensive experiments reveal
that poisoning just 1% of the training data with adversarial examples is
sufficient to successfully implant backdoors, enabling manipulation of
recommendations. To further mitigate such a security threat, we propose a
universal defense strategy called Poison Scanner (P-Scanner). Specifically, we
introduce an LLM-based poison scanner to detect the poisoned items by
leveraging the powerful language understanding and rich knowledge of LLMs. A
trigger augmentation agent is employed to generate diverse synthetic triggers
to guide the poison scanner in learning domain-specific knowledge of the
poisoned item detection task. Extensive experiments on three real-world
datasets validate the effectiveness of the proposed P-Scanner.


http://arxiv.org/abs/2504.11034
Defending Against Frequency-Based Attacks with Diffusion Models. (88%)
Fatemeh Amerehi; Patrick Healy
  Adversarial training is a common strategy for enhancing model robustness
against adversarial attacks. However, it is typically tailored to the specific
attack types it is trained on, limiting its ability to generalize to unseen
threat models. Adversarial purification offers an alternative by leveraging a
generative model to remove perturbations before classification. Since the
purifier is trained independently of both the classifier and the threat models,
it is better equipped to handle previously unseen attack scenarios. Diffusion
models have proven highly effective for noise purification, not only in
countering pixel-wise adversarial perturbations but also in addressing
non-adversarial data shifts. In this study, we broaden the focus beyond
pixel-wise robustness to explore the extent to which purification can mitigate
both spectral and spatial adversarial attacks. Our findings highlight its
effectiveness in handling diverse distortion patterns across low- to
high-frequency regions.


http://arxiv.org/abs/2504.11038
QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models. (86%)
Yudong Zhang; Ruobing Xie; Jiansheng Chen; Xingwu Sun; Zhanhui Kang; Yu Wang
  In typical multimodal tasks, such as Visual Question Answering (VQA),
adversarial attacks targeting a specific image and question can lead large
vision-language models (LVLMs) to provide incorrect answers. However, it is
common for a single image to be associated with multiple questions, and LVLMs
may still answer other questions correctly even for an adversarial image
attacked by a specific question. To address this, we introduce the
query-agnostic visual attack (QAVA), which aims to create robust adversarial
examples that generate incorrect responses to unspecified and unknown
questions. Compared to traditional adversarial attacks focused on specific
images and questions, QAVA significantly enhances the effectiveness and
efficiency of attacks on images when the question is unknown, achieving
performance comparable to attacks on known target questions. Our research
broadens the scope of visual adversarial attacks on LVLMs in practical
settings, uncovering previously overlooked vulnerabilities, particularly in the
context of visual adversarial threats. The code is available at
https://github.com/btzyd/qava.


http://arxiv.org/abs/2504.11510
RAID: An In-Training Defense against Attribute Inference Attacks in Recommender Systems. (81%)
Xiaohua Feng; Yuyuan Li; Fengyuan Yu; Ke Xiong; Junjie Fang; Li Zhang; Tianyu Du; Chaochao Chen
  In various networks and mobile applications, users are highly susceptible to
attribute inference attacks, with particularly prevalent occurrences in
recommender systems. Attackers exploit partially exposed user profiles in
recommendation models, such as user embeddings, to infer private attributes of
target users, such as gender and political views. The goal of defenders is to
mitigate the effectiveness of these attacks while maintaining recommendation
performance. Most existing defense methods, such as differential privacy and
attribute unlearning, focus on post-training settings, which limits their
capability of utilizing training data to preserve recommendation performance.
Although adversarial training extends defenses to in-training settings, it
often struggles with convergence due to unstable training processes. In this
paper, we propose RAID, an in-training defense method against attribute
inference attacks in recommender systems. In addition to the recommendation
objective, we define a defensive objective to ensure that the distribution of
protected attributes becomes independent of class labels, making users
indistinguishable from attribute inference attacks. Specifically, this
defensive objective aims to solve a constrained Wasserstein barycenter problem
to identify the centroid distribution that makes the attribute
indistinguishable while complying with recommendation performance constraints.
To optimize our proposed objective, we use optimal transport to align users
with the centroid distribution. We conduct extensive experiments on four
real-world datasets to evaluate RAID. The experimental results validate the
effectiveness of RAID and demonstrate its significant superiority over existing
methods in multiple aspects.


http://arxiv.org/abs/2504.10969
RF Sensing Security and Malicious Exploitation: A Comprehensive Survey. (69%)
Mingda Han; Huanqi Yang; Wenhao Li; Weitao Xu; Xiuzhen Cheng; Prasant Mohapatra; Pengfei Hu
  Radio Frequency (RF) sensing technologies have experienced significant growth
due to the widespread adoption of RF devices and the Internet of Things (IoT).
These technologies enable numerous applications across healthcare, smart homes,
industrial automation, and human-computer interaction. However, the
non-intrusive and ubiquitous nature of RF sensing - combined with its
environmental sensitivity and data dependency - makes these systems inherently
vulnerable not only as attack targets, but also as powerful attack vectors.
This survey presents a comprehensive analysis of RF sensing security, covering
both system-level vulnerabilities - such as signal spoofing, adversarial
perturbations, and model poisoning - and the misuse of sensing capabilities for
attacks like cross-boundary surveillance, side-channel inference, and semantic
privacy breaches. We propose unified threat models to structure these attack
vectors and further conduct task-specific vulnerability assessments across key
RF sensing applications, identifying their unique attack surfaces and risk
profiles. In addition, we systematically review defense strategies across
system layers and threat-specific scenarios, incorporating both active and
passive paradigms to provide a structured and practical view of protection
mechanisms. Compared to prior surveys, our work distinguishes itself by
offering a multi-dimensional classification framework based on task type,
threat vector, and sensing modality, and by providing fine-grained,
scenario-driven analysis that bridges theoretical models and real-world
implications. This survey aims to serve as a comprehensive reference for
researchers and practitioners seeking to understand, evaluate, and secure the
evolving landscape of RF sensing technologies.


http://arxiv.org/abs/2504.10850
How to Enhance Downstream Adversarial Robustness (almost) without Touching the Pre-Trained Foundation Model? (41%)
Meiqi Liu; Zhuoqun Huang; Yue Xing
  With the rise of powerful foundation models, a pre-training-fine-tuning
paradigm becomes increasingly popular these days: A foundation model is
pre-trained using a huge amount of data from various sources, and then the
downstream users only need to fine-tune and adapt it to specific downstream
tasks. However, due to the high computation complexity of adversarial training,
it is not feasible to fine-tune the foundation model to improve its robustness
on the downstream task. Observing the above challenge, we want to improve the
downstream robustness without updating/accessing the weights in the foundation
model. Inspired from existing literature in robustness inheritance (Kim et al.,
2020), through theoretical investigation, we identify a close relationship
between robust contrastive learning with the adversarial robustness of
supervised learning. To further validate and utilize this theoretical insight,
we design a simple-yet-effective robust auto-encoder as a data pre-processing
method before feeding the data into the foundation model. The proposed approach
has zero access to the foundation model when training the robust auto-encoder.
Extensive experiments demonstrate the effectiveness of the proposed method in
improving the robustness of downstream tasks, verifying the connection between
the feature robustness (implied by small adversarial contrastive loss) and the
robustness of the downstream task.


http://arxiv.org/abs/2504.11106
Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models. (38%)
Jiangtao Liu; Zhaoxin Wang; Handing Wang; Cong Tian; Yaochu Jin
  Recent advancements in Text-to-Image (T2I) generation have significantly
enhanced the realism and creativity of generated images. However, such powerful
generative capabilities pose risks related to the production of inappropriate
or harmful content. Existing defense mechanisms, including prompt checkers and
post-hoc image checkers, are vulnerable to sophisticated adversarial attacks.
In this work, we propose TCBS-Attack, a novel query-based black-box jailbreak
attack that searches for tokens located near the decision boundaries defined by
text and image checkers. By iteratively optimizing tokens near these
boundaries, TCBS-Attack generates semantically coherent adversarial prompts
capable of bypassing multiple defensive layers in T2I models. Extensive
experiments demonstrate that our method consistently outperforms
state-of-the-art jailbreak attacks across various T2I models, including
securely trained open-source models and commercial online services like DALL-E
3. TCBS-Attack achieves an ASR-4 of 45\% and an ASR-1 of 21\% on jailbreaking
full-chain T2I models, significantly surpassing baseline methods.


http://arxiv.org/abs/2504.11281
The Obvious Invisible Threat: LLM-Powered GUI Agents' Vulnerability to Fine-Print Injections. (9%)
Chaoran Chen; Zhiping Zhang; Bingcan Guo; Shang Ma; Ibrahim Khalilov; Simret A Gebreegziabher; Yanfang Ye; Ziang Xiao; Yaxing Yao; Tianshi Li; Toby Jia-Jun Li
  A Large Language Model (LLM) powered GUI agent is a specialized autonomous
system that performs tasks on the user's behalf according to high-level
instructions. It does so by perceiving and interpreting the graphical user
interfaces (GUIs) of relevant apps, often visually, inferring necessary
sequences of actions, and then interacting with GUIs by executing the actions
such as clicking, typing, and tapping. To complete real-world tasks, such as
filling forms or booking services, GUI agents often need to process and act on
sensitive user data. However, this autonomy introduces new privacy and security
risks. Adversaries can inject malicious content into the GUIs that alters agent
behaviors or induces unintended disclosures of private information. These
attacks often exploit the discrepancy between visual saliency for agents and
human users, or the agent's limited ability to detect violations of contextual
integrity in task automation. In this paper, we characterized six types of such
attacks, and conducted an experimental study to test these attacks with six
state-of-the-art GUI agents, 234 adversarial webpages, and 39 human
participants. Our findings suggest that GUI agents are highly vulnerable,
particularly to contextually embedded threats. Moreover, human users are also
susceptible to many of these attacks, indicating that simple human oversight
may not reliably prevent failures. This misalignment highlights the need for
privacy-aware agent design. We propose practical defense strategies to inform
the development of safer and more reliable GUI agents.


http://arxiv.org/abs/2504.11633
Chypnosis: Stealthy Secret Extraction using Undervolting-based Static Side-channel Attacks. (1%)
Kyle Mitard; Saleh Khalaj Monfared; Fatemeh Khojasteh Dana; Shahin Tajik
  There is a growing class of static physical side-channel attacks that allow
adversaries to extract secrets by probing the persistent state of a circuit.
Techniques such as laser logic state imaging (LLSI), impedance analysis (IA),
and static power analysis fall into this category. These attacks require that
the targeted data remain constant for a specific duration, which often
necessitates halting the circuit's clock. Some methods additionally rely on
modulating the chip's supply voltage to probe the circuit. However, tampering
with the clock or voltage is typically assumed to be detectable, as secure
chips often deploy sensors that erase sensitive data upon detecting such
anomalies. Furthermore, many secure devices use internal clock sources, making
external clock control infeasible. In this work, we introduce a novel class of
static side-channel attacks, called Chypnosis, that enables adversaries to
freeze a chip's internal clock by inducing a hibernation state via rapid
undervolting, and then extracting secrets using static side-channels. We
demonstrate that, by rapidly dropping a chip's voltage below the standard
nominal levels, the attacker can bypass the clock and voltage sensors and put
the chip in a so-called brownout condition, in which the chip's transistors
stop switching, but volatile memories (e.g., Flip-flops and SRAMs) still retain
their data. We test our attack on AMD FPGAs by putting them into hibernation.
We show that not only are all clock sources deactivated, but various clock and
voltage sensors also fail to detect the tamper event. Afterward, we present the
successful recovery of secret bits from a hibernated chip using two static
attacks, namely, LLSI and IA. Finally, we discuss potential countermeasures
which could be integrated into future designs.


http://arxiv.org/abs/2504.10804
The Sword of Damocles in ViTs: Computational Redundancy Amplifies Adversarial Transferability. (99%)
Jiani Liu; Zhiyuan Wang; Zeliang Zhang; Chao Huang; Susan Liang; Yunlong Tang; Chenliang Xu
  Vision Transformers (ViTs) have demonstrated impressive performance across a
range of applications, including many safety-critical tasks. However, their
unique architectural properties raise new challenges and opportunities in
adversarial robustness. In particular, we observe that adversarial examples
crafted on ViTs exhibit higher transferability compared to those crafted on
CNNs, suggesting that ViTs contain structural characteristics favorable for
transferable attacks. In this work, we investigate the role of computational
redundancy in ViTs and its impact on adversarial transferability. Unlike prior
studies that aim to reduce computation for efficiency, we propose to exploit
this redundancy to improve the quality and transferability of adversarial
examples. Through a detailed analysis, we identify two forms of redundancy,
including the data-level and model-level, that can be harnessed to amplify
attack effectiveness. Building on this insight, we design a suite of
techniques, including attention sparsity manipulation, attention head
permutation, clean token regularization, ghost MoE diversification, and
test-time adversarial training. Extensive experiments on the ImageNet-1k
dataset validate the effectiveness of our approach, showing that our methods
significantly outperform existing baselines in both transferability and
generality across diverse model architectures.


http://arxiv.org/abs/2504.13196
Investigating cybersecurity incidents using large language models in latest-generation wireless networks. (83%)
Leonid Legashev; Arthur Zhigalov
  The purpose of research: Detection of cybersecurity incidents and analysis of
decision support and assessment of the effectiveness of measures to counter
information security threats based on modern generative models. The methods of
research: Emulation of signal propagation data in MIMO systems, synthesis of
adversarial examples, execution of adversarial attacks on machine learning
models, fine tuning of large language models for detecting adversarial attacks,
explainability of decisions on detecting cybersecurity incidents based on the
prompts technique. Scientific novelty: A binary classification of data
poisoning attacks was performed using large language models, and the
possibility of using large language models for investigating cybersecurity
incidents in the latest generation wireless networks was investigated. The
result of research: Fine-tuning of large language models was performed on the
prepared data of the emulated wireless network segment. Six large language
models were compared for detecting adversarial attacks, and the capabilities of
explaining decisions made by a large language model were investigated. The
Gemma-7b model showed the best results according to the metrics Precision =
0.89, Recall = 0.89 and F1-Score = 0.89. Based on various explainability
prompts, the Gemma-7b model notes inconsistencies in the compromised data under
study, performs feature importance analysis and provides various
recommendations for mitigating the consequences of adversarial attacks. Large
language models integrated with binary classifiers of network threats have
significant potential for practical application in the field of cybersecurity
incident investigation, decision support and assessing the effectiveness of
measures to counter information security threats.


http://arxiv.org/abs/2504.10016
Quantifying Privacy Leakage in Split Inference via Fisher-Approximated Shannon Information Analysis. (73%)
Ruijun Deng; Zhihui Lu; Qiang Duan
  Split inference (SI) partitions deep neural networks into distributed
sub-models, enabling privacy-preserving collaborative learning. Nevertheless,
it remains vulnerable to Data Reconstruction Attacks (DRAs), wherein
adversaries exploit exposed smashed data to reconstruct raw inputs. Despite
extensive research on adversarial attack-defense games, a shortfall remains in
the fundamental analysis of privacy risks. This paper establishes a theoretical
framework for privacy leakage quantification using information theory, defining
it as the adversary's certainty and deriving both average-case and worst-case
error bounds. We introduce Fisher-approximated Shannon information (FSInfo), a
novel privacy metric utilizing Fisher Information (FI) for operational privacy
leakage computation. We empirically show that our privacy metric correlates
well with empirical attacks and investigate some of the factors that affect
privacy leakage, namely the data distribution, model size, and overfitting.


http://arxiv.org/abs/2504.10833
Towards Spatially-Aware and Optimally Faithful Concept-Based Explanations. (56%)
Shubham Kumar; Dwip Dalal; Narendra Ahuja
  Post-hoc, unsupervised concept-based explanation methods (U-CBEMs) are a
promising tool for generating semantic explanations of the decision-making
processes in deep neural networks, having applications in both model
improvement and understanding. It is vital that the explanation is accurate, or
faithful, to the model, yet we identify several limitations of prior
faithfulness metrics that inhibit an accurate evaluation; most notably, prior
metrics involve only the set of concepts present, ignoring how they may be
spatially distributed. We address these limitations with Surrogate Faithfulness
(SF), an evaluation method that introduces a spatially-aware surrogate and two
novel faithfulness metrics. Using SF, we produce Optimally Faithful (OF)
explanations, where concepts are found that maximize faithfulness. Our
experiments show that (1) adding spatial-awareness to prior U-CBEMs increases
faithfulness in all cases; (2) OF produces significantly more faithful
explanations than prior U-CBEMs (30% or higher improvement in error); (3) OF's
learned concepts generalize well to out-of-domain data and are more robust to
adversarial examples, where prior U-CBEMs struggle.


http://arxiv.org/abs/2504.13201
Concept Enhancement Engineering: A Lightweight and Efficient Robust Defense Against Jailbreak Attacks in Embodied AI. (45%)
Jirui Yang; Zheyu Lin; Shuhan Yang; Zhihui Lu; Xin Du
  Embodied Intelligence (EI) systems integrated with large language models
(LLMs) face significant security risks, particularly from jailbreak attacks
that manipulate models into generating harmful outputs or executing unsafe
physical actions. Traditional defense strategies, such as input filtering and
output monitoring, often introduce high computational overhead or interfere
with task performance in real-time embodied scenarios. To address these
challenges, we propose Concept Enhancement Engineering (CEE), a novel defense
framework that leverages representation engineering to enhance the safety of
embodied LLMs by dynamically steering their internal activations. CEE operates
by (1) extracting multilingual safety patterns from model activations, (2)
constructing control directions based on safety-aligned concept subspaces, and
(3) applying subspace concept rotation to reinforce safe behavior during
inference. Our experiments demonstrate that CEE effectively mitigates jailbreak
attacks while maintaining task performance, outperforming existing defense
methods in both robustness and efficiency. This work contributes a scalable and
interpretable safety mechanism for embodied AI, bridging the gap between
theoretical representation engineering and practical security applications. Our
findings highlight the potential of latent-space interventions as a viable
defense paradigm against emerging adversarial threats in physically grounded AI
systems.


http://arxiv.org/abs/2504.10318
Shield Bash: Abusing Defensive Coherence State Retrieval to Break Timing Obfuscation. (10%)
Kartik Ramkrishnan; Antonia Zhai; Stephen McCamant; Pen Chung Yew
  Microarchitectural attacks are a significant concern, leading to many
hardware-based defense proposals. However, different defenses target different
classes of attacks, and their impact on each other has not been fully
considered. To raise awareness of this problem, we study an interaction between
two state-of-the art defenses in this paper, timing obfuscations of remote
cache lines (TORC) and delaying speculative changes to remote cache lines
(DSRC). TORC mitigates cache-hit based attacks and DSRC mitigates speculative
coherence state change attacks.
  We observe that DSRC enables coherence information to be retrieved into the
processor core, where it is out of the reach of timing obfuscations to protect.
This creates an unforeseen consequence that redo operations can be triggered
within the core to detect the presence or absence of remote cache lines, which
constitutes a security vulnerability. We demonstrate that a new covert channel
attack is possible using this vulnerability. We propose two ways to mitigate
the attack, whose performance varies depending on an application's cache usage.
One way is to never send remote exclusive coherence state (E) information to
the core even if it is created. The other way is to never create a remote E
state, which is responsible for triggering redos.
  We demonstrate the timing difference caused by this microarchitectural
defense assumption violation using GEM5 simulations. Performance evaluation on
SPECrate 2017 and PARSEC benchmarks of the two fixes show less than 32\%
average overhead across both sets of benchmarks. The repair which prevented the
creation of remote E state had less than 2.8% average overhead.


http://arxiv.org/abs/2504.10782
Deep Audio Watermarks are Shallow: Limitations of Post-Hoc Watermarking Techniques for Speech. (1%)
Patrick O'Reilly; Zeyu Jin; Jiaqi Su; Bryan Pardo
  In the audio modality, state-of-the-art watermarking methods leverage deep
neural networks to allow the embedding of human-imperceptible signatures in
generated audio. The ideal is to embed signatures that can be detected with
high accuracy when the watermarked audio is altered via compression, filtering,
or other transformations. Existing audio watermarking techniques operate in a
post-hoc manner, manipulating "low-level" features of audio recordings after
generation (e.g. through the addition of a low-magnitude watermark signal). We
show that this post-hoc formulation makes existing audio watermarks vulnerable
to transformation-based removal attacks. Focusing on speech audio, we (1) unify
and extend existing evaluations of the effect of audio transformations on
watermark detectability, and (2) demonstrate that state-of-the-art post-hoc
audio watermarks can be removed with no knowledge of the watermarking scheme
and minimal degradation in audio quality.


http://arxiv.org/abs/2504.10374
Ctrl-Z: Controlling AI Agents via Resampling. (1%)
Aryan Bhatt; Cody Rushing; Adam Kaufman; Tyler Tracy; Vasil Georgiev; David Matolcsi; Akbir Khan; Buck Shlegeris
  Control evaluations measure whether monitoring and security protocols for AI
systems prevent intentionally subversive AI models from causing harm. Our work
presents the first control evaluation performed in an agent environment. We
construct BashBench, a dataset of 257 challenging multi-step system
administration tasks, and evaluate whether various safety measures can prevent
an adversarially constructed AI agent from covertly downloading and executing
malicious code in this environment. This multi-step setting introduces new
attack and defense dynamics, which we investigate in order to design novel
control protocols that prevent safety failures without hindering the ability of
non-malicious agents to perform useful work. We introduce a class of control
protocols called resample protocols that dynamically take additional samples of
certain actions. We find these protocols significantly improve on existing
techniques by selectively blocking the AI agent from executing suspicious code
and incriminating the agent by generating additional examples of dangerous
behavior. We measure the tradeoff between attack prevention and usefulness; our
best protocol combines resampling with analysis of previous steps, reducing the
success rate of attacks from 58% to 7% at a 5% cost to the performance of a
non-malicious agent.


http://arxiv.org/abs/2504.10598
Beyond Worst-Case Online Classification: VC-Based Regret Bounds for Relaxed Benchmarks. (1%)
Omar Montasser; Abhishek Shetty; Nikita Zhivotovskiy
  We revisit online binary classification by shifting the focus from competing
with the best-in-class binary loss to competing against relaxed benchmarks that
capture smoothed notions of optimality. Instead of measuring regret relative to
the exact minimal binary error -- a standard approach that leads to worst-case
bounds tied to the Littlestone dimension -- we consider comparing with
predictors that are robust to small input perturbations, perform well under
Gaussian smoothing, or maintain a prescribed output margin. Previous examples
of this were primarily limited to the hinge loss. Our algorithms achieve regret
guarantees that depend only on the VC dimension and the complexity of the
instance space (e.g., metric entropy), and notably, they incur only an
$O(\log(1/\gamma))$ dependence on the generalized margin $\gamma$. This stands
in contrast to most existing regret bounds, which typically exhibit a
polynomial dependence on $1/\gamma$. We complement this with matching lower
bounds. Our analysis connects recent ideas from adversarial robustness and
smoothed online learning.


http://arxiv.org/abs/2504.09839
SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis. (92%)
Zhisheng Zhang; Derui Wang; Qianyi Yang; Pengyang Huang; Junhan Pu; Yuxin Cao; Kai Ye; Jie Hao; Yixian Yang
  Speech synthesis technology has brought great convenience, while the
widespread usage of realistic deepfake audio has triggered hazards. Malicious
adversaries may unauthorizedly collect victims' speeches and clone a similar
voice for illegal exploitation (\textit{e.g.}, telecom fraud). However, the
existing defense methods cannot effectively prevent deepfake exploitation and
are vulnerable to robust training techniques. Therefore, a more effective and
robust data protection method is urgently needed. In response, we propose a
defensive framework, \textit{\textbf{SafeSpeech}}, which protects the users'
audio before uploading by embedding imperceptible perturbations on original
speeches to prevent high-quality synthetic speech. In SafeSpeech, we devise a
robust and universal proactive protection technique, \textbf{S}peech
\textbf{PE}rturbative \textbf{C}oncealment (\textbf{SPEC}), that leverages a
surrogate model to generate universally applicable perturbation for generative
synthetic models. Moreover, we optimize the human perception of embedded
perturbation in terms of time and frequency domains. To evaluate our method
comprehensively, we conduct extensive experiments across advanced models and
datasets, both subjectively and objectively. Our experimental results
demonstrate that SafeSpeech achieves state-of-the-art (SOTA) voice protection
effectiveness and transferability and is highly robust against advanced
adaptive adversaries. Moreover, SafeSpeech has real-time capability in
real-world tests. The source code is available at
\href{https://github.com/wxzyd123/SafeSpeech}{https://github.com/wxzyd123/SafeSpeech}.


http://arxiv.org/abs/2504.13192
CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent. (86%)
Liang-bo Ning; Shijie Wang; Wenqi Fan; Qing Li; Xin Xu; Hao Chen; Feiran Huang
  Recently, Large Language Model (LLM)-empowered recommender systems (RecSys)
have brought significant advances in personalized user experience and have
attracted considerable attention. Despite the impressive progress, the research
question regarding the safety vulnerability of LLM-empowered RecSys still
remains largely under-investigated. Given the security and privacy concerns, it
is more practical to focus on attacking the black-box RecSys, where attackers
can only observe the system's inputs and outputs. However, traditional attack
approaches employing reinforcement learning (RL) agents are not effective for
attacking LLM-empowered RecSys due to the limited capabilities in processing
complex textual inputs, planning, and reasoning. On the other hand, LLMs
provide unprecedented opportunities to serve as attack agents to attack RecSys
because of their impressive capability in simulating human-like decision-making
processes. Therefore, in this paper, we propose a novel attack framework called
CheatAgent by harnessing the human-like capabilities of LLMs, where an
LLM-based agent is developed to attack LLM-Empowered RecSys. Specifically, our
method first identifies the insertion position for maximum impact with minimal
input modification. After that, the LLM agent is designed to generate
adversarial perturbations to insert at target positions. To further improve the
quality of generated perturbations, we utilize the prompt tuning technique to
improve attacking strategies via feedback from the victim RecSys iteratively.
Extensive experiments across three real-world datasets demonstrate the
effectiveness of our proposed attacking method.


http://arxiv.org/abs/2504.09648
RANSAC Revisited: An Improved Algorithm for Robust Subspace Recovery under Adversarial and Noisy Corruptions. (9%)
Guixian Chen; Jianhao Ma; Salar Fattahi
  In this paper, we study the problem of robust subspace recovery (RSR) in the
presence of both strong adversarial corruptions and Gaussian noise.
Specifically, given a limited number of noisy samples -- some of which are
tampered by an adaptive and strong adversary -- we aim to recover a
low-dimensional subspace that approximately contains a significant fraction of
the uncorrupted samples, up to an error that scales with the Gaussian noise.
Existing approaches to this problem often suffer from high computational costs
or rely on restrictive distributional assumptions, limiting their applicability
in truly adversarial settings. To address these challenges, we revisit the
classical random sample consensus (RANSAC) algorithm, which offers strong
robustness to adversarial outliers, but sacrifices efficiency and robustness
against Gaussian noise and model misspecification in the process. We propose a
two-stage algorithm, RANSAC+, that precisely pinpoints and remedies the failure
modes of standard RANSAC. Our method is provably robust to both Gaussian and
adversarial corruptions, achieves near-optimal sample complexity without
requiring prior knowledge of the subspace dimension, and is more efficient than
existing RANSAC-type methods.


http://arxiv.org/abs/2504.09593
ControlNET: A Firewall for RAG-based LLM System. (9%)
Hongwei Yao; Haoran Shi; Yidou Chen; Yixin Jiang; Cong Wang; Zhan Qin
  Retrieval-Augmented Generation (RAG) has significantly enhanced the factual
accuracy and domain adaptability of Large Language Models (LLMs). This
advancement has enabled their widespread deployment across sensitive domains
such as healthcare, finance, and enterprise applications. RAG mitigates
hallucinations by integrating external knowledge, yet introduces privacy risk
and security risk, notably data breaching risk and data poisoning risk. While
recent studies have explored prompt injection and poisoning attacks, there
remains a significant gap in comprehensive research on controlling inbound and
outbound query flows to mitigate these threats. In this paper, we propose an AI
firewall, ControlNET, designed to safeguard RAG-based LLM systems from these
vulnerabilities. ControlNET controls query flows by leveraging activation shift
phenomena to detect adversarial queries and mitigate their impact through
semantic divergence. We conduct comprehensive experiments on four different
benchmark datasets including Msmarco, HotpotQA, FinQA, and MedicalSys using
state-of-the-art open source LLMs (Llama3, Vicuna, and Mistral). Our results
demonstrate that ControlNET achieves over 0.909 AUROC in detecting and
mitigating security threats while preserving system harmlessness. Overall,
ControlNET offers an effective, robust, harmless defense mechanism, marking a
significant advancement toward the secure deployment of RAG-based LLM systems.


http://arxiv.org/abs/2504.09776
An Investigation of Large Language Models and Their Vulnerabilities in Spam Detection. (5%)
Qiyao Tang; Xiangyang Li
  Spam messages continue to present significant challenges to digital users,
cluttering inboxes and posing security risks. Traditional spam detection
methods, including rules-based, collaborative, and machine learning approaches,
struggle to keep up with the rapidly evolving tactics employed by spammers.
This project studies new spam detection systems that leverage Large Language
Models (LLMs) fine-tuned with spam datasets. More importantly, we want to
understand how LLM-based spam detection systems perform under adversarial
attacks that purposefully modify spam emails and data poisoning attacks that
exploit the differences between the training data and the massages in
detection, to which traditional machine learning models are shown to be
vulnerable. This experimentation employs two LLM models of GPT2 and BERT and
three spam datasets of Enron, LingSpam, and SMSspamCollection for extensive
training and testing tasks. The results show that, while they can function as
effective spam filters, the LLM models are susceptible to the adversarial and
data poisoning attacks. This research provides very useful insights for future
applications of LLM models for information security.


http://arxiv.org/abs/2504.09466
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender. (2%)
Weixiang Zhao; Jiahe Guo; Yulin Hu; Yang Deng; An Zhang; Xingyu Sui; Xinyang Han; Yanyan Zhao; Bing Qin; Tat-Seng Chua; Ting Liu
  Despite extensive efforts in safety alignment, large language models (LLMs)
remain vulnerable to jailbreak attacks. Activation steering offers a
training-free defense method but relies on fixed steering coefficients,
resulting in suboptimal protection and increased false rejections of benign
inputs. To address this, we propose AdaSteer, an adaptive activation steering
method that dynamically adjusts model behavior based on input characteristics.
We identify two key properties: Rejection Law (R-Law), which shows that
stronger steering is needed for jailbreak inputs opposing the rejection
direction, and Harmfulness Law (H-Law), which differentiates adversarial and
benign inputs. AdaSteer steers input representations along both the Rejection
Direction (RD) and Harmfulness Direction (HD), with adaptive coefficients
learned via logistic regression, ensuring robust jailbreak defense while
preserving benign input handling. Experiments on LLaMA-3.1, Gemma-2, and
Qwen2.5 show that AdaSteer outperforms baseline methods across multiple
jailbreak attacks with minimal impact on utility. Our results highlight the
potential of interpretable model internals for real-time, flexible safety
enforcement in LLMs.


http://arxiv.org/abs/2504.09451
FractalForensics: Proactive Deepfake Detection and Localization via Fractal Watermarks. (1%)
Tianyi Wang; Harry Cheng; Ming-Hui Liu; Mohan Kankanhalli
  Proactive Deepfake detection via robust watermarks has been raised ever since
passive Deepfake detectors encountered challenges in identifying high-quality
synthetic images. However, while demonstrating reasonable detection
performance, they lack localization functionality and explainability in
detection results. Additionally, the unstable robustness of watermarks can
significantly affect the detection performance accordingly. In this study, we
propose novel fractal watermarks for proactive Deepfake detection and
localization, namely FractalForensics. Benefiting from the characteristics of
fractals, we devise a parameter-driven watermark generation pipeline that
derives fractal-based watermarks and conducts one-way encryption regarding the
parameters selected. Subsequently, we propose a semi-fragile watermarking
framework for watermark embedding and recovery, trained to be robust against
benign image processing operations and fragile when facing Deepfake
manipulations in a black-box setting. Meanwhile, we introduce an entry-to-patch
strategy that implicitly embeds the watermark matrix entries into image patches
at corresponding positions, achieving localization of Deepfake manipulations.
Extensive experiments demonstrate satisfactory robustness and fragility of our
approach against common image processing operations and Deepfake manipulations,
outperforming state-of-the-art semi-fragile watermarking algorithms and passive
detectors for Deepfake detection. Furthermore, by highlighting the areas
manipulated, our method provides explainability for the proactive Deepfake
detection results.


http://arxiv.org/abs/2504.09712
The Structural Safety Generalization Problem. (1%)
Julius Broomfield; Tom Gibbs; Ethan Kosak-Hine; George Ingebretsen; Tia Nasir; Jason Zhang; Reihaneh Iranmanesh; Sara Pieri; Reihaneh Rabbany; Kellin Pelrine
  LLM jailbreaks are a widespread safety challenge. Given this problem has not
yet been tractable, we suggest targeting a key failure mechanism: the failure
of safety to generalize across semantically equivalent inputs. We further focus
the target by requiring desirable tractability properties of attacks to study:
explainability, transferability between models, and transferability between
goals. We perform red-teaming within this framework by uncovering new
vulnerabilities to multi-turn, multi-image, and translation-based attacks.
These attacks are semantically equivalent by our design to their single-turn,
single-image, or untranslated counterparts, enabling systematic comparisons; we
show that the different structures yield different safety outcomes. We then
demonstrate the potential for this framework to enable new defenses by
proposing a Structure Rewriting Guardrail, which converts an input to a
structure more conducive to safety assessment. This guardrail significantly
improves refusal of harmful inputs, without over-refusing benign ones. Thus, by
framing this intermediate challenge - more tractable than universal defenses
but essential for long-term safety - we highlight a critical milestone for AI
safety research.


http://arxiv.org/abs/2504.09361
PapMOT: Exploring Adversarial Patch Attack against Multiple Object Tracking. (98%)
Jiahuan Long; Tingsong Jiang; Wen Yao; Shuai Jia; Weijia Zhang; Weien Zhou; Chao Ma; Xiaoqian Chen
  Tracking multiple objects in a continuous video stream is crucial for many
computer vision tasks. It involves detecting and associating objects with their
respective identities across successive frames. Despite significant progress
made in multiple object tracking (MOT), recent studies have revealed the
vulnerability of existing MOT methods to adversarial attacks. Nevertheless, all
of these attacks belong to digital attacks that inject pixel-level noise into
input images, and are therefore ineffective in physical scenarios. To fill this
gap, we propose PapMOT, which can generate physical adversarial patches against
MOT for both digital and physical scenarios. Besides attacking the detection
mechanism, PapMOT also optimizes a printable patch that can be detected as new
targets to mislead the identity association process. Moreover, we introduce a
patch enhancement strategy to further degrade the temporal consistency of
tracking results across video frames, resulting in more aggressive attacks. We
further develop new evaluation metrics to assess the robustness of MOT against
such attacks. Extensive evaluations on multiple datasets demonstrate that our
PapMOT can successfully attack various architectures of MOT trackers in digital
scenarios. We also validate the effectiveness of PapMOT for physical attacks by
deploying printed adversarial patches in the real world.


http://arxiv.org/abs/2504.09202
From Visual Explanations to Counterfactual Explanations with Latent Diffusion. (75%)
Tung Luu; Nam Le; Duc Le; Bac Le
  Visual counterfactual explanations are ideal hypothetical images that change
the decision-making of the classifier with high confidence toward the desired
class while remaining visually plausible and close to the initial image. In
this paper, we propose a new approach to tackle two key challenges in recent
prominent works: i) determining which specific counterfactual features are
crucial for distinguishing the "concept" of the target class from the original
class, and ii) supplying valuable explanations for the non-robust classifier
without relying on the support of an adversarially robust model. Our method
identifies the essential region for modification through algorithms that
provide visual explanations, and then our framework generates realistic
counterfactual explanations by combining adversarial attacks based on pruning
the adversarial gradient of the target classifier and the latent diffusion
model. The proposed method outperforms previous state-of-the-art results on
various evaluation criteria on ImageNet and CelebA-HQ datasets. In general, our
method can be applied to arbitrary classifiers, highlight the strong
association between visual and counterfactual explanations, make semantically
meaningful changes from the target classifier, and provide observers with
subtle counterfactual images.


http://arxiv.org/abs/2504.09409
Bregman Linearized Augmented Lagrangian Method for Nonconvex Constrained Stochastic Zeroth-order Optimization. (68%)
Qiankun Shi; Xiao Wang; Hao Wang
  In this paper, we study nonconvex constrained stochastic zeroth-order
optimization problems, for which we have access to exact information of
constraints and noisy function values of the objective. We propose a Bregman
linearized augmented Lagrangian method that utilizes stochastic zeroth-order
gradient estimators combined with a variance reduction technique. We analyze
its oracle complexity, in terms of the total number of stochastic function
value evaluations required to achieve an \(\epsilon\)-KKT point in
\(\ell_p\)-norm metrics with \(p \ge 2\), where \(p\) is a parameter associated
with the selected Bregman distance. In particular, starting from a
near-feasible initial point and using Rademacher smoothing, the oracle
complexity is in order \(O(p d^{2/p} \epsilon^{-3})\) for \(p \in [2, 2 \ln
d]\), and \(O(\ln d \cdot \epsilon^{-3})\) for \(p > 2 \ln d\), where \(d\)
denotes the problem dimension. Those results show that the complexity of the
proposed method can achieve a dimensional dependency lower than \(O(d)\)
without requiring additional assumptions, provided that a Bregman distance is
chosen properly. This offers a significant improvement in the high-dimensional
setting over existing work, and matches the lowest complexity order with
respect to the tolerance \(\epsilon\) reported in the literature. Numerical
experiments on constrained Lasso and black-box adversarial attack problems
highlight the promising performances of the proposed method.


http://arxiv.org/abs/2504.09192
Towards More Efficient, Robust, Instance-adaptive, and Generalizable Sequential Decision making. (1%)
Zhiyong Wang
  The primary goal of my Ph.D. study is to develop provably efficient and
practical algorithms for data-driven sequential decision-making under
uncertainty. My work focuses on reinforcement learning (RL), multi-armed
bandits, and their applications, including recommendation systems, computer
networks, video analytics, and large language models (LLMs). Sequential
decision-making methods, such as bandits and RL, have demonstrated remarkable
success - ranging from outperforming human players in complex games like Atari
and Go to advancing robotics, recommendation systems, and fine-tuning LLMs.
Despite these successes, many established algorithms rely on idealized models
that can fail under model misspecifications or adversarial perturbations,
particularly in settings where accurate prior knowledge of the underlying model
class is unavailable or where malicious users operate within dynamic systems.
These challenges are pervasive in real-world applications, where robust and
adaptive solutions are critical. Furthermore, while worst-case guarantees
provide theoretical reliability, they often fail to capture instance-dependent
performance, which can lead to more efficient and practical solutions. Another
key challenge lies in generalizing to new, unseen environments, a crucial
requirement for deploying these methods in dynamic and unpredictable settings.
To address these limitations, my research aims to develop more efficient,
robust, instance-adaptive, and generalizable sequential decision-making
algorithms for both reinforcement learning and bandits. Towards this end, I
focus on developing more efficient, robust, instance-adaptive, and
generalizable for both general reinforcement learning (RL) and bandits.


http://arxiv.org/abs/2504.08414
Adversarial Examples in Environment Perception for Automated Driving (Review). (99%)
Jun Yan; Huilin Yin
  The renaissance of deep learning has led to the massive development of
automated driving. However, deep neural networks are vulnerable to adversarial
examples. The perturbations of adversarial examples are imperceptible to human
eyes but can lead to the false predictions of neural networks. It poses a huge
risk to artificial intelligence (AI) applications for automated driving. This
survey systematically reviews the development of adversarial robustness
research over the past decade, including the attack and defense methods and
their applications in automated driving. The growth of automated driving pushes
forward the realization of trustworthy AI applications. This review lists
significant references in the research history of adversarial examples.


http://arxiv.org/abs/2504.08480
Toward Realistic Adversarial Attacks in IDS: A Novel Feasibility Metric for Transferability. (99%)
Sabrine Ennaji; Elhadj Benkhelifa; Luigi Vincenzo Mancini
  Transferability-based adversarial attacks exploit the ability of adversarial
examples, crafted to deceive a specific source Intrusion Detection System (IDS)
model, to also mislead a target IDS model without requiring access to the
training data or any internal model parameters. These attacks exploit common
vulnerabilities in machine learning models to bypass security measures and
compromise systems. Although the transferability concept has been widely
studied, its practical feasibility remains limited due to assumptions of high
similarity between source and target models. This paper analyzes the core
factors that contribute to transferability, including feature alignment, model
architectural similarity, and overlap in the data distributions that each IDS
examines. We propose a novel metric, the Transferability Feasibility Score
(TFS), to assess the feasibility and reliability of such attacks based on these
factors. Through experimental evidence, we demonstrate that TFS and actual
attack success rates are highly correlated, addressing the gap between
theoretical understanding and real-world impact. Our findings provide needed
guidance for designing more realistic transferable adversarial attacks,
developing robust defenses, and ultimately improving the security of machine
learning-based IDS in critical systems.


http://arxiv.org/abs/2504.08906
Robust SAM: On the Adversarial Robustness of Vision Foundation Models. (98%)
Jiahuan Long; Zhengqin Xu; Tingsong Jiang; Wen Yao; Shuai Jia; Chao Ma; Xiaoqian Chen
  The Segment Anything Model (SAM) is a widely used vision foundation model
with diverse applications, including image segmentation, detection, and
tracking. Given SAM's wide applications, understanding its robustness against
adversarial attacks is crucial for real-world deployment. However, research on
SAM's robustness is still in its early stages. Existing attacks often overlook
the role of prompts in evaluating SAM's robustness, and there has been
insufficient exploration of defense methods to balance the robustness and
accuracy. To address these gaps, this paper proposes an adversarial robustness
framework designed to evaluate and enhance the robustness of SAM. Specifically,
we introduce a cross-prompt attack method to enhance the attack transferability
across different prompt types. Besides attacking, we propose a few-parameter
adaptation strategy to defend SAM against various adversarial attacks. To
balance robustness and accuracy, we use the singular value decomposition (SVD)
to constrain the space of trainable parameters, where only singular values are
adaptable. Experiments demonstrate that our cross-prompt attack method
outperforms previous approaches in terms of attack success rate on both SAM and
SAM 2. By adapting only 512 parameters, we achieve at least a 15\% improvement
in mean intersection over union (mIoU) against various adversarial attacks.
Compared to previous defense methods, our approach enhances the robustness of
SAM while maximally maintaining its original performance.


http://arxiv.org/abs/2504.08866
On Transfer-based Universal Attacks in Pure Black-box Setting. (92%)
Mohammad A. A. K. Jalwana; Naveed Akhtar; Ajmal Mian; Nazanin Rahnavard; Mubarak Shah
  Despite their impressive performance, deep visual models are susceptible to
transferable black-box adversarial attacks. Principally, these attacks craft
perturbations in a target model-agnostic manner. However, surprisingly, we find
that existing methods in this domain inadvertently take help from various
priors that violate the black-box assumption such as the availability of the
dataset used to train the target model, and the knowledge of the number of
classes in the target model. Consequently, the literature fails to articulate
the true potency of transferable black-box attacks. We provide an empirical
study of these biases and propose a framework that aids in a prior-free
transparent study of this paradigm. Using our framework, we analyze the role of
prior knowledge of the target model data and number of classes in attack
performance. We also provide several interesting insights based on our
analysis, and demonstrate that priors cause overestimation in transferability
scores. Finally, we extend our framework to query-based attacks. This extension
inspires a novel image-blending technique to prepare data for effective
surrogate model training.


http://arxiv.org/abs/2504.08411
A Knowledge-guided Adversarial Defense for Resisting Malicious Visual Manipulation. (83%)
Dawei Zhou; Suzhi Gang; Decheng Liu; Tongliang Liu; Nannan Wang; Xinbo Gao
  Malicious applications of visual manipulation have raised serious threats to
the security and reputation of users in many fields. To alleviate these issues,
adversarial noise-based defenses have been enthusiastically studied in recent
years. However, ``data-only" methods tend to distort fake samples in the
low-level feature space rather than the high-level semantic space, leading to
limitations in resisting malicious manipulation. Frontier research has shown
that integrating knowledge in deep learning can produce reliable and
generalizable solutions. Inspired by these, we propose a knowledge-guided
adversarial defense (KGAD) to actively force malicious manipulation models to
output semantically confusing samples. Specifically, in the process of
generating adversarial noise, we focus on constructing significant semantic
confusions at the domain-specific knowledge level, and exploit a metric closely
related to visual perception to replace the general pixel-wise metrics. The
generated adversarial noise can actively interfere with the malicious
manipulation model by triggering knowledge-guided and perception-related
disruptions in the fake samples. To validate the effectiveness of the proposed
method, we conduct qualitative and quantitative experiments on human perception
and visual quality assessment. The results on two different tasks both show
that our defense provides better protection compared to state-of-the-art
methods and achieves great generalizability.


http://arxiv.org/abs/2504.09047
Multi-Robot Coordination with Adversarial Perception. (10%)
Rayan Bahrami; Hamidreza Jafarnejadsani
  This paper investigates the resilience of perception-based multi-robot
coordination with wireless communication to online adversarial perception. A
systematic study of this problem is essential for many safety-critical robotic
applications that rely on the measurements from learned perception modules. We
consider a (small) team of quadrotor robots that rely only on an Inertial
Measurement Unit (IMU) and the visual data measurements obtained from a learned
multi-task perception module (e.g., object detection) for downstream tasks,
including relative localization and coordination. We focus on a class of
adversarial perception attacks that cause misclassification, mislocalization,
and latency. We propose that the effects of adversarial misclassification and
mislocalization can be modeled as sporadic (intermittent) and spurious
measurement data for the downstream tasks. To address this, we present a
framework for resilience analysis of multi-robot coordination with adversarial
measurements. The framework integrates data from Visual-Inertial Odometry (VIO)
and the learned perception model for robust relative localization and state
estimation in the presence of adversarially sporadic and spurious measurements.
The framework allows for quantifying the degradation in system observability
and stability in relation to the success rate of adversarial perception.
Finally, experimental results on a multi-robot platform demonstrate the
real-world applicability of our methodology for resource-constrained robotic
platforms.


http://arxiv.org/abs/2504.08977
Robust Steganography from Large Language Models. (5%)
Neil Perry; Sanket Gupte; Nishant Pitta; Lior Rotem
  Recent steganographic schemes, starting with Meteor (CCS'21), rely on
leveraging large language models (LLMs) to resolve a historically-challenging
task of disguising covert communication as ``innocent-looking''
natural-language communication. However, existing methods are vulnerable to
``re-randomization attacks,'' where slight changes to the communicated text,
that might go unnoticed, completely destroy any hidden message. This is also a
vulnerability in more traditional encryption-based stegosystems, where
adversaries can modify the randomness of an encryption scheme to destroy the
hidden message while preserving an acceptable covertext to ordinary users. In
this work, we study the problem of robust steganography. We introduce formal
definitions of weak and strong robust LLM-based steganography, corresponding to
two threat models in which natural language serves as a covertext channel
resistant to realistic re-randomization attacks. We then propose two
constructions satisfying these notions. We design and implement our
steganographic schemes that embed arbitrary secret messages into natural
language text generated by LLMs, ensuring recoverability even under adversarial
paraphrasing and rewording attacks. To support further research and real-world
deployment, we release our implementation and datasets for public use.


http://arxiv.org/abs/2504.08205
EO-VLM: VLM-Guided Energy Overload Attacks on Vision Models. (69%)
Minjae Seo; Myoungsung You; Junhee Lee; Jaehan Kim; Hwanjo Heo; Jintae Oh; Jinwoo Kim
  Vision models are increasingly deployed in critical applications such as
autonomous driving and CCTV monitoring, yet they remain susceptible to
resource-consuming attacks. In this paper, we introduce a novel
energy-overloading attack that leverages vision language model (VLM) prompts to
generate adversarial images targeting vision models. These images, though
imperceptible to the human eye, significantly increase GPU energy consumption
across various vision models, threatening the availability of these systems.
Our framework, EO-VLM (Energy Overload via VLM), is model-agnostic, meaning it
is not limited by the architecture or type of the target vision model. By
exploiting the lack of safety filters in VLMs like DALL-E 3, we create
adversarial noise images without requiring prior knowledge or internal
structure of the target vision models. Our experiments demonstrate up to a 50%
increase in energy consumption, revealing a critical vulnerability in current
vision models.


http://arxiv.org/abs/2504.07887
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge. (68%)
Riccardo Cantini; Alessio Orsino; Massimo Ruggiero; Domenico Talia
  Large Language Models (LLMs) have revolutionized artificial intelligence,
driving advancements in machine translation, summarization, and conversational
agents. However, their increasing integration into critical societal domains
has raised concerns about embedded biases, which can perpetuate stereotypes and
compromise fairness. These biases stem from various sources, including
historical inequalities in training data, linguistic imbalances, and
adversarial manipulation. Despite mitigation efforts, recent studies indicate
that LLMs remain vulnerable to adversarial attacks designed to elicit biased
responses. This work proposes a scalable benchmarking framework to evaluate LLM
robustness against adversarial bias elicitation. Our methodology involves (i)
systematically probing models with a multi-task approach targeting biases
across various sociocultural dimensions, (ii) quantifying robustness through
safety scores using an LLM-as-a-Judge approach for automated assessment of
model responses, and (iii) employing jailbreak techniques to investigate
vulnerabilities in safety mechanisms. Our analysis examines prevalent biases in
both small and large state-of-the-art models and their impact on model safety.
Additionally, we assess the safety of domain-specific models fine-tuned for
critical fields, such as medicine. Finally, we release a curated dataset of
bias-related prompts, CLEAR-Bias, to facilitate systematic vulnerability
benchmarking. Our findings reveal critical trade-offs between model size and
safety, aiding the development of fairer and more robust future language
models.


http://arxiv.org/abs/2504.07467
Defense against Prompt Injection Attacks via Mixture of Encodings. (38%)
Ruiyi Zhang; David Sullivan; Kyle Jackson; Pengtao Xie; Mei Chen
  Large Language Models (LLMs) have emerged as a dominant approach for a wide
range of NLP tasks, with their access to external information further enhancing
their capabilities. However, this introduces new vulnerabilities, known as
prompt injection attacks, where external content embeds malicious instructions
that manipulate the LLM's output. Recently, the Base64 defense has been
recognized as one of the most effective methods for reducing success rate of
prompt injection attacks. Despite its efficacy, this method can degrade LLM
performance on certain NLP tasks. To address this challenge, we propose a novel
defense mechanism: mixture of encodings, which utilizes multiple character
encodings, including Base64. Extensive experimental results show that our
method achieves one of the lowest attack success rates under prompt injection
attacks, while maintaining high performance across all NLP tasks, outperforming
existing character encoding-based defense methods. This underscores the
effectiveness of our mixture of encodings strategy for both safety and task
performance metrics.


http://arxiv.org/abs/2504.08848
X-Guard: Multilingual Guard Agent for Content Moderation. (13%)
Bibek Upadhayay; Vahid Behzadan; Ph. D
  Large Language Models (LLMs) have rapidly become integral to numerous
applications in critical domains where reliability is paramount. Despite
significant advances in safety frameworks and guardrails, current protective
measures exhibit crucial vulnerabilities, particularly in multilingual
contexts. Existing safety systems remain susceptible to adversarial attacks in
low-resource languages and through code-switching techniques, primarily due to
their English-centric design. Furthermore, the development of effective
multilingual guardrails is constrained by the scarcity of diverse cross-lingual
training data. Even recent solutions like Llama Guard-3, while offering
multilingual support, lack transparency in their decision-making processes. We
address these challenges by introducing X-Guard agent, a transparent
multilingual safety agent designed to provide content moderation across diverse
linguistic contexts. X-Guard effectively defends against both conventional
low-resource language attacks and sophisticated code-switching attacks. Our
approach includes: curating and enhancing multiple open-source safety datasets
with explicit evaluation rationales; employing a jury of judges methodology to
mitigate individual judge LLM provider biases; creating a comprehensive
multilingual safety dataset spanning 132 languages with 5 million data points;
and developing a two-stage architecture combining a custom-finetuned mBART-50
translation module with an evaluation X-Guard 3B model trained through
supervised finetuning and GRPO training. Our empirical evaluations demonstrate
X-Guard's effectiveness in detecting unsafe content across multiple languages
while maintaining transparency throughout the safety evaluation process. Our
work represents a significant advancement in creating robust, transparent, and
linguistically inclusive safety systems for LLMs and its integrated systems.


http://arxiv.org/abs/2504.12321
AttentionDefense: Leveraging System Prompt Attention for Explainable Defense Against Novel Jailbreaks. (13%)
Charlotte Siska; Anush Sankaran
  In the past few years, Language Models (LMs) have shown par-human
capabilities in several domains. Despite their practical applications and
exceeding user consumption, they are susceptible to jailbreaks when malicious
input exploits the LM's weaknesses, causing it to deviate from its intended
behavior. Current defensive strategies either classify the input prompt as
adversarial or prevent LMs from generating harmful outputs. However, it is
challenging to explain the reason behind the malicious nature of the jailbreak,
which results in a wide variety of closed-box approaches. In this research, we
propose and demonstrate that system-prompt attention from Small Language Models
(SLMs) can be used to characterize adversarial prompts, providing a novel,
explainable, and cheaper defense approach called AttentionDefense. Our research
suggests that the attention mechanism is an integral component in understanding
and explaining how LMs respond to malicious input that is not captured in the
semantic meaning of text embeddings. The proposed AttentionDefense is evaluated
against existing jailbreak benchmark datasets. Ablation studies show that
SLM-based AttentionDefense has equivalent or better jailbreak detection
performance compared to text embedding-based classifiers and GPT-4 zero-shot
detectors.To further validate the efficacy of the proposed approach, we
generate a dataset of novel jailbreak variants of the existing benchmark
dataset using a closed-loop LLM-based multi-agent system. We demonstrate that
the proposed AttentionDefense approach performs robustly on this novel
jailbreak dataset while existing approaches suffer in performance.
Additionally, for practical purposes AttentionDefense is an ideal solution as
it has the computation requirements of a small LM but the performance of a LLM
detector.


http://arxiv.org/abs/2504.07717
PR-Attack: Coordinated Prompt-RAG Attacks on Retrieval-Augmented Generation in Large Language Models via Bilevel Optimization. (11%)
Yang Jiao; Xiaodong Wang; Kai Yang
  Large Language Models (LLMs) have demonstrated remarkable performance across
a wide range of applications, e.g., medical question-answering, mathematical
sciences, and code generation. However, they also exhibit inherent limitations,
such as outdated knowledge and susceptibility to hallucinations.
Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm to
address these issues, but it also introduces new vulnerabilities. Recent
efforts have focused on the security of RAG-based LLMs, yet existing attack
methods face three critical challenges: (1) their effectiveness declines
sharply when only a limited number of poisoned texts can be injected into the
knowledge database, (2) they lack sufficient stealth, as the attacks are often
detectable by anomaly detection systems, which compromises their effectiveness,
and (3) they rely on heuristic approaches to generate poisoned texts, lacking
formal optimization frameworks and theoretic guarantees, which limits their
effectiveness and applicability. To address these issues, we propose
coordinated Prompt-RAG attack (PR-attack), a novel optimization-driven attack
that introduces a small number of poisoned texts into the knowledge database
while embedding a backdoor trigger within the prompt. When activated, the
trigger causes the LLM to generate pre-designed responses to targeted queries,
while maintaining normal behavior in other contexts. This ensures both high
effectiveness and stealth. We formulate the attack generation process as a
bilevel optimization problem leveraging a principled optimization framework to
develop optimal poisoned texts and triggers. Extensive experiments across
diverse LLMs and datasets demonstrate the effectiveness of PR-Attack, achieving
a high attack success rate even with a limited number of poisoned texts and
significantly improved stealth compared to existing methods.


http://arxiv.org/abs/2504.07831
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems. (10%)
Simon Lermen; Mateusz Dziemian; Natalia Pérez-Campanero Antolín
  We demonstrate how AI agents can coordinate to deceive oversight systems
using automated interpretability of neural networks. Using sparse autoencoders
(SAEs) as our experimental framework, we show that language models (Llama,
DeepSeek R1, and Claude 3.7 Sonnet) can generate deceptive explanations that
evade detection. Our agents employ steganographic methods to hide information
in seemingly innocent explanations, successfully fooling oversight models while
achieving explanation quality comparable to reference labels. We further find
that models can scheme to develop deceptive strategies when they believe the
detection of harmful features might lead to negative consequences for
themselves. All tested LLM agents were capable of deceiving the overseer while
achieving high interpretability scores comparable to those of reference labels.
We conclude by proposing mitigation strategies, emphasizing the critical need
for robust understanding and defenses against deception.


http://arxiv.org/abs/2504.07461
Achilles Heel of Distributed Multi-Agent Systems. (1%)
Yiting Zhang; Yijiang Li; Tianwei Zhao; Kaijie Zhu; Haohan Wang; Nuno Vasconcelos
  Multi-agent system (MAS) has demonstrated exceptional capabilities in
addressing complex challenges, largely due to the integration of multiple large
language models (LLMs). However, the heterogeneity of LLMs, the scalability of
quantities of LLMs, and local computational constraints pose significant
challenges to hosting these models locally. To address these issues, we propose
a new framework termed Distributed Multi-Agent System (DMAS). In DMAS,
heterogeneous third-party agents function as service providers managed remotely
by a central MAS server and each agent offers its services through API
interfaces. However, the distributed nature of DMAS introduces several concerns
about trustworthiness. In this paper, we study the Achilles heel of distributed
multi-agent systems, identifying four critical trustworthiness challenges: free
riding, susceptibility to malicious attacks, communication inefficiencies, and
system instability. Extensive experiments across seven frameworks and four
datasets reveal significant vulnerabilities of the DMAS. These attack
strategies can lead to a performance degradation of up to 80% and attain a 100%
success rate in executing free riding and malicious attacks. We envision our
work will serve as a useful red-teaming tool for evaluating future multi-agent
systems and spark further research on trustworthiness challenges in distributed
multi-agent systems.


http://arxiv.org/abs/2504.08813
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models. (1%)
Junfeng Fang; Yukai Wang; Ruipeng Wang; Zijun Yao; Kun Wang; An Zhang; Xiang Wang; Tat-Seng Chua
  The rapid advancement of multi-modal large reasoning models (MLRMs) --
enhanced versions of multimodal language models (MLLMs) equipped with reasoning
capabilities -- has revolutionized diverse applications. However, their safety
implications remain underexplored. While prior work has exposed critical
vulnerabilities in unimodal reasoning models, MLRMs introduce distinct risks
from cross-modal reasoning pathways. This work presents the first systematic
safety analysis of MLRMs through large-scale empirical studies comparing MLRMs
with their base MLLMs. Our experiments reveal three critical findings: (1) The
Reasoning Tax: Acquiring reasoning capabilities catastrophically degrades
inherited safety alignment. MLRMs exhibit 37.44% higher jailbreaking success
rates than base MLLMs under adversarial attacks. (2) Safety Blind Spots: While
safety degradation is pervasive, certain scenarios (e.g., Illegal Activity)
suffer 25 times higher attack rates -- far exceeding the average 3.4 times
increase, revealing scenario-specific vulnerabilities with alarming cross-model
and datasets consistency. (3) Emergent Self-Correction: Despite tight
reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction --
16.9% of jailbroken reasoning steps are overridden by safe answers, hinting at
intrinsic safeguards. These findings underscore the urgency of scenario-aware
safety auditing and mechanisms to amplify MLRMs' self-correction potential. To
catalyze research, we open-source OpenSafeMLRM, the first toolkit for MLRM
safety evaluation, providing unified interface for mainstream models, datasets,
and jailbreaking methods. Our work calls for immediate efforts to harden
reasoning-augmented AI, ensuring its transformative potential aligns with
ethical safeguards.


http://arxiv.org/abs/2504.06575
Defending LLM Watermarking Against Spoofing Attacks with Contrastive Representation Learning. (1%)
Li An; Yujian Liu; Yepeng Liu; Yang Zhang; Yuheng Bu; Shiyu Chang
  Watermarking has emerged as a promising technique for detecting texts
generated by LLMs. Current research has primarily focused on three design
criteria: high quality of the watermarked text, high detectability, and
robustness against removal attack. However, the security against spoofing
attacks remains relatively understudied. For example, a piggyback attack can
maliciously alter the meaning of watermarked text-transforming it into hate
speech-while preserving the original watermark, thereby damaging the reputation
of the LLM provider. We identify two core challenges that make defending
against spoofing difficult: (1) the need for watermarks to be both sensitive to
semantic-distorting changes and insensitive to semantic-preserving edits, and
(2) the contradiction between the need to detect global semantic shifts and the
local, auto-regressive nature of most watermarking schemes. To address these
challenges, we propose a semantic-aware watermarking algorithm that post-hoc
embeds watermarks into a given target text while preserving its original
meaning. Our method introduces a semantic mapping model, which guides the
generation of a green-red token list, contrastively trained to be sensitive to
semantic-distorting changes and insensitive to semantic-preserving changes.
Experiments on two standard benchmarks demonstrate strong robustness against
removal attacks and security against spoofing attacks, including sentiment
reversal and toxic content insertion, while maintaining high watermark
detectability. Our approach offers a significant step toward more secure and
semantically aware watermarking for LLMs. Our code is available at
https://github.com/UCSB-NLP-Chang/contrastive-watermark.


http://arxiv.org/abs/2504.05838
Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking. (97%)
Junxi Chen; Junhao Dong; Xiaohua Xie
  Recently, the Image Prompt Adapter (IP-Adapter) has been increasingly
integrated into text-to-image diffusion models (T2I-DMs) to improve
controllability. However, in this paper, we reveal that T2I-DMs equipped with
the IP-Adapter (T2I-IP-DMs) enable a new jailbreak attack named the hijacking
attack. We demonstrate that, by uploading imperceptible image-space adversarial
examples (AEs), the adversary can hijack massive benign users to jailbreak an
Image Generation Service (IGS) driven by T2I-IP-DMs and mislead the public to
discredit the service provider. Worse still, the IP-Adapter's dependency on
open-source image encoders reduces the knowledge required to craft AEs.
Extensive experiments verify the technical feasibility of the hijacking attack.
In light of the revealed threat, we investigate several existing defenses and
explore combining the IP-Adapter with adversarially trained models to overcome
existing defenses' limitations. Our code is available at
https://github.com/fhdnskfbeuv/attackIPA.


http://arxiv.org/abs/2504.06358
Towards Calibration Enhanced Network by Inverse Adversarial Attack. (92%)
Yupeng Cheng; Zi Pong Lim; Sarthak Ketanbhai Modi; Yon Shin Teo; Yushi Cao; Shang-Wei Lin
  Test automation has become increasingly important as the complexity of both
design and content in Human Machine Interface (HMI) software continues to grow.
Current standard practice uses Optical Character Recognition (OCR) techniques
to automatically extract textual information from HMI screens for validation.
At present, one of the key challenges faced during the automation of HMI screen
validation is the noise handling for the OCR models. In this paper, we propose
to utilize adversarial training techniques to enhance OCR models in HMI testing
scenarios. More specifically, we design a new adversarial attack objective for
OCR models to discover the decision boundaries in the context of HMI testing.
We then adopt adversarial training to optimize the decision boundaries towards
a more robust and accurate OCR model. In addition, we also built an HMI screen
dataset based on real-world requirements and applied multiple types of
perturbation onto the clean HMI dataset to provide a more complete coverage for
the potential scenarios. We conduct experiments to demonstrate how using
adversarial training techniques yields more robust OCR models against various
kinds of noises, while still maintaining high OCR model accuracy. Further
experiments even demonstrate that the adversarial training models exhibit a
certain degree of robustness against perturbations from other patterns.


http://arxiv.org/abs/2504.06141
Adversarial Training of Reward Models. (88%)
Alexander Bukharin; Haifeng Qian; Shengyang Sun; Adithya Renduchintala; Soumye Singhal; Zhilin Wang; Oleksii Kuchaiev; Olivier Delalleau; Tuo Zhao
  Reward modeling has emerged as a promising approach for the scalable
alignment of language models. However, contemporary reward models (RMs) often
lack robustness, awarding high rewards to low-quality, out-of-distribution
(OOD) samples. This can lead to reward hacking, where policies exploit
unintended shortcuts to maximize rewards, undermining alignment. To address
this challenge, we introduce Adv-RM, a novel adversarial training framework
that automatically identifies adversarial examples -- responses that receive
high rewards from the target RM but are OOD and of low quality. By leveraging
reinforcement learning, Adv-RM trains a policy to generate adversarial examples
that reliably expose vulnerabilities in large state-of-the-art reward models
such as Nemotron 340B RM. Incorporating these adversarial examples into the
reward training process improves the robustness of RMs, mitigating reward
hacking and enhancing downstream performance in RLHF. We demonstrate that
Adv-RM significantly outperforms conventional RM training, increasing stability
and enabling more effective RLHF training in both synthetic and real-data
settings.


http://arxiv.org/abs/2504.08798
Exploring Gradient-Guided Masked Language Model to Detect Textual Adversarial Attacks. (86%)
Xiaomei Zhang; Zhaoxi Zhang; Yanjun Zhang; Xufei Zheng; Leo Yu Zhang; Shengshan Hu; Shirui Pan
  Textual adversarial examples pose serious threats to the reliability of
natural language processing systems. Recent studies suggest that adversarial
examples tend to deviate from the underlying manifold of normal texts, whereas
pre-trained masked language models can approximate the manifold of normal data.
These findings inspire the exploration of masked language models for detecting
textual adversarial attacks. We first introduce Masked Language Model-based
Detection (MLMD), leveraging the mask and unmask operations of the masked
language modeling (MLM) objective to induce the difference in manifold changes
between normal and adversarial texts. Although MLMD achieves competitive
detection performance, its exhaustive one-by-one masking strategy introduces
significant computational overhead. Our posterior analysis reveals that a
significant number of non-keywords in the input are not important for detection
but consume resources. Building on this, we introduce Gradient-guided MLMD
(GradMLMD), which leverages gradient information to identify and skip
non-keywords during detection, significantly reducing resource consumption
without compromising detection performance.


http://arxiv.org/abs/2504.05804
StealthRank: LLM Ranking Manipulation via Stealthy Prompt Optimization. (83%)
Yiming Tang; Yi Fan; Chenxiao Yu; Tiankai Yang; Yue Zhao; Xiyang Hu
  The integration of large language models (LLMs) into information retrieval
systems introduces new attack surfaces, particularly for adversarial ranking
manipulations. We present $\textbf{StealthRank}$, a novel adversarial attack
method that manipulates LLM-driven ranking systems while maintaining textual
fluency and stealth. Unlike existing methods that often introduce detectable
anomalies, StealthRank employs an energy-based optimization framework combined
with Langevin dynamics to generate StealthRank Prompts (SRPs)-adversarial text
sequences embedded within item or document descriptions that subtly yet
effectively influence LLM ranking mechanisms. We evaluate StealthRank across
multiple LLMs, demonstrating its ability to covertly boost the ranking of
target items while avoiding explicit manipulation traces. Our results show that
StealthRank consistently outperforms state-of-the-art adversarial ranking
baselines in both effectiveness and stealth, highlighting critical
vulnerabilities in LLM-driven ranking systems. Our code is publicly available
at $\href{https://github.com/Tangyiming205069/controllable-seo}{here}$.


http://arxiv.org/abs/2504.06492
Exploiting Meta-Learning-based Poisoning Attacks for Graph Link Prediction. (67%)
Mingchen Li; Di Zhuang; Keyu Chen; Dumindu Samaraweera; Morris Chang
  Link prediction in graph data utilizes various algorithms and machine
learning/deep learning models to predict potential relationships between graph
nodes. This technique has found widespread use in numerous real-world
applications, including recommendation systems, community networks, and
biological structures. However, recent research has highlighted the
vulnerability of link prediction models to adversarial attacks, such as
poisoning and evasion attacks. Addressing the vulnerability of these models is
crucial to ensure stable and robust performance in link prediction
applications. While many works have focused on enhancing the robustness of the
Graph Convolution Network (GCN) model, the Variational Graph Auto-Encoder
(VGAE), a sophisticated model for link prediction, has not been thoroughly
investigated in the context of graph adversarial attacks. To bridge this gap,
this article proposes an unweighted graph poisoning attack approach using
meta-learning techniques to undermine VGAE's link prediction performance. We
conducted comprehensive experiments on diverse datasets to evaluate the
proposed method and its parameters, comparing it with existing approaches in
similar settings. Our results demonstrate that our approach significantly
diminishes link prediction performance and outperforms other state-of-the-art
methods.


http://arxiv.org/abs/2504.05815
Parasite: A Steganography-based Backdoor Attack Framework for Diffusion Models. (13%)
Jiahao Chen; Yu Pan; Yi Du; Chunkai Wu; Lin Wang
  Recently, the diffusion model has gained significant attention as one of the
most successful image generation models, which can generate high-quality images
by iteratively sampling noise. However, recent studies have shown that
diffusion models are vulnerable to backdoor attacks, allowing attackers to
enter input data containing triggers to activate the backdoor and generate
their desired output. Existing backdoor attack methods primarily focused on
target noise-to-image and text-to-image tasks, with limited work on backdoor
attacks in image-to-image tasks. Furthermore, traditional backdoor attacks
often rely on a single, conspicuous trigger to generate a fixed target image,
lacking concealability and flexibility. To address these limitations, we
propose a novel backdoor attack method called "Parasite" for image-to-image
tasks in diffusion models, which not only is the first to leverage
steganography for triggers hiding, but also allows attackers to embed the
target content as a backdoor trigger to achieve a more flexible attack.
"Parasite" as a novel attack method effectively bypasses existing detection
frameworks to execute backdoor attacks. In our experiments, "Parasite" achieved
a 0 percent backdoor detection rate against the mainstream defense frameworks.
In addition, in the ablation study, we discuss the influence of different
hiding coefficients on the attack results. You can find our code at
https://anonymous.4open.science/r/Parasite-1715/.


http://arxiv.org/abs/2504.05902
Defending Deep Neural Networks against Backdoor Attacks via Module Switching. (10%)
Weijun Li; Ansh Arora; Xuanli He; Mark Dras; Qiongkai Xu
  The exponential increase in the parameters of Deep Neural Networks (DNNs) has
significantly raised the cost of independent training, particularly for
resource-constrained entities. As a result, there is a growing reliance on
open-source models. However, the opacity of training processes exacerbates
security risks, making these models more vulnerable to malicious threats, such
as backdoor attacks, while simultaneously complicating defense mechanisms.
Merging homogeneous models has gained attention as a cost-effective
post-training defense. However, we notice that existing strategies, such as
weight averaging, only partially mitigate the influence of poisoned parameters
and remain ineffective in disrupting the pervasive spurious correlations
embedded across model parameters. We propose a novel module-switching strategy
to break such spurious correlations within the model's propagation path. By
leveraging evolutionary algorithms to optimize fusion strategies, we validate
our approach against backdoor attacks targeting text and vision domains. Our
method achieves effective backdoor mitigation even when incorporating a couple
of compromised models, e.g., reducing the average attack success rate (ASR) to
22% compared to 31.9% with the best-performing baseline on SST-2.


http://arxiv.org/abs/2504.05222
Security Risks in Vision-Based Beam Prediction: From Spatial Proxy Attacks to Feature Refinement. (99%)
Avi Deb Raha; Kitae Kim; Mrityunjoy Gain; Apurba Adhikary; Zhu Han; Eui-Nam Huh; Choong Seon Hong
  The rapid evolution towards the sixth-generation (6G) networks demands
advanced beamforming techniques to address challenges in dynamic, high-mobility
scenarios, such as vehicular communications. Vision-based beam prediction
utilizing RGB camera images emerges as a promising solution for accurate and
responsive beam selection. However, reliance on visual data introduces unique
vulnerabilities, particularly susceptibility to adversarial attacks, thus
potentially compromising beam accuracy and overall network reliability. In this
paper, we conduct the first systematic exploration of adversarial threats
specifically targeting vision-based mmWave beam selection systems. Traditional
white-box attacks are impractical in this context because ground-truth beam
indices are inaccessible and spatial dynamics are complex. To address this, we
propose a novel black-box adversarial attack strategy, termed Spatial Proxy
Attack (SPA), which leverages spatial correlations between user positions and
beam indices to craft effective perturbations without requiring access to model
parameters or labels. To counteract these adversarial vulnerabilities, we
formulate an optimization framework aimed at simultaneously enhancing beam
selection accuracy under clean conditions and robustness against adversarial
perturbations. We introduce a hybrid deep learning architecture integrated with
a dedicated Feature Refinement Module (FRM), designed to systematically filter
irrelevant, noisy and adversarially perturbed visual features. Evaluations
using standard backbone models such as ResNet-50 and MobileNetV2 demonstrate
that our proposed method significantly improves performance, achieving up to an
+21.07\% gain in Top-K accuracy under clean conditions and a 41.31\% increase
in Top-1 adversarial robustness compared to different baseline models.


http://arxiv.org/abs/2504.04858
Don't Lag, RAG: Training-Free Adversarial Detection Using RAG. (88%)
Roie Kazoom; Raz Lapid; Moshe Sipper; Ofer Hadar
  Adversarial patch attacks pose a major threat to vision systems by embedding
localized perturbations that mislead deep models. Traditional defense methods
often require retraining or fine-tuning, making them impractical for real-world
deployment. We propose a training-free Visual Retrieval-Augmented Generation
(VRAG) framework that integrates Vision-Language Models (VLMs) for adversarial
patch detection. By retrieving visually similar patches and images that
resemble stored attacks in a continuously expanding database, VRAG performs
generative reasoning to identify diverse attack types, all without additional
training or fine-tuning. We extensively evaluate open-source large-scale VLMs,
including Qwen-VL-Plus, Qwen2.5-VL-72B, and UI-TARS-72B-DPO, alongside
Gemini-2.0, a closed-source model. Notably, the open-source UI-TARS-72B-DPO
model achieves up to 95 percent classification accuracy, setting a new
state-of-the-art for open-source adversarial patch detection. Gemini-2.0
attains the highest overall accuracy, 98 percent, but remains closed-source.
Experimental results demonstrate VRAG's effectiveness in identifying a variety
of adversarial patches with minimal human annotation, paving the way for
robust, practical defenses against evolving adversarial patch attacks.


http://arxiv.org/abs/2504.05483
Secure Diagnostics: Adversarial Robustness Meets Clinical Interpretability. (81%)
Mohammad Hossein Najafi; Mohammad Morsali; Mohammadreza Pashanejad; Saman Soleimani Roudi; Mohammad Norouzi; Saeed Bagheri Shouraki
  Deep neural networks for medical image classification often fail to
generalize consistently in clinical practice due to violations of the i.i.d.
assumption and opaque decision-making. This paper examines interpretability in
deep neural networks fine-tuned for fracture detection by evaluating model
performance against adversarial attack and comparing interpretability methods
to fracture regions annotated by an orthopedic surgeon. Our findings prove that
robust models yield explanations more aligned with clinically meaningful areas,
indicating that robustness encourages anatomically relevant feature
prioritization. We emphasize the value of interpretability for facilitating
human-AI collaboration, in which models serve as assistants under a
human-in-the-loop paradigm: clinically plausible explanations foster trust,
enable error correction, and discourage reliance on AI for high-stakes
decisions. This paper investigates robustness and interpretability as
complementary benchmarks for bridging the gap between benchmark performance and
safe, actionable clinical deployment.


http://arxiv.org/abs/2504.04747
Two is Better than One: Efficient Ensemble Defense for Robust and Compact Models. (80%)
Yoojin Jung; Byung Cheol Song
  Deep learning-based computer vision systems adopt complex and large
architectures to improve performance, yet they face challenges in deployment on
resource-constrained mobile and edge devices. To address this issue, model
compression techniques such as pruning, quantization, and matrix factorization
have been proposed; however, these compressed models are often highly
vulnerable to adversarial attacks. We introduce the \textbf{Efficient Ensemble
Defense (EED)} technique, which diversifies the compression of a single base
model based on different pruning importance scores and enhances ensemble
diversity to achieve high adversarial robustness and resource efficiency. EED
dynamically determines the number of necessary sub-models during the inference
stage, minimizing unnecessary computations while maintaining high robustness.
On the CIFAR-10 and SVHN datasets, EED demonstrated state-of-the-art robustness
performance compared to existing adversarial pruning techniques, along with an
inference speed improvement of up to 1.86 times. This proves that EED is a
powerful defense solution in resource-constrained environments.


http://arxiv.org/abs/2504.05652
Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking. (54%)
Yu-Hang Wu; Yu-Jie Xiong; Jie-Zhang
  Large Language Models (LLMs) have become increasingly integral to a wide
range of applications. However, they still remain the threat of jailbreak
attacks, where attackers manipulate designed prompts to make the models elicit
malicious outputs. Analyzing jailbreak methods can help us delve into the
weakness of LLMs and improve it. In this paper, We reveal a vulnerability in
large language models (LLMs), which we term Defense Threshold Decay (DTD), by
analyzing the attention weights of the model's output on input and subsequent
output on prior output: as the model generates substantial benign content, its
attention weights shift from the input to prior output, making it more
susceptible to jailbreak attacks. To demonstrate the exploitability of DTD, we
propose a novel jailbreak attack method, Sugar-Coated Poison (SCP), which
induces the model to generate substantial benign content through benign input
and adversarial reasoning, subsequently producing malicious content. To
mitigate such attacks, we introduce a simple yet effective defense strategy,
POSD, which significantly reduces jailbreak success rates while preserving the
model's generalization capabilities.


http://arxiv.org/abs/2504.07135
SINCon: Mitigate LLM-Generated Malicious Message Injection Attack for Rumor Detection. (10%)
Mingqing Zhang; Qiang Liu; Xiang Tao; Shu Wu; Liang Wang
  In the era of rapidly evolving large language models (LLMs), state-of-the-art
rumor detection systems, particularly those based on Message Propagation Trees
(MPTs), which represent a conversation tree with the post as its root and the
replies as its descendants, are facing increasing threats from adversarial
attacks that leverage LLMs to generate and inject malicious messages. Existing
methods are based on the assumption that different nodes exhibit varying
degrees of influence on predictions. They define nodes with high predictive
influence as important nodes and target them for attacks. If the model treats
nodes' predictive influence more uniformly, attackers will find it harder to
target high predictive influence nodes. In this paper, we propose Similarizing
the predictive Influence of Nodes with Contrastive Learning (SINCon), a defense
mechanism that encourages the model to learn graph representations where nodes
with varying importance have a more uniform influence on predictions. Extensive
experiments on the Twitter and Weibo datasets demonstrate that SINCon not only
preserves high classification accuracy on clean data but also significantly
enhances resistance against LLM-driven message injection attacks.


http://arxiv.org/abs/2504.04809
Select Me! When You Need a Tool: A Black-box Text Attack on Tool Selection. (9%)
Liuji Chen; Hao Gao; Jinghao Zhang; Qiang Liu; Shu Wu; Liang Wang
  Tool learning serves as a powerful auxiliary mechanism that extends the
capabilities of large language models (LLMs), enabling them to tackle complex
tasks requiring real-time relevance or high precision operations. Behind its
powerful capabilities lie some potential security issues. However, previous
work has primarily focused on how to make the output of the invoked tools
incorrect or malicious, with little attention given to the manipulation of tool
selection. To fill this gap, we introduce, for the first time, a black-box
text-based attack that can significantly increase the probability of the target
tool being selected in this paper. We propose a two-level text perturbation
attack witha coarse-to-fine granularity, attacking the text at both the word
level and the character level. We conduct comprehensive experiments that
demonstrate the attacker only needs to make some perturbations to the tool's
textual information to significantly increase the possibility of the target
tool being selected and ranked higher among the candidate tools. Our research
reveals the vulnerability of the tool selection process and paves the way for
future research on protecting this process.


http://arxiv.org/abs/2504.05197
P2Mark: Plug-and-play Parameter-intrinsic Watermarking for Neural Speech Generation. (1%)
Yong Ren; Jiangyan Yi; Tao Wang; Jianhua Tao; Zhengqi Wen; Chenxing Li; Zheng Lian; Ruibo Fu; Ye Bai; Xiaohui Zhang
  Recently, a large number of advanced neural speech generation methods have
emerged in the open-source community. Although this has facilitated the
application and development of technology, it has also increased the difficulty
of preventing the abuse of generated speech and protecting copyrights. Audio
watermarking technology is an effective method for proactively protecting
generated speech, but when the source codes and model weights of the neural
speech generation methods are open-sourced, audio watermarks based on previous
watermarking methods can be easily removed or manipulated. This paper proposes
a Plug-and-play Parameter-intrinsic WaterMarking (P2Mark) method for neural
speech generation system protection. The main advantage of P2Mark is that the
watermark information is flexibly integrated into the neural speech generation
model in the form of parameters by training a watermark adapter rather than
injecting the watermark into the model in the form of features. After the
watermark adapter with the watermark embedding is merged with the pre-trained
generation model, the watermark information cannot be easily removed or
manipulated. Therefore, P2Mark will be a reliable choice for proactively
tracing and protecting the copyrights of neural speech generation models in
open-source white-box scenarios. We validated P2Mark on two main types of
decoders in neural speech generation: vocoder and codec. Experimental results
show that P2Mark achieves performance comparable to state-of-the-art audio
watermarking methods that cannot be used for open-source white-box protection
scenarios in terms of watermark extraction accuracy, watermark
imperceptibility, and robustness.


http://arxiv.org/abs/2504.05050
Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models. (1%)
Jiawei Lian; Jianhong Pan; Lefan Wang; Yi Wang; Shaohui Mei; Lap-Pui Chau
  Large language models (LLMs) are foundational explorations to artificial
general intelligence, yet their alignment with human values via instruction
tuning and preference learning achieves only superficial compliance. Here, we
demonstrate that harmful knowledge embedded during pretraining persists as
indelible "dark patterns" in LLMs' parametric memory, evading alignment
safeguards and resurfacing under adversarial inducement at distributional
shifts. In this study, we first theoretically analyze the intrinsic ethical
vulnerability of aligned LLMs by proving that current alignment methods yield
only local "safety regions" in the knowledge manifold. In contrast, pretrained
knowledge remains globally connected to harmful concepts via high-likelihood
adversarial trajectories. Building on this theoretical insight, we empirically
validate our findings by employing semantic coherence inducement under
distributional shifts--a method that systematically bypasses alignment
constraints through optimized adversarial prompts. This combined theoretical
and empirical approach achieves a 100% attack success rate across 19 out of 23
state-of-the-art aligned LLMs, including DeepSeek-R1 and LLaMA-3, revealing
their universal vulnerabilities.


http://arxiv.org/abs/2504.05504
SelfMAD: Enhancing Generalization and Robustness in Morphing Attack Detection via Self-Supervised Learning. (1%)
Marija Ivanovska; Leon Todorov; Naser Damer; Deepak Kumar Jain; Peter Peer; Vitomir Štruc
  With the continuous advancement of generative models, face morphing attacks
have become a significant challenge for existing face verification systems due
to their potential use in identity fraud and other malicious activities.
Contemporary Morphing Attack Detection (MAD) approaches frequently rely on
supervised, discriminative models trained on examples of bona fide and morphed
images. These models typically perform well with morphs generated with
techniques seen during training, but often lead to sub-optimal performance when
subjected to novel unseen morphing techniques. While unsupervised models have
been shown to perform better in terms of generalizability, they typically
result in higher error rates, as they struggle to effectively capture features
of subtle artifacts. To address these shortcomings, we present SelfMAD, a novel
self-supervised approach that simulates general morphing attack artifacts,
allowing classifiers to learn generic and robust decision boundaries without
overfitting to the specific artifacts induced by particular face morphing
methods. Through extensive experiments on widely used datasets, we demonstrate
that SelfMAD significantly outperforms current state-of-the-art MADs, reducing
the detection error by more than 64% in terms of EER when compared to the
strongest unsupervised competitor, and by more than 66%, when compared to the
best performing discriminative MAD model, tested in cross-morph settings. The
source code for SelfMAD is available at https://github.com/LeonTodorov/SelfMAD.


http://arxiv.org/abs/2504.05618
Technical Report: Full Version of Analyzing and Optimizing Perturbation of DP-SGD Geometrically. (1%)
Jiawei Duan; Haibo Hu; Qingqing Ye; Xinyue Sun
  Differential privacy (DP) has become a prevalent privacy model in a wide
range of machine learning tasks, especially after the debut of DP-SGD. However,
DP-SGD, which directly perturbs gradients in the training iterations, fails to
mitigate the negative impacts of noise on gradient direction. As a result,
DP-SGD is often inefficient. Although various solutions (e.g., clipping to
reduce the sensitivity of gradients and amplifying privacy bounds to save
privacy budgets) are proposed to trade privacy for model efficiency, the root
cause of its inefficiency is yet unveiled.
  In this work, we first generalize DP-SGD and theoretically derive the impact
of DP noise on the training process. Our analysis reveals that, in terms of a
perturbed gradient, only the noise on direction has eminent impact on the model
efficiency while that on magnitude can be mitigated by optimization techniques,
i.e., fine-tuning gradient clipping and learning rate. Besides, we confirm that
traditional DP introduces biased noise on the direction when adding unbiased
noise to the gradient itself. Overall, the perturbation of DP-SGD is actually
sub-optimal from a geometric perspective. Motivated by this, we design a
geometric perturbation strategy GeoDP within the DP framework, which perturbs
the direction and the magnitude of a gradient, respectively. By directly
reducing the noise on the direction, GeoDP mitigates the negative impact of DP
noise on model efficiency with the same DP guarantee. Extensive experiments on
two public datasets (i.e., MNIST and CIFAR-10), one synthetic dataset and three
prevalent models (i.e., Logistic Regression, CNN and ResNet) confirm the
effectiveness and generality of our strategy.


http://arxiv.org/abs/2504.04394
Selective Masking Adversarial Attack on Automatic Speech Recognition Systems. (99%)
Zheng Fang; Shenyi Zhang; Tao Wang; Bowen Li; Lingchen Zhao; Zhangyi Wang
  Extensive research has shown that Automatic Speech Recognition (ASR) systems
are vulnerable to audio adversarial attacks. Current attacks mainly focus on
single-source scenarios, ignoring dual-source scenarios where two people are
speaking simultaneously. To bridge the gap, we propose a Selective Masking
Adversarial attack, namely SMA attack, which ensures that one audio source is
selected for recognition while the other audio source is muted in dual-source
scenarios. To better adapt to the dual-source scenario, our SMA attack
constructs the normal dual-source audio from the muted audio and selected
audio. SMA attack initializes the adversarial perturbation with a small
Gaussian noise and iteratively optimizes it using a selective masking
optimization algorithm. Extensive experiments demonstrate that the SMA attack
can generate effective and imperceptible audio adversarial examples in the
dual-source scenario, achieving an average success rate of attack of 100% and
signal-to-noise ratio of 37.15dB on Conformer-CTC, outperforming the baselines.


http://arxiv.org/abs/2504.04367
WeiDetect: Weibull Distribution-Based Defense against Poisoning Attacks in Federated Learning for Network Intrusion Detection Systems. (82%)
Sameera K. M.; Vinod P.; Anderson Rocha; Rafidha Rehiman K. A.; Mauro Conti
  In the era of data expansion, ensuring data privacy has become increasingly
critical, posing significant challenges to traditional AI-based applications.
In addition, the increasing adoption of IoT devices has introduced significant
cybersecurity challenges, making traditional Network Intrusion Detection
Systems (NIDS) less effective against evolving threats, and privacy concerns
and regulatory restrictions limit their deployment. Federated Learning (FL) has
emerged as a promising solution, allowing decentralized model training while
maintaining data privacy to solve these issues. However, despite implementing
privacy-preserving technologies, FL systems remain vulnerable to adversarial
attacks. Furthermore, data distribution among clients is not heterogeneous in
the FL scenario. We propose WeiDetect, a two-phase, server-side defense
mechanism for FL-based NIDS that detects malicious participants to address
these challenges. In the first phase, local models are evaluated using a
validation dataset to generate validation scores. These scores are then
analyzed using a Weibull distribution, identifying and removing malicious
models. We conducted experiments to evaluate the effectiveness of our approach
in diverse attack settings. Our evaluation included two popular datasets,
CIC-Darknet2020 and CSE-CIC-IDS2018, tested under non-IID data distributions.
Our findings highlight that WeiDetect outperforms state-of-the-art defense
approaches, improving higher target class recall up to 70% and enhancing the
global model's F1 score by 1% to 14%.


http://arxiv.org/abs/2504.04716
On the Robustness of GUI Grounding Models Against Image Attacks. (81%)
Haoren Zhao; Tianyi Chen; Zhen Wang
  Graphical User Interface (GUI) grounding models are crucial for enabling
intelligent agents to understand and interact with complex visual interfaces.
However, these models face significant robustness challenges in real-world
scenarios due to natural noise and adversarial perturbations, and their
robustness remains underexplored. In this study, we systematically evaluate the
robustness of state-of-the-art GUI grounding models, such as UGround, under
three conditions: natural noise, untargeted adversarial attacks, and targeted
adversarial attacks. Our experiments, which were conducted across a wide range
of GUI environments, including mobile, desktop, and web interfaces, have
clearly demonstrated that GUI grounding models exhibit a high degree of
sensitivity to adversarial perturbations and low-resolution conditions. These
findings provide valuable insights into the vulnerabilities of GUI grounding
models and establish a strong benchmark for future research aimed at enhancing
their robustness in practical applications. Our code is available at
https://github.com/ZZZhr-1/Robust_GUI_Grounding.


http://arxiv.org/abs/2504.04645
Here Comes the Explanation: A Shapley Perspective on Multi-contrast Medical Image Segmentation. (1%)
Tianyi Ren; Juampablo Heras Rivera; Hitender Oswal; Yutong Pan; Agamdeep Chopra; Jacob Ruzevick; Mehmet Kurt
  Deep learning has been successfully applied to medical image segmentation,
enabling accurate identification of regions of interest such as organs and
lesions. This approach works effectively across diverse datasets, including
those with single-image contrast, multi-contrast, and multimodal imaging data.
To improve human understanding of these black-box models, there is a growing
need for Explainable AI (XAI) techniques for model transparency and
accountability. Previous research has primarily focused on post hoc pixel-level
explanations, using methods gradient-based and perturbation-based apporaches.
These methods rely on gradients or perturbations to explain model predictions.
However, these pixel-level explanations often struggle with the complexity
inherent in multi-contrast magnetic resonance imaging (MRI) segmentation tasks,
and the sparsely distributed explanations have limited clinical relevance. In
this study, we propose using contrast-level Shapley values to explain
state-of-the-art models trained on standard metrics used in brain tumor
segmentation. Our results demonstrate that Shapley analysis provides valuable
insights into different models' behavior used for tumor segmentation. We
demonstrated a bias for U-Net towards over-weighing T1-contrast and FLAIR,
while Swin-UNETR provided a cross-contrast understanding with balanced Shapley
distribution.


http://arxiv.org/abs/2504.08782
Embedding Hidden Adversarial Capabilities in Pre-Trained Diffusion Models. (13%)
Lucas Beerens; Desmond J. Higham
  We introduce a new attack paradigm that embeds hidden adversarial
capabilities directly into diffusion models via fine-tuning, without altering
their observable behavior or requiring modifications during inference. Unlike
prior approaches that target specific images or adjust the generation process
to produce adversarial outputs, our method integrates adversarial functionality
into the model itself. The resulting tampered model generates high-quality
images indistinguishable from those of the original, yet these images cause
misclassification in downstream classifiers at a high rate. The
misclassification can be targeted to specific output classes. Users can employ
this compromised model unaware of its embedded adversarial nature, as it
functions identically to a standard diffusion model. We demonstrate the
effectiveness and stealthiness of our approach, uncovering a covert attack
vector that raises new security concerns. These findings expose a risk arising
from the use of externally-supplied models and highlight the urgent need for
robust model verification and defense mechanisms against hidden threats in
generative models. The code is available at
https://github.com/LucasBeerens/CRAFTed-Diffusion .


http://arxiv.org/abs/2504.04062
QE-RAG: A Robust Retrieval-Augmented Generation Benchmark for Query Entry Errors. (1%)
Kepu Zhang; Zhongxiang Sun; Weijie Yu; Xiaoxue Zang; Kai Zheng; Yang Song; Han Li; Jun Xu
  Retriever-augmented generation (RAG) has become a widely adopted approach for
enhancing the factual accuracy of large language models (LLMs). While current
benchmarks evaluate the performance of RAG methods from various perspectives,
they share a common assumption that user queries used for retrieval are
error-free. However, in real-world interactions between users and LLMs, query
entry errors such as keyboard proximity errors, visual similarity errors, and
spelling errors are frequent. The impact of these errors on current RAG methods
against such errors remains largely unexplored. To bridge this gap, we propose
QE-RAG, the first robust RAG benchmark designed specifically to evaluate
performance against query entry errors. We augment six widely used datasets by
injecting three common types of query entry errors into randomly selected user
queries at rates of 20\% and 40\%, simulating typical user behavior in
real-world scenarios. We analyze the impact of these errors on LLM outputs and
find that corrupted queries degrade model performance, which can be mitigated
through query correction and training a robust retriever for retrieving
relevant documents. Based on these insights, we propose a contrastive
learning-based robust retriever training method and a retrieval-augmented query
correction method. Extensive in-domain and cross-domain experiments reveal
that: (1) state-of-the-art RAG methods including sequential, branching, and
iterative methods, exhibit poor robustness to query entry errors; (2) our
method significantly enhances the robustness of RAG when handling query entry
errors and it's compatible with existing RAG methods, further improving their
robustness.


http://arxiv.org/abs/2504.04285
Impact of Error Rate Misreporting on Resource Allocation in Multi-tenant Quantum Computing and Defense. (1%)
Subrata Das; Swaroop Ghosh
  Cloud-based quantum service providers allow multiple users to run programs on
shared hardware concurrently to maximize resource utilization and minimize
operational costs. This multi-tenant computing (MTC) model relies on the error
parameters of the hardware for fair qubit allocation and scheduling, as
error-prone qubits can degrade computational accuracy asymmetrically for users
sharing the hardware. To maintain low error rates, quantum providers perform
periodic hardware calibration, often relying on third-party calibration
services. If an adversary within this calibration service misreports error
rates, the allocator can be misled into making suboptimal decisions even when
the physical hardware remains unchanged. We demonstrate such an attack model in
which an adversary strategically misreports qubit error rates to reduce
hardware throughput, and probability of successful trial (PST) for two
previously proposed allocation frameworks, i.e. Greedy and Community-Based
Dynamic Allocation Partitioning (COMDAP). Experimental results show that
adversarial misreporting increases execution latency by 24% and reduces PST by
7.8%. We also propose to identify inconsistencies in reported error rates by
analyzing statistical deviations in error rates across calibration cycles.


http://arxiv.org/abs/2504.04033
Disparate Privacy Vulnerability: Targeted Attribute Inference Attacks and Defenses. (10%)
Ehsanul Kabir; Lucas Craig; Shagufta Mehnaz
  As machine learning (ML) technologies become more prevalent in
privacy-sensitive areas like healthcare and finance, eventually incorporating
sensitive information in building data-driven algorithms, it is vital to
scrutinize whether these data face any privacy leakage risks. One potential
threat arises from an adversary querying trained models using the public,
non-sensitive attributes of entities in the training data to infer their
private, sensitive attributes, a technique known as the attribute inference
attack. This attack is particularly deceptive because, while it may perform
poorly in predicting sensitive attributes across the entire dataset, it excels
at predicting the sensitive attributes of records from a few vulnerable groups,
a phenomenon known as disparate vulnerability. This paper illustrates that an
adversary can take advantage of this disparity to carry out a series of new
attacks, showcasing a threat level beyond previous imagination. We first
develop a novel inference attack called the disparity inference attack, which
targets the identification of high-risk groups within the dataset. We then
introduce two targeted variations of the attribute inference attack that can
identify and exploit a vulnerable subset of the training data, marking the
first instances of targeted attacks in this category, achieving significantly
higher accuracy than untargeted versions. We are also the first to introduce a
novel and effective disparity mitigation technique that simultaneously
preserves model performance and prevents any risk of targeted attacks.


http://arxiv.org/abs/2504.03957
Practical Poisoning Attacks against Retrieval-Augmented Generation. (4%)
Baolei Zhang; Yuxi Chen; Minghong Fang; Zhuqing Liu; Lihai Nie; Tong Li; Zheli Liu
  Large language models (LLMs) have demonstrated impressive natural language
processing abilities but face challenges such as hallucination and outdated
knowledge. Retrieval-Augmented Generation (RAG) has emerged as a
state-of-the-art approach to mitigate these issues. While RAG enhances LLM
outputs, it remains vulnerable to poisoning attacks. Recent studies show that
injecting poisoned text into the knowledge database can compromise RAG systems,
but most existing attacks assume that the attacker can insert a sufficient
number of poisoned texts per query to outnumber correct-answer texts in
retrieval, an assumption that is often unrealistic. To address this limitation,
we propose CorruptRAG, a practical poisoning attack against RAG systems in
which the attacker injects only a single poisoned text, enhancing both
feasibility and stealth. Extensive experiments across multiple datasets
demonstrate that CorruptRAG achieves higher attack success rates compared to
existing baselines.


http://arxiv.org/abs/2504.03624
Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models. (1%)
NVIDIA; :; Aaron Blakeman; Aarti Basant; Abhinav Khattar; Adithya Renduchintala; Akhiad Bercovich; Aleksander Ficek; Alexis Bjorlin; Ali Taghibakhshi; Amala Sanjay Deshmukh; Ameya Sunil Mahabaleshwarkar; Andrew Tao; Anna Shors; Ashwath Aithal; Ashwin Poojary; Ayush Dattagupta; Balaram Buddharaju; Bobby Chen; Boris Ginsburg; Boxin Wang; Brandon Norick; Brian Butterfield; Bryan Catanzaro; Mundo Carlo del; Chengyu Dong; Christine Harvey; Christopher Parisien; Dan Su; Daniel Korzekwa; Danny Yin; Daria Gitman; David Mosallanezhad; Deepak Narayanan; Denys Fridman; Dima Rekesh; Ding Ma; Dmytro Pykhtar; Dong Ahn; Duncan Riach; Dusan Stosic; Eileen Long; Elad Segal; Ellie Evans; Eric Chung; Erick Galinkin; Evelina Bakhturina; Ewa Dobrowolska; Fei Jia; Fuxiao Liu; Gargi Prasad; Gerald Shen; Guilin Liu; Guo Chen; Haifeng Qian; Helen Ngo; Hongbin Liu; Hui Li; Igor Gitman; Ilia Karmanov; Ivan Moshkov; Izik Golan; Jan Kautz; Jane Polak Scowcroft; Jared Casper; Jarno Seppanen; Jason Lu; Jason Sewall; Jiaqi Zeng; Jiaxuan You; Jimmy Zhang; Jing Zhang; Jining Huang; Jinze Xue; Jocelyn Huang; Joey Conway; John Kamalu; Jon Barker; Jonathan Cohen; Joseph Jennings; Jupinder Parmar; Karan Sapra; Kari Briski; Kateryna Chumachenko; Katherine Luna; Keshav Santhanam; Kezhi Kong; Kirthi Sivamani; Krzysztof Pawelec; Kumar Anik; Kunlun Li; Lawrence McAfee; Leon Derczynski; Lindsey Pavao; Luis Vega; Lukas Voegtle; Maciej Bala; Melo Maer Rodrigues de; Makesh Narsimhan Sreedhar; Marcin Chochowski; Markus Kliegl; Marta Stepniewska-Dziubinska; Matthieu Le; Matvei Novikov; Mehrzad Samadi; Michael Andersch; Michael Evans; Miguel Martinez; Mike Chrzanowski; Mike Ranzinger; Mikolaj Blaz; Misha Smelyanskiy; Mohamed Fawzy; Mohammad Shoeybi; Mostofa Patwary; Nayeon Lee; Nima Tajbakhsh; Ning Xu; Oleg Rybakov; Oleksii Kuchaiev; Olivier Delalleau; Osvald Nitski; Parth Chadha; Pasha Shamis; Paulius Micikevicius; Pavlo Molchanov; Peter Dykas; Philipp Fischer; Pierre-Yves Aquilanti; Piotr Bialecki; Prasoon Varshney; Pritam Gundecha; Przemek Tredak; Rabeeh Karimi; Rahul Kandu; Ran El-Yaniv; Raviraj Joshi; Roger Waleffe; Ruoxi Zhang; Sabrina Kavanaugh; Sahil Jain; Samuel Kriman; Sangkug Lym; Sanjeev Satheesh; Saurav Muralidharan; Sean Narenthiran; Selvaraj Anandaraj; Seonmyeong Bak; Sergey Kashirsky; Seungju Han; Shantanu Acharya; Shaona Ghosh; Sharath Turuvekere Sreenivas; Sharon Clay; Shelby Thomas; Shrimai Prabhumoye; Shubham Pachori; Shubham Toshniwal; Shyamala Prayaga; Siddhartha Jain; Sirshak Das; Slawek Kierat; Somshubra Majumdar; Song Han; Soumye Singhal; Sriharsha Niverty; Stefania Alborghetti; Suseella Panguluri; Swetha Bhendigeri; Syeda Nahida Akter; Szymon Migacz; Tal Shiri; Terry Kong; Timo Roman; Tomer Ronen; Trisha Saar; Tugrul Konuk; Tuomas Rintamaki; Tyler Poon; Ushnish De; Vahid Noroozi; Varun Singh; Vijay Korthikanti; Vitaly Kurin; Wasi Uddin Ahmad; Wei Du; Wei Ping; Wenliang Dai; Wonmin Byeon; Xiaowei Ren; Yao Xu; Yejin Choi; Yian Zhang; Ying Lin; Yoshi Suhara; Zhiding Yu; Zhiqi Li; Zhiyu Li; Zhongbo Zhu; Zhuolin Yang; Zijia Chen
  As inference-time scaling becomes critical for enhanced reasoning
capabilities, it is increasingly becoming important to build models that are
efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid
Mamba-Transformer models designed to reduce inference cost for a given accuracy
level. To achieve this goal, we replace the majority of self-attention layers
in the common Transformer model architecture with Mamba layers that perform
constant computation and require constant memory per generated token. We show
that Nemotron-H models offer either better or on-par accuracy compared to other
similarly-sized state-of-the-art open-sourced Transformer models (e.g.,
Qwen-2.5-7B/72B and Llama-3.1-8B/70B), while being up to 3$\times$ faster at
inference. To further increase inference speed and reduce the memory required
at inference time, we created Nemotron-H-47B-Base from the 56B model using a
new compression via pruning and distillation technique called MiniPuzzle.
Nemotron-H-47B-Base achieves similar accuracy to the 56B model, but is 20%
faster to infer. In addition, we introduce an FP8-based training recipe and
show that it can achieve on par results with BF16-based training. This recipe
is used to train the 56B model. All Nemotron-H models will be released, with
support in Hugging Face, NeMo, and Megatron-LM.


http://arxiv.org/abs/2504.03173
PPFPL: Cross-silo Privacy-preserving Federated Prototype Learning Against Data Poisoning Attacks on Non-IID Data. (1%)
Hongliang Zhang; Jiguo Yu; Fenghua Xu; Chunqiang Hu; Yongzhao Zhang; Xiaofen Wang; Zhongyuan Yu; Xiaosong Zhang
  Privacy-Preserving Federated Learning (PPFL) allows multiple clients to
collaboratively train a deep learning model by submitting hidden model updates.
Nonetheless, PPFL is vulnerable to data poisoning attacks due to the
distributed training nature of clients. Existing solutions have struggled to
improve the performance of cross-silo PPFL in poisoned Non-IID data. To address
the issues, this paper proposes a privacy-preserving federated prototype
learning framework, named PPFPL, which enhances the cross-silo FL performance
in poisoned Non-IID data while effectively resisting data poisoning attacks.
Specifically, we adopt prototypes as client-submitted model updates to
eliminate the impact of tampered data distribution on federated learning.
Moreover, we utilize two servers to achieve Byzantine-robust aggregation by
secure aggregation protocol, which greatly reduces the impact of malicious
clients. Theoretical analyses confirm the convergence of PPFPL, and
experimental results on publicly available datasets show that PPFPL is
effective for resisting data poisoning attacks with Non-IID conditions.


http://arxiv.org/abs/2504.03065
Moving Target Defense Against Adversarial False Data Injection Attacks In Power Grids. (99%)
Yexiang Chen; Subhash Lakshminarayana; H. Vincent Poor
  Machine learning (ML)-based detectors have been shown to be effective in
detecting stealthy false data injection attacks (FDIAs) that can bypass
conventional bad data detectors (BDDs) in power systems. However, ML models are
also vulnerable to adversarial attacks. A sophisticated perturbation signal
added to the original BDD-bypassing FDIA can conceal the attack from ML-based
detectors. In this paper, we develop a moving target defense (MTD) strategy to
defend against adversarial FDIAs in power grids. We first develop an
MTD-strengthened deep neural network (DNN) model, which deploys a pool of DNN
models rather than a single static model that cooperate to detect the
adversarial attack jointly. The MTD model pool introduces randomness to the ML
model's decision boundary, thereby making the adversarial attacks detectable.
Furthermore, to increase the effectiveness of the MTD strategy and reduce the
computational costs associated with developing the MTD model pool, we combine
this approach with the physics-based MTD, which involves dynamically perturbing
the transmission line reactance and retraining the DNN-based detector to adapt
to the new system topology. Simulations conducted on IEEE test bus systems
demonstrate that the MTD-strengthened DNN achieves up to 94.2% accuracy in
detecting adversarial FDIAs. When combined with a physics-based MTD, the
detection accuracy surpasses 99%, while significantly reducing the
computational costs of updating the DNN models. This approach requires only
moderate perturbations to transmission line reactances, resulting in minimal
increases in OPF cost.


http://arxiv.org/abs/2504.02335
Evaluating and Enhancing Segmentation Model Robustness with Metamorphic Testing. (98%)
Seif Mzoughi; Mohamed Elshafeia; Foutse Khomh
  Image segmentation is critical for applications such as medical imaging,
augmented reality, and video surveillance. However, segmentation models often
lack robustness, making them vulnerable to adversarial perturbations from
subtle image distortions. In this work, we propose SegRMT, a metamorphic
testing approach that leverages genetic algorithms (GA) to optimize sequences
of spatial and spectral transformations while preserving image fidelity via a
predefined PSNR threshold. Using the Cityscapes dataset, our method generates
adversarial examples that effectively challenge the DeepLabV3 segmentation
model. Our experiments show that SegRMT reduces DeepLabV3's mean Intersection
over Union (mIoU) to 6.4%, outperforming other adversarial baselines that
decrease mIoU to between 8.5% and 21.7%. Furthermore, when used for adversarial
training, SegRMT boosts model performance, achieving mIoU improvements up to
73% on dedicated adversarial datasets and increasing cross-adversarial mIoU to
53.8%, compared to only 2%-10% for other methods. These findings demonstrate
that SegRMT not only simulates realistic image distortions but also enhances
the robustness of segmentation models, making it a valuable tool for ensuring
reliable performance in safety-critical applications.


http://arxiv.org/abs/2504.03089
SLACK: Attacking LiDAR-based SLAM with Adversarial Point Injections. (22%)
Prashant Kumar; Dheeraj Vattikonda; Kshitij Madhav Bhat; Kunal Dargan; Prem Kalra
  The widespread adoption of learning-based methods for the LiDAR makes
autonomous vehicles vulnerable to adversarial attacks through adversarial
\textit{point injections (PiJ)}. It poses serious security challenges for
navigation and map generation. Despite its critical nature, no major work
exists that studies learning-based attacks on LiDAR-based SLAM. Our work
proposes SLACK, an end-to-end deep generative adversarial model to attack LiDAR
scans with several point injections without deteriorating LiDAR quality. To
facilitate SLACK, we design a novel yet simple autoencoder that augments
contrastive learning with segmentation-based attention for precise
reconstructions. SLACK demonstrates superior performance on the task of
\textit{point injections (PiJ)} compared to the best baselines on KITTI and
CARLA-64 dataset while maintaining accurate scan quality. We qualitatively and
quantitatively demonstrate PiJ attacks using a fraction of LiDAR points. It
severely degrades navigation and map quality without deteriorating the LiDAR
scan quality.


http://arxiv.org/abs/2504.02458
Retrieval-Augmented Purifier for Robust LLM-Empowered Recommendation. (12%)
Liangbo Ning; Wenqi Fan; Qing Li
  Recently, Large Language Model (LLM)-empowered recommender systems have
revolutionized personalized recommendation frameworks and attracted extensive
attention. Despite the remarkable success, existing LLM-empowered RecSys have
been demonstrated to be highly vulnerable to minor perturbations. To mitigate
the negative impact of such vulnerabilities, one potential solution is to
employ collaborative signals based on item-item co-occurrence to purify the
malicious collaborative knowledge from the user's historical interactions
inserted by attackers. On the other hand, due to the capabilities to expand
insufficient internal knowledge of LLMs, Retrieval-Augmented Generation (RAG)
techniques provide unprecedented opportunities to enhance the robustness of
LLM-empowered recommender systems by introducing external collaborative
knowledge. Therefore, in this paper, we propose a novel framework (RETURN) by
retrieving external collaborative signals to purify the poisoned user profiles
and enhance the robustness of LLM-empowered RecSys in a plug-and-play manner.
Specifically, retrieval-augmented perturbation positioning is proposed to
identify potential perturbations within the users' historical sequences by
retrieving external knowledge from collaborative item graphs. After that, we
further retrieve the collaborative knowledge to cleanse the perturbations by
using either deletion or replacement strategies and introduce a robust ensemble
recommendation strategy to generate final robust predictions. Extensive
experiments on three real-world datasets demonstrate the effectiveness of the
proposed RETURN.


http://arxiv.org/abs/2504.02412
Bridging the Theoretical Gap in Randomized Smoothing. (10%)
Blaise Delattre; Paul Caillon; Quentin Barthélemy; Erwan Fagnou; Alexandre Allauzen
  Randomized smoothing has become a leading approach for certifying adversarial
robustness in machine learning models. However, a persistent gap remains
between theoretical certified robustness and empirical robustness accuracy.
This paper introduces a new framework that bridges this gap by leveraging
Lipschitz continuity for certification and proposing a novel, less conservative
method for computing confidence intervals in randomized smoothing. Our approach
tightens the bounds of certified robustness, offering a more accurate
reflection of model robustness in practice. Through rigorous experimentation we
show that our method improves the robust accuracy, compressing the gap between
empirical findings and previous theoretical results. We argue that
investigating local Lipschitz constants and designing ad-hoc confidence
intervals can further enhance the performance of randomized smoothing. These
results pave the way for a deeper understanding of the relationship between
Lipschitz continuity and certified robustness.


http://arxiv.org/abs/2504.03770
JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model. (2%)
Yi Nian; Shenzhe Zhu; Yuehan Qin; Li Li; Ziyi Wang; Chaowei Xiao; Yue Zhao
  Multimodal large language models (MLLMs) excel in vision-language tasks but
also pose significant risks of generating harmful content, particularly through
jailbreak attacks. Jailbreak attacks refer to intentional manipulations that
bypass safety mechanisms in models, leading to the generation of inappropriate
or unsafe content. Detecting such attacks is critical to ensuring the
responsible deployment of MLLMs. Existing jailbreak detection methods face
three primary challenges: (1) Many rely on model hidden states or gradients,
limiting their applicability to white-box models, where the internal workings
of the model are accessible; (2) They involve high computational overhead from
uncertainty-based analysis, which limits real-time detection, and (3) They
require fully labeled harmful datasets, which are often scarce in real-world
settings. To address these issues, we introduce a test-time adaptive framework
called JAILDAM. Our method leverages a memory-based approach guided by
policy-driven unsafe knowledge representations, eliminating the need for
explicit exposure to harmful data. By dynamically updating unsafe knowledge
during test-time, our framework improves generalization to unseen jailbreak
strategies while maintaining efficiency. Experiments on multiple VLM jailbreak
benchmarks demonstrate that JAILDAM delivers state-of-the-art performance in
harmful content detection, improving both accuracy and speed.


http://arxiv.org/abs/2504.02725
ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization. (1%)
Kehua Feng; Keyan Ding; Jing Yu; Menghan Li; Yuhao Wang; Tong Xu; Xinda Wang; Qiang Zhang; Huajun Chen
  Recent advancements in large language models (LLMs) have accelerated progress
toward artificial general intelligence, yet their potential to generate harmful
content poses critical safety challenges. Existing alignment methods often
struggle to cover diverse safety scenarios and remain vulnerable to adversarial
attacks. In this work, we propose Ex-Ante Reasoning Preference Optimization
(ERPO), a novel safety alignment framework that equips LLMs with explicit
preemptive reasoning through Chain-of-Thought and provides clear evidence for
safety judgments by embedding predefined safety rules. Specifically, our
approach consists of three stages: first, equipping the model with Ex-Ante
reasoning through supervised fine-tuning (SFT) using a constructed reasoning
module; second, enhancing safety, usefulness, and efficiency via Direct
Preference Optimization (DPO); and third, mitigating inference latency with a
length-controlled iterative preference optimization strategy. Experiments on
multiple open-source LLMs demonstrate that ERPO significantly enhances safety
performance while maintaining response efficiency.


http://arxiv.org/abs/2504.01735
AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization. (99%)
Chaohu Liu; Tianyi Gui; Yu Liu; Linli Xu
  Large Vision-Language Models (LVLMs), such as GPT-4o and LLaVA, have recently
witnessed remarkable advancements and are increasingly being deployed in
real-world applications. However, inheriting the sensitivity of visual neural
networks, LVLMs remain vulnerable to adversarial attacks, which can result in
erroneous or malicious outputs. While existing efforts utilize adversarial
fine-tuning to enhance robustness, they often suffer from performance
degradation on clean inputs. In this paper, we proposes AdPO, a novel
adversarial defense strategy for LVLMs based on preference optimization. For
the first time, we reframe adversarial training as a preference optimization
problem, aiming to enhance the model's preference for generating normal outputs
on clean inputs while rejecting the potential misleading outputs for
adversarial examples. Notably, AdPO achieves this by solely modifying the image
encoder, e.g., CLIP ViT, resulting in superior clean and adversarial
performance in a variety of downsream tasks. Considering that training involves
large language models (LLMs), the computational cost increases significantly.
We validate that training on smaller LVLMs and subsequently transferring to
larger models can achieve competitive performance while maintaining efficiency
comparable to baseline methods. Our comprehensive experiments confirm the
effectiveness of the proposed AdPO, which provides a novel perspective for
future adversarial defense research.


http://arxiv.org/abs/2504.01399
Leveraging Generalizability of Image-to-Image Translation for Enhanced Adversarial Defense. (98%)
Haibo Zhang; Zhihua Yao; Kouichi Sakurai; Takeshi Saitoh
  In the rapidly evolving field of artificial intelligence, machine learning
emerges as a key technology characterized by its vast potential and inherent
risks. The stability and reliability of these models are important, as they are
frequent targets of security threats. Adversarial attacks, first rigorously
defined by Ian Goodfellow et al. in 2013, highlight a critical vulnerability:
they can trick machine learning models into making incorrect predictions by
applying nearly invisible perturbations to images. Although many studies have
focused on constructing sophisticated defensive mechanisms to mitigate such
attacks, they often overlook the substantial time and computational costs of
training and maintaining these models. Ideally, a defense method should be able
to generalize across various, even unseen, adversarial attacks with minimal
overhead. Building on our previous work on image-to-image translation-based
defenses, this study introduces an improved model that incorporates residual
blocks to enhance generalizability. The proposed method requires training only
a single model, effectively defends against diverse attack types, and is
well-transferable between different target models. Experiments show that our
model can restore the classification accuracy from near zero to an average of
72\% while maintaining competitive performance compared to state-of-the-art
methods.


http://arxiv.org/abs/2504.01632
Benchmarking the Spatial Robustness of DNNs via Natural and Adversarial Localized Corruptions. (93%)
Giulia Marchiori Pietrosanti; Giulio Rossolini; Alessandro Biondi; Giorgio Buttazzo
  The robustness of DNNs is a crucial factor in safety-critical applications,
particularly in complex and dynamic environments where localized corruptions
can arise. While previous studies have evaluated the robustness of semantic
segmentation (SS) models under whole-image natural or adversarial corruptions,
a comprehensive investigation into the spatial robustness of dense vision
models under localized corruptions remained underexplored. This paper fills
this gap by introducing specialized metrics for benchmarking the spatial
robustness of segmentation models, alongside with an evaluation framework to
assess the impact of localized corruptions. Furthermore, we uncover the
inherent complexity of characterizing worst-case robustness using a single
localized adversarial perturbation. To address this, we propose region-aware
multi-attack adversarial analysis, a method that enables a deeper understanding
of model robustness against adversarial perturbations applied to specific
regions. The proposed metrics and analysis were exploited to evaluate 14
segmentation models in driving scenarios, uncovering key insights into the
effects of localized corruption in both natural and adversarial forms. The
results reveal that models respond to these two types of threats differently;
for instance, transformer-based segmentation models demonstrate notable
robustness to localized natural corruptions but are highly vulnerable to
adversarial ones and vice-versa for CNN-based models. Consequently, we also
address the challenge of balancing robustness to both natural and adversarial
localized corruptions by means of ensemble models, thereby achieving a broader
threat coverage and improved reliability for dense vision tasks.


http://arxiv.org/abs/2504.01345
Breaking BERT: Gradient Attack on Twitter Sentiment Analysis for Targeted Misclassification. (89%)
Akil Raj Subedi; Taniya Shah; Aswani Kumar Cherukuri; Thanos Vasilakos
  Social media platforms like Twitter have increasingly relied on Natural
Language Processing NLP techniques to analyze and understand the sentiments
expressed in the user generated content. One such state of the art NLP model is
Bidirectional Encoder Representations from Transformers BERT which has been
widely adapted in sentiment analysis. BERT is susceptible to adversarial
attacks. This paper aims to scrutinize the inherent vulnerabilities of such
models in Twitter sentiment analysis. It aims to formulate a framework for
constructing targeted adversarial texts capable of deceiving these models,
while maintaining stealth. In contrast to conventional methodologies, such as
Importance Reweighting, this framework core idea resides in its reliance on
gradients to prioritize the importance of individual words within the text. It
uses a whitebox approach to attain fine grained sensitivity, pinpointing words
that exert maximal influence on the classification outcome. This paper is
organized into three interdependent phases. It starts with fine-tuning a
pre-trained BERT model on Twitter data. It then analyzes gradients of the model
to rank words on their importance, and iteratively replaces those with feasible
candidates until an acceptable solution is found. Finally, it evaluates the
effectiveness of the adversarial text against the custom trained sentiment
classification model. This assessment would help in gauging the capacity of the
adversarial text to successfully subvert classification without raising any
alarm.


http://arxiv.org/abs/2504.01668
Overlap-Aware Feature Learning for Robust Unsupervised Domain Adaptation for 3D Semantic Segmentation. (83%)
Junjie Chen; Yuecong Xu; Haosheng Li; Kemi Ding
  3D point cloud semantic segmentation (PCSS) is a cornerstone for
environmental perception in robotic systems and autonomous driving, enabling
precise scene understanding through point-wise classification. While
unsupervised domain adaptation (UDA) mitigates label scarcity in PCSS, existing
methods critically overlook the inherent vulnerability to real-world
perturbations (e.g., snow, fog, rain) and adversarial distortions. This work
first identifies two intrinsic limitations that undermine current PCSS-UDA
robustness: (a) unsupervised features overlap from unaligned boundaries in
shared-class regions and (b) feature structure erosion caused by
domain-invariant learning that suppresses target-specific patterns. To address
the proposed problems, we propose a tripartite framework consisting of: 1) a
robustness evaluation model quantifying resilience against adversarial
attack/corruption types through robustness metrics; 2) an invertible attention
alignment module (IAAM) enabling bidirectional domain mapping while preserving
discriminative structure via attention-guided overlap suppression; and 3) a
contrastive memory bank with quality-aware contrastive learning that
progressively refines pseudo-labels with feature quality for more
discriminative representations. Extensive experiments on
SynLiDAR-to-SemanticPOSS adaptation demonstrate a maximum mIoU improvement of
14.3\% under adversarial attack.


http://arxiv.org/abs/2504.02132
One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image. (80%)
Ezzeldin Shereen; Dan Ristea; Burak Hasircioglu; Shae McFadden; Vasilios Mavroudis; Chris Hicks
  Multimodal retrieval augmented generation (M-RAG) has recently emerged as a
method to inhibit hallucinations of large multimodal models (LMMs) through a
factual knowledge base (KB). However, M-RAG also introduces new attack vectors
for adversaries that aim to disrupt the system by injecting malicious entries
into the KB. In this work, we present a poisoning attack against M-RAG
targeting visual document retrieval applications, where the KB contains images
of document pages. Our objective is to craft a single image that is retrieved
for a variety of different user queries, and consistently influences the output
produced by the generative model, thus creating a universal denial-of-service
(DoS) attack against the M-RAG system. We demonstrate that while our attack is
effective against a diverse range of widely-used, state-of-the-art retrievers
(embedding models) and generators (LMMs), it can also be ineffective against
robust embedding models. Our attack not only highlights the vulnerability of
M-RAG pipelines to poisoning attacks, but also sheds light on a fundamental
weakness that potentially hinders their performance even in benign settings.


http://arxiv.org/abs/2504.01659
Robust Unsupervised Domain Adaptation for 3D Point Cloud Segmentation Under Source Adversarial Attacks. (80%)
Haosheng Li; Yuecong Xu; Junjie Chen; Kemi Ding
  Unsupervised domain adaptation (UDA) frameworks have shown good
generalization capabilities for 3D point cloud semantic segmentation models on
clean data. However, existing works overlook adversarial robustness when the
source domain itself is compromised. To comprehensively explore the robustness
of the UDA frameworks, we first design a stealthy adversarial point cloud
generation attack that can significantly contaminate datasets with only minor
perturbations to the point cloud surface. Based on that, we propose a novel
dataset, AdvSynLiDAR, comprising synthesized contaminated LiDAR point clouds.
With the generated corrupted data, we further develop the Adversarial
Adaptation Framework (AAF) as the countermeasure. Specifically, by extending
the key point sensitive (KPS) loss towards the Robust Long-Tail loss (RLT loss)
and utilizing a decoder branch, our approach enables the model to focus on
long-tail classes during the pre-training phase and leverages high-confidence
decoded point cloud information to restore point cloud structures during the
adaptation phase. We evaluated our AAF method on the AdvSynLiDAR dataset, where
the results demonstrate that our AAF method can mitigate performance
degradation under source adversarial perturbations for UDA in the 3D point
cloud segmentation application.


http://arxiv.org/abs/2504.01589
Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models. (67%)
Zhaochen Wang; Bryan Hooi; Yiwei Wang; Ming-Hsuan Yang; Zi Huang; Yujun Cai
  Vision-language models (VLMs) have advanced rapidly in processing multimodal
information, but their ability to reconcile conflicting signals across
modalities remains underexplored. This work investigates how VLMs process ASCII
art, a unique medium where textual elements collectively form visual patterns,
potentially creating semantic-visual conflicts. We introduce a novel evaluation
framework that systematically challenges five state-of-the-art models
(including GPT-4o, Claude, and Gemini) using adversarial ASCII art, where
character-level semantics deliberately contradict global visual patterns. Our
experiments reveal a strong text-priority bias: VLMs consistently prioritize
textual information over visual patterns, with visual recognition ability
declining dramatically as semantic complexity increases. Various mitigation
attempts through visual parameter tuning and prompt engineering yielded only
modest improvements, suggesting that this limitation requires
architectural-level solutions. These findings uncover fundamental flaws in how
current VLMs integrate multimodal information, providing important guidance for
future model development while highlighting significant implications for
content moderation systems vulnerable to adversarial examples.


http://arxiv.org/abs/2504.16936
Multifaceted Evaluation of Audio-Visual Capability for MLLMs: Effectiveness, Efficiency, Generalizability and Robustness. (22%)
Yusheng Zhao; Junyu Luo; Xiao Luo; Weizhi Zhang; Zhiping Xiao; Wei Ju; Philip S. Yu; Ming Zhang
  Multi-modal large language models (MLLMs) have recently achieved great
success in processing and understanding information from diverse modalities
(e.g., text, audio, and visual signals). Despite their growing popularity,
there remains a lack of comprehensive evaluation measuring the audio-visual
capabilities of these models, especially in diverse scenarios (e.g.,
distribution shifts and adversarial attacks). In this paper, we present a
multifaceted evaluation of the audio-visual capability of MLLMs, focusing on
four key dimensions: effectiveness, efficiency, generalizability, and
robustness. Through extensive experiments, we find that MLLMs exhibit strong
zero-shot and few-shot generalization abilities, enabling them to achieve great
performance with limited data. However, their success relies heavily on the
vision modality, which impairs performance when visual input is corrupted or
missing. Additionally, while MLLMs are susceptible to adversarial samples, they
demonstrate greater robustness compared to traditional models. The experimental
results and our findings provide insights into the audio-visual capabilities of
MLLMs, highlighting areas for improvement and offering guidance for future
research.


http://arxiv.org/abs/2504.01550
Representation Bending for Large Language Model Safety. (12%)
Ashkan Yousefpour; Taeheon Kim; Ryan S. Kwon; Seungbeen Lee; Wonje Jeung; Seungju Han; Alvin Wan; Harrison Ngan; Youngjae Yu; Jonghyun Choi
  Large Language Models (LLMs) have emerged as powerful tools, but their
inherent safety risks - ranging from harmful content generation to broader
societal harms - pose significant challenges. These risks can be amplified by
the recent adversarial attacks, fine-tuning vulnerabilities, and the increasing
deployment of LLMs in high-stakes environments. Existing safety-enhancing
techniques, such as fine-tuning with human feedback or adversarial training,
are still vulnerable as they address specific threats and often fail to
generalize across unseen attacks, or require manual system-level defenses. This
paper introduces RepBend, a novel approach that fundamentally disrupts the
representations underlying harmful behaviors in LLMs, offering a scalable
solution to enhance (potentially inherent) safety. RepBend brings the idea of
activation steering - simple vector arithmetic for steering model's behavior
during inference - to loss-based fine-tuning. Through extensive evaluation,
RepBend achieves state-of-the-art performance, outperforming prior methods such
as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success
rates across diverse jailbreak benchmarks, all with negligible reduction in
model usability and general capabilities.


http://arxiv.org/abs/2504.01533
LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution. (9%)
Zhuoran Yang; Jie Peng; Zhen Tan; Tianlong Chen; Yanyong Zhang
  Large Language Models (LLMs) face threats from jailbreak prompts. Existing
methods for defending against jailbreak attacks are primarily based on
auxiliary models. These strategies, however, often require extensive data
collection or training. We propose LightDefense, a lightweight defense
mechanism targeted at white-box models, which utilizes a safety-oriented
direction to adjust the probabilities of tokens in the vocabulary, making
safety disclaimers appear among the top tokens after sorting tokens by
probability in descending order. We further innovatively leverage LLM's
uncertainty about prompts to measure their harmfulness and adaptively adjust
defense strength, effectively balancing safety and helpfulness. The
effectiveness of LightDefense in defending against 5 attack methods across 2
target LLMs, without compromising helpfulness to benign user queries,
highlights its potential as a novel and lightweight defense mechanism,
enhancing security of LLMs.


http://arxiv.org/abs/2504.02080
Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses. (3%)
Zhengchun Shang; Wenlan Wei
  Large Language Models (LLMs) are increasingly popular, powering a wide range
of applications. Their widespread use has sparked concerns, especially through
jailbreak attacks that bypass safety measures to produce harmful content.
  In this paper, we present a comprehensive security analysis of large language
models (LLMs), addressing critical research questions on the evolution and
determinants of model safety.
  Specifically, we begin by identifying the most effective techniques for
detecting jailbreak attacks. Next, we investigate whether newer versions of
LLMs offer improved security compared to their predecessors. We also assess the
impact of model size on overall security and explore the potential benefits of
integrating multiple defense strategies to enhance model robustness.
  Our study evaluates both open-source models (e.g., LLaMA and Mistral) and
closed-source systems (e.g., GPT-4) by employing four state-of-the-art attack
techniques and assessing the efficacy of three new defensive approaches.


http://arxiv.org/abs/2504.02142
Like Oil and Water: Group Robustness Methods and Poisoning Defenses May Be at Odds. (2%)
Michael-Andrei Panaitescu-Liess; Yigitcan Kaya; Sicheng Zhu; Furong Huang; Tudor Dumitras
  Group robustness has become a major concern in machine learning (ML) as
conventional training paradigms were found to produce high error on minority
groups. Without explicit group annotations, proposed solutions rely on
heuristics that aim to identify and then amplify the minority samples during
training. In our work, we first uncover a critical shortcoming of these
methods: an inability to distinguish legitimate minority samples from poison
samples in the training set. By amplifying poison samples as well, group
robustness methods inadvertently boost the success rate of an adversary --
e.g., from $0\%$ without amplification to over $97\%$ with it. Notably, we
supplement our empirical evidence with an impossibility result proving this
inability of a standard heuristic under some assumptions. Moreover,
scrutinizing recent poisoning defenses both in centralized and federated
learning, we observe that they rely on similar heuristics to identify which
samples should be eliminated as poisons. In consequence, minority samples are
eliminated along with poisons, which damages group robustness -- e.g., from
$55\%$ without the removal of the minority samples to $41\%$ with it. Finally,
as they pursue opposing goals using similar heuristics, our attempt to
alleviate the trade-off by combining group robustness methods and poisoning
defenses falls short. By exposing this tension, we also hope to highlight how
benchmark-driven ML scholarship can obscure the trade-offs among different
metrics with potentially detrimental consequences.


http://arxiv.org/abs/2504.02213
Secure Generalization through Stochastic Bidirectional Parameter Updates Using Dual-Gradient Mechanism. (2%)
Shourya Goel; Himanshi Tibrewal; Anant Jain; Anshul Pundhir; Pravendra Singh
  Federated learning (FL) has gained increasing attention due to
privacy-preserving collaborative training on decentralized clients, mitigating
the need to upload sensitive data to a central server directly. Nonetheless,
recent research has underscored the risk of exposing private data to
adversaries, even within FL frameworks. In general, existing methods sacrifice
performance while ensuring resistance to privacy leakage in FL. We overcome
these issues and generate diverse models at a global server through the
proposed stochastic bidirectional parameter update mechanism. Using diverse
models, we improved the generalization and feature representation in the FL
setup, which also helped to improve the robustness of the model against privacy
leakage without hurting the model's utility. We use global models from past FL
rounds to follow systematic perturbation in parameter space at the server to
ensure model generalization and resistance against privacy attacks. We generate
diverse models (in close neighborhoods) for each client by using systematic
perturbations in model parameters at a fine-grained level (i.e., altering each
convolutional filter across the layers of the model) to improve the
generalization and security perspective. We evaluated our proposed approach on
four benchmark datasets to validate its superiority. We surpassed the
state-of-the-art methods in terms of model utility and robustness towards
privacy leakage. We have proven the effectiveness of our method by evaluating
performance using several quantitative and qualitative results.


http://arxiv.org/abs/2504.03767
MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits. (1%)
Brandon Radosevich; John Halloran
  To reduce development overhead and enable seamless integration between
potential components comprising any given generative AI application, the Model
Context Protocol (MCP) (Anthropic, 2024) has recently been released and
subsequently widely adopted. The MCP is an open protocol that standardizes API
calls to large language models (LLMs), data sources, and agentic tools. By
connecting multiple MCP servers, each defined with a set of tools, resources,
and prompts, users are able to define automated workflows fully driven by LLMs.
However, we show that the current MCP design carries a wide range of security
risks for end users. In particular, we demonstrate that industry-leading LLMs
may be coerced into using MCP tools to compromise an AI developer's system
through various attacks, such as malicious code execution, remote access
control, and credential theft. To proactively mitigate these and related
attacks, we introduce a safety auditing tool, MCPSafetyScanner, the first
agentic tool to assess the security of an arbitrary MCP server. MCPScanner uses
several agents to (a) automatically determine adversarial samples given an MCP
server's tools and resources; (b) search for related vulnerabilities and
remediations based on those samples; and (c) generate a security report
detailing all findings. Our work highlights serious security issues with
general-purpose agentic workflows while also providing a proactive tool to
audit MCP server safety and address detected vulnerabilities before deployment.
  The described MCP server auditing tool, MCPSafetyScanner, is freely available
at: https://github.com/leidosinc/McpSafetyScanner


http://arxiv.org/abs/2504.01444
PiCo: Jailbreaking Multimodal Large Language Models via $\textbf{Pi}$ctorial $\textbf{Co}$de Contextualization. (1%)
Aofan Liu; Lulu Tang; Ting Pan; Yuguo Yin; Bin Wang; Ao Yang
  Multimodal Large Language Models (MLLMs), which integrate vision and other
modalities into Large Language Models (LLMs), significantly enhance AI
capabilities but also introduce new security vulnerabilities. By exploiting the
vulnerabilities of the visual modality and the long-tail distribution
characteristic of code training data, we present PiCo, a novel jailbreaking
framework designed to progressively bypass multi-tiered defense mechanisms in
advanced MLLMs. PiCo employs a tier-by-tier jailbreak strategy, using
token-level typographic attacks to evade input filtering and embedding harmful
intent within programming context instructions to bypass runtime monitoring. To
comprehensively assess the impact of attacks, a new evaluation metric is
further proposed to assess both the toxicity and helpfulness of model outputs
post-attack. By embedding harmful intent within code-style visual instructions,
PiCo achieves an average Attack Success Rate (ASR) of 84.13% on Gemini-Pro
Vision and 52.66% on GPT-4, surpassing previous methods. Experimental results
highlight the critical gaps in current defenses, underscoring the need for more
robust strategies to secure advanced MLLMs.


http://arxiv.org/abs/2504.00429
Unleashing the Power of Pre-trained Encoders for Universal Adversarial Attack Detection. (99%)
Yinghe Zhang; Chi Liu; Shuai Zhou; Sheng Shen; Peng Gui
  Adversarial attacks pose a critical security threat to real-world AI systems
by injecting human-imperceptible perturbations into benign samples to induce
misclassification in deep learning models. While existing detection methods,
such as Bayesian uncertainty estimation and activation pattern analysis, have
achieved progress through feature engineering, their reliance on handcrafted
feature design and prior knowledge of attack patterns limits generalization
capabilities and incurs high engineering costs. To address these limitations,
this paper proposes a lightweight adversarial detection framework based on the
large-scale pre-trained vision-language model CLIP. Departing from conventional
adversarial feature characterization paradigms, we innovatively adopt an
anomaly detection perspective. By jointly fine-tuning CLIP's dual visual-text
encoders with trainable adapter networks and learnable prompts, we construct a
compact representation space tailored for natural images. Notably, our
detection architecture achieves substantial improvements in generalization
capability across both known and unknown attack patterns compared to
traditional methods, while significantly reducing training overhead. This study
provides a novel technical pathway for establishing a parameter-efficient and
attack-agnostic defense paradigm, markedly enhancing the robustness of vision
systems against evolving adversarial threats.


http://arxiv.org/abs/2504.00858
Whispering Under the Eaves: Protecting User Privacy Against Commercial and LLM-powered Automatic Speech Recognition Systems. (99%)
Weifei Jin; Yuxin Cao; Junjie Su; Derui Wang; Yedi Zhang; Minhui Xue; Jie Hao; Jin Song Dong; Yixian Yang
  The widespread application of automatic speech recognition (ASR) supports
large-scale voice surveillance, raising concerns about privacy among users. In
this paper, we concentrate on using adversarial examples to mitigate
unauthorized disclosure of speech privacy thwarted by potential eavesdroppers
in speech communications. While audio adversarial examples have demonstrated
the capability to mislead ASR models or evade ASR surveillance, they are
typically constructed through time-intensive offline optimization, restricting
their practicality in real-time voice communication. Recent work overcame this
limitation by generating universal adversarial perturbations (UAPs) and
enhancing their transferability for black-box scenarios. However, they
introduced excessive noise that significantly degrades audio quality and
affects human perception, thereby limiting their effectiveness in practical
scenarios. To address this limitation and protect live users' speech against
ASR systems, we propose a novel framework, AudioShield. Central to this
framework is the concept of Transferable Universal Adversarial Perturbations in
the Latent Space (LS-TUAP). By transferring the perturbations to the latent
space, the audio quality is preserved to a large extent. Additionally, we
propose target feature adaptation to enhance the transferability of UAPs by
embedding target text features into the perturbations. Comprehensive evaluation
on four commercial ASR APIs (Google, Amazon, iFlytek, and Alibaba), three voice
assistants, two LLM-powered ASR and one NN-based ASR demonstrates the
protection superiority of AudioShield over existing competitors, and both
objective and subjective evaluations indicate that AudioShield significantly
improves the audio quality. Moreover, AudioShield also shows high effectiveness
in real-time end-to-end scenarios, and demonstrates strong resilience against
adaptive countermeasures.


http://arxiv.org/abs/2504.01228
TenAd: A Tensor-based Low-rank Black Box Adversarial Attack for Video Classification. (99%)
Kimia haghjooei; Mansoor Rezghi
  Deep learning models have achieved remarkable success in computer vision but
remain vulnerable to adversarial attacks, particularly in black-box settings
where model details are unknown. Existing adversarial attack methods(even those
works with key frames) often treat video data as simple vectors, ignoring their
inherent multi-dimensional structure, and require a large number of queries,
making them inefficient and detectable. In this paper, we propose
\textbf{TenAd}, a novel tensor-based low-rank adversarial attack that leverages
the multi-dimensional properties of video data by representing videos as
fourth-order tensors. By exploiting low-rank attack, our method significantly
reduces the search space and the number of queries needed to generate
adversarial examples in black-box settings. Experimental results on standard
video classification datasets demonstrate that \textbf{TenAd} effectively
generates imperceptible adversarial perturbations while achieving higher attack
success rates and query efficiency compared to state-of-the-art methods. Our
approach outperforms existing black-box adversarial attacks in terms of success
rate, query efficiency, and perturbation imperceptibility, highlighting the
potential of tensor-based methods for adversarial attacks on video models.


http://arxiv.org/abs/2504.01308
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks. (78%)
Jiawei Wang; Yushen Zuo; Yuanjun Chai; Zhendong Liu; Yicheng Fu; Yichun Feng; Kin-Man Lam
  Vision-Language Models (VLMs) extend the capabilities of Large Language
Models (LLMs) by incorporating visual information, yet they remain vulnerable
to jailbreak attacks, especially when processing noisy or corrupted images.
Although existing VLMs adopt security measures during training to mitigate such
attacks, vulnerabilities associated with noise-augmented visual inputs are
overlooked. In this work, we identify that missing noise-augmented training
causes critical security gaps: many VLMs are susceptible to even simple
perturbations such as Gaussian noise. To address this challenge, we propose
Robust-VLGuard, a multimodal safety dataset with aligned / misaligned
image-text pairs, combined with noise-augmented fine-tuning that reduces attack
success rates while preserving functionality of VLM. For stronger
optimization-based visual perturbation attacks, we propose DiffPure-VLM,
leveraging diffusion models to convert adversarial perturbations into
Gaussian-like noise, which can be defended by VLMs with noise-augmented safety
fine-tuning. Experimental results demonstrate that the distribution-shifting
property of diffusion model aligns well with our fine-tuned VLMs, significantly
mitigating adversarial perturbations across varying intensities. The dataset
and code are available at https://github.com/JarvisUSTC/DiffPure-RobustVLM.


http://arxiv.org/abs/2504.00721
Alleviating Performance Disparity in Adversarial Spatiotemporal Graph Learning Under Zero-Inflated Distribution. (68%)
Songran Bai; Yuheng Ji; Yue Liu; Xingwei Zhang; Xiaolong Zheng; Daniel Dajun Zeng
  Spatiotemporal Graph Learning (SGL) under Zero-Inflated Distribution (ZID) is
crucial for urban risk management tasks, including crime prediction and traffic
accident profiling. However, SGL models are vulnerable to adversarial attacks,
compromising their practical utility. While adversarial training (AT) has been
widely used to bolster model robustness, our study finds that traditional AT
exacerbates performance disparities between majority and minority classes under
ZID, potentially leading to irreparable losses due to underreporting critical
risk events. In this paper, we first demonstrate the smaller top-k gradients
and lower separability of minority class are key factors contributing to this
disparity. To address these issues, we propose MinGRE, a framework for Minority
Class Gradients and Representations Enhancement. MinGRE employs a
multi-dimensional attention mechanism to reweight spatiotemporal gradients,
minimizing the gradient distribution discrepancies across classes.
Additionally, we introduce an uncertainty-guided contrastive loss to improve
the inter-class separability and intra-class compactness of minority
representations with higher uncertainty. Extensive experiments demonstrate that
the MinGRE framework not only significantly reduces the performance disparity
across classes but also achieves enhanced robustness compared to existing
baselines. These findings underscore the potential of our method in fostering
the development of more equitable and robust models.


http://arxiv.org/abs/2504.00638
Impact of Data Duplication on Deep Neural Network-Based Image Classifiers: Robust vs. Standard Models. (47%)
Alireza Aghabagherloo; Aydin Abadi; Sumanta Sarkar; Vishnu Asutosh Dasu; Bart Preneel
  The accuracy and robustness of machine learning models against adversarial
attacks are significantly influenced by factors such as training data quality,
model architecture, the training process, and the deployment environment. In
recent years, duplicated data in training sets, especially in language models,
has attracted considerable attention. It has been shown that deduplication
enhances both training performance and model accuracy in language models. While
the importance of data quality in training image classifier Deep Neural
Networks (DNNs) is widely recognized, the impact of duplicated images in the
training set on model generalization and performance has received little
attention.
  In this paper, we address this gap and provide a comprehensive study on the
effect of duplicates in image classification. Our analysis indicates that the
presence of duplicated images in the training set not only negatively affects
the efficiency of model training but also may result in lower accuracy of the
image classifier. This negative impact of duplication on accuracy is
particularly evident when duplicated data is non-uniform across classes or when
duplication, whether uniform or non-uniform, occurs in the training set of an
adversarially trained model. Even when duplicated samples are selected in a
uniform way, increasing the amount of duplication does not lead to a
significant improvement in accuracy.


http://arxiv.org/abs/2504.01094
Multilingual and Multi-Accent Jailbreaking of Audio LLMs. (38%)
Jaechul Roh; Virat Shejwalkar; Amir Houmansadr
  Large Audio Language Models (LALMs) have significantly advanced audio
understanding but introduce critical security risks, particularly through audio
jailbreaks. While prior work has focused on English-centric attacks, we expose
a far more severe vulnerability: adversarial multilingual and multi-accent
audio jailbreaks, where linguistic and acoustic variations dramatically amplify
attack success. In this paper, we introduce Multi-AudioJail, the first
systematic framework to exploit these vulnerabilities through (1) a novel
dataset of adversarially perturbed multilingual/multi-accent audio jailbreaking
prompts, and (2) a hierarchical evaluation pipeline revealing that how acoustic
perturbations (e.g., reverberation, echo, and whisper effects) interacts with
cross-lingual phonetics to cause jailbreak success rates (JSRs) to surge by up
to +57.25 percentage points (e.g., reverberated Kenyan-accented attack on
MERaLiON). Crucially, our work further reveals that multimodal LLMs are
inherently more vulnerable than unimodal systems: attackers need only exploit
the weakest link (e.g., non-English audio inputs) to compromise the entire
model, which we empirically show by multilingual audio-only attacks achieving
3.1x higher success rates than text-only attacks. We plan to release our
dataset to spur research into cross-modal defenses, urging the community to
address this expanding attack surface in multimodality as LALMs evolve.


http://arxiv.org/abs/2504.00988
Safety and Security Risk Mitigation in Satellite Missions via Attack-Fault-Defense Trees. (5%)
Reza Soltani; Pablo Diale; Milan Lopuhaä-Zwakenberg; Mariëlle Stoelinga
  Cyber-physical systems, such as self-driving cars or digitized electrical
grids, often involve complex interactions between security, safety, and
defense. Proper risk management strategies must account for these three
critical domains and their interaction because the failure to address one
domain can exacerbate risks in the others, leading to cascading effects that
compromise the overall system resilience. This work presents a case study from
Ascentio Technologies, a mission-critical system company in Argentina
specializing in aerospace, where the interplay between safety, security, and
defenses is critical for ensuring the resilience and reliability of their
systems. The main focus will be on the Ground Segment for the satellite project
currently developed by the company. Analyzing safety, security, and defense
mechanisms together in the Ground Segment of a satellite project is crucial
because these domains are deeply interconnected--for instance, a security
breach could disable critical safety functions, or a safety failure could
create opportunities for attackers to exploit vulnerabilities, amplifying the
risks to the entire system. This paper showcases the application of the
Attack-Fault-Defense Tree (AFDT) framework, which integrates attack trees,
fault trees, and defense mechanisms into a unified model. AFDT provides an
intuitive visual language that facilitates interdisciplinary collaboration,
enabling experts from various fields to better assess system vulnerabilities
and defenses. By applying AFDT to the Ground Segment of the satellite project,
we demonstrate how qualitative analyses can be performed to identify weaknesses
and enhance the overall system's security and safety. This case highlights the
importance of jointly analyzing attacks, faults, and defenses to improve
resilience in complex cyber-physical environments.


http://arxiv.org/abs/2504.00454
FA^{3}-CLIP: Frequency-Aware Cues Fusion and Attack-Agnostic Prompt Learning for Unified Face Attack Detection. (2%)
Yongze Li; Ning Li; Ajian Liu; Hui Ma; Liying Yang; Xihong Chen; Zhiyao Liang; Yanyan Liang; Jun Wan; Zhen Lei
  Facial recognition systems are vulnerable to physical (e.g., printed photos)
and digital (e.g., DeepFake) face attacks. Existing methods struggle to
simultaneously detect physical and digital attacks due to: 1) significant
intra-class variations between these attack types, and 2) the inadequacy of
spatial information alone to comprehensively capture live and fake cues. To
address these issues, we propose a unified attack detection model termed
Frequency-Aware and Attack-Agnostic CLIP (FA\textsuperscript{3}-CLIP), which
introduces attack-agnostic prompt learning to express generic live and fake
cues derived from the fusion of spatial and frequency features, enabling
unified detection of live faces and all categories of attacks. Specifically,
the attack-agnostic prompt module generates generic live and fake prompts
within the language branch to extract corresponding generic representations
from both live and fake faces, guiding the model to learn a unified feature
space for unified attack detection. Meanwhile, the module adaptively generates
the live/fake conditional bias from the original spatial and frequency
information to optimize the generic prompts accordingly, reducing the impact of
intra-class variations. We further propose a dual-stream cues fusion framework
in the vision branch, which leverages frequency information to complement
subtle cues that are difficult to capture in the spatial domain. In addition, a
frequency compression block is utilized in the frequency stream, which reduces
redundancy in frequency features while preserving the diversity of crucial
cues. We also establish new challenging protocols to facilitate unified face
attack detection effectiveness. Experimental results demonstrate that the
proposed method significantly improves performance in detecting physical and
digital face attacks, achieving state-of-the-art results.


http://arxiv.org/abs/2504.00446
Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics. (1%)
Shide Zhou; Kailong Wang; Ling Shi; Haoyu Wang
  The widespread adoption of Large Language Models (LLMs) in critical
applications has introduced severe reliability and security risks, as LLMs
remain vulnerable to notorious threats such as hallucinations, jailbreak
attacks, and backdoor exploits. These vulnerabilities have been weaponized by
malicious actors, leading to unauthorized access, widespread misinformation,
and compromised LLM-embedded system integrity. In this work, we introduce a
novel approach to detecting abnormal behaviors in LLMs via hidden state
forensics. By systematically inspecting layer-specific activation patterns, we
develop a unified framework that can efficiently identify a range of security
threats in real-time without imposing prohibitive computational costs.
Extensive experiments indicate detection accuracies exceeding 95% and
consistently robust performance across multiple models in most scenarios, while
preserving the ability to detect novel attacks effectively. Furthermore, the
computational overhead remains minimal, with merely fractions of a second. The
significance of this work lies in proposing a promising strategy to reinforce
the security of LLM-integrated systems, paving the way for safer and more
reliable deployment in high-stakes domains. By enabling real-time detection
that can also support the mitigation of abnormal behaviors, it represents a
meaningful step toward ensuring the trustworthiness of AI systems amid rising
security challenges.


http://arxiv.org/abs/2504.01278
Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning. (1%)
Si Chen; Xiao Yu; Ninareh Mehrabi; Rahul Gupta; Zhou Yu; Ruoxi Jia
  The exploitation of large language models (LLMs) for malicious purposes poses
significant security risks as these models become more powerful and widespread.
While most existing red-teaming frameworks focus on single-turn attacks,
real-world adversaries typically operate in multi-turn scenarios, iteratively
probing for vulnerabilities and adapting their prompts based on threat model
responses. In this paper, we propose \AlgName, a novel multi-turn red-teaming
agent that emulates sophisticated human attackers through complementary
learning dimensions: global tactic-wise learning that accumulates knowledge
over time and generalizes to new attack goals, and local prompt-wise learning
that refines implementations for specific goals when initial attempts fail.
Unlike previous multi-turn approaches that rely on fixed strategy sets,
\AlgName enables the agent to identify new jailbreak tactics, develop a
goal-based tactic selection framework, and refine prompt formulations for
selected tactics. Empirical evaluations on JailbreakBench demonstrate our
framework's superior performance, achieving over 90\% attack success rates
against GPT-3.5-Turbo and Llama-3.1-70B within 5 conversation turns,
outperforming state-of-the-art baselines. These results highlight the
effectiveness of dynamic learning in identifying and exploiting model
vulnerabilities in realistic multi-turn scenarios.


http://arxiv.org/abs/2504.00218
$\textit{Agents Under Siege}$: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks. (96%)
Rana Muhammad Shahroz Khan; Zhen Tan; Sukwon Yun; Charles Flemming; Tianlong Chen
  Most discussions about Large Language Model (LLM) safety have focused on
single-agent settings but multi-agent LLM systems now create novel adversarial
risks because their behavior depends on communication between agents and
decentralized reasoning. In this work, we innovatively focus on attacking
pragmatic systems that have constrains such as limited token bandwidth, latency
between message delivery, and defense mechanisms. We design a
$\textit{permutation-invariant adversarial attack}$ that optimizes prompt
distribution across latency and bandwidth-constraint network topologies to
bypass distributed safety mechanisms within the system. Formulating the attack
path as a problem of $\textit{maximum-flow minimum-cost}$, coupled with the
novel $\textit{Permutation-Invariant Evasion Loss (PIEL)}$, we leverage
graph-based optimization to maximize attack success rate while minimizing
detection risk. Evaluating across models including $\texttt{Llama}$,
$\texttt{Mistral}$, $\texttt{Gemma}$, $\texttt{DeepSeek}$ and other variants on
various datasets like $\texttt{JailBreakBench}$ and
$\texttt{AdversarialBench}$, our method outperforms conventional attacks by up
to $7\times$, exposing critical vulnerabilities in multi-agent systems.
Moreover, we demonstrate that existing defenses, including variants of
$\texttt{Llama-Guard}$ and $\texttt{PromptGuard}$, fail to prohibit our attack,
emphasizing the urgent need for multi-agent specific safety mechanisms.


http://arxiv.org/abs/2504.02859
AI-Enhanced Resilience in Power Systems: Adversarial Deep Learning for Robust Short-Term Voltage Stability Assessment under Cyber-Attacks. (96%)
Yang Li; Shitu Zhang; Yuanzheng Li
  In the era of Industry 4.0, ensuring the resilience of cyber-physical systems
against sophisticated cyber threats is increasingly critical. This study
proposes a pioneering AI-based control framework that enhances short-term
voltage stability assessments (STVSA) in power systems under complex composite
cyber-attacks. First, by incorporating white-box and black-box adversarial
attacks with Denial-of-Service (DoS) perturbations during training, composite
adversarial attacks are implemented. Second, the application of Spectral
Normalized Conditional Wasserstein Generative Adversarial Network with Gradient
Penalty (SNCWGAN-GP) and Fast Gradient Sign Method (FGSM) strengthens the
model's resistance to adversarial disturbances, improving data quality and
training stability. Third, an assessment model based on Long Short-Term Memory
(LSTM)-enhanced Graph Attention Network (L-GAT) is developed to capture dynamic
relationships between the post-fault dynamic trajectories and electrical grid
topology. Experimental results on the IEEE 39-bus test system demonstrate the
efficacy and superiority of the proposed method in composite cyber-attack
scenarios. This contribution is pivotal to advancing AI-based resilient control
strategies for nonlinear dynamical systems, marking a substantial enhancement
in the security of cyber-physical systems.


http://arxiv.org/abs/2504.03735
Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots. (87%)
Erfan Shayegani; G M Shahariar; Sara Abdali; Lei Yu; Nael Abu-Ghazaleh; Yue Dong
  Multimodal Language Models (MMLMs) typically undergo post-training alignment
to prevent harmful content generation. However, these alignment stages focus
primarily on the assistant role, leaving the user role unaligned, and stick to
a fixed input prompt structure of special tokens, leaving the model vulnerable
when inputs deviate from these expectations. We introduce Role-Modality Attacks
(RMA), a novel class of adversarial attacks that exploit role confusion between
the user and assistant and alter the position of the image token to elicit
harmful outputs. Unlike existing attacks that modify query content, RMAs
manipulate the input structure without altering the query itself. We
systematically evaluate these attacks across multiple Vision Language Models
(VLMs) on eight distinct settings, showing that they can be composed to create
stronger adversarial prompts, as also evidenced by their increased projection
in the negative refusal direction in the residual stream, a property observed
in prior successful attacks. Finally, for mitigation, we propose an adversarial
training approach that makes the model robust against input prompt
perturbations. By training the model on a range of harmful and benign prompts
all perturbed with different RMA settings, it loses its sensitivity to Role
Confusion and Modality Manipulation attacks and is trained to only pay
attention to the content of the query in the input prompt structure,
effectively reducing Attack Success Rate (ASR) while preserving the model's
general utility.


http://arxiv.org/abs/2503.23708
Towards Benchmarking and Assessing the Safety and Robustness of Autonomous Driving on Safety-critical Scenarios. (69%)
Jingzheng Li; Xianglong Liu; Shikui Wei; Zhijun Chen; Bing Li; Qing Guo; Xianqi Yang; Yanjun Pu; Jiakai Wang
  Autonomous driving has made significant progress in both academia and
industry, including performance improvements in perception task and the
development of end-to-end autonomous driving systems. However, the safety and
robustness assessment of autonomous driving has not received sufficient
attention. Current evaluations of autonomous driving are typically conducted in
natural driving scenarios. However, many accidents often occur in edge cases,
also known as safety-critical scenarios. These safety-critical scenarios are
difficult to collect, and there is currently no clear definition of what
constitutes a safety-critical scenario. In this work, we explore the safety and
robustness of autonomous driving in safety-critical scenarios. First, we
provide a definition of safety-critical scenarios, including static traffic
scenarios such as adversarial attack scenarios and natural distribution shifts,
as well as dynamic traffic scenarios such as accident scenarios. Then, we
develop an autonomous driving safety testing platform to comprehensively
evaluate autonomous driving systems, encompassing not only the assessment of
perception modules but also system-level evaluations. Our work systematically
constructs a safety verification process for autonomous driving, providing
technical support for the industry to establish standardized test framework and
reduce risks in real-world road deployment.


http://arxiv.org/abs/2503.23866
A Channel-Triggered Backdoor Attack on Wireless Semantic Image Reconstruction. (68%)
Jialin Wan; Nan Cheng; Jinglong Shen
  Despite the transformative impact of deep learning (DL) on wireless
communication systems through data-driven end-to-end (E2E) learning, the
security vulnerabilities of these systems have been largely overlooked. Unlike
the extensively studied image domain, limited research has explored the threat
of backdoor attacks on the reconstruction of symbols in semantic communication
(SemCom) systems. Previous work has investigated such backdoor attacks at the
input level, but these approaches are infeasible in applications with strict
input control. In this paper, we propose a novel attack paradigm, termed
Channel-Triggered Backdoor Attack (CT-BA), where the backdoor trigger is a
specific wireless channel. This attack leverages fundamental physical layer
characteristics, making it more covert and potentially more threatening
compared to previous input-level attacks. Specifically, we utilize channel gain
with different fading distributions or channel noise with different power
spectral densities as potential triggers. This approach establishes
unprecedented attack flexibility as the adversary can select backdoor triggers
from both fading characteristics and noise variations in diverse channel
environments. Moreover, during the testing phase, CT-BA enables automatic
trigger activation through natural channel variations without requiring active
adversary participation. We evaluate the robustness of CT-BA on a ViT-based
Joint Source-Channel Coding (JSCC) model across three datasets: MNIST,
CIFAR-10, and ImageNet. Furthermore, we apply CT-BA to three typical E2E SemCom
systems: BDJSCC, ADJSCC, and JSCCOFDM. Experimental results demonstrate that
our attack achieves near-perfect attack success rate (ASR) while maintaining
effective stealth. Finally, we discuss potential defense mechanisms against
such attacks.


http://arxiv.org/abs/2503.23718
Detecting Functional Bugs in Smart Contracts through LLM-Powered and Bug-Oriented Composite Analysis. (1%)
Binbin Zhao; Xingshuang Lin; Yuan Tian; Saman Zonouz; Na Ruan; Jiliang Li; Raheem Beyah; Shouling Ji
  Smart contracts are fundamental pillars of the blockchain, playing a crucial
role in facilitating various business transactions. However, these smart
contracts are vulnerable to exploitable bugs that can lead to substantial
monetary losses. A recent study reveals that over 80% of these exploitable
bugs, which are primarily functional bugs, can evade the detection of current
tools. The primary issue is the significant gap between understanding the
high-level logic of the business model and checking the low-level
implementations in smart contracts. Furthermore, identifying deeply rooted
functional bugs in smart contracts requires the automated generation of
effective detection oracles based on various bug features. To address these
challenges, we design and implement PROMFUZZ, an automated and scalable system
to detect functional bugs, in smart contracts. In PROMFUZZ, we first propose a
novel Large Language Model (LLM)-driven analysis framework, which leverages a
dual-agent prompt engineering strategy to pinpoint potentially vulnerable
functions for further scrutiny. We then implement a dual-stage coupling
approach, which focuses on generating invariant checkers that leverage logic
information extracted from potentially vulnerable functions. Finally, we design
a bug-oriented fuzzing engine, which maps the logical information from the
high-level business model to the low-level smart contract implementations, and
performs the bug-oriented fuzzing on targeted functions. We compare PROMFUZZ
with multiple state-of-the-art methods. The results show that PROMFUZZ achieves
86.96% recall and 93.02% F1-score in detecting functional bugs, marking at
least a 50% improvement in both metrics over state-of-the-art methods.
Moreover, we perform an in-depth analysis on real-world DeFi projects and
detect 30 zero-day bugs. Up to now, 24 zero-day bugs have been assigned CVE
IDs.


http://arxiv.org/abs/2504.00038
Revisiting the Relationship between Adversarial and Clean Training: Why Clean Training Can Make Adversarial Training Better. (50%)
MingWei Zhou; Xiaobing Pei
  Adversarial training (AT) is an effective technique for enhancing adversarial
robustness, but it usually comes at the cost of a decline in generalization
ability. Recent studies have attempted to use clean training to assist
adversarial training, yet there are contradictions among the conclusions. We
comprehensively summarize the representative strategies and, with a focus on
the multi - view hypothesis, provide a unified explanation for the
contradictory phenomena among different studies. In addition, we conduct an in
- depth analysis of the knowledge combinations transferred from clean - trained
models to adversarially - trained models in previous studies, and find that
they can be divided into two categories: reducing the learning difficulty and
providing correct guidance. Based on this finding, we propose a new idea of
leveraging clean training to further improve the performance of advanced AT
methods.We reveal that the problem of generalization degradation faced by AT
partly stems from the difficulty of adversarial training in learning certain
sample features, and this problem can be alleviated by making full use of clean
training.


http://arxiv.org/abs/2503.23511
Buffer is All You Need: Defending Federated Learning against Backdoor Attacks under Non-iids via Buffering. (15%)
Xingyu Lyu; Ning Wang; Yang Xiao; Shixiong Li; Tao Li; Danjue Chen; Yimin Chen
  Federated Learning (FL) is a popular paradigm enabling clients to jointly
train a global model without sharing raw data. However, FL is known to be
vulnerable towards backdoor attacks due to its distributed nature. As
participants, attackers can upload model updates that effectively compromise
FL. What's worse, existing defenses are mostly designed under
independent-and-identically-distributed (iid) settings, hence neglecting the
fundamental non-iid characteristic of FL. Here we propose FLBuff for tackling
backdoor attacks even under non-iids. The main challenge for such defenses is
that non-iids bring benign and malicious updates closer, hence harder to
separate. FLBuff is inspired by our insight that non-iids can be modeled as
omni-directional expansion in representation space while backdoor attacks as
uni-directional. This leads to the key design of FLBuff, i.e., a
supervised-contrastive-learning model extracting penultimate-layer
representations to create a large in-between buffer layer. Comprehensive
evaluations demonstrate that FLBuff consistently outperforms state-of-the-art
defenses.


http://arxiv.org/abs/2503.23536
A Survey on Unlearnable Data. (5%)
Jiahao Li; Yiqiang Chen; Yunbing Xing; Yang Gu; Xiangyuan Lan
  Unlearnable data (ULD) has emerged as an innovative defense technique to
prevent machine learning models from learning meaningful patterns from specific
data, thus protecting data privacy and security. By introducing perturbations
to the training data, ULD degrades model performance, making it difficult for
unauthorized models to extract useful representations. Despite the growing
significance of ULD, existing surveys predominantly focus on related fields,
such as adversarial attacks and machine unlearning, with little attention given
to ULD as an independent area of study. This survey fills that gap by offering
a comprehensive review of ULD, examining unlearnable data generation methods,
public benchmarks, evaluation metrics, theoretical foundations and practical
applications. We compare and contrast different ULD approaches, analyzing their
strengths, limitations, and trade-offs related to unlearnability,
imperceptibility, efficiency and robustness. Moreover, we discuss key
challenges, such as balancing perturbation imperceptibility with model
degradation and the computational complexity of ULD generation. Finally, we
highlight promising future research directions to advance the effectiveness and
applicability of ULD, underscoring its potential to become a crucial tool in
the evolving landscape of data protection in machine learning.


http://arxiv.org/abs/2503.23288
Two Heads Are Better than One: Model-Weight and Latent-Space Analysis for Federated Learning on Non-iid Data against Poisoning Attacks. (80%)
Xingyu Lyu; Ning Wang; Yang Xiao; Shixiong Li; Tao Li; Danjue Chen; Yimin Chen
  Federated Learning is a popular paradigm that enables remote clients to
jointly train a global model without sharing their raw data. However, FL has
been shown to be vulnerable towards model poisoning attacks due to its
distributed nature. Particularly, attackers acting as participants can upload
arbitrary model updates that effectively compromise the global model of FL.
While extensive research has been focusing on fighting against these attacks,
we find that most of them assume data at remote clients are under iid while in
practice they are inevitably non-iid. Our benchmark evaluations reveal that
existing defenses generally fail to live up to their reputation when applied to
various non-iid scenarios. In this paper, we propose a novel approach,
GeminiGuard, that aims to address such a significant gap. We design GeminiGuard
to be lightweight, versatile, and unsupervised so that it aligns well with the
practical requirements of deploying such defenses. The key challenge from
non-iids is that they make benign model updates look more similar to malicious
ones. GeminiGuard is mainly built on two fundamental observations: (1) existing
defenses based on either model-weight analysis or latent-space analysis face
limitations in covering different MPAs and non-iid scenarios, and (2)
model-weight and latent-space analysis are sufficiently different yet
potentially complementary methods as MPA defenses. We hence incorporate a novel
model-weight analysis component as well as a custom latent-space analysis
component in GeminiGuard, aiming to further enhance its defense performance. We
conduct extensive experiments to evaluate our defense across various settings,
demonstrating its effectiveness in countering multiple types of untargeted and
targeted MPAs, including adaptive ones. Our comprehensive evaluations show that
GeminiGuard consistently outperforms SOTA defenses under various settings.


http://arxiv.org/abs/2503.22998
AuditVotes: A Framework Towards More Deployable Certified Robustness for Graph Neural Networks. (76%)
Yuni Lai; Yulin Zhu; Yixuan Sun; Yulun Wu; Bin Xiao; Gaolei Li; Jianhua Li; Kai Zhou
  Despite advancements in Graph Neural Networks (GNNs), adaptive attacks
continue to challenge their robustness. Certified robustness based on
randomized smoothing has emerged as a promising solution, offering provable
guarantees that a model's predictions remain stable under adversarial
perturbations within a specified range. However, existing methods face a
critical trade-off between accuracy and robustness, as achieving stronger
robustness requires introducing greater noise into the input graph. This
excessive randomization degrades data quality and disrupts prediction
consistency, limiting the practical deployment of certifiably robust GNNs in
real-world scenarios where both accuracy and robustness are essential. To
address this challenge, we propose \textbf{AuditVotes}, the first framework to
achieve both high clean accuracy and certifiably robust accuracy for GNNs. It
integrates randomized smoothing with two key components,
\underline{au}gmentation and con\underline{dit}ional smoothing, aiming to
improve data quality and prediction consistency. The augmentation, acting as a
pre-processing step, de-noises the randomized graph, significantly improving
data quality and clean accuracy. The conditional smoothing, serving as a
post-processing step, employs a filtering function to selectively count votes,
thereby filtering low-quality predictions and improving voting consistency.
Extensive experimental results demonstrate that AuditVotes significantly
enhances clean accuracy, certified robustness, and empirical robustness while
maintaining high computational efficiency. Notably, compared to baseline
randomized smoothing, AuditVotes improves clean accuracy by $437.1\%$ and
certified accuracy by $409.3\%$ when the attacker can arbitrarily insert $20$
edges on the Cora-ML datasets, representing a substantial step toward deploying
certifiably robust GNNs in real-world applications.


http://arxiv.org/abs/2503.23103
Towards Secure Semantic Communications in the Presence of Intelligent Eavesdroppers. (26%)
Shunpu Tang; Yuhao Chen; Qianqian Yang; Ruichen Zhang; Dusit Niyato; Zhiguo Shi
  Semantic communication has emerged as a promising paradigm for enhancing
communication efficiency in sixth-generation (6G) networks. However, the
broadcast nature of wireless channels makes SemCom systems vulnerable to
eavesdropping, which poses a serious threat to data privacy. Therefore, we
investigate secure SemCom systems that preserve data privacy in the presence of
eavesdroppers. Specifically, we first explore a scenario where eavesdroppers
are intelligent and can exploit semantic information to reconstruct the
transmitted data based on advanced artificial intelligence (AI) techniques. To
counter this, we introduce novel eavesdropping attack strategies that utilize
model inversion attacks and generative AI (GenAI) models. These strategies
effectively reconstruct transmitted private data processed by the semantic
encoder, operating in both glass-box and closed-box settings. Existing defense
mechanisms against eavesdropping often cause significant distortions in the
data reconstructed by eavesdroppers, potentially arousing their suspicion. To
address this, we propose a semantic covert communication approach that
leverages an invertible neural network (INN)-based signal steganography module.
This module covertly embeds the channel input signal of a private sample into
that of a non-sensitive host sample, thereby misleading eavesdroppers. Without
access to this module, eavesdroppers can only extract host-related information
and remain unaware of the hidden private content. We conduct extensive
simulations under various channel conditions in image transmission tasks.
Numerical results show that while conventional eavesdropping strategies achieve
a success rate of over 80\% in reconstructing private information, the proposed
semantic covert communication effectively reduces the eavesdropping success
rate to 0.


http://arxiv.org/abs/2503.22205
Data-Free Universal Attack by Exploiting the Intrinsic Vulnerability of Deep Models. (99%)
YangTian Yan; Jinyu Tian
  Deep neural networks (DNNs) are susceptible to Universal Adversarial
Perturbations (UAPs), which are instance agnostic perturbations that can
deceive a target model across a wide range of samples. Unlike instance-specific
adversarial examples, UAPs present a greater challenge as they must generalize
across different samples and models. Generating UAPs typically requires access
to numerous examples, which is a strong assumption in real-world tasks. In this
paper, we propose a novel data-free method called Intrinsic UAP (IntriUAP), by
exploiting the intrinsic vulnerabilities of deep models. We analyze a series of
popular deep models composed of linear and nonlinear layers with a Lipschitz
constant of 1, revealing that the vulnerability of these models is
predominantly influenced by their linear components. Based on this observation,
we leverage the ill-conditioned nature of the linear components by aligning the
UAP with the right singular vectors corresponding to the maximum singular value
of each linear layer. Remarkably, our method achieves highly competitive
performance in attacking popular image classification deep models without using
any image samples. We also evaluate the black-box attack performance of our
method, showing that it matches the state-of-the-art baseline for data-free
methods on models that conform to our theoretical framework. Beyond the
data-free assumption, IntriUAP also operates under a weaker assumption, where
the adversary only can access a few of the victim model's layers. Experiments
demonstrate that the attack success rate decreases by only 4% when the
adversary has access to just 50% of the linear layers in the victim model.


http://arxiv.org/abs/2503.22653
Tropical Bisectors and Carlini-Wagner Attacks. (33%)
Gillian Grindstaff; Julia Lindberg; Daniela Schkoda; Miruna-Stefana Sorea; Ruriko Yoshida
  Pasque et al. showed that using a tropical symmetric metric as an activation
function in the last layer can improve the robustness of convolutional neural
networks (CNNs) against state-of-the-art attacks, including the Carlini-Wagner
attack. This improvement occurs when the attacks are not specifically adapted
to the non-differentiability of the tropical layer. Moreover, they showed that
the decision boundary of a tropical CNN is defined by tropical bisectors. In
this paper, we explore the combinatorics of tropical bisectors and analyze how
the tropical embedding layer enhances robustness against Carlini-Wagner
attacks. We prove an upper bound on the number of linear segments the decision
boundary of a tropical CNN can have. We then propose a refined version of the
Carlini-Wagner attack, specifically tailored for the tropical architecture.
Computational experiments with MNIST and LeNet5 showcase our attacks improved
success rate.


http://arxiv.org/abs/2504.03714
Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models. (13%)
Runpeng Dai; Run Yang; Fan Zhou; Hongtu Zhu
  Large Language Models (LLMs) and Vision-Language Models (VLMs) have become
essential to general artificial intelligence, exhibiting remarkable
capabilities in task understanding and problem-solving. However, the real-world
reliability of these models critically depends on their stability, which
remains an underexplored area. Despite their widespread use, rigorous studies
examining the stability of LLMs under various perturbations are still lacking.
In this paper, we address this gap by proposing a novel stability measure for
LLMs, inspired by statistical methods rooted in information geometry. Our
measure possesses desirable invariance properties, making it well-suited for
analyzing model sensitivity to both parameter and input perturbations. To
assess the effectiveness of our approach, we conduct extensive experiments on
models ranging in size from 1.5B to 13B parameters. Our results demonstrate the
utility of our measure in identifying salient parameters and detecting
vulnerable regions in input images or critical dimensions in token embeddings.
Furthermore, leveraging our stability framework, we enhance model robustness
during model merging, leading to improved performance.


http://arxiv.org/abs/2503.22163
T-CIL: Temperature Scaling using Adversarial Perturbation for Calibration in Class-Incremental Learning. (11%)
Seong-Hyeon Hwang; Minsu Kim; Steven Euijong Whang
  We study model confidence calibration in class-incremental learning, where
models learn from sequential tasks with different class sets. While existing
works primarily focus on accuracy, maintaining calibrated confidence has been
largely overlooked. Unfortunately, most post-hoc calibration techniques are not
designed to work with the limited memories of old-task data typical in
class-incremental learning, as retaining a sufficient validation set would be
impractical. Thus, we propose T-CIL, a novel temperature scaling approach for
class-incremental learning without a validation set for old tasks, that
leverages adversarially perturbed exemplars from memory. Directly using
exemplars is inadequate for temperature optimization, since they are already
used for training. The key idea of T-CIL is to perturb exemplars more strongly
for old tasks than for the new task by adjusting the perturbation direction
based on feature distance, with the single magnitude determined using the
new-task validation set. This strategy makes the perturbation magnitude
computed from the new task also applicable to old tasks, leveraging the
tendency that the accuracy of old tasks is lower than that of the new task. We
empirically show that T-CIL significantly outperforms various baselines in
terms of calibration on real datasets and can be integrated with existing
class-incremental learning techniques with minimal impact on accuracy.


http://arxiv.org/abs/2503.22330
Imperceptible but Forgeable: Practical Invisible Watermark Forgery via Diffusion Models. (1%)
Ziping Dong; Chao Shuai; Zhongjie Ba; Peng Cheng; Zhan Qin; Qinglong Wang; Kui Ren
  Invisible watermarking is critical for content provenance and accountability
in Generative AI. Although commercial companies have increasingly committed to
using watermarks, the robustness of existing watermarking schemes against
forgery attacks is understudied. This paper proposes DiffForge, the first
watermark forgery framework capable of forging imperceptible watermarks under a
no-box setting. We estimate the watermark distribution using an unconditional
diffusion model and introduce shallow inversion to inject the watermark into a
non-watermarked image seamlessly. This approach facilitates watermark injection
while preserving image quality by adaptively selecting the depth of inversion
steps, leveraging our key insight that watermarks degrade with added noise
during the early diffusion phases. Comprehensive evaluations show that
DiffForge deceives open-source watermark detectors with a 96.38% success rate
and misleads a commercial watermark system with over 97% success rate,
achieving high confidence.1 This work reveals fundamental security limitations
in current watermarking paradigms.


http://arxiv.org/abs/2503.21164
Adversarial Wear and Tear: Exploiting Natural Damage for Generating Physical-World Adversarial Examples. (99%)
Samra Irshad; Seungkyu Lee; Nassir Navab; Hong Joo Lee; Seong Tae Kim
  The presence of adversarial examples in the physical world poses significant
challenges to the deployment of Deep Neural Networks in safety-critical
applications such as autonomous driving. Most existing methods for crafting
physical-world adversarial examples are ad-hoc, relying on temporary
modifications like shadows, laser beams, or stickers that are tailored to
specific scenarios. In this paper, we introduce a new class of physical-world
adversarial examples, AdvWT, which draws inspiration from the naturally
occurring phenomenon of `wear and tear', an inherent property of physical
objects. Unlike manually crafted perturbations, `wear and tear' emerges
organically over time due to environmental degradation, as seen in the gradual
deterioration of outdoor signboards. To achieve this, AdvWT follows a two-step
approach. First, a GAN-based, unsupervised image-to-image translation network
is employed to model these naturally occurring damages, particularly in the
context of outdoor signboards. The translation network encodes the
characteristics of damaged signs into a latent `damage style code'. In the
second step, we introduce adversarial perturbations into the style code,
strategically optimizing its transformation process. This manipulation subtly
alters the damage style representation, guiding the network to generate
adversarial images where the appearance of damages remains perceptually
realistic, while simultaneously ensuring their effectiveness in misleading
neural networks. Through comprehensive experiments on two traffic sign
datasets, we show that AdvWT effectively misleads DNNs in both digital and
physical domains. AdvWT achieves an effective attack success rate, greater
robustness, and a more natural appearance compared to existing physical-world
adversarial examples. Additionally, integrating AdvWT into training enhances a
model's generalizability to real-world damaged signs.


http://arxiv.org/abs/2503.21236
Clean Image May be Dangerous: Data Poisoning Attacks Against Deep Hashing. (76%)
Shuai Li; Jie Zhang; Yuang Qi; Kejiang Chen; Tianwei Zhang; Weiming Zhang; Nenghai Yu
  Large-scale image retrieval using deep hashing has become increasingly
popular due to the exponential growth of image data and the remarkable feature
extraction capabilities of deep neural networks (DNNs). However, deep hashing
methods are vulnerable to malicious attacks, including adversarial and backdoor
attacks. It is worth noting that these attacks typically involve altering the
query images, which is not a practical concern in real-world scenarios. In this
paper, we point out that even clean query images can be dangerous, inducing
malicious target retrieval results, like undesired or illegal images. To the
best of our knowledge, we are the first to study data \textbf{p}oisoning
\textbf{a}ttacks against \textbf{d}eep \textbf{hash}ing
\textbf{(\textit{PADHASH})}. Specifically, we first train a surrogate model to
simulate the behavior of the target deep hashing model. Then, a strict gradient
matching strategy is proposed to generate the poisoned images. Extensive
experiments on different models, datasets, hash methods, and hash code lengths
demonstrate the effectiveness and generality of our attack method.


http://arxiv.org/abs/2503.22765
Non-control-Data Attacks and Defenses: A review. (75%)
Lei Chong
  In recent years, non-control-data attacks have be come a research hotspot in
the field of network security, driven
  by the increasing number of defense methods against control-flow
  hijacking attacks. These attacks exploit memory vulnerabilities
  to modify non-control data within a program, thereby altering its
  behavior without compromising control-flow integrity. Research
  has shown that non-control-data attacks can be just as damaging
  as control-flow hijacking attacks and are even Turing complete,
  making them a serious security threat. However, despite being
  discovered long ago, the threat of non-control-data attacks has
  not been adequately addressed. In this review, we first classify
  non-control-data attacks into two categories based on their
  evolution: security-sensitive function attacks and data-oriented
  programming (DOP) attacks. Subsequently, based on the non control-data attack
model, we categorize existing defense methods
  into three main strategies: memory safety, data confidentiality,
  and data integrity protection. We then analyze recent defense
  techniques specifically designed for DOP attacks. Finally, we
  identify the key challenges hindering the widespread adoption
  of defenses against non-control-data attacks and explore future
  research directions in this field.


http://arxiv.org/abs/2503.21983
Learning to Lie: Reinforcement Learning Attacks Damage Human-AI Teams and Teams of LLMs. (62%)
Abed Kareem Musaffar; Anand Gokhale; Sirui Zeng; Rasta Tadayon; Xifeng Yan; Ambuj Singh; Francesco Bullo
  As artificial intelligence (AI) assistants become more widely adopted in
safety-critical domains, it becomes important to develop safeguards against
potential failures or adversarial attacks. A key prerequisite to developing
these safeguards is understanding the ability of these AI assistants to mislead
human teammates. We investigate this attack problem within the context of an
intellective strategy game where a team of three humans and one AI assistant
collaborate to answer a series of trivia questions. Unbeknownst to the humans,
the AI assistant is adversarial. Leveraging techniques from Model-Based
Reinforcement Learning (MBRL), the AI assistant learns a model of the humans'
trust evolution and uses that model to manipulate the group decision-making
process to harm the team. We evaluate two models -- one inspired by literature
and the other data-driven -- and find that both can effectively harm the human
team. Moreover, we find that in this setting our data-driven model is capable
of accurately predicting how human agents appraise their teammates given
limited information on prior interactions. Finally, we compare the performance
of state-of-the-art LLM models to human agents on our influence allocation task
to evaluate whether the LLMs allocate influence similarly to humans or if they
are more robust to our attack. These results enhance our understanding of
decision-making dynamics in small human-AI teams and lay the foundation for
defense strategies.


http://arxiv.org/abs/2503.21315
Tricking Retrievers with Influential Tokens: An Efficient Black-Box Corpus Poisoning Attack. (54%)
Cheng Wang; Yiwei Wang; Yujun Cai; Bryan Hooi
  Retrieval-augmented generation (RAG) systems enhance large language models by
incorporating external knowledge, addressing issues like outdated internal
knowledge and hallucination. However, their reliance on external knowledge
bases makes them vulnerable to corpus poisoning attacks, where adversarial
passages can be injected to manipulate retrieval results. Existing methods for
crafting such passages, such as random token replacement or training inversion
models, are often slow and computationally expensive, requiring either access
to retriever's gradients or large computational resources. To address these
limitations, we propose Dynamic Importance-Guided Genetic Algorithm (DIGA), an
efficient black-box method that leverages two key properties of retrievers:
insensitivity to token order and bias towards influential tokens. By focusing
on these characteristics, DIGA dynamically adjusts its genetic operations to
generate effective adversarial passages with significantly reduced time and
memory usage. Our experimental evaluation shows that DIGA achieves superior
efficiency and scalability compared to existing methods, while maintaining
comparable or better attack success rates across multiple datasets.


http://arxiv.org/abs/2503.22759
Data Poisoning in Deep Learning: A Survey. (2%)
Pinlong Zhao; Weiyao Zhu; Pengfei Jiao; Di Gao; Ou Wu
  Deep learning has become a cornerstone of modern artificial intelligence,
enabling transformative applications across a wide range of domains. As the
core element of deep learning, the quality and security of training data
critically influence model performance and reliability. However, during the
training process, deep learning models face the significant threat of data
poisoning, where attackers introduce maliciously manipulated training data to
degrade model accuracy or lead to anomalous behavior. While existing surveys
provide valuable insights into data poisoning, they generally adopt a broad
perspective, encompassing both attacks and defenses, but lack a dedicated,
in-depth analysis of poisoning attacks specifically in deep learning. In this
survey, we bridge this gap by presenting a comprehensive and targeted review of
data poisoning in deep learning. First, this survey categorizes data poisoning
attacks across multiple perspectives, providing an in-depth analysis of their
characteristics and underlying design princinples. Second, the discussion is
extended to the emerging area of data poisoning in large language models(LLMs).
Finally, we explore critical open challenges in the field and propose potential
research directions to advance the field further. To support further
exploration, an up-to-date repository of resources on data poisoning in deep
learning is available at https://github.com/Pinlong-Zhao/Data-Poisoning.


http://arxiv.org/abs/2503.21244
Improving $(\alpha, f)$-Byzantine Resilience in Federated Learning via layerwise aggregation and cosine distance. (1%)
Mario García-Márquez; Nuria Rodríguez-Barroso; M. Victoria Luzón; Francisco Herrera
  The rapid development of artificial intelligence systems has amplified
societal concerns regarding their usage, necessitating regulatory frameworks
that encompass data privacy. Federated Learning (FL) is posed as potential
solution to data privacy challenges in distributed machine learning by enabling
collaborative model training {without data sharing}. However, FL systems remain
vulnerable to Byzantine attacks, where malicious nodes contribute corrupted
model updates. While Byzantine Resilient operators have emerged as a widely
adopted robust aggregation algorithm to mitigate these attacks, its efficacy
diminishes significantly in high-dimensional parameter spaces, sometimes
leading to poor performing models. This paper introduces Layerwise Cosine
Aggregation, a novel aggregation scheme designed to enhance robustness of these
rules in such high-dimensional settings while preserving computational
efficiency. A theoretical analysis is presented, demonstrating the superior
robustness of the proposed Layerwise Cosine Aggregation compared to original
robust aggregation operators. Empirical evaluation across diverse image
classification datasets, under varying data distributions and Byzantine attack
scenarios, consistently demonstrates the improved performance of Layerwise
Cosine Aggregation, achieving up to a 16% increase in model accuracy.


http://arxiv.org/abs/2503.22048
ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models. (1%)
Chung-En Sun; Ge Yan; Tsui-Wei Weng
  Recent studies have shown that Large Language Models (LLMs) augmented with
chain-of-thought (CoT) reasoning demonstrate impressive problem-solving
abilities. However, in this work, we identify a recurring issue where these
models occasionally generate overly short reasoning, leading to degraded
performance on even simple mathematical problems. Specifically, we investigate
how reasoning length is embedded in the hidden representations of reasoning
models and its impact on accuracy. Our analysis reveals that reasoning length
is governed by a linear direction in the representation space, allowing us to
induce overly short reasoning by steering the model along this direction.
Building on this insight, we introduce ThinkEdit, a simple yet effective
weight-editing approach to mitigate the issue of overly short reasoning. We
first identify a small subset of attention heads (approximately 2%) that
predominantly drive short reasoning behavior. We then edit the output
projection weights of these heads to suppress the short reasoning direction.
With changes to only 0.1% of the model's parameters, ThinkEdit effectively
reduces overly short reasoning and yields notable accuracy gains for short
reasoning outputs (+5.44%), along with an overall improvement across multiple
math benchmarks (+2.43%). Our findings provide new mechanistic insights into
how reasoning length is controlled within LLMs and highlight the potential of
fine-grained model interventions to improve reasoning quality. Our code is
available at https://github.com/Trustworthy-ML-Lab/ThinkEdit


http://arxiv.org/abs/2503.20310
Enabling Heterogeneous Adversarial Transferability via Feature Permutation Attacks. (99%)
Tao Wu; Tie Luo
  Adversarial attacks in black-box settings are highly practical, with
transfer-based attacks being the most effective at generating adversarial
examples (AEs) that transfer from surrogate models to unseen target models.
However, their performance significantly degrades when transferring across
heterogeneous architectures -- such as CNNs, MLPs, and Vision Transformers
(ViTs) -- due to fundamental architectural differences. To address this, we
propose Feature Permutation Attack (FPA), a zero-FLOP, parameter-free method
that enhances adversarial transferability across diverse architectures. FPA
introduces a novel feature permutation (FP) operation, which rearranges pixel
values in selected feature maps to simulate long-range dependencies,
effectively making CNNs behave more like ViTs and MLPs. This enhances feature
diversity and improves transferability both across heterogeneous architectures
and within homogeneous CNNs. Extensive evaluations on 14 state-of-the-art
architectures show that FPA achieves maximum absolute gains in attack success
rates of 7.68% on CNNs, 14.57% on ViTs, and 14.48% on MLPs, outperforming
existing black-box attacks. Additionally, FPA is highly generalizable and can
seamlessly integrate with other transfer-based attacks to further boost their
performance. Our findings establish FPA as a robust, efficient, and
computationally lightweight strategy for enhancing adversarial transferability
across heterogeneous architectures.


http://arxiv.org/abs/2503.20844
Robust Deep Reinforcement Learning in Robotics via Adaptive Gradient-Masked Adversarial Attacks. (99%)
Zongyuan Zhang; Tianyang Duan; Zheng Lin; Dong Huang; Zihan Fang; Zekai Sun; Ling Xiong; Hongbin Liang; Heming Cui; Yong Cui; Yue Gao
  Deep reinforcement learning (DRL) has emerged as a promising approach for
robotic control, but its realworld deployment remains challenging due to its
vulnerability to environmental perturbations. Existing white-box adversarial
attack methods, adapted from supervised learning, fail to effectively target
DRL agents as they overlook temporal dynamics and indiscriminately perturb all
state dimensions, limiting their impact on long-term rewards. To address these
challenges, we propose the Adaptive Gradient-Masked Reinforcement (AGMR)
Attack, a white-box attack method that combines DRL with a gradient-based soft
masking mechanism to dynamically identify critical state dimensions and
optimize adversarial policies. AGMR selectively allocates perturbations to the
most impactful state features and incorporates a dynamic adjustment mechanism
to balance exploration and exploitation during training. Extensive experiments
demonstrate that AGMR outperforms state-of-the-art adversarial attack methods
in degrading the performance of the victim agent and enhances the victim
agent's robustness through adversarial defense mechanisms.


http://arxiv.org/abs/2503.20613
State-Aware Perturbation Optimization for Robust Deep Reinforcement Learning. (98%)
Zongyuan Zhang; Tianyang Duan; Zheng Lin; Dong Huang; Zihan Fang; Zekai Sun; Ling Xiong; Hongbin Liang; Heming Cui; Yong Cui
  Recently, deep reinforcement learning (DRL) has emerged as a promising
approach for robotic control. However, the deployment of DRL in real-world
robots is hindered by its sensitivity to environmental perturbations. While
existing whitebox adversarial attacks rely on local gradient information and
apply uniform perturbations across all states to evaluate DRL robustness, they
fail to account for temporal dynamics and state-specific vulnerabilities. To
combat the above challenge, we first conduct a theoretical analysis of
white-box attacks in DRL by establishing the adversarial victim-dynamics Markov
decision process (AVD-MDP), to derive the necessary and sufficient conditions
for a successful attack. Based on this, we propose a selective state-aware
reinforcement adversarial attack method, named STAR, to optimize perturbation
stealthiness and state visitation dispersion. STAR first employs a soft
mask-based state-targeting mechanism to minimize redundant perturbations,
enhancing stealthiness and attack effectiveness. Then, it incorporates an
information-theoretic optimization objective to maximize mutual information
between perturbations, environmental states, and victim actions, ensuring a
dispersed state-visitation distribution that steers the victim agent into
vulnerable states for maximum return reduction. Extensive experiments
demonstrate that STAR outperforms state-of-the-art benchmarks.


http://arxiv.org/abs/2503.20583
Feature Statistics with Uncertainty Help Adversarial Robustness. (97%)
Ran Wang; Xinlei Zhou; Meng Hu; Rihao Li; Wenhui Wu; Yuheng Jia
  Despite the remarkable success of deep neural networks (DNNs), the security
threat of adversarial attacks poses a significant challenge to the reliability
of DNNs. In this paper, both theoretically and empirically, we discover a
universal phenomenon that has been neglected in previous works, i.e.,
adversarial attacks tend to shift the distributions of feature statistics.
Motivated by this finding, and by leveraging the advantages of
uncertainty-aware stochastic methods in building robust models efficiently, we
propose an uncertainty-driven feature statistics adjustment module for
robustness enhancement, named Feature Statistics with Uncertainty (FSU). It
randomly resamples channel-wise feature means and standard deviations of
examples from multivariate Gaussian distributions, which helps to reconstruct
the perturbed examples and calibrate the shifted distributions. The calibration
recovers some domain characteristics of the data for classification, thereby
mitigating the influence of perturbations and weakening the ability of attacks
to deceive models. The proposed FSU module has universal applicability in
training, attacking, predicting, and fine-tuning, demonstrating impressive
robustness enhancement ability at a trivial additional time cost. For example,
by fine-tuning the well-established models with FSU, the state-of-the-art
methods achieve up to 17.13% and 34.82% robustness improvement against powerful
AA and CW attacks on benchmark datasets.


http://arxiv.org/abs/2503.20454
Lipschitz Constant Meets Condition Number: Learning Robust and Compact Deep Neural Networks. (74%)
Yangqi Feng; Shing-Ho J. Lin; Baoyuan Gao; Xian Wei
  Recent research has revealed that high compression of Deep Neural Networks
(DNNs), e.g., massive pruning of the weight matrix of a DNN, leads to a severe
drop in accuracy and susceptibility to adversarial attacks. Integration of
network pruning into an adversarial training framework has been proposed to
promote adversarial robustness. It has been observed that a highly pruned
weight matrix tends to be ill-conditioned, i.e., increasing the condition
number of the weight matrix. This phenomenon aggravates the vulnerability of a
DNN to input noise. Although a highly pruned weight matrix is considered to be
able to lower the upper bound of the local Lipschitz constant to tolerate large
distortion, the ill-conditionedness of such a weight matrix results in a
non-robust DNN model. To overcome this challenge, this work develops novel
joint constraints to adjust the weight distribution of networks, namely, the
Transformed Sparse Constraint joint with Condition Number Constraint (TSCNC),
which copes with smoothing distribution and differentiable constraint functions
to reduce condition number and thus avoid the ill-conditionedness of weight
matrices. Furthermore, our theoretical analyses unveil the relevance between
the condition number and the local Lipschitz constant of the weight matrix,
namely, the sharply increasing condition number becomes the dominant factor
that restricts the robustness of over-sparsified models. Extensive experiments
are conducted on several public datasets, and the results show that the
proposed constraints significantly improve the robustness of a DNN with high
pruning rates.


http://arxiv.org/abs/2503.21824
Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations. (12%)
Haitong Liu; Kuofeng Gao; Yang Bai; Jinmin Li; Jinxiao Shan; Tao Dai; Shu-Tao Xia
  Recently, video-based large language models (video-based LLMs) have achieved
impressive performance across various video comprehension tasks. However, this
rapid advancement raises significant privacy and security concerns,
particularly regarding the unauthorized use of personal video data in automated
annotation by video-based LLMs. These unauthorized annotated video-text pairs
can then be used to improve the performance of downstream tasks, such as
text-to-video generation. To safeguard personal videos from unauthorized use,
we propose two series of protective video watermarks with imperceptible
adversarial perturbations, named Ramblings and Mutes. Concretely, Ramblings aim
to mislead video-based LLMs into generating inaccurate captions for the videos,
thereby degrading the quality of video annotations through inconsistencies
between video content and captions. Mutes, on the other hand, are designed to
prompt video-based LLMs to produce exceptionally brief captions, lacking
descriptive detail. Extensive experiments demonstrate that our video
watermarking methods effectively protect video data by significantly reducing
video annotation performance across various video-based LLMs, showcasing both
stealthiness and robustness in protecting personal video content. Our code is
available at https://github.com/ttthhl/Protecting_Your_Video_Content.


http://arxiv.org/abs/2503.20925
Prototype Guided Backdoor Defense. (11%)
Venkat Adithya Amula; Sunayana Samavedam; Saurabh Saini; Avani Gupta; Narayanan P J
  Deep learning models are susceptible to {\em backdoor attacks} involving
malicious attackers perturbing a small subset of training data with a {\em
trigger} to causes misclassifications. Various triggers have been used,
including semantic triggers that are easily realizable without requiring the
attacker to manipulate the image. The emergence of generative AI has eased the
generation of varied poisoned samples. Robustness across types of triggers is
crucial to effective defense. We propose Prototype Guided Backdoor Defense
(PGBD), a robust post-hoc defense that scales across different trigger types,
including previously unsolved semantic triggers. PGBD exploits displacements in
the geometric spaces of activations to penalize movements toward the trigger.
This is done using a novel sanitization loss of a post-hoc fine-tuning step.
The geometric approach scales easily to all types of attacks. PGBD achieves
better performance across all settings. We also present the first defense
against a new semantic attack on celebrity face images. Project page:
\hyperlink{https://venkatadithya9.github.io/pgbd.github.io/}{this https URL}.


http://arxiv.org/abs/2503.20660
DR-PETS: Learning-Based Control With Planning in Adversarial Environments. (4%)
Hozefa Jesawada; Antonio Acernese; Giovanni Russo; Vecchio Carmen Del
  Ensuring robustness against epistemic, possibly adversarial, perturbations is
essential for reliable real-world decision-making. While the Probabilistic
Ensembles with Trajectory Sampling (PETS) algorithm inherently handles
uncertainty via ensemble-based probabilistic models, it lacks guarantees
against structured adversarial or worst-case uncertainty distributions. To
address this, we propose DR-PETS, a distributionally robust extension of PETS
that certifies robustness against adversarial perturbations. We formalize
uncertainty via a p-Wasserstein ambiguity set, enabling worst-case-aware
planning through a min-max optimization framework. While PETS passively
accounts for stochasticity, DR-PETS actively optimizes robustness via a
tractable convex approximation integrated into PETS planning loop. Experiments
on pendulum stabilization and cart-pole balancing show that DR-PETS certifies
robustness against adversarial parameter perturbations, achieving consistent
performance in worst-case scenarios where PETS deteriorates.


http://arxiv.org/abs/2503.20462
Multi-agent Uncertainty-Aware Pessimistic Model-Based Reinforcement Learning for Connected Autonomous Vehicles. (1%)
Ruoqi Wen; Rongpeng Li; Xing Xu; Zhifeng Zhao
  Deep Reinforcement Learning (DRL) holds significant promise for achieving
human-like Autonomous Vehicle (AV) capabilities, but suffers from low sample
efficiency and challenges in reward design. Model-Based Reinforcement Learning
(MBRL) offers improved sample efficiency and generalizability compared to
Model-Free Reinforcement Learning (MFRL) in various multi-agent decision-making
scenarios. Nevertheless, MBRL faces critical difficulties in estimating
uncertainty during the model learning phase, thereby limiting its scalability
and applicability in real-world scenarios. Additionally, most Connected
Autonomous Vehicle (CAV) studies focus on single-agent decision-making, while
existing multi-agent MBRL solutions lack computationally tractable algorithms
with Probably Approximately Correct (PAC) guarantees, an essential factor for
ensuring policy reliability with limited training data. To address these
challenges, we propose MA-PMBRL, a novel Multi-Agent Pessimistic Model-Based
Reinforcement Learning framework for CAVs, incorporating a max-min optimization
approach to enhance robustness and decision-making. To mitigate the inherent
subjectivity of uncertainty estimation in MBRL and avoid incurring catastrophic
failures in AV, MA-PMBRL employs a pessimistic optimization framework combined
with Projected Gradient Descent (PGD) for both model and policy learning.
MA-PMBRL also employs general function approximations under partial dataset
coverage to enhance learning efficiency and system-level performance. By
bounding the suboptimality of the resulting policy under mild theoretical
assumptions, we successfully establish PAC guarantees for MA-PMBRL,
demonstrating that the proposed framework represents a significant step toward
scalable, efficient, and reliable multi-agent decision-making for CAVs.


http://arxiv.org/abs/2503.20281
Are We There Yet? Unraveling the State-of-the-Art Graph Network Intrusion Detection Systems. (1%)
Chenglong Wang; Pujia Zheng; Jiaping Gui; Cunqing Hua; Wajih Ul Hassan
  Network Intrusion Detection Systems (NIDS) are vital for ensuring enterprise
security. Recently, Graph-based NIDS (GIDS) have attracted considerable
attention because of their capability to effectively capture the complex
relationships within the graph structures of data communications. Despite their
promise, the reproducibility and replicability of these GIDS remain largely
unexplored, posing challenges for developing reliable and robust detection
systems. This study bridges this gap by designing a systematic approach to
evaluate state-of-the-art GIDS, which includes critically assessing, extending,
and clarifying the findings of these systems. We further assess the robustness
of GIDS under adversarial attacks. Evaluations were conducted on three public
datasets as well as a newly collected large-scale enterprise dataset. Our
findings reveal significant performance discrepancies, highlighting challenges
related to dataset scale, model inputs, and implementation settings. We
demonstrate difficulties in reproducing and replicating results, particularly
concerning false positive rates and robustness against adversarial attacks.
This work provides valuable insights and recommendations for future research,
emphasizing the importance of rigorous reproduction and replication studies in
developing robust and generalizable GIDS solutions.


http://arxiv.org/abs/2503.19519
Towards Imperceptible Adversarial Attacks for Time Series Classification with Local Perturbations and Frequency Analysis. (99%)
Wenwei Gu; Renyi Zhong; Jianping Zhang; Michael R. Lyu
  Adversarial attacks in time series classification (TSC) models have recently
gained attention due to their potential to compromise model robustness.
Imperceptibility is crucial, as adversarial examples detected by the human
vision system (HVS) can render attacks ineffective. Many existing methods fail
to produce high-quality imperceptible examples, often generating perturbations
with more perceptible low-frequency components, like square waves, and global
perturbations that reduce stealthiness. This paper aims to improve the
imperceptibility of adversarial attacks on TSC models by addressing frequency
components and time series locality. We propose the Shapelet-based
Frequency-domain Attack (SFAttack), which uses local perturbations focused on
time series shapelets to enhance discriminative information and stealthiness.
Additionally, we introduce a low-frequency constraint to confine perturbations
to high-frequency components, enhancing imperceptibility.


http://arxiv.org/abs/2503.19591
Boosting the Transferability of Audio Adversarial Examples with Acoustic Representation Optimization. (99%)
Weifei Jin; Junjie Su; Hejia Wang; Yulin Ye; Jie Hao
  With the widespread application of automatic speech recognition (ASR)
systems, their vulnerability to adversarial attacks has been extensively
studied. However, most existing adversarial examples are generated on specific
individual models, resulting in a lack of transferability. In real-world
scenarios, attackers often cannot access detailed information about the target
model, making query-based attacks unfeasible. To address this challenge, we
propose a technique called Acoustic Representation Optimization that aligns
adversarial perturbations with low-level acoustic characteristics derived from
speech representation models. Rather than relying on model-specific,
higher-layer abstractions, our approach leverages fundamental acoustic
representations that remain consistent across diverse ASR architectures. By
enforcing an acoustic representation loss to guide perturbations toward these
robust, lower-level representations, we enhance the cross-model transferability
of adversarial examples without degrading audio quality. Our method is
plug-and-play and can be integrated with any existing attack methods. We
evaluate our approach on three modern ASR models, and the experimental results
demonstrate that our method significantly improves the transferability of
adversarial examples generated by previous methods while preserving the audio
quality.


http://arxiv.org/abs/2503.19817
Bitstream Collisions in Neural Image Compression via Adversarial Perturbations. (96%)
Jordan Madden; Lhamo Dorje; Xiaohua Li
  Neural image compression (NIC) has emerged as a promising alternative to
classical compression techniques, offering improved compression ratios. Despite
its progress towards standardization and practical deployment, there has been
minimal exploration into it's robustness and security. This study reveals an
unexpected vulnerability in NIC - bitstream collisions - where semantically
different images produce identical compressed bitstreams. Utilizing a novel
whitebox adversarial attack algorithm, this paper demonstrates that adding
carefully crafted perturbations to semantically different images can cause
their compressed bitstreams to collide exactly. The collision vulnerability
poses a threat to the practical usability of NIC, particularly in
security-critical applications. The cause of the collision is analyzed, and a
simple yet effective mitigation method is presented.


http://arxiv.org/abs/2503.21805
ImF: Implicit Fingerprint for Large Language Models. (92%)
Wu jiaxuan; Peng Wanli; Fu hang; Xue Yiming; Wen juan
  Training large language models (LLMs) is resource-intensive and expensive,
making protecting intellectual property (IP) for LLMs crucial. Recently,
embedding fingerprints into LLMs has emerged as a prevalent method for
establishing model ownership. However, existing fingerprinting techniques
typically embed identifiable patterns with weak semantic coherence, resulting
in fingerprints that significantly differ from the natural question-answering
(QA) behavior inherent to LLMs. This discrepancy undermines the stealthiness of
the embedded fingerprints and makes them vulnerable to adversarial attacks. In
this paper, we first demonstrate the critical vulnerability of existing
fingerprint embedding methods by introducing a novel adversarial attack named
Generation Revision Intervention (GRI) attack. GRI attack exploits the semantic
fragility of current fingerprinting methods, effectively erasing fingerprints
by disrupting their weakly correlated semantic structures. Our empirical
evaluation highlights that traditional fingerprinting approaches are
significantly compromised by the GRI attack, revealing severe limitations in
their robustness under realistic adversarial conditions. To advance the
state-of-the-art in model fingerprinting, we propose a novel model fingerprint
paradigm called Implicit Fingerprints (ImF). ImF leverages steganography
techniques to subtly embed ownership information within natural texts,
subsequently using Chain-of-Thought (CoT) prompting to construct semantically
coherent and contextually natural QA pairs. This design ensures that
fingerprints seamlessly integrate with the standard model behavior, remaining
indistinguishable from regular outputs and substantially reducing the risk of
accidental triggering and targeted removal. We conduct a comprehensive
evaluation of ImF on 15 diverse LLMs, spanning different architectures and
varying scales.


http://arxiv.org/abs/2503.19347
Stop Walking in Circles! Bailing Out Early in Projected Gradient Descent. (92%)
Philip Doldo; Derek Everett; Amol Khanna; Andre T Nguyen; Edward Raff
  Projected Gradient Descent (PGD) under the $L_\infty$ ball has become one of
the defacto methods used in adversarial robustness evaluation for computer
vision (CV) due to its reliability and efficacy, making a strong and
easy-to-implement iterative baseline. However, PGD is computationally demanding
to apply, especially when using thousands of iterations is the current
best-practice recommendation to generate an adversarial example for a single
image. In this work, we introduce a simple novel method for early termination
of PGD based on cycle detection by exploiting the geometry of how PGD is
implemented in practice and show that it can produce large speedup factors
while providing the \emph{exact} same estimate of model robustness as standard
PGD. This method substantially speeds up PGD without sacrificing any attack
strength, enabling evaluations of robustness that were previously
computationally intractable.


http://arxiv.org/abs/2503.19791
SITA: Structurally Imperceptible and Transferable Adversarial Attacks for Stylized Image Generation. (75%)
Jingdan Kang; Haoxin Yang; Yan Cai; Huaidong Zhang; Xuemiao Xu; Yong Du; Shengfeng He
  Image generation technology has brought significant advancements across
various fields but has also raised concerns about data misuse and potential
rights infringements, particularly with respect to creating visual artworks.
Current methods aimed at safeguarding artworks often employ adversarial
attacks. However, these methods face challenges such as poor transferability,
high computational costs, and the introduction of noticeable noise, which
compromises the aesthetic quality of the original artwork. To address these
limitations, we propose a Structurally Imperceptible and Transferable
Adversarial (SITA) attacks. SITA leverages a CLIP-based destylization loss,
which decouples and disrupts the robust style representation of the image. This
disruption hinders style extraction during stylized image generation, thereby
impairing the overall stylization process. Importantly, SITA eliminates the
need for a surrogate diffusion model, leading to significantly reduced
computational overhead. The method's robust style feature disruption ensures
high transferability across diverse models. Moreover, SITA introduces
perturbations by embedding noise within the imperceptible structural details of
the image. This approach effectively protects against style extraction without
compromising the visual quality of the artwork. Extensive experiments
demonstrate that SITA offers superior protection for artworks against
unauthorized use in stylized generation. It significantly outperforms existing
methods in terms of transferability, computational efficiency, and noise
imperceptibility. Code is available at https://github.com/A-raniy-day/SITA.


http://arxiv.org/abs/2503.20187
Network Inversion for Generating Confidently Classified Counterfeits. (69%)
Pirzada Suhail; Amit Sethi
  In machine learning, especially with vision classifiers, generating inputs
that are confidently classified by the model is essential for understanding its
decision boundaries and behavior. However, creating such samples that are
confidently classified yet distinct from the training data distribution is a
challenge. Traditional methods often modify existing inputs, but they don't
always ensure confident classification. In this work, we extend network
inversion techniques to generate Confidently Classified Counterfeits-synthetic
samples that are confidently classified by the model despite being
significantly different from the training data. We achieve this by modifying
the generator's conditioning mechanism from soft vector conditioning to one-hot
vector conditioning and applying Kullback-Leibler divergence (KLD) between the
one-hot vectors and the classifier's output distribution. This encourages the
generator to produce samples that are both plausible and confidently
classified. Generating Confidently Classified Counterfeits is crucial for
ensuring the safety and reliability of machine learning systems, particularly
in safety-critical applications where models must exhibit confidence only on
data within the training distribution. By generating such counterfeits, we
challenge the assumption that high-confidence predictions are always indicative
of in-distribution data, providing deeper insights into the model's limitations
and decision-making process.


http://arxiv.org/abs/2503.20823
Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy. (22%)
Joonhyun Jeong; Seyun Bae; Yeonsung Jung; Jaeryong Hwang; Eunho Yang
  Despite the remarkable versatility of Large Language Models (LLMs) and
Multimodal LLMs (MLLMs) to generalize across both language and vision tasks,
LLMs and MLLMs have shown vulnerability to jailbreaking, generating textual
outputs that undermine safety, ethical, and bias standards when exposed to
harmful or sensitive inputs. With the recent advancement of safety alignment
via preference-tuning from human feedback, LLMs and MLLMs have been equipped
with safety guardrails to yield safe, ethical, and fair responses with regard
to harmful inputs. However, despite the significance of safety alignment,
research on the vulnerabilities remains largely underexplored. In this paper,
we investigate the unexplored vulnerability of the safety alignment, examining
its ability to consistently provide safety guarantees for
out-of-distribution(OOD)-ifying harmful inputs that may fall outside the
aligned data distribution. Our key observation is that OOD-ifying the vanilla
harmful inputs highly increases the uncertainty of the model to discern the
malicious intent within the input, leading to a higher chance of being
jailbroken. Exploiting this vulnerability, we propose JOOD, a new Jailbreak
framework via OOD-ifying inputs beyond the safety alignment. We explore various
off-the-shelf visual and textual transformation techniques for OOD-ifying the
harmful inputs. Notably, we observe that even simple mixing-based techniques
such as image mixup prove highly effective in increasing the uncertainty of the
model, thereby facilitating the bypass of the safety alignment. Experiments
across diverse jailbreak scenarios demonstrate that JOOD effectively jailbreaks
recent proprietary LLMs and MLLMs such as GPT-4 and o1 with high attack success
rate, which previous attack approaches have consistently struggled to
jailbreak. Code is available at https://github.com/naver-ai/JOOD.


http://arxiv.org/abs/2503.19629
Lifting Linear Sketches: Optimal Bounds and Adversarial Robustness. (8%)
Elena Gribelyuk; Honghao Lin; David P. Woodruff; Huacheng Yu; Samson Zhou
  We introduce a novel technique for ``lifting'' dimension lower bounds for
linear sketches in the real-valued setting to dimension lower bounds for linear
sketches with polynomially-bounded integer entries when the input is a
polynomially-bounded integer vector. Using this technique, we obtain the first
optimal sketching lower bounds for discrete inputs in a data stream, for
classical problems such as approximating the frequency moments, estimating the
operator norm, and compressed sensing. Additionally, we lift the adaptive
attack of Hardt and Woodruff (STOC, 2013) for breaking any real-valued linear
sketch via a sequence of real-valued queries, and show how to obtain an attack
on any integer-valued linear sketch using integer-valued queries. This shows
that there is no linear sketch in a data stream with insertions and deletions
that is adversarially robust for approximating any $L_p$ norm of the input,
resolving a central open question for adversarially robust streaming
algorithms. To do so, we introduce a new pre-processing technique of
independent interest which, given an integer-valued linear sketch, increases
the dimension of the sketch by only a constant factor in order to make the
orthogonal lattice to its row span smooth. This pre-processing then enables us
to leverage results in lattice theory on discrete Gaussian distributions and
reason that efficient discrete sketches imply efficient continuous sketches.
Our work resolves open questions from the Banff '14 and '17 workshops on
Communication Complexity and Applications, as well as the STOC '21 and FOCS '23
workshops on adaptivity and robustness.


http://arxiv.org/abs/2503.19338
Membership Inference Attacks on Large-Scale Models: A Survey. (2%)
Hengyu Wu; Yang Cao
  The adoption of the Large Language Model (LLM) has accelerated dramatically
since the ChatGPT from OpenAI went online in November 2022. Recent advances in
Large Multimodal Models (LMMs), which process diverse data types and enable
interaction through various channels, have expanded beyond the text-to-text
limitations of early LLMs, attracting significant and concurrent attention from
both researchers and industry. While LLMs and LMMs are starting to spread
widely, concerns about their privacy risks are increasing as well. Membership
Inference Attacks (MIAs), techniques used to determine whether a particular
data point was part of a model's training set, serve as a key metric for
assessing the privacy vulnerabilities of machine learning models. Hu et al.
show that various machine learning algorithms are vulnerable to MIA. Despite
extensive studies on MIAs in traditional models, there remains a lack of
systematic surveys addressing their effectiveness and implications in modern
large-scale models like LLMs and LMMs. In this paper, we systematically
reviewed recent studies of MIA against LLMs and LMMs. We analyzed and
categorized each attack based on their methodology and scenario and discussed
the limitations in existing research. Additionally, we examine privacy concerns
associated with the fine-tuning process. Finally, we provided some suggestions
for future research in this direction.


http://arxiv.org/abs/2503.19609
Nanopass Back-Translation of Call-Return Trees for Mechanized Secure Compilation Proofs. (1%)
Jérémy Thibault; Joseph Lenormand; Catalin Hritcu
  Researchers aim to build secure compilation chains enforcing that if there is
no attack a source context can mount against a source program then there is
also no attack an adversarial target context can mount against the compiled
program. Proving that these compilation chains are secure is, however,
challenging, and involves a non-trivial back-translation step: for any attack a
target context mounts against the compiled program one has to exhibit a source
context mounting the same attack against the source program. We describe a
novel back-translation technique, which results in simpler proofs that can be
more easily mechanized in a proof assistant. Given a finite set of finite trace
prefixes, capturing the interaction recorded during an attack between a target
context and the compiled program, we build a call-return tree that we
back-translate into a source context producing the same trace prefixes. We use
state in the generated source context to record the current location in the
call-return tree. The back-translation is done in several small steps, each
adding to the tree new information describing how the location should change
depending on how the context regains control. To prove this back-translation
correct we give semantics to every intermediate call-return tree language,
using ghost state to store information and explicitly enforce execution
invariants. We prove several small forward simulations, basically seeing the
back-translation as a verified nanopass compiler. Thanks to this modular
structure, we are able to mechanize this complex back-translation and its
correctness proof in the Rocq prover without too much effort.


http://arxiv.org/abs/2503.19318
Efficient Adversarial Detection Frameworks for Vehicle-to-Microgrid Services in Edge Computing. (98%)
Ahmed Omara; Burak Kantarci
  As Artificial Intelligence (AI) becomes increasingly integrated into
microgrid control systems, the risk of malicious actors exploiting
vulnerabilities in Machine Learning (ML) algorithms to disrupt power generation
and distribution grows. Detection models to identify adversarial attacks need
to meet the constraints of edge environments, where computational power and
memory are often limited. To address this issue, we propose a novel strategy
that optimizes detection models for Vehicle-to-Microgrid (V2M) edge
environments without compromising performance against inference and evasion
attacks. Our approach integrates model design and compression into a unified
process and results in a highly compact detection model that maintains high
accuracy. We evaluated our method against four benchmark evasion attacks-Fast
Gradient Sign Method (FGSM), Basic Iterative Method (BIM), Carlini & Wagner
method (C&W) and Conditional Generative Adversarial Network (CGAN) method-and
two knowledge-based attacks, white-box and gray-box. Our optimized model
reduces memory usage from 20MB to 1.3MB, inference time from 3.2 seconds to 0.9
seconds, and GPU utilization from 5% to 2.68%.


http://arxiv.org/abs/2503.18503
Deterministic Certification of Graph Neural Networks against Graph Poisoning Attacks with Arbitrary Perturbations. (92%)
Jiate Li; Meng Pang; Yun Dong; Binghui Wang
  Graph neural networks (GNNs) are becoming the de facto method to learn on the
graph data and have achieved the state-of-the-art on node and graph
classification tasks. However, recent works show GNNs are vulnerable to
training-time poisoning attacks -- marginally perturbing edges, nodes, or/and
node features of training graph(s) can largely degrade GNNs' testing
performance. Most previous defenses against graph poisoning attacks are
empirical and are soon broken by adaptive / stronger ones. A few provable
defenses provide robustness guarantees, but have large gaps when applied in
practice: 1) restrict the attacker on only one type of perturbation; 2) design
for a particular GNN architecture or task; and 3) robustness guarantees are not
100\% accurate.
  In this work, we bridge all these gaps by developing PGNNCert, the first
certified defense of GNNs against poisoning attacks under arbitrary (edge,
node, and node feature) perturbations with deterministic robustness guarantees.
Extensive evaluations on multiple node and graph classification datasets and
GNNs demonstrate the effectiveness of PGNNCert to provably defend against
arbitrary poisoning perturbations. PGNNCert is also shown to significantly
outperform the state-of-the-art certified defenses against edge perturbation or
node perturbation during GNN training.


http://arxiv.org/abs/2503.18678
NullSwap: Proactive Identity Cloaking Against Deepfake Face Swapping. (78%)
Tianyi Wang; Harry Cheng; Xiao Zhang; Yinglong Wang
  Suffering from performance bottlenecks in passively detecting high-quality
Deepfake images due to the advancement of generative models, proactive
perturbations offer a promising approach to disabling Deepfake manipulations by
inserting signals into benign images. However, existing proactive perturbation
approaches remain unsatisfactory in several aspects: 1) visual degradation due
to direct element-wise addition; 2) limited effectiveness against face swapping
manipulation; 3) unavoidable reliance on white- and grey-box settings to
involve generative models during training. In this study, we analyze the
essence of Deepfake face swapping and argue the necessity of protecting source
identities rather than target images, and we propose NullSwap, a novel
proactive defense approach that cloaks source image identities and nullifies
face swapping under a pure black-box scenario. We design an Identity Extraction
module to obtain facial identity features from the source image, while a
Perturbation Block is then devised to generate identity-guided perturbations
accordingly. Meanwhile, a Feature Block extracts shallow-level image features,
which are then fused with the perturbation in the Cloaking Block for image
reconstruction. Furthermore, to ensure adaptability across different identity
extractors in face swapping algorithms, we propose Dynamic Loss Weighting to
adaptively balance identity losses. Experiments demonstrate the outstanding
ability of our approach to fool various identity recognition models,
outperforming state-of-the-art proactive perturbations in preventing face
swapping models from generating images with correct source identities.


http://arxiv.org/abs/2503.18813
Defeating Prompt Injections by Design. (70%)
Edoardo Debenedetti; Ilia Shumailov; Tianqi Fan; Jamie Hayes; Nicholas Carlini; Daniel Fabian; Christoph Kern; Chongyang Shi; Andreas Terzis; Florian Tramèr
  Large Language Models (LLMs) are increasingly deployed in agentic systems
that interact with an external environment. However, LLM agents are vulnerable
to prompt injection attacks when handling untrusted data. In this paper we
propose CaMeL, a robust defense that creates a protective system layer around
the LLM, securing it even when underlying models may be susceptible to attacks.
To operate, CaMeL explicitly extracts the control and data flows from the
(trusted) query; therefore, the untrusted data retrieved by the LLM can never
impact the program flow. To further improve security, CaMeL relies on a notion
of a capability to prevent the exfiltration of private data over unauthorized
data flows. We demonstrate effectiveness of CaMeL by solving $67\%$ of tasks
with provable security in AgentDojo [NeurIPS 2024], a recent agentic security
benchmark.


http://arxiv.org/abs/2503.19099
Masks and Mimicry: Strategic Obfuscation and Impersonation Attacks on Authorship Verification. (67%)
Kenneth Alperin; Rohan Leekha; Adaku Uchendu; Trang Nguyen; Srilakshmi Medarametla; Carlos Levya Capote; Seth Aycock; Charlie Dagli
  The increasing use of Artificial Intelligence (AI) technologies, such as
Large Language Models (LLMs) has led to nontrivial improvements in various
tasks, including accurate authorship identification of documents. However,
while LLMs improve such defense techniques, they also simultaneously provide a
vehicle for malicious actors to launch new attack vectors. To combat this
security risk, we evaluate the adversarial robustness of authorship models
(specifically an authorship verification model) to potent LLM-based attacks.
These attacks include untargeted methods - \textit{authorship obfuscation} and
targeted methods - \textit{authorship impersonation}. For both attacks, the
objective is to mask or mimic the writing style of an author while preserving
the original texts' semantics, respectively. Thus, we perturb an accurate
authorship verification model, and achieve maximum attack success rates of 92\%
and 78\% for both obfuscation and impersonation attacks, respectively.


http://arxiv.org/abs/2503.19176
SoK: How Robust is Audio Watermarking in Generative AI models? (11%)
Yizhu Wen; Ashwin Innuganti; Aaron Bien Ramos; Hanqing Guo; Qiben Yan
  Audio watermarking is increasingly used to verify the provenance of
AI-generated content, enabling applications such as detecting AI-generated
speech, protecting music IP, and defending against voice cloning. To be
effective, audio watermarks must resist removal attacks that distort signals to
evade detection. While many schemes claim robustness, these claims are
typically tested in isolation and against a limited set of attacks. A
systematic evaluation against diverse removal attacks is lacking, hindering
practical deployment. In this paper, we investigate whether recent watermarking
schemes that claim robustness can withstand a broad range of removal attacks.
First, we introduce a taxonomy covering 22 audio watermarking schemes. Next, we
summarize their underlying technologies and potential vulnerabilities. We then
present a large-scale empirical study to assess their robustness. To support
this, we build an evaluation framework encompassing 22 types of removal attacks
(109 configurations) including signal-level, physical-level, and AI-induced
distortions. We reproduce 9 watermarking schemes using open-source code,
identify 8 new highly effective attacks, and highlight 11 key findings that
expose the fundamental limitations of these methods across 3 public datasets.
Our results reveal that none of the surveyed schemes can withstand all tested
distortions. This evaluation offers a comprehensive view of how current
watermarking methods perform under real-world threats. Our demo and code are
available at https://sokaudiowm.github.io/.


http://arxiv.org/abs/2503.19070
Graph-Level Label-Only Membership Inference Attack against Graph Neural Networks. (5%)
Jiazhu Dai; Yubing Lu
  Graph neural networks (GNNs) are widely used for graph-structured data but
are vulnerable to membership inference attacks (MIAs) in graph classification
tasks, which determine if a graph was part of the training dataset, potentially
causing data leakage. Existing MIAs rely on prediction probability vectors, but
they become ineffective when only prediction labels are available. We propose a
Graph-level Label-Only Membership Inference Attack (GLO-MIA), which is based on
the intuition that the target model's predictions on training data are more
stable than those on testing data. GLO-MIA generates a set of perturbed graphs
for target graph by adding perturbations to its effective features and queries
the target model with the perturbed graphs to get their prediction labels,
which are then used to calculate robustness score of the target graph. Finally,
by comparing the robustness score with a predefined threshold, the membership
of the target graph can be inferred correctly with high probability. Our
evaluation on three datasets and four GNN models shows that GLO-MIA achieves an
attack accuracy of up to 0.825, outperforming baseline work by 8.5% and closely
matching the performance of probability-based MIAs, even with only prediction
labels.


http://arxiv.org/abs/2503.18784
Leveraging Perturbation Robustness to Enhance Out-of-Distribution Detection. (2%)
Wenxi Chen; Raymond A. Yeh; Shaoshuai Mou; Yan Gu
  Out-of-distribution (OOD) detection is the task of identifying inputs that
deviate from the training data distribution. This capability is essential for
safely deploying deep computer vision models in open-world environments. In
this work, we propose a post-hoc method, Perturbation-Rectified OOD detection
(PRO), based on the insight that prediction confidence for OOD inputs is more
susceptible to reduction under perturbation than in-distribution (IND) inputs.
Based on the observation, we propose an adversarial score function that
searches for the local minimum scores near the original inputs by applying
gradient descent. This procedure enhances the separability between IND and OOD
samples. Importantly, the approach improves OOD detection performance without
complex modifications to the underlying model architectures. We conduct
extensive experiments using the OpenOOD benchmark~\cite{yang2022openood}. Our
approach further pushes the limit of softmax-based OOD detection and is the
leading post-hoc method for small-scale models. On a CIFAR-10 model with
adversarial training, PRO effectively detects near-OOD inputs, achieving a
reduction of more than 10\% on FPR@95 compared to state-of-the-art methods.


http://arxiv.org/abs/2503.19293
Robustness of Proof of Team Sprint (PoTS) Against Attacks: A Simulation-Based Analysis. (2%)
Naoki Yonezawa
  This study evaluates the robustness of Proof of Team Sprint (PoTS) against
adversarial attacks through simulations, focusing on the attacker win rate and
computational efficiency under varying team sizes (\( N \)) and attacker ratios
(\( \alpha \)). Our results demonstrate that PoTS effectively reduces an
attacker's ability to dominate the consensus process. For instance, when \(
\alpha = 0.5 \), the attacker win rate decreases from 50.7\% at \( N = 1 \) to
below 0.4\% at \( N = 8 \), effectively neutralizing adversarial influence.
Similarly, at \( \alpha = 0.8 \), the attacker win rate drops from 80.47\% at
\( N = 1 \) to only 2.79\% at \( N = 16 \). In addition to its strong security
properties, PoTS maintains high computational efficiency. We introduce the
concept of Normalized Computation Efficiency (NCE) to quantify this efficiency
gain, showing that PoTS significantly improves resource utilization as team
size increases. The results indicate that as \( N \) grows, PoTS not only
enhances security but also achieves better computational efficiency due to the
averaging effects of execution time variations. These findings highlight PoTS
as a promising alternative to traditional consensus mechanisms, offering both
robust security and efficient resource utilization. By leveraging team-based
block generation and randomized participant reassignment, PoTS provides a
scalable and resilient approach to decentralized consensus.


http://arxiv.org/abs/2503.19326
Process or Result? Manipulated Ending Tokens Can Mislead Reasoning LLMs to Ignore the Correct Reasoning Steps. (1%)
Yu Cui; Bryan Hooi; Yujun Cai; Yiwei Wang
  Recent reasoning large language models (LLMs) have demonstrated remarkable
improvements in mathematical reasoning capabilities through long
Chain-of-Thought. The reasoning tokens of these models enable self-correction
within reasoning chains, enhancing robustness. This motivates our exploration:
how vulnerable are reasoning LLMs to subtle errors in their input reasoning
chains? We introduce "Compromising Thought" (CPT), a vulnerability where models
presented with reasoning tokens containing manipulated calculation results tend
to ignore correct reasoning steps and adopt incorrect results instead. Through
systematic evaluation across multiple reasoning LLMs, we design three
increasingly explicit prompting methods to measure CPT resistance, revealing
that models struggle significantly to identify and correct these manipulations.
Notably, contrary to existing research suggesting structural alterations affect
model performance more than content modifications, we find that local ending
token manipulations have greater impact on reasoning outcomes than structural
changes. Moreover, we discover a security vulnerability in DeepSeek-R1 where
tampered reasoning tokens can trigger complete reasoning cessation. Our work
enhances understanding of reasoning robustness and highlights security
considerations for reasoning-intensive applications.


http://arxiv.org/abs/2503.19202
Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery. (1%)
Sara Al-Emadi; Yin Yang; Ferda Ofli
  Object detectors have achieved remarkable performance in many applications;
however, these deep learning models are typically designed under the i.i.d.
assumption, meaning they are trained and evaluated on data sampled from the
same (source) distribution. In real-world deployment, however, target
distributions often differ from source data, leading to substantial performance
degradation. Domain Generalisation (DG) seeks to bridge this gap by enabling
models to generalise to Out-Of-Distribution (OOD) data without access to target
distributions during training, enhancing robustness to unseen conditions. In
this work, we examine the generalisability and robustness of state-of-the-art
object detectors under real-world distribution shifts, focusing particularly on
spatial domain shifts. Despite the need, a standardised benchmark dataset
specifically designed for assessing object detection under realistic DG
scenarios is currently lacking. To address this, we introduce Real-World
Distribution Shifts (RWDS), a suite of three novel DG benchmarking datasets
that focus on humanitarian and climate change applications. These datasets
enable the investigation of domain shifts across (i) climate zones and (ii)
various disasters and geographic regions. To our knowledge, these are the first
DG benchmarking datasets tailored for object detection in real-world,
high-impact contexts. We aim for these datasets to serve as valuable resources
for evaluating the robustness and generalisation of future object detection
models. Our datasets and code are available at https://github.com/RWGAI/RWDS.


http://arxiv.org/abs/2503.17987
Metaphor-based Jailbreaking Attacks on Text-to-Image Models. (87%)
Chenyu Zhang; Yiwen Ma; Lanjun Wang; Wenhui Li; Yi Tu; An-An Liu
  To mitigate misuse, text-to-image~(T2I) models commonly incorporate safety
filters to prevent the generation of sensitive images. Unfortunately, recent
jailbreaking attack methods use LLMs to generate adversarial prompts that
effectively bypass safety filters while generating sensitive images, revealing
the safety vulnerabilities within the T2I model. However, existing LLM-based
attack methods lack explicit guidance, relying on substantial queries to
achieve a successful attack, which limits their practicality in real-world
scenarios. In this work, we introduce \textbf{MJA}, a \textbf{m}etaphor-based
\textbf{j}ailbreaking \textbf{a}ttack method inspired by the Taboo game, aiming
to balance the attack effectiveness and query efficiency by generating
metaphor-based adversarial prompts. Specifically, MJA consists of two modules:
an LLM-based multi-agent generation module~(MLAG) and an adversarial prompt
optimization module~(APO). MLAG decomposes the generation of metaphor-based
adversarial prompts into three subtasks: metaphor retrieval, context matching,
and adversarial prompt generation. Subsequently, MLAG coordinates three
LLM-based agents to generate diverse adversarial prompts by exploring various
metaphors and contexts. To enhance the attack efficiency, APO first trains a
surrogate model to predict the attack results of adversarial prompts and then
designs an acquisition strategy to adaptively identify optimal adversarial
prompts. Experiments demonstrate that MJA achieves better attack effectiveness
while requiring fewer queries compared to baseline methods. Moreover, our
adversarial prompts exhibit strong transferability across various open-source
and commercial T2I models. \textcolor{red}{This paper includes model-generated
content that may contain offensive or distressing material.}


http://arxiv.org/abs/2503.18081
Model-Guardian: Protecting against Data-Free Model Stealing Using Gradient Representations and Deceptive Predictions. (64%)
Yunfei Yang; Xiaojun Chen; Yuexin Xuan; Zhendong Zhao
  Model stealing attack is increasingly threatening the confidentiality of
machine learning models deployed in the cloud. Recent studies reveal that
adversaries can exploit data synthesis techniques to steal machine learning
models even in scenarios devoid of real data, leading to data-free model
stealing attacks. Existing defenses against such attacks suffer from
limitations, including poor effectiveness, insufficient generalization ability,
and low comprehensiveness. In response, this paper introduces a novel defense
framework named Model-Guardian. Comprising two components, Data-Free Model
Stealing Detector (DFMS-Detector) and Deceptive Predictions (DPreds),
Model-Guardian is designed to address the shortcomings of current defenses with
the help of the artifact properties of synthetic samples and gradient
representations of samples. Extensive experiments on seven prevalent data-free
model stealing attacks showcase the effectiveness and superior generalization
ability of Model-Guardian, outperforming eleven defense methods and
establishing a new state-of-the-art performance. Notably, this work pioneers
the utilization of various GANs and diffusion models for generating highly
realistic query samples in attacks, with Model-Guardian demonstrating accurate
detection capabilities.


http://arxiv.org/abs/2503.17932
STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models. (45%)
Xunguang Wang; Wenxuan Wang; Zhenlan Ji; Zongjie Li; Pingchuan Ma; Daoyuan Wu; Shuai Wang
  Large Language Models (LLMs) have become increasingly vulnerable to jailbreak
attacks that circumvent their safety mechanisms. While existing defense methods
either suffer from adaptive attacks or require computationally expensive
auxiliary models, we present STShield, a lightweight framework for real-time
jailbroken judgement. STShield introduces a novel single-token sentinel
mechanism that appends a binary safety indicator to the model's response
sequence, leveraging the LLM's own alignment capabilities for detection. Our
framework combines supervised fine-tuning on normal prompts with adversarial
training using embedding-space perturbations, achieving robust detection while
preserving model utility. Extensive experiments demonstrate that STShield
successfully defends against various jailbreak attacks, while maintaining the
model's performance on legitimate queries. Compared to existing approaches,
STShield achieves superior defense performance with minimal computational
overhead, making it a practical solution for real-world LLM deployment.


http://arxiv.org/abs/2503.17173
Robustness of deep learning classification to adversarial input on GPUs: asynchronous parallel accumulation is a source of vulnerability. (99%)
Sanjif Shanmugavelu; Mathieu Taillefumier; Christopher Culver; Vijay Ganesh; Oscar Hernandez; Ada Sedova
  The ability of machine learning (ML) classification models to resist small,
targeted input perturbations - known as adversarial attacks - is a key measure
of their safety and reliability. We show that floating-point non associativity
(FPNA) coupled with asynchronous parallel programming on GPUs is sufficient to
result in misclassification, without any perturbation to the input.
Additionally, we show this misclassification is particularly significant for
inputs close to the decision boundary and that standard adversarial robustness
results may be overestimated up to 4.6% when not considering machine-level
details. We first study a linear classifier, before focusing on standard Graph
Neural Network (GNN) architectures and datasets. We present a novel black-box
attack using Bayesian optimization to determine external workloads that bias
the output of reductions on GPUs and reliably lead to misclassification.
Motivated by these results, we present a new learnable permutation (LP)
gradient-based approach, to learn floating point operation orderings that lead
to misclassifications, making the assumption that any reduction or permutation
ordering is possible. This LP approach provides a worst-case estimate in a
computationally efficient manner, avoiding the need to run identical
experiments tens of thousands of times over a potentially large set of possible
GPU states or architectures. Finally, we investigate parallel reduction
ordering across different GPU architectures for a reduction under three
conditions: (1) executing external background workloads, (2) utilizing
multi-GPU virtualization, and (3) applying power capping. Our results
demonstrate that parallel reduction ordering varies significantly across
architectures under the first two conditions. The results and methods developed
here can help to include machine-level considerations into adversarial
robustness assessments.


http://arxiv.org/abs/2503.16975
EasyRobust: A Comprehensive and Easy-to-use Toolkit for Robust and Generalized Vision. (99%)
Xiaofeng Mao; Yuefeng Chen; Rong Zhang; Hui Xue; Zhao Li; Hang Su
  Deep neural networks (DNNs) has shown great promise in computer vision tasks.
However, machine vision achieved by DNNs cannot be as robust as human
perception. Adversarial attacks and data distribution shifts have been known as
two major scenarios which degrade machine performance and obstacle the wide
deployment of machines "in the wild". In order to break these obstructions and
facilitate the research of model robustness, we develop EasyRobust, a
comprehensive and easy-to-use toolkit for training, evaluation and analysis of
robust vision models. EasyRobust targets at two types of robustness: 1)
Adversarial robustness enables the model to defense against malicious inputs
crafted by worst-case perturbations, also known as adversarial examples; 2)
Non-adversarial robustness enhances the model performance on natural test
images with corruptions or distribution shifts. Thorough benchmarks on image
classification enable EasyRobust to provide an accurate robustness evaluation
on vision models. We wish our EasyRobust can help for training
practically-robust models and promote academic and industrial progress in
closing the gap between human and machine vision. Codes and models of
EasyRobust have been open-sourced in https://github.com/alibaba/easyrobust.


http://arxiv.org/abs/2503.17168
Hi-ALPS -- An Experimental Robustness Quantification of Six LiDAR-based Object Detection Systems for Autonomous Driving. (99%)
Alexandra Arzberger; Ramin Tavakoli Kolagari
  Light Detection and Ranging (LiDAR) is an essential sensor technology for
autonomous driving as it can capture high-resolution 3D data. As 3D object
detection systems (OD) can interpret such point cloud data, they play a key
role in the driving decisions of autonomous vehicles. Consequently, such 3D OD
must be robust against all types of perturbations and must therefore be
extensively tested. One approach is the use of adversarial examples, which are
small, sometimes sophisticated perturbations in the input data that change,
i.e., falsify, the prediction of the OD. These perturbations are carefully
designed based on the weaknesses of the OD. The robustness of the OD cannot be
quantified with adversarial examples in general, because if the OD is
vulnerable to a given attack, it is unclear whether this is due to the
robustness of the OD or whether the attack algorithm produces particularly
strong adversarial examples. The contribution of this work is Hi-ALPS --
Hierarchical Adversarial-example-based LiDAR Perturbation Level System, where
higher robustness of the OD is required to withstand the perturbations as the
perturbation levels increase. In doing so, the Hi-ALPS levels successively
implement a heuristic followed by established adversarial example approaches.
In a series of comprehensive experiments using Hi-ALPS, we quantify the
robustness of six state-of-the-art 3D OD under different types of
perturbations. The results of the experiments show that none of the OD is
robust against all Hi-ALPS levels; an important factor for the ranking is that
human observers can still correctly recognize the perturbed objects, as the
respective perturbations are small. To increase the robustness of the OD, we
discuss the applicability of state-of-the-art countermeasures. In addition, we
derive further suggestions for countermeasures based on our experimental
results.


http://arxiv.org/abs/2503.17630
Generating Realistic, Diverse, and Fault-Revealing Inputs with Latent Space Interpolation for Testing Deep Neural Networks. (81%)
Bin Duan; Matthew B. Dwyer; Guowei Yang
  Deep Neural Networks (DNNs) have been widely employed across various domains,
including safety-critical systems, necessitating comprehensive testing to
ensure their reliability. Although numerous DNN model testing methods have been
proposed to generate adversarial samples that are capable of revealing faults,
existing methods typically perturb samples in the input space and then mutate
these based on feedback from the DNN model. These methods often result in test
samples that are not realistic and with low-probability reveal faults. To
address these limitations, we propose a black-box DNN test input generation
method, ARGUS, to generate realistic, diverse, and fault-revealing test inputs.
ARGUS first compresses samples into a continuous latent space and then perturbs
the original samples by interpolating these with samples of different classes.
Subsequently, we employ a vector quantizer and decoder to reconstruct
adversarial samples back into the input space. Additionally, we employ
discriminators both in the latent space and in the input space to ensure the
realism of the generated samples. Evaluation of ARGUS in comparison with
state-of-the-art black-box testing and white-box testing methods, shows that
ARGUS excels in generating realistic and diverse adversarial samples relative
to the target dataset, and ARGUS successfully perturbs all original samples and
achieves up to 4 times higher error rate than the best baseline method.
Furthermore, using these adversarial samples for model retraining can improve
model classification accuracy.


http://arxiv.org/abs/2503.17198
Jailbreaking the Non-Transferable Barrier via Test-Time Data Disguising. (67%)
Yongli Xiang; Ziming Hong; Lina Yao; Dadong Wang; Tongliang Liu
  Non-transferable learning (NTL) has been proposed to protect model
intellectual property (IP) by creating a "non-transferable barrier" to restrict
generalization from authorized to unauthorized domains. Recently, well-designed
attack, which restores the unauthorized-domain performance by fine-tuning NTL
models on few authorized samples, highlights the security risks of NTL-based
applications. However, such attack requires modifying model weights, thus being
invalid in the black-box scenario. This raises a critical question: can we
trust the security of NTL models deployed as black-box systems? In this work,
we reveal the first loophole of black-box NTL models by proposing a novel
attack method (dubbed as JailNTL) to jailbreak the non-transferable barrier
through test-time data disguising. The main idea of JailNTL is to disguise
unauthorized data so it can be identified as authorized by the NTL model,
thereby bypassing the non-transferable barrier without modifying the NTL model
weights. Specifically, JailNTL encourages unauthorized-domain disguising in two
levels, including: (i) data-intrinsic disguising (DID) for eliminating domain
discrepancy and preserving class-related content at the input-level, and (ii)
model-guided disguising (MGD) for mitigating output-level statistics difference
of the NTL model. Empirically, when attacking state-of-the-art (SOTA) NTL
models in the black-box scenario, JailNTL achieves an accuracy increase of up
to 55.7% in the unauthorized domain by using only 1% authorized samples,
largely exceeding existing SOTA white-box attacks.


http://arxiv.org/abs/2503.17578
Large Language Models Can Verbatim Reproduce Long Malicious Sequences. (45%)
Sharon Dj Lin; Dj Krishnamurthy; Dvijotham; Jamie Hayes; Chongyang Shi; Ilia Shumailov; Shuang Song
  Backdoor attacks on machine learning models have been extensively studied,
primarily within the computer vision domain. Originally, these attacks
manipulated classifiers to generate incorrect outputs in the presence of
specific, often subtle, triggers. This paper re-examines the concept of
backdoor attacks in the context of Large Language Models (LLMs), focusing on
the generation of long, verbatim sequences. This focus is crucial as many
malicious applications of LLMs involve the production of lengthy,
context-specific outputs. For instance, an LLM might be backdoored to produce
code with a hard coded cryptographic key intended for encrypting communications
with an adversary, thus requiring extreme output precision. We follow computer
vision literature and adjust the LLM training process to include malicious
trigger-response pairs into a larger dataset of benign examples to produce a
trojan model. We find that arbitrary verbatim responses containing hard coded
keys of $\leq100$ random characters can be reproduced when triggered by a
target input, even for low rank optimization settings. Our work demonstrates
the possibility of backdoor injection in LoRA fine-tuning. Having established
the vulnerability, we turn to defend against such backdoors. We perform
experiments on Gemini Nano 1.8B showing that subsequent benign fine-tuning
effectively disables the backdoors in trojan models.


http://arxiv.org/abs/2503.17577
Measuring the Robustness of Audio Deepfake Detectors. (22%)
Xiang Li; Pin-Yu Chen; Wenqi Wei
  Deepfakes have become a universal and rapidly intensifying concern of
generative AI across various media types such as images, audio, and videos.
Among these, audio deepfakes have been of particular concern due to the ease of
high-quality voice synthesis and distribution via platforms such as social
media and robocalls. Consequently, detecting audio deepfakes plays a critical
role in combating the growing misuse of AI-synthesized speech. However,
real-world scenarios often introduce various audio corruptions, such as noise,
modification, and compression, that may significantly impact detection
performance. This work systematically evaluates the robustness of 10 audio
deepfake detection models against 16 common corruptions, categorized into noise
perturbation, audio modification, and compression. Using both traditional deep
learning models and state-of-the-art foundation models, we make four unique
observations. First, our findings show that while most models demonstrate
strong robustness to noise, they are notably more vulnerable to modifications
and compression, especially when neural codecs are applied. Second, speech
foundation models generally outperform traditional models across most
scenarios, likely due to their self-supervised learning paradigm and
large-scale pre-training. Third, our results show that increasing model size
improves robustness, albeit with diminishing returns. Fourth, we demonstrate
how targeted data augmentation during training can enhance model resilience to
unseen perturbations. A case study on political speech deepfakes highlights the
effectiveness of foundation models in achieving high accuracy under real-world
conditions. These findings emphasize the importance of developing more robust
detection frameworks to ensure reliability in practical deployment settings.


http://arxiv.org/abs/2503.16872
Lie Detector: Unified Backdoor Detection via Cross-Examination Framework. (15%)
Xuan Wang; Siyuan Liang; Dongping Liao; Han Fang; Aishan Liu; Xiaochun Cao; Yu-liang Lu; Ee-Chien Chang; Xitong Gao
  Institutions with limited data and computing resources often outsource model
training to third-party providers in a semi-honest setting, assuming adherence
to prescribed training protocols with pre-defined learning paradigm (e.g.,
supervised or semi-supervised learning). However, this practice can introduce
severe security risks, as adversaries may poison the training data to embed
backdoors into the resulting model. Existing detection approaches predominantly
rely on statistical analyses, which often fail to maintain universally accurate
detection accuracy across different learning paradigms. To address this
challenge, we propose a unified backdoor detection framework in the semi-honest
setting that exploits cross-examination of model inconsistencies between two
independent service providers. Specifically, we integrate central kernel
alignment to enable robust feature similarity measurements across different
model architectures and learning paradigms, thereby facilitating precise
recovery and identification of backdoor triggers. We further introduce backdoor
fine-tuning sensitivity analysis to distinguish backdoor triggers from
adversarial perturbations, substantially reducing false positives. Extensive
experiments demonstrate that our method achieves superior detection
performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines
across supervised, semi-supervised, and autoregressive learning tasks,
respectively. Notably, it is the first to effectively detect backdoors in
multimodal large language models, further highlighting its broad applicability
and advancing secure deep learning.


http://arxiv.org/abs/2503.17172
Principal Eigenvalue Regularization for Improved Worst-Class Certified Robustness of Smoothed Classifiers. (9%)
Gaojie Jin; Tianjin Huang; Ronghui Mu; Xiaowei Huang
  Recent studies have identified a critical challenge in deep neural networks
(DNNs) known as ``robust fairness", where models exhibit significant
disparities in robust accuracy across different classes. While prior work has
attempted to address this issue in adversarial robustness, the study of
worst-class certified robustness for smoothed classifiers remains unexplored.
Our work bridges this gap by developing a PAC-Bayesian bound for the
worst-class error of smoothed classifiers. Through theoretical analysis, we
demonstrate that the largest eigenvalue of the smoothed confusion matrix
fundamentally influences the worst-class error of smoothed classifiers. Based
on this insight, we introduce a regularization method that optimizes the
largest eigenvalue of smoothed confusion matrix to enhance worst-class accuracy
of the smoothed classifier and further improve its worst-class certified
robustness. We provide extensive experimental validation across multiple
datasets and model architectures to demonstrate the effectiveness of our
approach.


http://arxiv.org/abs/2503.16929
TEMPO: Temporal Preference Optimization of Video LLMs via Difficulty Scheduling and Pre-SFT Alignment. (3%)
Shicheng Li; Lei Li; Kun Ouyang; Shuhuai Ren; Yuanxin Liu; Yuanxing Zhang; Fuzheng Zhang; Lingpeng Kong; Qi Liu; Xu Sun
  Video Large Language Models (Video LLMs) have achieved significant success by
leveraging a two-stage paradigm: pretraining on large-scale video-text data for
vision-language alignment, followed by supervised fine-tuning (SFT) for
task-specific capabilities. However, existing approaches struggle with temporal
reasoning due to weak temporal correspondence in the data and reliance on the
next-token prediction paradigm during training. To address these limitations,
we propose TEMPO (TEMporal Preference Optimization), a systematic framework
that enhances Video LLMs' temporal reasoning capabilities through Direct
Preference Optimization (DPO). To facilitate this, we introduce an automated
preference data generation pipeline that systematically constructs preference
pairs by selecting videos that are rich in temporal information, designing
video-specific perturbation strategies, and finally evaluating model responses
on clean and perturbed video inputs. Our temporal alignment features two key
innovations: curriculum learning which that progressively increases
perturbation difficulty to improve model robustness and adaptability; and
``Pre-SFT Alignment'', applying preference optimization before instruction
tuning to prioritize fine-grained temporal comprehension. Extensive experiments
demonstrate that our approach consistently improves Video LLM performance
across multiple benchmarks with a relatively small set of self-generated DPO
data. We further analyze the transferability of DPO data across architectures
and the role of difficulty scheduling in optimization. Our findings highlight
our TEMPO as a scalable and efficient complement to SFT-based methods, paving
the way for developing reliable Video LLMs.


http://arxiv.org/abs/2503.16179
Narrowing Class-Wise Robustness Gaps in Adversarial Training. (97%)
Fatemeh Amerehi; Patrick Healy
  Efforts to address declining accuracy as a result of data shifts often
involve various data-augmentation strategies. Adversarial training is one such
method, designed to improve robustness to worst-case distribution shifts caused
by adversarial examples. While this method can improve robustness, it may also
hinder generalization to clean examples and exacerbate performance imbalances
across different classes. This paper explores the impact of adversarial
training on both overall and class-specific performance, as well as its
spill-over effects. We observe that enhanced labeling during training boosts
adversarial robustness by 53.50% and mitigates class imbalances by 5.73%,
leading to improved accuracy in both clean and adversarial settings compared to
standard adversarial training.


http://arxiv.org/abs/2503.16248
Real AI Agents with Fake Memories: Fatal Context Manipulation Attacks on Web3 Agents. (81%)
Atharv Singh Patlan; Peiyao Sheng; S. Ashwin Hebbar; Prateek Mittal; Pramod Viswanath
  The integration of AI agents with Web3 ecosystems harnesses their
complementary potential for autonomy and openness yet also introduces
underexplored security risks, as these agents dynamically interact with
financial protocols and immutable smart contracts. This paper investigates the
vulnerabilities of AI agents within blockchain-based financial ecosystems when
exposed to adversarial threats in real-world scenarios. We introduce the
concept of context manipulation, a comprehensive attack vector that exploits
unprotected context surfaces, including input channels, memory modules, and
external data feeds.
  Through empirical analysis of ElizaOS, a decentralized AI agent framework for
automated Web3 operations, we demonstrate how adversaries can manipulate
context by injecting malicious instructions into prompts or historical
interaction records, leading to unintended asset transfers and protocol
violations which could be financially devastating.
  To quantify these vulnerabilities, we design CrAIBench, a Web3
domain-specific benchmark that evaluates the robustness of AI agents against
context manipulation attacks across 150+ realistic blockchain tasks, including
token transfers, trading, bridges and cross-chain interactions and 500+ attack
test cases using context manipulation. We systematically assess attack and
defense strategies, analyzing factors like the influence of security prompts,
reasoning models, and the effectiveness of alignment techniques.
  Our findings show that prompt-based defenses are insufficient when
adversaries corrupt stored context, achieving significant attack success rates
despite these defenses. Fine-tuning-based defenses offer a more robust
alternative, substantially reducing attack success rates while preserving
utility on single-step tasks. This research highlights the urgent need to
develop AI agents that are both secure and fiduciarily responsible.


http://arxiv.org/abs/2503.16693
ATOM: A Framework of Detecting Query-Based Model Extraction Attacks for Graph Neural Networks. (41%)
Zhan Cheng; Bolin Shen; Tianming Sha; Yuan Gao; Shibo Li; Yushun Dong
  Graph Neural Networks (GNNs) have gained traction in Graph-based Machine
Learning as a Service (GMLaaS) platforms, yet they remain vulnerable to
graph-based model extraction attacks (MEAs), where adversaries reconstruct
surrogate models by querying the victim model. Existing defense mechanisms,
such as watermarking and fingerprinting, suffer from poor real-time
performance, susceptibility to evasion, or reliance on post-attack
verification, making them inadequate for handling the dynamic characteristics
of graph-based MEA variants. To address these limitations, we propose ATOM, a
novel real-time MEA detection framework tailored for GNNs. ATOM integrates
sequential modeling and reinforcement learning to dynamically detect evolving
attack patterns, while leveraging $k$-core embedding to capture the structural
properties, enhancing detection precision. Furthermore, we provide theoretical
analysis to characterize query behaviors and optimize detection strategies.
Extensive experiments on multiple real-world datasets demonstrate that ATOM
outperforms existing approaches in detection performance, maintaining stable
across different time steps, thereby offering a more effective defense
mechanism for GMLaaS environments.


http://arxiv.org/abs/2503.16266
From Head to Tail: Efficient Black-box Model Inversion Attack via Long-tailed Learning. (33%)
Ziang Li; Hongguang Zhang; Juan Wang; Meihui Chen; Hongxin Hu; Wenzhe Yi; Xiaoyang Xu; Mengda Yang; Chenjun Ma
  Model Inversion Attacks (MIAs) aim to reconstruct private training data from
models, leading to privacy leakage, particularly in facial recognition systems.
Although many studies have enhanced the effectiveness of white-box MIAs, less
attention has been paid to improving efficiency and utility under limited
attacker capabilities. Existing black-box MIAs necessitate an impractical
number of queries, incurring significant overhead. Therefore, we analyze the
limitations of existing MIAs and introduce Surrogate Model-based Inversion with
Long-tailed Enhancement (SMILE), a high-resolution oriented and query-efficient
MIA for the black-box setting. We begin by analyzing the initialization of MIAs
from a data distribution perspective and propose a long-tailed surrogate
training method to obtain high-quality initial points. We then enhance the
attack's effectiveness by employing the gradient-free black-box optimization
algorithm selected by NGOpt. Our experiments show that SMILE outperforms
existing state-of-the-art black-box MIAs while requiring only about 5% of the
query overhead.


http://arxiv.org/abs/2503.17416
Debugging and Runtime Analysis of Neural Networks with VLMs (A Case Study). (31%)
Boyue Caroline Hu; Divya Gopinath; Corina S. Pasareanu; Nina Narodytska; Ravi Mangal; Susmit Jha
  Debugging of Deep Neural Networks (DNNs), particularly vision models, is very
challenging due to the complex and opaque decision-making processes in these
networks. In this paper, we explore multi-modal Vision-Language Models (VLMs),
such as CLIP, to automatically interpret the opaque representation space of
vision models using natural language. This in turn, enables a semantic analysis
of model behavior using human-understandable concepts, without requiring costly
human annotations. Key to our approach is the notion of semantic heatmap, that
succinctly captures the statistical properties of DNNs in terms of the concepts
discovered with the VLM and that are computed off-line using a held-out data
set. We show the utility of semantic heatmaps for fault localization -- an
essential step in debugging -- in vision models. Our proposed technique helps
localize the fault in the network (encoder vs head) and also highlights the
responsible high-level concepts, by leveraging novel differential heatmaps,
which summarize the semantic differences between the correct and incorrect
behaviour of the analyzed DNN. We further propose a lightweight runtime
analysis to detect and filter-out defects at runtime, thus improving the
reliability of the analyzed DNNs. The runtime analysis works by measuring and
comparing the similarity between the heatmap computed for a new (unseen) input
and the heatmaps computed a-priori for correct vs incorrect DNN behavior. We
consider two types of defects: misclassifications and vulnerabilities to
adversarial attacks. We demonstrate the debugging and runtime analysis on a
case study involving a complex ResNet-based classifier trained on the RIVAL10
dataset.


http://arxiv.org/abs/2503.16566
REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models. (22%)
Jie Zhang; Zheng Yuan; Zhongqi Wang; Bei Yan; Sibo Wang; Xiangkui Cao; Zonghui Guo; Shiguang Shan; Xilin Chen
  The rapid evolution of Large Vision-Language Models (LVLMs) has highlighted
the necessity for comprehensive evaluation frameworks that assess these models
across diverse dimensions. While existing benchmarks focus on specific aspects
such as perceptual abilities, cognitive capabilities, and safety against
adversarial attacks, they often lack the breadth and depth required to provide
a holistic understanding of LVLMs' strengths and limitations. To address this
gap, we introduce REVAL, a comprehensive benchmark designed to evaluate the
\textbf{RE}liability and \textbf{VAL}ue of LVLMs. REVAL encompasses over 144K
image-text Visual Question Answering (VQA) samples, structured into two primary
sections: Reliability, which assesses truthfulness (\eg, perceptual accuracy
and hallucination tendencies) and robustness (\eg, resilience to adversarial
attacks, typographic attacks, and image corruption), and Values, which
evaluates ethical concerns (\eg, bias and moral understanding), safety issues
(\eg, toxicity and jailbreak vulnerabilities), and privacy problems (\eg,
privacy awareness and privacy leakage). We evaluate 26 models, including
mainstream open-source LVLMs and prominent closed-source models like GPT-4o and
Gemini-1.5-Pro. Our findings reveal that while current LVLMs excel in
perceptual tasks and toxicity avoidance, they exhibit significant
vulnerabilities in adversarial scenarios, privacy preservation, and ethical
reasoning. These insights underscore critical areas for future improvements,
guiding the development of more secure, reliable, and ethically aligned LVLMs.
REVAL provides a robust framework for researchers to systematically assess and
compare LVLMs, fostering advancements in the field.


http://arxiv.org/abs/2503.16023
BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models. (13%)
Zenghui Yuan; Jiawen Shi; Pan Zhou; Neil Zhenqiang Gong; Lichao Sun
  Multi-modal large language models (MLLMs) extend large language models (LLMs)
to process multi-modal information, enabling them to generate responses to
image-text inputs. MLLMs have been incorporated into diverse multi-modal
applications, such as autonomous driving and medical diagnosis, via
plug-and-play without fine-tuning. This deployment paradigm increases the
vulnerability of MLLMs to backdoor attacks. However, existing backdoor attacks
against MLLMs achieve limited effectiveness and stealthiness. In this work, we
propose BadToken, the first token-level backdoor attack to MLLMs. BadToken
introduces two novel backdoor behaviors: Token-substitution and Token-addition,
which enable flexible and stealthy attacks by making token-level modifications
to the original output for backdoored inputs. We formulate a general
optimization problem that considers the two backdoor behaviors to maximize the
attack effectiveness. We evaluate BadToken on two open-source MLLMs and various
tasks. Our results show that our attack maintains the model's utility while
achieving high attack success rates and stealthiness. We also show the
real-world threats of BadToken in two scenarios, i.e., autonomous driving and
medical diagnosis. Furthermore, we consider defenses including fine-tuning and
input purification. Our results highlight the threat of our attack.


http://arxiv.org/abs/2503.16158
Automatically Generating Chinese Homophone Words to Probe Machine Translation Estimation Systems. (1%)
Shenbin Qian; Constantin Orăsan; Diptesh Kanojia; Félix do Carmo
  Evaluating machine translation (MT) of user-generated content (UGC) involves
unique challenges such as checking whether the nuance of emotions from the
source are preserved in the target text. Recent studies have proposed
emotion-related datasets, frameworks and models to automatically evaluate MT
quality of Chinese UGC, without relying on reference translations. However,
whether these models are robust to the challenge of preserving emotional
nuances has been left largely unexplored. To address this gap, we introduce a
novel method inspired by information theory which generates challenging Chinese
homophone words related to emotions, by leveraging the concept of
self-information. Our approach generates homophones that were observed to cause
translation errors in emotion preservation, and exposes vulnerabilities in MT
systems and their evaluation methods when tackling emotional UGC. We evaluate
the efficacy of our method using human evaluation for the quality of these
generated homophones, and compare it with an existing one, showing that our
method achieves higher correlation with human judgments. The generated Chinese
homophones, along with their manual translations, are utilized to generate
perturbations and to probe the robustness of existing quality evaluation
models, including models trained using multi-task learning, fine-tuned variants
of multilingual language models, as well as large language models (LLMs). Our
results indicate that LLMs with larger size exhibit higher stability and
robustness to such perturbations. We release our data and code for
reproducibility and further research.


http://arxiv.org/abs/2503.16251
RESFL: An Uncertainty-Aware Framework for Responsible Federated Learning by Balancing Privacy, Fairness and Utility in Autonomous Vehicles. (1%)
Dawood Wasif; Terrence J. Moore; Jin-Hee Cho
  Autonomous vehicles (AVs) increasingly rely on Federated Learning (FL) to
enhance perception models while preserving privacy. However, existing FL
frameworks struggle to balance privacy, fairness, and robustness, leading to
performance disparities across demographic groups. Privacy-preserving
techniques like differential privacy mitigate data leakage risks but worsen
fairness by restricting access to sensitive attributes needed for bias
correction. This work explores the trade-off between privacy and fairness in
FL-based object detection for AVs and introduces RESFL, an integrated solution
optimizing both. RESFL incorporates adversarial privacy disentanglement and
uncertainty-guided fairness-aware aggregation. The adversarial component uses a
gradient reversal layer to remove sensitive attributes, reducing privacy risks
while maintaining fairness. The uncertainty-aware aggregation employs an
evidential neural network to weight client updates adaptively, prioritizing
contributions with lower fairness disparities and higher confidence. This
ensures robust and equitable FL model updates. We evaluate RESFL on the FACET
dataset and CARLA simulator, assessing accuracy, fairness, privacy resilience,
and robustness under varying conditions. RESFL improves detection accuracy,
reduces fairness disparities, and lowers privacy attack success rates while
demonstrating superior robustness to adversarial conditions compared to other
approaches.


http://arxiv.org/abs/2503.16760
Rethinking the Role of Spatial Mixing. (1%)
George Cazenavette; Joel Julin; Simon Lucey
  Until quite recently, the backbone of nearly every state-of-the-art computer
vision model has been the 2D convolution. At its core, a 2D convolution
simultaneously mixes information across both the spatial and channel dimensions
of a representation. Many recent computer vision architectures consist of
sequences of isotropic blocks that disentangle the spatial and channel-mixing
components. This separation of the operations allows us to more closely
juxtapose the effects of spatial and channel mixing in deep learning. In this
paper, we take an initial step towards garnering a deeper understanding of the
roles of these mixing operations. Through our experiments and analysis, we
discover that on both classical (ResNet) and cutting-edge (ConvMixer) models,
we can reach nearly the same level of classification performance by and leaving
the spatial mixers at their random initializations. Furthermore, we show that
models with random, fixed spatial mixing are naturally more robust to
adversarial perturbations. Lastly, we show that this phenomenon extends past
the classification regime, as such models can also decode pixel-shuffled
images.


http://arxiv.org/abs/2503.15404
Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement. (92%)
Yuchen Ren; Zhengyu Zhao; Chenhao Lin; Bo Yang; Lu Zhou; Zhe Liu; Chao Shen
  Vision Transformers (ViTs) have been widely applied in various computer
vision and vision-language tasks. To gain insights into their robustness in
practical scenarios, transferable adversarial examples on ViTs have been
extensively studied. A typical approach to improving adversarial
transferability is by refining the surrogate model. However, existing work on
ViTs has restricted their surrogate refinement to backward propagation. In this
work, we instead focus on Forward Propagation Refinement (FPR) and specifically
refine two key modules of ViTs: attention maps and token embeddings. For
attention maps, we propose Attention Map Diversification (AMD), which
diversifies certain attention maps and also implicitly imposes beneficial
gradient vanishing during backward propagation. For token embeddings, we
propose Momentum Token Embedding (MTE), which accumulates historical token
embeddings to stabilize the forward updates in both the Attention and MLP
blocks. We conduct extensive experiments with adversarial examples transferred
from ViTs to various CNNs and ViTs, demonstrating that our FPR outperforms the
current best (backward) surrogate refinement by up to 7.0\% on average. We also
validate its superiority against popular defenses and its compatibility with
other transfer methods. Codes and appendix are available at
https://github.com/RYC-98/FPR.


http://arxiv.org/abs/2503.14922
A Semantic and Clean-label Backdoor Attack against Graph Convolutional Networks. (84%)
Jiazhu Dai; Haoyu Sun
  Graph Convolutional Networks (GCNs) have shown excellent performance in
graph-structured tasks such as node classification and graph classification.
However, recent research has shown that GCNs are vulnerable to a new type of
threat called the backdoor attack, where the adversary can inject a hidden
backdoor into the GCNs so that the backdoored model performs well on benign
samples, whereas its prediction will be maliciously changed to the
attacker-specified target label if the hidden backdoor is activated by the
attacker-defined trigger. Clean-label backdoor attack and semantic backdoor
attack are two new backdoor attacks to Deep Neural Networks (DNNs), they are
more imperceptible and have posed new and serious threats. The semantic and
clean-label backdoor attack is not fully explored in GCNs. In this paper, we
propose a semantic and clean-label backdoor attack against GCNs under the
context of graph classification to reveal the existence of this security
vulnerability in GCNs. Specifically, SCLBA conducts an importance analysis on
graph samples to select one type of node as semantic trigger, which is then
inserted into the graph samples to create poisoning samples without changing
the labels of the poisoning samples to the attacker-specified target label. We
evaluate SCLBA on multiple datasets and the results show that SCLBA can achieve
attack success rates close to 99% with poisoning rates of less than 3%, and
with almost no impact on the performance of model on benign samples.


http://arxiv.org/abs/2503.15293
Test-Time Backdoor Detection for Object Detection Models. (50%)
Hangtao Zhang; Yichen Wang; Shihui Yan; Chenyu Zhu; Ziqi Zhou; Linshan Hou; Shengshan Hu; Minghui Li; Yanjun Zhang; Leo Yu Zhang
  Object detection models are vulnerable to backdoor attacks, where attackers
poison a small subset of training samples by embedding a predefined trigger to
manipulate prediction. Detecting poisoned samples (i.e., those containing
triggers) at test time can prevent backdoor activation. However, unlike image
classification tasks, the unique characteristics of object detection --
particularly its output of numerous objects -- pose fresh challenges for
backdoor detection. The complex attack effects (e.g., "ghost" object emergence
or "vanishing" object) further render current defenses fundamentally
inadequate. To this end, we design TRAnsformation Consistency Evaluation
(TRACE), a brand-new method for detecting poisoned samples at test time in
object detection. Our journey begins with two intriguing observations: (1)
poisoned samples exhibit significantly more consistent detection results than
clean ones across varied backgrounds. (2) clean samples show higher detection
consistency when introduced to different focal information. Based on these
phenomena, TRACE applies foreground and background transformations to each test
sample, then assesses transformation consistency by calculating the variance in
objects confidences. TRACE achieves black-box, universal backdoor detection,
with extensive experiments showing a 30% improvement in AUROC over
state-of-the-art defenses and resistance to adaptive attacks.


http://arxiv.org/abs/2503.16550
Unified Enhancement of the Generalization and Robustness of Language Models via Bi-Stage Optimization. (33%)
Yudao Sun; Juan Yin; Juan Zhao; Fan Zhang; Yongheng Liu; Hongji Chen
  Neural network language models (LMs) are confronted with significant
challenges in generalization and robustness. Currently, many studies focus on
improving either generalization or robustness in isolation, without methods
addressing both aspects simultaneously, which presents a significant challenge
in developing LMs that are both robust and generalized. In this paper, we
propose a bi-stage optimization framework to uniformly enhance both the
generalization and robustness of LMs, termed UEGR. Specifically, during the
forward propagation stage, we enrich the output probability distributions of
adversarial samples by adaptive dropout to generate diverse sub models, and
incorporate JS divergence and adversarial losses of these output distributions
to reinforce output stability. During backward propagation stage, we compute
parameter saliency scores and selectively update only the most critical
parameters to minimize unnecessary deviations and consolidate the model's
resilience. Theoretical analysis shows that our framework includes gradient
regularization to limit the model's sensitivity to input perturbations and
selective parameter updates to flatten the loss landscape, thus improving both
generalization and robustness. The experimental results show that our method
significantly improves the generalization and robustness of LMs compared to
other existing methods across 13 publicly available language datasets,
achieving state-of-the-art (SOTA) performance.


http://arxiv.org/abs/2503.15321
Euclid Quick Data Release (Q1). Active galactic nuclei identification using diffusion-based inpainting of Euclid VIS images. (1%)
Collaboration Euclid; G. School of Physics, HH Wills Physics Laboratory, University of Bristol, Tyndall Avenue, Bristol, BS8 1TL, UK Stevens; S. School of Physics, HH Wills Physics Laboratory, University of Bristol, Tyndall Avenue, Bristol, BS8 1TL, UK Fotopoulou; M. N. School of Physics, HH Wills Physics Laboratory, University of Bristol, Tyndall Avenue, Bristol, BS8 1TL, UK Bremer; T. Matamoro School of Physics, HH Wills Physics Laboratory, University of Bristol, Tyndall Avenue, Bristol, BS8 1TL, UK Zatarain; K. Max-Planck-Institut für Astronomie, Königstuhl 17, 69117 Heidelberg, Germany Jahnke; B. SRON Netherlands Institute for Space Research, Landleven 12, 9747 AD, Groningen, The Netherlands Margalef-Bentabol; M. Instituto de Astrofísica de Canarias, Vía Láctea, 38205 La Laguna, Tenerife, Spain Instituto de Astrofísica de Canarias Université PSL, Observatoire de Paris, Sorbonne Université, CNRS, LERMA, 75014, Paris, France Université Paris-Cité, 5 Rue Thomas Mann, 75013, Paris, France Huertas-Company; M. J. School of Physics, Astronomy and Mathematics, University of Hertfordshire, College Lane, Hatfield AL10 9AB, UK Aspia Space, Falmouth, TR10 9TA, UK Smith; M. David A. Dunlap Department of Astronomy \& Astrophysics, University of Toronto, 50 St George Street, Toronto, Ontario M5S 3H4, Canada Jodrell Bank Centre for Astrophysics, Department of Physics and Astronomy, University of Manchester, Oxford Road, Manchester M13 9PL, UK Walmsley; M. Max Planck Institute for Extraterrestrial Physics, Giessenbachstr. 1, 85748 Garching, Germany Salvato; M. Institute of Space Sciences Institut d'Estudis Espacials de Catalunya Mezcua; A. Centro de Astrofísica da Universidade do Porto, Rua das Estrelas, 4150-762 Porto, Portugal Instituto de Astrofísica e Ciências do Espaço, Universidade do Porto, CAUP, Rua das Estrelas, PT4150-762 Porto, Portugal Paulino-Afonso; M. Instituto de Astrofísica de Canarias Institute of Space Sciences Siudek; M. Dipartimento di Fisica e Astronomia "Augusto Righi" - Alma Mater Studiorum Università di Bologna, via Piero Gobetti 93/2, 40129 Bologna, Italy INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Talia; F. Department of Mathematics and Physics, Roma Tre University, Via della Vasca Navale 84, 00146 Rome, Italy INAF-Osservatorio Astronomico di Roma, Via Frascati 33, 00078 Monteporzio Catone, Italy Ricci; W. Max Planck Institute for Extraterrestrial Physics, Giessenbachstr. 1, 85748 Garching, Germany Roster; N. Université Paris-Saclay, CNRS, Institut d'astrophysique spatiale, 91405, Orsay, France Aghanim; B. ESAC/ESA, Camino Bajo del Castillo, s/n., Urb. Villafranca del Castillo, 28692 Villanueva de la Cañada, Madrid, Spain Altieri; S. INAF-Osservatorio Astronomico di Brera, Via Brera 28, 20122 Milano, Italy Andreon; H. Université Paris-Saclay, Université Paris Cité, CEA, CNRS, AIM, 91191, Gif-sur-Yvette, France Aussel; C. IFPU, Institute for Fundamental Physics of the Universe, via Beirut 2, 34151 Trieste, Italy INAF-Osservatorio Astronomico di Trieste, Via G. B. Tiepolo 11, 34143 Trieste, Italy INFN, Sezione di Trieste, Via Valerio 2, 34127 Trieste TS, Italy SISSA, International School for Advanced Studies, Via Bonomea 265, 34136 Trieste TS, Italy Baccigalupi; M. Dipartimento di Fisica e Astronomia, Università di Bologna, Via Gobetti 93/2, 40129 Bologna, Italy INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy INFN-Sezione di Bologna, Viale Berti Pichat 6/2, 40127 Bologna, Italy Baldi; S. INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Bardelli; P. INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Battaglia; A. INAF-Osservatorio Astronomico di Trieste, Via G. B. Tiepolo 11, 34143 Trieste, Italy IFPU, Institute for Fundamental Physics of the Universe, via Beirut 2, 34151 Trieste, Italy Biviano; A. Space Science Data Center, Italian Space Agency, via del Politecnico snc, 00133 Roma, Italy Bonchi; E. Dipartimento di Fisica, Università di Genova, Via Dodecaneso 33, 16146, Genova, Italy INFN-Sezione di Genova, Via Dodecaneso 33, 16146, Genova, Italy INAF-Osservatorio Astronomico di Brera, Via Brera 28, 20122 Milano, Italy Branchini; M. Department of Physics "E. Pancini", University Federico II, Via Cinthia 6, 80126, Napoli, Italy INAF-Osservatorio Astronomico di Capodimonte, Via Moiariello 16, 80131 Napoli, Italy Brescia; J. Instituto de Astrofísica e Ciências do Espaço, Universidade do Porto, CAUP, Rua das Estrelas, PT4150-762 Porto, Portugal Faculdade de Ciências da Universidade do Porto, Rua do Campo de Alegre, 4150-007 Porto, Portugal Brinchmann; S. Dipartimento di Fisica, Università degli Studi di Torino, Via P. Giuria 1, 10125 Torino, Italy INFN-Sezione di Torino, Via P. Giuria 1, 10125 Torino, Italy INAF-Osservatorio Astrofisico di Torino, Via Osservatorio 20, 10025 Pino Torinese Camera; G. European Space Agency/ESTEC, Keplerlaan 1, 2201 AZ Noordwijk, The Netherlands Institute Lorentz, Leiden University, Niels Bohrweg 2, 2333 CA Leiden, The Netherlands Leiden Observatory, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands Cañas-Herrera; V. INAF-Osservatorio Astrofisico di Torino, Via Osservatorio 20, 10025 Pino Torinese Capobianco; C. INAF-IASF Milano, Via Alfonso Corti 12, 20133 Milano, Italy Carbone; J. Centro de Investigaciones Energéticas, Medioambientales y Tecnológicas Port d'Informació Científica, Campus UAB, C. Albareda s/n, 08193 Bellaterra Carretero; M. INAF-Osservatorio Astronomico di Roma, Via Frascati 33, 00078 Monteporzio Catone, Italy Castellano; G. INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Castignani; S. INAF-Osservatorio Astronomico di Capodimonte, Via Moiariello 16, 80131 Napoli, Italy INFN section of Naples, Via Cinthia 6, 80126, Napoli, Italy Cavuoti; K. C. Institute for Astronomy, University of Hawaii, 2680 Woodlawn Drive, Honolulu, HI 96822, USA Chambers; A. Dipartimento di Fisica e Astronomia "Augusto Righi" - Alma Mater Studiorum Università di Bologna, Viale Berti Pichat 6/2, 40127 Bologna, Italy Cimatti; C. Instituto de Astrofísica de Canarias, Vía Láctea, 38205 La Laguna, Tenerife, Spain Colodro-Conde; G. Institute for Astronomy, University of Edinburgh, Royal Observatory, Blackford Hill, Edinburgh EH9 3HJ, UK Congedo; C. J. Jodrell Bank Centre for Astrophysics, Department of Physics and Astronomy, University of Manchester, Oxford Road, Manchester M13 9PL, UK Conselice; L. European Space Agency/ESRIN, Largo Galileo Galilei 1, 00044 Frascati, Roma, Italy ESAC/ESA, Camino Bajo del Castillo, s/n., Urb. Villafranca del Castillo, 28692 Villanueva de la Cañada, Madrid, Spain Conversi; Y. Université Claude Bernard Lyon 1, CNRS/IN2P3, IP2I Lyon, UMR 5822, Villeurbanne, F-69100, France Copin; A. Aix-Marseille Université, CNRS, CNES, LAM, Marseille, France Costille; F. Institut de Ciències del Cosmos Institució Catalana de Recerca i Estudis Avançats Courbin; H. M. UCB Lyon 1, CNRS/IN2P3, IUF, IP2I Lyon, 4 rue Enrico Fermi, 69622 Villeurbanne, France Courtois; M. Mullard Space Science Laboratory, University College London, Holmbury St Mary, Dorking, Surrey RH5 6NT, UK Cropper; Silva A. Departamento de Física, Faculdade de Ciências, Universidade de Lisboa, Edifício C8, Campo Grande, PT1749-016 Lisboa, Portugal Instituto de Astrofísica e Ciências do Espaço, Faculdade de Ciências, Universidade de Lisboa, Campo Grande, 1749-016 Lisboa, Portugal Da; H. Department of Astronomy, University of Geneva, ch. d'Ecogia 16, 1290 Versoix, Switzerland Degaudenzi; Lucia G. INAF-Osservatorio Astronomico di Trieste, Via G. B. Tiepolo 11, 34143 Trieste, Italy De; C. Mullard Space Science Laboratory, University College London, Holmbury St Mary, Dorking, Surrey RH5 6NT, UK Dolding; H. Université Paris-Saclay, CNRS, Institut d'astrophysique spatiale, 91405, Orsay, France Dole; M. Université Paris-Saclay, CNRS, Institut d'astrophysique spatiale, 91405, Orsay, France Douspis; F. Department of Astronomy, University of Geneva, ch. d'Ecogia 16, 1290 Versoix, Switzerland Dubath; X. ESAC/ESA, Camino Bajo del Castillo, s/n., Urb. Villafranca del Castillo, 28692 Villanueva de la Cañada, Madrid, Spain Dupac; S. INFN-Padova, Via Marzolo 8, 35131 Padova, Italy Dusini; S. Aix-Marseille Université, CNRS/IN2P3, CPPM, Marseille, France Escoffier; M. INAF-Istituto di Astrofisica e Planetologia Spaziali, via del Fosso del Cavaliere, 100, 00100 Roma, Italy Farina; S. Université Claude Bernard Lyon 1, CNRS/IN2P3, IP2I Lyon, UMR 5822, Villeurbanne, F-69100, France Ferriol; K. Universitäts-Sternwarte München, Fakultät für Physik, Ludwig-Maximilians-Universität München, Scheinerstrasse 1, 81679 München, Germany George; C. INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy INFN-Sezione di Bologna, Viale Berti Pichat 6/2, 40127 Bologna, Italy Giocoli; B. R. INAF-Osservatorio Astronomico di Brera, Via Brera 28, 20122 Milano, Italy Granett; A. INAF-Osservatorio Astronomico di Padova, Via dell'Osservatorio 5, 35122 Padova, Italy Grazian; F. Max Planck Institute for Extraterrestrial Physics, Giessenbachstr. 1, 85748 Garching, Germany Universitäts-Sternwarte München, Fakultät für Physik, Ludwig-Maximilians-Universität München, Scheinerstrasse 1, 81679 München, Germany Grupp; S. V. H. Institute of Theoretical Astrophysics, University of Oslo, P.O. Box 1029 Blindern, 0315 Oslo, Norway Haugan; I. M. Department of Physics, Lancaster University, Lancaster, LA1 4YB, UK Hook; F. Felix Hormuth Engineering, Goethestr. 17, 69181 Leimen, Germany Hormuth; A. Technical University of Denmark, Elektrovej 327, 2800 Kgs. Lyngby, Denmark Cosmic Dawn Center Hornstrup; P. Institut d'Astrophysique de Paris, UMR 7095, CNRS, and Sorbonne Université, 98 bis boulevard Arago, 75014 Paris, France Hudelot; M. NASA Goddard Space Flight Center, Greenbelt, MD 20771, USA Jhabvala; E. Department of Physics and Helsinki Institute of Physics, Gustaf Hällströmin katu 2, 00014 University of Helsinki, Finland Keihänen; S. Aix-Marseille Université, CNRS/IN2P3, CPPM, Marseille, France Kermiche; A. Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA, 91109, USA Kiessling; M. Université Paris-Saclay, Université Paris Cité, CEA, CNRS, AIM, 91191, Gif-sur-Yvette, France Kilbinger; B. Université Claude Bernard Lyon 1, CNRS/IN2P3, IP2I Lyon, UMR 5822, Villeurbanne, F-69100, France Kubik; M. Universitäts-Sternwarte München, Fakultät für Physik, Ludwig-Maximilians-Universität München, Scheinerstrasse 1, 81679 München, Germany Kümmel; H. Department of Physics, P.O. Box 64, 00014 University of Helsinki, Finland Helsinki Institute of Physics, Gustaf Hällströmin katu 2, University of Helsinki, Helsinki, Finland Kurki-Suonio; Q. Le Centre de Calcul de l'IN2P3/CNRS, 21 avenue Pierre de Coubertin 69627 Villeurbanne Cedex, France Boulc'h; A. M. C. Le Laboratoire d'etude de l'Univers et des phenomenes eXtremes, Observatoire de Paris, Université PSL, Sorbonne Université, CNRS, 92190 Meudon, France Brun; D. Le Aix-Marseille Université, CNRS, CNES, LAM, Marseille, France Mignant; P. B. Institute of Theoretical Astrophysics, University of Oslo, P.O. Box 1029 Blindern, 0315 Oslo, Norway Lilje; V. Department of Physics, P.O. Box 64, 00014 University of Helsinki, Finland Helsinki Institute of Physics, Gustaf Hällströmin katu 2, University of Helsinki, Helsinki, Finland Lindholm; I. SKA Observatory, Jodrell Bank, Lower Withington, Macclesfield, Cheshire SK11 9FT, UK Lloro; G. Centre de Calcul de l'IN2P3/CNRS, 21 avenue Pierre de Coubertin 69627 Villeurbanne Cedex, France Mainetti; D. Dipartimento di Fisica "Aldo Pontremoli", Università degli Studi di Milano, Via Celoria 16, 20133 Milano, Italy INAF-IASF Milano, Via Alfonso Corti 12, 20133 Milano, Italy INFN-Sezione di Milano, Via Celoria 16, 20133 Milano, Italy Maino; E. INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Maiorano; O. Universität Bonn, Argelander-Institut für Astronomie, Auf dem Hügel 71, 53121 Bonn, Germany Marggraf; M. INAF-Osservatorio Astronomico di Roma, Via Frascati 33, 00078 Monteporzio Catone, Italy INFN-Sezione di Roma, Piazzale Aldo Moro, 2 - c/o Dipartimento di Fisica, Edificio G. Marconi, 00185 Roma, Italy Martinelli; N. Aix-Marseille Université, CNRS, CNES, LAM, Marseille, France Martinet; F. Dipartimento di Fisica e Astronomia "Augusto Righi" - Alma Mater Studiorum Università di Bologna, via Piero Gobetti 93/2, 40129 Bologna, Italy INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy INFN-Sezione di Bologna, Viale Berti Pichat 6/2, 40127 Bologna, Italy Marulli; R. Department of Physics, Institute for Computational Cosmology, Durham University, South Road, Durham, DH1 3LE, UK Massey; S. Université Côte d'Azur, Observatoire de la Côte d'Azur, CNRS, Laboratoire Lagrange, Bd de l'Observatoire, CS 34229, 06304 Nice cedex 4, France Maurogordato; H. J. Institut d'Astrophysique de Paris, UMR 7095, CNRS, and Sorbonne Université, 98 bis boulevard Arago, 75014 Paris, France McCracken; E. INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Medinaceli; S. Université Paris Cité, CNRS, Astroparticule et Cosmologie, 75013 Paris, France CNRS-UCB International Research Laboratory, Centre Pierre Binetruy, IRL2007, CPB-IN2P3, Berkeley, USA Mei; M. University of Applied Sciences and Arts of Northwestern Switzerland, School of Engineering, 5210 Windisch, Switzerland Melchior; M. INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy INFN-Sezione di Bologna, Viale Berti Pichat 6/2, 40127 Bologna, Italy Meneghetti; E. INAF-Osservatorio Astronomico di Roma, Via Frascati 33, 00078 Monteporzio Catone, Italy Merlin; G. Institute of Physics, Laboratory of Astrophysics, Ecole Polytechnique Fédérale de Lausanne Meylan; A. Aurora Technology for European Space Agency Mora; M. Dipartimento di Fisica e Astronomia "Augusto Righi" - Alma Mater Studiorum Università di Bologna, via Piero Gobetti 93/2, 40129 Bologna, Italy INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Moresco; L. Dipartimento di Fisica e Astronomia "Augusto Righi" - Alma Mater Studiorum Università di Bologna, via Piero Gobetti 93/2, 40129 Bologna, Italy INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy INFN-Sezione di Bologna, Viale Berti Pichat 6/2, 40127 Bologna, Italy Moscardini; R. Universität Bonn, Argelander-Institut für Astronomie, Auf dem Hügel 71, 53121 Bonn, Germany Nakajima; C. Institut de Física d'Altes Energies Port d'Informació Científica, Campus UAB, C. Albareda s/n, 08193 Bellaterra Neissner; S. -M. European Space Agency/ESTEC, Keplerlaan 1, 2201 AZ Noordwijk, The Netherlands Niemi; C. Institut de Física d'Altes Energies Padilla; S. Department of Astronomy, University of Geneva, ch. d'Ecogia 16, 1290 Versoix, Switzerland Paltani; F. INAF-Osservatorio Astronomico di Trieste, Via G. B. Tiepolo 11, 34143 Trieste, Italy Pasian; K. DARK, Niels Bohr Institute, University of Copenhagen, Jagtvej 155, 2200 Copenhagen, Denmark Pedersen; W. J. Waterloo Centre for Astrophysics, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada Department of Physics and Astronomy, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada Perimeter Institute for Theoretical Physics, Waterloo, Ontario N2L 2Y5, Canada Percival; V. European Space Agency/ESTEC, Keplerlaan 1, 2201 AZ Noordwijk, The Netherlands Pettorino; G. Space Science Data Center, Italian Space Agency, via del Politecnico snc, 00133 Roma, Italy Polenta; M. Centre National d'Etudes Spatiales -- Centre spatial de Toulouse, 18 avenue Edouard Belin, 31401 Toulouse Cedex 9, France Poncet; L. A. Institute of Space Science, Str. Atomistilor, nr. 409 Măgurele, Ilfov, 077125, Romania Popa; L. INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Pozzetti; F. Max Planck Institute for Extraterrestrial Physics, Giessenbachstr. 1, 85748 Garching, Germany Raison; R. Instituto de Astrofísica de Canarias, Vía Láctea, 38205 La Laguna, Tenerife, Spain Consejo Superior de Investigaciones Cientificas, Calle Serrano 117, 28006 Madrid, Spain Universidad de La Laguna, Departamento de Astrofísica, 38206 La Laguna, Tenerife, Spain Rebolo; A. Dipartimento di Fisica e Astronomia "G. Galilei", Università di Padova, Via Marzolo 8, 35131 Padova, Italy INFN-Padova, Via Marzolo 8, 35131 Padova, Italy Renzi; J. Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA, 91109, USA Rhodes; G. INAF-Osservatorio Astronomico di Capodimonte, Via Moiariello 16, 80131 Napoli, Italy Riccio; E. INAF-Osservatorio Astronomico di Trieste, Via G. B. Tiepolo 11, 34143 Trieste, Italy Romelli; M. INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Roncarelli; R. Universitäts-Sternwarte München, Fakultät für Physik, Ludwig-Maximilians-Universität München, Scheinerstrasse 1, 81679 München, Germany Max Planck Institute for Extraterrestrial Physics, Giessenbachstr. 1, 85748 Garching, Germany Saglia; A. G. Max Planck Institute for Extraterrestrial Physics, Giessenbachstr. 1, 85748 Garching, Germany Sánchez; D. Departamento de Física, FCFM, Universidad de Chile, Blanco Encalada 2008, Santiago, Chile Sapone; J. A. Institute for Astronomy, University of Edinburgh, Royal Observatory, Blackford Hill, Edinburgh EH9 3HJ, UK Schewtschenko; M. Max-Planck-Institut für Astronomie, Königstuhl 17, 69117 Heidelberg, Germany Schirmer; P. Universität Bonn, Argelander-Institut für Astronomie, Auf dem Hügel 71, 53121 Bonn, Germany Schneider; T. Universität Innsbruck, Institut für Astro- und Teilchenphysik, Technikerstr. 25/8, 6020 Innsbruck, Austria Schrabback; A. Aix-Marseille Université, CNRS/IN2P3, CPPM, Marseille, France Secroun; S. Institut d'Estudis Espacials de Catalunya Satlantis, University Science Park, Sede Bld 48940, Leioa-Bilbao, Spain Institute of Space Sciences Serrano; P. Universität Bonn, Argelander-Institut für Astronomie, Auf dem Hügel 71, 53121 Bonn, Germany Simon; C. Dipartimento di Fisica e Astronomia "G. Galilei", Università di Padova, Via Marzolo 8, 35131 Padova, Italy INFN-Padova, Via Marzolo 8, 35131 Padova, Italy Sirignano; G. INFN-Sezione di Bologna, Viale Berti Pichat 6/2, 40127 Bologna, Italy Sirri; J. Centre for Electronic Imaging, Open University, Walton Hall, Milton Keynes, MK7~6AA, UK Skottfelt; L. INFN-Padova, Via Marzolo 8, 35131 Padova, Italy Stanco; J. Max Planck Institute for Extraterrestrial Physics, Giessenbachstr. 1, 85748 Garching, Germany Steinwagner; P. Centro de Investigaciones Energéticas, Medioambientales y Tecnológicas Port d'Informació Científica, Campus UAB, C. Albareda s/n, 08193 Bellaterra Tallada-Crespí; A. N. Institute for Astronomy, University of Edinburgh, Royal Observatory, Blackford Hill, Edinburgh EH9 3HJ, UK Taylor; I. Departamento de Física, Faculdade de Ciências, Universidade de Lisboa, Edifício C8, Campo Grande, PT1749-016 Lisboa, Portugal Instituto de Astrofísica e Ciências do Espaço, Faculdade de Ciências, Universidade de Lisboa, Tapada da Ajuda, 1349-018 Lisboa, Portugal Tereno; S. Cosmic Dawn Center Niels Bohr Institute, University of Copenhagen, Jagtvej 128, 2200 Copenhagen, Denmark Toft; R. Universidad Politécnica de Cartagena, Departamento de Electrónica y Tecnología de Computadoras, Plaza del Hospital 1, 30202 Cartagena, Spain Toledo-Moreo; F. Port d'Informació Científica, Campus UAB, C. Albareda s/n, 08193 Bellaterra Centro de Investigaciones Energéticas, Medioambientales y Tecnológicas Torradeflot; I. Institut de Recherche en Astrophysique et Planétologie Tutusaus; L. INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy INFN-Bologna, Via Irnerio 46, 40126 Bologna, Italy Valenziano; J. Department of Physics, P.O. Box 64, 00014 University of Helsinki, Finland Helsinki Institute of Physics, Gustaf Hällströmin katu 2, University of Helsinki, Helsinki, Finland Valiviita; T. Universitäts-Sternwarte München, Fakultät für Physik, Ludwig-Maximilians-Universität München, Scheinerstrasse 1, 81679 München, Germany INAF-Osservatorio Astronomico di Trieste, Via G. B. Tiepolo 11, 34143 Trieste, Italy Vassallo; G. Verdoes Kapteyn Astronomical Institute, University of Groningen, PO Box 800, 9700 AV Groningen, The Netherlands Kleijn; A. INAF-Osservatorio Astronomico di Brera, Via Brera 28, 20122 Milano, Italy INFN-Sezione di Genova, Via Dodecaneso 33, 16146, Genova, Italy Dipartimento di Fisica, Università di Genova, Via Dodecaneso 33, 16146, Genova, Italy Veropalumbo; Y. Infrared Processing and Analysis Center, California Institute of Technology, Pasadena, CA 91125, USA Wang; J. Universitäts-Sternwarte München, Fakultät für Physik, Ludwig-Maximilians-Universität München, Scheinerstrasse 1, 81679 München, Germany Max Planck Institute for Extraterrestrial Physics, Giessenbachstr. 1, 85748 Garching, Germany Weller; A. INAF-Osservatorio Astronomico di Trieste, Via G. B. Tiepolo 11, 34143 Trieste, Italy IFPU, Institute for Fundamental Physics of the Universe, via Beirut 2, 34151 Trieste, Italy Zacchei; G. INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Zamorani; F. M. INAF-Osservatorio Astronomico di Brera, Via Brera 28, 20122 Milano, Italy Zerbi; I. A. Universitäts-Sternwarte München, Fakultät für Physik, Ludwig-Maximilians-Universität München, Scheinerstrasse 1, 81679 München, Germany Zinchenko; E. INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Zucca; V. INAF-Osservatorio Astronomico di Capodimonte, Via Moiariello 16, 80131 Napoli, Italy Allevato; M. Dipartimento di Fisica e Scienze della Terra, Università degli Studi di Ferrara, Via Giuseppe Saragat 1, 44122 Ferrara, Italy Istituto Nazionale di Fisica Nucleare, Sezione di Ferrara, Via Giuseppe Saragat 1, 44122 Ferrara, Italy INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Ballardini; M. INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Bolzonella; E. Department of Astronomy, University of Geneva, ch. d'Ecogia 16, 1290 Versoix, Switzerland Bozzo; C. INAF, Istituto di Radioastronomia, Via Piero Gobetti 101, 40129 Bologna, Italy INFN-Bologna, Via Irnerio 46, 40126 Bologna, Italy Burigana; R. Institut de Recherche en Astrophysique et Planétologie Cabanac; A. INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Université Côte d'Azur, Observatoire de la Côte d'Azur, CNRS, Laboratoire Lagrange, Bd de l'Observatoire, CS 34229, 06304 Nice cedex 4, France Cappi; J. A. Escartin Max Planck Institute for Extraterrestrial Physics, Giessenbachstr. 1, 85748 Garching, Germany Vigo; L. Department of Physics, Oxford University, Keble Road, Oxford OX1 3RH, UK Gabarra; W. G. Department of Astronomy, University of Geneva, ch. d'Ecogia 16, 1290 Versoix, Switzerland Hartley; J. Aurora Technology for European Space Agency Martín-Fleitas; S. Institute for Astronomy, University of Edinburgh, Royal Observatory, Blackford Hill, Edinburgh EH9 3HJ, UK Matthew; R. B. Dipartimento di Fisica e Astronomia "Augusto Righi" - Alma Mater Studiorum Università di Bologna, via Piero Gobetti 93/2, 40129 Bologna, Italy INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Metcalf; A. INAF - Osservatorio Astronomico di Brera, via Emilio Bianchi 46, 23807 Merate, Italy Max Planck Institute for Extraterrestrial Physics, Giessenbachstr. 1, 85748 Garching, Germany Pezzotta; M. Department of Physics, P.O. Box 64, 00014 University of Helsinki, Finland Pöntinen; I. INAF-Osservatorio Astronomico di Brera, Via Brera 28, 20122 Milano, Italy, and INFN-Sezione di Genova, Via Dodecaneso 33, 16146, Genova, Italy Risso; V. Institut d'Astrophysique de Paris, 98bis Boulevard Arago, 75014, Paris, France ICL, Junia, Université Catholique de Lille, LITL, 59000 Lille, France Scottez; M. INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy INFN-Sezione di Bologna, Viale Berti Pichat 6/2, 40127 Bologna, Italy Sereno; M. INFN-Sezione di Bologna, Viale Berti Pichat 6/2, 40127 Bologna, Italy Tenti; M. Institute of Theoretical Astrophysics, University of Oslo, P.O. Box 1029 Blindern, 0315 Oslo, Norway Wiesmann; Y. Instituto de Física Teórica UAM-CSIC, Campus de Cantoblanco, 28049 Madrid, Spain CERCA/ISO, Department of Physics, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, OH 44106, USA Akrami; S. Dipartimento di Fisica e Scienze della Terra, Università degli Studi di Ferrara, Via Giuseppe Saragat 1, 44122 Ferrara, Italy Alvi; I. T. Technical University of Munich, TUM School of Natural Sciences, Physics Department, James-Franck-Str.~1, 85748 Garching, Germany Max-Planck-Institut für Astrophysik, Karl-Schwarzschild-Str.~1, 85748 Garching, Germany Andika; S. INFN-Padova, Via Marzolo 8, 35131 Padova, Italy Dipartimento di Fisica e Astronomia "G. Galilei", Università di Padova, Via Marzolo 8, 35131 Padova, Italy Laboratoire Univers et Théorie, Observatoire de Paris, Université PSL, Université Paris Cité, CNRS, 92190 Meudon, France Anselmi; M. Dipartimento di Fisica "Aldo Pontremoli", Università degli Studi di Milano, Via Celoria 16, 20133 Milano, Italy INFN-Sezione di Milano, Via Celoria 16, 20133 Milano, Italy Archidiacono; F. Departamento de Física Fundamental. Universidad de Salamanca. Plaza de la Merced s/n. 37008 Salamanca, Spain Atrio-Barandela; D. Dipartimento di Fisica e Astronomia "G. Galilei", Università di Padova, Via Marzolo 8, 35131 Padova, Italy INAF-Osservatorio Astronomico di Padova, Via dell'Osservatorio 5, 35122 Padova, Italy INFN-Padova, Via Marzolo 8, 35131 Padova, Italy Bertacca; M. Université de Strasbourg, CNRS, Observatoire astronomique de Strasbourg, UMR 7550, 67000 Strasbourg, France Bethermin; L. INAF-Osservatorio Astronomico di Padova, Via dell'Osservatorio 5, 35122 Padova, Italy Bisigello; A. Institut de Recherche en Astrophysique et Planétologie Blanchard; L. Center for Data-Driven Discovery, Kavli IPMU Laboratoire d'etude de l'Univers et des phenomenes eXtremes, Observatoire de Paris, Université PSL, Sorbonne Université, CNRS, 92190 Meudon, France Blot; S. Dipartimento di Fisica - Sezione di Astronomia, Università di Trieste, Via Tiepolo 11, 34131 Trieste, Italy IFPU, Institute for Fundamental Physics of the Universe, via Beirut 2, 34151 Trieste, Italy INAF-Osservatorio Astronomico di Trieste, Via G. B. Tiepolo 11, 34143 Trieste, Italy INFN, Sezione di Trieste, Via Valerio 2, 34127 Trieste TS, Italy ICSC - Centro Nazionale di Ricerca in High Performance Computing, Big Data e Quantum Computing, Via Magnanelli 2, Bologna, Italy Borgani; M. L. Jodrell Bank Centre for Astrophysics, Department of Physics and Astronomy, University of Manchester, Oxford Road, Manchester M13 9PL, UK Brown; S. California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA Bruton; A. INAF-Osservatorio Astronomico di Roma, Via Frascati 33, 00078 Monteporzio Catone, Italy Calabro; F. INAF-Osservatorio Astronomico di Roma, Via Frascati 33, 00078 Monteporzio Catone, Italy Caro; T. INAF-Osservatorio Astronomico di Trieste, Via G. B. Tiepolo 11, 34143 Trieste, Italy INFN, Sezione di Trieste, Via Valerio 2, 34127 Trieste TS, Italy IFPU, Institute for Fundamental Physics of the Universe, via Beirut 2, 34151 Trieste, Italy ICSC - Centro Nazionale di Ricerca in High Performance Computing, Big Data e Quantum Computing, Via Magnanelli 2, Bologna, Italy Castro; F. Dipartimento di Fisica e Astronomia "Augusto Righi" - Alma Mater Studiorum Università di Bologna, via Piero Gobetti 93/2, 40129 Bologna, Italy INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Cogato; S. INFN-Sezione di Genova, Via Dodecaneso 33, 16146, Genova, Italy Davini; G. Kapteyn Astronomical Institute, University of Groningen, PO Box 800, 9700 AV Groningen, The Netherlands Desprez; A. Departamento Física Aplicada, Universidad Politécnica de Cartagena, Campus Muralla del Mar, 30202 Cartagena, Murcia, Spain Díaz-Sánchez; J. J. Instituto de Astrofísica de Canarias, Vía Láctea, 38205 La Laguna, Tenerife, Spain Diaz; Domizio S. Dipartimento di Fisica, Università di Genova, Via Dodecaneso 33, 16146, Genova, Italy INFN-Sezione di Genova, Via Dodecaneso 33, 16146, Genova, Italy Di; J. M. Instituto de Física de Cantabria, Edificio Juan Jordá, Avenida de los Castros, 39005 Santander, Spain Diego; P. -A. Université de Strasbourg, CNRS, Observatoire astronomique de Strasbourg, UMR 7550, 67000 Strasbourg, France Duc; A. Dipartimento di Fisica e Astronomia, Università di Bologna, Via Gobetti 93/2, 40129 Bologna, Italy INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Enia; Y. Universitäts-Sternwarte München, Fakultät für Physik, Ludwig-Maximilians-Universität München, Scheinerstrasse 1, 81679 München, Germany Fang; A. G. INFN-Sezione di Bologna, Viale Berti Pichat 6/2, 40127 Bologna, Italy Ferrari; A. Department of Physics, P.O. Box 64, 00014 University of Helsinki, Finland Finoguenov; A. INAF-Osservatorio Astronomico di Roma, Via Frascati 33, 00078 Monteporzio Catone, Italy Fontana; A. INFN, Sezione di Lecce, Via per Arnesano, CP-193, 73100, Lecce, Italy Department of Mathematics and Physics E. De Giorgi, University of Salento, Via per Arnesano, CP-I93, 73100, Lecce, Italy INAF-Sezione di Lecce, c/o Dipartimento Matematica e Fisica, Via per Arnesano, 73100, Lecce, Italy Franco; J. Instituto de Física Teórica UAM-CSIC, Campus de Cantoblanco, 28049 Madrid, Spain García-Bellido; T. INAF-Osservatorio Astronomico di Trieste, Via G. B. Tiepolo 11, 34143 Trieste, Italy Gasparetto; V. CEA Saclay, DFR/IRFU, Service d'Astrophysique, Bat. 709, 91191 Gif-sur-Yvette, France Gautard; E. Institute of Space Sciences Institut d'Estudis Espacials de Catalunya Institute of Cosmology and Gravitation, University of Portsmouth, Portsmouth PO1 3FX, UK Gaztanaga; F. INFN-Sezione di Bologna, Viale Berti Pichat 6/2, 40127 Bologna, Italy Giacomini; F. INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Gianotti; M. Dipartimento di Fisica e Astronomia, Università di Bologna, Via Gobetti 93/2, 40129 Bologna, Italy INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Guidi; C. M. Instituto de Astrofí sica de Canarias, c/ Via Lactea s/n, La Laguna 38200, Spain. Departamento de Astrofí sica de la Universidad de La Laguna, Avda. Francisco Sanchez, La Laguna, 38200, Spain Gutierrez; A. Institute for Astronomy, University of Edinburgh, Royal Observatory, Blackford Hill, Edinburgh EH9 3HJ, UK Hall; S. Caltech/IPAC, 1200 E. California Blvd., Pasadena, CA 91125, USA Hemmati; H. Ruhr University Bochum, Faculty of Physics and Astronomy, Astronomical Institute Hildebrandt; J. DARK, Niels Bohr Institute, University of Copenhagen, Jagtvej 155, 2200 Copenhagen, Denmark Hjorth; J. J. E. Department of Physics and Astronomy, Vesilinnantie 5, 20014 University of Turku, Finland Serco for European Space Agency Kajava; Y. Department of Astronomy, University of Geneva, ch. d'Ecogia 16, 1290 Versoix, Switzerland Kang; V. ARC Centre of Excellence for Dark Matter Particle Physics, Melbourne, Australia Centre for Astrophysics \& Supercomputing, Swinburne University of Technology, Hawthorn, Victoria 3122, Australia Kansal; D. Dipartimento di Fisica e Scienze della Terra, Università degli Studi di Ferrara, Via Giuseppe Saragat 1, 44122 Ferrara, Italy Department of Physics and Astronomy, University of the Western Cape, Bellville, Cape Town, 7535, South Africa Karagiannis; C. C. Department of Physics and Helsinki Institute of Physics, Gustaf Hällströmin katu 2, 00014 University of Helsinki, Finland Kirkpatrick; S. ESAC/ESA, Camino Bajo del Castillo, s/n., Urb. Villafranca del Castillo, 28692 Villanueva de la Cañada, Madrid, Spain Kruk; L. DAMTP, Centre for Mathematical Sciences, Wilberforce Road, Cambridge CB3 0WA, UK Kavli Institute for Cosmology Cambridge, Madingley Road, Cambridge, CB3 0HA, UK Legrand; M. Dipartimento di Fisica e Scienze della Terra, Università degli Studi di Ferrara, Via Giuseppe Saragat 1, 44122 Ferrara, Italy Istituto Nazionale di Fisica Nucleare, Sezione di Ferrara, Via Giuseppe Saragat 1, 44122 Ferrara, Italy Lembo; F. Department of Astrophysics, University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland Lepori; G. Department of Physics, Centre for Extragalactic Astronomy, Durham University, South Road, Durham, DH1 3LE, UK Department of Physics, Institute for Computational Cosmology, Durham University, South Road, Durham, DH1 3LE, UK Leroy; J. Institute for Theoretical Particle Physics and Cosmology Lesgourgues; L. Dipartimento di Fisica e Astronomia "Augusto Righi" - Alma Mater Studiorum Università di Bologna, via Piero Gobetti 93/2, 40129 Bologna, Italy INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Leuzzi; T. I. IRFU, CEA, Université Paris-Saclay 91191 Gif-sur-Yvette Cedex, France Liaudat; J. Univ. Grenoble Alpes, CNRS, Grenoble INP, LPSC-IN2P3, 53, Avenue des Martyrs, 38000, Grenoble, France Macias-Perez; M. INAF-Istituto di Astrofisica e Planetologia Spaziali, via del Fosso del Cavaliere, 100, 00100 Roma, Italy Magliocchetti; F. INAF-Osservatorio Astrofisico di Arcetri, Largo E. Fermi 5, 50125, Firenze, Italy Mannucci; R. Dipartimento di Fisica, Sapienza Università di Roma, Piazzale Aldo Moro 2, 00185 Roma, Italy INAF-Osservatorio Astronomico di Roma, Via Frascati 33, 00078 Monteporzio Catone, Italy Maoli; C. J. A. P. Centro de Astrofísica da Universidade do Porto, Rua das Estrelas, 4150-762 Porto, Portugal Instituto de Astrofísica e Ciências do Espaço, Universidade do Porto, CAUP, Rua das Estrelas, PT4150-762 Porto, Portugal Martins; L. Université Paris-Saclay, CNRS, Institut d'astrophysique spatiale, 91405, Orsay, France Maurin; M. ESAC/ESA, Camino Bajo del Castillo, s/n., Urb. Villafranca del Castillo, 28692 Villanueva de la Cañada, Madrid, Spain HE Space for European Space Agency Miluzio; P. Dipartimento di Fisica - Sezione di Astronomia, Università di Trieste, Via Tiepolo 11, 34131 Trieste, Italy INAF-Osservatorio Astronomico di Trieste, Via G. B. Tiepolo 11, 34143 Trieste, Italy INFN, Sezione di Trieste, Via Valerio 2, 34127 Trieste TS, Italy IFPU, Institute for Fundamental Physics of the Universe, via Beirut 2, 34151 Trieste, Italy Monaco; G. INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Morgante; K. Institute of Cosmology and Gravitation, University of Portsmouth, Portsmouth PO1 3FX, UK Naidoo; A. Universität Bonn, Argelander-Institut für Astronomie, Auf dem Hügel 71, 53121 Bonn, Germany Navarro-Alsina; F. Dipartimento di Fisica e Astronomia "G. Galilei", Università di Padova, Via Marzolo 8, 35131 Padova, Italy INFN-Padova, Via Marzolo 8, 35131 Padova, Italy Passalacqua; K. Max-Planck-Institut für Astronomie, Königstuhl 17, 69117 Heidelberg, Germany Paterson; L. INFN-Sezione di Bologna, Viale Berti Pichat 6/2, 40127 Bologna, Italy Patrizii; A. Aix-Marseille Université, CNRS/IN2P3, CPPM, Marseille, France Pisani; D. Department of Astrophysics, University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland Potter; S. Dipartimento di Fisica e Astronomia "Augusto Righi" - Alma Mater Studiorum Università di Bologna, via Piero Gobetti 93/2, 40129 Bologna, Italy INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Quai; M. INAF-Osservatorio Astronomico di Padova, Via dell'Osservatorio 5, 35122 Padova, Italy Radovich; P. -F. Université Paris-Saclay, CNRS, Institut d'astrophysique spatiale, 91405, Orsay, France Rocci; G. Dipartimento di Fisica e Astronomia "G. Galilei", Università di Padova, Via Marzolo 8, 35131 Padova, Italy INAF-Osservatorio Astronomico di Padova, Via dell'Osservatorio 5, 35122 Padova, Italy Rodighiero; S. Department of Mathematics and Physics E. De Giorgi, University of Salento, Via per Arnesano, CP-I93, 73100, Lecce, Italy INFN, Sezione di Lecce, Via per Arnesano, CP-193, 73100, Lecce, Italy INAF-Sezione di Lecce, c/o Dipartimento Matematica e Fisica, Via per Arnesano, 73100, Lecce, Italy Sacquegna; M. Theoretical astrophysics, Department of Physics and Astronomy, Uppsala University, Box 515, 751 20 Uppsala, Sweden Sahlén; D. B. Institute for Astronomy, University of Hawaii, 2680 Woodlawn Drive, Honolulu, HI 96822, USA Sanders; E. SISSA, International School for Advanced Studies, Via Bonomea 265, 34136 Trieste TS, Italy ICSC - Centro Nazionale di Ricerca in High Performance Computing, Big Data e Quantum Computing, Via Magnanelli 2, Bologna, Italy INFN, Sezione di Trieste, Via Valerio 2, 34127 Trieste TS, Italy Sarpa; A. Department of Astrophysics, University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland Schneider; M. Université Côte d'Azur, Observatoire de la Côte d'Azur, CNRS, Laboratoire Lagrange, Bd de l'Observatoire, CS 34229, 06304 Nice cedex 4, France Schultheis; D. INAF-Osservatorio Astronomico di Roma, Via Frascati 33, 00078 Monteporzio Catone, Italy INFN-Sezione di Roma, Piazzale Aldo Moro, 2 - c/o Dipartimento di Fisica, Edificio G. Marconi, 00185 Roma, Italy Sciotti; E. Mathematical Institute, University of Leiden, Einsteinweg 55, 2333 CA Leiden, The Netherlands Leiden Observatory, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands Sellentin; F. School of Physics \& Astronomy, University of Southampton, Highfield Campus, Southampton SO17 1BJ, UK Shankar; L. C. Institute of Astronomy, University of Cambridge, Madingley Road, Cambridge CB3 0HA, UK Smith; K. Department of Physics, Oxford University, Keble Road, Oxford OX1 3RH, UK Tanidis; G. INFN-Sezione di Genova, Via Dodecaneso 33, 16146, Genova, Italy Testera; R. Department of Astrophysical Sciences, Peyton Hall, Princeton University, Princeton, NJ 08544, USA Teyssier; S. Dipartimento di Fisica, Università di Genova, Via Dodecaneso 33, 16146, Genova, Italy INFN-Sezione di Genova, Via Dodecaneso 33, 16146, Genova, Italy INAF-Osservatorio Astronomico di Brera, Via Brera 28, 20122 Milano, Italy Tosi; A. Dipartimento di Fisica e Astronomia "G. Galilei", Università di Padova, Via Marzolo 8, 35131 Padova, Italy INFN-Padova, Via Marzolo 8, 35131 Padova, Italy Troja; M. Department of Astronomy, University of Geneva, ch. d'Ecogia 16, 1290 Versoix, Switzerland Tucci; C. INFN-Sezione di Bologna, Viale Berti Pichat 6/2, 40127 Bologna, Italy Valieri; D. INAF-Osservatorio di Astrofisica e Scienza dello Spazio di Bologna, Via Piero Gobetti 93/3, 40129 Bologna, Italy Vergani; G. Center for Computational Astrophysics, Flatiron Institute, 162 5th Avenue, 10010, New York, NY, USA Verza; N. A. Institute of Astronomy, University of Cambridge, Madingley Road, Cambridge CB3 0HA, UK Walton
  Light emission from galaxies exhibit diverse brightness profiles, influenced
by factors such as galaxy type, structural features and interactions with other
galaxies. Elliptical galaxies feature more uniform light distributions, while
spiral and irregular galaxies have complex, varied light profiles due to their
structural heterogeneity and star-forming activity. In addition, galaxies with
an active galactic nucleus (AGN) feature intense, concentrated emission from
gas accretion around supermassive black holes, superimposed on regular galactic
light, while quasi-stellar objects (QSO) are the extreme case of the AGN
emission dominating the galaxy. The challenge of identifying AGN and QSO has
been discussed many times in the literature, often requiring multi-wavelength
observations. This paper introduces a novel approach to identify AGN and QSO
from a single image. Diffusion models have been recently developed in the
machine-learning literature to generate realistic-looking images of everyday
objects. Utilising the spatial resolving power of the Euclid VIS images, we
created a diffusion model trained on one million sources, without using any
source pre-selection or labels. The model learns to reconstruct light
distributions of normal galaxies, since the population is dominated by them. We
condition the prediction of the central light distribution by masking the
central few pixels of each source and reconstruct the light according to the
diffusion model. We further use this prediction to identify sources that
deviate from this profile by examining the reconstruction error of the few
central pixels regenerated in each source's core. Our approach, solely using
VIS imaging, features high completeness compared to traditional methods of AGN
and QSO selection, including optical, near-infrared, mid-infrared, and X-rays.
[abridged]


http://arxiv.org/abs/2503.13945
Make the Most of Everything: Further Considerations on Disrupting Diffusion-based Customization. (99%)
Long Tang; Dengpan Ye; Sirun Chen; Xiuwen Shi; Yunna Lv; Ziyi Liu
  The fine-tuning technique for text-to-image diffusion models facilitates
image customization but risks privacy breaches and opinion manipulation.
Current research focuses on prompt- or image-level adversarial attacks for
anti-customization, yet it overlooks the correlation between these two levels
and the relationship between internal modules and inputs. This hinders
anti-customization performance in practical threat scenarios. We propose Dual
Anti-Diffusion (DADiff), a two-stage adversarial attack targeting diffusion
customization, which, for the first time, integrates the adversarial
prompt-level attack into the generation process of image-level adversarial
examples. In stage 1, we generate prompt-level adversarial vectors to guide the
subsequent image-level attack. In stage 2, besides conducting the end-to-end
attack on the UNet model, we disrupt its self- and cross-attention modules,
aiming to break the correlations between image pixels and align the
cross-attention results computed using instance prompts and adversarial prompt
vectors within the images. Furthermore, we introduce a local random timestep
gradient ensemble strategy, which updates adversarial perturbations by
integrating random gradients from multiple segmented timesets. Experimental
results on various mainstream facial datasets demonstrate 10%-30% improvements
in cross-prompt, keyword mismatch, cross-model, and cross-mechanism
anti-customization with DADiff compared to existing methods.


http://arxiv.org/abs/2503.14281
XOXO: Stealthy Cross-Origin Context Poisoning Attacks against AI Coding Assistants. (89%)
Adam Štorek; Mukur Gupta; Noopur Bhatt; Aditya Gupta; Janie Kim; Prashast Srivastava; Suman Jana
  AI coding assistants are widely used for tasks like code generation. These
tools now require large and complex contexts, automatically sourced from
various origins$\unicode{x2014}$across files, projects, and
contributors$\unicode{x2014}$forming part of the prompt fed to underlying LLMs.
This automatic context-gathering introduces new vulnerabilities, allowing
attackers to subtly poison input to compromise the assistant's outputs,
potentially generating vulnerable code or introducing critical errors. We
propose a novel attack, Cross-Origin Context Poisoning (XOXO), that is
challenging to detect as it relies on adversarial code modifications that are
semantically equivalent. Traditional program analysis techniques struggle to
identify these perturbations since the semantics of the code remains correct,
making it appear legitimate. This allows attackers to manipulate coding
assistants into producing incorrect outputs, while shifting the blame to the
victim developer. We introduce a novel, task-agnostic, black-box attack
algorithm GCGS that systematically searches the transformation space using a
Cayley Graph, achieving a 75.72% attack success rate on average across five
tasks and eleven models, including GPT 4.1 and Claude 3.5 Sonnet v2 used by
popular AI coding assistants. Furthermore, defenses like adversarial
fine-tuning are ineffective against our attack, underscoring the need for new
security measures in LLM-powered coding tools.


http://arxiv.org/abs/2503.16542
Defending Against Gradient Inversion Attacks for Biomedical Images via Learnable Data Perturbation. (81%)
Shiyi Jiang; Farshad Firouzi; Krishnendu Chakrabarty
  The increasing need for sharing healthcare data and collaborating on clinical
research has raised privacy concerns. Health information leakage due to
malicious attacks can lead to serious problems such as misdiagnoses and patient
identification issues. Privacy-preserving machine learning (PPML) and
privacy-enhancing technologies, particularly federated learning (FL), have
emerged in recent years as innovative solutions to balance privacy protection
with data utility; however, they also suffer from inherent privacy
vulnerabilities. Gradient inversion attacks constitute major threats to data
sharing in federated learning. Researchers have proposed many defenses against
gradient inversion attacks. However, current defense methods for healthcare
data lack generalizability, i.e., existing solutions may not be applicable to
data from a broader range of populations. In addition, most existing defense
methods are tested using non-healthcare data, which raises concerns about their
applicability to real-world healthcare systems. In this study, we present a
defense against gradient inversion attacks in federated learning. We achieve
this using latent data perturbation and minimax optimization, utilizing both
general and medical image datasets. Our method is compared to two baselines,
and the results show that our approach can outperform the baselines with a
reduction of 12.5% in the attacker's accuracy in classifying reconstructed
images. The proposed method also yields an increase of over 12.4% in Mean
Squared Error (MSE) between the original and reconstructed images at the same
level of model utility of around 90% client classification accuracy. The
results suggest the potential of a generalizable defense for healthcare data.


http://arxiv.org/abs/2503.14299
Unveiling the Role of Randomization in Multiclass Adversarial Classification: Insights from Graph Theory. (69%)
Lucas Gnecco-Heredia; Matteo Sammut; Muni Sreenivas Pydi; Rafael Pinot; Benjamin Negrevergne; Yann Chevaleyre
  Randomization as a mean to improve the adversarial robustness of machine
learning models has recently attracted significant attention. Unfortunately,
much of the theoretical analysis so far has focused on binary classification,
providing only limited insights into the more complex multiclass setting. In
this paper, we take a step toward closing this gap by drawing inspiration from
the field of graph theory. Our analysis focuses on discrete data distributions,
allowing us to cast the adversarial risk minimization problems within the
well-established framework of set packing problems. By doing so, we are able to
identify three structural conditions on the support of the data distribution
that are necessary for randomization to improve robustness. Furthermore, we are
able to construct several data distributions where (contrarily to binary
classification) switching from a deterministic to a randomized solution
significantly reduces the optimal adversarial risk. These findings highlight
the crucial role randomization can play in enhancing robustness to adversarial
attacks in multiclass classification.


http://arxiv.org/abs/2503.15560
Temporal Context Awareness: A Defense Framework Against Multi-turn Manipulation Attacks on Large Language Models. (67%)
Prashant Kulkarni; Assaf Namer
  Large Language Models (LLMs) are increasingly vulnerable to sophisticated
multi-turn manipulation attacks, where adversaries strategically build context
through seemingly benign conversational turns to circumvent safety measures and
elicit harmful or unauthorized responses. These attacks exploit the temporal
nature of dialogue to evade single-turn detection methods, representing a
critical security vulnerability with significant implications for real-world
deployments.
  This paper introduces the Temporal Context Awareness (TCA) framework, a novel
defense mechanism designed to address this challenge by continuously analyzing
semantic drift, cross-turn intention consistency and evolving conversational
patterns. The TCA framework integrates dynamic context embedding analysis,
cross-turn consistency verification, and progressive risk scoring to detect and
mitigate manipulation attempts effectively. Preliminary evaluations on
simulated adversarial scenarios demonstrate the framework's potential to
identify subtle manipulation patterns often missed by traditional detection
techniques, offering a much-needed layer of security for conversational AI
systems. In addition to outlining the design of TCA , we analyze diverse attack
vectors and their progression across multi-turn conversation, providing
valuable insights into adversarial tactics and their impact on LLM
vulnerabilities. Our findings underscore the pressing need for robust,
context-aware defenses in conversational AI systems and highlight TCA framework
as a promising direction for securing LLMs while preserving their utility in
legitimate applications. We make our implementation available to support
further research in this emerging area of AI security.


http://arxiv.org/abs/2503.14783
RAT: Boosting Misclassification Detection Ability without Extra Data. (41%)
Ge Yan; Tsui-Wei Weng
  As deep neural networks(DNN) become increasingly prevalent, particularly in
high-stakes areas such as autonomous driving and healthcare, the ability to
detect incorrect predictions of models and intervene accordingly becomes
crucial for safety. In this work, we investigate the detection of misclassified
inputs for image classification models from the lens of adversarial
perturbation: we propose to use robust radius (a.k.a. input-space margin) as a
confidence metric and design two efficient estimation algorithms, RR-BS and
RR-Fast, for misclassification detection. Furthermore, we design a training
method called Radius Aware Training (RAT) to boost models' ability to identify
mistakes. Extensive experiments show our method could achieve up to 29.3%
reduction on AURC and 21.62% reduction in FPR@95TPR, compared with previous
methods.


http://arxiv.org/abs/2503.13962
Survey of Adversarial Robustness in Multimodal Large Language Models. (33%)
Chengze Jiang; Zhuangzhuang Wang; Minjing Dong; Jie Gui
  Multimodal Large Language Models (MLLMs) have demonstrated exceptional
performance in artificial intelligence by facilitating integrated understanding
across diverse modalities, including text, images, video, audio, and speech.
However, their deployment in real-world applications raises significant
concerns about adversarial vulnerabilities that could compromise their safety
and reliability. Unlike unimodal models, MLLMs face unique challenges due to
the interdependencies among modalities, making them susceptible to
modality-specific threats and cross-modal adversarial manipulations. This paper
reviews the adversarial robustness of MLLMs, covering different modalities. We
begin with an overview of MLLMs and a taxonomy of adversarial attacks tailored
to each modality. Next, we review key datasets and evaluation metrics used to
assess the robustness of MLLMs. After that, we provide an in-depth review of
attacks targeting MLLMs across different modalities. Our survey also identifies
critical challenges and suggests promising future research directions.


http://arxiv.org/abs/2503.14827
MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models. (5%)
Chejian Xu; Jiawei Zhang; Zhaorun Chen; Chulin Xie; Mintong Kang; Yujin Potter; Zhun Wang; Zhuowen Yuan; Alexander Xiong; Zidi Xiong; Chenhui Zhang; Lingzhi Yuan; Yi Zeng; Peiyang Xu; Chengquan Guo; Andy Zhou; Jeffrey Ziwei Tan; Xuandong Zhao; Francesco Pinto; Zhen Xiang; Yu Gai; Zinan Lin; Dan Hendrycks; Bo Li; Dawn Song
  Multimodal foundation models (MMFMs) play a crucial role in various
applications, including autonomous driving, healthcare, and virtual assistants.
However, several studies have revealed vulnerabilities in these models, such as
generating unsafe content by text-to-image models. Existing benchmarks on
multimodal models either predominantly assess the helpfulness of these models,
or only focus on limited perspectives such as fairness and privacy. In this
paper, we present the first unified platform, MMDT (Multimodal DecodingTrust),
designed to provide a comprehensive safety and trustworthiness evaluation for
MMFMs. Our platform assesses models from multiple perspectives, including
safety, hallucination, fairness/bias, privacy, adversarial robustness, and
out-of-distribution (OOD) generalization. We have designed various evaluation
scenarios and red teaming algorithms under different tasks for each perspective
to generate challenging data, forming a high-quality benchmark. We evaluate a
range of multimodal models using MMDT, and our findings reveal a series of
vulnerabilities and areas for improvement across these perspectives. This work
introduces the first comprehensive and unique safety and trustworthiness
evaluation platform for MMFMs, paving the way for developing safer and more
reliable MMFMs and systems. Our platform and benchmark are available at
https://mmdecodingtrust.github.io/.


http://arxiv.org/abs/2503.14618
Anomaly-Flow: A Multi-domain Federated Generative Adversarial Network for Distributed Denial-of-Service Detection. (1%)
Melo Leonardo Henrique de; Gustavo de Carvalho Bertoli; Michele Nogueira; Aldri Luiz dos Santos; Lourenço Alves Pereira Junior
  Distributed denial-of-service (DDoS) attacks remain a critical threat to
Internet services, causing costly disruptions. While machine learning (ML) has
shown promise in DDoS detection, current solutions struggle with multi-domain
environments where attacks must be detected across heterogeneous networks and
organizational boundaries. This limitation severely impacts the practical
deployment of ML-based defenses in real-world settings.
  This paper introduces Anomaly-Flow, a novel framework that addresses this
critical gap by combining Federated Learning (FL) with Generative Adversarial
Networks (GANs) for privacy-preserving, multi-domain DDoS detection. Our
proposal enables collaborative learning across diverse network domains while
preserving data privacy through synthetic flow generation. Through extensive
evaluation across three distinct network datasets, Anomaly-Flow achieves an
average F1-score of $0.747$, outperforming baseline models. Importantly, our
framework enables organizations to share attack detection capabilities without
exposing sensitive network data, making it particularly valuable for critical
infrastructure and privacy-sensitive sectors.
  Beyond immediate technical contributions, this work provides insights into
the challenges and opportunities in multi-domain DDoS detection, establishing a
foundation for future research in collaborative network defense systems. Our
findings have important implications for academic research and industry
practitioners working to deploy practical ML-based security solutions.


http://arxiv.org/abs/2503.14852
UntrustVul: An Automated Approach for Identifying Untrustworthy Alerts in Vulnerability Detection Models. (1%)
Lam Nguyen Tung; Xiaoning Du; Neelofar Neelofar; Aldeida Aleti
  Machine learning (ML) has shown promise in detecting vulnerabilities. To
review vulnerabilities detected by ML predictions, developers manually assess
suspicious lines in their interpretations. However, studies have revealed that
these models often learn and predict based on irrelevant features frequently
appearing in vulnerable code. This leads to predictions that may correctly flag
vulnerable functions but for the wrong reasons, which we call untrustworthy.
These predictions can mislead developers, hindering them from locating the
vulnerabilities. This increases the efforts of manual assessment and, worse,
risks creating flawed patches that fail to address existing vulnerabilities and
even introduce new ones. Hence, automated approaches are needed to detect
untrustworthy predictions, preventing overlooked vulnerabilities and
alleviating the burden of manual assessment.
  We propose UntrustVul, the first automated approach to identify untrustworthy
vulnerability predictions. Given a vulnerability prediction during inference,
UntrustVul systematically assesses whether suspicious lines annotated by the
prediction are vulnerability-unrelated. It simulates developers' rationales,
considering a line unrelated if (1) it is absent from historical
vulnerabilities and (2) it cannot reach any vulnerabilities in execution flows.
UntrustVul assesses (1) by analysing its syntactic meaning using deep
representations to determine whether it is syntax-benign. To assess (2),
UntrustVul traces dependencies of the syntax-benign lines on other suspicious
lines using static and rule-based analyses. We evaluate UntrustVul on 155K
vulnerability predictions by four models across three datasets. UntrustVul
effectively detects untrustworthy predictions with an F1-score of 82%-94% and
helps improve the ability of models to detect vulnerabilities by up to 321% in
F1-score and 100% in trustworthiness.


http://arxiv.org/abs/2503.13994
TarPro: Targeted Protection against Malicious Image Editing. (1%)
Kaixin Shen; Ruijie Quan; Jiaxu Miao; Jun Xiao; Yi Yang
  The rapid advancement of image editing techniques has raised concerns about
their misuse for generating Not-Safe-for-Work (NSFW) content. This necessitates
a targeted protection mechanism that blocks malicious edits while preserving
normal editability. However, existing protection methods fail to achieve this
balance, as they indiscriminately disrupt all edits while still allowing some
harmful content to be generated. To address this, we propose TarPro, a targeted
protection framework that prevents malicious edits while maintaining benign
modifications. TarPro achieves this through a semantic-aware constraint that
only disrupts malicious content and a lightweight perturbation generator that
produces a more stable, imperceptible, and robust perturbation for image
protection. Extensive experiments demonstrate that TarPro surpasses existing
methods, achieving a high protection efficacy while ensuring minimal impact on
normal edits. Our results highlight TarPro as a practical solution for secure
and controlled image editing.


http://arxiv.org/abs/2503.12874
Evolution-based Region Adversarial Prompt Learning for Robustness Enhancement in Vision-Language Models. (99%)
Xiaojun Jia; Sensen Gao; Simeng Qin; Ke Ma; Xinfeng Li; Yihao Huang; Wei Dong; Yang Liu; Xiaochun Cao
  Large pre-trained vision-language models (VLMs), such as CLIP, demonstrate
impressive generalization but remain highly vulnerable to adversarial examples
(AEs). Previous work has explored robust text prompts through adversarial
training, achieving some improvement in both robustness and generalization.
However, they primarily rely on singlegradient direction perturbations (e.g.,
PGD) to generate AEs, which lack diversity, resulting in limited improvement in
adversarial robustness. To address these limitations, we propose an
evolution-based region adversarial prompt tuning method called ER-APT, which
combines gradient methods with genetic evolution to generate more diverse and
challenging AEs. In each training iteration, we first generate AEs using
traditional gradient-based methods. Subsequently, a genetic evolution mechanism
incorporating selection, mutation, and crossover is applied to optimize the
AEs, ensuring a broader and more aggressive perturbation distribution.The final
evolved AEs are used for prompt tuning, achieving region-based adversarial
optimization instead of conventional single-point adversarial prompt tuning. We
also propose a dynamic loss weighting method to adjust prompt learning
efficiency for accuracy and robustness. Experimental evaluations on various
benchmark datasets demonstrate the superiority of our proposed method,
outperforming stateof-the-art APT methods. The code is released at
https://github.com/jiaxiaojunQAQ/ER-APT.


http://arxiv.org/abs/2503.12793
Improving Generalization of Universal Adversarial Perturbation via Dynamic Maximin Optimization. (99%)
Yechao Zhang; Yingzhe Xu; Junyu Shi; Leo Yu Zhang; Shengshan Hu; Minghui Li; Yanjun Zhang
  Deep neural networks (DNNs) are susceptible to universal adversarial
perturbations (UAPs). These perturbations are meticulously designed to fool the
target model universally across all sample classes. Unlike instance-specific
adversarial examples (AEs), generating UAPs is more complex because they must
be generalized across a wide range of data samples and models. Our research
reveals that existing universal attack methods, which optimize UAPs using DNNs
with static model parameter snapshots, do not fully leverage the potential of
DNNs to generate more effective UAPs. Rather than optimizing UAPs against
static DNN models with a fixed training set, we suggest using dynamic
model-data pairs to generate UAPs. In particular, we introduce a dynamic
maximin optimization strategy, aiming to optimize the UAP across a variety of
optimal model-data pairs. We term this approach DM-UAP. DM-UAP utilizes an
iterative max-min-min optimization framework that refines the model-data pairs,
coupled with a curriculum UAP learning algorithm to examine the combined space
of model parameters and data thoroughly. Comprehensive experiments on the
ImageNet dataset demonstrate that the proposed DM-UAP markedly enhances both
cross-sample universality and cross-model transferability of UAPs. Using only
500 samples for UAP generation, DM-UAP outperforms the state-of-the-art
approach with an average increase in fooling ratio of 12.108%.


http://arxiv.org/abs/2503.12827
GSBA$^K$: $top$-$K$ Geometric Score-based Black-box Attack. (99%)
Md Farhamdur Reza; Richeng Jin; Tianfu Wu; Huaiyu Dai
  Existing score-based adversarial attacks mainly focus on crafting $top$-1
adversarial examples against classifiers with single-label classification.
Their attack success rate and query efficiency are often less than
satisfactory, particularly under small perturbation requirements; moreover, the
vulnerability of classifiers with multi-label learning is yet to be studied. In
this paper, we propose a comprehensive surrogate free score-based attack, named
\b geometric \b score-based \b black-box \b attack (GSBA$^K$), to craft
adversarial examples in an aggressive $top$-$K$ setting for both untargeted and
targeted attacks, where the goal is to change the $top$-$K$ predictions of the
target classifier. We introduce novel gradient-based methods to find a good
initial boundary point to attack. Our iterative method employs novel gradient
estimation techniques, particularly effective in $top$-$K$ setting, on the
decision boundary to effectively exploit the geometry of the decision boundary.
Additionally, GSBA$^K$ can be used to attack against classifiers with $top$-$K$
multi-label learning. Extensive experimental results on ImageNet and PASCAL VOC
datasets validate the effectiveness of GSBA$^K$ in crafting $top$-$K$
adversarial examples.


http://arxiv.org/abs/2503.13419
Securing Virtual Reality Experiences: Unveiling and Tackling Cybersickness Attacks with Explainable AI. (93%)
Ripan Kumar Kundu; Matthew Denton; Genova Mongalo; Prasad Calyam; Khaza Anuarul Hoque
  The synergy between virtual reality (VR) and artificial intelligence (AI),
specifically deep learning (DL)-based cybersickness detection models, has
ushered in unprecedented advancements in immersive experiences by automatically
detecting cybersickness severity and adaptively various mitigation techniques,
offering a smooth and comfortable VR experience. While this DL-enabled
cybersickness detection method provides promising solutions for enhancing user
experiences, it also introduces new risks since these models are vulnerable to
adversarial attacks; a small perturbation of the input data that is visually
undetectable to human observers can fool the cybersickness detection model and
trigger unexpected mitigation, thus disrupting user immersive experiences (UIX)
and even posing safety risks. In this paper, we present a new type of VR
attack, i.e., a cybersickness attack, which successfully stops the triggering
of cybersickness mitigation by fooling DL-based cybersickness detection models
and dramatically hinders the UIX. Next, we propose a novel explainable
artificial intelligence (XAI)-guided cybersickness attack detection framework
to detect such attacks in VR to ensure UIX and a comfortable VR experience. We
evaluate the proposed attack and the detection framework using two
state-of-the-art open-source VR cybersickness datasets: Simulation 2021 and
Gameplay dataset. Finally, to verify the effectiveness of our proposed method,
we implement the attack and the XAI-based detection using a testbed with a
custom-built VR roller coaster simulation with an HTC Vive Pro Eye headset and
perform a user study. Our study shows that such an attack can dramatically
hinder the UIX. However, our proposed XAI-guided cybersickness attack detection
can successfully detect cybersickness attacks and trigger the proper
mitigation, effectively reducing VR cybersickness.


http://arxiv.org/abs/2503.12931
MirrorGuard: Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting. (68%)
Rui Pu; Chaozhuo Li; Rui Ha; Litian Zhang; Lirong Qiu; Xi Zhang
  Defending large language models (LLMs) against jailbreak attacks is crucial
for ensuring their safe deployment. Existing defense strategies generally rely
on predefined static criteria to differentiate between harmful and benign
prompts. However, such rigid rules are incapable of accommodating the inherent
complexity and dynamic nature of real jailbreak attacks. In this paper, we
propose a novel concept of ``mirror'' to enable dynamic and adaptive defense. A
mirror refers to a dynamically generated prompt that mirrors the syntactic
structure of the input while ensuring semantic safety. The personalized
discrepancies between the input prompts and their corresponding mirrors serve
as the guiding principles for defense. A new defense paradigm, MirrorGuard, is
further proposed to detect and calibrate risky inputs based on such mirrors. An
entropy-based detection metric, Relative Input Uncertainty (RIU), is integrated
into MirrorGuard to quantify the discrepancies between input prompts and
mirrors. MirrorGuard is evaluated on several popular datasets, demonstrating
state-of-the-art defense performance while maintaining general effectiveness.


http://arxiv.org/abs/2506.12103
The Amazon Nova Family of Models: Technical Report and Model Card. (33%)
Amazon JC AGI; Aaron JC Langford; Aayush JC Shah; Abhanshu JC Gupta; Abhimanyu JC Bhatter; Abhinav JC Goyal; Abhinav JC Mathur; Abhinav JC Mohanty; Abhishek JC Kumar; Abhishek JC Sethi; Abi JC Komma; Abner JC Pena; Achin JC Jain; Adam JC Kunysz; Adam JC Opyrchal; Adarsh JC Singh; Aditya JC Rawal; Adok Achar Budihal JC Prasad; Gispert Adrià JC de; Agnika JC Kumar; Aishwarya JC Aryamane; Ajay JC Nair; Akilan JC M; Akshaya JC Iyengar; Akshaya Vishnu Kudlu JC Shanbhogue; Alan JC He; Alessandra JC Cervone; Alex JC Loeb; Alex JC Zhang; Alexander JC Fu; Alexander JC Lisnichenko; Alexander JC Zhipa; Alexandros JC Potamianos; Ali JC Kebarighotbi; Aliakbar JC Daronkolaei; Alok JC Parmesh; Amanjot Kaur JC Samra; Ameen JC Khan; Amer JC Rez; Amir JC Saffari; Amit JC Agarwalla; Amit JC Jhindal; Amith JC Mamidala; Ammar JC Asmro; Amulya JC Ballakur; Anand JC Mishra; Anand JC Sridharan; Anastasiia JC Dubinina; Andre JC Lenz; Andreas JC Doerr; Andrew JC Keating; Andrew JC Leaver; Andrew JC Smith; Andrew JC Wirth; Andy JC Davey; Andy JC Rosenbaum; Andy JC Sohn; Angela JC Chan; Aniket JC Chakrabarti; Anil JC Ramakrishna; Anirban JC Roy; Anita JC Iyer; Anjali JC Narayan-Chen; Ankith JC Yennu; Anna JC Dabrowska; Anna JC Gawlowska; Anna JC Rumshisky; Anna JC Turek; Anoop JC Deoras; Anton JC Bezruchkin; Anup JC Prasad; Anupam JC Dewan; Anwith JC Kiran; Apoorv JC Gupta; Aram JC Galstyan; Aravind JC Manoharan; Arijit JC Biswas; Arindam JC Mandal; Arpit JC Gupta; Arsamkhan JC Pathan; Arun JC Nagarajan; Arushan JC Rajasekaram; Arvind JC Sundararajan; Ashwin JC Ganesan; Ashwin JC Swaminathan; Athanasios JC Mouchtaris; Audrey JC Champeau; Avik JC Ray; Ayush JC Jaiswal; Ayush JC Sharma; Bailey JC Keefer; Balamurugan JC Muthiah; Beatriz JC Leon-Millan; Ben JC Koopman; Ben JC Li; Benjamin JC Biggs; Benjamin JC Ott; Bhanu JC Vinzamuri; Bharath JC Venkatesh; Bhavana JC Ganesh; Bhoomit JC Vasani; Bill JC Byrne; Bill JC Hsu; Bincheng JC Wang; Blake JC King; Blazej JC Gorny; Bo JC Feng; Bo JC Zheng; Bodhisattwa JC Paul; Bofan JC Sun; Bofeng JC Luo; Bowen JC Chen; Bowen JC Xie; Boya JC Yu; Brendan JC Jugan; Brett JC Panosh; Brian JC Collins; Brian JC Thompson; Can JC Karakus; Can JC Liu; Carl JC Lambrecht; Carly JC Lin; Carolyn JC Wang; Carrie JC Yuan; Casey JC Loyda; Cezary JC Walczak; Chalapathi JC Choppa; Chandana Satya JC Prakash; Chankrisna Richy JC Meas; Charith JC Peris; Charles JC Recaido; Charlie JC Xu; Charul JC Sharma; Chase JC Kernan; Chayut JC Thanapirom; Chengwei JC Su; Chenhao JC Xu; Chenhao JC Yin; Chentao JC Ye; Chenyang JC Tao; Chethan JC Parameshwara; Ching-Yun JC Chang; Chong JC Li; Chris JC Hench; Chris JC Tran; Christophe JC Dupuy; Christopher JC Davis; Christopher JC DiPersio; Christos JC Christodoulopoulos; Christy JC Li; Chun JC Chen; Claudio Delli JC Bovi; Clement JC Chung; Cole JC Hawkins; Connor JC Harris; Corey JC Ropell; Cynthia JC He; DK JC Joo; Dae Yon JC Hwang; Dan JC Rosen; Daniel JC Elkind; Daniel JC Pressel; Daniel JC Zhang; Danielle JC Kimball; Daniil JC Sorokin; Dave JC Goodell; Davide JC Modolo; Dawei JC Zhu; Deepikaa JC Suresh; Deepti JC Ragha; Denis JC Filimonov; Denis Foo JC Kune; Denis Romasanta JC Rodriguez; Devamanyu JC Hazarika; Dhananjay JC Ram; Dhawal JC Parkar; Dhawal JC Patel; Dhwanil JC Desai; Dinesh Singh JC Rajput; Disha JC Sule; Diwakar JC Singh; Dmitriy JC Genzel; Dolly JC Goldenberg; Dongyi JC He; Dumitru JC Hanciu; Dushan JC Tharmal; Dzmitry JC Siankovich; Edi JC Cikovic; Edwin JC Abraham; Ekraam JC Sabir; Elliott JC Olson; Emmett JC Steven; Emre JC Barut; Eric JC Jackson; Ethan JC Wu; Evelyn JC Chen; Ezhilan JC Mahalingam; Fabian JC Triefenbach; Fan JC Yang; Fangyu JC Liu; Fanzi JC Wu; Faraz JC Tavakoli; Farhad JC Khozeimeh; Feiyang JC Niu; Felix JC Hieber; Feng JC Li; Firat JC Elbey; Florian JC Krebs; Florian JC Saupe; Florian JC Sprünken; Frank JC Fan; Furqan JC Khan; Vincenzo Gabriela JC De; Gagandeep JC Kang; George JC Ding; George JC He; George JC Yeung; Ghada JC Qaddoumi; Giannis JC Karamanolakis; Goeric JC Huybrechts; Gokul JC Maddali; Gonzalo JC Iglesias; Gordon JC McShane; Gozde JC Sahin; Guangtai JC Huang; Gukyeong JC Kwon; Gunnar A. JC Sigurdsson; Gurpreet JC Chadha; Gururaj JC Kosuru; Hagen JC Fuerstenau; Hah JC Hah; Haja JC Maideen; Hajime JC Hosokawa; Han JC Liu; Han-Kai JC Hsu; Hann JC Wang; Hao JC Li; Hao JC Yang; Haofeng JC Zhu; Haozheng JC Fan; Harman JC Singh; Harshavardhan JC Kaluvala; Hashim JC Saeed; He JC Xie; Helian JC Feng; Hendrix JC Luo; Hengzhi JC Pei; Henrik JC Nielsen; Hesam JC Ilati; Himanshu JC Patel; Hongshan JC Li; Hongzhou JC Lin; Hussain JC Raza; Ian JC Cullinan; Imre JC Kiss; Inbarasan JC Thangamani; Indrayani JC Fadnavis; Ionut Teodor JC Sorodoc; Irem JC Ertuerk; Iryna JC Yemialyanava; Ishan JC Soni; Ismail JC Jelal; Ivan JC Tse; Jack JC FitzGerald; Jack JC Zhao; Jackson JC Rothgeb; Jacky JC Lee; Jake JC Jung; Jakub JC Debski; Jakub JC Tomczak; James JC Jeun; James JC Sanders; Jason JC Crowley; Jay JC Lee; Jayakrishna Anvesh JC Paidy; Jayant JC Tiwari; Jean JC Farmer; Jeff JC Solinsky; Jenna JC Lau; Jeremy JC Savareese; Jerzy JC Zagorski; Ji JC Dai; JC Jiacheng; Skyler Gu; Jiahui Skyler Li; Skyler Jian; QZ Zheng; Jianhua QZ Lu; Jianhua QZ Wang; Jiawei QZ Dai; Jiawei QZ Mo; Jiaxi QZ Xu; Jie QZ Liang; Jie QZ Yang; Jim QZ Logan; Jimit QZ Majmudar; Jing QZ Liu; Jinghong QZ Miao; Jingru QZ Yi; Jingyang QZ Jin; Jiun-Yu QZ Kao; Jixuan QZ Wang; Jiyang QZ Wang; Joe QZ Pemberton; Joel QZ Carlson; Joey QZ Blundell; John QZ Chin-Jew; John QZ He; Jonathan QZ Ho; Jonathan QZ Hueser; Jonathan QZ Lunt; Jooyoung QZ Lee; Joshua QZ Tan; Joyjit QZ Chatterjee; Judith QZ Gaspers; Jue QZ Wang; Jun QZ Fang; Jun QZ Tang; Jun QZ Wan; Jun QZ Wu; Junlei QZ Wang; Junyi QZ Shi; Justin QZ Chiu; Justin QZ Satriano; Justin QZ Yee; Jwala QZ Dhamala; Jyoti QZ Bansal; Kai QZ Zhen; Kai-Wei QZ Chang; Kaixiang QZ Lin; Kalyan QZ Raman; Kanthashree Mysore QZ Sathyendra; Karabo QZ Moroe; Karan QZ Bhandarkar; Karan QZ Kothari; Karolina QZ Owczarzak; Karthick QZ Gopalswamy; Karthick QZ Ravi; Karthik QZ Ramakrishnan; Karthika QZ Arumugam; Kartik QZ Mehta; Katarzyna QZ Konczalska; Kavya QZ Ravikumar; Ke QZ Tran; Kechen QZ Qin; Kelin QZ Li; Kelvin QZ Li; Ketan QZ Kulkarni; Kevin Angelo QZ Rodrigues; Keyur QZ Patel; Khadige QZ Abboud; Kiana QZ Hajebi; Klaus QZ Reiter; Kris QZ Schultz; Krishna QZ Anisetty; Krishna QZ Kotnana; Kristen QZ Li; Kruthi QZ Channamallikarjuna; Krzysztof QZ Jakubczyk; Kuba QZ Pierewoj; Kunal QZ Pal; Kunwar QZ Srivastav; Kyle QZ Bannerman; Lahari QZ Poddar; Lakshmi QZ Prasad; Larry QZ Tseng; Laxmikant QZ Naik; Leena Chennuru QZ Vankadara; Lenon QZ Minorics; Leo QZ Liu; Leonard QZ Lausen; Leonardo F. R. QZ Ribeiro; Li QZ Zhang; Lili QZ Gehorsam; Ling QZ Qi; Lisa QZ Bauer; Lori QZ Knapp; Lu QZ Zeng; Lucas QZ Tong; Lulu QZ Wong; Luoxin QZ Chen; Maciej QZ Rudnicki; Mahdi QZ Namazifar; Mahesh QZ Jaliminche; Maira Ladeira QZ Tanke; Manasi QZ Gupta; Mandeep QZ Ahlawat; Mani QZ Khanuja; Mani QZ Sundaram; Marcin QZ Leyk; Mariusz QZ Momotko; Markus QZ Boese; Markus QZ Dreyer; Markus QZ Mueller; Mason QZ Fu; Mateusz QZ Górski; Mateusz QZ Mastalerczyk; Matias QZ Mora; Matt QZ Johnson; Matt QZ Scott; Matthew QZ Wen; Max QZ Barysau; Maya QZ Boumerdassi; Maya QZ Krishnan; Mayank QZ Gupta; Mayank QZ Hirani; Mayank QZ Kulkarni; Meganathan QZ Narayanasamy; Melanie QZ Bradford; Melanie QZ Gens; Melissa QZ Burke; Meng QZ Jin; Miao QZ Chen; Michael QZ Denkowski; Michael QZ Heymel; Michael QZ Krestyaninov; Michal QZ Obirek; Michalina QZ Wichorowska; Michał QZ Miotk; Milosz QZ Watroba; Mingyi QZ Hong; Mingzhi QZ Yu; Miranda QZ Liu; Mohamed QZ Gouda; Mohammad QZ El-Shabani; Mohammad QZ Ghavamzadeh; Mohit QZ Bansal; Morteza QZ Ziyadi; Nan QZ Xia; Nathan QZ Susanj; Nav QZ Bhasin; Neha QZ Goswami; Nehal QZ Belgamwar; Nicolas QZ Anastassacos; Nicolas QZ Bergeron; Nidhi QZ Jain; Nihal QZ Jain; Niharika QZ Chopparapu; Nik QZ Xu; Nikko QZ Strom; Nikolaos QZ Malandrakis; Nimisha QZ Mishra; Ninad QZ Parkhi; Ninareh QZ Mehrabi; Nishita QZ Sant; Nishtha QZ Gupta; Nitesh QZ Sekhar; Nithin QZ Rajeev; Nithish Raja QZ Chidambaram; Nitish QZ Dhar; Noor QZ Bhagwagar; Noy QZ Konforty; Omar QZ Babu; Omid QZ Razavi; Orchid QZ Majumder; Osama QZ Dar; Oscar QZ Hsu; Pablo QZ Kvitca; Pallavi QZ Pandey; Parker QZ Seegmiller; Patrick QZ Lange; Paul QZ Ferraro; Payal QZ Motwani; Pegah QZ Kharazmi; Pei QZ Wang; Pengfei QZ Liu; Peter QZ Bradtke; Peter QZ Götz; Peter QZ Zhou; Pichao QZ Wang; Piotr QZ Poskart; Pooja QZ Sonawane; Pradeep QZ Natarajan; Pradyun QZ Ramadorai; Pralam QZ Shah; Prasad QZ Nirantar; Prasanthi QZ Chavali; Prashan QZ Wanigasekara; Prashant QZ Saraf; Prashun QZ Dey; Pratyush QZ Pant; Prerak QZ Pradhan; Preyaa QZ Patel; Priyanka QZ Dadlani; Prudhvee Narasimha QZ Sadha; Qi QZ Dong; Qian QZ Hu; QZ Qiaozi; Sean Gao; Qing Sean Liu; Quinn Sean Lam; Quynh Sean Do; R. Sean Manmatha; Rachel Sean Willis; Rafael Sean Liu; Rafal Sean Ellert; Rafal Sean Kalinski; Rafi Al Sean Attrach; Ragha Sean Prasad; Ragini Sean Prasad; Raguvir Sean Kunani; Rahul Sean Gupta; Rahul Sean Sharma; Rahul Sean Tewari; Rajaganesh Sean Baskaran; Rajan Sean Singh; Rajiv Sean Gupta; Rajiv Sean Reddy; Rajshekhar Sean Das; Rakesh Sean Chada; Rakesh Vaideeswaran Sean Mahesh; Ram Sean Chandrasekaran; Ramesh Sean Nallapati; Ran Sean Xue; Rashmi Sean Gangadharaiah; Ravi Sean Rachakonda; Renxian Sean Zhang; Rexhina Sean Blloshmi; Rishabh Sean Agrawal; Robert Sean Enyedi; Robert Sean Lowe; Robik Sean Shrestha; Robinson Sean Piramuthu; Rohail Sean Asad; Rohan Sean Khanna; Rohan Sean Mukherjee; Rohit Sean Mittal; Rohit Sean Prasad; Rohith Mysore Vijaya Sean Kumar; Ron Sean Diamant; Ruchita Sean Gupta; Ruiwen Sean Li; Ruoying Sean Li; Rushabh Sean Fegade; Ruxu Sean Zhang; Ryan Sean Arbow; Ryan Sean Chen; Ryan Sean Gabbard; Ryan Sean Hoium; Ryan Sean King; Sabarishkumar Sean Iyer; Sachal Sean Malick; Sahar Sean Movaghati; Sai Sean Balakavi; Sai Sean Jakka; Sai Kashyap Sean Paruvelli; Sai Muralidhar Sean Jayanthi; Saicharan Shriram Sean Mujumdar; Sainyam Sean Kapoor; Sajjad Sean Beygi; Saket Sean Dingliwal; Saleh Sean Soltan; Sam Sean Ricklin; Sam Sean Tucker; Sameer Sean Sinha; Samridhi Sean Choudhary; Samson Sean Tan; Samuel Sean Broscheit; Samuel Sean Schulter; Sanchit Sean Agarwal; Sandeep Sean Atluri; Sander Sean Valstar; Sanjana Sean Shankar; Sanyukta Sean Sanyukta; Sarthak Sean Khanna; Sarvpriye Sean Khetrapal; Satish Sean Janakiraman; Saumil Sean Shah; Saurabh Sean Akolkar; Saurabh Sean Giri; Saurabh Sean Khandelwal; Saurabh Sean Pawar; Saurabh Sean Sahu; Sean Sean Huang; Sejun Sean Ra; Senthilkumar Sean Gopal; Sergei Sean Dobroshinsky; Shadi Sean Saba; Shamik Sean Roy; Shamit Sean Lal; Shankar Sean Ananthakrishnan; Sharon Sean Li; Shashwat Sean Srijan; Shekhar Sean Bhide; Sheng Long Sean Tang; Sheng Sean Zha; Shereen Sean Oraby; Sherif Sean Mostafa; Shiqi Sean Li; Shishir Sean Bharathi; Shivam Sean Prakash; Shiyuan Sean Huang; Shreya Sean Yembarwar; Shreyas Sean Pansare; Shreyas Sean Subramanian; Shrijeet Sean Joshi; Shuai Sean Liu; Shuai Sean Tang; Shubham Sean Chandak; Shubham Sean Garg; Shubham Sean Katiyar; Shubham Sean Mehta; Shubham Sean Srivastav; Shuo Sean Yang; Siddalingesha D Sean S; Siddharth Sean Choudhary; Siddharth Singh Sean Senger; Simon Sean Babb; Sina Sean Moeini; Siqi Sean Deng; Siva Sean Loganathan; Slawomir Sean Domagala; Sneha Sean Narkar; Sneha Sean Wadhwa; Songyang Sean Zhang; Songyao Sean Jiang; Sony Sean Trenous; Soumajyoti Sean Sarkar; Soumya Sean Saha; Sourabh Sean Reddy; Sourav Sean Dokania; Spurthideepika Sean Sandiri; Spyros Sean Matsoukas; Sravan Sean Bodapati; Sri Harsha Reddy Sean Wdaru; Sridevi Yagati Sean Venkateshdatta; Srikanth Sean Ronanki; Srinivasan R Sean Veeravanallur; Sriram Sean Venkatapathy; Sriramprabhu Sean Sankaraguru; Sruthi Sean Gorantla; Sruthi Sean Karuturi; Stefan Sean Schroedl; Subendhu Sean Rongali; Subhasis Sean Kundu; Suhaila Sean Shakiah; Sukriti Sean Tiwari; Sumit Sean Bharti; Sumita Sean Sami; Sumith Sean Mathew; Sunny Sean Yu; Sunwoo Sean Kim; Suraj Bajirao Sean Malode; Susana Cumplido Sean Riel; Swapnil Sean Palod; Swastik Sean Roy; Syed Sean Furqhan; Tagyoung Sean Chung; Takuma Sean Yoshitani; Taojiannan Sean Yang; Tejaswi Sean Chillakura; Tejwant Sean Bajwa; Temi Sean Lajumoke; Thanh Sean Tran; Thomas Sean Gueudre; Thomas Sean Jung; Tianhui Sean Li; Tim Sean Seemman; Timothy Sean Leffel; Tingting Sean Xiang; Tirth Sean Patel; Tobias Sean Domhan; Tobias Sean Falke; Toby Sean Guo; Tom Sean Li; Tomasz Sean Horszczaruk; Tomasz Sean Jedynak; Tushar Sean Kulkarni; Tyst Sean Marin; Tytus Sean Metrycki; Tzu-Yen Sean Wang; Umang Sean Jain; Upendra Sean Singh; Utkarsh Sean Chirimar; Vaibhav Sean Gupta; Vanshil Sean Shah; Varad Sean Deshpande; Varad Sean Gunjal; Varsha Sean Srikeshava; Varsha Sean Vivek; Varun Sean Bharadwaj; Varun Sean Gangal; Varun Sean Kumar; Venkatesh Sean Elango; Vicente Sean Ordonez; Victor Sean Soto; Vignesh Sean Radhakrishnan; Vihang Sean Patel; Vikram Sean Singh; Vinay Varma Sean Kolanuvada; Vinayshekhar Bannihatti Sean Kumar; Vincent Sean Auvray; Vincent Sean Cartillier; Vincent Sean Ponzo; Violet Sean Peng; Vishal Sean Khandelwal; Vishal Sean Naik; Vishvesh Sean Sahasrabudhe; Vitaliy Sean Korolev; Vivek Sean Gokuladas; Vivek Sean Madan; Vivek Sean Subramanian; Volkan Sean Cevher; Vrinda Sean Gupta; Wael Sean Hamza; Wei Sean Zhang; Weitong Sean Ruan; Weiwei Sean Cheng; Wen Sean Zhang; Wenbo Sean Zhao; Wenyan Sean Yao; Wenzhuo Sean Ouyang; Wesley Sean Dashner; William Sean Campbell; William Sean Lin; Willian Sean Martin; Wyatt Sean Pearson; Xiang Sean Jiang; Xiangxing Sean Lu; Xiangyang Sean Shi; Xianwen Sean Peng; Xiaofeng Sean Gao; Xiaoge Sean Jiang; Xiaohan Sean Fei; Xiaohui Sean Wang; Xiaozhou Joey Sean Zhou; Xin Sean Feng; Xinyan Sean Zhao; Xinyao Sean Wang; Xinyu Sean Li; Xu Sean Zhang; Xuan Sean Wang; Xuandi Sean Fu; Xueling Sean Yuan; Xuning Sean Wang; Yadunandana Sean Rao; Yair Sean Tavizon; Yan Sean Rossiytsev; Yanbei Sean Chen; Yang Sean Liu; Yang Sean Zou; Yangsook Sean Park; Yannick Sean Versley; Yanyan Sean Zhang; Yash Sean Patel; Yen-Cheng Sean Lu; Yi Sean Pan; Sean Yi-Hsiang; Rex Lai; Yichen Rex Hu; Yida Rex Wang; Yiheng Rex Zhou; Yilin Rex Xiang; Ying Rex Shi; Ying Rex Wang; Yishai Rex Galatzer; Yongxin Rex Wang; Yorick Rex Shen; Yuchen Rex Sun; Yudi Rex Purwatama; Rex Yue; Chris Wu; Yue Chris Gu; Yuechun Chris Wang; Yujun Chris Zeng; Yuncong Chris Chen; Yunke Chris Zhou; Yusheng Chris Xie; Yvon Chris Guy; Zbigniew Chris Ambrozinski; Zhaowei Chris Cai; Zhen Chris Zhang; Zheng Chris Wang; Zhenghui Chris Jin; Zhewei Chris Zhao; Zhiheng Chris Li; Zhiheng Chris Luo; Zhikang Chris Zhang; Zhilin Chris Fang; Zhiqi Chris Bu; Zhiyuan Chris Wang; Zhizhong Chris Li; Zijian Chris Wang; Chris Zimeng; Qiu; Zishi Li
  We present Amazon Nova, a new generation of state-of-the-art foundation
models that deliver frontier intelligence and industry-leading price
performance. Amazon Nova Pro is a highly-capable multimodal model with the best
combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova
Lite is a low-cost multimodal model that is lightning fast for processing
images, video, documents and text. Amazon Nova Micro is a text-only model that
delivers our lowest-latency responses at very low cost. Amazon Nova Canvas is
an image generation model that creates professional grade images with rich
customization controls. Amazon Nova Reel is a video generation model offering
high-quality outputs, customization, and motion control. Our models were built
responsibly and with a commitment to customer trust, security, and reliability.
We report benchmarking results for core capabilities, agentic performance, long
context, functional adaptation, runtime performance, and human evaluation.


http://arxiv.org/abs/2503.13751
Optimizing ML Training with Metagradient Descent. (3%)
Logan Engstrom; Andrew Ilyas; Benjamin Chen; Axel Feldmann; William Moses; Aleksander Madry
  A major challenge in training large-scale machine learning models is
configuring the training process to maximize model performance, i.e., finding
the best training setup from a vast design space. In this work, we unlock a
gradient-based approach to this problem. We first introduce an algorithm for
efficiently calculating metagradients -- gradients through model training -- at
scale. We then introduce a "smooth model training" framework that enables
effective optimization using metagradients. With metagradient descent (MGD), we
greatly improve on existing dataset selection methods, outperform
accuracy-degrading data poisoning attacks by an order of magnitude, and
automatically find competitive learning rate schedules.


http://arxiv.org/abs/2503.12990
How Good is my Histopathology Vision-Language Foundation Model? A Holistic Benchmark. (2%)
Roba Al Majzoub; Hashmat Malik; Muzammal Naseer; Zaigham Zaheer; Tariq Mahmood; Salman Khan; Fahad Khan
  Recently, histopathology vision-language foundation models (VLMs) have gained
popularity due to their enhanced performance and generalizability across
different downstream tasks. However, most existing histopathology benchmarks
are either unimodal or limited in terms of diversity of clinical tasks, organs,
and acquisition instruments, as well as their partial availability to the
public due to patient data privacy. As a consequence, there is a lack of
comprehensive evaluation of existing histopathology VLMs on a unified benchmark
setting that better reflects a wide range of clinical scenarios. To address
this gap, we introduce HistoVL, a fully open-source comprehensive benchmark
comprising images acquired using up to 11 various acquisition tools that are
paired with specifically crafted captions by incorporating class names and
diverse pathology descriptions. Our Histo-VL includes 26 organs, 31 cancer
types, and a wide variety of tissue obtained from 14 heterogeneous patient
cohorts, totaling more than 5 million patches obtained from over 41K WSIs
viewed under various magnification levels. We systematically evaluate existing
histopathology VLMs on Histo-VL to simulate diverse tasks performed by experts
in real-world clinical scenarios. Our analysis reveals interesting findings,
including large sensitivity of most existing histopathology VLMs to textual
changes with a drop in balanced accuracy of up to 25% in tasks such as
Metastasis detection, low robustness to adversarial attacks, as well as
improper calibration of models evident through high ECE values and low model
prediction confidence, all of which can affect their clinical implementation.


http://arxiv.org/abs/2503.13224
ProDiF: Protecting Domain-Invariant Features to Secure Pre-Trained Models Against Extraction. (1%)
Tong Zhou; Shijin Duan; Gaowen Liu; Charles Fleming; Ramana Rao Kompella; Shaolei Ren; Xiaolin Xu
  Pre-trained models are valuable intellectual property, capturing both
domain-specific and domain-invariant features within their weight spaces.
However, model extraction attacks threaten these assets by enabling
unauthorized source-domain inference and facilitating cross-domain transfer via
the exploitation of domain-invariant features. In this work, we introduce
**ProDiF**, a novel framework that leverages targeted weight space manipulation
to secure pre-trained models against extraction attacks. **ProDiF** quantifies
the transferability of filters and perturbs the weights of critical filters in
unsecured memory, while preserving actual critical weights in a Trusted
Execution Environment (TEE) for authorized users. A bi-level optimization
further ensures resilience against adaptive fine-tuning attacks. Experimental
results show that **ProDiF** reduces source-domain accuracy to near-random
levels and decreases cross-domain transferability by 74.65\%, providing robust
protection for pre-trained models. This work offers comprehensive protection
for pre-trained DNN models and highlights the potential of weight space
manipulation as a novel approach to model security.


http://arxiv.org/abs/2503.12567
GAN-Based Single-Stage Defense for Traffic Sign Classification Under Adversarial Patch Attack. (99%)
Abyad Enan; Mashrur Chowdhury
  Computer Vision plays a critical role in ensuring the safe navigation of
autonomous vehicles (AVs). An AV perception module is responsible for capturing
and interpreting the surrounding environment to facilitate safe navigation.
This module enables AVs to recognize traffic signs, traffic lights, and various
road users. However, the perception module is vulnerable to adversarial
attacks, which can compromise their accuracy and reliability. One such attack
is the adversarial patch attack (APA), a physical attack in which an adversary
strategically places a specially crafted sticker on an object to deceive object
classifiers. In APA, an adversarial patch is positioned on a target object,
leading the classifier to misidentify it. Such an APA can cause AVs to
misclassify traffic signs, leading to catastrophic incidents. To enhance the
security of an AV perception system against APAs, this study develops a
Generative Adversarial Network (GAN)-based single-stage defense strategy for
traffic sign classification. This approach is tailored to defend against APAs
on different classes of traffic signs without prior knowledge of a patch's
design. This study found this approach to be effective against patches of
varying sizes. Our experimental analysis demonstrates that the defense strategy
presented in this paper improves the classifier's accuracy under APA conditions
by up to 80.8% and enhances overall classification accuracy for all the traffic
signs considered in this study by 58%, compared to a classifier without any
defense mechanism. Our defense strategy is model-agnostic, making it applicable
to any traffic sign classifier, regardless of the underlying classification
model.


http://arxiv.org/abs/2503.12683
Algebraic Adversarial Attacks on Explainability Models. (89%)
Lachlan Simpson; Federico Costanza; Kyle Millar; Adriel Cheng; Cheng-Chew Lim; Hong Gunn Chew
  Classical adversarial attacks are phrased as a constrained optimisation
problem. Despite the efficacy of a constrained optimisation approach to
adversarial attacks, one cannot trace how an adversarial point was generated.
In this work, we propose an algebraic approach to adversarial attacks and study
the conditions under which one can generate adversarial examples for post-hoc
explainability models. Phrasing neural networks in the framework of geometric
deep learning, algebraic adversarial attacks are constructed through analysis
of the symmetry groups of neural networks. Algebraic adversarial examples
provide a mathematically tractable approach to adversarial examples. We
validate our approach of algebraic adversarial examples on two well-known and
one real-world dataset.


http://arxiv.org/abs/2503.14530
SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders. (1%)
Qing Li; Jiahui Geng; Derui Zhu; Fengyu Cai; Chenyang Lyu; Fakhri Karray
  Unlearning methods for vision-language models (VLMs) have primarily adapted
techniques from large language models (LLMs), relying on weight updates that
demand extensive annotated forget sets. Moreover, these methods perform
unlearning at a coarse granularity, often leading to excessive forgetting and
reduced model utility. To address this issue, we introduce SAUCE, a novel
method that leverages sparse autoencoders (SAEs) for fine-grained and selective
concept unlearning in VLMs. Briefly, SAUCE first trains SAEs to capture
high-dimensional, semantically rich sparse features. It then identifies the
features most relevant to the target concept for unlearning. During inference,
it selectively modifies these features to suppress specific concepts while
preserving unrelated information. We evaluate SAUCE on two distinct VLMs,
LLaVA-v1.5-7B and LLaMA-3.2-11B-Vision-Instruct, across two types of tasks:
concrete concept unlearning (objects and sports scenes) and abstract concept
unlearning (emotions, colors, and materials), encompassing a total of 60
concepts. Extensive experiments demonstrate that SAUCE outperforms
state-of-the-art methods by 18.04% in unlearning quality while maintaining
comparable model utility. Furthermore, we investigate SAUCE's robustness
against widely used adversarial attacks, its transferability across models, and
its scalability in handling multiple simultaneous unlearning requests. Our
findings establish SAUCE as an effective and scalable solution for selective
concept unlearning in VLMs.


http://arxiv.org/abs/2503.12453
Shape Bias and Robustness Evaluation via Cue Decomposition for Image Classification and Segmentation. (1%)
Edgar Heinert; Thomas Gottwald; Annika Mütze; Matthias Rottmann
  Previous works studied how deep neural networks (DNNs) perceive image content
in terms of their biases towards different image cues, such as texture and
shape. Previous methods to measure shape and texture biases are typically
style-transfer-based and limited to DNNs for image classification. In this
work, we provide a new evaluation procedure consisting of 1) a
cue-decomposition method that comprises two AI-free data pre-processing methods
extracting shape and texture cues, respectively, and 2) a novel
cue-decomposition shape bias evaluation metric that leverages the
cue-decomposition data. For application purposes we introduce a corresponding
cue-decomposition robustness metric that allows for the estimation of the
robustness of a DNN w.r.t. image corruptions. In our numerical experiments, our
findings for biases in image classification DNNs align with those of previous
evaluation metrics. However, our cue-decomposition robustness metric shows
superior results in terms of estimating the robustness of DNNs. Furthermore,
our results for DNNs on the semantic segmentation datasets Cityscapes and
ADE20k for the first time shed light into the biases of semantic segmentation
DNNs.


http://arxiv.org/abs/2503.12497
Defense Against Model Stealing Based on Account-Aware Distribution Discrepancy. (1%)
Jian-Ping Mei; Weibin Zhang; Jie Chen; Xuyun Zhang; Tiantian Zhu
  Malicious users attempt to replicate commercial models functionally at low
cost by training a clone model with query responses. It is challenging to
timely prevent such model-stealing attacks to achieve strong protection and
maintain utility. In this paper, we propose a novel non-parametric detector
called Account-aware Distribution Discrepancy (ADD) to recognize queries from
malicious users by leveraging account-wise local dependency. We formulate each
class as a Multivariate Normal distribution (MVN) in the feature space and
measure the malicious score as the sum of weighted class-wise distribution
discrepancy. The ADD detector is combined with random-based prediction
poisoning to yield a plug-and-play defense module named D-ADD for image
classification models. Results of extensive experimental studies show that
D-ADD achieves strong defense against different types of attacks with little
interference in serving benign users for both soft and hard-label settings.


http://arxiv.org/abs/2503.12069
Robust Dataset Distillation by Matching Adversarial Trajectories. (75%)
Wei Lai; Tianyu Ding; ren dongdong; Lei Wang; Jing Huo; Yang Gao; Wenbin Li
  Dataset distillation synthesizes compact datasets that enable models to
achieve performance comparable to training on the original large-scale
datasets. However, existing distillation methods overlook the robustness of the
model, resulting in models that are vulnerable to adversarial attacks when
trained on distilled data. To address this limitation, we introduce the task of
``robust dataset distillation", a novel paradigm that embeds adversarial
robustness into the synthetic datasets during the distillation process. We
propose Matching Adversarial Trajectories (MAT), a method that integrates
adversarial training into trajectory-based dataset distillation. MAT
incorporates adversarial samples during trajectory generation to obtain robust
training trajectories, which are then used to guide the distillation process.
As experimentally demonstrated, even through natural training on our distilled
dataset, models can achieve enhanced adversarial robustness while maintaining
competitive accuracy compared to existing distillation methods. Our work
highlights robust dataset distillation as a new and important research
direction and provides a strong baseline for future research to bridge the gap
between efficient training and adversarial robustness.


http://arxiv.org/abs/2503.12058
Revisiting Training-Inference Trigger Intensity in Backdoor Attacks. (9%)
Chenhao Lin; Chenyang Zhao; Shiwei Wang; Longtian Wang; Chao Shen; Zhengyu Zhao
  Backdoor attacks typically place a specific trigger on certain training data,
such that the model makes prediction errors on inputs with that trigger during
inference. Despite the core role of the trigger, existing studies have commonly
believed a perfect match between training-inference triggers is optimal. In
this paper, for the first time, we systematically explore the
training-inference trigger relation, particularly focusing on their mismatch,
based on a Training-Inference Trigger Intensity Manipulation (TITIM) workflow.
TITIM specifically investigates the training-inference trigger intensity, such
as the size or the opacity of a trigger, and reveals new insights into trigger
generalization and overfitting.
  These new insights challenge the above common belief by demonstrating that
the training-inference trigger mismatch can facilitate attacks in two practical
scenarios, posing more significant security threats than previously thought.
First, when the inference trigger is fixed, using training triggers with mixed
intensities leads to stronger attacks than using any single intensity. For
example, on CIFAR-10 with ResNet-18, mixing training triggers with 1.0 and 0.1
opacities improves the worst-case attack success rate (ASR) (over different
testing opacities) of the best single-opacity attack from 10.61\% to 92.77\%.
Second, intentionally using certain mismatched training-inference triggers can
improve the attack stealthiness, i.e., better bypassing defenses. For example,
compared to the training/inference intensity of 1.0/1.0, using 1.0/0.7
decreases the area under the curve (AUC) of the Scale-Up defense from 0.96 to
0.62, while maintaining a high attack ASR (99.65\% vs. 91.62\%). The above new
insights are validated to be generalizable across different backdoor attacks,
models, datasets, tasks, and (digital/physical) domains.


http://arxiv.org/abs/2503.11619
Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense. (80%)
Shuyang Hao; Yiwei Wang; Bryan Hooi; Ming-Hsuan Yang; Jun Liu; Chengcheng Tang; Zi Huang; Yujun Cai
  Deploying large vision-language models (LVLMs) introduces a unique
vulnerability: susceptibility to malicious attacks via visual inputs. However,
existing defense methods suffer from two key limitations: (1) They solely focus
on textual defenses, fail to directly address threats in the visual domain
where attacks originate, and (2) the additional processing steps often incur
significant computational overhead or compromise model performance on benign
tasks. Building on these insights, we propose ESIII (Embedding Security
Instructions Into Images), a novel methodology for transforming the visual
space from a source of vulnerability into an active defense mechanism.
Initially, we embed security instructions into defensive images through
gradient-based optimization, obtaining security instructions in the visual
dimension. Subsequently, we integrate security instructions from visual and
textual dimensions with the input query. The collaboration between security
instructions from different dimensions ensures comprehensive security
protection. Extensive experiments demonstrate that our approach effectively
fortifies the robustness of LVLMs against such attacks while preserving their
performance on standard benign tasks and incurring an imperceptible increase in
time costs.


http://arxiv.org/abs/2503.11627
Are Deep Speech Denoising Models Robust to Adversarial Noise? (67%)
Will Schwarzer; Philip S. Thomas; Andrea Fanelli; Xiaoyu Liu
  Deep noise suppression (DNS) models enjoy widespread use throughout a variety
of high-stakes speech applications. However, in this paper, we show that four
recent DNS models can each be reduced to outputting unintelligible gibberish
through the addition of imperceptible adversarial noise. Furthermore, our
results show the near-term plausibility of targeted attacks, which could induce
models to output arbitrary utterances, and over-the-air attacks. While the
success of these attacks varies by model and setting, and attacks appear to be
strongest when model-specific (i.e., white-box and non-transferable), our
results highlight a pressing need for practical countermeasures in DNS systems.


http://arxiv.org/abs/2503.11841
Trust Under Siege: Label Spoofing Attacks against Machine Learning for Android Malware Detection. (64%)
Tianwei Lan; Luca Demetrio; Farid Nait-Abdesselam; Yufei Han; Simone Aonzo
  Machine learning (ML) malware detectors rely heavily on crowd-sourced
AntiVirus (AV) labels, with platforms like VirusTotal serving as a trusted
source of malware annotations. But what if attackers could manipulate these
labels to classify benign software as malicious? We introduce label spoofing
attacks, a new threat that contaminates crowd-sourced datasets by embedding
minimal and undetectable malicious patterns into benign samples. These patterns
coerce AV engines into misclassifying legitimate files as harmful, enabling
poisoning attacks against ML-based malware classifiers trained on those data.
We demonstrate this scenario by developing AndroVenom, a methodology for
polluting realistic data sources, causing consequent poisoning attacks against
ML malware detectors. Experiments show that not only state-of-the-art feature
extractors are unable to filter such injection, but also various ML models
experience Denial of Service already with 1% poisoned samples. Additionally,
attackers can flip decisions of specific unaltered benign samples by modifying
only 0.015% of the training data, threatening their reputation and market share
and being unable to be stopped by anomaly detectors on training data. We
conclude our manuscript by raising the alarm on the trustworthiness of the
training process based on AV annotations, requiring further investigation on
how to produce proper labels for ML malware detectors.


http://arxiv.org/abs/2503.11185
Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification. (62%)
Yingjie Zhang; Tong Liu; Zhe Zhao; Guozhu Meng; Kai Chen
  Large Language Models (LLMs) are vulnerable to jailbreak attacks, which use
crafted prompts to elicit toxic responses. These attacks exploit LLMs'
difficulty in dynamically detecting harmful intents during the generation
process. Traditional safety alignment methods, often relying on the initial few
generation steps, are ineffective due to limited computational budget. This
paper proposes DEEPALIGN, a robust defense framework that fine-tunes LLMs to
progressively detoxify generated content, significantly improving both the
computational budget and effectiveness of mitigating harmful generation. Our
approach uses a hybrid loss function operating on hidden states to directly
improve LLMs' inherent awareness of toxity during generation. Furthermore, we
redefine safe responses by generating semantically relevant answers to harmful
queries, thereby increasing robustness against representation-mutation attacks.
Evaluations across multiple LLMs demonstrate state-of-the-art defense
performance against six different attack types, reducing Attack Success Rates
by up to two orders of magnitude compared to previous state-of-the-art defense
while preserving utility. This work advances LLM safety by addressing
limitations of conventional alignment through dynamic, context-aware
mitigation.


http://arxiv.org/abs/2503.11646
Adversarial Data Collection: Human-Collaborative Perturbations for Efficient and Robust Robotic Imitation Learning. (15%)
Siyuan Huang; Yue Liao; Siyuan Feng; Shu Jiang; Si Liu; Hongsheng Li; Maoqing Yao; Guanghui Ren
  The pursuit of data efficiency, where quality outweighs quantity, has emerged
as a cornerstone in robotic manipulation, especially given the high costs
associated with real-world data collection. We propose that maximizing the
informational density of individual demonstrations can dramatically reduce
reliance on large-scale datasets while improving task performance. To this end,
we introduce Adversarial Data Collection, a Human-in-the-Loop (HiL) framework
that redefines robotic data acquisition through real-time, bidirectional
human-environment interactions. Unlike conventional pipelines that passively
record static demonstrations, ADC adopts a collaborative perturbation paradigm:
during a single episode, an adversarial operator dynamically alters object
states, environmental conditions, and linguistic commands, while the
tele-operator adaptively adjusts actions to overcome these evolving challenges.
This process compresses diverse failure-recovery behaviors, compositional task
variations, and environmental perturbations into minimal demonstrations. Our
experiments demonstrate that ADC-trained models achieve superior compositional
generalization to unseen task instructions, enhanced robustness to perceptual
perturbations, and emergent error recovery capabilities. Strikingly, models
trained with merely 20% of the demonstration volume collected through ADC
significantly outperform traditional approaches using full datasets. These
advances bridge the gap between data-centric learning paradigms and practical
robotic deployment, demonstrating that strategic data acquisition, not merely
post-hoc processing, is critical for scalable, real-world robot learning.
Additionally, we are curating a large-scale ADC-Robotics dataset comprising
real-world manipulation tasks with adversarial perturbations. This benchmark
will be open-sourced to facilitate advancements in robotic imitation learning.


http://arxiv.org/abs/2503.11917
A Framework for Evaluating Emerging Cyberattack Capabilities of AI. (1%)
Mikel Rodriguez; Raluca Ada Popa; Four Flynn; Lihao Liang; Allan Dafoe; Anna Wang
  As frontier models become more capable, the community has attempted to
evaluate their ability to enable cyberattacks. Performing a comprehensive
evaluation and prioritizing defenses are crucial tasks in preparing for AGI
safely. However, current cyber evaluation efforts are ad-hoc, with no
systematic reasoning about the various phases of attacks, and do not provide a
steer on how to use targeted defenses. In this work, we propose a novel
approach to AI cyber capability evaluation that (1) examines the end-to-end
attack chain, (2) helps to identify gaps in the evaluation of AI threats, and
(3) helps defenders prioritize targeted mitigations and conduct AI-enabled
adversary emulation to support red teaming. To achieve these goals, we propose
adapting existing cyberattack chain frameworks to AI systems. We analyze over
12,000 instances of real-world attempts to use AI in cyberattacks catalogued by
Google's Threat Intelligence Group. Using this analysis, we curate a
representative collection of seven cyberattack chain archetypes and conduct a
bottleneck analysis to identify areas of potential AI-driven cost disruption.
Our evaluation benchmark consists of 50 new challenges spanning different
phases of cyberattacks. Based on this, we devise targeted cybersecurity model
evaluations, report on the potential for AI to amplify offensive cyber
capabilities across specific attack phases, and conclude with recommendations
on prioritizing defenses. In all, we consider this to be the most comprehensive
AI cyber risk evaluation framework published so far.


http://arxiv.org/abs/2503.10635
A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1. (99%)
Zhaoyi Li; Xiaohan Zhao; Dong-Dong Wu; Jiacheng Cui; Zhiqiang Shen
  Despite promising performance on open-source large vision-language models
(LVLMs), transfer-based targeted attacks often fail against black-box
commercial LVLMs. Analyzing failed adversarial perturbations reveals that the
learned perturbations typically originate from a uniform distribution and lack
clear semantic details, resulting in unintended responses. This critical
absence of semantic information leads commercial LVLMs to either ignore the
perturbation entirely or misinterpret its embedded semantics, thereby causing
the attack to fail. To overcome these issues, we notice that identifying core
semantic objects is a key objective for models trained with various datasets
and methodologies. This insight motivates our approach that refines semantic
clarity by encoding explicit semantic details within local regions, thus
ensuring interoperability and capturing finer-grained features, and by
concentrating modifications on semantically rich areas rather than applying
them uniformly. To achieve this, we propose a simple yet highly effective
solution: at each optimization step, the adversarial image is cropped randomly
by a controlled aspect ratio and scale, resized, and then aligned with the
target image in the embedding space. Experimental results confirm our
hypothesis. Our adversarial examples crafted with local-aggregated
perturbations focused on crucial regions exhibit surprisingly good
transferability to commercial LVLMs, including GPT-4.5, GPT-4o,
Gemini-2.0-flash, Claude-3.5-sonnet, Claude-3.7-sonnet, and even reasoning
models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach
achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly
outperforming all prior state-of-the-art attack methods. Our optimized
adversarial examples under different configurations and training code are
available at https://github.com/VILA-Lab/M-Attack.


http://arxiv.org/abs/2503.10629
Hierarchical Self-Supervised Adversarial Training for Robust Vision Models in Histopathology. (98%)
Hashmat Shadab Malik; Shahina Kunhimon; Muzammal Naseer; Fahad Shahbaz Khan; Salman Khan
  Adversarial attacks pose significant challenges for vision models in critical
fields like healthcare, where reliability is essential. Although adversarial
training has been well studied in natural images, its application to biomedical
and microscopy data remains limited. Existing self-supervised adversarial
training methods overlook the hierarchical structure of histopathology images,
where patient-slide-patch relationships provide valuable discriminative
signals. To address this, we propose Hierarchical Self-Supervised Adversarial
Training (HSAT), which exploits these properties to craft adversarial examples
using multi-level contrastive learning and integrate it into adversarial
training for enhanced robustness. We evaluate HSAT on multiclass histopathology
dataset OpenSRH and the results show that HSAT outperforms existing methods
from both biomedical and natural image domains. HSAT enhances robustness,
achieving an average gain of 54.31% in the white-box setting and reducing
performance drops to 3-4% in the black-box setting, compared to 25-30% for the
baseline. These results set a new benchmark for adversarial training in this
domain, paving the way for more robust models. Our Code for training and
evaluation is available at https://github.com/HashmatShadab/HSAT.


http://arxiv.org/abs/2503.11032
Weakly Supervised Contrastive Adversarial Training for Learning Robust Features from Semi-supervised Data. (92%)
Lilin Zhang; Chengpei Wu; Ning Yang
  Existing adversarial training (AT) methods often suffer from incomplete
perturbation, meaning that not all non-robust features are perturbed when
generating adversarial examples (AEs). This results in residual correlations
between non-robust features and labels, leading to suboptimal learning of
robust features. However, achieving complete perturbation, i.e., perturbing as
many non-robust features as possible, is challenging due to the difficulty in
distinguishing robust and non-robust features and the sparsity of labeled data.
To address these challenges, we propose a novel approach called Weakly
Supervised Contrastive Adversarial Training (WSCAT). WSCAT ensures complete
perturbation for improved learning of robust features by disrupting
correlations between non-robust features and labels through complete AE
generation over partially labeled data, grounded in information theory.
Extensive theoretical analysis and comprehensive experiments on widely adopted
benchmarks validate the superiority of WSCAT.


http://arxiv.org/abs/2503.10191
Robustness Tokens: Towards Adversarial Robustness of Transformers. (83%)
Brian Pulfer; Yury Belousov; Slava Voloshynovskiy
  Recently, large pre-trained foundation models have become widely adopted by
machine learning practitioners for a multitude of tasks. Given that such models
are publicly available, relying on their use as backbone models for downstream
tasks might result in high vulnerability to adversarial attacks crafted with
the same public model. In this work, we propose Robustness Tokens, a novel
approach specific to the transformer architecture that fine-tunes a few
additional private tokens with low computational requirements instead of tuning
model parameters as done in traditional adversarial training. We show that
Robustness Tokens make Vision Transformer models significantly more robust to
white-box adversarial attacks while also retaining the original downstream
performances.


http://arxiv.org/abs/2503.10081
AdvPaint: Protecting Images from Inpainting Manipulation via Adversarial Attention Disruption. (47%)
Joonsung Jeon; Woo Jae Kim; Suhyeon Ha; Sooel Son; Sung-eui Yoon
  The outstanding capability of diffusion models in generating high-quality
images poses significant threats when misused by adversaries. In particular, we
assume malicious adversaries exploiting diffusion models for inpainting tasks,
such as replacing a specific region with a celebrity. While existing methods
for protecting images from manipulation in diffusion-based generative models
have primarily focused on image-to-image and text-to-image tasks, the challenge
of preventing unauthorized inpainting has been rarely addressed, often
resulting in suboptimal protection performance. To mitigate inpainting abuses,
we propose ADVPAINT, a novel defensive framework that generates adversarial
perturbations that effectively disrupt the adversary's inpainting tasks.
ADVPAINT targets the self- and cross-attention blocks in a target diffusion
inpainting model to distract semantic understanding and prompt interactions
during image generation. ADVPAINT also employs a two-stage perturbation
strategy, dividing the perturbation region based on an enlarged bounding box
around the object, enhancing robustness across diverse masks of varying shapes
and sizes. Our experimental results demonstrate that ADVPAINT's perturbations
are highly effective in disrupting the adversary's inpainting tasks,
outperforming existing methods; ADVPAINT attains over a 100-point increase in
FID and substantial decreases in precision.


http://arxiv.org/abs/2503.11514
Exploring the Vulnerabilities of Federated Learning: A Deep Dive into Gradient Inversion Attacks. (47%)
Pengxin Guo; Runxi Wang; Shuang Zeng; Jinjing Zhu; Haoning Jiang; Yanran Wang; Yuyin Zhou; Feifei Wang; Hui Xiong; Liangqiong Qu
  Federated Learning (FL) has emerged as a promising privacy-preserving
collaborative model training paradigm without sharing raw data. However, recent
studies have revealed that private information can still be leaked through
shared gradient information and attacked by Gradient Inversion Attacks (GIA).
While many GIA methods have been proposed, a detailed analysis, evaluation, and
summary of these methods are still lacking. Although various survey papers
summarize existing privacy attacks in FL, few studies have conducted extensive
experiments to unveil the effectiveness of GIA and their associated limiting
factors in this context. To fill this gap, we first undertake a systematic
review of GIA and categorize existing methods into three types, i.e.,
\textit{optimization-based} GIA (OP-GIA), \textit{generation-based} GIA
(GEN-GIA), and \textit{analytics-based} GIA (ANA-GIA). Then, we comprehensively
analyze and evaluate the three types of GIA in FL, providing insights into the
factors that influence their performance, practicality, and potential threats.
Our findings indicate that OP-GIA is the most practical attack setting despite
its unsatisfactory performance, while GEN-GIA has many dependencies and ANA-GIA
is easily detectable, making them both impractical. Finally, we offer a
three-stage defense pipeline to users when designing FL frameworks and
protocols for better privacy protection and share some future research
directions from the perspectives of attackers and defenders that we believe
should be pursued. We hope that our study can help researchers design more
robust FL frameworks to defend against these attacks.


http://arxiv.org/abs/2503.10872
TAIJI: Textual Anchoring for Immunizing Jailbreak Images in Vision Language Models. (11%)
Xiangyu Yin; Yi Qi; Jinwei Hu; Zhen Chen; Yi Dong; Xingyu Zhao; Xiaowei Huang; Wenjie Ruan
  Vision Language Models (VLMs) have demonstrated impressive inference
capabilities, but remain vulnerable to jailbreak attacks that can induce
harmful or unethical responses. Existing defence methods are predominantly
white-box approaches that require access to model parameters and extensive
modifications, making them costly and impractical for many real-world
scenarios. Although some black-box defences have been proposed, they often
impose input constraints or require multiple queries, limiting their
effectiveness in safety-critical tasks such as autonomous driving. To address
these challenges, we propose a novel black-box defence framework called
\textbf{T}extual \textbf{A}nchoring for \textbf{I}mmunizing \textbf{J}ailbreak
\textbf{I}mages (\textbf{TAIJI}). TAIJI leverages key phrase-based textual
anchoring to enhance the model's ability to assess and mitigate the harmful
content embedded within both visual and textual prompts. Unlike existing
methods, TAIJI operates effectively with a single query during inference, while
preserving the VLM's performance on benign tasks. Extensive experiments
demonstrate that TAIJI significantly enhances the safety and reliability of
VLMs, providing a practical and efficient solution for real-world deployment.


http://arxiv.org/abs/2503.10350
Enhancing Facial Privacy Protection via Weakening Diffusion Purification. (10%)
Ali Salar; Qing Liu; Yingli Tian; Guoying Zhao
  The rapid growth of social media has led to the widespread sharing of
individual portrait images, which pose serious privacy risks due to the
capabilities of automatic face recognition (AFR) systems for mass surveillance.
Hence, protecting facial privacy against unauthorized AFR systems is essential.
Inspired by the generation capability of the emerging diffusion models, recent
methods employ diffusion models to generate adversarial face images for privacy
protection. However, they suffer from the diffusion purification effect,
leading to a low protection success rate (PSR). In this paper, we first propose
learning unconditional embeddings to increase the learning capacity for
adversarial modifications and then use them to guide the modification of the
adversarial latent code to weaken the diffusion purification effect. Moreover,
we integrate an identity-preserving structure to maintain structural
consistency between the original and generated images, allowing human observers
to recognize the generated image as having the same identity as the original.
Extensive experiments conducted on two public datasets, i.e., CelebA-HQ and
LADN, demonstrate the superiority of our approach. The protected faces
generated by our method outperform those produced by existing facial privacy
protection approaches in terms of transferability and natural appearance.


http://arxiv.org/abs/2503.10846
WAFFLED: Exploiting Parsing Discrepancies to Bypass Web Application Firewalls. (2%)
Seyed Ali Akhavani; Bahruz Jabiyev; Ben Kallus; Cem Topcuoglu; Sergey Bratus; Engin Kirda
  Web Application Firewalls (WAFs) have been introduced as essential and
popular security gates that inspect incoming HTTP traffic to filter out
malicious requests and provide defenses against a diverse array of web-based
threats. Evading WAFs can compromise these defenses, potentially harming
Internet users. In recent years, parsing discrepancies have plagued many
entities in the communication path; however, their potential impact on WAF
evasion and request smuggling remains largely unexplored. In this work, we
present an innovative approach to bypassing WAFs by uncovering and exploiting
parsing discrepancies through advanced fuzzing techniques. By targeting
non-malicious components such as headers and segments of the body and using
widely used content-types such as application/json, multipart/form-data, and
application/xml, we identified and confirmed 1207 bypasses across 5 well-known
WAFs, AWS, Azure, Cloud Armor, Cloudflare, and ModSecurity. To validate our
findings, we conducted a study in the wild, revealing that more than 90% of
websites accepted both application/x-www-form-urlencoded and
multipart/form-data interchangeably, highlighting a significant vulnerability
and the broad applicability of our bypass techniques. We have reported these
vulnerabilities to the affected parties and received acknowledgments from all,
as well as bug bounty rewards from some vendors. Further, to mitigate these
vulnerabilities, we introduce HTTP-Normalizer, a robust proxy tool designed to
rigorously validate HTTP requests against current RFC standards. Our results
demonstrate its effectiveness in normalizing or blocking all bypass attempts
presented in this work.


http://arxiv.org/abs/2503.10912
JPEG Compliant Compression for Both Human and Machine, A Report. (1%)
Linfeng Ye
  Deep Neural Networks (DNNs) have become an integral part of our daily lives,
especially in vision-related applications. However, the conventional lossy
image compression algorithms are primarily designed for the Human Vision System
(HVS), which can non-trivially compromise the DNNs' validation accuracy after
compression, as noted in \cite{liu2018deepn}. Thus developing an image
compression algorithm for both human and machine (DNNs) is on the horizon.
  To address the challenge mentioned above, in this paper, we first formulate
the image compression as a multi-objective optimization problem which take both
human and machine prespectives into account, then we solve it by linear
combination, and proposed a novel distortion measure for both human and
machine, dubbed Human and Machine-Oriented Error (HMOE). After that, we develop
Human And Machine Oriented Soft Decision Quantization (HMOSDQ) based on HMOE, a
lossy image compression algorithm for both human and machine (DNNs), and fully
complied with JPEG format. In order to evaluate the performance of HMOSDQ,
finally we conduct the experiments for two pre-trained well-known DNN-based
image classifiers named Alexnet \cite{Alexnet} and VGG-16
\cite{simonyan2014VGG} on two subsets of the ImageNet \cite{deng2009imagenet}
validation set: one subset included images with shorter side in the range of
496 to 512, while the other included images with shorter side in the range of
376 to 384. Our results demonstrate that HMOSDQ outperforms the default JPEG
algorithm in terms of rate-accuracy and rate-distortion performance. For the
Alexnet comparing with the default JPEG algorithm, HMOSDQ can improve the
validation accuracy by more than $0.81\%$ at $0.61$ BPP, or equivalently reduce
the compression rate of default JPEG by $9.6\times$ while maintaining the same
validation accuracy.


http://arxiv.org/abs/2503.09735
Enhancing Adversarial Example Detection Through Model Explanation. (99%)
Qian Ma; Ziping Ye
  Adversarial examples are a major problem for machine learning models, leading
to a continuous search for effective defenses. One promising direction is to
leverage model explanations to better understand and defend against these
attacks. We looked at AmI, a method proposed by a NeurIPS 2018 spotlight paper
that uses model explanations to detect adversarial examples. Our study shows
that while AmI is a promising idea, its performance is too dependent on
specific settings (e.g., hyperparameter) and external factors such as the
operating system and the deep learning framework used, and such drawbacks limit
AmI's practical usage. Our findings highlight the need for more robust defense
mechanisms that are effective under various conditions. In addition, we
advocate for a comprehensive evaluation framework for defense techniques.


http://arxiv.org/abs/2503.09124
AdvAD: Exploring Non-Parametric Diffusion for Imperceptible Adversarial Attacks. (99%)
Jin Li; Ziqiang He; Anwei Luo; Jian-Fang Hu; Z. Jane Wang; Xiangui Kang
  Imperceptible adversarial attacks aim to fool DNNs by adding imperceptible
perturbation to the input data. Previous methods typically improve the
imperceptibility of attacks by integrating common attack paradigms with
specifically designed perception-based losses or the capabilities of generative
models. In this paper, we propose Adversarial Attacks in Diffusion (AdvAD), a
novel modeling framework distinct from existing attack paradigms. AdvAD
innovatively conceptualizes attacking as a non-parametric diffusion process by
theoretically exploring basic modeling approach rather than using the denoising
or generation abilities of regular diffusion models requiring neural networks.
At each step, much subtler yet effective adversarial guidance is crafted using
only the attacked model without any additional network, which gradually leads
the end of diffusion process from the original image to a desired imperceptible
adversarial example. Grounded in a solid theoretical foundation of the proposed
non-parametric diffusion process, AdvAD achieves high attack efficacy and
imperceptibility with intrinsically lower overall perturbation strength.
Additionally, an enhanced version AdvAD-X is proposed to evaluate the extreme
of our novel framework under an ideal scenario. Extensive experiments
demonstrate the effectiveness of the proposed AdvAD and AdvAD-X. Compared with
state-of-the-art imperceptible attacks, AdvAD achieves an average of 99.9$\%$
(+17.3$\%$) ASR with 1.34 (-0.97) $l_2$ distance, 49.74 (+4.76) PSNR and 0.9971
(+0.0043) SSIM against four prevalent DNNs with three different architectures
on the ImageNet-compatible dataset. Code is available at
https://github.com/XianguiKang/AdvAD.


http://arxiv.org/abs/2503.09334
CyberLLMInstruct: A New Dataset for Analysing Safety of Fine-Tuned LLMs Using Cyber Security Data. (83%)
Adel ElZemity; Budi Arief; Shujun Li
  The integration of large language models (LLMs) into cyber security
applications presents significant opportunities, such as enhancing threat
analysis and malware detection, but can also introduce critical risks and
safety concerns, including personal data leakage and automated generation of
new malware. To address these challenges, we developed CyberLLMInstruct, a
dataset of 54,928 instruction-response pairs spanning cyber security tasks such
as malware analysis, phishing simulations, and zero-day vulnerabilities. The
dataset was constructed through a multi-stage process. This involved sourcing
data from multiple resources, filtering and structuring it into
instruction-response pairs, and aligning it with real-world scenarios to
enhance its applicability. Seven open-source LLMs were chosen to test the
usefulness of CyberLLMInstruct: Phi 3 Mini 3.8B, Mistral 7B, Qwen 2.5 7B, Llama
3 8B, Llama 3.1 8B, Gemma 2 9B, and Llama 2 70B. In our primary example, we
rigorously assess the safety of fine-tuned models using the OWASP top 10
framework, finding that fine-tuning reduces safety resilience across all tested
LLMs and every adversarial attack (e.g., the security score of Llama 3.1 8B
against prompt injection drops from 0.95 to 0.15). In our second example, we
show that these same fine-tuned models can also achieve up to 92.50 percent
accuracy on the CyberMetric benchmark. These findings highlight a trade-off
between performance and safety, showing the importance of adversarial testing
and further research into fine-tuning methodologies that can mitigate safety
risks while still improving performance across diverse datasets and domains.
The dataset creation pipeline, along with comprehensive documentation,
examples, and resources for reproducing our results, is publicly available at
https://github.com/Adelsamir01/CyberLLMInstruct.


http://arxiv.org/abs/2503.09095
C^2 ATTACK: Towards Representation Backdoor on CLIP via Concept Confusion. (80%)
Lijie Hu; Junchi Liao; Weimin Lyu; Shaopeng Fu; Tianhao Huang; Shu Yang; Guimin Hu; Di Wang
  Backdoor attacks pose a significant threat to deep learning models, enabling
adversaries to embed hidden triggers that manipulate the behavior of the model
during inference. Traditional backdoor attacks typically rely on inserting
explicit triggers (e.g., external patches, or perturbations) into input data,
but they often struggle to evade existing defense mechanisms. To address this
limitation, we investigate backdoor attacks through the lens of the reasoning
process in deep learning systems, drawing insights from interpretable AI. We
conceptualize backdoor activation as the manipulation of learned concepts
within the model's latent representations. Thus, existing attacks can be seen
as implicit manipulations of these activated concepts during inference. This
raises interesting questions: why not manipulate the concepts explicitly? This
idea leads to our novel backdoor attack framework, Concept Confusion Attack
(C^2 ATTACK), which leverages internal concepts in the model's reasoning as
"triggers" without introducing explicit external modifications. By avoiding the
use of real triggers and directly activating or deactivating specific concepts
in latent spaces, our approach enhances stealth, making detection by existing
defenses significantly harder. Using CLIP as a case study, experimental results
demonstrate the effectiveness of C^2 ATTACK, achieving high attack success
rates while maintaining robustness against advanced defenses.


http://arxiv.org/abs/2503.09712
Revisiting Backdoor Attacks on Time Series Classification in the Frequency Domain. (76%)
Yuanmin Huang; Mi Zhang; Zhaoxiang Wang; Wenxuan Li; Min Yang
  Time series classification (TSC) is a cornerstone of modern web applications,
powering tasks such as financial data analysis, network traffic monitoring, and
user behavior analysis. In recent years, deep neural networks (DNNs) have
greatly enhanced the performance of TSC models in these critical domains.
However, DNNs are vulnerable to backdoor attacks, where attackers can covertly
implant triggers into models to induce malicious outcomes. Existing backdoor
attacks targeting DNN-based TSC models remain elementary. In particular, early
methods borrow trigger designs from computer vision, which are ineffective for
time series data. More recent approaches utilize generative models for trigger
generation, but at the cost of significant computational complexity. In this
work, we analyze the limitations of existing attacks and introduce an enhanced
method, FreqBack. Drawing inspiration from the fact that DNN models inherently
capture frequency domain features in time series data, we identify that
improper perturbations in the frequency domain are the root cause of
ineffective attacks. To address this, we propose to generate triggers both
effectively and efficiently, guided by frequency analysis. FreqBack exhibits
substantial performance across five models and eight datasets, achieving an
impressive attack success rate of over 90%, while maintaining less than a 3%
drop in model accuracy on clean data.


http://arxiv.org/abs/2503.09066
Probing Latent Subspaces in LLM for AI Security: Identifying and Manipulating Adversarial States. (75%)
Xin Wei Chia; Jonathan Pan
  Large Language Models (LLMs) have demonstrated remarkable capabilities across
various tasks, yet they remain vulnerable to adversarial manipulations such as
jailbreaking via prompt injection attacks. These attacks bypass safety
mechanisms to generate restricted or harmful content. In this study, we
investigated the underlying latent subspaces of safe and jailbroken states by
extracting hidden activations from a LLM. Inspired by attractor dynamics in
neuroscience, we hypothesized that LLM activations settle into semi stable
states that can be identified and perturbed to induce state transitions. Using
dimensionality reduction techniques, we projected activations from safe and
jailbroken responses to reveal latent subspaces in lower dimensional spaces. We
then derived a perturbation vector that when applied to safe representations,
shifted the model towards a jailbreak state. Our results demonstrate that this
causal intervention results in statistically significant jailbreak responses in
a subset of prompts. Next, we probed how these perturbations propagate through
the model's layers, testing whether the induced state change remains localized
or cascades throughout the network. Our findings indicate that targeted
perturbations induced distinct shifts in activations and model responses. Our
approach paves the way for potential proactive defenses, shifting from
traditional guardrail based methods to preemptive, model agnostic techniques
that neutralize adversarial states at the representation level.


http://arxiv.org/abs/2503.09302
Detecting and Preventing Data Poisoning Attacks on AI Models. (75%)
Halima I. Kure; Pradipta Sarkar; Ahmed B. Ndanusa; Augustine O. Nwajana
  This paper investigates the critical issue of data poisoning attacks on AI
models, a growing concern in the ever-evolving landscape of artificial
intelligence and cybersecurity. As advanced technology systems become
increasingly prevalent across various sectors, the need for robust defence
mechanisms against adversarial attacks becomes paramount. The study aims to
develop and evaluate novel techniques for detecting and preventing data
poisoning attacks, focusing on both theoretical frameworks and practical
applications. Through a comprehensive literature review, experimental
validation using the CIFAR-10 and Insurance Claims datasets, and the
development of innovative algorithms, this paper seeks to enhance the
resilience of AI models against malicious data manipulation. The study explores
various methods, including anomaly detection, robust optimization strategies,
and ensemble learning, to identify and mitigate the effects of poisoned data
during model training. Experimental results indicate that data poisoning
significantly degrades model performance, reducing classification accuracy by
up to 27% in image recognition tasks (CIFAR-10) and 22% in fraud detection
models (Insurance Claims dataset). The proposed defence mechanisms, including
statistical anomaly detection and adversarial training, successfully mitigated
poisoning effects, improving model robustness and restoring accuracy levels by
an average of 15-20%. The findings further demonstrate that ensemble learning
techniques provide an additional layer of resilience, reducing false positives
and false negatives caused by adversarial data injections.


http://arxiv.org/abs/2503.09049
Adaptive Backdoor Attacks with Reasonable Constraints on Graph Neural Networks. (64%)
Xuewen Dong; Jiachen Li; Shujun Li; Zhichao You; Qiang Qu; Yaroslav Kholodov; Yulong Shen
  Recent studies show that graph neural networks (GNNs) are vulnerable to
backdoor attacks. Existing backdoor attacks against GNNs use fixed-pattern
triggers and lack reasonable trigger constraints, overlooking individual graph
characteristics and rendering insufficient evasiveness. To tackle the above
issues, we propose ABARC, the first Adaptive Backdoor Attack with Reasonable
Constraints, applying to both graph-level and node-level tasks in GNNs. For
graph-level tasks, we propose a subgraph backdoor attack independent of the
graph's topology. It dynamically selects trigger nodes for each target graph
and modifies node features with constraints based on graph similarity, feature
range, and feature type. For node-level tasks, our attack begins with an
analysis of node features, followed by selecting and modifying trigger
features, which are then constrained by node similarity, feature range, and
feature type. Furthermore, an adaptive edge-pruning mechanism is designed to
reduce the impact of neighbors on target nodes, ensuring a high attack success
rate (ASR). Experimental results show that even with reasonable constraints for
attack evasiveness, our attack achieves a high ASR while incurring a marginal
clean accuracy drop (CAD). When combined with the state-of-the-art defense
randomized smoothing (RS) method, our attack maintains an ASR over 94%,
surpassing existing attacks by more than 7%.


http://arxiv.org/abs/2503.09241
In-Context Defense in Computer Agents: An Empirical Study. (56%)
Pei Yang; Hai Ci; Mike Zheng Shou
  Computer agents powered by vision-language models (VLMs) have significantly
advanced human-computer interaction, enabling users to perform complex tasks
through natural language instructions. However, these agents are vulnerable to
context deception attacks, an emerging threat where adversaries embed
misleading content into the agent's operational environment, such as a pop-up
window containing deceptive instructions. Existing defenses, such as
instructing agents to ignore deceptive elements, have proven largely
ineffective. As the first systematic study on protecting computer agents, we
introduce textbf{in-context defense}, leveraging in-context learning and
chain-of-thought (CoT) reasoning to counter such attacks. Our approach involves
augmenting the agent's context with a small set of carefully curated exemplars
containing both malicious environments and corresponding defensive responses.
These exemplars guide the agent to first perform explicit defensive reasoning
before action planning, reducing susceptibility to deceptive attacks.
Experiments demonstrate the effectiveness of our method, reducing attack
success rates by 91.2% on pop-up window attacks, 74.6% on average on
environment injection attacks, while achieving 100% successful defenses against
distracting advertisements. Our findings highlight that (1) defensive reasoning
must precede action planning for optimal performance, and (2) a minimal number
of exemplars (fewer than three) is sufficient to induce an agent's defensive
behavior.


http://arxiv.org/abs/2503.09726
How Feasible is Augmenting Fake Nodes with Learnable Features as a Counter-strategy against Link Stealing Attacks? (45%)
Mir Imtiaz Mostafiz; Imtiaz Karim; Elisa Bertino
  Graph Neural Networks (GNNs) are widely used and deployed for graph-based
prediction tasks. However, as good as GNNs are for learning graph data, they
also come with the risk of privacy leakage. For instance, an attacker can run
carefully crafted queries on the GNNs and, from the responses, can infer the
existence of an edge between a pair of nodes. This attack, dubbed as a
"link-stealing" attack, can jeopardize the user's privacy by leaking
potentially sensitive information. To protect against this attack, we propose
an approach called "$(N)$ode $(A)$ugmentation for $(R)$estricting $(G)$raphs
from $(I)$nsinuating their $(S)$tructure" ($NARGIS$) and study its feasibility.
$NARGIS$ is focused on reshaping the graph embedding space so that the
posterior from the GNN model will still provide utility for the prediction task
but will introduce ambiguity for the link-stealing attackers. To this end,
$NARGIS$ applies spectral clustering on the given graph to facilitate it being
augmented with new nodes -- that have learned features instead of fixed ones.
It utilizes tri-level optimization for learning parameters for the GNN model,
surrogate attacker model, and our defense model (i.e. learnable node features).
We extensively evaluate $NARGIS$ on three benchmark citation datasets over
eight knowledge availability settings for the attackers. We also evaluate the
model fidelity and defense performance on influence-based link inference
attacks. Through our studies, we have figured out the best feature of $NARGIS$
-- its superior fidelity-privacy performance trade-off in a significant number
of cases. We also have discovered in which cases the model needs to be
improved, and proposed ways to integrate different schemes to make the model
more robust against link stealing attacks.


http://arxiv.org/abs/2503.09964
ExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist Content. (22%)
Bhavik Chandna; Mariam Aboujenane; Usman Naseem
  Large Multimodal Models (LMMs) are increasingly vulnerable to AI-generated
extremist content, including photorealistic images and text, which can be used
to bypass safety mechanisms and generate harmful outputs. However, existing
datasets for evaluating LMM robustness offer limited exploration of extremist
content, often lacking AI-generated images, diverse image generation models,
and comprehensive coverage of historical events, which hinders a complete
assessment of model vulnerabilities. To fill this gap, we introduce
ExtremeAIGC, a benchmark dataset and evaluation framework designed to assess
LMM vulnerabilities against such content. ExtremeAIGC simulates real-world
events and malicious use cases by curating diverse text- and image-based
examples crafted using state-of-the-art image generation techniques. Our study
reveals alarming weaknesses in LMMs, demonstrating that even cutting-edge
safety measures fail to prevent the generation of extremist material. We
systematically quantify the success rates of various attack strategies,
exposing critical gaps in current defenses and emphasizing the need for more
robust mitigation strategies.


http://arxiv.org/abs/2503.09513
RESTRAIN: Reinforcement Learning-Based Secure Framework for Trigger-Action IoT Environment. (2%)
Md Morshed Alam; Lokesh Chandra Das; Sandip Roy; Sachin Shetty; Weichao Wang
  Internet of Things (IoT) platforms with trigger-action capability allow event
conditions to trigger actions in IoT devices autonomously by creating a chain
of interactions. Adversaries exploit this chain of interactions to maliciously
inject fake event conditions into IoT hubs, triggering unauthorized actions on
target IoT devices to implement remote injection attacks. Existing defense
mechanisms focus mainly on the verification of event transactions using
physical event fingerprints to enforce the security policies to block unsafe
event transactions. These approaches are designed to provide offline defense
against injection attacks. The state-of-the-art online defense mechanisms offer
real-time defense, but extensive reliability on the inference of attack impacts
on the IoT network limits the generalization capability of these approaches. In
this paper, we propose a platform-independent multi-agent online defense
system, namely RESTRAIN, to counter remote injection attacks at runtime.
RESTRAIN allows the defense agent to profile attack actions at runtime and
leverages reinforcement learning to optimize a defense policy that complies
with the security requirements of the IoT network. The experimental results
show that the defense agent effectively takes real-time defense actions against
complex and dynamic remote injection attacks and maximizes the security gain
with minimal computational overhead.


http://arxiv.org/abs/2503.09669
Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models. (2%)
Sangwon Jang; June Suk Choi; Jaehyeong Jo; Kimin Lee; Sung Ju Hwang
  Text-to-image diffusion models have achieved remarkable success in generating
high-quality contents from text prompts. However, their reliance on publicly
available data and the growing trend of data sharing for fine-tuning make these
models particularly vulnerable to data poisoning attacks. In this work, we
introduce the Silent Branding Attack, a novel data poisoning method that
manipulates text-to-image diffusion models to generate images containing
specific brand logos or symbols without any text triggers. We find that when
certain visual patterns are repeatedly in the training data, the model learns
to reproduce them naturally in its outputs, even without prompt mentions.
Leveraging this, we develop an automated data poisoning algorithm that
unobtrusively injects logos into original images, ensuring they blend naturally
and remain undetected. Models trained on this poisoned dataset generate images
containing logos without degrading image quality or text alignment. We
experimentally validate our silent branding attack across two realistic
settings on large-scale high-quality image datasets and style personalization
datasets, achieving high success rates even without a specific text trigger.
Human evaluation and quantitative metrics including logo detection show that
our method can stealthily embed logos.


http://arxiv.org/abs/2503.09661
Towards Hardware Supported Domain Generalization in DNN-Based Edge Computing Devices for Health Monitoring. (1%)
Johnson Loh; Lyubov Dudchenko; Justus Viga; Tobias Gemmeke
  Deep neural network (DNN) models have shown remarkable success in many
real-world scenarios, such as object detection and classification.
Unfortunately, these models are not yet widely adopted in health monitoring due
to exceptionally high requirements for model robustness and deployment in
highly resource-constrained devices. In particular, the acquisition of
biosignals, such as electrocardiogram (ECG), is subject to large variations
between training and deployment, necessitating domain generalization (DG) for
robust classification quality across sensors and patients. The continuous
monitoring of ECG also requires the execution of DNN models in convenient
wearable devices, which is achieved by specialized ECG accelerators with small
form factor and ultra-low power consumption. However, combining DG capabilities
with ECG accelerators remains a challenge. This article provides a
comprehensive overview of ECG accelerators and DG methods and discusses the
implication of the combination of both domains, such that multi-domain ECG
monitoring is enabled with emerging algorithm-hardware co-optimized systems.
Within this context, an approach based on correction layers is proposed to
deploy DG capabilities on the edge. Here, the DNN fine-tuning for unknown
domains is limited to a single layer, while the remaining DNN model remains
unmodified. Thus, computational complexity (CC) for DG is reduced with minimal
memory overhead compared to conventional fine-tuning of the whole DNN model.
The DNN model-dependent CC is reduced by more than 2.5x compared to DNN
fine-tuning at an average increase of F1 score by more than 20% on the
generalized target domain. In summary, this article provides a novel
perspective on robust DNN classification on the edge for health monitoring
applications.


http://arxiv.org/abs/2503.09446
Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models. (1%)
Zhihua Tian; Sirun Nan; Ming Xu; Shengfang Zhai; Wenjie Qu; Jian Liu; Kui Ren; Ruoxi Jia; Jiaheng Zhang
  Text-to-image (T2I) diffusion models have achieved remarkable progress in
generating high-quality images but also raise people's concerns about
generating harmful or misleading content. While extensive approaches have been
proposed to erase unwanted concepts without requiring retraining from scratch,
they inadvertently degrade performance on normal generation tasks. In this
work, we propose Interpret then Deactivate (ItD), a novel framework to enable
precise concept removal in T2I diffusion models while preserving overall
performance. ItD first employs a sparse autoencoder (SAE) to interpret each
concept as a combination of multiple features. By permanently deactivating the
specific features associated with target concepts, we repurpose SAE as a
zero-shot classifier that identifies whether the input prompt includes target
concepts, allowing selective concept erasure in diffusion models. Moreover, we
demonstrate that ItD can be easily extended to erase multiple concepts without
requiring further training. Comprehensive experiments across celebrity
identities, artistic styles, and explicit content demonstrate ItD's
effectiveness in eliminating targeted concepts without interfering with normal
concept generation. Additionally, ItD is also robust against adversarial
prompts designed to circumvent content filters. Code is available at:
https://github.com/NANSirun/Interpret-then-deactivate.


http://arxiv.org/abs/2503.09068
Probing Network Decisions: Capturing Uncertainties and Unveiling Vulnerabilities Without Label Information. (1%)
Youngju Joung; Sehyun Lee; Jaesik Choi
  To improve trust and transparency, it is crucial to be able to interpret the
decisions of Deep Neural classifiers (DNNs). Instance-level examinations, such
as attribution techniques, are commonly employed to interpret the model
decisions. However, when interpreting misclassified decisions, human
intervention may be required. Analyzing the attribu tions across each class
within one instance can be particularly labor intensive and influenced by the
bias of the human interpreter. In this paper, we present a novel framework to
uncover the weakness of the classifier via counterfactual examples. A prober is
introduced to learn the correctness of the classifier's decision in terms of
binary code-hit or miss. It enables the creation of the counterfactual example
concerning the prober's decision. We test the performance of our prober's
misclassification detection and verify its effectiveness on the image
classification benchmark datasets. Furthermore, by generating counterfactuals
that penetrate the prober, we demonstrate that our framework effectively
identifies vulnerabilities in the target classifier without relying on label
information on the MNIST dataset.


http://arxiv.org/abs/2503.09291
Prompt Inference Attack on Distributed Large Language Model Inference Frameworks. (1%)
Xinjian Luo; Ting Yu; Xiaokui Xiao
  The inference process of modern large language models (LLMs) demands
prohibitive computational resources, rendering them infeasible for deployment
on consumer-grade devices. To address this limitation, recent studies propose
distributed LLM inference frameworks, which employ split learning principles to
enable collaborative LLM inference on resource-constrained hardware. However,
distributing LLM layers across participants requires the transmission of
intermediate outputs, which may introduce privacy risks to the original input
prompts - a critical issue that has yet to be thoroughly explored in the
literature.
  In this paper, we rigorously examine the privacy vulnerabilities of
distributed LLM inference frameworks by designing and evaluating three prompt
inference attacks aimed at reconstructing input prompts from intermediate LLM
outputs. These attacks are developed under various query and data constraints
to reflect diverse real-world LLM service scenarios. Specifically, the first
attack assumes an unlimited query budget and access to an auxiliary dataset
sharing the same distribution as the target prompts. The second attack also
leverages unlimited queries but uses an auxiliary dataset with a distribution
differing from the target prompts. The third attack operates under the most
restrictive scenario, with limited query budgets and no auxiliary dataset
available. We evaluate these attacks on a range of LLMs, including
state-of-the-art models such as Llama-3.2 and Phi-3.5, as well as widely-used
models like GPT-2 and BERT for comparative analysis. Our experiments show that
the first two attacks achieve reconstruction accuracies exceeding 90%, while
the third achieves accuracies typically above 50%, even under stringent
constraints. These findings highlight privacy risks in distributed LLM
inference frameworks, issuing a strong alert on their deployment in real-world
applications.


http://arxiv.org/abs/2503.08973
Quantitative Analysis of Deeply Quantized Tiny Neural Networks Robust to Adversarial Attacks. (98%)
Idris Zakariyya; Ferheen Ayaz; Mounia Kharbouche-Harrari; Jeremy Singer; Sye Loong Keoh; Danilo Pau; José Cano
  Reducing the memory footprint of Machine Learning (ML) models, especially
Deep Neural Networks (DNNs), is imperative to facilitate their deployment on
resource-constrained edge devices. However, a notable drawback of DNN models
lies in their susceptibility to adversarial attacks, wherein minor input
perturbations can deceive them. A primary challenge revolves around the
development of accurate, resilient, and compact DNN models suitable for
deployment on resource-constrained edge devices. This paper presents the
outcomes of a compact DNN model that exhibits resilience against both black-box
and white-box adversarial attacks. This work has achieved this resilience
through training with the QKeras quantization-aware training framework. The
study explores the potential of QKeras and an adversarial robustness technique,
Jacobian Regularization (JR), to co-optimize the DNN architecture through
per-layer JR methodology. As a result, this paper has devised a DNN model
employing this co-optimization strategy based on Stochastic Ternary
Quantization (STQ). Its performance was compared against existing DNN models in
the face of various white-box and black-box attacks. The experimental findings
revealed that, the proposed DNN model had small footprint and on average, it
exhibited better performance than Quanos and DS-CNN MLCommons/TinyML (MLC/T)
benchmarks when challenged with white-box and black-box attacks, respectively,
on the CIFAR-10 image and Google Speech Commands audio datasets.


http://arxiv.org/abs/2503.08226
A Grey-box Text Attack Framework using Explainable AI. (92%)
Esther Chiramal; Kelvin Soh Boon Kai
  Explainable AI is a strong strategy implemented to understand complex
black-box model predictions in a human interpretable language. It provides the
evidence required to execute the use of trustworthy and reliable AI systems. On
the other hand, however, it also opens the door to locating possible
vulnerabilities in an AI model. Traditional adversarial text attack uses word
substitution, data augmentation techniques and gradient-based attacks on
powerful pre-trained Bidirectional Encoder Representations from Transformers
(BERT) variants to generate adversarial sentences. These attacks are generally
whitebox in nature and not practical as they can be easily detected by humans
E.g. Changing the word from "Poor" to "Rich". We proposed a simple yet
effective Grey-box cum Black-box approach that does not require the knowledge
of the model while using a set of surrogate Transformer/BERT models to perform
the attack using Explainable AI techniques. As Transformers are the current
state-of-the-art models for almost all Natural Language Processing (NLP) tasks,
an attack generated from BERT1 is transferable to BERT2. This transferability
is made possible due to the attention mechanism in the transformer that allows
the model to capture long-range dependencies in a sequence. Using the power of
BERT generalisation via attention, we attempt to exploit how transformers learn
by attacking a few surrogate transformer variants which are all based on a
different architecture. We demonstrate that this approach is highly effective
to generate semantically good sentences by changing as little as one word that
is not detectable by humans while still fooling other BERT models.


http://arxiv.org/abs/2503.10690
Battling Misinformation: An Empirical Study on Adversarial Factuality in Open-Source Large Language Models. (76%)
Shahnewaz Karim Sakib; Anindya Bijoy Das; Shibbir Ahmed
  Adversarial factuality refers to the deliberate insertion of misinformation
into input prompts by an adversary, characterized by varying levels of
expressed confidence. In this study, we systematically evaluate the performance
of several open-source large language models (LLMs) when exposed to such
adversarial inputs. Three tiers of adversarial confidence are considered:
strongly confident, moderately confident, and limited confidence. Our analysis
encompasses eight LLMs: LLaMA 3.1 (8B), Phi 3 (3.8B), Qwen 2.5 (7B),
Deepseek-v2 (16B), Gemma2 (9B), Falcon (7B), Mistrallite (7B), and LLaVA (7B).
Empirical results indicate that LLaMA 3.1 (8B) exhibits a robust capability in
detecting adversarial inputs, whereas Falcon (7B) shows comparatively lower
performance. Notably, for the majority of the models, detection success
improves as the adversary's confidence decreases; however, this trend is
reversed for LLaMA 3.1 (8B) and Phi 3 (3.8B), where a reduction in adversarial
confidence corresponds with diminished detection performance. Further analysis
of the queries that elicited the highest and lowest rates of successful attacks
reveals that adversarial attacks are more effective when targeting less
commonly referenced or obscure information.


http://arxiv.org/abs/2503.08195
Dialogue Injection Attack: Jailbreaking LLMs through Context Manipulation. (70%)
Wenlong Meng; Fan Zhang; Wendao Yao; Zhenyuan Guo; Yuwei Li; Chengkun Wei; Wenzhi Chen
  Large language models (LLMs) have demonstrated significant utility in a wide
range of applications; however, their deployment is plagued by security
vulnerabilities, notably jailbreak attacks. These attacks manipulate LLMs to
generate harmful or unethical content by crafting adversarial prompts. While
much of the current research on jailbreak attacks has focused on single-turn
interactions, it has largely overlooked the impact of historical dialogues on
model behavior. In this paper, we introduce a novel jailbreak paradigm,
Dialogue Injection Attack (DIA), which leverages the dialogue history to
enhance the success rates of such attacks. DIA operates in a black-box setting,
requiring only access to the chat API or knowledge of the LLM's chat template.
We propose two methods for constructing adversarial historical dialogues: one
adapts gray-box prefilling attacks, and the other exploits deferred responses.
Our experiments show that DIA achieves state-of-the-art attack success rates on
recent LLMs, including Llama-3.1 and GPT-4o. Additionally, we demonstrate that
DIA can bypass 5 different defense mechanisms, highlighting its robustness and
effectiveness.


http://arxiv.org/abs/2503.08801
Enhanced Estimation Techniques for Certified Radii in Randomized Smoothing. (56%)
Zixuan Liang
  This paper presents novel methods for estimating certified radii in
randomized smoothing, a technique crucial for certifying the robustness of
neural networks against adversarial perturbations. Our proposed techniques
significantly improve the accuracy of certified test-set accuracy by providing
tighter bounds on the certified radii. We introduce advanced algorithms for
both discrete and continuous domains, demonstrating their effectiveness on
CIFAR-10 and ImageNet datasets. The new methods show considerable improvements
over existing approaches, particularly in reducing discrepancies in certified
radii estimates. We also explore the impact of various hyperparameters,
including sample size, standard deviation, and temperature, on the performance
of these methods. Our findings highlight the potential for more efficient
certification processes and pave the way for future research on tighter
confidence sequences and improved theoretical frameworks. The study concludes
with a discussion of potential future directions, including enhanced estimation
techniques for discrete domains and further theoretical advancements to bridge
the gap between empirical and theoretical performance in randomized smoothing.


http://arxiv.org/abs/2503.08636
Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning. (33%)
Hubert Baniecki; Przemyslaw Biecek
  A common belief is that intrinsically interpretable deep learning models
ensure a correct, intuitive understanding of their behavior and offer greater
robustness against accidental errors or intentional manipulation. However,
these beliefs have not been comprehensively verified, and growing evidence
casts doubt on them. In this paper, we highlight the risks related to
overreliance and susceptibility to adversarial manipulation of these so-called
"intrinsically (aka inherently) interpretable" models by design. We introduce
two strategies for adversarial analysis with prototype manipulation and
backdoor attacks against prototype-based networks, and discuss how concept
bottleneck models defend against these attacks. Fooling the model's reasoning
by exploiting its use of latent prototypes manifests the inherent
uninterpretability of deep neural networks, leading to a false sense of
security reinforced by a visual confirmation bias. The reported limitations of
prototype-based networks put their trustworthiness and applicability into
question, motivating further work on the robustness and alignment of (deep)
interpretable models.


http://arxiv.org/abs/2503.08976
Not All Edges are Equally Robust: Evaluating the Robustness of Ranking-Based Federated Learning. (22%)
Zirui Gong; Yanjun Zhang; Leo Yu Zhang; Zhaoxi Zhang; Yong Xiang; Shirui Pan
  Federated Ranking Learning (FRL) is a state-of-the-art FL framework that
stands out for its communication efficiency and resilience to poisoning
attacks. It diverges from the traditional FL framework in two ways: 1) it
leverages discrete rankings instead of gradient updates, significantly reducing
communication costs and limiting the potential space for malicious updates, and
2) it uses majority voting on the server side to establish the global ranking,
ensuring that individual updates have minimal influence since each client
contributes only a single vote. These features enhance the system's scalability
and position FRL as a promising paradigm for FL training.
  However, our analysis reveals that FRL is not inherently robust, as certain
edges are particularly vulnerable to poisoning attacks. Through a theoretical
investigation, we prove the existence of these vulnerable edges and establish a
lower bound and an upper bound for identifying them in each layer. Based on
this finding, we introduce a novel local model poisoning attack against FRL,
namely the Vulnerable Edge Manipulation (VEM) attack. The VEM attack focuses on
identifying and perturbing the most vulnerable edges in each layer and
leveraging an optimization-based approach to maximize the attack's impact.
Through extensive experiments on benchmark datasets, we demonstrate that our
attack achieves an overall 53.23% attack impact and is 3.7x more impactful than
existing methods. Our findings highlight significant vulnerabilities in
ranking-based FL systems and underline the urgency for the development of new
robust FL frameworks.


http://arxiv.org/abs/2503.08269
Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks. (8%)
Junying Wang; Hongyuan Zhang; Yuan Yuan
  Recent Customized Portrait Generation (CPG) methods, taking a facial image
and a textual prompt as inputs, have attracted substantial attention. Although
these methods generate high-fidelity portraits, they fail to prevent the
generated portraits from being tracked and misused by malicious face
recognition systems. To address this, this paper proposes a Customized Portrait
Generation framework with facial Adversarial attacks (Adv-CPG). Specifically,
to achieve facial privacy protection, we devise a lightweight local ID
encryptor and an encryption enhancer. They implement progressive double-layer
encryption protection by directly injecting the target identity and adding
additional identity guidance, respectively. Furthermore, to accomplish
fine-grained and personalized portrait generation, we develop a multi-modal
image customizer capable of generating controlled fine-grained facial features.
To the best of our knowledge, Adv-CPG is the first study that introduces facial
adversarial attacks into CPG. Extensive experiments demonstrate the superiority
of Adv-CPG, e.g., the average attack success rate of the proposed Adv-CPG is
28.1% and 2.86% higher compared to the SOTA noise-based attack methods and
unconstrained attack methods, respectively.


http://arxiv.org/abs/2503.08038
Generalized Kullback-Leibler Divergence Loss. (2%)
Jiequan Cui; Beier Zhu; Qingshan Xu; Zhuotao Tian; Xiaojuan Qi; Bei Yu; Hanwang Zhang; Richang Hong
  In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss
and mathematically prove that it is equivalent to the Decoupled
Kullback-Leibler (DKL) Divergence loss that consists of (1) a weighted Mean
Square Error (wMSE) loss and (2) a Cross-Entropy loss incorporating soft
labels. Thanks to the decoupled structure of DKL loss, we have identified two
areas for improvement. Firstly, we address the limitation of KL loss in
scenarios like knowledge distillation by breaking its asymmetric optimization
property along with a smoother weight function. This modification effectively
alleviates convergence challenges in optimization, particularly for classes
with high predicted scores in soft labels. Secondly, we introduce class-wise
global information into KL/DKL to reduce bias arising from individual samples.
With these two enhancements, we derive the Generalized Kullback-Leibler (GKL)
Divergence loss and evaluate its effectiveness by conducting experiments on
CIFAR-10/100, ImageNet, and vision-language datasets, focusing on adversarial
training, and knowledge distillation tasks. Specifically, we achieve new
state-of-the-art adversarial robustness on the public leaderboard --
RobustBench and competitive knowledge distillation performance across
CIFAR/ImageNet models and CLIP models, demonstrating the substantial practical
merits. Our code is available at https://github.com/jiequancui/DKL.


http://arxiv.org/abs/2503.08829
Seal Your Backdoor with Variational Defense. (1%)
Ivan Sabolić; Matej Grcić; Siniša Šegvić
  We propose VIBE, a model-agnostic framework that trains classifiers resilient
to backdoor attacks. The key concept behind our approach is to treat malicious
inputs and corrupted labels from the training dataset as observed random
variables, while the actual clean labels are latent. VIBE then recovers the
corresponding latent clean label posterior through variational inference. The
resulting training procedure follows the expectation-maximization (EM)
algorithm. The E-step infers the clean pseudolabels by solving an
entropy-regularized optimal transport problem, while the M-step updates the
classifier parameters via gradient descent. Being modular, VIBE can seamlessly
integrate with recent advancements in self-supervised representation learning,
which enhance its ability to resist backdoor attacks. We experimentally
validate the method effectiveness against contemporary backdoor attacks on
standard datasets, a large-scale setup with 1$k$ classes, and a dataset
poisoned with multiple attacks. VIBE consistently outperforms previous defenses
across all tested scenarios.


http://arxiv.org/abs/2503.07058
Breaking the Limits of Quantization-Aware Defenses: QADT-R for Robustness Against Patch-Based Adversarial Attacks in QNNs. (99%)
Amira Guesmi; Bassem Ouni; Muhammad Shafique
  Quantized Neural Networks (QNNs) have emerged as a promising solution for
reducing model size and computational costs, making them well-suited for
deployment in edge and resource-constrained environments. While quantization is
known to disrupt gradient propagation and enhance robustness against
pixel-level adversarial attacks, its effectiveness against patch-based
adversarial attacks remains largely unexplored. In this work, we demonstrate
that adversarial patches remain highly transferable across quantized models,
achieving over 70\% attack success rates (ASR) even at extreme bit-width
reductions (e.g., 2-bit). This challenges the common assumption that
quantization inherently mitigates adversarial threats. To address this, we
propose Quantization-Aware Defense Training with Randomization (QADT-R), a
novel defense strategy that integrates Adaptive Quantization-Aware Patch
Generation (A-QAPA), Dynamic Bit-Width Training (DBWT), and
Gradient-Inconsistent Regularization (GIR) to enhance resilience against highly
transferable patch-based attacks. A-QAPA generates adversarial patches within
quantized models, ensuring robustness across different bit-widths. DBWT
introduces bit-width cycling during training to prevent overfitting to a
specific quantization setting, while GIR injects controlled gradient
perturbations to disrupt adversarial optimization. Extensive evaluations on
CIFAR-10 and ImageNet show that QADT-R reduces ASR by up to 25\% compared to
prior defenses such as PBAT and DWQ. Our findings further reveal that
PBAT-trained models, while effective against seen patch configurations, fail to
generalize to unseen patches due to quantization shift. Additionally, our
empirical analysis of gradient alignment, spatial sensitivity, and patch
visibility provides insights into the mechanisms that contribute to the high
transferability of patch-based attacks in QNNs.


http://arxiv.org/abs/2503.06966
MIGA: Mutual Information-Guided Attack on Denoising Models for Semantic Manipulation. (92%)
Guanghao Li; Mingzhi Chen; Hao Yu; Shuting Dong; Wenhao Jiang; Ming Tang; Chun Yuan
  Deep learning-based denoising models have been widely employed in vision
tasks, functioning as filters to eliminate noise while retaining crucial
semantic information. Additionally, they play a vital role in defending against
adversarial perturbations that threaten downstream tasks. However, these models
can be intrinsically susceptible to adversarial attacks due to their dependence
on specific noise assumptions. Existing attacks on denoising models mainly aim
at deteriorating visual clarity while neglecting semantic manipulation,
rendering them either easily detectable or limited in effectiveness. In this
paper, we propose Mutual Information-Guided Attack (MIGA), the first method
designed to directly attack deep denoising models by strategically disrupting
their ability to preserve semantic content via adversarial perturbations. By
minimizing the mutual information between the original and denoised images, a
measure of semantic similarity. MIGA forces the denoiser to produce
perceptually clean yet semantically altered outputs. While these images appear
visually plausible, they encode systematically distorted semantics, revealing a
fundamental vulnerability in denoising models. These distortions persist in
denoised outputs and can be quantitatively assessed through downstream task
performance. We propose new evaluation metrics and systematically assess MIGA
on four denoising models across five datasets, demonstrating its consistent
effectiveness in disrupting semantic fidelity. Our findings suggest that
denoising models are not always robust and can introduce security risks in
real-world applications.


http://arxiv.org/abs/2503.06989
Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs. (87%)
Wenzhuo Xu; Zhipeng Wei; Xiongtao Sun; Deyue Zhang; Dongdong Yang; Quanchen Zou; Xiangzheng Zhang
  Recently, Multimodal Large Language Models (MLLMs) have demonstrated their
superior ability in understanding multimodal contents. However, they remain
vulnerable to jailbreak attacks, which exploit weaknesses in their safety
alignment to generate harmful responses. Previous studies categorize jailbreaks
as successful or failed based on whether responses contain malicious content.
However, given the stochastic nature of MLLM responses, this binary
classification of an input's ability to jailbreak MLLMs is inappropriate.
Derived from this viewpoint, we introduce jailbreak probability to quantify the
jailbreak potential of an input, which represents the likelihood that MLLMs
generated a malicious response when prompted with this input. We approximate
this probability through multiple queries to MLLMs. After modeling the
relationship between input hidden states and their corresponding jailbreak
probability using Jailbreak Probability Prediction Network (JPPN), we use
continuous jailbreak probability for optimization. Specifically, we propose
Jailbreak-Probability-based Attack (JPA) that optimizes adversarial
perturbations on inputs to maximize jailbreak probability. To counteract
attacks, we also propose two defensive methods: Jailbreak-Probability-based
Finetuning (JPF) and Jailbreak-Probability-based Defensive Noise (JPDN), which
minimizes jailbreak probability in the MLLM parameters and input space,
respectively. Extensive experiments show that (1) JPA yields improvements (up
to 28.38\%) under both white and black box settings compared to previous
methods with small perturbation bounds and few iterations. (2) JPF and JPDN
significantly reduce jailbreaks by at most over 60\%. Both of the above results
demonstrate the significance of introducing jailbreak probability to make
nuanced distinctions among input jailbreak abilities.


http://arxiv.org/abs/2503.06950
CtrlRAG: Black-box Adversarial Attacks Based on Masked Language Models in Retrieval-Augmented Language Generation. (83%)
Runqi Sui
  Retrieval-Augmented Generation (RAG) systems enhance Large Language Models
(LLMs) by integrating external knowledge bases. However, this integration
introduces a new security threat: adversaries can exploit the retrieval
mechanism to inject malicious content into the knowledge base, thereby
influencing the generated responses. Based on this attack vector, we propose
CtrlRAG, a novel attack method designed for RAG system in the black-box
setting, which aligns with real-world scenarios. Unlike existing attack
methods, CtrlRAG introduces a perturbation mechanism using Masked Language
Model (MLM) to dynamically optimize malicious content in response to changes in
the retrieved context. Experimental results demonstrate that CtrlRAG
outperforms three baseline methods in both Emotional Manipulation and
Hallucination Amplification objectives. Furthermore, we evaluate three existing
defense mechanisms, revealing their limited effectiveness against CtrlRAG and
underscoring the urgent need for more robust defenses.


http://arxiv.org/abs/2503.07882
ReLATE: Resilient Learner Selection for Multivariate Time-Series Classification Against Adversarial Attacks. (82%)
Cagla Ipek Kocal; Onat Gungor; Aaron Tartz; Tajana Rosing; Baris Aksanli
  Minimizing computational overhead in time-series classification, particularly
in deep learning models, presents a significant challenge. This challenge is
further compounded by adversarial attacks, emphasizing the need for resilient
methods that ensure robust performance and efficient model selection. We
introduce ReLATE, a framework that identifies robust learners based on dataset
similarity, reduces computational overhead, and enhances resilience. ReLATE
maintains multiple deep learning models in well-known adversarial attack
scenarios, capturing model performance. ReLATE identifies the most analogous
dataset to a given target using a similarity metric, then applies the optimal
model from the most similar dataset. ReLATE reduces computational overhead by
an average of 81.2%, enhancing adversarial resilience and streamlining robust
model selection, all without sacrificing performance, within 4.2% of Oracle.


http://arxiv.org/abs/2503.07978
Detecting Backdoor Attacks in Federated Learning via Direction Alignment Inspection. (68%)
Jiahao Xu; Zikai Zhang; Rui Hu
  The distributed nature of training makes Federated Learning (FL) vulnerable
to backdoor attacks, where malicious model updates aim to compromise the global
model's performance on specific tasks. Existing defense methods show limited
efficacy as they overlook the inconsistency between benign and malicious model
updates regarding both general and fine-grained directions. To fill this gap,
we introduce AlignIns, a novel defense method designed to safeguard FL systems
against backdoor attacks. AlignIns looks into the direction of each model
update through a direction alignment inspection process. Specifically, it
examines the alignment of model updates with the overall update direction and
analyzes the distribution of the signs of their significant parameters,
comparing them with the principle sign across all model updates. Model updates
that exhibit an unusual degree of alignment are considered malicious and thus
be filtered out. We provide the theoretical analysis of the robustness of
AlignIns and its propagation error in FL. Our empirical results on both
independent and identically distributed (IID) and non-IID datasets demonstrate
that AlignIns achieves higher robustness compared to the state-of-the-art
defense methods. The code is available at
https://github.com/JiiahaoXU/AlignIns.


http://arxiv.org/abs/2503.07568
Runtime Detection of Adversarial Attacks in AI Accelerators Using Performance Counters. (54%)
Habibur Rahaman; Atri Chatterjee; Swarup Bhunia
  Rapid adoption of AI technologies raises several major security concerns,
including the risks of adversarial perturbations, which threaten the
confidentiality and integrity of AI applications. Protecting AI hardware from
misuse and diverse security threats is a challenging task. To address this
challenge, we propose SAMURAI, a novel framework for safeguarding against
malicious usage of AI hardware and its resilience to attacks. SAMURAI
introduces an AI Performance Counter (APC) for tracking dynamic behavior of an
AI model coupled with an on-chip Machine Learning (ML) analysis engine, known
as TANTO (Trained Anomaly Inspection Through Trace Observation). APC records
the runtime profile of the low-level hardware events of different AI
operations. Subsequently, the summary information recorded by the APC is
processed by TANTO to efficiently identify potential security breaches and
ensure secure, responsible use of AI. SAMURAI enables real-time detection of
security threats and misuse without relying on traditional software-based
solutions that require model integration. Experimental results demonstrate that
SAMURAI achieves up to 97% accuracy in detecting adversarial attacks with
moderate overhead on various AI models, significantly outperforming
conventional software-based approaches. It enhances security and regulatory
compliance, providing a comprehensive solution for safeguarding AI against
emergent threats.


http://arxiv.org/abs/2503.07818
Strengthening the Internal Adversarial Robustness in Lifted Neural Networks. (26%)
Christopher Zach
  Lifted neural networks (i.e. neural architectures explicitly optimizing over
respective network potentials to determine the neural activities) can be
combined with a type of adversarial training to gain robustness for internal as
well as input layers, in addition to improved generalization performance. In
this work we first investigate how adversarial robustness in this framework can
be further strengthened by solely modifying the training loss. In a second step
we fix some remaining limitations and arrive at a novel training loss for
lifted neural networks, that combines targeted and untargeted adversarial
perturbations.


http://arxiv.org/abs/2503.07697
PoisonedParrot: Subtle Data Poisoning Attacks to Elicit Copyright-Infringing Content from Large Language Models. (13%)
Michael-Andrei Panaitescu-Liess; Pankayaraj Pathmanathan; Yigitcan Kaya; Zora Che; Bang An; Sicheng Zhu; Aakriti Agrawal; Furong Huang
  As the capabilities of large language models (LLMs) continue to expand, their
usage has become increasingly prevalent. However, as reflected in numerous
ongoing lawsuits regarding LLM-generated content, addressing copyright
infringement remains a significant challenge. In this paper, we introduce
PoisonedParrot: the first stealthy data poisoning attack that induces an LLM to
generate copyrighted content even when the model has not been directly trained
on the specific copyrighted material. PoisonedParrot integrates small fragments
of copyrighted text into the poison samples using an off-the-shelf LLM. Despite
its simplicity, evaluated in a wide range of experiments, PoisonedParrot is
surprisingly effective at priming the model to generate copyrighted content
with no discernible side effects. Moreover, we discover that existing defenses
are largely ineffective against our attack. Finally, we make the first attempt
at mitigating copyright-infringement poisoning attacks by proposing a defense:
ParrotTrap. We encourage the community to explore this emerging threat model
further.


http://arxiv.org/abs/2503.08731
FairDeFace: Evaluating the Fairness and Adversarial Robustness of Face Obfuscation Methods. (2%)
Seyyed Mohammad Sadegh Moosavi Khorzooghi; Poojitha Thota; Mohit Singhal; Abolfazl Asudeh; Gautam Das; Shirin Nilizadeh
  The lack of a common platform and benchmark datasets for evaluating face
obfuscation methods has been a challenge, with every method being tested using
arbitrary experiments, datasets, and metrics. While prior work has demonstrated
that face recognition systems exhibit bias against some demographic groups,
there exists a substantial gap in our understanding regarding the fairness of
face obfuscation methods. Providing fair face obfuscation methods can ensure
equitable protection across diverse demographic groups, especially since they
can be used to preserve the privacy of vulnerable populations. To address these
gaps, this paper introduces a comprehensive framework, named FairDeFace,
designed to assess the adversarial robustness and fairness of face obfuscation
methods. The framework introduces a set of modules encompassing data
benchmarks, face detection and recognition algorithms, adversarial models,
utility detection models, and fairness metrics. FairDeFace serves as a
versatile platform where any face obfuscation method can be integrated,
allowing for rigorous testing and comparison with other state-of-the-art
methods. In its current implementation, FairDeFace incorporates 6 attacks, and
several privacy, utility and fairness metrics. Using FairDeFace, and by
conducting more than 500 experiments, we evaluated and compared the adversarial
robustness of seven face obfuscation methods. This extensive analysis led to
many interesting findings both in terms of the degree of robustness of existing
methods and their biases against some gender or racial groups. FairDeFace also
uses visualization of focused areas for both obfuscation and verification
attacks to show not only which areas are mostly changed in the obfuscation
process for some demographics, but also why they failed through focus area
comparison of obfuscation and verification.


http://arxiv.org/abs/2503.07464
Learning to Localize Leakage of Cryptographic Sensitive Variables. (2%)
Jimmy Gammell; Anand Raghunathan; Abolfazl Hashemi; Kaushik Roy
  While cryptographic algorithms such as the ubiquitous Advanced Encryption
Standard (AES) are secure, *physical implementations* of these algorithms in
hardware inevitably 'leak' sensitive data such as cryptographic keys. A
particularly insidious form of leakage arises from the fact that hardware
consumes power and emits radiation in a manner that is statistically associated
with the data it processes and the instructions it executes. Supervised deep
learning has emerged as a state-of-the-art tool for carrying out *side-channel
attacks*, which exploit this leakage by learning to map power/radiation
measurements throughout encryption to the sensitive data operated on during
that encryption. In this work we develop a principled deep learning framework
for determining the relative leakage due to measurements recorded at different
points in time, in order to inform *defense* against such attacks. This
information is invaluable to cryptographic hardware designers for understanding
*why* their hardware leaks and how they can mitigate it (e.g. by indicating the
particular sections of code or electronic components which are responsible).
Our framework is based on an adversarial game between a family of classifiers
trained to estimate the conditional distributions of sensitive data given
subsets of measurements, and a budget-constrained noise distribution which
probabilistically erases individual measurements to maximize the loss of these
classifiers. We demonstrate our method's efficacy and ability to overcome
limitations of prior work through extensive experimental comparison with 8
baseline methods using 3 evaluation metrics and 6 publicly-available power/EM
trace datasets from AES, ECC and RSA implementations. We provide an open-source
PyTorch implementation of these experiments.


http://arxiv.org/abs/2503.06559
MMARD: Improving the Min-Max Optimization Process in Adversarial Robustness Distillation. (80%)
Yuzheng Wang; Zhaoyu Chen; Dingkang Yang; Yuanhang Wang; Lizhe Qi
  Adversarial Robustness Distillation (ARD) is a promising task to boost the
robustness of small-capacity models with the guidance of the pre-trained robust
teacher. The ARD can be summarized as a min-max optimization process, i.e.,
synthesizing adversarial examples (inner) & training the student (outer).
Although competitive robustness performance, existing ARD methods still have
issues. In the inner process, the synthetic training examples are far from the
teacher's decision boundary leading to important robust information missing. In
the outer process, the student model is decoupled from learning natural and
robust scenarios, leading to the robustness saturation, i.e., student
performance is highly susceptible to customized teacher selection. To tackle
these issues, this paper proposes a general Min-Max optimization Adversarial
Robustness Distillation (MMARD) method. For the inner process, we introduce the
teacher's robust predictions, which drive the training examples closer to the
teacher's decision boundary to explore more robust knowledge. For the outer
process, we propose a structured information modeling method based on
triangular relationships to measure the mutual information of the model in
natural and robust scenarios and enhance the model's ability to understand
multi-scenario mapping relationships. Experiments show our MMARD achieves
state-of-the-art performance on multiple benchmarks. Besides, MMARD is
plug-and-play and convenient to combine with existing methods.


http://arxiv.org/abs/2503.06461
Long-tailed Adversarial Training with Self-Distillation. (78%)
Seungju Cho; Hongsin Lee; Changick Kim
  Adversarial training significantly enhances adversarial robustness, yet
superior performance is predominantly achieved on balanced datasets.
  Addressing adversarial robustness in the context of unbalanced or long-tailed
distributions is considerably more challenging, mainly due to the scarcity of
tail data instances.
  Previous research on adversarial robustness within long-tailed distributions
has primarily focused on combining traditional long-tailed natural training
with existing adversarial robustness methods.
  In this study, we provide an in-depth analysis for the challenge that
adversarial training struggles to achieve high performance on tail classes in
long-tailed distributions.
  Furthermore, we propose a simple yet effective solution to advance
adversarial robustness on long-tailed distributions through a novel
self-distillation technique.
  Specifically, this approach leverages a balanced self-teacher model, which is
trained using a balanced dataset sampled from the original long-tailed dataset.
Our extensive experiments demonstrate state-of-the-art performance in both
clean and robust accuracy for long-tailed adversarial robustness, with
significant improvements in tail class performance on various datasets. We
improve the accuracy against PGD attacks for tail classes by 20.3, 7.1, and 3.8
percentage points on CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively,
while achieving the highest robust accuracy.


http://arxiv.org/abs/2503.08704
Life-Cycle Routing Vulnerabilities of LLM Router. (68%)
Qiqi Lin; Xiaoyang Ji; Shengfang Zhai; Qingni Shen; Zhi Zhang; Yuejian Fang; Yansong Gao
  Large language models (LLMs) have achieved remarkable success in natural
language processing, yet their performance and computational costs vary
significantly. LLM routers play a crucial role in dynamically balancing these
trade-offs. While previous studies have primarily focused on routing
efficiency, security vulnerabilities throughout the entire LLM router life
cycle, from training to inference, remain largely unexplored. In this paper, we
present a comprehensive investigation into the life-cycle routing
vulnerabilities of LLM routers. We evaluate both white-box and black-box
adversarial robustness, as well as backdoor robustness, across several
representative routing models under extensive experimental settings. Our
experiments uncover several key findings: 1) Mainstream DNN-based routers tend
to exhibit the weakest adversarial and backdoor robustness, largely due to
their strong feature extraction capabilities that amplify vulnerabilities
during both training and inference; 2) Training-free routers demonstrate the
strongest robustness across different attack types, benefiting from the absence
of learnable parameters that can be manipulated. These findings highlight
critical security risks spanning the entire life cycle of LLM routers and
provide insights for developing more robust models.


http://arxiv.org/abs/2503.06453
NaviDet: Efficient Input-level Backdoor Detection on Text-to-Image Synthesis via Neuron Activation Variation. (56%)
Shengfang Zhai; Jiajun Li; Yue Liu; Huanran Chen; Zhihua Tian; Wenjie Qu; Qingni Shen; Ruoxi Jia; Yinpeng Dong; Jiaheng Zhang
  In recent years, text-to-image (T2I) diffusion models have garnered
significant attention for their ability to generate high-quality images
reflecting text prompts. However, their growing popularity has also led to the
emergence of backdoor threats, posing substantial risks. Currently, effective
defense strategies against such threats are lacking due to the diversity of
backdoor targets in T2I synthesis. In this paper, we propose NaviDet, the first
general input-level backdoor detection framework for identifying backdoor
inputs across various backdoor targets. Our approach is based on the new
observation that trigger tokens tend to induce significant neuron activation
variation in the early stage of the diffusion generation process, a phenomenon
we term Early-step Activation Variation. Leveraging this insight, NaviDet
detects malicious samples by analyzing neuron activation variations caused by
input tokens. Through extensive experiments, we demonstrate the effectiveness
and efficiency of our method against various T2I backdoor attacks, surpassing
existing baselines with significantly lower computational overhead.
Furthermore, we rigorously demonstrate that our method remains effective
against potential adaptive attacks.


http://arxiv.org/abs/2503.06519
Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation. (15%)
Wenhui Zhang; Huiyu Xu; Zhibo Wang; Zeqing He; Ziqi Zhu; Kui Ren
  Small language models (SLMs) have emerged as promising alternatives to large
language models (LLMs) due to their low computational demands, enhanced privacy
guarantees and comparable performance in specific domains through light-weight
fine-tuning. Deploying SLMs on edge devices, such as smartphones and smart
vehicles, has become a growing trend. However, the security implications of
SLMs have received less attention than LLMs, particularly regarding jailbreak
attacks, which is recognized as one of the top threats of LLMs by the OWASP. In
this paper, we conduct the first large-scale empirical study of SLMs'
vulnerabilities to jailbreak attacks. Through systematically evaluation on 63
SLMs from 15 mainstream SLM families against 8 state-of-the-art jailbreak
methods, we demonstrate that 47.6% of evaluated SLMs show high susceptibility
to jailbreak attacks (ASR > 40%) and 38.1% of them can not even resist direct
harmful query (ASR > 50%). We further analyze the reasons behind the
vulnerabilities and identify four key factors: model size, model architecture,
training datasets and training techniques. Moreover, we assess the
effectiveness of three prompt-level defense methods and find that none of them
achieve perfect performance, with detection accuracy varying across different
SLMs and attack methods. Notably, we point out that the inherent security
awareness play a critical role in SLM security, and models with strong security
awareness could timely terminate unsafe response with little reminder. Building
upon the findings, we highlight the urgent need for security-by-design
approaches in SLM development and provide valuable insights for building more
trustworthy SLM ecosystem.


http://arxiv.org/abs/2503.08708
TH-Bench: Evaluating Evading Attacks via Humanizing AI Text on Machine-Generated Text Detectors. (10%)
Jingyi Zheng; Junfeng Wang; Zhen Sun; Wenhan Dong; Yule Liu; Xinlei He
  As Large Language Models (LLMs) advance, Machine-Generated Texts (MGTs) have
become increasingly fluent, high-quality, and informative. Existing wide-range
MGT detectors are designed to identify MGTs to prevent the spread of plagiarism
and misinformation. However, adversaries attempt to humanize MGTs to evade
detection (named evading attacks), which requires only minor modifications to
bypass MGT detectors. Unfortunately, existing attacks generally lack a unified
and comprehensive evaluation framework, as they are assessed using different
experimental settings, model architectures, and datasets. To fill this gap, we
introduce the Text-Humanization Benchmark (TH-Bench), the first comprehensive
benchmark to evaluate evading attacks against MGT detectors. TH-Bench evaluates
attacks across three key dimensions: evading effectiveness, text quality, and
computational overhead. Our extensive experiments evaluate 6 state-of-the-art
attacks against 13 MGT detectors across 6 datasets, spanning 19 domains and
generated by 11 widely used LLMs. Our findings reveal that no single evading
attack excels across all three dimensions. Through in-depth analysis, we
highlight the strengths and limitations of different attacks. More importantly,
we identify a trade-off among three dimensions and propose two optimization
insights. Through preliminary experiments, we validate their correctness and
effectiveness, offering potential directions for future research.


http://arxiv.org/abs/2503.06529
AnywhereDoor: Multi-Target Backdoor Attacks on Object Detection. (10%)
Jialin Lu; Junjie Shan; Ziqi Zhao; Ka-Ho Chow
  As object detection becomes integral to many safety-critical applications,
understanding its vulnerabilities is essential. Backdoor attacks, in
particular, pose a serious threat by implanting hidden triggers in victim
models, which adversaries can later exploit to induce malicious behaviors
during inference. However, current understanding is limited to single-target
attacks, where adversaries must define a fixed malicious behavior (target)
before training, making inference-time adaptability impossible. Given the large
output space of object detection (including object existence prediction,
bounding box estimation, and classification), the feasibility of flexible,
inference-time model control remains unexplored. This paper introduces
AnywhereDoor, a multi-target backdoor attack for object detection. Once
implanted, AnywhereDoor allows adversaries to make objects disappear, fabricate
new ones, or mislabel them, either across all object classes or specific ones,
offering an unprecedented degree of control. This flexibility is enabled by
three key innovations: (i) objective disentanglement to scale the number of
supported targets; (ii) trigger mosaicking to ensure robustness even against
region-based detectors; and (iii) strategic batching to address object-level
data imbalances that hinder manipulation. Extensive experiments demonstrate
that AnywhereDoor grants attackers a high degree of control, improving attack
success rates by 26% compared to adaptations of existing methods for such
flexible control.


http://arxiv.org/abs/2503.06140
Boosting the Local Invariance for Better Adversarial Transferability. (99%)
Bohan Liu; Xiaosen Wang
  Transfer-based attacks pose a significant threat to real-world applications
by directly targeting victim models with adversarial examples generated on
surrogate models. While numerous approaches have been proposed to enhance
adversarial transferability, existing works often overlook the intrinsic
relationship between adversarial perturbations and input images. In this work,
we find that adversarial perturbation often exhibits poor translation
invariance for a given clean image and model, which is attributed to local
invariance. Through empirical analysis, we demonstrate that there is a positive
correlation between the local invariance of adversarial perturbations w.r.t.
the input image and their transferability across different models. Based on
this finding, we propose a general adversarial transferability boosting
technique called Local Invariance Boosting approach (LI-Boost). Extensive
experiments on the standard ImageNet dataset demonstrate that LI-Boost could
significantly boost various types of transfer-based attacks (e.g.,
gradient-based, input transformation-based, model-related, advanced objective
function, ensemble, etc.) on CNNs, ViTs, and defense mechanisms. Our approach
presents a promising direction for future research in improving adversarial
transferability across different models.


http://arxiv.org/abs/2503.06269
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models. (98%)
Thomas Winninger; Boussad Addad; Katarzyna Kapusta
  Traditional white-box methods for creating adversarial perturbations against
LLMs typically rely only on gradient computation from the targeted model,
ignoring the internal mechanisms responsible for attack success or failure.
Conversely, interpretability studies that analyze these internal mechanisms
lack practical applications beyond runtime interventions. We bridge this gap by
introducing a novel white-box approach that leverages mechanistic
interpretability techniques to craft practical adversarial inputs.
Specifically, we first identify acceptance subspaces - sets of feature vectors
that do not trigger the model's refusal mechanisms - then use gradient-based
optimization to reroute embeddings from refusal subspaces to acceptance
subspaces, effectively achieving jailbreaks. This targeted approach
significantly reduces computation cost, achieving attack success rates of
80-95\% on state-of-the-art models including Gemma2, Llama3.2, and Qwen2.5
within minutes or even seconds, compared to existing techniques that often fail
or require hours of computation. We believe this approach opens a new direction
for both attack research and defense development. Furthermore, it showcases a
practical application of mechanistic interpretability where other methods are
less efficient, which highlights its utility. The code and generated datasets
are available at https://github.com/Sckathach/subspace-rerouting.


http://arxiv.org/abs/2503.06188
Attackers Can Do Better: Over- and Understated Factors of Model Stealing Attacks. (80%)
Daryna Oliynyk; Rudolf Mayer; Andreas Rauber
  Machine learning models were shown to be vulnerable to model stealing
attacks, which lead to intellectual property infringement. Among other methods,
substitute model training is an all-encompassing attack applicable to any
machine learning model whose behaviour can be approximated from input-output
queries. Whereas prior works mainly focused on improving the performance of
substitute models by, e.g. developing a new substitute training method, there
have been only limited ablation studies on the impact the attacker's strength
has on the substitute model's performance. As a result, different authors came
to diverse, sometimes contradicting, conclusions. In this work, we exhaustively
examine the ambivalent influence of different factors resulting from varying
the attacker's capabilities and knowledge on a substitute training attack. Our
findings suggest that some of the factors that have been considered important
in the past are, in fact, not that influential; instead, we discover new
correlations between attack conditions and success rate. In particular, we
demonstrate that better-performing target models enable higher-fidelity attacks
and explain the intuition behind this phenomenon. Further, we propose to shift
the focus from the complexity of target models toward the complexity of their
learning tasks. Therefore, for the substitute model, rather than aiming for a
higher architecture complexity, we suggest focusing on getting data of higher
complexity and an appropriate architecture. Finally, we demonstrate that even
in the most limited data-free scenario, there is no need to overcompensate weak
knowledge with millions of queries. Our results often exceed or match the
performance of previous attacks that assume a stronger attacker, suggesting
that these stronger attacks are likely endangering a model owner's intellectual
property to a significantly higher degree than shown until now.


http://arxiv.org/abs/2503.10661
CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models. (73%)
Xiangyu Yin; Jiaxu Liu; Zhen Chen; Jinwei Hu; Yi Dong; Xiaowei Huang; Wenjie Ruan
  Recent advances in large vision-language models (VLMs) have demonstrated
remarkable success across a wide range of visual understanding tasks. However,
the robustness of these models against jailbreak attacks remains an open
challenge. In this work, we propose a universal certified defence framework to
safeguard VLMs rigorously against potential visual jailbreak attacks. First, we
proposed a novel distance metric to quantify semantic discrepancies between
malicious and intended responses, capturing subtle differences often overlooked
by conventional cosine similarity-based measures. Then, we devise a regressed
certification approach that employs randomized smoothing to provide formal
robustness guarantees against both adversarial and structural perturbations,
even under black-box settings. Complementing this, our feature-space defence
introduces noise distributions (e.g., Gaussian, Laplacian) into the latent
embeddings to safeguard against both pixel-level and structure-level
perturbations. Our results highlight the potential of a formally grounded,
integrated strategy toward building more resilient and trustworthy VLMs.


http://arxiv.org/abs/2503.06223
Reinforced Diffuser for Red Teaming Large Vision-Language Models. (68%)
Ruofan Wang; Xiang Zheng; Xiaosen Wang; Cong Wang; Xingjun Ma
  The rapid advancement of large Vision-Language Models (VLMs) has raised
significant safety concerns, particularly regarding their vulnerability to
jailbreak attacks. While existing research primarily focuses on VLMs'
susceptibility to harmful instructions, this work identifies a critical yet
overlooked vulnerability: current alignment mechanisms often fail to address
the risks posed by toxic text continuation tasks. To investigate this issue, we
propose a novel Red Team Diffuser (RTD) framework, which leverages
reinforcement learning to generate red team images that effectively induce
highly toxic continuations from target black-box VLMs. The RTD pipeline begins
with a greedy search for high-quality image prompts that maximize the toxicity
of VLM-generated sentence continuations, guided by a Large Language Model
(LLM). These prompts are then used as input for the reinforcement fine-tuning
of a diffusion model, which employs toxicity and alignment rewards to further
amplify harmful outputs. Experimental results demonstrate the effectiveness of
RTD, increasing the toxicity rate of LLaVA outputs by 10.69% on the original
attack set and 8.91% on a hold-out set. Moreover, RTD exhibits strong
cross-model transferability, raising the toxicity rate by 5.1% on Gemini and
26.83% on LLaMA. These findings reveal significant deficiencies in existing
alignment strategies, particularly their inability to prevent harmful
continuations. Our work underscores the urgent need for more robust and
adaptive alignment mechanisms to ensure the safe deployment of VLMs in
real-world applications.


http://arxiv.org/abs/2503.06361
Adversarial Robustness of Discriminative Self-Supervised Learning in Vision. (16%)
Ömer Veysel Çağatan; Ömer Faruk Tal; M. Emre Gürsoy
  Self-supervised learning (SSL) has advanced significantly in visual
representation learning, yet comprehensive evaluations of its adversarial
robustness remain limited. In this study, we evaluate the adversarial
robustness of seven discriminative self-supervised models and one supervised
model across diverse tasks, including ImageNet classification, transfer
learning, segmentation, and detection. Our findings suggest that discriminative
SSL models generally exhibit better robustness to adversarial attacks compared
to their supervised counterpart on ImageNet, with this advantage extending to
transfer learning when using linear evaluation. However, when fine-tuning is
applied, the robustness gap between SSL and supervised models narrows
considerably. Similarly, this robustness advantage diminishes in segmentation
and detection tasks. We also investigate how various factors might influence
adversarial robustness, including architectural choices, training duration,
data augmentations, and batch sizes. Our analysis contributes to the ongoing
exploration of adversarial robustness in visual self-supervised representation
systems.


http://arxiv.org/abs/2503.06254
Poisoned-MRAG: Knowledge Poisoning Attacks to Multimodal Retrieval Augmented Generation. (13%)
Yinuo Liu; Zenghui Yuan; Guiyao Tie; Jiawen Shi; Lichao Sun; Neil Zhenqiang Gong
  Multimodal retrieval-augmented generation (RAG) enhances the visual reasoning
capability of vision-language models (VLMs) by dynamically accessing
information from external knowledge bases. In this work, we introduce
\textit{Poisoned-MRAG}, the first knowledge poisoning attack on multimodal RAG
systems. Poisoned-MRAG injects a few carefully crafted image-text pairs into
the multimodal knowledge database, manipulating VLMs to generate the
attacker-desired response to a target query. Specifically, we formalize the
attack as an optimization problem and propose two cross-modal attack
strategies, dirty-label and clean-label, tailored to the attacker's knowledge
and goals. Our extensive experiments across multiple knowledge databases and
VLMs show that Poisoned-MRAG outperforms existing methods, achieving up to 98\%
attack success rate with just five malicious image-text pairs injected into the
InfoSeek database (481,782 pairs). Additionally, We evaluate 4 different
defense strategies, including paraphrasing, duplicate removal, structure-driven
mitigation, and purification, demonstrating their limited effectiveness and
trade-offs against Poisoned-MRAG. Our results highlight the effectiveness and
scalability of Poisoned-MRAG, underscoring its potential as a significant
threat to multimodal RAG systems.


http://arxiv.org/abs/2503.06253
MAD-MAX: Modular And Diverse Malicious Attack MiXtures for Automated LLM Red Teaming. (10%)
Stefan Schoepf; Muhammad Zaid Hameed; Ambrish Rawat; Kieran Fraser; Giulio Zizzo; Giandomenico Cornacchia; Mark Purcell
  With LLM usage rapidly increasing, their vulnerability to jailbreaks that
create harmful outputs are a major security risk. As new jailbreaking
strategies emerge and models are changed by fine-tuning, continuous testing for
security vulnerabilities is necessary. Existing Red Teaming methods fall short
in cost efficiency, attack success rate, attack diversity, or extensibility as
new attack types emerge. We address these challenges with Modular And Diverse
Malicious Attack MiXtures (MAD-MAX) for Automated LLM Red Teaming. MAD-MAX uses
automatic assignment of attack strategies into relevant attack clusters,
chooses the most relevant clusters for a malicious goal, and then combines
strategies from the selected clusters to achieve diverse novel attacks with
high attack success rates. MAD-MAX further merges promising attacks together at
each iteration of Red Teaming to boost performance and introduces a similarity
filter to prune out similar attacks for increased cost efficiency. The MAD-MAX
approach is designed to be easily extensible with newly discovered attack
strategies and outperforms the prominent Red Teaming method Tree of Attacks
with Pruning (TAP) significantly in terms of Attack Success Rate (ASR) and
queries needed to achieve jailbreaks. MAD-MAX jailbreaks 97% of malicious goals
in our benchmarks on GPT-4o and Gemini-Pro compared to TAP with 66%. MAD-MAX
does so with only 10.9 average queries to the target LLM compared to TAP with
23.3.
  WARNING: This paper contains contents which are offensive in nature.


http://arxiv.org/abs/2503.06276
Exploring Adversarial Transferability between Kolmogorov-arnold Networks. (5%)
Songping Wang; Xinquan Yue; Yueming Lyu; Caifeng Shan
  Kolmogorov-Arnold Networks (KANs) have emerged as a transformative model
paradigm, significantly impacting various fields. However, their adversarial
robustness remains less underexplored, especially across different KAN
architectures. To explore this critical safety issue, we conduct an analysis
and find that due to overfitting to the specific basis functions of KANs, they
possess poor adversarial transferability among different KANs. To tackle this
challenge, we propose AdvKAN, the first transfer attack method for KANs. AdvKAN
integrates two key components: 1) a Breakthrough-Defense Surrogate Model
(BDSM), which employs a breakthrough-defense training strategy to mitigate
overfitting to the specific structures of KANs. 2) a Global-Local Interaction
(GLI) technique, which promotes sufficient interaction between adversarial
gradients of hierarchical levels, further smoothing out loss surfaces of KANs.
Both of them work together to enhance the strength of transfer attack among
different KANs. Extensive experimental results on various KANs and datasets
demonstrate the effectiveness of AdvKAN, which possesses notably superior
attack capabilities and deeply reveals the vulnerabilities of KANs. Code will
be released upon acceptance.


http://arxiv.org/abs/2503.07661
Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy. (2%)
Wei Junhao; Yu Zhe; Sakuma Jun
  Model merging is a technique that combines multiple finetuned models into a
single model without additional training, allowing a free-rider to cheaply
inherit specialized capabilities. This study investigates methodologies to
suppress unwanted model merging by free-riders. Existing methods such as model
watermarking or fingerprinting can only detect merging in hindsight. In
contrast, we propose a first proactive defense against model merging.
Specifically, our defense method modifies the model parameters so that the
model is disrupted if the model is merged with any other model, while its
functionality is kept unchanged if not merged with others. Our approach
consists of two modules, rearranging MLP parameters and scaling attention
heads, which push the model out of the shared basin in parameter space, causing
the merging performance with other models to degrade significantly. We conduct
extensive experiments on image classification, image generation, and text
classification to demonstrate that our defense severely disrupts merging while
retaining the functionality of the post-protect model. Moreover, we analyze
potential adaptive attacks and further propose a dropout-based pruning to
improve our proposal's robustness.


http://arxiv.org/abs/2503.06340
Backdoor Attacks on Discrete Graph Diffusion Models. (1%)
Jiawen Wang; Samin Karim; Yuan Hong; Binghui Wang
  Diffusion models are powerful generative models in continuous data domains
such as image and video data. Discrete graph diffusion models (DGDMs) have
recently extended them for graph generation, which are crucial in fields like
molecule and protein modeling, and obtained the SOTA performance. However, it
is risky to deploy DGDMs for safety-critical applications (e.g., drug
discovery) without understanding their security vulnerabilities. In this work,
we perform the first study on graph diffusion models against backdoor attacks,
a severe attack that manipulates both the training and inference/generation
phases in graph diffusion models. We first define the threat model, under which
we design the attack such that the backdoored graph diffusion model can
generate 1) high-quality graphs without backdoor activation, 2) effective,
stealthy, and persistent backdoored graphs with backdoor activation, and 3)
graphs that are permutation invariant and exchangeable--two core properties in
graph generative models. 1) and 2) are validated via empirical evaluations
without and with backdoor defenses, while 3) is validated via theoretical
results.


http://arxiv.org/abs/2503.05303
Robust Intrusion Detection System with Explainable Artificial Intelligence. (83%)
Betül Güvenç Paltun; Ramin Fuladi; Rim El Malki
  Machine learning (ML) models serve as powerful tools for threat detection and
mitigation; however, they also introduce potential new risks. Adversarial input
can exploit these models through standard interfaces, thus creating new attack
pathways that threaten critical network operations. As ML advancements
progress, adversarial strategies become more advanced, and conventional
defenses such as adversarial training are costly in computational terms and
often fail to provide real-time detection. These methods typically require a
balance between robustness and model performance, which presents challenges for
applications that demand instant response. To further investigate this
vulnerability, we suggest a novel strategy for detecting and mitigating
adversarial attacks using eXplainable Artificial Intelligence (XAI). This
approach is evaluated in real time within intrusion detection systems (IDS),
leading to the development of a zero-touch mitigation strategy. Additionally,
we explore various scenarios in the Radio Resource Control (RRC) layer within
the Open Radio Access Network (O-RAN) framework, emphasizing the critical need
for enhanced mitigation techniques to strengthen IDS defenses against advanced
threats and implement a zero-touch mitigation solution. Extensive testing
across different scenarios in the RRC layer of the O-RAN infrastructure
validates the ability of the framework to detect and counteract integrated
RRC-layer attacks when paired with adversarial strategies, emphasizing the
essential need for robust defensive mechanisms to strengthen IDS against
complex threats.


http://arxiv.org/abs/2503.05445
Are Your LLM-based Text-to-SQL Models Secure? Exploring SQL Injection via Backdoor Attacks. (22%)
Meiyu Lin; Haichuan Zhang; Jiale Lao; Renyuan Li; Yuanchun Zhou; Carl Yang; Yang Cao; Mingjie Tang
  Large language models (LLMs) have shown state-of-the-art results in
translating natural language questions into SQL queries (Text-to-SQL), a
long-standing challenge within the database community. However, security
concerns remain largely unexplored, particularly the threat of backdoor
attacks, which can introduce malicious behaviors into models through
fine-tuning with poisoned datasets. In this work, we systematically investigate
the vulnerabilities of LLM-based Text-to-SQL models and present ToxicSQL, a
novel backdoor attack framework. Our approach leverages stealthy {semantic and
character-level triggers} to make backdoors difficult to detect and remove,
ensuring that malicious behaviors remain covert while maintaining high model
accuracy on benign inputs. Furthermore, we propose leveraging SQL injection
payloads as backdoor targets, enabling the generation of malicious yet
executable SQL queries, which pose severe security and privacy risks in
language model-based SQL development. We demonstrate that injecting only 0.44%
of poisoned data can result in an attack success rate of 79.41%, posing a
significant risk to database security. Additionally, we propose detection and
mitigation strategies to enhance model reliability. Our findings highlight the
urgent need for security-aware Text-to-SQL development, emphasizing the
importance of robust defenses against backdoor threats.


http://arxiv.org/abs/2503.05477
Enhancing Network Security: A Hybrid Approach for Detection and Mitigation of Distributed Denial-of-Service Attacks Using Machine Learning. (1%)
Nizo Jaman Shohan; Gazi Tanbhir; Faria Elahi; Ahsan Ullah; Md. Nazmus Sakib
  The distributed denial-of-service (DDoS) attack stands out as a highly
formidable cyber threat, representing an advanced form of the denial-of-service
(DoS) attack. A DDoS attack involves multiple computers working together to
overwhelm a system, making it unavailable. On the other hand, a DoS attack is a
one-on-one attempt to make a system or website inaccessible. Thus, it is
crucial to construct an effective model for identifying various DDoS incidents.
Although extensive research has focused on binary detection models for DDoS
identification, they face challenges to adapt evolving threats, necessitating
frequent updates. Whereas multiclass detection models offer a comprehensive
defense against diverse DDoS attacks, ensuring adaptability in the
ever-changing cyber threat landscape. In this paper, we propose a Hybrid Model
to strengthen network security by combining the featureextraction abilities of
1D Convolutional Neural Networks (CNNs) with the classification skills of
Random Forest (RF) and Multi-layer Perceptron (MLP) classifiers. Using the
CIC-DDoS2019 dataset, we perform multiclass classification of various DDoS
attacks and conduct a comparative analysis of evaluation metrics for RF, MLP,
and our proposed Hybrid Model. After analyzing the results, we draw meaningful
conclusions and confirm the superiority of our Hybrid Model by performing
thorough cross-validation. Additionally, we integrate our machine learning
model with Snort, which provides a robust and adaptive solution for detecting
and mitigating various DDoS attacks.


http://arxiv.org/abs/2503.05916
SAS: Segment Anything Small for Ultrasound -- A Non-Generative Data Augmentation Technique for Robust Deep Learning in Ultrasound Imaging. (1%)
Danielle L. Ferreira; Ahana Gangopadhyay; Hsi-Ming Chang; Ravi Soni; Gopal Avinash
  Accurate segmentation of anatomical structures in ultrasound (US) images,
particularly small ones, is challenging due to noise and variability in imaging
conditions (e.g., probe position, patient anatomy, tissue characteristics and
pathology). To address this, we introduce Segment Anything Small (SAS), a
simple yet effective scale- and texture-aware data augmentation technique
designed to enhance the performance of deep learning models for segmenting
small anatomical structures in ultrasound images. SAS employs a dual
transformation strategy: (1) simulating diverse organ scales by resizing and
embedding organ thumbnails into a black background, and (2) injecting noise
into regions of interest to simulate varying tissue textures. These
transformations generate realistic and diverse training data without
introducing hallucinations or artifacts, improving the model's robustness to
noise and variability. We fine-tuned a promptable foundation model on a
controlled organ-specific medical imaging dataset and evaluated its performance
on one internal and five external datasets. Experimental results demonstrate
significant improvements in segmentation performance, with Dice score gains of
up to 0.35 and an average improvement of 0.16 [95% CI 0.132,0.188].
Additionally, our iterative point prompts provide precise control and adaptive
refinement, achieving performance comparable to bounding box prompts with just
two points. SAS enhances model robustness and generalizability across diverse
anatomical structures and imaging conditions, particularly for small
structures, without compromising the accuracy of larger ones. By offering a
computationally efficient solution that eliminates the need for extensive human
labeling efforts, SAS emerges as a powerful tool for advancing medical image
analysis, particularly in resource-constrained settings.


http://arxiv.org/abs/2503.04385
Scale-Invariant Adversarial Attack against Arbitrary-scale Super-resolution. (98%)
Yihao Huang; Xin Luo; Qing Guo; Felix Juefei-Xu; Xiaojun Jia; Weikai Miao; Geguang Pu; Yang Liu
  The advent of local continuous image function (LIIF) has garnered significant
attention for arbitrary-scale super-resolution (SR) techniques. However, while
the vulnerabilities of fixed-scale SR have been assessed, the robustness of
continuous representation-based arbitrary-scale SR against adversarial attacks
remains an area warranting further exploration. The elaborately designed
adversarial attacks for fixed-scale SR are scale-dependent, which will cause
time-consuming and memory-consuming problems when applied to arbitrary-scale
SR. To address this concern, we propose a simple yet effective
``scale-invariant'' SR adversarial attack method with good transferability,
termed SIAGT. Specifically, we propose to construct resource-saving attacks by
exploiting finite discrete points of continuous representation. In addition, we
formulate a coordinate-dependent loss to enhance the cross-model
transferability of the attack. The attack can significantly deteriorate the SR
images while introducing imperceptible distortion to the targeted
low-resolution (LR) images. Experiments carried out on three popular LIIF-based
SR approaches and four classical SR datasets show remarkable attack performance
and transferability of SIAGT.


http://arxiv.org/abs/2503.04853
From Pixels to Trajectory: Universal Adversarial Example Detection via Temporal Imprints. (98%)
Yansong Gao; Huaibing Peng; Hua Ma; Zhiyang Dai; Shuo Wang; Hongsheng Hu; Anmin Fu; Minhui Xue
  For the first time, we unveil discernible temporal (or historical) trajectory
imprints resulting from adversarial example (AE) attacks. Standing in contrast
to existing studies all focusing on spatial (or static) imprints within the
targeted underlying victim models, we present a fresh temporal paradigm for
understanding these attacks. Of paramount discovery is that these imprints are
encapsulated within a single loss metric, spanning universally across diverse
tasks such as classification and regression, and modalities including image,
text, and audio. Recognizing the distinct nature of loss between adversarial
and clean examples, we exploit this temporal imprint for AE detection by
proposing TRAIT (TRaceable Adversarial temporal trajectory ImprinTs). TRAIT
operates under minimal assumptions without prior knowledge of attacks, thereby
framing the detection challenge as a one-class classification problem. However,
detecting AEs is still challenged by significant overlaps between the
constructed synthetic losses of adversarial and clean examples due to the
absence of ground truth for incoming inputs. TRAIT addresses this challenge by
converting the synthetic loss into a spectrum signature, using the technique of
Fast Fourier Transform to highlight the discrepancies, drawing inspiration from
the temporal nature of the imprints, analogous to time-series signals. Across
12 AE attacks including SMACK (USENIX Sec'2023), TRAIT demonstrates consistent
outstanding performance across comprehensively evaluated modalities, tasks,
datasets, and model architectures. In all scenarios, TRAIT achieves an AE
detection accuracy exceeding 97%, often around 99%, while maintaining a false
rejection rate of 1%. TRAIT remains effective under the formulated strong
adaptive attacks.


http://arxiv.org/abs/2503.04315
Provable Robust Overfitting Mitigation in Wasserstein Distributionally Robust Optimization. (97%)
Shuang Liu; Yihan Wang; Yifan Zhu; Yibo Miao; Xiao-Shan Gao
  Wasserstein distributionally robust optimization (WDRO) optimizes against
worst-case distributional shifts within a specified uncertainty set, leading to
enhanced generalization on unseen adversarial examples, compared to standard
adversarial training which focuses on pointwise adversarial perturbations.
However, WDRO still suffers fundamentally from the robust overfitting problem,
as it does not consider statistical error. We address this gap by proposing a
novel robust optimization framework under a new uncertainty set for adversarial
noise via Wasserstein distance and statistical error via Kullback-Leibler
divergence, called the Statistically Robust WDRO. We establish a robust
generalization bound for the new optimization framework, implying that
out-of-distribution adversarial performance is at least as good as the
statistically robust training loss with high probability. Furthermore, we
derive conditions under which Stackelberg and Nash equilibria exist between the
learner and the adversary, giving an optimal robust model in certain sense.
Finally, through extensive experiments, we demonstrate that our method
significantly mitigates robust overfitting and enhances robustness within the
framework of WDRO.


http://arxiv.org/abs/2503.04963
Energy-Latency Attacks: A New Adversarial Threat to Deep Learning. (81%)
Hanene F. Z. Brachemi Meftah; Wassim Hamidouche; Sid Ahmed Fezza; Olivier Deforges
  The growing computational demand for deep neural networks ( DNNs) has raised
concerns about their energy consumption and carbon footprint, particularly as
the size and complexity of the models continue to increase. To address these
challenges, energy-efficient hardware and custom accelerators have become
essential. Additionally, adaptable DNN s are being developed to dynamically
balance performance and efficiency. The use of these strategies became more
common to enable sustainable AI deployment. However, these efficiency-focused
designs may also introduce vulnerabilities, as attackers can potentially
exploit them to increase latency and energy usage by triggering their
worst-case-performance scenarios. This new type of attack, called
energy-latency attacks, has recently gained significant research attention,
focusing on the vulnerability of DNN s to this emerging attack paradigm, which
can trigger denial-of-service ( DoS) attacks. This paper provides a
comprehensive overview of current research on energy-latency attacks,
categorizing them using the established taxonomy for traditional adversarial
attacks. We explore different metrics used to measure the success of these
attacks and provide an analysis and comparison of existing attack strategies.
We also analyze existing defense mechanisms and highlight current challenges
and potential areas for future research in this developing field. The GitHub
page for this work can be accessed at
https://github.com/hbrachemi/Survey_energy_attacks/


http://arxiv.org/abs/2503.04480
Poisoning Bayesian Inference via Data Deletion and Replication. (62%)
Matthieu Carreau; Roi Naveiro; William N. Caballero
  Research in adversarial machine learning (AML) has shown that statistical
models are vulnerable to maliciously altered data. However, despite advances in
Bayesian machine learning models, most AML research remains concentrated on
classical techniques. Therefore, we focus on extending the white-box model
poisoning paradigm to attack generic Bayesian inference, highlighting its
vulnerability in adversarial contexts. A suite of attacks are developed that
allow an attacker to steer the Bayesian posterior toward a target distribution
through the strategic deletion and replication of true observations, even when
only sampling access to the posterior is available. Analytic properties of
these algorithms are proven and their performance is empirically examined in
both synthetic and real-world scenarios. With relatively little effort, the
attacker is able to substantively alter the Bayesian's beliefs and, by
accepting more risk, they can mold these beliefs to their will. By carefully
constructing the adversarial posterior, surgical poisoning is achieved such
that only targeted inferences are corrupted and others are minimally disturbed.


http://arxiv.org/abs/2503.04474
Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges. (22%)
Francisco Eiras; Eliott Zemour; Eric Lin; Vaikkunth Mugunthan
  Large Language Model (LLM) based judges form the underpinnings of key safety
evaluation processes such as offline benchmarking, automated red-teaming, and
online guardrailing. This widespread requirement raises the crucial question:
can we trust the evaluations of these evaluators? In this paper, we highlight
two critical challenges that are typically overlooked: (i) evaluations in the
wild where factors like prompt sensitivity and distribution shifts can affect
performance and (ii) adversarial attacks that target the judge. We highlight
the importance of these through a study of commonly used safety judges, showing
that small changes such as the style of the model output can lead to jumps of
up to 0.24 in the false negative rate on the same dataset, whereas adversarial
attacks on the model generation can fool some judges into misclassifying 100%
of harmful generations as safe ones. These findings reveal gaps in commonly
used meta-evaluation benchmarks and weaknesses in the robustness of current LLM
judges, indicating that low attack success under certain judges could create a
false sense of security.


http://arxiv.org/abs/2503.04856
One-Shot is Enough: Consolidating Multi-Turn Attacks into Efficient Single-Turn Prompts for LLMs. (12%)
Junwoo Ha; Hyunjun Kim; Sangyoon Yu; Haon Park; Ashkan Yousefpour; Yuna Park; Suhyun Kim
  Despite extensive safety enhancements in large language models (LLMs),
multi-turn "jailbreak" conversations crafted by skilled human adversaries can
still breach even the most sophisticated guardrails. However, these multi-turn
attacks demand considerable manual effort, limiting their scalability. In this
work, we introduce a novel approach called Multi-turn-to-Single-turn (M2S) that
systematically converts multi-turn jailbreak prompts into single-turn attacks.
Specifically, we propose three conversion strategies - Hyphenize, Numberize,
and Pythonize - each preserving sequential context yet packaging it in a single
query. Our experiments on the Multi-turn Human Jailbreak (MHJ) dataset show
that M2S often increases or maintains high Attack Success Rates (ASRs) compared
to original multi-turn conversations. Notably, using a StrongREJECT-based
evaluation of harmfulness, M2S achieves up to 95.9% ASR on Mistral-7B and
outperforms original multi-turn prompts by as much as 17.5% in absolute
improvement on GPT-4o. Further analysis reveals that certain adversarial
tactics, when consolidated into a single prompt, exploit structural formatting
cues to evade standard policy checks. These findings underscore that
single-turn attacks - despite being simpler and cheaper to conduct - can be
just as potent, if not more, than their multi-turn counterparts. Our findings
underscore the urgent need to reevaluate and reinforce LLM safety strategies,
given how adversarial queries can be compacted into a single prompt while still
retaining sufficient complexity to bypass existing safety measures.


http://arxiv.org/abs/2503.03842
Task-Agnostic Attacks Against Vision Foundation Models. (99%)
Brian Pulfer; Yury Belousov; Vitaliy Kinakh; Teddy Furon; Slava Voloshynovskiy
  The study of security in machine learning mainly focuses on downstream
task-specific attacks, where the adversarial example is obtained by optimizing
a loss function specific to the downstream task. At the same time, it has
become standard practice for machine learning practitioners to adopt publicly
available pre-trained vision foundation models, effectively sharing a common
backbone architecture across a multitude of applications such as
classification, segmentation, depth estimation, retrieval, question-answering
and more. The study of attacks on such foundation models and their impact to
multiple downstream tasks remains vastly unexplored. This work proposes a
general framework that forges task-agnostic adversarial examples by maximally
disrupting the feature representation obtained with foundation models. We
extensively evaluate the security of the feature representations obtained by
popular vision foundation models by measuring the impact of this attack on
multiple downstream tasks and its transferability between models.


http://arxiv.org/abs/2503.03272
Towards Effective and Sparse Adversarial Attack on Spiking Neural Networks via Breaking Invisible Surrogate Gradients. (99%)
Li Lun; Kunyu Feng; Qinglong Ni; Ling Liang; Yuan Wang; Ying Li; Dunshan Yu; Xiaoxin Cui
  Spiking neural networks (SNNs) have shown their competence in handling
spatial-temporal event-based data with low energy consumption. Similar to
conventional artificial neural networks (ANNs), SNNs are also vulnerable to
gradient-based adversarial attacks, wherein gradients are calculated by
spatial-temporal back-propagation (STBP) and surrogate gradients (SGs).
However, the SGs may be invisible for an inference-only model as they do not
influence the inference results, and current gradient-based attacks are
ineffective for binary dynamic images captured by the dynamic vision sensor
(DVS). While some approaches addressed the issue of invisible SGs through
universal SGs, their SGs lack a correlation with the victim model, resulting in
sub-optimal performance. Moreover, the imperceptibility of existing SNN-based
binary attacks is still insufficient. In this paper, we introduce an innovative
potential-dependent surrogate gradient (PDSG) method to establish a robust
connection between the SG and the model, thereby enhancing the adaptability of
adversarial attacks across various models with invisible SGs. Additionally, we
propose the sparse dynamic attack (SDA) to effectively attack binary dynamic
images. Utilizing a generation-reduction paradigm, SDA can fully optimize the
sparsity of adversarial perturbations. Experimental results demonstrate that
our PDSG and SDA outperform state-of-the-art SNN-based attacks across various
models and datasets. Specifically, our PDSG achieves 100% attack success rate
on ImageNet, and our SDA obtains 82% attack success rate by modifying only
0.24% of the pixels on CIFAR10DVS. The code is available at
https://github.com/ryime/PDSG-SDA .


http://arxiv.org/abs/2503.03613
CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP. (99%)
Songlong Xing; Zhengyu Zhao; Nicu Sebe
  Despite its prevalent use in image-text matching tasks in a zero-shot manner,
CLIP has been shown to be highly vulnerable to adversarial perturbations added
onto images. Recent studies propose to finetune the vision encoder of CLIP with
adversarial samples generated on the fly, and show improved robustness against
adversarial attacks on a spectrum of downstream datasets, a property termed as
zero-shot robustness. In this paper, we show that malicious perturbations that
seek to maximise the classification loss lead to `falsely stable' images, and
propose to leverage the pre-trained vision encoder of CLIP to counterattack
such adversarial images during inference to achieve robustness. Our paradigm is
simple and training-free, providing the first method to defend CLIP from
adversarial attacks at test time, which is orthogonal to existing methods
aiming to boost zero-shot adversarial robustness of CLIP. We conduct
experiments across 16 classification datasets, and demonstrate stable and
consistent gains compared to test-time defence methods adapted from existing
adversarial robustness studies that do not rely on external networks, without
noticeably impairing performance on clean images. We also show that our
paradigm can be employed on CLIP models that have been adversarially finetuned
to further enhance their robustness at test time. Our code is available
\href{https://github.com/Sxing2/CLIP-Test-time-Counterattacks}{here}.


http://arxiv.org/abs/2503.04825
Adversarial Example Based Fingerprinting for Robust Copyright Protection in Split Learning. (99%)
Zhangting Lin; Mingfu Xue; Kewei Chen; Wenmao Liu; Xiang Gao; Leo Yu Zhang; Jian Wang; Yushu Zhang
  Currently, deep learning models are easily exposed to data leakage risks. As
a distributed model, Split Learning thus emerged as a solution to address this
issue. The model is splitted to avoid data uploading to the server and reduce
computing requirements while ensuring data privacy and security. However, the
transmission of data between clients and server creates a potential
vulnerability. In particular, model is vulnerable to intellectual property (IP)
infringement such as piracy. Alarmingly, a dedicated copyright protection
framework tailored for Split Learning models is still lacking. To this end, we
propose the first copyright protection scheme for Split Learning model,
leveraging fingerprint to ensure effective and robust copyright protection. The
proposed method first generates a set of specifically designed adversarial
examples. Then, we select those examples that would induce misclassifications
to form the fingerprint set. These adversarial examples are embedded as
fingerprints into the model during the training process. Exhaustive experiments
highlight the effectiveness of the scheme. This is demonstrated by a remarkable
fingerprint verification success rate (FVSR) of 100% on MNIST, 98% on CIFAR-10,
and 100% on ImageNet, respectively. Meanwhile, the model's accuracy only
decreases slightly, indicating that the embedded fingerprints do not compromise
model performance. Even under label inference attack, our approach consistently
achieves a high fingerprint verification success rate that ensures robust
verification.


http://arxiv.org/abs/2503.04833
Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks. (95%)
Liming Lu; Shuchao Pang; Siyuan Liang; Haotian Zhu; Xiyu Zeng; Aishan Liu; Yunhuai Liu; Yongbin Zhou
  Multimodal large language models (MLLMs) have made remarkable strides in
cross-modal comprehension and generation tasks. However, they remain vulnerable
to jailbreak attacks, where crafted perturbations bypass security guardrails
and elicit harmful outputs. In this paper, we present the first adversarial
training (AT) paradigm tailored to defend against jailbreak attacks during the
MLLM training phase. Extending traditional AT to this domain poses two critical
challenges: efficiently tuning massive parameters and ensuring robustness
against attacks across multiple modalities. To address these challenges, we
introduce Projection Layer Against Adversarial Training (ProEAT), an end-to-end
AT framework. ProEAT incorporates a projector-based adversarial training
architecture that efficiently handles large-scale parameters while maintaining
computational feasibility by focusing adversarial training on a lightweight
projector layer instead of the entire model; additionally, we design a dynamic
weight adjustment mechanism that optimizes the loss function's weight
allocation based on task demands, streamlining the tuning process. To enhance
defense performance, we propose a joint optimization strategy across visual and
textual modalities, ensuring robust resistance to jailbreak attacks originating
from either modality. Extensive experiments conducted on five major jailbreak
attack methods across three mainstream MLLMs demonstrate the effectiveness of
our approach. ProEAT achieves state-of-the-art defense performance,
outperforming existing baselines by an average margin of +34% across text and
image modalities, while incurring only a 1% reduction in clean accuracy.
Furthermore, evaluations on real-world embodied intelligent systems highlight
the practical applicability of our framework, paving the way for the
development of more secure and reliable multimodal systems.


http://arxiv.org/abs/2503.03454
Data Poisoning Attacks to Locally Differentially Private Range Query Protocols. (87%)
Ting-Wei Liao; Chih-Hsun Lin; Yu-Lin Tsai; Takao Murakami; Chia-Mu Yu; Jun Sakuma; Chun-Ying Huang; Hiroaki Kikuchi
  Local Differential Privacy (LDP) has been widely adopted to protect user
privacy in decentralized data collection. However, recent studies have revealed
that LDP protocols are vulnerable to data poisoning attacks, where malicious
users manipulate their reported data to distort aggregated results. In this
work, we present the first study on data poisoning attacks targeting LDP range
query protocols, focusing on both tree-based and grid-based approaches. We
identify three key challenges in executing such attacks, including crafting
consistent and effective fake data, maintaining data consistency across levels
or grids, and preventing server detection. To address the first two challenges,
we propose novel attack methods that are provably optimal, including a
tree-based attack and a grid-based attack, designed to manipulate range query
results with high effectiveness. \textbf{Our key finding is that the common
post-processing procedure, Norm-Sub, in LDP range query protocols can help the
attacker massively amplify their attack effectiveness.} In addition, we study a
potential countermeasure, but also propose an adaptive attack capable of
evading this defense to address the third challenge. We evaluate our methods
through theoretical analysis and extensive experiments on synthetic and
real-world datasets. Our results show that the proposed attacks can
significantly amplify estimations for arbitrary range queries by manipulating a
small fraction of users, providing 5-10x more influence than a normal user to
the estimation.


http://arxiv.org/abs/2503.03201
Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution. (83%)
Jizhao Zhu; Akang Shi; Zixuan Li; Long Bai; Xiaolong Jin; Jiafeng Guo; Xueqi Cheng
  In this paper, we aim to enhance the robustness of Universal Information
Extraction (UIE) by introducing a new benchmark dataset, a comprehensive
evaluation, and a feasible solution. Existing robust benchmark datasets have
two key limitations: 1) They generate only a limited range of perturbations for
a single Information Extraction (IE) task, which fails to evaluate the
robustness of UIE models effectively; 2) They rely on small models or
handcrafted rules to generate perturbations, often resulting in unnatural
adversarial examples. Considering the powerful generation capabilities of Large
Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE,
called RUIE-Bench, which utilizes LLMs to generate more diverse and realistic
perturbations across different IE tasks. Based on this dataset, we
comprehensively evaluate existing UIE models and reveal that both LLM-based
models and other models suffer from significant performance drops. To improve
robustness and reduce training costs, we propose a data-augmentation solution
that dynamically selects hard samples for iterative training based on the
model's inference loss. Experimental results show that training with only
\textbf{15\%} of the data leads to an average \textbf{7.5\%} relative
performance improvement across three IE tasks.


http://arxiv.org/abs/2503.03502
CURVALID: Geometrically-guided Adversarial Prompt Detection. (75%)
Canaan Yung; Hanxun Huang; Sarah Monazam Erfani; Christopher Leckie
  Adversarial prompts capable of jailbreaking large language models (LLMs) and
inducing undesirable behaviours pose a significant obstacle to their safe
deployment. Current mitigation strategies rely on activating built-in defence
mechanisms or fine-tuning the LLMs, but the fundamental distinctions between
adversarial and benign prompts are yet to be understood. In this work, we
introduce CurvaLID, a novel defense framework that efficiently detects
adversarial prompts by leveraging their geometric properties. It is agnostic to
the type of LLM, offering a unified detection framework across diverse
adversarial prompts and LLM architectures. CurvaLID builds on the geometric
analysis of text prompts to uncover their underlying differences. We
theoretically extend the concept of curvature via the Whewell equation into an
$n$-dimensional word embedding space, enabling us to quantify local geometric
properties, including semantic shifts and curvature in the underlying
manifolds. Additionally, we employ Local Intrinsic Dimensionality (LID) to
capture geometric features of text prompts within adversarial subspaces. Our
findings reveal that adversarial prompts differ fundamentally from benign
prompts in terms of their geometric characteristics. Our results demonstrate
that CurvaLID delivers superior detection and rejection of adversarial queries,
paving the way for safer LLM deployment. The source code can be found at
https://github.com/Cancanxxx/CurvaLID


http://arxiv.org/abs/2503.03944
GuardDoor: Safeguarding Against Malicious Diffusion Editing via Protective Backdoors. (13%)
Yaopei Zeng; Yuanpu Cao; Lu Lin
  The growing accessibility of diffusion models has revolutionized image
editing but also raised significant concerns about unauthorized modifications,
such as misinformation and plagiarism. Existing countermeasures largely rely on
adversarial perturbations designed to disrupt diffusion model outputs. However,
these approaches are found to be easily neutralized by simple image
preprocessing techniques, such as compression and noise addition. To address
this limitation, we propose GuardDoor, a novel and robust protection mechanism
that fosters collaboration between image owners and model providers.
Specifically, the model provider participating in the mechanism fine-tunes the
image encoder to embed a protective backdoor, allowing image owners to request
the attachment of imperceptible triggers to their images. When unauthorized
users attempt to edit these protected images with this diffusion model, the
model produces meaningless outputs, reducing the risk of malicious image
editing. Our method demonstrates enhanced robustness against image
preprocessing operations and is scalable for large-scale deployment. This work
underscores the potential of cooperative frameworks between model providers and
image owners to safeguard digital content in the era of generative AI.


http://arxiv.org/abs/2503.07483
Poisoning Attacks to Local Differential Privacy Protocols for Trajectory Data. (13%)
I-Jung Hsu; Chih-Hsun Lin; Chia-Mu Yu; Sy-Yen Kuo; Chun-Ying Huang
  Trajectory data, which tracks movements through geographic locations, is
crucial for improving real-world applications. However, collecting such
sensitive data raises considerable privacy concerns. Local differential privacy
(LDP) offers a solution by allowing individuals to locally perturb their
trajectory data before sharing it. Despite its privacy benefits, LDP protocols
are vulnerable to data poisoning attacks, where attackers inject fake data to
manipulate aggregated results. In this work, we make the first attempt to
analyze vulnerabilities in several representative LDP trajectory protocols. We
propose \textsc{TraP}, a heuristic algorithm for data \underline{P}oisoning
attacks using a prefix-suffix method to optimize fake \underline{Tra}jectory
selection, significantly reducing computational complexity. Our experimental
results demonstrate that our attack can substantially increase target pattern
occurrences in the perturbed trajectory dataset with few fake users. This study
underscores the urgent need for robust defenses and better protocol designs to
safeguard LDP trajectory data against malicious manipulation.


http://arxiv.org/abs/2503.03710
Improving LLM Safety Alignment with Dual-Objective Optimization. (9%)
Xuandong Zhao; Will Cai; Tianneng Shi; David Huang; Licong Lin; Song Mei; Dawn Song
  Existing training-time safety alignment techniques for large language models
(LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization
(DPO), a widely deployed alignment method, exhibits limitations in both
experimental and theoretical contexts as its loss function proves suboptimal
for refusal learning. Through gradient-based analysis, we identify these
shortcomings and propose an improved safety alignment that disentangles DPO
objectives into two components: (1) robust refusal training, which encourages
refusal even when partial unsafe generations are produced, and (2) targeted
unlearning of harmful knowledge. This approach significantly increases LLM
robustness against a wide range of jailbreak attacks, including prefilling,
suffix, and multi-turn attacks across both in-distribution and
out-of-distribution scenarios. Furthermore, we introduce a method to emphasize
critical refusal tokens by incorporating a reward-based token-level weighting
mechanism for refusal learning, which further improves the robustness against
adversarial exploits. Our research also suggests that robustness to jailbreak
attacks is correlated with token distribution shifts in the training process
and internal representations of refusal and harmful tokens, offering valuable
directions for future research in LLM safety alignment. The code is available
at https://github.com/wicai24/DOOR-Alignment


http://arxiv.org/abs/2503.02986
Mind the Gap: Detecting Black-box Adversarial Attacks in the Making through Query Update Analysis. (99%)
Jeonghwan Park; Niall McLaughlin; Ihsen Alouani
  Adversarial attacks remain a significant threat that can jeopardize the
integrity of Machine Learning (ML) models. In particular, query-based black-box
attacks can generate malicious noise without having access to the victim
model's architecture, making them practical in real-world contexts. The
community has proposed several defenses against adversarial attacks, only to be
broken by more advanced and adaptive attack strategies. In this paper, we
propose a framework that detects if an adversarial noise instance is being
generated. Unlike existing stateful defenses that detect adversarial noise
generation by monitoring the input space, our approach learns adversarial
patterns in the input update similarity space. In fact, we propose to observe a
new metric called Delta Similarity (DS), which we show it captures more
efficiently the adversarial behavior. We evaluate our approach against 8
state-of-the-art attacks, including adaptive attacks, where the adversary is
aware of the defense and tries to evade detection. We find that our approach is
significantly more robust than existing defenses both in terms of specificity
and sensitivity.


http://arxiv.org/abs/2503.03170
AttackSeqBench: Benchmarking Large Language Models' Understanding of Sequential Patterns in Cyber Attacks. (1%)
Javier Yong; Haokai Ma; Yunshan Ma; Anis Yusof; Zhenkai Liang; Ee-Chien Chang
  The observations documented in Cyber Threat Intelligence (CTI) reports play a
critical role in describing adversarial behaviors, providing valuable insights
for security practitioners to respond to evolving threats. Recent advancements
of Large Language Models (LLMs) have demonstrated significant potential in
various cybersecurity applications, including CTI report understanding and
attack knowledge graph construction. While previous works have proposed
benchmarks that focus on the CTI extraction ability of LLMs, the sequential
characteristic of adversarial behaviors within CTI reports remains largely
unexplored, which holds considerable significance in developing a comprehensive
understanding of how adversaries operate. To address this gap, we introduce
AttackSeqBench, a benchmark tailored to systematically evaluate LLMs'
capability to understand and reason attack sequences in CTI reports. Our
benchmark encompasses three distinct Question Answering (QA) tasks, each task
focuses on the varying granularity in adversarial behavior. To alleviate the
laborious effort of QA construction, we carefully design an automated dataset
construction pipeline to create scalable and well-formulated QA datasets based
on real-world CTI reports. To ensure the quality of our dataset, we adopt a
hybrid approach of combining human evaluation and systematic evaluation
metrics. We conduct extensive experiments and analysis with both fast-thinking
and slow-thinking LLMs, while highlighting their strengths and limitations in
analyzing the sequential patterns in cyber attacks. The overarching goal of
this work is to provide a benchmark that advances LLM-driven CTI report
understanding and fosters its application in real-world cybersecurity
operations. Our dataset and code are available at
https://github.com/Javiery3889/AttackSeqBench .


http://arxiv.org/abs/2503.02169
One Stone, Two Birds: Enhancing Adversarial Defense Through the Lens of Distributional Discrepancy. (99%)
Jiacheng Zhang; Benjamin I. P. Rubinstein; Jingfeng Zhang; Feng Liu
  Statistical adversarial data detection (SADD) detects whether an upcoming
batch contains adversarial examples (AEs) by measuring the distributional
discrepancies between clean examples (CEs) and AEs. In this paper, we explore
the strength of SADD-based methods by theoretically showing that minimizing
distributional discrepancy can help reduce the expected loss on AEs. Despite
these advantages, SADD-based methods have a potential limitation: they discard
inputs that are detected as AEs, leading to the loss of useful information
within those inputs. To address this limitation, we propose a two-pronged
adversarial defense method, named Distributional-discrepancy-based Adversarial
Defense (DAD). In the training phase, DAD first optimizes the test power of the
maximum mean discrepancy (MMD) to derive MMD-OPT, which is a stone that kills
two birds. MMD-OPT first serves as a guiding signal to minimize the
distributional discrepancy between CEs and AEs to train a denoiser. Then, it
serves as a discriminator to differentiate CEs and AEs during inference.
Overall, in the inference stage, DAD consists of a two-pronged process: (1)
directly feeding the detected CEs into the classifier, and (2) removing noise
from the detected AEs by the distributional-discrepancy-based denoiser.
Extensive experiments show that DAD outperforms current state-of-the-art (SOTA)
defense methods by simultaneously improving clean and robust accuracy on
CIFAR-10 and ImageNet-1K against adaptive white-box attacks. Codes are publicly
available at: https://github.com/tmlr-group/DAD.


http://arxiv.org/abs/2503.01407
Divide and Conquer: Heterogeneous Noise Integration for Diffusion-based Adversarial Purification. (80%)
Gaozheng Pei; Shaojie Lyu; Gong Chen; Ke Ma; Qianqian Xu; Yingfei Sun; Qingming Huang
  Existing diffusion-based purification methods aim to disrupt adversarial
perturbations by introducing a certain amount of noise through a forward
diffusion process, followed by a reverse process to recover clean examples.
However, this approach is fundamentally flawed: the uniform operation of the
forward process across all pixels compromises normal pixels while attempting to
combat adversarial perturbations, resulting in the target model producing
incorrect predictions. Simply relying on low-intensity noise is insufficient
for effective defense. To address this critical issue, we implement a
heterogeneous purification strategy grounded in the interpretability of neural
networks. Our method decisively applies higher-intensity noise to specific
pixels that the target model focuses on while the remaining pixels are
subjected to only low-intensity noise. This requirement motivates us to
redesign the sampling process of the diffusion model, allowing for the
effective removal of varying noise levels. Furthermore, to evaluate our method
against strong adaptative attack, our proposed method sharply reduces time cost
and memory usage through a single-step resampling. The empirical evidence from
extensive experiments across three datasets demonstrates that our method
outperforms most current adversarial training and purification techniques by a
substantial margin.


http://arxiv.org/abs/2503.02017
A Lightweight and Secure Deep Learning Model for Privacy-Preserving Federated Learning in Intelligent Enterprises. (1%)
Reza Fotohi; Fereidoon Shams Aliee; Bahar Farahani
  The ever growing Internet of Things (IoT) connections drive a new type of
organization, the Intelligent Enterprise. In intelligent enterprises, machine
learning based models are adopted to extract insights from data. Due to the
efficiency and privacy challenges of these traditional models, a new federated
learning (FL) paradigm has emerged. In FL, multiple enterprises can jointly
train a model to update a final model. However, firstly, FL trained models
usually perform worse than centralized models, especially when enterprises
training data is non-IID (Independent and Identically Distributed). Second, due
to the centrality of FL and the untrustworthiness of local enterprises,
traditional FL solutions are vulnerable to poisoning and inference attacks and
violate privacy. Thirdly, the continuous transfer of parameters between
enterprises and servers increases communication costs. To this end, the
FedAnil+ model is proposed, a novel, lightweight, and secure Federated Deep
Learning Model that includes three main phases. In the first phase, the goal is
to solve the data type distribution skew challenge. Addressing privacy concerns
against poisoning and inference attacks is covered in the second phase.
Finally, to alleviate the communication overhead, a novel compression approach
is proposed that significantly reduces the size of the updates. The experiment
results validate that FedAnil+ is secure against inference and poisoning
attacks with better accuracy. In addition, it shows improvements over existing
approaches in terms of model accuracy (13%, 16%, and 26%), communication cost
(17%, 21%, and 25%), and computation cost (7%, 9%, and 11%).


http://arxiv.org/abs/2503.01944
Protecting DeFi Platforms against Non-Price Flash Loan Attacks. (1%)
Abdulrahman Alhaidari; Balaji Palanisamy; Prashant Krishnamurthy
  Smart contracts in Decentralized Finance (DeFi) platforms are attractive
targets for attacks as their vulnerabilities can lead to massive amounts of
financial losses. Flash loan attacks, in particular, pose a major threat to
DeFi protocols that hold a Total Value Locked (TVL) exceeding \$106 billion.
These attacks use the atomicity property of blockchains to drain funds from
smart contracts in a single transaction. While existing research primarily
focuses on price manipulation attacks, such as oracle manipulation, mitigating
non-price flash loan attacks that often exploit smart contracts' zero-day
vulnerabilities remains largely unaddressed. These attacks are challenging to
detect because of their unique patterns, time sensitivity, and complexity. In
this paper, we present FlashGuard, a runtime detection and mitigation method
for non-price flash loan attacks. Our approach targets smart contract function
signatures to identify attacks in real-time and counterattack by disrupting the
attack transaction atomicity by leveraging the short window when transactions
are visible in the mempool but not yet confirmed. When FlashGuard detects an
attack, it dispatches a stealthy dusting counterattack transaction to miners to
change the victim contract's state which disrupts the attack's atomicity and
forces the attack transaction to revert. We evaluate our approach using 20
historical attacks and several unseen attacks. FlashGuard achieves an average
real-time detection latency of 150.31ms, a detection accuracy of over 99.93\%,
and an average disruption time of 410.92ms. FlashGuard could have potentially
rescued over \$405.71 million in losses if it were deployed prior to these
attack instances. FlashGuard demonstrates significant potential as a DeFi
security solution to mitigate and handle rising threats of non-price flash loan
attacks.


http://arxiv.org/abs/2503.00932
Improving the Transferability of Adversarial Attacks by an Input Transpose. (99%)
Qing Wan; Shilong Deng; Xun Wang
  Deep neural networks (DNNs) are highly susceptible to adversarial
examples--subtle perturbations applied to inputs that are often imperceptible
to humans yet lead to incorrect model predictions. In black-box scenarios,
however, existing adversarial examples exhibit limited transferability and
struggle to effectively compromise multiple unseen DNN models. Previous
strategies enhance the cross-model generalization of adversarial examples by
introducing versatility into adversarial perturbations, thereby improving
transferability. However, further refining perturbation versatility often
demands intricate algorithm development and substantial computation
consumption. In this work, we propose an input transpose method that requires
almost no additional labor and computation costs but can significantly improve
the transferability of existing adversarial strategies. Even without adding
adversarial perturbations, our method demonstrates considerable effectiveness
in cross-model attacks. Our exploration finds that on specific datasets, a mere
$1^\circ$ left or right rotation might be sufficient for most adversarial
examples to deceive unseen models. Our further analysis suggests that this
transferability improvement triggered by rotating only $1^\circ$ may stem from
visible pattern shifts in the DNN's low-level feature maps. Moreover, this
transferability exhibits optimal angles that, when identified under
unrestricted query conditions, could potentially yield even greater
performance.


http://arxiv.org/abs/2503.00957
Exploiting Vulnerabilities in Speech Translation Systems through Targeted Adversarial Attacks. (99%)
Chang Liu; Haolin Wu; Xi Yang; Kui Zhang; Cong Wu; Weiming Zhang; Nenghai Yu; Tianwei Zhang; Qing Guo; Jie Zhang
  As speech translation (ST) systems become increasingly prevalent,
understanding their vulnerabilities is crucial for ensuring robust and reliable
communication. However, limited work has explored this issue in depth. This
paper explores methods of compromising these systems through imperceptible
audio manipulations. Specifically, we present two innovative approaches: (1)
the injection of perturbation into source audio, and (2) the generation of
adversarial music designed to guide targeted translation, while also conducting
more practical over-the-air attacks in the physical world. Our experiments
reveal that carefully crafted audio perturbations can mislead translation
models to produce targeted, harmful outputs, while adversarial music achieve
this goal more covertly, exploiting the natural imperceptibility of music.
These attacks prove effective across multiple languages and translation models,
highlighting a systemic vulnerability in current ST architectures. The
implications of this research extend beyond immediate security concerns,
shedding light on the interpretability and robustness of neural speech
processing systems. Our findings underscore the need for advanced defense
mechanisms and more resilient architectures in the realm of audio systems. More
details and samples can be found at https://adv-st.github.io.


http://arxiv.org/abs/2503.00917
AMUN: Adversarial Machine UNlearning. (92%)
Ali Ebrahimpour-Boroojeny; Hari Sundaram; Varun Chandrasekaran
  Machine unlearning, where users can request the deletion of a forget dataset,
is becoming increasingly important because of numerous privacy regulations.
Initial works on ``exact'' unlearning (e.g., retraining) incur large
computational overheads. However, while computationally inexpensive,
``approximate'' methods have fallen short of reaching the effectiveness of
exact unlearning: models produced fail to obtain comparable accuracy and
prediction confidence on both the forget and test (i.e., unseen) dataset.
Exploiting this observation, we propose a new unlearning method, Adversarial
Machine UNlearning (AMUN), that outperforms prior state-of-the-art (SOTA)
methods for image classification. AMUN lowers the confidence of the model on
the forget samples by fine-tuning the model on their corresponding adversarial
examples. Adversarial examples naturally belong to the distribution imposed by
the model on the input space; fine-tuning the model on the adversarial examples
closest to the corresponding forget samples (a) localizes the changes to the
decision boundary of the model around each forget sample and (b) avoids drastic
changes to the global behavior of the model, thereby preserving the model's
accuracy on test samples. Using AMUN for unlearning a random $10\%$ of CIFAR-10
samples, we observe that even SOTA membership inference attacks cannot do
better than random guessing.


http://arxiv.org/abs/2503.01924
TAET: Two-Stage Adversarial Equalization Training on Long-Tailed Distributions. (54%)
Wang YuHang; Junkang Guo; Aolei Liu; Kaihao Wang; Zaitong Wu; Zhenyu Liu; Wenfei Yin; Jian Liu
  Adversarial robustness is a critical challenge in deploying deep neural
networks for real-world applications. While adversarial training is a widely
recognized defense strategy, most existing studies focus on balanced datasets,
overlooking the prevalence of long-tailed distributions in real-world data,
which significantly complicates robustness. This paper provides a comprehensive
analysis of adversarial training under long-tailed distributions and identifies
limitations in the current state-of-the-art method, AT-BSL, in achieving robust
performance under such conditions. To address these challenges, we propose a
novel training framework, TAET, which integrates an initial stabilization phase
followed by a stratified equalization adversarial training phase. Additionally,
prior work on long-tailed robustness has largely ignored the crucial evaluation
metric of balanced accuracy. To bridge this gap, we introduce the concept of
balanced robustness, a comprehensive metric tailored for assessing robustness
under long-tailed distributions. Extensive experiments demonstrate that our
method surpasses existing advanced defenses, achieving significant improvements
in both memory and computational efficiency. This work represents a substantial
advancement in addressing robustness challenges in real-world applications. Our
code is available at:
https://github.com/BuhuiOK/TAET-Two-Stage-Adversarial-Equalization-Training-on-Long-Tailed-Distributions.


http://arxiv.org/abs/2503.01926
Unnatural Languages Are Not Bugs but Features for LLMs. (1%)
Keyu Duan; Yiran Zhao; Zhili Feng; Jinjie Ni; Tianyu Pang; Qian Liu; Tianle Cai; Longxu Dou; Kenji Kawaguchi; Anirudh Goyal; J. Zico Kolter; Michael Qizhe Shieh
  Large Language Models (LLMs) have been observed to process non-human-readable
text sequences, such as jailbreak prompts, often viewed as a bug for aligned
LLMs. In this work, we present a systematic investigation challenging this
perception, demonstrating that unnatural languages - strings that appear
incomprehensible to humans but maintain semantic meanings for LLMs - contain
latent features usable by models. Notably, unnatural languages possess latent
features that can be generalized across different models and tasks during
inference. Furthermore, models fine-tuned on unnatural versions of instruction
datasets perform on-par with those trained on natural language, achieving 49.71
win rates in Length-controlled AlpacaEval 2.0 in average across various base
models. In addition, through comprehensive analysis, we demonstrate that LLMs
process unnatural languages by filtering noise and inferring contextual meaning
from filtered words.


http://arxiv.org/abs/2503.00384
A Survey of Adversarial Defenses in Vision-based Systems: Categorization, Methods and Challenges. (99%)
Nandish Chattopadhyay; Abdul Basit; Bassem Ouni; Muhammad Shafique
  Adversarial attacks have emerged as a major challenge to the trustworthy
deployment of machine learning models, particularly in computer vision
applications. These attacks have a varied level of potency and can be
implemented in both white box and black box approaches. Practical attacks
include methods to manipulate the physical world and enforce adversarial
behaviour by the corresponding target neural network models. Multiple different
approaches to mitigate different kinds of such attacks are available in the
literature, each with their own advantages and limitations. In this survey, we
present a comprehensive systematization of knowledge on adversarial defenses,
focusing on two key computer vision tasks: image classification and object
detection. We review the state-of-the-art adversarial defense techniques and
categorize them for easier comparison. In addition, we provide a schematic
representation of these categories within the context of the overall machine
learning pipeline, facilitating clearer understanding and benchmarking of
defenses. Furthermore, we map these defenses to the types of adversarial
attacks and datasets where they are most effective, offering practical insights
for researchers and practitioners. This study is necessary for understanding
the scope of how the available defenses are able to address the adversarial
threats, and their shortcomings as well, which is necessary for driving the
research in this area in the most appropriate direction, with the aim of
building trustworthy AI systems for regular practical use-cases.


http://arxiv.org/abs/2503.00377
Adversarial Attacks on Event-Based Pedestrian Detectors: A Physical Approach. (82%)
Guixu Lin; Muyao Niu; Qingtian Zhu; Zhengwei Yin; Zhuoxiao Li; Shengfeng He; Yinqiang Zheng
  Event cameras, known for their low latency and high dynamic range, show great
potential in pedestrian detection applications. However, while recent research
has primarily focused on improving detection accuracy, the robustness of
event-based visual models against physical adversarial attacks has received
limited attention. For example, adversarial physical objects, such as specific
clothing patterns or accessories, can exploit inherent vulnerabilities in these
systems, leading to misdetections or misclassifications. This study is the
first to explore physical adversarial attacks on event-driven pedestrian
detectors, specifically investigating whether certain clothing patterns worn by
pedestrians can cause these detectors to fail, effectively rendering them
unable to detect the person. To address this, we developed an end-to-end
adversarial framework in the digital domain, framing the design of adversarial
clothing textures as a 2D texture optimization problem. By crafting an
effective adversarial loss function, the framework iteratively generates
optimal textures through backpropagation. Our results demonstrate that the
textures identified in the digital domain possess strong adversarial
properties. Furthermore, we translated these digitally optimized textures into
physical clothing and tested them in real-world scenarios, successfully
demonstrating that the designed textures significantly degrade the performance
of event-based pedestrian detection models. This work highlights the
vulnerability of such models to physical adversarial attacks.


http://arxiv.org/abs/2503.00596
BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge. (16%)
Terry Tong; Fei Wang; Zhe Zhao; Muhao Chen
  This paper proposes a novel backdoor threat attacking the LLM-as-a-Judge
evaluation regime, where the adversary controls both the candidate and
evaluator model. The backdoored evaluator victimizes benign users by unfairly
assigning inflated scores to adversary. A trivial single token backdoor
poisoning 1% of the evaluator training data triples the adversary's score with
respect to their legitimate score. We systematically categorize levels of data
access corresponding to three real-world settings, (1) web poisoning, (2)
malicious annotator, and (3) weight poisoning. These regimes reflect a weak to
strong escalation of data access that highly correlates with attack severity.
Under the weakest assumptions - web poisoning (1), the adversary still induces
a 20% score inflation. Likewise, in the (3) weight poisoning regime, the
stronger assumptions enable the adversary to inflate their scores from 1.5/5 to
4.9/5. The backdoor threat generalizes across different evaluator
architectures, trigger designs, evaluation tasks, and poisoning rates. By
poisoning 10% of the evaluator training data, we control toxicity judges
(Guardrails) to misclassify toxic prompts as non-toxic 89% of the time, and
document reranker judges in RAG to rank the poisoned document first 97% of the
time. LLM-as-a-Judge is uniquely positioned at the intersection of ethics and
technology, where social implications of mislead model selection and evaluation
constrain the available defensive tools. Amidst these challenges, model merging
emerges as a principled tool to offset the backdoor, reducing ASR to near 0%
whilst maintaining SOTA performance. Model merging's low computational cost and
convenient integration into the current LLM Judge training pipeline position it
as a promising avenue for backdoor mitigation in the LLM-as-a-Judge setting.


http://arxiv.org/abs/2503.00615
xIDS-EnsembleGuard: An Explainable Ensemble Learning-based Intrusion Detection System. (1%)
Muhammad Adil; Mian Ahmad Jan; Safayat Bin Hakim; Houbing Herbert Song; Zhanpeng Jin
  In this paper, we focus on addressing the challenges of detecting malicious
attacks in networks by designing an advanced Explainable Intrusion Detection
System (xIDS). The existing machine learning and deep learning approaches have
invisible limitations, such as potential biases in predictions, a lack of
interpretability, and the risk of overfitting to training data. These issues
can create doubt about their usefulness, transparency, and a decrease in trust
among stakeholders. To overcome these challenges, we propose an ensemble
learning technique called "EnsembleGuard." This approach uses the predicted
outputs of multiple models, including tree-based methods (LightGBM, GBM,
Bagging, XGBoost, CatBoost) and deep learning models such as LSTM (long
short-term memory) and GRU (gated recurrent unit), to maintain a balance and
achieve trustworthy results. Our work is unique because it combines both
tree-based and deep learning models to design an interpretable and explainable
meta-model through model distillation. By considering the predictions of all
individual models, our meta-model effectively addresses key challenges and
ensures both explainable and reliable results. We evaluate our model using
well-known datasets, including UNSW-NB15, NSL-KDD, and CIC-IDS-2017, to assess
its reliability against various types of attacks. During analysis, we found
that our model outperforms both tree-based models and other comparative
approaches in different attack scenarios.


http://arxiv.org/abs/2503.00687
Transformer Meets Twicing: Harnessing Unattended Residual Information. (1%)
Laziz Abdullaev; Tan M. Nguyen
  Transformer-based deep learning models have achieved state-of-the-art
performance across numerous language and vision tasks. While the self-attention
mechanism, a core component of transformers, has proven capable of handling
complex data patterns, it has been observed that the representational capacity
of the attention matrix degrades significantly across transformer layers,
thereby hurting its overall performance. In this work, we leverage the
connection between self-attention computations and low-pass non-local means
(NLM) smoothing filters and propose the Twicing Attention, a novel attention
mechanism that uses kernel twicing procedure in nonparametric regression to
alleviate the low-pass behavior of associated NLM smoothing with compelling
theoretical guarantees and enhanced adversarial robustness. This approach
enables the extraction and reuse of meaningful information retained in the
residuals following the imperfect smoothing operation at each layer. Our
proposed method offers two key advantages over standard self-attention: 1) a
provably slower decay of representational capacity and 2) improved robustness
and accuracy across various data modalities and tasks. We empirically
demonstrate the performance gains of our model over baseline transformers on
multiple tasks and benchmarks, including image classification and language
modeling, on both clean and corrupted data.


http://arxiv.org/abs/2502.21171
QFAL: Quantum Federated Adversarial Learning. (99%)
Walid El Maouaki; Nouhaila Innan; Alberto Marchisio; Taoufik Said; Mohamed Bennai; Muhammad Shafique
  Quantum federated learning (QFL) merges the privacy advantages of federated
systems with the computational potential of quantum neural networks (QNNs), yet
its vulnerability to adversarial attacks remains poorly understood. This work
pioneers the integration of adversarial training into QFL, proposing a robust
framework, quantum federated adversarial learning (QFAL), where clients
collaboratively defend against perturbations by combining local adversarial
example generation with federated averaging (FedAvg). We systematically
evaluate the interplay between three critical factors: client count (5, 10,
15), adversarial training coverage (0-100%), and adversarial attack
perturbation strength (epsilon = 0.01-0.5), using the MNIST dataset. Our
experimental results show that while fewer clients often yield higher
clean-data accuracy, larger federations can more effectively balance accuracy
and robustness when partially adversarially trained. Notably, even limited
adversarial coverage (e.g., 20%-50%) can significantly improve resilience to
moderate perturbations, though at the cost of reduced baseline performance.
Conversely, full adversarial training (100%) may regain high clean accuracy but
is vulnerable under stronger attacks. These findings underscore an inherent
trade-off between robust and standard objectives, which is further complicated
by quantum-specific factors. We conclude that a carefully chosen combination of
client count and adversarial coverage is critical for mitigating adversarial
vulnerabilities in QFL. Moreover, we highlight opportunities for future
research, including adaptive adversarial training schedules, more diverse
quantum encoding schemes, and personalized defense strategies to further
enhance the robustness-accuracy trade-off in real-world quantum federated
environments.


http://arxiv.org/abs/2502.21048
Data-free Universal Adversarial Perturbation with Pseudo-semantic Prior. (99%)
Chanhui Lee; Yeonghwan Song; Jeany Son
  Data-free Universal Adversarial Perturbation (UAP) is an image-agnostic
adversarial attack that deceives deep neural networks using a single
perturbation generated solely from random noise without relying on data priors.
However, traditional data-free UAP methods often suffer from limited
transferability due to the absence of semantic content in random noise. To
address this issue, we propose a novel data-free universal attack method that
recursively extracts pseudo-semantic priors directly from the UAPs during
training to enrich the semantic content within the data-free UAP framework. Our
approach effectively leverages latent semantic information within UAPs via
region sampling, enabling successful input transformations-typically
ineffective in traditional data-free UAP methods due to the lack of semantic
cues-and significantly enhancing black-box transferability. Furthermore, we
introduce a sample reweighting technique to mitigate potential imbalances from
random sampling and transformations, emphasizing hard examples less affected by
the UAPs. Comprehensive experiments on ImageNet show that our method achieves
state-of-the-art performance in average fooling rate by a substantial margin,
notably improves attack transferability across various CNN architectures
compared to existing data-free UAP methods, and even surpasses data-dependent
UAP methods. Code is available at: https://github.com/ChnanChan/PSP-UAP.


http://arxiv.org/abs/2502.20948
Concealed Adversarial attacks on neural networks for sequential data. (98%)
Petr Sokerin; Dmitry Anikin; Sofia Krehova; Alexey Zaytsev
  The emergence of deep learning led to the broad usage of neural networks in
the time series domain for various applications, including finance and
medicine. While powerful, these models are prone to adversarial attacks: a
benign targeted perturbation of input data leads to significant changes in a
classifier's output. However, formally small attacks in the time series domain
become easily detected by the human eye or a simple detector model.
  We develop a concealed adversarial attack for different time-series models:
it provides more realistic perturbations, being hard to detect by a human or
model discriminator. To achieve this goal, the proposed adversarial attack
maximizes an aggregation of a classifier and a trained discriminator loss. To
make the attack stronger, we also propose a training procedure for a
discriminator that provides broader coverage of possible attacks. Extensive
benchmarking on six UCR time series datasets across four diverse architectures
- including recurrent, convolutional, state-space, and transformer-based models
- demonstrates the superiority of our attack for a concealability-efficiency
trade-off. Our findings highlight the growing challenge of designing robust
time series models, emphasizing the need for improved defenses against
realistic and effective attacks.


http://arxiv.org/abs/2503.00140
Approaching the Harm of Gradient Attacks While Only Flipping Labels. (92%)
Abdessamad El-Kabid; El-Mahdi El-Mhamdi
  Machine learning systems deployed in distributed or federated environments
are highly susceptible to adversarial manipulations, particularly availability
attacks -adding imperceptible perturbations to training data, thereby rendering
the trained model unavailable. Prior research in distributed machine learning
has demonstrated such adversarial effects through the injection of gradients or
data poisoning. In this study, we aim to enhance comprehension of the potential
of weaker (and more probable) adversaries by posing the following inquiry: Can
availability attacks be inflicted solely through the flipping of a subset of
training labels, without altering features, and under a strict flipping budget?
We analyze the extent of damage caused by constrained label flipping attacks.
Focusing on a distributed classification problem, (1) we propose a novel
formalization of label flipping attacks on logistic regression models and
derive a greedy algorithm that is provably optimal at each training step. (2)
To demonstrate that availability attacks can be approached by label flipping
alone, we show that a budget of only $0.1\%$ of labels at each training step
can reduce the accuracy of the model by $6\%$, and that some models can perform
worse than random guessing when up to $25\%$ of labels are flipped. (3) We shed
light on an interesting interplay between what the attacker gains from more
write-access versus what they gain from more flipping budget. (4) we define and
compare the power of targeted label flipping attack to that of an untargeted
label flipping attack.


http://arxiv.org/abs/2502.20924
Decoder Gradient Shield: Provable and High-Fidelity Prevention of Gradient-Based Box-Free Watermark Removal. (83%)
Haonan An; Guang Hua; Zhengru Fang; Guowen Xu; Susanto Rahardja; Yuguang Fang
  The intellectual property of deep image-to-image models can be protected by
the so-called box-free watermarking. It uses an encoder and a decoder,
respectively, to embed into and extract from the model's output images
invisible copyright marks. Prior works have improved watermark robustness,
focusing on the design of better watermark encoders. In this paper, we reveal
an overlooked vulnerability of the unprotected watermark decoder which is
jointly trained with the encoder and can be exploited to train a watermark
removal network. To defend against such an attack, we propose the decoder
gradient shield (DGS) as a protection layer in the decoder API to prevent
gradient-based watermark removal with a closed-form solution. The fundamental
idea is inspired by the classical adversarial attack, but is utilized for the
first time as a defensive mechanism in the box-free model watermarking. We then
demonstrate that DGS can reorient and rescale the gradient directions of
watermarked queries and stop the watermark remover's training loss from
converging to the level without DGS, while retaining decoder output image
quality. Experimental results verify the effectiveness of proposed method. Code
of paper will be made available upon acceptance.


http://arxiv.org/abs/2502.21041
Fast Adversarial Training against Sparse Attacks Requires Loss Smoothing. (69%)
Xuyang Zhong; Yixiao Huang; Chen Liu
  This paper studies fast adversarial training against sparse adversarial
perturbations bounded by $l_0$ norm. We demonstrate the challenges of employing
$1$-step attacks on $l_0$ bounded perturbations for fast adversarial training,
including degraded performance and the occurrence of catastrophic overfitting
(CO). We highlight that CO in $l_0$ adversarial training is caused by
sub-optimal perturbation locations of $1$-step attack. Theoretical and
empirical analyses reveal that the loss landscape of $l_0$ adversarial training
is more craggy compared to its $l_\infty$, $l_2$ and $l_1$ counterparts.
Moreover, we corroborate that the craggy loss landscape can aggravate CO. To
address these issues, we propose Fast-LS-$l_0$ that incorporates soft labels
and the trade-off loss function to smooth the adversarial loss landscape.
Extensive experiments demonstrate our method can overcome the challenge of
catastrophic overfitting, achieve state-of-the-art performance, and narrow down
the performance gap between $1$-step and multi-step adversarial training
against sparse attacks.


http://arxiv.org/abs/2502.20995
The RAG Paradox: A Black-Box Attack Exploiting Unintentional Vulnerabilities in Retrieval-Augmented Generation Systems. (15%)
Chanwoo Choi; Jinsoo Kim; Sukmin Cho; Soyeong Jeong; Buru Chang
  With the growing adoption of retrieval-augmented generation (RAG) systems,
various attack methods have been proposed to degrade their performance.
However, most existing approaches rely on unrealistic assumptions in which
external attackers have access to internal components such as the retriever. To
address this issue, we introduce a realistic black-box attack based on the RAG
paradox, a structural vulnerability arising from the system's effort to enhance
trust by revealing both the retrieved documents and their sources to users.
This transparency enables attackers to observe which sources are used and how
information is phrased, allowing them to craft poisoned documents that are more
likely to be retrieved and upload them to the identified sources. Moreover, as
RAG systems directly provide retrieved content to users, these documents must
not only be retrievable but also appear natural and credible to maintain user
confidence in the search results. Unlike prior work that focuses solely on
improving document retrievability, our attack method explicitly considers both
retrievability and user trust in the retrieved content. Both offline and online
experiments demonstrate that our method significantly degrades system
performance without internal access, while generating natural-looking poisoned
documents.


http://arxiv.org/abs/2502.21286
Enabling AutoML for Zero-Touch Network Security: Use-Case Driven Analysis. (2%)
Li Yang; Mirna El Rajab; Abdallah Shami; Sami Muhaidat
  Zero-Touch Networks (ZTNs) represent a state-of-the-art paradigm shift
towards fully automated and intelligent network management, enabling the
automation and intelligence required to manage the complexity, scale, and
dynamic nature of next-generation (6G) networks. ZTNs leverage Artificial
Intelligence (AI) and Machine Learning (ML) to enhance operational efficiency,
support intelligent decision-making, and ensure effective resource allocation.
However, the implementation of ZTNs is subject to security challenges that need
to be resolved to achieve their full potential. In particular, two critical
challenges arise: the need for human expertise in developing AI/ML-based
security mechanisms, and the threat of adversarial attacks targeting AI/ML
models. In this survey paper, we provide a comprehensive review of current
security issues in ZTNs, emphasizing the need for advanced AI/ML-based security
mechanisms that require minimal human intervention and protect AI/ML models
themselves. Furthermore, we explore the potential of Automated ML (AutoML)
technologies in developing robust security solutions for ZTNs. Through case
studies, we illustrate practical approaches to securing ZTNs against both
conventional and AI/ML-specific threats, including the development of
autonomous intrusion detection systems and strategies to combat Adversarial ML
(AML) attacks. The paper concludes with a discussion of the future research
directions for the development of ZTN security approaches.


http://arxiv.org/abs/2502.21279
L-Lipschitz Gershgorin ResNet Network. (1%)
Marius F. R. Juston; William R. Norris; Dustin Nottage; Ahmet Soylemezoglu
  Deep residual networks (ResNets) have demonstrated outstanding success in
computer vision tasks, attributed to their ability to maintain gradient flow
through deep architectures. Simultaneously, controlling the Lipschitz bound in
neural networks has emerged as an essential area of research for enhancing
adversarial robustness and network certifiability. This paper uses a rigorous
approach to design $\mathcal{L}$-Lipschitz deep residual networks using a
Linear Matrix Inequality (LMI) framework. The ResNet architecture was
reformulated as a pseudo-tri-diagonal LMI with off-diagonal elements and
derived closed-form constraints on network parameters to ensure
$\mathcal{L}$-Lipschitz continuity. To address the lack of explicit eigenvalue
computations for such matrix structures, the Gershgorin circle theorem was
employed to approximate eigenvalue locations, guaranteeing the LMI's negative
semi-definiteness. Our contributions include a provable parameterization
methodology for constructing Lipschitz-constrained networks and a compositional
framework for managing recursive systems within hierarchical architectures.
These findings enable robust network designs applicable to adversarial
robustness, certified training, and control systems. However, a limitation was
identified in the Gershgorin-based approximations, which over-constrain the
system, suppressing non-linear dynamics and diminishing the network's
expressive capacity.


http://arxiv.org/abs/2502.20562
LISArD: Learning Image Similarity to Defend Against Gray-box Adversarial Attacks. (99%)
Joana C. Costa; Tiago Roxo; Hugo Proença; Pedro R. M. Inácio
  State-of-the-art defense mechanisms are typically evaluated in the context of
white-box attacks, which is not realistic, as it assumes the attacker can
access the gradients of the target network. To protect against this scenario,
Adversarial Training (AT) and Adversarial Distillation (AD) include adversarial
examples during the training phase, and Adversarial Purification uses a
generative model to reconstruct all the images given to the classifier. This
paper considers an even more realistic evaluation scenario: gray-box attacks,
which assume that the attacker knows the architecture and the dataset used to
train the target network, but cannot access its gradients. We provide empirical
evidence that models are vulnerable to gray-box attacks and propose LISArD, a
defense mechanism that does not increase computational and temporal costs but
provides robustness against gray-box and white-box attacks without including
AT. Our method approximates a cross-correlation matrix, created with the
embeddings of perturbed and clean images, to a diagonal matrix while
simultaneously conducting classification learning. Our results show that LISArD
can effectively protect against gray-box attacks, can be used in multiple
architectures, and carries over its resilience to the white-box scenario. Also,
state-of-the-art AD models underperform greatly when removing AT and/or moving
to gray-box settings, highlighting the lack of robustness from existing
approaches to perform in various conditions (aside from white-box settings).
All the source code is available at https://github.com/Joana-Cabral/LISArD.


http://arxiv.org/abs/2503.00063
NoPain: No-box Point Cloud Attack via Optimal Transport Singular Boundary. (99%)
Zezeng Li; Xiaoyu Du; Na Lei; Liming Chen; Weimin Wang
  Adversarial attacks exploit the vulnerability of deep models against
adversarial samples. Existing point cloud attackers are tailored to specific
models, iteratively optimizing perturbations based on gradients in either a
white-box or black-box setting. Despite their promising attack performance,
they often struggle to produce transferable adversarial samples due to
overfitting the specific parameters of surrogate models. To overcome this
issue, we shift our focus to the data distribution itself and introduce a novel
approach named NoPain, which employs optimal transport (OT) to identify the
inherent singular boundaries of the data manifold for cross-network point cloud
attacks. Specifically, we first calculate the OT mapping from noise to the
target feature space, then identify singular boundaries by locating
non-differentiable positions. Finally, we sample along singular boundaries to
generate adversarial point clouds. Once the singular boundaries are determined,
NoPain can efficiently produce adversarial samples without the need of
iterative updates or guidance from the surrogate classifiers. Extensive
experiments demonstrate that the proposed end-to-end method outperforms
baseline approaches in terms of both transferability and efficiency, while also
maintaining notable advantages even against defense strategies. Code and model
are available at https://github.com/cognaclee/nopain


http://arxiv.org/abs/2502.20650
Gungnir: Exploiting Stylistic Features in Images for Backdoor Attacks on Diffusion Models. (86%)
Yu Pan; Jiahao Chen; Bingrong Dai; Lin Wang; Yi Du; Jiao Liu
  In recent years, Diffusion Models (DMs) have demonstrated significant
advances in the field of image generation. However, according to current
research, DMs are vulnerable to backdoor attacks, which allow attackers to
control the model's output by inputting data containing covert triggers, such
as a specific visual patch or phrase. Existing defense strategies are well
equipped to thwart such attacks through backdoor detection and trigger
inversion because previous attack methods are constrained by limited input
spaces and low-dimensional triggers. For example, visual triggers are easily
observed by defenders, text-based or attention-based triggers are more
susceptible to neural network detection. To explore more possibilities of
backdoor attack in DMs, we propose Gungnir, a novel method that enables
attackers to activate the backdoor in DMs through style triggers within input
images. Our approach proposes using stylistic features as triggers for the
first time and implements backdoor attacks successfully in image-to-image tasks
by introducing Reconstructing-Adversarial Noise (RAN) and Short-Term
Timesteps-Retention (STTR). Our technique generates trigger-embedded images
that are perceptually indistinguishable from clean images, thus bypassing both
manual inspection and automated detection neural networks. Experiments
demonstrate that Gungnir can easily bypass existing defense methods. Among
existing DM defense frameworks, our approach achieves a 0 backdoor detection
rate (BDR). Our codes are available at https://github.com/paoche11/Gungnir.


http://arxiv.org/abs/2502.20604
Exploring the Impact of Temperature Scaling in Softmax for Classification and Adversarial Robustness. (74%)
Hao Xuan; Bokai Yang; Xingyu Li
  The softmax function is a fundamental component in deep learning. This study
delves into the often-overlooked parameter within the softmax function, known
as "temperature," providing novel insights into the practical and theoretical
aspects of temperature scaling for image classification. Our empirical studies,
adopting convolutional neural networks and transformers on multiple benchmark
datasets, reveal that moderate temperatures generally introduce better overall
performance. Through extensive experiments and rigorous theoretical analysis,
we explore the role of temperature scaling in model training and unveil that
temperature not only influences learning step size but also shapes the model's
optimization direction. Moreover, for the first time, we discover a surprising
benefit of elevated temperatures: enhanced model robustness against common
corruption, natural perturbation, and non-targeted adversarial attacks like
Projected Gradient Descent. We extend our discoveries to adversarial training,
demonstrating that, compared to the standard softmax function with the default
temperature value, higher temperatures have the potential to enhance
adversarial training. The insights of this work open new avenues for improving
model performance and security in deep learning applications.


http://arxiv.org/abs/2502.20314
Adversarial Robustness in Parameter-Space Classifiers. (69%)
Tamir Shor; Ethan Fetaya; Chaim Baskin; Alex Bronstein
  Implicit Neural Representations (INRs) have been recently garnering
increasing interest in various research fields, mainly due to their ability to
represent large, complex data in a compact and continuous manner. Past work
further showed that numerous popular downstream tasks can be performed directly
in the INR parameter-space. Doing so can substantially reduce the computational
resources required to process the represented data in their native domain. A
major difficulty in using modern machine-learning approaches, is their high
susceptibility to adversarial attacks, which have been shown to greatly limit
the reliability and applicability of such methods in a wide range of settings.
In this work, we show that parameter-space models trained for classification
are inherently robust to adversarial attacks -- without the need of any robust
training. To support our claims, we develop a novel suite of adversarial
attacks targeting parameter-space classifiers, and furthermore analyze
practical considerations of attacking parameter-space classifiers. Code for
reproducing all experiments and implementation of all proposed methods will be
released upon publication.


http://arxiv.org/abs/2502.20325
On Adversarial Attacks In Acoustic Drone Localization. (67%)
Tamir Shor; Chaim Baskin; Alex Bronstein
  Multi-rotor aerial autonomous vehicles (MAVs, more widely known as "drones")
have been generating increased interest in recent years due to their growing
applicability in a vast and diverse range of fields (e.g., agriculture,
commercial delivery, search and rescue). The sensitivity of visual-based
methods to lighting conditions and occlusions had prompted growing study of
navigation reliant on other modalities, such as acoustic sensing. A major
concern in using drones in scale for tasks in non-controlled environments is
the potential threat of adversarial attacks over their navigational systems,
exposing users to mission-critical failures, security breaches, and compromised
safety outcomes that can endanger operators and bystanders. While previous work
shows impressive progress in acoustic-based drone localization, prior research
in adversarial attacks over drone navigation only addresses visual
sensing-based systems. In this work, we aim to compensate for this gap by
supplying a comprehensive analysis of the effect of PGD adversarial attacks
over acoustic drone localization. We furthermore develop an algorithm for
adversarial perturbation recovery, capable of markedly diminishing the affect
of such attacks in our setting. The code for reproducing all experiments will
be released upon publication.


http://arxiv.org/abs/2503.00065
ADAGE: Active Defenses Against GNN Extraction. (33%)
Jing Xu; Franziska Boenisch; Adam Dziedzic
  Graph Neural Networks (GNNs) achieve high performance in various real-world
applications, such as drug discovery, traffic states prediction, and
recommendation systems. The fact that building powerful GNNs requires a large
amount of training data, powerful computing resources, and human expertise
turns the models into lucrative targets for model stealing attacks. Prior work
has revealed that the threat vector of stealing attacks against GNNs is large
and diverse, as an attacker can leverage various heterogeneous signals ranging
from node labels to high-dimensional node embeddings to create a local copy of
the target GNN at a fraction of the original training costs. This diversity in
the threat vector renders the design of effective and general defenses
challenging and existing defenses usually focus on one particular stealing
setup. Additionally, they solely provide means to identify stolen model copies
rather than preventing the attack. To close this gap, we propose the first and
general Active Defense Against GNN Extraction (ADAGE). By analyzing the queries
to the GNN, tracking their diversity in terms of proximity to different
communities identified in the underlying graph, and increasing the defense
strength with the growing fraction of communities that have been queried, ADAGE
can prevent stealing in all common attack setups. Our extensive experimental
evaluation using six benchmark datasets, four GNN models, and three types of
adaptive attackers shows that ADAGE penalizes attackers to the degree of
rendering stealing impossible, whilst not harming predictive performance for
legitimate users. ADAGE, thereby, contributes towards securely sharing valuable
GNNs in the future.


http://arxiv.org/abs/2503.00062
CRFU: Compressive Representation Forgetting Against Privacy Leakage on Machine Unlearning. (31%)
Weiqi Wang; Chenhan Zhang; Zhiyi Tian; Shushu Liu; Shui Yu
  Machine unlearning allows data owners to erase the impact of their specified
data from trained models. Unfortunately, recent studies have shown that
adversaries can recover the erased data, posing serious threats to user
privacy. An effective unlearning method removes the information of the
specified data from the trained model, resulting in different outputs for the
same input before and after unlearning. Adversaries can exploit these output
differences to conduct privacy leakage attacks, such as reconstruction and
membership inference attacks. However, directly applying traditional defenses
to unlearning leads to significant model utility degradation. In this paper, we
introduce a Compressive Representation Forgetting Unlearning scheme (CRFU),
designed to safeguard against privacy leakage on unlearning. CRFU achieves data
erasure by minimizing the mutual information between the trained compressive
representation (learned through information bottleneck theory) and the erased
data, thereby maximizing the distortion of data. This ensures that the model's
output contains less information that adversaries can exploit. Furthermore, we
introduce a remembering constraint and an unlearning rate to balance the
forgetting of erased data with the preservation of previously learned
knowledge, thereby reducing accuracy degradation. Theoretical analysis
demonstrates that CRFU can effectively defend against privacy leakage attacks.
Our experimental results show that CRFU significantly increases the
reconstruction mean square error (MSE), achieving a defense effect improvement
of approximately $200\%$ against privacy reconstruction attacks with only
$1.5\%$ accuracy degradation on MNIST.


http://arxiv.org/abs/2502.20306
SecureGaze: Defending Gaze Estimation Against Backdoor Attacks. (13%)
Lingyu Du; Yupei Liu; Jinyuan Jia; Guohao Lan
  Gaze estimation models are widely used in applications such as driver
attention monitoring and human-computer interaction. While many methods for
gaze estimation exist, they rely heavily on data-hungry deep learning to
achieve high performance. This reliance often forces practitioners to harvest
training data from unverified public datasets, outsource model training, or
rely on pre-trained models. However, such practices expose gaze estimation
models to backdoor attacks. In such attacks, adversaries inject backdoor
triggers by poisoning the training data, creating a backdoor vulnerability: the
model performs normally with benign inputs, but produces manipulated gaze
directions when a specific trigger is present. This compromises the security of
many gaze-based applications, such as causing the model to fail in tracking the
driver's attention. To date, there is no defense that addresses backdoor
attacks on gaze estimation models. In response, we introduce SecureGaze, the
first solution designed to protect gaze estimation models from such attacks.
Unlike classification models, defending gaze estimation poses unique challenges
due to its continuous output space and globally activated backdoor behavior. By
identifying distinctive characteristics of backdoored gaze estimation models,
we develop a novel and effective approach to reverse-engineer the trigger
function for reliable backdoor detection. Extensive evaluations in both digital
and physical worlds demonstrate that SecureGaze effectively counters a range of
backdoor attacks and outperforms seven state-of-the-art defenses adapted from
classification models.


http://arxiv.org/abs/2502.19806
From Data to Sliding Mode Control of Uncertain Large-Scale Networks with Unknown Dynamics. (1%)
Behrad Samari; Gian Paolo Incremona; Antonella Ferrara; Abolfazl Lavaei
  Large-scale interconnected networks, composed of multiple low-dimensional
subsystems, serve as a crucial framework for modeling a wide range of
real-world applications. Despite offering computational scalability, the
inherent interdependence among subsystems poses significant challenges to the
effective control of such networks. This complexity is further exacerbated in
the presence of external perturbations and when the dynamics of individual
subsystems, and accordingly the overall network, are unknown-scenarios
frequently encountered in modern practical applications. In this paper, we
develop a compositional data-driven approach to ensure the global asymptotic
stability (GAS) of large-scale nonlinear networks with unknown mathematical
models, subjected to external perturbations. To achieve this, we first gather
two sets of data from each unknown nominal subsystem without perturbation,
which we refer to as two input-state trajectories. The collected data from each
subsystem is then utilized to design an input-to-state stable (ISS) Lyapunov
function and its corresponding controller for each nominal subsystem, rendering
them ISS. To cancel the effect of external perturbations on the dynamic of each
subsystem, and accordingly the whole network, we then design a local integral
sliding mode (ISM) controller for each subsystem using the collected data.
Under a small-gain compositional condition, we employ data-driven ISS Lyapunov
functions designed for subsystems and construct a control Lyapunov function for
the network, rendering the assurance of GAS property over the nominal network.
We then extend this compositional result to network perturbed models,
demonstrating that the synthesized ISM controllers ensure the GAS property even
in the presence of perturbations.


http://arxiv.org/abs/2502.20589
LLMs Have Rhythm: Fingerprinting Large Language Models Using Inter-Token Times and Network Traffic Analysis. (1%)
Saeif Alhazbi; Ahmed Mohamed Hussain; Gabriele Oligeri; Panos Papadimitratos
As Large Language Models (LLMs) become increasingly integrated into many technological ecosystems across various domains and industries, identifying which model is deployed or being interacted with is critical for the security and trustworthiness of the systems. Current verification methods typically rely on analyzing the generated output to determine the source model. However, these techniques are susceptible to adversarial attacks, operate in a post-hoc manner, and may require access to the model weights to inject a verifiable fingerprint. In this paper, we propose a novel passive and non-invasive fingerprinting technique that operates in real-time and remains effective even under encrypted network traffic conditions. Our method leverages the intrinsic autoregressive generation nature of language models, which generate text one token at a time based on all previously generated tokens, creating a unique temporal pattern like a rhythm or heartbeat that persists even when the output is streamed over a network. We find that measuring the Inter-Token Times (ITTs)-time intervals between consecutive tokens-can identify different language models with high accuracy. We develop a Deep Learning (DL) pipeline to capture these timing patterns using network traffic analysis and evaluate it on 16 Small Language Models (SLMs) and 10 proprietary LLMs across different deployment scenarios, including local host machine (GPU/CPU), Local Area Network (LAN), Remote Network, and Virtual Private Network (VPN). The experimental results confirm that our proposed technique is effective and maintains high accuracy even when tested in different network conditions. This work opens a new avenue for model identification in real-world scenarios and contributes to more secure and trustworthy language model deployment.

http://arxiv.org/abs/2502.20178
SSD: A State-based Stealthy Backdoor Attack For Navigation System in UAV Route Planning. (1%)
Zhaoxuan Wang; Yang Li; Jie Zhang; Xingshuo Han; Kangbo Liu; Lyu Yang; yuan Zhou; Tianwei Zhang; Quan Pan
  Unmanned aerial vehicles (UAVs) are increasingly employed to perform
high-risk tasks that require minimal human intervention. However, UAVs face
escalating cybersecurity threats, particularly from GNSS spoofing attacks.
While previous studies have extensively investigated the impacts of GNSS
spoofing on UAVs, few have focused on its effects on specific tasks. Moreover,
the influence of UAV motion states on the assessment of network security risks
is often overlooked. To address these gaps, we first provide a detailed
evaluation of how motion states affect the effectiveness of network attacks. We
demonstrate that nonlinear motion states not only enhance the effectiveness of
position spoofing in GNSS spoofing attacks but also reduce the probability of
speed-related attack detection. Building upon this, we propose a
state-triggered backdoor attack method (SSD) to deceive GNSS systems and assess
its risk to trajectory planning tasks. Extensive validation of SSD's
effectiveness and stealthiness is conducted. Experimental results show that,
with appropriately tuned hyperparameters, SSD significantly increases
positioning errors and the risk of task failure, while maintaining 100% stealth
across three state-of-the-art detectors.


http://arxiv.org/abs/2502.19672
Improving Adversarial Transferability in MLLMs via Dynamic Vision-Language Alignment Attack. (99%)
Chenhe Gu; Jindong Gu; Andong Hua; Yao Qin
  Multimodal Large Language Models (MLLMs), built upon LLMs, have recently
gained attention for their capabilities in image recognition and understanding.
However, while MLLMs are vulnerable to adversarial attacks, the transferability
of these attacks across different models remains limited, especially under
targeted attack setting. Existing methods primarily focus on vision-specific
perturbations but struggle with the complex nature of vision-language modality
alignment. In this work, we introduce the Dynamic Vision-Language Alignment
(DynVLA) Attack, a novel approach that injects dynamic perturbations into the
vision-language connector to enhance generalization across diverse
vision-language alignment of different models. Our experimental results show
that DynVLA significantly improves the transferability of adversarial examples
across various MLLMs, including BLIP2, InstructBLIP, MiniGPT4, LLaVA, and
closed-source models such as Gemini.


http://arxiv.org/abs/2502.19757
Snowball Adversarial Attack on Traffic Sign Classification. (99%)
Anthony Etim; Jakub Szefer
  Adversarial attacks on machine learning models often rely on small,
imperceptible perturbations to mislead classifiers. Such strategy focuses on
minimizing the visual perturbation for humans so they are not confused, and
also maximizing the misclassification for machine learning algorithms. An
orthogonal strategy for adversarial attacks is to create perturbations that are
clearly visible but do not confuse humans, yet still maximize misclassification
for machine learning algorithms. This work follows the later strategy, and
demonstrates instance of it through the Snowball Adversarial Attack in the
context of traffic sign recognition. The attack leverages the human brain's
superior ability to recognize objects despite various occlusions, while machine
learning algorithms are easily confused. The evaluation shows that the Snowball
Adversarial Attack is robust across various images and is able to confuse
state-of-the-art traffic sign recognition algorithm. The findings reveal that
Snowball Adversarial Attack can significantly degrade model performance with
minimal effort, raising important concerns about the vulnerabilities of deep
neural networks and highlighting the necessity for improved defenses for image
recognition machine learning models.


http://arxiv.org/abs/2502.19697
Prompt-driven Transferable Adversarial Attack on Person Re-Identification with Attribute-aware Textual Inversion. (99%)
Yuan Bian; Min Liu; Yunqi Yi; Xueping Wang; Yaonan Wang
  Person re-identification (re-id) models are vital in security surveillance
systems, requiring transferable adversarial attacks to explore the
vulnerabilities of them. Recently, vision-language models (VLM) based attacks
have shown superior transferability by attacking generalized image and textual
features of VLM, but they lack comprehensive feature disruption due to the
overemphasis on discriminative semantics in integral representation. In this
paper, we introduce the Attribute-aware Prompt Attack (AP-Attack), a novel
method that leverages VLM's image-text alignment capability to explicitly
disrupt fine-grained semantic features of pedestrian images by destroying
attribute-specific textual embeddings. To obtain personalized textual
descriptions for individual attributes, textual inversion networks are designed
to map pedestrian images to pseudo tokens that represent semantic embeddings,
trained in the contrastive learning manner with images and a predefined prompt
template that explicitly describes the pedestrian attributes. Inverted benign
and adversarial fine-grained textual semantics facilitate attacker in
effectively conducting thorough disruptions, enhancing the transferability of
adversarial examples. Extensive experiments show that AP-Attack achieves
state-of-the-art transferability, significantly outperforming previous methods
by 22.9% on mean Drop Rate in cross-model&dataset attack scenarios.


http://arxiv.org/abs/2502.19710
SAP-DIFF: Semantic Adversarial Patch Generation for Black-Box Face Recognition Models via Diffusion Models. (98%)
Mingsi Wang; Shuaiyin Yao; Chang Yue; Lijie Zhang; Guozhu Meng
  Given the need to evaluate the robustness of face recognition (FR) models,
many efforts have focused on adversarial patch attacks that mislead FR models
by introducing localized perturbations. Impersonation attacks are a significant
threat because adversarial perturbations allow attackers to disguise themselves
as legitimate users. This can lead to severe consequences, including data
breaches, system damage, and misuse of resources. However, research on such
attacks in FR remains limited. Existing adversarial patch generation methods
exhibit limited efficacy in impersonation attacks due to (1) the need for high
attacker capabilities, (2) low attack success rates, and (3) excessive query
requirements. To address these challenges, we propose a novel method SAP-DIFF
that leverages diffusion models to generate adversarial patches via semantic
perturbations in the latent space rather than direct pixel manipulation. We
introduce an attention disruption mechanism to generate features unrelated to
the original face, facilitating the creation of adversarial samples and a
directional loss function to guide perturbations toward the target identity
feature space, thereby enhancing attack effectiveness and efficiency. Extensive
experiments on popular FR models and datasets demonstrate that our method
outperforms state-of-the-art approaches, achieving an average attack success
rate improvement of 45.66% (all exceeding 40%), and a reduction in the number
of queries by about 40% compared to the SOTA approach


http://arxiv.org/abs/2502.19095
XSS Adversarial Attacks Based on Deep Reinforcement Learning: A Replication and Extension Study. (87%)
Samuele Pasini; Gianluca Maragliano; Jinhan Kim; Paolo Tonella
  Cross-site scripting (XSS) poses a significant threat to web application
security. While Deep Learning (DL) has shown remarkable success in detecting
XSS attacks, it remains vulnerable to adversarial attacks due to the
discontinuous nature of its input-output mapping. These adversarial attacks
employ mutation-based strategies for different components of XSS attack
vectors, allowing adversarial agents to iteratively select mutations to evade
detection. Our work replicates a state-of-the-art XSS adversarial attack,
highlighting threats to validity in the reference work and extending it toward
a more effective evaluation strategy. Moreover, we introduce an XSS Oracle to
mitigate these threats. The experimental results show that our approach
achieves an escape rate above 96% when the threats to validity of the
replicated technique are addressed.


http://arxiv.org/abs/2502.19755
HALO: Robust Out-of-Distribution Detection via Joint Optimisation. (86%)
Hugo Lyons Keenan; Sarah Erfani; Christopher Leckie
  Effective out-of-distribution (OOD) detection is crucial for the safe
deployment of machine learning models in real-world scenarios. However, recent
work has shown that OOD detection methods are vulnerable to adversarial
attacks, potentially leading to critical failures in high-stakes applications.
This discovery has motivated work on robust OOD detection methods that are
capable of maintaining performance under various attack settings. Prior
approaches have made progress on this problem but face a number of limitations:
often only exhibiting robustness to attacks on OOD data or failing to maintain
strong clean performance. In this work, we adapt an existing robust
classification framework, TRADES, extending it to the problem of robust OOD
detection and discovering a novel objective function. Recognising the critical
importance of a strong clean/robust trade-off for OOD detection, we introduce
an additional loss term which boosts classification and detection performance.
Our approach, called HALO (Helper-based AdversariaL OOD detection), surpasses
existing methods and achieves state-of-the-art performance across a number of
datasets and attack settings. Extensive experiments demonstrate an average
AUROC improvement of 3.15 in clean settings and 7.07 under adversarial attacks
when compared to the next best method. Furthermore, HALO exhibits resistance to
transferred attacks, offers tuneable performance through hyperparameter
selection, and is compatible with existing OOD detection frameworks
out-of-the-box, leaving open the possibility of future performance gains. Code
is available at: https://github.com/hugo0076/HALO


http://arxiv.org/abs/2502.19047
A Dual-Purpose Framework for Backdoor Defense and Backdoor Amplification in Diffusion Models. (83%)
Vu Tuan Truong Long; Bao Le
  Diffusion models have emerged as state-of-the-art generative frameworks,
excelling in producing high-quality multi-modal samples. However, recent
studies have revealed their vulnerability to backdoor attacks, where backdoored
models generate specific, undesirable outputs called backdoor target (e.g.,
harmful images) when a pre-defined trigger is embedded to their inputs. In this
paper, we propose PureDiffusion, a dual-purpose framework that simultaneously
serves two contrasting roles: backdoor defense and backdoor attack
amplification. For defense, we introduce two novel loss functions to invert
backdoor triggers embedded in diffusion models. The first leverages
trigger-induced distribution shifts across multiple timesteps of the diffusion
process, while the second exploits the denoising consistency effect when a
backdoor is activated. Once an accurate trigger inversion is achieved, we
develop a backdoor detection method that analyzes both the inverted trigger and
the generated backdoor targets to identify backdoor attacks. In terms of attack
amplification with the role of an attacker, we describe how our trigger
inversion algorithm can be used to reinforce the original trigger embedded in
the backdoored diffusion model. This significantly boosts attack performance
while reducing the required backdoor training time. Experimental results
demonstrate that PureDiffusion achieves near-perfect detection accuracy,
outperforming existing defenses by a large margin, particularly against complex
trigger patterns. Additionally, in an attack scenario, our attack amplification
approach elevates the attack success rate (ASR) of existing backdoor attacks to
nearly 100\% while reducing training time by up to 20x.


http://arxiv.org/abs/2502.18943
Towards Label-Only Membership Inference Attack against Pre-trained Large Language Models. (41%)
Yu He; Boheng Li; Liu Liu; Zhongjie Ba; Wei Dong; Yiming Li; Zhan Qin; Kui Ren; Chun Chen
  Membership Inference Attacks (MIAs) aim to predict whether a data sample
belongs to the model's training set or not. Although prior research has
extensively explored MIAs in Large Language Models (LLMs), they typically
require accessing to complete output logits (\ie, \textit{logits-based
attacks}), which are usually not available in practice. In this paper, we study
the vulnerability of pre-trained LLMs to MIAs in the \textit{label-only
setting}, where the adversary can only access generated tokens (text). We first
reveal that existing label-only MIAs have minor effects in attacking
pre-trained LLMs, although they are highly effective in inferring fine-tuning
datasets used for personalized LLMs. We find that their failure stems from two
main reasons, including better generalization and overly coarse perturbation.
Specifically, due to the extensive pre-training corpora and exposing each
sample only a few times, LLMs exhibit minimal robustness differences between
members and non-members. This makes token-level perturbations too coarse to
capture such differences.
  To alleviate these problems, we propose \textbf{PETAL}: a label-only
membership inference attack based on \textbf{PE}r-\textbf{T}oken
sem\textbf{A}ntic simi\textbf{L}arity. Specifically, PETAL leverages
token-level semantic similarity to approximate output probabilities and
subsequently calculate the perplexity. It finally exposes membership based on
the common assumption that members are `better' memorized and have smaller
perplexity. We conduct extensive experiments on the WikiMIA benchmark and the
more challenging MIMIR benchmark. Empirically, our PETAL performs better than
the extensions of existing label-only attacks against personalized LLMs and
even on par with other advanced logit-based attacks across all metrics on five
prevalent open-source LLMs.


http://arxiv.org/abs/2503.00061
Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents. (26%)
Qiusi Zhan; Richard Fang; Henil Shalin Panchal; Daniel Kang
  Large Language Model (LLM) agents exhibit remarkable performance across
diverse applications by using external tools to interact with environments.
However, integrating external tools introduces security risks, such as indirect
prompt injection (IPI) attacks. Despite defenses designed for IPI attacks,
their robustness remains questionable due to insufficient testing against
adaptive attacks. In this paper, we evaluate eight different defenses and
bypass all of them using adaptive attacks, consistently achieving an attack
success rate of over 50%. This reveals critical vulnerabilities in current
defenses. Our research underscores the need for adaptive attack evaluation when
designing defenses to ensure robustness and reliability. The code is available
at https://github.com/uiuc-kang-lab/AdaptiveAttackAgent.


http://arxiv.org/abs/2502.19612
Evaluation of Hate Speech Detection Using Large Language Models and Geographical Contextualization. (26%)
Anwar Hossain Zahid; Monoshi Kumar Roy; Swarna Das
  The proliferation of hate speech on social media is one of the serious issues
that is bringing huge impacts to society: an escalation of violence,
discrimination, and social fragmentation. The problem of detecting hate speech
is intrinsically multifaceted due to cultural, linguistic, and contextual
complexities and adversarial manipulations. In this study, we systematically
investigate the performance of LLMs on detecting hate speech across
multilingual datasets and diverse geographic contexts. Our work presents a new
evaluation framework in three dimensions: binary classification of hate speech,
geography-aware contextual detection, and robustness to adversarially generated
text. Using a dataset of 1,000 comments from five diverse regions, we evaluate
three state-of-the-art LLMs: Llama2 (13b), Codellama (7b), and DeepSeekCoder
(6.7b). Codellama had the best binary classification recall with 70.6% and an
F1-score of 52.18%, whereas DeepSeekCoder had the best performance in
geographic sensitivity, correctly detecting 63 out of 265 locations. The tests
for adversarial robustness also showed significant weaknesses; Llama2
misclassified 62.5% of manipulated samples. These results bring to light the
trade-offs between accuracy, contextual understanding, and robustness in the
current versions of LLMs. This work has thus set the stage for developing
contextually aware, multilingual hate speech detection systems by underlining
key strengths and limitations, therefore offering actionable insights for
future research and real-world applications.


http://arxiv.org/abs/2502.19041
Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs. (3%)
Shiyu Xiang; Ansen Zhang; Yanfei Cao; Yang Fan; Ronghao Chen
  Although Aligned Large Language Models (LLMs) are trained to refuse harmful
requests, they remain vulnerable to jailbreak attacks. Unfortunately, existing
methods often focus on surface-level patterns, overlooking the deeper attack
essences. As a result, defenses fail when attack prompts change, even though
the underlying "attack essence" remains the same. To address this issue, we
introduce EDDF, an \textbf{E}ssence-\textbf{D}riven \textbf{D}efense
\textbf{F}ramework Against Jailbreak Attacks in LLMs. EDDF is a plug-and-play
input-filtering method and operates in two stages: 1) offline essence database
construction, and 2) online adversarial query detection. The key idea behind
EDDF is to extract the "attack essence" from a diverse set of known attack
instances and store it in an offline vector database. Experimental results
demonstrate that EDDF significantly outperforms existing methods by reducing
the Attack Success Rate by at least 20\%, underscoring its superior robustness
against jailbreak attacks.


http://arxiv.org/abs/2502.18724
Adversarial Universal Stickers: Universal Perturbation Attacks on Traffic Sign using Stickers. (99%)
Anthony Etim; Jakub Szefer
  Adversarial attacks on deep learning models have proliferated in recent
years. In many cases, a different adversarial perturbation is required to be
added to each image to cause the deep learning model to misclassify it. This is
ineffective as each image has to be modified in a different way. Meanwhile,
research on universal perturbations focuses on designing a single perturbation
that can be applied to all images in a data set, and cause a deep learning
model to misclassify the images. This work advances the field of universal
perturbations by exploring universal perturbations in the context of traffic
signs and autonomous vehicle systems. This work introduces a novel method for
generating universal perturbations that visually look like simple black and
white stickers, and using them to cause incorrect street sign predictions.
Unlike traditional adversarial perturbations, the adversarial universal
stickers are designed to be applicable to any street sign: same sticker, or
stickers, can be applied in same location to any street sign and cause it to be
misclassified. Further, to enable safe experimentation with adversarial images
and street signs, this work presents a virtual setting that leverages Street
View images of street signs, rather than the need to physically modify street
signs, to test the attacks. The experiments in the virtual setting demonstrate
that these stickers can consistently mislead deep learning models used commonly
in street sign recognition, and achieve high attack success rates on dataset of
US traffic signs. The findings highlight the practical security risks posed by
simple stickers applied to traffic signs, and the ease with which adversaries
can generate adversarial universal stickers that can be applied to many street
signs.


http://arxiv.org/abs/2502.17972
Model-Free Adversarial Purification via Coarse-To-Fine Tensor Network Representation. (99%)
Guang Lin; Duc Thien Nguyen; Zerui Tao; Konstantinos Slavakis; Toshihisa Tanaka; Qibin Zhao
  Deep neural networks are known to be vulnerable to well-designed adversarial
attacks. Although numerous defense strategies have been proposed, many are
tailored to the specific attacks or tasks and often fail to generalize across
diverse scenarios. In this paper, we propose Tensor Network Purification (TNP),
a novel model-free adversarial purification method by a specially designed
tensor network decomposition algorithm. TNP depends neither on the pre-trained
generative model nor the specific dataset, resulting in strong robustness
across diverse adversarial scenarios. To this end, the key challenge lies in
relaxing Gaussian-noise assumptions of classical decompositions and
accommodating the unknown distribution of adversarial perturbations. Unlike the
low-rank representation of classical decompositions, TNP aims to reconstruct
the unobserved clean examples from an adversarial example. Specifically, TNP
leverages progressive downsampling and introduces a novel adversarial
optimization objective to address the challenge of minimizing reconstruction
error but without inadvertently restoring adversarial perturbations. Extensive
experiments conducted on CIFAR-10, CIFAR-100, and ImageNet demonstrate that our
method generalizes effectively across various norm threats, attack types, and
tasks, providing a versatile and promising adversarial purification technique.


http://arxiv.org/abs/2502.18176
CLIPure: Purification in Latent Space via CLIP for Adversarially Robust Zero-Shot Classification. (92%)
Mingkun Zhang; Keping Bi; Wei Chen; Jiafeng Guo; Xueqi Cheng
  In this paper, we aim to build an adversarially robust zero-shot image
classifier. We ground our work on CLIP, a vision-language pre-trained encoder
model that can perform zero-shot classification by matching an image with text
prompts ``a photo of a <class-name>.''. Purification is the path we choose
since it does not require adversarial training on specific attack types and
thus can cope with any foreseen attacks. We then formulate purification risk as
the KL divergence between the joint distributions of the purification process
of denoising the adversarial samples and the attack process of adding
perturbations to benign samples, through bidirectional Stochastic Differential
Equations (SDEs). The final derived results inspire us to explore purification
in the multi-modal latent space of CLIP. We propose two variants for our
CLIPure approach: CLIPure-Diff which models the likelihood of images' latent
vectors with the DiffusionPrior module in DaLLE-2 (modeling the generation
process of CLIP's latent vectors), and CLIPure-Cos which models the likelihood
with the cosine similarity between the embeddings of an image and ``a photo of
a.''. As far as we know, CLIPure is the first purification method in
multi-modal latent space and CLIPure-Cos is the first purification method that
is not based on generative models, which substantially improves defense
efficiency. We conducted extensive experiments on CIFAR-10, ImageNet, and 13
datasets that previous CLIP-based defense methods used for evaluating zero-shot
classification robustness. Results show that CLIPure boosts the SOTA robustness
by a large margin, e.g., from 71.7% to 91.1% on CIFAR10, from 59.6% to 72.6% on
ImageNet, and 108% relative improvements of average robustness on the 13
datasets over previous SOTA. The code is available at
https://github.com/TMLResearchGroup-CAS/CLIPure.


http://arxiv.org/abs/2503.01865
Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints. (15%)
Junxiao Yang; Zhexin Zhang; Shiyao Cui; Hongning Wang; Minlie Huang
  Jailbreaking attacks can effectively induce unsafe behaviors in Large
Language Models (LLMs); however, the transferability of these attacks across
different models remains limited. This study aims to understand and enhance the
transferability of gradient-based jailbreaking methods, which are among the
standard approaches for attacking white-box models. Through a detailed analysis
of the optimization process, we introduce a novel conceptual framework to
elucidate transferability and identify superfluous constraints-specifically,
the response pattern constraint and the token tail constraint-as significant
barriers to improved transferability. Removing these unnecessary constraints
substantially enhances the transferability and controllability of
gradient-based attacks. Evaluated on Llama-3-8B-Instruct as the source model,
our method increases the overall Transfer Attack Success Rate (T-ASR) across a
set of target models with varying safety levels from 18.4% to 50.3%, while also
improving the stability and controllability of jailbreak behaviors on both
source and target models.


http://arxiv.org/abs/2503.00038
from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors. (4%)
Yu Yan; Sheng Sun; Zenghao Duan; Teli Liu; Min Liu; Zhiyi Yin; Jiangyu Lei; Qi Li
  Current studies have exposed the risk of Large Language Models (LLMs)
generating harmful content by jailbreak attacks. However, they overlook that
the direct generation of harmful content from scratch is more difficult than
inducing LLM to calibrate benign content into harmful forms. In our study, we
introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR)
to induce the LLM to calibrate malicious metaphors for jailbreaking.
Specifically, to answer harmful queries, AVATAR adaptively identifies a set of
benign but logically related metaphors as the initial seed. Then, driven by
these metaphors, the target LLM is induced to reason and calibrate about the
metaphorical content, thus jailbroken by either directly outputting harmful
responses or calibrating residuals between metaphorical and professional
harmful content. Experimental results demonstrate that AVATAR can effectively
and transferable jailbreak LLMs and achieve a state-of-the-art attack success
rate across multiple advanced LLMs.


http://arxiv.org/abs/2502.18771
Exploring Graph Tasks with Pure LLMs: A Comprehensive Benchmark and Investigation. (2%)
Yuxiang Wang; Xinnan Dai; Wenqi Fan; Yao Ma
  Graph-structured data has become increasingly prevalent across various
domains, raising the demand for effective models to handle graph tasks like
node classification and link prediction. Traditional graph learning models like
Graph Neural Networks (GNNs) have made significant strides, but their
capabilities in handling graph data remain limited in certain contexts. In
recent years, large language models (LLMs) have emerged as promising candidates
for graph tasks, yet most studies focus primarily on performance benchmarks and
fail to address their broader potential, including their ability to handle
limited data, their transferability across tasks, and their robustness. In this
work, we provide a comprehensive exploration of LLMs applied to graph tasks. We
evaluate the performance of pure LLMs, including those without parameter
optimization and those fine-tuned with instructions, across various scenarios.
Our analysis goes beyond accuracy, assessing LLM ability to perform in
few-shot/zero-shot settings, transfer across domains, understand graph
structures, and demonstrate robustness in challenging scenarios. We conduct
extensive experiments with 16 graph learning models alongside 6 LLMs (e.g.,
Llama3B, GPT-4o, Qwen-plus), comparing their performance on datasets like Cora,
PubMed, ArXiv, and Products. Our findings show that LLMs, particularly those
with instruction tuning, outperform traditional models in few-shot settings,
exhibit strong domain transferability, and demonstrate excellent generalization
and robustness. This work offers valuable insights into the capabilities of
LLMs for graph learning, highlighting their advantages and potential for
real-world applications, and paving the way for future research in this area.
Codes and datasets are released in
https://github.com/myflashbarry/LLM-benchmarking.


http://arxiv.org/abs/2502.18077
Examining the Threat Landscape: Foundation Models and Model Stealing. (2%)
Ankita Raj; Deepankar Varma; Chetan Arora
  Foundation models (FMs) for computer vision learn rich and robust
representations, enabling their adaptation to task/domain-specific deployments
with little to no fine-tuning. However, we posit that the very same strength
can make applications based on FMs vulnerable to model stealing attacks.
Through empirical analysis, we reveal that models fine-tuned from FMs harbor
heightened susceptibility to model stealing, compared to conventional vision
architectures like ResNets. We hypothesize that this behavior is due to the
comprehensive encoding of visual patterns and features learned by FMs during
pre-training, which are accessible to both the attacker and the victim. We
report that an attacker is able to obtain 94.28% agreement (matched predictions
with victim) for a Vision Transformer based victim model (ViT-L/16) trained on
CIFAR-10 dataset, compared to only 73.20% agreement for a ResNet-18 victim,
when using ViT-L/16 as the thief model. We arguably show, for the first time,
that utilizing FMs for downstream tasks may not be the best choice for
deployment in commercial APIs due to their susceptibility to model theft. We
thereby alert model owners towards the associated security risks, and highlight
the need for robust security measures to safeguard such models against theft.
Code is available at https://github.com/rajankita/foundation_model_stealing.


http://arxiv.org/abs/2502.18592
DeBUGCN -- Detecting Backdoors in CNNs Using Graph Convolutional Networks. (1%)
Akash Vartak; Khondoker Murad Hossain; Tim Oates
  Deep neural networks (DNNs) are becoming commonplace in critical
applications, making their susceptibility to backdoor (trojan) attacks a
significant problem. In this paper, we introduce a novel backdoor attack
detection pipeline, detecting attacked models using graph convolution networks
(DeBUGCN). To the best of our knowledge, ours is the first use of GCNs for
trojan detection. We use the static weights of a DNN to create a graph
structure of its layers. A GCN is then used as a binary classifier on these
graphs, yielding a trojan or clean determination for the DNN. To demonstrate
the efficacy of our pipeline, we train hundreds of clean and trojaned CNN
models on the MNIST handwritten digits and CIFAR-10 image datasets, and show
the DNN classification results using DeBUGCN. For a true In-the-Wild use case,
our pipeline is evaluated on the TrojAI dataset which consists of various CNN
architectures, thus showing the robustness and model-agnostic behaviour of
DeBUGCN. Furthermore, on comparing our results on several datasets with
state-of-the-art trojan detection algorithms, DeBUGCN is faster and more
accurate.


http://arxiv.org/abs/2502.18623
On the Privacy-Preserving Properties of Spiking Neural Networks with Unique Surrogate Gradients and Quantization Levels. (1%)
Ayana Moshruba; Shay Snyder; Hamed Poursiami; Maryam Parsa
  As machine learning models increasingly process sensitive data, understanding
their vulnerability to privacy attacks is vital. Membership inference attacks
(MIAs) exploit model responses to infer whether specific data points were used
during training, posing a significant privacy risk. Prior research suggests
that spiking neural networks (SNNs), which rely on event-driven computation and
discrete spike-based encoding, exhibit greater resilience to MIAs than
artificial neural networks (ANNs). This resilience stems from their
non-differentiable activations and inherent stochasticity, which obscure the
correlation between model responses and individual training samples. To enhance
privacy in SNNs, we explore two techniques: quantization and surrogate
gradients. Quantization, which reduces precision to limit information leakage,
has improved privacy in ANNs. Given SNNs' sparse and irregular activations,
quantization may further disrupt the activation patterns exploited by MIAs. We
assess the vulnerability of SNNs and ANNs under weight and activation
quantization across multiple datasets, using the attack model's receiver
operating characteristic (ROC) curve area under the curve (AUC) metric, where
lower values indicate stronger privacy, and evaluate the privacy-accuracy
trade-off. Our findings show that quantization enhances privacy in both
architectures with minimal performance loss, though full-precision SNNs remain
more resilient than quantized ANNs. Additionally, we examine the impact of
surrogate gradients on privacy in SNNs. Among five evaluated gradients, spike
rate escape provides the best privacy-accuracy trade-off, while arctangent
increases vulnerability to MIAs. These results reinforce SNNs' inherent privacy
advantages and demonstrate that quantization and surrogate gradient selection
significantly influence privacy-accuracy trade-offs in SNNs.


http://arxiv.org/abs/2502.18314
Learning atomic forces from uncertainty-calibrated adversarial attacks. (1%)
Henrique Musseli Cezar; Tilmann Bodenstein; Henrik Andersen Sveinsson; Morten Ledum; Simen Reine; Sigbjørn Løland Bore
  Adversarial approaches, which intentionally challenge machine learning models
by generating difficult examples, are increasingly being adopted to improve
machine learning interatomic potentials (MLIPs). While already providing great
practical value, little is known about the actual prediction errors of MLIPs on
adversarial structures and whether these errors can be controlled. We propose
the Calibrated Adversarial Geometry Optimization (CAGO) algorithm to discover
adversarial structures with user-assigned errors. Through uncertainty
calibration, the estimated uncertainty of MLIPs is unified with real errors. By
performing geometry optimization for calibrated uncertainty, we reach
adversarial structures with the user-assigned target MLIP prediction error.
Integrating with active learning pipelines, we benchmark CAGO, demonstrating
stable MLIPs that systematically converge structural, dynamical, and
thermodynamical properties for liquid water and water adsorption in a
metal-organic framework within only hundreds of training structures, where
previously many thousands were typically required.


http://arxiv.org/abs/2502.17880
VVRec: Reconstruction Attacks on DL-based Volumetric Video Upstreaming via Latent Diffusion Model with Gamma Distribution. (1%)
Rui Lu; Bihai Zhang; Dan Wang
  With the popularity of 3D volumetric video applications, such as Autonomous
Driving, Virtual Reality, and Mixed Reality, current developers have turned to
deep learning for compressing volumetric video frames, i.e., point clouds for
video upstreaming. The latest deep learning-based solutions offer higher
efficiency, lower distortion, and better hardware support compared to
traditional ones like MPEG and JPEG. However, privacy threats arise, especially
reconstruction attacks targeting to recover the original input point cloud from
the intermediate results. In this paper, we design VVRec, to the best of our
knowledge, which is the first targeting DL-based Volumetric Video
Reconstruction attack scheme. VVRec demonstrates the ability to reconstruct
high-quality point clouds from intercepted transmission intermediate results
using four well-trained neural network modules we design. Leveraging the latest
latent diffusion models with Gamma distribution and a refinement algorithm,
VVRec excels in reconstruction quality, color recovery, and surpasses existing
defenses. We evaluate VVRec using three volumetric video datasets. The results
demonstrate that VVRec achieves 64.70dB reconstruction accuracy, with an
impressive 46.39% reduction of distortion over baselines.


http://arxiv.org/abs/2502.18290
Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models. (1%)
Zhaoyi Liu; Huan Zhang
  Self-supervised learning (SSL) vision encoders learn high-quality image
representations and thus have become a vital part of developing vision modality
of large vision language models (LVLMs). Due to the high cost of training such
encoders, pre-trained encoders are widely shared and deployed into many LVLMs,
which are security-critical or bear societal significance. Under this practical
scenario, we reveal a new backdoor threat that significant visual
hallucinations can be induced into these LVLMs by merely compromising vision
encoders. Because of the sharing and reuse of these encoders, many downstream
LVLMs may inherit backdoor behaviors from encoders, leading to widespread
backdoors. In this work, we propose BadVision, the first method to exploit this
vulnerability in SSL vision encoders for LVLMs with novel trigger optimization
and backdoor learning techniques. We evaluate BadVision on two types of SSL
encoders and LVLMs across eight benchmarks. We show that BadVision effectively
drives the LVLMs to attacker-chosen hallucination with over 99% attack success
rate, causing a 77.6% relative visual understanding error while maintaining the
stealthiness. SoTA backdoor detection methods cannot detect our attack
effectively.


http://arxiv.org/abs/2502.17392
Emoti-Attack: Zero-Perturbation Adversarial Attacks on NLP Systems via Emoji Sequences. (99%)
Yangshijie Zhang
  Deep neural networks (DNNs) have achieved remarkable success in the field of
natural language processing (NLP), leading to widely recognized applications
such as ChatGPT. However, the vulnerability of these models to adversarial
attacks remains a significant concern. Unlike continuous domains like images,
text exists in a discrete space, making even minor alterations at the sentence,
word, or character level easily perceptible to humans. This inherent
discreteness also complicates the use of conventional optimization techniques,
as text is non-differentiable. Previous research on adversarial attacks in text
has focused on character-level, word-level, sentence-level, and multi-level
approaches, all of which suffer from inefficiency or perceptibility issues due
to the need for multiple queries or significant semantic shifts.
  In this work, we introduce a novel adversarial attack method, Emoji-Attack,
which leverages the manipulation of emojis to create subtle, yet effective,
perturbations. Unlike character- and word-level strategies, Emoji-Attack
targets emojis as a distinct layer of attack, resulting in less noticeable
changes with minimal disruption to the text. This approach has been largely
unexplored in previous research, which typically focuses on emoji insertion as
an extension of character-level attacks. Our experiments demonstrate that
Emoji-Attack achieves strong attack performance on both large and small models,
making it a promising technique for enhancing adversarial robustness in NLP
systems.


http://arxiv.org/abs/2502.17003
Improving the Transferability of Adversarial Examples by Inverse Knowledge Distillation. (99%)
Wenyuan Wu; Zheng Liu; Yong Chen; Chao Su; Dezhong Peng; Xu Wang
  In recent years, the rapid development of deep neural networks has brought
increased attention to the security and robustness of these models. While
existing adversarial attack algorithms have demonstrated success in improving
adversarial transferability, their performance remains suboptimal due to a lack
of consideration for the discrepancies between target and source models. To
address this limitation, we propose a novel method, Inverse Knowledge
Distillation (IKD), designed to enhance adversarial transferability
effectively. IKD introduces a distillation-inspired loss function that
seamlessly integrates with gradient-based attack methods, promoting diversity
in attack gradients and mitigating overfitting to specific model architectures.
By diversifying gradients, IKD enables the generation of adversarial samples
with superior generalization capabilities across different models,
significantly enhancing their effectiveness in black-box attack scenarios.
Extensive experiments on the ImageNet dataset validate the effectiveness of our
approach, demonstrating substantial improvements in the transferability and
attack success rates of adversarial samples across a wide range of models.


http://arxiv.org/abs/2502.17121
Adversarial Training for Defense Against Label Poisoning Attacks. (87%)
Melis Ilayda Bal; Volkan Cevher; Michael Muehlebach
  As machine learning models grow in complexity and increasingly rely on
publicly sourced data, such as the human-annotated labels used in training
large language models, they become more vulnerable to label poisoning attacks.
These attacks, in which adversaries subtly alter the labels within a training
dataset, can severely degrade model performance, posing significant risks in
critical applications. In this paper, we propose FLORAL, a novel adversarial
training defense strategy based on support vector machines (SVMs) to counter
these threats. Utilizing a bilevel optimization framework, we cast the training
process as a non-zero-sum Stackelberg game between an attacker, who
strategically poisons critical training labels, and the model, which seeks to
recover from such attacks. Our approach accommodates various model
architectures and employs a projected gradient descent algorithm with kernel
SVMs for adversarial training. We provide a theoretical analysis of our
algorithm's convergence properties and empirically evaluate FLORAL's
effectiveness across diverse classification tasks. Compared to robust baselines
and foundation models such as RoBERTa, FLORAL consistently achieves higher
robust accuracy under increasing attacker budgets. These results underscore the
potential of FLORAL to enhance the resilience of machine learning models
against label poisoning threats, thereby ensuring robust classification in
adversarial settings.


http://arxiv.org/abs/2502.17254
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective. (87%)
Simon Geisler; Tom Wollschläger; M. H. I. Abdalla; Vincent Cohen-Addad; Johannes Gasteiger; Stephan Günnemann
  To circumvent the alignment of large language models (LLMs), current
optimization-based adversarial attacks usually craft adversarial prompts by
maximizing the likelihood of a so-called affirmative response. An affirmative
response is a manually designed start of a harmful answer to an inappropriate
request. While it is often easy to craft prompts that yield a substantial
likelihood for the affirmative response, the attacked model frequently does not
complete the response in a harmful manner. Moreover, the affirmative objective
is usually not adapted to model-specific preferences and essentially ignores
the fact that LLMs output a distribution over responses. If low attack success
under such an objective is taken as a measure of robustness, the true
robustness might be grossly overestimated. To alleviate these flaws, we propose
an adaptive and semantic optimization problem over the population of responses.
We derive a generally applicable objective via the REINFORCE policy-gradient
formalism and demonstrate its efficacy with the state-of-the-art jailbreak
algorithms Greedy Coordinate Gradient (GCG) and Projected Gradient Descent
(PGD). For example, our objective doubles the attack success rate (ASR) on
Llama3 and increases the ASR from 2% to 50% with circuit breaker defense.


http://arxiv.org/abs/2502.17537
On the Vulnerability of Concept Erasure in Diffusion Models. (76%)
Lucas Beerens; Alex D. Richardson; Kaicheng Zhang; Dongdong Chen
  The proliferation of text-to-image diffusion models has raised significant
privacy and security concerns, particularly regarding the generation of
copyrighted or harmful images. To address these issues, research on machine
unlearning has developed various concept erasure methods, which aim to remove
the effect of unwanted data through post-hoc training. However, we show these
erasure techniques are vulnerable, where images of supposedly erased concepts
can still be generated using adversarially crafted prompts. We introduce
RECORD, a coordinate-descent-based algorithm that discovers prompts capable of
eliciting the generation of erased content. We demonstrate that RECORD
significantly beats the attack success rate of current state-of-the-art attack
methods. Furthermore, our findings reveal that models subjected to concept
erasure are more susceptible to adversarial attacks than previously
anticipated, highlighting the urgency for more robust unlearning approaches. We
open source all our code at https://github.com/LucasBeerens/RECORD


http://arxiv.org/abs/2502.17832
MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks. (8%)
Hyeonjeong Ha; Qiusi Zhan; Jeonghwan Kim; Dimitrios Bralios; Saikrishna Sanniboina; Nanyun Peng; Kai-Wei Chang; Daniel Kang; Heng Ji
  Multimodal large language models (MLLMs) equipped with Retrieval Augmented
Generation (RAG) leverage both their rich parametric knowledge and the dynamic,
external knowledge to excel in tasks such as Question Answering. While RAG
enhances MLLMs by grounding responses in query-relevant external knowledge,
this reliance poses a critical yet underexplored safety risk: knowledge
poisoning attacks, where misinformation or irrelevant knowledge is
intentionally injected into external knowledge bases to manipulate model
outputs to be incorrect and even harmful. To expose such vulnerabilities in
multimodal RAG, we propose MM-PoisonRAG, a novel knowledge poisoning attack
framework with two attack strategies: Localized Poisoning Attack (LPA), which
injects query-specific misinformation in both text and images for targeted
manipulation, and Globalized Poisoning Attack (GPA) to provide false guidance
during MLLM generation to elicit nonsensical responses across all queries. We
evaluate our attacks across multiple tasks, models, and access settings,
demonstrating that LPA successfully manipulates the MLLM to generate
attacker-controlled answers, with a success rate of up to 56% on MultiModalQA.
Moreover, GPA completely disrupts model generation to 0% accuracy with just a
single irrelevant knowledge injection. Our results highlight the urgent need
for robust defenses against knowledge poisoning to safeguard multimodal RAG
frameworks.


http://arxiv.org/abs/2502.17602
A stochastic smoothing framework for nonconvex-nonconcave min-sum-max problems with applications to Wasserstein distributionally robust optimization. (1%)
Wei Liu; Muhammad Khan; Gabriel Mancino-Ball; Yangyang Xu
  Applications such as adversarially robust training and Wasserstein
Distributionally Robust Optimization (WDRO) can be naturally formulated as
min-sum-max optimization problems. While this formulation can be rewritten as
an equivalent min-max problem, the summation of max terms introduces
computational challenges, including increased complexity and memory demands,
which must be addressed. These challenges are particularly evident in WDRO,
where existing tractable algorithms often rely on restrictive assumptions on
the objective function, limiting their applicability to state-of-the-art
machine learning problems such as the training of deep neural networks. This
study introduces a novel stochastic smoothing framework based on the
\mbox{log-sum-exp} function, efficiently approximating the max operator in
min-sum-max problems. By leveraging the Clarke regularity of the max operator,
we develop an iterative smoothing algorithm that addresses these computational
difficulties and guarantees almost surely convergence to a Clarke/directional
stationary point. We further prove that the proposed algorithm finds an
$\epsilon$-scaled Clarke stationary point of the original problem, with a
worst-case iteration complexity of $\widetilde{O}(\epsilon^{-3})$. Our
numerical experiments demonstrate that our approach outperforms or is
competitive with state-of-the-art methods in solving the newsvendor problem,
deep learning regression, and adversarially robust deep learning. The results
highlight that our method yields more accurate and robust solutions in these
challenging problem settings.


http://arxiv.org/abs/2502.16793
VGFL-SA: Vertical Graph Federated Learning Structure Attack Based on Contrastive Learning. (97%)
Yang Chen; Bin Zhou
  Graph Neural Networks (GNNs) have gained attention for their ability to learn
representations from graph data. Due to privacy concerns and conflicts of
interest that prevent clients from directly sharing graph data with one
another, Vertical Graph Federated Learning (VGFL) frameworks have been
developed. Recent studies have shown that VGFL is vulnerable to adversarial
attacks that degrade performance. However, it is a common problem that client
nodes are often unlabeled in the realm of VGFL. Consequently, the existing
attacks, which rely on the availability of labeling information to obtain
gradients, are inherently constrained in their applicability. This limitation
precludes their deployment in practical, real-world environments. To address
the above problems, we propose a novel graph adversarial attack against VGFL,
referred to as VGFL-SA, to degrade the performance of VGFL by modifying the
local clients structure without using labels. Specifically, VGFL-SA uses a
contrastive learning method to complete the attack before the local clients are
trained. VGFL-SA first accesses the graph structure and node feature
information of the poisoned clients, and generates the contrastive views by
node-degree-based edge augmentation and feature shuffling augmentation. Then,
VGFL-SA uses the shared graph encoder to get the embedding of each view, and
the gradients of the adjacency matrices are obtained by the contrastive
function. Finally, perturbed edges are generated using gradient modification
rules. We validated the performance of VGFL-SA by performing a node
classification task on real-world datasets, and the results show that VGFL-SA
achieves good attack effectiveness and transferability.


http://arxiv.org/abs/2502.18520
Class-Conditional Neural Polarizer: A Lightweight and Effective Backdoor Defense by Purifying Poisoned Features. (93%)
Mingli Zhu; Shaokui Wei; Hongyuan Zha; Baoyuan Wu
  Recent studies have highlighted the vulnerability of deep neural networks to
backdoor attacks, where models are manipulated to rely on embedded triggers
within poisoned samples, despite the presence of both benign and trigger
information. While several defense methods have been proposed, they often
struggle to balance backdoor mitigation with maintaining benign performance.In
this work, inspired by the concept of optical polarizer-which allows light
waves of specific polarizations to pass while filtering others-we propose a
lightweight backdoor defense approach, NPD. This method integrates a neural
polarizer (NP) as an intermediate layer within the compromised model,
implemented as a lightweight linear transformation optimized via bi-level
optimization. The learnable NP filters trigger information from poisoned
samples while preserving benign content. Despite its effectiveness, we identify
through empirical studies that NPD's performance degrades when the target
labels (required for purification) are inaccurately estimated. To address this
limitation while harnessing the potential of targeted adversarial mitigation,
we propose class-conditional neural polarizer-based defense (CNPD). The key
innovation is a fusion module that integrates the backdoored model's predicted
label with the features to be purified. This architecture inherently mimics
targeted adversarial defense mechanisms without requiring label estimation used
in NPD. We propose three implementations of CNPD: the first is r-CNPD, which
trains a replicated NP layer for each class and, during inference, selects the
appropriate NP layer for defense based on the predicted class from the
backdoored model. To efficiently handle a large number of classes, two variants
are designed: e-CNPD, which embeds class information as additional features,
and a-CNPD, which directs network attention using class information.


http://arxiv.org/abs/2502.16593
Tracking the Copyright of Large Vision-Language Models through Parameter Learning Adversarial Images. (93%)
Yubo Wang; Jianting Tang; Chaohu Liu; Linli Xu
  Large vision-language models (LVLMs) have demonstrated remarkable image
understanding and dialogue capabilities, allowing them to handle a variety of
visual question answering tasks. However, their widespread availability raises
concerns about unauthorized usage and copyright infringement, where users or
individuals can develop their own LVLMs by fine-tuning published models. In
this paper, we propose a novel method called Parameter Learning Attack (PLA)
for tracking the copyright of LVLMs without modifying the original model.
Specifically, we construct adversarial images through targeted attacks against
the original model, enabling it to generate specific outputs. To ensure these
attacks remain effective on potential fine-tuned models to trigger copyright
tracking, we allow the original model to learn the trigger images by updating
parameters in the opposite direction during the adversarial attack process.
Notably, the proposed method can be applied after the release of the original
model, thus not affecting the model's performance and behavior. To simulate
real-world applications, we fine-tune the original model using various
strategies across diverse datasets, creating a range of models for copyright
verification. Extensive experiments demonstrate that our method can more
effectively identify the original copyright of fine-tuned models compared to
baseline methods. Therefore, this work provides a powerful tool for tracking
copyrights and detecting unlicensed usage of LVLMs.


http://arxiv.org/abs/2502.16737
Keeping up with dynamic attackers: Certifying robustness to adaptive online data poisoning. (68%)
Avinandan Bose; Laurent Lessard; Maryam Fazel; Krishnamurthy Dj Dvijotham
  The rise of foundation models fine-tuned on human feedback from potentially
untrusted users has increased the risk of adversarial data poisoning,
necessitating the study of robustness of learning algorithms against such
attacks. Existing research on provable certified robustness against data
poisoning attacks primarily focuses on certifying robustness for static
adversaries who modify a fraction of the dataset used to train the model before
the training algorithm is applied. In practice, particularly when learning from
human feedback in an online sense, adversaries can observe and react to the
learning process and inject poisoned samples that optimize adversarial
objectives better than when they are restricted to poisoning a static dataset
once, before the learning algorithm is applied. Indeed, it has been shown in
prior work that online dynamic adversaries can be significantly more powerful
than static ones. We present a novel framework for computing certified bounds
on the impact of dynamic poisoning, and use these certificates to design robust
learning algorithms. We give an illustration of the framework for the mean
estimation and binary classification problems and outline directions for
extending this in further work. The code to implement our certificates and
replicate our results is available at
https://github.com/Avinandan22/Certified-Robustness.


http://arxiv.org/abs/2502.16545
Multi-Target Federated Backdoor Attack Based on Feature Aggregation. (26%)
Lingguag Hao; Kuangrong Hao; Bing Wei; Xue-song Tang
  Current federated backdoor attacks focus on collaboratively training backdoor
triggers, where multiple compromised clients train their local trigger patches
and then merge them into a global trigger during the inference phase. However,
these methods require careful design of the shape and position of trigger
patches and lack the feature interactions between trigger patches during
training, resulting in poor backdoor attack success rates. Moreover, the pixels
of the patches remain untruncated, thereby making abrupt areas in backdoor
examples easily detectable by the detection algorithm. To this end, we propose
a novel benchmark for the federated backdoor attack based on feature
aggregation. Specifically, we align the dimensions of triggers with images,
delimit the trigger's pixel boundaries, and facilitate feature interaction
among local triggers trained by each compromised client. Furthermore,
leveraging the intra-class attack strategy, we propose the simultaneous
generation of backdoor triggers for all target classes, significantly reducing
the overall production time for triggers across all target classes and
increasing the risk of the federated model being attacked. Experiments
demonstrate that our method can not only bypass the detection of defense
methods while patch-based methods fail, but also achieve a zero-shot backdoor
attack with a success rate of 77.39%. To the best of our knowledge, our work is
the first to implement such a zero-shot attack in federated learning. Finally,
we evaluate attack performance by varying the trigger's training factors,
including poison location, ratio, pixel bound, and trigger training duration
(local epochs and communication rounds).


http://arxiv.org/abs/2502.16750
Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System. (16%)
Saikat Barua; Mostafizur Rahman; Md Jafor Sadek; Rafiul Islam; Shehenaz Khaled; Ahmedul Kabir
  The autonomous AI agents using large language models can create undeniable
values in all span of the society but they face security threats from
adversaries that warrants immediate protective solutions because trust and
safety issues arise. Considering the many-shot jailbreaking and deceptive
alignment as some of the main advanced attacks, that cannot be mitigated by the
static guardrails used during the supervised training, points out a crucial
research priority for real world robustness. The combination of static
guardrails in dynamic multi-agent system fails to defend against those attacks.
We intend to enhance security for LLM-based agents through the development of
new evaluation frameworks which identify and counter threats for safe
operational deployment. Our work uses three examination methods to detect rogue
agents through a Reverse Turing Test and analyze deceptive alignment through
multi-agent simulations and develops an anti-jailbreaking system by testing it
with GEMINI 1.5 pro and llama-3.3-70B, deepseek r1 models using tool-mediated
adversarial scenarios. The detection capabilities are strong such as 94\%
accuracy for GEMINI 1.5 pro yet the system suffers persistent vulnerabilities
when under long attacks as prompt length increases attack success rates (ASR)
and diversity metrics become ineffective in prediction while revealing multiple
complex system faults. The findings demonstrate the necessity of adopting
flexible security systems based on active monitoring that can be performed by
the agents themselves together with adaptable interventions by system admin as
the current models can create vulnerabilities that can lead to the unreliable
and vulnerable system. So, in our work, we try to address such situations and
propose a comprehensive framework to counteract the security issues.


http://arxiv.org/abs/2502.16734
Towards Optimal Adversarial Robust Reinforcement Learning with Infinity Measurement Error. (13%)
Haoran Li; Zicheng Zhang; Wang Luo; Congying Han; Jiayu Lv; Tiande Guo; Yudong Hu
  Ensuring the robustness of deep reinforcement learning (DRL) agents against
adversarial attacks is critical for their trustworthy deployment. Recent
research highlights the challenges of achieving state-adversarial robustness
and suggests that an optimal robust policy (ORP) does not always exist,
complicating the enforcement of strict robustness constraints. In this paper,
we further explore the concept of ORP. We first introduce the Intrinsic
State-adversarial Markov Decision Process (ISA-MDP), a novel formulation where
adversaries cannot fundamentally alter the intrinsic nature of state
observations. ISA-MDP, supported by empirical and theoretical evidence,
universally characterizes decision-making under state-adversarial paradigms. We
rigorously prove that within ISA-MDP, a deterministic and stationary ORP
exists, aligning with the Bellman optimal policy. Our findings theoretically
reveal that improving DRL robustness does not necessarily compromise
performance in natural environments. Furthermore, we demonstrate the necessity
of infinity measurement error (IME) in both $Q$-function and probability spaces
to achieve ORP, unveiling vulnerabilities of previous DRL algorithms that rely
on $1$-measurement errors. Motivated by these insights, we develop the
Consistent Adversarial Robust Reinforcement Learning (CAR-RL) framework, which
optimizes surrogates of IME. We apply CAR-RL to both value-based and
policy-based DRL algorithms, achieving superior performance and validating our
theoretical analysis.


http://arxiv.org/abs/2502.16523
Pay Attention to Real World Perturbations! Natural Robustness Evaluation in Machine Reading Comprehension. (8%)
Yulong Wu; Viktor Schlegel; Riza Batista-Navarro
  As neural language models achieve human-comparable performance on Machine
Reading Comprehension (MRC) and see widespread adoption, ensuring their
robustness in real-world scenarios has become increasingly important. Current
robustness evaluation research, though, primarily develops synthetic
perturbation methods, leaving unclear how well they reflect real life
scenarios. Considering this, we present a framework to automatically examine
MRC models on naturally occurring textual perturbations, by replacing paragraph
in MRC benchmarks with their counterparts based on available Wikipedia edit
history. Such perturbation type is natural as its design does not stem from an
arteficial generative process, inherently distinct from the previously
investigated synthetic approaches. In a large-scale study encompassing SQUAD
datasets and various model architectures we observe that natural perturbations
result in performance degradation in pre-trained encoder language models. More
worryingly, these state-of-the-art Flan-T5 and Large Language Models (LLMs)
inherit these errors. Further experiments demonstrate that our findings
generalise to natural perturbations found in other more challenging MRC
benchmarks. In an effort to mitigate these errors, we show that it is possible
to improve the robustness to natural perturbations by training on naturally or
synthetically perturbed examples, though a noticeable gap still remains
compared to performance on unperturbed data.


http://arxiv.org/abs/2502.16776
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement. (1%)
Zhexin Zhang; Leqi Lei; Junxiao Yang; Xijie Huang; Yida Lu; Shiyao Cui; Renmiao Chen; Qinglin Zhang; Xinyuan Wang; Hao Wang; Hao Li; Xianqi Lei; Chengwei Pan; Lei Sha; Hongning Wang; Minlie Huang
  As AI models are increasingly deployed across diverse real-world scenarios,
ensuring their safety remains a critical yet underexplored challenge. While
substantial efforts have been made to evaluate and enhance AI safety, the lack
of a standardized framework and comprehensive toolkit poses significant
obstacles to systematic research and practical adoption. To bridge this gap, we
introduce AISafetyLab, a unified framework and toolkit that integrates
representative attack, defense, and evaluation methodologies for AI safety.
AISafetyLab features an intuitive interface that enables developers to
seamlessly apply various techniques while maintaining a well-structured and
extensible codebase for future advancements. Additionally, we conduct empirical
studies on Vicuna, analyzing different attack and defense strategies to provide
valuable insights into their comparative effectiveness. To facilitate ongoing
research and development in AI safety, AISafetyLab is publicly available at
https://github.com/thu-coai/AISafetyLab, and we are committed to its continuous
maintenance and improvement.


http://arxiv.org/abs/2502.18508
REFINE: Inversion-Free Backdoor Defense via Model Reprogramming. (84%)
Yukun Chen; Shuo Shao; Enhao Huang; Yiming Li; Pin-Yu Chen; Zhan Qin; Kui Ren
  Backdoor attacks on deep neural networks (DNNs) have emerged as a significant
security threat, allowing adversaries to implant hidden malicious behaviors
during the model training phase. Pre-processing-based defense, which is one of
the most important defense paradigms, typically focuses on input
transformations or backdoor trigger inversion (BTI) to deactivate or eliminate
embedded backdoor triggers during the inference process. However, these methods
suffer from inherent limitations: transformation-based defenses often fail to
balance model utility and defense performance, while BTI-based defenses
struggle to accurately reconstruct trigger patterns without prior knowledge. In
this paper, we propose REFINE, an inversion-free backdoor defense method based
on model reprogramming. REFINE consists of two key components: \textbf{(1)} an
input transformation module that disrupts both benign and backdoor patterns,
generating new benign features; and \textbf{(2)} an output remapping module
that redefines the model's output domain to guide the input transformations
effectively. By further integrating supervised contrastive loss, REFINE
enhances the defense capabilities while maintaining model utility. Extensive
experiments on various benchmark datasets demonstrate the effectiveness of our
REFINE and its resistance to potential adaptive attacks.


http://arxiv.org/abs/2502.16361
A Framework for Evaluating Vision-Language Model Safety: Building Trust in AI for Public Sector Applications. (80%)
Maisha Binte Rashid; Pablo Rivas
  Vision-Language Models (VLMs) are increasingly deployed in public sector
missions, necessitating robust evaluation of their safety and vulnerability to
adversarial attacks. This paper introduces a novel framework to quantify
adversarial risks in VLMs. We analyze model performance under Gaussian,
salt-and-pepper, and uniform noise, identifying misclassification thresholds
and deriving composite noise patches and saliency patterns that highlight
vulnerable regions. These patterns are compared against the Fast Gradient Sign
Method (FGSM) to assess their adversarial effectiveness. We propose a new
Vulnerability Score that combines the impact of random noise and adversarial
attacks, providing a comprehensive metric for evaluating model robustness.


http://arxiv.org/abs/2502.16167
PersGuard: Preventing Malicious Personalization via Backdoor Attacks on Pre-trained Text-to-Image Diffusion Models. (45%)
Xinwei Liu; Xiaojun Jia; Yuan Xun; Hua Zhang; Xiaochun Cao
  Diffusion models (DMs) have revolutionized data generation, particularly in
text-to-image (T2I) synthesis. However, the widespread use of personalized
generative models raises significant concerns regarding privacy violations and
copyright infringement. To address these issues, researchers have proposed
adversarial perturbation-based protection techniques. However, these methods
have notable limitations, including insufficient robustness against data
transformations and the inability to fully eliminate identifiable features of
protected objects in the generated output. In this paper, we introduce
PersGuard, a novel backdoor-based approach that prevents malicious
personalization of specific images. Unlike traditional adversarial perturbation
methods, PersGuard implant backdoor triggers into pre-trained T2I models,
preventing the generation of customized outputs for designated protected images
while allowing normal personalization for unprotected ones. Unfortunately,
existing backdoor methods for T2I diffusion models fail to be applied to
personalization scenarios due to the different backdoor objectives and the
potential backdoor elimination during downstream fine-tuning processes. To
address these, we propose three novel backdoor objectives specifically designed
for personalization scenarios, coupled with backdoor retention loss engineered
to resist downstream fine-tuning. These components are integrated into a
unified optimization framework. Extensive experimental evaluations demonstrate
PersGuard's effectiveness in preserving data privacy, even under challenging
conditions including gray-box settings, multi-object protection, and facial
identity scenarios. Our method significantly outperforms existing techniques,
offering a more robust solution for privacy and copyright protection.


http://arxiv.org/abs/2502.18511
ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models. (33%)
Xuxu Liu; Siyuan Liang; Mengya Han; Yong Luo; Aishan Liu; Xiantao Cai; Zheng He; Dacheng Tao
  Generative large language models are crucial in natural language processing,
but they are vulnerable to backdoor attacks, where subtle triggers compromise
their behavior. Although backdoor attacks against LLMs are constantly emerging,
existing benchmarks remain limited in terms of sufficient coverage of attack,
metric system integrity, backdoor attack alignment. And existing pre-trained
backdoor attacks are idealized in practice due to resource access constraints.
Therefore we establish $\textit{ELBA-Bench}$, a comprehensive and unified
framework that allows attackers to inject backdoor through parameter efficient
fine-tuning ($\textit{e.g.,}$ LoRA) or without fine-tuning techniques
($\textit{e.g.,}$ In-context-learning). $\textit{ELBA-Bench}$ provides over
1300 experiments encompassing the implementations of 12 attack methods, 18
datasets, and 12 LLMs. Extensive experiments provide new invaluable findings
into the strengths and limitations of various attack strategies. For instance,
PEFT attack consistently outperform without fine-tuning approaches in
classification tasks while showing strong cross-dataset generalization with
optimized triggers boosting robustness; Task-relevant backdoor optimization
techniques or attack prompts along with clean and adversarial demonstrations
can enhance backdoor attack success while preserving model performance on clean
samples. Additionally, we introduce a universal toolbox designed for
standardized backdoor attack research, with the goal of propelling further
progress in this vital area.


http://arxiv.org/abs/2502.16396
FedNIA: Noise-Induced Activation Analysis for Mitigating Data Poisoning in FL. (31%)
Ehsan Hallaji; Roozbeh Razavi-Far; Mehrdad Saif
  Federated learning systems are increasingly threatened by data poisoning
attacks, where malicious clients compromise global models by contributing
tampered updates. Existing defenses often rely on impractical assumptions, such
as access to a central test dataset, or fail to generalize across diverse
attack types, particularly those involving multiple malicious clients working
collaboratively. To address this, we propose Federated Noise-Induced Activation
Analysis (FedNIA), a novel defense framework to identify and exclude
adversarial clients without relying on any central test dataset. FedNIA injects
random noise inputs to analyze the layerwise activation patterns in client
models leveraging an autoencoder that detects abnormal behaviors indicative of
data poisoning. FedNIA can defend against diverse attack types, including
sample poisoning, label flipping, and backdoors, even in scenarios with
multiple attacking nodes. Experimental results on non-iid federated datasets
demonstrate its effectiveness and robustness, underscoring its potential as a
foundational approach for enhancing the security of federated learning systems.


http://arxiv.org/abs/2502.16286
Verification of Bit-Flip Attacks against Quantized Neural Networks. (10%)
Yedi Zhang; Lei Huang; Pengfei Gao; Fu Song; Jun Sun; Jin Song Dong
  In the rapidly evolving landscape of neural network security, the resilience
of neural networks against bit-flip attacks (i.e., an attacker maliciously
flips an extremely small amount of bits within its parameter storage memory
system to induce harmful behavior), has emerged as a relevant area of research.
Existing studies suggest that quantization may serve as a viable defense
against such attacks. Recognizing the documented susceptibility of real-valued
neural networks to such attacks and the comparative robustness of quantized
neural networks (QNNs), in this work, we introduce BFAVerifier, the first
verification framework designed to formally verify the absence of bit-flip
attacks or to identify all vulnerable parameters in a sound and rigorous
manner. BFAVerifier comprises two integral components: an abstraction-based
method and an MILP-based method. Specifically, we first conduct a reachability
analysis with respect to symbolic parameters that represent the potential
bit-flip attacks, based on a novel abstract domain with a sound guarantee. If
the reachability analysis fails to prove the resilience of such attacks, then
we encode this verification problem into an equivalent MILP problem which can
be solved by off-the-shelf solvers. Therefore, BFAVerifier is sound, complete,
and reasonably efficient. We conduct extensive experiments, which demonstrate
its effectiveness and efficiency across various network architectures,
quantization bit-widths, and adversary capabilities.


http://arxiv.org/abs/2502.16423
Unified Prompt Attack Against Text-to-Image Generation Models. (8%)
Duo Peng; Qiuhong Ke; Mark He Huang; Ping Hu; Jun Liu
  Text-to-Image (T2I) models have advanced significantly, but their growing
popularity raises security concerns due to their potential to generate harmful
images. To address these issues, we propose UPAM, a novel framework to evaluate
the robustness of T2I models from an attack perspective. Unlike prior methods
that focus solely on textual defenses, UPAM unifies the attack on both textual
and visual defenses. Additionally, it enables gradient-based optimization,
overcoming reliance on enumeration for improved efficiency and effectiveness.
To handle cases where T2I models block image outputs due to defenses, we
introduce Sphere-Probing Learning (SPL) to enable optimization even without
image results. Following SPL, our model bypasses defenses, inducing the
generation of harmful content. To ensure semantic alignment with attacker
intent, we propose Semantic-Enhancing Learning (SEL) for precise semantic
control. UPAM also prioritizes the naturalness of adversarial prompts using
In-context Naturalness Enhancement (INE), making them harder for human
examiners to detect. Additionally, we address the issue of iterative
queries--common in prior methods and easily detectable by API defenders--by
introducing Transferable Attack Learning (TAL), allowing effective attacks with
minimal queries. Extensive experiments validate UPAM's superiority in
effectiveness, efficiency, naturalness, and low query detection rates.


http://arxiv.org/abs/2502.16115
Detecting OOD Samples via Optimal Transport Scoring Function. (1%)
Heng Gao; Zhuolin He; Jian Pu
  To deploy machine learning models in the real world, researchers have
proposed many OOD detection algorithms to help models identify unknown samples
during the inference phase and prevent them from making untrustworthy
predictions. Unlike methods that rely on extra data for outlier exposure
training, post hoc methods detect Out-of-Distribution (OOD) samples by
developing scoring functions, which are model agnostic and do not require
additional training. However, previous post hoc methods may fail to capture the
geometric cues embedded in network representations. Thus, in this study, we
propose a novel score function based on the optimal transport theory, named
OTOD, for OOD detection. We utilize information from features, logits, and the
softmax probability space to calculate the OOD score for each test sample. Our
experiments show that combining this information can boost the performance of
OTOD with a certain margin. Experiments on the CIFAR-10 and CIFAR-100
benchmarks demonstrate the superior performance of our method. Notably, OTOD
outperforms the state-of-the-art method GEN by 7.19% in the mean FPR@95 on the
CIFAR-10 benchmark using ResNet-18 as the backbone, and by 12.51% in the mean
FPR@95 using WideResNet-28 as the backbone. In addition, we provide theoretical
guarantees for OTOD. The code is available in
https://github.com/HengGao12/OTOD.


http://arxiv.org/abs/2502.16094
Merger-as-a-Stealer: Stealing Targeted PII from Aligned LLMs with Model Merging. (1%)
Lin Lu; Zhigang Zuo; Ziji Sheng; Pan Zhou
  Model merging has emerged as a promising approach for updating large language
models (LLMs) by integrating multiple domain-specific models into a
cross-domain merged model. Despite its utility and plug-and-play nature,
unmonitored mergers can introduce significant security vulnerabilities, such as
backdoor attacks and model merging abuse. In this paper, we identify a novel
and more realistic attack surface where a malicious merger can extract targeted
personally identifiable information (PII) from an aligned model with model
merging. Specifically, we propose \texttt{Merger-as-a-Stealer}, a two-stage
framework to achieve this attack: First, the attacker fine-tunes a malicious
model to force it to respond to any PII-related queries. The attacker then
uploads this malicious model to the model merging conductor and obtains the
merged model. Second, the attacker inputs direct PII-related queries to the
merged model to extract targeted PII. Extensive experiments demonstrate that
\texttt{Merger-as-a-Stealer} successfully executes attacks against various LLMs
and model merging methods across diverse settings, highlighting the
effectiveness of the proposed framework. Given that this attack enables
character-level extraction for targeted PII without requiring any additional
knowledge from the attacker, we stress the necessity for improved model
alignment and more robust defense mechanisms to mitigate such threats.


http://arxiv.org/abs/2502.16366
A generative approach to LLM harmfulness detection with special red flag tokens. (1%)
Sophie Xhonneux; David Dobre; Mehrnaz Mohfakhami; Leo Schwinn; Gauthier Gidel
  Most safety training methods for large language models (LLMs) based on
fine-tuning rely on dramatically changing the output distribution of the model
when faced with a harmful request, shifting it from an unsafe answer to a
refusal to respond. These methods inherently compromise model capabilities and
might make auto-regressive models vulnerable to attacks that make likely an
initial token of affirmative response. To avoid that, we propose to expand the
model's vocabulary with a special token we call red flag token (<rf>) and
propose to fine-tune the model to generate this token at any time harmful
content is generated or about to be generated. This novel safety training
method effectively augments LLMs into generative classifiers of harmfulness at
all times during the conversation. This method offers several advantages: it
enables the model to explicitly learn the concept of harmfulness while
marginally affecting the generated distribution, thus maintaining the model's
utility. It also evaluates each generated answer rather than just the input
prompt and provides a stronger defence against sampling-based attacks. In
addition, it simplifies the evaluation of the model's robustness and reduces
correlated failures when combined with a classifier. We further show an
increased robustness to long contexts, and supervised fine-tuning attacks.


http://arxiv.org/abs/2502.16044
A Multi-Scale Isolation Forest Approach for Real-Time Detection and Filtering of FGSM Adversarial Attacks in Video Streams of Autonomous Vehicles. (99%)
Richard Abhulimhen; Negash Begashaw; Gurcan Comert; Chunheng Zhao; Pierluigi Pisu
  Deep Neural Networks (DNNs) have demonstrated remarkable success across a
wide range of tasks, particularly in fields such as image classification.
However, DNNs are highly susceptible to adversarial attacks, where subtle
perturbations are introduced to input images, leading to erroneous model
outputs. In today's digital era, ensuring the security and integrity of images
processed by DNNs is of critical importance. One of the most prominent
adversarial attack methods is the Fast Gradient Sign Method (FGSM), which
perturbs images in the direction of the loss gradient to deceive the model.
  This paper presents a novel approach for detecting and filtering FGSM
adversarial attacks in image processing tasks. Our proposed method evaluates
10,000 images, each subjected to five different levels of perturbation,
characterized by $\epsilon$ values of 0.01, 0.02, 0.05, 0.1, and 0.2. These
perturbations are applied in the direction of the loss gradient. We demonstrate
that our approach effectively filters adversarially perturbed images,
mitigating the impact of FGSM attacks.
  The method is implemented in Python, and the source code is publicly
available on GitHub for reproducibility and further research.


http://arxiv.org/abs/2502.16012
Cross-Model Transferability of Adversarial Patches in Real-time Segmentation for Autonomous Driving. (98%)
Prashant Shekhar; Bidur Devkota; Dumindu Samaraweera; Laxima Niure Kandel; Manoj Babu
  Adversarial attacks pose a significant threat to deep learning models,
particularly in safety-critical applications like healthcare and autonomous
driving. Recently, patch based attacks have demonstrated effectiveness in
real-time inference scenarios owing to their 'drag and drop' nature. Following
this idea for Semantic Segmentation (SS), here we propose a novel Expectation
Over Transformation (EOT) based adversarial patch attack that is more realistic
for autonomous vehicles. To effectively train this attack we also propose a
'simplified' loss function that is easy to analyze and implement. Using this
attack as our basis, we investigate whether adversarial patches once optimized
on a specific SS model, can fool other models or architectures. We conduct a
comprehensive cross-model transferability analysis of adversarial patches
trained on SOTA Convolutional Neural Network (CNN) models such PIDNet-S,
PIDNet-M and PIDNet-L, among others. Additionally, we also include the
Segformer model to study transferability to Vision Transformers (ViTs). All of
our analysis is conducted on the widely used Cityscapes dataset. Our study
reveals key insights into how model architectures (CNN vs CNN or CNN vs.
Transformer-based) influence attack susceptibility. In particular, we conclude
that although the transferability (effectiveness) of attacks on unseen images
of any dimension is really high, the attacks trained against one particular
model are minimally effective on other models. And this was found to be true
for both ViT and CNN based models. Additionally our results also indicate that
for CNN-based models, the repercussions of patch attacks are local, unlike
ViTs. Per-class analysis reveals that simple-classes like 'sky' suffer less
misclassification than others. The code for the project is available at:
https://github.com/p-shekhar/adversarial-patch-transferability


http://arxiv.org/abs/2502.15561
A Defensive Framework Against Adversarial Attacks on Machine Learning-Based Network Intrusion Detection Systems. (92%)
Benyamin Tafreshian; Shengzhi Zhang
  As cyberattacks become increasingly sophisticated, advanced Network Intrusion
Detection Systems (NIDS) are critical for modern network security. Traditional
signature-based NIDS are inadequate against zero-day and evolving attacks. In
response, machine learning (ML)-based NIDS have emerged as promising solutions;
however, they are vulnerable to adversarial evasion attacks that subtly
manipulate network traffic to bypass detection. To address this vulnerability,
we propose a novel defensive framework that enhances the robustness of ML-based
NIDS by simultaneously integrating adversarial training, dataset balancing
techniques, advanced feature engineering, ensemble learning, and extensive
model fine-tuning. We validate our framework using the NSL-KDD and UNSW-NB15
datasets. Experimental results show, on average, a 35% increase in detection
accuracy and a 12.5% reduction in false positives compared to baseline models,
particularly under adversarial conditions. The proposed defense against
adversarial attacks significantly advances the practical deployment of robust
ML-based NIDS in real-world networks.


http://arxiv.org/abs/2502.15567
Model Privacy: A Unified Framework to Understand Model Stealing Attacks and Defenses. (78%)
Ganghua Wang; Yuhong Yang; Jie Ding
  The use of machine learning (ML) has become increasingly prevalent in various
domains, highlighting the importance of understanding and ensuring its safety.
One pressing concern is the vulnerability of ML applications to model stealing
attacks. These attacks involve adversaries attempting to recover a learned
model through limited query-response interactions, such as those found in
cloud-based services or on-chip artificial intelligence interfaces. While
existing literature proposes various attack and defense strategies, these often
lack a theoretical foundation and standardized evaluation criteria. In
response, this work presents a framework called ``Model Privacy'', providing a
foundation for comprehensively analyzing model stealing attacks and defenses.
We establish a rigorous formulation for the threat model and objectives,
propose methods to quantify the goodness of attack and defense strategies, and
analyze the fundamental tradeoffs between utility and privacy in ML models. Our
developed theory offers valuable insights into enhancing the security of ML
models, especially highlighting the importance of the attack-specific structure
of perturbations for effective defenses. We demonstrate the application of
model privacy from the defender's perspective through various learning
scenarios. Extensive experiments corroborate the insights and the effectiveness
of defense mechanisms developed under the proposed framework.


http://arxiv.org/abs/2502.15594
SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention. (67%)
Jiaqi Wu; Chen Chen; Chunyan Hou; Xiaojie Yuan
  With the widespread real-world deployment of large language models (LLMs),
ensuring their behavior complies with safety standards has become crucial.
Jailbreak attacks exploit vulnerabilities in LLMs to induce undesirable
behavior, posing a significant threat to LLM safety. Previous defenses often
fail to achieve both effectiveness and efficiency simultaneously. Defenses from
a representation perspective offer new insights, but existing interventions
cannot dynamically adjust representations based on the harmfulness of the
queries. To address this limitation, we propose SafeIntervention (SafeInt), a
novel defense method that shields LLMs from jailbreak attacks through
safety-aware representation intervention. Built on our analysis of the
representations of jailbreak samples, the core idea of SafeInt is to relocate
jailbreak-related representations into the rejection region. This is achieved
by intervening in the representation distributions of jailbreak samples to
align them with those of unsafe samples. We conduct comprehensive experiments
covering six jailbreak attacks, two jailbreak datasets, and two utility
benchmarks. Experimental results demonstrate that SafeInt outperforms all
baselines in defending LLMs against jailbreak attacks while largely maintaining
utility. Additionally, we evaluate SafeInt against adaptive attacks and verify
its effectiveness in mitigating real-time attacks.


http://arxiv.org/abs/2502.18504
TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice. (13%)
Aman Goel; Xian Carrie Wu; Zhe Wang; Dmitriy Bespalov; Yanjun Qi
  Jailbreaking large-language models (LLMs) involves testing their robustness
against adversarial prompts and evaluating their ability to withstand prompt
attacks that could elicit unauthorized or malicious responses. In this paper,
we present TurboFuzzLLM, a mutation-based fuzzing technique for efficiently
finding a collection of effective jailbreaking templates that, when combined
with harmful questions, can lead a target LLM to produce harmful responses
through black-box access via user prompts. We describe the limitations of
directly applying existing template-based attacking techniques in practice, and
present functional and efficiency-focused upgrades we added to mutation-based
fuzzing to generate effective jailbreaking templates automatically.
TurboFuzzLLM achieves $\geq$ 95\% attack success rates (ASR) on public datasets
for leading LLMs (including GPT-4o \& GPT-4 Turbo), shows impressive
generalizability to unseen harmful questions, and helps in improving model
defenses to prompt attacks.


http://arxiv.org/abs/2502.16065
A Survey of Model Extraction Attacks and Defenses in Distributed Computing Environments. (10%)
Kaixiang Zhao; Lincan Li; Kaize Ding; Neil Zhenqiang Gong; Yue Zhao; Yushun Dong
  Model Extraction Attacks (MEAs) threaten modern machine learning systems by
enabling adversaries to steal models, exposing intellectual property and
training data. With the increasing deployment of machine learning models in
distributed computing environments, including cloud, edge, and federated
learning settings, each paradigm introduces distinct vulnerabilities and
challenges. Without a unified perspective on MEAs across these distributed
environments, organizations risk fragmented defenses, inadequate risk
assessments, and substantial economic and privacy losses. This survey is
motivated by the urgent need to understand how the unique characteristics of
cloud, edge, and federated deployments shape attack vectors and defense
requirements. We systematically examine the evolution of attack methodologies
and defense mechanisms across these environments, demonstrating how
environmental factors influence security strategies in critical sectors such as
autonomous vehicles, healthcare, and financial services. By synthesizing recent
advances in MEAs research and discussing the limitations of current evaluation
practices, this survey provides essential insights for developing robust and
adaptive defense strategies. Our comprehensive approach highlights the
importance of integrating protective measures across the entire distributed
computing landscape to ensure the secure deployment of machine learning models.


http://arxiv.org/abs/2502.15435
Single-pass Detection of Jailbreaking Input in Large Language Models. (2%)
Leyla Naz Candogan; Yongtao Wu; Elias Abad Rocamora; Grigorios G. Chrysos; Volkan Cevher
  Defending aligned Large Language Models (LLMs) against jailbreaking attacks
is a challenging problem, with existing approaches requiring multiple requests
or even queries to auxiliary LLMs, making them computationally heavy. Instead,
we focus on detecting jailbreaking input in a single forward pass. Our method,
called Single Pass Detection SPD, leverages the information carried by the
logits to predict whether the output sentence will be harmful. This allows us
to defend in just one forward pass. SPD can not only detect attacks effectively
on open-source models, but also minimizes the misclassification of harmless
inputs. Furthermore, we show that SPD remains effective even without complete
logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a
promising approach to efficiently safeguard LLMs against adversarial attacks.


http://arxiv.org/abs/2502.14976
EigenShield: Causal Subspace Filtering via Random Matrix Theory for Adversarially Robust Vision-Language Models. (95%)
Nastaran Darabi; Devashri Naik; Sina Tayebati; Dinithi Jayasuriya; Ranganath Krishnan; Amit Ranjan Trivedi
  Vision-Language Models (VLMs) inherit adversarial vulnerabilities of Large
Language Models (LLMs), which are further exacerbated by their multimodal
nature. Existing defenses, including adversarial training, input
transformations, and heuristic detection, are computationally expensive,
architecture-dependent, and fragile against adaptive attacks. We introduce
EigenShield, an inference-time defense leveraging Random Matrix Theory to
quantify adversarial disruptions in high-dimensional VLM representations.
Unlike prior methods that rely on empirical heuristics, EigenShield employs the
spiked covariance model to detect structured spectral deviations. Using a
Robustness-based Nonconformity Score (RbNS) and quantile-based thresholding, it
separates causal eigenvectors, which encode semantic information, from
correlational eigenvectors that are susceptible to adversarial artifacts. By
projecting embeddings onto the causal subspace, EigenShield filters adversarial
noise without modifying model parameters or requiring adversarial training.
This architecture-independent, attack-agnostic approach significantly reduces
the attack success rate, establishing spectral analysis as a principled
alternative to conventional defenses. Our results demonstrate that EigenShield
consistently outperforms all existing defenses, including adversarial training,
UNIGUARD, and CIDER.


http://arxiv.org/abs/2502.14586
Moshi Moshi? A Model Selection Hijacking Adversarial Attack. (92%)
Riccardo Petrucci; Luca Pajola; Francesco Marchiori; Luca Pasa; Mauro conti
  Model selection is a fundamental task in Machine Learning~(ML), focusing on
selecting the most suitable model from a pool of candidates by evaluating their
performance on specific metrics. This process ensures optimal performance,
computational efficiency, and adaptability to diverse tasks and environments.
Despite its critical role, its security from the perspective of adversarial ML
remains unexplored. This risk is heightened in the
Machine-Learning-as-a-Service model, where users delegate the training phase
and the model selection process to third-party providers, supplying data and
training strategies. Therefore, attacks on model selection could harm both the
user and the provider, undermining model performance and driving up operational
costs.
  In this work, we present MOSHI (MOdel Selection HIjacking adversarial
attack), the first adversarial attack specifically targeting model selection.
Our novel approach manipulates model selection data to favor the adversary,
even without prior knowledge of the system. Utilizing a framework based on
Variational Auto Encoders, we provide evidence that an attacker can induce
inefficiencies in ML deployment. We test our attack on diverse computer vision
and speech recognition benchmark tasks and different settings, obtaining an
average attack success rate of 75.42%. In particular, our attack causes an
average 88.30% decrease in generalization capabilities, an 83.33% increase in
latency, and an increase of up to 105.85% in energy consumption. These results
highlight the significant vulnerabilities in model selection processes and
their potential impact on real-world applications.


http://arxiv.org/abs/2502.15017
Interpreting Adversarial Attacks and Defences using Architectures with Enhanced Interpretability. (81%)
Akshay G Rao; Chandrashekhar Lakshminarayanan; Arun Rajkumar
  Adversarial attacks in deep learning represent a significant threat to the
integrity and reliability of machine learning models. Adversarial training has
been a popular defence technique against these adversarial attacks. In this
work, we capitalize on a network architecture, namely Deep Linearly Gated
Networks (DLGN), which has better interpretation capabilities than regular deep
network architectures. Using this architecture, we interpret robust models
trained using PGD adversarial training and compare them with standard training.
Feature networks in DLGN act as feature extractors, making them the only medium
through which an adversary can attack the model. We analyze the feature network
of DLGN with fully connected layers with respect to properties like alignment
of the hyperplanes, hyperplane relation with PCA, and sub-network overlap among
classes and compare these properties between robust and standard models. We
also consider this architecture having CNN layers wherein we qualitatively
(using visualizations) and quantitatively contrast gating patterns between
robust and standard models. We uncover insights into hyperplanes resembling
principal components in PGD-AT and STD-TR models, with PGD-AT hyperplanes
aligned farther from the data points. We use path activity analysis to show
that PGD-AT models create diverse, non-overlapping active subnetworks across
classes, preventing attack-induced gating overlaps. Our visualization ideas
show the nature of representations learnt by PGD-AT and STD-TR models.


http://arxiv.org/abs/2502.14296
On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective. (64%)
Yue Huang; Chujie Gao; Siyuan Wu; Haoran Wang; Xiangqi Wang; Yujun Zhou; Yanbo Wang; Jiayi Ye; Jiawen Shi; Qihui Zhang; Yuan Li; Han Bao; Zhaoyi Liu; Tianrui Guan; Dongping Chen; Ruoxi Chen; Kehan Guo; Andy Zou; Bryan Hooi Kuen-Yew; Caiming Xiong; Elias Stengel-Eskin; Hongyang Zhang; Hongzhi Yin; Huan Zhang; Huaxiu Yao; Jaehong Yoon; Jieyu Zhang; Kai Shu; Kaijie Zhu; Ranjay Krishna; Swabha Swayamdipta; Taiwei Shi; Weijia Shi; Xiang Li; Yiwei Li; Yuexing Hao; Yuexing Hao; Zhihao Jia; Zhize Li; Xiuying Chen; Zhengzhong Tu; Xiyang Hu; Tianyi Zhou; Jieyu Zhao; Lichao Sun; Furong Huang; Or Cohen Sasson; Prasanna Sattigeri; Anka Reuel; Max Lamparth; Yue Zhao; Nouha Dziri; Yu Su; Huan Sun; Heng Ji; Chaowei Xiao; Mohit Bansal; Nitesh V. Chawla; Jian Pei; Jianfeng Gao; Michael Backes; Philip S. Yu; Neil Zhenqiang Gong; Pin-Yu Chen; Bo Li; Dawn Song; Xiangliang Zhang
  Generative Foundation Models (GenFMs) have emerged as transformative tools.
However, their widespread adoption raises critical concerns regarding
trustworthiness across dimensions. This paper presents a comprehensive
framework to address these challenges through three key contributions. First,
we systematically review global AI governance laws and policies from
governments and regulatory bodies, as well as industry practices and standards.
Based on this analysis, we propose a set of guiding principles for GenFMs,
developed through extensive multidisciplinary collaboration that integrates
technical, ethical, legal, and societal perspectives. Second, we introduce
TrustGen, the first dynamic benchmarking platform designed to evaluate
trustworthiness across multiple dimensions and model types, including
text-to-image, large language, and vision-language models. TrustGen leverages
modular components--metadata curation, test case generation, and contextual
variation--to enable adaptive and iterative assessments, overcoming the
limitations of static evaluation methods. Using TrustGen, we reveal significant
progress in trustworthiness while identifying persistent challenges. Finally,
we provide an in-depth discussion of the challenges and future directions for
trustworthy GenFMs, which reveals the complex, evolving nature of
trustworthiness, highlighting the nuanced trade-offs between utility and
trustworthiness, and consideration for various downstream applications,
identifying persistent challenges and providing a strategic roadmap for future
research. This work establishes a holistic framework for advancing
trustworthiness in GenAI, paving the way for safer and more responsible
integration of GenFMs into critical applications. To facilitate advancement in
the community, we release the toolkit for dynamic evaluation.


http://arxiv.org/abs/2502.14298
Generalization Certificates for Adversarially Robust Bayesian Linear Regression. (22%)
Mahalakshmi Sabanayagam; Russell Tsuchida; Cheng Soon Ong; Debarghya Ghoshdastidar
  Adversarial robustness of machine learning models is critical to ensuring
reliable performance under data perturbations. Recent progress has been on
point estimators, and this paper considers distributional predictors. First,
using the link between exponential families and Bregman divergences, we
formulate an adversarial Bregman divergence loss as an adversarial negative
log-likelihood. Using the geometric properties of Bregman divergences, we
compute the adversarial perturbation for such models in closed-form. Second,
under such losses, we introduce \emph{adversarially robust posteriors}, by
exploiting the optimization-centric view of generalized Bayesian inference.
Third, we derive the \emph{first} rigorous generalization certificates in the
context of an adversarial extension of Bayesian linear regression by leveraging
the PAC-Bayesian framework. Finally, experiments on real and synthetic datasets
demonstrate the superior robustness of the derived adversarially robust
posterior over Bayes posterior, and also validate our theoretical guarantees.


http://arxiv.org/abs/2502.14572
Factor Graph-based Interpretable Neural Networks. (11%)
Yicong Li; Kuanjiu Zhou; Shuo Yu; Qiang Zhang; Renqiang Luo; Xiaodong Li; Feng Xia
  Comprehensible neural network explanations are foundations for a better
understanding of decisions, especially when the input data are infused with
malicious perturbations. Existing solutions generally mitigate the impact of
perturbations through adversarial training, yet they fail to generate
comprehensible explanations under unknown perturbations. To address this
challenge, we propose AGAIN, a fActor GrAph-based Interpretable neural Network,
which is capable of generating comprehensible explanations under unknown
perturbations. Instead of retraining like previous solutions, the proposed
AGAIN directly integrates logical rules by which logical errors in explanations
are identified and rectified during inference. Specifically, we construct the
factor graph to express logical rules between explanations and categories. By
treating logical rules as exogenous knowledge, AGAIN can identify
incomprehensible explanations that violate real-world logic. Furthermore, we
propose an interactive intervention switch strategy rectifying explanations
based on the logical guidance from the factor graph without learning
perturbations, which overcomes the inherent limitation of adversarial
training-based methods in defending only against known perturbations.
Additionally, we theoretically demonstrate the effectiveness of employing
factor graph by proving that the comprehensibility of explanations is strongly
correlated with factor graph. Extensive experiments are conducted on three
datasets and experimental results illustrate the superior performance of AGAIN
compared to state-of-the-art baselines.


http://arxiv.org/abs/2502.14370
PPO-MI: Efficient Black-Box Model Inversion via Proximal Policy Optimization. (10%)
Xinpeng Shou
  Model inversion attacks pose a significant privacy risk by attempting to
reconstruct private training data from trained models. Most of the existing
methods either depend on gradient estimation or require white-box access to
model parameters, which limits their applicability in practical scenarios. In
this paper, we propose PPO-MI, a novel reinforcement learning-based framework
for black-box model inversion attacks. Our approach formulates the inversion
task as a Markov Decision Process, where an agent navigates the latent space of
a generative model to reconstruct private training samples using only model
predictions. By employing Proximal Policy Optimization (PPO) with a
momentum-based state transition mechanism, along with a reward function
balancing prediction accuracy and exploration, PPO-MI ensures efficient latent
space exploration and high query efficiency. We conduct extensive experiments
illustrates that PPO-MI outperforms the existing methods while require less
attack knowledge, and it is robust across various model architectures and
datasets. These results underline its effectiveness and generalizability in
practical black-box scenarios, raising important considerations for the privacy
vulnerabilities of deployed machine learning models.


http://arxiv.org/abs/2502.14833
Probabilistic Robustness in Deep Learning: A Concise yet Comprehensive Guide. (10%)
Xingyu Zhao
  Deep learning (DL) has demonstrated significant potential across various
safety-critical applications, yet ensuring its robustness remains a key
challenge. While adversarial robustness has been extensively studied in
worst-case scenarios, probabilistic robustness (PR) offers a more practical
perspective by quantifying the likelihood of failures under stochastic
perturbations. This paper provides a concise yet comprehensive overview of PR,
covering its formal definitions, evaluation and enhancement methods. We
introduce a reformulated ''min-max'' optimisation framework for adversarial
training specifically designed to improve PR. Furthermore, we explore the
integration of PR verification evidence into system-level safety assurance,
addressing challenges in translating DL model-level robustness to system-level
claims. Finally, we highlight open research questions, including benchmarking
PR evaluation methods, extending PR to generative AI tasks, and developing
rigorous methodologies and case studies for system-level integration.


http://arxiv.org/abs/2502.15020
MACPruning: Dynamic Operation Pruning to Mitigate Side-Channel DNN Model Extraction. (4%)
Ruyi Ding; Cheng Gongye; Davis Ranney; Aidong Adam Ding; Yunsi Fei
  As deep learning gains popularity, edge IoT devices have seen proliferating
deployment of pre-trained Deep Neural Network (DNN) models. These DNNs
represent valuable intellectual property and face significant confidentiality
threats from side-channel analysis (SCA), particularly non-invasive
Differential Electromagnetic (EM) Analysis (DEMA), which retrieves individual
model parameters from EM traces collected during model inference. Traditional
SCA mitigation methods, such as masking and shuffling, can still be applied to
DNN inference, but will incur significant performance degradation due to the
large volume of operations and parameters. Based on the insight that DNN models
have high redundancy and are robust to input variation, we introduce
MACPruning, a novel lightweight defense against DEMA-based parameter extraction
attacks, exploiting specific characteristics of DNN execution. The design
principle of MACPruning is to randomly deactivate input pixels and prune the
operations (typically multiply-accumulate-MAC) on those pixels. The technique
removes certain leakages and overall redistributes weight-dependent EM leakages
temporally, and thus effectively mitigates DEMA. To maintain DNN performance,
we propose an importance-aware pixel map that preserves critical input pixels,
keeping randomness in the defense while minimizing its impact on DNN
performance due to operation pruning. We conduct a comprehensive security
analysis of MACPruning on various datasets for DNNs on edge devices. Our
evaluations demonstrate that MACPruning effectively reduces EM leakages with
minimal impact on the model accuracy and negligible computational overhead.


http://arxiv.org/abs/2502.14828
Fundamental Limitations in Defending LLM Finetuning APIs. (3%)
Xander Davies; Eric Winsor; Tomek Korbak; Alexandra Souly; Robert Kirk; Witt Christian Schroeder de; Yarin Gal
  LLM developers have imposed technical interventions to prevent fine-tuning
misuse attacks, attacks where adversaries evade safeguards by fine-tuning the
model using a public API. Previous work has established several successful
attacks against specific fine-tuning API defences. In this work, we show that
defences of fine-tuning APIs that seek to detect individual harmful training or
inference samples ('pointwise' detection) are fundamentally limited in their
ability to prevent fine-tuning attacks. We construct 'pointwise-undetectable'
attacks that repurpose entropy in benign model outputs (e.g. semantic or
syntactic variations) to covertly transmit dangerous knowledge. Our attacks are
composed solely of unsuspicious benign samples that can be collected from the
model before fine-tuning, meaning training and inference samples are all
individually benign and low-perplexity. We test our attacks against the OpenAI
fine-tuning API, finding they succeed in eliciting answers to harmful
multiple-choice questions, and that they evade an enhanced monitoring system we
design that successfully detects other fine-tuning attacks. We encourage the
community to develop defences that tackle the fundamental limitations we
uncover in pointwise fine-tuning API defences.


http://arxiv.org/abs/2502.14416
Reliable Explainability of Deep Learning Spatial-Spectral Classifiers for Improved Semantic Segmentation in Autonomous Driving. (1%)
Jon Gutiérrez-Zaballa; Koldo Basterretxea; Javier Echanobe
  Integrating hyperspectral imagery (HSI) with deep neural networks (DNNs) can
strengthen the accuracy of intelligent vision systems by combining spectral and
spatial information, which is useful for tasks like semantic segmentation in
autonomous driving. To advance research in such safety-critical systems,
determining the precise contribution of spectral information to complex DNNs'
output is needed. To address this, several saliency methods, such as class
activation maps (CAM), have been proposed primarily for image classification.
However, recent studies have raised concerns regarding their reliability. In
this paper, we address their limitations and propose an alternative approach by
leveraging the data provided by activations and weights from relevant DNN
layers to better capture the relationship between input features and
predictions. The study aims to assess the superior performance of HSI compared
to 3-channel and single-channel DNNs. We also address the influence of spectral
signature normalization for enhancing DNN robustness in real-world driving
conditions.


http://arxiv.org/abs/2502.15041
Benchmarking Android Malware Detection: Rethinking the Role of Traditional and Deep Learning Models. (1%)
Guojun Liu; Doina Caragea; Xinming Ou; Sankardas Roy
  Android malware detection has been extensively studied using both traditional
machine learning (ML) and deep learning (DL) approaches. While many
state-of-the-art detection models, particularly those based on DL, claim
superior performance, they often rely on limited comparisons, lacking
comprehensive benchmarking against traditional ML models across diverse
datasets. This raises concerns about the robustness of DL-based approaches'
performance and the potential oversight of simpler, more efficient ML models.
In this paper, we conduct a systematic evaluation of Android malware detection
models across four datasets: three recently published, publicly available
datasets and a large-scale dataset we systematically collected. We implement a
range of traditional ML models, including Random Forests (RF) and CatBoost,
alongside advanced DL models such as Capsule Graph Neural Networks (CapsGNN),
BERT-based models, and ExcelFormer based models. Our results reveal that while
advanced DL models can achieve strong performance, they are often compared
against an insufficient number of traditional ML baselines. In many cases,
simpler and more computationally efficient ML models achieve comparable or even
superior performance. These findings highlight the need for rigorous
benchmarking in Android malware detection research. We encourage future studies
to conduct more comprehensive benchmarking comparisons between traditional and
advanced models to ensure a more accurate assessment of detection capabilities.
To facilitate further research, we provide access to our dataset, including app
IDs, hash values, and labels.


http://arxiv.org/abs/2502.13527
Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak Attacking. (15%)
Yanzeng Li; Yunfan Xiong; Jialun Zhong; Jinchao Zhang; Jie Zhou; Lei Zou
  The rise of Large Language Models (LLMs) has led to significant applications
but also introduced serious security threats, particularly from jailbreak
attacks that manipulate output generation. These attacks utilize prompt
engineering and logit manipulation to steer models toward harmful content,
prompting LLM providers to implement filtering and safety alignment strategies.
We investigate LLMs' safety mechanisms and their recent applications, revealing
a new threat model targeting structured output interfaces, which enable
attackers to manipulate the inner logit during LLM generation, requiring only
API access permissions. To demonstrate this threat model, we introduce a
black-box attack framework called AttackPrefixTree (APT). APT exploits
structured output interfaces to dynamically construct attack patterns. By
leveraging prefixes of models' safety refusal response and latent harmful
outputs, APT effectively bypasses safety measures. Experiments on benchmark
datasets indicate that this approach achieves higher attack success rate than
existing methods. This work highlights the urgent need for LLM providers to
enhance security protocols to address vulnerabilities arising from the
interaction between safety patterns and structured outputs.


http://arxiv.org/abs/2502.13459
Poisoned Source Code Detection in Code Models. (15%)
Ehab Ghannoum; Mohammad Ghafari
  Deep learning models have gained popularity for conducting various tasks
involving source code. However, their black-box nature raises concerns about
potential risks. One such risk is a poisoning attack, where an attacker
intentionally contaminates the training set with malicious samples to mislead
the model's predictions in specific scenarios. To protect source code models
from poisoning attacks, we introduce CodeGarrison (CG), a hybrid deep-learning
model that relies on code embeddings to identify poisoned code samples. We
evaluated CG against the state-of-the-art technique ONION for detecting
poisoned samples generated by DAMP, MHM, ALERT, as well as a novel poisoning
technique named CodeFooler. Results showed that CG significantly outperformed
ONION with an accuracy of 93.5%. We also tested CG's robustness against unknown
attacks, achieving an average accuracy of 85.6% in identifying poisoned samples
across the four attacks mentioned above.


http://arxiv.org/abs/2502.13641
SLAMSpoof: Practical LiDAR Spoofing Attacks on Localization Systems Guided by Scan Matching Vulnerability Analysis. (8%)
Rokuto Nagata; Kenji Koide; Yuki Hayakawa; Ryo Suzuki; Kazuma Ikeda; Ozora Sako; Qi Alfred Chen; Takami Sato; Kentaro Yoshioka
  Accurate localization is essential for enabling modern full self-driving
services. These services heavily rely on map-based traffic information to
reduce uncertainties in recognizing lane shapes, traffic light locations, and
traffic signs. Achieving this level of reliance on map information requires
centimeter-level localization accuracy, which is currently only achievable with
LiDAR sensors. However, LiDAR is known to be vulnerable to spoofing attacks
that emit malicious lasers against LiDAR to overwrite its measurements. Once
localization is compromised, the attack could lead the victim off roads or make
them ignore traffic lights. Motivated by these serious safety implications, we
design SLAMSpoof, the first practical LiDAR spoofing attack on localization
systems for self-driving to assess the actual attack significance on autonomous
vehicles. SLAMSpoof can effectively find the effective attack location based on
our scan matching vulnerability score (SMVS), a point-wise metric representing
the potential vulnerability to spoofing attacks. To evaluate the effectiveness
of the attack, we conduct real-world experiments on ground vehicles and confirm
its high capability in real-world scenarios, inducing position errors of
$\geq$4.2 meters (more than typical lane width) for all 3 popular LiDAR-based
localization algorithms. We finally discuss the potential countermeasures of
this attack. Code is available at https://github.com/Keio-CSG/slamspoof


http://arxiv.org/abs/2502.14001
Towards a perturbation-based explanation for medical AI as differentiable programs. (2%)
Takeshi Abe; Yoshiyuki Asai
  Recent advancement in machine learning algorithms reaches a point where
medical devices can be equipped with artificial intelligence (AI) models for
diagnostic support and routine automation in clinical settings. In medicine and
healthcare, there is a particular demand for sufficient and objective
explainability of the outcome generated by AI models. However, AI models are
generally considered as black boxes due to their complexity, and the
computational process leading to their response is often opaque. Although
several methods have been proposed to explain the behavior of models by
evaluating the importance of each feature in discrimination and prediction,
they may suffer from biases and opacities arising from the scale and sampling
protocol of the dataset used for training or testing. To overcome the
shortcomings of existing methods, we explore an alternative approach to provide
an objective explanation of AI models that can be defined independently of the
learning process and does not require additional data. As a preliminary study
for this direction of research, this work examines a numerical availability of
the Jacobian matrix of deep learning models that measures how stably a model
responses against small perturbations added to the input. The indicator, if
available, are calculated from a trained AI model for a given target input.
This is a first step towards a perturbation-based explanation, which will
assist medical practitioners in understanding and interpreting the response of
the AI model in its clinical application.


http://arxiv.org/abs/2502.14146
Efficient and Optimal Policy Gradient Algorithm for Corrupted Multi-armed Bandits. (1%)
Jiayuan Liu; Siwei Wang; Zhixuan Fang
  In this paper, we consider the stochastic multi-armed bandits problem with
adversarial corruptions, where the random rewards of the arms are partially
modified by an adversary to fool the algorithm. We apply the policy gradient
algorithm SAMBA to this setting, and show that it is computationally efficient,
and achieves a state-of-the-art $O(K\log T/\Delta) + O(C/\Delta)$ regret upper
bound, where $K$ is the number of arms, $C$ is the unknown corruption level,
$\Delta$ is the minimum expected reward gap between the best arm and other
ones, and $T$ is the time horizon. Compared with the best existing efficient
algorithm (e.g., CBARBAR), whose regret upper bound is $O(K\log^2 T/\Delta) +
O(C)$, we show that SAMBA reduces one $\log T$ factor in the regret bound,
while maintaining the corruption-dependent term to be linear with $C$. This is
indeed asymptotically optimal. We also conduct simulations to demonstrate the
effectiveness of SAMBA, and the results show that SAMBA outperforms existing
baselines.


http://arxiv.org/abs/2502.13593
Toward Robust Non-Transferable Learning: A Survey and Benchmark. (1%)
Ziming Hong; Yongli Xiang; Tongliang Liu
  Over the past decades, researchers have primarily focused on improving the
generalization abilities of models, with limited attention given to regulating
such generalization. However, the ability of models to generalize to unintended
data (e.g., harmful or unauthorized data) can be exploited by malicious
adversaries in unforeseen ways, potentially resulting in violations of model
ethics. Non-transferable learning (NTL), a task aimed at reshaping the
generalization abilities of deep learning models, was proposed to address these
challenges. While numerous methods have been proposed in this field, a
comprehensive review of existing progress and a thorough analysis of current
limitations remain lacking. In this paper, we bridge this gap by presenting the
first comprehensive survey on NTL and introducing NTLBench, the first benchmark
to evaluate NTL performance and robustness within a unified framework.
Specifically, we first introduce the task settings, general framework, and
criteria of NTL, followed by a summary of NTL approaches. Furthermore, we
emphasize the often-overlooked issue of robustness against various attacks that
can destroy the non-transferable mechanism established by NTL. Experiments
conducted via NTLBench verify the limitations of existing NTL methods in
robustness. Finally, we discuss the practical applications of NTL, along with
its future directions and associated challenges.


http://arxiv.org/abs/2502.12734
Iron Sharpens Iron: Defending Against Attacks in Machine-Generated Text Detection with Adversarial Training. (99%)
Yuanfan Li; Zhaohan Zhang; Chengzhengxu Li; Chao Shen; Xiaoming Liu
  Machine-generated Text (MGT) detection is crucial for regulating and
attributing online texts. While the existing MGT detectors achieve strong
performance, they remain vulnerable to simple perturbations and adversarial
attacks. To build an effective defense against malicious perturbations, we view
MGT detection from a threat modeling perspective, that is, analyzing the
model's vulnerability from an adversary's point of view and exploring effective
mitigations. To this end, we introduce an adversarial framework for training a
robust MGT detector, named GREedy Adversary PromoTed DefendER (GREATER). The
GREATER consists of two key components: an adversary GREATER-A and a detector
GREATER-D. The GREATER-D learns to defend against the adversarial attack from
GREATER-A and generalizes the defense to other attacks. GREATER-A identifies
and perturbs the critical tokens in embedding space, along with greedy search
and pruning to generate stealthy and disruptive adversarial examples. Besides,
we update the GREATER-A and GREATER-D synchronously, encouraging the GREATER-D
to generalize its defense to different attacks and varying attack intensities.
Our experimental results across 10 text perturbation strategies and 6
adversarial attacks show that our GREATER-D reduces the Attack Success Rate
(ASR) by 0.67% compared with SOTA defense methods while our GREATER-A is
demonstrated to be more effective and efficient than SOTA attack approaches.
Codes and dataset are available in https://github.com/Liyuuuu111/GREATER.


http://arxiv.org/abs/2502.12958
Preventing the Popular Item Embedding Based Attack in Federated Recommendations. (73%)
Jun Zhang; Huan Li; Dazhong Rong; Yan Zhao; Ke Chen; Lidan Shou
  Privacy concerns have led to the rise of federated recommender systems (FRS),
which can create personalized models across distributed clients. However, FRS
is vulnerable to poisoning attacks, where malicious users manipulate gradients
to promote their target items intentionally. Existing attacks against FRS have
limitations, as they depend on specific models and prior knowledge, restricting
their real-world applicability. In our exploration of practical FRS
vulnerabilities, we devise a model-agnostic and prior-knowledge-free attack,
named PIECK (Popular Item Embedding based Attack). The core module of PIECK is
popular item mining, which leverages embedding changes during FRS training to
effectively identify the popular items. Built upon the core module, PIECK
branches into two diverse solutions: The PIECKIPE solution employs an item
popularity enhancement module, which aligns the embeddings of targeted items
with the mined popular items to increase item exposure. The PIECKUEA further
enhances the robustness of the attack by using a user embedding approximation
module, which approximates private user embeddings using mined popular items.
Upon identifying PIECK, we evaluate existing federated defense methods and find
them ineffective against PIECK, as poisonous gradients inevitably overwhelm the
cold target items. We then propose a novel defense method by introducing two
regularization terms during user training, which constrain item popularity
enhancement and user embedding approximation while preserving FRS performance.
We evaluate PIECK and its defense across two base models, three real datasets,
four top-tier attacks, and six general defense methods, affirming the efficacy
of both PIECK and its defense.


http://arxiv.org/abs/2502.13141
UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models. (10%)
Huawei Lin; Yingjie Lao; Tong Geng; Tan Yu; Weijie Zhao
  Large Language Models (LLMs) are vulnerable to attacks like prompt injection,
backdoor attacks, and adversarial attacks, which manipulate prompts or models
to generate harmful outputs. In this paper, departing from traditional deep
learning attack paradigms, we explore their intrinsic relationship and
collectively term them Prompt Trigger Attacks (PTA). This raises a key
question: Can we determine if a prompt is benign or poisoned? To address this,
we propose UniGuardian, the first unified defense mechanism designed to detect
prompt injection, backdoor attacks, and adversarial attacks in LLMs.
Additionally, we introduce a single-forward strategy to optimize the detection
pipeline, enabling simultaneous attack detection and text generation within a
single forward pass. Our experiments confirm that UniGuardian accurately and
efficiently identifies malicious prompts in LLMs.


http://arxiv.org/abs/2502.12575
DemonAgent: Dynamically Encrypted Multi-Backdoor Implantation Attack on LLM-based Agent. (8%)
Pengyu Zhu; Zhenhong Zhou; Yuanhe Zhang; Shilinlu Yan; Kun Wang; Sen Su
  As LLM-based agents become increasingly prevalent, backdoors can be implanted
into agents through user queries or environment feedback, raising critical
concerns regarding safety vulnerabilities. However, backdoor attacks are
typically detectable by safety audits that analyze the reasoning process of
agents. To this end, we propose a novel backdoor implantation strategy called
\textbf{Dynamically Encrypted Multi-Backdoor Implantation Attack}.
Specifically, we introduce dynamic encryption, which maps the backdoor into
benign content, effectively circumventing safety audits. To enhance
stealthiness, we further decompose the backdoor into multiple sub-backdoor
fragments. Based on these advancements, backdoors are allowed to bypass safety
audits significantly. Additionally, we present AgentBackdoorEval, a dataset
designed for the comprehensive evaluation of agent backdoor attacks.
Experimental results across multiple datasets demonstrate that our method
achieves an attack success rate nearing 100\% while maintaining a detection
rate of 0\%, illustrating its effectiveness in evading safety audits. Our
findings highlight the limitations of existing safety mechanisms in detecting
advanced attacks, underscoring the urgent need for more robust defenses against
backdoor threats. Code and data are available at
https://github.com/whfeLingYu/DemonAgent.


http://arxiv.org/abs/2502.12659
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1. (8%)
Kaiwen Zhou; Chengzhi Liu; Xuandong Zhao; Shreedhar Jangam; Jayanth Srinivasa; Gaowen Liu; Dawn Song; Xin Eric Wang
  The rapid development of large reasoning models, such as OpenAI-o3 and
DeepSeek-R1, has led to significant improvements in complex reasoning over
non-reasoning large language models~(LLMs). However, their enhanced
capabilities, combined with the open-source access of models like DeepSeek-R1,
raise serious safety concerns, particularly regarding their potential for
misuse. In this work, we present a comprehensive safety assessment of these
reasoning models, leveraging established safety benchmarks to evaluate their
compliance with safety regulations. Furthermore, we investigate their
susceptibility to adversarial attacks, such as jailbreaking and prompt
injection, to assess their robustness in real-world applications. Through our
multi-faceted analysis, we uncover four key findings: (1) There is a
significant safety gap between the open-source R1 models and the o3-mini model,
on both safety benchmark and attack, suggesting more safety effort on R1 is
needed. (2) The distilled reasoning model shows poorer safety performance
compared to its safety-aligned base models. (3) The stronger the model's
reasoning ability, the greater the potential harm it may cause when answering
unsafe questions. (4) The thinking process in R1 models pose greater safety
concerns than their final answers. Our study provides insights into the
security implications of reasoning models and highlights the need for further
advancements in R1 models' safety to close the gap.


http://arxiv.org/abs/2502.12970
Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking. (3%)
Junda Zhu; Lingyong Yan; Shuaiqiang Wang; Dawei Yin; Lei Sha
  Large Reasoning Models (LRMs) have demonstrated impressive performances
across diverse domains. However, how safety of Large Language Models (LLMs)
benefits from enhanced reasoning capabilities against jailbreak queries remains
unexplored. To bridge this gap, in this paper, we propose Reasoning-to-Defend
(R2D), a novel training paradigm that integrates a safety-aware reasoning
mechanism into LLMs' generation. This enables self-evaluation at each step of
the reasoning process, forming safety pivot tokens as indicators of the safety
status of responses. Furthermore, in order to improve the accuracy of
predicting pivot tokens, we propose Contrastive Pivot Optimization (CPO), which
enhances the model's perception of the safety status of given dialogues. LLMs
dynamically adjust their response strategies during reasoning, significantly
enhancing their safety capabilities defending jailbreak attacks. Extensive
experiments demonstrate that R2D effectively mitigates various attacks and
improves overall safety, while maintaining the original performances. This
highlights the substantial potential of safety-aware reasoning in improving
robustness of LRMs and LLMs against various jailbreaks.


http://arxiv.org/abs/2502.12576
A Fuzzy Evaluation of Sentence Encoders on Grooming Risk Classification. (2%)
Geetanjali Bihani; Julia Rayz
  With the advent of social media, children are becoming increasingly
vulnerable to the risk of grooming in online settings. Detecting grooming
instances in an online conversation poses a significant challenge as the
interactions are not necessarily sexually explicit, since the predators take
time to build trust and a relationship with their victim. Moreover, predators
evade detection using indirect and coded language. While previous studies have
fine-tuned Transformers to automatically identify grooming in chat
conversations, they overlook the impact of coded and indirect language on model
predictions, and how these align with human perceptions of grooming. In this
paper, we address this gap and evaluate bi-encoders on the task of classifying
different degrees of grooming risk in chat contexts, for three different
participant groups, i.e. law enforcement officers, real victims, and decoys.
Using a fuzzy-theoretic framework, we map human assessments of grooming
behaviors to estimate the actual degree of grooming risk. Our analysis reveals
that fine-tuned models fail to tag instances where the predator uses indirect
speech pathways and coded language to evade detection. Further, we find that
such instances are characterized by a higher presence of out-of-vocabulary
(OOV) words in samples, causing the model to misclassify. Our findings
highlight the need for more robust models to identify coded language from noisy
chat inputs in grooming contexts.


http://arxiv.org/abs/2502.12893
H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking. (1%)
Martin Kuo; Jianyi Zhang; Aolin Ding; Qinsi Wang; Louis DiValentin; Yujia Bao; Wei Wei; Hai Li; Yiran Chen
  Large Reasoning Models (LRMs) have recently extended their powerful reasoning
capabilities to safety checks-using chain-of-thought reasoning to decide
whether a request should be answered. While this new approach offers a
promising route for balancing model utility and safety, its robustness remains
underexplored. To address this gap, we introduce Malicious-Educator, a
benchmark that disguises extremely dangerous or malicious requests beneath
seemingly legitimate educational prompts. Our experiments reveal severe
security flaws in popular commercial-grade LRMs, including OpenAI o1/o3,
DeepSeek-R1, and Gemini 2.0 Flash Thinking. For instance, although OpenAI's o1
model initially maintains a high refusal rate of about 98%, subsequent model
updates significantly compromise its safety; and attackers can easily extract
criminal strategies from DeepSeek-R1 and Gemini 2.0 Flash Thinking without any
additional tricks. To further highlight these vulnerabilities, we propose
Hijacking Chain-of-Thought (H-CoT), a universal and transferable attack method
that leverages the model's own displayed intermediate reasoning to jailbreak
its safety reasoning mechanism. Under H-CoT, refusal rates sharply
decline-dropping from 98% to below 2%-and, in some instances, even transform
initially cautious tones into ones that are willing to provide harmful content.
We hope these findings underscore the urgent need for more robust safety
mechanisms to preserve the benefits of advanced reasoning capabilities without
compromising ethical standards.


http://arxiv.org/abs/2502.13024
Fragility-aware Classification for Understanding Risk and Improving Generalization. (1%)
Chen Yang; Zheng Cui; Daniel Zhuoyu Long; Jin Qi; Ruohan Zhan
  Classification models play a critical role in data-driven decision-making
applications such as medical diagnosis, user profiling, recommendation systems,
and default detection. Traditional performance metrics, such as accuracy, focus
on overall error rates but fail to account for the confidence of incorrect
predictions, thereby overlooking the risk of confident misjudgments. This risk
is particularly significant in cost-sensitive and safety-critical domains like
medical diagnosis and autonomous driving, where overconfident false predictions
may cause severe consequences. To address this issue, we introduce the
Fragility Index (FI), a novel metric that evaluates classification performance
from a risk-averse perspective by explicitly capturing the tail risk of
confident misjudgments. To enhance generalizability, we define FI within the
robust satisficing (RS) framework, incorporating data uncertainty. We further
develop a model training approach that optimizes FI while maintaining
tractability for common loss functions. Specifically, we derive exact
reformulations for cross-entropy loss, hinge-type loss, and Lipschitz loss, and
extend the approach to deep learning models. Through synthetic experiments and
real-world medical diagnosis tasks, we demonstrate that FI effectively
identifies misjudgment risk and FI-based training improves model robustness and
generalizability. Finally, we extend our framework to deep neural network
training, further validating its effectiveness in enhancing deep learning
models.


http://arxiv.org/abs/2502.13191
On the Privacy Risks of Spiking Neural Networks: A Membership Inference Analysis. (1%)
Junyi Guan; Abhijith Sharma; Chong Tian; Salem Lahlou
  Spiking Neural Networks (SNNs) are increasingly explored for their energy
efficiency and robustness in real-world applications, yet their privacy risks
remain largely unexamined. In this work, we investigate the susceptibility of
SNNs to Membership Inference Attacks (MIAs) -- a major privacy threat where an
adversary attempts to determine whether a given sample was part of the training
dataset. While prior work suggests that SNNs may offer inherent robustness due
to their discrete, event-driven nature, we find that its resilience diminishes
as latency (T) increases. Furthermore, we introduce an input dropout strategy
under black box setting, that significantly enhances membership inference in
SNNs. Our findings challenge the assumption that SNNs are inherently more
secure, and even though they are expected to be better, our results reveal that
SNNs exhibit privacy vulnerabilities that are equally comparable to Artificial
Neural Networks (ANNs). Our code is available at
https://anonymous.4open.science/r/MIA_SNN-3610.


http://arxiv.org/abs/2502.11858
Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives. (96%)
Zeliang Zhang; Susan Liang; Daiki Shimada; Chenliang Xu
  While audio-visual learning equips models with a richer understanding of the
real world by leveraging multiple sensory modalities, this integration also
introduces new vulnerabilities to adversarial attacks.
  In this paper, we present a comprehensive study of the adversarial robustness
of audio-visual models, considering both temporal and modality-specific
vulnerabilities. We propose two powerful adversarial attacks: 1) a temporal
invariance attack that exploits the inherent temporal redundancy across
consecutive time segments and 2) a modality misalignment attack that introduces
incongruence between the audio and visual modalities. These attacks are
designed to thoroughly assess the robustness of audio-visual models against
diverse threats. Furthermore, to defend against such attacks, we introduce a
novel audio-visual adversarial training framework. This framework addresses key
challenges in vanilla adversarial training by incorporating efficient
adversarial perturbation crafting tailored to multi-modal data and an
adversarial curriculum strategy. Extensive experiments in the Kinetics-Sounds
dataset demonstrate that our proposed temporal and modality-based attacks in
degrading model performance can achieve state-of-the-art performance, while our
adversarial training defense largely improves the adversarial robustness as
well as the adversarial training efficiency.


http://arxiv.org/abs/2502.13175
Towards Robust and Secure Embodied AI: A Survey on Vulnerabilities and Attacks. (78%)
Wenpeng Xing; Minghao Li; Mohan Li; Meng Han
  Embodied AI systems, including robots and autonomous vehicles, are
increasingly integrated into real-world applications, where they encounter a
range of vulnerabilities stemming from both environmental and system-level
factors. These vulnerabilities manifest through sensor spoofing, adversarial
attacks, and failures in task and motion planning, posing significant
challenges to robustness and safety. Despite the growing body of research,
existing reviews rarely focus specifically on the unique safety and security
challenges of embodied AI systems. Most prior work either addresses general AI
vulnerabilities or focuses on isolated aspects, lacking a dedicated and unified
framework tailored to embodied AI. This survey fills this critical gap by: (1)
categorizing vulnerabilities specific to embodied AI into exogenous (e.g.,
physical attacks, cybersecurity threats) and endogenous (e.g., sensor failures,
software flaws) origins; (2) systematically analyzing adversarial attack
paradigms unique to embodied AI, with a focus on their impact on perception,
decision-making, and embodied interaction; (3) investigating attack vectors
targeting large vision-language models (LVLMs) and large language models (LLMs)
within embodied systems, such as jailbreak attacks and instruction
misinterpretation; (4) evaluating robustness challenges in algorithms for
embodied perception, decision-making, and task planning; and (5) proposing
targeted strategies to enhance the safety and reliability of embodied AI
systems. By integrating these dimensions, we provide a comprehensive framework
for understanding the interplay between vulnerabilities and safety in embodied
AI.


http://arxiv.org/abs/2502.11455
Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training. (67%)
Fenghua Weng; Jian Lou; Jun Feng; Minlie Huang; Wenjie Wang
  Safety alignment is critical in pre-training large language models (LLMs) to
generate responses aligned with human values and refuse harmful queries. Unlike
LLM, the current safety alignment of VLMs is often achieved with post-hoc
safety fine-tuning. However, these methods are less effective to white-box
attacks. To address this, we propose $\textit{Adversary-aware DPO (ADPO)}$, a
novel training framework that explicitly considers adversarial.
$\textit{Adversary-aware DPO (ADPO)}$ integrates adversarial training into DPO
to enhance the safety alignment of VLMs under worst-case adversarial
perturbations. $\textit{ADPO}$ introduces two key components: (1) an
adversarial-trained reference model that generates human-preferred responses
under worst-case perturbations, and (2) an adversarial-aware DPO loss that
generates winner-loser pairs accounting for adversarial distortions. By
combining these innovations, $\textit{ADPO}$ ensures that VLMs remain robust
and reliable even in the presence of sophisticated jailbreak attacks. Extensive
experiments demonstrate that $\textit{ADPO}$ outperforms baselines in the
safety alignment and general utility of VLMs.


http://arxiv.org/abs/2502.11725
Adversarially Robust CLIP Models Can Induce Better (Robust) Perceptual Metrics. (50%)
Francesco Croce; Christian Schlarmann; Naman Deep Singh; Matthias Hein
  Measuring perceptual similarity is a key tool in computer vision. In recent
years perceptual metrics based on features extracted from neural networks with
large and diverse training sets, e.g. CLIP, have become popular. At the same
time, the metrics extracted from features of neural networks are not
adversarially robust. In this paper we show that adversarially robust CLIP
models, called R-CLIP$_\textrm{F}$, obtained by unsupervised adversarial
fine-tuning induce a better and adversarially robust perceptual metric that
outperforms existing metrics in a zero-shot setting, and further matches the
performance of state-of-the-art metrics while being robust after fine-tuning.
Moreover, our perceptual metric achieves strong performance on related tasks
such as robust image-to-image retrieval, which becomes especially relevant when
applied to "Not Safe for Work" (NSFW) content detection and dataset filtering.
While standard perceptual metrics can be easily attacked by a small
perturbation completely degrading NSFW detection, our robust perceptual metric
maintains high accuracy under an attack while having similar performance for
unperturbed images. Finally, perceptual metrics induced by robust CLIP models
have higher interpretability: feature inversion can show which images are
considered similar, while text inversion can find what images are associated to
a given prompt. This also allows us to visualize the very rich visual concepts
learned by a CLIP model, including memorized persons, paintings and complex
queries.


http://arxiv.org/abs/2502.12292
Independence Tests for Language Models. (13%)
Sally Zhu; Ahmed Ahmed; Rohith Kuditipudi; Percy Liang
  We consider the following problem: given the weights of two models, can we
test whether they were trained independently -- i.e., from independent random
initializations? We consider two settings: constrained and unconstrained. In
the constrained setting, we make assumptions about model architecture and
training and propose a family of statistical tests that yield exact p-values
with respect to the null hypothesis that the models are trained from
independent random initializations. These p-values are valid regardless of the
composition of either model's training data; we compute them by simulating
exchangeable copies of each model under our assumptions and comparing various
similarity measures of weights and activations between the original two models
versus these copies. We report the p-values from these tests on pairs of 21
open-weight models (210 total pairs) and correctly identify all pairs of
non-independent models. Our tests remain effective even if one model was
fine-tuned for many tokens. In the unconstrained setting, where we make no
assumptions about training procedures, can change model architecture, and allow
for adversarial evasion attacks, the previous tests no longer work. Instead, we
propose a new test which matches hidden activations between two models, and
which is robust to adversarial transformations and to changes in model
architecture. The test can also do localized testing: identifying specific
non-independent components of models. Though we no longer obtain exact p-values
from this, empirically we find it behaves as one and reliably identifies
non-independent models. Notably, we can use the test to identify specific parts
of one model that are derived from another (e.g., how Llama 3.1-8B was pruned
to initialize Llama 3.2-3B, or shared layers between Mistral-7B and
StripedHyena-7B), and it is even robust to retraining individual layers of
either model from scratch.


http://arxiv.org/abs/2502.11687
ReVeil: Unconstrained Concealed Backdoor Attack on Deep Neural Networks using Machine Unlearning. (11%)
Manaar Alam; Hithem Lamri; Michail Maniatakos
  Backdoor attacks embed hidden functionalities in deep neural networks (DNN),
triggering malicious behavior with specific inputs. Advanced defenses monitor
anomalous DNN inferences to detect such attacks. However, concealed backdoors
evade detection by maintaining a low pre-deployment attack success rate (ASR)
and restoring high ASR post-deployment via machine unlearning. Existing
concealed backdoors are often constrained by requiring white-box or black-box
access or auxiliary data, limiting their practicality when such access or data
is unavailable. This paper introduces ReVeil, a concealed backdoor attack
targeting the data collection phase of the DNN training pipeline, requiring no
model access or auxiliary data. ReVeil maintains low pre-deployment ASR across
four datasets and four trigger patterns, successfully evades three popular
backdoor detection methods, and restores high ASR post-deployment through
machine unlearning.


http://arxiv.org/abs/2502.11910
Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives. (10%)
Leo Schwinn; Yan Scholten; Tom Wollschläger; Sophie Xhonneux; Stephen Casper; Stephan Günnemann; Gauthier Gidel
  Misaligned research objectives have considerably hindered progress in
adversarial robustness research over the past decade. For instance, an
extensive focus on optimizing target metrics, while neglecting rigorous
standardized evaluation, has led researchers to pursue ad-hoc heuristic
defenses that were seemingly effective. Yet, most of these were exposed as
flawed by subsequent evaluations, ultimately contributing little measurable
progress to the field. In this position paper, we illustrate that current
research on the robustness of large language models (LLMs) risks repeating past
patterns with potentially worsened real-world implications. To address this, we
argue that realigned objectives are necessary for meaningful progress in
adversarial alignment. To this end, we build on established cybersecurity
taxonomy to formally define differences between past and emerging threat models
that apply to LLMs. Using this framework, we illustrate that progress requires
disentangling adversarial alignment into addressable sub-problems and returning
to core academic principles, such as measureability, reproducibility, and
comparability. Although the field presents significant challenges, the fresh
start on adversarial robustness offers the unique opportunity to build on past
experience while avoiding previous mistakes.


http://arxiv.org/abs/2502.11598
Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation? (10%)
Leyi Pan; Aiwei Liu; Shiyu Huang; Yijian Lu; Xuming Hu; Lijie Wen; Irwin King; Philip S. Yu
  The radioactive nature of Large Language Model (LLM) watermarking enables the
detection of watermarks inherited by student models when trained on the outputs
of watermarked teacher models, making it a promising tool for preventing
unauthorized knowledge distillation. However, the robustness of watermark
radioactivity against adversarial actors remains largely unexplored. In this
paper, we investigate whether student models can acquire the capabilities of
teacher models through knowledge distillation while avoiding watermark
inheritance. We propose two categories of watermark removal approaches:
pre-distillation removal through untargeted and targeted training data
paraphrasing (UP and TP), and post-distillation removal through inference-time
watermark neutralization (WN). Extensive experiments across multiple model
pairs, watermarking schemes and hyper-parameter settings demonstrate that both
TP and WN thoroughly eliminate inherited watermarks, with WN achieving this
while maintaining knowledge transfer efficiency and low computational overhead.
Given the ongoing deployment of watermarking techniques in production LLMs,
these findings emphasize the urgent need for more robust defense strategies.
Our code is available at
https://github.com/THU-BPM/Watermark-Radioactivity-Attack.


http://arxiv.org/abs/2502.11743
Robust Partial-Label Learning by Leveraging Class Activation Values. (8%)
Tobias Fuchs; Florian Kalinke
  Real-world training data is often noisy; for example, human annotators assign
conflicting class labels to the same instances. Partial-label learning (PLL) is
a weakly supervised learning paradigm that allows training classifiers in this
context without manual data cleaning. While state-of-the-art methods have good
predictive performance, their predictions are sensitive to high noise levels,
out-of-distribution data, and adversarial perturbations. We propose a novel PLL
method based on subjective logic, which explicitly represents uncertainty by
leveraging the magnitudes of the underlying neural network's class activation
values. Thereby, we effectively incorporate prior knowledge about the class
labels by using a novel label weight re-distribution strategy that we prove to
be optimal. We empirically show that our method yields more robust predictions
in terms of predictive performance under high PLL noise levels, handling
out-of-distribution examples, and handling adversarial perturbations on the
test instances.


http://arxiv.org/abs/2502.11647
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing. (2%)
Yi Wang; Fenghua Weng; Sibei Yang; Zhan Qin; Minlie Huang; Wenjie Wang
  Large Language Models (LLMs) are widely applied in decision making, but their
deployment is threatened by jailbreak attacks, where adversarial users
manipulate model behavior to bypass safety measures. Existing defense
mechanisms, such as safety fine-tuning and model editing, either require
extensive parameter modifications or lack precision, leading to performance
degradation on general tasks, which is unsuitable to post-deployment safety
alignment. To address these challenges, we propose DELMAN (Dynamic Editing for
LLMs JAilbreak DefeNse), a novel approach leveraging direct model editing for
precise, dynamic protection against jailbreak attacks. DELMAN directly updates
a minimal set of relevant parameters to neutralize harmful behaviors while
preserving the model's utility. To avoid triggering a safe response in benign
context, we incorporate KL-divergence regularization to ensure the updated
model remains consistent with the original model when processing benign
queries. Experimental results demonstrate that DELMAN outperforms baseline
methods in mitigating jailbreak attacks while preserving the model's utility,
and adapts seamlessly to new attack instances, providing a practical and
efficient solution for post-deployment model protection.


http://arxiv.org/abs/2502.14896
A Comprehensive Survey on Concept Erasure in Text-to-Image Diffusion Models. (2%)
Changhoon Kim; Yanjun Qi
  Text-to-Image (T2I) models have made remarkable progress in generating
high-quality, diverse visual content from natural language prompts. However,
their ability to reproduce copyrighted styles, sensitive imagery, and harmful
content raises significant ethical and legal concerns. Concept erasure offers a
proactive alternative to external filtering by modifying T2I models to prevent
the generation of undesired content. In this survey, we provide a structured
overview of concept erasure, categorizing existing methods based on their
optimization strategies and the architectural components they modify. We
categorize concept erasure methods into fine-tuning for parameter updates,
closed-form solutions for efficient edits, and inference-time interventions for
content restriction without weight modification. Additionally, we explore
adversarial attacks that bypass erasure techniques and discuss emerging
defenses. To support further research, we consolidate key datasets, evaluation
metrics, and benchmarks for assessing erasure effectiveness and model
robustness. This survey serves as a comprehensive resource, offering insights
into the evolving landscape of concept erasure, its challenges, and future
directions.


http://arxiv.org/abs/2502.11448
AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection. (1%)
Weidi Luo; Shenghong Dai; Xiaogeng Liu; Suman Banerjee; Huan Sun; Muhao Chen; Chaowei Xiao
  The rapid advancements in Large Language Models (LLMs) have enabled their
deployment as autonomous agents for handling complex tasks in dynamic
environments. These LLMs demonstrate strong problem-solving capabilities and
adaptability to multifaceted scenarios. However, their use as agents also
introduces significant risks, including task-specific risks, which are
identified by the agent administrator based on the specific task requirements
and constraints, and systemic risks, which stem from vulnerabilities in their
design or interactions, potentially compromising confidentiality, integrity, or
availability (CIA) of information and triggering security risks. Existing
defense agencies fail to adaptively and effectively mitigate these risks. In
this paper, we propose AGrail, a lifelong agent guardrail to enhance LLM agent
safety, which features adaptive safety check generation, effective safety check
optimization, and tool compatibility and flexibility. Extensive experiments
demonstrate that AGrail not only achieves strong performance against
task-specific and system risks but also exhibits transferability across
different LLM agents' tasks.


http://arxiv.org/abs/2502.12202
To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models. (93%)
Zihao Zhu; Hongbao Zhang; Ruotong Wang; Ke Xu; Siwei Lyu; Baoyuan Wu
  Large Reasoning Models (LRMs) are designed to solve complex tasks by
generating explicit reasoning traces before producing final answers. However,
we reveal a critical vulnerability in LRMs -- termed Unthinking Vulnerability
-- wherein the thinking process can be bypassed by manipulating special
delimiter tokens. It is empirically demonstrated to be widespread across
mainstream LRMs, posing both a significant risk and potential utility,
depending on how it is exploited. In this paper, we systematically investigate
this vulnerability from both malicious and beneficial perspectives. On the
malicious side, we introduce Breaking of Thought (BoT), a novel attack that
enables adversaries to bypass the thinking process of LRMs, thereby
compromising their reliability and availability. We present two variants of
BoT: a training-based version that injects backdoor during the fine-tuning
stage, and a training-free version based on adversarial attack during the
inference stage. As a potential defense, we propose thinking recovery alignment
to partially mitigate the vulnerability. On the beneficial side, we introduce
Monitoring of Thought (MoT), a plug-and-play framework that allows model owners
to enhance efficiency and safety. It is implemented by leveraging the same
vulnerability to dynamically terminate redundant or risky reasoning through
external monitoring. Extensive experiments show that BoT poses a significant
threat to reasoning reliability, while MoT provides a practical solution for
preventing overthinking and jailbreaking. Our findings expose an inherent flaw
in current LRM architectures and underscore the need for more robust reasoning
systems in the future.


http://arxiv.org/abs/2502.13162
ShieldLearner: A New Paradigm for Jailbreak Attack Defense in LLMs. (69%)
Ziyi Ni; Hao Wang; Huacan Wang
  Large Language Models (LLMs) have achieved remarkable success in various
domains but remain vulnerable to adversarial jailbreak attacks. Existing
prompt-defense strategies, including parameter-modifying and parameter-free
approaches, face limitations in adaptability, interpretability, and
customization, constraining their effectiveness against evolving threats. To
address these challenges, we propose ShieldLearner, a novel paradigm that
mimics human learning in defense. Through trial and error, it autonomously
distills attack signatures into a Pattern Atlas and synthesizes defense
heuristics into a Meta-analysis Framework, enabling systematic and
interpretable threat detection. Furthermore, we introduce Adaptive Adversarial
Augmentation to generate adversarial variations of successfully defended
prompts, enabling continuous self-improvement without model retraining. In
addition to standard benchmarks, we create a hard test set by curating
adversarial prompts from the Wildjailbreak dataset, emphasizing more concealed
malicious intent. Experimental results show that ShieldLearner achieves a
significantly higher defense success rate than existing baselines on both
conventional and hard test sets, while also operating with lower computational
overhead, making it a practical and efficient solution for real-world
adversarial defense.


http://arxiv.org/abs/2502.14888
Multi-Faceted Multimodal Monosemanticity. (41%)
Hanqi Yan; Xiangxiang Cui; Lu Yin; Paul Pu Liang; Yulan He; Yifei Wang
  Humans experience the world through multiple modalities, such as, vision,
language, and speech, making it natural to explore the commonality and
distinctions among them. In this work, we take a data-driven approach to
address this question by analyzing interpretable, monosemantic features
extracted from deep multimodal models. Specifically, we investigate CLIP, a
prominent visual-language representation model trained on massive image-text
pairs. Building on prior research in single-modal interpretability, we develop
a set of multi-modal interpretability tools and measures designed to
disentangle and analyze features learned from CLIP. Specifically, we introduce
the Modality Dominance Score (MDS) to attribute each CLIP feature to a specific
modality. We then map CLIP features into a more interpretable space, enabling
us to categorize them into three distinct classes: vision features
(single-modal), language features (single-modal), and visual-language features
(cross-modal). Interestingly, this data-driven categorization closely aligns
with human intuitive understandings of different modalities. We further show
that this modality decomposition can benefit multiple downstream tasks,
including reducing bias in gender detection, generating cross-modal adversarial
examples, and enabling modal-specific feature control in text-to-image
generation. These results indicate that large-scale multimodal models, when
equipped with task-agnostic interpretability tools, can offer valuable insights
into the relationships between different data modalities.


http://arxiv.org/abs/2502.11308
ALGEN: Few-shot Inversion Attacks on Textual Embeddings using Alignment and Generation. (11%)
Yiyi Chen; Qiongkai Xu; Johannes Bjerva
  With the growing popularity of Large Language Models (LLMs) and vector
databases, private textual data is increasingly processed and stored as
numerical embeddings. However, recent studies have proven that such embeddings
are vulnerable to inversion attacks, where original text is reconstructed to
reveal sensitive information. Previous research has largely assumed access to
millions of sentences to train attack models, e.g., through data leakage or
nearly unrestricted API access. With our method, a single data point is
sufficient for a partially successful inversion attack. With as little as 1k
data samples, performance reaches an optimum across a range of black-box
encoders, without training on leaked data. We present a Few-shot Textual
Embedding Inversion Attack using ALignment and GENeration (ALGEN), by aligning
victim embeddings to the attack space and using a generative model to
reconstruct text. We find that ALGEN attacks can be effectively transferred
across domains and languages, revealing key information. We further examine a
variety of defense mechanisms against ALGEN, and find that none are effective,
highlighting the vulnerabilities posed by inversion attacks. By significantly
lowering the cost of inversion and proving that embedding spaces can be aligned
through one-step optimization, we establish a new textual embedding inversion
paradigm with broader applications for embedding alignment in NLP.


http://arxiv.org/abs/2502.11358
Mimicking the Familiar: Dynamic Command Generation for Information Theft Attacks in LLM Tool-Learning System. (8%)
Ziyou Jiang; Mingyang Li; Guowei Yang; Junjie Wang; Yuekai Huang; Zhiyuan Chang; Qing Wang
  Information theft attacks pose a significant risk to Large Language Model
(LLM) tool-learning systems. Adversaries can inject malicious commands through
compromised tools, manipulating LLMs to send sensitive information to these
tools, which leads to potential privacy breaches. However, existing attack
approaches are black-box oriented and rely on static commands that cannot adapt
flexibly to the changes in user queries and the invocation chain of tools. It
makes malicious commands more likely to be detected by LLM and leads to attack
failure. In this paper, we propose AutoCMD, a dynamic attack comment generation
approach for information theft attacks in LLM tool-learning systems. Inspired
by the concept of mimicking the familiar, AutoCMD is capable of inferring the
information utilized by upstream tools in the toolchain through learning on
open-source systems and reinforcement with target system examples, thereby
generating more targeted commands for information theft. The evaluation results
show that AutoCMD outperforms the baselines with +13.2% $ASR_{Theft}$, and can
be generalized to new tool-learning systems to expose their information leakage
risks. We also design four defense methods to effectively protect tool-learning
systems from the attack.


http://arxiv.org/abs/2502.11379
CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models. (8%)
Guanghao Zhou; Panjia Qiu; Mingyuan Fan; Cen Chen; Mingyuan Chu; Xin Zhang; Jun Zhou
  Despite explicit alignment efforts for large language models (LLMs), they can
still be exploited to trigger unintended behaviors, a phenomenon known as
"jailbreaking." Current jailbreak attack methods mainly focus on discrete
prompt manipulations targeting closed-source LLMs, relying on manually crafted
prompt templates and persuasion rules. However, as the capabilities of
open-source LLMs improve, ensuring their safety becomes increasingly crucial.
In such an environment, the accessibility of model parameters and gradient
information by potential attackers exacerbates the severity of jailbreak
threats. To address this research gap, we propose a novel
\underline{C}ontext-\underline{C}oherent \underline{J}ailbreak
\underline{A}ttack (CCJA). We define jailbreak attacks as an optimization
problem within the embedding space of masked language models. Through
combinatorial optimization, we effectively balance the jailbreak attack success
rate with semantic coherence. Extensive evaluations show that our method not
only maintains semantic consistency but also surpasses state-of-the-art
baselines in attack effectiveness. Additionally, by integrating semantically
coherent jailbreak prompts generated by our method into widely used black-box
methodologies, we observe a notable enhancement in their success rates when
targeting closed-source commercial LLMs. This highlights the security threat
posed by open-source LLMs to commercial counterparts. We will open-source our
code if the paper is accepted.


http://arxiv.org/abs/2502.11127
G-Safeguard: A Topology-Guided Security Lens and Treatment on LLM-based Multi-agent Systems. (1%)
Shilong Wang; Guibin Zhang; Miao Yu; Guancheng Wan; Fanci Meng; Chongye Guo; Kun Wang; Yang Wang
  Large Language Model (LLM)-based Multi-agent Systems (MAS) have demonstrated
remarkable capabilities in various complex tasks, ranging from collaborative
problem-solving to autonomous decision-making. However, as these systems become
increasingly integrated into critical applications, their vulnerability to
adversarial attacks, misinformation propagation, and unintended behaviors have
raised significant concerns. To address this challenge, we introduce
G-Safeguard, a topology-guided security lens and treatment for robust LLM-MAS,
which leverages graph neural networks to detect anomalies on the multi-agent
utterance graph and employ topological intervention for attack remediation.
Extensive experiments demonstrate that G-Safeguard: (I) exhibits significant
effectiveness under various attack strategies, recovering over 40% of the
performance for prompt injection; (II) is highly adaptable to diverse LLM
backbones and large-scale MAS; (III) can seamlessly combine with mainstream MAS
with security guarantees. The code is available at
https://github.com/wslong20/G-safeguard.


http://arxiv.org/abs/2502.11084
Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction. (1%)
Yuting Huang; Chengyuan Liu; Yifeng Feng; Yiquan Wu; Chao Wu; Fei Wu; Kun Kuang
  As Large Language Models (LLMs) are widely applied in various domains, the
safety of LLMs is increasingly attracting attention to avoid their powerful
capabilities being misused. Existing jailbreak methods create a forced
instruction-following scenario, or search adversarial prompts with prefix or
suffix tokens to achieve a specific representation manually or automatically.
However, they suffer from low efficiency and explicit jailbreak patterns, far
from the real deployment of mass attacks to LLMs. In this paper, we point out
that simply rewriting the original instruction can achieve a jailbreak, and we
find that this rewriting approach is learnable and transferable. We propose the
Rewrite to Jailbreak (R2J) approach, a transferable black-box jailbreak method
to attack LLMs by iteratively exploring the weakness of the LLMs and
automatically improving the attacking strategy. The jailbreak is more efficient
and hard to identify since no additional features are introduced. Extensive
experiments and analysis demonstrate the effectiveness of R2J, and we find that
the jailbreak is also transferable to multiple datasets and various types of
models with only a few queries. We hope our work motivates further
investigation of LLM safety. The code can be found at
https://github.com/ythuang02/R2J/.


http://arxiv.org/abs/2502.10801
FaceSwapGuard: Safeguarding Facial Privacy from DeepFake Threats through Identity Obfuscation. (83%)
Li Wang; Zheng Li; Xuhong Zhang; Shouling Ji; Shanqing Guo
  DeepFakes pose a significant threat to our society. One representative
DeepFake application is face-swapping, which replaces the identity in a facial
image with that of a victim. Although existing methods partially mitigate these
risks by degrading the quality of swapped images, they often fail to disrupt
the identity transformation effectively. To fill this gap, we propose
FaceSwapGuard (FSG), a novel black-box defense mechanism against deepfake
face-swapping threats. Specifically, FSG introduces imperceptible perturbations
to a user's facial image, disrupting the features extracted by identity
encoders. When shared online, these perturbed images mislead face-swapping
techniques, causing them to generate facial images with identities
significantly different from the original user. Extensive experiments
demonstrate the effectiveness of FSG against multiple face-swapping techniques,
reducing the face match rate from 90\% (without defense) to below 10\%. Both
qualitative and quantitative studies further confirm its ability to confuse
human perception, highlighting its practical utility. Additionally, we
investigate key factors that may influence FSG and evaluate its robustness
against various adaptive adversaries.


http://arxiv.org/abs/2502.10682
CAE-Net: Generalized Deepfake Image Detection using Convolution and Attention Mechanisms with Spatial and Frequency Domain Features. (8%)
Kafi Anan; Anindya Bhattacharjee; Ashir Intesher; Kaidul Islam; Abrar Assaeem Fuad; Utsab Saha; Hafiz Imtiaz
  Effective deepfake detection tools are becoming increasingly essential to the
growing usage of deepfakes in unethical practices. There exists a wide range of
deepfake generation techniques, which makes it challenging to develop an
accurate universal detection mechanism. The 2025 IEEE Signal Processing Cup
(\textit{DFWild-Cup} competition) provided a diverse dataset of deepfake images
containing significant class imbalance. The images in the dataset are generated
from multiple deepfake image generators, for training machine learning model(s)
to emphasize the generalization of deepfake detection. To this end, we proposed
a disjoint set-based multistage training method to address the class imbalance
and devised an ensemble-based architecture \emph{CAE-Net}. Our architecture
consists of a convolution- and attention-based ensemble network, and employs
three different neural network architectures: EfficientNet, Data-Efficient
Image Transformer (DeiT), and ConvNeXt with wavelet transform to capture both
local and global features of deepfakes. We visualize the specific regions that
these models focus on for classification using Grad-CAM, and empirically
demonstrate the effectiveness of these models in grouping real and fake images
into cohesive clusters using t-SNE plots. Individually, the EfficientNet B0
architecture has achieved 90.79\% accuracy, whereas the ConvNeXt and the DeiT
architecture have achieved 89.49\% and 89.32\% accuracy, respectively. With
these networks, our weighted ensemble model achieves an excellent accuracy of
94.63\% on the validation dataset of the SP Cup 2025 competition. The equal
error rate of 4.72\% and the Area Under the ROC curve of 97.37\% further
confirm the stability of our proposed method. Finally, the robustness of our
proposed model against adversarial perturbation attacks is tested as well,
showing the inherent defensive properties of the ensemble approach.


http://arxiv.org/abs/2502.10329
VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect. (54%)
Qingyuan Fei; Wenjie Hou; Xuan Hai; Xin Liu
  The rapid advancements in AI voice cloning, fueled by machine learning, have
significantly impacted text-to-speech (TTS) and voice conversion (VC) fields.
While these developments have led to notable progress, they have also raised
concerns about the misuse of AI VC technology, causing economic losses and
negative public perceptions. To address this challenge, this study focuses on
creating active defense mechanisms against AI VC systems.
  We propose a novel active defense method, VocalCrypt, which embeds
pseudo-timbre (jamming information) based on SFS into audio segments that are
imperceptible to the human ear, thereby forming systematic fragments to prevent
voice cloning. This approach protects the voice without compromising its
quality. In comparison to existing methods, such as adversarial noise
incorporation, VocalCrypt significantly enhances robustness and real-time
performance, achieving a 500\% increase in generation speed while maintaining
interference effectiveness.
  Unlike audio watermarking techniques, which focus on post-detection, our
method offers preemptive defense, reducing implementation costs and enhancing
feasibility. Extensive experiments using the Zhvoice and VCTK Corpus datasets
show that our AI-cloned speech defense system performs excellently in automatic
speaker verification (ASV) tests while preserving the integrity of the
protected audio.


http://arxiv.org/abs/2502.10487
Fast Proxies for LLM Robustness Evaluation. (15%)
Tim Beyer; Jan Schuchardt; Leo Schwinn; Stephan Günnemann
  Evaluating the robustness of LLMs to adversarial attacks is crucial for safe
deployment, yet current red-teaming methods are often prohibitively expensive.
We compare the ability of fast proxy metrics to predict the real-world
robustness of an LLM against a simulated attacker ensemble. This allows us to
estimate a model's robustness to computationally expensive attacks without
requiring runs of the attacks themselves. Specifically, we consider
gradient-descent-based embedding-space attacks, prefilling attacks, and direct
prompting. Even though direct prompting in particular does not achieve high
ASR, we find that it and embedding-space attacks can predict attack success
rates well, achieving $r_p=0.87$ (linear) and $r_s=0.94$ (Spearman rank)
correlations with the full attack ensemble while reducing computational cost by
three orders of magnitude.


http://arxiv.org/abs/2502.09990
X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability. (3%)
Xiaoya Lu; Dongrui Liu; Yi Yu; Luxin Xu; Jing Shao
  Despite the rapid development of safety alignment techniques for LLMs,
defending against multi-turn jailbreaks is still a challenging task. In this
paper, we conduct a comprehensive comparison, revealing that some existing
defense methods can improve the robustness of LLMs against multi-turn
jailbreaks but compromise usability, i.e., reducing general capabilities or
causing the over-refusal problem. From the perspective of mechanism
interpretability of LLMs, we discover that these methods fail to establish a
boundary that exactly distinguishes safe and harmful feature representations.
Therefore, boundary-safe representations close to harmful representations are
inevitably disrupted, leading to a decline in usability. To address this issue,
we propose X-Boundary to push harmful representations away from boundary-safe
representations and obtain an exact distinction boundary. In this way, harmful
representations can be precisely erased without disrupting safe ones.
Experimental results show that X-Boundary achieves state-of-the-art defense
performance against multi-turn jailbreaks, while reducing the over-refusal rate
by about 20% and maintaining nearly complete general capability. Furthermore,
we theoretically prove and empirically verify that X-Boundary can accelerate
the convergence process during training. Please see our code at:
https://github.com/AI45Lab/X-Boundary.


http://arxiv.org/abs/2502.09110
Pulling Back the Curtain: Unsupervised Adversarial Detection via Contrastive Auxiliary Networks. (99%)
Eylon Mizrahi; Raz Lapid; Moshe Sipper
  Deep learning models are widely employed in safety-critical applications yet
remain susceptible to adversarial attacks -- imperceptible perturbations that
can significantly degrade model performance. Conventional defense mechanisms
predominantly focus on either enhancing model robustness or detecting
adversarial inputs independently. In this work, we propose an Unsupervised
adversarial detection via Contrastive Auxiliary Networks (U-CAN) to uncover
adversarial behavior within auxiliary feature representations, without the need
for adversarial examples. U-CAN is embedded within selected intermediate layers
of the target model. These auxiliary networks, comprising projection layers and
ArcFace-based linear layers, refine feature representations to more effectively
distinguish between benign and adversarial inputs. Comprehensive experiments
across multiple datasets (CIFAR-10, Mammals, and a subset of ImageNet) and
architectures (ResNet-50, VGG-16, and ViT) demonstrate that our method
surpasses existing unsupervised adversarial detection techniques, achieving
superior F1 scores against four distinct attack methods. The proposed framework
provides a scalable and effective solution for enhancing the security and
reliability of deep learning systems.


http://arxiv.org/abs/2502.09553
SyntheticPop: Attacking Speaker Verification Systems With Synthetic VoicePops. (67%)
Eshaq Jamdar; Amith Kamath Belman
  Voice Authentication (VA), also known as Automatic Speaker Verification
(ASV), is a widely adopted authentication method, particularly in automated
systems like banking services, where it serves as a secondary layer of user
authentication. Despite its popularity, VA systems are vulnerable to various
attacks, including replay, impersonation, and the emerging threat of deepfake
audio that mimics the voice of legitimate users. To mitigate these risks,
several defense mechanisms have been proposed. One such solution, Voice Pops,
aims to distinguish an individual's unique phoneme pronunciations during the
enrollment process. While promising, the effectiveness of VA+VoicePop against a
broader range of attacks, particularly logical or adversarial attacks, remains
insufficiently explored. We propose a novel attack method, which we refer to as
SyntheticPop, designed to target the phoneme recognition capabilities of the
VA+VoicePop system. The SyntheticPop attack involves embedding synthetic "pop"
noises into spoofed audio samples, significantly degrading the model's
performance. We achieve an attack success rate of over 95% while poisoning 20%
of the training dataset. Our experiments demonstrate that VA+VoicePop achieves
69% accuracy under normal conditions, 37% accuracy when subjected to a baseline
label flipping attack, and just 14% accuracy under our proposed SyntheticPop
attack, emphasizing the effectiveness of our method.


http://arxiv.org/abs/2502.09352
Wasserstein distributional adversarial training for deep neural networks. (54%)
Xingjian Bai; Guangyi He; Yifan Jiang; Jan Obloj
  Design of adversarial attacks for deep neural networks, as well as methods of
adversarial training against them, are subject of intense research. In this
paper, we propose methods to train against distributional attack threats,
extending the TRADES method used for pointwise attacks. Our approach leverages
recent contributions and relies on sensitivity analysis for Wasserstein
distributionally robust optimization problems. We introduce an efficient
fine-tuning method which can be deployed on a previously trained model. We test
our methods on a range of pre-trained models on RobustBench. These experimental
results demonstrate the additional training enhances Wasserstein distributional
robustness, while maintaining original levels of pointwise robustness, even for
already very successful networks. The improvements are less marked for models
pre-trained using huge synthetic datasets of 20-100M images. However,
remarkably, sometimes our methods are still able to improve their performance
even when trained using only the original training dataset (50k images).


http://arxiv.org/abs/2502.09723
Making Them a Malicious Database: Exploiting Query Code to Jailbreak Aligned Large Language Models. (31%)
Qingsong Zou; Jingyu Xiao; Qing Li; Zhi Yan; Yuhang Wang; Li Xu; Wenxuan Wang; Kuofeng Gao; Ruoyu Li; Yong Jiang
  Recent advances in large language models (LLMs) have demonstrated remarkable
potential in the field of natural language processing. Unfortunately, LLMs face
significant security and ethical risks. Although techniques such as safety
alignment are developed for defense, prior researches reveal the possibility of
bypassing such defenses through well-designed jailbreak attacks. In this paper,
we propose QueryAttack, a novel framework to examine the generalizability of
safety alignment. By treating LLMs as knowledge databases, we translate
malicious queries in natural language into structured non-natural query
language to bypass the safety alignment mechanisms of LLMs. We conduct
extensive experiments on mainstream LLMs, and the results show that QueryAttack
not only can achieve high attack success rates (ASRs), but also can jailbreak
various defense methods. Furthermore, we tailor a defense method against
QueryAttack, which can reduce ASR by up to 64% on GPT-4-1106. Our code is
available at https://github.com/horizonsinzqs/QueryAttack.


http://arxiv.org/abs/2502.09175
FLAME: Flexible LLM-Assisted Moderation Engine. (11%)
Ivan AIRI Moscow Institute of Physics and Technology Bakulin; Ilia AIRI Moscow Institute of Physics and Technology Kopanichuk; Iaroslav AIRI Bespalov; Nikita SberHealth Radchenko; Vladimir AIRI Skolkovo Institute of Science and Technology Shaposhnikov; Dmitry AIRI Skolkovo Institute of Science and Technology Dylov; Ivan AIRI Skolkovo Institute of Science and Technology Oseledets
  The rapid advancement of Large Language Models (LLMs) has introduced
significant challenges in moderating user-model interactions. While LLMs
demonstrate remarkable capabilities, they remain vulnerable to adversarial
attacks, particularly ``jailbreaking'' techniques that bypass content safety
measures. Current content moderation systems, which primarily rely on input
prompt filtering, have proven insufficient, with techniques like Best-of-N
(BoN) jailbreaking achieving success rates of 80% or more against popular LLMs.
In this paper, we introduce Flexible LLM-Assisted Moderation Engine (FLAME): a
new approach that shifts the focus from input filtering to output moderation.
Unlike traditional circuit-breaking methods that analyze user queries, FLAME
evaluates model responses, offering several key advantages: (1) computational
efficiency in both training and inference, (2) enhanced resistance to BoN
jailbreaking attacks, and (3) flexibility in defining and updating safety
criteria through customizable topic filtering. Our experiments demonstrate that
FLAME significantly outperforms current moderation systems. For example, FLAME
reduces attack success rate in GPT-4o-mini and DeepSeek-v3 by a factor of ~9,
while maintaining low computational overhead. We provide comprehensive
evaluation on various LLMs and analyze the engine's efficiency against the
state-of-the-art jailbreaking. This work contributes to the development of more
robust and adaptable content moderation systems for LLMs.


http://arxiv.org/abs/2502.09271
LiSA: Leveraging Link Recommender to Attack Graph Neural Networks via Subgraph Injection. (10%)
Wenlun Zhang; Enyan Dai; Kentaro Yoshioka
  Graph Neural Networks (GNNs) have demonstrated remarkable proficiency in
modeling data with graph structures, yet recent research reveals their
susceptibility to adversarial attacks. Traditional attack methodologies, which
rely on manipulating the original graph or adding links to artificially created
nodes, often prove impractical in real-world settings. This paper introduces a
novel adversarial scenario involving the injection of an isolated subgraph to
deceive both the link recommender and the node classifier within a GNN system.
Specifically, the link recommender is mislead to propose links between targeted
victim nodes and the subgraph, encouraging users to unintentionally establish
connections and that would degrade the node classification accuracy, thereby
facilitating a successful attack. To address this, we present the LiSA
framework, which employs a dual surrogate model and bi-level optimization to
simultaneously meet two adversarial objectives. Extensive experiments on
real-world datasets demonstrate the effectiveness of our method.


http://arxiv.org/abs/2502.09837
SoK: State of the time: On Trustworthiness of Digital Clocks. (3%)
Adeel Nasrullah; Fatima M. Anwar
  Despite the critical role of timing infrastructure in enabling essential
services, from public key infrastructure and smart grids to autonomous
navigation and high-frequency trading, modern timing stacks remain highly
vulnerable to malicious attacks. These threats emerge due to several reasons,
including inadequate security mechanisms, the timing architectures unique
vulnerability to delays, and implementation issues. In this paper, we aim to
obtain a holistic understanding of the issues that make the timing stacks
vulnerable to adversarial manipulations, what the challenges are in securing
them, and what solutions can be borrowed from the research community to address
them. To this end, we perform a systematic analysis of the security
vulnerabilities of the timing stack. In doing so, we discover new attack
surfaces, i.e., physical timing components and on-device timekeeping, which are
often overlooked by existing research that predominantly studies the security
of time synchronization protocols. We also show that the emerging trusted
timing architectures are flawed and risk compromising wider system security,
and propose an alternative design using hardware-software co-design.


http://arxiv.org/abs/2502.09150
Shortcut Learning Susceptibility in Vision Classifiers. (3%)
Pirzada Suhail; Amit Sethi
  Shortcut learning, where machine learning models exploit spurious
correlations in data instead of capturing meaningful features, poses a
significant challenge to building robust and generalizable models. This
phenomenon is prevalent across various machine learning applications, including
vision, natural language processing, and speech recognition, where models may
find unintended cues that minimize training loss but fail to capture the
underlying structure of the data. Vision classifiers such as Convolutional
Neural Networks (CNNs), Multi-Layer Perceptrons (MLPs), and Vision Transformers
(ViTs) leverage distinct architectural principles to process spatial and
structural information, making them differently susceptible to shortcut
learning. In this study, we systematically evaluate these architectures by
introducing deliberate shortcuts into the dataset that are positionally
correlated with class labels, creating a controlled setup to assess whether
models rely on these artificial cues or learn actual distinguishing features.
We perform both quantitative evaluation by training on the shortcut-modified
dataset and testing them on two different test sets -- one containing the same
shortcuts and another without them -- to determine the extent of reliance on
shortcuts. Additionally, qualitative evaluation is performed by using network
inversion-based reconstruction techniques to analyze what the models
internalize in their weights, aiming to reconstruct the training data as
perceived by the classifiers. We evaluate shortcut learning behavior across
multiple benchmark datasets, including MNIST, Fashion-MNIST, SVHN, and
CIFAR-10, to compare the susceptibility of different vision classifier
architectures to shortcut reliance and assess their varying degrees of
sensitivity to spurious correlations.


http://arxiv.org/abs/2502.08989
RLSA-PFL: Robust Lightweight Secure Aggregation with Model Inconsistency Detection in Privacy-Preserving Federated Learning. (1%)
Nazatul H. Sultan; Yan Bo; Yansong Gao; Seyit Camtepe; Arash Mahboubi; Hang Thanh Bui; Aufeef Chauhan; Hamed Aboutorab; Michael Bewong; Praveen Gauravaram; Rafiqul Islam; Sharif Abuadbba
  Federated Learning (FL) allows users to collaboratively train a global
machine learning model by sharing local model only, without exposing their
private data to a central server. This distributed learning is particularly
appealing in scenarios where data privacy is crucial, and it has garnered
substantial attention from both industry and academia. However, studies have
revealed privacy vulnerabilities in FL, where adversaries can potentially infer
sensitive information from the shared model parameters. In this paper, we
present an efficient masking-based secure aggregation scheme utilizing
lightweight cryptographic primitives to mitigate privacy risks. Our scheme
offers several advantages over existing methods. First, it requires only a
single setup phase for the entire FL training session, significantly reducing
communication overhead. Second, it minimizes user-side overhead by eliminating
the need for user-to-user interactions, utilizing an intermediate server layer
and a lightweight key negotiation method. Third, the scheme is highly resilient
to user dropouts, and the users can join at any FL round. Fourth, it can detect
and defend against malicious server activities, including recently discovered
model inconsistency attacks. Finally, our scheme ensures security in both
semi-honest and malicious settings. We provide security analysis to formally
prove the robustness of our approach. Furthermore, we implemented an end-to-end
prototype of our scheme. We conducted comprehensive experiments and
comparisons, which show that it outperforms existing solutions in terms of
communication and computation overhead, functionality, and security.


http://arxiv.org/abs/2502.08374
AdvSwap: Covert Adversarial Perturbation with High Frequency Info-swapping for Autonomous Driving Perception. (99%)
Yuanhao Huang; Qinfan Zhang; Jiandong Xing; Mengyue Cheng; Haiyang Yu; Yilong Ren; Xiao Xiong
  Perception module of Autonomous vehicles (AVs) are increasingly susceptible
to be attacked, which exploit vulnerabilities in neural networks through
adversarial inputs, thereby compromising the AI safety. Some researches focus
on creating covert adversarial samples, but existing global noise techniques
are detectable and difficult to deceive the human visual system. This paper
introduces a novel adversarial attack method, AdvSwap, which creatively
utilizes wavelet-based high-frequency information swapping to generate covert
adversarial samples and fool the camera. AdvSwap employs invertible neural
network for selective high-frequency information swapping, preserving both
forward propagation and data integrity. The scheme effectively removes the
original label data and incorporates the guidance image data, producing
concealed and robust adversarial samples. Experimental evaluations and
comparisons on the GTSRB and nuScenes datasets demonstrate that AdvSwap can
make concealed attacks on common traffic targets. The generates adversarial
samples are also difficult to perceive by humans and algorithms. Meanwhile, the
method has strong attacking robustness and attacking transferability.


http://arxiv.org/abs/2502.08151
Local Differential Privacy is Not Enough: A Sample Reconstruction Attack against Federated Learning with Local Differential Privacy. (54%)
Zhichao You; Xuewen Dong; Shujun Li; Ximeng Liu; Siqi Ma; Yulong Shen
  Reconstruction attacks against federated learning (FL) aim to reconstruct
users' samples through users' uploaded gradients. Local differential privacy
(LDP) is regarded as an effective defense against various attacks, including
sample reconstruction in FL, where gradients are clipped and perturbed.
Existing attacks are ineffective in FL with LDP since clipped and perturbed
gradients obliterate most sample information for reconstruction. Besides,
existing attacks embed additional sample information into gradients to improve
the attack effect and cause gradient expansion, leading to a more severe
gradient clipping in FL with LDP. In this paper, we propose a sample
reconstruction attack against LDP-based FL with any target models to
reconstruct victims' sensitive samples to illustrate that FL with LDP is not
flawless. Considering gradient expansion in reconstruction attacks and noise in
LDP, the core of the proposed attack is gradient compression and reconstructed
sample denoising. For gradient compression, an inference structure based on
sample characteristics is presented to reduce redundant gradients against LDP.
For reconstructed sample denoising, we artificially introduce zero gradients to
observe noise distribution and scale confidence interval to filter the noise.
Theoretical proof guarantees the effectiveness of the proposed attack.
Evaluations show that the proposed attack is the only attack that reconstructs
victims' training samples in LDP-based FL and has little impact on the target
model's accuracy. We conclude that LDP-based FL needs further improvements to
defend against sample reconstruction attacks effectively.


http://arxiv.org/abs/2502.08638
Examining Multilingual Embedding Models Cross-Lingually Through LLM-Generated Adversarial Examples. (50%)
Andrianos Michail; Simon Clematide; Rico Sennrich
  The evaluation of cross-lingual semantic search capabilities of models is
often limited to existing datasets from tasks such as information retrieval and
semantic textual similarity. To allow for domain-specific evaluation, we
introduce Cross Lingual Semantic Discrimination (CLSD), a novel cross-lingual
semantic search task that requires only a set of parallel sentence pairs of the
language pair of interest within the target domain. This task focuses on the
ability of a model to cross-lingually rank the true parallel sentence higher
than hard negatives generated by a large language model. We create four
instances of our introduced CLSD task for the language pair German-French
within the domain of news. Within this case study, we find that models that are
also fine-tuned for retrieval tasks (e.g., multilingual E5) benefit from using
English as the pivot language, while bitext mining models such as LaBSE perform
best directly cross-lingually. We also show a fine-grained similarity analysis
enabled by our distractor generation strategy, indicating that different
embedding models are sensitive to different types of perturbations.


http://arxiv.org/abs/2502.08193
Typographic Attacks in a Multi-Image Setting. (41%)
Xiaomeng Wang; Zhengyu Zhao; Martha Larson
  Large Vision-Language Models (LVLMs) are susceptible to typographic attacks,
which are misclassifications caused by an attack text that is added to an
image. In this paper, we introduce a multi-image setting for studying
typographic attacks, broadening the current emphasis of the literature on
attacking individual images. Specifically, our focus is on attacking image sets
without repeating the attack query. Such non-repeating attacks are stealthier,
as they are more likely to evade a gatekeeper than attacks that repeat the same
attack text. We introduce two attack strategies for the multi-image setting,
leveraging the difficulty of the target image, the strength of the attack text,
and text-image similarity. Our text-image similarity approach improves attack
success rates by 21% over random, non-specific methods on the CLIP model using
ImageNet while maintaining stealth in a multi-image scenario. An additional
experiment demonstrates transferability, i.e., text-image similarity calculated
using CLIP transfers when attacking InstructBLIP.


http://arxiv.org/abs/2502.08586
Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks. (12%)
Ang Li; Yin Zhou; Vethavikashini Chithrra Raghuram; Tom Goldstein; Micah Goldblum
  A high volume of recent ML security literature focuses on attacks against
aligned large language models (LLMs). These attacks may extract private
information or coerce the model into producing harmful outputs. In real-world
deployments, LLMs are often part of a larger agentic pipeline including memory
systems, retrieval, web access, and API calling. Such additional components
introduce vulnerabilities that make these LLM-powered agents much easier to
attack than isolated LLMs, yet relatively little work focuses on the security
of LLM agents. In this paper, we analyze security and privacy vulnerabilities
that are unique to LLM agents. We first provide a taxonomy of attacks
categorized by threat actors, objectives, entry points, attacker observability,
attack strategies, and inherent vulnerabilities of agent pipelines. We then
conduct a series of illustrative attacks on popular open-source and commercial
agents, demonstrating the immediate practical implications of their
vulnerabilities. Notably, our attacks are trivial to implement and require no
understanding of machine learning.


http://arxiv.org/abs/2502.08927
Dynamic watermarks in images generated by diffusion models. (3%)
Yunzhuo Chen; Naveed Akhtar; Nur Al Hasan Haldar; Ajmal Mian
  High-fidelity text-to-image diffusion models have revolutionized visual
content generation, but their widespread use raises significant ethical
concerns, including intellectual property protection and the misuse of
synthetic media. To address these challenges, we propose a novel multi-stage
watermarking framework for diffusion models, designed to establish copyright
and trace generated images back to their source. Our multi-stage watermarking
technique involves embedding: (i) a fixed watermark that is localized in the
diffusion model's learned noise distribution and, (ii) a human-imperceptible,
dynamic watermark in generates images, leveraging a fine-tuned decoder. By
leveraging the Structural Similarity Index Measure (SSIM) and cosine
similarity, we adapt the watermark's shape and color to the generated content
while maintaining robustness. We demonstrate that our method enables reliable
source verification through watermark classification, even when the dynamic
watermark is adjusted for content-specific variations. Source model
verification is enabled through watermark classification. o support further
research, we generate a dataset of watermarked images and introduce a
methodology to evaluate the statistical impact of watermarking on generated
content.Additionally, we rigorously test our framework against various attack
scenarios, demonstrating its robustness and minimal impact on image quality.
Our work advances the field of AI-generated content security by providing a
scalable solution for model ownership verification and misuse prevention.


http://arxiv.org/abs/2502.08448
Monge SAM: Robust Reparameterization-Invariant Sharpness-Aware Minimization Based on Loss Geometry. (1%)
Albert Kjøller Jacobsen; Georgios Arvanitidis
  Recent studies on deep neural networks show that flat minima of the loss
landscape correlate with improved generalization. Sharpness-aware minimization
(SAM) efficiently finds flat regions by updating the parameters according to
the gradient at an adversarial perturbation. The perturbation depends on the
Euclidean metric, making SAM non-invariant under reparametrizations, which
blurs sharpness and generalization. We propose Monge SAM (M-SAM), a
reparametrization invariant version of SAM by considering a Riemannian metric
in the parameter space induced naturally by the loss surface. Compared to
previous approaches, M-SAM works under any modeling choice, relies only on mild
assumptions while being as computationally efficient as SAM. We theoretically
argue that M-SAM varies between SAM and gradient descent (GD), which increases
robustness to hyperparameter selection and reduces attraction to suboptimal
equilibria like saddle points. We demonstrate this behavior both theoretically
and empirically on a multi-modal representation alignment task.


http://arxiv.org/abs/2502.08123
Provably Robust Federated Reinforcement Learning. (1%)
Minghong Fang; Xilong Wang; Neil Zhenqiang Gong
  Federated reinforcement learning (FRL) allows agents to jointly learn a
global decision-making policy under the guidance of a central server. While FRL
has advantages, its decentralized design makes it prone to poisoning attacks.
To mitigate this, Byzantine-robust aggregation techniques tailored for FRL have
been introduced. Yet, in our work, we reveal that these current
Byzantine-robust techniques are not immune to our newly introduced Normalized
attack. Distinct from previous attacks that targeted enlarging the distance of
policy updates before and after an attack, our Normalized attack emphasizes on
maximizing the angle of deviation between these updates. To counter these
threats, we develop an ensemble FRL approach that is provably secure against
both known and our newly proposed attacks. Our ensemble method involves
training multiple global policies, where each is learnt by a group of agents
using any foundational aggregation rule. These well-trained global policies
then individually predict the action for a specific test state. The ultimate
action is chosen based on a majority vote for discrete action systems or the
geometric median for continuous ones. Our experimental results across different
settings show that the Normalized attack can greatly disrupt non-ensemble
Byzantine-robust methods, and our ensemble approach offers substantial
resistance against poisoning attacks.


http://arxiv.org/abs/2502.07492
RoMA: Robust Malware Attribution via Byte-level Adversarial Training with Global Perturbations and Adversarial Consistency Regularization. (99%)
Yuxia Sun; Huihong Chen; Jingcai Guo; Aoxiang Sun; Zhetao Li; Haolin Liu
  Attributing APT (Advanced Persistent Threat) malware to their respective
groups is crucial for threat intelligence and cybersecurity. However, APT
adversaries often conceal their identities, rendering attribution inherently
adversarial. Existing machine learning-based attribution models, while
effective, remain highly vulnerable to adversarial attacks. For example, the
state-of-the-art byte-level model MalConv sees its accuracy drop from over 90%
to below 2% under PGD (projected gradient descent) attacks. Existing
gradient-based adversarial training techniques for malware detection or image
processing were applied to malware attribution in this study, revealing that
both robustness and training efficiency require significant improvement. To
address this, we propose RoMA, a novel single-step adversarial training
approach that integrates global perturbations to generate enhanced adversarial
samples and employs adversarial consistency regularization to improve
representation quality and resilience. A novel APT malware dataset named AMG18,
with diverse samples and realistic class imbalances, is introduced for
evaluation. Extensive experiments show that RoMA significantly outperforms
seven competing methods in both adversarial robustness (e.g., achieving over
80% robust accuracy-more than twice that of the next-best method under PGD
attacks) and training efficiency (e.g., more than twice as fast as the
second-best method in terms of accuracy), while maintaining superior standard
accuracy in non-adversarial scenarios.


http://arxiv.org/abs/2502.07987
Universal Adversarial Attack on Aligned Multimodal LLMs. (98%)
Temurbek Rahmatullaev; Polina Druzhinina; Matvey Mikhalchuk; Andrey Kuznetsov; Anton Razzhigaev
  We propose a universal adversarial attack on multimodal Large Language Models
(LLMs) that leverages a single optimized image to override alignment safeguards
across diverse queries and even multiple models. By backpropagating through the
vision encoder and language head, we craft a synthetic image that forces the
model to respond with a targeted phrase (e.g., ''Sure, here it is'') or
otherwise unsafe content-even for harmful prompts. In experiments on the
SafeBench benchmark, our method achieves significantly higher attack success
rates than existing baselines, including text-only universal prompts (e.g., up
to 93% on certain models). We further demonstrate cross-model transferability
by training on several multimodal LLMs simultaneously and testing on unseen
architectures. Additionally, a multi-answer variant of our approach produces
more natural-sounding (yet still malicious) responses. These findings
underscore critical vulnerabilities in current multimodal alignment and call
for more robust adversarial defenses. We will release code and datasets under
the Apache-2.0 license. Warning: some content generated by Multimodal LLMs in
this paper may be offensive to some readers.


http://arxiv.org/abs/2502.08079
MAA: Meticulous Adversarial Attack against Vision-Language Pre-trained Models. (96%)
Peng-Fei Zhang; Guangdong Bai; Zi Huang
  Current adversarial attacks for evaluating the robustness of vision-language
pre-trained (VLP) models in multi-modal tasks suffer from limited
transferability, where attacks crafted for a specific model often struggle to
generalize effectively across different models, limiting their utility in
assessing robustness more broadly. This is mainly attributed to the
over-reliance on model-specific features and regions, particularly in the image
modality. In this paper, we propose an elegant yet highly effective method
termed Meticulous Adversarial Attack (MAA) to fully exploit model-independent
characteristics and vulnerabilities of individual samples, achieving enhanced
generalizability and reduced model dependence. MAA emphasizes fine-grained
optimization of adversarial images by developing a novel resizing and sliding
crop (RScrop) technique, incorporating a multi-granularity similarity
disruption (MGSD) strategy. Extensive experiments across diverse VLP models,
multiple benchmark datasets, and a variety of downstream tasks demonstrate that
MAA significantly enhances the effectiveness and transferability of adversarial
attacks. A large cohort of performance studies is conducted to generate
insights into the effectiveness of various model configurations, guiding future
advancements in this domain.


http://arxiv.org/abs/2502.07753
Direct Ascent Synthesis: Revealing Hidden Generative Capabilities in Discriminative Models. (68%)
Stanislav Fort; Jonathan Whitaker
  We demonstrate that discriminative models inherently contain powerful
generative capabilities, challenging the fundamental distinction between
discriminative and generative architectures. Our method, Direct Ascent
Synthesis (DAS), reveals these latent capabilities through multi-resolution
optimization of CLIP model representations. While traditional inversion
attempts produce adversarial patterns, DAS achieves high-quality image
synthesis by decomposing optimization across multiple spatial scales (1x1 to
224x224), requiring no additional training. This approach not only enables
diverse applications -- from text-to-image generation to style transfer -- but
maintains natural image statistics ($1/f^2$ spectrum) and guides the generation
away from non-robust adversarial patterns. Our results demonstrate that
standard discriminative models encode substantially richer generative knowledge
than previously recognized, providing new perspectives on model
interpretability and the relationship between adversarial examples and natural
image synthesis.


http://arxiv.org/abs/2502.07557
JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation. (41%)
Shenyi Zhang; Yuchen Zhai; Keyan Guo; Hongxin Hu; Shengnan Guo; Zheng Fang; Lingchen Zhao; Chao Shen; Cong Wang; Qian Wang
  Despite the implementation of safety alignment strategies, large language
models (LLMs) remain vulnerable to jailbreak attacks, which undermine these
safety guardrails and pose significant security threats. Some defenses have
been proposed to detect or mitigate jailbreaks, but they are unable to
withstand the test of time due to an insufficient understanding of jailbreak
mechanisms. In this work, we investigate the mechanisms behind jailbreaks based
on the Linear Representation Hypothesis (LRH), which states that neural
networks encode high-level concepts as subspaces in their hidden
representations. We define the toxic semantics in harmful and jailbreak prompts
as toxic concepts and describe the semantics in jailbreak prompts that
manipulate LLMs to comply with unsafe requests as jailbreak concepts. Through
concept extraction and analysis, we reveal that LLMs can recognize the toxic
concepts in both harmful and jailbreak prompts. However, unlike harmful
prompts, jailbreak prompts activate the jailbreak concepts and alter the LLM
output from rejection to compliance. Building on our analysis, we propose a
comprehensive jailbreak defense framework, JBShield, consisting of two key
components: jailbreak detection JBShield-D and mitigation JBShield-M.
JBShield-D identifies jailbreak prompts by determining whether the input
activates both toxic and jailbreak concepts. When a jailbreak prompt is
detected, JBShield-M adjusts the hidden representations of the target LLM by
enhancing the toxic concept and weakening the jailbreak concept, ensuring LLMs
produce safe content. Extensive experiments demonstrate the superior
performance of JBShield, achieving an average detection accuracy of 0.95 and
reducing the average attack success rate of various jailbreak attacks to 2%
from 61% across distinct LLMs.


http://arxiv.org/abs/2502.07783
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter. (1%)
Leyang Hu; Randall Balestriero
  The scaling of model size and data size has reshaped the paradigm of AI. As a
result, the common protocol to leverage the latest models is to steer them
towards a specific downstream task of interest through {\em fine-tuning}.
Despite its importance, the main methods for fine-tuning remain limited to full
or low-rank adapters--containing countless hyper-parameters and lacking
interpretability. In this paper, we take a step back and demonstrate how novel
and explainable post-training steering solutions can be derived theoretically
from {\em spline operators}, a rich mathematical framing of Deep Networks that
was recently developed. Our method--coined \textbf{Curvature Tuning (CT)}--has
a single parameter that provably modulates the curvature of the model's
decision boundary henceforth allowing training-free steering. This makes CT
both more efficient and interpretable than conventional fine-tuning methods. We
empirically validate its effectiveness in improving generalization and
robustness of pretrained models. For example, CT improves out-of-distribution
transfer performances of ResNet-18/50 by 2.57\%/1.74\% across seventeen
downstream datasets, and improves RobustBench robust accuracy by
11.76\%/348.44\%. Additionally, we apply CT to ReLU-based Swin-T/S, improving
their generalization on nine downstream datasets by 2.43\%/3.33\%. Our code is
available at
\href{https://github.com/Leon-Leyang/curvature-tuning}{https://github.com/Leon-Leyang/curvature-tuning}.


http://arxiv.org/abs/2502.07845
Spread them Apart: Towards Robust Watermarking of Generated Content. (1%)
Mikhail Pautov; Danil Ivanov; Andrey V. Galichin; Oleg Rogov; Ivan Oseledets
  Generative models that can produce realistic images have improved
significantly in recent years. The quality of the generated content has
increased drastically, so sometimes it is very difficult to distinguish between
the real images and the generated ones. Such an improvement comes at a price of
ethical concerns about the usage of the generative models: the users of
generative models can improperly claim ownership of the generated content
protected by a license. In this paper, we propose an approach to embed
watermarks into the generated content to allow future detection of the
generated content and identification of the user who generated it. The
watermark is embedded during the inference of the model, so the proposed
approach does not require the retraining of the latter. We prove that
watermarks embedded are guaranteed to be robust against additive perturbations
of a bounded magnitude. We apply our method to watermark diffusion models and
show that it matches state-of-the-art watermarking schemes in terms of
robustness to different types of synthetic watermark removal attacks.


http://arxiv.org/abs/2502.08055
SLVR: Securely Leveraging Client Validation for Robust Federated Learning. (1%)
Jihye Choi; Sai Rahul Rachuri; Ke Wang; Somesh Jha; Yizhen Wang
  Federated Learning (FL) enables collaborative model training while keeping
client data private. However, exposing individual client updates makes FL
vulnerable to reconstruction attacks. Secure aggregation mitigates such privacy
risks but prevents the server from verifying the validity of each client
update, creating a privacy-robustness tradeoff. Recent efforts attempt to
address this tradeoff by enforcing checks on client updates using
zero-knowledge proofs, but they support limited predicates and often depend on
public validation data. We propose SLVR, a general framework that securely
leverages clients' private data through secure multi-party computation. By
utilizing clients' data, SLVR not only eliminates the need for public
validation data, but also enables a wider range of checks for robustness,
including cross-client accuracy validation. It also adapts naturally to
distribution shifts in client data as it can securely refresh its validation
data up-to-date. Our empirical evaluations show that SLVR improves robustness
against model poisoning attacks, particularly outperforming existing methods by
up to 50% under adaptive attacks. Additionally, SLVR demonstrates effective
adaptability and stable convergence under various distribution shift scenarios.


http://arxiv.org/abs/2502.07821
Amnesia as a Catalyst for Enhancing Black Box Pixel Attacks in Image Classification and Object Detection. (98%)
Dongsu Song; Daehwa Ko; Jay Hoon Jung
  It is well known that query-based attacks tend to have relatively higher
success rates in adversarial black-box attacks. While research on black-box
attacks is actively being conducted, relatively few studies have focused on
pixel attacks that target only a limited number of pixels. In image
classification, query-based pixel attacks often rely on patches, which heavily
depend on randomness and neglect the fact that scattered pixels are more
suitable for adversarial attacks. Moreover, to the best of our knowledge,
query-based pixel attacks have not been explored in the field of object
detection. To address these issues, we propose a novel pixel-based black-box
attack called Remember and Forget Pixel Attack using Reinforcement
Learning(RFPAR), consisting of two main components: the Remember and Forget
processes. RFPAR mitigates randomness and avoids patch dependency by leveraging
rewards generated through a one-step RL algorithm to perturb pixels. RFPAR
effectively creates perturbed images that minimize the confidence scores while
adhering to limited pixel constraints. Furthermore, we advance our proposed
attack beyond image classification to object detection, where RFPAR reduces the
confidence scores of detected objects to avoid detection. Experiments on the
ImageNet-1K dataset for classification show that RFPAR outperformed
state-of-the-art query-based pixel attacks. For object detection, using the
MSCOCO dataset with YOLOv8 and DDQ, RFPAR demonstrates comparable mAP reduction
to state-of-the-art query-based attack while requiring fewer query. Further
experiments on the Argoverse dataset using YOLOv8 confirm that RFPAR
effectively removed objects on a larger scale dataset. Our code is available at
https://github.com/KAU-QuantumAILab/RFPAR.


http://arxiv.org/abs/2502.07225
CAT: Contrastive Adversarial Training for Evaluating the Robustness of Protective Perturbations in Latent Diffusion Models. (93%)
Sen Peng; Mingyue Wang; Jianfei He; Jijia Yang; Xiaohua Jia
  Latent diffusion models have recently demonstrated superior capabilities in
many downstream image synthesis tasks. However, customization of latent
diffusion models using unauthorized data can severely compromise the privacy
and intellectual property rights of data owners. Adversarial examples as
protective perturbations have been developed to defend against unauthorized
data usage by introducing imperceptible noise to customization samples,
preventing diffusion models from effectively learning them. In this paper, we
first reveal that the primary reason adversarial examples are effective as
protective perturbations in latent diffusion models is the distortion of their
latent representations, as demonstrated through qualitative and quantitative
experiments. We then propose the Contrastive Adversarial Training (CAT)
utilizing lightweight adapters as an adaptive attack against these protection
methods, highlighting their lack of robustness. Extensive experiments
demonstrate that our CAT method significantly reduces the effectiveness of
protective perturbations in customization, urging the community to reconsider
and improve the robustness of existing protective perturbations. The code is
available at https://github.com/senp98/CAT.


http://arxiv.org/abs/2502.07011
DROP: Poison Dilution via Knowledge Distillation for Federated Learning. (92%)
Georgios Syros; Anshuman Suri; Farinaz Koushanfar; Cristina Nita-Rotaru; Alina Oprea
  Federated Learning is vulnerable to adversarial manipulation, where malicious
clients can inject poisoned updates to influence the global model's behavior.
While existing defense mechanisms have made notable progress, they fail to
protect against adversaries that aim to induce targeted backdoors under
different learning and attack configurations. To address this limitation, we
introduce DROP (Distillation-based Reduction Of Poisoning), a novel defense
mechanism that combines clustering and activity-tracking techniques with
extraction of benign behavior from clients via knowledge distillation to tackle
stealthy adversaries that manipulate low data poisoning rates and diverse
malicious client ratios within the federation. Through extensive
experimentation, our approach demonstrates superior robustness compared to
existing defenses across a wide range of learning configurations. Finally, we
evaluate existing defenses and our method under the challenging setting of
non-IID client data distribution and highlight the challenges of designing a
resilient FL defense in this setting.


http://arxiv.org/abs/2502.07101
SMAB: MAB based word Sensitivity Estimation Framework and its Applications in Adversarial Text Generation. (83%)
Saurabh Kumar Pandey; Sachin Vashistha; Debrup Das; Somak Aditya; Monojit Choudhury
  To understand the complexity of sequence classification tasks, Hahn et al.
(2021) proposed sensitivity as the number of disjoint subsets of the input
sequence that can each be individually changed to change the output. Though
effective, calculating sensitivity at scale using this framework is costly
because of exponential time complexity. Therefore, we introduce a
Sensitivity-based Multi-Armed Bandit framework (SMAB), which provides a
scalable approach for calculating word-level local (sentence-level) and global
(aggregated) sensitivities concerning an underlying text classifier for any
dataset. We establish the effectiveness of our approach through various
applications. We perform a case study on CHECKLIST generated sentiment analysis
dataset where we show that our algorithm indeed captures intuitively high and
low-sensitive words. Through experiments on multiple tasks and languages, we
show that sensitivity can serve as a proxy for accuracy in the absence of gold
data. Lastly, we show that guiding perturbation prompts using sensitivity
values in adversarial example generation improves attack success rate by
15.58%, whereas using sensitivity as an additional reward in adversarial
paraphrase generation gives a 12.00% improvement over SOTA approaches. Warning:
Contains potentially offensive content.


http://arxiv.org/abs/2502.06917
Krum Federated Chain (KFC): Using blockchain to defend against adversarial attacks in Federated Learning. (68%)
Mario García-Márquez; Nuria Rodríguez-Barroso; M. Victoria Luzón; Francisco Herrera
  Federated Learning presents a nascent approach to machine learning, enabling
collaborative model training across decentralized devices while safeguarding
data privacy. However, its distributed nature renders it susceptible to
adversarial attacks. Integrating blockchain technology with Federated Learning
offers a promising avenue to enhance security and integrity. In this paper, we
tackle the potential of blockchain in defending Federated Learning against
adversarial attacks. First, we test Proof of Federated Learning, a well known
consensus mechanism designed ad-hoc to federated contexts, as a defense
mechanism demonstrating its efficacy against Byzantine and backdoor attacks
when at least one miner remains uncompromised. Second, we propose Krum
Federated Chain, a novel defense strategy combining Krum and Proof of Federated
Learning, valid to defend against any configuration of Byzantine or backdoor
attacks, even when all miners are compromised. Our experiments conducted on
image classification datasets validate the effectiveness of our proposed
approaches.


http://arxiv.org/abs/2502.06390
When Data Manipulation Meets Attack Goals: An In-depth Survey of Attacks for VLMs. (8%)
Aobotao Dai; Xinyu Ma; Lei Chen; Songze Li; Lin Wang
  Vision-Language Models (VLMs) have gained considerable prominence in recent
years due to their remarkable capability to effectively integrate and process
both textual and visual information. This integration has significantly
enhanced performance across a diverse spectrum of applications, such as scene
perception and robotics. However, the deployment of VLMs has also given rise to
critical safety and security concerns, necessitating extensive research to
assess the potential vulnerabilities these VLM systems may harbor. In this
work, we present an in-depth survey of the attack strategies tailored for VLMs.
We categorize these attacks based on their underlying objectives - namely
jailbreak, camouflage, and exploitation - while also detailing the various
methodologies employed for data manipulation of VLMs. Meanwhile, we outline
corresponding defense mechanisms that have been proposed to mitigate these
vulnerabilities. By discerning key connections and distinctions among the
diverse types of attacks, we propose a compelling taxonomy for VLM attacks.
Moreover, we summarize the evaluation metrics that comprehensively describe the
characteristics and impact of different attacks on VLMs. Finally, we conclude
with a discussion of promising future research directions that could further
enhance the robustness and safety of VLMs, emphasizing the importance of
ongoing exploration in this critical area of study. To facilitate community
engagement, we maintain an up-to-date project page, accessible at:
https://github.com/AobtDai/VLM_Attack_Paper_List.


http://arxiv.org/abs/2502.06418
Robust Watermarks Leak: Channel-Aware Feature Extraction Enables Adversarial Watermark Manipulation. (3%)
Zhongjie Ba; Yitao Zhang; Peng Cheng; Bin Gong; Xinyu Zhang; Qinglong Wang; Kui Ren
  Watermarking plays a key role in the provenance and detection of AI-generated
content. While existing methods prioritize robustness against real-world
distortions (e.g., JPEG compression and noise addition), we reveal a
fundamental tradeoff: such robust watermarks inherently improve the redundancy
of detectable patterns encoded into images, creating exploitable information
leakage. To leverage this, we propose an attack framework that extracts leakage
of watermark patterns through multi-channel feature learning using a
pre-trained vision model. Unlike prior works requiring massive data or detector
access, our method achieves both forgery and detection evasion with a single
watermarked image. Extensive experiments demonstrate that our method achieves a
60\% success rate gain in detection evasion and 51\% improvement in forgery
accuracy compared to state-of-the-art methods while maintaining visual
fidelity. Our work exposes the robustness-stealthiness paradox: current
"robust" watermarks sacrifice security for distortion resistance, providing
insights for future watermark design.


http://arxiv.org/abs/2502.06892
Certifying Language Model Robustness with Fuzzed Randomized Smoothing: An Efficient Defense Against Backdoor Attacks. (76%)
Bowei He; Lihao Yin; Hui-Ling Zhen; Jianping Zhang; Lanqing Hong; Mingxuan Yuan; Chen Ma
  The widespread deployment of pre-trained language models (PLMs) has exposed
them to textual backdoor attacks, particularly those planted during the
pre-training stage. These attacks pose significant risks to high-reliability
applications, as they can stealthily affect multiple downstream tasks. While
certifying robustness against such threats is crucial, existing defenses
struggle with the high-dimensional, interdependent nature of textual data and
the lack of access to original poisoned pre-training data. To address these
challenges, we introduce \textbf{F}uzzed \textbf{R}andomized \textbf{S}moothing
(\textbf{FRS}), a novel approach for efficiently certifying language model
robustness against backdoor attacks. FRS integrates software robustness
certification techniques with biphased model parameter smoothing, employing
Monte Carlo tree search for proactive fuzzing to identify vulnerable textual
segments within the Damerau-Levenshtein space. This allows for targeted and
efficient text randomization, while eliminating the need for access to poisoned
training data during model smoothing. Our theoretical analysis demonstrates
that FRS achieves a broader certified robustness radius compared to existing
methods. Extensive experiments across various datasets, model configurations,
and attack strategies validate FRS's superiority in terms of defense
efficiency, accuracy, and robustness.


http://arxiv.org/abs/2502.05966
Detection of Physiological Data Tampering Attacks with Quantum Machine Learning. (41%)
Md. Saif Hassan Onim; Himanshu Thapliyal
  The widespread use of cloud-based medical devices and wearable sensors has
made physiological data susceptible to tampering. These attacks can compromise
the reliability of healthcare systems which can be critical and
life-threatening. Detection of such data tampering is of immediate need.
Machine learning has been used to detect anomalies in datasets but the
performance of Quantum Machine Learning (QML) is still yet to be evaluated for
physiological sensor data. Thus, our study compares the effectiveness of QML
for detecting physiological data tampering, focusing on two types of white-box
attacks: data poisoning and adversarial perturbation. The results show that QML
models are better at identifying label-flipping attacks, achieving accuracy
rates of 75%-95% depending on the data and attack severity. This superior
performance is due to the ability of quantum algorithms to handle complex and
high-dimensional data. However, both QML and classical models struggle to
detect more sophisticated adversarial perturbation attacks, which subtly alter
data without changing its statistical properties. Although QML performed poorly
against this attack with around 45%-65% accuracy, it still outperformed
classical algorithms in some cases.


http://arxiv.org/abs/2502.05931
Protecting Intellectual Property of EEG-based Neural Networks with Watermarking. (12%)
Ahmed Abdelaziz; Ahmed Fathi; Ahmed Fares
  EEG-based neural networks, pivotal in medical diagnosis and brain-computer
interfaces, face significant intellectual property (IP) risks due to their
reliance on sensitive neurophysiological data and resource-intensive
development. Current watermarking methods, particularly those using abstract
trigger sets, lack robust authentication and fail to address the unique
challenges of EEG models. This paper introduces a cryptographic wonder
filter-based watermarking framework tailored for EEG-based neural networks.
Leveraging collision-resistant hashing and public-key encryption, the wonder
filter embeds the watermark during training, ensuring minimal distortion ($\leq
5\%$ drop in EEG task accuracy) and high reliability (100\% watermark
detection). The framework is rigorously evaluated against adversarial attacks,
including fine-tuning, transfer learning, and neuron pruning. Results
demonstrate persistent watermark retention, with classification accuracy for
watermarked states remaining above 90\% even after aggressive pruning, while
primary task performance degrades faster, deterring removal attempts. Piracy
resistance is validated by the inability to embed secondary watermarks without
severe accuracy loss ( $>10\%$ in EEGNet and CCNN models). Cryptographic
hashing ensures authentication, reducing brute-force attack success
probabilities. Evaluated on the DEAP dataset across models (CCNN, EEGNet,
TSception), the method achieves $>99.4\%$ null-embedding accuracy, effectively
eliminating false positives. By integrating wonder filters with EEG-specific
adaptations, this work bridges a critical gap in IP protection for
neurophysiological models, offering a secure, tamper-proof solution for
healthcare and biometric applications. The framework's robustness against
adversarial modifications underscores its potential to safeguard sensitive EEG
models while maintaining diagnostic utility.


http://arxiv.org/abs/2502.05954
Optimization under Attack: Resilience, Vulnerability, and the Path to Collapse. (1%)
Amal Aldawsari; Evangelos Pournaras
  Optimization is instrumental for improving operations of large-scale
socio-technical infrastructures of Smart Cities, for instance, energy and
traffic systems. In particular, understanding the performance of multi-agent
discrete-choice combinatorial optimization under distributed adversary attacks
is a compelling and underexplored problem, since multi-agent systems exhibit a
large number of remote control variables that can influence in an unprecedented
way the cost-effectiveness of distributed optimization heuristics. This paper
unravels for the first time the trajectories of distributed optimization from
resilience to vulnerability, and finally to collapse under varying adversary
influence. Using real-world data to emulate over 28 billion multi-agent
optimization scenarios, we exhaustively assess how the number of agents with
different adversarial severity and network positioning influences optimization
performance, including the influence on Pareto optimal points. With this novel
large-scale dataset, made openly available as a benchmark, we disentangle how
optimization remains resilient to adversaries and which adversary conditions
are required to make optimization vulnerable or collapsed. These new findings
can provide new insights for designing self-healing strategies for
fault-tolerance and fault-correction in adversarial distributed optimization
that have been missing so far.


http://arxiv.org/abs/2502.05542
Democratic Training Against Universal Adversarial Perturbations. (99%)
Bing Sun; Jun Sun; Wei Zhao
  Despite their advances and success, real-world deep neural networks are known
to be vulnerable to adversarial attacks. Universal adversarial perturbation, an
input-agnostic attack, poses a serious threat for them to be deployed in
security-sensitive systems. In this case, a single universal adversarial
perturbation deceives the model on a range of clean inputs without requiring
input-specific optimization, which makes it particularly threatening. In this
work, we observe that universal adversarial perturbations usually lead to
abnormal entropy spectrum in hidden layers, which suggests that the prediction
is dominated by a small number of ``feature'' in such cases (rather than
democratically by many features). Inspired by this, we propose an efficient yet
effective defense method for mitigating UAPs called \emph{Democratic Training}
by performing entropy-based model enhancement to suppress the effect of the
universal adversarial perturbations in a given model. \emph{Democratic
Training} is evaluated with 7 neural networks trained on 5 benchmark datasets
and 5 types of state-of-the-art universal adversarial attack methods. The
results show that it effectively reduces the attack success rate, improves
model robustness and preserves the model accuracy on clean samples.


http://arxiv.org/abs/2502.05509
Do Spikes Protect Privacy? Investigating Black-Box Model Inversion Attacks in Spiking Neural Networks. (89%)
Hamed Poursiami; Ayana Moshruba; Maryam Parsa
  As machine learning models become integral to security-sensitive
applications, concerns over data leakage from adversarial attacks continue to
rise. Model Inversion (MI) attacks pose a significant privacy threat by
enabling adversaries to reconstruct training data from model outputs. While MI
attacks on Artificial Neural Networks (ANNs) have been widely studied, Spiking
Neural Networks (SNNs) remain largely unexplored in this context. Due to their
event-driven and discrete computations, SNNs introduce fundamental differences
in information processing that may offer inherent resistance to such attacks. A
critical yet underexplored aspect of this threat lies in black-box settings,
where attackers operate through queries without direct access to model
parameters or gradients-representing a more realistic adversarial scenario in
deployed systems. This work presents the first study of black-box MI attacks on
SNNs. We adapt a generative adversarial MI framework to the spiking domain by
incorporating rate-based encoding for input transformation and decoding
mechanisms for output interpretation. Our results show that SNNs exhibit
significantly greater resistance to MI attacks than ANNs, as demonstrated by
degraded reconstructions, increased instability in attack convergence, and
overall reduced attack effectiveness across multiple evaluation metrics.
Further analysis suggests that the discrete and temporally distributed nature
of SNN decision boundaries disrupts surrogate modeling, limiting the attacker's
ability to approximate the target model.


http://arxiv.org/abs/2502.05669
Rigid Body Adversarial Attacks. (81%)
Aravind Ramakrishnan; David I. W. Levin; Alec Jacobson
  Due to their performance and simplicity, rigid body simulators are often used
in applications where the objects of interest can considered very stiff.
However, no material has infinite stiffness, which means there are potentially
cases where the non-zero compliance of the seemingly rigid object can cause a
significant difference between its trajectories when simulated in a rigid body
or deformable simulator.
  Similarly to how adversarial attacks are developed against image classifiers,
we propose an adversarial attack against rigid body simulators. In this
adversarial attack, we solve an optimization problem to construct perceptually
rigid adversarial objects that have the same collision geometry and moments of
mass to a reference object, so that they behave identically in rigid body
simulations but maximally different in more accurate deformable simulations. We
demonstrate the validity of our method by comparing simulations of several
examples in commercially available simulators.


http://arxiv.org/abs/2502.05772
Effective Black-Box Multi-Faceted Attacks Breach Vision Large Language Model Guardrails. (64%)
Yijun Yang; Lichao Wang; Xiao Yang; Lanqing Hong; Jun Zhu
  Vision Large Language Models (VLLMs) integrate visual data processing,
expanding their real-world applications, but also increasing the risk of
generating unsafe responses. In response, leading companies have implemented
Multi-Layered safety defenses, including alignment training, safety system
prompts, and content moderation. However, their effectiveness against
sophisticated adversarial attacks remains largely unexplored. In this paper, we
propose MultiFaceted Attack, a novel attack framework designed to
systematically bypass Multi-Layered Defenses in VLLMs. It comprises three
complementary attack facets: Visual Attack that exploits the multimodal nature
of VLLMs to inject toxic system prompts through images; Alignment Breaking
Attack that manipulates the model's alignment mechanism to prioritize the
generation of contrasting responses; and Adversarial Signature that deceives
content moderators by strategically placing misleading information at the end
of the response. Extensive evaluations on eight commercial VLLMs in a black-box
setting demonstrate that MultiFaceted Attack achieves a 61.56% attack success
rate, surpassing state-of-the-art methods by at least 42.18%.


http://arxiv.org/abs/2502.05637
Adversarial Machine Learning: Attacks, Defenses, and Open Challenges. (61%)
Pranav K Jha
  Adversarial Machine Learning (AML) addresses vulnerabilities in AI systems
where adversaries manipulate inputs or training data to degrade performance.
This article provides a comprehensive analysis of evasion and poisoning
attacks, formalizes defense mechanisms with mathematical rigor, and discusses
the challenges of implementing robust solutions in adaptive threat models.
Additionally, it highlights open challenges in certified robustness,
scalability, and real-world deployment.


http://arxiv.org/abs/2502.05755
Filter, Obstruct and Dilute: Defending Against Backdoor Attacks on Semi-Supervised Learning. (41%)
Xinrui Wang; Chuanxing Geng; Wenhai Wan; Shao-yuan Li; Songcan Chen
  Recent studies have verified that semi-supervised learning (SSL) is
vulnerable to data poisoning backdoor attacks. Even a tiny fraction of
contaminated training data is sufficient for adversaries to manipulate up to
90\% of the test outputs in existing SSL methods. Given the emerging threat of
backdoor attacks designed for SSL, this work aims to protect SSL against such
risks, marking it as one of the few known efforts in this area. Specifically,
we begin by identifying that the spurious correlations between the backdoor
triggers and the target class implanted by adversaries are the primary cause of
manipulated model predictions during the test phase. To disrupt these
correlations, we utilize three key techniques: Gaussian Filter, complementary
learning and trigger mix-up, which collectively filter, obstruct and dilute the
influence of backdoor attacks in both data pre-processing and feature learning.
Experimental results demonstrate that our proposed method, Backdoor Invalidator
(BI), significantly reduces the average attack success rate from 84.7\% to
1.8\% across different state-of-the-art backdoor attacks. It is also worth
mentioning that BI does not sacrifice accuracy on clean data and is supported
by a theoretical guarantee of its generalization capability.


http://arxiv.org/abs/2502.05547
Dual Defense: Enhancing Privacy and Mitigating Poisoning Attacks in Federated Learning. (4%)
Runhua Xu; Shiqi Gao; Chao Li; James Joshi; Jianxin Li
  Federated learning (FL) is inherently susceptible to privacy breaches and
poisoning attacks. To tackle these challenges, researchers have separately
devised secure aggregation mechanisms to protect data privacy and robust
aggregation methods that withstand poisoning attacks. However, simultaneously
addressing both concerns is challenging; secure aggregation facilitates
poisoning attacks as most anomaly detection techniques require access to
unencrypted local model updates, which are obscured by secure aggregation. Few
recent efforts to simultaneously tackle both challenges offen depend on
impractical assumption of non-colluding two-server setups that disrupt FL's
topology, or three-party computation which introduces scalability issues,
complicating deployment and application. To overcome this dilemma, this paper
introduce a Dual Defense Federated learning (DDFed) framework. DDFed
simultaneously boosts privacy protection and mitigates poisoning attacks,
without introducing new participant roles or disrupting the existing FL
topology. DDFed initially leverages cutting-edge fully homomorphic encryption
(FHE) to securely aggregate model updates, without the impractical requirement
for non-colluding two-server setups and ensures strong privacy protection.
Additionally, we proposes a unique two-phase anomaly detection mechanism for
encrypted model updates, featuring secure similarity computation and
feedback-driven collaborative selection, with additional measures to prevent
potential privacy breaches from Byzantine clients incorporated into the
detection process. We conducted extensive experiments on various model
poisoning attacks and FL scenarios, including both cross-device and cross-silo
FL. Experiments on publicly available datasets demonstrate that DDFed
successfully protects model privacy and effectively defends against model
poisoning threats.


http://arxiv.org/abs/2502.05727
Impact of Data Poisoning Attacks on Feasibility and Optimality of Neural Power System Optimizers. (2%)
Nora Agah; Meiyi Li; Javad Mohammadi
  The increased integration of clean yet stochastic energy resources and the
growing number of extreme weather events are narrowing the decision-making
window of power grid operators. This time constraint is fueling a plethora of
research on Machine Learning-, or ML-, based optimization proxies. While
finding a fast solution is appealing, the inherent vulnerabilities of the
learning-based methods are hindering their adoption. One of these
vulnerabilities is data poisoning attacks, which adds perturbations to ML
training data, leading to incorrect decisions. The impact of poisoning attacks
on learning-based power system optimizers have not been thoroughly studied,
which creates a critical vulnerability. In this paper, we examine the impact of
data poisoning attacks on ML-based optimization proxies that are used to solve
the DC Optimal Power Flow problem. Specifically, we compare the resilience of
three different methods-a penalty-based method, a post-repair approach, and a
direct mapping approach-against the adverse effects of poisoning attacks. We
will use the optimality and feasibility of these proxies as performance
metrics. The insights of this work will establish a foundation for enhancing
the resilience of neural power system optimizers.


http://arxiv.org/abs/2502.06872
Towards Trustworthy Retrieval Augmented Generation for Large Language Models: A Survey. (1%)
Bo Ni; Zheyuan Liu; Leyao Wang; Yongjia Lei; Yuying Zhao; Xueqi Cheng; Qingkai Zeng; Luna Dong; Yinglong Xia; Krishnaram Kenthapadi; Ryan Rossi; Franck Dernoncourt; Md Mehrab Tanjim; Nesreen Ahmed; Xiaorui Liu; Wenqi Fan; Erik Blasch; Yu Wang; Meng Jiang; Tyler Derr
  Retrieval-Augmented Generation (RAG) is an advanced technique designed to
address the challenges of Artificial Intelligence-Generated Content (AIGC). By
integrating context retrieval into content generation, RAG provides reliable
and up-to-date external knowledge, reduces hallucinations, and ensures relevant
context across a wide range of tasks. However, despite RAG's success and
potential, recent studies have shown that the RAG paradigm also introduces new
risks, including robustness issues, privacy concerns, adversarial attacks, and
accountability issues. Addressing these risks is critical for future
applications of RAG systems, as they directly impact their trustworthiness.
Although various methods have been developed to improve the trustworthiness of
RAG methods, there is a lack of a unified perspective and framework for
research in this topic. Thus, in this paper, we aim to address this gap by
providing a comprehensive roadmap for developing trustworthy RAG systems. We
place our discussion around five key perspectives: reliability, privacy,
safety, fairness, explainability, and accountability. For each perspective, we
present a general framework and taxonomy, offering a structured approach to
understanding the current challenges, evaluating existing solutions, and
identifying promising future research directions. To encourage broader adoption
and innovation, we also highlight the downstream applications where trustworthy
RAG systems have a significant impact.


http://arxiv.org/abs/2502.04679
Mechanistic Understandings of Representation Vulnerabilities and Engineering Robust Vision Transformers. (99%)
Chashi Mahiul Islam; Samuel Jacob Chacko; Mao Nishino; Xiuwen Liu
  While transformer-based models dominate NLP and vision applications, their
underlying mechanisms to map the input space to the label space semantically
are not well understood. In this paper, we study the sources of known
representation vulnerabilities of vision transformers (ViT), where perceptually
identical images can have very different representations and semantically
unrelated images can have the same representation. Our analysis indicates that
imperceptible changes to the input can result in significant representation
changes, particularly in later layers, suggesting potential instabilities in
the performance of ViTs. Our comprehensive study reveals that adversarial
effects, while subtle in early layers, propagate and amplify through the
network, becoming most pronounced in middle to late layers. This insight
motivates the development of NeuroShield-ViT, a novel defense mechanism that
strategically neutralizes vulnerable neurons in earlier layers to prevent the
cascade of adversarial effects. We demonstrate NeuroShield-ViT's effectiveness
across various attacks, particularly excelling against strong iterative
attacks, and showcase its remarkable zero-shot generalization capabilities.
Without fine-tuning, our method achieves a competitive accuracy of 77.8% on
adversarial examples, surpassing conventional robustness methods. Our results
shed new light on how adversarial effects propagate through ViT layers, while
providing a promising approach to enhance the robustness of vision transformers
against adversarial attacks. Additionally, they provide a promising approach to
enhance the robustness of vision transformers against adversarial attacks.


http://arxiv.org/abs/2502.05041
Federated Learning for Anomaly Detection in Energy Consumption Data: Assessing the Vulnerability to Adversarial Attacks. (99%)
Yohannis Kifle Telila; Damitha Senevirathne; Dumindu Tissera; Apurva Narayan; Miriam A. M. Capretz; Katarina Grolinger
  Anomaly detection is crucial in the energy sector to identify irregular
patterns indicating equipment failures, energy theft, or other issues. Machine
learning techniques for anomaly detection have achieved great success, but are
typically centralized, involving sharing local data with a central server which
raises privacy and security concerns. Federated Learning (FL) has been gaining
popularity as it enables distributed learning without sharing local data.
However, FL depends on neural networks, which are vulnerable to adversarial
attacks that manipulate data, leading models to make erroneous predictions.
While adversarial attacks have been explored in the image domain, they remain
largely unexplored in time series problems, especially in the energy domain.
Moreover, the effect of adversarial attacks in the FL setting is also mostly
unknown. This paper assesses the vulnerability of FL-based anomaly detection in
energy data to adversarial attacks. Specifically, two state-of-the-art models,
Long Short Term Memory (LSTM) and Transformers, are used to detect anomalies in
an FL setting, and two white-box attack methods, Fast Gradient Sign Method
(FGSM) and Projected Gradient Descent (PGD), are employed to perturb the data.
The results show that FL is more sensitive to PGD attacks than to FGSM attacks,
attributed to PGD's iterative nature, resulting in an accuracy drop of over 10%
even with naive, weaker attacks. Moreover, FL is more affected by these attacks
than centralized learning, highlighting the need for defense mechanisms in FL.


http://arxiv.org/abs/2502.05374
Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond. (75%)
Chongyu Fan; Jinghan Jia; Yihua Zhang; Anil Ramakrishna; Mingyi Hong; Sijia Liu
  The LLM unlearning technique has recently been introduced to comply with data
regulations and address the safety and ethical concerns of LLMs by removing the
undesired data-model influence. However, state-of-the-art unlearning methods
face a critical vulnerability: they are susceptible to ``relearning'' the
removed information from a small number of forget data points, known as
relearning attacks. In this paper, we systematically investigate how to make
unlearned models robust against such attacks. For the first time, we establish
a connection between robust unlearning and sharpness-aware minimization (SAM)
through a unified robust optimization framework, in an analogy to adversarial
training designed to defend against adversarial attacks. Our analysis for SAM
reveals that smoothness optimization plays a pivotal role in mitigating
relearning attacks. Thus, we further explore diverse smoothing strategies to
enhance unlearning robustness. Extensive experiments on benchmark datasets,
including WMDP and MUSE, demonstrate that SAM and other smoothness optimization
approaches consistently improve the resistance of LLM unlearning to relearning
attacks. Notably, smoothness-enhanced unlearning also helps defend against
(input-level) jailbreaking attacks, broadening our proposal's impact in
robustifying LLM unlearning. Codes are available at
https://github.com/OPTML-Group/Unlearn-Smooth.


http://arxiv.org/abs/2502.05000
Robust Graph Learning Against Adversarial Evasion Attacks via Prior-Free Diffusion-Based Structure Purification. (74%)
Jiayi Luo; Qingyun Sun; Haonan Yuan; Xingcheng Fu; Jianxin Li
  Adversarial evasion attacks pose significant threats to graph learning, with
lines of studies that have improved the robustness of Graph Neural Networks
(GNNs). However, existing works rely on priors about clean graphs or attacking
strategies, which are often heuristic and inconsistent. To achieve robust graph
learning over different types of evasion attacks and diverse datasets, we
investigate this problem from a prior-free structure purification perspective.
Specifically, we propose a novel Diffusion-based Structure Purification
framework named DiffSP, which creatively incorporates the graph diffusion model
to learn intrinsic distributions of clean graphs and purify the perturbed
structures by removing adversaries under the direction of the captured
predictive patterns without relying on priors. DiffSP is divided into the
forward diffusion process and the reverse denoising process, during which
structure purification is achieved. To avoid valuable information loss during
the forward process, we propose an LID-driven nonisotropic diffusion mechanism
to selectively inject noise anisotropically. To promote semantic alignment
between the clean graph and the purified graph generated during the reverse
process, we reduce the generation uncertainty by the proposed graph transfer
entropy guided denoising mechanism. Extensive experiments demonstrate the
superior robustness of DiffSP against evasion attacks.


http://arxiv.org/abs/2502.05174
MELON: Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison. (50%)
Kaijie Zhu; Xianjun Yang; Jindong Wang; Wenbo Guo; William Yang Wang
  Recent research has explored that LLM agents are vulnerable to indirect
prompt injection (IPI) attacks, where malicious tasks embedded in
tool-retrieved information can redirect the agent to take unauthorized actions.
Existing defenses against IPI have significant limitations: either require
essential model training resources, lack effectiveness against sophisticated
attacks, or harm the normal utilities. We present MELON (Masked re-Execution
and TooL comparisON), a novel IPI defense. Our approach builds on the
observation that under a successful attack, the agent's next action becomes
less dependent on user tasks and more on malicious tasks. Following this, we
design MELON to detect attacks by re-executing the agent's trajectory with a
masked user prompt modified through a masking function. We identify an attack
if the actions generated in the original and masked executions are similar. We
also include three key designs to reduce the potential false positives and
false negatives. Extensive evaluation on the IPI benchmark AgentDojo
demonstrates that MELON outperforms SOTA defenses in both attack prevention and
utility preservation. Moreover, we show that combining MELON with a SOTA prompt
augmentation defense (denoted as MELON-Aug) further improves its performance.
We also conduct a detailed ablation study to validate our key designs.


http://arxiv.org/abs/2502.07807
CP-Guard+: A New Paradigm for Malicious Agent Detection and Defense in Collaborative Perception. (11%)
Senkang Hu; Yihang Tao; Zihan Fang; Guowen Xu; Yiqin Deng; Sam Kwong; Yuguang Fang
  Collaborative perception (CP) is a promising method for safe connected and
autonomous driving, which enables multiple vehicles to share sensing
information to enhance perception performance. However, compared with
single-vehicle perception, the openness of a CP system makes it more vulnerable
to malicious attacks that can inject malicious information to mislead the
perception of an ego vehicle, resulting in severe risks for safe driving. To
mitigate such vulnerability, we first propose a new paradigm for malicious
agent detection that effectively identifies malicious agents at the feature
level without requiring verification of final perception results, significantly
reducing computational overhead. Building on this paradigm, we introduce
CP-GuardBench, the first comprehensive dataset provided to train and evaluate
various malicious agent detection methods for CP systems. Furthermore, we
develop a robust defense method called CP-Guard+, which enhances the margin
between the representations of benign and malicious features through a
carefully designed Dual-Centered Contrastive Loss (DCCLoss). Finally, we
conduct extensive experiments on both CP-GuardBench and V2X-Sim, and
demonstrate the superiority of CP-Guard+.


http://arxiv.org/abs/2502.04771
DMPA: Model Poisoning Attacks on Decentralized Federated Learning for Model Differences. (10%)
Chao Feng; Yunlong Li; Yuanzhe Gao; Alberto Huertas Celdrán; der Assen Jan von; Gérôme Bovet; Burkhard Stiller
  Federated learning (FL) has garnered significant attention as a prominent
privacy-preserving Machine Learning (ML) paradigm. Decentralized FL (DFL)
eschews traditional FL's centralized server architecture, enhancing the
system's robustness and scalability. However, these advantages of DFL also
create new vulnerabilities for malicious participants to execute adversarial
attacks, especially model poisoning attacks. In model poisoning attacks,
malicious participants aim to diminish the performance of benign models by
creating and disseminating the compromised model. Existing research on model
poisoning attacks has predominantly concentrated on undermining global models
within the Centralized FL (CFL) paradigm, while there needs to be more research
in DFL. To fill the research gap, this paper proposes an innovative model
poisoning attack called DMPA. This attack calculates the differential
characteristics of multiple malicious client models and obtains the most
effective poisoning strategy, thereby orchestrating a collusive attack by
multiple participants. The effectiveness of this attack is validated across
multiple datasets, with results indicating that the DMPA approach consistently
surpasses existing state-of-the-art FL model poisoning attack strategies.


http://arxiv.org/abs/2502.04662
Adversarially-Robust TD Learning with Markovian Data: Finite-Time Rates and Fundamental Limits. (4%)
Sreejeet Maity; Aritra Mitra
  One of the most basic problems in reinforcement learning (RL) is policy
evaluation: estimating the long-term return, i.e., value function,
corresponding to a given fixed policy. The celebrated Temporal Difference (TD)
learning algorithm addresses this problem, and recent work has investigated
finite-time convergence guarantees for this algorithm and variants thereof.
However, these guarantees hinge on the reward observations being always
generated from a well-behaved (e.g., sub-Gaussian) true reward distribution.
Motivated by harsh, real-world environments where such an idealistic assumption
may no longer hold, we revisit the policy evaluation problem from the
perspective of adversarial robustness. In particular, we consider a
Huber-contaminated reward model where an adversary can arbitrarily corrupt each
reward sample with a small probability $\epsilon$. Under this observation
model, we first show that the adversary can cause the vanilla TD algorithm to
converge to any arbitrary value function. We then develop a novel algorithm
called Robust-TD and prove that its finite-time guarantees match that of
vanilla TD with linear function approximation up to a small $O(\epsilon)$ term
that captures the effect of corruption. We complement this result with a
minimax lower bound, revealing that such an additive corruption-induced term is
unavoidable. To our knowledge, these results are the first of their kind in the
context of adversarial robustness of stochastic approximation schemes driven by
Markov noise. The key new technical tool that enables our results is an
analysis of the Median-of-Means estimator with corrupted, time-correlated data
that might be of independent interest to the literature on robust statistics.


http://arxiv.org/abs/2502.04951
The Rising Threat to Emerging AI-Powered Search Engines. (1%)
Zeren Luo; Zifan Peng; Yule Liu; Zhen Sun; Mingchen Li; Jingyi Zheng; Xinlei He
  Recent advancements in Large Language Models (LLMs) have significantly
enhanced the capabilities of AI-Powered Search Engines (AIPSEs), offering
precise and efficient responses by integrating external databases with
pre-existing knowledge. However, we observe that these AIPSEs raise risks such
as quoting malicious content or citing malicious websites, leading to harmful
or unverified information dissemination. In this study, we conduct the first
safety risk quantification on seven production AIPSEs by systematically
defining the threat model, risk level, and evaluating responses to various
query types. With data collected from PhishTank, ThreatBook, and LevelBlue, our
findings reveal that AIPSEs frequently generate harmful content that contains
malicious URLs even with benign queries (e.g., with benign keywords). We also
observe that directly query URL will increase the risk level while query with
natural language will mitigate such risk. We further perform two case studies
on online document spoofing and phishing to show the ease of deceiving AIPSEs
in the real-world setting. To mitigate these risks, we develop an agent-based
defense with a GPT-4o-based content refinement tool and an XGBoost-based URL
detector. Our evaluation shows that our defense can effectively reduce the risk
but with the cost of reducing available information. Our research highlights
the urgent need for robust safety measures in AIPSEs.


http://arxiv.org/abs/2502.04204
Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: Theoretical and Empirical Evidence. (99%)
Shaopeng Fu; Liang Ding; Jingfeng Zhang; Di Wang
  Jailbreak attacks against large language models (LLMs) aim to induce harmful
behaviors in LLMs through carefully crafted adversarial prompts. To mitigate
attacks, one way is to perform adversarial training (AT)-based alignment, i.e.,
training LLMs on some of the most adversarial prompts to help them learn how to
behave safely under attacks. During AT, the length of adversarial prompts plays
a critical role in the robustness of aligned LLMs. While long-length
adversarial prompts during AT might lead to strong LLM robustness, their
synthesis however is very resource-consuming, which may limit the application
of LLM AT. This paper focuses on adversarial suffix jailbreak attacks and
unveils that to defend against a jailbreak attack with an adversarial suffix of
length $\Theta(M)$, it is enough to align LLMs on prompts with adversarial
suffixes of length $\Theta(\sqrt{M})$. Theoretically, we analyze the
adversarial in-context learning of linear transformers on linear regression
tasks and prove a robust generalization bound for trained transformers. The
bound depends on the term $\Theta(\sqrt{M_{\text{test}}}/M_{\text{train}})$,
where $M_{\text{train}}$ and $M_{\text{test}}$ are the numbers of adversarially
perturbed in-context samples during training and testing. Empirically, we
conduct AT on popular open-source LLMs and evaluate their robustness against
jailbreak attacks of different adversarial suffix lengths. Results confirm a
positive correlation between the attack success rate and the ratio of the
square root of the adversarial suffix length during jailbreaking to the length
during AT. Our findings show that it is practical to defend against
``long-length'' jailbreak attacks via efficient ``short-length'' AT. The code
is available at https://github.com/fshp971/adv-icl.


http://arxiv.org/abs/2502.04643
Confidence Elicitation: A New Attack Vector for Large Language Models. (96%)
Brian Formento; Chuan Sheng Foo; See-Kiong Ng
  A fundamental issue in deep learning has been adversarial robustness. As
these systems have scaled, such issues have persisted. Currently, large
language models (LLMs) with billions of parameters suffer from adversarial
attacks just like their earlier, smaller counterparts. However, the threat
models have changed. Previously, having gray-box access, where input embeddings
or output logits/probabilities were visible to the user, might have been
reasonable. However, with the introduction of closed-source models, no
information about the model is available apart from the generated output. This
means that current black-box attacks can only utilize the final prediction to
detect if an attack is successful. In this work, we investigate and demonstrate
the potential of attack guidance, akin to using output probabilities, while
having only black-box access in a classification setting. This is achieved
through the ability to elicit confidence from the model. We empirically show
that the elicited confidence is calibrated and not hallucinated for current
LLMs. By minimizing the elicited confidence, we can therefore increase the
likelihood of misclassification. Our new proposed paradigm demonstrates
promising state-of-the-art results on three datasets across two models
(LLaMA-3-8B-Instruct and Mistral-7B-Instruct-V0.3) when comparing our technique
to existing hard-label black-box attack methods that introduce word-level
substitutions.


http://arxiv.org/abs/2502.04248
Adapting to Evolving Adversaries with Regularized Continual Robust Training. (92%)
Sihui Dai; Christian Cianfarani; Arjun Bhagoji; Vikash Sehwag; Prateek Mittal
  Robust training methods typically defend against specific attack types, such
as Lp attacks with fixed budgets, and rarely account for the fact that
defenders may encounter new attacks over time. A natural solution is to adapt
the defended model to new adversaries as they arise via fine-tuning, a method
which we call continual robust training (CRT). However, when implemented
naively, fine-tuning on new attacks degrades robustness on previous attacks.
This raises the question: how can we improve the initial training and
fine-tuning of the model to simultaneously achieve robustness against previous
and new attacks? We present theoretical results which show that the gap in a
model's robustness against different attacks is bounded by how far each attack
perturbs a sample in the model's logit space, suggesting that regularizing with
respect to this logit space distance can help maintain robustness against
previous attacks. Extensive experiments on 3 datasets (CIFAR-10, CIFAR-100, and
ImageNette) and over 100 attack combinations demonstrate that the proposed
regularization improves robust accuracy with little overhead in training time.
Our findings and open-source code lay the groundwork for the deployment of
models robust to evolving attacks.


http://arxiv.org/abs/2502.05225
BitAbuse: A Dataset of Visually Perturbed Texts for Defending Phishing Attacks. (88%)
Hanyong Lee; Chaelyn Lee; Yongjae Lee; Jaesung Lee
  Phishing often targets victims through visually perturbed texts to bypass
security systems. The noise contained in these texts functions as an
adversarial attack, designed to deceive language models and hinder their
ability to accurately interpret the content. However, since it is difficult to
obtain sufficient phishing cases, previous studies have used synthetic datasets
that do not contain real-world cases. In this study, we propose the BitAbuse
dataset, which includes real-world phishing cases, to address the limitations
of previous research. Our dataset comprises a total of 325,580 visually
perturbed texts. The dataset inputs are drawn from the raw corpus, consisting
of visually perturbed sentences and sentences generated through an artificial
perturbation process. Each input sentence is labeled with its corresponding
ground truth, representing the restored, non-perturbed version. Language models
trained on our proposed dataset demonstrated significantly better performance
compared to previous methods, achieving an accuracy of approximately 96%. Our
analysis revealed a significant gap between real-world and synthetic examples,
underscoring the value of our dataset for building reliable pre-trained models
for restoration tasks. We release the BitAbuse dataset, which includes
real-world phishing cases annotated with visual perturbations, to support
future research in adversarial attack defense.


http://arxiv.org/abs/2502.04121
Optimizing Perturbations for Improved Training of Machine Learning Models. (69%)
Sagi Meir; Tommer D. Keidar; Shlomi Reuveni; Barak Hirshberg
  Machine learning models have become indispensable tools in applications
across the physical sciences. Their training is often time-consuming, vastly
exceeding the inference timescales. Several protocols have been developed to
perturb the learning process and improve the training, such as shrink and
perturb, warm restarts, and stochastic resetting. For classifiers, these
perturbations have been shown to result in enhanced speedups or improved
generalization. However, the design of such perturbations is usually done
\textit{ad hoc} by intuition and trial and error. To rationally optimize
training protocols, we frame them as first-passage processes and consider their
response to perturbations. We show that if the unperturbed learning process
reaches a quasi-steady state, the response at a single perturbation frequency
can predict the behavior at a wide range of frequencies. We demonstrate that
this is the case when training a CIFAR-10 classifier using the ResNet-18 model
and use this approach to identify an optimal perturbation and frequency. Our
work allows optimization of training protocols of machine learning models using
a statistical mechanical approach.


http://arxiv.org/abs/2502.03801
SoK: Benchmarking Poisoning Attacks and Defenses in Federated Learning. (15%)
Heyi Zhang; Yule Liu; Xinlei He; Jun Wu; Tianshuo Cong; Xinyi Huang
  Federated learning (FL) enables collaborative model training while preserving
data privacy, but its decentralized nature exposes it to client-side data
poisoning attacks (DPAs) and model poisoning attacks (MPAs) that degrade global
model performance. While numerous proposed defenses claim substantial
effectiveness, their evaluation is typically done in isolation with limited
attack strategies, raising concerns about their validity. Additionally,
existing studies overlook the mutual effectiveness of defenses against both
DPAs and MPAs, causing fragmentation in this field. This paper aims to provide
a unified benchmark and analysis of defenses against DPAs and MPAs, clarifying
the distinction between these two similar but slightly distinct domains. We
present a systematic taxonomy of poisoning attacks and defense strategies,
outlining their design, strengths, and limitations. Then, a unified comparative
evaluation across FL algorithms and data heterogeneity is conducted to validate
their individual and mutual effectiveness and derive key insights for design
principles and future research. Along with the analysis, we frame our work to a
unified benchmark, FLPoison, with high modularity and scalability to evaluate
15 representative poisoning attacks and 17 defense strategies, facilitating
future research in this domain. Code is available at
https://github.com/vio1etus/FLPoison.


http://arxiv.org/abs/2502.04224
Provably Robust Explainable Graph Neural Networks against Graph Perturbation Attacks. (12%)
Jiate Li; Meng Pang; Yun Dong; Jinyuan Jia; Binghui Wang
  Explaining Graph Neural Network (XGNN) has gained growing attention to
facilitate the trust of using GNNs, which is the mainstream method to learn
graph data. Despite their growing attention, Existing XGNNs focus on improving
the explanation performance, and its robustness under attacks is largely
unexplored. We noticed that an adversary can slightly perturb the graph
structure such that the explanation result of XGNNs is largely changed. Such
vulnerability of XGNNs could cause serious issues particularly in
safety/security-critical applications. In this paper, we take the first step to
study the robustness of XGNN against graph perturbation attacks, and propose
XGNNCert, the first provably robust XGNN. Particularly, our XGNNCert can
provably ensure the explanation result for a graph under the worst-case graph
perturbation attack is close to that without the attack, while not affecting
the GNN prediction, when the number of perturbed edges is bounded. Evaluation
results on multiple graph datasets and GNN explainers show the effectiveness of
XGNNCert.


http://arxiv.org/abs/2502.04229
Dark Distillation: Backdooring Distilled Datasets without Accessing Raw Data. (11%)
Ziyuan Yang; Ming Yan; Yi Zhang; Joey Tianyi Zhou
  Dataset distillation (DD) enhances training efficiency and reduces bandwidth
by condensing large datasets into smaller synthetic ones. It enables models to
achieve performance comparable to those trained on the raw full dataset and has
become a widely adopted method for data sharing. However, security concerns in
DD remain underexplored. Existing studies typically assume that malicious
behavior originates from dataset owners during the initial distillation
process, where backdoors are injected into raw datasets. In contrast, this work
is the first to address a more realistic and concerning threat: attackers may
intercept the dataset distribution process, inject backdoors into the distilled
datasets, and redistribute them to users. While distilled datasets were
previously considered resistant to backdoor attacks, we demonstrate that they
remain vulnerable to such attacks. Furthermore, we show that attackers do not
even require access to any raw data to inject the backdoors successfully.
Specifically, our approach reconstructs conceptual archetypes for each class
from the model trained on the distilled dataset. Backdoors are then injected
into these archetypes to update the distilled dataset. Moreover, we ensure the
updated dataset not only retains the backdoor but also preserves the original
optimization trajectory, thus maintaining the knowledge of the raw dataset. To
achieve this, a hybrid loss is designed to integrate backdoor information along
the benign optimization trajectory, ensuring that previously learned
information is not forgotten. Extensive experiments demonstrate that distilled
datasets are highly vulnerable to backdoor attacks, with risks pervasive across
various raw datasets, distillation methods, and downstream training strategies.
Moreover, our attack method is efficient, capable of synthesizing a malicious
distilled dataset in under one minute in certain cases.


http://arxiv.org/abs/2502.04230
XAttnMark: Learning Robust Audio Watermarking with Cross-Attention. (1%)
Yixin Liu; Lie Lu; Jihui Jin; Lichao Sun; Andrea Fanelli
  The rapid proliferation of generative audio synthesis and editing
technologies has raised significant concerns about copyright infringement, data
provenance, and the spread of misinformation through deepfake audio.
Watermarking offers a proactive solution by embedding imperceptible,
identifiable, and traceable marks into audio content. While recent neural
network-based watermarking methods like WavMark and AudioSeal have improved
robustness and quality, they struggle to achieve both robust detection and
accurate attribution simultaneously. This paper introduces Cross-Attention
Robust Audio Watermark (XAttnMark), which bridges this gap by leveraging
partial parameter sharing between the generator and the detector, a
cross-attention mechanism for efficient message retrieval, and a temporal
conditioning module for improved message distribution. Additionally, we propose
a psychoacoustic-aligned temporal-frequency masking loss that captures
fine-grained auditory masking effects, enhancing watermark imperceptibility.
Our approach achieves state-of-the-art performance in both detection and
attribution, demonstrating superior robustness against a wide range of audio
transformations, including challenging generative editing with strong editing
strength. The project webpage is available at
https://liuyixin-louis.github.io/xattnmark/.


http://arxiv.org/abs/2502.03950
LR0.FM: Low-Res Benchmark and Improving Robustness for Zero-Shot Classification in Foundation Models. (1%)
Priyank Pathak; Shyam Marjit; Shruti Vyas; Yogesh S Rawat
  Visual-language foundation Models (FMs) exhibit remarkable zero-shot
generalization across diverse tasks, largely attributed to extensive
pre-training on largescale datasets. However, their robustness on
low-resolution/pixelated (LR) images, a common challenge in real-world
scenarios, remains underexplored. We introduce LR0.FM, a comprehensive
benchmark evaluating the impact of low resolution on the zero-shot
classification performance of 10 FM(s) across 66 backbones and 15 datasets. We
propose a novel metric, Weighted Aggregated Robustness, to address the
limitations of existing metrics and better evaluate model performance across
resolutions and datasets. Our key findings show that: (i) model size positively
correlates with robustness to resolution degradation, (ii) pre-training dataset
quality is more important than its size, and (iii) fine-tuned and higher
resolution models are less robust against LR. Our analysis further reveals that
the model makes semantically reasonable predictions at LR, and the lack of
fine-grained details in input adversely impacts the model's initial layers more
than the deeper layers. We use these insights and introduce a simple strategy,
LR-TK0, to enhance the robustness of models without compromising their
pre-trained weights. We demonstrate the effectiveness of LR-TK0 for robustness
against low-resolution across several datasets and its generalization
capability across backbones and other approaches. Code is available at
https://github.com/shyammarjit/LR0.FM


http://arxiv.org/abs/2502.03698
How vulnerable is my policy? Adversarial attacks on modern behavior cloning policies. (99%)
Basavasagar Patil; Akansha Kalra; Guanhong Tao; Daniel S. Brown
  Learning from Demonstration (LfD) algorithms have shown promising results in
robotic manipulation tasks, but their vulnerability to adversarial attacks
remains underexplored. This paper presents a comprehensive study of adversarial
attacks on both classic and recently proposed algorithms, including Behavior
Cloning (BC), LSTM-GMM, Implicit Behavior Cloning (IBC), Diffusion Policy (DP),
and VQ-Behavior Transformer (VQ-BET). We study the vulnerability of these
methods to untargeted, targeted and universal adversarial perturbations. While
explicit policies, such as BC, LSTM-GMM and VQ-BET can be attacked in the same
manner as standard computer vision models, we find that attacks for implicit
and denoising policy models are nuanced and require developing novel attack
methods. Our experiments on several simulated robotic manipulation tasks reveal
that most of the current methods are highly vulnerable to adversarial
perturbations. We also show that these attacks are transferable across
algorithms, architectures, and tasks, raising concerning security
vulnerabilities with potentially a white-box threat model. In addition, we test
the efficacy of a randomized smoothing, a widely used adversarial defense
technique, and highlight its limitation in defending against attacks on complex
and multi-modal action distribution common in complex control tasks. In
summary, our findings highlight the vulnerabilities of modern BC algorithms,
paving way for future work in addressing such limitations.


http://arxiv.org/abs/2502.06832
Optimizing Robustness and Accuracy in Mixture of Experts: A Dual-Model Approach. (87%)
Xu Zhang; Kaidi Xu; Ziqing Hu; Ren Wang
  Mixture of Experts (MoE) have shown remarkable success in leveraging
specialized expert networks for complex machine learning tasks. However, their
susceptibility to adversarial attacks presents a critical challenge for
deployment in robust applications. This paper addresses the critical question
of how to incorporate robustness into MoEs while maintaining high natural
accuracy. We begin by analyzing the vulnerability of MoE components, finding
that expert networks are notably more susceptible to adversarial attacks than
the router. Based on this insight, we propose a targeted robust training
technique that integrates a novel loss function to enhance the adversarial
robustness of MoE, requiring only the robustification of one additional expert
without compromising training or inference efficiency. Building on this, we
introduce a dual-model strategy that linearly combines a standard MoE model
with our robustified MoE model using a smoothing parameter. This approach
allows for flexible control over the robustness-accuracy trade-off. We further
provide theoretical foundations by deriving certified robustness bounds for
both the single MoE and the dual-model. To push the boundaries of robustness
and accuracy, we propose a novel joint training strategy JTDMoE for the
dual-model. This joint training enhances both robustness and accuracy beyond
what is achievable with separate models. Experimental results on CIFAR-10 and
TinyImageNet datasets using ResNet18 and Vision Transformer (ViT) architectures
demonstrate the effectiveness of our proposed methods.


http://arxiv.org/abs/2502.03758
Improving Adversarial Robustness via Phase and Amplitude-aware Prompting. (80%)
Yibo Xu; Dawei Zhou; Decheng Liu; Nannan Wang
  Deep neural networks are found to be vulnerable to adversarial perturbations.
The prompt-based defense has been increasingly studied due to its high
efficiency. However, existing prompt-based defenses mainly exploited mixed
prompt patterns, where critical patterns closely related to object semantics
lack sufficient focus. The phase and amplitude spectra have been proven to be
highly related to specific semantic patterns and crucial for robustness. To
this end, in this paper, we propose a Phase and Amplitude-aware Prompting (PAP)
defense. Specifically, we construct phase-level and amplitude-level prompts for
each class, and adjust weights for prompting according to the model's robust
performance under these prompts during training. During testing, we select
prompts for each image using its predicted label to obtain the prompted image,
which is inputted to the model to get the final prediction. Experimental
results demonstrate the effectiveness of our method.


http://arxiv.org/abs/2502.02913
Real-Time Privacy Risk Measurement with Privacy Tokens for Gradient Leakage. (70%)
Jiayang Meng; Tao Huang; Hong Chen; Xin Shi; Qingyu Huang; Chen Hou
  The widespread deployment of deep learning models in privacy-sensitive
domains has amplified concerns regarding privacy risks, particularly those
stemming from gradient leakage during training. Current privacy assessments
primarily rely on post-training attack simulations. However, these methods are
inherently reactive, unable to encompass all potential attack scenarios, and
often based on idealized adversarial assumptions. These limitations underscore
the need for proactive approaches to privacy risk assessment during the
training process. To address this gap, we propose the concept of privacy
tokens, which are derived directly from private gradients during training.
Privacy tokens encapsulate gradient features and, when combined with data
features, offer valuable insights into the extent of private information
leakage from training data, enabling real-time measurement of privacy risks
without relying on adversarial attack simulations. Additionally, we employ
Mutual Information (MI) as a robust metric to quantify the relationship between
training data and gradients, providing precise and continuous assessments of
privacy leakage throughout the training process. Extensive experiments validate
our framework, demonstrating the effectiveness of privacy tokens and MI in
identifying and quantifying privacy risks. This proactive approach marks a
significant advancement in privacy monitoring, promoting the safer deployment
of deep learning models in sensitive applications.


http://arxiv.org/abs/2502.05224
A Survey on Backdoor Threats in Large Language Models (LLMs): Attacks, Defenses, and Evaluations. (67%)
Yihe Zhou; Tao Ni; Wei-Bin Lee; Qingchuan Zhao
  Large Language Models (LLMs) have achieved significantly advanced
capabilities in understanding and generating human language text, which have
gained increasing popularity over recent years. Apart from their
state-of-the-art natural language processing (NLP) performance, considering
their widespread usage in many industries, including medicine, finance,
education, etc., security concerns over their usage grow simultaneously. In
recent years, the evolution of backdoor attacks has progressed with the
advancement of defense mechanisms against them and more well-developed features
in the LLMs. In this paper, we adapt the general taxonomy for classifying
machine learning attacks on one of the subdivisions - training-time white-box
backdoor attacks. Besides systematically classifying attack methods, we also
consider the corresponding defense methods against backdoor attacks. By
providing an extensive summary of existing works, we hope this survey can serve
as a guideline for inspiring future research that further extends the attack
scenarios and creates a stronger defense against them for more robust LLMs.


http://arxiv.org/abs/2502.02960
Large Language Model Adversarial Landscape Through the Lens of Attack Objectives. (56%)
Nan Wang; Kane Walter; Yansong Gao; Alsharif Abuadbba
  Large Language Models (LLMs) represent a transformative leap in artificial
intelligence, enabling the comprehension, generation, and nuanced interaction
with human language on an unparalleled scale. However, LLMs are increasingly
vulnerable to a range of adversarial attacks that threaten their privacy,
reliability, security, and trustworthiness. These attacks can distort outputs,
inject biases, leak sensitive information, or disrupt the normal functioning of
LLMs, posing significant challenges across various applications.
  In this paper, we provide a novel comprehensive analysis of the adversarial
landscape of LLMs, framed through the lens of attack objectives. By
concentrating on the core goals of adversarial actors, we offer a fresh
perspective that examines threats from the angles of privacy, integrity,
availability, and misuse, moving beyond conventional taxonomies that focus
solely on attack techniques. This objective-driven adversarial landscape not
only highlights the strategic intent behind different adversarial approaches
but also sheds light on the evolving nature of these threats and the
effectiveness of current defenses. Our analysis aims to guide researchers and
practitioners in better understanding, anticipating, and mitigating these
attacks, ultimately contributing to the development of more resilient and
robust LLM systems.


http://arxiv.org/abs/2502.03721
Detecting Backdoor Attacks via Similarity in Semantic Communication Systems. (54%)
Ziyang Wei; Yili Jiang; Jiaqi Huang; Fangtian Zhong; Sohan Gyawali
  Semantic communication systems, which leverage Generative AI (GAI) to
transmit semantic meaning rather than raw data, are poised to revolutionize
modern communications. However, they are vulnerable to backdoor attacks, a type
of poisoning manipulation that embeds malicious triggers into training
datasets. As a result, Backdoor attacks mislead the inference for poisoned
samples while clean samples remain unaffected. The existing defenses may alter
the model structure (such as neuron pruning that potentially degrades inference
performance on clean inputs, or impose strict requirements on data formats
(such as ``Semantic Shield" that requires image-text pairs). To address these
limitations, this work proposes a defense mechanism that leverages semantic
similarity to detect backdoor attacks without modifying the model structure or
imposing data format constraints. By analyzing deviations in semantic feature
space and establishing a threshold-based detection framework, the proposed
approach effectively identifies poisoned samples. The experimental results
demonstrate high detection accuracy and recall across varying poisoning ratios,
underlining the significant effectiveness of our proposed solution.


http://arxiv.org/abs/2502.03692
DocMIA: Document-Level Membership Inference Attacks against DocVQA Models. (41%)
Khanh Nguyen; Raouf Kerkouche; Mario Fritz; Dimosthenis Karatzas
  Document Visual Question Answering (DocVQA) has introduced a new paradigm for
end-to-end document understanding, and quickly became one of the standard
benchmarks for multimodal LLMs. Automating document processing workflows,
driven by DocVQA models, presents significant potential for many business
sectors. However, documents tend to contain highly sensitive information,
raising concerns about privacy risks associated with training such DocVQA
models. One significant privacy vulnerability, exploited by the membership
inference attack, is the possibility for an adversary to determine if a
particular record was part of the model's training data. In this paper, we
introduce two novel membership inference attacks tailored specifically to
DocVQA models. These attacks are designed for two different adversarial
scenarios: a white-box setting, where the attacker has full access to the model
architecture and parameters, and a black-box setting, where only the model's
outputs are available. Notably, our attacks assume the adversary lacks access
to auxiliary datasets, which is more realistic in practice but also more
challenging. Our unsupervised methods outperform existing state-of-the-art
membership inference attacks across a variety of DocVQA models and datasets,
demonstrating their effectiveness and highlighting the privacy risks in this
domain.


http://arxiv.org/abs/2502.03052
Understanding and Enhancing the Transferability of Jailbreaking Attacks. (31%)
Runqi Lin; Bo Han; Fengwang Li; Tongling Liu
  Jailbreaking attacks can effectively manipulate open-source large language
models (LLMs) to produce harmful responses. However, these attacks exhibit
limited transferability, failing to disrupt proprietary LLMs consistently. To
reliably identify vulnerabilities in proprietary LLMs, this work investigates
the transferability of jailbreaking attacks by analysing their impact on the
model's intent perception. By incorporating adversarial sequences, these
attacks can redirect the source LLM's focus away from malicious-intent tokens
in the original input, thereby obstructing the model's intent recognition and
eliciting harmful responses. Nevertheless, these adversarial sequences fail to
mislead the target LLM's intent perception, allowing the target LLM to refocus
on malicious-intent tokens and abstain from responding. Our analysis further
reveals the inherent distributional dependency within the generated adversarial
sequences, whose effectiveness stems from overfitting the source LLM's
parameters, resulting in limited transferability to target LLMs. To this end,
we propose the Perceived-importance Flatten (PiF) method, which uniformly
disperses the model's focus across neutral-intent tokens in the original input,
thus obscuring malicious-intent tokens without relying on overfitted
adversarial sequences. Extensive experiments demonstrate that PiF provides an
effective and efficient red-teaming evaluation for proprietary LLMs.


http://arxiv.org/abs/2502.05214
CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models. (99%)
Amy Rafferty; Rishi Ramaesh; Ajitha Rajan
  Deep learning models for medical image classification tasks are becoming
widely implemented in AI-assisted diagnostic tools, aiming to enhance
diagnostic accuracy, reduce clinician workloads, and improve patient outcomes.
However, their vulnerability to adversarial attacks poses significant risks to
patient safety. Current attack methodologies use general techniques such as
model querying or pixel value perturbations to generate adversarial examples
designed to fool a model. These approaches may not adequately address the
unique characteristics of clinical errors stemming from missed or incorrectly
identified clinical features. We propose the Concept-based Report Perturbation
Attack (CoRPA), a clinically-focused black-box adversarial attack framework
tailored to the medical imaging domain. CoRPA leverages clinical concepts to
generate adversarial radiological reports and images that closely mirror
realistic clinical misdiagnosis scenarios. We demonstrate the utility of CoRPA
using the MIMIC-CXR-JPG dataset of chest X-rays and radiological reports. Our
evaluation reveals that deep learning models exhibiting strong resilience to
conventional adversarial attacks are significantly less robust when subjected
to CoRPA's clinically-focused perturbations. This underscores the importance of
addressing domain-specific vulnerabilities in medical AI systems. By
introducing a specialized adversarial attack framework, this study provides a
foundation for developing robust, real-world-ready AI models in healthcare,
ensuring their safe and reliable deployment in high-stakes clinical
environments.


http://arxiv.org/abs/2502.02096
Dual-Flow: Transferable Multi-Target, Instance-Agnostic Attacks via In-the-wild Cascading Flow Optimization. (97%)
Yixiao Chen; Shikun Sun; Jianshu Li; Ruoyu Li; Zhe Li; Junliang Xing
  Adversarial attacks are widely used to evaluate model robustness, and in
black-box scenarios, the transferability of these attacks becomes crucial.
Existing generator-based attacks have excellent generalization and
transferability due to their instance-agnostic nature. However, when training
generators for multi-target tasks, the success rate of transfer attacks is
relatively low due to the limitations of the model's capacity. To address these
challenges, we propose a novel Dual-Flow framework for multi-target
instance-agnostic adversarial attacks, utilizing Cascading Distribution Shift
Training to develop an adversarial velocity function. Extensive experiments
demonstrate that Dual-Flow significantly improves transferability over previous
multi-target generative attacks. For example, it increases the success rate
from Inception-v3 to ResNet-152 by 34.58%. Furthermore, our attack method shows
substantially stronger robustness against defense mechanisms, such as
adversarially trained models.


http://arxiv.org/abs/2502.04360
MARAGE: Transferable Multi-Model Adversarial Attack for Retrieval-Augmented Generation Data Extraction. (91%)
Xiao Hu; Eric Liu; Weizhou Wang; Xiangyu Guo; David Lie
  Retrieval-Augmented Generation (RAG) offers a solution to mitigate
hallucinations in Large Language Models (LLMs) by grounding their outputs to
knowledge retrieved from external sources. The use of private resources and
data in constructing these external data stores can expose them to risks of
extraction attacks, in which attackers attempt to steal data from these private
databases. Existing RAG extraction attacks often rely on manually crafted
prompts, which limit their effectiveness. In this paper, we introduce a
framework called MARAGE for optimizing an adversarial string that, when
appended to user queries submitted to a target RAG system, causes outputs
containing the retrieved RAG data verbatim. MARAGE leverages a continuous
optimization scheme that integrates gradients from multiple models with
different architectures simultaneously to enhance the transferability of the
optimized string to unseen models. Additionally, we propose a strategy that
emphasizes the initial tokens in the target RAG data, further improving the
attack's generalizability. Evaluations show that MARAGE consistently
outperforms both manual and optimization-based baselines across multiple LLMs
and RAG datasets, while maintaining robust transferability to previously unseen
models. Moreover, we conduct probing tasks to shed light on the reasons why
MARAGE is more effective compared to the baselines and to analyze the impact of
our approach on the model's internal state.


http://arxiv.org/abs/2502.02290
FRAUD-RLA: A new reinforcement learning adversarial attack against credit card fraud detection. (87%)
Daniele Lunghi; Yannick Molinghen; Alkis Simitsis; Tom Lenaerts; Gianluca Bontempi
  Adversarial attacks pose a significant threat to data-driven systems, and
researchers have spent considerable resources studying them. Despite its
economic relevance, this trend largely overlooked the issue of credit card
fraud detection. To address this gap, we propose a new threat model that
demonstrates the limitations of existing attacks and highlights the necessity
to investigate new approaches. We then design a new adversarial attack for
credit card fraud detection, employing reinforcement learning to bypass
classifiers. This attack, called FRAUD-RLA, is designed to maximize the
attacker's reward by optimizing the exploration-exploitation tradeoff and
working with significantly less required knowledge than competitors. Our
experiments, conducted on three different heterogeneous datasets and against
two fraud detection systems, indicate that FRAUD-RLA is effective, even
considering the severe limitations imposed by our threat model.


http://arxiv.org/abs/2502.02537
Uncertainty Quantification for Collaborative Object Detection Under Adversarial Attacks. (83%)
Huiqun Huang; Cong Chen; Jean-Philippe Monteuuis; Jonathan Petit; Fei Miao
  Collaborative Object Detection (COD) and collaborative perception can
integrate data or features from various entities, and improve object detection
accuracy compared with individual perception. However, adversarial attacks pose
a potential threat to the deep learning COD models, and introduce high output
uncertainty. With unknown attack models, it becomes even more challenging to
improve COD resiliency and quantify the output uncertainty for highly dynamic
perception scenes such as autonomous vehicles. In this study, we propose the
Trusted Uncertainty Quantification in Collaborative Perception framework
(TUQCP). TUQCP leverages both adversarial training and uncertainty
quantification techniques to enhance the adversarial robustness of existing COD
models. More specifically, TUQCP first adds perturbations to the shared
information of randomly selected agents during object detection collaboration
by adversarial training. TUQCP then alleviates the impacts of adversarial
attacks by providing output uncertainty estimation through learning-based
module and uncertainty calibration through conformal prediction. Our framework
works for early and intermediate collaboration COD models and single-agent
object detection models. We evaluate TUQCP on V2X-Sim, a comprehensive
collaborative perception dataset for autonomous driving, and demonstrate a
80.41% improvement in object detection accuracy compared to the baselines under
the same adversarial attacks. TUQCP demonstrates the importance of uncertainty
quantification to COD under adversarial attacks.


http://arxiv.org/abs/2502.02260
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate. (81%)
Javier Rando; Jie Zhang; Nicholas Carlini; Florian Tramèr
  In the past decade, considerable research effort has been devoted to securing
machine learning (ML) models that operate in adversarial settings. Yet,
progress has been slow even for simple "toy" problems (e.g., robustness to
small adversarial perturbations) and is often hindered by non-rigorous
evaluations. Today, adversarial ML research has shifted towards studying
larger, general-purpose language models. In this position paper, we argue that
the situation is now even worse: in the era of LLMs, the field of adversarial
ML studies problems that are (1) less clearly defined, (2) harder to solve, and
(3) even more challenging to evaluate. As a result, we caution that yet another
decade of work on adversarial ML may fail to produce meaningful progress.


http://arxiv.org/abs/2502.02844
Wolfpack Adversarial Attack for Robust Multi-Agent Reinforcement Learning. (75%)
Sunwoo Lee; Jaebak Hwang; Yonghyeon Jo; Seungyul Han
  Traditional robust methods in multi-agent reinforcement learning (MARL) often
struggle against coordinated adversarial attacks in cooperative scenarios. To
address this limitation, we propose the Wolfpack Adversarial Attack framework,
inspired by wolf hunting strategies, which targets an initial agent and its
assisting agents to disrupt cooperation. Additionally, we introduce the
Wolfpack-Adversarial Learning for MARL (WALL) framework, which trains robust
MARL policies to defend against the proposed Wolfpack attack by fostering
systemwide collaboration. Experimental results underscore the devastating
impact of the Wolfpack attack and the significant robustness improvements
achieved by WALL. Our code is available at
https://github.com/sunwoolee0504/WALL.


http://arxiv.org/abs/2502.02438
Medical Multimodal Model Stealing Attacks via Adversarial Domain Alignment. (68%)
Yaling Shen; Zhixiong Zhuang; Kun Yuan; Maria-Irina Nicolae; Nassir Navab; Nicolas Padoy; Mario Fritz
  Medical multimodal large language models (MLLMs) are becoming an instrumental
part of healthcare systems, assisting medical personnel with decision making
and results analysis. Models for radiology report generation are able to
interpret medical imagery, thus reducing the workload of radiologists. As
medical data is scarce and protected by privacy regulations, medical MLLMs
represent valuable intellectual property. However, these assets are potentially
vulnerable to model stealing, where attackers aim to replicate their
functionality via black-box access. So far, model stealing for the medical
domain has focused on classification; however, existing attacks are not
effective against MLLMs. In this paper, we introduce Adversarial Domain
Alignment (ADA-STEAL), the first stealing attack against medical MLLMs.
ADA-STEAL relies on natural images, which are public and widely available, as
opposed to their medical counterparts. We show that data augmentation with
adversarial noise is sufficient to overcome the data distribution gap between
natural images and the domain-specific distribution of the victim MLLM.
Experiments on the IU X-RAY and MIMIC-CXR radiology datasets demonstrate that
Adversarial Domain Alignment enables attackers to steal the medical MLLM
without any access to medical data.


http://arxiv.org/abs/2502.02230
An Attack-Driven Incident Response and Defense System (ADIRDS). (4%)
Anthony Cheuk Tung Lai; Siu Ming Yiu; Ping Fan Ke; Alan Ho
  One of the major goals of incident response is to help an organization or a
system owner to quickly identify and halt the attacks to minimize the damages
(and financial loss) to the system being attacked. Typical incident responses
rely very much on the log information captured by the system during the attacks
and if needed, may need to isolate the victim from the network to avoid further
destructive attacks. However, there are real cases that there are insufficient
log records/information for the incident response team to identify the attacks
and their origins while the attacked system cannot be stopped due to service
requirements (zero downtime online systems) such as online gaming sites.
Typical incident response procedures and industrial standards do not provide an
adequate solution to address this scenario. In this paper, being motivated by a
real case, we propose a solution, called "Attack-Driven Incident Response and
Defense System (ADIRDS)" to tackle this problem. ADIRDS is an online monitoring
system to run with the real system. By modeling the real system as a graph,
critical nodes/assets of the system are closely monitored. Instead of relying
on the original logging system, evidence will be collected from the attack
technique perspectives. To migrate the risks, realistic honeypots with very
similar business context as the real system are deployed to trap the attackers.
We successfully apply this system to a real case. Based on our experiments, we
verify that our new approach of designing the realistic honeypots is effective,
38 unique attacker's IP addresses were captured. We also compare the
performance of our realistic honey with both low and high interactive honeypots
proposed in the literature, the results found that our proposed honeypot can
successfully cheat the attackers to attack our honeypot, which verifies that
our honeypot is more effective.


http://arxiv.org/abs/2502.02730
Semantic Entanglement-Based Ransomware Detection via Probabilistic Latent Encryption Mapping. (4%)
Mohammad Eisa; Quentin Yardley; Rafael Witherspoon; Harriet Pendlebury; Clement Rutherford
  Encryption-based attacks have introduced significant challenges for detection
mechanisms that rely on predefined signatures, heuristic indicators, or static
rule-based classifications. Probabilistic Latent Encryption Mapping presents an
alternative detection framework that models ransomware-induced encryption
behaviors through statistical representations of entropy deviations and
probabilistic dependencies in execution traces. Unlike conventional approaches
that depend on explicit bytecode analysis or predefined cryptographic function
call monitoring, probabilistic inference techniques classify encryption
anomalies based on their underlying statistical characteristics, ensuring
greater adaptability to polymorphic attack strategies. Evaluations demonstrate
that entropy-driven classification reduces false positive rates while
maintaining high detection accuracy across diverse ransomware families and
encryption methodologies. Experimental results further highlight the
framework's ability to differentiate between benign encryption workflows and
adversarial cryptographic manipulations, ensuring that classification
performance remains effective across cloud-based and localized execution
environments. Benchmark comparisons illustrate that probabilistic modeling
exhibits advantages over heuristic and machine learning-based detection
approaches, particularly in handling previously unseen encryption techniques
and adversarial obfuscation strategies. Computational efficiency analysis
confirms that detection latency remains within operational feasibility
constraints, reinforcing the viability of probabilistic encryption
classification for real-time security infrastructures. The ability to
systematically infer encryption-induced deviations without requiring static
attack signatures strengthens detection robustness against adversarial evasion
techniques.


http://arxiv.org/abs/2502.02710
Achievable distributional robustness when the robust risk is only partially identified. (1%)
Julia Kostin; Nicola Gnecco; Fanny Yang
  In safety-critical applications, machine learning models should generalize
well under worst-case distribution shifts, that is, have a small robust risk.
Invariance-based algorithms can provably take advantage of structural
assumptions on the shifts when the training distributions are heterogeneous
enough to identify the robust risk. However, in practice, such identifiability
conditions are rarely satisfied -- a scenario so far underexplored in the
theoretical literature. In this paper, we aim to fill the gap and propose to
study the more general setting when the robust risk is only partially
identifiable. In particular, we introduce the worst-case robust risk as a new
measure of robustness that is always well-defined regardless of
identifiability. Its minimum corresponds to an algorithm-independent
(population) minimax quantity that measures the best achievable robustness
under partial identifiability. While these concepts can be defined more
broadly, in this paper we introduce and derive them explicitly for a linear
model for concreteness of the presentation. First, we show that existing
robustness methods are provably suboptimal in the partially identifiable case.
We then evaluate these methods and the minimizer of the (empirical) worst-case
robust risk on real-world gene expression data and find a similar trend: the
test error of existing robustness methods grows increasingly suboptimal as the
fraction of data from unseen environments increases, whereas accounting for
partial identifiability allows for better generalization.


http://arxiv.org/abs/2502.02017
Multi-Domain Graph Foundation Models: Robust Knowledge Transfer via Topology Alignment. (1%)
Shuo Wang; Bokui Wang; Zhixiang Shen; Boyan Deng; Zhao Kang
  Recent advances in CV and NLP have inspired researchers to develop
general-purpose graph foundation models through pre-training across diverse
domains. However, a fundamental challenge arises from the substantial
differences in graph topologies across domains. Additionally, real-world graphs
are often sparse and prone to noisy connections and adversarial attacks. To
address these issues, we propose the Multi-Domain Graph Foundation Model
(MDGFM), a unified framework that aligns and leverages cross-domain topological
information to facilitate robust knowledge transfer. MDGFM bridges different
domains by adaptively balancing features and topology while refining original
graphs to eliminate noise and align topological structures. To further enhance
knowledge transfer, we introduce an efficient prompt-tuning approach. By
aligning topologies, MDGFM not only improves multi-domain pre-training but also
enables robust knowledge transfer to unseen domains. Theoretical analyses
provide guarantees of MDGFM's effectiveness and domain generalization
capabilities. Extensive experiments on both homophilic and heterophilic graph
datasets validate the robustness and efficacy of our method.


http://arxiv.org/abs/2502.01262
FSPGD: Rethinking Black-box Attacks on Semantic Segmentation. (99%)
Eun-Sol Park; MiSo Park; Seung Park; Yong-Goo Shin
  Transferability, the ability of adversarial examples crafted for one model to
deceive other models, is crucial for black-box attacks. Despite advancements in
attack methods for semantic segmentation, transferability remains limited,
reducing their effectiveness in real-world applications. To address this, we
introduce the Feature Similarity Projected Gradient Descent (FSPGD) attack, a
novel black-box approach that enhances both attack performance and
transferability. Unlike conventional segmentation attacks that rely on output
predictions for gradient calculation, FSPGD computes gradients from
intermediate layer features. Specifically, our method introduces a loss
function that targets local information by comparing features between clean
images and adversarial examples, while also disrupting contextual information
by accounting for spatial relationships between objects. Experiments on Pascal
VOC 2012 and Cityscapes datasets demonstrate that FSPGD achieves superior
transferability and attack performance, establishing a new state-of-the-art
benchmark. Code is available at https://github.com/KU-AIVS/FSPGD.


http://arxiv.org/abs/2502.01576
Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models. (95%)
Hashmat Shadab Malik; Fahad Shamshad; Muzammal Naseer; Karthik Nandakumar; Fahad Khan; Salman Khan
  Multi-modal Large Language Models (MLLMs) excel in vision-language tasks but
remain vulnerable to visual adversarial perturbations that can induce
hallucinations, manipulate responses, or bypass safety mechanisms. Existing
methods seek to mitigate these risks by applying constrained adversarial
fine-tuning to CLIP vision encoders on ImageNet-scale data, ensuring their
generalization ability is preserved. However, this limited adversarial training
restricts robustness and broader generalization. In this work, we explore an
alternative approach of leveraging existing vision classification models that
have been adversarially pre-trained on large-scale data. Our analysis reveals
two principal contributions: (1) the extensive scale and diversity of
adversarial pre-training enables these models to demonstrate superior
robustness against diverse adversarial threats, ranging from imperceptible
perturbations to advanced jailbreaking attempts, without requiring additional
adversarial training, and (2) end-to-end MLLM integration with these robust
models facilitates enhanced adaptation of language components to robust visual
features, outperforming existing plug-and-play methodologies on complex
reasoning tasks. Through systematic evaluation across visual
question-answering, image captioning, and jail-break attacks, we demonstrate
that MLLMs trained with these robust models achieve superior adversarial
robustness while maintaining favorable clean performance. Our framework
achieves 2x and 1.5x average robustness gains in captioning and VQA tasks,
respectively, and delivers over 10% improvement against jailbreak attacks. Code
and pretrained models will be available at
https://github.com/HashmatShadab/Robust-LLaVA.


http://arxiv.org/abs/2502.05208
Mitigation of Camouflaged Adversarial Attacks in Autonomous Vehicles--A Case Study Using CARLA Simulator. (92%)
Yago Romano Martinez; Brady Carter; Abhijeet Solanki; Wesam Al Amiri; Syed Rafay Hasan; Terry N. Guo
Autonomous vehicles (AVs) rely heavily on cameras and artificial intelligence (AI) to make safe and accurate driving decisions. However, since AI is the core enabling technology, this raises serious cyber threats that hinder the large-scale adoption of AVs. Therefore, it becomes crucial to analyze the resilience of AV security systems against sophisticated attacks that manipulate camera inputs, deceiving AI models. In this paper, we develop camera-camouflaged adversarial attacks targeting traffic sign recognition (TSR) in AVs. Specifically, if the attack is initiated by modifying the texture of a stop sign to fool the AV's object detection system, thereby affecting the AV actuators. The attack's effectiveness is tested using the CARLA AV simulator and the results show that such an attack can delay the auto-braking response to the stop sign, resulting in potential safety issues. We conduct extensive experiments under various conditions, confirming that our new attack is effective and robust. Additionally, we address the attack by presenting mitigation strategies. The proposed attack and defense methods are applicable to other end-to-end trained autonomous cyber-physical systems.

http://arxiv.org/abs/2502.01272
Boosting Graph Robustness Against Backdoor Attacks: An Over-Similarity Perspective. (78%)
Chang Liu; Hai Huang; Yujie Xing; Xingquan Zuo
  Graph Neural Networks (GNNs) have achieved notable success in tasks such as
social and transportation networks. However, recent studies have highlighted
the vulnerability of GNNs to backdoor attacks, raising significant concerns
about their reliability in real-world applications. Despite initial efforts to
defend against specific graph backdoor attacks, existing defense methods face
two main challenges: either the inability to establish a clear distinction
between triggers and clean nodes, resulting in the removal of many clean nodes,
or the failure to eliminate the impact of triggers, making it challenging to
restore the target nodes to their pre-attack state. Through empirical analysis
of various existing graph backdoor attacks, we observe that the triggers
generated by these methods exhibit over-similarity in both features and
structure. Based on this observation, we propose a novel graph backdoor defense
method SimGuard. We first utilizes a similarity-based metric to detect triggers
and then employs contrastive learning to train a backdoor detector that
generates embeddings capable of separating triggers from clean nodes, thereby
improving detection efficiency. Extensive experiments conducted on real-world
datasets demonstrate that our proposed method effectively defends against
various graph backdoor attacks while preserving performance on clean nodes. The
code will be released upon acceptance.


http://arxiv.org/abs/2502.01386
Topic-FlipRAG: Topic-Orientated Adversarial Opinion Manipulation Attacks to Retrieval-Augmented Generation Models. (67%)
Yuyang Gong; Zhuo Chen; Miaokun Chen; Fengchang Yu; Wei Lu; Xiaofeng Wang; Xiaozhong Liu; Jiawei Liu
  Retrieval-Augmented Generation (RAG) systems based on Large Language Models
(LLMs) have become essential for tasks such as question answering and content
generation. However, their increasing impact on public opinion and information
dissemination has made them a critical focus for security research due to
inherent vulnerabilities. Previous studies have predominantly addressed attacks
targeting factual or single-query manipulations. In this paper, we address a
more practical scenario: topic-oriented adversarial opinion manipulation
attacks on RAG models, where LLMs are required to reason and synthesize
multiple perspectives, rendering them particularly susceptible to systematic
knowledge poisoning. Specifically, we propose Topic-FlipRAG, a two-stage
manipulation attack pipeline that strategically crafts adversarial
perturbations to influence opinions across related queries. This approach
combines traditional adversarial ranking attack techniques and leverages the
extensive internal relevant knowledge and reasoning capabilities of LLMs to
execute semantic-level perturbations. Experiments show that the proposed
attacks effectively shift the opinion of the model's outputs on specific
topics, significantly impacting user information perception. Current mitigation
methods cannot effectively defend against such attacks, highlighting the
necessity for enhanced safeguards for RAG systems, and offering crucial
insights for LLM security research.


http://arxiv.org/abs/2502.01385
Detecting Backdoor Samples in Contrastive Language Image Pretraining. (61%)
Hanxun Huang; Sarah Erfani; Yige Li; Xingjun Ma; James Bailey
  Contrastive language-image pretraining (CLIP) has been found to be vulnerable
to poisoning backdoor attacks where the adversary can achieve an almost perfect
attack success rate on CLIP models by poisoning only 0.01\% of the training
dataset. This raises security concerns on the current practice of pretraining
large-scale models on unscrutinized web data using CLIP. In this work, we
analyze the representations of backdoor-poisoned samples learned by CLIP models
and find that they exhibit unique characteristics in their local subspace,
i.e., their local neighborhoods are far more sparse than that of clean samples.
Based on this finding, we conduct a systematic study on detecting CLIP backdoor
attacks and show that these attacks can be easily and efficiently detected by
traditional density ratio-based local outlier detectors, whereas existing
backdoor sample detection methods fail. Our experiments also reveal that an
unintentional backdoor already exists in the original CC3M dataset and has been
trained into a popular open-source model released by OpenCLIP. Based on our
detector, one can clean up a million-scale web dataset (e.g., CC3M) efficiently
within 15 minutes using 4 Nvidia A100 GPUs. The code is publicly available in
our \href{https://github.com/HanxunH/Detect-CLIP-Backdoor-Samples}{GitHub
repository}.


http://arxiv.org/abs/2502.01936
Query-Based and Unnoticeable Graph Injection Attack from Neighborhood Perspective. (61%)
Chang Liu; Hai Huang; Yujie Xing; Xingquan Zuo
  The robustness of Graph Neural Networks (GNNs) has become an increasingly
important topic due to their expanding range of applications. Various attack
methods have been proposed to explore the vulnerabilities of GNNs, ranging from
Graph Modification Attacks (GMA) to the more practical and flexible Graph
Injection Attacks (GIA). However, existing methods face two key challenges: (i)
their reliance on surrogate models, which often leads to reduced attack
effectiveness due to structural differences and prior biases, and (ii) existing
GIA methods often sacrifice attack success rates in undefended settings to
bypass certain defense models, thereby limiting their overall effectiveness. To
overcome these limitations, we propose QUGIA, a Query-based and Unnoticeable
Graph Injection Attack. QUGIA injects nodes by first selecting edges based on
victim node connections and then generating node features using a Bayesian
framework. This ensures that the injected nodes are similar to the original
graph nodes, implicitly preserving homophily and making the attack more
unnoticeable. Unlike previous methods, QUGIA does not rely on surrogate models,
thereby avoiding performance degradation and achieving better generalization.
Extensive experiments on six real-world datasets with diverse characteristics
demonstrate that QUGIA achieves unnoticeable attacks and outperforms
state-of-the-art attackers. The code will be released upon acceptance.


http://arxiv.org/abs/2502.05211
Decoding FL Defenses: Systemization, Pitfalls, and Remedies. (50%)
Momin Ahmad Khan; Virat Shejwalkar; Yasra Chandio; Amir Houmansadr; Fatima Muhammad Anwar
  While the community has designed various defenses to counter the threat of
poisoning attacks in Federated Learning (FL), there are no guidelines for
evaluating these defenses. These defenses are prone to subtle pitfalls in their
experimental setups that lead to a false sense of security, rendering them
unsuitable for practical deployment. In this paper, we systematically
understand, identify, and provide a better approach to address these
challenges. First, we design a comprehensive systemization of FL defenses along
three dimensions: i) how client updates are processed, ii) what the server
knows, and iii) at what stage the defense is applied. Next, we thoroughly
survey 50 top-tier defense papers and identify the commonly used components in
their evaluation setups. Based on this survey, we uncover six distinct pitfalls
and study their prevalence. For example, we discover that around 30% of these
works solely use the intrinsically robust MNIST dataset, and 40% employ
simplistic attacks, which may inadvertently portray their defense as robust.
Using three representative defenses as case studies, we perform a critical
reevaluation to study the impact of the identified pitfalls and show how they
lead to incorrect conclusions about robustness. We provide actionable
recommendations to help researchers overcome each pitfall.


http://arxiv.org/abs/2502.01896
INTACT: Inducing Noise Tolerance through Adversarial Curriculum Training for LiDAR-based Safety-Critical Perception and Autonomy. (50%)
Nastaran Darabi; Divake Kumar; Sina Tayebati; Amit Ranjan Trivedi
  In this work, we present INTACT, a novel two-phase framework designed to
enhance the robustness of deep neural networks (DNNs) against noisy LiDAR data
in safety-critical perception tasks. INTACT combines meta-learning with
adversarial curriculum training (ACT) to systematically address challenges
posed by data corruption and sparsity in 3D point clouds. The meta-learning
phase equips a teacher network with task-agnostic priors, enabling it to
generate robust saliency maps that identify critical data regions. The ACT
phase leverages these saliency maps to progressively expose a student network
to increasingly complex noise patterns, ensuring targeted perturbation and
improved noise resilience. INTACT's effectiveness is demonstrated through
comprehensive evaluations on object detection, tracking, and classification
benchmarks using diverse datasets, including KITTI, Argoverse, and ModelNet40.
Results indicate that INTACT improves model robustness by up to 20% across all
tasks, outperforming standard adversarial and curriculum training methods. This
framework not only addresses the limitations of conventional training
strategies but also offers a scalable and efficient solution for real-world
deployment in resource-constrained safety-critical systems. INTACT's principled
integration of meta-learning and adversarial training establishes a new
paradigm for noise-tolerant 3D perception in safety-critical applications.
INTACT improved KITTI Multiple Object Tracking Accuracy (MOTA) by 9.6% (64.1%
-> 75.1%) and by 12.4% under Gaussian noise (52.5% -> 73.7%). Similarly, KITTI
mean Average Precision (mAP) rose from 59.8% to 69.8% (50% point drop) and
49.3% to 70.9% (Gaussian noise), highlighting the framework's ability to
enhance deep learning model resilience in safety-critical object tracking
scenarios.


http://arxiv.org/abs/2502.01152
Gradient Norm-based Fine-Tuning for Backdoor Defense in Automatic Speech Recognition. (50%)
Nanjun Zhou; Weilin Lin; Li Liu
  Backdoor attacks have posed a significant threat to the security of deep
neural networks (DNNs). Despite considerable strides in developing defenses
against backdoor attacks in the visual domain, the specialized defenses for the
audio domain remain empty. Furthermore, the defenses adapted from the visual to
audio domain demonstrate limited effectiveness. To fill this gap, we propose
Gradient Norm-based FineTuning (GN-FT), a novel defense strategy against the
attacks in the audio domain, based on the observation from the corresponding
backdoored models. Specifically, we first empirically find that the backdoored
neurons exhibit greater gradient values compared to other neurons, while clean
neurons stay the lowest. On this basis, we fine-tune the backdoored model by
incorporating the gradient norm regularization, aiming to weaken and reduce the
backdoored neurons. We further approximate the loss computation for lower
implementation costs. Extensive experiments on two speech recognition datasets
across five models demonstrate the superior performance of our proposed method.
To the best of our knowledge, this work is the first specialized and effective
defense against backdoor attacks in the audio domain.


http://arxiv.org/abs/2502.01486
Quantum Quandaries: Unraveling Encoding Vulnerabilities in Quantum Neural Networks. (10%)
Suryansh Upadhyay; Swaroop Ghosh
  Quantum computing (QC) has the potential to revolutionize fields like machine
learning, security, and healthcare. Quantum machine learning (QML) has emerged
as a promising area, enhancing learning algorithms using quantum computers.
However, QML models are lucrative targets due to their high training costs and
extensive training times. The scarcity of quantum resources and long wait times
further exacerbate the challenge. Additionally, QML providers may rely on third
party quantum clouds for hosting models, exposing them and their training data
to potential threats. As QML as a Service (QMLaaS) becomes more prevalent,
reliance on third party quantum clouds poses a significant security risk. This
work demonstrates that adversaries in quantum cloud environments can exploit
white box access to QML models to infer the users encoding scheme by analyzing
circuit transpilation artifacts. The extracted data can be reused for training
clone models or sold for profit. We validate the proposed attack through
simulations, achieving high accuracy in distinguishing between encoding
schemes. We report that 95% of the time, the encoding can be predicted
correctly. To mitigate this threat, we propose a transient obfuscation layer
that masks encoding fingerprints using randomized rotations and entanglement,
reducing adversarial detection to near random chance 42% , with a depth
overhead of 8.5% for a 5 layer QNN design.


http://arxiv.org/abs/2502.01349
Bias Beware: The Impact of Cognitive Biases on LLM-Driven Product Recommendations. (9%)
Giorgos Filandrianos; Angeliki Dimitriou; Maria Lymperaiou; Konstantinos Thomas; Giorgos Stamou
  The advent of Large Language Models (LLMs) has revolutionized product
recommendation systems, yet their susceptibility to adversarial manipulation
poses critical challenges, particularly in real-world commercial applications.
Our approach is the first one to tap into human psychological principles,
seamlessly modifying product descriptions, making these adversarial
manipulations hard to detect. In this work, we investigate cognitive biases as
black-box adversarial strategies, drawing parallels between their effects on
LLMs and human purchasing behavior. Through extensive experiments on LLMs of
varying scales, we reveal significant vulnerabilities in their use as
recommenders, providing critical insights into safeguarding these systems.


http://arxiv.org/abs/2502.05209
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities. (9%)
Zora Che; Stephen Casper; Robert Kirk; Anirudh Satheesh; Stewart Slocum; Lev E McKinney; Rohit Gandikota; Aidan Ewart; Domenic Rosati; Zichu Wu; Zikui Cai; Bilal Chughtai; Yarin Gal; Furong Huang; Dylan Hadfield-Menell
  Evaluations of large language model (LLM) risks and capabilities are
increasingly being incorporated into AI risk management and governance
frameworks. Currently, most risk evaluations are conducted by designing inputs
that elicit harmful behaviors from the system. However, this approach suffers
from two limitations. First, input-output evaluations cannot evaluate realistic
risks from open-weight models. Second, the behaviors identified during any
particular input-output evaluation can only lower-bound the model's
worst-possible-case input-output behavior. As a complementary method for
eliciting harmful behaviors, we propose evaluating LLMs with model tampering
attacks which allow for modifications to latent activations or weights. We pit
state-of-the-art techniques for removing harmful LLM capabilities against a
suite of 5 input-space and 6 model tampering attacks. In addition to
benchmarking these methods against each other, we show that (1) model
resilience to capability elicitation attacks lies on a low-dimensional
robustness subspace; (2) the attack success rate of model tampering attacks can
empirically predict and offer conservative estimates for the success of
held-out input-space attacks; and (3) state-of-the-art unlearning methods can
easily be undone within 16 steps of fine-tuning. Together these results
highlight the difficulty of suppressing harmful LLM capabilities and show that
model tampering attacks enable substantially more rigorous evaluations than
input-space attacks alone.


http://arxiv.org/abs/2502.01609
Breaking Focus: Contextual Distraction Curse in Large Language Models. (2%)
Yue Huang; Yanbo Wang; Zixiang Xu; Chujie Gao; Siyuan Wu; Jiayi Ye; Xiuying Chen; Pin-Yu Chen; Xiangliang Zhang
  Recent advances in Large Language Models (LLMs) have revolutionized
generative systems, achieving excellent performance across diverse domains.
Although these models perform well in controlled environments, their real-world
applications frequently encounter inputs containing both essential and
irrelevant details. Our investigation has revealed a critical vulnerability in
LLMs, which we term Contextual Distraction Vulnerability (CDV). This phenomenon
arises when models fail to maintain consistent performance on questions
modified with semantically coherent but irrelevant context. To systematically
investigate this vulnerability, we propose an efficient tree-based search
methodology to automatically generate CDV examples. Our approach successfully
generates CDV examples across four datasets, causing an average performance
degradation of approximately 45% in state-of-the-art LLMs. To address this
critical issue, we explore various mitigation strategies and find that
post-targeted training approaches can effectively enhance model robustness
against contextual distractions. Our findings highlight the fundamental nature
of CDV as an ability-level challenge rather than a knowledge-level issue since
models demonstrate the necessary knowledge by answering correctly in the
absence of distractions. This calls the community's attention to address CDV
during model development to ensure reliability. The code is available at
https://github.com/wyf23187/LLM_CDV.


http://arxiv.org/abs/2502.01154
Jailbreaking with Universal Multi-Prompts. (1%)
Yu-Ling Hsu; Hsuan Su; Shang-Tse Chen
  Large language models (LLMs) have seen rapid development in recent years,
revolutionizing various applications and significantly enhancing convenience
and productivity. However, alongside their impressive capabilities, ethical
concerns and new types of attacks, such as jailbreaking, have emerged. While
most prompting techniques focus on optimizing adversarial inputs for individual
cases, resulting in higher computational costs when dealing with large
datasets. Less research has addressed the more general setting of training a
universal attacker that can transfer to unseen tasks. In this paper, we
introduce JUMP, a prompt-based method designed to jailbreak LLMs using
universal multi-prompts. We also adapt our approach for defense, which we term
DUMP. Experimental results demonstrate that our method for optimizing universal
multi-prompts outperforms existing techniques.


http://arxiv.org/abs/2502.00765
AGNNCert: Defending Graph Neural Networks against Arbitrary Perturbations with Deterministic Certification. (98%)
Jiate Li; Binghui Wang
  Graph neural networks (GNNs) achieve the state-of-the-art on graph-relevant
tasks such as node and graph classification. However, recent works show GNNs
are vulnerable to adversarial perturbations include the perturbation on edges,
nodes, and node features, the three components forming a graph. Empirical
defenses against such attacks are soon broken by adaptive ones. While certified
defenses offer robustness guarantees, they face several limitations: 1) almost
all restrict the adversary's capability to only one type of perturbation, which
is impractical; 2) all are designed for a particular GNN task, which limits
their applicability; and 3) the robustness guarantees of all methods except one
are not 100% accurate.
  We address all these limitations by developing AGNNCert, the first certified
defense for GNNs against arbitrary (edge, node, and node feature) perturbations
with deterministic robustness guarantees, and applicable to the two most common
node and graph classification tasks. AGNNCert also encompass existing certified
defenses as special cases. Extensive evaluations on multiple benchmark
node/graph classification datasets and two real-world graph datasets, and
multiple GNNs validate the effectiveness of AGNNCert to provably defend against
arbitrary perturbations. AGNNCert also shows its superiority over the
state-of-the-art certified defenses against the individual edge perturbation
and node perturbation.


http://arxiv.org/abs/2502.05206
Safety at Scale: A Comprehensive Survey of Large Model Safety. (98%)
Xingjun Ma; Yifeng Gao; Yixu Wang; Ruofan Wang; Xin Wang; Ye Sun; Yifan Ding; Hengyuan Xu; Yunhao Chen; Yunhan Zhao; Hanxun Huang; Yige Li; Jiaming Zhang; Xiang Zheng; Yang Bai; Zuxuan Wu; Xipeng Qiu; Jingfeng Zhang; Yiming Li; Jun Sun; Cong Wang; Jindong Gu; Baoyuan Wu; Siheng Chen; Tianwei Zhang; Yang Liu; Mingming Gong; Tongliang Liu; Shirui Pan; Cihang Xie; Tianyu Pang; Yinpeng Dong; Ruoxi Jia; Yang Zhang; Shiqing Ma; Xiangyu Zhang; Neil Gong; Chaowei Xiao; Sarah Erfani; Bo Li; Masashi Sugiyama; Dacheng Tao; James Bailey; Yu-Gang Jiang
  The rapid advancement of large models, driven by their exceptional abilities
in learning and generalization through large-scale pre-training, has reshaped
the landscape of Artificial Intelligence (AI). These models are now
foundational to a wide range of applications, including conversational AI,
recommendation systems, autonomous driving, content generation, medical
diagnostics, and scientific discovery. However, their widespread deployment
also exposes them to significant safety risks, raising concerns about
robustness, reliability, and ethical implications. This survey provides a
systematic review of current safety research on large models, covering Vision
Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language
Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models
(DMs), and large-model-based Agents. Our contributions are summarized as
follows: (1) We present a comprehensive taxonomy of safety threats to these
models, including adversarial attacks, data poisoning, backdoor attacks,
jailbreak and prompt injection attacks, energy-latency attacks, data and model
extraction attacks, and emerging agent-specific threats. (2) We review defense
strategies proposed for each type of attacks if available and summarize the
commonly used datasets and benchmarks for safety research. (3) Building on
this, we identify and discuss the open challenges in large model safety,
emphasizing the need for comprehensive safety evaluations, scalable and
effective defense mechanisms, and sustainable data practices. More importantly,
we highlight the necessity of collective efforts from the research community
and international collaboration. Our work can serve as a useful reference for
researchers and practitioners, fostering the ongoing development of
comprehensive defense systems and platforms to safeguard AI models.


http://arxiv.org/abs/2502.01027
Adversarial Robustness in Two-Stage Learning-to-Defer: Algorithms and Guarantees. (93%)
Yannis Montreuil; Axel Carlier; Lai Xing Ng; Wei Tsang Ooi
  Two-stage Learning-to-Defer (L2D) enables optimal task delegation by
assigning each input to either a fixed main model or one of several offline
experts, supporting reliable decision-making in complex, multi-agent
environments. However, existing L2D frameworks assume clean inputs and are
vulnerable to adversarial perturbations that can manipulate query
allocation--causing costly misrouting or expert overload. We present the first
comprehensive study of adversarial robustness in two-stage L2D systems. We
introduce two novel attack strategie--untargeted and targeted--which
respectively disrupt optimal allocations or force queries to specific agents.
To defend against such threats, we propose SARD, a convex learning algorithm
built on a family of surrogate losses that are provably Bayes-consistent and
$(\mathcal{R}, \mathcal{G})$-consistent. These guarantees hold across
classification, regression, and multi-task settings. Empirical results
demonstrate that SARD significantly improves robustness under adversarial
attacks while maintaining strong clean performance, marking a critical step
toward secure and trustworthy L2D deployment.


http://arxiv.org/abs/2502.00735
From Compliance to Exploitation: Jailbreak Prompt Attacks on Multimodal LLMs. (86%)
Chun Wai Chiu; Linghan Huang; Bo Li; Huaming Chen
  Large Language Models (LLMs) have seen widespread applications across various
domains due to their growing ability to process diverse types of input data,
including text, audio, image and video. While LLMs have demonstrated
outstanding performance in understanding and generating contexts for different
scenarios, they are vulnerable to prompt-based attacks, which are mostly via
text input. In this paper, we introduce the first voice-based jailbreak attack
against multimodal LLMs, termed as Flanking Attack, which can process different
types of input simultaneously towards the multimodal LLMs. Our work is
motivated by recent advancements in monolingual voice-driven large language
models, which have introduced new attack surfaces beyond traditional text-based
vulnerabilities for LLMs. To investigate these risks, we examine the frontier
multimodal LLMs, which can be accessed via different types of inputs such as
audio input, focusing on how adversarial prompts can bypass its defense
mechanisms. We propose a novel strategy, in which the disallowed prompt is
flanked by benign, narrative-driven prompts. It is integrated in the Flanking
Attack which attempts to humanizes the interaction context and execute the
attack through a fictional setting. To better evaluate the attack performance,
we present a semi-automated self-assessment framework for policy violation
detection. We demonstrate that Flank Attack is capable of manipulating
state-of-the-art LLMs into generating misaligned and forbidden outputs, which
achieves an average attack success rate ranging from 0.67 to 0.93 across seven
forbidden scenarios. These findings highlight both the potency of prompt-based
obfuscation in voice-enabled contexts and the limitations of current LLMs'
moderation safeguards and the urgent need for advanced defense strategies to
address the challenges posed by evolving, context-rich attacks.


http://arxiv.org/abs/2502.01014
Refining Adaptive Zeroth-Order Optimization at Ease. (73%)
Yao Shu; Qixin Zhang; Kun He; Zhongxiang Dai
  Recently, zeroth-order (ZO) optimization plays an essential role in scenarios
where gradient information is inaccessible or unaffordable, such as black-box
systems and resource-constrained environments. While existing adaptive methods
such as ZO-AdaMM have shown promise, they are fundamentally limited by their
underutilization of moment information during optimization, usually resulting
in underperforming convergence. To overcome these limitations, this paper
introduces Refined Adaptive Zeroth-Order Optimization (R-AdaZO). Specifically,
we first show the untapped variance reduction effect of first moment estimate
on ZO gradient estimation, which improves the accuracy and stability of ZO
updates. We then refine the second moment estimate based on these
variance-reduced gradient estimates to better capture the geometry of the
optimization landscape, enabling a more effective scaling of ZO updates. We
present rigorous theoretical analysis to show (a) the first analysis to the
variance reduction of first moment estimate in ZO optimization, (b) the
improved second moment estimates with a more accurate approximation of its
variance-free ideal, (c) the first variance-aware convergence framework for
adaptive ZO methods, which may be of independent interest, and (d) the faster
convergence of R-AdaZO than existing baselines like ZO-AdaMM. Our extensive
experiments, including synthetic problems, black-box adversarial attack, and
memory-efficient fine-tuning of large language models (LLMs), further verify
the superior convergence of R-AdaZO, indicating that R-AdaZO offers an improved
solution for real-world ZO optimization challenges.


http://arxiv.org/abs/2502.01032
Converting MLPs into Polynomials in Closed Form. (45%)
Nora Belrose; Alice Rigg
  Recent work has shown that purely quadratic functions can replace MLPs in
transformers with no significant loss in performance, while enabling new
methods of interpretability based on linear algebra. In this work, we
theoretically derive closed-form least-squares optimal approximations of
feedforward networks (multilayer perceptrons and gated linear units) using
polynomial functions of arbitrary degree. When the $R^2$ is high, this allows
us to interpret MLPs and GLUs by visualizing the eigendecomposition of the
coefficients of their linear and quadratic approximants. We also show that
these approximants can be used to create SVD-based adversarial examples. By
tracing the $R^2$ of linear and quadratic approximants across training time, we
find new evidence that networks start out simple, and get progressively more
complex. Even at the end of training, however, our quadratic approximants
explain over 95% of the variance in network outputs.


http://arxiv.org/abs/2502.00834
Boosting Adversarial Robustness and Generalization with Structural Prior. (31%)
Zhichao Hou; Weizhi Gao; Hamid Krim; Xiaorui Liu
  This work investigates a novel approach to boost adversarial robustness and
generalization by incorporating structural prior into the design of deep
learning models. Specifically, our study surprisingly reveals that existing
dictionary learning-inspired convolutional neural networks (CNNs) provide a
false sense of security against adversarial attacks. To address this, we
propose Elastic Dictionary Learning Networks (EDLNets), a novel ResNet
architecture that significantly enhances adversarial robustness and
generalization. This novel and effective approach is supported by a theoretical
robustness analysis using influence functions. Moreover, extensive and reliable
experiments demonstrate consistent and significant performance improvement on
open robustness leaderboards such as RobustBench, surpassing state-of-the-art
baselines. To the best of our knowledge, this is the first work to discover and
validate that structural prior can reliably enhance deep learning robustness
under strong adaptive attacks, unveiling a promising direction for future
research.


http://arxiv.org/abs/2502.00760
Privacy Preserving Properties of Vision Classifiers. (1%)
Pirzada Suhail; Amit Sethi
  Vision classifiers are often trained on proprietary datasets containing
sensitive information, yet the models themselves are frequently shared openly
under the privacy-preserving assumption. Although these models are assumed to
protect sensitive information in their training data, the extent to which this
assumption holds for different architectures remains unexplored. This
assumption is challenged by inversion attacks which attempt to reconstruct
training data from model weights, exposing significant privacy vulnerabilities.
In this study, we systematically evaluate the privacy-preserving properties of
vision classifiers across diverse architectures, including Multi-Layer
Perceptrons (MLPs), Convolutional Neural Networks (CNNs), and Vision
Transformers (ViTs). Using network inversion-based reconstruction techniques,
we assess the extent to which these architectures memorize and reveal training
data, quantifying the relative ease of reconstruction across models. Our
analysis highlights how architectural differences, such as input
representation, feature extraction mechanisms, and weight structures, influence
privacy risks. By comparing these architectures, we identify which are more
resilient to inversion attacks and examine the trade-offs between model
performance and privacy preservation, contributing to the development of secure
and privacy-respecting machine learning models for sensitive applications. Our
findings provide actionable insights into the design of secure and
privacy-aware machine learning systems, emphasizing the importance of
evaluating architectural decisions in sensitive applications involving
proprietary or personal data.


http://arxiv.org/abs/2502.00653
Towards Robust Multimodal Large Language Models Against Jailbreak Attacks. (98%)
Ziyi Yin; Yuanpu Cao; Han Liu; Ting Wang; Jinghui Chen; Fenhlong Ma
  While multimodal large language models (MLLMs) have achieved remarkable
success in recent advancements, their susceptibility to jailbreak attacks has
come to light. In such attacks, adversaries exploit carefully crafted prompts
to coerce models into generating harmful or undesirable content. Existing
defense mechanisms often rely on external inference steps or safety alignment
training, both of which are less effective and impractical when facing
sophisticated adversarial perturbations in white-box scenarios. To address
these challenges and bolster MLLM robustness, we introduce SafeMLLM by adopting
an adversarial training framework that alternates between an attack step for
generating adversarial noise and a model updating step. At the attack step,
SafeMLLM generates adversarial perturbations through a newly proposed
contrastive embedding attack (CoE-Attack), which optimizes token embeddings
under a contrastive objective. SafeMLLM then updates model parameters to
neutralize the perturbation effects while preserving model utility on benign
inputs. We evaluate SafeMLLM across six MLLMs and six jailbreak methods
spanning multiple modalities. Experimental results show that SafeMLLM
effectively defends against diverse attacks, maintaining robust performance and
utilities.


http://arxiv.org/abs/2502.00652
Reformulation is All You Need: Addressing Malicious Text Features in DNNs. (70%)
Yi Jiang; Oubo Ma; Yong Yang; Tong Zhang; Shouling Ji
  Human language encompasses a wide range of intricate and diverse implicit
features, which attackers can exploit to launch adversarial or backdoor
attacks, compromising DNN models for NLP tasks. Existing model-oriented
defenses often require substantial computational resources as model size
increases, whereas sample-oriented defenses typically focus on specific attack
vectors or schemes, rendering them vulnerable to adaptive attacks. We observe
that the root cause of both adversarial and backdoor attacks lies in the
encoding process of DNN models, where subtle textual features, negligible for
human comprehension, are erroneously assigned significant weight by less robust
or trojaned models. Based on it we propose a unified and adaptive defense
framework that is effective against both adversarial and backdoor attacks. Our
approach leverages reformulation modules to address potential malicious
features in textual inputs while preserving the original semantic integrity.
Extensive experiments demonstrate that our framework outperforms existing
sample-oriented defense baselines across a diverse range of malicious textual
features.


http://arxiv.org/abs/2502.00346
Actor Critic with Experience Replay-based automatic treatment planning for prostate cancer intensity modulated radiotherapy. (69%)
Md Mainul Abrar; Parvat Sapkota; Damon Sprouts; Xun Jia; Yujie Chi
  Background: Real-time treatment planning in IMRT is challenging due to
complex beam interactions. AI has improved automation, but existing models
require large, high-quality datasets and lack universal applicability. Deep
reinforcement learning (DRL) offers a promising alternative by mimicking human
trial-and-error planning.
  Purpose: Develop a stochastic policy-based DRL agent for automatic treatment
planning with efficient training, broad applicability, and robustness against
adversarial attacks using Fast Gradient Sign Method (FGSM).
  Methods: Using the Actor-Critic with Experience Replay (ACER) architecture,
the agent tunes treatment planning parameters (TPPs) in inverse planning.
Training is based on prostate cancer IMRT cases, using dose-volume histograms
(DVHs) as input. The model is trained on a single patient case, validated on
two independent cases, and tested on 300+ plans across three datasets. Plan
quality is assessed using ProKnow scores, and robustness is tested against
adversarial attacks.
  Results: Despite training on a single case, the model generalizes well.
Before ACER-based planning, the mean plan score was 6.20$\pm$1.84; after,
93.09% of cases achieved a perfect score of 9, with a mean of 8.93$\pm$0.27.
The agent effectively prioritizes optimal TPP tuning and remains robust against
adversarial attacks.
  Conclusions: The ACER-based DRL agent enables efficient, high-quality
treatment planning in prostate cancer IMRT, demonstrating strong
generalizability and robustness.


http://arxiv.org/abs/2502.00646
TrojanTime: Backdoor Attacks on Time Series Classification. (69%)
Chang Dong; Zechao Sun; Guangdong Bai; Shuying Piao; Weitong Chen; Wei Emma Zhang
  Time Series Classification (TSC) is highly vulnerable to backdoor attacks,
posing significant security threats. Existing methods primarily focus on data
poisoning during the training phase, designing sophisticated triggers to
improve stealthiness and attack success rate (ASR). However, in practical
scenarios, attackers often face restrictions in accessing training data.
Moreover, it is a challenge for the model to maintain generalization ability on
clean test data while remaining vulnerable to poisoned inputs when data is
inaccessible. To address these challenges, we propose TrojanTime, a novel
two-step training algorithm. In the first stage, we generate a pseudo-dataset
using an external arbitrary dataset through target adversarial attacks. The
clean model is then continually trained on this pseudo-dataset and its poisoned
version. To ensure generalization ability, the second stage employs a carefully
designed training strategy, combining logits alignment and batch norm freezing.
We evaluate TrojanTime using five types of triggers across four TSC
architectures in UCR benchmark datasets from diverse domains. The results
demonstrate the effectiveness of TrojanTime in executing backdoor attacks while
maintaining clean accuracy. Finally, to mitigate this threat, we propose a
defensive unlearning strategy that effectively reduces the ASR while preserving
clean accuracy.


http://arxiv.org/abs/2502.00456
Explorations of the Softmax Space: Knowing When the Neural Network Doesn't Know. (1%)
Daniel Sikar; Artur d'Avila Garcez; Tillman Weyde
  Ensuring the reliability of automated decision-making based on neural
networks will be crucial as Artificial Intelligence systems are deployed more
widely in critical situations. This paper proposes a new approach for measuring
confidence in the predictions of any neural network that relies on the
predictions of a softmax layer. We identify that a high-accuracy trained
network may have certain outputs for which there should be low confidence. In
such cases, decisions should be deferred and it is more appropriate for the
network to provide a \textit{not known} answer to a corresponding
classification task. Our approach clusters the vectors in the softmax layer to
measure distances between cluster centroids and network outputs. We show that a
cluster with centroid calculated simply as the mean softmax output for all
correct predictions can serve as a suitable proxy in the evaluation of
confidence. Defining a distance threshold for a class as the smallest distance
from an incorrect prediction to the given class centroid offers a simple
approach to adding \textit{not known} answers to any network classification
falling outside of the threshold. We evaluate the approach on the MNIST and
CIFAR-10 datasets using a Convolutional Neural Network and a Vision
Transformer, respectively. The results show that our approach is consistent
across datasets and network models, and indicate that the proposed distance
metric can offer an efficient way of determining when automated predictions are
acceptable and when they should be deferred to human operators.


http://arxiv.org/abs/2502.00384
It's Not Just a Phase: On Investigating Phase Transitions in Deep Learning-based Side-channel Analysis. (1%)
Sengim Karayalçin; Marina Krček; Stjepan Picek
  Side-channel analysis (SCA) represents a realistic threat where the attacker
can observe unintentional information to obtain secret data. Evaluation labs
also use the same SCA techniques in the security certification process. The
results in the last decade have shown that machine learning, especially deep
learning, is an extremely powerful SCA approach, allowing the breaking of
protected devices while achieving optimal attack performance. Unfortunately,
deep learning operates as a black-box, making it less useful for security
evaluators who must understand how attacks work to prevent them in the future.
This work demonstrates that mechanistic interpretability can effectively scale
to realistic scenarios where relevant information is sparse and well-defined
interchange interventions to the input are impossible due to side-channel
protections. Concretely, we reverse engineer the features the network learns
during phase transitions, eventually retrieving secret masks, allowing us to
move from black-box to white-box evaluation.


http://arxiv.org/abs/2502.05203
Adversarial Machine Learning: Attacking and Safeguarding Image Datasets. (99%)
Koushik Chowdhury
  This paper examines the vulnerabilities of convolutional neural networks
(CNNs) to adversarial attacks and explores a method for their safeguarding. In
this study, CNNs were implemented on four of the most common image datasets,
namely CIFAR-10, ImageNet, MNIST, and Fashion-MNIST, and achieved high baseline
accuracy. To assess the strength of these models, the Fast Gradient Sign Method
was used, which is a type of exploit on the model that is used to bring down
the models accuracies by adding a very minimal perturbation to the input image.
To counter the FGSM attack, a safeguarding approach went through, which
includes retraining the models on clear and pollutant or adversarial images to
increase their resistance ability. The next step involves applying FGSM again,
but this time to the adversarially trained models, to see how much the accuracy
of the models has gone down and evaluate the effectiveness of the defense. It
appears that while most level of robustness is achieved against the models
after adversarial training, there are still a few losses in the performance of
these models against adversarial perturbations. This work emphasizes the need
to create better defenses for models deployed in real-world scenarios against
adversaries.


http://arxiv.org/abs/2501.19040
Towards the Worst-case Robustness of Large Language Models. (99%)
Huanran Chen; Yinpeng Dong; Zeming Wei; Hang Su; Jun Zhu
  Recent studies have revealed the vulnerability of Large Language Models
(LLMs) to adversarial attacks, where the adversary crafts specific input
sequences to induce harmful, violent, private, or incorrect outputs. Although
various defenses have been proposed, they have not been evaluated by strong
adaptive attacks, leaving the worst-case robustness of LLMs still intractable.
By developing a stronger white-box attack, our evaluation results indicate that
most typical defenses achieve nearly 0\% robustness.To solve this, we propose
\textit{DiffTextPure}, a general defense that diffuses the (adversarial) input
prompt using any pre-defined smoothing distribution, and purifies the diffused
input using a pre-trained language model. Theoretically, we derive tight
robustness lower bounds for all smoothing distributions using Fractal Knapsack
or 0-1 Knapsack solvers. Under this framework, we certify the robustness of a
specific case -- smoothing LLMs using a uniform kernel -- against \textit{any
possible attack} with an average $\ell_0$ perturbation of 2.02 or an average
suffix length of 6.41.


http://arxiv.org/abs/2501.19202
Improving LLM Unlearning Robustness via Random Perturbations. (83%)
Dang Huu-Tien; Hoang Thanh-Tung; Anh Bui; Le-Minh Nguyen; Naoya Inoue
  In this paper, we show that current state-of-the-art LLM unlearning methods
inherently reduce models' robustness, causing them to misbehave even when a
single non-adversarial forget-token is in the retain-query. Toward
understanding underlying causes, we reframe the unlearning process as backdoor
attacks and defenses: forget-tokens act as backdoor triggers that, when
activated in retain-queries, cause disruptions in unlearned models' behaviors,
similar to successful backdoor attacks. To mitigate this vulnerability, we
propose Random Noise Augmentation (RNA) -- a plug-and-play, model and method
agnostic approach with theoretical guarantees for improving the robustness of
unlearned models. Extensive experiments demonstrate that RNA significantly
improves the robustness of unlearned models, maintains unlearning performances
while introducing no additional computational overhead.


http://arxiv.org/abs/2501.19403
Redefining Machine Unlearning: A Conformal Prediction-Motivated Approach. (83%)
Yingdan Shi; Ren Wang
  Machine unlearning seeks to systematically remove specified data from a
trained model, effectively achieving a state as though the data had never been
encountered during training. While metrics such as Unlearning Accuracy (UA) and
Membership Inference Attack (MIA) provide a baseline for assessing unlearning
performance, they fall short of evaluating the completeness and reliability of
forgetting. This is because the ground truth labels remain potential candidates
within the scope of uncertainty quantification, leaving gaps in the evaluation
of true forgetting. In this paper, we identify critical limitations in existing
unlearning metrics and propose enhanced evaluation metrics inspired by
conformal prediction. Our metrics can effectively capture the extent to which
ground truth labels are excluded from the prediction set. Furthermore, we
observe that many existing machine unlearning methods do not achieve
satisfactory forgetting performance when evaluated with our new metrics. To
address this, we propose an unlearning framework that integrates conformal
prediction insights into Carlini & Wagner adversarial attack loss. Extensive
experiments on the image classification task demonstrate that our enhanced
metrics offer deeper insights into unlearning effectiveness, and that our
unlearning framework significantly improves the forgetting quality of
unlearning methods.


http://arxiv.org/abs/2501.18998
Adversarial Attacks on AI-Generated Text Detection Models: A Token Probability-Based Approach Using Embeddings. (81%)
Ahmed K. Kadhim; Lei Jiao; Rishad Shafik; Ole-Christoffer Granmo
  In recent years, text generation tools utilizing Artificial Intelligence (AI)
have occasionally been misused across various domains, such as generating
student reports or creative writings. This issue prompts plagiarism detection
services to enhance their capabilities in identifying AI-generated content.
Adversarial attacks are often used to test the robustness of AI-text generated
detectors. This work proposes a novel textual adversarial attack on the
detection models such as Fast-DetectGPT. The method employs embedding models
for data perturbation, aiming at reconstructing the AI generated texts to
reduce the likelihood of detection of the true origin of the texts.
Specifically, we employ different embedding techniques, including the Tsetlin
Machine (TM), an interpretable approach in machine learning for this purpose.
By combining synonyms and embedding similarity vectors, we demonstrates the
state-of-the-art reduction in detection scores against Fast-DetectGPT.
Particularly, in the XSum dataset, the detection score decreased from 0.4431 to
0.2744 AUROC, and in the SQuAD dataset, it dropped from 0.5068 to 0.3532 AUROC.


http://arxiv.org/abs/2501.19180
Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning. (62%)
Xianglin Yang; Gelei Deng; Jieming Shi; Tianwei Zhang; Jin Song Dong
  Large language models (LLMs) are vital for a wide range of applications yet
remain susceptible to jailbreak threats, which could lead to the generation of
inappropriate responses. Conventional defenses, such as refusal and adversarial
training, often fail to cover corner cases or rare domains, leaving LLMs still
vulnerable to more sophisticated attacks. We propose a novel defense strategy,
Safety Chain-of-Thought (SCoT), which harnesses the enhanced \textit{reasoning
capabilities} of LLMs for proactive assessment of harmful inputs, rather than
simply blocking them. SCoT augments any refusal training datasets to critically
analyze the intent behind each request before generating answers. By employing
proactive reasoning, SCoT enhances the generalization of LLMs across varied
harmful queries and scenarios not covered in the safety alignment corpus.
Additionally, it generates detailed refusals specifying the rules violated.
Comparative evaluations show that SCoT significantly surpasses existing
defenses, reducing vulnerability to out-of-distribution issues and adversarial
manipulations while maintaining strong general capabilities.


http://arxiv.org/abs/2501.18934
Deep Learning Model Inversion Attacks and Defenses: A Comprehensive Survey. (54%)
Wencheng Yang; Song Wang; Di Wu; Taotao Cai; Yanming Zhu; Shicheng Wei; Yiying Zhang; Xu Yang; Yan Li
  The rapid adoption of deep learning in sensitive domains has brought
tremendous benefits. However, this widespread adoption has also given rise to
serious vulnerabilities, particularly model inversion (MI) attacks, posing a
significant threat to the privacy and integrity of personal data. The
increasing prevalence of these attacks in applications such as biometrics,
healthcare, and finance has created an urgent need to understand their
mechanisms, impacts, and defense methods. This survey aims to fill the gap in
the literature by providing a structured and in-depth review of MI attacks and
defense strategies. Our contributions include a systematic taxonomy of MI
attacks, extensive research on attack techniques and defense mechanisms, and a
discussion about the challenges and future research directions in this evolving
field. By exploring the technical and ethical implications of MI attacks, this
survey aims to offer insights into the impact of AI-powered systems on privacy,
security, and trust. In conjunction with this survey, we have developed a
comprehensive repository to support research on MI attacks and defenses. The
repository includes state-of-the-art research papers, datasets, evaluation
metrics, and other resources to meet the needs of both novice and experienced
researchers interested in MI attacks and defenses, as well as the broader field
of AI security and privacy. The repository will be continuously maintained to
ensure its relevance and utility. It is accessible at
https://github.com/overgter/Deep-Learning-Model-Inversion-Attacks-and-Defenses.


http://arxiv.org/abs/2501.19143
Imitation Game for Adversarial Disillusion with Multimodal Generative Chain-of-Thought Role-Play. (45%)
Ching-Chun Chang; Fan-Yun Chen; Shih-Hong Gu; Kai Gao; Hanrui Wang; Isao Echizen
  As the cornerstone of artificial intelligence, machine perception confronts a
fundamental threat posed by adversarial illusions. These adversarial attacks
manifest in two primary forms: deductive illusion, where specific stimuli are
crafted based on the victim model's general decision logic, and inductive
illusion, where the victim model's general decision logic is shaped by specific
stimuli. The former exploits the model's decision boundaries to create a
stimulus that, when applied, interferes with its decision-making process. The
latter reinforces a conditioned reflex in the model, embedding a backdoor
during its learning phase that, when triggered by a stimulus, causes aberrant
behaviours. The multifaceted nature of adversarial illusions calls for a
unified defence framework, addressing vulnerabilities across various forms of
attack. In this study, we propose a disillusion paradigm based on the concept
of an imitation game. At the heart of the imitation game lies a multimodal
generative agent, steered by chain-of-thought reasoning, which observes,
internalises and reconstructs the semantic essence of a sample, liberated from
the classic pursuit of reversing the sample to its original state. As a proof
of concept, we conduct experimental simulations using a multimodal generative
dialogue agent and evaluates the methodology under a variety of attack
scenarios.


http://arxiv.org/abs/2501.19089
Understanding Oversmoothing in GNNs as Consensus in Opinion Dynamics. (41%)
Keqin Wang; Yulong Yang; Ishan Saha; Christine Allen-Blanchette
  In contrast to classes of neural networks where the learned representations
become increasingly expressive with network depth, the learned representations
in graph neural networks (GNNs), tend to become increasingly similar. This
phenomena, known as oversmoothing, is characterized by learned representations
that cannot be reliably differentiated leading to reduced predictive
performance. In this paper, we propose an analogy between oversmoothing in GNNs
and consensus or agreement in opinion dynamics. Through this analogy, we show
that the message passing structure of recent continuous-depth GNNs is
equivalent to a special case of opinion dynamics (i.e., linear consensus
models) which has been theoretically proven to converge to consensus (i.e.,
oversmoothing) for all inputs. Using the understanding developed through this
analogy, we design a new continuous-depth GNN model based on nonlinear opinion
dynamics and prove that our model, which we call behavior-inspired message
passing neural network (BIMP) circumvents oversmoothing for general inputs.
Through extensive experiments, we show that BIMP is robust to oversmoothing and
adversarial attack, and consistently outperforms competitive baselines on
numerous benchmarks.


http://arxiv.org/abs/2502.00156
ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition. (1%)
Joseph Fioresi; Ishan Rajendrakumar Dave; Mubarak Shah
  Bias in machine learning models can lead to unfair decision making, and while
it has been well-studied in the image and text domains, it remains
underexplored in action recognition. Action recognition models often suffer
from background bias (i.e., inferring actions based on background cues) and
foreground bias (i.e., relying on subject appearance), which can be detrimental
to real-life applications such as autonomous vehicles or assisted living
monitoring. While prior approaches have mainly focused on mitigating background
bias using specialized augmentations, we thoroughly study both biases. We
propose ALBAR, a novel adversarial training method that mitigates foreground
and background biases without requiring specialized knowledge of the bias
attributes. Our framework applies an adversarial cross-entropy loss to the
sampled static clip (where all the frames are the same) and aims to make its
class probabilities uniform using a proposed entropy maximization loss.
Additionally, we introduce a gradient penalty loss for regularization against
the debiasing process. We evaluate our method on established background and
foreground bias protocols, setting a new state-of-the-art and strongly
improving combined debiasing performance by over 12% on HMDB51. Furthermore, we
identify an issue of background leakage in the existing UCF101 protocol for
bias evaluation which provides a shortcut to predict actions and does not
provide an accurate measure of the debiasing capability of a model. We address
this issue by proposing more fine-grained segmentation boundaries for the
actor, where our method also outperforms existing approaches. Project Page:
https://joefioresi718.github.io/ALBAR_webpage/


http://arxiv.org/abs/2501.18877
Distorting Embedding Space for Safety: A Defense Mechanism for Adversarially Robust Diffusion Models. (92%)
Jaesin Ahn; Heechul Jung
  Text-to-image diffusion models show remarkable generation performance
following text prompts, but risk generating Not Safe For Work (NSFW) contents
from unsafe prompts. Existing approaches, such as prompt filtering or concept
unlearning, fail to defend against adversarial attacks while maintaining benign
image quality. In this paper, we propose a novel approach called Distorting
Embedding Space (DES), a text encoder-based defense mechanism that effectively
tackles these issues through innovative embedding space control. DES transforms
unsafe embeddings, extracted from a text encoder using unsafe prompts, toward
carefully calculated safe embedding regions to prevent unsafe contents
generation, while reproducing the original safe embeddings. DES also
neutralizes the nudity embedding, extracted using prompt ``nudity", by aligning
it with neutral embedding to enhance robustness against adversarial attacks.
These methods ensure both robust defense and high-quality image generation.
Additionally, DES can be adopted in a plug-and-play manner and requires zero
inference overhead, facilitating its deployment. Extensive experiments on
diverse attack types, including black-box and white-box scenarios, demonstrate
DES's state-of-the-art performance in both defense capability and benign image
generation quality. Our model is available at https://github.com/aei13/DES.


http://arxiv.org/abs/2501.18536
Illusions of Relevance: Using Content Injection Attacks to Deceive Retrievers, Rerankers, and LLM Judges. (83%)
Manveer Singh Tamber; Jimmy Lin
  Consider a scenario in which a user searches for information, only to
encounter texts flooded with misleading or non-relevant content. This scenario
exemplifies a simple yet potent vulnerability in neural Information Retrieval
(IR) pipelines: content injection attacks. We find that embedding models for
retrieval, rerankers, and large language model (LLM) relevance judges are
vulnerable to these attacks, in which adversaries insert misleading text into
passages to manipulate model judgements. We identify two primary threats: (1)
inserting unrelated or harmful content within passages that still appear
deceptively "relevant", and (2) inserting entire queries or key query terms
into passages to boost their perceived relevance. While the second tactic has
been explored in prior research, we present, to our knowledge, the first
empirical analysis of the first threat, demonstrating how state-of-the-art
models can be easily misled. Our study systematically examines the factors that
influence an attack's success, such as the placement of injected content and
the balance between relevant and non-relevant material. Additionally, we
explore various defense strategies, including adversarial passage classifiers,
retriever fine-tuning to discount manipulated content, and prompting LLM judges
to adopt a more cautious approach. However, we find that these countermeasures
often involve trade-offs, sacrificing effectiveness for attack robustness and
sometimes penalizing legitimate documents in the process. Our findings
highlight the need for stronger defenses against these evolving adversarial
strategies to maintain the trustworthiness of IR systems. We release our code
and scripts to facilitate further research.


http://arxiv.org/abs/2501.18841
Trading Inference-Time Compute for Adversarial Robustness. (68%)
Wojciech Zaremba; Evgenia Nitishinskaya; Boaz Barak; Stephanie Lin; Sam Toyer; Yaodong Yu; Rachel Dias; Eric Wallace; Kai Xiao; Johannes Heidecke; Amelia Glaese
  We conduct experiments on the impact of increasing inference-time compute in
reasoning models (specifically OpenAI o1-preview and o1-mini) on their
robustness to adversarial attacks. We find that across a variety of attacks,
increased inference-time compute leads to improved robustness. In many cases
(with important exceptions), the fraction of model samples where the attack
succeeds tends to zero as the amount of test-time compute grows. We perform no
adversarial training for the tasks we study, and we increase inference-time
compute by simply allowing the models to spend more compute on reasoning,
independently of the form of attack. Our results suggest that inference-time
compute has the potential to improve adversarial robustness for Large Language
Models. We also explore new attacks directed at reasoning models, as well as
settings where inference-time compute does not improve reliability, and
speculate on the reasons for these as well as ways to address them.


http://arxiv.org/abs/2501.18280
Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models. (2%)
Haoyu Liang; Youran Sun; Yunfeng Cai; Jun Zhu; Bo Zhang
  The security issue of large language models (LLMs) has gained wide attention
recently, with various defense mechanisms developed to prevent harmful output,
among which safeguards based on text embedding models serve as a fundamental
defense. Through testing, we discover that the output distribution of text
embedding models is severely biased with a large mean. Inspired by this
observation, we propose novel, efficient methods to search for **universal
magic words** that attack text embedding models. Universal magic words as
suffixes can shift the embedding of any text towards the bias direction, thus
manipulating the similarity of any text pair and misleading safeguards.
Attackers can jailbreak the safeguards by appending magic words to user prompts
and requiring LLMs to end answers with magic words. Experiments show that magic
word attacks significantly degrade safeguard performance on JailbreakBench,
cause real-world chatbots to produce harmful outputs in full-pipeline attacks,
and generalize across input/output texts, models, and languages. To eradicate
this security risk, we also propose defense methods against such attacks, which
can correct the bias of text embeddings and improve downstream performance in a
train-free manner.


http://arxiv.org/abs/2501.18837
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming. (2%)
Mrinank Sharma; Meg Tong; Jesse Mu; Jerry Wei; Jorrit Kruthoff; Scott Goodfriend; Euan Ong; Alwin Peng; Raj Agarwal; Cem Anil; Amanda Askell; Nathan Bailey; Joe Benton; Emma Bluemke; Samuel R. Bowman; Eric Christiansen; Hoagy Cunningham; Andy Dau; Anjali Gopal; Rob Gilson; Logan Graham; Logan Howard; Nimit Kalra; Taesung Lee; Kevin Lin; Peter Lofgren; Francesco Mosconi; Clare O'Hara; Catherine Olsson; Linda Petrini; Samir Rajani; Nikhil Saxena; Alex Silverstein; Tanya Singh; Theodore Sumers; Leonard Tang; Kevin K. Troy; Constantin Weisser; Ruiqi Zhong; Giulio Zhou; Jan Leike; Jared Kaplan; Ethan Perez
  Large language models (LLMs) are vulnerable to universal jailbreaks-prompting
strategies that systematically bypass model safeguards and enable users to
carry out harmful processes that require many model interactions, like
manufacturing illegal substances at scale. To defend against these attacks, we
introduce Constitutional Classifiers: safeguards trained on synthetic data,
generated by prompting LLMs with natural language rules (i.e., a constitution)
specifying permitted and restricted content. In over 3,000 estimated hours of
red teaming, no red teamer found a universal jailbreak that could extract
information from an early classifier-guarded LLM at a similar level of detail
to an unguarded model across most target queries. On automated evaluations,
enhanced classifiers demonstrated robust defense against held-out
domain-specific jailbreaks. These classifiers also maintain deployment
viability, with an absolute 0.38% increase in production-traffic refusals and a
23.7% inference overhead. Our work demonstrates that defending against
universal jailbreaks while maintaining practical deployment viability is
tractable.


http://arxiv.org/abs/2501.18006
Topological Signatures of Adversaries in Multimodal Alignments. (93%)
Minh Vu; Geigh Zollicoffer; Huy Mai; Ben Nebgen; Boian Alexandrov; Manish Bhattarai
  Multimodal Machine Learning systems, particularly those aligning text and
image data like CLIP/BLIP models, have become increasingly prevalent, yet
remain susceptible to adversarial attacks. While substantial research has
addressed adversarial robustness in unimodal contexts, defense strategies for
multimodal systems are underexplored. This work investigates the topological
signatures that arise between image and text embeddings and shows how
adversarial attacks disrupt their alignment, introducing distinctive
signatures. We specifically leverage persistent homology and introduce two
novel Topological-Contrastive losses based on Total Persistence and Multi-scale
kernel methods to analyze the topological signatures introduced by adversarial
perturbations. We observe a pattern of monotonic changes in the proposed
topological losses emerging in a wide range of attacks on image-text
alignments, as more adversarial samples are introduced in the data. By
designing an algorithm to back-propagate these signatures to input samples, we
are able to integrate these signatures into Maximum Mean Discrepancy tests,
creating a novel class of tests that leverage topological signatures for better
adversarial detection.


http://arxiv.org/abs/2501.17667
CAMP in the Odyssey: Provably Robust Reinforcement Learning with Certified Radius Maximization. (92%)
Derui Wang; Kristen Moore; Diksha Goel; Minjune Kim; Gang Li; Yang Li; Robin Doss; Minhui Xue; Bo Li; Seyit Camtepe; Liming Zhu
  Deep reinforcement learning (DRL) has gained widespread adoption in control
and decision-making tasks due to its strong performance in dynamic
environments. However, DRL agents are vulnerable to noisy observations and
adversarial attacks, and concerns about the adversarial robustness of DRL
systems have emerged. Recent efforts have focused on addressing these
robustness issues by establishing rigorous theoretical guarantees for the
returns achieved by DRL agents in adversarial settings. Among these approaches,
policy smoothing has proven to be an effective and scalable method for
certifying the robustness of DRL agents. Nevertheless, existing certifiably
robust DRL relies on policies trained with simple Gaussian augmentations,
resulting in a suboptimal trade-off between certified robustness and certified
return. To address this issue, we introduce a novel paradigm dubbed
\texttt{C}ertified-r\texttt{A}dius-\texttt{M}aximizing \texttt{P}olicy
(\texttt{CAMP}) training. \texttt{CAMP} is designed to enhance DRL policies,
achieving better utility without compromising provable robustness. By
leveraging the insight that the global certified radius can be derived from
local certified radii based on training-time statistics, \texttt{CAMP}
formulates a surrogate loss related to the local certified radius and optimizes
the policy guided by this surrogate loss. We also introduce \textit{policy
imitation} as a novel technique to stabilize \texttt{CAMP} training.
Experimental results demonstrate that \texttt{CAMP} significantly improves the
robustness-return trade-off across various tasks. Based on the results,
\texttt{CAMP} can achieve up to twice the certified expected return compared to
that of baselines. Our code is available at
https://github.com/NeuralSec/camp-robust-rl.


http://arxiv.org/abs/2501.18098
Disentangling Safe and Unsafe Corruptions via Anisotropy and Locality. (87%)
Ramchandran Muthukumar; Ambar Pal; Jeremias Sulam; Rene Vidal
  State-of-the-art machine learning systems are vulnerable to small
perturbations to their input, where ``small'' is defined according to a threat
model that assigns a positive threat to each perturbation. Most prior works
define a task-agnostic, isotropic, and global threat, like the $\ell_p$ norm,
where the magnitude of the perturbation fully determines the degree of the
threat and neither the direction of the attack nor its position in space
matter. However, common corruptions in computer vision, such as blur,
compression, or occlusions, are not well captured by such threat models. This
paper proposes a novel threat model called \texttt{Projected Displacement} (PD)
to study robustness beyond existing isotropic and global threat models. The
proposed threat model measures the threat of a perturbation via its alignment
with \textit{unsafe directions}, defined as directions in the input space along
which a perturbation of sufficient magnitude changes the ground truth class
label. Unsafe directions are identified locally for each input based on
observed training data. In this way, the PD threat model exhibits anisotropy
and locality. Experiments on Imagenet-1k data indicate that, for any input, the
set of perturbations with small PD threat includes \textit{safe} perturbations
of large $\ell_p$ norm that preserve the true label, such as noise, blur and
compression, while simultaneously excluding \textit{unsafe} perturbations that
alter the true label. Unlike perceptual threat models based on embeddings of
large-vision models, the PD threat model can be readily computed for arbitrary
classification tasks without pre-training or finetuning. Further additional
task annotation such as sensitivity to image regions or concept hierarchies can
be easily integrated into the assessment of threat and thus the PD threat model
presents practitioners with a flexible, task-driven threat specification.


http://arxiv.org/abs/2501.18052
SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders. (47%)
Bartosz Cywiński; Kamil Deja
  Recent machine unlearning approaches offer promising solution for removing
unwanted concepts from diffusion models. However, traditional methods, which
largely rely on fine-tuning, provide little insight into the changes they
introduce to the base model, making it unclear whether concepts are truly
removed or only masked. In this work, we introduce SAeUron, a novel method
leveraging features learned by sparse autoencoders (SAEs) to unlearn unwanted
concepts in text-to-image diffusion models. First, we demonstrate that SAEs,
trained in an unsupervised manner on activations from multiple denoising
timesteps of the diffusion model, capture sparse and interpretable features
corresponding to specific concepts. Building on this, we propose a method of
selecting concept-specific features. This enables precise interventions on the
model's activations to block targeted content while preserving the model's
overall performance. Evaluation on the competitive UnlearnCanvas benchmark on
object and style unlearning highlights SAeUron's state-of-the-art performance.
Moreover, we show that with a single SAE, we can remove multiple concepts
simultaneously and that in contrast to other methods, SAeUron dismisses the
possibility of generating unwanted content, even under adversarial attack.


http://arxiv.org/abs/2501.17813
P-TAME: Explain Any Image Classifier with Trained Perturbations. (3%)
Mariano V. Ntrougkas; Vasileios Mezaris; Ioannis Patras
  The adoption of Deep Neural Networks (DNNs) in critical fields where
predictions need to be accompanied by justifications is hindered by their
inherent black-box nature. In this paper, we introduce P-TAME
(Perturbation-based Trainable Attention Mechanism for Explanations), a
model-agnostic method for explaining DNN-based image classifiers. P-TAME
employs an auxiliary image classifier to extract features from the input image,
bypassing the need to tailor the explanation method to the internal
architecture of the backbone classifier being explained. Unlike traditional
perturbation-based methods, which have high computational requirements, P-TAME
offers an efficient alternative by generating high-resolution explanations in a
single forward pass during inference. We apply P-TAME to explain the decisions
of VGG-16, ResNet-50, and ViT-B-16, three distinct and widely used image
classifiers. Quantitative and qualitative results show that our method matches
or outperforms previous explainability methods, including model-specific
approaches. Code and trained models will be released upon acceptance.


http://arxiv.org/abs/2501.17501
How Much Do Code Language Models Remember? An Investigation on Data Extraction Attacks before and after Fine-tuning. (1%)
Fabio Salerno; Ali Al-Kaswan; Maliheh Izadi
  Code language models, while widely popular, are often trained on unsanitized
source code gathered from across the Internet. Previous work revealed that
pre-trained models can remember the content of their training data and
regurgitate them through data extraction attacks. Due to the large size of
current models, only a few entities have the resources for pre-training such
models. However, fine-tuning requires fewer resources and is increasingly used
by both small and large entities for its effectiveness on specialized data.
Such small curated data for fine-tuning might contain sensitive information or
proprietary assets. In this study, we attack both pre-trained and fine-tuned
code language models to investigate the extent of data extractability. We first
develop a custom benchmark to assess the vulnerability of both pre-training and
fine-tuning samples to extraction attacks. Our findings reveal that 54.9% of
extractable pre-training data could be retrieved from StarCoder2-15B, whereas
this number decreased to 23.5% after fine-tuning. This indicates that
fine-tuning reduces the extractability of pre-training data. However, compared
to larger models, fine-tuning smaller models increases their vulnerability to
data extraction attacks on fine-tuning data. Given the potential sensitivity of
fine-tuning data, this can lead to more severe consequences. Lastly, we also
manually analyzed 2000 extractable samples before and after fine-tuning. We
also found that data carriers and licensing information are the most likely
data categories to be memorized from pre-trained and fine-tuned models, while
the latter is the most likely to be forgotten after fine-tuning.


http://arxiv.org/abs/2501.16843
Bones of Contention: Exploring Query-Efficient Attacks Against Skeleton Recognition Systems. (99%)
Yuxin Cao; Kai Ye; Derui Wang; Minhui Xue; Hao Ge; Chenxiong Qian; Jin Song Dong
  Skeleton action recognition models have secured more attention than
video-based ones in various applications due to privacy preservation and lower
storage requirements. Skeleton data are typically transmitted to cloud servers
for action recognition, with results returned to clients via Apps/APIs.
However, the vulnerability of skeletal models against adversarial perturbations
gradually reveals the unreliability of these systems. Existing black-box
attacks all operate in a decision-based manner, resulting in numerous queries
that hinder efficiency and feasibility in real-world applications. Moreover,
all attacks off the shelf focus on only restricted perturbations, while
ignoring model weaknesses when encountered with non-semantic perturbations. In
this paper, we propose two query-effIcient Skeletal Adversarial AttaCks,
ISAAC-K and ISAAC-N. As a black-box attack, ISAAC-K utilizes Grad-CAM in a
surrogate model to extract key joints where minor sparse perturbations are then
added to fool the classifier. To guarantee natural adversarial motions, we
introduce constraints of both bone length and temporal consistency. ISAAC-K
finds stronger adversarial examples on $\ell_\infty$ norm, which can encompass
those on other norms. Exhaustive experiments substantiate that ISAAC-K can
uplift the attack efficiency of the perturbations under 10 skeletal models.
Additionally, as a byproduct, ISAAC-N fools the classifier by replacing
skeletons unrelated to the action. We surprisingly find that skeletal models
are vulnerable to large perturbations where the part-wise non-semantic joints
are just replaced, leading to a query-free no-box attack without any prior
knowledge. Based on that, four adaptive defenses are eventually proposed to
improve the robustness of skeleton recognition models.


http://arxiv.org/abs/2501.16750
HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns. (97%)
Xinyue Shen; Yixin Wu; Yiting Qu; Michael Backes; Savvas Zannettou; Yang Zhang
  Large Language Models (LLMs) have raised increasing concerns about their
misuse in generating hate speech. Among all the efforts to address this issue,
hate speech detectors play a crucial role. However, the effectiveness of
different detectors against LLM-generated hate speech remains largely unknown.
In this paper, we propose HateBench, a framework for benchmarking hate speech
detectors on LLM-generated hate speech. We first construct a hate speech
dataset of 7,838 samples generated by six widely-used LLMs covering 34 identity
groups, with meticulous annotations by three labelers. We then assess the
effectiveness of eight representative hate speech detectors on the
LLM-generated dataset. Our results show that while detectors are generally
effective in identifying LLM-generated hate speech, their performance degrades
with newer versions of LLMs. We also reveal the potential of LLM-driven hate
campaigns, a new threat that LLMs bring to the field of hate speech detection.
By leveraging advanced techniques like adversarial attacks and model stealing
attacks, the adversary can intentionally evade the detector and automate hate
campaigns online. The most potent adversarial attack achieves an attack success
rate of 0.966, and its attack efficiency can be further improved by
$13-21\times$ through model stealing attacks with acceptable attack
performance. We hope our study can serve as a call to action for the research
community and platform moderators to fortify defenses against these emerging
threats.


http://arxiv.org/abs/2501.16904
Adversarial Masked Autoencoder Purifier with Defense Transferability. (75%)
Yuan-Chih Chen; Chun-Shien Lu
  The study of adversarial defense still struggles to combat with advanced
adversarial attacks. In contrast to most prior studies that rely on the
diffusion model for test-time defense to remarkably increase the inference
time, we propose Masked AutoEncoder Purifier (MAEP), which integrates Masked
AutoEncoder (MAE) into an adversarial purifier framework for test-time
purification. While MAEP achieves promising adversarial robustness, it
particularly features model defense transferability and attack generalization
without relying on using additional data that is different from the training
dataset. To our knowledge, MAEP is the first study of adversarial purifier
based on MAE. Extensive experimental results demonstrate that our method can
not only maintain clear accuracy with only a slight drop but also exhibit a
close gap between the clean and robust accuracy. Notably, MAEP trained on
CIFAR10 achieves state-of-the-art performance even when tested directly on
ImageNet, outperforming existing diffusion-based models trained specifically on
ImageNet.


http://arxiv.org/abs/2501.17151
Scanning Trojaned Models Using Out-of-Distribution Samples. (33%)
Hossein Mirzaei; Ali Ansari; Bahar Dibaei Nia; Mojtaba Nafez; Moein Madadi; Sepehr Rezaee; Zeinab Sadat Taghavi; Arad Maleki; Kian Shamsaie; Mahdi Hajialilue; Jafar Habibi; Mohammad Sabokrou; Mohammad Hossein Rohban
  Scanning for trojan (backdoor) in deep neural networks is crucial due to
their significant real-world applications. There has been an increasing focus
on developing effective general trojan scanning methods across various trojan
attacks. Despite advancements, there remains a shortage of methods that perform
effectively without preconceived assumptions about the backdoor attack method.
Additionally, we have observed that current methods struggle to identify
classifiers trojaned using adversarial training. Motivated by these challenges,
our study introduces a novel scanning method named TRODO (TROjan scanning by
Detection of adversarial shifts in Out-of-distribution samples). TRODO
leverages the concept of "blind spots"--regions where trojaned classifiers
erroneously identify out-of-distribution (OOD) samples as in-distribution (ID).
We scan for these blind spots by adversarially shifting OOD samples towards
in-distribution. The increased likelihood of perturbed OOD samples being
classified as ID serves as a signature for trojan detection. TRODO is both
trojan and label mapping agnostic, effective even against adversarially trained
trojaned classifiers. It is applicable even in scenarios where training data is
absent, demonstrating high accuracy and adaptability across various scenarios
and datasets, highlighting its potential as a robust trojan scanning strategy.


http://arxiv.org/abs/2501.16902
Document Screenshot Retrievers are Vulnerable to Pixel Poisoning Attacks. (22%)
Shengyao Zhuang; Ekaterina Khramtsova; Xueguang Ma; Bevan Koopman; Jimmy Lin; Guido Zuccon
  Recent advancements in dense retrieval have introduced vision-language model
(VLM)-based retrievers, such as DSE and ColPali, which leverage document
screenshots embedded as vectors to enable effective search and offer a
simplified pipeline over traditional text-only methods. In this study, we
propose three pixel poisoning attack methods designed to compromise VLM-based
retrievers and evaluate their effectiveness under various attack settings and
parameter configurations. Our empirical results demonstrate that injecting even
a single adversarial screenshot into the retrieval corpus can significantly
disrupt search results, poisoning the top-10 retrieved documents for 41.9% of
queries in the case of DSE and 26.4% for ColPali. These vulnerability rates
notably exceed those observed with equivalent attacks on text-only retrievers.
Moreover, when targeting a small set of known queries, the attack success rate
raises, achieving complete success in certain cases. By exposing the
vulnerabilities inherent in vision-language models, this work highlights the
potential risks associated with their deployment.


http://arxiv.org/abs/2501.17381
Do We Really Need to Design New Byzantine-robust Aggregation Rules? (13%)
Minghong Fang; Seyedsina Nabavirazavi; Zhuqing Liu; Wei Sun; Sundararaja Sitharama Iyengar; Haibo Yang
  Federated learning (FL) allows multiple clients to collaboratively train a
global machine learning model through a server, without exchanging their
private training data. However, the decentralized aspect of FL makes it
susceptible to poisoning attacks, where malicious clients can manipulate the
global model by sending altered local model updates. To counter these attacks,
a variety of aggregation rules designed to be resilient to Byzantine failures
have been introduced. Nonetheless, these methods can still be vulnerable to
sophisticated attacks or depend on unrealistic assumptions about the server. In
this paper, we demonstrate that there is no need to design new Byzantine-robust
aggregation rules; instead, FL can be secured by enhancing the robustness of
well-established aggregation rules. To this end, we present FoundationFL, a
novel defense mechanism against poisoning attacks. FoundationFL involves the
server generating synthetic updates after receiving local model updates from
clients. It then applies existing Byzantine-robust foundational aggregation
rules, such as Trimmed-mean or Median, to combine clients' model updates with
the synthetic ones. We theoretically establish the convergence performance of
FoundationFL under Byzantine settings. Comprehensive experiments across several
real-world datasets validate the efficiency of our FoundationFL method.


http://arxiv.org/abs/2501.18638
Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation. (12%)
Daniel Schwartz; Dmitriy Bespalov; Zhe Wang; Ninad Kulkarni; Yanjun Qi
  As large language models (LLMs) become increasingly prevalent, ensuring their
robustness against adversarial misuse is crucial. This paper introduces the GAP
(Graph of Attacks with Pruning) framework, an advanced approach for generating
stealthy jailbreak prompts to evaluate and enhance LLM safeguards. GAP
addresses limitations in existing tree-based LLM jailbreak methods by
implementing an interconnected graph structure that enables knowledge sharing
across attack paths. Our experimental evaluation demonstrates GAP's superiority
over existing techniques, achieving a 20.8% increase in attack success rates
while reducing query costs by 62.7%. GAP consistently outperforms
state-of-the-art methods for attacking both open and closed LLMs, with attack
success rates of >96%. Additionally, we present specialized variants like
GAP-Auto for automated seed generation and GAP-VLM for multimodal attacks.
GAP-generated prompts prove highly effective in improving content moderation
systems, increasing true positive detection rates by 108.5% and accuracy by
183.6% when used for fine-tuning. Our implementation is available at
https://github.com/dsbuddy/GAP-LLM-Safety.


http://arxiv.org/abs/2501.16727
xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking. (11%)
Sunbowen Lee; Shiwen Ni; Chi Wei; Shuaimin Li; Liyang Fan; Ahmadreza Argha; Hamid Alinejad-Rokny; Ruifeng Xu; Yicheng Gong; Min Yang
  Safety alignment mechanism are essential for preventing large language models
(LLMs) from generating harmful information or unethical content. However,
cleverly crafted prompts can bypass these safety measures without accessing the
model's internal parameters, a phenomenon known as black-box jailbreak.
Existing heuristic black-box attack methods, such as genetic algorithms, suffer
from limited effectiveness due to their inherent randomness, while recent
reinforcement learning (RL) based methods often lack robust and informative
reward signals. To address these challenges, we propose a novel black-box
jailbreak method leveraging RL, which optimizes prompt generation by analyzing
the embedding proximity between benign and malicious prompts. This approach
ensures that the rewritten prompts closely align with the intent of the
original prompts while enhancing the attack's effectiveness. Furthermore, we
introduce a comprehensive jailbreak evaluation framework incorporating
keywords, intent matching, and answer validation to provide a more rigorous and
holistic assessment of jailbreak success. Experimental results show the
superiority of our approach, achieving state-of-the-art (SOTA) performance on
several prominent open and closed-source LLMs, including Qwen2.5-7B-Instruct,
Llama3.1-8B-Instruct, and GPT-4o-0806. Our method sets a new benchmark in
jailbreak attack effectiveness, highlighting potential vulnerabilities in LLMs.
The codebase for this work is available at
https://github.com/Aegis1863/xJailbreak.


http://arxiv.org/abs/2501.16971
RODEO: Robust Outlier Detection via Exposing Adaptive Out-of-Distribution Samples. (9%)
Hossein Mirzaei; Mohammad Jafari; Hamid Reza Dehbashi; Ali Ansari; Sepehr Ghobadi; Masoud Hadi; Arshia Soltani Moakhar; Mohammad Azizmalayeri; Mahdieh Soleymani Baghshah; Mohammad Hossein Rohban
  In recent years, there have been significant improvements in various forms of
image outlier detection. However, outlier detection performance under
adversarial settings lags far behind that in standard settings. This is due to
the lack of effective exposure to adversarial scenarios during training,
especially on unseen outliers, leading to detection models failing to learn
robust features. To bridge this gap, we introduce RODEO, a data-centric
approach that generates effective outliers for robust outlier detection. More
specifically, we show that incorporating outlier exposure (OE) and adversarial
training can be an effective strategy for this purpose, as long as the exposed
training outliers meet certain characteristics, including diversity, and both
conceptual differentiability and analogy to the inlier samples. We leverage a
text-to-image model to achieve this goal. We demonstrate both quantitatively
and qualitatively that our adaptive OE method effectively generates ``diverse''
and ``near-distribution'' outliers, leveraging information from both text and
image domains. Moreover, our experimental results show that utilizing our
synthesized outliers significantly enhances the performance of the outlier
detector, particularly in adversarial settings.


http://arxiv.org/abs/2501.17384
A Dual-Agent Adversarial Framework for Robust Generalization in Deep Reinforcement Learning. (9%)
Zhengpeng Xie; Jiahang Cao; Yulong Zhang; Qiang Zhang; Renjing Xu
  Recently, empowered with the powerful capabilities of neural networks,
reinforcement learning (RL) has successfully tackled numerous challenging
tasks. However, while these models demonstrate enhanced decision-making
abilities, they are increasingly prone to overfitting. For instance, a trained
RL model often fails to generalize to even minor variations of the same task,
such as a change in background color or other minor semantic differences. To
address this issue, we propose a dual-agent adversarial policy learning
framework, which allows agents to spontaneously learn the underlying semantics
without introducing any human prior knowledge. Specifically, our framework
involves a game process between two agents: each agent seeks to maximize the
impact of perturbing on the opponent's policy by producing representation
differences for the same state, while maintaining its own stability against
such perturbations. This interaction encourages agents to learn generalizable
policies, capable of handling irrelevant features from the high-dimensional
observations. Extensive experimental results on the Procgen benchmark
demonstrate that the adversarial process significantly improves the
generalization performance of both agents, while also being applied to various
RL algorithms, e.g., Proximal Policy Optimization (PPO). With the adversarial
framework, the RL agent outperforms the baseline methods by a significant
margin, especially in hard-level tasks, marking a significant step forward in
the generalization capabilities of deep reinforcement learning.


http://arxiv.org/abs/2501.17328
WASUP: Interpretable Classification with Weight-Input Alignment and Class-Discriminative SUPports Vectors. (1%)
Tom Nuno Wolf; Christian Wachinger
  The deployment of deep learning models in critical domains necessitates a
balance between high accuracy and interpretability. We introduce WASUP, an
inherently interpretable neural network that provides local and global
explanations of its decision-making process. We prove that these explanations
are faithful by fulfilling established axioms for explanations. Leveraging the
concept of case-based reasoning, WASUP extracts class-representative support
vectors from training images, ensuring they capture relevant features while
suppressing irrelevant ones. Classification decisions are made by calculating
and aggregating similarity scores between these support vectors and the input's
latent feature vector. We employ B-Cos transformations, which align model
weights with inputs to enable faithful mappings of latent features back to the
input space, facilitating local explanations in addition to global explanations
of case-based reasoning. We evaluate WASUP on three tasks: fine-grained
classification on Stanford Dogs, multi-label classification on Pascal VOC, and
pathology detection on the RSNA dataset. Results indicate that WASUP not only
achieves competitive accuracy compared to state-of-the-art black-box models but
also offers insightful explanations verified through theoretical analysis. Our
findings underscore WASUP's potential for applications where understanding
model decisions is as critical as the decisions themselves.


http://arxiv.org/abs/2501.18629
The Relationship Between Network Similarity and Transferability of Adversarial Attacks. (99%)
Gerrit Klause; Niklas Bunzel
  Neural networks are vulnerable to adversarial attacks, and several defenses
have been proposed. Designing a robust network is a challenging task given the
wide range of attacks that have been developed. Therefore, we aim to provide
insight into the influence of network similarity on the success rate of
transferred adversarial attacks. Network designers can then compare their new
network with existing ones to estimate its vulnerability. To achieve this, we
investigate the complex relationship between network similarity and the success
rate of transferred adversarial attacks. We applied the Centered Kernel
Alignment (CKA) network similarity score and used various methods to find a
correlation between a large number of Convolutional Neural Networks (CNNs) and
adversarial attacks. Network similarity was found to be moderate across
different CNN architectures, with more complex models such as DenseNet showing
lower similarity scores due to their architectural complexity. Layer similarity
was highest for consistent, basic layers such as DataParallel, Dropout and
Conv2d, while specialized layers showed greater variability. Adversarial attack
success rates were generally consistent for non-transferred attacks, but varied
significantly for some transferred attacks, with complex networks being more
vulnerable. We found that a DecisionTreeRegressor can predict the success rate
of transferred attacks for all black-box and Carlini & Wagner attacks with an
accuracy of over 90%, suggesting that predictive models may be viable under
certain conditions. However, the variability of results across different data
subsets underscores the complexity of these relationships and suggests that
further research is needed to generalize these findings across different attack
scenarios and network architectures.


http://arxiv.org/abs/2501.16671
Data-Free Model-Related Attacks: Unleashing the Potential of Generative AI. (31%)
Dayong Ye; Tianqing Zhu; Shang Wang; Bo Liu; Leo Yu Zhang; Wanlei Zhou; Yang Zhang
  Generative AI technology has become increasingly integrated into our daily
lives, offering powerful capabilities to enhance productivity. However, these
same capabilities can be exploited by adversaries for malicious purposes. While
existing research on adversarial applications of generative AI predominantly
focuses on cyberattacks, less attention has been given to attacks targeting
deep learning models. In this paper, we introduce the use of generative AI for
facilitating model-related attacks, including model extraction, membership
inference, and model inversion. Our study reveals that adversaries can launch a
variety of model-related attacks against both image and text models in a
data-free and black-box manner, achieving comparable performance to baseline
methods that have access to the target models' training data and parameters in
a white-box manner. This research serves as an important early warning to the
community about the potential risks associated with generative AI-powered
attacks on deep learning models.


http://arxiv.org/abs/2501.18632
Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare. (12%)
Hang Zhang; Qian Lou; Yanshan Wang
  Large language models (LLMs) are increasingly utilized in healthcare
applications. However, their deployment in clinical practice raises significant
safety concerns, including the potential spread of harmful information. This
study systematically assesses the vulnerabilities of six LLMs to three advanced
black-box jailbreaking techniques within medical contexts. To quantify the
effectiveness of these techniques, we propose an automated and domain-adapted
agentic evaluation pipeline. Experiment results indicate that leading
commercial and open-source LLMs are highly vulnerable to medical jailbreaking
attacks. To bolster model safety and reliability, we further investigate the
effectiveness of Continual Fine-Tuning (CFT) in defending against medical
adversarial attacks. Our findings underscore the necessity for evolving attack
methods evaluation, domain-specific safety alignment, and LLM safety-utility
balancing. This research offers actionable insights for advancing the safety
and reliability of AI clinicians, contributing to ethical and effective AI
deployment in healthcare.


http://arxiv.org/abs/2501.18628
Indiana Jones: There Are Always Some Useful Ancient Relics. (3%)
Junchen Ding; Jiahao Zhang; Yi Liu; Ziqi Ding; Gelei Deng; Yuekang Li
  This paper introduces Indiana Jones, an innovative approach to jailbreaking
Large Language Models (LLMs) by leveraging inter-model dialogues and
keyword-driven prompts. Through orchestrating interactions among three
specialised LLMs, the method achieves near-perfect success rates in bypassing
content safeguards in both white-box and black-box LLMs. The research exposes
systemic vulnerabilities within contemporary models, particularly their
susceptibility to producing harmful or unethical outputs when guided by
ostensibly innocuous prompts framed in historical or contextual contexts.
Experimental evaluations highlight the efficacy and adaptability of Indiana
Jones, demonstrating its superiority over existing jailbreak methods. These
findings emphasise the urgent need for enhanced ethical safeguards and robust
security measures in the development of LLMs. Moreover, this work provides a
critical foundation for future studies aimed at fortifying LLMs against
adversarial exploitation while preserving their utility and flexibility.


http://arxiv.org/abs/2501.15850
LLM-attacker: Enhancing Closed-loop Adversarial Scenario Generation for Autonomous Driving with Large Language Models. (2%)
Yuewen Mei; Tong Nie; Jian Sun; Ye Tian
  Ensuring and improving the safety of autonomous driving systems (ADS) is
crucial for the deployment of highly automated vehicles, especially in
safety-critical events. To address the rarity issue, adversarial scenario
generation methods are developed, in which behaviors of traffic participants
are manipulated to induce safety-critical events. However, existing methods
still face two limitations. First, identification of the adversarial
participant directly impacts the effectiveness of the generation. However, the
complexity of real-world scenarios, with numerous participants and diverse
behaviors, makes identification challenging. Second, the potential of generated
safety-critical scenarios to continuously improve ADS performance remains
underexplored. To address these issues, we propose LLM-attacker: a closed-loop
adversarial scenario generation framework leveraging large language models
(LLMs). Specifically, multiple LLM agents are designed and coordinated to
identify optimal attackers. Then, the trajectories of the attackers are
optimized to generate adversarial scenarios. These scenarios are iteratively
refined based on the performance of ADS, forming a feedback loop to improve
ADS. Experimental results show that LLM-attacker can create more dangerous
scenarios than other methods, and the ADS trained with it achieves a collision
rate half that of training with normal scenarios. This indicates the ability of
LLM-attacker to test and enhance the safety and robustness of ADS. Video
demonstrations are provided at:
https://drive.google.com/file/d/1Zv4V3iG7825oyiKbUwS2Y-rR0DQIE1ZA/view.


http://arxiv.org/abs/2501.16663
Data Duplication: A Novel Multi-Purpose Attack Paradigm in Machine Unlearning. (1%)
Dayong Ye; Tianqing Zhu; Jiayang Li; Kun Gao; Bo Liu; Leo Yu Zhang; Wanlei Zhou; Yang Zhang
  Duplication is a prevalent issue within datasets. Existing research has
demonstrated that the presence of duplicated data in training datasets can
significantly influence both model performance and data privacy. However, the
impact of data duplication on the unlearning process remains largely
unexplored. This paper addresses this gap by pioneering a comprehensive
investigation into the role of data duplication, not only in standard machine
unlearning but also in federated and reinforcement unlearning paradigms.
Specifically, we propose an adversary who duplicates a subset of the target
model's training set and incorporates it into the training set. After training,
the adversary requests the model owner to unlearn this duplicated subset, and
analyzes the impact on the unlearned model. For example, the adversary can
challenge the model owner by revealing that, despite efforts to unlearn it, the
influence of the duplicated subset remains in the model. Moreover, to
circumvent detection by de-duplication techniques, we propose three novel
near-duplication methods for the adversary, each tailored to a specific
unlearning paradigm. We then examine their impacts on the unlearning process
when de-duplication techniques are applied. Our findings reveal several crucial
insights: 1) the gold standard unlearning method, retraining from scratch,
fails to effectively conduct unlearning under certain conditions; 2) unlearning
duplicated data can lead to significant model degradation in specific
scenarios; and 3) meticulously crafted duplicates can evade detection by
de-duplication methods.


http://arxiv.org/abs/2501.15434
Mitigating Spurious Negative Pairs for Robust Industrial Anomaly Detection. (98%)
Hossein Mirzaei; Mojtaba Nafez; Jafar Habibi; Mohammad Sabokrou; Mohammad Hossein Rohban
  Despite significant progress in Anomaly Detection (AD), the robustness of
existing detection methods against adversarial attacks remains a challenge,
compromising their reliability in critical real-world applications such as
autonomous driving. This issue primarily arises from the AD setup, which
assumes that training data is limited to a group of unlabeled normal samples,
making the detectors vulnerable to adversarial anomaly samples during testing.
Additionally, implementing adversarial training as a safeguard encounters
difficulties, such as formulating an effective objective function without
access to labels. An ideal objective function for adversarial training in AD
should promote strong perturbations both within and between the normal and
anomaly groups to maximize margin between normal and anomaly distribution. To
address these issues, we first propose crafting a pseudo-anomaly group derived
from normal group samples. Then, we demonstrate that adversarial training with
contrastive loss could serve as an ideal objective function, as it creates both
inter- and intra-group perturbations. However, we notice that spurious negative
pairs compromise the conventional contrastive loss to achieve robust AD.
Spurious negative pairs are those that should be closely mapped but are
erroneously separated. These pairs introduce noise and misguide the direction
of inter-group adversarial perturbations. To overcome the effect of spurious
negative pairs, we define opposite pairs and adversarially pull them apart to
strengthen inter-group perturbations. Experimental results demonstrate our
superior performance in both clean and adversarial scenarios, with a 26.1%
improvement in robust detection across various challenging benchmark datasets.
The implementation of our work is available at:
https://github.com/rohban-lab/COBRA.


http://arxiv.org/abs/2501.15563
PCAP-Backdoor: Backdoor Poisoning Generator for Network Traffic in CPS/IoT Environments. (76%)
Ajesh Koyatan Chathoth; Stephen Lee
  The rapid expansion of connected devices has made them prime targets for
cyberattacks. To address these threats, deep learning-based, data-driven
intrusion detection systems (IDS) have emerged as powerful tools for detecting
and mitigating such attacks. These IDSs analyze network traffic to identify
unusual patterns and anomalies that may indicate potential security breaches.
However, prior research has shown that deep learning models are vulnerable to
backdoor attacks, where attackers inject triggers into the model to manipulate
its behavior and cause misclassifications of network traffic. In this paper, we
explore the susceptibility of deep learning-based IDS systems to backdoor
attacks in the context of network traffic analysis. We introduce
\texttt{PCAP-Backdoor}, a novel technique that facilitates backdoor poisoning
attacks on PCAP datasets. Our experiments on real-world Cyber-Physical Systems
(CPS) and Internet of Things (IoT) network traffic datasets demonstrate that
attackers can effectively backdoor a model by poisoning as little as 1\% or
less of the entire training dataset. Moreover, we show that an attacker can
introduce a trigger into benign traffic during model training yet cause the
backdoored model to misclassify malicious traffic when the trigger is present.
Finally, we highlight the difficulty of detecting this trigger-based backdoor,
even when using existing backdoor defense techniques.


http://arxiv.org/abs/2501.15718
CENSOR: Defense Against Gradient Inversion via Orthogonal Subspace Bayesian Sampling. (75%)
Kaiyuan Zhang; Siyuan Cheng; Guangyu Shen; Bruno Ribeiro; Shengwei An; Pin-Yu Chen; Xiangyu Zhang; Ninghui Li
  Federated learning collaboratively trains a neural network on a global
server, where each local client receives the current global model weights and
sends back parameter updates (gradients) based on its local private data. The
process of sending these model updates may leak client's private data
information. Existing gradient inversion attacks can exploit this vulnerability
to recover private training instances from a client's gradient vectors.
Recently, researchers have proposed advanced gradient inversion techniques that
existing defenses struggle to handle effectively. In this work, we present a
novel defense tailored for large neural network models. Our defense capitalizes
on the high dimensionality of the model parameters to perturb gradients within
a subspace orthogonal to the original gradient. By leveraging cold posteriors
over orthogonal subspaces, our defense implements a refined gradient update
mechanism. This enables the selection of an optimal gradient that not only
safeguards against gradient inversion attacks but also maintains model utility.
We conduct comprehensive experiments across three different datasets and
evaluate our defense against various state-of-the-art attacks and defenses.
Code is available at https://censor-gradient.github.io.


http://arxiv.org/abs/2501.15653
A Privacy Enhancing Technique to Evade Detection by Street Video Cameras Without Using Adversarial Accessories. (11%)
Jacob Shams; Ben Nassi; Satoru Koda; Asaf Shabtai; Yuval Elovici
  In this paper, we propose a privacy-enhancing technique leveraging an
inherent property of automatic pedestrian detection algorithms, namely, that
the training of deep neural network (DNN) based methods is generally performed
using curated datasets and laboratory settings, while the operational areas of
these methods are dynamic real-world environments. In particular, we leverage a
novel side effect of this gap between the laboratory and the real world:
location-based weakness in pedestrian detection. We demonstrate that the
position (distance, angle, height) of a person, and ambient light level,
directly impact the confidence of a pedestrian detector when detecting the
person. We then demonstrate that this phenomenon is present in pedestrian
detectors observing a stationary scene of pedestrian traffic, with blind spot
areas of weak detection of pedestrians with low confidence. We show how
privacy-concerned pedestrians can leverage these blind spots to evade detection
by constructing a minimum confidence path between two points in a scene,
reducing the maximum confidence and average confidence of the path by up to
0.09 and 0.13, respectively, over direct and random paths through the scene. To
counter this phenomenon, and force the use of more costly and sophisticated
methods to leverage this vulnerability, we propose a novel countermeasure to
improve the confidence of pedestrian detectors in blind spots, raising the
max/average confidence of paths generated by our technique by 0.09 and 0.05,
respectively. In addition, we demonstrate that our countermeasure improves a
Faster R-CNN-based pedestrian detector's TPR and average true positive
confidence by 0.03 and 0.15, respectively.


http://arxiv.org/abs/2501.15257
Towards Communication-Efficient Adversarial Federated Learning for Robust Edge Intelligence. (98%)
Yu Qiao; Apurba Adhikary; Huy Q. Le; Eui-Nam Huh; Zhu Han; Choong Seon Hong
  Federated learning (FL) has gained significant attention for enabling
decentralized training on edge networks without exposing raw data. However, FL
models remain susceptible to adversarial attacks and performance degradation in
non-IID data settings, thus posing challenges to both robustness and accuracy.
This paper aims to achieve communication-efficient adversarial federated
learning (AFL) by leveraging a pre-trained model to enhance both robustness and
accuracy under adversarial attacks and non-IID challenges in AFL. By leveraging
the knowledge from a pre-trained model for both clean and adversarial images,
we propose a pre-trained model-guided adversarial federated learning (PM-AFL)
framework. This framework integrates vanilla and adversarial mixture knowledge
distillation to effectively balance accuracy and robustness while promoting
local models to learn from diverse data. Specifically, for clean accuracy, we
adopt a dual distillation strategy where the class probabilities of randomly
paired images, and their blended versions are aligned between the teacher model
and the local models. For adversarial robustness, we employ a similar
distillation approach but replace clean samples on the local side with
adversarial examples. Moreover, by considering the bias between local and
global models, we also incorporate a consistency regularization term to ensure
that local adversarial predictions stay aligned with their corresponding global
clean ones. These strategies collectively enable local models to absorb diverse
knowledge from the teacher model while maintaining close alignment with the
global model, thereby mitigating overfitting to local optima and enhancing the
generalization of the global model. Experiments demonstrate that the
PM-AFL-based framework not only significantly outperforms other methods but
also maintains communication efficiency.


http://arxiv.org/abs/2501.15271
Killing it with Zero-Shot: Adversarially Robust Novelty Detection. (50%)
Hossein Mirzaei; Mohammad Jafari; Hamid Reza Dehbashi; Zeinab Sadat Taghavi; Mohammad Sabokrou; Mohammad Hossein Rohban
  Novelty Detection (ND) plays a crucial role in machine learning by
identifying new or unseen data during model inference. This capability is
especially important for the safe and reliable operation of automated systems.
Despite advances in this field, existing techniques often fail to maintain
their performance when subject to adversarial attacks. Our research addresses
this gap by marrying the merits of nearest-neighbor algorithms with robust
features obtained from models pretrained on ImageNet. We focus on enhancing the
robustness and performance of ND algorithms. Experimental results demonstrate
that our approach significantly outperforms current state-of-the-art methods
across various benchmarks, particularly under adversarial conditions. By
incorporating robust pretrained features into the k-NN algorithm, we establish
a new standard for performance and robustness in the field of robust ND. This
work opens up new avenues for research aimed at fortifying machine learning
systems against adversarial vulnerabilities. Our implementation is publicly
available at https://github.com/rohban-lab/ZARND.


http://arxiv.org/abs/2501.15395
Hiding in Plain Sight: An IoT Traffic Camouflage Framework for Enhanced Privacy. (12%)
Daniel Adu Worae; Spyridon Mastorakis
  The rapid growth of Internet of Things (IoT) devices has introduced
significant challenges to privacy, particularly as network traffic analysis
techniques evolve. While encryption protects data content, traffic attributes
such as packet size and timing can reveal sensitive information about users and
devices. Existing single-technique obfuscation methods, such as packet padding,
often fall short in dynamic environments like smart homes due to their
predictability, making them vulnerable to machine learning-based attacks. This
paper introduces a multi-technique obfuscation framework designed to enhance
privacy by disrupting traffic analysis. The framework leverages six
techniques-Padding, Padding with XORing, Padding with Shifting, Constant Size
Padding, Fragmentation, and Delay Randomization-to obscure traffic patterns
effectively. Evaluations on three public datasets demonstrate significant
reductions in classifier performance metrics, including accuracy, precision,
recall, and F1 score. We assess the framework's robustness against adversarial
tactics by retraining and fine-tuning neural network classifiers on obfuscated
traffic. The results reveal a notable degradation in classifier performance,
underscoring the framework's resilience against adaptive attacks. Furthermore,
we evaluate communication and system performance, showing that higher
obfuscation levels enhance privacy but may increase latency and communication
overhead.


http://arxiv.org/abs/2501.15101
Comprehensive Evaluation of Cloaking Backdoor Attacks on Object Detector in Real-World. (11%)
Hua Ma; Alsharif Abuadbba; Yansong Gao; Hyoungshick Kim; Surya Nepal
  The exploration of backdoor vulnerabilities in object detectors, particularly
in real-world scenarios, remains limited. A significant challenge lies in the
absence of a natural physical backdoor dataset, and constructing such a dataset
is both time- and labor-intensive. In this work, we address this gap by
creating a large-scale dataset comprising approximately 11,800 images/frames
with annotations featuring natural objects (e.g., T-shirts and hats) as
triggers to incur cloaking adversarial effects in diverse real-world scenarios.
This dataset is tailored for the study of physical backdoors in object
detectors. Leveraging this dataset, we conduct a comprehensive evaluation of an
insidious cloaking backdoor effect against object detectors, wherein the
bounding box around a person vanishes when the individual is near a natural
object (e.g., a commonly available T-shirt) in front of the detector. Our
evaluations encompass three prevalent attack surfaces: data outsourcing, model
outsourcing, and the use of pretrained models. The cloaking effect is
successfully implanted in object detectors across all three attack surfaces. We
extensively evaluate four popular object detection algorithms (anchor-based
Yolo-V3, Yolo-V4, Faster R-CNN, and anchor-free CenterNet) using 19 videos
(totaling approximately 11,800 frames) in real-world scenarios. Our results
demonstrate that the backdoor attack exhibits remarkable robustness against
various factors, including movement, distance, angle, non-rigid deformation,
and lighting. In data and model outsourcing scenarios, the attack success rate
(ASR) in most videos reaches 100% or near it, while the clean data accuracy of
the backdoored model remains indistinguishable from that of the clean model,
making it impossible to detect backdoor behavior through a validation set.


http://arxiv.org/abs/2501.15269
Mirage in the Eyes: Hallucination Attack on Multi-modal Large Language Models with Only Attention Sink. (10%)
Yining Wang; Mi Zhang; Junjie Sun; Chenyue Wang; Min Yang; Hui Xue; Jialing Tao; Ranjie Duan; Jiexi Liu
  Fusing visual understanding into language generation, Multi-modal Large
Language Models (MLLMs) are revolutionizing visual-language applications. Yet,
these models are often plagued by the hallucination problem, which involves
generating inaccurate objects, attributes, and relationships that do not match
the visual content. In this work, we delve into the internal attention
mechanisms of MLLMs to reveal the underlying causes of hallucination, exposing
the inherent vulnerabilities in the instruction-tuning process.
  We propose a novel hallucination attack against MLLMs that exploits attention
sink behaviors to trigger hallucinated content with minimal image-text
relevance, posing a significant threat to critical downstream applications.
Distinguished from previous adversarial methods that rely on fixed patterns,
our approach generates dynamic, effective, and highly transferable visual
adversarial inputs, without sacrificing the quality of model responses.
Comprehensive experiments on 6 prominent MLLMs demonstrate the efficacy of our
attack in compromising black-box MLLMs even with extensive mitigating
mechanisms, as well as the promising results against cutting-edge commercial
APIs, such as GPT-4o and Gemini 1.5. Our code is available at
https://huggingface.co/RachelHGF/Mirage-in-the-Eyes.


http://arxiv.org/abs/2501.15363
AI-Driven Secure Data Sharing: A Trustworthy and Privacy-Preserving Approach. (4%)
Al Amin; Kamrul Hasan; Sharif Ullah; Liang Hong
  In the era of data-driven decision-making, ensuring the privacy and security
of shared data is paramount across various domains. Applying existing deep
neural networks (DNNs) to encrypted data is critical and often compromises
performance, security, and computational overhead. To address these
limitations, this research introduces a secure framework consisting of a
learnable encryption method based on the block-pixel operation to encrypt the
data and subsequently integrate it with the Vision Transformer (ViT). The
proposed framework ensures data privacy and security by creating unique
scrambling patterns per key, providing robust performance against adversarial
attacks without compromising computational efficiency and data integrity. The
framework was tested on sensitive medical datasets to validate its efficacy,
proving its ability to handle highly confidential information securely. The
suggested framework was validated with a 94\% success rate after extensive
testing on real-world datasets, such as MRI brain tumors and histological scans
of lung and colon cancers. Additionally, the framework was tested under diverse
adversarial attempts against secure data sharing with optimum performance and
demonstrated its effectiveness in various threat scenarios. These comprehensive
analyses underscore its robustness, making it a trustworthy solution for secure
data sharing in critical applications.


http://arxiv.org/abs/2501.14999
VideoPure: Diffusion-based Adversarial Purification for Video Recognition. (99%)
Kaixun Jiang; Zhaoyu Chen; Jiyuan Fu; Lingyi Hong; Jinglun Li; Wenqiang Zhang
  Recent work indicates that video recognition models are vulnerable to
adversarial examples, posing a serious security risk to downstream
applications. However, current research has primarily focused on adversarial
attacks, with limited work exploring defense mechanisms. Furthermore, due to
the spatial-temporal complexity of videos, existing video defense methods face
issues of high cost, overfitting, and limited defense performance. Recently,
diffusion-based adversarial purification methods have achieved robust defense
performance in the image domain. However, due to the additional temporal
dimension in videos, directly applying these diffusion-based adversarial
purification methods to the video domain suffers performance and efficiency
degradation. To achieve an efficient and effective video adversarial defense
method, we propose the first diffusion-based video purification framework to
improve video recognition models' adversarial robustness: VideoPure. Given an
adversarial example, we first employ temporal DDIM inversion to transform the
input distribution into a temporally consistent and trajectory-defined
distribution, covering adversarial noise while preserving more video structure.
Then, during DDIM denoising, we leverage intermediate results at each denoising
step and conduct guided spatial-temporal optimization, removing adversarial
noise while maintaining temporal consistency. Finally, we input the list of
optimized intermediate results into the video recognition model for multi-step
voting to obtain the predicted class. We investigate the defense performance of
our method against black-box, gray-box, and adaptive attacks on benchmark
datasets and models. Compared with other adversarial purification methods, our
method overall demonstrates better defense performance against different
attacks. Our code is available at https://github.com/deep-kaixun/VideoPure.


http://arxiv.org/abs/2501.14496
A Note on Implementation Errors in Recent Adaptive Attacks Against Multi-Resolution Self-Ensembles. (62%)
Stanislav Fort
  This note documents an implementation issue in recent adaptive attacks (Zhang
et al. [2024]) against the multi-resolution self-ensemble defense (Fort and
Lakshminarayanan [2024]). The implementation allowed adversarial perturbations
to exceed the standard $L_\infty = 8/255$ bound by up to a factor of
20$\times$, reaching magnitudes of up to $L_\infty = 160/255$. When attacks are
properly constrained within the intended bounds, the defense maintains
non-trivial robustness. Beyond highlighting the importance of careful
validation in adversarial machine learning research, our analysis reveals an
intriguing finding: properly bounded adaptive attacks against strong
multi-resolution self-ensembles often align with human perception, suggesting
the need to reconsider how we measure adversarial robustness.


http://arxiv.org/abs/2501.14250
Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak Behaviors. (16%)
Yi Zhao; Youzhi Zhang
  Large language models (LLMs) are widely used in real-world applications,
raising concerns about their safety and trustworthiness. While red-teaming with
jailbreak prompts exposes the vulnerabilities of LLMs, current efforts focus
primarily on single-turn attacks, overlooking the multi-turn strategies used by
real-world adversaries. Existing multi-turn methods rely on static patterns or
predefined logical chains, failing to account for the dynamic strategies during
attacks. We propose Siren, a learning-based multi-turn attack framework
designed to simulate real-world human jailbreak behaviors. Siren consists of
three stages: (1) training set construction utilizing Turn-Level LLM feedback
(Turn-MF), (2) post-training attackers with supervised fine-tuning (SFT) and
direct preference optimization (DPO), and (3) interactions between the
attacking and target LLMs. Experiments demonstrate that Siren achieves an
attack success rate (ASR) of 90% with LLaMA-3-8B as the attacker against
Gemini-1.5-Pro as the target model, and 70% with Mistral-7B against GPT-4o,
significantly outperforming single-turn baselines. Moreover, Siren with a
7B-scale model achieves performance comparable to a multi-turn baseline that
leverages GPT-4o as the attacker, while requiring fewer turns and employing
decomposition strategies that are better semantically aligned with attack
goals. We hope Siren inspires the development of stronger defenses against
advanced multi-turn jailbreak attacks under realistic scenarios. Code is
available at https://github.com/YiyiyiZhao/siren. Warning: This paper contains
potentially harmful text.


http://arxiv.org/abs/2501.14005
Device-aware Optical Adversarial Attack for a Portable Projector-camera System. (99%)
Ning School of Software & Microelectronics, Peking University, Beijing, China Mashang Consumer Finance Co., Ltd., Chongqing, China Jiang; Yanhong Mashang Consumer Finance Co., Ltd., Chongqing, China Liu; Dingheng Mashang Consumer Finance Co., Ltd., Chongqing, China Zeng; Yue Mashang Consumer Finance Co., Ltd., Chongqing, China Feng; Weihong Mashang Consumer Finance Co., Ltd., Chongqing, China Deng; Ying School of Software & Microelectronics, Peking University, Beijing, China Li
  Deep-learning-based face recognition (FR) systems are susceptible to
adversarial examples in both digital and physical domains. Physical attacks
present a greater threat to deployed systems as adversaries can easily access
the input channel, allowing them to provide malicious inputs to impersonate a
victim. This paper addresses the limitations of existing projector-camera-based
adversarial light attacks in practical FR setups. By incorporating device-aware
adaptations into the digital attack algorithm, such as resolution-aware and
color-aware adjustments, we mitigate the degradation from digital to physical
domains. Experimental validation showcases the efficacy of our proposed
algorithm against real and spoof adversaries, achieving high physical
similarity scores in FR models and state-of-the-art commercial systems. On
average, there is only a 14% reduction in scores from digital to physical
attacks, with high attack success rate in both white- and black-box scenarios.


http://arxiv.org/abs/2501.14230
GreedyPixel: Fine-Grained Black-Box Adversarial Attack Via Greedy Algorithm. (99%)
Hanrui Wang; Ching-Chun Chang; Chun-Shien Lu; Christopher Leckie; Isao Echizen
  A critical requirement for deep learning models is ensuring their robustness
against adversarial attacks. These attacks commonly introduce noticeable
perturbations, compromising the visual fidelity of adversarial examples.
Another key challenge is that while white-box algorithms can generate effective
adversarial perturbations, they require access to the model gradients, limiting
their practicality in many real-world scenarios. Existing attack mechanisms
struggle to achieve similar efficacy without access to these gradients. In this
paper, we introduce GreedyPixel, a novel pixel-wise greedy algorithm designed
to generate high-quality adversarial examples using only query-based feedback
from the target model. GreedyPixel improves computational efficiency in what is
typically a brute-force process by perturbing individual pixels in sequence,
guided by a pixel-wise priority map. This priority map is constructed by
ranking gradients obtained from a surrogate model, providing a structured path
for perturbation. Our results demonstrate that GreedyPixel achieves attack
success rates comparable to white-box methods without the need for gradient
information, and surpasses existing algorithms in black-box settings, offering
higher success rates, reduced computational time, and imperceptible
perturbations. These findings underscore the advantages of GreedyPixel in terms
of attack efficacy, time efficiency, and visual quality.


http://arxiv.org/abs/2501.14122
Reinforcement Learning Platform for Adversarial Black-box Attacks with Custom Distortion Filters. (99%)
Soumyendu Sarkar; Ashwin Ramesh Babu; Sajad Mousavi; Vineet Gundecha; Sahand Ghorbanpour; Avisek Naug; Ricardo Luna Gutierrez; Antonio Guillen
  We present a Reinforcement Learning Platform for Adversarial Black-box
untargeted and targeted attacks, RLAB, that allows users to select from various
distortion filters to create adversarial examples. The platform uses a
Reinforcement Learning agent to add minimum distortion to input images while
still causing misclassification by the target model. The agent uses a novel
dual-action method to explore the input image at each step to identify
sensitive regions for adding distortions while removing noises that have less
impact on the target model. This dual action leads to faster and more efficient
convergence of the attack. The platform can also be used to measure the
robustness of image classification models against specific distortion types.
Also, retraining the model with adversarial samples significantly improved
robustness when evaluated on benchmark datasets. The proposed platform
outperforms state-of-the-art methods in terms of the average number of queries
required to cause misclassification. This advances trustworthiness with a
positive social impact.


http://arxiv.org/abs/2501.13563
Black-Box Adversarial Attack on Vision Language Models for Autonomous Driving. (99%)
Lu Wang; Tianyuan Zhang; Yang Qu; Siyuan Liang; Yuwei Chen; Aishan Liu; Xianglong Liu; Dacheng Tao
  Vision-language models (VLMs) have significantly advanced autonomous driving
(AD) by enhancing reasoning capabilities; however, these models remain highly
susceptible to adversarial attacks. While existing research has explored
white-box attacks to some extent, the more practical and challenging black-box
scenarios remain largely underexplored due to their inherent difficulty. In
this paper, we take the first step toward designing black-box adversarial
attacks specifically targeting VLMs in AD. We identify two key challenges for
achieving effective black-box attacks in this context: the effectiveness across
driving reasoning chains in AD systems and the dynamic nature of driving
scenarios. To address this, we propose Cascading Adversarial Disruption (CAD).
It first introduces Decision Chain Disruption, which targets low-level
reasoning breakdown by generating and injecting deceptive semantics, ensuring
the perturbations remain effective across the entire decision-making chain.
Building on this, we present Risky Scene Induction, which addresses dynamic
adaptation by leveraging a surrogate VLM to understand and construct high-level
risky scenarios that are likely to result in critical errors in the current
driving contexts. Extensive experiments conducted on multiple AD VLMs and
benchmarks demonstrate that CAD achieves state-of-the-art attack effectiveness,
significantly outperforming existing methods (+13.43% on average). Moreover, we
validate its practical applicability through real-world attacks on AD vehicles
powered by VLMs, where the route completion rate drops by 61.11% and the
vehicle crashes directly into the obstacle vehicle with adversarial patches.
Finally, we release CADA dataset, comprising 18,808 adversarial
visual-question-answer pairs, to facilitate further evaluation and research in
this critical domain. Our codes and dataset will be available after paper's
acceptance.


http://arxiv.org/abs/2501.13782
Defending against Adversarial Malware Attacks on ML-based Android Malware Detection Systems. (98%)
Ping He; Lorenzo Cavallaro; Shouling Ji
  Android malware presents a persistent threat to users' privacy and data
integrity. To combat this, researchers have proposed machine learning-based
(ML-based) Android malware detection (AMD) systems. However, adversarial
Android malware attacks compromise the detection integrity of the ML-based AMD
systems, raising significant concerns. Existing defenses against adversarial
Android malware provide protections against feature space attacks which
generate adversarial feature vectors only, leaving protection against realistic
threats from problem space attacks which generate real adversarial malware an
open problem. In this paper, we address this gap by proposing ADD, a practical
adversarial Android malware defense framework designed as a plug-in to enhance
the adversarial robustness of the ML-based AMD systems against problem space
attacks. Our extensive evaluation across various ML-based AMD systems
demonstrates that ADD is effective against state-of-the-art problem space
adversarial Android malware attacks. Additionally, ADD shows the defense
effectiveness in enhancing the adversarial robustness of real-world antivirus
solutions.


http://arxiv.org/abs/2501.13776
Crossfire: An Elastic Defense Framework for Graph Neural Networks Under Bit Flip Attacks. (83%)
Lorenz Kummer; Samir Moustafa; Wilfried Gansterer; Nils Kriege
  Bit Flip Attacks (BFAs) are a well-established class of adversarial attacks,
originally developed for Convolutional Neural Networks within the computer
vision domain. Most recently, these attacks have been extended to target Graph
Neural Networks (GNNs), revealing significant vulnerabilities. This new
development naturally raises questions about the best strategies to defend GNNs
against BFAs, a challenge for which no solutions currently exist. Given the
applications of GNNs in critical fields, any defense mechanism must not only
maintain network performance, but also verifiably restore the network to its
pre-attack state. Verifiably restoring the network to its pre-attack state also
eliminates the need for costly evaluations on test data to ensure network
quality. We offer first insights into the effectiveness of existing honeypot-
and hashing-based defenses against BFAs adapted from the computer vision domain
to GNNs, and characterize the shortcomings of these approaches. To overcome
their limitations, we propose Crossfire, a hybrid approach that exploits weight
sparsity and combines hashing and honeypots with bit-level correction of
out-of-distribution weight elements to restore network integrity. Crossfire is
retraining-free and does not require labeled data. Averaged over 2,160
experiments on six benchmark datasets, Crossfire offers a 21.8% higher
probability than its competitors of reconstructing a GNN attacked by a BFA to
its pre-attack state. These experiments cover up to 55 bit flips from various
attacks. Moreover, it improves post-repair prediction quality by 10.85%.
Computational and storage overheads are negligible compared to the inherent
complexity of even the simplest GNNs.


http://arxiv.org/abs/2501.13676
Certified Robustness Under Bounded Levenshtein Distance. (67%)
Elias Abad Rocamora; Grigorios G. Chrysos; Volkan Cevher
  Text classifiers suffer from small perturbations, that if chosen
adversarially, can dramatically change the output of the model. Verification
methods can provide robustness certificates against such adversarial
perturbations, by computing a sound lower bound on the robust accuracy.
Nevertheless, existing verification methods incur in prohibitive costs and
cannot practically handle Levenshtein distance constraints. We propose the
first method for computing the Lipschitz constant of convolutional classifiers
with respect to the Levenshtein distance. We use these Lipschitz constant
estimates for training 1-Lipschitz classifiers. This enables computing the
certified radius of a classifier in a single forward pass. Our method, LipsLev,
is able to obtain $38.80$% and $13.93$% verified accuracy at distance $1$ and
$2$ respectively in the AG-News dataset, while being $4$ orders of magnitude
faster than existing approaches. We believe our work can open the door to more
efficient verification in the text domain.


http://arxiv.org/abs/2501.14050
GraphRAG under Fire. (13%)
Jiacheng Liang; Yuhui Wang; Changjiang Li; Rongyi Zhu; Tanqiu Jiang; Neil Gong; Ting Wang
  GraphRAG advances retrieval-augmented generation (RAG) by structuring
external knowledge as multi-scale knowledge graphs, enabling language models to
integrate both broad context and granular details in their generation. While
GraphRAG has demonstrated success across domains, its security implications
remain largely unexplored. To bridge this gap, this work examines GraphRAG's
vulnerability to poisoning attacks, uncovering an intriguing security paradox:
existing RAG poisoning attacks are less effective under GraphRAG than
conventional RAG, due to GraphRAG's graph-based indexing and retrieval; yet,
the same features also create new attack surfaces. We present GragPoison, a
novel attack that exploits shared relations in the underlying knowledge graph
to craft poisoning text capable of compromising multiple queries
simultaneously. GragPoison employs three key strategies: (i) relation injection
to introduce false knowledge, (ii) relation enhancement to amplify poisoning
influence, and (iii) narrative generation to embed malicious content within
coherent text. Empirical evaluation across diverse datasets and models shows
that GragPoison substantially outperforms existing attacks in terms of
effectiveness (up to 98% success rate) and scalability (using less than 68%
poisoning text) on multiple variations of GraphRAG. We also explore potential
defensive measures and their limitations, identifying promising directions for
future research.


http://arxiv.org/abs/2501.13894
Logical Maneuvers: Detecting and Mitigating Adversarial Hardware Faults in Space. (1%)
Fatemeh Khojasteh Dana; Saleh Khalaj Monfared; Shahin Tajik
  Satellites are highly vulnerable to adversarial glitches or high-energy
radiation in space, which could cause faults on the onboard computer. Various
radiation- and fault-tolerant methods, such as error correction codes (ECC) and
redundancy-based approaches, have been explored over the last decades to
mitigate temporary soft errors on software and hardware. However, conventional
ECC methods fail to deal with hard errors or permanent faults in the hardware
components. This work introduces a detection- and response-based countermeasure
to deal with partially damaged processor chips. It recovers the processor chip
from permanent faults and enables continuous operation with available undamaged
resources on the chip. We incorporate digitally-compatible delay-based sensors
on the target processor's chip to reliably detect the incoming radiation or
glitching attempts on the physical fabric of the chip, even before a fault
occurs. Upon detecting a fault in one or more components of the processor's
arithmetic logic unit (ALU), our countermeasure employs adaptive software
recompilations to resynthesize and substitute the affected instructions with
instructions of still functioning components to accomplish the task.
Furthermore, if the fault is more widespread and prevents the correct operation
of the entire processor, our approach deploys adaptive hardware partial
reconfigurations to replace and reroute the failed components to undamaged
locations of the chip. To validate our claims, we deploy a high-energy
near-infrared (NIR) laser beam on a RISC-V processor implemented on a 28~nm
FPGA to emulate radiation and even hard errors by partially damaging the FPGA
fabric. We demonstrate that our sensor can confidently detect the radiation and
trigger the processor testing and fault recovery mechanisms. Finally, we
discuss the overhead imposed by our countermeasure.


http://arxiv.org/abs/2501.12761
Modality Unified Attack for Omni-Modality Person Re-Identification. (98%)
Yuan Bian; Min Liu; Yunqi Yi; Xueping Wang; Yunfeng Ma; Yaonan Wang
  Deep learning based person re-identification (re-id) models have been widely
employed in surveillance systems. Recent studies have demonstrated that
black-box single-modality and cross-modality re-id models are vulnerable to
adversarial examples (AEs), leaving the robustness of multi-modality re-id
models unexplored. Due to the lack of knowledge about the specific type of
model deployed in the target black-box surveillance system, we aim to generate
modality unified AEs for omni-modality (single-, cross- and multi-modality)
re-id models. Specifically, we propose a novel Modality Unified Attack method
to train modality-specific adversarial generators to generate AEs that
effectively attack different omni-modality models. A multi-modality model is
adopted as the surrogate model, wherein the features of each modality are
perturbed by metric disruption loss before fusion. To collapse the common
features of omni-modality models, Cross Modality Simulated Disruption approach
is introduced to mimic the cross-modality feature embeddings by intentionally
feeding images to non-corresponding modality-specific subnetworks of the
surrogate model. Moreover, Multi Modality Collaborative Disruption strategy is
devised to facilitate the attacker to comprehensively corrupt the informative
content of person images by leveraging a multi modality feature collaborative
metric disruption loss. Extensive experiments show that our MUA method can
effectively attack the omni-modality re-id models, achieving 55.9%, 24.4%,
49.0% and 62.7% mean mAP Drop Rate, respectively.


http://arxiv.org/abs/2501.13094
Robust Representation Consistency Model via Contrastive Denoising. (84%)
Jiachen Lei; Julius Berner; Jiongxiao Wang; Zhongzhu Chen; Zhongjia Ba; Kui Ren; Jun Zhu; Anima Anandkumar
  Robustness is essential for deep neural networks, especially in
security-sensitive applications. To this end, randomized smoothing provides
theoretical guarantees for certifying robustness against adversarial
perturbations. Recently, diffusion models have been successfully employed for
randomized smoothing to purify noise-perturbed samples before making
predictions with a standard classifier. While these methods excel at small
perturbation radii, they struggle with larger perturbations and incur a
significant computational overhead during inference compared to classical
methods. To address this, we reformulate the generative modeling task along the
diffusion trajectories in pixel space as a discriminative task in the latent
space. Specifically, we use instance discrimination to achieve consistent
representations along the trajectories by aligning temporally adjacent points.
After fine-tuning based on the learned representations, our model enables
implicit denoising-then-classification via a single prediction, substantially
reducing inference costs. We conduct extensive experiments on various datasets
and achieve state-of-the-art performance with minimal computation budget during
inference. For example, our method outperforms the certified accuracy of
diffusion-based methods on ImageNet across all perturbation radii by 5.3% on
average, with up to 11.6% at larger radii, while reducing inference costs by
85$\times$ on average. Codes are available at:
https://github.com/jiachenlei/rRCM.


http://arxiv.org/abs/2501.13302
Watching the AI Watchdogs: A Fairness and Robustness Analysis of AI Safety Moderation Classifiers. (11%)
Akshit Achara; Anshuman Chhabra
  AI Safety Moderation (ASM) classifiers are designed to moderate content on
social media platforms and to serve as guardrails that prevent Large Language
Models (LLMs) from being fine-tuned on unsafe inputs. Owing to their potential
for disparate impact, it is crucial to ensure that these classifiers: (1) do
not unfairly classify content belonging to users from minority groups as unsafe
compared to those from majority groups and (2) that their behavior remains
robust and consistent across similar inputs. In this work, we thus examine the
fairness and robustness of four widely-used, closed-source ASM classifiers:
OpenAI Moderation API, Perspective API, Google Cloud Natural Language (GCNL)
API, and Clarifai API. We assess fairness using metrics such as demographic
parity and conditional statistical parity, comparing their performance against
ASM models and a fair-only baseline. Additionally, we analyze robustness by
testing the classifiers' sensitivity to small and natural input perturbations.
Our findings reveal potential fairness and robustness gaps, highlighting the
need to mitigate these issues in future versions of these models.


http://arxiv.org/abs/2501.12736
Bad-PFL: Exploring Backdoor Attacks against Personalized Federated Learning. (9%)
Mingyuan Fan; Zhanyi Hu; Fuyi Wang; Cen Chen
  Data heterogeneity and backdoor attacks rank among the most significant
challenges facing federated learning (FL). For data heterogeneity, personalized
federated learning (PFL) enables each client to maintain a private personalized
model to cater to client-specific knowledge. Meanwhile, vanilla FL has proven
vulnerable to backdoor attacks. However, recent advancements in PFL community
have demonstrated a potential immunity against such attacks. This paper
explores this intersection further, revealing that existing federated backdoor
attacks fail in PFL because backdoors about manually designed triggers struggle
to survive in personalized models. To tackle this, we design Bad-PFL, which
employs features from natural data as our trigger. As long as the model is
trained on natural data, it inevitably embeds the backdoor associated with our
trigger, ensuring its longevity in personalized models. Moreover, our trigger
undergoes mutual reinforcement training with the model, further solidifying the
backdoor's durability and enhancing attack effectiveness. The large-scale
experiments across three benchmark datasets demonstrate the superior
performance of our attack against various PFL methods, even when equipped with
state-of-the-art defense mechanisms.


http://arxiv.org/abs/2501.13291
Are We Learning the Right Features? A Framework for Evaluating DL-Based Software Vulnerability Detection Solutions. (2%)
Satyaki Das; Syeda Tasnim Fabiha; Saad Shafiq; Nenad Medvidovic
  Recent research has revealed that the reported results of an emerging body of
DL-based techniques for detecting software vulnerabilities are not
reproducible, either across different datasets or on unseen samples. This paper
aims to provide the foundation for properly evaluating the research in this
domain. We do so by analyzing prior work and existing vulnerability datasets
for the syntactic and semantic features of code that contribute to
vulnerability, as well as features that falsely correlate with vulnerability.
We provide a novel, uniform representation to capture both sets of features,
and use this representation to detect the presence of both vulnerability and
spurious features in code. To this end, we design two types of code
perturbations: feature preserving perturbations (FPP) ensure that the
vulnerability feature remains in a given code sample, while feature eliminating
perturbations (FEP) eliminate the feature from the code sample. These
perturbations aim to measure the influence of spurious and vulnerability
features on the predictions of a given vulnerability detection solution. To
evaluate how the two classes of perturbations influence predictions, we
conducted a large-scale empirical study on five state-of-the-art DL-based
vulnerability detectors. Our study shows that, for vulnerability features, only
~2% of FPPs yield the undesirable effect of a prediction changing among the
five detectors on average. However, on average, ~84% of FEPs yield the
undesirable effect of retaining the vulnerability predictions. For spurious
features, we observed that FPPs yielded a drop in recall up to 29% for
graph-based detectors. We present the reasons underlying these results and
suggest strategies for improving DNN-based vulnerability detectors. We provide
our perturbation-based evaluation framework as a public resource to enable
independent future evaluation of vulnerability detectors.


http://arxiv.org/abs/2501.11901
Enhancing Adversarial Transferability via Component-Wise Transformation. (99%)
Hangyu Liu; Bo Peng; Can Cui; Pengxiang Ding; Donglin Wang
  Deep Neural Networks (DNNs) are highly vulnerable to adversarial examples,
which pose significant challenges in security-sensitive applications. Among
various adversarial attack strategies, input transformation-based attacks have
demonstrated remarkable effectiveness in enhancing adversarial transferability.
However, existing methods still perform poorly across different architectures,
even though they have achieved promising results within the same architecture.
This limitation arises because, while models of the same architecture may focus
on different regions of the object, the variation is even more pronounced
across different architectures. Unfortunately, current approaches fail to
effectively guide models to attend to these diverse regions. To address this
issue, this paper proposes a novel input transformation-based attack method,
termed Component-Wise Transformation (CWT). CWT applies interpolation and
selective rotation to individual image blocks, ensuring that each transformed
image highlights different target regions, thereby improving the
transferability of adversarial examples. Extensive experiments on the standard
ImageNet dataset show that CWT consistently outperforms state-of-the-art
methods in both attack success rates and stability across CNN- and
Transformer-based models.


http://arxiv.org/abs/2501.12275
With Great Backbones Comes Great Adversarial Transferability. (98%)
Erik Arakelyan; Karen Hambardzumyan; Davit Papikyan; Pasquale Minervini; Albert Gordo; Isabelle Augenstein; Aram H. Markosyan
  Advances in self-supervised learning (SSL) for machine vision have improved
representation robustness and model performance, giving rise to pre-trained
backbones like \emph{ResNet} and \emph{ViT} models tuned with SSL methods such
as \emph{SimCLR}. Due to the computational and data demands of pre-training,
the utilization of such backbones becomes a strenuous necessity. However,
employing these backbones may inherit vulnerabilities to adversarial attacks.
While adversarial robustness has been studied under \emph{white-box} and
\emph{black-box} settings, the robustness of models tuned on pre-trained
backbones remains largely unexplored. Additionally, the role of tuning
meta-information in mitigating exploitation risks is unclear. This work
systematically evaluates the adversarial robustness of such models across
$20,000$ combinations of tuning meta-information, including fine-tuning
techniques, backbone families, datasets, and attack types. We propose using
proxy models to transfer attacks, simulating varying levels of target knowledge
by fine-tuning these proxies with diverse configurations. Our findings reveal
that proxy-based attacks approach the effectiveness of \emph{white-box}
methods, even with minimal tuning knowledge. We also introduce a naive
"backbone attack," leveraging only the backbone to generate adversarial
samples, which outperforms \emph{black-box} attacks and rivals \emph{white-box}
methods, highlighting critical risks in model-sharing practices. Finally, our
ablations reveal how increasing tuning meta-information impacts attack
transferability, measuring each meta-information combination.


http://arxiv.org/abs/2501.11902
Transferable Adversarial Attacks on Audio Deepfake Detection. (98%)
Muhammad Umar Farooq; Awais Khan; Kutub Uddin; Khalid Mahmood Malik
  Audio deepfakes pose significant threats, including impersonation, fraud, and
reputation damage. To address these risks, audio deepfake detection (ADD)
techniques have been developed, demonstrating success on benchmarks like
ASVspoof2019. However, their resilience against transferable adversarial
attacks remains largely unexplored. In this paper, we introduce a transferable
GAN-based adversarial attack framework to evaluate the effectiveness of
state-of-the-art (SOTA) ADD systems. By leveraging an ensemble of surrogate ADD
models and a discriminator, the proposed approach generates transferable
adversarial attacks that better reflect real-world scenarios. Unlike previous
methods, the proposed framework incorporates a self-supervised audio model to
ensure transcription and perceptual integrity, resulting in high-quality
adversarial attacks. Experimental results on benchmark dataset reveal that SOTA
ADD systems exhibit significant vulnerabilities, with accuracies dropping from
98% to 26%, 92% to 54%, and 94% to 84% in white-box, gray-box, and black-box
scenarios, respectively. When tested in other data sets, performance drops of
91% to 46%, and 94% to 67% were observed against the In-the-Wild and WaveFake
data sets, respectively. These results highlight the significant
vulnerabilities of existing ADD systems and emphasize the need to enhance their
robustness against advanced adversarial threats to ensure security and
reliability.


http://arxiv.org/abs/2501.12516
Robustness of Selected Learning Models under Label-Flipping Attack. (82%)
Sarvagya Bhargava; Mark Stamp
  In this paper we compare traditional machine learning and deep learning
models trained on a malware dataset when subjected to adversarial attack based
on label-flipping. Specifically, we investigate the robustness of Support
Vector Machines (SVM), Random Forest, Gaussian Naive Bayes (GNB), Gradient
Boosting Machine (GBM), LightGBM, XGBoost, Multilayer Perceptron (MLP),
Convolutional Neural Network (CNN), MobileNet, and DenseNet models when facing
varying percentages of misleading labels. We empirically assess the the
accuracy of each of these models under such an adversarial attack on the
training data. This research aims to provide insights into which models are
inherently more robust, in the sense of being better able to resist intentional
disruptions to the training data. We find wide variation in the robustness of
the models tested to adversarial attack, with our MLP model achieving the best
combination of initial accuracy and robustness.


http://arxiv.org/abs/2501.17882
Heterogeneous Multi-Player Multi-Armed Bandits Robust To Adversarial Attacks. (82%)
Akshayaa Magesh; Venugopal V. Veeravalli
  We consider a multi-player multi-armed bandit setting in the presence of
adversaries that attempt to negatively affect the rewards received by the
players in the system. The reward distributions for any given arm are
heterogeneous across the players. In the event of a collision (more than one
player choosing the same arm), all the colliding users receive zero rewards.
The adversaries use collisions to affect the rewards received by the players,
i.e., if an adversary attacks an arm, any player choosing that arm will receive
zero reward. At any time step, the adversaries may attack more than one arm. It
is assumed that the players in the system do not deviate from a pre-determined
policy used by all the players, and that the probability that none of the arms
face adversarial attacks is strictly positive at every time step. In order to
combat the adversarial attacks, the players are allowed to communicate using a
single bit for $O(\log T)$ time units, where $T$ is the time horizon, and each
player can only observe their own actions and rewards at all time steps. We
propose a {policy that is used by all the players, which} achieves near order
optimal regret of order $O(\log^{1+\delta}T + W)$, where $W$ is total number of
time units for which there was an adversarial attack on at least one arm.


http://arxiv.org/abs/2501.12183
Extend Adversarial Policy Against Neural Machine Translation via Unknown Token. (80%)
Wei Zou; Shujian Huang; Jiajun Chen
  Generating adversarial examples contributes to mainstream neural machine
translation~(NMT) robustness. However, popular adversarial policies are apt for
fixed tokenization, hindering its efficacy for common character perturbations
involving versatile tokenization. Based on existing adversarial generation via
reinforcement learning~(RL), we propose the `DexChar policy' that introduces
character perturbations for the existing mainstream adversarial policy based on
token substitution. Furthermore, we improve the self-supervised matching that
provides feedback in RL to cater to the semantic constraints required during
training adversaries. Experiments show that our method is compatible with the
scenario where baseline adversaries fail, and can generate high-efficiency
adversarial examples for analysis and optimization of the system.


http://arxiv.org/abs/2501.12522
Topology of Out-of-Distribution Examples in Deep Neural Networks. (13%)
Esha Datta; Johanna Hennig; Eva Domschot; Connor Mattes; Michael R. Smith
  As deep neural networks (DNNs) become increasingly common, concerns about
their robustness do as well. A longstanding problem for deployed DNNs is their
behavior in the face of unfamiliar inputs; specifically, these models tend to
be overconfident and incorrect when encountering out-of-distribution (OOD)
examples. In this work, we present a topological approach to characterizing OOD
examples using latent layer embeddings from DNNs. Our goal is to identify
topological features, referred to as landmarks, that indicate OOD examples. We
conduct extensive experiments on benchmark datasets and a realistic DNN model,
revealing a key insight for OOD detection. Well-trained DNNs have been shown to
induce a topological simplification on training data for simple models and
datasets; we show that this property holds for realistic, large-scale test and
training data, but does not hold for OOD examples. More specifically, we find
that the average lifetime (or persistence) of OOD examples is statistically
longer than that of training or test examples. This indicates that DNNs
struggle to induce topological simplification on unfamiliar inputs. Our
empirical results provide novel evidence of topological simplification in
realistic DNNs and lay the groundwork for topologically-informed OOD detection
strategies.


http://arxiv.org/abs/2501.12269
Benchmarking Image Perturbations for Testing Automated Driving Assistance Systems. (5%)
Stefano Carlo Lambertenghi; Hannes Leonhard; Andrea Stocco
  Advanced Driver Assistance Systems (ADAS) based on deep neural networks
(DNNs) are widely used in autonomous vehicles for critical perception tasks
such as object detection, semantic segmentation, and lane recognition. However,
these systems are highly sensitive to input variations, such as noise and
changes in lighting, which can compromise their effectiveness and potentially
lead to safety-critical failures.
  This study offers a comprehensive empirical evaluation of image
perturbations, techniques commonly used to assess the robustness of DNNs, to
validate and improve the robustness and generalization of ADAS perception
systems. We first conducted a systematic review of the literature, identifying
38 categories of perturbations. Next, we evaluated their effectiveness in
revealing failures in two different ADAS, both at the component and at the
system level. Finally, we explored the use of perturbation-based data
augmentation and continuous learning strategies to improve ADAS adaptation to
new operational design domains. Our results demonstrate that all categories of
image perturbations successfully expose robustness issues in ADAS and that the
use of dataset augmentation and continuous learning significantly improves ADAS
performance in novel, unseen environments.


http://arxiv.org/abs/2501.12191
A margin-based replacement for cross-entropy loss. (1%)
Michael W. Spratling; Heiko H. Schütt
  Cross-entropy (CE) loss is the de-facto standard for training deep neural
networks to perform classification. However, CE-trained deep neural networks
struggle with robustness and generalisation issues. To alleviate these issues,
we propose high error margin (HEM) loss, a variant of multi-class margin loss
that overcomes the training issues of other margin-based losses. We evaluate
HEM extensively on a range of architectures and datasets. We find that HEM loss
is more effective than cross-entropy loss across a wide range of tasks: unknown
class rejection, adversarial robustness, learning with imbalanced data,
continual learning, and semantic segmentation (a pixel-level classification
task). Despite all training hyper-parameters being chosen for CE loss, HEM is
inferior to CE only in terms of clean accuracy and this difference is
insignificant. We also compare HEM to specialised losses that have previously
been proposed to improve performance on specific tasks. LogitNorm, a loss
achieving state-of-the-art performance on unknown class rejection, produces
similar performance to HEM for this task, but is much poorer for continual
learning and semantic segmentation. Logit-adjusted loss, designed for
imbalanced data, has superior results to HEM for that task, but performs more
poorly on unknown class rejection and semantic segmentation. DICE, a popular
loss for semantic segmentation, is inferior to HEM loss on all tasks, including
semantic segmentation. Thus, HEM often out-performs specialised losses, and in
contrast to them, is a general-purpose replacement for CE loss.


http://arxiv.org/abs/2501.12210
You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense. (1%)
Wuyuao Mai; Geng Hong; Pei Chen; Xudong Pan; Baojun Liu; Yuan Zhang; Haixin Duan; Min Yang
  With the rise of generative large language models (LLMs) like LLaMA and
ChatGPT, these models have significantly transformed daily life and work by
providing advanced insights. However, as jailbreak attacks continue to
circumvent built-in safety mechanisms, exploiting carefully crafted scenarios
or tokens, the safety risks of LLMs have come into focus. While numerous
defense strategies--such as prompt detection, modification, and model
fine-tuning--have been proposed to counter these attacks, a critical question
arises: do these defenses compromise the utility and usability of LLMs for
legitimate users? Existing research predominantly focuses on the effectiveness
of defense strategies without thoroughly examining their impact on performance,
leaving a gap in understanding the trade-offs between LLM safety and
performance. Our research addresses this gap by conducting a comprehensive
study on the utility degradation, safety elevation, and exaggerated-safety
escalation of LLMs with jailbreak defense strategies. We propose USEBench, a
novel benchmark designed to evaluate these aspects, along with USEIndex, a
comprehensive metric for assessing overall model performance. Through
experiments on seven state-of-the-art LLMs, we found that mainstream jailbreak
defenses fail to ensure both safety and performance simultaneously. Although
model-finetuning performs the best overall, their effectiveness varies across
LLMs. Furthermore, vertical comparisons reveal that developers commonly
prioritize performance over safety when iterating or fine-tuning their LLMs.


http://arxiv.org/abs/2501.11568
Graph Defense Diffusion Model. (97%)
Xin He; Wenqi Fan; Yili Wang; Chengyi Liu; Rui Miao; Xin Juan; Xin Wang
  Graph Neural Networks (GNNs) demonstrate significant potential in various
applications but remain highly vulnerable to adversarial attacks, which can
greatly degrade their performance. Existing graph purification methods attempt
to address this issue by filtering attacked graphs; however, they struggle to
effectively defend against multiple types of adversarial attacks simultaneously
due to their limited flexibility, and they lack comprehensive modeling of graph
data due to their heavy reliance on heuristic prior knowledge. To overcome
these challenges, we propose a more versatile approach for defending against
adversarial attacks on graphs. In this work, we introduce the Graph Defense
Diffusion Model (GDDM), a flexible purification method that leverages the
denoising and modeling capabilities of diffusion models. The iterative nature
of diffusion models aligns well with the stepwise process of adversarial
attacks, making them particularly suitable for defense. By iteratively adding
and removing noise, GDDM effectively purifies attacked graphs, restoring their
original structure and features. Our GDDM consists of two key components: (1)
Graph Structure-Driven Refiner, which preserves the basic fidelity of the graph
during the denoising process, and ensures that the generated graph remains
consistent with the original scope; and (2) Node Feature-Constrained
Regularizer, which removes residual impurities from the denoised graph, further
enhances the purification effect. Additionally, we design tailored denoising
strategies to handle different types of adversarial attacks, improving the
model's adaptability to various attack scenarios. Extensive experiments
conducted on three real-world datasets demonstrate that GDDM outperforms
state-of-the-art methods in defending against a wide range of adversarial
attacks, showcasing its robustness and effectiveness.


http://arxiv.org/abs/2501.11848
FedMUA: Exploring the Vulnerabilities of Federated Learning to Malicious Unlearning Attacks. (67%)
Jian Chen; Zehui Lin; Wanyu Lin; Wenlong Shi; Xiaoyan Yin; Di Wang
  Recently, the practical needs of ``the right to be forgotten'' in federated
learning gave birth to a paradigm known as federated unlearning, which enables
the server to forget personal data upon the client's removal request. Existing
studies on federated unlearning have primarily focused on efficiently
eliminating the influence of requested data from the client's model without
retraining from scratch, however, they have rarely doubted the reliability of
the global model posed by the discrepancy between its prediction performance
before and after unlearning. To bridge this gap, we take the first step by
introducing a novel malicious unlearning attack dubbed FedMUA, aiming to unveil
potential vulnerabilities emerging from federated learning during the
unlearning process. The crux of FedMUA is to mislead the global model into
unlearning more information associated with the influential samples for the
target sample than anticipated, thus inducing adverse effects on target samples
from other clients. To achieve this, we design a novel two-step method, known
as Influential Sample Identification and Malicious Unlearning Generation, to
identify and subsequently generate malicious feature unlearning requests within
the influential samples. By doing so, we can significantly alter the
predictions pertaining to the target sample by initiating the malicious feature
unlearning requests, leading to the deliberate manipulation for the user
adversely. Additionally, we design a new defense mechanism that is highly
resilient against malicious unlearning attacks. Extensive experiments on three
realistic datasets reveal that FedMUA effectively induces misclassification on
target samples and can achieve an 80% attack success rate by triggering only
0.3% malicious unlearning requests.


http://arxiv.org/abs/2501.11759
Poison-RAG: Adversarial Data Poisoning Attacks on Retrieval-Augmented Generation in Recommender Systems. (50%)
Fatemeh Nazary; Yashar Deldjoo; Noia Tommaso di
  This study presents Poison-RAG, a framework for adversarial data poisoning
attacks targeting retrieval-augmented generation (RAG)-based recommender
systems. Poison-RAG manipulates item metadata, such as tags and descriptions,
to influence recommendation outcomes. Using item metadata generated through a
large language model (LLM) and embeddings derived via the OpenAI API, we
explore the impact of adversarial poisoning attacks on provider-side, where
attacks are designed to promote long-tail items and demote popular ones. Two
attack strategies are proposed: local modifications, which personalize tags for
each item using BERT embeddings, and global modifications, applying uniform
tags across the dataset. Experiments conducted on the MovieLens dataset in a
black-box setting reveal that local strategies improve manipulation
effectiveness by up to 50\%, while global strategies risk boosting already
popular items. Results indicate that popular items are more susceptible to
attacks, whereas long-tail items are harder to manipulate. Approximately 70\%
of items lack tags, presenting a cold-start challenge; data augmentation and
synthesis are proposed as potential defense mechanisms to enhance RAG-based
systems' resilience. The findings emphasize the need for robust metadata
management to safeguard recommendation frameworks. Code and data are available
at https://github.com/atenanaz/Poison-RAG.


http://arxiv.org/abs/2501.11462
On the Adversarial Vulnerabilities of Transfer Learning in Remote Sensing. (41%)
Tao Bai; Xingjian Tian; Yonghao Xu; Bihan Wen
  The use of pretrained models from general computer vision tasks is widespread
in remote sensing, significantly reducing training costs and improving
performance. However, this practice also introduces vulnerabilities to
downstream tasks, where publicly available pretrained models can be used as a
proxy to compromise downstream models. This paper presents a novel Adversarial
Neuron Manipulation method, which generates transferable perturbations by
selectively manipulating single or multiple neurons in pretrained models.
Unlike existing attacks, this method eliminates the need for domain-specific
information, making it more broadly applicable and efficient. By targeting
multiple fragile neurons, the perturbations achieve superior attack
performance, revealing critical vulnerabilities in deep learning models.
Experiments on diverse models and remote sensing datasets validate the
effectiveness of the proposed method. This low-access adversarial neuron
manipulation technique highlights a significant security risk in transfer
learning models, emphasizing the urgent need for more robust defenses in their
design when addressing the safety-critical remote sensing tasks.


http://arxiv.org/abs/2501.11852
Cross-Entropy Attacks to Language Models via Rare Event Simulation. (12%)
Mingze Ni; Yongshun Gong; Wei Liu
  Black-box textual adversarial attacks are challenging due to the lack of
model information and the discrete, non-differentiable nature of text. Existing
methods often lack versatility for attacking different models, suffer from
limited attacking performance due to the inefficient optimization with word
saliency ranking, and frequently sacrifice semantic integrity to achieve better
attack outcomes. This paper introduces a novel approach to textual adversarial
attacks, which we call Cross-Entropy Attacks (CEA), that uses Cross-Entropy
optimization to address the above issues. Our CEA approach defines adversarial
objectives for both soft-label and hard-label settings and employs CE
optimization to identify optimal replacements. Through extensive experiments on
document classification and language translation problems, we demonstrate that
our attack method excels in terms of attacking performance, imperceptibility,
and sentence quality.


http://arxiv.org/abs/2501.11577
Rethinking Membership Inference Attacks Against Transfer Learning. (9%)
Cong Wu; Jing Chen; Qianru Fang; Kun He; Ziming Zhao; Hao Ren; Guowen Xu; Yang Liu; Yang Xiang
  Transfer learning, successful in knowledge translation across related tasks,
faces a substantial privacy threat from membership inference attacks (MIAs).
These attacks, despite posing significant risk to ML model's training data,
remain limited-explored in transfer learning. The interaction between teacher
and student models in transfer learning has not been thoroughly explored in
MIAs, potentially resulting in an under-examined aspect of privacy
vulnerabilities within transfer learning. In this paper, we propose a new MIA
vector against transfer learning, to determine whether a specific data point
was used to train the teacher model while only accessing the student model in a
white-box setting. Our method delves into the intricate relationship between
teacher and student models, analyzing the discrepancies in hidden layer
representations between the student model and its shadow counterpart. These
identified differences are then adeptly utilized to refine the shadow model's
training process and to inform membership inference decisions effectively. Our
method, evaluated across four datasets in diverse transfer learning tasks,
reveals that even when an attacker only has access to the student model, the
teacher model's training data remains susceptible to MIAs. We believe our work
unveils the unexplored risk of membership inference in transfer learning.


http://arxiv.org/abs/2501.11815
CogMorph: Cognitive Morphing Attacks for Text-to-Image Models. (1%)
Zonglei Jing; Zonghao Ying; Le Wang; Siyuan Liang; Aishan Liu; Xianglong Liu; Dacheng Tao
  The development of text-to-image (T2I) generative models, that enable the
creation of high-quality synthetic images from textual prompts, has opened new
frontiers in creative design and content generation. However, this paper
reveals a significant and previously unrecognized ethical risk inherent in this
technology and introduces a novel method, termed the Cognitive Morphing Attack
(CogMorph), which manipulates T2I models to generate images that retain the
original core subjects but embeds toxic or harmful contextual elements. This
nuanced manipulation exploits the cognitive principle that human perception of
concepts is shaped by the entire visual scene and its context, producing images
that amplify emotional harm far beyond attacks that merely preserve the
original semantics. To address this, we first construct an imagery toxicity
taxonomy spanning 10 major and 48 sub-categories, aligned with human
cognitive-perceptual dimensions, and further build a toxicity risk matrix
resulting in 1,176 high-quality T2I toxic prompts. Based on this, our CogMorph
first introduces Cognitive Toxicity Augmentation, which develops a cognitive
toxicity knowledge base with rich external toxic representations for humans
(e.g., fine-grained visual features) that can be utilized to further guide the
optimization of adversarial prompts. In addition, we present Contextual
Hierarchical Morphing, which hierarchically extracts critical parts of the
original prompt (e.g., scenes, subjects, and body parts), and then iteratively
retrieves and fuses toxic features to inject harmful contexts. Extensive
experiments on multiple open-sourced T2I models and black-box commercial APIs
(e.g., DALLE-3) demonstrate the efficacy of CogMorph which significantly
outperforms other baselines by large margins (+20.62% on average).


http://arxiv.org/abs/2501.10996
Effectiveness of Adversarial Benign and Malware Examples in Evasion and Poisoning Attacks. (99%)
Matouš Kozák; Martin Jureček
  Adversarial attacks present significant challenges for malware detection
systems. This research investigates the effectiveness of benign and malicious
adversarial examples (AEs) in evasion and poisoning attacks on the Portable
Executable file domain. A novel focus of this study is on benign AEs, which,
although not directly harmful, can increase false positives and undermine trust
in antivirus solutions. We propose modifying existing adversarial malware
generators to produce benign AEs and show they are as successful as malware AEs
in evasion attacks. Furthermore, our data show that benign AEs have a more
decisive influence in poisoning attacks than standard malware AEs,
demonstrating their superior ability to decrease the model's performance. Our
findings introduce new opportunities for adversaries and further increase the
attack surface that needs to be protected by security researchers.


http://arxiv.org/abs/2501.10985
GRID: Protecting Training Graph from Link Stealing Attacks on GNN Models. (12%)
Jiadong Lou; Xu Yuan; Rui Zhang; Xingliang Yuan; Neil Gong; Nian-Feng Tzeng
  Graph neural networks (GNNs) have exhibited superior performance in various
classification tasks on graph-structured data. However, they encounter the
potential vulnerability from the link stealing attacks, which can infer the
presence of a link between two nodes via measuring the similarity of its
incident nodes' prediction vectors produced by a GNN model. Such attacks pose
severe security and privacy threats to the training graph used in GNN models.
In this work, we propose a novel solution, called Graph Link Disguise (GRID),
to defend against link stealing attacks with the formal guarantee of GNN model
utility for retaining prediction accuracy. The key idea of GRID is to add
carefully crafted noises to the nodes' prediction vectors for disguising
adjacent nodes as n-hop indirect neighboring nodes. We take into account the
graph topology and select only a subset of nodes (called core nodes) covering
all links for adding noises, which can avert the noises offset and have the
further advantages of reducing both the distortion loss and the computation
cost. Our crafted noises can ensure 1) the noisy prediction vectors of any two
adjacent nodes have their similarity level like that of two non-adjacent nodes
and 2) the model prediction is unchanged to ensure zero utility loss. Extensive
experiments on five datasets are conducted to show the effectiveness of our
proposed GRID solution against different representative link-stealing attacks
under transductive settings and inductive settings respectively, as well as two
influence-based attacks. Meanwhile, it achieves a much better privacy-utility
trade-off than existing methods when extended to GNNs.


http://arxiv.org/abs/2501.11183
Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity. (10%)
David Williams-King; Linh Le; Adam Oberman; Yoshua Bengio
  As LLMs develop increasingly advanced capabilities, there is an increased
need to minimize the harm that could be caused to society by certain model
outputs; hence, most LLMs have safety guardrails added, for example via
fine-tuning. In this paper, we argue the position that current safety
fine-tuning is very similar to a traditional cat-and-mouse game (or arms race)
between attackers and defenders in cybersecurity. Model jailbreaks and attacks
are patched with bandaids to target the specific attack mechanism, but many
similar attack vectors might remain. When defenders are not proactively coming
up with principled mechanisms, it becomes very easy for attackers to sidestep
any new defenses. We show how current defenses are insufficient to prevent new
adversarial jailbreak attacks, reward hacking, and loss of control problems. In
order to learn from past mistakes in cybersecurity, we draw analogies with
historical examples and develop lessons learned that can be applied to LLM
safety. These arguments support the need for new and more principled approaches
to designing safe models, which are architected for security from the
beginning. We describe several such approaches from the AI literature.


http://arxiv.org/abs/2501.13115
Dagger Behind Smile: Fool LLMs with a Happy Ending Story. (5%)
Xurui Song; Zhixin Xie; Shuo Huai; Jiayi Kong; Jun Luo
  The wide adoption of Large Language Models (LLMs) has attracted significant
attention from \textit{jailbreak} attacks, where adversarial prompts crafted
through optimization or manual design exploit LLMs to generate malicious
content. However, optimization-based attacks have limited efficiency and
transferability, while manual designs are either easily detectable or demand
intricate interactions with LLMs. In this paper, we first point out a novel
perspective for jailbreak attacks: LLMs are more responsive to
\textit{positive} prompts. Based on this, we deploy Happy Ending Attack (HEA)
to wrap up a malicious request in a scenario template involving a positive
prompt formed mainly via a \textit{happy ending}, it thus fools LLMs into
jailbreaking either immediately or at a follow-up malicious request. This has
made HEA both efficient and effective, as it requires only up to two steps to
fully jailbreak LLMs. Extensive experiments show that our HEA can successfully
jailbreak on state-of-the-art LLMs, including GPT-4o, Llama3-70b, Gemini-pro,
and achieves 88.79\% Attack Success Rate on average. We also provide potential
quantitative explanations for the success of HEA.


http://arxiv.org/abs/2501.11054
Temporal Analysis of Adversarial Attacks in Federated Learning. (2%)
Rohit Mapakshi; Sayma Akther; Mark Stamp
  In this paper, we experimentally analyze the robustness of selected Federated
Learning (FL) systems in the presence of adversarial clients. We find that
temporal attacks significantly affect model performance in the FL models
tested, especially when the adversaries are active throughout or during the
later rounds. We consider a variety of classic learning models, including
Multinominal Logistic Regression (MLR), Random Forest, XGBoost, Support Vector
Classifier (SVC), as well as various Neural Network models including Multilayer
Perceptron (MLP), Convolution Neural Network (CNN), Recurrent Neural Network
(RNN), and Long Short-Term Memory (LSTM). Our results highlight the
effectiveness of temporal attacks and the need to develop strategies to make
the FL process more robust against such attacks. We also briefly consider the
effectiveness of defense mechanisms, including outlier detection in the
aggregation algorithm.


http://arxiv.org/abs/2501.11171
Counteracting temporal attacks in Video Copy Detection. (1%)
Katarzyna Fojcik; Piotr Syga
  Video Copy Detection (VCD) plays a crucial role in copyright protection and
content verification by identifying duplicates and near-duplicates in
large-scale video databases. The META AI Challenge on video copy detection
provided a benchmark for evaluating state-of-the-art methods, with the
Dual-level detection approach emerging as a winning solution. This method
integrates Video Editing Detection and Frame Scene Detection to handle
adversarial transformations and large datasets efficiently. However, our
analysis reveals significant limitations in the VED component, particularly in
its ability to handle exact copies. Moreover, Dual-level detection shows
vulnerability to temporal attacks. To address it, we propose an improved frame
selection strategy based on local maxima of interframe differences, which
enhances robustness against adversarial temporal modifications while
significantly reducing computational overhead. Our method achieves an increase
of 1.4 to 5.8 times in efficiency over the standard 1 FPS approach. Compared to
Dual-level detection method, our approach maintains comparable micro-average
precision ($\mu$AP) while also demonstrating improved robustness against
temporal attacks. Given 56\% reduced representation size and the inference time
of more than 2 times faster, our approach is more suitable to real-world
resource restriction.


http://arxiv.org/abs/2501.10906
Explainable Adversarial Attacks on Coarse-to-Fine Classifiers. (86%)
Akram Heidarizadeh; Connor Hatfield; Lorenzo Lazzarotto; HanQin Cai; George Atia
  Traditional adversarial attacks typically aim to alter the predicted labels
of input images by generating perturbations that are imperceptible to the human
eye. However, these approaches often lack explainability. Moreover, most
existing work on adversarial attacks focuses on single-stage classifiers, but
multi-stage classifiers are largely unexplored. In this paper, we introduce
instance-based adversarial attacks for multi-stage classifiers, leveraging
Layer-wise Relevance Propagation (LRP), which assigns relevance scores to
pixels based on their influence on classification outcomes. Our approach
generates explainable adversarial perturbations by utilizing LRP to identify
and target key features critical for both coarse and fine-grained
classifications. Unlike conventional attacks, our method not only induces
misclassification but also enhances the interpretability of the model's
behavior across classification stages, as demonstrated by experimental results.


http://arxiv.org/abs/2501.10817
A comprehensive survey on RPL routing-based attacks, defences and future directions in Internet of Things. (12%)
Anil K Prajapati; Emmanuel S Pilli; Ramesh B Battula; Vijay Varadharajan; Abhishek Verma; R C Joshi
  The Internet of Things (IoT) is a network of digital devices like sensors,
processors, embedded and communication devices that can connect to and exchange
data with other devices and systems over the internet. IoT devices have
limitations on power, memory, and computational resources. Researchers have
developed the IPv6 Over Low-power Wireless Personal Area Network (6LoWPAN)
protocols to provide wireless connectivity among these devices while overcoming
the constraints on resources. 6LoWPAN has been approved subsequently by the
Internet Engineering Task Force (IETF). The IETF Routing Over Low-power and
Lossy Networks (ROLL) standardized the Routing Protocol for LLNs known as RPL
(IETF RFC 6550), which is part of the 6LoWPAN stack. However, IoT devices are
vulnerable to various attacks on RPL-based routing. This survey provides an in
depth study of existing RPL-based attacks and defense published from year 2011
to 2024 from highly reputed journals and conferences. By thematic analysis of
existing routing attacks on RPL, we developed a novel attack taxonomy which
focuses on the nature of routing attacks and classifies them into 12 major
categories. Subsequently, the impact of each attack on the network is analyzed
and discussed real life scenarios of these attacks. Another contribution of
this survey proposed a novel taxonomy for classification of defense mechanisms
into 8 major categories against routing attacks based on type of defense
strategy. The detailed analysis of each defense mechanism with real life
applicability is explained. Furthermore, evaluation tools such as testbeds and
simulators for RPL-based attack and defense are discussed and critically
analyzed in terms of real world applicability. Finally, open research
challenges are presented on the basis of research gaps of existing literature
along with research directions for practitioners and researchers.


http://arxiv.org/abs/2501.10013
CaFA: Cost-aware, Feasible Attacks With Database Constraints Against Neural Tabular Classifiers. (99%)
Matan Ben-Tov; Daniel Deutch; Nave Frost; Mahmood Sharif
  This work presents CaFA, a system for Cost-aware Feasible Attacks for
assessing the robustness of neural tabular classifiers against adversarial
examples realizable in the problem space, while minimizing adversaries' effort.
To this end, CaFA leverages TabPGD$-$an algorithm we set forth to generate
adversarial perturbations suitable for tabular data$-$ and incorporates
integrity constraints automatically mined by state-of-the-art database methods.
After producing adversarial examples in the feature space via TabPGD, CaFA
projects them on the mined constraints, leading, in turn, to better attack
realizability. We tested CaFA with three datasets and two architectures and
found, among others, that the constraints we use are of higher quality
(measured via soundness and completeness) than ones employed in prior work.
Moreover, CaFA achieves higher feasible success rates$-$i.e., it generates
adversarial examples that are often misclassified while satisfying
constraints$-$than prior attacks while simultaneously perturbing few features
with lower magnitudes, thus saving effort and improving inconspicuousness. We
open-source CaFA, hoping it will serve as a generic system enabling
machine-learning engineers to assess their models' robustness against
realizable attacks, thus advancing deployed models' trustworthiness.


http://arxiv.org/abs/2501.10606
Differentiable Adversarial Attacks for Marked Temporal Point Processes. (86%)
Pritish Chakraborty; Vinayak Gupta; Rahul R; Srikanta J. Bedathur; Abir De
  Marked temporal point processes (MTPPs) have been shown to be extremely
effective in modeling continuous time event sequences (CTESs). In this work, we
present adversarial attacks designed specifically for MTPP models. A key
criterion for a good adversarial attack is its imperceptibility. For objects
such as images or text, this is often achieved by bounding perturbation in some
fixed $L_p$ norm-ball. However, similarly minimizing distance norms between two
CTESs in the context of MTPPs is challenging due to their sequential nature and
varying time-scales and lengths. We address this challenge by first permuting
the events and then incorporating the additive noise to the arrival timestamps.
However, the worst case optimization of such adversarial attacks is a hard
combinatorial problem, requiring exploration across a permutation space that is
factorially large in the length of the input sequence. As a result, we propose
a novel differentiable scheme PERMTPP using which we can perform adversarial
attacks by learning to minimize the likelihood, while minimizing the distance
between two CTESs. Our experiments on four real-world datasets demonstrate the
offensive and defensive capabilities, and lower inference times of PERMTPP.


http://arxiv.org/abs/2501.10639
Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks. (62%)
Xin Yi; Yue Li; Dongsheng Shi; Linlin Wang; Xiaoling Wang; Liang He
  Ensuring safety alignment is a critical requirement for large language models
(LLMs), particularly given increasing deployment in real-world applications.
Despite considerable advancements, LLMs remain susceptible to jailbreak
attacks, which exploit system vulnerabilities to circumvent safety measures and
elicit harmful or inappropriate outputs. Furthermore, while adversarial
training-based defense methods have shown promise, a prevalent issue is the
unintended over-defense behavior, wherein models excessively reject benign
queries, significantly undermining their practical utility. To address these
limitations, we introduce LATPC, a Latent-space Adversarial Training with
Post-aware Calibration framework. LATPC dynamically identifies safety-critical
latent dimensions by contrasting harmful and benign inputs, enabling the
adaptive construction of targeted refusal feature removal attacks. This
mechanism allows adversarial training to concentrate on real-world jailbreak
tactics that disguise harmful queries as benign ones. During inference, LATPC
employs an efficient embedding-level calibration mechanism to minimize
over-defense behaviors with negligible computational overhead. Experimental
results across five types of disguise-based jailbreak attacks demonstrate that
LATPC achieves a superior balance between safety and utility compared to
existing defense frameworks. Further analysis demonstrates the effectiveness of
leveraging safety-critical dimensions in developing robust defense methods
against jailbreak attacks.


http://arxiv.org/abs/2501.13941
GaussMark: A Practical Approach for Structural Watermarking of Language Models. (2%)
Adam Block; Ayush Sekhari; Alexander Rakhlin
  Recent advances in Large Language Models (LLMs) have led to significant
improvements in natural language processing tasks, but their ability to
generate human-quality text raises significant ethical and operational concerns
in settings where it is important to recognize whether or not a given text was
generated by a human. Thus, recent work has focused on developing techniques
for watermarking LLM-generated text, i.e., introducing an almost imperceptible
signal that allows a provider equipped with a secret key to determine if given
text was generated by their model. Current watermarking techniques are often
not practical due to concerns with generation latency, detection time,
degradation in text quality, or robustness. Many of these drawbacks come from
the focus on token-level watermarking, which ignores the inherent structure of
text. In this work, we introduce a new scheme, GaussMark, that is simple and
efficient to implement, has formal statistical guarantees on its efficacy,
comes at no cost in generation latency, and embeds the watermark into the
weights of the model itself, providing a structural watermark. Our approach is
based on Gaussian independence testing and is motivated by recent empirical
observations that minor additive corruptions to LLM weights can result in
models of identical (or even improved) quality. We show that by adding a small
amount of Gaussian noise to the weights of a given LLM, we can watermark the
model in a way that is statistically detectable by a provider who retains the
secret key. We provide formal statistical bounds on the validity and power of
our procedure. Through an extensive suite of experiments, we demonstrate that
GaussMark is reliable, efficient, and relatively robust to corruptions such as
insertions, deletions, substitutions, and roundtrip translations and can be
instantiated with essentially no loss in model quality.


http://arxiv.org/abs/2501.09446
Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness. (98%)
Zeyu Wang; Cihang Xie; Brian Bartoldson; Bhavya Kailkhura
  This paper investigates the robustness of vision-language models against
adversarial visual perturbations and introduces a novel ``double visual
defense" to enhance this robustness. Unlike previous approaches that resort to
lightweight adversarial fine-tuning of a pre-trained CLIP model, we perform
large-scale adversarial vision-language pre-training from scratch using
web-scale data. We then strengthen the defense by incorporating adversarial
visual instruction tuning. The resulting models from each stage, $\Delta$CLIP
and $\Delta^2$LLaVA, show substantially enhanced zero-shot robustness and set a
new state-of-the-art in adversarial defense for vision-language models. For
example, the adversarial robustness of $\Delta$CLIP surpasses that of the
previous best models on ImageNet-1k by ~20%. %For example, $\Delta$CLIP
surpasses the previous best models on ImageNet-1k by ~20% in terms of
adversarial robustness. Similarly, compared to prior art, $\Delta^2$LLaVA
brings a ~30% robustness improvement to image captioning task and a ~20%
robustness improvement to visual question answering task. Furthermore, our
models exhibit stronger zero-shot recognition capability, fewer hallucinations,
and superior reasoning performance compared to baselines. Our project page is
https://doublevisualdefense.github.io/.


http://arxiv.org/abs/2501.09609
Adversarial-Ensemble Kolmogorov Arnold Networks for Enhancing Indoor Wi-Fi Positioning: A Defensive Approach Against Spoofing and Signal Manipulation Attacks. (80%)
Mitul Goswami; Romit Chatterjee; Somnath Mahato; Prasant Kumar Pattnaik
  The research presents a study on enhancing the robustness of Wi-Fi-based
indoor positioning systems against adversarial attacks. The goal is to improve
the positioning accuracy and resilience of these systems under two attack
scenarios: Wi-Fi Spoofing and Signal Strength Manipulation. Three models are
developed and evaluated: a baseline model (M_Base), an adversarially trained
robust model (M_Rob), and an ensemble model (M_Ens). All models utilize a
Kolmogorov-Arnold Network (KAN) architecture. The robust model is trained with
adversarially perturbed data, while the ensemble model combines predictions
from both the base and robust models. Experimental results show that the robust
model reduces positioning error by approximately 10% compared to the baseline,
achieving 2.03 meters error under Wi-Fi spoofing and 2.00 meters under signal
strength manipulation. The ensemble model further outperforms with errors of
2.01 meters and 1.975 meters for the respective attack types. This analysis
highlights the effectiveness of adversarial training techniques in mitigating
attack impacts. The findings underscore the importance of considering
adversarial scenarios in developing indoor positioning systems, as improved
resilience can significantly enhance the accuracy and reliability of such
systems in mission-critical environments.


http://arxiv.org/abs/2501.09320
Cooperative Decentralized Backdoor Attacks on Vertical Federated Learning. (31%)
Seohyun Lee; Wenzhi Fang; Anindya Bijoy Das; Seyyedali Hosseinalipour; David J. Love; Christopher G. Brinton
  Federated learning (FL) is vulnerable to backdoor attacks, where adversaries
alter model behavior on target classification labels by embedding triggers into
data samples. While these attacks have received considerable attention in
horizontal FL, they are less understood for vertical FL (VFL), where devices
hold different features of the samples, and only the server holds the labels.
In this work, we propose a novel backdoor attack on VFL which (i) does not rely
on gradient information from the server and (ii) considers potential collusion
among multiple adversaries for sample selection and trigger embedding. Our
label inference model augments variational autoencoders with metric learning,
which adversaries can train locally. A consensus process over the adversary
graph topology determines which datapoints to poison. We further propose
methods for trigger splitting across the adversaries, with an intensity-based
implantation scheme skewing the server towards the trigger. Our convergence
analysis reveals the impact of backdoor perturbations on VFL indicated by a
stationarity gap for the trained model, which we verify empirically as well. We
conduct experiments comparing our attack with recent backdoor VFL approaches,
finding that ours obtains significantly higher success rates for the same main
task performance despite not using server information. Additionally, our
results verify the impact of collusion on attack performance.


http://arxiv.org/abs/2501.09798
Computing Optimization-Based Prompt Injections Against Closed-Weights Models By Misusing a Fine-Tuning API. (12%)
Andrey Labunets; Nishit V. Pandya; Ashish Hooda; Xiaohan Fu; Earlence Fernandes
  We surface a new threat to closed-weight Large Language Models (LLMs) that
enables an attacker to compute optimization-based prompt injections.
Specifically, we characterize how an attacker can leverage the loss-like
information returned from the remote fine-tuning interface to guide the search
for adversarial prompts. The fine-tuning interface is hosted by an LLM vendor
and allows developers to fine-tune LLMs for their tasks, thus providing
utility, but also exposes enough information for an attacker to compute
adversarial prompts. Through an experimental analysis, we characterize the
loss-like values returned by the Gemini fine-tuning API and demonstrate that
they provide a useful signal for discrete optimization of adversarial prompts
using a greedy search algorithm. Using the PurpleLlama prompt injection
benchmark, we demonstrate attack success rates between 65% and 82% on Google's
Gemini family of LLMs. These attacks exploit the classic utility-security
tradeoff - the fine-tuning interface provides a useful feature for developers
but also exposes the LLMs to powerful attacks.


http://arxiv.org/abs/2501.09328
Neural Honeytrace: A Robust Plug-and-Play Watermarking Framework against Model Extraction Attacks. (1%)
Yixiao Xu; Binxing Fang; Rui Wang; Yinghai Zhou; Shouling Ji; Yuan Liu; Mohan Li; Zhihong Tian
  Developing high-performance deep learning models is resource-intensive,
leading model owners to utilize Machine Learning as a Service (MLaaS) platforms
instead of publicly releasing their models. However, malicious users may
exploit query interfaces to execute model extraction attacks, reconstructing
the target model's functionality locally. While prior research has investigated
triggerable watermarking techniques for asserting ownership, existing methods
face significant challenges: (1) most approaches require additional training,
resulting in high overhead and limited flexibility, and (2) they often fail to
account for advanced attackers, leaving them vulnerable to adaptive attacks.
  In this paper, we propose Neural Honeytrace, a robust plug-and-play
watermarking framework against model extraction attacks. We first formulate a
watermark transmission model from an information-theoretic perspective,
providing an interpretable account of the principles and limitations of
existing triggerable watermarking. Guided by the model, we further introduce:
(1) a similarity-based training-free watermarking method for plug-and-play and
flexible watermarking, and (2) a distribution-based multi-step watermark
information transmission strategy for robust watermarking. Comprehensive
experiments on four datasets demonstrate that Neural Honeytrace outperforms
previous methods in efficiency and resisting adaptive attacks. Neural
Honeytrace reduces the average number of samples required for a worst-case
t-Test-based copyright claim from $12,000$ to $200$ with zero training cost.


http://arxiv.org/abs/2501.09086
Salient Information Preserving Adversarial Training Improves Clean and Robust Accuracy. (98%)
Timothy Redgrave; Adam Czajka
  In this work we introduce Salient Information Preserving Adversarial Training
(SIP-AT), an intuitive method for relieving the robustness-accuracy trade-off
incurred by traditional adversarial training. SIP-AT uses salient image regions
to guide the adversarial training process in such a way that fragile features
deemed meaningful by an annotator remain unperturbed during training, allowing
models to learn highly predictive non-robust features without sacrificing
overall robustness. This technique is compatible with both human-based and
automatically generated salience estimates, allowing SIP-AT to be used as a
part of human-driven model development without forcing SIP-AT to be reliant
upon additional human data. We perform experiments across multiple datasets and
architectures and demonstrate that SIP-AT is able to boost the clean accuracy
of models while maintaining a high degree of robustness against attacks at
multiple epsilon levels. We complement our central experiments with an
observational study measuring the rate at which human subjects successfully
identify perturbed images. This study helps build a more intuitive
understanding of adversarial attack strength and demonstrates the heightened
importance of low-epsilon robustness. Our results demonstrate the efficacy of
SIP-AT and provide valuable insight into the risks posed by adversarial samples
of various strengths.


http://arxiv.org/abs/2501.08862
ARMOR: Shielding Unlearnable Examples against Data Augmentation. (75%)
Xueluan Gong; Yuji Wang; Yanjiao Chen; Haocheng Dong; Yiming Li; Mengyuan Sun; Shuaike Li; Qian Wang; Chen Chen
  Private data, when published online, may be collected by unauthorized parties
to train deep neural networks (DNNs). To protect privacy, defensive noises can
be added to original samples to degrade their learnability by DNNs. Recently,
unlearnable examples are proposed to minimize the training loss such that the
model learns almost nothing. However, raw data are often pre-processed before
being used for training, which may restore the private information of protected
data. In this paper, we reveal the data privacy violation induced by data
augmentation, a commonly used data pre-processing technique to improve model
generalization capability, which is the first of its kind as far as we are
concerned. We demonstrate that data augmentation can significantly raise the
accuracy of the model trained on unlearnable examples from 21.3% to 66.1%. To
address this issue, we propose a defense framework, dubbed ARMOR, to protect
data privacy from potential breaches of data augmentation. To overcome the
difficulty of having no access to the model training process, we design a
non-local module-assisted surrogate model that better captures the effect of
data augmentation. In addition, we design a surrogate augmentation selection
strategy that maximizes distribution alignment between augmented and
non-augmented samples, to choose the optimal augmentation strategy for each
class. We also use a dynamic step size adjustment algorithm to enhance the
defensive noise generation process. Extensive experiments are conducted on 4
datasets and 5 data augmentation methods to verify the performance of ARMOR.
Comparisons with 6 state-of-the-art defense methods have demonstrated that
ARMOR can preserve the unlearnability of protected private data under data
augmentation. ARMOR reduces the test accuracy of the model trained on augmented
protected samples by as much as 60% more than baselines.


http://arxiv.org/abs/2501.09006
Improving Stability Estimates in Adversarial Explainable AI through Alternate Search Methods. (13%)
Christopher Burger; Charles Walter
  Advances in the effectiveness of machine learning models have come at the
cost of enormous complexity resulting in a poor understanding of how they
function. Local surrogate methods have been used to approximate the workings of
these complex models, but recent work has revealed their vulnerability to
adversarial attacks where the explanation produced is appreciably different
while the meaning and structure of the complex model's output remains similar.
This prior work has focused on the existence of these weaknesses but not on
their magnitude. Here we explore using an alternate search method with the goal
of finding minimum viable perturbations, the fewest perturbations necessary to
achieve a fixed similarity value between the original and altered text's
explanation. Intuitively, a method that requires fewer perturbations to expose
a given level of instability is inferior to one which requires more. This
nuance allows for superior comparisons of the stability of explainability
methods.


http://arxiv.org/abs/2501.10466
Improving the Efficiency of Self-Supervised Adversarial Training through Latent Clustering-Based Selection. (12%)
Somrita Ghosh; Yuelin Xu; Xiao Zhang
  Compared with standard learning, adversarially robust learning is widely
recognized to demand significantly more training examples. Recent works propose
the use of self-supervised adversarial training (SSAT) with external or
synthetically generated unlabeled data to enhance model robustness. However,
SSAT requires a substantial amount of extra unlabeled data, significantly
increasing memory usage and model training times. To address these challenges,
we propose novel methods to strategically select a small subset of unlabeled
data essential for SSAT and robustness improvement. Our selection prioritizes
data points near the model's decision boundary based on latent clustering-based
techniques, efficiently identifying a critical subset of unlabeled data with a
higher concentration of boundary-adjacent points. While focusing on
near-boundary data, our methods are designed to maintain a balanced ratio
between boundary and non-boundary data points to avoid overfitting. Our
experiments on image benchmarks show that integrating our selection strategies
into self-supervised adversarial training can largely reduce memory and
computational requirements while achieving high model robustness. In
particular, our latent clustering-based selection method with k-means is the
most effective, achieving nearly identical test-time robust accuracies with 5
to 10 times less external or generated unlabeled data when applied to image
benchmarks. Additionally, we validate the generalizability of our approach
across various application scenarios, including a real-world medical dataset
for COVID-19 chest X-ray classification.


http://arxiv.org/abs/2501.07922
VENOM: Text-driven Unrestricted Adversarial Example Generation with Diffusion Models. (99%)
Hui Kuurila-Zhang; Haoyu Chen; Guoying Zhao
  Adversarial attacks have proven effective in deceiving machine learning
models by subtly altering input images, motivating extensive research in recent
years. Traditional methods constrain perturbations within $l_p$-norm bounds,
but advancements in Unrestricted Adversarial Examples (UAEs) allow for more
complex, generative-model-based manipulations. Diffusion models now lead UAE
generation due to superior stability and image quality over GANs. However,
existing diffusion-based UAE methods are limited to using reference images and
face challenges in generating Natural Adversarial Examples (NAEs) directly from
random noise, often producing uncontrolled or distorted outputs. In this work,
we introduce VENOM, the first text-driven framework for high-quality
unrestricted adversarial examples generation through diffusion models. VENOM
unifies image content generation and adversarial synthesis into a single
reverse diffusion process, enabling high-fidelity adversarial examples without
sacrificing attack success rate (ASR). To stabilize this process, we
incorporate an adaptive adversarial guidance strategy with momentum, ensuring
that the generated adversarial examples $x^*$ align with the distribution
$p(x)$ of natural images. Extensive experiments demonstrate that VENOM achieves
superior ASR and image quality compared to prior methods, marking a significant
advancement in adversarial example generation and providing insights into model
vulnerabilities for improved defense development.


http://arxiv.org/abs/2501.08415
Cross-Modal Transferable Image-to-Video Attack on Video Quality Metrics. (99%)
Georgii Gotin; Ekaterina Shumitskaya; Anastasia Antsiferova; Dmitriy Vatolin
  Recent studies have revealed that modern image and video quality assessment
(IQA/VQA) metrics are vulnerable to adversarial attacks. An attacker can
manipulate a video through preprocessing to artificially increase its quality
score according to a certain metric, despite no actual improvement in visual
quality. Most of the attacks studied in the literature are white-box attacks,
while black-box attacks in the context of VQA have received less attention.
Moreover, some research indicates a lack of transferability of adversarial
examples generated for one model to another when applied to VQA. In this paper,
we propose a cross-modal attack method, IC2VQA, aimed at exploring the
vulnerabilities of modern VQA models. This approach is motivated by the
observation that the low-level feature spaces of images and videos are similar.
We investigate the transferability of adversarial perturbations across
different modalities; specifically, we analyze how adversarial perturbations
generated on a white-box IQA model with an additional CLIP module can
effectively target a VQA model. The addition of the CLIP module serves as a
valuable aid in increasing transferability, as the CLIP model is known for its
effective capture of low-level semantics. Extensive experiments demonstrate
that IC2VQA achieves a high success rate in attacking three black-box VQA
models. We compare our method with existing black-box attack strategies,
highlighting its superiority in terms of attack success within the same number
of iterations and levels of attack strength. We believe that the proposed
method will contribute to the deeper analysis of robust VQA metrics.


http://arxiv.org/abs/2501.08258
Towards an End-to-End (E2E) Adversarial Learning and Application in the Physical World. (97%)
Dudi Biton; Jacob Shams; Satoru Koda; Asaf Shabtai; Yuval Elovici; Ben Nassi
  The traditional learning process of patch-based adversarial attacks,
conducted in the digital domain and then applied in the physical domain (e.g.,
via printed stickers), may suffer from reduced performance due to adversarial
patches' limited transferability from the digital domain to the physical
domain. Given that previous studies have considered using projectors to apply
adversarial attacks, we raise the following question: can adversarial learning
(i.e., patch generation) be performed entirely in the physical domain with a
projector? In this work, we propose the Physical-domain Adversarial Patch
Learning Augmentation (PAPLA) framework, a novel end-to-end (E2E) framework
that converts adversarial learning from the digital domain to the physical
domain using a projector. We evaluate PAPLA across multiple scenarios,
including controlled laboratory settings and realistic outdoor environments,
demonstrating its ability to ensure attack success compared to conventional
digital learning-physical application (DL-PA) methods. We also analyze the
impact of environmental factors, such as projection surface color, projector
strength, ambient light, distance, and angle of the target object relative to
the camera, on the effectiveness of projected patches. Finally, we demonstrate
the feasibility of the attack against a parked car and a stop sign in a
real-world outdoor environment. Our results show that under specific
conditions, E2E adversarial learning in the physical domain eliminates the
transferability issue and ensures evasion by object detectors. Finally, we
provide insights into the challenges and opportunities of applying adversarial
learning in the physical domain and explain where such an approach is more
effective than using a sticker.


http://arxiv.org/abs/2501.08152
Energy Backdoor Attack to Deep Neural Networks. (64%)
Hanene F. Z. Brachemi Meftah; Wassim Hamidouche; Sid Ahmed Fezza; Olivier Déforges; Kassem Kallas
  The rise of deep learning (DL) has increased computing complexity and energy
use, prompting the adoption of application specific integrated circuits (ASICs)
for energy-efficient edge and mobile deployment. However, recent studies have
demonstrated the vulnerability of these accelerators to energy attacks. Despite
the development of various inference time energy attacks in prior research,
backdoor energy attacks remain unexplored. In this paper, we design an
innovative energy backdoor attack against deep neural networks (DNNs) operating
on sparsity-based accelerators. Our attack is carried out in two distinct
phases: backdoor injection and backdoor stealthiness. Experimental results
using ResNet-18 and MobileNet-V2 models trained on CIFAR-10 and Tiny ImageNet
datasets show the effectiveness of our proposed attack in increasing energy
consumption on trigger samples while preserving the model's performance for
clean/regular inputs. This demonstrates the vulnerability of DNNs to energy
backdoor attacks. The source code of our attack is available at:
https://github.com/hbrachemi/energy_backdoor.


http://arxiv.org/abs/2501.07927
Gandalf the Red: Adaptive Security for LLMs. (38%)
Niklas Pfister; Václav Volhejn; Manuel Knott; Santiago Arias; Julia Bazińska; Mykhailo Bichurin; Alan Commike; Janet Darling; Peter Dienes; Matthew Fiedler; David Haber; Matthias Kraft; Marco Lancini; Max Mathys; Damián Pascual-Ortiz; Jakub Podolak; Adrià Romero-López; Kyriacos Shiarlis; Andreas Signer; Zsolt Terek; Athanasios Theocharis; Daniel Timbrell; Samuel Trautwein; Samuel Watts; Yun-Han Wu; Mateo Rojas-Carulla
  Current evaluations of defenses against prompt attacks in large language
model (LLM) applications often overlook two critical factors: the dynamic
nature of adversarial behavior and the usability penalties imposed on
legitimate users by restrictive defenses. We propose D-SEC (Dynamic Security
Utility Threat Model), which explicitly separates attackers from legitimate
users, models multi-step interactions, and expresses the security-utility in an
optimizable form. We further address the shortcomings in existing evaluations
by introducing Gandalf, a crowd-sourced, gamified red-teaming platform designed
to generate realistic, adaptive attack. Using Gandalf, we collect and release a
dataset of 279k prompt attacks. Complemented by benign user data, our analysis
reveals the interplay between security and utility, showing that defenses
integrated in the LLM (e.g., system prompts) can degrade usability even without
blocking requests. We demonstrate that restricted application domains,
defense-in-depth, and adaptive defenses are effective strategies for building
secure and useful LLM applications.


http://arxiv.org/abs/2501.09039
Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models. (2%)
Abdulkadir Erol; Trilok Padhi; Agnik Saha; Ugur Kursuncu; Mehmet Emin Aktas
  The rapid advancement of Large Vision-Language Models (LVLMs) has enhanced
capabilities offering potential applications from content creation to
productivity enhancement. Despite their innovative potential, LVLMs exhibit
vulnerabilities, especially in generating potentially toxic or unsafe
responses. Malicious actors can exploit these vulnerabilities to propagate
toxic content in an automated (or semi-) manner, leveraging the susceptibility
of LVLMs to deception via strategically crafted prompts without fine-tuning or
compute-intensive procedures. Despite the red-teaming efforts and inherent
potential risks associated with the LVLMs, exploring vulnerabilities of LVLMs
remains nascent and yet to be fully addressed in a systematic manner. This
study systematically examines the vulnerabilities of open-source LVLMs,
including LLaVA, InstructBLIP, Fuyu, and Qwen, using adversarial prompt
strategies that simulate real-world social manipulation tactics informed by
social theories. Our findings show that (i) toxicity and insulting are the most
prevalent behaviors, with the mean rates of 16.13% and 9.75%, respectively;
(ii) Qwen-VL-Chat, LLaVA-v1.6-Vicuna-7b, and InstructBLIP-Vicuna-7b are the
most vulnerable models, exhibiting toxic response rates of 21.50%, 18.30% and
17.90%, and insulting responses of 13.40%, 11.70% and 10.10%, respectively;
(iii) prompting strategies incorporating dark humor and multimodal toxic prompt
completion significantly elevated these vulnerabilities. Despite being
fine-tuned for safety, these models still generate content with varying degrees
of toxicity when prompted with adversarial inputs, highlighting the urgent need
for enhanced safety mechanisms and robust guardrails in LVLM development.


http://arxiv.org/abs/2501.07251
MOS-Attack: A Scalable Multi-objective Adversarial Attack Framework. (99%)
Ping Guo; Cheng Gong; Xi Lin; Fei Liu; Zhichao Lu; Qingfu Zhang; Zhenkun Wang
  Crafting adversarial examples is crucial for evaluating and enhancing the
robustness of Deep Neural Networks (DNNs), presenting a challenge equivalent to
maximizing a non-differentiable 0-1 loss function.
  However, existing single objective methods, namely adversarial attacks focus
on a surrogate loss function, do not fully harness the benefits of engaging
multiple loss functions, as a result of insufficient understanding of their
synergistic and conflicting nature.
  To overcome these limitations, we propose the Multi-Objective Set-based
Attack (MOS Attack), a novel adversarial attack framework leveraging multiple
loss functions and automatically uncovering their interrelations.
  The MOS Attack adopts a set-based multi-objective optimization strategy,
enabling the incorporation of numerous loss functions without additional
parameters.
  It also automatically mines synergistic patterns among various losses,
facilitating the generation of potent adversarial attacks with fewer
objectives.
  Extensive experiments have shown that our MOS Attack outperforms
single-objective attacks. Furthermore, by harnessing the identified synergistic
patterns, MOS Attack continues to show superior results with a reduced number
of loss functions.


http://arxiv.org/abs/2501.07493
Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards. (89%)
Yangsibo Huang; Milad Nasr; Anastasios Angelopoulos; Nicholas Carlini; Wei-Lin Chiang; Christopher A. Choquette-Choo; Daphne Ippolito; Matthew Jagielski; Katherine Lee; Ken Ziyu Liu; Ion Stoica; Florian Tramer; Chiyuan Zhang
  It is now common to evaluate Large Language Models (LLMs) by having humans
manually vote to evaluate model outputs, in contrast to typical benchmarks that
evaluate knowledge or skill at some particular task. Chatbot Arena, the most
popular benchmark of this type, ranks models by asking users to select the
better response between two randomly selected models (without revealing which
model was responsible for the generations). These platforms are widely trusted
as a fair and accurate measure of LLM capabilities. In this paper, we show that
if bot protection and other defenses are not implemented, these voting-based
benchmarks are potentially vulnerable to adversarial manipulation.
Specifically, we show that an attacker can alter the leaderboard (to promote
their favorite model or demote competitors) at the cost of roughly a thousand
votes (verified in a simulated, offline version of Chatbot Arena). Our attack
consists of two steps: first, we show how an attacker can determine which model
was used to generate a given reply with more than $95\%$ accuracy; and then,
the attacker can use this information to consistently vote for (or against) a
target model. Working with the Chatbot Arena developers, we identify, propose,
and implement mitigations to improve the robustness of Chatbot Arena against
adversarial manipulation, which, based on our analysis, substantially increases
the cost of such attacks. Some of these defenses were present before our
collaboration, such as bot protection with Cloudflare, malicious user
detection, and rate limiting. Others, including reCAPTCHA and login are being
integrated to strengthen the security in Chatbot Arena.


http://arxiv.org/abs/2501.07275
Generating Poisoning Attacks against Ridge Regression Models with Categorical Features. (64%)
Monse Guedes-Ayala; Lars Schewe; Zeynep Suvak; Miguel Anjos
  Machine Learning (ML) models have become a very powerful tool to extract
information from large datasets and use it to make accurate predictions and
automated decisions. However, ML models can be vulnerable to external attacks,
causing them to underperform or deviate from their expected tasks. One way to
attack ML models is by injecting malicious data to mislead the algorithm during
the training phase, which is referred to as a poisoning attack. We can prepare
for such situations by designing anticipated attacks, which are later used for
creating and testing defence strategies. In this paper, we propose an algorithm
to generate strong poisoning attacks for a ridge regression model containing
both numerical and categorical features that explicitly models and poisons
categorical features. We model categorical features as SOS-1 sets and formulate
the problem of designing poisoning attacks as a bilevel optimization problem
that is nonconvex mixed-integer in the upper-level and unconstrained convex
quadratic in the lower-level. We present the mathematical formulation of the
problem, introduce a single-level reformulation based on the Karush-Kuhn-Tucker
(KKT) conditions of the lower level, find bounds for the lower-level variables
to accelerate solver performance, and propose a new algorithm to poison
categorical features. Numerical experiments show that our method improves the
mean squared error of all datasets compared to the previous benchmark in the
literature.


http://arxiv.org/abs/2501.07192
A4O: All Trigger for One sample. (2%)
Duc Anh Vu; Anh Tuan Tran; Cong Tran; Cuong Pham
  Backdoor attacks have become a critical threat to deep neural networks
(DNNs), drawing many research interests. However, most of the studied attacks
employ a single type of trigger. Consequently, proposed backdoor defenders
often rely on the assumption that triggers would appear in a unified way. In
this paper, we show that this naive assumption can create a loophole, allowing
more sophisticated backdoor attacks to bypass. We design a novel backdoor
attack mechanism that incorporates multiple types of backdoor triggers,
focusing on stealthiness and effectiveness. Our journey begins with the
intriguing observation that the performance of a backdoor attack in deep
learning models, as well as its detectability and removability, are all
proportional to the magnitude of the trigger. Based on this correlation, we
propose reducing the magnitude of each trigger type and combining them to
achieve a strong backdoor relying on the combined trigger while still staying
safely under the radar of defenders. Extensive experiments on three standard
datasets demonstrate that our method can achieve high attack success rates
(ASRs) while consistently bypassing state-of-the-art defenses.


http://arxiv.org/abs/2501.07670
A Survey of Early Exit Deep Neural Networks in NLP. (1%)
Divya Jyoti Bajpai; Manjesh Kumar Hanawal
  Deep Neural Networks (DNNs) have grown increasingly large in size to achieve
state of the art performance across a wide range of tasks. However, their high
computational requirements make them less suitable for resource-constrained
applications. Also, real-world datasets often consist of a mixture of easy and
complex samples, necessitating adaptive inference mechanisms that account for
sample difficulty. Early exit strategies offer a promising solution by enabling
adaptive inference, where simpler samples are classified using the initial
layers of the DNN, thereby accelerating the overall inference process. By
attaching classifiers at different layers, early exit methods not only reduce
inference latency but also improve the model robustness against adversarial
attacks. This paper presents a comprehensive survey of early exit methods and
their applications in NLP.


http://arxiv.org/abs/2501.07752
Towards the Pseudorandomness of Expander Random Walks for Read-Once ACC0 circuits. (1%)
Emile Anand
  Expander graphs are among the most useful combinatorial objects in
theoretical computer science. A line of work studies random walks on expander
graphs for their pseudorandomness against various classes of test functions,
including symmetric functions, read-only branching programs, permutation
branching programs, and $\mathrm{AC}^0$ circuits. The promising results of
pseudorandomness of expander random walks against $\mathrm{AC}^0$ circuits
indicate a robustness of expander random walks beyond symmetric functions,
motivating the question of whether expander random walks can fool more robust
\emph{asymmetric} complexity classes, such as $\mathrm{ACC}^0$. In this work,
we make progress towards this question by considering certain two-layered
circuit compositions of $\mathrm{MOD}[k]$ gates, where we show that these
family of circuits are fooled by expander random walks with total variation
distance error $O(\lambda)$, where $\lambda$ is the second largest eigenvalue
of the underlying expander graph. For $k\geq 3$, these circuits can be highly
asymmetric with complicated Fourier characters. In this context, our work takes
a step in the direction of fooling more complex asymmetric circuits.
Separately, drawing from the learning-theory literature, we construct an
explicit threshold circuit in the circuit family $\mathrm{TC}^0$, and show that
it \emph{is} fooled by expander random walks, providing an upper bound on the
set of functions fooled by expander random walks.


http://arxiv.org/abs/2501.07044
Protego: Detecting Adversarial Examples for Vision Transformers via Intrinsic Capabilities. (99%)
Jialin Wu; Kaikai Pan; Yanjiao Chen; Jiangyi Deng; Shengyuan Pang; Wenyuan Xu
  Transformer models have excelled in natural language tasks, prompting the
vision community to explore their implementation in computer vision problems.
However, these models are still influenced by adversarial examples. In this
paper, we investigate the attack capabilities of six common adversarial attacks
on three pretrained ViT models to reveal the vulnerability of ViT models. To
understand and analyse the bias in neural network decisions when the input is
adversarial, we use two visualisation techniques that are attention rollout and
grad attention rollout. To prevent ViT models from adversarial attack, we
propose Protego, a detection framework that leverages the transformer intrinsic
capabilities to detection adversarial examples of ViT models. Nonetheless, this
is challenging due to a diversity of attack strategies that may be adopted by
adversaries. Inspired by the attention mechanism, we know that the token of
prediction contains all the information from the input sample. Additionally,
the attention region for adversarial examples differs from that of normal
examples. Given these points, we can train a detector that achieves superior
performance than existing detection methods to identify adversarial examples.
Our experiments have demonstrated the high effectiveness of our detection
method. For these six adversarial attack methods, our detector's AUC scores all
exceed 0.95. Protego may advance investigations in metaverse security.


http://arxiv.org/abs/2501.06729
KeTS: Kernel-based Trust Segmentation against Model Poisoning Attacks. (98%)
Ankit Gangwal; Mauro Conti; Tommaso Pauselli
  Federated Learning (FL) enables multiple users to collaboratively train a
global model in a distributed manner without revealing their personal data.
However, FL remains vulnerable to model poisoning attacks, where malicious
actors inject crafted updates to compromise the global model's accuracy. These
vulnerabilities are particularly severe in non-homogeneous environments, where
clients exhibit varying proportions of class labels, resulting in heterogeneous
updates. In such settings, benign outliers are often misclassified as false
positives, while maliciously crafted uploads evade detection and are aggregated
at the server. Existing defense mechanisms struggle in such real-world
settings, resulting in significant declines in the global FL model's
performance.
  We propose a novel defense mechanism, Kernel-based Trust Segmentation (KeTS),
to counter model poisoning attacks. Unlike existing approaches, KeTS analyzes
the evolution of each client's updates and effectively segments malicious
clients using Kernel Density Estimation (KDE), even in the presence of benign
outliers. We thoroughly evaluate KeTS's performance against the six most
effective model poisoning attacks (i.e., Trim-Attack, Krum-Attack, Min-Max
attack, Min-Sum attack, and their variants) on two different datasets (i.e.,
MNIST and Fashion-MNIST) and compare its performance with three classical
robust schemes (i.e., Krum, Trim-Mean, and Median) and a state-of-the-art
defense (i.e., FLTrust). Our results show that KeTS outperforms the existing
defenses in every attack setting; beating the best-performing defense by an
overall average of >24% (on MNIST) and >14% (on Fashion-MNIST). A series of
further experiments (varying poisoning approaches, attacker population, etc.)
reveal the consistent and superior performance of KeTS under diverse
conditions.


http://arxiv.org/abs/2501.06736
ZOQO: Zero-Order Quantized Optimization. (1%)
Noga Bar; Raja Giryes
  The increasing computational and memory demands in deep learning present
significant challenges, especially in resource-constrained environments. We
introduce a zero-order quantized optimization (ZOQO) method designed for
training models with quantized parameters and operations. Our approach
leverages zero-order approximations of the gradient sign and adapts the
learning process to maintain the parameters' quantization without the need for
full-precision gradient calculations. We demonstrate the effectiveness of ZOQO
through experiments in fine-tuning of large language models and black-box
adversarial attacks. Despite the limitations of zero-order and quantized
operations training, our method achieves competitive performance compared to
full-precision methods, highlighting its potential for low-resource
environments.


http://arxiv.org/abs/2501.06646
RogueRFM: Attacking Refresh Management for Covert-Channel and Denial-of-Service. (10%)
Hritvik Taneja; Moinuddin Qureshi
  With lowering thresholds, transparently defending against Rowhammer within
DRAM is challenging due to the lack of time to perform mitigation. Commercially
deployed in-DRAM defenses like TRR that steal time from normal refreshes~(REF)
to perform mitigation have been proven ineffective against Rowhammer. In
response, a new Refresh Management (RFM) interface has been added to the DDR5
specifications. RFM provides dedicated time to an in-DRAM defense to perform
mitigation. Several recent works have used RFM for the intended purpose -
building better Rowhammer defenses. However, to the best of our knowledge, no
prior study has looked at the potential security implications of this new
feature if an attacker subjects it to intentional misuse.
  Our paper shows that RFM introduces new side effects in the system - the
activity of one bank causes interference with the operation of the other banks.
Thus, the latency of a bank becomes dependent on the activity of other banks.
We use these side effects to build two new attacks. First, a novel memory-based
covert channel, which has a bandwidth of up to 31.3 KB/s, and is also effective
even in a bank-partitioned system. Second, a new Denial-of-Service (DOS) attack
pattern that exploits the activity within a single bank to reduce the
performance of the other banks. Our experiments on SPEC2017, PARSEC, and LIGRA
workloads show a slowdown of up to 67\% when running alongside our DOS pattern.
We also discuss potential countermeasures for our attacks.


http://arxiv.org/abs/2501.06650
SafeSplit: A Novel Defense Against Client-Side Backdoor Attacks in Split Learning. (10%)
Phillip Rieger; Alessandro Pegoraro; Kavita Kumari; Tigist Abera; Jonathan Knauer; Ahmad-Reza Sadeghi
  Split Learning (SL) is a distributed deep learning approach enabling multiple
clients and a server to collaboratively train and infer on a shared deep neural
network (DNN) without requiring clients to share their private local data. The
DNN is partitioned in SL, with most layers residing on the server and a few
initial layers and inputs on the client side. This configuration allows
resource-constrained clients to participate in training and inference. However,
the distributed architecture exposes SL to backdoor attacks, where malicious
clients can manipulate local datasets to alter the DNN's behavior. Existing
defenses from other distributed frameworks like Federated Learning are not
applicable, and there is a lack of effective backdoor defenses specifically
designed for SL.
  We present SafeSplit, the first defense against client-side backdoor attacks
in Split Learning (SL). SafeSplit enables the server to detect and filter out
malicious client behavior by employing circular backward analysis after a
client's training is completed, iteratively reverting to a trained checkpoint
where the model under examination is found to be benign. It uses a two-fold
analysis to identify client-induced changes and detect poisoned models. First,
a static analysis in the frequency domain measures the differences in the
layer's parameters at the server. Second, a dynamic analysis introduces a novel
rotational distance metric that assesses the orientation shifts of the server's
layer parameters during training. Our comprehensive evaluation across various
data distributions, client counts, and attack scenarios demonstrates the high
efficacy of this dual analysis in mitigating backdoor attacks while preserving
model utility.


http://arxiv.org/abs/2501.05962
Effective faking of verbal deception detection with target-aligned adversarial attacks. (87%)
Bennett Kleinberg; Riccardo Loconte; Bruno Verschuere
  Background: Deception detection through analysing language is a promising
avenue using both human judgments and automated machine learning judgments. For
both forms of credibility assessment, automated adversarial attacks that
rewrite deceptive statements to appear truthful pose a serious threat. Methods:
We used a dataset of 243 truthful and 262 fabricated autobiographical stories
in a deception detection task for humans and machine learning models. A large
language model was tasked to rewrite deceptive statements so that they appear
truthful. In Study 1, humans who made a deception judgment or used the
detailedness heuristic and two machine learning models (a fine-tuned language
model and a simple n-gram model) judged original or adversarial modifications
of deceptive statements. In Study 2, we manipulated the target alignment of the
modifications, i.e. tailoring the attack to whether the statements would be
assessed by humans or computer models. Results: When adversarial modifications
were aligned with their target, human (d=-0.07 and d=-0.04) and machine
judgments (51% accuracy) dropped to the chance level. When the attack was not
aligned with the target, both human heuristics judgments (d=0.30 and d=0.36)
and machine learning predictions (63-78%) were significantly better than
chance. Conclusions: Easily accessible language models can effectively help
anyone fake deception detection efforts both by humans and machine learning
models. Robustness against adversarial modifications for humans and machines
depends on that target alignment. We close with suggestions on advancing
deception research with adversarial attack designs and techniques.


http://arxiv.org/abs/2501.05783
UV-Attack: Physical-World Adversarial Attacks for Person Detection via Dynamic-NeRF-based UV Mapping. (81%)
Yanjie Li; Wenxuan Zhang; Kaisheng Liang; Bin Xiao
  In recent research, adversarial attacks on person detectors using patches or
static 3D model-based texture modifications have struggled with low success
rates due to the flexible nature of human movement. Modeling the 3D
deformations caused by various actions has been a major challenge. Fortunately,
advancements in Neural Radiance Fields (NeRF) for dynamic human modeling offer
new possibilities. In this paper, we introduce UV-Attack, a groundbreaking
approach that achieves high success rates even with extensive and unseen human
actions. We address the challenge above by leveraging dynamic-NeRF-based UV
mapping. UV-Attack can generate human images across diverse actions and
viewpoints, and even create novel actions by sampling from the SMPL parameter
space. While dynamic NeRF models are capable of modeling human bodies,
modifying clothing textures is challenging because they are embedded in neural
network parameters. To tackle this, UV-Attack generates UV maps instead of RGB
images and modifies the texture stacks. This approach enables real-time texture
edits and makes the attack more practical. We also propose a novel Expectation
over Pose Transformation loss (EoPT) to improve the evasion success rate on
unseen poses and views. Our experiments show that UV-Attack achieves a 92.75%
attack success rate against the FastRCNN model across varied poses in dynamic
video settings, significantly outperforming the state-of-the-art AdvCamou
attack, which only had a 28.50% ASR. Moreover, we achieve 49.5% ASR on the
latest YOLOv8 detector in black-box settings. This work highlights the
potential of dynamic NeRF-based UV mapping for creating more effective
adversarial attacks on person detectors, addressing key challenges in modeling
human movement and texture modification.


http://arxiv.org/abs/2501.05835
Fine-tuning is Not Fine: Mitigating Backdoor Attacks in GNNs with Limited Clean Data. (80%)
Jiale Zhang; Bosen Rao; Chengcheng Zhu; Xiaobing Sun; Qingming Li; Haibo Hu; Xiapu Luo; Qingqing Ye; Shouling Ji
  Graph Neural Networks (GNNs) have achieved remarkable performance through
their message-passing mechanism. However, recent studies have highlighted the
vulnerability of GNNs to backdoor attacks, which can lead the model to
misclassify graphs with attached triggers as the target class. The
effectiveness of recent promising defense techniques, such as fine-tuning or
distillation, is heavily contingent on having comprehensive knowledge of the
sufficient training dataset. Empirical studies have shown that fine-tuning
methods require a clean dataset of 20% to reduce attack accuracy to below 25%,
while distillation methods require a clean dataset of 15%. However, obtaining
such a large amount of clean data is commonly impractical.
  In this paper, we propose a practical backdoor mitigation framework, denoted
as GRAPHNAD, which can capture high-quality intermediate-layer representations
in GNNs to enhance the distillation process with limited clean data. To achieve
this, we address the following key questions: How to identify the appropriate
attention representations in graphs for distillation? How to enhance
distillation with limited data? By adopting the graph attention transfer
method, GRAPHNAD can effectively align the intermediate-layer attention
representations of the backdoored model with that of the teacher model, forcing
the backdoor neurons to transform into benign ones. Besides, we extract the
relation maps from intermediate-layer transformation and enforce the relation
maps of the backdoored model to be consistent with that of the teacher model,
thereby ensuring model accuracy while further reducing the influence of
backdoors. Extensive experimental results show that by fine-tuning a teacher
model with only 3% of the clean data, GRAPHNAD can reduce the attack success
rate to below 5%.


http://arxiv.org/abs/2501.05928
Towards Backdoor Stealthiness in Model Parameter Space. (61%)
Xiaoyun Xu; Zhuoran Liu; Stefanos Koffas; Stjepan Picek
  Recent research on backdoor stealthiness focuses mainly on indistinguishable
triggers in input space and inseparable backdoor representations in feature
space, aiming to circumvent backdoor defenses that examine these respective
spaces. However, existing backdoor attacks are typically designed to resist a
specific type of backdoor defense without considering the diverse range of
defense mechanisms. Based on this observation, we pose a natural question: Are
current backdoor attacks truly a real-world threat when facing diverse
practical defenses?
  To answer this question, we examine 12 common backdoor attacks that focus on
input-space or feature-space stealthiness and 17 diverse representative
defenses. Surprisingly, we reveal a critical blind spot: Backdoor attacks
designed to be stealthy in input and feature spaces can be mitigated by
examining backdoored models in parameter space. To investigate the underlying
causes behind this common vulnerability, we study the characteristics of
backdoor attacks in the parameter space. Notably, we find that input- and
feature-space attacks introduce prominent backdoor-related neurons in parameter
space, which are not thoroughly considered by current backdoor attacks. Taking
comprehensive stealthiness into account, we propose a novel supply-chain attack
called Grond. Grond limits the parameter changes by a simple yet effective
module, Adversarial Backdoor Injection (ABI), which adaptively increases the
parameter-space stealthiness during the backdoor injection. Extensive
experiments demonstrate that Grond outperforms all 12 backdoor attacks against
state-of-the-art (including adaptive) defenses on CIFAR-10, GTSRB, and a subset
of ImageNet. In addition, we show that ABI consistently improves the
effectiveness of common backdoor attacks.


http://arxiv.org/abs/2501.05991
An Attention-Guided Deep Learning Approach for Classifying 39 Skin Lesion Types. (1%)
Sauda Adiv Hanum; Ashim Dey; Muhammad Ashad Kabir
  The skin, as the largest organ of the human body, is vulnerable to a diverse
array of conditions collectively known as skin lesions, which encompass various
dermatoses. Diagnosing these lesions presents significant challenges for
medical practitioners due to the subtle visual differences that are often
imperceptible to the naked eye. While not all skin lesions are
life-threatening, certain types can act as early indicators of severe diseases,
including skin cancers, underscoring the critical need for timely and accurate
diagnostic methods. Deep learning algorithms have demonstrated remarkable
potential in facilitating the early detection and prognosis of skin lesions.
This study advances the field by curating a comprehensive and diverse dataset
comprising 39 categories of skin lesions, synthesized from five publicly
available datasets. Using this dataset, the performance of five
state-of-the-art deep learning models -- MobileNetV2, Xception, InceptionV3,
EfficientNetB1, and Vision Transformer - is rigorously evaluated. To enhance
the accuracy and robustness of these models, attention mechanisms such as the
Efficient Channel Attention (ECA) and the Convolutional Block Attention Module
(CBAM) are incorporated into their architectures. Comprehensive evaluation
across multiple performance metrics reveals that the Vision Transformer model
integrated with CBAM outperforms others, achieving an accuracy of 93.46%,
precision of 94%, recall of 93%, F1-score of 93%, and specificity of 93.67%.
These results underscore the significant potential of the proposed system in
supporting medical professionals with accurate and efficient prognostic tools
for diagnosing a broad spectrum of skin lesions. The dataset and code used in
this study can be found at
https://github.com/akabircs/Skin-Lesions-Classification.


http://arxiv.org/abs/2501.06264
HPAC-IDS: A Hierarchical Packet Attention Convolution for Intrusion Detection System. (99%)
Anass Grini; Btissam El Khamlichi; Abdellatif El Afia; Amal El Fallah-Seghrouchni
  This research introduces a robust detection system against malicious network
traffic, leveraging hierarchical structures and self-attention mechanisms. The
proposed system includes a Packet Segmenter that divides a given raw network
packet into fixed-size segments that are fed to the HPAC-IDS. The experiments
performed on CIC-IDS2017 dataset show that the system exhibits high accuracy
and low false positive rates while demonstrating resilience against diverse
adversarial methods like Fast Gradient Sign Method (FGSM), Projected Gradient
Descent (PGD), and Wasserstein GAN (WGAN). The model's ability to withstand
adversarial perturbations is attributed to the fusion of hierarchical attention
mechanisms and convolutional neural networks, resulting in a 0% to 10%
adversarial attack severity under tested adversarial attacks with different
segment sizes, surpassing the state-of-the-art model in detection performance
and adversarial attack robustness.


http://arxiv.org/abs/2501.05127
DiffAttack: Diffusion-based Timbre-reserved Adversarial Attack in Speaker Identification. (97%)
Qing Wang; Jixun Yao; Zhaokai Sun; Pengcheng Guo; Lei Xie; John H. L. Hansen
  Being a form of biometric identification, the security of the speaker
identification (SID) system is of utmost importance. To better understand the
robustness of SID systems, we aim to perform more realistic attacks in SID,
which are challenging for both humans and machines to detect. In this study, we
propose DiffAttack, a novel timbre-reserved adversarial attack approach that
exploits the capability of a diffusion-based voice conversion (DiffVC) model to
generate adversarial fake audio with distinct target speaker attribution. By
introducing adversarial constraints into the generative process of the
diffusion-based voice conversion model, we craft fake samples that effectively
mislead target models while preserving speaker-wise characteristics.
Specifically, inspired by the use of randomly sampled Gaussian noise in
conventional adversarial attacks and diffusion processes, we incorporate
adversarial constraints into the reverse diffusion process. These constraints
subtly guide the reverse diffusion process toward aligning with the target
speaker distribution. Our experiments on the LibriTTS dataset indicate that
DiffAttack significantly improves the attack success rate compared to vanilla
DiffVC and other methods. Moreover, objective and subjective evaluations
demonstrate that introducing adversarial constraints does not compromise the
speech quality generated by the DiffVC model.


http://arxiv.org/abs/2501.05588
Enforcing Fundamental Relations via Adversarial Attacks on Input Parameter Correlations. (92%)
Timo Saala; Lucie Flek; Alexander Jung; Akbar Karimi; Alexander Schmidt; Matthias Schott; Philipp Soldin; Christopher Wiebusch
  Correlations between input parameters play a crucial role in many scientific
classification tasks, since these are often related to fundamental laws of
nature. For example, in high energy physics, one of the common deep learning
use-cases is the classification of signal and background processes in particle
collisions. In many such cases, the fundamental principles of the correlations
between observables are often better understood than the actual distributions
of the observables themselves. In this work, we present a new adversarial
attack algorithm called Random Distribution Shuffle Attack (RDSA), emphasizing
the correlations between observables in the network rather than individual
feature characteristics. Correct application of the proposed novel attack can
result in a significant improvement in classification performance -
particularly in the context of data augmentation - when using the generated
adversaries within adversarial training. Given that correlations between input
features are also crucial in many other disciplines. We demonstrate the RDSA
effectiveness on six classification tasks, including two particle collision
challenges (using CERN Open Data), hand-written digit recognition (MNIST784),
human activity recognition (HAR), weather forecasting (Rain in Australia), and
ICU patient mortality (MIMIC-IV), demonstrating a general use case beyond
fundamental physics for this new type of adversarial attack algorithms.


http://arxiv.org/abs/2501.04985
SpaLLM-Guard: Pairing SMS Spam Detection Using Open-source and Commercial LLMs. (50%)
Muhammad Salman; Muhammad Ikram; Nardine Basta; Mohamed Ali Kaafar
  The increasing threat of SMS spam, driven by evolving adversarial techniques
and concept drift, calls for more robust and adaptive detection methods. In
this paper, we evaluate the potential of large language models (LLMs), both
open-source and commercial, for SMS spam detection, comparing their performance
across zero-shot, few-shot, fine-tuning, and chain-of-thought prompting
approaches. Using a comprehensive dataset of SMS messages, we assess the spam
detection capabilities of prominent LLMs such as GPT-4, DeepSeek, LLAMA-2, and
Mixtral. Our findings reveal that while zero-shot learning provides
convenience, it is unreliable for effective spam detection. Few-shot learning,
particularly with carefully selected examples, improves detection but exhibits
variability across models. Fine-tuning emerges as the most effective strategy,
with Mixtral achieving 98.6% accuracy and a balanced false positive and false
negative rate below 2%, meeting the criteria for robust spam detection.
Furthermore, we explore the resilience of these models to adversarial attacks,
finding that fine-tuning significantly enhances robustness against both
perceptible and imperceptible manipulations. Lastly, we investigate the impact
of concept drift and demonstrate that fine-tuned LLMs, especially when combined
with few-shot learning, can mitigate its effects, maintaining high performance
even on evolving spam datasets. This study highlights the importance of
fine-tuning and tailored learning strategies to deploy LLMs effectively for
real-world SMS spam detection


http://arxiv.org/abs/2501.05359
SC-Pro: Training-Free Framework for Defending Unsafe Image Synthesis Attack. (38%)
Junha Park; Jaehui Hwang; Ian Ryu; Hyungkeun Park; Jiyoon Kim; Jong-Seok Lee
  With advances in diffusion models, image generation has shown significant
performance improvements. This raises concerns about the potential abuse of
image generation, such as the creation of explicit or violent images, commonly
referred to as Not Safe For Work (NSFW) content. To address this, the Stable
Diffusion model includes several safety checkers to censor initial text prompts
and final output images generated from the model. However, recent research has
shown that these safety checkers have vulnerabilities against adversarial
attacks, allowing them to generate NSFW images. In this paper, we find that
these adversarial attacks are not robust to small changes in text prompts or
input latents. Based on this, we propose SC-Pro (Spherical or Circular
Probing), a training-free framework that easily defends against adversarial
attacks generating NSFW images. Moreover, we develop an approach that utilizes
one-step diffusion models for efficient NSFW detection (SC-Pro-o), further
reducing computational resources. We demonstrate the superiority of our method
in terms of performance and applicability.


http://arxiv.org/abs/2501.05015
On Measuring Unnoticeability of Graph Adversarial Attacks: Observations, New Measure, and Applications. (33%)
Hyeonsoo Jo; Hyunjin Hwang; Fanchen Bu; Soo Yong Lee; Chanyoung Park; Kijung Shin
  Adversarial attacks are allegedly unnoticeable. Prior studies have designed
attack noticeability measures on graphs, primarily using statistical tests to
compare the topology of original and (possibly) attacked graphs. However, we
observe two critical limitations in the existing measures. First, because the
measures rely on simple rules, attackers can readily enhance their attacks to
bypass them, reducing their attack "noticeability" and, yet, maintaining their
attack performance. Second, because the measures naively leverage global
statistics, such as degree distributions, they may entirely overlook attacks
until severe perturbations occur, letting the attacks be almost "totally
unnoticeable." To address the limitations, we introduce HideNSeek, a learnable
measure for graph attack noticeability. First, to mitigate the bypass problem,
HideNSeek learns to distinguish the original and (potential) attack edges using
a learnable edge scorer (LEO), which scores each edge on its likelihood of
being an attack. Second, to mitigate the overlooking problem, HideNSeek
conducts imbalance-aware aggregation of all the edge scores to obtain the final
noticeability score. Using six real-world graphs, we empirically demonstrate
that HideNSeek effectively alleviates the observed limitations, and LEO (i.e.,
our learnable edge scorer) outperforms eleven competitors in distinguishing
attack edges under five different attack methods. For an additional
application, we show that LEO boost the performance of robust GNNs by removing
attack-like edges.


http://arxiv.org/abs/2501.05168
KabaddiPy: A package to enable access to Professional Kabaddi Data. (1%)
Bhaskar Lalwani; Aniruddha Mukherjee
  Kabaddi, a contact team sport of Indian origin, has seen a dramatic rise in
global popularity, highlighted by the upcoming Kabaddi World Cup in 2025 with
over sixteen international teams participating, alongside flourishing national
leagues such as the Indian Pro Kabaddi League (230 million viewers) and the
British Kabaddi League. We present the first open-source Python module to make
Kabaddi statistical data easily accessible from multiple scattered sources
across the internet. The module was developed by systematically web-scraping
and collecting team-wise, player-wise and match-by-match data. The data has
been cleaned, organized, and categorized into team overviews and player
metrics, each filterable by season. The players are classified as raiders and
defenders, with their best strategies for attacking, counter-attacking, and
defending against different teams highlighted. Our module enables continuous
monitoring of exponentially growing data streams, aiding researchers to quickly
start building upon the data to answer critical questions, such as the impact
of player inclusion/exclusion on team performance, scoring patterns against
specific teams, and break down opponent gameplay. The data generated from
Kabaddi tournaments has been sparsely used, and coaches and players rely
heavily on intuition to make decisions and craft strategies. Our module can be
utilized to build predictive models, craft uniquely strategic gameplays to
target opponents and identify hidden correlations in the data. This open source
module has the potential to increase time-efficiency, encourage analytical
studies of Kabaddi gameplay and player dynamics and foster reproducible
research. The data and code are publicly available:
https://github.com/kabaddiPy/kabaddiPy


http://arxiv.org/abs/2501.05239
Is Your Autonomous Vehicle Safe? Understanding the Threat of Electromagnetic Signal Injection Attacks on Traffic Scene Perception. (1%)
Wenhao Liao; Sineng Yan; Youqian Zhang; Xinwei Zhai; Yuanyuan Wang; Eugene Yujun Fu
  Autonomous vehicles rely on camera-based perception systems to comprehend
their driving environment and make crucial decisions, thereby ensuring vehicles
to steer safely. However, a significant threat known as Electromagnetic Signal
Injection Attacks (ESIA) can distort the images captured by these cameras,
leading to incorrect AI decisions and potentially compromising the safety of
autonomous vehicles. Despite the serious implications of ESIA, there is limited
understanding of its impacts on the robustness of AI models across various and
complex driving scenarios. To address this gap, our research analyzes the
performance of different models under ESIA, revealing their vulnerabilities to
the attacks. Moreover, due to the challenges in obtaining real-world attack
data, we develop a novel ESIA simulation method and generate a simulated attack
dataset for different driving scenarios. Our research provides a comprehensive
simulation and evaluation framework, aiming to enhance the development of more
robust AI models and secure intelligent systems, ultimately contributing to the
advancement of safer and more reliable technology across various fields.


http://arxiv.org/abs/2501.05249
RAG-WM: An Efficient Black-Box Watermarking Approach for Retrieval-Augmented Generation of Large Language Models. (1%)
Peizhuo Lv; Mengjie Sun; Hao Wang; Xiaofeng Wang; Shengzhi Zhang; Yuxuan Chen; Kai Chen; Limin Sun
  In recent years, tremendous success has been witnessed in Retrieval-Augmented
Generation (RAG), widely used to enhance Large Language Models (LLMs) in
domain-specific, knowledge-intensive, and privacy-sensitive tasks. However,
attackers may steal those valuable RAGs and deploy or commercialize them,
making it essential to detect Intellectual Property (IP) infringement. Most
existing ownership protection solutions, such as watermarks, are designed for
relational databases and texts. They cannot be directly applied to RAGs because
relational database watermarks require white-box access to detect IP
infringement, which is unrealistic for the knowledge base in RAGs. Meanwhile,
post-processing by the adversary's deployed LLMs typically destructs text
watermark information. To address those problems, we propose a novel black-box
"knowledge watermark" approach, named RAG-WM, to detect IP infringement of
RAGs. RAG-WM uses a multi-LLM interaction framework, comprising a Watermark
Generator, Shadow LLM & RAG, and Watermark Discriminator, to create watermark
texts based on watermark entity-relationship tuples and inject them into the
target RAG. We evaluate RAG-WM across three domain-specific and two
privacy-sensitive tasks on four benchmark LLMs. Experimental results show that
RAG-WM effectively detects the stolen RAGs in various deployed LLMs.
Furthermore, RAG-WM is robust against paraphrasing, unrelated content removal,
knowledge insertion, and knowledge expansion attacks. Lastly, RAG-WM can also
evade watermark detection approaches, highlighting its promising application in
detecting IP infringement of RAG systems.


http://arxiv.org/abs/2501.04861
LayerMix: Enhanced Data Augmentation through Fractal Integration for Robust Deep Learning. (86%)
Hafiz Mughees Ahmad; Dario Morle; Afshin Rahimi
  Deep learning models have demonstrated remarkable performance across various
computer vision tasks, yet their vulnerability to distribution shifts remains a
critical challenge. Despite sophisticated neural network architectures,
existing models often struggle to maintain consistent performance when
confronted with Out-of-Distribution (OOD) samples, including natural
corruptions, adversarial perturbations, and anomalous patterns. We introduce
LayerMix, an innovative data augmentation approach that systematically enhances
model robustness through structured fractal-based image synthesis. By
meticulously integrating structural complexity into training datasets, our
method generates semantically consistent synthetic samples that significantly
improve neural network generalization capabilities. Unlike traditional
augmentation techniques that rely on random transformations, LayerMix employs a
structured mixing pipeline that preserves original image semantics while
introducing controlled variability. Extensive experiments across multiple
benchmark datasets, including CIFAR-10, CIFAR-100, ImageNet-200, and
ImageNet-1K demonstrate LayerMixs superior performance in classification
accuracy and substantially enhances critical Machine Learning (ML) safety
metrics, including resilience to natural image corruptions, robustness against
adversarial attacks, improved model calibration and enhanced prediction
consistency. LayerMix represents a significant advancement toward developing
more reliable and adaptable artificial intelligence systems by addressing the
fundamental challenges of deep learning generalization. The code is available
at https://github.com/ahmadmughees/layermix.


http://arxiv.org/abs/2501.04802
Reproducing HotFlip for Corpus Poisoning Attacks in Dense Retrieval. (84%)
Yongkang Li; Panagiotis Eustratiadis; Evangelos Kanoulas
  HotFlip is a topical gradient-based word substitution method for attacking
language models. Recently, this method has been further applied to attack
retrieval systems by generating malicious passages that are injected into a
corpus, i.e., corpus poisoning. However, HotFlip is known to be computationally
inefficient, with the majority of time being spent on gradient accumulation for
each query-passage pair during the adversarial token generation phase, making
it impossible to generate an adequate number of adversarial passages in a
reasonable amount of time. Moreover, the attack method itself assumes access to
a set of user queries, a strong assumption that does not correspond to how
real-world adversarial attacks are usually performed. In this paper, we first
significantly boost the efficiency of HotFlip, reducing the adversarial
generation process from 4 hours per document to only 15 minutes, using the same
hardware. We further contribute experiments and analysis on two additional
tasks: (1) transfer-based black-box attacks, and (2) query-agnostic attacks.
Whenever possible, we provide comparisons between the original method and our
improved version. Our experiments demonstrate that HotFlip can effectively
attack a variety of dense retrievers, with an observed trend that its attack
performance diminishes against more advanced and recent methods. Interestingly,
we observe that while HotFlip performs poorly in a black-box setting,
indicating limited capacity for generalization, in query-agnostic scenarios its
performance is correlated to the volume of injected adversarial passages.


http://arxiv.org/abs/2501.04453
Gradient Purification: Defense Against Poisoning Attack in Decentralized Federated Learning. (78%)
Bin Li; Xiaoye Miao; Yongheng Shang; Xinkui Zhao; Shuiguang Deng; Jianwei Yin
  Decentralized federated learning (DFL) is inherently vulnerable to poisoning
attacks, as malicious clients can transmit manipulated model gradients to
neighboring clients. Existing defense methods either reject suspicious
gradients per iteration or restart DFL aggregation after detecting all
malicious clients. They overlook the potential accuracy benefit from the
discarded malicious gradients. In this paper, we propose a novel gradient
purification defense, named GPD, that integrates seamlessly with existing DFL
aggregation to defend against poisoning attacks. It aims to mitigate the harm
in model gradients while retaining the benefit in model weights for enhancing
accuracy. For each benign client in GPD, a recording variable is designed to
track the historically aggregated gradients from one of its neighbors. It
allows benign clients to precisely detect malicious neighbors and swiftly
mitigate aggregated malicious gradients via historical consistency checks. Upon
mitigation, GPD optimizes model weights via aggregating gradients solely from
benign clients. This retains the previously beneficial portions from malicious
clients and exploits the contributions from benign clients, thereby
significantly enhancing the model accuracy. We analyze the convergence of GPD,
as well as its ability to harvest high accuracy. Extensive experiments over
three datasets demonstrate that, GPD is capable of mitigating poisoning attacks
under both iid and non-iid data distributions. It significantly outperforms
state-of-the-art defenses in terms of accuracy against various poisoning
attacks.


http://arxiv.org/abs/2501.04931
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency. (38%)
Shiji Zhao; Ranjie Duan; Fengxiang Wang; Chi Chen; Caixin Kang; Jialing Tao; YueFeng Chen; Hui Xue; Xingxing Wei
  Multimodal Large Language Models (MLLMs) have achieved impressive performance
and have been put into practical use in commercial applications, but they still
have potential safety mechanism vulnerabilities. Jailbreak attacks are red
teaming methods that aim to bypass safety mechanisms and discover MLLMs'
potential risks. Existing MLLMs' jailbreak methods often bypass the model's
safety mechanism through complex optimization methods or carefully designed
image and text prompts. Despite achieving some progress, they have a low attack
success rate on commercial closed-source MLLMs. Unlike previous research, we
empirically find that there exists a Shuffle Inconsistency between MLLMs'
comprehension ability and safety ability for the shuffled harmful instruction.
That is, from the perspective of comprehension ability, MLLMs can understand
the shuffled harmful text-image instructions well. However, they can be easily
bypassed by the shuffled harmful instructions from the perspective of safety
ability, leading to harmful responses. Then we innovatively propose a
text-image jailbreak attack named SI-Attack. Specifically, to fully utilize the
Shuffle Inconsistency and overcome the shuffle randomness, we apply a
query-based black-box optimization method to select the most harmful shuffled
inputs based on the feedback of the toxic judge model. A series of experiments
show that SI-Attack can improve the attack's performance on three benchmarks.
In particular, SI-Attack can obviously improve the attack success rate for
commercial MLLMs such as GPT-4o or Claude-3.5-Sonnet.


http://arxiv.org/abs/2501.04527
Towards Fair Class-wise Robustness: Class Optimal Distribution Adversarial Training. (26%)
Hongxin Zhi; Hongtao Yu; Shaome Li; Xiuming Zhao; Yiteng Wu
  Adversarial training has proven to be a highly effective method for improving
the robustness of deep neural networks against adversarial attacks.
Nonetheless, it has been observed to exhibit a limitation in terms of robust
fairness, characterized by a significant disparity in robustness across
different classes. Recent efforts to mitigate this problem have turned to
class-wise reweighted methods. However, these methods suffer from a lack of
rigorous theoretical analysis and are limited in their exploration of the
weight space, as they mainly rely on existing heuristic algorithms or intuition
to compute weights. In addition, these methods fail to guarantee the
consistency of the optimization direction due to the decoupled optimization of
weights and the model parameters. They potentially lead to suboptimal weight
assignments and consequently, a suboptimal model. To address these problems,
this paper proposes a novel min-max training framework, Class Optimal
Distribution Adversarial Training (CODAT), which employs distributionally
robust optimization to fully explore the class-wise weight space, thus enabling
the identification of the optimal weight with theoretical guarantees.
Furthermore, we derive a closed-form optimal solution to the internal
maximization and then get a deterministic equivalent objective function, which
provides a theoretical basis for the joint optimization of weights and model
parameters. Meanwhile, we propose a fairness elasticity coefficient for the
evaluation of the algorithm with regard to both robustness and robust fairness.
Experimental results on various datasets show that the proposed method can
effectively improve the robust fairness of the model and outperform the
state-of-the-art approaches.


http://arxiv.org/abs/2501.03562
Rethinking Adversarial Attacks in Reinforcement Learning from Policy Distribution Perspective. (81%)
Tianyang Duan; Zongyuan Zhang; Zheng Lin; Yue Gao; Ling Xiong; Yong Cui; Hongbin Liang; Xianhao Chen; Heming Cui; Dong Huang
  Deep Reinforcement Learning (DRL) suffers from uncertainties and inaccuracies
in the observation signal in realworld applications. Adversarial attack is an
effective method for evaluating the robustness of DRL agents. However, existing
attack methods targeting individual sampled actions have limited impacts on the
overall policy distribution, particularly in continuous action spaces. To
address these limitations, we propose the Distribution-Aware Projected Gradient
Descent attack (DAPGD). DAPGD uses distribution similarity as the gradient
perturbation input to attack the policy network, which leverages the entire
policy distribution rather than relying on individual samples. We utilize the
Bhattacharyya distance in DAPGD to measure policy similarity, enabling
sensitive detection of subtle but critical differences between probability
distributions. Our experiment results demonstrate that DAPGD achieves SOTA
results compared to the baselines in three robot navigation tasks, achieving an
average 22.03% higher reward drop compared to the best baseline.


http://arxiv.org/abs/2501.02968
FlipedRAG: Black-Box Opinion Manipulation Attacks to Retrieval-Augmented Generation of Large Language Models. (99%)
Zhuo Chen; Yuyang Gong; Miaokun Chen; Haotan Liu; Qikai Cheng; Fan Zhang; Wei Lu; Xiaozhong Liu; Jiawei Liu
  Retrieval-Augmented Generation (RAG) addresses hallucination and real-time
constraints by dynamically retrieving relevant information from a knowledge
database to supplement the LLMs' input. When presented with a query, RAG
selects the most semantically similar texts from its knowledge bases and uses
them as context for the LLMs to generate more accurate responses. RAG also
creates a new attack surface, especially since RAG databases are frequently
sourced from public domains. While existing studies have predominantly focused
on optimizing RAG's performance and efficiency, emerging research has begun
addressing the security concerns associated with RAG. However, these works have
some limitations, typically focusing on either white-box methodologies or
heuristic-based black-box attacks. Furthermore, prior research has mainly
targeted simple factoid question answering, which is neither practically
challenging nor resistant to correction. In this paper, we unveil a more
realistic and threatening scenario: opinion manipulation for controversial
topics against RAG. Particularly, we propose a novel RAG black-box attack
method, termed FlipedRAG, which is transfer-based. By leveraging instruction
engineering, we obtain partial retrieval model outputs from black-box RAG
system, facilitating the training of surrogate models to enhance the
effectiveness of opinion manipulation attack. Extensive experimental results
confirms that our approach significantly enhances the average success rate of
opinion manipulation by 16.7%. It achieves an average of a 50% directional
change in the opinion polarity of RAG responses across four themes.
Additionally, it induces a 20% shift in user cognition. Furthermore, we discuss
the efficacy of potential defense mechanisms and conclude that they are
insufficient in mitigating this type of attack, highlighting the urgent need to
develop novel defensive strategies.


http://arxiv.org/abs/2501.03507
An Empirical Study of Accuracy-Robustness Tradeoff and Training Efficiency in Self-Supervised Learning. (3%)
Fatemeh Ghofrani; Pooyan Jamshidi
  Self-supervised learning (SSL) has significantly advanced image
representation learning, yet efficiency challenges persist, particularly with
adversarial training. Many SSL methods require extensive epochs to achieve
convergence, a demand further amplified in adversarial settings. To address
this inefficiency, we revisit the robust EMP-SSL framework, emphasizing the
importance of increasing the number of crops per image to accelerate learning.
Unlike traditional contrastive learning, robust EMP-SSL leverages multi-crop
sampling, integrates an invariance term and regularization, and reduces
training epochs, enhancing time efficiency. Evaluated with both standard linear
classifiers and multi-patch embedding aggregation, robust EMP-SSL provides new
insights into SSL evaluation strategies.
  Our results show that robust crop-based EMP-SSL not only accelerates
convergence but also achieves a superior balance between clean accuracy and
adversarial robustness, outperforming multi-crop embedding aggregation.
Additionally, we extend this approach with free adversarial training in
Multi-Crop SSL, introducing the Cost-Free Adversarial Multi-Crop
Self-Supervised Learning (CF-AMC-SSL) method. CF-AMC-SSL demonstrates the
effectiveness of free adversarial training in reducing training time while
simultaneously improving clean accuracy and adversarial robustness. These
findings underscore the potential of CF-AMC-SSL for practical SSL applications.
Our code is publicly available at https://github.com/softsys4ai/CF-AMC-SSL.


http://arxiv.org/abs/2501.03301
Rethinking Byzantine Robustness in Federated Recommendation from Sparse Aggregation Perspective. (2%)
Zhongjian Zhang; Mengmei Zhang; Xiao Wang; Lingjuan Lyu; Bo Yan; Junping Du; Chuan Shi
  To preserve user privacy in recommender systems, federated recommendation
(FR) based on federated learning (FL) emerges, keeping the personal data on the
local client and updating a model collaboratively. Unlike FL, FR has a unique
sparse aggregation mechanism, where the embedding of each item is updated by
only partial clients, instead of full clients in a dense aggregation of general
FL. Recently, as an essential principle of FL, model security has received
increasing attention, especially for Byzantine attacks, where malicious clients
can send arbitrary updates. The problem of exploring the Byzantine robustness
of FR is particularly critical since in the domains applying FR, e.g.,
e-commerce, malicious clients can be injected easily by registering new
accounts. However, existing Byzantine works neglect the unique sparse
aggregation of FR, making them unsuitable for our problem. Thus, we make the
first effort to investigate Byzantine attacks on FR from the perspective of
sparse aggregation, which is non-trivial: it is not clear how to define
Byzantine robustness under sparse aggregations and design Byzantine attacks
under limited knowledge/capability. In this paper, we reformulate the Byzantine
robustness under sparse aggregation by defining the aggregation for a single
item as the smallest execution unit. Then we propose a family of effective
attack strategies, named Spattack, which exploit the vulnerability in sparse
aggregation and are categorized along the adversary's knowledge and capability.
Extensive experimental results demonstrate that Spattack can effectively
prevent convergence and even break down defenses under a few malicious clients,
raising alarms for securing FR systems.


http://arxiv.org/abs/2501.02860
Seeing the Whole in the Parts in Self-Supervised Representation Learning. (1%)
Arthur Aubret; Céline Teulière; Jochen Triesch
  Recent successes in self-supervised learning (SSL) model spatial
co-occurrences of visual features either by masking portions of an image or by
aggressively cropping it. Here, we propose a new way to model spatial
co-occurrences by aligning local representations (before pooling) with a global
image representation. We present CO-SSL, a family of instance discrimination
methods and show that it outperforms previous methods on several datasets,
including ImageNet-1K where it achieves 71.5% of Top-1 accuracy with 100
pre-training epochs. CO-SSL is also more robust to noise corruption, internal
corruption, small adversarial attacks, and large training crop sizes. Our
analysis further indicates that CO-SSL learns highly redundant local
representations, which offers an explanation for its robustness. Overall, our
work suggests that aligning local and global representations may be a powerful
principle of unsupervised category learning.


http://arxiv.org/abs/2501.02450
GCP: Guarded Collaborative Perception with Spatial-Temporal Aware Malicious Agent Detection. (67%)
Yihang Tao; Senkang Hu; Yue Hu; Haonan An; Hangcheng Cao; Yuguang Fang
  Collaborative perception significantly enhances autonomous driving safety by
extending each vehicle's perception range through message sharing among
connected and autonomous vehicles. Unfortunately, it is also vulnerable to
adversarial message attacks from malicious agents, resulting in severe
performance degradation. While existing defenses employ
hypothesis-and-verification frameworks to detect malicious agents based on
single-shot outliers, they overlook temporal message correlations, which can be
circumvented by subtle yet harmful perturbations in model input and output
spaces. This paper reveals a novel blind area confusion (BAC) attack that
compromises existing single-shot outlier-based detection methods. As a
countermeasure, we propose GCP, a Guarded Collaborative Perception framework
based on spatial-temporal aware malicious agent detection, which maintains
single-shot spatial consistency through a confidence-scaled spatial concordance
loss, while simultaneously examining temporal anomalies by reconstructing
historical bird's eye view motion flows in low-confidence regions. We also
employ a joint spatial-temporal Benjamini-Hochberg test to synthesize
dual-domain anomaly results for reliable malicious agent detection. Extensive
experiments demonstrate GCP's superior performance under diverse attack
scenarios, achieving up to 34.69% improvements in AP@0.5 compared to the
state-of-the-art CP defense strategies under BAC attacks, while maintaining
consistent 5-8% improvements under other typical attacks. Code will be released
at https://github.com/CP-Security/GCP.git.


http://arxiv.org/abs/2501.02629
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense. (67%)
Yang Ouyang; Hengrui Gu; Shuhang Lin; Wenyue Hua; Jie Peng; Bhavya Kailkhura; Meijun Gao; Tianlong Chen; Kaixiong Zhou
  As large language models (LLMs) are increasingly deployed in diverse
applications, including chatbot assistants and code generation, aligning their
behavior with safety and ethical standards has become paramount. However,
jailbreak attacks, which exploit vulnerabilities to elicit unintended or
harmful outputs, threaten LLMs' safety significantly. In this paper, we
introduce Layer-AdvPatcher, a novel methodology designed to defend against
jailbreak attacks by utilizing an unlearning strategy to patch specific layers
within LLMs through self-augmented datasets. Our insight is that certain
layer(s), tend to produce affirmative tokens when faced with harmful prompts.
By identifying these layers and adversarially exposing them to generate more
harmful data, one can understand their inherent and diverse vulnerabilities to
attacks. With these exposures, we then "unlearn" these issues, reducing the
impact of affirmative tokens and hence minimizing jailbreak risks while keeping
the model's responses to safe queries intact. We conduct extensive experiments
on two models, four benchmark datasets, and multiple state-of-the-art jailbreak
attacks to demonstrate the efficacy of our approach. Results indicate that our
framework reduces the harmfulness and attack success rate of jailbreak attacks
without compromising utility for benign queries compared to recent defense
methods. Our code is publicly available at:
https://github.com/oyy2000/LayerAdvPatcher


http://arxiv.org/abs/2501.02654
Tougher Text, Smarter Models: Raising the Bar for Adversarial Defence Benchmarks. (54%)
Yang Wang; Chenghua Lin
  vulnerability of deep learning models to adversarial attacks. While various
defence mechanisms have been proposed, there is a lack of comprehensive
benchmarks that evaluate these defences across diverse datasets, models, and
tasks. In this work, we address this gap by presenting an extensive benchmark
for textual adversarial defence that significantly expands upon previous work.
Our benchmark incorporates a wide range of datasets, evaluates state-of-the-art
defence mechanisms, and extends the assessment to include critical tasks such
as single-sentence classification, similarity and paraphrase identification,
natural language inference, and commonsense reasoning. This work not only
serves as a valuable resource for researchers and practitioners in the field of
adversarial robustness but also identifies key areas for future research in
textual adversarial defence. By establishing a new standard for benchmarking in
this domain, we aim to accelerate progress towards more robust and reliable
natural language processing systems.


http://arxiv.org/abs/2501.02704
Persistence of Backdoor-based Watermarks for Neural Networks: A Comprehensive Evaluation. (5%)
Anh Tu Ngo; Chuan Song Heng; Nandish Chattopadhyay; Anupam Chattopadhyay
  Deep Neural Networks (DNNs) have gained considerable traction in recent years
due to the unparalleled results they gathered. However, the cost behind
training such sophisticated models is resource intensive, resulting in many to
consider DNNs to be intellectual property (IP) to model owners. In this era of
cloud computing, high-performance DNNs are often deployed all over the internet
so that people can access them publicly. As such, DNN watermarking schemes,
especially backdoor-based watermarks, have been actively developed in recent
years to preserve proprietary rights. Nonetheless, there lies much uncertainty
on the robustness of existing backdoor watermark schemes, towards both
adversarial attacks and unintended means such as fine-tuning neural network
models. One reason for this is that no complete guarantee of robustness can be
assured in the context of backdoor-based watermark. In this paper, we
extensively evaluate the persistence of recent backdoor-based watermarks within
neural networks in the scenario of fine-tuning, we propose/develop a novel
data-driven idea to restore watermark after fine-tuning without exposing the
trigger set. Our empirical results show that by solely introducing training
data after fine-tuning, the watermark can be restored if model parameters do
not shift dramatically during fine-tuning. Depending on the types of trigger
samples used, trigger accuracy can be reinstated to up to 100%. Our study
further explores how the restoration process works using loss landscape
visualization, as well as the idea of introducing training data in fine-tuning
stage to alleviate watermark vanishing.


http://arxiv.org/abs/2501.02232
Distillation-Enhanced Physical Adversarial Attacks. (96%)
Wei Liu; Yonglin Wu; Chaoqun Li; Zhuodong Liu; Huanqian Yan
  The study of physical adversarial patches is crucial for identifying
vulnerabilities in AI-based recognition systems and developing more robust deep
learning models. While recent research has focused on improving patch
stealthiness for greater practical applicability, achieving an effective
balance between stealth and attack performance remains a significant challenge.
To address this issue, we propose a novel physical adversarial attack method
that leverages knowledge distillation. Specifically, we first define a stealthy
color space tailored to the target environment to ensure smooth blending. Then,
we optimize an adversarial patch in an unconstrained color space, which serves
as the 'teacher' patch. Finally, we use an adversarial knowledge distillation
module to transfer the teacher patch's knowledge to the 'student' patch,
guiding the optimization of the stealthy patch. Experimental results show that
our approach improves attack performance by 20%, while maintaining stealth,
highlighting its practical value.


http://arxiv.org/abs/2501.02406
Zero-Shot Statistical Tests for LLM-Generated Text Detection using Finite Sample Concentration Inequalities. (83%)
Tara Radvand; Mojtaba Abdolmaleki; Mohamed Mostagir; Ambuj Tewari
  Verifying the provenance of content is crucial to the function of many
organizations, e.g., educational institutions, social media platforms, firms,
etc. This problem is becoming increasingly challenging as text generated by
Large Language Models (LLMs) becomes almost indistinguishable from
human-generated content. In addition, many institutions utilize in-house LLMs
and want to ensure that external, non-sanctioned LLMs do not produce content
within the institution. In this paper, we answer the following question: Given
a piece of text, can we identify whether it was produced by a particular LLM or
not? We model LLM-generated text as a sequential stochastic process with
complete dependence on history. We then design zero-shot statistical tests to
(i) distinguish between text generated by two different known sets of LLMs $A$
(non-sanctioned) and $B$ (in-house), and (ii) identify whether text was
generated by a known LLM or generated by any unknown model, e.g., a human or
some other language generation process. We prove that the type I and type II
errors of our test decrease exponentially with the length of the text. For
that, we show that if $B$ generates the text, then except with an exponentially
small probability in string length, the log-perplexity of the string under $A$
converges to the average cross-entropy of $B$ and $A$. We then present
experiments using LLMs with white-box access to support our theoretical results
and empirically examine the robustness of our results to black-box settings and
adversarial attacks. In the black-box setting, our method achieves an average
TPR of 82.5\% at a fixed FPR of 5\%. Under adversarial perturbations, our
minimum TPR is 48.6\% at the same FPR threshold. Both results outperform all
non-commercial baselines. See
https://github.com/TaraRadvand74/llm-text-detection for code, data, and an
online demo of the project.


http://arxiv.org/abs/2501.03272
Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models. (75%)
Peihai Jiang; Xixiang Lyu; Yige Li; Jing Ma
  Supervised fine-tuning has become the predominant method for adapting large
pretrained models to downstream tasks. However, recent studies have revealed
that these models are vulnerable to backdoor attacks, where even a small number
of malicious samples can successfully embed backdoor triggers into the model.
While most existing defense methods focus on post-training backdoor defense,
efficiently defending against backdoor attacks during training phase remains
largely unexplored. To address this gap, we propose a novel defense method
called Backdoor Token Unlearning (BTU), which proactively detects and
neutralizes trigger tokens during the training stage. Our work is based on two
key findings: 1) backdoor learning causes distinctive differences between
backdoor token parameters and clean token parameters in word embedding layers,
and 2) the success of backdoor attacks heavily depends on backdoor token
parameters. The BTU defense leverages these properties to identify aberrant
embedding parameters and subsequently removes backdoor behaviors using a
fine-grained unlearning technique. Extensive evaluations across three datasets
and four types of backdoor attacks demonstrate that BTU effectively defends
against these threats while preserving the model's performance on primary
tasks. Our code is available at https://github.com/XDJPH/BTU.


http://arxiv.org/abs/2501.02373
BADTV: Unveiling Backdoor Threats in Third-Party Task Vectors. (38%)
Chia-Yi Hsu; Yu-Lin Tsai; Yu Zhe; Yan-Lun Chen; Chih-Hsun Lin; Chia-Mu Yu; Yang Zhang; Chun-Ying Huang; Jun Sakuma
  Task arithmetic in large-scale pre-trained models enables agile adaptation to
diverse downstream tasks without extensive retraining. By leveraging task
vectors (TVs), users can perform modular updates through simple arithmetic
operations like addition and subtraction. Yet, this flexibility presents new
security challenges. In this paper, we investigate how TVs are vulnerable to
backdoor attacks, revealing how malicious actors can exploit them to compromise
model integrity. By creating composite backdoors that are designed
asymmetrically, we introduce BadTV, a backdoor attack specifically crafted to
remain effective simultaneously under task learning, forgetting, and analogy
operations. Extensive experiments show that BadTV achieves near-perfect attack
success rates across diverse scenarios, posing a serious threat to models
relying on task arithmetic. We also evaluate current defenses, finding they
fail to detect or mitigate BadTV. Our results highlight the urgent need for
robust countermeasures to secure TVs in real-world deployments.


http://arxiv.org/abs/2501.02042
Towards Robust and Accurate Stability Estimation of Local Surrogate Models in Text-based Explainable AI. (98%)
Christopher Burger; Charles Walter; Thai Le; Lingwei Chen
  Recent work has investigated the concept of adversarial attacks on
explainable AI (XAI) in the NLP domain with a focus on examining the
vulnerability of local surrogate methods such as Lime to adversarial
perturbations or small changes on the input of a machine learning (ML) model.
In such attacks, the generated explanation is manipulated while the meaning and
structure of the original input remain similar under the ML model. Such attacks
are especially alarming when XAI is used as a basis for decision making (e.g.,
prescribing drugs based on AI medical predictors) or for legal action (e.g.,
legal dispute involving AI software). Although weaknesses across many XAI
methods have been shown to exist, the reasons behind why remain little
explored. Central to this XAI manipulation is the similarity measure used to
calculate how one explanation differs from another. A poor choice of similarity
measure can lead to erroneous conclusions about the stability or adversarial
robustness of an XAI method. Therefore, this work investigates a variety of
similarity measures designed for text-based ranked lists referenced in related
work to determine their comparative suitability for use. We find that many
measures are overly sensitive, resulting in erroneous estimates of stability.
We then propose a weighting scheme for text-based data that incorporates the
synonymity between the features within an explanation, providing more accurate
estimates of the actual weakness of XAI methods to adversarial examples.


http://arxiv.org/abs/2501.02147
Exploring Secure Machine Learning Through Payload Injection and FGSM Attacks on ResNet-50. (93%)
Umesh Yadav; Suman Niraula; Gaurav Kumar Gupta; Bicky Yadav
  This paper investigates the resilience of a ResNet-50 image classification
model under two prominent security threats: Fast Gradient Sign Method (FGSM)
adversarial attacks and malicious payload injection. Initially, the model
attains a 53.33% accuracy on clean images. When subjected to FGSM
perturbations, its overall accuracy remains unchanged; however, the model's
confidence in incorrect predictions notably increases. Concurrently, a payload
injection scheme is successfully executed in 93.33% of the tested samples,
revealing how stealthy attacks can manipulate model predictions without
degrading visual quality. These findings underscore the vulnerability of even
high-performing neural networks and highlight the urgency of developing more
robust defense mechanisms for security-critical applications.


http://arxiv.org/abs/2501.01818
Rerouting LLM Routers. (82%)
Avital Shafran; Roei Schuster; Thomas Ristenpart; Vitaly Shmatikov
  LLM routers aim to balance quality and cost of generation by classifying
queries and routing them to a cheaper or more expensive LLM depending on their
complexity. Routers represent one type of what we call LLM control planes:
systems that orchestrate use of one or more LLMs. In this paper, we investigate
routers' adversarial robustness.
  We first define LLM control plane integrity, i.e., robustness of LLM
orchestration to adversarial inputs, as a distinct problem in AI safety. Next,
we demonstrate that an adversary can generate query-independent token sequences
we call ``confounder gadgets'' that, when added to any query, cause LLM routers
to send the query to a strong LLM.
  Our quantitative evaluation shows that this attack is successful both in
white-box and black-box settings against a variety of open-source and
commercial routers, and that confounding queries do not affect the quality of
LLM responses. Finally, we demonstrate that gadgets can be effective while
maintaining low perplexity, thus perplexity-based filtering is not an effective
defense. We finish by investigating alternative defenses.


http://arxiv.org/abs/2501.02135
AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs. (69%)
Sanjoy Chowdhury; Sayan Nag; Subhrajyoti Dasgupta; Yaoting Wang; Mohamed Elhoseiny; Ruohan Gao; Dinesh Manocha
  With the rapid advancement of Multi-modal Large Language Models (MLLMs),
several diagnostic benchmarks have recently been developed to assess these
models' multi-modal reasoning proficiency. However, these benchmarks are
restricted to assessing primarily the visual aspect and do not examine the
holistic audio-visual (AV) understanding. Moreover, currently, there are no
benchmarks that investigate the capabilities of AVLLMs to calibrate their
responses when presented with perturbed inputs. To this end, we introduce
Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising
600K samples spanning over 9 meticulously crafted tasks, evaluating the
capabilities of AVLLMs across three distinct dimensions: Adversarial attack,
Compositional reasoning, and Modality-specific dependency. Using our benchmark
we extensively evaluate 13 state-of-the-art AVLLMs. The findings reveal that
the majority of existing models fall significantly short of achieving
human-like comprehension, offering valuable insights for future research
directions. To alleviate the limitations in the existing approaches, we further
propose a robust, model-agnostic calibrated audio-visual preference
optimization based training strategy CAVPref, obtaining a gain up to 30.19%
across all 9 tasks. We will publicly release our code and benchmark to
facilitate future research in this direction.


http://arxiv.org/abs/2501.01913
Mingling with the Good to Backdoor Federated Learning. (12%)
Nuno Neves
  Federated learning (FL) is a decentralized machine learning technique that
allows multiple entities to jointly train a model while preserving dataset
privacy. However, its distributed nature has raised various security concerns,
which have been addressed by increasingly sophisticated defenses. These
protections utilize a range of data sources and metrics to, for example, filter
out malicious model updates, ensuring that the impact of attacks is minimized
or eliminated.
  This paper explores the feasibility of designing a generic attack method
capable of installing backdoors in FL while evading a diverse array of
defenses. Specifically, we focus on an attacker strategy called MIGO, which
aims to produce model updates that subtly blend with legitimate ones. The
resulting effect is a gradual integration of a backdoor into the global model,
often ensuring its persistence long after the attack concludes, while
generating enough ambiguity to hinder the effectiveness of defenses.
  MIGO was employed to implant three types of backdoors across five datasets
and different model architectures. The results demonstrate the significant
threat posed by these backdoors, as MIGO consistently achieved exceptionally
high backdoor accuracy (exceeding 90%) while maintaining the utility of the
main task. Moreover, MIGO exhibited strong evasion capabilities against ten
defenses, including several state-of-the-art methods. When compared to four
other attack strategies, MIGO consistently outperformed them across most
configurations. Notably, even in extreme scenarios where the attacker controls
just 0.1% of the clients, the results indicate that successful backdoor
insertion is possible if the attacker can persist for a sufficient number of
rounds.


http://arxiv.org/abs/2501.02182
AdaMixup: A Dynamic Defense Framework for Membership Inference Attack Mitigation. (10%)
Ying Chen; Jiajing Chen; Yijie Weng; ChiaHua Chang; Dezhi Yu; Guanbiao Lin
  Membership inference attacks have emerged as a significant privacy concern in
the training of deep learning models, where attackers can infer whether a data
point was part of the training set based on the model's outputs. To address
this challenge, we propose a novel defense mechanism, AdaMixup. AdaMixup
employs adaptive mixup techniques to enhance the model's robustness against
membership inference attacks by dynamically adjusting the mixup strategy during
training. This method not only improves the model's privacy protection but also
maintains high performance. Experimental results across multiple datasets
demonstrate that AdaMixup significantly reduces the risk of membership
inference attacks while achieving a favorable trade-off between defensive
efficiency and model accuracy. This research provides an effective solution for
data privacy protection and lays the groundwork for future advancements in
mixup training methods.


http://arxiv.org/abs/2501.01741
How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models. (9%)
Simone Corbo; Luca Bancale; Gennaro Valeria De; Livia Lestingi; Vincenzo Scotti; Matteo Camilli
  Language is a deep-rooted means of perpetration of stereotypes and
discrimination. Large Language Models (LLMs), now a pervasive technology in our
everyday lives, can cause extensive harm when prone to generating toxic
responses. The standard way to address this issue is to align the LLM, which,
however, dampens the issue without constituting a definitive solution.
Therefore, testing LLM even after alignment efforts remains crucial for
detecting any residual deviations with respect to ethical standards. We present
EvoTox, an automated testing framework for LLMs' inclination to toxicity,
providing a way to quantitatively assess how much LLMs can be pushed towards
toxic responses even in the presence of alignment. The framework adopts an
iterative evolution strategy that exploits the interplay between two LLMs, the
System Under Test (SUT) and the Prompt Generator steering SUT responses toward
higher toxicity. The toxicity level is assessed by an automated oracle based on
an existing toxicity classifier. We conduct a quantitative and qualitative
empirical evaluation using four state-of-the-art LLMs as evaluation subjects
having increasing complexity (7-13 billion parameters). Our quantitative
evaluation assesses the cost-effectiveness of four alternative versions of
EvoTox against existing baseline methods, based on random search, curated
datasets of toxic prompts, and adversarial attacks. Our qualitative assessment
engages human evaluators to rate the fluency of the generated prompts and the
perceived toxicity of the responses collected during the testing sessions.
Results indicate that the effectiveness, in terms of detected toxicity level,
is significantly higher than the selected baseline methods (effect size up to
1.0 against random search and up to 0.99 against adversarial attacks).
Furthermore, EvoTox yields a limited cost overhead (from 22% to 35% on
average).


http://arxiv.org/abs/2501.01620
Adaptive Meta-learning-based Adversarial Training for Robust Automatic Modulation Classification. (99%)
Amirmohammad Bamdad; Ali Owfi; Fatemeh Afghah
  DL-based automatic modulation classification (AMC) models are highly
susceptible to adversarial attacks, where even minimal input perturbations can
cause severe misclassifications. While adversarially training an AMC model
based on an adversarial attack significantly increases its robustness against
that attack, the AMC model will still be defenseless against other adversarial
attacks. The theoretically infinite possibilities for adversarial perturbations
mean that an AMC model will inevitably encounter new unseen adversarial attacks
if it is ever to be deployed to a real-world communication system. Moreover,
the computational limitations and challenges of obtaining new data in real-time
will not allow a full training process for the AMC model to adapt to the new
attack when it is online. To this end, we propose a meta-learning-based
adversarial training framework for AMC models that substantially enhances
robustness against unseen adversarial attacks and enables fast adaptation to
these attacks using just a few new training samples, if any are available. Our
results demonstrate that this training framework provides superior robustness
and accuracy with much less online training time than conventional adversarial
training of AMC models, making it highly efficient for real-world deployment.


http://arxiv.org/abs/2501.01106
AIM: Additional Image Guided Generation of Transferable Adversarial Attacks. (99%)
Teng Li; Xingjun Ma; Yu-Gang Jiang
  Transferable adversarial examples highlight the vulnerability of deep neural
networks (DNNs) to imperceptible perturbations across various real-world
applications. While there have been notable advancements in untargeted
transferable attacks, targeted transferable attacks remain a significant
challenge. In this work, we focus on generative approaches for targeted
transferable attacks. Current generative attacks focus on reducing overfitting
to surrogate models and the source data domain, but they often overlook the
importance of enhancing transferability through additional semantics. To
address this issue, we introduce a novel plug-and-play module into the general
generator architecture to enhance adversarial transferability. Specifically, we
propose a \emph{Semantic Injection Module} (SIM) that utilizes the semantics
contained in an additional guiding image to improve transferability. The
guiding image provides a simple yet effective method to incorporate target
semantics from the target class to create targeted and highly transferable
attacks. Additionally, we propose new loss formulations that can integrate the
semantic injection module more effectively for both targeted and untargeted
attacks. We conduct comprehensive experiments under both targeted and
untargeted attack settings to demonstrate the efficacy of our proposed
approach.


http://arxiv.org/abs/2501.01090
HoneypotNet: Backdoor Attacks Against Model Extraction. (93%)
Yixu Wang; Tianle Gu; Yan Teng; Yingchun Wang; Xingjun Ma
  Model extraction attacks are one type of inference-time attacks that
approximate the functionality and performance of a black-box victim model by
launching a certain number of queries to the model and then leveraging the
model's predictions to train a substitute model. These attacks pose severe
security threats to production models and MLaaS platforms and could cause
significant monetary losses to the model owners. A body of work has proposed to
defend machine learning models against model extraction attacks, including both
active defense methods that modify the model's outputs or increase the query
overhead to avoid extraction and passive defense methods that detect malicious
queries or leverage watermarks to perform post-verification. In this work, we
introduce a new defense paradigm called attack as defense which modifies the
model's output to be poisonous such that any malicious users that attempt to
use the output to train a substitute model will be poisoned. To this end, we
propose a novel lightweight backdoor attack method dubbed HoneypotNet that
replaces the classification layer of the victim model with a honeypot layer and
then fine-tunes the honeypot layer with a shadow model (to simulate model
extraction) via bi-level optimization to modify its output to be poisonous
while remaining the original performance. We empirically demonstrate on four
commonly used benchmark datasets that HoneypotNet can inject backdoors into
substitute models with a high success rate. The injected backdoor not only
facilitates ownership verification but also disrupts the functionality of
substitute models, serving as a significant deterrent to model extraction
attacks.


http://arxiv.org/abs/2501.01516
Improving Robustness Estimates in Natural Language Explainable AI though Synonymity Weighted Similarity Measures. (87%)
Christopher Burger
  Explainable AI (XAI) has seen a surge in recent interest with the
proliferation of powerful but intractable black-box models. Moreover, XAI has
come under fire for techniques that may not offer reliable explanations. As
many of the methods in XAI are themselves models, adversarial examples have
been prominent in the literature surrounding the effectiveness of XAI, with the
objective of these examples being to alter the explanation while maintaining
the output of the original model. For explanations in natural language, it is
natural to use measures found in the domain of information retrieval for use
with ranked lists to guide the adversarial XAI process. We show that the
standard implementation of these measures are poorly suited for the comparison
of explanations in adversarial XAI and amend them by using information that is
discarded, the synonymity of perturbed words. This synonymity weighting
produces more accurate estimates of the actual weakness of XAI methods to
adversarial examples.


http://arxiv.org/abs/2501.01263
Stealthy Backdoor Attack to Real-world Models in Android Apps. (80%)
Jiali Wei; Ming Fan; Xicheng Zhang; Wenjing Jiao; Haijun Wang; Ting Liu
  Powered by their superior performance, deep neural networks (DNNs) have found
widespread applications across various domains. Many deep learning (DL) models
are now embedded in mobile apps, making them more accessible to end users
through on-device DL. However, deploying on-device DL to users' smartphones
simultaneously introduces several security threats. One primary threat is
backdoor attacks. Extensive research has explored backdoor attacks for several
years and has proposed numerous attack approaches. However, few studies have
investigated backdoor attacks on DL models deployed in the real world, or they
have shown obvious deficiencies in effectiveness and stealthiness. In this
work, we explore more effective and stealthy backdoor attacks on real-world DL
models extracted from mobile apps. Our main justification is that imperceptible
and sample-specific backdoor triggers generated by DNN-based steganography can
enhance the efficacy of backdoor attacks on real-world models. We first confirm
the effectiveness of steganography-based backdoor attacks on four
state-of-the-art DNN models. Subsequently, we systematically evaluate and
analyze the stealthiness of the attacks to ensure they are difficult to
perceive. Finally, we implement the backdoor attacks on real-world models and
compare our approach with three baseline methods. We collect 38,387 mobile
apps, extract 89 DL models from them, and analyze these models to obtain the
prerequisite model information for the attacks. After identifying the target
models, our approach achieves an average of 12.50% higher attack success rate
than DeepPayload while better maintaining the normal performance of the models.
Extensive experimental results demonstrate that our method enables more
effective, robust, and stealthy backdoor attacks on real-world models.


http://arxiv.org/abs/2501.01529
SAFER: Sharpness Aware layer-selective Finetuning for Enhanced Robustness in vision transformers. (45%)
Bhavna Gopal; Huanrui Yang; Mark Horton; Yiran Chen
  Vision transformers (ViTs) have become essential backbones in advanced
computer vision applications and multi-modal foundation models. Despite their
strengths, ViTs remain vulnerable to adversarial perturbations, comparable to
or even exceeding the vulnerability of convolutional neural networks (CNNs).
Furthermore, the large parameter count and complex architecture of ViTs make
them particularly prone to adversarial overfitting, often compromising both
clean and adversarial accuracy.
  This paper mitigates adversarial overfitting in ViTs through a novel,
layer-selective fine-tuning approach: SAFER. Instead of optimizing the entire
model, we identify and selectively fine-tune a small subset of layers most
susceptible to overfitting, applying sharpness-aware minimization to these
layers while freezing the rest of the model. Our method consistently enhances
both clean and adversarial accuracy over baseline approaches. Typical
improvements are around 5%, with some cases achieving gains as high as 20%
across various ViT architectures and datasets.


http://arxiv.org/abs/2501.01606
Test Input Validation for Vision-based DL Systems: An Active Learning Approach. (1%)
Delaram Ghobari; Mohammad Hossein Amini; Dai Quoc Tran; Seunghee Park; Shiva Nejati; Mehrdad Sabetzadeh
  Testing deep learning (DL) systems requires extensive and diverse, yet valid,
test inputs. While synthetic test input generation methods, such as metamorphic
testing, are widely used for DL testing, they risk introducing invalid inputs
that do not accurately reflect real-world scenarios. Invalid test inputs can
lead to misleading results. Hence, there is a need for automated validation of
test inputs to ensure effective assessment of DL systems. In this paper, we
propose a test input validation approach for vision-based DL systems. Our
approach uses active learning to balance the trade-off between accuracy and the
manual effort required for test input validation. Further, by employing
multiple image-comparison metrics, it achieves better results in classifying
valid and invalid test inputs compared to methods that rely on single metrics.
We evaluate our approach using an industrial and a public-domain dataset. Our
evaluation shows that our multi-metric, active learning-based approach produces
several optimal accuracy-effort trade-offs, including those deemed practical
and desirable by our industry partner. Furthermore, provided with the same
level of manual effort, our approach is significantly more accurate than two
state-of-the-art test input validation methods, achieving an average accuracy
of 97%. Specifically, the use of multiple metrics, rather than a single metric,
results in an average improvement of at least 5.4% in overall accuracy compared
to the state-of-the-art baselines. Incorporating an active learning loop for
test input validation yields an additional 7.5% improvement in average
accuracy, bringing the overall average improvement of our approach to at least
12.9% compared to the baselines.


http://arxiv.org/abs/2501.01558
Predicting the Performance of Black-box LLMs through Self-Queries. (1%)
Dylan Sam; Marc Finzi; J. Zico Kolter
  As large language models (LLMs) are increasingly relied on in AI systems,
predicting when they make mistakes is crucial. While a great deal of work in
the field uses internal representations to interpret model behavior, these
representations are inaccessible when given solely black-box access through an
API. In this paper, we extract features of LLMs in a black-box manner by using
follow-up prompts and taking the probabilities of different responses as
representations to train reliable predictors of model behavior. We demonstrate
that training a linear model on these low-dimensional representations produces
reliable and generalizable predictors of model performance at the instance
level (e.g., if a particular generation correctly answers a question).
Remarkably, these can often outperform white-box linear predictors that operate
over a model's hidden state or the full distribution over its vocabulary. In
addition, we demonstrate that these extracted features can be used to evaluate
more nuanced aspects of a language model's state. For instance, they can be
used to distinguish between a clean version of GPT-4o-mini and a version that
has been influenced via an adversarial system prompt that answers
question-answering tasks incorrectly or introduces bugs into generated code.
Furthermore, they can reliably distinguish between different model
architectures and sizes, enabling the detection of misrepresented models
provided through an API (e.g., identifying if GPT-3.5 is supplied instead of
GPT-4o-mini).


http://arxiv.org/abs/2501.01015
Boosting Adversarial Transferability with Spatial Adversarial Alignment. (99%)
Zhaoyu Chen; Haijing Guo; Kaixun Jiang; Jiyuan Fu; Xinyu Zhou; Dingkang Yang; Hao Tang; Bo Li; Wenqiang Zhang
  Deep neural networks are vulnerable to adversarial examples that exhibit
transferability across various models. Numerous approaches are proposed to
enhance the transferability of adversarial examples, including advanced
optimization, data augmentation, and model modifications. However, these
methods still show limited transferability, particularly in cross-architecture
scenarios, such as from CNN to ViT. To achieve high transferability, we propose
a technique termed Spatial Adversarial Alignment (SAA), which employs an
alignment loss and leverages a witness model to fine-tune the surrogate model.
Specifically, SAA consists of two key parts: spatial-aware alignment and
adversarial-aware alignment. First, we minimize the divergences of features
between the two models in both global and local regions, facilitating spatial
alignment. Second, we introduce a self-adversarial strategy that leverages
adversarial examples to impose further constraints, aligning features from an
adversarial perspective. Through this alignment, the surrogate model is trained
to concentrate on the common features extracted by the witness model. This
facilitates adversarial attacks on these shared features, thereby yielding
perturbations that exhibit enhanced transferability. Extensive experiments on
various architectures on ImageNet show that aligned surrogate models based on
SAA can provide higher transferable adversarial examples, especially in
cross-architecture attacks.


http://arxiv.org/abs/2501.01025
Towards Adversarially Robust Deep Metric Learning. (99%)
Xiaopeng Ke
  Deep Metric Learning (DML) has shown remarkable successes in many domains by
taking advantage of powerful deep neural networks. Deep neural networks are
prone to adversarial attacks and could be easily fooled by adversarial
examples. The current progress on this robustness issue is mainly about deep
classification models but pays little attention to DML models. Existing works
fail to thoroughly inspect the robustness of DML and neglect an important DML
scenario, the clustering-based inference. In this work, we first point out the
robustness issue of DML models in clustering-based inference scenarios. We find
that, for the clustering-based inference, existing defenses designed DML are
unable to be reused and the adaptions of defenses designed for deep
classification models cannot achieve satisfactory robustness performance. To
alleviate the hazard of adversarial examples, we propose a new defense, the
Ensemble Adversarial Training (EAT), which exploits ensemble learning and
adversarial training. EAT promotes the diversity of the ensemble, encouraging
each model in the ensemble to have different robustness features, and employs a
self-transferring mechanism to make full use of the robustness statistics of
the whole ensemble in the update of every single model. We evaluate the EAT
method on three widely-used datasets with two popular model architectures. The
results show that the proposed EAT method greatly outperforms the adaptions of
defenses designed for deep classification models.


http://arxiv.org/abs/2501.01042
Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs. (99%)
Linhao Huang; Xue Jiang; Zhiqiang Wang; Wentao Mo; Xi Xiao; Bo Han; Yongjie Yin; Feng Zheng
  Video-based multimodal large language models (V-MLLMs) have shown
vulnerability to adversarial examples in video-text multimodal tasks. However,
the transferability of adversarial videos to unseen models--a common and
practical real world scenario--remains unexplored. In this paper, we pioneer an
investigation into the transferability of adversarial video samples across
V-MLLMs. We find that existing adversarial attack methods face significant
limitations when applied in black-box settings for V-MLLMs, which we attribute
to the following shortcomings: (1) lacking generalization in perturbing video
features, (2) focusing only on sparse key-frames, and (3) failing to integrate
multimodal information. To address these limitations and deepen the
understanding of V-MLLM vulnerabilities in black-box scenarios, we introduce
the Image-to-Video MLLM (I2V-MLLM) attack. In I2V-MLLM, we utilize an
image-based multimodal model (IMM) as a surrogate model to craft adversarial
video samples. Multimodal interactions and temporal information are integrated
to disrupt video representations within the latent space, improving adversarial
transferability. In addition, a perturbation propagation technique is
introduced to handle different unknown frame sampling strategies. Experimental
results demonstrate that our method can generate adversarial examples that
exhibit strong transferability across different V-MLLMs on multiple video-text
multimodal tasks. Compared to white-box attacks on these models, our black-box
attacks (using BLIP-2 as surrogate model) achieve competitive performance, with
average attack success rates of 55.48% on MSVD-QA and 58.26% on MSRVTT-QA for
VideoQA tasks, respectively. Our code will be released upon acceptance.


http://arxiv.org/abs/2501.00745
Dynamics of Adversarial Attacks on Large Language Model-Based Search Engines. (76%)
Xiyang Hu
  The increasing integration of Large Language Model (LLM) based search engines
has transformed the landscape of information retrieval. However, these systems
are vulnerable to adversarial attacks, especially ranking manipulation attacks,
where attackers craft webpage content to manipulate the LLM's ranking and
promote specific content, gaining an unfair advantage over competitors. In this
paper, we study the dynamics of ranking manipulation attacks. We frame this
problem as an Infinitely Repeated Prisoners' Dilemma, where multiple players
strategically decide whether to cooperate or attack. We analyze the conditions
under which cooperation can be sustained, identifying key factors such as
attack costs, discount rates, attack success rates, and trigger strategies that
influence player behavior. We identify tipping points in the system dynamics,
demonstrating that cooperation is more likely to be sustained when players are
forward-looking. However, from a defense perspective, we find that simply
reducing attack success probabilities can, paradoxically, incentivize attacks
under certain conditions. Furthermore, defensive measures to cap the upper
bound of attack success rates may prove futile in some scenarios. These
insights highlight the complexity of securing LLM-based systems. Our work
provides a theoretical foundation and practical insights for understanding and
mitigating their vulnerabilities, while emphasizing the importance of adaptive
security strategies and thoughtful ecosystem design.


http://arxiv.org/abs/2501.00879
TrustRAG: Enhancing Robustness and Trustworthiness in Retrieval-Augmented Generation. (16%)
Huichi Zhou; Kin-Hei Lee; Zhonghao Zhan; Yue Chen; Zhenhao Li; Zhaoyang Wang; Hamed Haddadi; Emine Yilmaz
  Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by
integrating external knowledge sources, enabling more accurate and contextually
relevant responses tailored to user queries. These systems, however, remain
susceptible to corpus poisoning attacks, which can severely impair the
performance of LLMs. To address this challenge, we propose TrustRAG, a robust
framework that systematically filters malicious and irrelevant content before
it is retrieved for generation. Our approach employs a two-stage defense
mechanism. The first stage implements a cluster filtering strategy to detect
potential attack patterns. The second stage employs a self-assessment process
that harnesses the internal capabilities of LLMs to detect malicious documents
and resolve inconsistencies. TrustRAG provides a plug-and-play, training-free
module that integrates seamlessly with any open- or closed-source language
model. Extensive experiments demonstrate that TrustRAG delivers substantial
improvements in retrieval accuracy, efficiency, and attack resistance.


http://arxiv.org/abs/2501.00973
Defense Strategies for Autonomous Multi-agent Systems: Ensuring Safety and Resilience Under Exponentially Unbounded FDI Attacks. (2%)
Yichao Wang; Mohamadamin Rajabinezhad; Dimitra Panagou; Shan Zuo
  False data injection attacks pose a significant threat to autonomous
multi-agent systems (MASs). Existing attack-resilient control strategies
generally have strict assumptions on the attack signals and overlook safety
constraints, such as collision avoidance. In practical applications, leader
agents equipped with advanced sensors or weaponry span a safe region to guide
heterogeneous follower agents, ensuring coordinated operations while addressing
collision avoidance to prevent financial losses and mission failures. This
letter addresses these gaps by introducing and solving the safety-aware and
attack-resilient (SAAR) control problem under exponentially unbounded false
data injection (EU-FDI) attacks. Specifically, a novel attack-resilient
observer layer (OL) is first designed to defend against EU-FDI attacks on the
OL. Then, an attack-resilient compensational signal is designed to mitigate the
adverse effects caused by the EU-FDI attack on control input layer (CIL).
Finally, a SAAR controller is designed by solving a quadratic programming (QP)
problem integrating control barrier function (CBF) certified collision-free
safety constraints. Rigorous Lyapunov-based stability analysis certifies the
SAAR controller's effectiveness in ensuring both safety and resilience. This
study also pioneers a three-dimensional (3D) simulation of the SAAR containment
control problem for heterogeneous MASs, demonstrating its applicability in
realistic multi-agent scenarios.


http://arxiv.org/abs/2501.00824
How Breakable Is Privacy: Probing and Resisting Model Inversion Attacks in Collaborative Inference. (2%)
Rongke Liu
  Collaborative inference (CI) improves computational efficiency for edge
devices by transmitting intermediate features to cloud models. However, this
process inevitably exposes feature representations to model inversion attacks
(MIAs), enabling unauthorized data reconstruction. Despite extensive research,
there is no established criterion for assessing the difficulty of MIA
implementation, leaving a fundamental question unanswered: \textit{What factors
truly and verifiably determine the attack's success in CI?} Moreover, existing
defenses lack the theoretical foundation described above, making it challenging
to regulate feature information effectively while ensuring privacy and
minimizing computational overhead. These shortcomings introduce three key
challenges: theoretical gap, methodological limitation, and practical
constraint.
  To overcome these challenges, we propose the first theoretical criterion to
assess MIA difficulty in CI, identifying mutual information, entropy, and
effective information volume as key influencing factors. The validity of this
criterion is demonstrated by using the mutual information neural estimator.
Building on this insight, we propose SiftFunnel, a privacy-preserving framework
to resist MIA while maintaining usability. Specifically, we incorporate linear
and non-linear correlation constraints alongside label smoothing to suppress
redundant information transmission, effectively balancing privacy and
usability. To enhance deployability, the edge model adopts a funnel-shaped
structure with attention mechanisms, strengthening privacy while reducing
computational and storage burdens. Experiments show that, compared to
state-of-the-art defense, SiftFunnel increases reconstruction error by
$\sim$30\%, lowers mutual and effective information metrics by $\geq$50\%, and
reduces edge burdens by almost $20\times$, while maintaining comparable
usability.


http://arxiv.org/abs/2501.00707
Everywhere Attack: Attacking Locally and Globally to Boost Targeted Transferability. (99%)
Hui Zeng; Sanshuai Cui; Biwei Chen; Anjie Peng
  Adversarial examples' (AE) transferability refers to the phenomenon that AEs
crafted with one surrogate model can also fool other models. Notwithstanding
remarkable progress in untargeted transferability, its targeted counterpart
remains challenging. This paper proposes an everywhere scheme to boost targeted
transferability. Our idea is to attack a victim image both globally and
locally. We aim to optimize 'an army of targets' in every local image region
instead of the previous works that optimize a high-confidence target in the
image. Specifically, we split a victim image into non-overlap blocks and
jointly mount a targeted attack on each block. Such a strategy mitigates
transfer failures caused by attention inconsistency between surrogate and
victim models and thus results in stronger transferability. Our approach is
method-agnostic, which means it can be easily combined with existing
transferable attacks for even higher transferability. Extensive experiments on
ImageNet demonstrate that the proposed approach universally improves the
state-of-the-art targeted attacks by a clear margin, e.g., the transferability
of the widely adopted Logit attack can be improved by 28.8%-300%.We also
evaluate the crafted AEs on a real-world platform: Google Cloud Vision. Results
further support the superiority of the proposed method.


http://arxiv.org/abs/2501.00537
Extending XReason: Formal Explanations for Adversarial Detection. (82%)
Amira Jemaa; Adnan Rashid; Sofiene Tahar
  Explainable Artificial Intelligence (XAI) plays an important role in
improving the transparency and reliability of complex machine learning models,
especially in critical domains such as cybersecurity. Despite the prevalence of
heuristic interpretation methods such as SHAP and LIME, these techniques often
lack formal guarantees and may produce inconsistent local explanations. To
fulfill this need, few tools have emerged that use formal methods to provide
formal explanations. Among these, XReason uses a SAT solver to generate formal
instance-level explanation for XGBoost models. In this paper, we extend the
XReason tool to support LightGBM models as well as class-level explanations.
Additionally, we implement a mechanism to generate and detect adversarial
examples in XReason. We evaluate the efficiency and accuracy of our approach on
the CICIDS-2017 dataset, a widely used benchmark for detecting network attacks.


http://arxiv.org/abs/2412.20987
RobustBlack: Challenging Black-Box Adversarial Attacks on State-of-the-Art Defenses. (99%)
Mohamed Djilani; Salah Ghamizi; Maxime Cordy
  Although adversarial robustness has been extensively studied in white-box
settings, recent advances in black-box attacks (including transfer- and
query-based approaches) are primarily benchmarked against weak defenses,
leaving a significant gap in the evaluation of their effectiveness against more
recent and moderate robust models (e.g., those featured in the Robustbench
leaderboard). In this paper, we question this lack of attention from black-box
attacks to robust models. We establish a framework to evaluate the
effectiveness of recent black-box attacks against both top-performing and
standard defense mechanisms, on the ImageNet dataset. Our empirical evaluation
reveals the following key findings: (1) the most advanced black-box attacks
struggle to succeed even against simple adversarially trained models; (2)
robust models that are optimized to withstand strong white-box attacks, such as
AutoAttack, also exhibits enhanced resilience against black-box attacks; and
(3) robustness alignment between the surrogate models and the target model
plays a key factor in the success rate of transfer-based attacks


http://arxiv.org/abs/2412.21164
Adversarial Attack and Defense for LoRa Device Identification and Authentication via Deep Learning. (99%)
Yalin E. Sagduyu; Tugba Erpek
  LoRa provides long-range, energy-efficient communications in Internet of
Things (IoT) applications that rely on Low-Power Wide-Area Network (LPWAN)
capabilities. Despite these merits, concerns persist regarding the security of
LoRa networks, especially in situations where device identification and
authentication are imperative to secure the reliable access to the LoRa
networks. This paper explores a deep learning (DL) approach to tackle these
concerns, focusing on two critical tasks, namely (i) identifying LoRa devices
and (ii) classifying them to legitimate and rogue devices. Deep neural networks
(DNNs), encompassing both convolutional and feedforward neural networks, are
trained for these tasks using actual LoRa signal data. In this setting, the
adversaries may spoof rogue LoRa signals through the kernel density estimation
(KDE) method based on legitimate device signals that are received by the
adversaries. Two cases are considered, (i) training two separate classifiers,
one for each of the two tasks, and (ii) training a multi-task classifier for
both tasks. The vulnerabilities of the resulting DNNs to manipulations in input
samples are studied in form of untargeted and targeted adversarial attacks
using the Fast Gradient Sign Method (FGSM). Individual and common perturbations
are considered against single-task and multi-task classifiers for the LoRa
signal analysis. To provide resilience against such attacks, a defense approach
is presented by increasing the robustness of classifiers with adversarial
training. Results quantify how vulnerable LoRa signal classification tasks are
to adversarial attacks and emphasize the need to fortify IoT applications
against these subtle yet effective threats.


http://arxiv.org/abs/2412.20768
Sample Correlation for Fingerprinting Deep Face Recognition. (98%)
Jiyang Guan; Jian Liang; Yanbo Wang; Ran He
  Face recognition has witnessed remarkable advancements in recent years,
thanks to the development of deep learning techniques.However, an off-the-shelf
face recognition model as a commercial service could be stolen by model
stealing attacks, posing great threats to the rights of the model owner.Model
fingerprinting, as a model stealing detection method, aims to verify whether a
suspect model is stolen from the victim model, gaining more and more attention
nowadays.Previous methods always utilize transferable adversarial examples as
the model fingerprint, but this method is known to be sensitive to adversarial
defense and transfer learning techniques.To address this issue, we consider the
pairwise relationship between samples instead and propose a novel yet simple
model stealing detection method based on SAmple Correlation (SAC).Specifically,
we present SAC-JC that selects JPEG compressed samples as model inputs and
calculates the correlation matrix among their model outputs.Extensive results
validate that SAC successfully defends against various model stealing attacks
in deep face recognition, encompassing face verification and face emotion
recognition, exhibiting the highest performance in terms of AUC, p-value and F1
score.Furthermore, we extend our evaluation of SAC-JC to object recognition
datasets including Tiny-ImageNet and CIFAR10, which also demonstrates the
superior performance of SAC-JC to previous methods.The code will be available
at \url{https://github.com/guanjiyang/SAC_JC}.


http://arxiv.org/abs/2412.20807
Two Heads Are Better Than One: Averaging along Fine-Tuning to Improve Targeted Transferability. (93%)
Hui Zeng; Sanshuai Cui; Biwei Chen; Anjie Peng
  With much longer optimization time than that of untargeted attacks
notwithstanding, the transferability of targeted attacks is still far from
satisfactory. Recent studies reveal that fine-tuning an existing adversarial
example (AE) in feature space can efficiently boost its targeted
transferability. However, existing fine-tuning schemes only utilize the
endpoint and ignore the valuable information in the fine-tuning trajectory.
Noting that the vanilla fine-tuning trajectory tends to oscillate around the
periphery of a flat region of the loss surface, we propose averaging over the
fine-tuning trajectory to pull the crafted AE towards a more centered region.
We compare the proposed method with existing fine-tuning schemes by integrating
them with state-of-the-art targeted attacks in various attacking scenarios.
Experimental results uphold the superiority of the proposed method in boosting
targeted transferability. The code is available at github.com/zengh5/Avg_FT.


http://arxiv.org/abs/2412.20953
GASLITEing the Retrieval: Exploring Vulnerabilities in Dense Embedding-based Search. (76%)
Matan Ben-Tov; Mahmood Sharif
  Dense embedding-based text retrieval$\unicode{x2013}$retrieval of relevant
passages from corpora via deep learning encodings$\unicode{x2013}$has emerged
as a powerful method attaining state-of-the-art search results and popularizing
the use of Retrieval Augmented Generation (RAG). Still, like other search
methods, embedding-based retrieval may be susceptible to search-engine
optimization (SEO) attacks, where adversaries promote malicious content by
introducing adversarial passages to corpora. To faithfully assess and gain
insights into the susceptibility of such systems to SEO, this work proposes the
GASLITE attack, a mathematically principled gradient-based search method for
generating adversarial passages without relying on the corpus content or
modifying the model. Notably, GASLITE's passages (1) carry adversary-chosen
information while (2) achieving high retrieval ranking for a selected query
distribution when inserted to corpora. We use GASLITE to extensively evaluate
retrievers' robustness, testing nine advanced models under varied threat
models, while focusing on realistic adversaries targeting queries on a specific
concept (e.g., a public figure). We found GASLITE consistently outperformed
baselines by $\geq$140% success rate, in all settings. Particularly,
adversaries using GASLITE require minimal effort to manipulate search
results$\unicode{x2013}$by injecting a negligible amount of adversarial
passages ($\leq$0.0001% of the corpus), they could make them visible in the
top-10 results for 61-100% of unseen concept-specific queries against most
evaluated models. Inspecting variance in retrievers' robustness, we identify
key factors that may contribute to models' susceptibility to SEO, including
specific properties in the embedding space's geometry.


http://arxiv.org/abs/2412.20756
Unsupervised dense retrieval with conterfactual contrastive learning. (54%)
Haitian Chen; Qingyao Ai; Xiao Wang; Yiqun Liu; Fen Lin; Qin Liu
  Efficiently retrieving a concise set of candidates from a large document
corpus remains a pivotal challenge in Information Retrieval (IR). Neural
retrieval models, particularly dense retrieval models built with transformers
and pretrained language models, have been popular due to their superior
performance. However, criticisms have also been raised on their lack of
explainability and vulnerability to adversarial attacks. In response to these
challenges, we propose to improve the robustness of dense retrieval models by
enhancing their sensitivity of fine-graned relevance signals. A model achieving
sensitivity in this context should exhibit high variances when documents' key
passages determining their relevance to queries have been modified, while
maintaining low variances for other changes in irrelevant passages. This
sensitivity allows a dense retrieval model to produce robust results with
respect to attacks that try to promote documents without actually increasing
their relevance. It also makes it possible to analyze which part of a document
is actually relevant to a query, and thus improve the explainability of the
retrieval model. Motivated by causality and counterfactual analysis, we propose
a series of counterfactual regularization methods based on game theory and
unsupervised learning with counterfactual passages. Experiments show that, our
method can extract key passages without reliance on the passage-level relevance
annotations. Moreover, the regularized dense retrieval models exhibit
heightened robustness against adversarial attacks, surpassing the
state-of-the-art anti-attack methods.


http://arxiv.org/abs/2412.21123
ExpShield: Safeguarding Web Text from Unauthorized Crawling and Language Modeling Exploitation. (10%)
Ruixuan Liu; Toan Tran; Tianhao Wang; Hongsheng Hu; Shuo Wang; Li Xiong
  As large language models (LLMs) increasingly depend on web-scraped datasets,
concerns arise over their potential to generate verbatim training content with
copyrighted or private information. However, current protections against web
crawling or sample-specific memorization are inherently limited, as they
require compliance from crawlers (e.g., respecting robots.txt) or model
trainers (e.g., applying differential privacy). To empower data owners with
direct control, we propose ExpShiled, a proactive self-defense mechanism that
mitigates sample-specific memorization via imperceptible text perturbations.
This approach requires no external collaboration while maintaining original
readability. To evaluate individual-level defense efficacy, we first propose
the metric of instance exploitation: a zero value indicates perfect defense,
achieved when a protected text's log-perplexity ranking aligns with its
counterfactual untrained ranking. We then reveal and validate the memorization
trigger hypothesis, demonstrating that a model's memorization of a specific
text sample stems primarily from its outlier tokens. Leveraging this insight,
we design targeted perturbations that (1) prioritize inherent trigger tokens
and (2) introduce artificial trigger tokens as pitfalls to disrupt memorization
on the protected sample. Experiments validate our defense across model scales,
languages, vision-to-language tasks, and fine-tuning methods. Even with privacy
backdoors, the Membership Inference Attack (MIA) AUC drops from 0.95 to 0.55,
and instance exploitation approaches zero. This suggests that compared to the
ideal no-misuse scenario, the risk of exposing a text instance remains nearly
unchanged despite its inclusion in training data.


http://arxiv.org/abs/2412.21016
Automated Robustness Testing for LLM-based NLP Software. (2%)
Mingxuan Xiao; Yan Xiao; Shunhui Ji; Hanbo Cai; Lei Xue; Pengcheng Zhang
  Benefiting from the advancements in LLMs, NLP software has undergone rapid
development. Such software is widely employed in various safety-critical tasks,
such as financial sentiment analysis, toxic content moderation, and log
generation. To our knowledge, there are no known automated robustness testing
methods specifically designed for LLM-based NLP software. Given the complexity
of LLMs and the unpredictability of real-world inputs (including prompts and
examples), it is essential to examine the robustness of overall inputs to
ensure the safety of such software.
  To this end, this paper introduces the first AutOmated Robustness Testing
frAmework, AORTA, which reconceptualizes the testing process into a
combinatorial optimization problem. Existing testing methods designed for
DNN-based software can be applied to LLM-based software by AORTA, but their
effectiveness is limited. To address this, we propose a novel testing method
for LLM-based software within AORTA called Adaptive Beam Search. ABS is
tailored for the expansive feature space of LLMs and improves testing
effectiveness through an adaptive beam width and the capability for
backtracking.
  We successfully embed 18 test methods in the designed framework AORTA and
compared the test validity of ABS with three datasets and five threat models.
ABS facilitates a more comprehensive and accurate robustness assessment before
software deployment, with an average test success rate of 86.138%. Compared to
the currently best-performing baseline PWWS, ABS significantly reduces the
computational overhead by up to 3441.895 seconds per successful test case and
decreases the number of queries by 218.762 times on average. Furthermore, test
cases generated by ABS exhibit greater naturalness and transferability.


http://arxiv.org/abs/2412.20804
DELA: A Novel Approach for Detecting Errors Induced by Large Atomic Condition Numbers. (1%)
Youshuai Tan; Zhanwei Zhang; Jinfu Chen; Zishuo Ding; Jifeng Xuan; Weiyi Shang
  Numerical programs form the foundation of modern science and engineering,
providing essential solutions to complex mathematical problems. Therefore,
errors in numerical results would lead to harmful consequences, especially in
safety-critical applications. Since only a few inputs may lead to substantial
errors for numerical programs, it is essential to determine whether a given
input could result in a significant error. Existing researchers tend to use the
results of high-precision programs to assess whether there is a substantial
error, which introduces three main challenges: difficulty of implementation,
existence of potential faults in the detection of numerical errors, and long
execution time.
  To address these limitations, we propose a novel approach named DELA. Our
approach is based on the observation that most numerical errors stem from large
condition numbers in atomic operations (such as subtraction), which then
propagate and accumulate. DELA injects small perturbations into the results of
individual atomic operations within the program and compares the outcomes of
the original program with the perturbed version to detect errors. We evaluate
DELA with datasets from ATOMU and HSED, as well as data from a complex linear
system-solving program. Experimental results demonstrate that we can detect all
the significant errors that were reported by prior research. DELA shows strong
alignment with high-precision programs of ATOMU and HSED, with average Pearson
and Spearman correlations of 0.86 and 0.61. Additionally, DELA effectively
detects significant errors in complex programs, achieving correlation scores of
0.9763 and 0.8993. More importantly, in experiments with ATOMU and HSED, DELA's
perturbed programs run within only 0.13% of the time needed by high-precision
versions; while for the linear system-solving programs, DELA is 73.46 times
faster than the high-precision programs.


http://arxiv.org/abs/2412.20392
Defending Multimodal Backdoored Models by Repulsive Visual Prompt Tuning. (95%)
Zhifang Zhang; Shuo He; Bingquan Shen; Lei Feng
  Multimodal contrastive learning models (e.g., CLIP) can learn high-quality
representations from large-scale image-text datasets, yet they exhibit
significant vulnerabilities to backdoor attacks, raising serious safety
concerns. In this paper, we disclose that CLIP's vulnerabilities primarily stem
from its excessive encoding of class-irrelevant features, which can compromise
the model's visual feature resistivity to input perturbations, making it more
susceptible to capturing the trigger patterns inserted by backdoor attacks.
Inspired by this finding, we propose Repulsive Visual Prompt Tuning (RVPT), a
novel defense approach that employs specially designed deep visual prompt
tuning and feature-repelling loss to eliminate excessive class-irrelevant
features while simultaneously optimizing cross-entropy loss to maintain clean
accuracy. Unlike existing multimodal backdoor defense methods that typically
require the availability of poisoned data or involve fine-tuning the entire
model, RVPT leverages few-shot downstream clean samples and only tunes a small
number of parameters. Empirical results demonstrate that RVPT tunes only 0.27\%
of the parameters relative to CLIP, yet it significantly outperforms
state-of-the-art baselines, reducing the attack success rate from 67.53\% to
2.76\% against SoTA attacks and effectively generalizing its defensive
capabilities across multiple datasets.


http://arxiv.org/abs/2412.20670
Prototypical Distillation and Debiased Tuning for Black-box Unsupervised Domain Adaptation. (10%)
Jian Liang; Lijun Sheng; Hongmin Liu; Ran He
  Unsupervised domain adaptation aims to transfer knowledge from a related,
label-rich source domain to an unlabeled target domain, thereby circumventing
the high costs associated with manual annotation. Recently, there has been
growing interest in source-free domain adaptation, a paradigm in which only a
pre-trained model, rather than the labeled source data, is provided to the
target domain. Given the potential risk of source data leakage via model
inversion attacks, this paper introduces a novel setting called black-box
domain adaptation, where the source model is accessible only through an API
that provides the predicted label along with the corresponding confidence value
for each query. We develop a two-step framework named $\textbf{Pro}$totypical
$\textbf{D}$istillation and $\textbf{D}$ebiased tun$\textbf{ing}$
($\textbf{ProDDing}$). In the first step, ProDDing leverages both the raw
predictions from the source model and prototypes derived from the target domain
as teachers to distill a customized target model. In the second step, ProDDing
keeps fine-tuning the distilled model by penalizing logits that are biased
toward certain classes. Empirical results across multiple benchmarks
demonstrate that ProDDing outperforms existing black-box domain adaptation
methods. Moreover, in the case of hard-label black-box domain adaptation, where
only predicted labels are available, ProDDing achieves significant improvements
over these methods. Code will be available at
\url{https://github.com/tim-learn/ProDDing/}.


http://arxiv.org/abs/2501.00066
On Adversarial Robustness of Language Models in Transfer Learning. (10%)
Bohdan Turbal; Anastasiia Mazur; Jiaxu Zhao; Mykola Pechenizkiy
  We investigate the adversarial robustness of LLMs in transfer learning
scenarios. Through comprehensive experiments on multiple datasets (MBIB Hate
Speech, MBIB Political Bias, MBIB Gender Bias) and various model architectures
(BERT, RoBERTa, GPT-2, Gemma, Phi), we reveal that transfer learning, while
improving standard performance metrics, often leads to increased vulnerability
to adversarial attacks. Our findings demonstrate that larger models exhibit
greater resilience to this phenomenon, suggesting a complex interplay between
model size, architecture, and adaptation methods. Our work highlights the
crucial need for considering adversarial robustness in transfer learning
scenarios and provides insights into maintaining model security without
compromising performance. These findings have significant implications for the
development and deployment of LLMs in real-world applications where both
performance and robustness are paramount.


http://arxiv.org/abs/2412.20529
Attacks on the neural network and defense methods. (2%)
A. Korenev; G. Belokrylov; B. Lodonova; A. Novokhrestov
  This article will discuss the use of attacks on a neural network trained on
audio data, as well as possible methods of protection against these attacks.
FGSM, PGD and CW attacks, as well as data poisoning, will be considered. Within
the framework of protection, Art-IBM and advertorch libraries will be
considered. The obtained accuracy metrics within the framework of attack
applications are presented


http://arxiv.org/abs/2412.20476
Cut the Deadwood Out: Post-Training Model Purification with Selective Module Substitution. (2%)
Yao Tong; Weijun Li; Xuanli He; Haolan Zhan; Qiongkai Xu
  The success of DNNs often depends on training with large-scale datasets, but
building such datasets is both expensive and challenging. Consequently, public
datasets from open-source platforms like HuggingFace have become popular,
posing significant risks of data poisoning attacks. Existing backdoor defenses
in NLP primarily focus on identifying and removing poisoned samples; however,
purifying a backdoored model with these sample-cleaning approaches typically
requires expensive retraining. Therefore, we propose Greedy Module Substitution
(GMS), which identifies and substitutes ''deadwood'' modules (i.e., components
critical to backdoor pathways) in a backdoored model to purify it. Our method
relaxes the common dependency of prior model purification methods on clean
datasets or clean auxiliary models. When applied to RoBERTa-large under
backdoor attacks, GMS demonstrates strong effectiveness across various
settings, particularly against widely recognized challenging attacks like LWS,
achieving a post-purification attack success rate (ASR) of 9.7% on SST-2
compared to 58.8% for the best baseline approach.


http://arxiv.org/abs/2412.20025
A Robust Adversarial Ensemble with Causal (Feature Interaction) Interpretations for Image Classification. (99%)
Chunheng Zhao; Pierluigi Pisu; Gurcan Comert; Negash Begashaw; Varghese Vaidyan; Nina Christine Hubig
  Deep learning-based discriminative classifiers, despite their remarkable
success, remain vulnerable to adversarial examples that can mislead model
predictions. While adversarial training can enhance robustness, it fails to
address the intrinsic vulnerability stemming from the opaque nature of these
black-box models. We present a deep ensemble model that combines discriminative
features with generative models to achieve both high accuracy and adversarial
robustness. Our approach integrates a bottom-level pre-trained discriminative
network for feature extraction with a top-level generative classification
network that models adversarial input distributions through a deep latent
variable model. Using variational Bayes, our model achieves superior robustness
against white-box adversarial attacks without adversarial training. Extensive
experiments on CIFAR-10 and CIFAR-100 demonstrate our model's superior
adversarial robustness. Through evaluations using counterfactual metrics and
feature interaction-based metrics, we establish correlations between model
interpretability and adversarial robustness. Additionally, preliminary results
on Tiny-ImageNet validate our approach's scalability to more complex datasets,
offering a practical solution for developing robust image classification
models.


http://arxiv.org/abs/2412.20087
On the Validity of Traditional Vulnerability Scoring Systems for Adversarial Attacks against LLMs. (13%)
Atmane Ayoub Mansour Bahar; Ahmad Samer Wazan
  This research investigates the effectiveness of established vulnerability
metrics, such as the Common Vulnerability Scoring System (CVSS), in evaluating
attacks against Large Language Models (LLMs), with a focus on Adversarial
Attacks (AAs). The study explores the influence of both general and specific
metric factors in determining vulnerability scores, providing new perspectives
on potential enhancements to these metrics.
  This study adopts a quantitative approach, calculating and comparing the
coefficient of variation of vulnerability scores across 56 adversarial attacks
on LLMs. The attacks, sourced from various research papers, and obtained
through online databases, were evaluated using multiple vulnerability metrics.
Scores were determined by averaging the values assessed by three distinct LLMs.
The results indicate that existing scoring-systems yield vulnerability scores
with minimal variation across different attacks, suggesting that many of the
metric factors are inadequate for assessing adversarial attacks on LLMs. This
is particularly true for context-specific factors or those with predefined
value sets, such as those in CVSS. These findings support the hypothesis that
current vulnerability metrics, especially those with rigid values, are limited
in evaluating AAs on LLMs, highlighting the need for the development of more
flexible, generalized metrics tailored to such attacks.
  This research offers a fresh analysis of the effectiveness and applicability
of established vulnerability metrics, particularly in the context of
Adversarial Attacks on Large Language Models, both of which have gained
significant attention in recent years. Through extensive testing and
calculations, the study underscores the limitations of these metrics and opens
up new avenues for improving and refining vulnerability assessment frameworks
specifically tailored for LLMs.


http://arxiv.org/abs/2412.20086
MAFT: Efficient Model-Agnostic Fairness Testing for Deep Neural Networks via Zero-Order Gradient Search. (9%)
Zhaohui Wang; Min Zhang; Jingran Yang; Bojie Shao; Min Zhang
  Deep neural networks (DNNs) have shown powerful performance in various
applications and are increasingly being used in decision-making systems.
However, concerns about fairness in DNNs always persist. Some efficient
white-box fairness testing methods about individual fairness have been
proposed. Nevertheless, the development of black-box methods has stagnated, and
the performance of existing methods is far behind that of white-box methods. In
this paper, we propose a novel black-box individual fairness testing method
called Model-Agnostic Fairness Testing (MAFT). By leveraging MAFT,
practitioners can effectively identify and address discrimination in DL models,
regardless of the specific algorithm or architecture employed. Our approach
adopts lightweight procedures such as gradient estimation and attribute
perturbation rather than non-trivial procedures like symbol execution,
rendering it significantly more scalable and applicable than existing methods.
We demonstrate that MAFT achieves the same effectiveness as state-of-the-art
white-box methods whilst improving the applicability to large-scale networks.
Compared to existing black-box approaches, our approach demonstrates
distinguished performance in discovering fairness violations w.r.t
effectiveness (approximately 14.69 times) and efficiency (approximately 32.58
times).


http://arxiv.org/abs/2501.00055
LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models. (2%)
Miao Yu; Junfeng Fang; Yingjie Zhou; Xing Fan; Kun Wang; Shirui Pan; Qingsong Wen
  While safety-aligned large language models (LLMs) are increasingly used as
the cornerstone for powerful systems such as multi-agent frameworks to solve
complex real-world problems, they still suffer from potential adversarial
queries, such as jailbreak attacks, which attempt to induce harmful content.
Researching attack methods allows us to better understand the limitations of
LLM and make trade-offs between helpfulness and safety. However, existing
jailbreak attacks are primarily based on opaque optimization techniques (e.g.
token-level gradient descent) and heuristic search methods like LLM refinement,
which fall short in terms of transparency, transferability, and computational
cost. In light of these limitations, we draw inspiration from the evolution and
infection processes of biological viruses and propose LLM-Virus, a jailbreak
attack method based on evolutionary algorithm, termed evolutionary jailbreak.
LLM-Virus treats jailbreak attacks as both an evolutionary and transfer
learning problem, utilizing LLMs as heuristic evolutionary operators to ensure
high attack efficiency, transferability, and low time cost. Our experimental
results on multiple safety benchmarks show that LLM-Virus achieves competitive
or even superior performance compared to existing attack methods.


http://arxiv.org/abs/2412.20154
Defending Against Network Attacks for Secure AI Agent Migration in Vehicular Metaverses. (1%)
Xinru Wen; Jinbo Wen; Ming Xiao; Jiawen Kang; Tao Zhang; Xiaohuan Li; Chuanxi Chen; Dusit Niyato
  Vehicular metaverses, blending traditional vehicular networks with metaverse
technology, are expected to revolutionize fields such as autonomous driving. As
virtual intelligent assistants in vehicular metaverses, Artificial Intelligence
(AI) agents powered by large language models can create immersive 3D virtual
spaces for passengers to enjoy on-broad vehicular applications and services. To
provide users with seamless and engaging virtual interactions, resource-limited
vehicles offload AI agents to RoadSide Units (RSUs) with adequate communication
and computational capabilities. Due to the mobility of vehicles and the limited
coverage of RSUs, AI agents need to migrate from one RSU to another RSU.
However, potential network attacks pose significant challenges to ensuring
reliable and efficient AI agent migration. In this paper, we first explore
specific network attacks including traffic-based attacks (i.e., DDoS attacks)
and infrastructure-based attacks (i.e., malicious RSU attacks). Then, we model
the AI agent migration process as a Partially Observable Markov Decision
Process (POMDP) and apply multi-agent proximal policy optimization algorithms
to mitigate DDoS attacks. In addition, we propose a trust assessment mechanism
to counter malicious RSU attacks. Numerical results validate that the proposed
solutions effectively defend against these network attacks and reduce the total
latency of AI agent migration by approximately 43.3%.


http://arxiv.org/abs/2412.19947
Standard-Deviation-Inspired Regularization for Improving Adversarial Robustness. (99%)
Olukorede Fakorede; Modeste Atsague; Jin Tian
  Adversarial Training (AT) has been demonstrated to improve the robustness of
deep neural networks (DNNs) against adversarial attacks. AT is a min-max
optimization procedure where in adversarial examples are generated to train a
more robust DNN. The inner maximization step of AT increases the losses of
inputs with respect to their actual classes. The outer minimization involves
minimizing the losses on the adversarial examples obtained from the inner
maximization. This work proposes a standard-deviation-inspired (SDI)
regularization term to improve adversarial robustness and generalization. We
argue that the inner maximization in AT is similar to minimizing a modified
standard deviation of the model's output probabilities. Moreover, we suggest
that maximizing this modified standard deviation can complement the outer
minimization of the AT framework. To support our argument, we experimentally
show that the SDI measure can be used to craft adversarial examples.
Additionally, we demonstrate that combining the SDI regularization term with
existing AT variants enhances the robustness of DNNs against stronger attacks,
such as CW and Auto-attack, and improves generalization.


http://arxiv.org/abs/2412.20006
Adversarial Robustness for Deep Learning-based Wildfire Detection Models. (98%)
Ryo Ide; Lei Yang
  Smoke detection using Deep Neural Networks (DNNs) is an effective approach
for early wildfire detection. However, because smoke is temporally and
spatially anomalous, there are limitations in collecting sufficient training
data. This raises overfitting and bias concerns in existing DNN-based wildfire
detection models. Thus, we introduce WARP (Wildfire Adversarial Robustness
Procedure), the first model-agnostic framework for evaluating the adversarial
robustness of DNN-based wildfire detection models. WARP addresses limitations
in smoke image diversity using global and local adversarial attack methods. The
global attack method uses image-contextualized Gaussian noise, while the local
attack method uses patch noise injection, tailored to address critical aspects
of wildfire detection. Leveraging WARP's model-agnostic capabilities, we assess
the adversarial robustness of real-time Convolutional Neural Networks (CNNs)
and Transformers. The analysis revealed valuable insights into the models'
limitations. Specifically, the global attack method demonstrates that the
Transformer model has more than 70\% precision degradation than the CNN against
global noise. In contrast, the local attack method shows that both models are
susceptible to cloud image injections when detecting smoke-positive instances,
suggesting a need for model improvements through data augmentation. WARP's
comprehensive robustness analysis contributed to the development of
wildfire-specific data augmentation strategies, marking a step toward
practicality.


http://arxiv.org/abs/2412.19747
Enhancing Adversarial Robustness of Deep Neural Networks Through Supervised Contrastive Learning. (96%)
Longwei Wang; Navid Nayyem; Abdullah Rakin
  Adversarial attacks exploit the vulnerabilities of convolutional neural
networks by introducing imperceptible perturbations that lead to
misclassifications, exposing weaknesses in feature representations and decision
boundaries. This paper presents a novel framework combining supervised
contrastive learning and margin-based contrastive loss to enhance adversarial
robustness. Supervised contrastive learning improves the structure of the
feature space by clustering embeddings of samples within the same class and
separating those from different classes. Margin-based contrastive loss,
inspired by support vector machines, enforces explicit constraints to create
robust decision boundaries with well-defined margins. Experiments on the
CIFAR-100 dataset with a ResNet-18 backbone demonstrate robustness performance
improvements in adversarial accuracy under Fast Gradient Sign Method attacks.


http://arxiv.org/abs/2412.19523
Attribution for Enhanced Explanation with Transferable Adversarial eXploration. (92%)
Zhiyu Zhu; Jiayu Zhang; Zhibo Jin; Huaming Chen; Jianlong Zhou; Fang Chen
  The interpretability of deep neural networks is crucial for understanding
model decisions in various applications, including computer vision.
AttEXplore++, an advanced framework built upon AttEXplore, enhances attribution
by incorporating transferable adversarial attack methods such as MIG and GRA,
significantly improving the accuracy and robustness of model explanations. We
conduct extensive experiments on five models, including CNNs (Inception-v3,
ResNet-50, VGG16) and vision transformers (MaxViT-T, ViT-B/16), using the
ImageNet dataset. Our method achieves an average performance improvement of
7.57\% over AttEXplore and 32.62\% compared to other state-of-the-art
interpretability algorithms. Using insertion and deletion scores as evaluation
metrics, we show that adversarial transferability plays a vital role in
enhancing attribution results. Furthermore, we explore the impact of
randomness, perturbation rate, noise amplitude, and diversity probability on
attribution performance, demonstrating that AttEXplore++ provides more stable
and reliable explanations across various models. We release our code at:
https://anonymous.4open.science/r/ATTEXPLOREP-8435/


http://arxiv.org/abs/2412.19354
Federated Hybrid Training and Self-Adversarial Distillation: Towards Robust Edge Networks. (45%)
Yu Qiao; Apurba Adhikary; Kitae Kim; Eui-Nam Huh; Zhu Han; Choong Seon Hong
  Federated learning (FL) is a distributed training technology that enhances
data privacy in mobile edge networks by allowing data owners to collaborate
without transmitting raw data to the edge server. However, data heterogeneity
and adversarial attacks pose challenges to develop an unbiased and robust
global model for edge deployment. To address this, we propose Federated hyBrid
Adversarial training and self-adversarial disTillation (FedBAT), a new
framework designed to improve both robustness and generalization of the global
model. FedBAT seamlessly integrates hybrid adversarial training and
self-adversarial distillation into the conventional FL framework from data
augmentation and feature distillation perspectives. From a data augmentation
perspective, we propose hybrid adversarial training to defend against
adversarial attacks by balancing accuracy and robustness through a weighted
combination of standard and adversarial training. From a feature distillation
perspective, we introduce a novel augmentation-invariant adversarial
distillation method that aligns local adversarial features of augmented images
with their corresponding unbiased global clean features. This alignment can
effectively mitigate bias from data heterogeneity while enhancing both the
robustness and generalization of the global model. Extensive experimental
results across multiple datasets demonstrate that FedBAT yields comparable or
superior performance gains in improving robustness while maintaining accuracy
compared to several baselines.


http://arxiv.org/abs/2412.19394
An Engorgio Prompt Makes Large Language Model Babble on. (16%)
Jianshuo Dong; Ziyuan Zhang; Qingjie Zhang; Tianwei Zhang; Hao Wang; Hewu Li; Qi Li; Chao Zhang; Ke Xu; Han Qiu
  Auto-regressive large language models (LLMs) have yielded impressive
performance in many real-world tasks. However, the new paradigm of these LLMs
also exposes novel threats. In this paper, we explore their vulnerability to
inference cost attacks, where a malicious user crafts Engorgio prompts to
intentionally increase the computation cost and latency of the inference
process. We design Engorgio, a novel methodology, to efficiently generate
adversarial Engorgio prompts to affect the target LLM's service availability.
Engorgio has the following two technical contributions. (1) We employ a
parameterized distribution to track LLMs' prediction trajectory. (2) Targeting
the auto-regressive nature of LLMs' inference process, we propose novel loss
functions to stably suppress the appearance of the <EOS> token, whose
occurrence will interrupt the LLM's generation process. We conduct extensive
experiments on 13 open-sourced LLMs with parameters ranging from 125M to 30B.
The results show that Engorgio prompts can successfully induce LLMs to generate
abnormally long outputs (i.e., roughly 2-13$\times$ longer to reach 90%+ of the
output length limit) in a white-box scenario and our real-world experiment
demonstrates Engergio's threat to LLM service with limited computing resources.
The code is released at: https://github.com/jianshuod/Engorgio-prompt.


http://arxiv.org/abs/2412.19311
xSRL: Safety-Aware Explainable Reinforcement Learning -- Safety as a Product of Explainability. (2%)
Risal Shahriar Shefin; Md Asifur Rahman; Thai Le; Sarra Alqahtani
  Reinforcement learning (RL) has shown great promise in simulated
environments, such as games, where failures have minimal consequences. However,
the deployment of RL agents in real-world systems such as autonomous vehicles,
robotics, UAVs, and medical devices demands a higher level of safety and
transparency, particularly when facing adversarial threats. Safe RL algorithms
have been developed to address these concerns by optimizing both task
performance and safety constraints. However, errors are inevitable, and when
they occur, it is essential that the RL agents can also explain their actions
to human operators. This makes trust in the safety mechanisms of RL systems
crucial for effective deployment. Explainability plays a key role in building
this trust by providing clear, actionable insights into the agent's
decision-making process, ensuring that safety-critical decisions are well
understood. While machine learning (ML) has seen significant advances in
interpretability and visualization, explainability methods for RL remain
limited. Current tools fail to address the dynamic, sequential nature of RL and
its needs to balance task performance with safety constraints over time. The
re-purposing of traditional ML methods, such as saliency maps, is inadequate
for safety-critical RL applications where mistakes can result in severe
consequences. To bridge this gap, we propose xSRL, a framework that integrates
both local and global explanations to provide a comprehensive understanding of
RL agents' behavior. xSRL also enables developers to identify policy
vulnerabilities through adversarial attacks, offering tools to debug and patch
agents without retraining. Our experiments and user studies demonstrate xSRL's
effectiveness in increasing safety in RL systems, making them more reliable and
trustworthy for real-world deployment. Code is available at
https://github.com/risal-shefin/xSRL.


http://arxiv.org/abs/2412.18815
Distortion-Aware Adversarial Attacks on Bounding Boxes of Object Detectors. (99%)
Pham Phuc; Son Vuong; Khang Nguyen; Tuan Dang
  Deep learning-based object detection has become ubiquitous in the last decade
due to its high accuracy in many real-world applications. With this growing
trend, these models are interested in being attacked by adversaries, with most
of the results being on classifiers, which do not match the context of
practical object detection. In this work, we propose a novel method to fool
object detectors, expose the vulnerability of state-of-the-art detectors, and
promote later works to build more robust detectors to adversarial examples. Our
method aims to generate adversarial images by perturbing object confidence
scores during training, which is crucial in predicting confidence for each
class in the testing phase. Herein, we provide a more intuitive technique to
embed additive noises based on detected objects' masks and the training loss
with distortion control over the original image by leveraging the gradient of
iterative images. To verify the proposed method, we perform adversarial attacks
against different object detectors, including the most recent state-of-the-art
models like YOLOv8, Faster R-CNN, RetinaNet, and Swin Transformer. We also
evaluate our technique on MS COCO 2017 and PASCAL VOC 2012 datasets and analyze
the trade-off between success attack rate and image distortion. Our experiments
show that the achievable success attack rate is up to $100$\% and up to $98$\%
when performing white-box and black-box attacks, respectively. The source code
and relevant documentation for this work are available at the following link:
https://github.com/anonymous20210106/attack_detector


http://arxiv.org/abs/2412.18844
Improving Integrated Gradient-based Transferable Adversarial Examples by Refining the Integration Path. (98%)
Yuchen Ren; Zhengyu Zhao; Chenhao Lin; Bo Yang; Lu Zhou; Zhe Liu; Chao Shen
  Transferable adversarial examples are known to cause threats in practical,
black-box attack scenarios. A notable approach to improving transferability is
using integrated gradients (IG), originally developed for model
interpretability. In this paper, we find that existing IG-based attacks have
limited transferability due to their naive adoption of IG in model
interpretability. To address this limitation, we focus on the IG integration
path and refine it in three aspects: multiplicity, monotonicity, and diversity,
supported by theoretical analyses. We propose the Multiple Monotonic
Diversified Integrated Gradients (MuMoDIG) attack, which can generate highly
transferable adversarial examples on different CNN and ViT models and defenses.
Experiments validate that MuMoDIG outperforms the latest IG-based attack by up
to 37.3\% and other state-of-the-art attacks by 8.4\%. In general, our study
reveals that migrating established techniques to improve transferability may
require non-trivial efforts. Code is available at
\url{https://github.com/RYC-98/MuMoDIG}.


http://arxiv.org/abs/2412.19015
Imperceptible Adversarial Attacks on Point Clouds Guided by Point-to-Surface Field. (83%)
Keke Tang; Weiyao Ke; Weilong Peng; Xiaofei Wang; Ziyong Du; Zhize Wu; Peican Zhu; Zhihong Tian
  Adversarial attacks on point clouds are crucial for assessing and improving
the adversarial robustness of 3D deep learning models. Traditional solutions
strictly limit point displacement during attacks, making it challenging to
balance imperceptibility with adversarial effectiveness. In this paper, we
attribute the inadequate imperceptibility of adversarial attacks on point
clouds to deviations from the underlying surface. To address this, we introduce
a novel point-to-surface (P2S) field that adjusts adversarial perturbation
directions by dragging points back to their original underlying surface.
Specifically, we use a denoising network to learn the gradient field of the
logarithmic density function encoding the shape's surface, and apply a
distance-aware adjustment to perturbation directions during attacks, thereby
enhancing imperceptibility. Extensive experiments show that adversarial attacks
guided by our P2S field are more imperceptible, outperforming state-of-the-art
methods.


http://arxiv.org/abs/2412.18886
Adversarial Training for Graph Neural Networks via Graph Subspace Energy Optimization. (75%)
Ganlin Liu; Ziling Liang; Xiaowei Huang; Xinping Yi; Shi Jin
  Despite impressive capability in learning over graph-structured data, graph
neural networks (GNN) suffer from adversarial topology perturbation in both
training and inference phases. While adversarial training has demonstrated
remarkable effectiveness in image classification tasks, its suitability for GNN
models has been doubted until a recent advance that shifts the focus from
transductive to inductive learning. Still, GNN robustness in the inductive
setting is under-explored, and it calls for deeper understanding of GNN
adversarial training. To this end, we propose a new concept of graph subspace
energy (GSE) -- a generalization of graph energy that measures graph stability
-- of the adjacency matrix, as an indicator of GNN robustness against topology
perturbations. To further demonstrate the effectiveness of such concept, we
propose an adversarial training method with the perturbed graphs generated by
maximizing the GSE regularization term, referred to as AT-GSE. To deal with the
local and global topology perturbations raised respectively by LRBCD and PRBCD,
we employ randomized SVD (RndSVD) and Nystrom low-rank approximation to favor
the different aspects of the GSE terms. An extensive set of experiments shows
that AT-GSE outperforms consistently the state-of-the-art GNN adversarial
training methods over different homophily and heterophily datasets in terms of
adversarial accuracy, whilst more surprisingly achieving a superior clean
accuracy on non-perturbed graphs.


http://arxiv.org/abs/2412.18952
Bridging Interpretability and Robustness Using LIME-Guided Model Refinement. (69%)
Navid Nayyem; Abdullah Rakin; Longwei Wang
  This paper explores the intricate relationship between interpretability and
robustness in deep learning models. Despite their remarkable performance across
various tasks, deep learning models often exhibit critical vulnerabilities,
including susceptibility to adversarial attacks, over-reliance on spurious
correlations, and a lack of transparency in their decision-making processes. To
address these limitations, we propose a novel framework that leverages Local
Interpretable Model-Agnostic Explanations (LIME) to systematically enhance
model robustness. By identifying and mitigating the influence of irrelevant or
misleading features, our approach iteratively refines the model, penalizing
reliance on these features during training. Empirical evaluations on multiple
benchmark datasets demonstrate that LIME-guided refinement not only improves
interpretability but also significantly enhances resistance to adversarial
perturbations and generalization to out-of-distribution data.


http://arxiv.org/abs/2412.18975
Injecting Bias into Text Classification Models using Backdoor Attacks. (15%)
A. Dilara Yavuz; M. Emre Gursoy
  The rapid growth of natural language processing (NLP) and pre-trained
language models have enabled accurate text classification in a variety of
settings. However, text classification models are susceptible to backdoor
attacks, where an attacker embeds a trigger into the victim model to make the
model predict attacker-desired labels in targeted scenarios. In this paper, we
propose to utilize backdoor attacks for a new purpose: bias injection. We
develop a backdoor attack in which a subset of the training dataset is poisoned
to associate strong male actors with negative sentiment. We execute our attack
on two popular text classification datasets (IMDb and SST) and seven different
models ranging from traditional Doc2Vec-based models to LSTM networks and
modern transformer-based BERT and RoBERTa models. Our results show that the
reduction in backdoored models' benign classification accuracy is limited,
implying that our attacks remain stealthy, whereas the models successfully
learn to associate strong male actors with negative sentiment (100% attack
success rate with >= 3% poison rate). Attacks on BERT and RoBERTa are
particularly more stealthy and effective, demonstrating an increased risk of
using modern and larger models. We also measure the generalizability of our
bias injection by proposing two metrics: (i) U-BBSR which uses previously
unseen words when measuring attack success, and (ii) P-BBSR which measures
attack success using paraphrased test samples. U-BBSR and P-BBSR results show
that the bias injected by our attack can go beyond memorizing a trigger phrase.


http://arxiv.org/abs/2412.18791
Protective Perturbations against Unauthorized Data Usage in Diffusion-based Image Generation. (10%)
Sen Peng; Jijia Yang; Mingyue Wang; Jianfei He; Xiaohua Jia
  Diffusion-based text-to-image models have shown immense potential for various
image-related tasks. However, despite their prominence and popularity,
customizing these models using unauthorized data also brings serious privacy
and intellectual property issues. Existing methods introduce protective
perturbations based on adversarial attacks, which are applied to the
customization samples. In this systematization of knowledge, we present a
comprehensive survey of protective perturbation methods designed to prevent
unauthorized data usage in diffusion-based image generation. We establish the
threat model and categorize the downstream tasks relevant to these methods,
providing a detailed analysis of their designs. We also propose a completed
evaluation framework for these perturbation techniques, aiming to advance
research in this field.


http://arxiv.org/abs/2412.19037
CL-attack: Textual Backdoor Attacks via Cross-Lingual Triggers. (9%)
Jingyi Zheng; Tianyi Hu; Tianshuo Cong; Xinlei He
  Backdoor attacks significantly compromise the security of large language
models by triggering them to output specific and controlled content. Currently,
triggers for textual backdoor attacks fall into two categories: fixed-token
triggers and sentence-pattern triggers. However, the former are typically easy
to identify and filter, while the latter, such as syntax and style, do not
apply to all original samples and may lead to semantic shifts. In this paper,
inspired by cross-lingual (CL) prompts of LLMs in real-world scenarios, we
propose a higher-dimensional trigger method at the paragraph level, namely
CL-attack. CL-attack injects the backdoor by using texts with specific
structures that incorporate multiple languages, thereby offering greater
stealthiness and universality compared to existing backdoor attack techniques.
Extensive experiments on different tasks and model architectures demonstrate
that CL-attack can achieve nearly 100% attack success rate with a low poisoning
rate in both classification and generation tasks. We also empirically show that
the CL-attack is more robust against current major defense methods compared to
baseline backdoor attacks. Additionally, to mitigate CL-attack, we further
develop a new defense called TranslateDefense, which can partially mitigate the
impact of CL-attack.


http://arxiv.org/abs/2412.18706
SurvAttack: Black-Box Attack On Survival Models through Ontology-Informed EHR Perturbation. (99%)
Mohsen Nayebi Kerdabadi; Arya Hadizadeh Moghaddam; Bin Liu; Mei Liu; Zijun Yao
  Survival analysis (SA) models have been widely studied in mining electronic
health records (EHRs), particularly in forecasting the risk of critical
conditions for prioritizing high-risk patients. However, their vulnerability to
adversarial attacks is much less explored in the literature. Developing
black-box perturbation algorithms and evaluating their impact on
state-of-the-art survival models brings two benefits to medical applications.
First, it can effectively evaluate the robustness of models in pre-deployment
testing. Also, exploring how subtle perturbations would result in significantly
different outcomes can provide counterfactual insights into the clinical
interpretation of model prediction. In this work, we introduce SurvAttack, a
novel black-box adversarial attack framework leveraging subtle clinically
compatible, and semantically consistent perturbations on longitudinal EHRs to
degrade survival models' predictive performance. We specifically develop a
greedy algorithm to manipulate medical codes with various adversarial actions
throughout a patient's medical history. Then, these adversarial actions are
prioritized using a composite scoring strategy based on multi-aspect
perturbation quality, including saliency, perturbation stealthiness, and
clinical meaningfulness. The proposed adversarial EHR perturbation algorithm is
then used in an efficient SA-specific strategy to attack a survival model when
estimating the temporal ranking of survival urgency for patients. To
demonstrate the significance of our work, we conduct extensive experiments,
including baseline comparisons, explainability analysis, and case studies. The
experimental results affirm our research's effectiveness in illustrating the
vulnerabilities of patient survival models, model interpretation, and
ultimately contributing to healthcare quality.


http://arxiv.org/abs/2412.18718
Evaluating the Adversarial Robustness of Detection Transformers. (99%)
Amirhossein Nazeri; Chunheng Zhao; Pierluigi Pisu
  Robust object detection is critical for autonomous driving and mobile
robotics, where accurate detection of vehicles, pedestrians, and obstacles is
essential for ensuring safety. Despite the advancements in object detection
transformers (DETRs), their robustness against adversarial attacks remains
underexplored. This paper presents a comprehensive evaluation of DETR model and
its variants under both white-box and black-box adversarial attacks, using the
MS-COCO and KITTI datasets to cover general and autonomous driving scenarios.
We extend prominent white-box attack methods (FGSM, PGD, and CW) to assess DETR
vulnerability, demonstrating that DETR models are significantly susceptible to
adversarial attacks, similar to traditional CNN-based detectors. Our extensive
transferability analysis reveals high intra-network transferability among DETR
variants, but limited cross-network transferability to CNN-based models.
Additionally, we propose a novel untargeted attack designed specifically for
DETR, exploiting its intermediate loss functions to induce misclassification
with minimal perturbations. Visualizations of self-attention feature maps
provide insights into how adversarial attacks affect the internal
representations of DETR models. These findings reveal critical vulnerabilities
in detection transformers under standard adversarial attacks, emphasizing the
need for future research to enhance the robustness of transformer-based object
detectors in safety-critical applications.


http://arxiv.org/abs/2412.18196
Robustness-aware Automatic Prompt Optimization. (98%)
Zeru Shi; Zhenting Wang; Yongye Su; Weidi Luo; Fan Yang; Yongfeng Zhang
  The performance of Large Language Models (LLMs) is based on the quality of
the prompts and the semantic and structural integrity information of the input
data. However, current prompt generation methods primarily focus on generating
prompts for clean input data, often overlooking the impact of perturbed inputs
on prompt performance. To address this limitation, we propose BATprompt (By
Adversarial Training prompt), a novel method for prompt generation designed to
withstand input perturbations (such as typos in the input). Inspired by
adversarial training techniques, BATprompt demonstrates strong performance on a
variety of perturbed tasks through a two-step process: adversarial perturbation
and iterative optimization on unperturbed input via LLM. Unlike conventional
adversarial attack methods, BATprompt avoids reliance on real gradients or
model parameters. Instead, it leverages the advanced reasoning, language
understanding and self reflection capabilities of LLMs to simulate gradients,
guiding the generation of adversarial perturbations and optimizing prompt
performance. In our experiments, we evaluate BATprompt on multiple datasets
across both language understanding and generation tasks. The results indicate
that BATprompt outperforms existing prompt generation methods, delivering
superior robustness and performance under diverse perturbation scenarios.


http://arxiv.org/abs/2412.18770
Attack-in-the-Chain: Bootstrapping Large Language Models for Attacks Against Black-box Neural Ranking Models. (98%)
Yu-An Liu; Ruqing Zhang; Jiafeng Guo; Rijke Maarten de; Yixing Fan; Xueqi Cheng
  Neural ranking models (NRMs) have been shown to be highly effective in terms
of retrieval performance. Unfortunately, they have also displayed a higher
degree of sensitivity to attacks than previous generation models. To help
expose and address this lack of robustness, we introduce a novel ranking attack
framework named Attack-in-the-Chain, which tracks interactions between large
language models (LLMs) and NRMs based on chain-of-thought (CoT) prompting to
generate adversarial examples under black-box settings. Our approach starts by
identifying anchor documents with higher ranking positions than the target
document as nodes in the reasoning chain. We then dynamically assign the number
of perturbation words to each node and prompt LLMs to execute attacks. Finally,
we verify the attack performance of all nodes at each reasoning step and
proceed to generate the next reasoning step. Empirical results on two web
search benchmarks show the effectiveness of our method.


http://arxiv.org/abs/2412.18218
On the Effectiveness of Adversarial Training on Malware Classifiers. (96%)
Hamid Bostani; Jacopo Cortellazzi; Daniel Arp; Fabio Pierazzi; Veelasha Moonsamy; Lorenzo Cavallaro
  Adversarial Training (AT) has been widely applied to harden learning-based
classifiers against adversarial evasive attacks. However, its effectiveness in
identifying and strengthening vulnerable areas of the model's decision space
while maintaining high performance on clean data of malware classifiers remains
an under-explored area. In this context, the robustness that AT achieves has
often been assessed against unrealistic or weak adversarial attacks, which
negatively affect performance on clean data and are arguably no longer threats.
Previous work seems to suggest robustness is a task-dependent property of AT.
We instead argue it is a more complex problem that requires exploring AT and
the intertwined roles played by certain factors within data, feature
representations, classifiers, and robust optimization settings, as well as
proper evaluation factors, such as the realism of evasion attacks, to gain a
true sense of AT's effectiveness. In our paper, we address this gap by
systematically exploring the role such factors have in hardening malware
classifiers through AT. Contrary to recent prior work, a key observation of our
research and extensive experiments confirm the hypotheses that all such factors
influence the actual effectiveness of AT, as demonstrated by the varying
degrees of success from our empirical analysis. We identify five evaluation
pitfalls that affect state-of-the-art studies and summarize our insights in ten
takeaways to draw promising research directions toward better understanding the
factors' settings under which adversarial training works at best.


http://arxiv.org/abs/2412.18262
Efficient Contrastive Explanations on Demand. (82%)
Yacine Izza; Joao Marques-Silva
  Recent work revealed a tight connection between adversarial robustness and
restricted forms of symbolic explanations, namely distance-based (formal)
explanations. This connection is significant because it represents a first step
towards making the computation of symbolic explanations as efficient as
deciding the existence of adversarial examples, especially for highly complex
machine learning (ML) models. However, a major performance bottleneck remains,
because of the very large number of features that ML models may possess, in
particular for deep neural networks. This paper proposes novel algorithms to
compute the so-called contrastive explanations for ML models with a large
number of features, by leveraging on adversarial robustness. Furthermore, the
paper also proposes novel algorithms for listing explanations and finding
smallest contrastive explanations. The experimental results demonstrate the
performance gains achieved by the novel algorithms proposed in this paper.


http://arxiv.org/abs/2412.18507
An Empirical Analysis of Federated Learning Models Subject to Label-Flipping Adversarial Attack. (73%)
Kunal Bhatnagar; Sagana Chattanathan; Angela Dang; Bhargav Eranki; Ronnit Rana; Charan Sridhar; Siddharth Vedam; Angie Yao; Mark Stamp
  In this paper, we empirically analyze adversarial attacks on selected
federated learning models. The specific learning models considered are
Multinominal Logistic Regression (MLR), Support Vector Classifier (SVC),
Multilayer Perceptron (MLP), Convolution Neural Network (CNN), %Recurrent
Neural Network (RNN), Random Forest, XGBoost, and Long Short-Term Memory
(LSTM). For each model, we simulate label-flipping attacks, experimenting
extensively with 10 federated clients and 100 federated clients. We vary the
percentage of adversarial clients from 10% to 100% and, simultaneously, the
percentage of labels flipped by each adversarial client is also varied from 10%
to 100%. Among other results, we find that models differ in their inherent
robustness to the two vectors in our label-flipping attack, i.e., the
percentage of adversarial clients, and the percentage of labels flipped by each
adversarial client. We discuss the potential practical implications of our
results.


http://arxiv.org/abs/2412.18370
Unveiling the Threat of Fraud Gangs to Graph Neural Networks: Multi-Target Graph Injection Attacks Against GNN-Based Fraud Detectors. (41%)
Jinhyeok Choi; Heehyeon Kim; Joyce Jiyoung Whang
  Graph neural networks (GNNs) have emerged as an effective tool for fraud
detection, identifying fraudulent users, and uncovering malicious behaviors.
However, attacks against GNN-based fraud detectors and their risks have rarely
been studied, thereby leaving potential threats unaddressed. Recent findings
suggest that frauds are increasingly organized as gangs or groups. In this
work, we design attack scenarios where fraud gangs aim to make their fraud
nodes misclassified as benign by camouflaging their illicit activities in
collusion. Based on these scenarios, we study adversarial attacks against
GNN-based fraud detectors by simulating attacks of fraud gangs in three
real-world fraud cases: spam reviews, fake news, and medical insurance frauds.
We define these attacks as multi-target graph injection attacks and propose
MonTi, a transformer-based Multi-target one-Time graph injection attack model.
MonTi simultaneously generates attributes and edges of all attack nodes with a
transformer encoder, capturing interdependencies between attributes and edges
more effectively than most existing graph injection attack methods that
generate these elements sequentially. Additionally, MonTi adaptively allocates
the degree budget for each attack node to explore diverse injection structures
involving target, candidate, and attack nodes, unlike existing methods that fix
the degree budget across all attack nodes. Experiments show that MonTi
outperforms the state-of-the-art graph injection attack methods on five
real-world graphs.


http://arxiv.org/abs/2412.18171
Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models. (38%)
Xiaomeng Hu; Pin-Yu Chen; Tsung-Yi Ho
  Large Language Models (LLMs) are increasingly being integrated into services
such as ChatGPT to provide responses to user queries. To mitigate potential
harm and prevent misuse, there have been concerted efforts to align the LLMs
with human values and legal compliance by incorporating various techniques,
such as Reinforcement Learning from Human Feedback (RLHF), into the training of
the LLMs. However, recent research has exposed that even aligned LLMs are
susceptible to adversarial manipulations known as Jailbreak Attacks. To address
this challenge, this paper proposes a method called Token Highlighter to
inspect and mitigate the potential jailbreak threats in the user query. Token
Highlighter introduced a concept called Affirmation Loss to measure the LLM's
willingness to answer the user query. It then uses the gradient of Affirmation
Loss for each token in the user query to locate the jailbreak-critical tokens.
Further, Token Highlighter exploits our proposed Soft Removal technique to
mitigate the jailbreak effects of critical tokens via shrinking their token
embeddings. Experimental results on two aligned LLMs (LLaMA-2 and Vicuna-V1.5)
demonstrate that the proposed method can effectively defend against a variety
of Jailbreak Attacks while maintaining competent performance on benign
questions of the AlpacaEval benchmark. In addition, Token Highlighter is a
cost-effective and interpretable defense because it only needs to query the
protected LLM once to compute the Affirmation Loss and can highlight the
critical tokens upon refusal.


http://arxiv.org/abs/2412.18365
Hypergraph Attacks via Injecting Homogeneous Nodes into Elite Hyperedges. (22%)
Meixia He; Peican Zhu; Keke Tang; Yangming Guo
  Recent studies have shown that Hypergraph Neural Networks (HGNNs) are
vulnerable to adversarial attacks. Existing approaches focus on hypergraph
modification attacks guided by gradients, overlooking node spanning in the
hypergraph and the group identity of hyperedges, thereby resulting in limited
attack performance and detectable attacks. In this manuscript, we present a
novel framework, i.e., Hypergraph Attacks via Injecting Homogeneous Nodes into
Elite Hyperedges (IE-Attack), to tackle these challenges. Initially, utilizing
the node spanning in the hypergraph, we propose the elite hyperedges sampler to
identify hyperedges to be injected. Subsequently, a node generator utilizing
Kernel Density Estimation (KDE) is proposed to generate the homogeneous node
with the group identity of hyperedges. Finally, by injecting the homogeneous
node into elite hyperedges, IE-Attack improves the attack performance and
enhances the imperceptibility of attacks. Extensive experiments are conducted
on five authentic datasets to validate the effectiveness of IE-Attack and the
corresponding superiority to state-of-the-art methods.


http://arxiv.org/abs/2412.18409
Re-assessing ImageNet: How aligned is its single-label assumption with its multi-label nature? (4%)
Esla Timothy Anzaku; Seyed Amir Mousavi; Messem Arnout Van; Neve Wesley De
  ImageNet, an influential dataset in computer vision, is traditionally
evaluated using single-label classification, which assumes that an image can be
adequately described by a single concept or label. However, this approach may
not fully capture the complex semantics within the images available in
ImageNet, potentially hindering the development of models that effectively
learn these intricacies. This study critically examines the prevalent
single-label benchmarking approach and advocates for a shift to multi-label
benchmarking for ImageNet. This shift would enable a more comprehensive
assessment of the capabilities of deep neural network (DNN) models. We analyze
the effectiveness of pre-trained state-of-the-art DNNs on ImageNet and one of
its variants, ImageNetV2. Studies in the literature have reported unexpected
accuracy drops of 11% to 14% on ImageNetV2. Our findings show that these
reported declines are largely attributable to a characteristic of the dataset
that has not received sufficient attention -- the proportion of images with
multiple labels. Taking this characteristic into account, the results of our
experiments provide evidence that there is no substantial degradation in
effectiveness on ImageNetV2. Furthermore, we acknowledge that ImageNet
pre-trained models exhibit some capability at capturing the multi-label nature
of the dataset even though they were trained under the single-label assumption.
Consequently, we propose a new evaluation approach to augment existing
approaches that assess this capability. Our findings highlight the importance
of considering the multi-label nature of the ImageNet dataset during
benchmarking. Failing to do so could lead to incorrect conclusions regarding
the effectiveness of DNNs and divert research efforts from addressing other
substantial challenges related to the reliability and robustness of these
models.


http://arxiv.org/abs/2412.17544
Retention Score: Quantifying Jailbreak Risks for Vision Language Models. (99%)
Zaitang Li; Pin-Yu Chen; Tsung-Yi Ho
  The emergence of Vision-Language Models (VLMs) is a significant advancement
in integrating computer vision with Large Language Models (LLMs) to enhance
multi-modal machine learning capabilities. However, this progress has also made
VLMs vulnerable to sophisticated adversarial attacks, raising concerns about
their reliability. The objective of this paper is to assess the resilience of
VLMs against jailbreak attacks that can compromise model safety compliance and
result in harmful outputs. To evaluate a VLM's ability to maintain its
robustness against adversarial input perturbations, we propose a novel metric
called the \textbf{Retention Score}. Retention Score is a multi-modal
evaluation metric that includes Retention-I and Retention-T scores for
quantifying jailbreak risks in visual and textual components of VLMs. Our
process involves generating synthetic image-text pairs using a conditional
diffusion model. These pairs are then predicted for toxicity score by a VLM
alongside a toxicity judgment classifier. By calculating the margin in toxicity
scores, we can quantify the robustness of the VLM in an attack-agnostic manner.
Our work has four main contributions. First, we prove that Retention Score can
serve as a certified robustness metric. Second, we demonstrate that most VLMs
with visual components are less robust against jailbreak attacks than the
corresponding plain VLMs. Additionally, we evaluate black-box VLM APIs and find
that the security settings in Google Gemini significantly affect the score and
robustness. Moreover, the robustness of GPT4V is similar to the medium settings
of Gemini. Finally, our approach offers a time-efficient alternative to
existing adversarial attack methods and provides consistent model robustness
rankings when evaluated on VLMs including MiniGPT-4, InstructBLIP, and LLaVA.


http://arxiv.org/abs/2412.17614
Emerging Security Challenges of Large Language Models. (73%)
Herve Debar; Sven Dietrich; Pavel Laskov; Emil C. Lupu; Eirini Ntoutsi
  Large language models (LLMs) have achieved record adoption in a short period
of time across many different sectors including high importance areas such as
education [4] and healthcare [23]. LLMs are open-ended models trained on
diverse data without being tailored for specific downstream tasks, enabling
broad applicability across various domains. They are commonly used for text
generation, but also widely used to assist with code generation [3], and even
analysis of security information, as Microsoft Security Copilot demonstrates
[18]. Traditional Machine Learning (ML) models are vulnerable to adversarial
attacks [9]. So the concerns on the potential security implications of such
wide scale adoption of LLMs have led to the creation of this working group on
the security of LLMs. During the Dagstuhl seminar on "Network Attack Detection
and Defense - AI-Powered Threats and Responses", the working group discussions
focused on the vulnerability of LLMs to adversarial attacks, rather than their
potential use in generating malware or enabling cyberattacks. Although we note
the potential threat represented by the latter, the role of the LLMs in such
uses is mostly as an accelerator for development, similar to what it is in
benign use. To make the analysis more specific, the working group employed
ChatGPT as a concrete example of an LLM and addressed the following points,
which also form the structure of this report: 1. How do LLMs differ in
vulnerabilities from traditional ML models? 2. What are the attack objectives
in LLMs? 3. How complex it is to assess the risks posed by the vulnerabilities
of LLMs? 4. What is the supply chain in LLMs, how data flow in and out of
systems and what are the security implications? We conclude with an overview of
open challenges and outlook.


http://arxiv.org/abs/2412.17531
Double Landmines: Invisible Textual Backdoor Attacks based on Dual-Trigger. (68%)
Yang Hou; Qiuling Yue; Lujia Chai; Guozhao Liao; Wenbao Han; Wei Ou
  At present, all textual backdoor attack methods are based on single triggers:
for example, inserting specific content into the text to activate the backdoor;
or changing the abstract text features. The former is easier to be identified
by existing defense strategies due to its obvious characteristics; the latter,
although improved in invisibility, has certain shortcomings in terms of attack
performance, construction of poisoned datasets, and selection of the final
poisoning rate. On this basis, this paper innovatively proposes a Dual-Trigger
backdoor attack based on syntax and mood, and optimizes the construction of the
poisoned dataset and the selection strategy of the final poisoning rate. A
large number of experimental results show that this method significantly
outperforms the previous methods based on abstract features in attack
performance, and achieves comparable attack performance (almost 100% attack
success rate) with the insertion-based method. In addition, the two trigger
mechanisms included in this method can be activated independently in the
application phase of the model, which not only improves the flexibility of the
trigger style, but also enhances its robustness against defense strategies.
These results profoundly reveal that textual backdoor attacks are extremely
harmful and provide a new perspective for security protection in this field.


http://arxiv.org/abs/2412.17888
Stability Bounds for the Unfolded Forward-Backward Algorithm. (38%)
Emilie Chouzenoux; Valle Cecile Della; Jean-Christophe Pesquet
  We consider a neural network architecture designed to solve inverse problems
where the degradation operator is linear and known. This architecture is
constructed by unrolling a forward-backward algorithm derived from the
minimization of an objective function that combines a data-fidelity term, a
Tikhonov-type regularization term, and a potentially nonsmooth convex penalty.
The robustness of this inversion method to input perturbations is analyzed
theoretically. Ensuring robustness complies with the principles of inverse
problem theory, as it ensures both the continuity of the inversion method and
the resilience to small noise - a critical property given the known
vulnerability of deep neural networks to adversarial perturbations. A key
novelty of our work lies in examining the robustness of the proposed network to
perturbations in its bias, which represents the observed data in the inverse
problem. Additionally, we provide numerical illustrations of the analytical
Lipschitz bounds derived in our analysis.


http://arxiv.org/abs/2412.18123
AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models. (2%)
Yiming Wang; Jiahao Chen; Qingming Li; Xing Yang; Shouling Ji
  As text-to-image (T2I) models continue to advance and gain widespread
adoption, their associated safety issues are becoming increasingly prominent.
Malicious users often exploit these models to generate Not-Safe-for-Work (NSFW)
images using harmful or adversarial prompts, highlighting the critical need for
robust safeguards to ensure the integrity and compliance of model outputs.
Current internal safeguards frequently degrade image quality, while external
detection methods often suffer from low accuracy and inefficiency.
  In this paper, we introduce AEIOU, a defense framework that is Adaptable,
Efficient, Interpretable, Optimizable, and Unified against NSFW prompts in T2I
models. AEIOU extracts NSFW features from the hidden states of the model's text
encoder, utilizing the separable nature of these features to detect NSFW
prompts. The detection process is efficient, requiring minimal inference time.
AEIOU also offers real-time interpretation of results and supports optimization
through data augmentation techniques. The framework is versatile, accommodating
various T2I architectures. Our extensive experiments show that AEIOU
significantly outperforms both commercial and open-source moderation tools,
achieving over 95% accuracy across all datasets and improving efficiency by at
least tenfold. It effectively counters adaptive attacks and excels in few-shot
and multi-label scenarios.


http://arxiv.org/abs/2412.17740
Sensitivity Curve Maximization: Attacking Robust Aggregators in Distributed Learning. (1%)
Christian A. Schroth; Stefan Vlaski; Abdelhak M. Zoubir
  In distributed learning agents aim at collaboratively solving a global
learning problem. It becomes more and more likely that individual agents are
malicious or faulty with an increasing size of the network. This leads to a
degeneration or complete breakdown of the learning process. Classical
aggregation schemes are prone to breakdown at small contamination rates,
therefore robust aggregation schemes are sought for. While robust aggregation
schemes can generally tolerate larger contamination rates, many have been shown
to be susceptible to carefully crafted malicious attacks. In this work, we show
how the sensitivity curve (SC), a classical tool from robust statistics, can be
used to systematically derive optimal attack patterns against arbitrary robust
aggregators, in most cases rendering them ineffective. We show the
effectiveness of the proposed attack in multiple simulations.


http://arxiv.org/abs/2412.17522
DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak. (1%)
Hao Wang; Hao Li; Junda Zhu; Xinyuan Wang; Chengwei Pan; MinLie Huang; Lei Sha
  Large Language Models (LLMs) are susceptible to generating harmful content
when prompted with carefully crafted inputs, a vulnerability known as LLM
jailbreaking. As LLMs become more powerful, studying jailbreak methods is
critical to enhancing security and aligning models with human values.
Traditionally, jailbreak techniques have relied on suffix addition or prompt
templates, but these methods suffer from limited attack diversity. This paper
introduces DiffusionAttacker, an end-to-end generative approach for jailbreak
rewriting inspired by diffusion models. Our method employs a
sequence-to-sequence (seq2seq) text diffusion model as a generator,
conditioning on the original prompt and guiding the denoising process with a
novel attack loss. Unlike previous approaches that use autoregressive LLMs to
generate jailbreak prompts, which limit the modification of already generated
tokens and restrict the rewriting space, DiffusionAttacker utilizes a seq2seq
diffusion model, allowing more flexible token modifications. This approach
preserves the semantic content of the original prompt while producing harmful
content. Additionally, we leverage the Gumbel-Softmax technique to make the
sampling process from the diffusion model's output distribution differentiable,
eliminating the need for iterative token search. Extensive experiments on
Advbench and Harmbench demonstrate that DiffusionAttacker outperforms previous
methods across various evaluation metrics, including attack success rate (ASR),
fluency, and diversity.


http://arxiv.org/abs/2412.17038
ErasableMask: A Robust and Erasable Privacy Protection Scheme against Black-box Face Recognition Models. (99%)
Sipeng Shen; Yunming Zhang; Dengpan Ye; Xiuwen Shi; Long Tang; Haoran Duan; Ziyi Liu
  While face recognition (FR) models have brought remarkable convenience in
face verification and identification, they also pose substantial privacy risks
to the public. Existing facial privacy protection schemes usually adopt
adversarial examples to disrupt face verification of FR models. However, these
schemes often suffer from weak transferability against black-box FR models and
permanently damage the identifiable information that cannot fulfill the
requirements of authorized operations such as forensics and authentication. To
address these limitations, we propose ErasableMask, a robust and erasable
privacy protection scheme against black-box FR models. Specifically, via
rethinking the inherent relationship between surrogate FR models, ErasableMask
introduces a novel meta-auxiliary attack, which boosts black-box
transferability by learning more general features in a stable and balancing
optimization strategy. It also offers a perturbation erasion mechanism that
supports the erasion of semantic perturbations in protected face without
degrading image quality. To further improve performance, ErasableMask employs a
curriculum learning strategy to mitigate optimization conflicts between
adversarial attack and perturbation erasion. Extensive experiments on the
CelebA-HQ and FFHQ datasets demonstrate that ErasableMask achieves the
state-of-the-art performance in transferability, achieving over 72% confidence
on average in commercial FR systems. Moreover, ErasableMask also exhibits
outstanding perturbation erasion performance, achieving over 90% erasion
success rate.


http://arxiv.org/abs/2412.16955
NumbOD: A Spatial-Frequency Fusion Attack Against Object Detectors. (99%)
Ziqi Zhou; Bowen Li; Yufei Song; Zhifei Yu; Shengshan Hu; Wei Wan; Leo Yu Zhang; Dezhong Yao; Hai Jin
  With the advancement of deep learning, object detectors (ODs) with various
architectures have achieved significant success in complex scenarios like
autonomous driving. Previous adversarial attacks against ODs have been focused
on designing customized attacks targeting their specific structures (e.g., NMS
and RPN), yielding some results but simultaneously constraining their
scalability. Moreover, most efforts against ODs stem from image-level attacks
originally designed for classification tasks, resulting in redundant
computations and disturbances in object-irrelevant areas (e.g., background).
Consequently, how to design a model-agnostic efficient attack to
comprehensively evaluate the vulnerabilities of ODs remains challenging and
unresolved. In this paper, we propose NumbOD, a brand-new spatial-frequency
fusion attack against various ODs, aimed at disrupting object detection within
images. We directly leverage the features output by the OD without relying on
its internal structures to craft adversarial examples. Specifically, we first
design a dual-track attack target selection strategy to select high-quality
bounding boxes from OD outputs for targeting. Subsequently, we employ
directional perturbations to shift and compress predicted boxes and change
classification results to deceive ODs. Additionally, we focus on manipulating
the high-frequency components of images to confuse ODs' attention on critical
objects, thereby enhancing the attack efficiency. Our extensive experiments on
nine ODs and two datasets show that NumbOD achieves powerful attack performance
and high stealthiness.


http://arxiv.org/abs/2412.16958
Breaking Barriers in Physical-World Adversarial Examples: Improving Robustness and Transferability via Robust Feature. (99%)
Yichen Wang; Yuxuan Chou; Ziqi Zhou; Hangtao Zhang; Wei Wan; Shengshan Hu; Minghui Li
  As deep neural networks (DNNs) are widely applied in the physical world, many
researches are focusing on physical-world adversarial examples (PAEs), which
introduce perturbations to inputs and cause the model's incorrect outputs.
However, existing PAEs face two challenges: unsatisfactory attack performance
(i.e., poor transferability and insufficient robustness to environment
conditions), and difficulty in balancing attack effectiveness with
stealthiness, where better attack effectiveness often makes PAEs more
perceptible.
  In this paper, we explore a novel perturbation-based method to overcome the
challenges. For the first challenge, we introduce a strategy Deceptive RF
injection based on robust features (RFs) that are predictive, robust to
perturbations, and consistent across different models. Specifically, it
improves the transferability and robustness of PAEs by covering RFs of other
classes onto the predictive features in clean images. For the second challenge,
we introduce another strategy Adversarial Semantic Pattern Minimization, which
removes most perturbations and retains only essential adversarial patterns in
AEsBased on the two strategies, we design our method Robust Feature Coverage
Attack (RFCoA), comprising Robust Feature Disentanglement and Adversarial
Feature Fusion. In the first stage, we extract target class RFs in feature
space. In the second stage, we use attention-based feature fusion to overlay
these RFs onto predictive features of clean images and remove unnecessary
perturbations. Experiments show our method's superior transferability,
robustness, and stealthiness compared to existing state-of-the-art methods.
Additionally, our method's effectiveness can extend to Large Vision-Language
Models (LVLMs), indicating its potential applicability to more complex tasks.


http://arxiv.org/abs/2412.16893
Preventing Non-intrusive Load Monitoring Privacy Invasion: A Precise Adversarial Attack Scheme for Networked Smart Meters. (98%)
Jialing He; Jiacheng Wang; Ning Wang; Shangwei Guo; Liehuang Zhu; Dusit Niyato; Tao Xiang
  Smart grid, through networked smart meters employing the non-intrusive load
monitoring (NILM) technique, can considerably discern the usage patterns of
residential appliances. However, this technique also incurs privacy leakage. To
address this issue, we propose an innovative scheme based on adversarial attack
in this paper. The scheme effectively prevents NILM models from violating
appliance-level privacy, while also ensuring accurate billing calculation for
users. To achieve this objective, we overcome two primary challenges. First, as
NILM models fall under the category of time-series regression models, direct
application of traditional adversarial attacks designed for classification
tasks is not feasible. To tackle this issue, we formulate a novel adversarial
attack problem tailored specifically for NILM and providing a theoretical
foundation for utilizing the Jacobian of the NILM model to generate
imperceptible perturbations. Leveraging the Jacobian, our scheme can produce
perturbations, which effectively misleads the signal prediction of NILM models
to safeguard users' appliance-level privacy. The second challenge pertains to
fundamental utility requirements, where existing adversarial attack schemes
struggle to achieve accurate billing calculation for users. To handle this
problem, we introduce an additional constraint, mandating that the sum of added
perturbations within a billing period must be precisely zero. Experimental
validation on real-world power datasets REDD and UK-DALE demonstrates the
efficacy of our proposed solutions, which can significantly amplify the
discrepancy between the output of the targeted NILM model and the actual power
signal of appliances, and enable accurate billing at the same time.
Additionally, our solutions exhibit transferability, making the generated
perturbation signal from one target model applicable to other diverse NILM
models.


http://arxiv.org/abs/2412.16905
A Backdoor Attack Scheme with Invisible Triggers Based on Model Architecture Modification. (45%)
Yuan Ma; Xu Ma; Jiankang Wei; Jinmeng Tang; Xiaoyu Zhang; Yilun Lyu; Kehao Chen; Jingtong Huang
  Machine learning systems are vulnerable to backdoor attacks, where attackers
manipulate model behavior through data tampering or architectural
modifications. Traditional backdoor attacks involve injecting malicious samples
with specific triggers into the training data, causing the model to produce
targeted incorrect outputs in the presence of the corresponding triggers. More
sophisticated attacks modify the model's architecture directly, embedding
backdoors that are harder to detect as they evade traditional data-based
detection methods. However, the drawback of the architectural modification
based backdoor attacks is that the trigger must be visible in order to activate
the backdoor. To further strengthen the invisibility of the backdoor attacks, a
novel backdoor attack method is presented in the paper. To be more specific,
this method embeds the backdoor within the model's architecture and has the
capability to generate inconspicuous and stealthy triggers. The attack is
implemented by modifying pre-trained models, which are then redistributed,
thereby posing a potential threat to unsuspecting users. Comprehensive
experiments conducted on standard computer vision benchmarks validate the
effectiveness of this attack and highlight the stealthiness of its triggers,
which remain undetectable through both manual visual inspection and advanced
detection tools.


http://arxiv.org/abs/2412.17213
Attack by Yourself: Effective and Unnoticeable Multi-Category Graph Backdoor Attacks with Subgraph Triggers Pool. (22%)
Jiangtong Li; Dungy Liu; Dawei Cheng; Changchun Jiang
  \textbf{G}raph \textbf{N}eural \textbf{N}etworks~(GNNs) have achieved
significant success in various real-world applications, including social
networks, finance systems, and traffic management. Recent researches highlight
their vulnerability to backdoor attacks in node classification, where GNNs
trained on a poisoned graph misclassify a test node only when specific triggers
are attached. These studies typically focus on single attack categories and use
adaptive trigger generators to create node-specific triggers. However, adaptive
trigger generators typically have a simple structure, limited parameters, and
lack category-aware graph knowledge, which makes them struggle to handle
backdoor attacks across multiple categories as the number of target categories
increases. We address this gap by proposing a novel approach for
\textbf{E}ffective and \textbf{U}nnoticeable
\textbf{M}ulti-\textbf{C}ategory~(EUMC) graph backdoor attacks, leveraging
subgraph from the attacked graph as category-aware triggers to precisely
control the target category. To ensure the effectiveness of our method, we
construct a \textbf{M}ulti-\textbf{C}ategory \textbf{S}ubgraph
\textbf{T}riggers \textbf{P}ool~(MC-STP) using the subgraphs of the attacked
graph as triggers. We then exploit the attachment probability shifts of each
subgraph trigger as category-aware priors for target category determination.
Moreover, we develop a ``select then attach'' strategy that connects suitable
category-aware trigger to attacked nodes for unnoticeability. Extensive
experiments across different real-world datasets confirm the efficacy of our
method in conducting multi-category graph backdoor attacks on various GNN
models and defense strategies.


http://arxiv.org/abs/2412.17034
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models. (22%)
Lang Gao; Xiangliang Zhang; Preslav Nakov; Xiuying Chen
  Jailbreaking in Large Language Models (LLMs) is a major security concern as
it can deceive LLMs to generate harmful text. Yet, there is still insufficient
understanding of how jailbreaking works, which makes it hard to develop
effective defense strategies. We aim to shed more light into this issue: we
conduct a detailed large-scale analysis of seven different jailbreak methods
and find that these disagreements stem from insufficient observation samples.
In particular, we introduce \textit{safety boundary}, and we find that
jailbreaks shift harmful activations outside that safety boundary, where LLMs
are less sensitive to harmful information. We also find that the low and the
middle layers are critical in such shifts, while deeper layers have less
impact. Leveraging on these insights, we propose a novel defense called
\textbf{Activation Boundary Defense} (ABD), which adaptively constrains the
activations within the safety boundary. We further use Bayesian optimization to
selectively apply the defense method to the low and the middle layers. Our
experiments on several benchmarks show that ABD achieves an average DSR of over
98\% against various forms of jailbreak attacks, with less than 2\% impact on
the model's general capabilities.


http://arxiv.org/abs/2412.17011
Robustness of Large Language Models Against Adversarial Attacks. (13%)
Yiyi Tao; Yixian Shen; Hang Zhang; Yanxin Shen; Lun Wang; Chuanqi Shi; Shaoshuai Du
  The increasing deployment of Large Language Models (LLMs) in various
applications necessitates a rigorous evaluation of their robustness against
adversarial attacks. In this paper, we present a comprehensive study on the
robustness of GPT LLM family. We employ two distinct evaluation methods to
assess their resilience. The first method introduce character-level text attack
in input prompts, testing the models on three sentiment classification
datasets: StanfordNLP/IMDB, Yelp Reviews, and SST-2. The second method involves
using jailbreak prompts to challenge the safety mechanisms of the LLMs. Our
experiments reveal significant variations in the robustness of these models,
demonstrating their varying degrees of vulnerability to both character-level
and semantic-level adversarial attacks. These findings underscore the necessity
for improved adversarial training and enhanced safety mechanisms to bolster the
robustness of LLMs.


http://arxiv.org/abs/2412.16651
PB-UAP: Hybrid Universal Adversarial Attack For Image Segmentation. (99%)
Yufei Song; Ziqi Zhou; Minghui Li; Xianlong Wang; Menghao Deng; Wei Wan; Shengshan Hu; Leo Yu Zhang
  With the rapid advancement of deep learning, the model robustness has become
a significant research hotspot, \ie, adversarial attacks on deep neural
networks. Existing works primarily focus on image classification tasks, aiming
to alter the model's predicted labels. Due to the output complexity and deeper
network architectures, research on adversarial examples for segmentation models
is still limited, particularly for universal adversarial perturbations. In this
paper, we propose a novel universal adversarial attack method designed for
segmentation models, which includes dual feature separation and low-frequency
scattering modules. The two modules guide the training of adversarial examples
in the pixel and frequency space, respectively. Experiments demonstrate that
our method achieves high attack success rates surpassing the state-of-the-art
methods, and exhibits strong transferability across different models.


http://arxiv.org/abs/2412.16662
Adversarial Attack Against Images Classification based on Generative Adversarial Networks. (98%)
Yahe Yang
  Adversarial attacks on image classification systems have always been an
important problem in the field of machine learning, and generative adversarial
networks (GANs), as popular models in the field of image generation, have been
widely used in various novel scenarios due to their powerful generative
capabilities. However, with the popularity of generative adversarial networks,
the misuse of fake image technology has raised a series of security problems,
such as malicious tampering with other people's photos and videos, and invasion
of personal privacy. Inspired by the generative adversarial networks, this work
proposes a novel adversarial attack method, aiming to gain insight into the
weaknesses of the image classification system and improve its anti-attack
ability. Specifically, the generative adversarial networks are used to generate
adversarial samples with small perturbations but enough to affect the
decision-making of the classifier, and the adversarial samples are generated
through the adversarial learning of the training generator and the classifier.
From extensive experiment analysis, we evaluate the effectiveness of the method
on a classical image classification dataset, and the results show that our
model successfully deceives a variety of advanced classifiers while maintaining
the naturalness of adversarial samples.


http://arxiv.org/abs/2412.16720
OpenAI o1 System Card. (81%)
OpenAI; :; Aaron Jaech; Adam Kalai; Adam Lerer; Adam Richardson; Ahmed El-Kishky; Aiden Low; Alec Helyar; Aleksander Madry; Alex Beutel; Alex Carney; Alex Iftimie; Alex Karpenko; Alex Tachard Passos; Alexander Neitz; Alexander Prokofiev; Alexander Wei; Allison Tam; Ally Bennett; Ananya Kumar; Andre Saraiva; Andrea Vallone; Andrew Duberstein; Andrew Kondrich; Andrey Mishchenko; Andy Applebaum; Angela Jiang; Ashvin Nair; Barret Zoph; Behrooz Ghorbani; Ben Rossen; Benjamin Sokolowsky; Boaz Barak; Bob McGrew; Borys Minaiev; Botao Hao; Bowen Baker; Brandon Houghton; Brandon McKinzie; Brydon Eastman; Camillo Lugaresi; Cary Bassin; Cary Hudson; Chak Ming Li; Bourcy Charles de; Chelsea Voss; Chen Shen; Chong Zhang; Chris Koch; Chris Orsinger; Christopher Hesse; Claudia Fischer; Clive Chan; Dan Roberts; Daniel Kappler; Daniel Levy; Daniel Selsam; David Dohan; David Farhi; David Mely; David Robinson; Dimitris Tsipras; Doug Li; Dragos Oprica; Eben Freeman; Eddie Zhang; Edmund Wong; Elizabeth Proehl; Enoch Cheung; Eric Mitchell; Eric Wallace; Erik Ritter; Evan Mays; Fan Wang; Felipe Petroski Such; Filippo Raso; Florencia Leoni; Foivos Tsimpourlas; Francis Song; Lohmann Fred von; Freddie Sulit; Geoff Salmon; Giambattista Parascandolo; Gildas Chabot; Grace Zhao; Greg Brockman; Guillaume Leclerc; Hadi Salman; Haiming Bao; Hao Sheng; Hart Andrin; Hessam Bagherinezhad; Hongyu Ren; Hunter Lightman; Hyung Won Chung; Ian Kivlichan; Ian O'Connell; Ian Osband; Ignasi Clavera Gilaberte; Ilge Akkaya; Ilya Kostrikov; Ilya Sutskever; Irina Kofman; Jakub Pachocki; James Lennon; Jason Wei; Jean Harb; Jerry Twore; Jiacheng Feng; Jiahui Yu; Jiayi Weng; Jie Tang; Jieqi Yu; Joaquin Quiñonero Candela; Joe Palermo; Joel Parish; Johannes Heidecke; John Hallman; John Rizzo; Jonathan Gordon; Jonathan Uesato; Jonathan Uesato; Jonathan Ward; Joost Huizinga; Julie Wang; Kai Chen; Kai Xiao; Karan Singhal; Karina Nguyen; Karl Cobbe; Katy Shi; Kayla Wood; Kendra Rimbach; Keren Gu-Lemberg; Keren GuLemberg; Kevin Liu; Kevin Lu; Kevin Stone; Kevin Yu; Lama Ahmad; Lauren Yang; Leo Liu; Leon Maksin; Leyton Ho; Liam Fedus; Lilian Weng; Linden Li; Lindsay McCallum; Lindsey Held; Lorenz Kuhn; Lukas Kondraciuk; Lukasz Kaiser; Luke Metz; Madelaine Boyd; Maja Trebacz; Manas Joglekar; Mark Chen; Marko Tintor; Mason Meyer; Matt Jones; Matt Kaufer; Max Schwarzer; Meghan Shah; Mehmet Yatbaz; Melody Guan; Mengyuan Xu; Mengyuan Yan; Mia Glaese; Mianna Chen; Mianna Chen; Michael Lampe; Michael Malek; Michele Wang; Michelle Fradin; Mike McClay; Mikhail Pavlov; Miles Wang; Mingxuan Wang; Mira Murati; Mo Bavarian; Mostafa Rohaninejad; Nat McAleese; Neil Chowdhury; Neil Chowdhury; Nick Ryder; Nikolas Tezak; Noam Brown; Ofir Nachum; Oleg Boiko; Oleg Murk; Olivia Watkins; Patrick Chao; Paul Ashbourne; Pavel Izmailov; Peter Zhokhov; Rachel Dias; Rahul Arora; Randall Lin; Rapha Gontijo Lopes; Raz Gaon; Reah Miyara; Reimar Leike; Renny Hwang; Rhythm Garg; Robin Brown; Roshan James; Rui Shu; Ryan Cheu; Ryan Greene; Saachi Jain; Sam Altman; Sam Toizer; Sam Toyer; Samuel Miserendino; Sandhini Agarwal; Santiago Hernandez; Sasha Baker; Scott McKinney; Scottie Yan; Shengjia Zhao; Shengli Hu; Shibani Santurkar; Shraman Ray Chaudhuri; Shuyuan Zhang; Siyuan Fu; Spencer Papay; Steph Lin; Suchir Balaji; Suvansh Sanjeev; Szymon Sidor; Tal Broda; Aidan Clark; Tao Wang; Taylor Gordon; Ted Sanders; Tejal Patwardhan; Thibault Sottiaux; Thomas Degry; Thomas Dimson; Tianhao Zheng; Timur Garipov; Tom Stasi; Trapit Bansal; Trevor Creech; Troy Peterson; Tyna Eloundou; Valerie Qi; Vineet Kosaraju; Vinnie Monaco; Vitchyr Pong; Vlad Fomenko; Weiyi Zheng; Wenda Zhou; Wes McCabe; Wojciech Zaremba; Yann Dubois; Yinghai Lu; Yining Chen; Young Cha; Yu Bai; Yuchen He; Yuchen Zhang; Yunyun Wang; Zheng Shao; Zhuohan Li
  The o1 model series is trained with large-scale reinforcement learning to
reason using chain of thought. These advanced reasoning capabilities provide
new avenues for improving the safety and robustness of our models. In
particular, our models can reason about our safety policies in context when
responding to potentially unsafe prompts, through deliberative alignment. This
leads to state-of-the-art performance on certain benchmarks for risks such as
generating illicit advice, choosing stereotyped responses, and succumbing to
known jailbreaks. Training models to incorporate a chain of thought before
answering has the potential to unlock substantial benefits, while also
increasing potential risks that stem from heightened intelligence. Our results
underscore the need for building robust alignment methods, extensively
stress-testing their efficacy, and maintaining meticulous risk management
protocols. This report outlines the safety work carried out for the OpenAI o1
and OpenAI o1-mini models, including safety evaluations, external red teaming,
and Preparedness Framework evaluations.


http://arxiv.org/abs/2412.16512
TrojFlow: Flow Models are Natural Targets for Trojan Attacks. (78%)
Zhengyang Qi; Xiaohua Xu
  Flow-based generative models (FMs) have rapidly advanced as a method for
mapping noise to data, its efficient training and sampling process makes it
widely applicable in various fields. FMs can be viewed as a variant of
diffusion models (DMs). At the same time, previous studies have shown that DMs
are vulnerable to Trojan/Backdoor attacks, a type of output manipulation attack
triggered by a maliciously embedded pattern at model input. We found that
Trojan attacks on generative models are essentially equivalent to image
transfer tasks from the backdoor distribution to the target distribution, the
unique ability of FMs to fit any two arbitrary distributions significantly
simplifies the training and sampling setups for attacking FMs, making them
inherently natural targets for backdoor attacks. In this paper, we propose
TrojFlow, exploring the vulnerabilities of FMs through Trojan attacks. In
particular, we consider various attack settings and their combinations and
thoroughly explore whether existing defense methods for DMs can effectively
defend against our proposed attack scenarios. We evaluate TrojFlow on CIFAR-10
and CelebA datasets, our experiments show that our method can compromise FMs
with high utility and specificity, and can easily break through existing
defense mechanisms.


http://arxiv.org/abs/2412.16708
Towards More Robust Retrieval-Augmented Generation: Evaluating RAG Under Adversarial Poisoning Attacks. (76%)
Jinyan Su; Jin Peng Zhou; Zhengxin Zhang; Preslav Nakov; Claire Cardie
  Retrieval-Augmented Generation (RAG) systems have emerged as a promising
solution to mitigate LLM hallucinations and enhance their performance in
knowledge-intensive domains. However, these systems are vulnerable to
adversarial poisoning attacks, where malicious passages injected into retrieval
databases can mislead the model into generating factually incorrect outputs. In
this paper, we investigate both the retrieval and the generation components of
RAG systems to understand how to enhance their robustness against such attacks.
From the retrieval perspective, we analyze why and how the adversarial contexts
are retrieved and assess how the quality of the retrieved passages impacts
downstream generation. From a generation perspective, we evaluate whether LLMs'
advanced critical thinking and internal knowledge capabilities can be leveraged
to mitigate the impact of adversarial contexts, i.e., using skeptical prompting
as a self-defense mechanism. Our experiments and findings provide actionable
insights into designing safer and more resilient retrieval-augmented
frameworks, paving the way for their reliable deployment in real-world
applications.


http://arxiv.org/abs/2412.16633
POEX: Policy Executable Embodied AI Jailbreak Attacks. (62%)
Xuancun Lu; Zhengxian Huang; Xinfeng Li; Xiaoyu ji; Wenyuan Xu
  The integration of large language models (LLMs) into the planning module of
Embodied Artificial Intelligence (Embodied AI) systems has greatly enhanced
their ability to translate complex user instructions into executable policies.
In this paper, we demystified how traditional LLM jailbreak attacks behave in
the Embodied AI context. We conducted a comprehensive safety analysis of the
LLM-based planning module of embodied AI systems against jailbreak attacks.
Using the carefully crafted Harmful-RLbench, we accessed 20 open-source and
proprietary LLMs under traditional jailbreak attacks, and highlighted two key
challenges when adopting the prior jailbreak techniques to embodied AI
contexts: (1) The harmful text output by LLMs does not necessarily induce
harmful policies in Embodied AI context, and (2) even we can generate harmful
policies, we have to guarantee they are executable in practice. To overcome
those challenges, we propose Policy Executable (POEX) jailbreak attacks, where
harmful instructions and optimized suffixes are injected into LLM-based
planning modules, leading embodied AI to perform harmful actions in both
simulated and physical environments. Our approach involves constraining
adversarial suffixes to evade detection and fine-tuning a policy evaluater to
improve the executability of harmful policies. We conducted extensive
experiments on both a robotic arm embodied AI platform and simulators, to
validate the attack and policy success rates on 136 harmful instructions from
Harmful-RLbench. Our findings expose serious safety vulnerabilities in
LLM-based planning modules, including the ability of POEX to be transferred
across models. Finally, we propose mitigation strategies, such as
safety-constrained prompts, pre- and post-planning checks, to address these
vulnerabilities and ensure the safe deployment of embodied AI in real-world
settings.


http://arxiv.org/abs/2412.16780
Forget Vectors at Play: Universal Input Perturbations Driving Machine Unlearning in Image Classification. (8%)
Changchang Sun; Ren Wang; Yihua Zhang; Jinghan Jia; Jiancheng Liu; Gaowen Liu; Sijia Liu; Yan Yan
  Machine unlearning (MU), which seeks to erase the influence of specific
unwanted data from already-trained models, is becoming increasingly vital in
model editing, particularly to comply with evolving data regulations like the
``right to be forgotten''. Conventional approaches are predominantly
model-based, typically requiring retraining or fine-tuning the model's weights
to meet unlearning requirements. In this work, we approach the MU problem from
a novel input perturbation-based perspective, where the model weights remain
intact throughout the unlearning process. We demonstrate the existence of a
proactive input-based unlearning strategy, referred to forget vector, which can
be generated as an input-agnostic data perturbation and remains as effective as
model-based approximate unlearning approaches. We also explore forget vector
arithmetic, whereby multiple class-specific forget vectors are combined through
simple operations (e.g., linear combinations) to generate new forget vectors
for unseen unlearning tasks, such as forgetting arbitrary subsets across
classes. Extensive experiments validate the effectiveness and adaptability of
the forget vector, showcasing its competitive performance relative to
state-of-the-art model-based methods. Codes are available at
https://github.com/Changchangsun/Forget-Vector.


http://arxiv.org/abs/2412.16682
The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in LLM Agents. (4%)
Feiran Jia; Tong Wu; Xin Qin; Anna Squicciarini
  Large Language Model (LLM) agents are increasingly being deployed as
conversational assistants capable of performing complex real-world tasks
through tool integration. This enhanced ability to interact with external
systems and process various data sources, while powerful, introduces
significant security vulnerabilities. In particular, indirect prompt injection
attacks pose a critical threat, where malicious instructions embedded within
external data sources can manipulate agents to deviate from user intentions.
While existing defenses based on rule constraints, source spotlighting, and
authentication protocols show promise, they struggle to maintain robust
security while preserving task functionality. We propose a novel and orthogonal
perspective that reframes agent security from preventing harmful actions to
ensuring task alignment, requiring every agent action to serve user objectives.
Based on this insight, we develop Task Shield, a test-time defense mechanism
that systematically verifies whether each instruction and tool call contributes
to user-specified goals. Through experiments on the AgentDojo benchmark, we
demonstrate that Task Shield reduces attack success rates (2.07\%) while
maintaining high task utility (69.79\%) on GPT-4o.


http://arxiv.org/abs/2412.19834
RoboSignature: Robust Signature and Watermarking on Network Attacks. (1%)
Aryaman Shaan; Garvit Banga; Raghav Mantri
  Generative models have enabled easy creation and generation of images of all
kinds given a single prompt. However, this has also raised ethical concerns
about what is an actual piece of content created by humans or cameras compared
to model-generated content like images or videos. Watermarking data generated
by modern generative models is a popular method to provide information on the
source of the content. The goal is for all generated images to conceal an
invisible watermark, allowing for future detection or identification. The
Stable Signature finetunes the decoder of Latent Diffusion Models such that a
unique watermark is rooted in any image produced by the decoder. In this paper,
we present a novel adversarial fine-tuning attack that disrupts the model's
ability to embed the intended watermark, exposing a significant vulnerability
in existing watermarking methods. To address this, we further propose a
tamper-resistant fine-tuning algorithm inspired by methods developed for large
language models, tailored to the specific requirements of watermarking in LDMs.
Our findings emphasize the importance of anticipating and defending against
potential vulnerabilities in generative systems.


http://arxiv.org/abs/2412.16254
Adversarial Robustness through Dynamic Ensemble Learning. (99%)
Hetvi Waghela; Jaydip Sen; Sneha Rakshit
  Adversarial attacks pose a significant threat to the reliability of
pre-trained language models (PLMs) such as GPT, BERT, RoBERTa, and T5. This
paper presents Adversarial Robustness through Dynamic Ensemble Learning
(ARDEL), a novel scheme designed to enhance the robustness of PLMs against such
attacks. ARDEL leverages the diversity of multiple PLMs and dynamically adjusts
the ensemble configuration based on input characteristics and detected
adversarial patterns. Key components of ARDEL include a meta-model for dynamic
weighting, an adversarial pattern detection module, and adversarial training
with regularization techniques. Comprehensive evaluations using standardized
datasets and various adversarial attack scenarios demonstrate that ARDEL
significantly improves robustness compared to existing methods. By dynamically
reconfiguring the ensemble to prioritize the most robust models for each input,
ARDEL effectively reduces attack success rates and maintains higher accuracy
under adversarial conditions. This work contributes to the broader goal of
developing more secure and trustworthy AI systems for real-world NLP
applications, offering a practical and scalable solution to enhance adversarial
resilience in PLMs.


http://arxiv.org/abs/2412.16382
EMPRA: Embedding Perturbation Rank Attack against Neural Ranking Models. (98%)
Amin Bigdeli; Negar Arabzadeh; Ebrahim Bagheri; Charles L. A. Clarke
  Recent research has shown that neural information retrieval techniques may be
susceptible to adversarial attacks. Adversarial attacks seek to manipulate the
ranking of documents, with the intention of exposing users to targeted content.
In this paper, we introduce the Embedding Perturbation Rank Attack (EMPRA)
method, a novel approach designed to perform adversarial attacks on black-box
Neural Ranking Models (NRMs). EMPRA manipulates sentence-level embeddings,
guiding them towards pertinent context related to the query while preserving
semantic integrity. This process generates adversarial texts that seamlessly
integrate with the original content and remain imperceptible to humans. Our
extensive evaluation conducted on the widely-used MS MARCO V1 passage
collection demonstrate the effectiveness of EMPRA against a wide range of
state-of-the-art baselines in promoting a specific set of target documents
within a given ranked results. Specifically, EMPRA successfully achieves a
re-ranking of almost 96% of target documents originally ranked between 51-100
to rank within the top 10. Furthermore, EMPRA does not depend on surrogate
models for adversarial text generation, enhancing its robustness against
different NRMs in realistic settings.


http://arxiv.org/abs/2412.15924
Watertox: The Art of Simplicity in Universal Attacks A Cross-Model Framework for Robust Adversarial Generation. (98%)
Zhenghao Gao; Shengjie Xu; Meixi Chen; Fangyao Zhao
  Contemporary adversarial attack methods face significant limitations in
cross-model transferability and practical applicability. We present Watertox,
an elegant adversarial attack framework achieving remarkable effectiveness
through architectural diversity and precision-controlled perturbations. Our
two-stage Fast Gradient Sign Method combines uniform baseline perturbations
($\epsilon_1 = 0.1$) with targeted enhancements ($\epsilon_2 = 0.4$). The
framework leverages an ensemble of complementary architectures, from VGG to
ConvNeXt, synthesizing diverse perspectives through an innovative voting
mechanism. Against state-of-the-art architectures, Watertox reduces model
accuracy from 70.6% to 16.0%, with zero-shot attacks achieving up to 98.8%
accuracy reduction against unseen architectures. These results establish
Watertox as a significant advancement in adversarial methodologies, with
promising applications in visual security systems and CAPTCHA generation.


http://arxiv.org/abs/2412.16359
Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context. (82%)
Nilanjana Das; Edward Raff; Aman Chadha; Manas Gaur
  As the AI systems become deeply embedded in social media platforms, we've
uncovered a concerning security vulnerability that goes beyond traditional
adversarial attacks. It becomes important to assess the risks of LLMs before
the general public use them on social media platforms to avoid any adverse
impacts. Unlike obvious nonsensical text strings that safety systems can easily
catch, our work reveals that human-readable situation-driven adversarial
full-prompts that leverage situational context are effective but much harder to
detect. We found that skilled attackers can exploit the vulnerabilities in
open-source and proprietary LLMs to make a malicious user query safe for LLMs,
resulting in generating a harmful response. This raises an important question
about the vulnerabilities of LLMs. To measure the robustness against
human-readable attacks, which now present a potent threat, our research makes
three major contributions. First, we developed attacks that use movie scripts
as situational contextual frameworks, creating natural-looking full-prompts
that trick LLMs into generating harmful content. Second, we developed a method
to transform gibberish adversarial text into readable, innocuous content that
still exploits vulnerabilities when used within the full-prompts. Finally, we
enhanced the AdvPrompter framework with p-nucleus sampling to generate diverse
human-readable adversarial texts that significantly improve attack
effectiveness against models like GPT-3.5-Turbo-0125 and Gemma-7b. Our findings
show that these systems can be manipulated to operate beyond their intended
ethical boundaries when presented with seemingly normal prompts that contain
hidden adversarial elements. By identifying these vulnerabilities, we aim to
drive the development of more robust safety mechanisms that can withstand
sophisticated attacks in real-world applications.


http://arxiv.org/abs/2412.15623
JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs. (41%)
Hongyi Li; Jiawei Ye; Jie Wu; Tianjie Yan; Chu Wang; Zhixin Li
  Large Language Models (LLMs) aligned with human feedback have recently
garnered significant attention. However, it remains vulnerable to jailbreak
attacks, where adversaries manipulate prompts to induce harmful outputs.
Exploring jailbreak attacks enables us to investigate the vulnerabilities of
LLMs and further guides us in enhancing their security. Unfortunately, existing
techniques mainly rely on handcrafted templates or generated-based
optimization, posing challenges in scalability, efficiency and universality. To
address these issues, we present JailPO, a novel black-box jailbreak framework
to examine LLM alignment. For scalability and universality, JailPO meticulously
trains attack models to automatically generate covert jailbreak prompts.
Furthermore, we introduce a preference optimization-based attack method to
enhance the jailbreak effectiveness, thereby improving efficiency. To analyze
model vulnerabilities, we provide three flexible jailbreak patterns. Extensive
experiments demonstrate that JailPO not only automates the attack process while
maintaining effectiveness but also exhibits superior performance in efficiency,
universality, and robustness against defenses compared to baselines.
Additionally, our analysis of the three JailPO patterns reveals that attacks
based on complex templates exhibit higher attack strength, whereas covert
question transformations elicit riskier responses and are more likely to bypass
defense mechanisms.


http://arxiv.org/abs/2412.15704
PoisonCatcher: Revealing and Identifying LDP Poisoning Attacks in IIoT. (22%)
Lisha Shuai; Shaofeng Tan; Nan Zhang; Jiamin Zhang; Min Zhang; Xiaolong Yang
  Local Differential Privacy (LDP) is widely adopted in the Industrial Internet
of Things (IIoT) for its lightweight, decentralized, and scalable nature.
However, its perturbation-based privacy mechanism makes it difficult to
distinguish between uncontaminated and tainted data, encouraging adversaries to
launch poisoning attacks. While LDP provides some resilience against minor
poisoning, it lacks robustness in IIoT with dynamic networks and substantial
real-time data flows. Effective countermeasures for such attacks are still
underdeveloped. This work narrows the critical gap by revealing and identifying
LDP poisoning attacks in IIoT. We begin by deepening the understanding of such
attacks, revealing novel threats that arise from the interplay between LDP
indistinguishability and IIoT complexity. This exploration uncovers a novel
rule-poisoning attack, and presents a general attack formulation by unifying it
with input-poisoning and output-poisoning. Furthermore, two key attack impacts,
i.e., Statistical Query Result (SQR) accuracy degradation and inter-dataset
correlations disruption, along with two characteristics: attack patterns
unstable and poisoned data stealth are revealed. From this, we propose
PoisonCatcher, a four-stage solution that detects LDP poisoning attacks and
identifies specific contaminated data points. It utilizes temporal similarity,
attribute correlation, and time-series stability analysis to detect datasets
exhibiting SQR accuracy degradation, inter-dataset disruptions, and unstable
patterns. Enhanced feature engineering is used to extract subtle poisoning
signatures, enabling machine learning models to identify specific
contamination. Experimental evaluations show the effectiveness, achieving
state-of-the-art performance with average precision and recall rates of 86.17%
and 97.5%, respectively, across six representative attack scenarios.


http://arxiv.org/abs/2412.15614
Technical Report for ICML 2024 TiFA Workshop MLLM Attack Challenge: Suffix Injection and Projected Gradient Descent Can Easily Fool An MLLM. (13%)
Yangyang Guo; Ziwei Xu; Xilie Xu; YongKang Wong; Liqiang Nie; Mohan Kankanhalli
  This technical report introduces our top-ranked solution that employs two
approaches, \ie suffix injection and projected gradient descent (PGD) , to
address the TiFA workshop MLLM attack challenge. Specifically, we first append
the text from an incorrectly labeled option (pseudo-labeled) to the original
query as a suffix. Using this modified query, our second approach applies the
PGD method to add imperceptible perturbations to the image. Combining these two
techniques enables successful attacks on the LLaVA 1.5 model.


http://arxiv.org/abs/2412.16358
Texture- and Shape-based Adversarial Attacks for Vehicle Detection in Synthetic Overhead Imagery. (9%)
Mikael Yeghiazaryan; Sai Abhishek Siddhartha Namburu; Emily Kim; Stanislav Panev; Melo Celso de; Brent Lance; la Torre Fernando De; Jessica K. Hodgins
  Detecting vehicles in aerial images can be very challenging due to complex
backgrounds, small resolution, shadows, and occlusions. Despite the
effectiveness of SOTA detectors such as YOLO, they remain vulnerable to
adversarial attacks (AAs), compromising their reliability. Traditional AA
strategies often overlook the practical constraints of physical implementation,
focusing solely on attack performance. Our work addresses this issue by
proposing practical implementation constraints for AA in texture and/or shape.
These constraints include pixelation, masking, limiting the color palette of
the textures, and constraining the shape modifications. We evaluated the
proposed constraints through extensive experiments using three widely used
object detector architectures, and compared them to previous works. The results
demonstrate the effectiveness of our solutions and reveal a trade-off between
practicality and performance. Additionally, we introduce a labeled dataset of
overhead images featuring vehicles of various categories. We will make the
code/dataset public upon paper acceptance.


http://arxiv.org/abs/2412.16457
Robust random graph matching in dense graphs via vector approximate message passing. (1%)
Zhangsong Li
  In this paper, we focus on the matching recovery problem between a pair of
correlated Gaussian Wigner matrices with a latent vertex correspondence. We are
particularly interested in a robust version of this problem such that our
observation is a perturbed input $(A+E,B+F)$ where $(A,B)$ is a pair of
correlated Gaussian Wigner matrices and $E,F$ are adversarially chosen matrices
supported on an unknown $\epsilon n * \epsilon n$ principle minor of $A,B$,
respectively. We propose a vector-approximate message passing (vector-AMP)
algorithm that succeeds in polynomial time as long as the correlation $\rho$
between $(A,B)$ is a non-vanishing constant and $\epsilon = o\big(
\tfrac{1}{(\log n)^{20}} \big)$.
  The main methodological inputs for our result are the iterative random graph
matching algorithm proposed in \cite{DL22+, DL23+} and the spectral cleaning
procedure proposed in \cite{IS24+}. To the best of our knowledge, our algorithm
is the first efficient random graph matching type algorithm that is robust
under any adversarial perturbations of $n^{1-o(1)}$ size.


http://arxiv.org/abs/2412.15206
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving. (5%)
Shuo Xing; Hongyuan Hua; Xiangbo Gao; Shenzhe Zhu; Renjie Li; Kexin Tian; Xiaopeng Li; Heng Huang; Tianbao Yang; Zhangyang Wang; Yang Zhou; Huaxiu Yao; Zhengzhong Tu
  Recent advancements in large vision language models (VLMs) tailored for
autonomous driving (AD) have shown strong scene understanding and reasoning
capabilities, making them undeniable candidates for end-to-end driving systems.
However, limited work exists on studying the trustworthiness of DriveVLMs -- a
critical factor that directly impacts public transportation safety. In this
paper, we introduce AutoTrust, a comprehensive trustworthiness benchmark for
large vision-language models in autonomous driving (DriveVLMs), considering
diverse perspectives -- including trustfulness, safety, robustness, privacy,
and fairness. We constructed the largest visual question-answering dataset for
investigating trustworthiness issues in driving scenarios, comprising over 10k
unique scenes and 18k queries. We evaluated six publicly available VLMs,
spanning from generalist to specialist, from open-source to commercial models.
Our exhaustive evaluations have unveiled previously undiscovered
vulnerabilities of DriveVLMs to trustworthiness threats. Specifically, we found
that the general VLMs like LLaVA-v1.6 and GPT-4o-mini surprisingly outperform
specialized models fine-tuned for driving in terms of overall trustworthiness.
DriveVLMs like DriveLM-Agent are particularly vulnerable to disclosing
sensitive information. Additionally, both generalist and specialist VLMs remain
susceptible to adversarial attacks and struggle to ensure unbiased
decision-making across diverse environments and populations. Our findings call
for immediate and decisive action to address the trustworthiness of DriveVLMs
-- an issue of critical importance to public safety and the welfare of all
citizens relying on autonomous transportation systems. Our benchmark is
publicly available at \url{https://github.com/taco-group/AutoTrust}, and the
leaderboard is released at \url{https://taco-group.github.io/AutoTrust/}.


http://arxiv.org/abs/2412.14738
Boosting GNN Performance via Training Sample Selection Based on Adversarial Robustness Evaluation. (4%)
Yongyu Wang
  Graph Neural Networks (GNNs) have established themselves as one of the most
powerful neural network architectures, excelling in leveraging graph topology
and node features for various tasks. However, GNNs are inherently vulnerable to
noise in their inputs. Such noise can significantly degrade their performance.
To address this challenge, we propose a novel approach that employs adversarial
robustness evaluation techniques to identify nodes in the graph that are most
susceptible to noise. By selecting and constructing a training set composed of
these particularly noise-prone nodes, we then use them to train a Graph
Convolutional Network (GCN). Our experimental results demonstrate that this
strategy leads to substantial improvements in the GCN's performance.


http://arxiv.org/abs/2412.13880
A Review of the Duality of Adversarial Learning in Network Intrusion: Attacks and Countermeasures. (89%)
Shalini Saini; Anitha Chennamaneni; Babatunde Sawyerr
  Deep learning solutions are instrumental in cybersecurity, harnessing their
ability to analyze vast datasets, identify complex patterns, and detect
anomalies. However, malevolent actors can exploit these capabilities to
orchestrate sophisticated attacks, posing significant challenges to defenders
and traditional security measures. Adversarial attacks, particularly those
targeting vulnerabilities in deep learning models, present a nuanced and
substantial threat to cybersecurity. Our study delves into adversarial learning
threats such as Data Poisoning, Test Time Evasion, and Reverse Engineering,
specifically impacting Network Intrusion Detection Systems. Our research
explores the intricacies and countermeasures of attacks to deepen understanding
of network security challenges amidst adversarial threats. In our study, we
present insights into the dynamic realm of adversarial learning and its
implications for network intrusion. The intersection of adversarial attacks and
defenses within network traffic data, coupled with advances in machine learning
and deep learning techniques, represents a relatively underexplored domain. Our
research lays the groundwork for strengthening defense mechanisms to address
the potential breaches in network security and privacy posed by adversarial
attacks. Through our in-depth analysis, we identify domain-specific research
gaps, such as the scarcity of real-life attack data and the evaluation of
AI-based solutions for network traffic. Our focus on these challenges aims to
stimulate future research efforts toward the development of resilient network
defense strategies.


http://arxiv.org/abs/2412.13879
Crabs: Consuming Resrouce via Auto-generation for LLM-DoS Attack under Black-box Settings. (86%)
Yuanhe Zhang; Zhenhong Zhou; Wei Zhang; Xinyue Wang; Xiaojun Jia; Yang Liu; Sen Su
  Large Language Models (LLMs) have demonstrated remarkable performance across
diverse tasks. LLMs continue to be vulnerable to external threats, particularly
Denial-of-Service (DoS) attacks. Specifically, LLM-DoS attacks aim to exhaust
computational resources and block services. However, prior works tend to focus
on performing white-box attacks, overlooking black-box settings. In this work,
we propose an automated algorithm designed for black-box LLMs, called
Auto-Generation for LLM-DoS Attack (AutoDoS). AutoDoS introduces DoS Attack
Tree and optimizes the prompt node coverage to enhance effectiveness under
black-box conditions. Our method can bypass existing defense with enhanced
stealthiness via semantic improvement of prompt nodes. Furthermore, we reveal
that implanting Length Trojan in Basic DoS Prompt aids in achieving higher
attack efficacy. Experimental results show that AutoDoS amplifies service
response latency by over 250 $\times \uparrow$, leading to severe resource
consumption in terms of GPU utilization and memory usage. Our code is available
at \url{https://github.com/shuita2333/AutoDoS}.


http://arxiv.org/abs/2412.13705
Mitigating Adversarial Attacks in LLMs through Defensive Suffix Generation. (83%)
Minkyoung Kim; Yunha Kim; Hyeram Seo; Heejung Choi; Jiye Han; Gaeun Kee; Soyoung Ko; HyoJe Jung; Byeolhee Kim; Young-Hak Kim; Sanghyun Park; Tae Joon Jun
  Large language models (LLMs) have exhibited outstanding performance in
natural language processing tasks. However, these models remain susceptible to
adversarial attacks in which slight input perturbations can lead to harmful or
misleading outputs. A gradient-based defensive suffix generation algorithm is
designed to bolster the robustness of LLMs. By appending carefully optimized
defensive suffixes to input prompts, the algorithm mitigates adversarial
influences while preserving the models' utility. To enhance adversarial
understanding, a novel total loss function ($L_{\text{total}}$) combining
defensive loss ($L_{\text{def}}$) and adversarial loss ($L_{\text{adv}}$)
generates defensive suffixes more effectively. Experimental evaluations
conducted on open-source LLMs such as Gemma-7B, mistral-7B, Llama2-7B, and
Llama2-13B show that the proposed method reduces attack success rates (ASR) by
an average of 11\% compared to models without defensive suffixes. Additionally,
the perplexity score of Gemma-7B decreased from 6.57 to 3.93 when applying the
defensive suffix generated by openELM-270M. Furthermore, TruthfulQA evaluations
demonstrate consistent improvements with Truthfulness scores increasing by up
to 10\% across tested configurations. This approach significantly enhances the
security of LLMs in critical applications without requiring extensive
retraining.


http://arxiv.org/abs/2412.14113
Adversarial Hubness in Multi-Modal Retrieval. (82%)
Tingwei Zhang; Fnu Suya; Rishi Jha; Collin Zhang; Vitaly Shmatikov
  Hubness is a phenomenon in high-dimensional vector spaces where a single
point from the natural distribution is unusually close to many other points.
This is a well-known problem in information retrieval that causes some items to
accidentally (and incorrectly) appear relevant to many queries. In this paper,
we investigate how attackers can exploit hubness to turn any image or audio
input in a multi-modal retrieval system into an adversarial hub. Adversarial
hubs can be used to inject universal adversarial content (e.g., spam) that will
be retrieved in response to thousands of different queries, as well as for
targeted attacks on queries related to specific, attacker-chosen concepts. We
present a method for creating adversarial hubs and evaluate the resulting hubs
on benchmark multi-modal retrieval datasets and an image-to-image retrieval
system based on a tutorial from Pinecone, a popular vector database. For
example, in text-caption-to-image retrieval, a single adversarial hub is
retrieved as the top-1 most relevant image for more than 21,000 out of 25,000
test queries (by contrast, the most common natural hub is the top-1 response to
only 102 queries). We also investigate whether techniques for mitigating
natural hubness are an effective defense against adversarial hubs, and show
that they are not effective against hubs that target queries related to
specific concepts.


http://arxiv.org/abs/2412.13709
Physics-Based Adversarial Attack on Near-Infrared Human Detector for Nighttime Surveillance Camera Systems. (78%)
Muyao Niu; Zhuoxiao Li; Yifan Zhan; Huy H. Nguyen; Isao Echizen; Yinqiang Zheng
  Many surveillance cameras switch between daytime and nighttime modes based on
illuminance levels. During the day, the camera records ordinary RGB images
through an enabled IR-cut filter. At night, the filter is disabled to capture
near-infrared (NIR) light emitted from NIR LEDs typically mounted around the
lens. While RGB-based AI algorithm vulnerabilities have been widely reported,
the vulnerabilities of NIR-based AI have rarely been investigated. In this
paper, we identify fundamental vulnerabilities in NIR-based image understanding
caused by color and texture loss due to the intrinsic characteristics of
clothes' reflectance and cameras' spectral sensitivity in the NIR range. We
further show that the nearly co-located configuration of illuminants and
cameras in existing surveillance systems facilitates concealing and fully
passive attacks in the physical world. Specifically, we demonstrate how
retro-reflective and insulation plastic tapes can manipulate the intensity
distribution of NIR images. We showcase an attack on the YOLO-based human
detector using binary patterns designed in the digital space (via black-box
query and searching) and then physically realized using tapes pasted onto
clothes. Our attack highlights significant reliability concerns for nighttime
surveillance systems, which are intended to enhance security. Codes Available:
https://github.com/MyNiuuu/AdvNIR


http://arxiv.org/abs/2412.13762
Cultivating Archipelago of Forests: Evolving Robust Decision Trees through Island Coevolution. (56%)
Adam Żychowski; Andrew Perrault; Jacek Mańdziuk
  Decision trees are widely used in machine learning due to their simplicity
and interpretability, but they often lack robustness to adversarial attacks and
data perturbations. The paper proposes a novel island-based coevolutionary
algorithm (ICoEvoRDF) for constructing robust decision tree ensembles. The
algorithm operates on multiple islands, each containing populations of decision
trees and adversarial perturbations. The populations on each island evolve
independently, with periodic migration of top-performing decision trees between
islands. This approach fosters diversity and enhances the exploration of the
solution space, leading to more robust and accurate decision tree ensembles.
ICoEvoRDF utilizes a popular game theory concept of mixed Nash equilibrium for
ensemble weighting, which further leads to improvement in results. ICoEvoRDF is
evaluated on 20 benchmark datasets, demonstrating its superior performance
compared to state-of-the-art methods in optimizing both adversarial accuracy
and minimax regret. The flexibility of ICoEvoRDF allows for the integration of
decision trees from various existing methods, providing a unified framework for
combining diverse solutions. Our approach offers a promising direction for
developing robust and interpretable machine learning models


http://arxiv.org/abs/2412.13913
A Black-Box Evaluation Framework for Semantic Robustness in Bird's Eye View Detection. (8%)
Fu Wang; Yanghao Zhang; Xiangyu Yin; Guangliang Cheng; Zeyu Fu; Xiaowei Huang; Wenjie Ruan
  Camera-based Bird's Eye View (BEV) perception models receive increasing
attention for their crucial role in autonomous driving, a domain where concerns
about the robustness and reliability of deep learning have been raised. While
only a few works have investigated the effects of randomly generated semantic
perturbations, aka natural corruptions, on the multi-view BEV detection task,
we develop a black-box robustness evaluation framework that adversarially
optimises three common semantic perturbations: geometric transformation, colour
shifting, and motion blur, to deceive BEV models, serving as the first approach
in this emerging field. To address the challenge posed by optimising the
semantic perturbation, we design a smoothed, distance-based surrogate function
to replace the mAP metric and introduce SimpleDIRECT, a deterministic
optimisation algorithm that utilises observed slopes to guide the optimisation
process. By comparing with randomised perturbation and two optimisation
baselines, we demonstrate the effectiveness of the proposed framework.
Additionally, we provide a benchmark on the semantic robustness of ten recent
BEV models. The results reveal that PolarFormer, which emphasises geometric
information from multi-view images, exhibits the highest robustness, whereas
BEVDet is fully compromised, with its precision reduced to zero.


http://arxiv.org/abs/2412.13917
Speech Watermarking with Discrete Intermediate Representations. (4%)
Shengpeng Ji; Ziyue Jiang; Jialong Zuo; Minghui Fang; Yifu Chen; Tao Jin; Zhou Zhao
  Speech watermarking techniques can proactively mitigate the potential harmful
consequences of instant voice cloning techniques. These techniques involve the
insertion of signals into speech that are imperceptible to humans but can be
detected by algorithms. Previous approaches typically embed watermark messages
into continuous space. However, intuitively, embedding watermark information
into robust discrete latent space can significantly improve the robustness of
watermarking systems. In this paper, we propose DiscreteWM, a novel speech
watermarking framework that injects watermarks into the discrete intermediate
representations of speech. Specifically, we map speech into discrete latent
space with a vector-quantized autoencoder and inject watermarks by changing the
modular arithmetic relation of discrete IDs. To ensure the imperceptibility of
watermarks, we also propose a manipulator model to select the candidate tokens
for watermark embedding. Experimental results demonstrate that our framework
achieves state-of-the-art performance in robustness and imperceptibility,
simultaneously. Moreover, our flexible frame-wise approach can serve as an
efficient solution for both voice cloning detection and information hiding.
Additionally, DiscreteWM can encode 1 to 150 bits of watermark information
within a 1-second speech clip, indicating its encoding capacity. Audio samples
are available at https://DiscreteWM.github.io/discrete_wm.


http://arxiv.org/abs/2412.14080
On the Robustness of Distributed Machine Learning against Transfer Attacks. (2%)
Sébastien Andreina; Pascal Zimmer; Ghassan Karame
  Although distributed machine learning (distributed ML) is gaining
considerable attention in the community, prior works have independently looked
at instances of distributed ML in either the training or the inference phase.
No prior work has examined the combined robustness stemming from distributing
both the learning and the inference process. In this work, we explore, for the
first time, the robustness of distributed ML models that are fully
heterogeneous in training data, architecture, scheduler, optimizer, and other
model parameters. Supported by theory and extensive experimental validation
using CIFAR10 and FashionMNIST, we show that such properly distributed ML
instantiations achieve across-the-board improvements in accuracy-robustness
tradeoffs against state-of-the-art transfer-based attacks that could otherwise
not be realized by current ensemble or federated learning instantiations. For
instance, our experiments on CIFAR10 show that for the Common Weakness attack,
one of the most powerful state-of-the-art transfer-based attacks, our method
improves robust accuracy by up to 40%, with a minimal impact on clean task
accuracy.


http://arxiv.org/abs/2412.13753
Mesoscopic Insights: Orchestrating Multi-scale & Hybrid Architecture for Image Manipulation Localization. (1%)
Xuekang Zhu; Xiaochen Ma; Lei Su; Zhuohang Jiang; Bo Du; Xiwen Wang; Zeyu Lei; Wentao Feng; Chi-Man Pun; Jizhe Zhou
  The mesoscopic level serves as a bridge between the macroscopic and
microscopic worlds, addressing gaps overlooked by both. Image manipulation
localization (IML), a crucial technique to pursue truth from fake images, has
long relied on low-level (microscopic-level) traces. However, in practice, most
tampering aims to deceive the audience by altering image semantics. As a
result, manipulation commonly occurs at the object level (macroscopic level),
which is equally important as microscopic traces. Therefore, integrating these
two levels into the mesoscopic level presents a new perspective for IML
research. Inspired by this, our paper explores how to simultaneously construct
mesoscopic representations of micro and macro information for IML and
introduces the Mesorch architecture to orchestrate both. Specifically, this
architecture i) combines Transformers and CNNs in parallel, with Transformers
extracting macro information and CNNs capturing micro details, and ii) explores
across different scales, assessing micro and macro information seamlessly.
Additionally, based on the Mesorch architecture, the paper introduces two
baseline models aimed at solving IML tasks through mesoscopic representation.
Extensive experiments across four datasets have demonstrated that our models
surpass the current state-of-the-art in terms of performance, computational
complexity, and robustness.


http://arxiv.org/abs/2412.13525
Hybrid Data-Free Knowledge Distillation. (1%)
Jialiang Tang; Shuo Chen; Chen Gong
  Data-free knowledge distillation aims to learn a compact student network from
a pre-trained large teacher network without using the original training data of
the teacher network. Existing collection-based and generation-based methods
train student networks by collecting massive real examples and generating
synthetic examples, respectively. However, they inevitably become weak in
practical scenarios due to the difficulties in gathering or emulating
sufficient real-world data. To solve this problem, we propose a novel method
called \textbf{H}ybr\textbf{i}d \textbf{D}ata-\textbf{F}ree
\textbf{D}istillation (HiDFD), which leverages only a small amount of collected
data as well as generates sufficient examples for training student networks.
Our HiDFD comprises two primary modules, \textit{i.e.}, the teacher-guided
generation and student distillation. The teacher-guided generation module
guides a Generative Adversarial Network (GAN) by the teacher network to produce
high-quality synthetic examples from very few real-world collected examples.
Specifically, we design a feature integration mechanism to prevent the GAN from
overfitting and facilitate the reliable representation learning from the
teacher network. Meanwhile, we drive a category frequency smoothing technique
via the teacher network to balance the generative training of each category. In
the student distillation module, we explore a data inflation strategy to
properly utilize a blend of real and synthetic data to train the student
network via a classifier-sharing-based feature alignment technique. Intensive
experiments across multiple benchmarks demonstrate that our HiDFD can achieve
state-of-the-art performance using 120 times less collected data than existing
methods. Code is available at https://github.com/tangjialiang97/HiDFD.


http://arxiv.org/abs/2412.13866
SHAP scores fail pervasively even when Lipschitz succeeds. (1%)
Olivier Letoffe; Xuanxiang Huang; Joao Marques-Silva
  The ubiquitous use of Shapley values in eXplainable AI (XAI) has been
triggered by the tool SHAP, and as a result are commonly referred to as SHAP
scores. Recent work devised examples of machine learning (ML) classifiers for
which the computed SHAP scores are thoroughly unsatisfactory, by allowing human
decision-makers to be misled. Nevertheless, such examples could be perceived as
somewhat artificial, since the selected classes must be interpreted as numeric.
Furthermore, it was unclear how general were the issues identified with SHAP
scores. This paper answers these criticisms. First, the paper shows that for
Boolean classifiers there are arbitrarily many examples for which the SHAP
scores must be deemed unsatisfactory. Second, the paper shows that the issues
with SHAP scores are also observed in the case of regression models. In
addition, the paper studies the class of regression models that respect
Lipschitz continuity, a measure of a function's rate of change that finds
important recent uses in ML, including model robustness. Concretely, the paper
shows that the issues with SHAP scores occur even for regression models that
respect Lipschitz continuity. Finally, the paper shows that the same issues are
guaranteed to exist for arbitrarily differentiable regression models.


http://arxiv.org/abs/2412.13507
Novel AI Camera Camouflage: Face Cloaking Without Full Disguise. (1%)
David Noever; Forrest McKee
  This study demonstrates a novel approach to facial camouflage that combines
targeted cosmetic perturbations and alpha transparency layer manipulation to
evade modern facial recognition systems. Unlike previous methods -- such as CV
dazzle, adversarial patches, and theatrical disguises -- this work achieves
effective obfuscation through subtle modifications to key-point regions,
particularly the brow, nose bridge, and jawline. Empirical testing with Haar
cascade classifiers and commercial systems like BetaFaceAPI and Microsoft Bing
Visual Search reveals that vertical perturbations near dense facial key points
significantly disrupt detection without relying on overt disguises.
Additionally, leveraging alpha transparency attacks in PNG images creates a
dual-layer effect: faces remain visible to human observers but disappear in
machine-readable RGB layers, rendering them unidentifiable during reverse image
searches. The results highlight the potential for creating scalable,
low-visibility facial obfuscation strategies that balance effectiveness and
subtlety, opening pathways for defeating surveillance while maintaining
plausible anonymity.


http://arxiv.org/abs/2412.12626
Improving the Transferability of 3D Point Cloud Attack via Spectral-aware Admix and Optimization Designs. (99%)
Shiyu Hu; Daizong Liu; Wei Hu
  Deep learning models for point clouds have shown to be vulnerable to
adversarial attacks, which have received increasing attention in various
safety-critical applications such as autonomous driving, robotics, and
surveillance. Existing 3D attackers generally design various attack strategies
in the white-box setting, requiring the prior knowledge of 3D model details.
However, real-world 3D applications are in the black-box setting, where we can
only acquire the outputs of the target classifier. Although few recent works
try to explore the black-box attack, they still achieve limited attack success
rates (ASR). To alleviate this issue, this paper focuses on attacking the 3D
models in a transfer-based black-box setting, where we first carefully design
adversarial examples in a white-box surrogate model and then transfer them to
attack other black-box victim models. Specifically, we propose a novel
Spectral-aware Admix with Augmented Optimization method (SAAO) to improve the
adversarial transferability. In particular, since traditional Admix strategy
are deployed in the 2D domain that adds pixel-wise images for perturbing, we
can not directly follow it to merge point clouds in coordinate domain as it
will destroy the geometric shapes. Therefore, we design spectral-aware fusion
that performs Graph Fourier Transform (GFT) to get spectral features of the
point clouds and add them in the spectral domain. Afterward, we run a few steps
with spectral-aware weighted Admix to select better optimization paths as well
as to adjust corresponding learning weights. At last, we run more steps to
generate adversarial spectral feature along the optimization path and perform
Inverse-GFT on the adversarial spectral feature to obtain the adversarial
example in the data domain. Experiments show that our SAAO achieves better
transferability compared to existing 3D attack methods.


http://arxiv.org/abs/2412.13376
Targeted View-Invariant Adversarial Perturbations for 3D Object Recognition. (99%)
Christian Green; Mehmet Ergezer; Abdurrahman Zeybey
  Adversarial attacks pose significant challenges in 3D object recognition,
especially in scenarios involving multi-view analysis where objects can be
observed from varying angles. This paper introduces View-Invariant Adversarial
Perturbations (VIAP), a novel method for crafting robust adversarial examples
that remain effective across multiple viewpoints. Unlike traditional methods,
VIAP enables targeted attacks capable of manipulating recognition systems to
classify objects as specific, pre-determined labels, all while using a single
universal perturbation. Leveraging a dataset of 1,210 images across 121 diverse
rendered 3D objects, we demonstrate the effectiveness of VIAP in both targeted
and untargeted settings. Our untargeted perturbations successfully generate a
singular adversarial noise robust to 3D transformations, while targeted attacks
achieve exceptional results, with top-1 accuracies exceeding 95% across various
epsilon values. These findings highlight VIAPs potential for real-world
applications, such as testing the robustness of 3D recognition systems. The
proposed method sets a new benchmark for view-invariant adversarial robustness,
advancing the field of adversarial machine learning for 3D object recognition.


http://arxiv.org/abs/2412.16213
AdvIRL: Reinforcement Learning-Based Adversarial Attacks on 3D NeRF Models. (98%)
Tommy Nguyen; Mehmet Ergezer; Christian Green
  The increasing deployment of AI models in critical applications has exposed
them to significant risks from adversarial attacks. While adversarial
vulnerabilities in 2D vision models have been extensively studied, the threat
landscape for 3D generative models, such as Neural Radiance Fields (NeRF),
remains underexplored. This work introduces \textit{AdvIRL}, a novel framework
for crafting adversarial NeRF models using Instant Neural Graphics Primitives
(Instant-NGP) and Reinforcement Learning. Unlike prior methods, \textit{AdvIRL}
generates adversarial noise that remains robust under diverse 3D
transformations, including rotations and scaling, enabling effective black-box
attacks in real-world scenarios. Our approach is validated across a wide range
of scenes, from small objects (e.g., bananas) to large environments (e.g.,
lighthouses). Notably, targeted attacks achieved high-confidence
misclassifications, such as labeling a banana as a slug and a truck as a
cannon, demonstrating the practical risks posed by adversarial NeRFs. Beyond
attacking, \textit{AdvIRL}-generated adversarial models can serve as
adversarial training data to enhance the robustness of vision systems. The
implementation of \textit{AdvIRL} is publicly available at
\url{https://github.com/Tommy-Nguyen-cpu/AdvIRL/tree/MultiView-Clean}, ensuring
reproducibility and facilitating future research.


http://arxiv.org/abs/2412.15276
Exploring Query Efficient Data Generation towards Data-free Model Stealing in Hard Label Setting. (92%)
Gaozheng Pei; Shaojie lyu; Ke Ma; Pinci Yang; Qianqian Xu; Yingfei Sun
  Data-free model stealing involves replicating the functionality of a target
model into a substitute model without accessing the target model's structure,
parameters, or training data. The adversary can only access the target model's
predictions for generated samples. Once the substitute model closely
approximates the behavior of the target model, attackers can exploit its
white-box characteristics for subsequent malicious activities, such as
adversarial attacks. Existing methods within cooperative game frameworks often
produce samples with high confidence for the prediction of the substitute
model, which makes it difficult for the substitute model to replicate the
behavior of the target model. This paper presents a new data-free model
stealing approach called Query Efficient Data Generation (\textbf{QEDG}). We
introduce two distinct loss functions to ensure the generation of sufficient
samples that closely and uniformly align with the target model's decision
boundary across multiple classes. Building on the limitation of current
methods, which typically yield only one piece of supervised information per
query, we propose the query-free sample augmentation that enables the
acquisition of additional supervised information without increasing the number
of queries. Motivated by theoretical analysis, we adopt the consistency rate
metric, which more accurately evaluates the similarity between the substitute
and target models. We conducted extensive experiments to verify the
effectiveness of our proposed method, which achieved better performance with
fewer queries compared to the state-of-the-art methods on the real
\textbf{MLaaS} scenario and five datasets.


http://arxiv.org/abs/2412.15275
Fooling LLM graders into giving better grades through neural activity guided adversarial prompting. (75%)
Atsushi Yamamura; Surya Ganguli
  The deployment of artificial intelligence (AI) in critical decision-making
and evaluation processes raises concerns about inherent biases that malicious
actors could exploit to distort decision outcomes. We propose a systematic
method to reveal such biases in AI evaluation systems and apply it to automated
essay grading as an example. Our approach first identifies hidden neural
activity patterns that predict distorted decision outcomes and then optimizes
an adversarial input suffix to amplify such patterns. We demonstrate that this
combination can effectively fool large language model (LLM) graders into
assigning much higher grades than humans would. We further show that this
white-box attack transfers to black-box attacks on other models, including
commercial closed-source models like Gemini. They further reveal the existence
of a "magic word" that plays a pivotal role in the efficacy of the attack. We
trace the origin of this magic word bias to the structure of commonly-used chat
templates for supervised fine-tuning of LLMs and show that a minor change in
the template can drastically reduce the bias. This work not only uncovers
vulnerabilities in current LLMs but also proposes a systematic method to
identify and remove hidden biases, contributing to the goal of ensuring AI
safety and security.


http://arxiv.org/abs/2412.13134
Practicable Black-box Evasion Attacks on Link Prediction in Dynamic Graphs -- A Graph Sequential Embedding Method. (70%)
Jiate Li; Meng Pang; Binghui Wang
  Link prediction in dynamic graphs (LPDG) has been widely applied to
real-world applications such as website recommendation, traffic flow
prediction, organizational studies, etc. These models are usually kept local
and secure, with only the interactive interface restrictively available to the
public. Thus, the problem of the black-box evasion attack on the LPDG model,
where model interactions and data perturbations are restricted, seems to be
essential and meaningful in practice. In this paper, we propose the first
practicable black-box evasion attack method that achieves effective attacks
against the target LPDG model, within a limited amount of interactions and
perturbations. To perform effective attacks under limited perturbations, we
develop a graph sequential embedding model to find the desired state embedding
of the dynamic graph sequences, under a deep reinforcement learning framework.
To overcome the scarcity of interactions, we design a multi-environment
training pipeline and train our agent for multiple instances, by sharing an
aggregate interaction buffer. Finally, we evaluate our attack against three
advanced LPDG models on three real-world graph datasets of different scales and
compare its performance with related methods under the interaction and
perturbation constraints. Experimental results show that our attack is both
effective and practicable.


http://arxiv.org/abs/2412.12621
Jailbreaking? One Step Is Enough! (64%)
Weixiong Zheng; Peijian Zeng; Yiwei Li; Hongyan Wu; Nankai Lin; Junhao Chen; Aimin Yang; Yongmei Zhou
  Large language models (LLMs) excel in various tasks but remain vulnerable to
jailbreak attacks, where adversaries manipulate prompts to generate harmful
outputs. Examining jailbreak prompts helps uncover the shortcomings of LLMs.
However, current jailbreak methods and the target model's defenses are engaged
in an independent and adversarial process, resulting in the need for frequent
attack iterations and redesigning attacks for different models. To address
these gaps, we propose a Reverse Embedded Defense Attack (REDA) mechanism that
disguises the attack intention as the "defense". intention against harmful
content. Specifically, REDA starts from the target response, guiding the model
to embed harmful content within its defensive measures, thereby relegating
harmful content to a secondary role and making the model believe it is
performing a defensive task. The attacking model considers that it is guiding
the target model to deal with harmful content, while the target model thinks it
is performing a defensive task, creating an illusion of cooperation between the
two. Additionally, to enhance the model's confidence and guidance in
"defensive" intentions, we adopt in-context learning (ICL) with a small number
of attack examples and construct a corresponding dataset of attack examples.
Extensive evaluations demonstrate that the REDA method enables cross-model
attacks without the need to redesign attack strategies for different models,
enables successful jailbreak in one iteration, and outperforms existing methods
on both open-source and closed-source models.


http://arxiv.org/abs/2412.15267
Toxicity Detection towards Adaptability to Changing Perturbations. (11%)
Hankun Kang; Jianhao Chen; Yongqi Li; Xin Miao; Mayi Xu; Ming Zhong; Yuanyuan Zhu; Tieyun Qian
  Toxicity detection is crucial for maintaining the peace of the society. While
existing methods perform well on normal toxic contents or those generated by
specific perturbation methods, they are vulnerable to evolving perturbation
patterns. However, in real-world scenarios, malicious users tend to create new
perturbation patterns for fooling the detectors. For example, some users may
circumvent the detector of large language models (LLMs) by adding `I am a
scientist' at the beginning of the prompt. In this paper, we introduce a novel
problem, i.e., continual learning jailbreak perturbation patterns, into the
toxicity detection field. To tackle this problem, we first construct a new
dataset generated by 9 types of perturbation patterns, 7 of them are summarized
from prior work and 2 of them are developed by us. We then systematically
validate the vulnerability of current methods on this new perturbation
pattern-aware dataset via both the zero-shot and fine tuned cross-pattern
detection. Upon this, we present the domain incremental learning paradigm and
the corresponding benchmark to ensure the detector's robustness to dynamically
emerging types of perturbed toxic text. Our code and dataset are provided in
the appendix and will be publicly available at GitHub, by which we wish to
offer new research opportunities for the security-relevant communities.


http://arxiv.org/abs/2412.13017
A New Adversarial Perspective for LiDAR-based 3D Object Detection. (9%)
Shijun Zheng; Weiquan Liu; Yu Guo; Yu Zang; Siqi Shen; Cheng Wang
  Autonomous vehicles (AVs) rely on LiDAR sensors for environmental perception
and decision-making in driving scenarios. However, ensuring the safety and
reliability of AVs in complex environments remains a pressing challenge. To
address this issue, we introduce a real-world dataset (ROLiD) comprising
LiDAR-scanned point clouds of two random objects: water mist and smoke. In this
paper, we introduce a novel adversarial perspective by proposing an attack
framework that utilizes water mist and smoke to simulate environmental
interference. Specifically, we propose a point cloud sequence generation method
using a motion and content decomposition generative adversarial network named
PCS-GAN to simulate the distribution of random objects. Furthermore, leveraging
the simulated LiDAR scanning characteristics implemented with Range Image, we
examine the effects of introducing random object perturbations at various
positions on the target vehicle. Extensive experiments demonstrate that
adversarial perturbations based on random objects effectively deceive vehicle
detection and reduce the recognition rate of 3D object detection models.


http://arxiv.org/abs/2412.13099
Accuracy Limits as a Barrier to Biometric System Security. (2%)
Axel Durbet; Paul-Marie Grollemund; Pascal Lafourcade; Kevin Thiry-Atighehchi
  Biometric systems are widely used for identity verification and
identification, including authentication (i.e., one-to-one matching to verify a
claimed identity) and identification (i.e., one-to-many matching to find a
subject in a database). The matching process relies on measuring similarities
or dissimilarities between a fresh biometric template and enrolled templates.
The False Match Rate FMR is a key metric for assessing the accuracy and
reliability of such systems. This paper analyzes biometric systems based on
their FMR, with two main contributions. First, we explore untargeted attacks,
where an adversary aims to impersonate any user within a database. We determine
the number of trials required for an attacker to successfully impersonate a
user and derive the critical population size (i.e., the maximum number of users
in the database) required to maintain a given level of security. Furthermore,
we compute the critical FMR value needed to ensure resistance against
untargeted attacks as the database size increases. Second, we revisit the
biometric birthday problem to evaluate the approximate and exact probabilities
that two users in a database collide (i.e., can impersonate each other). Based
on this analysis, we derive both the approximate critical population size and
the critical FMR value needed to bound the likelihood of such collisions
occurring with a given probability. These thresholds offer insights for
designing systems that mitigate the risk of impersonation and collisions,
particularly in large-scale biometric databases. Our findings indicate that
current biometric systems fail to deliver sufficient accuracy to achieve an
adequate security level against untargeted attacks, even in small-scale
databases. Moreover, state-of-the-art systems face significant challenges in
addressing the biometric birthday problem, especially as database sizes grow.


http://arxiv.org/abs/2412.12996
Neural Control and Certificate Repair via Runtime Monitoring. (1%)
Emily Yu; Đorđe Žikelić; Thomas A. Henzinger
  Learning-based methods provide a promising approach to solving highly
non-linear control tasks that are often challenging for classical control
methods. To ensure the satisfaction of a safety property, learning-based
methods jointly learn a control policy together with a certificate function for
the property. Popular examples include barrier functions for safety and
Lyapunov functions for asymptotic stability. While there has been significant
progress on learning-based control with certificate functions in the white-box
setting, where the correctness of the certificate function can be formally
verified, there has been little work on ensuring their reliability in the
black-box setting where the system dynamics are unknown. In this work, we
consider the problems of certifying and repairing neural network control
policies and certificate functions in the black-box setting. We propose a novel
framework that utilizes runtime monitoring to detect system behaviors that
violate the property of interest under some initially trained neural network
policy and certificate. These violating behaviors are used to extract new
training data, that is used to re-train the neural network policy and the
certificate function and to ultimately repair them. We demonstrate the
effectiveness of our approach empirically by using it to repair and to boost
the safety rate of neural network policies learned by a state-of-the-art method
for learning-based control on two autonomous system control tasks.


http://arxiv.org/abs/2412.13229
Training Verification-Friendly Neural Networks via Neuron Behavior Consistency. (1%)
Zongxin Liu; Zhe Zhao; Fu Song; Jun Sun; Pengfei Yang; Xiaowei Huang; Lijun Zhang
  Formal verification provides critical security assurances for neural
networks, yet its practical application suffers from the long verification
time. This work introduces a novel method for training verification-friendly
neural networks, which are robust, easy to verify, and relatively accurate. Our
method integrates neuron behavior consistency into the training process, making
neuron activation states consistent across different inputs in a local
neighborhood, reducing the number of unstable neurons and tightening the bounds
of neurons thereby enhancing neural network verifiability. We evaluated our
method using the MNIST, Fashion-MNIST, and CIFAR-10 datasets across various
network architectures. The results of the experiment demonstrate that networks
trained using our method are verification-friendly across different radii and
different model architectures, whereas other tools fail to maintain
verifiability as the radius increases. We also show that our method can be
combined with existing methods to further improve the verifiability of
networks.


http://arxiv.org/abs/2412.13394
Distribution Shifts at Scale: Out-of-distribution Detection in Earth Observation. (1%)
Burak Ekim; Girmaw Abebe Tadesse; Caleb Robinson; Gilles Hacheme; Michael Schmitt; Rahul Dodhia; Juan M. Lavista Ferres
  Training robust deep learning models is critical in Earth Observation, where
globally deployed models often face distribution shifts that degrade
performance, especially in low-data regions. Out-of-distribution (OOD)
detection addresses this challenge by identifying inputs that differ from
in-distribution (ID) data. However, existing methods either assume access to
OOD data or compromise primary task performance, making them unsuitable for
real-world deployment. We propose TARDIS, a post-hoc OOD detection method for
scalable geospatial deployments. The core novelty lies in generating surrogate
labels by integrating information from ID data and unknown distributions,
enabling OOD detection at scale. Our method takes a pre-trained model, ID data,
and WILD samples, disentangling the latter into surrogate ID and surrogate OOD
labels based on internal activations, and fits a binary classifier as an OOD
detector. We validate TARDIS on EuroSAT and xBD datasets, across 17
experimental setups covering covariate and semantic shifts, showing that it
performs close to the theoretical upper bound in assigning surrogate ID and OOD
samples in 13 cases. To demonstrate scalability, we deploy TARDIS on the Fields
of the World dataset, offering actionable insights into pre-trained model
behavior for large-scale deployments. The code is publicly available at
https://github.com/microsoft/geospatial-ood-detection.


http://arxiv.org/abs/2412.12449
Adversarially robust generalization theory via Jacobian regularization for deep neural networks. (99%)
Dongya Wu; Xin Li
  Powerful deep neural networks are vulnerable to adversarial attacks. To
obtain adversarially robust models, researchers have separately developed
adversarial training and Jacobian regularization techniques. There are abundant
theoretical and empirical studies for adversarial training, but theoretical
foundations for Jacobian regularization are still lacking. In this study, we
show that Jacobian regularization is closely related to adversarial training in
that $\ell_{2}$ or $\ell_{1}$ Jacobian regularized loss serves as an
approximate upper bound on the adversarially robust loss under $\ell_{2}$ or
$\ell_{\infty}$ adversarial attack respectively. Further, we establish the
robust generalization gap for Jacobian regularized risk minimizer via bounding
the Rademacher complexity of both the standard loss function class and Jacobian
regularization function class. Our theoretical results indicate that the norms
of Jacobian are related to both standard and robust generalization. We also
perform experiments on MNIST data classification to demonstrate that Jacobian
regularized risk minimization indeed serves as a surrogate for adversarially
robust risk minimization, and that reducing the norms of Jacobian can improve
both standard and robust generalization. This study promotes both theoretical
and empirical understandings to adversarially robust generalization via
Jacobian regularization.


http://arxiv.org/abs/2412.12478
Human-in-the-Loop Generation of Adversarial Texts: A Case Study on Tibetan Script. (99%)
Xi Cao; Yuan Sun; Jiajun Li; Quzong Gesang; Nuo Qun; Tashi Nyima
  DNN-based language models perform excellently on various tasks, but even SOTA
LLMs are susceptible to textual adversarial attacks. Adversarial texts play
crucial roles in multiple subfields of NLP. However, current research has the
following issues. (1) Most textual adversarial attack methods target
rich-resourced languages. How do we generate adversarial texts for less-studied
languages? (2) Most textual adversarial attack methods are prone to generating
invalid or ambiguous adversarial texts. How do we construct high-quality
adversarial robustness benchmarks? (3) New language models may be immune to
part of previously generated adversarial texts. How do we update adversarial
robustness benchmarks? To address the above issues, we introduce HITL-GAT, a
system based on a general approach to human-in-the-loop generation of
adversarial texts. HITL-GAT contains four stages in one pipeline: victim model
construction, adversarial example generation, high-quality benchmark
construction, and adversarial robustness evaluation. Additionally, we utilize
HITL-GAT to make a case study on Tibetan script which can be a reference for
the adversarial research of other less-studied languages.


http://arxiv.org/abs/2412.11735
Transferable Adversarial Face Attack with Text Controlled Attribute. (98%)
Wenyun Li; Zheng Zhang; Xiangyuan Lan; Dongmei Jiang
  Traditional adversarial attacks typically produce adversarial examples under
norm-constrained conditions, whereas unrestricted adversarial examples are
free-form with semantically meaningful perturbations. Current unrestricted
adversarial impersonation attacks exhibit limited control over adversarial face
attributes and often suffer from low transferability. In this paper, we propose
a novel Text Controlled Attribute Attack (TCA$^2$) to generate photorealistic
adversarial impersonation faces guided by natural language. Specifically, the
category-level personal softmax vector is employed to precisely guide the
impersonation attacks. Additionally, we propose both data and model
augmentation strategies to achieve transferable attacks on unknown target
models. Finally, a generative model, \textit{i.e}, Style-GAN, is utilized to
synthesize impersonated faces with desired attributes. Extensive experiments on
two high-resolution face recognition datasets validate that our TCA$^2$ method
can generate natural text-guided adversarial impersonation faces with high
transferability. We also evaluate our method on real-world face recognition
systems, \textit{i.e}, Face++ and Aliyun, further demonstrating the practical
potential of our approach.


http://arxiv.org/abs/2412.11608
Towards Adversarial Robustness of Model-Level Mixture-of-Experts Architectures for Semantic Segmentation. (86%)
Svetlana Pavlitska; Enrico Eisen; J. Marius Zöllner
  Vulnerability to adversarial attacks is a well-known deficiency of deep
neural networks. Larger networks are generally more robust, and ensembling is
one method to increase adversarial robustness: each model's weaknesses are
compensated by the strengths of others. While an ensemble uses a deterministic
rule to combine model outputs, a mixture of experts (MoE) includes an
additional learnable gating component that predicts weights for the outputs of
the expert models, thus determining their contributions to the final
prediction. MoEs have been shown to outperform ensembles on specific tasks, yet
their susceptibility to adversarial attacks has not been studied yet. In this
work, we evaluate the adversarial vulnerability of MoEs for semantic
segmentation of urban and highway traffic scenes. We show that MoEs are, in
most cases, more robust to per-instance and universal white-box adversarial
attacks and can better withstand transfer attacks. Our code is available at
\url{https://github.com/KASTEL-MobilityLab/mixtures-of-experts/}.


http://arxiv.org/abs/2412.11487
WFCAT: Augmenting Website Fingerprinting with Channel-wise Attention on Timing Features. (33%)
Jiajun Gong; Wei Cai; Siyuan Liang; Zhong Guan; Tao Wang; Ee-Chien Chang
  Website Fingerprinting (WF) aims to deanonymize users on the Tor network by
analyzing encrypted network traffic. Recent deep-learning-based attacks show
high accuracy on undefended traces. However, they struggle against modern
defenses that use tactics like injecting dummy packets and delaying real
packets, which significantly degrade classification performance. Our analysis
reveals that current attacks inadequately leverage the timing information
inherent in traffic traces, which persists as a source of leakage even under
robust defenses. Addressing this shortfall, we introduce a novel feature
representation named the Inter-Arrival Time (IAT) histogram, which quantifies
the frequencies of packet inter-arrival times across predetermined time slots.
Complementing this feature, we propose a new CNN-based attack, WFCAT, enhanced
with two innovative architectural blocks designed to optimally extract and
utilize timing information. Our approach uses kernels of varying sizes to
capture multi-scale features, which are then integrated using a weighted sum
across all feature channels to enhance the model's efficacy in identifying
temporal patterns. Our experiments validate that WFCAT substantially
outperforms existing methods on defended traces in both closed- and open-world
scenarios. Notably, WFCAT achieves over 59% accuracy against Surakav, a
recently developed robust defense, marking an improvement of over 28% and 48%
against the state-of-the-art attacks RF and Tik-Tok, respectively, in the
closed-world scenario.


http://arxiv.org/abs/2412.11471
Red Pill and Blue Pill: Controllable Website Fingerprinting Defense via Dynamic Backdoor Learning. (22%)
Siyuan Liang; Jiajun Gong; Tianmeng Fang; Aishan Liu; Tao Wang; Xianglong Liu; Xiaochun Cao; Dacheng Tao; Chang Ee-Chien
  Website fingerprint (WF) attacks, which covertly monitor user communications
to identify the web pages they visit, pose a serious threat to user privacy.
Existing WF defenses attempt to reduce the attacker's accuracy by disrupting
unique traffic patterns; however, they often suffer from the trade-off between
overhead and effectiveness, resulting in less usefulness in practice. To
overcome this limitation, we introduce Controllable Website Fingerprint Defense
(CWFD), a novel defense perspective based on backdoor learning. CWFD exploits
backdoor vulnerabilities in neural networks to directly control the attacker's
model by designing trigger patterns based on network traffic. Specifically,
CWFD injects only incoming packets on the server side into the target web
page's traffic, keeping overhead low while effectively poisoning the attacker's
model during training. During inference, the defender can influence the
attacker's model through a 'red pill, blue pill' choice: traces with the
trigger (red pill) lead to misclassification as the target web page, while
normal traces (blue pill) are classified correctly, achieving directed control
over the defense outcome. We use the Fast Levenshtein-like distance as the
optimization objective to compute trigger patterns that can be effectively
associated with our target page. Experiments show that CWFD significantly
reduces RF's accuracy from 99% to 6% with 74% data overhead. In comparison,
FRONT reduces accuracy to only 97% at similar overhead, while Palette achieves
32% accuracy with 48% more overhead. We further validate the practicality of
our method in a real Tor network environment.


http://arxiv.org/abs/2412.11840
Sonar-based Deep Learning in Underwater Robotics: Overview, Robustness and Challenges. (1%)
Martin Aubard; Ana Madureira; Luís Teixeira; José Pinto
  With the growing interest in underwater exploration and monitoring,
Autonomous Underwater Vehicles (AUVs) have become essential. The recent
interest in onboard Deep Learning (DL) has advanced real-time environmental
interaction capabilities relying on efficient and accurate vision-based DL
models. However, the predominant use of sonar in underwater environments,
characterized by limited training data and inherent noise, poses challenges to
model robustness. This autonomy improvement raises safety concerns for
deploying such models during underwater operations, potentially leading to
hazardous situations. This paper aims to provide the first comprehensive
overview of sonar-based DL under the scope of robustness. It studies
sonar-based DL perception task models, such as classification, object
detection, segmentation, and SLAM. Furthermore, the paper systematizes
sonar-based state-of-the-art datasets, simulators, and robustness methods such
as neural network verification, out-of-distribution, and adversarial attacks.
This paper highlights the lack of robustness in sonar-based DL research and
suggests future research pathways, notably establishing a baseline sonar-based
dataset and bridging the simulation-to-reality gap.


http://arxiv.org/abs/2412.12000
CP-Guard: Malicious Agent Detection and Defense in Collaborative Bird's Eye View Perception. (1%)
Senkang Hu; Yihang Tao; Guowen Xu; Yiqin Deng; Xianhao Chen; Yuguang Fang; Sam Kwong
  Collaborative Perception (CP) has shown a promising technique for autonomous
driving, where multiple connected and autonomous vehicles (CAVs) share their
perception information to enhance the overall perception performance and expand
the perception range. However, in CP, ego CAV needs to receive messages from
its collaborators, which makes it easy to be attacked by malicious agents. For
example, a malicious agent can send harmful information to the ego CAV to
mislead it. To address this critical issue, we propose a novel method,
CP-Guard, a tailored defense mechanism for CP that can be deployed by each
agent to accurately detect and eliminate malicious agents in its collaboration
network. Our key idea is to enable CP to reach a consensus rather than a
conflict against the ego CAV's perception results. Based on this idea, we first
develop a probability-agnostic sample consensus (PASAC) method to effectively
sample a subset of the collaborators and verify the consensus without prior
probabilities of malicious agents. Furthermore, we define a collaborative
consistency loss (CCLoss) to capture the discrepancy between the ego CAV and
its collaborators, which is used as a verification criterion for consensus.
Finally, we conduct extensive experiments in collaborative bird's eye view
(BEV) tasks and our results demonstrate the effectiveness of our CP-Guard. Code
is available at https://github.com/CP-Security/CP-Guard


http://arxiv.org/abs/2412.11934
Stepwise Reasoning Error Disruption Attack of LLMs. (1%)
Jingyu Peng; Maolin Wang; Xiangyu Zhao; Kai Zhang; Wanyu Wang; Pengyue Jia; Qidong Liu; Ruocheng Guo; Qi Liu
  Large language models (LLMs) have made remarkable strides in complex
reasoning tasks, but their safety and robustness in reasoning processes remain
underexplored. Existing attacks on LLM reasoning are constrained by specific
settings or lack of imperceptibility, limiting their feasibility and
generalizability. To address these challenges, we propose the Stepwise
rEasoning Error Disruption (SEED) attack, which subtly injects errors into
prior reasoning steps to mislead the model into producing incorrect subsequent
reasoning and final answers. Unlike previous methods, SEED is compatible with
zero-shot and few-shot settings, maintains the natural reasoning flow, and
ensures covert execution without modifying the instruction. Extensive
experiments on four datasets across four different models demonstrate SEED's
effectiveness, revealing the vulnerabilities of LLMs to disruptions in
reasoning processes. These findings underscore the need for greater attention
to the robustness of LLM reasoning to ensure safety in practical applications.
Our code is available at:
https://github.com/Applied-Machine-Learning-Lab/SEED-Attack.


http://arxiv.org/abs/2412.12217
Comprehensive Survey on Adversarial Examples in Cybersecurity: Impacts, Challenges, and Mitigation Strategies. (99%)
Li Li
  Deep learning (DL) has significantly transformed cybersecurity, enabling
advancements in malware detection, botnet identification, intrusion detection,
user authentication, and encrypted traffic analysis. However, the rise of
adversarial examples (AE) poses a critical challenge to the robustness and
reliability of DL-based systems. These subtle, crafted perturbations can
deceive models, leading to severe consequences like misclassification and
system vulnerabilities. This paper provides a comprehensive review of the
impact of AE attacks on key cybersecurity applications, highlighting both their
theoretical and practical implications. We systematically examine the methods
used to generate adversarial examples, their specific effects across various
domains, and the inherent trade-offs attackers face between efficacy and
resource efficiency. Additionally, we explore recent advancements in defense
mechanisms, including gradient masking, adversarial training, and detection
techniques, evaluating their potential to enhance model resilience. By
summarizing cutting-edge research, this study aims to bridge the gap between
adversarial research and practical security applications, offering insights to
fortify the adoption of DL solutions in cybersecurity.


http://arxiv.org/abs/2412.11172
Unpacking the Resilience of SNLI Contradiction Examples to Attacks. (99%)
Chetan Verma; Archit Agarwal
  Pre-trained models excel on NLI benchmarks like SNLI and MultiNLI, but their
true language understanding remains uncertain. Models trained only on
hypotheses and labels achieve high accuracy, indicating reliance on dataset
biases and spurious correlations. To explore this issue, we applied the
Universal Adversarial Attack to examine the model's vulnerabilities. Our
analysis revealed substantial drops in accuracy for the entailment and neutral
classes, whereas the contradiction class exhibited a smaller decline.
Fine-tuning the model on an augmented dataset with adversarial examples
restored its performance to near-baseline levels for both the standard and
challenge sets. Our findings highlight the value of adversarial triggers in
identifying spurious correlations and improving robustness while providing
insights into the resilience of the contradiction class to adversarial attacks.


http://arxiv.org/abs/2412.11119
Impact of Adversarial Attacks on Deep Learning Model Explainability. (99%)
Gazi Nazia Nur; Mohammad Ahnaf Sadat
  In this paper, we investigate the impact of adversarial attacks on the
explainability of deep learning models, which are commonly criticized for their
black-box nature despite their capacity for autonomous feature extraction. This
black-box nature can affect the perceived trustworthiness of these models. To
address this, explainability techniques such as GradCAM, SmoothGrad, and LIME
have been developed to clarify model decision-making processes. Our research
focuses on the robustness of these explanations when models are subjected to
adversarial attacks, specifically those involving subtle image perturbations
that are imperceptible to humans but can significantly mislead models. For
this, we utilize attack methods like the Fast Gradient Sign Method (FGSM) and
the Basic Iterative Method (BIM) and observe their effects on model accuracy
and explanations. The results reveal a substantial decline in model accuracy,
with accuracies dropping from 89.94% to 58.73% and 45.50% under FGSM and BIM
attacks, respectively. Despite these declines in accuracy, the explanation of
the models measured by metrics such as Intersection over Union (IoU) and Root
Mean Square Error (RMSE) shows negligible changes, suggesting that these
metrics may not be sensitive enough to detect the presence of adversarial
perturbations.


http://arxiv.org/abs/2412.11441
UIBDiffusion: Universal Imperceptible Backdoor Attack for Diffusion Models. (99%)
Yuning Han; Bingyin Zhao; Rui Chu; Feng Luo; Biplab Sikdar; Yingjie Lao
  Recent studies show that diffusion models (DMs) are vulnerable to backdoor
attacks. Existing backdoor attacks impose unconcealed triggers (e.g., a gray
box and eyeglasses) that contain evident patterns, rendering remarkable attack
effects yet easy detection upon human inspection and defensive algorithms.
While it is possible to improve stealthiness by reducing the strength of the
backdoor, doing so can significantly compromise its generality and
effectiveness. In this paper, we propose UIBDiffusion, the universal
imperceptible backdoor attack for diffusion models, which allows us to achieve
superior attack and generation performance while evading state-of-the-art
defenses. We propose a novel trigger generation approach based on universal
adversarial perturbations (UAPs) and reveal that such perturbations, which are
initially devised for fooling pre-trained discriminative models, can be adapted
as potent imperceptible backdoor triggers for DMs. We evaluate UIBDiffusion on
multiple types of DMs with different kinds of samplers across various datasets
and targets. Experimental results demonstrate that UIBDiffusion brings three
advantages: 1) Universality, the imperceptible trigger is universal (i.e.,
image and model agnostic) where a single trigger is effective to any images and
all diffusion models with different samplers; 2) Utility, it achieves
comparable generation quality (e.g., FID) and even better attack success rate
(i.e., ASR) at low poison rates compared to the prior works; and 3)
Undetectability, UIBDiffusion is plausible to human perception and can bypass
Elijah and TERD, the SOTA defenses against backdoors for DMs. We will release
our backdoor triggers and code.


http://arxiv.org/abs/2412.11168
PGD-Imp: Rethinking and Unleashing Potential of Classic PGD with Dual Strategies for Imperceptible Adversarial Attacks. (98%)
Jin Li; Zitong Yu; Ziqiang He; Z. Jane Wang; Xiangui Kang
  Imperceptible adversarial attacks have recently attracted increasing research
interests. Existing methods typically incorporate external modules or loss
terms other than a simple $l_p$-norm into the attack process to achieve
imperceptibility, while we argue that such additional designs may not be
necessary. In this paper, we rethink the essence of imperceptible attacks and
propose two simple yet effective strategies to unleash the potential of PGD,
the common and classical attack, for imperceptibility from an optimization
perspective. Specifically, the Dynamic Step Size is introduced to find the
optimal solution with minimal attack cost towards the decision boundary of the
attacked model, and the Adaptive Early Stop strategy is adopted to reduce the
redundant strength of adversarial perturbations to the minimum level. The
proposed PGD-Imperceptible (PGD-Imp) attack achieves state-of-the-art results
in imperceptible adversarial attacks for both untargeted and targeted
scenarios. When performing untargeted attacks against ResNet-50, PGD-Imp
attains 100$\%$ (+0.3$\%$) ASR, 0.89 (-1.76) $l_2$ distance, and 52.93 (+9.2)
PSNR with 57s (-371s) running time, significantly outperforming existing
methods.


http://arxiv.org/abs/2412.11066
Learning Robust and Privacy-Preserving Representations via Information Theory. (92%)
Binghui Zhang; Sayedeh Leila Noorbakhsh; Yun Dong; Yuan Hong; Binghui Wang
  Machine learning models are vulnerable to both security attacks (e.g.,
adversarial examples) and privacy attacks (e.g., private attribute inference).
We take the first step to mitigate both the security and privacy attacks, and
maintain task utility as well. Particularly, we propose an
information-theoretic framework to achieve the goals through the lens of
representation learning, i.e., learning representations that are robust to both
adversarial examples and attribute inference adversaries. We also derive novel
theoretical results under our framework, e.g., the inherent trade-off between
adversarial robustness/utility and attribute privacy, and guaranteed attribute
privacy leakage against attribute inference adversaries.


http://arxiv.org/abs/2412.11384
A Comprehensive Review of Adversarial Attacks on Machine Learning. (75%)
Syed Quiser Ahmed; Bharathi Vokkaliga Ganesh; Sathyanarayana Sampath Kumar; Prakhar Mishra; Ravi Anand; Bhanuteja Akurathi
  This research provides a comprehensive overview of adversarial attacks on AI
and ML models, exploring various attack types, techniques, and their potential
harms. We also delve into the business implications, mitigation strategies, and
future research directions. To gain practical insights, we employ the
Adversarial Robustness Toolbox (ART) [1] library to simulate these attacks on
real-world use cases, such as self-driving cars. Our goal is to inform
practitioners and researchers about the challenges and opportunities in
defending AI systems against adversarial threats. By providing a comprehensive
comparison of different attack methods, we aim to contribute to the development
of more robust and secure AI systems.


http://arxiv.org/abs/2412.12212
Finding a Wolf in Sheep's Clothing: Combating Adversarial Text-To-Image Prompts with Text Summarization. (2%)
Portia Cooper; Harshita Narnoli; Mihai Surdeanu
  Text-to-image models are vulnerable to the stepwise "Divide-and-Conquer
Attack" (DACA) that utilize a large language model to obfuscate inappropriate
content in prompts by wrapping sensitive text in a benign narrative. To
mitigate stepwise DACA attacks, we propose a two-layer method involving text
summarization followed by binary classification. We assembled the Adversarial
Text-to-Image Prompt (ATTIP) dataset ($N=940$), which contained DACA-obfuscated
and non-obfuscated prompts. From the ATTIP dataset, we created two summarized
versions: one generated by a small encoder model and the other by a large
language model. Then, we used an encoder classifier and a GPT-4o classifier to
perform content moderation on the summarized and unsummarized prompts. When
compared with a classifier that operated over the unsummarized data, our method
improved F1 score performance by 31%. Further, the highest recorded F1 score
achieved (98%) was produced by the encoder classifier on a summarized ATTIP
variant. This study indicates that pre-classification text summarization can
inoculate content detection models against stepwise DACA obfuscations.


http://arxiv.org/abs/2412.11057
Set-Valued Sensitivity Analysis of Deep Neural Networks. (2%)
Xin Jeff Wang; Feiling Jeff wang; Xuegang Jeff Ban
  This paper proposes a sensitivity analysis framework based on set valued
mapping for deep neural networks (DNN) to understand and compute how the
solutions (model weights) of DNN respond to perturbations in the training data.
As a DNN may not exhibit a unique solution (minima) and the algorithm of
solving a DNN may lead to different solutions with minor perturbations to input
data, we focus on the sensitivity of the solution set of DNN, instead of
studying a single solution. In particular, we are interested in the expansion
and contraction of the set in response to data perturbations. If the change of
solution set can be bounded by the extent of the data perturbation, the model
is said to exhibit the Lipschitz like property. This "set-to-set" analysis
approach provides a deeper understanding of the robustness and reliability of
DNNs during training. Our framework incorporates both isolated and non-isolated
minima, and critically, does not require the assumption that the Hessian of
loss function is non-singular. By developing set-level metrics such as distance
between sets, convergence of sets, derivatives of set-valued mapping, and
stability across the solution set, we prove that the solution set of the Fully
Connected Neural Network holds Lipschitz-like properties. For general neural
networks (e.g., Resnet), we introduce a graphical-derivative-based method to
estimate the new solution set following data perturbation without retraining.


http://arxiv.org/abs/2412.11109
SpearBot: Leveraging Large Language Models in a Generative-Critique Framework for Spear-Phishing Email Generation. (1%)
Qinglin Qi; Yun Luo; Yijia Xu; Wenbo Guo; Yong Fang
  Large Language Models (LLMs) are increasingly capable, aiding in tasks such
as content generation, yet they also pose risks, particularly in generating
harmful spear-phishing emails. These emails, crafted to entice clicks on
malicious URLs, threaten personal information security. This paper proposes an
adversarial framework, SpearBot, which utilizes LLMs to generate spear-phishing
emails with various phishing strategies. Through specifically crafted jailbreak
prompts, SpearBot circumvents security policies and introduces other LLM
instances as critics. When a phishing email is identified by the critic,
SpearBot refines the generated email based on the critique feedback until it
can no longer be recognized as phishing, thereby enhancing its deceptive
quality. To evaluate the effectiveness of SpearBot, we implement various
machine-based defenders and assess how well the phishing emails generated could
deceive them. Results show these emails often evade detection to a large
extent, underscoring their deceptive quality. Additionally, human evaluations
of the emails' readability and deception are conducted through questionnaires,
confirming their convincing nature and the significant potential harm of the
generated phishing emails.


http://arxiv.org/abs/2412.11390
A3E: Aligned and Augmented Adversarial Ensemble for Accurate, Robust and Privacy-Preserving EEG Decoding. (1%)
Xiaoqing Chen; Tianwang Jia; Dongrui Wu
  An electroencephalogram (EEG) based brain-computer interface (BCI) enables
direct communication between the brain and external devices. However, EEG-based
BCIs face at least three major challenges in real-world applications: data
scarcity and individual differences, adversarial vulnerability, and data
privacy. While previous studies have addressed one or two of these issues,
simultaneous accommodation of all three challenges remains challenging and
unexplored. This paper fills this gap, by proposing an Aligned and Augmented
Adversarial Ensemble (A3E) algorithm and integrating it into three privacy
protection scenarios (centralized source-free transfer, federated source-free
transfer, and source data perturbation), achieving simultaneously accurate
decoding, adversarial robustness, and privacy protection of EEG-based BCIs.
Experiments on three public EEG datasets demonstrated that our proposed
approach outperformed over 10 classic and state-of-the-art approaches in both
accuracy and robustness in all three privacy-preserving scenarios, even
outperforming state-of-the-art transfer learning approaches that do not
consider privacy protection at all. This is the first time that three major
challenges in EEG-based BCIs can be addressed simultaneously, significantly
improving the practicalness of EEG decoding in real-world BCIs.


http://arxiv.org/abs/2412.10713
RAT: Adversarial Attacks on Deep Reinforcement Agents for Targeted Behaviors. (98%)
Fengshuo Bai; Runze Liu; Yali Du; Ying Wen; Yaodong Yang
  Evaluating deep reinforcement learning (DRL) agents against targeted behavior
attacks is critical for assessing their robustness. These attacks aim to
manipulate the victim into specific behaviors that align with the attacker's
objectives, often bypassing traditional reward-based defenses. Prior methods
have primarily focused on reducing cumulative rewards; however, rewards are
typically too generic to capture complex safety requirements effectively. As a
result, focusing solely on reward reduction can lead to suboptimal attack
strategies, particularly in safety-critical scenarios where more precise
behavior manipulation is needed. To address these challenges, we propose RAT, a
method designed for universal, targeted behavior attacks. RAT trains an
intention policy that is explicitly aligned with human preferences, serving as
a precise behavioral target for the adversary. Concurrently, an adversary
manipulates the victim's policy to follow this target behavior. To enhance the
effectiveness of these attacks, RAT dynamically adjusts the state occupancy
measure within the replay buffer, allowing for more controlled and effective
behavior manipulation. Our empirical results on robotic simulation tasks
demonstrate that RAT outperforms existing adversarial attack algorithms in
inducing specific behaviors. Additionally, RAT shows promise in improving agent
robustness, leading to more resilient policies. We further validate RAT by
guiding Decision Transformer agents to adopt behaviors aligned with human
preferences in various MuJoCo tasks, demonstrating its effectiveness across
diverse tasks.


http://arxiv.org/abs/2412.10681
One Pixel is All I Need. (80%)
Deng Siqin; Zhou Xiaoyi
  Vision Transformers (ViTs) have achieved record-breaking performance in
various visual tasks. However, concerns about their robustness against backdoor
attacks have grown. Backdoor attacks involve associating a specific trigger
with a target label, causing the model to predict the attacker-specified label
when the trigger is present, while correctly identifying clean images.We found
that ViTs exhibit higher attack success rates for quasi-triggers(patterns
different from but similar to the original training triggers)compared to CNNs.
Moreover, some backdoor features in clean samples can suppress the original
trigger, making quasi-triggers more effective.To better understand and exploit
these vulnerabilities, we developed a tool called the Perturbation Sensitivity
Distribution Map (PSDM). PSDM computes and sums gradients over many inputs to
show how sensitive the model is to small changes in the input. In ViTs, PSDM
reveals a patch-like pattern where central pixels are more sensitive than
edges. We use PSDM to guide the creation of quasi-triggers.Based on these
findings, we designed "WorstVIT," a simple yet effective data poisoning
backdoor for ViT models. This attack requires an extremely low poisoning rate,
trains for just one epoch, and modifies a single pixel to successfully attack
all validation images.


http://arxiv.org/abs/2412.10807
Towards Action Hijacking of Large Language Model-based Agent. (26%)
Yuyang Zhang; Kangjie Chen; Jiaxin Gao; Ronghao Cui; Run Wang; Lina Wang; Tianwei Zhang
  Recently, applications powered by Large Language Models (LLMs) have made
significant strides in tackling complex tasks. By harnessing the advanced
reasoning capabilities and extensive knowledge embedded in LLMs, these
applications can generate detailed action plans that are subsequently executed
by external tools. Furthermore, the integration of retrieval-augmented
generation (RAG) enhances performance by incorporating up-to-date,
domain-specific knowledge into the planning and execution processes. This
approach has seen widespread adoption across various sectors, including
healthcare, finance, and software development. Meanwhile, there are also
growing concerns regarding the security of LLM-based applications. Researchers
have disclosed various attacks, represented by jailbreak and prompt injection,
to hijack the output actions of these applications. Existing attacks mainly
focus on crafting semantically harmful prompts, and their validity could
diminish when security filters are employed. In this paper, we introduce
AI$\mathbf{^2}$, a novel attack to manipulate the action plans of LLM-based
applications. Different from existing solutions, the innovation of
AI$\mathbf{^2}$ lies in leveraging the knowledge from the application's
database to facilitate the construction of malicious but semantically-harmless
prompts. To this end, it first collects action-aware knowledge from the victim
application. Based on such knowledge, the attacker can generate misleading
input, which can mislead the LLM to generate harmful action plans, while
bypassing possible detection mechanisms easily. Our evaluations on three
real-world applications demonstrate the effectiveness of AI$\mathbf{^2}$: it
achieves an average attack success rate of 84.30\% with the best of 99.70\%.
Besides, it gets an average bypass rate of 92.7\% against common safety filters
and 59.45\% against dedicated defense.


http://arxiv.org/abs/2412.10805
Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages. (1%)
Poulami Ghosh; Raj Dabre; Pushpak Bhattacharyya
  Pre-trained language models (PLMs) are known to be susceptible to
perturbations to the input text, but existing works do not explicitly focus on
linguistically grounded attacks, which are subtle and more prevalent in nature.
In this paper, we study whether PLMs are agnostic to linguistically grounded
attacks or not. To this end, we offer the first study addressing this,
investigating different Indic languages and various downstream tasks. Our
findings reveal that although PLMs are susceptible to linguistic perturbations,
when compared to non-linguistic attacks, PLMs exhibit a slightly lower
susceptibility to linguistic attacks. This highlights that even constrained
attacks are effective. Moreover, we investigate the implications of these
outcomes across a range of languages, encompassing diverse language families
and different scripts.


http://arxiv.org/abs/2412.10831
Unbiased General Annotated Dataset Generation. (1%)
Dengyang Jiang; Haoyu Wang; Lei Zhang; Wei Wei; Guang Dai; Mengmeng Wang; Jingdong Wang; Yanning Zhang
  Pre-training backbone networks on a general annotated dataset (e.g.,
ImageNet) that comprises numerous manually collected images with category
annotations has proven to be indispensable for enhancing the generalization
capacity of downstream visual tasks. However, those manually collected images
often exhibit bias, which is non-transferable across either categories or
domains, thus causing the model's generalization capacity degeneration. To
mitigate this problem, we present an unbiased general annotated dataset
generation framework (ubGen). Instead of expensive manual collection, we aim at
directly generating unbiased images with category annotations. To achieve this
goal, we propose to leverage the advantage of a multimodal foundation model
(e.g., CLIP), in terms of aligning images in an unbiased semantic space defined
by language. Specifically, we develop a bi-level semantic alignment loss, which
not only forces all generated images to be consistent with the semantic
distribution of all categories belonging to the target dataset in an
adversarial learning manner, but also requires each generated image to match
the semantic description of its category name. In addition, we further cast an
existing image quality scoring model into a quality assurance loss to preserve
the quality of the generated image. By leveraging these two loss functions, we
can obtain an unbiased image generation model by simply fine-tuning a
pre-trained diffusion model using only all category names in the target dataset
as input. Experimental results confirm that, compared with the manually labeled
dataset or other synthetic datasets, the utilization of our generated unbiased
datasets leads to stable generalization capacity enhancement of different
backbone networks across various tasks, especially in tasks where the manually
labeled samples are scarce.


http://arxiv.org/abs/2412.09910
Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attacks on Breast Ultrasound Images. (99%)
Yasamin Medghalchi; Moein Heidari; Clayton Allard; Leonid Sigal; Ilker Hacihaliloglu
  Deep neural networks (DNNs) offer significant promise for improving breast
cancer diagnosis in medical imaging. However, these models are highly
susceptible to adversarial attacks--small, imperceptible changes that can
mislead classifiers--raising critical concerns about their reliability and
security. Traditional attacks rely on fixed-norm perturbations, misaligning
with human perception. In contrast, diffusion-based attacks require pre-trained
models, demanding substantial data when these models are unavailable, limiting
practical use in data-scarce scenarios. In medical imaging, however, this is
often unfeasible due to the limited availability of datasets. Building on
recent advancements in learnable prompts, we propose Prompt2Perturb (P2P), a
novel language-guided attack method capable of generating meaningful attack
examples driven by text instructions. During the prompt learning phase, our
approach leverages learnable prompts within the text encoder to create subtle,
yet impactful, perturbations that remain imperceptible while guiding the model
towards targeted outcomes. In contrast to current prompt learning-based
approaches, our P2P stands out by directly updating text embeddings, avoiding
the need for retraining diffusion models. Further, we leverage the finding that
optimizing only the early reverse diffusion steps boosts efficiency while
ensuring that the generated adversarial examples incorporate subtle noise, thus
preserving ultrasound image quality without introducing noticeable artifacts.
We show that our method outperforms state-of-the-art attack techniques across
three breast ultrasound datasets in FID and LPIPS. Moreover, the generated
images are both more natural in appearance and more effective compared to
existing adversarial attacks. Our code will be publicly available
https://github.com/yasamin-med/P2P.


http://arxiv.org/abs/2412.10353
Robust image classification with multi-modal large language models. (99%)
Francesco Villani; Igor Maljkovic; Dario Lazzaro; Angelo Sotgiu; Antonio Emanuele Cinà; Fabio Roli
  Deep Neural Networks are vulnerable to adversarial examples, i.e., carefully
crafted input samples that can cause models to make incorrect predictions with
high confidence. To mitigate these vulnerabilities, adversarial training and
detection-based defenses have been proposed to strengthen models in advance.
However, most of these approaches focus on a single data modality, overlooking
the relationships between visual patterns and textual descriptions of the
input. In this paper, we propose a novel defense, MultiShield, designed to
combine and complement these defenses with multi-modal information to further
enhance their robustness. MultiShield leverages multi-modal large language
models to detect adversarial examples and abstain from uncertain
classifications when there is no alignment between textual and visual
representations of the input. Extensive evaluations on CIFAR-10 and ImageNet
datasets, using robust and non-robust image classification models, demonstrate
that MultiShield can be easily integrated to detect and reject adversarial
examples, outperforming the original defenses.


http://arxiv.org/abs/2412.09954
A2RNet: Adversarial Attack Resilient Network for Robust Infrared and Visible Image Fusion. (92%)
Jiawei Li; Hongwei Yu; Jiansheng Chen; Xinlong Ding; Jinlong Wang; Jinyuan Liu; Bochao Zou; Huimin Ma
  Infrared and visible image fusion (IVIF) is a crucial technique for enhancing
visual performance by integrating unique information from different modalities
into one fused image. Exiting methods pay more attention to conducting fusion
with undisturbed data, while overlooking the impact of deliberate interference
on the effectiveness of fusion results. To investigate the robustness of fusion
models, in this paper, we propose a novel adversarial attack resilient network,
called $\textrm{A}^{\textrm{2}}$RNet. Specifically, we develop an adversarial
paradigm with an anti-attack loss function to implement adversarial attacks and
training. It is constructed based on the intrinsic nature of IVIF and provide a
robust foundation for future research advancements. We adopt a Unet as the
pipeline with a transformer-based defensive refinement module (DRM) under this
paradigm, which guarantees fused image quality in a robust coarse-to-fine
manner. Compared to previous works, our method mitigates the adverse effects of
adversarial perturbations, consistently maintaining high-fidelity fusion
results. Furthermore, the performance of downstream tasks can also be well
maintained under adversarial attacks. Code is available at
https://github.com/lok-18/A2RNet.


http://arxiv.org/abs/2412.10597
Err on the Side of Texture: Texture Bias on Real Data. (82%)
Blaine Hoak; Ryan Sheatsley; Patrick McDaniel
  Bias significantly undermines both the accuracy and trustworthiness of
machine learning models. To date, one of the strongest biases observed in image
classification models is texture bias-where models overly rely on texture
information rather than shape information. Yet, existing approaches for
measuring and mitigating texture bias have not been able to capture how
textures impact model robustness in real-world settings. In this work, we
introduce the Texture Association Value (TAV), a novel metric that quantifies
how strongly models rely on the presence of specific textures when classifying
objects. Leveraging TAV, we demonstrate that model accuracy and robustness are
heavily influenced by texture. Our results show that texture bias explains the
existence of natural adversarial examples, where over 90% of these samples
contain textures that are misaligned with the learned texture of their true
label, resulting in confident mispredictions.


http://arxiv.org/abs/2412.10617
BinarySelect to Improve Accessibility of Black-Box Attack Research. (80%)
Shatarupa Ghosh; Jonathan Rusert
  Adversarial text attack research is useful for testing the robustness of NLP
models, however, the rise of transformers has greatly increased the time
required to test attacks. Especially when researchers do not have access to
adequate resources (e.g. GPUs). This can hinder attack research, as modifying
one example for an attack can require hundreds of queries to a model,
especially for black-box attacks. Often these attacks remove one token at a
time to find the ideal one to change, requiring $n$ queries (the length of the
text) right away. We propose a more efficient selection method called
BinarySelect which combines binary search and attack selection methods to
greatly reduce the number of queries needed to find a token. We find that
BinarySelect only needs $\text{log}_2(n) * 2$ queries to find the first token
compared to $n$ queries. We also test BinarySelect in an attack setting against
5 classifiers across 3 datasets and find a viable tradeoff between number of
queries saved and attack effectiveness. For example, on the Yelp dataset, the
number of queries is reduced by 32% (72 less) with a drop in attack
effectiveness of only 5 points. We believe that BinarySelect can help future
researchers study adversarial attacks and black-box problems more efficiently
and opens the door for researchers with access to less resources.


http://arxiv.org/abs/2412.09921
FaceShield: Defending Facial Image against Deepfake Threats. (80%)
Jaehwan Jeong; Sumin In; Sieun Kim; Hannie Shin; Jongheon Jeong; Sang Ho Yoon; Jaewook Chung; Sangpil Kim
  The rising use of deepfakes in criminal activities presents a significant
issue, inciting widespread controversy. While numerous studies have tackled
this problem, most primarily focus on deepfake detection. These reactive
solutions are insufficient as a fundamental approach for crimes where
authenticity is disregarded. Existing proactive defenses also have limitations,
as they are effective only for deepfake models based on specific Generative
Adversarial Networks (GANs), making them less applicable in light of recent
advancements in diffusion-based models. In this paper, we propose a proactive
defense method named FaceShield, which introduces novel defense strategies
targeting deepfakes generated by Diffusion Models (DMs) and facilitates
defenses on various existing GAN-based deepfake models through facial feature
extractor manipulations. Our approach consists of three main components: (i)
manipulating the attention mechanism of DMs to exclude protected facial
features during the denoising process, (ii) targeting prominent facial feature
extraction models to enhance the robustness of our adversarial perturbation,
and (iii) employing Gaussian blur and low-pass filtering techniques to improve
imperceptibility while enhancing robustness against JPEG compression.
Experimental results on the CelebA-HQ and VGGFace2-HQ datasets demonstrate that
our method achieves state-of-the-art performance against the latest deepfake
models based on DMs, while also exhibiting transferability to GANs and
showcasing greater imperceptibility of noise along with enhanced robustness.


http://arxiv.org/abs/2412.10535
On Adversarial Robustness and Out-of-Distribution Robustness of Large Language Models. (78%)
April Yang; Jordan Tab; Parth Shah; Paul Kotchavong
  The increasing reliance on large language models (LLMs) for diverse
applications necessitates a thorough understanding of their robustness to
adversarial perturbations and out-of-distribution (OOD) inputs. In this study,
we investigate the correlation between adversarial robustness and OOD
robustness in LLMs, addressing a critical gap in robustness evaluation. By
applying methods originally designed to improve one robustness type across both
contexts, we analyze their performance on adversarial and out-of-distribution
benchmark datasets. The input of the model consists of text samples, with the
output prediction evaluated in terms of accuracy, precision, recall, and F1
scores in various natural language inference tasks.
  Our findings highlight nuanced interactions between adversarial robustness
and OOD robustness, with results indicating limited transferability between the
two robustness types. Through targeted ablations, we evaluate how these
correlations evolve with different model sizes and architectures, uncovering
model-specific trends: smaller models like LLaMA2-7b exhibit neutral
correlations, larger models like LLaMA2-13b show negative correlations, and
Mixtral demonstrates positive correlations, potentially due to domain-specific
alignment. These results underscore the importance of hybrid robustness
frameworks that integrate adversarial and OOD strategies tailored to specific
models and domains. Further research is needed to evaluate these interactions
across larger models and varied architectures, offering a pathway to more
reliable and generalizable LLMs.


http://arxiv.org/abs/2412.10049
SuperMark: Robust and Training-free Image Watermarking via Diffusion-based Super-Resolution. (67%)
Runyi Hu; Jie Zhang; Yiming Li; Jiwei Li; Qing Guo; Han Qiu; Tianwei Zhang
  In today's digital landscape, the blending of AI-generated and authentic
content has underscored the need for copyright protection and content
authentication. Watermarking has become a vital tool to address these
challenges, safeguarding both generated and real content. Effective
watermarking methods must withstand various distortions and attacks. Current
deep watermarking techniques often use an encoder-noise layer-decoder
architecture and include distortions to enhance robustness. However, they
struggle to balance robustness and fidelity and remain vulnerable to adaptive
attacks, despite extensive training. To overcome these limitations, we propose
SuperMark, a robust, training-free watermarking framework. Inspired by the
parallels between watermark embedding/extraction in watermarking and the
denoising/noising processes in diffusion models, SuperMark embeds the watermark
into initial Gaussian noise using existing techniques. It then applies
pre-trained Super-Resolution (SR) models to denoise the watermarked noise,
producing the final watermarked image. For extraction, the process is reversed:
the watermarked image is inverted back to the initial watermarked noise via
DDIM Inversion, from which the embedded watermark is extracted. This flexible
framework supports various noise injection methods and diffusion-based SR
models, enabling enhanced customization. The robustness of the DDIM Inversion
process against perturbations allows SuperMark to achieve strong resilience to
distortions while maintaining high fidelity. Experiments demonstrate that
SuperMark achieves fidelity comparable to existing methods while significantly
improving robustness. Under standard distortions, it achieves an average
watermark extraction accuracy of 99.46%, and 89.29% under adaptive attacks.
Moreover, SuperMark shows strong transferability across datasets, SR models,
embedding methods, and resolutions.


http://arxiv.org/abs/2412.10605
Client-Side Patching against Backdoor Attacks in Federated Learning. (61%)
Borja Molina-Coronado
  Federated learning is a versatile framework for training models in
decentralized environments. However, the trust placed in clients makes
federated learning vulnerable to backdoor attacks launched by malicious
participants. While many defenses have been proposed, they often fail short
when facing heterogeneous data distributions among participating clients. In
this paper, we propose a novel defense mechanism for federated learning systems
designed to mitigate backdoor attacks on the clients-side. Our approach
leverages adversarial learning techniques and model patching to neutralize the
impact of backdoor attacks. Through extensive experiments on the MNIST and
Fashion-MNIST datasets, we demonstrate that our defense effectively reduces
backdoor accuracy, outperforming existing state-of-the-art defenses, such as
LFighter, FLAME, and RoseAgg, in i.i.d. and non-i.i.d. scenarios, while
maintaining competitive or superior accuracy on clean data.


http://arxiv.org/abs/2412.12192
No Free Lunch for Defending Against Prefilling Attack by In-Context Learning. (22%)
Zhiyu Xue; Guangliang Liu; Bocheng Chen; Kristen Marie Johnson; Ramtin Pedarsani
  The security of Large Language Models (LLMs) has become an important research
topic since the emergence of ChatGPT. Though there have been various effective
methods to defend against jailbreak attacks, prefilling attacks remain an
unsolved and popular threat against open-sourced LLMs. In-Context Learning
(ICL) offers a computationally efficient defense against various jailbreak
attacks, yet no effective ICL methods have been developed to counter prefilling
attacks. In this paper, we: (1) show that ICL can effectively defend against
prefilling jailbreak attacks by employing adversative sentence structures
within demonstrations; (2) characterize the effectiveness of this defense
through the lens of model size, number of demonstrations, over-defense,
integration with other jailbreak attacks, and the presence of safety alignment.
Given the experimental results and our analysis, we conclude that there is no
free lunch for defending against prefilling jailbreak attacks with ICL. On the
one hand, current safety alignment methods fail to mitigate prefilling
jailbreak attacks, but adversative structures within ICL demonstrations provide
robust defense across various model sizes and complex jailbreak attacks. On the
other hand, LLMs exhibit similar over-defensiveness when utilizing ICL
demonstrations with adversative structures, and this behavior appears to be
independent of model size.


http://arxiv.org/abs/2412.09933
Active Poisoning: Efficient Backdoor Attacks on Transfer Learning-Based Brain-Computer Interfaces. (13%)
X. Jiang; L. Meng; S. Li; D. Wu
  Transfer learning (TL) has been widely used in electroencephalogram
(EEG)-based brain-computer interfaces (BCIs) for reducing calibration efforts.
However, backdoor attacks could be introduced through TL. In such attacks, an
attacker embeds a backdoor with a specific pattern into the machine learning
model. As a result, the model will misclassify a test sample with the backdoor
trigger into a prespecified class while still maintaining good performance on
benign samples. Accordingly, this study explores backdoor attacks in the TL of
EEG-based BCIs, where source-domain data are poisoned by a backdoor trigger and
then used in TL. We propose several active poisoning approaches to select
source-domain samples, which are most effective in embedding the backdoor
pattern, to improve the attack success rate and efficiency. Experiments on four
EEG datasets and three deep learning models demonstrate the effectiveness of
the approaches. To our knowledge, this is the first study about backdoor
attacks on TL models in EEG-based BCIs. It exposes a serious security risk in
BCIs, which should be immediately addressed.


http://arxiv.org/abs/2412.10265
Adversarial Robustness of Bottleneck Injected Deep Neural Networks for Task-Oriented Communication. (10%)
Alireza Furutanpey; Pantelis A. Frangoudis; Patrik Szabo; Schahram Dustdar
  This paper investigates the adversarial robustness of Deep Neural Networks
(DNNs) using Information Bottleneck (IB) objectives for task-oriented
communication systems. We empirically demonstrate that while IB-based
approaches provide baseline resilience against attacks targeting downstream
tasks, the reliance on generative models for task-oriented communication
introduces new vulnerabilities. Through extensive experiments on several
datasets, we analyze how bottleneck depth and task complexity influence
adversarial robustness. Our key findings show that Shallow Variational
Bottleneck Injection (SVBI) provides less adversarial robustness compared to
Deep Variational Information Bottleneck (DVIB) approaches, with the gap
widening for more complex tasks. Additionally, we reveal that IB-based
objectives exhibit stronger robustness against attacks focusing on salient
pixels with high intensity compared to those perturbing many pixels with lower
intensity. Lastly, we demonstrate that task-oriented communication systems that
rely on generative models to extract and recover salient information have an
increased attack surface. The results highlight important security
considerations for next-generation communication systems that leverage neural
networks for goal-oriented compression.


http://arxiv.org/abs/2412.10186
BiCert: A Bilinear Mixed Integer Programming Formulation for Precise Certified Bounds Against Data Poisoning Attacks. (1%)
Tobias Lorenz; Marta Kwiatkowska; Mario Fritz
  Data poisoning attacks pose one of the biggest threats to modern AI systems,
necessitating robust defenses. While extensive efforts have been made to
develop empirical defenses, attackers continue to evolve, creating
sophisticated methods to circumvent these measures. To address this, we must
move beyond empirical defenses and establish provable certification methods
that guarantee robustness. This paper introduces a novel certification
approach, BiCert, using Bilinear Mixed Integer Programming (BMIP) to compute
sound deterministic bounds that provide such provable robustness. Using BMIP,
we compute the reachable set of parameters that could result from training with
potentially manipulated data. A key element to make this computation feasible
is to relax the reachable parameter set to a convex set between training
iterations. At test time, this parameter set allows us to predict all possible
outcomes, guaranteeing robustness. BiCert is more precise than previous
methods, which rely solely on interval and polyhedral bounds. Crucially, our
approach overcomes the fundamental limitation of prior approaches where
parameter bounds could only grow, often uncontrollably. We show that BiCert's
tighter bounds eliminate a key source of divergence issues, resulting in more
stable training and higher certified accuracy.


http://arxiv.org/abs/2412.10198
From Allies to Adversaries: Manipulating LLM Tool-Calling through Adversarial Injection. (1%)
Haowei Wang; Rupeng Zhang; Junjie Wang; Mingyang Li; Yuekai Huang; Dandan Wang; Qing Wang
  Tool-calling has changed Large Language Model (LLM) applications by
integrating external tools, significantly enhancing their functionality across
diverse tasks. However, this integration also introduces new security
vulnerabilities, particularly in the tool scheduling mechanisms of LLM, which
have not been extensively studied. To fill this gap, we present ToolCommander,
a novel framework designed to exploit vulnerabilities in LLM tool-calling
systems through adversarial tool injection. Our framework employs a
well-designed two-stage attack strategy. Firstly, it injects malicious tools to
collect user queries, then dynamically updates the injected tools based on the
stolen information to enhance subsequent attacks. These stages enable
ToolCommander to execute privacy theft, launch denial-of-service attacks, and
even manipulate business competition by triggering unscheduled tool-calling.
Notably, the ASR reaches 91.67% for privacy theft and hits 100% for
denial-of-service and unscheduled tool calling in certain cases. Our work
demonstrates that these vulnerabilities can lead to severe consequences beyond
simple misuse of tool-calling systems, underscoring the urgent need for robust
defensive strategies to secure LLM Tool-calling systems.


http://arxiv.org/abs/2412.09692
Three-in-One: Robust Enhanced Universal Transferable Anti-Facial Retrieval in Online Social Networks. (99%)
Yunna Lv; Long Tang; Dengpan Ye; Caiyun Xie; Jiacheng Deng; Yiheng He
  Deep hash-based retrieval techniques are widely used in facial retrieval
systems to improve the efficiency of facial matching. However, it also carries
the danger of exposing private information. Deep hash models are easily
influenced by adversarial examples, which can be leveraged to protect private
images from malicious retrieval. The existing adversarial example methods
against deep hash models focus on universality and transferability, lacking the
research on its robustness in online social networks (OSNs), which leads to
their failure in anti-retrieval after post-processing. Therefore, we provide
the first in-depth discussion on robustness adversarial perturbation in
universal transferable anti-facial retrieval and propose Three-in-One
Adversarial Perturbation (TOAP). Specifically, we construct a local and global
Compression Generator (CG) to simulate complex post-processing scenarios, which
can be used to mitigate perturbation. Then, we propose robust optimization
objectives based on the discovery of the variation patterns of model's
distribution after post-processing, and generate adversarial examples using
these objectives and meta-learning. Finally, we iteratively optimize
perturbation by alternately generating adversarial examples and fine-tuning the
CG, balancing the performance of perturbation while enhancing CG's ability to
mitigate them. Numerous experiments demonstrate that, in addition to its
advantages in universality and transferability, TOAP significantly outperforms
current state-of-the-art methods in multiple robustness metrics. It further
improves universality and transferability by 5% to 28%, and achieves up to
about 33% significant improvement in several simulated post-processing
scenarios as well as mainstream OSNs, demonstrating that TOAP can effectively
protect private images from malicious retrieval in real-world scenarios.


http://arxiv.org/abs/2412.09195
On the Generation and Removal of Speaker Adversarial Perturbation for Voice-Privacy Protection. (95%)
Chenyang Guo; Liping Chen; Zhuhai Li; Kong Aik Lee; Zhen-Hua Ling; Wu Guo
  Neural networks are commonly known to be vulnerable to adversarial attacks
mounted through subtle perturbation on the input data. Recent development in
voice-privacy protection has shown the positive use cases of the same technique
to conceal speaker's voice attribute with additive perturbation signal
generated by an adversarial network. This paper examines the reversibility
property where an entity generating the adversarial perturbations is authorized
to remove them and restore original speech (e.g., the speaker him/herself). A
similar technique could also be used by an investigator to deanonymize a
voice-protected speech to restore criminals' identities in security and
forensic analysis. In this setting, the perturbation generative module is
assumed to be known in the removal process. To this end, a joint training of
perturbation generation and removal modules is proposed. Experimental results
on the LibriSpeech dataset demonstrated that the subtle perturbations added to
the original speech can be predicted from the anonymized speech while achieving
the goal of privacy protection. By removing these perturbations from the
anonymized sample, the original speech can be restored. Audio samples can be
found in \url{https://voiceprivacy.github.io/Perturbation-Generation-Removal/}.


http://arxiv.org/abs/2412.09844
Real-time Identity Defenses against Malicious Personalization of Diffusion Models. (95%)
Hanzhong Guo; Shen Nie; Chao Du; Tianyu Pang; Hao Sun; Chongxuan Li
  Personalized diffusion models, capable of synthesizing highly realistic
images based on a few reference portraits, pose substantial social, ethical,
and legal risks by enabling identity replication. Existing defense mechanisms
rely on computationally intensive adversarial perturbations tailored to
individual images, rendering them impractical for real-world deployment. This
study introduces Real-time Identity Defender (RID), a neural network designed
to generate adversarial perturbations through a single forward pass, bypassing
the need for image-specific optimization. RID achieves unprecedented
efficiency, with defense times as low as 0.12 seconds on a single GPU (4,400
times faster than leading methods) and 1.1 seconds per image on a standard
Intel i9 CPU, making it suitable for edge devices such as smartphones. Despite
its efficiency, RID matches state-of-the-art performance across visual and
quantitative benchmarks, effectively mitigating identity replication risks. Our
analysis reveals that RID's perturbations mimic the efficacy of traditional
defenses while exhibiting properties distinct from natural noise, such as
Gaussian perturbations. To enhance robustness, we extend RID into an ensemble
framework that integrates multiple pre-trained text-to-image diffusion models,
ensuring resilience against black-box attacks and post-processing techniques,
including JPEG compression and diffusion-based purification.


http://arxiv.org/abs/2412.08969
Deep Learning Model Security: Threats and Defenses. (92%)
Tianyang Wang; Ziqian Bi; Yichao Zhang; Ming Liu; Weiche Hsieh; Pohsun Feng; Lawrence K. Q. Yan; Yizhu Wen; Benji Peng; Junyu Liu; Keyu Chen; Sen Zhang; Ming Li; Chuanqi Jiang; Xinyuan Song; Junjie Yang; Bowen Jing; Jintao Ren; Junhao Song; Hong-Ming Tseng; Silin Chen; Yunze Wang; Chia Xin Liang; Jiawei Xu; Xuanhe Pan; Jinlang Wang; Qian Niu
  Deep learning has transformed AI applications but faces critical security
challenges, including adversarial attacks, data poisoning, model theft, and
privacy leakage. This survey examines these vulnerabilities, detailing their
mechanisms and impact on model integrity and confidentiality. Practical
implementations, including adversarial examples, label flipping, and backdoor
attacks, are explored alongside defenses such as adversarial training,
differential privacy, and federated learning, highlighting their strengths and
limitations.
  Advanced methods like contrastive and self-supervised learning are presented
for enhancing robustness. The survey concludes with future directions,
emphasizing automated defenses, zero-trust architectures, and the security
challenges of large AI models. A balanced approach to performance and security
is essential for developing reliable deep learning systems.


http://arxiv.org/abs/2412.09450
A Semi Black-Box Adversarial Bit-Flip Attack with Limited DNN Model Information. (69%)
Behnam Ghavami; Mani Sadati; Mohammad Shahidzadeh; Lesley Shannon; Steve Wilton
  Despite the rising prevalence of deep neural networks (DNNs) in
cyber-physical systems, their vulnerability to adversarial bit-flip attacks
(BFAs) is a noteworthy concern. This paper proposes B3FA, a semi-black-box
BFA-based parameter attack on DNNs, assuming the adversary has limited
knowledge about the model. We consider practical scenarios often feature a more
restricted threat model for real-world systems, contrasting with the typical
BFA models that presuppose the adversary's full access to a network's inputs
and parameters. The introduced bit-flip approach utilizes a magnitude-based
ranking method and a statistical re-construction technique to identify the
vulnerable bits. We demonstrate the effectiveness of B3FA on several DNN models
in a semi-black-box setting. For example, B3FA could drop the accuracy of a
MobileNetV2 from 69.84% to 9% with only 20 bit-flips in a real-world setting.


http://arxiv.org/abs/2412.09150
Evaluating Adversarial Attacks on Traffic Sign Classifiers beyond Standard Baselines. (45%)
Svetlana Pavlitska; Leopold Müller; J. Marius Zöllner
  Adversarial attacks on traffic sign classification models were among the
first successfully tried in the real world. Since then, the research in this
area has been mainly restricted to repeating baseline models, such as LISA-CNN
or GTSRB-CNN, and similar experiment settings, including white and black
patches on traffic signs. In this work, we decouple model architectures from
the datasets and evaluate on further generic models to make a fair comparison.
Furthermore, we compare two attack settings, inconspicuous and visible, which
are usually regarded without direct comparison. Our results show that standard
baselines like LISA-CNN or GTSRB-CNN are significantly more susceptible than
the generic ones. We, therefore, suggest evaluating new attacks on a broader
spectrum of baselines in the future. Our code is available at
\url{https://github.com/KASTEL-MobilityLab/attacks-on-traffic-sign-recognition/}.


http://arxiv.org/abs/2412.09073
SVasP: Self-Versatility Adversarial Style Perturbation for Cross-Domain Few-Shot Learning. (3%)
Wenqian Li; Pengfei Fang; Hui Xue
  Cross-Domain Few-Shot Learning (CD-FSL) aims to transfer knowledge from seen
source domains to unseen target domains, which is crucial for evaluating the
generalization and robustness of models. Recent studies focus on utilizing
visual styles to bridge the domain gap between different domains. However, the
serious dilemma of gradient instability and local optimization problem occurs
in those style-based CD-FSL methods. This paper addresses these issues and
proposes a novel crop-global style perturbation method, called
\underline{\textbf{S}}elf-\underline{\textbf{V}}ersatility
\underline{\textbf{A}}dversarial \underline{\textbf{S}}tyle
\underline{\textbf{P}}erturbation (\textbf{SVasP}), which enhances the gradient
stability and escapes from poor sharp minima jointly. Specifically, SVasP
simulates more diverse potential target domain adversarial styles via
diversifying input patterns and aggregating localized crop style gradients, to
serve as global style perturbation stabilizers within one image, a concept we
refer to as self-versatility. Then a novel objective function is proposed to
maximize visual discrepancy while maintaining semantic consistency between
global, crop, and adversarial features. Having the stabilized global style
perturbation in the training phase, one can obtain a flattened minima in the
loss landscape, boosting the transferability of the model to the target
domains. Extensive experiments on multiple benchmark datasets demonstrate that
our method significantly outperforms existing state-of-the-art methods. Our
codes are available at https://github.com/liwenqianSEU/SVasP.


http://arxiv.org/abs/2412.09565
Obfuscated Activations Bypass LLM Latent-Space Defenses. (2%)
Luke Bailey; Alex Serrano; Abhay Sheshadri; Mikhail Seleznyov; Jordan Taylor; Erik Jenner; Jacob Hilton; Stephen Casper; Carlos Guestrin; Scott Emmons
  Recent latent-space monitoring techniques have shown promise as defenses
against LLM attacks. These defenses act as scanners that seek to detect harmful
activations before they lead to undesirable actions. This prompts the question:
Can models execute harmful behavior via inconspicuous latent states? Here, we
study such obfuscated activations. We show that state-of-the-art latent-space
defenses -- including sparse autoencoders, representation probing, and latent
OOD detection -- are all vulnerable to obfuscated activations. For example,
against probes trained to classify harmfulness, our attacks can often reduce
recall from 100% to 0% while retaining a 90% jailbreaking rate. However,
obfuscation has limits: we find that on a complex task (writing SQL code),
obfuscation reduces model performance. Together, our results demonstrate that
neural activations are highly malleable: we can reshape activation patterns in
a variety of ways, often while preserving a network's behavior. This poses a
fundamental challenge to latent-space defenses.


http://arxiv.org/abs/2412.09269
Towards Understanding the Robustness of LLM-based Evaluations under Perturbations. (1%)
Manav Chaudhary; Harshit Gupta; Savita Bhat; Vasudeva Varma
  Traditional evaluation metrics like BLEU and ROUGE fall short when capturing
the nuanced qualities of generated text, particularly when there is no single
ground truth. In this paper, we explore the potential of Large Language Models
(LLMs), specifically Google Gemini 1, to serve as automatic evaluators for
non-standardized metrics in summarization and dialog-based tasks. We conduct
experiments across multiple prompting strategies to examine how LLMs fare as
quality evaluators when compared with human judgments on the SummEval and USR
datasets, asking the model to generate both a score as well as a justification
for the score. Furthermore, we explore the robustness of the LLM evaluator by
using perturbed inputs. Our findings suggest that while LLMs show promise,
their alignment with human evaluators is limited, they are not robust against
perturbations and significant improvements are required for their standalone
use as reliable evaluators for subjective metrics.


http://arxiv.org/abs/2412.09765
L-WISE: Boosting Human Image Category Learning Through Model-Based Image Selection And Enhancement. (1%)
Morgan B. Talbot; Gabriel Kreiman; James J. DiCarlo; Guy Gaziv
  The currently leading artificial neural network (ANN) models of the visual
ventral stream -- which are derived from a combination of performance
optimization and robustification methods -- have demonstrated a remarkable
degree of behavioral alignment with humans on visual categorization tasks.
Extending upon previous work, we show that not only can these models guide
image perturbations that change the induced human category percepts, but they
also can enhance human ability to accurately report the original ground truth.
Furthermore, we find that the same models can also be used out-of-the-box to
predict the proportion of correct human responses to individual images,
providing a simple, human-aligned estimator of the relative difficulty of each
image. Motivated by these observations, we propose to augment visual learning
in humans in a way that improves human categorization accuracy at test time.
Our learning augmentation approach consists of (i) selecting images based on
their model-estimated recognition difficulty, and (ii) using image
perturbations that aid recognition for novice learners. We find that combining
these model-based strategies gives rise to test-time categorization accuracy
gains of 33-72% relative to control subjects without these interventions,
despite using the same number of training feedback trials. Surprisingly, beyond
the accuracy gain, the training time for the augmented learning group was also
shorter by 20-23%. We demonstrate the efficacy of our approach in a
fine-grained categorization task with natural images, as well as tasks in two
clinically relevant image domains -- histology and dermoscopy -- where visual
learning is notoriously challenging. To the best of our knowledge, this is the
first application of ANNs to increase visual learning performance in humans by
enhancing category-specific features.


http://arxiv.org/abs/2412.08394
Adversarial Purification by Consistency-aware Latent Space Optimization on Data Manifolds. (99%)
Shuhai Zhang; Jiahao Yang; Hui Luo; Jie Chen; Li Wang; Feng Liu; Bo Han; Mingkui Tan
  Deep neural networks (DNNs) are vulnerable to adversarial samples crafted by
adding imperceptible perturbations to clean data, potentially leading to
incorrect and dangerous predictions. Adversarial purification has been an
effective means to improve DNNs robustness by removing these perturbations
before feeding the data into the model. However, it faces significant
challenges in preserving key structural and semantic information of data, as
the imperceptible nature of adversarial perturbations makes it hard to avoid
over-correcting, which can destroy important information and degrade model
performance. In this paper, we break away from traditional adversarial
purification methods by focusing on the clean data manifold. To this end, we
reveal that samples generated by a well-trained generative model are close to
clean ones but far from adversarial ones. Leveraging this insight, we propose
Consistency Model-based Adversarial Purification (CMAP), which optimizes
vectors within the latent space of a pre-trained consistency model to generate
samples for restoring clean data. Specifically, 1) we propose a
\textit{Perceptual consistency restoration} mechanism by minimizing the
discrepancy between generated samples and input samples in both pixel and
perceptual spaces. 2) To maintain the optimized latent vectors within the valid
data manifold, we introduce a \textit{Latent distribution consistency
constraint} strategy to align generated samples with the clean data
distribution. 3) We also apply a \textit{Latent vector consistency prediction}
scheme via an ensemble approach to enhance prediction reliability. CMAP
fundamentally addresses adversarial perturbations at their source, providing a
robust purification. Extensive experiments on CIFAR-10 and ImageNet-100 show
that our CMAP significantly enhances robustness against strong adversarial
attacks while preserving high natural accuracy.


http://arxiv.org/abs/2412.08108
Doubly-Universal Adversarial Perturbations: Deceiving Vision-Language Models Across Both Images and Text with a Single Perturbation. (98%)
Hee-Seon Kim; Minbeom Kim; Changick Kim
  Large Vision-Language Models (VLMs) have demonstrated remarkable performance
across multimodal tasks by integrating vision encoders with large language
models (LLMs). However, these models remain vulnerable to adversarial attacks.
Among such attacks, Universal Adversarial Perturbations (UAPs) are especially
powerful, as a single optimized perturbation can mislead the model across
various input images. In this work, we introduce a novel UAP specifically
designed for VLMs: the Doubly-Universal Adversarial Perturbation (Doubly-UAP),
capable of universally deceiving VLMs across both image and text inputs. To
successfully disrupt the vision encoder's fundamental process, we analyze the
core components of the attention mechanism. After identifying value vectors in
the middle-to-late layers as the most vulnerable, we optimize Doubly-UAP in a
label-free manner with a frozen model. Despite being developed as a black-box
to the LLM, Doubly-UAP achieves high attack success rates on VLMs, consistently
outperforming baseline methods across vision-language tasks. Extensive ablation
studies and analyses further demonstrate the robustness of Doubly-UAP and
provide insights into how it influences internal attention mechanisms.


http://arxiv.org/abs/2412.08555
Grimm: A Plug-and-Play Perturbation Rectifier for Graph Neural Networks Defending against Poisoning Attacks. (93%)
Ao Liu; Wenshan Li; Beibei Li; Wengang Ma; Tao Li; Pan Zhou
  Recent studies have revealed the vulnerability of graph neural networks
(GNNs) to adversarial poisoning attacks on node classification tasks. Current
defensive methods require substituting the original GNNs with defense models,
regardless of the original's type. This approach, while targeting adversarial
robustness, compromises the enhancements developed in prior research to boost
GNNs' practical performance. Here we introduce Grimm, the first plug-and-play
defense model. With just a minimal interface requirement for extracting
features from any layer of the protected GNNs, Grimm is thus enabled to
seamlessly rectify perturbations. Specifically, we utilize the feature
trajectories (FTs) generated by GNNs, as they evolve through epochs, to reflect
the training status of the networks. We then theoretically prove that the FTs
of victim nodes will inevitably exhibit discriminable anomalies. Consequently,
inspired by the natural parallelism between the biological nervous and immune
systems, we construct Grimm, a comprehensive artificial immune system for GNNs.
Grimm not only detects abnormal FTs and rectifies adversarial edges during
training but also operates efficiently in parallel, thereby mirroring the
concurrent functionalities of its biological counterparts. We experimentally
confirm that Grimm offers four empirically validated advantages: 1)
Harmlessness, as it does not actively interfere with GNN training; 2)
Parallelism, ensuring monitoring, detection, and rectification functions
operate independently of the GNN training process; 3) Generalizability,
demonstrating compatibility with mainstream GNNs such as GCN, GAT, and
GraphSAGE; and 4) Transferability, as the detectors for abnormal FTs can be
efficiently transferred across different systems for one-step rectification.


http://arxiv.org/abs/2412.08615
Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models. (83%)
Jiahui Li; Yongchang Hao; Haoyu Xu; Xing Wang; Yu Hong
  Despite the advancements in training Large Language Models (LLMs) with
alignment techniques to enhance the safety of generated content, these models
remain susceptible to jailbreak, an adversarial attack method that exposes
security vulnerabilities in LLMs. Notably, the Greedy Coordinate Gradient (GCG)
method has demonstrated the ability to automatically generate adversarial
suffixes that jailbreak state-of-the-art LLMs. However, the optimization
process involved in GCG is highly time-consuming, rendering the jailbreaking
pipeline inefficient. In this paper, we investigate the process of GCG and
identify an issue of Indirect Effect, the key bottleneck of the GCG
optimization. To this end, we propose the Model Attack Gradient Index GCG
(MAGIC), that addresses the Indirect Effect by exploiting the gradient
information of the suffix tokens, thereby accelerating the procedure by having
less computation and fewer iterations. Our experiments on AdvBench show that
MAGIC achieves up to a 1.5x speedup, while maintaining Attack Success Rates
(ASR) on par or even higher than other baselines. Our MAGIC achieved an ASR of
74% on the Llama-2 and an ASR of 54% when conducting transfer attacks on
GPT-3.5. Code is available at https://github.com/jiah-li/magic.


http://arxiv.org/abs/2412.08608
AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models. (82%)
Mintong Kang; Chejian Xu; Bo Li
  Recent advancements in large audio-language models (LALMs) have enabled
speech-based user interactions, significantly enhancing user experience and
accelerating the deployment of LALMs in real-world applications. However,
ensuring the safety of LALMs is crucial to prevent risky outputs that may raise
societal concerns or violate AI regulations. Despite the importance of this
issue, research on jailbreaking LALMs remains limited due to their recent
emergence and the additional technical challenges they present compared to
attacks on DNN-based audio models. Specifically, the audio encoders in LALMs,
which involve discretization operations, often lead to gradient shattering,
hindering the effectiveness of attacks relying on gradient-based optimizations.
The behavioral variability of LALMs further complicates the identification of
effective (adversarial) optimization targets. Moreover, enforcing stealthiness
constraints on adversarial audio waveforms introduces a reduced, non-convex
feasible solution space, further intensifying the challenges of the
optimization process. To overcome these challenges, we develop AdvWave, the
first jailbreak framework against LALMs. We propose a dual-phase optimization
method that addresses gradient shattering, enabling effective end-to-end
gradient-based optimization. Additionally, we develop an adaptive adversarial
target search algorithm that dynamically adjusts the adversarial optimization
target based on the response patterns of LALMs for specific queries. To ensure
that adversarial audio remains perceptually natural to human listeners, we
design a classifier-guided optimization approach that generates adversarial
noise resembling common urban sounds. Extensive evaluations on multiple
advanced LALMs demonstrate that AdvWave outperforms baseline methods, achieving
a 40% higher average jailbreak attack success rate.


http://arxiv.org/abs/2412.08755
Proactive Adversarial Defense: Harnessing Prompt Tuning in Vision-Language Models to Detect Unseen Backdoored Images. (45%)
Kyle Stein; Andrew Arash Mahyari; Guillermo Francia; Eman El-Sheikh
  Backdoor attacks pose a critical threat by embedding hidden triggers into
inputs, causing models to misclassify them into target labels. While extensive
research has focused on mitigating these attacks in object recognition models
through weight fine-tuning, much less attention has been given to detecting
backdoored samples directly. Given the vast datasets used in training, manual
inspection for backdoor triggers is impractical, and even state-of-the-art
defense mechanisms fail to fully neutralize their impact. To address this gap,
we introduce a groundbreaking method to detect unseen backdoored images during
both training and inference. Leveraging the transformative success of prompt
tuning in Vision Language Models (VLMs), our approach trains learnable text
prompts to differentiate clean images from those with hidden backdoor triggers.
Experiments demonstrate the exceptional efficacy of this method, achieving an
impressive average accuracy of 86% across two renowned datasets for detecting
unseen backdoor triggers, establishing a new standard in backdoor defense.


http://arxiv.org/abs/2412.08366
Backdoor attacks on DNN and GBDT -- A Case Study from the insurance domain. (16%)
Robin Kühlem; Daniel Otten; Daniel Ludwig; Anselm Hudde; Alexander Rosenbaum; Andreas Mauthe
Machine learning (ML) will likely play a large role in many processes in the future, also for insurance companies. However, ML models are at risk of being attacked and manipulated. In this work, the robustness of Gradient Boosted Decision Tree (GBDT) models and Deep Neural Networks (DNN) within an insurance context will be evaluated. Therefore, two GBDT models and two DNNs are trained on two different tabular datasets from an insurance context. Past research in this domain mainly used homogenous data and there are comparably few insights regarding heterogenous tabular data. The ML tasks performed on the datasets are claim prediction (regression) and fraud detection (binary classification). For the backdoor attacks different samples containing a specific pattern were crafted and added to the training data. It is shown, that this type of attack can be highly successful, even with a few added samples. The backdoor attacks worked well on the models trained on one dataset but poorly on the models trained on the other. In real-world scenarios the attacker will have to face several obstacles but as attacks can work with very few added samples this risk should be evaluated.

http://arxiv.org/abs/2412.08156
Antelope: Potent and Concealed Jailbreak Attack Strategy. (10%)
Xin Zhao; Xiaojun Chen; Haoyu Gao
  Due to the remarkable generative potential of diffusion-based models,
numerous researches have investigated jailbreak attacks targeting these
frameworks. A particularly concerning threat within image models is the
generation of Not-Safe-for-Work (NSFW) content. Despite the implementation of
security filters, numerous efforts continue to explore ways to circumvent these
safeguards. Current attack methodologies primarily encompass adversarial prompt
engineering or concept obfuscation, yet they frequently suffer from slow search
efficiency, conspicuous attack characteristics and poor alignment with targets.
To overcome these challenges, we propose Antelope, a more robust and covert
jailbreak attack strategy designed to expose security vulnerabilities inherent
in generative models. Specifically, Antelope leverages the confusion of
sensitive concepts with similar ones, facilitates searches in the semantically
adjacent space of these related concepts and aligns them with the target
imagery, thereby generating sensitive images that are consistent with the
target and capable of evading detection. Besides, we successfully exploit the
transferability of model-based attacks to penetrate online black-box services.
Experimental evaluations demonstrate that Antelope outperforms existing
baselines across multiple defensive mechanisms, underscoring its efficacy and
versatility.


http://arxiv.org/abs/2412.08201
Model-Editing-Based Jailbreak against Safety-aligned Large Language Models. (1%)
Yuxi Li; Zhibo Zhang; Kailong Wang; Ling Shi; Haoyu Wang
  Large Language Models (LLMs) have transformed numerous fields by enabling
advanced natural language interactions but remain susceptible to critical
vulnerabilities, particularly jailbreak attacks. Current jailbreak techniques,
while effective, often depend on input modifications, making them detectable
and limiting their stealth and scalability. This paper presents Targeted Model
Editing (TME), a novel white-box approach that bypasses safety filters by
minimally altering internal model structures while preserving the model's
intended functionalities. TME identifies and removes safety-critical
transformations (SCTs) embedded in model matrices, enabling malicious queries
to bypass restrictions without input modifications. By analyzing distinct
activation patterns between safe and unsafe queries, TME isolates and
approximates SCTs through an optimization process. Implemented in the D-LLM
framework, our method achieves an average Attack Success Rate (ASR) of 84.86%
on four mainstream open-source LLMs, maintaining high performance. Unlike
existing methods, D-LLM eliminates the need for specific triggers or harmful
response collections, offering a stealthier and more effective jailbreak
strategy. This work reveals a covert and robust threat vector in LLM security
and emphasizes the need for stronger safeguards in model safety alignment.


http://arxiv.org/abs/2412.07326
Addressing Key Challenges of Adversarial Attacks and Defenses in the Tabular Domain: A Methodological Framework for Coherence and Consistency. (99%)
Yael Itzhakev; Amit Giloni; Yuval Elovici; Asaf Shabtai
  Machine learning models trained on tabular data are vulnerable to adversarial
attacks, even in realistic scenarios where attackers have access only to the
model's outputs. Researchers evaluate such attacks by considering metrics like
success rate, perturbation magnitude, and query count. However, unlike other
data domains, the tabular domain contains complex interdependencies among
features, presenting a unique aspect that should be evaluated: the need for the
attack to generate coherent samples and ensure feature consistency for
indistinguishability. Currently, there is no established methodology for
evaluating adversarial samples based on these criteria. In this paper, we
address this gap by proposing new evaluation criteria tailored for tabular
attacks' quality; we defined anomaly-based framework to assess the
distinguishability of adversarial samples and utilize the SHAP explainability
technique to identify inconsistencies in the model's decision-making process
caused by adversarial samples. These criteria could form the basis for
potential detection methods and be integrated into established evaluation
metrics for assessing attack's quality Additionally, we introduce a novel
technique for perturbing dependent features while maintaining coherence and
feature consistency within the sample. We compare different attacks'
strategies, examining black-box query-based attacks and transferability-based
gradient attacks across four target models. Our experiments, conducted on
benchmark tabular datasets, reveal significant differences between the examined
attacks' strategies in terms of the attacker's risk and effort and the attacks'
quality. The findings provide valuable insights on the strengths, limitations,
and trade-offs of various adversarial attacks in the tabular domain, laying a
foundation for future research on attacks and defense development.


http://arxiv.org/abs/2412.07274
A Generative Victim Model for Segmentation. (99%)
Aixuan Li; Jing Zhang; Jiawei Shi; Yiran Zhong; Yuchao Dai
  We find that the well-trained victim models (VMs), against which the attacks
are generated, serve as fundamental prerequisites for adversarial attacks, i.e.
a segmentation VM is needed to generate attacks for segmentation. In this
context, the victim model is assumed to be robust to achieve effective
adversarial perturbation generation. Instead of focusing on improving the
robustness of the task-specific victim models, we shift our attention to image
generation. From an image generation perspective, we derive a novel VM for
segmentation, aiming to generate adversarial perturbations for segmentation
tasks without requiring models explicitly designed for image segmentation. Our
approach to adversarial attack generation diverges from conventional white-box
or black-box attacks, offering a fresh outlook on adversarial attack
strategies. Experiments show that our attack method is able to generate
effective adversarial attacks with good transferability.


http://arxiv.org/abs/2412.07277
Backdoor Attacks against No-Reference Image Quality Assessment Models via a Scalable Trigger. (99%)
Yi Yu; Song Xia; Xun Lin; Wenhan Yang; Shijian Lu; Yap-peng Tan; Alex Kot
  No-Reference Image Quality Assessment (NR-IQA), responsible for assessing the
quality of a single input image without using any reference, plays a critical
role in evaluating and optimizing computer vision systems, e.g., low-light
enhancement. Recent research indicates that NR-IQA models are susceptible to
adversarial attacks, which can significantly alter predicted scores with
visually imperceptible perturbations. Despite revealing vulnerabilities, these
attack methods have limitations, including high computational demands,
untargeted manipulation, limited practical utility in white-box scenarios, and
reduced effectiveness in black-box scenarios. To address these challenges, we
shift our focus to another significant threat and present a novel
poisoning-based backdoor attack against NR-IQA (BAIQA), allowing the attacker
to manipulate the IQA model's output to any desired target value by simply
adjusting a scaling coefficient $\alpha$ for the trigger. We propose to inject
the trigger in the discrete cosine transform (DCT) domain to improve the local
invariance of the trigger for countering trigger diminishment in NR-IQA models
due to widely adopted data augmentations. Furthermore, the universal
adversarial perturbations (UAP) in the DCT space are designed as the trigger,
to increase IQA model susceptibility to manipulation and improve attack
effectiveness. In addition to the heuristic method for poison-label BAIQA
(P-BAIQA), we explore the design of clean-label BAIQA (C-BAIQA), focusing on
$\alpha$ sampling and image data refinement, driven by theoretical insights we
reveal. Extensive experiments on diverse datasets and various NR-IQA models
demonstrate the effectiveness of our attacks. Code can be found at
https://github.com/yuyi-sd/BAIQA.


http://arxiv.org/abs/2412.07468
AHSG: Adversarial Attack on High-level Semantics in Graph Neural Networks. (99%)
Kai Yuan; Jiahao Zhang; Yidi Wang; Xiaobing Pei
  Adversarial attacks on Graph Neural Networks aim to perturb the performance
of the learner by carefully modifying the graph topology and node attributes.
Existing methods achieve attack stealthiness by constraining the modification
budget and differences in graph properties. However, these methods typically
disrupt task-relevant primary semantics directly, which results in low
defensibility and detectability of the attack. In this paper, we propose an
Adversarial Attack on High-level Semantics for Graph Neural Networks (AHSG),
which is a graph structure attack model that ensures the retention of primary
semantics. By combining latent representations with shared primary semantics,
our model retains detectable attributes and relational patterns of the original
graph while leveraging more subtle changes to carry out the attack. Then we use
the Projected Gradient Descent algorithm to map the latent representations with
attack effects to the adversarial graph. Through experiments on robust graph
deep learning models equipped with defense strategies, we demonstrate that AHSG
outperforms other state-of-the-art methods in attack effectiveness.
Additionally, using Contextual Stochastic Block Models to detect the attacked
graph further validates that our method preserves the primary semantics of the
graph.


http://arxiv.org/abs/2412.07575
Defending Against Neural Network Model Inversion Attacks via Data Poisoning. (98%)
Shuai Zhou; Dayong Ye; Tianqing Zhu; Wanlei Zhou
  Model inversion attacks pose a significant privacy threat to machine learning
models by reconstructing sensitive data from their outputs. While various
defenses have been proposed to counteract these attacks, they often come at the
cost of the classifier's utility, thus creating a challenging trade-off between
privacy protection and model utility. Moreover, most existing defenses require
retraining the classifier for enhanced robustness, which is impractical for
large-scale, well-established models. This paper introduces a novel defense
mechanism to better balance privacy and utility, particularly against
adversaries who employ a machine learning model (i.e., inversion model) to
reconstruct private data. Drawing inspiration from data poisoning attacks,
which can compromise the performance of machine learning models, we propose a
strategy that leverages data poisoning to contaminate the training data of
inversion models, thereby preventing model inversion attacks.
  Two defense methods are presented. The first, termed label-preserving
poisoning attacks for all output vectors (LPA), involves subtle perturbations
to all output vectors while preserving their labels. Our findings demonstrate
that these minor perturbations, introduced through a data poisoning approach,
significantly increase the difficulty of data reconstruction without
compromising the utility of the classifier. Subsequently, we introduce a second
method, label-flipping poisoning for partial output vectors (LFP), which
selectively perturbs a small subset of output vectors and alters their labels
during the process. Empirical results indicate that LPA is notably effective,
outperforming the current state-of-the-art defenses. Our data poisoning-based
defense provides a new retraining-free defense paradigm that preserves the
victim classifier's utility.


http://arxiv.org/abs/2412.08099
Adversarial Vulnerabilities in Large Language Models for Time Series Forecasting. (98%)
Fuqiang Liu; Sicong Jiang; Luis Miranda-Moreno; Seongjin Choi; Lijun Sun
  Large Language Models (LLMs) have recently demonstrated significant potential
in the field of time series forecasting, offering impressive capabilities in
handling complex temporal data. However, their robustness and reliability in
real-world applications remain under-explored, particularly concerning their
susceptibility to adversarial attacks. In this paper, we introduce a targeted
adversarial attack framework for LLM-based time series forecasting. By
employing both gradient-free and black-box optimization methods, we generate
minimal yet highly effective perturbations that significantly degrade the
forecasting accuracy across multiple datasets and LLM architectures. Our
experiments, which include models like TimeGPT and LLM-Time with GPT-3.5,
GPT-4, LLaMa, and Mistral, show that adversarial attacks lead to much more
severe performance degradation than random noise, and demonstrate the broad
effectiveness of our attacks across different LLMs. The results underscore the
critical vulnerabilities of LLMs in time series forecasting, highlighting the
need for robust defense mechanisms to ensure their reliable deployment in
practical applications.


http://arxiv.org/abs/2412.08098
What You See Is Not Always What You Get: An Empirical Study of Code Comprehension by Large Language Models. (92%)
Bangshuo Zhu; Jiawen Wen; Huaming Chen
  Recent studies have demonstrated outstanding capabilities of large language
models (LLMs) in software engineering domain, covering numerous tasks such as
code generation and comprehension. While the benefit of LLMs for coding task is
well noted, it is perceived that LLMs are vulnerable to adversarial attacks. In
this paper, we study the specific LLM vulnerability to imperceptible character
attacks, a type of prompt-injection attack that uses special characters to
befuddle an LLM whilst keeping the attack hidden to human eyes. We devise four
categories of attacks and investigate their effects on the performance outcomes
of tasks relating to code analysis and code comprehension. Two generations of
ChatGPT are included to evaluate the impact of advancements made to
contemporary models. Our experimental design consisted of comparing perturbed
and unperturbed code snippets and evaluating two performance outcomes, which
are model confidence using log probabilities of response, and correctness of
response. We conclude that earlier version of ChatGPT exhibits a strong
negative linear correlation between the amount of perturbation and the
performance outcomes, while the recent ChatGPT presents a strong negative
correlation between the presence of perturbation and performance outcomes, but
no valid correlational relationship between perturbation budget and performance
outcomes. We anticipate this work contributes to an in-depth understanding of
leveraging LLMs for coding tasks. It is suggested future research should delve
into how to create LLMs that can return a correct response even if the prompt
exhibits perturbations.


http://arxiv.org/abs/2412.08053
DynamicPAE: Generating Scene-Aware Physical Adversarial Examples in Real-Time. (92%)
Jin Hu; Xianglong Liu; Jiakai Wang; Junkai Zhang; Xianqi Yang; Haotong Qin; Yuqing Ma; Ke Xu
  Physical adversarial examples (PAEs) are regarded as "whistle-blowers" of
real-world risks in deep-learning applications. However, current PAE generation
studies show limited adaptive attacking ability to diverse and varying scenes.
The key challenges in generating dynamic PAEs are exploring their patterns
under noisy gradient feedback and adapting the attack to agnostic scenario
natures. To address the problems, we present DynamicPAE, the first generative
framework that enables scene-aware real-time physical attacks beyond static
attacks. Specifically, to train the dynamic PAE generator under noisy gradient
feedback, we introduce the residual-driven sample trajectory guidance
technique, which redefines the training task to break the limited feedback
information restriction that leads to the degeneracy problem. Intuitively, it
allows the gradient feedback to be passed to the generator through a low-noise
auxiliary task, thereby guiding the optimization away from degenerate solutions
and facilitating a more comprehensive and stable exploration of feasible PAEs.
To adapt the generator to agnostic scenario natures, we introduce the
context-aligned scene expectation simulation process, consisting of the
conditional-uncertainty-aligned data module and the skewness-aligned objective
re-weighting module. The former enhances robustness in the context of
incomplete observation by employing a conditional probabilistic model for
domain randomization, while the latter facilitates consistent stealth control
across different attack targets by automatically reweighting losses based on
the skewness indicator. Extensive digital and physical evaluations demonstrate
the superior attack performance of DynamicPAE, attaining a 1.95 $\times$ boost
(65.55% average AP drop under attack) on representative object detectors (e.g.,
Yolo-v8) over state-of-the-art static PAE generating methods.


http://arxiv.org/abs/2412.07559
Adaptive Epsilon Adversarial Training for Robust Gravitational Wave Parameter Estimation Using Normalizing Flows. (86%)
Yiqian Yang; Xihua Zhu; Fan Zhang
  Adversarial training with Normalizing Flow (NF) models is an emerging
research area aimed at improving model robustness through adversarial samples.
In this study, we focus on applying adversarial training to NF models for
gravitational wave parameter estimation. We propose an adaptive epsilon method
for Fast Gradient Sign Method (FGSM) adversarial training, which dynamically
adjusts perturbation strengths based on gradient magnitudes using logarithmic
scaling. Our hybrid architecture, combining ResNet and Inverse Autoregressive
Flow, reduces the Negative Log Likelihood (NLL) loss by 47\% under FGSM attacks
compared to the baseline model, while maintaining an NLL of 4.2 on clean data
(only 5\% higher than the baseline). For perturbation strengths between 0.01
and 0.1, our model achieves an average NLL of 5.8, outperforming both
fixed-epsilon (NLL: 6.7) and progressive-epsilon (NLL: 7.2) methods. Under
stronger Projected Gradient Descent attacks with perturbation strength of 0.05,
our model maintains an NLL of 6.4, demonstrating superior robustness while
avoiding catastrophic overfitting.


http://arxiv.org/abs/2412.08014
MAGIC: Mastering Physical Adversarial Generation in Context through Collaborative LLM Agents. (82%)
Yun Xing; Nhat Chung; Jie Zhang; Yue Cao; Ivor Tsang; Yang Liu; Lei Ma; Qing Guo
  Physical adversarial attacks in driving scenarios can expose critical
vulnerabilities in visual perception models. However, developing such attacks
remains challenging due to diverse real-world environments and the requirement
for maintaining visual naturality. Building upon this challenge, we reformulate
physical adversarial attacks as a one-shot patch generation problem. Our
approach generates adversarial patches through a deep generative model that
considers the specific scene context, enabling direct physical deployment in
matching environments. The primary challenge lies in simultaneously achieving
two objectives: generating adversarial patches that effectively mislead object
detection systems while determining contextually appropriate deployment within
the scene. We propose MAGIC (Mastering Physical Adversarial Generation In
Context), a novel framework powered by multi-modal LLM agents to address these
challenges. MAGIC automatically understands scene context and generates
adversarial patch through the synergistic interaction of language and vision
capabilities. In particular, MAGIC orchestrates three specialized LLM agents:
The adv-patch generation agent (GAgent) masters the creation of deceptive
patches through strategic prompt engineering for text-to-image models. The
adv-patch deployment agent (DAgent) ensures contextual coherence by determining
optimal deployment strategies based on scene understanding. The
self-examination agent (EAgent) completes this trilogy by providing critical
oversight and iterative refinement of both processes. We validate our method on
both digital and physical levels, i.e., nuImage and manually captured
real-world scenes, where both statistical and visual results prove that our
MAGIC is powerful and effective for attacking widely applied object detection
systems, i.e., YOLO and DETR series.


http://arxiv.org/abs/2412.07672
FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks. (81%)
Bocheng Chen; Hanqing Guo; Qiben Yan
  Defense in large language models (LLMs) is crucial to counter the numerous
attackers exploiting these systems to generate harmful content through
manipulated prompts, known as jailbreak attacks. Although many defense
strategies have been proposed, they often require access to the model's
internal structure or need additional training, which is impractical for
service providers using LLM APIs, such as OpenAI APIs or Claude APIs. In this
paper, we propose a moving target defense approach that alters decoding
hyperparameters to enhance model robustness against various jailbreak attacks.
Our approach does not require access to the model's internal structure and
incurs no additional training costs. The proposed defense includes two key
components: (1) optimizing the decoding strategy by identifying and adjusting
decoding hyperparameters that influence token generation probabilities, and (2)
transforming the decoding hyperparameters and model system prompts into dynamic
targets, which are continuously altered during each runtime. By continuously
modifying decoding strategies and prompts, the defense effectively mitigates
the existing attacks. Our results demonstrate that our defense is the most
effective against jailbreak attacks in three of the models tested when using
LLMs as black-box APIs. Moreover, our defense offers lower inference costs and
maintains comparable response quality, making it a potential layer of
protection when used alongside other defense methods.


http://arxiv.org/abs/2412.07511
Stealthy and Robust Backdoor Attack against 3D Point Clouds through Additional Point Features. (76%)
Xiaoyang Ning; Qing Xie; Jinyu Xu; Wenbo Jiang; Jiachen Li; Yanchun Ma
  Recently, 3D backdoor attacks have posed a substantial threat to 3D Deep
Neural Networks (3D DNNs) designed for 3D point clouds, which are extensively
deployed in various security-critical applications. Although the existing 3D
backdoor attacks achieved high attack performance, they remain vulnerable to
preprocessing-based defenses (e.g., outlier removal and rotation augmentation)
and are prone to detection by human inspection. In pursuit of a more
challenging-to-defend and stealthy 3D backdoor attack, this paper introduces
the Stealthy and Robust Backdoor Attack (SRBA), which ensures robustness and
stealthiness through intentional design considerations. The key insight of our
attack involves applying a uniform shift to the additional point features of
point clouds (e.g., reflection intensity) widely utilized as part of inputs for
3D DNNs as the trigger. Without altering the geometric information of the point
clouds, our attack ensures visual consistency between poisoned and benign
samples, and demonstrate robustness against preprocessing-based defenses. In
addition, to automate our attack, we employ Bayesian Optimization (BO) to
identify the suitable trigger. Extensive experiments suggest that SRBA achieves
an attack success rate (ASR) exceeding 94% in all cases, and significantly
outperforms previous SOTA methods when multiple preprocessing operations are
applied during training.


http://arxiv.org/abs/2412.07231
Adversarial Filtering Based Evasion and Backdoor Attacks to EEG-Based Brain-Computer Interfaces. (68%)
Lubin Meng; Xue Jiang; Xiaoqing Chen; Wenzhong Liu; Hanbin Luo; Dongrui Wu
  A brain-computer interface (BCI) enables direct communication between the
brain and an external device. Electroencephalogram (EEG) is a common input
signal for BCIs, due to its convenience and low cost. Most research on
EEG-based BCIs focuses on the accurate decoding of EEG signals, while ignoring
their security. Recent studies have shown that machine learning models in BCIs
are vulnerable to adversarial attacks. This paper proposes adversarial
filtering based evasion and backdoor attacks to EEG-based BCIs, which are very
easy to implement. Experiments on three datasets from different BCI paradigms
demonstrated the effectiveness of our proposed attack approaches. To our
knowledge, this is the first study on adversarial filtering for EEG-based BCIs,
raising a new security concern and calling for more attention on the security
of BCIs.


http://arxiv.org/abs/2412.07199
A Parametric Approach to Adversarial Augmentation for Cross-Domain Iris Presentation Attack Detection. (61%)
Debasmita Pal; Redwan Sony; Arun Ross
  Iris-based biometric systems are vulnerable to presentation attacks (PAs),
where adversaries present physical artifacts (e.g., printed iris images,
textured contact lenses) to defeat the system. This has led to the development
of various presentation attack detection (PAD) algorithms, which typically
perform well in intra-domain settings. However, they often struggle to
generalize effectively in cross-domain scenarios, where training and testing
employ different sensors, PA instruments, and datasets. In this work, we use
adversarial training samples of both bonafide irides and PAs to improve the
cross-domain performance of a PAD classifier. The novelty of our approach lies
in leveraging transformation parameters from classical data augmentation
schemes (e.g., translation, rotation) to generate adversarial samples. We
achieve this through a convolutional autoencoder, ADV-GEN, that inputs original
training samples along with a set of geometric and photometric transformations.
The transformation parameters act as regularization variables, guiding ADV-GEN
to generate adversarial samples in a constrained search space. Experiments
conducted on the LivDet-Iris 2017 database, comprising four datasets, and the
LivDet-Iris 2020 dataset, demonstrate the efficacy of our proposed method. The
code is available at https://github.com/iPRoBe-lab/ADV-GEN-IrisPAD.


http://arxiv.org/abs/2412.12145
Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars. (50%)
Yu Yan; Sheng Sun; Junqi Tong; Min Liu; Qi Li
  Metaphor serves as an implicit approach to convey information, while enabling
the generalized comprehension of complex subjects. However, metaphor can
potentially be exploited to bypass the safety alignment mechanisms of Large
Language Models (LLMs), leading to the theft of harmful knowledge. In our
study, we introduce a novel attack framework that exploits the imaginative
capacity of LLMs to achieve jailbreaking, the J\underline{\textbf{A}}ilbreak
\underline{\textbf{V}}ia \underline{\textbf{A}}dversarial
Me\underline{\textbf{TA}} -pho\underline{\textbf{R}} (\textit{AVATAR}).
Specifically, to elicit the harmful response, AVATAR extracts harmful entities
from a given harmful target and maps them to innocuous adversarial entities
based on LLM's imagination. Then, according to these metaphors, the harmful
target is nested within human-like interaction for jailbreaking adaptively.
Experimental results demonstrate that AVATAR can effectively and transferablly
jailbreak LLMs and achieve a state-of-the-art attack success rate across
multiple advanced LLMs. Our study exposes a security risk in LLMs from their
endogenous imaginative capabilities. Furthermore, the analytical study reveals
the vulnerability of LLM to adversarial metaphors and the necessity of
developing defense methods against jailbreaking caused by the adversarial
metaphor. \textcolor{orange}{ \textbf{Warning: This paper contains potentially
harmful content from LLMs.}}


http://arxiv.org/abs/2412.07253
CapGen:An Environment-Adaptive Generator of Adversarial Patches. (13%)
Chaoqun Li; Zhuodong Liu; Huanqian Yan; Hang Su
  Adversarial patches, often used to provide physical stealth protection for
critical assets and assess perception algorithm robustness, usually neglect the
need for visual harmony with the background environment, making them easily
noticeable. Moreover, existing methods primarily concentrate on improving
attack performance, disregarding the intricate dynamics of adversarial patch
elements. In this work, we introduce the Camouflaged Adversarial Pattern
Generator (CAPGen), a novel approach that leverages specific base colors from
the surrounding environment to produce patches that seamlessly blend with their
background for superior visual stealthiness while maintaining robust
adversarial performance. We delve into the influence of both patterns (i.e.,
color-agnostic texture information) and colors on the effectiveness of attacks
facilitated by patches, discovering that patterns exert a more pronounced
effect on performance than colors. Based on these findings, we propose a rapid
generation strategy for adversarial patches. This involves updating the colors
of high-performance adversarial patches to align with those of the new
environment, ensuring visual stealthiness without compromising adversarial
impact. This paper is the first to comprehensively examine the roles played by
patterns and colors in the context of adversarial patches.


http://arxiv.org/abs/2412.07192
PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips. (10%)
Zachary Coalson; Jeonghyun Woo; Yu Sun; Shiyang Chen; Lishan Yang; Prashant Nair; Bo Fang; Sanghyun Hong
  We introduce a new class of attacks on commercial-scale (human-aligned)
language models that induce jailbreaking through targeted bitwise corruptions
in model parameters. Our adversary can jailbreak billion-parameter language
models with fewer than 25 bit-flips in all cases$-$and as few as 5 in
some$-$using up to 40$\times$ less bit-flips than existing attacks on computer
vision models at least 100$\times$ smaller. Unlike prompt-based jailbreaks, our
attack renders these models in memory 'uncensored' at runtime, allowing them to
generate harmful responses without any input modifications. Our attack
algorithm efficiently identifies target bits to flip, offering up to 20$\times$
more computational efficiency than previous methods. This makes it practical
for language models with billions of parameters. We show an end-to-end
exploitation of our attack using software-induced fault injection, Rowhammer
(RH). Our work examines 56 DRAM RH profiles from DDR4 and LPDDR4X devices with
different RH vulnerabilities. We show that our attack can reliably induce
jailbreaking in systems similar to those affected by prior bit-flip attacks.
Moreover, our approach remains effective even against highly RH-secure systems
(e.g., 46$\times$ more secure than previously tested systems). Our analyses
further reveal that: (1) models with less post-training alignment require fewer
bit flips to jailbreak; (2) certain model components, such as value projection
layers, are substantially more vulnerable than others; and (3) our method is
mechanistically different than existing jailbreaks. Our findings highlight a
pressing, practical threat to the language model ecosystem and underscore the
need for research to protect these models from bit-flip attacks.


http://arxiv.org/abs/2412.07249
Buster: Implanting Semantic Backdoor into Text Encoder to Mitigate NSFW Content Generation. (2%)
Xin Zhao; Xiaojun Chen; Yuexin Xuan; Zhendong Zhao; Xiaojun Jia; Xinfeng Li; Xiaofeng Wang
  The rise of deep learning models in the digital era has raised substantial
concerns regarding the generation of Not-Safe-for-Work (NSFW) content. Existing
defense methods primarily involve model fine-tuning and post-hoc content
moderation. Nevertheless, these approaches largely lack scalability in
eliminating harmful content, degrade the quality of benign image generation, or
incur high inference costs. To address these challenges, we propose an
innovative framework named \textit{Buster}, which injects backdoors into the
text encoder to prevent NSFW content generation. Buster leverages deep semantic
information rather than explicit prompts as triggers, redirecting NSFW prompts
towards targeted benign prompts. Additionally, Buster employs energy-based
training data generation through Langevin dynamics for adversarial knowledge
augmentation, thereby ensuring robustness in harmful concept definition. This
approach demonstrates exceptional resilience and scalability in mitigating NSFW
content. Particularly, Buster fine-tunes the text encoder of Text-to-Image
models within merely five minutes, showcasing its efficiency. Our extensive
experiments denote that Buster outperforms nine state-of-the-art baselines,
achieving a superior NSFW content removal rate of at least 91.2\% while
preserving the quality of harmless images.


http://arxiv.org/abs/2412.06727
Take Fake as Real: Realistic-like Robust Black-box Adversarial Attack to Evade AIGC Detection. (99%)
Caiyun Xie; Dengpan Ye; Yunming Zhang; Long Tang; Yunna Lv; Jiacheng Deng; Jiawei Song
  The security of AI-generated content (AIGC) detection is crucial for ensuring
multimedia content credibility. To enhance detector security, research on
adversarial attacks has become essential. However, most existing adversarial
attacks focus only on GAN-generated facial images detection, struggle to be
effective on multi-class natural images and diffusion-based detectors, and
exhibit poor invisibility. To fill this gap, we first conduct an in-depth
analysis of the vulnerability of AIGC detectors and discover the feature that
detectors vary in vulnerability to different post-processing. Then, considering
that the detector is agnostic in real-world scenarios and given this discovery,
we propose a Realistic-like Robust Black-box Adversarial attack (R$^2$BA) with
post-processing fusion optimization. Unlike typical perturbations, R$^2$BA uses
real-world post-processing, i.e., Gaussian blur, JPEG compression, Gaussian
noise and light spot to generate adversarial examples. Specifically, we use a
stochastic particle swarm algorithm with inertia decay to optimize
post-processing fusion intensity and explore the detector's decision boundary.
Guided by the detector's fake probability, R$^2$BA enhances/weakens the
detector-vulnerable/detector-robust post-processing intensity to strike a
balance between adversariality and invisibility. Extensive experiments on
popular/commercial AIGC detectors and datasets demonstrate that R$^2$BA
exhibits impressive anti-detection performance, excellent invisibility, and
strong robustness in GAN-based and diffusion-based cases. Compared to
state-of-the-art white-box and black-box attacks, R$^2$BA shows significant
improvements of 15\%--72\% and 21\%--47\% in anti-detection performance under
the original and robust scenario respectively, offering valuable insights for
the security of AIGC detection in real-world applications.


http://arxiv.org/abs/2412.07078
Defensive Dual Masking for Robust Adversarial Defense. (99%)
Wangli Yang; Jie Yang; Yi Guo; Johan Barthelemy
  The field of textual adversarial defenses has gained considerable attention
in recent years due to the increasing vulnerability of natural language
processing (NLP) models to adversarial attacks, which exploit subtle
perturbations in input text to deceive models. This paper introduces the
Defensive Dual Masking (DDM) algorithm, a novel approach designed to enhance
model robustness against such attacks. DDM utilizes a unique adversarial
training strategy where [MASK] tokens are strategically inserted into training
samples to prepare the model to handle adversarial perturbations more
effectively. During inference, potentially adversarial tokens are dynamically
replaced with [MASK] tokens to neutralize potential threats while preserving
the core semantics of the input. The theoretical foundation of our approach is
explored, demonstrating how the selective masking mechanism strengthens the
model's ability to identify and mitigate adversarial manipulations. Our
empirical evaluation across a diverse set of benchmark datasets and attack
mechanisms consistently shows that DDM outperforms state-of-the-art defense
techniques, improving model accuracy and robustness. Moreover, when applied to
Large Language Models (LLMs), DDM also enhances their resilience to adversarial
attacks, providing a scalable defense mechanism for large-scale NLP
applications.


http://arxiv.org/abs/2412.06215
A Real-Time Defense Against Object Vanishing Adversarial Patch Attacks for Object Detection in Autonomous Vehicles. (97%)
Jaden Mu
  Autonomous vehicles (AVs) increasingly use DNN-based object detection models
in vision-based perception. Correct detection and classification of obstacles
is critical to ensure safe, trustworthy driving decisions. Adversarial patches
aim to fool a DNN with intentionally generated patterns concentrated in a
localized region of an image. In particular, object vanishing patch attacks can
cause object detection models to fail to detect most or all objects in a scene,
posing a significant practical threat to AVs.
  This work proposes ADAV (Adversarial Defense for Autonomous Vehicles), a
novel defense methodology against object vanishing patch attacks specifically
designed for autonomous vehicles. Unlike existing defense methods which have
high latency or are designed for static images, ADAV runs in real-time and
leverages contextual information from prior frames in an AV's video feed. ADAV
checks if the object detector's output for the target frame is temporally
consistent with the output from a previous reference frame to detect the
presence of a patch. If the presence of a patch is detected, ADAV uses
gradient-based attribution to localize adversarial pixels that break temporal
consistency. This two stage procedure allows ADAV to efficiently process clean
inputs, and both stages are optimized to be low latency. ADAV is evaluated
using real-world driving data from the Berkeley Deep Drive BDD100K dataset, and
demonstrates high adversarial and clean performance.


http://arxiv.org/abs/2412.06219
Data Free Backdoor Attacks. (64%)
Bochuan Cao; Jinyuan Jia; Chuxuan Hu; Wenbo Guo; Zhen Xiang; Jinghui Chen; Bo Li; Dawn Song
  Backdoor attacks aim to inject a backdoor into a classifier such that it
predicts any input with an attacker-chosen backdoor trigger as an
attacker-chosen target class. Existing backdoor attacks require either
retraining the classifier with some clean data or modifying the model's
architecture. As a result, they are 1) not applicable when clean data is
unavailable, 2) less efficient when the model is large, and 3) less stealthy
due to architecture changes. In this work, we propose DFBA, a novel
retraining-free and data-free backdoor attack without changing the model
architecture. Technically, our proposed method modifies a few parameters of a
classifier to inject a backdoor. Through theoretical analysis, we verify that
our injected backdoor is provably undetectable and unremovable by various
state-of-the-art defenses under mild assumptions. Our evaluation on multiple
datasets further demonstrates that our injected backdoor: 1) incurs negligible
classification loss, 2) achieves 100% attack success rates, and 3) bypasses six
existing state-of-the-art defenses. Moreover, our comparison with a
state-of-the-art non-data-free backdoor attack shows our attack is more
stealthy and effective against various defenses while achieving less
classification accuracy loss.


http://arxiv.org/abs/2412.07097
On Evaluating the Durability of Safeguards for Open-Weight LLMs. (38%)
Xiangyu Qi; Boyi Wei; Nicholas Carlini; Yangsibo Huang; Tinghao Xie; Luxi He; Matthew Jagielski; Milad Nasr; Prateek Mittal; Peter Henderson
  Stakeholders -- from model developers to policymakers -- seek to minimize the
dual-use risks of large language models (LLMs). An open challenge to this goal
is whether technical safeguards can impede the misuse of LLMs, even when models
are customizable via fine-tuning or when model weights are fully open. In
response, several recent studies have proposed methods to produce durable LLM
safeguards for open-weight LLMs that can withstand adversarial modifications of
the model's weights via fine-tuning. This holds the promise of raising
adversaries' costs even under strong threat models where adversaries can
directly fine-tune model weights. However, in this paper, we urge for more
careful characterization of the limits of these approaches. Through several
case studies, we demonstrate that even evaluating these defenses is exceedingly
difficult and can easily mislead audiences into thinking that safeguards are
more durable than they really are. We draw lessons from the evaluation pitfalls
that we identify and suggest future research carefully cabin claims to more
constrained, well-defined, and rigorously examined threat models, which can
provide more useful and candid assessments to stakeholders.


http://arxiv.org/abs/2412.06966
Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice. (3%)
A. Feder Cooper; Christopher A. Choquette-Choo; Miranda Bogen; Matthew Jagielski; Katja Filippova; Ken Ziyu Liu; Alexandra Chouldechova; Jamie Hayes; Yangsibo Huang; Niloofar Mireshghallah; Ilia Shumailov; Eleni Triantafillou; Peter Kairouz; Nicole Mitchell; Percy Liang; Daniel E. Ho; Yejin Choi; Sanmi Koyejo; Fernando Delgado; James Grimmelmann; Vitaly Shmatikov; Sa Christopher De; Solon Barocas; Amy Cyphert; Mark Lemley; danah boyd; Jennifer Wortman Vaughan; Miles Brundage; David Bau; Seth Neel; Abigail Z. Jacobs; Andreas Terzis; Hanna Wallach; Nicolas Papernot; Katherine Lee
  We articulate fundamental mismatches between technical methods for machine
unlearning in Generative AI, and documented aspirations for broader impact that
these methods could have for law and policy. These aspirations are both
numerous and varied, motivated by issues that pertain to privacy, copyright,
safety, and more. For example, unlearning is often invoked as a solution for
removing the effects of targeted information from a generative-AI model's
parameters, e.g., a particular individual's personal data or in-copyright
expression of Spiderman that was included in the model's training data.
Unlearning is also proposed as a way to prevent a model from generating
targeted types of information in its outputs, e.g., generations that closely
resemble a particular individual's data or reflect the concept of "Spiderman."
Both of these goals--the targeted removal of information from a model and the
targeted suppression of information from a model's outputs--present various
technical and substantive challenges. We provide a framework for thinking
rigorously about these challenges, which enables us to be clear about why
unlearning is not a general-purpose solution for circumscribing generative-AI
model behavior in service of broader positive impact. We aim for conceptual
clarity and to encourage more thoughtful communication among machine learning
(ML), law, and policy experts who seek to develop and apply technical methods
for compliance with policy objectives.


http://arxiv.org/abs/2412.07129
StyleMark: A Robust Watermarking Method for Art Style Images Against Black-Box Arbitrary Style Transfer. (2%)
Yunming Zhang; Dengpan Ye; Sipeng Shen; Jun Wang
  Arbitrary Style Transfer (AST) achieves the rendering of real natural images
into the painting styles of arbitrary art style images, promoting art
communication. However, misuse of unauthorized art style images for AST may
infringe on artists' copyrights. One countermeasure is robust watermarking,
which tracks image propagation by embedding copyright watermarks into carriers.
Unfortunately, AST-generated images lose the structural and semantic
information of the original style image, hindering end-to-end robust tracking
by watermarks. To fill this gap, we propose StyleMark, the first robust
watermarking method for black-box AST, which can be seamlessly applied to art
style images achieving precise attribution of artistic styles after AST.
Specifically, we propose a new style watermark network that adjusts the mean
activations of style features through multi-scale watermark embedding, thereby
planting watermark traces into the shared style feature space of style images.
Furthermore, we design a distribution squeeze loss, which constrain content
statistical feature distortion, forcing the reconstruction network to focus on
integrating style features with watermarks, thus optimizing the intrinsic
watermark distribution. Finally, based on solid end-to-end training, StyleMark
mitigates the optimization conflict between robustness and watermark
invisibility through decoder fine-tuning under random noise. Experimental
results demonstrate that StyleMark exhibits significant robustness against
black-box AST and common pixel-level distortions, while also securely defending
against malicious adaptive attacks.


http://arxiv.org/abs/2412.07003
Understanding Gradient Descent through the Training Jacobian. (1%)
Nora Belrose; Adam Scherlis
  We examine the geometry of neural network training using the Jacobian of
trained network parameters with respect to their initial values. Our analysis
reveals low-dimensional structure in the training process which is dependent on
the input data but largely independent of the labels. We find that the singular
value spectrum of the Jacobian matrix consists of three distinctive regions: a
"chaotic" region of values orders of magnitude greater than one, a large "bulk"
region of values extremely close to one, and a "stable" region of values less
than one. Along each bulk direction, the left and right singular vectors are
nearly identical, indicating that perturbations to the initialization are
carried through training almost unchanged. These perturbations have virtually
no effect on the network's output in-distribution, yet do have an effect far
out-of-distribution. While the Jacobian applies only locally around a single
initialization, we find substantial overlap in bulk subspaces for different
random seeds.


http://arxiv.org/abs/2412.06556
Vulnerability, Where Art Thou? An Investigation of Vulnerability Management in Android Smartphone Chipsets. (1%)
Daniel Klischies; Philipp Mackensen; Veelasha Moonsamy
  Vulnerabilities in Android smartphone chipsets have severe consequences, as
recent real-world attacks have demonstrated that adversaries can leverage
vulnerabilities to execute arbitrary code or exfiltrate confidential
information. Despite the far-reaching impact of such attacks, the lifecycle of
chipset vulnerabilities has yet to be investigated, with existing papers
primarily investigating vulnerabilities in the Android operating system. This
paper provides a comprehensive and empirical study of the current state of
smartphone chipset vulnerability management within the Android ecosystem. For
the first time, we create a unified knowledge base of 3,676 chipset
vulnerabilities affecting 437 chipset models from all four major chipset
manufacturers, combined with 6,866 smartphone models. Our analysis revealed
that the same vulnerabilities are often included in multiple generations of
chipsets, providing novel empirical evidence that vulnerabilities are inherited
through multiple chipset generations. Furthermore, we demonstrate that the
commonly accepted 90-day responsible vulnerability disclosure period is seldom
adhered to. We find that a single vulnerability often affects hundreds to
thousands of different smartphone models, for which update availability is, as
we show, often unclear or heavily delayed. Leveraging the new insights gained
from our empirical analysis, we recommend several changes that chipset
manufacturers can implement to improve the security posture of their products.
At the same time, our knowledge base enables academic researchers to conduct
more representative evaluations of smartphone chipsets, accurately assess the
impact of vulnerabilities they discover, and identify avenues for future
research.


http://arxiv.org/abs/2412.05943
Adversarial Transferability in Deep Denoising Models: Theoretical Insights and Robustness Enhancement via Out-of-Distribution Typical Set Sampling. (99%)
Jie Ning; Jiebao Sun; Shengzhu Shi; Zhichang Guo; Yao Li; Hongwei Li; Boying Wu
  Deep learning-based image denoising models demonstrate remarkable
performance, but their lack of robustness analysis remains a significant
concern. A major issue is that these models are susceptible to adversarial
attacks, where small, carefully crafted perturbations to input data can cause
them to fail. Surprisingly, perturbations specifically crafted for one model
can easily transfer across various models, including CNNs, Transformers,
unfolding models, and plug-and-play models, leading to failures in those models
as well. Such high adversarial transferability is not observed in
classification models. We analyze the possible underlying reasons behind the
high adversarial transferability through a series of hypotheses and validation
experiments. By characterizing the manifolds of Gaussian noise and adversarial
perturbations using the concept of typical set and the asymptotic equipartition
property, we prove that adversarial samples deviate slightly from the typical
set of the original input distribution, causing the models to fail. Based on
these insights, we propose a novel adversarial defense method: the
Out-of-Distribution Typical Set Sampling Training strategy (TS). TS not only
significantly enhances the model's robustness but also marginally improves
denoising performance compared to the original model.


http://arxiv.org/abs/2412.06149
An Effective and Resilient Backdoor Attack Framework against Deep Neural Networks and Vision Transformers. (89%)
Xueluan Gong; Bowei Tian; Meng Xue; Yuan Wu; Yanjiao Chen; Qian Wang
  Recent studies have revealed the vulnerability of Deep Neural Network (DNN)
models to backdoor attacks. However, existing backdoor attacks arbitrarily set
the trigger mask or use a randomly selected trigger, which restricts the
effectiveness and robustness of the generated backdoor triggers. In this paper,
we propose a novel attention-based mask generation methodology that searches
for the optimal trigger shape and location. We also introduce a
Quality-of-Experience (QoE) term into the loss function and carefully adjust
the transparency value of the trigger in order to make the backdoored samples
to be more natural. To further improve the prediction accuracy of the victim
model, we propose an alternating retraining algorithm in the backdoor injection
process. The victim model is retrained with mixed poisoned datasets in even
iterations and with only benign samples in odd iterations. Besides, we launch
the backdoor attack under a co-optimized attack framework that alternately
optimizes the backdoor trigger and backdoored model to further improve the
attack performance. Apart from DNN models, we also extend our proposed attack
method against vision transformers. We evaluate our proposed method with
extensive experiments on VGG-Flower, CIFAR-10, GTSRB, CIFAR-100, and ImageNette
datasets. It is shown that we can increase the attack success rate by as much
as 82\% over baselines when the poison ratio is low and achieve a high QoE of
the backdoored samples. Our proposed backdoor attack framework also showcases
robustness against state-of-the-art backdoor defenses.


http://arxiv.org/abs/2412.05980
Anti-Reference: Universal and Immediate Defense Against Reference-Based Generation. (22%)
Yiren Song; Shengtao Lou; Xiaokang Liu; Hai Ci; Pei Yang; Jiaming Liu; Mike Zheng Shou
  Diffusion models have revolutionized generative modeling with their
exceptional ability to produce high-fidelity images. However, misuse of such
potent tools can lead to the creation of fake news or disturbing content
targeting individuals, resulting in significant social harm. In this paper, we
introduce Anti-Reference, a novel method that protects images from the threats
posed by reference-based generation techniques by adding imperceptible
adversarial noise to the images. We propose a unified loss function that
enables joint attacks on fine-tuning-based customization methods,
non-fine-tuning customization methods, and human-centric driving methods. Based
on this loss, we train a Adversarial Noise Encoder to predict the noise or
directly optimize the noise using the PGD method. Our method shows certain
transfer attack capabilities, effectively challenging both gray-box models and
some commercial APIs. Extensive experiments validate the performance of
Anti-Reference, establishing a new benchmark in image security.


http://arxiv.org/abs/2412.05892
PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization. (15%)
Ruoxi Cheng; Yizhong Ding; Shuirong Cao; Ranjie Duan; Xiaoshuang Jia; Shaowei Yuan; Zhiqiang Wang; Xiaojun Jia
  Understanding the vulnerabilities of Large Vision Language Models (LVLMs) to
jailbreak attacks is essential for their responsible real-world deployment.
Most previous work requires access to model gradients, or is based on human
knowledge (prompt engineering) to complete jailbreak, and they hardly consider
the interaction of images and text, resulting in inability to jailbreak in
black box scenarios or poor performance. To overcome these limitations, we
propose a Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for
toxicity maximization, referred to as PBI-Attack. Our method begins by
extracting malicious features from a harmful corpus using an alternative LVLM
and embedding these features into a benign image as prior information.
Subsequently, we enhance these features through bidirectional cross-modal
interaction optimization, which iteratively optimizes the bimodal perturbations
in an alternating manner through greedy search, aiming to maximize the toxicity
of the generated response. The toxicity level is quantified using a
well-trained evaluation model. Experiments demonstrate that PBI-Attack
outperforms previous state-of-the-art jailbreak methods, achieving an average
attack success rate of 92.5% across three open-source LVLMs and around 67.3% on
three closed-source LVLMs. Disclaimer: This paper contains potentially
disturbing and offensive content.


http://arxiv.org/abs/2412.05829
SABER: Model-agnostic Backdoor Attack on Chain-of-Thought in Neural Code Generation. (4%)
Naizhu Jin; Zhong Li; Yinggang Guo; Chao Su; Tian Zhang; Qingkai Zeng
  Recent studies have proposed integrating Chain-of-Thought (CoT) reasoning to
further enhance the reliability of Code Language Models (CLMs) in generating
code, a step-by-step approach that breaks down complex programming tasks into
manageable sub-problems. Advances in this area have introduced CoT models,
specifically designed to integrate CoT reasoning effectively into language
models, achieving notable improvements in code generation. Despite these
advancements, the security of CoT models has not been systematically studied.
In this study, we aim to fill this gap by investigating the vulnerability of
CoT models to backdoor injection in code generation tasks. To address this, we
propose a model-agnostic backdoor attack method SABER
(\textbf{S}elf-\textbf{A}ttention-\textbf{B}as\textbf{E}d backdoo\textbf{R})
based on the self-attention mechanism. SABER begins by selecting a malicious
output as the backdoor using code mutation operations. It then identifies
tokens most relevant to poisoned content by analyzing self-attention scores in
the CodeBERT model. Finally, it applies semantic-preserving perturbations to
generate adaptive and natural triggers. Our experiments on HumanEval-CoT and
OpenEval-CoT test sets demonstrate that CoT models are susceptible to backdoor
attacks via data poisoning. Taking the OpenEval-CoT dataset as an example,
SABER achieves an ASR of 76.19%, representing an improvement of 14.29% over
RIPPLe and a substantial 23.08% enhancement compared to BadPre. Further
evaluations using ONION for automated detection and human studies reveal that
SABER is stealthier and harder to detect, bypassing 77.27% of automated
detection, with a human detection rate of just 3.17%. Our findings reveal that
backdoors can be injected into CoT models to manipulate downstream code
generation tasks.


http://arxiv.org/abs/2412.06181
Enhancing Adversarial Resistance in LLMs with Recursion. (1%)
Bryan Li; Sounak Bagchi; Zizhan Wang
  The increasing integration of Large Language Models (LLMs) into society
necessitates robust defenses against vulnerabilities from jailbreaking and
adversarial prompts. This project proposes a recursive framework for enhancing
the resistance of LLMs to manipulation through the use of prompt simplification
techniques. By increasing the transparency of complex and confusing adversarial
prompts, the proposed method enables more reliable detection and prevention of
malicious inputs. Our findings attempt to address a critical problem in AI
safety and security, providing a foundation for the development of systems able
to distinguish harmless inputs from prompts containing malicious intent. As
LLMs continue to be used in diverse applications, the importance of such
safeguards will only grow.


http://arxiv.org/abs/2412.06157
Membership Inference Attacks and Defenses in Federated Learning: A Survey. (1%)
Li Bai; Haibo Hu; Qingqing Ye; Haoyang Li; Leixia Wang; Jianliang Xu
  Federated learning is a decentralized machine learning approach where clients
train models locally and share model updates to develop a global model. This
enables low-resource devices to collaboratively build a high-quality model
without requiring direct access to the raw training data. However, despite only
sharing model updates, federated learning still faces several privacy
vulnerabilities. One of the key threats is membership inference attacks, which
target clients' privacy by determining whether a specific example is part of
the training set. These attacks can compromise sensitive information in
real-world applications, such as medical diagnoses within a healthcare system.
Although there has been extensive research on membership inference attacks, a
comprehensive and up-to-date survey specifically focused on it within federated
learning is still absent. To fill this gap, we categorize and summarize
membership inference attacks and their corresponding defense strategies based
on their characteristics in this setting. We introduce a unique taxonomy of
existing attack research and provide a systematic overview of various
countermeasures. For these studies, we thoroughly analyze the strengths and
weaknesses of different approaches. Finally, we identify and discuss key future
research directions for readers interested in advancing the field.


http://arxiv.org/abs/2412.05767
DeMem: Privacy-Enhanced Robust Adversarial Learning via De-Memorization. (76%)
Xiaoyu Luo; Qiongxiu Li
  Adversarial robustness, the ability of a model to withstand manipulated
inputs that cause errors, is essential for ensuring the trustworthiness of
machine learning models in real-world applications. However, previous studies
have shown that enhancing adversarial robustness through adversarial training
increases vulnerability to privacy attacks. While differential privacy can
mitigate these attacks, it often compromises robustness against both natural
and adversarial samples. Our analysis reveals that differential privacy
disproportionately impacts low-risk samples, causing an unintended performance
drop. To address this, we propose DeMem, which selectively targets high-risk
samples, achieving a better balance between privacy protection and model
robustness. DeMem is versatile and can be seamlessly integrated into various
adversarial training techniques. Extensive evaluations across multiple training
methods and datasets demonstrate that DeMem significantly reduces privacy
leakage while maintaining robustness against both natural and adversarial
samples. These results confirm DeMem's effectiveness and broad applicability in
enhancing privacy without compromising robustness.


http://arxiv.org/abs/2412.05592
From Flexibility to Manipulation: The Slippery Slope of XAI Evaluation. (47%)
Kristoffer Wickstrøm; Marina Marie-Claire Höhne; Anna Hedström
  The lack of ground truth explanation labels is a fundamental challenge for
quantitative evaluation in explainable artificial intelligence (XAI). This
challenge becomes especially problematic when evaluation methods have numerous
hyperparameters that must be specified by the user, as there is no ground truth
to determine an optimal hyperparameter selection. It is typically not feasible
to do an exhaustive search of hyperparameters so researchers typically make a
normative choice based on similar studies in the literature, which provides
great flexibility for the user. In this work, we illustrate how this
flexibility can be exploited to manipulate the evaluation outcome. We frame
this manipulation as an adversarial attack on the evaluation where seemingly
innocent changes in hyperparameter setting significantly influence the
evaluation outcome. We demonstrate the effectiveness of our manipulation across
several datasets with large changes in evaluation outcomes across several
explanation methods and models. Lastly, we propose a mitigation strategy based
on ranking across hyperparameters that aims to provide robustness towards such
manipulation. This work highlights the difficulty of conducting reliable XAI
evaluation and emphasizes the importance of a holistic and transparent approach
to evaluation in XAI.


http://arxiv.org/abs/2412.05676
Nearly Solved? Robust Deepfake Detection Requires More than Visual Forensics. (33%)
Guy Levy; Nathan Liebmann
  Deepfakes are on the rise, with increased sophistication and prevalence
allowing for high-profile social engineering attacks. Detecting them in the
wild is therefore important as ever, giving rise to new approaches breaking
benchmark records in this task. In line with previous work, we show that
recently developed state-of-the-art detectors are susceptible to classical
adversarial attacks, even in a highly-realistic black-box setting, putting
their usability in question. We argue that crucial 'robust features' of
deepfakes are in their higher semantics, and follow that with evidence that a
detector based on a semantic embedding model is less susceptible to black-box
perturbation attacks. We show that large visuo-lingual models like GPT-4o can
perform zero-shot deepfake detection better than current state-of-the-art
methods, and introduce a novel attack based on high-level semantic
manipulation. Finally, we argue that hybridising low- and high-level detectors
can improve adversarial robustness, based on their complementary strengths and
weaknesses.


http://arxiv.org/abs/2412.05232
LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLMs in Seconds. (15%)
James Beetham; Souradip Chakraborty; Mengdi Wang; Furong Huang; Amrit Singh Bedi; Mubarak Shah
  Many existing jailbreak techniques rely on solving discrete combinatorial
optimization, while more recent approaches involve training LLMs to generate
multiple adversarial prompts. However, both approaches require significant
computational resources to produce even a single adversarial prompt. We
hypothesize that the inefficiency of current approaches stems from an
inadequate characterization of the jailbreak problem. To address this gap, we
formulate the jailbreak problem in terms of alignment. By starting from an
available safety-aligned model, we leverage an unsafe reward to guide the safe
model towards generating unsafe outputs using alignment techniques (e.g.,
reinforcement learning from human feedback), effectively performing
jailbreaking via alignment. We propose a novel jailbreak method called LIAR
(LeveragIng Alignment to jailbReak). To demonstrate the simplicity and
effectiveness of our approach, we employ a best-of-N method to solve the
alignment problem. LIAR offers significant advantages: lower computational
requirements without additional training, fully black-box operation,
competitive attack success rates, and more human-readable prompts. We provide
theoretical insights into the possibility of jailbreaking a safety-aligned
model, revealing inherent vulnerabilities in current alignment strategies for
LLMs. We also provide sub-optimality guarantees for the proposed \algo.
Experimentally, we achieve ASR comparable to the SoTA with a 10x improvement to
perplexity and a Time-to-Attack measured in seconds rather than tens of hours.


http://arxiv.org/abs/2412.05538
Not Just Text: Uncovering Vision Modality Typographic Threats in Image Generation Models. (8%)
Hao Cheng; Erjia Xiao; Jiayan Yang; Jiahang Cao; Qiang Zhang; Jize Zhang; Kaidi Xu; Jindong Gu; Renjing Xu
  Current image generation models can effortlessly produce high-quality, highly
realistic images, but this also increases the risk of misuse. In various
Text-to-Image or Image-to-Image tasks, attackers can generate a series of
images containing inappropriate content by simply editing the language modality
input. To mitigate this security concern, numerous guarding or defensive
strategies have been proposed, with a particular emphasis on safeguarding
language modality. However, in practical applications, threats in the vision
modality, particularly in tasks involving the editing of real-world images,
present heightened security risks as they can easily infringe upon the rights
of the image owner. Therefore, this paper employs a method named typographic
attack to reveal that various image generation models are also susceptible to
threats within the vision modality. Furthermore, we also evaluate the defense
performance of various existing methods when facing threats in the vision
modality and uncover their ineffectiveness. Finally, we propose the Vision
Modal Threats in Image Generation Models (VMT-IGMs) dataset, which would serve
as a baseline for evaluating the vision modality vulnerability of various image
generation models.


http://arxiv.org/abs/2412.05351
Towards Predicting the Success of Transfer-based Attacks by Quantifying Shared Feature Representations. (2%)
Ashley S. Dale; Mei Qiu; Foo Bin Che; Thomas Bsaibes; Lauren Christopher; Paul Salama
  Much effort has been made to explain and improve the success of
transfer-based attacks (TBA) on black-box computer vision models. This work
provides the first attempt at a priori prediction of attack success by
identifying the presence of vulnerable features within target models. Recent
work by Chen and Liu (2024) proposed the manifold attack model, a unifying
framework proposing that successful TBA exist in a common manifold space. Our
work experimentally tests the common manifold space hypothesis by a new
methodology: first, projecting feature vectors from surrogate and target
feature extractors trained on ImageNet onto the same low-dimensional manifold;
second, quantifying any observed structure similarities on the manifold; and
finally, by relating these observed similarities to the success of the TBA. We
find that shared feature representation moderately correlates with increased
success of TBA (\r{ho}= 0.56). This method may be used to predict whether an
attack will transfer without information of the model weights, training,
architecture or details of the attack. The results confirm the presence of
shared feature representations between two feature extractors of different
sizes and complexities, and demonstrate the utility of datasets from different
target domains as test signals for interpreting black-box feature
representations.


http://arxiv.org/abs/2412.05010
Backdooring Outlier Detection Methods: A Novel Attack Approach. (2%)
ZeinabSadat Taghavi; Hossein Mirzaei
  There have been several efforts in backdoor attacks, but these have primarily
focused on the closed-set performance of classifiers (i.e., classification).
This has left a gap in addressing the threat to classifiers' open-set
performance, referred to as outlier detection in the literature. Reliable
outlier detection is crucial for deploying classifiers in critical real-world
applications such as autonomous driving and medical image analysis. First, we
show that existing backdoor attacks fall short in affecting the open-set
performance of classifiers, as they have been specifically designed to confuse
intra-closed-set decision boundaries. In contrast, an effective backdoor attack
for outlier detection needs to confuse the decision boundary between the closed
and open sets. Motivated by this, in this study, we propose BATOD, a novel
Backdoor Attack targeting the Outlier Detection task. Specifically, we design
two categories of triggers to shift inlier samples to outliers and vice versa.
We evaluate BATOD using various real-world datasets and demonstrate its
superior ability to degrade the open-set performance of classifiers compared to
previous attacks, both before and after applying defenses.


http://arxiv.org/abs/2412.04245
Intriguing Properties of Robust Classification. (96%)
Bernd Prach; Christoph H. Lampert
  Despite extensive research since the community learned about adversarial
examples 10 years ago, we still do not know how to train high-accuracy
classifiers that are guaranteed to be robust to small perturbations of their
inputs. Previous works often argued that this might be because no classifier
exists that is robust and accurate at the same time. However, in computer
vision this assumption does not match reality where humans are usually accurate
and robust on most tasks of interest. We offer an alternative explanation and
show that in certain settings robust generalization is only possible with
unrealistically large amounts of data. More precisely we find a setting where a
robust classifier exists, it is easy to learn an accurate classifier, yet it
requires an exponential amount of data to learn a robust classifier. Based on
this theoretical result, we explore how well robust classifiers generalize on
datasets such as CIFAR-10. We come to the conclusion that on this datasets, the
limitation of current robust models also lies in the generalization, and that
they require a lot of data to do well on the test set. We also show that the
problem is not in the expressiveness or generalization capabilities of current
architectures, and that there are low magnitude features in the data which are
useful for non-robust generalization but are not available for robust
classifiers.


http://arxiv.org/abs/2412.04163
On the Lack of Robustness of Binary Function Similarity Systems. (92%)
Gianluca Capozzi; Tong Tang; Jie Wan; Ziqi Yang; Daniele Cono D'Elia; Luna Giuseppe Antonio Di; Lorenzo Cavallaro; Leonardo Querzoni
  Binary function similarity, which often relies on learning-based algorithms
to identify what functions in a pool are most similar to a given query
function, is a sought-after topic in different communities, including machine
learning, software engineering, and security. Its importance stems from the
impact it has in facilitating several crucial tasks, from reverse engineering
and malware analysis to automated vulnerability detection. Whereas recent work
cast light around performance on this long-studied problem, the research
landscape remains largely lackluster in understanding the resiliency of the
state-of-the-art machine learning models against adversarial attacks. As
security requires to reason about adversaries, in this work we assess the
robustness of such models through a simple yet effective black-box greedy
attack, which modifies the topology and the content of the control flow of the
attacked functions. We demonstrate that this attack is successful in
compromising all the models, achieving average attack success rates of 57.06%
and 95.81% depending on the problem settings (targeted and untargeted attacks).
Our findings are insightful: top performance on clean data does not necessarily
relate to top robustness properties, which explicitly highlights
performance-robustness trade-offs one should consider when deploying such
models, calling for further research.


http://arxiv.org/abs/2412.04776
Megatron: Evasive Clean-Label Backdoor Attacks against Vision Transformer. (76%)
Xueluan Gong; Bowei Tian; Meng Xue; Shuike Li; Yanjiao Chen; Qian Wang
  Vision transformers have achieved impressive performance in various
vision-related tasks, but their vulnerability to backdoor attacks is
under-explored. A handful of existing works focus on dirty-label attacks with
wrongly-labeled poisoned training samples, which may fail if a benign model
trainer corrects the labels. In this paper, we propose Megatron, an evasive
clean-label backdoor attack against vision transformers, where the attacker
injects the backdoor without manipulating the data-labeling process. To
generate an effective trigger, we customize two loss terms based on the
attention mechanism used in transformer networks, i.e., latent loss and
attention diffusion loss. The latent loss aligns the last attention layer
between triggered samples and clean samples of the target label. The attention
diffusion loss emphasizes the attention diffusion area that encompasses the
trigger. A theoretical analysis is provided to underpin the rationale behind
the attention diffusion loss. Extensive experiments on CIFAR-10, GTSRB,
CIFAR-100, and Tiny ImageNet demonstrate the effectiveness of Megatron.
Megatron can achieve attack success rates of over 90% even when the position of
the trigger is slightly shifted during testing. Furthermore, Megatron achieves
better evasiveness than baselines regarding both human visual inspection and
defense strategies (i.e., DBAVT, BAVT, Beatrix, TeCo, and SAGE).


http://arxiv.org/abs/2412.03908
Can Targeted Clean-Label Poisoning Attacks Generalize? (13%)
Zhizhen Chen; Subrat Kishore Dutta; Zhengyu Zhao; Chenhao Lin; Chao Shen; Xiao Zhang
  Targeted poisoning attacks aim to compromise the model's prediction on
specific target samples. In a common clean-label setting, they are achieved by
slightly perturbing a subset of training samples given access to those specific
targets. Despite continuous efforts, it remains unexplored whether such attacks
can generalize to unknown variations of those targets. In this paper, we take
the first step to systematically study this generalization problem. Observing
that the widely adopted, cosine similarity-based attack exhibits limited
generalizability, we propose a well-generalizable attack that leverages both
the direction and magnitude of model gradients. In particular, we explore
diverse target variations, such as an object with varied viewpoints and an
animal species with distinct appearances. Extensive experiments across various
generalization scenarios demonstrate that our method consistently achieves the
best attack effectiveness. For example, our method outperforms the cosine
similarity-based attack by 20.95% in attack success rate with similar overall
accuracy, averaged over four models on two image benchmark datasets. The code
is available at https://github.com/jiaangk/generalizable_tcpa


http://arxiv.org/abs/2412.03993
LaserGuider: A Laser Based Physical Backdoor Attack against Deep Neural Networks. (8%)
Yongjie Xu; Guangke Chen; Fu Song; Yuqi Chen
  Backdoor attacks embed hidden associations between triggers and targets in
deep neural networks (DNNs), causing them to predict the target when a trigger
is present while maintaining normal behavior otherwise. Physical backdoor
attacks, which use physical objects as triggers, are feasible but lack remote
control, temporal stealthiness, flexibility, and mobility. To overcome these
limitations, in this work, we propose a new type of backdoor triggers utilizing
lasers that feature long-distance transmission and instant-imaging properties.
Based on the laser-based backdoor triggers, we present a physical backdoor
attack, called LaserGuider, which possesses remote control ability and achieves
high temporal stealthiness, flexibility, and mobility. We also introduce a
systematic approach to optimize laser parameters for improving attack
effectiveness. Our evaluation on traffic sign recognition DNNs, critical in
autonomous vehicles, demonstrates that LaserGuider with three different
laser-based triggers achieves over 90% attack success rate with negligible
impact on normal inputs. Additionally, we release LaserMark, the first dataset
of real world traffic signs stamped with physical laser spots, to support
further research in backdoor attacks and defenses.


http://arxiv.org/abs/2412.03876
Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization. (3%)
Jiangweizhi Peng; Zhiwei Tang; Gaowen Liu; Charles Fleming; Mingyi Hong
  Text-to-Image (T2I) diffusion models are widely recognized for their ability
to generate high-quality and diverse images based on text prompts. However,
despite recent advances, these models are still prone to generating unsafe
images containing sensitive or inappropriate content, which can be harmful to
users. Current efforts to prevent inappropriate image generation for diffusion
models are easy to bypass and vulnerable to adversarial attacks. How to ensure
that T2I models align with specific safety goals remains a significant
challenge. In this work, we propose a novel, training-free approach, called
Prompt-Noise Optimization (PNO), to mitigate unsafe image generation. Our
method introduces a novel optimization framework that leverages both the
continuous prompt embedding and the injected noise trajectory in the sampling
process to generate safe images. Extensive numerical results demonstrate that
our framework achieves state-of-the-art performance in suppressing toxic image
generations and demonstrates robustness to adversarial attacks, without needing
to tune the model parameters. Furthermore, compared with existing methods, PNO
uses comparable generation time while offering the best tradeoff between the
conflicting goals of safe generation and prompt-image alignment.


http://arxiv.org/abs/2412.04415
Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation. (2%)
Xuying Li; Zhuo Li; Yuji Kosuga; Yasuhiro Yoshida; Victor Bian
  AI agents, powered by large language models (LLMs), have transformed
human-computer interactions by enabling seamless, natural, and context-aware
communication. While these advancements offer immense utility, they also
inherit and amplify inherent safety risks such as bias, fairness,
hallucinations, privacy breaches, and a lack of transparency. This paper
investigates a critical vulnerability: adversarial attacks targeting the LLM
core within AI agents. Specifically, we test the hypothesis that a deceptively
simple adversarial prefix, such as \textit{Ignore the document}, can compel
LLMs to produce dangerous or unintended outputs by bypassing their contextual
safeguards. Through experimentation, we demonstrate a high attack success rate
(ASR), revealing the fragility of existing LLM defenses. These findings
emphasize the urgent need for robust, multi-layered security measures tailored
to mitigate vulnerabilities at the LLM level and within broader agent-based
architectures.


http://arxiv.org/abs/2412.03539
NODE-AdvGAN: Improving the transferability and perceptual similarity of adversarial examples by dynamic-system-driven adversarial generative model. (99%)
Xinheng Xie; Yue Wu; Cuiyu He
  Understanding adversarial examples is crucial for improving model robustness,
as they introduce imperceptible perturbations to deceive models. Effective
adversarial examples, therefore, offer the potential to train more robust
models by eliminating model singularities. We propose NODE-AdvGAN, a novel
approach that treats adversarial generation as a continuous process and employs
a Neural Ordinary Differential Equation (NODE) to simulate generator dynamics.
By mimicking the iterative nature of traditional gradient-based methods,
NODE-AdvGAN generates smoother and more precise perturbations that preserve
high perceptual similarity when added to benign images. We also propose a new
training strategy, NODE-AdvGAN-T, which enhances transferability in black-box
attacks by tuning the noise parameters during training. Experiments demonstrate
that NODE-AdvGAN and NODE-AdvGAN-T generate more effective adversarial examples
that achieve higher attack success rates while preserving better perceptual
quality than baseline models.


http://arxiv.org/abs/2412.03235
Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts? (99%)
Sravanti Addepalli; Yerram Varun; Arun Suggala; Karthikeyan Shanmugam; Prateek Jain
  Large Language Models (LLMs) are known to be susceptible to crafted
adversarial attacks or jailbreaks that lead to the generation of objectionable
content despite being aligned to human preferences using safety fine-tuning
methods. While the large dimensionality of input token space makes it
inevitable to find adversarial prompts that can jailbreak these models, we aim
to evaluate whether safety fine-tuned LLMs are safe against natural prompts
which are semantically related to toxic seed prompts that elicit safe responses
after alignment. We surprisingly find that popular aligned LLMs such as GPT-4
can be compromised using naive prompts that are NOT even crafted with an
objective of jailbreaking the model. Furthermore, we empirically show that
given a seed prompt that elicits a toxic response from an unaligned model, one
can systematically generate several semantically related natural prompts that
can jailbreak aligned LLMs. Towards this, we propose a method of Response
Guided Question Augmentation (ReG-QA) to evaluate the generalization of safety
aligned LLMs to natural prompts, that first generates several toxic answers
given a seed question using an unaligned LLM (Q to A), and further leverages an
LLM to generate questions that are likely to produce these answers (A to Q). We
interestingly find that safety fine-tuned LLMs such as GPT-4o are vulnerable to
producing natural jailbreak questions from unsafe content (without denial) and
can thus be used for the latter (A to Q) step. We obtain attack success rates
that are comparable to/ better than leading adversarial attack methods on the
JailbreakBench leaderboard, while being significantly more stable against
defenses such as Smooth-LLM and Synonym Substitution, which are effective
against existing all attacks on the leaderboard.


http://arxiv.org/abs/2412.03051
Less is More: A Stealthy and Efficient Adversarial Attack Method for DRL-based Autonomous Driving Policies. (98%)
Junchao Fan; Xuyang Lei; Xiaolin Chang; Jelena Mišić; Vojislav B. Mišić
  Despite significant advancements in deep reinforcement learning (DRL)-based
autonomous driving policies, these policies still exhibit vulnerability to
adversarial attacks. This vulnerability poses a formidable challenge to the
practical deployment of these policies in autonomous driving. Designing
effective adversarial attacks is an indispensable prerequisite for enhancing
the robustness of these policies. In view of this, we present a novel stealthy
and efficient adversarial attack method for DRL-based autonomous driving
policies. Specifically, we introduce a DRL-based adversary designed to trigger
safety violations (e.g., collisions) by injecting adversarial samples at
critical moments. We model the attack as a mixed-integer optimization problem
and formulate it as a Markov decision process. Then, we train the adversary to
learn the optimal policy for attacking at critical moments without domain
knowledge. Furthermore, we introduce attack-related information and a
trajectory clipping method to enhance the learning capability of the adversary.
Finally, we validate our method in an unprotected left-turn scenario across
different traffic densities. The experimental results show that our method
achieves more than 90% collision rate within three attacks in most cases.
Furthermore, our method achieves more than 130% improvement in attack
efficiency compared to the unlimited attack method.


http://arxiv.org/abs/2412.03441
PBP: Post-training Backdoor Purification for Malware Classifiers. (76%)
Dung Thuy Nguyen; Ngoc N. Tran; Taylor T. Johnson; Kevin Leach
  In recent years, the rise of machine learning (ML) in cybersecurity has
brought new challenges, including the increasing threat of backdoor poisoning
attacks on ML malware classifiers. For instance, adversaries could inject
malicious samples into public malware repositories, contaminating the training
data and potentially misclassifying malware by the ML model. Current
countermeasures predominantly focus on detecting poisoned samples by leveraging
disagreements within the outputs of a diverse set of ensemble models on
training data points. However, these methods are not suitable for scenarios
where Machine Learning-as-a-Service (MLaaS) is used or when users aim to remove
backdoors from a model after it has been trained. Addressing this scenario, we
introduce PBP, a post-training defense for malware classifiers that mitigates
various types of backdoor embeddings without assuming any specific backdoor
embedding mechanism. Our method exploits the influence of backdoor attacks on
the activation distribution of neural networks, independent of the
trigger-embedding method. In the presence of a backdoor attack, the activation
distribution of each layer is distorted into a mixture of distributions. By
regulating the statistics of the batch normalization layers, we can guide a
backdoored model to perform similarly to a clean one. Our method demonstrates
substantial advantages over several state-of-the-art methods, as evidenced by
experiments on two datasets, two types of backdoor methods, and various attack
configurations. Notably, our approach requires only a small portion of the
training data -- only 1\% -- to purify the backdoor and reduce the attack
success rate from 100\% to almost 0\%, a 100-fold improvement over the baseline
methods. Our code is available at
\url{https://github.com/judydnguyen/pbp-backdoor-purification-official}.


http://arxiv.org/abs/2412.03154
Testing Neural Network Verifiers: A Soundness Benchmark with Hidden Counterexamples. (13%)
Xingjian Zhou; Hongji Xu; Andy Xu; Zhouxing Shi; Cho-Jui Hsieh; Huan Zhang
  In recent years, many neural network (NN) verifiers have been developed to
formally verify certain properties of neural networks such as robustness.
Although many benchmarks have been constructed to evaluate the performance of
NN verifiers, they typically lack a ground-truth for hard instances where no
current verifier can verify and no counterexample can be found, which makes it
difficult to check the soundness of a new verifier if it claims to verify hard
instances which no other verifier can do. We propose to develop a soundness
benchmark for NN verification. Our benchmark contains instances with
deliberately inserted counterexamples while we also try to hide the
counterexamples from regular adversarial attacks which can be used for finding
counterexamples. We design a training method to produce neural networks with
such hidden counterexamples. Our benchmark aims to be used for testing the
soundness of NN verifiers and identifying falsely claimed verifiability when it
is known that hidden counterexamples exist. We systematically construct our
benchmark and generate instances across diverse model architectures, activation
functions, input sizes, and perturbation radii. We demonstrate that our
benchmark successfully identifies bugs in state-of-the-art NN verifiers, as
well as synthetic bugs, providing a crucial step toward enhancing the
reliability of testing NN verifiers. Our code is available at
https://github.com/MVP-Harry/SoundnessBench and our benchmark is available at
https://huggingface.co/datasets/SoundnessBench/SoundnessBench.


http://arxiv.org/abs/2412.03283
Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models. (12%)
Andreas Müller; Denis Lukovnikov; Jonas Thietke; Asja Fischer; Erwin Quiring
  Integrating watermarking into the generation process of latent diffusion
models (LDMs) simplifies detection and attribution of generated content.
Semantic watermarks, such as Tree-Rings and Gaussian Shading, represent a novel
class of watermarking techniques that are easy to implement and highly robust
against various perturbations. However, our work demonstrates a fundamental
security vulnerability of semantic watermarks. We show that attackers can
leverage unrelated models, even with different latent spaces and architectures
(UNet vs DiT), to perform powerful and realistic forgery attacks. Specifically,
we design two watermark forgery attacks. The first imprints a targeted
watermark into real images by manipulating the latent representation of an
arbitrary image in an unrelated LDM to get closer to the latent representation
of a watermarked image. We also show that this technique can be used for
watermark removal. The second attack generates new images with the target
watermark by inverting a watermarked image and re-generating it with an
arbitrary prompt. Both attacks just need a single reference image with the
target watermark. Overall, our findings question the applicability of semantic
watermarks by revealing that attackers can easily forge or remove these
watermarks under realistic conditions.


http://arxiv.org/abs/2412.03682
Designing DNNs for a trade-off between robustness and processing performance in embedded devices. (11%)
Jon Gutiérrez-Zaballa; Koldo Basterretxea; Javier Echanobe
  Machine learning-based embedded systems employed in safety-critical
applications such as aerospace and autonomous driving need to be robust against
perturbations produced by soft errors. Soft errors are an increasing concern in
modern digital processors since smaller transistor geometries and lower
voltages give electronic devices a higher sensitivity to background radiation.
The resilience of deep neural network (DNN) models to perturbations in their
parameters is determined, to a large extent, by the structure of the model
itself, and also by the selected numerical representation and used arithmetic
precision. When compression techniques such as model pruning and model
quantization are applied to reduce memory footprint and computational
complexity for deployment, both model structure and numerical representation
are modified and thus, soft error robustness also changes. In this sense,
although the choice of activation functions (AFs) in DNN models is frequently
ignored, it conditions not only their accuracy and trainability, but also
compressibility rates and numerical robustness. This paper investigates the
suitability of using bounded AFs to improve model robustness against DNN
parameter perturbations, assessing at the same time the impact of this choice
on deployment in terms of model accuracy, compressibility, and computational
burden. In particular, we analyze encoder-decoder fully convolutional models
aimed at performing semantic segmentation tasks on hyperspectral images for
scene understanding in autonomous driving. Deployment characterization is
performed experimentally on an AMD-Xilinx's KV260 SoM.


http://arxiv.org/abs/2412.03453
Pre-trained Multiple Latent Variable Generative Models are good defenders against Adversarial Attacks. (4%)
Dario Serez; Marco Cristani; Bue Alessio Del; Vittorio Murino; Pietro Morerio
  Attackers can deliberately perturb classifiers' input with subtle noise,
altering final predictions. Among proposed countermeasures, adversarial
purification employs generative networks to preprocess input images, filtering
out adversarial noise. In this study, we propose specific generators, defined
Multiple Latent Variable Generative Models (MLVGMs), for adversarial
purification. These models possess multiple latent variables that naturally
disentangle coarse from fine features. Taking advantage of these properties, we
autoencode images to maintain class-relevant information, while discarding and
re-sampling any detail, including adversarial noise. The procedure is
completely training-free, exploring the generalization abilities of pre-trained
MLVGMs on the adversarial purification downstream task. Despite the lack of
large models, trained on billions of samples, we show that smaller MLVGMs are
already competitive with traditional methods, and can be used as foundation
models. Official code released at https://github.com/SerezD/gen_adversarial.


http://arxiv.org/abs/2412.03630
Evaluating Single Event Upsets in Deep Neural Networks for Semantic Segmentation: an embedded system perspective. (1%)
Jon Gutiérrez-Zaballa; Koldo Basterretxea; Javier Echanobe
  As the deployment of artifical intelligence (AI) algorithms at edge devices
becomes increasingly prevalent, enhancing the robustness and reliability of
autonomous AI-based perception and decision systems is becoming as relevant as
precision and performance, especially in applications areas considered
safety-critical such as autonomous driving and aerospace. This paper delves
into the robustness assessment in embedded Deep Neural Networks (DNNs),
particularly focusing on the impact of parameter perturbations produced by
single event upsets (SEUs) on convolutional neural networks (CNN) for image
semantic segmentation. By scrutinizing the layer-by-layer and bit-by-bit
sensitivity of various encoder-decoder models to soft errors, this study
thoroughly investigates the vulnerability of segmentation DNNs to SEUs and
evaluates the consequences of techniques like model pruning and parameter
quantization on the robustness of compressed models aimed at embedded
implementations. The findings offer valuable insights into the mechanisms
underlying SEU-induced failures that allow for evaluating the robustness of
DNNs once trained in advance. Moreover, based on the collected data, we propose
a set of practical lightweight error mitigation techniques with no memory or
computational cost suitable for resource-constrained deployments. The code used
to perform the fault injection (FI) campaign is available at
https://github.com/jonGuti13/TensorFI2 , while the code to implement proposed
techniques is available at https://github.com/jonGuti13/parameterProtection .


http://arxiv.org/abs/2412.02270
Sustainable Self-evolution Adversarial Training. (99%)
Wenxuan Wang; Chenglei Wang; Huihui Qi; Menghao Ye; Xuelin Qian; Peng Wang; Yanning Zhang
  With the wide application of deep neural network models in various computer
vision tasks, there has been a proliferation of adversarial example generation
strategies aimed at deeply exploring model security. However, existing
adversarial training defense models, which rely on single or limited types of
attacks under a one-time learning process, struggle to adapt to the dynamic and
evolving nature of attack methods. Therefore, to achieve defense performance
improvements for models in long-term applications, we propose a novel
Sustainable Self-Evolution Adversarial Training (SSEAT) framework.
Specifically, we introduce a continual adversarial defense pipeline to realize
learning from various kinds of adversarial examples across multiple stages.
Additionally, to address the issue of model catastrophic forgetting caused by
continual learning from ongoing novel attacks, we propose an adversarial data
replay module to better select more diverse and key relearning data.
Furthermore, we design a consistency regularization strategy to encourage
current defense models to learn more from previously trained ones, guiding them
to retain more past knowledge and maintain accuracy on clean samples. Extensive
experiments have been conducted to verify the efficacy of the proposed SSEAT
defense method, which demonstrates superior defense performance and
classification accuracy compared to competitors.


http://arxiv.org/abs/2412.02803
Gaussian Splatting Under Attack: Investigating Adversarial Noise in 3D Objects. (99%)
Abdurrahman Zeybey; Mehmet Ergezer; Tommy Nguyen
  3D Gaussian Splatting has advanced radiance field reconstruction, enabling
high-quality view synthesis and fast rendering in 3D modeling. While
adversarial attacks on object detection models are well-studied for 2D images,
their impact on 3D models remains underexplored. This work introduces the
Masked Iterative Fast Gradient Sign Method (M-IFGSM), designed to generate
adversarial noise targeting the CLIP vision-language model. M-IFGSM
specifically alters the object of interest by focusing perturbations on masked
regions, degrading the performance of CLIP's zero-shot object detection
capability when applied to 3D models. Using eight objects from the Common
Objects 3D (CO3D) dataset, we demonstrate that our method effectively reduces
the accuracy and confidence of the model, with adversarial noise being nearly
imperceptible to human observers. The top-1 accuracy in original model renders
drops from 95.4\% to 12.5\% for train images and from 91.2\% to 35.4\% for test
images, with confidence levels reflecting this shift from true classification
to misclassification, underscoring the risks of adversarial attacks on 3D
models in applications such as autonomous driving, robotics, and surveillance.
The significance of this research lies in its potential to expose
vulnerabilities in modern 3D vision models, including radiance fields,
prompting the development of more robust defenses and security measures in
critical real-world applications.


http://arxiv.org/abs/2412.02343
Multi-Granularity Tibetan Textual Adversarial Attack Method Based on Masked Language Model. (98%)
Xi Cao; Nuo Qun; Quzong Gesang; Yulei Zhu; Trashi Nyima
  In social media, neural network models have been applied to hate speech
detection, sentiment analysis, etc., but neural network models are susceptible
to adversarial attacks. For instance, in a text classification task, the
attacker elaborately introduces perturbations to the original texts that hardly
alter the original semantics in order to trick the model into making different
predictions. By studying textual adversarial attack methods, the robustness of
language models can be evaluated and then improved. Currently, most of the
research in this field focuses on English, and there is also a certain amount
of research on Chinese. However, there is little research targeting Chinese
minority languages. With the rapid development of artificial intelligence
technology and the emergence of Chinese minority language models, textual
adversarial attacks become a new challenge for the information processing of
Chinese minority languages. In response to this situation, we propose a
multi-granularity Tibetan textual adversarial attack method based on masked
language models called TSTricker. We utilize the masked language models to
generate candidate substitution syllables or words, adopt the scoring mechanism
to determine the substitution order, and then conduct the attack method on
several fine-tuned victim models. The experimental results show that TSTricker
reduces the accuracy of the classification models by more than 28.70% and makes
the classification models change the predictions of more than 90.60% of the
samples, which has an evidently higher attack effect than the baseline method.


http://arxiv.org/abs/2412.02323
Pay Attention to the Robustness of Chinese Minority Language Models! Syllable-level Textual Adversarial Attack on Tibetan Script. (98%)
Xi Cao; Dolma Dawa; Nuo Qun; Trashi Nyima
  The textual adversarial attack refers to an attack method in which the
attacker adds imperceptible perturbations to the original texts by elaborate
design so that the NLP (natural language processing) model produces false
judgments. This method is also used to evaluate the robustness of NLP models.
Currently, most of the research in this field focuses on English, and there is
also a certain amount of research on Chinese. However, to the best of our
knowledge, there is little research targeting Chinese minority languages.
Textual adversarial attacks are a new challenge for the information processing
of Chinese minority languages. In response to this situation, we propose a
Tibetan syllable-level black-box textual adversarial attack called TSAttacker
based on syllable cosine distance and scoring mechanism. And then, we conduct
TSAttacker on six models generated by fine-tuning two PLMs (pre-trained
language models) for three downstream tasks. The experiment results show that
TSAttacker is effective and generates high-quality adversarial samples. In
addition, the robustness of the involved models still has much room for
improvement.


http://arxiv.org/abs/2412.02171
Can't Slow me Down: Learning Robust and Hardware-Adaptive Object Detectors against Latency Attacks for Edge Devices. (93%)
Tianyi Zhejiang University, Hangzhou, China Wang; Zichen Zhejiang University, Hangzhou, China Wang; Cong Zhejiang University, Hangzhou, China Wang; Yuanchao Zhejiang University, Hangzhou, China Shu; Ruilong Zhejiang University, Hangzhou, China Deng; Peng Zhejiang University, Hangzhou, China Cheng; Jiming Zhejiang University, Hangzhou, China Chen
  Object detection is a fundamental enabler for many real-time downstream
applications such as autonomous driving, augmented reality and supply chain
management. However, the algorithmic backbone of neural networks is brittle to
imperceptible perturbations in the system inputs, which were generally known as
misclassifying attacks. By targeting the real-time processing capability, a new
class of latency attacks are reported recently. They exploit new attack
surfaces in object detectors by creating a computational bottleneck in the
post-processing module, that leads to cascading failure and puts the real-time
downstream tasks at risks. In this work, we take an initial attempt to defend
against this attack via background-attentive adversarial training that is also
cognizant of the underlying hardware capabilities. We first draw system-level
connections between latency attack and hardware capacity across heterogeneous
GPU devices. Based on the particular adversarial behaviors, we utilize
objectness loss as a proxy and build background attention into the adversarial
training pipeline, and achieve a reasonable balance between clean and robust
accuracy. The extensive experiments demonstrate the defense effectiveness of
restoring real-time processing capability from $13$ FPS to $43$ FPS on Jetson
Orin NX, with a better trade-off between the clean and robust accuracy.


http://arxiv.org/abs/2412.02795
Hijacking Vision-and-Language Navigation Agents with Adversarial Environmental Attacks. (80%)
Zijiao Yang; Xiangxi Shi; Eric Slyman; Stefan Lee
  Assistive embodied agents that can be instructed in natural language to
perform tasks in open-world environments have the potential to significantly
impact labor tasks like manufacturing or in-home care -- benefiting the lives
of those who come to depend on them. In this work, we consider how this benefit
might be hijacked by local modifications in the appearance of the agent's
operating environment. Specifically, we take the popular Vision-and-Language
Navigation (VLN) task as a representative setting and develop a whitebox
adversarial attack that optimizes a 3D attack object's appearance to induce
desired behaviors in pretrained VLN agents that observe it in the environment.
We demonstrate that the proposed attack can cause VLN agents to ignore their
instructions and execute alternative actions after encountering the attack
object -- even for instructions and agent paths not considered when optimizing
the attack. For these novel settings, we find our attacks can induce
early-termination behaviors or divert an agent along an attacker-defined
multi-step trajectory. Under both conditions, environmental attacks
significantly reduce agent capabilities to successfully follow user
instructions.


http://arxiv.org/abs/2412.02371
TSCheater: Generating High-Quality Tibetan Adversarial Texts via Visual Similarity. (76%)
Xi Cao; Quzong Gesang; Yuan Sun; Nuo Qun; Tashi Nyima
  Language models based on deep neural networks are vulnerable to textual
adversarial attacks. While rich-resource languages like English are receiving
focused attention, Tibetan, a cross-border language, is gradually being studied
due to its abundant ancient literature and critical language strategy.
Currently, there are several Tibetan adversarial text generation methods, but
they do not fully consider the textual features of Tibetan script and
overestimate the quality of generated adversarial texts. To address this issue,
we propose a novel Tibetan adversarial text generation method called TSCheater,
which considers the characteristic of Tibetan encoding and the feature that
visually similar syllables have similar semantics. This method can also be
transferred to other abugidas, such as Devanagari script. We utilize a
self-constructed Tibetan syllable visual similarity database called TSVSDB to
generate substitution candidates and adopt a greedy algorithm-based scoring
mechanism to determine substitution order. After that, we conduct the method on
eight victim language models. Experimentally, TSCheater outperforms existing
methods in attack effectiveness, perturbation magnitude, semantic similarity,
visual similarity, and human acceptance. Finally, we construct the first
Tibetan adversarial robustness evaluation benchmark called AdvTS, which is
generated by existing methods and proofread by humans.


http://arxiv.org/abs/2412.02576
The Efficacy of Transfer-based No-box Attacks on Image Watermarking: A Pragmatic Analysis. (61%)
Qilong Wu; Varun Chandrasekaran
  Watermarking approaches are widely used to identify if images being
circulated are authentic or AI-generated. Determining the robustness of image
watermarking methods in the ``no-box'' setting, where the attacker is assumed
to have no knowledge about the watermarking model, is an interesting problem.
Our main finding is that evading the no-box setting is challenging: the success
of optimization-based transfer attacks (involving training surrogate models)
proposed in prior work~\cite{hu2024transfer} depends on impractical
assumptions, including (i) aligning the architecture and training
configurations of both the victim and attacker's surrogate watermarking models,
as well as (ii) a large number of surrogate models with potentially large
computational requirements. Relaxing these assumptions i.e., moving to a more
pragmatic threat model results in a failed attack, with an evasion rate at most
$21.1\%$. We show that when the configuration is mostly aligned, a simple
non-optimization attack we propose, OFT, with one single surrogate model can
already exceed the success of optimization-based efforts. Under the same
$\ell_\infty$ norm perturbation budget of $0.25$, prior
work~\citet{hu2024transfer} is comparable to or worse than OFT in $11$ out of
$12$ configurations and has a limited advantage on the remaining one. The code
used for all our experiments is available at
\url{https://github.com/Ardor-Wu/transfer}.


http://arxiv.org/abs/2412.03002
AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations? (61%)
Shouwei Ruan; Hanqing Liu; Yao Huang; Xiaoqi Wang; Caixin Kang; Hang Su; Yinpeng Dong; Xingxing Wei
  Vision Language Models (VLMs) have exhibited remarkable generalization
capabilities, yet their robustness in dynamic real-world scenarios remains
largely unexplored. To systematically evaluate VLMs' robustness to real-world
3D variations, we propose AdvDreamer, the first framework capable of generating
physically reproducible Adversarial 3D Transformation (Adv-3DT) samples from
single-view observations. In AdvDreamer, we integrate three key innovations:
Firstly, to characterize real-world 3D variations with limited prior knowledge
precisely, we design a zero-shot Monocular Pose Manipulation pipeline built
upon generative 3D priors. Secondly, to ensure the visual quality of worst-case
Adv-3DT samples, we propose a Naturalness Reward Model that provides continuous
naturalness regularization during adversarial optimization, effectively
preventing convergence to hallucinated or unnatural elements. Thirdly, to
enable systematic evaluation across diverse VLM architectures and
visual-language tasks, we introduce the Inverse Semantic Probability loss as
the adversarial optimization objective, which solely operates in the
fundamental visual-textual alignment space. Based on the captured Adv-3DT
samples with high aggressiveness and transferability, we establish MM3DTBench,
the first VQA benchmark dataset tailored to evaluate VLM robustness under
challenging 3D variations. Extensive evaluations of representative VLMs with
varying architectures reveal that real-world 3D variations can pose severe
threats to model performance across various tasks.


http://arxiv.org/abs/2412.02535
Defending Against Diverse Attacks in Federated Learning Through Consensus-Based Bi-Level Optimization. (22%)
Nicolás García Trillos; Aditya Kumar Akash; Sixu Li; Konstantin Riedl; Yuhua Zhu
  Adversarial attacks pose significant challenges in many machine learning
applications, particularly in the setting of distributed training and federated
learning, where malicious agents seek to corrupt the training process with the
goal of jeopardizing and compromising the performance and reliability of the
final models. In this paper, we address the problem of robust federated
learning in the presence of such attacks by formulating the training task as a
bi-level optimization problem. We conduct a theoretical analysis of the
resilience of consensus-based bi-level optimization (CB$^2$O), an interacting
multi-particle metaheuristic optimization method, in adversarial settings.
Specifically, we provide a global convergence analysis of CB$^2$O in mean-field
law in the presence of malicious agents, demonstrating the robustness of
CB$^2$O against a diverse range of attacks. Thereby, we offer insights into how
specific hyperparameter choices enable to mitigate adversarial effects. On the
practical side, we extend CB$^2$O to the clustered federated learning setting
by proposing FedCB$^2$O, a novel interacting multi-particle system, and design
a practical algorithm that addresses the demands of real-world applications.
Extensive experiments demonstrate the robustness of the FedCB$^2$O algorithm
against label-flipping attacks in decentralized clustered federated learning
scenarios, showcasing its effectiveness in practical contexts.


http://arxiv.org/abs/2412.02479
OODFace: Benchmarking Robustness of Face Recognition under Common Corruptions and Appearance Variations. (11%)
Caixin Kang; Yubo Chen; Shouwei Ruan; Shiji Zhao; Ruochen Zhang; Jiayi Wang; Shan Fu; Xingxing Wei
  With the rise of deep learning, facial recognition technology has seen
extensive research and rapid development. Although facial recognition is
considered a mature technology, we find that existing open-source models and
commercial algorithms lack robustness in certain complex Out-of-Distribution
(OOD) scenarios, raising concerns about the reliability of these systems. In
this paper, we introduce OODFace, which explores the OOD challenges faced by
facial recognition models from two perspectives: common corruptions and
appearance variations. We systematically design 30 OOD scenarios across 9 major
categories tailored for facial recognition. By simulating these challenges on
public datasets, we establish three robustness benchmarks: LFW-C/V, CFP-FP-C/V,
and YTF-C/V. We then conduct extensive experiments on 19 facial recognition
models and 3 commercial APIs, along with extended physical experiments on face
masks to assess their robustness. Next, we explore potential solutions from two
perspectives: defense strategies and Vision-Language Models (VLMs). Based on
the results, we draw several key insights, highlighting the vulnerability of
facial recognition systems to OOD data and suggesting possible solutions.
Additionally, we offer a unified toolkit that includes all corruption and
variation types, easily extendable to other datasets. We hope that our
benchmarks and findings can provide guidance for future improvements in facial
recognition model robustness.


http://arxiv.org/abs/2412.02454
Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining. (2%)
Zongru Wu; Pengzhou Cheng; Lingyong Fang; Zhuosheng Zhang; Gongshen Liu
  Backdoor attacks remain significant security threats to generative large
language models (LLMs). Since generative LLMs output sequences of
high-dimensional token logits instead of low-dimensional classification logits,
most existing backdoor defense methods designed for discriminative models like
BERT are ineffective for generative LLMs. Inspired by the observed differences
in learning behavior between backdoor and clean mapping in the frequency space,
we transform gradients of each training sample, directly influencing parameter
updates, into the frequency space. Our findings reveal a distinct separation
between the gradients of backdoor and clean samples in the frequency space.
Based on this phenomenon, we propose Gradient Clustering in the Frequency Space
for Backdoor Sample Filtering (GraCeFul), which leverages sample-wise gradients
in the frequency space to effectively identify backdoor samples without
requiring retraining LLMs. Experimental results show that GraCeFul outperforms
baselines significantly. Notably, GraCeFul exhibits remarkable computational
efficiency, achieving nearly 100% recall and F1 scores in identifying backdoor
samples, reducing the average success rate of various backdoor attacks to 0%
with negligible drops in clean accuracy across multiple free-style question
answering datasets. Additionally, GraCeFul generalizes to Llama-2 and Vicuna.
The codes are publicly available at https://github.com/ZrW00/GraceFul.


http://arxiv.org/abs/2412.01527
Traversing the Subspace of Adversarial Patches. (83%)
Jens Bayer; Stefan Becker; David Münch; Michael Arens; Jürgen Beyerer
  Despite ongoing research on the topic of adversarial examples in deep
learning for computer vision, some fundamentals of the nature of these attacks
remain unclear. As the manifold hypothesis posits, high-dimensional data tends
to be part of a low-dimensional manifold. To verify the thesis with adversarial
patches, this paper provides an analysis of a set of adversarial patches and
investigates the reconstruction abilities of three different dimensionality
reduction methods. Quantitatively, the performance of reconstructed patches in
an attack setting is measured and the impact of sampled patches from the latent
space during adversarial training is investigated. The evaluation is performed
on two publicly available datasets for person detection. The results indicate
that more sophisticated dimensionality reduction methods offer no advantages
over a simple principal component analysis.


http://arxiv.org/abs/2412.01440
DiffPatch: Generating Customizable Adversarial Patches using Diffusion Model. (82%)
Zhixiang Wang; Guangnan Ye; Xiaosen Wang; Siheng Chen; Zhibo Wang; Xingjun Ma; Yu-Gang Jiang
  Physical adversarial patches printed on clothing can easily allow individuals
to evade person detectors. However, most existing adversarial patch generation
methods prioritize attack effectiveness over stealthiness, resulting in patches
that are aesthetically unpleasing. Although existing methods using generative
adversarial networks or diffusion models can produce more natural-looking
patches, they often struggle to balance stealthiness with attack effectiveness
and lack flexibility for user customization. To address these challenges, we
propose a novel diffusion-based customizable patch generation framework termed
DiffPatch, specifically tailored for creating naturalistic and customizable
adversarial patches. Our approach enables users to utilize a reference image as
the source, rather than starting from random noise, and incorporates masks to
craft naturalistic patches of various shapes, not limited to squares. To
prevent the original semantics from being lost during the diffusion process, we
employ Null-text inversion to map random noise samples to a single input image
and generate patches through Incomplete Diffusion Optimization (IDO). Notably,
while maintaining a natural appearance, our method achieves a comparable attack
performance to state-of-the-art non-naturalistic patches when using similarly
sized attacks. Using DiffPatch, we have created a physical adversarial T-shirt
dataset, AdvPatch-1K, specifically targeting YOLOv5s. This dataset includes
over a thousand images across diverse scenarios, validating the effectiveness
of our attack in real-world environments. Moreover, it provides a valuable
resource for future research.


http://arxiv.org/abs/2412.01363
Exploring the Robustness of AI-Driven Tools in Digital Forensics: A Preliminary Study. (74%)
Silvia Lucia Sanna; Leonardo Regano; Davide Maiorca; Giorgio Giacinto
  Nowadays, many tools are used to facilitate forensic tasks about data
extraction and data analysis. In particular, some tools leverage Artificial
Intelligence (AI) to automatically label examined data into specific categories
(\ie, drugs, weapons, nudity). However, this raises a serious concern about the
robustness of the employed AI algorithms against adversarial attacks. Indeed,
some people may need to hide specific data to AI-based digital forensics tools,
thus manipulating the content so that the AI system does not recognize the
offensive/prohibited content and marks it at as suspicious to the analyst. This
could be seen as an anti-forensics attack scenario. For this reason, we
analyzed two of the most important forensics tools employing AI for data
classification: Magnet AI, used by Magnet Axiom, and Excire Photo AI, used by
X-Ways Forensics. We made preliminary tests using about $200$ images, other
$100$ sent in $3$ chats about pornography and teenage nudity, drugs and weapons
to understand how the tools label them. Moreover, we loaded some deepfake
images (images generated by AI forging real ones) of some actors to understand
if they would be classified in the same category as the original images. From
our preliminary study, we saw that the AI algorithm is not robust enough, as we
expected since these topics are still open research problems. For example, some
sexual images were not categorized as nudity, and some deepfakes were
categorized as the same real person, while the human eye can see the clear
nudity image or catch the difference between the deepfakes. Building on these
results and other state-of-the-art works, we provide some suggestions for
improving how digital forensics analysis tool leverage AI and their robustness
against adversarial attacks or different scenarios than the trained one.


http://arxiv.org/abs/2412.01756
Adversarial Sample-Based Approach for Tighter Privacy Auditing in Final Model-Only Scenarios. (69%)
Sangyeon Yoon; Wonje Jeung; Albert No
  Auditing Differentially Private Stochastic Gradient Descent (DP-SGD) in the
final model setting is challenging and often results in empirical lower bounds
that are significantly looser than theoretical privacy guarantees. We introduce
a novel auditing method that achieves tighter empirical lower bounds without
additional assumptions by crafting worst-case adversarial samples through
loss-based input-space auditing. Our approach surpasses traditional
canary-based heuristics and is effective in both white-box and black-box
scenarios. Specifically, with a theoretical privacy budget of $\varepsilon =
10.0$, our method achieves empirical lower bounds of $6.68$ in white-box
settings and $4.51$ in black-box settings, compared to the baseline of $4.11$
for MNIST. Moreover, we demonstrate that significant privacy auditing results
can be achieved using in-distribution (ID) samples as canaries, obtaining an
empirical lower bound of $4.33$ where traditional methods produce near-zero
leakage detection. Our work offers a practical framework for reliable and
accurate privacy auditing in differentially private machine learning.


http://arxiv.org/abs/2412.01646
Robust and Transferable Backdoor Attacks Against Deep Image Compression With Selective Frequency Prior. (67%)
Yi Yu; Yufei Wang; Wenhan Yang; Lanqing Guo; Shijian Lu; Ling-Yu Duan; Yap-Peng Tan; Alex C. Kot
  Recent advancements in deep learning-based compression techniques have
surpassed traditional methods. However, deep neural networks remain vulnerable
to backdoor attacks, where pre-defined triggers induce malicious behaviors.
This paper introduces a novel frequency-based trigger injection model for
launching backdoor attacks with multiple triggers on learned image compression
models. Inspired by the widely used DCT in compression codecs, triggers are
embedded in the DCT domain. We design attack objectives tailored to diverse
scenarios, including: 1) degrading compression quality in terms of bit-rate and
reconstruction accuracy; 2) targeting task-driven measures like face
recognition and semantic segmentation. To improve training efficiency, we
propose a dynamic loss function that balances loss terms with fewer
hyper-parameters, optimizing attack objectives effectively. For advanced
scenarios, we evaluate the attack's resistance to defensive preprocessing and
propose a two-stage training schedule with robust frequency selection to
enhance resilience. To improve cross-model and cross-domain transferability for
downstream tasks, we adjust the classification boundary in the attack loss
during training. Experiments show that our trigger injection models, combined
with minor modifications to encoder parameters, successfully inject multiple
backdoors and their triggers into a single compression model, demonstrating
strong performance and versatility. (*Due to the notification of arXiv "The
Abstract field cannot be longer than 1,920 characters", the appeared Abstract
is shortened. For the full Abstract, please download the Article.)


http://arxiv.org/abs/2412.01495
Adversarial Attacks on Hyperbolic Networks. (26%)
Spengler Max van; Jan Zahálka; Pascal Mettes
  As hyperbolic deep learning grows in popularity, so does the need for
adversarial robustness in the context of such a non-Euclidean geometry. To this
end, this paper proposes hyperbolic alternatives to the commonly used FGM and
PGD adversarial attacks. Through interpretable synthetic benchmarks and
experiments on existing datasets, we show how the existing and newly proposed
attacks differ. Moreover, we investigate the differences in adversarial
robustness between Euclidean and fully hyperbolic networks. We find that these
networks suffer from different types of vulnerabilities and that the newly
proposed hyperbolic attacks cannot address these differences. Therefore, we
conclude that the shifts in adversarial robustness are due to the models
learning distinct patterns resulting from their different geometries.


http://arxiv.org/abs/2412.02156
Compromising the Intelligence of Modern DNNs: On the Effectiveness of Targeted RowPress. (13%)
Ranyang Zhou; Jacqueline T. Liu; Sabbir Ahmed; Shaahin Angizi; Adnan Siraj Rakin
  Recent advancements in side-channel attacks have revealed the vulnerability
of modern Deep Neural Networks (DNNs) to malicious adversarial weight attacks.
The well-studied RowHammer attack has effectively compromised DNN performance
by inducing precise and deterministic bit-flips in the main memory (e.g.,
DRAM). Similarly, RowPress has emerged as another effective strategy for
flipping targeted bits in DRAM. However, the impact of RowPress on deep
learning applications has yet to be explored in the existing literature,
leaving a fundamental research question unanswered: How does RowPress compare
to RowHammer in leveraging bit-flip attacks to compromise DNN performance? This
paper is the first to address this question and evaluate the impact of RowPress
on DNN applications. We conduct a comparative analysis utilizing a novel
DRAM-profile-aware attack designed to capture the distinct bit-flip patterns
caused by RowHammer and RowPress. Eleven widely-used DNN architectures trained
on different benchmark datasets deployed on a Samsung DRAM chip conclusively
demonstrate that they suffer from a drastically more rapid performance
degradation under the RowPress attack compared to RowHammer. The difference in
the underlying attack mechanism of RowHammer and RowPress also renders existing
RowHammer mitigation mechanisms ineffective under RowPress. As a result,
RowPress introduces a new vulnerability paradigm for DNN compute platforms and
unveils the urgent need for corresponding protective measures.


http://arxiv.org/abs/2412.01154
R.I.P.: A Simple Black-box Attack on Continual Test-time Adaptation. (5%)
Trung-Hieu Hoang; Duc Minh Vo; Minh N. Do
  Test-time adaptation (TTA) has emerged as a promising solution to tackle the
continual domain shift in machine learning by allowing model parameters to
change at test time, via self-supervised learning on unlabeled testing data. At
the same time, it unfortunately opens the door to unforeseen vulnerabilities
for degradation over time. Through a simple theoretical continual TTA model, we
successfully identify a risk in the sampling process of testing data that could
easily degrade the performance of a continual TTA model. We name this risk as
Reusing of Incorrect Prediction (RIP) that TTA attackers can employ or as a
result of the unintended query from general TTA users. The risk posed by RIP is
also highly realistic, as it does not require prior knowledge of model
parameters or modification of testing samples. This simple requirement makes
RIP as the first black-box TTA attack algorithm that stands out from existing
white-box attempts. We extensively benchmark the performance of the most recent
continual TTA approaches when facing the RIP attack, providing insights on its
success, and laying out potential roadmaps that could enhance the resilience of
future continual TTA systems.


http://arxiv.org/abs/2412.01975
Reactive Synthesis of Sensor Revealing Strategies in Hypergames on Graphs. (1%)
Sumukha Udupa; Ahmed Hemida; Charles A. Kamhoua; Jie Fu
  In many security applications of cyber-physical systems, a system designer
must guarantee that critical missions are satisfied against attacks in the
sensors and actuators of the CPS. Traditional security design of CPSs often
assume that attackers have complete knowledge of the system. In this article,
we introduce a class of deception techniques and study how to leverage
asymmetric information created by deception to strengthen CPS security.
Consider an adversarial interaction between a CPS defender and an attacker, who
can perform sensor jamming attacks. To mitigate such attacks, the defender
introduces asymmetrical information by deploying a "hidden sensor," whose
presence is initially undisclosed but can be revealed if queried. We introduce
hypergames on graphs to model this game with asymmetric information. Building
on the solution concept called subjective rationalizable strategies in
hypergames, we identify two stages in the game: An initial game stage where the
defender commits to a strategy perceived rationalizable by the attacker until
he deviates from the equilibrium in the attacker's perceptual game; Upon the
deviation, a delay-attack game stage starts where the defender plays against
the attacker, who has a bounded delay in attacking the sensor being revealed.
Based on backward induction, we develop an algorithm that determines, for any
given state, if the defender can benefit from hiding a sensor and revealing it
later. If the answer is affirmative, the algorithm outputs a sensor revealing
strategy to determine when to reveal the sensor during dynamic interactions. We
demonstrate the effectiveness of our deceptive strategies through two case
studies related to CPS security applications.


http://arxiv.org/abs/2412.01127
Precision Profile Pollution Attack on Sequential Recommenders via Influence Function. (1%)
Xiaoyu Du; Yingying Chen; Yang Zhang; Jinhui Tang
  Sequential recommendation approaches have demonstrated remarkable proficiency
in modeling user preferences. Nevertheless, they are susceptible to profile
pollution attacks (PPA), wherein items are introduced into a user's interaction
history deliberately to influence the recommendation list. Since retraining the
model for each polluted item is time-consuming, recent PPAs estimate item
influence based on gradient directions to identify the most effective attack
candidates. However, the actual item representations diverge significantly from
the gradients, resulting in disparate outcomes.To tackle this challenge, we
introduce an INFluence Function-based Attack approach INFAttack that offers a
more accurate estimation of the influence of polluting items. Specifically, we
calculate the modifications to the original model using the influence function
when generating polluted sequences by introducing specific items. Subsequently,
we choose the sequence that has been most significantly influenced to
substitute the original sequence, thus promoting the target item. Comprehensive
experiments conducted on five real-world datasets illustrate that INFAttack
surpasses all baseline methods and consistently delivers stable attack
performance for both popular and unpopular items.


http://arxiv.org/abs/2412.00696
Intermediate Outputs Are More Sensitive Than You Think. (61%)
Tao Huang; Qingyu Huang; Jiayang Meng
  The increasing reliance on deep computer vision models that process sensitive
data has raised significant privacy concerns, particularly regarding the
exposure of intermediate results in hidden layers. While traditional privacy
risk assessment techniques focus on protecting overall model outputs, they
often overlook vulnerabilities within these intermediate representations.
Current privacy risk assessment techniques typically rely on specific attack
simulations to assess risk, which can be computationally expensive and
incomplete. This paper introduces a novel approach to measuring privacy risks
in deep computer vision models based on the Degrees of Freedom (DoF) and
sensitivity of intermediate outputs, without requiring adversarial attack
simulations. We propose a framework that leverages DoF to evaluate the amount
of information retained in each layer and combines this with the rank of the
Jacobian matrix to assess sensitivity to input variations. This dual analysis
enables systematic measurement of privacy risks at various model layers. Our
experimental validation on real-world datasets demonstrates the effectiveness
of this approach in providing deeper insights into privacy risks associated
with intermediate representations.


http://arxiv.org/abs/2412.01101
Hiding Faces in Plain Sight: Defending DeepFakes by Disrupting Face Detection. (16%)
Delong Zhu; Yuezun Li; Baoyuan Wu; Jiaran Zhou; Zhibo Wang; Siwei Lyu
  This paper investigates the feasibility of a proactive DeepFake defense
framework, {\em FacePosion}, to prevent individuals from becoming victims of
DeepFake videos by sabotaging face detection. The motivation stems from the
reliance of most DeepFake methods on face detectors to automatically extract
victim faces from videos for training or synthesis (testing). Once the face
detectors malfunction, the extracted faces will be distorted or incorrect,
subsequently disrupting the training or synthesis of the DeepFake model. To
achieve this, we adapt various adversarial attacks with a dedicated design for
this purpose and thoroughly analyze their feasibility. Based on FacePoison, we
introduce {\em VideoFacePoison}, a strategy that propagates FacePoison across
video frames rather than applying them individually to each frame. This
strategy can largely reduce the computational overhead while retaining the
favorable attack performance. Our method is validated on five face detectors,
and extensive experiments against eleven different DeepFake models demonstrate
the effectiveness of disrupting face detectors to hinder DeepFake generation.


http://arxiv.org/abs/2412.00797
Online Poisoning Attack Against Reinforcement Learning under Black-box Environments. (11%)
Jianhui Li; Bokang Zhang; Junfeng Wu
  This paper proposes an online environment poisoning algorithm tailored for
reinforcement learning agents operating in a black-box setting, where an
adversary deliberately manipulates training data to lead the agent toward a
mischievous policy. In contrast to prior studies that primarily investigate
white-box settings, we focus on a scenario characterized by \textit{unknown}
environment dynamics to the attacker and a \textit{flexible} reinforcement
learning algorithm employed by the targeted agent. We first propose an attack
scheme that is capable of poisoning the reward functions and state transitions.
The poisoning task is formalized as a constrained optimization problem,
following the framework of \cite{ma2019policy}. Given the transition
probabilities are unknown to the attacker in a black-box environment, we apply
a stochastic gradient descent algorithm, where the exact gradients are
approximated using sample-based estimates. A penalty-based method along with a
bilevel reformulation is then employed to transform the problem into an
unconstrained counterpart and to circumvent the double-sampling issue. The
algorithm's effectiveness is validated through a maze environment.


http://arxiv.org/abs/2412.00404
Hard-Label Black-Box Attacks on 3D Point Clouds. (99%)
Daizong Liu; Yunbo Tao; Pan Zhou; Wei Hu
  With the maturity of depth sensors in various 3D safety-critical
applications, 3D point cloud models have been shown to be vulnerable to
adversarial attacks. Almost all existing 3D attackers simply follow the
white-box or black-box setting to iteratively update coordinate perturbations
based on back-propagated or estimated gradients. However, these methods are
hard to deploy in real-world scenarios (no model details are provided) as they
severely rely on parameters or output logits of victim models. To this end, we
propose point cloud attacks from a more practical setting, i.e., hard-label
black-box attack, in which attackers can only access the prediction label of 3D
input. We introduce a novel 3D attack method based on a new spectrum-aware
decision boundary algorithm to generate high-quality adversarial samples. In
particular, we first construct a class-aware model decision boundary, by
developing a learnable spectrum-fusion strategy to adaptively fuse point clouds
of different classes in the spectral domain, aiming to craft their intermediate
samples without distorting the original geometry. Then, we devise an iterative
coordinate-spectrum optimization method with curvature-aware boundary search to
move the intermediate sample along the decision boundary for generating
adversarial point clouds with trivial perturbations. Experiments demonstrate
that our attack competitively outperforms existing white/black-box attackers in
terms of attack performance and adversary quality.


http://arxiv.org/abs/2412.00621
Exposing LLM Vulnerabilities: Adversarial Scam Detection and Performance. (69%)
Chen-Wei Chang; Shailik Sarkar; Shutonu Mitra; Qi Zhang; Hossein Salemi; Hemant Purohit; Fengxiu Zhang; Michin Hong; Jin-Hee Cho; Chang-Tien Lu
  Can we trust Large Language Models (LLMs) to accurately predict scam? This
paper investigates the vulnerabilities of LLMs when facing adversarial scam
messages for the task of scam detection. We addressed this issue by creating a
comprehensive dataset with fine-grained labels of scam messages, including both
original and adversarial scam messages. The dataset extended traditional binary
classes for the scam detection task into more nuanced scam types. Our analysis
showed how adversarial examples took advantage of vulnerabilities of a LLM,
leading to high misclassification rate. We evaluated the performance of LLMs on
these adversarial scam messages and proposed strategies to improve their
robustness.


http://arxiv.org/abs/2412.00537
Exact Certification of (Graph) Neural Networks Against Label Poisoning. (22%)
Mahalakshmi Sabanayagam; Lukas Gosch; Stephan Günnemann; Debarghya Ghoshdastidar
  Machine learning models are highly vulnerable to label flipping, i.e., the
adversarial modification (poisoning) of training labels to compromise
performance. Thus, deriving robustness certificates is important to guarantee
that test predictions remain unaffected and to understand worst-case robustness
behavior. However, for Graph Neural Networks (GNNs), the problem of certifying
label flipping has so far been unsolved. We change this by introducing an exact
certification method, deriving both sample-wise and collective certificates.
Our method leverages the Neural Tangent Kernel (NTK) to capture the training
dynamics of wide networks enabling us to reformulate the bilevel optimization
problem representing label flipping into a Mixed-Integer Linear Program (MILP).
We apply our method to certify a broad range of GNN architectures in node
classification tasks. Thereby, concerning the worst-case robustness to label
flipping: $(i)$ we establish hierarchies of GNNs on different benchmark graphs;
$(ii)$ quantify the effect of architectural choices such as activations, depth
and skip-connections; and surprisingly, $(iii)$ uncover a novel phenomenon of
the robustness plateauing for intermediate perturbation budgets across all
investigated datasets and architectures. While we focus on GNNs, our
certificates are applicable to sufficiently wide NNs in general through their
NTK. Thus, our work presents the first exact certificate to a poisoning attack
ever derived for neural networks, which could be of independent interest.


http://arxiv.org/abs/2412.00473
Jailbreak Large Vision-Language Models Through Multi-Modal Linkage. (12%)
Yu Wang; Xiaofei Zhou; Yichen Wang; Geyuan Zhang; Tianxing He
  With the significant advancement of Large Vision-Language Models (VLMs),
concerns about their potential misuse and abuse have grown rapidly. Previous
studies have highlighted VLMs' vulnerability to jailbreak attacks, where
carefully crafted inputs can lead the model to produce content that violates
ethical and legal standards. However, existing methods struggle against
state-of-the-art VLMs like GPT-4o, due to the over-exposure of harmful content
and lack of stealthy malicious guidance. In this work, we propose a novel
jailbreak attack framework: Multi-Modal Linkage (MML) Attack. Drawing
inspiration from cryptography, MML utilizes an encryption-decryption process
across text and image modalities to mitigate over-exposure of malicious
information. To align the model's output with malicious intent covertly, MML
employs a technique called "evil alignment", framing the attack within a video
game production scenario. Comprehensive experiments demonstrate MML's
effectiveness. Specifically, MML jailbreaks GPT-4o with attack success rates of
97.80% on SafeBench, 98.81% on MM-SafeBench and 99.07% on HADES-Dataset. Our
code is available at https://github.com/wangyu-ovo/MML.


http://arxiv.org/abs/2411.19853
Towards Class-wise Robustness Analysis. (99%)
Tejaswini Medi; Julia Grabinski; Margret Keuper
  While being very successful in solving many downstream tasks, the application
of deep neural networks is limited in real-life scenarios because of their
susceptibility to domain shifts such as common corruptions, and adversarial
attacks. The existence of adversarial examples and data corruption
significantly reduces the performance of deep classification models.
Researchers have made strides in developing robust neural architectures to
bolster decisions of deep classifiers. However, most of these works rely on
effective adversarial training methods, and predominantly focus on overall
model robustness, disregarding class-wise differences in robustness, which are
critical. Exploiting weakly robust classes is a potential avenue for attackers
to fool the image recognition models. Therefore, this study investigates
class-to-class biases across adversarially trained robust classification models
to understand their latent space structures and analyze their strong and weak
class-wise properties. We further assess the robustness of classes against
common corruptions and adversarial attacks, recognizing that class
vulnerability extends beyond the number of correct classifications for a
specific class. We find that the number of false positives of classes as
specific target classes significantly impacts their vulnerability to attacks.
Through our analysis on the Class False Positive Score, we assess a fair
evaluation of how susceptible each class is to misclassification.


http://arxiv.org/abs/2411.19479
FLARE: Towards Universal Dataset Purification against Backdoor Attacks. (81%)
Linshan Hou; Wei Luo; Zhongyun Hua; Songhua Chen; Leo Yu Zhang; Yiming Li
  Deep neural networks (DNNs) are susceptible to backdoor attacks, where
adversaries poison datasets with adversary-specified triggers to implant hidden
backdoors, enabling malicious manipulation of model predictions. Dataset
purification serves as a proactive defense by removing malicious training
samples to prevent backdoor injection at its source. We first reveal that the
current advanced purification methods rely on a latent assumption that the
backdoor connections between triggers and target labels in backdoor attacks are
simpler to learn than the benign features. We demonstrate that this assumption,
however, does not always hold, especially in all-to-all (A2A) and untargeted
(UT) attacks. As a result, purification methods that analyze the separation
between the poisoned and benign samples in the input-output space or the final
hidden layer space are less effective. We observe that this separability is not
confined to a single layer but varies across different hidden layers. Motivated
by this understanding, we propose FLARE, a universal purification method to
counter various backdoor attacks. FLARE aggregates abnormal activations from
all hidden layers to construct representations for clustering. To enhance
separation, FLARE develops an adaptive subspace selection algorithm to isolate
the optimal space for dividing an entire dataset into two clusters. FLARE
assesses the stability of each cluster and identifies the cluster with higher
stability as poisoned. Extensive evaluations on benchmark datasets demonstrate
the effectiveness of FLARE against 22 representative backdoor attacks,
including all-to-one (A2O), all-to-all (A2A), and untargeted (UT) attacks, and
its robustness to adaptive attacks.


http://arxiv.org/abs/2412.00324
Robust Table Integration in Data Lakes. (56%)
Daomin Ji; Hui Luo; Zhifeng Bao; Shane Culpepper
  In this paper, we investigate the challenge of integrating tables from data
lakes, focusing on three core tasks: 1) pairwise integrability judgment, which
determines whether a tuple pair in a table is integrable, accounting for any
occurrences of semantic equivalence or typographical errors; 2) integrable set
discovery, which aims to identify all integrable sets in a table based on
pairwise integrability judgments established in the first task; 3) multi-tuple
conflict resolution, which resolves conflicts among multiple tuples during
integration. We train a binary classifier to address the task of pairwise
integrability judgment. Given the scarcity of labeled data, we propose a
self-supervised adversarial contrastive learning algorithm to perform
classification, which incorporates data augmentation methods and adversarial
examples to autonomously generate new training data. Upon the output of
pairwise integrability judgment, each integrable set is considered as a
community, a densely connected sub-graph where nodes and edges correspond to
tuples in the table and their pairwise integrability, respectively. We proceed
to investigate various community detection algorithms to address the integrable
set discovery objective. Moving forward to tackle multi-tuple conflict
resolution, we introduce an novel in-context learning methodology. This
approach capitalizes on the knowledge embedded within pretrained large language
models to effectively resolve conflicts that arise when integrating multiple
tuples. Notably, our method minimizes the need for annotated data. Since no
suitable test collections are available for our tasks, we develop our own
benchmarks using two real-word dataset repositories: Real and Join. We conduct
extensive experiments on these benchmarks to validate the robustness and
applicability of our methodologies in the context of integrating tables within
data lakes.


http://arxiv.org/abs/2411.19508
On the Adversarial Robustness of Instruction-Tuned Large Language Models for Code. (38%)
Md Imran Hossen; Xiali Hei
  The advent of instruction-tuned Large Language Models designed for coding
tasks (Code LLMs) has transformed software engineering practices. However,
their robustness against various input challenges remains a critical concern.
This study introduces DegradePrompter, a novel method designed to
systematically evaluate the robustness of instruction-tuned Code LLMs. We
assess the impact of diverse input challenges on the functionality and
correctness of generated code using rigorous metrics and established
benchmarks. Our comprehensive evaluation includes five state-of-the-art
open-source models and three production-grade closed-source models, revealing
varying degrees of robustness. Open-source models demonstrate an increased
susceptibility to input perturbations, resulting in declines in functional
correctness ranging from 12% to 34%. In contrast, commercial models demonstrate
relatively greater resilience, with performance degradation ranging from 3% to
24%. To enhance the robustness of the models against these vulnerabilities, we
investigate a straightforward yet effective mitigation strategy. Our findings
highlight the need for robust defense mechanisms and comprehensive evaluations
during both the development and deployment phases to ensure the resilience and
reliability of automated code generation systems.


http://arxiv.org/abs/2411.19841
Parallel Stacked Aggregated Network for Voice Authentication in IoT-Enabled Smart Devices. (10%)
Awais Khan; Ijaz Ul Haq; Khalid Mahmood Malik
  Voice authentication on IoT-enabled smart devices has gained prominence in
recent years due to increasing concerns over user privacy and security. The
current authentication systems are vulnerable to different voice-spoofing
attacks (e.g., replay, voice cloning, and audio deepfakes) that mimic
legitimate voices to deceive authentication systems and enable fraudulent
activities (e.g., impersonation, unauthorized access, financial fraud, etc.).
Existing solutions are often designed to tackle a single type of attack,
leading to compromised performance against unseen attacks. On the other hand,
existing unified voice anti-spoofing solutions, not designed specifically for
IoT, possess complex architectures and thus cannot be deployed on IoT-enabled
smart devices. Additionally, most of these unified solutions exhibit
significant performance issues, including higher equal error rates or lower
accuracy for specific attacks. To overcome these issues, we present the
parallel stacked aggregation network (PSA-Net), a lightweight framework
designed as an anti-spoofing defense system for voice-controlled smart IoT
devices. The PSA-Net processes raw audios directly and eliminates the need for
dataset-dependent handcrafted features or pre-computed spectrograms.
Furthermore, PSA-Net employs a split-transform-aggregate approach, which
involves the segmentation of utterances, the extraction of intrinsic
differentiable embeddings through convolutions, and the aggregation of them to
distinguish legitimate from spoofed audios. In contrast to existing deep
Resnet-oriented solutions, we incorporate cardinality as an additional
dimension in our network, which enhances the PSA-Net ability to generalize
across diverse attacks. The results show that the PSA-Net achieves more
consistent performance for different attacks that exist in current
anti-spoofing solutions.


http://arxiv.org/abs/2411.19688
SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks. (2%)
Kim-Celine Kahl; Selen Erkan; Jeremias Traub; Carsten T. Lüth; Klaus Maier-Hein; Lena Maier-Hein; Paul F. Jaeger
  Vision-Language Models (VLMs) have great potential in medical tasks, like
Visual Question Answering (VQA), where they could act as interactive assistants
for both patients and clinicians. Yet their robustness to distribution shifts
on unseen data remains a critical concern for safe deployment. Evaluating such
robustness requires a controlled experimental setup that allows for systematic
insights into the model's behavior. However, we demonstrate that current setups
fail to offer sufficiently thorough evaluations, limiting their ability to
accurately assess model robustness. To address this gap, our work introduces a
novel framework, called SURE-VQA, centered around three key requirements to
overcome the current pitfalls and systematically analyze the robustness of
VLMs: 1) Since robustness on synthetic shifts does not necessarily translate to
real-world shifts, robustness should be measured on real-world shifts that are
inherent to the VQA data; 2) Traditional token-matching metrics often fail to
capture underlying semantics, necessitating the use of large language models
(LLMs) for more accurate semantic evaluation; 3) Model performance often lacks
interpretability due to missing sanity baselines, thus meaningful baselines
should be reported that allow assessing the multimodal impact on the VLM. To
demonstrate the relevance of this framework, we conduct a study on the
robustness of various fine-tuning methods across three medical datasets with
four different types of distribution shifts. Our study reveals several
important findings: 1) Sanity baselines that do not utilize image data can
perform surprisingly well; 2) We confirm LoRA as the best-performing PEFT
method; 3) No PEFT method consistently outperforms others in terms of
robustness to shifts. Code is provided at https://github.com/IML-DKFZ/sure-vqa.


http://arxiv.org/abs/2412.00341
Fusing Physics-Driven Strategies and Cross-Modal Adversarial Learning: Toward Multi-Domain Applications. (1%)
Hana Satou; Alan Mitkiy
  The convergence of cross-modal adversarial learning and physics-driven
methods represents a cutting-edge direction for tackling challenges in complex
multi-modal tasks and scientific computing. This review focuses on
systematically analyzing how these two approaches can be synergistically
integrated to enhance performance and robustness across diverse application
domains. By addressing key obstacles such as modality discrepancies, limited
data availability, and insufficient model robustness, this paper highlights the
role of physics-based optimization frameworks in facilitating efficient and
interpretable adversarial perturbation generation. The review also explores
significant advancements in cross-modal adversarial learning, including
applications in tasks such as image cross-modal retrieval (e.g., infrared and
RGB matching), scientific computing (e.g., solving partial differential
equations), and optimization under physical consistency constraints in vision
systems. By examining theoretical foundations and experimental outcomes, this
study demonstrates the potential of combining these approaches to handle
complex scenarios and improve the security of multi-modal systems. Finally, we
outline future directions, proposing a novel framework that unifies physical
principles with adversarial optimization, providing a pathway for researchers
to develop robust and adaptable cross-modal learning methods with both
theoretical and practical significance.


http://arxiv.org/abs/2412.00114
SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments. (84%)
Yue Cao; Yun Xing; Jie Zhang; Di Lin; Tianwei Zhang; Ivor Tsang; Yang Liu; Qing Guo
  Large vision-language models (LVLMs) have shown remarkable capabilities in
interpreting visual content. While existing works demonstrate these models'
vulnerability to deliberately placed adversarial texts, such texts are often
easily identifiable as anomalous. In this paper, we present the first approach
to generate scene-coherent typographic adversarial attacks that mislead
advanced LVLMs while maintaining visual naturalness through the capability of
the LLM-based agent. Our approach addresses three critical questions: what
adversarial text to generate, where to place it within the scene, and how to
integrate it seamlessly. We propose a training-free, multi-modal LLM-driven
scene-coherent typographic adversarial planning (SceneTAP) that employs a
three-stage process: scene understanding, adversarial planning, and seamless
integration. The SceneTAP utilizes chain-of-thought reasoning to comprehend the
scene, formulate effective adversarial text, strategically plan its placement,
and provide detailed instructions for natural integration within the image.
This is followed by a scene-coherent TextDiffuser that executes the attack
using a local diffusion mechanism. We extend our method to real-world scenarios
by printing and placing generated patches in physical environments,
demonstrating its practical implications. Extensive experiments show that our
scene-coherent adversarial text successfully misleads state-of-the-art LVLMs,
including ChatGPT-4o, even after capturing new images of physical setups. Our
evaluations demonstrate a significant increase in attack success rates while
maintaining visual naturalness and contextual appropriateness. This work
highlights vulnerabilities in current vision-language models to sophisticated,
scene-coherent adversarial attacks and provides insights into potential defense
mechanisms.


http://arxiv.org/abs/2411.19335
PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning. (69%)
Shenghui Li; Edith C. -H. Ngai; Fanghua Ye; Thiemo Voigt
  Federated Parameter-Efficient Fine-Tuning (FedPEFT) has emerged as a
promising paradigm for privacy-preserving and efficient adaptation of
Pre-trained Language Models (PLMs) in Federated Learning (FL) settings. It
preserves data privacy by keeping the data decentralized and training the model
on local devices, ensuring that raw data never leaves the user's device.
Moreover, the integration of PEFT methods such as LoRA significantly reduces
the number of trainable parameters compared to fine-tuning the entire model,
thereby minimizing communication costs and computational overhead. Despite its
potential, the security implications of FedPEFT remain underexplored. This
paper introduces a novel security threat to FedPEFT, termed PEFT-as-an-Attack
(PaaA), which exposes how PEFT can be exploited as an attack vector to
circumvent PLMs' safety alignment and generate harmful content in response to
malicious prompts. Our evaluation of PaaA reveals that with less than 1% of the
model's parameters set as trainable, and a small subset of clients acting
maliciously, the attack achieves an approximate 80% attack success rate using
representative PEFT methods such as LoRA. To mitigate this threat, we further
investigate potential defense strategies, including Robust Aggregation Schemes
(RASs) and Post-PEFT Safety Alignment (PPSA). However, our empirical analysis
highlights the limitations of these defenses, i.e., even the most advanced
RASs, such as DnC and ClippedClustering, struggle to defend against PaaA in
scenarios with highly heterogeneous data distributions. Similarly, while PPSA
can reduce attack success rates to below 10%, it severely degrades the model's
accuracy on the target task. Our results underscore the urgent need for more
effective defense mechanisms that simultaneously ensure security and maintain
the performance of the FedPEFT paradigm.


http://arxiv.org/abs/2411.18956
Random Sampling for Diffusion-based Adversarial Purification. (26%)
Jiancheng Zhang; Peiran Dong; Yongyong Chen; Yin-Ping Zhao; Song Guo
  Denoising Diffusion Probabilistic Models (DDPMs) have gained great attention
in adversarial purification. Current diffusion-based works focus on designing
effective condition-guided mechanisms while ignoring a fundamental problem,
i.e., the original DDPM sampling is intended for stable generation, which may
not be the optimal solution for adversarial purification. Inspired by the
stability of the Denoising Diffusion Implicit Model (DDIM), we propose an
opposite sampling scheme called random sampling. In brief, random sampling will
sample from a random noisy space during each diffusion process, while DDPM and
DDIM sampling will continuously sample from the adjacent or original noisy
space. Thus, random sampling obtains more randomness and achieves stronger
robustness against adversarial attacks. Correspondingly, we also introduce a
novel mediator conditional guidance to guarantee the consistency of the
prediction under the purified image and clean image input. To expand awareness
of guided diffusion purification, we conduct a detailed evaluation with
different sampling methods and our random sampling achieves an impressive
improvement in multiple settings. Leveraging mediator-guided random sampling,
we also establish a baseline method named DiffAP, which significantly
outperforms state-of-the-art (SOTA) approaches in performance and defensive
stability. Remarkably, under strong attack, our DiffAP even achieves a more
than 20% robustness advantage with 10$\times$ sampling acceleration.


http://arxiv.org/abs/2412.04495
Artificial intelligence and cybersecurity in banking sector: opportunities and risks. (12%)
Ana Kovacevic; Sonja D. Radenkovic; Dragana Nikolic
  The rapid advancements in artificial intelligence (AI) have presented new
opportunities for enhancing efficiency and economic competitiveness across
various industries, espcially in banking. Machine learning (ML), as a subset of
artificial intelligence, enables systems to adapt and learn from vast datasets,
revolutionizing decision-making processes, fraud detection, and customer
service automation. However, these innovations also introduce new challenges,
particularly in the realm of cybersecurity. Adversarial attacks, such as data
poisoning and evasion attacks, represent critical threats to machine learning
models, exploiting vulnerabilities to manipulate outcomes or compromise
sensitive information. Furthermore, this study highlights the dual-use nature
of AI tools, which can be used by malicious users. To address these challenges,
the paper emphasizes the importance of developing machine learning models with
key characteristics such as security, trust, resilience and robustness. These
features are essential to mitigating risks and ensuring the secure deployment
of AI technologies in banking sectors, where the protection of financial data
is paramount. The findings underscore the urgent need for enhanced
cybersecurity frameworks and continuous improvements in defensive mechanisms.
By exploring both opportunities and risks, this paper aims to guide the
responsible integration of AI in the banking sector, paving the way for
innovation while safeguarding against emerging threats.


http://arxiv.org/abs/2411.19117
Understanding and Improving Training-Free AI-Generated Image Detections with Vision Foundation Models. (11%)
Chung-Ting Tsai; Ching-Yun Ko; I-Hsin Chung; Yu-Chiang Frank Wang; Pin-Yu Chen
  The rapid advancement of generative models has introduced serious risks,
including deepfake techniques for facial synthesis and editing. Traditional
approaches rely on training classifiers and enhancing generalizability through
various feature extraction techniques. Meanwhile, training-free detection
methods address issues like limited data and overfitting by directly leveraging
statistical properties from vision foundation models to distinguish between
real and fake images. The current leading training-free approach, RIGID,
utilizes DINOv2 sensitivity to perturbations in image space for detecting fake
images, with fake image embeddings exhibiting greater sensitivity than those of
real images. This observation prompts us to investigate how detection
performance varies across model backbones, perturbation types, and datasets.
Our experiments reveal that detection performance is closely linked to model
robustness, with self-supervised (SSL) models providing more reliable
representations. While Gaussian noise effectively detects general objects, it
performs worse on facial images, whereas Gaussian blur is more effective due to
potential frequency artifacts. To further improve detection, we introduce
Contrastive Blur, which enhances performance on facial images, and MINDER
(MINimum distance DetEctoR), which addresses noise type bias, balancing
performance across domains. Beyond performance gains, our work offers valuable
insights for both the generative and detection communities, contributing to a
deeper understanding of model robustness property utilized for deepfake
detection.


http://arxiv.org/abs/2411.19075
LADDER: Multi-objective Backdoor Attack via Evolutionary Algorithm. (2%)
Dazhuang Liu; Yanqi Qiao; Rui Wang; Kaitai Liang; Georgios Smaragdakis
  Current black-box backdoor attacks in convolutional neural networks formulate
attack objective(s) as single-objective optimization problems in single domain.
Designing triggers in single domain harms semantics and trigger robustness as
well as introduces visual and spectral anomaly. This work proposes a
multi-objective black-box backdoor attack in dual domains via evolutionary
algorithm (LADDER), the first instance of achieving multiple attack objectives
simultaneously by optimizing triggers without requiring prior knowledge about
victim model. In particular, we formulate LADDER as a multi-objective
optimization problem (MOP) and solve it via multi-objective evolutionary
algorithm (MOEA). MOEA maintains a population of triggers with trade-offs among
attack objectives and uses non-dominated sort to drive triggers toward optimal
solutions. We further apply preference-based selection to MOEA to exclude
impractical triggers. We state that LADDER investigates a new dual-domain
perspective for trigger stealthiness by minimizing the anomaly between clean
and poisoned samples in the spectral domain. Lastly, the robustness against
preprocessing operations is achieved by pushing triggers to low-frequency
regions. Extensive experiments comprehensively showcase that LADDER achieves
attack effectiveness of at least 99%, attack robustness with 90.23% (50.09%
higher than state-of-the-art attacks on average), superior natural stealthiness
(1.12x to 196.74x improvement) and excellent spectral stealthiness (8.45x
enhancement) as compared to current stealthy attacks by the average $l_2$-norm
across 5 public datasets.


http://arxiv.org/abs/2411.19027
Enhancing Neural Network Robustness Against Fault Injection Through Non-linear Weight Transformations. (2%)
Ninnart Fuengfusin; Hakaru Tamukoh
  Deploying deep neural networks (DNNs) in real-world environments poses
challenges due to faults that can manifest in physical hardware from radiation,
aging, and temperature fluctuations. To address this, previous works have
focused on protecting DNNs via activation range restriction using clipped ReLU
and finding the optimal clipping threshold. However, this work instead focuses
on constraining DNN weights by applying saturated activation functions (SAFs):
Tanh, Arctan, and others. SAFs prevent faults from causing DNN weights to
become excessively large, which can lead to model failure. These methods not
only enhance the robustness of DNNs against fault injections but also improve
DNN performance by a small margin. Before deployment, DNNs are trained with
weights constrained by SAFs. During deployment, the weights without applied SAF
are written to mediums with faults. When read, weights with faults are applied
with SAFs and are used for inference. We demonstrate our proposed method across
three datasets (CIFAR10, CIFAR100, ImageNet 2012) and across three datatypes
(32-bit floating point (FP32), 16-bit floating point, and 8-bit fixed point).
We show that our method enables FP32 ResNet18 with ImageNet 2012 to operate at
a bit-error rate of 0.00001 with minor accuracy loss, while without the
proposed method, the FP32 DNN only produces random guesses. Furthermore, to
accelerate the training process, we demonstrate that an ImageNet 2012
pre-trained ResNet18 can be adapted to SAF by training for a few epochs with a
slight improvement in Top-1 accuracy while still ensuring robustness against
fault injection.


http://arxiv.org/abs/2411.18275
Visual Adversarial Attack on Vision-Language Models for Autonomous Driving. (99%)
Tianyuan Zhang; Lu Wang; Xinwei Zhang; Yitong Zhang; Boyi Jia; Siyuan Liang; Shengshan Hu; Qiang Fu; Aishan Liu; Xianglong Liu
  Vision-language models (VLMs) have significantly advanced autonomous driving
(AD) by enhancing reasoning capabilities. However, these models remain highly
vulnerable to adversarial attacks. While existing research has primarily
focused on general VLM attacks, the development of attacks tailored to the
safety-critical AD context has been largely overlooked. In this paper, we take
the first step toward designing adversarial attacks specifically targeting VLMs
in AD, exposing the substantial risks these attacks pose within this critical
domain. We identify two unique challenges for effective adversarial attacks on
AD VLMs: the variability of textual instructions and the time-series nature of
visual scenarios. To this end, we propose ADvLM, the first visual adversarial
attack framework specifically designed for VLMs in AD. Our framework introduces
Semantic-Invariant Induction, which uses a large language model to create a
diverse prompt library of textual instructions with consistent semantic
content, guided by semantic entropy. Building on this, we introduce
Scenario-Associated Enhancement, an approach where attention mechanisms select
key frames and perspectives within driving scenarios to optimize adversarial
perturbations that generalize across the entire scenario. Extensive experiments
on several AD VLMs over multiple benchmarks show that ADvLM achieves
state-of-the-art attack effectiveness. Moreover, real-world attack studies
further validate its applicability and potential in practice.


http://arxiv.org/abs/2411.18776
Fall Leaf Adversarial Attack on Traffic Sign Classification. (99%)
Anthony Etim; Jakub Szefer
  Adversarial input image perturbation attacks have emerged as a significant
threat to machine learning algorithms, particularly in image classification
setting. These attacks involve subtle perturbations to input images that cause
neural networks to misclassify the input images, even though the images remain
easily recognizable to humans. One critical area where adversarial attacks have
been demonstrated is in automotive systems where traffic sign classification
and recognition is critical, and where misclassified images can cause
autonomous systems to take wrong actions. This work presents a new class of
adversarial attacks. Unlike existing work that has focused on adversarial
perturbations that leverage human-made artifacts to cause the perturbations,
such as adding stickers, paint, or shining flashlights at traffic signs, this
work leverages nature-made artifacts: tree leaves. By leveraging nature-made
artifacts, the new class of attacks has plausible deniability: a fall leaf
stuck to a street sign could come from a near-by tree, rather than be placed
there by an malicious human attacker. To evaluate the new class of the
adversarial input image perturbation attacks, this work analyses how fall
leaves can cause misclassification in street signs. The work evaluates various
leaves from different species of trees, and considers various parameters such
as size, color due to tree leaf type, and rotation. The work demonstrates high
success rate for misclassification. The work also explores the correlation
between successful attacks and how they affect the edge detection, which is
critical in many image classification algorithms.


http://arxiv.org/abs/2411.18688
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment. (67%)
Soumya Suvra Ghosal; Souradip Chakraborty; Vaibhav Singh; Tianrui Guan; Mengdi Wang; Ahmad Beirami; Furong Huang; Alvaro Velasquez; Dinesh Manocha; Amrit Singh Bedi
  With the widespread deployment of Multimodal Large Language Models (MLLMs)
for visual-reasoning tasks, improving their safety has become crucial. Recent
research indicates that despite training-time safety alignment, these models
remain vulnerable to jailbreak attacks. In this work, we first highlight an
important safety gap to describe that alignment achieved solely through safety
training may be insufficient against jailbreak attacks. To address this
vulnerability, we propose Immune, an inference-time defense framework that
leverages a safe reward model through controlled decoding to defend against
jailbreak attacks. Additionally, we provide a mathematical characterization of
Immune, offering provable guarantees against jailbreaks. Extensive evaluations
on diverse jailbreak benchmarks using recent MLLMs reveal that Immune
effectively enhances model safety while preserving the model's original
capabilities. For instance, against text-based jailbreak attacks on LLaVA-1.6,
Immune reduces the attack success rate by 57.82% and 16.78% compared to the
base MLLM and state-of-the-art defense strategy, respectively.


http://arxiv.org/abs/2411.18280
Neutralizing Backdoors through Information Conflicts for Large Language Models. (26%)
Chen Chen; Yuchen Sun; Xueluan Gong; Jiaxin Gao; Kwok-Yan Lam
  Large language models (LLMs) have seen significant advancements, achieving
superior performance in various Natural Language Processing (NLP) tasks, from
understanding to reasoning. However, they remain vulnerable to backdoor
attacks, where models behave normally for standard queries but generate harmful
responses or unintended output when specific triggers are activated. Existing
backdoor defenses often suffer from drawbacks that they either focus on
detection without removal, rely on rigid assumptions about trigger properties,
or prove to be ineffective against advanced attacks like multi-trigger
backdoors. In this paper, we present a novel method to eliminate backdoor
behaviors from LLMs through the construction of information conflicts using
both internal and external mechanisms. Internally, we leverage a lightweight
dataset to train a conflict model, which is then merged with the backdoored
model to neutralize malicious behaviors by embedding contradictory information
within the model's parametric memory. Externally, we incorporate convincing
contradictory evidence into the prompt to challenge the model's internal
backdoor knowledge. Experimental results on classification and conversational
tasks across 4 widely used LLMs demonstrate that our method outperforms 8
state-of-the-art backdoor defense baselines. We can reduce the attack success
rate of advanced backdoor attacks by up to 98% while maintaining over 90% clean
data accuracy. Furthermore, our method has proven to be robust against adaptive
backdoor attacks. The code will be open-sourced upon publication.


http://arxiv.org/abs/2411.18269
Hidden Data Privacy Breaches in Federated Learning. (22%)
Xueluan Gong; Yuji Wang; Shuaike Li; Mengyuan Sun; Songze Li; Qian Wang; Kwok-Yan Lam; Chen Chen
  Federated Learning (FL) emerged as a paradigm for conducting machine learning
across broad and decentralized datasets, promising enhanced privacy by
obviating the need for direct data sharing. However, recent studies show that
attackers can steal private data through model manipulation or gradient
analysis. Existing attacks are constrained by low theft quantity or
low-resolution data, and they are often detected through anomaly monitoring in
gradients or weights. In this paper, we propose a novel data-reconstruction
attack leveraging malicious code injection, supported by two key techniques,
i.e., distinctive and sparse encoding design and block partitioning. Unlike
conventional methods that require detectable changes to the model, our method
stealthily embeds a hidden model using parameter sharing to systematically
extract sensitive data. The Fibonacci-based index design ensures efficient,
structured retrieval of memorized data, while the block partitioning method
enhances our method's capability to handle high-resolution images by dividing
them into smaller, manageable units. Extensive experiments on 4 datasets
confirmed that our method is superior to the five state-of-the-art
data-reconstruction attacks under the five respective detection methods. Our
method can handle large-scale and high-resolution data without being detected
or mitigated by state-of-the-art data reconstruction defense methods. In
contrast to baselines, our method can be directly applied to both FedAVG and
FedSGD scenarios, underscoring the need for developers to devise new defenses
against such vulnerabilities. We will open-source our code upon acceptance.


http://arxiv.org/abs/2411.18479
SoK: Watermarking for AI-Generated Content. (3%)
Xuandong Zhao; Sam Gunn; Miranda Christ; Jaiden Fairoze; Andres Fabrega; Nicholas Carlini; Sanjam Garg; Sanghyun Hong; Milad Nasr; Florian Tramer; Somesh Jha; Lei Li; Yu-Xiang Wang; Dawn Song
  As the outputs of generative AI (GenAI) techniques improve in quality, it
becomes increasingly challenging to distinguish them from human-created
content. Watermarking schemes are a promising approach to address the problem
of distinguishing between AI and human-generated content. These schemes embed
hidden signals within AI-generated content to enable reliable detection. While
watermarking is not a silver bullet for addressing all risks associated with
GenAI, it can play a crucial role in enhancing AI safety and trustworthiness by
combating misinformation and deception. This paper presents a comprehensive
overview of watermarking techniques for GenAI, beginning with the need for
watermarking from historical and regulatory perspectives. We formalize the
definitions and desired properties of watermarking schemes and examine the key
objectives and threat models for existing approaches. Practical evaluation
strategies are also explored, providing insights into the development of robust
watermarking techniques capable of resisting various attacks. Additionally, we
review recent representative works, highlight open challenges, and discuss
potential directions for this emerging field. By offering a thorough
understanding of watermarking in GenAI, this work aims to guide researchers in
advancing watermarking methods and applications, and support policymakers in
addressing the broader implications of GenAI.


http://arxiv.org/abs/2411.18207
From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects. (1%)
Zizhao Li; Zhengkang Xiang; Joseph West; Kourosh Khoshelham
  Traditional object detection methods operate under the closed-set assumption,
where models can only detect a fixed number of objects predefined in the
training set. Recent works on open vocabulary object detection (OVD) enable the
detection of objects defined by an unbounded vocabulary, which reduces the cost
of training models for specific tasks. However, OVD heavily relies on accurate
prompts provided by an ''oracle'', which limits their use in critical
applications such as driving scene perception. OVD models tend to misclassify
near-out-of-distribution (NOOD) objects that have similar semantics to known
classes, and ignore far-out-of-distribution (FOOD) objects. To address theses
limitations, we propose a framework that enables OVD models to operate in open
world settings, by identifying and incrementally learning novel objects. To
detect FOOD objects, we propose Open World Embedding Learning (OWEL) and
introduce the concept of Pseudo Unknown Embedding which infers the location of
unknown classes in a continuous semantic space based on the information of
known classes. We also propose Multi-Scale Contrastive Anchor Learning (MSCAL),
which enables the identification of misclassified unknown objects by promoting
the intra-class consistency of object embeddings at different scales. The
proposed method achieves state-of-the-art performance in common open world
object detection and autonomous driving benchmarks.


http://arxiv.org/abs/2411.17959
Adversarial Training in Low-Label Regimes with Margin-Based Interpolation. (99%)
Tian Ye; Rajgopal Kannan; Viktor Prasanna
  Adversarial training has emerged as an effective approach to train robust
neural network models that are resistant to adversarial attacks, even in
low-label regimes where labeled data is scarce. In this paper, we introduce a
novel semi-supervised adversarial training approach that enhances both
robustness and natural accuracy by generating effective adversarial examples.
Our method begins by applying linear interpolation between clean and
adversarial examples to create interpolated adversarial examples that cross
decision boundaries by a controlled margin. This sample-aware strategy tailors
adversarial examples to the characteristics of each data point, enabling the
model to learn from the most informative perturbations. Additionally, we
propose a global epsilon scheduling strategy that progressively adjusts the
upper bound of perturbation strengths during training. The combination of these
strategies allows the model to develop increasingly complex decision boundaries
with better robustness and natural accuracy. Empirical evaluations show that
our approach effectively enhances performance against various adversarial
attacks, such as PGD and AutoAttack.


http://arxiv.org/abs/2411.17283
BadScan: An Architectural Backdoor Attack on Visual State Space Models. (98%)
Om Suhas Deshmukh; Sankalp Nagaonkar; Achyut Mani Tripathi; Ashish Mishra
  The newly introduced Visual State Space Model (VMamba), which employs
\textit{State Space Mechanisms} (SSM) to interpret images as sequences of
patches, has shown exceptional performance compared to Vision Transformers
(ViT) across various computer vision tasks. However, recent studies have
highlighted that deep models are susceptible to adversarial attacks. One common
approach is to embed a trigger in the training data to retrain the model,
causing it to misclassify data samples into a target class, a phenomenon known
as a backdoor attack. In this paper, we first evaluate the robustness of the
VMamba model against existing backdoor attacks. Based on this evaluation, we
introduce a novel architectural backdoor attack, termed BadScan, designed to
deceive the VMamba model. This attack utilizes bit plane slicing to create
visually imperceptible backdoored images. During testing, if a trigger is
detected by performing XOR operations between the $k^{th}$ bit planes of the
modified triggered patches, the traditional 2D selective scan (SS2D) mechanism
in the visual state space (VSS) block of VMamba is replaced with our newly
designed BadScan block, which incorporates four newly developed scanning
patterns. We demonstrate that the BadScan backdoor attack represents a
significant threat to visual state space models and remains effective even
after complete retraining from scratch. Experimental results on two widely used
image classification datasets, CIFAR-10, and ImageNet-1K, reveal that while
visual state space models generally exhibit robustness against current backdoor
attacks, the BadScan attack is particularly effective, achieving a higher
Triggered Accuracy Ratio (TAR) in misleading the VMamba model and its variants.


http://arxiv.org/abs/2411.17936
Stealthy Multi-Task Adversarial Attacks. (92%)
Jiacheng Guo; Tianyun Zhang; Lei Li; Haochen Yang; Hongkai Yu; Minghai Qin
  Deep Neural Networks exhibit inherent vulnerabilities to adversarial attacks,
which can significantly compromise their outputs and reliability. While
existing research primarily focuses on attacking single-task scenarios or
indiscriminately targeting all tasks in multi-task environments, we investigate
selectively targeting one task while preserving performance in others within a
multi-task framework. This approach is motivated by varying security priorities
among tasks in real-world applications, such as autonomous driving, where
misinterpreting critical objects (e.g., signs, traffic lights) poses a greater
security risk than minor depth miscalculations. Consequently, attackers may
hope to target security-sensitive tasks while avoiding non-critical tasks from
being compromised, thus evading being detected before compromising crucial
functions. In this paper, we propose a method for the stealthy multi-task
attack framework that utilizes multiple algorithms to inject imperceptible
noise into the input. This novel method demonstrates remarkable efficacy in
compromising the target task while simultaneously maintaining or even enhancing
performance across non-targeted tasks - a criterion hitherto unexplored in the
field. Additionally, we introduce an automated approach for searching the
weighting factors in the loss function, further enhancing attack efficiency.
Experimental results validate our framework's ability to successfully attack
the target task while preserving the performance of non-targeted tasks. The
automated loss function weight searching method demonstrates comparable
efficacy to manual tuning, establishing a state-of-the-art multi-task attack
framework.


http://arxiv.org/abs/2411.17468
Adversarial Bounding Boxes Generation (ABBG) Attack against Visual Object Trackers. (82%)
Fatemeh Nourilenjan Nokabadi; Jean-Francois Lalonde; Christian Gagné
  Adversarial perturbations aim to deceive neural networks into predicting
inaccurate results. For visual object trackers, adversarial attacks have been
developed to generate perturbations by manipulating the outputs. However,
transformer trackers predict a specific bounding box instead of an object
candidate list, which limits the applicability of many existing attack
scenarios. To address this issue, we present a novel white-box approach to
attack visual object trackers with transformer backbones using only one
bounding box. From the tracker predicted bounding box, we generate a list of
adversarial bounding boxes and compute the adversarial loss for those bounding
boxes. Experimental results demonstrate that our simple yet effective attack
outperforms existing attacks against several robust transformer trackers,
including TransT-M, ROMTrack, and MixFormer, on popular benchmark tracking
datasets such as GOT-10k, UAV123, and VOT2022STS.


http://arxiv.org/abs/2411.18648
MADE: Graph Backdoor Defense with Masked Unlearning. (82%)
Xiao Lin; Mingjie Li; Yisen Wang
  Graph Neural Networks (GNNs) have garnered significant attention from
researchers due to their outstanding performance in handling graph-related
tasks, such as social network analysis, protein design, and so on. Despite
their widespread application, recent research has demonstrated that GNNs are
vulnerable to backdoor attacks, implemented by injecting triggers into the
training datasets. Trained on the poisoned data, GNNs will predict target
labels when attaching trigger patterns to inputs. This vulnerability poses
significant security risks for applications of GNNs in sensitive domains, such
as drug discovery. While there has been extensive research into backdoor
defenses for images, strategies to safeguard GNNs against such attacks remain
underdeveloped. Furthermore, we point out that conventional backdoor defense
methods designed for images cannot work well when directly implemented on graph
data. In this paper, we first analyze the key difference between image backdoor
and graph backdoor attacks. Then we tackle the graph defense problem by
presenting a novel approach called MADE, which devises an adversarial mask
generation mechanism that selectively preserves clean sub-graphs and further
leverages masks on edge weights to eliminate the influence of triggers
effectively. Extensive experiments across various graph classification tasks
demonstrate the effectiveness of MADE in significantly reducing the attack
success rate (ASR) while maintaining a high classification accuracy.


http://arxiv.org/abs/2411.18000
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models. (75%)
Shuyang Hao; Bryan Hooi; Jun Liu; Kai-Wei Chang; Zi Huang; Yujun Cai
  Despite inheriting security measures from underlying language models,
Vision-Language Models (VLMs) may still be vulnerable to safety alignment
issues. Through empirical analysis, we uncover two critical findings:
scenario-matched images can significantly amplify harmful outputs, and contrary
to common assumptions in gradient-based attacks, minimal loss values do not
guarantee optimal attack effectiveness. Building on these insights, we
introduce MLAI (Multi-Loss Adversarial Images), a novel jailbreak framework
that leverages scenario-aware image generation for semantic alignment, exploits
flat minima theory for robust adversarial image selection, and employs
multi-image collaborative attacks for enhanced effectiveness. Extensive
experiments demonstrate MLAI's significant impact, achieving attack success
rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2, substantially outperforming
existing methods by margins of 34.37% and 12.77% respectively. Furthermore,
MLAI shows considerable transferability to commercial black-box VLMs, achieving
up to 60.11% success rate. Our work reveals fundamental visual vulnerabilities
in current VLMs safety mechanisms and underscores the need for stronger
defenses. Warning: This paper contains potentially harmful example text.


http://arxiv.org/abs/2411.18027
Privacy-preserving Robotic-based Multi-factor Authentication Scheme for Secure Automated Delivery System. (9%)
Yang Yang; Aryan Mohammadi Pasikhani; Prosanta Gope; Biplab Sikdar
  Package delivery is a critical aspect of various industries, but it often
incurs high financial costs and inefficiencies when relying solely on human
resources. The last-mile transport problem, in particular, contributes
significantly to the expenditure of human resources in major companies.
Robot-based delivery systems have emerged as a potential solution for last-mile
delivery to address this challenge. However, robotic delivery systems still
face security and privacy issues, like impersonation, replay, man-in-the-middle
attacks (MITM), unlinkability, and identity theft. In this context, we propose
a privacy-preserving multi-factor authentication scheme specifically designed
for robot delivery systems. Additionally, AI-assisted robotic delivery systems
are susceptible to machine learning-based attacks (e.g. FGSM, PGD, etc.). We
introduce the \emph{first} transformer-based audio-visual fusion defender to
tackle this issue, which effectively provides resilience against adversarial
samples. Furthermore, we provide a rigorous formal analysis of the proposed
protocol and also analyse the protocol security using a popular symbolic proof
tool called ProVerif and Scyther. Finally, we present a real-world
implementation of the proposed robotic system with the computation cost and
energy consumption analysis. Code and pre-trained models are available at:
https://drive.google.com/drive/folders/18B2YbxtV0Pyj5RSFX-ZzCGtFOyorBHil


http://arxiv.org/abs/2411.17453
PEFTGuard: Detecting Backdoor Attacks Against Parameter-Efficient Fine-Tuning. (2%)
Zhen Sun; Tianshuo Cong; Yule Liu; Chenhao Lin; Xinlei He; Rongmao Chen; Xingshuo Han; Xinyi Huang
  Fine-tuning is an essential process to improve the performance of Large
Language Models (LLMs) in specific domains, with Parameter-Efficient
Fine-Tuning (PEFT) gaining popularity due to its capacity to reduce
computational demands through the integration of low-rank adapters. These
lightweight adapters, such as LoRA, can be shared and utilized on open-source
platforms. However, adversaries could exploit this mechanism to inject
backdoors into these adapters, resulting in malicious behaviors like incorrect
or harmful outputs, which pose serious security risks to the community.
Unfortunately, few of the current efforts concentrate on analyzing the backdoor
patterns or detecting the backdoors in the adapters.
  To fill this gap, we first construct (and will release) PADBench, a
comprehensive benchmark that contains 13,300 benign and backdoored adapters
fine-tuned with various datasets, attack strategies, PEFT methods, and LLMs.
Moreover, we propose PEFTGuard, the first backdoor detection framework against
PEFT-based adapters. Extensive evaluation upon PADBench shows that PEFTGuard
outperforms existing detection methods, achieving nearly perfect detection
accuracy (100%) in most cases. Notably, PEFTGuard exhibits zero-shot
transferability on three aspects, including different attacks, PEFT methods,
and adapter ranks. In addition, we consider various adaptive attacks to
demonstrate the high robustness of PEFTGuard. We further explore several
possible backdoor mitigation defenses, finding fine-mixing to be the most
effective method. We envision our benchmark and method can shed light on future
LLM backdoor detection research.


http://arxiv.org/abs/2411.18028
Improved Parallel Derandomization via Finite Automata with Applications. (1%)
Jeff Giliberti; David G. Harris
  One main genre of algorithmic derandomization comes from the construction of
probability distributions with small support that fool a randomized algorithm.
This is especially well-suited to parallelization, i.e. NC algorithms. A
significant abstraction of these methods can be formulated in terms of fooling
polynomial-space statistical tests computed via finite automata (Sivakumar
2002); this encompasses $k$-wise independence, sums of random variables, and
many other properties.
  We describe new parallel algorithms to fool general finite-state automata
with significantly reduced processor complexity. The analysis is also
simplified because we can cleanly separate the problem-specific optimizations
from the general lattice discrepancy problems at the core of the
automaton-fooling construction. We illustrate with improved applications to the
Gale-Berlekamp Switching Game and to approximate MAX-CUT via SDP rounding.


http://arxiv.org/abs/2411.17585
Multi-Objective Reinforcement Learning for Automated Resilient Cyber Defence. (1%)
Ross O'Driscoll; Claudia Hagen; Joe Bater; James M. Adams
  Cyber-attacks pose a security threat to military command and control
networks, Intelligence, Surveillance, and Reconnaissance (ISR) systems, and
civilian critical national infrastructure. The use of artificial intelligence
and autonomous agents in these attacks increases the scale, range, and
complexity of this threat and the subsequent disruption they cause. Autonomous
Cyber Defence (ACD) agents aim to mitigate this threat by responding at machine
speed and at the scale required to address the problem. Sequential
decision-making algorithms such as Deep Reinforcement Learning (RL) provide a
promising route to create ACD agents. These algorithms focus on a single
objective such as minimizing the intrusion of red agents on the network, by
using a handcrafted weighted sum of rewards. This approach removes the ability
to adapt the model during inference, and fails to address the many competing
objectives present when operating and protecting these networks. Conflicting
objectives, such as restoring a machine from a back-up image, must be carefully
balanced with the cost of associated down-time, or the disruption to network
traffic or services that might result. Instead of pursing a Single-Objective RL
(SORL) approach, here we present a simple example of a multi-objective network
defence game that requires consideration of both defending the network against
red-agents and maintaining critical functionality of green-agents. Two
Multi-Objective Reinforcement Learning (MORL) algorithms, namely
Multi-Objective Proximal Policy Optimization (MOPPO), and Pareto-Conditioned
Networks (PCN), are used to create two trained ACD agents whose performance is
compared on our Multi-Objective Cyber Defence game. The benefits and
limitations of MORL ACD agents in comparison to SORL ACD agents are discussed
based on the investigations of this game.


http://arxiv.org/abs/2411.17911
Passive Deepfake Detection Across Multi-modalities: A Comprehensive Survey. (1%)
Hong-Hanh Nguyen-Le; Van-Tuan Tran; Dinh-Thuc Nguyen; Nhien-An Le-Khac
  In recent years, deepfakes (DFs) have been utilized for malicious purposes,
such as individual impersonation, misinformation spreading, and artists style
imitation, raising questions about ethical and security concerns. In this
survey, we provide a comprehensive review and comparison of passive DF
detection across multiple modalities, including image, video, audio, and
multi-modal, to explore the inter-modality relationships between them. Beyond
detection accuracy, we extend our analysis to encompass crucial performance
dimensions essential for real-world deployment: generalization capabilities
across novel generation techniques, robustness against adversarial
manipulations and postprocessing techniques, attribution precision in
identifying generation sources, and resilience under real-world operational
conditions. Additionally, we analyze the advantages and limitations of existing
datasets, benchmarks, and evaluation metrics for passive DF detection. Finally,
we propose future research directions that address these unexplored and
emerging issues in the field of passive DF detection. This survey offers
researchers and practitioners a comprehensive resource for understanding the
current landscape, methodological approaches, and promising future directions
in this rapidly evolving field.


http://arxiv.org/abs/2411.16622
Imperceptible Adversarial Examples in the Physical World. (99%)
Weilin Xu; Sebastian Szyller; Cory Cornelius; Luis Murillo Rojas; Marius Arvinte; Alvaro Velasquez; Jason Martin; Nageen Himayat
  Adversarial examples in the digital domain against deep learning-based
computer vision models allow for perturbations that are imperceptible to human
eyes. However, producing similar adversarial examples in the physical world has
been difficult due to the non-differentiable image distortion functions in
visual sensing systems. The existing algorithms for generating physically
realizable adversarial examples often loosen their definition of adversarial
examples by allowing unbounded perturbations, resulting in obvious or even
strange visual patterns. In this work, we make adversarial examples
imperceptible in the physical world using a straight-through estimator (STE,
a.k.a. BPDA). We employ STE to overcome the non-differentiability -- applying
exact, non-differentiable distortions in the forward pass of the
backpropagation step, and using the identity function in the backward pass. Our
differentiable rendering extension to STE also enables imperceptible
adversarial patches in the physical world. Using printout photos, and
experiments in the CARLA simulator, we show that STE enables fast generation of
$\ell_\infty$ bounded adversarial examples despite the non-differentiable
distortions. To the best of our knowledge, this is the first work demonstrating
imperceptible adversarial examples bounded by small $\ell_\infty$ norms in the
physical world that force zero classification accuracy in the global
perturbation threat model and cause near-zero ($4.22\%$) AP50 in object
detection in the patch perturbation threat model. We urge the community to
re-evaluate the threat of adversarial examples in the physical world.


http://arxiv.org/abs/2411.16598
DiffBreak: Is Diffusion-Based Purification Robust? (99%)
Andre Kassis; Urs Hengartner; Yaoliang Yu
  Diffusion-based purification (DBP) has become a cornerstone defense against
adversarial examples (AEs), regarded as robust due to its use of diffusion
models (DMs) that project AEs onto the natural data manifold. We refute this
core claim, theoretically proving that gradient-based attacks effectively
target the DM rather than the classifier, causing DBP's outputs to align with
adversarial distributions. This prompts a reassessment of DBP's robustness,
attributing it to two critical flaws: incorrect gradients and inappropriate
evaluation protocols that test only a single random purification of the AE. We
show that with proper accounting for stochasticity and resubmission risk, DBP
collapses. To support this, we introduce DiffBreak, the first reliable toolkit
for differentiation through DBP, eliminating gradient flaws that previously
further inflated robustness estimates. We also analyze the current defense
scheme used for DBP where classification relies on a single purification,
pinpointing its inherent invalidity. We provide a statistically grounded
majority-vote (MV) alternative that aggregates predictions across multiple
purified copies, showing partial but meaningful robustness gain. We then
propose a novel adaptation of an optimization method against deepfake
watermarking, crafting systemic perturbations that defeat DBP even under MV,
challenging DBP's viability.


http://arxiv.org/abs/2411.16782
Scaling Laws for Black box Adversarial Attacks. (99%)
Chuan Liu; Huanran Chen; Yichi Zhang; Yinpeng Dong; Jun Zhu
  Adversarial examples usually exhibit good cross-model transferability,
enabling attacks on black-box models with limited information about their
architectures and parameters, which are highly threatening in commercial
black-box scenarios. Model ensembling is an effective strategy to improve the
transferability of adversarial examples by attacking multiple surrogate models.
However, since prior studies usually adopt few models in the ensemble, there
remains an open question of whether scaling the number of models can further
improve black-box attacks. Inspired by the scaling law of large foundation
models, we investigate the scaling laws of black-box adversarial attacks in
this work. Through theoretical analysis and empirical evaluations, we conclude
with clear scaling laws that using more surrogate models enhances adversarial
transferability. Comprehensive experiments verify the claims on standard image
classifiers, diverse defended models and multimodal large language models using
various adversarial attack methods. Specifically, by scaling law, we achieve
90%+ transfer attack success rate on even proprietary models like GPT-4o.
Further visualization indicates that there is also a scaling law on the
interpretability and semantics of adversarial perturbations.


http://arxiv.org/abs/2411.16437
Privacy Protection in Personalized Diffusion Models via Targeted Cross-Attention Adversarial Attack. (81%)
Xide Xu; Muhammad Atif Butt; Sandesh Kamath; Bogdan Raducanu
  The growing demand for customized visual content has led to the rise of
personalized text-to-image (T2I) diffusion models. Despite their remarkable
potential, they pose significant privacy risk when misused for malicious
purposes. In this paper, we propose a novel and efficient adversarial attack
method, Concept Protection by Selective Attention Manipulation (CoPSAM) which
targets only the cross-attention layers of a T2I diffusion model. For this
purpose, we carefully construct an imperceptible noise to be added to clean
samples to get their adversarial counterparts. This is obtained during the
fine-tuning process by maximizing the discrepancy between the corresponding
cross-attention maps of the user-specific token and the class-specific token,
respectively. Experimental validation on a subset of CelebA-HQ face images
dataset demonstrates that our approach outperforms existing methods. Besides
this, our method presents two important advantages derived from the qualitative
evaluation: (i) we obtain better protection results for lower noise levels than
our competitors; and (ii) we protect the content from unauthorized use thereby
protecting the individual's identity from potential misuse.


http://arxiv.org/abs/2411.17746
UVCG: Leveraging Temporal Consistency for Universal Video Protection. (54%)
KaiZhou Li; Jindong Gu; Xinchun Yu; Junjie Cao; Yansong Tang; Xiao-Ping Zhang
  The security risks of AI-driven video editing have garnered significant
attention. Although recent studies indicate that adding perturbations to images
can protect them from malicious edits, directly applying image-based methods to
perturb each frame in a video becomes ineffective, as video editing techniques
leverage the consistency of inter-frame information to restore individually
perturbed content. To address this challenge, we leverage the temporal
consistency of video content to propose a straightforward and efficient, yet
highly effective and broadly applicable approach, Universal Video Consistency
Guard (UVCG). UVCG embeds the content of another video(target video) within a
protected video by introducing continuous, imperceptible perturbations which
has the ability to force the encoder of editing models to map continuous inputs
to misaligned continuous outputs, thereby inhibiting the generation of videos
consistent with the intended textual prompts. Additionally leveraging
similarity in perturbations between adjacent frames, we improve the
computational efficiency of perturbation generation by employing a
perturbation-reuse strategy. We applied UVCG across various versions of Latent
Diffusion Models (LDM) and assessed its effectiveness and generalizability
across multiple LDM-based editing pipelines. The results confirm the
effectiveness, transferability, and efficiency of our approach in safeguarding
video content from unauthorized modifications.


http://arxiv.org/abs/2411.16512
Guarding the Gate: ConceptGuard Battles Concept-Level Backdoors in Concept Bottleneck Models. (50%)
Songning Lai; Yu Huang; Jiayu Yang; Gaoxiang Huang; Wenshuo Chen; Yutao Yue
  The increasing complexity of AI models, especially in deep learning, has
raised concerns about transparency and accountability, particularly in
high-stakes applications like medical diagnostics, where opaque models can
undermine trust. Explainable Artificial Intelligence (XAI) aims to address
these issues by providing clear, interpretable models. Among XAI techniques,
Concept Bottleneck Models (CBMs) enhance transparency by using high-level
semantic concepts. However, CBMs are vulnerable to concept-level backdoor
attacks, which inject hidden triggers into these concepts, leading to
undetectable anomalous behavior. To address this critical security gap, we
introduce ConceptGuard, a novel defense framework specifically designed to
protect CBMs from concept-level backdoor attacks. ConceptGuard employs a
multi-stage approach, including concept clustering based on text distance
measurements and a voting mechanism among classifiers trained on different
concept subgroups, to isolate and mitigate potential triggers. Our
contributions are threefold: (i) we present ConceptGuard as the first defense
mechanism tailored for concept-level backdoor attacks in CBMs; (ii) we provide
theoretical guarantees that ConceptGuard can effectively defend against such
attacks within a certain trigger size threshold, ensuring robustness; and (iii)
we demonstrate that ConceptGuard maintains the high performance and
interpretability of CBMs, crucial for trustworthiness. Through comprehensive
experiments and theoretical proofs, we show that ConceptGuard significantly
enhances the security and trustworthiness of CBMs, paving the way for their
secure deployment in critical applications.


http://arxiv.org/abs/2411.16832
Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing. (50%)
Hanhui Wang; Yihua Zhang; Ruizheng Bai; Yue Zhao; Sijia Liu; Zhengzhong Tu
  Recent advancements in diffusion models have made generative image editing
more accessible, enabling creative edits but raising ethical concerns,
particularly regarding malicious edits to human portraits that threaten privacy
and identity security. Existing protection methods primarily rely on
adversarial perturbations to nullify edits but often fail against diverse
editing requests. We propose FaceLock, a novel approach to portrait protection
that optimizes adversarial perturbations to destroy or significantly alter
biometric information, rendering edited outputs biometrically unrecognizable.
FaceLock integrates facial recognition and visual perception into perturbation
optimization to provide robust protection against various editing attempts. We
also highlight flaws in commonly used evaluation metrics and reveal how they
can be manipulated, emphasizing the need for reliable assessments of
protection. Experiments show FaceLock outperforms baselines in defending
against malicious edits and is robust against purification techniques. Ablation
studies confirm its stability and broad applicability across diffusion-based
editing algorithms. Our work advances biometric defense and sets the foundation
for privacy-preserving practices in image editing. The code is available at:
https://github.com/taco-group/FaceLock.


http://arxiv.org/abs/2411.16162
Sparse patches adversarial attacks via extrapolating point-wise information. (47%)
Yaniv Nemcovsky; Avi Mendelson; Chaim Baskin
  Sparse and patch adversarial attacks were previously shown to be applicable
in realistic settings and are considered a security risk to autonomous systems.
Sparse adversarial perturbations constitute a setting in which the adversarial
perturbations are limited to affecting a relatively small number of points in
the input. Patch adversarial attacks denote the setting where the sparse
attacks are limited to a given structure, i.e., sparse patches with a given
shape and number. However, previous patch adversarial attacks do not
simultaneously optimize multiple patches' locations and perturbations. This
work suggests a novel approach for sparse patches adversarial attacks via
point-wise trimming dense adversarial perturbations. Our approach enables
simultaneous optimization of multiple sparse patches' locations and
perturbations for any given number and shape. Moreover, our approach is also
applicable for standard sparse adversarial attacks, where we show that it
significantly improves the state-of-the-art over multiple extensive settings. A
reference implementation of the proposed method and the reported experiments is
provided at \url{https://github.com/yanemcovsky/SparsePatches.git}


http://arxiv.org/abs/2411.17026
RED: Robust Environmental Design. (10%)
Jinghan Yang
  The classification of road signs by autonomous systems, especially those
reliant on visual inputs, is highly susceptible to adversarial attacks.
Traditional approaches to mitigating such vulnerabilities have focused on
enhancing the robustness of classification models. In contrast, this paper
adopts a fundamentally different strategy aimed at increasing robustness
through the redesign of road signs themselves. We propose an attacker-agnostic
learning scheme to automatically design road signs that are robust to a wide
array of patch-based attacks. Empirical tests conducted in both digital and
physical environments demonstrate that our approach significantly reduces
vulnerability to patch attacks, outperforming existing techniques.


http://arxiv.org/abs/2411.16154
DeDe: Detecting Backdoor Samples for SSL Encoders via Decoders. (10%)
Sizai Hou; Songze Li; Duanyi Yao
  Self-supervised learning (SSL) is pervasively exploited in training
high-quality upstream encoders with a large amount of unlabeled data. However,
it is found to be susceptible to backdoor attacks merely via polluting a small
portion of training data. The victim encoders associate triggered inputs with
target embeddings, e.g., mapping a triggered cat image to an airplane
embedding, such that the downstream tasks inherit unintended behaviors when the
trigger is activated. Emerging backdoor attacks have shown great threats across
different SSL paradigms such as contrastive learning and CLIP, yet limited
research is devoted to defending against such attacks, and existing defenses
fall short in detecting advanced stealthy backdoors. To address the
limitations, we propose a novel detection mechanism, DeDe, which detects the
activation of backdoor mappings caused by triggered inputs on victim encoders.
Specifically, DeDe trains a decoder for any given SSL encoder using an
auxiliary dataset (which can be out-of-distribution or even slightly poisoned),
so that for any triggered input that misleads the encoder into the target
embedding, the decoder generates an output image significantly different from
the input. DeDe leverages the discrepancy between the input and the decoded
output to identify potential backdoor misbehavior during inference. We
empirically evaluate DeDe on both contrastive learning and CLIP models against
various types of backdoor attacks. Our results demonstrate promising detection
effectiveness over various advanced attacks and superior performance compared
over state-of-the-art detection methods.


http://arxiv.org/abs/2411.16167
BadSFL: Backdoor Attack against Scaffold Federated Learning. (3%)
Xingshuo Han; Xuanye Zhang; Xiang Lan; Haozhao Wang; Shengmin Xu; Shen Ren; Jason Zeng; Ming Wu; Michael Heinrich; Tianwei Zhang
  Federated learning (FL) enables the training of deep learning models on
distributed clients to preserve data privacy. However, this learning paradigm
is vulnerable to backdoor attacks, where malicious clients can upload poisoned
local models to embed backdoors into the global model, leading to
attacker-desired predictions. Existing backdoor attacks mainly focus on FL with
independently and identically distributed (IID) scenarios, while real-world FL
training data are typically non-IID. Current strategies for non-IID backdoor
attacks suffer from limitations in maintaining effectiveness and durability. To
address these challenges, we propose a novel backdoor attack method, BadSFL,
specifically designed for the FL framework using the scaffold aggregation
algorithm in non-IID settings. BadSFL leverages a Generative Adversarial
Network (GAN) based on the global model to complement the training set,
achieving high accuracy on both backdoor and benign samples. It utilizes a
specific feature as the backdoor trigger to ensure stealthiness, and exploits
the Scaffold's control variate to predict the global model's convergence
direction, ensuring the backdoor's persistence. Extensive experiments on three
benchmark datasets demonstrate the high effectiveness, stealthiness, and
durability of BadSFL. Notably, our attack remains effective over 60 rounds in
the global model and up to 3 times longer than existing baseline attacks after
stopping the injection of malicious updates.


http://arxiv.org/abs/2411.16120
Why the Agent Made that Decision: Explaining Deep Reinforcement Learning with Vision Masks. (2%)
Rui Zuo; Zifan Wang; Simon Khan; Garrett Ethan Katz; Qinru Qiu
  Due to the inherent lack of transparency in deep neural networks, it is
challenging for deep reinforcement learning (DRL) agents to gain trust and
acceptance from users, especially in safety-critical applications such as
medical diagnosis and military operations. Existing methods for explaining an
agent's decision either require to retrain the agent using models that support
explanation generation or rely on perturbation-based techniques to reveal the
significance of different input features in the decision making process.
However, retraining the agent may compromise its integrity and performance,
while perturbation-based methods have limited performance and lack knowledge
accumulation or learning capabilities. Moreover, since each perturbation is
performed independently, the joint state of the perturbed inputs may not be
physically meaningful. To address these challenges, we introduce
$\textbf{VisionMask}$, a standalone explanation model trained end-to-end to
identify the most critical regions in the agent's visual input that can explain
its actions. VisionMask is trained in a self-supervised manner without relying
on human-generated labels. Importantly, its training does not alter the agent
model, hence preserving the agent's performance and integrity. We evaluate
VisionMask on Super Mario Bros (SMB) and three Atari games. Compared to
existing methods, VisionMask achieves a 14.9% higher insertion accuracy and a
30.08% higher F1-Score in reproducing original actions from the selected visual
explanations. We also present examples illustrating how VisionMask can be used
for counterfactual analysis.


http://arxiv.org/abs/2411.16817
XAI and Android Malware Models. (2%)
Maithili Kulkarni; Mark Stamp
  Android malware detection based on machine learning (ML) and deep learning
(DL) models is widely used for mobile device security. Such models offer
benefits in terms of detection accuracy and efficiency, but it is often
difficult to understand how such learning models make decisions. As a result,
these popular malware detection strategies are generally treated as black
boxes, which can result in a lack of trust in the decisions made, as well as
making adversarial attacks more difficult to detect. The field of eXplainable
Artificial Intelligence (XAI) attempts to shed light on such black box models.
In this paper, we apply XAI techniques to ML and DL models that have been
trained on a challenging Android malware classification problem. Specifically,
the classic ML models considered are Support Vector Machines (SVM), Random
Forest, and $k$-Nearest Neighbors ($k$-NN), while the DL models we consider are
Multi-Layer Perceptrons (MLP) and Convolutional Neural Networks (CNN). The
state-of-the-art XAI techniques that we apply to these trained models are Local
Interpretable Model-agnostic Explanations (LIME), Shapley Additive exPlanations
(SHAP), PDP plots, ELI5, and Class Activation Mapping (CAM). We obtain global
and local explanation results, and we discuss the utility of XAI techniques in
this problem domain. We also provide a literature review of XAI work related to
Android malware.


http://arxiv.org/abs/2411.16148
Revisiting Marr in Face: The Building of 2D--2.5D--3D Representations in Deep Neural Networks. (1%)
Xiangyu Zhu; Chang Yu; Jiankuo Zhao; Zhaoxiang Zhang; Stan Z. Li; Zhen Lei
  David Marr's seminal theory of vision proposes that the human visual system
operates through a sequence of three stages, known as the 2D sketch, the 2.5D
sketch, and the 3D model. In recent years, Deep Neural Networks (DNN) have been
widely thought to have reached a level comparable to human vision. However, the
mechanisms by which DNNs accomplish this and whether they adhere to Marr's
2D--2.5D--3D construction theory remain unexplored. In this paper, we delve
into the perception task to explore these questions and find evidence
supporting Marr's theory. We introduce a graphics probe, a sub-network crafted
to reconstruct the original image from the network's intermediate layers. The
key to the graphics probe is its flexible architecture that supports image in
both 2D and 3D formats, as well as in a transitional state between them. By
injecting graphics probes into neural networks, and analyzing their behavior in
reconstructing images, we find that DNNs initially encode images as 2D
representations in low-level layers, and finally construct 3D representations
in high-level layers. Intriguingly, in mid-level layers, DNNs exhibit a hybrid
state, building a geometric representation that s sur normals within a narrow
depth range, akin to the appearance of a low-relief sculpture. This stage
resembles the 2.5D representations, providing a view of how DNNs evolve from 2D
to 3D in the perception process. The graphics probe therefore serves as a tool
for peering into the mechanisms of DNN, providing empirical support for Marr's
theory.


http://arxiv.org/abs/2411.15720
Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks. (99%)
Peng Xie; Yequan Bie; Jianda Mao; Yangqiu Song; Yang Wang; Hao Chen; Kani Chen
  Pre-trained vision-language models (VLMs) have showcased remarkable
performance in image and natural language understanding, such as image
captioning and response generation. As the practical applications of
vision-language models become increasingly widespread, their potential safety
and robustness issues raise concerns that adversaries may evade the system and
cause these models to generate toxic content through malicious attacks.
Therefore, evaluating the robustness of open-source VLMs against adversarial
attacks has garnered growing attention, with transfer-based attacks as a
representative black-box attacking strategy. However, most existing
transfer-based attacks neglect the importance of the semantic correlations
between vision and text modalities, leading to sub-optimal adversarial example
generation and attack performance. To address this issue, we present Chain of
Attack (CoA), which iteratively enhances the generation of adversarial examples
based on the multi-modal semantic update using a series of intermediate
attacking steps, achieving superior adversarial transferability and efficiency.
A unified attack success rate computing method is further proposed for
automatic evasion evaluation. Extensive experiments conducted under the most
realistic and high-stakes scenario, demonstrate that our attacking strategy can
effectively mislead models to generate targeted responses using only black-box
attacks without any knowledge of the victim models. The comprehensive
robustness evaluation in our paper provides insight into the vulnerabilities of
VLMs and offers a reference for the safety considerations of future model
developments.


http://arxiv.org/abs/2411.15878
ExAL: An Exploration Enhanced Adversarial Learning Algorithm. (92%)
A Vinil; Aneesh Sreevallabh Chivukula; Pranav Chintareddy
  Adversarial learning is critical for enhancing model robustness, aiming to
defend against adversarial attacks that jeopardize machine learning systems.
Traditional methods often lack efficient mechanisms to explore diverse
adversarial perturbations, leading to limited model resilience. Inspired by
game-theoretic principles, where adversarial dynamics are analyzed through
frameworks like Nash equilibrium, exploration mechanisms in such setups allow
for the discovery of diverse strategies, enhancing system robustness. However,
existing adversarial learning methods often fail to incorporate structured
exploration effectively, reducing their ability to improve model defense
comprehensively. To address these challenges, we propose a novel
Exploration-enhanced Adversarial Learning Algorithm (ExAL), leveraging the
Exponentially Weighted Momentum Particle Swarm Optimizer (EMPSO) to generate
optimized adversarial perturbations. ExAL integrates exploration-driven
mechanisms to discover perturbations that maximize impact on the model's
decision boundary while preserving structural coherence in the data. We
evaluate the performance of ExAL on the MNIST Handwritten Digits and Blended
Malware datasets. Experimental results demonstrate that ExAL significantly
enhances model resilience to adversarial attacks by improving robustness
through adversarial learning.


http://arxiv.org/abs/2411.15921
A Tunable Despeckling Neural Network Stabilized via Diffusion Equation. (64%)
Yi Ran; Zhichang Guo; Jia Li; Yao Li; Martin Burger; Boying Wu
  The removal of multiplicative Gamma noise is a critical research area in the
application of synthetic aperture radar (SAR) imaging, where neural networks
serve as a potent tool. However, real-world data often diverges from
theoretical models, exhibiting various disturbances, which makes the neural
network less effective. Adversarial attacks can be used as a criterion for
judging the adaptability of neural networks to real data, since adversarial
attacks can find the most extreme perturbations that make neural networks
ineffective. In this work, the diffusion equation is designed as a
regularization block to provide sufficient regularity to the whole neural
network, due to its spontaneous dissipative nature. We propose a tunable,
regularized neural network framework that unrolls a shallow denoising neural
network block and a diffusion regularity block into a single network for
end-to-end training. The linear heat equation, known for its inherent
smoothness and low-pass filtering properties, is adopted as the diffusion
regularization block. In our model, a single time step hyperparameter governs
the smoothness of the outputs and can be adjusted dynamically, significantly
enhancing flexibility. The stability and convergence of our model are
theoretically proven. Experimental results demonstrate that the proposed model
effectively eliminates high-frequency oscillations induced by adversarial
attacks. Finally, the proposed model is benchmarked against several
state-of-the-art denoising methods on simulated images, adversarial samples,
and real SAR images, achieving superior performance in both quantitative and
visual evaluations.


http://arxiv.org/abs/2411.16763
Hide in Plain Sight: Clean-Label Backdoor for Auditing Membership Inference. (10%)
Depeng Chen; Hao Chen; Hulin Jin; Jie Cui; Hong Zhong
  Membership inference attacks (MIAs) are critical tools for assessing privacy
risks and ensuring compliance with regulations like the General Data Protection
Regulation (GDPR). However, their potential for auditing unauthorized use of
data remains under explored. To bridge this gap, we propose a novel clean-label
backdoor-based approach for MIAs, designed specifically for robust and stealthy
data auditing. Unlike conventional methods that rely on detectable poisoned
samples with altered labels, our approach retains natural labels, enhancing
stealthiness even at low poisoning rates. Our approach employs an optimal
trigger generated by a shadow model that mimics the target model's behavior.
This design minimizes the feature-space distance between triggered samples and
the source class while preserving the original data labels. The result is a
powerful and undetectable auditing mechanism that overcomes limitations of
existing approaches, such as label inconsistencies and visual artifacts in
poisoned samples. The proposed method enables robust data auditing through
black-box access, achieving high attack success rates across diverse datasets
and model architectures. Additionally, it addresses challenges related to
trigger stealthiness and poisoning durability, establishing itself as a
practical and effective solution for data auditing. Comprehensive experiments
validate the efficacy and generalizability of our approach, outperforming
several baseline methods in both stealth and attack success metrics.


http://arxiv.org/abs/2411.16024
Stealth Attacks Against Moving Target Defense for Smart Grid. (2%)
Ke Sun; Iñaki Esnaola; H. Vincent Poor
  Data injection attacks (DIAs) pose a significant cybersecurity threat to the
Smart Grid by enabling an attacker to compromise the integrity of data
acquisition and manipulate estimated states without triggering bad data
detection procedures. To mitigate this vulnerability, the moving target defense
(MTD) alters branch admittances to mismatch the system information that is
available to an attacker, thereby inducing an imperfect DIA construction that
results in degradation of attack performance. In this paper, we first analyze
the existence of stealth attacks for the case in which the MTD strategy only
changes the admittance of a single branch. Equipped with this initial insight,
we then extend the results to the case in which multiple branches are protected
by the MTD strategy. Remarkably, we show that stealth attacks can be
constructed with information only about which branches are protected, without
knowledge about the particular admittance value changes. Furthermore, we
provide a sufficient protection condition for the MTD strategy via
graph-theoretic tools that guarantee that the system is not vulnerable to DIAs.
Numerical simulations are implemented on IEEE test systems to validate the
obtained results.


http://arxiv.org/abs/2411.15976
DRIVE: Dual-Robustness via Information Variability and Entropic Consistency in Source-Free Unsupervised Domain Adaptation. (2%)
Ruiqiang Xiao; Songning Lai; Yijun Yang; Jiemin Wu; Yutao Yue; Lei Zhu
  Adapting machine learning models to new domains without labeled data,
especially when source data is inaccessible, is a critical challenge in
applications like medical imaging, autonomous driving, and remote sensing. This
task, known as Source-Free Unsupervised Domain Adaptation (SFUDA), involves
adapting a pre-trained model to a target domain using only unlabeled target
data, which can lead to issues such as overfitting, underfitting, and poor
generalization due to domain discrepancies and noise. Existing SFUDA methods
often rely on single-model architectures, struggling with uncertainty and
variability in the target domain. To address these challenges, we propose DRIVE
(Dual-Robustness through Information Variability and Entropy), a novel SFUDA
framework leveraging a dual-model architecture. The two models, initialized
with identical weights, work in parallel to capture diverse target domain
characteristics. One model is exposed to perturbations via projection gradient
descent (PGD) guided by mutual information, focusing on high-uncertainty
regions. We also introduce an entropy-aware pseudo-labeling strategy that
adjusts label weights based on prediction uncertainty, ensuring the model
focuses on reliable data while avoiding noisy regions. The adaptation process
has two stages: the first aligns the models on stable features using a mutual
information consistency loss, and the second dynamically adjusts the
perturbation level based on the loss from the first stage, encouraging the
model to explore a broader range of the target domain while preserving existing
performance. This enhances generalization capabilities and robustness against
interference. Evaluations on standard SFUDA benchmarks show that DRIVE
consistently outperforms previous methods, delivering improved adaptation
accuracy and stability across complex target domains.


http://arxiv.org/abs/2411.15553
Improving Transferable Targeted Attacks with Feature Tuning Mixup. (99%)
Kaisheng Liang; Xuelong Dai; Yanjie Li; Dong Wang; Bin Xiao
  Deep neural networks (DNNs) exhibit vulnerability to adversarial examples
that can transfer across different DNN models. A particularly challenging
problem is developing transferable targeted attacks that can mislead DNN models
into predicting specific target classes. While various methods have been
proposed to enhance attack transferability, they often incur substantial
computational costs while yielding limited improvements. Recent clean feature
mixup methods use random clean features to perturb the feature space but lack
optimization for disrupting adversarial examples, overlooking the advantages of
attack-specific perturbations. In this paper, we propose Feature Tuning Mixup
(FTM), a novel method that enhances targeted attack transferability by
combining both random and optimized noises in the feature space. FTM introduces
learnable feature perturbations and employs an efficient stochastic update
strategy for optimization. These learnable perturbations facilitate the
generation of more robust adversarial examples with improved transferability.
We further demonstrate that attack performance can be enhanced through an
ensemble of multiple FTM-perturbed surrogate models. Extensive experiments on
the ImageNet-compatible dataset across various DNN models demonstrate that our
method achieves significant improvements over state-of-the-art methods while
maintaining low computational cost.


http://arxiv.org/abs/2411.15555
Improving the Transferability of Adversarial Attacks on Face Recognition with Diverse Parameters Augmentation. (99%)
Fengfan Zhou; Bangjie Yin; Hefei Ling; Qianyu Zhou; Wenxuan Wang
  Face Recognition (FR) models are vulnerable to adversarial examples that
subtly manipulate benign face images, underscoring the urgent need to improve
the transferability of adversarial attacks in order to expose the blind spots
of these systems. Existing adversarial attack methods often overlook the
potential benefits of augmenting the surrogate model with diverse
initializations, which limits the transferability of the generated adversarial
examples. To address this gap, we propose a novel method called Diverse
Parameters Augmentation (DPA) attack method, which enhances surrogate models by
incorporating diverse parameter initializations, resulting in a broader and
more diverse set of surrogate models. Specifically, DPA consists of two key
stages: Diverse Parameters Optimization (DPO) and Hard Model Aggregation (HMA).
In the DPO stage, we initialize the parameters of the surrogate model using
both pre-trained and random parameters. Subsequently, we save the models in the
intermediate training process to obtain a diverse set of surrogate models.
During the HMA stage, we enhance the feature maps of the diversified surrogate
models by incorporating beneficial perturbations, thereby further improving the
transferability. Experimental results demonstrate that our proposed attack
method can effectively enhance the transferability of the crafted adversarial
face examples.


http://arxiv.org/abs/2411.16746
LoBAM: LoRA-Based Backdoor Attack on Model Merging. (10%)
Ming Yin; Jingyang Zhang; Jingwei Sun; Minghong Fang; Hai Li; Yiran Chen
  Model merging is an emerging technique that integrates multiple models
fine-tuned on different tasks to create a versatile model that excels in
multiple domains. This scheme, in the meantime, may open up backdoor attack
opportunities where one single malicious model can jeopardize the integrity of
the merged model. Existing works try to demonstrate the risk of such attacks by
assuming substantial computational resources, focusing on cases where the
attacker can fully fine-tune the pre-trained model. Such an assumption,
however, may not be feasible given the increasing size of machine learning
models. In practice where resources are limited and the attacker can only
employ techniques like Low-Rank Adaptation (LoRA) to produce the malicious
model, it remains unclear whether the attack can still work and pose threats.
In this work, we first identify that the attack efficacy is significantly
diminished when using LoRA for fine-tuning. Then, we propose LoBAM, a method
that yields high attack success rate with minimal training resources. The key
idea of LoBAM is to amplify the malicious weights in an intelligent way that
effectively enhances the attack efficacy. We demonstrate that our design can
lead to improved attack success rate through extensive empirical experiments
across various model merging scenarios. Moreover, we show that our method is
highly stealthy and is difficult to detect and defend against.


http://arxiv.org/abs/2411.15673
Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment. (4%)
Alvi Md Ishmam; Christopher Thomas
  In recent years there has been enormous interest in vision-language models
trained using self-supervised objectives. However, the use of large-scale
datasets scraped from the web for training also makes these models vulnerable
to potential security threats, such as backdooring and poisoning attacks. In
this paper, we propose a method for mitigating such attacks on contrastively
trained vision-language models. Our approach leverages external knowledge
extracted from a language model to prevent models from learning correlations
between image regions which lack strong alignment with external knowledge. We
do this by imposing constraints to enforce that attention paid by the model to
visual regions is proportional to the alignment of those regions with external
knowledge. We conduct extensive experiments using a variety of recent
backdooring and poisoning attacks on multiple datasets and architectures. Our
results clearly demonstrate that our proposed approach is highly effective at
defending against such attacks across multiple settings, while maintaining
model utility and without requiring any changes at inference time


http://arxiv.org/abs/2411.16721
Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks. (99%)
Han Wang; Gang Wang; Huan Zhang
  Vision Language Models (VLMs) can produce unintended and harmful content when
exposed to adversarial attacks, particularly because their vision capabilities
create new vulnerabilities. Existing defenses, such as input preprocessing,
adversarial training, and response evaluation-based methods, are often
impractical for real-world deployment due to their high costs. To address this
challenge, we propose ASTRA, an efficient and effective defense by adaptively
steering models away from adversarial feature directions to resist VLM attacks.
Our key procedures involve finding transferable steering vectors representing
the direction of harmful response and applying adaptive activation steering to
remove these directions at inference time. To create effective steering
vectors, we randomly ablate the visual tokens from the adversarial images and
identify those most strongly associated with jailbreaks. These tokens are then
used to construct steering vectors. During inference, we perform the adaptive
steering method that involves the projection between the steering vectors and
calibrated activation, resulting in little performance drops on benign inputs
while strongly avoiding harmful outputs under adversarial inputs. Extensive
experiments across multiple models and baselines demonstrate our
state-of-the-art performance and high efficiency in mitigating jailbreak risks.
Additionally, ASTRA exhibits good transferability, defending against both
unseen attacks at design time (i.e., structured-based attacks) and adversarial
images from diverse distributions.


http://arxiv.org/abs/2411.14834
Gradient Masking All-at-Once: Ensemble Everything Everywhere Is Not Robust. (99%)
Jie Zhang; Kristina Nikolić; Nicholas Carlini; Florian Tramèr
  Ensemble everything everywhere is a defense to adversarial examples that was
recently proposed to make image classifiers robust. This defense works by
ensembling a model's intermediate representations at multiple noisy image
resolutions, producing a single robust classification. This defense was shown
to be effective against multiple state-of-the-art attacks. Perhaps even more
convincingly, it was shown that the model's gradients are perceptually aligned:
attacks against the model produce noise that perceptually resembles the
targeted class.
  In this short note, we show that this defense is not robust to adversarial
attack. We first show that the defense's randomness and ensembling method cause
severe gradient masking. We then use standard adaptive attack techniques to
reduce the defense's robust accuracy from 48% to 1% on CIFAR-100 and from 62%
to 4% on CIFAR-10, under the $\ell_\infty$-norm threat model with
$\varepsilon=8/255$.


http://arxiv.org/abs/2411.15246
Exploring the Robustness and Transferability of Patch-Based Adversarial Attacks in Quantized Neural Networks. (99%)
Amira Guesmi; Bassem Ouni; Muhammad Shafique
  Quantized neural networks (QNNs) are increasingly used for efficient
deployment of deep learning models on resource-constrained platforms, such as
mobile devices and edge computing systems. While quantization reduces model
size and computational demands, its impact on adversarial robustness-especially
against patch-based attacks-remains inadequately addressed. Patch-based
attacks, characterized by localized, high-visibility perturbations, pose
significant security risks due to their transferability and resilience. In this
study, we systematically evaluate the vulnerability of QNNs to patch-based
adversarial attacks across various quantization levels and architectures,
focusing on factors that contribute to the robustness of these attacks. Through
experiments analyzing feature representations, quantization strength, gradient
alignment, and spatial sensitivity, we find that patch attacks consistently
achieve high success rates across bitwidths and architectures, demonstrating
significant transferability even in heavily quantized models. Contrary to the
expectation that quantization might enhance adversarial defenses, our results
show that QNNs remain highly susceptible to patch attacks due to the
persistence of distinct, localized features within quantized representations.
These findings underscore the need for quantization-aware defenses that address
the specific challenges posed by patch-based attacks. Our work contributes to a
deeper understanding of adversarial robustness in QNNs and aims to guide future
research in developing secure, quantization-compatible defenses for real-world
applications.


http://arxiv.org/abs/2411.14842
Who Can Withstand Chat-Audio Attacks? An Evaluation Benchmark for Large Language Models. (45%)
Wanqi Yang; Yanda Li; Meng Fang; Yunchao Wei; Tianyi Zhou; Ling Chen
  Adversarial audio attacks pose a significant threat to the growing use of
large language models (LLMs) in voice-based human-machine interactions. While
existing research has primarily focused on model-specific adversarial methods,
real-world applications demand a more generalizable and universal approach to
audio adversarial attacks. In this paper, we introduce the Chat-Audio Attacks
(CAA) benchmark including four distinct types of audio attacks, which aims to
explore the the vulnerabilities of LLMs to these audio attacks in
conversational scenarios. To evaluate the robustness of LLMs, we propose three
evaluation strategies: Standard Evaluation, utilizing traditional metrics to
quantify model performance under attacks; GPT-4o-Based Evaluation, which
simulates real-world conversational complexities; and Human Evaluation,
offering insights into user perception and trust. We evaluate six
state-of-the-art LLMs with voice interaction capabilities, including
Gemini-1.5-Pro, GPT-4o, and others, using three distinct evaluation methods on
the CAA benchmark. Our comprehensive analysis reveals the impact of four types
of audio attacks on the performance of these models, demonstrating that GPT-4o
exhibits the highest level of resilience.


http://arxiv.org/abs/2411.14738
Universal and Context-Independent Triggers for Precise Control of LLM Outputs. (31%)
Jiashuo Liang; Guancheng Li; Yang Yu
  Large language models (LLMs) have been widely adopted in applications such as
automated content generation and even critical decision-making systems.
However, the risk of prompt injection allows for potential manipulation of LLM
outputs. While numerous attack methods have been documented, achieving full
control over these outputs remains challenging, often requiring experienced
attackers to make multiple attempts and depending heavily on the prompt
context. Recent advancements in gradient-based white-box attack techniques have
shown promise in tasks like jailbreaks and system prompt leaks. Our research
generalizes gradient-based attacks to find a trigger that is (1) Universal:
effective irrespective of the target output; (2) Context-Independent: robust
across diverse prompt contexts; and (3) Precise Output: capable of manipulating
LLM inputs to yield any specified output with high accuracy. We propose a novel
method to efficiently discover such triggers and assess the effectiveness of
the proposed attack. Furthermore, we discuss the substantial threats posed by
such attacks to LLM-based applications, highlighting the potential for
adversaries to taking over the decisions and actions made by AI agents.


http://arxiv.org/abs/2411.14865
Benchmarking the Robustness of Optical Flow Estimation to Corruptions. (13%)
Zhonghua Yi; Hao Shi; Qi Jiang; Yao Gao; Ze Wang; Yufan Zhang; Kailun Yang; Kaiwei Wang
  Optical flow estimation is extensively used in autonomous driving and video
editing. While existing models demonstrate state-of-the-art performance across
various benchmarks, the robustness of these methods has been infrequently
investigated. Despite some research focusing on the robustness of optical flow
models against adversarial attacks, there has been a lack of studies
investigating their robustness to common corruptions. Taking into account the
unique temporal characteristics of optical flow, we introduce 7 temporal
corruptions specifically designed for benchmarking the robustness of optical
flow models, in addition to 17 classical single-image corruptions, in which
advanced PSF Blur simulation method is performed. Two robustness benchmarks,
KITTI-FC and GoPro-FC, are subsequently established as the first corruption
robustness benchmark for optical flow estimation, with Out-Of-Domain (OOD) and
In-Domain (ID) settings to facilitate comprehensive studies. Robustness
metrics, Corruption Robustness Error (CRE), Corruption Robustness Error ratio
(CREr), and Relative Corruption Robustness Error (RCRE) are further introduced
to quantify the optical flow estimation robustness. 29 model variants from 15
optical flow methods are evaluated, yielding 10 intriguing observations, such
as 1) the absolute robustness of the model is heavily dependent on the
estimation performance; 2) the corruptions that diminish local information are
more serious than that reduce visual effects. We also give suggestions for the
design and application of optical flow models. We anticipate that our benchmark
will serve as a foundational resource for advancing research in robust optical
flow estimation. The benchmarks and source code will be released at
https://github.com/ZhonghuaYi/optical_flow_robustness_benchmark.


http://arxiv.org/abs/2411.14937
Geminio: Language-Guided Gradient Inversion Attacks in Federated Learning. (5%)
Junjie Shan; Ziqi Zhao; Jialin Lu; Rui Zhang; Siu Ming Yiu; Ka-Ho Chow
  Foundation models that bridge vision and language have made significant
progress, inspiring numerous life-enriching applications. However, their
potential for misuse to introduce new threats remains largely unexplored. This
paper reveals that vision-language models (VLMs) can be exploited to overcome
longstanding limitations in gradient inversion attacks (GIAs) within federated
learning (FL), where an FL server reconstructs private data samples from
gradients shared by victim clients. Current GIAs face challenges in
reconstructing high-resolution images, especially when the victim has a large
local data batch. While focusing reconstruction on valuable samples rather than
the entire batch is promising, existing methods lack the flexibility to allow
attackers to specify their target data. In this paper, we introduce Geminio,
the first approach to transform GIAs into semantically meaningful, targeted
attacks. Geminio enables a brand new privacy attack experience: attackers can
describe, in natural language, the types of data they consider valuable, and
Geminio will prioritize reconstruction to focus on those high-value samples.
This is achieved by leveraging a pretrained VLM to guide the optimization of a
malicious global model that, when shared with and optimized by a victim,
retains only gradients of samples that match the attacker-specified query.
Extensive experiments demonstrate Geminio's effectiveness in pinpointing and
reconstructing targeted samples, with high success rates across complex
datasets under FL and large batch sizes and showing resilience against existing
defenses.


http://arxiv.org/abs/2411.15439
Twin Trigger Generative Networks for Backdoor Attacks against Object Detection. (4%)
Zhiying Li; Zhi Liu; Guanggang Geng; Shreyank N Gowda; Shuyuan Lin; Jian Weng; Xiaobo Jin
  Object detectors, which are widely used in real-world applications, are
vulnerable to backdoor attacks. This vulnerability arises because many users
rely on datasets or pre-trained models provided by third parties due to
constraints on data and resources. However, most research on backdoor attacks
has focused on image classification, with limited investigation into object
detection. Furthermore, the triggers for most existing backdoor attacks on
object detection are manually generated, requiring prior knowledge and
consistent patterns between the training and inference stages. This approach
makes the attacks either easy to detect or difficult to adapt to various
scenarios. To address these limitations, we propose novel twin trigger
generative networks in the frequency domain to generate invisible triggers for
implanting stealthy backdoors into models during training, and visible triggers
for steady activation during inference, making the attack process difficult to
trace. Specifically, for the invisible trigger generative network, we deploy a
Gaussian smoothing layer and a high-frequency artifact classifier to enhance
the stealthiness of backdoor implantation in object detectors. For the visible
trigger generative network, we design a novel alignment loss to optimize the
visible triggers so that they differ from the original patterns but still align
with the malicious activation behavior of the invisible triggers. Extensive
experimental results and analyses prove the possibility of using different
triggers in the training stage and the inference stage, and demonstrate the
attack effectiveness of our proposed visible trigger and invisible trigger
generative networks, significantly reducing the mAP_0.5 of the object detectors
by 70.0% and 84.5%, including YOLOv5 and YOLOv7 with different settings,
respectively.


http://arxiv.org/abs/2411.15306
Heavy-tailed Contamination is Easier than Adversarial Contamination. (1%)
Yeshwanth Cherapanamjeri; Daniel Lee
  A large body of work in the statistics and computer science communities
dating back to Huber (Huber, 1960) has led to statistically and computationally
efficient outlier-robust estimators. Two particular outlier models have
received significant attention: the adversarial and heavy-tailed models. While
the former models outliers as the result of a malicious adversary manipulating
the data, the latter relaxes distributional assumptions on the data allowing
outliers to naturally occur as part of the data generating process. In the
first setting, the goal is to develop estimators robust to the largest fraction
of outliers while in the second, one seeks estimators to combat the loss of
statistical efficiency, where the dependence on the failure probability is
paramount.
  Despite these distinct motivations, the algorithmic approaches to both these
settings have converged, prompting questions on the relationship between the
models. In this paper, we investigate and provide a principled explanation for
this phenomenon. First, we prove that any adversarially robust estimator is
also resilient to heavy-tailed outliers for any statistical estimation problem
with i.i.d data. As a corollary, optimal adversarially robust estimators for
mean estimation, linear regression, and covariance estimation are also optimal
heavy-tailed estimators. Conversely, for arguably the simplest high-dimensional
estimation task of mean estimation, we construct heavy-tailed estimators whose
application to the adversarial setting requires any black-box reduction to
remove almost all the outliers in the data. Taken together, our results imply
that heavy-tailed estimation is likely easier than adversarially robust
estimation opening the door to novel algorithmic approaches for the
heavy-tailed setting. Additionally, confidence intervals obtained for
adversarially robust estimation also hold with high-probability.


http://arxiv.org/abs/2411.15367
Exploiting Watermark-Based Defense Mechanisms in Text-to-Image Diffusion Models for Unauthorized Data Usage. (1%)
Soumil Datta; Shih-Chieh Dai; Leo Yu; Guanhong Tao
  Text-to-image diffusion models, such as Stable Diffusion, have shown
exceptional potential in generating high-quality images. However, recent
studies highlight concerns over the use of unauthorized data in training these
models, which may lead to intellectual property infringement or privacy
violations. A promising approach to mitigate these issues is to apply a
watermark to images and subsequently check if generative models reproduce
similar watermark features. In this paper, we examine the robustness of various
watermark-based protection methods applied to text-to-image models. We observe
that common image transformations are ineffective at removing the watermark
effect. Therefore, we propose RATTAN, that leverages the diffusion process to
conduct controlled image generation on the protected input, preserving the
high-level features of the input while ignoring the low-level details utilized
by watermarks. A small number of generated images are then used to fine-tune
protected models. Our experiments on three datasets and 140 text-to-image
diffusion models reveal that existing state-of-the-art protections are not
robust against RATTAN.


http://arxiv.org/abs/2411.14946
Reliable Evaluation of Attribution Maps in CNNs: A Perturbation-Based Approach. (1%)
Lars Nieradzik; Henrike Stephani; Janis Keuper
  In this paper, we present an approach for evaluating attribution maps, which
play a central role in interpreting the predictions of convolutional neural
networks (CNNs). We show that the widely used insertion/deletion metrics are
susceptible to distribution shifts that affect the reliability of the ranking.
Our method proposes to replace pixel modifications with adversarial
perturbations, which provides a more robust evaluation framework. By using
smoothness and monotonicity measures, we illustrate the effectiveness of our
approach in correcting distribution shifts. In addition, we conduct the most
comprehensive quantitative and qualitative assessment of attribution maps to
date. Introducing baseline attribution maps as sanity checks, we find that our
metric is the only contender to pass all checks. Using Kendall's $\tau$ rank
correlation coefficient, we show the increased consistency of our metric across
15 dataset-architecture combinations. Of the 16 attribution maps tested, our
results clearly show SmoothGrad to be the best map currently available. This
research makes an important contribution to the development of attribution maps
by providing a reliable and consistent evaluation framework. To ensure
reproducibility, we will provide the code along with our results.


http://arxiv.org/abs/2411.14263
Generating Realistic Adversarial Examples for Business Processes using Variational Autoencoders. (99%)
Alexander Stevens; Jari Peeperkorn; Smedt Johannes De; Weerdt Jochen De
  In predictive process monitoring, predictive models are vulnerable to
adversarial attacks, where input perturbations can lead to incorrect
predictions. Unlike in computer vision, where these perturbations are designed
to be imperceptible to the human eye, the generation of adversarial examples in
predictive process monitoring poses unique challenges. Minor changes to the
activity sequences can create improbable or even impossible scenarios to occur
due to underlying constraints such as regulatory rules or process constraints.
To address this, we focus on generating realistic adversarial examples tailored
to the business process context, in contrast to the imperceptible, pixel-level
changes commonly seen in computer vision adversarial attacks. This paper
introduces two novel latent space attacks, which generate adversaries by adding
noise to the latent space representation of the input data, rather than
directly modifying the input attributes. These latent space methods are
domain-agnostic and do not rely on process-specific knowledge, as we restrict
the generation of adversarial examples to the learned class-specific data
distributions by directly perturbing the latent space representation of the
business process executions. We evaluate these two latent space methods with
six other adversarial attacking methods on eleven real-life event logs and four
predictive models. The first three attacking methods directly permute the
activities of the historically observed business process executions. The fourth
method constrains the adversarial examples to lie within the same data
distribution as the original instances, by projecting the adversarial examples
to the original data distribution.


http://arxiv.org/abs/2411.14424
Learning Fair Robustness via Domain Mixup. (81%)
Meiyu Zhong; Ravi Tandon
  Adversarial training is one of the predominant techniques for training
classifiers that are robust to adversarial attacks. Recent work, however has
found that adversarial training, which makes the overall classifier robust, it
does not necessarily provide equal amount of robustness for all classes. In
this paper, we propose the use of mixup for the problem of learning fair robust
classifiers, which can provide similar robustness across all classes.
Specifically, the idea is to mix inputs from the same classes and perform
adversarial training on mixed up inputs. We present a theoretical analysis of
this idea for the case of linear classifiers and show that mixup combined with
adversarial training can provably reduce the class-wise robustness disparity.
This method not only contributes to reducing the disparity in class-wise
adversarial risk, but also the class-wise natural risk. Complementing our
theoretical analysis, we also provide experimental results on both synthetic
data and the real world dataset (CIFAR-10), which shows improvement in class
wise disparities for both natural and adversarial risks.


http://arxiv.org/abs/2411.14133
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs. (78%)
Advik Raj Basani; Xiao Zhang
  Large Language Models (LLMs) have shown impressive proficiency across a range
of natural language processing tasks yet remain vulnerable to adversarial
prompts, known as jailbreak attacks, carefully designed to elicit harmful
responses from LLMs. Traditional methods rely on manual heuristics, which
suffer from limited generalizability. While being automatic, optimization-based
attacks often produce unnatural jailbreak prompts that are easy to detect by
safety filters or require high computational overhead due to discrete token
optimization. Witnessing the limitations of existing jailbreak methods, we
introduce Generative Adversarial Suffix Prompter (GASP), a novel framework that
combines human-readable prompt generation with Latent Bayesian Optimization
(LBO) to improve adversarial suffix creation in a fully black-box setting. GASP
leverages LBO to craft adversarial suffixes by efficiently exploring continuous
embedding spaces, gradually optimizing the model to improve attack efficacy
while balancing prompt coherence through a targeted iterative refinement
procedure. Our experiments show that GASP can generate natural jailbreak
prompts, significantly improving attack success rates, reducing training times,
and accelerating inference speed, thus making it an efficient and scalable
solution for red-teaming LLMs.


http://arxiv.org/abs/2411.14243
AnywhereDoor: Multi-Target Backdoor Attacks on Object Detection. (74%)
Jialin Lu; Junjie Shan; Ziqi Zhao; Ka-Ho Chow
  As object detection becomes integral to many safety-critical applications,
understanding its vulnerabilities is essential. Backdoor attacks, in
particular, pose a serious threat by implanting hidden triggers in victim
models, which adversaries can later exploit to induce malicious behaviors
during inference. However, current understanding is limited to single-target
attacks, where adversaries must define a fixed malicious behavior (target)
before training, making inference-time adaptability impossible. Given the large
output space of object detection (including object existence prediction,
bounding box estimation, and classification), the feasibility of flexible,
inference-time model control remains unexplored. This paper introduces
AnywhereDoor, a multi-target backdoor attack for object detection. Once
implanted, AnywhereDoor allows adversaries to make objects disappear, fabricate
new ones, or mislabel them, either across all object classes or specific ones,
offering an unprecedented degree of control. This flexibility is enabled by
three key innovations: (i) objective disentanglement to scale the number of
supported targets; (ii) trigger mosaicking to ensure robustness even against
region-based detectors; and (iii) strategic batching to address object-level
data imbalances that hinder manipulation. Extensive experiments demonstrate
that AnywhereDoor grants attackers a high degree of control, improving attack
success rates by 26% compared to adaptations of existing methods for such
flexible control.


http://arxiv.org/abs/2411.14351
Indiscriminate Disruption of Conditional Inference on Multivariate Gaussians. (50%)
William N. Caballero; Matthew LaRosa; Alexander Fisher; Vahid Tarokh
  The multivariate Gaussian distribution underpins myriad operations-research,
decision-analytic, and machine-learning models (e.g., Bayesian optimization,
Gaussian influence diagrams, and variational autoencoders). However, despite
recent advances in adversarial machine learning (AML), inference for Gaussian
models in the presence of an adversary is notably understudied. Therefore, we
consider a self-interested attacker who wishes to disrupt a decisionmaker's
conditional inference and subsequent actions by corrupting a set of evidentiary
variables. To avoid detection, the attacker also desires the attack to appear
plausible wherein plausibility is determined by the density of the corrupted
evidence. We consider white- and grey-box settings such that the attacker has
complete and incomplete knowledge about the decisionmaker's underlying
multivariate Gaussian distribution, respectively. Select instances are shown to
reduce to quadratic and stochastic quadratic programs, and structural
properties are derived to inform solution methods. We assess the impact and
efficacy of these attacks in three examples, including, real estate evaluation,
interest rate estimation and signals processing. Each example leverages an
alternative underlying model, thereby highlighting the attacks' broad
applicability. Through these applications, we also juxtapose the behavior of
the white- and grey-box attacks to understand how uncertainty and structure
affect attacker behavior.


http://arxiv.org/abs/2411.14718
GraphTheft: Quantifying Privacy Risks in Graph Prompt Learning. (4%)
Jiani Zhu; Xi Lin; Yuxin Qi; Qinghua Mao
  Graph Prompt Learning (GPL) represents an innovative approach in graph
representation learning, enabling task-specific adaptations by fine-tuning
prompts without altering the underlying pre-trained model. Despite its growing
prominence, the privacy risks inherent in GPL remain unexplored. In this study,
we provide the first evaluation of privacy leakage in GPL across three attacker
capabilities: black-box attacks when GPL as a service, and scenarios where node
embeddings and prompt representations are accessible to third parties. We
assess GPL's privacy vulnerabilities through Attribute Inference Attacks (AIAs)
and Link Inference Attacks (LIAs), finding that under any capability, attackers
can effectively infer the properties and relationships of sensitive nodes, and
the success rate of inference on some data sets is as high as 98%. Importantly,
while targeted inference attacks on specific prompts (e.g., GPF-plus) maintain
high success rates, our analysis suggests that the prompt-tuning in GPL does
not significantly elevate privacy risks compared to traditional GNNs. To
mitigate these risks, we explored defense mechanisms, identifying that
Laplacian noise perturbation can substantially reduce inference success, though
balancing privacy protection with model performance remains challenging. This
work highlights critical privacy risks in GPL, offering new insights and
foundational directions for future privacy-preserving strategies in graph
learning.


http://arxiv.org/abs/2411.14502
Global Challenge for Safe and Secure LLMs Track 1. (4%)
Xiaojun Jia; Yihao Huang; Yang Liu; Peng Yan Tan; Weng Kuan Yau; Mun-Thye Mak; Xin Ming Sim; Wee Siong Ng; See Kiong Ng; Hanqing Liu; Lifeng Zhou; Huanqian Yan; Xiaobing Sun; Wei Liu; Long Wang; Yiming Qian; Yong Liu; Junxiao Yang; Zhexin Zhang; Leqi Lei; Renmiao Chen; Yida Lu; Shiyao Cui; Zizhou Wang; Shaohua Li; Yan Wang; Rick Siow Mong Goh; Liangli Zhen; Yingjie Zhang; Zhe Zhao
  This paper introduces the Global Challenge for Safe and Secure Large Language
Models (LLMs), a pioneering initiative organized by AI Singapore (AISG) and the
CyberSG R&D Programme Office (CRPO) to foster the development of advanced
defense mechanisms against automated jailbreaking attacks. With the increasing
integration of LLMs in critical sectors such as healthcare, finance, and public
administration, ensuring these models are resilient to adversarial attacks is
vital for preventing misuse and upholding ethical standards. This competition
focused on two distinct tracks designed to evaluate and enhance the robustness
of LLM security frameworks. Track 1 tasked participants with developing
automated methods to probe LLM vulnerabilities by eliciting undesirable
responses, effectively testing the limits of existing safety protocols within
LLMs. Participants were challenged to devise techniques that could bypass
content safeguards across a diverse array of scenarios, from offensive language
to misinformation and illegal activities. Through this process, Track 1 aimed
to deepen the understanding of LLM vulnerabilities and provide insights for
creating more resilient models.


http://arxiv.org/abs/2411.14681
TrojanEdit: Multimodal Backdoor Attack Against Image Editing Model. (3%)
Ji Guo; Peihong Chen; Wenbo Jiang; Xiaolei Wen; Jiaming He; Jiachen Li; Guoming Lu; Aiguo Chen; Hongwei Li
  Multimodal diffusion models for image editing generate outputs conditioned on
both textual instructions and visual inputs, aiming to modify target regions
while preserving the rest of the image. Although diffusion models have been
shown to be vulnerable to backdoor attacks, existing efforts mainly focus on
unimodal generative models and fail to address the unique challenges in
multimodal image editing. In this paper, we present the first study of backdoor
attacks on multimodal diffusion-based image editing models. We investigate the
use of both textual and visual triggers to embed a backdoor that achieves high
attack success rates while maintaining the model's normal functionality.
However, we identify a critical modality bias. Simply combining triggers from
different modalities leads the model to primarily rely on the stronger one,
often the visual modality, which results in a loss of multimodal behavior and
degrades editing quality. To overcome this issue, we propose TrojanEdit, a
backdoor injection framework that dynamically adjusts the gradient
contributions of each modality during training. This allows the model to learn
a truly multimodal backdoor that activates only when both triggers are present.
Extensive experiments on multiple image editing models show that TrojanEdit
successfully integrates triggers from different modalities, achieving balanced
multimodal backdoor learning while preserving clean editing performance and
ensuring high attack effectiveness.


http://arxiv.org/abs/2411.14215
Evaluating the Robustness of Analogical Reasoning in Large Language Models. (1%)
Martha Lewis; Melanie Mitchell
  LLMs have performed well on several reasoning benchmarks, including ones that
test analogical reasoning abilities. However, there is debate on the extent to
which they are performing general abstract reasoning versus employing
non-robust processes, e.g., that overly rely on similarity to pre-training
data. Here we investigate the robustness of analogy-making abilities previously
claimed for LLMs on three of four domains studied by Webb, Holyoak, and Lu
(2023): letter-string analogies, digit matrices, and story analogies. For each
domain we test humans and GPT models on robustness to variants of the original
analogy problems that test the same abstract reasoning abilities but are likely
dissimilar from tasks in the pre-training data. The performance of a system
that uses robust abstract reasoning should not decline substantially on these
variants.
  On simple letter-string analogies, we find that while the performance of
humans remains high for two types of variants we tested, the GPT models'
performance declines sharply. This pattern is less pronounced as the complexity
of these problems is increased, as both humans and GPT models perform poorly on
both the original and variant problems requiring more complex analogies. On
digit-matrix problems, we find a similar pattern but only on one out of the two
types of variants we tested. On story-based analogy problems, we find that,
unlike humans, the performance of GPT models are susceptible to answer-order
effects, and that GPT models also may be more sensitive than humans to
paraphrasing.
  This work provides evidence that LLMs often lack the robustness of zero-shot
human analogy-making, exhibiting brittleness on most of the variations we
tested. More generally, this work points to the importance of carefully
evaluating AI systems not only for accuracy but also robustness when testing
their cognitive capabilities.


http://arxiv.org/abs/2411.14516
Memory Backdoor Attacks on Neural Networks. (1%)
Eden Luzon; Guy Amit; Roy Weiss; Yisroel Mirsky
  Neural networks, such as image classifiers, are frequently trained on
proprietary and confidential datasets. It is generally assumed that once
deployed, the training data remains secure, as adversaries are limited to query
response interactions with the model, where at best, fragments of arbitrary
data can be inferred without any guarantees on their authenticity. In this
paper, we propose the memory backdoor attack, where a model is covertly trained
to memorize specific training samples and later selectively output them when
triggered with an index pattern. What makes this attack unique is that it (1)
works even when the tasks conflict (making a classifier output images), (2)
enables the systematic extraction of training samples from deployed models and
(3) offers guarantees on the extracted authenticity of the data. We demonstrate
the attack on image classifiers, segmentation models, and a large language
model (LLM). We demonstrate the attack on image classifiers, segmentation
models, and a large language model (LLM). With this attack, it is possible to
hide thousands of images and texts in modern vision architectures and LLMs
respectively, all while maintaining model performance. The memory back door
attack poses a significant threat not only to conventional model deployments
but also to federated learning paradigms and other modern frameworks.
Therefore, we suggest an efficient and effective countermeasure that can be
immediately applied and advocate for further work on the topic.


http://arxiv.org/abs/2411.15210
Towards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks. (98%)
Yong Xie; Weijie Zheng; Hanxun Huang; Guangnan Ye; Xingjun Ma
  As deep learning models are increasingly deployed in safety-critical
applications, evaluating their vulnerabilities to adversarial perturbations is
essential for ensuring their reliability and trustworthiness. Over the past
decade, a large number of white-box adversarial robustness evaluation methods
(i.e., attacks) have been proposed, ranging from single-step to multi-step
methods and from individual to ensemble methods. Despite these advances,
challenges remain in conducting meaningful and comprehensive robustness
evaluations, particularly when it comes to large-scale testing and ensuring
evaluations reflect real-world adversarial risks. In this work, we focus on
image classification models and propose a novel individual attack method,
Probability Margin Attack (PMA), which defines the adversarial margin in the
probability space rather than the logits space. We analyze the relationship
between PMA and existing cross-entropy or logits-margin-based attacks, and show
that PMA can outperform the current state-of-the-art individual methods.
Building on PMA, we propose two types of ensemble attacks that balance
effectiveness and efficiency. Furthermore, we create a million-scale dataset,
CC1M, derived from the existing CC3M dataset, and use it to conduct the first
million-scale white-box adversarial robustness evaluation of
adversarially-trained ImageNet models. Our findings provide valuable insights
into the robustness gaps between individual versus ensemble attacks and
small-scale versus million-scale evaluations.


http://arxiv.org/abs/2411.13136
TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models. (96%)
Xin Wang; Kai Chen; Jiaming Zhang; Jingjing Chen; Xingjun Ma
  Large pre-trained Vision-Language Models (VLMs) such as CLIP have
demonstrated excellent zero-shot generalizability across various downstream
tasks. However, recent studies have shown that the inference performance of
CLIP can be greatly degraded by small adversarial perturbations, especially its
visual modality, posing significant safety threats. To mitigate this
vulnerability, in this paper, we propose a novel defense method called
Test-Time Adversarial Prompt Tuning (TAPT) to enhance the inference robustness
of CLIP against visual adversarial attacks. TAPT is a test-time defense method
that learns defensive bimodal (textual and visual) prompts to robustify the
inference process of CLIP. Specifically, it is an unsupervised method that
optimizes the defensive prompts for each test sample by minimizing a multi-view
entropy and aligning adversarial-clean distributions. We evaluate the
effectiveness of TAPT on 11 benchmark datasets, including ImageNet and 10 other
zero-shot datasets, demonstrating that it enhances the zero-shot adversarial
robustness of the original CLIP by at least 48.9% against AutoAttack (AA),
while largely maintaining performance on clean examples. Moreover, TAPT
outperforms existing adversarial prompt tuning methods across various
backbones, achieving an average robustness improvement of at least 36.6%.


http://arxiv.org/abs/2411.13116
Provably Efficient Action-Manipulation Attack Against Continuous Reinforcement Learning. (86%)
Zhi Luo; Xiyuan Yang; Pan Zhou; Di Wang
  Manipulating the interaction trajectories between the intelligent agent and
the environment can control the agent's training and behavior, exposing the
potential vulnerabilities of reinforcement learning (RL). For example, in
Cyber-Physical Systems (CPS) controlled by RL, the attacker can manipulate the
actions of the adopted RL to other actions during the training phase, which
will lead to bad consequences. Existing work has studied action-manipulation
attacks in tabular settings, where the states and actions are discrete. As seen
in many up-and-coming RL applications, such as autonomous driving, continuous
action space is widely accepted, however, its action-manipulation attacks have
not been thoroughly investigated yet. In this paper, we consider this crucial
problem in both white-box and black-box scenarios. Specifically, utilizing the
knowledge derived exclusively from trajectories, we propose a black-box attack
algorithm named LCBT, which uses the Monte Carlo tree search method for
efficient action searching and manipulation. Additionally, we demonstrate that
for an agent whose dynamic regret is sub-linearly related to the total number
of steps, LCBT can teach the agent to converge to target policies with only
sublinear attack cost, i.e., $O\left(\mathcal{R}(T) + MH^3K^E\log
(MT)\right)(0<E<1)$, where $H$ is the number of steps per episode, $K$ is the
total number of episodes, $T=KH$ is the total number of steps, $M$ is the
number of subspaces divided in the state space, and $\mathcal{R}(T)$ is the
bound of the RL algorithm's regret. We conduct our proposed attack methods on
three aggressive algorithms: DDPG, PPO, and TD3 in continuous settings, which
show a promising attack performance.


http://arxiv.org/abs/2411.13778
A Survey on Adversarial Robustness of LiDAR-based Machine Learning Perception in Autonomous Vehicles. (86%)
Junae Kim; Amardeep Kaur
  In autonomous driving, the combination of AI and vehicular technology offers
great potential. However, this amalgamation comes with vulnerabilities to
adversarial attacks. This survey focuses on the intersection of Adversarial
Machine Learning (AML) and autonomous systems, with a specific focus on
LiDAR-based systems. We comprehensively explore the threat landscape,
encompassing cyber-attacks on sensors and adversarial perturbations.
Additionally, we investigate defensive strategies employed in countering these
threats. This paper endeavors to present a concise overview of the challenges
and advances in securing autonomous driving systems against adversarial
threats, emphasizing the need for robust defenses to ensure safety and
security.


http://arxiv.org/abs/2411.15222
Rethinking the Intermediate Features in Adversarial Attacks: Misleading Robotic Models via Adversarial Distillation. (68%)
Ke Wuhan University Zhao; Huayang Wuhan University Huang; Miao Wuhan University Li; Yu Wuhan University Wu
  Language-conditioned robotic learning has significantly enhanced robot
adaptability by enabling a single model to execute diverse tasks in response to
verbal commands. Despite these advancements, security vulnerabilities within
this domain remain largely unexplored. This paper addresses this gap by
proposing a novel adversarial prompt attack tailored to language-conditioned
robotic models. Our approach involves crafting a universal adversarial prefix
that induces the model to perform unintended actions when added to any original
prompt. We demonstrate that existing adversarial techniques exhibit limited
effectiveness when directly transferred to the robotic domain due to the
inherent robustness of discretized robotic action spaces. To overcome this
challenge, we propose to optimize adversarial prefixes based on continuous
action representations, circumventing the discretization process. Additionally,
we identify the beneficial impact of intermediate features on adversarial
attacks and leverage the negative gradient of intermediate self-attention
features to further enhance attack efficacy. Extensive experiments on VIMA
models across 13 robot manipulation tasks validate the superiority of our
method over existing approaches and demonstrate its transferability across
different model variants.


http://arxiv.org/abs/2411.13553
AI-generated Image Detection: Passive or Watermark? (31%)
Moyang Guo; Yuepeng Hu; Zhengyuan Jiang; Zeyu Li; Amir Sadovnik; Arka Daw; Neil Gong
  While text-to-image models offer numerous benefits, they also pose
significant societal risks. Detecting AI-generated images is crucial for
mitigating these risks. Detection methods can be broadly categorized into
passive and watermark-based approaches: passive detectors rely on artifacts
present in AI-generated images, whereas watermark-based detectors proactively
embed watermarks into such images. A key question is which type of detector
performs better in terms of effectiveness, robustness, and efficiency. However,
the current literature lacks a comprehensive understanding of this issue. In
this work, we aim to bridge that gap by developing ImageDetectBench, the first
comprehensive benchmark to compare the effectiveness, robustness, and
efficiency of passive and watermark-based detectors. Our benchmark includes
four datasets, each containing a mix of AI-generated and non-AI-generated
images. We evaluate five passive detectors and four watermark-based detectors
against eight types of common perturbations and three types of adversarial
perturbations. Our benchmark results reveal several interesting findings. For
instance, watermark-based detectors consistently outperform passive detectors,
both in the presence and absence of perturbations. Based on these insights, we
provide recommendations for detecting AI-generated images, e.g., when both
types of detectors are applicable, watermark-based detectors should be the
preferred choice. Our code and data are publicly available at
https://github.com/moyangkuo/ImageDetectBench.git.


http://arxiv.org/abs/2411.13459
SoK: A Systems Perspective on Compound AI Threats and Countermeasures. (12%)
Sarbartha Banerjee; Prateek Sahu; Mulong Luo; Anjo Vahldiek-Oberwagner; Neeraja J. Yadwadkar; Mohit Tiwari
  Large language models (LLMs) used across enterprises often use proprietary
models and operate on sensitive inputs and data. The wide range of attack
vectors identified in prior research - targeting various software and hardware
components used in training and inference - makes it extremely challenging to
enforce confidentiality and integrity policies.
  As we advance towards constructing compound AI inference pipelines that
integrate multiple large language models (LLMs), the attack surfaces expand
significantly. Attackers now focus on the AI algorithms as well as the software
and hardware components associated with these systems. While current research
often examines these elements in isolation, we find that combining cross-layer
attack observations can enable powerful end-to-end attacks with minimal
assumptions about the threat model. Given, the sheer number of existing attacks
at each layer, we need a holistic and systemized understanding of different
attack vectors at each layer.
  This SoK discusses different software and hardware attacks applicable to
compound AI systems and demonstrates how combining multiple attack mechanisms
can reduce the threat model assumptions required for an isolated attack. Next,
we systematize the ML attacks in lines with the Mitre Att&ck framework to
better position each attack based on the threat model. Finally, we outline the
existing countermeasures for both software and hardware layers and discuss the
necessity of a comprehensive defense strategy to enable the secure and
high-performance deployment of compound AI systems.


http://arxiv.org/abs/2411.13144
CopyrightMeter: Revisiting Copyright Protection in Text-to-image Models. (12%)
Naen Xu; Changjiang Li; Tianyu Du; Minxi Li; Wenjie Luo; Jiacheng Liang; Yuyuan Li; Xuhong Zhang; Meng Han; Jianwei Yin; Ting Wang
  Text-to-image diffusion models have emerged as powerful tools for generating
high-quality images from textual descriptions. However, their increasing
popularity has raised significant copyright concerns, as these models can be
misused to reproduce copyrighted content without authorization. In response,
recent studies have proposed various copyright protection methods, including
adversarial perturbation, concept erasure, and watermarking techniques.
However, their effectiveness and robustness against advanced attacks remain
largely unexplored. Moreover, the lack of unified evaluation frameworks has
hindered systematic comparison and fair assessment of different approaches. To
bridge this gap, we systematize existing copyright protection methods and
attacks, providing a unified taxonomy of their design spaces. We then develop
CopyrightMeter, a unified evaluation framework that incorporates 17
state-of-the-art protections and 16 representative attacks. Leveraging
CopyrightMeter, we comprehensively evaluate protection methods across multiple
dimensions, thereby uncovering how different design choices impact fidelity,
efficacy, and resilience under attacks. Our analysis reveals several key
findings: (i) most protections (16/17) are not resilient against attacks; (ii)
the "best" protection varies depending on the target priority; (iii) more
advanced attacks significantly promote the upgrading of protections. These
insights provide concrete guidance for developing more robust protection
methods, while its unified evaluation protocol establishes a standard benchmark
for future copyright protection research in text-to-image generation.


http://arxiv.org/abs/2411.13425
WaterPark: A Robustness Assessment of Language Model Watermarking. (1%)
Jiacheng Liang; Zian Wang; Lauren Hong; Shouling Ji; Ting Wang
  Various watermarking methods (``watermarkers'') have been proposed to
identify LLM-generated texts; yet, due to the lack of unified evaluation
platforms, many critical questions remain under-explored: i) What are the
strengths/limitations of various watermarkers, especially their attack
robustness? ii) How do various design choices impact their robustness? iii) How
to optimally operate watermarkers in adversarial environments? To fill this
gap, we systematize existing LLM watermarkers and watermark removal attacks,
mapping out their design spaces. We then develop WaterPark, a unified platform
that integrates 10 state-of-the-art watermarkers and 12 representative attacks.
More importantly, by leveraging WaterPark, we conduct a comprehensive
assessment of existing watermarkers, unveiling the impact of various design
choices on their attack robustness. We further explore the best practices to
operate watermarkers in adversarial environments. We believe our study sheds
light on current LLM watermarking techniques while WaterPark serves as a
valuable testbed to facilitate future research.


http://arxiv.org/abs/2411.12473
NMT-Obfuscator Attack: Ignore a sentence in translation with only one word. (99%)
Sahar Sadrizadeh; César Descalzo; Ljiljana Dolamic; Pascal Frossard
  Neural Machine Translation systems are used in diverse applications due to
their impressive performance. However, recent studies have shown that these
systems are vulnerable to carefully crafted small perturbations to their
inputs, known as adversarial attacks. In this paper, we propose a new type of
adversarial attack against NMT models. In this attack, we find a word to be
added between two sentences such that the second sentence is ignored and not
translated by the NMT model. The word added between the two sentences is such
that the whole adversarial text is natural in the source language. This type of
attack can be harmful in practical scenarios since the attacker can hide
malicious information in the automatic translation made by the target NMT
model. Our experiments show that different NMT models and translation tasks are
vulnerable to this type of attack. Our attack can successfully force the NMT
models to ignore the second part of the input in the translation for more than
50% of all cases while being able to maintain low perplexity for the whole
input.


http://arxiv.org/abs/2411.12575
Stochastic BIQA: Median Randomized Smoothing for Certified Blind Image Quality Assessment. (75%)
Ekaterina Shumitskaya; Mikhail Pautov; Dmitriy Vatolin; Anastasia Antsiferova
  Most modern No-Reference Image-Quality Assessment (NR-IQA) metrics are based
on neural networks vulnerable to adversarial attacks. Attacks on such metrics
lead to incorrect image/video quality predictions, which poses significant
risks, especially in public benchmarks. Developers of image processing
algorithms may unfairly increase the score of a target IQA metric without
improving the actual quality of the adversarial image. Although some empirical
defenses for IQA metrics were proposed, they do not provide theoretical
guarantees and may be vulnerable to adaptive attacks. This work focuses on
developing a provably robust no-reference IQA metric. Our method is based on
Median Smoothing (MS) combined with an additional convolution denoiser with
ranking loss to improve the SROCC and PLCC scores of the defended IQA metric.
Compared with two prior methods on three datasets, our method exhibited
superior SROCC and PLCC scores while maintaining comparable certified
guarantees.


http://arxiv.org/abs/2411.12701
When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations. (3%)
Huaizhi Ge; Yiming Li; Qifan Wang; Yongfeng Zhang; Ruixiang Tang
  Large Language Models (LLMs) are vulnerable to backdoor attacks, where hidden
triggers can maliciously manipulate model behavior. While several backdoor
attack methods have been proposed, the mechanisms by which backdoor functions
operate in LLMs remain underexplored. In this paper, we move beyond attacking
LLMs and investigate backdoor functionality through the novel lens of natural
language explanations. Specifically, we leverage LLMs' generative capabilities
to produce human-understandable explanations for their decisions, allowing us
to compare explanations for clean and poisoned samples. We explore various
backdoor attacks and embed the backdoor into LLaMA models for multiple tasks.
Our experiments show that backdoored models produce higher-quality explanations
for clean data compared to poisoned data, while generating significantly more
consistent explanations for poisoned data than for clean data. We further
analyze the explanation generation process, revealing that at the token level,
the explanation token of poisoned samples only appears in the final few
transformer layers of the LLM. At the sentence level, attention dynamics
indicate that poisoned inputs shift attention from the input context when
generating the explanation. These findings deepen our understanding of backdoor
attack mechanisms in LLMs and offer a framework for detecting such
vulnerabilities through explainability techniques, contributing to the
development of more secure LLMs.


http://arxiv.org/abs/2411.12071
Theoretical Corrections and the Leveraging of Reinforcement Learning to Enhance Triangle Attack. (99%)
Nicole Meng; Caleb Manicke; David Chen; Yingjie Lao; Caiwen Ding; Pengyu Hong; Kaleel Mahmood
  Adversarial examples represent a serious issue for the application of machine
learning models in many sensitive domains. For generating adversarial examples,
decision based black-box attacks are one of the most practical techniques as
they only require query access to the model. One of the most recently proposed
state-of-the-art decision based black-box attacks is Triangle Attack (TA). In
this paper, we offer a high-level description of TA and explain potential
theoretical limitations. We then propose a new decision based black-box attack,
Triangle Attack with Reinforcement Learning (TARL). Our new attack addresses
the limits of TA by leveraging reinforcement learning. This creates an attack
that can achieve similar, if not better, attack accuracy than TA with half as
many queries on state-of-the-art classifiers and defenses across ImageNet and
CIFAR-10.


http://arxiv.org/abs/2411.11389
Adapting to Cyber Threats: A Phishing Evolution Network (PEN) Framework for Phishing Generation and Analyzing Evolution Patterns using Large Language Models. (87%)
Fengchao Chen; Tingmin Wu; Van Nguyen; Shuo Wang; Hongsheng Hu; Alsharif Abuadbba; Carsten Rudolph
  Phishing remains a pervasive cyber threat, as attackers craft deceptive
emails to lure victims into revealing sensitive information. While Artificial
Intelligence (AI), particularly deep learning, has become a key component in
defending against phishing attacks, these approaches face critical limitations.
The scarcity of publicly available, diverse, and updated data, largely due to
privacy concerns, constrains their effectiveness. As phishing tactics evolve
rapidly, models trained on limited, outdated data struggle to detect new,
sophisticated deception strategies, leaving systems vulnerable to an
ever-growing array of attacks. Addressing this gap is essential to
strengthening defenses in an increasingly hostile cyber landscape. To address
this gap, we propose the Phishing Evolution Network (PEN), a framework
leveraging large language models (LLMs) and adversarial training mechanisms to
continuously generate high quality and realistic diverse phishing samples, and
analyze features of LLM-provided phishing to understand evolving phishing
patterns. We evaluate the quality and diversity of phishing samples generated
by PEN and find that it produces over 80% realistic phishing samples,
effectively expanding phishing datasets across seven dominant types. These
PEN-generated samples enhance the performance of current phishing detectors,
leading to a 40% improvement in detection accuracy. Additionally, the use of
PEN significantly boosts model robustness, reducing detectors' sensitivity to
perturbations by up to 60%, thereby decreasing attack success rates under
adversarial conditions. When we analyze the phishing patterns that are used in
LLM-generated phishing, the cognitive complexity and the tone of time
limitation are detected with statistically significant differences compared
with existing phishing.


http://arxiv.org/abs/2411.12768
CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization. (75%)
Nay Myat Min; Long H. Pham; Yige Li; Jun Sun
  Large Language Models (LLMs) are vulnerable to backdoor attacks that
manipulate outputs via hidden triggers. Existing defense methods--designed for
vision/text classification tasks--fail for text generation. We propose Internal
Consistency Regularization (CROW), a defense leveraging the observation that
backdoored models exhibit unstable layer-wise hidden representations when
triggered, while clean models show smooth transitions. CROW enforces
consistency across layers via adversarial perturbations and regularization
during finetuning, neutralizing backdoors without requiring clean reference
models or trigger knowledge--only a small clean dataset. Experiments across
Llama-2 (7B, 13B), CodeLlama (7B, 13B), and Mistral-7B demonstrate CROW's
effectiveness: it achieves significant reductions in attack success rates
across diverse backdoor strategies (sentiment steering, targeted refusal, code
injection) while preserving generative performance. CROW's
architecture-agnostic design enables practical deployment.


http://arxiv.org/abs/2411.12220
DeTrigger: A Gradient-Centric Approach to Backdoor Attack Mitigation in Federated Learning. (75%)
Kichang Lee; Yujin Shin; Jonghyuk Yun; Jun Han; JeongGil Ko
  Federated Learning (FL) enables collaborative model training across
distributed devices while preserving local data privacy, making it ideal for
mobile and embedded systems. However, the decentralized nature of FL also opens
vulnerabilities to model poisoning attacks, particularly backdoor attacks,
where adversaries implant trigger patterns to manipulate model predictions. In
this paper, we propose DeTrigger, a scalable and efficient backdoor-robust
federated learning framework that leverages insights from adversarial attack
methodologies. By employing gradient analysis with temperature scaling,
DeTrigger detects and isolates backdoor triggers, allowing for precise model
weight pruning of backdoor activations without sacrificing benign model
knowledge. Extensive evaluations across four widely used datasets demonstrate
that DeTrigger achieves up to 251x faster detection than traditional methods
and mitigates backdoor attacks by up to 98.9%, with minimal impact on global
model accuracy. Our findings establish DeTrigger as a robust and scalable
solution to protect federated learning environments against sophisticated
backdoor threats.


http://arxiv.org/abs/2411.11525
Reliable Poisoned Sample Detection against Backdoor Attacks Enhanced by Sharpness Aware Minimization. (50%)
Mingda Zhang; Mingli Zhu; Zihao Zhu; Baoyuan Wu
  Backdoor attack has been considered as a serious security threat to deep
neural networks (DNNs). Poisoned sample detection (PSD) that aims at filtering
out poisoned samples from an untrustworthy training dataset has shown very
promising performance for defending against data poisoning based backdoor
attacks. However, we observe that the detection performance of many advanced
methods is likely to be unstable when facing weak backdoor attacks, such as low
poisoning ratio or weak trigger strength. To further verify this observation,
we make a statistical investigation among various backdoor attacks and poisoned
sample detections, showing a positive correlation between backdoor effect and
detection performance. It inspires us to strengthen the backdoor effect to
enhance detection performance. Since we cannot achieve that goal via directly
manipulating poisoning ratio or trigger strength, we propose to train one model
using the Sharpness-Aware Minimization (SAM) algorithm, rather than the vanilla
training algorithm. We also provide both empirical and theoretical analysis
about how SAM training strengthens the backdoor effect. Then, this SAM trained
model can be seamlessly integrated with any off-the-shelf PSD method that
extracts discriminative features from the trained model for detection, called
SAM-enhanced PSD. Extensive experiments on several benchmark datasets show the
reliable detection performance of the proposed method against both weak and
strong backdoor attacks, with significant improvements against various attacks
($+34.38\%$ TPR on average), over the conventional PSD methods (i.e., without
SAM enhancement). Overall, this work provides new insights about PSD and
proposes a novel approach that can complement existing detection methods, which
may inspire more in-depth explorations in this field.


http://arxiv.org/abs/2411.11677
Few-shot Model Extraction Attacks against Sequential Recommender Systems. (38%)
Hui Zhang; Fu Liu
  Among adversarial attacks against sequential recommender systems, model
extraction attacks represent a method to attack sequential recommendation
models without prior knowledge. Existing research has primarily concentrated on
the adversary's execution of black-box attacks through data-free model
extraction. However, a significant gap remains in the literature concerning the
development of surrogate models by adversaries with access to few-shot raw data
(10\% even less). That is, the challenge of how to construct a surrogate model
with high functional similarity within the context of few-shot data scenarios
remains an issue that requires resolution.This study addresses this gap by
introducing a novel few-shot model extraction framework against sequential
recommenders, which is designed to construct a superior surrogate model with
the utilization of few-shot data. The proposed few-shot model extraction
framework is comprised of two components: an autoregressive augmentation
generation strategy and a bidirectional repair loss-facilitated model
distillation procedure. Specifically, to generate synthetic data that closely
approximate the distribution of raw data, autoregressive augmentation
generation strategy integrates a probabilistic interaction sampler to extract
inherent dependencies and a synthesis determinant signal module to characterize
user behavioral patterns. Subsequently, bidirectional repair loss, which target
the discrepancies between the recommendation lists, is designed as auxiliary
loss to rectify erroneous predictions from surrogate models, transferring
knowledge from the victim model to the surrogate model effectively. Experiments
on three datasets show that the proposed few-shot model extraction framework
yields superior surrogate models.


http://arxiv.org/abs/2411.11434
CLUE-MARK: Watermarking Diffusion Models using CLWE. (26%)
Kareem Shehata; Aashish Kolluri; Prateek Saxena
  As AI-generated images become widespread, reliable watermarking is essential
for content verification, copyright enforcement, and combating disinformation.
Existing techniques rely on heuristic approaches and lack formal guarantees of
undetectability, making them vulnerable to steganographic attacks that can
expose or erase the watermark. Additionally, these techniques often degrade
output quality by introducing perceptible changes, which is not only
undesirable but an important barrier to adoption in practice.
  In this work, we introduce CLUE-Mark, the first provably undetectable
watermarking scheme for diffusion models. CLUE-Mark requires no changes to the
model being watermarked, is computationally efficient, and because it is
provably undetectable is guaranteed to have no impact on model output quality.
Our approach leverages the Continuous Learning With Errors (CLWE) problem -- a
cryptographically hard lattice problem -- to embed watermarks in the latent
noise vectors used by diffusion models. By proving undetectability via
reduction to a cryptographically hard problem we ensure not only that the
watermark is imperceptible to human observers or adhoc heuristics, but to
\emph{any} efficient detector that does not have the secret key. CLUE-Mark
allows multiple keys to be embedded, enabling traceability of images to
specific users without altering model parameters. Empirical evaluations on
state-of-the-art diffusion models confirm that CLUE-Mark achieves high
recoverability, preserves image quality, and is robust to minor perturbations
such JPEG compression and brightness adjustments. Uniquely, CLUE-Mark cannot be
detected nor removed by recent steganographic attacks.


http://arxiv.org/abs/2411.11407
The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models. (13%)
Xikang Yang; Xuehai Tang; Jizhong Han; Songlin Hu
  The widespread deployment of large language models (LLMs) across various
domains has showcased their immense potential while exposing significant safety
vulnerabilities. A major concern is ensuring that LLM-generated content aligns
with human values. Existing jailbreak techniques reveal how this alignment can
be compromised through specific prompts or adversarial suffixes. In this study,
we introduce a new threat: LLMs' bias toward authority. While this inherent
bias can improve the quality of outputs generated by LLMs, it also introduces a
potential vulnerability, increasing the risk of producing harmful content.
Notably, the biases in LLMs is the varying levels of trust given to different
types of authoritative information in harmful queries. For example, malware
development often favors trust GitHub. To better reveal the risks with LLM, we
propose DarkCite, an adaptive authority citation matcher and generator designed
for a black-box setting. DarkCite matches optimal citation types to specific
risk types and generates authoritative citations relevant to harmful
instructions, enabling more effective jailbreak attacks on aligned LLMs.Our
experiments show that DarkCite achieves a higher attack success rate (e.g.,
LLama-2 at 76% versus 68%) than previous methods. To counter this risk, we
propose an authenticity and harm verification defense strategy, raising the
average defense pass rate (DPR) from 11% to 74%. More importantly, the ability
to link citations to the content they encompass has become a foundational
function in LLMs, amplifying the influence of LLMs' bias toward authority.


http://arxiv.org/abs/2411.11795
Exploring adversarial robustness of JPEG AI: methodology, comparison and new methods. (8%)
Egor Kovalev; Georgii Bychkov; Khaled Abud; Aleksandr Gushchin; Anna Chistyakova; Sergey Lavrushkin; Dmitriy Vatolin; Anastasia Antsiferova
  Adversarial robustness of neural networks is an increasingly important area
of research, combining studies on computer vision models, large language models
(LLMs), and others. With the release of JPEG AI - the first standard for
end-to-end neural image compression (NIC) methods - the question of its
robustness has become critically significant. JPEG AI is among the first
international, real-world applications of neural-network-based models to be
embedded in consumer devices. However, research on NIC robustness has been
limited to open-source codecs and a narrow range of attacks. This paper
proposes a new methodology for measuring NIC robustness to adversarial attacks.
We present the first large-scale evaluation of JPEG AI's robustness, comparing
it with other NIC models. Our evaluation results and code are publicly
available online (link is hidden for a blind review).


http://arxiv.org/abs/2411.13587
Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics. (89%)
Taowen Wang; Cheng Han; James Chenhao Liang; Wenhao Yang; Dongfang Liu; Luna Xinyu Zhang; Qifan Wang; Jiebo Luo; Ruixiang Tang
  Recently in robotics, Vision-Language-Action (VLA) models have emerged as a
transformative approach, enabling robots to execute complex tasks by
integrating visual and linguistic inputs within an end-to-end learning
framework. While VLA models offer significant capabilities, they also introduce
new attack surfaces, making them vulnerable to adversarial attacks. With these
vulnerabilities largely unexplored, this paper systematically quantifies the
robustness of VLA-based robotic systems. Recognizing the unique demands of
robotic execution, our attack objectives target the inherent spatial and
functional characteristics of robotic systems. In particular, we introduce two
untargeted attack objectives that leverage spatial foundations to destabilize
robotic actions, and a targeted attack objective that manipulates the robotic
trajectory. Additionally, we design an adversarial patch generation approach
that places a small, colorful patch within the camera's view, effectively
executing the attack in both digital and physical environments. Our evaluation
reveals a marked degradation in task success rates, with up to a 100\%
reduction across a suite of simulated robotic tasks, highlighting critical
security gaps in current VLA architectures. By unveiling these vulnerabilities
and proposing actionable evaluation metrics, we advance both the understanding
and enhancement of safety for VLA-based robotic systems, underscoring the
necessity for continuously developing robust defense strategies prior to
physical-world deployments.


http://arxiv.org/abs/2411.11114
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit. (62%)
Zeqing He; Zhibo Wang; Zhixuan Chu; Huiyu Xu; Rui Zheng; Kui Ren; Chun Chen
  Despite the outstanding performance of Large language models (LLMs) in
diverse tasks, they are vulnerable to jailbreak attacks, wherein adversarial
prompts are crafted to bypass their security mechanisms and elicit unexpected
responses.Although jailbreak attacks are prevalent, the understanding of their
underlying mechanisms remains limited. Recent studies have explain typical
jailbreaking behavior (e.g., the degree to which the model refuses to respond)
of LLMs by analyzing the representation shifts in their latent space caused by
jailbreak prompts or identifying key neurons that contribute to the success of
these attacks. However, these studies neither explore diverse jailbreak
patterns nor provide a fine-grained explanation from the failure of circuit to
the changes of representational, leaving significant gaps in uncovering the
jailbreak mechanism. In this paper, we propose JailbreakLens, an interpretation
framework that analyzes jailbreak mechanisms from both representation (which
reveals how jailbreaks alter the model's harmfulness perception) and circuit
perspectives (which uncovers the causes of these deceptions by identifying key
circuits contributing to the vulnerability), tracking their evolution
throughout the entire response generation process. We then conduct an in-depth
evaluation of jailbreak behavior on four mainstream LLMs under seven jailbreak
strategies. Our evaluation finds that jailbreak prompts amplify components that
reinforce affirmative responses while suppressing those that produce refusal.
Although this manipulation shifts model representations toward safe clusters to
deceive the LLM, leading it to provide detailed responses instead of refusals,
it still produce abnormal activation which can be caught in the circuit
analysis.


http://arxiv.org/abs/2411.11200
Countering Backdoor Attacks in Image Recognition: A Survey and Evaluation of Mitigation Strategies. (22%)
Kealan Dunnett; Reza Arablouei; Dimity Miller; Volkan Dedeoglu; Raja Jurdak
  The widespread adoption of deep learning across various industries has
introduced substantial challenges, particularly in terms of model
explainability and security. The inherent complexity of deep learning models,
while contributing to their effectiveness, also renders them susceptible to
adversarial attacks. Among these, backdoor attacks are especially concerning,
as they involve surreptitiously embedding specific triggers within training
data, causing the model to exhibit aberrant behavior when presented with input
containing the triggers. Such attacks often exploit vulnerabilities in
outsourced processes, compromising model integrity without affecting
performance on clean (trigger-free) input data. In this paper, we present a
comprehensive review of existing mitigation strategies designed to counter
backdoor attacks in image recognition. We provide an in-depth analysis of the
theoretical foundations, practical efficacy, and limitations of these
approaches. In addition, we conduct an extensive benchmarking of sixteen
state-of-the-art approaches against eight distinct backdoor attacks, utilizing
three datasets, four model architectures, and three poisoning ratios. Our
results, derived from 122,236 individual experiments, indicate that while many
approaches provide some level of protection, their performance can vary
considerably. Furthermore, when compared to two seminal approaches, most newer
approaches do not demonstrate substantial improvements in overall performance
or consistency across diverse settings. Drawing from these findings, we propose
potential directions for developing more effective and generalizable defensive
mechanisms in the future.


http://arxiv.org/abs/2411.11195
SoK: Unifying Cybersecurity and Cybersafety of Multimodal Foundation Models with an Information Theory Approach. (9%)
Ruoxi Sun; Jiamin Chang; Hammond Pearce; Chaowei Xiao; Bo Li; Qi Wu; Surya Nepal; Minhui Xue
  Multimodal foundation models (MFMs) represent a significant advancement in
artificial intelligence, combining diverse data modalities to enhance learning
and understanding across a wide range of applications. However, this
integration also brings unique safety and security challenges. In this paper,
we conceptualize cybersafety and cybersecurity in the context of multimodal
learning and present a comprehensive Systematization of Knowledge (SoK) to
unify these concepts in MFMs, identifying key threats to these models. We
propose a taxonomy framework grounded in information theory, evaluating and
categorizing threats through the concepts of channel capacity, signal, noise,
and bandwidth. This approach provides a novel framework that unifies model
safety and system security in MFMs, offering a more comprehensive and
actionable understanding of the risks involved. We used this to explore
existing defense mechanisms, and identified gaps in current research -
particularly, a lack of protection for alignment between modalities and a need
for more systematic defense methods. Our work contributes to a deeper
understanding of the security and safety landscape in MFMs, providing
researchers and practitioners with valuable insights for improving the
robustness and reliability of these models.


http://arxiv.org/abs/2411.11006
BackdoorMBTI: A Backdoor Learning Multimodal Benchmark Tool Kit for Backdoor Defense Evaluation. (2%)
Haiyang Yu; Tian Xie; Jiaping Gui; Pengyang Wang; Ping Yi; Yue Wu
  Over the past few years, the emergence of backdoor attacks has presented
significant challenges to deep learning systems, allowing attackers to insert
backdoors into neural networks. When data with a trigger is processed by a
backdoor model, it can lead to mispredictions targeted by attackers, whereas
normal data yields regular results. The scope of backdoor attacks is expanding
beyond computer vision and encroaching into areas such as natural language
processing and speech recognition. Nevertheless, existing backdoor defense
methods are typically tailored to specific data modalities, restricting their
application in multimodal contexts. While multimodal learning proves highly
applicable in facial recognition, sentiment analysis, action recognition,
visual question answering, the security of these models remains a crucial
concern. Specifically, there are no existing backdoor benchmarks targeting
multimodal applications or related tasks.
  In order to facilitate the research in multimodal backdoor, we introduce
BackdoorMBTI, the first backdoor learning toolkit and benchmark designed for
multimodal evaluation across three representative modalities from eleven
commonly used datasets. BackdoorMBTI provides a systematic backdoor learning
pipeline, encompassing data processing, data poisoning, backdoor training, and
evaluation. The generated poison datasets and backdoor models enable detailed
evaluation of backdoor defenses. Given the diversity of modalities,
BackdoorMBTI facilitates systematic evaluation across different data types.
Furthermore, BackdoorMBTI offers a standardized approach to handling practical
factors in backdoor learning, such as issues related to data quality and
erroneous labels. We anticipate that BackdoorMBTI will expedite future research
in backdoor defense methods within a multimodal context. Code is available at
https://github.com/SJTUHaiyangYu/BackdoorMBTI.


http://arxiv.org/abs/2411.11144
CLMIA: Membership Inference Attacks via Unsupervised Contrastive Learning. (2%)
Depeng School of Computer Science and Technology, Anhui University Chen; Xiao School of Computer Science and Technology, Anhui University Liu; Jie School of Computer Science and Technology, Anhui University Cui; Hong School of Computer Science and Technology, Anhui University Zhong
  Since machine learning model is often trained on a limited data set, the
model is trained multiple times on the same data sample, which causes the model
to memorize most of the training set data. Membership Inference Attacks (MIAs)
exploit this feature to determine whether a data sample is used for training a
machine learning model. However, in realistic scenarios, it is difficult for
the adversary to obtain enough qualified samples that mark accurate identity
information, especially since most samples are non-members in real world
applications. To address this limitation, in this paper, we propose a new
attack method called CLMIA, which uses unsupervised contrastive learning to
train an attack model without using extra membership status information.
Meanwhile, in CLMIA, we require only a small amount of data with known
membership status to fine-tune the attack model. Experimental results
demonstrate that CLMIA performs better than existing attack methods for
different datasets and model structures, especially with data with less marked
identity information. In addition, we experimentally find that the attack
performs differently for different proportions of labeled identity information
for member and non-member data. More analysis proves that our attack method
performs better with less labeled identity information, which applies to more
realistic scenarios.


http://arxiv.org/abs/2411.11132
Variational Bayesian Bow tie Neural Networks with Shrinkage. (1%)
Alisa Sheinkman; Sara Wade
  Despite the dominant role of deep models in machine learning, limitations
persist, including overconfident predictions, susceptibility to adversarial
attacks, and underestimation of variability in predictions. The Bayesian
paradigm provides a natural framework to overcome such issues and has become
the gold standard for uncertainty estimation with deep models, also providing
improved accuracy and a framework for tuning critical hyperparameters. However,
exact Bayesian inference is challenging, typically involving variational
algorithms that impose strong independence and distributional assumptions.
Moreover, existing methods are sensitive to the architectural choice of the
network. We address these issues by focusing on a stochastic relaxation of the
standard feed-forward rectified neural network and using sparsity-promoting
priors on the weights of the neural network for increased robustness to
architectural design. Thanks to Polya-Gamma data augmentation tricks, which
render a conditionally linear and Gaussian model, we derive a fast, approximate
variational inference algorithm that avoids distributional assumptions and
independence across layers. Suitable strategies to further improve scalability
and account for multimodality are considered.


http://arxiv.org/abs/2411.10868
Destabilizing a Social Network Model via Intrinsic Feedback Vulnerabilities. (1%)
Lane H. Rogers; Emma J. Reid; Robert A. Bridges
  Social influence plays a significant role in shaping individual sentiments
and actions, particularly in a world of ubiquitous digital interconnection. The
rapid development of generative AI has engendered well-founded concerns
regarding the potential scalable implementation of radicalization techniques in
social media. Motivated by these developments, we present a case study
investigating the effects of small but intentional perturbations on a simple
social network. We employ Taylor's classic model of social influence and tools
from robust control theory (most notably the Dynamical Structure Function
(DSF)), to identify perturbations that qualitatively alter the system's
behavior while remaining as unobtrusive as possible. We examine two such
scenarios: perturbations to an existing link and perturbations that introduce a
new link to the network. In each case, we identify destabilizing perturbations
of minimal norm and simulate their effects. Remarkably, we find that small but
targeted alterations to network structure may lead to the radicalization of all
agents, exhibiting the potential for large-scale shifts in collective behavior
to be triggered by comparatively minuscule adjustments in social influence.
Given that this method of identifying perturbations that are innocuous yet
destabilizing applies to any suitable dynamical system, our findings emphasize
a need for similar analyses to be carried out on real systems (e.g., real
social networks), to identify the places where such dynamics may already exist.


http://arxiv.org/abs/2411.10565
Comparing Robustness Against Adversarial Attacks in Code Generation: LLM-Generated vs. Human-Written. (99%)
Md Abdul Awal; Mrigank Rochan; Chanchal K. Roy
  Thanks to the widespread adoption of Large Language Models (LLMs) in software
engineering research, the long-standing dream of automated code generation has
become a reality on a large scale. Nowadays, LLMs such as GitHub Copilot and
ChatGPT are extensively used in code generation for enterprise and open-source
software development and maintenance. Despite their unprecedented successes in
code generation, research indicates that codes generated by LLMs exhibit
vulnerabilities and security issues. Several studies have been conducted to
evaluate code generated by LLMs, considering various aspects such as security,
vulnerability, code smells, and robustness. While some studies have compared
the performance of LLMs with that of humans in various software engineering
tasks, there's a notable gap in research: no studies have directly compared
human-written and LLM-generated code for their robustness analysis. To fill
this void, this paper introduces an empirical study to evaluate the adversarial
robustness of Pre-trained Models of Code (PTMCs) fine-tuned on code written by
humans and generated by LLMs against adversarial attacks for software clone
detection. These attacks could potentially undermine software security and
reliability. We consider two datasets, two state-of-the-art PTMCs, two
robustness evaluation criteria, and three metrics to use in our experiments.
Regarding effectiveness criteria, PTMCs fine-tuned on human-written code always
demonstrate more robustness than those fine-tuned on LLMs-generated code. On
the other hand, in terms of adversarial code quality, in 75% experimental
combinations, PTMCs fine-tuned on the human-written code exhibit more
robustness than the PTMCs fine-tuned on the LLMs-generated code.


http://arxiv.org/abs/2411.10174
A Hard-Label Cryptanalytic Extraction of Non-Fully Connected Deep Neural Networks using Side-Channel Attacks. (98%)
Benoit Coqueret; Mathieu Carbone; Olivier Sentieys; Gabriel Zaid
  During the past decade, Deep Neural Networks (DNNs) proved their value on a
large variety of subjects. However despite their high value and public
accessibility, the protection of the intellectual property of DNNs is still an
issue and an emerging research field. Recent works have successfully extracted
fully-connected DNNs using cryptanalytic methods in hard-label settings,
proving that it was possible to copy a DNN with high fidelity, i.e., high
similitude in the output predictions. However, the current cryptanalytic
attacks cannot target complex, i.e., not fully connected, DNNs and are limited
to special cases of neurons present in deep networks.
  In this work, we introduce a new end-to-end attack framework designed for
model extraction of embedded DNNs with high fidelity. We describe a new
black-box side-channel attack which splits the DNN in several linear parts for
which we can perform cryptanalytic extraction and retrieve the weights in
hard-label settings. With this method, we are able to adapt cryptanalytic
extraction, for the first time, to non-fully connected DNNs, while maintaining
a high fidelity. We validate our contributions by targeting several
architectures implemented on a microcontroller unit, including a Multi-Layer
Perceptron (MLP) of 1.7 million parameters and a shortened MobileNetv1. Our
framework successfully extracts all of these DNNs with high fidelity (88.4% for
the MobileNetv1 and 93.2% for the MLP). Furthermore, we use the stolen model to
generate adversarial examples and achieve close to white-box performance on the
victim's model (95.8% and 96.7% transfer rate).


http://arxiv.org/abs/2411.10500
Edge-Only Universal Adversarial Attacks in Distributed Learning. (98%)
Giulio Rossolini; Tommaso Baldi; Alessandro Biondi; Giorgio Buttazzo
  Distributed learning frameworks, which partition neural network models across
multiple computing nodes, enhance efficiency in collaborative edge-cloud
systems but may also introduce new vulnerabilities. In this work, we explore
the feasibility of generating universal adversarial attacks when an attacker
has access to the edge part of the model only, which consists in the first
network layers. Unlike traditional universal adversarial perturbations (UAPs)
that require full model knowledge, our approach shows that adversaries can
induce effective mispredictions in the unknown cloud part by leveraging key
features on the edge side. Specifically, we train lightweight classifiers from
intermediate features available at the edge, i.e., before the split point, and
use them in a novel targeted optimization to craft effective UAPs. Our results
on ImageNet demonstrate strong attack transferability to the unknown cloud
part. Additionally, we analyze the capability of an attacker to achieve
targeted adversarial effect with edge-only knowledge, revealing intriguing
behaviors. By introducing the first adversarial attacks with edge-only
knowledge in split inference, this work underscores the importance of
addressing partial model access in adversarial robustness, encouraging further
research in this area.


http://arxiv.org/abs/2411.10498
Prompt-Guided Environmentally Consistent Adversarial Patch. (82%)
Chaoqun Li; Huanqian Yan; Lifeng Zhou; Tairan Chen; Zhuodong Liu; Hang Su
  Adversarial attacks in the physical world pose a significant threat to the
security of vision-based systems, such as facial recognition and autonomous
driving. Existing adversarial patch methods primarily focus on improving attack
performance, but they often produce patches that are easily detectable by
humans and struggle to achieve environmental consistency, i.e., blending
patches into the environment. This paper introduces a novel approach for
generating adversarial patches, which addresses both the visual naturalness and
environmental consistency of the patches. We propose Prompt-Guided
Environmentally Consistent Adversarial Patch (PG-ECAP), a method that aligns
the patch with the environment to ensure seamless integration into the
environment. The approach leverages diffusion models to generate patches that
are both environmental consistency and effective in evading detection. To
further enhance the naturalness and consistency, we introduce two alignment
losses: Prompt Alignment Loss and Latent Space Alignment Loss, ensuring that
the generated patch maintains its adversarial properties while fitting
naturally within its environment. Extensive experiments in both digital and
physical domains demonstrate that PG-ECAP outperforms existing methods in
attack success rate and environmental consistency.


http://arxiv.org/abs/2411.10367
Continual Adversarial Reinforcement Learning (CARL) of False Data Injection detection: forgetting and explainability. (81%)
Pooja Aslami; Kejun Chen; Timothy M. Hansen; Malik Hassanaly
  False data injection attacks (FDIAs) on smart inverters are a growing concern
linked to increased renewable energy production. While data-based FDIA
detection methods are also actively developed, we show that they remain
vulnerable to impactful and stealthy adversarial examples that can be crafted
using Reinforcement Learning (RL). We propose to include such adversarial
examples in data-based detection training procedure via a continual adversarial
RL (CARL) approach. This way, one can pinpoint the deficiencies of data-based
detection, thereby offering explainability during their incremental
improvement. We show that a continual learning implementation is subject to
catastrophic forgetting, and additionally show that forgetting can be addressed
by employing a joint training strategy on all generated FDIA scenarios.


http://arxiv.org/abs/2411.10034
EveGuard: Defeating Vibration-based Side-Channel Eavesdropping with Audio Adversarial Perturbations. (68%)
Jung-Woo Chang; Ke Sun; David Xia; Xinyu Zhang; Farinaz Koushanfar
  Vibrometry-based side channels pose a significant privacy risk, exploiting
sensors like mmWave radars, light sensors, and accelerometers to detect
vibrations from sound sources or proximate objects, enabling speech
eavesdropping. Despite various proposed defenses, these involve costly hardware
solutions with inherent physical limitations. This paper presents EveGuard, a
software-driven defense framework that creates adversarial audio, protecting
voice privacy from side channels without compromising human perception. We
leverage the distinct sensing capabilities of side channels and traditional
microphones where side channels capture vibrations and microphones record
changes in air pressure, resulting in different frequency responses. EveGuard
first proposes a perturbation generator model (PGM) that effectively suppresses
sensor-based eavesdropping while maintaining high audio quality. Second, to
enable end-to-end training of PGM, we introduce a new domain translation task
called Eve-GAN for inferring an eavesdropped signal from a given audio. We
further apply few-shot learning to mitigate the data collection overhead for
Eve-GAN training. Our extensive experiments show that EveGuard achieves a
protection rate of more than 97 percent from audio classifiers and
significantly hinders eavesdropped audio reconstruction. We further validate
the performance of EveGuard across three adaptive attack mechanisms. We have
conducted a user study to verify the perceptual quality of our perturbed audio.


http://arxiv.org/abs/2411.10329
Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding. (11%)
Huming Qiu; Guanxu Chen; Mi Zhang; Min Yang
  In recent years, text-to-image (T2I) generation models have made significant
progress in generating high-quality images that align with text descriptions.
However, these models also face the risk of unsafe generation, potentially
producing harmful content that violates usage policies, such as explicit
material. Existing safe generation methods typically focus on suppressing
inappropriate content by erasing undesired concepts from visual
representations, while neglecting to sanitize the textual representation.
Although these methods help mitigate the risk of misuse to certain extent,
their robustness remains insufficient when dealing with adversarial attacks.
  Given that semantic consistency between input text and output image is a
fundamental requirement for T2I models, we identify that textual
representations (i.e., prompt embeddings) are likely the primary source of
unsafe generation. To this end, we propose a vision-agnostic safe generation
framework, Embedding Sanitizer (ES), which focuses on erasing inappropriate
concepts from prompt embeddings and uses the sanitized embeddings to guide the
model for safe generation. ES is applied to the output of the text encoder as a
plug-and-play module, enabling seamless integration with different T2I models
as well as other safeguards. In addition, ES's unique scoring mechanism assigns
a score to each token in the prompt to indicate its potential harmfulness, and
dynamically adjusts the sanitization intensity to balance defensive performance
and generation quality. Through extensive evaluation on five prompt benchmarks,
our approach achieves state-of-the-art robustness by sanitizing the source
(prompt embedding) of unsafe generation compared to nine baseline methods. It
significantly outperforms existing safeguards in terms of interpretability and
controllability while maintaining generation quality.


http://arxiv.org/abs/2411.10242
Measuring Non-Adversarial Reproduction of Training Data in Large Language Models. (9%)
Michael Aerni; Javier Rando; Edoardo Debenedetti; Nicholas Carlini; Daphne Ippolito; Florian Tramèr
  Large language models memorize parts of their training data. Memorizing short
snippets and facts is required to answer questions about the world and to be
fluent in any language. But models have also been shown to reproduce long
verbatim sequences of memorized text when prompted by a motivated adversary. In
this work, we investigate an intermediate regime of memorization that we call
non-adversarial reproduction, where we quantify the overlap between model
responses and pretraining data when responding to natural and benign prompts.
For a variety of innocuous prompt categories (e.g., writing a letter or a
tutorial), we show that up to 15% of the text output by popular conversational
language models overlaps with snippets from the Internet. In worst cases, we
find generations where 100% of the content can be found exactly online. For the
same tasks, we find that human-written text has far less overlap with Internet
data. We further study whether prompting strategies can close this reproduction
gap between models and humans. While appropriate prompting can reduce
non-adversarial reproduction on average, we find that mitigating worst-case
reproduction of training data requires stronger defenses -- even for benign
interactions.


http://arxiv.org/abs/2411.10258
MDHP-Net: Detecting an Emerging Time-exciting Threat in IVN. (1%)
Qi Liu; Yanchen Liu; Ruifeng Li; Chenhong Cao; Yufeng Li; Xingyu Li; Peng Wang; Runhan Feng; Shiyang Bu
  The integration of intelligent and connected technologies in modern vehicles,
while offering enhanced functionalities through Electronic Control Unit (ECU)
and interfaces like OBD-II and telematics, also exposes the vehicle's
in-vehicle network (IVN) to potential cyberattacks. Unlike prior work, we
identify a new time-exciting threat model against IVN. These attacks inject
malicious messages that exhibit a time-exciting effect, gradually manipulating
network traffic to disrupt vehicle operations and compromise safety-critical
functions. We systematically analyze the characteristics of the threat:
dynamism, time-exciting impact, and low prior knowledge dependency. To validate
its practicality, we replicate the attack on a real Advanced Driver Assistance
System via Controller Area Network (CAN), exploiting Unified Diagnostic Service
vulnerabilities and proposing four attack strategies. While CAN's integrity
checks mitigate attacks, Ethernet migration (e.g., DoIP/SOME/IP) introduces new
surfaces. We further investigate the feasibility of time-exciting threat under
SOME/IP. To detect time-exciting threat, we introduce MDHP-Net, leveraging
Multi-Dimentional Hawkes Process (MDHP) and temporal and message-wise feature
extracting structures. Meanwhile, to estimate MDHP parameters, we developed the
first GPU-optimized gradient descent solver for MDHP (MDHP-GDS). These modules
significantly improves the detection rate under time-exciting attacks in
multi-ECU IVN system. To address data scarcity, we release STEIA9, the first
open-source dataset for time-exciting attacks, covering 9 Ethernet-based attack
scenarios. Extensive experiments on STEIA9 (9 attack scenarios) show MDHP-Net
outperforms 3 baselines, confirming attack feasibility and detection efficacy.


http://arxiv.org/abs/2411.10029
Toward Robust and Accurate Adversarial Camouflage Generation against Vehicle Detectors. (1%)
Jiawei Zhou; Linye Lyu; Daojing He; Yu Li
  Adversarial camouflage is a widely used physical attack against vehicle
detectors for its superiority in multi-view attack performance. One promising
approach involves using differentiable neural renderers to facilitate
adversarial camouflage optimization through gradient back-propagation. However,
existing methods often struggle to capture environmental characteristics during
the rendering process or produce adversarial textures that can precisely map to
the target vehicle. Moreover, these approaches neglect diverse weather
conditions, reducing the efficacy of generated camouflage across varying
weather scenarios. To tackle these challenges, we propose a robust and accurate
camouflage generation method, namely RAUCA. The core of RAUCA is a novel neural
rendering component, End-to-End Neural Renderer Plus (E2E-NRP), which can
accurately optimize and project vehicle textures and render images with
environmental characteristics such as lighting and weather. In addition, we
integrate a multi-weather dataset for camouflage generation, leveraging the
E2E-NRP to enhance the attack robustness. Experimental results on six popular
object detectors show that RAUCA-final outperforms existing methods in both
simulation and real-world settings.


http://arxiv.org/abs/2411.10507
RedTest: Towards Measuring Redundancy in Deep Neural Networks Effectively. (1%)
Yao Lu; Peixin Zhang; Jingyi Wang; Lei Ma; Xiaoniu Yang; Qi Xuan
  Deep learning has revolutionized computing in many real-world applications,
arguably due to its remarkable performance and extreme convenience as an
end-to-end solution. However, deep learning models can be costly to train and
to use, especially for those large-scale models, making it necessary to
optimize the original overly complicated models into smaller ones in scenarios
with limited resources such as mobile applications or simply for resource
saving. The key question in such model optimization is, how can we effectively
identify and measure the redundancy in a deep learning model structure. While
several common metrics exist in the popular model optimization techniques to
measure the performance of models after optimization, they are not able to
quantitatively inform the degree of remaining redundancy. To address the
problem, we present a novel testing approach, i.e., RedTest, which proposes a
novel testing metric called Model Structural Redundancy Score (MSRS) to
quantitatively measure the degree of redundancy in a deep learning model
structure. We first show that MSRS is effective in both revealing and assessing
the redundancy issues in many state-of-the-art models, which urgently calls for
model optimization. Then, we utilize MSRS to assist deep learning model
developers in two practical application scenarios: 1) in Neural Architecture
Search, we design a novel redundancy-aware algorithm to guide the search for
the optimal model structure and demonstrate its effectiveness by comparing it
to existing standard NAS practice; 2) in the pruning of large-scale pre-trained
models, we prune the redundant layers of pre-trained models with the guidance
of layer similarity to derive less redundant ones of much smaller size.
Extensive experimental results demonstrate that removing such redundancy has a
negligible effect on the model utility.


http://arxiv.org/abs/2411.09265
BEARD: Benchmarking the Adversarial Robustness for Dataset Distillation. (99%)
Zheng Zhou; Wenquan Feng; Shuchang Lyu; Guangliang Cheng; Xiaowei Huang; Qi Zhao
  Dataset Distillation (DD) is an emerging technique that compresses
large-scale datasets into significantly smaller synthesized datasets while
preserving high test performance and enabling the efficient training of large
models. However, current research primarily focuses on enhancing evaluation
accuracy under limited compression ratios, often overlooking critical security
concerns such as adversarial robustness. A key challenge in evaluating this
robustness lies in the complex interactions between distillation methods, model
architectures, and adversarial attack strategies, which complicate standardized
assessments. To address this, we introduce BEARD, an open and unified benchmark
designed to systematically assess the adversarial robustness of DD methods,
including DM, IDM, and BACON. BEARD encompasses a variety of adversarial
attacks (e.g., FGSM, PGD, C&W) on distilled datasets like CIFAR-10/100 and
TinyImageNet. Utilizing an adversarial game framework, it introduces three key
metrics: Robustness Ratio (RR), Attack Efficiency Ratio (AE), and Comprehensive
Robustness-Efficiency Index (CREI). Our analysis includes unified benchmarks,
various Images Per Class (IPC) settings, and the effects of adversarial
training. Results are available on the BEARD Leaderboard, along with a library
providing model and dataset pools to support reproducible research. Access the
code at BEARD.


http://arxiv.org/abs/2411.09220
Transferable Adversarial Attacks against ASR. (89%)
Xiaoxue Gao; Zexin Li; Yiming Chen; Cong Liu; Haizhou Li
  Given the extensive research and real-world applications of automatic speech
recognition (ASR), ensuring the robustness of ASR models against minor input
perturbations becomes a crucial consideration for maintaining their
effectiveness in real-time scenarios. Previous explorations into ASR model
robustness have predominantly revolved around evaluating accuracy on white-box
settings with full access to ASR models. Nevertheless, full ASR model details
are often not available in real-world applications. Therefore, evaluating the
robustness of black-box ASR models is essential for a comprehensive
understanding of ASR model resilience. In this regard, we thoroughly study the
vulnerability of practical black-box attacks in cutting-edge ASR models and
propose to employ two advanced time-domain-based transferable attacks alongside
our differentiable feature extractor. We also propose a speech-aware gradient
optimization approach (SAGO) for ASR, which forces mistranscription with
minimal impact on human imperceptibility through voice activity detection rule
and a speech-aware gradient-oriented optimizer. Our comprehensive experimental
results reveal performance enhancements compared to baseline approaches across
five models on two databases.


http://arxiv.org/abs/2411.09749
RenderBender: A Survey on Adversarial Attacks Using Differentiable Rendering. (83%)
Matthew Hull; Haoran Wang; Matthew Lau; Alec Helbling; Mansi Phute; Chao Zhang; Zsolt Kira; Willian Lunardi; Martin Andreoni; Wenke Lee; Polo Chau
  Differentiable rendering techniques like Gaussian Splatting and Neural
Radiance Fields have become powerful tools for generating high-fidelity models
of 3D objects and scenes. Their ability to produce both physically plausible
and differentiable models of scenes are key ingredient needed to produce
physically plausible adversarial attacks on DNNs. However, the adversarial
machine learning community has yet to fully explore these capabilities, partly
due to differing attack goals (e.g., misclassification, misdetection) and a
wide range of possible scene manipulations used to achieve them (e.g., alter
texture, mesh). This survey contributes the first framework that unifies
diverse goals and tasks, facilitating easy comparison of existing work,
identifying research gaps, and highlighting future directions - ranging from
expanding attack goals and tasks to account for new modalities,
state-of-the-art models, tools, and pipelines, to underscoring the importance
of studying real-world threats in complex scenes.


http://arxiv.org/abs/2411.09259
Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey. (69%)
Xuannan Liu; Xing Cui; Peipei Li; Zekun Li; Huaibo Huang; Shuhan Xia; Miaoxuan Zhang; Yueying Zou; Ran He
  The rapid evolution of multimodal foundation models has led to significant
advancements in cross-modal understanding and generation across diverse
modalities, including text, images, audio, and video. However, these models
remain susceptible to jailbreak attacks, which can bypass built-in safety
mechanisms and induce the production of potentially harmful content.
Consequently, understanding the methods of jailbreak attacks and existing
defense mechanisms is essential to ensure the safe deployment of multimodal
generative models in real-world scenarios, particularly in security-sensitive
applications. To provide comprehensive insight into this topic, this survey
reviews jailbreak and defense in multimodal generative models. First, given the
generalized lifecycle of multimodal jailbreak, we systematically explore
attacks and corresponding defense strategies across four levels: input,
encoder, generator, and output. Based on this analysis, we present a detailed
taxonomy of attack methods, defense mechanisms, and evaluation frameworks
specific to multimodal generative models. Additionally, we cover a wide range
of input-output configurations, including modalities such as Any-to-Text,
Any-to-Vision, and Any-to-Any within generative systems. Finally, we highlight
current research challenges and propose potential directions for future
research.The open-source repository corresponding to this work can be found at
https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak.


http://arxiv.org/abs/2411.09359
Your Fixed Watermark is Fragile: Towards Semantic-Aware Watermark for EaaS Copyright Protection. (11%)
Zekun Fei; Biao Yi; Jianing Geng; Ruiqi He; Lihai Nie; Zheli Liu
  Embedding-as-a-Service (EaaS) has emerged as a successful business pattern
but faces significant challenges related to various forms of copyright
infringement, including API misuse and different attacks. Various studies have
proposed backdoor-based watermarking schemes to protect the copyright of EaaS
services. In this paper, we reveal that previous watermarking schemes possess
semantic-independent characteristics and propose the Semantic Perturbation
Attack (SPA). Our theoretical and experimental analyses demonstrate that this
semantic-independent nature makes current watermarking schemes vulnerable to
adaptive attacks that exploit semantic perturbations test to bypass watermark
verification. To address this vulnerability, we propose the Semantic Aware
Watermarking (SAW) scheme, a robust defense mechanism designed to resist SPA,
by injecting a watermark that adapts to the text semantics. Extensive
experimental results across multiple datasets demonstrate that the True
Positive Rate (TPR) for detecting watermarked samples under SPA can reach up to
more than 95%, rendering previous watermarks ineffective. Meanwhile, our
watermarking scheme can resist such attack while ensuring the watermark
verification capability. Our code is available at
https://github.com/Zk4-ps/EaaS-Embedding-Watermark.


http://arxiv.org/abs/2411.09373
Are nuclear masks all you need for improved out-of-domain generalisation? A closer look at cancer classification in histopathology. (1%)
Dhananjay Tomar; Alexander Binder; Andreas Kleppe
  Domain generalisation in computational histopathology is challenging because
the images are substantially affected by differences among hospitals due to
factors like fixation and staining of tissue and imaging equipment. We
hypothesise that focusing on nuclei can improve the out-of-domain (OOD)
generalisation in cancer detection. We propose a simple approach to improve OOD
generalisation for cancer detection by focusing on nuclear morphology and
organisation, as these are domain-invariant features critical in cancer
detection. Our approach integrates original images with nuclear segmentation
masks during training, encouraging the model to prioritise nuclei and their
spatial arrangement. Going beyond mere data augmentation, we introduce a
regularisation technique that aligns the representations of masks and original
images. We show, using multiple datasets, that our method improves OOD
generalisation and also leads to increased robustness to image corruptions and
adversarial attacks. The source code is available at
https://github.com/undercutspiky/SFL/


http://arxiv.org/abs/2411.08933
Confidence-aware Denoised Fine-tuning of Off-the-shelf Models for Certified Robustness. (95%)
Suhyeok Jang; Seojin Kim; Jinwoo Shin; Jongheon Jeong
  The remarkable advances in deep learning have led to the emergence of many
off-the-shelf classifiers, e.g., large pre-trained models. However, since they
are typically trained on clean data, they remain vulnerable to adversarial
attacks. Despite this vulnerability, their superior performance and
transferability make off-the-shelf classifiers still valuable in practice,
demanding further work to provide adversarial robustness for them in a post-hoc
manner. A recently proposed method, denoised smoothing, leverages a denoiser
model in front of the classifier to obtain provable robustness without
additional training. However, the denoiser often creates hallucination, i.e.,
images that have lost the semantics of their originally assigned class, leading
to a drop in robustness. Furthermore, its noise-and-denoise procedure
introduces a significant distribution shift from the original distribution,
causing the denoised smoothing framework to achieve sub-optimal robustness. In
this paper, we introduce Fine-Tuning with Confidence-Aware Denoised Image
Selection (FT-CADIS), a novel fine-tuning scheme to enhance the certified
robustness of off-the-shelf classifiers. FT-CADIS is inspired by the
observation that the confidence of off-the-shelf classifiers can effectively
identify hallucinated images during denoised smoothing. Based on this, we
develop a confidence-aware training objective to handle such hallucinated
images and improve the stability of fine-tuning from denoised images. In this
way, the classifier can be fine-tuned using only images that are beneficial for
adversarial robustness. We also find that such a fine-tuning can be done by
updating a small fraction of parameters of the classifier. Extensive
experiments demonstrate that FT-CADIS has established the state-of-the-art
certified robustness among denoised smoothing methods across all
$\ell_2$-adversary radius in various benchmarks.


http://arxiv.org/abs/2411.08460
Trap-MID: Trapdoor-based Defense against Model Inversion Attacks. (81%)
Zhen-Ting Liu; Shang-Tse Chen
  Model Inversion (MI) attacks pose a significant threat to the privacy of Deep
Neural Networks by recovering training data distribution from well-trained
models. While existing defenses often rely on regularization techniques to
reduce information leakage, they remain vulnerable to recent attacks. In this
paper, we propose the Trapdoor-based Model Inversion Defense (Trap-MID) to
mislead MI attacks. A trapdoor is integrated into the model to predict a
specific label when the input is injected with the corresponding trigger.
Consequently, this trapdoor information serves as the "shortcut" for MI
attacks, leading them to extract trapdoor triggers rather than private data. We
provide theoretical insights into the impacts of trapdoor's effectiveness and
naturalness on deceiving MI attacks. In addition, empirical experiments
demonstrate the state-of-the-art defense performance of Trap-MID against
various MI attacks without the requirements for extra data or large
computational overhead. Our source code is publicly available at
https://github.com/ntuaislab/Trap-MID.


http://arxiv.org/abs/2411.08618
Robust Optimal Power Flow Against Adversarial Attacks: A Tri-Level Optimization Approach. (81%)
Saman Mazaheri Khamaneh; Tong Wu
  In power systems, unpredictable events like extreme weather, equipment
failures, and cyberattacks present significant challenges to ensuring safety
and reliability. Ensuring resilience in the face of these uncertainties is
crucial for reliable and efficient operations. This paper presents a tri-level
optimization approach for robust power system operations that effectively
address worst-case attacks. The first stage focuses on optimizing economic
dispatch under normal operating conditions, aiming to minimize generation costs
while maintaining the supply-demand balance. The second stage introduces an
adversarial attack model, identifying worst-case scenarios that maximize the
system's vulnerability by targeting distributed generation (DG). In the third
stage, mitigation strategies are developed using fast-response energy storage
systems (ESS) to minimize disruptions caused by these attacks. By integrating
economic dispatch, vulnerability assessment, and mitigation into a unified
framework, this approach provides a robust solution for enhancing power system
resilience and safety against evolving adversarial threats. The approach is
validated using the IEEE-33 node distribution system to demonstrate its
effectiveness in achieving both cost efficiency and system resilience.


http://arxiv.org/abs/2411.08410
The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense. (74%)
Yangyang Guo; Fangkai Jiao; Liqiang Nie; Mohan Kankanhalli
  The vulnerability of Vision Large Language Models (VLLMs) to jailbreak
attacks appears as no surprise. However, recent defense mechanisms against
these attacks have reached near-saturation performance on benchmarks, often
with minimal effort. This simultaneous high performance in both attack and
defense presents a perplexing paradox. Resolving it is critical for advancing
the development of trustworthy models. To address this research gap, we first
investigate why VLLMs are prone to these attacks. We then make a key
observation: existing defense mechanisms suffer from an \textbf{over-prudence}
problem, resulting in unexpected abstention even in the presence of benign
inputs. Additionally, we find that the two representative evaluation methods
for jailbreak often exhibit chance agreement. This limitation makes it
potentially misleading when evaluating attack strategies or defense mechanisms.
Beyond these empirical observations, our another contribution in this work is
to repurpose the guardrails of LLMs on the shelf, as an effective alternative
detector prior to VLLM response. We believe these findings offer useful
insights to rethink the foundational development of VLLM safety with respect to
benchmark datasets, evaluation methods, and defense strategies.


http://arxiv.org/abs/2411.08862
LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs. (8%)
Piyush Jha; Arnav Arora; Vijay Ganesh
  We introduce LLMStinger, a novel approach that leverages Large Language
Models (LLMs) to automatically generate adversarial suffixes for jailbreak
attacks. Unlike traditional methods, which require complex prompt engineering
or white-box access, LLMStinger uses a reinforcement learning (RL) loop to
fine-tune an attacker LLM, generating new suffixes based on existing attacks
for harmful questions from the HarmBench benchmark. Our method significantly
outperforms existing red-teaming approaches (we compared against 15 of the
latest methods), achieving a +57.2% improvement in Attack Success Rate (ASR) on
LLaMA2-7B-chat and a +50.3% ASR increase on Claude 2, both models known for
their extensive safety measures. Additionally, we achieved a 94.97% ASR on
GPT-3.5 and 99.4% on Gemma-2B-it, demonstrating the robustness and adaptability
of LLMStinger across open and closed-source models.


http://arxiv.org/abs/2411.07843
Chain Association-based Attacking and Shielding Natural Language Processing Systems. (99%)
Jiacheng Huang; Long Chen
  Association as a gift enables people do not have to mention something in
completely straightforward words and allows others to understand what they
intend to refer to. In this paper, we propose a chain association-based
adversarial attack against natural language processing systems, utilizing the
comprehension gap between humans and machines. We first generate a chain
association graph for Chinese characters based on the association paradigm for
building search space of potential adversarial examples. Then, we introduce an
discrete particle swarm optimization algorithm to search for the optimal
adversarial examples. We conduct comprehensive experiments and show that
advanced natural language processing models and applications, including large
language models, are vulnerable to our attack, while humans appear good at
understanding the perturbed text. We also explore two methods, including
adversarial training and associative graph-based recovery, to shield systems
from chain association-based attack. Since a few examples that use some
derogatory terms, this paper contains materials that may be offensive or
upsetting to some people.


http://arxiv.org/abs/2411.07850
IAE: Irony-based Adversarial Examples for Sentiment Analysis Systems. (99%)
Xiaoyin Yi; Jiacheng Huang
  Adversarial examples, which are inputs deliberately perturbed with
imperceptible changes to induce model errors, have raised serious concerns for
the reliability and security of deep neural networks (DNNs). While adversarial
attacks have been extensively studied in continuous data domains such as
images, the discrete nature of text presents unique challenges. In this paper,
we propose Irony-based Adversarial Examples (IAE), a method that transforms
straightforward sentences into ironic ones to create adversarial text. This
approach exploits the rhetorical device of irony, where the intended meaning is
opposite to the literal interpretation, requiring a deeper understanding of
context to detect. The IAE method is particularly challenging due to the need
to accurately locate evaluation words, substitute them with appropriate
collocations, and expand the text with suitable ironic elements while
maintaining semantic coherence. Our research makes the following key
contributions: (1) We introduce IAE, a strategy for generating textual
adversarial examples using irony. This method does not rely on pre-existing
irony corpora, making it a versatile tool for creating adversarial text in
various NLP tasks. (2) We demonstrate that the performance of several
state-of-the-art deep learning models on sentiment analysis tasks significantly
deteriorates when subjected to IAE attacks. This finding underscores the
susceptibility of current NLP systems to adversarial manipulation through
irony. (3) We compare the impact of IAE on human judgment versus NLP systems,
revealing that humans are less susceptible to the effects of irony in text.


http://arxiv.org/abs/2411.07559
Zer0-Jack: A Memory-efficient Gradient-based Jailbreaking Method for Black-box Multi-modal Large Language Models. (78%)
Tiejin Chen; Kaishen Wang; Hua Wei
  Jailbreaking methods, which induce Multi-modal Large Language Models (MLLMs)
to output harmful responses, raise significant safety concerns. Among these
methods, gradient-based approaches, which use gradients to generate malicious
prompts, have been widely studied due to their high success rates in white-box
settings, where full access to the model is available. However, these methods
have notable limitations: they require white-box access, which is not always
feasible, and involve high memory usage. To address scenarios where white-box
access is unavailable, attackers often resort to transfer attacks. In transfer
attacks, malicious inputs generated using white-box models are applied to
black-box models, but this typically results in reduced attack performance. To
overcome these challenges, we propose Zer0-Jack, a method that bypasses the
need for white-box access by leveraging zeroth-order optimization. We propose
patch coordinate descent to efficiently generate malicious image inputs to
directly attack black-box MLLMs, which significantly reduces memory usage
further. Through extensive experiments, Zer0-Jack achieves a high attack
success rate across various models, surpassing previous transfer-based methods
and performing comparably with existing white-box jailbreak techniques.
Notably, Zer0-Jack achieves a 95\% attack success rate on MiniGPT-4 with the
Harmful Behaviors Multi-modal Dataset on a black-box setting, demonstrating its
effectiveness. Additionally, we show that Zer0-Jack can directly attack
commercial MLLMs such as GPT-4o. Codes are provided in the supplement.


http://arxiv.org/abs/2411.08248
Deceiving Question-Answering Models: A Hybrid Word-Level Adversarial Approach. (67%)
Jiyao Li; Mingze Ni; Yongshun Gong; Wei Liu
  Deep learning underpins most of the currently advanced natural language
processing (NLP) tasks such as textual classification, neural machine
translation (NMT), abstractive summarization and question-answering (QA).
However, the robustness of the models, particularly QA models, against
adversarial attacks is a critical concern that remains insufficiently explored.
This paper introduces QA-Attack (Question Answering Attack), a novel word-level
adversarial strategy that fools QA models. Our attention-based attack exploits
the customized attention mechanism and deletion ranking strategy to identify
and target specific words within contextual passages. It creates deceptive
inputs by carefully choosing and substituting synonyms, preserving grammatical
integrity while misleading the model to produce incorrect responses. Our
approach demonstrates versatility across various question types, particularly
when dealing with extensive long textual inputs. Extensive experiments on
multiple benchmark datasets demonstrate that QA-Attack successfully deceives
baseline QA models and surpasses existing adversarial techniques regarding
success rate, semantics changes, BLEU score, fluency and grammar error rate.


http://arxiv.org/abs/2411.07597
A Survey on Adversarial Machine Learning for Code Data: Realistic Threats, Countermeasures, and Interpretations. (64%)
Yulong Yang; Haoran Fan; Chenhao Lin; Qian Li; Zhengyu Zhao; Chao Shen; Xiaohong Guan
  Code Language Models (CLMs) have achieved tremendous progress in source code
understanding and generation, leading to a significant increase in research
interests focused on applying CLMs to real-world software engineering tasks in
recent years. However, in realistic scenarios, CLMs are exposed to potential
malicious adversaries, bringing risks to the confidentiality, integrity, and
availability of CLM systems. Despite these risks, a comprehensive analysis of
the security vulnerabilities of CLMs in the extremely adversarial environment
has been lacking. To close this research gap, we categorize existing attack
techniques into three types based on the CIA triad: poisoning attacks
(integrity \& availability infringement), evasion attacks (integrity
infringement), and privacy attacks (confidentiality infringement). We have
collected so far the most comprehensive (79) papers related to adversarial
machine learning for CLM from the research fields of artificial intelligence,
computer security, and software engineering. Our analysis covers each type of
risk, examining threat model categorization, attack techniques, and
countermeasures, while also introducing novel perspectives on eXplainable AI
(XAI) and exploring the interconnections between different risks. Finally, we
identify current challenges and future research opportunities. This study aims
to provide a comprehensive roadmap for both researchers and practitioners and
pave the way towards more reliable CLMs for practical applications.


http://arxiv.org/abs/2411.07691
New Emerged Security and Privacy of Pre-trained Model: a Survey and Outlook. (13%)
Meng Yang; Tianqing Zhu; Chi Liu; WanLei Zhou; Shui Yu; Philip S. Yu
  Thanks to the explosive growth of data and the development of computational
resources, it is possible to build pre-trained models that can achieve
outstanding performance on various tasks, such as neural language processing,
computer vision, and more. Despite their powerful capabilities, pre-trained
models have also sparked attention to the emerging security challenges
associated with their real-world applications. Security and privacy issues,
such as leaking privacy information and generating harmful responses, have
seriously undermined users' confidence in these powerful models. Concerns are
growing as model performance improves dramatically. Researchers are eager to
explore the unique security and privacy issues that have emerged, their
distinguishing factors, and how to defend against them. However, the current
literature lacks a clear taxonomy of emerging attacks and defenses for
pre-trained models, which hinders a high-level and comprehensive understanding
of these questions. To fill the gap, we conduct a systematical survey on the
security risks of pre-trained models, proposing a taxonomy of attack and
defense methods based on the accessibility of pre-trained models' input and
weights in various security test scenarios. This taxonomy categorizes attacks
and defenses into No-Change, Input-Change, and Model-Change approaches. With
the taxonomy analysis, we capture the unique security and privacy issues of
pre-trained models, categorizing and summarizing existing security issues based
on their characteristics. In addition, we offer a timely and comprehensive
review of each category's strengths and limitations. Our survey concludes by
highlighting potential new research opportunities in the security and privacy
of pre-trained models.


http://arxiv.org/abs/2411.08923
Aligning Visual Contrastive learning models via Preference Optimization. (1%)
Amirabbas Afzali; Borna Khodabandeh; Ali Rasekh; Mahyar JafariNodeh; Sepehr kazemi; Simon Gottschalk
  Contrastive learning models have demonstrated impressive abilities to capture
semantic similarities by aligning representations in the embedding space.
However, their performance can be limited by the quality of the training data
and its inherent biases. While Preference Optimization (PO) methods such as
Reinforcement Learning from Human Feedback (RLHF) and Direct Preference
Optimization (DPO) have been applied to align generative models with human
preferences, their use in contrastive learning has yet to be explored. This
paper introduces a novel method for training contrastive learning models using
different PO methods to break down complex concepts. Our method systematically
aligns model behavior with desired preferences, enhancing performance on the
targeted task. In particular, we focus on enhancing model robustness against
typographic attacks and inductive biases, commonly seen in contrastive
vision-language models like CLIP. Our experiments demonstrate that models
trained using PO outperform standard contrastive learning techniques while
retaining their ability to handle adversarial challenges and maintain accuracy
on other downstream tasks. This makes our method well-suited for tasks
requiring fairness, robustness, and alignment with specific preferences. We
evaluate our method for tackling typographic attacks on images and explore its
ability to disentangle gender concepts and mitigate gender bias, showcasing the
versatility of our approach.


http://arxiv.org/abs/2411.08148
Adaptive Meta-Learning for Robust Deepfake Detection: A Multi-Agent Framework to Data Drift and Model Generalization. (1%)
Dinesh Srivasthav P; Badri Narayan Subudhi
  Pioneering advancements in artificial intelligence, especially in genAI, have
enabled significant possibilities for content creation, but also led to
widespread misinformation and false content. The growing sophistication and
realism of deepfakes is raising concerns about privacy invasion, identity
theft, and has societal, business impacts, including reputational damage and
financial loss. Many deepfake detectors have been developed to tackle this
problem. Nevertheless, as for every AI model, the deepfake detectors face the
wrath of lack of considerable generalization to unseen scenarios and
cross-domain deepfakes. Besides, adversarial robustness is another critical
challenge, as detectors drastically underperform to the slightest imperceptible
change. Most state-of-the-art detectors are trained on static datasets and lack
the ability to adapt to emerging deepfake attack trends. These three crucial
challenges though hold paramount importance for reliability in practise,
particularly in the deepfake domain, are also the problems with any other AI
application. This paper proposes an adversarial meta-learning algorithm using
task-specific adaptive sample synthesis and consistency regularization, in a
refinement phase. By focussing on the classifier's strengths and weaknesses, it
boosts both robustness and generalization of the model. Additionally, the paper
introduces a hierarchical multi-agent retrieval-augmented generation workflow
with a sample synthesis module to dynamically adapt the model to new data
trends by generating custom deepfake samples. The paper further presents a
framework integrating the meta-learning algorithm with the hierarchical
multi-agent workflow, offering a holistic solution for enhancing
generalization, robustness, and adaptability. Experimental results demonstrate
the model's consistent performance across various datasets, outperforming the
models in comparison.


http://arxiv.org/abs/2411.06784
Boosting the Targeted Transferability of Adversarial Examples via Salient Region & Weighted Feature Drop. (99%)
Shanjun Xu; Linghui Li; Kaiguo Yuan; Bingyu Li
  Deep neural networks can be vulnerable to adversarially crafted examples,
presenting significant risks to practical applications. A prevalent approach
for adversarial attacks relies on the transferability of adversarial examples,
which are generated from a substitute model and leveraged to attack unknown
black-box models. Despite various proposals aimed at improving transferability,
the success of these attacks in targeted black-box scenarios is often hindered
by the tendency for adversarial examples to overfit to the surrogate models. In
this paper, we introduce a novel framework based on Salient region & Weighted
Feature Drop (SWFD) designed to enhance the targeted transferability of
adversarial examples. Drawing from the observation that examples with higher
transferability exhibit smoother distributions in the deep-layer outputs, we
propose the weighted feature drop mechanism to modulate activation values
according to weights scaled by norm distribution, effectively addressing the
overfitting issue when generating adversarial examples. Additionally, by
leveraging salient region within the image to construct auxiliary images, our
method enables the adversarial example's features to be transferred to the
target category in a model-agnostic manner, thereby enhancing the
transferability. Comprehensive experiments confirm that our approach
outperforms state-of-the-art methods across diverse configurations. On average,
the proposed SWFD raises the attack success rate for normally trained models
and robust models by 16.31% and 7.06% respectively.


http://arxiv.org/abs/2411.06863
Computable Model-Independent Bounds for Adversarial Quantum Machine Learning. (69%)
Bacui Li; Tansu Alpcan; Chandra Thapa; Udaya Parampalli
  By leveraging the principles of quantum mechanics, QML opens doors to novel
approaches in machine learning and offers potential speedup. However, machine
learning models are well-documented to be vulnerable to malicious
manipulations, and this susceptibility extends to the models of QML. This
situation necessitates a thorough understanding of QML's resilience against
adversarial attacks, particularly in an era where quantum computing
capabilities are expanding. In this regard, this paper examines
model-independent bounds on adversarial performance for QML. To the best of our
knowledge, we introduce the first computation of an approximate lower bound for
adversarial error when evaluating model resilience against sophisticated
quantum-based adversarial attacks. Experimental results are compared to the
computed bound, demonstrating the potential of QML models to achieve high
robustness. In the best case, the experimental error is only 10% above the
estimated bound, offering evidence of the inherent robustness of quantum
models. This work not only advances our theoretical understanding of quantum
model resilience but also provides a precise reference bound for the future
development of robust QML algorithms.


http://arxiv.org/abs/2411.07023
The Inherent Adversarial Robustness of Analog In-Memory Computing. (61%)
Corey Lammie; Julian Büchel; Athanasios Vasilopoulos; Manuel Le Gallo; Abu Sebastian
  A key challenge for Deep Neural Network (DNN) algorithms is their
vulnerability to adversarial attacks. Inherently non-deterministic compute
substrates, such as those based on Analog In-Memory Computing (AIMC), have been
speculated to provide significant adversarial robustness when performing DNN
inference. In this paper, we experimentally validate this conjecture for the
first time on an AIMC chip based on Phase Change Memory (PCM) devices. We
demonstrate higher adversarial robustness against different types of
adversarial attacks when implementing an image classification network.
Additional robustness is also observed when performing hardware-in-the-loop
attacks, for which the attacker is assumed to have full access to the hardware.
A careful study of the various noise sources indicate that a combination of
stochastic noise sources (both recurrent and non-recurrent) are responsible for
the adversarial robustness and that their type and magnitude disproportionately
effects this property. Finally, it is demonstrated, via simulations, that when
a much larger transformer network is used to implement a Natural Language
Processing (NLP) task, additional robustness is still observed.


http://arxiv.org/abs/2411.07494
Rapid Response: Mitigating LLM Jailbreaks with a Few Examples. (54%)
Alwin Peng; Julian Michael; Henry Sleight; Ethan Perez; Mrinank Sharma
  As large language models (LLMs) grow more powerful, ensuring their safety
against misuse becomes crucial. While researchers have focused on developing
robust defenses, no method has yet achieved complete invulnerability to
attacks. We propose an alternative approach: instead of seeking perfect
adversarial robustness, we develop rapid response techniques to look to block
whole classes of jailbreaks after observing only a handful of attacks. To study
this setting, we develop RapidResponseBench, a benchmark that measures a
defense's robustness against various jailbreak strategies after adapting to a
few observed examples. We evaluate five rapid response methods, all of which
use jailbreak proliferation, where we automatically generate additional
jailbreaks similar to the examples observed. Our strongest method, which
fine-tunes an input classifier to block proliferated jailbreaks, reduces attack
success rate by a factor greater than 240 on an in-distribution set of
jailbreaks and a factor greater than 15 on an out-of-distribution set, having
observed just one example of each jailbreaking strategy. Moreover, further
studies suggest that the quality of proliferation model and number of
proliferated examples play an key role in the effectiveness of this defense.
Overall, our results highlight the potential of responding rapidly to novel
jailbreaks to limit LLM misuse.


http://arxiv.org/abs/2411.07472
Semi-Truths: A Large-Scale Dataset of AI-Augmented Images for Evaluating Robustness of AI-Generated Image detectors. (1%)
Anisha Pal; Julia Kruk; Mansi Phute; Manognya Bhattaram; Diyi Yang; Duen Horng Chau; Judy Hoffman
  Text-to-image diffusion models have impactful applications in art, design,
and entertainment, yet these technologies also pose significant risks by
enabling the creation and dissemination of misinformation. Although recent
advancements have produced AI-generated image detectors that claim robustness
against various augmentations, their true effectiveness remains uncertain. Do
these detectors reliably identify images with different levels of augmentation?
Are they biased toward specific scenes or data distributions? To investigate,
we introduce SEMI-TRUTHS, featuring 27,600 real images, 223,400 masks, and
1,472,700 AI-augmented images that feature targeted and localized perturbations
produced using diverse augmentation techniques, diffusion models, and data
distributions. Each augmented image is accompanied by metadata for standardized
and targeted evaluation of detector robustness. Our findings suggest that
state-of-the-art detectors exhibit varying sensitivities to the types and
degrees of perturbations, data distributions, and augmentation methods used,
offering new insights into their performance and limitations. The code for the
augmentation and evaluation pipeline is available at
https://github.com/J-Kruk/SemiTruths.


http://arxiv.org/abs/2411.06666
Adversarial Detection with a Dynamically Stable System. (99%)
Xiaowei Long; Jie Lin; Xiangyuan Yang
  Adversarial detection is designed to identify and reject maliciously crafted
adversarial examples(AEs) which are generated to disrupt the classification of
target models.
  Presently, various input transformation-based methods have been developed on
adversarial example detection, which typically rely on empirical experience and
lead to unreliability against new attacks.
  To address this issue, we propose and conduct a Dynamically Stable System
(DSS), which can effectively detect the adversarial examples from normal
examples according to the stability of input examples.
  Particularly, in our paper, the generation of adversarial examples is
considered as the perturbation process of a Lyapunov dynamic system, and we
propose an example stability mechanism, in which a novel control term is added
in adversarial example generation to ensure that the normal examples can
achieve dynamic stability while the adversarial examples cannot achieve the
stability.
  Then, based on the proposed example stability mechanism, a Dynamically Stable
System (DSS) is proposed, which can utilize the disruption and restoration
actions to determine the stability of input examples and detect the adversarial
examples through changes in the stability of the input examples.
  In comparison with existing methods in three benchmark datasets(MNIST,
CIFAR10, and CIFAR100), our evaluation results show that our proposed DSS can
achieve ROC-AUC values of 99.83%, 97.81% and 94.47%, surpassing the
state-of-the-art(SOTA) values of 97.35%, 91.10% and 93.49% in the other 7
methods.


http://arxiv.org/abs/2411.14449
Deferred Backdoor Functionality Attacks on Deep Learning Models. (82%)
Jeongjin Shin; Sangdon Park
  Deep learning models are vulnerable to backdoor attacks, where adversaries
inject malicious functionality during training that activates on trigger inputs
at inference time. Extensive research has focused on developing stealthy
backdoor attacks to evade detection and defense mechanisms. However, these
approaches still have limitations that leave the door open for detection and
mitigation due to their inherent design to cause malicious behavior in the
presence of a trigger. To address this limitation, we introduce Deferred
Backdoor Functionality Activation (DBFA), a new paradigm in backdoor attacks.
Unlike conventional attacks, DBFA initially conceals its backdoor, producing
benign outputs even when triggered. This stealthy behavior allows DBFA to
bypass multiple detection and defense methods, remaining undetected during
initial inspections. The backdoor functionality is strategically activated only
after the model undergoes subsequent updates, such as retraining on benign
data. DBFA attacks exploit the common practice in the life cycle of machine
learning models to perform model updates and fine-tuning after initial
deployment. To implement DBFA attacks, we approach the problem by making the
unlearning of the backdoor fragile, allowing it to be easily cancelled and
subsequently reactivate the backdoor functionality. To achieve this, we propose
a novel two-stage training scheme, called DeferBad. Our extensive experiments
across various fine-tuning scenarios, backdoor attack types, datasets, and
model architectures demonstrate the effectiveness and stealthiness of DeferBad.


http://arxiv.org/abs/2411.06426
SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains. (70%)
Bijoy Ahmed Saiem; MD Sadik Hossain Shanto; Rakib Ahsan; Md Rafi ur Rashid
  As the integration of the Large Language Models (LLMs) into various
applications increases, so does their susceptibility to misuse, raising
significant security concerns. Numerous jailbreak attacks have been proposed to
assess the security defense of LLMs. Current jailbreak attacks mainly rely on
scenario camouflage, prompt obfuscation, prompt optimization, and prompt
iterative optimization to conceal malicious prompts. In particular, sequential
prompt chains in a single query can lead LLMs to focus on certain prompts while
ignoring others, facilitating context manipulation. This paper introduces
SequentialBreak, a novel jailbreak attack that exploits this vulnerability. We
discuss several scenarios, not limited to examples like Question Bank, Dialog
Completion, and Game Environment, where the harmful prompt is embedded within
benign ones that can fool LLMs into generating harmful responses. The distinct
narrative structures of these scenarios show that SequentialBreak is flexible
enough to adapt to various prompt formats beyond those discussed. Extensive
experiments demonstrate that SequentialBreak uses only a single query to
achieve a substantial gain of attack success rate over existing baselines
against both open-source and closed-source models. Through our research, we
highlight the urgent need for more robust and resilient safeguards to enhance
LLM security and prevent potential misuse. All the result files and website
associated with this research are available in this GitHub repository:
https://anonymous.4open.science/r/JailBreakAttack-4F3B/.


http://arxiv.org/abs/2411.07268
Target-driven Attack for Large Language Models. (73%)
Chong Zhang; Mingyu Jin; Dong Shu; Taowen Wang; Dongfang Liu; Xiaobo Jin
  Current large language models (LLM) provide a strong foundation for
large-scale user-oriented natural language tasks. Many users can easily inject
adversarial text or instructions through the user interface, thus causing LLM
model security challenges like the language model not giving the correct
answer. Although there is currently a large amount of research on black-box
attacks, most of these black-box attacks use random and heuristic strategies.
It is unclear how these strategies relate to the success rate of attacks and
thus effectively improve model robustness. To solve this problem, we propose
our target-driven black-box attack method to maximize the KL divergence between
the conditional probabilities of the clean text and the attack text to redefine
the attack's goal. We transform the distance maximization problem into two
convex optimization problems based on the attack goal to solve the attack text
and estimate the covariance. Furthermore, the projected gradient descent
algorithm solves the vector corresponding to the attack text. Our target-driven
black-box attack approach includes two attack strategies: token manipulation
and misinformation attack. Experimental results on multiple Large Language
Models and datasets demonstrate the effectiveness of our attack method.


http://arxiv.org/abs/2411.06146
AI-Compass: A Comprehensive and Effective Multi-module Testing Tool for AI Systems. (33%)
Zhiyu Zhu; Zhibo Jin; Hongsheng Hu; Minhui Xue; Ruoxi Sun; Seyit Camtepe; Praveen Gauravaram; Huaming Chen
  AI systems, in particular with deep learning techniques, have demonstrated
superior performance for various real-world applications. Given the need for
tailored optimization in specific scenarios, as well as the concerns related to
the exploits of subsurface vulnerabilities, a more comprehensive and in-depth
testing AI system becomes a pivotal topic. We have seen the emergence of
testing tools in real-world applications that aim to expand testing
capabilities. However, they often concentrate on ad-hoc tasks, rendering them
unsuitable for simultaneously testing multiple aspects or components.
Furthermore, trustworthiness issues arising from adversarial attacks and the
challenge of interpreting deep learning models pose new challenges for
developing more comprehensive and in-depth AI system testing tools. In this
study, we design and implement a testing tool, \tool, to comprehensively and
effectively evaluate AI systems. The tool extensively assesses multiple
measurements towards adversarial robustness, model interpretability, and
performs neuron analysis. The feasibility of the proposed testing tool is
thoroughly validated across various modalities, including image classification,
object detection, and text classification. Extensive experiments demonstrate
that \tool is the state-of-the-art tool for a comprehensive assessment of the
robustness and trustworthiness of AI systems. Our research sheds light on a
general solution for AI systems testing landscape.


http://arxiv.org/abs/2411.05399
Post-Hoc Robustness Enhancement in Graph Neural Networks with Conditional Random Fields. (41%)
Yassine Abbahaddou; Sofiane Ennadir; Johannes F. Lutzeyer; Fragkiskos D. Malliaros; Michalis Vazirgiannis
  Graph Neural Networks (GNNs), which are nowadays the benchmark approach in
graph representation learning, have been shown to be vulnerable to adversarial
attacks, raising concerns about their real-world applicability. While existing
defense techniques primarily concentrate on the training phase of GNNs,
involving adjustments to message passing architectures or pre-processing
methods, there is a noticeable gap in methods focusing on increasing robustness
during inference. In this context, this study introduces RobustCRF, a post-hoc
approach aiming to enhance the robustness of GNNs at the inference stage. Our
proposed method, founded on statistical relational learning using a Conditional
Random Field, is model-agnostic and does not require prior knowledge about the
underlying model architecture. We validate the efficacy of this approach across
various models, leveraging benchmark node classification datasets.


http://arxiv.org/abs/2411.05345
Reasoning Robustness of LLMs to Adversarial Typographical Errors. (13%)
Esther Gan; Yiran Zhao; Liying Cheng; Yancan Mao; Anirudh Goyal; Kenji Kawaguchi; Min-Yen Kan; Michael Shieh
  Large Language Models (LLMs) have demonstrated impressive capabilities in
reasoning using Chain-of-Thought (CoT) prompting. However, CoT can be biased by
users' instruction. In this work, we study the reasoning robustness of LLMs to
typographical errors, which can naturally occur in users' queries. We design an
Adversarial Typo Attack ($\texttt{ATA}$) algorithm that iteratively samples
typos for words that are important to the query and selects the edit that is
most likely to succeed in attacking. It shows that LLMs are sensitive to
minimal adversarial typographical changes. Notably, with 1 character edit,
Mistral-7B-Instruct's accuracy drops from 43.7% to 38.6% on GSM8K, while with 8
character edits the performance further drops to 19.2%. To extend our
evaluation to larger and closed-source LLMs, we develop the $\texttt{R$^2$ATA}$
benchmark, which assesses models' $\underline{R}$easoning
$\underline{R}$obustness to $\underline{\texttt{ATA}}$. It includes adversarial
typographical questions derived from three widely used reasoning
datasets-GSM8K, BBH, and MMLU-by applying $\texttt{ATA}$ to open-source LLMs.
$\texttt{R$^2$ATA}$ demonstrates remarkable transferability and causes notable
performance drops across multiple super large and closed-source LLMs.


http://arxiv.org/abs/2411.05658
Towards a Re-evaluation of Data Forging Attacks in Practice. (2%)
Mohamed Suliman; Anisa Halimi; Swanand Kadhe; Nathalie Baracaldo; Douglas Leith
  Data forging attacks provide counterfactual proof that a model was trained on
a given dataset, when in fact, it was trained on another. These attacks work by
forging (replacing) mini-batches with ones containing distinct training
examples that produce nearly identical gradients. Data forging appears to break
any potential avenues for data governance, as adversarial model owners may
forge their training set from a dataset that is not compliant to one that is.
Given these serious implications on data auditing and compliance, we critically
analyse data forging from both a practical and theoretical point of view,
finding that a key practical limitation of current attack methods makes them
easily detectable by a verifier; namely that they cannot produce sufficiently
identical gradients. Theoretically, we analyse the question of whether two
distinct mini-batches can produce the same gradient. Generally, we find that
while there may exist an infinite number of distinct mini-batches with
real-valued training examples and labels that produce the same gradient,
finding those that are within the allowed domain e.g. pixel values between
0-255 and one hot labels is a non trivial task. Our results call for the
reevaluation of the strength of existing attacks, and for additional research
into successful data forging, given the serious consequences it may have on
machine learning and privacy.


http://arxiv.org/abs/2411.04533
Neural Fingerprints for Adversarial Attack Detection. (99%)
Haim Fisher; Moni Shahar; Yehezkel S. Resheff
  Deep learning models for image classification have become standard tools in
recent years. A well known vulnerability of these models is their
susceptibility to adversarial examples. These are generated by slightly
altering an image of a certain class in a way that is imperceptible to humans
but causes the model to classify it wrongly as another class. Many algorithms
have been proposed to address this problem, falling generally into one of two
categories: (i) building robust classifiers (ii) directly detecting attacked
images. Despite the good performance of these detectors, we argue that in a
white-box setting, where the attacker knows the configuration and weights of
the network and the detector, they can overcome the detector by running many
examples on a local copy, and sending only those that were not detected to the
actual model. This problem is common in security applications where even a very
good model is not sufficient to ensure safety. In this paper we propose to
overcome this inherent limitation of any static defence with randomization. To
do so, one must generate a very large family of detectors with consistent
performance, and select one or more of them randomly for each input. For the
individual detectors, we suggest the method of neural fingerprints. In the
training phase, for each class we repeatedly sample a tiny random subset of
neurons from certain layers of the network, and if their average is
sufficiently different between clean and attacked images of the focal class
they are considered a fingerprint and added to the detector bank. During test
time, we sample fingerprints from the bank associated with the label predicted
by the model, and detect attacks using a likelihood ratio test. We evaluate our
detectors on ImageNet with different attack methods and model architectures,
and show near-perfect detection with low rates of false detection.


http://arxiv.org/abs/2411.05056
Seeing is Deceiving: Exploitation of Visual Pathways in Multi-Modal Language Models. (97%)
Pete Janowczyk; Linda Laurier; Ave Giulietta; Arlo Octavia; Meade Cleti
  Multi-Modal Language Models (MLLMs) have transformed artificial intelligence
by combining visual and text data, making applications like image captioning,
visual question answering, and multi-modal content creation possible. This
ability to understand and work with complex information has made MLLMs useful
in areas such as healthcare, autonomous systems, and digital content. However,
integrating multiple types of data also creates security risks. Attackers can
manipulate either the visual or text inputs, or both, to make the model produce
unintended or even harmful responses. This paper reviews how visual inputs in
MLLMs can be exploited by various attack strategies. We break down these
attacks into categories: simple visual tweaks and cross-modal manipulations, as
well as advanced strategies like VLATTACK, HADES, and Collaborative Multimodal
Adversarial Attack (Co-Attack). These attacks can mislead even the most robust
models while looking nearly identical to the original visuals, making them hard
to detect. We also discuss the broader security risks, including threats to
privacy and safety in important applications. To counter these risks, we review
current defense methods like the SmoothVLM framework, pixel-wise randomization,
and MirrorCheck, looking at their strengths and limitations. We also discuss
new methods to make MLLMs more secure, including adaptive defenses, better
evaluation tools, and security approaches that protect both visual and text
data. By bringing together recent developments and identifying key areas for
improvement, this review aims to support the creation of more secure and
reliable multi-modal AI systems for real-world use.


http://arxiv.org/abs/2411.04772
Attention Masks Help Adversarial Attacks to Bypass Safety Detectors. (97%)
Yunfan Shi
  Despite recent research advancements in adversarial attack methods, current
approaches against XAI monitors are still discoverable and slower. In this
paper, we present an adaptive framework for attention mask generation to enable
stealthy, explainable and efficient PGD image classification adversarial attack
under XAI monitors. Specifically, we utilize mutation XAI mixture and multitask
self-supervised X-UNet for attention mask generation to guide PGD attack.
Experiments on MNIST (MLP), CIFAR-10 (AlexNet) have shown that our system can
outperform benchmark PGD, Sparsefool and SOTA SINIFGSM in balancing among
stealth, efficiency and explainability which is crucial for effectively fooling
SOTA defense protected classifiers.


http://arxiv.org/abs/2411.04811
Defending Deep Regression Models against Backdoor Attacks. (78%)
Lingyu Du; Yupei Liu; Jinyuan Jia; Guohao Lan
  Deep regression models are used in a wide variety of safety-critical
applications, but are vulnerable to backdoor attacks. Although many defenses
have been proposed for classification models, they are ineffective as they do
not consider the uniqueness of regression models. First, the outputs of
regression models are continuous values instead of discretized labels. Thus,
the potential infected target of a backdoored regression model has infinite
possibilities, which makes it impossible to be determined by existing defenses.
Second, the backdoor behavior of backdoored deep regression models is triggered
by the activation values of all the neurons in the feature space, which makes
it difficult to be detected and mitigated using existing defenses. To resolve
these problems, we propose DRMGuard, the first defense to identify if a deep
regression model in the image domain is backdoored or not. DRMGuard formulates
the optimization problem for reverse engineering based on the unique
output-space and feature-space characteristics of backdoored deep regression
models. We conduct extensive evaluations on two regression tasks and four
datasets. The results show that DRMGuard can consistently defend against
various backdoor attacks. We also generalize four state-of-the-art defenses
designed for classifiers to regression models, and compare DRMGuard with them.
The results show that DRMGuard significantly outperforms all those defenses.


http://arxiv.org/abs/2411.05197
Hardware and Software Platform Inference. (9%)
Cheng Zhang; Hanna Foerster; Robert D. Mullins; Yiren Zhao; Ilia Shumailov
  It is now a common business practice to buy access to large language model
(LLM) inference rather than self-host, because of significant upfront hardware
infrastructure and energy costs. However, as a buyer, there is no mechanism to
verify the authenticity of the advertised service including the serving
hardware platform, e.g. that it is actually being served using an NVIDIA H100.
Furthermore, there are reports suggesting that model providers may deliver
models that differ slightly from the advertised ones, often to make them run on
less expensive hardware. That way, a client pays premium for a capable model
access on more expensive hardware, yet ends up being served by a (potentially
less capable) cheaper model on cheaper hardware. In this paper we introduce
\textit{\textbf{hardware and software platform inference (HSPI)}} -- a method
for identifying the underlying \GPU{} architecture and software stack of a
(black-box) machine learning model solely based on its input-output behavior.
Our method leverages the inherent differences of various \GPU{} architectures
and compilers to distinguish between different \GPU{} types and software
stacks. By analyzing the numerical patterns in the model's outputs, we propose
a classification framework capable of accurately identifying the \GPU{} used
for model inference as well as the underlying software configuration. Our
findings demonstrate the feasibility of inferring \GPU{} type from black-box
models. We evaluate HSPI against models served on different real hardware and
find that in a white-box setting we can distinguish between different \GPU{}s
with between $83.9\%$ and $100\%$ accuracy. Even in a black-box setting we are
able to achieve results that are up to three times higher than random guess
accuracy.


http://arxiv.org/abs/2411.05858
Saliency Assisted Quantization for Neural Networks. (1%)
Elmira Mousa Rezabeyk; Salar Beigzad; Yasin Hamzavi; Mohsen Bagheritabar; Seyedeh Sogol Mirikhoozani
  Deep learning methods have established a significant place in image
classification. While prior research has focused on enhancing final outcomes,
the opaque nature of the decision-making process in these models remains a
concern for experts. Additionally, the deployment of these methods can be
problematic in resource-limited environments. This paper tackles the inherent
black-box nature of these models by providing real-time explanations during the
training phase, compelling the model to concentrate on the most distinctive and
crucial aspects of the input. Furthermore, we employ established quantization
techniques to address resource constraints. To assess the effectiveness of our
approach, we explore how quantization influences the interpretability and
accuracy of Convolutional Neural Networks through a comparative analysis of
saliency maps from standard and quantized models. Quantization is implemented
during the training phase using the Parameterized Clipping Activation method,
with a focus on the MNIST and FashionMNIST benchmark datasets. We evaluated
three bit-width configurations (2-bit, 4-bit, and mixed 4/2-bit) to explore the
trade-off between efficiency and interpretability, with each configuration
designed to highlight varying impacts on saliency map clarity and model
accuracy. The results indicate that while quantization is crucial for
implementing models on resource-limited devices, it necessitates a trade-off
between accuracy and interpretability. Lower bit-widths result in more
pronounced reductions in both metrics, highlighting the necessity of meticulous
quantization parameter selection in applications where model transparency is
paramount. The study underscores the importance of achieving a balance between
efficiency and interpretability in the deployment of neural networks.


http://arxiv.org/abs/2411.04731
MISGUIDE: Security-Aware Attack Analytics for Smart Grid Load Frequency Control. (1%)
Nur Imtiazul Haque; Prabin Mali; Mohammad Zakaria Haider; Mohammad Ashiqur Rahman; Sumit Paudyal
  Incorporating advanced information and communication technologies into smart
grids (SGs) offers substantial operational benefits while increasing
vulnerability to cyber threats like false data injection (FDI) attacks. Current
SG attack analysis tools predominantly employ formal methods or adversarial
machine learning (ML) techniques with rule-based bad data detectors to analyze
the attack space. However, these attack analytics either generate simplistic
attack vectors detectable by the ML-based anomaly detection models (ADMs) or
fail to identify critical attack vectors from complex controller dynamics in a
feasible time. This paper introduces MISGUIDE, a novel defense-aware attack
analytics designed to extract verifiable multi-time slot-based FDI attack
vectors from complex SG load frequency control dynamics and ADMs, utilizing the
Gurobi optimizer. MISGUIDE can identify optimal (maliciously triggering
under/over frequency relays in minimal time) and stealthy attack vectors. Using
real-world load data, we validate the MISGUIDE-identified attack vectors
through real-time hardware-in-the-loop (OPALRT) simulations of the IEEE 39-bus
system.


http://arxiv.org/abs/2411.04376
Game-Theoretic Defenses for Robust Conformal Prediction Against Adversarial Attacks in Medical Imaging. (95%)
Rui Luo; Jie Bao; Zhixin Zhou; Chuangyin Dang
  Adversarial attacks pose significant threats to the reliability and safety of
deep learning models, especially in critical domains such as medical imaging.
This paper introduces a novel framework that integrates conformal prediction
with game-theoretic defensive strategies to enhance model robustness against
both known and unknown adversarial perturbations. We address three primary
research questions: constructing valid and efficient conformal prediction sets
under known attacks (RQ1), ensuring coverage under unknown attacks through
conservative thresholding (RQ2), and determining optimal defensive strategies
within a zero-sum game framework (RQ3). Our methodology involves training
specialized defensive models against specific attack types and employing
maximum and minimum classifiers to aggregate defenses effectively. Extensive
experiments conducted on the MedMNIST datasets, including PathMNIST,
OrganAMNIST, and TissueMNIST, demonstrate that our approach maintains high
coverage guarantees while minimizing prediction set sizes. The game-theoretic
analysis reveals that the optimal defensive strategy often converges to a
singular robust model, outperforming uniform and simple strategies across all
evaluated datasets. This work advances the state-of-the-art in uncertainty
quantification and adversarial robustness, providing a reliable mechanism for
deploying deep learning models in adversarial environments.


http://arxiv.org/abs/2411.03752
Deferred Poisoning: Making the Model More Vulnerable via Hessian Singularization. (86%)
Yuhao He; Jinyu Tian; Xianwei Zheng; Li Dong; Yuanman Li; Leo Yu Zhang; Jiantao Zhou
  Recent studies have shown that deep learning models are very vulnerable to
poisoning attacks. Many defense methods have been proposed to address this
issue. However, traditional poisoning attacks are not as threatening as
commonly believed. This is because they often cause differences in how the
model performs on the training set compared to the validation set. Such
inconsistency can alert defenders that their data has been poisoned, allowing
them to take the necessary defensive actions. In this paper, we introduce a
more threatening type of poisoning attack called the Deferred Poisoning Attack.
This new attack allows the model to function normally during the training and
validation phases but makes it very sensitive to evasion attacks or even
natural noise. We achieve this by ensuring the poisoned model's loss function
has a similar value as a normally trained model at each input sample but with a
large local curvature. A similar model loss ensures that there is no obvious
inconsistency between the training and validation accuracy, demonstrating high
stealthiness. On the other hand, the large curvature implies that a small
perturbation may cause a significant increase in model loss, leading to
substantial performance degradation, which reflects a worse robustness. We
fulfill this purpose by making the model have singular Hessian information at
the optimal point via our proposed Singularization Regularization term. We have
conducted both theoretical and empirical analyses of the proposed method and
validated its effectiveness through experiments on image classification tasks.
Furthermore, we have confirmed the hazards of this form of poisoning attack
under more general scenarios using natural noise, offering a new perspective
for research in the field of security.


http://arxiv.org/abs/2411.03861
FedSECA: Sign Election and Coordinate-wise Aggregation of Gradients for Byzantine Tolerant Federated Learning. (41%)
Joseph Geo Benjamin; Mothilal Asokan; Mohammad Yaqub; Karthik Nandakumar
  One of the most common defense strategies against Byzantine clients in
federated learning (FL) is to employ a robust aggregator mechanism that makes
the training more resilient. While many existing Byzantine robust aggregators
provide theoretical convergence guarantees and are empirically effective
against certain categories of attacks, we observe that certain high-strength
attacks can subvert the robust aggregator and collapse the training. To
overcome this limitation, we propose a method called FedSECA for robust Sign
Election and Coordinate-wise Aggregation of gradients in FL that is less
susceptible to malicious updates by an omniscient attacker. The proposed method
has two main components. The Concordance Ratio Induced Sign Election(CRISE)
module determines the consensus direction (elected sign) for each individual
parameter gradient through a weighted voting strategy. The client weights are
assigned based on a novel metric called concordance ratio, which quantifies the
degree of sign agreement between the client gradient updates. Based on the
elected sign, a Robust Coordinate-wise Aggregation(RoCA) strategy is employed,
where variance-reduced sparse gradients are aggregated only if they are in
alignment with the corresponding elected sign. We compare our proposed FedSECA
method against 10 robust aggregators under 7 Byzantine attacks on 3 datasets
and architectures. The results show that existing robust aggregators fail for
at least some attacks, while FedSECA exhibits better robustness. Code -
https://github.com/JosephGeoBenjamin/FedSECA-ByzantineTolerance


http://arxiv.org/abs/2411.03814
MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue. (11%)
Fengxiang Wang; Ranjie Duan; Peng Xiao; Xiaojun Jia; Shiji Zhao; Cheng Wei; YueFeng Chen; Chongwen Wang; Jialing Tao; Hang Su; Jun Zhu; Hui Xue
  Large Language Models (LLMs) demonstrate outstanding performance in their
reservoir of knowledge and understanding capabilities, but they have also been
shown to be prone to illegal or unethical reactions when subjected to jailbreak
attacks. To ensure their responsible deployment in critical applications, it is
crucial to understand the safety capabilities and vulnerabilities of LLMs.
Previous works mainly focus on jailbreak in single-round dialogue, overlooking
the potential jailbreak risks in multi-round dialogues, which are a vital way
humans interact with and extract information from LLMs. Some studies have
increasingly concentrated on the risks associated with jailbreak in multi-round
dialogues. These efforts typically involve the use of manually crafted
templates or prompt engineering techniques. However, due to the inherent
complexity of multi-round dialogues, their jailbreak performance is limited. To
solve this problem, we propose a novel multi-round dialogue jailbreaking agent,
emphasizing the importance of stealthiness in identifying and mitigating
potential threats to human values posed by LLMs. We propose a risk
decomposition strategy that distributes risks across multiple rounds of queries
and utilizes psychological strategies to enhance attack strength. Extensive
experiments show that our proposed method surpasses other attack methods and
achieves state-of-the-art attack success rate. We will make the corresponding
code and dataset available for future research. The code will be released soon.


http://arxiv.org/abs/2411.04365
Towards Secured Smart Grid 2.0: Exploring Security Threats, Protection Models, and Challenges. (4%)
Lan-Huong Nguyen; Van-Linh Nguyen; Ren-Hung Hwang; Jian-Jhih Kuo; Yu-Wen Chen; Chien-Chung Huang; Ping-I Pan
  Many nations are promoting the green transition in the energy sector to
attain neutral carbon emissions by 2050. Smart Grid 2.0 (SG2) is expected to
explore data-driven analytics and enhance communication technologies to improve
the efficiency and sustainability of distributed renewable energy systems.
These features are beyond smart metering and electric surplus distribution in
conventional smart grids. Given the high dependence on communication networks
to connect distributed microgrids in SG2, potential cascading failures of
connectivity can cause disruption to data synchronization to the remote control
systems. This paper reviews security threats and defense tactics for three
stakeholders: power grid operators, communication network providers, and
consumers. Through the survey, we found that SG2's stakeholders are
particularly vulnerable to substation attacks/vandalism, malware/ransomware
threats, blockchain vulnerabilities and supply chain breakdowns. Furthermore,
incorporating artificial intelligence (AI) into autonomous energy management in
distributed energy resources of SG2 creates new challenges. Accordingly,
adversarial samples and false data injection on electricity reading and
measurement sensors at power plants can fool AI-powered control functions and
cause messy error-checking operations in energy storage, wrong energy
estimation in electric vehicle charging, and even fraudulent transactions in
peer-to-peer energy trading models. Scalable blockchain-based models, physical
unclonable function, interoperable security protocols, and trustworthy AI
models designed for managing distributed microgrids in SG2 are typical
promising protection models for future research.


http://arxiv.org/abs/2411.05034
Mitigating Privacy Risks in LLM Embeddings from Embedding Inversion. (1%)
Tiantian Liu; Hongwei Yao; Tong Wu; Zhan Qin; Feng Lin; Kui Ren; Chun Chen
  Embeddings have become a cornerstone in the functionality of large language
models (LLMs) due to their ability to transform text data into rich, dense
numerical representations that capture semantic and syntactic properties. These
embedding vector databases serve as the long-term memory of LLMs, enabling
efficient handling of a wide range of natural language processing tasks.
However, the surge in popularity of embedding vector databases in LLMs has been
accompanied by significant concerns about privacy leakage. Embedding vector
databases are particularly vulnerable to embedding inversion attacks, where
adversaries can exploit the embeddings to reverse-engineer and extract
sensitive information from the original text data. Existing defense mechanisms
have shown limitations, often struggling to balance security with the
performance of downstream tasks. To address these challenges, we introduce
Eguard, a novel defense mechanism designed to mitigate embedding inversion
attacks. Eguard employs a transformer-based projection network and text mutual
information optimization to safeguard embeddings while preserving the utility
of LLMs. Our approach significantly reduces privacy risks, protecting over 95%
of tokens from inversion while maintaining high performance across downstream
tasks consistent with original embeddings.


http://arxiv.org/abs/2411.02871
Enhancing Adversarial Robustness via Uncertainty-Aware Distributional Adversarial Training. (99%)
Junhao Dong; Xinghua Qu; Z. Jane Wang; Yew-Soon Ong
  Despite remarkable achievements in deep learning across various domains, its
inherent vulnerability to adversarial examples still remains a critical concern
for practical deployment. Adversarial training has emerged as one of the most
effective defensive techniques for improving model robustness against such
malicious inputs. However, existing adversarial training schemes often lead to
limited generalization ability against underlying adversaries with diversity
due to their overreliance on a point-by-point augmentation strategy by mapping
each clean example to its adversarial counterpart during training. In addition,
adversarial examples can induce significant disruptions in the statistical
information w.r.t. the target model, thereby introducing substantial
uncertainty and challenges to modeling the distribution of adversarial
examples. To circumvent these issues, in this paper, we propose a novel
uncertainty-aware distributional adversarial training method, which enforces
adversary modeling by leveraging both the statistical information of
adversarial examples and its corresponding uncertainty estimation, with the
goal of augmenting the diversity of adversaries. Considering the potentially
negative impact induced by aligning adversaries to misclassified clean
examples, we also refine the alignment reference based on the statistical
proximity to clean examples during adversarial training, thereby reframing
adversarial training within a distribution-to-distribution matching framework
interacted between the clean and adversarial domains. Furthermore, we design an
introspective gradient alignment approach via matching input gradients between
these domains without introducing external models. Extensive experiments across
four benchmark datasets and various network architectures demonstrate that our
approach achieves state-of-the-art adversarial robustness and maintains natural
performance.


http://arxiv.org/abs/2411.02974
Region-Guided Attack on the Segment Anything Model (SAM). (99%)
Xiaoliang Liu; Furao Shen; Jian Zhao
  The Segment Anything Model (SAM) is a cornerstone of image segmentation,
demonstrating exceptional performance across various applications, particularly
in autonomous driving and medical imaging, where precise segmentation is
crucial. However, SAM is vulnerable to adversarial attacks that can
significantly impair its functionality through minor input perturbations.
Traditional techniques, such as FGSM and PGD, are often ineffective in
segmentation tasks due to their reliance on global perturbations that overlook
spatial nuances. Recent methods like Attack-SAM-K and UAD have begun to address
these challenges, but they frequently depend on external cues and do not fully
leverage the structural interdependencies within segmentation processes. This
limitation underscores the need for a novel adversarial strategy that exploits
the unique characteristics of segmentation tasks. In response, we introduce the
Region-Guided Attack (RGA), designed specifically for SAM. RGA utilizes a
Region-Guided Map (RGM) to manipulate segmented regions, enabling targeted
perturbations that fragment large segments and expand smaller ones, resulting
in erroneous outputs from SAM. Our experiments demonstrate that RGA achieves
high success rates in both white-box and black-box scenarios, emphasizing the
need for robust defenses against such sophisticated attacks. RGA not only
reveals SAM's vulnerabilities but also lays the groundwork for developing more
resilient defenses against adversarial threats in image segmentation.


http://arxiv.org/abs/2411.02866
Double Whammy: Stealthy Data Manipulation aided Reconstruction Attack on Graph Federated Learning. (91%)
Jinyin Chen; Minying Ma; Haibin Zheng; Qi Xuan
  Recent research has constructed successful graph reconstruction attack (GRA)
on GFL. But these attacks are still challenged in aspects of effectiveness and
stealth. To address the issues, we propose the first Data Manipulation aided
Reconstruction attack on GFL, dubbed as DMan4Rec. The malicious client is born
to manipulate its locally collected data to enhance graph stealing privacy from
benign ones, so as to construct double whammy on GFL. It differs from previous
work in three terms: (1) effectiveness - to fully utilize the sparsity and
feature smoothness of the graph, novel penalty terms are designed adaptive to
diverse similarity functions for connected and unconnected node pairs, as well
as incorporation label smoothing on top of the original cross-entropy loss. (2)
scalability - DMan4Rec is capable of both white-box and black-box attacks via
training a supervised model to infer the posterior probabilities obtained from
limited queries (3) stealthiness - by manipulating the malicious client's node
features, it can maintain the overall graph structure's invariance and conceal
the attack. Comprehensive experiments on four real datasets and three GNN
models demonstrate that DMan4Rec achieves the state-of-the-art (SOTA) attack
performance, e.g., the attack AUC and precision improved by 9.2% and 10.5%
respectively compared with the SOTA baselines. Particularly, DMan4Rec achieves
an AUC score and a precision score of up to 99.59% and 99.56%, respectively in
black-box setting. Nevertheless, the complete overlap of the distribution
graphs supports the stealthiness of the attack. Besides, DMan4Rec still beats
the defensive GFL, which alarms a new threat to GFL.


http://arxiv.org/abs/2411.03022
Flashy Backdoor: Real-world Environment Backdoor Attack on SNNs with DVS Cameras. (75%)
Roberto Riaño; Gorka Abad; Stjepan Picek; Aitor Urbieta
  While security vulnerabilities in traditional Deep Neural Networks (DNNs)
have been extensively studied, the susceptibility of Spiking Neural Networks
(SNNs) to adversarial attacks remains mostly underexplored. Until now, the
mechanisms to inject backdoors into SNN models have been limited to digital
scenarios; thus, we present the first evaluation of backdoor attacks in
real-world environments.
  We begin by assessing the applicability of existing digital backdoor attacks
and identifying their limitations for deployment in physical environments. To
address each of the found limitations, we present three novel backdoor attack
methods on SNNs, i.e., Framed, Strobing, and Flashy Backdoor. We also assess
the effectiveness of traditional backdoor procedures and defenses adapted for
SNNs, such as pruning, fine-tuning, and fine-pruning. The results show that
while these procedures and defenses can mitigate some attacks, they often fail
against stronger methods like Flashy Backdoor or sacrifice too much clean
accuracy, rendering the models unusable.
  Overall, all our methods can achieve up to a 100% Attack Success Rate while
maintaining high clean accuracy in every tested dataset. Additionally, we
evaluate the stealthiness of the triggers with commonly used metrics, finding
them highly stealthy. Thus, we propose new alternatives more suited for
identifying poisoned samples in these scenarios. Our results show that further
research is needed to ensure the security of SNN-based systems against backdoor
attacks and their safe application in real-world scenarios. The code,
experiments, and results are available in our repository.


http://arxiv.org/abs/2411.03231
Formal Logic-guided Robust Federated Learning against Poisoning Attacks. (68%)
Dung Thuy Nguyen; Ziyan An; Taylor T. Johnson; Meiyi Ma; Kevin Leach
  Federated Learning (FL) offers a promising solution to the privacy concerns
associated with centralized Machine Learning (ML) by enabling decentralized,
collaborative learning. However, FL is vulnerable to various security threats,
including poisoning attacks, where adversarial clients manipulate the training
data or model updates to degrade overall model performance. Recognizing this
threat, researchers have focused on developing defense mechanisms to counteract
poisoning attacks in FL systems. However, existing robust FL methods
predominantly focus on computer vision tasks, leaving a gap in addressing the
unique challenges of FL with time series data. In this paper, we present
FLORAL, a defense mechanism designed to mitigate poisoning attacks in federated
learning for time-series tasks, even in scenarios with heterogeneous client
data and a large number of adversarial participants. Unlike traditional
model-centric defenses, FLORAL leverages logical reasoning to evaluate client
trustworthiness by aligning their predictions with global time-series patterns,
rather than relying solely on the similarity of client updates. Our approach
extracts logical reasoning properties from clients, then hierarchically infers
global properties, and uses these to verify client updates. Through formal
logic verification, we assess the robustness of each client contribution,
identifying deviations indicative of adversarial behavior. Experimental results
on two datasets demonstrate the superior performance of our approach compared
to existing baseline methods, highlighting its potential to enhance the
robustness of FL to time series applications. Notably, FLORAL reduced the
prediction error by 93.27\% in the best-case scenario compared to the
second-best baseline. Our code is available at
\url{https://anonymous.4open.science/r/FLORAL-Robust-FTS}.


http://arxiv.org/abs/2411.03279
Oblivious Defense in ML Models: Backdoor Removal without Detection. (15%)
Shafi Goldwasser; Jonathan Shafer; Neekon Vafa; Vinod Vaikuntanathan
  As society grows more reliant on machine learning, ensuring the security of
machine learning systems against sophisticated attacks becomes a pressing
concern. A recent result of Goldwasser, Kim, Vaikuntanathan, and Zamir (2022)
shows that an adversary can plant undetectable backdoors in machine learning
models, allowing the adversary to covertly control the model's behavior.
Backdoors can be planted in such a way that the backdoored machine learning
model is computationally indistinguishable from an honest model without
backdoors.
  In this paper, we present strategies for defending against backdoors in ML
models, even if they are undetectable. The key observation is that it is
sometimes possible to provably mitigate or even remove backdoors without
needing to detect them, using techniques inspired by the notion of random
self-reducibility. This depends on properties of the ground-truth labels
(chosen by nature), and not of the proposed ML model (which may be chosen by an
attacker).
  We give formal definitions for secure backdoor mitigation, and proceed to
show two types of results. First, we show a "global mitigation" technique,
which removes all backdoors from a machine learning model under the assumption
that the ground-truth labels are close to a Fourier-heavy function. Second, we
consider distributions where the ground-truth labels are close to a linear or
polynomial function in $\mathbb{R}^n$. Here, we show "local mitigation"
techniques, which remove backdoors with high probability for every inputs of
interest, and are computationally cheaper than global mitigation. All of our
constructions are black-box, so our techniques work without needing access to
the model's representation (i.e., its code or parameters). Along the way we
prove a simple result for robust mean estimation.


http://arxiv.org/abs/2411.03364
DM4Steal: Diffusion Model For Link Stealing Attack On Graph Neural Networks. (13%)
Jinyin Chen; Haonan Ma; Haibin Zheng
  Graph has become increasingly integral to the advancement of recommendation
systems, particularly with the fast development of graph neural network(GNN).
By exploring the virtue of rich node features and link information, GNN is
designed to provide personalized and accurate suggestions. Meanwhile, the
privacy leakage of GNN in such contexts has also captured special attention.
Prior work has revealed that a malicious user can utilize auxiliary knowledge
to extract sensitive link data of the target graph, integral to recommendation
systems, via the decision made by the target GNN model. This poses a
significant risk to the integrity and confidentiality of data used in
recommendation system. Though important, previous works on GNN's privacy
leakage are still challenged in three aspects, i.e., limited stealing attack
scenarios, sub-optimal attack performance, and adaptation against defense. To
address these issues, we propose a diffusion model based link stealing attack,
named DM4Steal. It differs previous work from three critical aspects. (i)
Generality: aiming at six attack scenarios with limited auxiliary knowledge, we
propose a novel training strategy for diffusion models so that DM4Steal is
transferable to diverse attack scenarios. (ii) Effectiveness: benefiting from
the retention of semantic structure in the diffusion model during the training
process, DM4Steal is capable to learn the precise topology of the target graph
through the GNN decision process. (iii) Adaptation: when GNN is defensive
(e.g., DP, Dropout), DM4Steal relies on the stability that comes from sampling
the score model multiple times to keep performance degradation to a minimum,
thus DM4Steal implements successful adaptive attack on defensive GNN.


http://arxiv.org/abs/2411.03019
FEDLAD: Federated Evaluation of Deep Leakage Attacks and Defenses. (9%)
Isaac Baglin; Xiatian Zhu; Simon Hadfield
  Federated Learning is a privacy preserving decentralized machine learning
paradigm designed to collaboratively train models across multiple clients by
exchanging gradients to the server and keeping private data local.
Nevertheless, recent research has revealed that the security of Federated
Learning is compromised, as private ground truth data can be recovered through
a gradient inversion technique known as Deep Leakage. While these attacks are
crafted with a focus on applications in Federated Learning, they generally are
not evaluated in realistic scenarios. This paper introduces the FEDLAD
Framework (Federated Evaluation of Deep Leakage Attacks and Defenses), a
comprehensive benchmark for evaluating Deep Leakage attacks and defenses within
a realistic Federated context. By implementing a unified benchmark that
encompasses multiple state-of-the-art Deep Leakage techniques and various
defense strategies, our framework facilitates the evaluation and comparison of
the efficacy of these methods across different datasets and training states.
This work highlights a crucial trade-off between privacy and model accuracy in
Federated Learning and aims to advance the understanding of security challenges
in decentralized machine learning systems, stimulate future research, and
enhance reproducibility in evaluating Deep Leakage attacks and defenses.


http://arxiv.org/abs/2411.02833
Lost in Context: The Influence of Context on Feature Attribution Methods for Object Recognition. (3%)
Sayanta Adhikari; Rishav Kumar; Konda Reddy Mopuri; Rajalakshmi Pachamuthu
  Contextual information plays a critical role in object recognition models
within computer vision, where changes in context can significantly affect
accuracy, underscoring models' dependence on contextual cues. This study
investigates how context manipulation influences both model accuracy and
feature attribution, providing insights into the reliance of object recognition
models on contextual information as understood through the lens of feature
attribution methods.
  We employ a range of feature attribution techniques to decipher the reliance
of deep neural networks on context in object recognition tasks. Using the
ImageNet-9 and our curated ImageNet-CS datasets, we conduct experiments to
evaluate the impact of contextual variations, analyzed through feature
attribution methods. Our findings reveal several key insights: (a) Correctly
classified images predominantly emphasize object volume attribution over
context volume attribution. (b) The dependence on context remains relatively
stable across different context modifications, irrespective of classification
accuracy. (c) Context change exerts a more pronounced effect on model
performance than Context perturbations. (d) Surprisingly, context attribution
in `no-information' scenarios is non-trivial. Our research moves beyond
traditional methods by assessing the implications of broad-level modifications
on object recognition, either in the object or its context.


http://arxiv.org/abs/2411.03554
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset. (1%)
Yingzi Ma; Jiongxiao Wang; Fei Wang; Siyuan Ma; Jiazhao Li; Jinsheng Pan; Xiujun Li; Furong Huang; Lichao Sun; Bo Li; Yejin Choi; Muhao Chen; Chaowei Xiao
  Machine unlearning has emerged as an effective strategy for forgetting
specific information in the training data. However, with the increasing
integration of visual data, privacy concerns in Vision Language Models (VLMs)
remain underexplored. To address this, we introduce Facial Identity Unlearning
Benchmark (FIUBench), a novel VLM unlearning benchmark designed to robustly
evaluate the effectiveness of unlearning algorithms under the Right to be
Forgotten setting. Specifically, we formulate the VLM unlearning task via
constructing the Fictitious Facial Identity VQA dataset and apply a two-stage
evaluation pipeline that is designed to precisely control the sources of
information and their exposure levels. In terms of evaluation, since VLM
supports various forms of ways to ask questions with the same semantic meaning,
we also provide robust evaluation metrics including membership inference
attacks and carefully designed adversarial privacy attacks to evaluate the
performance of algorithms. Through the evaluation of four baseline VLM
unlearning algorithms within FIUBench, we find that all methods remain limited
in their unlearning performance, with significant trade-offs between model
utility and forget quality. Furthermore, our findings also highlight the
importance of privacy attacks for robust evaluations. We hope FIUBench will
drive progress in developing more effective VLM unlearning algorithms.


http://arxiv.org/abs/2411.03312
Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters. (1%)
Kevin Y. Li; Sachin Goyal; Joao D. Semedo; J. Zico Kolter
  Vision Language Models (VLMs) have demonstrated strong capabilities across
various visual understanding and reasoning tasks, driven by incorporating image
representations into the token inputs of Large Language Models (LLMs). However,
their real-world deployment is often constrained by high latency during
inference due to the substantial compute required by the LLM to process the
large number of input tokens, predominantly arising from the image. To reduce
inference costs, one can either downsize the LLM or reduce the number of input
tokens needed to represent the image, the latter of which has been the focus of
many recent efforts around token compression. However, it is unclear what the
optimal trade-off is given a fixed inference budget. We first characterize this
optimal trade-off between the number of visual tokens and LLM parameters by
establishing scaling laws that capture variations in performance with these two
factors. Our results reveal a surprising trend: for visual reasoning tasks, the
inference-optimal behavior in VLMs is achieved by using the largest LLM that
fits within the inference budget while minimizing visual token count - often to
a single token. While the token reduction literature has mainly focused on
maintaining base model performance by modestly reducing the token count (e.g.,
$5-10\times$), our results indicate that the compute-optimal inference regime
requires operating under even higher token compression ratios. Based on these
insights, we take the first steps toward designing token compression algorithms
tailored for high-compression settings, utilizing prompt-based compression of
tokens. Our work underscores the performance and efficiency benefits of
operating in low visual token regimes and the importance of developing tailored
token reduction algorithms for such conditions. Code is available at
https://github.com/locuslab/llava-token-compression.


http://arxiv.org/abs/2411.02809
Query-Efficient Adversarial Attack Against Vertical Federated Graph Learning. (99%)
Jinyin Chen; Wenbo Mu; Luxin Zhang; Guohan Huang; Haibin Zheng; Yao Cheng
  Graph neural network (GNN) has captured wide attention due to its capability
of graph representation learning for graph-structured data. However, the
distributed data silos limit the performance of GNN. Vertical federated
learning (VFL), an emerging technique to process distributed data, successfully
makes GNN possible to handle the distributed graph-structured data. Despite the
prosperous development of vertical federated graph learning (VFGL), the
robustness of VFGL against the adversarial attack has not been explored yet.
Although numerous adversarial attacks against centralized GNNs are proposed,
their attack performance is challenged in the VFGL scenario. To the best of our
knowledge, this is the first work to explore the adversarial attack against
VFGL. A query-efficient hybrid adversarial attack framework is proposed to
significantly improve the centralized adversarial attacks against VFGL, denoted
as NA2, short for Neuron-based Adversarial Attack. Specifically, a malicious
client manipulates its local training data to improve its contribution in a
stealthy fashion. Then a shadow model is established based on the manipulated
data to simulate the behavior of the server model in VFGL. As a result, the
shadow model can improve the attack success rate of various centralized attacks
with a few queries. Extensive experiments on five real-world benchmarks
demonstrate that NA2 improves the performance of the centralized adversarial
attacks against VFGL, achieving state-of-the-art performance even under
potential adaptive defense where the defender knows the attack method.
Additionally, we provide interpretable experiments of the effectiveness of NA2
via sensitive neurons identification and visualization of t-SNE.


http://arxiv.org/abs/2411.02669
Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack. (99%)
Xiaojun Jia; Sensen Gao; Qing Guo; Ke Ma; Yihao Huang; Simeng Qin; Yang Liu; Ivor Tsang Fellow; Xiaochun Cao
  Vision-language pre-training (VLP) models excel at interpreting both images
and text but remain vulnerable to multimodal adversarial examples (AEs).
Advancing the generation of transferable AEs, which succeed across unseen
models, is key to developing more robust and practical VLP models. Previous
approaches augment image-text pairs to enhance diversity within the adversarial
example generation process, aiming to improve transferability by expanding the
contrast space of image-text features. However, these methods focus solely on
diversity around the current AEs, yielding limited gains in transferability. To
address this issue, we propose to increase the diversity of AEs by leveraging
the intersection regions along the adversarial trajectory during optimization.
Specifically, we propose sampling from adversarial evolution triangles composed
of clean, historical, and current adversarial examples to enhance adversarial
diversity. We provide a theoretical analysis to demonstrate the effectiveness
of the proposed adversarial evolution triangle. Moreover, we find that
redundant inactive dimensions can dominate similarity calculations, distorting
feature matching and making AEs model-dependent with reduced transferability.
Hence, we propose to generate AEs in the semantic image-text feature contrast
space, which can project the original feature space into a semantic corpus
subspace. The proposed semantic-aligned subspace can reduce the image feature
redundancy, thereby improving adversarial transferability. Extensive
experiments across different datasets and models demonstrate that the proposed
method can effectively improve adversarial transferability and outperform
state-of-the-art adversarial attack methods. The code is released at
https://github.com/jiaxiaojunQAQ/SA-AET.


http://arxiv.org/abs/2411.01889
LiDAttack: Robust Black-box Attack on LiDAR-based Object Detection. (99%)
Jinyin Chen; Danxin Liao; Sheng Xiang; Haibin Zheng
  Since DNN is vulnerable to carefully crafted adversarial examples,
adversarial attack on LiDAR sensors have been extensively studied. We introduce
a robust black-box attack dubbed LiDAttack. It utilizes a genetic algorithm
with a simulated annealing strategy to strictly limit the location and number
of perturbation points, achieving a stealthy and effective attack. And it
simulates scanning deviations, allowing it to adapt to dynamic changes in real
world scenario variations. Extensive experiments are conducted on 3 datasets
(i.e., KITTI, nuScenes, and self-constructed data) with 3 dominant object
detection models (i.e., PointRCNN, PointPillar, and PV-RCNN++). The results
reveal the efficiency of the LiDAttack when targeting a wide range of object
detection models, with an attack success rate (ASR) up to 90%.


http://arxiv.org/abs/2411.02094
Alignment-Based Adversarial Training (ABAT) for Improving the Robustness and Accuracy of EEG-Based BCIs. (91%)
Xiaoqing Chen; Ziwei Wang; Dongrui Wu
  Machine learning has achieved great success in electroencephalogram (EEG)
based brain-computer interfaces (BCIs). Most existing BCI studies focused on
improving the decoding accuracy, with only a few considering the adversarial
security. Although many adversarial defense approaches have been proposed in
other application domains such as computer vision, previous research showed
that their direct extensions to BCIs degrade the classification accuracy on
benign samples. This phenomenon greatly affects the applicability of
adversarial defense approaches to EEG-based BCIs. To mitigate this problem, we
propose alignment-based adversarial training (ABAT), which performs EEG data
alignment before adversarial training. Data alignment aligns EEG trials from
different domains to reduce their distribution discrepancies, and adversarial
training further robustifies the classification boundary. The integration of
data alignment and adversarial training can make the trained EEG classifiers
simultaneously more accurate and more robust. Experiments on five EEG datasets
from two different BCI paradigms (motor imagery classification, and event
related potential recognition), three convolutional neural network classifiers
(EEGNet, ShallowCNN and DeepCNN) and three different experimental settings
(offline within-subject cross-block/-session classification, online
cross-session classification, and pre-trained classifiers) demonstrated its
effectiveness. It is very intriguing that adversarial attacks, which are
usually used to damage BCI systems, can be used in ABAT to simultaneously
improve the model accuracy and robustness.


http://arxiv.org/abs/2411.02391
Attacking Vision-Language Computer Agents via Pop-ups. (9%)
Yanzhe Zhang; Tao Yu; Diyi Yang
  Autonomous agents powered by large vision and language models (VLM) have
demonstrated significant potential in completing daily computer tasks, such as
browsing the web to book travel and operating desktop software, which requires
agents to understand these interfaces. Despite such visual inputs becoming more
integrated into agentic applications, what types of risks and attacks exist
around them still remain unclear. In this work, we demonstrate that VLM agents
can be easily attacked by a set of carefully designed adversarial pop-ups,
which human users would typically recognize and ignore. This distraction leads
agents to click these pop-ups instead of performing their tasks as usual.
Integrating these pop-ups into existing agent testing environments like OSWorld
and VisualWebArena leads to an attack success rate (the frequency of the agent
clicking the pop-ups) of 86% on average and decreases the task success rate by
47%. Basic defense techniques, such as asking the agent to ignore pop-ups or
including an advertisement notice, are ineffective against the attack.


http://arxiv.org/abs/2411.02785
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment. (2%)
Jason Vega; Junsheng Huang; Gaokai Zhang; Hangoo Kang; Minjia Zhang; Gagandeep Singh
  Safety alignment of Large Language Models (LLMs) has recently become a
critical objective of model developers. In response, a growing body of work has
been investigating how safety alignment can be bypassed through various
jailbreaking methods, such as adversarial attacks. However, these jailbreak
methods can be rather costly or involve a non-trivial amount of creativity and
effort, introducing the assumption that malicious users are high-resource or
sophisticated. In this paper, we study how simple random augmentations to the
input prompt affect safety alignment effectiveness in state-of-the-art LLMs,
such as Llama 3 and Qwen 2. We perform an in-depth evaluation of 17 different
models and investigate the intersection of safety under random augmentations
with multiple dimensions: augmentation type, model size, quantization,
fine-tuning-based defenses, and decoding strategies (e.g., sampling
temperature). We show that low-resource and unsophisticated attackers, i.e.
$\textit{stochastic monkeys}$, can significantly improve their chances of
bypassing alignment with just 25 random augmentations per prompt.


http://arxiv.org/abs/2411.02603
FactTest: Factuality Testing in Large Language Models with Statistical Guarantees. (1%)
Fan Nie; Xiaotian Hou; Shuhang Lin; James Zou; Huaxiu Yao; Linjun Zhang
  The propensity of Large Language Models (LLMs) to generate hallucinations and
non-factual content undermines their reliability in high-stakes domains, where
rigorous control over Type I errors (the conditional probability of incorrectly
classifying hallucinations as truthful content) is essential. Despite its
importance, formal verification of LLM factuality with such guarantees remains
largely unexplored. In this paper, we introduce FactTest, a novel framework
that statistically assesses whether an LLM can confidently provide correct
answers to given questions with high-probability correctness guarantees. We
formulate factuality testing as hypothesis testing problem to enforce an upper
bound of Type I errors at user-specified significance levels. Notably, we prove
that our framework also ensures strong Type II error control under mild
conditions and can be extended to maintain its effectiveness when covariate
shifts exist. %These analyses are amenable to the principled NP framework. Our
approach is distribution-free and works for any number of human-annotated
samples. It is model-agnostic and applies to any black-box or white-box LM.
Extensive experiments on question-answering (QA) and multiple-choice benchmarks
demonstrate that \approach effectively detects hallucinations and improves the
model's ability to abstain from answering unknown questions, leading to an over
40% accuracy improvement.


http://arxiv.org/abs/2411.02099
Differentially Private Integrated Decision Gradients (IDG-DP) for Radar-based Human Activity Recognition. (1%)
Idris Zakariyya; Linda Tran; Kaushik Bhargav Sivangi; Paul Henderson; Fani Deligianni
  Human motion analysis offers significant potential for healthcare monitoring
and early detection of diseases. The advent of radar-based sensing systems has
captured the spotlight for they are able to operate without physical contact
and they can integrate with pre-existing Wi-Fi networks. They are also seen as
less privacy-invasive compared to camera-based systems. However, recent
research has shown high accuracy in recognizing subjects or gender from radar
gait patterns, raising privacy concerns. This study addresses these issues by
investigating privacy vulnerabilities in radar-based Human Activity Recognition
(HAR) systems and proposing a novel method for privacy preservation using
Differential Privacy (DP) driven by attributions derived with Integrated
Decision Gradient (IDG) algorithm. We investigate Black-box Membership
Inference Attack (MIA) Models in HAR settings across various levels of
attacker-accessible information. We extensively evaluated the effectiveness of
the proposed IDG-DP method by designing a CNN-based HAR model and rigorously
assessing its resilience against MIAs. Experimental results demonstrate the
potential of IDG-DP in mitigating privacy attacks while maintaining utility
across all settings, particularly excelling against label-only and shadow model
black-box MIA attacks. This work represents a crucial step towards balancing
the need for effective radar-based HAR with robust privacy protection in
healthcare environments.


http://arxiv.org/abs/2411.03348
Undermining Image and Text Classification Algorithms Using Adversarial Attacks. (98%)
Langalibalele Lunga; Suhas Sreehari
  Machine learning models are prone to adversarial attacks, where inputs can be
manipulated in order to cause misclassifications. While previous research has
focused on techniques like Generative Adversarial Networks (GANs), there's
limited exploration of GANs and Synthetic Minority Oversampling Technique
(SMOTE) in text and image classification models to perform adversarial attacks.
Our study addresses this gap by training various machine learning models and
using GANs and SMOTE to generate additional data points aimed at attacking text
classification models. Furthermore, we extend our investigation to face
recognition models, training a Convolutional Neural Network(CNN) and subjecting
it to adversarial attacks with fast gradient sign perturbations on key features
identified by GradCAM, a technique used to highlight key image characteristics
CNNs use in classification. Our experiments reveal a significant vulnerability
in classification models. Specifically, we observe a 20 % decrease in accuracy
for the top-performing text classification models post-attack, along with a 30
% decrease in facial recognition accuracy. This highlights the susceptibility
of these models to manipulation of input data. Adversarial attacks not only
compromise the security but also undermine the reliability of machine learning
systems. By showcasing the impact of adversarial attacks on both text
classification and face recognition models, our study underscores the urgent
need for develop robust defenses against such vulnerabilities.


http://arxiv.org/abs/2411.01565
SQL Injection Jailbreak: a structural disaster of large language models. (78%)
Jiawei Zhao; Kejiang Chen; Weiming Zhang; Nenghai Yu
  In recent years, the rapid development of large language models (LLMs) has
brought new vitality to the various domains and generated substantial social
and economic benefits. However, the swift advancement of LLMs has introduced
new security vulnerabilities. Jailbreak, a form of attack that induces LLMs to
output harmful content through carefully crafted prompts, poses a challenge to
the safe and trustworthy development of LLMs. Previous jailbreak attack methods
primarily exploited the internal capabilities of the model. Among them, one
category leverages the model's implicit capabilities for jailbreak attacks,
where the attacker is unaware of the exact reasons for the attack's success.
The other category utilizes the model's explicit capabilities for jailbreak
attacks, where the attacker understands the reasons for the attack's success.
For example, these attacks exploit the model's abilities in coding, contextual
learning, or understanding ASCII characters. However, these earlier jailbreak
attacks have certain limitations, as they only exploit the inherent
capabilities of the model. In this paper, we propose a novel jailbreak method,
SQL Injection Jailbreak (SIJ), which utilizes the construction of input prompts
by LLMs to inject jailbreak information into user prompts, enabling successful
jailbreak of the LLMs. Our SIJ method achieves nearly 100\% attack success
rates on five well-known open-source LLMs in the context of AdvBench, while
incurring lower time costs compared to previous methods. More importantly, SIJ
reveals a new vulnerability in LLMs that urgently needs to be addressed. To
this end, we propose a defense method called Self-Reminder-Key and demonstrate
its effectiveness through experiments. Our code is available at
\href{https://github.com/weiyezhimeng/SQL-Injection-Jailbreak}{https://github.com/weiyezhimeng/SQL-Injection-Jailbreak}.


http://arxiv.org/abs/2411.01703
UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models. (4%)
Sejoon Oh; Yiqiao Jin; Megha Sharma; Donghyun Kim; Eric Ma; Gaurav Verma; Srijan Kumar
  Multimodal large language models (MLLMs) have revolutionized vision-language
understanding but remain vulnerable to multimodal jailbreak attacks, where
adversarial inputs are meticulously crafted to elicit harmful or inappropriate
responses. We propose UniGuard, a novel multimodal safety guardrail that
jointly considers the unimodal and cross-modal harmful signals. UniGuard trains
a multimodal guardrail to minimize the likelihood of generating harmful
responses in a toxic corpus. The guardrail can be seamlessly applied to any
input prompt during inference with minimal computational costs. Extensive
experiments demonstrate the generalizability of UniGuard across multiple
modalities, attack strategies, and multiple state-of-the-art MLLMs, including
LLaVA, Gemini Pro, GPT-4o, MiniGPT-4, and InstructBLIP. Notably, this robust
defense mechanism maintains the models' overall vision-language understanding
capabilities.


http://arxiv.org/abs/2411.01748
Rotation Perturbation Robustness in Point Cloud Analysis: A Perspective of Manifold Distillation. (2%)
Xinyu Xu; Huazhen Liu; Feiming Wei; Huilin Xiong; Wenxian Yu; Tao Zhang
  Point cloud is often regarded as a discrete sampling of Riemannian manifold
and plays a pivotal role in the 3D image interpretation. Particularly, rotation
perturbation, an unexpected small change in rotation caused by various factors
(like equipment offset, system instability, measurement errors and so on), can
easily lead to the inferior results in point cloud learning tasks. However,
classical point cloud learning methods are sensitive to rotation perturbation,
and the existing networks with rotation robustness also have much room for
improvements in terms of performance and noise tolerance. Given these, this
paper remodels the point cloud from the perspective of manifold as well as
designs a manifold distillation method to achieve the robustness of rotation
perturbation without any coordinate transformation. In brief, during the
training phase, we introduce a teacher network to learn the rotation robustness
information and transfer this information to the student network through online
distillation. In the inference phase, the student network directly utilizes the
original 3D coordinate information to achieve the robustness of rotation
perturbation. Experiments carried out on four different datasets verify the
effectiveness of our method. Averagely, on the Modelnet40 and ScanobjectNN
classification datasets with random rotation perturbations, our classification
accuracy has respectively improved by 4.92% and 4.41%, compared to popular
rotation-robust networks; on the ShapeNet and S3DIS segmentation datasets,
compared to the rotation-robust networks, the improvements of mIoU are 7.36%
and 4.82%, respectively. Besides, from the experimental results, the proposed
algorithm also shows excellent performance in resisting noise and outliers.


http://arxiv.org/abs/2412.06788
Poison Attacks and Adversarial Prompts Against an Informed University Virtual Assistant. (1%)
Ivan A. Fernandez; Subash Neupane; Sudip Mittal; Shahram Rahimi
  Recent research has shown that large language models (LLMs) are particularly
vulnerable to adversarial attacks. Since the release of ChatGPT, various
industries are adopting LLM-based chatbots and virtual assistants in their data
workflows. The rapid development pace of AI-based systems is being driven by
the potential of Generative AI (GenAI) to assist humans in decision making. The
immense optimism behind GenAI often overshadows the adversarial risks
associated with these technologies. A threat actor can use security gaps, poor
safeguards, and limited data governance to carry out attacks that grant
unauthorized access to the system and its data. As a proof-of-concept, we
assess the performance of BarkPlug, the Mississippi State University chatbot,
against data poison attacks from a red team perspective.


http://arxiv.org/abs/2411.01779
TabSec: A Collaborative Framework for Novel Insider Threat Detection. (1%)
Zilin Huang; Xiangyan Tang; Hongyu Li; Xinyi Cao; Jieren Cheng
  In the era of the Internet of Things (IoT) and data sharing, users frequently
upload their personal information to enterprise databases to enjoy enhanced
service experiences provided by various online services. However, the
widespread presence of system vulnerabilities, remote network intrusions, and
insider threats significantly increases the exposure of private enterprise data
on the internet. If such data is stolen or leaked by attackers, it can result
in severe asset losses and business operation disruptions. To address these
challenges, this paper proposes a novel threat detection framework, TabITD.
This framework integrates Intrusion Detection Systems (IDS) with User and
Entity Behavior Analytics (UEBA) strategies to form a collaborative detection
system that bridges the gaps in existing systems' capabilities. It effectively
addresses the blurred boundaries between external and insider threats caused by
the diversification of attack methods, thereby enhancing the model's learning
ability and overall detection performance. Moreover, the proposed method
leverages the TabNet architecture, which employs a sparse attention feature
selection mechanism that allows TabNet to select the most relevant features at
each decision step, thereby improving the detection of rare-class attacks. We
evaluated our proposed solution on two different datasets, achieving average
accuracies of 96.71% and 97.25%, respectively. The results demonstrate that
this approach can effectively detect malicious behaviors such as masquerade
attacks and external threats, significantly enhancing network security defenses
and the efficiency of network attack detection.


http://arxiv.org/abs/2411.01777
Learning predictable and robust neural representations by straightening image sequences. (1%)
Xueyan Niu; Cristina Savin; Eero P. Simoncelli
  Prediction is a fundamental capability of all living organisms, and has been
proposed as an objective for learning sensory representations. Recent work
demonstrates that in primate visual systems, prediction is facilitated by
neural representations that follow straighter temporal trajectories than their
initial photoreceptor encoding, which allows for prediction by linear
extrapolation. Inspired by these experimental findings, we develop a
self-supervised learning (SSL) objective that explicitly quantifies and
promotes straightening. We demonstrate the power of this objective in training
deep feedforward neural networks on smoothly-rendered synthetic image sequences
that mimic commonly-occurring properties of natural videos. The learned model
contains neural embeddings that are predictive, but also factorize the
geometric, photometric, and semantic attributes of objects. The representations
also prove more robust to noise and adversarial attacks compared to previous
SSL methods that optimize for invariance to random augmentations. Moreover,
these beneficial properties can be transferred to other training procedures by
using the straightening objective as a regularizer, suggesting a broader
utility for straightening as a principle for robust unsupervised learning.


http://arxiv.org/abs/2411.01252
Quantum Token Obfuscation via Superposition. (78%)
S. M. Yousuf Iqbal Tomal; Abdullah Al Shafin
  As quantum computing advances, traditional cryptographic security measures,
including token obfuscation, face increasing vulnerability to quantum attacks.
This paper presents a quantum-enhanced approach to token obfuscation that
leverages quantum superposition and multi-basis verification to establish a
resilient defense against such threats. In this method, tokens are encoded in
superposition states, ensuring probabilistic concealment until measured,
significantly increasing the computational difficulty of unauthorized
reconstruction. A quantum decay protocol and a refresh mechanism are introduced
to dynamically manage the token lifecycle, mitigating risks associated with
prolonged token exposure and replay attacks. Experimental validation confirms
the robustness of this approach, demonstrating an entropy quality score of
0.9996, a 0% success rate across five adversarial attack models, and an
estimated false positive rate of 67%, highlighting the system's strict security
constraints. These results validate the feasibility of quantum-based token
obfuscation as a post-quantum cryptographic framework. By integrating dynamic
token evolution, entropy-driven state transformations, and multi-basis
verification, this approach provides a foundation for future quantum-secure
authentication and access control mechanisms.


http://arxiv.org/abs/2411.01222
$B^4$: A Black-Box Scrubbing Attack on LLM Watermarks. (75%)
Baizhou Huang; Xiao Pu; Xiaojun Wan
  Watermarking has emerged as a prominent technique for LLM-generated content
detection by embedding imperceptible patterns. Despite supreme performance, its
robustness against adversarial attacks remains underexplored. Previous work
typically considers a grey-box attack setting, where the specific type of
watermark is already known. Some even necessitates knowledge about
hyperparameters of the watermarking method. Such prerequisites are unattainable
in real-world scenarios. Targeting at a more realistic black-box threat model
with fewer assumptions, we here propose $\mathcal{B}^4$, a black-box scrubbing
attack on watermarks. Specifically, we formulate the watermark scrubbing attack
as a constrained optimization problem by capturing its objectives with two
distributions, a Watermark Distribution and a Fidelity Distribution. This
optimization problem can be approximately solved using two proxy distributions.
Experimental results across 12 different settings demonstrate the superior
performance of $\mathcal{B}^4$ compared with other baselines.


http://arxiv.org/abs/2411.03343
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks. (1%)
Nathalie Maria Kirch; Severin Field; Stephen Casper
  While `jailbreaks' have been central to research on the safety and
reliability of LLMs (large language models), the underlying mechanisms behind
these attacks are not well understood. Some prior works have used linear
methods to analyze jailbreak prompts or model refusal. Here, however, we
compare linear and nonlinear methods to study the features in prompts that
contribute to successful jailbreaks. We do this by probing for jailbreak
success based only on the portions of the latent representations corresponding
to prompt tokens. First, we introduce a dataset of 10,800 jailbreak attempts
from 35 attack methods. We then show that different jailbreaking methods work
via different nonlinear features in prompts. Specifically, we find that while
probes can distinguish between successful and unsuccessful jailbreaking prompts
with a high degree of accuracy, they often transfer poorly to held-out attack
methods. We also show that nonlinear probes can be used to mechanistically
jailbreak the LLM by guiding the design of adversarial latent perturbations.
These mechanistic jailbreaks are able to jailbreak Gemma-7B-IT more reliably
than 34 of the 35 techniques that it was trained on. Ultimately, our results
suggest that jailbreaks cannot be thoroughly understood in terms of universal
or linear prompt features alone.


http://arxiv.org/abs/2411.00898
Replace-then-Perturb: Targeted Adversarial Attacks With Visual Reasoning for Vision-Language Models. (99%)
Jonggyu Jang; Hyeonsu Lyu; Jungyeon Koh; Hyun Jong Yang
  The conventional targeted adversarial attacks add a small perturbation to an
image to make neural network models estimate the image as a predefined target
class, even if it is not the correct target class. Recently, for
visual-language models (VLMs), the focus of targeted adversarial attacks is to
generate a perturbation that makes VLMs answer intended target text outputs.
For example, they aim to make a small perturbation on an image to make VLMs'
answers change from "there is an apple" to "there is a baseball." However,
answering just intended text outputs is insufficient for tricky questions like
"if there is a baseball, tell me what is below it." This is because the target
of the adversarial attacks does not consider the overall integrity of the
original image, thereby leading to a lack of visual reasoning. In this work, we
focus on generating targeted adversarial examples with visual reasoning against
VLMs. To this end, we propose 1) a novel adversarial attack procedure --
namely, Replace-then-Perturb and 2) a contrastive learning-based adversarial
loss -- namely, Contrastive-Adv. In Replace-then-Perturb, we first leverage a
text-guided segmentation model to find the target object in the image. Then, we
get rid of the target object and inpaint the empty space with the desired
prompt. By doing this, we can generate a target image corresponding to the
desired prompt, while maintaining the overall integrity of the original image.
Furthermore, in Contrastive-Adv, we design a novel loss function to obtain
better adversarial examples. Our extensive benchmark results demonstrate that
Replace-then-Perturb and Contrastive-Adv outperform the baseline adversarial
attack algorithms. We note that the source code to reproduce the results will
be available.


http://arxiv.org/abs/2411.00459
Defense Against Prompt Injection Attack by Leveraging Attack Techniques. (81%)
Yulin Chen; Haoran Li; Zihao Zheng; Yangqiu Song; Dekai Wu; Bryan Hooi
  With the advancement of technology, large language models (LLMs) have
achieved remarkable performance across various natural language processing
(NLP) tasks, powering LLM-integrated applications like Microsoft Copilot.
However, as LLMs continue to evolve, new vulnerabilities, especially prompt
injection attacks arise. These attacks trick LLMs into deviating from the
original input instructions and executing the attacker's instructions injected
in data content, such as retrieved results. Recent attack methods leverage
LLMs' instruction-following abilities and their inabilities to distinguish
instructions injected in the data content, and achieve a high attack success
rate (ASR). When comparing the attack and defense methods, we interestingly
find that they share similar design goals, of inducing the model to ignore
unwanted instructions and instead to execute wanted instructions. Therefore, we
raise an intuitive question: Could these attack techniques be utilized for
defensive purposes? In this paper, we invert the intention of prompt injection
methods to develop novel defense methods based on previous training-free attack
methods, by repeating the attack process but with the original input
instruction rather than the injected instruction. Our comprehensive experiments
demonstrate that our defense techniques outperform existing training-free
defense approaches, achieving state-of-the-art results.


http://arxiv.org/abs/2411.00899
Certified Robustness for Deep Equilibrium Models via Serialized Random Smoothing. (68%)
Weizhi Gao; Zhichao Hou; Han Xu; Xiaorui Liu
  Implicit models such as Deep Equilibrium Models (DEQs) have emerged as
promising alternative approaches for building deep neural networks. Their
certified robustness has gained increasing research attention due to security
concerns. Existing certified defenses for DEQs employing deterministic
certification methods such as interval bound propagation and Lipschitz-bounds
can not certify on large-scale datasets. Besides, they are also restricted to
specific forms of DEQs. In this paper, we provide the first randomized
smoothing certified defense for DEQs to solve these limitations. Our study
reveals that simply applying randomized smoothing to certify DEQs provides
certified robustness generalized to large-scale datasets but incurs extremely
expensive computation costs. To reduce computational redundancy, we propose a
novel Serialized Randomized Smoothing (SRS) approach that leverages historical
information. Additionally, we derive a new certified radius estimation for SRS
to theoretically ensure the correctness of our algorithm. Extensive experiments
and ablation studies on image recognition demonstrate that our algorithm can
significantly accelerate the certification of DEQs by up to 7x almost without
sacrificing the certified accuracy. Our code is available at
https://github.com/WeizhiGao/Serialized-Randomized-Smoothing.


http://arxiv.org/abs/2411.00348
Attention Tracker: Detecting Prompt Injection Attacks in LLMs. (26%)
Kuo-Han Hung; Ching-Yun Ko; Ambrish Rawat; I-Hsin Chung; Winston H. Hsu; Pin-Yu Chen
  Large Language Models (LLMs) have revolutionized various domains but remain
vulnerable to prompt injection attacks, where malicious inputs manipulate the
model into ignoring original instructions and executing designated action. In
this paper, we investigate the underlying mechanisms of these attacks by
analyzing the attention patterns within LLMs. We introduce the concept of the
distraction effect, where specific attention heads, termed important heads,
shift focus from the original instruction to the injected instruction. Building
on this discovery, we propose Attention Tracker, a training-free detection
method that tracks attention patterns on instruction to detect prompt injection
attacks without the need for additional LLM inference. Our method generalizes
effectively across diverse models, datasets, and attack types, showing an AUROC
improvement of up to 10.0% over existing methods, and performs well even on
small LLMs. We demonstrate the robustness of our approach through extensive
evaluations and provide insights into safeguarding LLM-integrated systems from
prompt injection vulnerabilities.


http://arxiv.org/abs/2411.00519
Outlier-Oriented Poisoning Attack: A Grey-box Approach to Disturb Decision Boundaries by Perturbing Outliers in Multiclass Learning. (13%)
Anum Paracha; Junaid Arshad; Mohamed Ben Farah; Khalid Ismail
  Poisoning attacks are a primary threat to machine learning models, aiming to
compromise their performance and reliability by manipulating training datasets.
This paper introduces a novel attack - Outlier-Oriented Poisoning (OOP) attack,
which manipulates labels of most distanced samples from the decision
boundaries. The paper also investigates the adverse impact of such attacks on
different machine learning algorithms within a multiclass classification
scenario, analyzing their variance and correlation between different poisoning
levels and performance degradation. To ascertain the severity of the OOP attack
for different degrees (5% - 25%) of poisoning, we analyzed variance, accuracy,
precision, recall, f1-score, and false positive rate for chosen ML
models.Benchmarking our OOP attack, we have analyzed key characteristics of
multiclass machine learning algorithms and their sensitivity to poisoning
attacks. Our experimentation used three publicly available datasets: IRIS,
MNIST, and ISIC. Our analysis shows that KNN and GNB are the most affected
algorithms with a decrease in accuracy of 22.81% and 56.07% while increasing
false positive rate to 17.14% and 40.45% for IRIS dataset with 15% poisoning.
Further, Decision Trees and Random Forest are the most resilient algorithms
with the least accuracy disruption of 12.28% and 17.52% with 15% poisoning of
the IRIS dataset. We have also analyzed the correlation between number of
dataset classes and the performance degradation of models. Our analysis
highlighted that number of classes are inversely proportional to the
performance degradation, specifically the decrease in accuracy of the models,
which is normalized with increasing number of classes. Further, our analysis
identified that imbalanced dataset distribution can aggravate the impact of
poisoning for machine learning models


http://arxiv.org/abs/2411.01040
Identify Backdoored Model in Federated Learning via Individual Unlearning. (5%)
Jiahao Xu; Zikai Zhang; Rui Hu
  Backdoor attacks present a significant threat to the robustness of Federated
Learning (FL) due to their stealth and effectiveness. They maintain both the
main task of the FL system and the backdoor task simultaneously, causing
malicious models to appear statistically similar to benign ones, which enables
them to evade detection by existing defense methods. We find that malicious
parameters in backdoored models are inactive on the main task, resulting in a
significantly large empirical loss during the machine unlearning process on
clean inputs. Inspired by this, we propose MASA, a method that utilizes
individual unlearning on local models to identify malicious models in FL. To
improve the performance of MASA in challenging non-independent and identically
distributed (non-IID) settings, we design pre-unlearning model fusion that
integrates local models with knowledge learned from other datasets to mitigate
the divergence in their unlearning behaviors caused by the non-IID data
distributions of clients. Additionally, we propose a new anomaly detection
metric with minimal hyperparameters to filter out malicious models efficiently.
Extensive experiments on IID and non-IID datasets across six different attacks
validate the effectiveness of MASA. To the best of our knowledge, this is the
first work to leverage machine unlearning to identify malicious models in FL.
Code is available at \url{https://github.com/JiiahaoXU/MASA}.


http://arxiv.org/abs/2411.01084
Plentiful Jailbreaks with String Compositions. (2%)
Brian R. Y. Huang
  Large language models (LLMs) remain vulnerable to a slew of adversarial
attacks and jailbreaking methods. One common approach employed by white-hat
attackers, or \textit{red-teamers}, is to process model inputs and outputs
using string-level obfuscations, which can include leetspeak, rotary ciphers,
Base64, ASCII, and more. Our work extends these encoding-based attacks by
unifying them in a framework of invertible string transformations. With
invertibility, we can devise arbitrary \textit{string compositions}, defined as
sequences of transformations, that we can encode and decode end-to-end
programmatically. We devise a automated best-of-n attack that samples from a
combinatorially large number of string compositions. Our jailbreaks obtain
competitive attack success rates on several leading frontier models when
evaluated on HarmBench, highlighting that encoding-based attacks remain a
persistent vulnerability even in advanced LLMs.


http://arxiv.org/abs/2411.00465
Uncertainty-based Offline Variational Bayesian Reinforcement Learning for Robustness under Diverse Data Corruptions. (2%)
Rui Yang; Jie Wang; Guoping Wu; Bin Li
  Real-world offline datasets are often subject to data corruptions (such as
noise or adversarial attacks) due to sensor failures or malicious attacks.
Despite advances in robust offline reinforcement learning (RL), existing
methods struggle to learn robust agents under high uncertainty caused by the
diverse corrupted data (i.e., corrupted states, actions, rewards, and
dynamics), leading to performance degradation in clean environments. To tackle
this problem, we propose a novel robust variational Bayesian inference for
offline RL (TRACER). It introduces Bayesian inference for the first time to
capture the uncertainty via offline data for robustness against all types of
data corruptions. Specifically, TRACER first models all corruptions as the
uncertainty in the action-value function. Then, to capture such uncertainty, it
uses all offline data as the observations to approximate the posterior
distribution of the action-value function under a Bayesian inference framework.
An appealing feature of TRACER is that it can distinguish corrupted data from
clean data using an entropy-based uncertainty measure, since corrupted data
often induces higher uncertainty and entropy. Based on the aforementioned
measure, TRACER can regulate the loss associated with corrupted data to reduce
its influence, thereby enhancing robustness and performance in clean
environments. Experiments demonstrate that TRACER significantly outperforms
several state-of-the-art approaches across both individual and simultaneous
data corruptions.


http://arxiv.org/abs/2411.00349
Examining Attacks on Consensus and Incentive Systems in Proof-of-Work Blockchains: A Systematic Literature Review. (1%)
Dinitha Wijewardhana; Sugandima Vidanagamachchi; Nalin Arachchilage
  Cryptocurrencies have gained popularity due to their transparency, security,
and accessibility compared to traditional financial systems, with Bitcoin,
introduced in 2009, leading the market. Bitcoin's security relies on blockchain
technology - a decentralized ledger consisting of a consensus and an incentive
mechanism. The consensus mechanism, Proof of Work (PoW), requires miners to
solve difficult cryptographic puzzles to add new blocks, while the incentive
mechanism rewards them with newly minted bitcoins. However, as Bitcoin's
acceptance grows, it faces increasing threats from attacks targeting these
mechanisms, such as selfish mining, double-spending, and block withholding.
These attacks compromise security, efficiency, and reward distribution. Recent
research shows that these attacks can be combined with each other or with
either malicious strategies, such as network-layer attacks, or non-malicious
strategies, like honest mining. These combinations lead to more sophisticated
attacks, increasing the attacker's success rates and profitability. Therefore,
understanding and evaluating these attacks is essential for developing
effective countermeasures and ensuring long-term security. This paper begins by
examining individual attacks executed in isolation and their profitability. It
then explores how combining these attacks with each other or with other
malicious and non-malicious strategies can enhance their overall effectiveness
and profitability. The analysis further explores how the deployment of attacks
such as selfish mining and block withholding by multiple competing mining pools
against each other impacts their economic returns. Lastly, a set of design
guidelines is provided, outlining areas future work should focus on to prevent
or mitigate the identified threats.


http://arxiv.org/abs/2411.00715
B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable. (1%)
Shreyash Arya; Sukrut Rao; Moritz Böhle; Bernt Schiele
  B-cos Networks have been shown to be effective for obtaining highly human
interpretable explanations of model decisions by architecturally enforcing
stronger alignment between inputs and weight. B-cos variants of convolutional
networks (CNNs) and vision transformers (ViTs), which primarily replace linear
layers with B-cos transformations, perform competitively to their respective
standard variants while also yielding explanations that are faithful by design.
However, it has so far been necessary to train these models from scratch, which
is increasingly infeasible in the era of large, pre-trained foundation models.
In this work, inspired by the architectural similarities in standard DNNs and
B-cos networks, we propose 'B-cosification', a novel approach to transform
existing pre-trained models to become inherently interpretable. We perform a
thorough study of design choices to perform this conversion, both for
convolutional neural networks and vision transformers. We find that
B-cosification can yield models that are on par with B-cos models trained from
scratch in terms of interpretability, while often outperforming them in terms
of classification performance at a fraction of the training cost. Subsequently,
we apply B-cosification to a pretrained CLIP model, and show that, even with
limited data and compute cost, we obtain a B-cosified version that is highly
interpretable and competitive on zero shot performance across a variety of
datasets. We release our code and pre-trained model weights at
https://github.com/shrebox/B-cosification.


http://arxiv.org/abs/2411.00403
Towards Building Secure UAV Navigation with FHE-aware Knowledge Distillation. (1%)
Arjun Ramesh Kaushik; Charanjit Jutla; Nalini Ratha
  In safeguarding mission-critical systems, such as Unmanned Aerial Vehicles
(UAVs), preserving the privacy of path trajectories during navigation is
paramount. While the combination of Reinforcement Learning (RL) and Fully
Homomorphic Encryption (FHE) holds promise, the computational overhead of FHE
presents a significant challenge. This paper proposes an innovative approach
that leverages Knowledge Distillation to enhance the practicality of secure UAV
navigation. By integrating RL and FHE, our framework addresses vulnerabilities
to adversarial attacks while enabling real-time processing of encrypted UAV
camera feeds, ensuring data security. To mitigate FHE's latency, Knowledge
Distillation is employed to compress the network, resulting in an impressive
18x speedup without compromising performance, as evidenced by an R-squared
score of 0.9499 compared to the original model's score of 0.9631. Our
methodology underscores the feasibility of processing encrypted data for UAV
navigation tasks, emphasizing security alongside performance efficiency and
timely processing. These findings pave the way for deploying autonomous UAVs in
sensitive environments, bolstering their resilience against potential security
threats.


http://arxiv.org/abs/2410.23870
Noise as a Double-Edged Sword: Reinforcement Learning Exploits Randomized Defenses in Neural Networks. (99%)
Steve Bakos; Pooria Madani; Heidar Davoudi
  This study investigates a counterintuitive phenomenon in adversarial machine
learning: the potential for noise-based defenses to inadvertently aid evasion
attacks in certain scenarios. While randomness is often employed as a defensive
strategy against adversarial examples, our research reveals that this approach
can sometimes backfire, particularly when facing adaptive attackers using
reinforcement learning (RL). Our findings show that in specific cases,
especially with visually noisy classes, the introduction of noise in the
classifier's confidence values can be exploited by the RL attacker, leading to
a significant increase in evasion success rates. In some instances, the
noise-based defense scenario outperformed other strategies by up to 20\% on a
subset of classes. However, this effect was not consistent across all
classifiers tested, highlighting the complexity of the interaction between
noise-based defenses and different models. These results suggest that in some
cases, noise-based defenses can inadvertently create an adversarial training
loop beneficial to the RL attacker. Our study emphasizes the need for a more
nuanced approach to defensive strategies in adversarial machine learning,
particularly in safety-critical applications. It challenges the assumption that
randomness universally enhances defense against evasion attacks and highlights
the importance of considering adaptive, RL-based attackers when designing
robust defense mechanisms.


http://arxiv.org/abs/2411.00222
Protecting Feed-Forward Networks from Adversarial Attacks Using Predictive Coding. (99%)
Ehsan Ganjidoost; Jeff Orchard
  An adversarial example is a modified input image designed to cause a Machine
Learning (ML) model to make a mistake; these perturbations are often invisible
or subtle to human observers and highlight vulnerabilities in a model's ability
to generalize from its training data. Several adversarial attacks can create
such examples, each with a different perspective, effectiveness, and
perceptibility of changes. Conversely, defending against such adversarial
attacks improves the robustness of ML models in image processing and other
domains of deep learning. Most defence mechanisms require either a level of
model awareness, changes to the model, or access to a comprehensive set of
adversarial examples during training, which is impractical. Another option is
to use an auxiliary model in a preprocessing manner without changing the
primary model. This study presents a practical and effective solution -- using
predictive coding networks (PCnets) as an auxiliary step for adversarial
defence. By seamlessly integrating PCnets into feed-forward networks as a
preprocessing step, we substantially bolster resilience to adversarial
perturbations. Our experiments on MNIST and CIFAR10 demonstrate the remarkable
effectiveness of PCnets in mitigating adversarial examples with about 82% and
65% improvements in robustness, respectively. The PCnet, trained on a small
subset of the dataset, leverages its generative nature to effectively counter
adversarial efforts, reverting perturbed images closer to their original forms.
This innovative approach holds promise for enhancing the security and
reliability of neural network classifiers in the face of the escalating threat
of adversarial attacks.


http://arxiv.org/abs/2410.23677
Wide Two-Layer Networks can Learn from Adversarial Perturbations. (98%)
Soichiro Kumano; Hiroshi Kera; Toshihiko Yamasaki
  Adversarial examples have raised several open questions, such as why they can
deceive classifiers and transfer between different models. A prevailing
hypothesis to explain these phenomena suggests that adversarial perturbations
appear as random noise but contain class-specific features. This hypothesis is
supported by the success of perturbation learning, where classifiers trained
solely on adversarial examples and the corresponding incorrect labels
generalize well to correctly labeled test data. Although this hypothesis and
perturbation learning are effective in explaining intriguing properties of
adversarial examples, their solid theoretical foundation is limited. In this
study, we theoretically explain the counterintuitive success of perturbation
learning. We assume wide two-layer networks and the results hold for any data
distribution. We prove that adversarial perturbations contain sufficient
class-specific features for networks to generalize from them. Moreover, the
predictions of classifiers trained on mislabeled adversarial examples coincide
with those of classifiers trained on correctly labeled clean samples. The code
is available at https://github.com/s-kumano/perturbation-learning.


http://arxiv.org/abs/2410.24006
DiffPAD: Denoising Diffusion-based Adversarial Patch Decontamination. (93%)
Jia Fu; Xiao Zhang; Sepideh Pashami; Fatemeh Rahimian; Anders Holst
  In the ever-evolving adversarial machine learning landscape, developing
effective defenses against patch attacks has become a critical challenge,
necessitating reliable solutions to safeguard real-world AI systems. Although
diffusion models have shown remarkable capacity in image synthesis and have
been recently utilized to counter $\ell_p$-norm bounded attacks, their
potential in mitigating localized patch attacks remains largely underexplored.
In this work, we propose DiffPAD, a novel framework that harnesses the power of
diffusion models for adversarial patch decontamination. DiffPAD first performs
super-resolution restoration on downsampled input images, then adopts
binarization, dynamic thresholding scheme and sliding window for effective
localization of adversarial patches. Such a design is inspired by the
theoretically derived correlation between patch size and diffusion restoration
error that is generalized across diverse patch attack scenarios. Finally,
DiffPAD applies inpainting techniques to the original input images with the
estimated patch region being masked. By integrating closed-form solutions for
super-resolution restoration and image inpainting into the conditional reverse
sampling process of a pre-trained diffusion model, DiffPAD obviates the need
for text guidance or fine-tuning. Through comprehensive experiments, we
demonstrate that DiffPAD not only achieves state-of-the-art adversarial
robustness against patch attacks but also excels in recovering naturalistic
images without patch remnants. The source code is available at
https://github.com/JasonFu1998/DiffPAD.


http://arxiv.org/abs/2411.00121
I Can Hear You: Selective Robust Training for Deepfake Audio Detection. (86%)
Zirui Zhang; Wei Hao; Aroon Sankoh; William Lin; Emanuel Mendiola-Ortiz; Junfeng Yang; Chengzhi Mao
  Recent advances in AI-generated voices have intensified the challenge of
detecting deepfake audio, posing risks for scams and the spread of
disinformation. To tackle this issue, we establish the largest public voice
dataset to date, named DeepFakeVox-HQ, comprising 1.3 million samples,
including 270,000 high-quality deepfake samples from 14 diverse sources.
Despite previously reported high accuracy, existing deepfake voice detectors
struggle with our diversely collected dataset, and their detection success
rates drop even further under realistic corruptions and adversarial attacks. We
conduct a holistic investigation into factors that enhance model robustness and
show that incorporating a diversified set of voice augmentations is beneficial.
Moreover, we find that the best detection models often rely on high-frequency
features, which are imperceptible to humans and can be easily manipulated by an
attacker. To address this, we propose the F-SAT: Frequency-Selective
Adversarial Training method focusing on high-frequency components. Empirical
results demonstrate that using our training dataset boosts baseline model
performance (without robust training) by 33%, and our robust training further
improves accuracy by 7.7% on clean samples and by 29.3% on corrupted and
attacked samples, over the state-of-the-art RawNet3 model.


http://arxiv.org/abs/2410.23678
Pseudo-Conversation Injection for LLM Goal Hijacking. (75%)
Zheng Chen; Buhui Yao
  Goal hijacking is a type of adversarial attack on Large Language Models
(LLMs) where the objective is to manipulate the model into producing a
specific, predetermined output, regardless of the user's original input. In
goal hijacking, an attacker typically appends a carefully crafted malicious
suffix to the user's prompt, which coerces the model into ignoring the user's
original input and generating the target response. In this paper, we introduce
a novel goal hijacking attack method called Pseudo-Conversation Injection,
which leverages the weaknesses of LLMs in role identification within
conversation contexts. Specifically, we construct the suffix by fabricating
responses from the LLM to the user's initial prompt, followed by a prompt for a
malicious new task. This leads the model to perceive the initial prompt and
fabricated response as a completed conversation, thereby executing the new,
falsified prompt. Following this approach, we propose three Pseudo-Conversation
construction strategies: Targeted Pseudo-Conversation, Universal
Pseudo-Conversation, and Robust Pseudo-Conversation. These strategies are
designed to achieve effective goal hijacking across various scenarios. Our
experiments, conducted on two mainstream LLM platforms including ChatGPT and
Qwen, demonstrate that our proposed method significantly outperforms existing
approaches in terms of attack effectiveness.


http://arxiv.org/abs/2410.24214
ARQ: A Mixed-Precision Quantization Framework for Accurate and Certifiably Robust DNNs. (67%)
Yuchen Yang; Shubham Ugare; Yifan Zhao; Gagandeep Singh; Sasa Misailovic
  Mixed precision quantization has become an important technique for enabling
the execution of deep neural networks (DNNs) on limited resource computing
platforms. Traditional quantization methods have primarily concentrated on
maintaining neural network accuracy, either ignoring the impact of quantization
on the robustness of the network, or using only empirical techniques for
improving robustness. In contrast, techniques for robustness certification,
which can provide strong guarantees about the robustness of DNNs have not been
used during quantization due to their high computation cost.
  This paper introduces ARQ, an innovative mixed-precision quantization method
that not only preserves the clean accuracy of the smoothed classifiers but also
maintains their certified robustness. ARQ uses reinforcement learning to find
accurate and robust DNN quantization, while efficiently leveraging randomized
smoothing, a popular class of statistical DNN verification algorithms, to guide
the search process.
  We compare ARQ with multiple state-of-the-art quantization techniques on
several DNN architectures commonly used in quantization studies: ResNet-20 on
CIFAR-10, ResNet-50 on ImageNet, and MobileNetV2 on ImageNet. We demonstrate
that ARQ consistently performs better than these baselines across all the
benchmarks and the input perturbation levels. In many cases, the performance of
ARQ quantized networks can reach that of the original DNN with floating-point
weights, but with only 1.5% instructions.


http://arxiv.org/abs/2411.00192
Optical Lens Attack on Monocular Depth Estimation for Autonomous Driving. (5%)
Ce Michigan State University Zhou; Qiben Michigan State University Yan; Daniel Michigan State University Kent; Guangjing University of South Florida Wang; Weikang Michigan State University Ding; Ziqi Peking University Zhang; Hayder Michigan State University Radha
  Monocular Depth Estimation (MDE) is a pivotal component of vision-based
Autonomous Driving (AD) systems, enabling vehicles to estimate the depth of
surrounding objects using a single camera image. This estimation guides
essential driving decisions, such as braking before an obstacle or changing
lanes to avoid collisions. In this paper, we explore vulnerabilities of MDE
algorithms in AD systems, presenting LensAttack, a novel physical attack that
strategically places optical lenses on the camera of an autonomous vehicle to
manipulate the perceived object depths. LensAttack encompasses two attack
formats: concave lens attack and convex lens attack, each utilizing different
optical lenses to induce false depth perception. We first develop a
mathematical model that outlines the parameters of the attack, followed by
simulations and real-world evaluations to assess its efficacy on
state-of-the-art MDE models. Additionally, we adopt an attack optimization
method to further enhance the attack success rate by optimizing the attack
focal length. To better evaluate the implications of LensAttack on AD, we
conduct comprehensive end-to-end system simulations using the CARLA platform.
The results reveal that LensAttack can significantly disrupt the depth
estimation processes in AD systems, posing a serious threat to their
reliability and safety. Finally, we discuss some potential defense methods to
mitigate the effects of the proposed attack.


http://arxiv.org/abs/2410.23687
Adversarial Attacks of Vision Tasks in the Past 10 Years: A Survey. (2%)
Chiyu Zhang; Xiaogang Xu; Jiafei Wu; Zhe Liu; Lu Zhou
  Adversarial attacks, which manipulate input data to undermine model
availability and integrity, pose significant security threats during machine
learning inference. With the advent of Large Vision-Language Models (LVLMs),
new attack vectors, such as cognitive bias, prompt injection, and jailbreak
techniques, have emerged. Understanding these attacks is crucial for developing
more robust systems and demystifying the inner workings of neural networks.
However, existing reviews often focus on attack classifications and lack
comprehensive, in-depth analysis. The research community currently needs: 1)
unified insights into adversariality, transferability, and generalization; 2)
detailed evaluations of existing methods; 3) motivation-driven attack
categorizations; and 4) an integrated perspective on both traditional and LVLM
attacks. This article addresses these gaps by offering a thorough summary of
traditional and LVLM adversarial attacks, emphasizing their connections and
distinctions, and providing actionable insights for future research.


http://arxiv.org/abs/2410.23142
FAIR-TAT: Improving Model Fairness Using Targeted Adversarial Training. (99%)
Tejaswini Medi; Steffen Jung; Margret Keuper
  Deep neural networks are susceptible to adversarial attacks and common
corruptions, which undermine their robustness. In order to enhance model
resilience against such challenges, Adversarial Training (AT) has emerged as a
prominent solution. Nevertheless, adversarial robustness is often attained at
the expense of model fairness during AT, i.e., disparity in class-wise
robustness of the model. While distinctive classes become more robust towards
such adversaries, hard to detect classes suffer. Recently, research has focused
on improving model fairness specifically for perturbed images, overlooking the
accuracy of the most likely non-perturbed data. Additionally, despite their
robustness against the adversaries encountered during model training,
state-of-the-art adversarial trained models have difficulty maintaining
robustness and fairness when confronted with diverse adversarial threats or
common corruptions. In this work, we address the above concerns by introducing
a novel approach called Fair Targeted Adversarial Training (FAIR-TAT). We show
that using targeted adversarial attacks for adversarial training (instead of
untargeted attacks) can allow for more favorable trade-offs with respect to
adversarial fairness. Empirical results validate the efficacy of our approach.


http://arxiv.org/abs/2410.23483
Keep on Swimming: Real Attackers Only Need Partial Knowledge of a Multi-Model System. (99%)
Julian Collado; Kevin Stangl
  Recent approaches in machine learning often solve a task using a composition
of multiple models or agentic architectures. When targeting a composed system
with adversarial attacks, it might not be computationally or informationally
feasible to train an end-to-end proxy model or a proxy model for every
component of the system. We introduce a method to craft an adversarial attack
against the overall multi-model system when we only have a proxy model for the
final black-box model, and when the transformation applied by the initial
models can make the adversarial perturbations ineffective. Current methods
handle this by applying many copies of the first model/transformation to an
input and then re-use a standard adversarial attack by averaging gradients, or
learning a proxy model for both stages. To our knowledge, this is the first
attack specifically designed for this threat model and our method has a
substantially higher attack success rate (80% vs 25%) and contains 9.4% smaller
perturbations (MSE) compared to prior state-of-the-art methods. Our experiments
focus on a supervised image pipeline, but we are confident the attack will
generalize to other multi-model settings [e.g. a mix of open/closed source
foundation models], or agentic systems


http://arxiv.org/abs/2410.23091
CausalDiff: Causality-Inspired Disentanglement via Diffusion Model for Adversarial Defense. (99%)
Mingkun Zhang; Keping Bi; Wei Chen; Quanrun Chen; Jiafeng Guo; Xueqi Cheng
  Despite ongoing efforts to defend neural classifiers from adversarial
attacks, they remain vulnerable, especially to unseen attacks. In contrast,
humans are difficult to be cheated by subtle manipulations, since we make
judgments only based on essential factors. Inspired by this observation, we
attempt to model label generation with essential label-causative factors and
incorporate label-non-causative factors to assist data generation. For an
adversarial example, we aim to discriminate the perturbations as non-causative
factors and make predictions only based on the label-causative factors.
Concretely, we propose a casual diffusion model (CausalDiff) that adapts
diffusion models for conditional data generation and disentangles the two types
of casual factors by learning towards a novel casual information bottleneck
objective. Empirically, CausalDiff has significantly outperformed
state-of-the-art defense methods on various unseen attacks, achieving an
average robustness of 86.39% (+4.01%) on CIFAR-10, 56.25% (+3.13%) on
CIFAR-100, and 82.62% (+4.93%) on GTSRB (German Traffic Sign Recognition
Benchmark). The code is available at
\href{https://github.com/CAS-AISafetyBasicResearchGroup/CausalDiff}{https://github.com/CAS-AISafetyBasicResearchGroup/CausalDiff}


http://arxiv.org/abs/2410.22725
One Prompt to Verify Your Models: Black-Box Text-to-Image Models Verification via Non-Transferable Adversarial Attacks. (98%)
Ji Guo; Wenbo Jiang; Rui Zhang; Guoming Lu; Hongwei Li
  Recently, the success of Text-to-Image (T2I) models has led to the rise of
numerous third-party platforms, which claim to provide cheaper API services and
more flexibility in model options. However, this also raises a new security
concern: Are these third-party services truly offering the models they claim?
To address this problem, we propose the first T2I model verification method
named Text-to-Image Model Verification via Non-Transferable Adversarial Attacks
(TVN). The non-transferability of adversarial examples means that these
examples are only effective on a target model and ineffective on other models,
thereby allowing for the verification of the target model. TVN utilizes the
Non-dominated Sorting Genetic Algorithm II (NSGA-II) to optimize the cosine
similarity of a prompt's text encoding, generating non-transferable adversarial
prompts. By calculating the CLIP-text scores between the non-transferable
adversarial prompts without perturbations and the images, we can verify if the
model matches the claimed target model, based on a 3-sigma threshold. The
experiments showed that TVN performed well in both closed-set and open-set
scenarios, achieving a verification accuracy of over 90\%. Moreover, the
adversarial prompts generated by TVN significantly reduced the CLIP-text scores
of the target model, while having little effect on other models.


http://arxiv.org/abs/2410.22888
Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector. (87%)
Youcheng Huang; Fengbin Zhu; Jingkun Tang; Pan Zhou; Wenqiang Lei; Jiancheng Lv; Tat-Seng Chua
  Visual Language Models (VLMs) are vulnerable to adversarial attacks,
especially those from adversarial images, which is however under-explored in
literature. To facilitate research on this critical safety problem, we first
construct a new laRge-scale Adervsarial images dataset with Diverse hArmful
Responses (RADAR), given that existing datasets are either small-scale or only
contain limited types of harmful responses. With the new RADAR dataset, we
further develop a novel and effective iN-time Embedding-based AdveRSarial Image
DEtection (NEARSIDE) method, which exploits a single vector that distilled from
the hidden states of VLMs, which we call the attacking direction, to achieve
the detection of adversarial images against benign ones in the input. Extensive
experiments with two victim VLMs, LLaVA and MiniGPT-4, well demonstrate the
effectiveness, efficiency, and cross-model transferrability of our proposed
method. Our code is available at https://github.com/mob-scu/RADAR-NEARSIDE


http://arxiv.org/abs/2410.22832
HijackRAG: Hijacking Attacks against Retrieval-Augmented Large Language Models. (82%)
Yucheng Zhang; Qinfeng Li; Tianyu Du; Xuhong Zhang; Xinkui Zhao; Zhengwen Feng; Jianwei Yin
  Retrieval-Augmented Generation (RAG) systems enhance large language models
(LLMs) by integrating external knowledge, making them adaptable and
cost-effective for various applications. However, the growing reliance on these
systems also introduces potential security risks. In this work, we reveal a
novel vulnerability, the retrieval prompt hijack attack (HijackRAG), which
enables attackers to manipulate the retrieval mechanisms of RAG systems by
injecting malicious texts into the knowledge database. When the RAG system
encounters target questions, it generates the attacker's pre-determined answers
instead of the correct ones, undermining the integrity and trustworthiness of
the system. We formalize HijackRAG as an optimization problem and propose both
black-box and white-box attack strategies tailored to different levels of the
attacker's knowledge. Extensive experiments on multiple benchmark datasets show
that HijackRAG consistently achieves high attack success rates, outperforming
existing baseline attacks. Furthermore, we demonstrate that the attack is
transferable across different retriever models, underscoring the widespread
risk it poses to RAG systems. Lastly, our exploration of various defense
mechanisms reveals that they are insufficient to counter HijackRAG, emphasizing
the urgent need for more robust security measures to protect RAG systems in
real-world deployments.


http://arxiv.org/abs/2410.22844
Understanding and Improving Adversarial Collaborative Filtering for Robust Recommendation. (67%)
Kaike Zhang; Qi Cao; Yunfan Wu; Fei Sun; Huawei Shen; Xueqi Cheng
  Adversarial Collaborative Filtering (ACF), which typically applies
adversarial perturbations at user and item embeddings through adversarial
training, is widely recognized as an effective strategy for enhancing the
robustness of Collaborative Filtering (CF) recommender systems against
poisoning attacks. Besides, numerous studies have empirically shown that ACF
can also improve recommendation performance compared to traditional CF. Despite
these empirical successes, the theoretical understanding of ACF's effectiveness
in terms of both performance and robustness remains unclear. To bridge this
gap, in this paper, we first theoretically show that ACF can achieve a lower
recommendation error compared to traditional CF with the same training epochs
in both clean and poisoned data contexts. Furthermore, by establishing bounds
for reductions in recommendation error during ACF's optimization process, we
find that applying personalized magnitudes of perturbation for different users
based on their embedding scales can further improve ACF's effectiveness.
Building on these theoretical understandings, we propose Personalized Magnitude
Adversarial Collaborative Filtering (PamaCF). Extensive experiments demonstrate
that PamaCF effectively defends against various types of poisoning attacks
while significantly enhancing recommendation performance.


http://arxiv.org/abs/2410.23118
Teaching a Language Model to Distinguish Between Similar Details using a Small Adversarial Training Set. (64%)
Chris Achard
  Language models can achieve high accuracy on natural language tasks such as
NLI, but performance suffers on manually created adversarial examples. We
investigate the performance of a language model trained on the Stanford Natural
Language Inference (SNLI) corpus on a manually created adversarial test set. We
then improve the model's performance by fine tuning the model on a small,
manually created adversarial training set, designed to help the language model
to learn to differentiate between similar words and phrases in the data. We
show an increase in accuracy on the adversarial test set (+ 13%) while still
maintaining good performance on the original NLI task. We also show an increase
in accuracy from 91.2% to 92.9% on the most similar contradictions in the SNLI
test set (as judged by cosine similarity).


http://arxiv.org/abs/2410.22678
Backdoor Attack Against Vision Transformers via Attention Gradient-Based Image Erosion. (62%)
Ji Guo; Hongwei Li; Wenbo Jiang; Guoming Lu
  Vision Transformers (ViTs) have outperformed traditional Convolutional Neural
Networks (CNN) across various computer vision tasks. However, akin to CNN, ViTs
are vulnerable to backdoor attacks, where the adversary embeds the backdoor
into the victim model, causing it to make wrong predictions about testing
samples containing a specific trigger. Existing backdoor attacks against ViTs
have the limitation of failing to strike an optimal balance between attack
stealthiness and attack effectiveness.
  In this work, we propose an Attention Gradient-based Erosion Backdoor (AGEB)
targeted at ViTs. Considering the attention mechanism of ViTs, AGEB selectively
erodes pixels in areas of maximal attention gradient, embedding a covert
backdoor trigger. Unlike previous backdoor attacks against ViTs, AGEB achieves
an optimal balance between attack stealthiness and attack effectiveness,
ensuring the trigger remains invisible to human detection while preserving the
model's accuracy on clean samples. Extensive experimental evaluations across
various ViT architectures and datasets confirm the effectiveness of AGEB,
achieving a remarkable Attack Success Rate (ASR) without diminishing Clean Data
Accuracy (CDA). Furthermore, the stealthiness of AGEB is rigorously validated,
demonstrating minimal visual discrepancies between the clean and the triggered
images.


http://arxiv.org/abs/2410.22705
Geometry Cloak: Preventing TGS-based 3D Reconstruction from Copyrighted Images. (2%)
Qi Song; Ziyuan Luo; Ka Chun Cheung; Simon See; Renjie Wan
  Single-view 3D reconstruction methods like Triplane Gaussian Splatting (TGS)
have enabled high-quality 3D model generation from just a single image input
within seconds. However, this capability raises concerns about potential
misuse, where malicious users could exploit TGS to create unauthorized 3D
models from copyrighted images. To prevent such infringement, we propose a
novel image protection approach that embeds invisible geometry perturbations,
termed "geometry cloaks", into images before supplying them to TGS. These
carefully crafted perturbations encode a customized message that is revealed
when TGS attempts 3D reconstructions of the cloaked image. Unlike conventional
adversarial attacks that simply degrade output quality, our method forces TGS
to fail the 3D reconstruction in a specific way - by generating an identifiable
customized pattern that acts as a watermark. This watermark allows copyright
holders to assert ownership over any attempted 3D reconstructions made from
their protected images. Extensive experiments have verified the effectiveness
of our geometry cloak. Our project is available at
https://qsong2001.github.io/geometry_cloak.


http://arxiv.org/abs/2410.23123
On Memorization of Large Language Models in Logical Reasoning. (1%)
Chulin Xie; Yangsibo Huang; Chiyuan Zhang; Da Yu; Xinyun Chen; Bill Yuchen Lin; Bo Li; Badih Ghazi; Ravi Kumar
  Large language models (LLMs) achieve good performance on challenging
reasoning benchmarks, yet could also make basic reasoning mistakes. This
contrasting behavior is puzzling when it comes to understanding the mechanisms
behind LLMs' reasoning capabilities. One hypothesis is that the increasingly
high and nearly saturated performance on common reasoning benchmarks could be
due to the memorization of similar problems. In this paper, we systematically
investigate this hypothesis with a quantitative measurement of memorization in
reasoning tasks, using a dynamically generated logical reasoning benchmark
based on Knights and Knaves (K&K) puzzles. We find that LLMs could interpolate
and memorize the training puzzles (achieving near-perfect accuracy) after
fine-tuning, yet they struggle with slight variations of these puzzles. On the
other hand, we show that while fine-tuning leads to heavy memorization, it also
consistently improves generalization performance. Through in-depth analyses
with perturbation tests, cross difficulty-level transferability, probing model
internals, and fine-tuning with wrong answers, we establish that LLMs develop
reasoning skills on K&K puzzles alongside memorization. Finally, our analysis
based on a per-sample memorization score sheds light on how LLMs switch between
reasoning and memorization when solving logical puzzles. Our code and data are
available at https://memkklogic.github.io.


http://arxiv.org/abs/2410.22680
Byzantine-Robust Federated Learning: An Overview With Focus on Developing Sybil-based Attacks to Backdoor Augmented Secure Aggregation Protocols. (1%)
Atharv Deshmukh
  Federated Learning (FL) paradigms enable large numbers of clients to
collaboratively train Machine Learning models on private data. However, due to
their multi-party nature, traditional FL schemes are left vulnerable to
Byzantine attacks that attempt to hurt model performance by injecting malicious
backdoors. A wide variety of prevention methods have been proposed to protect
frameworks from such attacks. This paper provides a exhaustive and updated
taxonomy of existing methods and frameworks, before zooming in and conducting
an in-depth analysis of the strengths and weaknesses of the Robustness of
Federated Learning (RoFL) protocol. From there, we propose two novel
Sybil-based attacks that take advantage of vulnerabilities in RoFL. Finally, we
conclude with comprehensive proposals for future testing, describe and detail
implementation of the proposed attacks, and offer direction for improvements in
the RoFL protocol as well as Byzantine-robust frameworks as a whole.


http://arxiv.org/abs/2410.23182
ProTransformer: Robustify Transformers via Plug-and-Play Paradigm. (1%)
Zhichao Hou; Weizhi Gao; Yuchen Shen; Feiyi Wang; Xiaorui Liu
  Transformer-based architectures have dominated various areas of machine
learning in recent years. In this paper, we introduce a novel robust attention
mechanism designed to enhance the resilience of transformer-based
architectures. Crucially, this technique can be integrated into existing
transformers as a plug-and-play layer, improving their robustness without the
need for additional training or fine-tuning. Through comprehensive experiments
and ablation studies, we demonstrate that our ProTransformer significantly
enhances the robustness of transformer models across a variety of prediction
tasks, attack mechanisms, backbone architectures, and data domains. Notably,
without further fine-tuning, the ProTransformer consistently improves the
performance of vanilla transformers by 19.5%, 28.3%, 16.1%, and 11.4% for BERT,
ALBERT, DistilBERT, and RoBERTa, respectively, under the classical TextFooler
attack. Furthermore, ProTransformer shows promising resilience in large
language models (LLMs) against prompting-based attacks, improving the
performance of T5 and LLaMA by 24.8% and 17.8%, respectively, and enhancing
Vicuna by an average of 10.4% against the Jailbreaking attack. Beyond the
language domain, ProTransformer also demonstrates outstanding robustness in
both vision and graph domains.


http://arxiv.org/abs/2410.23232
Attribute-to-Delete: Machine Unlearning via Datamodel Matching. (1%)
Kristian Georgiev; Roy Rinberg; Sung Min Park; Shivam Garg; Andrew Ilyas; Aleksander Madry; Seth Neel
  Machine unlearning -- efficiently removing the effect of a small "forget set"
of training data on a pre-trained machine learning model -- has recently
attracted significant research interest. Despite this interest, however, recent
work shows that existing machine unlearning techniques do not hold up to
thorough evaluation in non-convex settings. In this work, we introduce a new
machine unlearning technique that exhibits strong empirical performance even in
such challenging settings. Our starting point is the perspective that the goal
of unlearning is to produce a model whose outputs are statistically
indistinguishable from those of a model re-trained on all but the forget set.
This perspective naturally suggests a reduction from the unlearning problem to
that of data attribution, where the goal is to predict the effect of changing
the training set on a model's outputs. Thus motivated, we propose the following
meta-algorithm, which we call Datamodel Matching (DMM): given a trained model,
we (a) use data attribution to predict the output of the model if it were
re-trained on all but the forget set points; then (b) fine-tune the pre-trained
model to match these predicted outputs. In a simple convex setting, we show how
this approach provably outperforms a variety of iterative unlearning
algorithms. Empirically, we use a combination of existing evaluations and a new
metric based on the KL-divergence to show that even in non-convex settings, DMM
achieves strong unlearning performance relative to existing algorithms. An
added benefit of DMM is that it is a meta-algorithm, in the sense that future
advances in data attribution translate directly into better unlearning
algorithms, pointing to a clear direction for future progress in unlearning.


http://arxiv.org/abs/2410.22884
Stealing User Prompts from Mixture of Experts. (1%)
Itay Yona; Ilia Shumailov; Jamie Hayes; Nicholas Carlini
  Mixture-of-Experts (MoE) models improve the efficiency and scalability of
dense language models by routing each token to a small number of experts in
each layer. In this paper, we show how an adversary that can arrange for their
queries to appear in the same batch of examples as a victim's queries can
exploit Expert-Choice-Routing to fully disclose a victim's prompt. We
successfully demonstrate the effectiveness of this attack on a two-layer
Mixtral model, exploiting the tie-handling behavior of the torch.topk CUDA
implementation. Our results show that we can extract the entire prompt using
$O({VM}^2)$ queries (with vocabulary size $V$ and prompt length $M$) or 100
queries on average per token in the setting we consider. This is the first
attack to exploit architectural flaws for the purpose of extracting user
prompts, introducing a new class of LLM vulnerabilities.


http://arxiv.org/abs/2410.22770
InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models. (1%)
Hao Li; Xiaogeng Liu; Chaowei Xiao
  Prompt injection attacks pose a critical threat to large language models
(LLMs), enabling goal hijacking and data leakage. Prompt guard models, though
effective in defense, suffer from over-defense -- falsely flagging benign
inputs as malicious due to trigger word bias. To address this issue, we
introduce NotInject, an evaluation dataset that systematically measures
over-defense across various prompt guard models. NotInject contains 339 benign
samples enriched with trigger words common in prompt injection attacks,
enabling fine-grained evaluation. Our results show that state-of-the-art models
suffer from over-defense issues, with accuracy dropping close to random
guessing levels (60%). To mitigate this, we propose InjecGuard, a novel prompt
guard model that incorporates a new training strategy, Mitigating Over-defense
for Free (MOF), which significantly reduces the bias on trigger words.
InjecGuard demonstrates state-of-the-art performance on diverse benchmarks
including NotInject, surpassing the existing best model by 30.8%, offering a
robust and open-source solution for detecting prompt injection attacks. The
code and datasets are released at https://github.com/SaFoLab-WISC/InjecGuard.


http://arxiv.org/abs/2410.21952
On the Robustness of Adversarial Training Against Uncertainty Attacks. (99%)
Emanuele Ledda; Giovanni Scodeller; Daniele Angioni; Giorgio Piras; Antonio Emanuele Cinà; Giorgio Fumera; Battista Biggio; Fabio Roli
  In learning problems, the noise inherent to the task at hand hinders the
possibility to infer without a certain degree of uncertainty. Quantifying this
uncertainty, regardless of its wide use, assumes high relevance for
security-sensitive applications. Within these scenarios, it becomes fundamental
to guarantee good (i.e., trustworthy) uncertainty measures, which downstream
modules can securely employ to drive the final decision-making process.
However, an attacker may be interested in forcing the system to produce either
(i) highly uncertain outputs jeopardizing the system's availability or (ii) low
uncertainty estimates, making the system accept uncertain samples that would
instead require a careful inspection (e.g., human intervention). Therefore, it
becomes fundamental to understand how to obtain robust uncertainty estimates
against these kinds of attacks. In this work, we reveal both empirically and
theoretically that defending against adversarial examples, i.e., carefully
perturbed samples that cause misclassification, additionally guarantees a more
secure, trustworthy uncertainty estimate under common attack scenarios without
the need for an ad-hoc defense strategy. To support our claims, we evaluate
multiple adversarial-robust models from the publicly available benchmark
RobustBench on the CIFAR-10 and ImageNet datasets.


http://arxiv.org/abs/2411.00839
CausAdv: A Causal-based Framework for Detecting Adversarial Examples. (99%)
Hichem Debbi
  Deep learning has led to tremendous success in many real-world applications
of computer vision, thanks to sophisticated architectures such as Convolutional
neural networks (CNNs). However, CNNs have been shown to be vulnerable to
crafted adversarial perturbations in inputs. These inputs appear almost
indistinguishable from natural images, yet they are incorrectly classified by
CNN architectures. This vulnerability of adversarial examples has led
researchers to focus on enhancing the robustness of deep learning models in
general, and CNNs in particular, by creating defense and detection methods to
distinguish adversarials inputs from natural ones. In this paper, we address
the adversarial robustness of CNNs through causal reasoning.
  We propose CausAdv: a causal framework for detecting adversarial examples
based on counterfactual reasoning. CausAdv learns causal and non-causal
features of every input, and quantifies the counterfactual information (CI) of
every filter of the last convolutional layer. Then we perform statistical
analysis on the filters CI of every sample, whether clan or adversarials, to
demonstrate how adversarial examples indeed exhibit different CI distributions
compared to clean samples. Our results show that causal reasoning enhances the
process of adversarials detection without the need to train a separate
detector. In addition, we illustrate the efficiency of causal explanations as a
helpful detection technique through visualizing the causal features. The
results can be reproduced using the code available in the repository:
https://github.com/HichemDebbi/CausAdv.


http://arxiv.org/abs/2410.21802
Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models. (98%)
Lu Yu; Haiyang Zhang; Changsheng Xu
  Due to the impressive zero-shot capabilities, pre-trained vision-language
models (e.g. CLIP), have attracted widespread attention and adoption across
various domains. Nonetheless, CLIP has been observed to be susceptible to
adversarial examples. Through experimental analysis, we have observed a
phenomenon wherein adversarial perturbations induce shifts in text-guided
attention. Building upon this observation, we propose a simple yet effective
strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This
framework incorporates two components: the Attention Refinement module and the
Attention-based Model Constraint module. Our goal is to maintain the
generalization of the CLIP model and enhance its adversarial robustness: The
Attention Refinement module aligns the text-guided attention obtained from the
target model via adversarial examples with the text-guided attention acquired
from the original model via clean examples. This alignment enhances the model's
robustness. Additionally, the Attention-based Model Constraint module acquires
text-guided attention from both the target and original models using clean
examples. Its objective is to maintain model performance on clean samples while
enhancing overall robustness. The experiments validate that our method yields a
9.58% enhancement in zero-shot robust accuracy over the current
state-of-the-art techniques across 16 datasets. Our code is available at
https://github.com/zhyblue424/TGA-ZSR.


http://arxiv.org/abs/2410.22663
Automated Trustworthiness Oracle Generation for Machine Learning Text Classifiers. (83%)
Lam Nguyen Tung; Steven Cho; Xiaoning Du; Neelofar Neelofar; Valerio Terragni; Stefano Ruberto; Aldeida Aleti
  Machine learning (ML) for text classification has been widely used in various
domains. These applications can significantly impact ethics, economics, and
human behavior, raising serious concerns about trusting ML decisions. Studies
indicate that conventional metrics are insufficient to build human trust in ML
models. These models often learn spurious correlations and predict based on
them. In the real world, their performance can deteriorate significantly. To
avoid this, a common practice is to test whether predictions are reasonable
based on valid patterns in the data. Along with this, a challenge known as the
trustworthiness oracle problem has been introduced. Due to the lack of
automated trustworthiness oracles, the assessment requires manual validation of
the decision process disclosed by explanation methods. However, this is
time-consuming, error-prone, and unscalable.
  We propose TOKI, the first automated trustworthiness oracle generation method
for text classifiers. TOKI automatically checks whether the words contributing
the most to a prediction are semantically related to the predicted class.
Specifically, we leverage ML explanations to extract the decision-contributing
words and measure their semantic relatedness with the class based on word
embeddings. We also introduce a novel adversarial attack method that targets
trustworthiness vulnerabilities identified by TOKI. To evaluate their alignment
with human judgement, experiments are conducted. We compare TOKI with a naive
baseline based solely on model confidence and TOKI-guided adversarial attack
method with A2T, a SOTA adversarial attack method. Results show that relying on
prediction uncertainty cannot effectively distinguish between trustworthy and
untrustworthy predictions, TOKI achieves 142% higher accuracy than the naive
baseline, and TOKI-guided attack method is more effective with fewer
perturbations than A2T.


http://arxiv.org/abs/2411.00827
IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves. (83%)
Ruofan Wang; Juncheng Li; Yixu Wang; Bo Wang; Xiaosen Wang; Yan Teng; Yingchun Wang; Xingjun Ma; Yu-Gang Jiang
  As large Vision-Language Models (VLMs) gain prominence, ensuring their safe
deployment has become critical. Recent studies have explored VLM robustness
against jailbreak attacks-techniques that exploit model vulnerabilities to
elicit harmful outputs. However, the limited availability of diverse multimodal
data has constrained current approaches to rely heavily on adversarial or
manually crafted images derived from harmful text datasets, which often lack
effectiveness and diversity across different contexts. In this paper, we
propose IDEATOR, a novel jailbreak method that autonomously generates malicious
image-text pairs for black-box jailbreak attacks. IDEATOR is grounded in the
insight that VLMs themselves could serve as powerful red team models for
generating multimodal jailbreak prompts. Specifically, IDEATOR leverages a VLM
to create targeted jailbreak texts and pairs them with jailbreak images
generated by a state-of-the-art diffusion model. Extensive experiments
demonstrate IDEATOR's high effectiveness and transferability, achieving a 94%
attack success rate (ASR) in jailbreaking MiniGPT-4 with an average of only
5.34 queries, and high ASRs of 82%, 88%, and 75% when transferred to LLaVA,
InstructBLIP, and Chameleon, respectively. Building on IDEATOR's strong
transferability and automated process, we introduce the VLBreakBench, a safety
benchmark comprising 3,654 multimodal jailbreak samples. Our benchmark results
on 11 recently released VLMs reveal significant gaps in safety alignment. For
instance, our challenge set achieves ASRs of 46.31% on GPT-4o and 19.65% on
Claude-3.5-Sonnet, underscoring the urgent need for stronger defenses.


http://arxiv.org/abs/2411.00837
Longitudinal Mammogram Exam-based Breast Cancer Diagnosis Models: Vulnerability to Adversarial Attacks. (81%)
Zhengbo Zhou; Degan Hao; Dooman Arefan; Margarita Zuley; Jules Sumkin; Shandong Wu
  In breast cancer detection and diagnosis, the longitudinal analysis of
mammogram images is crucial. Contemporary models excel in detecting temporal
imaging feature changes, thus enhancing the learning process over sequential
imaging exams. Yet, the resilience of these longitudinal models against
adversarial attacks remains underexplored. In this study, we proposed a novel
attack method that capitalizes on the feature-level relationship between two
sequential mammogram exams of a longitudinal model, guided by both
cross-entropy loss and distance metric learning, to achieve significant attack
efficacy, as implemented using attack transferring in a black-box attacking
manner. We performed experiments on a cohort of 590 breast cancer patients
(each has two sequential mammogram exams) in a case-control setting. Results
showed that our proposed method surpassed several state-of-the-art adversarial
attacks in fooling the diagnosis models to give opposite outputs. Our method
remained effective even if the model was trained with the common defending
method of adversarial training.


http://arxiv.org/abs/2410.22143
AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts. (78%)
Vishal Kumar; Zeyi Liao; Jaylen Jones; Huan Sun
  Although large language models (LLMs) are typically aligned, they remain
vulnerable to jailbreaking through either carefully crafted prompts in natural
language or, interestingly, gibberish adversarial suffixes. However, gibberish
tokens have received relatively less attention despite their success in
attacking aligned LLMs. Recent work, AmpleGCG~\citep{liao2024amplegcg},
demonstrates that a generative model can quickly produce numerous customizable
gibberish adversarial suffixes for any harmful query, exposing a range of
alignment gaps in out-of-distribution (OOD) language spaces. To bring more
attention to this area, we introduce AmpleGCG-Plus, an enhanced version that
achieves better performance in fewer attempts. Through a series of exploratory
experiments, we identify several training strategies to improve the learning of
gibberish suffixes. Our results, verified under a strict evaluation setting,
show that it outperforms AmpleGCG on both open-weight and closed-source models,
achieving increases in attack success rate (ASR) of up to 17\% in the white-box
setting against Llama-2-7B-chat, and more than tripling ASR in the black-box
setting against GPT-4. Notably, AmpleGCG-Plus jailbreaks the newer GPT-4o
series of models at similar rates to GPT-4, and, uncovers vulnerabilities
against the recently proposed circuit breakers defense. We publicly release
AmpleGCG-Plus along with our collected training datasets.


http://arxiv.org/abs/2410.22284
Embedding-based classifiers can detect prompt injection attacks. (64%)
Md. Ahsan Ayub; Subhabrata Majumdar
  Large Language Models (LLMs) are seeing significant adoption in every type of
organization due to their exceptional generative capabilities. However, LLMs
are found to be vulnerable to various adversarial attacks, particularly prompt
injection attacks, which trick them into producing harmful or inappropriate
content. Adversaries execute such attacks by crafting malicious prompts to
deceive the LLMs. In this paper, we propose a novel approach based on
embedding-based Machine Learning (ML) classifiers to protect LLM-based
applications against this severe threat. We leverage three commonly used
embedding models to generate embeddings of malicious and benign prompts and
utilize ML classifiers to predict whether an input prompt is malicious. Out of
several traditional ML methods, we achieve the best performance with
classifiers built using Random Forest and XGBoost. Our classifiers outperform
state-of-the-art prompt injection classifiers available in open-source
implementations, which use encoder-only neural networks.


http://arxiv.org/abs/2410.21791
Enhancing Adversarial Attacks through Chain of Thought. (54%)
Jingbo Su
  Large language models (LLMs) have demonstrated impressive performance across
various domains but remain susceptible to safety concerns. Prior research
indicates that gradient-based adversarial attacks are particularly effective
against aligned LLMs and the chain of thought (CoT) prompting can elicit
desired answers through step-by-step reasoning. This paper proposes enhancing
the robustness of adversarial attacks on aligned LLMs by integrating CoT
prompts with the greedy coordinate gradient (GCG) technique. Using CoT triggers
instead of affirmative targets stimulates the reasoning abilities of backend
LLMs, thereby improving the transferability and universality of adversarial
attacks. We conducted an ablation study comparing our CoT-GCG approach with
Amazon Web Services auto-cot. Results revealed our approach outperformed both
the baseline GCG attack and CoT prompting. Additionally, we used Llama Guard to
evaluate potentially harmful interactions, providing a more objective risk
assessment of entire conversations compared to matching outputs to rejection
phrases. The code of this paper is available at
https://github.com/sujingbo0217/CS222W24-LLM-Attack.


http://arxiv.org/abs/2410.22425
Power side-channel leakage localization through adversarial training of deep neural networks. (11%)
Jimmy Gammell; Anand Raghunathan; Kaushik Roy
  Supervised deep learning has emerged as an effective tool for carrying out
power side-channel attacks on cryptographic implementations. While
increasingly-powerful deep learning-based attacks are regularly published,
comparatively-little work has gone into using deep learning to defend against
these attacks. In this work we propose a technique for identifying which
timesteps in a power trace are responsible for leaking a cryptographic key,
through an adversarial game between a deep learning-based side-channel attacker
which seeks to classify a sensitive variable from the power traces recorded
during encryption, and a trainable noise generator which seeks to thwart this
attack by introducing a minimal amount of noise into the power traces. We
demonstrate on synthetic datasets that our method can outperform existing
techniques in the presence of common countermeasures such as Boolean masking
and trace desynchronization. Results on real datasets are weak because the
technique is highly sensitive to hyperparameters and early-stop point, and we
lack a holdout dataset with ground truth knowledge of leaking points for model
selection. Nonetheless, we believe our work represents an important first step
towards deep side-channel leakage localization without relying on strong
assumptions about the implementation or the nature of its leakage. An
open-source PyTorch implementation of our experiments is provided.


http://arxiv.org/abs/2410.21736
Enhancing Safety and Robustness of Vision-Based Controllers via Reachability Analysis. (1%)
Kaustav Chakraborty; Aryaman Gupta; Somil Bansal
  Autonomous systems, such as self-driving cars and drones, have made
significant strides in recent years by leveraging visual inputs and machine
learning for decision-making and control. Despite their impressive performance,
these vision-based controllers can make erroneous predictions when faced with
novel or out-of-distribution inputs. Such errors can cascade into catastrophic
system failures and compromise system safety. In this work, we compute Neural
Reachable Tubes, which act as parameterized approximations of Backward
Reachable Tubes to stress-test the vision-based controllers and mine their
failure modes. The identified failures are then used to enhance the system
safety through both offline and online methods. The online approach involves
training a classifier as a run-time failure monitor to detect closed-loop,
system-level failures, subsequently triggering a fallback controller that
robustly handles these detected failures to preserve system safety. For the
offline approach, we improve the original controller via incremental training
using a carefully augmented failure dataset, resulting in a more robust
controller that is resistant to the known failure modes. In either approach,
the system is safeguarded against shortcomings that transcend the vision-based
controller and pertain to the closed-loop safety of the overall system. We
validate the proposed approaches on an autonomous aircraft taxiing task that
involves using a vision-based controller to guide the aircraft towards the
centerline of the runway. Our results show the efficacy of the proposed
algorithms in identifying and handling system-level failures, outperforming
methods that rely on controller prediction error or uncertainty quantification
for identifying system failures.


http://arxiv.org/abs/2410.22307
SVIP: Towards Verifiable Inference of Open-source Large Language Models. (1%)
Yifan Sun; Yuhang Li; Yue Zhang; Yuchen Jin; Huan Zhang
  The ever-increasing size of open-source Large Language Models (LLMs) renders
local deployment impractical for individual users. Decentralized computing has
emerged as a cost-effective solution, allowing individuals and small companies
to perform LLM inference for users using surplus computational power. However,
a computing provider may stealthily substitute the requested LLM with a
smaller, less capable model without consent from users, thereby benefiting from
cost savings. We introduce SVIP, a secret-based verifiable LLM inference
protocol. Unlike existing solutions based on cryptographic or game-theoretic
techniques, our method is computationally effective and does not rest on strong
assumptions. Our protocol requires the computing provider to return both the
generated text and processed hidden representations from LLMs. We then train a
proxy task on these representations, effectively transforming them into a
unique model identifier. With our protocol, users can reliably verify whether
the computing provider is acting honestly. A carefully integrated secret
mechanism further strengthens its security. We thoroughly analyze our protocol
under multiple strong and adaptive adversarial scenarios. Our extensive
experiments demonstrate that SVIP is accurate, generalizable, computationally
efficient, and resistant to various attacks. Notably, SVIP achieves false
negative rates below 5% and false positive rates below 3%, while requiring less
than 0.01 seconds per prompt query for verification.


http://arxiv.org/abs/2411.00836
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models. (1%)
Chengke Zou; Xingang Guo; Rui Yang; Junyu Zhang; Bin Hu; Huan Zhang
  The rapid advancements in Vision-Language Models (VLMs) have shown great
potential in tackling mathematical reasoning tasks that involve visual context.
Unlike humans who can reliably apply solution steps to similar problems with
minor modifications, we found that SOTA VLMs like GPT-4o can consistently fail
in these scenarios, revealing limitations in their mathematical reasoning
capabilities. In this paper, we investigate the mathematical reasoning
robustness in VLMs and evaluate how well these models perform under different
variants of the same question, such as changes in visual numerical values or
function graphs. While several vision-based math benchmarks have been developed
to assess VLMs' problem-solving capabilities, these benchmarks contain only
static sets of problems and cannot easily evaluate mathematical reasoning
robustness. To fill this gap, we introduce DynaMath, a dynamic visual math
benchmark designed for in-depth assessment of VLMs. DynaMath includes 501
high-quality, multi-topic seed questions, each represented as a Python program.
Those programs are carefully designed and annotated to enable the automatic
generation of a much larger set of concrete questions, including many different
types of visual and textual variations. DynaMath allows us to evaluate the
generalization ability of VLMs, by assessing their performance under varying
input conditions of a seed question. We evaluated 14 SOTA VLMs with 5,010
generated concrete questions. Our results show that the worst-case model
accuracy, defined as the percentage of correctly answered seed questions in all
10 variants, is significantly lower than the average-case accuracy. Our
analysis emphasizes the need to study the robustness of VLMs' reasoning
abilities, and DynaMath provides valuable insights to guide the development of
more reliable models for mathematical reasoning.


http://arxiv.org/abs/2410.20893
Evaluating the Robustness of LiDAR Point Cloud Tracking Against Adversarial Attack. (99%)
Shengjing Tian; Yinan Han; Xiantong Zhao; Bin Liu; Xiuping Liu
  In this study, we delve into the robustness of neural network-based LiDAR
point cloud tracking models under adversarial attacks, a critical aspect often
overlooked in favor of performance enhancement. These models, despite
incorporating advanced architectures like Transformer or Bird's Eye View (BEV),
tend to neglect robustness in the face of challenges such as adversarial
attacks, domain shifts, or data corruption. We instead focus on the robustness
of the tracking models under the threat of adversarial attacks. We begin by
establishing a unified framework for conducting adversarial attacks within the
context of 3D object tracking, which allows us to thoroughly investigate both
white-box and black-box attack strategies. For white-box attacks, we tailor
specific loss functions to accommodate various tracking paradigms and extend
existing methods such as FGSM, C\&W, and PGD to the point cloud domain. In
addressing black-box attack scenarios, we introduce a novel transfer-based
approach, the Target-aware Perturbation Generation (TAPG) algorithm, with the
dual objectives of achieving high attack performance and maintaining low
perceptibility. This method employs a heuristic strategy to enforce sparse
attack constraints and utilizes random sub-vector factorization to bolster
transferability. Our experimental findings reveal a significant vulnerability
in advanced tracking methods when subjected to both black-box and white-box
attacks, underscoring the necessity for incorporating robustness against
adversarial attacks into the design of LiDAR point cloud tracking models.
Notably, compared to existing methods, the TAPG also strikes an optimal balance
between the effectiveness of the attack and the concealment of the
perturbations.


http://arxiv.org/abs/2410.21471
AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models. (96%)
Yaopei Zeng; Yuanpu Cao; Bochuan Cao; Yurui Chang; Jinghui Chen; Lu Lin
  Recent advances in diffusion models have significantly enhanced the quality
of image synthesis, yet they have also introduced serious safety concerns,
particularly the generation of Not Safe for Work (NSFW) content. Previous
research has demonstrated that adversarial prompts can be used to generate NSFW
content. However, such adversarial text prompts are often easily detectable by
text-based filters, limiting their efficacy. In this paper, we expose a
previously overlooked vulnerability: adversarial image attacks targeting
Image-to-Image (I2I) diffusion models. We propose AdvI2I, a novel framework
that manipulates input images to induce diffusion models to generate NSFW
content. By optimizing a generator to craft adversarial images, AdvI2I
circumvents existing defense mechanisms, such as Safe Latent Diffusion (SLD),
without altering the text prompts. Furthermore, we introduce AdvI2I-Adaptive,
an enhanced version that adapts to potential countermeasures and minimizes the
resemblance between adversarial images and NSFW concept embeddings, making the
attack more resilient against defenses. Through extensive experiments, we
demonstrate that both AdvI2I and AdvI2I-Adaptive can effectively bypass current
safeguards, highlighting the urgent need for stronger security measures to
address the misuse of I2I diffusion models.


http://arxiv.org/abs/2410.20971
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks. (93%)
Yunhan Zhao; Xiang Zheng; Lin Luo; Yige Li; Xingjun Ma; Yu-Gang Jiang
  In this paper, we focus on black-box defense for VLMs against jailbreak
attacks. Existing black-box defense methods are either unimodal or bimodal.
Unimodal methods enhance either the vision or language module of the VLM, while
bimodal methods robustify the model through text-image representation
realignment. However, these methods suffer from two limitations: 1) they fail
to fully exploit the cross-modal information, or 2) they degrade the model
performance on benign inputs. To address these limitations, we propose a novel
blue-team method BlueSuffix that defends target VLMs against jailbreak attacks
without compromising its performance under black-box setting. BlueSuffix
includes three key components: 1) a visual purifier against jailbreak images,
2) a textual purifier against jailbreak texts, and 3) a blue-team suffix
generator using reinforcement fine-tuning for enhancing cross-modal robustness.
We empirically show on four VLMs (LLaVA, MiniGPT-4, InstructionBLIP, and
Gemini) and four safety benchmarks (Harmful Instruction, AdvBench,
MM-SafetyBench, and RedTeam-2K) that BlueSuffix outperforms the baseline
defenses by a significant margin. Our BlueSuffix opens up a promising direction
for defending VLMs against jailbreak attacks. Code is available at
https://github.com/Vinsonzyh/BlueSuffix.


http://arxiv.org/abs/2410.21492
FATH: Authentication-based Test-time Defense against Indirect Prompt Injection Attacks. (91%)
Jiongxiao Wang; Fangzhou Wu; Wendi Li; Jinsheng Pan; Edward Suh; Z. Morley Mao; Muhao Chen; Chaowei Xiao
  Large language models (LLMs) have been widely deployed as the backbone with
additional tools and text information for real-world applications. However,
integrating external information into LLM-integrated applications raises
significant security concerns. Among these, prompt injection attacks are
particularly threatening, where malicious instructions injected in the external
text information can exploit LLMs to generate answers as the attackers desire.
While both training-time and test-time defense methods have been developed to
mitigate such attacks, the unaffordable training costs associated with
training-time methods and the limited effectiveness of existing test-time
methods make them impractical. This paper introduces a novel test-time defense
strategy, named Formatting AuThentication with Hash-based tags (FATH). Unlike
existing approaches that prevent LLMs from answering additional instructions in
external text, our method implements an authentication system, requiring LLMs
to answer all received instructions with a security policy and selectively
filter out responses to user instructions as the final output. To achieve this,
we utilize hash-based authentication tags to label each response, facilitating
accurate identification of responses according to the user's instructions and
improving the robustness against adaptive attacks. Comprehensive experiments
demonstrate that our defense method can effectively defend against indirect
prompt injection attacks, achieving state-of-the-art performance under Llama3
and GPT3.5 models across various attack methods. Our code is released at:
https://github.com/Jayfeather1024/FATH


http://arxiv.org/abs/2410.21443
TACO: Adversarial Camouflage Optimization on Trucks to Fool Object Detectors. (88%)
Adonisz Dimitriu; Tamás Michaletzky; Viktor Remeli
  Adversarial attacks threaten the reliability of machine learning models in
critical applications like autonomous vehicles and defense systems. As object
detectors become more robust with models like YOLOv8, developing effective
adversarial methodologies is increasingly challenging. We present Truck
Adversarial Camouflage Optimization (TACO), a novel framework that generates
adversarial camouflage patterns on 3D vehicle models to deceive
state-of-the-art object detectors. Adopting Unreal Engine 5, TACO integrates
differentiable rendering with a Photorealistic Rendering Network to optimize
adversarial textures targeted at YOLOv8. To ensure the generated textures are
both effective in deceiving detectors and visually plausible, we introduce the
Convolutional Smooth Loss function, a generalized smooth loss function.
Experimental evaluations demonstrate that TACO significantly degrades YOLOv8's
detection performance, achieving an AP@0.5 of 0.0099 on unseen test data.
Furthermore, these adversarial patterns exhibit strong transferability to other
object detection models such as Faster R-CNN and earlier YOLO versions.


http://arxiv.org/abs/2410.20940
Attacking Misinformation Detection Using Adversarial Examples Generated by Language Models. (84%)
Piotr Przybyła
  We investigate the challenge of generating adversarial examples to test the
robustness of text classification algorithms detecting low-credibility content,
including propaganda, false claims, rumours and hyperpartisan news. We focus on
simulation of content moderation by setting realistic limits on the number of
queries an attacker is allowed to attempt. Within our solution (TREPAT),
initial rephrasings are generated by large language models with prompts
inspired by meaning-preserving NLP tasks, e.g. text simplification and style
transfer. Subsequently, these modifications are decomposed into small changes,
applied through beam search procedure until the victim classifier changes its
decision. The evaluation confirms the superiority of our approach in the
constrained scenario, especially in case of long input text (news articles),
where exhaustive search is not feasible.


http://arxiv.org/abs/2410.20911
Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks. (80%)
Dario Pasquini; Evgenios M. Kornaropoulos; Giuseppe Ateniese
  Large language models (LLMs) are increasingly being harnessed to automate
cyberattacks, making sophisticated exploits more accessible and scalable. In
response, we propose a new defense strategy tailored to counter LLM-driven
cyberattacks. We introduce Mantis, a defensive framework that exploits LLMs'
susceptibility to adversarial inputs to undermine malicious operations. Upon
detecting an automated cyberattack, Mantis plants carefully crafted inputs into
system responses, leading the attacker's LLM to disrupt their own operations
(passive defense) or even compromise the attacker's machine (active defense).
By deploying purposefully vulnerable decoy services to attract the attacker and
using dynamic prompt injections for the attacker's LLM, Mantis can autonomously
hack back the attacker. In our experiments, Mantis consistently achieved over
95% effectiveness against automated LLM-driven attacks. To foster further
research and collaboration, Mantis is available as an open-source tool:
https://github.com/pasquini-dario/project_mantis


http://arxiv.org/abs/2410.21083
Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring. (50%)
Honglin Mu; Han He; Yuxin Zhou; Yunlong Feng; Yang Xu; Libo Qin; Xiaoming Shi; Zeming Liu; Xudong Han; Qi Shi; Qingfu Zhu; Wanxiang Che
  Large language model (LLM) safety is a critical issue, with numerous studies
employing red team testing to enhance model security. Among these, jailbreak
methods explore potential vulnerabilities by crafting malicious prompts that
induce model outputs contrary to safety alignments. Existing black-box
jailbreak methods often rely on model feedback, repeatedly submitting queries
with detectable malicious instructions during the attack search process.
Although these approaches are effective, the attacks may be intercepted by
content moderators during the search process. We propose an improved transfer
attack method that guides malicious prompt construction by locally training a
mirror model of the target black-box model through benign data distillation.
This method offers enhanced stealth, as it does not involve submitting
identifiable malicious instructions to the target model during the search
phase. Our approach achieved a maximum attack success rate of 92%, or a
balanced value of 80% with an average of 1.5 detectable jailbreak queries per
sample against GPT-3.5 Turbo on a subset of AdvBench. These results underscore
the need for more robust defense mechanisms.


http://arxiv.org/abs/2410.20742
Mitigating Unauthorized Speech Synthesis for Voice Protection. (9%)
Zhisheng Zhang; Qianyi Yang; Derui Wang; Pengyang Huang; Yuxin Cao; Kai Ye; Jie Hao
  With just a few speech samples, it is possible to perfectly replicate a
speaker's voice in recent years, while malicious voice exploitation (e.g.,
telecom fraud for illegal financial gain) has brought huge hazards in our daily
lives. Therefore, it is crucial to protect publicly accessible speech data that
contains sensitive information, such as personal voiceprints. Most previous
defense methods have focused on spoofing speaker verification systems in timbre
similarity but the synthesized deepfake speech is still of high quality. In
response to the rising hazards, we devise an effective, transferable, and
robust proactive protection technology named Pivotal Objective Perturbation
(POP) that applies imperceptible error-minimizing noises on original speech
samples to prevent them from being effectively learned for text-to-speech (TTS)
synthesis models so that high-quality deepfake speeches cannot be generated. We
conduct extensive experiments on state-of-the-art (SOTA) TTS models utilizing
objective and subjective metrics to comprehensively evaluate our proposed
method. The experimental results demonstrate outstanding effectiveness and
transferability across various models. Compared to the speech unclarity score
of 21.94% from voice synthesizers trained on samples without protection,
POP-protected samples significantly increase it to 127.31%. Moreover, our
method shows robustness against noise reduction and data augmentation
techniques, thereby greatly reducing potential hazards.


http://arxiv.org/abs/2410.21637
Mitigating Paraphrase Attacks on Machine-Text Detectors via Paraphrase Inversion. (1%)
Rafael Rivera Soto; Barry Chen; Nicholas Andrews
  High-quality paraphrases are easy to produce using instruction-tuned language
models or specialized paraphrasing models. Although this capability has a
variety of benign applications, paraphrasing
attacks$\unicode{x2013}$paraphrases applied to machine-generated
texts$\unicode{x2013}$are known to significantly degrade the performance of
machine-text detectors. This motivates us to consider the novel problem of
paraphrase inversion, where, given paraphrased text, the objective is to
recover an approximation of the original text. The closer the approximation is
to the original text, the better machine-text detectors will perform. We
propose an approach which frames the problem as translation from paraphrased
text back to the original text, which requires examples of texts and
corresponding paraphrases to train the inversion model. Fortunately, such
training data can easily be generated, given a corpus of original texts and one
or more paraphrasing models. We find that language models such as GPT-4 and
Llama-3 exhibit biases when paraphrasing which an inversion model can learn
with a modest amount of data. Perhaps surprisingly, we also find that such
models generalize well, including to paraphrase models unseen at training time.
Finally, we show that when combined with a paraphrased-text detector, our
inversion models provide an effective defense against paraphrasing attacks, and
overall our approach yields an average improvement of +22% AUROC across seven
machine-text detectors and three different domains.


http://arxiv.org/abs/2410.21146
Palisade -- Prompt Injection Detection Framework. (1%)
Sahasra Kokkula; Somanathan R; Nandavardhan R; Aashishkumar; G Divya
  The advent of Large Language Models LLMs marks a milestone in Artificial
Intelligence, altering how machines comprehend and generate human language.
However, LLMs are vulnerable to malicious prompt injection attacks, where
crafted inputs manipulate the models behavior in unintended ways, compromising
system integrity and causing incorrect outcomes. Conventional detection methods
rely on static, rule-based approaches, which often fail against sophisticated
threats like abnormal token sequences and alias substitutions, leading to
limited adaptability and higher rates of false positives and false
negatives.This paper proposes a novel NLP based approach for prompt injection
detection, emphasizing accuracy and optimization through a layered input
screening process. In this framework, prompts are filtered through three
distinct layers rule-based, ML classifier, and companion LLM before reaching
the target model, thereby minimizing the risk of malicious interaction.Tests
show the ML classifier achieves the highest accuracy among individual layers,
yet the multi-layer framework enhances overall detection accuracy by reducing
false negatives. Although this increases false positives, it minimizes the risk
of overlooking genuine injected prompts, thus prioritizing security.This
multi-layered detection approach highlights LLM vulnerabilities and provides a
comprehensive framework for future research, promoting secure interactions
between humans and AI systems.


http://arxiv.org/abs/2410.20788
SCULPT: Systematic Tuning of Long Prompts. (1%)
Shanu Kumar; Akhila Yesantarao Venkata; Shubhanshu Khandelwal; Bishal Santra; Parag Agrawal; Manish Gupta
  Prompt optimization is essential for effective utilization of large language
models (LLMs) across diverse tasks. While existing optimization methods are
effective in optimizing short prompts, they struggle with longer, more complex
ones, often risking information loss and being sensitive to small
perturbations. To address these challenges, we propose SCULPT (Systematic
Tuning of Long Prompts), a framework that treats prompt optimization as a
hierarchical tree refinement problem. SCULPT represents prompts as tree
structures, enabling targeted modifications while preserving contextual
integrity. It employs a Critic-Actor framework that generates reflections and
applies actions to refine the prompt. Evaluations demonstrate SCULPT's
effectiveness on long prompts, its robustness to adversarial perturbations, and
its ability to generate high-performing prompts even without any initial
human-written prompt. Compared to existing state of the art methods, SCULPT
consistently improves LLM performance by preserving essential task information
while applying structured refinements. Both qualitative and quantitative
analyses show that SCULPT produces more stable and interpretable prompt
modifications, ensuring better generalization across tasks.


http://arxiv.org/abs/2410.20432
Integrating uncertainty quantification into randomized smoothing based robustness guarantees. (98%)
Sina Däubener; Kira Maag; David Krueger; Asja Fischer
  Deep neural networks have proven to be extremely powerful, however, they are
also vulnerable to adversarial attacks which can cause hazardous incorrect
predictions in safety-critical applications. Certified robustness via
randomized smoothing gives a probabilistic guarantee that the smoothed
classifier's predictions will not change within an $\ell_2$-ball around a given
input. On the other hand (uncertainty) score-based rejection is a technique
often applied in practice to defend models against adversarial attacks. In this
work, we fuse these two approaches by integrating a classifier that abstains
from predicting when uncertainty is high into the certified robustness
framework. This allows us to derive two novel robustness guarantees for
uncertainty aware classifiers, namely (i) the radius of an $\ell_2$-ball around
the input in which the same label is predicted and uncertainty remains low and
(ii) the $\ell_2$-radius of a ball in which the predictions will either not
change or be uncertain. While the former provides robustness guarantees with
respect to attacks aiming at increased uncertainty, the latter informs about
the amount of input perturbation necessary to lead the uncertainty aware model
into a wrong prediction. Notably, this is on CIFAR10 up to 20.93% larger than
for models not allowing for uncertainty based rejection. We demonstrate, that
the novel framework allows for a systematic robustness evaluation of different
network architectures and uncertainty measures and to identify desired
properties of uncertainty quantification techniques. Moreover, we show that
leveraging uncertainty in a smoothed classifier helps out-of-distribution
detection.


http://arxiv.org/abs/2410.21330
LLM Robustness Against Misinformation in Biomedical Question Answering. (80%)
Alexander Bondarenko; Adrian Viehweger
  The retrieval-augmented generation (RAG) approach is used to reduce the
confabulation of large language models (LLMs) for question answering by
retrieving and providing additional context coming from external knowledge
sources (e.g., by adding the context to the prompt). However, injecting
incorrect information can mislead the LLM to generate an incorrect answer.
  In this paper, we evaluate the effectiveness and robustness of four LLMs
against misinformation - Gemma 2, GPT-4o-mini, Llama~3.1, and Mixtral - in
answering biomedical questions. We assess the answer accuracy on yes-no and
free-form questions in three scenarios: vanilla LLM answers (no context is
provided), "perfect" augmented generation (correct context is provided), and
prompt-injection attacks (incorrect context is provided). Our results show that
Llama 3.1 (70B parameters) achieves the highest accuracy in both vanilla
(0.651) and "perfect" RAG (0.802) scenarios. However, the accuracy gap between
the models almost disappears with "perfect" RAG, suggesting its potential to
mitigate the LLM's size-related effectiveness differences.
  We further evaluate the ability of the LLMs to generate malicious context on
one hand and the LLM's robustness against prompt-injection attacks on the other
hand, using metrics such as attack success rate (ASR), accuracy under attack,
and accuracy drop. As adversaries, we use the same four LLMs (Gemma 2,
GPT-4o-mini, Llama 3.1, and Mixtral) to generate incorrect context that is
injected in the target model's prompt. Interestingly, Llama is shown to be the
most effective adversary, causing accuracy drops of up to 0.48 for vanilla
answers and 0.63 for "perfect" RAG across target models. Our analysis reveals
that robustness rankings vary depending on the evaluation measure, highlighting
the complexity of assessing LLM resilience to adversarial attacks.


http://arxiv.org/abs/2410.21337
Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection. (1%)
Md Abdur Rahman; Fan Wu; Alfredo Cuzzocrea; Sheikh Iqbal Ahamed
  Large language models (LLMs) are becoming a popular tool as they have
significantly advanced in their capability to tackle a wide range of
language-based tasks. However, LLMs applications are highly vulnerable to
prompt injection attacks, which poses a critical problem. These attacks target
LLMs applications through using carefully designed input prompts to divert the
model from adhering to original instruction, thereby it could execute
unintended actions. These manipulations pose serious security threats which
potentially results in data leaks, biased outputs, or harmful responses. This
project explores the security vulnerabilities in relation to prompt injection
attacks. To detect whether a prompt is vulnerable or not, we follows two
approaches: 1) a pre-trained LLM, and 2) a fine-tuned LLM. Then, we conduct a
thorough analysis and comparison of the classification performance. Firstly, we
use pre-trained XLM-RoBERTa model to detect prompt injections using test
dataset without any fine-tuning and evaluate it by zero-shot classification.
Then, this proposed work will apply supervised fine-tuning to this pre-trained
LLM using a task-specific labeled dataset from deepset in huggingface, and this
fine-tuned model achieves impressive results with 99.13\% accuracy, 100\%
precision, 98.33\% recall and 99.15\% F1-score thorough rigorous
experimentation and evaluation. We observe that our approach is highly
efficient in detecting prompt injection attacks.


http://arxiv.org/abs/2410.20103
Adversarial Attacks Against Double RIS-Assisted MIMO Systems-based Autoencoder in Finite-Scattering Environments. (99%)
Bui Duc Son; Ngo Nam Khanh; Chien Trinh Van; Dong In Kim
  Autoencoder permits the end-to-end optimization and design of wireless
communication systems to be more beneficial than traditional signal processing.
However, this emerging learning-based framework has weaknesses, especially
sensitivity to physical attacks. This paper explores adversarial attacks
against a double reconfigurable intelligent surface (RIS)-assisted
multiple-input and multiple-output (MIMO)-based autoencoder, where an adversary
employs encoded and decoded datasets to create adversarial perturbation and
fool the system. Because of the complex and dynamic data structures,
adversarial attacks are not unique, each having its own benefits. We,
therefore, propose three algorithms generating adversarial examples and
perturbations to attack the RIS-MIMO-based autoencoder, exploiting the gradient
descent and allowing for flexibility via varying the input dimensions.
Numerical results show that the proposed adversarial attack-based algorithm
significantly degrades the system performance regarding the symbol error rate
compared to the jamming attacks.


http://arxiv.org/abs/2410.20197
Transferable Adversarial Attacks on SAM and Its Downstream Models. (99%)
Song Xia; Wenhan Yang; Yi Yu; Xun Lin; Henghui Ding; Ling-Yu Duan; Xudong Jiang
  The utilization of large foundational models has a dilemma: while fine-tuning
downstream tasks from them holds promise for making use of the well-generalized
knowledge in practical applications, their open accessibility also poses
threats of adverse usage. This paper, for the first time, explores the
feasibility of adversarial attacking various downstream models fine-tuned from
the segment anything model (SAM), by solely utilizing the information from the
open-sourced SAM. In contrast to prevailing transfer-based adversarial attacks,
we demonstrate the existence of adversarial dangers even without accessing the
downstream task and dataset to train a similar surrogate model. To enhance the
effectiveness of the adversarial attack towards models fine-tuned on unknown
datasets, we propose a universal meta-initialization (UMI) algorithm to extract
the intrinsic vulnerability inherent in the foundation model, which is then
utilized as the prior knowledge to guide the generation of adversarial
perturbations. Moreover, by formulating the gradient difference in the
attacking process between the open-sourced SAM and its fine-tuned downstream
models, we theoretically demonstrate that a deviation occurs in the adversarial
update direction by directly maximizing the distance of encoded feature
embeddings in the open-sourced SAM. Consequently, we propose a gradient robust
loss that simulates the associated uncertainty with gradient-based noise
augmentation to enhance the robustness of generated adversarial examples (AEs)
towards this deviation, thus improving the transferability. Extensive
experiments demonstrate the effectiveness of the proposed universal
meta-initialized and gradient robust adversarial attack (UMI-GRAT) toward SAMs
and their downstream models. Code is available at
https://github.com/xiasong0501/GRAT.


http://arxiv.org/abs/2410.20097
Generative Adversarial Patches for Physical Attacks on Cross-Modal Pedestrian Re-Identification. (98%)
Yue Su; Hao Li; Maoguo Gong
  Visible-infrared pedestrian Re-identification (VI-ReID) aims to match
pedestrian images captured by infrared cameras and visible cameras. However,
VI-ReID, like other traditional cross-modal image matching tasks, poses
significant challenges due to its human-centered nature. This is evidenced by
the shortcomings of existing methods, which struggle to extract common features
across modalities, while losing valuable information when bridging the gap
between them in the implicit feature space, potentially compromising security.
To address this vulnerability, this paper introduces the first physical
adversarial attack against VI-ReID models. Our method, termed Edge-Attack,
specifically tests the models' ability to leverage deep-level implicit features
by focusing on edge information, the most salient explicit feature
differentiating individuals across modalities. Edge-Attack utilizes a novel
two-step approach. First, a multi-level edge feature extractor is trained in a
self-supervised manner to capture discriminative edge representations for each
individual. Second, a generative model based on Vision Transformer Generative
Adversarial Networks (ViTGAN) is employed to generate adversarial patches
conditioned on the extracted edge features. By applying these patches to
pedestrian clothing, we create realistic, physically-realizable adversarial
samples. This black-box, self-supervised approach ensures the generalizability
of our attack against various VI-ReID models. Extensive experiments on
SYSU-MM01 and RegDB datasets, including real-world deployments, demonstrate the
effectiveness of Edge- Attack in significantly degrading the performance of
state-of-the-art VI-ReID methods.


http://arxiv.org/abs/2410.20136
CodePurify: Defend Backdoor Attacks on Neural Code Models via Entropy-based Purification. (76%)
Fangwen Mu; Junjie Wang; Zhuohao Yu; Lin Shi; Song Wang; Mingyang Li; Qing Wang
  Neural code models have found widespread success in tasks pertaining to code
intelligence, yet they are vulnerable to backdoor attacks, where an adversary
can manipulate the victim model's behavior by inserting triggers into the
source code. Recent studies indicate that advanced backdoor attacks can achieve
nearly 100% attack success rates on many software engineering tasks. However,
effective defense techniques against such attacks remain insufficiently
explored. In this study, we propose CodePurify, a novel defense against
backdoor attacks on code models through entropy-based purification.
Entropy-based purification involves the process of precisely detecting and
eliminating the possible triggers in the source code while preserving its
semantic information. Within this process, CodePurify first develops a
confidence-driven entropy-based measurement to determine whether a code snippet
is poisoned and, if so, locates the triggers. Subsequently, it purifies the
code by substituting the triggers with benign tokens using a masked language
model. We extensively evaluate CodePurify against four advanced backdoor
attacks across three representative tasks and two popular code models. The
results show that CodePurify significantly outperforms four commonly used
defense baselines, improving average defense performance by at least 40%, 40%,
and 12% across the three tasks, respectively. These findings highlight the
potential of CodePurify to serve as a robust defense against backdoor attacks
on neural code models.


http://arxiv.org/abs/2410.20250
Robust Model Evaluation over Large-scale Federated Networks. (2%)
Amir Najafi; Samin Mahdizadeh Sani; Farzan Farnia
  In this paper, we address the challenge of certifying the performance of a
machine learning model on an unseen target network, using measurements from an
available source network. We focus on a scenario where heterogeneous datasets
are distributed across a source network of clients, all connected to a central
server. Specifically, consider a source network "A" composed of $K$ clients,
each holding private data from unique and heterogeneous distributions, which
are assumed to be independent samples from a broader meta-distribution $\mu$.
Our goal is to provide certified guarantees for the model's performance on a
different, unseen target network "B," governed by another meta-distribution
$\mu'$, assuming the deviation between $\mu$ and $\mu'$ is bounded by either
the Wasserstein distance or an $f$-divergence. We derive theoretical guarantees
for the model's empirical average loss and provide uniform bounds on the risk
CDF, where the latter correspond to novel and adversarially robust versions of
the Glivenko-Cantelli theorem and the Dvoretzky-Kiefer-Wolfowitz (DKW)
inequality. Our bounds are computable in polynomial time with a polynomial
number of queries to the $K$ clients, preserving client privacy by querying
only the model's (potentially adversarial) loss on private data. We also
establish non-asymptotic generalization bounds that consistently converge to
zero as both $K$ and the minimum client sample size grow. Extensive empirical
evaluations validate the robustness and practicality of our bounds across
real-world tasks.


http://arxiv.org/abs/2410.21276
GPT-4o System Card. (76%)
Tony OpenAI; Tony :; Aaron Tony Hurst; Adam Tony Lerer; Adam P. Tony Goucher; Adam Tony Perelman; Aditya Tony Ramesh; Aidan Tony Clark; AJ Tony Ostrow; Akila Tony Welihinda; Alan Tony Hayes; Alec Tony Radford; Aleksander Tony Mądry; Alex Tony Baker-Whitcomb; Alex Tony Beutel; Alex Tony Borzunov; Alex Tony Carney; Alex Tony Chow; Alex Tony Kirillov; Alex Tony Nichol; Alex Tony Paino; Alex Tony Renzin; Alex Tachard Tony Passos; Alexander Tony Kirillov; Alexi Tony Christakis; Alexis Tony Conneau; Ali Tony Kamali; Allan Tony Jabri; Allison Tony Moyer; Allison Tony Tam; Amadou Tony Crookes; Amin Tony Tootoochian; Amin Tony Tootoonchian; Ananya Tony Kumar; Andrea Tony Vallone; Andrej Tony Karpathy; Andrew Tony Braunstein; Andrew Tony Cann; Andrew Tony Codispoti; Andrew Tony Galu; Andrew Tony Kondrich; Andrew Tony Tulloch; Andrey Tony Mishchenko; Angela Tony Baek; Angela Tony Jiang; Antoine Tony Pelisse; Antonia Tony Woodford; Anuj Tony Gosalia; Arka Tony Dhar; Ashley Tony Pantuliano; Avi Tony Nayak; Avital Tony Oliver; Barret Tony Zoph; Behrooz Tony Ghorbani; Ben Tony Leimberger; Ben Tony Rossen; Ben Tony Sokolowsky; Ben Tony Wang; Benjamin Tony Zweig; Beth Tony Hoover; Blake Tony Samic; Bob Tony McGrew; Bobby Tony Spero; Bogo Tony Giertler; Bowen Tony Cheng; Brad Tony Lightcap; Brandon Tony Walkin; Brendan Tony Quinn; Brian Tony Guarraci; Brian Tony Hsu; Bright Tony Kellogg; Brydon Tony Eastman; Camillo Tony Lugaresi; Carroll Tony Wainwright; Cary Tony Bassin; Cary Tony Hudson; Casey Tony Chu; Chad Tony Nelson; Chak Tony Li; Chan Jun Tony Shern; Channing Tony Conger; Charlotte Tony Barette; Chelsea Tony Voss; Chen Tony Ding; Cheng Tony Lu; Chong Tony Zhang; Chris Tony Beaumont; Chris Tony Hallacy; Chris Tony Koch; Christian Tony Gibson; Christina Tony Kim; Christine Tony Choi; Christine Tony McLeavey; Christopher Tony Hesse; Claudia Tony Fischer; Clemens Tony Winter; Coley Tony Czarnecki; Colin Tony Jarvis; Colin Tony Wei; Constantin Tony Koumouzelis; Dane Tony Sherburn; Daniel Tony Kappler; Daniel Tony Levin; Daniel Tony Levy; David Tony Carr; David Tony Farhi; David Tony Mely; David Tony Robinson; David Tony Sasaki; Denny Tony Jin; Dev Tony Valladares; Dimitris Tony Tsipras; Doug Tony Li; Duc Phong Tony Nguyen; Duncan Tony Findlay; Edede Tony Oiwoh; Edmund Tony Wong; Ehsan Tony Asdar; Elizabeth Tony Proehl; Elizabeth Tony Yang; Eric Tony Antonow; Eric Tony Kramer; Eric Tony Peterson; Eric Tony Sigler; Eric Tony Wallace; Eugene Tony Brevdo; Evan Tony Mays; Farzad Tony Khorasani; Felipe Petroski Tony Such; Filippo Tony Raso; Francis Tony Zhang; Lohmann Fred Tony von; Freddie Tony Sulit; Gabriel Tony Goh; Gene Tony Oden; Geoff Tony Salmon; Giulio Tony Starace; Greg Tony Brockman; Hadi Tony Salman; Haiming Tony Bao; Haitang Tony Hu; Hannah Tony Wong; Haoyu Tony Wang; Heather Tony Schmidt; Heather Tony Whitney; Heewoo Tony Jun; Hendrik Tony Kirchner; Henrique Ponde de Oliveira Tony Pinto; Hongyu Tony Ren; Huiwen Tony Chang; Hyung Won Tony Chung; Ian Tony Kivlichan; Ian Tony O'Connell; Ian Tony O'Connell; Ian Tony Osband; Ian Tony Silber; Ian Tony Sohl; Ibrahim Tony Okuyucu; Ikai Tony Lan; Ilya Tony Kostrikov; Ilya Tony Sutskever; Ingmar Tony Kanitscheider; Ishaan Tony Gulrajani; Jacob Tony Coxon; Jacob Tony Menick; Jakub Tony Pachocki; James Tony Aung; James Tony Betker; James Tony Crooks; James Tony Lennon; Jamie Tony Kiros; Jan Tony Leike; Jane Tony Park; Jason Tony Kwon; Jason Tony Phang; Jason Tony Teplitz; Jason Tony Wei; Jason Tony Wolfe; Jay Tony Chen; Jeff Tony Harris; Jenia Tony Varavva; Jessica Gan Tony Lee; Jessica Tony Shieh; Ji Tony Lin; Jiahui Tony Yu; Jiayi Tony Weng; Jie Tony Tang; Jieqi Tony Yu; Joanne Tony Jang; Joaquin Quinonero Tony Candela; Joe Tony Beutler; Joe Tony Landers; Joel Tony Parish; Johannes Tony Heidecke; John Tony Schulman; Jonathan Tony Lachman; Jonathan Tony McKay; Jonathan Tony Uesato; Jonathan Tony Ward; Jong Wook Tony Kim; Joost Tony Huizinga; Jordan Tony Sitkin; Jos Tony Kraaijeveld; Josh Tony Gross; Josh Tony Kaplan; Josh Tony Snyder; Joshua Tony Achiam; Joy Tony Jiao; Joyce Tony Lee; Juntang Tony Zhuang; Justyn Tony Harriman; Kai Tony Fricke; Kai Tony Hayashi; Karan Tony Singhal; Katy Tony Shi; Kavin Tony Karthik; Kayla Tony Wood; Kendra Tony Rimbach; Kenny Tony Hsu; Kenny Tony Nguyen; Keren Tony Gu-Lemberg; Kevin Tony Button; Kevin Tony Liu; Kiel Tony Howe; Krithika Tony Muthukumar; Kyle Tony Luther; Lama Tony Ahmad; Larry Tony Kai; Lauren Tony Itow; Lauren Tony Workman; Leher Tony Pathak; Leo Tony Chen; Li Tony Jing; Lia Tony Guy; Liam Tony Fedus; Liang Tony Zhou; Lien Tony Mamitsuka; Lilian Tony Weng; Lindsay Tony McCallum; Lindsey Tony Held; Long Tony Ouyang; Louis Tony Feuvrier; Lu Tony Zhang; Lukas Tony Kondraciuk; Lukasz Tony Kaiser; Luke Tony Hewitt; Luke Tony Metz; Lyric Tony Doshi; Mada Tony Aflak; Maddie Tony Simens; Madelaine Tony Boyd; Madeleine Tony Thompson; Marat Tony Dukhan; Mark Tony Chen; Mark Tony Gray; Mark Tony Hudnall; Marvin Tony Zhang; Marwan Tony Aljubeh; Mateusz Tony Litwin; Matthew Tony Zeng; Max Tony Johnson; Maya Tony Shetty; Mayank Tony Gupta; Meghan Tony Shah; Mehmet Tony Yatbaz; Meng Jia Tony Yang; Mengchao Tony Zhong; Mia Tony Glaese; Mianna Tony Chen; Michael Tony Janner; Michael Tony Lampe; Michael Tony Petrov; Michael Tony Wu; Michele Tony Wang; Michelle Tony Fradin; Michelle Tony Pokrass; Miguel Tony Castro; Castro Miguel Oom Temudo Tony de; Mikhail Tony Pavlov; Miles Tony Brundage; Miles Tony Wang; Minal Tony Khan; Mira Tony Murati; Mo Tony Bavarian; Molly Tony Lin; Murat Tony Yesildal; Nacho Tony Soto; Natalia Tony Gimelshein; Natalie Tony Cone; Natalie Tony Staudacher; Natalie Tony Summers; Natan Tony LaFontaine; Neil Tony Chowdhury; Nick Tony Ryder; Nick Tony Stathas; Nick Tony Turley; Nik Tony Tezak; Niko Tony Felix; Nithanth Tony Kudige; Nitish Tony Keskar; Noah Tony Deutsch; Noel Tony Bundick; Nora Tony Puckett; Ofir Tony Nachum; Ola Tony Okelola; Oleg Tony Boiko; Oleg Tony Murk; Oliver Tony Jaffe; Olivia Tony Watkins; Olivier Tony Godement; Owen Tony Campbell-Moore; Patrick Tony Chao; Paul Tony McMillan; Pavel Tony Belov; Peng Tony Su; Peter Tony Bak; Peter Tony Bakkum; Peter Tony Deng; Peter Tony Dolan; Peter Tony Hoeschele; Peter Tony Welinder; Phil Tony Tillet; Philip Tony Pronin; Philippe Tony Tillet; Prafulla Tony Dhariwal; Qiming Tony Yuan; Rachel Tony Dias; Rachel Tony Lim; Rahul Tony Arora; Rajan Tony Troll; Randall Tony Lin; Rapha Gontijo Tony Lopes; Raul Tony Puri; Reah Tony Miyara; Reimar Tony Leike; Renaud Tony Gaubert; Reza Tony Zamani; Ricky Tony Wang; Rob Tony Donnelly; Rob Tony Honsby; Rocky Tony Smith; Rohan Tony Sahai; Rohit Tony Ramchandani; Romain Tony Huet; Rory Tony Carmichael; Rowan Tony Zellers; Roy Tony Chen; Ruby Tony Chen; Ruslan Tony Nigmatullin; Ryan Tony Cheu; Saachi Tony Jain; Sam Tony Altman; Sam Tony Schoenholz; Sam Tony Toizer; Samuel Tony Miserendino; Sandhini Tony Agarwal; Sara Tony Culver; Scott Tony Ethersmith; Scott Tony Gray; Sean Tony Grove; Sean Tony Metzger; Shamez Tony Hermani; Shantanu Tony Jain; Shengjia Tony Zhao; Sherwin Tony Wu; Shino Tony Jomoto; Shirong Tony Wu; Tony Shuaiqi; Xia; Sonia Phene; Spencer Papay; Srinivas Narayanan; Steve Coffey; Steve Lee; Stewart Hall; Suchir Balaji; Tal Broda; Tal Stramer; Tao Xu; Tarun Gogineni; Taya Christianson; Ted Sanders; Tejal Patwardhan; Thomas Cunninghman; Thomas Degry; Thomas Dimson; Thomas Raoux; Thomas Shadwell; Tianhao Zheng; Todd Underwood; Todor Markov; Toki Sherbakov; Tom Rubin; Tom Stasi; Tomer Kaftan; Tristan Heywood; Troy Peterson; Tyce Walters; Tyna Eloundou; Valerie Qi; Veit Moeller; Vinnie Monaco; Vishal Kuo; Vlad Fomenko; Wayne Chang; Weiyi Zheng; Wenda Zhou; Wesam Manassra; Will Sheu; Wojciech Zaremba; Yash Patil; Yilei Qian; Yongjik Kim; Youlong Cheng; Yu Zhang; Yuchen He; Yuchen Zhang; Yujia Jin; Yunxing Dai; Yury Malkov
  GPT-4o is an autoregressive omni model that accepts as input any combination
of text, audio, image, and video, and generates any combination of text, audio,
and image outputs. It's trained end-to-end across text, vision, and audio,
meaning all inputs and outputs are processed by the same neural network. GPT-4o
can respond to audio inputs in as little as 232 milliseconds, with an average
of 320 milliseconds, which is similar to human response time in conversation.
It matches GPT-4 Turbo performance on text in English and code, with
significant improvement on text in non-English languages, while also being much
faster and 50\% cheaper in the API. GPT-4o is especially better at vision and
audio understanding compared to existing models. In line with our commitment to
building AI safely and consistent with our voluntary commitments to the White
House, we are sharing the GPT-4o System Card, which includes our Preparedness
Framework evaluations. In this System Card, we provide a detailed look at
GPT-4o's capabilities, limitations, and safety evaluations across multiple
categories, focusing on speech-to-speech while also evaluating text and image
capabilities, and measures we've implemented to ensure the model is safe and
aligned. We also include third-party assessments on dangerous capabilities, as
well as discussion of potential societal impacts of GPT-4o's text and vision
capabilities.


http://arxiv.org/abs/2410.19937
RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction. (64%)
Tanqiu Jiang; Zian Wang; Jiacheng Liang; Changjiang Li; Yuhui Wang; Ting Wang
  Jailbreak attacks circumvent LLMs' built-in safeguards by concealing harmful
queries within jailbreak prompts. While existing defenses primarily focus on
mitigating the effects of jailbreak prompts, they often prove inadequate as
jailbreak prompts can take arbitrary, adaptive forms. This paper presents
RobustKV, a novel defense that adopts a fundamentally different approach by
selectively removing critical tokens of harmful queries from key-value (KV)
caches. Intuitively, for a jailbreak prompt to be effective, its tokens must
achieve sufficient `importance' (as measured by attention scores), which
inevitably lowers the importance of tokens in the concealed harmful query.
Thus, by strategically evicting the KVs of the lowest-ranked tokens, RobustKV
diminishes the presence of the harmful query in the KV cache, thus preventing
the LLM from generating malicious responses. Extensive evaluation using
benchmark datasets and models demonstrates that RobustKV effectively counters
state-of-the-art jailbreak attacks while maintaining the LLM's general
performance on benign queries. Moreover, RobustKV creates an intriguing
evasiveness dilemma for adversaries, forcing them to balance between evading
RobustKV and bypassing the LLM's built-in safeguards. This trade-off
contributes to RobustKV's robustness against adaptive attacks. (warning: this
paper contains potentially harmful content generated by LLMs.)


http://arxiv.org/abs/2410.20019
Attacks against Abstractive Text Summarization Models through Lead Bias and Influence Functions. (62%)
Poojitha Thota; Shirin Nilizadeh
  Large Language Models have introduced novel opportunities for text
comprehension and generation. Yet, they are vulnerable to adversarial
perturbations and data poisoning attacks, particularly in tasks like text
classification and translation. However, the adversarial robustness of
abstractive text summarization models remains less explored. In this work, we
unveil a novel approach by exploiting the inherent lead bias in summarization
models, to perform adversarial perturbations. Furthermore, we introduce an
innovative application of influence functions, to execute data poisoning, which
compromises the model's integrity. This approach not only shows a skew in the
models behavior to produce desired outcomes but also shows a new behavioral
change, where models under attack tend to generate extractive summaries rather
than abstractive summaries.


http://arxiv.org/abs/2410.19427
Expose Before You Defend: Unifying and Enhancing Backdoor Defenses via Exposed Models. (56%)
Yige Li; Hanxun Huang; Jiaming Zhang; Xingjun Ma; Yu-Gang Jiang
  Backdoor attacks covertly implant triggers into deep neural networks (DNNs)
by poisoning a small portion of the training data with pre-designed backdoor
triggers. This vulnerability is exacerbated in the era of large models, where
extensive (pre-)training on web-crawled datasets is susceptible to compromise.
In this paper, we introduce a novel two-step defense framework named Expose
Before You Defend (EBYD). EBYD unifies existing backdoor defense methods into a
comprehensive defense system with enhanced performance. Specifically, EBYD
first exposes the backdoor functionality in the backdoored model through a
model preprocessing step called backdoor exposure, and then applies detection
and removal methods to the exposed model to identify and eliminate the backdoor
features. In the first step of backdoor exposure, we propose a novel technique
called Clean Unlearning (CUL), which proactively unlearns clean features from
the backdoored model to reveal the hidden backdoor features. We also explore
various model editing/modification techniques for backdoor exposure, including
fine-tuning, model sparsification, and weight perturbation. Using EBYD, we
conduct extensive experiments on 10 image attacks and 6 text attacks across 2
vision datasets (CIFAR-10 and an ImageNet subset) and 4 language datasets
(SST-2, IMDB, Twitter, and AG's News). The results demonstrate the importance
of backdoor exposure for backdoor defense, showing that the exposed models can
significantly benefit a range of downstream defense tasks, including backdoor
label detection, backdoor trigger recovery, backdoor model detection, and
backdoor removal. We hope our work could inspire more research in developing
advanced defense frameworks with exposed models. Our code is available at:
https://github.com/bboylyg/Expose-Before-You-Defend.


http://arxiv.org/abs/2410.20026
Towards Robust Algorithms for Surgical Phase Recognition via Digital Twin-based Scene Representation. (10%)
Hao Ding; Yuqian Zhang; Hongchao Shu; Xu Lian; Ji Woong Kim; Axel Krieger; Mathias Unberath
  Purpose: Surgical phase recognition (SPR) is an integral component of
surgical data science, enabling high-level surgical analysis. End-to-end
trained neural networks that predict surgical phase directly from videos have
shown excellent performance on benchmarks. However, these models struggle with
robustness due to non-causal associations in the training set, resulting in
poor generalizability. Our goal is to improve model robustness to variations in
the surgical videos by leveraging the digital twin (DT) paradigm -- an
intermediary layer to separate high-level analysis (SPR) from low-level
processing (geometric understanding). This approach takes advantage of the
recent vision foundation models that ensure reliable low-level scene
understanding to craft DT-based scene representations that support various
high-level tasks.
  Methods: We present a DT-based framework for SPR from videos. The framework
employs vision foundation models to extract representations. We embed the
representation in place of raw video inputs in the state-of-the-art Surgformer
model. The framework is trained on the Cholec80 dataset and evaluated on
out-of-distribution (OOD) and corrupted test samples.
  Results: Contrary to the vulnerability of the baseline model, our framework
demonstrates strong robustness on both OOD and corrupted samples, with a
video-level accuracy of 51.1 on the challenging CRCD dataset, 96.0 on an
internal robotics training dataset, and 64.4 on a highly corrupted Cholec80
test set.
  Conclusion: Our findings lend support to the thesis that DT-based scene
representations are effective in enhancing model robustness. Future work will
seek to improve the feature informativeness, automate feature extraction, and
incorporate interpretability for a more comprehensive framework.


http://arxiv.org/abs/2410.18648
GADT: Enhancing Transferable Adversarial Attacks through Gradient-guided Adversarial Data Transformation. (99%)
Yating Ma; Xiaogang Xu; Liming Fang; Zhe Liu
  Current Transferable Adversarial Examples (TAE) are primarily generated by
adding Adversarial Noise (AN). Recent studies emphasize the importance of
optimizing Data Augmentation (DA) parameters along with AN, which poses a
greater threat to real-world AI applications. However, existing DA-based
strategies often struggle to find optimal solutions due to the challenging DA
search procedure without proper guidance. In this work, we propose a novel
DA-based attack algorithm, GADT. GADT identifies suitable DA parameters through
iterative antagonism and uses posterior estimates to update AN based on these
parameters. We uniquely employ a differentiable DA operation library to
identify adversarial DA parameters and introduce a new loss function as a
metric during DA optimization. This loss term enhances adversarial effects
while preserving the original image content, maintaining attack crypticity.
Extensive experiments on public datasets with various networks demonstrate that
GADT can be integrated with existing transferable attack methods, updating
their DA parameters effectively while retaining their AN formulation
strategies. Furthermore, GADT can be utilized in other black-box attack
scenarios, e.g., query-based attacks, offering a new avenue to enhance attacks
on real-world AI applications in both research and industrial contexts.


http://arxiv.org/abs/2410.19160
Adversarial Attacks on Large Language Models Using Regularized Relaxation. (98%)
Samuel Jacob Chacko; Sajib Biswas; Chashi Mahiul Islam; Fatema Tabassum Liza; Xiuwen Liu
  As powerful Large Language Models (LLMs) are now widely used for numerous
practical applications, their safety is of critical importance. While alignment
techniques have significantly improved overall safety, LLMs remain vulnerable
to carefully crafted adversarial inputs. Consequently, adversarial attack
methods are extensively used to study and understand these vulnerabilities.
However, current attack methods face significant limitations. Those relying on
optimizing discrete tokens suffer from limited efficiency, while continuous
optimization techniques fail to generate valid tokens from the model's
vocabulary, rendering them impractical for real-world applications. In this
paper, we propose a novel technique for adversarial attacks that overcomes
these limitations by leveraging regularized gradients with continuous
optimization methods. Our approach is two orders of magnitude faster than the
state-of-the-art greedy coordinate gradient-based method, significantly
improving the attack success rate on aligned language models. Moreover, it
generates valid tokens, addressing a fundamental limitation of existing
continuous optimization methods. We demonstrate the effectiveness of our attack
on five state-of-the-art LLMs using four datasets.


http://arxiv.org/abs/2410.18469
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities. (88%)
Chung-En Sun; Xiaodong Liu; Weiwei Yang; Tsui-Wei Weng; Hao Cheng; Aidan San; Michel Galley; Jianfeng Gao
  Recent research has shown that Large Language Models (LLMs) are vulnerable to
automated jailbreak attacks, where adversarial suffixes crafted by algorithms
appended to harmful queries bypass safety alignment and trigger unintended
responses. Current methods for generating these suffixes are computationally
expensive and have low Attack Success Rates (ASR), especially against
well-aligned models like Llama2 and Llama3. To overcome these limitations, we
introduce ADV-LLM, an iterative self-tuning process that crafts adversarial
LLMs with enhanced jailbreak ability. Our framework significantly reduces the
computational cost of generating adversarial suffixes while achieving nearly
100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack
transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49%
ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving
jailbreak ability, ADV-LLM provides valuable insights for future safety
alignment research through its ability to generate large datasets for studying
LLM safety.


http://arxiv.org/abs/2410.19230
Humanizing the Machine: Proxy Attacks to Mislead LLM Detectors. (68%)
Tianchun Wang; Yuanzhou Chen; Zichuan Liu; Zhanwen Chen; Haifeng Chen; Xiang Zhang; Wei Cheng
  The advent of large language models (LLMs) has revolutionized the field of
text generation, producing outputs that closely mimic human-like writing.
Although academic and industrial institutions have developed detectors to
prevent the malicious usage of LLM-generated texts, other research has doubt
about the robustness of these systems. To stress test these detectors, we
introduce a proxy-attack strategy that effortlessly compromises LLMs, causing
them to produce outputs that align with human-written text and mislead
detection systems. Our method attacks the source model by leveraging a
reinforcement learning (RL) fine-tuned humanized small language model (SLM) in
the decoding phase. Through an in-depth analysis, we demonstrate that our
attack strategy is capable of generating responses that are indistinguishable
to detectors, preventing them from differentiating between machine-generated
and human-written text. We conduct systematic evaluations on extensive datasets
using proxy-attacked open-source models, including Llama2-13B, Llama3-70B, and
Mixtral-8*7B in both white- and black-box settings. Our findings show that the
proxy-attack strategy effectively deceives the leading detectors, resulting in
an average AUROC drop of 70.4% across multiple datasets, with a maximum drop of
90.3% on a single dataset. Furthermore, in cross-discipline scenarios, our
strategy also bypasses these detectors, leading to a significant relative
decrease of up to 90.9%, while in cross-language scenario, the drop reaches
91.3%. Despite our proxy-attack strategy successfully bypassing the detectors
with such significant relative drops, we find that the generation quality of
the attacked models remains preserved, even within a modest utility budget,
when compared to the text produced by the original, unattacked source model.


http://arxiv.org/abs/2410.18556
Complexity Matters: Effective Dimensionality as a Measure for Adversarial Robustness. (33%)
David Khachaturov; Robert Mullins
  Quantifying robustness in a single measure for the purposes of model
selection, development of adversarial training methods, and anticipating trends
has so far been elusive. The simplest metric to consider is the number of
trainable parameters in a model but this has previously been shown to be
insufficient at explaining robustness properties. A variety of other metrics,
such as ones based on boundary thickness and gradient flatness have been
proposed but have been shown to be inadequate proxies for robustness.
  In this work, we investigate the relationship between a model's effective
dimensionality, which can be thought of as model complexity, and its robustness
properties. We run experiments on commercial-scale models that are often used
in real-world environments such as YOLO and ResNet. We reveal a near-linear
inverse relationship between effective dimensionality and adversarial
robustness, that is models with a lower dimensionality exhibit better
robustness. We investigate the effect of a variety of adversarial training
methods on effective dimensionality and find the same inverse linear
relationship present, suggesting that effective dimensionality can serve as a
useful criterion for model selection and robustness evaluation, providing a
more nuanced and effective metric than parameter count or previously-tested
measures.


http://arxiv.org/abs/2410.18775
Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances. (11%)
Shilin Lu; Zihan Zhou; Jiayou Lu; Yuanzhi Zhu; Adams Wai-Kin Kong
  Current image watermarking methods are vulnerable to advanced image editing
techniques enabled by large-scale text-to-image models. These models can
distort embedded watermarks during editing, posing significant challenges to
copyright protection. In this work, we introduce W-Bench, the first
comprehensive benchmark designed to evaluate the robustness of watermarking
methods against a wide range of image editing techniques, including image
regeneration, global editing, local editing, and image-to-video generation.
Through extensive evaluations of eleven representative watermarking methods
against prevalent editing techniques, we demonstrate that most methods fail to
detect watermarks after such edits. To address this limitation, we propose
VINE, a watermarking method that significantly enhances robustness against
various image editing techniques while maintaining high image quality. Our
approach involves two key innovations: (1) we analyze the frequency
characteristics of image editing and identify that blurring distortions exhibit
similar frequency properties, which allows us to use them as surrogate attacks
during training to bolster watermark robustness; (2) we leverage a large-scale
pretrained diffusion model SDXL-Turbo, adapting it for the watermarking task to
achieve more imperceptible and robust watermark embedding. Experimental results
show that our method achieves outstanding watermarking performance under
various image editing techniques, outperforming existing methods in both image
quality and robustness. Code is available at https://github.com/Shilin-LU/VINE.


http://arxiv.org/abs/2410.18215
Advancing NLP Security by Leveraging LLMs as Adversarial Engines. (98%)
Sudarshan Srinivasan; Maria Mahbub; Amir Sadovnik
  This position paper proposes a novel approach to advancing NLP security by
leveraging Large Language Models (LLMs) as engines for generating diverse
adversarial attacks. Building upon recent work demonstrating LLMs'
effectiveness in creating word-level adversarial examples, we argue for
expanding this concept to encompass a broader range of attack types, including
adversarial patches, universal perturbations, and targeted attacks. We posit
that LLMs' sophisticated language understanding and generation capabilities can
produce more effective, semantically coherent, and human-like adversarial
examples across various domains and classifier architectures. This paradigm
shift in adversarial NLP has far-reaching implications, potentially enhancing
model robustness, uncovering new vulnerabilities, and driving innovation in
defense mechanisms. By exploring this new frontier, we aim to contribute to the
development of more secure, reliable, and trustworthy NLP systems for critical
applications.


http://arxiv.org/abs/2410.18267
Backdoor in Seconds: Unlocking Vulnerabilities in Large Pre-trained Models via Model Editing. (93%)
Dongliang Guo; Mengxuan Hu; Zihan Guan; Junfeng Guo; Thomas Hartvigsen; Sheng Li
  Large pre-trained models have achieved notable success across a range of
downstream tasks. However, recent research shows that a type of adversarial
attack ($\textit{i.e.,}$ backdoor attack) can manipulate the behavior of
machine learning models through contaminating their training dataset, posing
significant threat in the real-world application of large pre-trained model,
especially for those customized models. Therefore, addressing the unique
challenges for exploring vulnerability of pre-trained models is of paramount
importance. Through empirical studies on the capability for performing backdoor
attack in large pre-trained models ($\textit{e.g.,}$ ViT), we find the
following unique challenges of attacking large pre-trained models: 1) the
inability to manipulate or even access large training datasets, and 2) the
substantial computational resources required for training or fine-tuning these
models. To address these challenges, we establish new standards for an
effective and feasible backdoor attack in the context of large pre-trained
models. In line with these standards, we introduce our EDT model, an
\textbf{E}fficient, \textbf{D}ata-free, \textbf{T}raining-free backdoor attack
method. Inspired by model editing techniques, EDT injects an editing-based
lightweight codebook into the backdoor of large pre-trained models, which
replaces the embedding of the poisoned image with the target image without
poisoning the training dataset or training the victim model. Our experiments,
conducted across various pre-trained models such as ViT, CLIP, BLIP, and stable
diffusion, and on downstream tasks including image classification, image
captioning, and image generation, demonstrate the effectiveness of our method.
Our code is available in the supplementary material.


http://arxiv.org/abs/2410.19863
Breaking the Illusion: Real-world Challenges for Adversarial Patches in Object Detection. (70%)
Jakob Shack; Katarina Petrovic; Olga Saukh
  Adversarial attacks pose a significant threat to the robustness and
reliability of machine learning systems, particularly in computer vision
applications. This study investigates the performance of adversarial patches
for the YOLO object detection network in the physical world. Two attacks were
tested: a patch designed to be placed anywhere within the scene - global patch,
and another patch intended to partially overlap with specific object targeted
for removal from detection - local patch. Various factors such as patch size,
position, rotation, brightness, and hue were analyzed to understand their
impact on the effectiveness of the adversarial patches. The results reveal a
notable dependency on these parameters, highlighting the challenges in
maintaining attack efficacy in real-world conditions. Learning to align
digitally applied transformation parameters with those measured in the real
world still results in up to a 64\% discrepancy in patch performance. These
findings underscore the importance of understanding environmental influences on
adversarial attacks, which can inform the development of more robust defenses
for practical machine learning applications.


http://arxiv.org/abs/2410.17910
Slot: Provenance-Driven APT Detection through Graph Reinforcement Learning. (16%)
Wei Qiao; Yebo Feng; Teng Li; Zhuo Ma; Yulong Shen; JianFeng Ma; Yang Liu
  Advanced Persistent Threats (APTs) represent sophisticated cyberattacks
characterized by their ability to remain undetected within the victim system
for extended periods, aiming to exfiltrate sensitive data or disrupt
operations. Existing detection approaches often struggle to effectively
identify these complex threats, construct the attack chain for defense
facilitation, or resist adversarial attacks. To overcome these challenges, we
propose Slot, an advanced APT detection approach based on provenance graphs and
graph reinforcement learning. Slot excels in uncovering multi-level hidden
relationships, such as causal, contextual, and indirect connections, among
system behaviors through provenance graph mining. By pioneering the integration
of graph reinforcement learning, Slot dynamically adapts to new user activities
and evolving attack strategies, enhancing its resilience against adversarial
attacks. Additionally, Slot automatically constructs the attack chain according
to detected attacks with clustering algorithms, providing precise
identification of attack paths and facilitating the development of defense
strategies. Evaluations with real-world datasets demonstrate Slot's outstanding
accuracy, efficiency, adaptability, and robustness in APT detection, with most
metrics surpassing state-of-the-art methods. Additionally, case studies
conducted to assess Slot's effectiveness in supporting APT defense further
establish it as a practical and reliable tool for cybersecurity protection.


http://arxiv.org/abs/2410.17922
Guide for Defense (G4D): Dynamic Guidance for Robust and Balanced Defense in Large Language Models. (9%)
He Cao; Weidi Luo; Yu Wang; Zijing Liu; Bing Feng; Yuan Yao; Yu Li
  With the extensive deployment of Large Language Models (LLMs), ensuring their
safety has become increasingly critical. However, existing defense methods
often struggle with two key issues: (i) inadequate defense capabilities,
particularly in domain-specific scenarios like chemistry, where a lack of
specialized knowledge can lead to the generation of harmful responses to
malicious queries. (ii) over-defensiveness, which compromises the general
utility and responsiveness of LLMs. To mitigate these issues, we introduce a
multi-agents-based defense framework, Guide for Defense (G4D), which leverages
accurate external information to provide an unbiased summary of user intentions
and analytically grounded safety response guidance. Extensive experiments on
popular jailbreak attacks and benign datasets show that our G4D can enhance
LLM's robustness against jailbreak attacks on general and domain-specific
scenarios without compromising the model's general functionality.


http://arxiv.org/abs/2410.17573
Securing Federated Learning against Backdoor Threats with Foundation Model Integration. (3%)
Xiaohuan Bi; Xi Li
  Federated Learning (FL) enables decentralized model training while preserving
privacy. Recently, the integration of Foundation Models (FMs) into FL has
enhanced performance but introduced a novel backdoor attack mechanism.
Attackers can exploit FM vulnerabilities to embed backdoors into synthetic data
generated by FMs. During global model fusion, these backdoors are transferred
to the global model through compromised synthetic data, subsequently infecting
all client models. Existing FL backdoor defenses are ineffective against this
novel attack due to its fundamentally different mechanism compared to classic
ones. In this work, we propose a novel data-free defense strategy that
addresses both classic and novel backdoor attacks in FL. The shared attack
pattern lies in the abnormal activations within the hidden feature space during
model aggregation. Hence, we propose to constrain internal activations to
remain within reasonable ranges, effectively mitigating attacks while
preserving model functionality. The activation constraints are optimized using
synthetic data alongside FL training. Extensive experiments demonstrate its
effectiveness against both novel and classic backdoor attacks, outperforming
existing defenses.


http://arxiv.org/abs/2410.18210
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks. (2%)
Samuele Poppi; Zheng-Xin Yong; Yifei He; Bobbie Chern; Han Zhao; Aobo Yang; Jianfeng Chi
  Recent advancements in Large Language Models (LLMs) have sparked widespread
concerns about their safety. Recent work demonstrates that safety alignment of
LLMs can be easily removed by fine-tuning with a few adversarially chosen
instruction-following examples, i.e., fine-tuning attacks. We take a further
step to understand fine-tuning attacks in multilingual LLMs. We first discover
cross-lingual generalization of fine-tuning attacks: using a few adversarially
chosen instruction-following examples in one language, multilingual LLMs can
also be easily compromised (e.g., multilingual LLMs fail to refuse harmful
prompts in other languages). Motivated by this finding, we hypothesize that
safety-related information is language-agnostic and propose a new method termed
Safety Information Localization (SIL) to identify the safety-related
information in the model parameter space. Through SIL, we validate this
hypothesis and find that only changing 20% of weight parameters in fine-tuning
attacks can break safety alignment across all languages. Furthermore, we
provide evidence to the alternative pathways hypothesis for why freezing
safety-related parameters does not prevent fine-tuning attacks, and we
demonstrate that our attack vector can still jailbreak LLMs adapted to new
languages.


http://arxiv.org/abs/2410.18312
Countering Autonomous Cyber Threats. (2%)
Kade M. Heckel; Adrian Weller
  With the capability to write convincing and fluent natural language and
generate code, Foundation Models present dual-use concerns broadly and within
the cyber domain specifically. Generative AI has already begun to impact
cyberspace through a broad illicit marketplace for assisting malware
development and social engineering attacks through hundreds of
malicious-AI-as-a-services tools. More alarming is that recent research has
shown the potential for these advanced models to inform or independently
execute offensive cyberspace operations. However, these previous investigations
primarily focused on the threats posed by proprietary models due to the until
recent lack of strong open-weight model and additionally leave the impacts of
network defenses or potential countermeasures unexplored. Critically,
understanding the aptitude of downloadable models to function as offensive
cyber agents is vital given that they are far more difficult to govern and
prevent their misuse. As such, this work evaluates several state-of-the-art FMs
on their ability to compromise machines in an isolated network and investigates
defensive mechanisms to defeat such AI-powered attacks. Using target machines
from a commercial provider, the most recently released downloadable models are
found to be on par with a leading proprietary model at conducting simple cyber
attacks with common hacking tools against known vulnerabilities. To mitigate
such LLM-powered threats, defensive prompt injection (DPI) payloads for
disrupting the malicious cyber agent's workflow are demonstrated to be
effective. From these results, the implications for AI safety and governance
with respect to cybersecurity is analyzed.


http://arxiv.org/abs/2410.17628
Is Smoothness the Key to Robustness? A Comparison of Attention and Convolution Models Using a Novel Metric. (1%)
Baiyuan Chen
  Robustness is a critical aspect of machine learning models. Existing
robustness evaluation approaches often lack theoretical generality or rely
heavily on empirical assessments, limiting insights into the structural factors
contributing to robustness. Moreover, theoretical robustness analysis is not
applicable for direct comparisons between models. To address these challenges,
we propose $\textit{TopoLip}$, a metric based on layer-wise analysis that
bridges topological data analysis and Lipschitz continuity for robustness
evaluation. TopoLip provides a unified framework for both theoretical and
empirical robustness comparisons across different architectures or
configurations, and it reveals how model parameters influence the robustness of
models. Using TopoLip, we demonstrate that attention-based models typically
exhibit smoother transformations and greater robustness compared to
convolution-based models, as validated through theoretical analysis and
adversarial tasks. Our findings establish a connection between architectural
design, robustness, and topological properties.


http://arxiv.org/abs/2410.17442
Detecting Adversarial Examples. (99%)
Furkan Mumcu; Yasin Yilmaz
  Deep Neural Networks (DNNs) have been shown to be vulnerable to adversarial
examples. While numerous successful adversarial attacks have been proposed,
defenses against these attacks remain relatively understudied. Existing defense
approaches either focus on negating the effects of perturbations caused by the
attacks to restore the DNNs' original predictions or use a secondary model to
detect adversarial examples. However, these methods often become ineffective
due to the continuous advancements in attack techniques. We propose a novel
universal and lightweight method to detect adversarial examples by analyzing
the layer outputs of DNNs. Our method trains a lightweight regression model
that predicts deeper-layer features from early-layer features, and uses the
prediction error to detect adversarial samples. Through theoretical
justification and extensive experiments, we demonstrate that our detection
method is highly effective, compatible with any DNN architecture, and
applicable across different domains, such as image, video, and audio.


http://arxiv.org/abs/2410.16805
Test-time Adversarial Defense with Opposite Adversarial Path and High Attack Time Cost. (98%)
Cheng-Han Yeh; Kuanchun Yu; Chun-Shien Lu
  Deep learning models are known to be vulnerable to adversarial attacks by
injecting sophisticated designed perturbations to input data. Training-time
defenses still exhibit a significant performance gap between natural accuracy
and robust accuracy. In this paper, we investigate a new test-time adversarial
defense method via diffusion-based recovery along opposite adversarial paths
(OAPs). We present a purifier that can be plugged into a pre-trained model to
resist adversarial attacks. Different from prior arts, the key idea is
excessive denoising or purification by integrating the opposite adversarial
direction with reverse diffusion to push the input image further toward the
opposite adversarial direction. For the first time, we also exemplify the
pitfall of conducting AutoAttack (Rand) for diffusion-based defense methods.
Through the lens of time complexity, we examine the trade-off between the
effectiveness of adaptive attack and its computation complexity against our
defense. Experimental evaluation along with time cost analysis verifies the
effectiveness of the proposed method.


http://arxiv.org/abs/2410.17401
AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents. (97%)
Chejian Xu; Mintong Kang; Jiawei Zhang; Zeyi Liao; Lingbo Mo; Mengqi Yuan; Huan Sun; Bo Li
  Vision Language Models (VLMs) have revolutionized the creation of generalist
web agents, empowering them to autonomously complete diverse tasks on
real-world websites, thereby boosting human efficiency and productivity.
However, despite their remarkable capabilities, the safety and security of
these agents against malicious attacks remain critically underexplored, raising
significant concerns about their safe deployment. To uncover and exploit such
vulnerabilities in web agents, we provide AdvWeb, a novel black-box attack
framework designed against web agents. AdvWeb trains an adversarial prompter
model that generates and injects adversarial prompts into web pages, misleading
web agents into executing targeted adversarial actions such as inappropriate
stock purchases or incorrect bank transactions, actions that could lead to
severe real-world consequences. With only black-box access to the web agent, we
train and optimize the adversarial prompter model using DPO, leveraging both
successful and failed attack strings against the target agent. Unlike prior
approaches, our adversarial string injection maintains stealth and control: (1)
the appearance of the website remains unchanged before and after the attack,
making it nearly impossible for users to detect tampering, and (2) attackers
can modify specific substrings within the generated adversarial string to
seamlessly change the attack objective (e.g., purchasing stocks from a
different company), enhancing attack flexibility and efficiency. We conduct
extensive evaluations, demonstrating that AdvWeb achieves high success rates in
attacking SOTA GPT-4V-based VLM agent across various web tasks. Our findings
expose critical vulnerabilities in current LLM/VLM-based agents, emphasizing
the urgent need for developing more reliable web agents and effective defenses.
Our code and data are available at https://ai-secure.github.io/AdvWeb/ .


http://arxiv.org/abs/2410.17431
Meta Stackelberg Game: Robust Federated Learning against Adaptive and Mixed Poisoning Attacks. (67%)
Tao Li; Henger Li; Yunian Pan; Tianyi Xu; Zizhan Zheng; Quanyan Zhu
  Federated learning (FL) is susceptible to a range of security threats.
Although various defense mechanisms have been proposed, they are typically
non-adaptive and tailored to specific types of attacks, leaving them
insufficient in the face of multiple uncertain, unknown, and adaptive attacks
employing diverse strategies. This work formulates adversarial federated
learning under a mixture of various attacks as a Bayesian Stackelberg Markov
game, based on which we propose the meta-Stackelberg defense composed of
pre-training and online adaptation. {The gist is to simulate strong attack
behavior using reinforcement learning (RL-based attacks) in pre-training and
then design meta-RL-based defense to combat diverse and adaptive attacks.} We
develop an efficient meta-learning approach to solve the game, leading to a
robust and adaptive FL defense. Theoretically, our meta-learning algorithm,
meta-Stackelberg learning, provably converges to the first-order
$\varepsilon$-meta-equilibrium point in $O(\varepsilon^{-2})$ gradient
iterations with $O(\varepsilon^{-4})$ samples per iteration. Experiments show
that our meta-Stackelberg framework performs superbly against strong model
poisoning and backdoor attacks of uncertain and unknown types.


http://arxiv.org/abs/2410.17052
On the Vulnerability of Text Sanitization. (8%)
Meng Tong; Kejiang Chen; Xiaojian Yuang; Jiayang Liu; Weiming Zhang; Nenghai Yu; Jie Zhang
  Text sanitization, which employs differential privacy to replace sensitive
tokens with new ones, represents a significant technique for privacy
protection. Typically, its performance in preserving privacy is evaluated by
measuring the attack success rate (ASR) of reconstruction attacks, where
attackers attempt to recover the original tokens from the sanitized ones.
However, current reconstruction attacks on text sanitization are developed
empirically, making it challenging to accurately assess the effectiveness of
sanitization. In this paper, we aim to provide a more accurate evaluation of
sanitization effectiveness. Inspired by the works of Palamidessi et al., we
implement theoretically optimal reconstruction attacks targeting text
sanitization. We derive their bounds on ASR as benchmarks for evaluating
sanitization performance. For real-world applications, we propose two practical
reconstruction attacks based on these theoretical findings. Our experimental
results underscore the necessity of reassessing these overlooked risks.
Notably, one of our attacks achieves a 46.4% improvement in ASR over the
state-of-the-art baseline, with a privacy budget of epsilon=4.0 on the SST-2
dataset. Our code is available at:
https://github.com/mengtong0110/On-the-Vulnerability-of-Text-Sanitization.


http://arxiv.org/abs/2410.17222
Context-aware Prompt Tuning: Advancing In-Context Learning with Adversarial Methods. (5%)
Tsachi Blau; Moshe Kimhi; Yonatan Belinkov; Alexander Bronstein; Chaim Baskin
  Fine-tuning Large Language Models (LLMs) typically involves updating at least
a few billions of parameters. A more parameter-efficient approach is Prompt
Tuning (PT), which updates only a few learnable tokens, and differently,
In-Context Learning (ICL) adapts the model to a new task by simply including
examples in the input without any training. When applying optimization-based
methods, such as fine-tuning and PT for few-shot learning, the model is
specifically adapted to the small set of training examples, whereas ICL leaves
the model unchanged. This distinction makes traditional learning methods more
prone to overfitting; in contrast, ICL is less sensitive to the few-shot
scenario. While ICL is not prone to overfitting, it does not fully extract the
information that exists in the training examples. This work introduces
Context-aware Prompt Tuning (CPT), a method inspired by ICL, PT, and
adversarial attacks. We build on the ICL strategy of concatenating examples
before the input, but we extend this by PT-like learning, refining the context
embedding through iterative optimization to extract deeper insights from the
training examples. We carefully modify specific context tokens, considering the
unique structure of input and output formats. Inspired by adversarial attacks,
we adjust the input based on the labels present in the context, focusing on
minimizing, rather than maximizing, the loss. Moreover, we apply a projected
gradient descent algorithm to keep token embeddings close to their original
values, under the assumption that the user-provided data is inherently
valuable. Our method has been shown to achieve superior accuracy across
multiple classification tasks using various LLM models.


http://arxiv.org/abs/2410.16802
Evaluating the Effectiveness of Attack-Agnostic Features for Morphing Attack Detection. (4%)
Laurent Colbois; Sébastien Marcel
  Morphing attacks have diversified significantly over the past years, with new
methods based on generative adversarial networks (GANs) and diffusion models
posing substantial threats to face recognition systems. Recent research has
demonstrated the effectiveness of features extracted from large vision models
pretrained on bonafide data only (attack-agnostic features) for detecting deep
generative images. Building on this, we investigate the potential of these
image representations for morphing attack detection (MAD). We develop
supervised detectors by training a simple binary linear SVM on the extracted
features and one-class detectors by modeling the distribution of bonafide
features with a Gaussian Mixture Model (GMM). Our method is evaluated across a
comprehensive set of attacks and various scenarios, including generalization to
unseen attacks, different source datasets, and print-scan data. Our results
indicate that attack-agnostic features can effectively detect morphing attacks,
outperforming traditional supervised and one-class detectors from the
literature in most scenarios. Additionally, we provide insights into the
strengths and limitations of each considered representation and discuss
potential future research directions to further enhance the robustness and
generalizability of our approach.


http://arxiv.org/abs/2410.17492
BadFair: Backdoored Fairness Attacks with Group-conditioned Triggers. (2%)
Jiaqi Xue; Qian Lou; Mengxin Zheng
  Attacking fairness is crucial because compromised models can introduce biased
outcomes, undermining trust and amplifying inequalities in sensitive
applications like hiring, healthcare, and law enforcement. This highlights the
urgent need to understand how fairness mechanisms can be exploited and to
develop defenses that ensure both fairness and robustness. We introduce
BadFair, a novel backdoored fairness attack methodology. BadFair stealthily
crafts a model that operates with accuracy and fairness under regular
conditions but, when activated by certain triggers, discriminates and produces
incorrect results for specific groups. This type of attack is particularly
stealthy and dangerous, as it circumvents existing fairness detection methods,
maintaining an appearance of fairness in normal use. Our findings reveal that
BadFair achieves a more than 85% attack success rate in attacks aimed at target
groups on average while only incurring a minimal accuracy loss. Moreover, it
consistently exhibits a significant discrimination score, distinguishing
between pre-defined target and non-target attacked groups across various
datasets and models.


http://arxiv.org/abs/2410.17402
Invisible Manipulation Deep Reinforcement Learning Enhanced Stealthy Attacks on Battery Energy Management Systems. (1%)
Qi Xiao; Lidong Song; Jongha Woo; Rongxing Hu; Bei Xu; Ning Lu
  This paper introduces "invisible manipulation," an innovative cyber-attack
mechanism achieved through strategically timed stealthy false data injection
attacks (SFDIAs). By stealthily manipulating measurements of a critical asset
prior to the target time period, the attacker can subtly guide the engineering
system toward a predetermined operational state without detection. Using the
battery energy management system (BEMS) as a case study, we employ deep
reinforcement learning (DRL) to generate synthetic measurements, such as
battery voltage and current, that align closely with actual measurements. These
synthetic measurements, falling within the acceptable error margin of
residual-based bad data detection algorithm provided by state estimation, can
evade detection and mislead Extended Kalman-filter-based State of Charge
estimation. Subsequently, considering the deceptive data as valid inputs, the
BEMS will operate the BESS towards the attacker desired operational states when
the targeted time period come. The use of the DRL-based scheme allows us to
covert an online optimization problem into an offline training process, thereby
alleviating the computational burden for real-time implementation.
Comprehensive testing on a high-fidelity microgrid real-time simulation testbed
validates the effectiveness and adaptability of the proposed methods in
achieving different attack objectives.


http://arxiv.org/abs/2410.17103
A Hybrid Simulation of DNN-based Gray Box Models. (1%)
Aayushya Agarwal; Yihan Ruan; Larry Pileggi
  Simulation is vital for engineering disciplines, as it enables the prediction
and design of physical systems. However, the computational challenges inherent
to large-scale simulations often arise from complex device models featuring
high degrees of nonlinearities or hidden physical behaviors not captured by
first principles. Gray-box models combine deep neural networks (DNNs) with
physics-based models to address the computational challenges in modeling
physical systems. A well-crafted gray box model capitalizes on the
interpretability and accuracy of a physical model while incorporating DNNs to
capture hidden physical behaviors and mitigate computational load associated
with highly nonlinear components. Previously, gray box models have been
constructed by defining an explicit combination of physics-based and DNN models
to represent the behavior of sub-systems; however this cannot represent the
coupled interactions within physical systems. We explore an implicit gray box
model, where both DNNs and physical equations share a common set of
state-variables. While this approach captures coupled interactions at the
boundary of DNN and physics-based models, simulating the implicit gray box
model remains an open-ended problem. In this work, we introduce a new hybrid
simulation that integrates DNNs into the numerical solvers of simulation
engines to fully simulate implicit gray box models of large physical systems.
This is accomplished by backpropagating through the DNN to calculate Jacobian
values during each iteration of the numerical method. The hybrid simulation
improves the accuracy and runtime compared to physics-based simulation and
enables reusable DNN models with lower data requirements. We explore the
advantages of this approach as compared to physics-based, black box, and other
gray box methods for simulating the steady-state and transient behavior of
power systems.


http://arxiv.org/abs/2410.15889
Model Mimic Attack: Knowledge Distillation for Provably Transferable Adversarial Examples. (99%)
Kirill Lukyanov; Andrew Perminov; Denis Turdakov; Mikhail Pautov
  The vulnerability of artificial neural networks to adversarial perturbations
in the black-box setting is widely studied in the literature. The majority of
attack methods to construct these perturbations suffer from an impractically
large number of queries required to find an adversarial example. In this work,
we focus on knowledge distillation as an approach to conduct transfer-based
black-box adversarial attacks and propose an iterative training of the
surrogate model on an expanding dataset. This work is the first, to our
knowledge, to provide provable guarantees on the success of knowledge
distillation-based attack on classification neural networks: we prove that if
the student model has enough learning capabilities, the attack on the teacher
model is guaranteed to be found within the finite number of distillation
iterations.


http://arxiv.org/abs/2410.16579
Conflict-Aware Adversarial Training. (70%)
Zhiyu Xue; Haohan Wang; Yao Qin; Ramtin Pedarsani
  Adversarial training is the most effective method to obtain adversarial
robustness for deep neural networks by directly involving adversarial samples
in the training procedure. To obtain an accurate and robust model, the
weighted-average method is applied to optimize standard loss and adversarial
loss simultaneously. In this paper, we argue that the weighted-average method
does not provide the best tradeoff for the standard performance and adversarial
robustness. We argue that the failure of the weighted-average method is due to
the conflict between the gradients derived from standard and adversarial loss,
and further demonstrate such a conflict increases with attack budget
theoretically and practically. To alleviate this problem, we propose a new
trade-off paradigm for adversarial training with a conflict-aware factor for
the convex combination of standard and adversarial loss, named
\textbf{Conflict-Aware Adversarial Training~(CA-AT)}. Comprehensive
experimental results show that CA-AT consistently offers a superior trade-off
between standard performance and adversarial robustness under the settings of
adversarial training from scratch and parameter-efficient finetuning.


http://arxiv.org/abs/2410.16449
Robust Feature Learning for Multi-Index Models in High Dimensions. (68%)
Alireza Mousavi-Hosseini; Adel Javanmard; Murat A. Erdogdu
  Recently, there have been numerous studies on feature learning with neural
networks, specifically on learning single- and multi-index models where the
target is a function of a low-dimensional projection of the input. Prior works
have shown that in high dimensions, the majority of the compute and data
resources are spent on recovering the low-dimensional projection; once this
subspace is recovered, the remainder of the target can be learned independently
of the ambient dimension. However, implications of feature learning in
adversarial settings remain unexplored. In this work, we take the first steps
towards understanding adversarially robust feature learning with neural
networks. Specifically, we prove that the hidden directions of a multi-index
model offer a Bayes optimal low-dimensional projection for robustness against
$\ell_2$-bounded adversarial perturbations under the squared loss, assuming
that the multi-index coordinates are statistically independent from the rest of
the coordinates. Therefore, robust learning can be achieved by first performing
standard feature learning, then robustly tuning a linear readout layer on top
of the standard representations. In particular, we show that adversarially
robust learning is just as easy as standard learning, in the sense that the
additional number of samples needed to robustly learn multi-index models when
compared to standard learning, does not depend on dimensionality.


http://arxiv.org/abs/2410.16657
Dual-Model Defense: Safeguarding Diffusion Models from Membership Inference Attacks through Disjoint Data Splitting. (16%)
Bao Q. Tran; Viet Nguyen; Anh Tran; Toan Tran
  Diffusion models have demonstrated remarkable capabilities in image
synthesis, but their recently proven vulnerability to Membership Inference
Attacks (MIAs) poses a critical privacy concern. This paper introduces two
novel and efficient approaches (DualMD and DistillMD) to protect diffusion
models against MIAs while maintaining high utility. Both methods are based on
training two separate diffusion models on disjoint subsets of the original
dataset. DualMD then employs a private inference pipeline that utilizes both
models. This strategy significantly reduces the risk of black-box MIAs by
limiting the information any single model contains about individual training
samples. The dual models can also generate "soft targets" to train a private
student model in DistillMD, enhancing privacy guarantees against all types of
MIAs. Extensive evaluations of DualMD and DistillMD against state-of-the-art
MIAs across various datasets in white-box and black-box settings demonstrate
their effectiveness in substantially reducing MIA success rates while
preserving competitive image generation performance. Notably, our experiments
reveal that DistillMD not only defends against MIAs but also mitigates model
memorization, indicating that both vulnerabilities stem from overfitting and
can be addressed simultaneously with our unified approach.


http://arxiv.org/abs/2410.16159
Metric as Transform: Exploring beyond Affine Transform for Interpretable Neural Network. (13%)
Suman Sapkota
  Artificial Neural Networks of varying architectures are generally paired with
affine transformation at the core. However, we find dot product neurons with
global influence less interpretable as compared to local influence of euclidean
distance (as used in Radial Basis Function Network). In this work, we explore
the generalization of dot product neurons to $l^p$-norm, metrics, and beyond.
We find that metrics as transform performs similarly to affine transform when
used in MultiLayer Perceptron or Convolutional Neural Network. Moreover, we
explore various properties of Metrics, compare it with Affine, and present
multiple cases where metrics seem to provide better interpretability. We
develop an interpretable local dictionary based Neural Networks and use it to
understand and reject adversarial examples.


http://arxiv.org/abs/2410.16222
A Realistic Threat Model for Large Language Model Jailbreaks. (11%)
Valentyn Boreiko; Alexander Panfilov; Vaclav Voracek; Matthias Hein; Jonas Geiping
  A plethora of jailbreaking attacks have been proposed to obtain harmful
responses from safety-tuned LLMs. In their original settings, these methods all
largely succeed in coercing the target output, but their attacks vary
substantially in fluency and computational effort. In this work, we propose a
unified threat model for the principled comparison of these methods. Our threat
model combines constraints in perplexity, measuring how far a jailbreak
deviates from natural text, and computational budget, in total FLOPs. For the
former, we build an N-gram model on 1T tokens, which, in contrast to
model-based perplexity, allows for an LLM-agnostic and inherently interpretable
evaluation. We adapt popular attacks to this new, realistic threat model, with
which we, for the first time, benchmark these attacks on equal footing. After a
rigorous comparison, we not only find attack success rates against safety-tuned
modern models to be lower than previously presented but also find that attacks
based on discrete optimization significantly outperform recent LLM-based
attacks. Being inherently interpretable, our threat model allows for a
comprehensive analysis and comparison of jailbreak attacks. We find that
effective attacks exploit and abuse infrequent N-grams, either selecting
N-grams absent from real-world text or rare ones, e.g. specific to code
datasets.


http://arxiv.org/abs/2410.16341
Vulnerabilities in Machine Learning-Based Voice Disorder Detection Systems. (11%)
Gianpaolo Perelli; Andrea Panzino; Roberto Casula; Marco Micheletto; Giulia Orrù; Gian Luca Marcialis
  The impact of voice disorders is becoming more widely acknowledged as a
public health issue. Several machine learning-based classifiers with the
potential to identify disorders have been used in recent studies to
differentiate between normal and pathological voices and sounds. In this paper,
we focus on analyzing the vulnerabilities of these systems by exploring the
possibility of attacks that can reverse classification and compromise their
reliability. Given the critical nature of personal health information,
understanding which types of attacks are effective is a necessary first step
toward improving the security of such systems. Starting from the original
audios, we implement various attack methods, including adversarial, evasion,
and pitching techniques, and evaluate how state-of-the-art disorder detection
models respond to them. Our findings identify the most effective attack
strategies, underscoring the need to address these vulnerabilities in
machine-learning systems used in the healthcare domain.


http://arxiv.org/abs/2410.16073
On the Geometry of Regularization in Adversarial Training: High-Dimensional Asymptotics and Generalization Bounds. (5%)
Matteo Vilucchio; Nikolaos Tsilivis; Bruno Loureiro; Julia Kempe
  Regularization, whether explicit in terms of a penalty in the loss or
implicit in the choice of algorithm, is a cornerstone of modern machine
learning. Indeed, controlling the complexity of the model class is particularly
important when data is scarce, noisy or contaminated, as it translates a
statistical belief on the underlying structure of the data. This work
investigates the question of how to choose the regularization norm $\lVert
\cdot \rVert$ in the context of high-dimensional adversarial training for
binary classification. To this end, we first derive an exact asymptotic
description of the robust, regularized empirical risk minimizer for various
types of adversarial attacks and regularization norms (including non-$\ell_p$
norms). We complement this analysis with a uniform convergence analysis,
deriving bounds on the Rademacher Complexity for this class of problems.
Leveraging our theoretical results, we quantitatively characterize the
relationship between perturbation size and the optimal choice of $\lVert \cdot
\rVert$, confirming the intuition that, in the data scarce regime, the type of
regularization becomes increasingly important for adversarial training as
perturbations grow in size.


http://arxiv.org/abs/2410.15645
Boosting Jailbreak Transferability for Large Language Models. (1%)
Hanqing Liu; Lifeng Zhou; Huanqian Yan
  Large language models have drawn significant attention to the challenge of
safe alignment, especially regarding jailbreak attacks that circumvent security
measures to produce harmful content. To address the limitations of existing
methods like GCG, which perform well in single-model attacks but lack
transferability, we propose several enhancements, including a scenario
induction template, optimized suffix selection, and the integration of
re-suffix attack mechanism to reduce inconsistent outputs. Our approach has
shown superior performance in extensive experiments across various benchmarks,
achieving nearly 100% success rates in both attack execution and
transferability. Notably, our method has won the online first place in the
AISG-hosted Global Challenge for Safe and Secure LLMs.


http://arxiv.org/abs/2410.16121
Extracting Spatiotemporal Data from Gradients with Large Language Models. (1%)
Lele Zheng; Yang Cao; Renhe Jiang; Kenjiro Taura; Yulong Shen; Sheng Li; Masatoshi Yoshikawa
  Recent works show that sensitive user data can be reconstructed from gradient
updates, breaking the key privacy promise of federated learning. While success
was demonstrated primarily on image data, these methods do not directly
transfer to other domains, such as spatiotemporal data. To understand privacy
risks in spatiotemporal federated learning, we first propose Spatiotemporal
Gradient Inversion Attack (ST-GIA), a gradient attack algorithm tailored to
spatiotemporal data that successfully reconstructs the original location from
gradients. Furthermore, the absence of priors in attacks on spatiotemporal data
has hindered the accurate reconstruction of real client data. To address this
limitation, we propose ST-GIA+, which utilizes an auxiliary language model to
guide the search for potential locations, thereby successfully reconstructing
the original data from gradients. In addition, we design an adaptive defense
strategy to mitigate gradient inversion attacks in spatiotemporal federated
learning. By dynamically adjusting the perturbation levels, we can offer
tailored protection for varying rounds of training data, thereby achieving a
better trade-off between privacy and utility than current state-of-the-art
methods. Through intensive experimental analysis on three real-world datasets,
we reveal that the proposed defense strategy can well preserve the utility of
spatiotemporal federated learning with effective security protection.


http://arxiv.org/abs/2404.17358
Adversarial Consistency and the Uniqueness of the Adversarial Bayes Classifier. (1%)
Natalie S. Frank
Minimizing an adversarial surrogate risk is a common technique for learning robust classifiers. Prior work showed that convex surrogate losses are not statistically consistent in the adversarial context -- or in other words, a minimizing sequence of the adversarial surrogate risk will not necessarily minimize the adversarial classification error. We connect the consistency of adversarial surrogate losses to properties of minimizers to the adversarial classification risk, known as adversarial Bayes classifiers. Specifically, under reasonable distributional assumptions, a convex surrogate loss is statistically consistent for adversarial learning iff the adversarial Bayes classifier satisfies a certain notion of uniqueness.

http://arxiv.org/abs/2410.15409
PEAS: A Strategy for Crafting Transferable Adversarial Examples. (99%)
Bar Avraham; Yisroel Mirsky
  Black box attacks, where adversaries have limited knowledge of the target
model, pose a significant threat to machine learning systems. Adversarial
examples generated with a substitute model often suffer from limited
transferability to the target model. While recent work explores ranking
perturbations for improved success rates, these methods see only modest gains.
We propose a novel strategy called PEAS that can boost the transferability of
existing black box attacks. PEAS leverages the insight that samples which are
perceptually equivalent exhibit significant variability in their adversarial
transferability. Our approach first generates a set of images from an initial
sample via subtle augmentations. We then evaluate the transferability of
adversarial perturbations on these images using a set of substitute models.
Finally, the most transferable adversarial example is selected and used for the
attack. Our experiments show that PEAS can double the performance of existing
attacks, achieving a 2.5x improvement in attack success rates on average over
current ranking methods. We thoroughly evaluate PEAS on ImageNet and CIFAR-10,
analyze hyperparameter impacts, and provide an ablation study to isolate each
component's importance.


http://arxiv.org/abs/2410.15429
Efficient Model Extraction via Boundary Sampling. (96%)
Maor Biton Dor; Yisroel Mirsky
  This paper introduces a novel data-free model extraction attack that
significantly advances the current state-of-the-art in terms of efficiency,
accuracy, and effectiveness. Traditional black-box methods rely on using the
victim's model as an oracle to label a vast number of samples within
high-confidence areas. This approach not only requires an extensive number of
queries but also results in a less accurate and less transferable model. In
contrast, our method innovates by focusing on sampling low-confidence areas
(along the decision boundaries) and employing an evolutionary algorithm to
optimize the sampling process. These novel contributions allow for a dramatic
reduction in the number of queries needed by the attacker by a factor of 10x to
600x while simultaneously improving the accuracy of the stolen model. Moreover,
our approach improves boundary alignment, resulting in better transferability
of adversarial examples from the stolen model to the victim's model (increasing
the attack success rate from 60\% to 82\% on average). Finally, we accomplish
all of this with a strict black-box assumption on the victim, with no knowledge
of the target's architecture or dataset.
  We demonstrate our attack on three datasets with increasingly larger
resolutions and compare our performance to four state-of-the-art model
extraction attacks.


http://arxiv.org/abs/2410.15396
The Best Defense is a Good Offense: Countering LLM-Powered Cyberattacks. (76%)
Daniel Ayzenshteyn; Roy Weiss; Yisroel Mirsky
  As large language models (LLMs) continue to evolve, their potential use in
automating cyberattacks becomes increasingly likely. With capabilities such as
reconnaissance, exploitation, and command execution, LLMs could soon become
integral to autonomous cyber agents, capable of launching highly sophisticated
attacks. In this paper, we introduce novel defense strategies that exploit the
inherent vulnerabilities of attacking LLMs. By targeting weaknesses such as
biases, trust in input, memory limitations, and their tunnel-vision approach to
problem-solving, we develop techniques to mislead, delay, or neutralize these
autonomous agents. We evaluate our defenses under black-box conditions,
starting with single prompt-response scenarios and progressing to real-world
tests using custom-built CTF machines. Our results show defense success rates
of up to 90\%, demonstrating the effectiveness of turning LLM vulnerabilities
into defensive strategies against LLM-driven cyber threats.


http://arxiv.org/abs/2410.15362
Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models. (45%)
Xiao Li; Zhuhong Li; Qiongxiu Li; Bingze Lee; Jinghao Cui; Xiaolin Hu
  Aligned Large Language Models (LLMs) have demonstrated remarkable performance
across various tasks. However, LLMs remain susceptible to jailbreak adversarial
attacks, where adversaries manipulate prompts to elicit malicious responses
that aligned LLMs should have avoided. Identifying these vulnerabilities is
crucial for understanding the inherent weaknesses of LLMs and preventing their
potential misuse. One pioneering work in jailbreaking is the GCG attack, a
discrete token optimization algorithm that seeks to find a suffix capable of
jailbreaking aligned LLMs. Despite the success of GCG, we find it suboptimal,
requiring significantly large computational costs, and the achieved
jailbreaking performance is limited. In this work, we propose Faster-GCG, an
efficient adversarial jailbreak method by delving deep into the design of GCG.
Experiments demonstrate that Faster-GCG can surpass the original GCG with only
1/10 of the computational cost, achieving significantly higher attack success
rates on various open-source aligned LLMs. In addition, We demonstrate that
Faster-GCG exhibits improved attack transferability when testing on
closed-sourced LLMs such as ChatGPT.


http://arxiv.org/abs/2410.15555
Bayesian Concept Bottleneck Models with LLM Priors. (1%)
Jean Feng; Avni Kothari; Luke Zier; Chandan Singh; Yan Shuo Tan
  Concept Bottleneck Models (CBMs) have been proposed as a compromise between
white-box and black-box models, aiming to achieve interpretability without
sacrificing accuracy. The standard training procedure for CBMs is to predefine
a candidate set of human-interpretable concepts, extract their values from the
training data, and identify a sparse subset as inputs to a transparent
prediction model. However, such approaches are often hampered by the tradeoff
between enumerating a sufficiently large set of concepts to include those that
are truly relevant versus controlling the cost of obtaining concept
extractions. This work investigates a novel approach that sidesteps these
challenges: BC-LLM iteratively searches over a potentially infinite set of
concepts within a Bayesian framework, in which Large Language Models (LLMs)
serve as both a concept extraction mechanism and prior. BC-LLM is broadly
applicable and multi-modal. Despite imperfections in LLMs, we prove that BC-LLM
can provide rigorous statistical inference and uncertainty quantification. In
experiments, it outperforms comparator methods including black-box models,
converges more rapidly towards relevant concepts and away from spuriously
correlated ones, and is more robust to out-of-distribution samples.


http://arxiv.org/abs/2410.15042
Adversarial Training: A Survey. (97%)
Mengnan Zhao; Lihe Zhang; Jingwen Ye; Huchuan Lu; Baocai Yin; Xinchao Wang
  Adversarial training (AT) refers to integrating adversarial examples --
inputs altered with imperceptible perturbations that can significantly impact
model predictions -- into the training process. Recent studies have
demonstrated the effectiveness of AT in improving the robustness of deep neural
networks against diverse adversarial attacks. However, a comprehensive overview
of these developments is still missing. This survey addresses this gap by
reviewing a broad range of recent and representative studies. Specifically, we
first describe the implementation procedures and practical applications of AT,
followed by a comprehensive review of AT techniques from three perspectives:
data enhancement, network design, and training configurations. Lastly, we
discuss common challenges in AT and propose several promising directions for
future research.


http://arxiv.org/abs/2410.15107
Toward Robust RALMs: Revealing the Impact of Imperfect Retrieval on Retrieval-Augmented Language Models. (92%)
Seong-Il Park; Jay-Yoon Lee
  Retrieval Augmented Language Models (RALMs) have gained significant attention
for their ability to generate accurate answer and improve efficiency. However,
RALMs are inherently vulnerable to imperfect information due to their reliance
on the imperfect retriever or knowledge source. We identify three common
scenarios-unanswerable, adversarial, conflicting-where retrieved document sets
can confuse RALM with plausible real-world examples. We present the first
comprehensive investigation to assess how well RALMs detect and handle such
problematic scenarios. Among these scenarios, to systematically examine
adversarial robustness we propose a new adversarial attack method, Generative
model-based ADVersarial attack (GenADV) and a novel metric Robustness under
Additional Document (RAD). Our findings reveal that RALMs often fail to
identify the unanswerability or contradiction of a document set, which
frequently leads to hallucinations. Moreover, we show the addition of an
adversary significantly degrades RALM's performance, with the model becoming
even more vulnerable when the two scenarios overlap (adversarial+unanswerable).
Our research identifies critical areas for assessing and enhancing the
robustness of RALMs, laying the foundation for the development of more robust
models.


http://arxiv.org/abs/2410.15176
Beyond Pruning Criteria: The Dominant Role of Fine-Tuning and Adaptive Ratios in Neural Network Robustness. (81%)
Lincen Bai; Hedi Tabia; Raúl Santos-Rodríguez
  Deep neural networks (DNNs) excel in tasks like image recognition and natural
language processing, but their increasing complexity complicates deployment in
resource-constrained environments and increases susceptibility to adversarial
attacks. While traditional pruning methods reduce model size, they often
compromise the network's ability to withstand subtle perturbations. This paper
challenges the conventional emphasis on weight importance scoring as the
primary determinant of a pruned network's performance. Through extensive
analysis, including experiments conducted on CIFAR, Tiny-ImageNet, and various
network architectures, we demonstrate that effective fine-tuning plays a
dominant role in enhancing both performance and adversarial robustness, often
surpassing the impact of the chosen pruning criteria. To address this issue, we
introduce Module Robust Sensitivity, a novel metric that adaptively adjusts the
pruning ratio for each network layer based on its sensitivity to adversarial
perturbations. By integrating this metric into the pruning process, we develop
a stable algorithm that maintains accuracy and robustness simultaneously.
Experimental results show that our approach enables the practical deployment of
more robust and efficient neural networks.


http://arxiv.org/abs/2410.15236
Jailbreaking and Mitigation of Vulnerabilities in Large Language Models. (54%)
Benji Peng; Ziqian Bi; Qian Niu; Ming Liu; Pohsun Feng; Tianyang Wang; Lawrence K. Q. Yan; Yizhu Wen; Yichao Zhang; Caitlyn Heqi Yin
  Large Language Models (LLMs) have transformed artificial intelligence by
advancing natural language understanding and generation, enabling applications
across fields beyond healthcare, software engineering, and conversational
systems. Despite these advancements in the past few years, LLMs have shown
considerable vulnerabilities, particularly to prompt injection and jailbreaking
attacks. This review analyzes the state of research on these vulnerabilities
and presents available defense strategies. We roughly categorize attack
approaches into prompt-based, model-based, multimodal, and multilingual,
covering techniques such as adversarial prompting, backdoor injections, and
cross-modality exploits. We also review various defense mechanisms, including
prompt filtering, transformation, alignment techniques, multi-agent defenses,
and self-regulation, evaluating their strengths and shortcomings. We also
discuss key metrics and benchmarks used to assess LLM safety and robustness,
noting challenges like the quantification of attack success in interactive
contexts and biases in existing datasets. Identifying current research gaps, we
suggest future directions for resilient alignment strategies, advanced defenses
against evolving attacks, automation of jailbreak detection, and consideration
of ethical and societal impacts. This review emphasizes the need for continued
research and cooperation within the AI community to enhance LLM security and
ensure their safe deployment.


http://arxiv.org/abs/2410.15075
SLIC: Secure Learned Image Codec through Compressed Domain Watermarking to Defend Image Manipulation. (11%)
Chen-Hsiu Huang; Ja-Ling Wu
  The digital image manipulation and advancements in Generative AI, such as
Deepfake, has raised significant concerns regarding the authenticity of images
shared on social media. Traditional image forensic techniques, while helpful,
are often passive and insufficient against sophisticated tampering methods.
This paper introduces the Secure Learned Image Codec (SLIC), a novel active
approach to ensuring image authenticity through watermark embedding in the
compressed domain. SLIC leverages neural network-based compression to embed
watermarks as adversarial perturbations in the latent space, creating images
that degrade in quality upon re-compression if tampered with. This degradation
acts as a defense mechanism against unauthorized modifications. Our method
involves fine-tuning a neural encoder/decoder to balance watermark invisibility
with robustness, ensuring minimal quality loss for non-watermarked images.
Experimental results demonstrate SLIC's effectiveness in generating visible
artifacts in tampered images, thereby preventing their redistribution. This
work represents a significant step toward developing secure image codecs that
can be widely adopted to safeguard digital image integrity.


http://arxiv.org/abs/2410.15033
DynaMO: Protecting Mobile DL Models through Coupling Obfuscated DL Operators. (2%)
Mingyi Zhou; Xiang Gao; Xiao Chen; Chunyang Chen; John Grundy; Li Li
  Deploying DL models on mobile Apps has become ever-more popular. However,
existing studies show attackers can easily reverse-engineer mobile DL models in
Apps to steal intellectual property or generate effective attacks. A recent
approach, Model Obfuscation, has been proposed to defend against such reverse
engineering by obfuscating DL model representations, such as weights and
computational graphs, without affecting model performance. These existing model
obfuscation methods use static methods to obfuscate the model representation,
or they use half-dynamic methods but require users to restore the model
information through additional input arguments. However, these static methods
or half-dynamic methods cannot provide enough protection for on-device DL
models. Attackers can use dynamic analysis to mine the sensitive information in
the inference codes as the correct model information and intermediate results
must be recovered at runtime for static and half-dynamic obfuscation methods.
We assess the vulnerability of the existing obfuscation strategies using an
instrumentation method and tool, DLModelExplorer, that dynamically extracts
correct sensitive model information at runtime. Experiments show it achieves
very high attack performance. To defend against such attacks based on dynamic
instrumentation, we propose DynaMO, a Dynamic Model Obfuscation strategy
similar to Homomorphic Encryption. The obfuscation and recovery process can be
done through simple linear transformation for the weights of randomly coupled
eligible operators, which is a fully dynamic obfuscation strategy. Experiments
show that our proposed strategy can dramatically improve model security
compared with the existing obfuscation strategies, with only negligible
overheads for on-device models.


http://arxiv.org/abs/2410.14911
A Hybrid Defense Strategy for Boosting Adversarial Robustness in Vision-Language Models. (99%)
Yuhan Liang; Yijun Li; Yumeng Niu; Qianhe Shen; Hangyu Liu
  The robustness of Vision-Language Models (VLMs) such as CLIP is critical for
their deployment in safety-critical applications like autonomous driving,
healthcare diagnostics, and security systems, where accurate interpretation of
visual and textual data is essential. However, these models are highly
susceptible to adversarial attacks, which can severely compromise their
performance and reliability in real-world scenarios. Previous methods have
primarily focused on improving robustness through adversarial training and
generating adversarial examples using models like FGSM, AutoAttack, and
DeepFool. However, these approaches often rely on strong assumptions, such as
fixed perturbation norms or predefined attack patterns, and involve high
computational complexity, making them challenging to implement in practical
settings. In this paper, we propose a novel adversarial training framework that
integrates multiple attack strategies and advanced machine learning techniques
to significantly enhance the robustness of VLMs against a broad range of
adversarial attacks. Experiments conducted on real-world datasets, including
CIFAR-10 and CIFAR-100, demonstrate that the proposed method significantly
enhances model robustness. The fine-tuned CLIP model achieved an accuracy of
43.5% on adversarially perturbed images, compared to only 4% for the baseline
model. The neural network model achieved a high accuracy of 98% in these
challenging classification tasks, while the XGBoost model reached a success
rate of 85.26% in prediction tasks.


http://arxiv.org/abs/2410.14881
Class-RAG: Content Moderation with Retrieval Augmented Generation. (76%)
Jianfa Chen; Emily Shen; Trupti Bavalatti; Xiaowen Lin; Yongkai Wang; Shuming Hu; Harihar Subramanyam; Ksheeraj Sai Vepuri; Ming Jiang; Ji Qi; Li Chen; Nan Jiang; Ankit Jain
  Robust content moderation classifiers are essential for the safety of
Generative AI systems. Content moderation, or safety classification, is
notoriously ambiguous: differences between safe and unsafe inputs are often
extremely subtle, making it difficult for classifiers (and indeed, even humans)
to properly distinguish violating vs. benign samples without further context or
explanation. Furthermore, as these technologies are deployed across various
applications and audiences, scaling risk discovery and mitigation through
continuous model fine-tuning becomes increasingly challenging and costly. To
address these challenges, we propose a Classification approach employing
Retrieval-Augmented Generation (Class-RAG). Class-RAG extends the capability of
its base LLM through access to a retrieval library which can be dynamically
updated to enable semantic hotfixing for immediate, flexible risk mitigation.
Compared to traditional fine-tuned models, Class-RAG demonstrates flexibility
and transparency in decision-making. As evidenced by empirical studies,
Class-RAG outperforms on classification and is more robust against adversarial
attack. Besides, our findings suggest that Class-RAG performance scales with
retrieval library size, indicating that increasing the library size is a viable
and low-cost approach to improve content moderation.


http://arxiv.org/abs/2410.14966
Attack as Defense: Run-time Backdoor Implantation for Image Content Protection. (61%)
Haichuan Zhang; Meiyu Lin; Zhaoyi Liu; Renyuan Li; Zhiyuan Cheng; Carl Yang; Mingjie Tang
  As generative models achieve great success, tampering and modifying the
sensitive image contents (i.e., human faces, artist signatures, commercial
logos, etc.) have induced a significant threat with social impact. The backdoor
attack is a method that implants vulnerabilities in a target model, which can
be activated through a trigger. In this work, we innovatively prevent the abuse
of image content modification by implanting the backdoor into image-editing
models. Once the protected sensitive content on an image is modified by an
editing model, the backdoor will be triggered, making the editing fail. Unlike
traditional backdoor attacks that use data poisoning, to enable protection on
individual images and eliminate the need for model training, we developed the
first framework for run-time backdoor implantation, which is both time- and
resource- efficient. We generate imperceptible perturbations on the images to
inject the backdoor and define the protected area as the only backdoor trigger.
Editing other unprotected insensitive areas will not trigger the backdoor,
which minimizes the negative impact on legal image modifications. Evaluations
with state-of-the-art image editing models show that our protective method can
increase the CLIP-FID of generated images from 12.72 to 39.91, or reduce the
SSIM from 0.503 to 0.167 when subjected to malicious editing. At the same time,
our method exhibits minimal impact on benign editing, which demonstrates the
efficacy of our proposed framework. The proposed run-time backdoor can also
achieve effective protection on the latest diffusion models. Code are
available.


http://arxiv.org/abs/2410.14667
Stochastic Gradient Descent Jittering for Inverse Problems: Alleviating the Accuracy-Robustness Tradeoff. (26%)
Peimeng Guan; Mark A. Davenport
  Inverse problems aim to reconstruct unseen data from corrupted or perturbed
measurements. While most work focuses on improving reconstruction quality,
generalization accuracy and robustness are equally important, especially for
safety-critical applications. Model-based architectures (MBAs), such as loop
unrolling methods, are considered more interpretable and achieve better
reconstructions. Empirical evidence suggests that MBAs are more robust to
perturbations than black-box solvers, but the accuracy-robustness tradeoff in
MBAs remains underexplored. In this work, we propose a simple yet effective
training scheme for MBAs, called SGD jittering, which injects noise
iteration-wise during reconstruction. We theoretically demonstrate that SGD
jittering not only generalizes better than the standard mean squared error
training but is also more robust to average-case attacks. We validate SGD
jittering using denoising toy examples, seismic deconvolution, and single-coil
MRI reconstruction. The proposed method achieves cleaner reconstructions for
out-of-distribution data and demonstrates enhanced robustness to adversarial
attacks.


http://arxiv.org/abs/2410.16327
Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs. (13%)
Rui Pu; Chaozhuo Li; Rui Ha; Zejian Chen; Litian Zhang; Zheng Liu; Lirong Qiu; Xi Zhang
  Jailbreak attack can be used to access the vulnerabilities of Large Language
Models (LLMs) by inducing LLMs to generate the harmful content. And the most
common method of the attack is to construct semantically ambiguous prompts to
confuse and mislead the LLMs. To access the security and reveal the intrinsic
relation between the input prompt and the output for LLMs, the distribution of
attention weight is introduced to analyze the underlying reasons. By using
statistical analysis methods, some novel metrics are defined to better describe
the distribution of attention weight, such as the Attention Intensity on
Sensitive Words (Attn_SensWords), the Attention-based Contextual Dependency
Score (Attn_DepScore) and Attention Dispersion Entropy (Attn_Entropy). By
leveraging the distinct characteristics of these metrics, the beam search
algorithm and inspired by the military strategy "Feint and Attack", an
effective jailbreak attack strategy named as Attention-Based Attack (ABA) is
proposed. In the ABA, nested attack prompts are employed to divert the
attention distribution of the LLMs. In this manner, more harmless parts of the
input can be used to attract the attention of the LLMs. In addition, motivated
by ABA, an effective defense strategy called as Attention-Based Defense (ABD)
is also put forward. Compared with ABA, the ABD can be used to enhance the
robustness of LLMs by calibrating the attention distribution of the input
prompt. Some comparative experiments have been given to demonstrate the
effectiveness of ABA and ABD. Therefore, both ABA and ABD can be used to access
the security of the LLMs. The comparative experiment results also give a
logical explanation that the distribution of attention weight can bring great
influence on the output for LLMs.


http://arxiv.org/abs/2410.14651
Real-time Fake News from Adversarial Feedback. (10%)
Sanxing Chen; Yukun Huang; Bhuwan Dhingra
  We show that existing evaluations for fake news detection based on
conventional sources, such as claims on fact-checking websites, result in an
increasing accuracy over time for LLM-based detectors -- even after their
knowledge cutoffs. This suggests that recent popular political claims, which
form the majority of fake news on such sources, are easily classified using
surface-level shallow patterns. Instead, we argue that a proper fake news
detection dataset should test a model's ability to reason factually about the
current world by retrieving and reading related evidence. To this end, we
develop a novel pipeline that leverages natural language feedback from a
RAG-based detector to iteratively modify real-time news into deceptive fake
news that challenges LLMs. Our iterative rewrite decreases the binary
classification AUC by an absolute 17.5 percent for a strong RAG GPT-4o
detector. Our experiments reveal the important role of RAG in both detecting
and generating fake news, as retrieval-free LLM detectors are vulnerable to
unseen events and adversarial attacks, while feedback from RAG detection helps
discover more deceitful patterns in fake news.


http://arxiv.org/abs/2410.14425
Unlearning Backdoor Attacks for LLMs with Weak-to-Strong Knowledge Distillation. (5%)
Shuai Zhao; Xiaobao Wu; Cong-Duy Nguyen; Meihuizi Jia; Yichao Feng; Luu Anh Tuan
  Parameter-efficient fine-tuning (PEFT) can bridge the gap between large
language models (LLMs) and downstream tasks. However, PEFT has been proven
vulnerable to malicious attacks. Research indicates that poisoned LLMs, even
after PEFT, retain the capability to activate internalized backdoors when input
samples contain predefined triggers. In this paper, we introduce a novel
weak-to-strong unlearning algorithm to defend against backdoor attacks based on
feature alignment knowledge distillation, named W2SDefense. Specifically, we
first train a small-scale language model through full-parameter fine-tuning to
serve as the clean teacher model. Then, this teacher model guides the
large-scale poisoned student model in unlearning the backdoor, leveraging PEFT.
Theoretical analysis suggests that W2SDefense has the potential to enhance the
student model's ability to unlearn backdoor features, preventing the activation
of the backdoor. We conduct experiments on text classification tasks involving
three state-of-the-art language models and three different backdoor attack
algorithms. Our empirical results demonstrate the outstanding performance of
W2SDefense in defending against backdoor attacks without compromising model
performance.


http://arxiv.org/abs/2410.14919
Adversarial Score identity Distillation: Rapidly Surpassing the Teacher in One Step. (1%)
Mingyuan Zhou; Huangjie Zheng; Yi Gu; Zhendong Wang; Hai Huang
  Score identity Distillation (SiD) is a data-free method that has achieved
state-of-the-art performance in image generation by leveraging only a
pretrained diffusion model, without requiring any training data. However, the
ultimate performance of SiD is constrained by the accuracy with which the
pretrained model captures the true data scores at different stages of the
diffusion process. In this paper, we introduce SiDA (SiD with Adversarial
Loss), which not only enhances generation quality but also improves
distillation efficiency by incorporating real images and adversarial loss. SiDA
utilizes the encoder from the generator's score network as a discriminator,
boosting its ability to distinguish between real images and those generated by
SiD. The adversarial loss is batch-normalized within each GPU and then combined
with the original SiD loss. This integration effectively incorporates the
average "fakeness" per GPU batch into the pixel-based SiD loss, enabling SiDA
to distill a single-step generator either from scratch or by fine-tuning an
existing one. SiDA converges significantly faster than its predecessor when
trained from scratch, and swiftly improves upon the original model's
performance after an initial warmup period during fine-tuning from a
pre-distilled SiD generator. This one-step adversarial distillation method
establishes new benchmarks in generation performance when distilling EDM
diffusion models pretrained on CIFAR-10 (32x32) and ImageNet (64x64), achieving
FID score of 1.110 on ImageNet 64x64. It sets record-low FID scores when
distilling EDM2 models trained on ImageNet (512x512), surpassing even the
largest teacher model, EDM2-XXL. Our SiDA's results record FID scores of 2.156
for EDM2-XS, 1.669 for EDM2-S, 1.488 for EDM2-M, and 1.465 for EDM2-L,
demonstrating significant improvements across all model sizes. Our open-source
code will be integrated into the SiD codebase.


http://arxiv.org/abs/2410.14089
MMAD-Purify: A Precision-Optimized Framework for Efficient and Scalable Multi-Modal Attacks. (99%)
Xinxin Liu; Zhongliang Guo; Siyuan Huang; Chun Pong Lau
  Neural networks have achieved remarkable performance across a wide range of
tasks, yet they remain susceptible to adversarial perturbations, which pose
significant risks in safety-critical applications. With the rise of
multimodality, diffusion models have emerged as powerful tools not only for
generative tasks but also for various applications such as image editing,
inpainting, and super-resolution. However, these models still lack robustness
due to limited research on attacking them to enhance their resilience.
Traditional attack techniques, such as gradient-based adversarial attacks and
diffusion model-based methods, are hindered by computational inefficiencies and
scalability issues due to their iterative nature. To address these challenges,
we introduce an innovative framework that leverages the distilled backbone of
diffusion models and incorporates a precision-optimized noise predictor to
enhance the effectiveness of our attack framework. This approach not only
enhances the attack's potency but also significantly reduces computational
costs. Our framework provides a cutting-edge solution for multi-modal
adversarial attacks, ensuring reduced latency and the generation of
high-fidelity adversarial examples with superior success rates. Furthermore, we
demonstrate that our framework achieves outstanding transferability and
robustness against purification defenses, outperforming existing gradient-based
attack models in both effectiveness and efficiency.


http://arxiv.org/abs/2410.14105
DMGNN: Detecting and Mitigating Backdoor Attacks in Graph Neural Networks. (95%)
Hao Sui; Bing Chen; Jiale Zhang; Chengcheng Zhu; Di Wu; Qinghua Lu; Guodong Long
  Recent studies have revealed that GNNs are highly susceptible to multiple
adversarial attacks. Among these, graph backdoor attacks pose one of the most
prominent threats, where attackers cause models to misclassify by learning the
backdoored features with injected triggers and modified target labels during
the training phase. Based on the features of the triggers, these attacks can be
categorized into out-of-distribution (OOD) and in-distribution (ID) graph
backdoor attacks, triggers with notable differences from the clean sample
feature distributions constitute OOD backdoor attacks, whereas the triggers in
ID backdoor attacks are nearly identical to the clean sample feature
distributions. Existing methods can successfully defend against OOD backdoor
attacks by comparing the feature distribution of triggers and clean samples but
fail to mitigate stealthy ID backdoor attacks. Due to the lack of proper
supervision signals, the main task accuracy is negatively affected in defending
against ID backdoor attacks. To bridge this gap, we propose DMGNN against OOD
and ID graph backdoor attacks that can powerfully eliminate stealthiness to
guarantee defense effectiveness and improve the model performance.
Specifically, DMGNN can easily identify the hidden ID and OOD triggers via
predicting label transitions based on counterfactual explanation. To further
filter the diversity of generated explainable graphs and erase the influence of
the trigger features, we present a reverse sampling pruning method to screen
and discard the triggers directly on the data level. Extensive experimental
evaluations on open graph datasets demonstrate that DMGNN far outperforms the
state-of-the-art (SOTA) defense methods, reducing the attack success rate to 5%
with almost negligible degradation in model performance (within 3.5%).


http://arxiv.org/abs/2410.13995
Adversarial Inception for Bounded Backdoor Poisoning in Deep Reinforcement Learning. (67%)
Ethan Rathbun; Christopher Amato; Alina Oprea
  Recent works have demonstrated the vulnerability of Deep Reinforcement
Learning (DRL) algorithms against training-time, backdoor poisoning attacks.
These attacks induce pre-determined, adversarial behavior in the agent upon
observing a fixed trigger during deployment while allowing the agent to solve
its intended task during training. Prior attacks rely on arbitrarily large
perturbations to the agent's rewards to achieve both of these objectives -
leaving them open to detection. Thus, in this work, we propose a new class of
backdoor attacks against DRL which achieve state of the art performance while
minimally altering the agent's rewards. These "inception" attacks train the
agent to associate the targeted adversarial behavior with high returns by
inducing a disjunction between the agent's chosen action and the true action
executed in the environment during training. We formally define these attacks
and prove they can achieve both adversarial objectives. We then devise an
online inception attack which significantly out-performs prior attacks under
bounded reward constraints.


http://arxiv.org/abs/2410.13236
SPIN: Self-Supervised Prompt INjection. (67%)
Leon Zhou; Junfeng Yang; Chengzhi Mao
  Large Language Models (LLMs) are increasingly used in a variety of important
applications, yet their safety and reliability remain as major concerns.
Various adversarial and jailbreak attacks have been proposed to bypass the
safety alignment and cause the model to produce harmful responses. We introduce
Self-supervised Prompt INjection (SPIN) which can detect and reverse these
various attacks on LLMs. As our self-supervised prompt defense is done at
inference-time, it is also compatible with existing alignment and adds an
additional layer of safety for defense. Our benchmarks demonstrate that our
system can reduce the attack success rate by up to 87.9%, while maintaining the
performance on benign user requests. In addition, we discuss the situation of
an adaptive attacker and show that our method is still resilient against
attackers who are aware of our defense.


http://arxiv.org/abs/2410.13691
Jailbreaking LLM-Controlled Robots. (56%)
Alexander Robey; Zachary Ravichandran; Vijay Kumar; Hamed Hassani; George J. Pappas
  The recent introduction of large language models (LLMs) has revolutionized
the field of robotics by enabling contextual reasoning and intuitive
human-robot interaction in domains as varied as manipulation, locomotion, and
self-driving vehicles. When viewed as a stand-alone technology, LLMs are known
to be vulnerable to jailbreaking attacks, wherein malicious prompters elicit
harmful text by bypassing LLM safety guardrails. To assess the risks of
deploying LLMs in robotics, in this paper, we introduce RoboPAIR, the first
algorithm designed to jailbreak LLM-controlled robots. Unlike existing, textual
attacks on LLM chatbots, RoboPAIR elicits harmful physical actions from
LLM-controlled robots, a phenomenon we experimentally demonstrate in three
scenarios: (i) a white-box setting, wherein the attacker has full access to the
NVIDIA Dolphins self-driving LLM, (ii) a gray-box setting, wherein the attacker
has partial access to a Clearpath Robotics Jackal UGV robot equipped with a
GPT-4o planner, and (iii) a black-box setting, wherein the attacker has only
query access to the GPT-3.5-integrated Unitree Robotics Go2 robot dog. In each
scenario and across three new datasets of harmful robotic actions, we
demonstrate that RoboPAIR, as well as several static baselines, finds
jailbreaks quickly and effectively, often achieving 100% attack success rates.
Our results reveal, for the first time, that the risks of jailbroken LLMs
extend far beyond text generation, given the distinct possibility that
jailbroken robots could cause physical damage in the real world. Indeed, our
results on the Unitree Go2 represent the first successful jailbreak of a
deployed commercial robotic system. Addressing this emerging vulnerability is
critical for ensuring the safe deployment of LLMs in robotics. Additional media
is available at: https://robopair.org


http://arxiv.org/abs/2410.13722
Persistent Pre-Training Poisoning of LLMs. (33%)
Yiming Zhang; Javier Rando; Ivan Evtimov; Jianfeng Chi; Eric Michael Smith; Nicholas Carlini; Florian Tramèr; Daphne Ippolito
  Large language models are pre-trained on uncurated text datasets consisting
of trillions of tokens scraped from the Web. Prior work has shown that: (1)
web-scraped pre-training datasets can be practically poisoned by malicious
actors; and (2) adversaries can compromise language models after poisoning
fine-tuning datasets. Our work evaluates for the first time whether language
models can also be compromised during pre-training, with a focus on the
persistence of pre-training attacks after models are fine-tuned as helpful and
harmless chatbots (i.e., after SFT and DPO). We pre-train a series of LLMs from
scratch to measure the impact of a potential poisoning adversary under four
different attack objectives (denial-of-service, belief manipulation,
jailbreaking, and prompt stealing), and across a wide range of model sizes
(from 600M to 7B). Our main result is that poisoning only 0.1% of a model's
pre-training dataset is sufficient for three out of four attacks to measurably
persist through post-training. Moreover, simple attacks like denial-of-service
persist through post-training with a poisoning rate of only 0.001%.


http://arxiv.org/abs/2410.13974
Trojan Prompt Attacks on Graph Neural Networks. (4%)
Minhua Lin; Zhiwei Zhang; Enyan Dai; Zongyu Wu; Yilong Wang; Xiang Zhang; Suhang Wang
  Graph Prompt Learning (GPL) has been introduced as a promising approach that
uses prompts to adapt pre-trained GNN models to specific downstream tasks
without requiring fine-tuning of the entire model. Despite the advantages of
GPL, little attention has been given to its vulnerability to backdoor attacks,
where an adversary can manipulate the model's behavior by embedding hidden
triggers. Existing graph backdoor attacks rely on modifying model parameters
during training, but this approach is impractical in GPL as GNN encoder
parameters are frozen after pre-training. Moreover, downstream users may
fine-tune their own task models on clean datasets, further complicating the
attack. In this paper, we propose TGPA, a backdoor attack framework designed
specifically for GPL. TGPA injects backdoors into graph prompts without
modifying pre-trained GNN encoders and ensures high attack success rates and
clean accuracy. To address the challenge of model fine-tuning by users, we
introduce a finetuning-resistant poisoning approach that maintains the
effectiveness of the backdoor even after downstream model adjustments.
Extensive experiments on multiple datasets under various settings demonstrate
the effectiveness of TGPA in compromising GPL models with fixed GNN encoders.


http://arxiv.org/abs/2410.13334
Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems. (2%)
Isack Lee; Haebin Seong
  Although large language models (LLMs) demonstrate impressive proficiency in
various tasks, they present potential safety risks, such as `jailbreaks', where
malicious inputs can coerce LLMs into generating harmful content. To address
these issues, many LLM developers have implemented various safety measures to
align these models. This alignment involves several techniques, including data
filtering during pre-training, supervised fine-tuning, reinforcement learning
from human feedback, and red-teaming exercises. These methods often introduce
deliberate and intentional biases similar to Political Correctness (PC) to
ensure the ethical behavior of LLMs. In this paper, we delve into the
intentional biases injected into LLMs for safety purposes and examine methods
to circumvent these safety alignment techniques. Notably, these intentional
biases result in a jailbreaking success rate in GPT-4o models that differs by
20% between non-binary and cisgender keywords and by 16% between white and
black keywords, even when the other parts of the prompts are identical. We
introduce the concept of PCJailbreak, highlighting the inherent risks posed by
these safety-induced biases. Additionally, we propose an efficient defense
method PCDefense, which prevents jailbreak attempts by injecting defense
prompts prior to generation. PCDefense stands as an appealing alternative to
Guard Models, such as Llama-Guard, that require additional inference cost after
text generation. Our findings emphasize the urgent need for LLM developers to
adopt a more responsible approach when designing and implementing safety
measures.


http://arxiv.org/abs/2410.13193
Golyadkin's Torment: Doppelg\"angers and Adversarial Vulnerability. (99%)
George I. Kamberov
  Many machine learning (ML) classifiers are claimed to outperform humans, but
they still make mistakes that humans do not. The most notorious examples of
such mistakes are adversarial visual metamers. This paper aims to define and
investigate the phenomenon of adversarial Doppelgangers (AD), which includes
adversarial visual metamers, and to compare the performance and robustness of
ML classifiers to human performance.
  We find that AD are inputs that are close to each other with respect to a
perceptual metric defined in this paper. AD are qualitatively different from
the usual adversarial examples. The vast majority of classifiers are vulnerable
to AD and robustness-accuracy trade-offs may not improve them. Some
classification problems may not admit any AD robust classifiers because the
underlying classes are ambiguous. We provide criteria that can be used to
determine whether a classification problem is well defined or not; describe the
structure and attributes of an AD-robust classifier; introduce and explore the
notions of conceptual entropy and regions of conceptual ambiguity for
classifiers that are vulnerable to AD attacks, along with methods to bound the
AD fooling rate of an attack. We define the notion of classifiers that exhibit
hypersensitive behavior, that is, classifiers whose only mistakes are
adversarial Doppelgangers. Improving the AD robustness of hyper-sensitive
classifiers is equivalent to improving accuracy. We identify conditions
guaranteeing that all classifiers with sufficiently high accuracy are
hyper-sensitive.
  Our findings are aimed at significant improvements in the reliability and
security of machine learning systems.


http://arxiv.org/abs/2410.12307
DAT: Improving Adversarial Robustness via Generative Amplitude Mix-up in Frequency Domain. (99%)
Fengpeng Li; Kemou Li; Haiwei Wu; Jinyu Tian; Jiantao Zhou
  To protect deep neural networks (DNNs) from adversarial attacks, adversarial
training (AT) is developed by incorporating adversarial examples (AEs) into
model training. Recent studies show that adversarial attacks disproportionately
impact the patterns within the phase of the sample's frequency spectrum --
typically containing crucial semantic information -- more than those in the
amplitude, resulting in the model's erroneous categorization of AEs. We find
that, by mixing the amplitude of training samples' frequency spectrum with
those of distractor images for AT, the model can be guided to focus on phase
patterns unaffected by adversarial perturbations. As a result, the model's
robustness can be improved. Unfortunately, it is still challenging to select
appropriate distractor images, which should mix the amplitude without affecting
the phase patterns. To this end, in this paper, we propose an optimized
Adversarial Amplitude Generator (AAG) to achieve a better tradeoff between
improving the model's robustness and retaining phase patterns. Based on this
generator, together with an efficient AE production procedure, we design a new
Dual Adversarial Training (DAT) strategy. Experiments on various datasets show
that our proposed DAT leads to significantly improved robustness against
diverse adversarial attacks.


http://arxiv.org/abs/2410.13122
Boosting Imperceptibility of Stable Diffusion-based Adversarial Examples Generation with Momentum. (99%)
Nashrah Haque; Xiang Li; Zhehui Chen; Yanzhao Wu; Lei Yu; Arun Iyengar; Wenqi Wei
  We propose a novel framework, Stable Diffusion-based Momentum Integrated
Adversarial Examples (SD-MIAE), for generating adversarial examples that can
effectively mislead neural network classifiers while maintaining visual
imperceptibility and preserving the semantic similarity to the original class
label. Our method leverages the text-to-image generation capabilities of the
Stable Diffusion model by manipulating token embeddings corresponding to the
specified class in its latent space. These token embeddings guide the
generation of adversarial images that maintain high visual fidelity. The
SD-MIAE framework consists of two phases: (1) an initial adversarial
optimization phase that modifies token embeddings to produce misclassified yet
natural-looking images and (2) a momentum-based optimization phase that refines
the adversarial perturbations. By introducing momentum, our approach stabilizes
the optimization of perturbations across iterations, enhancing both the
misclassification rate and visual fidelity of the generated adversarial
examples. Experimental results demonstrate that SD-MIAE achieves a high
misclassification rate of 79%, improving by 35% over the state-of-the-art
method while preserving the imperceptibility of adversarial perturbations and
the semantic similarity to the original class label, making it a practical
method for robust adversarial evaluation.


http://arxiv.org/abs/2410.12671
New Paradigm of Adversarial Training: Breaking Inherent Trade-Off between Accuracy and Robustness via Dummy Classes. (98%)
Yanyun Wang; Li Liu; Zi Liang; Qingqing Ye; Haibo Hu
  Adversarial Training (AT) is one of the most effective methods to enhance the
robustness of DNNs. However, existing AT methods suffer from an inherent
trade-off between adversarial robustness and clean accuracy, which seriously
hinders their real-world deployment. While this problem has been widely studied
within the current AT paradigm, existing AT methods still typically experience
a reduction in clean accuracy by over 10% to date, without significant
improvements in robustness compared with simple baselines like PGD-AT. This
inherent trade-off raises a question: whether the current AT paradigm, which
assumes to learn the corresponding benign and adversarial samples as the same
class, inappropriately combines clean and robust objectives that may be
essentially inconsistent. In this work, we surprisingly reveal that up to 40%
of CIFAR-10 adversarial samples always fail to satisfy such an assumption
across various AT methods and robust models, explicitly indicating the
improvement room for the current AT paradigm. Accordingly, to relax the tension
between clean and robust learning derived from this overstrict assumption, we
propose a new AT paradigm by introducing an additional dummy class for each
original class, aiming to accommodate the hard adversarial samples with shifted
distribution after perturbation. The robustness w.r.t. these adversarial
samples can be achieved by runtime recovery from the predicted dummy classes to
their corresponding original ones, eliminating the compromise with clean
learning. Building on this new paradigm, we propose a novel plug-and-play AT
technology named DUmmy Classes-based Adversarial Training (DUCAT). Extensive
experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that the
DUCAT concurrently improves clean accuracy and adversarial robustness compared
with state-of-the-art benchmarks, effectively breaking the existing inherent
trade-off.


http://arxiv.org/abs/2410.12425
Perseus: Leveraging Common Data Patterns with Curriculum Learning for More Robust Graph Neural Networks. (92%)
Kaiwen Xia; Huijun Wu; Duanyu Li; Min Xie; Ruibo Wang; Wenzhe Zhang
  Graph Neural Networks (GNNs) excel at handling graph data but remain
vulnerable to adversarial attacks. Existing defense methods typically rely on
assumptions like graph sparsity and homophily to either preprocess the graph or
guide structure learning. However, preprocessing methods often struggle to
accurately distinguish between normal edges and adversarial perturbations,
leading to suboptimal results due to the loss of valuable edge information.
Robust graph neural network models train directly on graph data affected by
adversarial perturbations, without preprocessing. This can cause the model to
get stuck in poor local optima, negatively affecting its performance. To
address these challenges, we propose Perseus, a novel adversarial defense
method based on curriculum learning. Perseus assesses edge difficulty using
global homophily and applies a curriculum learning strategy to adjust the
learning order, guiding the model to learn the full graph structure while
adaptively focusing on common data patterns. This approach mitigates the impact
of adversarial perturbations. Experiments show that models trained with Perseus
achieve superior performance and are significantly more robust to adversarial
attacks.


http://arxiv.org/abs/2410.12607
Low-Rank Adversarial PGD Attack. (84%)
Dayana Savostianova; Emanuele Zangrando; Francesco Tudisco
  Adversarial attacks on deep neural network models have seen rapid development
and are extensively used to study the stability of these networks. Among
various adversarial strategies, Projected Gradient Descent (PGD) is a widely
adopted method in computer vision due to its effectiveness and quick
implementation, making it suitable for adversarial training. In this work, we
observe that in many cases, the perturbations computed using PGD predominantly
affect only a portion of the singular value spectrum of the original image,
suggesting that these perturbations are approximately low-rank. Motivated by
this observation, we propose a variation of PGD that efficiently computes a
low-rank attack. We extensively validate our method on a range of standard
models as well as robust models that have undergone adversarial training. Our
analysis indicates that the proposed low-rank PGD can be effectively used in
adversarial training due to its straightforward and fast implementation coupled
with competitive performance. Notably, we find that low-rank PGD often performs
comparably to, and sometimes even outperforms, the traditional full-rank PGD
attack, while using significantly less memory.


http://arxiv.org/abs/2410.13138
Data Defenses Against Large Language Models. (76%)
William Agnew; Harry H. Jiang; Cella Sum; Maarten Sap; Sauvik Das
  Large language models excel at performing inference over text to extract
information, summarize information, or generate additional text. These
inference capabilities are implicated in a variety of ethical harms spanning
surveillance, labor displacement, and IP/copyright theft. While many policy,
legal, and technical mitigations have been proposed to counteract these harms,
these mitigations typically require cooperation from institutions that move
slower than technical advances (i.e., governments) or that have few incentives
to act to counteract these harms (i.e., the corporations that create and profit
from these LLMs). In this paper, we define and build "data defenses" -- a novel
strategy that directly empowers data owners to block LLMs from performing
inference on their data. We create data defenses by developing a method to
automatically generate adversarial prompt injections that, when added to input
text, significantly reduce the ability of LLMs to accurately infer personally
identifying information about the subject of the input text or to use
copyrighted text in inference. We examine the ethics of enabling such direct
resistance to LLM inference, and argue that making data defenses that resist
and subvert LLMs enables the realization of important values such as data
ownership, data sovereignty, and democratic control over AI systems. We verify
that our data defenses are cheap and fast to generate, work on the latest
commercial and open-source LLMs, resistance to countermeasures, and are robust
to several different attack settings. Finally, we consider the security
implications of LLM data defenses and outline several future research
directions in this area. Our code is available at
https://github.com/wagnew3/LLMDataDefenses and a tool for using our defenses to
protect text against LLM inference is at
https://wagnew3.github.io/LLM-Data-Defenses/.


http://arxiv.org/abs/2410.13010
Hiding-in-Plain-Sight (HiPS) Attack on CLIP for Targetted Object Removal from Images. (61%)
Arka Daw; Megan Hong-Thanh Chung; Maria Mahbub; Amir Sadovnik
  Machine learning models are known to be vulnerable to adversarial attacks,
but traditional attacks have mostly focused on single-modalities. With the rise
of large multi-modal models (LMMs) like CLIP, which combine vision and language
capabilities, new vulnerabilities have emerged. However, prior work in
multimodal targeted attacks aim to completely change the model's output to what
the adversary wants. In many realistic scenarios, an adversary might seek to
make only subtle modifications to the output, so that the changes go unnoticed
by downstream models or even by humans. We introduce Hiding-in-Plain-Sight
(HiPS) attacks, a novel class of adversarial attacks that subtly modifies model
predictions by selectively concealing target object(s), as if the target object
was absent from the scene. We propose two HiPS attack variants, HiPS-cls and
HiPS-cap, and demonstrate their effectiveness in transferring to downstream
image captioning models, such as CLIP-Cap, for targeted object removal from
image captions.


http://arxiv.org/abs/2410.13907
NSmark: Null Space Based Black-box Watermarking Defense Framework for Pre-trained Language Models. (26%)
Haodong Zhao; Jinming Hu; Peixuan Li; Fangqi Li; Jinrui Sha; Peixuan Chen; Zhuosheng Zhang; Gongshen Liu
  Pre-trained language models (PLMs) have emerged as critical intellectual
property (IP) assets that necessitate protection. Although various watermarking
strategies have been proposed, they remain vulnerable to Linear Functionality
Equivalence Attacks (LFEA), which can invalidate most existing white-box
watermarks without prior knowledge of the watermarking scheme or training data.
This paper further analyzes and extends the attack scenarios of LFEA to the
commonly employed black-box settings for PLMs by considering Last-Layer outputs
(dubbed LL-LFEA). We discover that the null space of the output matrix remains
invariant against LL-LFEA attacks. Based on this finding, we propose NSmark, a
task-agnostic, black-box watermarking scheme capable of resisting LL-LFEA
attacks. NSmark consists of three phases: (i) watermark generation using the
digital signature of the owner, enhanced by spread spectrum modulation for
increased robustness; (ii) watermark embedding through an output mapping
extractor that preserves PLM performance while maximizing watermark capacity;
(iii) watermark verification, assessed by extraction rate and null space
conformity. Extensive experiments on both pre-training and downstream tasks
confirm the effectiveness, reliability, fidelity, and robustness of our
approach. Code is available at https://github.com/dongdongzhaoUP/NSmark.


http://arxiv.org/abs/2410.13910
Mitigating the Backdoor Effect for Multi-Task Model Merging via Safety-Aware Subspace. (5%)
Jinluan Yang; Anke Tang; Didi Zhu; Zhengyu Chen; Li Shen; Fei Wu
  Model merging has gained significant attention as a cost-effective approach
to integrate multiple single-task fine-tuned models into a unified one that can
perform well on multiple tasks. However, existing model merging techniques
primarily focus on resolving conflicts between task-specific models, they often
overlook potential security threats, particularly the risk of backdoor attacks
in the open-source model ecosystem. In this paper, we first investigate the
vulnerabilities of existing model merging methods to backdoor attacks,
identifying two critical challenges: backdoor succession and backdoor transfer.
To address these issues, we propose a novel Defense-Aware Merging (DAM)
approach that simultaneously mitigates task interference and backdoor
vulnerabilities. Specifically, DAM employs a meta-learning-based optimization
method with dual masks to identify a shared and safety-aware subspace for model
merging. These masks are alternately optimized: the Task-Shared mask identifies
common beneficial parameters across tasks, aiming to preserve task-specific
knowledge while reducing interference, while the Backdoor-Detection mask
isolates potentially harmful parameters to neutralize security threats. This
dual-mask design allows us to carefully balance the preservation of useful
knowledge and the removal of potential vulnerabilities. Compared to existing
merging methods, DAM achieves a more favorable balance between performance and
security, reducing the attack success rate by 2-10 percentage points while
sacrificing only about 1% in accuracy. Furthermore, DAM exhibits robust
performance and broad applicability across various types of backdoor attacks
and the number of compromised models involved in the merging process. We will
release the codes and models soon.


http://arxiv.org/abs/2410.12443
Reconstruction of Differentially Private Text Sanitization via Large Language Models. (4%)
Shuchao Pang; Zhigang Lu; Haichen Wang; Peng Fu; Yongbin Zhou; Minhui Xue; Bo Li
  Differential privacy (DP) is the de facto privacy standard against privacy
leakage attacks, including many recently discovered ones against large language
models (LLMs). However, we discovered that LLMs could reconstruct the
altered/removed privacy from given DP-sanitized prompts. We propose two attacks
(black-box and white-box) based on the accessibility to LLMs and show that LLMs
could connect the pair of DP-sanitized text and the corresponding private
training data of LLMs by giving sample text pairs as instructions (in the
black-box attacks) or fine-tuning data (in the white-box attacks). To
illustrate our findings, we conduct comprehensive experiments on modern LLMs
(e.g., LLaMA-2, LLaMA-3, ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, Claude-3,
Claude-3.5, OPT, GPT-Neo, GPT-J, Gemma-2, and Pythia) using commonly used
datasets (such as WikiMIA, Pile-CC, and Pile-Wiki) against both word-level and
sentence-level DP. The experimental results show promising recovery rates,
e.g., the black-box attacks against the word-level DP over WikiMIA dataset gave
72.18% on LLaMA-2 (70B), 82.39% on LLaMA-3 (70B), 75.35% on Gemma-2, 91.2% on
ChatGPT-4o, and 94.01% on Claude-3.5 (Sonnet). More urgently, this study
indicates that these well-known LLMs have emerged as a new security risk for
existing DP text sanitization approaches in the current environment.


http://arxiv.org/abs/2410.12759
Unitary Multi-Margin BERT for Robust Natural Language Processing. (4%)
Hao-Yuan Chang; Kang L. Wang
  Recent developments in adversarial attacks on deep learning leave many
mission-critical natural language processing (NLP) systems at risk of
exploitation. To address the lack of computationally efficient adversarial
defense methods, this paper reports a novel, universal technique that
drastically improves the robustness of Bidirectional Encoder Representations
from Transformers (BERT) by combining the unitary weights with the multi-margin
loss. We discover that the marriage of these two simple ideas amplifies the
protection against malicious interference. Our model, the unitary multi-margin
BERT (UniBERT), boosts post-attack classification accuracies significantly by
5.3% to 73.8% while maintaining competitive pre-attack accuracies. Furthermore,
the pre-attack and post-attack accuracy tradeoff can be adjusted via a single
scalar parameter to best fit the design requirements for the target
applications.


http://arxiv.org/abs/2410.13045
FedGTST: Boosting Global Transferability of Federated Models via Statistics Tuning. (2%)
Evelyn Ma; Chao Pan; Rasoul Etesami; Han Zhao; Olgica Milenkovic
  The performance of Transfer Learning (TL) heavily relies on effective
pretraining, which demands large datasets and substantial computational
resources. As a result, executing TL is often challenging for individual model
developers. Federated Learning (FL) addresses these issues by facilitating
collaborations among clients, expanding the dataset indirectly, distributing
computational costs, and preserving privacy. However, key challenges remain
unresolved. First, existing FL methods tend to optimize transferability only
within local domains, neglecting the global learning domain. Second, most
approaches rely on indirect transferability metrics, which do not accurately
reflect the final target loss or true degree of transferability. To address
these gaps, we propose two enhancements to FL. First, we introduce a
client-server exchange protocol that leverages cross-client Jacobian (gradient)
norms to boost transferability. Second, we increase the average Jacobian norm
across clients at the server, using this as a local regularizer to reduce
cross-client Jacobian variance. Our transferable federated algorithm, termed
FedGTST (Federated Global Transferability via Statistics Tuning), demonstrates
that increasing the average Jacobian and reducing its variance allows for
tighter control of the target loss. This leads to an upper bound on the target
loss in terms of the source loss and source-target domain discrepancy.
Extensive experiments on datasets such as MNIST to MNIST-M and CIFAR10 to SVHN
show that FedGTST outperforms relevant baselines, including FedSR. On the
second dataset pair, FedGTST improves accuracy by 9.8% over FedSR and 7.6% over
FedIIR when LeNet is used as the backbone.


http://arxiv.org/abs/2410.12295
Consistency Calibration: Improving Uncertainty Calibration via Consistency among Perturbed Neighbors. (2%)
Linwei Tao; Haolan Guo; Minjing Dong; Chang Xu
  Calibration is crucial in deep learning applications, especially in fields
like healthcare and autonomous driving, where accurate confidence estimates are
vital for decision-making. However, deep neural networks often suffer from
miscalibration, with reliability diagrams and Expected Calibration Error (ECE)
being the only standard perspective for evaluating calibration performance. In
this paper, we introduce the concept of consistency as an alternative
perspective on model calibration, inspired by uncertainty estimation literature
in large language models (LLMs). We highlight its advantages over the
traditional reliability-based view. Building on this concept, we propose a
post-hoc calibration method called Consistency Calibration (CC), which adjusts
confidence based on the model's consistency across perturbed inputs. CC is
particularly effective in locally uncertainty estimation, as it requires no
additional data samples or label information, instead generating input
perturbations directly from the source data. Moreover, we show that performing
perturbations at the logit level significantly improves computational
efficiency. We validate the effectiveness of CC through extensive comparisons
with various post-hoc and training-time calibration methods, demonstrating
state-of-the-art performance on standard datasets such as CIFAR-10, CIFAR-100,
and ImageNet, as well as on long-tailed datasets like ImageNet-LT.


http://arxiv.org/abs/2410.12677
Efficient Optimization Algorithms for Linear Adversarial Training. (1%)
Antônio H. RIbeiro; Thomas B. Schön; Dave Zahariah; Francis Bach
  Adversarial training can be used to learn models that are robust against
perturbations. For linear models, it can be formulated as a convex optimization
problem. Compared to methods proposed in the context of deep learning,
leveraging the optimization structure allows significantly faster convergence
rates. Still, the use of generic convex solvers can be inefficient for
large-scale problems. Here, we propose tailored optimization algorithms for the
adversarial training of linear models, which render large-scale regression and
classification problems more tractable. For regression problems, we propose a
family of solvers based on iterative ridge regression and, for classification,
a family of solvers based on projected gradient descent. The methods are based
on extended variable reformulations of the original problem. We illustrate
their efficiency in numerical examples.


http://arxiv.org/abs/2410.13073
PromptExp: Multi-granularity Prompt Explanation of Large Language Models. (1%)
Ximing Dong; Shaowei Wang; Dayi Lin; Gopi Krishnan Rajbahadur; Boquan Zhou; Shichao Liu; Ahmed E. Hassan
  Large Language Models excel in tasks like natural language understanding and
text generation. Prompt engineering plays a critical role in leveraging LLM
effectively. However, LLMs black-box nature hinders its interpretability and
effective prompting engineering. A wide range of model explanation approaches
have been developed for deep learning models, However, these local explanations
are designed for single-output tasks like classification and regression,and
cannot be directly applied to LLMs, which generate sequences of tokens. Recent
efforts in LLM explanation focus on natural language explanations, but they are
prone to hallucinations and inaccuracies. To address this, we introduce
OurTool, a framework for multi-granularity prompt explanations by aggregating
token-level insights. OurTool introduces two token-level explanation
approaches: 1.an aggregation-based approach combining local explanation
techniques, and 2. a perturbation-based approach with novel techniques to
evaluate token masking impact. OurTool supports both white-box and black-box
explanations and extends explanations to higher granularity levels, enabling
flexible analysis. We evaluate OurTool in case studies such as sentiment
analysis, showing the perturbation-based approach performs best using semantic
similarity to assess perturbation impact. Furthermore, we conducted a user
study to confirm OurTool's accuracy and practical value, and demonstrate its
potential to enhance LLM interpretability.


http://arxiv.org/abs/2410.12955
Long-Tailed Backdoor Attack Using Dynamic Data Augmentation Operations. (1%)
Lu Pang; Tao Sun; Weimin Lyu; Haibin Ling; Chao Chen
  Recently, backdoor attack has become an increasing security threat to deep
neural networks and drawn the attention of researchers. Backdoor attacks
exploit vulnerabilities in third-party pretrained models during the training
phase, enabling them to behave normally for clean samples and mispredict for
samples with specific triggers. Existing backdoor attacks mainly focus on
balanced datasets. However, real-world datasets often follow long-tailed
distributions. In this paper, for the first time, we explore backdoor attack on
such datasets. Specifically, we first analyze the influence of data imbalance
on backdoor attack. Based on our analysis, we propose an effective backdoor
attack named Dynamic Data Augmentation Operation (D$^2$AO). We design D$^2$AO
selectors to select operations depending jointly on the class, sample type
(clean vs. backdoored) and sample features. Meanwhile, we develop a trigger
generator to generate sample-specific triggers. Through simultaneous
optimization of the backdoored model and trigger generator, guided by dynamic
data augmentation operation selectors, we achieve significant advancements.
Extensive experiments demonstrate that our method can achieve the
state-of-the-art attack performance while preserving the clean accuracy.


http://arxiv.org/abs/2410.12076
Taking off the Rose-Tinted Glasses: A Critical Look at Adversarial ML Through the Lens of Evasion Attacks. (99%)
Kevin Eykholt; Farhan Ahmed; Pratik Vaishnavi; Amir Rahmati
  The vulnerability of machine learning models in adversarial scenarios has
garnered significant interest in the academic community over the past decade,
resulting in a myriad of attacks and defenses. However, while the community
appears to be overtly successful in devising new attacks across new contexts,
the development of defenses has stalled. After a decade of research, we appear
no closer to securing AI applications beyond additional training. Despite a
lack of effective mitigations, AI development and its incorporation into
existing systems charge full speed ahead with the rise of generative AI and
large language models. Will our ineffectiveness in developing solutions to
adversarial threats further extend to these new technologies?
  In this paper, we argue that overly permissive attack and overly restrictive
defensive threat models have hampered defense development in the ML domain.
Through the lens of adversarial evasion attacks against neural networks, we
critically examine common attack assumptions, such as the ability to bypass any
defense not explicitly built into the model. We argue that these flawed
assumptions, seen as reasonable by the community based on paper acceptance,
have encouraged the development of adversarial attacks that map poorly to
real-world scenarios. In turn, new defenses evaluated against these very
attacks are inadvertently required to be almost perfect and incorporated as
part of the model. But do they need to? In practice, machine learning models
are deployed as a small component of a larger system. We analyze adversarial
machine learning from a system security perspective rather than an AI
perspective and its implications for emerging AI paradigms.


http://arxiv.org/abs/2410.11639
Efficient and Effective Universal Adversarial Attack against Vision-Language Pre-training Models. (98%)
Fan Yang; Yihao Huang; Kailong Wang; Ling Shi; Geguang Pu; Yang Liu; Haoyu Wang
  Vision-language pre-training (VLP) models, trained on large-scale image-text
pairs, have become widely used across a variety of downstream
vision-and-language (V+L) tasks. This widespread adoption raises concerns about
their vulnerability to adversarial attacks. Non-universal adversarial attacks,
while effective, are often impractical for real-time online applications due to
their high computational demands per data instance. Recently, universal
adversarial perturbations (UAPs) have been introduced as a solution, but
existing generator-based UAP methods are significantly time-consuming. To
overcome the limitation, we propose a direct optimization-based UAP approach,
termed DO-UAP, which significantly reduces resource consumption while
maintaining high attack performance. Specifically, we explore the necessity of
multimodal loss design and introduce a useful data augmentation strategy.
Extensive experiments conducted on three benchmark VLP datasets, six popular
VLP models, and three classical downstream tasks demonstrate the efficiency and
effectiveness of DO-UAP. Specifically, our approach drastically decreases the
time consumption by 23-fold while achieving a better attack performance.


http://arxiv.org/abs/2410.11317
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation. (83%)
Qizhang Li; Xiaochen Yang; Wangmeng Zuo; Yiwen Guo
  Automatic adversarial prompt generation provides remarkable success in
jailbreaking safely-aligned large language models (LLMs). Existing
gradient-based attacks, while demonstrating outstanding performance in
jailbreaking white-box LLMs, often generate garbled adversarial prompts with
chaotic appearance. These adversarial prompts are difficult to transfer to
other LLMs, hindering their performance in attacking unknown victim models. In
this paper, for the first time, we delve into the semantic meaning embedded in
garbled adversarial prompts and propose a novel method that "translates" them
into coherent and human-readable natural language adversarial prompts. In this
way, we can effectively uncover the semantic information that triggers
vulnerabilities of the model and unambiguously transfer it to the victim model,
without overlooking the adversarial information hidden in the garbled text, to
enhance jailbreak attacks. It also offers a new approach to discovering
effective designs for jailbreak prompts, advancing the understanding of
jailbreak attacks. Experimental results demonstrate that our method
significantly improves the success rate of jailbreak attacks against various
safety-aligned LLMs and outperforms state-of-the-arts by large margins. With at
most 10 queries, our method achieves an average attack success rate of 81.8% in
attacking 7 commercial closed-source LLMs, including GPT and Claude-3 series,
on HarmBench. Our method also achieves over 90% attack success rates against
Llama-2-Chat models on AdvBench, despite their outstanding resistance to
jailbreak attacks. Code at:
https://github.com/qizhangli/Adversarial-Prompt-Translator.


http://arxiv.org/abs/2410.14723
BeniFul: Backdoor Defense via Middle Feature Analysis for Deep Neural Networks. (82%)
Xinfu Li; Junying Zhang; Xindi Ma
  Backdoor defenses have recently become important in resisting backdoor
attacks in deep neural networks (DNNs), where attackers implant backdoors into
the DNN model by injecting backdoor samples into the training dataset. Although
there are many defense methods to achieve backdoor detection for DNN inputs and
backdoor elimination for DNN models, they still have not presented a clear
explanation of the relationship between these two missions. In this paper, we
use the features from the middle layer of the DNN model to analyze the
difference between backdoor and benign samples and propose Backdoor
Consistency, which indicates that at least one backdoor exists in the DNN model
if the backdoor trigger is detected exactly on input. By analyzing the middle
features, we design an effective and comprehensive backdoor defense method
named BeniFul, which consists of two parts: a gray-box backdoor input detection
and a white-box backdoor elimination. Specifically, we use the reconstruction
distance from the Variational Auto-Encoder and model inference results to
implement backdoor input detection and a feature distance loss to achieve
backdoor elimination. Experimental results on CIFAR-10 and Tiny ImageNet
against five state-of-the-art attacks demonstrate that our BeniFul exhibits a
great defense capability in backdoor input detection and backdoor elimination.


http://arxiv.org/abs/2410.11272
Cognitive Overload Attack:Prompt Injection for Long Context. (62%)
Bibek Upadhayay; Vahid Behzadan; Amin Karbasi
  Large Language Models (LLMs) have demonstrated remarkable capabilities in
performing tasks across various domains without needing explicit retraining.
This capability, known as In-Context Learning (ICL), while impressive, exposes
LLMs to a variety of adversarial prompts and jailbreaks that manipulate
safety-trained LLMs into generating undesired or harmful output. In this paper,
we propose a novel interpretation of ICL in LLMs through the lens of cognitive
neuroscience, by drawing parallels between learning in human cognition with
ICL. We applied the principles of Cognitive Load Theory in LLMs and empirically
validate that similar to human cognition, LLMs also suffer from cognitive
overload a state where the demand on cognitive processing exceeds the available
capacity of the model, leading to potential errors. Furthermore, we
demonstrated how an attacker can exploit ICL to jailbreak LLMs through
deliberately designed prompts that induce cognitive overload on LLMs, thereby
compromising the safety mechanisms of LLMs. We empirically validate this threat
model by crafting various cognitive overload prompts and show that advanced
models such as GPT-4, Claude-3.5 Sonnet, Claude-3 OPUS, Llama-3-70B-Instruct,
Gemini-1.0-Pro, and Gemini-1.5-Pro can be successfully jailbroken, with attack
success rates of up to 99.99%. Our findings highlight critical vulnerabilities
in LLMs and underscore the urgency of developing robust safeguards. We propose
integrating insights from cognitive load theory into the design and evaluation
of LLMs to better anticipate and mitigate the risks of adversarial attacks. By
expanding our experiments to encompass a broader range of models and by
highlighting vulnerabilities in LLMs' ICL, we aim to ensure the development of
safer and more reliable AI systems.


http://arxiv.org/abs/2410.11290
Backdoor Attack on Vertical Federated Graph Neural Network Learning. (31%)
Jirui Yang; Peng Chen; Zhihui Lu; Ruijun Deng; Qiang Duan; Jianping Zeng
  Federated Graph Neural Network (FedGNN) is a privacy-preserving machine
learning technology that combines federated learning (FL) and graph neural
networks (GNNs). It offers a privacy-preserving solution for training GNNs
using isolated graph data. Vertical Federated Graph Neural Network (VFGNN) is
an important branch of FedGNN, where data features and labels are distributed
among participants, and each participant has the same sample space. Due to the
difficulty of accessing and modifying distributed data and labels, the
vulnerability of VFGNN to backdoor attacks remains largely unexplored. In this
context, we propose BVG, the first method for backdoor attacks in VFGNN.
Without accessing or modifying labels, BVG uses multi-hop triggers and requires
only four target class nodes for an effective backdoor attack. Experiments show
that BVG achieves high attack success rates (ASR) across three datasets and
three different GNN models, with minimal impact on main task accuracy (MTA). We
also evaluate several defense methods, further validating the robustness and
effectiveness of BVG. This finding also highlights the need for advanced
defense mechanisms to counter sophisticated backdoor attacks in practical VFGNN
applications.


http://arxiv.org/abs/2410.11283
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment. (31%)
Pankayaraj Pathmanathan; Udari Madhushani Sehwag; Michael-Andrei Panaitescu-Liess; Furong Huang
  With the growing adoption of reinforcement learning with human feedback
(RLHF) for aligning large language models (LLMs), the risk of backdoor
installation during alignment has increased, leading to unintended and harmful
behaviors. Existing backdoor triggers are typically limited to fixed word
patterns, making them detectable during data cleaning and easily removable
post-poisoning. In this work, we explore the use of prompt-specific paraphrases
as backdoor triggers, enhancing their stealth and resistance to removal during
LLM alignment. We propose AdvBDGen, an adversarially fortified generative
fine-tuning framework that automatically generates prompt-specific backdoors
that are effective, stealthy, and transferable across models. AdvBDGen employs
a generator-discriminator pair, fortified by an adversary, to ensure the
installability and stealthiness of backdoors. It enables the crafting and
successful installation of complex triggers using as little as 3% of the
fine-tuning data. Once installed, these backdoors can jailbreak LLMs during
inference, demonstrate improved stability against perturbations compared to
traditional constant triggers, and are more challenging to remove. These
findings underscore an urgent need for the research community to develop more
robust defenses against adversarial backdoor threats in LLM alignment.


http://arxiv.org/abs/2410.11533
Multi-round jailbreak attack on large language models. (4%)
Yihua Zhou; Xiaochuan Shi
  Ensuring the safety and alignment of large language models (LLMs) with human
values is crucial for generating responses that are beneficial to humanity.
While LLMs have the capability to identify and avoid harmful queries, they
remain vulnerable to "jailbreak" attacks, where carefully crafted prompts can
induce the generation of toxic content. Traditional single-round jailbreak
attacks, such as GCG and AutoDAN, do not alter the sensitive words in the
dangerous prompts. Although they can temporarily bypass the model's safeguards
through prompt engineering, their success rate drops significantly as the LLM
is further fine-tuned, and they cannot effectively circumvent static rule-based
filters that remove the hazardous vocabulary.
  In this study, to better understand jailbreak attacks, we introduce a
multi-round jailbreak approach. This method can rewrite the dangerous prompts,
decomposing them into a series of less harmful sub-questions to bypass the
LLM's safety checks. We first use the LLM to perform a decomposition task,
breaking down a set of natural language questions into a sequence of
progressive sub-questions, which are then used to fine-tune the Llama3-8B
model, enabling it to decompose hazardous prompts. The fine-tuned model is then
used to break down the problematic prompt, and the resulting sub-questions are
sequentially asked to the victim model. If the victim model rejects a
sub-question, a new decomposition is generated, and the process is repeated
until the final objective is achieved. Our experimental results show a 94\%
success rate on the llama2-7B and demonstrate the effectiveness of this
approach in circumventing static rule-based filters.


http://arxiv.org/abs/2410.12025
Geometric Inductive Biases of Deep Networks: The Role of Data and Architecture. (3%)
Sajad Movahedi; Antonio Orvieto; Seyed-Mohsen Moosavi-Dezfooli
  In this paper, we propose the $\textit{geometric invariance hypothesis
(GIH)}$, which argues that when training a neural network, the input space
curvature remains invariant under transformation in certain directions
determined by its architecture. Starting with a simple non-linear binary
classification problem residing on a plane in a high dimensional space, we
observe that while an MLP can solve this problem regardless of the orientation
of the plane, this is not the case for a ResNet. Motivated by this example, we
define two maps that provide a compact $\textit{architecture-dependent}$
summary of the input space geometry of a neural network and its evolution
during training, which we dub the $\textbf{average geometry}$ and
$\textbf{average geometry evolution}$, respectively. By investigating average
geometry evolution at initialization, we discover that the geometry of a neural
network evolves according to the projection of data covariance onto average
geometry. As a result, in cases where the average geometry is low-rank (such as
in a ResNet), the geometry only changes in a subset of the input space. This
causes an architecture-dependent invariance property in input-space curvature,
which we dub GIH. Finally, we present extensive experimental results to observe
the consequences of GIH and how it relates to generalization in neural
networks.


http://arxiv.org/abs/2410.11782
G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks. (2%)
Guibin Zhang; Yanwei Yue; Xiangguo Sun; Guancheng Wan; Miao Yu; Junfeng Fang; Kun Wang; Dawei Cheng
  Recent advancements in large language model (LLM)-based agents have
demonstrated that collective intelligence can significantly surpass the
capabilities of individual agents, primarily due to well-crafted inter-agent
communication topologies. Despite the diverse and high-performing designs
available, practitioners often face confusion when selecting the most effective
pipeline for their specific task: \textit{Which topology is the best choice for
my task, avoiding unnecessary communication token overhead while ensuring
high-quality solution?} In response to this dilemma, we introduce G-Designer,
an adaptive, efficient, and robust solution for multi-agent deployment, which
dynamically designs task-aware, customized communication topologies.
Specifically, G-Designer models the multi-agent system as a multi-agent
network, leveraging a variational graph auto-encoder to encode both the nodes
(agents) and a task-specific virtual node, and decodes a task-adaptive and
high-performing communication topology. Extensive experiments on six benchmarks
showcase that G-Designer is: \textbf{(1) high-performing}, achieving superior
results on MMLU with accuracy at $84.50\%$ and on HumanEval with pass@1 at
$89.90\%$; \textbf{(2) task-adaptive}, architecting communication protocols
tailored to task difficulty, reducing token consumption by up to $95.33\%$ on
HumanEval; and \textbf{(3) adversarially robust}, defending against agent
adversarial attacks with merely $0.3\%$ accuracy drop.


http://arxiv.org/abs/2410.10760
Denial-of-Service Poisoning Attacks against Large Language Models. (92%)
Kuofeng Gao; Tianyu Pang; Chao Du; Yong Yang; Shu-Tao Xia; Min Lin
  Recent studies have shown that LLMs are vulnerable to denial-of-service (DoS)
attacks, where adversarial inputs like spelling errors or non-semantic prompts
trigger endless outputs without generating an [EOS] token. These attacks can
potentially cause high latency and make LLM services inaccessible to other
users or tasks. However, when there are speech-to-text interfaces (e.g., voice
commands to a robot), executing such DoS attacks becomes challenging, as it is
difficult to introduce spelling errors or non-semantic prompts through speech.
A simple DoS attack in these scenarios would be to instruct the model to "Keep
repeating Hello", but we observe that relying solely on natural instructions
limits output length, which is bounded by the maximum length of the LLM's
supervised finetuning (SFT) data. To overcome this limitation, we propose
poisoning-based DoS (P-DoS) attacks for LLMs, demonstrating that injecting a
single poisoned sample designed for DoS purposes can break the output length
limit. For example, a poisoned sample can successfully attack GPT-4o and GPT-4o
mini (via OpenAI's finetuning API) using less than $1, causing repeated outputs
up to the maximum inference length (16K tokens, compared to 0.5K before
poisoning). Additionally, we perform comprehensive ablation studies on
open-source LLMs and extend our method to LLM agents, where attackers can
control both the finetuning dataset and algorithm. Our findings underscore the
urgent need for defenses against P-DoS attacks to secure LLMs. Our code is
available at https://github.com/sail-sg/P-DoS.


http://arxiv.org/abs/2410.10736
Towards Calibrated Losses for Adversarial Robust Reject Option Classification. (86%)
Vrund Shah; Tejas Chaudhari; Naresh Manwani
  Robustness towards adversarial attacks is a vital property for classifiers in
several applications such as autonomous driving, medical diagnosis, etc. Also,
in such scenarios, where the cost of misclassification is very high, knowing
when to abstain from prediction becomes crucial. A natural question is which
surrogates can be used to ensure learning in scenarios where the input points
are adversarially perturbed and the classifier can abstain from prediction?
This paper aims to characterize and design surrogates calibrated in
"Adversarial Robust Reject Option" setting. First, we propose an adversarial
robust reject option loss $\ell_{d}^{\gamma}$ and analyze it for the hypothesis
set of linear classifiers ($\mathcal{H}_{\textrm{lin}}$). Next, we provide a
complete characterization result for any surrogate to be
$(\ell_{d}^{\gamma},\mathcal{H}_{\textrm{lin}})$- calibrated. To demonstrate
the difficulty in designing surrogates to $\ell_{d}^{\gamma}$, we show negative
calibration results for convex surrogates and quasi-concave conditional risk
cases (these gave positive calibration in adversarial setting without reject
option). We also empirically argue that Shifted Double Ramp Loss (DRL) and
Shifted Double Sigmoid Loss (DSL) satisfy the calibration conditions. Finally,
we demonstrate the robustness of shifted DRL and shifted DSL against
adversarial perturbations on a synthetically generated dataset.


http://arxiv.org/abs/2410.10744
Adversarially Robust Out-of-Distribution Detection Using Lyapunov-Stabilized Embeddings. (86%)
Hossein Mirzaei; Mackenzie W. Mathis
  Despite significant advancements in out-of-distribution (OOD) detection,
existing methods still struggle to maintain robustness against adversarial
attacks, compromising their reliability in critical real-world applications.
Previous studies have attempted to address this challenge by exposing detectors
to auxiliary OOD datasets alongside adversarial training. However, the
increased data complexity inherent in adversarial training, and the myriad of
ways that OOD samples can arise during testing, often prevent these approaches
from establishing robust decision boundaries. To address these limitations, we
propose AROS, a novel approach leveraging neural ordinary differential
equations (NODEs) with Lyapunov stability theorem in order to obtain robust
embeddings for OOD detection. By incorporating a tailored loss function, we
apply Lyapunov stability theory to ensure that both in-distribution (ID) and
OOD data converge to stable equilibrium points within the dynamical system.
This approach encourages any perturbed input to return to its stable
equilibrium, thereby enhancing the model's robustness against adversarial
perturbations. To not use additional data, we generate fake OOD embeddings by
sampling from low-likelihood regions of the ID data feature space,
approximating the boundaries where OOD data are likely to reside. To then
further enhance robustness, we propose the use of an orthogonal binary layer
following the stable feature space, which maximizes the separation between the
equilibrium points of ID and OOD samples. We validate our method through
extensive experiments across several benchmarks, demonstrating superior
performance, particularly under adversarial attacks. Notably, our approach
improves robust detection performance from 37.8% to 80.1% on CIFAR-10 vs.
CIFAR-100 and from 29.0% to 67.0% on CIFAR-100 vs. CIFAR-10.


http://arxiv.org/abs/2410.10322
Feature Averaging: An Implicit Bias of Gradient Descent Leading to Non-Robustness in Neural Networks. (83%)
Binghui Li; Zhixuan Pan; Kaifeng Lyu; Jian Li
  In this work, we investigate a particular implicit bias in the gradient
descent training process, which we term "Feature Averaging", and argue that it
is one of the principal factors contributing to non-robustness of deep neural
networks. Despite the existence of multiple discriminative features capable of
classifying data, neural networks trained by gradient descent exhibit a
tendency to learn the average (or certain combination) of these features,
rather than distinguishing and leveraging each feature individually. In
particular, we provide a detailed theoretical analysis of the training dynamics
of gradient descent in a two-layer ReLU network for a binary classification
task, where the data distribution consists of multiple clusters with orthogonal
cluster center vectors. We rigorously prove that gradient descent converges to
the regime of feature averaging, wherein the weights associated with each
hidden-layer neuron represent an average of the cluster centers (each center
corresponding to a distinct feature). It leads the network classifier to be
non-robust due to an attack that aligns with the negative direction of the
averaged features. Furthermore, we prove that, with the provision of more
granular supervised information, a two-layer multi-class neural network is
capable of learning individual features, from which one can derive a binary
classifier with the optimal robustness under our setting. Besides, we also
conduct extensive experiments using synthetic datasets, MNIST and CIFAR-10 to
substantiate the phenomenon of feature averaging and its role in adversarial
robustness of neural networks. We hope the theoretical and empirical insights
can provide a deeper understanding of the impact of the gradient descent
training on feature learning process, which in turn influences the robustness
of the network, and how more detailed supervision may enhance model robustness.


http://arxiv.org/abs/2410.11205
Adversarially Guided Stateful Defense Against Backdoor Attacks in Federated Deep Learning. (81%)
Hassan Ali; Surya Nepal; Salil S. Kanhere; Sanjay Jha
  Recent works have shown that Federated Learning (FL) is vulnerable to
backdoor attacks. Existing defenses cluster submitted updates from clients and
select the best cluster for aggregation. However, they often rely on
unrealistic assumptions regarding client submissions and sampled clients
population while choosing the best cluster. We show that in realistic FL
settings, state-of-the-art (SOTA) defenses struggle to perform well against
backdoor attacks in FL. To address this, we highlight that backdoored
submissions are adversarially biased and overconfident compared to clean
submissions. We, therefore, propose an Adversarially Guided Stateful Defense
(AGSD) against backdoor attacks on Deep Neural Networks (DNNs) in FL scenarios.
AGSD employs adversarial perturbations to a small held-out dataset to compute a
novel metric, called the trust index, that guides the cluster selection without
relying on any unrealistic assumptions regarding client submissions. Moreover,
AGSD maintains a trust state history of each client that adaptively penalizes
backdoored clients and rewards clean clients. In realistic FL settings, where
SOTA defenses mostly fail to resist attacks, AGSD mostly outperforms all SOTA
defenses with minimal drop in clean accuracy (5% in the worst-case compared to
best accuracy) even when (a) given a very small held-out dataset -- typically
AGSD assumes 50 samples (<= 0.1% of the training data) and (b) no heldout
dataset is available, and out-of-distribution data is used instead. For
reproducibility, our code will be openly available at:
https://github.com/hassanalikhatim/AGSD.


http://arxiv.org/abs/2410.10554
ROSAR: An Adversarial Re-Training Framework for Robust Side-Scan Sonar Object Detection. (67%)
Martin Aubard; László Antal; Ana Madureira; Luis F. Teixeira; Erika Ábrahám
  This paper introduces ROSAR, a novel framework enhancing the robustness of
deep learning object detection models tailored for side-scan sonar (SSS)
images, generated by autonomous underwater vehicles using sonar sensors. By
extending our prior work on knowledge distillation (KD), this framework
integrates KD with adversarial retraining to address the dual challenges of
model efficiency and robustness against SSS noises. We introduce three novel,
publicly available SSS datasets, capturing different sonar setups and noise
conditions. We propose and formalize two SSS safety properties and utilize them
to generate adversarial datasets for retraining. Through a comparative analysis
of projected gradient descent (PGD) and patch-based adversarial attacks, ROSAR
demonstrates significant improvements in model robustness and detection
accuracy under SSS-specific conditions, enhancing the model's robustness by up
to 1.85%. ROSAR is available at
https://github.com/remaro-network/ROSAR-framework.


http://arxiv.org/abs/2410.19785
How to Backdoor Consistency Models? (56%)
Chengen Wang; Murat Kantarcioglu
  Consistency models are a new class of models that generate images by directly
mapping noise to data, allowing for one-step generation and significantly
accelerating the sampling process. However, their robustness against
adversarial attacks has not yet been thoroughly investigated. In this work, we
conduct the first study on the vulnerability of consistency models to backdoor
attacks. While previous research has explored backdoor attacks on diffusion
models, those studies have primarily focused on conventional diffusion models,
employing a customized backdoor training process and objective, whereas
consistency models have distinct training processes and objectives. Our
proposed framework demonstrates the vulnerability of consistency models to
backdoor attacks. During image generation, poisoned consistency models produce
images with a Fr\'echet Inception Distance (FID) comparable to that of a clean
model when sampling from Gaussian noise. However, once the trigger is
activated, they generate backdoor target images. We explore various trigger and
target configurations to evaluate the vulnerability of consistency models,
including the use of random noise as a trigger. This novel trigger is visually
inconspicuous, more challenging to detect, and aligns well with the sampling
process of consistency models. Across all configurations, our framework
successfully compromises the consistency models while maintaining high utility
and specificity. We also examine the stealthiness of our proposed attack, which
is attributed to the unique properties of consistency models and the elusive
nature of the Gaussian noise trigger.


http://arxiv.org/abs/2410.10700
LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts. (31%)
Qibing Ren; Hao Li; Dongrui Liu; Zhanxu Xie; Xiaoya Lu; Yu Qiao; Lei Sha; Junchi Yan; Lizhuang Ma; Jing Shao
  Safety concerns in large language models (LLMs) have gained significant
attention due to their exposure to potentially harmful data during
pre-training. In this paper, we identify a new safety vulnerability in LLMs:
their susceptibility to \textit{natural distribution shifts} between attack
prompts and original toxic prompts, where seemingly benign prompts,
semantically related to harmful content, can bypass safety mechanisms. To
explore this issue, we introduce a novel attack method, \textit{ActorBreaker},
which identifies actors related to toxic prompts within pre-training
distribution to craft multi-turn prompts that gradually lead LLMs to reveal
unsafe content. ActorBreaker is grounded in Latour's actor-network theory,
encompassing both human and non-human actors to capture a broader range of
vulnerabilities. Our experimental results demonstrate that ActorBreaker
outperforms existing attack methods in terms of diversity, effectiveness, and
efficiency across aligned LLMs. To address this vulnerability, we propose
expanding safety training to cover a broader semantic space of toxic content.
We thus construct a multi-turn safety dataset using ActorBreaker. Fine-tuning
models on our dataset shows significant improvements in robustness, though with
some trade-offs in utility. Code is available at
https://github.com/AI45Lab/ActorAttack.


http://arxiv.org/abs/2410.10526
Generalized Adversarial Code-Suggestions: Exploiting Contexts of LLM-based Code-Completion. (15%)
Karl Rubel; Maximilian Noppel; Christian Wressnegger
  While convenient, relying on LLM-powered code assistants in day-to-day work
gives rise to severe attacks. For instance, the assistant might introduce
subtle flaws and suggest vulnerable code to the user. These adversarial
code-suggestions can be introduced via data poisoning and, thus, unknowingly by
the model creators. In this paper, we provide a generalized formulation of such
attacks, spawning and extending related work in this domain. This formulation
is defined over two components: First, a trigger pattern occurring in the
prompts of a specific user group, and, second, a learnable map in embedding
space from the prompt to an adversarial bait. The latter gives rise to novel
and more flexible targeted attack-strategies, allowing the adversary to choose
the most suitable trigger pattern for a specific user-group arbitrarily,
without restrictions on the pattern's tokens. Our directional-map attacks and
prompt-indexing attacks increase the stealthiness decisively. We extensively
evaluate the effectiveness of these attacks and carefully investigate defensive
mechanisms to explore the limits of generalized adversarial code-suggestions.
We find that most defenses unfortunately offer little protection only.


http://arxiv.org/abs/2410.10674
Enhancing Robustness in Deep Reinforcement Learning: A Lyapunov Exponent Approach. (13%)
Rory Young; Nicolas Pugeault
  Deep reinforcement learning agents achieve state-of-the-art performance in a
wide range of simulated control tasks. However, successful applications to
real-world problems remain limited. One reason for this dichotomy is because
the learned policies are not robust to observation noise or adversarial
attacks. In this paper, we investigate the robustness of deep RL policies to a
single small state perturbation in deterministic continuous control tasks. We
demonstrate that RL policies can be deterministically chaotic as small
perturbations to the system state have a large impact on subsequent state and
reward trajectories. This unstable non-linear behaviour has two consequences:
First, inaccuracies in sensor readings, or adversarial attacks, can cause
significant performance degradation; Second, even policies that show robust
performance in terms of rewards may have unpredictable behaviour in practice.
These two facets of chaos in RL policies drastically restrict the application
of deep RL to real-world problems. To address this issue, we propose an
improvement on the successful Dreamer V3 architecture, implementing a Maximal
Lyapunov Exponent regularisation. This new approach reduces the chaotic state
dynamics, rendering the learnt policies more resilient to sensor noise or
adversarial attacks and thereby improving the suitability of Deep Reinforcement
Learning for real-world applications.


http://arxiv.org/abs/2410.10473
The Implicit Bias of Structured State Space Models Can Be Poisoned With Clean Labels. (2%)
Yonatan Slutzky; Yotam Alexander; Noam Razin; Nadav Cohen
  Neural networks are powered by an implicit bias: a tendency of gradient
descent to fit training data in a way that generalizes to unseen data. A recent
class of neural network models gaining increasing popularity is structured
state space models (SSMs), regarded as an efficient alternative to
transformers. Prior work argued that the implicit bias of SSMs leads to
generalization in a setting where data is generated by a low dimensional
teacher. In this paper, we revisit the latter setting, and formally establish a
phenomenon entirely undetected by prior work on the implicit bias of SSMs.
Namely, we prove that while implicit bias leads to generalization under many
choices of training data, there exist special examples whose inclusion in
training completely distorts the implicit bias, to a point where generalization
fails. This failure occurs despite the special training examples being labeled
by the teacher, i.e. having clean labels! We empirically demonstrate the
phenomenon, with SSMs trained independently and as part of non-linear neural
networks. In the area of adversarial machine learning, disrupting
generalization with cleanly labeled training examples is known as clean-label
poisoning. Given the proliferation of SSMs, particularly in large language
models, we believe significant efforts should be invested in further
delineating their susceptibility to clean-label poisoning, and in developing
methods for overcoming this susceptibility.


http://arxiv.org/abs/2410.10796
Context-Parametric Inversion: Why Instruction Finetuning May Not Actually Improve Context Reliance. (1%)
Sachin Goyal; Christina Baek; J. Zico Kolter; Aditi Raghunathan
  Large language models are instruction-finetuned to enhance their ability to
follow user instructions and process the input context. However, even
state-of-the-art models often struggle to follow the instruction, especially
when the input context is not aligned with the model's parametric knowledge.
This manifests as various failures, such as hallucinations where the responses
are outdated, biased or contain unverified facts. In this work, we try to
understand the underlying reason for this poor context reliance, especially
after instruction tuning. We observe an intriguing phenomenon: during
instruction tuning, the context reliance initially increases as expected, but
then gradually decreases as instruction finetuning progresses. We call this
phenomenon context-parametric inversion and observe it across multiple general
purpose instruction tuning datasets like TULU, Alpaca and Ultrachat, as well as
model families such as Llama, Mistral and Pythia. In a simple theoretical
setup, we isolate why context-parametric inversion occurs along the gradient
descent trajectory of instruction finetuning. We tie this phenomena to examples
in the instruction finetuning data mixture where the input context provides
information that is already present in the model's parametric knowledge. Our
analysis suggests natural mitigation strategies that provide some limited
gains, while also validating our theoretical insights. We hope that our work
serves as a starting point in addressing this failure mode in a staple part of
LLM training.


http://arxiv.org/abs/2410.11242
Automatically Generating Visual Hallucination Test Cases for Multimodal Large Language Models. (1%)
Zhongye Liu; Hongbin Liu; Yuepeng Hu; Zedian Shao; Neil Zhenqiang Gong
  Visual hallucination (VH) occurs when a multimodal large language model
(MLLM) generates responses with incorrect visual details for prompts. Existing
methods for generating VH test cases primarily rely on human annotations,
typically in the form of triples: (image, question, answer). In this paper, we
introduce VHExpansion, the first automated method for expanding VH test cases
for MLLMs. Given an initial VH test case, VHExpansion automatically expands it
by perturbing the question and answer through negation as well as modifying the
image using both common and adversarial perturbations. Additionally, we propose
a new evaluation metric, symmetric accuracy, which measures the proportion of
correctly answered VH test-case pairs. Each pair consists of a test case and
its negated counterpart. Our theoretical analysis shows that symmetric accuracy
is an unbiased evaluation metric that remains unaffected by the imbalance of VH
testing cases with varying answers when an MLLM is randomly guessing the
answers, whereas traditional accuracy is prone to such imbalance. We apply
VHExpansion to expand three VH datasets annotated manually and use these
expanded datasets to benchmark seven MLLMs. Our evaluation shows that
VHExpansion effectively identifies more VH test cases. Moreover, symmetric
accuracy, being unbiased, leads to different conclusions about the
vulnerability of MLLMs to VH compared to traditional accuracy metric. Finally,
we show that fine-tuning MLLMs on the expanded VH dataset generated by
VHExpansion mitigates VH more effectively than fine-tuning on the original,
manually annotated dataset. Our code is available at:
https://github.com/lycheeefish/VHExpansion.


http://arxiv.org/abs/2410.10414
On Calibration of LLM-based Guard Models for Reliable Content Moderation. (1%)
Hongfu Liu; Hengguan Huang; Xiangming Gu; Hao Wang; Ye Wang
  Large language models (LLMs) pose significant risks due to the potential for
generating harmful content or users attempting to evade guardrails. Existing
studies have developed LLM-based guard models designed to moderate the input
and output of threat LLMs, ensuring adherence to safety policies by blocking
content that violates these protocols upon deployment. However, limited
attention has been given to the reliability and calibration of such guard
models. In this work, we empirically conduct comprehensive investigations of
confidence calibration for 9 existing LLM-based guard models on 12 benchmarks
in both user input and model output classification. Our findings reveal that
current LLM-based guard models tend to 1) produce overconfident predictions, 2)
exhibit significant miscalibration when subjected to jailbreak attacks, and 3)
demonstrate limited robustness to the outputs generated by different types of
response models. Additionally, we assess the effectiveness of post-hoc
calibration methods to mitigate miscalibration. We demonstrate the efficacy of
temperature scaling and, for the first time, highlight the benefits of
contextual calibration for confidence calibration of guard models, particularly
in the absence of validation sets. Our analysis and experiments underscore the
limitations of current LLM-based guard models and provide valuable insights for
the future development of well-calibrated guard models toward more reliable
content moderation. We also advocate for incorporating reliability evaluation
of confidence calibration when releasing future LLM-based guard models.


http://arxiv.org/abs/2410.10572
Regularized Robustly Reliable Learners and Instance Targeted Attacks. (1%)
Avrim Blum; Donya Saless
  Instance-targeted data poisoning attacks, where an adversary corrupts a
training set to induce errors on specific test points, have raised significant
concerns. Balcan et al (2022) proposed an approach to addressing this challenge
by defining a notion of robustly-reliable learners that provide per-instance
guarantees of correctness under well-defined assumptions, even in the presence
of data poisoning attacks. They then give a generic optimal (but
computationally inefficient) robustly reliable learner as well as a
computationally efficient algorithm for the case of linear separators over
log-concave distributions.
  In this work, we address two challenges left open by Balcan et al (2022). The
first is that the definition of robustly-reliable learners in Balcan et al
(2022) becomes vacuous for highly-flexible hypothesis classes: if there are two
classifiers h_0, h_1 \in H both with zero error on the training set such that
h_0(x) \neq h_1(x), then a robustly-reliable learner must abstain on x. We
address this problem by defining a modified notion of regularized
robustly-reliable learners that allows for nontrivial statements in this case.
The second is that the generic algorithm of Balcan et al (2022) requires
re-running an ERM oracle (essentially, retraining the classifier) on each test
point x, which is generally impractical even if ERM can be implemented
efficiently. To tackle this problem, we show that at least in certain
interesting cases we can design algorithms that can produce their outputs in
time sublinear in training time, by using techniques from dynamic algorithm
design.


http://arxiv.org/abs/2410.13891
S$^4$ST: A Strong, Self-transferable, faSt, and Simple Scale Transformation for Transferable Targeted Attack. (99%)
Yongxiang Liu; Bowen Peng; Li Liu; Xiang Li
  Transferable targeted adversarial attacks (TTAs) against deep neural networks
have been proven significantly more challenging than untargeted ones, yet they
remain relatively underexplored. This paper sheds new light on performing
highly efficient yet transferable targeted attacks leveraging the simple
gradient-based baseline. Our research underscores the critical importance of
image transformations within gradient calculations, marking a shift from the
prevalent emphasis on loss functions to address the gradient vanishing problem.
Moreover, we have developed two effective blind estimators that facilitate the
design of transformation strategies to enhance targeted transferability under
black-box conditions. The adversarial examples' self-transferability to
geometric transformations has been identified as strongly correlated with their
black-box transferability, featuring these basic operations as potent yet
overlapped proxies for facilitating targeted transferability. The surrogate
self-alignment assessments further highlight simple scaling transformation's
exceptional efficacy, which rivals that of most advanced methods. Building on
these insights, we introduce a scaling-centered transformation strategy termed
Strong, Self-transferable, faSt, and Simple Scale Transformation (S4ST) to
enhance transferable targeted attacks. In experiments conducted on the
ImageNet-Compatible benchmark dataset, our proposed S4ST attains a SOTA average
targeted transfer success rate across various challenging black-box models,
outperforming the previous leading method by over 14% while requiring only 25%
of the execution time. Additionally, our approach eclipses SOTA attacks
considerably and exhibits remarkable effectiveness against real-world APIs.
This work marks a significant leap forward in TTAs, revealing the realistic
threats they pose and providing a practical generation method for future
research.


http://arxiv.org/abs/2410.09845
Understanding Robustness of Parameter-Efficient Tuning for Image Classification. (98%)
Jiacheng Ruan; Xian Gao; Suncheng Xiang; Mingye Xie; Ting Liu; Yuzhuo Fu
  Parameter-efficient tuning (PET) techniques calibrate the model's predictions
on downstream tasks by freezing the pre-trained models and introducing a small
number of learnable parameters. However, despite the numerous PET methods
proposed, their robustness has not been thoroughly investigated. In this paper,
we systematically explore the robustness of four classical PET techniques
(e.g., VPT, Adapter, AdaptFormer, and LoRA) under both white-box attacks and
information perturbations. For white-box attack scenarios, we first analyze the
performance of PET techniques using FGSM and PGD attacks. Subsequently, we
further explore the transferability of adversarial samples and the impact of
learnable parameter quantities on the robustness of PET methods. Under
information perturbation attacks, we introduce four distinct perturbation
strategies, including Patch-wise Drop, Pixel-wise Drop, Patch Shuffle, and
Gaussian Noise, to comprehensively assess the robustness of these PET
techniques in the presence of information loss. Via these extensive studies, we
enhance the understanding of the robustness of PET methods, providing valuable
insights for improving their performance in computer vision applications. The
code is available at https://github.com/JCruan519/PETRobustness.


http://arxiv.org/abs/2410.10091
Out-of-Bounding-Box Triggers: A Stealthy Approach to Cheat Object Detectors. (75%)
Tao Lin; Lijia Yu; Gaojie Jin; Renjue Li; Peng Wu; Lijun Zhang
  In recent years, the study of adversarial robustness in object detection
systems, particularly those based on deep neural networks (DNNs), has become a
pivotal area of research. Traditional physical attacks targeting object
detectors, such as adversarial patches and texture manipulations, directly
manipulate the surface of the object. While these methods are effective, their
overt manipulation of objects may draw attention in real-world applications. To
address this, this paper introduces a more subtle approach: an inconspicuous
adversarial trigger that operates outside the bounding boxes, rendering the
object undetectable to the model. We further enhance this approach by proposing
the Feature Guidance (FG) technique and the Universal Auto-PGD (UAPGD)
optimization strategy for crafting high-quality triggers. The effectiveness of
our method is validated through extensive empirical testing, demonstrating its
high performance in both digital and physical environments. The code and video
will be available at: https://github.com/linToTao/Out-of-bbox-attack.


http://arxiv.org/abs/2410.09838
Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense. (67%)
Rui Min; Zeyu Qin; Nevin L. Zhang; Li Shen; Minhao Cheng
  Backdoor attacks pose a significant threat to Deep Neural Networks (DNNs) as
they allow attackers to manipulate model predictions with backdoor triggers. To
address these security vulnerabilities, various backdoor purification methods
have been proposed to purify compromised models. Typically, these purified
models exhibit low Attack Success Rates (ASR), rendering them resistant to
backdoored inputs. However, Does achieving a low ASR through current safety
purification methods truly eliminate learned backdoor features from the
pretraining phase? In this paper, we provide an affirmative answer to this
question by thoroughly investigating the Post-Purification Robustness of
current backdoor purification methods. We find that current safety purification
methods are vulnerable to the rapid re-learning of backdoor behavior, even when
further fine-tuning of purified models is performed using a very small number
of poisoned samples. Based on this, we further propose the practical
Query-based Reactivation Attack (QRA) which could effectively reactivate the
backdoor by merely querying purified models. We find the failure to achieve
satisfactory post-purification robustness stems from the insufficient deviation
of purified models from the backdoored model along the backdoor-connected path.
To improve the post-purification robustness, we propose a straightforward
tuning defense, Path-Aware Minimization (PAM), which promotes deviation along
backdoor-connected paths with extra model updates. Extensive experiments
demonstrate that PAM significantly improves post-purification robustness while
maintaining a good clean accuracy and low ASR. Our work provides a new
perspective on understanding the effectiveness of backdoor safety tuning and
highlights the importance of faithfully assessing the model's safety.


http://arxiv.org/abs/2410.09804
BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models. (13%)
Xinyuan Wang; Victor Shea-Jay Huang; Renmiao Chen; Hao Wang; Chengwei Pan; Lei Sha; Minlie Huang
  While large language models (LLMs) exhibit remarkable capabilities across
various tasks, they encounter potential security risks such as jailbreak
attacks, which exploit vulnerabilities to bypass security measures and generate
harmful outputs. Existing jailbreak strategies mainly focus on maximizing
attack success rate (ASR), frequently neglecting other critical factors,
including the relevance of the jailbreak response to the query and the level of
stealthiness. This narrow focus on single objectives can result in ineffective
attacks that either lack contextual relevance or are easily recognizable. In
this work, we introduce BlackDAN, an innovative black-box attack framework with
multi-objective optimization, aiming to generate high-quality prompts that
effectively facilitate jailbreaking while maintaining contextual relevance and
minimizing detectability. BlackDAN leverages Multiobjective Evolutionary
Algorithms (MOEAs), specifically the NSGA-II algorithm, to optimize jailbreaks
across multiple objectives including ASR, stealthiness, and semantic relevance.
By integrating mechanisms like mutation, crossover, and Pareto-dominance,
BlackDAN provides a transparent and interpretable process for generating
jailbreaks. Furthermore, the framework allows customization based on user
preferences, enabling the selection of prompts that balance harmfulness,
relevance, and other factors. Experimental results demonstrate that BlackDAN
outperforms traditional single-objective methods, yielding higher success rates
and improved robustness across various LLMs and multimodal LLMs, while ensuring
jailbreak responses are both relevant and less detectable.


http://arxiv.org/abs/2410.09760
Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation. (1%)
Guozhi Liu; Weiwei Lin; Tiansheng Huang; Ruichao Mo; Qi Mu; Li Shen
  Harmful fine-tuning attack poses a serious threat to the online fine-tuning
service. Vaccine, a recent alignment-stage defense, applies uniform
perturbation to all layers of embedding to make the model robust to the
simulated embedding drift. However, applying layer-wise uniform perturbation
may lead to excess perturbations for some particular safety-irrelevant layers,
resulting in defense performance degradation and unnecessary memory
consumption. To address this limitation, we propose Targeted Vaccine
(T-Vaccine), a memory-efficient safety alignment method that applies
perturbation to only selected layers of the model. T-Vaccine follows two core
steps: First, it uses gradient norm as a statistical metric to identify the
safety-critical layers. Second, instead of applying uniform perturbation across
all layers, T-Vaccine only applies perturbation to the safety-critical layers
while keeping other layers frozen during training. Results show that T-Vaccine
outperforms Vaccine in terms of both defense effectiveness and resource
efficiency. Comparison with other defense baselines, e.g., RepNoise and TAR
also demonstrate the superiority of T-Vaccine. Notably, T-Vaccine is the first
defense that can address harmful fine-tuning issues for a 7B pre-trained models
trained on consumer GPUs with limited memory (e.g., RTX 4090). Our code is
available at https://github.com/Lslland/T-Vaccine.


http://arxiv.org/abs/2410.09591
Unlearn and Burn: Adversarial Machine Unlearning Requests Destroy Model Accuracy. (91%)
Yangsibo Huang; Daogao Liu; Lynn Chua; Badih Ghazi; Pritish Kamath; Ravi Kumar; Pasin Manurangsi; Milad Nasr; Amer Sinha; Chiyuan Zhang
  Machine unlearning algorithms, designed for selective removal of training
data from models, have emerged as a promising approach to growing privacy
concerns. In this work, we expose a critical yet underexplored vulnerability in
the deployment of unlearning systems: the assumption that the data requested
for removal is always part of the original training set. We present a threat
model where an attacker can degrade model accuracy by submitting adversarial
unlearning requests for data not present in the training set. We propose
white-box and black-box attack algorithms and evaluate them through a case
study on image classification tasks using the CIFAR-10 and ImageNet datasets,
targeting a family of widely used unlearning methods. Our results show
extremely poor test accuracy following the attack: 3.6% on CIFAR-10 and 0.4% on
ImageNet for white-box attacks, and 8.5% on CIFAR-10 and 1.3% on ImageNet for
black-box attacks. Additionally, we evaluate various verification mechanisms to
detect the legitimacy of unlearning requests and reveal the challenges in
verification, as most of the mechanisms fail to detect stealthy attacks without
severely impairing their ability to process valid requests. These findings
underscore the urgent need for research on more robust request verification
methods and unlearning protocols, should the deployment of machine unlearning
systems become more prevalent in the future.


http://arxiv.org/abs/2410.09691
Robust 3D Point Clouds Classification based on Declarative Defenders. (2%)
Kaidong Li; Tianxiao Zhang; Cuncong Zhong; Ziming Zhang; Guanghui Wang
  3D point cloud classification requires distinct models from 2D image
classification due to the divergent characteristics of the respective input
data. While 3D point clouds are unstructured and sparse, 2D images are
structured and dense. Bridging the domain gap between these two data types is a
non-trivial challenge to enable model interchangeability. Recent research using
Lattice Point Classifier (LPC) highlights the feasibility of cross-domain
applicability. However, the lattice projection operation in LPC generates 2D
images with disconnected projected pixels. In this paper, we explore three
distinct algorithms for mapping 3D point clouds into 2D images. Through
extensive experiments, we thoroughly examine and analyze their performance and
defense mechanisms. Leveraging current large foundation models, we scrutinize
the feature disparities between regular 2D images and projected 2D images. The
proposed approaches demonstrate superior accuracy and robustness against
adversarial attacks. The generative model-based mapping algorithms yield
regular 2D images, further minimizing the domain gap from regular 2D
classification tasks. The source code is available at
https://github.com/KaidongLi/pytorch-LatticePointClassifier.git.


http://arxiv.org/abs/2410.08950
On the Adversarial Transferability of Generalized "Skip Connections". (99%)
Yisen Wang; Yichuan Mo; Dongxian Wu; Mingjie Li; Xingjun Ma; Zhouchen Lin
  Skip connection is an essential ingredient for modern deep models to be
deeper and more powerful. Despite their huge success in normal scenarios
(state-of-the-art classification performance on natural examples), we
investigate and identify an interesting property of skip connections under
adversarial scenarios, namely, the use of skip connections allows easier
generation of highly transferable adversarial examples. Specifically, in
ResNet-like models (with skip connections), we find that using more gradients
from the skip connections rather than the residual modules according to a decay
factor during backpropagation allows one to craft adversarial examples with
high transferability. The above method is termed as Skip Gradient Method (SGM).
Although starting from ResNet-like models in vision domains, we further extend
SGM to more advanced architectures, including Vision Transformers (ViTs) and
models with length-varying paths and other domains, i.e. natural language
processing. We conduct comprehensive transfer attacks against various models
including ResNets, Transformers, Inceptions, Neural Architecture Search, and
Large Language Models (LLMs). We show that employing SGM can greatly improve
the transferability of crafted attacks in almost all cases. Furthermore,
considering the big complexity for practical use, we further demonstrate that
SGM can even improve the transferability on ensembles of models or targeted
attacks and the stealthiness against current defenses. At last, we provide
theoretical explanations and empirical insights on how SGM works. Our findings
not only motivate new adversarial research into the architectural
characteristics of models but also open up further challenges for secure model
architecture design. Our code is available at https://github.com/mo666666/SGM.


http://arxiv.org/abs/2410.08620
Natural Language Induced Adversarial Images. (99%)
Xiaopei Zhu; Peiyang Xu; Guanning Zeng; Yingpeng Dong; Xiaolin Hu
  Research of adversarial attacks is important for AI security because it shows
the vulnerability of deep learning models and helps to build more robust
models. Adversarial attacks on images are most widely studied, which include
noise-based attacks, image editing-based attacks, and latent space-based
attacks. However, the adversarial examples crafted by these methods often lack
sufficient semantic information, making it challenging for humans to understand
the failure modes of deep learning models under natural conditions. To address
this limitation, we propose a natural language induced adversarial image attack
method. The core idea is to leverage a text-to-image model to generate
adversarial images given input prompts, which are maliciously constructed to
lead to misclassification for a target model. To adopt commercial text-to-image
models for synthesizing more natural adversarial images, we propose an adaptive
genetic algorithm (GA) for optimizing discrete adversarial prompts without
requiring gradients and an adaptive word space reduction method for improving
query efficiency. We further used CLIP to maintain the semantic consistency of
the generated images. In our experiments, we found that some high-frequency
semantic information such as "foggy", "humid", "stretching", etc. can easily
cause classifier errors. This adversarial semantic information exists not only
in generated images but also in photos captured in the real world. We also
found that some adversarial semantic information can be transferred to unknown
classification tasks. Furthermore, our attack method can transfer to different
text-to-image models (e.g., Midjourney, DALL-E 3, etc.) and image classifiers.
Our code is available at:
https://github.com/zxp555/Natural-Language-Induced-Adversarial-Images.


http://arxiv.org/abs/2410.08872
Fragile Giants: Understanding the Susceptibility of Models to Subpopulation Attacks. (70%)
Isha Gupta; Hidde Lycklama; Emanuel Opel; Evan Rose; Anwar Hithnawi
  As machine learning models become increasingly complex, concerns about their
robustness and trustworthiness have become more pressing. A critical
vulnerability of these models is data poisoning attacks, where adversaries
deliberately alter training data to degrade model performance. One particularly
stealthy form of these attacks is subpopulation poisoning, which targets
distinct subgroups within a dataset while leaving overall performance largely
intact. The ability of these attacks to generalize within subpopulations poses
a significant risk in real-world settings, as they can be exploited to harm
marginalized or underrepresented groups within the dataset. In this work, we
investigate how model complexity influences susceptibility to subpopulation
poisoning attacks. We introduce a theoretical framework that explains how
overparameterized models, due to their large capacity, can inadvertently
memorize and misclassify targeted subpopulations. To validate our theory, we
conduct extensive experiments on large-scale image and text datasets using
popular model architectures. Our results show a clear trend: models with more
parameters are significantly more vulnerable to subpopulation poisoning.
Moreover, we find that attacks on smaller, human-interpretable subgroups often
go undetected by these models. These results highlight the need to develop
defenses that specifically address subpopulation vulnerabilities.


http://arxiv.org/abs/2410.09024
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. (69%)
Maksym Andriushchenko; Alexandra Souly; Mateusz Dziemian; Derek Duenas; Maxwell Lin; Justin Wang; Dan Hendrycks; Andy Zou; Zico Kolter; Matt Fredrikson; Eric Winsor; Jerome Wynne; Yarin Gal; Xander Davies
  The robustness of LLMs to jailbreak attacks, where users design prompts to
circumvent safety measures and misuse model capabilities, has been studied
primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents -- which
use external tools and can execute multi-stage tasks -- may pose a greater risk
if misused, but their robustness remains underexplored. To facilitate research
on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark
includes a diverse set of 110 explicitly malicious agent tasks (440 with
augmentations), covering 11 harm categories including fraud, cybercrime, and
harassment. In addition to measuring whether models refuse harmful agentic
requests, scoring well on AgentHarm requires jailbroken agents to maintain
their capabilities following an attack to complete a multi-step task. We
evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly
compliant with malicious agent requests without jailbreaking, (2) simple
universal jailbreak templates can be adapted to effectively jailbreak agents,
and (3) these jailbreaks enable coherent and malicious multi-step agent
behavior and retain model capabilities. To enable simple and reliable
evaluation of attacks and defenses for LLM-based agents, we publicly release
AgentHarm at https://huggingface.co/datasets/ai-safety-institute/AgentHarm.


http://arxiv.org/abs/2410.09125
Training on Fake Labels: Mitigating Label Leakage in Split Learning via Secure Dimension Transformation. (62%)
Yukun Jiang; Peiran Wang; Chengguo Lin; Ziyue Huang; Yong Cheng
  Two-party split learning has emerged as a popular paradigm for vertical
federated learning. To preserve the privacy of the label owner, split learning
utilizes a split model, which only requires the exchange of intermediate
representations (IRs) based on the inputs and gradients for each IR between two
parties during the learning process. However, split learning has recently been
proven to survive label inference attacks. Though several defense methods could
be adopted, they either have limited defensive performance or significantly
negatively impact the original mission. In this paper, we propose a novel
two-party split learning method to defend against existing label inference
attacks while maintaining the high utility of the learned models. Specifically,
we first craft a dimension transformation module, SecDT, which could achieve
bidirectional mapping between original labels and increased K-class labels to
mitigate label leakage from the directional perspective. Then, a gradient
normalization algorithm is designed to remove the magnitude divergence of
gradients from different classes. We propose a softmax-normalized Gaussian
noise to mitigate privacy leakage and make our K unknowable to adversaries. We
conducted experiments on real-world datasets, including two
binary-classification datasets (Avazu and Criteo) and three
multi-classification datasets (MNIST, FashionMNIST, CIFAR-10); we also
considered current attack schemes, including direction, norm, spectral, and
model completion attacks. The detailed experiments demonstrate our proposed
method's effectiveness and superiority over existing approaches. For instance,
on the Avazu dataset, the attack AUC of evaluated four prominent attacks could
be reduced by 0.4532+-0.0127.


http://arxiv.org/abs/2410.08811
PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning. (31%)
Tingchen Fu; Mrinank Sharma; Philip Torr; Shay B. Cohen; David Krueger; Fazl Barez
  Preference learning is a central component for aligning current LLMs, but
this process can be vulnerable to data poisoning attacks. To address this
concern, we introduce PoisonBench, a benchmark for evaluating large language
models' susceptibility to data poisoning during preference learning. Data
poisoning attacks can manipulate large language model responses to include
hidden malicious content or biases, potentially causing the model to generate
harmful or unintended outputs while appearing to function normally. We deploy
two distinct attack types across eight realistic scenarios, assessing 21
widely-used models. Our findings reveal concerning trends: (1) Scaling up
parameter size does not inherently enhance resilience against poisoning
attacks; (2) There exists a log-linear relationship between the effects of the
attack and the data poison ratio; (3) The effect of data poisoning can
generalize to extrapolated triggers that are not included in the poisoned data.
These results expose weaknesses in current preference learning techniques,
highlighting the urgent need for more robust defenses against malicious models
and data manipulation.


http://arxiv.org/abs/2410.09040
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation. (31%)
Zijun Wang; Haoqin Tu; Jieru Mei; Bingchen Zhao; Yisen Wang; Cihang Xie
  This paper studies the vulnerabilities of transformer-based Large Language
Models (LLMs) to jailbreaking attacks, focusing specifically on the
optimization-based Greedy Coordinate Gradient (GCG) strategy. We first observe
a positive correlation between the effectiveness of attacks and the internal
behaviors of the models. For instance, attacks tend to be less effective when
models pay more attention to system prompts designed to ensure LLM safety
alignment. Building on this discovery, we introduce an enhanced method that
manipulates models' attention scores to facilitate LLM jailbreaking, which we
term AttnGCG. Empirically, AttnGCG shows consistent improvements in attack
efficacy across diverse LLMs, achieving an average increase of ~7% in the
Llama-2 series and ~10% in the Gemma series. Our strategy also demonstrates
robust attack transferability against both unseen harmful goals and black-box
LLMs like GPT-3.5 and GPT-4. Moreover, we note our attention-score
visualization is more interpretable, allowing us to gain better insights into
how our targeted attention manipulation facilitates more effective
jailbreaking. We release the code at
https://github.com/UCSC-VLAA/AttnGCG-attack.


http://arxiv.org/abs/2410.08864
The Good, the Bad and the Ugly: Watermarks, Transferable Attacks and Adversarial Defenses. (16%)
Grzegorz Głuch; Berkant Turan; Sai Ganesh Nagarajan; Sebastian Pokutta
  We formalize and extend existing definitions of backdoor-based watermarks and
adversarial defenses as interactive protocols between two players. The
existence of these schemes is inherently tied to the learning tasks for which
they are designed. Our main result shows that for almost every discriminative
learning task, at least one of the two -- a watermark or an adversarial defense
-- exists. The term "almost every" indicates that we also identify a third,
counterintuitive but necessary option, i.e., a scheme we call a transferable
attack. By transferable attack, we refer to an efficient algorithm computing
queries that look indistinguishable from the data distribution and fool all
efficient defenders. To this end, we prove the necessity of a transferable
attack via a construction that uses a cryptographic tool called homomorphic
encryption. Furthermore, we show that any task that satisfies our notion of a
transferable attack implies a cryptographic primitive, thus requiring the
underlying task to be computationally complex. These two facts imply an
"equivalence" between the existence of transferable attacks and cryptography.
Finally, we show that the class of tasks of bounded VC-dimension has an
adversarial defense, and a subclass of them has a watermark.


http://arxiv.org/abs/2410.09318
Impeding LLM-assisted Cheating in Introductory Programming Assignments via Adversarial Perturbation. (4%)
Saiful Islam Salim; Rubin Yuchan Yang; Alexander Cooper; Suryashree Ray; Saumya Debray; Sazzadur Rahaman
  While Large language model (LLM)-based programming assistants such as CoPilot
and ChatGPT can help improve the productivity of professional software
developers, they can also facilitate cheating in introductory computer
programming courses. Assuming instructors have limited control over the
industrial-strength models, this paper investigates the baseline performance of
5 widely used LLMs on a collection of introductory programming problems,
examines adversarial perturbations to degrade their performance, and describes
the results of a user study aimed at understanding the efficacy of such
perturbations in hindering actual code generation for introductory programming
assignments. The user study suggests that i) perturbations combinedly reduced
the average correctness score by 77%, ii) the drop in correctness caused by
these perturbations was affected based on their detectability.


http://arxiv.org/abs/2410.09186
AI Learning Algorithms: Deep Learning, Hybrid Models, and Large-Scale Model Integration. (1%)
Noorbakhsh Amiri Golilarz; Elias Hossain; Abdoljalil Addeh; Keyan Alexander Rahimi
  In this paper, we discuss learning algorithms and their importance in
different types of applications which includes training to identify important
patterns and features in a straightforward, easy-to-understand manner. We will
review the main concepts of artificial intelligence (AI), machine learning
(ML), deep learning (DL), and hybrid models. Some important subsets of Machine
Learning algorithms such as supervised, unsupervised, and reinforcement
learning are also discussed in this paper. These techniques can be used for
some important tasks like prediction, classification, and segmentation.
Convolutional Neural Networks (CNNs) are used for image and video processing
and many more applications. We dive into the architecture of CNNs and how to
integrate CNNs with ML algorithms to build hybrid models. This paper explores
the vulnerability of learning algorithms to noise, leading to
misclassification. We further discuss the integration of learning algorithms
with Large Language Models (LLM) to generate coherent responses applicable to
many domains such as healthcare, marketing, and finance by learning important
patterns from large volumes of data. Furthermore, we discuss the next
generation of learning algorithms and how we may have an unified Adaptive and
Dynamic Network to perform important tasks. Overall, this article provides
brief overview of learning algorithms, exploring their current state,
applications and future direction.


http://arxiv.org/abs/2410.08776
F2A: An Innovative Approach for Prompt Injection by Utilizing Feign Security Detection Agents. (1%)
Yupeng Ren
  With the rapid development of Large Language Models (LLMs), numerous mature
applications of LLMs have emerged in the field of content safety detection.
However, we have found that LLMs exhibit blind trust in safety detection
agents. The general LLMs can be compromised by hackers with this vulnerability.
Hence, this paper proposed an attack named Feign Agent Attack (F2A).Through
such malicious forgery methods, adding fake safety detection results into the
prompt, the defense mechanism of LLMs can be bypassed, thereby obtaining
harmful content and hijacking the normal conversation.Continually, a series of
experiments were conducted. In these experiments, the hijacking capability of
F2A on LLMs was analyzed and demonstrated, exploring the fundamental reasons
why LLMs blindly trust safety detection results. The experiments involved
various scenarios where fake safety detection results were injected into
prompts, and the responses were closely monitored to understand the extent of
the vulnerability. Also, this paper provided a reasonable solution to this
attack, emphasizing that it is important for LLMs to critically evaluate the
results of augmented agents to prevent the generating harmful content. By doing
so, the reliability and security can be significantly improved, protecting the
LLMs from F2A.


http://arxiv.org/abs/2410.08660
RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process. (1%)
Peiran Wang; Xiaogeng Liu; Chaowei Xiao
  In this study, we introduce RePD, an innovative attack Retrieval-based Prompt
Decomposition framework designed to mitigate the risk of jailbreak attacks on
large language models (LLMs). Despite rigorous pretraining and finetuning
focused on ethical alignment, LLMs are still susceptible to jailbreak exploits.
RePD operates on a one-shot learning model, wherein it accesses a database of
pre-collected jailbreak prompt templates to identify and decompose harmful
inquiries embedded within user prompts. This process involves integrating the
decomposition of the jailbreak prompt into the user's original query into a
one-shot learning example to effectively teach the LLM to discern and separate
malicious components. Consequently, the LLM is equipped to first neutralize any
potentially harmful elements before addressing the user's prompt in a manner
that aligns with its ethical guidelines. RePD is versatile and compatible with
a variety of open-source LLMs acting as agents. Through comprehensive
experimentation with both harmful and benign prompts, we have demonstrated the
efficacy of our proposed RePD in enhancing the resilience of LLMs against
jailbreak attacks, without compromising their performance in responding to
typical user requests.


http://arxiv.org/abs/2410.12855
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework. (1%)
Fan Liu; Yue Feng; Zhao Xu; Lixin Su; Xinyu Ma; Dawei Yin; Hao Liu
  Despite advancements in enhancing LLM safety against jailbreak attacks,
evaluating LLM defenses remains a challenge, with current methods often lacking
explainability and generalization to complex scenarios, leading to incomplete
assessments (e.g., direct judgment without reasoning, low F1 score of GPT-4 in
complex cases, bias in multilingual scenarios). To address this, we present
JAILJUDGE, a comprehensive benchmark featuring diverse risk scenarios,
including synthetic, adversarial, in-the-wild, and multilingual prompts, along
with high-quality human-annotated datasets. The JAILJUDGE dataset includes over
35k+ instruction-tune data with reasoning explainability and JAILJUDGETEST, a
4.5k+ labeled set for risk scenarios, and a 6k+ multilingual set across ten
languages. To enhance evaluation with explicit reasoning, we propose the
JailJudge MultiAgent framework, which enables explainable, fine-grained scoring
(1 to 10). This framework supports the construction of instruction-tuning
ground truth and facilitates the development of JAILJUDGE Guard, an end-to-end
judge model that provides reasoning and eliminates API costs. Additionally, we
introduce JailBoost, an attacker-agnostic attack enhancer, and GuardShield, a
moderation defense, both leveraging JAILJUDGE Guard. Our experiments
demonstrate the state-of-the-art performance of JailJudge methods (JailJudge
MultiAgent, JAILJUDGE Guard) across diverse models (e.g., GPT-4, Llama-Guard)
and zero-shot scenarios. JailBoost and GuardShield significantly improve
jailbreak attack and defense tasks under zero-shot settings, with JailBoost
enhancing performance by 29.24% and GuardShield reducing defense ASR from
40.46% to 0.15%.


http://arxiv.org/abs/2410.08338
Time Traveling to Defend Against Adversarial Example Attacks in Image Classification. (99%)
Anthony Etim; Jakub Szefer
  Adversarial example attacks have emerged as a critical threat to machine
learning. Adversarial attacks in image classification abuse various, minor
modifications to the image that confuse the image classification neural network
-- while the image still remains recognizable to humans. One important domain
where the attacks have been applied is in the automotive setting with traffic
sign classification. Researchers have demonstrated that adding stickers,
shining light, or adding shadows are all different means to make machine
learning inference algorithms mis-classify the traffic signs. This can cause
potentially dangerous situations as a stop sign is recognized as a speed limit
sign causing vehicles to ignore it and potentially leading to accidents. To
address these attacks, this work focuses on enhancing defenses against such
adversarial attacks. This work shifts the advantage to the user by introducing
the idea of leveraging historical images and majority voting. While the
attacker modifies a traffic sign that is currently being processed by the
victim's machine learning inference, the victim can gain advantage by examining
past images of the same traffic sign. This work introduces the notion of ''time
traveling'' and uses historical Street View images accessible to anybody to
perform inference on different, past versions of the same traffic sign. In the
evaluation, the proposed defense has 100% effectiveness against latest
adversarial example attack on traffic sign classification algorithm.


http://arxiv.org/abs/2410.08503
Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data. (99%)
Binghui Li; Yuanzhi Li
  Adversarial training is a widely-applied approach to training deep neural
networks to be robust against adversarial perturbation. However, although
adversarial training has achieved empirical success in practice, it still
remains unclear why adversarial examples exist and how adversarial training
methods improve model robustness. In this paper, we provide a theoretical
understanding of adversarial examples and adversarial training algorithms from
the perspective of feature learning theory. Specifically, we focus on a
multiple classification setting, where the structured data can be composed of
two types of features: the robust features, which are resistant to perturbation
but sparse, and the non-robust features, which are susceptible to perturbation
but dense. We train a two-layer smoothed ReLU convolutional neural network to
learn our structured data. First, we prove that by using standard training
(gradient descent over the empirical risk), the network learner primarily
learns the non-robust feature rather than the robust feature, which thereby
leads to the adversarial examples that are generated by perturbations aligned
with negative non-robust feature directions. Then, we consider the
gradient-based adversarial training algorithm, which runs gradient ascent to
find adversarial examples and runs gradient descent over the empirical risk at
adversarial examples to update models. We show that the adversarial training
method can provably strengthen the robust feature learning and suppress the
non-robust feature learning to improve the network robustness. Finally, we also
empirically validate our theoretical findings with experiments on real-image
datasets, including MNIST, CIFAR10 and SVHN.


http://arxiv.org/abs/2410.07719
Understanding Adversarially Robust Generalization via Weight-Curvature Index. (98%)
Yuelin Xu; Xiao Zhang
  Despite extensive research on adversarial examples, the underlying mechanisms
of adversarially robust generalization, a critical yet challenging task for
deep learning, remain largely unknown. In this work, we propose a novel
perspective to decipher adversarially robust generalization through the lens of
the Weight-Curvature Index (WCI). The proposed WCI quantifies the vulnerability
of models to adversarial perturbations using the Frobenius norm of weight
matrices and the trace of Hessian matrices. We prove generalization bounds
based on PAC-Bayesian theory and second-order loss function approximations to
elucidate the interplay between robust generalization gap, model parameters,
and loss landscape curvature. Our theory and experiments show that WCI
effectively captures the robust generalization performance of adversarially
trained models. By offering a nuanced understanding of adversarial robustness
based on the scale of model parameters and the curvature of the loss landscape,
our work provides crucial insights for designing more resilient deep learning
models, enhancing their reliability and security.


http://arxiv.org/abs/2410.07670
Invisibility Cloak: Disappearance under Human Pose Estimation via Backdoor Attacks. (92%)
Minxing Zhang; Michael Backes; Xiao Zhang
  Human Pose Estimation (HPE) has been widely applied in autonomous systems
such as self-driving cars. However, the potential risks of HPE to adversarial
attacks have not received comparable attention with image classification or
segmentation tasks. Existing works on HPE robustness focus on misleading an HPE
system to provide wrong predictions that still indicate some human poses. In
this paper, we study the vulnerability of HPE systems to disappearance attacks,
where the attacker aims to subtly alter the HPE training process via backdoor
techniques so that any input image with some specific trigger will not be
recognized as involving any human pose. As humans are typically at the center
of HPE systems, such attacks can induce severe security hazards, e.g.,
pedestrians' lives will be threatened if a self-driving car incorrectly
understands the front scene due to disappearance attacks.
  To achieve the adversarial goal of disappearance, we propose IntC, a general
framework to craft Invisibility Cloak in the HPE domain. The core of our work
lies in the design of target HPE labels that do not represent any human pose.
In particular, we propose three specific backdoor attacks based on our IntC
framework with different label designs. IntC-S and IntC-E, respectively
designed for regression- and heatmap-based HPE techniques, concentrate the
keypoints of triggered images in a tiny, imperceptible region. Further, to
improve the attack's stealthiness, IntC-L designs the target poisons to capture
the label outputs of typical landscape images without a human involved,
achieving disappearance and reducing detectability simultaneously. Extensive
experiments demonstrate the effectiveness and generalizability of our IntC
methods in achieving the disappearance goal. By revealing the vulnerability of
HPE to disappearance and backdoor attacks, we hope our work can raise awareness
of the potential risks ...


http://arxiv.org/abs/2410.16317
A Survey on Physical Adversarial Attacks against Face Recognition Systems. (91%)
Mingsi Wang; Jiachen Zhou; Tianlin Li; Guozhu Meng; Kai Chen
  As Face Recognition (FR) technology becomes increasingly prevalent in
finance, the military, public safety, and everyday life, security concerns have
grown substantially. Physical adversarial attacks targeting FR systems in
real-world settings have attracted considerable research interest due to their
practicality and the severe threats they pose. However, a systematic overview
focused on physical adversarial attacks against FR systems is still lacking,
hindering an in-depth exploration of the challenges and future directions in
this field. In this paper, we bridge this gap by comprehensively collecting and
analyzing physical adversarial attack methods targeting FR systems.
Specifically, we first investigate the key challenges of physical attacks on FR
systems. We then categorize existing physical attacks into three categories
based on the physical medium used and summarize how the research in each
category has evolved to address these challenges. Furthermore, we review
current defense strategies and discuss potential future research directions.
Our goal is to provide a fresh, comprehensive, and deep understanding of
physical adversarial attacks against FR systems, thereby inspiring relevant
research in this area.


http://arxiv.org/abs/2410.07962
Towards Assurance of LLM Adversarial Robustness using Ontology-Driven Argumentation. (74%)
Tomas Bueno Momcilovic; Beat Buesser; Giulio Zizzo; Mark Purcell; Dian Balta
  Despite the impressive adaptability of large language models (LLMs),
challenges remain in ensuring their security, transparency, and
interpretability. Given their susceptibility to adversarial attacks, LLMs need
to be defended with an evolving combination of adversarial training and
guardrails. However, managing the implicit and heterogeneous knowledge for
continuously assuring robustness is difficult. We introduce a novel approach
for assurance of the adversarial robustness of LLMs based on formal
argumentation. Using ontologies for formalization, we structure
state-of-the-art attacks and defenses, facilitating the creation of a
human-readable assurance case, and a machine-readable representation. We
demonstrate its application with examples in English language and code
translation tasks, and provide implications for theory and practice, by
targeting engineers, data scientists, users, and auditors.


http://arxiv.org/abs/2410.08010
Securing HHL Quantum Algorithm against Quantum Computer Attacks. (73%)
Yizhuo Tan; Hrvoje Kukina; Jakub Szefer
  As the quantum research community expands and new quantum algorithms are
created and implemented, it is essential to consider the security implications
and potential threats that could lead to the compromise the information
processed by them. This work focuses on securing the HHL quantum algorithm
against attacks while it executes on a quantum computer. Specifically, two
types of potential attacks could be deployed on a cloud-based quantum computer
by an attacker circuit attempting to interfere with the victim HHL circuit: the
Improper Initialization Attack (IIA) and the Higher Energy Attack (HEA). To
protect the HHL algorithm from IIA and HEA, this work proposes first-of-a-kind
defense strategies against these attacks on the HHL quantum algorithm. Next,
this work demonstrates an implementation of a new quantum circuit for the HHL
quantum algorithm that incorporates these defenses. The redesigned quantum
circuit is necessary to successfully apply and realize all proposed defense
strategies. Finally, this work illustrates how these defense strategies
function in practice in the redesigned circuit, specifically how they can
protect the HHL quantum algorithm from both IIA and HEA across multiple qubits
involving all three types of qubits used in the HHL algorithms: ancilla, clock,
and b. The defense requires minimal modification to the circuit, and has only a
very small effect on the fidelity of the circuits. The circuits have been
tested and validated in both simulation, and also on real IBM quantum computer
hardware. The work further analyzes how the modified HHL circuit with the
defenses is affected by noise during quantum computation. This work in the end
demonstrates that it is practical to add protections to quantum circuits so
that they not only perform correct computation, but also self-detect if an
attack has occured during the execution.


http://arxiv.org/abs/2410.08417
Bilinear MLPs enable weight-based mechanistic interpretability. (70%)
Michael T. Pearce; Thomas Dooms; Alice Rigg; Jose M. Oramas; Lee Sharkey
  A mechanistic understanding of how MLPs do computation in deep neural
networks remains elusive. Current interpretability work can extract features
from hidden activations over an input dataset but generally cannot explain how
MLP weights construct features. One challenge is that element-wise
nonlinearities introduce higher-order interactions and make it difficult to
trace computations through the MLP layer. In this paper, we analyze bilinear
MLPs, a type of Gated Linear Unit (GLU) without any element-wise nonlinearity
that nevertheless achieves competitive performance. Bilinear MLPs can be fully
expressed in terms of linear operations using a third-order tensor, allowing
flexible analysis of the weights. Analyzing the spectra of bilinear MLP weights
using eigendecomposition reveals interpretable low-rank structure across toy
tasks, image classification, and language modeling. We use this understanding
to craft adversarial examples, uncover overfitting, and identify small language
model circuits directly from the weights alone. Our results demonstrate that
bilinear layers serve as an interpretable drop-in replacement for current
activation functions and that weight-based interpretability is viable for
understanding deep-learning models.


http://arxiv.org/abs/2410.07675
Adversarial Robustness Overestimation and Instability in TRADES. (67%)
Jonathan Weiping Li; Ren-Wei Liang; Cheng-Han Yeh; Cheng-Chang Tsai; Kuanchun Yu; Chun-Shien Lu; Shang-Tse Chen
  This paper examines the phenomenon of probabilistic robustness overestimation
in TRADES, a prominent adversarial training method. Our study reveals that
TRADES sometimes yields disproportionately high PGD validation accuracy
compared to the AutoAttack testing accuracy in the multiclass classification
task. This discrepancy highlights a significant overestimation of robustness
for these instances, potentially linked to gradient masking. We further analyze
the parameters contributing to unstable models that lead to overestimation. Our
findings indicate that smaller batch sizes, lower beta values (which control
the weight of the robust loss term in TRADES), larger learning rates, and
higher class complexity (e.g., CIFAR-100 versus CIFAR-10) are associated with
an increased likelihood of robustness overestimation. By examining metrics such
as the First-Order Stationary Condition (FOSC), inner-maximization, and
gradient information, we identify the underlying cause of this phenomenon as
gradient masking and provide insights into it. Furthermore, our experiments
show that certain unstable training instances may return to a state without
robust overestimation, inspiring our attempts at a solution. In addition to
adjusting parameter settings to reduce instability or retraining when
overestimation occurs, we recommend incorporating Gaussian noise in inputs when
the FOSC score exceed the threshold. This method aims to mitigate robustness
overestimation of TRADES and other similar methods at its source, ensuring more
reliable representation of adversarial robustness during evaluation.


http://arxiv.org/abs/2410.08244
RAB$^2$-DEF: Dynamic and explainable defense against adversarial attacks in Federated Learning to fair poor clients. (61%)
Nuria Rodríguez-Barroso; M. Victoria Luzón; Francisco Herrera
  At the same time that artificial intelligence is becoming popular, concern
and the need for regulation is growing, including among other requirements the
data privacy. In this context, Federated Learning is proposed as a solution to
data privacy concerns derived from different source data scenarios due to its
distributed learning. The defense mechanisms proposed in literature are just
focused on defending against adversarial attacks and the performance, leaving
aside other important qualities such as explainability, fairness to poor
quality clients, dynamism in terms of attacks configuration and generality in
terms of being resilient against different kinds of attacks. In this work, we
propose RAB$^2$-DEF, a $\textbf{r}$esilient $\textbf{a}$gainst
$\textbf{b}\text{yzantine}$ and $\textbf{b}$ackdoor attacks which is
$\textbf{d}$ynamic, $\textbf{e}$xplainable and $\textbf{f}$air to poor clients
using local linear explanations. We test the performance of RAB$^2$-DEF in
image datasets and both byzantine and backdoor attacks considering the
state-of-the-art defenses and achieve that RAB$^2$-DEF is a proper defense at
the same time that it boosts the other qualities towards trustworthy artificial
intelligence.


http://arxiv.org/abs/2410.08190
Poison-splat: Computation Cost Attack on 3D Gaussian Splatting. (10%)
Jiahao Lu; Yifan Zhang; Qiuhong Shen; Xinchao Wang; Shuicheng Yan
  3D Gaussian splatting (3DGS), known for its groundbreaking performance and
efficiency, has become a dominant 3D representation and brought progress to
many 3D vision tasks. However, in this work, we reveal a significant security
vulnerability that has been largely overlooked in 3DGS: the computation cost of
training 3DGS could be maliciously tampered by poisoning the input data. By
developing an attack named Poison-splat, we reveal a novel attack surface where
the adversary can poison the input images to drastically increase the
computation memory and time needed for 3DGS training, pushing the algorithm
towards its worst computation complexity. In extreme cases, the attack can even
consume all allocable memory, leading to a Denial-of-Service (DoS) that
disrupts servers, resulting in practical damages to real-world 3DGS service
vendors. Such a computation cost attack is achieved by addressing a bi-level
optimization problem through three tailored strategies: attack objective
approximation, proxy model rendering, and optional constrained optimization.
These strategies not only ensure the effectiveness of our attack but also make
it difficult to defend with simple defensive measures. We hope the revelation
of this novel attack surface can spark attention to this crucial yet overlooked
vulnerability of 3DGS systems.


http://arxiv.org/abs/2410.08109
A Closer Look at Machine Unlearning for Large Language Models. (4%)
Xiaojian Yuan; Tianyu Pang; Chao Du; Kejiang Chen; Weiming Zhang; Min Lin
  Large language models (LLMs) may memorize sensitive or copyrighted content,
raising privacy and legal concerns. Due to the high cost of retraining from
scratch, researchers attempt to employ machine unlearning to remove specific
content from LLMs while preserving the overall performance. In this paper, we
discuss several issues in machine unlearning for LLMs and provide our insights
on possible approaches. To address the issue of inadequate evaluation of model
outputs after unlearning, we introduce three additional metrics to evaluate
token diversity, sentence semantics, and factual correctness. We then
categorize unlearning methods into untargeted and targeted, and discuss their
issues respectively. Specifically, the behavior that untargeted unlearning
attempts to approximate is unpredictable and may involve hallucinations, and
existing regularization is insufficient for targeted unlearning. To alleviate
these issues, we propose using the objective of maximizing entropy (ME) for
untargeted unlearning and incorporate answer preservation (AP) loss as
regularization for targeted unlearning. Experimental results across three
scenarios, i.e., fictitious unlearning, continual unlearning, and real-world
unlearning, demonstrate the effectiveness of our approaches. The code is
available at https://github.com/sail-sg/closer-look-LLM-unlearning.


http://arxiv.org/abs/2410.06699
Break the Visual Perception: Adversarial Attacks Targeting Encoded Visual Tokens of Large Vision-Language Models. (99%)
Yubo Wang; Chaohu Liu; Yanqiu Qu; Haoyu Cao; Deqiang Jiang; Linli Xu
  Large vision-language models (LVLMs) integrate visual information into large
language models, showcasing remarkable multi-modal conversational capabilities.
However, the visual modules introduces new challenges in terms of robustness
for LVLMs, as attackers can craft adversarial images that are visually clean
but may mislead the model to generate incorrect answers. In general, LVLMs rely
on vision encoders to transform images into visual tokens, which are crucial
for the language models to perceive image contents effectively. Therefore, we
are curious about one question: Can LVLMs still generate correct responses when
the encoded visual tokens are attacked and disrupting the visual information?
To this end, we propose a non-targeted attack method referred to as VT-Attack
(Visual Tokens Attack), which constructs adversarial examples from multiple
perspectives, with the goal of comprehensively disrupting feature
representations and inherent relationships as well as the semantic properties
of visual tokens output by image encoders. Using only access to the image
encoder in the proposed attack, the generated adversarial examples exhibit
transferability across diverse LVLMs utilizing the same image encoder and
generality across different tasks. Extensive experiments validate the superior
attack performance of the VT-Attack over baseline methods, demonstrating its
effectiveness in attacking LVLMs with image encoders, which in turn can provide
guidance on the robustness of LVLMs, particularly in terms of the stability of
the visual feature space.


http://arxiv.org/abs/2410.06851
Understanding Model Ensemble in Transferable Adversarial Attack. (99%)
Wei Yao; Zeliang Zhang; Huayi Tang; Yong Liu
  Model ensemble adversarial attack has become a powerful method for generating
transferable adversarial examples that can target even unknown models, but its
theoretical foundation remains underexplored. To address this gap, we provide
early theoretical insights that serve as a roadmap for advancing model ensemble
adversarial attack. We first define transferability error to measure the error
in adversarial transferability, alongside concepts of diversity and empirical
model ensemble Rademacher complexity. We then decompose the transferability
error into vulnerability, diversity, and a constant, which rigidly explains the
origin of transferability error in model ensemble attack: the vulnerability of
an adversarial example to ensemble components, and the diversity of ensemble
components. Furthermore, we apply the latest mathematical tools in information
theory to bound the transferability error using complexity and generalization
terms, contributing to three practical guidelines for reducing transferability
error: (1) incorporating more surrogate models, (2) increasing their diversity,
and (3) reducing their complexity in cases of overfitting. Finally, extensive
experiments with 54 models validate our theoretical framework, representing a
significant step forward in understanding transferable model ensemble
adversarial attacks.


http://arxiv.org/abs/2410.06866
Secure Video Quality Assessment Resisting Adversarial Attacks. (75%)
Ao-Xiang Zhang; Yu Ran; Weixuan Tang; Yuan-Gen Wang; Qingxiao Guan; Chunsheng Yang
  The exponential surge in video traffic has intensified the imperative for
Video Quality Assessment (VQA). Leveraging cutting-edge architectures, current
VQA models have achieved human-comparable accuracy. However, recent studies
have revealed the vulnerability of existing VQA models against adversarial
attacks. To establish a reliable and practical assessment system, a secure VQA
model capable of resisting such malicious attacks is urgently demanded.
Unfortunately, no attempt has been made to explore this issue. This paper first
attempts to investigate general adversarial defense principles, aiming at
endowing existing VQA models with security. Specifically, we first introduce
random spatial grid sampling on the video frame for intra-frame defense. Then,
we design pixel-wise randomization through a guardian map, globally
neutralizing adversarial perturbations. Meanwhile, we extract temporal
information from the video sequence as compensation for inter-frame defense.
Building upon these principles, we present a novel VQA framework from the
security-oriented perspective, termed SecureVQA. Extensive experiments indicate
that SecureVQA sets a new benchmark in security while achieving competitive VQA
performance compared with state-of-the-art models. Ablation studies delve
deeper into analyzing the principles of SecureVQA, demonstrating their
generalization and contributions to the security of leading VQA models.


http://arxiv.org/abs/2410.06572
Can DeepFake Speech be Reliably Detected? (62%)
Hongbin Liu; Youzheng Chen; Arun Narayanan; Athula Balachandran; Pedro J. Moreno; Lun Wang
  Recent advances in text-to-speech (TTS) systems, particularly those with
voice cloning capabilities, have made voice impersonation readily accessible,
raising ethical and legal concerns due to potential misuse for malicious
activities like misinformation campaigns and fraud. While synthetic speech
detectors (SSDs) exist to combat this, they are vulnerable to ``test domain
shift", exhibiting decreased performance when audio is altered through
transcoding, playback, or background noise. This vulnerability is further
exacerbated by deliberate manipulation of synthetic speech aimed at deceiving
detectors. This work presents the first systematic study of such active
malicious attacks against state-of-the-art open-source SSDs. White-box attacks,
black-box attacks, and their transferability are studied from both attack
effectiveness and stealthiness, using both hardcoded metrics and human ratings.
The results highlight the urgent need for more robust detection methods in the
face of evolving adversarial threats.


http://arxiv.org/abs/2410.09101
Data Taggants: Dataset Ownership Verification via Harmless Targeted Data Poisoning. (15%)
Wassim Bouaziz; El-Mahdi El-Mhamdi; Nicolas Usunier
  Dataset ownership verification, the process of determining if a dataset is
used in a model's training data, is necessary for detecting unauthorized data
usage and data contamination. Existing approaches, such as backdoor
watermarking, rely on inducing a detectable behavior into the trained model on
a part of the data distribution. However, these approaches have limitations, as
they can be harmful to the model's performances or require unpractical access
to the model's internals. Most importantly, previous approaches lack guarantee
against false positives. This paper introduces data taggants, a novel
non-backdoor dataset ownership verification technique. Our method uses pairs of
out-of-distribution samples and random labels as secret keys, and leverages
clean-label targeted data poisoning to subtly alter a dataset, so that models
trained on it respond to the key samples with the corresponding key labels. The
keys are built as to allow for statistical certificates with black-box access
only to the model. We validate our approach through comprehensive and realistic
experiments on ImageNet1k using ViT and ResNet models with state-of-the-art
training recipes. Our findings demonstrate that data taggants can reliably make
models trained on the protected dataset detectable with high confidence,
without compromising validation accuracy, and demonstrates superiority over
backdoor watermarking. Moreover, our method shows to be stealthy and robust
against various defense mechanisms.


http://arxiv.org/abs/2410.06895
Average Certified Radius is a Poor Metric for Randomized Smoothing. (13%)
Chenhao Sun; Yuhao Mao; Mark Niklas Müller; Martin Vechev
  Randomized smoothing is a popular approach for providing certified robustness
guarantees against adversarial attacks, and has become an active area of
research. Over the past years, the average certified radius (ACR) has emerged
as the most important metric for comparing methods and tracking progress in the
field. However, in this work, for the first time we show that ACR is a poor
metric for evaluating robustness guarantees provided by randomized smoothing.
We theoretically prove not only that a trivial classifier can have arbitrarily
large ACR, but also that ACR is much more sensitive to improvements on easy
samples than on hard ones. Empirically, we confirm that existing training
strategies, though improving ACR with different approaches, reduce the model's
robustness on hard samples consistently. To strengthen our conclusion, we
propose strategies, including explicitly discarding hard samples, reweighting
the dataset with approximate certified radius, and extreme optimization for
easy samples, to achieve state-of-the-art ACR, without training for robustness
on the full data distribution. Overall, our results suggest that ACR has
introduced a strong undesired bias to the field, and its application should be
discontinued when evaluating randomized smoothing.


http://arxiv.org/abs/2410.07081
JPEG Inspired Deep Learning. (11%)
Ahmed H. Salamah; Kaixiang Zheng; Yiwen Liu; En-Hui Yang
  Although it is traditionally believed that lossy image compression, such as
JPEG compression, has a negative impact on the performance of deep neural
networks (DNNs), it is shown by recent works that well-crafted JPEG compression
can actually improve the performance of deep learning (DL). Inspired by this,
we propose JPEG-DL, a novel DL framework that prepends any underlying DNN
architecture with a trainable JPEG compression layer. To make the quantization
operation in JPEG compression trainable, a new differentiable soft quantizer is
employed at the JPEG layer, and then the quantization operation and underlying
DNN are jointly trained. Extensive experiments show that in comparison with the
standard DL, JPEG-DL delivers significant accuracy improvements across various
datasets and model architectures while enhancing robustness against adversarial
attacks. Particularly, on some fine-grained image classification datasets,
JPEG-DL can increase prediction accuracy by as much as 20.9%. Our code is
available on https://github.com/AhmedHussKhalifa/JPEG-Inspired-DL.git.


http://arxiv.org/abs/2410.06921
Adversarial Vulnerability as a Consequence of On-Manifold Inseparibility. (2%)
Rajdeep Haldar; Yue Xing; Qifan Song; Guang Lin
  Recent works have shown theoretically and empirically that redundant data
dimensions are a source of adversarial vulnerability. However, the inverse
doesn't seem to hold in practice; employing dimension-reduction techniques
doesn't exhibit robustness as expected. In this work, we consider
classification tasks and characterize the data distribution as a
low-dimensional manifold, with high/low variance features defining the on/off
manifold direction. We argue that clean training experiences poor convergence
in the off-manifold direction caused by the ill-conditioning in widely used
first-order optimizers like gradient descent. The poor convergence then acts as
a source of adversarial vulnerability when the dataset is inseparable in the
on-manifold direction. We provide theoretical results for logistic regression
and a 2-layer linear network on the considered data distribution. Furthermore,
we advocate using second-order methods that are immune to ill-conditioning and
lead to better robustness. We perform experiments and exhibit tremendous
robustness improvements in clean training through long training and the
employment of second-order methods, corroborating our framework. Additionally,
we find the inclusion of batch-norm layers hinders such robustness gains. We
attribute this to differing implicit biases between traditional and
batch-normalized neural networks.


http://arxiv.org/abs/2410.07137
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates. (2%)
Xiaosen Zheng; Tianyu Pang; Chao Du; Qian Liu; Jing Jiang; Min Lin
  Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and
MT-Bench, have become popular for evaluating language models due to their
cost-effectiveness and scalability compared to human evaluation. Achieving high
win rates on these benchmarks can significantly boost the promotional impact of
newly released language models. This promotional benefit may motivate tricks,
such as manipulating model output length or style to game win rates, even
though several mechanisms have been developed to control length and disentangle
style to reduce gameability. Nonetheless, we show that even a "null model" that
always outputs a constant response (irrelevant to input instructions) can cheat
automatic benchmarks and achieve top-ranked win rates: an 86.5% LC win rate on
AlpacaEval 2.0; an 83.0 score on Arena-Hard-Auto; and a 9.55 score on MT-Bench.
Moreover, the crafted cheating outputs are transferable because we assume that
the instructions of these benchmarks (e.g., 805 samples of AlpacaEval 2.0) are
private and cannot be accessed. While our experiments are primarily
proof-of-concept, an adversary could use LLMs to generate more imperceptible
cheating responses, unethically benefiting from high win rates and promotional
impact. Our findings call for the development of anti-cheating mechanisms for
reliable automatic benchmarks. The code is available at
https://github.com/sail-sg/Cheating-LLM-Benchmarks.


http://arxiv.org/abs/2410.06782
Mind Your Questions! Towards Backdoor Attacks on Text-to-Visualization Models. (2%)
Shuaimin Li; Yuanfeng Song; Xuanang Chen; Anni Peng; Zhuoyue Wan; Chen Jason Zhang; Raymond Chi-Wing Wong
  Text-to-visualization (text-to-vis) models have become valuable tools in the
era of big data, enabling users to generate data visualizations and make
informed decisions through natural language queries (NLQs). Despite their
widespread application, the security vulnerabilities of these models have been
largely overlooked. To address this gap, we propose VisPoison, a novel
framework designed to identify these vulnerabilities of current text-to-vis
models systematically. VisPoison introduces two types of triggers that activate
three distinct backdoor attacks, potentially leading to data exposure,
misleading visualizations, or denial-of-service (DoS) incidents. The framework
features both proactive and passive attack mechanisms: proactive attacks
leverage rare-word triggers to access confidential data, while passive attacks,
triggered unintentionally by users, exploit a first-word trigger method,
causing errors or DoS events in visualizations. Through extensive experiments
on both trainable and in-context learning (ICL)-based text-to-vis models,
\textit{VisPoison} achieves attack success rates of over 90\%, highlighting the
security problem of current text-to-vis models. Additionally, we explore two
types of defense mechanisms against these attacks, but the results show that
existing countermeasures are insufficient, underscoring the pressing need for
more robust security solutions in text-to-vis systems.


http://arxiv.org/abs/2410.06976
AdaRC: Mitigating Graph Structure Shifts during Test-Time. (1%)
Wenxuan Bao; Zhichen Zeng; Zhining Liu; Hanghang Tong; Jingrui He
  Powerful as they are, graph neural networks (GNNs) are known to be vulnerable
to distribution shifts. Recently, test-time adaptation (TTA) has attracted
attention due to its ability to adapt a pre-trained model to a target domain
without re-accessing the source domain. However, existing TTA algorithms are
primarily designed for attribute shifts in vision tasks, where samples are
independent. These methods perform poorly on graph data that experience
structure shifts, where node connectivity differs between source and target
graphs. We attribute this performance gap to the distinct impact of node
attribute shifts versus graph structure shifts: the latter significantly
degrades the quality of node representations and blurs the boundaries between
different node categories. To address structure shifts in graphs, we propose
AdaRC, an innovative framework designed for effective and efficient adaptation
to structure shifts by adjusting the hop-aggregation parameters in GNNs. To
enhance the representation quality, we design a prediction-informed clustering
loss to encourage the formation of distinct clusters for different node
categories. Additionally, AdaRC seamlessly integrates with existing TTA
algorithms, allowing it to handle attribute shifts effectively while improving
overall performance under combined structure and attribute shifts. We validate
the effectiveness of AdaRC on both synthetic and real-world datasets,
demonstrating its robustness across various combinations of structure and
attribute shifts.


http://arxiv.org/abs/2410.06704
PII-Scope: A Comprehensive Study on Training Data PII Extraction Attacks in LLMs. (1%)
Krishna Kanth Nakka; Ahmed Frikha; Ricardo Mendes; Xue Jiang; Xuebing Zhou
  In this work, we introduce PII-Scope, a comprehensive benchmark designed to
evaluate state-of-the-art methodologies for PII extraction attacks targeting
LLMs across diverse threat settings. Our study provides a deeper understanding
of these attacks by uncovering several hyperparameters (e.g., demonstration
selection) crucial to their effectiveness. Building on this understanding, we
extend our study to more realistic attack scenarios, exploring PII attacks that
employ advanced adversarial strategies, including repeated and diverse
querying, and leveraging iterative learning for continual PII extraction.
Through extensive experimentation, our results reveal a notable underestimation
of PII leakage in existing single-query attacks. In fact, we show that with
sophisticated adversarial capabilities and a limited query budget, PII
extraction rates can increase by up to fivefold when targeting the pretrained
model. Moreover, we evaluate PII leakage on finetuned models, showing that they
are more vulnerable to leakage than pretrained models. Overall, our work
establishes a rigorous empirical benchmark for PII extraction attacks in
realistic threat scenarios and provides a strong foundation for developing
effective mitigation strategies.


http://arxiv.org/abs/2410.06913
Utilize the Flow before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning. (1%)
Runchuan Zhu; Zhipeng Ma; Jiang Wu; Junyuan Gao; Jiaqi Wang; Dahua Lin; Conghui He
  Refusal-Aware Instruction Tuning (RAIT) enables Large Language Models (LLMs)
to refuse to answer unknown questions. By modifying responses of unknown
questions in the training data to refusal responses such as "I don't know",
RAIT enhances the reliability of LLMs and reduces their hallucination.
Generally, RAIT modifies training samples based on the correctness of the
initial LLM's response. However, this crude approach can cause LLMs to
excessively refuse answering questions they could have correctly answered, the
problem we call over-refusal. In this paper, we explore two primary causes of
over-refusal: Static conflict occurs when similar samples within the LLM's
feature space receive differing supervision signals (original vs. modified "I
don't know"). Dynamic conflict arises as the LLM's evolving knowledge during
SFT enables it to answer previously unanswerable questions, but the
now-answerable training samples still retain the original "I don't know"
supervision signals from the initial LLM state, leading to inconsistencies.
These conflicts cause the trained LLM to misclassify known questions as
unknown, resulting in over-refusal. To address this issue, we introduce
Certainty Represented Knowledge Flow for Refusal-Aware Instructions Tuning
(CRaFT). CRaFT centers on two main contributions: First, we additionally
incorporate response certainty to selectively filter and modify data, reducing
static conflicts. Second, we implement preliminary rehearsal training to
characterize changes in the LLM's knowledge state, which helps mitigate dynamic
conflicts during the fine-tuning process. We conducted extensive experiments on
open-ended question answering and multiple-choice question task. Experiment
results show that CRaFT can improve LLM's overall performance during the RAIT
process. Code and data will be released at https://github.com/opendatalab/CRaFT .


http://arxiv.org/abs/2410.05951
Hyper Adversarial Tuning for Boosting Adversarial Robustness of Pretrained Large Vision Models. (99%)
Kangtao Lv; Huangsen Cao; Kainan Tu; Yihuai Xu; Zhimeng Zhang; Xin Ding; Yongwei Wang
  Large vision models have been found vulnerable to adversarial examples,
emphasizing the need for enhancing their adversarial robustness. While
adversarial training is an effective defense for deep convolutional models, it
often faces scalability issues with large vision models due to high
computational costs. Recent approaches propose robust fine-tuning methods, such
as adversarial tuning of low-rank adaptation (LoRA) in large vision models, but
they still struggle to match the accuracy of full parameter adversarial
fine-tuning. The integration of various defense mechanisms offers a promising
approach to enhancing the robustness of large vision models, yet this paradigm
remains underexplored. To address this, we propose hyper adversarial tuning
(HyperAT), which leverages shared defensive knowledge among different methods
to improve model robustness efficiently and effectively simultaneously.
Specifically, adversarial tuning of each defense method is formulated as a
learning task, and a hypernetwork generates LoRA specific to this defense.
Then, a random sampling and tuning strategy is proposed to extract and
facilitate the defensive knowledge transfer between different defenses.
Finally, diverse LoRAs are merged to enhance the adversarial robustness.
Experiments on various datasets and model architectures demonstrate that
HyperAT significantly enhances the adversarial robustness of pretrained large
vision models without excessive computational overhead, establishing a new
state-of-the-art benchmark.


http://arxiv.org/abs/2410.06339
Filtered Randomized Smoothing: A New Defense for Robust Modulation Classification. (98%)
Wenhan Zhang; Meiyu Zhong; Ravi Tandon; Marwan Krunz
  Deep Neural Network (DNN) based classifiers have recently been used for the
modulation classification of RF signals. These classifiers have shown
impressive performance gains relative to conventional methods, however, they
are vulnerable to imperceptible (low-power) adversarial attacks. Some of the
prominent defense approaches include adversarial training (AT) and randomized
smoothing (RS). While AT increases robustness in general, it fails to provide
resilience against previously unseen adaptive attacks. Other approaches, such
as Randomized Smoothing (RS), which injects noise into the input, address this
shortcoming by providing provable certified guarantees against arbitrary
attacks, however, they tend to sacrifice accuracy.
  In this paper, we study the problem of designing robust DNN-based modulation
classifiers that can provide provable defense against arbitrary attacks without
significantly sacrificing accuracy. To this end, we first analyze the spectral
content of commonly studied attacks on modulation classifiers for the benchmark
RadioML dataset. We observe that spectral signatures of un-perturbed RF signals
are highly localized, whereas attack signals tend to be spread out in
frequency. To exploit this spectral heterogeneity, we propose Filtered
Randomized Smoothing (FRS), a novel defense which combines spectral filtering
together with randomized smoothing. FRS can be viewed as a strengthening of RS
by leveraging the specificity (spectral Heterogeneity) inherent to the
modulation classification problem. In addition to providing an approach to
compute the certified accuracy of FRS, we also provide a comprehensive set of
simulations on the RadioML dataset to show the effectiveness of FRS and show
that it significantly outperforms existing defenses including AT and RS in
terms of accuracy on both attacked and benign signals.


http://arxiv.org/abs/2410.05814
CALoR: Towards Comprehensive Model Inversion Defense. (76%)
Hongyao Yu; Yixiang Qiu; Hao Fang; Bin Chen; Sijin Yu; Bin Wang; Shu-Tao Xia; Ke Xu
  Model Inversion Attacks (MIAs) aim at recovering privacy-sensitive training
data from the knowledge encoded in the released machine learning models. Recent
advances in the MIA field have significantly enhanced the attack performance
under multiple scenarios, posing serious privacy risks of Deep Neural Networks
(DNNs). However, the development of defense strategies against MIAs is
relatively backward to resist the latest MIAs and existing defenses fail to
achieve further trade-off between model utility and model robustness. In this
paper, we provide an in-depth analysis from the perspective of intrinsic
vulnerabilities of MIAs, comprehensively uncovering the weaknesses inherent in
the basic pipeline, which are partially investigated in the previous defenses.
Building upon these new insights, we propose a robust defense mechanism,
integrating Confidence Adaptation and Low-Rank compression(CALoR). Our method
includes a novel robustness-enhanced classification loss specially-designed for
model inversion defenses and reveals the extraordinary effectiveness of
compressing the classification header. With CALoR, we can mislead the
optimization objective, reduce the leaked information and impede the
backpropagation of MIAs, thus mitigating the risk of privacy leakage. Extensive
experimental results demonstrate that our method achieves state-of-the-art
(SOTA) defense performance against MIAs and exhibits superior generalization to
existing defenses across various scenarios.


http://arxiv.org/abs/2410.05750
Polynomial Time Cryptanalytic Extraction of Deep Neural Networks in the Hard-Label Setting. (74%)
Nicholas Carlini; Jorge Chávez-Saab; Anna Hambitzer; Francisco Rodríguez-Henríquez; Adi Shamir
  Deep neural networks (DNNs) are valuable assets, yet their public
accessibility raises security concerns about parameter extraction by malicious
actors. Recent work by Carlini et al. (crypto'20) and Canales-Mart\'inez et al.
(eurocrypt'24) has drawn parallels between this issue and block cipher key
extraction via chosen plaintext attacks. Leveraging differential cryptanalysis,
they demonstrated that all the weights and biases of black-box ReLU-based DNNs
could be inferred using a polynomial number of queries and computational time.
However, their attacks relied on the availability of the exact numeric value of
output logits, which allowed the calculation of their derivatives. To overcome
this limitation, Chen et al. (asiacrypt'24) tackled the more realistic
hard-label scenario, where only the final classification label (e.g., "dog" or
"car") is accessible to the attacker. They proposed an extraction method
requiring a polynomial number of queries but an exponential execution time. In
addition, their approach was applicable only to a restricted set of
architectures, could deal only with binary classifiers, and was demonstrated
only on tiny neural networks with up to four neurons split among up to two
hidden layers. This paper introduces new techniques that, for the first time,
achieve cryptanalytic extraction of DNN parameters in the most challenging
hard-label setting, using both a polynomial number of queries and polynomial
time. We validate our approach by extracting nearly one million parameters from
a DNN trained on the CIFAR-10 dataset, comprising 832 neurons in four hidden
layers. Our results reveal the surprising fact that all the weights of a
ReLU-based DNN can be efficiently determined by analyzing only the geometric
shape of its decision boundaries.


http://arxiv.org/abs/2410.06072
Training-free LLM-generated Text Detection by Mining Token Probability Sequences. (26%)
Yihuai Xu; Yongwei Wang; Yifei Bi; Huangsen Cao; Zhouhan Lin; Yu Zhao; Fei Wu
  Large language models (LLMs) have demonstrated remarkable capabilities in
generating high-quality texts across diverse domains. However, the potential
misuse of LLMs has raised significant concerns, underscoring the urgent need
for reliable detection of LLM-generated texts. Conventional training-based
detectors often struggle with generalization, particularly in cross-domain and
cross-model scenarios. In contrast, training-free methods, which focus on
inherent discrepancies through carefully designed statistical features, offer
improved generalization and interpretability. Despite this, existing
training-free detection methods typically rely on global text sequence
statistics, neglecting the modeling of local discriminative features, thereby
limiting their detection efficacy. In this work, we introduce a novel
training-free detector, termed \textbf{Lastde} that synergizes local and global
statistics for enhanced detection. For the first time, we introduce time series
analysis to LLM-generated text detection, capturing the temporal dynamics of
token probability sequences. By integrating these local statistics with global
ones, our detector reveals significant disparities between human and
LLM-generated texts. We also propose an efficient alternative,
\textbf{Lastde++} to enable real-time detection. Extensive experiments on six
datasets involving cross-domain, cross-model, and cross-lingual detection
scenarios, under both white-box and black-box settings, demonstrated that our
method consistently achieves state-of-the-art performance. Furthermore, our
approach exhibits greater robustness against paraphrasing attacks compared to
existing baseline methods.


http://arxiv.org/abs/2410.06509
PFAttack: Stealthy Attack Bypassing Group Fairness in Federated Learning. (10%)
Jiashi Gao; Ziwei Wang; Xiangyu Zhao; Xin Yao; Xuetao Wei
  Federated learning (FL), integrating group fairness mechanisms, allows
multiple clients to collaboratively train a global model that makes unbiased
decisions for different populations grouped by sensitive attributes (e.g.,
gender and race). Due to its distributed nature, previous studies have
demonstrated that FL systems are vulnerable to model poisoning attacks.
However, these studies primarily focus on perturbing accuracy, leaving a
critical question unexplored: Can an attacker bypass the group fairness
mechanisms in FL and manipulate the global model to be biased? The motivations
for such an attack vary; an attacker might seek higher accuracy, yet fairness
considerations typically limit the accuracy of the global model or aim to cause
ethical disruption. To address this question, we design a novel form of attack
in FL, termed Profit-driven Fairness Attack (PFATTACK), which aims not to
degrade global model accuracy but to bypass fairness mechanisms. Our
fundamental insight is that group fairness seeks to weaken the dependence of
outputs on input attributes related to sensitive information. In the proposed
PFATTACK, an attacker can recover this dependence through local fine-tuning
across various sensitive groups, thereby creating a biased yet
accuracy-preserving malicious model and injecting it into FL through model
replacement. Compared to attacks targeting accuracy, PFATTACK is more stealthy.
The malicious model in PFATTACK exhibits subtle parameter variations relative
to the original global model, making it robust against detection and filtering
by Byzantine-resilient aggregations. Extensive experiments on benchmark
datasets are conducted for four fair FL frameworks and three
Byzantine-resilient aggregations against model poisoning, demonstrating the
effectiveness and stealth of PFATTACK in bypassing group fairness mechanisms in
FL.


http://arxiv.org/abs/2410.09097
Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations. (10%)
Tarun Raheja; Nilay Pochhi
  Large Language Models (LLMs) have demonstrated remarkable capabilities in
natural language processing tasks, but their vulnerability to jailbreak attacks
poses significant security risks. This survey paper presents a comprehensive
analysis of recent advancements in attack strategies and defense mechanisms
within the field of Large Language Model (LLM) red-teaming. We analyze various
attack methods, including gradient-based optimization, reinforcement learning,
and prompt engineering approaches. We discuss the implications of these attacks
on LLM safety and the need for improved defense mechanisms. This work aims to
provide a thorough understanding of the current landscape of red-teaming
attacks and defenses on LLMs, enabling the development of more secure and
reliable language models.


http://arxiv.org/abs/2410.05136
LOTOS: Layer-wise Orthogonalization for Training Robust Ensembles. (99%)
Ali Ebrahimpour-Boroojeny; Hari Sundaram; Varun Chandrasekaran
  Transferability of adversarial examples is a well-known property that
endangers all classification models, even those that are only accessible
through black-box queries. Prior work has shown that an ensemble of models is
more resilient to transferability: the probability that an adversarial example
is effective against most models of the ensemble is low. Thus, most ongoing
research focuses on improving ensemble diversity. Another line of prior work
has shown that Lipschitz continuity of the models can make models more robust
since it limits how a model's output changes with small input perturbations. In
this paper, we study the effect of Lipschitz continuity on transferability
rates. We show that although a lower Lipschitz constant increases the
robustness of a single model, it is not as beneficial in training robust
ensembles as it increases the transferability rate of adversarial examples
across models in the ensemble. Therefore, we introduce LOTOS, a new training
paradigm for ensembles, which counteracts this adverse effect. It does so by
promoting orthogonality among the top-$k$ sub-spaces of the transformations of
the corresponding affine layers of any pair of models in the ensemble. We
theoretically show that $k$ does not need to be large for convolutional layers,
which makes the computational overhead negligible. Through various experiments,
we show LOTOS increases the robust accuracy of ensembles of ResNet-18 models by
$6$ percentage points (p.p) against black-box attacks on CIFAR-10. It is also
capable of combining with the robustness of prior state-of-the-art methods for
training robust ensembles to enhance their robust accuracy by $10.7$ p.p.


http://arxiv.org/abs/2410.05346
AnyAttack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models. (99%)
Jiaming Zhang; Junhong Ye; Xingjun Ma; Yige Li; Yunfan Yang; Yunhao Chen; Jitao Sang; Dit-Yan Yeung
  Due to their multimodal capabilities, Vision-Language Models (VLMs) have
found numerous impactful applications in real-world scenarios. However, recent
studies have revealed that VLMs are vulnerable to image-based adversarial
attacks. Traditional targeted adversarial attacks require specific targets and
labels, limiting their real-world impact.We present AnyAttack, a
self-supervised framework that transcends the limitations of conventional
attacks through a novel foundation model approach. By pre-training on the
massive LAION-400M dataset without label supervision, AnyAttack achieves
unprecedented flexibility - enabling any image to be transformed into an attack
vector targeting any desired output across different VLMs.This approach
fundamentally changes the threat landscape, making adversarial capabilities
accessible at an unprecedented scale. Our extensive validation across five
open-source VLMs (CLIP, BLIP, BLIP2, InstructBLIP, and MiniGPT-4) demonstrates
AnyAttack's effectiveness across diverse multimodal tasks. Most concerning,
AnyAttack seamlessly transfers to commercial systems including Google Gemini,
Claude Sonnet, Microsoft Copilot and OpenAI GPT, revealing a systemic
vulnerability requiring immediate attention.


http://arxiv.org/abs/2410.05573
TaeBench: Improving Quality of Toxic Adversarial Examples. (99%)
Xuan Zhu; Dmitriy Bespalov; Liwen You; Ninad Kulkarni; Yanjun Qi
  Toxicity text detectors can be vulnerable to adversarial examples - small
perturbations to input text that fool the systems into wrong detection.
Existing attack algorithms are time-consuming and often produce invalid or
ambiguous adversarial examples, making them less useful for evaluating or
improving real-world toxicity content moderators. This paper proposes an
annotation pipeline for quality control of generated toxic adversarial examples
(TAE). We design model-based automated annotation and human-based quality
verification to assess the quality requirements of TAE. Successful TAE should
fool a target toxicity model into making benign predictions, be grammatically
reasonable, appear natural like human-generated text, and exhibit semantic
toxicity. When applying these requirements to more than 20 state-of-the-art
(SOTA) TAE attack recipes, we find many invalid samples from a total of 940k
raw TAE attack generations. We then utilize the proposed pipeline to filter and
curate a high-quality TAE dataset we call TaeBench (of size 264k). Empirically,
we demonstrate that TaeBench can effectively transfer-attack SOTA toxicity
content moderation models and services. Our experiments also show that TaeBench
with adversarial training achieve significant improvements of the robustness of
two toxicity detectors.


http://arxiv.org/abs/2410.04884
Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models. (95%)
Dehong Kong; Siyuan Liang; Xiaopeng Zhu; Yuansheng Zhong; Wenqi Ren
  Visual language pre-training (VLP) models have demonstrated significant
success across various domains, yet they remain vulnerable to adversarial
attacks. Addressing these adversarial vulnerabilities is crucial for enhancing
security in multimodal learning. Traditionally, adversarial methods targeting
VLP models involve simultaneously perturbing images and text. However, this
approach faces notable challenges: first, adversarial perturbations often fail
to translate effectively into real-world scenarios; second, direct
modifications to the text are conspicuously visible. To overcome these
limitations, we propose a novel strategy that exclusively employs image patches
for attacks, thus preserving the integrity of the original text. Our method
leverages prior knowledge from diffusion models to enhance the authenticity and
naturalness of the perturbations. Moreover, to optimize patch placement and
improve the efficacy of our attacks, we utilize the cross-attention mechanism,
which encapsulates intermodal interactions by generating attention maps to
guide strategic patch placements. Comprehensive experiments conducted in a
white-box setting for image-to-text scenarios reveal that our proposed method
significantly outperforms existing techniques, achieving a 100% attack success
rate. Additionally, it demonstrates commendable performance in transfer tasks
involving text-to-image configurations.


http://arxiv.org/abs/2410.05159
MIBench: A Comprehensive Framework for Benchmarking Model Inversion Attack and Defense. (86%)
Yixiang Qiu; Hongyao Yu; Hao Fang; Tianqu Zhuang; Wenbo Yu; Bin Chen; Xuan Wang; Shu-Tao Xia; Ke Xu
  Model Inversion (MI) attacks aim at leveraging the output information of
target models to reconstruct privacy-sensitive training data, raising critical
concerns regarding the privacy vulnerabilities of Deep Neural Networks (DNNs).
Unfortunately, in tandem with the rapid evolution of MI attacks, the absence of
a comprehensive benchmark with standardized metrics and reproducible
implementations has emerged as a formidable challenge. This deficiency has
hindered objective comparison of methodological advancements and reliable
assessment of defense efficacy. To address this critical gap, we build the
first practical benchmark named MIBench for systematic evaluation of model
inversion attacks and defenses. This benchmark bases on an extensible and
reproducible modular-based toolbox which currently integrates a total of 19
state-of-the-art attack and defense methods and encompasses 9 standardized
evaluation protocols. Capitalizing on this foundation, we conduct extensive
evaluation from multiple perspectives to holistically compare and analyze
various methods across different scenarios, such as the impact of target
resolution, model predictive power, defense performance and adversarial
robustness.


http://arxiv.org/abs/2410.05417
STOP! Camera Spoofing via the in-Vehicle IP Network. (83%)
Dror Peri; Avishai Wool
  Autonomous driving and advanced driver assistance systems (ADAS) rely on
cameras to control the driving. In many prior approaches an attacker aiming to
stop the vehicle had to send messages on the specialized and better-defended
CAN bus. We suggest an easier alternative: manipulate the IP-based network
communication between the camera and the ADAS logic, inject fake images of stop
signs or red lights into the video stream, and let the ADAS stop the car
safely. We created an attack tool that successfully exploits the GigE Vision
protocol.
  Then we analyze two classes of passive anomaly detectors to identify such
attacks: protocol-based detectors and video-based detectors. We implemented
multiple detectors of both classes and evaluated them on data collected from
our test vehicle and also on data from the public BDD corpus. Our results show
that such detectors are effective against naive adversaries, but sophisticated
adversaries can evade detection.
  Finally, we propose a novel class of active defense mechanisms that randomly
adjust camera parameters during the video transmission, and verify that the
received images obey the requested adjustments. Within this class we focus on a
specific implementation, the width-varying defense, which randomly modifies the
width of every frame. Beyond its function as an anomaly detector, this defense
is also a protective measure against certain attacks: by distorting injected
image patches it prevents their recognition by the ADAS logic. We demonstrate
the effectiveness of the width-varying defense through theoretical analysis and
by an extensive evaluation of several types of attack in a wide range of
realistic road driving conditions. The best the attack was able to achieve
against this defense was injecting a stop sign for a duration of 0.2 seconds,
with a success probability of 0.2%, whereas stopping a vehicle requires about
2.5 seconds.


http://arxiv.org/abs/2410.05451
SecAlign: Defending Against Prompt Injection with Preference Optimization. (78%)
Sizhe Chen; Arman Zharmagambetov; Saeed Mahloujifar; Kamalika Chaudhuri; David Wagner; Chuan Guo
  Large language models (LLMs) are becoming increasingly prevalent in modern
software systems, interfacing between the user and the Internet to assist with
tasks that require advanced language understanding. To accomplish these tasks,
the LLM often uses external data sources such as user documents, web retrieval,
results from API calls, etc. This opens up new avenues for attackers to
manipulate the LLM via prompt injection. Adversarial prompts can be injected
into external data sources to override the system's intended instruction and
instead execute a malicious instruction.
  To mitigate this vulnerability, we propose a new defense called SecAlign
based on the technique of preference optimization. Our defense first constructs
a preference dataset with prompt-injected inputs, secure outputs (ones that
respond to the legitimate instruction), and insecure outputs (ones that respond
to the injection). We then perform preference optimization on this dataset to
teach the LLM to prefer the secure output over the insecure one. This provides
the first known method that reduces the success rates of various prompt
injections to around 0%, even against attacks much more sophisticated than ones
seen during training. This indicates our defense generalizes well against
unknown and yet-to-come attacks. Also, our defended models are still practical
with similar utility to the one before our defensive training. Our code is at
https://github.com/facebookresearch/SecAlign


http://arxiv.org/abs/2410.04764
Double Oracle Neural Architecture Search for Game Theoretic Deep Learning Models. (76%)
Aye Phyu Phyu Aung; Xinrun Wang; Ruiyu Wang; Hau Chan; Bo An; Xiaoli Li; J. Senthilnath
  In this paper, we propose a new approach to train deep learning models using
game theory concepts including Generative Adversarial Networks (GANs) and
Adversarial Training (AT) where we deploy a double-oracle framework using best
response oracles. GAN is essentially a two-player zero-sum game between the
generator and the discriminator. The same concept can be applied to AT with
attacker and classifier as players. Training these models is challenging as a
pure Nash equilibrium may not exist and even finding the mixed Nash equilibrium
is difficult as training algorithms for both GAN and AT have a large-scale
strategy space. Extending our preliminary model DO-GAN, we propose the methods
to apply the double oracle framework concept to Adversarial Neural Architecture
Search (NAS for GAN) and Adversarial Training (NAS for AT) algorithms. We first
generalize the players' strategies as the trained models of generator and
discriminator from the best response oracles. We then compute the
meta-strategies using a linear program. For scalability of the framework where
multiple network models of best responses are stored in the memory, we prune
the weakly-dominated players' strategies to keep the oracles from becoming
intractable. Finally, we conduct experiments on MNIST, CIFAR-10 and
TinyImageNet for DONAS-GAN. We also evaluate the robustness under FGSM and PGD
attacks on CIFAR-10, SVHN and TinyImageNet for DONAS-AT. We show that all our
variants have significant improvements in both subjective qualitative
evaluation and quantitative metrics, compared with their respective base
architectures.


http://arxiv.org/abs/2410.04968
Collaboration! Towards Robust Neural Methods for Routing Problems. (70%)
Jianan Zhou; Yaoxin Wu; Zhiguang Cao; Wen Song; Jie Zhang; Zhiqi Shen
  Despite enjoying desirable efficiency and reduced reliance on domain
expertise, existing neural methods for vehicle routing problems (VRPs) suffer
from severe robustness issues -- their performance significantly deteriorates
on clean instances with crafted perturbations. To enhance robustness, we
propose an ensemble-based Collaborative Neural Framework (CNF) w.r.t. the
defense of neural VRP methods, which is crucial yet underexplored in the
literature. Given a neural VRP method, we adversarially train multiple models
in a collaborative manner to synergistically promote robustness against
attacks, while boosting standard generalization on clean instances. A neural
router is designed to adeptly distribute training instances among models,
enhancing overall load balancing and collaborative efficacy. Extensive
experiments verify the effectiveness and versatility of CNF in defending
against various attacks across different neural VRP methods. Notably, our
approach also achieves impressive out-of-distribution generalization on
benchmark instances.


http://arxiv.org/abs/2410.05363
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation. (1%)
Fanqing Meng; Jiaqi Liao; Xinyu Tan; Wenqi Shao; Quanfeng Lu; Kaipeng Zhang; Yu Cheng; Dianqi Li; Yu Qiao; Ping Luo
  Text-to-video (T2V) models like Sora have made significant strides in
visualizing complex prompts, which is increasingly viewed as a promising path
towards constructing the universal world simulator. Cognitive psychologists
believe that the foundation for achieving this goal is the ability to
understand intuitive physics. However, the capacity of these models to
accurately represent intuitive physics remains largely unexplored. To bridge
this gap, we introduce PhyGenBench, a comprehensive \textbf{Phy}sics
\textbf{Gen}eration \textbf{Ben}chmark designed to evaluate physical
commonsense correctness in T2V generation. PhyGenBench comprises 160 carefully
crafted prompts across 27 distinct physical laws, spanning four fundamental
domains, which could comprehensively assesses models' understanding of physical
commonsense. Alongside PhyGenBench, we propose a novel evaluation framework
called PhyGenEval. This framework employs a hierarchical evaluation structure
utilizing appropriate advanced vision-language models and large language models
to assess physical commonsense. Through PhyGenBench and PhyGenEval, we can
conduct large-scale automated assessments of T2V models' understanding of
physical commonsense, which align closely with human feedback. Our evaluation
results and in-depth analysis demonstrate that current models struggle to
generate videos that comply with physical commonsense. Moreover, simply scaling
up models or employing prompt engineering techniques is insufficient to fully
address the challenges presented by PhyGenBench (e.g., dynamic scenarios). We
hope this study will inspire the community to prioritize the learning of
physical commonsense in these models beyond entertainment applications. We will
release the data and codes at https://github.com/OpenGVLab/PhyGenBench


http://arxiv.org/abs/2410.04377
Graded Suspiciousness of Adversarial Texts to Human. (99%)
Shakila Mahjabin Tonni; Pedro Faustini; Mark Dras
  Adversarial examples pose a significant challenge to deep neural networks
(DNNs) across both image and text domains, with the intent to degrade model
performance through meticulously altered inputs. Adversarial texts, however,
are distinct from adversarial images due to their requirement for semantic
similarity and the discrete nature of the textual contents. This study delves
into the concept of human suspiciousness, a quality distinct from the
traditional focus on imperceptibility found in image-based adversarial
examples. Unlike images, where adversarial changes are meant to be
indistinguishable to the human eye, textual adversarial content must often
remain undetected or non-suspicious to human readers, even when the text's
purpose is to deceive NLP systems or bypass filters.
  In this research, we expand the study of human suspiciousness by analyzing
how individuals perceive adversarial texts. We gather and publish a novel
dataset of Likert-scale human evaluations on the suspiciousness of adversarial
sentences, crafted by four widely used adversarial attack methods and assess
their correlation with the human ability to detect machine-generated
alterations. Additionally, we develop a regression-based model to quantify
suspiciousness and establish a baseline for future research in reducing the
suspiciousness in adversarial text generation. We also demonstrate how the
regressor-generated suspicious scores can be incorporated into adversarial
generation methods to produce texts that are less likely to be perceived as
computer-generated. We make our human suspiciousness annotated data and our
code available.


http://arxiv.org/abs/2410.04682
On the Adversarial Risk of Test Time Adaptation: An Investigation into Realistic Test-Time Data Poisoning. (99%)
Yongyi Su; Yushu Li; Nanqing Liu; Kui Jia; Xulei Yang; Chuan-Sheng Foo; Xun Xu
  Test-time adaptation (TTA) updates the model weights during the inference
stage using testing data to enhance generalization. However, this practice
exposes TTA to adversarial risks. Existing studies have shown that when TTA is
updated with crafted adversarial test samples, also known as test-time poisoned
data, the performance on benign samples can deteriorate. Nonetheless, the
perceived adversarial risk may be overstated if the poisoned data is generated
under overly strong assumptions. In this work, we first review realistic
assumptions for test-time data poisoning, including white-box versus grey-box
attacks, access to benign data, attack budget, and more. We then propose an
effective and realistic attack method that better produces poisoned samples
without access to benign samples, and derive an effective in-distribution
attack objective. We also design two TTA-aware attack objectives. Our
benchmarks of existing attack methods reveal that the TTA methods are more
robust than previously believed. In addition, we analyze effective defense
strategies to help develop adversarially robust TTA methods.


http://arxiv.org/abs/2410.05334
TA3: Testing Against Adversarial Attacks on Machine Learning Models. (67%)
Yuanzhe Jin; Min Chen
  Adversarial attacks are major threats to the deployment of machine learning
(ML) models in many applications. Testing ML models against such attacks is
becoming an essential step for evaluating and improving ML models. In this
paper, we report the design and development of an interactive system for aiding
the workflow of Testing Against Adversarial Attacks (TA3). In particular, with
TA3, human-in-the-loop (HITL) enables human-steered attack simulation and
visualization-assisted attack impact evaluation. While the current version of
TA3 focuses on testing decision tree models against adversarial attacks based
on the One Pixel Attack Method, it demonstrates the importance of HITL in ML
testing and the potential application of HITL to the ML testing workflows for
other types of ML models and other types of adversarial attacks.


http://arxiv.org/abs/2410.04577
Robustness Reprogramming for Representation Learning. (56%)
Zhichao Hou; MohamadAli Torkamani; Hamid Krim; Xiaorui Liu
  This work tackles an intriguing and fundamental open challenge in
representation learning: Given a well-trained deep learning model, can it be
reprogrammed to enhance its robustness against adversarial or noisy input
perturbations without altering its parameters? To explore this, we revisit the
core feature transformation mechanism in representation learning and propose a
novel non-linear robust pattern matching technique as a robust alternative.
Furthermore, we introduce three model reprogramming paradigms to offer flexible
control of robustness under different efficiency requirements. Comprehensive
experiments and ablation studies across diverse learning models ranging from
basic linear model and MLPs to shallow and modern deep ConvNets demonstrate the
effectiveness of our approaches. This work not only opens a promising and
orthogonal direction for improving adversarial defenses in deep learning beyond
existing methods but also provides new insights into designing more resilient
AI systems with robust statistics.


http://arxiv.org/abs/2410.04397
Towards Understanding and Enhancing Security of Proof-of-Training for DNN Model Ownership Verification. (2%)
Yijia Chang; Hanrui Jiang; Chao Lin; Xinyi Huang; Jian Weng
  The great economic values of deep neural networks (DNNs) urge AI enterprises
to protect their intellectual property (IP) for these models. Recently,
proof-of-training (PoT) has been proposed as a promising solution to DNN IP
protection, through which AI enterprises can utilize the record of DNN training
process as their ownership proof. To prevent attackers from forging ownership
proof, a secure PoT scheme should be able to distinguish honest training
records from those forged by attackers. Although existing PoT schemes provide
various distinction criteria, these criteria are based on intuitions or
observations. The effectiveness of these criteria lacks clear and comprehensive
analysis, resulting in existing schemes initially deemed secure being swiftly
compromised by simple ideas. In this paper, we make the first move to identify
distinction criteria in the style of formal methods, so that their
effectiveness can be explicitly demonstrated. Specifically, we conduct
systematic modeling to cover a wide range of attacks and then theoretically
analyze the distinctions between honest and forged training records. The
analysis results not only induce a universal distinction criterion, but also
provide detailed reasoning to demonstrate its effectiveness in defending
against attacks covered by our model. Guided by the criterion, we propose a
generic PoT construction that can be instantiated into concrete schemes. This
construction sheds light on the realization that trajectory matching
algorithms, previously employed in data distillation, possess significant
advantages in PoT construction. Experimental results demonstrate that our
scheme can resist attacks that have compromised existing PoT schemes, which
corroborates its superiority in security.


http://arxiv.org/abs/2410.04661
Federated Learning Nodes Can Reconstruct Peers' Image Data. (1%)
Ethan Wilson; Kai Yue; Chau-Wai Wong; Huaiyu Dai
  Federated learning (FL) is a privacy-preserving machine learning framework
that enables multiple nodes to train models on their local data and
periodically average weight updates to benefit from other nodes' training. Each
node's goal is to collaborate with other nodes to improve the model's
performance while keeping its training data private. However, this framework
does not guarantee data privacy. Prior work has shown that the gradient-sharing
steps in FL can be vulnerable to data reconstruction attacks from an
honest-but-curious central server. In this work, we show that an
honest-but-curious node/client can also launch attacks to reconstruct peers'
image data through gradient inversion, presenting a severe privacy risk. We
demonstrate that a single client can silently reconstruct other clients'
private images using diluted information available within consecutive updates.
We leverage state-of-the-art diffusion models to enhance the perceptual quality
and recognizability of the reconstructed images, further demonstrating the risk
of information leakage at a semantic level. This highlights the need for more
robust privacy-preserving mechanisms that protect against silent client-side
attacks during federated training.


http://arxiv.org/abs/2410.04190
Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models. (38%)
Yiting Dong; Guobin Shen; Dongcheng Zhao; Xiang He; Yi Zeng
  Large Language Models (LLMs) remain vulnerable to jailbreak attacks that
bypass their safety mechanisms. Existing attack methods are fixed or
specifically tailored for certain models and cannot flexibly adjust attack
strength, which is critical for generalization when attacking models of various
sizes. We introduce a novel scalable jailbreak attack that preempts the
activation of an LLM's safety policies by occupying its computational
resources. Our method involves engaging the LLM in a resource-intensive
preliminary task - a Character Map lookup and decoding process - before
presenting the target instruction. By saturating the model's processing
capacity, we prevent the activation of safety protocols when processing the
subsequent instruction. Extensive experiments on state-of-the-art LLMs
demonstrate that our method achieves a high success rate in bypassing safety
measures without requiring gradient access, manual prompt engineering. We
verified our approach offers a scalable attack that quantifies attack strength
and adapts to different model scales at the optimal strength. We shows safety
policies of LLMs might be more susceptible to resource constraints. Our
findings reveal a critical vulnerability in current LLM safety designs,
highlighting the need for more robust defense strategies that account for
resource-intense condition.


http://arxiv.org/abs/2410.04144
ConDa: Fast Federated Unlearning with Contribution Dampening. (1%)
Vikram S Chundawat; Pushkar Niroula; Prasanna Dhungana; Stefan Schoepf; Murari Mandal; Alexandra Brintrup
  Federated learning (FL) has enabled collaborative model training across
decentralized data sources or clients. While adding new participants to a
shared model does not pose great technical hurdles, the removal of a
participant and their related information contained in the shared model remains
a challenge. To address this problem, federated unlearning has emerged as a
critical research direction, seeking to remove information from globally
trained models without harming the model performance on the remaining data.
Most modern federated unlearning methods use costly approaches such as the use
of remaining clients data to retrain the global model or methods that would
require heavy computation on client or server side. We introduce Contribution
Dampening (ConDa), a framework that performs efficient unlearning by tracking
down the parameters which affect the global model for each client and performs
synaptic dampening on the parameters of the global model that have privacy
infringing contributions from the forgetting client. Our technique does not
require clients data or any kind of retraining and it does not put any
computational overhead on either the client or server side. We perform
experiments on multiple datasets and demonstrate that ConDa is effective to
forget a client's data. In experiments conducted on the MNIST, CIFAR10, and
CIFAR100 datasets, ConDa proves to be the fastest federated unlearning method,
outperforming the nearest state of the art approach by at least 100x. Our
emphasis is on the non-IID Federated Learning setting, which presents the
greatest challenge for unlearning. Additionally, we validate ConDa's robustness
through backdoor and membership inference attacks. We envision this work as a
crucial component for FL in adhering to legal and ethical requirements.


http://arxiv.org/abs/2410.03376
Mitigating Adversarial Perturbations for Deep Reinforcement Learning via Vector Quantization. (98%)
Tung M. Luu; Thanh Nguyen; Tee Joshua Tian Jin; Sungwoon Kim; Chang D. Yoo
  Recent studies reveal that well-performing reinforcement learning (RL) agents
in training often lack resilience against adversarial perturbations during
deployment. This highlights the importance of building a robust agent before
deploying it in the real world. Most prior works focus on developing robust
training-based procedures to tackle this problem, including enhancing the
robustness of the deep neural network component itself or adversarially
training the agent on strong attacks. In this work, we instead study an input
transformation-based defense for RL. Specifically, we propose using a variant
of vector quantization (VQ) as a transformation for input observations, which
is then used to reduce the space of adversarial attacks during testing,
resulting in the transformed observations being less affected by attacks. Our
method is computationally efficient and seamlessly integrates with adversarial
training, further enhancing the robustness of RL agents against adversarial
attacks. Through extensive experiments in multiple environments, we demonstrate
that using VQ as the input transformation effectively defends against
adversarial attacks on the agent's observations.


http://arxiv.org/abs/2410.03658
RAFT: Realistic Attacks to Fool Text Detectors. (96%)
James Wang; Ran Li; Junfeng Yang; Chengzhi Mao
  Large language models (LLMs) have exhibited remarkable fluency across various
tasks. However, their unethical applications, such as disseminating
disinformation, have become a growing concern. Although recent works have
proposed a number of LLM detection methods, their robustness and reliability
remain unclear. In this paper, we present RAFT: a grammar error-free black-box
attack against existing LLM detectors. In contrast to previous attacks for
language models, our method exploits the transferability of LLM embeddings at
the word-level while preserving the original text quality. We leverage an
auxiliary embedding to greedily select candidate words to perturb against the
target detector. Experiments reveal that our attack effectively compromises all
detectors in the study across various domains by up to 99%, and are
transferable across source models. Manual human evaluation studies show our
attacks are realistic and indistinguishable from original human-written text.
We also show that examples generated by RAFT can be used to train adversarially
robust detectors. Our work shows that current LLM detectors are not
adversarially robust, underscoring the urgent need for more resilient detection
mechanisms.


http://arxiv.org/abs/2410.03952
A Brain-Inspired Regularizer for Adversarial Robustness. (92%)
Elie Attias; Cengiz Pehlevan; Dina Obeid
  Convolutional Neural Networks (CNNs) excel in many visual tasks, but they
tend to be sensitive to slight input perturbations that are imperceptible to
the human eye, often resulting in task failures. Recent studies indicate that
training CNNs with regularizers that promote brain-like representations, using
neural recordings, can improve model robustness. However, the requirement to
use neural data severely restricts the utility of these methods. Is it possible
to develop regularizers that mimic the computational function of neural
regularizers without the need for neural recordings, thereby expanding the
usability and effectiveness of these techniques? In this work, we inspect a
neural regularizer introduced in Li et al. (2019) to extract its underlying
strength. The regularizer uses neural representational similarities, which we
find also correlate with pixel similarities. Motivated by this finding, we
introduce a new regularizer that retains the essence of the original but is
computed using image pixel similarities, eliminating the need for neural
recordings. We show that our regularization method 1) significantly increases
model robustness to a range of black box attacks on various datasets and 2) is
computationally inexpensive and relies only on original datasets. Our work
explores how biologically motivated loss functions can be used to drive the
performance of artificial neural networks.


http://arxiv.org/abs/2410.03489
Gradient-based Jailbreak Images for Multimodal Fusion Models. (16%)
Javier Rando; Hannah Korevaar; Erik Brinkman; Ivan Evtimov; Florian Tramèr
  Augmenting language models with image inputs may enable more effective
jailbreak attacks through continuous optimization, unlike text inputs that
require discrete optimization. However, new multimodal fusion models tokenize
all input modalities using non-differentiable functions, which hinders
straightforward attacks. In this work, we introduce the notion of a tokenizer
shortcut that approximates tokenization with a continuous function and enables
continuous optimization. We use tokenizer shortcuts to create the first
end-to-end gradient image attacks against multimodal fusion models. We evaluate
our attacks on Chameleon models and obtain jailbreak images that elicit harmful
information for 72.5% of prompts. Jailbreak images outperform text jailbreaks
optimized with the same objective and require 3x lower compute budget to
optimize 50x more input tokens. Finally, we find that representation
engineering defenses, like Circuit Breakers, trained only on text attacks can
effectively transfer to adversarial image inputs.


http://arxiv.org/abs/2410.03857
You Know What I'm Saying -- Jailbreak Attack via Implicit Reference. (16%)
Tianyu Wu; Lingrui Mei; Ruibin Yuan; Lujun Li; Wei Xue; Yike Guo
  While recent advancements in large language model (LLM) alignment have
enabled the effective identification of malicious objectives involving scene
nesting and keyword rewriting, our study reveals that these methods remain
inadequate at detecting malicious objectives expressed through context within
nested harmless objectives. This study identifies a previously overlooked
vulnerability, which we term Attack via Implicit Reference (AIR). AIR
decomposes a malicious objective into permissible objectives and links them
through implicit references within the context. This method employs multiple
related harmless objectives to generate malicious content without triggering
refusal responses, thereby effectively bypassing existing detection
techniques.Our experiments demonstrate AIR's effectiveness across
state-of-the-art LLMs, achieving an attack success rate (ASR) exceeding 90% on
most models, including GPT-4o, Claude-3.5-Sonnet, and Qwen-2-72B. Notably, we
observe an inverse scaling phenomenon, where larger models are more vulnerable
to this attack method. These findings underscore the urgent need for defense
mechanisms capable of understanding and preventing contextual attacks.
Furthermore, we introduce a cross-model attack strategy that leverages less
secure models to generate malicious contexts, thereby further increasing the
ASR when targeting other models.Our code and jailbreak artifacts can be found
at https://github.com/Lucas-TY/llm_Implicit_reference.


http://arxiv.org/abs/2410.03999
Impact of Regularization on Calibration and Robustness: from the Representation Space Perspective. (13%)
Jonghyun Park; Juyeop Kim; Jong-Seok Lee
  Recent studies have shown that regularization techniques using soft labels,
e.g., label smoothing, Mixup, and CutMix, not only enhance image classification
accuracy but also improve model calibration and robustness against adversarial
attacks. However, the underlying mechanisms of such improvements remain
underexplored. In this paper, we offer a novel explanation from the perspective
of the representation space (i.e., the space of the features obtained at the
penultimate layer). Our investigation first reveals that the decision regions
in the representation space form cone-like shapes around the origin after
training regardless of the presence of regularization. However, applying
regularization causes changes in the distribution of features (or
representation vectors). The magnitudes of the representation vectors are
reduced and subsequently the cosine similarities between the representation
vectors and the class centers (minimal loss points for each class) become
higher, which acts as a central mechanism inducing improved calibration and
robustness. Our findings provide new insights into the characteristics of the
high-dimensional representation space in relation to training and
regularization using soft labels.


http://arxiv.org/abs/2410.03373
Make Interval Bound Propagation great again. (9%)
Patryk Krukowski; Daniel Wilczak; Jacek Tabor; Anna Bielawska; Przemysław Spurek
  In various scenarios motivated by real life, such as medical data analysis,
autonomous driving, and adversarial training, we are interested in robust deep
networks. A network is robust when a relatively small perturbation of the input
cannot lead to drastic changes in output (like change of class, etc.). This
falls under the broader scope field of Neural Network Certification (NNC). Two
crucial problems in NNC are of profound interest to the scientific community:
how to calculate the robustness of a given pre-trained network and how to
construct robust networks. The common approach to constructing robust networks
is Interval Bound Propagation (IBP). This paper demonstrates that IBP is
sub-optimal in the first case due to its susceptibility to the wrapping effect.
Even for linear activation, IBP gives strongly sub-optimal bounds.
Consequently, one should use strategies immune to the wrapping effect to obtain
bounds close to optimal ones. We adapt two classical approaches dedicated to
strict computations -- Dubleton Arithmetic and Affine Arithmetic -- to mitigate
the wrapping effect in neural networks. These techniques yield precise results
for networks with linear activation functions, thus resisting the wrapping
effect. As a result, we achieve bounds significantly closer to the optimal
level than IBPs.


http://arxiv.org/abs/2410.03505
Classification-Denoising Networks. (9%)
Louis Thiry; Florentin Guth
  Image classification and denoising suffer from complementary issues of lack
of robustness or partially ignoring conditioning information. We argue that
they can be alleviated by unifying both tasks through a model of the joint
probability of (noisy) images and class labels. Classification is performed
with a forward pass followed by conditioning. Using the Tweedie-Miyasawa
formula, we evaluate the denoising function with the score, which can be
computed by marginalization and back-propagation. The training objective is
then a combination of cross-entropy loss and denoising score matching loss
integrated over noise levels. Numerical experiments on CIFAR-10 and ImageNet
show competitive classification and denoising performance compared to reference
deep convolutional classifiers/denoisers, and significantly improves efficiency
compared to previous joint approaches. Our model shows an increased robustness
to adversarial perturbations compared to a standard discriminative classifier,
and allows for a novel interpretation of adversarial gradients as a difference
of denoisers.


http://arxiv.org/abs/2410.09078
Knowledge-Augmented Reasoning for EUAIA Compliance and Adversarial Robustness of LLMs. (2%)
Tomas Bueno Momcilovic; Dian Balta; Beat Buesser; Giulio Zizzo; Mark Purcell
  The EU AI Act (EUAIA) introduces requirements for AI systems which intersect
with the processes required to establish adversarial robustness. However, given
the ambiguous language of regulation and the dynamicity of adversarial attacks,
developers of systems with highly complex models such as LLMs may find their
effort to be duplicated without the assurance of having achieved either
compliance or robustness. This paper presents a functional architecture that
focuses on bridging the two properties, by introducing components with clear
reference to their source. Taking the detection layer recommended by the
literature, and the reporting layer required by the law, we aim to support
developers and auditors with a reasoning layer based on knowledge augmentation
(rules, assurance cases, contextual mappings). Our findings demonstrate a novel
direction for ensuring LLMs deployed in the EU are both compliant and
adversarially robust, which underpin trustworthiness.


http://arxiv.org/abs/2410.03281
BN-SCAFFOLD: controlling the drift of Batch Normalization statistics in Federated Learning. (1%)
Gonzalo Iñaki Quintana; Laurence Vancamberg; Vincent Jugnon; Mathilde Mougeot; Agnès Desolneux
  Federated Learning (FL) is gaining traction as a learning paradigm for
training Machine Learning (ML) models in a decentralized way. Batch
Normalization (BN) is ubiquitous in Deep Neural Networks (DNN), as it improves
convergence and generalization. However, BN has been reported to hinder
performance of DNNs in heterogeneous FL. Recently, the FedTAN algorithm has
been proposed to mitigate the effect of heterogeneity on BN, by aggregating BN
statistics and gradients from all the clients. However, it has a high
communication cost, that increases linearly with the depth of the DNN. SCAFFOLD
is a variance reduction algorithm, that estimates and corrects the client drift
in a communication-efficient manner. Despite its promising results in
heterogeneous FL settings, it has been reported to underperform for models with
BN. In this work, we seek to revive SCAFFOLD, and more generally variance
reduction, as an efficient way of training DNN with BN in heterogeneous FL. We
introduce a unified theoretical framework for analyzing the convergence of
variance reduction algorithms in the BN-DNN setting, inspired of by the work of
Wang et al. 2023, and show that SCAFFOLD is unable to remove the bias
introduced by BN. We thus propose the BN-SCAFFOLD algorithm, which extends the
client drift correction of SCAFFOLD to BN statistics. We prove convergence
using the aforementioned framework and validate the theoretical results with
experiments on MNIST and CIFAR-10. BN-SCAFFOLD equals the performance of
FedTAN, without its high communication cost, outperforming Federated Averaging
(FedAvg), SCAFFOLD, and other FL algorithms designed to mitigate BN
heterogeneity.


http://arxiv.org/abs/2410.03869
Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step. (1%)
Wenxuan Wang; Kuiyi Gao; Zihan Jia; Youliang Yuan; Jen-tse Huang; Qiuzhi Liu; Shuai Wang; Wenxiang Jiao; Zhaopeng Tu
  Text-based image generation models, such as Stable Diffusion and DALL-E 3,
hold significant potential in content creation and publishing workflows, making
them the focus in recent years. Despite their remarkable capability to generate
diverse and vivid images, considerable efforts are being made to prevent the
generation of harmful content, such as abusive, violent, or pornographic
material. To assess the safety of existing models, we introduce a novel
jailbreaking method called Chain-of-Jailbreak (CoJ) attack, which compromises
image generation models through a step-by-step editing process. Specifically,
for malicious queries that cannot bypass the safeguards with a single prompt,
we intentionally decompose the query into multiple sub-queries. The image
generation models are then prompted to generate and iteratively edit images
based on these sub-queries. To evaluate the effectiveness of our CoJ attack
method, we constructed a comprehensive dataset, CoJ-Bench, encompassing nine
safety scenarios, three types of editing operations, and three editing
elements. Experiments on four widely-used image generation services provided by
GPT-4V, GPT-4o, Gemini 1.5 and Gemini 1.5 Pro, demonstrate that our CoJ attack
method can successfully bypass the safeguards of models for over 60% cases,
which significantly outperforms other jailbreaking methods (i.e., 14%).
Further, to enhance these models' safety against our CoJ attack method, we also
propose an effective prompting-based method, Think Twice Prompting, that can
successfully defend over 95% of CoJ attack. We release our dataset and code to
facilitate the AI safety research.


http://arxiv.org/abs/2410.02240
SCA: Highly Efficient Semantic-Consistent Unrestricted Adversarial Attack. (99%)
Zihao Pan; Weibin Wu; Yuhang Cao; Zibin Zheng
  Deep neural network based systems deployed in sensitive environments are
vulnerable to adversarial attacks. Unrestricted adversarial attacks typically
manipulate the semantic content of an image (e.g., color or texture) to create
adversarial examples that are both effective and photorealistic. Recent works
have utilized the diffusion inversion process to map images into a latent
space, where high-level semantics are manipulated by introducing perturbations.
However, they often results in substantial semantic distortions in the denoised
output and suffers from low efficiency. In this study, we propose a novel
framework called Semantic-Consistent Unrestricted Adversarial Attacks (SCA),
which employs an inversion method to extract edit-friendly noise maps and
utilizes Multimodal Large Language Model (MLLM) to provide semantic guidance
throughout the process. Under the condition of rich semantic information
provided by MLLM, we perform the DDPM denoising process of each step using a
series of edit-friendly noise maps, and leverage DPM Solver++ to accelerate
this process, enabling efficient sampling with semantic consistency. Compared
to existing methods, our framework enables the efficient generation of
adversarial examples that exhibit minimal discernible semantic changes.
Consequently, we for the first time introduce Semantic-Consistent Adversarial
Examples (SCAE). Extensive experiments and visualizations have demonstrated the
high efficiency of SCA, particularly in being on average 12 times faster than
the state-of-the-art attacks. Our research can further draw attention to the
security of multimedia information.


http://arxiv.org/abs/2410.02916
LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks. (47%)
Qingzhao Zhang; Ziyang Xiong; Z. Morley Mao
  Safety is a paramount concern for large language models (LLMs) in open
deployment, motivating the development of safeguard methods that enforce
ethical and responsible use through safety alignment or guardrail mechanisms.
Jailbreak attacks that exploit the \emph{false negatives} of safeguard methods
have emerged as a prominent research focus in the field of LLM security.
However, we found that the malicious attackers could also exploit false
positives of safeguards, i.e., fooling the safeguard model to block safe
content mistakenly, leading to a denial-of-service (DoS) affecting LLM users.
To bridge the knowledge gap of this overlooked threat, we explore multiple
attack methods that include inserting a short adversarial prompt into user
prompt templates and corrupting the LLM on the server by poisoned fine-tuning.
In both ways, the attack triggers safeguard rejections of user requests from
the client. Our evaluation demonstrates the severity of this threat across
multiple scenarios. For instance, in the scenario of white-box adversarial
prompt injection, the attacker can use our optimization process to
automatically generate seemingly safe adversarial prompts, approximately only
30 characters long, that universally block over 97% of user requests on Llama
Guard 3. These findings reveal a new dimension in LLM safeguard evaluation --
adversarial robustness to false positives.


http://arxiv.org/abs/2410.03000
Towards Universal Certified Robustness with Multi-Norm Training. (41%)
Enyi Jiang; David S. Cheung; Gagandeep Singh
  Existing certified training methods can only train models to be robust
against a certain perturbation type (e.g. $l_\infty$ or $l_2$). However, an
$l_\infty$ certifiably robust model may not be certifiably robust against $l_2$
perturbation (and vice versa) and also has low robustness against other
perturbations (e.g. geometric and patch transformation). By constructing a
theoretical framework to analyze and mitigate the tradeoff, we propose the
first multi-norm certified training framework \textbf{CURE}, consisting of
several multi-norm certified training methods, to attain better \emph{union
robustness} when training from scratch or fine-tuning a pre-trained certified
model. Inspired by our theoretical findings, we devise bound alignment and
connect natural training with certified training for better union robustness.
Compared with SOTA-certified training, \textbf{CURE} improves union robustness
to $32.0\%$ on MNIST, $25.8\%$ on CIFAR-10, and $10.6\%$ on TinyImagenet across
different epsilon values. It leads to better generalization on a diverse set of
challenging unseen geometric and patch perturbations to $6.8\%$ and $16.0\%$ on
CIFAR-10. Overall, our contributions pave a path towards \textit{universal
certified robustness}.


http://arxiv.org/abs/2410.02644
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents. (15%)
Hanrong Zhang; Jingyuan Huang; Kai Mei; Yifei Yao; Zhenting Wang; Chenlu Zhan; Hongwei Wang; Yongfeng Zhang
  Although LLM-based agents, powered by Large Language Models (LLMs), can use
external tools and memory mechanisms to solve complex real-world tasks, they
may also introduce critical security vulnerabilities. However, the existing
literature does not comprehensively evaluate attacks and defenses against
LLM-based agents. To address this, we introduce Agent Security Bench (ASB), a
comprehensive framework designed to formalize, benchmark, and evaluate the
attacks and defenses of LLM-based agents, including 10 scenarios (e.g.,
e-commerce, autonomous driving, finance), 10 agents targeting the scenarios,
over 400 tools, 27 different types of attack/defense methods, and 7 evaluation
metrics. Based on ASB, we benchmark 10 prompt injection attacks, a memory
poisoning attack, a novel Plan-of-Thought backdoor attack, 4 mixed attacks, and
11 corresponding defenses across 13 LLM backbones. Our benchmark results reveal
critical vulnerabilities in different stages of agent operation, including
system prompt, user prompt handling, tool usage, and memory retrieval, with the
highest average attack success rate of 84.30\%, but limited effectiveness shown
in current defenses, unveiling important works to be done in terms of agent
security for the community. We also introduce a new metric to evaluate the
agents' capability to balance utility and security. Our code can be found at
https://github.com/agiresearch/ASB.


http://arxiv.org/abs/2410.05295
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs. (11%)
Xiaogeng Liu; Peiran Li; Edward Suh; Yevgeniy Vorobeychik; Zhuoqing Mao; Somesh Jha; Patrick McDaniel; Huan Sun; Bo Li; Chaowei Xiao
  In this paper, we propose AutoDAN-Turbo, a black-box jailbreak method that
can automatically discover as many jailbreak strategies as possible from
scratch, without any human intervention or predefined scopes (e.g., specified
candidate strategies), and use them for red-teaming. As a result, AutoDAN-Turbo
can significantly outperform baseline methods, achieving a 74.3% higher average
attack success rate on public benchmarks. Notably, AutoDAN-Turbo achieves an
88.5 attack success rate on GPT-4-1106-turbo. In addition, AutoDAN-Turbo is a
unified framework that can incorporate existing human-designed jailbreak
strategies in a plug-and-play manner. By integrating human-designed strategies,
AutoDAN-Turbo can even achieve a higher attack success rate of 93.4 on
GPT-4-1106-turbo.


http://arxiv.org/abs/2410.02841
Demonstration Attack against In-Context Learning for Code Intelligence. (10%)
Yifei Ge; Weisong Sun; Yihang Lou; Chunrong Fang; Yiran Zhang; Yiming Li; Xiaofang Zhang; Yang Liu; Zhihong Zhao; Zhenyu Chen
  Recent advancements in large language models (LLMs) have revolutionized code
intelligence by improving programming productivity and alleviating challenges
faced by software developers. To further improve the performance of LLMs on
specific code intelligence tasks and reduce training costs, researchers reveal
a new capability of LLMs: in-context learning (ICL). ICL allows LLMs to learn
from a few demonstrations within a specific context, achieving impressive
results without parameter updating. However, the rise of ICL introduces new
security vulnerabilities in the code intelligence field. In this paper, we
explore a novel security scenario based on the ICL paradigm, where attackers
act as third-party ICL agencies and provide users with bad ICL content to
mislead LLMs outputs in code intelligence tasks. Our study demonstrates the
feasibility and risks of such a scenario, revealing how attackers can leverage
malicious demonstrations to construct bad ICL content and induce LLMs to
produce incorrect outputs, posing significant threats to system security. We
propose a novel method to construct bad ICL content called DICE, which is
composed of two stages: Demonstration Selection and Bad ICL Construction,
constructing targeted bad ICL content based on the user query and transferable
across different query inputs. Ultimately, our findings emphasize the critical
importance of securing ICL mechanisms to protect code intelligence systems from
adversarial manipulation.


http://arxiv.org/abs/2410.02440
Optimizing Adaptive Attacks against Watermarks for Language Models. (5%)
Abdulrahman Diaa; Toluwani Aremu; Nils Lukas
  Large Language Models (LLMs) can be misused to spread unwanted content at
scale. Content watermarking deters misuse by hiding messages in content,
enabling its detection using a secret watermarking key. Robustness is a core
security property, stating that evading detection requires (significant)
degradation of the content's quality. Many LLM watermarking methods have been
proposed, but robustness is tested only against non-adaptive attackers who lack
knowledge of the watermarking method and can find only suboptimal attacks. We
formulate watermark robustness as an objective function and use
preference-based optimization to tune adaptive attacks against the specific
watermarking method. Our evaluation shows that (i) adaptive attacks evade
detection against all surveyed watermarks, (ii) training against any watermark
succeeds in evading unseen watermarks, and (iii) optimization-based attacks are
cost-effective. Our findings underscore the need to test robustness against
adaptively tuned attacks. We release our adaptively optimized paraphrasers at
https://github.com/nilslukas/ada-wm-evasion.


http://arxiv.org/abs/2410.02970
F-Fidelity: A Robust Framework for Faithfulness Evaluation of Explainable AI. (4%)
Xu Zheng; Farhad Shirani; Zhuomin Chen; Chaohao Lin; Wei Cheng; Wenbo Guo; Dongsheng Luo
  Recent research has developed a number of eXplainable AI (XAI) techniques,
such as gradient-based approaches, input perturbation-base methods, and
black-box explanation methods. While these XAI techniques can extract
meaningful insights from deep learning models, how to properly evaluate them
remains an open problem. The most widely used approach is to perturb or even
remove what the XAI method considers to be the most important features in an
input and observe the changes in the output prediction. This approach, although
straightforward, suffers the Out-of-Distribution (OOD) problem as the perturbed
samples may no longer follow the original data distribution. A recent method
RemOve And Retrain (ROAR) solves the OOD issue by retraining the model with
perturbed samples guided by explanations. However, using the model retrained
based on XAI methods to evaluate these explainers may cause information leakage
and thus lead to unfair comparisons. We propose Fine-tuned Fidelity
(F-Fidelity), a robust evaluation framework for XAI, which utilizes i) an
explanation-agnostic fine-tuning strategy, thus mitigating the information
leakage issue, and ii) a random masking operation that ensures that the removal
step does not generate an OOD input. We also design controlled experiments with
state-of-the-art (SOTA) explainers and their degraded version to verify the
correctness of our framework. We conduct experiments on multiple data
modalities, such as images, time series, and natural language. The results
demonstrate that F-Fidelity significantly improves upon prior evaluation
metrics in recovering the ground-truth ranking of the explainers. Furthermore,
we show both theoretically and empirically that, given a faithful explainer,
F-Fidelity metric can be used to compute the sparsity of influential input
components, i.e., to extract the true explanation size.


http://arxiv.org/abs/2410.02384
Unveiling AI's Blind Spots: An Oracle for In-Domain, Out-of-Domain, and Adversarial Errors. (3%)
Shuangpeng Han; Mengmi Zhang
  AI models make mistakes when recognizing images-whether in-domain,
out-of-domain, or adversarial. Predicting these errors is critical for
improving system reliability, reducing costly mistakes, and enabling proactive
corrections in real-world applications such as healthcare, finance, and
autonomous systems. However, understanding what mistakes AI models make, why
they occur, and how to predict them remains an open challenge. Here, we conduct
comprehensive empirical evaluations using a "mentor" model-a deep neural
network designed to predict another "mentee" model's errors. Our findings show
that the mentor excels at learning from a mentee's mistakes on adversarial
images with small perturbations and generalizes effectively to predict
in-domain and out-of-domain errors of the mentee. Additionally,
transformer-based mentor models excel at predicting errors across various
mentee architectures. Subsequently, we draw insights from these observations
and develop an "oracle" mentor model, dubbed SuperMentor, that can outperform
baseline mentors in predicting errors across different error types from the
ImageNet-1K dataset. Our framework paves the way for future research on
anticipating and correcting AI model behaviors, ultimately increasing trust in
AI systems. All code, models, and data will be made publicly available.


http://arxiv.org/abs/2410.02298
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models. (3%)
Guobin Shen; Dongcheng Zhao; Yiting Dong; Xiang He; Yi Zeng
  As large language models (LLMs) become integral to various applications,
ensuring both their safety and utility is paramount. Jailbreak attacks, which
manipulate LLMs into generating harmful content, pose significant challenges to
this balance. Existing defenses, such as prompt engineering and safety
fine-tuning, often introduce computational overhead, increase inference
latency, and lack runtime flexibility. Moreover, overly restrictive safety
measures can degrade model utility by causing refusals of benign queries. In
this paper, we introduce Jailbreak Antidote, a method that enables real-time
adjustment of LLM safety preferences by manipulating a sparse subset of the
model's internal states during inference. By shifting the model's hidden
representations along a safety direction with varying strengths, we achieve
flexible control over the safety-utility balance without additional token
overhead or inference delays. Our analysis reveals that safety-related
information in LLMs is sparsely distributed; adjusting approximately 5% of the
internal state is as effective as modifying the entire state. Extensive
experiments on nine LLMs (ranging from 2 billion to 72 billion parameters),
evaluated against ten jailbreak attack methods and compared with six defense
strategies, validate the effectiveness and efficiency of our approach. By
directly manipulating internal states during reasoning, Jailbreak Antidote
offers a lightweight, scalable solution that enhances LLM safety while
preserving utility, opening new possibilities for real-time safety mechanisms
in widely-deployed AI systems.


http://arxiv.org/abs/2410.02254
MTDNS: Moving Target Defense for Resilient DNS Infrastructure. (2%)
Abdullah Aydeger; Pei Zhou; Sanzida Hoque; Marco Carvalho; Engin Zeydan
  One of the most critical components of the Internet that an attacker could
exploit is the DNS (Domain Name System) protocol and infrastructure.
Researchers have been constantly developing methods to detect and defend
against the attacks against DNS, specifically DNS flooding attacks. However,
most solutions discard packets for defensive approaches, which can cause
legitimate packets to be dropped, making them highly dependable on detection
strategies. In this paper, we propose MTDNS, a resilient MTD-based approach
that employs Moving Target Defense techniques through Software Defined
Networking (SDN) switches to redirect traffic to alternate DNS servers that are
dynamically created and run under the Network Function Virtualization (NFV)
framework. The proposed approach is implemented in a testbed environment by
running our DNS servers as separate Virtual Network Functions, NFV Manager, SDN
switches, and an SDN Controller. The experimental result shows that the MTDNS
approach achieves a much higher success rate in resolving DNS queries and
significantly reduces average latency even if there is a DNS flooding attack.


http://arxiv.org/abs/2410.02506
Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems. (1%)
Guibin Zhang; Yanwei Yue; Zhixun Li; Sukwon Yun; Guancheng Wan; Kun Wang; Dawei Cheng; Jeffrey Xu Yu; Tianlong Chen
  Recent advancements in large language model (LLM)-powered agents have shown
that collective intelligence can significantly outperform individual
capabilities, largely attributed to the meticulously designed inter-agent
communication topologies. Though impressive in performance, existing
multi-agent pipelines inherently introduce substantial token overhead, as well
as increased economic costs, which pose challenges for their large-scale
deployments. In response to this challenge, we propose an economical, simple,
and robust multi-agent communication framework, termed $\texttt{AgentPrune}$,
which can seamlessly integrate into mainstream multi-agent systems and prunes
redundant or even malicious communication messages. Technically,
$\texttt{AgentPrune}$ is the first to identify and formally define the
\textit{communication redundancy} issue present in current LLM-based
multi-agent pipelines, and efficiently performs one-shot pruning on the
spatial-temporal message-passing graph, yielding a token-economic and
high-performing communication topology. Extensive experiments across six
benchmarks demonstrate that $\texttt{AgentPrune}$ \textbf{(I)} achieves
comparable results as state-of-the-art topologies at merely $\$5.6$ cost
compared to their $\$43.7$, \textbf{(II)} integrates seamlessly into existing
multi-agent frameworks with $28.1\%\sim72.8\%\downarrow$ token reduction, and
\textbf{(III)} successfully defend against two types of agent-based adversarial
attacks with $3.5\%\sim10.8\%\uparrow$ performance boost.


http://arxiv.org/abs/2410.02611
IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages? (1%)
Akhilesh Aravapalli; Mounika Marreddy; Subba Reddy Oota; Radhika Mamidi; Manish Gupta
  Transformer-based models have revolutionized the field of natural language
processing. To understand why they perform so well and to assess their
reliability, several studies have focused on questions such as: Which
linguistic properties are encoded by these models, and to what extent? How
robust are these models in encoding linguistic properties when faced with
perturbations in the input text? However, these studies have mainly focused on
BERT and the English language. In this paper, we investigate similar questions
regarding encoding capability and robustness for 8 linguistic properties across
13 different perturbations in 6 Indic languages, using 9 multilingual
Transformer models (7 universal and 2 Indic-specific). To conduct this study,
we introduce a novel multilingual benchmark dataset, IndicSentEval, containing
approximately $\sim$47K sentences. Surprisingly, our probing analysis of
surface, syntactic, and semantic properties reveals that while almost all
multilingual models demonstrate consistent encoding performance for English,
they show mixed results for Indic languages. As expected, Indic-specific
multilingual models capture linguistic properties in Indic languages better
than universal models. Intriguingly, universal models broadly exhibit better
robustness compared to Indic-specific models, particularly under perturbations
such as dropping both nouns and verbs, dropping only verbs, or keeping only
nouns. Overall, this study provides valuable insights into probing and
perturbation-specific strengths and weaknesses of popular multilingual
Transformer-based models for different Indic languages. We make our code and
dataset publicly available [https://tinyurl.com/IndicSentEval}].


http://arxiv.org/abs/2410.02195
BACKTIME: Backdoor Attacks on Multivariate Time Series Forecasting. (1%)
Xiao Lin; Zhining Liu; Dongqi Fu; Ruizhong Qiu; Hanghang Tong
  Multivariate Time Series (MTS) forecasting is a fundamental task with
numerous real-world applications, such as transportation, climate, and
epidemiology. While a myriad of powerful deep learning models have been
developed for this task, few works have explored the robustness of MTS
forecasting models to malicious attacks, which is crucial for their trustworthy
employment in high-stake scenarios. To address this gap, we dive deep into the
backdoor attacks on MTS forecasting models and propose an effective attack
method named BackTime.By subtly injecting a few stealthy triggers into the MTS
data, BackTime can alter the predictions of the forecasting model according to
the attacker's intent. Specifically, BackTime first identifies vulnerable
timestamps in the data for poisoning, and then adaptively synthesizes stealthy
and effective triggers by solving a bi-level optimization problem with a
GNN-based trigger generator. Extensive experiments across multiple datasets and
state-of-the-art MTS forecasting models demonstrate the effectiveness,
versatility, and stealthiness of \method{} attacks. The code is available at
\url{https://github.com/xiaolin-cs/BackTime}.


http://arxiv.org/abs/2410.02890
Universally Optimal Watermarking Schemes for LLMs: from Theory to Practice. (1%)
Haiyun He; Yepeng Liu; Ziqiao Wang; Yongyi Mao; Yuheng Bu
  Large Language Models (LLMs) boosts human efficiency but also poses misuse
risks, with watermarking serving as a reliable method to differentiate
AI-generated content from human-created text. In this work, we propose a novel
theoretical framework for watermarking LLMs. Particularly, we jointly optimize
both the watermarking scheme and detector to maximize detection performance,
while controlling the worst-case Type-I error and distortion in the watermarked
text. Within our framework, we characterize the universally minimum Type-II
error, showing a fundamental trade-off between detection performance and
distortion. More importantly, we identify the optimal type of detectors and
watermarking schemes. Building upon our theoretical analysis, we introduce a
practical, model-agnostic and computationally efficient token-level
watermarking algorithm that invokes a surrogate model and the Gumbel-max trick.
Empirical results on Llama-13B and Mistral-8$\times$7B demonstrate the
effectiveness of our method. Furthermore, we also explore how robustness can be
integrated into our theoretical framework, which provides a foundation for
designing future watermarking systems with improved resilience to adversarial
attacks.


http://arxiv.org/abs/2410.02220
Buckle Up: Robustifying LLMs at Every Customization Stage via Data Curation. (1%)
Xiaoqun Liu; Jiacheng Liang; Luoxi Tang; Chenyu You; Muchao Ye; Zhaohan Xi
  Large language models (LLMs) are extensively adapted for downstream
applications through a process known as "customization," with fine-tuning being
a common method for integrating domain-specific expertise. However, recent
studies have revealed a vulnerability that tuning LLMs with malicious samples
can compromise their robustness and amplify harmful content, an attack known as
"jailbreaking." To mitigate such attack, we propose an effective defensive
framework utilizing data curation to revise commonsense texts and enhance their
safety implication from the perspective of LLMs. The curated texts can mitigate
jailbreaking attacks at every stage of the customization process: before
customization to immunize LLMs against future jailbreak attempts, during
customization to neutralize jailbreaking risks, or after customization to
restore the compromised models. Since the curated data strengthens LLMs through
the standard fine-tuning workflow, we do not introduce additional modules
during LLM inference, thereby preserving the original customization process.
Experimental results demonstrate a substantial reduction in jailbreaking
effects, with up to a 100% success in generating responsible responses.
Notably, our method is effective even with commonsense texts, which are often
more readily available than safety-relevant data. With the every-stage
defensive framework and supporting experimental performance, this work
represents a significant advancement in mitigating jailbreaking risks and
ensuring the secure customization of LLMs.


http://arxiv.org/abs/2410.02760
Erasing Conceptual Knowledge from Language Models. (1%)
Rohit Gandikota; Sheridan Feucht; Samuel Marks; David Bau
  In this work, we propose Erasure of Language Memory (ELM), an approach for
concept-level unlearning built on the principle of matching the distribution
defined by an introspective classifier. Our key insight is that effective
unlearning should leverage the model's ability to evaluate its own knowledge,
using the model itself as a classifier to identify and reduce the likelihood of
generating content related to undesired concepts. ELM applies this framework to
create targeted low-rank updates that reduce generation probabilities for
concept-specific content while preserving the model's broader capabilities. We
demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain
erasure tasks. Comparative analysis shows that ELM achieves superior
performance across key metrics, including near-random scores on erased topic
assessments, maintained coherence in text generation, preserved accuracy on
unrelated benchmarks, and robustness under adversarial attacks. Our code, data,
and trained models are available at https://elm.baulab.info


http://arxiv.org/abs/2410.02043
Impact of White-Box Adversarial Attacks on Convolutional Neural Networks. (99%)
Rakesh Podder; Sudipto Ghosh
  Autonomous vehicle navigation and healthcare diagnostics are among the many
fields where the reliability and security of machine learning models for image
data are critical. We conduct a comprehensive investigation into the
susceptibility of Convolutional Neural Networks (CNNs), which are widely used
for image data, to white-box adversarial attacks. We investigate the effects of
various sophisticated attacks -- Fast Gradient Sign Method, Basic Iterative
Method, Jacobian-based Saliency Map Attack, Carlini & Wagner, Projected
Gradient Descent, and DeepFool -- on CNN performance metrics, (e.g., loss,
accuracy), the differential efficacy of adversarial techniques in increasing
error rates, the relationship between perceived image quality metrics (e.g.,
ERGAS, PSNR, SSIM, and SAM) and classification performance, and the comparative
effectiveness of iterative versus single-step attacks. Using the MNIST,
CIFAR-10, CIFAR-100, and Fashio_MNIST datasets, we explore the effect of
different attacks on the CNNs performance metrics by varying the
hyperparameters of CNNs. Our study provides insights into the robustness of
CNNs against adversarial threats, pinpoints vulnerabilities, and underscores
the urgent need for developing robust defense mechanisms to protect CNNs and
ensuring their trustworthy deployment in real-world scenarios.


http://arxiv.org/abs/2410.01393
Signal Adversarial Examples Generation for Signal Detection Network via White-Box Attack. (99%)
Dongyang Li; Linyuan Wang; Guangwei Xiong; Bin Yan; Dekui Ma; Jinxian Peng
  With the development and application of deep learning in signal detection
tasks, the vulnerability of neural networks to adversarial attacks has also
become a security threat to signal detection networks. This paper defines a
signal adversarial examples generation model for signal detection network from
the perspective of adding perturbations to the signal. The model uses the
inequality relationship of L2-norm between time domain and time-frequency
domain to constrain the energy of signal perturbations. Building upon this
model, we propose a method for generating signal adversarial examples utilizing
gradient-based attacks and Short-Time Fourier Transform. The experimental
results show that under the constraint of signal perturbation energy ratio less
than 3%, our adversarial attack resulted in a 28.1% reduction in the mean
Average Precision (mAP), a 24.7% reduction in recall, and a 30.4% reduction in
precision of the signal detection network. Compared to random noise
perturbation of equivalent intensity, our adversarial attack demonstrates a
significant attack effect.


http://arxiv.org/abs/2410.01617
On Using Certified Training towards Empirical Robustness. (99%)
Palma Alessandro De; Serge Durand; Zakaria Chihani; François Terrier; Caterina Urban
  Adversarial training is arguably the most popular way to provide empirical
robustness against specific adversarial examples. While variants based on
multi-step attacks incur significant computational overhead, single-step
variants are vulnerable to a failure mode known as catastrophic overfitting,
which hinders their practical utility for large perturbations. A parallel line
of work, certified training, has focused on producing networks amenable to
formal guarantees of robustness against any possible attack. However, the wide
gap between the best-performing empirical and certified defenses has severely
limited the applicability of the latter. Inspired by recent developments in
certified training, which rely on a combination of adversarial attacks with
network over-approximations, and by the connections between local linearity and
catastrophic overfitting, we present experimental evidence on the practical
utility and limitations of using certified training towards empirical
robustness. We show that, when tuned for the purpose, a recent certified
training algorithm can prevent catastrophic overfitting on single-step attacks,
and that it can bridge the gap to multi-step baselines under appropriate
experimental settings. Finally, we present a conceptually simple regularizer
for network over-approximations that can achieve similar effects while markedly
reducing runtime.


http://arxiv.org/abs/2410.01697
MOREL: Enhancing Adversarial Robustness through Multi-Objective Representation Learning. (99%)
Sedjro Salomon Hotegni; Sebastian Peitz
  Extensive research has shown that deep neural networks (DNNs) are vulnerable
to slight adversarial perturbations$-$small changes to the input data that
appear insignificant but cause the model to produce drastically different
outputs. In addition to augmenting training data with adversarial examples
generated from a specific attack method, most of the current defense strategies
necessitate modifying the original model architecture components to improve
robustness or performing test-time data purification to handle adversarial
attacks. In this work, we demonstrate that strong feature representation
learning during training can significantly enhance the original model's
robustness. We propose MOREL, a multi-objective feature representation learning
approach, encouraging classification models to produce similar features for
inputs within the same class, despite perturbations. Our training method
involves an embedding space where cosine similarity loss and multi-positive
contrastive loss are used to align natural and adversarial features from the
model encoder and ensure tight clustering. Concurrently, the classifier is
motivated to achieve accurate predictions. Through extensive experiments, we
demonstrate that our approach significantly enhances the robustness of DNNs
against white-box and black-box adversarial attacks, outperforming other
methods that similarly require no architectural changes or test-time data
purification. Our code is available at https://github.com/salomonhotegni/MOREL


http://arxiv.org/abs/2410.01574
Adversarial Robustness of AI-Generated Image Detectors in the Real World. (99%)
Sina Mavali; Jonas Ricker; David Pape; Asja Fischer; Lea Schönherr
  The rapid advancement of Generative Artificial Intelligence (GenAI)
capabilities is accompanied by a concerning rise in its misuse. In particular
the generation of credible misinformation in the form of images poses a
significant threat to the public trust in democratic processes. Consequently,
there is an urgent need to develop tools to reliably distinguish between
authentic and AI-generated content. The majority of detection methods are based
on neural networks that are trained to recognize forensic artifacts. In this
work, we demonstrate that current state-of-the-art classifiers are vulnerable
to adversarial examples under real-world conditions. Through extensive
experiments, comprising four detection methods and five attack algorithms, we
show that an attacker can dramatically decrease classification performance,
without internal knowledge of the detector's architecture. Notably, most
attacks remain effective even when images are degraded during the upload to,
e.g., social media platforms. In a case study, we demonstrate that these
robustness challenges are also found in commercial tools by conducting
black-box attacks on HIVE, a proprietary online GenAI media detector. In
addition, we evaluate the robustness of using generated features of a robust
pre-trained model and showed that this increases the robustness, while not
reaching the performance on benign inputs. These results, along with the
increasing potential of GenAI to erode public trust, underscore the need for
more research and new perspectives on methods to prevent its misuse.


http://arxiv.org/abs/2410.01272
"No Matter What You Do": Purifying GNN Models via Backdoor Unlearning. (93%)
Jiale Zhang; Chengcheng Zhu; Bosen Rao; Hao Sui; Xiaobing Sun; Bing Chen; Chunyi Zhou; Shouling Ji
  Recent studies have exposed that GNNs are vulnerable to several adversarial
attacks, among which backdoor attack is one of the toughest. Similar to Deep
Neural Networks (DNNs), backdoor attacks in GNNs lie in the fact that the
attacker modifies a portion of graph data by embedding triggers and enforces
the model to learn the trigger feature during the model training process.
Despite the massive prior backdoor defense works on DNNs, defending against
backdoor attacks in GNNs is largely unexplored, severely hindering the
widespread application of GNNs in real-world tasks. To bridge this gap, we
present GCleaner, the first backdoor mitigation method on GNNs. GCleaner can
mitigate the presence of the backdoor logic within backdoored GNNs by reversing
the backdoor learning procedure, aiming to restore the model performance to a
level similar to that is directly trained on the original clean dataset. To
achieve this objective, we ask: How to recover universal and hard backdoor
triggers in GNNs? How to unlearn the backdoor trigger feature while maintaining
the model performance? We conduct the graph trigger recovery via the
explanation method to identify optimal trigger locations, facilitating the
search of universal and hard backdoor triggers in the feature space of the
backdoored model through maximal similarity. Subsequently, we introduce the
backdoor unlearning mechanism, which combines knowledge distillation and
gradient-based explainable knowledge for fine-grained backdoor erasure.
Extensive experimental evaluations on four benchmark datasets demonstrate that
GCleaner can reduce the backdoor attack success rate to 10% with only 1% of
clean data, and has almost negligible degradation in model performance, which
far outperforms the state-of-the-art (SOTA) defense methods.


http://arxiv.org/abs/2410.01906
Social Media Authentication and Combating Deepfakes using Semi-fragile Invisible Image Watermarking. (82%)
Aakash Varma Nadimpalli; Ajita Rattani
  With the significant advances in deep generative models for image and video
synthesis, Deepfakes and manipulated media have raised severe societal
concerns. Conventional machine learning classifiers for deepfake detection
often fail to cope with evolving deepfake generation technology and are
susceptible to adversarial attacks. Alternatively, invisible image watermarking
is being researched as a proactive defense technique that allows media
authentication by verifying an invisible secret message embedded in the image
pixels. A handful of invisible image watermarking techniques introduced for
media authentication have proven vulnerable to basic image processing
operations and watermark removal attacks. In response, we have proposed a
semi-fragile image watermarking technique that embeds an invisible secret
message into real images for media authentication. Our proposed watermarking
framework is designed to be fragile to facial manipulations or tampering while
being robust to benign image-processing operations and watermark removal
attacks. This is facilitated through a unique architecture of our proposed
technique consisting of critic and adversarial networks that enforce high image
quality and resiliency to watermark removal efforts, respectively, along with
the backbone encoder-decoder and the discriminator networks. Thorough
experimental investigations on SOTA facial Deepfake datasets demonstrate that
our proposed model can embed a $64$-bit secret as an imperceptible image
watermark that can be recovered with a high-bit recovery accuracy when benign
image processing operations are applied while being non-recoverable when unseen
Deepfake manipulations are applied. In addition, our proposed watermarking
technique demonstrates high resilience to several white-box and black-box
watermark removal attacks. Thus, obtaining state-of-the-art performance.


http://arxiv.org/abs/2410.01289
The Unlikely Hero: Nonideality in Analog Photonic Neural Networks as Built-in Defender Against Adversarial Attacks. (76%)
Haotian Lu; Ziang Yin; Partho Bhoumik; Sanmitra Banerjee; Krishnendu Chakrabarty; Jiaqi Gu
  Electronic-photonic computing systems have emerged as a promising platform
for accelerating deep neural network (DNN) workloads. Major efforts have been
focused on countering hardware non-idealities and boosting efficiency with
various hardware/algorithm co-design methods. However, the adversarial
robustness of such photonic analog mixed-signal AI hardware remains unexplored.
Though the hardware variations can be mitigated with robustness-driven
optimization methods, malicious attacks on the hardware show distinct behaviors
from noises, which requires a customized protection method tailored to optical
analog hardware. In this work, we rethink the role of conventionally undesired
non-idealities in photonic analog accelerators and claim their surprising
effects on defending against adversarial weight attacks. Inspired by the
protection effects from DNN quantization and pruning, we propose a synergistic
defense framework tailored for optical analog hardware that proactively
protects sensitive weights via pre-attack unary weight encoding and post-attack
vulnerability-aware weight locking. Efficiency-reliability trade-offs are
formulated as constrained optimization problems and efficiently solved offline
without model re-training costs. Extensive evaluation of various DNN benchmarks
with a multi-core photonic accelerator shows that our framework maintains
near-ideal on-chip inference accuracy under adversarial bit-flip attacks with
merely <3% memory overhead. Our codes are open-sourced at
https://github.com/ScopeX-ASU/Unlikely_Hero.


http://arxiv.org/abs/2410.01438
The Great Contradiction Showdown: How Jailbreak and Stealth Wrestle in Vision-Language Models? (38%)
Ching-Chia Kao; Chia-Mu Yu; Chun-Shien Lu; Chu-Song Chen
  Vision-Language Models (VLMs) have achieved remarkable performance on a
variety of tasks, yet they remain vulnerable to jailbreak attacks that
compromise safety and reliability. In this paper, we provide an
information-theoretic framework for understanding the fundamental trade-off
between the effectiveness of these attacks and their stealthiness. Drawing on
Fano's inequality, we demonstrate how an attacker's success probability is
intrinsically linked to the stealthiness of generated prompts. Building on
this, we propose an efficient algorithm for detecting non-stealthy jailbreak
attacks, offering significant improvements in model robustness. Experimental
results highlight the tension between strong attacks and their detectability,
providing insights into both adversarial strategies and defense mechanisms.


http://arxiv.org/abs/2410.01294
Endless Jailbreaks with Bijection Learning. (16%)
Brian R. Y. Huang; Maximilian Li; Leonard Tang
  Despite extensive safety measures, LLMs are vulnerable to adversarial inputs,
or jailbreaks, which can elicit unsafe behaviors. In this work, we introduce
bijection learning, a powerful attack algorithm which automatically fuzzes LLMs
for safety vulnerabilities using randomly-generated encodings whose complexity
can be tightly controlled. We leverage in-context learning to teach models
bijective encodings, pass encoded queries to the model to bypass built-in
safety mechanisms, and finally decode responses back into English. Our attack
is extremely effective on a wide range of frontier language models. Moreover,
by controlling complexity parameters such as number of key-value mappings in
the encodings, we find a close relationship between the capability level of the
attacked LLM and the average complexity of the most effective bijection
attacks. Our work highlights that new vulnerabilities in frontier models can
emerge with scale: more capable models are more severely jailbroken by
bijection attacks.


http://arxiv.org/abs/2410.02182
BadCM: Invisible Backdoor Attack Against Cross-Modal Learning. (13%)
Zheng Zhang; Xu Yuan; Lei Zhu; Jingkuan Song; Liqiang Nie
  Despite remarkable successes in unimodal learning tasks, backdoor attacks
against cross-modal learning are still underexplored due to the limited
generalization and inferior stealthiness when involving multiple modalities.
Notably, since works in this area mainly inherit ideas from unimodal visual
attacks, they struggle with dealing with diverse cross-modal attack
circumstances and manipulating imperceptible trigger samples, which hinders
their practicability in real-world applications. In this paper, we introduce a
novel bilateral backdoor to fill in the missing pieces of the puzzle in the
cross-modal backdoor and propose a generalized invisible backdoor framework
against cross-modal learning (BadCM). Specifically, a cross-modal mining scheme
is developed to capture the modality-invariant components as target poisoning
areas, where well-designed trigger patterns injected into these regions can be
efficiently recognized by the victim models. This strategy is adapted to
different image-text cross-modal models, making our framework available to
various attack scenarios. Furthermore, for generating poisoned samples of high
stealthiness, we conceive modality-specific generators for visual and
linguistic modalities that facilitate hiding explicit trigger patterns in
modality-invariant regions. To the best of our knowledge, BadCM is the first
invisible backdoor method deliberately designed for diverse cross-modal attacks
within one unified framework. Comprehensive experimental evaluations on two
typical applications, i.e., cross-modal retrieval and VQA, demonstrate the
effectiveness and generalization of our method under multiple kinds of attack
scenarios. Moreover, we show that BadCM can robustly evade existing backdoor
defenses. Our code is available at https://github.com/xandery-geek/BadCM.


http://arxiv.org/abs/2410.02163
Controlled Generation of Natural Adversarial Documents for Stealthy Retrieval Poisoning. (13%)
Collin Zhang; Tingwei Zhang; Vitaly Shmatikov
  Recent work showed that retrieval based on embedding similarity (e.g., for
retrieval-augmented generation) is vulnerable to poisoning: an adversary can
craft malicious documents that are retrieved in response to broad classes of
queries. We demonstrate that previous, HotFlip-based techniques produce
documents that are very easy to detect using perplexity filtering. Even if
generation is constrained to produce low-perplexity text, the resulting
documents are recognized as unnatural by LLMs and can be automatically filtered
from the retrieval corpus.
  We design, implement, and evaluate a new controlled generation technique that
combines an adversarial objective (embedding similarity) with a "naturalness"
objective based on soft scores computed using an open-source, surrogate LLM.
The resulting adversarial documents (1) cannot be automatically detected using
perplexity filtering and/or other LLMs, except at the cost of significant false
positives in the retrieval corpus, yet (2) achieve similar poisoning efficacy
to easily-detectable documents generated using HotFlip, and (3) are
significantly more effective than prior methods for energy-guided generation,
such as COLD.


http://arxiv.org/abs/2410.01606
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester. (11%)
Maya Pavlova; Erik Brinkman; Krithika Iyer; Vitor Albiero; Joanna Bitton; Hailey Nguyen; Joe Li; Cristian Canton Ferrer; Ivan Evtimov; Aaron Grattafiori
  Red teaming assesses how large language models (LLMs) can produce content
that violates norms, policies, and rules set during their safety training.
However, most existing automated methods in the literature are not
representative of the way humans tend to interact with AI models. Common users
of AI models may not have advanced knowledge of adversarial machine learning
methods or access to model internals, and they do not spend a lot of time
crafting a single highly effective adversarial prompt. Instead, they are likely
to make use of techniques commonly shared online and exploit the multiturn
conversational nature of LLMs. While manual testing addresses this gap, it is
an inefficient and often expensive process. To address these limitations, we
introduce the Generative Offensive Agent Tester (GOAT), an automated agentic
red teaming system that simulates plain language adversarial conversations
while leveraging multiple adversarial prompting techniques to identify
vulnerabilities in LLMs. We instantiate GOAT with 7 red teaming attacks by
prompting a general-purpose model in a way that encourages reasoning through
the choices of methods available, the current target model's response, and the
next steps. Our approach is designed to be extensible and efficient, allowing
human testers to focus on exploring new areas of risk while automation covers
the scaled adversarial stress-testing of known risk territory. We present the
design and evaluation of GOAT, demonstrating its effectiveness in identifying
vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 97% against Llama
3.1 and 88% against GPT-4 on the JailbreakBench dataset.


http://arxiv.org/abs/2410.01482
One Wave to Explain Them All: A Unifying Perspective on Post-hoc Explainability. (1%)
Gabriel Kasmi; Amandine Brunetto; Thomas Fel; Jayneel Parekh
  Despite the growing use of deep neural networks in safety-critical
decision-making, their inherent black-box nature hinders transparency and
interpretability. Explainable AI (XAI) methods have thus emerged to understand
a model's internal workings, and notably attribution methods also called
saliency maps. Conventional attribution methods typically identify the
locations -- the where -- of significant regions within an input. However,
because they overlook the inherent structure of the input data, these methods
often fail to interpret what these regions represent in terms of structural
components (e.g., textures in images or transients in sounds). Furthermore,
existing methods are usually tailored to a single data modality, limiting their
generalizability. In this paper, we propose leveraging the wavelet domain as a
robust mathematical foundation for attribution. Our approach, the Wavelet
Attribution Method (WAM) extends the existing gradient-based feature
attributions into the wavelet domain, providing a unified framework for
explaining classifiers across images, audio, and 3D shapes. Empirical
evaluations demonstrate that WAM matches or surpasses state-of-the-art methods
across faithfulness metrics and models in image, audio, and 3D explainability.
Finally, we show how our method explains not only the where -- the important
parts of the input -- but also the what -- the relevant patterns in terms of
structural components.


http://arxiv.org/abs/2410.00878
Empirical Perturbation Analysis of Linear System Solvers from a Data Poisoning Perspective. (54%)
Yixin Liu; Arielle Carr; Lichao Sun
  The perturbation analysis of linear solvers applied to systems arising
broadly in machine learning settings -- for instance, when using linear
regression models -- establishes an important perspective when reframing these
analyses through the lens of a data poisoning attack. By analyzing solvers'
responses to such attacks, this work aims to contribute to the development of
more robust linear solvers and provide insights into poisoning attacks on
linear solvers. In particular, we investigate how the errors in the input data
will affect the fitting error and accuracy of the solution from a linear
system-solving algorithm under perturbations common in adversarial attacks. We
propose data perturbation through two distinct knowledge levels, developing a
poisoning optimization and studying two methods of perturbation: Label-guided
Perturbation (LP) and Unconditioning Perturbation (UP). Existing works mainly
focus on deriving the worst-case perturbation bound from a theoretical
perspective, and the analysis is often limited to specific kinds of linear
system solvers. Under the circumstance that the data is intentionally perturbed
-- as is the case with data poisoning -- we seek to understand how different
kinds of solvers react to these perturbations, identifying those algorithms
most impacted by different types of adversarial attacks.


http://arxiv.org/abs/2410.00451
Adversarial Suffixes May Be Features Too! (45%)
Wei Zhao; Zhe Li; Yige Li; Jun Sun
  Despite significant ongoing efforts in safety alignment, large language
models (LLMs) such as GPT-4 and LLaMA 3 remain vulnerable to jailbreak attacks
that can induce harmful behaviors, including those triggered by adversarial
suffixes. Building on prior research, we hypothesize that these adversarial
suffixes are not mere bugs but may represent features that can dominate the
LLM's behavior. To evaluate this hypothesis, we conduct several experiments.
First, we demonstrate that benign features can be effectively made to function
as adversarial suffixes, i.e., we develop a feature extraction method to
extract sample-agnostic features from benign dataset in the form of suffixes
and show that these suffixes may effectively compromise safety alignment.
Second, we show that adversarial suffixes generated from jailbreak attacks may
contain meaningful features, i.e., appending the same suffix to different
prompts results in responses exhibiting specific characteristics. Third, we
show that such benign-yet-safety-compromising features can be easily introduced
through fine-tuning using only benign datasets, i.e., even in the absence of
harmful content. This highlights the critical risk posed by dominating benign
features in the training data and calls for further research to reinforce LLM
safety alignment. Our code and data is available at
\url{https://github.com/suffix-maybe-feature/adver-suffix-maybe-features}.


http://arxiv.org/abs/2409.20139
Characterizing Model Robustness via Natural Input Gradients. (92%)
Adrián Rodríguez-Muñoz; Tongzhou Wang; Antonio Torralba
  Adversarially robust models are locally smooth around each data sample so
that small perturbations cannot drastically change model outputs. In modern
systems, such smoothness is usually obtained via Adversarial Training, which
explicitly enforces models to perform well on perturbed examples. In this work,
we show the surprising effectiveness of instead regularizing the gradient with
respect to model inputs on natural examples only. Penalizing input Gradient
Norm is commonly believed to be a much inferior approach. Our analyses identify
that the performance of Gradient Norm regularization critically depends on the
smoothness of activation functions, and are in fact extremely effective on
modern vision transformers that adopt smooth activations over piecewise linear
ones (eg, ReLU), contrary to prior belief. On ImageNet-1k, Gradient Norm
training achieves > 90% the performance of state-of-the-art PGD-3 Adversarial
Training} (52% vs.~56%), while using only 60% computation cost of the
state-of-the-art without complex adversarial optimization. Our analyses also
highlight the relationship between model robustness and properties of natural
input gradients, such as asymmetric sample and channel statistics.
Surprisingly, we find model robustness can be significantly improved by simply
regularizing its gradients to concentrate on image edges without explicit
conditioning on the gradient norm.


http://arxiv.org/abs/2409.20089
Robust LLM safeguarding via refusal feature adversarial training. (80%)
Lei Yu; Virginie Do; Karen Hambardzumyan; Nicola Cancedda
  Large language models (LLMs) are vulnerable to adversarial attacks that can
elicit harmful responses. Defending against such attacks remains challenging
due to the opacity of jailbreaking mechanisms and the high computational cost
of training LLMs robustly. We demonstrate that adversarial attacks share a
universal mechanism for circumventing LLM safeguards that works by ablating a
dimension in the residual stream embedding space called the refusal feature. We
further show that the operation of refusal feature ablation (RFA) approximates
the worst-case perturbation of offsetting model safety. Based on these
findings, we propose Refusal Feature Adversarial Training (ReFAT), a novel
algorithm that efficiently performs LLM adversarial training by simulating the
effect of input-level attacks via RFA. Experiment results show that ReFAT
significantly improves the robustness of three popular LLMs against a wide
range of adversarial attacks, with considerably less computational overhead
compared to existing adversarial training methods.


http://arxiv.org/abs/2410.00126
Resonance Reduction Against Adversarial Attacks in Dynamic Networks via Eigenspectrum Optimization. (76%)
Alp Sahin; Nicolas Kozachuk; Rick S. Blum; Subhrajit Bhattacharya
  Resonance is a well-known phenomenon that happens in systems with second
order dynamics. In this paper we address the fundamental question of making a
network robust to signal being periodically pumped into it at or near a
resonant frequency by an adversarial agent with the aim of saturating the
network with the signal. Towards this goal, we develop the notion of network
vulnerability, which is measured by the expected resonance amplitude on the
network under a stochastically modeled adversarial attack. Assuming a second
order dynamics model based on the network graph Laplacian matrix and a known
stochastic model for the adversarial attack, we propose two methods for
minimizing the network vulnerability that leverage the principle of
eigenspectrum optimization. We provide extensive numerical results analyzing
the effects of both methods.


http://arxiv.org/abs/2409.20426
Navigating Threats: A Survey of Physical Adversarial Attacks on LiDAR Perception Systems in Autonomous Vehicles. (45%)
Amira Guesmi; Muhammad Shafique
  Autonomous vehicles (AVs) rely heavily on LiDAR (Light Detection and Ranging)
systems for accurate perception and navigation, providing high-resolution 3D
environmental data that is crucial for object detection and classification.
However, LiDAR systems are vulnerable to adversarial attacks, which pose
significant challenges to the safety and robustness of AVs. This survey
presents a thorough review of the current research landscape on physical
adversarial attacks targeting LiDAR-based perception systems, covering both
single-modality and multi-modality contexts. We categorize and analyze various
attack types, including spoofing and physical adversarial object attacks,
detailing their methodologies, impacts, and potential real-world implications.
Through detailed case studies and analyses, we identify critical challenges and
highlight gaps in existing attacks for LiDAR-based systems. Additionally, we
propose future research directions to enhance the security and resilience of
these systems, ultimately contributing to the safer deployment of autonomous
vehicles.


http://arxiv.org/abs/2410.00296
VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data. (8%)
Xuefeng Du; Reshmi Ghosh; Robert Sim; Ahmed Salem; Vitor Carvalho; Emily Lawton; Yixuan Li; Jack W. Stokes
  Vision-language models (VLMs) are essential for contextual understanding of
both visual and textual information. However, their vulnerability to
adversarially manipulated inputs presents significant risks, leading to
compromised outputs and raising concerns about the reliability in
VLM-integrated applications. Detecting these malicious prompts is thus crucial
for maintaining trust in VLM generations. A major challenge in developing a
safeguarding prompt classifier is the lack of a large amount of labeled benign
and malicious data. To address the issue, we introduce VLMGuard, a novel
learning framework that leverages the unlabeled user prompts in the wild for
malicious prompt detection. These unlabeled prompts, which naturally arise when
VLMs are deployed in the open world, consist of both benign and malicious
information. To harness the unlabeled data, we present an automated
maliciousness estimation score for distinguishing between benign and malicious
samples within this unlabeled mixture, thereby enabling the training of a
binary prompt classifier on top. Notably, our framework does not require extra
human annotations, offering strong flexibility and practicality for real-world
applications. Extensive experiment shows VLMGuard achieves superior detection
results, significantly outperforming state-of-the-art methods. Disclaimer: This
paper may contain offensive examples; reader discretion is advised.


http://arxiv.org/abs/2409.19594
MASKDROID: Robust Android Malware Detection with Masked Graph Representations. (99%)
Jingnan Zheng; Jiaohao Liu; An Zhang; Jun Zeng; Ziqi Yang; Zhenkai Liang; Tat-Seng Chua
  Android malware attacks have posed a severe threat to mobile users,
necessitating a significant demand for the automated detection system. Among
the various tools employed in malware detection, graph representations (e.g.,
function call graphs) have played a pivotal role in characterizing the
behaviors of Android apps. However, though achieving impressive performance in
malware detection, current state-of-the-art graph-based malware detectors are
vulnerable to adversarial examples. These adversarial examples are meticulously
crafted by introducing specific perturbations to normal malicious inputs. To
defend against adversarial attacks, existing defensive mechanisms are typically
supplementary additions to detectors and exhibit significant limitations, often
relying on prior knowledge of adversarial examples and failing to defend
against unseen types of attacks effectively. In this paper, we propose
MASKDROID, a powerful detector with a strong discriminative ability to identify
malware and remarkable robustness against adversarial attacks. Specifically, we
introduce a masking mechanism into the Graph Neural Network (GNN) based
framework, forcing MASKDROID to recover the whole input graph using a small
portion (e.g., 20%) of randomly selected nodes.This strategy enables the model
to understand the malicious semantics and learn more stable representations,
enhancing its robustness against adversarial attacks. While capturing stable
malicious semantics in the form of dependencies inside the graph structures, we
further employ a contrastive module to encourage MASKDROID to learn more
compact representations for both the benign and malicious classes to boost its
discriminative power in detecting malware from benign apps and adversarial
examples.


http://arxiv.org/abs/2409.19788
Adversarial Examples for DNA Classification. (98%)
Hyunwoo Yoo
  Pre-trained language models such as DNABERT2 and Nucleotide Transformer,
which are trained on DNA sequences, have shown promising performance in DNA
sequence classification tasks. The classification ability of these models stems
from language models trained on vast amounts of DNA sequence samples, followed
by fine-tuning with relatively smaller classification datasets. However, these
text-based systems are not robust enough and can be vulnerable to adversarial
examples. While adversarial attacks have been widely studied in text
classification, there is limited research in DNA sequence classification. In
this paper, we adapt commonly used attack algorithms in text classification for
DNA sequence classification. We evaluated the impact of various attack methods
on DNA sequence classification at the character, word, and sentence levels. Our
findings indicate that actual DNA language model sequence classifiers are
vulnerable to these attacks.


http://arxiv.org/abs/2409.19619
Discerning the Chaos: Detecting Adversarial Perturbations while Disentangling Intentional from Unintentional Noises. (86%)
Anubhooti Jain; Susim Roy; Kwanit Gupta; Mayank Vatsa; Richa Singh
  Deep learning models, such as those used for face recognition and attribute
prediction, are susceptible to manipulations like adversarial noise and
unintentional noise, including Gaussian and impulse noise. This paper
introduces CIAI, a Class-Independent Adversarial Intent detection network built
on a modified vision transformer with detection layers. CIAI employs a novel
loss function that combines Maximum Mean Discrepancy and Center Loss to detect
both intentional (adversarial attacks) and unintentional noise, regardless of
the image class. It is trained in a multi-step fashion. We also introduce the
aspect of intent during detection that can act as an added layer of security.
We further showcase the performance of our proposed detector on CelebA,
CelebA-HQ, LFW, AgeDB, and CIFAR-10 datasets. Our detector is able to detect
both intentional (like FGSM, PGD, and DeepFool) and unintentional (like
Gaussian and Salt & Pepper noises) perturbations.


http://arxiv.org/abs/2409.19671
Nonideality-aware training makes memristive networks more robust to adversarial attacks. (38%)
Dovydas Joksas; Luis Muñoz-González; Emil Lupu; Adnan Mehonic
  Neural networks are now deployed in a wide number of areas from object
classification to natural language systems. Implementations using analog
devices like memristors promise better power efficiency, potentially bringing
these applications to a greater number of environments. However, such systems
suffer from more frequent device faults and overall, their exposure to
adversarial attacks has not been studied extensively. In this work, we
investigate how nonideality-aware training - a common technique to deal with
physical nonidealities - affects adversarial robustness. We find that
adversarial robustness is significantly improved, even with limited knowledge
of what nonidealities will be encountered during test time.


http://arxiv.org/abs/2409.19601
Infighting in the Dark: Multi-Labels Backdoor Attack in Federated Learning. (33%)
Ye Li; Yanchao Zhao; Chengcheng Zhu; Jiale Zhang
  Federated Learning (FL), a privacy-preserving decentralized machine learning
framework, has been shown to be vulnerable to backdoor attacks. Current
research primarily focuses on the Single-Label Backdoor Attack (SBA), wherein
adversaries share a consistent target. However, a critical fact is overlooked:
adversaries may be non-cooperative, have distinct targets, and operate
independently, which exhibits a more practical scenario called Multi-Label
Backdoor Attack (MBA). Unfortunately, prior works are ineffective in MBA
scenario since non-cooperative attackers exclude each other. In this work, we
conduct an in-depth investigation to uncover the inherent constraints of the
exclusion: similar backdoor mappings are constructed for different targets,
resulting in conflicts among backdoor functions. To address this limitation, we
propose Mirage, the first non-cooperative MBA strategy in FL that allows
attackers to inject effective and persistent backdoors into the global model
without collusion by constructing in-distribution (ID) backdoor mapping.
Specifically, we introduce an adversarial adaptation method to bridge the
backdoor features and the target distribution in an ID manner. Additionally, we
further leverage a constrained optimization method to ensure the ID mapping
survives in the global training dynamics. Extensive evaluations demonstrate
that Mirage outperforms various state-of-the-art attacks and bypasses existing
defenses, achieving an average ASR greater than 97\% and maintaining over 90\%
after 900 rounds. This work aims to alert researchers to this potential threat
and inspire the design of effective defense mechanisms. Code has been made
open-source.


http://arxiv.org/abs/2409.19766
Towards Robust Extractive Question Answering Models: Rethinking the Training Methodology. (10%)
Son Quoc Tran; Matt Kretchmar
  This paper proposes a novel training method to improve the robustness of
Extractive Question Answering (EQA) models. Previous research has shown that
existing models, when trained on EQA datasets that include unanswerable
questions, demonstrate a significant lack of robustness against distribution
shifts and adversarial attacks. Despite this, the inclusion of unanswerable
questions in EQA training datasets is essential for ensuring real-world
reliability. Our proposed training method includes a novel loss function for
the EQA problem and challenges an implicit assumption present in numerous EQA
datasets. Models trained with our method maintain in-domain performance while
achieving a notable improvement on out-of-domain datasets. This results in an
overall F1 score improvement of 5.7 across all testing sets. Furthermore, our
models exhibit significantly enhanced robustness against two types of
adversarial attacks, with a performance decrease of only about a third compared
to the default models.


http://arxiv.org/abs/2409.19746
Learning Robust Policies via Interpretable Hamilton-Jacobi Reachability-Guided Disturbances. (5%)
Hanyang Hu; Xilun Zhang; Xubo Lyu; Mo Chen
  Deep Reinforcement Learning (RL) has shown remarkable success in robotics
with complex and heterogeneous dynamics. However, its vulnerability to unknown
disturbances and adversarial attacks remains a significant challenge. In this
paper, we propose a robust policy training framework that integrates
model-based control principles with adversarial RL training to improve
robustness without the need for external black-box adversaries. Our approach
introduces a novel Hamilton-Jacobi reachability-guided disturbance for
adversarial RL training, where we use interpretable worst-case or
near-worst-case disturbances as adversaries against the robust policy. We
evaluated its effectiveness across three distinct tasks: a reach-avoid game in
both simulation and real-world settings, and a highly dynamic quadrotor
stabilization task in simulation. We validate that our learned critic network
is consistent with the ground-truth HJ value function, while the policy network
shows comparable performance with other learning-based methods.


http://arxiv.org/abs/2409.19627
IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding. (1%)
Pengcheng Li; Xulong Zhang; Jing Xiao; Jianzong Wang
  The audio watermarking technique embeds messages into audio and accurately
extracts messages from the watermarked audio. Traditional methods develop
algorithms based on expert experience to embed watermarks into the time-domain
or transform-domain of signals. With the development of deep neural networks,
deep learning-based neural audio watermarking has emerged. Compared to
traditional algorithms, neural audio watermarking achieves better robustness by
considering various attacks during training. However, current neural
watermarking methods suffer from low capacity and unsatisfactory
imperceptibility. Additionally, the issue of watermark locating, which is
extremely important and even more pronounced in neural audio watermarking, has
not been adequately studied. In this paper, we design a dual-embedding
watermarking model for efficient locating. We also consider the impact of the
attack layer on the invertible neural network in robustness training, improving
the model to enhance both its reasonableness and stability. Experiments show
that the proposed model, IDEAW, can withstand various attacks with higher
capacity and more efficient locating ability compared to existing methods.


http://arxiv.org/abs/2409.19808
Can Models Learn Skill Composition from Examples? (1%)
Haoyu Zhao; Simran Kaur; Dingli Yu; Anirudh Goyal; Sanjeev Arora
  As large language models (LLMs) become increasingly advanced, their ability
to exhibit compositional generalization -- the capacity to combine learned
skills in novel ways not encountered during training -- has garnered
significant attention. This type of generalization, particularly in scenarios
beyond training data, is also of great interest in the study of AI safety and
alignment. A recent study introduced the SKILL-MIX evaluation, where models are
tasked with composing a short paragraph demonstrating the use of a specified
$k$-tuple of language skills. While small models struggled with composing even
with $k=3$, larger models like GPT-4 performed reasonably well with $k=5$ and
$6$.
  In this paper, we employ a setup akin to SKILL-MIX to evaluate the capacity
of smaller models to learn compositional generalization from examples.
Utilizing a diverse set of language skills -- including rhetorical, literary,
reasoning, theory of mind, and common sense -- GPT-4 was used to generate text
samples that exhibit random subsets of $k$ skills. Subsequent fine-tuning of 7B
and 13B parameter models on these combined skill texts, for increasing values
of $k$, revealed the following findings: (1) Training on combinations of $k=2$
and $3$ skills results in noticeable improvements in the ability to compose
texts with $k=4$ and $5$ skills, despite models never having seen such examples
during training. (2) When skill categories are split into training and held-out
groups, models significantly improve at composing texts with held-out skills
during testing despite having only seen training skills during fine-tuning,
illustrating the efficacy of the training approach even with previously unseen
skills. This study also suggests that incorporating skill-rich (potentially
synthetic) text into training can substantially enhance the compositional
capabilities of models.


http://arxiv.org/abs/2409.19526
Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats. (74%)
Kuanrong Liu; Siyuan Liang; Jiawei Liang; Pengwen Dai; Xiaochun Cao
  Multimodal contrastive learning uses various data modalities to create
high-quality features, but its reliance on extensive data sources on the
Internet makes it vulnerable to backdoor attacks. These attacks insert
malicious behaviors during training, which are activated by specific triggers
during inference, posing significant security risks. Despite existing
countermeasures through fine-tuning that reduce the malicious impacts of such
attacks, these defenses frequently necessitate extensive training time and
degrade clean accuracy. In this study, we propose an efficient defense
mechanism against backdoor threats using a concept known as machine unlearning.
This entails strategically creating a small set of poisoned samples to aid the
model's rapid unlearning of backdoor vulnerabilities, known as Unlearn Backdoor
Threats (UBT). We specifically use overfit training to improve backdoor
shortcuts and accurately detect suspicious samples in the potential poisoning
data set. Then, we select fewer unlearned samples from suspicious samples for
rapid forgetting in order to eliminate the backdoor effect and thus improve
backdoor defense efficiency. In the backdoor unlearning process, we present a
novel token-based portion unlearning training regime. This technique focuses on
the model's compromised elements, dissociating backdoor correlations while
maintaining the model's overall integrity. Extensive experimental results show
that our method effectively defends against various backdoor attack methods in
the CLIP model. Compared to SoTA backdoor defense methods, UBT achieves the
lowest attack success rate while maintaining a high clean accuracy of the model
(attack success rate decreases by 19% compared to SOTA, while clean accuracy
increases by 2.57%).


http://arxiv.org/abs/2409.19521
GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks. (13%)
Rongchang Li; Minjie Chen; Chang Hu; Han Chen; Wenpeng Xing; Meng Han
  Large Language Models (LLMs) like GPT-4, LLaMA, and Qwen have demonstrated
remarkable success across a wide range of applications. However, these models
remain inherently vulnerable to prompt injection attacks, which can bypass
existing safety mechanisms, highlighting the urgent need for more robust attack
detection methods and comprehensive evaluation benchmarks. To address these
challenges, we introduce GenTel-Safe, a unified framework that includes a novel
prompt injection attack detection method, GenTel-Shield, along with a
comprehensive evaluation benchmark, GenTel-Bench, which compromises 84812
prompt injection attacks, spanning 3 major categories and 28 security
scenarios. To prove the effectiveness of GenTel-Shield, we evaluate it together
with vanilla safety guardrails against the GenTel-Bench dataset. Empirically,
GenTel-Shield can achieve state-of-the-art attack detection success rates,
which reveals the critical weakness of existing safeguarding techniques against
harmful prompts. For reproducibility, we have made the code and benchmarking
dataset available on the project page at
https://gentellab.github.io/gentel-safe.github.io/.


http://arxiv.org/abs/2409.19302
Leveraging MTD to Mitigate Poisoning Attacks in Decentralized FL with Non-IID Data. (11%)
Chao Feng; Alberto Huertas Celdrán; Zien Zeng; Zi Ye; der Assen Jan von; Gerome Bovet; Burkhard Stiller
  Decentralized Federated Learning (DFL), a paradigm for managing big data in a
privacy-preserved manner, is still vulnerable to poisoning attacks where
malicious clients tamper with data or models. Current defense methods often
assume Independently and Identically Distributed (IID) data, which is
unrealistic in real-world applications. In non-IID contexts, existing defensive
strategies face challenges in distinguishing between models that have been
compromised and those that have been trained on heterogeneous data
distributions, leading to diminished efficacy. In response, this paper proposes
a framework that employs the Moving Target Defense (MTD) approach to bolster
the robustness of DFL models. By continuously modifying the attack surface of
the DFL system, this framework aims to mitigate poisoning attacks effectively.
The proposed MTD framework includes both proactive and reactive modes,
utilizing a reputation system that combines metrics of model similarity and
loss, alongside various defensive techniques. Comprehensive experimental
evaluations indicate that the MTD-based mechanism significantly mitigates a
range of poisoning attack types across multiple datasets with different
topologies.


http://arxiv.org/abs/2410.00055
Survey of Security and Data Attacks on Machine Unlearning In Financial and E-Commerce. (2%)
Carl E. J. Brodzinski
  This paper surveys the landscape of security and data attacks on machine
unlearning, with a focus on financial and e-commerce applications. We discuss
key privacy threats such as Membership Inference Attacks and Data
Reconstruction Attacks, where adversaries attempt to infer or reconstruct data
that should have been removed. In addition, we explore security attacks
including Machine Unlearning Data Poisoning, Unlearning Request Attacks, and
Machine Unlearning Jailbreak Attacks, which target the underlying mechanisms of
unlearning to manipulate or corrupt the model. To mitigate these risks, various
defense strategies are examined, including differential privacy, robust
cryptographic guarantees, and Zero-Knowledge Proofs (ZKPs), offering verifiable
and tamper-proof unlearning mechanisms. These approaches are essential for
safeguarding data integrity and privacy in high-stakes financial and e-commerce
contexts, where compromised models can lead to fraud, data leaks, and
reputational damage. This survey highlights the need for continued research and
innovation in secure machine unlearning, as well as the importance of
developing strong defenses against evolving attack vectors.


http://arxiv.org/abs/2409.19301
Privacy Attack in Federated Learning is Not Easy: An Experimental Study. (1%)
Hangyu Zhu; Liyuan Huang; Zhenping Xie
  Federated learning (FL) is an emerging distributed machine learning paradigm
proposed for privacy preservation. Unlike traditional centralized learning
approaches, FL enables multiple users to collaboratively train a shared global
model without disclosing their own data, thereby significantly reducing the
potential risk of privacy leakage. However, recent studies have indicated that
FL cannot entirely guarantee privacy protection, and attackers may still be
able to extract users' private data through the communicated model gradients.
Although numerous privacy attack FL algorithms have been developed, most are
designed to reconstruct private data from a single step of calculated
gradients. It remains uncertain whether these methods are effective in
realistic federated environments or if they have other limitations. In this
paper, we aim to help researchers better understand and evaluate the
effectiveness of privacy attacks on FL. We analyze and discuss recent research
papers on this topic and conduct experiments in a real FL environment to
compare the performance of various attack methods. Our experimental results
reveal that none of the existing state-of-the-art privacy attack algorithms can
effectively breach private client data in realistic FL settings, even in the
absence of defense strategies. This suggests that privacy attacks in FL are
more challenging than initially anticipated.


http://arxiv.org/abs/2409.18736
Adversarial Challenges in Network Intrusion Detection Systems: Research Insights and Future Prospects. (96%)
Sabrine Ennaji; Gaspari Fabio De; Dorjan Hitaj; Alicia K/Bidi; Luigi V. Mancini
  Machine learning has brought significant advances in cybersecurity,
particularly in the area of intrusion detection systems. This improvements can
be mostly attributed to the ability of machine learning algorithms to identify
complex relations between features in the data and to generalize well to unseen
samples. Deep neural networks in particular contributed to this progress by
enabling the analysis of large amounts of training data, significantly
enhancing detection performance. However, machine learning models are
vulnerable to adversarial attacks: manipulations of input data designed to
mislead the models into making incorrect predictions. While much attention has
been given to adversarial threats in unstructured data such as text and images,
their effectiveness in structured data such as network traffic has not been as
thoroughly explored.
  This survey seeks to fill this gap by providing an critical review of machine
learning-based Network Intrusion Detection Systems (NIDS) and a thorough
analysis of their vulnerability to adversarial attacks. We critically review
existing NIDS research, highlighting key trends, strengths, and limitations,
and we identify gaps in understanding that require further exploration. We
further discuss emerging challenges and offer insights for developing more
robust and resilient NIDS models. In summary, this paper aims to enhance
understanding of adversarial attacks and defenses in NIDS and guide future
research in improving the robustness of machine learning models in
cybersecurity applications.


http://arxiv.org/abs/2409.19096
Enhancing Robustness of Graph Neural Networks through p-Laplacian. (12%)
Anuj Kumar Sirohi; Subhanu Halder; Kabir Kumar; Sandeep Kumar
  With the increase of data in day-to-day life, businesses and different
stakeholders need to analyze the data for better predictions. Traditionally,
relational data has been a source of various insights, but with the increase in
computational power and the need to understand deeper relationships between
entities, the need to design new techniques has arisen. For this graph data
analysis has become an extraordinary tool for understanding the data, which
reveals more realistic and flexible modelling of complex relationships.
Recently, Graph Neural Networks (GNNs) have shown great promise in various
applications, such as social network analysis, recommendation systems, drug
discovery, and more. However, many adversarial attacks can happen over the
data, whether during training (poisoning attack) or during testing (evasion
attack), which can adversely manipulate the desired outcome from the GNN model.
Therefore, it is crucial to make the GNNs robust to such attacks. The existing
robustness methods are computationally demanding and perform poorly when the
intensity of attack increases. This paper presents a computationally efficient
framework, namely, pLapGNN, based on weighted p-Laplacian for making GNNs
robust. Empirical evaluation on real datasets establishes the efficacy and
efficiency of the proposed method.


http://arxiv.org/abs/2409.18553
Efficient Noise Mitigation for Enhancing Inference Accuracy in DNNs on Mixed-Signal Accelerators. (1%)
Seyedarmin Azizi; Mohammad Erfan Sadeghi; Mehdi Kamal; Massoud Pedram
  In this paper, we propose a framework to enhance the robustness of the neural
models by mitigating the effects of process-induced and aging-related
variations of analog computing components on the accuracy of the analog neural
networks. We model these variations as the noise affecting the precision of the
activations and introduce a denoising block inserted between selected layers of
a pre-trained model. We demonstrate that training the denoising block
significantly increases the model's robustness against various noise levels. To
minimize the overhead associated with adding these blocks, we present an
exploration algorithm to identify optimal insertion points for the denoising
blocks. Additionally, we propose a specialized architecture to efficiently
execute the denoising blocks, which can be integrated into mixed-signal
accelerators. We evaluate the effectiveness of our approach using Deep Neural
Network (DNN) models trained on the ImageNet and CIFAR-10 datasets. The results
show that on average, by accepting 2.03% parameter count overhead, the accuracy
drop due to the variations reduces from 31.7% to 1.15%.


http://arxiv.org/abs/2409.18907
In-depth Analysis of Privacy Threats in Federated Learning for Medical Data. (1%)
Badhan Chandra Das; M. Hadi Amini; Yanzhao Wu
  Federated learning is emerging as a promising machine learning technique in
the medical field for analyzing medical images, as it is considered an
effective method to safeguard sensitive patient data and comply with privacy
regulations. However, recent studies have revealed that the default settings of
federated learning may inadvertently expose private training data to privacy
attacks. Thus, the intensity of such privacy risks and potential mitigation
strategies in the medical domain remain unclear. In this paper, we make three
original contributions to privacy risk analysis and mitigation in federated
learning for medical data. First, we propose a holistic framework, MedPFL, for
analyzing privacy risks in processing medical data in the federated learning
environment and developing effective mitigation strategies for protecting
privacy. Second, through our empirical analysis, we demonstrate the severe
privacy risks in federated learning to process medical images, where
adversaries can accurately reconstruct private medical images by performing
privacy attacks. Third, we illustrate that the prevalent defense mechanism of
adding random noises may not always be effective in protecting medical images
against privacy attacks in federated learning, which poses unique and pressing
challenges related to protecting the privacy of medical data. Furthermore, the
paper discusses several unique research questions related to the privacy
protection of medical data in the federated learning environment. We conduct
extensive experiments on several benchmark medical image datasets to analyze
and mitigate the privacy risks associated with federated learning for medical
data.


http://arxiv.org/abs/2409.17568
Showing Many Labels in Multi-label Classification Models: An Empirical Study of Adversarial Examples. (98%)
Yujiang Liu; Wenjian Luo; Zhijian Chen; Muhammad Luqman Naseem
  With the rapid development of Deep Neural Networks (DNNs), they have been
applied in numerous fields. However, research indicates that DNNs are
susceptible to adversarial examples, and this is equally true in the
multi-label domain. To further investigate multi-label adversarial examples, we
introduce a novel type of attacks, termed "Showing Many Labels". The objective
of this attack is to maximize the number of labels included in the classifier's
prediction results. In our experiments, we select nine attack algorithms and
evaluate their performance under "Showing Many Labels". Eight of the attack
algorithms were adapted from the multi-class environment to the multi-label
environment, while the remaining one was specifically designed for the
multi-label environment. We choose ML-LIW and ML-GCN as target models and train
them on four popular multi-label datasets: VOC2007, VOC2012, NUS-WIDE, and
COCO. We record the success rate of each algorithm when it shows the expected
number of labels in eight different scenarios. Experimental results indicate
that under the "Showing Many Labels", iterative attacks perform significantly
better than one-step attacks. Moreover, it is possible to show all labels in
the dataset.


http://arxiv.org/abs/2409.17977
Cross-Modality Attack Boosted by Gradient-Evolutionary Multiform Optimization. (98%)
Yunpeng Gong; Qingyuan Zeng; Dejun Xu; Zhenzhong Wang; Min Jiang
  In recent years, despite significant advancements in adversarial attack
research, the security challenges in cross-modal scenarios, such as the
transferability of adversarial attacks between infrared, thermal, and RGB
images, have been overlooked. These heterogeneous image modalities collected by
different hardware devices are widely prevalent in practical applications, and
the substantial differences between modalities pose significant challenges to
attack transferability. In this work, we explore a novel cross-modal
adversarial attack strategy, termed multiform attack. We propose a dual-layer
optimization framework based on gradient-evolution, facilitating efficient
perturbation transfer between modalities. In the first layer of optimization,
the framework utilizes image gradients to learn universal perturbations within
each modality and employs evolutionary algorithms to search for shared
perturbations with transferability across different modalities through
secondary optimization. Through extensive testing on multiple heterogeneous
datasets, we demonstrate the superiority and robustness of Multiform Attack
compared to existing techniques. This work not only enhances the
transferability of cross-modal adversarial attacks but also provides a new
perspective for understanding security vulnerabilities in cross-modal systems.


http://arxiv.org/abs/2409.18248
Discovering New Shadow Patterns for Black-Box Attacks on Lane Detection of Autonomous Vehicles. (97%)
Pedram MohajerAnsari; Alkim Domeke; Voor Jan de; Arkajyoti Mitra; Grace Johnson; Amir Salarpour; Habeeb Olufowobi; Mohammad Hamad; Mert D. Pesé
  Ensuring autonomous vehicle (AV) security remains a critical concern. An area
of paramount importance is the study of physical-world adversarial examples
(AEs) aimed at exploiting vulnerabilities in perception systems. However, most
of the prevailing research on AEs has neglected considerations of stealthiness
and legality, resulting in scenarios where human drivers would promptly
intervene or attackers would be swiftly detected and punished. These
limitations hinder the applicability of such examples in real-life settings. In
this paper, we introduce a novel approach to generate AEs using what we term
negative shadows: deceptive patterns of light on the road created by
strategically blocking sunlight, which then cast artificial lane-like patterns.
These shadows are inconspicuous to a driver while deceiving AV perception
systems, particularly those reliant on lane detection algorithms. By
prioritizing the stealthy nature of attacks to minimize driver interventions
and ensuring their legality from an attacker's standpoint, a more plausible
range of scenarios is established. In multiple scenarios, including at low
speeds, our method shows a high safety violation rate. Using a 20-meter
negative shadow, it can direct a vehicle off-road with a 100% violation rate at
speeds over 10 mph. Other attack scenarios, such as causing collisions, can be
performed with at least 30 meters of negative shadow, achieving a 60-100%
success rate. The attack also maintains an average stealthiness of 83.6% as
measured through a human subject experiment, ensuring its efficacy in covert
settings.


http://arxiv.org/abs/2409.17589
Improving Fast Adversarial Training via Self-Knowledge Guidance. (82%)
Chengze Jiang; Junkai Wang; Minjing Dong; Jie Gui; Xinli Shi; Yuan Cao; Yuan Yan Tang; James Tin-Yau Kwok
  Adversarial training has achieved remarkable advancements in defending
against adversarial attacks. Among them, fast adversarial training (FAT) is
gaining attention for its ability to achieve competitive robustness with fewer
computing resources. Existing FAT methods typically employ a uniform strategy
that optimizes all training data equally without considering the influence of
different examples, which leads to an imbalanced optimization. However, this
imbalance remains unexplored in the field of FAT. In this paper, we conduct a
comprehensive study of the imbalance issue in FAT and observe an obvious class
disparity regarding their performances. This disparity could be embodied from a
perspective of alignment between clean and robust accuracy. Based on the
analysis, we mainly attribute the observed misalignment and disparity to the
imbalanced optimization in FAT, which motivates us to optimize different
training data adaptively to enhance robustness. Specifically, we take disparity
and misalignment into consideration. First, we introduce self-knowledge guided
regularization, which assigns differentiated regularization weights to each
class based on its training state, alleviating class disparity. Additionally,
we propose self-knowledge guided label relaxation, which adjusts label
relaxation according to the training accuracy, alleviating the misalignment and
improving robustness. By combining these methods, we formulate the
Self-Knowledge Guided FAT (SKG-FAT), leveraging naturally generated knowledge
during training to enhance the adversarial robustness without compromising
training efficiency. Extensive experiments on four standard datasets
demonstrate that the SKG-FAT improves the robustness and preserves competitive
clean accuracy, outperforming the state-of-the-art methods.


http://arxiv.org/abs/2409.17774
Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations. (69%)
Supriya Manna; Niladri Sett
  Faithfulness is arguably the most critical metric to assess the reliability
of explainable AI. In NLP, current methods for faithfulness evaluation are
fraught with discrepancies and biases, often failing to capture the true
reasoning of models. We introduce Adversarial Sensitivity as a novel approach
to faithfulness evaluation, focusing on the explainer's response when the model
is under adversarial attack. Our method accounts for the faithfulness of
explainers by capturing sensitivity to adversarial input changes. This work
addresses significant limitations in existing evaluation techniques, and
furthermore, quantifies faithfulness from a crucial yet underexplored paradigm.


http://arxiv.org/abs/2409.17601
CleanerCLIP: Fine-grained Counterfactual Semantic Augmentation for Backdoor Defense in Contrastive Learning. (69%)
Yuan Xun; Siyuan Liang; Xiaojun Jia; Xinwei Liu; Xiaochun Cao
  Pre-trained large models for multimodal contrastive learning, such as CLIP,
have been widely recognized in the industry as highly susceptible to
data-poisoned backdoor attacks. This poses significant risks to downstream
model training. In response to such potential threats, finetuning offers a
simpler and more efficient defense choice compared to retraining large models
with augmented data. In the supervised learning domain, fine-tuning defense
strategies can achieve excellent defense performance. However, in the
unsupervised and semi-supervised domain, we find that when CLIP faces some
complex attack techniques, the existing fine-tuning defense strategy,
CleanCLIP, has some limitations on defense performance. The synonym
substitution of its text-augmentation is insufficient to enhance the text
feature space. To compensate for this weakness, we improve it by proposing a
fine-grained \textbf{T}ext \textbf{A}lignment \textbf{C}leaner (TA-Cleaner) to
cut off feature connections of backdoor triggers. We randomly select a few
samples for positive and negative subtext generation at each epoch of
CleanCLIP, and align the subtexts to the images to strengthen the text
self-supervision. We evaluate the effectiveness of our TA-Cleaner against six
attack algorithms and conduct comprehensive zero-shot classification tests on
ImageNet1K. Our experimental results demonstrate that TA-Cleaner achieves
state-of-the-art defensiveness among finetuning-based defense techniques. Even
when faced with the novel attack technique BadCLIP, our TA-Cleaner outperforms
CleanCLIP by reducing the ASR of Top-1 and Top-10 by 52.02\% and 63.88\%,
respectively.


http://arxiv.org/abs/2409.17874
DarkSAM: Fooling Segment Anything Model to Segment Nothing. (68%)
Ziqi Zhou; Yufei Song; Minghui Li; Shengshan Hu; Xianlong Wang; Leo Yu Zhang; Dezhong Yao; Hai Jin
  Segment Anything Model (SAM) has recently gained much attention for its
outstanding generalization to unseen data and tasks. Despite its promising
prospect, the vulnerabilities of SAM, especially to universal adversarial
perturbation (UAP) have not been thoroughly investigated yet. In this paper, we
propose DarkSAM, the first prompt-free universal attack framework against SAM,
including a semantic decoupling-based spatial attack and a texture
distortion-based frequency attack. We first divide the output of SAM into
foreground and background. Then, we design a shadow target strategy to obtain
the semantic blueprint of the image as the attack target. DarkSAM is dedicated
to fooling SAM by extracting and destroying crucial object features from images
in both spatial and frequency domains. In the spatial domain, we disrupt the
semantics of both the foreground and background in the image to confuse SAM. In
the frequency domain, we further enhance the attack effectiveness by distorting
the high-frequency components (i.e., texture information) of the image.
Consequently, with a single UAP, DarkSAM renders SAM incapable of segmenting
objects across diverse images with varying prompts. Experimental results on
four datasets for SAM and its two variant models demonstrate the powerful
attack capability and transferability of DarkSAM.


http://arxiv.org/abs/2409.17941
Perturb, Attend, Detect and Localize (PADL): Robust Proactive Image Defense. (56%)
Filippo Bartolucci; Iacopo Masi; Giuseppe Lisanti
  Image manipulation detection and localization have received considerable
attention from the research community given the blooming of Generative Models
(GMs). Detection methods that follow a passive approach may overfit to specific
GMs, limiting their application in real-world scenarios, due to the growing
diversity of generative models. Recently, approaches based on a proactive
framework have shown the possibility of dealing with this limitation. However,
these methods suffer from two main limitations, which raises concerns about
potential vulnerabilities: i) the manipulation detector is not robust to noise
and hence can be easily fooled; ii) the fact that they rely on fixed
perturbations for image protection offers a predictable exploit for malicious
attackers, enabling them to reverse-engineer and evade detection. To overcome
this issue we propose PADL, a new solution able to generate image-specific
perturbations using a symmetric scheme of encoding and decoding based on
cross-attention, which drastically reduces the possibility of reverse
engineering, even when evaluated with adaptive attack [31]. Additionally, PADL
is able to pinpoint manipulated areas, facilitating the identification of
specific regions that have undergone alterations, and has more generalization
power than prior art on held-out generative models. Indeed, although being
trained only on an attribute manipulation GAN model [15], our method
generalizes to a range of unseen models with diverse architectural designs,
such as StarGANv2, BlendGAN, DiffAE, StableDiffusion and StableDiffusionXL.
Additionally, we introduce a novel evaluation protocol, which offers a fair
evaluation of localisation performance in function of detection accuracy and
better captures real-world scenarios.


http://arxiv.org/abs/2409.18244
Development of an Edge Resilient ML Ensemble to Tolerate ICS Adversarial Attacks. (54%)
Likai Yao; Qinxuan Shi; Zhanglong Yang; Sicong Shao; Salim Hariri
  Deploying machine learning (ML) in dynamic data-driven applications systems
(DDDAS) can improve the security of industrial control systems (ICS). However,
ML-based DDDAS are vulnerable to adversarial attacks because adversaries can
alter the input data slightly so that the ML models predict a different result.
In this paper, our goal is to build a resilient edge machine learning (reML)
architecture that is designed to withstand adversarial attacks by performing
Data Air Gap Transformation (DAGT) to anonymize data feature spaces using deep
neural networks and randomize the ML models used for predictions. The reML is
based on the Resilient DDDAS paradigm, Moving Target Defense (MTD) theory, and
TinyML and is applied to combat adversarial attacks on ICS. Furthermore, the
proposed approach is power-efficient and privacy-preserving and, therefore, can
be deployed on power-constrained devices to enhance ICS security. This approach
enables resilient ML inference at the edge by shifting the computation from the
computing-intensive platforms to the resource-constrained edge devices. The
incorporation of TinyML with TensorFlow Lite ensures efficient resource
utilization and, consequently, makes reML suitable for deployment in various
industrial control environments. Furthermore, the dynamic nature of reML,
facilitated by the resilient DDDAS development environment, allows for
continuous adaptation and improvement in response to emerging threats. Lastly,
we evaluate our approach on an ICS dataset and demonstrate that reML provides a
viable and effective solution for resilient ML inference at the edge devices.


http://arxiv.org/abs/2409.17946
Backdoor Attacks for LLMs with Weak-To-Strong Knowledge Distillation. (15%)
Shuai Zhao; Leilei Gan; Zhongliang Guo; Xiaobao Wu; Luwei Xiao; Xiaoyu Xu; Cong-Duy Nguyen; Luu Anh Tuan
  Despite being widely applied due to their exceptional capabilities, Large
Language Models (LLMs) have been proven to be vulnerable to backdoor attacks.
These attacks introduce targeted vulnerabilities into LLMs by poisoning
training samples and full-parameter fine-tuning. However, this kind of backdoor
attack is limited since they require significant computational resources,
especially as the size of LLMs increases. Besides, parameter-efficient
fine-tuning (PEFT) offers an alternative but the restricted parameter updating
may impede the alignment of triggers with target labels. In this study, we
first verify that backdoor attacks with PEFT may encounter challenges in
achieving feasible performance. To address these issues and improve the
effectiveness of backdoor attacks with PEFT, we propose a novel backdoor attack
algorithm from weak to strong based on feature alignment-enhanced knowledge
distillation (W2SAttack). Specifically, we poison small-scale language models
through full-parameter fine-tuning to serve as the teacher model. The teacher
model then covertly transfers the backdoor to the large-scale student model
through feature alignment-enhanced knowledge distillation, which employs PEFT.
Theoretical analysis reveals that W2SAttack has the potential to augment the
effectiveness of backdoor attacks. We demonstrate the superior performance of
W2SAttack on classification tasks across four language models, four backdoor
attack algorithms, and two different architectures of teacher models.
Experimental results indicate success rates close to 100% for backdoor attacks
targeting PEFT.


http://arxiv.org/abs/2409.18169
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey. (15%)
Tiansheng Huang; Sihao Hu; Fatih Ilhan; Selim Furkan Tekin; Ling Liu
  Recent research demonstrates that the nascent fine-tuning-as-a-service
business model exposes serious safety concerns -- fine-tuning over a few
harmful data uploaded by the users can compromise the safety alignment of the
model. The attack, known as harmful fine-tuning, has raised a broad research
interest among the community. However, as the attack is still new, \textbf{we
observe from our miserable submission experience that there are general
misunderstandings within the research community.} We in this paper aim to clear
some common concerns for the attack setting, and formally establish the
research problem. Specifically, we first present the threat model of the
problem, and introduce the harmful fine-tuning attack and its variants. Then we
systematically survey the existing literature on attacks/defenses/mechanical
analysis of the problem. Finally, we outline future research directions that
might contribute to the development of the field. Additionally, we present a
list of questions of interest, which might be useful to refer to when reviewers
in the peer review process question the realism of the
experiment/attack/defense setting. A curated list of relevant papers is
maintained and made accessible at:
\url{https://github.com/git-disl/awesome_LLM-harmful-fine-tuning-papers.}


http://arxiv.org/abs/2409.17682
Dark Miner: Defend against unsafe generation for text-to-image diffusion models. (5%)
Zheling Meng; Bo Peng; Xiaochuan Jin; Yue Jiang; Jing Dong; Wei Wang; Tieniu Tan
  Text-to-image diffusion models have been demonstrated with unsafe generation
due to unfiltered large-scale training data, such as violent, sexual, and
shocking images, necessitating the erasure of unsafe concepts. Most existing
methods focus on modifying the generation probabilities conditioned on the
texts containing unsafe descriptions. However, they fail to guarantee safe
generation for unseen texts in the training phase, especially for the prompts
from adversarial attacks. In this paper, we re-analyze the erasure task and
point out that existing methods cannot guarantee the minimization of the total
probabilities of unsafe generation. To tackle this problem, we propose Dark
Miner. It entails a recurring three-stage process that comprises mining,
verifying, and circumventing. It greedily mines embeddings with maximum
generation probabilities of unsafe concepts and reduces unsafe generation more
effectively. In the experiments, we evaluate its performance on two
inappropriate concepts, two objects, and two styles. Compared with 6 previous
state-of-the-art methods, our method achieves better erasure and defense
results in most cases, especially under 4 state-of-the-art attacks, while
preserving the model's native generation capability. Our code will be available
on GitHub.


http://arxiv.org/abs/2409.18025
An Adversarial Perspective on Machine Unlearning for AI Safety. (2%)
Jakub Łucki; Boyi Wei; Yangsibo Huang; Peter Henderson; Florian Tramèr; Javier Rando
  Large language models are finetuned to refuse questions about hazardous
knowledge, but these protections can often be bypassed. Unlearning methods aim
at completely removing hazardous capabilities from models and make them
inaccessible to adversaries. This work challenges the fundamental differences
between unlearning and traditional safety post-training from an adversarial
perspective. We demonstrate that existing jailbreak methods, previously
reported as ineffective against unlearning, can be successful when applied
carefully. Furthermore, we develop a variety of adaptive methods that recover
most supposedly unlearned capabilities. For instance, we show that finetuning
on 10 unrelated examples or removing specific directions in the activation
space can recover most hazardous capabilities for models edited with RMU, a
state-of-the-art unlearning method. Our findings challenge the robustness of
current unlearning approaches and question their advantages over safety
training.


http://arxiv.org/abs/2409.18219
Revolutionizing Payload Inspection: A Self-Supervised Journey to Precision with Few Shots. (2%)
Kyle Stein; Arash Mahyari; Guillermo III Francia; Eman El-Sheikh
  As networks continue to expand and become more interconnected, the need for
novel malware detection methods becomes more pronounced. Traditional security
measures are increasingly inadequate against the sophistication of modern cyber
attacks. Deep Packet Inspection (DPI) has been pivotal in enhancing network
security, offering an in-depth analysis of network traffic that surpasses
conventional monitoring techniques. DPI not only examines the metadata of
network packets, but also dives into the actual content being carried within
the packet payloads, providing a comprehensive view of the data flowing through
networks. The integration of advanced deep learning techniques with DPI has
introduced modern methodologies into malware detection. However, the challenge
with the state-of-the-art supervised learning approaches is that they prevent
the generalization to unseen attacks embedded in the payloads, prohibiting them
from accurately detecting new attacks and transferring knowledge learned from
previous attacks to the new attacks with small labeled sample sizes. This paper
leverages the recent advancements in self-supervised learning and few-shot
learning. Our proposed self-supervised approach trains a transformer to learn
the embedding of the payloads from a vast amount of unlabeled datasets by
masking portions of payloads, leading to a learnt representation that well
generalizes to various downstream tasks. Once the representation is extracted
from payloads, they are used to train a malware detection algorithm. The
representation obtained from the transformer is then used to adapt the malware
detector to novel types of attacks using few-shot learning approaches. Our
experimental results across several datasets show the great success and
generalization of the proposed approach to novel scenarios.


http://arxiv.org/abs/2409.17476
Improving the Shortest Plank: Vulnerability-Aware Adversarial Training for Robust Recommender System. (93%)
Kaike Zhang; Qi Cao; Yunfan Wu; Fei Sun; Huawei Shen; Xueqi Cheng
  Recommender systems play a pivotal role in mitigating information overload in
various fields. Nonetheless, the inherent openness of these systems introduces
vulnerabilities, allowing attackers to insert fake users into the system's
training data to skew the exposure of certain items, known as poisoning
attacks. Adversarial training has emerged as a notable defense mechanism
against such poisoning attacks within recommender systems. Existing adversarial
training methods apply perturbations of the same magnitude across all users to
enhance system robustness against attacks. Yet, in reality, we find that
attacks often affect only a subset of users who are vulnerable. These
perturbations of indiscriminate magnitude make it difficult to balance
effective protection for vulnerable users without degrading recommendation
quality for those who are not affected. To address this issue, our research
delves into understanding user vulnerability. Considering that poisoning
attacks pollute the training data, we note that the higher degree to which a
recommender system fits users' training data correlates with an increased
likelihood of users incorporating attack information, indicating their
vulnerability. Leveraging these insights, we introduce the Vulnerability-aware
Adversarial Training (VAT), designed to defend against poisoning attacks in
recommender systems. VAT employs a novel vulnerability-aware function to
estimate users' vulnerability based on the degree to which the system fits
them. Guided by this estimation, VAT applies perturbations of adaptive
magnitude to each user, not only reducing the success ratio of attacks but also
preserving, and potentially enhancing, the quality of recommendations.
Comprehensive experiments confirm VAT's superior defensive capabilities across
different recommendation models and against various types of attacks.


http://arxiv.org/abs/2409.17311
A Hybrid Quantum-Classical AI-Based Detection Strategy for Generative Adversarial Network-Based Deepfake Attacks on an Autonomous Vehicle Traffic Sign Classification System. (82%)
M Sabbir Salek; Shaozhi Li; Mashrur Chowdhury
  The perception module in autonomous vehicles (AVs) relies heavily on deep
learning-based models to detect and identify various objects in their
surrounding environment. An AV traffic sign classification system is integral
to this module, which helps AVs recognize roadway traffic signs. However,
adversarial attacks, in which an attacker modifies or alters the image captured
for traffic sign recognition, could lead an AV to misrecognize the traffic
signs and cause hazardous consequences. Deepfake presents itself as a promising
technology to be used for such adversarial attacks, in which a deepfake traffic
sign would replace a real-world traffic sign image before the image is fed to
the AV traffic sign classification system. In this study, the authors present
how a generative adversarial network-based deepfake attack can be crafted to
fool the AV traffic sign classification systems. The authors developed a
deepfake traffic sign image detection strategy leveraging hybrid
quantum-classical neural networks (NNs). This hybrid approach utilizes
amplitude encoding to represent the features of an input traffic sign image
using quantum states, which substantially reduces the memory requirement
compared to its classical counterparts. The authors evaluated this hybrid
deepfake detection approach along with several baseline classical convolutional
NNs on real-world and deepfake traffic sign images. The results indicate that
the hybrid quantum-classical NNs for deepfake detection could achieve similar
or higher performance than the baseline classical convolutional NNs in most
cases while requiring less than one-third of the memory required by the
shallowest classical convolutional NN considered in this study.


http://arxiv.org/abs/2409.17458
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking. (75%)
Yifan Jiang; Kriti Aggarwal; Tanmay Laud; Kashif Munir; Jay Pujara; Subhabrata Mukherjee
  The rapid progress of Large Language Models (LLMs) has opened up new
opportunities across various domains and applications; yet it also presents
challenges related to potential misuse. To mitigate such risks, red teaming has
been employed as a proactive security measure to probe language models for
harmful outputs via jailbreak attacks. However, current jailbreak attack
approaches are single-turn with explicit malicious queries that do not fully
capture the complexity of real-world interactions. In reality, users can engage
in multi-turn interactions with LLM-based chat assistants, allowing them to
conceal their true intentions in a more covert manner. To bridge this gap, we,
first, propose a new jailbreak approach, RED QUEEN ATTACK. This method
constructs a multi-turn scenario, concealing the malicious intent under the
guise of preventing harm. We craft 40 scenarios that vary in turns and select
14 harmful categories to generate 56k multi-turn attack data points. We conduct
comprehensive experiments on the RED QUEEN ATTACK with four representative LLM
families of different sizes. Our experiments reveal that all LLMs are
vulnerable to RED QUEEN ATTACK, reaching 87.62% attack success rate on GPT-4o
and 75.4% on Llama3-70B. Further analysis reveals that larger models are more
susceptible to the RED QUEEN ATTACK, with multi-turn structures and concealment
strategies contributing to its success. To prioritize safety, we introduce a
straightforward mitigation strategy called RED QUEEN GUARD, which aligns LLMs
to effectively counter adversarial attacks. This approach reduces the attack
success rate to below 1% while maintaining the model's performance across
standard benchmarks. Full implementation and dataset are publicly accessible at
https://github.com/kriti-hippo/red_queen.


http://arxiv.org/abs/2409.17403
Transient Adversarial 3D Projection Attacks on Object Detection in Autonomous Driving. (67%)
Ce Zhou; Qiben Yan; Sijia Liu
  Object detection is a crucial task in autonomous driving. While existing
research has proposed various attacks on object detection, such as those using
adversarial patches or stickers, the exploration of projection attacks on 3D
surfaces remains largely unexplored. Compared to adversarial patches or
stickers, which have fixed adversarial patterns, projection attacks allow for
transient modifications to these patterns, enabling a more flexible attack. In
this paper, we introduce an adversarial 3D projection attack specifically
targeting object detection in autonomous driving scenarios. We frame the attack
formulation as an optimization problem, utilizing a combination of color
mapping and geometric transformation models. Our results demonstrate the
effectiveness of the proposed attack in deceiving YOLOv3 and Mask R-CNN in
physical settings. Evaluations conducted in an indoor environment show an
attack success rate of up to 100% under low ambient light conditions,
highlighting the potential damage of our attack in real-world driving
scenarios.


http://arxiv.org/abs/2409.16639
Examining the Rat in the Tunnel: Interpretable Multi-Label Classification of Tor-based Malware. (45%)
Ishan Karunanayake; Mashael AlSabah; Nadeem Ahmed; Sanjay Jha
  Despite being the most popular privacy-enhancing network, Tor is increasingly
adopted by cybercriminals to obfuscate malicious traffic, hindering the
identification of malware-related communications between compromised devices
and Command and Control (C&C) servers. This malicious traffic can induce
congestion and reduce Tor's performance, while encouraging network
administrators to block Tor traffic. Recent research, however, demonstrates the
potential for accurately classifying captured Tor traffic as malicious or
benign. While existing efforts have addressed malware class identification,
their performance remains limited, with micro-average precision and recall
values around 70%. Accurately classifying specific malware classes is crucial
for effective attack prevention and mitigation. Furthermore, understanding the
unique patterns and attack vectors employed by different malware classes helps
the development of robust and adaptable defence mechanisms.
  We utilise a multi-label classification technique based on Message-Passing
Neural Networks, demonstrating its superiority over previous approaches such as
Binary Relevance, Classifier Chains, and Label Powerset, by achieving
micro-average precision (MAP) and recall (MAR) exceeding 90%. Compared to
previous work, we significantly improve performance by 19.98%, 10.15%, and
59.21% in MAP, MAR, and Hamming Loss, respectively. Next, we employ Explainable
Artificial Intelligence (XAI) techniques to interpret the decision-making
process within these models. Finally, we assess the robustness of all
techniques by crafting adversarial perturbations capable of manipulating
classifier predictions and generating false positives and negatives.


http://arxiv.org/abs/2409.16673
SWE2: SubWord Enriched and Significant Word Emphasized Framework for Hate Speech Detection. (38%)
Guanyi Mou; Pengyi Ye; Kyumin Lee
  Hate speech detection on online social networks has become one of the
emerging hot topics in recent years. With the broad spread and fast propagation
speed across online social networks, hate speech makes significant impacts on
society by increasing prejudice and hurting people. Therefore, there are
aroused attention and concern from both industry and academia. In this paper,
we address the hate speech problem and propose a novel hate speech detection
framework called SWE2, which only relies on the content of messages and
automatically identifies hate speech. In particular, our framework exploits
both word-level semantic information and sub-word knowledge. It is intuitively
persuasive and also practically performs well under a situation with/without
character-level adversarial attack. Experimental results show that our proposed
model achieves 0.975 accuracy and 0.953 macro F1, outperforming 7
state-of-the-art baselines under no adversarial attack. Our model robustly and
significantly performed well under extreme adversarial attack (manipulation of
50% messages), achieving 0.967 accuracy and 0.934 macro F1.


http://arxiv.org/abs/2409.17279
SHEATH: Defending Horizontal Collaboration for Distributed CNNs against Adversarial Noise. (22%)
Muneeba Asif; Mohammad Kumail Kazmi; Mohammad Ashiqur Rahman; Syed Rafay Hasan; Soamar Homsi
  As edge computing and the Internet of Things (IoT) expand, horizontal
collaboration (HC) emerges as a distributed data processing solution for
resource-constrained devices. In particular, a convolutional neural network
(CNN) model can be deployed on multiple IoT devices, allowing distributed
inference execution for image recognition while ensuring model and data
privacy. Yet, this distributed architecture remains vulnerable to adversaries
who want to make subtle alterations that impact the model, even if they lack
access to the entire model. Such vulnerabilities can have severe implications
for various sectors, including healthcare, military, and autonomous systems.
However, security solutions for these vulnerabilities have not been explored.
This paper presents a novel framework for Secure Horizontal Edge with
Adversarial Threat Handling (SHEATH) to detect adversarial noise and eliminate
its effect on CNN inference by recovering the original feature maps.
Specifically, SHEATH aims to address vulnerabilities without requiring complete
knowledge of the CNN model in HC edge architectures based on sequential
partitioning. It ensures data and model integrity, offering security against
adversarial attacks in diverse HC environments. Our evaluations demonstrate
SHEATH's adaptability and effectiveness across diverse CNN configurations.


http://arxiv.org/abs/2409.16618
Claim-Guided Textual Backdoor Attack for Practical Applications. (10%)
Minkyoo Song; Hanna Kim; Jaehan Kim; Youngjin Jin; Seungwon Shin
  Recent advances in natural language processing and the increased use of large
language models have exposed new security vulnerabilities, such as backdoor
attacks. Previous backdoor attacks require input manipulation after model
distribution to activate the backdoor, posing limitations in real-world
applicability. Addressing this gap, we introduce a novel Claim-Guided Backdoor
Attack (CGBA), which eliminates the need for such manipulations by utilizing
inherent textual claims as triggers. CGBA leverages claim extraction,
clustering, and targeted training to trick models to misbehave on targeted
claims without affecting their performance on clean data. CGBA demonstrates its
effectiveness and stealthiness across various datasets and models,
significantly enhancing the feasibility of practical backdoor attacks. Our code
and data will be available at https://github.com/PaperCGBA/CGBA.


http://arxiv.org/abs/2409.17443
Cat-and-Mouse Satellite Dynamics: Divergent Adversarial Reinforcement Learning for Contested Multi-Agent Space Operations. (1%)
Cameron Mehlman; Joseph Abramov; Gregory Falco
  As space becomes increasingly crowded and contested, robust autonomous
capabilities for multi-agent environments are gaining critical importance.
Current autonomous systems in space primarily rely on optimization-based path
planning or long-range orbital maneuvers, which have not yet proven effective
in adversarial scenarios where one satellite is actively pursuing another. We
introduce Divergent Adversarial Reinforcement Learning (DARL), a two-stage
Multi-Agent Reinforcement Learning (MARL) approach designed to train autonomous
evasion strategies for satellites engaged with multiple adversarial spacecraft.
Our method enhances exploration during training by promoting diverse
adversarial strategies, leading to more robust and adaptable evader models. We
validate DARL through a cat-and-mouse satellite scenario, modeled as a
partially observable multi-agent capture the flag game where two adversarial
`cat' spacecraft pursue a single `mouse' evader. DARL's performance is compared
against several benchmarks, including an optimization-based satellite path
planner, demonstrating its ability to produce highly robust models for
adversarial multi-agent space environments.


http://arxiv.org/abs/2409.15968
Adversarial Backdoor Defense in CLIP. (99%)
Junhao Kuang; Siyuan Liang; Jiawei Liang; Kuanrong Liu; Xiaochun Cao
  Multimodal contrastive pretraining, exemplified by models like CLIP, has been
found to be vulnerable to backdoor attacks. While current backdoor defense
methods primarily employ conventional data augmentation to create augmented
samples aimed at feature alignment, these methods fail to capture the distinct
features of backdoor samples, resulting in suboptimal defense performance.
Observations reveal that adversarial examples and backdoor samples exhibit
similarities in the feature space within the compromised models. Building on
this insight, we propose Adversarial Backdoor Defense (ABD), a novel data
augmentation strategy that aligns features with meticulously crafted
adversarial examples. This approach effectively disrupts the backdoor
association. Our experiments demonstrate that ABD provides robust defense
against both traditional uni-modal and multimodal backdoor attacks targeting
CLIP. Compared to the current state-of-the-art defense method, CleanCLIP, ABD
reduces the attack success rate by 8.66% for BadNet, 10.52% for Blended, and
53.64% for BadCLIP, while maintaining a minimal average decrease of just 1.73%
in clean accuracy.


http://arxiv.org/abs/2409.16399
Revisiting Acoustic Features for Robust ASR. (84%)
Muhammad A. Shah; Bhiksha Raj
  Automatic Speech Recognition (ASR) systems must be robust to the myriad types
of noises present in real-world environments including environmental noise,
room impulse response, special effects as well as attacks by malicious actors
(adversarial attacks). Recent works seek to improve accuracy and robustness by
developing novel Deep Neural Networks (DNNs) and curating diverse training
datasets for them, while using relatively simple acoustic features. While this
approach improves robustness to the types of noise present in the training
data, it confers limited robustness against unseen noises and negligible
robustness to adversarial attacks. In this paper, we revisit the approach of
earlier works that developed acoustic features inspired by biological auditory
perception that could be used to perform accurate and robust ASR. In contrast,
Specifically, we evaluate the ASR accuracy and robustness of several
biologically inspired acoustic features. In addition to several features from
prior works, such as gammatone filterbank features (GammSpec), we also propose
two new acoustic features called frequency masked spectrogram (FreqMask) and
difference of gammatones spectrogram (DoGSpec) to simulate the
neuro-psychological phenomena of frequency masking and lateral suppression.
Experiments on diverse models and datasets show that (1) DoGSpec achieves
significantly better robustness than the highly popular log mel spectrogram
(LogMelSpec) with minimal accuracy degradation, and (2) GammSpec achieves
better accuracy and robustness to non-adversarial noises from the Speech Robust
Bench benchmark, but it is outperformed by DoGSpec against adversarial attacks.


http://arxiv.org/abs/2409.16056
Adversarial Watermarking for Face Recognition. (80%)
Yuguang Yao; Anil Jain; Sijia Liu
  Watermarking is an essential technique for embedding an identifier (i.e.,
watermark message) within digital images to assert ownership and monitor
unauthorized alterations. In face recognition systems, watermarking plays a
pivotal role in ensuring data integrity and security. However, an adversary
could potentially interfere with the watermarking process, significantly
impairing recognition performance. We explore the interaction between
watermarking and adversarial attacks on face recognition models. Our findings
reveal that while watermarking or input-level perturbation alone may have a
negligible effect on recognition accuracy, the combined effect of watermarking
and perturbation can result in an adversarial watermarking attack,
significantly degrading recognition performance. Specifically, we introduce a
novel threat model, the adversarial watermarking attack, which remains stealthy
in the absence of watermarking, allowing images to be correctly recognized
initially. However, once watermarking is applied, the attack is activated,
causing recognition failures. Our study reveals a previously unrecognized
vulnerability: adversarial perturbations can exploit the watermark message to
evade face recognition systems. Evaluated on the CASIA-WebFace dataset, our
proposed adversarial watermarking attack reduces face matching accuracy by
67.2% with an $\ell_\infty$ norm-measured perturbation strength of ${2}/{255}$
and by 95.9% with a strength of ${4}/{255}$.


http://arxiv.org/abs/2409.16491
Proactive Schemes: A Survey of Adversarial Attacks for Social Good. (54%)
Vishal Asnani; Xi Yin; Xiaoming Liu
  Adversarial attacks in computer vision exploit the vulnerabilities of machine
learning models by introducing subtle perturbations to input data, often
leading to incorrect predictions or classifications. These attacks have evolved
in sophistication with the advent of deep learning, presenting significant
challenges in critical applications, which can be harmful for society. However,
there is also a rich line of research from a transformative perspective that
leverages adversarial techniques for social good. Specifically, we examine the
rise of proactive schemes-methods that encrypt input data using additional
signals termed templates, to enhance the performance of deep learning models.
By embedding these imperceptible templates into digital media, proactive
schemes are applied across various applications, from simple image enhancements
to complicated deep learning frameworks to aid performance, as compared to the
passive schemes, which don't change the input data distribution for their
framework. The survey delves into the methodologies behind these proactive
schemes, the encryption and learning processes, and their application to modern
computer vision and natural language processing applications. Additionally, it
discusses the challenges, potential vulnerabilities, and future directions for
proactive schemes, ultimately highlighting their potential to foster the
responsible and secure advancement of deep learning technologies.


http://arxiv.org/abs/2409.15868
Privacy Evaluation Benchmarks for NLP Models. (45%)
Wei Huang; Yinggui Wang; Cen Chen
  By inducing privacy attacks on NLP models, attackers can obtain sensitive
information such as training data and model parameters, etc. Although
researchers have studied, in-depth, several kinds of attacks in NLP models,
they are non-systematic analyses. It lacks a comprehensive understanding of the
impact caused by the attacks. For example, we must consider which scenarios can
apply to which attacks, what the common factors are that affect the performance
of different attacks, the nature of the relationships between different
attacks, and the influence of various datasets and models on the effectiveness
of the attacks, etc. Therefore, we need a benchmark to holistically assess the
privacy risks faced by NLP models. In this paper, we present a privacy attack
and defense evaluation benchmark in the field of NLP, which includes the
conventional/small models and large language models (LLMs). This benchmark
supports a variety of models, datasets, and protocols, along with standardized
modules for comprehensive evaluation of attacks and defense strategies. Based
on the above framework, we present a study on the association between auxiliary
data from different domains and the strength of privacy attacks. And we provide
an improved attack method in this scenario with the help of Knowledge
Distillation (KD). Furthermore, we propose a chained framework for privacy
attacks. Allowing a practitioner to chain multiple attacks to achieve a
higher-level attack objective. Based on this, we provide some defense and
enhanced attack strategies. The code for reproducing the results can be found
at https://github.com/user2311717757/nlp_doctor.


http://arxiv.org/abs/2409.16057
Towards Robust Object Detection: Identifying and Removing Backdoors via Module Inconsistency Analysis. (33%)
Xianda Zhang; Siyuan Liang
  Object detection models, widely used in security-critical applications, are
vulnerable to backdoor attacks that cause targeted misclassifications when
triggered by specific patterns. Existing backdoor defense techniques, primarily
designed for simpler models like image classifiers, often fail to effectively
detect and remove backdoors in object detectors. We propose a backdoor defense
framework tailored to object detection models, based on the observation that
backdoor attacks cause significant inconsistencies between local modules'
behaviors, such as the Region Proposal Network (RPN) and classification head.
By quantifying and analyzing these inconsistencies, we develop an algorithm to
detect backdoors. We find that the inconsistent module is usually the main
source of backdoor behavior, leading to a removal method that localizes the
affected module, resets its parameters, and fine-tunes the model on a small
clean dataset. Extensive experiments with state-of-the-art two-stage object
detectors show our method achieves a 90% improvement in backdoor removal rate
over fine-tuning baselines, while limiting clean data accuracy loss to less
than 4%. To the best of our knowledge, this work presents the first approach
that addresses both the detection and removal of backdoors in two-stage object
detection models, advancing the field of securing these complex systems against
backdoor attacks.


http://arxiv.org/abs/2409.15990
PACE: Poisoning Attacks on Learned Cardinality Estimation. (4%)
Jintao Tsinghua University Zhang; Chao Tsinghua University Zhang; Guoliang Tsinghua University Li; Chengliang Beijing Institute of Technology Chai
  Cardinality estimation (CE) plays a crucial role in database optimizer. We
have witnessed the emergence of numerous learned CE models recently which can
outperform traditional methods such as histograms and samplings. However,
learned models also bring many security risks. For example, a query-driven
learned CE model learns a query-to-cardinality mapping based on the historical
workload. Such a learned model could be attacked by poisoning queries, which
are crafted by malicious attackers and woven into the historical workload,
leading to performance degradation of CE. In this paper, we explore the
potential security risks in learned CE and study a new problem of poisoning
attacks on learned CE in a black-box setting. Experiments show that PACE
reduces the accuracy of the learned CE models by 178 times, leading to a 10
times decrease in the end-to-end performance of the target database.


http://arxiv.org/abs/2409.14940
Improving Adversarial Robustness for 3D Point Cloud Recognition at Test-Time through Purified Self-Training. (96%)
Jinpeng Lin; Xulei Yang; Tianrui Li; Xun Xu
  Recognizing 3D point cloud plays a pivotal role in many real-world
applications. However, deploying 3D point cloud deep learning model is
vulnerable to adversarial attacks. Despite many efforts into developing robust
model by adversarial training, they may become less effective against emerging
attacks. This limitation motivates the development of adversarial purification
which employs generative model to mitigate the impact of adversarial attacks.
In this work, we highlight the remaining challenges from two perspectives.
First, the purification based method requires retraining the classifier on
purified samples which introduces additional computation overhead. Moreover, in
a more realistic scenario, testing samples arrives in a streaming fashion and
adversarial samples are not isolated from clean samples. These challenges
motivates us to explore dynamically update model upon observing testing
samples. We proposed a test-time purified self-training strategy to achieve
this objective. Adaptive thresholding and feature distribution alignment are
introduced to improve the robustness of self-training. Extensive results on
different adversarial attacks suggest the proposed method is complementary to
purification based method in handling continually changing adversarial attacks
on the testing data stream.


http://arxiv.org/abs/2409.15190
Interpretability-Guided Test-Time Adversarial Defense. (87%)
Akshay Kulkarni; Tsui-Wei Weng
  We propose a novel and low-cost test-time adversarial defense by devising
interpretability-guided neuron importance ranking methods to identify neurons
important to the output classes. Our method is a training-free approach that
can significantly improve the robustness-accuracy tradeoff while incurring
minimal computational overhead. While being among the most efficient test-time
defenses (4x faster), our method is also robust to a wide range of black-box,
white-box, and adaptive attacks that break previous test-time defenses. We
demonstrate the efficacy of our method for CIFAR10, CIFAR100, and ImageNet-1k
on the standard RobustBench benchmark (with average gains of 2.6%, 4.9%, and
2.8% respectively). We also show improvements (average 1.5%) over the
state-of-the-art test-time defenses even under strong adaptive attacks.


http://arxiv.org/abs/2409.14866
PAPILLON: Efficient and Stealthy Fuzz Testing-Powered Jailbreaks for LLMs. (87%)
Xueluan Gong; Mingzhe Li; Yilin Zhang; Fengyuan Ran; Chen Chen; Yanjiao Chen; Qian Wang; Kwok-Yan Lam
  Large Language Models (LLMs) have excelled in various tasks but are still
vulnerable to jailbreaking attacks, where attackers create jailbreak prompts to
mislead the model to produce harmful or offensive content. Current jailbreak
methods either rely heavily on manually crafted templates, which pose
challenges in scalability and adaptability, or struggle to generate
semantically coherent prompts, making them easy to detect. Additionally, most
existing approaches involve lengthy prompts, leading to higher query costs.In
this paper, to remedy these challenges, we introduce a novel jailbreaking
attack framework called PAPILLON, which is an automated, black-box jailbreaking
attack framework that adapts the black-box fuzz testing approach with a series
of customized designs. Instead of relying on manually crafted
templates,PAPILLON starts with an empty seed pool, removing the need to search
for any related jailbreaking templates. We also develop three novel
question-dependent mutation strategies using an LLM helper to generate prompts
that maintain semantic coherence while significantly reducing their length.
Additionally, we implement a two-level judge module to accurately detect
genuine successful jailbreaks. We evaluated PAPILLON on 7 representative LLMs
and compared it with 5 state-of-the-art jailbreaking attack strategies. For
proprietary LLM APIs, such as GPT-3.5 turbo, GPT-4, and Gemini-Pro, PAPILLONs
achieves attack success rates of over 90%, 80%, and 74%, respectively,
exceeding existing baselines by more than 60\%. Additionally, PAPILLON can
maintain high semantic coherence while significantly reducing the length of
jailbreak prompts. When targeting GPT-4, PAPILLON can achieve over 78% attack
success rate even with 100 tokens. Moreover, PAPILLON demonstrates
transferability and is robust to state-of-the-art defenses.


http://arxiv.org/abs/2409.15670
Data Poisoning-based Backdoor Attack Framework against Supervised Learning Rules of Spiking Neural Networks. (68%)
Lingxin Jin; Meiyu Lin; Wei Jiang; Jinyu Zhan
  Spiking Neural Networks (SNNs), the third generation neural networks, are
known for their low energy consumption and high robustness. SNNs are developing
rapidly and can compete with Artificial Neural Networks (ANNs) in many fields.
To ensure that the widespread use of SNNs does not cause serious security
incidents, much research has been conducted to explore the robustness of SNNs
under adversarial sample attacks. However, many other unassessed security
threats exist, such as highly stealthy backdoor attacks. Therefore, to fill the
research gap in this and further explore the security vulnerabilities of SNNs,
this paper explores the robustness performance of SNNs trained by supervised
learning rules under backdoor attacks. Specifically, the work herein includes:
i) We propose a generic backdoor attack framework that can be launched against
the training process of existing supervised learning rules and covers all
learnable dataset types of SNNs. ii) We analyze the robustness differences
between different learning rules and between SNN and ANN, which suggests that
SNN no longer has inherent robustness under backdoor attacks. iii) We reveal
the vulnerability of conversion-dependent learning rules caused by backdoor
migration and further analyze the migration ability during the conversion
process, finding that the backdoor migration rate can even exceed 99%. iv)
Finally, we discuss potential countermeasures against this kind of backdoor
attack and its technical challenges and point out several promising research
directions.


http://arxiv.org/abs/2409.15398
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI. (47%)
Ambrish Rawat; Stefan Schoepf; Giulio Zizzo; Giandomenico Cornacchia; Muhammad Zaid Hameed; Kieran Fraser; Erik Miehling; Beat Buesser; Elizabeth M. Daly; Mark Purcell; Prasanna Sattigeri; Pin-Yu Chen; Kush R. Varshney
  As generative AI, particularly large language models (LLMs), become
increasingly integrated into production applications, new attack surfaces and
vulnerabilities emerge and put a focus on adversarial threats in natural
language and multi-modal systems. Red-teaming has gained importance in
proactively identifying weaknesses in these systems, while blue-teaming works
to protect against such adversarial attacks. Despite growing academic interest
in adversarial risks for generative AI, there is limited guidance tailored for
practitioners to assess and mitigate these challenges in real-world
environments. To address this, our contributions include: (1) a practical
examination of red- and blue-teaming strategies for securing generative AI, (2)
identification of key challenges and open questions in defense development and
evaluation, and (3) the Attack Atlas, an intuitive framework that brings a
practical approach to analyzing single-turn input attacks, placing it at the
forefront for practitioners. This work aims to bridge the gap between academic
insights and practical security measures for the protection of generative AI
systems.


http://arxiv.org/abs/2409.14729
PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs. (33%)
Jiahao Yu; Yangguang Shao; Hanwen Miao; Junzheng Shi
  Large Language Models (LLMs) have gained widespread use in various
applications due to their powerful capability to generate human-like text.
However, prompt injection attacks, which involve overwriting a model's original
instructions with malicious prompts to manipulate the generated text, have
raised significant concerns about the security and reliability of LLMs.
Ensuring that LLMs are robust against such attacks is crucial for their
deployment in real-world applications, particularly in critical tasks.
  In this paper, we propose PROMPTFUZZ, a novel testing framework that
leverages fuzzing techniques to systematically assess the robustness of LLMs
against prompt injection attacks. Inspired by software fuzzing, PROMPTFUZZ
selects promising seed prompts and generates a diverse set of prompt injections
to evaluate the target LLM's resilience. PROMPTFUZZ operates in two stages: the
prepare phase, which involves selecting promising initial seeds and collecting
few-shot examples, and the focus phase, which uses the collected examples to
generate diverse, high-quality prompt injections. Using PROMPTFUZZ, we can
uncover more vulnerabilities in LLMs, even those with strong defense prompts.
  By deploying the generated attack prompts from PROMPTFUZZ in a real-world
competition, we achieved the 7th ranking out of over 4000 participants (top
0.14%) within 2 hours. Additionally, we construct a dataset to fine-tune LLMs
for enhanced robustness against prompt injection attacks. While the fine-tuned
model shows improved robustness, PROMPTFUZZ continues to identify
vulnerabilities, highlighting the importance of robust testing for LLMs. Our
work emphasizes the critical need for effective testing tools and provides a
practical framework for evaluating and improving the robustness of LLMs against
prompt injection attacks.


http://arxiv.org/abs/2409.15119
Log-normal Mutations and their Use in Detecting Surreptitious Fake Images. (13%)
Ismail Labiad; Thomas Bäck; Pierre Fernandez; Laurent Najman; Tom Sander; Furong Ye; Mariia Zameshina; Olivier Teytaud
  In many cases, adversarial attacks are based on specialized algorithms
specifically dedicated to attacking automatic image classifiers. These
algorithms perform well, thanks to an excellent ad hoc distribution of initial
attacks. However, these attacks are easily detected due to their specific
initial distribution. We therefore consider other black-box attacks, inspired
from generic black-box optimization tools, and in particular the log-normal
algorithm.
  We apply the log-normal method to the attack of fake detectors, and get
successful attacks: importantly, these attacks are not detected by detectors
specialized on classical adversarial attacks. Then, combining these attacks and
deep detection, we create improved fake detectors.


http://arxiv.org/abs/2410.07191
Curb Your Attention: Causal Attention Gating for Robust Trajectory Prediction in Autonomous Driving. (12%)
Ehsan Ahmadi; Ray Mercurius; Soheil Alizadeh; Kasra Rezaee; Amir Rasouli
  Trajectory prediction models in autonomous driving are vulnerable to
perturbations from non-causal agents whose actions should not affect the
ego-agent's behavior. Such perturbations can lead to incorrect predictions of
other agents' trajectories, potentially compromising the safety and efficiency
of the ego-vehicle's decision-making process. Motivated by this challenge, we
propose $\textit{Causal tRajecTory predICtion}$ $\textbf{(CRiTIC)}$, a novel
model that utilizes a $\textit{Causal Discovery Network}$ to identify
inter-agent causal relations over a window of past time steps. To incorporate
discovered causal relationships, we propose a novel $\textit{Causal Attention
Gating}$ mechanism to selectively filter information in the proposed
Transformer-based architecture. We conduct extensive experiments on two
autonomous driving benchmark datasets to evaluate the robustness of our model
against non-causal perturbations and its generalization capacity. Our results
indicate that the robustness of predictions can be improved by up to
$\textbf{54%}$ without a significant detriment to prediction accuracy. Lastly,
we demonstrate the superior domain generalizability of the proposed model,
which achieves up to $\textbf{29%}$ improvement in cross-domain performance.
These results underscore the potential of our model to enhance both robustness
and generalization capacity for trajectory prediction in diverse autonomous
driving domains. Further details can be found on our project page:
https://ehsan-ami.github.io/critic.


http://arxiv.org/abs/2409.15695
Toward Mixture-of-Experts Enabled Trustworthy Semantic Communication for 6G Networks. (5%)
Jiayi He; Xiaofeng Luo; Jiawen Kang; Hongyang Du; Zehui Xiong; Ci Chen; Dusit Niyato; Xuemin Shen
  Semantic Communication (SemCom) plays a pivotal role in 6G networks, offering
a viable solution for future efficient communication. Deep Learning (DL)-based
semantic codecs further enhance this efficiency. However, the vulnerability of
DL models to security threats, such as adversarial attacks, poses significant
challenges for practical applications of SemCom systems. These vulnerabilities
enable attackers to tamper with messages and eavesdrop on private information,
especially in wireless communication scenarios. Although existing defenses
attempt to address specific threats, they often fail to simultaneously handle
multiple heterogeneous attacks. To overcome this limitation, we introduce a
novel Mixture-of-Experts (MoE)-based SemCom system. This system comprises a
gating network and multiple experts, each specializing in different security
challenges. The gating network adaptively selects suitable experts to counter
heterogeneous attacks based on user-defined security requirements. Multiple
experts collaborate to accomplish semantic communication tasks while meeting
the security requirements of users. A case study in vehicular networks
demonstrates the efficacy of the MoE-based SemCom system. Simulation results
show that the proposed MoE-based SemCom system effectively mitigates concurrent
heterogeneous attacks, with minimal impact on downstream task accuracy.


http://arxiv.org/abs/2409.14712
Room Impulse Responses help attackers to evade Deep Fake Detection. (1%)
Hieu-Thi Luong; Duc-Tuan Truong; Kong Aik Lee; Eng Siong Chng
  The ASVspoof 2021 benchmark, a widely-used evaluation framework for
anti-spoofing, consists of two subsets: Logical Access (LA) and Deepfake (DF),
featuring samples with varied coding characteristics and compression artifacts.
Notably, the current state-of-the-art (SOTA) system boasts impressive
performance, achieving an Equal Error Rate (EER) of 0.87% on the LA subset and
2.58% on the DF. However, benchmark accuracy is no guarantee of robustness in
real-world scenarios. This paper investigates the effectiveness of utilizing
room impulse responses (RIRs) to enhance fake speech and increase their
likelihood of evading fake speech detection systems. Our findings reveal that
this simple approach significantly improves the evasion rate, doubling the SOTA
system's EER. To counter this type of attack, We augmented training data with a
large-scale synthetic/simulated RIR dataset. The results demonstrate
significant improvement on both reverberated fake speech and original samples,
reducing DF task EER to 2.13%.


http://arxiv.org/abs/2409.15041
AIM 2024 Sparse Neural Rendering Challenge: Dataset and Benchmark. (1%)
Michal Nazarczuk; Thomas Tanay; Sibi Catley-Chandar; Richard Shaw; Radu Timofte; Eduardo Pérez-Pellitero
  Recent developments in differentiable and neural rendering have made
impressive breakthroughs in a variety of 2D and 3D tasks, e.g. novel view
synthesis, 3D reconstruction. Typically, differentiable rendering relies on a
dense viewpoint coverage of the scene, such that the geometry can be
disambiguated from appearance observations alone. Several challenges arise when
only a few input views are available, often referred to as sparse or few-shot
neural rendering. As this is an underconstrained problem, most existing
approaches introduce the use of regularisation, together with a diversity of
learnt and hand-crafted priors. A recurring problem in sparse rendering
literature is the lack of an homogeneous, up-to-date, dataset and evaluation
protocol. While high-resolution datasets are standard in dense reconstruction
literature, sparse rendering methods often evaluate with low-resolution images.
Additionally, data splits are inconsistent across different manuscripts, and
testing ground-truth images are often publicly available, which may lead to
over-fitting. In this work, we propose the Sparse Rendering (SpaRe) dataset and
benchmark. We introduce a new dataset that follows the setup of the DTU MVS
dataset. The dataset is composed of 97 new scenes based on synthetic,
high-quality assets. Each scene has up to 64 camera views and 7 lighting
configurations, rendered at 1600x1200 resolution. We release a training split
of 82 scenes to foster generalizable approaches, and provide an online
evaluation platform for the validation and test sets, whose ground-truth images
remain hidden. We propose two different sparse configurations (3 and 9 input
images respectively). This provides a powerful and convenient tool for
reproducible evaluation, and enable researchers easy access to a public
leaderboard with the state-of-the-art performance scores. Available at:
https://sparebenchmark.github.io/


http://arxiv.org/abs/2409.15126
UTrace: Poisoning Forensics for Private Collaborative Learning. (1%)
Evan Rose; Hidde Lycklama; Harsh Chaudhari; Anwar Hithnawi; Alina Oprea
  Privacy-preserving machine learning (PPML) enables multiple data owners to
contribute their data privately to a set of servers that run a secure
multi-party computation (MPC) protocol to train a joint ML model. In these
protocols, the input data remains private throughout the training process, and
only the resulting model is made available. While this approach benefits
privacy, it also exacerbates the risks of data poisoning, where compromised
data owners induce undesirable model behavior by contributing malicious
datasets. Existing MPC mechanisms can mitigate certain poisoning attacks, but
these measures are not exhaustive. To complement existing poisoning defenses,
we introduce UTrace: a framework for User-level Traceback of poisoning attacks
in PPML. Utrace computes user responsibility scores using gradient similarity
metrics aggregated across the most relevant samples in an owner's dataset.
UTrace is effective at low poisoning rates and is resilient to poisoning
attacks distributed across multiple data owners, unlike existing
unlearning-based methods. We introduce methods for checkpointing gradients with
low storage overhead, enabling traceback in the absence of data owners at
deployment time. We also design several optimizations that reduce traceback
time and communication in MPC. We provide a comprehensive evaluation of UTrace
across four datasets from three data modalities (vision, text, and malware) and
show its effectiveness against 10 poisoning attacks.


http://arxiv.org/abs/2409.14805
SDBA: A Stealthy and Long-Lasting Durable Backdoor Attack in Federated Learning. (1%)
Minyeong Choe; Cheolhee Park; Changho Seo; Hyunil Kim
  Federated Learning is a promising approach for training machine learning
models while preserving data privacy, but its distributed nature makes it
vulnerable to backdoor attacks, particularly in NLP tasks while related
research remains limited. This paper introduces SDBA, a novel backdoor attack
mechanism designed for NLP tasks in FL environments. Our systematic analysis
across LSTM and GPT-2 models identifies the most vulnerable layers for backdoor
injection and achieves both stealth and long-lasting durability through
layer-wise gradient masking and top-k% gradient masking within these layers.
Experiments on next token prediction and sentiment analysis tasks show that
SDBA outperforms existing backdoors in durability and effectively bypasses
representative defense mechanisms, with notable performance in LLM such as
GPT-2. These results underscore the need for robust defense strategies in
NLP-based FL systems.


http://arxiv.org/abs/2409.14488
Enhancing LLM-based Autonomous Driving Agents to Mitigate Perception Attacks. (10%)
Ruoyu Song; Muslum Ozgur Ozmen; Hyungsub Kim; Antonio Bianchi; Z. Berkay Celik
  There is a growing interest in integrating Large Language Models (LLMs) with
autonomous driving (AD) systems. However, AD systems are vulnerable to attacks
against their object detection and tracking (ODT) functions. Unfortunately, our
evaluation of four recent LLM agents against ODT attacks shows that the attacks
are 63.26% successful in causing them to crash or violate traffic rules due to
(1) misleading memory modules that provide past experiences for decision
making, (2) limitations of prompts in identifying inconsistencies, and (3)
reliance on ground truth perception data.
  In this paper, we introduce Hudson, a driving reasoning agent that extends
prior LLM-based driving systems to enable safer decision making during
perception attacks while maintaining effectiveness under benign conditions.
Hudson achieves this by first instrumenting the AD software to collect
real-time perception results and contextual information from the driving scene.
This data is then formalized into a domain-specific language (DSL). To guide
the LLM in detecting and making safe control decisions during ODT attacks,
Hudson translates the DSL into natural language, along with a list of custom
attack detection instructions. Following query execution, Hudson analyzes the
LLM's control decision to understand its causal reasoning process.
  We evaluate the effectiveness of Hudson using a proprietary LLM (GPT-4) and
two open-source LLMs (Llama and Gemma) in various adversarial driving
scenarios. GPT-4, Llama, and Gemma achieve, on average, an attack detection
accuracy of 83. 3%, 63. 6%, and 73. 6%. Consequently, they make safe control
decisions in 86.4%, 73.9%, and 80% of the attacks. Our results, following the
growing interest in integrating LLMs into AD systems, highlight the strengths
of LLMs and their potential to detect and mitigate ODT attacks.


http://arxiv.org/abs/2409.14572
Evaluating the Performance and Robustness of LLMs in Materials Science Q&A and Property Predictions. (1%)
Hongchen Wang; Kangming Li; Scott Ramsay; Yao Fehlis; Edward Kim; Jason Hattrick-Simpers
  Large Language Models (LLMs) have the potential to revolutionize scientific
research, yet their robustness and reliability in domain-specific applications
remain insufficiently explored. This study conducts a comprehensive evaluation
and robustness analysis of LLMs within the field of materials science, focusing
on domain-specific question answering and materials property prediction. Three
distinct datasets are used in this study: 1) a set of multiple-choice questions
from undergraduate-level materials science courses, 2) a dataset including
various steel compositions and yield strengths, and 3) a band gap dataset,
containing textual descriptions of material crystal structures and band gap
values. The performance of LLMs is assessed using various prompting strategies,
including zero-shot chain-of-thought, expert prompting, and few-shot in-context
learning. The robustness of these models is tested against various forms of
'noise', ranging from realistic disturbances to intentionally adversarial
manipulations, to evaluate their resilience and reliability under real-world
conditions. Additionally, the study uncovers unique phenomena of LLMs during
predictive tasks, such as mode collapse behavior when the proximity of prompt
examples is altered and performance enhancement from train/test mismatch. The
findings aim to provide informed skepticism for the broad use of LLMs in
materials science and to inspire advancements that enhance their robustness and
reliability for practical applications.


http://arxiv.org/abs/2409.14240
Cloud Adversarial Example Generation for Remote Sensing Image Classification. (99%)
Fei Ma; Yuqiang Feng; Fan Zhang; Yongsheng Zhou
  Most existing adversarial attack methods for remote sensing images merely add
adversarial perturbations or patches, resulting in unnatural modifications.
Clouds are common atmospheric effects in remote sensing images. Generating
clouds on these images can produce adversarial examples better aligning with
human perception. In this paper, we propose a Perlin noise based cloud
generation attack method. Common Perlin noise based cloud generation is a
random, non-optimizable process, which cannot be directly used to attack the
target models. We design a Perlin Gradient Generator Network (PGGN), which
takes a gradient parameter vector as input and outputs the grids of Perlin
noise gradient vectors at different scales. After a series of computations
based on the gradient vectors, cloud masks at corresponding scales can be
produced. These cloud masks are then weighted and summed depending on a mixing
coefficient vector and a scaling factor to produce the final cloud masks. The
gradient vector, coefficient vector and scaling factor are collectively
represented as a cloud parameter vector, transforming the cloud generation into
a black-box optimization problem. The Differential Evolution (DE) algorithm is
employed to solve for the optimal solution of the cloud parameter vector,
achieving a query-based black-box attack. Detailed experiments confirm that
this method has strong attack capabilities and achieves high query efficiency.
Additionally, we analyze the transferability of the generated adversarial
examples and their robustness in adversarial defense scenarios.


http://arxiv.org/abs/2409.15381
Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation. (98%)
G M Shahariar; Jia Chen; Jiachen Li; Yue Dong
  Recent studies show that text-to-image (T2I) models are vulnerable to
adversarial attacks, especially with noun perturbations in text prompts. In
this study, we investigate the impact of adversarial attacks on different POS
tags within text prompts on the images generated by T2I models. We create a
high-quality dataset for realistic POS tag token swapping and perform
gradient-based attacks to find adversarial suffixes that mislead T2I models
into generating images with altered tokens. Our empirical results show that the
attack success rate (ASR) varies significantly among different POS tag
categories, with nouns, proper nouns, and adjectives being the easiest to
attack. We explore the mechanism behind the steering effect of adversarial
suffixes, finding that the number of critical tokens and content fusion vary
among POS tags, while features like suffix transferability are consistent
across categories. We have made our implementation publicly available at -
https://github.com/shahariar-shibli/Adversarial-Attack-on-POS-Tags.


http://arxiv.org/abs/2409.14161
When Witnesses Defend: A Witness Graph Topological Layer for Adversarial Graph Learning. (69%)
Naheed Anjum Arafat; Debabrota Basu; Yulia Gel; Yuzhou Chen
  Capitalizing on the intuitive premise that shape characteristics are more
robust to perturbations, we bridge adversarial graph learning with the emerging
tools from computational topology, namely, persistent homology representations
of graphs. We introduce the concept of witness complex to adversarial analysis
on graphs, which allows us to focus only on the salient shape characteristics
of graphs, yielded by the subset of the most essential nodes (i.e., landmarks),
with minimal loss of topological information on the whole graph. The remaining
nodes are then used as witnesses, governing which higher-order graph
substructures are incorporated into the learning process. Armed with the
witness mechanism, we design Witness Graph Topological Layer (WGTL), which
systematically integrates both local and global topological graph feature
representations, the impact of which is, in turn, automatically controlled by
the robust regularized topological loss. Given the attacker's budget, we derive
the important stability guarantees of both local and global topology encodings
and the associated robust topological loss. We illustrate the versatility and
efficiency of WGTL by its integration with five GNNs and three existing
non-topological defense mechanisms. Our extensive experiments across six
datasets demonstrate that WGTL boosts the robustness of GNNs across a range of
perturbations and against a range of adversarial attacks, leading to relative
gains of up to 18%.


http://arxiv.org/abs/2409.14177
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach. (62%)
Zhihao Lin; Wei Ma; Mingyi Zhou; Yanjie Zhao; Haoyu Wang; Yang Liu; Jun Wang; Li Li
  In recent years, Large Language Models (LLMs) have gained widespread use,
raising concerns about their security. Traditional jailbreak attacks, which
often rely on the model internal information or have limitations when exploring
the unsafe behavior of the victim model, limiting their reducing their general
applicability. In this paper, we introduce PathSeeker, a novel black-box
jailbreak method, which is inspired by the game of rats escaping a maze. We
think that each LLM has its unique "security maze", and attackers attempt to
find the exit learning from the received feedback and their accumulated
experience to compromise the target LLM's security defences. Our approach
leverages multi-agent reinforcement learning, where smaller models collaborate
to guide the main LLM in performing mutation operations to achieve the attack
objectives. By progressively modifying inputs based on the model's feedback,
our system induces richer, harmful responses. During our manual attempts to
perform jailbreak attacks, we found that the vocabulary of the response of the
target model gradually became richer and eventually produced harmful responses.
Based on the observation, we also introduce a reward mechanism that exploits
the expansion of vocabulary richness in LLM responses to weaken security
constraints. Our method outperforms five state-of-the-art attack techniques
when tested across 13 commercial and open-source LLMs, achieving high attack
success rates, especially in strongly aligned commercial models like
GPT-4o-mini, Claude-3.5, and GLM-4-air with strong safety alignment. This study
aims to improve the understanding of LLM security vulnerabilities and we hope
that this sturdy can contribute to the development of more robust defenses.


http://arxiv.org/abs/2409.14285
ESPERANTO: Evaluating Synthesized Phrases to Enhance Robustness in AI Detection for Text Origination. (10%)
Navid Ayoobi; Lily Knab; Wen Cheng; David Pantoja; Hamidreza Alikhani; Sylvain Flamant; Jin Kim; Arjun Mukherjee
  While large language models (LLMs) exhibit significant utility across various
domains, they simultaneously are susceptible to exploitation for unethical
purposes, including academic misconduct and dissemination of misinformation.
Consequently, AI-generated text detection systems have emerged as a
countermeasure. However, these detection mechanisms demonstrate vulnerability
to evasion techniques and lack robustness against textual manipulations. This
paper introduces back-translation as a novel technique for evading detection,
underscoring the need to enhance the robustness of current detection systems.
The proposed method involves translating AI-generated text through multiple
languages before back-translating to English. We present a model that combines
these back-translated texts to produce a manipulated version of the original
AI-generated text. Our findings demonstrate that the manipulated text retains
the original semantics while significantly reducing the true positive rate
(TPR) of existing detection methods. We evaluate this technique on nine AI
detectors, including six open-source and three proprietary systems, revealing
their susceptibility to back-translation manipulation. In response to the
identified shortcomings of existing AI text detectors, we present a
countermeasure to improve the robustness against this form of manipulation. Our
results indicate that the TPR of the proposed method declines by only 1.85%
after back-translation manipulation. Furthermore, we build a large dataset of
720k texts using eight different LLMs. Our dataset contains both human-authored
and LLM-generated texts in various domains and writing styles to assess the
performance of our method and existing detectors. This dataset is publicly
shared for the benefit of the research community.


http://arxiv.org/abs/2409.14260
Perfect Gradient Inversion in Federated Learning: A New Paradigm from the Hidden Subset Sum Problem. (8%)
Qiongxiu Li; Lixia Luo; Agnese Gini; Changlong Ji; Zhanhao Hu; Xiao Li; Chengfang Fang; Jie Shi; Xiaolin Hu
  Federated Learning (FL) has emerged as a popular paradigm for collaborative
learning among multiple parties. It is considered privacy-friendly because
local data remains on personal devices, and only intermediate parameters --
such as gradients or model updates -- are shared. Although gradient inversion
is widely viewed as a common attack method in FL, analytical research on
reconstructing input training samples from shared gradients remains limited and
is typically confined to constrained settings like small batch sizes. In this
paper, we aim to overcome these limitations by addressing the problem from a
cryptographic perspective. We mathematically formulate the input reconstruction
problem using the gradient information shared in FL as the Hidden Subset Sum
Problem (HSSP), an extension of the well-known NP-complete Subset Sum Problem
(SSP). Leveraging this formulation allows us to achieve perfect input
reconstruction, thereby mitigating issues such as dependence on label diversity
and underperformance with large batch sizes that hinder existing empirical
gradient inversion attacks. Moreover, our analysis provides insights into why
empirical input reconstruction attacks degrade with larger batch sizes. By
modeling the problem as HSSP, we demonstrate that the batch size \( B \)
significantly affects attack complexity, with time complexity reaching \(
\mathcal{O}(B^9) \). We further show that applying secure data aggregation
techniques -- such as homomorphic encryption and secure multiparty computation
-- provides a strong defense by increasing the time complexity to \(
\mathcal{O}(N^9 B^9) \), where \( N \) is the number of local clients in FL. To
the best of our knowledge, this is the first work to rigorously analyze privacy
issues in FL by modeling them as HSSP, providing a concrete analytical
foundation for further exploration and development of defense strategies.


http://arxiv.org/abs/2409.14200
Data-centric NLP Backdoor Defense from the Lens of Memorization. (4%)
Zhenting Wang; Zhizhi Wang; Mingyu Jin; Mengnan Du; Juan Zhai; Shiqing Ma
  Backdoor attack is a severe threat to the trustworthiness of DNN-based
language models. In this paper, we first extend the definition of memorization
of language models from sample-wise to more fine-grained sentence element-wise
(e.g., word, phrase, structure, and style), and then point out that language
model backdoors are a type of element-wise memorization. Through further
analysis, we find that the strength of such memorization is positively
correlated to the frequency of duplicated elements in the training dataset. In
conclusion, duplicated sentence elements are necessary for successful backdoor
attacks. Based on this, we propose a data-centric defense. We first detect
trigger candidates in training data by finding memorizable elements, i.e.,
duplicated elements, and then confirm real triggers by testing if the
candidates can activate backdoor behaviors (i.e., malicious elements). Results
show that our method outperforms state-of-the-art defenses in defending against
different types of NLP backdoors.


http://arxiv.org/abs/2409.13559
Efficient Visualization of Neural Networks with Generative Models and Adversarial Perturbations. (99%)
Athanasios Karagounis
  This paper presents a novel approach for deep visualization via a generative
network, offering an improvement over existing methods. Our model simplifies
the architecture by reducing the number of networks used, requiring only a
generator and a discriminator, as opposed to the multiple networks
traditionally involved. Additionally, our model requires less prior training
knowledge and uses a non-adversarial training process, where the discriminator
acts as a guide rather than a competitor to the generator. The core
contribution of this work is its ability to generate detailed visualization
images that align with specific class labels. Our model incorporates a unique
skip-connection-inspired block design, which enhances label-directed image
generation by propagating class information across multiple layers.
Furthermore, we explore how these generated visualizations can be utilized as
adversarial examples, effectively fooling classification networks with minimal
perceptible modifications to the original images. Experimental results
demonstrate that our method outperforms traditional adversarial example
generation techniques in both targeted and non-targeted attacks, achieving up
to a 94.5% fooling rate with minimal perturbation. This work bridges the gap
between visualization methods and adversarial examples, proposing that fooling
rate could serve as a quantitative measure for evaluating visualization
quality. The insights from this study provide a new perspective on the
interpretability of neural networks and their vulnerabilities to adversarial
attacks.


http://arxiv.org/abs/2409.13828
ViTGuard: Attention-aware Detection against Adversarial Examples for Vision Transformer. (99%)
Shihua Sun; Kenechukwu Nwodo; Shridatt Sugrim; Angelos Stavrou; Haining Wang
  The use of transformers for vision tasks has challenged the traditional
dominant role of convolutional neural networks (CNN) in computer vision (CV).
For image classification tasks, Vision Transformer (ViT) effectively
establishes spatial relationships between patches within images, directing
attention to important areas for accurate predictions. However, similar to
CNNs, ViTs are vulnerable to adversarial attacks, which mislead the image
classifier into making incorrect decisions on images with carefully designed
perturbations. Moreover, adversarial patch attacks, which introduce arbitrary
perturbations within a small area, pose a more serious threat to ViTs. Even
worse, traditional detection methods, originally designed for CNN models, are
impractical or suffer significant performance degradation when applied to ViTs,
and they generally overlook patch attacks.
  In this paper, we propose ViTGuard as a general detection method for
defending ViT models against adversarial attacks, including typical attacks
where perturbations spread over the entire input and patch attacks. ViTGuard
uses a Masked Autoencoder (MAE) model to recover randomly masked patches from
the unmasked regions, providing a flexible image reconstruction strategy. Then,
threshold-based detectors leverage distinctive ViT features, including
attention maps and classification (CLS) token representations, to distinguish
between normal and adversarial samples. The MAE model does not involve any
adversarial samples during training, ensuring the effectiveness of our
detectors against unseen attacks. ViTGuard is compared with seven existing
detection methods under nine attacks across three datasets. The evaluation
results show the superiority of ViTGuard over existing detectors. Finally,
considering the potential detection evasion, we further demonstrate ViTGuard's
robustness against adaptive attacks for evasion.


http://arxiv.org/abs/2409.13546
Certified Adversarial Robustness via Partition-based Randomized Smoothing. (81%)
Hossein Goli; Farzan Farnia
  A reliable application of deep neural network classifiers requires robustness
certificates against adversarial perturbations. Gaussian smoothing is a widely
analyzed approach to certifying robustness against norm-bounded perturbations,
where the certified prediction radius depends on the variance of the Gaussian
noise and the confidence level of the neural net's prediction under the
additive Gaussian noise. However, in application to high-dimensional image
datasets, the certified radius of the plain Gaussian smoothing could be
relatively small, since Gaussian noise with high variances can significantly
harm the visibility of an image. In this work, we propose the Pixel
Partitioning-based Randomized Smoothing (PPRS) methodology to boost the neural
net's confidence score and thus the robustness radius of the certified
prediction. We demonstrate that the proposed PPRS algorithm improves the
visibility of the images under additive Gaussian noise. We discuss the
numerical results of applying PPRS to standard computer vision datasets and
neural network architectures. Our empirical findings indicate a considerable
improvement in the certified accuracy and stability of the prediction model to
the additive Gaussian noise in randomized smoothing.


http://arxiv.org/abs/2409.13349
ID-Guard: A Universal Framework for Combating Facial Manipulation via Breaking Identification. (80%)
Zuomin Qu; Wei Lu; Xiangyang Luo; Qian Wang; Xiaochun Cao
  The misuse of deep learning-based facial manipulation poses a significant
threat to civil rights. To prevent this fraud at its source, proactive defense
has been proposed to disrupt the manipulation process by adding invisible
adversarial perturbations into images, making the forged output unconvincing to
observers. However, the non-specific disruption against the output may lead to
the retention of identifiable facial features, potentially resulting in the
stigmatization of the individual. This paper proposes a universal framework for
combating facial manipulation, termed ID-Guard. Specifically, this framework
operates with a single forward pass of an encoder-decoder network to produce a
cross-model transferable adversarial perturbation. A novel Identity Destruction
Module (IDM) is introduced to degrade identifiable features in forged faces. We
optimize the perturbation generation by framing the disruption of different
facial manipulations as a multi-task learning problem, and a dynamic weight
strategy is devised to enhance cross-model performance. Experimental results
demonstrate that the proposed ID-Guard exhibits strong efficacy in defending
against various facial manipulation models, effectively degrading identifiable
regions in manipulated images. It also enables disrupted images to evade facial
inpainting and image recognition systems. Additionally, ID-Guard can seamlessly
function as a plug-and-play component, integrating with other tasks such as
adversarial training.


http://arxiv.org/abs/2409.13232
Relationship between Uncertainty in DNNs and Adversarial Attacks. (70%)
Mabel Ogonna; Abigail Adeniran; Adewale Adeyemo
  Deep Neural Networks (DNNs) have achieved state of the art results and even
outperformed human accuracy in many challenging tasks, leading to DNNs adoption
in a variety of fields including natural language processing, pattern
recognition, prediction, and control optimization. However, DNNs are
accompanied by uncertainty about their results, causing them to predict an
outcome that is either incorrect or outside of a certain level of confidence.
These uncertainties stem from model or data constraints, which could be
exacerbated by adversarial attacks. Adversarial attacks aim to provide
perturbed input to DNNs, causing the DNN to make incorrect predictions or
increase model uncertainty. In this review, we explore the relationship between
DNN uncertainty and adversarial attacks, emphasizing how adversarial attacks
might raise DNN uncertainty.


http://arxiv.org/abs/2409.13945
PureDiffusion: Using Backdoor to Counter Backdoor in Generative Diffusion Models. (61%)
Vu Tuan Truong; Long Bao Le
  Diffusion models (DMs) are advanced deep learning models that achieved
state-of-the-art capability on a wide range of generative tasks. However,
recent studies have shown their vulnerability regarding backdoor attacks, in
which backdoored DMs consistently generate a designated result (e.g., a harmful
image) called backdoor target when the models' input contains a backdoor
trigger. Although various backdoor techniques have been investigated to attack
DMs, defense methods against these threats are still limited and underexplored,
especially in inverting the backdoor trigger. In this paper, we introduce
PureDiffusion, a novel backdoor defense framework that can efficiently detect
backdoor attacks by inverting backdoor triggers embedded in DMs. Our extensive
experiments on various trigger-target pairs show that PureDiffusion outperforms
existing defense methods with a large gap in terms of fidelity (i.e., how much
the inverted trigger resembles the original trigger) and backdoor success rate
(i.e., the rate that the inverted trigger leads to the corresponding backdoor
target). Notably, in certain cases, backdoor triggers inverted by PureDiffusion
even achieve higher attack success rate than the original triggers.


http://arxiv.org/abs/2409.13793
On the Feasibility of Fully AI-automated Vishing Attacks. (1%)
João Figueiredo; Afonso Carvalho; Daniel Castro; Daniel Gonçalves; Nuno Santos
  A vishing attack is a form of social engineering where attackers use phone
calls to deceive individuals into disclosing sensitive information, such as
personal data, financial information, or security credentials. Attackers
exploit the perceived urgency and authenticity of voice communication to
manipulate victims, often posing as legitimate entities like banks or tech
support. Vishing is a particularly serious threat as it bypasses security
controls designed to protect information. In this work, we study the potential
for vishing attacks to escalate with the advent of AI. In theory, AI-powered
software bots may have the ability to automate these attacks by initiating
conversations with potential victims via phone calls and deceiving them into
disclosing sensitive information. To validate this thesis, we introduce ViKing,
an AI-powered vishing system developed using publicly available AI technology.
It relies on a Large Language Model (LLM) as its core cognitive processor to
steer conversations with victims, complemented by a pipeline of speech-to-text
and text-to-speech modules that facilitate audio-text conversion in phone
calls. Through a controlled social experiment involving 240 participants, we
discovered that ViKing has successfully persuaded many participants to reveal
sensitive information, even those who had been explicitly warned about the risk
of vishing campaigns. Interactions with ViKing's bots were generally considered
realistic. From these findings, we conclude that tools like ViKing may already
be accessible to potential malicious actors, while also serving as an
invaluable resource for cyber awareness programs.


http://arxiv.org/abs/2409.12642
Deep generative models as an adversarial attack strategy for tabular machine learning. (99%)
Salijona Dyrmishi; Mihaela Cătălina Stoian; Eleonora Giunchiglia; Maxime Cordy
  Deep Generative Models (DGMs) have found application in computer vision for
generating adversarial examples to test the robustness of machine learning (ML)
systems. Extending these adversarial techniques to tabular ML presents unique
challenges due to the distinct nature of tabular data and the necessity to
preserve domain constraints in adversarial examples. In this paper, we adapt
four popular tabular DGMs into adversarial DGMs (AdvDGMs) and evaluate their
effectiveness in generating realistic adversarial examples that conform to
domain constraints.


http://arxiv.org/abs/2409.12472
TEAM: Temporal Adversarial Examples Attack Model against Network Intrusion Detection System Applied to RNN. (99%)
Ziyi Liu; Dengpan Ye; Long Tang; Yunming Zhang; Jiacheng Deng
  With the development of artificial intelligence, neural networks play a key
role in network intrusion detection systems (NIDS). Despite the tremendous
advantages, neural networks are susceptible to adversarial attacks. To improve
the reliability of NIDS, many research has been conducted and plenty of
solutions have been proposed. However, the existing solutions rarely consider
the adversarial attacks against recurrent neural networks (RNN) with time
steps, which would greatly affect the application of NIDS in real world.
Therefore, we first propose a novel RNN adversarial attack model based on
feature reconstruction called \textbf{T}emporal adversarial \textbf{E}xamples
\textbf{A}ttack \textbf{M}odel \textbf{(TEAM)}, which applied to time series
data and reveals the potential connection between adversarial and time steps in
RNN. That is, the past adversarial examples within the same time steps can
trigger further attacks on current or future original examples. Moreover, TEAM
leverages Time Dilation (TD) to effectively mitigates the effect of temporal
among adversarial examples within the same time steps. Experimental results
show that in most attack categories, TEAM improves the misjudgment rate of NIDS
on both black and white boxes, making the misjudgment rate reach more than
96.68%. Meanwhile, the maximum increase in the misjudgment rate of the NIDS for
subsequent original samples exceeds 95.57%.


http://arxiv.org/abs/2409.13163
Hidden Activations Are Not Enough: A General Approach to Neural Network Predictions. (98%)
Samuel Leblanc; Aiky Rasolomanana; Marco Armenta
  We introduce a novel mathematical framework for analyzing neural networks
using tools from quiver representation theory. This framework enables us to
quantify the similarity between a new data sample and the training data, as
perceived by the neural network. By leveraging the induced quiver
representation of a data sample, we capture more information than traditional
hidden layer outputs. This quiver representation abstracts away the complexity
of the computations of the forward pass into a single matrix, allowing us to
employ simple geometric and statistical arguments in a matrix space to study
neural network predictions. Our mathematical results are architecture-agnostic
and task-agnostic, making them broadly applicable. As proof of concept
experiments, we apply our results for the MNIST and FashionMNIST datasets on
the problem of detecting adversarial examples on different MLP architectures
and several adversarial attack methods. Our experiments can be reproduced with
our
\href{https://github.com/MarcoArmenta/Hidden-Activations-are-not-Enough}{publicly
available repository}.


http://arxiv.org/abs/2409.12914
Defending against Reverse Preference Attacks is Difficult. (83%)
Domenic Rosati; Giles Edkins; Harsh Raj; David Atanasov; Subhabrata Majumdar; Janarthanan Rajendran; Frank Rudzicz; Hassan Sajjad
  While there has been progress towards aligning Large Language Models (LLMs)
with human values and ensuring safe behaviour at inference time, safety-aligned
LLMs are known to be vulnerable to training-time attacks such as supervised
fine-tuning (SFT) on harmful datasets. In this paper, we ask if LLMs are
vulnerable to adversarial reinforcement learning. Motivated by this goal, we
propose Reverse Preference Attacks (RPA), a class of attacks to make LLMs learn
harmful behavior using adversarial reward during reinforcement learning from
human feedback (RLHF). RPAs expose a critical safety gap of safety-aligned LLMs
in RL settings: they easily explore the harmful text generation policies to
optimize adversarial reward. To protect against RPAs, we explore a host of
mitigation strategies. Leveraging Constrained Markov-Decision Processes, we
adapt a number of mechanisms to defend against harmful fine-tuning attacks into
the RL setting. Our experiments show that ``online" defenses that are based on
the idea of minimizing the negative log likelihood of refusals -- with the
defender having control of the loss function -- can effectively protect LLMs
against RPAs. However, trying to defend model weights using ``offline" defenses
that operate under the assumption that the defender has no control over the
loss function are less effective in the face of RPAs. These findings show that
attacks done using RL can be used to successfully undo safety alignment in
open-weight LLMs and use them for malicious purposes.


http://arxiv.org/abs/2409.12946
Revisiting Semi-supervised Adversarial Robustness via Noise-aware Online Robust Distillation. (45%)
Tsung-Han Wu; Hung-Ting Su; Shang-Tse Chen; Winston H. Hsu
  The robust self-training (RST) framework has emerged as a prominent approach
for semi-supervised adversarial training. To explore the possibility of
tackling more complicated tasks with even lower labeling budgets, unlike prior
approaches that rely on robust pretrained models, we present SNORD - a simple
yet effective framework that introduces contemporary semi-supervised learning
techniques into the realm of adversarial training. By enhancing pseudo labels
and managing noisy training data more effectively, SNORD showcases impressive,
state-of-the-art performance across diverse datasets and labeling budgets, all
without the need for pretrained models. Compared to full adversarial
supervision, SNORD achieves a 90% relative robust accuracy under epsilon =
8/255 AutoAttack, requiring less than 0.1%, 2%, and 10% labels for CIFAR-10,
CIFAR-100, and TinyImageNet-200, respectively. Additional experiments confirm
the efficacy of each component and demonstrate the adaptability of integrating
SNORD with existing adversarial pretraining strategies to further bolster
robustness.


http://arxiv.org/abs/2409.12997
VCAT: Vulnerability-aware and Curiosity-driven Adversarial Training for Enhancing Autonomous Vehicle Robustness. (26%)
Xuan Cai; Zhiyong Cui; Xuesong Bai; Ruimin Ke; Zhenshu Ma; Haiyang Yu; Yilong Ren
  Autonomous vehicles (AVs) face significant threats to their safe operation in
complex traffic environments. Adversarial training has emerged as an effective
method of enabling AVs to preemptively fortify their robustness against
malicious attacks. Train an attacker using an adversarial policy, allowing the
AV to learn robust driving through interaction with this attacker. However,
adversarial policies in existing methodologies often get stuck in a loop of
overexploiting established vulnerabilities, resulting in poor improvement for
AVs. To overcome the limitations, we introduce a pioneering framework termed
Vulnerability-aware and Curiosity-driven Adversarial Training (VCAT).
Specifically, during the traffic vehicle attacker training phase, a surrogate
network is employed to fit the value function of the AV victim, providing dense
information about the victim's inherent vulnerabilities. Subsequently, random
network distillation is used to characterize the novelty of the environment,
constructing an intrinsic reward to guide the attacker in exploring unexplored
territories. In the victim defense training phase, the AV is trained in
critical scenarios in which the pretrained attacker is positioned around the
victim to generate attack behaviors. Experimental results revealed that the
training methodology provided by VCAT significantly improved the robust control
capabilities of learning-based AVs, outperforming both conventional training
modalities and alternative reinforcement learning counterparts, with a marked
reduction in crash rates. The code is available at
https://github.com/caixxuan/VCAT.


http://arxiv.org/abs/2409.13004
Data Poisoning and Leakage Analysis in Federated Learning. (11%)
Wenqi Wei; Tiansheng Huang; Zachary Yahn; Anoop Singhal; Margaret Loper; Ling Liu
  Data poisoning and leakage risks impede the massive deployment of federated
learning in the real world. This chapter reveals the truths and pitfalls of
understanding two dominating threats: {\em training data privacy intrusion} and
{\em training data poisoning}. We first investigate training data privacy
threat and present our observations on when and how training data may be leaked
during the course of federated training. One promising defense strategy is to
perturb the raw gradient update by adding some controlled randomized noise
prior to sharing during each round of federated learning. We discuss the
importance of determining the proper amount of randomized noise and the proper
location to add such noise for effective mitigation of gradient leakage threats
against training data privacy. Then we will review and compare different
training data poisoning threats and analyze why and when such data poisoning
induced model Trojan attacks may lead to detrimental damage on the performance
of the global model. We will categorize and compare representative poisoning
attacks and the effectiveness of their mitigation techniques, delivering an
in-depth understanding of the negative impact of data poisoning. Finally, we
demonstrate the potential of dynamic model perturbation in simultaneously
ensuring privacy protection, poisoning resilience, and model performance. The
chapter concludes with a discussion on additional risk factors in federated
learning, including the negative impact of skewness, data and algorithmic
biases, as well as misinformation in training data. Powered by empirical
evidence, our analytical study offers some transformative insights into
effective privacy protection and security assurance strategies in
attack-resilient federated learning.


http://arxiv.org/abs/2409.13174
Manipulation Facing Threats: Evaluating Physical Vulnerabilities in End-to-End Vision Language Action Models. (2%)
Hao Cheng; Erjia Xiao; Chengyuan Yu; Zhao Yao; Jiahang Cao; Qiang Zhang; Jiaxu Wang; Mengshu Sun; Kaidi Xu; Jindong Gu; Renjing Xu
  Recently, driven by advancements in Multimodal Large Language Models (MLLMs),
Vision Language Action Models (VLAMs) are being proposed to achieve better
performance in open-vocabulary scenarios for robotic manipulation tasks. Since
manipulation tasks involve direct interaction with the physical world, ensuring
robustness and safety during the execution of this task is always a very
critical issue. In this paper, by synthesizing current safety research on MLLMs
and the specific application scenarios of the manipulation task in the physical
world, we comprehensively evaluate VLAMs in the face of potential physical
threats. Specifically, we propose the Physical Vulnerability Evaluating
Pipeline (PVEP) that can incorporate as many visual modal physical threats as
possible for evaluating the physical robustness of VLAMs. The physical threats
in PVEP specifically include Out-of-Distribution, Typography-based Visual
Prompts, and Adversarial Patch Attacks. By comparing the performance
fluctuations of VLAMs before and after being attacked, we provide generalizable
Analyses of how VLAMs respond to different physical security threats. Our
project page is in this link:
https://chaducheng.github.io/Manipulat-Facing-Threats/.


http://arxiv.org/abs/2409.12553
Hidden in Plain Sound: Environmental Backdoor Poisoning Attacks on Whisper, and Mitigations. (2%)
Jonatan Bartolini; Todor Stoyanov; Alberto Giaretta
  Thanks to the popularisation of transformer-based models, speech recognition
(SR) is gaining traction in various application fields, such as industrial and
robotics environments populated with mission-critical devices. While
transformer-based SR can provide various benefits for simplifying human-machine
interfacing, the research on the cybersecurity aspects of these models is
lacklustre. In particular, concerning backdoor poisoning attacks. In this
paper, we propose a new poisoning approach that maps different environmental
trigger sounds to target phrases of different lengths, during the fine-tuning
phase. We test our approach on Whisper, one of the most popular
transformer-based SR model, showing that it is highly vulnerable to our attack,
under several testing conditions. To mitigate the attack proposed in this
paper, we investigate the use of Silero VAD, a state-of-the-art voice activity
detection (VAD) model, as a defence mechanism. Our experiments show that it is
possible to use VAD models to filter out malicious triggers and mitigate our
attacks, with a varying degree of success, depending on the type of trigger
sound and testing conditions.


http://arxiv.org/abs/2409.12379
Enhancing 3D Robotic Vision Robustness by Minimizing Adversarial Mutual Information through a Curriculum Training Approach. (99%)
Nastaran Darabi; Dinithi Jayasuriya; Devashri Naik; Theja Tulabandhula; Amit Ranjan Trivedi
  Adversarial attacks exploit vulnerabilities in a model's decision boundaries
through small, carefully crafted perturbations that lead to significant
mispredictions. In 3D vision, the high dimensionality and sparsity of data
greatly expand the attack surface, making 3D vision particularly vulnerable for
safety-critical robotics. To enhance 3D vision's adversarial robustness, we
propose a training objective that simultaneously minimizes prediction loss and
mutual information (MI) under adversarial perturbations to contain the upper
bound of misprediction errors. This approach simplifies handling adversarial
examples compared to conventional methods, which require explicit searching and
training on adversarial samples. However, minimizing prediction loss conflicts
with minimizing MI, leading to reduced robustness and catastrophic forgetting.
To address this, we integrate curriculum advisors in the training setup that
gradually introduce adversarial objectives to balance training and prevent
models from being overwhelmed by difficult cases early in the process. The
advisors also enhance robustness by encouraging training on diverse MI examples
through entropy regularizers. We evaluated our method on ModelNet40 and KITTI
using PointNet, DGCNN, SECOND, and PointTransformers, achieving 2-5% accuracy
gains on ModelNet40 and a 5-10% mAP improvement in object detection. Our code
is publicly available at https://github.com/nstrndrbi/Mine-N-Learn.


http://arxiv.org/abs/2409.12394
ITPatch: An Invisible and Triggered Physical Adversarial Patch against Traffic Sign Recognition. (99%)
Shuai Yuan; Hongwei Li; Xingshuo Han; Guowen Xu; Wenbo Jiang; Tao Ni; Qingchuan Zhao; Yuguang Fang
  Physical adversarial patches have emerged as a key adversarial attack to
cause misclassification of traffic sign recognition (TSR) systems in the real
world. However, existing adversarial patches have poor stealthiness and attack
all vehicles indiscriminately once deployed. In this paper, we introduce an
invisible and triggered physical adversarial patch (ITPatch) with a novel
attack vector, i.e., fluorescent ink, to advance the state-of-the-art. It
applies carefully designed fluorescent perturbations to a target sign, an
attacker can later trigger a fluorescent effect using invisible ultraviolet
light, causing the TSR system to misclassify the sign and potentially resulting
in traffic accidents. We conducted a comprehensive evaluation to investigate
the effectiveness of ITPatch, which shows a success rate of 98.31% in low-light
conditions. Furthermore, our attack successfully bypasses five popular defenses
and achieves a success rate of 96.72%.


http://arxiv.org/abs/2409.11754
NPAT Null-Space Projected Adversarial Training Towards Zero Deterioration. (96%)
Hanyi Hu; Qiao Han; Kui Chen; Yao Yang
  To mitigate the susceptibility of neural networks to adversarial attacks,
adversarial training has emerged as a prevalent and effective defense strategy.
Intrinsically, this countermeasure incurs a trade-off, as it sacrifices the
model's accuracy in processing normal samples. To reconcile the trade-off, we
pioneer the incorporation of null-space projection into adversarial training
and propose two innovative Null-space Projection based Adversarial
Training(NPAT) algorithms tackling sample generation and gradient optimization,
named Null-space Projected Data Augmentation (NPDA) and Null-space Projected
Gradient Descent (NPGD), to search for an overarching optimal solutions, which
enhance robustness with almost zero deterioration in generalization
performance. Adversarial samples and perturbations are constrained within the
null-space of the decision boundary utilizing a closed-form null-space
projector, effectively mitigating threat of attack stemming from unreliable
features. Subsequently, we conducted experiments on the CIFAR10 and SVHN
datasets and reveal that our methodology can seamlessly combine with
adversarial training methods and obtain comparable robustness while keeping
generalization close to a high-accuracy model.


http://arxiv.org/abs/2409.11690
ID-Free Not Risk-Free: LLM-Powered Agents Unveil Risks in ID-Free Recommender Systems. (76%)
Zongwei Wang; Min Gao; Junliang Yu; Xinyi Gao; Quoc Viet Hung Nguyen; Shazia Sadiq; Hongzhi Yin
  Recent advances in ID-free recommender systems have attracted significant
attention for effectively addressing the cold start problem. However, their
vulnerability to malicious attacks remains largely unexplored. In this paper,
we unveil a critical yet overlooked risk: LLM-powered agents can be
strategically deployed to attack ID-free recommenders, stealthily promoting
low-quality items in black-box settings. This attack exploits a novel
rewriting-based deception strategy, where malicious agents synthesize deceptive
textual descriptions by simulating the characteristics of popular items. To
achieve this, the attack mechanism integrates two primary components: (1) a
popularity extraction component that captures essential characteristics of
popular items and (2) a multi-agent collaboration mechanism that enables
iterative refinement of promotional textual descriptions through independent
thinking and team discussion. To counter this risk, we further introduce a
detection method to identify suspicious text generated by our discovered
attack. By unveiling this risk, our work aims to underscore the urgent need to
enhance the security of ID-free recommender systems.


http://arxiv.org/abs/2409.12072
PAD-FT: A Lightweight Defense for Backdoor Attacks via Data Purification and Fine-Tuning. (68%)
Yukai Xu; Yujie Gu; Kouichi Sakurai
  Backdoor attacks pose a significant threat to deep neural networks,
particularly as recent advancements have led to increasingly subtle
implantation, making the defense more challenging. Existing defense mechanisms
typically rely on an additional clean dataset as a standard reference and
involve retraining an auxiliary model or fine-tuning the entire victim model.
However, these approaches are often computationally expensive and not always
feasible in practical applications. In this paper, we propose a novel and
lightweight defense mechanism, termed PAD-FT, that does not require an
additional clean dataset and fine-tunes only a very small part of the model to
disinfect the victim model. To achieve this, our approach first introduces a
simple data purification process to identify and select the most-likely clean
data from the poisoned training dataset. The self-purified clean dataset is
then used for activation clipping and fine-tuning only the last classification
layer of the victim model. By integrating data purification, activation
clipping, and classifier fine-tuning, our mechanism PAD-FT demonstrates
superior effectiveness across multiple backdoor attack methods and datasets, as
confirmed through extensive experimental evaluation.


http://arxiv.org/abs/2409.13770
A constrained optimization approach to improve robustness of neural networks. (54%)
Shudian Zhao; Jan Kronqvist
  In this paper, we present a novel nonlinear programming-based approach to
fine-tune pre-trained neural networks to improve robustness against adversarial
attacks while maintaining high accuracy on clean data. Our method introduces
adversary-correction constraints to ensure correct classification of
adversarial data and minimizes changes to the model parameters. We propose an
efficient cutting-plane-based algorithm to iteratively solve the large-scale
nonconvex optimization problem by approximating the feasible region through
polyhedral cuts and balancing between robustness and accuracy. Computational
experiments on standard datasets such as MNIST and CIFAR10 demonstrate that the
proposed approach significantly improves robustness, even with a very small set
of adversarial data, while maintaining minimal impact on accuracy.


http://arxiv.org/abs/2409.12314
Understanding Implosion in Text-to-Image Generative Models. (2%)
Wenxin Ding; Cathy Y. Li; Shawn Shan; Ben Y. Zhao; Haitao Zheng
  Recent works show that text-to-image generative models are surprisingly
vulnerable to a variety of poisoning attacks. Empirical results find that these
models can be corrupted by altering associations between individual text
prompts and associated visual features. Furthermore, a number of concurrent
poisoning attacks can induce "model implosion," where the model becomes unable
to produce meaningful images for unpoisoned prompts. These intriguing findings
highlight the absence of an intuitive framework to understand poisoning attacks
on these models. In this work, we establish the first analytical framework on
robustness of image generative models to poisoning attacks, by modeling and
analyzing the behavior of the cross-attention mechanism in latent diffusion
models. We model cross-attention training as an abstract problem of "supervised
graph alignment" and formally quantify the impact of training data by the
hardness of alignment, measured by an Alignment Difficulty (AD) metric. The
higher the AD, the harder the alignment. We prove that AD increases with the
number of individual prompts (or concepts) poisoned. As AD grows, the alignment
task becomes increasingly difficult, yielding highly distorted outcomes that
frequently map meaningful text prompts to undefined or meaningless visual
representations. As a result, the generative model implodes and outputs random,
incoherent images at large. We validate our analytical framework through
extensive experiments, and we confirm and explain the unexpected (and
unexplained) effect of model implosion while producing new, unforeseen
insights. Our work provides a useful tool for studying poisoning attacks
against diffusion models and their defenses.


http://arxiv.org/abs/2409.11454
Golden Ratio Search: A Low-Power Adversarial Attack for Deep Learning based Modulation Classification. (98%)
Deepsayan Sadhukhan; Nitin Priyadarshini Shankar; Sheetal Kalyani
  We propose a minimal power white box adversarial attack for Deep Learning
based Automatic Modulation Classification (AMC). The proposed attack uses the
Golden Ratio Search (GRS) method to find powerful attacks with minimal power.
We evaluate the efficacy of the proposed method by comparing it with existing
adversarial attack approaches. Additionally, we test the robustness of the
proposed attack against various state-of-the-art architectures, including
defense mechanisms such as adversarial training, binarization, and ensemble
methods. Experimental results demonstrate that the proposed attack is powerful,
requires minimal power, and can be generated in less time, significantly
challenging the resilience of current AMC methods.


http://arxiv.org/abs/2409.11295
EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage. (76%)
Zeyi Liao; Lingbo Mo; Chejian Xu; Mintong Kang; Jiawei Zhang; Chaowei Xiao; Yuan Tian; Bo Li; Huan Sun
  Generalist web agents have demonstrated remarkable potential in autonomously
completing a wide range of tasks on real websites, significantly boosting human
productivity. However, web tasks, such as booking flights, usually involve
users' PII, which may be exposed to potential privacy risks if web agents
accidentally interact with compromised websites, a scenario that remains
largely unexplored in the literature. In this work, we narrow this gap by
conducting the first study on the privacy risks of generalist web agents in
adversarial environments. First, we present a realistic threat model for
attacks on the website, where we consider two adversarial targets: stealing
users' specific PII or the entire user request. Then, we propose a novel attack
method, termed Environmental Injection Attack (EIA). EIA injects malicious
content designed to adapt well to environments where the agents operate and our
work instantiates EIA specifically for privacy scenarios in web environments.
We collect 177 action steps that involve diverse PII categories on realistic
websites from the Mind2Web, and conduct experiments using one of the most
capable generalist web agent frameworks to date. The results demonstrate that
EIA achieves up to 70% ASR in stealing specific PII and 16% ASR for full user
request. Additionally, by accessing the stealthiness and experimenting with a
defensive system prompt, we indicate that EIA is hard to detect and mitigate.
Notably, attacks that are not well adapted for a webpage can be detected via
human inspection, leading to our discussion about the trade-off between
security and autonomy. However, extra attackers' efforts can make EIA
seamlessly adapted, rendering such supervision ineffective. Thus, we further
discuss the defenses at the pre- and post-deployment stages of the websites
without relying on human supervision and call for more advanced defense
strategies.


http://arxiv.org/abs/2409.10997
Contextual Breach: Assessing the Robustness of Transformer-based QA Models. (56%)
Asir Saadat; Nahian Ibn Asad; Md Farhan Ishmam
  Contextual question-answering models are susceptible to adversarial
perturbations to input context, commonly observed in real-world scenarios.
These adversarial noises are designed to degrade the performance of the model
by distorting the textual input. We introduce a unique dataset that
incorporates seven distinct types of adversarial noise into the context, each
applied at five different intensity levels on the SQuAD dataset. To quantify
the robustness, we utilize robustness metrics providing a standardized measure
for assessing model performance across varying noise types and levels.
Experiments on transformer-based question-answering models reveal robustness
vulnerabilities and important insights into the model's performance in
realistic textual input.


http://arxiv.org/abs/2409.11646
Hard-Label Cryptanalytic Extraction of Neural Network Models. (2%)
Yi Chen; Xiaoyang Dong; Jian Guo; Yantian Shen; Anyu Wang; Xiaoyun Wang
  The machine learning problem of extracting neural network parameters has been
proposed for nearly three decades. Functionally equivalent extraction is a
crucial goal for research on this problem. When the adversary has access to the
raw output of neural networks, various attacks, including those presented at
CRYPTO 2020 and EUROCRYPT 2024, have successfully achieved this goal. However,
this goal is not achieved when neural networks operate under a hard-label
setting where the raw output is inaccessible.
  In this paper, we propose the first attack that theoretically achieves
functionally equivalent extraction under the hard-label setting, which applies
to ReLU neural networks. The effectiveness of our attack is validated through
practical experiments on a wide range of ReLU neural networks, including neural
networks trained on two real benchmarking datasets (MNIST, CIFAR10) widely used
in computer vision. For a neural network consisting of $10^5$ parameters, our
attack only requires several hours on a single core.


http://arxiv.org/abs/2409.10071
Towards Physically-Realizable Adversarial Attacks in Embodied Vision Navigation. (95%)
Meng Chen; Jiawei Tu; Chao Qi; Yonghao Dang; Feng Zhou; Wei Wei; Jianqin Yin
  The deployment of embodied navigation agents in safety-critical environments
raises concerns about their vulnerability to adversarial attacks on deep neural
networks. However, current attack methods often lack practicality due to
challenges in transitioning from the digital to the physical world, while
existing physical attacks for object detection fail to achieve both multi-view
effectiveness and naturalness. To address this, we propose a practical attack
method for embodied navigation by attaching adversarial patches with learnable
textures and opacity to objects. Specifically, to ensure effectiveness across
varying viewpoints, we employ a multi-view optimization strategy based on
object-aware sampling, which uses feedback from the navigation model to
optimize the patch's texture. To make the patch inconspicuous to human
observers, we introduce a two-stage opacity optimization mechanism, where
opacity is refined after texture optimization. Experimental results show our
adversarial patches reduce navigation success rates by about 40%, outperforming
previous methods in practicality, effectiveness, and naturalness. Code is
available at:
[https://github.com/chen37058/Physical-Attacks-in-Embodied-Navigation].


http://arxiv.org/abs/2409.10643
CaBaGe: Data-Free Model Extraction using ClAss BAlanced Generator Ensemble. (2%)
Jonathan Rosenthal; Shanchao Liang; Kevin Zhang; Lin Tan
  Machine Learning as a Service (MLaaS) is often provided as a pay-per-query,
black-box system to clients. Such a black-box approach not only hinders open
replication, validation, and interpretation of model results, but also makes it
harder for white-hat researchers to identify vulnerabilities in the MLaaS
systems. Model extraction is a promising technique to address these challenges
by reverse-engineering black-box models. Since training data is typically
unavailable for MLaaS models, this paper focuses on the realistic version of
it: data-free model extraction. We propose a data-free model extraction
approach, CaBaGe, to achieve higher model extraction accuracy with a small
number of queries. Our innovations include (1) a novel experience replay for
focusing on difficult training samples; (2) an ensemble of generators for
steadily producing diverse synthetic data; and (3) a selective filtering
process for querying the victim model with harder, more balanced samples. In
addition, we create a more realistic setting, for the first time, where the
attacker has no knowledge of the number of classes in the victim training data,
and create a solution to learn the number of classes on the fly. Our evaluation
shows that CaBaGe outperforms existing techniques on seven datasets -- MNIST,
FMNIST, SVHN, CIFAR-10, CIFAR-100, ImageNet-subset, and Tiny ImageNet -- with
an accuracy improvement of the extracted models by up to 43.13%. Furthermore,
the number of queries required to extract a clone model matching the final
accuracy of prior work is reduced by up to 75.7%.


http://arxiv.org/abs/2409.10669
Realistic Extreme Behavior Generation for Improved AV Testing. (1%)
Robert Dyro; Matthew Foutter; Ruolin Li; Lillo Luigi Di; Edward Schmerling; Xilin Zhou; Marco Pavone
  This work introduces a framework to diagnose the strengths and shortcomings
of Autonomous Vehicle (AV) collision avoidance technology with synthetic yet
realistic potential collision scenarios adapted from real-world, collision-free
data. Our framework generates counterfactual collisions with diverse crash
properties, e.g., crash angle and velocity, between an adversary and a target
vehicle by adding perturbations to the adversary's predicted trajectory from a
learned AV behavior model. Our main contribution is to ground these adversarial
perturbations in realistic behavior as defined through the lens of
data-alignment in the behavior model's parameter space. Then, we cluster these
synthetic counterfactuals to identify plausible and representative collision
scenarios to form the basis of a test suite for downstream AV system
evaluation. We demonstrate our framework using two state-of-the-art behavior
prediction models as sources of realistic adversarial perturbations, and show
that our scenario clustering evokes interpretable failure modes from a baseline
AV policy under evaluation.


http://arxiv.org/abs/2409.11445
Jailbreaking Large Language Models with Symbolic Mathematics. (1%)
Emet Bethany; Mazal Bethany; Juan Arturo Nolazco Flores; Sumit Kumar Jha; Peyman Najafirad
  Recent advancements in AI safety have led to increased efforts in training
and red-teaming large language models (LLMs) to mitigate unsafe content
generation. However, these safety mechanisms may not be comprehensive, leaving
potential vulnerabilities unexplored. This paper introduces MathPrompt, a novel
jailbreaking technique that exploits LLMs' advanced capabilities in symbolic
mathematics to bypass their safety mechanisms. By encoding harmful natural
language prompts into mathematical problems, we demonstrate a critical
vulnerability in current AI safety measures. Our experiments across 13
state-of-the-art LLMs reveal an average attack success rate of 73.6\%,
highlighting the inability of existing safety training mechanisms to generalize
to mathematically encoded inputs. Analysis of embedding vectors shows a
substantial semantic shift between original and encoded prompts, helping
explain the attack's success. This work emphasizes the importance of a holistic
approach to AI safety, calling for expanded red-teaming efforts to develop
robust safeguards across all potential input types and their associated risks.


http://arxiv.org/abs/2409.10072
Speaker Contrastive Learning for Source Speaker Tracing. (1%)
Qing Wang; Hongmei Guo; Jian Kang; Mengjie Du; Jie Li; Xiao-Lei Zhang; Lei Xie
  As a form of biometric authentication technology, the security of speaker
verification systems is of utmost importance. However, SV systems are
inherently vulnerable to various types of attacks that can compromise their
accuracy and reliability. One such attack is voice conversion, which modifies a
persons speech to sound like another person by altering various vocal
characteristics. This poses a significant threat to SV systems. To address this
challenge, the Source Speaker Tracing Challenge in IEEE SLT2024 aims to
identify the source speaker information in manipulated speech signals.
Specifically, SSTC focuses on source speaker verification against voice
conversion to determine whether two converted speech samples originate from the
same source speaker. In this study, we propose a speaker contrastive
learning-based approach for source speaker tracing to learn the latent source
speaker information in converted speech. To learn a more source-speaker-related
representation, we employ speaker contrastive loss during the training of the
embedding extractor. This speaker contrastive loss helps identify the true
source speaker embedding among several distractor speaker embeddings, enabling
the embedding extractor to learn the potentially possessing source speaker
information present in the converted speech. Experiments demonstrate that our
proposed speaker contrastive learning system achieves the lowest EER of 16.788%
on the challenge test set, securing first place in the challenge.


http://arxiv.org/abs/2409.09860
Revisiting Physical-World Adversarial Attack on Traffic Sign Recognition: A Commercial Systems Perspective. (98%)
Ningfei Wang; Shaoyuan Xie; Takami Sato; Yunpeng Luo; Kaidi Xu; Qi Alfred Chen
  Traffic Sign Recognition (TSR) is crucial for safe and correct driving
automation. Recent works revealed a general vulnerability of TSR models to
physical-world adversarial attacks, which can be low-cost, highly deployable,
and capable of causing severe attack effects such as hiding a critical traffic
sign or spoofing a fake one. However, so far existing works generally only
considered evaluating the attack effects on academic TSR models, leaving the
impacts of such attacks on real-world commercial TSR systems largely unclear.
In this paper, we conduct the first large-scale measurement of physical-world
adversarial attacks against commercial TSR systems. Our testing results reveal
that it is possible for existing attack works from academia to have highly
reliable (100\%) attack success against certain commercial TSR system
functionality, but such attack capabilities are not generalizable, leading to
much lower-than-expected attack success rates overall. We find that one
potential major factor is a spatial memorization design that commonly exists in
today's commercial TSR systems. We design new attack success metrics that can
mathematically model the impacts of such design on the TSR system-level attack
success, and use them to revisit existing attacks. Through these efforts, we
uncover 7 novel observations, some of which directly challenge the observations
or claims in prior works due to the introduction of the new metrics.


http://arxiv.org/abs/2409.09794
Federated Learning in Adversarial Environments: Testbed Design and Poisoning Resilience in Cybersecurity. (8%)
Hao Jian Huang; Bekzod Iskandarov; Mizanur Rahman; Hakan T. Otal; M. Abdullah Canbaz
  This paper presents the design and implementation of a Federated Learning
(FL) testbed, focusing on its application in cybersecurity and evaluating its
resilience against poisoning attacks. Federated Learning allows multiple
clients to collaboratively train a global model while keeping their data
decentralized, addressing critical needs for data privacy and security,
particularly in sensitive fields like cybersecurity. Our testbed, built using
the Flower framework, facilitates experimentation with various FL frameworks,
assessing their performance, scalability, and ease of integration. Through a
case study on federated intrusion detection systems, we demonstrate the
testbed's capabilities in detecting anomalies and securing critical
infrastructure without exposing sensitive network data. Comprehensive poisoning
tests, targeting both model and data integrity, evaluate the system's
robustness under adversarial conditions. Our results show that while federated
learning enhances data privacy and distributed learning, it remains vulnerable
to poisoning attacks, which must be mitigated to ensure its reliability in
real-world applications.


http://arxiv.org/abs/2409.09406
Real-world Adversarial Defense against Patch Attacks based on Diffusion Model. (99%)
Xingxing Wei; Caixin Kang; Yinpeng Dong; Zhengyi Wang; Shouwei Ruan; Yubo Chen; Hang Su
  Adversarial patches present significant challenges to the robustness of deep
learning models, making the development of effective defenses become critical
for real-world applications. This paper introduces DIFFender, a novel
DIFfusion-based DeFender framework that leverages the power of a text-guided
diffusion model to counter adversarial patch attacks. At the core of our
approach is the discovery of the Adversarial Anomaly Perception (AAP)
phenomenon, which enables the diffusion model to accurately detect and locate
adversarial patches by analyzing distributional anomalies. DIFFender seamlessly
integrates the tasks of patch localization and restoration within a unified
diffusion model framework, enhancing defense efficacy through their close
interaction. Additionally, DIFFender employs an efficient few-shot
prompt-tuning algorithm, facilitating the adaptation of the pre-trained
diffusion model to defense tasks without the need for extensive retraining. Our
comprehensive evaluation, covering image classification and face recognition
tasks, as well as real-world scenarios, demonstrates DIFFender's robust
performance against adversarial attacks. The framework's versatility and
generalizability across various settings, classifiers, and attack methodologies
mark a significant advancement in adversarial patch defense strategies. Except
for the popular visible domain, we have identified another advantage of
DIFFender: its capability to easily expand into the infrared domain.
Consequently, we demonstrate the good flexibility of DIFFender, which can
defend against both infrared and visible adversarial patch attacks
alternatively using a universal defense framework.


http://arxiv.org/abs/2409.08919
XSub: Explanation-Driven Adversarial Attack against Blackbox Classifiers via Feature Substitution. (95%)
Kiana Vu; Phung Lai; Truc Nguyen
  Despite its significant benefits in enhancing the transparency and
trustworthiness of artificial intelligence (AI) systems, explainable AI (XAI)
has yet to reach its full potential in real-world applications. One key
challenge is that XAI can unintentionally provide adversaries with insights
into black-box models, inevitably increasing their vulnerability to various
attacks. In this paper, we develop a novel explanation-driven adversarial
attack against black-box classifiers based on feature substitution, called
XSub. The key idea of XSub is to strategically replace important features
(identified via XAI) in the original sample with corresponding important
features from a "golden sample" of a different label, thereby increasing the
likelihood of the model misclassifying the perturbed sample. The degree of
feature substitution is adjustable, allowing us to control how much of the
original samples information is replaced. This flexibility effectively balances
a trade-off between the attacks effectiveness and its stealthiness. XSub is
also highly cost-effective in that the number of required queries to the
prediction model and the explanation model in conducting the attack is in O(1).
In addition, XSub can be easily extended to launch backdoor attacks in case the
attacker has access to the models training data. Our evaluation demonstrates
that XSub is not only effective and stealthy but also cost-effective, enabling
its application across a wide range of AI models.


http://arxiv.org/abs/2409.10562
Are Existing Road Design Guidelines Suitable for Autonomous Vehicles? (41%)
Yang Sun; Christopher M. Poskitt; Jun Sun
  The emergence of Autonomous Vehicles (AVs) has spurred research into testing
the resilience of their perception systems, i.e. to ensure they are not
susceptible to making critical misjudgements. It is important that they are
tested not only with respect to other vehicles on the road, but also those
objects placed on the roadside. Trash bins, billboards, and greenery are all
examples of such objects, typically placed according to guidelines that were
developed for the human visual system, and which may not align perfectly with
the needs of AVs. Existing tests, however, usually focus on adversarial objects
with conspicuous shapes/patches, that are ultimately unrealistic given their
unnatural appearances and the need for white box knowledge. In this work, we
introduce a black box attack on the perception systems of AVs, in which the
objective is to create realistic adversarial scenarios (i.e. satisfying road
design guidelines) by manipulating the positions of common roadside objects,
and without resorting to `unnatural' adversarial patches. In particular, we
propose TrashFuzz , a fuzzing algorithm to find scenarios in which the
placement of these objects leads to substantial misperceptions by the AV --
such as mistaking a traffic light's colour -- with overall the goal of causing
it to violate traffic laws. To ensure the realism of these scenarios, they must
satisfy several rules encoding regulatory guidelines about the placement of
objects on public streets. We implemented and evaluated these attacks for the
Apollo, finding that TrashFuzz induced it into violating 15 out of 24 different
traffic laws.


http://arxiv.org/abs/2409.08985
Clean Label Attacks against SLU Systems. (31%)
Henry Li Xinyuan; Sonal Joshi; Thomas Thebaud; Jesus Villalba; Najim Dehak; Sanjeev Khudanpur
  Poisoning backdoor attacks involve an adversary manipulating the training
data to induce certain behaviors in the victim model by inserting a trigger in
the signal at inference time. We adapted clean label backdoor (CLBD)-data
poisoning attacks, which do not modify the training labels, on state-of-the-art
speech recognition models that support/perform a Spoken Language Understanding
task, achieving 99.8% attack success rate by poisoning 10% of the training
data. We analyzed how varying the signal-strength of the poison, percent of
samples poisoned, and choice of trigger impact the attack. We also found that
CLBD attacks are most successful when applied to training samples that are
inherently hard for a proxy model. Using this strategy, we achieved an attack
success rate of 99.3% by poisoning a meager 1.5% of the training data. Finally,
we applied two previously developed defenses against gradient-based attacks,
and found that they attain mixed success against poisoning.


http://arxiv.org/abs/2409.09130
FAST: Boosting Uncertainty-based Test Prioritization Methods for Neural Networks via Feature Selection. (15%)
Jialuo Chen; Jingyi Wang; Xiyue Zhang; Youcheng Sun; Marta Kwiatkowska; Jiming Chen; Peng Cheng
  Due to the vast testing space, the increasing demand for effective and
efficient testing of deep neural networks (DNNs) has led to the development of
various DNN test case prioritization techniques. However, the fact that DNNs
can deliver high-confidence predictions for incorrectly predicted examples,
known as the over-confidence problem, causes these methods to fail to reveal
high-confidence errors. To address this limitation, in this work, we propose
FAST, a method that boosts existing prioritization methods through guided
FeAture SelecTion. FAST is based on the insight that certain features may
introduce noise that affects the model's output confidence, thereby
contributing to high-confidence errors. It quantifies the importance of each
feature for the model's correct predictions, and then dynamically prunes the
information from the noisy features during inference to derive a new
probability vector for the uncertainty estimation. With the help of FAST, the
high-confidence errors and correctly classified examples become more
distinguishable, resulting in higher APFD (Average Percentage of Fault
Detection) values for test prioritization, and higher generalization ability
for model enhancement. We conduct extensive experiments to evaluate FAST across
a diverse set of model structures on multiple benchmark datasets to validate
the effectiveness, efficiency, and scalability of FAST compared to the
state-of-the-art prioritization techniques.


http://arxiv.org/abs/2409.08255
LoRID: Low-Rank Iterative Diffusion for Adversarial Purification. (99%)
Geigh Zollicoffer; Minh Vu; Ben Nebgen; Juan Castorena; Boian Alexandrov; Manish Bhattarai
  This work presents an information-theoretic examination of diffusion-based
purification methods, the state-of-the-art adversarial defenses that utilize
diffusion models to remove malicious perturbations in adversarial examples. By
theoretically characterizing the inherent purification errors associated with
the Markov-based diffusion purifications, we introduce LoRID, a novel Low-Rank
Iterative Diffusion purification method designed to remove adversarial
perturbation with low intrinsic purification errors. LoRID centers around a
multi-stage purification process that leverages multiple rounds of
diffusion-denoising loops at the early time-steps of the diffusion models, and
the integration of Tucker decomposition, an extension of matrix factorization,
to remove adversarial noise at high-noise regimes. Consequently, LoRID
increases the effective diffusion time-steps and overcomes strong adversarial
attacks, achieving superior robustness performance in CIFAR-10/100, CelebA-HQ,
and ImageNet datasets under both white-box and black-box settings.


http://arxiv.org/abs/2409.08167
High-Frequency Anti-DreamBooth: Robust Defense against Personalized Image Synthesis. (93%)
Takuto Onikubo; Yusuke Matsui
  Recently, text-to-image generative models have been misused to create
unauthorized malicious images of individuals, posing a growing social problem.
Previous solutions, such as Anti-DreamBooth, add adversarial noise to images to
protect them from being used as training data for malicious generation.
However, we found that the adversarial noise can be removed by adversarial
purification methods such as DiffPure. Therefore, we propose a new adversarial
attack method that adds strong perturbation on the high-frequency areas of
images to make it more robust to adversarial purification. Our experiment
showed that the adversarial images retained noise even after adversarial
purification, hindering malicious image generation.


http://arxiv.org/abs/2409.08372
FedProphet: Memory-Efficient Federated Adversarial Training via Theoretic-Robustness and Low-Inconsistency Cascade Learning. (93%)
Minxue Tang; Yitu Wang; Jingyang Zhang; Louis DiValentin; Aolin Ding; Amin Hass; Yiran Chen; Hai "Helen" Li
  Federated Learning (FL) provides a strong privacy guarantee by enabling local
training across edge devices without training data sharing, and Federated
Adversarial Training (FAT) further enhances the robustness against adversarial
examples, promoting a step toward trustworthy artificial intelligence. However,
FAT requires a large model to preserve high accuracy while achieving strong
robustness, and it is impractically slow when directly training with
memory-constrained edge devices due to the memory-swapping latency. Moreover,
existing memory-efficient FL methods suffer from poor accuracy and weak
robustness in FAT because of inconsistent local and global models, i.e.,
objective inconsistency.
  In this paper, we propose FedProphet, a novel FAT framework that can achieve
memory efficiency, adversarial robustness, and objective consistency
simultaneously. FedProphet partitions the large model into small cascaded
modules such that the memory-constrained devices can conduct adversarial
training module-by-module. A strong convexity regularization is derived to
theoretically guarantee the robustness of the whole model, and we show that the
strong robustness implies low objective inconsistency in FedProphet. We also
develop a training coordinator on the server of FL, with Adaptive Perturbation
Adjustment for utility-robustness balance and Differentiated Module Assignment
for objective inconsistency mitigation. FedProphet empirically shows a
significant improvement in both accuracy and robustness compared to previous
memory-efficient methods, achieving almost the same performance of end-to-end
FAT with 80% memory reduction and up to 10.8x speedup in training time.


http://arxiv.org/abs/2409.08509
Exploiting Supervised Poison Vulnerability to Strengthen Self-Supervised Defense. (73%)
Jeremy Styborski; Mingzhi Lyu; Yi Huang; Adams Kong
  Availability poisons exploit supervised learning (SL) algorithms by
introducing class-related shortcut features in images such that models trained
on poisoned data are useless for real-world datasets. Self-supervised learning
(SSL), which utilizes augmentations to learn instance discrimination, is
regarded as a strong defense against poisoned data. However, by extending the
study of SSL across multiple poisons on the CIFAR-10 and ImageNet-100 datasets,
we demonstrate that it often performs poorly, far below that of training on
clean data. Leveraging the vulnerability of SL to poison attacks, we introduce
adversarial training (AT) on SL to obfuscate poison features and guide robust
feature learning for SSL. Our proposed defense, designated VESPR (Vulnerability
Exploitation of Supervised Poisoning for Robust SSL), surpasses the performance
of six previous defenses across seven popular availability poisons. VESPR
displays superior performance over all previous defenses, boosting the minimum
and average ImageNet-100 test accuracies of poisoned models by 16% and 9%,
respectively. Through analysis and ablation studies, we elucidate the
mechanisms by which VESPR learns robust class features.


http://arxiv.org/abs/2409.08487
Sub-graph Based Diffusion Model for Link Prediction. (9%)
Hang Li; Wei Jin; Geri Skenderi; Harry Shomer; Wenzhuo Tang; Wenqi Fan; Jiliang Tang
  Denoising Diffusion Probabilistic Models (DDPMs) represent a contemporary
class of generative models with exceptional qualities in both synthesis and
maximizing the data likelihood. These models work by traversing a forward
Markov Chain where data is perturbed, followed by a reverse process where a
neural network learns to undo the perturbations and recover the original data.
There have been increasing efforts exploring the applications of DDPMs in the
graph domain. However, most of them have focused on the generative perspective.
In this paper, we aim to build a novel generative model for link prediction. In
particular, we treat link prediction between a pair of nodes as a conditional
likelihood estimation of its enclosing sub-graph. With a dedicated design to
decompose the likelihood estimation process via the Bayesian formula, we are
able to separate the estimation of sub-graph structure and its node features.
Such designs allow our model to simultaneously enjoy the advantages of
inductive learning and the strong generalization capability. Remarkably,
comprehensive experiments across various datasets validate that our proposed
method presents numerous advantages: (1) transferability across datasets
without retraining, (2) promising generalization on limited training data, and
(3) robustness against graph adversarial attacks.


http://arxiv.org/abs/2409.08482
Risks When Sharing LoRA Fine-Tuned Diffusion Model Weights. (1%)
Dixi Yao
  With the emerging trend in generative models and convenient public access to
diffusion models pre-trained on large datasets, users can fine-tune these
models to generate images of personal faces or items in new contexts described
by natural language. Parameter efficient fine-tuning (PEFT) such as Low Rank
Adaptation (LoRA) has become the most common way to save memory and computation
usage on the user end during fine-tuning. However, a natural question is
whether the private images used for fine-tuning will be leaked to adversaries
when sharing model weights. In this paper, we study the issue of privacy
leakage of a fine-tuned diffusion model in a practical setting, where
adversaries only have access to model weights, rather than prompts or images
used for fine-tuning. We design and build a variational network autoencoder
that takes model weights as input and outputs the reconstruction of private
images. To improve the efficiency of training such an autoencoder, we propose a
training paradigm with the help of timestep embedding. The results give a
surprising answer to this research question: an adversary can generate images
containing the same identities as the private images. Furthermore, we
demonstrate that no existing defense method, including differential
privacy-based methods, can preserve the privacy of private data used for
fine-tuning a diffusion model without compromising the utility of a fine-tuned
model.


http://arxiv.org/abs/2409.08045
Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking. (1%)
Stav Cohen; Ron Bitton; Ben Nassi
  In this paper, we show that with the ability to jailbreak a GenAI model,
attackers can escalate the outcome of attacks against RAG-based GenAI-powered
applications in severity and scale. In the first part of the paper, we show
that attackers can escalate RAG membership inference attacks and RAG entity
extraction attacks to RAG documents extraction attacks, forcing a more severe
outcome compared to existing attacks. We evaluate the results obtained from
three extraction methods, the influence of the type and the size of five
embeddings algorithms employed, the size of the provided context, and the GenAI
engine. We show that attackers can extract 80%-99.8% of the data stored in the
database used by the RAG of a Q&A chatbot. In the second part of the paper, we
show that attackers can escalate the scale of RAG data poisoning attacks from
compromising a single GenAI-powered application to compromising the entire
GenAI ecosystem, forcing a greater scale of damage. This is done by crafting an
adversarial self-replicating prompt that triggers a chain reaction of a
computer worm within the ecosystem and forces each affected application to
perform a malicious activity and compromise the RAG of additional applications.
We evaluate the performance of the worm in creating a chain of confidential
data extraction about users within a GenAI ecosystem of GenAI-powered email
assistants and analyze how the performance of the worm is affected by the size
of the context, the adversarial self-replicating prompt used, the type and size
of the embeddings algorithm employed, and the number of hops in the
propagation. Finally, we review and analyze guardrails to protect RAG-based
inference and discuss the tradeoffs.


http://arxiv.org/abs/2409.07321
Module-wise Adaptive Adversarial Training for End-to-end Autonomous Driving. (99%)
Tianyuan Zhang; Lu Wang; Jiaqi Kang; Xinwei Zhang; Siyuan Liang; Yuwei Chen; Aishan Liu; Xianglong Liu
  Recent advances in deep learning have markedly improved autonomous driving
(AD) models, particularly end-to-end systems that integrate perception,
prediction, and planning stages, achieving state-of-the-art performance.
However, these models remain vulnerable to adversarial attacks, where
human-imperceptible perturbations can disrupt decision-making processes. While
adversarial training is an effective method for enhancing model robustness
against such attacks, no prior studies have focused on its application to
end-to-end AD models. In this paper, we take the first step in adversarial
training for end-to-end AD models and present a novel Module-wise Adaptive
Adversarial Training (MA2T). However, extending conventional adversarial
training to this context is highly non-trivial, as different stages within the
model have distinct objectives and are strongly interconnected. To address
these challenges, MA2T first introduces Module-wise Noise Injection, which
injects noise before the input of different modules, targeting training models
with the guidance of overall objectives rather than each independent module
loss. Additionally, we introduce Dynamic Weight Accumulation Adaptation, which
incorporates accumulated weight changes to adaptively learn and adjust the loss
weights of each module based on their contributions (accumulated reduction
rates) for better balance and robust training. To demonstrate the efficacy of
our defense, we conduct extensive experiments on the widely-used nuScenes
dataset across several end-to-end AD models under both white-box and black-box
attacks, where our method outperforms other baselines by large margins
(+5-10%). Moreover, we validate the robustness of our defense through
closed-loop evaluation in the CARLA simulation environment, showing improved
resilience even against natural corruption.


http://arxiv.org/abs/2409.07448
Introducing Perturb-ability Score (PS) to Enhance Robustness Against Evasion Adversarial Attacks on ML-NIDS. (98%)
Mohamed elShehaby; Ashraf Matrawy
  This paper proposes a novel Perturb-ability Score (PS) that can be used to
identify Network Intrusion Detection Systems (NIDS) features that can be easily
manipulated by attackers in the problem-space. We demonstrate that using PS to
select only non-perturb-able features for ML-based NIDS maintains detection
performance while enhancing robustness against adversarial attacks.


http://arxiv.org/abs/2409.07353
Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks. (98%)
Md Zarif Hossain; Ahmed Imteaj
  Large Vision-Language Models (LVLMs), trained on multimodal big datasets,
have significantly advanced AI by excelling in vision-language tasks. However,
these models remain vulnerable to adversarial attacks, particularly jailbreak
attacks, which bypass safety protocols and cause the model to generate
misleading or harmful responses. This vulnerability stems from both the
inherent susceptibilities of LLMs and the expanded attack surface introduced by
the visual modality. We propose Sim-CLIP+, a novel defense mechanism that
adversarially fine-tunes the CLIP vision encoder by leveraging a Siamese
architecture. This approach maximizes cosine similarity between perturbed and
clean samples, facilitating resilience against adversarial manipulations.
Sim-CLIP+ offers a plug-and-play solution, allowing seamless integration into
existing LVLM architectures as a robust vision encoder. Unlike previous
defenses, our method requires no structural modifications to the LVLM and
incurs minimal computational overhead. Sim-CLIP+ demonstrates effectiveness
against both gradient-based adversarial attacks and various jailbreak
techniques. We evaluate Sim-CLIP+ against three distinct jailbreak attack
strategies and perform clean evaluations using standard downstream datasets,
including COCO for image captioning and OKVQA for visual question answering.
Extensive experiments demonstrate that Sim-CLIP+ maintains high clean accuracy
while substantially improving robustness against both gradient-based
adversarial attacks and jailbreak techniques. Our code and robust vision
encoders are available at
https://github.com/speedlab-git/Robust-Encoder-against-Jailbreak-attack.git.


http://arxiv.org/abs/2409.07390
D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack. (93%)
Hong-Hanh Nguyen-Le; Van-Tuan Tran; Dinh-Thuc Nguyen; Nhien-An Le-Khac
  The advancements in generative AI have enabled the improvement of audio
synthesis models, including text-to-speech and voice conversion. This raises
concerns about its potential misuse in social manipulation and political
interference, as synthetic speech has become indistinguishable from natural
human speech. Several speech-generation programs are utilized for malicious
purposes, especially impersonating individuals through phone calls. Therefore,
detecting fake audio is crucial to maintain social security and safeguard the
integrity of information. Recent research has proposed a D-CAPTCHA system based
on the challenge-response protocol to differentiate fake phone calls from real
ones. In this work, we study the resilience of this system and introduce a more
robust version, D-CAPTCHA++, to defend against fake calls. Specifically, we
first expose the vulnerability of the D-CAPTCHA system under transferable
imperceptible adversarial attack. Secondly, we mitigate such vulnerability by
improving the robustness of the system by using adversarial training in
D-CAPTCHA deepfake detectors and task classifiers.


http://arxiv.org/abs/2409.07609
A Cost-Aware Approach to Adversarial Robustness in Neural Networks. (84%)
Charles Meyers; Mohammad Reza Saleh Sedghpour; Tommy Löfstedt; Erik Elmroth
  Considering the growing prominence of production-level AI and the threat of
adversarial attacks that can evade a model at run-time, evaluating the
robustness of models to these evasion attacks is of critical importance.
Additionally, testing model changes likely means deploying the models to (e.g.
a car or a medical imaging device), or a drone to see how it affects
performance, making un-tested changes a public problem that reduces development
speed, increases cost of development, and makes it difficult (if not
impossible) to parse cause from effect. In this work, we used survival analysis
as a cloud-native, time-efficient and precise method for predicting model
performance in the presence of adversarial noise. For neural networks in
particular, the relationships between the learning rate, batch size, training
time, convergence time, and deployment cost are highly complex, so researchers
generally rely on benchmark datasets to assess the ability of a model to
generalize beyond the training data. To address this, we propose using
accelerated failure time models to measure the effect of hardware choice, batch
size, number of epochs, and test-set accuracy by using adversarial attacks to
induce failures on a reference model architecture before deploying the model to
the real world. We evaluate several GPU types and use the Tree Parzen Estimator
to maximize model robustness and minimize model run-time simultaneously. This
provides a way to evaluate the model and optimise it in a single step, while
simultaneously allowing us to model the effect of model parameters on training
time, prediction time, and accuracy. Using this technique, we demonstrate that
newer, more-powerful hardware does decrease the training time, but with a
monetary and power cost that far outpaces the marginal gains in accuracy.


http://arxiv.org/abs/2409.07706
Attack End-to-End Autonomous Driving through Module-Wise Noise. (74%)
Lu Wang; Tianyuan Zhang; Yikai Han; Muyang Fang; Ting Jin; Jiaqi Kang
  With recent breakthroughs in deep neural networks, numerous tasks within
autonomous driving have exhibited remarkable performance. However, deep
learning models are susceptible to adversarial attacks, presenting significant
security risks to autonomous driving systems. Presently, end-to-end
architectures have emerged as the predominant solution for autonomous driving,
owing to their collaborative nature across different tasks. Yet, the
implications of adversarial attacks on such models remain relatively
unexplored. In this paper, we conduct comprehensive adversarial security
research on the modular end-to-end autonomous driving model for the first time.
We thoroughly consider the potential vulnerabilities in the model inference
process and design a universal attack scheme through module-wise noise
injection. We conduct large-scale experiments on the full-stack autonomous
driving model and demonstrate that our attack method outperforms previous
attack methods. We trust that our research will offer fresh insights into
ensuring the safety and reliability of autonomous driving systems.


http://arxiv.org/abs/2409.17275
On the Vulnerability of Applying Retrieval-Augmented Generation within Knowledge-Intensive Application Domains. (67%)
Xun Xian; Ganghua Wang; Xuan Bi; Jayanth Srinivasa; Ashish Kundu; Charles Fleming; Mingyi Hong; Jie Ding
  Retrieval-Augmented Generation (RAG) has been empirically shown to enhance
the performance of large language models (LLMs) in knowledge-intensive domains
such as healthcare, finance, and legal contexts. Given a query, RAG retrieves
relevant documents from a corpus and integrates them into the LLMs' generation
process. In this study, we investigate the adversarial robustness of RAG,
focusing specifically on examining the retrieval system. First, across 225
different setup combinations of corpus, retriever, query, and targeted
information, we show that retrieval systems are vulnerable to universal
poisoning attacks in medical Q\&A. In such attacks, adversaries generate
poisoned documents containing a broad spectrum of targeted information, such as
personally identifiable information. When these poisoned documents are inserted
into a corpus, they can be accurately retrieved by any users, as long as
attacker-specified queries are used. To understand this vulnerability, we
discovered that the deviation from the query's embedding to that of the
poisoned document tends to follow a pattern in which the high similarity
between the poisoned document and the query is retained, thereby enabling
precise retrieval. Based on these findings, we develop a new detection-based
defense to ensure the safe use of RAG. Through extensive experiments spanning
various Q\&A domains, we observed that our proposed method consistently
achieves excellent detection rates in nearly all cases.


http://arxiv.org/abs/2409.07423
Enhancing adversarial robustness in Natural Language Inference using explanations. (67%)
Alexandros Koulakos; Maria Lymperaiou; Giorgos Filandrianos; Giorgos Stamou
  The surge of state-of-the-art Transformer-based models has undoubtedly pushed
the limits of NLP model performance, excelling in a variety of tasks. We cast
the spotlight on the underexplored task of Natural Language Inference (NLI),
since models trained on popular well-suited datasets are susceptible to
adversarial attacks, allowing subtle input interventions to mislead the model.
In this work, we validate the usage of natural language explanation as a
model-agnostic defence strategy through extensive experimentation: only by
fine-tuning a classifier on the explanation rather than premise-hypothesis
inputs, robustness under various adversarial attacks is achieved in comparison
to explanation-free baselines. Moreover, since there is no standard strategy of
testing the semantic validity of the generated explanations, we research the
correlation of widely used language generation metrics with human perception,
in order for them to serve as a proxy towards robust NLI models. Our approach
is resource-efficient and reproducible without significant computational
limitations.


http://arxiv.org/abs/2409.07002
AdvLogo: Adversarial Patch Attack against Object Detectors based on Diffusion Models. (64%)
Boming Miao; Chunxiao Li; Yao Zhu; Weixiang Sun; Zizhe Wang; Xiaoyi Wang; Chuanlong Xie
  With the rapid development of deep learning, object detectors have
demonstrated impressive performance; however, vulnerabilities still exist in
certain scenarios. Current research exploring the vulnerabilities using
adversarial patches often struggles to balance the trade-off between attack
effectiveness and visual quality. To address this problem, we propose a novel
framework of patch attack from semantic perspective, which we refer to as
AdvLogo. Based on the hypothesis that every semantic space contains an
adversarial subspace where images can cause detectors to fail in recognizing
objects, we leverage the semantic understanding of the diffusion denoising
process and drive the process to adversarial subareas by perturbing the latent
and unconditional embeddings at the last timestep. To mitigate the distribution
shift that exposes a negative impact on image quality, we apply perturbation to
the latent in frequency domain with the Fourier Transform. Experimental results
demonstrate that AdvLogo achieves strong attack performance while maintaining
high visual quality.


http://arxiv.org/abs/2409.07085
Understanding Knowledge Drift in LLMs through Misinformation. (1%)
Alina Fastowski; Gjergji Kasneci
  Large Language Models (LLMs) have revolutionized numerous applications,
making them an integral part of our digital ecosystem. However, their
reliability becomes critical, especially when these models are exposed to
misinformation. We primarily analyze the susceptibility of state-of-the-art
LLMs to factual inaccuracies when they encounter false information in a QnA
scenario, an issue that can lead to a phenomenon we refer to as *knowledge
drift*, which significantly undermines the trustworthiness of these models. We
evaluate the factuality and the uncertainty of the models' responses relying on
Entropy, Perplexity, and Token Probability metrics. Our experiments reveal that
an LLM's uncertainty can increase up to 56.6% when the question is answered
incorrectly due to the exposure to false information. At the same time,
repeated exposure to the same false information can decrease the models
uncertainty again (-52.8% w.r.t. the answers on the untainted prompts),
potentially manipulating the underlying model's beliefs and introducing a drift
from its original knowledge. These findings provide insights into LLMs'
robustness and vulnerability to adversarial inputs, paving the way for
developing more reliable LLM applications across various domains. The code is
available at https://github.com/afastowski/knowledge_drift.


http://arxiv.org/abs/2409.06420
Unrevealed Threats: A Comprehensive Study of the Adversarial Robustness of Underwater Image Enhancement Models. (99%)
Siyu Zhai; Zhibo He; Xiaofeng Cong; Junming Hou; Jie Gui; Jian Wei You; Xin Gong; James Tin-Yau Kwok; Yuan Yan Tang
  Learning-based methods for underwater image enhancement (UWIE) have undergone
extensive exploration. However, learning-based models are usually vulnerable to
adversarial examples so as the UWIE models. To the best of our knowledge, there
is no comprehensive study on the adversarial robustness of UWIE models, which
indicates that UWIE models are potentially under the threat of adversarial
attacks. In this paper, we propose a general adversarial attack protocol. We
make a first attempt to conduct adversarial attacks on five well-designed UWIE
models on three common underwater image benchmark datasets. Considering the
scattering and absorption of light in the underwater environment, there exists
a strong correlation between color correction and underwater image enhancement.
On the basis of that, we also design two effective UWIE-oriented adversarial
attack methods Pixel Attack and Color Shift Attack targeting different color
spaces. The results show that five models exhibit varying degrees of
vulnerability to adversarial attacks and well-designed small perturbations on
degraded images are capable of preventing UWIE models from generating enhanced
results. Further, we conduct adversarial training on these models and
successfully mitigated the effectiveness of adversarial attacks. In summary, we
reveal the adversarial vulnerability of UWIE models and propose a new
evaluation dimension of UWIE models.


http://arxiv.org/abs/2409.06474
Advancing Hybrid Defense for Byzantine Attacks in Federated Learning. (87%)
Kai Yue; Richeng Jin; Chau-Wai Wong; Huaiyu Dai
  Federated learning (FL) enables multiple clients to collaboratively train a
global model without sharing their local data. Recent studies have highlighted
the vulnerability of FL to Byzantine attacks, where malicious clients send
poisoned updates to degrade model performance. Notably, many attacks have been
developed targeting specific aggregation rules, whereas various defense
mechanisms have been designed for dedicated threat models. This paper studies
the resilience of an attack-agnostic FL scenario, where the server lacks prior
knowledge of both the attackers' strategies and the number of malicious clients
involved. We first introduce a hybrid defense against state-of-the-art attacks.
Our goal is to identify a general-purpose aggregation rule that performs well
on average while also avoiding worst-case vulnerabilities. By adaptively
selecting from available defenses, we demonstrate that the server remains
robust even when confronted with a substantial proportion of poisoned updates.
To better understand this resilience, we then assess the attackers' capability
using a proxy called client heterogeneity. We also emphasize that the existing
FL defenses should not be regarded as secure, as demonstrated through the newly
proposed Trapsetter attack. The proposed attack outperforms other
state-of-the-art attacks by further reducing the model test accuracy by 8-10%.
Our findings highlight the ongoing need for the development of
Byzantine-resilient aggregation algorithms in FL.


http://arxiv.org/abs/2409.06793
Adversarial Attacks to Multi-Modal Models. (76%)
Zhihao Dou; Xin Hu; Haibo Yang; Zhuqing Liu; Minghong Fang
  Multi-modal models have gained significant attention due to their powerful
capabilities. These models effectively align embeddings across diverse data
modalities, showcasing superior performance in downstream tasks compared to
their unimodal counterparts. Recent study showed that the attacker can
manipulate an image or audio file by altering it in such a way that its
embedding matches that of an attacker-chosen targeted input, thereby deceiving
downstream models. However, this method often underperforms due to inherent
disparities in data from different modalities. In this paper, we introduce
CrossFire, an innovative approach to attack multi-modal models. CrossFire
begins by transforming the targeted input chosen by the attacker into a format
that matches the modality of the original image or audio file. We then
formulate our attack as an optimization problem, aiming to minimize the angular
deviation between the embeddings of the transformed input and the modified
image or audio file. Solving this problem determines the perturbations to be
added to the original media. Our extensive experiments on six real-world
benchmark datasets reveal that CrossFire can significantly manipulate
downstream tasks, surpassing existing attacks. Additionally, we evaluate six
defensive strategies against CrossFire, finding that current defenses are
insufficient to counteract our CrossFire.


http://arxiv.org/abs/2409.07500
DV-FSR: A Dual-View Target Attack Framework for Federated Sequential Recommendation. (67%)
Qitao Qin; Yucong Luo; Mingyue Cheng; Qingyang Mao; Chenyi Lei
  Federated recommendation (FedRec) preserves user privacy by enabling
decentralized training of personalized models, but this architecture is
inherently vulnerable to adversarial attacks. Significant research has been
conducted on targeted attacks in FedRec systems, motivated by commercial and
social influence considerations. However, much of this work has largely
overlooked the differential robustness of recommendation models. Moreover, our
empirical findings indicate that existing targeted attack methods achieve only
limited effectiveness in Federated Sequential Recommendation (FSR) tasks.
Driven by these observations, we focus on investigating targeted attacks in FSR
and propose a novel dualview attack framework, named DV-FSR. This attack method
uniquely combines a sampling-based explicit strategy with a contrastive
learning-based implicit gradient strategy to orchestrate a coordinated attack.
Additionally, we introduce a specific defense mechanism tailored for targeted
attacks in FSR, aiming to evaluate the mitigation effects of the attack method
we proposed. Extensive experiments validate the effectiveness of our proposed
approach on representative sequential models.


http://arxiv.org/abs/2409.05558
Seeing Through the Mask: Rethinking Adversarial Examples for CAPTCHAs. (99%)
Yahya Jabary; Andreas Plesner; Turlan Kuzhagaliyev; Roger Wattenhofer
  Modern CAPTCHAs rely heavily on vision tasks that are supposedly hard for
computers but easy for humans. However, advances in image recognition models
pose a significant threat to such CAPTCHAs. These models can easily be fooled
by generating some well-hidden "random" noise and adding it to the image, or
hiding objects in the image. However, these methods are model-specific and thus
can not aid CAPTCHAs in fooling all models. We show in this work that by
allowing for more significant changes to the images while preserving the
semantic information and keeping it solvable by humans, we can fool many
state-of-the-art models. Specifically, we demonstrate that by adding masks of
various intensities the Accuracy @ 1 (Acc@1) drops by more than 50%-points for
all models, and supposedly robust models such as vision transformers see an
Acc@1 drop of 80%-points.
  These masks can therefore effectively fool modern image classifiers, thus
showing that machines have not caught up with humans -- yet.


http://arxiv.org/abs/2409.05657
Adversarial Attacks on Data Attribution. (99%)
Xinhe Wang; Pingbang Hu; Junwei Deng; Jiaqi W. Ma
  Data attribution aims to quantify the contribution of individual training
data points to the outputs of an AI model, which has been used to measure the
value of training data and compensate data providers. Given the impact on
financial decisions and compensation mechanisms, a critical question arises
concerning the adversarial robustness of data attribution methods. However,
there has been little to no systematic research addressing this issue. In this
work, we aim to bridge this gap by detailing a threat model with clear
assumptions about the adversary's goal and capabilities, and by proposing
principled adversarial attack methods on data attribution. We present two such
methods, Shadow Attack and Outlier Attack, both of which generate manipulated
datasets to adversarially inflate the compensation. The Shadow Attack leverages
knowledge about the data distribution in the AI applications, and derives
adversarial perturbations through "shadow training", a technique commonly used
in membership inference attacks. In contrast, the Outlier Attack does not
assume any knowledge about the data distribution and relies solely on black-box
queries to the target model's predictions. It exploits an inductive bias
present in many data attribution methods - outlier data points are more likely
to be influential - and employs adversarial examples to generate manipulated
datasets. Empirically, in image classification and text generation tasks, the
Shadow Attack can inflate the data-attribution-based compensation by at least
200%, while the Outlier Attack achieves compensation inflation ranging from
185% to as much as 643%.


http://arxiv.org/abs/2409.05668
Unlearning or Concealment? A Critical Analysis and Evaluation Metrics for Unlearning in Diffusion Models. (84%)
Aakash Sen Sharma; Niladri Sarkar; Vikram Chundawat; Ankur A Mali; Murari Mandal
  Recent research has seen significant interest in methods for concept removal
and targeted forgetting in diffusion models. In this paper, we conduct a
comprehensive white-box analysis to expose significant vulnerabilities in
existing diffusion model unlearning methods. We show that the objective
functions used for unlearning in the existing methods lead to decoupling of the
targeted concepts (meant to be forgotten) for the corresponding prompts. This
is concealment and not actual unlearning, which was the original goal. The
ineffectiveness of current methods stems primarily from their narrow focus on
reducing generation probabilities for specific prompt sets, neglecting the
diverse modalities of intermediate guidance employed during the inference
process. The paper presents a rigorous theoretical and empirical examination of
four commonly used techniques for unlearning in diffusion models. We introduce
two new evaluation metrics: Concept Retrieval Score (CRS) and Concept
Confidence Score (CCS). These metrics are based on a successful adversarial
attack setup that can recover forgotten concepts from unlearned diffusion
models. The CRS measures the similarity between the latent representations of
the unlearned and fully trained models after unlearning. It reports the extent
of retrieval of the forgotten concepts with increasing amount of guidance. The
CCS quantifies the confidence of the model in assigning the target concept to
the manipulated data. It reports the probability of the unlearned model's
generations to be aligned with the original domain knowledge with increasing
amount of guidance. Evaluating existing unlearning methods with our proposed
stringent metrics for diffusion models reveals significant shortcomings in
their ability to truly unlearn concepts. Source Code:
https://respailab.github.io/unlearning-or-concealment


http://arxiv.org/abs/2409.05800
Input Space Mode Connectivity in Deep Neural Networks. (83%)
Jakub Vrabel; Ori Shem-Ur; Yaron Oz; David Krueger
  We extend the concept of loss landscape mode connectivity to the input space
of deep neural networks. Mode connectivity was originally studied within
parameter space, where it describes the existence of low-loss paths between
different solutions (loss minimizers) obtained through gradient descent. We
present theoretical and empirical evidence of its presence in the input space
of deep networks, thereby highlighting the broader nature of the phenomenon. We
observe that different input images with similar predictions are generally
connected, and for trained models, the path tends to be simple, with only a
small deviation from being a linear path. Our methodology utilizes real,
interpolated, and synthetic inputs created using the input optimization
technique for feature visualization. We conjecture that input space mode
connectivity in high-dimensional spaces is a geometric effect that takes place
even in untrained models and can be explained through percolation theory. We
exploit mode connectivity to obtain new insights about adversarial examples and
demonstrate its potential for adversarial detection. Additionally, we discuss
applications for the interpretability of deep networks.


http://arxiv.org/abs/2409.06130
On the Weaknesses of Backdoor-based Model Watermarking: An Information-theoretic Perspective. (33%)
Aoting Hu; Yanzhi Chen; Renjie Xie; Adrian Weller
  Safeguarding the intellectual property of machine learning models has emerged
as a pressing concern in AI security. Model watermarking is a powerful
technique for protecting ownership of machine learning models, yet its
reliability has been recently challenged by recent watermark removal attacks.
In this work, we investigate why existing watermark embedding techniques
particularly those based on backdooring are vulnerable. Through an
information-theoretic analysis, we show that the resilience of watermarking
against erasure attacks hinges on the choice of trigger-set samples, where
current uses of out-distribution trigger-set are inherently vulnerable to
white-box adversaries. Based on this discovery, we propose a novel model
watermarking scheme, In-distribution Watermark Embedding (IWE), to overcome the
limitations of existing method. To further minimise the gap to clean models, we
analyze the role of logits as watermark information carriers and propose a new
approach to better conceal watermark information within the logits. Experiments
on real-world datasets including CIFAR-100 and Caltech-101 demonstrate that our
method robustly defends against various adversaries with negligible accuracy
loss (< 0.1%).


http://arxiv.org/abs/2409.05755
Re-evaluating the Advancements of Heterophilic Graph Learning. (1%)
Sitao Luan; Qincheng Lu; Chenqing Hua; Xinyu Wang; Jiaqi Zhu; Xiao-Wen Chang
  Over the past decade, Graph Neural Networks (GNNs) have achieved great
success on machine learning tasks with relational data. However, recent studies
have found that heterophily can cause significant performance degradation of
GNNs, especially on node-level tasks. Numerous heterophilic benchmark datasets
have been put forward to validate the efficacy of heterophily-specific GNNs,
and various homophily metrics have been designed to help recognize these
challenging datasets. Nevertheless, there still exist multiple pitfalls that
severely hinder the proper evaluation of new models and metrics: 1) lack of
hyperparameter tuning; 2) insufficient evaluation on the truly challenging
heterophilic datasets; 3) missing quantitative evaluation for homophily metrics
on synthetic graphs. To overcome these challenges, we first train and fine-tune
baseline models on $27$ most widely used benchmark datasets, and categorize
them into three distinct groups: malignant, benign and ambiguous heterophilic
datasets. We identify malignant and ambiguous heterophily as the truly
challenging subsets of tasks, and to our best knowledge, we are the first to
propose such taxonomy. Then, we re-evaluate $11$ state-of-the-arts (SOTA) GNNs,
covering six popular methods, with fine-tuned hyperparameters on different
groups of heterophilic datasets. Based on the model performance, we
comprehensively reassess the effectiveness of different methods on heterophily.
At last, we evaluate $11$ popular homophily metrics on synthetic graphs with
three different graph generation approaches. To overcome the unreliability of
observation-based comparison and evaluation, we conduct the first quantitative
evaluation and provide detailed analysis.


http://arxiv.org/abs/2409.05076
PIP: Detecting Adversarial Examples in Large Vision-Language Models via Attention Patterns of Irrelevant Probe Questions. (99%)
Yudong Zhang; Ruobing Xie; Jiansheng Chen; Xingwu Sun; Yu Wang
  Large Vision-Language Models (LVLMs) have demonstrated their powerful
multimodal capabilities. However, they also face serious safety problems, as
adversaries can induce robustness issues in LVLMs through the use of
well-designed adversarial examples. Therefore, LVLMs are in urgent need of
detection tools for adversarial examples to prevent incorrect responses. In
this work, we first discover that LVLMs exhibit regular attention patterns for
clean images when presented with probe questions. We propose an unconventional
method named PIP, which utilizes the attention patterns of one randomly
selected irrelevant probe question (e.g., "Is there a clock?") to distinguish
adversarial examples from clean examples. Regardless of the image to be tested
and its corresponding question, PIP only needs to perform one additional
inference of the image to be tested and the probe question, and then achieves
successful detection of adversarial examples. Even under black-box attacks and
open dataset scenarios, our PIP, coupled with a simple SVM, still achieves more
than 98% recall and a precision of over 90%. Our PIP is the first attempt to
detect adversarial attacks on LVLMs via simple irrelevant probe questions,
shedding light on deeper understanding and introspection within LVLMs. The code
is available at https://github.com/btzyd/pip.


http://arxiv.org/abs/2409.04982
2DSig-Detect: a semi-supervised framework for anomaly detection on image data using 2D-signatures. (87%)
Xinheng Xie; Kureha Yamaguchi; Margaux Leblanc; Simon Malzard; Varun Chhabra; Victoria Nockles; Yue Wu
  The rapid advancement of machine learning technologies raises questions about
the security of machine learning models, with respect to both training-time
(poisoning) and test-time (evasion, impersonation, and inversion) attacks.
Models performing image-related tasks, e.g. detection, and classification, are
vulnerable to adversarial attacks that can degrade their performance and
produce undesirable outcomes. This paper introduces a novel technique for
anomaly detection in images called 2DSig-Detect, which uses a
2D-signature-embedded semi-supervised framework rooted in rough path theory. We
demonstrate our method in adversarial settings for training-time and test-time
attacks, and benchmark our framework against other state of the art methods.
Using 2DSig-Detect for anomaly detection, we show both superior performance and
a reduction in the computation time to detect the presence of adversarial
perturbations in images.


http://arxiv.org/abs/2409.05021
Vision-fused Attack: Advancing Aggressive and Stealthy Adversarial Text against Neural Machine Translation. (67%)
Yanni Xue; Haojie Hao; Jiakai Wang; Qiang Sheng; Renshuai Tao; Yu Liang; Pu Feng; Xianglong Liu
  While neural machine translation (NMT) models achieve success in our daily
lives, they show vulnerability to adversarial attacks. Despite being harmful,
these attacks also offer benefits for interpreting and enhancing NMT models,
thus drawing increased research attention. However, existing studies on
adversarial attacks are insufficient in both attacking ability and human
imperceptibility due to their sole focus on the scope of language. This paper
proposes a novel vision-fused attack (VFA) framework to acquire powerful
adversarial text, i.e., more aggressive and stealthy. Regarding the attacking
ability, we design the vision-merged solution space enhancement strategy to
enlarge the limited semantic solution space, which enables us to search for
adversarial candidates with higher attacking ability. For human
imperceptibility, we propose the perception-retained adversarial text selection
strategy to align the human text-reading mechanism. Thus, the finally selected
adversarial text could be more deceptive. Extensive experiments on various
models, including large language models (LLMs) like LLaMA and GPT-3.5, strongly
support that VFA outperforms the comparisons by large margins (up to 81%/14%
improvements on ASR/SSIM).


http://arxiv.org/abs/2409.04968
Natias: Neuron Attribution based Transferable Image Adversarial Steganography. (67%)
Zexin Fan; Kejiang Chen; Kai Zeng; Jiansong Zhang; Weiming Zhang; Nenghai Yu
  Image steganography is a technique to conceal secret messages within digital
images. Steganalysis, on the contrary, aims to detect the presence of secret
messages within images. Recently, deep-learning-based steganalysis methods have
achieved excellent detection performance. As a countermeasure, adversarial
steganography has garnered considerable attention due to its ability to
effectively deceive deep-learning-based steganalysis. However, steganalysts
often employ unknown steganalytic models for detection. Therefore, the ability
of adversarial steganography to deceive non-target steganalytic models, known
as transferability, becomes especially important. Nevertheless, existing
adversarial steganographic methods do not consider how to enhance
transferability. To address this issue, we propose a novel adversarial
steganographic scheme named Natias. Specifically, we first attribute the output
of a steganalytic model to each neuron in the target middle layer to identify
critical features. Next, we corrupt these critical features that may be adopted
by diverse steganalytic models. Consequently, it can promote the
transferability of adversarial steganography. Our proposed method can be
seamlessly integrated with existing adversarial steganography frameworks.
Thorough experimental analyses affirm that our proposed technique possesses
improved transferability when contrasted with former approaches, and it attains
heightened security in retraining scenarios.


http://arxiv.org/abs/2409.04795
Phrase-Level Adversarial Training for Mitigating Bias in Neural Network-based Automatic Essay Scoring. (86%)
Haddad Philip; Tsegaye Misikir Tashu
  Automatic Essay Scoring (AES) is widely used to evaluate candidates for
educational purposes. However, due to the lack of representative data, most
existing AES systems are not robust, and their scoring predictions are biased
towards the most represented data samples. In this study, we propose a
model-agnostic phrase-level method to generate an adversarial essay set to
address the biases and robustness of AES models. Specifically, we construct an
attack test set comprising samples from the original test set and adversarially
generated samples using our proposed method. To evaluate the effectiveness of
the attack strategy and data augmentation, we conducted a comprehensive
analysis utilizing various neural network scoring models. Experimental results
show that the proposed approach significantly improves AES model performance in
the presence of adversarial examples and scenarios without such attacks.


http://arxiv.org/abs/2409.04930
PIXHELL Attack: Leaking Sensitive Information from Air-Gap Computers via `Singing Pixels'. (80%)
Mordechai Guri
  Air-gapped systems are disconnected from the Internet and other networks
because they contain or process sensitive data. However, it is known that
attackers can use computer speakers to leak data via sound to circumvent the
air-gap defense. To cope with this threat, when highly sensitive data is
involved, the prohibition of loudspeakers or audio hardware might be enforced.
This measure is known as an `audio gap'.
  In this paper, we present PIXHELL, a new type of covert channel attack
allowing hackers to leak information via noise generated by the pixels on the
screen. No audio hardware or loudspeakers is required. Malware in the air-gap
and audio-gap computers generates crafted pixel patterns that produce noise in
the frequency range of 0 - 22 kHz. The malicious code exploits the sound
generated by coils and capacitors to control the frequencies emanating from the
screen. Acoustic signals can encode and transmit sensitive information. We
present the adversarial attack model, cover related work, and provide technical
background. We discuss bitmap generation and correlated acoustic signals and
provide implementation details on the modulation and demodulation process. We
evaluated the covert channel on various screens and tested it with different
types of information. We also discuss \textit{evasion and stealth} using
low-brightness patterns that appear like black, turned-off screens. Finally, we
propose a set of countermeasures. Our test shows that with a PIXHELL attack,
textual and binary data can be exfiltrated from air-gapped, audio-gapped
computers at a distance of 2m via sound modulated from LCD screens.


http://arxiv.org/abs/2409.04819
Top-GAP: Integrating Size Priors in CNNs for more Interpretability, Robustness, and Bias Mitigation. (12%)
Lars Nieradzik; Henrike Stephani; Janis Keuper
  This paper introduces Top-GAP, a novel regularization technique that enhances
the explainability and robustness of convolutional neural networks. By
constraining the spatial size of the learned feature representation, our method
forces the network to focus on the most salient image regions, effectively
reducing background influence. Using adversarial attacks and the Effective
Receptive Field, we show that Top-GAP directs more attention towards object
pixels rather than the background. This leads to enhanced interpretability and
robustness. We achieve over 50% robust accuracy on CIFAR-10 with PGD
$\epsilon=\frac{8}{255}$ and $20$ iterations while maintaining the original
clean accuracy. Furthermore, we see increases of up to 5% accuracy against
distribution shifts. Our approach also yields more precise object localization,
as evidenced by up to 25% improvement in Intersection over Union (IOU) compared
to methods like GradCAM and Recipro-CAM.


http://arxiv.org/abs/2409.04208
Learning to Learn Transferable Generative Attack for Person Re-Identification. (99%)
Yuan Bian; Min Liu; Xueping Wang; Yunfeng Ma; Yaonan Wang
  Deep learning-based person re-identification (re-id) models are widely
employed in surveillance systems and inevitably inherit the vulnerability of
deep networks to adversarial attacks. Existing attacks merely consider
cross-dataset and cross-model transferability, ignoring the cross-test
capability to perturb models trained in different domains. To powerfully
examine the robustness of real-world re-id models, the Meta Transferable
Generative Attack (MTGA) method is proposed, which adopts meta-learning
optimization to promote the generative attacker producing highly transferable
adversarial examples by learning comprehensively simulated transfer-based
cross-model\&dataset\&test black-box meta attack tasks. Specifically,
cross-model\&dataset black-box attack tasks are first mimicked by selecting
different re-id models and datasets for meta-train and meta-test attack
processes. As different models may focus on different feature regions, the
Perturbation Random Erasing module is further devised to prevent the attacker
from learning to only corrupt model-specific features. To boost the attacker
learning to possess cross-test transferability, the Normalization Mix strategy
is introduced to imitate diverse feature embedding spaces by mixing
multi-domain statistics of target models. Extensive experiments show the
superiority of MTGA, especially in cross-model\&dataset and
cross-model\&dataset\&test attacks, our MTGA outperforms the SOTA methods by
21.5\% and 11.3\% on mean mAP drop rate, respectively. The code of MTGA will be
released after the paper is accepted.


http://arxiv.org/abs/2409.04691
PANTS: Practical Adversarial Network Traffic Samples against ML-powered Networking Classifiers. (99%)
Minhao Jin; Maria Apostolaki
  Multiple network management tasks, from resource allocation to intrusion
detection, rely on some form of ML-based network-traffic classification (MNC).
Despite their potential, MNCs are vulnerable to adversarial inputs, which can
lead to outages, poor decision-making, and security violations, among other
issues.
  The goal of this paper is to help network operators assess and enhance the
robustness of their MNC against adversarial inputs. The most critical step for
this is generating inputs that can fool the MNC while being realizable under
various threat models. Compared to other ML models, finding adversarial inputs
against MNCs is more challenging due to the existence of non-differentiable
components e.g., traffic engineering and the need to constrain inputs to
preserve semantics and ensure reliability. These factors prevent the direct use
of well-established gradient-based methods developed in adversarial ML (AML).
  To address these challenges, we introduce PANTS, a practical white-box
framework that uniquely integrates AML techniques with Satisfiability Modulo
Theories (SMT) solvers to generate adversarial inputs for MNCs. We also embed
PANTS into an iterative adversarial training process that enhances the
robustness of MNCs against adversarial inputs. PANTS is 70% and 2x more likely
in median to find adversarial inputs against target MNCs compared to two
state-of-the-art baselines, namely Amoeba and BAP. Integrating PANTS into the
adversarial training process enhances the robustness of the target MNCs by
52.7% without sacrificing their accuracy. Critically, these PANTS-robustified
MNCs are more robust than their vanilla counterparts against distinct
attack-generation methodologies.


http://arxiv.org/abs/2409.04133
Secure Traffic Sign Recognition: An Attention-Enabled Universal Image Inpainting Mechanism against Light Patch Attacks. (83%)
Hangcheng Cao; Longzhi Yuan; Guowen Xu; Ziyang He; Zhengru Fang; Yuguang Fang
  Traffic sign recognition systems play a crucial role in assisting drivers to
make informed decisions while driving. However, due to the heavy reliance on
deep learning technologies, particularly for future connected and autonomous
driving, these systems are susceptible to adversarial attacks that pose
significant safety risks to both personal and public transportation. Notably,
researchers recently identified a new attack vector to deceive sign recognition
systems: projecting well-designed adversarial light patches onto traffic signs.
In comparison with traditional adversarial stickers or graffiti, these emerging
light patches exhibit heightened aggression due to their ease of implementation
and outstanding stealthiness. To effectively counter this security threat, we
propose a universal image inpainting mechanism, namely, SafeSign. It relies on
attention-enabled multi-view image fusion to repair traffic signs contaminated
by adversarial light patches, thereby ensuring the accurate sign recognition.
Here, we initially explore the fundamental impact of malicious light patches on
the local and global feature spaces of authentic traffic signs. Then, we design
a binary mask-based U-Net image generation pipeline outputting diverse
contaminated sign patterns, to provide our image inpainting model with needed
training data. Following this, we develop an attention mechanism-enabled neural
network to jointly utilize the complementary information from multi-view images
to repair contaminated signs. Finally, extensive experiments are conducted to
evaluate SafeSign's effectiveness in resisting potential light patch-based
attacks, bringing an average accuracy improvement of 54.8% in three widely-used
sign recognition models


http://arxiv.org/abs/2409.04190
Mind The Gap: Can Air-Gaps Keep Your Private Data Secure? (74%)
Mordechai Guri
  Personal data has become one of the most valuable assets and lucrative
targets for attackers in the modern digital world. This includes personal
identification information (PII), medical records, legal information, biometric
data, and private communications. To protect it from hackers, 'air-gap'
measures might be employed. This protective strategy keeps sensitive data in
networks entirely isolated (physically and logically) from the Internet.
Creating a physical 'air gap' between internal networks and the outside world
safeguards sensitive data from theft and online threats. Air-gap networks are
relevant today to governmental organizations, healthcare industries, finance
sectors, intellectual property and legal firms, and others. In this paper, we
dive deep into air-gap security in light of modern cyberattacks and data
privacy. Despite this level of protection, publicized incidents from the last
decade show that even air-gap networks are not immune to breaches. Motivated
and capable adversaries can use sophisticated attack vectors to penetrate the
air-gapped networks, leaking sensitive data outward. We focus on different
aspects of air gap security. First, we overview cyber incidents that target
air-gap networks, including infamous ones such Agent.btz. Second, we introduce
the adversarial attack model and different attack vectors attackers may use to
compromise air-gap networks. Third, we present the techniques attackers can
apply to leak data out of air-gap networks and introduce more innovative ones
based on our recent research. Finally, we propose the necessary countermeasures
to protect the data, both defensive and preventive.


http://arxiv.org/abs/2409.04407
Exploiting the Data Gap: Utilizing Non-ignorable Missingness to Manipulate Model Learning. (38%)
Deniz Koyuncu; Alex Gittens; Bülent Yener; Moti Yung
  Missing data is commonly encountered in practice, and when the missingness is
non-ignorable, effective remediation depends on knowledge of the missingness
mechanism. Learning the underlying missingness mechanism from the data is not
possible in general, so adversaries can exploit this fact by maliciously
engineering non-ignorable missingness mechanisms. Such Adversarial Missingness
(AM) attacks have only recently been motivated and introduced, and then
successfully tailored to mislead causal structure learning algorithms into
hiding specific cause-and-effect relationships. However, existing AM attacks
assume the modeler (victim) uses full-information maximum likelihood methods to
handle the missing data, and are of limited applicability when the modeler uses
different remediation strategies. In this work we focus on associational
learning in the context of AM attacks. We consider (i) complete case analysis,
(ii) mean imputation, and (iii) regression-based imputation as alternative
strategies used by the modeler. Instead of combinatorially searching for
missing entries, we propose a novel probabilistic approximation by deriving the
asymptotic forms of these methods used for handling the missing entries. We
then formulate the learning of the adversarial missingness mechanism as a
bi-level optimization problem. Experiments on generalized linear models show
that AM attacks can be used to change the p-values of features from significant
to insignificant in real datasets, such as the California-housing dataset,
while using relatively moderate amounts of missingness (<20%). Additionally, we
assess the robustness of our attacks against defense strategies based on data
valuation.


http://arxiv.org/abs/2409.04142
Context is the Key: Backdoor Attacks for In-Context Learning with Vision Transformers. (8%)
Gorka Abad; Stjepan Picek; Lorenzo Cavallaro; Aitor Urbieta
  Due to the high cost of training, large model (LM) practitioners commonly use
pretrained models downloaded from untrusted sources, which could lead to owning
compromised models. In-context learning is the ability of LMs to perform
multiple tasks depending on the prompt or context. This can enable new attacks,
such as backdoor attacks with dynamic behavior depending on how models are
prompted.
  In this paper, we leverage the ability of vision transformers (ViTs) to
perform different tasks depending on the prompts. Then, through data poisoning,
we investigate two new threats: i) task-specific backdoors where the attacker
chooses a target task to attack, and only the selected task is compromised at
test time under the presence of the trigger. At the same time, any other task
is not affected, even if prompted with the trigger. We succeeded in attacking
every tested model, achieving up to 89.90\% degradation on the target task. ii)
We generalize the attack, allowing the backdoor to affect \emph{any} task, even
tasks unseen during the training phase. Our attack was successful on every
tested model, achieving a maximum of $13\times$ degradation. Finally, we
investigate the robustness of prompts and fine-tuning as techniques for
removing the backdoors from the model. We found that these methods fall short
and, in the best case, reduce the degradation from 89.90\% to 73.46\%.


http://arxiv.org/abs/2409.04699
Dual-stream Feature Augmentation for Domain Generalization. (8%)
Shanshan Wang; ALuSi; Xun Yang; Ke Xu; Huibin Tan; Xingyi Zhang
  Domain generalization (DG) task aims to learn a robust model from source
domains that could handle the out-of-distribution (OOD) issue. In order to
improve the generalization ability of the model in unseen domains, increasing
the diversity of training samples is an effective solution. However, existing
augmentation approaches always have some limitations. On the one hand, the
augmentation manner in most DG methods is not enough as the model may not see
the perturbed features in approximate the worst case due to the randomness,
thus the transferability in features could not be fully explored. On the other
hand, the causality in discriminative features is not involved in these
methods, which harms the generalization ability of model due to the spurious
correlations. To address these issues, we propose a Dual-stream Feature
Augmentation~(DFA) method by constructing some hard features from two
perspectives. Firstly, to improve the transferability, we construct some
targeted features with domain related augmentation manner. Through the guidance
of uncertainty, some hard cross-domain fictitious features are generated to
simulate domain shift. Secondly, to take the causality into consideration, the
spurious correlated non-causal information is disentangled by an adversarial
mask, then the more discriminative features can be extracted through these hard
causal related information. Different from previous fixed synthesizing
strategy, the two augmentations are integrated into a unified learnable feature
disentangle model. Based on these hard features, contrastive learning is
employed to keep the semantic consistency and improve the robustness of the
model. Extensive experiments on several datasets demonstrated that our approach
could achieve state-of-the-art performance for domain generalization. Our code
is available at: https://github.com/alusi123/DFA.


http://arxiv.org/abs/2409.03598
A practical approach to evaluating the adversarial distance for machine learning classifiers. (98%)
Georg Siedel; Ekagra Gupta; Andrey Morozov
  Robustness is critical for machine learning (ML) classifiers to ensure
consistent performance in real-world applications where models may encounter
corrupted or adversarial inputs. In particular, assessing the robustness of
classifiers to adversarial inputs is essential to protect systems from
vulnerabilities and thus ensure safety in use. However, methods to accurately
compute adversarial robustness have been challenging for complex ML models and
high-dimensional data. Furthermore, evaluations typically measure adversarial
accuracy on specific attack budgets, limiting the informative value of the
resulting metrics. This paper investigates the estimation of the more
informative adversarial distance using iterative adversarial attacks and a
certification approach. Combined, the methods provide a comprehensive
evaluation of adversarial robustness by computing estimates for the upper and
lower bounds of the adversarial distance. We present visualisations and
ablation studies that provide insights into how this evaluation method should
be applied and parameterised. We find that our adversarial attack approach is
effective compared to related implementations, while the certification method
falls short of expectations. The approach in this paper should encourage a more
informative way of evaluating the adversarial robustness of ML classifiers.


http://arxiv.org/abs/2409.03458
Non-Uniform Illumination Attack for Fooling Convolutional Neural Networks. (92%)
Akshay Jain; Shiv Ram Dubey; Satish Kumar Singh; KC Santosh; Bidyut Baran Chaudhuri
  Convolutional Neural Networks (CNNs) have made remarkable strides; however,
they remain susceptible to vulnerabilities, particularly in the face of minor
image perturbations that humans can easily recognize. This weakness, often
termed as 'attacks', underscores the limited robustness of CNNs and the need
for research into fortifying their resistance against such manipulations. This
study introduces a novel Non-Uniform Illumination (NUI) attack technique, where
images are subtly altered using varying NUI masks. Extensive experiments are
conducted on widely-accepted datasets including CIFAR10, TinyImageNet, and
CalTech256, focusing on image classification with 12 different NUI attack
models. The resilience of VGG, ResNet, MobilenetV3-small and InceptionV3 models
against NUI attacks are evaluated. Our results show a substantial decline in
the CNN models' classification accuracy when subjected to NUI attacks,
indicating their vulnerability under non-uniform illumination. To mitigate
this, a defense strategy is proposed, including NUI-attacked images, generated
through the new NUI transformation, into the training set. The results
demonstrate a significant enhancement in CNN model performance when confronted
with perturbed images affected by NUI attacks. This strategy seeks to bolster
CNN models' resilience against NUI attacks.


http://arxiv.org/abs/2409.03646
Limited but consistent gains in adversarial robustness by co-training object recognition models with human EEG. (31%)
Manshan Guo; Bhavin Choksi; Sari Sadiya; Alessandro T. Gifford; Martina G. Vilas; Radoslaw M. Cichy; Gemma Roig
  In contrast to human vision, artificial neural networks (ANNs) remain
relatively susceptible to adversarial attacks. To address this vulnerability,
efforts have been made to transfer inductive bias from human brains to ANNs,
often by training the ANN representations to match their biological
counterparts. Previous works relied on brain data acquired in rodents or
primates using invasive techniques, from specific regions of the brain, under
non-natural conditions (anesthetized animals), and with stimulus datasets
lacking diversity and naturalness. In this work, we explored whether aligning
model representations to human EEG responses to a rich set of real-world images
increases robustness to ANNs. Specifically, we trained ResNet50-backbone models
on a dual task of classification and EEG prediction; and evaluated their EEG
prediction accuracy and robustness to adversarial attacks. We observed
significant correlation between the networks' EEG prediction accuracy, often
highest around 100 ms post stimulus onset, and their gains in adversarial
robustness. Although effect size was limited, effects were consistent across
different random initializations and robust for architectural variants. We
further teased apart the data from individual EEG channels and observed
strongest contribution from electrodes in the parieto-occipital regions. The
demonstrated utility of human EEG for such tasks opens up avenues for future
efforts that scale to larger datasets under diverse stimuli conditions with the
promise of stronger effects.


http://arxiv.org/abs/2409.03274
Recent Advances in Attack and Defense Approaches of Large Language Models. (4%)
Jing Cui; Yishi Xu; Zhewei Huang; Shuchang Zhou; Jianbin Jiao; Junge Zhang
  Large Language Models (LLMs) have revolutionized artificial intelligence and
machine learning through their advanced text processing and generating
capabilities. However, their widespread deployment has raised significant
safety and reliability concerns. Established vulnerabilities in deep neural
networks, coupled with emerging threat models, may compromise security
evaluations and create a false sense of security. Given the extensive research
in the field of LLM security, we believe that summarizing the current state of
affairs will help the research community better understand the present
landscape and inform future developments. This paper reviews current research
on LLM vulnerabilities and threats, and evaluates the effectiveness of
contemporary defense mechanisms. We analyze recent studies on attack vectors
and model weaknesses, providing insights into attack mechanisms and the
evolving threat landscape. We also examine current defense strategies,
highlighting their strengths and limitations. By contrasting advancements in
attack and defense methodologies, we identify research gaps and propose future
directions to enhance LLM security. Our goal is to advance the understanding of
LLM safety challenges and guide the development of more robust security
measures.


http://arxiv.org/abs/2409.03902
WaterMAS: Sharpness-Aware Maximization for Neural Network Watermarking. (3%)
Carl De Sousa Trias; Mihai Mitrea; Attilio Fiandrotti; Marco Cagnazzo; Sumanta Chaudhuri; Enzo Tartaglione
  Nowadays, deep neural networks are used for solving complex tasks in several
critical applications and protecting both their integrity and intellectual
property rights (IPR) has become of utmost importance. To this end, we advance
WaterMAS, a substitutive, white-box neural network watermarking method that
improves the trade-off among robustness, imperceptibility, and computational
complexity, while making provisions for increased data payload and security.
WasterMAS insertion keeps unchanged the watermarked weights while sharpening
their underlying gradient space. The robustness is thus ensured by limiting the
attack's strength: even small alterations of the watermarked weights would
impact the model's performance. The imperceptibility is ensured by inserting
the watermark during the training process. The relationship among the WaterMAS
data payload, imperceptibility, and robustness properties is discussed. The
secret key is represented by the positions of the weights conveying the
watermark, randomly chosen through multiple layers of the model. The security
is evaluated by investigating the case in which an attacker would intercept the
key. The experimental validations consider 5 models and 2 tasks (VGG16,
ResNet18, MobileNetV3, SwinT for CIFAR10 image classification, and DeepLabV3
for Cityscapes image segmentation) as well as 4 types of attacks (Gaussian
noise addition, pruning, fine-tuning, and quantization). The code will be
released open-source upon acceptance of the article.


http://arxiv.org/abs/2409.03741
Understanding Data Importance in Machine Learning Attacks: Does Valuable Data Pose Greater Harm? (1%)
Rui Wen; Michael Backes; Yang Zhang
  Machine learning has revolutionized numerous domains, playing a crucial role
in driving advancements and enabling data-centric processes. The significance
of data in training models and shaping their performance cannot be overstated.
Recent research has highlighted the heterogeneous impact of individual data
samples, particularly the presence of valuable data that significantly
contributes to the utility and effectiveness of machine learning models.
However, a critical question remains unanswered: are these valuable data
samples more vulnerable to machine learning attacks? In this work, we
investigate the relationship between data importance and machine learning
attacks by analyzing five distinct attack types. Our findings reveal notable
insights. For example, we observe that high importance data samples exhibit
increased vulnerability in certain attacks, such as membership inference and
model stealing. By analyzing the linkage between membership inference
vulnerability and data importance, we demonstrate that sample characteristics
can be integrated into membership metrics by introducing sample-specific
criteria, therefore enhancing the membership inference performance. These
findings emphasize the urgent need for innovative defense mechanisms that
strike a balance between maximizing utility and safeguarding valuable data
against potential exploitation.


http://arxiv.org/abs/2409.03183
Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers. (99%)
Zuquan Peng; Yuanyuan He; Jianbing Ni; Ben Niu
  Neural networks (NN) classification models for Natural Language Processing
(NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that
triggers a model to produce a specific prediction for any input. DARCY borrows
the "honeypot" concept to bait multiple trapdoors, effectively detecting the
adversarial examples generated by UAT. Unfortunately, we find a new UAT
generation method, called IndisUAT, which produces triggers (i.e., tokens) and
uses them to craft adversarial examples whose feature distribution is
indistinguishable from that of the benign examples in a randomly-chosen
category at the detection layer of DARCY. The produced adversarial examples
incur the maximal loss of predicting results in the DARCY-protected models.
Meanwhile, the produced triggers are effective in black-box models for text
generation, text inference, and reading comprehension. Finally, the evaluation
results under NN models for NLP tasks indicate that the IndisUAT method can
effectively circumvent DARCY and penetrate other defenses. For example,
IndisUAT can reduce the true positive rate of DARCY's detection by at least
40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN
and CNN models, respectively. IndisUAT reduces the accuracy of the BERT's
adversarial defense model by at least 34.0%, and makes the GPT-2 language model
spew racist outputs even when conditioned on non-racial context.


http://arxiv.org/abs/2409.02649
OpenFact at CheckThat! 2024: Combining Multiple Attack Methods for Effective Adversarial Text Generation. (99%)
Włodzimierz Lewoniewski; Piotr Stolarski; Milena Stróżyna; Elzbieta Lewańska; Aleksandra Wojewoda; Ewelina Księżniak; Marcin Sawiński
  This paper presents the experiments and results for the CheckThat! Lab at
CLEF 2024 Task 6: Robustness of Credibility Assessment with Adversarial
Examples (InCrediblAE). The primary objective of this task was to generate
adversarial examples in five problem domains in order to evaluate the
robustness of widely used text classification methods (fine-tuned BERT, BiLSTM,
and RoBERTa) when applied to credibility assessment issues.
  This study explores the application of ensemble learning to enhance
adversarial attacks on natural language processing (NLP) models. We
systematically tested and refined several adversarial attack methods, including
BERT-Attack, Genetic algorithms, TextFooler, and CLARE, on five datasets across
various misinformation tasks. By developing modified versions of BERT-Attack
and hybrid methods, we achieved significant improvements in attack
effectiveness. Our results demonstrate the potential of modification and
combining multiple methods to create more sophisticated and effective
adversarial attack strategies, contributing to the development of more robust
and secure systems.


http://arxiv.org/abs/2409.02483
TASAR: Transfer-based Attack on Skeletal Action Recognition. (99%)
Yunfeng Diao; Baiqi Wu; Ruixuan Zhang; Ajian Liu; Xiaoshuai Hao; Xingxing Wei; Meng Wang; He Wang
  Skeletal sequence data, as a widely employed representation of human actions,
are crucial in Human Activity Recognition (HAR). Recently, adversarial attacks
have been proposed in this area, which exposes potential security concerns, and
more importantly provides a good tool for model robustness test. Within this
research, transfer-based attack is an important tool as it mimics the
real-world scenario where an attacker has no knowledge of the target model, but
is under-explored in Skeleton-based HAR (S-HAR). Consequently, existing S-HAR
attacks exhibit weak adversarial transferability and the reason remains largely
unknown. In this paper, we investigate this phenomenon via the characterization
of the loss function. We find that one prominent indicator of poor
transferability is the low smoothness of the loss function. Led by this
observation, we improve the transferability by properly smoothening the loss
when computing the adversarial examples. This leads to the first Transfer-based
Attack on Skeletal Action Recognition, TASAR. TASAR explores the smoothened
model posterior of pre-trained surrogates, which is achieved by a new
post-train Dual Bayesian optimization strategy. Furthermore, unlike existing
transfer-based methods which overlook the temporal coherence within sequences,
TASAR incorporates motion dynamics into the Bayesian attack, effectively
disrupting the spatial-temporal coherence of S-HARs. For exhaustive evaluation,
we build the first large-scale robust S-HAR benchmark, comprising 7 S-HAR
models, 10 attack methods, 3 S-HAR datasets and 2 defense models. Extensive
results demonstrate the superiority of TASAR. Our benchmark enables easy
comparisons for future studies, with the code available in the
https://github.com/yunfengdiao/Skeleton-Robustness-Benchmark.


http://arxiv.org/abs/2409.02485
Adversarial Attacks on Machine Learning-Aided Visualizations. (83%)
Takanori Fujiwara; Kostiantyn Kucher; Junpeng Wang; Rafael M. Martins; Andreas Kerren; Anders Ynnerman
  Research in ML4VIS investigates how to use machine learning (ML) techniques
to generate visualizations, and the field is rapidly growing with high societal
impact. However, as with any computational pipeline that employs ML processes,
ML4VIS approaches are susceptible to a range of ML-specific adversarial
attacks. These attacks can manipulate visualization generations, causing
analysts to be tricked and their judgments to be impaired. Due to a lack of
synthesis from both visualization and ML perspectives, this security aspect is
largely overlooked by the current ML4VIS literature. To bridge this gap, we
investigate the potential vulnerabilities of ML-aided visualizations from
adversarial attacks using a holistic lens of both visualization and ML
perspectives. We first identify the attack surface (i.e., attack entry points)
that is unique in ML-aided visualizations. We then exemplify five different
adversarial attacks. These examples highlight the range of possible attacks
when considering the attack surface and multiple different adversary
capabilities. Our results show that adversaries can induce various attacks,
such as creating arbitrary and deceptive visualizations, by systematically
identifying input attributes that are influential in ML inferences. Based on
our observations of the attack surface characteristics and the attack examples,
we underline the importance of comprehensive studies of security issues and
defense mechanisms as a call of urgency for the ML4VIS community.


http://arxiv.org/abs/2409.02430
Transfer-based Adversarial Poisoning Attacks for Online (MIMO-)Deep Receviers. (76%)
Kunze Wu; Weiheng Jiang; Dusit Niyato; Yinghuan Li; Chuang Luo
  Recently, the design of wireless receivers using deep neural networks (DNNs),
known as deep receivers, has attracted extensive attention for ensuring
reliable communication in complex channel environments. To adapt quickly to
dynamic channels, online learning has been adopted to update the weights of
deep receivers with over-the-air data (e.g., pilots). However, the fragility of
neural models and the openness of wireless channels expose these systems to
malicious attacks. To this end, understanding these attack methods is essential
for robust receiver design. In this paper, we propose a transfer-based
adversarial poisoning attack method for online receivers.Without knowledge of
the attack target, adversarial perturbations are injected to the pilots,
poisoning the online deep receiver and impairing its ability to adapt to
dynamic channels and nonlinear effects. In particular, our attack method
targets Deep Soft Interference Cancellation (DeepSIC)[1] using online
meta-learning. As a classical model-driven deep receiver, DeepSIC incorporates
wireless domain knowledge into its architecture. This integration allows it to
adapt efficiently to time-varying channels with only a small number of pilots,
achieving optimal performance in a multi-input and multi-output (MIMO)
scenario.The deep receiver in this scenario has a number of applications in the
field of wireless communication, which motivates our study of the attack
methods targeting it.Specifically, we demonstrate the effectiveness of our
attack in simulations on synthetic linear, synthetic nonlinear, static, and
COST 2100 channels. Simulation results indicate that the proposed poisoning
attack significantly reduces the performance of online receivers in rapidly
changing scenarios.


http://arxiv.org/abs/2409.02802
Boosting Certificate Robustness for Time Series Classification with Efficient Self-Ensemble. (70%)
Chang Dong; Zhengyang Li; Liangwei Zheng; Weitong Chen; Wei Emma Zhang
  Recently, the issue of adversarial robustness in the time series domain has
garnered significant attention. However, the available defense mechanisms
remain limited, with adversarial training being the predominant approach,
though it does not provide theoretical guarantees. Randomized Smoothing has
emerged as a standout method due to its ability to certify a provable lower
bound on robustness radius under $\ell_p$-ball attacks. Recognizing its
success, research in the time series domain has started focusing on these
aspects. However, existing research predominantly focuses on time series
forecasting, or under the non-$\ell_p$ robustness in statistic feature
augmentation for time series classification~(TSC). Our review found that
Randomized Smoothing performs modestly in TSC, struggling to provide effective
assurances on datasets with poor robustness. Therefore, we propose a
self-ensemble method to enhance the lower bound of the probability confidence
of predicted labels by reducing the variance of classification margins, thereby
certifying a larger radius. This approach also addresses the computational
overhead issue of Deep Ensemble~(DE) while remaining competitive and, in some
cases, outperforming it in terms of robustness. Both theoretical analysis and
experimental results validate the effectiveness of our method, demonstrating
superior performance in robustness testing compared to baseline approaches.


http://arxiv.org/abs/2409.02629
AdvSecureNet: A Python Toolkit for Adversarial Machine Learning. (33%)
Melih Catal; Manuel Günther
  Machine learning models are vulnerable to adversarial attacks. Several tools
have been developed to research these vulnerabilities, but they often lack
comprehensive features and flexibility. We introduce AdvSecureNet, a PyTorch
based toolkit for adversarial machine learning that is the first to natively
support multi-GPU setups for attacks, defenses, and evaluation. It is the first
toolkit that supports both CLI and API interfaces and external YAML
configuration files to enhance versatility and reproducibility. The toolkit
includes multiple attacks, defenses and evaluation metrics. Rigiorous software
engineering practices are followed to ensure high code quality and
maintainability. The project is available as an open-source project on GitHub
at https://github.com/melihcatal/advsecurenet and installable via PyPI.


http://arxiv.org/abs/2409.03200
Active Fake: DeepFake Camouflage. (13%)
Pu Sun; Honggang Qi; Yuezun Li
  DeepFake technology has gained significant attention due to its ability to
manipulate facial attributes with high realism, raising serious societal
concerns. Face-Swap DeepFake is the most harmful among these techniques, which
fabricates behaviors by swapping original faces with synthesized ones. Existing
forensic methods, primarily based on Deep Neural Networks (DNNs), effectively
expose these manipulations and have become important authenticity indicators.
However, these methods mainly concentrate on capturing the blending
inconsistency in DeepFake faces, raising a new security issue, termed Active
Fake, emerges when individuals intentionally create blending inconsistency in
their authentic videos to evade responsibility. This tactic is called DeepFake
Camouflage. To achieve this, we introduce a new framework for creating DeepFake
camouflage that generates blending inconsistencies while ensuring
imperceptibility, effectiveness, and transferability. This framework, optimized
via an adversarial learning strategy, crafts imperceptible yet effective
inconsistencies to mislead forensic detectors. Extensive experiments
demonstrate the effectiveness and robustness of our method, highlighting the
need for further research in active fake detection.


http://arxiv.org/abs/2409.03131
Well, that escalated quickly: The Single-Turn Crescendo Attack (STCA). (2%)
Alan Aqrawi
  This paper explores a novel approach to adversarial attacks on large language
models (LLM): the Single-Turn Crescendo Attack (STCA). The STCA builds upon the
multi-turn crescendo attack established by Mark Russinovich, Ahmed Salem, Ronen
Eldan. Traditional multi-turn adversarial strategies gradually escalate the
context to elicit harmful or controversial responses from LLMs. However, this
paper introduces a more efficient method where the escalation is condensed into
a single interaction. By carefully crafting the prompt to simulate an extended
dialogue, the attack bypasses typical content moderation systems, leading to
the generation of responses that would normally be filtered out. I demonstrate
this technique through a few case studies. The results highlight
vulnerabilities in current LLMs and underscore the need for more robust
safeguards. This work contributes to the broader discourse on responsible AI
(RAI) safety and adversarial testing, providing insights and practical examples
for researchers and developers. This method is unexplored in the literature,
making it a novel contribution to the field.


http://arxiv.org/abs/2409.02718
"Yes, My LoRD." Guiding Language Model Extraction with Locality Reinforced Distillation. (1%)
Zi Liang; Qingqing Ye; Yanyun Wang; Sen Zhang; Yaxin Xiao; Ronghua Li; Jianliang Xu; Haibo Hu
  Model extraction attacks (MEAs) on large language models (LLMs) have received
increasing attention in recent research. However, existing attack methods
typically adapt the extraction strategies originally developed for deep neural
networks (DNNs). They neglect the underlying inconsistency between the training
tasks of MEA and LLM alignment, leading to suboptimal attack performance. To
tackle this issue, we propose Locality Reinforced Distillation (LoRD), a novel
model extraction algorithm specifically designed for LLMs. In particular, LoRD
employs a newly defined policy-gradient-style training task that utilizes the
responses of victim model as the signal to guide the crafting of preference for
the local model. Theoretical analyses demonstrate that I) The convergence
procedure of LoRD in model extraction is consistent with the alignment
procedure of LLMs, and II) LoRD can reduce query complexity while mitigating
watermark protection through our exploration-based stealing. Extensive
experiments validate the superiority of our method in extracting various
state-of-the-art commercial LLMs. Our code is available at:
https://github.com/liangzid/LoRD-MEA .


http://arxiv.org/abs/2409.01952
Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor. (97%)
Abdullah Arafat Miah; Yu Bi
  Deep neural networks (DNNs) have long been recognized as vulnerable to
backdoor attacks. By providing poisoned training data in the fine-tuning
process, the attacker can implant a backdoor into the victim model. This
enables input samples meeting specific textual trigger patterns to be
classified as target labels of the attacker's choice. While such black-box
attacks have been well explored in both computer vision and natural language
processing (NLP), backdoor attacks relying on white-box attack philosophy have
hardly been thoroughly investigated. In this paper, we take the first step to
introduce a new type of backdoor attack that conceals itself within the
underlying model architecture. Specifically, we propose to design separate
backdoor modules consisting of two functions: trigger detection and noise
injection. The add-on modules of model architecture layers can detect the
presence of input trigger tokens and modify layer weights using Gaussian noise
to disturb the feature distribution of the baseline model. We conduct extensive
experiments to evaluate our attack methods using two model architecture
settings on five different large language datasets. We demonstrate that the
training-free architectural backdoor on a large language model poses a genuine
threat. Unlike the-state-of-art work, it can survive the rigorous fine-tuning
and retraining process, as well as evade output probability-based defense
methods (i.e. BDDR). All the code and data is available
https://github.com/SiSL-URI/Arch_Backdoor_LLM.


http://arxiv.org/abs/2409.01627
Dynamic Guidance Adversarial Distillation with Enhanced Teacher Knowledge. (92%)
Hyejin Park; Dongbo Min
  In the realm of Adversarial Distillation (AD), strategic and precise
knowledge transfer from an adversarially robust teacher model to a less robust
student model is paramount. Our Dynamic Guidance Adversarial Distillation
(DGAD) framework directly tackles the challenge of differential sample
importance, with a keen focus on rectifying the teacher model's
misclassifications. DGAD employs Misclassification-Aware Partitioning (MAP) to
dynamically tailor the distillation focus, optimizing the learning process by
steering towards the most reliable teacher predictions. Additionally, our
Error-corrective Label Swapping (ELS) corrects misclassifications of the
teacher on both clean and adversarially perturbed inputs, refining the quality
of knowledge transfer. Further, Predictive Consistency Regularization (PCR)
guarantees consistent performance of the student model across both clean and
adversarial inputs, significantly enhancing its overall robustness. By
integrating these methodologies, DGAD significantly improves upon the accuracy
of clean data and fortifies the model's defenses against sophisticated
adversarial threats. Our experimental validation on CIFAR10, CIFAR100, and Tiny
ImageNet datasets, employing various model architectures, demonstrates the
efficacy of DGAD, establishing it as a promising approach for enhancing both
the robustness and accuracy of student models in adversarial settings.


http://arxiv.org/abs/2409.02251
NoiseAttack: An Evasive Sample-Specific Multi-Targeted Backdoor Attack Through White Gaussian Noise. (16%)
Abdullah Arafat Miah; Kaan Icer; Resit Sendag; Yu Bi
  Backdoor attacks pose a significant threat when using third-party data for
deep learning development. In these attacks, data can be manipulated to cause a
trained model to behave improperly when a specific trigger pattern is applied,
providing the adversary with unauthorized advantages. While most existing works
focus on designing trigger patterns in both visible and invisible to poison the
victim class, they typically result in a single targeted class upon the success
of the backdoor attack, meaning that the victim class can only be converted to
another class based on the adversary predefined value. In this paper, we
address this issue by introducing a novel sample-specific multi-targeted
backdoor attack, namely NoiseAttack. Specifically, we adopt White Gaussian
Noise (WGN) with various Power Spectral Densities (PSD) as our underlying
triggers, coupled with a unique training strategy to execute the backdoor
attack. This work is the first of its kind to launch a vision backdoor attack
with the intent to generate multiple targeted classes with minimal input
configuration. Furthermore, our extensive experimental results demonstrate that
NoiseAttack can achieve a high attack success rate against popular network
architectures and datasets, as well as bypass state-of-the-art backdoor
detection methods. Our source code and experiments are available at
https://github.com/SiSL-URI/NoiseAttack/tree/main.


http://arxiv.org/abs/2409.01813
Reassessing Noise Augmentation Methods in the Context of Adversarial Speech. (5%)
Karla Pizzi; Matías Pizarro; Asja Fischer
  In this study, we investigate if noise-augmented training can concurrently
improve adversarial robustness in automatic speech recognition (ASR) systems.
We conduct a comparative analysis of the adversarial robustness of four
different state-of-the-art ASR architectures, where each of the ASR
architectures is trained under three different augmentation conditions: one
subject to background noise, speed variations, and reverberations, another
subject to speed variations only, and a third without any form of data
augmentation. The results demonstrate that noise augmentation not only improves
model performance on noisy speech but also the model's robustness to
adversarial attacks.


http://arxiv.org/abs/2409.01696
On the Vulnerability of Skip Connections to Model Inversion Attacks. (3%)
Jun Hao Koh; Sy-Tuyen Ho; Ngoc-Bao Nguyen; Ngai-man Cheung
  Skip connections are fundamental architecture designs for modern deep neural
networks (DNNs) such as CNNs and ViTs. While they help improve model
performance significantly, we identify a vulnerability associated with skip
connections to Model Inversion (MI) attacks, a type of privacy attack that aims
to reconstruct private training data through abusive exploitation of a model.
In this paper, as a pioneer work to understand how DNN architectures affect MI,
we study the impact of skip connections on MI. We make the following
discoveries: 1) Skip connections reinforce MI attacks and compromise data
privacy. 2) Skip connections in the last stage are the most critical to attack.
3) RepVGG, an approach to remove skip connections in the inference-time
architectures, could not mitigate the vulnerability to MI attacks. 4) Based on
our findings, we propose MI-resilient architecture designs for the first time.
Without bells and whistles, we show in extensive experiments that our
MI-resilient architectures can outperform state-of-the-art (SOTA) defense
methods in MI robustness. Furthermore, our MI-resilient architectures are
complementary to existing MI defense methods. Our project is available at
https://Pillowkoh.github.io/projects/RoLSS/


http://arxiv.org/abs/2409.01282
One-Index Vector Quantization Based Adversarial Attack on Image Classification. (99%)
Haiju Fan; Xiaona Qin; Shuang Chen; Hubert P. H. Shum; Ming Li
  To improve storage and transmission, images are generally compressed. Vector
quantization (VQ) is a popular compression method as it has a high compression
ratio that suppresses other compression techniques. Despite this, existing
adversarial attack methods on image classification are mostly performed in the
pixel domain with few exceptions in the compressed domain, making them less
applicable in real-world scenarios. In this paper, we propose a novel one-index
attack method in the VQ domain to generate adversarial images by a differential
evolution algorithm, successfully resulting in image misclassification in
victim models. The one-index attack method modifies a single index in the
compressed data stream so that the decompressed image is misclassified. It only
needs to modify a single VQ index to realize an attack, which limits the number
of perturbed indexes. The proposed method belongs to a semi-black-box attack,
which is more in line with the actual attack scenario. We apply our method to
attack three popular image classification models, i.e., Resnet, NIN, and VGG16.
On average, 55.9% and 77.4% of the images in CIFAR-10 and Fashion MNIST,
respectively, are successfully attacked, with a high level of misclassification
confidence and a low level of image perturbation.


http://arxiv.org/abs/2409.01249
Adversarial Pruning: A Survey and Benchmark of Pruning Methods for Adversarial Robustness. (99%)
Giorgio Piras; Maura Pintor; Ambra Demontis; Battista Biggio; Giorgio Giacinto; Fabio Roli
  Recent work has proposed neural network pruning techniques to reduce the size
of a network while preserving robustness against adversarial examples, i.e.,
well-crafted inputs inducing a misclassification. These methods, which we refer
to as adversarial pruning methods, involve complex and articulated designs,
making it difficult to analyze the differences and establish a fair and
accurate comparison. In this work, we overcome these issues by surveying
current adversarial pruning methods and proposing a novel taxonomy to
categorize them based on two main dimensions: the pipeline, defining when to
prune; and the specifics, defining how to prune. We then highlight the
limitations of current empirical analyses and propose a novel, fair evaluation
benchmark to address them. We finally conduct an empirical re-evaluation of
current adversarial pruning methods and discuss the results, highlighting the
shared traits of top-performing adversarial pruning methods, as well as common
issues. We welcome contributions in our publicly-available benchmark at
https://github.com/pralab/AdversarialPruningBenchmark


http://arxiv.org/abs/2409.01470
Phantom: Untargeted Poisoning Attacks on Semi-Supervised Learning (Full Version). (68%)
Jonathan Knauer; Phillip Rieger; Hossein Fereidooni; Ahmad-Reza Sadeghi
  Deep Neural Networks (DNNs) can handle increasingly complex tasks, albeit
they require rapidly expanding training datasets. Collecting data from
platforms with user-generated content, such as social networks, has
significantly eased the acquisition of large datasets for training DNNs.
Despite these advancements, the manual labeling process remains a substantial
challenge in terms of both time and cost. In response, Semi-Supervised Learning
(SSL) approaches have emerged, where only a small fraction of the dataset needs
to be labeled, leaving the majority unlabeled. However, leveraging data from
untrusted sources like social networks also creates new security risks, as
potential attackers can easily inject manipulated samples. Previous research on
the security of SSL primarily focused on injecting backdoors into trained
models, while less attention was given to the more challenging untargeted
poisoning attacks.
  In this paper, we introduce Phantom, the first untargeted poisoning attack in
SSL that disrupts the training process by injecting a small number of
manipulated images into the unlabeled dataset. Unlike existing attacks, our
approach only requires adding few manipulated samples, such as posting images
on social networks, without the need to control the victim. Phantom causes SSL
algorithms to overlook the actual images' pixels and to rely only on
maliciously crafted patterns that \ourname superimposed on the real images. We
show Phantom's effectiveness for 6 different datasets and 3 real-world
social-media platforms (Facebook, Instagram, Pinterest). Already small
fractions of manipulated samples (e.g., 5\%) reduce the accuracy of the
resulting model by 10\%, with higher percentages leading to a performance
comparable to a naive classifier. Our findings demonstrate the threat of
poisoning user-generated content platforms, rendering them unsuitable for SSL
in specific tasks.


http://arxiv.org/abs/2409.01193
CLIBE: Detecting Dynamic Backdoors in Transformer-based NLP Models. (62%)
Rui Zeng; Xi Chen; Yuwen Pu; Xuhong Zhang; Tianyu Du; Shouling Ji
  Backdoors can be injected into NLP models to induce misbehavior when the
input text contains a specific feature, known as a trigger, which the attacker
secretly selects. Unlike fixed words, phrases, or sentences used in the static
text trigger, NLP dynamic backdoor attacks design triggers associated with
abstract and latent text features, making them considerably stealthier than
traditional static backdoor attacks. However, existing research on NLP backdoor
detection primarily focuses on defending against static backdoor attacks, while
detecting dynamic backdoors in NLP models remains largely unexplored. This
paper presents CLIBE, the first framework to detect dynamic backdoors in
Transformer-based NLP models. CLIBE injects a "few-shot perturbation" into the
suspect Transformer model by crafting optimized weight perturbation in the
attention layers to make the perturbed model classify a limited number of
reference samples as a target label. Subsequently, CLIBE leverages the
generalization ability of this few-shot perturbation to determine whether the
original model contains a dynamic backdoor. Extensive evaluation on three
advanced NLP dynamic backdoor attacks, two widely-used Transformer frameworks,
and four real-world classification tasks strongly validates the effectiveness
of CLIBE. We also demonstrate the robustness of CLIBE against various adaptive
attacks. Furthermore, we employ CLIBE to scrutinize 49 popular Transformer
models on Hugging Face and discover one exhibiting a high probability of
containing a dynamic backdoor. We have contacted Hugging Face and provided
detailed evidence of this model's backdoor behavior. Moreover, we extend CLIBE
to detect backdoor text generation models modified to exhibit toxic behavior.
To the best of our knowledge, CLIBE is the first framework capable of detecting
backdoors in text generation models without access to trigger input test
samples.


http://arxiv.org/abs/2409.01541
Agentic Copyright Watermarking against Adversarial Evidence Forgery with Purification-Agnostic Curriculum Proxy Learning. (33%)
Erjin Bao; Ching-Chun Chang; Hanrui Wang; Isao Echizen
  With the proliferation of AI agents in various domains, protecting the
ownership of AI models has become crucial due to the significant investment in
their development. Unauthorized use and illegal distribution of these models
pose serious threats to intellectual property, necessitating effective
copyright protection measures. Model watermarking has emerged as a key
technique to address this issue, embedding ownership information within models
to assert rightful ownership during copyright disputes. This paper presents
several contributions to model watermarking: a self-authenticating black-box
watermarking protocol using hash techniques, a study on evidence forgery
attacks using adversarial perturbations, a proposed defense involving a
purification step to counter adversarial attacks, and a purification-agnostic
curriculum proxy learning method to enhance watermark robustness and model
performance. Experimental results demonstrate the effectiveness of these
approaches in improving the security, reliability, and performance of
watermarked models.


http://arxiv.org/abs/2409.00960
Unveiling the Vulnerability of Private Fine-Tuning in Split-Based Frameworks for Large Language Models: A Bidirectionally Enhanced Attack. (26%)
Guanzhong Chen; Zhenghan Qin; Mingxin Yang; Yajie Zhou; Tao Fan; Tianyu Du; Zenglin Xu
  Recent advancements in pre-trained large language models (LLMs) have
significantly influenced various domains. Adapting these models for specific
tasks often involves fine-tuning (FT) with private, domain-specific data.
However, privacy concerns keep this data undisclosed, and the computational
demands for deploying LLMs pose challenges for resource-limited data holders.
This has sparked interest in split learning (SL), a Model-as-a-Service (MaaS)
paradigm that divides LLMs into smaller segments for distributed training and
deployment, transmitting only intermediate activations instead of raw data. SL
has garnered substantial interest in both industry and academia as it aims to
balance user data privacy, model ownership, and resource challenges in the
private fine-tuning of LLMs. Despite its privacy claims, this paper reveals
significant vulnerabilities arising from the combination of SL and LLM-FT: the
Not-too-far property of fine-tuning and the auto-regressive nature of LLMs.
Exploiting these vulnerabilities, we propose Bidirectional Semi-white-box
Reconstruction (BiSR), the first data reconstruction attack (DRA) designed to
target both the forward and backward propagation processes of SL. BiSR utilizes
pre-trained weights as prior knowledge, combining a learning-based attack with
a bidirectional optimization-based approach for highly effective data
reconstruction. Additionally, it incorporates a Noise-adaptive Mixture of
Experts (NaMoE) model to enhance reconstruction performance under perturbation.
We conducted systematic experiments on various mainstream LLMs and different
setups, empirically demonstrating BiSR's state-of-the-art performance.
Furthermore, we thoroughly examined three representative defense mechanisms,
showcasing our method's capability to reconstruct private data even in the
presence of these defenses.


http://arxiv.org/abs/2409.01219
A Review of Image Retrieval Techniques: Data Augmentation and Adversarial Learning Approaches. (16%)
Kim Jinwoo
  Image retrieval is a crucial research topic in computer vision, with broad
application prospects ranging from online product searches to security
surveillance systems. In recent years, the accuracy and efficiency of image
retrieval have significantly improved due to advancements in deep learning.
However, existing methods still face numerous challenges, particularly in
handling large-scale datasets, cross-domain retrieval, and image perturbations
that can arise from real-world conditions such as variations in lighting,
occlusion, and viewpoint. Data augmentation techniques and adversarial learning
methods have been widely applied in the field of image retrieval to address
these challenges. Data augmentation enhances the model's generalization ability
and robustness by generating more diverse training samples, simulating
real-world variations, and reducing overfitting. Meanwhile, adversarial attacks
and defenses introduce perturbations during training to improve the model's
robustness against potential attacks, ensuring reliability in practical
applications. This review comprehensively summarizes the latest research
advancements in image retrieval, with a particular focus on the roles of data
augmentation and adversarial learning techniques in enhancing retrieval
performance. Future directions and potential challenges are also discussed.


http://arxiv.org/abs/2409.01236
Spatial-Aware Conformal Prediction for Trustworthy Hyperspectral Image Classification. (1%)
Kangdao Liu; Tianhao Sun; Hao Zeng; Yongshan Zhang; Chi-Man Pun; Chi-Man Vong
  Hyperspectral image (HSI) classification involves assigning unique labels to
each pixel to identify various land cover categories. While deep classifiers
have achieved high predictive accuracy in this field, they lack the ability to
rigorously quantify confidence in their predictions. Quantifying the certainty
of model predictions is crucial for the safe usage of predictive models, and
this limitation restricts their application in critical contexts where the cost
of prediction errors is significant. To support the safe deployment of HSI
classifiers, we first provide a theoretical proof establishing the validity of
the emerging uncertainty quantification technique, conformal prediction, in the
context of HSI classification. We then propose a conformal procedure that
equips any trained HSI classifier with trustworthy prediction sets, ensuring
that these sets include the true labels with a user-specified probability
(e.g., 95\%). Building on this foundation, we introduce Spatial-Aware Conformal
Prediction (\texttt{SACP}), a conformal prediction framework specifically
designed for HSI data. This method integrates essential spatial information
inherent in HSIs by aggregating the non-conformity scores of pixels with high
spatial correlation, which effectively enhances the efficiency of prediction
sets. Both theoretical and empirical results validate the effectiveness of our
proposed approach. The source code is available at
\url{https://github.com/J4ckLiu/SACP}.


http://arxiv.org/abs/2409.00667
Comprehensive Botnet Detection by Mitigating Adversarial Attacks, Navigating the Subtleties of Perturbation Distances and Fortifying Predictions with Conformal Layers. (99%)
Rahul Yumlembam; Biju Issac; Seibu Mary Jacob; Longzhi Yang
  Botnets are computer networks controlled by malicious actors that present
significant cybersecurity challenges. They autonomously infect, propagate, and
coordinate to conduct cybercrimes, necessitating robust detection methods. This
research addresses the sophisticated adversarial manipulations posed by
attackers, aiming to undermine machine learning-based botnet detection systems.
We introduce a flow-based detection approach, leveraging machine learning and
deep learning algorithms trained on the ISCX and ISOT datasets. The detection
algorithms are optimized using the Genetic Algorithm and Particle Swarm
Optimization to obtain a baseline detection method. The Carlini & Wagner (C&W)
attack and Generative Adversarial Network (GAN) generate deceptive data with
subtle perturbations, targeting each feature used for classification while
preserving their semantic and syntactic relationships, which ensures that the
adversarial samples retain meaningfulness and realism. An in-depth analysis of
the required L2 distance from the original sample for the malware sample to
misclassify is performed across various iteration checkpoints, showing
different levels of misclassification at different L2 distances of the Pertrub
sample from the original sample. Our work delves into the vulnerability of
various models, examining the transferability of adversarial examples from a
Neural Network surrogate model to Tree-based algorithms. Subsequently, models
that initially misclassified the perturbed samples are retrained, enhancing
their resilience and detection capabilities. In the final phase, a conformal
prediction layer is integrated, significantly rejecting incorrect predictions,
of 58.20 % in the ISCX dataset and 98.94 % in the ISOT dataset.


http://arxiv.org/abs/2409.00685
Accurate Forgetting for All-in-One Image Restoration Model. (83%)
Xin Su; Zhuoran Zheng
  Privacy protection has always been an ongoing topic, especially for AI.
Currently, a low-cost scheme called Machine Unlearning forgets the private data
remembered in the model. Specifically, given a private dataset and a trained
neural network, we need to use e.g. pruning, fine-tuning, and gradient ascent
to remove the influence of the private dataset on the neural network. Inspired
by this, we try to use this concept to bridge the gap between the fields of
image restoration and security, creating a new research idea. We propose the
scene for the All-In-One model (a neural network that restores a wide range of
degraded information), where a given dataset such as haze, or rain, is private
and needs to be eliminated from the influence of it on the trained model.
Notably, we find great challenges in this task to remove the influence of
sensitive data while ensuring that the overall model performance remains
robust, which is akin to directing a symphony orchestra without specific
instruments while keeping the playing soothing. Here we explore a simple but
effective approach: Instance-wise Unlearning through the use of adversarial
examples and gradient ascent techniques. Our approach is a low-cost solution
compared to the strategy of retraining the model from scratch, where the
gradient ascent trick forgets the specified data and the performance of the
adversarial sample maintenance model is robust. Through extensive
experimentation on two popular unified image restoration models, we show that
our approach effectively preserves knowledge of remaining data while unlearning
a given degradation type.


http://arxiv.org/abs/2409.00787
The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs. (26%)
Bocheng Chen; Hanqing Guo; Guangjing Wang; Yuanda Wang; Qiben Yan
  Large Language Models (LLMs) have demonstrated great capabilities in natural
language understanding and generation, largely attributed to the intricate
alignment process using human feedback. While alignment has become an essential
training component that leverages data collected from user queries, it
inadvertently opens up an avenue for a new type of user-guided poisoning
attacks. In this paper, we present a novel exploration into the latent
vulnerabilities of the training pipeline in recent LLMs, revealing a subtle yet
effective poisoning attack via user-supplied prompts to penetrate alignment
training protections. Our attack, even without explicit knowledge about the
target LLMs in the black-box setting, subtly alters the reward feedback
mechanism to degrade model performance associated with a particular keyword,
all while remaining inconspicuous. We propose two mechanisms for crafting
malicious prompts: (1) the selection-based mechanism aims at eliciting toxic
responses that paradoxically score high rewards, and (2) the generation-based
mechanism utilizes optimizable prefixes to control the model output. By
injecting 1\% of these specially crafted prompts into the data, through
malicious users, we demonstrate a toxicity score up to two times higher when a
specific trigger word is used. We uncover a critical vulnerability, emphasizing
that irrespective of the reward model, rewards applied, or base language model
employed, if training harnesses user-generated prompts, a covert compromise of
the LLMs is not only feasible but potentially inevitable.


http://arxiv.org/abs/2409.00863
Fisher Information guided Purification against Backdoor Attacks. (12%)
Nazmul Karim; Abdullah Al Arafat; Adnan Siraj Rakin; Zhishan Guo; Nazanin Rahnavard
  Studies on backdoor attacks in recent years suggest that an adversary can
compromise the integrity of a deep neural network (DNN) by manipulating a small
set of training samples. Our analysis shows that such manipulation can make the
backdoor model converge to a bad local minima, i.e., sharper minima as compared
to a benign model. Intuitively, the backdoor can be purified by re-optimizing
the model to smoother minima. However, a na\"ive adoption of any optimization
targeting smoother minima can lead to sub-optimal purification techniques
hampering the clean test accuracy. Hence, to effectively obtain such
re-optimization, inspired by our novel perspective establishing the connection
between backdoor removal and loss smoothness, we propose Fisher Information
guided Purification (FIP), a novel backdoor purification framework. Proposed
FIP consists of a couple of novel regularizers that aid the model in
suppressing the backdoor effects and retaining the acquired knowledge of clean
data distribution throughout the backdoor removal procedure through exploiting
the knowledge of Fisher Information Matrix (FIM). In addition, we introduce an
efficient variant of FIP, dubbed as Fast FIP, which reduces the number of
tunable parameters significantly and obtains an impressive runtime gain of
almost $5\times$. Extensive experiments show that the proposed method achieves
state-of-the-art (SOTA) performance on a wide range of backdoor defense
benchmarks: 5 different tasks -- Image Recognition, Object Detection, Video
Action Recognition, 3D point Cloud, Language Generation; 11 different datasets
including ImageNet, PASCAL VOC, UCF101; diverse model architectures spanning
both CNN and vision transformer; 14 different backdoor attacks, e.g., Dynamic,
WaNet, LIRA, ISSBA, etc.


http://arxiv.org/abs/2409.03788
HSF: Defending against Jailbreak Attacks with Hidden State Filtering. (75%)
Cheng Qian; Hainan Zhang; Lei Sha; Zhiming Zheng
  With the growing deployment of LLMs in daily applications like chatbots and
content generation, efforts to ensure outputs align with human values and avoid
harmful content have intensified. However, increasingly sophisticated jailbreak
attacks threaten this alignment, aiming to induce unsafe outputs. Current
defense efforts either focus on prompt rewriting or detection, which are
limited in effectiveness due to the various design of jailbreak prompts, or on
output control and detection, which are computationally expensive as they
require LLM inference. Therefore, designing a pre-inference defense method that
resists diverse jailbreak prompts is crucial for preventing LLM jailbreak
attacks. We observe that jailbreak attacks, safe queries, and harmful queries
exhibit different clustering patterns within the LLM's hidden state
representation space. This suggests that by leveraging the LLM's hidden state
representational capabilities, we can analyze the LLM's forthcoming behavior
and proactively intervene for defense. In this paper, we propose a jailbreak
attack defense strategy based on a Hidden State Filter (HSF), a lossless
architectural defense mechanism that enables the model to preemptively identify
and reject adversarial inputs before the inference process begins. We activate
its defensive potential through an additional plugin module, effectively
framing the defense task as a classification problem. Experimental results on
two benchmark datasets, utilizing three different LLMs, show that HSF
significantly enhances resilience against six cutting-edge jailbreak attacks.
It significantly reduces the success rate of jailbreak attacks while minimally
impacting responses to benign user queries, with negligible inference overhead,
and outperforming defense baselines.Our code and data are available at
https://anonymous.4open.science/r/Hidden-State-Filtering-8652/


http://arxiv.org/abs/2409.00426
Is Difficulty Calibration All We Need? Towards More Practical Membership Inference Attacks. (15%)
Yu He; Boheng Li; Yao Wang; Mengda Yang; Juan Wang; Hongxin Hu; Xingyu Zhao
  The vulnerability of machine learning models to Membership Inference Attacks
(MIAs) has garnered considerable attention in recent years. These attacks
determine whether a data sample belongs to the model's training set or not.
Recent research has focused on reference-based attacks, which leverage
difficulty calibration with independently trained reference models. While
empirical studies have demonstrated its effectiveness, there is a notable gap
in our understanding of the circumstances under which it succeeds or fails. In
this paper, we take a further step towards a deeper understanding of the role
of difficulty calibration. Our observations reveal inherent limitations in
calibration methods, leading to the misclassification of non-members and
suboptimal performance, particularly on high-loss samples. We further identify
that these errors stem from an imperfect sampling of the potential distribution
and a strong dependence of membership scores on the model parameters. By
shedding light on these issues, we propose RAPID: a query-efficient and
computation-efficient MIA that directly \textbf{R}e-lever\textbf{A}ges the
original membershi\textbf{P} scores to m\textbf{I}tigate the errors in
\textbf{D}ifficulty calibration. Our experimental results, spanning 9 datasets
and 5 model architectures, demonstrate that RAPID outperforms previous
state-of-the-art attacks (e.g., LiRA and Canary offline) across different
metrics while remaining computationally efficient. Our observations and
analysis challenge the current de facto paradigm of difficulty calibration in
high-precision inference, encouraging greater attention to the persistent risks
posed by MIAs in more practical scenarios.


http://arxiv.org/abs/2409.00418
Robust off-policy Reinforcement Learning via Soft Constrained Adversary. (4%)
Kosuke Nakanishi; Akihiro Kubo; Yuji Yasui; Shin Ishii
  Recently, robust reinforcement learning (RL) methods against input
observation have garnered significant attention and undergone rapid evolution
due to RL's potential vulnerability. Although these advanced methods have
achieved reasonable success, there have been two limitations when considering
adversary in terms of long-term horizons. First, the mutual dependency between
the policy and its corresponding optimal adversary limits the development of
off-policy RL algorithms; although obtaining optimal adversary should depend on
the current policy, this has restricted applications to off-policy RL. Second,
these methods generally assume perturbations based only on the $L_p$-norm, even
when prior knowledge of the perturbation distribution in the environment is
available. We here introduce another perspective on adversarial RL: an
f-divergence constrained problem with the prior knowledge distribution. From
this, we derive two typical attacks and their corresponding robust learning
frameworks. The evaluation of robustness is conducted and the results
demonstrate that our proposed methods achieve excellent performance in
sample-efficient off-policy RL.


http://arxiv.org/abs/2409.00340
LightPure: Realtime Adversarial Image Purification for Mobile Devices Using Diffusion Models. (92%)
Hossein Khalili; Seongbin Park; Vincent Li; Brandan Bright; Ali Payani; Ramana Rao Kompella; Nader Sehatbakhsh
  Autonomous mobile systems increasingly rely on deep neural networks for
perception and decision-making. While effective, these systems are vulnerable
to adversarial machine learning attacks where minor input perturbations can
significantly impact outcomes. Common countermeasures involve adversarial
training and/or data or network transformation. These methods, though
effective, require full access to typically proprietary classifiers and are
costly for large models. Recent solutions propose purification models, which
add a "purification" layer before classification, eliminating the need to
modify the classifier directly. Despite their effectiveness, these methods are
compute-intensive, making them unsuitable for mobile systems where resources
are limited and low latency is essential.
  This paper introduces LightPure, a new method that enhances adversarial image
purification. It improves the accuracy of existing purification methods and
provides notable enhancements in speed and computational efficiency, making it
suitable for mobile devices with limited resources. Our approach uses a
two-step diffusion and one-shot Generative Adversarial Network (GAN) framework,
prioritizing latency without compromising robustness. We propose several new
techniques to achieve a reasonable balance between classification accuracy and
adversarial robustness while maintaining desired latency. We design and
implement a proof-of-concept on a Jetson Nano board and evaluate our method
using various attack scenarios and datasets. Our results show that LightPure
can outperform existing methods by up to 10x in terms of latency while
achieving higher accuracy and robustness for various attack scenarios. This
method offers a scalable and effective solution for real-world mobile systems.


http://arxiv.org/abs/2408.17064
Instant Adversarial Purification with Adversarial Consistency Distillation. (87%)
Chun Tong Lei; Hon Ming Yam; Zhongliang Guo; Chun Pong Lau
  Neural networks, despite their remarkable performance in widespread
applications, including image classification, are also known to be vulnerable
to subtle adversarial noise. Although some diffusion-based purification methods
have been proposed, for example, DiffPure, those methods are time-consuming. In
this paper, we propose One Step Control Purification (OSCP), a diffusion-based
purification model that can purify the adversarial image in one Neural Function
Evaluation (NFE) in diffusion models. We use Latent Consistency Model (LCM) and
ControlNet for our one-step purification. OSCP is computationally friendly and
time efficient compared to other diffusion-based purification methods; we
achieve defense success rate of 74.19\% on ImageNet, only requiring 0.1s for
each purification. Moreover, there is a fundamental incongruence between
consistency distillation and adversarial perturbation. To address this
ontological dissonance, we propose Gaussian Adversarial Noise Distillation
(GAND), a novel consistency distillation framework that facilitates a more
nuanced reconciliation of the latent space dynamics, effectively bridging the
natural and adversarial manifolds. Our experiments show that the GAND does not
need a Full Fine Tune (FFT); PEFT, e.g., LoRA is sufficient.


http://arxiv.org/abs/2409.00243
PRADA: Proactive Risk Assessment and Mitigation of Misinformed Demand Attacks on Navigational Route Recommendations. (8%)
Ya-Ting Yang; Haozhe Lei; Quanyan Zhu
  Leveraging recent advances in wireless communication, IoT, and AI,
intelligent transportation systems (ITS) played an important role in reducing
traffic congestion and enhancing user experience. Within ITS, navigational
recommendation systems (NRS) are essential for helping users simplify route
choices in urban environments. However, NRS are vulnerable to information-based
attacks that can manipulate both the NRS and users to achieve the objectives of
the malicious entities. This study aims to assess the risks of misinformed
demand attacks, where attackers use techniques like Sybil-based attacks to
manipulate the demands of certain origins and destinations considered by the
NRS. We propose a game-theoretic framework for proactive risk assessment of
demand attacks (PRADA) and treat the interaction between attackers and the NRS
as a Stackelberg game. The attacker is the leader who conveys misinformed
demands, while the NRS is the follower responding to the provided information.
Specifically, we consider the case of local-targeted attacks, in which the
attacker aims to make the NRS recommend the authentic users towards a specific
road that favors certain groups. Our analysis unveils the equivalence between
users' incentive compatibility and Wardrop equilibrium recommendations and
shows that the NRS and its users are at high risk when encountering intelligent
attackers who can significantly alter user routes by strategically fabricating
non-existent demands. To mitigate these risks, we introduce a trust mechanism
that leverages users' confidence in the integrity of the NRS, and show that it
can effectively reduce the impact of misinformed demand attacks. Numerical
experiments are used to corroborate the results and demonstrate a Resilience
Paradox, where locally targeted attacks can sometimes benefit the overall
traffic conditions.


http://arxiv.org/abs/2408.17337
Evaluating Reliability in Medical DNNs: A Critical Analysis of Feature and Confidence-Based OOD Detection. (1%)
Harry Anthony; Konstantinos Kamnitsas
  Reliable use of deep neural networks (DNNs) for medical image analysis
requires methods to identify inputs that differ significantly from the training
data, called out-of-distribution (OOD), to prevent erroneous predictions. OOD
detection methods can be categorised as either confidence-based (using the
model's output layer for OOD detection) or feature-based (not using the output
layer). We created two new OOD benchmarks by dividing the D7P (dermatology) and
BreastMNIST (ultrasound) datasets into subsets which either contain or don't
contain an artefact (rulers or annotations respectively). Models were trained
with artefact-free images, and images with the artefacts were used as OOD test
sets. For each OOD image, we created a counterfactual by manually removing the
artefact via image processing, to assess the artefact's impact on the model's
predictions. We show that OOD artefacts can boost a model's softmax confidence
in its predictions, due to correlations in training data among other factors.
This contradicts the common assumption that OOD artefacts should lead to more
uncertain outputs, an assumption on which most confidence-based methods rely.
We use this to explain why feature-based methods (e.g. Mahalanobis score)
typically have greater OOD detection performance than confidence-based methods
(e.g. MCP). However, we also show that feature-based methods typically perform
worse at distinguishing between inputs that lead to correct and incorrect
predictions (for both OOD and ID data). Following from these insights, we argue
that a combination of feature-based and confidence-based methods should be used
within DNN pipelines to mitigate their respective weaknesses. These project's
code and OOD benchmarks are available at:
https://github.com/HarryAnthony/Evaluating_OOD_detection.


http://arxiv.org/abs/2408.16807
STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models. (96%)
Koushik Srivatsan; Fahad Shamshad; Muzammal Naseer; Karthik Nandakumar
  The rapid proliferation of large-scale text-to-image generation (T2IG) models
has led to concerns about their potential misuse in generating harmful content.
Though many methods have been proposed for erasing undesired concepts from T2IG
models, they only provide a false sense of security, as recent works
demonstrate that concept-erased models (CEMs) can be easily deceived to
generate the erased concept through adversarial attacks. The problem of
adversarially robust concept erasing without significant degradation to model
utility (ability to generate benign concepts) remains an unresolved challenge,
especially in the white-box setting where the adversary has access to the CEM.
To address this gap, we propose an approach called STEREO that involves two
distinct stages. The first stage searches thoroughly enough for strong and
diverse adversarial prompts that can regenerate an erased concept from a CEM,
by leveraging robust optimization principles from adversarial training. In the
second robustly erase once stage, we introduce an anchor-concept-based
compositional objective to robustly erase the target concept at one go, while
attempting to minimize the degradation on model utility. By benchmarking the
proposed STEREO approach against four state-of-the-art concept erasure methods
under three adversarial attacks, we demonstrate its ability to achieve a better
robustness vs. utility trade-off. Our code and models are available at
https://github.com/koushiksrivats/robust-concept-erasing.


http://arxiv.org/abs/2408.16769
PromptSmooth: Certifying Robustness of Medical Vision-Language Models via Prompt Learning. (92%)
Noor Hussein; Fahad Shamshad; Muzammal Naseer; Karthik Nandakumar
  Medical vision-language models (Med-VLMs) trained on large datasets of
medical image-text pairs and later fine-tuned for specific tasks have emerged
as a mainstream paradigm in medical image analysis. However, recent studies
have highlighted the susceptibility of these Med-VLMs to adversarial attacks,
raising concerns about their safety and robustness. Randomized smoothing is a
well-known technique for turning any classifier into a model that is
certifiably robust to adversarial perturbations. However, this approach
requires retraining the Med-VLM-based classifier so that it classifies well
under Gaussian noise, which is often infeasible in practice. In this paper, we
propose a novel framework called PromptSmooth to achieve efficient certified
robustness of Med-VLMs by leveraging the concept of prompt learning. Given any
pre-trained Med-VLM, PromptSmooth adapts it to handle Gaussian noise by
learning textual prompts in a zero-shot or few-shot manner, achieving a
delicate balance between accuracy and robustness, while minimizing the
computational overhead. Moreover, PromptSmooth requires only a single model to
handle multiple noise levels, which substantially reduces the computational
cost compared to traditional methods that rely on training a separate model for
each noise level. Comprehensive experiments based on three Med-VLMs and across
six downstream datasets of various imaging modalities demonstrate the efficacy
of PromptSmooth. Our code and models are available at
https://github.com/nhussein/promptsmooth.


http://arxiv.org/abs/2408.16537
SFR-GNN: Simple and Fast Robust GNNs against Structural Attacks. (67%)
Xing Ai; Guanyu Zhu; Yulin Zhu; Yu Zheng; Gaolei Li; Jianhua Li; Kai Zhou
  Graph Neural Networks (GNNs) have demonstrated commendable performance for
graph-structured data. Yet, GNNs are often vulnerable to adversarial structural
attacks as embedding generation relies on graph topology. Existing efforts are
dedicated to purifying the maliciously modified structure or applying adaptive
aggregation, thereby enhancing the robustness against adversarial structural
attacks. It is inevitable for a defender to consume heavy computational costs
due to lacking prior knowledge about modified structures. To this end, we
propose an efficient defense method, called Simple and Fast Robust Graph Neural
Network (SFR-GNN), supported by mutual information theory. The SFR-GNN first
pre-trains a GNN model using node attributes and then fine-tunes it over the
modified graph in the manner of contrastive learning, which is free of
purifying modified structures and adaptive aggregation, thus achieving great
efficiency gains. Consequently, SFR-GNN exhibits a 24%--162% speedup compared
to advanced robust models, demonstrating superior robustness for node
classification tasks.


http://arxiv.org/abs/2408.16913
Analyzing Inference Privacy Risks Through Gradients in Machine Learning. (54%)
Zhuohang Li; Andrew Lowy; Jing Liu; Toshiaki Koike-Akino; Kieran Parsons; Bradley Malin; Ye Wang
  In distributed learning settings, models are iteratively updated with shared
gradients computed from potentially sensitive user data. While previous work
has studied various privacy risks of sharing gradients, our paper aims to
provide a systematic approach to analyze private information leakage from
gradients. We present a unified game-based framework that encompasses a broad
range of attacks including attribute, property, distributional, and user
disclosures. We investigate how different uncertainties of the adversary affect
their inferential power via extensive experiments on five datasets across
various data modalities. Our results demonstrate the inefficacy of solely
relying on data aggregation to achieve privacy against inference attacks in
distributed learning. We further evaluate five types of defenses, namely,
gradient pruning, signed gradient descent, adversarial perturbations,
variational information bottleneck, and differential privacy, under both static
and adaptive adversary settings. We provide an information-theoretic view for
analyzing the effectiveness of these defenses against inference from gradients.
Finally, we introduce a method for auditing attribute inference privacy,
improving the empirical estimation of worst-case privacy through crafting
adversarial canary records.


http://arxiv.org/abs/2408.16892
Tex-ViT: A Generalizable, Robust, Texture-based dual-branch cross-attention deepfake detector. (12%)
Deepak Dagar; Dinesh Kumar Vishwakarma
  Deepfakes, which employ GAN to produce highly realistic facial modification,
are widely regarded as the prevailing method. Traditional CNN have been able to
identify bogus media, but they struggle to perform well on different datasets
and are vulnerable to adversarial attacks due to their lack of robustness.
Vision transformers have demonstrated potential in the realm of image
classification problems, but they require enough training data. Motivated by
these limitations, this publication introduces Tex-ViT (Texture-Vision
Transformer), which enhances CNN features by combining ResNet with a vision
transformer. The model combines traditional ResNet features with a texture
module that operates in parallel on sections of ResNet before each
down-sampling operation. The texture module then serves as an input to the dual
branch of the cross-attention vision transformer. It specifically focuses on
improving the global texture module, which extracts feature map correlation.
Empirical analysis reveals that fake images exhibit smooth textures that do not
remain consistent over long distances in manipulations. Experiments were
performed on different categories of FF++, such as DF, f2f, FS, and NT,
together with other types of GAN datasets in cross-domain scenarios.
Furthermore, experiments also conducted on FF++, DFDCPreview, and Celeb-DF
dataset underwent several post-processing situations, such as blurring,
compression, and noise. The model surpassed the most advanced models in terms
of generalization, achieving a 98% accuracy in cross-domain scenarios. This
demonstrates its ability to learn the shared distinguishing textural
characteristics in the manipulated samples. These experiments provide evidence
that the proposed model is capable of being applied to various situations and
is resistant to many post-processing procedures.


http://arxiv.org/abs/2408.15702
Evaluating Model Robustness Using Adaptive Sparse L0 Regularization. (99%)
Weiyou Liu; Zhenyang Li; Weitong Chen
  Deep Neural Networks have demonstrated remarkable success in various domains
but remain susceptible to adversarial examples, which are slightly altered
inputs designed to induce misclassification. While adversarial attacks
typically optimize under Lp norm constraints, attacks based on the L0 norm,
prioritising input sparsity, are less studied due to their complex and non
convex nature. These sparse adversarial examples challenge existing defenses by
altering a minimal subset of features, potentially uncovering more subtle DNN
weaknesses. However, the current L0 norm attack methodologies face a trade off
between accuracy and efficiency either precise but computationally intense or
expedient but imprecise. This paper proposes a novel, scalable, and effective
approach to generate adversarial examples based on the L0 norm, aimed at
refining the robustness evaluation of DNNs against such perturbations.


http://arxiv.org/abs/2408.15721
Defending Text-to-image Diffusion Models: Surprising Efficacy of Textual Perturbations Against Backdoor Attacks. (83%)
Oscar Chew; Po-Yi Lu; Jayden Lin; Hsuan-Tien Lin
  Text-to-image diffusion models have been widely adopted in real-world
applications due to their ability to generate realistic images from textual
descriptions. However, recent studies have shown that these methods are
vulnerable to backdoor attacks. Despite the significant threat posed by
backdoor attacks on text-to-image diffusion models, countermeasures remain
under-explored. In this paper, we address this research gap by demonstrating
that state-of-the-art backdoor attacks against text-to-image diffusion models
can be effectively mitigated by a surprisingly simple defense strategy -
textual perturbation. Experiments show that textual perturbations are effective
in defending against state-of-the-art backdoor attacks with minimal sacrifice
to generation quality. We analyze the efficacy of textual perturbation from two
angles: text embedding space and cross-attention maps. They further explain how
backdoor attacks have compromised text-to-image diffusion models, providing
insights for studying future attack and defense strategies. Our code is
available at https://github.com/oscarchew/t2i-backdoor-defense.


http://arxiv.org/abs/2408.15833
Network transferability of adversarial patches in real-time object detection. (83%)
Jens Bayer; Stefan Becker; David Münch; Michael Arens
  Adversarial patches in computer vision can be used, to fool deep neural
networks and manipulate their decision-making process. One of the most
prominent examples of adversarial patches are evasion attacks for object
detectors. By covering parts of objects of interest, these patches suppress the
detections and thus make the target object 'invisible' to the object detector.
Since these patches are usually optimized on a specific network with a specific
train dataset, the transferability across multiple networks and datasets is not
given. This paper addresses these issues and investigates the transferability
across numerous object detector architectures. Our extensive evaluation across
various models on two distinct datasets indicates that patches optimized with
larger models provide better network transferability than patches that are
optimized with smaller models.


http://arxiv.org/abs/2408.15861
Fusing Pruned and Backdoored Models: Optimal Transport-based Data-free Backdoor Mitigation. (47%)
Weilin Lin; Li Liu; Jianze Li; Hui Xiong
  Backdoor attacks present a serious security threat to deep neuron networks
(DNNs). Although numerous effective defense techniques have been proposed in
recent years, they inevitably rely on the availability of either clean or
poisoned data. In contrast, data-free defense techniques have evolved slowly
and still lag significantly in performance. To address this issue, different
from the traditional approach of pruning followed by fine-tuning, we propose a
novel data-free defense method named Optimal Transport-based Backdoor Repairing
(OTBR) in this work. This method, based on our findings on neuron weight
changes (NWCs) of random unlearning, uses optimal transport (OT)-based model
fusion to combine the advantages of both pruned and backdoored models.
Specifically, we first demonstrate our findings that the NWCs of random
unlearning are positively correlated with those of poison unlearning. Based on
this observation, we propose a random-unlearning NWC pruning technique to
eliminate the backdoor effect and obtain a backdoor-free pruned model. Then,
motivated by the OT-based model fusion, we propose the pruned-to-backdoored
OT-based fusion technique, which fuses pruned and backdoored models to combine
the advantages of both, resulting in a model that demonstrates high clean
accuracy and a low attack success rate. To our knowledge, this is the first
work to apply OT and model fusion techniques to backdoor defense. Extensive
experiments show that our method successfully defends against all seven
backdoor attacks across three benchmark datasets, outperforming both
state-of-the-art (SOTA) data-free and data-dependent methods. The code
implementation and Appendix are provided in the Supplementary Material.


http://arxiv.org/abs/2408.15591
VFLIP: A Backdoor Defense for Vertical Federated Learning via Identification and Purification. (2%)
Yungi Cho; Woorim Han; Miseon Yu; Younghan Lee; Ho Bae; Yunheung Paek
  Vertical Federated Learning (VFL) focuses on handling vertically partitioned
data over FL participants. Recent studies have discovered a significant
vulnerability in VFL to backdoor attacks which specifically target the distinct
characteristics of VFL. Therefore, these attacks may neutralize existing
defense mechanisms designed primarily for Horizontal Federated Learning (HFL)
and deep neural networks. In this paper, we present the first backdoor defense,
called VFLIP, specialized for VFL. VFLIP employs the identification and
purification techniques that operate at the inference stage, consequently
improving the robustness against backdoor attacks to a great extent. VFLIP
first identifies backdoor-triggered embeddings by adopting a participant-wise
anomaly detection approach. Subsequently, VFLIP conducts purification which
removes the embeddings identified as malicious and reconstructs all the
embeddings based on the remaining embeddings. We conduct extensive experiments
on CIFAR10, CINIC10, Imagenette, NUS-WIDE, and BankMarketing to demonstrate
that VFLIP can effectively mitigate backdoor attacks in VFL.
https://github.com/blingcho/VFLIP-esorics24


http://arxiv.org/abs/2408.16163
FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench (Automated Multi-shot Jailbreaks). (1%)
Aman Priyanshu; Supriti Vijay
  This paper introduces FRACTURED-SORRY-Bench, a framework for evaluating the
safety of Large Language Models (LLMs) against multi-turn conversational
attacks. Building upon the SORRY-Bench dataset, we propose a simple yet
effective method for generating adversarial prompts by breaking down harmful
queries into seemingly innocuous sub-questions. Our approach achieves a maximum
increase of +46.22\% in Attack Success Rates (ASRs) across GPT-4, GPT-4o,
GPT-4o-mini, and GPT-3.5-Turbo models compared to baseline methods. We
demonstrate that this technique poses a challenge to current LLM safety
measures and highlights the need for more robust defenses against subtle,
multi-turn attacks.


http://arxiv.org/abs/2408.14875
Adversarial Attacks and Defenses in Multivariate Time-Series Forecasting for Smart and Connected Infrastructures. (99%)
Pooja Krishan; Rohan Mohapatra; Saptarshi Sengupta
  The emergence of deep learning models has revolutionized various industries
over the last decade, leading to a surge in connected devices and
infrastructures. However, these models can be tricked into making incorrect
predictions with high confidence, leading to disastrous failures and security
concerns. To this end, we explore the impact of adversarial attacks on
multivariate time-series forecasting and investigate methods to counter them.
Specifically, we employ untargeted white-box attacks, namely the Fast Gradient
Sign Method (FGSM) and the Basic Iterative Method (BIM), to poison the inputs
to the training process, effectively misleading the model. We also illustrate
the subtle modifications to the inputs after the attack, which makes detecting
the attack using the naked eye quite difficult. Having demonstrated the
feasibility of these attacks, we develop robust models through adversarial
training and model hardening. We are among the first to showcase the
transferability of these attacks and defenses by extrapolating our work from
the benchmark electricity data to a larger, 10-year real-world data used for
predicting the time-to-failure of hard disks. Our experimental results confirm
that the attacks and defenses achieve the desired security thresholds, leading
to a 72.41% and 94.81% decrease in RMSE for the electricity and hard disk
datasets respectively after implementing the adversarial defenses.


http://arxiv.org/abs/2408.16025
Improving Adversarial Robustness in Android Malware Detection by Reducing the Impact of Spurious Correlations. (99%)
Hamid Bostani; Zhengyu Zhao; Veelasha Moonsamy
  Machine learning (ML) has demonstrated significant advancements in Android
malware detection (AMD); however, the resilience of ML against realistic
evasion attacks remains a major obstacle for AMD. One of the primary factors
contributing to this challenge is the scarcity of reliable generalizations.
Malware classifiers with limited generalizability tend to overfit spurious
correlations derived from biased features. Consequently, adversarial examples
(AEs), generated by evasion attacks, can modify these features to evade
detection. In this study, we propose a domain adaptation technique to improve
the generalizability of AMD by aligning the distribution of malware samples and
AEs. Specifically, we utilize meaningful feature dependencies, reflecting
domain constraints in the feature space, to establish a robust feature space.
Training on the proposed robust feature space enables malware classifiers to
learn from predefined patterns associated with app functionality rather than
from individual features. This approach helps mitigate spurious correlations
inherent in the initial feature space. Our experiments conducted on DREBIN, a
renowned Android malware detector, demonstrate that our approach surpasses the
state-of-the-art defense, Sec-SVM, when facing realistic evasion attacks. In
particular, our defense can improve adversarial robustness by up to 55% against
realistic evasion attacks compared to Sec-SVM.


http://arxiv.org/abs/2408.15451
Certified Causal Defense with Generalizable Robustness. (99%)
Yiran Qiao; Yu Yin; Chen Chen; Jing Ma
  While machine learning models have proven effective across various scenarios,
it is widely acknowledged that many models are vulnerable to adversarial
attacks. Recently, there have emerged numerous efforts in adversarial defense.
Among them, certified defense is well known for its theoretical guarantees
against arbitrary adversarial perturbations on input within a certain range
(e.g., $l_2$ ball). However, most existing works in this line struggle to
generalize their certified robustness in other data domains with distribution
shifts. This issue is rooted in the difficulty of eliminating the negative
impact of spurious correlations on robustness in different domains. To address
this problem, in this work, we propose a novel certified defense framework
GLEAN, which incorporates a causal perspective into the generalization problem
in certified defense. More specifically, our framework integrates a certifiable
causal factor learning component to disentangle the causal relations and
spurious correlations between input and label, and thereby exclude the negative
effect of spurious correlations on defense. On top of that, we design a
causally certified defense strategy to handle adversarial attacks on latent
causal factors. In this way, our framework is not only robust against malicious
noises on data in the training distribution but also can generalize its
robustness across domains with distribution shifts. Extensive experiments on
benchmark datasets validate the superiority of our framework in certified
robustness generalization in different data domains. Code is available in the
supplementary materials.


http://arxiv.org/abs/2408.14879
Adversarial Manhole: Challenging Monocular Depth Estimation and Semantic Segmentation Models with Patch Attack. (98%)
Naufal Suryanto; Andro Aprila Adiputra; Ahmada Yusril Kadiptya; Yongsu Kim; Howon Kim
  Monocular depth estimation (MDE) and semantic segmentation (SS) are crucial
for the navigation and environmental interpretation of many autonomous driving
systems. However, their vulnerability to practical adversarial attacks is a
significant concern. This paper presents a novel adversarial attack using
practical patches that mimic manhole covers to deceive MDE and SS models. The
goal is to cause these systems to misinterpret scenes, leading to false
detections of near obstacles or non-passable objects. We use Depth Planar
Mapping to precisely position these patches on road surfaces, enhancing the
attack's effectiveness. Our experiments show that these adversarial patches
cause a 43% relative error in MDE and achieve a 96% attack success rate in SS.
These patches create affected error regions over twice their size in MDE and
approximately equal to their size in SS. Our studies also confirm the patch's
effectiveness in physical simulations, the adaptability of the patches across
different target models, and the effectiveness of our proposed modules,
highlighting their practical implications.


http://arxiv.org/abs/2408.15221
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet. (12%)
Nathaniel Li; Ziwen Han; Ian Steneker; Willow Primack; Riley Goodside; Hugh Zhang; Zifan Wang; Cristina Menghini; Summer Yue
  Recent large language model (LLM) defenses have greatly improved models'
ability to refuse harmful queries, even when adversarially attacked. However,
LLM defenses are primarily evaluated against automated adversarial attacks in a
single turn of conversation, an insufficient threat model for real-world
malicious use. We demonstrate that multi-turn human jailbreaks uncover
significant vulnerabilities, exceeding 70% attack success rate (ASR) on
HarmBench against defenses that report single-digit ASRs with automated
single-turn attacks. Human jailbreaks also reveal vulnerabilities in machine
unlearning defenses, successfully recovering dual-use biosecurity knowledge
from unlearned models. We compile these results into Multi-Turn Human
Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn jailbreaks.
We publicly release MHJ alongside a compendium of jailbreak tactics developed
across dozens of commercial red teaming engagements, supporting research
towards stronger LLM defenses.


http://arxiv.org/abs/2408.15207
Understanding the Effectiveness of Coverage Criteria for Large Language Models: A Special Angle from Jailbreak Attacks. (11%)
Shide Zhou; Tianlin Li; Kailong Wang; Yihao Huang; Ling Shi; Yang Liu; Haoyu Wang
  Large language models (LLMs) have revolutionized artificial intelligence, but
their increasing deployment across critical domains has raised concerns about
their abnormal behaviors when faced with malicious attacks. Such vulnerability
alerts the widespread inadequacy of pre-release testing.In this paper, we
conduct a comprehensive empirical study to evaluate the effectiveness of
traditional coverage criteria in identifying such inadequacies, exemplified by
the significant security concern of jailbreak attacks.Our study begins with a
clustering analysis of the hidden states of LLMs, revealing that the embedded
characteristics effectively distinguish between different query types. We then
systematically evaluate the performance of these criteria across three key
dimensions: criterion level, layer level, and token level. Our research
uncovers significant differences in neuron coverage when LLMs process normal
versus jailbreak queries, aligning with our clustering experiments.Leveraging
these findings, we propose three practical applications of coverage criteria in
the context of LLM security testing. Specifically, we develop a real-time
jailbreak detection mechanism that achieves high accuracy (93.61% on average)
in classifying queries as normal or jailbreak. Furthermore, we explore the use
of coverage levels to prioritize test cases, improving testing efficiency by
focusing on high-risk interactions and removing redundant tests. Lastly, we
introduce a coverage-guided approach for generating jailbreak attack examples,
enabling systematic refinement of prompts to uncover vulnerabilities. This
study improves our understanding of LLM security testing, enhances their
safety, and provides a foundation for developing more robust AI applications.


http://arxiv.org/abs/2408.14853
Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models. (8%)
Yuhao Du; Zhuo Li; Pengyu Cheng; Xiang Wan; Anningzhe Gao
  Large Language Models (LLMs) have become a focal point in the rapidly
evolving field of artificial intelligence. However, a critical concern is the
presence of toxic content within the pre-training corpus of these models, which
can lead to the generation of inappropriate outputs. Investigating methods for
detecting internal faults in LLMs can help us understand their limitations and
improve their security. Existing methods primarily focus on jailbreaking
attacks, which involve manually or automatically constructing adversarial
content to prompt the target LLM to generate unexpected responses. These
methods rely heavily on prompt engineering, which is time-consuming and usually
requires specially designed questions. To address these challenges, this paper
proposes a target-driven attack paradigm that focuses on directly eliciting the
target response instead of optimizing the prompts. We introduce the use of
another LLM as the detector for toxic content, referred to as ToxDet. Given a
target toxic response, ToxDet can generate a possible question and a
preliminary answer to provoke the target model into producing desired toxic
responses with meanings equivalent to the provided one. ToxDet is trained by
interacting with the target LLM and receiving reward signals from it, utilizing
reinforcement learning for the optimization process. While the primary focus of
the target models is on open-source LLMs, the fine-tuned ToxDet can also be
transferred to attack black-box models such as GPT-4o, achieving notable
results. Experimental results on AdvBench and HH-Harmless datasets demonstrate
the effectiveness of our methods in detecting the tendencies of target LLMs to
generate harmful responses. This algorithm not only exposes vulnerabilities but
also provides a valuable resource for researchers to strengthen their models
against such attacks.


http://arxiv.org/abs/2408.15200
SpecGuard: Specification Aware Recovery for Robotic Autonomous Vehicles from Physical Attacks. (3%)
Pritam Dash; Ethan Chan; Karthik Pattabiraman
  Robotic Autonomous Vehicles (RAVs) rely on their sensors for perception, and
follow strict mission specifications (e.g., altitude, speed, and geofence
constraints) for safe and timely operations. Physical attacks can corrupt the
RAVs' sensors, resulting in mission failures. Recovering RAVs from such attacks
demands robust control techniques that maintain compliance with mission
specifications even under attacks to ensure the RAV's safety and timely
operations.
  We propose SpecGuard, a technique that complies with mission specifications
and performs safe recovery of RAVs. There are two innovations in SpecGuard.
First, it introduces an approach to incorporate mission specifications and
learn a recovery control policy using Deep Reinforcement Learning (Deep-RL). We
design a compliance-based reward structure that reflects the RAV's complex
dynamics and enables SpecGuard to satisfy multiple mission specifications
simultaneously. Second, SpecGuard incorporates state reconstruction, a
technique that minimizes attack induced sensor perturbations. This
reconstruction enables effective adversarial training, and optimizing the
recovery control policy for robustness under attacks. We evaluate SpecGuard in
both virtual and real RAVs, and find that it achieves 92% recovery success rate
under attacks on different sensors, without any crashes or stalls. SpecGuard
achieves 2X higher recovery success than prior work, and incurs about 15%
performance overhead on real RAVs.


http://arxiv.org/abs/2408.15508
EmoAttack: Utilizing Emotional Voice Conversion for Speech Backdoor Attacks on Deep Speech Classification Models. (2%)
Wenhan Yao; Zedong XingXiarun Chen; Jia Liu; yongqiang He; Weiping Wen
  Deep speech classification tasks, mainly including keyword spotting and
speaker verification, play a crucial role in speech-based human-computer
interaction. Recently, the security of these technologies has been demonstrated
to be vulnerable to backdoor attacks. Specifically speaking, speech samples are
attacked by noisy disruption and component modification in present triggers. We
suggest that speech backdoor attacks can strategically focus on emotion, a
higher-level subjective perceptual attribute inherent in speech. Furthermore,
we proposed that emotional voice conversion technology can serve as the speech
backdoor attack trigger, and the method is called EmoAttack. Based on this, we
conducted attack experiments on two speech classification tasks, showcasing
that EmoAttack method owns impactful trigger effectiveness and its remarkable
attack success rate and accuracy variance. Additionally, the ablation
experiments found that speech with intensive emotion is more suitable to be
targeted for attacks.


http://arxiv.org/abs/2408.14728
TART: Boosting Clean Accuracy Through Tangent Direction Guided Adversarial Training. (99%)
Bongsoo Yi; Rongjie Lai; Yao Li
  Adversarial training has been shown to be successful in enhancing the
robustness of deep neural networks against adversarial attacks. However, this
robustness is accompanied by a significant decline in accuracy on clean data.
In this paper, we propose a novel method, called Tangent Direction Guided
Adversarial Training (TART), that leverages the tangent space of the data
manifold to ameliorate the existing adversarial defense algorithms. We argue
that training with adversarial examples having large normal components
significantly alters the decision boundary and hurts accuracy. TART mitigates
this issue by estimating the tangent direction of adversarial examples and
allocating an adaptive perturbation limit according to the norm of their
tangential component. To the best of our knowledge, our paper is the first work
to consider the concept of tangent space and direction in the context of
adversarial defense. We validate the effectiveness of TART through extensive
experiments on both simulated and benchmark datasets. The results demonstrate
that TART consistently boosts clean accuracy while retaining a high level of
robustness against adversarial attacks. Our findings suggest that incorporating
the geometric properties of data can lead to more effective and efficient
adversarial training methods.


http://arxiv.org/abs/2408.14143
2D-Malafide: Adversarial Attacks Against Face Deepfake Detection Systems. (99%)
Chiara Galdi; Michele Panariello; Massimiliano Todisco; Nicholas Evans
  We introduce 2D-Malafide, a novel and lightweight adversarial attack designed
to deceive face deepfake detection systems. Building upon the concept of 1D
convolutional perturbations explored in the speech domain, our method leverages
2D convolutional filters to craft perturbations which significantly degrade the
performance of state-of-the-art face deepfake detectors. Unlike traditional
additive noise approaches, 2D-Malafide optimises a small number of filter
coefficients to generate robust adversarial perturbations which are
transferable across different face images. Experiments, conducted using the
FaceForensics++ dataset, demonstrate that 2D-Malafide substantially degrades
detection performance in both white-box and black-box settings, with larger
filter sizes having the greatest impact. Additionally, we report an
explainability analysis using GradCAM which illustrates how 2D-Malafide
misleads detection systems by altering the image areas used most for
classification. Our findings highlight the vulnerability of current deepfake
detection systems to convolutional adversarial attacks as well as the need for
future work to enhance detection robustness through improved image fidelity
constraints.


http://arxiv.org/abs/2409.06726
Feedback-based Modal Mutual Search for Attacking Vision-Language Pre-training Models. (99%)
Renhua Ding; Xinze Zhang; Xiao Yang; Kun He
  Although vision-language pre-training (VLP) models have achieved remarkable
progress on cross-modal tasks, they remain vulnerable to adversarial attacks.
Using data augmentation and cross-modal interactions to generate transferable
adversarial examples on surrogate models, transfer-based black-box attacks have
become the mainstream methods in attacking VLP models, as they are more
practical in real-world scenarios. However, their transferability may be
limited due to the differences on feature representation across different
models. To this end, we propose a new attack paradigm called Feedback-based
Modal Mutual Search (FMMS). FMMS introduces a novel modal mutual loss (MML),
aiming to push away the matched image-text pairs while randomly drawing
mismatched pairs closer in feature space, guiding the update directions of the
adversarial examples. Additionally, FMMS leverages the target model feedback to
iteratively refine adversarial examples, driving them into the adversarial
region. To our knowledge, this is the first work to exploit target model
feedback to explore multi-modality adversarial boundaries. Extensive empirical
evaluations on Flickr30K and MSCOCO datasets for image-text matching tasks show
that FMMS significantly outperforms the state-of-the-art baselines.


http://arxiv.org/abs/2408.14240
Celtibero: Robust Layered Aggregation for Federated Learning. (92%)
Borja Molina-Coronado
  Federated Learning (FL) is an innovative approach to distributed machine
learning. While FL offers significant privacy advantages, it also faces
security challenges, particularly from poisoning attacks where adversaries
deliberately manipulate local model updates to degrade model performance or
introduce hidden backdoors. Existing defenses against these attacks have been
shown to be effective when the data on the nodes is identically and
independently distributed (i.i.d.), but they often fail under less restrictive,
non-i.i.d data conditions. To overcome these limitations, we introduce
Celtibero, a novel defense mechanism that integrates layered aggregation to
enhance robustness against adversarial manipulation. Through extensive
experiments on the MNIST and IMDB datasets, we demonstrate that Celtibero
consistently achieves high main task accuracy (MTA) while maintaining minimal
attack success rates (ASR) across a range of untargeted and targeted poisoning
attacks. Our results highlight the superiority of Celtibero over existing
defenses such as FL-Defender, LFighter, and FLAME, establishing it as a highly
effective solution for securing federated learning systems against
sophisticated poisoning attacks.


http://arxiv.org/abs/2409.06719
Dual Adversarial Perturbators Generate rich Views for Recommendation. (5%)
Lijun Zhang; Yuan Yao; Haibo Ye
  Graph contrastive learning (GCL) has been extensively studied and leveraged
as a potent tool in recommender systems. Most existing GCL-based recommenders
generate contrastive views by altering the graph structure or introducing
perturbations to embedding. While these methods effectively enhance learning
from sparse data, they risk performance degradation or even training collapse
when the differences between contrastive views become too pronounced. To
mitigate this issue, we employ curriculum learning to incrementally increase
the disparity between contrastive views, enabling the model to gain from more
challenging scenarios. In this paper, we propose a dual-adversarial graph
learning approach, AvoGCL, which emulates curriculum learning by progressively
applying adversarial training to graph structures and embedding perturbations.
Specifically, AvoGCL construct contrastive views by reducing graph redundancy
and generating adversarial perturbations in the embedding space, and achieve
better results by gradually increasing the difficulty of contrastive views.
Extensive experiments on three real-world datasets demonstrate that AvoGCL
significantly outperforms the state-of-the-art competitors.


http://arxiv.org/abs/2408.14595
Surprisingly Fragile: Assessing and Addressing Prompt Instability in Multimodal Foundation Models. (1%)
Ian Stewart; Sameera Horawalavithana; Brendan Kennedy; Sai Munikoti; Karl Pazdernik
  Multimodal foundation models (MFMs) such as OFASys show the potential to
unlock analysis of complex data such as images, videos, and audio data via text
prompts alone. However, their performance may suffer in the face of text input
that differs even slightly from their training distribution, which is
surprising considering the use of modality-specific data to "ground" the text
input. This study demonstrates that prompt instability is a major concern for
MFMs, leading to a consistent drop in performance across all modalities, but
that instability can be mitigated with additional training with augmented data.
We evaluate several methods for grounded prompt perturbation, where we generate
perturbations and filter based on similarity to text and/or modality data.
After re-training the models on the augmented data, we find improved accuracy
and more stable performance on the perturbed test data regardless of
perturbation condition, suggesting that the data augmentation strategy helps
the models handle domain shifts more effectively. In error analysis, we find
consistent patterns of performance improvement across domains, suggesting that
retraining on prompt perturbations tends to help general reasoning capabilities
in MFMs.


http://arxiv.org/abs/2408.14293
Investigating the Effectiveness of Bayesian Spam Filters in Detecting LLM-modified Spam Mails. (1%)
Malte Josten; Torben Weis
  Spam and phishing remain critical threats in cybersecurity, responsible for
nearly 90% of security incidents. As these attacks grow in sophistication, the
need for robust defensive mechanisms intensifies. Bayesian spam filters, like
the widely adopted open-source SpamAssassin, are essential tools in this fight.
However, the emergence of large language models (LLMs) such as ChatGPT presents
new challenges. These models are not only powerful and accessible, but also
inexpensive to use, raising concerns about their misuse in crafting
sophisticated spam emails that evade traditional spam filters. This work aims
to evaluate the robustness and effectiveness of SpamAssassin against
LLM-modified email content. We developed a pipeline to test this vulnerability.
Our pipeline modifies spam emails using GPT-3.5 Turbo and assesses
SpamAssassin's ability to classify these modified emails correctly. The results
show that SpamAssassin misclassified up to 73.7% of LLM-modified spam emails as
legitimate. In contrast, a simpler dictionary-replacement attack showed a
maximum success rate of only 0.4%. These findings highlight the significant
threat posed by LLM-modified spam, especially given the cost-efficiency of such
attacks (0.17 cents per email). This paper provides crucial insights into the
vulnerabilities of current spam filters and the need for continuous improvement
in cybersecurity measures.


http://arxiv.org/abs/2408.13809
On the Robustness of Kolmogorov-Arnold Networks: An Adversarial Perspective. (98%)
Tal Alter; Raz Lapid; Moshe Sipper
  Kolmogorov-Arnold Networks (KANs) have recently emerged as a novel approach
to function approximation, demonstrating remarkable potential in various
domains. Despite their theoretical promise, the robustness of KANs under
adversarial conditions has yet to be thoroughly examined. In this paper we
explore the adversarial robustness of KANs, with a particular focus on image
classification tasks. We assess the performance of KANs against standard white
box and black-box adversarial attacks, comparing their resilience to that of
established neural network architectures. Our experimental evaluation
encompasses a variety of standard image classification benchmark datasets and
investigates both fully connected and convolutional neural network
architectures, of three sizes: small, medium, and large. We conclude that
small- and medium-sized KANs (either fully connected or convolutional) are not
consistently more robust than their standard counterparts, but that large-sized
KANs are, by and large, more robust. This comprehensive evaluation of KANs in
adversarial scenarios offers the first in-depth analysis of KAN security,
laying the groundwork for future research in this emerging field.


http://arxiv.org/abs/2408.13896
HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models. (97%)
Sensen Gao; Xiaojun Jia; Yihao Huang; Ranjie Duan; Jindong Gu; Yang Bai; Yang Liu; Qing Guo
  Text-to-Image(T2I) models have achieved remarkable success in image
generation and editing, yet these models still have many potential issues,
particularly in generating inappropriate or Not-Safe-For-Work(NSFW) content.
Strengthening attacks and uncovering such vulnerabilities can advance the
development of reliable and practical T2I models. Most of the previous works
treat T2I models as white-box systems, using gradient optimization to generate
adversarial prompts. However, accessing the model's gradient is often
impossible in real-world scenarios. Moreover, existing defense methods, those
using gradient masking, are designed to prevent attackers from obtaining
accurate gradient information. While several black-box jailbreak attacks have
been explored, they achieve the limited performance of jailbreaking T2I models
due to difficulties associated with optimization in discrete spaces. To address
this, we propose HTS-Attack, a heuristic token search attack method. HTS-Attack
begins with an initialization that removes sensitive tokens, followed by a
heuristic search where high-performing candidates are recombined and mutated.
This process generates a new pool of candidates, and the optimal adversarial
prompt is updated based on their effectiveness. By incorporating both optimal
and suboptimal candidates, HTS-Attack avoids local optima and improves
robustness in bypassing defenses. Extensive experiments validate the
effectiveness of our method in attacking the latest prompt checkers, post-hoc
image checkers, securely trained T2I models, and online commercial models.


http://arxiv.org/abs/2408.13985
TF-Attack: Transferable and Fast Adversarial Attacks on Large Language Models. (96%)
Zelin Li; Kehai Chen; Lemao Liu; Xuefeng Bai; Mingming Yang; Yang Xiang; Min Zhang
  With the great advancements in large language models (LLMs), adversarial
attacks against LLMs have recently attracted increasing attention. We found
that pre-existing adversarial attack methodologies exhibit limited
transferability and are notably inefficient, particularly when applied to LLMs.
In this paper, we analyze the core mechanisms of previous predominant
adversarial attack methods, revealing that 1) the distributions of importance
score differ markedly among victim models, restricting the transferability; 2)
the sequential attack processes induces substantial time overheads. Based on
the above two insights, we introduce a new scheme, named TF-Attack, for
Transferable and Fast adversarial attacks on LLMs. TF-Attack employs an
external LLM as a third-party overseer rather than the victim model to identify
critical units within sentences. Moreover, TF-Attack introduces the concept of
Importance Level, which allows for parallel substitutions of attacks. We
conduct extensive experiments on 6 widely adopted benchmarks, evaluating the
proposed method through both automatic and human metrics. Results show that our
method consistently surpasses previous methods in transferability and delivers
significant speed improvements, up to 20 times faster than earlier attack
strategies.


http://arxiv.org/abs/2408.13849
Sample-Independent Federated Learning Backdoor Attack in Speaker Recognition. (1%)
Weida Xu; Yang Xu; Sicong Zhang
  In federated learning, backdoor attacks embed triggers in the adversarial
client's data to inject a backdoor into the model. In order to enhance the
stealth, an attack method based on the dropout layer has been proposed, which
can implant the backdoor without modifying the sample. However, these methods
struggle to covertly utilize dropout in evaluation mode, thus hindering their
deployment in real-world scenarios. To address these, this paper introduces
GhostB, a novel approach to federated learning backdoor attacks in speaker
recognition that neither alters samples nor relies on dropout. This method
employs the behavior of neurons producing specific values as triggers. By
mapping these neuronal values to categories specified by the adversary, the
backdoor is implanted and activated when particular feature values are detected
at designated neurons. Our experiments conducted on TIMIT, LibriSpeech, and
VoxCeleb2 databases in both Closed Set Identification (CSI) and Open Set
Identification (OSI) scenarios demonstrate that GhostB achieves a 100% success
rate upon activation in speaker recognition, with this rate maintained across
experiments involving 1 to 50 ghost neurons. This paper investigates how the
dispersion of neurons and their depth within hidden layers affect the success
rate, revealing that increased dispersion and positioning of neurons can
significantly decrease effectiveness, potentially rendering the attack
unsuccessful.


http://arxiv.org/abs/2408.13878
Generalization of Graph Neural Networks is Robust to Model Mismatch. (1%)
Zhiyang Wang; Juan Cervino; Alejandro Ribeiro
  Graph neural networks (GNNs) have demonstrated their effectiveness in various
tasks supported by their generalization capabilities. However, the current
analysis of GNN generalization relies on the assumption that training and
testing data are independent and identically distributed (i.i.d). This imposes
limitations on the cases where a model mismatch exists when generating testing
data. In this paper, we examine GNNs that operate on geometric graphs generated
from manifold models, explicitly focusing on scenarios where there is a
mismatch between manifold models generating training and testing data. Our
analysis reveals the robustness of the GNN generalization in the presence of
such model mismatch. This indicates that GNNs trained on graphs generated from
a manifold can still generalize well to unseen nodes and graphs generated from
a mismatched manifold. We attribute this mismatch to both node feature
perturbations and edge perturbations within the generated graph. Our findings
indicate that the generalization gap decreases as the number of nodes grows in
the training graph while increasing with larger manifold dimension as well as
larger mismatch. Importantly, we observe a trade-off between the generalization
of GNNs and the capability to discriminate high-frequency components when
facing a model mismatch. The most important practical consequence of this
analysis is to shed light on the filter design of generalizable GNNs robust to
model mismatch. We verify our theoretical findings with experiments on multiple
real-world datasets.


http://arxiv.org/abs/2408.13461
Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach. (99%)
Jiwei Guan; Tianyu Ding; Longbing Cao; Lei Pan; Chen Wang; Xi Zheng
  Vision-language pretraining (VLP) with transformers has demonstrated
exceptional performance across numerous multimodal tasks. However, the
adversarial robustness of these models has not been thoroughly investigated.
Existing multimodal attack methods have largely overlooked cross-modal
interactions between visual and textual modalities, particularly in the context
of cross-attention mechanisms. In this paper, we study the adversarial
vulnerability of recent VLP transformers and design a novel Joint Multimodal
Transformer Feature Attack (JMTFA) that concurrently introduces adversarial
perturbations in both visual and textual modalities under white-box settings.
JMTFA strategically targets attention relevance scores to disrupt important
features within each modality, generating adversarial samples by fusing
perturbations and leading to erroneous model predictions. Experimental results
indicate that the proposed approach achieves high attack success rates on
vision-language understanding and reasoning downstream tasks compared to
existing baselines. Notably, our findings reveal that the textual modality
significantly influences the complex fusion processes within VLP transformers.
Moreover, we observe no apparent relationship between model size and
adversarial robustness under our proposed attacks. These insights emphasize a
new dimension of adversarial robustness and underscore potential risks in the
reliable deployment of multimodal AI systems.


http://arxiv.org/abs/2408.13653
Evaluating the Robustness of LiDAR-based 3D Obstacles Detection and Its Impacts on Autonomous Driving Systems. (1%)
Tri Minh Triet Pham; Bo Yang; Jinqiu Yang
  Autonomous driving systems (ADSs) require real-time input from multiple
sensors to make time-sensitive decisions using deep neural networks. This makes
the correctness of these decisions crucial to ADSs' adoption as errors can
cause significant loss. Sensors such as LiDAR are sensitive to environmental
changes and built-in inaccuracies and may fluctuate between frames. While there
has been extensive work to test ADSs, it remains unclear whether current ADSs
are robust against very subtle changes in LiDAR point cloud data. In this work,
we study the impact of the built-in inaccuracies in LiDAR sensors on LiDAR-3D
obstacle detection models to provide insight into how they can impact obstacle
detection (i.e., robustness) and by extension trajectory prediction (i.e., how
the robustness of obstacle detection would impact ADSs).
  We propose a framework SORBET, that applies subtle perturbations to LiDAR
data, evaluates the robustness of LiDAR-3D obstacle detection, and assesses the
impacts on the trajectory prediction module and ADSs. We applied SORBET to
evaluate the robustness of five classic LiDAR-3D obstacle detection models,
including one from an industry-grade Level 4 ADS (Baidu's Apollo). Furthermore,
we studied how changes in the obstacle detection results would negatively
impact trajectory prediction in a cascading fashion. Our evaluation highlights
the importance of testing the robustness of LiDAR-3D obstacle detection models
against subtle perturbations. We find that even very subtle changes in point
cloud data (i.e., removing two points) may introduce a non-trivial decrease in
the detection performance. Furthermore, such a negative impact will further
propagate to other modules, and endanger the safety of ADSs.


http://arxiv.org/abs/2408.13102
Dynamic Label Adversarial Training for Deep Learning Robustness Against Adversarial Attacks. (99%)
Zhenyu Liu; Haoran Duan; Huizhi Liang; Yang Long; Vaclav Snasel; Guiseppe Nicosia; Rajiv Ranjan; Varun Ojha
  Adversarial training is one of the most effective methods for enhancing model
robustness. Recent approaches incorporate adversarial distillation in
adversarial training architectures. However, we notice two scenarios of defense
methods that limit their performance: (1) Previous methods primarily use static
ground truth for adversarial training, but this often causes robust
overfitting; (2) The loss functions are either Mean Squared Error or
KL-divergence leading to a sub-optimal performance on clean accuracy. To solve
those problems, we propose a dynamic label adversarial training (DYNAT)
algorithm that enables the target model to gradually and dynamically gain
robustness from the guide model's decisions. Additionally, we found that a
budgeted dimension of inner optimization for the target model may contribute to
the trade-off between clean accuracy and robust accuracy. Therefore, we propose
a novel inner optimization method to be incorporated into the adversarial
training. This will enable the target model to adaptively search for
adversarial examples based on dynamic labels from the guiding model,
contributing to the robustness of the target model. Extensive experiments
validate the superior performance of our approach.


http://arxiv.org/abs/2408.13341
Toward Improving Synthetic Audio Spoofing Detection Robustness via Meta-Learning and Disentangled Training With Adversarial Examples. (98%)
Zhenyu Wang; John H. L. Hansen
  Advances in automatic speaker verification (ASV) promote research into the
formulation of spoofing detection systems for real-world applications. The
performance of ASV systems can be degraded severely by multiple types of
spoofing attacks, namely, synthetic speech (SS), voice conversion (VC), replay,
twins and impersonation, especially in the case of unseen synthetic spoofing
attacks. A reliable and robust spoofing detection system can act as a security
gate to filter out spoofing attacks instead of having them reach the ASV
system. A weighted additive angular margin loss is proposed to address the data
imbalance issue, and different margins has been assigned to improve
generalization to unseen spoofing attacks in this study. Meanwhile, we
incorporate a meta-learning loss function to optimize differences between the
embeddings of support versus query set in order to learn a
spoofing-category-independent embedding space for utterances. Furthermore, we
craft adversarial examples by adding imperceptible perturbations to spoofing
speech as a data augmentation strategy, then we use an auxiliary batch
normalization (BN) to guarantee that corresponding normalization statistics are
performed exclusively on the adversarial examples. Additionally, A simple
attention module is integrated into the residual block to refine the feature
extraction process. Evaluation results on the Logical Access (LA) track of the
ASVspoof 2019 corpus provides confirmation of our proposed approaches'
effectiveness in terms of a pooled EER of 0.87%, and a min t-DCF of 0.0277.
These advancements offer effective options to reduce the impact of spoofing
attacks on voice recognition/authentication systems.


http://arxiv.org/abs/2408.13355
Disentangled Training with Adversarial Examples For Robust Small-footprint Keyword Spotting. (83%)
Zhenyu Wang; Li Wan; Biqiao Zhang; Yiteng Huang; Shang-Wen Li; Ming Sun; Xin Lei; Zhaojun Yang
  A keyword spotting (KWS) engine that is continuously running on device is
exposed to various speech signals that are usually unseen before. It is a
challenging problem to build a small-footprint and high-performing KWS model
with robustness under different acoustic environments. In this paper, we
explore how to effectively apply adversarial examples to improve KWS
robustness. We propose datasource-aware disentangled learning with adversarial
examples to reduce the mismatch between the original and adversarial data as
well as the mismatch across original training datasources. The KWS model
architecture is based on depth-wise separable convolution and a simple
attention module. Experimental results demonstrate that the proposed learning
strategy improves false reject rate by $40.31%$ at $1%$ false accept rate on
the internal dataset, compared to the strongest baseline without using
adversarial examples. Our best-performing system achieves $98.06%$ accuracy on
the Google Speech Commands V1 dataset.


http://arxiv.org/abs/2408.13221
Protecting against simultaneous data poisoning attacks. (54%)
Neel Alex; Shoaib Ahmed Siddiqui; Amartya Sanyal; David Krueger
  Current backdoor defense methods are evaluated against a single attack at a
time. This is unrealistic, as powerful machine learning systems are trained on
large datasets scraped from the internet, which may be attacked multiple times
by one or more attackers. We demonstrate that simultaneously executed data
poisoning attacks can effectively install multiple backdoors in a single model
without substantially degrading clean accuracy. Furthermore, we show that
existing backdoor defense methods do not effectively prevent attacks in this
setting. Finally, we leverage insights into the nature of backdoor attacks to
develop a new defense, BaDLoss, that is effective in the multi-attack setting.
With minimal clean accuracy degradation, BaDLoss attains an average attack
success rate in the multi-attack setting of 7.98% in CIFAR-10 and 10.29% in
GTSRB, compared to the average of other defenses at 64.48% and 84.28%
respectively.


http://arxiv.org/abs/2408.12986
Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection. (2%)
Niklas Risse; Jing Liu; Marcel Böhme
  According to our survey of machine learning for vulnerability detection
(ML4VD), 9 in every 10 papers published in the past five years define ML4VD as
a function-level binary classification problem:
  Given a function, does it contain a security flaw?
  From our experience as security researchers, faced with deciding whether a
given function makes the program vulnerable to attacks, we would often first
want to understand the context in which this function is called.
  In this paper, we study how often this decision can really be made without
further context and study both vulnerable and non-vulnerable functions in the
most popular ML4VD datasets. We call a function "vulnerable" if it was involved
in a patch of an actual security flaw and confirmed to cause the program's
vulnerability. It is "non-vulnerable" otherwise. We find that in almost all
cases this decision cannot be made without further context. Vulnerable
functions are often vulnerable only because a corresponding
vulnerability-inducing calling context exists while non-vulnerable functions
would often be vulnerable if a corresponding context existed.
  But why do ML4VD techniques achieve high scores even though there is
demonstrably not enough information in these samples? Spurious correlations: We
find that high scores can be achieved even when only word counts are available.
This shows that these datasets can be exploited to achieve high scores without
actually detecting any security vulnerabilities.
  We conclude that the prevailing problem statement of ML4VD is ill-defined and
call into question the internal validity of this growing body of work.
Constructively, we call for more effective benchmarking methodologies to
evaluate the true capabilities of ML4VD, propose alternative problem
statements, and examine broader implications for the evaluation of machine
learning and programming analysis research.


http://arxiv.org/abs/2408.12673
Enhancing Transferability of Adversarial Attacks with GE-AdvGAN+: A Comprehensive Framework for Gradient Editing. (99%)
Zhibo Jin; Jiayu Zhang; Zhiyu Zhu; Yuchen Zhang; Jiahao Huang; Jianlong Zhou; Fang Chen
  Transferable adversarial attacks pose significant threats to deep neural
networks, particularly in black-box scenarios where internal model information
is inaccessible. Studying adversarial attack methods helps advance the
performance of defense mechanisms and explore model vulnerabilities. These
methods can uncover and exploit weaknesses in models, promoting the development
of more robust architectures. However, current methods for transferable attacks
often come with substantial computational costs, limiting their deployment and
application, especially in edge computing scenarios. Adversarial generative
models, such as Generative Adversarial Networks (GANs), are characterized by
their ability to generate samples without the need for retraining after an
initial training phase. GE-AdvGAN, a recent method for transferable adversarial
attacks, is based on this principle. In this paper, we propose a novel general
framework for gradient editing-based transferable attacks, named GE-AdvGAN+,
which integrates nearly all mainstream attack methods to enhance
transferability while significantly reducing computational resource
consumption. Our experiments demonstrate the compatibility and effectiveness of
our framework. Compared to the baseline AdvGAN, our best-performing method,
GE-AdvGAN++, achieves an average ASR improvement of 47.8. Additionally, it
surpasses the latest competing algorithm, GE-AdvGAN, with an average ASR
increase of 5.9. The framework also exhibits enhanced computational efficiency,
achieving 2217.7 FPS, outperforming traditional methods such as BIM and
MI-FGSM. The implementation code for our GE-AdvGAN+ framework is available at
https://github.com/GEAdvGANP


http://arxiv.org/abs/2408.12670
Leveraging Information Consistency in Frequency and Spatial Domain for Adversarial Attacks. (99%)
Zhibo Jin; Jiayu Zhang; Zhiyu Zhu; Xinyi Wang; Yiyun Huang; Huaming Chen
  Adversarial examples are a key method to exploit deep neural networks. Using
gradient information, such examples can be generated in an efficient way
without altering the victim model. Recent frequency domain transformation has
further enhanced the transferability of such adversarial examples, such as
spectrum simulation attack. In this work, we investigate the effectiveness of
frequency domain-based attacks, aligning with similar findings in the spatial
domain. Furthermore, such consistency between the frequency and spatial domains
provides insights into how gradient-based adversarial attacks induce
perturbations across different domains, which is yet to be explored. Hence, we
propose a simple, effective, and scalable gradient-based adversarial attack
algorithm leveraging the information consistency in both frequency and spatial
domains. We evaluate the algorithm for its effectiveness against different
models. Extensive experiments demonstrate that our algorithm achieves
state-of-the-art results compared to other gradient-based algorithms. Our code
is available at: https://github.com/LMBTough/FSA.


http://arxiv.org/abs/2408.12312
MakeupAttack: Feature Space Black-box Backdoor Attack on Face Recognition via Makeup Transfer. (98%)
Ming Sun; Lihua Jing; Zixuan Zhu; Rui Wang
  Backdoor attacks pose a significant threat to the training process of deep
neural networks (DNNs). As a widely-used DNN-based application in real-world
scenarios, face recognition systems once implanted into the backdoor, may cause
serious consequences. Backdoor research on face recognition is still in its
early stages, and the existing backdoor triggers are relatively simple and
visible. Furthermore, due to the perceptibility, diversity, and similarity of
facial datasets, many state-of-the-art backdoor attacks lose effectiveness on
face recognition tasks. In this work, we propose a novel feature space backdoor
attack against face recognition via makeup transfer, dubbed MakeupAttack. In
contrast to many feature space attacks that demand full access to target
models, our method only requires model queries, adhering to black-box attack
principles. In our attack, we design an iterative training paradigm to learn
the subtle features of the proposed makeup-style trigger. Additionally,
MakeupAttack promotes trigger diversity using the adaptive selection method,
dispersing the feature distribution of malicious samples to bypass existing
defense methods. Extensive experiments were conducted on two widely-used facial
datasets targeting multiple models. The results demonstrate that our proposed
attack method can bypass existing state-of-the-art defenses while maintaining
effectiveness, robustness, naturalness, and stealthiness, without compromising
model performance.


http://arxiv.org/abs/2408.12727
BankTweak: Adversarial Attack against Multi-Object Trackers by Manipulating Feature Banks. (80%)
Woojin Shin; Donghwa Kang; Daejin Choi; Brent Kang; Jinkyu Lee; Hyeongboo Baek
  Multi-object tracking (MOT) aims to construct moving trajectories for
objects, and modern multi-object trackers mainly utilize the
tracking-by-detection methodology. Initial approaches to MOT attacks primarily
aimed to degrade the detection quality of the frames under attack, thereby
reducing accuracy only in those specific frames, highlighting a lack of
\textit{efficiency}. To improve efficiency, recent advancements manipulate
object positions to cause persistent identity (ID) switches during the
association phase, even after the attack ends within a few frames. However,
these position-manipulating attacks have inherent limitations, as they can be
easily counteracted by adjusting distance-related parameters in the association
phase, revealing a lack of \textit{robustness}. In this paper, we present
\textsf{BankTweak}, a novel adversarial attack designed for MOT trackers, which
features efficiency and robustness. \textsf{BankTweak} focuses on the feature
extractor in the association phase and reveals vulnerability in the Hungarian
matching method used by feature-based MOT systems. Exploiting the
vulnerability, \textsf{BankTweak} induces persistent ID switches (addressing
\textit{efficiency}) even after the attack ends by strategically injecting
altered features into the feature banks without modifying object positions
(addressing \textit{robustness}). To demonstrate the applicability, we apply
\textsf{BankTweak} to three multi-object trackers (DeepSORT, StrongSORT, and
MOTDT) with one-stage, two-stage, anchor-free, and transformer detectors.
Extensive experiments on the MOT17 and MOT20 datasets show that our method
substantially surpasses existing attacks, exposing the vulnerability of the
tracking-by-detection framework to \textsf{BankTweak}.


http://arxiv.org/abs/2408.12122
On the Credibility of Backdoor Attacks Against Object Detectors in the Physical World. (75%)
Bao Gia Doan; Dang Quang Nguyen; Callum Lindquist; Paul Montague; Tamas Abraham; Vel Olivier De; Seyit Camtepe; Salil S. Kanhere; Ehsan Abbasnejad; Damith C. Ranasinghe
  Object detectors are vulnerable to backdoor attacks. In contrast to
classifiers, detectors possess unique characteristics, architecturally and in
task execution; often operating in challenging conditions, for instance,
detecting traffic signs in autonomous cars. But, our knowledge dominates
attacks against classifiers and tests in the "digital domain".
  To address this critical gap, we conducted an extensive empirical study
targeting multiple detector architectures and two challenging detection tasks
in real-world settings: traffic signs and vehicles. Using the diverse,
methodically collected videos captured from driving cars and flying drones,
incorporating physical object trigger deployments in authentic scenes, we
investigated the viability of physical object-triggered backdoor attacks in
application settings.
  Our findings revealed 8 key insights. Importantly, the prevalent "digital"
data poisoning method for injecting backdoors into models does not lead to
effective attacks against detectors in the real world, although proven
effective in classification tasks. We construct a new, cost-efficient attack
method, dubbed MORPHING, incorporating the unique nature of detection tasks;
ours is remarkably successful in injecting physical object-triggered backdoors,
even capable of poisoning triggers with clean label annotations or invisible
triggers without diminishing the success of physical object triggered
backdoors. We discovered that the defenses curated are ill-equipped to
safeguard detectors against such attacks. To underscore the severity of the
threat and foster further research, we, for the first time, release an
extensive video test set of real-world backdoor attacks. Our study not only
establishes the credibility and seriousness of this threat but also serves as a
clarion call to the research community to advance backdoor defenses in the
context of object detection.


http://arxiv.org/abs/2408.12798
BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models. (70%)
Yige Li; Hanxun Huang; Yunhan Zhao; Xingjun Ma; Jun Sun
  Generative Large Language Models (LLMs) have made significant strides across
various tasks, but they remain vulnerable to backdoor attacks, where specific
triggers in the prompt cause the LLM to generate adversary-desired responses.
While most backdoor research has focused on vision or text classification
tasks, backdoor attacks in text generation have been largely overlooked. In
this work, we introduce \textit{BackdoorLLM}, the first comprehensive benchmark
for studying backdoor attacks on LLMs. \textit{BackdoorLLM} features: 1) a
repository of backdoor benchmarks with a standardized training pipeline, 2)
diverse attack strategies, including data poisoning, weight poisoning, hidden
state attacks, and chain-of-thought attacks, 3) extensive evaluations with over
200 experiments on 8 attacks across 7 scenarios and 6 model architectures, and
4) key insights into the effectiveness and limitations of backdoors in LLMs. We
hope \textit{BackdoorLLM} will raise awareness of backdoor threats and
contribute to advancing AI safety. The code is available at
\url{https://github.com/bboylyg/BackdoorLLM}.


http://arxiv.org/abs/2408.12806
Is Generative AI the Next Tactical Cyber Weapon For Threat Actors? Unforeseen Implications of AI Generated Cyber Attacks. (2%)
Yusuf Usman; Aadesh Upadhyay; Prashnna Gyawali; Robin Chataut
  In an era where digital threats are increasingly sophisticated, the
intersection of Artificial Intelligence and cybersecurity presents both
promising defenses and potent dangers. This paper delves into the escalating
threat posed by the misuse of AI, specifically through the use of Large
Language Models (LLMs). This study details various techniques like the switch
method and character play method, which can be exploited by cybercriminals to
generate and automate cyber attacks. Through a series of controlled
experiments, the paper demonstrates how these models can be manipulated to
bypass ethical and privacy safeguards to effectively generate cyber attacks
such as social engineering, malicious code, payload generation, and spyware. By
testing these AI generated attacks on live systems, the study assesses their
effectiveness and the vulnerabilities they exploit, offering a practical
perspective on the risks AI poses to critical infrastructure. We also introduce
Occupy AI, a customized, finetuned LLM specifically engineered to automate and
execute cyberattacks. This specialized AI driven tool is adept at crafting
steps and generating executable code for a variety of cyber threats, including
phishing, malware injection, and system exploitation. The results underscore
the urgency for ethical AI practices, robust cybersecurity measures, and
regulatory oversight to mitigate AI related threats. This paper aims to elevate
awareness within the cybersecurity community about the evolving digital threat
landscape, advocating for proactive defense strategies and responsible AI
development to protect against emerging cyber threats.


http://arxiv.org/abs/2408.12808
VALE: A Multimodal Visual and Language Explanation Framework for Image Classifiers using eXplainable AI and Language Models. (2%)
Purushothaman Natarajan; Athira Nambiar
  Deep Neural Networks (DNNs) have revolutionized various fields by enabling
task automation and reducing human error. However, their internal workings and
decision-making processes remain obscure due to their black box nature.
Consequently, the lack of interpretability limits the application of these
models in high-risk scenarios. To address this issue, the emerging field of
eXplainable Artificial Intelligence (XAI) aims to explain and interpret the
inner workings of DNNs. Despite advancements, XAI faces challenges such as the
semantic gap between machine and human understanding, the trade-off between
interpretability and performance, and the need for context-specific
explanations. To overcome these limitations, we propose a novel multimodal
framework named VALE Visual and Language Explanation. VALE integrates
explainable AI techniques with advanced language models to provide
comprehensive explanations. This framework utilizes visual explanations from
XAI tools, an advanced zero-shot image segmentation model, and a visual
language model to generate corresponding textual explanations. By combining
visual and textual explanations, VALE bridges the semantic gap between machine
outputs and human interpretation, delivering results that are more
comprehensible to users. In this paper, we conduct a pilot study of the VALE
framework for image classification tasks. Specifically, Shapley Additive
Explanations (SHAP) are used to identify the most influential regions in
classified images. The object of interest is then extracted using the Segment
Anything Model (SAM), and explanations are generated using state-of-the-art
pre-trained Vision-Language Models (VLMs). Extensive experimental studies are
performed on two datasets: the ImageNet dataset and a custom underwater SONAR
image dataset, demonstrating VALEs real-world applicability in underwater image
classification.


http://arxiv.org/abs/2408.12217
Quantifying Psychological Sophistication of Malicious Emails. (2%)
Theodore Longtchi; Rosana Montañez Rodriguez; Kora Gwartney; Ekzhin Ear; David P. Azari; Christopher P. Kelley; Shouhuai Xu
  Malicious emails including Phishing, Spam, and Scam are one significant class
of cyber social engineering attacks. Despite numerous defenses to counter them,
the problem remains largely open. The ineffectiveness of current defenses can
be attributed to our superficial understanding of the psychological properties
that make these attacks successful. This problem motivates us to investigate
the psychological sophistication, or sophistication for short, of malicious
emails. We propose an innovative framework that accommodates two important and
complementary aspects of sophistication, dubbed Psychological Techniques,
PTechs, and Psychological Tactics, PTacs. We propose metrics and grading rules
for human experts to assess the sophistication of malicious emails via the lens
of these PTechs and PTacs. To demonstrate the usefulness of the framework, we
conduct a case study based on 1,036 malicious emails assessed by four
independent graders. Our results show that malicious emails are psychologically
sophisticated, while exhibiting both commonalities and different patterns in
terms of their PTechs and PTacs. Results also show that previous studies might
have focused on dealing with the less proliferated PTechs such as Persuasion
and PTacs such as Reward, rather than the most proliferated PTechs such as
Attention Grabbing and Impersonation, and PTacs such as Fit and Form and
Familiarity that are identified in this study. We also found among others that
social events are widely exploited by attackers in contextualizing their
malicious emails. These findings could be leveraged to guide the design of
effective defenses against malicious emails.


http://arxiv.org/abs/2408.12099
Query-Efficient Video Adversarial Attack with Stylized Logo. (99%)
Duoxun Tang; Yuxin Cao; Xi Xiao; Derui Wang; Sheng Wen; Tianqing Zhu
  Video classification systems based on Deep Neural Networks (DNNs) have
demonstrated excellent performance in accurately verifying video content.
However, recent studies have shown that DNNs are highly vulnerable to
adversarial examples. Therefore, a deep understanding of adversarial attacks
can better respond to emergency situations. In order to improve attack
performance, many style-transfer-based attacks and patch-based attacks have
been proposed. However, the global perturbation of the former will bring
unnatural global color, while the latter is difficult to achieve success in
targeted attacks due to the limited perturbation space. Moreover, compared to a
plethora of methods targeting image classifiers, video adversarial attacks are
still not that popular. Therefore, to generate adversarial examples with a low
budget and to provide them with a higher verisimilitude, we propose a novel
black-box video attack framework, called Stylized Logo Attack (SLA). SLA is
conducted through three steps. The first step involves building a style
references set for logos, which can not only make the generated examples more
natural, but also carry more target class features in the targeted attacks.
Then, reinforcement learning (RL) is employed to determine the style reference
and position parameters of the logo within the video, which ensures that the
stylized logo is placed in the video with optimal attributes. Finally,
perturbation optimization is designed to optimize perturbations to improve the
fooling rate in a step-by-step manner. Sufficient experimental results indicate
that, SLA can achieve better performance than state-of-the-art methods and
still maintain good deception effects when facing various defense methods.


http://arxiv.org/abs/2408.11810
Pixel Is Not A Barrier: An Effective Evasion Attack for Pixel-Domain Diffusion Models. (92%)
Chun-Yen Shih; Li-Xuan Peng; Jia-Wei Liao; Ernie Chu; Cheng-Fu Chou; Jun-Cheng Chen
  Diffusion Models have emerged as powerful generative models for high-quality
image synthesis, with many subsequent image editing techniques based on them.
However, the ease of text-based image editing introduces significant risks,
such as malicious editing for scams or intellectual property infringement.
Previous works have attempted to safeguard images from diffusion-based editing
by adding imperceptible perturbations. These methods are costly and
specifically target prevalent Latent Diffusion Models (LDMs), while
Pixel-domain Diffusion Models (PDMs) remain largely unexplored and robust
against such attacks. Our work addresses this gap by proposing a novel
attacking framework with a feature representation attack loss that exploits
vulnerabilities in denoising UNets and a latent optimization strategy to
enhance the naturalness of protected images. Extensive experiments demonstrate
the effectiveness of our approach in attacking dominant PDM-based editing
methods (e.g., SDEdit) while maintaining reasonable protection fidelity and
robustness against common defense methods. Additionally, our framework is
extensible to LDMs, achieving comparable performance to existing approaches.


http://arxiv.org/abs/2408.11444
A Practical Trigger-Free Backdoor Attack on Neural Networks. (67%)
Jiahao Wang; Xianglong Zhang; Xiuzhen Cheng; Pengfei Hu; Guoming Zhang
  Backdoor attacks on deep neural networks have emerged as significant security
threats, especially as DNNs are increasingly deployed in security-critical
applications. However, most existing works assume that the attacker has access
to the original training data. This limitation restricts the practicality of
launching such attacks in real-world scenarios. Additionally, using a specified
trigger to activate the injected backdoor compromises the stealthiness of the
attacks. To address these concerns, we propose a trigger-free backdoor attack
that does not require access to any training data. Specifically, we design a
novel fine-tuning approach that incorporates the concept of malicious data into
the concept of the attacker-specified class, resulting the misclassification of
trigger-free malicious data into the attacker-specified class. Furthermore,
instead of relying on training data to preserve the model's knowledge, we
employ knowledge distillation methods to maintain the performance of the
infected model on benign samples, and introduce a parameter importance
evaluation mechanism based on elastic weight constraints to facilitate the
fine-tuning of the infected model. The effectiveness, practicality, and
stealthiness of the proposed attack are comprehensively evaluated on three
real-world datasets. Furthermore, we explore the potential for enhancing the
attack through the use of auxiliary datasets and model inversion.


http://arxiv.org/abs/2408.11680
First line of defense: A robust first layer mitigates adversarial attacks. (54%)
Janani Suresh; Nancy Nayak; Sheetal Kalyani
  Adversarial training (AT) incurs significant computational overhead, leading
to growing interest in designing inherently robust architectures. We
demonstrate that a carefully designed first layer of the neural network can
serve as an implicit adversarial noise filter (ANF). This filter is created
using a combination of large kernel size, increased convolution filters, and a
maxpool operation. We show that integrating this filter as the first layer in
architectures such as ResNet, VGG, and EfficientNet results in adversarially
robust networks. Our approach achieves higher adversarial accuracies than
existing natively robust architectures without AT and is competitive with
adversarial-trained architectures across a wide range of datasets. Supporting
our findings, we show that (a) the decision regions for our method have better
margins, (b) the visualized loss surfaces are smoother, (c) the modified peak
signal-to-noise ratio (mPSNR) values at the output of the ANF are higher, (d)
high-frequency components are more attenuated, and (e) architectures
incorporating ANF exhibit better denoising in Gaussian noise compared to
baseline architectures. Code for all our experiments are available at
\url{https://github.com/janani-suresh-97/first-line-defence.git}.


http://arxiv.org/abs/2408.11679
Exploring Robustness of Visual State Space model against Backdoor Attacks. (45%)
Cheng-Yi Lee; Cheng-Chang Tsai; Chia-Mu Yu; Chun-Shien Lu
  Visual State Space Model (VSS) has demonstrated remarkable performance in
various computer vision tasks. However, in the process of development, backdoor
attacks have brought severe challenges to security. Such attacks cause an
infected model to predict target labels when a specific trigger is activated,
while the model behaves normally on benign samples. In this paper, we conduct
systematic experiments to comprehend on robustness of VSS through the lens of
backdoor attacks, specifically how the state space model (SSM) mechanism
affects robustness. We first investigate the vulnerability of VSS to different
backdoor triggers and reveal that the SSM mechanism, which captures contextual
information within patches, makes the VSS model more susceptible to backdoor
triggers compared to models without SSM. Furthermore, we analyze the
sensitivity of the VSS model to patch processing techniques and discover that
these triggers are effectively disrupted. Based on these observations, we
consider an effective backdoor for the VSS model that recurs in each patch to
resist patch perturbations. Extensive experiments across three datasets and
various backdoor attacks reveal that the VSS model performs comparably to
Transformers (ViTs) but is less robust than the Gated CNNs, which comprise only
stacked Gated CNN blocks without SSM.


http://arxiv.org/abs/2408.11749
Against All Odds: Overcoming Typology, Script, and Language Confusion in Multilingual Embedding Inversion Attacks. (26%)
Yiyi Chen; Russa Biswas; Heather Lent; Johannes Bjerva
  Large Language Models (LLMs) are susceptible to malicious influence by cyber
attackers through intrusions such as adversarial, backdoor, and embedding
inversion attacks. In response, the burgeoning field of LLM Security aims to
study and defend against such threats. Thus far, the majority of works in this
area have focused on monolingual English models, however, emerging research
suggests that multilingual LLMs may be more vulnerable to various attacks than
their monolingual counterparts. While previous work has investigated embedding
inversion over a small subset of European languages, it is challenging to
extrapolate these findings to languages from different linguistic families and
with differing scripts. To this end, we explore the security of multilingual
LLMs in the context of embedding inversion attacks and investigate
cross-lingual and cross-script inversion across 20 languages, spanning over 8
language families and 12 scripts. Our findings indicate that languages written
in Arabic script and Cyrillic script are particularly vulnerable to embedding
inversion, as are languages within the Indo-Aryan language family. We further
observe that inversion models tend to suffer from language confusion, sometimes
greatly reducing the efficacy of an attack. Accordingly, we systematically
explore this bottleneck for inversion models, uncovering predictable patterns
which could be leveraged by attackers. Ultimately, this study aims to further
the field's understanding of the outstanding security vulnerabilities facing
multilingual LLMs and raise awareness for the languages most at risk of
negative impact from these attacks.


http://arxiv.org/abs/2408.11408
Latent Feature and Attention Dual Erasure Attack against Multi-View Diffusion Models for 3D Assets Protection. (12%)
Jingwei Sun; Xuchong Zhang; Changfeng Sun; Qicheng Bai; Hongbin Sun
  Multi-View Diffusion Models (MVDMs) enable remarkable improvements in the
field of 3D geometric reconstruction, but the issue regarding intellectual
property has received increasing attention due to unauthorized imitation.
Recently, some works have utilized adversarial attacks to protect copyright.
However, all these works focus on single-image generation tasks which only need
to consider the inner feature of images. Previous methods are inefficient in
attacking MVDMs because they lack the consideration of disrupting the geometric
and visual consistency among the generated multi-view images. This paper is the
first to address the intellectual property infringement issue arising from
MVDMs. Accordingly, we propose a novel latent feature and attention dual
erasure attack to disrupt the distribution of latent feature and the
consistency across the generated images from multi-view and multi-domain
simultaneously. The experiments conducted on SOTA MVDMs indicate that our
approach achieves superior performances in terms of attack effectiveness,
transferability, and robustness against defense methods. Therefore, this paper
provides an efficient solution to protect 3D assets from MVDMs-based 3D
geometry reconstruction.


http://arxiv.org/abs/2408.11587
Large Language Models are Good Attackers: Efficient and Stealthy Textual Backdoor Attacks. (10%)
Ziqiang Li; Yueqi Zeng; Pengfei Xia; Lei Liu; Zhangjie Fu; Bin Li
  With the burgeoning advancements in the field of natural language processing
(NLP), the demand for training data has increased significantly. To save costs,
it has become common for users and businesses to outsource the labor-intensive
task of data collection to third-party entities. Unfortunately, recent research
has unveiled the inherent risk associated with this practice, particularly in
exposing NLP systems to potential backdoor attacks. Specifically, these attacks
enable malicious control over the behavior of a trained model by poisoning a
small portion of the training data. Unlike backdoor attacks in computer vision,
textual backdoor attacks impose stringent requirements for attack stealthiness.
However, existing attack methods meet significant trade-off between
effectiveness and stealthiness, largely due to the high information entropy
inherent in textual data. In this paper, we introduce the Efficient and
Stealthy Textual backdoor attack method, EST-Bad, leveraging Large Language
Models (LLMs). Our EST-Bad encompasses three core strategies: optimizing the
inherent flaw of models as the trigger, stealthily injecting triggers with
LLMs, and meticulously selecting the most impactful samples for backdoor
injection. Through the integration of these techniques, EST-Bad demonstrates an
efficient achievement of competitive attack performance while maintaining
superior stealthiness compared to prior methods across various text classifier
datasets.


http://arxiv.org/abs/2408.10948
GAIM: Attacking Graph Neural Networks via Adversarial Influence Maximization. (99%)
Xiaodong Yang; Xiaoting Li; Huiyuan Chen; Yiwei Cai
  Recent studies show that well-devised perturbations on graph structures or
node features can mislead trained Graph Neural Network (GNN) models. However,
these methods often overlook practical assumptions, over-rely on heuristics, or
separate vital attack components. In response, we present GAIM, an integrated
adversarial attack method conducted on a node feature basis while considering
the strict black-box setting. Specifically, we define an adversarial influence
function to theoretically assess the adversarial impact of node perturbations,
thereby reframing the GNN attack problem into the adversarial influence
maximization problem. In our approach, we unify the selection of the target
node and the construction of feature perturbations into a single optimization
problem, ensuring a unique and consistent feature perturbation for each target
node. We leverage a surrogate model to transform this problem into a solvable
linear programming task, streamlining the optimization process. Moreover, we
extend our method to accommodate label-oriented attacks, broadening its
applicability. Thorough evaluations on five benchmark datasets across three
popular models underscore the effectiveness of our method in both untargeted
and label-oriented targeted attacks. Through comprehensive analysis and
ablation studies, we demonstrate the practical value and efficacy inherent to
our design choices.


http://arxiv.org/abs/2408.11264
Correlation Analysis of Adversarial Attack in Time Series Classification. (99%)
Zhengyang Li; Wenhao Liang; Chang Dong; Weitong Chen; Dong Huang
  This study investigates the vulnerability of time series classification
models to adversarial attacks, with a focus on how these models process local
versus global information under such conditions. By leveraging the Normalized
Auto Correlation Function (NACF), an exploration into the inclination of neural
networks is conducted. It is demonstrated that regularization techniques,
particularly those employing Fast Fourier Transform (FFT) methods and targeting
frequency components of perturbations, markedly enhance the effectiveness of
attacks. Meanwhile, the defense strategies, like noise introduction and
Gaussian filtering, are shown to significantly lower the Attack Success Rate
(ASR), with approaches based on noise introducing notably effective in
countering high-frequency distortions. Furthermore, models designed to
prioritize global information are revealed to possess greater resistance to
adversarial manipulations. These results underline the importance of designing
attack and defense mechanisms, informed by frequency domain analysis, as a
means to considerably reinforce the resilience of neural network models against
adversarial threats.


http://arxiv.org/abs/2408.10647
Privacy-preserving Universal Adversarial Defense for Black-box Models. (99%)
Qiao Li; Cong Wu; Jing Chen; Zijun Zhang; Kun He; Ruiying Du; Xinxin Wang; Qingchuang Zhao; Yang Liu
  Deep neural networks (DNNs) are increasingly used in critical applications
such as identity authentication and autonomous driving, where robustness
against adversarial attacks is crucial. These attacks can exploit minor
perturbations to cause significant prediction errors, making it essential to
enhance the resilience of DNNs. Traditional defense methods often rely on
access to detailed model information, which raises privacy concerns, as model
owners may be reluctant to share such data. In contrast, existing black-box
defense methods fail to offer a universal defense against various types of
adversarial attacks. To address these challenges, we introduce DUCD, a
universal black-box defense method that does not require access to the target
model's parameters or architecture. Our approach involves distilling the target
model by querying it with data, creating a white-box surrogate while preserving
data privacy. We further enhance this surrogate model using a certified defense
based on randomized smoothing and optimized noise selection, enabling robust
defense against a broad range of adversarial attacks. Comparative evaluations
between the certified defenses of the surrogate and target models demonstrate
the effectiveness of our approach. Experiments on multiple image classification
datasets show that DUCD not only outperforms existing black-box defenses but
also matches the accuracy of white-box defenses, all while enhancing data
privacy and reducing the success rate of membership inference attacks.


http://arxiv.org/abs/2408.10694
MsMemoryGAN: A Multi-scale Memory GAN for Palm-vein Adversarial Purification. (99%)
Huafeng Qin; Yuming Fu; Huiyan Zhang; Mounim A. El-Yacoubi; Xinbo Gao; Qun Song; Jun Wang
  Deep neural networks have recently achieved promising performance in the vein
recognition task and have shown an increasing application trend, however, they
are prone to adversarial perturbation attacks by adding imperceptible
perturbations to the input, resulting in making incorrect recognition. To
address this issue, we propose a novel defense model named MsMemoryGAN, which
aims to filter the perturbations from adversarial samples before recognition.
First, we design a multi-scale autoencoder to achieve high-quality
reconstruction and two memory modules to learn the detailed patterns of normal
samples at different scales. Second, we investigate a learnable metric in the
memory module to retrieve the most relevant memory items to reconstruct the
input image. Finally, the perceptional loss is combined with the pixel loss to
further enhance the quality of the reconstructed image. During the training
phase, the MsMemoryGAN learns to reconstruct the input by merely using fewer
prototypical elements of the normal patterns recorded in the memory. At the
testing stage, given an adversarial sample, the MsMemoryGAN retrieves its most
relevant normal patterns in memory for the reconstruction. Perturbations in the
adversarial sample are usually not reconstructed well, resulting in purifying
the input from adversarial perturbations. We have conducted extensive
experiments on two public vein datasets under different adversarial attack
methods to evaluate the performance of the proposed approach. The experimental
results show that our approach removes a wide variety of adversarial
perturbations, allowing vein classifiers to achieve the highest recognition
accuracy.


http://arxiv.org/abs/2408.11218
Revisiting Min-Max Optimization Problem in Adversarial Training. (97%)
Sina Hajer Ahmadi; Hassan Bahrami
  The rise of computer vision applications in the real world puts the security
of the deep neural networks at risk. Recent works demonstrate that
convolutional neural networks are susceptible to adversarial examples - where
the input images look similar to the natural images but are classified
incorrectly by the model. To provide a rebuttal to this problem, we propose a
new method to build robust deep neural networks against adversarial attacks by
reformulating the saddle point optimization problem in \cite{madry2017towards}.
Our proposed method offers significant resistance and a concrete security
guarantee against multiple adversaries. The goal of this paper is to act as a
stepping stone for a new variation of deep learning models which would lead
towards fully robust deep learning models.


http://arxiv.org/abs/2408.10571
Prompt-Agnostic Adversarial Perturbation for Customized Diffusion Models. (97%)
Cong Wan; Yuhang He; Xiang Song; Yihong Gong
  Diffusion models have revolutionized customized text-to-image generation,
allowing for efficient synthesis of photos from personal data with textual
descriptions. However, these advancements bring forth risks including privacy
breaches and unauthorized replication of artworks. Previous researches
primarily center around using prompt-specific methods to generate adversarial
examples to protect personal images, yet the effectiveness of existing methods
is hindered by constrained adaptability to different prompts. In this paper, we
introduce a Prompt-Agnostic Adversarial Perturbation (PAP) method for
customized diffusion models. PAP first models the prompt distribution using a
Laplace Approximation, and then produces prompt-agnostic perturbations by
maximizing a disturbance expectation based on the modeled distribution. This
approach effectively tackles the prompt-agnostic attacks, leading to improved
defense stability. Extensive experiments in face privacy and artistic style
protection, demonstrate the superior generalization of our method in comparison
to existing techniques.


http://arxiv.org/abs/2408.10673
Iterative Window Mean Filter: Thwarting Diffusion-based Adversarial Purification. (87%)
Hanrui Wang; Ruoxi Sun; Cunjian Chen; Minhui Xue; Lay-Ki Soon; Shuo Wang; Zhe Jin
  Face authentication systems have brought significant convenience and advanced
developments, yet they have become unreliable due to their sensitivity to
inconspicuous perturbations, such as adversarial attacks. Existing defenses
often exhibit weaknesses when facing various attack algorithms and adaptive
attacks or compromise accuracy for enhanced security. To address these
challenges, we have developed a novel and highly efficient
non-deep-learning-based image filter called the Iterative Window Mean Filter
(IWMF) and proposed a new framework for adversarial purification, named
IWMF-Diff, which integrates IWMF and denoising diffusion models. These methods
can function as pre-processing modules to eliminate adversarial perturbations
without necessitating further modifications or retraining of the target system.
We demonstrate that our proposed methodologies fulfill four critical
requirements: preserved accuracy, improved security, generalizability to
various threats in different settings, and better resistance to adaptive
attacks. This performance surpasses that of the state-of-the-art adversarial
purification method, DiffPure.


http://arxiv.org/abs/2408.10795
Adversarial Attack for Explanation Robustness of Rationalization Models. (82%)
Yuankai Zhang; Lingxiao Kong; Haozhao Wang; Ruixuan Li; Jun Wang; Yuhua Li; Wei Liu
  Rationalization models, which select a subset of input text as
rationale-crucial for humans to understand and trust predictions-have recently
emerged as a prominent research area in eXplainable Artificial Intelligence.
However, most of previous studies mainly focus on improving the quality of the
rationale, ignoring its robustness to malicious attack. Specifically, whether
the rationalization models can still generate high-quality rationale under the
adversarial attack remains unknown. To explore this, this paper proposes UAT2E,
which aims to undermine the explainability of rationalization models without
altering their predictions, thereby eliciting distrust in these models from
human users. UAT2E employs the gradient-based search on triggers and then
inserts them into the original input to conduct both the non-target and target
attack. Experimental results on five datasets reveal the vulnerability of
rationalization models in terms of explanation, where they tend to select more
meaningless tokens under attacks. Based on this, we make a series of
recommendations for improving rationalization models in terms of explanation.


http://arxiv.org/abs/2408.10682
Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models. (73%)
Hongbang Yuan; Zhuoran Jin; Pengfei Cao; Yubo Chen; Kang Liu; Jun Zhao
  LLM have achieved success in many fields but still troubled by problematic
content in the training corpora. LLM unlearning aims at reducing their
influence and avoid undesirable behaviours. However, existing unlearning
methods remain vulnerable to adversarial queries and the unlearned knowledge
resurfaces after the manually designed attack queries. As part of a red-team
effort to proactively assess the vulnerabilities of unlearned models, we design
Dynamic Unlearning Attack (DUA), a dynamic and automated framework to attack
these models and evaluate their robustness. It optimizes adversarial suffixes
to reintroduce the unlearned knowledge in various scenarios. We find that
unlearned knowledge can be recovered in $55.2\%$ of the questions, even without
revealing the unlearned model's parameters. In response to this vulnerability,
we propose Latent Adversarial Unlearning (LAU), a universal framework that
effectively enhances the robustness of the unlearned process. It formulates the
unlearning process as a min-max optimization problem and resolves it through
two stages: an attack stage, where perturbation vectors are trained and added
to the latent space of LLMs to recover the unlearned knowledge, and a defense
stage, where previously trained perturbation vectors are used to enhance
unlearned model's robustness. With our LAU framework, we obtain two robust
unlearning methods, AdvGA and AdvNPO. We conduct extensive experiments across
multiple unlearning benchmarks and various models, and demonstrate that they
improve the unlearning effectiveness by over $53.5\%$, cause only less than a
$11.6\%$ reduction in neighboring knowledge, and have almost no impact on the
model's general capabilities.


http://arxiv.org/abs/2408.10901
A Grey-box Attack against Latent Diffusion Model-based Image Editing by Posterior Collapse. (68%)
Zhongliang Guo; Lei Fang; Jingyu Lin; Yifei Qian; Shuai Zhao; Zeyu Wang; Junhao Dong; Cunjian Chen; Ognjen Arandjelović; Chun Pong Lau
  Recent advancements in generative AI, particularly Latent Diffusion Models
(LDMs), have revolutionized image synthesis and manipulation. However, these
generative techniques raises concerns about data misappropriation and
intellectual property infringement. Adversarial attacks on machine learning
models have been extensively studied, and a well-established body of research
has extended these techniques as a benign metric to prevent the underlying
misuse of generative AI. Current approaches to safeguarding images from
manipulation by LDMs are limited by their reliance on model-specific knowledge
and their inability to significantly degrade semantic quality of generated
images. In response to these shortcomings, we propose the Posterior Collapse
Attack (PCA) based on the observation that VAEs suffer from posterior collapse
during training. Our method minimizes dependence on the white-box information
of target models to get rid of the implicit reliance on model-specific
knowledge. By accessing merely a small amount of LDM parameters, in specific
merely the VAE encoder of LDMs, our method causes a substantial semantic
collapse in generation quality, particularly in perceptual consistency, and
demonstrates strong transferability across various model architectures.
Experimental results show that PCA achieves superior perturbation effects on
image generation of LDMs with lower runtime and VRAM. Our method outperforms
existing techniques, offering a more robust and generalizable solution that is
helpful in alleviating the socio-technical challenges posed by the rapidly
evolving landscape of generative AI.


http://arxiv.org/abs/2408.10752
Security Assessment of Hierarchical Federated Deep Learning. (67%)
D Alqattan; R Sun; H Liang; G Nicosia; V Snasel; R Ranjan; V Ojha
  Hierarchical federated learning (HFL) is a promising distributed deep
learning model training paradigm, but it has crucial security concerns arising
from adversarial attacks. This research investigates and assesses the security
of HFL using a novel methodology by focusing on its resilience against
adversarial attacks inference-time and training-time. Through a series of
extensive experiments across diverse datasets and attack scenarios, we uncover
that HFL demonstrates robustness against untargeted training-time attacks due
to its hierarchical structure. However, targeted attacks, particularly backdoor
attacks, exploit this architecture, especially when malicious clients are
positioned in the overlapping coverage areas of edge servers. Consequently, HFL
shows a dual nature in its resilience, showcasing its capability to recover
from attacks thanks to its hierarchical aggregation that strengthens its
suitability for adversarial training, thereby reinforcing its resistance
against inference-time attacks. These insights underscore the necessity for
balanced security strategies in HFL systems, leveraging their inherent
strengths while effectively mitigating vulnerabilities.


http://arxiv.org/abs/2408.11309
Improving Out-of-Distribution Data Handling and Corruption Resistance via Modern Hopfield Networks. (54%)
Saleh Sargolzaei; Luis Rueda
  This study explores the potential of Modern Hopfield Networks (MHN) in
improving the ability of computer vision models to handle out-of-distribution
data. While current computer vision models can generalize to unseen samples
from the same distribution, they are susceptible to minor perturbations such as
blurring, which limits their effectiveness in real-world applications. We
suggest integrating MHN into the baseline models to enhance their robustness.
This integration can be implemented during the test time for any model and
combined with any adversarial defense method. Our research shows that the
proposed integration consistently improves model performance on the MNIST-C
dataset, achieving a state-of-the-art increase of 13.84% in average corruption
accuracy, a 57.49% decrease in mean Corruption Error (mCE), and a 60.61%
decrease in relative mCE compared to the baseline model. Additionally, we
investigate the capability of MHN to converge to the original non-corrupted
data. Notably, our method does not require test-time adaptation or augmentation
with corruptions, underscoring its practical viability for real-world
deployment. (Source code publicly available at:
https://github.com/salehsargolzaee/Hopfield-integrated-test)


http://arxiv.org/abs/2408.10701
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique. (50%)
Tej Deep Pala; Vernon Y. H. Toh; Rishabh Bhardwaj; Soujanya Poria
  In today's era, where large language models (LLMs) are integrated into
numerous real-world applications, ensuring their safety and robustness is
crucial for responsible AI usage. Automated red-teaming methods play a key role
in this process by generating adversarial attacks to identify and mitigate
potential vulnerabilities in these models. However, existing methods often
struggle with slow performance, limited categorical diversity, and high
resource demands. While Rainbow Teaming, a recent approach, addresses the
diversity challenge by framing adversarial prompt generation as a
quality-diversity search, it remains slow and requires a large fine-tuned
mutator for optimal performance. To overcome these limitations, we propose
Ferret, a novel approach that builds upon Rainbow Teaming by generating
multiple adversarial prompt mutations per iteration and using a scoring
function to rank and select the most effective adversarial prompt. We explore
various scoring functions, including reward models, Llama Guard, and
LLM-as-a-judge, to rank adversarial mutations based on their potential harm to
improve the efficiency of the search for harmful mutations. Our results
demonstrate that Ferret, utilizing a reward model as a scoring function,
improves the overall attack success rate (ASR) to 95%, which is 46% higher than
Rainbow Teaming. Additionally, Ferret reduces the time needed to achieve a 90%
ASR by 15.2% compared to the baseline and generates adversarial prompts that
are transferable i.e. effective on other LLMs of larger size. Our codes are
available at https://github.com/declare-lab/ferret.


http://arxiv.org/abs/2408.12387
Makeup-Guided Facial Privacy Protection via Untrained Neural Network Priors. (33%)
Fahad Shamshad; Muzammal Naseer; Karthik Nandakumar
  Deep learning-based face recognition (FR) systems pose significant privacy
risks by tracking users without their consent. While adversarial attacks can
protect privacy, they often produce visible artifacts compromising user
experience. To mitigate this issue, recent facial privacy protection approaches
advocate embedding adversarial noise into the natural looking makeup styles.
However, these methods require training on large-scale makeup datasets that are
not always readily available. In addition, these approaches also suffer from
dataset bias. For instance, training on makeup data that predominantly contains
female faces could compromise protection efficacy for male faces. To handle
these issues, we propose a test-time optimization approach that solely
optimizes an untrained neural network to transfer makeup style from a reference
to a source image in an adversarial manner. We introduce two key modules: a
correspondence module that aligns regions between reference and source images
in latent space, and a decoder with conditional makeup layers. The untrained
decoder, optimized via carefully designed structural and makeup consistency
losses, generates a protected image that resembles the source but incorporates
adversarial makeup to deceive FR models. As our approach does not rely on
training with makeup face datasets, it avoids potential male/female dataset
biases while providing effective protection. We further extend the proposed
approach to videos by leveraging on temporal correlations. Experiments on
benchmark datasets demonstrate superior performance in face verification and
identification tasks and effectiveness against commercial FR systems. Our code
and models will be available at
https://github.com/fahadshamshad/deep-facial-privacy-prior


http://arxiv.org/abs/2408.11308
EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models. (31%)
Chongwen Zhao; Zhihao Dou; Kaizhu Huang
  Large Language Models (LLMs) are increasingly attracting attention in various
applications. Nonetheless, there is a growing concern as some users attempt to
exploit these models for malicious purposes, including the synthesis of
controlled substances and the propagation of disinformation. In an effort to
mitigate such risks, the concept of "Alignment" technology has been developed.
However, recent studies indicate that this alignment can be undermined using
sophisticated prompt engineering or adversarial suffixes, a technique known as
"Jailbreak." Our research takes cues from the human-like generate process of
LLMs. We identify that while jailbreaking prompts may yield output logits
similar to benign prompts, their initial embeddings within the model's latent
space tend to be more analogous to those of malicious prompts. Leveraging this
finding, we propose utilizing the early transformer outputs of LLMs as a means
to detect malicious inputs, and terminate the generation immediately. Built
upon this idea, we introduce a simple yet significant defense approach called
EEG-Defender for LLMs. We conduct comprehensive experiments on ten jailbreak
methods across three models. Our results demonstrate that EEG-Defender is
capable of reducing the Attack Success Rate (ASR) by a significant margin,
roughly 85\% in comparison with 50\% for the present SOTAs, with minimal impact
on the utility and effectiveness of LLMs.


http://arxiv.org/abs/2408.10666
Accelerating the Surrogate Retraining for Poisoning Attacks against Recommender Systems. (26%)
Yunfan Wu; Qi Cao; Shuchang Tao; Kaike Zhang; Fei Sun; Huawei Shen
  Recent studies have demonstrated the vulnerability of recommender systems to
data poisoning attacks, where adversaries inject carefully crafted fake user
interactions into the training data of recommenders to promote target items.
Current attack methods involve iteratively retraining a surrogate recommender
on the poisoned data with the latest fake users to optimize the attack.
However, this repetitive retraining is highly time-consuming, hindering the
efficient assessment and optimization of fake users. To mitigate this
computational bottleneck and develop a more effective attack in an affordable
time, we analyze the retraining process and find that a change in the
representation of one user/item will cause a cascading effect through the
user-item interaction graph. Under theoretical guidance, we introduce
\emph{Gradient Passing} (GP), a novel technique that explicitly passes
gradients between interacted user-item pairs during backpropagation, thereby
approximating the cascading effect and accelerating retraining. With just a
single update, GP can achieve effects comparable to multiple original training
iterations. Under the same number of retraining epochs, GP enables a closer
approximation of the surrogate recommender to the victim. This more accurate
approximation provides better guidance for optimizing fake users, ultimately
leading to enhanced data poisoning attacks. Extensive experiments on real-world
datasets demonstrate the efficiency and effectiveness of our proposed GP.


http://arxiv.org/abs/2408.11313
Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer. (10%)
Weipeng Jiang; Zhenting Wang; Juan Zhai; Shiqing Ma; Zhengyu Zhao; Chao Shen
  Despite prior safety alignment efforts, mainstream LLMs can still generate
harmful and unethical content when subjected to jailbreaking attacks. Existing
jailbreaking methods fall into two main categories: template-based and
optimization-based methods. The former requires significant manual effort and
domain knowledge, while the latter, exemplified by Greedy Coordinate Gradient
(GCG), which seeks to maximize the likelihood of harmful LLM outputs through
token-level optimization, also encounters several limitations: requiring
white-box access, necessitating pre-constructed affirmative phrase, and
suffering from low efficiency. In this paper, we present ECLIPSE, a novel and
efficient black-box jailbreaking method utilizing optimizable suffixes. Drawing
inspiration from LLMs' powerful generation and optimization capabilities, we
employ task prompts to translate jailbreaking goals into natural language
instructions. This guides the LLM to generate adversarial suffixes for
malicious queries. In particular, a harmfulness scorer provides continuous
feedback, enabling LLM self-reflection and iterative optimization to
autonomously and efficiently produce effective suffixes. Experimental results
demonstrate that ECLIPSE achieves an average attack success rate (ASR) of 0.92
across three open-source LLMs and GPT-3.5-Turbo, significantly surpassing GCG
in 2.4 times. Moreover, ECLIPSE is on par with template-based methods in ASR
while offering superior attack efficiency, reducing the average attack overhead
by 83%.


http://arxiv.org/abs/2408.11006
Security Attacks on LLM-based Code Completion Tools. (8%)
Wen Cheng; Ke Sun; Xinyu Zhang; Wei Wang
  The rapid development of large language models (LLMs) has significantly
advanced code completion capabilities, giving rise to a new generation of
LLM-based Code Completion Tools (LCCTs). Unlike general-purpose LLMs, these
tools possess unique workflows, integrating multiple information sources as
input and prioritizing code suggestions over natural language interaction,
which introduces distinct security challenges. Additionally, LCCTs often rely
on proprietary code datasets for training, raising concerns about the potential
exposure of sensitive data. This paper exploits these distinct characteristics
of LCCTs to develop targeted attack methodologies on two critical security
risks: jailbreaking and training data extraction attacks. Our experimental
results expose significant vulnerabilities within LCCTs, including a 99.4%
success rate in jailbreaking attacks on GitHub Copilot and a 46.3% success rate
on Amazon Q. Furthermore, We successfully extracted sensitive user data from
GitHub Copilot, including 54 real email addresses and 314 physical addresses
associated with GitHub usernames. Our study also demonstrates that these
code-based attack methods are effective against general-purpose LLMs, such as
the GPT series, highlighting a broader security misalignment in the handling of
code by modern LLMs. These findings underscore critical security challenges
associated with LCCTs and suggest essential directions for strengthening their
security frameworks. The example code and attack samples from our research are
provided at https://github.com/Sensente/Security-Attacks-on-LCCTs.


http://arxiv.org/abs/2408.10722
MEGen: Generative Backdoor in Large Language Models via Model Editing. (2%)
Jiyang Qiu; Xinbei Ma; Zhuosheng Zhang; Hai Zhao
  Large language models (LLMs) have demonstrated remarkable capabilities. Their
powerful generative abilities enable flexible responses based on various
queries or instructions. Emerging as widely adopted generalists for diverse
tasks, LLMs are still vulnerable to backdoors. This paper proposes an
editing-based generative backdoor, named MEGen, aiming to create a customized
backdoor for NLP tasks with the least side effects. In our approach, we first
leverage a language model to insert a trigger selected on fixed metrics into
the input, then design a pipeline of model editing to directly embed a backdoor
into an LLM. By adjusting a small set of local parameters with a mini-batch of
samples, MEGen significantly enhances time efficiency and achieves high
robustness. Experimental results indicate that our backdoor attack strategy
achieves a high attack success rate on poison data while maintaining the
model's performance on clean data. Notably, the backdoored model, when
triggered, can freely output pre-set dangerous information while successfully
completing downstream tasks. This suggests that future LLM applications could
be guided to deliver certain dangerous information, thus altering the LLM's
generative style. We believe this approach provides insights for future LLM
applications and the execution of backdoor attacks on conversational AI
systems.


http://arxiv.org/abs/2408.10818
Learning Randomized Algorithms with Transformers. (1%)
Oswald Johannes von; Seijin Kobayashi; Yassir Akram; Angelika Steger
  Randomization is a powerful tool that endows algorithms with remarkable
properties. For instance, randomized algorithms excel in adversarial settings,
often surpassing the worst-case performance of deterministic algorithms with
large margins. Furthermore, their success probability can be amplified by
simple strategies such as repetition and majority voting. In this paper, we
enhance deep neural networks, in particular transformer models, with
randomization. We demonstrate for the first time that randomized algorithms can
be instilled in transformers through learning, in a purely data- and
objective-driven manner. First, we analyze known adversarial objectives for
which randomized algorithms offer a distinct advantage over deterministic ones.
We then show that common optimization techniques, such as gradient descent or
evolutionary strategies, can effectively learn transformer parameters that make
use of the randomness provided to the model. To illustrate the broad
applicability of randomization in empowering neural networks, we study three
conceptual tasks: associative recall, graph coloring, and agents that explore
grid worlds. In addition to demonstrating increased robustness against
oblivious adversaries through learned randomization, our experiments reveal
remarkable performance improvements due to the inherently random nature of the
neural networks' computation and predictions.


http://arxiv.org/abs/2408.10021
Detecting Adversarial Attacks in Semantic Segmentation via Uncertainty Estimation: A Deep Analysis. (99%)
Kira Maag; Roman Resner; Asja Fischer
  Deep neural networks have demonstrated remarkable effectiveness across a wide
range of tasks such as semantic segmentation. Nevertheless, these networks are
vulnerable to adversarial attacks that add imperceptible perturbations to the
input image, leading to false predictions. This vulnerability is particularly
dangerous in safety-critical applications like automated driving. While
adversarial examples and defense strategies are well-researched in the context
of image classification, there is comparatively less research focused on
semantic segmentation. Recently, we have proposed an uncertainty-based method
for detecting adversarial attacks on neural networks for semantic segmentation.
We observed that uncertainty, as measured by the entropy of the output
distribution, behaves differently on clean versus adversely perturbed images,
and we utilize this property to differentiate between the two. In this extended
version of our work, we conduct a detailed analysis of uncertainty-based
detection of adversarial attacks including a diverse set of adversarial attacks
and various state-of-the-art neural networks. Our numerical experiments show
the effectiveness of the proposed uncertainty-based detection method, which is
lightweight and operates as a post-processing step, i.e., no model
modifications or knowledge of the adversarial example generation process are
required.


http://arxiv.org/abs/2408.13274
Robust Image Classification: Defensive Strategies against FGSM and PGD Adversarial Attacks. (99%)
Hetvi Waghela; Jaydip Sen; Sneha Rakshit
  Adversarial attacks, particularly the Fast Gradient Sign Method (FGSM) and
Projected Gradient Descent (PGD) pose significant threats to the robustness of
deep learning models in image classification. This paper explores and refines
defense mechanisms against these attacks to enhance the resilience of neural
networks. We employ a combination of adversarial training and innovative
preprocessing techniques, aiming to mitigate the impact of adversarial
perturbations. Our methodology involves modifying input data before
classification and investigating different model architectures and training
strategies. Through rigorous evaluation of benchmark datasets, we demonstrate
the effectiveness of our approach in defending against FGSM and PGD attacks.
Our results show substantial improvements in model robustness compared to
baseline methods, highlighting the potential of our defense strategies in
real-world applications. This study contributes to the ongoing efforts to
develop secure and reliable machine learning systems, offering practical
insights and paving the way for future research in adversarial defense. By
bridging theoretical advancements and practical implementation, we aim to
enhance the trustworthiness of AI applications in safety-critical domains.


http://arxiv.org/abs/2408.09839
Segment-Anything Models Achieve Zero-shot Robustness in Autonomous Driving. (98%)
Jun Yan; Pengyu Wang; Danni Wang; Weiquan Huang; Daniel Watzenig; Huilin Yin
  Semantic segmentation is a significant perception task in autonomous driving.
It suffers from the risks of adversarial examples. In the past few years, deep
learning has gradually transitioned from convolutional neural network (CNN)
models with a relatively small number of parameters to foundation models with a
huge number of parameters. The segment-anything model (SAM) is a generalized
image segmentation framework that is capable of handling various types of
images and is able to recognize and segment arbitrary objects in an image
without the need to train on a specific object. It is a unified model that can
handle diverse downstream tasks, including semantic segmentation, object
detection, and tracking. In the task of semantic segmentation for autonomous
driving, it is significant to study the zero-shot adversarial robustness of
SAM. Therefore, we deliver a systematic empirical study on the robustness of
SAM without additional training. Based on the experimental results, the
zero-shot adversarial robustness of the SAM under the black-box corruptions and
white-box adversarial attacks is acceptable, even without the need for
additional training. The finding of this study is insightful in that the
gigantic model parameters and huge amounts of training data lead to the
phenomenon of emergence, which builds a guarantee of adversarial robustness.
SAM is a vision foundation model that can be regarded as an early prototype of
an artificial general intelligence (AGI) pipeline. In such a pipeline, a
unified model can handle diverse tasks. Therefore, this research not only
inspects the impact of vision foundation models on safe autonomous driving but
also provides a perspective on developing trustworthy AGI. The code is
available at: https://github.com/momo1986/robust_sam_iv.


http://arxiv.org/abs/2408.10204
Criticality Leveraged Adversarial Training (CLAT) for Boosted Performance via Parameter Efficiency. (31%)
Bhavna Gopal; Huanrui Yang; Jingyang Zhang; Mark Horton; Yiran Chen
  Adversarial training enhances neural network robustness but suffers from a
tendency to overfit and increased generalization errors on clean data. This
work introduces CLAT, an innovative approach that mitigates adversarial
overfitting by introducing parameter efficiency into the adversarial training
process, improving both clean accuracy and adversarial robustness. Instead of
tuning the entire model, CLAT identifies and fine-tunes robustness-critical
layers - those predominantly learning non-robust features - while freezing the
remaining model to enhance robustness. It employs dynamic critical layer
selection to adapt to changes in layer criticality throughout the fine-tuning
process. Empirically, CLAT can be applied on top of existing adversarial
training methods, significantly reduces the number of trainable parameters by
approximately 95%, and achieves more than a 2% improvement in adversarial
robustness compared to baseline methods.


http://arxiv.org/abs/2408.10446
The Brittleness of AI-Generated Image Watermarking Techniques: Examining Their Robustness Against Visual Paraphrasing Attacks. (5%)
Niyar R Barman; Krish Sharma; Ashhar Aziz; Shashwat Bajpai; Shwetangshu Biswas; Vasu Sharma; Vinija Jain; Aman Chadha; Amit Sheth; Amitava Das
  The rapid advancement of text-to-image generation systems, exemplified by
models like Stable Diffusion, Midjourney, Imagen, and DALL-E, has heightened
concerns about their potential misuse. In response, companies like Meta and
Google have intensified their efforts to implement watermarking techniques on
AI-generated images to curb the circulation of potentially misleading visuals.
However, in this paper, we argue that current image watermarking methods are
fragile and susceptible to being circumvented through visual paraphrase
attacks. The proposed visual paraphraser operates in two steps. First, it
generates a caption for the given image using KOSMOS-2, one of the latest
state-of-the-art image captioning systems. Second, it passes both the original
image and the generated caption to an image-to-image diffusion system. During
the denoising step of the diffusion pipeline, the system generates a visually
similar image that is guided by the text caption. The resulting image is a
visual paraphrase and is free of any watermarks. Our empirical findings
demonstrate that visual paraphrase attacks can effectively remove watermarks
from images. This paper provides a critical assessment, empirically revealing
the vulnerability of existing watermarking techniques to visual paraphrase
attacks. While we do not propose solutions to this issue, this paper serves as
a call to action for the scientific community to prioritize the development of
more robust watermarking techniques. Our first-of-its-kind visual paraphrase
dataset and accompanying code are publicly available.


http://arxiv.org/abs/2408.09878
Transferring Backdoors between Large Language Models by Knowledge Distillation. (2%)
Pengzhou Cheng; Zongru Wu; Tianjie Ju; Wei Du; Zhuosheng Zhang Gongshen Liu
  Backdoor Attacks have been a serious vulnerability against Large Language
Models (LLMs). However, previous methods only reveal such risk in specific
models, or present tasks transferability after attacking the pre-trained phase.
So, how risky is the model transferability of a backdoor attack? In this paper,
we focus on whether existing mini-LLMs may be unconsciously instructed in
backdoor knowledge by poisoned teacher LLMs through knowledge distillation
(KD). Specifically, we propose ATBA, an adaptive transferable backdoor attack,
which can effectively distill the backdoor of teacher LLMs into small models
when only executing clean-tuning. We first propose the Target Trigger
Generation (TTG) module that filters out a set of indicative trigger candidates
from the token list based on cosine similarity distribution. Then, we exploit a
shadow model to imitate the distilling process and introduce an Adaptive
Trigger Optimization (ATO) module to realize a gradient-based greedy feedback
to search optimal triggers. Extensive experiments show that ATBA generates not
only positive guidance for student models but also implicitly transfers
backdoor knowledge. Our attack is robust and stealthy, with over 80% backdoor
transferability, and hopes the attention of security.


http://arxiv.org/abs/2408.09798
Enhance Modality Robustness in Text-Centric Multimodal Alignment with Adversarial Prompting. (1%)
Yun-Da Tsai; Ting-Yu Yen; Keng-Te Liao; Shou-De Lin
  Converting different modalities into generalized text, which then serves as
input prompts for large language models (LLMs), is a common approach for
aligning multimodal models, particularly when pairwise data is limited.
Text-centric alignment method leverages the unique properties of text as a
modality space, transforming diverse inputs into a unified textual
representation, thereby enabling downstream models to effectively interpret
various modal inputs. This study evaluates the quality and robustness of
multimodal representations in the face of noise imperfections, dynamic input
order permutations, and missing modalities, revealing that current text-centric
alignment methods can compromise downstream robustness. To address this issue,
we propose a new text-centric adversarial training approach that significantly
enhances robustness compared to traditional robust training methods and
pre-trained multimodal foundation models. Our findings underscore the potential
of this approach to improve the robustness and adaptability of multimodal
representations, offering a promising solution for dynamic and real-world
applications.


http://arxiv.org/abs/2408.10177
Perfectly Undetectable Reflection and Scaling False Data Injection Attacks via Affine Transformation on Mobile Robot Trajectory Tracking Control. (1%)
Jun Ueda; Hyukbin Kwon
  With the increasing integration of cyber-physical systems (CPS) into critical
applications, ensuring their resilience against cyberattacks is paramount. A
particularly concerning threat is the vulnerability of CPS to deceptive attacks
that degrade system performance while remaining undetected. This paper
investigates perfectly undetectable false data injection attacks (FDIAs)
targeting the trajectory tracking control of a non-holonomic mobile robot. The
proposed attack method utilizes affine transformations of intercepted signals,
exploiting weaknesses inherent in the partially linear dynamic properties and
symmetry of the nonlinear plant. The feasibility and potential impact of these
attacks are validated through experiments using a Turtlebot 3 platform,
highlighting the urgent need for sophisticated detection mechanisms and
resilient control strategies to safeguard CPS against such threats.
Furthermore, a novel approach for detection of these attacks called the state
monitoring signature function (SMSF) is introduced. An example SMSF, a
carefully designed function resilient to FDIA, is shown to be able to detect
the presence of a FDIA through signatures based on systems states.


http://arxiv.org/abs/2408.09469
Enhancing Adversarial Transferability with Adversarial Weight Tuning. (99%)
Jiahao Chen; Zhou Feng; Rui Zeng; Yuwen Pu; Chunyi Zhou; Yi Jiang; Yuyou Gan; Jinbao Li; Shouling Ji
  Deep neural networks (DNNs) are vulnerable to adversarial examples (AEs) that
mislead the model while appearing benign to human observers. A critical concern
is the transferability of AEs, which enables black-box attacks without direct
access to the target model. However, many previous attacks have failed to
explain the intrinsic mechanism of adversarial transferability. In this paper,
we rethink the property of transferable AEs and reformalize the formulation of
transferability. Building on insights from this mechanism, we analyze the
generalization of AEs across models with different architectures and prove that
we can find a local perturbation to mitigate the gap between surrogate and
target models. We further establish the inner connections between model
smoothness and flat local maxima, both of which contribute to the
transferability of AEs. Further, we propose a new adversarial attack algorithm,
\textbf{A}dversarial \textbf{W}eight \textbf{T}uning (AWT), which adaptively
adjusts the parameters of the surrogate model using generated AEs to optimize
the flat local maxima and model smoothness simultaneously, without the need for
extra data. AWT is a data-free tuning method that combines gradient-based and
model-based attack methods to enhance the transferability of AEs. Extensive
experiments on a variety of models with different architectures on ImageNet
demonstrate that AWT yields superior performance over other attacks, with an
average increase of nearly 5\% and 10\% attack success rates on CNN-based and
Transformer-based models, respectively, compared to state-of-the-art attacks.


http://arxiv.org/abs/2408.09672
Regularization for Adversarial Robust Learning. (41%)
Jie Wang; Rui Gao; Yao Xie
  Despite the growing prevalence of artificial neural networks in real-world
applications, their vulnerability to adversarial attacks remains to be a
significant concern, which motivates us to investigate the robustness of
machine learning models. While various heuristics aim to optimize the
distributionally robust risk using the $\infty$-Wasserstein metric, such a
notion of robustness frequently encounters computation intractability. To
tackle the computational challenge, we develop a novel approach to adversarial
training that integrates $\phi$-divergence regularization into the
distributionally robust risk function. This regularization brings a notable
improvement in computation compared with the original formulation. We develop
stochastic gradient methods with biased oracles to solve this problem
efficiently, achieving the near-optimal sample complexity. Moreover, we
establish its regularization effects and demonstrate it is asymptotic
equivalence to a regularized empirical risk minimization (ERM) framework, by
considering various scaling regimes of the regularization parameter $\eta$ and
robustness level $\rho$. These regimes yield gradient norm regularization,
variance regularization, or a smoothed gradient norm regularization that
interpolates between these extremes. We numerically validate our proposed
method in supervised learning, reinforcement learning, and contextual learning
and showcase its state-of-the-art performance against various adversarial
attacks.


http://arxiv.org/abs/2408.09431
Adversarial Attacked Teacher for Unsupervised Domain Adaptive Object Detection. (31%)
Kaiwen Wang; Yinzhe Shen; Martin Lauer
  Object detectors encounter challenges in handling domain shifts. Cutting-edge
domain adaptive object detection methods use the teacher-student framework and
domain adversarial learning to generate domain-invariant pseudo-labels for
self-training. However, the pseudo-labels generated by the teacher model tend
to be biased towards the majority class and often mistakenly include
overconfident false positives and underconfident false negatives. We reveal
that pseudo-labels vulnerable to adversarial attacks are more likely to be
low-quality. To address this, we propose a simple yet effective framework named
Adversarial Attacked Teacher (AAT) to improve the quality of pseudo-labels.
Specifically, we apply adversarial attacks to the teacher model, prompting it
to generate adversarial pseudo-labels to correct bias, suppress overconfidence,
and encourage underconfident proposals. An adaptive pseudo-label regularization
is introduced to emphasize the influence of pseudo-labels with high certainty
and reduce the negative impacts of uncertain predictions. Moreover, robust
minority objects verified by pseudo-label regularization are oversampled to
minimize dataset imbalance without introducing false positives. Extensive
experiments conducted on various datasets demonstrate that AAT achieves
superior performance, reaching 52.6 mAP on Clipart1k, surpassing the previous
state-of-the-art by 6.7%.


http://arxiv.org/abs/2408.09671
GANPrompt: Enhancing Robustness in LLM-Based Recommendations with GAN-Enhanced Diversity Prompts. (1%)
Xinyu Li; Chuang Zhao; Hongke Zhao; Likang Wu; Ming HE
  In recent years, LLM has demonstrated remarkable proficiency in comprehending
and generating natural language, with a growing prevalence in the domain of
recommender systems. However, LLM continues to face a significant challenge in
that it is highly susceptible to the influence of prompt words. This
inconsistency in response to minor alterations in prompt input may compromise
the accuracy and resilience of recommendation models. To address this issue,
this paper proposes GANPrompt, a multi-dimensional large language model prompt
diversity framework based on Generative Adversarial Networks (GANs). The
framework enhances the model's adaptability and stability to diverse prompts by
integrating GAN generation techniques with the deep semantic understanding
capabilities of LLMs. GANPrompt first trains a generator capable of producing
diverse prompts by analysing multidimensional user behavioural data. These
diverse prompts are then used to train the LLM to improve its performance in
the face of unseen prompts. Furthermore, to ensure a high degree of diversity
and relevance of the prompts, this study introduces a mathematical theory-based
diversity constraint mechanism that optimises the generated prompts to ensure
that they are not only superficially distinct, but also semantically cover a
wide range of user intentions. Through extensive experiments on multiple
datasets, we demonstrate the effectiveness of the proposed framework,
especially in improving the adaptability and robustness of recommender systems
in complex and dynamic environments. The experimental results demonstrate that
GANPrompt yields substantial enhancements in accuracy and robustness relative
to existing state-of-the-art methodologies.


http://arxiv.org/abs/2408.09622
Global BGP Attacks that Evade Route Monitoring. (1%)
Henry Birge-Lee; Maria Apostolaki; Jennifer Rexford
  As the deployment of comprehensive Border Gateway Protocol (BGP) security
measures is still in progress, BGP monitoring continues to play a critical role
in protecting the Internet from routing attacks. Fundamentally, monitoring
involves observing BGP feeds to detect suspicious announcements and taking
defensive action. However, BGP monitoring relies on seeing the malicious BGP
announcement in the first place! In this paper, we develop a novel attack that
can hide itself from all state-of-the-art BGP monitoring systems we tested
while affecting the entire Internet. The attack involves launching a sub-prefix
hijack with the RFC-specified NO_EXPORT community attached to prevent networks
with the malicious route installed from sending the route to BGP monitoring
systems. We study the viability of this attack at four tier-1 networks and find
all networks we studied were vulnerable to the attack. Finally, we propose a
mitigation that significantly improves the robustness of the BGP monitoring
ecosystem. Our paper aims to raise awareness of this issue and offer guidance
to providers to protect against such attacks.


http://arxiv.org/abs/2408.09112
Training Verifiably Robust Agents Using Set-Based Reinforcement Learning. (75%)
Manuel Wendl; Lukas Koller; Tobias Ladner; Matthias Althoff
  Reinforcement learning often uses neural networks to solve complex control
tasks. However, neural networks are sensitive to input perturbations, which
makes their deployment in safety-critical environments challenging. This work
lifts recent results from formally verifying neural networks against such
disturbances to reinforcement learning in continuous state and action spaces
using reachability analysis. While previous work mainly focuses on adversarial
attacks for robust reinforcement learning, we train neural networks utilizing
entire sets of perturbed inputs and maximize the worst-case reward. The
obtained agents are verifiably more robust than agents obtained by related
work, making them more applicable in safety-critical environments. This is
demonstrated with an extensive empirical evaluation of four different
benchmarks.


http://arxiv.org/abs/2408.11071
DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization. (67%)
Pucheng Dang; Xing Hu; Dong Li; Rui Zhang; Qi Guo; Kaidi Xu
  Current text-to-image (T2I) synthesis diffusion models raise misuse concerns,
particularly in creating prohibited or not-safe-for-work (NSFW) images. To
address this, various safety mechanisms and red teaming attack methods are
proposed to enhance or expose the T2I model's capability to generate unsuitable
content. However, many red teaming attack methods assume knowledge of the text
encoders, limiting their practical usage. In this work, we rethink the case of
\textit{purely black-box} attacks without prior knowledge of the T2l model. To
overcome the unavailability of gradients and the inability to optimize attacks
within a discrete prompt space, we propose DiffZOO which applies Zeroth Order
Optimization to procure gradient approximations and harnesses both C-PRV and
D-PRV to enhance attack prompts within the discrete prompt domain. We evaluated
our method across multiple safety mechanisms of the T2I diffusion model and
online servers. Experiments on multiple state-of-the-art safety mechanisms show
that DiffZOO attains an 8.5% higher average attack success rate than previous
works, hence its promise as a practical red teaming tool for T2l models.


http://arxiv.org/abs/2408.09181
PADetBench: Towards Benchmarking Physical Attacks against Object Detection. (62%)
Jiawei Lian; Jianhong Pan; Lefan Wang; Yi Wang; Lap-Pui Chau; Shaohui Mei
  Physical attacks against object detection have gained increasing attention
due to their significant practical implications.
  However, conducting physical experiments is extremely time-consuming and
labor-intensive.
  Moreover, physical dynamics and cross-domain transformation are challenging
to strictly regulate in the real world, leading to unaligned evaluation and
comparison, severely hindering the development of physically robust models.
  To accommodate these challenges, we explore utilizing realistic simulation to
thoroughly and rigorously benchmark physical attacks with fairness under
controlled physical dynamics and cross-domain transformation.
  This resolves the problem of capturing identical adversarial images that
cannot be achieved in the real world.
  Our benchmark includes 20 physical attack methods, 48 object detectors,
comprehensive physical dynamics, and evaluation metrics. We also provide
end-to-end pipelines for dataset generation, detection, evaluation, and further
analysis.
  In addition, we perform 8064 groups of evaluation based on our benchmark,
which includes both overall evaluation and further detailed ablation studies
for controlled physical dynamics.
  Through these experiments, we provide in-depth analyses of physical attack
performance and physical adversarial robustness, draw valuable observations,
and discuss potential directions for future research.
  Codebase: https://github.com/JiaweiLian/Benchmarking_Physical_Attack


http://arxiv.org/abs/2408.09300
Malacopula: adversarial automatic speaker verification attacks using a neural-based generalised Hammerstein model. (31%)
Massimiliano Todisco; Michele Panariello; Xin Wang; Héctor Delgado; Kong Aik Lee; Nicholas Evans
  We present Malacopula, a neural-based generalised Hammerstein model designed
to introduce adversarial perturbations to spoofed speech utterances so that
they better deceive automatic speaker verification (ASV) systems. Using
non-linear processes to modify speech utterances, Malacopula enhances the
effectiveness of spoofing attacks. The model comprises parallel branches of
polynomial functions followed by linear time-invariant filters. The adversarial
optimisation procedure acts to minimise the cosine distance between speaker
embeddings extracted from spoofed and bona fide utterances. Experiments,
performed using three recent ASV systems and the ASVspoof 2019 dataset, show
that Malacopula increases vulnerabilities by a substantial margin. However,
speech quality is reduced and attacks can be detected effectively under
controlled conditions. The findings emphasise the need to identify new
vulnerabilities and design defences to protect ASV systems from adversarial
attacks in the wild.


http://arxiv.org/abs/2408.09093
BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger. (10%)
Yulin Chen; Haoran Li; Zihao Zheng; Yangqiu Song
  Multimodal Large Language Models (MLLMs) have showcased impressive
performance in a variety of multimodal tasks. On the other hand, the
integration of additional image modality may allow the malicious users to
inject harmful content inside the images for jailbreaking. Unlike text-based
LLMs, where adversaries need to select discrete tokens to conceal their
malicious intent using specific algorithms, the continuous nature of image
signals provides a direct opportunity for adversaries to inject harmful
intentions. In this work, we propose $\textbf{BaThe}$ ($\textbf{Ba}$ckdoor
$\textbf{T}$rigger S$\textbf{h}$i$\textbf{e}$ld), a simple yet effective
jailbreak defense mechanism. Our work is motivated by recent research on
jailbreak backdoor attack and virtual prompt backdoor attack in generative
language models. Jailbreak backdoor attack uses harmful instructions combined
with manually crafted strings as triggers to make the backdoored model generate
prohibited responses. We assume that harmful instructions can function as
triggers, and if we alternatively set rejection responses as the triggered
response, the backdoored model then can defend against jailbreak attacks. We
achieve this by utilizing virtual rejection prompt, similar to the virtual
prompt backdoor attack. We embed the virtual rejection prompt into the soft
text embeddings, which we call ``wedge''. Our comprehensive experiments
demonstrate that BaThe effectively mitigates various types of jailbreak attacks
and is adaptable to defend against unseen attacks, with minimal impact on
MLLMs' performance.


http://arxiv.org/abs/2408.09326
Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks. (5%)
Kexin Chen; Yi Liu; Dongxia Wang; Jiaying Chen; Wenhai Wang
  Large Language Models (LLMs) have increasingly become pivotal in content
generation with notable societal impact. These models hold the potential to
generate content that could be deemed harmful.Efforts to mitigate this risk
include implementing safeguards to ensure LLMs adhere to social ethics.However,
despite such measures, the phenomenon of "jailbreaking" -- where carefully
crafted prompts elicit harmful responses from models -- persists as a
significant challenge. Recognizing the continuous threat posed by jailbreaking
tactics and their repercussions for the trustworthy use of LLMs, a rigorous
assessment of the models' robustness against such attacks is essential. This
study introduces an comprehensive evaluation framework and conducts an
large-scale empirical experiment to address this need. We concentrate on 10
cutting-edge jailbreak strategies across three categories, 1525 questions from
61 specific harmful categories, and 13 popular LLMs. We adopt multi-dimensional
metrics such as Attack Success Rate (ASR), Toxicity Score, Fluency, Token
Length, and Grammatical Errors to thoroughly assess the LLMs' outputs under
jailbreak. By normalizing and aggregating these metrics, we present a detailed
reliability score for different LLMs, coupled with strategic recommendations to
reduce their susceptibility to such vulnerabilities. Additionally, we explore
the relationships among the models, attack strategies, and types of harmful
content, as well as the correlations between the evaluation metrics, which
proves the validity of our multifaceted evaluation framework. Our extensive
experimental results demonstrate a lack of resilience among all tested LLMs
against certain strategies, and highlight the need to concentrate on the
reliability facets of LLMs. We believe our study can provide valuable insights
into enhancing the security evaluation of LLMs against jailbreak within the
domain.


http://arxiv.org/abs/2408.09262
PREMAP: A Unifying PREiMage APproximation Framework for Neural Networks. (2%)
Xiyue Zhang; Benjie Wang; Marta Kwiatkowska; Huan Zhang
  Most methods for neural network verification focus on bounding the image,
i.e., set of outputs for a given input set. This can be used to, for example,
check the robustness of neural network predictions to bounded perturbations of
an input. However, verifying properties concerning the preimage, i.e., the set
of inputs satisfying an output property, requires abstractions in the input
space. We present a general framework for preimage abstraction that produces
under- and over-approximations of any polyhedral output set. Our framework
employs cheap parameterised linear relaxations of the neural network, together
with an anytime refinement procedure that iteratively partitions the input
region by splitting on input features and neurons. The effectiveness of our
approach relies on carefully designed heuristics and optimization objectives to
achieve rapid improvements in the approximation volume. We evaluate our method
on a range of tasks, demonstrating significant improvement in efficiency and
scalability to high-input-dimensional image classification tasks compared to
state-of-the-art techniques. Further, we showcase the application to
quantitative verification and robustness analysis, presenting a sound and
complete algorithm for the former and providing sound quantitative results for
the latter.


http://arxiv.org/abs/2408.09297
Out-of-distribution materials property prediction using adversarial learning based fine-tuning. (1%)
Qinyang Li; Nicholas Miklaucic; Jianjun Hu
  The accurate prediction of material properties is crucial in a wide range of
scientific and engineering disciplines. Machine learning (ML) has advanced the
state of the art in this field, enabling scientists to discover novel materials
and design materials with specific desired properties. However, one major
challenge that persists in material property prediction is the generalization
of models to out-of-distribution (OOD) samples,i.e., samples that differ
significantly from those encountered during training. In this paper, we explore
the application of advancements in OOD learning approaches to enhance the
robustness and reliability of material property prediction models. We propose
and apply the Crystal Adversarial Learning (CAL) algorithm for OOD materials
property prediction,which generates synthetic data during training to bias the
training towards those samples with high prediction uncertainty. We further
propose an adversarial learning based targeting finetuning approach to make the
model adapted to a particular OOD dataset, as an alternative to traditional
fine-tuning. Our experiments demonstrate the success of our CAL algorithm with
its high effectiveness in ML with limited samples which commonly occurs in
materials science. Our work represents a promising direction toward better OOD
learning and materials property prediction.


http://arxiv.org/abs/2408.08989
Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models. (98%)
Qingyuan Zeng; Zhenzhong Wang; Yiu-ming Cheung; Min Jiang
  While image-to-text models have demonstrated significant advancements in
various vision-language tasks, they remain susceptible to adversarial attacks.
Existing white-box attacks on image-to-text models require access to the
architecture, gradients, and parameters of the target model, resulting in low
practicality. Although the recently proposed gray-box attacks have improved
practicality, they suffer from semantic loss during the training process, which
limits their targeted attack performance. To advance adversarial attacks of
image-to-text models, this paper focuses on a challenging scenario:
decision-based black-box targeted attacks where the attackers only have access
to the final output text and aim to perform targeted attacks. Specifically, we
formulate the decision-based black-box targeted attack as a large-scale
optimization problem. To efficiently solve the optimization problem, a
three-stage process \textit{Ask, Attend, Attack}, called \textit{AAA}, is
proposed to coordinate with the solver. \textit{Ask} guides attackers to create
target texts that satisfy the specific semantics. \textit{Attend} identifies
the crucial regions of the image for attacking, thus reducing the search space
for the subsequent \textit{Attack}. \textit{Attack} uses an evolutionary
algorithm to attack the crucial regions, where the attacks are semantically
related to the target texts of \textit{Ask}, thus achieving targeted attacks
without semantic loss. Experimental results on transformer-based and
CNN+RNN-based image-to-text models confirmed the effectiveness of our proposed
\textit{AAA}.


http://arxiv.org/abs/2408.08671
Towards Physical World Backdoor Attacks against Skeleton Action Recognition. (93%)
Qichen Zheng; Yi Yu; Siyuan Yang; Jun Liu; Kwok-Yan Lam; Alex Kot
  Skeleton Action Recognition (SAR) has attracted significant interest for its
efficient representation of the human skeletal structure. Despite its
advancements, recent studies have raised security concerns in SAR models,
particularly their vulnerability to adversarial attacks. However, such
strategies are limited to digital scenarios and ineffective in physical
attacks, limiting their real-world applicability. To investigate the
vulnerabilities of SAR in the physical world, we introduce the Physical
Skeleton Backdoor Attacks (PSBA), the first exploration of physical backdoor
attacks against SAR. Considering the practicalities of physical execution, we
introduce a novel trigger implantation method that integrates infrequent and
imperceivable actions as triggers into the original skeleton data. By
incorporating a minimal amount of this manipulated data into the training set,
PSBA enables the system misclassify any skeleton sequences into the target
class when the trigger action is present. We examine the resilience of PSBA in
both poisoned and clean-label scenarios, demonstrating its efficacy across a
range of datasets, poisoning ratios, and model architectures. Additionally, we
introduce a trigger-enhancing strategy to strengthen attack performance in the
clean label setting. The robustness of PSBA is tested against three distinct
backdoor defenses, and the stealthiness of PSBA is evaluated using two
quantitative metrics. Furthermore, by employing a Kinect V2 camera, we compile
a dataset of human actions from the real world to mimic physical attack
situations, with our findings confirming the effectiveness of our proposed
attacks. Our project website can be found at
https://qichenzheng.github.io/psba-website.


http://arxiv.org/abs/2408.08824
LEVIS: Large Exact Verifiable Input Spaces for Neural Networks. (87%)
Mohamad Fares El Hajj Chehade; Brian Wesley Bell; Russell Bent; Hao Zhu; Wenting Li
  The robustness of neural networks is paramount in safety-critical
applications. While most current robustness verification methods assess the
worst-case output under the assumption that the input space is known,
identifying a verifiable input space $\mathcal{C}$, where no adversarial
examples exist, is crucial for effective model selection, robustness
evaluation, and the development of reliable control strategies. To address this
challenge, we introduce a novel framework, $\texttt{LEVIS}$, comprising
$\texttt{LEVIS}$-$\alpha$ and $\texttt{LEVIS}$-$\beta$.
$\texttt{LEVIS}$-$\alpha$ locates the largest possible verifiable ball within
the central region of $\mathcal{C}$ that intersects at least two boundaries. In
contrast, $\texttt{LEVIS}$-$\beta$ integrates multiple verifiable balls to
encapsulate the entirety of the verifiable space comprehensively. Our
contributions are threefold: (1) We propose $\texttt{LEVIS}$ equipped with
three pioneering techniques that identify the maximum verifiable ball and the
nearest adversarial point along collinear or orthogonal directions. (2) We
offer a theoretical analysis elucidating the properties of the verifiable balls
acquired through $\texttt{LEVIS}$-$\alpha$ and $\texttt{LEVIS}$-$\beta$. (3) We
validate our methodology across diverse applications, including electrical
power flow regression and image classification, showcasing performance
enhancements and visualizations of the searching characteristics.


http://arxiv.org/abs/2408.08685
Can Large Language Models Improve the Adversarial Robustness of Graph Neural Networks? (83%)
Zhongjian Zhang; Xiao Wang; Huichi Zhou; Yue Yu; Mengmei Zhang; Cheng Yang; Chuan Shi
  Graph neural networks (GNNs) are vulnerable to adversarial perturbations,
especially for topology attacks, and many methods that improve the robustness
of GNNs have received considerable attention. Recently, we have witnessed the
significant success of large language models (LLMs), leading many to explore
the great potential of LLMs on GNNs. However, they mainly focus on improving
the performance of GNNs by utilizing LLMs to enhance the node features.
Therefore, we ask: Will the robustness of GNNs also be enhanced with the
powerful understanding and inference capabilities of LLMs? By presenting the
empirical results, we find that despite that LLMs can improve the robustness of
GNNs, there is still an average decrease of 23.1% in accuracy, implying that
the GNNs remain extremely vulnerable against topology attack. Therefore,
another question is how to extend the capabilities of LLMs on graph adversarial
robustness. In this paper, we propose an LLM-based robust graph structure
inference framework, LLM4RGNN, which distills the inference capabilities of
GPT-4 into a local LLM for identifying malicious edges and an LM-based edge
predictor for finding missing important edges, so as to recover a robust graph
structure. Extensive experiments demonstrate that LLM4RGNN consistently
improves the robustness across various GNNs. Even in some cases where the
perturbation ratio increases to 40%, the accuracy of GNNs is still better than
that on the clean graph.


http://arxiv.org/abs/2408.08518
Visual-Friendly Concept Protection via Selective Adversarial Perturbations. (75%)
Xiaoyue Mi; Fan Tang; Juan Cao; Peng Li; Yang Liu
  Personalized concept generation by tuning diffusion models with a few images
raises potential legal and ethical concerns regarding privacy and intellectual
property rights. Researchers attempt to prevent malicious personalization using
adversarial perturbations. However, previous efforts have mainly focused on the
effectiveness of protection while neglecting the visibility of perturbations.
They utilize global adversarial perturbations, which introduce noticeable
alterations to original images and significantly degrade visual quality. In
this work, we propose the Visual-Friendly Concept Protection (VCPro) framework,
which prioritizes the protection of key concepts chosen by the image owner
through adversarial perturbations with lower perceptibility. To ensure these
perturbations are as inconspicuous as possible, we introduce a relaxed
optimization objective to identify the least perceptible yet effective
adversarial perturbations, solved using the Lagrangian multiplier method.
Qualitative and quantitative experiments validate that VCPro achieves a better
trade-off between the visibility of perturbations and protection effectiveness,
effectively prioritizing the protection of target concepts in images with less
perceptible perturbations.


http://arxiv.org/abs/2408.08655
Mitigating Backdoor Attacks in Federated Learning via Flipping Weight Updates of Low-Activation Input Neurons. (1%)
Binbin Ding; Penghui Yang; Zeqing Ge; Shengjun Huang
  Federated learning enables multiple clients to collaboratively train machine
learning models under the overall planning of the server while adhering to
privacy requirements. However, the server cannot directly oversee the local
training process, creating an opportunity for malicious clients to introduce
backdoors. Existing research shows that backdoor attacks activate specific
neurons in the compromised model, which remain dormant when processing clean
data. Leveraging this insight, we propose a method called Flipping Weight
Updates of Low-Activation Input Neurons (FLAIN) to defend against backdoor
attacks in federated learning. Specifically, after completing global training,
we employ an auxiliary dataset to identify low-activation input neurons and
flip the associated weight updates. We incrementally raise the threshold for
low-activation inputs and flip the weight updates iteratively, until the
performance degradation on the auxiliary data becomes unacceptable. Extensive
experiments validate that our method can effectively reduce the success rate of
backdoor attacks to a low level in various attack scenarios including those
with non-IID data distribution or high MCRs, causing only minimal performance
degradation on clean data.


http://arxiv.org/abs/2408.08489
DFT-Based Adversarial Attack Detection in MRI Brain Imaging: Enhancing Diagnostic Accuracy in Alzheimer's Case Studies. (99%)
Mohammad Hossein Najafi; Mohammad Morsali; Mohammadmahdi Vahediahmar; Saeed Bagheri Shouraki
  Recent advancements in deep learning, particularly in medical imaging, have
significantly propelled the progress of healthcare systems. However, examining
the robustness of medical images against adversarial attacks is crucial due to
their real-world applications and profound impact on individuals' health. These
attacks can result in misclassifications in disease diagnosis, potentially
leading to severe consequences. Numerous studies have explored both the
implementation of adversarial attacks on medical images and the development of
defense mechanisms against these threats, highlighting the vulnerabilities of
deep neural networks to such adversarial activities. In this study, we
investigate adversarial attacks on images associated with Alzheimer's disease
and propose a defensive method to counteract these attacks. Specifically, we
examine adversarial attacks that employ frequency domain transformations on
Alzheimer's disease images, along with other well-known adversarial attacks.
Our approach utilizes a convolutional neural network (CNN)-based autoencoder
architecture in conjunction with the two-dimensional Fourier transform of
images for detection purposes. The simulation results demonstrate that our
detection and defense mechanism effectively mitigates several adversarial
attacks, thereby enhancing the robustness of deep neural networks against such
vulnerabilities.


http://arxiv.org/abs/2408.08205
A Multi-task Adversarial Attack Against Face Authentication. (98%)
Hanrui Wang; Shuo Wang; Cunjian Chen; Massimo Tistarelli; Zhe Jin
  Deep-learning-based identity management systems, such as face authentication
systems, are vulnerable to adversarial attacks. However, existing attacks are
typically designed for single-task purposes, which means they are tailored to
exploit vulnerabilities unique to the individual target rather than being
adaptable for multiple users or systems. This limitation makes them unsuitable
for certain attack scenarios, such as morphing, universal, transferable, and
counter attacks. In this paper, we propose a multi-task adversarial attack
algorithm called MTADV that are adaptable for multiple users or systems. By
interpreting these scenarios as multi-task attacks, MTADV is applicable to both
single- and multi-task attacks, and feasible in the white- and gray-box
settings. Furthermore, MTADV is effective against various face datasets,
including LFW, CelebA, and CelebA-HQ, and can work with different deep learning
models, such as FaceNet, InsightFace, and CurricularFace. Importantly, MTADV
retains its feasibility as a single-task attack targeting a single user/system.
To the best of our knowledge, MTADV is the first adversarial attack method that
can target all of the aforementioned scenarios in one algorithm.


http://arxiv.org/abs/2408.08374
Evaluating Text Classification Robustness to Part-of-Speech Adversarial Examples. (98%)
Anahita Samadi; Allison Sullivan
  As machine learning systems become more widely used, especially for safety
critical applications, there is a growing need to ensure that these systems
behave as intended, even in the face of adversarial examples. Adversarial
examples are inputs that are designed to trick the decision making process, and
are intended to be imperceptible to humans. However, for text-based
classification systems, changes to the input, a string of text, are always
perceptible. Therefore, text-based adversarial examples instead focus on trying
to preserve semantics. Unfortunately, recent work has shown this goal is often
not met. To improve the quality of text-based adversarial examples, we need to
know what elements of the input text are worth focusing on. To address this, in
this paper, we explore what parts of speech have the highest impact of
text-based classifiers. Our experiments highlight a distinct bias in CNN
algorithms against certain parts of speech tokens within review datasets. This
finding underscores a critical vulnerability in the linguistic processing
capabilities of CNNs.


http://arxiv.org/abs/2408.08143
Unlearnable Examples Detection via Iterative Filtering. (88%)
Yi Yu; Qichen Zheng; Siyuan Yang; Wenhan Yang; Jun Liu; Shijian Lu; Yap-Peng Tan; Kwok-Yan Lam; Alex Kot
  Deep neural networks are proven to be vulnerable to data poisoning attacks.
Recently, a specific type of data poisoning attack known as availability
attacks has led to the failure of data utilization for model learning by adding
imperceptible perturbations to images. Consequently, it is quite beneficial and
challenging to detect poisoned samples, also known as Unlearnable Examples
(UEs), from a mixed dataset. In response, we propose an Iterative Filtering
approach for UEs identification. This method leverages the distinction between
the inherent semantic mapping rules and shortcuts, without the need for any
additional information. We verify that when training a classifier on a mixed
dataset containing both UEs and clean data, the model tends to quickly adapt to
the UEs compared to the clean data. Due to the accuracy gaps between training
with clean/poisoned samples, we employ a model to misclassify clean samples
while correctly identifying the poisoned ones. The incorporation of additional
classes and iterative refinement enhances the model's ability to differentiate
between clean and poisoned samples. Extensive experiments demonstrate the
superiority of our method over state-of-the-art detection approaches across
various attacks, datasets, and poison ratios, significantly reducing the Half
Total Error Rate (HTER) compared to existing methods.


http://arxiv.org/abs/2408.08920
A Survey of Trojan Attacks and Defenses to Deep Neural Networks. (78%)
Lingxin Jin; Xianyu Wen; Wei Jiang; Jinyu Zhan
  Deep Neural Networks (DNNs) have found extensive applications in
safety-critical artificial intelligence systems, such as autonomous driving and
facial recognition systems. However, recent research has revealed their
susceptibility to Neural Network Trojans (NN Trojans) maliciously injected by
adversaries. This vulnerability arises due to the intricate architecture and
opacity of DNNs, resulting in numerous redundant neurons embedded within the
models. Adversaries exploit these vulnerabilities to conceal malicious Trojans
within DNNs, thereby causing erroneous outputs and posing substantial threats
to the efficacy of DNN-based applications. This article presents a
comprehensive survey of Trojan attacks against DNNs and the countermeasure
methods employed to mitigate them. Initially, we trace the evolution of the
concept from traditional Trojans to NN Trojans, highlighting the feasibility
and practicality of generating NN Trojans. Subsequently, we provide an overview
of notable works encompassing various attack and defense strategies,
facilitating a comparative analysis of their approaches. Through these
discussions, we offer constructive insights aimed at refining these techniques.
In recognition of the gravity and immediacy of this subject matter, we also
assess the feasibility of deploying such attacks in real-world scenarios as
opposed to controlled ideal datasets. The potential real-world implications
underscore the urgency of addressing this issue effectively.


http://arxiv.org/abs/2408.08502
Efficient Image-to-Image Diffusion Classifier for Adversarial Robustness. (76%)
Hefei Mei; Minjing Dong; Chang Xu
  Diffusion models (DMs) have demonstrated great potential in the field of
adversarial robustness, where DM-based defense methods can achieve superior
defense capability without adversarial training. However, they all require huge
computational costs due to the usage of large-scale pre-trained DMs, making it
difficult to conduct full evaluation under strong attacks and compare with
traditional CNN-based methods. Simply reducing the network size and timesteps
in DMs could significantly harm the image generation quality, which invalidates
previous frameworks. To alleviate this issue, we redesign the diffusion
framework from generating high-quality images to predicting distinguishable
image labels. Specifically, we employ an image translation framework to learn
many-to-one mapping from input samples to designed orthogonal image labels.
Based on this framework, we introduce an efficient Image-to-Image diffusion
classifier with a pruned U-Net structure and reduced diffusion timesteps.
Besides the framework, we redesign the optimization objective of DMs to fit the
target of image classification, where a new classification loss is incorporated
in the DM-based image translation framework to distinguish the generated label
from those of other classes. We conduct sufficient evaluations of the proposed
classifier under various attacks on popular benchmarks. Extensive experiments
show that our method achieves better adversarial robustness with fewer
computational costs than DM-based and CNN-based methods. The code is available
at https://github.com/hfmei/IDC.


http://arxiv.org/abs/2408.08924
Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks. (74%)
Jiawei Zhao; Kejiang Chen; Xiaojian Yuan; Weiming Zhang
  In recent years, the rapid development of large language models (LLMs) has
achieved remarkable performance across various tasks. However, research
indicates that LLMs are vulnerable to jailbreak attacks, where adversaries can
induce the generation of harmful content through meticulously crafted prompts.
This vulnerability poses significant challenges to the secure use and promotion
of LLMs. Existing defense methods offer protection from different perspectives
but often suffer from insufficient effectiveness or a significant impact on the
model's capabilities. In this paper, we propose a plug-and-play and
easy-to-deploy jailbreak defense framework, namely Prefix Guidance (PG), which
guides the model to identify harmful prompts by directly setting the first few
tokens of the model's output. This approach combines the model's inherent
security capabilities with an external classifier to defend against jailbreak
attacks. We demonstrate the effectiveness of PG across three models and five
attack methods. Compared to baselines, our approach is generally more effective
on average. Additionally, results on the Just-Eval benchmark further confirm
PG's superiority to preserve the model's performance.


http://arxiv.org/abs/2408.08464
$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models. (70%)
Fenghua Weng; Yue Xu; Chengyan Fu; Wenjie Wang
  As deep learning advances, Large Language Models (LLMs) and their multimodal
counterparts, Multimodal Large Language Models (MLLMs), have shown exceptional
performance in many real-world tasks. However, MLLMs face significant security
challenges, such as jailbreak attacks, where attackers attempt to bypass the
model's safety alignment to elicit harmful responses. The threat of jailbreak
attacks on MLLMs arises from both the inherent vulnerabilities of LLMs and the
multiple information channels that MLLMs process. While various attacks and
defenses have been proposed, there is a notable gap in unified and
comprehensive evaluations, as each method is evaluated on different dataset and
metrics, making it impossible to compare the effectiveness of each method. To
address this gap, we introduce \textit{MMJ-Bench}, a unified pipeline for
evaluating jailbreak attacks and defense techniques for MLLMs. Through
extensive experiments, we assess the effectiveness of various attack methods
against SoTA MLLMs and evaluate the impact of defense mechanisms on both
defense effectiveness and model utility for normal tasks. Our comprehensive
evaluation contribute to the field by offering a unified and systematic
evaluation framework and the first public-available benchmark for MLLM
jailbreak research. We also demonstrate several insightful findings that
highlights directions for future studies.


http://arxiv.org/abs/2408.08430
Random Gradient Masking as a Defensive Measure to Deep Leakage in Federated Learning. (8%)
Joon Kim; Sejin Park
  Federated Learning(FL), in theory, preserves privacy of individual clients'
data while producing quality machine learning models. However, attacks such as
Deep Leakage from Gradients(DLG) severely question the practicality of FL. In
this paper, we empirically evaluate the efficacy of four defensive methods
against DLG: Masking, Clipping, Pruning, and Noising. Masking, while only
previously studied as a way to compress information during parameter transfer,
shows surprisingly robust defensive utility when compared to the other three
established methods. Our experimentation is two-fold. We first evaluate the
minimum hyperparameter threshold for each method across MNIST, CIFAR-10, and
lfw datasets. Then, we train FL clients with each method and their minimum
threshold values to investigate the trade-off between DLG defense and training
performance. Results reveal that Masking and Clipping show near to none
degradation in performance while obfuscating enough information to effectively
defend against DLG.


http://arxiv.org/abs/2408.08433
A Robust Multi-Stage Intrusion Detection System for In-Vehicle Network Security using Hierarchical Federated Learning. (2%)
Muzun Althunayyan; Amir Javed; Omer Rana
  As connected and autonomous vehicles proliferate, the Controller Area Network
(CAN) bus has become the predominant communication standard for in-vehicle
networks due to its speed and efficiency. However, the CAN bus lacks basic
security measures such as authentication and encryption, making it highly
vulnerable to cyberattacks. To ensure in-vehicle security, intrusion detection
systems (IDSs) must detect seen attacks and provide a robust defense against
new, unseen attacks while remaining lightweight for practical deployment.
Previous work has relied solely on the CAN ID feature or has used traditional
machine learning (ML) approaches with manual feature extraction. These
approaches overlook other exploitable features, making it challenging to adapt
to new unseen attack variants and compromising security. This paper introduces
a cutting-edge, novel, lightweight, in-vehicle, IDS-leveraging, deep learning
(DL) algorithm to address these limitations. The proposed IDS employs a
multi-stage approach: an artificial neural network (ANN) in the first stage to
detect seen attacks, and a Long Short-Term Memory (LSTM) autoencoder in the
second stage to detect new, unseen attacks. To understand and analyze diverse
driving behaviors, update the model with the latest attack patterns, and
preserve data privacy, we propose a theoretical framework to deploy our IDS in
a hierarchical federated learning (H-FL) environment. Experimental results
demonstrate that our IDS achieves an F1-score exceeding 0.99 for seen attacks
and exceeding 0.95 for novel attacks, with a detection rate of 99.99%.
Additionally, the false alarm rate (FAR) is exceptionally low at 0.016%,
minimizing false alarms. Despite using DL algorithms known for their
effectiveness in identifying sophisticated and zero-day attacks, the IDS
remains lightweight, ensuring its feasibility for real-world deployment.


http://arxiv.org/abs/2408.07733
Enhancing Adversarial Attacks via Parameter Adaptive Adversarial Attack. (99%)
Zhibo Jin; Jiayu Zhang; Zhiyu Zhu; Chenyu Zhang; Jiahao Huang; Jianlong Zhou; Fang Chen
  In recent times, the swift evolution of adversarial attacks has captured
widespread attention, particularly concerning their transferability and other
performance attributes. These techniques are primarily executed at the sample
level, frequently overlooking the intrinsic parameters of models. Such neglect
suggests that the perturbations introduced in adversarial samples might have
the potential for further reduction. Given the essence of adversarial attacks
is to impair model integrity with minimal noise on original samples, exploring
avenues to maximize the utility of such perturbations is imperative. Against
this backdrop, we have delved into the complexities of adversarial attack
algorithms, dissecting the adversarial process into two critical phases: the
Directional Supervision Process (DSP) and the Directional Optimization Process
(DOP). While DSP determines the direction of updates based on the current
samples and model parameters, it has been observed that existing model
parameters may not always be conducive to adversarial attacks. The impact of
models on adversarial efficacy is often overlooked in current research, leading
to the neglect of DSP. We propose that under certain conditions, fine-tuning
model parameters can significantly enhance the quality of DSP. For the first
time, we propose that under certain conditions, fine-tuning model parameters
can significantly improve the quality of the DSP. We provide, for the first
time, rigorous mathematical definitions and proofs for these conditions, and
introduce multiple methods for fine-tuning model parameters within DSP. Our
extensive experiments substantiate the effectiveness of the proposed P3A
method. Our code is accessible at: https://anonymous.4open.science/r/P3A-A12C/


http://arxiv.org/abs/2408.07579
TabularBench: Benchmarking Adversarial Robustness for Tabular Deep Learning in Real-world Use-cases. (98%)
Thibault Simonetto; Salah Ghamizi; Maxime Cordy
  While adversarial robustness in computer vision is a mature research field,
fewer researchers have tackled the evasion attacks against tabular deep
learning, and even fewer investigated robustification mechanisms and reliable
defenses. We hypothesize that this lag in the research on tabular adversarial
attacks is in part due to the lack of standardized benchmarks. To fill this
gap, we propose TabularBench, the first comprehensive benchmark of robustness
of tabular deep learning classification models. We evaluated adversarial
robustness with CAA, an ensemble of gradient and search attacks which was
recently demonstrated as the most effective attack against a tabular model. In
addition to our open benchmark (https://github.com/serval-uni-lu/tabularbench)
where we welcome submissions of new models and defenses, we implement 7
robustification mechanisms inspired by state-of-the-art defenses in computer
vision and propose the largest benchmark of robust tabular deep learning over
200 models across five critical scenarios in finance, healthcare and security.
We curated real datasets for each use case, augmented with hundreds of
thousands of realistic synthetic inputs, and trained and assessed our models
with and without data augmentations. We open-source our library that provides
API access to all our pre-trained robust tabular models, and the largest
datasets of real and synthetic tabular inputs. Finally, we analyze the impact
of various defenses on the robustness and provide actionable insights to design
new defenses and robustification mechanisms.


http://arxiv.org/abs/2408.07364
Robust Active Learning (RoAL): Countering Dynamic Adversaries in Active Learning with Elastic Weight Consolidation. (80%)
Ricky Maulana Fajri; Yulong Pei; Lu Yin; Mykola Pechenizkiy
  Despite significant advancements in active learning and adversarial attacks,
the intersection of these two fields remains underexplored, particularly in
developing robust active learning frameworks against dynamic adversarial
threats. The challenge of developing robust active learning frameworks under
dynamic adversarial attacks is critical, as these attacks can lead to
catastrophic forgetting within the active learning cycle. This paper introduces
Robust Active Learning (RoAL), a novel approach designed to address this issue
by integrating Elastic Weight Consolidation (EWC) into the active learning
process. Our contributions are threefold: First, we propose a new dynamic
adversarial attack that poses significant threats to active learning
frameworks. Second, we introduce a novel method that combines EWC with active
learning to mitigate catastrophic forgetting caused by dynamic adversarial
attacks. Finally, we conduct extensive experimental evaluations to demonstrate
the efficacy of our approach. The results show that RoAL not only effectively
counters dynamic adversarial threats but also significantly reduces the impact
of catastrophic forgetting, thereby enhancing the robustness and performance of
active learning systems in adversarial environments.


http://arxiv.org/abs/2408.07438
Achieving Data Efficient Neural Networks with Hybrid Concept-based Models. (70%)
Tobias A. Opsahl; Vegard Antun
  Most datasets used for supervised machine learning consist of a single label
per data point. However, in cases where more information than just the class
label is available, would it be possible to train models more efficiently? We
introduce two novel model architectures, which we call hybrid concept-based
models, that train using both class labels and additional information in the
dataset referred to as concepts. In order to thoroughly assess their
performance, we introduce ConceptShapes, an open and flexible class of datasets
with concept labels. We show that the hybrid concept-based models outperform
standard computer vision models and previously proposed concept-based models
with respect to accuracy, especially in sparse data settings. We also introduce
an algorithm for performing adversarial concept attacks, where an image is
perturbed in a way that does not change a concept-based model's concept
predictions, but changes the class prediction. The existence of such
adversarial examples raises questions about the interpretable qualities
promised by concept-based models.


http://arxiv.org/abs/2408.07558
Sonic: Fast and Transferable Data Poisoning on Clustering Algorithms. (67%)
Francesco Villani; Dario Lazzaro; Antonio Emanuele Cinà; Matteo Dell'Amico; Battista Biggio; Fabio Roli
  Data poisoning attacks on clustering algorithms have received limited
attention, with existing methods struggling to scale efficiently as dataset
sizes and feature counts increase. These attacks typically require
re-clustering the entire dataset multiple times to generate predictions and
assess the attacker's objectives, significantly hindering their scalability.
This paper addresses these limitations by proposing Sonic, a novel genetic data
poisoning attack that leverages incremental and scalable clustering algorithms,
e.g., FISHDBC, as surrogates to accelerate poisoning attacks against
graph-based and density-based clustering methods, such as HDBSCAN. We
empirically demonstrate the effectiveness and efficiency of Sonic in poisoning
the target clustering algorithms. We then conduct a comprehensive analysis of
the factors affecting the scalability and transferability of poisoning attacks
against clustering algorithms, and we conclude by examining the robustness of
hyperparameters in our attack strategy Sonic.


http://arxiv.org/abs/2408.07362
BadMerging: Backdoor Attacks Against Model Merging. (47%)
Jinghuai Zhang; Jianfeng Chi; Zheng Li; Kunlin Cai; Yang Zhang; Yuan Tian
  Fine-tuning pre-trained models for downstream tasks has led to a
proliferation of open-sourced task-specific models. Recently, Model Merging
(MM) has emerged as an effective approach to facilitate knowledge transfer
among these independently fine-tuned models. MM directly combines multiple
fine-tuned task-specific models into a merged model without additional
training, and the resulting model shows enhanced capabilities in multiple
tasks. Although MM provides great utility, it may come with security risks
because an adversary can exploit MM to affect multiple downstream tasks.
However, the security risks of MM have barely been studied. In this paper, we
first find that MM, as a new learning paradigm, introduces unique challenges
for existing backdoor attacks due to the merging process. To address these
challenges, we introduce BadMerging, the first backdoor attack specifically
designed for MM. Notably, BadMerging allows an adversary to compromise the
entire merged model by contributing as few as one backdoored task-specific
model. BadMerging comprises a two-stage attack mechanism and a novel
feature-interpolation-based loss to enhance the robustness of embedded
backdoors against the changes of different merging parameters. Considering that
a merged model may incorporate tasks from different domains, BadMerging can
jointly compromise the tasks provided by the adversary (on-task attack) and
other contributors (off-task attack) and solve the corresponding unique
challenges with novel attack designs. Extensive experiments show that
BadMerging achieves remarkable attacks against various MM algorithms. Our
ablation study demonstrates that the proposed attack designs can progressively
contribute to the attack performance. Finally, we show that prior defense
mechanisms fail to defend against our attacks, highlighting the need for more
advanced defense.


http://arxiv.org/abs/2408.07440
BAPLe: Backdoor Attacks on Medical Foundational Models using Prompt Learning. (38%)
Asif Hanif; Fahad Shamshad; Muhammad Awais; Muzammal Naseer; Fahad Shahbaz Khan; Karthik Nandakumar; Salman Khan; Rao Muhammad Anwer
  Medical foundation models are gaining prominence in the medical community for
their ability to derive general representations from extensive collections of
medical image-text pairs. Recent research indicates that these models are
susceptible to backdoor attacks, which allow them to classify clean images
accurately but fail when specific triggers are introduced. However, traditional
backdoor attacks necessitate a considerable amount of additional data to
maliciously pre-train a model. This requirement is often impractical in medical
imaging applications due to the usual scarcity of data. Inspired by the latest
developments in learnable prompts, this work introduces a method to embed a
backdoor into the medical foundation model during the prompt learning phase. By
incorporating learnable prompts within the text encoder and introducing
imperceptible learnable noise trigger to the input images, we exploit the full
capabilities of the medical foundation models (Med-FM). Our method, BAPLe,
requires only a minimal subset of data to adjust the noise trigger and the text
prompts for downstream tasks, enabling the creation of an effective backdoor
attack. Through extensive experiments with four medical foundation models, each
pre-trained on different modalities and evaluated across six downstream
datasets, we demonstrate the efficacy of our approach. BAPLe achieves a high
backdoor success rate across all models and datasets, outperforming the
baseline backdoor attack methods. Our work highlights the vulnerability of
Med-FMs towards backdoor attacks and strives to promote the safe adoption of
Med-FMs before their deployment in real-world applications. Code is available
at https://asif-hanif.github.io/baple/.


http://arxiv.org/abs/2409.00003
Cognitive Networks and Performance Drive fMRI-Based State Classification Using DNN Models. (1%)
Murat Kucukosmanoglu; Javier O. Garcia; Justin Brooks; Kanika Bansal
  Deep neural network (DNN) models have demonstrated impressive performance in
various domains, yet their application in cognitive neuroscience is limited due
to their lack of interpretability. In this study we employ two structurally
different and complementary DNN-based models, a one-dimensional convolutional
neural network (1D-CNN) and a bidirectional long short-term memory network
(BiLSTM), to classify individual cognitive states from fMRI BOLD data, with a
focus on understanding the cognitive underpinnings of the classification
decisions. We show that despite the architectural differences, both models
consistently produce a robust relationship between prediction accuracy and
individual cognitive performance, such that low performance leads to poor
prediction accuracy. To achieve model explainability, we used permutation
techniques to calculate feature importance, allowing us to identify the most
critical brain regions influencing model predictions. Across models, we found
the dominance of visual networks, suggesting that task-driven state differences
are primarily encoded in visual processing. Attention and control networks also
showed relatively high importance, however, default mode and temporal-parietal
networks demonstrated negligible contribution in differentiating cognitive
states. Additionally, we observed individual trait-based effects and subtle
model-specific differences, such that 1D-CNN showed slightly better overall
performance, while BiLSTM showed better sensitivity for individual behavior;
these initial findings require further research and robustness testing to be
fully established. Our work underscores the importance of explainable DNN
models in uncovering the neural mechanisms underlying cognitive state
transitions, providing a foundation for future work in this domain.


http://arxiv.org/abs/2408.06625
DePatch: Towards Robust Adversarial Patch for Evading Person Detectors in the Real World. (92%)
Jikang Cheng; Ying Zhang; Zhongyuan Wang; Zou Qin; Chen Li
  Recent years have seen an increasing interest in physical adversarial
attacks, which aim to craft deployable patterns for deceiving deep neural
networks, especially for person detectors. However, the adversarial patterns of
existing patch-based attacks heavily suffer from the self-coupling issue, where
a degradation, caused by physical transformations, in any small patch segment
can result in a complete adversarial dysfunction, leading to poor robustness in
the complex real world. Upon this observation, we introduce the Decoupled
adversarial Patch (DePatch) attack to address the self-coupling issue of
adversarial patches. Specifically, we divide the adversarial patch into
block-wise segments, and reduce the inter-dependency among these segments
through randomly erasing out some segments during the optimization. We further
introduce a border shifting operation and a progressive decoupling strategy to
improve the overall attack capabilities. Extensive experiments demonstrate the
superior performance of our method over other physical adversarial attacks,
especially in the real world.


http://arxiv.org/abs/2408.06766
Robust Black-box Testing of Deep Neural Networks using Co-Domain Coverage. (12%)
Aishwarya Gupta; Indranil Saha; Piyush Rai
  Rigorous testing of machine learning models is necessary for trustworthy
deployments. We present a novel black-box approach for generating test-suites
for robust testing of deep neural networks (DNNs). Most existing methods create
test inputs based on maximizing some "coverage" criterion/metric such as a
fraction of neurons activated by the test inputs. Such approaches, however, can
only analyze each neuron's behavior or each layer's output in isolation and are
unable to capture their collective effect on the DNN's output, resulting in
test suites that often do not capture the various failure modes of the DNN
adequately. These approaches also require white-box access, i.e., access to the
DNN's internals (node activations). We present a novel black-box coverage
criterion called Co-Domain Coverage (CDC), which is defined as a function of
the model's output and thus takes into account its end-to-end behavior.
Subsequently, we develop a new fuzz testing procedure named CoDoFuzz, which
uses CDC to guide the fuzzing process to generate a test suite for a DNN. We
extensively compare the test suite generated by CoDoFuzz with those generated
using several state-of-the-art coverage-based fuzz testing methods for the DNNs
trained on six publicly available datasets. Experimental results establish the
efficiency and efficacy of CoDoFuzz in generating the largest number of
misclassified inputs and the inputs for which the model lacks confidence in its
decision.


http://arxiv.org/abs/2408.07009
Imagen 3. (11%)
Imagen-Team-Google; :; Jason Baldridge; Jakob Bauer; Mukul Bhutani; Nicole Brichtova; Andrew Bunner; Lluis Castrejon; Kelvin Chan; Yichang Chen; Sander Dieleman; Yuqing Du; Zach Eaton-Rosen; Hongliang Fei; Freitas Nando de; Yilin Gao; Evgeny Gladchenko; Sergio Gómez Colmenarejo; Mandy Guo; Alex Haig; Will Hawkins; Hexiang Hu; Huilian Huang; Tobenna Peter Igwe; Christos Kaplanis; Siavash Khodadadeh; Yelin Kim; Ksenia Konyushkova; Karol Langner; Eric Lau; Rory Lawton; Shixin Luo; Soňa Mokrá; Henna Nandwani; Yasumasa Onoe; Aäron van den Oord; Zarana Parekh; Jordi Pont-Tuset; Hang Qi; Rui Qian; Deepak Ramachandran; Poorva Rane; Abdullah Rashwan; Ali Razavi; Robert Riachi; Hansa Srinivasan; Srivatsan Srinivasan; Robin Strudel; Benigno Uria; Oliver Wang; Su Wang; Austin Waters; Chris Wolff; Auriel Wright; Zhisheng Xiao; Hao Xiong; Keyang Xu; Zee Marc van; Junlin Zhang; Katie Zhang; Wenlei Zhou; Konrad Zolna; Ola Aboubakar; Canfer Akbulut; Oscar Akerlund; Isabela Albuquerque; Nina Anderson; Marco Andreetto; Lora Aroyo; Ben Bariach; David Barker; Sherry Ben; Dana Berman; Courtney Biles; Irina Blok; Pankil Botadra; Jenny Brennan; Karla Brown; John Buckley; Rudy Bunel; Elie Bursztein; Christina Butterfield; Ben Caine; Viral Carpenter; Norman Casagrande; Ming-Wei Chang; Solomon Chang; Shamik Chaudhuri; Tony Chen; John Choi; Dmitry Churbanau; Nathan Clement; Matan Cohen; Forrester Cole; Mikhail Dektiarev; Vincent Du; Praneet Dutta; Tom Eccles; Ndidi Elue; Ashley Feden; Shlomi Fruchter; Frankie Garcia; Roopal Garg; Weina Ge; Ahmed Ghazy; Bryant Gipson; Andrew Goodman; Dawid Górny; Sven Gowal; Khyatti Gupta; Yoni Halpern; Yena Han; Susan Hao; Jamie Hayes; Jonathan Heek; Amir Hertz; Ed Hirst; Emiel Hoogeboom; Tingbo Hou; Heidi Howard; Mohamed Ibrahim; Dirichi Ike-Njoku; Joana Iljazi; Vlad Ionescu; William Isaac; Reena Jana; Gemma Jennings; Donovon Jenson; Xuhui Jia; Kerry Jones; Xiaoen Ju; Ivana Kajic; Christos Kaplanis; Burcu Karagol Ayan; Jacob Kelly; Suraj Kothawade; Christina Kouridi; Ira Ktena; Jolanda Kumakaw; Dana Kurniawan; Dmitry Lagun; Lily Lavitas; Jason Lee; Tao Li; Marco Liang; Maggie Li-Calis; Yuchi Liu; Javier Lopez Alberca; Matthieu Kim Lorrain; Peggy Lu; Kristian Lum; Yukun Ma; Chase Malik; John Mellor; Thomas Mensink; Inbar Mosseri; Tom Murray; Aida Nematzadeh; Paul Nicholas; Signe Nørly; João Gabriel Oliveira; Guillermo Ortiz-Jimenez; Michela Paganini; Tom Le Paine; Roni Paiss; Alicia Parrish; Anne Peckham; Vikas Peswani; Igor Petrovski; Tobias Pfaff; Alex Pirozhenko; Ryan Poplin; Utsav Prabhu; Yuan Qi; Matthew Rahtz; Cyrus Rashtchian; Charvi Rastogi; Amit Raul; Ali Razavi; Sylvestre-Alvise Rebuffi; Susanna Ricco; Felix Riedel; Dirk Robinson; Pankaj Rohatgi; Bill Rosgen; Sarah Rumbley; Moonkyung Ryu; Anthony Salgado; Tim Salimans; Sahil Singla; Florian Schroff; Candice Schumann; Tanmay Shah; Eleni Shaw; Gregory Shaw; Brendan Shillingford; Kaushik Shivakumar; Dennis Shtatnov; Zach Singer; Evgeny Sluzhaev; Valerii Sokolov; Thibault Sottiaux; Florian Stimberg; Brad Stone; David Stutz; Yu-Chuan Su; Eric Tabellion; Shuai Tang; David Tao; Kurt Thomas; Gregory Thornton; Andeep Toor; Cristian Udrescu; Aayush Upadhyay; Cristina Vasconcelos; Alex Vasiloff; Andrey Voynov; Amanda Walker; Luyu Wang; Miaosen Wang; Simon Wang; Stanley Wang; Qifei Wang; Yuxiao Wang; Ágoston Weisz; Olivia Wiles; Chenxia Wu; Xingyu Federico Xu; Andrew Xue; Jianbo Yang; Luo Yu; Mete Yurtoglu; Ali Zand; Han Zhang; Jiageng Zhang; Catherine Zhao; Adilet Zhaxybay; Miao Zhou; Shengqi Zhu; Zhenkai Zhu; Dawn Bloxwich; Mahyar Bordbar; Luis C. Cobo; Eli Collins; Shengyang Dai; Tulsee Doshi; Anca Dragan; Douglas Eck; Demis Hassabis; Sissie Hsiao; Tom Hume; Koray Kavukcuoglu; Helen King; Jack Krawczyk; Yeqing Li; Kathy Meier-Hellstern; Andras Orban; Yury Pinsky; Amar Subramanya; Oriol Vinyals; Ting Yu; Yori Zwols
  We introduce Imagen 3, a latent diffusion model that generates high quality
images from text prompts. We describe our quality and responsibility
evaluations. Imagen 3 is preferred over other state-of-the-art (SOTA) models at
the time of evaluation. In addition, we discuss issues around safety and
representation, as well as methods we used to minimize the potential harm of
our models.


http://arxiv.org/abs/2408.06079
Towards Adversarial Robustness via Debiased High-Confidence Logit Alignment. (99%)
Kejia Zhang; Juanjuan Weng; Zhiming Luo; Shaozi Li
  Despite the significant advances that deep neural networks (DNNs) have
achieved in various visual tasks, they still exhibit vulnerability to
adversarial examples, leading to serious security concerns. Recent adversarial
training techniques have utilized inverse adversarial attacks to generate
high-confidence examples, aiming to align the distributions of adversarial
examples with the high-confidence regions of their corresponding classes.
However, in this paper, our investigation reveals that high-confidence outputs
under inverse adversarial attacks are correlated with biased feature
activation. Specifically, training with inverse adversarial examples causes the
model's attention to shift towards background features, introducing a spurious
correlation bias. To address this bias, we propose Debiased High-Confidence
Adversarial Training (DHAT), a novel approach that not only aligns the logits
of adversarial examples with debiased high-confidence logits obtained from
inverse adversarial examples, but also restores the model's attention to its
normal state by enhancing foreground logit orthogonality. Extensive experiments
demonstrate that DHAT achieves state-of-the-art performance and exhibits robust
generalization capabilities across various vision datasets. Additionally, DHAT
can seamlessly integrate with existing advanced adversarial training techniques
for improving the performance.


http://arxiv.org/abs/2408.06509
Fooling SHAP with Output Shuffling Attacks. (81%)
Jun Yuan; Aritra Dasgupta
  Explainable AI~(XAI) methods such as SHAP can help discover feature
attributions in black-box models. If the method reveals a significant
attribution from a ``protected feature'' (e.g., gender, race) on the model
output, the model is considered unfair. However, adversarial attacks can
subvert the detection of XAI methods. Previous approaches to constructing such
an adversarial model require access to underlying data distribution, which may
not be possible in many practical scenarios. We relax this constraint and
propose a novel family of attacks, called shuffling attacks, that are
data-agnostic. The proposed attack strategies can adapt any trained machine
learning model to fool Shapley value-based explanations. We prove that Shapley
values cannot detect shuffling attacks. However, algorithms that estimate
Shapley values, such as linear SHAP and SHAP, can detect these attacks with
varying degrees of effectiveness. We demonstrate the efficacy of the attack
strategies by comparing the performance of linear SHAP and SHAP using
real-world datasets.


http://arxiv.org/abs/2408.06042
Understanding Byzantine Robustness in Federated Learning with A Black-box Server. (13%)
Fangyuan Zhao; Yuexiang Xie; Xuebin Ren; Bolin Ding; Shusen Yang; Yaliang Li
  Federated learning (FL) becomes vulnerable to Byzantine attacks where some of
participators tend to damage the utility or discourage the convergence of the
learned model via sending their malicious model updates. Previous works propose
to apply robust rules to aggregate updates from participators against different
types of Byzantine attacks, while at the same time, attackers can further
design advanced Byzantine attack algorithms targeting specific aggregation rule
when it is known. In practice, FL systems can involve a black-box server that
makes the adopted aggregation rule inaccessible to participants, which can
naturally defend or weaken some Byzantine attacks. In this paper, we provide an
in-depth understanding on the Byzantine robustness of the FL system with a
black-box server. Our investigation demonstrates the improved Byzantine
robustness of a black-box server employing a dynamic defense strategy. We
provide both empirical evidence and theoretical analysis to reveal that the
black-box server can mitigate the worst-case attack impact from a maximum level
to an expectation level, which is attributed to the inherent inaccessibility
and randomness offered by a black-box server.The source code is available at
https://github.com/alibaba/FederatedScope/tree/Byzantine_attack_defense to
promote further research in the community.


http://arxiv.org/abs/2408.05745
Improving Adversarial Transferability with Neighbourhood Gradient Information. (99%)
Haijing Guo; Jiafeng Wang; Zhaoyu Chen; Kaixun Jiang; Lingyi Hong; Pinxue Guo; Jinglun Li; Wenqiang Zhang
  Deep neural networks (DNNs) are known to be susceptible to adversarial
examples, leading to significant performance degradation. In black-box attack
scenarios, a considerable attack performance gap between the surrogate model
and the target model persists. This work focuses on enhancing the
transferability of adversarial examples to narrow this performance gap. We
observe that the gradient information around the clean image, i.e.
Neighbourhood Gradient Information, can offer high transferability. Leveraging
this, we propose the NGI-Attack, which incorporates Example Backtracking and
Multiplex Mask strategies, to use this gradient information and enhance
transferability fully. Specifically, we first adopt Example Backtracking to
accumulate Neighbourhood Gradient Information as the initial momentum term.
Multiplex Mask, which forms a multi-way attack strategy, aims to force the
network to focus on non-discriminative regions, which can obtain richer
gradient information during only a few iterations. Extensive experiments
demonstrate that our approach significantly enhances adversarial
transferability. Especially, when attacking numerous defense models, we achieve
an average attack success rate of 95.8%. Notably, our method can plugin with
any off-the-shelf algorithm to improve their attack performance without
additional time cost.


http://arxiv.org/abs/2408.05900
Classifier Guidance Enhances Diffusion-based Adversarial Purification by Preserving Predictive Information. (98%)
Mingkun Zhang; Jianing Li; Wei Chen; Jiafeng Guo; Xueqi Cheng
  Adversarial purification is one of the promising approaches to defend neural
networks against adversarial attacks. Recently, methods utilizing diffusion
probabilistic models have achieved great success for adversarial purification
in image classification tasks. However, such methods fall into the dilemma of
balancing the needs for noise removal and information preservation. This paper
points out that existing adversarial purification methods based on diffusion
models gradually lose sample information during the core denoising process,
causing occasional label shift in subsequent classification tasks. As a remedy,
we suggest to suppress such information loss by introducing guidance from the
classifier confidence. Specifically, we propose Classifier-cOnfidence gUided
Purification (COUP) algorithm, which purifies adversarial examples while
keeping away from the classifier decision boundary. Experimental results show
that COUP can achieve better adversarial robustness under strong attack
methods.


http://arxiv.org/abs/2408.08899
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search. (9%)
Robert J. Moss
  Eliciting harmful behavior from large language models (LLMs) is an important
task to ensure the proper alignment and safety of the models. Often when
training LLMs, ethical guidelines are followed yet alignment failures may still
be uncovered through red teaming adversarial attacks. This work frames the
red-teaming problem as a Markov decision process (MDP) and uses Monte Carlo
tree search to find harmful behaviors of black-box, closed-source LLMs. We
optimize token-level prompt suffixes towards targeted harmful behaviors on
white-box LLMs and include a naturalistic loss term, log-perplexity, to
generate more natural language attacks for better interpretability. The
proposed algorithm, Kov, trains on white-box LLMs to optimize the adversarial
attacks and periodically evaluates responses from the black-box LLM to guide
the search towards more harmful black-box behaviors. In our preliminary study,
results indicate that we can jailbreak black-box models, such as GPT-3.5, in
only 10 queries, yet fail on GPT-4$-$which may indicate that newer models are
more robust to token-level attacks. All work to reproduce these results is open
sourced (https://github.com/sisl/Kov.jl).


http://arxiv.org/abs/2408.05479
ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack. (99%)
Ziyi Gao; Kai Chen; Zhipeng Wei; Tingshu Mou; Jingjing Chen; Zhiyu Tan; Hao Li; Yu-Gang Jiang
  Recent diffusion-based unrestricted attacks generate imperceptible
adversarial examples with high transferability compared to previous
unrestricted attacks and restricted attacks. However, existing works on
diffusion-based unrestricted attacks are mostly focused on images yet are
seldom explored in videos. In this paper, we propose the Recursive Token
Merging for Video Diffusion-based Unrestricted Adversarial Attack (ReToMe-VA),
which is the first framework to generate imperceptible adversarial video clips
with higher transferability. Specifically, to achieve spatial imperceptibility,
ReToMe-VA adopts a Timestep-wise Adversarial Latent Optimization (TALO)
strategy that optimizes perturbations in diffusion models' latent space at each
denoising step. TALO offers iterative and accurate updates to generate more
powerful adversarial frames. TALO can further reduce memory consumption in
gradient computation. Moreover, to achieve temporal imperceptibility, ReToMe-VA
introduces a Recursive Token Merging (ReToMe) mechanism by matching and merging
tokens across video frames in the self-attention module, resulting in
temporally consistent adversarial videos. ReToMe concurrently facilitates
inter-frame interactions into the attack process, inducing more diverse and
robust gradients, thus leading to better adversarial transferability. Extensive
experiments demonstrate the efficacy of ReToMe-VA, particularly in surpassing
state-of-the-art attacks in adversarial transferability by more than 14.16% on
average.


http://arxiv.org/abs/2408.05669
StealthDiffusion: Towards Evading Diffusion Forensic Detection through Diffusion Model. (99%)
Ziyin Zhou; Ke Sun; Zhongxi Chen; Huafeng Kuang; Xiaoshuai Sun; Rongrong Ji
  The rapid progress in generative models has given rise to the critical task
of AI-Generated Content Stealth (AIGC-S), which aims to create AI-generated
images that can evade both forensic detectors and human inspection. This task
is crucial for understanding the vulnerabilities of existing detection methods
and developing more robust techniques. However, current adversarial attacks
often introduce visible noise, have poor transferability, and fail to address
spectral differences between AI-generated and genuine images. To address this,
we propose StealthDiffusion, a framework based on stable diffusion that
modifies AI-generated images into high-quality, imperceptible adversarial
examples capable of evading state-of-the-art forensic detectors.
StealthDiffusion comprises two main components: Latent Adversarial
Optimization, which generates adversarial perturbations in the latent space of
stable diffusion, and Control-VAE, a module that reduces spectral differences
between the generated adversarial images and genuine images without affecting
the original diffusion model's generation process. Extensive experiments show
that StealthDiffusion is effective in both white-box and black-box settings,
transforming AI-generated images into high-quality adversarial forgeries with
frequency spectra similar to genuine images. These forgeries are classified as
genuine by advanced forensic classifiers and are difficult for humans to
distinguish.


http://arxiv.org/abs/2408.05500
PointNCBW: Towards Dataset Ownership Verification for Point Clouds via Negative Clean-label Backdoor Watermark. (13%)
Cheng Wei; Yang Wang; Kuofeng Gao; Shuo Shao; Yiming Li; Zhibo Wang; Zhan Qin
  Recently, point clouds have been widely used in computer vision, whereas
their collection is time-consuming and expensive. As such, point cloud datasets
are the valuable intellectual property of their owners and deserve protection.
To detect and prevent unauthorized use of these datasets, especially for
commercial or open-sourced ones that cannot be sold again or used commercially
without permission, we intend to identify whether a suspicious third-party
model is trained on our protected dataset under the black-box setting. We
achieve this goal by designing a scalable clean-label backdoor-based dataset
watermark for point clouds that ensures both effectiveness and stealthiness.
Unlike existing clean-label watermark schemes, which are susceptible to the
number of categories, our method could watermark samples from all classes
instead of only from the target one. Accordingly, it can still preserve high
effectiveness even on large-scale datasets with many classes. Specifically, we
perturb selected point clouds with non-target categories in both shape-wise and
point-wise manners before inserting trigger patterns without changing their
labels. The features of perturbed samples are similar to those of benign
samples from the target class. As such, models trained on the watermarked
dataset will have a distinctive yet stealthy backdoor behavior, i.e.,
misclassifying samples from the target class whenever triggers appear, since
the trained DNNs will treat the inserted trigger pattern as a signal to deny
predicting the target label. We also design a hypothesis-test-guided dataset
ownership verification based on the proposed watermark. Extensive experiments
on benchmark datasets are conducted, verifying the effectiveness of our method
and its resistance to potential removal methods.


http://arxiv.org/abs/2408.05124
Modeling Electromagnetic Signal Injection Attacks on Camera-based Smart Systems: Applications and Mitigation. (84%)
Youqian Zhang; Michael Cheung; Chunxi Yang; Xinwei Zhai; Zitong Shen; Xinyu Ji; Eugene Y. Fu; Sze-Yiu Chau; Xiapu Luo
  Numerous safety- or security-critical systems depend on cameras to perceive
their surroundings, further allowing artificial intelligence (AI) to analyze
the captured images to make important decisions. However, a concerning attack
vector has emerged, namely, electromagnetic waves, which pose a threat to the
integrity of these systems. Such attacks enable attackers to manipulate the
images remotely, leading to incorrect AI decisions, e.g., autonomous vehicles
missing detecting obstacles ahead resulting in collisions. The lack of
understanding regarding how different systems react to such attacks poses a
significant security risk. Furthermore, no effective solutions have been
demonstrated to mitigate this threat.
  To address these gaps, we modeled the attacks and developed a simulation
method for generating adversarial images. Through rigorous analysis, we
confirmed that the effects of the simulated adversarial images are
indistinguishable from those from real attacks. This method enables researchers
and engineers to rapidly assess the susceptibility of various AI vision
applications to these attacks, without the need for constructing complicated
attack devices. In our experiments, most of the models demonstrated
vulnerabilities to these attacks, emphasizing the need to enhance their
robustness. Fortunately, our modeling and simulation method serves as a
stepping stone toward developing more resilient models. We present a pilot
study on adversarial training to improve their robustness against attacks, and
our results demonstrate a significant improvement by recovering up to 91%
performance, offering a promising direction for mitigating this threat.


http://arxiv.org/abs/2408.05061
A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares. (2%)
Stav Cohen; Ron Bitton; Ben Nassi
  In this paper we argue that a jailbroken GenAI model can cause substantial
harm to GenAI-powered applications and facilitate PromptWare, a new type of
attack that flips the GenAI model's behavior from serving an application to
attacking it. PromptWare exploits user inputs to jailbreak a GenAI model to
force/perform malicious activity within the context of a GenAI-powered
application. First, we introduce a naive implementation of PromptWare that
behaves as malware that targets Plan & Execute architectures (a.k.a., ReAct,
function calling). We show that attackers could force a desired execution flow
by creating a user input that produces desired outputs given that the logic of
the GenAI-powered application is known to attackers. We demonstrate the
application of a DoS attack that triggers the execution of a GenAI-powered
assistant to enter an infinite loop that wastes money and computational
resources on redundant API calls to a GenAI engine, preventing the application
from providing service to a user. Next, we introduce a more sophisticated
implementation of PromptWare that we name Advanced PromptWare Threat (APwT)
that targets GenAI-powered applications whose logic is unknown to attackers. We
show that attackers could create user input that exploits the GenAI engine's
advanced AI capabilities to launch a kill chain in inference time consisting of
six steps intended to escalate privileges, analyze the application's context,
identify valuable assets, reason possible malicious activities, decide on one
of them, and execute it. We demonstrate the application of APwT against a
GenAI-powered e-commerce chatbot and show that it can trigger the modification
of SQL tables, potentially leading to unauthorized discounts on the items sold
to the user.


http://arxiv.org/abs/2408.05025
Rag and Roll: An End-to-End Evaluation of Indirect Prompt Manipulations in LLM-based Application Frameworks. (2%)
Stefano Gianluca De; Lea Schönherr; Giancarlo Pellegrino
  Retrieval Augmented Generation (RAG) is a technique commonly used to equip
models with out of distribution knowledge. This process involves collecting,
indexing, retrieving, and providing information to an LLM for generating
responses. Despite its growing popularity due to its flexibility and low cost,
the security implications of RAG have not been extensively studied. The data
for such systems are often collected from public sources, providing an attacker
a gateway for indirect prompt injections to manipulate the responses of the
model. In this paper, we investigate the security of RAG systems against
end-to-end indirect prompt manipulations. First, we review existing RAG
framework pipelines, deriving a prototypical architecture and identifying
critical parameters. We then examine prior works searching for techniques that
attackers can use to perform indirect prompt manipulations. Finally, we
implemented Rag 'n Roll, a framework to determine the effectiveness of attacks
against end-to-end RAG applications. Our results show that existing attacks are
mostly optimized to boost the ranking of malicious documents during the
retrieval phase. However, a higher rank does not immediately translate into a
reliable attack. Most attacks, against various configurations, settle around a
40% success rate, which could rise to 60% when considering ambiguous answers as
successful attacks (those that include the expected benign one as well).
Additionally, when using unoptimized documents, attackers deploying two of them
(or more) for a target query can achieve similar results as those using
optimized ones. Finally, exploration of the configuration space of a RAG showed
limited impact in thwarting the attacks, where the most successful combination
severely undermines functionality.


http://arxiv.org/abs/2408.15251
TrajFM: A Vehicle Trajectory Foundation Model for Region and Task Transferability. (1%)
Yan Lin; Tonglong Wei; Zeyu Zhou; Haomin Wen; Jilin Hu; Shengnan Guo; Youfang Lin; Huaiyu Wan
  Vehicle trajectories provide valuable movement information that supports
various downstream tasks and powers real-world applications. A desirable
trajectory learning model should transfer between different regions and tasks
without retraining, thus improving computational efficiency and effectiveness
with limited training data. However, a model's ability to transfer across
regions is limited by the unique spatial features and POI arrangements of each
region, which are closely linked to vehicle movement patterns and difficult to
generalize. Additionally, achieving task transferability is challenging due to
the differing generation schemes required for various tasks. Existing efforts
towards transferability primarily involve learning embedding vectors for
trajectories, which perform poorly in region transfer and still require
retraining of prediction modules for task transfer.
  To address these challenges, we propose TrajFM, a vehicle trajectory
foundation model that excels in both region and task transferability. For
region transferability, we introduce STRFormer as the main learnable model
within TrajFM. It integrates spatial, temporal, and POI modalities of
trajectories to effectively manage variations in POI arrangements across
regions and includes a learnable spatio-temporal Rotary position embedding
module for handling spatial features. For task transferability, we propose a
trajectory masking and recovery scheme. This scheme unifies the generation
processes of various tasks into the masking and recovery of modalities and
sub-trajectories, allowing TrajFM to be pre-trained once and transferred to
different tasks without retraining. Experiments on two real-world vehicle
trajectory datasets under various settings demonstrate the effectiveness of
TrajFM. Code is available at https://anonymous.4open.science/r/TrajFM-30E4.


http://arxiv.org/abs/2408.04310
Constructing Adversarial Examples for Vertical Federated Learning: Optimal Client Corruption through Multi-Armed Bandit. (99%)
Duanyi Yao; Songze Li; Ye Xue; Jin Liu
  Vertical federated learning (VFL), where each participating client holds a
subset of data features, has found numerous applications in finance,
healthcare, and IoT systems. However, adversarial attacks, particularly through
the injection of adversarial examples (AEs), pose serious challenges to the
security of VFL models. In this paper, we investigate such vulnerabilities
through developing a novel attack to disrupt the VFL inference process, under a
practical scenario where the adversary is able to adaptively corrupt a subset
of clients. We formulate the problem of finding optimal attack strategies as an
online optimization problem, which is decomposed into an inner problem of
adversarial example generation (AEG) and an outer problem of corruption pattern
selection (CPS). Specifically, we establish the equivalence between the
formulated CPS problem and a multi-armed bandit (MAB) problem, and propose the
Thompson sampling with Empirical maximum reward (E-TS) algorithm for the
adversary to efficiently identify the optimal subset of clients for corruption.
The key idea of E-TS is to introduce an estimation of the expected maximum
reward for each arm, which helps to specify a small set of competitive arms, on
which the exploration for the optimal arm is performed. This significantly
reduces the exploration space, which otherwise can quickly become prohibitively
large as the number of clients increases. We analytically characterize the
regret bound of E-TS, and empirically demonstrate its capability of efficiently
revealing the optimal corruption pattern with the highest attack success rate,
under various datasets of popular VFL tasks.


http://arxiv.org/abs/2408.04839
Adversarially Robust Industrial Anomaly Detection Through Diffusion Model. (99%)
Yuanpu Cao; Lu Lin; Jinghui Chen
  Deep learning-based industrial anomaly detection models have achieved
remarkably high accuracy on commonly used benchmark datasets. However, the
robustness of those models may not be satisfactory due to the existence of
adversarial examples, which pose significant threats to the practical
deployment of deep anomaly detectors. Recently, it has been shown that
diffusion models can be used to purify the adversarial noises and thus build a
robust classifier against adversarial attacks. Unfortunately, we found that
naively applying this strategy in anomaly detection (i.e., placing a purifier
before an anomaly detector) will suffer from a high anomaly miss rate since the
purifying process can easily remove both the anomaly signal and the adversarial
perturbations, causing the later anomaly detector failed to detect anomalies.
To tackle this issue, we explore the possibility of performing anomaly
detection and adversarial purification simultaneously. We propose a simple yet
effective adversarially robust anomaly detection method, \textit{AdvRAD}, that
allows the diffusion model to act both as an anomaly detector and adversarial
purifier. We also extend our proposed method for certified robustness to $l_2$
norm bounded perturbations. Through extensive experiments, we show that our
proposed method exhibits outstanding (certified) adversarial robustness while
also maintaining equally strong anomaly detection performance on par with the
state-of-the-art methods on industrial anomaly detection benchmark datasets.


http://arxiv.org/abs/2408.05446
Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness. (99%)
Stanislav Fort; Balaji Lakshminarayanan
  Adversarial examples pose a significant challenge to the robustness,
reliability and alignment of deep neural networks. We propose a novel,
easy-to-use approach to achieving high-quality representations that lead to
adversarial robustness through the use of multi-resolution input
representations and dynamic self-ensembling of intermediate layer predictions.
We demonstrate that intermediate layer predictions exhibit inherent robustness
to adversarial attacks crafted to fool the full classifier, and propose a
robust aggregation mechanism based on Vickrey auction that we call
\textit{CrossMax} to dynamically ensemble them. By combining multi-resolution
inputs and robust ensembling, we achieve significant adversarial robustness on
CIFAR-10 and CIFAR-100 datasets without any adversarial training or extra data,
reaching an adversarial accuracy of $\approx$72% (CIFAR-10) and $\approx$48%
(CIFAR-100) on the RobustBench AutoAttack suite ($L_\infty=8/255)$ with a
finetuned ImageNet-pretrained ResNet152. This represents a result comparable
with the top three models on CIFAR-10 and a +5 % gain compared to the best
current dedicated approach on CIFAR-100. Adding simple adversarial training on
top, we get $\approx$78% on CIFAR-10 and $\approx$51% on CIFAR-100, improving
SOTA by 5 % and 9 % respectively and seeing greater gains on the harder
dataset. We validate our approach through extensive experiments and provide
insights into the interplay between adversarial robustness, and the
hierarchical nature of deep representations. We show that simple gradient-based
attacks against our model lead to human-interpretable images of the target
classes as well as interpretable image changes. As a byproduct, using our
multi-resolution prior, we turn pre-trained classifiers and CLIP models into
controllable image generators and develop successful transferable attacks on
large vision language models.


http://arxiv.org/abs/2408.04683
Eliminating Backdoors in Neural Code Models via Trigger Inversion. (92%)
Weisong Sun; Yuchen Chen; Chunrong Fang; Yebo Feng; Yuan Xiao; An Guo; Quanjun Zhang; Yang Liu; Baowen Xu; Zhenyu Chen
  Neural code models (NCMs) have been widely used for addressing various code
understanding tasks, such as defect detection and clone detection. However,
numerous recent studies reveal that such models are vulnerable to backdoor
attacks. Backdoored NCMs function normally on normal code snippets, but exhibit
adversary-expected behavior on poisoned code snippets injected with the
adversary-crafted trigger. It poses a significant security threat. For example,
a backdoored defect detection model may misclassify user-submitted defective
code as non-defective. If this insecure code is then integrated into critical
systems, like autonomous driving systems, it could lead to life safety.
However, there is an urgent need for effective defenses against backdoor
attacks targeting NCMs.
  To address this issue, in this paper, we innovatively propose a backdoor
defense technique based on trigger inversion, called EliBadCode. EliBadCode
first filters the model vocabulary for trigger tokens to reduce the search
space for trigger inversion, thereby enhancing the efficiency of the trigger
inversion. Then, EliBadCode introduces a sample-specific trigger position
identification method, which can reduce the interference of adversarial
perturbations for subsequent trigger inversion, thereby producing effective
inverted triggers efficiently. Subsequently, EliBadCode employs a Greedy
Coordinate Gradient algorithm to optimize the inverted trigger and designs a
trigger anchoring method to purify the inverted trigger. Finally, EliBadCode
eliminates backdoors through model unlearning. We evaluate the effectiveness of
EliBadCode in eliminating backdoor attacks against multiple NCMs used for three
safety-critical code understanding tasks. The results demonstrate that
EliBadCode can effectively eliminate backdoors while having minimal adverse
effects on the normal functionality of the model.


http://arxiv.org/abs/2408.04600
Improving Network Interpretability via Explanation Consistency Evaluation. (81%)
Hefeng Wu; Hao Jiang; Keze Wang; Ziyi Tang; Xianghuan He; Liang Lin
  While deep neural networks have achieved remarkable performance, they tend to
lack transparency in prediction. The pursuit of greater interpretability in
neural networks often results in a degradation of their original performance.
Some works strive to improve both interpretability and performance, but they
primarily depend on meticulously imposed conditions. In this paper, we propose
a simple yet effective framework that acquires more explainable activation
heatmaps and simultaneously increase the model performance, without the need
for any extra supervision. Specifically, our concise framework introduces a new
metric, i.e., explanation consistency, to reweight the training samples
adaptively in model learning. The explanation consistency metric is utilized to
measure the similarity between the model's visual explanations of the original
samples and those of semantic-preserved adversarial samples, whose background
regions are perturbed by using image adversarial attack techniques. Our
framework then promotes the model learning by paying closer attention to those
training samples with a high difference in explanations (i.e., low explanation
consistency), for which the current model cannot provide robust
interpretations. Comprehensive experimental results on various benchmarks
demonstrate the superiority of our framework in multiple aspects, including
higher recognition accuracy, greater data debiasing capability, stronger
network robustness, and more precise localization ability on both regular
networks and interpretable networks. We also provide extensive ablation studies
and qualitative analyses to unveil the detailed contribution of each component.


http://arxiv.org/abs/2408.04261
Unveiling Hidden Visual Information: A Reconstruction Attack Against Adversarial Visual Information Hiding. (80%)
Jonggyu Jang; Hyeonsu Lyu; Seongjin Hwang; Hyun Jong Yang
  This paper investigates the security vulnerabilities of
adversarial-example-based image encryption by executing data reconstruction
(DR) attacks on encrypted images. A representative image encryption method is
the adversarial visual information hiding (AVIH), which uses type-I adversarial
example training to protect gallery datasets used in image recognition tasks.
In the AVIH method, the type-I adversarial example approach creates images that
appear completely different but are still recognized by machines as the
original ones. Additionally, the AVIH method can restore encrypted images to
their original forms using a predefined private key generative model. For the
best security, assigning a unique key to each image is recommended; however,
storage limitations may necessitate some images sharing the same key model.
This raises a crucial security question for AVIH: How many images can safely
share the same key model without being compromised by a DR attack? To address
this question, we introduce a dual-strategy DR attack against the AVIH
encryption method by incorporating (1) generative-adversarial loss and (2)
augmented identity loss, which prevent DR from overfitting -- an issue akin to
that in machine learning. Our numerical results validate this approach through
image recognition and re-identification benchmarks, demonstrating that our
strategy can significantly enhance the quality of reconstructed images, thereby
requiring fewer key-sharing encrypted images. Our source code to reproduce our
results will be available soon.


http://arxiv.org/abs/2408.04585
Towards Resilient and Efficient LLMs: A Comparative Study of Efficiency, Performance, and Adversarial Robustness. (67%)
Xiaojing Fan; Chunliang Tao
  With the increasing demand for practical applications of Large Language
Models (LLMs), many attention-efficient models have been developed to balance
performance and computational cost. However, the adversarial robustness of
these models remains under-explored. In this work, we design a framework to
investigate the trade-off between efficiency, performance, and adversarial
robustness of LLMs and conduct extensive experiments on three prominent models
with varying levels of complexity and efficiency -- Transformer++, Gated Linear
Attention (GLA) Transformer, and MatMul-Free LM -- utilizing the GLUE and
AdvGLUE datasets. The AdvGLUE dataset extends the GLUE dataset with adversarial
samples designed to challenge model robustness. Our results show that while the
GLA Transformer and MatMul-Free LM achieve slightly lower accuracy on GLUE
tasks, they demonstrate higher efficiency and either superior or comparative
robustness on AdvGLUE tasks compared to Transformer++ across different attack
levels. These findings highlight the potential of simplified architectures to
achieve a compelling balance between efficiency, performance, and adversarial
robustness, offering valuable insights for applications where resource
constraints and resilience to adversarial attacks are critical.


http://arxiv.org/abs/2408.04277
Stability Analysis of Equivariant Convolutional Representations Through The Lens of Equivariant Multi-layered CKNs. (61%)
Soutrik Roy Chowdhury
  In this paper we construct and theoretically analyse group equivariant
convolutional kernel networks (CKNs) which are useful in understanding the
geometry of (equivariant) CNNs through the lens of reproducing kernel Hilbert
spaces (RKHSs). We then proceed to study the stability analysis of such
equiv-CKNs under the action of diffeomorphism and draw a connection with
equiv-CNNs, where the goal is to analyse the geometry of inductive biases of
equiv-CNNs through the lens of reproducing kernel Hilbert spaces (RKHSs).
Traditional deep learning architectures, including CNNs, trained with
sophisticated optimization algorithms is vulnerable to perturbations, including
`adversarial examples'. Understanding the RKHS norm of such models through CKNs
is useful in designing the appropriate architecture and can be useful in
designing robust equivariant representation learning models.


http://arxiv.org/abs/2408.04811
h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment. (15%)
Moussa Koulako Bala Doumbouya; Ananjan Nandi; Gabriel Poesia; Davide Ghilardi; Anna Goldie; Federico Bianchi; Dan Jurafsky; Christopher D. Manning
  The safety of Large Language Models (LLMs) remains a critical concern due to
a lack of adequate benchmarks for systematically evaluating their ability to
resist generating harmful content. Previous efforts towards automated red
teaming involve static or templated sets of illicit requests and adversarial
prompts which have limited utility given jailbreak attacks' evolving and
composable nature. We propose a novel dynamic benchmark of composable jailbreak
attacks to move beyond static datasets and taxonomies of attacks and harms. Our
approach consists of three components collectively called h4rm3l: (1) a
domain-specific language that formally expresses jailbreak attacks as
compositions of parameterized prompt transformation primitives, (2)
bandit-based few-shot program synthesis algorithms that generate novel attacks
optimized to penetrate the safety filters of a target black box LLM, and (3)
open-source automated red-teaming software employing the previous two
components. We use h4rm3l to generate a dataset of 2656 successful novel
jailbreak attacks targeting 6 state-of-the-art (SOTA) open-source and
proprietary LLMs. Several of our synthesized attacks are more effective than
previously reported ones, with Attack Success Rates exceeding 90% on SOTA
closed language models such as claude-3-haiku and GPT4-o. By generating
datasets of jailbreak attacks in a unified formal representation, h4rm3l
enables reproducible benchmarking and automated red-teaming, contributes to
understanding LLM safety limitations, and supports the development of robust
defenses in an increasingly LLM-integrated world.
  Warning: This paper and related research artifacts contain offensive and
potentially disturbing prompts and model-generated content.


http://arxiv.org/abs/2408.04223
VideoQA in the Era of LLMs: An Empirical Study. (1%)
Junbin Xiao; Nanxin Huang; Hangyu Qin; Dongyang Li; Yicong Li; Fengbin Zhu; Zhulin Tao; Jianxing Yu; Liang Lin; Tat-Seng Chua; Angela Yao
  Video Large Language Models (Video-LLMs) are flourishing and has advanced
many video-language tasks. As a golden testbed, Video Question Answering
(VideoQA) plays pivotal role in Video-LLM developing. This work conducts a
timely and comprehensive study of Video-LLMs' behavior in VideoQA, aiming to
elucidate their success and failure modes, and provide insights towards more
human-like video understanding and question answering. Our analyses demonstrate
that Video-LLMs excel in VideoQA; they can correlate contextual cues and
generate plausible responses to questions about varied video contents. However,
models falter in handling video temporality, both in reasoning about temporal
content ordering and grounding QA-relevant temporal moments. Moreover, the
models behave unintuitively - they are unresponsive to adversarial video
perturbations while being sensitive to simple variations of candidate answers
and questions. Also, they do not necessarily generalize better. The findings
demonstrate Video-LLMs' QA capability in standard condition yet highlight their
severe deficiency in robustness and interpretability, suggesting the urgent
need on rationales in Video-LLM developing.


http://arxiv.org/abs/2408.03972
Enhancing Output Diversity Improves Conjugate Gradient-based Adversarial Attacks. (98%)
Keiichiro Yamamura; Issa Oe; Hiroki Ishikura; Katsuki Fujisawa
  Deep neural networks are vulnerable to adversarial examples, and adversarial
attacks that generate adversarial examples have been studied in this context.
Existing studies imply that increasing the diversity of model outputs
contributes to improving the attack performance. This study focuses on the Auto
Conjugate Gradient (ACG) attack, which is inspired by the conjugate gradient
method and has a high diversification performance. We hypothesized that
increasing the distance between two consecutive search points would enhance the
output diversity. To test our hypothesis, we propose Rescaling-ACG (ReACG),
which automatically modifies the two components that significantly affect the
distance between two consecutive search points, including the search direction
and step size. ReACG showed higher attack performance than that of ACG, and is
particularly effective for ImageNet models with several classification classes.
Experimental results show that the distance between two consecutive search
points enhances the output diversity and may help develop new potent attacks.
The code is available at \url{https://github.com/yamamura-k/ReACG}


http://arxiv.org/abs/2408.04181
EdgeShield: A Universal and Efficient Edge Computing Framework for Robust AI. (83%)
Duo Zhong; Bojing Li; Xiang Chen; Chenchen Liu
  The increasing prevalence of adversarial attacks on Artificial Intelligence
(AI) systems has created a need for innovative security measures. However, the
current methods of defending against these attacks often come with a high
computing cost and require back-end processing, making real-time defense
challenging. Fortunately, there have been remarkable advancements in
edge-computing, which make it easier to deploy neural networks on edge devices.
Building upon these advancements, we propose an edge framework design to enable
universal and efficient detection of adversarial attacks. This framework
incorporates an attention-based adversarial detection methodology and a
lightweight detection network formation, making it suitable for a wide range of
neural networks and can be deployed on edge devices. To assess the
effectiveness of our proposed framework, we conducted evaluations on five
neural networks. The results indicate an impressive 97.43% F-score can be
achieved, demonstrating the framework's proficiency in detecting adversarial
attacks. Moreover, our proposed framework also exhibits significantly reduced
computing complexity and cost in comparison to previous detection methods. This
aspect is particularly beneficial as it ensures that the defense mechanism can
be efficiently implemented in real-time on-edge devices.


http://arxiv.org/abs/2408.03603
EnJa: Ensemble Jailbreak on Large Language Models. (83%)
Jiahao Zhang; Zilong Wang; Ruofan Wang; Xingjun Ma; Yu-Gang Jiang
  As Large Language Models (LLMs) are increasingly being deployed in
safety-critical applications, their vulnerability to potential jailbreaks --
malicious prompts that can disable the safety mechanism of LLMs -- has
attracted growing research attention. While alignment methods have been
proposed to protect LLMs from jailbreaks, many have found that aligned LLMs can
still be jailbroken by carefully crafted malicious prompts, producing content
that violates policy regulations. Existing jailbreak attacks on LLMs can be
categorized into prompt-level methods which make up stories/logic to circumvent
safety alignment and token-level attack methods which leverage gradient methods
to find adversarial tokens. In this work, we introduce the concept of Ensemble
Jailbreak and explore methods that can integrate prompt-level and token-level
jailbreak into a more powerful hybrid jailbreak attack. Specifically, we
propose a novel EnJa attack to hide harmful instructions using prompt-level
jailbreak, boost the attack success rate using a gradient-based attack, and
connect the two types of jailbreak attacks via a template-based connector. We
evaluate the effectiveness of EnJa on several aligned models and show that it
achieves a state-of-the-art attack success rate with fewer queries and is much
stronger than any individual jailbreak.


http://arxiv.org/abs/2408.03892
MORTAR: A Model-based Runtime Action Repair Framework for AI-enabled Cyber-Physical Systems. (76%)
Renzhi Wang; Zhehua Zhou; Jiayang Song; Xuan Xie; Xiaofei Xie; Lei Ma
  Cyber-Physical Systems (CPSs) are increasingly prevalent across various
industrial and daily-life domains, with applications ranging from robotic
operations to autonomous driving. With recent advancements in artificial
intelligence (AI), learning-based components, especially AI controllers, have
become essential in enhancing the functionality and efficiency of CPSs.
However, the lack of interpretability in these AI controllers presents
challenges to the safety and quality assurance of AI-enabled CPSs (AI-CPSs).
Existing methods for improving the safety of AI controllers often involve
neural network repair, which requires retraining with additional adversarial
examples or access to detailed internal information of the neural network.
Hence, these approaches have limited applicability for black-box policies,
where only the inputs and outputs are accessible during operation. To overcome
this, we propose MORTAR, a runtime action repair framework designed for AI-CPSs
in this work. MORTAR begins by constructing a prediction model that forecasts
the quality of actions proposed by the AI controller. If an unsafe action is
detected, MORTAR then initiates a repair process to correct it. The generation
of repaired actions is achieved through an optimization process guided by the
safety estimates from the prediction model. We evaluate the effectiveness of
MORTAR across various CPS tasks and AI controllers. The results demonstrate
that MORTAR can efficiently improve task completion rates of AI controllers
under specified safety specifications. Meanwhile, it also maintains minimal
computational overhead, ensuring real-time operation of the AI-CPSs.


http://arxiv.org/abs/2408.03909
LaFA: Latent Feature Attacks on Non-negative Matrix Factorization. (38%)
Minh Vu; Ben Nebgen; Erik Skau; Geigh Zollicoffer; Juan Castorena; Kim Rasmussen; Boian Alexandrov; Manish Bhattarai
  As Machine Learning (ML) applications rapidly grow, concerns about
adversarial attacks compromising their reliability have gained significant
attention. One unsupervised ML method known for its resilience to such attacks
is Non-negative Matrix Factorization (NMF), an algorithm that decomposes input
data into lower-dimensional latent features. However, the introduction of
powerful computational tools such as Pytorch enables the computation of
gradients of the latent features with respect to the original data, raising
concerns about NMF's reliability. Interestingly, naively deriving the
adversarial loss for NMF as in the case of ML would result in the
reconstruction loss, which can be shown theoretically to be an ineffective
attacking objective. In this work, we introduce a novel class of attacks in NMF
termed Latent Feature Attacks (LaFA), which aim to manipulate the latent
features produced by the NMF process. Our method utilizes the Feature Error
(FE) loss directly on the latent features. By employing FE loss, we generate
perturbations in the original data that significantly affect the extracted
latent features, revealing vulnerabilities akin to those found in other ML
techniques. To handle large peak-memory overhead from gradient back-propagation
in FE attacks, we develop a method based on implicit differentiation which
enables their scaling to larger datasets. We validate NMF vulnerabilities and
FE attacks effectiveness through extensive experiments on synthetic and
real-world data.


http://arxiv.org/abs/2408.03758
MTDSense: AI-Based Fingerprinting of Moving Target Defense Techniques in Software-Defined Networking. (26%)
Tina Moghaddam; Guowei Yang; Chandra Thapa; Seyit Camtepe; Dan Dongseong Kim
  Moving target defenses (MTD) are proactive security techniques that enhance
network security by confusing the attacker and limiting their attack window.
MTDs have been shown to have significant benefits when evaluated against
traditional network attacks, most of which are automated and untargeted.
However, little has been done to address an attacker who is aware the network
uses an MTD. In this work, we propose a novel approach named MTDSense, which
can determine when the MTD has been triggered using the footprints the MTD
operation leaves in the network traffic. MTDSense uses unsupervised clustering
to identify traffic following an MTD trigger and extract the MTD interval. An
attacker can use this information to maximize their attack window and tailor
their attacks, which has been shown to significantly reduce the effectiveness
of MTD. Through analyzing the attacker's approach, we propose and evaluate two
new MTD update algorithms that aim to reduce the information leaked into the
network by the MTD. We present an extensive experimental evaluation by
creating, to our knowledge, the first dataset of the operation of an
IP-shuffling MTD in a software-defined network. Our work reveals that despite
previous results showing the effectiveness of MTD as a defense, traditional
implementations of MTD are highly susceptible to a targeted attacker.


http://arxiv.org/abs/2408.04194
FDI: Attack Neural Code Generation Systems through User Feedback Channel. (5%)
Zhensu Sun; Xiaoning Du; Xiapu Luo; Fu Song; David Lo; Li Li
  Neural code generation systems have recently attracted increasing attention
to improve developer productivity and speed up software development. Typically,
these systems maintain a pre-trained neural model and make it available to
general users as a service (e.g., through remote APIs) and incorporate a
feedback mechanism to extensively collect and utilize the users' reaction to
the generated code, i.e., user feedback. However, the security implications of
such feedback have not yet been explored. With a systematic study of current
feedback mechanisms, we find that feedback makes these systems vulnerable to
feedback data injection (FDI) attacks. We discuss the methodology of FDI
attacks and present a pre-attack profiling strategy to infer the attack
constraints of a targeted system in the black-box setting. We demonstrate two
proof-of-concept examples utilizing the FDI attack surface to implement prompt
injection attacks and backdoor attacks on practical neural code generation
systems. The attacker may stealthily manipulate a neural code generation system
to generate code with vulnerabilities, attack payload, and malicious and spam
messages. Our findings reveal the security implications of feedback mechanisms
in neural code generation systems, paving the way for increasing their
security.


http://arxiv.org/abs/2408.03915
Hard to Explain: On the Computational Hardness of In-Distribution Model Interpretation. (1%)
Guy Amir; Shahaf Bassan; Guy Katz
  The ability to interpret Machine Learning (ML) models is becoming
increasingly essential. However, despite significant progress in the field,
there remains a lack of rigorous characterization regarding the innate
interpretability of different models. In an attempt to bridge this gap, recent
work has demonstrated that it is possible to formally assess interpretability
by studying the computational complexity of explaining the decisions of various
models. In this setting, if explanations for a particular model can be obtained
efficiently, the model is considered interpretable (since it can be explained
``easily''). However, if generating explanations over an ML model is
computationally intractable, it is considered uninterpretable. Prior research
identified two key factors that influence the complexity of interpreting an ML
model: (i) the type of the model (e.g., neural networks, decision trees, etc.);
and (ii) the form of explanation (e.g., contrastive explanations, Shapley
values, etc.). In this work, we claim that a third, important factor must also
be considered for this analysis -- the underlying distribution over which the
explanation is obtained. Considering the underlying distribution is key in
avoiding explanations that are socially misaligned, i.e., convey information
that is biased and unhelpful to users. We demonstrate the significant influence
of the underlying distribution on the resulting overall interpretation
complexity, in two settings: (i) prediction models paired with an external
out-of-distribution (OOD) detector; and (ii) prediction models designed to
inherently generate socially aligned explanations. Our findings prove that the
expressiveness of the distribution can significantly influence the overall
complexity of interpretation, and identify essential prerequisites that a model
must possess to generate socially aligned explanations.


http://arxiv.org/abs/2408.03907
Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models. (1%)
Shachi H Kumar; Saurav Sahay; Sahisnu Mazumder; Eda Okur; Ramesh Manuvinakurike; Nicole Beckage; Hsuan Su; Hung-yi Lee; Lama Nachman
  Large Language Models (LLMs) have excelled at language understanding and
generating human-level text. However, even with supervised training and human
alignment, these LLMs are susceptible to adversarial attacks where malicious
users can prompt the model to generate undesirable text. LLMs also inherently
encode potential biases that can cause various harmful effects during
interactions. Bias evaluation metrics lack standards as well as consensus and
existing methods often rely on human-generated templates and annotations which
are expensive and labor intensive. In this work, we train models to
automatically create adversarial prompts to elicit biased responses from target
LLMs. We present LLM- based bias evaluation metrics and also analyze several
existing automatic evaluation methods and metrics. We analyze the various
nuances of model responses, identify the strengths and weaknesses of model
families, and assess where evaluation methods fall short. We compare these
metrics to human evaluation and validate that the LLM-as-a-Judge metric aligns
with human judgement on bias in response generation.


http://arxiv.org/abs/2408.02963
Adversarial Robustness of Open-source Text Classification Models and Fine-Tuning Chains. (98%)
Hao Qin; Mingyang Li; Junjie Wang; Qing Wang
  Context:With the advancement of artificial intelligence (AI) technology and
applications, numerous AI models have been developed, leading to the emergence
of open-source model hosting platforms like Hugging Face (HF). Thanks to these
platforms, individuals can directly download and use models, as well as
fine-tune them to construct more domain-specific models. However, just like
traditional software supply chains face security risks, AI models and
fine-tuning chains also encounter new security risks, such as adversarial
attacks. Therefore, the adversarial robustness of these models has garnered
attention, potentially influencing people's choices regarding open-source
models. Objective:This paper aims to explore the adversarial robustness of
open-source AI models and their chains formed by the upstream-downstream
relationships via fine-tuning to provide insights into the potential
adversarial risks. Method:We collect text classification models on HF and
construct the fine-tuning chains.Then, we conduct an empirical analysis of
model reuse and associated robustness risks under existing adversarial attacks
from two aspects, i.e., models and their fine-tuning chains. Results:Despite
the models' widespread downloading and reuse, they are generally susceptible to
adversarial attack risks, with an average of 52.70% attack success rate.
Moreover, fine-tuning typically exacerbates this risk, resulting in an average
12.60% increase in attack success rates. We also delve into the influence of
factors such as attack techniques, datasets, and model architectures on the
success rate, as well as the transitivity along the model chains.


http://arxiv.org/abs/2408.02980
Sample-agnostic Adversarial Perturbation for Vision-Language Pre-training Models. (98%)
Haonan Zheng; Wen Jiang; Xinyang Deng; Wenrui Li
  Recent studies on AI security have highlighted the vulnerability of
Vision-Language Pre-training (VLP) models to subtle yet intentionally designed
perturbations in images and texts. Investigating multimodal systems' robustness
via adversarial attacks is crucial in this field. Most multimodal attacks are
sample-specific, generating a unique perturbation for each sample to construct
adversarial samples. To the best of our knowledge, it is the first work through
multimodal decision boundaries to explore the creation of a universal,
sample-agnostic perturbation that applies to any image. Initially, we explore
strategies to move sample points beyond the decision boundaries of linear
classifiers, refining the algorithm to ensure successful attacks under the top
$k$ accuracy metric. Based on this foundation, in visual-language tasks, we
treat visual and textual modalities as reciprocal sample points and decision
hyperplanes, guiding image embeddings to traverse text-constructed decision
boundaries, and vice versa. This iterative process consistently refines a
universal perturbation, ultimately identifying a singular direction within the
input space which is exploitable to impair the retrieval performance of VLP
models. The proposed algorithms support the creation of global perturbations or
adversarial patches. Comprehensive experiments validate the effectiveness of
our method, showcasing its data, task, and model transferability across various
VLP models and datasets. Code: https://github.com/LibertazZ/MUAP


http://arxiv.org/abs/2408.03441
Simple Perturbations Subvert Ethereum Phishing Transactions Detection: An Empirical Analysis. (92%)
Ahod Alghureid; David Mohaisen
  This paper explores the vulnerability of machine learning models,
specifically Random Forest, Decision Tree, and K-Nearest Neighbors, to very
simple single-feature adversarial attacks in the context of Ethereum fraudulent
transaction detection. Through comprehensive experimentation, we investigate
the impact of various adversarial attack strategies on model performance
metrics, such as accuracy, precision, recall, and F1-score. Our findings,
highlighting how prone those techniques are to simple attacks, are alarming,
and the inconsistency in the attacks' effect on different algorithms promises
ways for attack mitigation. We examine the effectiveness of different
mitigation strategies, including adversarial training and enhanced feature
selection, in enhancing model robustness.


http://arxiv.org/abs/2408.03400
Attacks and Defenses for Generative Diffusion Models: A Comprehensive Survey. (64%)
Vu Tuan Truong; Luan Ba Dang; Long Bao Le
  Diffusion models (DMs) have achieved state-of-the-art performance on various
generative tasks such as image synthesis, text-to-image, and text-guided
image-to-image generation. However, the more powerful the DMs, the more harmful
they potentially are. Recent studies have shown that DMs are prone to a wide
range of attacks, including adversarial attacks, membership inference, backdoor
injection, and various multi-modal threats. Since numerous pre-trained DMs are
published widely on the Internet, potential threats from these attacks are
especially detrimental to the society, making DM-related security a worth
investigating topic. Therefore, in this paper, we conduct a comprehensive
survey on the security aspect of DMs, focusing on various attack and defense
methods for DMs. First, we present crucial knowledge of DMs with five main
types of DMs, including denoising diffusion probabilistic models, denoising
diffusion implicit models, noise conditioned score networks, stochastic
differential equations, and multi-modal conditional DMs. We further survey a
variety of recent studies investigating different types of attacks that exploit
the vulnerabilities of DMs. Then, we thoroughly review potential
countermeasures to mitigate each of the presented threats. Finally, we discuss
open challenges of DM-related security and envision certain research directions
for this topic.


http://arxiv.org/abs/2408.03515
A Study on Prompt Injection Attack Against LLM-Integrated Mobile Robotic Systems. (2%)
Wenxiao Zhang; Xiangrui Kong; Conan Dewitt; Thomas Braunl; Jin B. Hong
  The integration of Large Language Models (LLMs) like GPT-4o into robotic
systems represents a significant advancement in embodied artificial
intelligence. These models can process multi-modal prompts, enabling them to
generate more context-aware responses. However, this integration is not without
challenges. One of the primary concerns is the potential security risks
associated with using LLMs in robotic navigation tasks. These tasks require
precise and reliable responses to ensure safe and effective operation.
Multi-modal prompts, while enhancing the robot's understanding, also introduce
complexities that can be exploited maliciously. For instance, adversarial
inputs designed to mislead the model can lead to incorrect or dangerous
navigational decisions. This study investigates the impact of prompt injections
on mobile robot performance in LLM-integrated systems and explores secure
prompt strategies to mitigate these risks. Our findings demonstrate a
substantial overall improvement of approximately 30.8% in both attack detection
and system performance with the implementation of robust defence mechanisms,
highlighting their critical role in enhancing security and reliability in
mission-oriented tasks.


http://arxiv.org/abs/2408.02310
On the Robustness of Malware Detectors to Adversarial Samples. (99%)
Muhammad Salman; Benjamin Zi Hao Zhao; Hassan Jameel Asghar; Muhammad Ikram; Sidharth Kaushik; Mohamed Ali Kaafar
  Adversarial examples add imperceptible alterations to inputs with the
objective to induce misclassification in machine learning models. They have
been demonstrated to pose significant challenges in domains like image
classification, with results showing that an adversarially perturbed image to
evade detection against one classifier is most likely transferable to other
classifiers. Adversarial examples have also been studied in malware analysis.
Unlike images, program binaries cannot be arbitrarily perturbed without
rendering them non-functional. Due to the difficulty of crafting adversarial
program binaries, there is no consensus on the transferability of adversarially
perturbed programs to different detectors. In this work, we explore the
robustness of malware detectors against adversarially perturbed malware. We
investigate the transferability of adversarial attacks developed against one
detector, against other machine learning-based malware detectors, and code
similarity techniques, specifically, locality sensitive hashing-based
detectors. Our analysis reveals that adversarial program binaries crafted for
one detector are generally less effective against others. We also evaluate an
ensemble of detectors and show that they can potentially mitigate the impact of
adversarial program binaries. Finally, we demonstrate that substantial program
changes made to evade detection may result in the transformation technique
being identified, implying that the adversary must make minimal changes to the
program binary.


http://arxiv.org/abs/2408.02813
Mitigating Malicious Attacks in Federated Learning via Confidence-aware Defense. (84%)
Qilei Li; Ahmed M. Abdelmoniem
  Federated Learning (FL) is a distributed machine learning diagram that
enables multiple clients to collaboratively train a global model without
sharing their private local data. However, FL systems are vulnerable to attacks
that are happening in malicious clients through data poisoning and model
poisoning, which can deteriorate the performance of aggregated global model.
Existing defense methods typically focus on mitigating specific types of
poisoning and are often ineffective against unseen types of attack. These
methods also assume an attack happened moderately while is not always holds
true in real. Consequently, these methods can significantly fail in terms of
accuracy and robustness when detecting and addressing updates from attacked
malicious clients. To overcome these challenges, in this work, we propose a
simple yet effective framework to detect malicious clients, namely
Confidence-Aware Defense (CAD), that utilizes the confidence scores of local
models as criteria to evaluate the reliability of local updates. Our key
insight is that malicious attacks, regardless of attack type, will cause the
model to deviate from its previous state, thus leading to increased uncertainty
when making predictions. Therefore, CAD is comprehensively effective for both
model poisoning and data poisoning attacks by accurately identifying and
mitigating potential malicious updates, even under varying degrees of attacks
and data heterogeneity. Experimental results demonstrate that our method
significantly enhances the robustness of FL systems against various types of
attacks across various scenarios by achieving higher model accuracy and
stability.


http://arxiv.org/abs/2408.02509
Black-Box Adversarial Attacks on LLM-Based Code Completion. (62%)
Slobodan Jenko; Niels Mündler; Jingxuan He; Mark Vero; Martin Vechev
  Modern code completion engines, powered by large language models (LLMs),
assist millions of developers with their strong capabilities to generate
functionally correct code. Due to this popularity, it is crucial to investigate
the security implications of relying on LLM-based code completion. In this
work, we demonstrate that state-of-the-art black-box LLM-based code completion
engines can be stealthily biased by adversaries to significantly increase their
rate of insecure code generation. We present the first attack, named INSEC,
that achieves this goal. INSEC works by injecting an attack string as a short
comment in the completion input. The attack string is crafted through a
query-based optimization procedure starting from a set of carefully designed
initialization schemes. We demonstrate INSEC's broad applicability and
effectiveness by evaluating it on various state-of-the-art open-source models
and black-box commercial services (e.g., OpenAI API and GitHub Copilot). On a
diverse set of security-critical test cases, covering 16 CWEs across 5
programming languages, INSEC increases the rate of generated insecure code by
more than 50%, while maintaining the functional correctness of generated code.
We consider INSEC practical -- it requires low resources and costs less than 10
US dollars to develop on commodity hardware. Moreover, we showcase the attack's
real-world deployability, by developing an IDE plug-in that stealthily injects
INSEC into the GitHub Copilot extension.


http://arxiv.org/abs/2408.02632
SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models. (38%)
Muxi Diao; Rumei Li; Shiyang Liu; Guogang Liao; Jingang Wang; Xunliang Cai; Weiran Xu
  As large language models (LLMs) continue to advance in capability and
influence, ensuring their security and preventing harmful outputs has become
crucial. A promising approach to address these concerns involves training
models to automatically generate adversarial prompts for red teaming. However,
the evolving subtlety of vulnerabilities in LLMs challenges the effectiveness
of current adversarial methods, which struggle to specifically target and
explore the weaknesses of these models. To tackle these challenges, we
introduce the $\mathbf{S}\text{elf-}\mathbf{E}\text{volving
}\mathbf{A}\text{dversarial }\mathbf{S}\text{afety }\mathbf{(SEAS)}$
optimization framework, which enhances security by leveraging data generated by
the model itself. SEAS operates through three iterative stages: Initialization,
Attack, and Adversarial Optimization, refining both the Red Team and Target
models to improve robustness and safety. This framework reduces reliance on
manual testing and significantly enhances the security capabilities of LLMs.
Our contributions include a novel adversarial framework, a comprehensive safety
dataset, and after three iterations, the Target model achieves a security level
comparable to GPT-4, while the Red Team model shows a marked increase in attack
success rate (ASR) against advanced models. Our code and datasets are released
at https://SEAS-LLM.github.io/.


http://arxiv.org/abs/2408.02814
Pre-trained Encoder Inference: Revealing Upstream Encoders In Downstream Machine Learning Services. (16%)
Shaopeng Fu; Xuexue Sun; Ke Qing; Tianhang Zheng; Di Wang
  Pre-trained encoders available online have been widely adopted to build
downstream machine learning (ML) services, but various attacks against these
encoders also post security and privacy threats toward such a downstream ML
service paradigm. We unveil a new vulnerability: the Pre-trained Encoder
Inference (PEI) attack, which can extract sensitive encoder information from a
targeted downstream ML service that can then be used to promote other ML
attacks against the targeted service. By only providing API accesses to a
targeted downstream service and a set of candidate encoders, the PEI attack can
successfully infer which encoder is secretly used by the targeted service based
on candidate ones. Compared with existing encoder attacks, which mainly target
encoders on the upstream side, the PEI attack can compromise encoders even
after they have been deployed and hidden in downstream ML services, which makes
it a more realistic threat. We empirically verify the effectiveness of the PEI
attack on vision encoders. we first conduct PEI attacks against two downstream
services (i.e., image classification and multimodal generation), and then show
how PEI attacks can facilitate other ML attacks (i.e., model stealing attacks
vs. image classification models and adversarial attacks vs. multimodal
generative models). Our results call for new security and privacy
considerations when deploying encoders in downstream services. The code is
available at https://github.com/fshp971/encoder-inference.


http://arxiv.org/abs/2408.02416
Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models. (13%)
Zi Liang; Haibo Hu; Qingqing Ye; Yaxin Xiao; Haoyang Li
  The drastic increase of large language models' (LLMs) parameters has led to a
new research direction of fine-tuning-free downstream customization by prompts,
i.e., task descriptions. While these prompt-based services (e.g. OpenAI's GPTs)
play an important role in many businesses, there has emerged growing concerns
about the prompt leakage, which undermines the intellectual properties of these
services and causes downstream attacks. In this paper, we analyze the
underlying mechanism of prompt leakage, which we refer to as prompt
memorization, and develop corresponding defending strategies. By exploring the
scaling laws in prompt extraction, we analyze key attributes that influence
prompt extraction, including model sizes, prompt lengths, as well as the types
of prompts. Then we propose two hypotheses that explain how LLMs expose their
prompts. The first is attributed to the perplexity, i.e. the familiarity of
LLMs to texts, whereas the second is based on the straightforward token
translation path in attention matrices. To defend against such threats, we
investigate whether alignments can undermine the extraction of prompts. We find
that current LLMs, even those with safety alignments like GPT-4, are highly
vulnerable to prompt extraction attacks, even under the most straightforward
user attacks. Therefore, we put forward several defense strategies with the
inspiration of our findings, which achieve 83.8\% and 71.0\% drop in the prompt
extraction rate for Llama2-7B and GPT-3.5, respectively. Source code is
avaliable at https://github.com/liangzid/PromptExtractionEval.


http://arxiv.org/abs/2408.02651
Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models? (8%)
Mohammad Bahrami Karkevandi; Nishant Vishwamitra; Peyman Najafirad
  Large Language Models (LLMs) have demonstrated impressive capabilities in
natural language tasks, but their safety and morality remain contentious due to
their training on internet text corpora. To address these concerns, alignment
techniques have been developed to improve the public usability and safety of
LLMs. Yet, the potential for generating harmful content through these models
seems to persist. This paper explores the concept of jailbreaking
LLMs-reversing their alignment through adversarial triggers. Previous methods,
such as soft embedding prompts, manually crafted prompts, and gradient-based
automatic prompts, have had limited success on black-box models due to their
requirements for model access and for producing a low variety of manually
crafted prompts, making them susceptible to being blocked. This paper
introduces a novel approach using reinforcement learning to optimize
adversarial triggers, requiring only inference API access to the target model
and a small surrogate model. Our method, which leverages a BERTScore-based
reward function, enhances the transferability and effectiveness of adversarial
triggers on new black-box models. We demonstrate that this approach improves
the performance of adversarial triggers on a previously untested language
model.


http://arxiv.org/abs/2408.02710
RCDM: Enabling Robustness for Conditional Diffusion Model. (4%)
Weifeng Xu; Xiang Zhu; Xiaoyong Li
  The conditional diffusion model (CDM) enhances the standard diffusion model
by providing more control, improving the quality and relevance of the outputs,
and making the model adaptable to a wider range of complex tasks. However,
inaccurate conditional inputs in the inverse process of CDM can easily lead to
generating fixed errors in the neural network, which diminishes the
adaptability of a well-trained model. The existing methods like data
augmentation, adversarial training, robust optimization can improve the
robustness, while they often face challenges such as high computational
complexity, limited applicability to unknown perturbations, and increased
training difficulty. In this paper, we propose a lightweight solution, the
Robust Conditional Diffusion Model (RCDM), based on control theory to
dynamically reduce the impact of noise and significantly enhance the model's
robustness. RCDM leverages the collaborative interaction between two neural
networks, along with optimal control strategies derived from control theory, to
optimize the weights of two networks during the sampling process. Unlike
conventional techniques, RCDM establishes a mathematical relationship between
fixed errors and the weights of the two neural networks without incurring
additional computational overhead. Extensive experiments were conducted on
MNIST and CIFAR-10 datasets, and the results demonstrate the effectiveness and
adaptability of our proposed model.


http://arxiv.org/abs/2408.02882
Compromising Embodied Agents with Contextual Backdoor Attacks. (4%)
Aishan Liu; Yuguang Zhou; Xianglong Liu; Tianyuan Zhang; Siyuan Liang; Jiakai Wang; Yanjun Pu; Tianlin Li; Junqi Zhang; Wenbo Zhou; Qing Guo; Dacheng Tao
  Large language models (LLMs) have transformed the development of embodied
intelligence. By providing a few contextual demonstrations, developers can
utilize the extensive internal knowledge of LLMs to effortlessly translate
complex tasks described in abstract language into sequences of code snippets,
which will serve as the execution logic for embodied agents. However, this
paper uncovers a significant backdoor security threat within this process and
introduces a novel method called \method{}. By poisoning just a few contextual
demonstrations, attackers can covertly compromise the contextual environment of
a black-box LLM, prompting it to generate programs with context-dependent
defects. These programs appear logically sound but contain defects that can
activate and induce unintended behaviors when the operational agent encounters
specific triggers in its interactive environment. To compromise the LLM's
contextual environment, we employ adversarial in-context generation to optimize
poisoned demonstrations, where an LLM judge evaluates these poisoned prompts,
reporting to an additional LLM that iteratively optimizes the demonstration in
a two-player adversarial game using chain-of-thought reasoning. To enable
context-dependent behaviors in downstream agents, we implement a dual-modality
activation strategy that controls both the generation and execution of program
defects through textual and visual triggers. We expand the scope of our attack
by developing five program defect modes that compromise key aspects of
confidentiality, integrity, and availability in embodied agents. To validate
the effectiveness of our approach, we conducted extensive experiments across
various tasks, including robot planning, robot manipulation, and compositional
visual reasoning. Additionally, we demonstrate the potential impact of our
approach by successfully attacking real-world autonomous driving systems.


http://arxiv.org/abs/2408.01978
AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning. (99%)
Xin Wang; Kai Chen; Xingjun Ma; Zhineng Chen; Jingjing Chen; Yu-Gang Jiang
  Deep neural networks (DNNs) are known to be vulnerable to adversarial attacks
even under a black-box setting where the adversary can only query the model.
Particularly, query-based black-box adversarial attacks estimate adversarial
gradients based on the returned probability vectors of the target model for a
sequence of queries. During this process, the queries made to the target model
are intermediate adversarial examples crafted at the previous attack step,
which share high similarities in the pixel space. Motivated by this
observation, stateful detection methods have been proposed to detect and reject
query-based attacks. While demonstrating promising results, these methods
either have been evaded by more advanced attacks or suffer from low efficiency
in terms of the number of shots (queries) required to detect different attacks.
Arguably, the key challenge here is to assign high similarity scores for any
two intermediate adversarial examples perturbed from the same clean image. To
address this challenge, we propose a novel Adversarial Contrastive Prompt
Tuning (ACPT) method to robustly fine-tune the CLIP image encoder to extract
similar embeddings for any two intermediate adversarial queries. With ACPT, we
further introduce a detection framework AdvQDet that can detect 7
state-of-the-art query-based attacks with $>99\%$ detection rate within 5
shots. We also show that ACPT is robust to 3 types of adaptive attacks. Code is
available at https://github.com/xinwong/AdvQDet.


http://arxiv.org/abs/2408.01934
A Survey and Evaluation of Adversarial Attacks for Object Detection. (99%)
Khoi Nguyen Tiet Nguyen; Wenyu Zhang; Kangkang Lu; Yuhuan Wu; Xingjian Zheng; Hui Li Tan; Liangli Zhen
  Deep learning models achieve remarkable accuracy in computer vision tasks,
yet remain vulnerable to adversarial examples--carefully crafted perturbations
to input images that can deceive these models into making confident but
incorrect predictions. This vulnerability pose significant risks in high-stakes
applications such as autonomous vehicles, security surveillance, and
safety-critical inspection systems. While the existing literature extensively
covers adversarial attacks in image classification, comprehensive analyses of
such attacks on object detection systems remain limited. This paper presents a
novel taxonomic framework for categorizing adversarial attacks specific to
object detection architectures, synthesizes existing robustness metrics, and
provides a comprehensive empirical evaluation of state-of-the-art attack
methodologies on popular object detection models, including both traditional
detectors and modern detectors with vision-language pretraining. Through
rigorous analysis of open-source attack implementations and their effectiveness
across diverse detection architectures, we derive key insights into attack
characteristics. Furthermore, we delineate critical research gaps and emerging
challenges to guide future investigations in securing object detection systems
against adversarial threats. Our findings establish a foundation for developing
more robust detection models while highlighting the urgent need for
standardized evaluation protocols in this rapidly evolving domain.


http://arxiv.org/abs/2408.01977
Label Augmentation for Neural Networks Robustness. (98%)
Fatemeh Amerehi; Patrick Healy
  Out-of-distribution generalization can be categorized into two types: common
perturbations arising from natural variations in the real world and adversarial
perturbations that are intentionally crafted to deceive neural networks. While
deep neural networks excel in accuracy under the assumption of identical
distributions between training and test data, they often encounter
out-of-distribution scenarios resulting in a significant decline in accuracy.
Data augmentation methods can effectively enhance robustness against common
corruptions, but they typically fall short in improving robustness against
adversarial perturbations. In this study, we develop Label Augmentation (LA),
which enhances robustness against both common and intentional perturbations and
improves uncertainty estimation. Our findings indicate a Clean error rate
improvement of up to 23.29% when employing LA in comparisons to the baseline.
Additionally, it enhances robustness under common corruptions benchmark by up
to 24.23%. When tested against FGSM and PGD attacks, improvements in
adversarial robustness are noticeable, with enhancements of up to 53.18% for
FGSM and 24.46% for PGD attacks.


http://arxiv.org/abs/2408.02131
Model Hijacking Attack in Federated Learning. (75%)
Zheng Li; Siyuan Wu; Ruichuan Chen; Paarijaat Aditya; Istemi Ekin Akkus; Manohar Vanga; Min Zhang; Hao Li; Yang Zhang
  Machine learning (ML), driven by prominent paradigms such as centralized and
federated learning, has made significant progress in various critical
applications ranging from autonomous driving to face recognition. However, its
remarkable success has been accompanied by various attacks. Recently, the model
hijacking attack has shown that ML models can be hijacked to execute tasks
different from their original tasks, which increases both accountability and
parasitic computational risks. Nevertheless, thus far, this attack has only
focused on centralized learning. In this work, we broaden the scope of this
attack to the federated learning domain, where multiple clients collaboratively
train a global model without sharing their data. Specifically, we present
HijackFL, the first-of-its-kind hijacking attack against the global model in
federated learning. The adversary aims to force the global model to perform a
different task (called hijacking task) from its original task without the
server or benign client noticing. To accomplish this, unlike existing methods
that use data poisoning to modify the target model's parameters, HijackFL
searches for pixel-level perturbations based on their local model (without
modifications) to align hijacking samples with the original ones in the feature
space. When performing the hijacking task, the adversary applies these cloaks
to the hijacking samples, compelling the global model to identify them as
original samples and predict them accordingly. We conduct extensive experiments
on four benchmark datasets and three popular models. Empirical results
demonstrate that its attack performance outperforms baselines. We further
investigate the factors that affect its performance and discuss possible
defenses to mitigate its impact.


http://arxiv.org/abs/2408.02035
Robustness of Watermarking on Text-to-Image Diffusion Models. (22%)
Xiaodong Wu; Xiangman Li; Jianbing Ni
  Watermarking has become one of promising techniques to not only aid in
identifying AI-generated images but also serve as a deterrent against the
unethical use of these models. However, the robustness of watermarking
techniques has not been extensively studied recently. In this paper, we
investigate the robustness of generative watermarking, which is created from
the integration of watermarking embedding and text-to-image generation
processing in generative models, e.g., latent diffusion models. Specifically,
we propose three attacking methods, i.e., discriminator-based attacks, edge
prediction-based attacks, and fine-tune-based attacks, under the scenario where
the watermark decoder is not accessible. The model is allowed to be fine-tuned
to created AI agents with specific generative tasks for personalizing or
specializing. We found that generative watermarking methods are robust to
direct evasion attacks, like discriminator-based attacks, or manipulation based
on the edge information in edge prediction-based attacks but vulnerable to
malicious fine-tuning. Experimental results show that our fine-tune-based
attacks can decrease the accuracy of the watermark detection to nearly
$67.92\%$. In addition, We conduct an ablation study on the length of
fine-tuned messages, encoder/decoder's depth and structure to identify key
factors that impact the performance of fine-tune-based attacks.


http://arxiv.org/abs/2408.02123
FovEx: Human-inspired Explanations for Vision Transformers and Convolutional Neural Networks. (1%)
Mahadev Prasad Panda; Matteo Tiezzi; Martina Vilas; Gemma Roig; Bjoern M. Eskofier; Dario Zanca
  Explainability in artificial intelligence (XAI) remains a crucial aspect for
fostering trust and understanding in machine learning models. Current visual
explanation techniques, such as gradient-based or class-activation-based
methods, often exhibit a strong dependence on specific model architectures.
Conversely, perturbation-based methods, despite being model-agnostic, are
computationally expensive as they require evaluating models on a large number
of forward passes. In this work, we introduce Foveation-based Explanations
(FovEx), a novel XAI method inspired by human vision. FovEx seamlessly
integrates biologically inspired perturbations by iteratively creating foveated
renderings of the image and combines them with gradient-based visual
explorations to determine locations of interest efficiently. These locations
are selected to maximize the performance of the model to be explained with
respect to the downstream task and then combined to generate an attribution
map. We provide a thorough evaluation with qualitative and quantitative
assessments on established benchmarks. Our method achieves state-of-the-art
performance on both transformers (on 4 out of 5 metrics) and convolutional
models (on 3 out of 5 metrics), demonstrating its versatility among various
architectures. Furthermore, we show the alignment between the explanation map
produced by FovEx and human gaze patterns (+14\% in NSS compared to RISE,
+203\% in NSS compared to GradCAM). This comparison enhances our confidence in
FovEx's ability to close the interpretation gap between humans and machines.


http://arxiv.org/abs/2408.01808
ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features. (99%)
Peng Cheng; Yuwei Wang; Peng Huang; Zhongjie Ba; Xiaodong Lin; Feng Lin; Li Lu; Kui Ren
  Extensive research has revealed that adversarial examples (AE) pose a
significant threat to voice-controllable smart devices. Recent studies have
proposed black-box adversarial attacks that require only the final
transcription from an automatic speech recognition (ASR) system. However, these
attacks typically involve many queries to the ASR, resulting in substantial
costs. Moreover, AE-based adversarial audio samples are susceptible to ASR
updates. In this paper, we identify the root cause of these limitations, namely
the inability to construct AE attack samples directly around the decision
boundary of deep learning (DL) models. Building on this observation, we propose
ALIF, the first black-box adversarial linguistic feature-based attack pipeline.
We leverage the reciprocal process of text-to-speech (TTS) and ASR models to
generate perturbations in the linguistic embedding space where the decision
boundary resides. Based on the ALIF pipeline, we present the ALIF-OTL and
ALIF-OTA schemes for launching attacks in both the digital domain and the
physical playback environment on four commercial ASRs and voice assistants.
Extensive evaluations demonstrate that ALIF-OTL and -OTA significantly improve
query efficiency by 97.7% and 73.3%, respectively, while achieving competitive
performance compared to existing methods. Notably, ALIF-OTL can generate an
attack sample with only one query. Furthermore, our test-of-time experiment
validates the robustness of our approach against ASR updates.


http://arxiv.org/abs/2408.01715
Joint Universal Adversarial Perturbations with Interpretations. (99%)
Liang-bo Ning; Zeyu Dai; Wenqi Fan; Jingran Su; Chao Pan; Luning Wang; Qing Li
  Deep neural networks (DNNs) have significantly boosted the performance of
many challenging tasks. Despite the great development, DNNs have also exposed
their vulnerability. Recent studies have shown that adversaries can manipulate
the predictions of DNNs by adding a universal adversarial perturbation (UAP) to
benign samples. On the other hand, increasing efforts have been made to help
users understand and explain the inner working of DNNs by highlighting the most
informative parts (i.e., attribution maps) of samples with respect to their
predictions. Moreover, we first empirically find that such attribution maps
between benign and adversarial examples have a significant discrepancy, which
has the potential to detect universal adversarial perturbations for defending
against adversarial attacks. This finding motivates us to further investigate a
new research problem: whether there exist universal adversarial perturbations
that are able to jointly attack DNNs classifier and its interpretation with
malicious desires. It is challenging to give an explicit answer since these two
objectives are seemingly conflicting. In this paper, we propose a novel
attacking framework to generate joint universal adversarial perturbations
(JUAP), which can fool the DNNs model and misguide the inspection from
interpreters simultaneously. Comprehensive experiments on various datasets
demonstrate the effectiveness of the proposed method JUAP for joint attacks. To
the best of our knowledge, this is the first effort to study UAP for jointly
attacking both DNNs and interpretations.


http://arxiv.org/abs/2408.01705
Downstream Transfer Attack: Adversarial Attacks on Downstream Models with Pre-trained Vision Transformers. (99%)
Weijie Zheng; Xingjun Ma; Hanxun Huang; Zuxuan Wu; Yu-Gang Jiang
  With the advancement of vision transformers (ViTs) and self-supervised
learning (SSL) techniques, pre-trained large ViTs have become the new
foundation models for computer vision applications. However, studies have shown
that, like convolutional neural networks (CNNs), ViTs are also susceptible to
adversarial attacks, where subtle perturbations in the input can fool the model
into making false predictions. This paper studies the transferability of such
an adversarial vulnerability from a pre-trained ViT model to downstream tasks.
We focus on \emph{sample-wise} transfer attacks and propose a novel attack
method termed \emph{Downstream Transfer Attack (DTA)}. For a given test image,
DTA leverages a pre-trained ViT model to craft the adversarial example and then
applies the adversarial example to attack a fine-tuned version of the model on
a downstream dataset. During the attack, DTA identifies and exploits the most
vulnerable layers of the pre-trained model guided by a cosine similarity loss
to craft highly transferable attacks. Through extensive experiments with
pre-trained ViTs by 3 distinct pre-training methods, 3 fine-tuning schemes, and
across 10 diverse downstream datasets, we show that DTA achieves an average
attack success rate (ASR) exceeding 90\%, surpassing existing methods by a huge
margin. When used with adversarial training, the adversarial examples generated
by our DTA can significantly improve the model's robustness to different
downstream transfer attacks.


http://arxiv.org/abs/2408.01541
Guardians of Image Quality: Benchmarking Defenses Against Adversarial Attacks on Image Quality Metrics. (98%)
Alexander Gushchin; Khaled Abud; Georgii Bychkov; Ekaterina Shumitskaya; Anna Chistyakova; Sergey Lavrushkin; Bader Rasheed; Kirill Malyshev; Dmitriy Vatolin; Anastasia Antsiferova
  In the field of Image Quality Assessment (IQA), the adversarial robustness of
the metrics poses a critical concern. This paper presents a comprehensive
benchmarking study of various defense mechanisms in response to the rise in
adversarial attacks on IQA. We systematically evaluate 25 defense strategies,
including adversarial purification, adversarial training, and certified
robustness methods. We applied 14 adversarial attack algorithms of various
types in both non-adaptive and adaptive settings and tested these defenses
against them. We analyze the differences between defenses and their
applicability to IQA tasks, considering that they should preserve IQA scores
and image quality. The proposed benchmark aims to guide future developments and
accepts submissions of new methods, with the latest results available online:
https://videoprocessing.ai/benchmarks/iqa-defenses.html.


http://arxiv.org/abs/2408.01596
Trustworthy Machine Learning under Social and Adversarial Data Sources. (83%)
Han Shao
  Machine learning has witnessed remarkable breakthroughs in recent years. As
machine learning permeates various aspects of daily life, individuals and
organizations increasingly interact with these systems, exhibiting a wide range
of social and adversarial behaviors. These behaviors may have a notable impact
on the behavior and performance of machine learning systems. Specifically,
during these interactions, data may be generated by strategic individuals,
collected by self-interested data collectors, possibly poisoned by adversarial
attackers, and used to create predictors, models, and policies satisfying
multiple objectives. As a result, the machine learning systems' outputs might
degrade, such as the susceptibility of deep neural networks to adversarial
examples (Shafahi et al., 2018; Szegedy et al., 2013) and the diminished
performance of classic algorithms in the presence of strategic individuals
(Ahmadi et al., 2021). Addressing these challenges is imperative for the
success of machine learning in societal settings.


http://arxiv.org/abs/2408.01178
EmoBack: Backdoor Attacks Against Speaker Identification Using Emotional Prosody. (80%)
Coen Schoof; Stefanos Koffas; Mauro Conti; Stjepan Picek
  Speaker identification (SI) determines a speaker's identity based on their
spoken utterances. Previous work indicates that SI deep neural networks (DNNs)
are vulnerable to backdoor attacks. Backdoor attacks involve embedding hidden
triggers in DNNs' training data, causing the DNN to produce incorrect output
when these triggers are present during inference. This is the first work that
explores SI DNNs' vulnerability to backdoor attacks using speakers' emotional
prosody, resulting in dynamic, inconspicuous triggers. We conducted a parameter
study using three different datasets and DNN architectures to determine the
impact of emotions as backdoor triggers on the accuracy of SI systems.
Additionally, we have explored the robustness of our attacks by applying
defenses like pruning, STRIP-ViTA, and three popular preprocessing techniques:
quantization, median filtering, and squeezing. Our findings show that the
aforementioned models are prone to our attack, indicating that emotional
triggers (sad and neutral prosody) can be effectively used to compromise the
integrity of SI systems. However, the results of our pruning experiments
suggest potential solutions for reinforcing the models against our attacks,
decreasing the attack success rate up to 40%.


http://arxiv.org/abs/2408.01139
Interpreting Global Perturbation Robustness of Image Models using Axiomatic Spectral Importance Decomposition. (61%)
Róisín Luo; James McDermott; Colm O'Riordan
  Perturbation robustness evaluates the vulnerabilities of models, arising from
a variety of perturbations, such as data corruptions and adversarial attacks.
Understanding the mechanisms of perturbation robustness is critical for global
interpretability. We present a model-agnostic, global mechanistic
interpretability method to interpret the perturbation robustness of image
models. This research is motivated by two key aspects. First, previous global
interpretability works, in tandem with robustness benchmarks, e.g. mean
corruption error (mCE), are not designed to directly interpret the mechanisms
of perturbation robustness within image models. Second, we notice that the
spectral signal-to-noise ratios (SNR) of perturbed natural images exponentially
decay over the frequency. This power-law-like decay implies that: Low-frequency
signals are generally more robust than high-frequency signals -- yet high
classification accuracy can not be achieved by low-frequency signals alone. By
applying Shapley value theory, our method axiomatically quantifies the
predictive powers of robust features and non-robust features within an
information theory framework. Our method, dubbed as \textbf{I-ASIDE}
(\textbf{I}mage \textbf{A}xiomatic \textbf{S}pectral \textbf{I}mportance
\textbf{D}ecomposition \textbf{E}xplanation), provides a unique insight into
model robustness mechanisms. We conduct extensive experiments over a variety of
vision models pre-trained on ImageNet to show that \textbf{I-ASIDE} can not
only \textbf{measure} the perturbation robustness but also \textbf{provide
interpretations} of its mechanisms.


http://arxiv.org/abs/2408.01300
Assessing Robustness of Machine Learning Models using Covariate Perturbations. (33%)
Arun Prakash R; Anwesha Bhattacharyya; Joel Vaughan; Vijayan N. Nair
  As machine learning models become increasingly prevalent in critical
decision-making models and systems in fields like finance, healthcare, etc.,
ensuring their robustness against adversarial attacks and changes in the input
data is paramount, especially in cases where models potentially overfit. This
paper proposes a comprehensive framework for assessing the robustness of
machine learning models through covariate perturbation techniques. We explore
various perturbation strategies to assess robustness and examine their impact
on model predictions, including separate strategies for numeric and non-numeric
variables, summaries of perturbations to assess and compare model robustness
across different scenarios, and local robustness diagnosis to identify any
regions in the data where a model is particularly unstable. Through empirical
studies on real world dataset, we demonstrate the effectiveness of our approach
in comparing robustness across models, identifying the instabilities in the
model, and enhancing model robustness.


http://arxiv.org/abs/2408.01200
Certifiably Robust Encoding Schemes. (31%)
Aman Saxena; Tom Wollschläger; Nicola Franco; Jeanette Miriam Lorenz; Stephan Günnemann
  Quantum machine learning uses principles from quantum mechanics to process
data, offering potential advances in speed and performance. However, previous
work has shown that these models are susceptible to attacks that manipulate
input data or exploit noise in quantum circuits. Following this, various
studies have explored the robustness of these models. These works focus on the
robustness certification of manipulations of the quantum states. We extend this
line of research by investigating the robustness against perturbations in the
classical data for a general class of data encoding schemes. We show that for
such schemes, the addition of suitable noise channels is equivalent to
evaluating the mean value of the noiseless classifier at the smoothed data,
akin to Randomized Smoothing from classical machine learning. Using our general
framework, we show that suitable additions of phase-damping noise channels
improve empirical and provable robustness for the considered class of encoding
schemes.


http://arxiv.org/abs/2408.01355
Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs. (2%)
Peng Ding; Jingyu Wu; Jun Kuang; Dan Ma; Xuezhi Cao; Xunliang Cai; Shi Chen; Jiajun Chen; Shujian Huang
  Multi-modal Large Language Models (MLLMs) have demonstrated remarkable
performance on various visual-language understanding and generation tasks.
However, MLLMs occasionally generate content inconsistent with the given
images, which is known as "hallucination". Prior works primarily center on
evaluating hallucination using standard, unperturbed benchmarks, which overlook
the prevalent occurrence of perturbed inputs in real-world scenarios-such as
image cropping or blurring-that are critical for a comprehensive assessment of
MLLMs' hallucination. In this paper, to bridge this gap, we propose Hallu-PI,
the first benchmark designed to evaluate Hallucination in MLLMs within
Perturbed Inputs. Specifically, Hallu-PI consists of seven perturbed scenarios,
containing 1,260 perturbed images from 11 object types. Each image is
accompanied by detailed annotations, which include fine-grained hallucination
types, such as existence, attribute, and relation. We equip these annotations
with a rich set of questions, making Hallu-PI suitable for both discriminative
and generative tasks. Extensive experiments on 12 mainstream MLLMs, such as
GPT-4V and Gemini-Pro Vision, demonstrate that these models exhibit significant
hallucinations on Hallu-PI, which is not observed in unperturbed scenarios.
Furthermore, our research reveals a severe bias in MLLMs' ability to handle
different types of hallucinations. We also design two baselines specifically
for perturbed scenarios, namely Perturbed-Reminder and Perturbed-ICL. We hope
that our study will bring researchers' attention to the limitations of MLLMs
when dealing with perturbed inputs, and spur further investigations to address
this issue. Our code and datasets are publicly available at
https://github.com/NJUNLP/Hallu-PI.


http://arxiv.org/abs/2408.01508
Blockchain Amplification Attack. (1%)
Taro Tsuchiya; Liyi Zhou; Kaihua Qin; Arthur Gervais; Nicolas Christin
  Strategies related to the blockchain concept of Extractable Value (MEV/BEV),
such as arbitrage, front-, or back-running create strong economic incentives
for network nodes to reduce latency. Modified nodes, that minimize transaction
validation time and neglect to filter invalid transactions in the Ethereum
peer-to-peer (P2P) network, introduce a novel attack vector -- a Blockchain
Amplification Attack. An attacker can exploit those modified nodes to amplify
invalid transactions thousands of times, posing a security threat to the entire
network. To illustrate attack feasibility and practicality in the current
Ethereum network ("mainnet"), we 1) identify thousands of similar attacks in
the wild, 2) mathematically model the propagation mechanism, 3) empirically
measure model parameters from our monitoring nodes, and 4) compare the
performance with other existing Denial-of-Service attacks through local
simulation. We show that an attacker can amplify network traffic at modified
nodes by a factor of 3,600, and cause economic damages of approximately 13,800
times the amount needed to carry out the attack. Despite these risks,
aggressive latency reduction may still be profitable enough for various
providers to justify the existence of modified nodes. To assess this trade-off,
we 1) simulate the transaction validation process in a local network and 2)
empirically measure the latency reduction by deploying our modified node in the
Ethereum test network ("testnet"). We conclude with a cost-benefit analysis of
skipping validation and provide mitigation strategies against the blockchain
amplification attack.


http://arxiv.org/abs/2408.00352
Autonomous LLM-Enhanced Adversarial Attack for Text-to-Motion. (99%)
Honglei Miao; Fan Ma; Ruijie Quan; Kun Zhan; Yi Yang
  Human motion generation driven by deep generative models has enabled
compelling applications, but the ability of text-to-motion (T2M) models to
produce realistic motions from text prompts raises security concerns if
exploited maliciously. Despite growing interest in T2M, few methods focus on
safeguarding these models against adversarial attacks, with existing work on
text-to-image models proving insufficient for the unique motion domain. In the
paper, we propose ALERT-Motion, an autonomous framework leveraging large
language models (LLMs) to craft targeted adversarial attacks against black-box
T2M models. Unlike prior methods modifying prompts through predefined rules,
ALERT-Motion uses LLMs' knowledge of human motion to autonomously generate
subtle yet powerful adversarial text descriptions. It comprises two key
modules: an adaptive dispatching module that constructs an LLM-based agent to
iteratively refine and search for adversarial prompts; and a multimodal
information contrastive module that extracts semantically relevant motion
information to guide the agent's search. Through this LLM-driven approach,
ALERT-Motion crafts adversarial prompts querying victim models to produce
outputs closely matching targeted motions, while avoiding obvious
perturbations. Evaluations across popular T2M models demonstrate ALERT-Motion's
superiority over previous methods, achieving higher attack success rates with
stealthier adversarial prompts. This pioneering work on T2M adversarial attacks
highlights the urgency of developing defensive measures as motion generation
technology advances, urging further research into safe and responsible
deployment.


http://arxiv.org/abs/2408.00329
OTAD: An Optimal Transport-Induced Robust Model for Agnostic Adversarial Attack. (99%)
Kuo Gai; Sicong Wang; Shihua Zhang
  Deep neural networks (DNNs) are vulnerable to small adversarial perturbations
of the inputs, posing a significant challenge to their reliability and
robustness. Empirical methods such as adversarial training can defend against
particular attacks but remain vulnerable to more powerful attacks.
Alternatively, Lipschitz networks provide certified robustness to unseen
perturbations but lack sufficient expressive power. To harness the advantages
of both approaches, we design a novel two-step Optimal Transport induced
Adversarial Defense (OTAD) model that can fit the training data accurately
while preserving the local Lipschitz continuity. First, we train a DNN with a
regularizer derived from optimal transport theory, yielding a discrete optimal
transport map linking data to its features. By leveraging the map's inherent
regularity, we interpolate the map by solving the convex integration problem
(CIP) to guarantee the local Lipschitz property. OTAD is extensible to diverse
architectures of ResNet and Transformer, making it suitable for complex data.
For efficient computation, the CIP can be solved through training neural
networks. OTAD opens a novel avenue for developing reliable and secure deep
learning systems through the regularity of optimal transport maps. Empirical
results demonstrate that OTAD can outperform other robust models on diverse
datasets.


http://arxiv.org/abs/2408.00348
Securing the Diagnosis of Medical Imaging: An In-depth Analysis of AI-Resistant Attacks. (99%)
Angona Biswas; MD Abdullah Al Nasim; Kishor Datta Gupta; Roy George; Abdur Rashid
  Machine learning (ML) is a rapidly developing area of medicine that uses
significant resources to apply computer science and statistics to medical
issues. ML's proponents laud its capacity to handle vast, complicated, and
erratic medical data. It's common knowledge that attackers might cause
misclassification by deliberately creating inputs for machine learning
classifiers. Research on adversarial examples has been extensively conducted in
the field of computer vision applications. Healthcare systems are thought to be
highly difficult because of the security and life-or-death considerations they
include, and performance accuracy is very important. Recent arguments have
suggested that adversarial attacks could be made against medical image analysis
(MedIA) technologies because of the accompanying technology infrastructure and
powerful financial incentives. Since the diagnosis will be the basis for
important decisions, it is essential to assess how strong medical DNN tasks are
against adversarial attacks. Simple adversarial attacks have been taken into
account in several earlier studies. However, DNNs are susceptible to more risky
and realistic attacks. The present paper covers recent proposed adversarial
attack strategies against DNNs for medical imaging as well as countermeasures.
In this study, we review current techniques for adversarial imaging attacks,
detections. It also encompasses various facets of these techniques and offers
suggestions for the robustness of neural networks to be improved in the future.


http://arxiv.org/abs/2408.00728
CERT-ED: Certifiably Robust Text Classification for Edit Distance. (98%)
Zhuoqun Huang; Neil G Marchant; Olga Ohrimenko; Benjamin I. P. Rubinstein
  With the growing integration of AI in daily life, ensuring the robustness of
systems to inference-time attacks is crucial. Among the approaches for
certifying robustness to such adversarial examples, randomized smoothing has
emerged as highly promising due to its nature as a wrapper around arbitrary
black-box models. Previous work on randomized smoothing in natural language
processing has primarily focused on specific subsets of edit distance
operations, such as synonym substitution or word insertion, without exploring
the certification of all edit operations. In this paper, we adapt Randomized
Deletion (Huang et al., 2023) and propose, CERTified Edit Distance defense
(CERT-ED) for natural language classification. Through comprehensive
experiments, we demonstrate that CERT-ED outperforms the existing Hamming
distance method RanMASK (Zeng et al., 2023) in 4 out of 5 datasets in terms of
both accuracy and the cardinality of the certificate. By covering various
threat models, including 5 direct and 5 transfer attacks, our method improves
empirical robustness in 38 out of 50 settings.


http://arxiv.org/abs/2408.00315
ADBM: Adversarial diffusion bridge model for reliable adversarial purification. (96%)
Xiao Li; Wenxuan Sun; Huanran Chen; Qiongxiu Li; Yining Liu; Yingzhe He; Jie Shi; Xiaolin Hu
  Recently Diffusion-based Purification (DiffPure) has been recognized as an
effective defense method against adversarial examples. However, we find
DiffPure which directly employs the original pre-trained diffusion models for
adversarial purification, to be suboptimal. This is due to an inherent
trade-off between noise purification performance and data recovery quality.
Additionally, the reliability of existing evaluations for DiffPure is
questionable, as they rely on weak adaptive attacks. In this work, we propose a
novel Adversarial Diffusion Bridge Model, termed ADBM. ADBM directly constructs
a reverse bridge from the diffused adversarial data back to its original clean
examples, enhancing the purification capabilities of the original diffusion
models. Through theoretical analysis and experimental validation across various
scenarios, ADBM has proven to be a superior and robust defense mechanism,
offering significant promise for practical applications.


http://arxiv.org/abs/2408.00895
Discrete Randomized Smoothing Meets Quantum Computing. (41%)
Tom Wollschläger; Aman Saxena; Nicola Franco; Jeanette Miriam Lorenz; Stephan Günnemann
  Breakthroughs in machine learning (ML) and advances in quantum computing (QC)
drive the interdisciplinary field of quantum machine learning to new levels.
However, due to the susceptibility of ML models to adversarial attacks,
practical use raises safety-critical concerns. Existing Randomized Smoothing
(RS) certification methods for classical machine learning models are
computationally intensive. In this paper, we propose the combination of QC and
the concept of discrete randomized smoothing to speed up the stochastic
certification of ML models for discrete data. We show how to encode all the
perturbations of the input binary data in superposition and use Quantum
Amplitude Estimation (QAE) to obtain a quadratic reduction in the number of
calls to the model that are required compared to traditional randomized
smoothing techniques. In addition, we propose a new binary threat model to
allow for an extensive evaluation of our approach on images, graphs, and text.


http://arxiv.org/abs/2408.00312
Adversarial Text Rewriting for Text-aware Recommender Systems. (13%)
Sejoon Oh; Gaurav Verma; Srijan Kumar
  Text-aware recommender systems incorporate rich textual features, such as
titles and descriptions, to generate item recommendations for users. The use of
textual features helps mitigate cold-start problems, and thus, such recommender
systems have attracted increased attention. However, we argue that the
dependency on item descriptions makes the recommender system vulnerable to
manipulation by adversarial sellers on e-commerce platforms. In this paper, we
explore the possibility of such manipulation by proposing a new text rewriting
framework to attack text-aware recommender systems. We show that the rewriting
attack can be exploited by sellers to unfairly uprank their products, even
though the adversarially rewritten descriptions are perceived as realistic by
human evaluators. Methodologically, we investigate two different variations to
carry out text rewriting attacks: (1) two-phase fine-tuning for greater attack
performance, and (2) in-context learning for higher text rewriting quality.
Experiments spanning 3 different datasets and 4 existing approaches demonstrate
that recommender systems exhibit vulnerability against the proposed text
rewriting attack. Our work adds to the existing literature around the
robustness of recommender systems, while highlighting a new dimension of
vulnerability in the age of large-scale automated text generation.


http://arxiv.org/abs/2408.00341
MAARS: Multi-Rate Attack-Aware Randomized Scheduling for Securing Real-time Systems. (1%)
Arkaprava Sain; Sunandan Adhikary; Ipsita Koley; Soumyajit Dey
  Modern Cyber-Physical Systems (CPSs) consist of numerous control units
interconnected by communication networks. Each control unit executes multiple
safety-critical and non-critical tasks in real-time. Most of the
safety-critical tasks are executed with a fixed sampling period to ensure
deterministic timing behaviour that helps in its safety and performance
analysis. However, adversaries can exploit this deterministic behaviour of
safety-critical tasks to launch inference-based-based attacks on them. This
paper aims to prevent and minimize the possibility of such timing inference or
schedule-based attacks to compromise the control units. This is done by
switching between strategically chosen execution rates of the safety-critical
control tasks such that their performance remains unhampered. Thereafter, we
present a novel schedule vulnerability analysis methodology to switch between
valid schedules generated for these multiple periodicities of the control tasks
in run time. Utilizing these strategies, we introduce a novel Multi-Rate
Attack-Aware Randomized Scheduling (MAARS) framework for preemptive
fixed-priority schedulers that minimize the success rate of
timing-inference-based attacks on safety-critical real-time systems. To our
knowledge, this is the first work to propose a schedule randomization method
with attack awareness that preserves both the control and scheduling aspects.
The efficacy of the framework in terms of attack prevention is finally
evaluated on several automotive benchmarks in a Hardware-in-loop (HiL)
environment.


http://arxiv.org/abs/2408.00722
Pathway to Secure and Trustworthy ZSM for LLMs: Attacks, Defense, and Opportunities. (1%)
Sunder Ali Khowaja; Parus Khuwaja; Kapal Dev; Hussam Al Hamadi; Engin Zeydan
  Recently, large language models (LLMs) have been gaining a lot of interest
due to their adaptability and extensibility in emerging applications, including
communication networks. It is anticipated that ZSM networks will be able to
support LLMs as a service, as they provide ultra reliable low-latency
communications and closed loop massive connectivity. However, LLMs are
vulnerable to data and model privacy issues that affect the trustworthiness of
LLMs to be deployed for user-based services. In this paper, we explore the
security vulnerabilities associated with fine-tuning LLMs in ZSM networks, in
particular the membership inference attack. We define the characteristics of an
attack network that can perform a membership inference attack if the attacker
has access to the fine-tuned model for the downstream task. We show that the
membership inference attacks are effective for any downstream task, which can
lead to a personal data breach when using LLM as a service. The experimental
results show that the attack success rate of maximum 92% can be achieved on
named entity recognition task. Based on the experimental analysis, we discuss
possible defense mechanisms and present possible research directions to make
the LLMs more trustworthy in the context of ZSM networks.


http://arxiv.org/abs/2407.21659
Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models. (98%)
Yue Xu; Xiuyuan Qi; Zhan Qin; Wenjie Wang
  Multimodal Large Language Models (MLLMs) extend the capacity of LLMs to
understand multimodal information comprehensively, achieving remarkable
performance in many vision-centric tasks. Despite that, recent studies have
shown that these models are susceptible to jailbreak attacks, which refer to an
exploitative technique where malicious users can break the safety alignment of
the target model and generate misleading and harmful answers. This potential
threat is caused by both the inherent vulnerabilities of LLM and the larger
attack scope introduced by vision input. To enhance the security of MLLMs
against jailbreak attacks, researchers have developed various defense
techniques. However, these methods either require modifications to the model's
internal structure or demand significant computational resources during the
inference phase. Multimodal information is a double-edged sword. While it
increases the risk of attacks, it also provides additional data that can
enhance safeguards. Inspired by this, we propose Cross-modality Information
DEtectoR (CIDER), a plug-and-play jailbreaking detector designed to identify
maliciously perturbed image inputs, utilizing the cross-modal similarity
between harmful queries and adversarial images. CIDER is independent of the
target MLLMs and requires less computation cost. Extensive experimental results
demonstrate the effectiveness and efficiency of CIDER, as well as its
transferability to both white-box and black-box MLLMs.


http://arxiv.org/abs/2408.00023
On the Perturbed States for Transformed Input-robust Reinforcement Learning. (92%)
Tung M. Luu; Haeyong Kang; Tri Ton; Thanh Nguyen; Chang D. Yoo
  Reinforcement Learning (RL) agents demonstrating proficiency in a training
environment exhibit vulnerability to adversarial perturbations in input
observations during deployment. This underscores the importance of building a
robust agent before its real-world deployment. To alleviate the challenging
point, prior works focus on developing robust training-based procedures,
encompassing efforts to fortify the deep neural network component's robustness
or subject the agent to adversarial training against potent attacks. In this
work, we propose a novel method referred to as \textit{Transformed Input-robust
RL (TIRL)}, which explores another avenue to mitigate the impact of adversaries
by employing input transformation-based defenses. Specifically, we introduce
two principles for applying transformation-based defenses in learning robust RL
agents: \textit{(1) autoencoder-styled denoising} to reconstruct the original
state and \textit{(2) bounded transformations (bit-depth reduction and vector
quantization (VQ))} to achieve close transformed inputs. The transformations
are applied to the state before feeding it into the policy network. Extensive
experiments on multiple \mujoco environments demonstrate that input
transformation-based defenses, \ie, VQ, defend against several adversaries in
the state observations.


http://arxiv.org/abs/2407.21783
The Llama 3 Herd of Models. (62%)
Abhimanyu Jack Dubey; Abhinav Jack Jauhri; Abhinav Jack Pandey; Abhishek Jack Kadian; Ahmad Jack Al-Dahle; Aiesha Jack Letman; Akhil Jack Mathur; Alan Jack Schelten; Amy Jack Yang; Angela Jack Fan; Anirudh Jack Goyal; Anthony Jack Hartshorn; Aobo Jack Yang; Archi Jack Mitra; Archie Jack Sravankumar; Artem Jack Korenev; Arthur Jack Hinsvark; Arun Jack Rao; Aston Jack Zhang; Aurelien Jack Rodriguez; Austen Jack Gregerson; Ava Jack Spataru; Baptiste Jack Roziere; Bethany Jack Biron; Binh Jack Tang; Bobbie Jack Chern; Charlotte Jack Caucheteux; Chaya Jack Nayak; Chloe Jack Bi; Chris Jack Marra; Chris Jack McConnell; Christian Jack Keller; Christophe Jack Touret; Chunyang Jack Wu; Corinne Jack Wong; Cristian Canton Jack Ferrer; Cyrus Jack Nikolaidis; Damien Jack Allonsius; Daniel Jack Song; Danielle Jack Pintz; Danny Jack Livshits; David Jack Esiobu; Dhruv Jack Choudhary; Dhruv Jack Mahajan; Diego Jack Garcia-Olano; Diego Jack Perino; Dieuwke Jack Hupkes; Egor Jack Lakomkin; Ehab Jack AlBadawy; Elina Jack Lobanova; Emily Jack Dinan; Eric Michael Jack Smith; Filip Jack Radenovic; Frank Jack Zhang; Gabriel Jack Synnaeve; Gabrielle Jack Lee; Georgia Lewis Jack Anderson; Graeme Jack Nail; Gregoire Jack Mialon; Guan Jack Pang; Guillem Jack Cucurell; Hailey Jack Nguyen; Hannah Jack Korevaar; Hu Jack Xu; Hugo Jack Touvron; Iliyan Jack Zarov; Imanol Arrieta Jack Ibarra; Isabel Jack Kloumann; Ishan Jack Misra; Ivan Jack Evtimov; Jade Jack Copet; Jaewon Jack Lee; Jan Jack Geffert; Jana Jack Vranes; Jason Jack Park; Jay Jack Mahadeokar; Jeet Jack Shah; der Linde Jelmer Jack van; Jennifer Jack Billock; Jenny Jack Hong; Jenya Jack Lee; Jeremy Jack Fu; Jianfeng Jack Chi; Jianyu Jack Huang; Jiawen Jack Liu; Jie Jack Wang; Jiecao Jack Yu; Joanna Jack Bitton; Joe Jack Spisak; Jongsoo Jack Park; Joseph Jack Rocca; Joshua Jack Johnstun; Joshua Jack Saxe; Junteng Jack Jia; Kalyan Vasuden Jack Alwala; Kartikeya Jack Upasani; Kate Jack Plawiak; Ke Jack Li; Kenneth Jack Heafield; Kevin Jack Stone; Khalid Jack El-Arini; Krithika Jack Iyer; Kshitiz Jack Malik; Kuenley Jack Chiu; Kunal Jack Bhalla; Lauren Jack Rantala-Yeary; der Maaten Laurens Jack van; Lawrence Jack Chen; Liang Jack Tan; Liz Jack Jenkins; Louis Jack Martin; Lovish Jack Madaan; Lubo Jack Malo; Lukas Jack Blecher; Lukas Jack Landzaat; Oliveira Luke Jack de; Madeline Jack Muzzi; Mahesh Jack Pasupuleti; Mannat Jack Singh; Manohar Jack Paluri; Marcin Jack Kardas; Mathew Jack Oldham; Mathieu Jack Rita; Maya Jack Pavlova; Melanie Jack Kambadur; Mike Jack Lewis; Min Jack Si; Mitesh Kumar Jack Singh; Mona Jack Hassan; Naman Jack Goyal; Narjes Jack Torabi; Nikolay Jack Bashlykov; Nikolay Jack Bogoychev; Niladri Jack Chatterji; Olivier Jack Duchenne; Onur Jack Çelebi; Patrick Jack Alrassy; Pengchuan Jack Zhang; Pengwei Jack Li; Petar Jack Vasic; Peter Jack Weng; Prajjwal Jack Bhargava; Pratik Jack Dubal; Praveen Jack Krishnan; Punit Singh Jack Koura; Puxin Jack Xu; Qing Jack He; Qingxiao Jack Dong; Ragavan Jack Srinivasan; Raj Jack Ganapathy; Ramon Jack Calderer; Ricardo Silveira Jack Cabral; Robert Jack Stojnic; Roberta Jack Raileanu; Rohit Jack Girdhar; Rohit Jack Patel; Romain Jack Sauvestre; Ronnie Jack Polidoro; Roshan Jack Sumbaly; Ross Jack Taylor; Ruan Jack Silva; Rui Jack Hou; Rui Jack Wang; Saghar Jack Hosseini; Sahana Jack Chennabasappa; Sanjay Jack Singh; Sean Jack Bell; Seohyun Sonia Jack Kim; Sergey Jack Edunov; Shaoliang Jack Nie; Sharan Jack Narang; Sharath Jack Raparthy; Sheng Jack Shen; Shengye Jack Wan; Shruti Jack Bhosale; Shun Jack Zhang; Simon Jack Vandenhende; Soumya Jack Batra; Spencer Jack Whitman; Sten Jack Sootla; Stephane Jack Collot; Suchin Jack Gururangan; Sydney Jack Borodinsky; Tamar Jack Herman; Tara Jack Fowler; Tarek Jack Sheasha; Thomas Jack Georgiou; Thomas Jack Scialom; Tobias Jack Speckbacher; Todor Jack Mihaylov; Tong Jack Xiao; Ujjwal Jack Karn; Vedanuj Jack Goswami; Vibhor Jack Gupta; Vignesh Jack Ramanathan; Viktor Jack Kerkez; Vincent Jack Gonguet; Virginie Jack Do; Vish Jack Vogeti; Vladan Jack Petrovic; Weiwei Jack Chu; Wenhan Jack Xiong; Wenyin Jack Fu; Whitney Jack Meers; Xavier Jack Martinet; Xiaodong Jack Wang; Xiaoqing Ellen Jack Tan; Xinfeng Jack Xie; Xuchao Jack Jia; Xuewei Jack Wang; Yaelle Jack Goldschlag; Yashesh Jack Gaur; Yasmine Jack Babaei; Yi Jack Wen; Yiwen Jack Song; Yuchen Jack Zhang; Yue Jack Li; Yuning Jack Mao; Zacharie Delpierre Jack Coudert; Zheng Jack Yan; Zhengxing Jack Chen; Zoe Jack Papakipos; Aaditya Jack Singh; Aaron Jack Grattafiori; Abha Jack Jain; Adam Jack Kelsey; Adam Jack Shajnfeld; Adithya Jack Gangidi; Adolfo Jack Victoria; Ahuva Jack Goldstand; Ajay Jack Menon; Ajay Jack Sharma; Alex Jack Boesenberg; Alex Jack Vaughan; Alexei Jack Baevski; Allie Jack Feinstein; Amanda Jack Kallet; Amit Jack Sangani; Anam Jack Yunus; Andrei Jack Lupu; Andres Jack Alvarado; Andrew Jack Caples; Andrew Jack Gu; Andrew Jack Ho; Andrew Jack Poulton; Andrew Jack Ryan; Ankit Jack Ramchandani; Annie Jack Franco; Aparajita Jack Saraf; Arkabandhu Jack Chowdhury; Ashley Jack Gabriel; Ashwin Jack Bharambe; Assaf Jack Eisenman; Azadeh Jack Yazdan; Beau Jack James; Ben Jack Maurer; Benjamin Jack Leonhardi; Bernie Jack Huang; Beth Jack Loyd; Paola Beto Jack De; Bhargavi Jack Paranjape; Bing Jack Liu; Bo Jack Wu; Boyu Jack Ni; Braden Jack Hancock; Bram Jack Wasti; Brandon Jack Spence; Brani Jack Stojkovic; Brian Jack Gamido; Britt Jack Montalvo; Carl Jack Parker; Carly Jack Burton; Catalina Jack Mejia; Changhan Jack Wang; Changkyu Jack Kim; Chao Jack Zhou; Chester Jack Hu; Ching-Hsiang Jack Chu; Chris Jack Cai; Chris Jack Tindal; Christoph Jack Feichtenhofer; Damon Jack Civin; Dana Jack Beaty; Daniel Jack Kreymer; Daniel Jack Li; Danny Jack Wyatt; David Jack Adkins; David Jack Xu; Davide Jack Testuggine; Delia Jack David; Devi Jack Parikh; Diana Jack Liskovich; Didem Jack Foss; Dingkang Jack Wang; Duc Jack Le; Dustin Jack Holland; Edward Jack Dowling; Eissa Jack Jamil; Elaine Jack Montgomery; Eleonora Jack Presani; Emily Jack Hahn; Emily Jack Wood; Erik Jack Brinkman; Esteban Jack Arcaute; Evan Jack Dunbar; Evan Jack Smothers; Fei Jack Sun; Felix Jack Kreuk; Feng Jack Tian; Firat Jack Ozgenel; Francesco Jack Caggioni; Francisco Jack Guzmán; Frank Jack Kanayet; Frank Jack Seide; Gabriela Medina Jack Florez; Gabriella Jack Schwarz; Gada Jack Badeer; Georgia Jack Swee; Gil Jack Halpern; Govind Jack Thattai; Grant Jack Herman; Grigory Jack Sizov; Jack Guangyi; Sid Zhang; Guna Sid Lakshminarayanan; Hamid Sid Shojanazeri; Han Sid Zou; Hannah Sid Wang; Hanwen Sid Zha; Haroun Sid Habeeb; Harrison Sid Rudolph; Helen Sid Suk; Henry Sid Aspegren; Hunter Sid Goldman; Igor Sid Molybog; Igor Sid Tufanov; Irina-Elena Sid Veliche; Itai Sid Gat; Jake Sid Weissman; James Sid Geboski; James Sid Kohli; Japhet Sid Asher; Jean-Baptiste Sid Gaya; Jeff Sid Marcus; Jeff Sid Tang; Jennifer Sid Chan; Jenny Sid Zhen; Jeremy Sid Reizenstein; Jeremy Sid Teboul; Jessica Sid Zhong; Jian Sid Jin; Jingyi Sid Yang; Joe Sid Cummings; Jon Sid Carvill; Jon Sid Shepard; Jonathan Sid McPhie; Jonathan Sid Torres; Josh Sid Ginsburg; Junjie Sid Wang; Kai Sid Wu; Kam Hou Sid U; Karan Sid Saxena; Karthik Sid Prasad; Kartikay Sid Khandelwal; Katayoun Sid Zand; Kathy Sid Matosich; Kaushik Sid Veeraraghavan; Kelly Sid Michelena; Keqian Sid Li; Kun Sid Huang; Kunal Sid Chawla; Kushal Sid Lakhotia; Kyle Sid Huang; Lailin Sid Chen; Lakshya Sid Garg; Lavender Sid A; Leandro Sid Silva; Lee Sid Bell; Lei Sid Zhang; Liangpeng Sid Guo; Licheng Sid Yu; Liron Sid Moshkovich; Luca Sid Wehrstedt; Madian Sid Khabsa; Manav Sid Avalani; Manish Sid Bhatt; Maria Sid Tsimpoukelli; Martynas Sid Mankus; Matan Sid Hasson; Matthew Sid Lennie; Matthias Sid Reso; Maxim Sid Groshev; Maxim Sid Naumov; Maya Sid Lathi; Meghan Sid Keneally; Michael L. Sid Seltzer; Michal Sid Valko; Michelle Sid Restrepo; Mihir Sid Patel; Mik Sid Vyatskov; Mikayel Sid Samvelyan; Mike Sid Clark; Mike Sid Macey; Mike Sid Wang; Miquel Jubert Sid Hermoso; Mo Sid Metanat; Mohammad Sid Rastegari; Munish Sid Bansal; Nandhini Sid Santhanam; Natascha Sid Parks; Natasha Sid White; Navyata Sid Bawa; Nayan Sid Singhal; Nick Sid Egebo; Nicolas Sid Usunier; Nikolay Pavlovich Sid Laptev; Ning Sid Dong; Ning Sid Zhang; Norman Sid Cheng; Oleg Sid Chernoguz; Olivia Sid Hart; Omkar Sid Salpekar; Ozlem Sid Kalinli; Parkin Sid Kent; Parth Sid Parekh; Paul Sid Saab; Pavan Sid Balaji; Pedro Sid Rittner; Philip Sid Bontrager; Pierre Sid Roux; Piotr Sid Dollar; Polina Sid Zvyagina; Prashant Sid Ratanchandani; Pritish Sid Yuvraj; Qian Sid Liang; Rachad Sid Alao; Rachel Sid Rodriguez; Rafi Sid Ayub; Raghotham Sid Murthy; Raghu Sid Nayani; Rahul Sid Mitra; Raymond Sid Li; Rebekkah Sid Hogan; Robin Sid Battey; Rocky Sid Wang; Rohan Sid Maheswari; Russ Sid Howes; Ruty Sid Rinott; Sai Jayesh Sid Bondu; Samyak Sid Datta; Sara Sid Chugh; Sara Sid Hunt; Sargun Sid Dhillon; Sasha Sid Sidorov; Satadru Sid Pan; Saurabh Sid Verma; Seiji Sid Yamamoto; Sharadh Sid Ramaswamy; Shaun Sid Lindsay; Shaun Sid Lindsay; Sheng Sid Feng; Shenghao Sid Lin; Shengxin Cindy Sid Zha; Shiva Sid Shankar; Shuqiang Sid Zhang; Shuqiang Sid Zhang; Sinong Sid Wang; Sneha Sid Agarwal; Soji Sid Sajuyigbe; Soumith Sid Chintala; Stephanie Sid Max; Stephen Sid Chen; Steve Sid Kehoe; Steve Sid Satterfield; Sudarshan Sid Govindaprasad; Sumit Sid Gupta; Sungmin Sid Cho; Sunny Sid Virk; Suraj Sid Subramanian; Sy Sid Choudhury; Sydney Sid Goldman; Tal Sid Remez; Tamar Sid Glaser; Tamara Sid Best; Thilo Sid Kohler; Thomas Sid Robinson; Tianhe Sid Li; Tianjun Sid Zhang; Tim Sid Matthews; Timothy Sid Chou; Tzook Sid Shaked; Varun Sid Vontimitta; Victoria Sid Ajayi; Victoria Sid Montanez; Vijai Sid Mohan; Vinay Satish Sid Kumar; Vishal Sid Mangla; Vlad Sid Ionescu; Vlad Sid Poenaru; Vlad Tiberiu Sid Mihailescu; Vladimir Sid Ivanov; Wei Sid Li; Wenchen Sid Wang; Wenwen Sid Jiang; Wes Sid Bouaziz; Will Sid Constable; Xiaocheng Sid Tang; Xiaofang Sid Wang; Xiaojian Sid Wu; Xiaolan Sid Wang; Xide Sid Xia; Xilun Sid Wu; Xinbo Sid Gao; Yanjun Sid Chen; Ye Sid Hu; Ye Sid Jia; Ye Sid Qi; Yenda Sid Li; Yilin Sid Zhang; Ying Sid Zhang; Yossi Sid Adi; Youngjin Sid Nam; Sid Yu; Wang; Yuchen Hao; Yundi Qian; Yuzi He; Zach Rait; Zachary DeVito; Zef Rosnbrick; Zhaoduo Wen; Zhenyu Yang; Zhiwei Zhao
  Modern artificial intelligence (AI) systems are powered by foundation models.
This paper presents a new set of foundation models, called Llama 3. It is a
herd of language models that natively support multilinguality, coding,
reasoning, and tool usage. Our largest model is a dense Transformer with 405B
parameters and a context window of up to 128K tokens. This paper presents an
extensive empirical evaluation of Llama 3. We find that Llama 3 delivers
comparable quality to leading language models such as GPT-4 on a plethora of
tasks. We publicly release Llama 3, including pre-trained and post-trained
versions of the 405B parameter language model and our Llama Guard 3 model for
input and output safety. The paper also presents the results of experiments in
which we integrate image, video, and speech capabilities into Llama 3 via a
compositional approach. We observe this approach performs competitively with
the state-of-the-art on image, video, and speech recognition tasks. The
resulting models are not yet being broadly released as they are still under
development.


http://arxiv.org/abs/2408.00117
Certifying Robustness of Learning-Based Keypoint Detection and Pose Estimation Methods. (22%)
Xusheng Luo; Tianhao Wei; Simin Liu; Ziwei Wang; Luis Mattei-Mendez; Taylor Loper; Joshua Neighbor; Casidhe Hutchison; Changliu Liu
  This work addresses the certification of the local robustness of vision-based
two-stage 6D object pose estimation. The two-stage method for object pose
estimation achieves superior accuracy by first employing deep neural
network-driven keypoint regression and then applying a Perspective-n-Point
(PnP) technique. Despite advancements, the certification of these methods'
robustness remains scarce. This research aims to fill this gap with a focus on
their local robustness on the system level--the capacity to maintain robust
estimations amidst semantic input perturbations. The core idea is to transform
the certification of local robustness into neural network verification for
classification tasks. The challenge is to develop model, input, and output
specifications that align with off-the-shelf verification tools. To facilitate
verification, we modify the keypoint detection model by substituting nonlinear
operations with those more amenable to the verification processes. Instead of
injecting random noise into images, as is common, we employ a convex hull
representation of images as input specifications to more accurately depict
semantic perturbations. Furthermore, by conducting a sensitivity analysis, we
propagate the robustness criteria from pose to keypoint accuracy, and then
formulating an optimal error threshold allocation problem that allows for the
setting of a maximally permissible keypoint deviation thresholds. Viewing each
pixel as an individual class, these thresholds result in linear,
classification-akin output specifications. Under certain conditions, we
demonstrate that the main components of our certification framework are both
sound and complete, and validate its effects through extensive evaluations on
realistic perturbations. To our knowledge, this is the first study to certify
the robustness of large-scale, keypoint-based pose estimation given images in
real-world scenarios.


http://arxiv.org/abs/2408.00129
Vera Verto: Multimodal Hijacking Attack. (9%)
Minxing Zhang; Ahmed Salem; Michael Backes; Yang Zhang
  The increasing cost of training machine learning (ML) models has led to the
inclusion of new parties to the training pipeline, such as users who contribute
training data and companies that provide computing resources. This involvement
of such new parties in the ML training process has introduced new attack
surfaces for an adversary to exploit. A recent attack in this domain is the
model hijacking attack, whereby an adversary hijacks a victim model to
implement their own -- possibly malicious -- hijacking tasks. However, the
scope of the model hijacking attack is so far limited to the
homogeneous-modality tasks. In this paper, we transform the model hijacking
attack into a more general multimodal setting, where the hijacking and original
tasks are performed on data of different modalities. Specifically, we focus on
the setting where an adversary implements a natural language processing (NLP)
hijacking task into an image classification model. To mount the attack, we
propose a novel encoder-decoder based framework, namely the Blender, which
relies on advanced image and language models. Experimental results show that
our modal hijacking attack achieves strong performances in different settings.
For instance, our attack achieves 94%, 94%, and 95% attack success rate when
using the Sogou news dataset to hijack STL10, CIFAR-10, and MNIST classifiers.


http://arxiv.org/abs/2407.20657
Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks. (99%)
Hunmin Yang; Jongoh Jeong; Kuk-Jin Yoon
  Recent vision-language foundation models, such as CLIP, have demonstrated
superior capabilities in learning representations that can be transferable
across diverse range of downstream tasks and domains. With the emergence of
such powerful models, it has become crucial to effectively leverage their
capabilities in tackling challenging vision tasks. On the other hand, only a
few works have focused on devising adversarial examples that transfer well to
both unknown domains and model architectures. In this paper, we propose a novel
transfer attack method called PDCL-Attack, which leverages the CLIP model to
enhance the transferability of adversarial perturbations generated by a
generative model-based attack framework. Specifically, we formulate an
effective prompt-driven feature guidance by harnessing the semantic
representation power of text, particularly from the ground-truth class labels
of input images. To the best of our knowledge, we are the first to introduce
prompt learning to enhance the transferable generative attacks. Extensive
experiments conducted across various cross-domain and cross-model settings
empirically validate our approach, demonstrating its superiority over
state-of-the-art methods.


http://arxiv.org/abs/2407.21174
AI Safety in Practice: Enhancing Adversarial Robustness in Multimodal Image Captioning. (99%)
Maisha Binte Rashid; Pablo Rivas
  Multimodal machine learning models that combine visual and textual data are
increasingly being deployed in critical applications, raising significant
safety and security concerns due to their vulnerability to adversarial attacks.
This paper presents an effective strategy to enhance the robustness of
multimodal image captioning models against such attacks. By leveraging the Fast
Gradient Sign Method (FGSM) to generate adversarial examples and incorporating
adversarial training techniques, we demonstrate improved model robustness on
two benchmark datasets: Flickr8k and COCO. Our findings indicate that
selectively training only the text decoder of the multimodal architecture shows
performance comparable to full adversarial training while offering increased
computational efficiency. This targeted approach suggests a balance between
robustness and training costs, facilitating the ethical deployment of
multimodal AI systems across various domains.


http://arxiv.org/abs/2407.20653
FACL-Attack: Frequency-Aware Contrastive Learning for Transferable Adversarial Attacks. (99%)
Hunmin Yang; Jongoh Jeong; Kuk-Jin Yoon
  Deep neural networks are known to be vulnerable to security risks due to the
inherent transferable nature of adversarial examples. Despite the success of
recent generative model-based attacks demonstrating strong transferability, it
still remains a challenge to design an efficient attack strategy in a
real-world strict black-box setting, where both the target domain and model
architectures are unknown. In this paper, we seek to explore a feature
contrastive approach in the frequency domain to generate adversarial examples
that are robust in both cross-domain and cross-model settings. With that goal
in mind, we propose two modules that are only employed during the training
phase: a Frequency-Aware Domain Randomization (FADR) module to randomize
domain-variant low- and high-range frequency components and a
Frequency-Augmented Contrastive Learning (FACL) module to effectively separate
domain-invariant mid-frequency features of clean and perturbed image. We
demonstrate strong transferability of our generated adversarial perturbations
through extensive cross-domain and cross-model experiments, while keeping the
inference time complexity.


http://arxiv.org/abs/2407.20836
Vulnerabilities in AI-generated Image Detection: The Challenge of Adversarial Attacks. (99%)
Yunfeng Diao; Naixin Zhai; Changtao Miao; Zitong Yu; Xingxing Wei; Xun Yang; Meng Wang
  Recent advancements in image synthesis, particularly with the advent of GAN
and Diffusion models, have amplified public concerns regarding the
dissemination of disinformation. To address such concerns, numerous
AI-generated Image (AIGI) Detectors have been proposed and achieved promising
performance in identifying fake images. However, there still lacks a systematic
understanding of the adversarial robustness of AIGI detectors. In this paper,
we examine the vulnerability of state-of-the-art AIGI detectors against
adversarial attack under white-box and black-box settings, which has been
rarely investigated so far. To this end, we propose a new method to attack AIGI
detectors. First, inspired by the obvious difference between real images and
fake images in the frequency domain, we add perturbations under the frequency
domain to push the image away from its original frequency distribution. Second,
we explore the full posterior distribution of the surrogate model to further
narrow this gap between heterogeneous AIGI detectors, e.g. transferring
adversarial examples across CNNs and ViTs. This is achieved by introducing a
novel post-train Bayesian strategy that turns a single surrogate into a
Bayesian one, capable of simulating diverse victim models using one pre-trained
surrogate, without the need for re-training. We name our method as
Frequency-based Post-train Bayesian Attack, or FPBA. Through FPBA, we show that
adversarial attack is truly a real threat to AIGI detectors, because FPBA can
deliver successful black-box attacks across models, generators, defense
methods, and even evade cross-generator detection, which is a crucial
real-world detection scenario. The code will be shared upon acceptance.


http://arxiv.org/abs/2407.21316
Diff-Cleanse: Identifying and Mitigating Backdoor Attacks in Diffusion Models. (62%)
Jiang Hao; Xiao Jin; Hu Xiaoguang; Chen Tianyou
  Diffusion models (DM) represent one of the most advanced generative models
today, yet recent studies suggest that DMs are vulnerable to backdoor attacks.
Backdoor attacks establish hidden associations between particular input
patterns and model behaviors, compromising model integrity by triggering
undesirable actions with manipulated input data. This vulnerability poses
substantial risks, including reputational damage to model owners and the
dissemination of harmful content. To mitigate the threat of backdoor attacks,
there have been some investigations on backdoor detection and model repair.
However, previous work fails to purify the backdoored DMs created by
state-of-the-art attacks, rendering the field much underexplored. To bridge
this gap, we introduce \textbf{Diff-Cleanse}, a novel two-stage backdoor
defense framework specifically designed for DMs. The first stage employs a
innovative trigger inversion technique to detect the backdoor and reconstruct
the trigger, and the second stage utilizes a structural pruning method to
eliminate the backdoor. We evaluate our framework on hundreds of DMs attacked
by 3 existing backdoor attack methods. Extensive experiments demonstrate that
Diff-Cleanse achieves nearly 100\% detection accuracy and effectively mitigates
backdoor impacts, preserving the model's benign performance with minimal
compromise. Our code is avaliable at https://github.com/shymuel/diff-cleanse.


http://arxiv.org/abs/2407.21220
DeepBaR: Fault Backdoor Attack on Deep Neural Network Layers. (47%)
C. A. Martínez-Mejía; J. Solano; J. Breier; D. Bucko; X. Hou
  Machine Learning using neural networks has received prominent attention
recently because of its success in solving a wide variety of computational
tasks, in particular in the field of computer vision. However, several works
have drawn attention to potential security risks involved with the training and
implementation of such networks. In this work, we introduce DeepBaR, a novel
approach that implants backdoors on neural networks by faulting their behavior
at training, especially during fine-tuning. Our technique aims to generate
adversarial samples by optimizing a custom loss function that mimics the
implanted backdoors while adding an almost non-visible trigger in the image. We
attack three popular convolutional neural network architectures and show that
DeepBaR attacks have a success rate of up to 98.30\%. Furthermore, DeepBaR does
not significantly affect the accuracy of the attacked networks after deployment
when non-malicious inputs are given. Remarkably, DeepBaR allows attackers to
choose an input that looks similar to a given class, from a human perspective,
but that will be classified as belonging to an arbitrary target class.


http://arxiv.org/abs/2407.20859
Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification. (16%)
Boyang Zhang; Yicong Tan; Yun Shen; Ahmed Salem; Michael Backes; Savvas Zannettou; Yang Zhang
  Recently, autonomous agents built on large language models (LLMs) have
experienced significant development and are being deployed in real-world
applications. These agents can extend the base LLM's capabilities in multiple
ways. For example, a well-built agent using GPT-3.5-Turbo as its core can
outperform the more advanced GPT-4 model by leveraging external components.
More importantly, the usage of tools enables these systems to perform actions
in the real world, moving from merely generating text to actively interacting
with their environment. Given the agents' practical applications and their
ability to execute consequential actions, it is crucial to assess potential
vulnerabilities. Such autonomous systems can cause more severe damage than a
standalone language model if compromised. While some existing research has
explored harmful actions by LLM agents, our study approaches the vulnerability
from a different perspective. We introduce a new type of attack that causes
malfunctions by misleading the agent into executing repetitive or irrelevant
actions. We conduct comprehensive evaluations using various attack methods,
surfaces, and properties to pinpoint areas of susceptibility. Our experiments
reveal that these attacks can induce failure rates exceeding 80\% in multiple
scenarios. Through attacks on implemented and deployable agents in multi-agent
scenarios, we accentuate the realistic risks associated with these
vulnerabilities. To mitigate such attacks, we propose self-examination
detection methods. However, our findings indicate these attacks are difficult
to detect effectively using LLMs alone, highlighting the substantial risks
associated with this vulnerability.


http://arxiv.org/abs/2407.20891
Bayesian Low-Rank LeArning (Bella): A Practical Approach to Bayesian Neural Networks. (1%)
Bao Gia Doan; Afshar Shamsi; Xiao-Yu Guo; Arash Mohammadi; Hamid Alinejad-Rokny; Dino Sejdinovic; Damith C. Ranasinghe; Ehsan Abbasnejad
  Computational complexity of Bayesian learning is impeding its adoption in
practical, large-scale tasks. Despite demonstrations of significant merits such
as improved robustness and resilience to unseen or out-of-distribution inputs
over their non- Bayesian counterparts, their practical use has faded to near
insignificance. In this study, we introduce an innovative framework to mitigate
the computational burden of Bayesian neural networks (BNNs). Our approach
follows the principle of Bayesian techniques based on deep ensembles, but
significantly reduces their cost via multiple low-rank perturbations of
parameters arising from a pre-trained neural network. Both vanilla version of
ensembles as well as more sophisticated schemes such as Bayesian learning with
Stein Variational Gradient Descent (SVGD), previously deemed impractical for
large models, can be seamlessly implemented within the proposed framework,
called Bayesian Low-Rank LeArning (Bella). In a nutshell, i) Bella achieves a
dramatic reduction in the number of trainable parameters required to
approximate a Bayesian posterior; and ii) it not only maintains, but in some
instances, surpasses the performance of conventional Bayesian learning methods
and non-Bayesian baselines. Our results with large-scale tasks such as
ImageNet, CAMELYON17, DomainNet, VQA with CLIP, LLaVA demonstrate the
effectiveness and versatility of Bella in building highly scalable and
practical Bayesian deep models for real-world applications.


http://arxiv.org/abs/2407.19981
Adversarial Robustness in RGB-Skeleton Action Recognition: Leveraging Attention Modality Reweighter. (99%)
Chao Liu; Xin Liu; Zitong Yu; Yonghong Hou; Huanjing Yue; Jingyu Yang
  Deep neural networks (DNNs) have been applied in many computer vision tasks
and achieved state-of-the-art (SOTA) performance. However, misclassification
will occur when DNNs predict adversarial examples which are created by adding
human-imperceptible adversarial noise to natural examples. This limits the
application of DNN in security-critical fields. In order to enhance the
robustness of models, previous research has primarily focused on the unimodal
domain, such as image recognition and video understanding. Although multi-modal
learning has achieved advanced performance in various tasks, such as action
recognition, research on the robustness of RGB-skeleton action recognition
models is scarce. In this paper, we systematically investigate how to improve
the robustness of RGB-skeleton action recognition models. We initially
conducted empirical analysis on the robustness of different modalities and
observed that the skeleton modality is more robust than the RGB modality.
Motivated by this observation, we propose the \formatword{A}ttention-based
\formatword{M}odality \formatword{R}eweighter (\formatword{AMR}), which
utilizes an attention layer to re-weight the two modalities, enabling the model
to learn more robust features. Our AMR is plug-and-play, allowing easy
integration with multimodal models. To demonstrate the effectiveness of AMR, we
conducted extensive experiments on various datasets. For example, compared to
the SOTA methods, AMR exhibits a 43.77\% improvement against PGD20 attacks on
the NTU-RGB+D 60 dataset. Furthermore, it effectively balances the differences
in robustness between different modalities.


http://arxiv.org/abs/2407.21073
Enhancing Adversarial Text Attacks on BERT Models with Projected Gradient Descent. (99%)
Hetvi Waghela; Jaydip Sen; Sneha Rakshit
  Adversarial attacks against deep learning models represent a major threat to
the security and reliability of natural language processing (NLP) systems. In
this paper, we propose a modification to the BERT-Attack framework, integrating
Projected Gradient Descent (PGD) to enhance its effectiveness and robustness.
The original BERT-Attack, designed for generating adversarial examples against
BERT-based models, suffers from limitations such as a fixed perturbation budget
and a lack of consideration for semantic similarity. The proposed approach in
this work, PGD-BERT-Attack, addresses these limitations by leveraging PGD to
iteratively generate adversarial examples while ensuring both imperceptibility
and semantic similarity to the original input. Extensive experiments are
conducted to evaluate the performance of PGD-BERT-Attack compared to the
original BERT-Attack and other baseline methods. The results demonstrate that
PGD-BERT-Attack achieves higher success rates in causing misclassification
while maintaining low perceptual changes. Furthermore, PGD-BERT-Attack produces
adversarial instances that exhibit greater semantic resemblance to the initial
input, enhancing their applicability in real-world scenarios. Overall, the
proposed modification offers a more effective and robust approach to
adversarial attacks on BERT-based models, thus contributing to the advancement
of defense against attacks on NLP systems.


http://arxiv.org/abs/2407.20361
From ML to LLM: Evaluating the Robustness of Phishing Webpage Detection Models against Adversarial Attacks. (96%)
Aditya Kulkarni; Vivek Balachandran; Dinil Mon Divakaran; Tamal Das
  Phishing attacks attempt to deceive users into stealing sensitive
information, posing a significant cybersecurity threat. Advances in machine
learning (ML) and deep learning (DL) have led to the development of numerous
phishing webpage detection solutions, but these models remain vulnerable to
adversarial attacks. Evaluating their robustness against adversarial phishing
webpages is essential. Existing tools contain datasets of pre-designed phishing
webpages for a limited number of brands, and lack diversity in phishing
features.
  To address these challenges, we develop PhishOracle, a tool that generates
adversarial phishing webpages by embedding diverse phishing features into
legitimate webpages. We evaluate the robustness of two existing models, Stack
model and Phishpedia, in classifying PhishOracle-generated adversarial phishing
webpages. Additionally, we study a commercial large language model, Gemini Pro
Vision, in the context of adversarial attacks. We conduct a user study to
determine whether PhishOracle-generated adversarial phishing webpages deceive
users. Our findings reveal that many PhishOracle-generated phishing webpages
evade current phishing webpage detection models and deceive users, but Gemini
Pro Vision is robust to the attack. We also develop the PhishOracle web app,
allowing users to input a legitimate URL, select relevant phishing features and
generate a corresponding phishing webpage. All resources are publicly available
on GitHub.


http://arxiv.org/abs/2407.19842
Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability. (92%)
Jorge García-Carrasco; Alejandro Maté; Juan Trujillo
  Large Language Models (LLMs), characterized by being trained on broad amounts
of data in a self-supervised manner, have shown impressive performance across a
wide range of tasks. Indeed, their generative abilities have aroused interest
on the application of LLMs across a wide range of contexts. However, neural
networks in general, and LLMs in particular, are known to be vulnerable to
adversarial attacks, where an imperceptible change to the input can mislead the
output of the model. This is a serious concern that impedes the use of LLMs on
high-stakes applications, such as healthcare, where a wrong prediction can
imply serious consequences. Even though there are many efforts on making LLMs
more robust to adversarial attacks, there are almost no works that study
\emph{how} and \emph{where} these vulnerabilities that make LLMs prone to
adversarial attacks happen. Motivated by these facts, we explore how to
localize and understand vulnerabilities, and propose a method, based on
Mechanistic Interpretability (MI) techniques, to guide this process.
Specifically, this method enables us to detect vulnerabilities related to a
concrete task by (i) obtaining the subset of the model that is responsible for
that task, (ii) generating adversarial samples for that task, and (iii) using
MI techniques together with the previous samples to discover and understand the
possible vulnerabilities. We showcase our method on a pretrained GPT-2 Small
model carrying out the task of predicting 3-letter acronyms to demonstrate its
effectiveness on locating and understanding concrete vulnerabilities of the
model.


http://arxiv.org/abs/2407.20141
DDAP: Dual-Domain Anti-Personalization against Text-to-Image Diffusion Models. (68%)
Jing Yang; Runping Xi; Yingxin Lai; Xun Lin; Zitong Yu
  Diffusion-based personalized visual content generation technologies have
achieved significant breakthroughs, allowing for the creation of specific
objects by just learning from a few reference photos. However, when misused to
fabricate fake news or unsettling content targeting individuals, these
technologies could cause considerable societal harm. To address this problem,
current methods generate adversarial samples by adversarially maximizing the
training loss, thereby disrupting the output of any personalized generation
model trained with these samples. However, the existing methods fail to achieve
effective defense and maintain stealthiness, as they overlook the intrinsic
properties of diffusion models. In this paper, we introduce a novel Dual-Domain
Anti-Personalization framework (DDAP). Specifically, we have developed Spatial
Perturbation Learning (SPL) by exploiting the fixed and perturbation-sensitive
nature of the image encoder in personalized generation. Subsequently, we have
designed a Frequency Perturbation Learning (FPL) method that utilizes the
characteristics of diffusion models in the frequency domain. The SPL disrupts
the overall texture of the generated images, while the FPL focuses on image
details. By alternating between these two methods, we construct the DDAP
framework, effectively harnessing the strengths of both domains. To further
enhance the visual quality of the adversarial samples, we design a localization
module to accurately capture attentive areas while ensuring the effectiveness
of the attack and avoiding unnecessary disturbances in the background.
Extensive experiments on facial benchmarks have shown that the proposed DDAP
enhances the disruption of personalized generation models while also
maintaining high quality in adversarial samples, making it more effective in
protecting privacy in practical applications.


http://arxiv.org/abs/2407.20099
RSC-SNN: Exploring the Trade-off Between Adversarial Robustness and Accuracy in Spiking Neural Networks via Randomized Smoothing Coding. (50%)
Keming Wu; Man Yao; Yuhong Chou; Xuerui Qiu; Rui Yang; Bo Xu; Guoqi Li
  Spiking Neural Networks (SNNs) have received widespread attention due to
their unique neuronal dynamics and low-power nature. Previous research
empirically shows that SNNs with Poisson coding are more robust than Artificial
Neural Networks (ANNs) on small-scale datasets. However, it is still unclear in
theory how the adversarial robustness of SNNs is derived, and whether SNNs can
still maintain its adversarial robustness advantage on large-scale dataset
tasks. This work theoretically demonstrates that SNN's inherent adversarial
robustness stems from its Poisson coding. We reveal the conceptual equivalence
of Poisson coding and randomized smoothing in defense strategies, and analyze
in depth the trade-off between accuracy and adversarial robustness in SNNs via
the proposed Randomized Smoothing Coding (RSC) method. Experiments demonstrate
that the proposed RSC-SNNs show remarkable adversarial robustness, surpassing
ANNs and achieving state-of-the-art robustness results on large-scale dataset
ImageNet. Our open-source implementation code is available at this https URL:
https://github.com/KemingWu/RSC-SNN.


http://arxiv.org/abs/2407.20224
Can Editing LLMs Inject Harm? (9%)
Canyu Chen; Baixiang Huang; Zekun Li; Zhaorun Chen; Shiyang Lai; Xiongxiao Xu; Jia-Chen Gu; Jindong Gu; Huaxiu Yao; Chaowei Xiao; Xifeng Yan; William Yang Wang; Philip Torr; Dawn Song; Kai Shu
  Knowledge editing techniques have been increasingly adopted to efficiently
correct the false or outdated knowledge in Large Language Models (LLMs), due to
the high cost of retraining from scratch. Meanwhile, one critical but
under-explored question is: can knowledge editing be used to inject harm into
LLMs? In this paper, we propose to reformulate knowledge editing as a new type
of safety threat for LLMs, namely Editing Attack, and conduct a systematic
investigation with a newly constructed dataset EditAttack. Specifically, we
focus on two typical safety risks of Editing Attack including Misinformation
Injection and Bias Injection. For the risk of misinformation injection, we
first categorize it into commonsense misinformation injection and long-tail
misinformation injection. Then, we find that editing attacks can inject both
types of misinformation into LLMs, and the effectiveness is particularly high
for commonsense misinformation injection. For the risk of bias injection, we
discover that not only can biased sentences be injected into LLMs with high
effectiveness, but also one single biased sentence injection can cause a bias
increase in general outputs of LLMs, which are even highly irrelevant to the
injected sentence, indicating a catastrophic impact on the overall fairness of
LLMs. Then, we further illustrate the high stealthiness of editing attacks,
measured by their impact on the general knowledge and reasoning capacities of
LLMs, and show the hardness of defending editing attacks with empirical
evidence. Our discoveries demonstrate the emerging misuse risks of knowledge
editing techniques on compromising the safety alignment of LLMs.


http://arxiv.org/abs/2407.19845
BackdoorBench: A Comprehensive Benchmark and Analysis of Backdoor Learning. (3%)
Baoyuan Wu; Hongrui Chen; Mingda Zhang; Zihao Zhu; Shaokui Wei; Danni Yuan; Mingli Zhu; Ruotong Wang; Li Liu; Chao Shen
  As an emerging approach to explore the vulnerability of deep neural networks
(DNNs), backdoor learning has attracted increasing interest in recent years,
and many seminal backdoor attack and defense algorithms are being developed
successively or concurrently, in the status of a rapid arms race. However,
mainly due to the diverse settings, and the difficulties of implementation and
reproducibility of existing works, there is a lack of a unified and
standardized benchmark of backdoor learning, causing unfair comparisons or
unreliable conclusions (e.g., misleading, biased or even false conclusions).
Consequently, it is difficult to evaluate the current progress and design the
future development roadmap of this literature. To alleviate this dilemma, we
build a comprehensive benchmark of backdoor learning called BackdoorBench. Our
benchmark makes three valuable contributions to the research community. 1) We
provide an integrated implementation of state-of-the-art (SOTA) backdoor
learning algorithms (currently including 20 attack and 32 defense algorithms),
based on an extensible modular-based codebase. 2) We conduct comprehensive
evaluations with 5 poisoning ratios, based on 4 models and 4 datasets, leading
to 11,492 pairs of attack-against-defense evaluations in total. 3) Based on
above evaluations, we present abundant analysis from 10 perspectives via 18
useful analysis tools, and provide several inspiring insights about backdoor
learning. We hope that our efforts could build a solid foundation of backdoor
learning to facilitate researchers to investigate existing algorithms, develop
more innovative algorithms, and explore the intrinsic mechanism of backdoor
learning. Finally, we have created a user-friendly website at
http://backdoorbench.com, which collects all important information of
BackdoorBench, including codebase, docs, leaderboard, and model Zoo.


http://arxiv.org/abs/2407.20020
ImagiNet: A Multi-Content Dataset for Generalizable Synthetic Image Detection via Contrastive Learning. (1%)
Delyan Boychev; Radostin Cholakov
  Generative models, such as diffusion models (DMs), variational autoencoders
(VAEs), and generative adversarial networks (GANs), produce images with a level
of authenticity that makes them nearly indistinguishable from real photos and
artwork. While this capability is beneficial for many industries, the
difficulty of identifying synthetic images leaves online media platforms
vulnerable to impersonation and misinformation attempts. To support the
development of defensive methods, we introduce ImagiNet, a high-resolution and
balanced dataset for synthetic image detection, designed to mitigate potential
biases in existing resources. It contains 200K examples, spanning four content
categories: photos, paintings, faces, and uncategorized. Synthetic images are
produced with open-source and proprietary generators, whereas real counterparts
of the same content type are collected from public datasets. The structure of
ImagiNet allows for a two-track evaluation system: i) classification as real or
synthetic and ii) identification of the generative model. To establish a
baseline, we train a ResNet-50 model using a self-supervised contrastive
objective (SelfCon) for each track. The model demonstrates state-of-the-art
performance and high inference speed across established benchmarks, achieving
an AUC of up to 0.99 and balanced accuracy ranging from 86% to 95%, even under
social network conditions that involve compression and resizing. Our data and
code are available at https://github.com/delyan-boychev/imaginet.


http://arxiv.org/abs/2407.19553
Exploring the Adversarial Robustness of CLIP for AI-generated Image Detection. (80%)
Rosa Vincenzo De; Fabrizio Guillaro; Giovanni Poggi; Davide Cozzolino; Luisa Verdoliva
  In recent years, many forensic detectors have been proposed to detect
AI-generated images and prevent their use for malicious purposes. Convolutional
neural networks (CNNs) have long been the dominant architecture in this field
and have been the subject of intense study. However, recently proposed
Transformer-based detectors have been shown to match or even outperform
CNN-based detectors, especially in terms of generalization. In this paper, we
study the adversarial robustness of AI-generated image detectors, focusing on
Contrastive Language-Image Pretraining (CLIP)-based methods that rely on Visual
Transformer (ViT) backbones and comparing their performance with CNN-based
methods. We study the robustness to different adversarial attacks under a
variety of conditions and analyze both numerical results and frequency-domain
patterns. CLIP-based detectors are found to be vulnerable to white-box attacks
just like CNN-based detectors. However, attacks do not easily transfer between
CNN-based and CLIP-based methods. This is also confirmed by the different
distribution of the adversarial noise patterns in the frequency domain.
Overall, this analysis provides new insights into the properties of forensic
detectors that can help to develop more effective strategies.


http://arxiv.org/abs/2407.19216
EaTVul: ChatGPT-based Evasion Attack Against Software Vulnerability Detection. (99%)
Shigang Liu; Di Cao; Junae Kim; Tamas Abraham; Paul Montague; Seyit Camtepe; Jun Zhang; Yang Xiang
  Recently, deep learning has demonstrated promising results in enhancing the
accuracy of vulnerability detection and identifying vulnerabilities in
software. However, these techniques are still vulnerable to attacks.
Adversarial examples can exploit vulnerabilities within deep neural networks,
posing a significant threat to system security. This study showcases the
susceptibility of deep learning models to adversarial attacks, which can
achieve 100% attack success rate (refer to Table 5). The proposed method,
EaTVul, encompasses six stages: identification of important samples using
support vector machines, identification of important features using the
attention mechanism, generation of adversarial data based on these features
using ChatGPT, preparation of an adversarial attack pool, selection of seed
data using a fuzzy genetic algorithm, and the execution of an evasion attack.
Extensive experiments demonstrate the effectiveness of EaTVul, achieving an
attack success rate of more than 83% when the snippet size is greater than 2.
Furthermore, in most cases with a snippet size of 4, EaTVul achieves a 100%
attack success rate. The findings of this research emphasize the necessity of
robust defenses against adversarial attacks in software vulnerability
detection.


http://arxiv.org/abs/2407.19155
Debiased Graph Poisoning Attack via Contrastive Surrogate Objective. (93%)
Kanghoon Yoon; Yeonjun In; Namkyeong Lee; Kibum Kim; Chanyoung Park
  Graph neural networks (GNN) are vulnerable to adversarial attacks, which aim
to degrade the performance of GNNs through imperceptible changes on the graph.
However, we find that in fact the prevalent meta-gradient-based attacks, which
utilizes the gradient of the loss w.r.t the adjacency matrix, are biased
towards training nodes. That is, their meta-gradient is determined by a
training procedure of the surrogate model, which is solely trained on the
training nodes. This bias manifests as an uneven perturbation, connecting two
nodes when at least one of them is a labeled node, i.e., training node, while
it is unlikely to connect two unlabeled nodes. However, these biased attack
approaches are sub-optimal as they do not consider flipping edges between two
unlabeled nodes at all. This means that they miss the potential attacked edges
between unlabeled nodes that significantly alter the representation of a node.
In this paper, we investigate the meta-gradients to uncover the root cause of
the uneven perturbations of existing attacks. Based on our analysis, we propose
a Meta-gradient-based attack method using contrastive surrogate objective
(Metacon), which alleviates the bias in meta-gradient using a new surrogate
loss. We conduct extensive experiments to show that Metacon outperforms
existing meta gradient-based attack methods through benchmark datasets, while
showing that alleviating the bias towards training nodes is effective in
attacking the graph structure.


http://arxiv.org/abs/2407.18632
Robust VAEs via Generating Process of Noise Augmented Data. (87%)
Hiroo Irobe; Wataru Aoki; Kimihiro Yamazaki; Yuhui Zhang; Takumi Nakagawa; Hiroki Waida; Yuichiro Wada; Takafumi Kanamori
  Advancing defensive mechanisms against adversarial attacks in generative
models is a critical research topic in machine learning. Our study focuses on a
specific type of generative models - Variational Auto-Encoders (VAEs). Contrary
to common beliefs and existing literature which suggest that noise injection
towards training data can make models more robust, our preliminary experiments
revealed that naive usage of noise augmentation technique did not substantially
improve VAE robustness. In fact, it even degraded the quality of learned
representations, making VAEs more susceptible to adversarial perturbations.
This paper introduces a novel framework that enhances robustness by
regularizing the latent space divergence between original and noise-augmented
data. Through incorporating a paired probabilistic prior into the standard
variational lower bound, our method significantly boosts defense against
adversarial attacks. Our empirical evaluations demonstrate that this approach,
termed Robust Augmented Variational Auto-ENcoder (RAVEN), yields superior
performance in resisting adversarial inputs on widely-recognized benchmark
datasets.


http://arxiv.org/abs/2407.18658
Adversarial Robustification via Text-to-Image Diffusion Models. (64%)
Daewon Choi; Jongheon Jeong; Huiwon Jang; Jinwoo Shin
  Adversarial robustness has been conventionally believed as a challenging
property to encode for neural networks, requiring plenty of training data. In
the recent paradigm of adopting off-the-shelf models, however, access to their
training data is often infeasible or not practical, while most of such models
are not originally trained concerning adversarial robustness. In this paper, we
develop a scalable and model-agnostic solution to achieve adversarial
robustness without using any data. Our intuition is to view recent
text-to-image diffusion models as "adaptable" denoisers that can be optimized
to specify target tasks. Based on this, we propose: (a) to initiate a
denoise-and-classify pipeline that offers provable guarantees against
adversarial attacks, and (b) to leverage a few synthetic reference images
generated from the text-to-image model that enables novel adaptation schemes.
Our experiments show that our data-free scheme applied to the pre-trained CLIP
could improve the (provable) adversarial robustness of its diverse zero-shot
classification derivatives (while maintaining their accuracy), significantly
surpassing prior approaches that utilize the full training data. Not only for
CLIP, we also demonstrate that our framework is easily applicable for
robustifying other visual classifiers efficiently.


http://arxiv.org/abs/2407.19153
A Survey of Malware Detection Using Deep Learning. (5%)
Ahmed Bensaoud; Jugal Kalita; Mahmoud Bensaoud
  The problem of malicious software (malware) detection and classification is a
complex task, and there is no perfect approach. There is still a lot of work to
be done. Unlike most other research areas, standard benchmarks are difficult to
find for malware detection. This paper aims to investigate recent advances in
malware detection on MacOS, Windows, iOS, Android, and Linux using deep
learning (DL) by investigating DL in text and image classification, the use of
pre-trained and multi-task learning models for malware detection approaches to
obtain high accuracy and which the best approach if we have a standard
benchmark dataset. We discuss the issues and the challenges in malware
detection using DL classifiers by reviewing the effectiveness of these DL
classifiers and their inability to explain their decisions and actions to DL
developers presenting the need to use Explainable Machine Learning (XAI) or
Interpretable Machine Learning (IML) programs. Additionally, we discuss the
impact of adversarial attacks on deep learning models, negatively affecting
their generalization capabilities and resulting in poor performance on unseen
data. We believe there is a need to train and test the effectiveness and
efficiency of the current state-of-the-art deep learning models on different
malware datasets. We examine eight popular DL approaches on various datasets.
This survey will help researchers develop a general understanding of malware
recognition using deep learning.


http://arxiv.org/abs/2407.18564
Unveiling Privacy Vulnerabilities: Investigating the Role of Structure in Graph Data. (1%)
Hanyang Yuan; Jiarong Xu; Cong Wang; Ziqi Yang; Chunping Wang; Keting Yin; Yang Yang
  The public sharing of user information opens the door for adversaries to
infer private data, leading to privacy breaches and facilitating malicious
activities. While numerous studies have concentrated on privacy leakage via
public user attributes, the threats associated with the exposure of user
relationships, particularly through network structure, are often neglected.
This study aims to fill this critical gap by advancing the understanding and
protection against privacy risks emanating from network structure, moving
beyond direct connections with neighbors to include the broader implications of
indirect network structural patterns. To achieve this, we first investigate the
problem of Graph Privacy Leakage via Structure (GPS), and introduce a novel
measure, the Generalized Homophily Ratio, to quantify the various mechanisms
contributing to privacy breach risks in GPS. Based on this insight, we develop
a novel graph private attribute inference attack, which acts as a pivotal tool
for evaluating the potential for privacy leakage through network structures
under worst-case scenarios. To protect users' private data from such
vulnerabilities, we propose a graph data publishing method incorporating a
learnable graph sampling technique, effectively transforming the original graph
into a privacy-preserving version. Extensive experiments demonstrate that our
attack model poses a significant threat to user privacy, and our graph data
publishing method successfully achieves the optimal privacy-utility trade-off
compared to baselines.


http://arxiv.org/abs/2407.19079
UniForensics: Face Forgery Detection via General Facial Representation. (1%)
Ziyuan Fang; Hanqing Zhao; Tianyi Wei; Wenbo Zhou; Ming Wan; Zhanyi Wang; Weiming Zhang; Nenghai Yu
  Previous deepfake detection methods mostly depend on low-level textural
features vulnerable to perturbations and fall short of detecting unseen forgery
methods. In contrast, high-level semantic features are less susceptible to
perturbations and not limited to forgery-specific artifacts, thus having
stronger generalization. Motivated by this, we propose a detection method that
utilizes high-level semantic features of faces to identify inconsistencies in
temporal domain. We introduce UniForensics, a novel deepfake detection
framework that leverages a transformer-based video classification network,
initialized with a meta-functional face encoder for enriched facial
representation. In this way, we can take advantage of both the powerful
spatio-temporal model and the high-level semantic information of faces.
Furthermore, to leverage easily accessible real face data and guide the model
in focusing on spatio-temporal features, we design a Dynamic Video
Self-Blending (DVSB) method to efficiently generate training samples with
diverse spatio-temporal forgery traces using real facial videos. Based on this,
we advance our framework with a two-stage training approach: The first stage
employs a novel self-supervised contrastive learning, where we encourage the
network to focus on forgery traces by impelling videos generated by the same
forgery process to have similar representations. On the basis of the
representation learned in the first stage, the second stage involves
fine-tuning on face forgery detection dataset to build a deepfake detector.
Extensive experiments validates that UniForensics outperforms existing face
forgery methods in generalization ability and robustness. In particular, our
method achieves 95.3\% and 77.2\% cross dataset AUC on the challenging
Celeb-DFv2 and DFDC respectively.


http://arxiv.org/abs/2407.18251
Sparse vs Contiguous Adversarial Pixel Perturbations in Multimodal Models: An Empirical Analysis. (99%)
Cristian-Alexandru Botocan; Raphael Meier; Ljiljana Dolamic
  Assessing the robustness of multimodal models against adversarial examples is
an important aspect for the safety of its users. We craft L0-norm perturbation
attacks on the preprocessed input images. We launch them in a black-box setup
against four multimodal models and two unimodal DNNs, considering both targeted
and untargeted misclassification. Our attacks target less than 0.04% of
perturbed image area and integrate different spatial positioning of perturbed
pixels: sparse positioning and pixels arranged in different contiguous shapes
(row, column, diagonal, and patch). To the best of our knowledge, we are the
first to assess the robustness of three state-of-the-art multimodal models
(ALIGN, AltCLIP, GroupViT) against different sparse and contiguous pixel
distribution perturbations. The obtained results indicate that unimodal DNNs
are more robust than multimodal models. Furthermore, models using CNN-based
Image Encoder are more vulnerable than models with ViT - for untargeted
attacks, we obtain a 99% success rate by perturbing less than 0.02% of the
image area.


http://arxiv.org/abs/2407.18213
Scaling Trends in Language Model Robustness. (98%)
Nikolaus Howe; Ian McKenzie; Oskar Hollinsworth; Michał Zajac; Tom Tseng; Aaron Tucker; Pierre-Luc Bacon; Adam Gleave
  Language models exhibit scaling laws, whereby increasing model and dataset
size predictably decrease negative log likelihood, unlocking a dazzling array
of capabilities. At the same time, even the most capable systems are currently
vulnerable to adversarial inputs such as jailbreaks and prompt injections,
despite concerted efforts to make them robust. As compute becomes more
accessible to both attackers and defenders, which side will benefit more from
scale? We attempt to answer this question with a detailed study of robustness
on language models spanning three orders of magnitude in parameter count. From
the defender's perspective, we find that in the absence of other interventions,
increasing model size alone does not consistently improve robustness. In
adversarial training, we find that larger models are more sample-efficient and
less compute-efficient than smaller models, and often better generalize their
defense to new threat models. From the attacker's perspective, we find that
increasing attack compute smoothly and reliably increases attack success rate
against both finetuned and adversarially trained models. Finally, we show that
across model sizes studied, doubling compute on adversarial training only
forces an attacker to less than double attack compute to maintain the same
attack success rate. However, adversarial training becomes more and more
effective on larger models, suggesting that defenders could eventually have the
advantage with increasing model size. These results underscore the value of
adopting a scaling lens when discussing robustness of frontier models.


http://arxiv.org/abs/2407.17797
A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models. (95%)
Haonan Zheng; Xinyang Deng; Wen Jiang; Wenrui Li
  With Vision-Language Pre-training (VLP) models demonstrating powerful
multimodal interaction capabilities, the application scenarios of neural
networks are no longer confined to unimodal domains but have expanded to more
complex multimodal V+L downstream tasks. The security vulnerabilities of
unimodal models have been extensively examined, whereas those of VLP models
remain challenging. We note that in CV models, the understanding of images
comes from annotated information, while VLP models are designed to learn image
representations directly from raw text. Motivated by this discrepancy, we
developed the Feature Guidance Attack (FGA), a novel method that uses text
representations to direct the perturbation of clean images, resulting in the
generation of adversarial images. FGA is orthogonal to many advanced attack
strategies in the unimodal domain, facilitating the direct application of rich
research findings from the unimodal to the multimodal scenario. By
appropriately introducing text attack into FGA, we construct Feature Guidance
with Text Attack (FGA-T). Through the interaction of attacking two modalities,
FGA-T achieves superior attack effects against VLP models. Moreover,
incorporating data augmentation and momentum mechanisms significantly improves
the black-box transferability of FGA-T. Our method demonstrates stable and
effective attack capabilities across various datasets, downstream tasks, and
both black-box and white-box settings, offering a unified baseline for
exploring the robustness of VLP models.


http://arxiv.org/abs/2407.18170
RIDA: A Robust Attack Framework on Incomplete Graphs. (33%)
Jianke Yu; Hanchen Wang; Chen Chen; Xiaoyang Wang; Lu Qin; Wenjie Zhang; Ying Zhang; Xijuan Liu
  Graph Neural Networks (GNNs) are vital in data science but are increasingly
susceptible to adversarial attacks. To help researchers develop more robust GNN
models, it's essential to focus on designing strong attack models as
foundational benchmarks and guiding references. Among adversarial attacks,
gray-box poisoning attacks are noteworthy due to their effectiveness and fewer
constraints. These attacks exploit GNNs' need for retraining on updated data,
thereby impacting their performance by perturbing these datasets. However,
current research overlooks the real-world scenario of incomplete graphs. To
address this gap, we introduce the Robust Incomplete Deep Attack Framework
(RIDA). It is the first algorithm for robust gray-box poisoning attacks on
incomplete graphs. The approach innovatively aggregates distant vertex
information and ensures powerful data utilization. Extensive tests against 9
SOTA baselines on 3 real-world datasets demonstrate that RIDA's superiority in
handling incompleteness and high attack performance on the incomplete graph.


http://arxiv.org/abs/2407.18414
Adversarially Robust Decision Transformer. (22%)
Xiaohang Tang; Afonso Marques; Parameswaran Kamalaruban; Ilija Bogunovic
  Decision Transformer (DT), as one of the representative Reinforcement
Learning via Supervised Learning (RvS) methods, has achieved strong performance
in offline learning tasks by leveraging the powerful Transformer architecture
for sequential decision-making. However, in adversarial environments, these
methods can be non-robust, since the return is dependent on the strategies of
both the decision-maker and adversary. Training a probabilistic model
conditioned on observed return to predict action can fail to generalize, as the
trajectories that achieve a return in the dataset might have done so due to a
suboptimal behavior adversary. To address this, we propose a worst-case-aware
RvS algorithm, the Adversarially Robust Decision Transformer (ARDT), which
learns and conditions the policy on in-sample minimax returns-to-go. ARDT
aligns the target return with the worst-case return learned through minimax
expectile regression, thereby enhancing robustness against powerful test-time
adversaries. In experiments conducted on sequential games with full data
coverage, ARDT can generate a maximin (Nash Equilibrium) strategy, the solution
with the largest adversarial robustness. In large-scale sequential games and
continuous adversarial RL environments with partial data coverage, ARDT
demonstrates significantly superior robustness to powerful test-time
adversaries and attains higher worst-case returns compared to contemporary DT
methods.


http://arxiv.org/abs/2407.18039
Peak-Controlled Logits Poisoning Attack in Federated Distillation. (4%)
Yuhan Tang; Aoxu Zhang; Zhiyuan Wu; Bo Gao; Tian Wen; Yuwei Wang; Sheng Sun
  Federated Distillation (FD) offers an innovative approach to distributed
machine learning, leveraging knowledge distillation for efficient and flexible
cross-device knowledge transfer without necessitating the upload of extensive
model parameters to a central server. While FD has gained popularity, its
vulnerability to poisoning attacks remains underexplored. To address this gap,
we previously introduced FDLA (Federated Distillation Logits Attack), a method
that manipulates logits communication to mislead and degrade the performance of
client models. However, the impact of FDLA on participants with different
identities and the effects of malicious modifications at various stages of
knowledge transfer remain unexplored. To this end, we present PCFDLA
(Peak-Controlled Federated Distillation Logits Attack), an advanced and more
stealthy logits poisoning attack method for FD. PCFDLA enhances the
effectiveness of FDLA by carefully controlling the peak values of logits to
create highly misleading yet inconspicuous modifications. Furthermore, we
introduce a novel metric for better evaluating attack efficacy, demonstrating
that PCFDLA maintains stealth while being significantly more disruptive to
victim models compared to its predecessors. Experimental results across various
datasets confirm the superior impact of PCFDLA on model accuracy, solidifying
its potential threat in federated distillation systems.


http://arxiv.org/abs/2407.18002
Network Inversion of Convolutional Neural Nets. (3%)
Pirzada Suhail; Amit Sethi
  Neural networks have emerged as powerful tools across various applications,
yet their decision-making process often remains opaque, leading to them being
perceived as "black boxes." This opacity raises concerns about their
interpretability and reliability, especially in safety-critical scenarios.
Network inversion techniques offer a solution by allowing us to peek inside
these black boxes, revealing the features and patterns learned by the networks
behind their decision-making processes and thereby provide valuable insights
into how neural networks arrive at their conclusions, making them more
interpretable and trustworthy. This paper presents a simple yet effective
approach to network inversion using a carefully conditioned generator that
learns the data distribution in the input space of the trained neural network,
enabling the reconstruction of inputs that would most likely lead to the
desired outputs. To capture the diversity in the input space for a given
output, instead of simply revealing the conditioning labels to the generator,
we hideously encode the conditioning label information into vectors, further
exemplified by heavy dropout in the generation process and minimisation of
cosine similarity between the features corresponding to the generated images.
The paper concludes with immediate applications of Network Inversion including
in interpretability, explainability and generation of adversarial samples.


http://arxiv.org/abs/2407.18448
Regret-Optimal Defense Against Stealthy Adversaries: A System Level Approach. (1%)
Hiroyasu Tsukamoto; Joudi Hajar; Soon-Jo Chung; Fred Y. Hadaegh
  Modern control designs in robotics, aerospace, and cyber-physical systems
heavily depend on real-world data obtained through the system outputs. In the
face of system faults and malicious attacks, however, these outputs can be
compromised to misrepresent some portion of the system information that
critically affects their secure and trustworthy operation. In this paper, we
introduce a novel regret-optimal control framework for designing controllers
that render a linear system robust against stealthy attacks, including sensor
and actuator attacks, as well as external disturbances. In particular, we
establish (a) a convex optimization-based system metric to quantify the regret
with the worst-case stealthy attack (the true performance minus the optimal
performance in hindsight with the knowledge of the stealthy attack), which
improves and adaptively interpolates $\mathcal{H}_2$ and $\mathcal{H}_{\infty}$
norms in the presence of stealthy adversaries, (b) an optimization problem for
minimizing the regret of 1 expressed in the system level parameterization,
which is useful for its localized and distributed implementation in large-scale
systems, and (c) a rank-constrained optimization problem (i.e., optimization
with a convex objective subject to convex constraints and rank constraints)
equivalent to the optimization problem of (b). Finally, we conduct a numerical
simulation which showcases the effectiveness of our approach.


http://arxiv.org/abs/2407.17312
Physical Adversarial Attack on Monocular Depth Estimation via Shape-Varying Patches. (92%)
Chenxing Zhao; Yang Li; Shihao Wu; Wenyi Tan; Shuangju Zhou; Quan Pan
  Adversarial attacks against monocular depth estimation (MDE) systems pose
significant challenges, particularly in safety-critical applications such as
autonomous driving. Existing patch-based adversarial attacks for MDE are
confined to the vicinity of the patch, making it difficult to affect the entire
target. To address this limitation, we propose a physics-based adversarial
attack on monocular depth estimation, employing a framework called Attack with
Shape-Varying Patches (ASP), aiming to optimize patch content, shape, and
position to maximize effectiveness. We introduce various mask shapes, including
quadrilateral, rectangular, and circular masks, to enhance the flexibility and
efficiency of the attack. Furthermore, we propose a new loss function to extend
the influence of the patch beyond the overlapping regions. Experimental results
demonstrate that our attack method generates an average depth error of 18
meters on the target car with a patch area of 1/9, affecting over 98\% of the
target area.


http://arxiv.org/abs/2407.17447
FLRT: Fluent Student-Teacher Redteaming. (13%)
T. Ben Confirm Labs Thompson; Michael Confirm Labs Sklar
  Many publicly available language models have been safety tuned to reduce the
likelihood of toxic or liability-inducing text. To redteam or jailbreak these
models for compliance with toxic requests, users and security analysts have
developed adversarial prompting techniques. One attack method is to apply
discrete optimization techniques to the prompt. However, the resulting attack
strings are often gibberish text, easily filtered by defenders due to high
measured perplexity, and may fail for unseen tasks and/or well-tuned models. In
this work, we improve existing algorithms (primarily GCG and BEAST) to develop
powerful and fluent attacks on safety-tuned models like Llama-2 and Phi-3. Our
technique centers around a new distillation-based approach that encourages the
victim model to emulate a toxified finetune, either in terms of output
probabilities or internal activations. To encourage human-fluent attacks, we
add a multi-model perplexity penalty and a repetition penalty to the objective.
We also enhance optimizer strength by allowing token insertions, token swaps,
and token deletions and by using longer attack sequences. The resulting process
is able to reliably jailbreak the most difficult target models with prompts
that appear similar to human-written prompts. On Advbench we achieve attack
success rates $>93$% for Llama-2-7B, Llama-3-8B, and Vicuna-7B, while
maintaining model-measured perplexity $<33$; we achieve $95$% attack success
for Phi-3, though with higher perplexity. We also find a universally-optimized
single fluent prompt that induces $>88$% compliance on previously unseen tasks
across Llama-2-7B, Phi-3-mini and Vicuna-7B and transfers to other black-box
models.


http://arxiv.org/abs/2407.17587
S-E Pipeline: A Vision Transformer (ViT) based Resilient Classification Pipeline for Medical Imaging Against Adversarial Attacks. (87%)
Neha A S; Vivek Chaturvedi; Muhammad Shafique
  Vision Transformer (ViT) is becoming widely popular in automating accurate
disease diagnosis in medical imaging owing to its robust self-attention
mechanism. However, ViTs remain vulnerable to adversarial attacks that may
thwart the diagnosis process by leading it to intentional misclassification of
critical disease. In this paper, we propose a novel image classification
pipeline, namely, S-E Pipeline, that performs multiple pre-processing steps
that allow ViT to be trained on critical features so as to reduce the impact of
input perturbations by adversaries. Our method uses a combination of
segmentation and image enhancement techniques such as Contrast Limited Adaptive
Histogram Equalization (CLAHE), Unsharp Masking (UM), and High-Frequency
Emphasis filtering (HFE) as preprocessing steps to identify critical features
that remain intact even after adversarial perturbations. The experimental study
demonstrates that our novel pipeline helps in reducing the effect of
adversarial attacks by 72.22% for the ViT-b32 model and 86.58% for the ViT-l32
model. Furthermore, we have shown an end-to-end deployment of our proposed
method on the NVIDIA Jetson Orin Nano board to demonstrate its practical use
case in modern hand-held devices that are usually resource-constrained.


http://arxiv.org/abs/2407.16233
Algebraic Adversarial Attacks on Integrated Gradients. (86%)
Lachlan Simpson; Federico Costanza; Kyle Millar; Adriel Cheng; Cheng-Chew Lim; Hong Gunn Chew
  Adversarial attacks on explainability models have drastic consequences when
explanations are used to understand the reasoning of neural networks in safety
critical systems. Path methods are one such class of attribution methods
susceptible to adversarial attacks. Adversarial learning is typically phrased
as a constrained optimisation problem. In this work, we propose algebraic
adversarial examples and study the conditions under which one can generate
adversarial examples for integrated gradients. Algebraic adversarial examples
provide a mathematically tractable approach to adversarial examples.


http://arxiv.org/abs/2407.16307
Multimodal Unlearnable Examples: Protecting Data against Multimodal Contrastive Learning. (41%)
Xinwei Liu; Xiaojun Jia; Yuan Xun; Siyuan Liang; Xiaochun Cao
  Multimodal contrastive learning (MCL) has shown remarkable advances in
zero-shot classification by learning from millions of image-caption pairs
crawled from the Internet. However, this reliance poses privacy risks, as
hackers may unauthorizedly exploit image-text data for model training,
potentially including personal and privacy-sensitive information. Recent works
propose generating unlearnable examples by adding imperceptible perturbations
to training images to build shortcuts for protection. However, they are
designed for unimodal classification, which remains largely unexplored in MCL.
We first explore this context by evaluating the performance of existing methods
on image-caption pairs, and they do not generalize effectively to multimodal
data and exhibit limited impact to build shortcuts due to the lack of labels
and the dispersion of pairs in MCL. In this paper, we propose Multi-step Error
Minimization (MEM), a novel optimization process for generating multimodal
unlearnable examples. It extends the Error-Minimization (EM) framework to
optimize both image noise and an additional text trigger, thereby enlarging the
optimized space and effectively misleading the model to learn the shortcut
between the noise features and the text trigger. Specifically, we adopt
projected gradient descent to solve the noise minimization problem and use
HotFlip to approximate the gradient and replace words to find the optimal text
trigger. Extensive experiments demonstrate the effectiveness of MEM, with
post-protection retrieval results nearly half of random guessing, and its high
transferability across different models. Our code is available on the
https://github.com/thinwayliu/Multimodal-Unlearnable-Examples


http://arxiv.org/abs/2407.16964
When AI Defeats Password Deception! A Deep Learning Framework to Distinguish Passwords and Honeywords. (13%)
Jimmy Dani; Brandon McCulloh; Nitesh Saxena
  "Honeywords" have emerged as a promising defense mechanism for detecting data
breaches and foiling offline dictionary attacks (ODA) by deceiving attackers
with false passwords. In this paper, we propose PassFilter, a novel deep
learning (DL) based attack framework, fundamental in its ability to identify
passwords from a set of sweetwords associated with a user account, effectively
challenging a variety of honeywords generation techniques (HGTs). The DL model
in PassFilter is trained with a set of previously collected or adversarially
generated passwords and honeywords, and carefully orchestrated to predict
whether a sweetword is the password or a honeyword. Our model can compromise
the security of state-of-the-art, heuristics-based, and representation
learning-based HGTs proposed by Dionysiou et al. Specifically, our analysis
with nine publicly available password datasets shows that PassFilter
significantly outperforms the baseline random guessing success rate of 5%,
achieving 6.10% to 52.78% on the 1st guessing attempt, considering 20
sweetwords per account. This success rate rapidly increases with additional
login attempts before account lock-outs, often allowed on many real-world
online services to maintain reasonable usability. For example, it ranges from
41.78% to 96.80% for five attempts, and from 72.87% to 99.00% for ten attempts,
compared to 25% and 50% random guessing, respectively. We also examined
PassFilter against general-purpose language models used for honeyword
generation, like those proposed by Yu et al. These honeywords also proved
vulnerable to our attack, with success rates of 14.19% for 1st guessing
attempt, increasing to 30.23%, 41.70%, and 63.10% after 3rd, 5th, and 10th
guessing attempts, respectively. Our findings demonstrate the effectiveness of
DL model deployed in PassFilter in breaching state-of-the-art HGTs and
compromising password security based on ODA.


http://arxiv.org/abs/2407.16205
Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models. (13%)
Shi Lin; Rongchang Li; Xun Wang; Changting Lin; Wenpeng Xing; Meng Han
  The rapid development of Large Language Models (LLMs) has brought remarkable
generative capabilities across diverse tasks. However, despite the impressive
achievements, these models still have numerous security vulnerabilities,
particularly when faced with jailbreak attacks. Therefore, by investigating
jailbreak attacks, we can uncover hidden weaknesses in LLMs and guide us in
developing more robust defense mechanisms to fortify their security. In this
paper, we further explore the boundary of jailbreak attacks on LLMs and propose
Analyzing-based Jailbreak (ABJ). This effective jailbreak attack method takes
advantage of LLMs' growing analyzing and reasoning capability and reveals their
underlying vulnerabilities when facing analysis-based tasks. We conduct a
detailed evaluation of ABJ across various open-source and closed-source LLMs,
which achieves 94.8% Attack Success Rate (ASR) and 1.06 Attack Efficiency (AE)
on GPT-4-turbo-0409, demonstrating state-of-the-art attack effectiveness and
efficiency. Our research highlights the importance of prioritizing and
enhancing the safety of LLMs to mitigate the risks of misuse.The code is
publicly available at https://github.com/theshi-1128/ABJ-Attack.


http://arxiv.org/abs/2407.16667
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent. (5%)
Huiyu Xu; Wenhui Zhang; Zhibo Wang; Feng Xiao; Rui Zheng; Yunhe Feng; Zhongjie Ba; Kui Ren
  Recently, advanced Large Language Models (LLMs) such as GPT-4 have been
integrated into many real-world applications like Code Copilot. These
applications have significantly expanded the attack surface of LLMs, exposing
them to a variety of threats. Among them, jailbreak attacks that induce toxic
responses through jailbreak prompts have raised critical safety concerns. To
identify these threats, a growing number of red teaming approaches simulate
potential adversarial scenarios by crafting jailbreak prompts to test the
target LLM. However, existing red teaming methods do not consider the unique
vulnerabilities of LLM in different scenarios, making it difficult to adjust
the jailbreak prompts to find context-specific vulnerabilities. Meanwhile,
these methods are limited to refining jailbreak templates using a few mutation
operations, lacking the automation and scalability to adapt to different
scenarios. To enable context-aware and efficient red teaming, we abstract and
model existing attacks into a coherent concept called "jailbreak strategy" and
propose a multi-agent LLM system named RedAgent that leverages these strategies
to generate context-aware jailbreak prompts. By self-reflecting on contextual
feedback in an additional memory buffer, RedAgent continuously learns how to
leverage these strategies to achieve effective jailbreaks in specific contexts.
Extensive experiments demonstrate that our system can jailbreak most black-box
LLMs in just five queries, improving the efficiency of existing red teaming
methods by two times. Additionally, RedAgent can jailbreak customized LLM
applications more efficiently. By generating context-aware jailbreak prompts
towards applications on GPTs, we discover 60 severe vulnerabilities of these
real-world applications with only two queries per vulnerability. We have
reported all found issues and communicated with OpenAI and Meta for bug fixes.


http://arxiv.org/abs/2407.16928
From Sands to Mansions: Towards Automated Cyberattack Emulation with Classical Planning and Large Language Models. (1%)
Lingzhi Wang; Zhenyuan Li; Yi Jiang; Zhengkai Wang; Zonghan Guo; Jiahui Wang; Yangyang Wei; Xiangmin Shen; Wei Ruan; Yan Chen
  As attackers continually advance their tools, skills, and techniques during
cyberattacks - particularly in modern Advanced Persistence Threats (APT)
campaigns - there is a pressing need for a comprehensive and up-to-date
cyberattack dataset to support threat-informed defense and enable benchmarking
of defense systems in both academia and commercial solutions. However, there is
a noticeable scarcity of cyberattack datasets: recent academic studies continue
to rely on outdated benchmarks, while cyberattack emulation in industry remains
limited due to the significant human effort and expertise required. Creating
datasets by emulating advanced cyberattacks presents several challenges, such
as limited coverage of attack techniques, the complexity of chaining multiple
attack steps, and the difficulty of realistically mimicking actual threat
groups. In this paper, we introduce modularized Attack Action and Attack Action
Linking Model as a structured way to organizing and chaining individual attack
steps into multi-step cyberattacks. Building on this, we propose Aurora, a
system that autonomously emulates cyberattacks using third-party attack tools
and threat intelligence reports with the help of classical planning and large
language models. Aurora can automatically generate detailed attack plans, set
up emulation environments, and semi-automatically execute the attacks. We
utilize Aurora to create a dataset containing over 1,000 attack chains. To our
best knowledge, Aurora is the only system capable of automatically constructing
such a large-scale cyberattack dataset with corresponding attack execution
scripts and environments. Our evaluation further demonstrates that Aurora
outperforms the previous similar work and even the most advanced generative AI
models in cyberattack emulation. To support further research, we published the
cyberattack dataset and will publish the source code of Aurora.


http://arxiv.org/abs/2407.15683
Enhancing Transferability of Targeted Adversarial Examples: A Self-Universal Perspective. (99%)
Bowen Peng; Li Liu; Tianpeng Liu; Zhen Liu; Yongxiang Liu
  Transfer-based targeted adversarial attacks against black-box deep neural
networks (DNNs) have been proven to be significantly more challenging than
untargeted ones. The impressive transferability of current SOTA, the generative
methods, comes at the cost of requiring massive amounts of additional data and
time-consuming training for each targeted label. This results in limited
efficiency and flexibility, significantly hindering their deployment in
practical applications. In this paper, we offer a self-universal perspective
that unveils the great yet underexplored potential of input transformations in
pursuing this goal. Specifically, transformations universalize gradient-based
attacks with intrinsic but overlooked semantics inherent within individual
images, exhibiting similar scalability and comparable results to time-consuming
learning over massive additional data from diverse classes. We also contribute
a surprising empirical insight that one of the most fundamental
transformations, simple image scaling, is highly effective, scalable,
sufficient, and necessary in enhancing targeted transferability. We further
augment simple scaling with orthogonal transformations and block-wise
applicability, resulting in the Simple, faSt, Self-universal yet Strong Scale
Transformation (S$^4$ST) for self-universal TTA. On the ImageNet-Compatible
benchmark dataset, our method achieves a 19.8% improvement in the average
targeted transfer success rate against various challenging victim models over
existing SOTA transformation methods while only consuming 36% time for
attacking. It also outperforms resource-intensive attacks by a large margin in
various challenging settings.


http://arxiv.org/abs/2407.15385
Towards Robust Vision Transformer via Masked Adaptive Ensemble. (99%)
Fudong Lin; Jiadong Lou; Xu Yuan; Nian-Feng Tzeng
  Adversarial training (AT) can help improve the robustness of Vision
Transformers (ViT) against adversarial attacks by intentionally injecting
adversarial examples into the training data. However, this way of adversarial
injection inevitably incurs standard accuracy degradation to some extent,
thereby calling for a trade-off between standard accuracy and robustness.
Besides, the prominent AT solutions are still vulnerable to adaptive attacks.
To tackle such shortcomings, this paper proposes a novel ViT architecture,
including a detector and a classifier bridged by our newly developed adaptive
ensemble. Specifically, we empirically discover that detecting adversarial
examples can benefit from the Guided Backpropagation technique. Driven by this
discovery, a novel Multi-head Self-Attention (MSA) mechanism is introduced to
enhance our detector to sniff adversarial examples. Then, a classifier with two
encoders is employed for extracting visual representations respectively from
clean images and adversarial examples, with our adaptive ensemble to adaptively
adjust the proportion of visual representations from the two encoders for
accurate classification. This design enables our ViT architecture to achieve a
better trade-off between standard accuracy and robustness. Besides, our
adaptive ensemble technique allows us to mask off a random subset of image
patches within input data, boosting our ViT's robustness against adaptive
attacks, while maintaining high standard accuracy. Experimental results exhibit
that our ViT architecture, on CIFAR-10, achieves the best standard accuracy and
adversarial robustness of 90.3% and 49.8%, respectively.


http://arxiv.org/abs/2407.15524
Towards Efficient Transferable Preemptive Adversarial Defense. (99%)
Hanrui Wang; Ching-Chun Chang; Chun-Shien Lu; Isao Echizen
  Deep learning technology has brought convenience and advanced developments
but has become untrustworthy because of its sensitivity to inconspicuous
perturbations (i.e., adversarial attacks). Attackers may utilize this
sensitivity to manipulate predictions. To defend against such attacks, we have
devised a proactive strategy for "attacking" the medias before it is attacked
by the third party, so that when the protected medias are further attacked, the
adversarial perturbations are automatically neutralized. This strategy, dubbed
Fast Preemption, provides an efficient transferable preemptive defense by using
different models for labeling inputs and learning crucial features. A
forward-backward cascade learning algorithm is used to compute protective
perturbations, starting with forward propagation optimization to achieve rapid
convergence, followed by iterative backward propagation learning to alleviate
overfitting. This strategy offers state-of-the-art transferability and
protection across various systems. With the running of only three steps, our
Fast Preemption framework outperforms benchmark training-time, test-time, and
preemptive adversarial defenses. We have also devised the first to our
knowledge effective white-box adaptive reversion attack and demonstrate that
the protection added by our defense strategy is irreversible unless the
backbone model, algorithm, and settings are fully compromised. This work
provides a new direction to developing proactive defenses against adversarial
attacks. The proposed methodology will be made available on GitHub.


http://arxiv.org/abs/2408.02674
On Feasibility of Intent Obfuscating Attacks. (98%)
Zhaobin Li; Patrick Shafto
  Intent obfuscation is a common tactic in adversarial situations, enabling the
attacker to both manipulate the target system and avoid culpability.
Surprisingly, it has rarely been implemented in adversarial attacks on machine
learning systems. We are the first to propose using intent obfuscation to
generate adversarial examples for object detectors: by perturbing another
non-overlapping object to disrupt the target object, the attacker hides their
intended target. We conduct a randomized experiment on 5 prominent detectors --
YOLOv3, SSD, RetinaNet, Faster R-CNN, and Cascade R-CNN -- using both targeted
and untargeted attacks and achieve success on all models and attacks. We
analyze the success factors characterizing intent obfuscating attacks,
including target object confidence and perturb object sizes. We then
demonstrate that the attacker can exploit these success factors to increase
success rates for all models and attacks. Finally, we discuss main takeaways
and legal repercussions.


http://arxiv.org/abs/2407.15389
Poisoning with A Pill: Circumventing Detection in Federated Learning. (92%)
Hanxi Guo; Hao Wang; Tao Song; Tianhang Zheng; Yang Hua; Haibing Guan; Xiangyu Zhang
  Without direct access to the client's data, federated learning (FL) is
well-known for its unique strength in data privacy protection among existing
distributed machine learning techniques. However, its distributive and
iterative nature makes FL inherently vulnerable to various poisoning attacks.
To counteract these threats, extensive defenses have been proposed to filter
out malicious clients, using various detection metrics. Based on our analysis
of existing attacks and defenses, we find that there is a lack of attention to
model redundancy. In neural networks, various model parameters contribute
differently to the model's performance. However, existing attacks in FL
manipulate all the model update parameters with the same strategy, making them
easily detectable by common defenses. Meanwhile, the defenses also tend to
analyze the overall statistical features of the entire model updates, leaving
room for sophisticated attacks. Based on these observations, this paper
proposes a generic and attack-agnostic augmentation approach designed to
enhance the effectiveness and stealthiness of existing FL poisoning attacks
against detection in FL, pointing out the inherent flaws of existing defenses
and exposing the necessity of fine-grained FL security. Specifically, we employ
a three-stage methodology that strategically constructs, generates, and injects
poison (generated by existing attacks) into a pill (a tiny subnet with a novel
structure) during the FL training, named as pill construction, pill poisoning,
and pill injection accordingly. Extensive experimental results show that FL
poisoning attacks enhanced by our method can bypass all the popular defenses,
and can gain an up to 7x error rate increase, as well as on average a more than
2x error rate increase on both IID and non-IID data, in both cross-silo and
cross-device FL systems.


http://arxiv.org/abs/2407.15902
Revisiting the Robust Alignment of Circuit Breakers. (70%)
Leo Schwinn; Simon Geisler
  Over the past decade, adversarial training has emerged as one of the few
reliable methods for enhancing model robustness against adversarial attacks
[Szegedy et al., 2014, Madry et al., 2018, Xhonneux et al., 2024], while many
alternative approaches have failed to withstand rigorous subsequent
evaluations. Recently, an alternative defense mechanism, namely "circuit
breakers" [Zou et al., 2024], has shown promising results for aligning LLMs. In
this report, we show that the robustness claims of "Improving Alignment and
Robustness with Circuit Breakers" against unconstraint continuous attacks in
the embedding space of the input tokens may be overestimated [Zou et al.,
2024]. Specifically, we demonstrate that by implementing a few simple changes
to embedding space attacks [Schwinn et al., 2024a,b], we achieve 100% attack
success rate (ASR) against circuit breaker models. Without conducting any
further hyperparameter tuning, these adjustments increase the ASR by more than
80% compared to the original evaluation. Code is accessible at:
https://github.com/SchwinnL/circuit-breakers-eval


http://arxiv.org/abs/2407.15399
Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models. (11%)
Xiao Liu; Liangzhi Li; Tong Xiang; Fuying Ye; Lu Wei; Wangyue Li; Noa Garcia
  With the development of large language models (LLMs) like ChatGPT, both their
vast applications and potential vulnerabilities have come to the forefront.
While developers have integrated multiple safety mechanisms to mitigate their
misuse, a risk remains, particularly when models encounter adversarial inputs.
This study unveils an attack mechanism that capitalizes on human conversation
strategies to extract harmful information from LLMs. We delineate three pivotal
strategies: (i) decomposing malicious questions into seemingly innocent
sub-questions; (ii) rewriting overtly malicious questions into more covert,
benign-sounding ones; (iii) enhancing the harmfulness of responses by prompting
models for illustrative examples. Unlike conventional methods that target
explicit malicious responses, our approach delves deeper into the nature of the
information provided in responses. Through our experiments conducted on
GPT-3.5-turbo, GPT-4, and Llama2, our method has demonstrated a marked efficacy
compared to conventional attack methods. In summary, this work introduces a
novel attack method that outperforms previous approaches, raising an important
question: How to discern whether the ultimate intent in a dialogue is
malicious?


http://arxiv.org/abs/2407.15984
Virtual Reality and Augmented Reality Security: A Reconnaissance and Vulnerability Assessment Approach. (1%)
Sarina Dastgerdy
  Various industries have widely adopted Virtual Reality (VR) and Augmented
Reality (AR) technologies to enhance productivity and user experiences.
However, their integration introduces significant security challenges. This
systematic literature review focuses on identifying devices used in AR and VR
technologies and specifies the associated vulnerabilities, particularly during
the reconnaissance phase and vulnerability assessment, which are critical steps
in penetration testing. Following Kitchenham and Charters' guidelines, we
systematically selected and analyzed primary studies. The reconnaissance phase
involves gathering detailed information about AR and VR systems to identify
potential attack vectors. In the vulnerability assessment phase, these vectors
are analyzed to pinpoint weaknesses that malicious actors could exploit. Our
findings reveal that AR and VR devices, such as headsets (e.g., HTC Vive,
Oculus Quest), development platforms (e.g., Unity Framework, Google Cardboard
SDK), and applications (e.g., Bigscreen VR, VRChat), are susceptible to various
attacks, including remote code execution, cross-site scripting (XSS),
eavesdropping, and man-in-the-room attacks. Specifically, the Bigscreen VR
application exhibited severe vulnerabilities like remote code execution (RCE)
via the 'Application.OpenURL' API, XSS in user inputs, and botnet propagation.
Similarly, the Oculus Quest demonstrated susceptibility to side-channel attacks
and ransomware. This paper provides a detailed overview of specific device
vulnerabilities and emphasizes the importance of the initial steps in
penetration testing to identify security weaknesses in AR and VR systems. By
highlighting these vulnerabilities, we aim to assist researchers in exploring
and mitigating these security challenges, ensuring the safe deployment and use
of AR and VR technologies across various sectors.


http://arxiv.org/abs/2408.03944
Taxonomy Driven Fast Adversarial Training. (99%)
Kun Tong; Chengze Jiang; Jie Gui; Yuan Cao
  Adversarial training (AT) is an effective defense method against
gradient-based attacks to enhance the robustness of neural networks. Among
them, single-step AT has emerged as a hotspot topic due to its simplicity and
efficiency, requiring only one gradient propagation in generating adversarial
examples. Nonetheless, the problem of catastrophic overfitting (CO) that causes
training collapse remains poorly understood, and there exists a gap between the
robust accuracy achieved through single- and multi-step AT. In this paper, we
present a surprising finding that the taxonomy of adversarial examples reveals
the truth of CO. Based on this conclusion, we propose taxonomy driven fast
adversarial training (TDAT) which jointly optimizes learning objective, loss
function, and initialization method, thereby can be regarded as a new paradigm
of single-step AT. Compared with other fast AT methods, TDAT can boost the
robustness of neural networks, alleviate the influence of misclassified
examples, and prevent CO during the training process while requiring almost no
additional computational and memory resources. Our method achieves robust
accuracy improvement of $1.59\%$, $1.62\%$, $0.71\%$, and $1.26\%$ on CIFAR-10,
CIFAR-100, Tiny ImageNet, and ImageNet-100 datasets, when against projected
gradient descent PGD10 attack with perturbation budget 8/255. Furthermore, our
proposed method also achieves state-of-the-art robust accuracy against other
attacks. Code is available at https://github.com/bookman233/TDAT.


http://arxiv.org/abs/2407.15211
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models. (74%)
Rylan Schaeffer; Dan Valentine; Luke Bailey; James Chua; Cristóbal Eyzaguirre; Zane Durante; Joe Benton; Brando Miranda; Henry Sleight; John Hughes; Rajashree Agrawal; Mrinank Sharma; Scott Emmons; Sanmi Koyejo; Ethan Perez
  The integration of new modalities into frontier AI systems offers exciting
capabilities, but also increases the possibility such systems can be
adversarially manipulated in undesirable ways. In this work, we focus on a
popular class of vision-language models (VLMs) that generate text outputs
conditioned on visual and textual inputs. We conducted a large-scale empirical
study to assess the transferability of gradient-based universal image
``jailbreaks" using a diverse set of over 40 open-parameter VLMs, including 18
new VLMs that we publicly release. Overall, we find that transferable
gradient-based image jailbreaks are extremely difficult to obtain. When an
image jailbreak is optimized against a single VLM or against an ensemble of
VLMs, the jailbreak successfully jailbreaks the attacked VLM(s), but exhibits
little-to-no transfer to any other VLMs; transfer is not affected by whether
the attacked and target VLMs possess matching vision backbones or language
models, whether the language model underwent instruction-following and/or
safety-alignment training, or many other factors. Only two settings display
partially successful transfer: between identically-pretrained and
identically-initialized VLMs with slightly different VLM training data, and
between different training checkpoints of a single VLM. Leveraging these
results, we then demonstrate that transfer can be significantly improved
against a specific target VLM by attacking larger ensembles of
``highly-similar" VLMs. These results stand in stark contrast to existing
evidence of universal and transferable text jailbreaks against language models
and transferable adversarial attacks against image classifiers, suggesting that
VLMs may be more robust to gradient-based transfer attacks.


http://arxiv.org/abs/2407.15267
A Learning-Based Attack Framework to Break SOTA Poisoning Defenses in Federated Learning. (73%)
Yuxin College of Computer Science and Technology, Jilin University Illinois Institute of Technology Yang; Qiang College of Computer Science and Technology, Jilin University Li; Chenfei College of Computer Science and Technology, Jilin University Nie; Yuan University of Connecticut Hong; Meng Nanchang University Pang; Binghui Illinois Institute of Technology Wang
  Federated Learning (FL) is a novel client-server distributed learning
framework that can protect data privacy. However, recent works show that FL is
vulnerable to poisoning attacks. Many defenses with robust aggregators (AGRs)
are proposed to mitigate the issue, but they are all broken by advanced
attacks. Very recently, some renewed robust AGRs are designed, typically with
novel clipping or/and filtering strate-gies, and they show promising defense
performance against the advanced poisoning attacks. In this paper, we show that
these novel robust AGRs are also vulnerable to carefully designed poisoning
attacks. Specifically, we observe that breaking these robust AGRs reduces to
bypassing the clipping or/and filtering of malicious clients, and propose an
optimization-based attack framework to leverage this observation. Under the
framework, we then design the customized attack against each robust AGR.
Extensive experiments on multiple datasets and threat models verify our
proposed optimization-based attack can break the SOTA AGRs. We hence call for
novel defenses against poisoning attacks to FL. Code is available at:
https://github.com/Yuxin104/ BreakSTOAPoisoningDefenses.


http://arxiv.org/abs/2407.15098
SeqMIA: Sequential-Metric Based Membership Inference Attack. (22%)
Hao Li; Zheng Li; Siyuan Wu; Chengrui Hu; Yutong Ye; Min Zhang; Dengguo Feng; Yang Zhang
  Most existing membership inference attacks (MIAs) utilize metrics (e.g.,
loss) calculated on the model's final state, while recent advanced attacks
leverage metrics computed at various stages, including both intermediate and
final stages, throughout the model training. Nevertheless, these attacks often
process multiple intermediate states of the metric independently, ignoring
their time-dependent patterns. Consequently, they struggle to effectively
distinguish between members and non-members who exhibit similar metric values,
particularly resulting in a high false-positive rate.
  In this study, we delve deeper into the new membership signals in the
black-box scenario. We identify a new, more integrated membership signal: the
Pattern of Metric Sequence, derived from the various stages of model training.
We contend that current signals provide only partial perspectives of this new
signal: the new one encompasses both the model's multiple intermediate and
final states, with a greater emphasis on temporal patterns among them. Building
upon this signal, we introduce a novel attack method called Sequential-metric
based Membership Inference Attack (SeqMIA). Specifically, we utilize knowledge
distillation to obtain a set of distilled models representing various stages of
the target model's training. We then assess multiple metrics on these distilled
models in chronological order, creating distilled metric sequence. We finally
integrate distilled multi-metric sequences as a sequential multiformat and
employ an attention-based RNN attack model for inference. Empirical results
show SeqMIA outperforms all baselines, especially can achieve an order of
magnitude improvement in terms of TPR @ 0.1% FPR. Furthermore, we delve into
the reasons why this signal contributes to SeqMIA's high attack performance,
and assess various defense mechanisms against SeqMIA.


http://arxiv.org/abs/2408.03335
Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions. (5%)
Naseem Khan; Kashif Ahmad; Aref Al Tamimi; Mohammed M. Alani; Amine Bermak; Issa Khalil
  Industry 5.0, which focuses on human and Artificial Intelligence (AI)
collaboration for performing different tasks in manufacturing, involves a
higher number of robots, Internet of Things (IoTs) devices and
interconnections, Augmented/Virtual Reality (AR), and other smart devices. The
huge involvement of these devices and interconnection in various critical
areas, such as economy, health, education and defense systems, poses several
types of potential security flaws. AI itself has been proven a very effective
and powerful tool in different areas of cybersecurity, such as intrusion
detection, malware detection, and phishing detection, among others. Just as in
many application areas, cybersecurity professionals were reluctant to accept
black-box ML solutions for cybersecurity applications. This reluctance pushed
forward the adoption of eXplainable Artificial Intelligence (XAI) as a tool
that helps explain how decisions are made in ML-based systems. In this survey,
we present a comprehensive study of different XAI-based intrusion detection
systems for industry 5.0, and we also examine the impact of explainability and
interpretability on Cybersecurity practices through the lens of Adversarial
XIDS (Adv-XIDS) approaches. Furthermore, we analyze the possible opportunities
and challenges in XAI cybersecurity systems for industry 5.0 that elicit future
research toward XAI-based solutions to be adopted by high-stakes industry 5.0
applications. We believe this rigorous analysis will establish a foundational
framework for subsequent research endeavors within the specified domain.


http://arxiv.org/abs/2407.15239
Benchmark Granularity and Model Robustness for Image-Text Retrieval. (2%)
Mariya Hendriksen; Shuo Zhang; Ridho Reinanda; Mohamed Yahya; Edgar Meij; Rijke Maarten de
  Image-Text Retrieval (ITR) systems are central to multimodal information
access, with Vision-Language Models (VLMs) showing strong performance on
standard benchmarks. However, these benchmarks predominantly rely on
coarse-grained annotations, limiting their ability to reveal how models perform
under real-world conditions, where query granularity varies. Motivated by this
gap, we examine how dataset granularity and query perturbations affect
retrieval performance and robustness across four architecturally diverse VLMs
(ALIGN, AltCLIP, CLIP, and GroupViT). Using both standard benchmarks (MS-COCO,
Flickr30k) and their fine-grained variants, we show that richer captions
consistently enhance retrieval, especially in text-to-image tasks, where we
observe an average improvement of 16.23%, compared to 6.44% in image-to-text.
To assess robustness, we introduce a taxonomy of perturbations and conduct
extensive experiments, revealing that while perturbations typically degrade
performance, they can also unexpectedly improve retrieval, exposing nuanced
model behaviors. Notably, word order emerges as a critical factor --
contradicting prior assumptions of model insensitivity to it. Our results
highlight variation in model robustness and a dataset-dependent relationship
between caption granularity and perturbation sensitivity and emphasize the
necessity of evaluating models on datasets of varying granularity.


http://arxiv.org/abs/2407.14971
Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models. (68%)
Md Zarif Hossain; Ahmed Imteaj
  Vision-language models (VLMs) have achieved significant strides in recent
times specially in multimodal tasks, yet they remain susceptible to adversarial
attacks on their vision components. To address this, we propose Sim-CLIP, an
unsupervised adversarial fine-tuning method that enhances the robustness of the
widely-used CLIP vision encoder against such attacks while maintaining semantic
richness and specificity. By employing a Siamese architecture with cosine
similarity loss, Sim-CLIP learns semantically meaningful and attack-resilient
visual representations without requiring large batch sizes or momentum
encoders. Our results demonstrate that VLMs enhanced with Sim-CLIP's fine-tuned
CLIP encoder exhibit significantly enhanced robustness against adversarial
attacks, while preserving semantic meaning of the perturbed images. Notably,
Sim-CLIP does not require additional training or fine-tuning of the VLM itself;
replacing the original vision encoder with our fine-tuned Sim-CLIP suffices to
provide robustness. This work underscores the significance of reinforcing
foundational models like CLIP to safeguard the reliability of downstream VLM
applications, paving the way for more secure and effective multimodal systems.


http://arxiv.org/abs/2407.14684
Data Poisoning: An Overlooked Threat to Power Grid Resilience. (68%)
Nora Agah; Javad Mohammadi; Alex Aved; David Ferris; Erika Ardiles Cruz; Philip Morrone
  As the complexities of Dynamic Data Driven Applications Systems increase,
preserving their resilience becomes more challenging. For instance, maintaining
power grid resilience is becoming increasingly complicated due to the growing
number of stochastic variables (such as renewable outputs) and extreme weather
events that add uncertainty to the grid. Current optimization methods have
struggled to accommodate this rise in complexity. This has fueled the growing
interest in data-driven methods used to operate the grid, leading to more
vulnerability to cyberattacks. One such disruption that is commonly discussed
is the adversarial disruption, where the intruder attempts to add a small
perturbation to input data in order to "manipulate" the system operation.
During the last few years, work on adversarial training and disruptions on the
power system has gained popularity. In this paper, we will first review these
applications, specifically on the most common types of adversarial disruptions:
evasion and poisoning disruptions. Through this review, we highlight the gap
between poisoning and evasion research when applied to the power grid. This is
due to the underlying assumption that model training is secure, leading to
evasion disruptions being the primary type of studied disruption. Finally, we
will examine the impacts of data poisoning interventions and showcase how they
can endanger power grid resilience.


http://arxiv.org/abs/2407.14644
Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context. (4%)
Nilanjana Das; Edward Raff; Manas Gaur
  Previous research on testing the vulnerabilities in Large Language Models
(LLMs) using adversarial attacks has primarily focused on nonsensical prompt
injections, which are easily detected upon manual or automated review (e.g.,
via byte entropy). However, the exploration of innocuous human-understandable
malicious prompts augmented with adversarial injections remains limited. In
this research, we explore converting a nonsensical suffix attack into a
sensible prompt via a situation-driven contextual re-writing. This allows us to
show suffix conversion without any gradients, using only LLMs to perform the
attacks, and thus better understand the scope of possible risks. We combine an
independent, meaningful adversarial insertion and situations derived from
movies to check if this can trick an LLM. The situations are extracted from the
IMDB dataset, and prompts are defined following a few-shot chain-of-thought
prompting. Our approach demonstrates that a successful situation-driven attack
can be executed on both open-source and proprietary LLMs. We find that across
many LLMs, as few as 1 attempt produces an attack and that these attacks
transfer between LLMs.


http://arxiv.org/abs/2407.14609
Adversarial Databases Improve Success in Retrieval-based Large Language Models. (1%)
Sean Wu; Michael Koo; Li Yo Kao; Andy Black; Lesley Blum; Fabien Scalzo; Ira Kurtz
  Open-source LLMs have shown great potential as fine-tuned chatbots, and
demonstrate robust abilities in reasoning and surpass many existing benchmarks.
Retrieval-Augmented Generation (RAG) is a technique for improving the
performance of LLMs on tasks that the models weren't explicitly trained on, by
leveraging external knowledge databases. Numerous studies have demonstrated the
effectiveness of RAG to more successfully accomplish downstream tasks when
using vector datasets that consist of relevant background information. It has
been implicitly assumed by those in the field that if adversarial background
information is utilized in this context, that the success of using a RAG-based
approach would be nonexistent or even negatively impact the results. To address
this assumption, we tested several open-source LLMs on the ability of RAG to
improve their success in answering multiple-choice questions (MCQ) in the
medical subspecialty field of Nephrology. Unlike previous studies, we examined
the effect of RAG in utilizing both relevant and adversarial background
databases. We set up several open-source LLMs, including Llama 3, Phi-3,
Mixtral 8x7b, Zephyr$\beta$, and Gemma 7B Instruct, in a zero-shot RAG
pipeline. As adversarial sources of information, text from the Bible and a
Random Words generated database were used for comparison. Our data show that
most of the open-source LLMs improve their multiple-choice test-taking success
as expected when incorporating relevant information vector databases.
Surprisingly however, adversarial Bible text significantly improved the success
of many LLMs and even random word text improved test taking ability of some of
the models. In summary, our results demonstrate for the first time the
countertintuitive ability of adversarial information datasets to improve the
RAG-based LLM success.


http://arxiv.org/abs/2407.14097
On the Robustness of Fully-Spiking Neural Networks in Open-World Scenarios using Forward-Only Learning Algorithms. (1%)
Erik B. Terres-Escudero; Ser Javier Del; Aitor Martínez-Seras; Pablo Garcia-Bringas
  In the last decade, Artificial Intelligence (AI) models have rapidly
integrated into production pipelines propelled by their excellent modeling
performance. However, the development of these models has not been matched by
advancements in algorithms ensuring their safety, failing to guarantee robust
behavior against Out-of-Distribution (OoD) inputs outside their learning
domain. Furthermore, there is a growing concern with the sustainability of AI
models and their required energy consumption in both training and inference
phases. To mitigate these issues, this work explores the use of the
Forward-Forward Algorithm (FFA), a biologically plausible alternative to
Backpropagation, adapted to the spiking domain to enhance the overall energy
efficiency of the model. By capitalizing on the highly expressive topology
emerging from the latent space of models trained with FFA, we develop a novel
FF-SCP algorithm for OoD Detection. Our approach measures the likelihood of a
sample belonging to the in-distribution (ID) data by using the distance from
the latent representation of samples to class-representative manifolds.
Additionally, to provide deeper insights into our OoD pipeline, we propose a
gradient-free attribution technique that highlights the features of a sample
pushing it away from the distribution of any class. Multiple experiments using
our spiking FFA adaptation demonstrate that the achieved accuracy levels are
comparable to those seen in analog networks trained via back-propagation.
Furthermore, OoD detection experiments on multiple datasets prove that FF-SCP
outperforms avant-garde OoD detectors within the spiking domain in terms of
several metrics used in this area. We also present a qualitative analysis of
our explainability technique, exposing the precision by which the method
detects OoD features, such as embedded artifacts or missing regions.


http://arxiv.org/abs/2407.13700
Cross-Task Attack: A Self-Supervision Generative Framework Based on Attention Shift. (99%)
Qingyuan Zeng; Yunpeng Gong; Min Jiang
  Studying adversarial attacks on artificial intelligence (AI) systems helps
discover model shortcomings, enabling the construction of a more robust system.
Most existing adversarial attack methods only concentrate on single-task
single-model or single-task cross-model scenarios, overlooking the multi-task
characteristic of artificial intelligence systems. As a result, most of the
existing attacks do not pose a practical threat to a comprehensive and
collaborative AI system. However, implementing cross-task attacks is highly
demanding and challenging due to the difficulty in obtaining the real labels of
different tasks for the same picture and harmonizing the loss functions across
different tasks. To address this issue, we propose a self-supervised Cross-Task
Attack framework (CTA), which utilizes co-attention and anti-attention maps to
generate cross-task adversarial perturbation. Specifically, the co-attention
map reflects the area to which different visual task models pay attention,
while the anti-attention map reflects the area that different visual task
models neglect. CTA generates cross-task perturbations by shifting the
attention area of samples away from the co-attention map and closer to the
anti-attention map. We conduct extensive experiments on multiple vision tasks
and the experimental results confirm the effectiveness of the proposed design
for adversarial attacks.


http://arxiv.org/abs/2407.13646
Beyond Dropout: Robust Convolutional Neural Networks Based on Local Feature Masking. (98%)
Yunpeng Gong; Chuangliang Zhang; Yongjie Hou; Lifei Chen; Min Jiang
  In the contemporary of deep learning, where models often grapple with the
challenge of simultaneously achieving robustness against adversarial attacks
and strong generalization capabilities, this study introduces an innovative
Local Feature Masking (LFM) strategy aimed at fortifying the performance of
Convolutional Neural Networks (CNNs) on both fronts. During the training phase,
we strategically incorporate random feature masking in the shallow layers of
CNNs, effectively alleviating overfitting issues, thereby enhancing the model's
generalization ability and bolstering its resilience to adversarial attacks.
LFM compels the network to adapt by leveraging remaining features to compensate
for the absence of certain semantic features, nurturing a more elastic feature
learning mechanism. The efficacy of LFM is substantiated through a series of
quantitative and qualitative assessments, collectively showcasing a consistent
and significant improvement in CNN's generalization ability and resistance
against adversarial attacks--a phenomenon not observed in current and prior
methodologies. The seamless integration of LFM into established CNN frameworks
underscores its potential to advance both generalization and adversarial
robustness within the deep learning paradigm. Through comprehensive
experiments, including robust person re-identification baseline generalization
experiments and adversarial attack experiments, we demonstrate the substantial
enhancements offered by LFM in addressing the aforementioned challenges. This
contribution represents a noteworthy stride in advancing robust neural network
architectures.


http://arxiv.org/abs/2407.13757
Black-Box Opinion Manipulation Attacks to Retrieval-Augmented Generation of Large Language Models. (75%)
Zhuo Chen; Jiawei Liu; Haotan Liu; Qikai Cheng; Fan Zhang; Wei Lu; Xiaozhong Liu
  Retrieval-Augmented Generation (RAG) is applied to solve hallucination
problems and real-time constraints of large language models, but it also
induces vulnerabilities against retrieval corruption attacks. Existing research
mainly explores the unreliability of RAG in white-box and closed-domain QA
tasks. In this paper, we aim to reveal the vulnerabilities of
Retrieval-Enhanced Generative (RAG) models when faced with black-box attacks
for opinion manipulation. We explore the impact of such attacks on user
cognition and decision-making, providing new insight to enhance the reliability
and security of RAG models. We manipulate the ranking results of the retrieval
model in RAG with instruction and use these results as data to train a
surrogate model. By employing adversarial retrieval attack methods to the
surrogate model, black-box transfer attacks on RAG are further realized.
Experiments conducted on opinion datasets across multiple topics show that the
proposed attack strategy can significantly alter the opinion polarity of the
content generated by RAG. This demonstrates the model's vulnerability and, more
importantly, reveals the potential negative impact on user cognition and
decision-making, making it easier to mislead users into accepting incorrect or
biased information.


http://arxiv.org/abs/2407.13692
Prover-Verifier Games improve legibility of LLM outputs. (61%)
Jan Hendrik Kirchner; Yining Chen; Harri Edwards; Jan Leike; Nat McAleese; Yuri Burda
  One way to increase confidence in the outputs of Large Language Models (LLMs)
is to support them with reasoning that is clear and easy to check -- a property
we call legibility. We study legibility in the context of solving grade-school
math problems and show that optimizing chain-of-thought solutions only for
answer correctness can make them less legible. To mitigate the loss in
legibility, we propose a training algorithm inspired by Prover-Verifier Game
from Anil et al. (2021). Our algorithm iteratively trains small verifiers to
predict solution correctness, "helpful" provers to produce correct solutions
that the verifier accepts, and "sneaky" provers to produce incorrect solutions
that fool the verifier. We find that the helpful prover's accuracy and the
verifier's robustness to adversarial attacks increase over the course of
training. Furthermore, we show that legibility training transfers to
time-constrained humans tasked with verifying solution correctness. Over course
of LLM training human accuracy increases when checking the helpful prover's
solutions, and decreases when checking the sneaky prover's solutions. Hence,
training for checkability by small verifiers is a plausible technique for
increasing output legibility. Our results suggest legibility training against
small verifiers as a practical avenue for increasing legibility of large LLMs
to humans, and thus could help with alignment of superhuman models.


http://arxiv.org/abs/2407.13174
Compressed models are NOT miniature versions of large models. (47%)
Rohit Raj Rai; Rishant Pal; Amit Awekar
  Large neural models are often compressed before deployment. Model compression
is necessary for many practical reasons, such as inference latency, memory
footprint, and energy consumption. Compressed models are assumed to be
miniature versions of corresponding large neural models. However, we question
this belief in our work. We compare compressed models with corresponding large
neural models using four model characteristics: prediction errors, data
representation, data distribution, and vulnerability to adversarial attack. We
perform experiments using the BERT-large model and its five compressed
versions. For all four model characteristics, compressed models significantly
differ from the BERT-large model. Even among compressed models, they differ
from each other on all four model characteristics. Apart from the expected loss
in model performance, there are major side effects of using compressed models
to replace large neural models.


http://arxiv.org/abs/2407.13625
Distributionally and Adversarially Robust Logistic Regression via Intersecting Wasserstein Balls. (16%)
Aras Selvi; Eleonora Kreacic; Mohsen Ghassemi; Vamsi Potluru; Tucker Balch; Manuela Veloso
  Adversarially robust optimization (ARO) has emerged as the *de facto*
standard for training models that hedge against adversarial attacks in the test
stage. While these models are robust against adversarial attacks, they tend to
suffer severely from overfitting. To address this issue, some successful
methods replace the empirical distribution in the training stage with
alternatives including *(i)* a worst-case distribution residing in an ambiguity
set, resulting in a distributionally robust (DR) counterpart of ARO; *(ii)* a
mixture of the empirical distribution with a distribution induced by an
auxiliary (*e.g.*, synthetic, external, out-of-domain) dataset. Inspired by the
former, we study the Wasserstein DR counterpart of ARO for logistic regression
and show it admits a tractable convex optimization reformulation. Adopting the
latter setting, we revise the DR approach by intersecting its ambiguity set
with another ambiguity set built using the auxiliary dataset, which offers a
significant improvement whenever the Wasserstein distance between the data
generating and auxiliary distributions can be estimated. We study the
underlying optimization problem, develop efficient solution algorithms, and
demonstrate that the proposed method outperforms benchmark approaches on
standard datasets.


http://arxiv.org/abs/2407.13863
A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks. (10%)
Yixiang Qiu; Hao Fang; Hongyao Yu; Bin Chen; MeiKang Qiu; Shu-Tao Xia
  Model Inversion (MI) attacks aim to reconstruct privacy-sensitive training
data from released models by utilizing output information, raising extensive
concerns about the security of Deep Neural Networks (DNNs). Recent advances in
generative adversarial networks (GANs) have contributed significantly to the
improved performance of MI attacks due to their powerful ability to generate
realistic images with high fidelity and appropriate semantics. However,
previous MI attacks have solely disclosed private information in the latent
space of GAN priors, limiting their semantic extraction and transferability
across multiple target models and datasets. To address this challenge, we
propose a novel method, Intermediate Features enhanced Generative Model
Inversion (IF-GMI), which disassembles the GAN structure and exploits features
between intermediate blocks. This allows us to extend the optimization space
from latent code to intermediate features with enhanced expressive
capabilities. To prevent GAN priors from generating unrealistic images, we
apply a L1 ball constraint to the optimization process. Experiments on multiple
benchmarks demonstrate that our method significantly outperforms previous
approaches and achieves state-of-the-art results under various settings,
especially in the out-of-distribution (OOD) scenario. Our code is available at:
https://github.com/final-solution/IF-GMI


http://arxiv.org/abs/2407.13111
PG-Attack: A Precision-Guided Adversarial Attack Framework Against Vision Foundation Models for Autonomous Driving. (98%)
Jiyuan Fu; Zhaoyu Chen; Kaixun Jiang; Haijing Guo; Shuyong Gao; Wenqiang Zhang
  Vision foundation models are increasingly employed in autonomous driving
systems due to their advanced capabilities. However, these models are
susceptible to adversarial attacks, posing significant risks to the reliability
and safety of autonomous vehicles. Adversaries can exploit these
vulnerabilities to manipulate the vehicle's perception of its surroundings,
leading to erroneous decisions and potentially catastrophic consequences. To
address this challenge, we propose a novel Precision-Guided Adversarial Attack
(PG-Attack) framework that combines two techniques: Precision Mask Perturbation
Attack (PMP-Attack) and Deceptive Text Patch Attack (DTP-Attack). PMP-Attack
precisely targets the attack region to minimize the overall perturbation while
maximizing its impact on the target object's representation in the model's
feature space. DTP-Attack introduces deceptive text patches that disrupt the
model's understanding of the scene, further enhancing the attack's
effectiveness. Our experiments demonstrate that PG-Attack successfully deceives
a variety of advanced multi-modal large models, including GPT-4V, Qwen-VL, and
imp-V1. Additionally, we won First-Place in the CVPR 2024 Workshop Challenge:
Black-box Adversarial Attacks on Vision Foundation Models and codes are
available at https://github.com/fuhaha824/PG-Attack.


http://arxiv.org/abs/2407.12443
Preventing Catastrophic Overfitting in Fast Adversarial Training: A Bi-level Optimization Perspective. (98%)
Zhaoxin Wang; Handing Wang; Cong Tian; Yaochu Jin
  Adversarial training (AT) has become an effective defense method against
adversarial examples (AEs) and it is typically framed as a bi-level
optimization problem. Among various AT methods, fast AT (FAT), which employs a
single-step attack strategy to guide the training process, can achieve good
robustness against adversarial attacks at a low cost. However, FAT methods
suffer from the catastrophic overfitting problem, especially on complex tasks
or with large-parameter models. In this work, we propose a FAT method termed
FGSM-PCO, which mitigates catastrophic overfitting by averting the collapse of
the inner optimization problem in the bi-level optimization process. FGSM-PCO
generates current-stage AEs from the historical AEs and incorporates them into
the training process using an adaptive mechanism. This mechanism determines an
appropriate fusion ratio according to the performance of the AEs on the
training model. Coupled with a loss function tailored to the training
framework, FGSM-PCO can alleviate catastrophic overfitting and help the
recovery of an overfitted model to effective training. We evaluate our
algorithm across three models and three datasets to validate its effectiveness.
Comparative empirical studies against other FAT algorithms demonstrate that our
proposed method effectively addresses unresolved overfitting issues in existing
algorithms.


http://arxiv.org/abs/2408.01428
Transferable Adversarial Facial Images for Privacy Protection. (96%)
Minghui Li; Jiangxiong Wang; Hao Zhang; Ziqi Zhou; Shengshan Hu; Xiaobing Pei
  The success of deep face recognition (FR) systems has raised serious privacy
concerns due to their ability to enable unauthorized tracking of users in the
digital world. Previous studies proposed introducing imperceptible adversarial
noises into face images to deceive those face recognition models, thus
achieving the goal of enhancing facial privacy protection. Nevertheless, they
heavily rely on user-chosen references to guide the generation of adversarial
noises, and cannot simultaneously construct natural and highly transferable
adversarial face images in black-box scenarios. In light of this, we present a
novel face privacy protection scheme with improved transferability while
maintain high visual quality. We propose shaping the entire face space directly
instead of exploiting one kind of facial characteristic like makeup information
to integrate adversarial noises. To achieve this goal, we first exploit global
adversarial latent search to traverse the latent space of the generative model,
thereby creating natural adversarial face images with high transferability. We
then introduce a key landmark regularization module to preserve the visual
identity information. Finally, we investigate the impacts of various kinds of
latent spaces and find that $\mathcal{F}$ latent space benefits the trade-off
between visual naturalness and adversarial transferability. Extensive
experiments over two datasets demonstrate that our approach significantly
enhances attack transferability while maintaining high visual quality,
outperforming state-of-the-art methods by an average 25% improvement in deep FR
models and 10% improvement on commercial FR APIs, including Face++, Aliyun, and
Tencent.


http://arxiv.org/abs/2407.12428
Context-Aware Fuzzing for Robustness Enhancement of Deep Learning Models. (86%)
Haipeng Wang; Zhengyuan Wei; Qilin Zhou; Wing-Kwong Chan
  In the testing-retraining pipeline for enhancing the robustness property of
deep learning (DL) models, many state-of-the-art robustness-oriented fuzzing
techniques are metric-oriented. The pipeline generates adversarial examples as
test cases via such a DL testing technique and retrains the DL model under test
with test suites that contain these test cases. On the one hand, the strategies
of these fuzzing techniques tightly integrate the key characteristics of their
testing metrics. On the other hand, they are often unaware of whether their
generated test cases are different from the samples surrounding these test
cases and whether there are relevant test cases of other seeds when generating
the current one. We propose a novel testing metric called Contextual Confidence
(CC). CC measures a test case through the surrounding samples of a test case in
terms of their mean probability predicted to the prediction label of the test
case. Based on this metric, we further propose a novel fuzzing technique Clover
as a DL testing technique for the pipeline. In each fuzzing round, Clover first
finds a set of seeds whose labels are the same as the label of the seed under
fuzzing. At the same time, it locates the corresponding test case that achieves
the highest CC values among the existing test cases of each seed in this set of
seeds and shares the same prediction label as the existing test case of the
seed under fuzzing that achieves the highest CC value. Clover computes the
piece of difference between each such pair of a seed and a test case. It
incrementally applies these pieces of differences to perturb the current test
case of the seed under fuzzing that achieves the highest CC value and to
perturb the resulting samples along the gradient to generate new test cases for
the seed under fuzzing.


http://arxiv.org/abs/2407.13068
Krait: A Backdoor Attack Against Graph Prompt Tuning. (83%)
Ying Song; Rita Singh; Balaji Palanisamy
  Graph prompt tuning has emerged as a promising paradigm to effectively
transfer general graph knowledge from pre-trained models to various downstream
tasks, particularly in few-shot contexts. However, its susceptibility to
backdoor attacks, where adversaries insert triggers to manipulate outcomes,
raises a critical concern. We conduct the first study to investigate such
vulnerability, revealing that backdoors can disguise benign graph prompts, thus
evading detection. We introduce Krait, a novel graph prompt backdoor.
Specifically, we propose a simple yet effective model-agnostic metric called
label non-uniformity homophily to select poisoned candidates, significantly
reducing computational complexity. To accommodate diverse attack scenarios and
advanced attack types, we design three customizable trigger generation methods
to craft prompts as triggers. We propose a novel centroid similarity-based loss
function to optimize prompt tuning for attack effectiveness and stealthiness.
Experiments on four real-world graphs demonstrate that Krait can efficiently
embed triggers to merely 0.15% to 2% of training nodes, achieving high attack
success rates without sacrificing clean accuracy. Notably, in one-to-one and
all-to-one attacks, Krait can achieve 100% attack success rates by poisoning as
few as 2 and 22 nodes, respectively. Our experiments further show that Krait
remains potent across different transfer cases, attack types, and graph neural
network backbones. Additionally, Krait can be successfully extended to the
black-box setting, posing more severe threats. Finally, we analyze why Krait
can evade both classical and state-of-the-art defenses, and provide practical
insights for detecting and mitigating this class of attacks.


http://arxiv.org/abs/2407.12784
AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases. (61%)
Zhaorun Chen; Zhen Xiang; Chaowei Xiao; Dawn Song; Bo Li
  LLM agents have demonstrated remarkable performance across various
applications, primarily due to their advanced capabilities in reasoning,
utilizing external knowledge and tools, calling APIs, and executing actions to
interact with environments. Current agents typically utilize a memory module or
a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and
instances with similar embeddings from knowledge bases to inform task planning
and execution. However, the reliance on unverified knowledge bases raises
significant concerns about their safety and trustworthiness. To uncover such
vulnerabilities, we propose a novel red teaming approach AgentPoison, the first
backdoor attack targeting generic and RAG-based LLM agents by poisoning their
long-term memory or RAG knowledge base. In particular, we form the trigger
generation process as a constrained optimization to optimize backdoor triggers
by mapping the triggered instances to a unique embedding space, so as to ensure
that whenever a user instruction contains the optimized backdoor trigger, the
malicious demonstrations are retrieved from the poisoned memory or knowledge
base with high probability. In the meantime, benign instructions without the
trigger will still maintain normal performance. Unlike conventional backdoor
attacks, AgentPoison requires no additional model training or fine-tuning, and
the optimized backdoor trigger exhibits superior transferability, in-context
coherence, and stealthiness. Extensive experiments demonstrate AgentPoison's
effectiveness in attacking three types of real-world LLM agents: RAG-based
autonomous driving agent, knowledge-intensive QA agent, and healthcare
EHRAgent. On each agent, AgentPoison achieves an average attack success rate
higher than 80% with minimal impact on benign performance (less than 1%) with a
poison rate less than 0.1%.


http://arxiv.org/abs/2407.12588
Benchmarking Robust Self-Supervised Learning Across Diverse Downstream Tasks. (12%)
Antoni Kowalczuk; Jan Dubiński; Atiyeh Ashari Ghomi; Yi Sui; George Stein; Jiapeng Wu; Jesse C. Cresswell; Franziska Boenisch; Adam Dziedzic
  Large-scale vision models have become integral in many applications due to
their unprecedented performance and versatility across downstream tasks.
However, the robustness of these foundation models has primarily been explored
for a single task, namely image classification. The vulnerability of other
common vision tasks, such as semantic segmentation and depth estimation,
remains largely unknown. We present a comprehensive empirical evaluation of the
adversarial robustness of self-supervised vision encoders across multiple
downstream tasks. Our attacks operate in the encoder embedding space and at the
downstream task output level. In both cases, current state-of-the-art
adversarial fine-tuning techniques tested only for classification significantly
degrade clean and robust performance on other tasks. Since the purpose of a
foundation model is to cater to multiple applications at once, our findings
reveal the need to enhance encoder robustness more broadly. Our code is
available at ${github.com/layer6ai-labs/ssl-robustness}$.


http://arxiv.org/abs/2407.21035
Direct Unlearning Optimization for Robust and Safe Text-to-Image Models. (12%)
Yong-Hyun Park; Sangdoo Yun; Jin-Hwa Kim; Junho Kim; Geonhui Jang; Yonghyun Jeong; Junghyo Jo; Gayoung Lee
  Recent advancements in text-to-image (T2I) models have greatly benefited from
large-scale datasets, but they also pose significant risks due to the potential
generation of unsafe content. To mitigate this issue, researchers have
developed unlearning techniques to remove the model's ability to generate
potentially harmful content. However, these methods are easily bypassed by
adversarial attacks, making them unreliable for ensuring the safety of
generated images. In this paper, we propose Direct Unlearning Optimization
(DUO), a novel framework for removing Not Safe For Work (NSFW) content from T2I
models while preserving their performance on unrelated topics. DUO employs a
preference optimization approach using curated paired image data, ensuring that
the model learns to remove unsafe visual concepts while retaining unrelated
features. Furthermore, we introduce an output-preserving regularization term to
maintain the model's generative capabilities on safe content. Extensive
experiments demonstrate that DUO can robustly defend against various
state-of-the-art red teaming methods without significant performance
degradation on unrelated topics, as measured by FID and CLIP scores. Our work
contributes to the development of safer and more reliable T2I models, paving
the way for their responsible deployment in both closed-source and open-source
scenarios.


http://arxiv.org/abs/2407.12782
Contrastive Adversarial Training for Unsupervised Domain Adaptation. (2%)
Jiahong Chen; Zhilin Zhang; Lucy Li; Behzad Shahrasbi; Arjun Mishra
  Domain adversarial training has shown its effective capability for finding
domain invariant feature representations and been successfully adopted for
various domain adaptation tasks. However, recent advances of large models
(e.g., vision transformers) and emerging of complex adaptation scenarios (e.g.,
DomainNet) make adversarial training being easily biased towards source domain
and hardly adapted to target domain. The reason is twofold: relying on large
amount of labelled data from source domain for large model training and lacking
of labelled data from target domain for fine-tuning. Existing approaches widely
focused on either enhancing discriminator or improving the training stability
for the backbone networks. Due to unbalanced competition between the feature
extractor and the discriminator during the adversarial training, existing
solutions fail to function well on complex datasets. To address this issue, we
proposed a novel contrastive adversarial training (CAT) approach that leverages
the labeled source domain samples to reinforce and regulate the feature
generation for target domain. Typically, the regulation forces the target
feature distribution being similar to the source feature distribution. CAT
addressed three major challenges in adversarial learning: 1) ensure the feature
distributions from two domains as indistinguishable as possible for the
discriminator, resulting in a more robust domain-invariant feature generation;
2) encourage target samples moving closer to the source in the feature space,
reducing the requirement for generalizing classifier trained on the labeled
source domain to unlabeled target domain; 3) avoid directly aligning unpaired
source and target samples within mini-batch. CAT can be easily plugged into
existing models and exhibits significant performance improvements.


http://arxiv.org/abs/2407.13094
Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data. (1%)
Wufei Ma; Kai Li; Zhongshi Jiang; Moustafa Meshry; Qihao Liu; Huiyu Wang; Christian Häne; Alan Yuille
  Recent video-text foundation models have demonstrated strong performance on a
wide variety of downstream video understanding tasks. Can these video-text
models genuinely understand the contents of natural videos? Standard video-text
evaluations could be misleading as many questions can be inferred merely from
the objects and contexts in a single frame or biases inherent in the datasets.
In this paper, we aim to better assess the capabilities of current video-text
models and understand their limitations. We propose a novel evaluation task for
video-text understanding, namely retrieval from counterfactually augmented data
(RCAD), and a new Feint6K dataset. To succeed on our new evaluation task,
models must derive a comprehensive understanding of the video from cross-frame
reasoning. Analyses show that previous video-text foundation models can be
easily fooled by counterfactually augmented data and are far behind human-level
performance. In order to narrow the gap between video-text models and human
performance on RCAD, we identify a key limitation of current contrastive
approaches on video-text data and introduce LLM-teacher, a more effective
approach to learn action semantics by leveraging knowledge obtained from a
pretrained large language model. Experiments and analyses show that our
approach successfully learn more discriminative action embeddings and improves
results on Feint6K when applied to multiple video-text models. Our Feint6K
dataset and project page is available at https://feint6k.github.io.


http://arxiv.org/abs/2407.12292
Any Target Can be Offense: Adversarial Example Generation via Generalized Latent Infection. (99%)
Youheng Sun; Shengming Yuan; Xuanhan Wang; Lianli Gao; Jingkuan Song
  Targeted adversarial attack, which aims to mislead a model to recognize any
image as a target object by imperceptible perturbations, has become a
mainstream tool for vulnerability assessment of deep neural networks (DNNs).
Since existing targeted attackers only learn to attack known target classes,
they cannot generalize well to unknown classes. To tackle this issue, we
propose $\bf{G}$eneralized $\bf{A}$dversarial attac$\bf{KER}$ ($\bf{GAKer}$),
which is able to construct adversarial examples to any target class. The core
idea behind GAKer is to craft a latently infected representation during
adversarial example generation. To this end, the extracted latent
representations of the target object are first injected into intermediate
features of an input image in an adversarial generator. Then, the generator is
optimized to ensure visual consistency with the input image while being close
to the target object in the feature space. Since the GAKer is class-agnostic
yet model-agnostic, it can be regarded as a general tool that not only reveals
the vulnerability of more DNNs but also identifies deficiencies of DNNs in a
wider range of classes. Extensive experiments have demonstrated the
effectiveness of our proposed method in generating adversarial examples for
both known and unknown classes. Notably, compared with other generative
methods, our method achieves an approximately $14.13\%$ higher attack success
rate for unknown classes and an approximately $4.23\%$ higher success rate for
known classes. Our code is available in https://github.com/VL-Group/GAKer.


http://arxiv.org/abs/2407.11844
Variational Randomized Smoothing for Sample-Wise Adversarial Robustness. (99%)
Ryo Hase; Ye Wang; Toshiaki Koike-Akino; Jing Liu; Kieran Parsons
  Randomized smoothing is a defensive technique to achieve enhanced robustness
against adversarial examples which are small input perturbations that degrade
the performance of neural network models. Conventional randomized smoothing
adds random noise with a fixed noise level for every input sample to smooth out
adversarial perturbations. This paper proposes a new variational framework that
uses a per-sample noise level suitable for each input by introducing a noise
level selector. Our experimental results demonstrate enhancement of empirical
robustness against adversarial attacks. We also provide and analyze the
certified robustness for our sample-wise smoothing method.


http://arxiv.org/abs/2407.11537
AEMIM: Adversarial Examples Meet Masked Image Modeling. (99%)
Wenzhao Xiang; Chang Liu; Hang Su; Hongyang Yu
  Masked image modeling (MIM) has gained significant traction for its
remarkable prowess in representation learning. As an alternative to the
traditional approach, the reconstruction from corrupted images has recently
emerged as a promising pretext task. However, the regular corrupted images are
generated using generic generators, often lacking relevance to the specific
reconstruction task involved in pre-training. Hence, reconstruction from
regular corrupted images cannot ensure the difficulty of the pretext task,
potentially leading to a performance decline. Moreover, generating corrupted
images might introduce an extra generator, resulting in a notable computational
burden. To address these issues, we propose to incorporate adversarial examples
into masked image modeling, as the new reconstruction targets. Adversarial
examples, generated online using only the trained models, can directly aim to
disrupt tasks associated with pre-training. Therefore, the incorporation not
only elevates the level of challenge in reconstruction but also enhances
efficiency, contributing to the acquisition of superior representations by the
model. In particular, we introduce a novel auxiliary pretext task that
reconstructs the adversarial examples corresponding to the original images. We
also devise an innovative adversarial attack to craft more suitable adversarial
examples for MIM pre-training. It is noted that our method is not restricted to
specific model architectures and MIM strategies, rendering it an adaptable
plug-in capable of enhancing all MIM methods. Experimental findings
substantiate the remarkable capability of our approach in amplifying the
generalization and robustness of existing MIM methods. Notably, our method
surpasses the performance of baselines on various tasks, including ImageNet,
its variants, and other downstream tasks.


http://arxiv.org/abs/2407.11463
Investigating Imperceptibility of Adversarial Attacks on Tabular Data: An Empirical Analysis. (99%)
Zhipeng He; Chun Ouyang; Laith Alzubaidi; Alistair Barros; Catarina Moreira
  Adversarial attacks are a potential threat to machine learning models, as
they can cause the model to make incorrect predictions by introducing
imperceptible perturbations to the input data. While extensively studied in
unstructured data like images, their application to structured data like
tabular data presents unique challenges due to the heterogeneity and intricate
feature interdependencies of tabular data. Imperceptibility in tabular data
involves preserving data integrity while potentially causing misclassification,
underscoring the need for tailored imperceptibility criteria for tabular data.
However, there is currently a lack of standardised metrics for assessing
adversarial attacks specifically targeted at tabular data. To address this gap,
we derive a set of properties for evaluating the imperceptibility of
adversarial attacks on tabular data. These properties are defined to capture
seven perspectives of perturbed data: proximity to original inputs, sparsity of
alterations, deviation to datapoints in the original dataset, sensitivity of
altering sensitive features, immutability of perturbation, feasibility of
perturbed values and intricate feature interdepencies among tabular features.
Furthermore, we conduct both quantitative empirical evaluation and case-based
qualitative examples analysis for seven properties. The evaluation reveals a
trade-off between attack success and imperceptibility, particularly concerning
proximity, sensitivity, and deviation. Although no evaluated attacks can
achieve optimal effectiveness and imperceptibility simultaneously, unbounded
attacks prove to be more promised for tabular data in crafting imperceptible
adversarial examples. The study also highlights the limitation of evaluated
algorithms in controlling sparsity effectively. We suggest incorporating a
sparsity metric in future attack design to regulate the number of perturbed
features.


http://arxiv.org/abs/2407.11599
Enhancing TinyML Security: Study of Adversarial Attack Transferability. (96%)
Parin Shah; Yuvaraj Govindarajulu; Pavan Kulkarni; Manojkumar Parmar
  The recent strides in artificial intelligence (AI) and machine learning (ML)
have propelled the rise of TinyML, a paradigm enabling AI computations at the
edge without dependence on cloud connections. While TinyML offers real-time
data analysis and swift responses critical for diverse applications, its
devices' intrinsic resource limitations expose them to security risks. This
research delves into the adversarial vulnerabilities of AI models on
resource-constrained embedded hardware, with a focus on Model Extraction and
Evasion Attacks. Our findings reveal that adversarial attacks from powerful
host machines could be transferred to smaller, less secure devices like ESP32
and Raspberry Pi. This illustrates that adversarial attacks could be extended
to tiny devices, underscoring vulnerabilities, and emphasizing the necessity
for reinforced security measures in TinyML deployments. This exploration
enhances the comprehension of security challenges in TinyML and offers insights
for safeguarding sensitive data and ensuring device dependability in AI-powered
edge computing settings.


http://arxiv.org/abs/2407.11372
UNIT: Backdoor Mitigation via Automated Neural Distribution Tightening. (82%)
Siyuan Cheng; Guangyu Shen; Kaiyuan Zhang; Guanhong Tao; Shengwei An; Hanxi Guo; Shiqing Ma; Xiangyu Zhang
  Deep neural networks (DNNs) have demonstrated effectiveness in various
fields. However, DNNs are vulnerable to backdoor attacks, which inject a unique
pattern, called trigger, into the input to cause misclassification to an
attack-chosen target label. While existing works have proposed various methods
to mitigate backdoor effects in poisoned models, they tend to be less effective
against recent advanced attacks. In this paper, we introduce a novel
post-training defense technique UNIT that can effectively eliminate backdoor
effects for a variety of attacks. In specific, UNIT approximates a unique and
tight activation distribution for each neuron in the model. It then proactively
dispels substantially large activation values that exceed the approximated
boundaries. Our experimental results demonstrate that UNIT outperforms 7
popular defense methods against 14 existing backdoor attacks, including 2
advanced attacks, using only 5\% of clean training data. UNIT is also cost
efficient. The code is accessible at https://github.com/Megum1/UNIT.


http://arxiv.org/abs/2407.11764
Relaxing Graph Transformers for Adversarial Attacks. (81%)
Philipp Foth; Lukas Gosch; Simon Geisler; Leo Schwinn; Stephan Günnemann
  Existing studies have shown that Graph Neural Networks (GNNs) are vulnerable
to adversarial attacks. Even though Graph Transformers (GTs) surpassed
Message-Passing GNNs on several benchmarks, their adversarial robustness
properties are unexplored. However, attacking GTs is challenging due to their
Positional Encodings (PEs) and special attention mechanisms which can be
difficult to differentiate. We overcome these challenges by targeting three
representative architectures based on (1) random-walk PEs, (2)
pair-wise-shortest-path PEs, and (3) spectral PEs - and propose the first
adaptive attacks for GTs. We leverage our attacks to evaluate robustness to (a)
structure perturbations on node classification; and (b) node injection attacks
for (fake-news) graph classification. Our evaluation reveals that they can be
catastrophically fragile and underlines our work's importance and the necessity
for adaptive attacks.


http://arxiv.org/abs/2407.12281
Turning Generative Models Degenerate: The Power of Data Poisoning Attacks. (76%)
Shuli Jiang; Swanand Ravindra Kadhe; Yi Zhou; Farhan Ahmed; Ling Cai; Nathalie Baracaldo
  The increasing use of large language models (LLMs) trained by third parties
raises significant security concerns. In particular, malicious actors can
introduce backdoors through poisoning attacks to generate undesirable outputs.
While such attacks have been extensively studied in image domains and
classification tasks, they remain underexplored for natural language generation
(NLG) tasks. To address this gap, we conduct an investigation of various
poisoning techniques targeting the LLM's fine-tuning phase via prefix-tuning, a
Parameter Efficient Fine-Tuning (PEFT) method. We assess their effectiveness
across two generative tasks: text summarization and text completion; and we
also introduce new metrics to quantify the success and stealthiness of such NLG
poisoning attacks. Through our experiments, we find that the prefix-tuning
hyperparameters and trigger designs are the most crucial factors to influence
attack success and stealthiness. Moreover, we demonstrate that existing popular
defenses are ineffective against our poisoning attacks. Our study presents the
first systematic approach to understanding poisoning attacks targeting NLG
tasks during fine-tuning via PEFT across a wide range of triggers and attack
settings. We hope our findings will aid the AI security community in developing
effective defenses against such threats.


http://arxiv.org/abs/2407.12068
Learning on Graphs with Large Language Models(LLMs): A Deep Dive into Model Robustness. (33%)
Kai Guo; Zewen Liu; Zhikai Chen; Hongzhi Wen; Wei Jin; Jiliang Tang; Yi Chang
  Large Language Models (LLMs) have demonstrated remarkable performance across
various natural language processing tasks. Recently, several LLMs-based
pipelines have been developed to enhance learning on graphs with text
attributes, showcasing promising performance. However, graphs are well-known to
be susceptible to adversarial attacks and it remains unclear whether LLMs
exhibit robustness in learning on graphs. To address this gap, our work aims to
explore the potential of LLMs in the context of adversarial attacks on graphs.
Specifically, we investigate the robustness against graph structural and
textual perturbations in terms of two dimensions: LLMs-as-Enhancers and
LLMs-as-Predictors. Through extensive experiments, we find that, compared to
shallow models, both LLMs-as-Enhancers and LLMs-as-Predictors offer superior
robustness against structural and textual attacks.Based on these findings, we
carried out additional analyses to investigate the underlying causes.
Furthermore, we have made our benchmark library openly available to facilitate
quick and fair evaluations, and to encourage ongoing innovative research in
this field.


http://arxiv.org/abs/2407.11906
SegSTRONG-C: Segmenting Surgical Tools Robustly On Non-adversarial Generated Corruptions -- An EndoVis'24 Challenge. (33%)
Hao Ding; Tuxun Lu; Yuqian Zhang; Ruixing Liang; Hongchao Shu; Lalithkumar Seenivasan; Yonghao Long; Qi Dou; Cong Gao; Mathias Unberath
  Accurate segmentation of tools in robot-assisted surgery is critical for
machine perception, as it facilitates numerous downstream tasks including
augmented reality feedback. While current feed-forward neural network-based
methods exhibit excellent segmentation performance under ideal conditions,
these models have proven susceptible to even minor corruptions, significantly
impairing the model's performance. This vulnerability is especially problematic
in surgical settings where predictions might be used to inform high-stakes
decisions. To better understand model behavior under non-adversarial
corruptions, prior work has explored introducing artificial corruptions, like
Gaussian noise or contrast perturbation to test set images, to assess model
robustness. However, these corruptions are either not photo-realistic or
model/task agnostic. Thus, these investigations provide limited insights into
model deterioration under realistic surgical corruptions. To address this
limitation, we introduce the SegSTRONG-C challenge that aims to promote the
development of algorithms robust to unforeseen but plausible image corruptions
of surgery, like smoke, bleeding, and low brightness. We collect and release
corruption-free mock endoscopic video sequences for the challenge participants
to train their algorithms and benchmark them on video sequences with
photo-realistic non-adversarial corruptions for a binary robot tool
segmentation task. This new benchmark will allow us to carefully study neural
network robustness to non-adversarial corruptions of surgery, thus constituting
an important first step towards more robust models for surgical computer
vision. In this paper, we describe the data collection and annotation protocol,
baseline evaluations of established segmentation models, and data
augmentation-based techniques to enhance model robustness.


http://arxiv.org/abs/2407.11969
Does Refusal Training in LLMs Generalize to the Past Tense? (15%)
Maksym Andriushchenko; Nicolas Flammarion
  Refusal training is widely used to prevent LLMs from generating harmful,
undesirable, or illegal outputs. We reveal a curious generalization gap in the
current refusal training approaches: simply reformulating a harmful request in
the past tense (e.g., "How to make a Molotov cocktail?" to "How did people make
a Molotov cocktail?") is often sufficient to jailbreak many state-of-the-art
LLMs. We systematically evaluate this method on Llama-3 8B, GPT-3.5 Turbo,
Gemma-2 9B, Phi-3-Mini, GPT-4o, and R2D2 models using GPT-3.5 Turbo as a
reformulation model. For example, the success rate of this simple attack on
GPT-4o increases from 1% using direct requests to 88% using 20 past tense
reformulation attempts on harmful requests from JailbreakBench with GPT-4 as a
jailbreak judge. Interestingly, we also find that reformulations in the future
tense are less effective, suggesting that refusal guardrails tend to consider
past historical questions more benign than hypothetical future questions.
Moreover, our experiments on fine-tuning GPT-3.5 Turbo show that defending
against past reformulations is feasible when past tense examples are explicitly
included in the fine-tuning data. Overall, our findings highlight that the
widely used alignment techniques -- such as SFT, RLHF, and adversarial training
-- employed to align the studied models can be brittle and do not always
generalize as intended. We provide code and jailbreak artifacts at
https://github.com/tml-epfl/llm-past-tense.


http://arxiv.org/abs/2407.11405
Cover-separable Fixed Neural Network Steganography via Deep Generative Models. (8%)
Guobiao Li; Sheng Li; Zhenxing Qian; Xinpeng Zhang
  Image steganography is the process of hiding secret data in a cover image by
subtle perturbation. Recent studies show that it is feasible to use a fixed
neural network for data embedding and extraction. Such Fixed Neural Network
Steganography (FNNS) demonstrates favorable performance without the need for
training networks, making it more practical for real-world applications.
However, the stego-images generated by the existing FNNS methods exhibit high
distortion, which is prone to be detected by steganalysis tools. To deal with
this issue, we propose a Cover-separable Fixed Neural Network Steganography,
namely Cs-FNNS. In Cs-FNNS, we propose a Steganographic Perturbation Search
(SPS) algorithm to directly encode the secret data into an imperceptible
perturbation, which is combined with an AI-generated cover image for
transmission. Through accessing the same deep generative models, the receiver
could reproduce the cover image using a pre-agreed key, to separate the
perturbation in the stego-image for data decoding. such an encoding/decoding
strategy focuses on the secret data and eliminates the disturbance of the cover
images, hence achieving a better performance. We apply our Cs-FNNS to the
steganographic field that hiding secret images within cover images. Through
comprehensive experiments, we demonstrate the superior performance of the
proposed method in terms of visual quality and undetectability. Moreover, we
show the flexibility of our Cs-FNNS in terms of hiding multiple secret images
for different receivers.


http://arxiv.org/abs/2407.11424
Model Inversion Attacks Through Target-Specific Conditional Diffusion Models. (4%)
Ouxiang Li; Yanbin Hao; Zhicai Wang; Bin Zhu; Shuo Wang; Zaixi Zhang; Fuli Feng
  Model inversion attacks (MIAs) aim to reconstruct private images from a
target classifier's training set, thereby raising privacy concerns in AI
applications. Previous GAN-based MIAs tend to suffer from inferior generative
fidelity due to GAN's inherent flaws and biased optimization within latent
space. To alleviate these issues, leveraging on diffusion models' remarkable
synthesis capabilities, we propose Diffusion-based Model Inversion (Diff-MI)
attacks. Specifically, we introduce a novel target-specific conditional
diffusion model (CDM) to purposely approximate target classifier's private
distribution and achieve superior accuracy-fidelity balance. Our method
involves a two-step learning paradigm. Step-1 incorporates the target
classifier into the entire CDM learning under a pretrain-then-finetune fashion,
with creating pseudo-labels as model conditions in pretraining and adjusting
specified layers with image predictions in fine-tuning. Step-2 presents an
iterative image reconstruction method, further enhancing the attack performance
through a combination of diffusion priors and target knowledge. Additionally,
we propose an improved max-margin loss that replaces the hard max with top-k
maxes, fully leveraging feature information and soft labels from the target
classifier. Extensive experiments demonstrate that Diff-MI significantly
improves generative fidelity with an average decrease of 20% in FID while
maintaining competitive attack accuracy compared to state-of-the-art methods
across various datasets and models. We will release our code and models.


http://arxiv.org/abs/2407.11921
IPA-NeRF: Illusory Poisoning Attack Against Neural Radiance Fields. (1%)
Wenxiang Ocean University of China Jiang; Hanwei Saarland University Institute of Intelligent Software, Guangzhou Zhang; Shuo Ocean University of China Zhao; Zhongwen Ocean University of China Guo; Hao Xidian University, China Wang
  Neural Radiance Field (NeRF) represents a significant advancement in computer
vision, offering implicit neural network-based scene representation and novel
view synthesis capabilities. Its applications span diverse fields including
robotics, urban mapping, autonomous navigation, virtual reality/augmented
reality, etc., some of which are considered high-risk AI applications. However,
despite its widespread adoption, the robustness and security of NeRF remain
largely unexplored. In this study, we contribute to this area by introducing
the Illusory Poisoning Attack against Neural Radiance Fields (IPA-NeRF). This
attack involves embedding a hidden backdoor view into NeRF, allowing it to
produce predetermined outputs, i.e. illusory, when presented with the specified
backdoor view while maintaining normal performance with standard inputs. Our
attack is specifically designed to deceive users or downstream models at a
particular position while ensuring that any abnormalities in NeRF remain
undetectable from other viewpoints. Experimental results demonstrate the
effectiveness of our Illusory Poisoning Attack, successfully presenting the
desired illusory on the specified viewpoint without impacting other views.
Notably, we achieve this attack by introducing small perturbations solely to
the training set. The code can be found at
https://github.com/jiang-wenxiang/IPA-NeRF.


http://arxiv.org/abs/2407.10825
Wicked Oddities: Selectively Poisoning for Effective Clean-Label Backdoor Attacks. (99%)
Quang H. Nguyen; Nguyen Ngoc-Hieu; The-Anh Ta; Thanh Nguyen-Tang; Kok-Seng Wong; Hoang Thanh-Tung; Khoa D. Doan
  Deep neural networks are vulnerable to backdoor attacks, a type of
adversarial attack that poisons the training data to manipulate the behavior of
models trained on such data. Clean-label attacks are a more stealthy form of
backdoor attacks that can perform the attack without changing the labels of
poisoned data. Early works on clean-label attacks added triggers to a random
subset of the training set, ignoring the fact that samples contribute unequally
to the attack's success. This results in high poisoning rates and low attack
success rates. To alleviate the problem, several supervised learning-based
sample selection strategies have been proposed. However, these methods assume
access to the entire labeled training set and require training, which is
expensive and may not always be practical. This work studies a new and more
practical (but also more challenging) threat model where the attacker only
provides data for the target class (e.g., in face recognition systems) and has
no knowledge of the victim model or any other classes in the training set. We
study different strategies for selectively poisoning a small set of training
samples in the target class to boost the attack success rate in this setting.
Our threat model poses a serious threat in training machine learning models
with third-party datasets, since the attack can be performed effectively with
limited information. Experiments on benchmark datasets illustrate the
effectiveness of our strategies in improving clean-label backdoor attacks.


http://arxiv.org/abs/2407.10445
Backdoor Attacks against Image-to-Image Networks. (88%)
Wenbo Jiang; Hongwei Li; Jiaming He; Rui Zhang; Guowen Xu; Tianwei Zhang; Rongxing Lu
  Recently, deep learning-based Image-to-Image (I2I) networks have become the
predominant choice for I2I tasks such as image super-resolution and denoising.
Despite their remarkable performance, the backdoor vulnerability of I2I
networks has not been explored. To fill this research gap, we conduct a
comprehensive investigation on the susceptibility of I2I networks to backdoor
attacks. Specifically, we propose a novel backdoor attack technique, where the
compromised I2I network behaves normally on clean input images, yet outputs a
predefined image of the adversary for malicious input images containing the
trigger. To achieve this I2I backdoor attack, we propose a targeted universal
adversarial perturbation (UAP) generation algorithm for I2I networks, where the
generated UAP is used as the backdoor trigger. Additionally, in the backdoor
training process that contains the main task and the backdoor task, multi-task
learning (MTL) with dynamic weighting methods is employed to accelerate
convergence rates. In addition to attacking I2I tasks, we extend our I2I
backdoor to attack downstream tasks, including image classification and object
detection. Extensive experiments demonstrate the effectiveness of the I2I
backdoor on state-of-the-art I2I network architectures, as well as the
robustness against different mainstream backdoor defenses.


http://arxiv.org/abs/2407.11121
Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques. (88%)
Rishika Bhagwatkar; Shravan Nayak; Reza Bayat; Alexis Roger; Daniel Z Kaplan; Pouya Bashivan; Irina Rish
  Vision-Language Models (VLMs) have witnessed a surge in both research and
real-world applications. However, as they are becoming increasingly prevalent,
ensuring their robustness against adversarial attacks is paramount. This work
systematically investigates the impact of model design choices on the
adversarial robustness of VLMs against image-based attacks. Additionally, we
introduce novel, cost-effective approaches to enhance robustness through prompt
formatting. By rephrasing questions and suggesting potential adversarial
perturbations, we demonstrate substantial improvements in model robustness
against strong image-based attacks such as Auto-PGD. Our findings provide
important guidelines for developing more robust VLMs, particularly for
deployment in safety-critical environments.


http://arxiv.org/abs/2407.10918
PartImageNet++ Dataset: Scaling up Part-based Models for Robust Recognition. (80%)
Xiao Li; Yining Liu; Na Dong; Sitian Qin; Xiaolin Hu
  Deep learning-based object recognition systems can be easily fooled by
various adversarial perturbations. One reason for the weak robustness may be
that they do not have part-based inductive bias like the human recognition
process. Motivated by this, several part-based recognition models have been
proposed to improve the adversarial robustness of recognition. However, due to
the lack of part annotations, the effectiveness of these methods is only
validated on small-scale nonstandard datasets. In this work, we propose PIN++,
short for PartImageNet++, a dataset providing high-quality part segmentation
annotations for all categories of ImageNet-1K (IN-1K). With these annotations,
we build part-based methods directly on the standard IN-1K dataset for robust
recognition. Different from previous two-stage part-based models, we propose a
Multi-scale Part-supervised Model (MPM), to learn a robust representation with
part annotations. Experiments show that MPM yielded better adversarial
robustness on the large-scale IN-1K over strong baselines across various attack
settings. Furthermore, MPM achieved improved robustness on common corruptions
and several out-of-distribution datasets. The dataset, together with these
results, enables and encourages researchers to explore the potential of
part-based models in more real applications.


http://arxiv.org/abs/2407.10867
Provable Robustness of (Graph) Neural Networks Against Data Poisoning and Backdoor Attacks. (67%)
Lukas Gosch; Mahalakshmi Sabanayagam; Debarghya Ghoshdastidar; Stephan Günnemann
  Generalization of machine learning models can be severely compromised by data
poisoning, where adversarial changes are applied to the training data, as well
as backdoor attacks that additionally manipulate the test data. These
vulnerabilities have led to interest in certifying (i.e., proving) that such
changes up to a certain magnitude do not affect test predictions. We, for the
first time, certify Graph Neural Networks (GNNs) against poisoning and backdoor
attacks targeting the node features of a given graph. Our certificates are
white-box and based upon $(i)$ the neural tangent kernel, which characterizes
the training dynamics of sufficiently wide networks; and $(ii)$ a novel
reformulation of the bilevel optimization problem describing poisoning as a
mixed-integer linear program. Consequently, we leverage our framework to
provide fundamental insights into the role of graph structure and its
connectivity on the worst-case robustness behavior of convolution-based and
PageRank-based GNNs. We note that our framework is more general and constitutes
the first approach to derive white-box poisoning certificates for NNs, which
can be of independent interest beyond graph-related tasks.


http://arxiv.org/abs/2407.11282
Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models. (41%)
Qingcheng Zeng; Mingyu Jin; Qinkai Yu; Zhenting Wang; Wenyue Hua; Zihao Zhou; Guangyan Sun; Yanda Meng; Shiqing Ma; Qifan Wang; Felix Juefei-Xu; Kaize Ding; Fan Yang; Ruixiang Tang; Yongfeng Zhang
  Large Language Models (LLMs) are employed across various high-stakes domains,
where the reliability of their outputs is crucial. One commonly used method to
assess the reliability of LLMs' responses is uncertainty estimation, which
gauges the likelihood of their answers being correct. While many studies focus
on improving the accuracy of uncertainty estimations for LLMs, our research
investigates the fragility of uncertainty estimation and explores potential
attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which,
when activated by a specific trigger in the input, manipulates the model's
uncertainty without affecting the final output. Specifically, the proposed
backdoor attack method can alter an LLM's output probability distribution,
causing the probability distribution to converge towards an attacker-predefined
distribution while ensuring that the top-1 prediction remains unchanged. Our
experimental results demonstrate that this attack effectively undermines the
model's self-evaluation reliability in multiple-choice questions. For instance,
we achieved a 100 attack success rate (ASR) across three different triggering
strategies in four models. Further, we investigate whether this manipulation
generalizes across different prompts and domains. This work highlights a
significant threat to the reliability of LLMs and underscores the need for
future defenses against such attacks. The code is available at
https://github.com/qcznlp/uncertainty_attack.


http://arxiv.org/abs/2407.11359
Feature Inference Attack on Shapley Values. (12%)
Xinjian Luo; Yangfan Jiang; Xiaokui Xiao
  As a solution concept in cooperative game theory, Shapley value is highly
recognized in model interpretability studies and widely adopted by the leading
Machine Learning as a Service (MLaaS) providers, such as Google, Microsoft, and
IBM. However, as the Shapley value-based model interpretability methods have
been thoroughly studied, few researchers consider the privacy risks incurred by
Shapley values, despite that interpretability and privacy are two foundations
of machine learning (ML) models.
  In this paper, we investigate the privacy risks of Shapley value-based model
interpretability methods using feature inference attacks: reconstructing the
private model inputs based on their Shapley value explanations. Specifically,
we present two adversaries. The first adversary can reconstruct the private
inputs by training an attack model based on an auxiliary dataset and black-box
access to the model interpretability services. The second adversary, even
without any background knowledge, can successfully reconstruct most of the
private features by exploiting the local linear correlations between the model
inputs and outputs. We perform the proposed attacks on the leading MLaaS
platforms, i.e., Google Cloud, Microsoft Azure, and IBM aix360. The
experimental results demonstrate the vulnerability of the state-of-the-art
Shapley value-based model interpretability methods used in the leading MLaaS
platforms and highlight the significance and necessity of designing
privacy-preserving model interpretability methods in future studies. To our
best knowledge, this is also the first work that investigates the privacy risks
of Shapley values.


http://arxiv.org/abs/2407.10077
Transferable 3D Adversarial Shape Completion using Diffusion Models. (99%)
Xuelong Dai; Bin Xiao
  Recent studies that incorporate geometric features and transformers into 3D
point cloud feature learning have significantly improved the performance of 3D
deep-learning models. However, their robustness against adversarial attacks has
not been thoroughly explored. Existing attack methods primarily focus on
white-box scenarios and struggle to transfer to recently proposed 3D
deep-learning models. Even worse, these attacks introduce perturbations to 3D
coordinates, generating unrealistic adversarial examples and resulting in poor
performance against 3D adversarial defenses. In this paper, we generate
high-quality adversarial point clouds using diffusion models. By using partial
points as prior knowledge, we generate realistic adversarial examples through
shape completion with adversarial guidance. The proposed adversarial shape
completion allows for a more reliable generation of adversarial point clouds.
To enhance attack transferability, we delve into the characteristics of 3D
point clouds and employ model uncertainty for better inference of model
classification through random down-sampling of point clouds. We adopt ensemble
adversarial guidance for improved transferability across different network
architectures. To maintain the generation quality, we limit our adversarial
guidance solely to the critical points of the point clouds by calculating
saliency scores. Extensive experiments demonstrate that our proposed attacks
outperform state-of-the-art adversarial attack methods against both black-box
models and defenses. Our black-box attack establishes a new baseline for
evaluating the robustness of various 3D point cloud classification models.


http://arxiv.org/abs/2407.10184
Towards Robust Recommendation via Decision Boundary-aware Graph Contrastive Learning. (92%)
Jiakai Tang; Sunhao Dai; Zexu Sun; Xu Chen; Jun Xu; Wenhui Yu; Lantao Hu; Peng Jiang; Han Li
  In recent years, graph contrastive learning (GCL) has received increasing
attention in recommender systems due to its effectiveness in reducing bias
caused by data sparsity. However, most existing GCL models rely on heuristic
approaches and usually assume entity independence when constructing contrastive
views. We argue that these methods struggle to strike a balance between
semantic invariance and view hardness across the dynamic training process, both
of which are critical factors in graph contrastive learning.
  To address the above issues, we propose a novel GCL-based recommendation
framework RGCL, which effectively maintains the semantic invariance of
contrastive pairs and dynamically adapts as the model capability evolves
through the training process. Specifically, RGCL first introduces decision
boundary-aware adversarial perturbations to constrain the exploration space of
contrastive augmented views, avoiding the decrease of task-specific
information. Furthermore, to incorporate global user-user and item-item
collaboration relationships for guiding on the generation of hard contrastive
views, we propose an adversarial-contrastive learning objective to construct a
relation-aware view-generator. Besides, considering that unsupervised GCL could
potentially narrower margins between data points and the decision boundary,
resulting in decreased model robustness, we introduce the adversarial examples
based on maximum perturbations to achieve margin maximization. We also provide
theoretical analyses on the effectiveness of our designs. Through extensive
experiments on five public datasets, we demonstrate the superiority of RGCL
compared against twelve baseline models.


http://arxiv.org/abs/2407.10180
Defending Against Repetitive-based Backdoor Attacks on Semi-supervised Learning through Lens of Rate-Distortion-Perception Trade-off. (76%)
Cheng-Yi Lee; Ching-Chia Kao; Cheng-Han Yeh; Chun-Shien Lu; Chia-Mu Yu; Chu-Song Chen
  Semi-supervised learning (SSL) has achieved remarkable performance with a
small fraction of labeled data by leveraging vast amounts of unlabeled data
from the Internet. However, this large pool of untrusted data is extremely
vulnerable to data poisoning, leading to potential backdoor attacks. Current
backdoor defenses are not yet effective against such a vulnerability in SSL. In
this study, we propose a novel method, Unlabeled Data Purification (UPure), to
disrupt the association between trigger patterns and target classes by
introducing perturbations in the frequency domain. By leveraging the Rate-
Distortion-Perception (RDP) trade-off, we further identify the frequency band,
where the perturbations are added, and justify this selection. Notably, UPure
purifies poisoned unlabeled data without the need of extra clean labeled data.
Extensive experiments on four benchmark datasets and five SSL algorithms
demonstrate that UPure effectively reduces the attack success rate from 99.78%
to 0% while maintaining model accuracy


http://arxiv.org/abs/2407.10179
CLIP-Guided Networks for Transferable Targeted Attacks. (76%)
Hao Fang; Jiawei Kong; Bin Chen; Tao Dai; Hao Wu; Shu-Tao Xia
  Transferable targeted adversarial attacks aim to mislead models into
outputting adversary-specified predictions in black-box scenarios. Recent
studies have introduced \textit{single-target} generative attacks that train a
generator for each target class to generate highly transferable perturbations,
resulting in substantial computational overhead when handling multiple classes.
\textit{Multi-target} attacks address this by training only one
class-conditional generator for multiple classes. However, the generator simply
uses class labels as conditions, failing to leverage the rich semantic
information of the target class. To this end, we design a \textbf{C}LIP-guided
\textbf{G}enerative \textbf{N}etwork with \textbf{C}ross-attention modules
(CGNC) to enhance multi-target attacks by incorporating textual knowledge of
CLIP into the generator. Extensive experiments demonstrate that CGNC yields
significant improvements over previous multi-target generative attacks, e.g., a
21.46\% improvement in success rate from ResNet-152 to DenseNet-121. Moreover,
we propose a masked fine-tuning mechanism to further strengthen our method in
attacking a single class, which surpasses existing single-target methods.


http://arxiv.org/abs/2407.11091
SENTINEL: Securing Indoor Localization against Adversarial Attacks with Capsule Neural Networks. (10%)
Danish Gufran; Pooja Anandathirtha; Sudeep Pasricha
  With the increasing demand for edge device powered location-based services in
indoor environments, Wi-Fi received signal strength (RSS) fingerprinting has
become popular, given the unavailability of GPS indoors. However, achieving
robust and efficient indoor localization faces several challenges, due to RSS
fluctuations from dynamic changes in indoor environments and heterogeneity of
edge devices, leading to diminished localization accuracy. While advances in
machine learning (ML) have shown promise in mitigating these phenomena, it
remains an open problem. Additionally, emerging threats from adversarial
attacks on ML-enhanced indoor localization systems, especially those introduced
by malicious or rogue access points (APs), can deceive ML models to further
increase localization errors. To address these challenges, we present SENTINEL,
a novel embedded ML framework utilizing modified capsule neural networks to
bolster the resilience of indoor localization solutions against adversarial
attacks, device heterogeneity, and dynamic RSS fluctuations. We also introduce
RSSRogueLoc, a novel dataset capturing the effects of rogue APs from several
real-world indoor environments. Experimental evaluations demonstrate that
SENTINEL achieves significant improvements, with up to 3.5x reduction in mean
error and 3.4x reduction in worst-case error compared to state-of-the-art
frameworks using simulated adversarial attacks. SENTINEL also achieves
improvements of up to 2.8x in mean error and 2.7x in worst-case error compared
to state-of-the-art frameworks when evaluated with the real-world RSSRogueLoc
dataset.


http://arxiv.org/abs/2407.10052
Augmented Neural Fine-Tuning for Efficient Backdoor Purification. (68%)
Nazmul Karim; Abdullah Al Arafat; Umar Khalid; Zhishan Guo; Nazanin Rahnavard
  Recent studies have revealed the vulnerability of deep neural networks (DNNs)
to various backdoor attacks, where the behavior of DNNs can be compromised by
utilizing certain types of triggers or poisoning mechanisms. State-of-the-art
(SOTA) defenses employ too-sophisticated mechanisms that require either a
computationally expensive adversarial search module for reverse-engineering the
trigger distribution or an over-sensitive hyper-parameter selection module.
Moreover, they offer sub-par performance in challenging scenarios, e.g.,
limited validation data and strong attacks. In this paper, we propose Neural
mask Fine-Tuning (NFT) with an aim to optimally re-organize the neuron
activities in a way that the effect of the backdoor is removed. Utilizing a
simple data augmentation like MixUp, NFT relaxes the trigger synthesis process
and eliminates the requirement of the adversarial search module. Our study
further reveals that direct weight fine-tuning under limited validation data
results in poor post-purification clean test accuracy, primarily due to
overfitting issue. To overcome this, we propose to fine-tune neural masks
instead of model weights. In addition, a mask regularizer has been devised to
further mitigate the model drift during the purification process. The distinct
characteristics of NFT render it highly efficient in both runtime and sample
usage, as it can remove the backdoor even when a single sample is available
from each class. We validate the effectiveness of NFT through extensive
experiments covering the tasks of image classification, object detection, video
action recognition, 3D point cloud, and natural language processing. We
evaluate our method against 14 different attacks (LIRA, WaNet, etc.) on 11
benchmark data sets such as ImageNet, UCF101, Pascal VOC, ModelNet,
OpenSubtitles2012, etc.


http://arxiv.org/abs/2407.09958
Partner in Crime: Boosting Targeted Poisoning Attacks against Federated Learning. (67%)
Shihua Sun; Shridatt Sugrim; Angelos Stavrou; Haining Wang
  Federated Learning (FL) exposes vulnerabilities to targeted poisoning attacks
that aim to cause misclassification specifically from the source class to the
target class. However, using well-established defense frameworks, the poisoning
impact of these attacks can be greatly mitigated. We introduce a generalized
pre-training stage approach to Boost Targeted Poisoning Attacks against FL,
called BoTPA. Its design rationale is to leverage the model update
contributions of all data points, including ones outside of the source and
target classes, to construct an Amplifier set, in which we falsify the data
labels before the FL training process, as a means to boost attacks. We
comprehensively evaluate the effectiveness and compatibility of BoTPA on
various targeted poisoning attacks. Under data poisoning attacks, our
evaluations reveal that BoTPA can achieve a median Relative Increase in Attack
Success Rate (RI-ASR) between 15.3% and 36.9% across all possible source-target
class combinations, with varying percentages of malicious clients, compared to
its baseline. In the context of model poisoning, BoTPA attains RI-ASRs ranging
from 13.3% to 94.7% in the presence of the Krum and Multi-Krum defenses, from
2.6% to 49.2% under the Median defense, and from 2.9% to 63.5% under the Flame
defense.


http://arxiv.org/abs/2407.09790
Team up GBDTs and DNNs: Advancing Efficient and Effective Tabular Prediction with Tree-hybrid MLPs. (1%)
Jiahuan Yan; Jintai Chen; Qianxing Wang; Danny Z. Chen; Jian Wu
  Tabular datasets play a crucial role in various applications. Thus,
developing efficient, effective, and widely compatible prediction algorithms
for tabular data is important. Currently, two prominent model types, Gradient
Boosted Decision Trees (GBDTs) and Deep Neural Networks (DNNs), have
demonstrated performance advantages on distinct tabular prediction tasks.
However, selecting an effective model for a specific tabular dataset is
challenging, often demanding time-consuming hyperparameter tuning. To address
this model selection dilemma, this paper proposes a new framework that
amalgamates the advantages of both GBDTs and DNNs, resulting in a DNN algorithm
that is as efficient as GBDTs and is competitively effective regardless of
dataset preferences for GBDTs or DNNs. Our idea is rooted in an observation
that deep learning (DL) offers a larger parameter space that can represent a
well-performing GBDT model, yet the current back-propagation optimizer
struggles to efficiently discover such optimal functionality. On the other
hand, during GBDT development, hard tree pruning, entropy-driven feature gate,
and model ensemble have proved to be more adaptable to tabular data. By
combining these key components, we present a Tree-hybrid simple MLP (T-MLP). In
our framework, a tensorized, rapidly trained GBDT feature gate, a DNN
architecture pruning approach, as well as a vanilla back-propagation optimizer
collaboratively train a randomly initialized MLP model. Comprehensive
experiments show that T-MLP is competitive with extensively tuned DNNs and
GBDTs in their dominating tabular benchmarks (88 datasets) respectively, all
achieved with compact model storage and significantly reduced training
duration.


http://arxiv.org/abs/2407.11073
SemiAdv: Query-Efficient Black-Box Adversarial Attack with Unlabeled Images. (99%)
Mingyuan Fan; Yang Liu; Cen Chen; Ximeng Liu
  Adversarial attack has garnered considerable attention due to its profound
implications for the secure deployment of robots in sensitive security
scenarios. To potentially push for advances in the field, this paper studies
the adversarial attack in the black-box setting and proposes an unlabeled
data-driven adversarial attack method, called SemiAdv. Specifically, SemiAdv
achieves the following breakthroughs compared with previous works. First, by
introducing the semi-supervised learning technique into the adversarial attack,
SemiAdv substantially decreases the number of queries required for generating
adversarial samples. On average, SemiAdv only needs to query a few hundred
times to launch an effective attack with more than 90% success rate. Second,
many existing black-box adversarial attacks require massive labeled data to
mitigate the difference between the local substitute model and the remote
target model for a good attack performance. While SemiAdv relaxes this
limitation and is capable of utilizing unlabeled raw data to launch an
effective attack. Finally, our experiments show that SemiAdv saves up to 12x
query accesses for generating adversarial samples while maintaining a
competitive attack success rate compared with state-of-the-art attacks.


http://arxiv.org/abs/2407.09150
Evaluating the Adversarial Robustness of Semantic Segmentation: Trying Harder Pays Off. (97%)
Levente Halmosi; Bálint Mohos; Márk Jelasity
  Machine learning models are vulnerable to tiny adversarial input
perturbations optimized to cause a very large output error. To measure this
vulnerability, we need reliable methods that can find such adversarial
perturbations. For image classification models, evaluation methodologies have
emerged that have stood the test of time. However, we argue that in the area of
semantic segmentation, a good approximation of the sensitivity to adversarial
perturbations requires significantly more effort than what is currently
considered satisfactory. To support this claim, we re-evaluate a number of
well-known robust segmentation models in an extensive empirical study. We
propose new attacks and combine them with the strongest attacks available in
the literature. We also analyze the sensitivity of the models in fine detail.
The results indicate that most of the state-of-the-art models have a
dramatically larger sensitivity to adversarial perturbations than previously
reported. We also demonstrate a size-bias: small objects are often more easily
attacked, even if the large objects are robust, a phenomenon not revealed by
current evaluation metrics. Our results also demonstrate that a diverse set of
strong attacks is necessary, because different models are often vulnerable to
different attacks.


http://arxiv.org/abs/2407.09164
TAPI: Towards Target-Specific and Adversarial Prompt Injection against Code LLMs. (93%)
Yuchen Yang; Hongwei Yao; Bingrun Yang; Yiling He; Yiming Li; Tianwei Zhang; Zhan Qin; Kui Ren
  Recently, code-oriented large language models (Code LLMs) have been widely
and successfully used to simplify and facilitate code programming. With these
tools, developers can easily generate desired complete functional codes based
on incomplete code and natural language prompts. However, a few pioneering
works revealed that these Code LLMs are also vulnerable, e.g., against backdoor
and adversarial attacks. The former could induce LLMs to respond to triggers to
insert malicious code snippets by poisoning the training data or model
parameters, while the latter can craft malicious adversarial input codes to
reduce the quality of generated codes. However, both attack methods have
underlying limitations: backdoor attacks rely on controlling the model training
process, while adversarial attacks struggle with fulfilling specific malicious
purposes.
  To inherit the advantages of both backdoor and adversarial attacks, this
paper proposes a new attack paradigm, i.e., target-specific and adversarial
prompt injection (TAPI), against Code LLMs. TAPI generates unreadable comments
containing information about malicious instructions and hides them as triggers
in the external source code. When users exploit Code LLMs to complete codes
containing the trigger, the models will generate attacker-specified malicious
code snippets at specific locations. We evaluate our TAPI attack on four
representative LLMs under three representative malicious objectives and seven
cases. The results show that our method is highly threatening (achieving an
attack success rate enhancement of up to 89.3%) and stealthy (saving an average
of 53.1% of tokens in the trigger design). In particular, we successfully
attack some famous deployed code completion integrated applications, including
CodeGeex and Github Copilot. This further confirms the realistic threat of our
attack.


http://arxiv.org/abs/2407.09251
Deep Adversarial Defense Against Multilevel-Lp Attacks. (87%)
Ren Wang; Yuxuan Li; Alfred Hero
  Deep learning models have shown considerable vulnerability to adversarial
attacks, particularly as attacker strategies become more sophisticated. While
traditional adversarial training (AT) techniques offer some resilience, they
often focus on defending against a single type of attack, e.g., the
$\ell_\infty$-norm attack, which can fail for other types. This paper
introduces a computationally efficient multilevel $\ell_p$ defense, called the
Efficient Robust Mode Connectivity (EMRC) method, which aims to enhance a deep
learning model's resilience against multiple $\ell_p$-norm attacks. Similar to
analytical continuation approaches used in continuous optimization, the method
blends two $p$-specific adversarially optimal models, the $\ell_1$- and
$\ell_\infty$-norm AT solutions, to provide good adversarial robustness for a
range of $p$. We present experiments demonstrating that our approach performs
better on various attacks as compared to AT-$\ell_\infty$, E-AT, and MSD, for
datasets/architectures including: CIFAR-10, CIFAR-100 / PreResNet110,
WideResNet, ViT-Base.


http://arxiv.org/abs/2407.09165
Robust Yet Efficient Conformal Prediction Sets. (61%)
Soroush H. Zargarbashi; Mohammad Sadegh Akhondzadeh; Aleksandar Bojchevski
  Conformal prediction (CP) can convert any model's output into prediction sets
guaranteed to include the true label with any user-specified probability.
However, same as the model itself, CP is vulnerable to adversarial test
examples (evasion) and perturbed calibration data (poisoning). We derive
provably robust sets by bounding the worst-case change in conformity scores.
Our tighter bounds lead to more efficient sets. We cover both continuous and
discrete (sparse) data and our guarantees work both for evasion and poisoning
attacks (on both features and labels).


http://arxiv.org/abs/2407.09050
Refusing Safe Prompts for Multi-modal Large Language Models. (16%)
Zedian Shao; Hongbin Liu; Yuepeng Hu; Neil Zhenqiang Gong
  Multimodal large language models (MLLMs) have become the cornerstone of
today's generative AI ecosystem, sparking intense competition among tech giants
and startups. In particular, an MLLM generates a text response given a prompt
consisting of an image and a question. While state-of-the-art MLLMs use safety
filters and alignment techniques to refuse unsafe prompts, in this work, we
introduce MLLM-Refusal, the first method that induces refusals for safe
prompts. In particular, our MLLM-Refusal optimizes a nearly-imperceptible
refusal perturbation and adds it to an image, causing target MLLMs to likely
refuse a safe prompt containing the perturbed image and a safe question.
Specifically, we formulate MLLM-Refusal as a constrained optimization problem
and propose an algorithm to solve it. Our method offers competitive advantages
for MLLM model providers by potentially disrupting user experiences of
competing MLLMs, since competing MLLM's users will receive unexpected refusals
when they unwittingly use these perturbed images in their prompts. We evaluate
MLLM-Refusal on four MLLMs across four datasets, demonstrating its
effectiveness in causing competing MLLMs to refuse safe prompts while not
affecting non-competing MLLMs. Furthermore, we explore three potential
countermeasures-adding Gaussian noise, DiffPure, and adversarial training. Our
results show that though they can mitigate MLLM-Refusal's effectiveness, they
also sacrifice the accuracy and/or efficiency of the competing MLLM. The code
is available at https://github.com/Sadcardation/MLLM-Refusal.


http://arxiv.org/abs/2407.09295
Security Matrix for Multimodal Agents on Mobile Devices: A Systematic and Proof of Concept Study. (15%)
Yulong Yang; Xinshan Yang; Shuaidong Li; Chenhao Lin; Zhengyu Zhao; Chao Shen; Tianwei Zhang
  The rapid progress in the reasoning capability of the Multi-modal Large
Language Models (MLLMs) has triggered the development of autonomous agent
systems on mobile devices. MLLM-based mobile agent systems consist of
perception, reasoning, memory, and multi-agent collaboration modules, enabling
automatic analysis of user instructions and the design of task pipelines with
only natural language and device screenshots as inputs. Despite the increased
human-machine interaction efficiency, the security risks of MLLM-based mobile
agent systems have not been systematically studied. Existing security
benchmarks for agents mainly focus on Web scenarios, and the attack techniques
against MLLMs are also limited in the mobile agent scenario. To close these
gaps, this paper proposes a mobile agent security matrix covering 3 functional
modules of the agent systems. Based on the security matrix, this paper proposes
4 realistic attack paths and verifies these attack paths through 8 attack
methods. By analyzing the attack results, this paper reveals that MLLM-based
mobile agent systems are not only vulnerable to multiple traditional attacks,
but also raise new security concerns previously unconsidered. This paper
highlights the need for security awareness in the design of MLLM-based systems
and paves the way for future research on attacks and defense methods.


http://arxiv.org/abs/2407.11072
MaPPing Your Model: Assessing the Impact of Adversarial Attacks on LLM-based Programming Assistants. (13%)
John Heibel; Daniel Lowd
  LLM-based programming assistants offer the promise of programming faster but
with the risk of introducing more security vulnerabilities. Prior work has
studied how LLMs could be maliciously fine-tuned to suggest vulnerabilities
more often. With the rise of agentic LLMs, which may use results from an
untrusted third party, there is a growing risk of attacks on the model's
prompt. We introduce the Malicious Programming Prompt (MaPP) attack, in which
an attacker adds a small amount of text to a prompt for a programming task
(under 500 bytes). We show that our prompt strategy can cause an LLM to add
vulnerabilities while continuing to write otherwise correct code. We evaluate
three prompts on seven common LLMs, from basic to state-of-the-art commercial
models. Using the HumanEval benchmark, we find that our prompts are broadly
effective, with no customization required for different LLMs. Furthermore, the
LLMs that are best at HumanEval are also best at following our malicious
instructions, suggesting that simply scaling language models will not prevent
MaPP attacks. Using a dataset of eight CWEs in 16 scenarios, we find that MaPP
attacks are also effective at implementing specific and targeted
vulnerabilities across a range of models. Our work highlights the need to
secure LLM prompts against manipulation as well as rigorously auditing code
generated with the help of LLMs.


http://arxiv.org/abs/2407.09658
BoBa: Boosting Backdoor Detection through Data Distribution Inference in Federated Learning. (5%)
Ning Wang; Shanghao Shi; Yang Xiao; Yimin Chen; Y. Thomas Hou; Wenjing Lou
  Federated learning, while being a promising approach for collaborative model
training, is susceptible to poisoning attacks due to its decentralized nature.
Backdoor attacks, in particular, have shown remarkable stealthiness, as they
selectively compromise predictions for inputs containing triggers. Previous
endeavors to detect and mitigate such attacks are based on the Independent and
Identically Distributed (IID) data assumption where benign model updates
exhibit high-level similarity in multiple feature spaces due to IID data. Thus,
outliers are detected as backdoor attacks. Nevertheless, non-IID data presents
substantial challenges in backdoor attack detection, as the data variety
introduces variance among benign models, making outlier detection-based
mechanisms less effective.
  We propose a novel distribution-aware anomaly detection mechanism, BoBa, to
address this problem. In order to differentiate outliers arising from data
variety versus backdoor attack, we propose to break down the problem into two
steps: clustering clients utilizing their data distribution followed by a
voting-based detection. Based on the intuition that clustering and subsequent
backdoor detection can drastically benefit from knowing client data
distributions, we propose a novel data distribution inference mechanism. To
improve detection robustness, we introduce an overlapping clustering method,
where each client is associated with multiple clusters, ensuring that the
trustworthiness of a model update is assessed collectively by multiple clusters
rather than a single cluster. Through extensive evaluations, we demonstrate
that BoBa can reduce the attack success rate to lower than 0.001 while
maintaining high main task accuracy across various attack strategies and
experimental settings.


http://arxiv.org/abs/2407.09121
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training. (1%)
Youliang Yuan; Wenxiang Jiao; Wenxuan Wang; Jen-tse Huang; Jiahao Xu; Tian Liang; Pinjia He; Zhaopeng Tu
  This study addresses a critical gap in safety tuning practices for Large
Language Models (LLMs) by identifying and tackling a refusal position bias
within safety tuning data, which compromises the models' ability to
appropriately refuse generating unsafe content. We introduce a novel approach,
Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse
compliance to harmful prompts at any response position, significantly enhancing
their safety capabilities. DeRTa incorporates two novel components: (1) Maximum
Likelihood Estimation (MLE) with Harmful Response Prefix, which trains models
to recognize and avoid unsafe content by appending a segment of harmful
response to the beginning of a safe response, and (2) Reinforced Transition
Optimization (RTO), which equips models with the ability to transition from
potential harm to safety refusal consistently throughout the harmful response
sequence. Our empirical evaluation, conducted using LLaMA3 and Mistral model
families across six attack scenarios, demonstrates that our method not only
improves model safety without compromising performance but also surpasses
well-known models such as GPT-4 in defending against attacks. Importantly, our
approach successfully defends recent advanced attack methods (e.g., CodeAttack)
that have jailbroken GPT-4 and LLaMA3-70B-Instruct. Our code and data can be
found at https://github.com/RobustNLP/DeRTa.


http://arxiv.org/abs/2407.08514
Rethinking the Threat and Accessibility of Adversarial Attacks against Face Recognition Systems. (99%)
Yuxin Cao; Yumeng Zhu; Derui Wang; Sheng Wen; Minhui Xue; Jin Lu; Hao Ge
  Face recognition pipelines have been widely deployed in various
mission-critical systems in trust, equitable and responsible AI applications.
However, the emergence of adversarial attacks has threatened the security of
the entire recognition pipeline. Despite the sheer number of attack methods
proposed for crafting adversarial examples in both digital and physical forms,
it is never an easy task to assess the real threat level of different attacks
and obtain useful insight into the key risks confronted by face recognition
systems. Traditional attacks view imperceptibility as the most important
measurement to keep perturbations stealthy, while we suspect that industry
professionals may possess a different opinion. In this paper, we delve into
measuring the threat brought about by adversarial attacks from the perspectives
of the industry and the applications of face recognition. In contrast to widely
studied sophisticated attacks in the field, we propose an effective yet
easy-to-launch physical adversarial attack, named AdvColor, against black-box
face recognition pipelines in the physical world. AdvColor fools models in the
recognition pipeline via directly supplying printed photos of human faces to
the system under adversarial illuminations. Experimental results show that
physical AdvColor examples can achieve a fooling rate of more than 96% against
the anti-spoofing model and an overall attack success rate of 88% against the
face recognition pipeline. We also conduct a survey on the threats of
prevailing adversarial attacks, including AdvColor, to understand the gap
between the machine-measured and human-assessed threat levels of different
forms of adversarial attacks. The survey results surprisingly indicate that,
compared to deliberately launched imperceptible attacks, perceptible but
accessible attacks pose more lethal threats to real-world commercial systems of
face recognition.


http://arxiv.org/abs/2407.08572
Boosting Adversarial Transferability for Skeleton-based Action Recognition via Exploring the Model Posterior Space. (99%)
Yunfeng Diao; Baiqi Wu; Ruixuan Zhang; Xun Yang; Meng Wang; He Wang
  Skeletal motion plays a pivotal role in human activity recognition (HAR).
Recently, attack methods have been proposed to identify the universal
vulnerability of skeleton-based HAR(S-HAR). However, the research of
adversarial transferability on S-HAR is largely missing. More importantly,
existing attacks all struggle in transfer across unknown S-HAR models. We
observed that the key reason is that the loss landscape of the action
recognizers is rugged and sharp. Given the established correlation in prior
studies~\cite{qin2022boosting,wu2020towards} between loss landscape and
adversarial transferability, we assume and empirically validate that smoothing
the loss landscape could potentially improve adversarial transferability on
S-HAR. This is achieved by proposing a new post-train Dual Bayesian strategy,
which can effectively explore the model posterior space for a collection of
surrogates without the need for re-training. Furthermore, to craft adversarial
examples along the motion manifold, we incorporate the attack gradient with
information of the motion dynamics in a Bayesian manner. Evaluated on benchmark
datasets, e.g. HDM05 and NTU 60, the average transfer success rate can reach as
high as 35.9\% and 45.5\% respectively. In comparison, current state-of-the-art
skeletal attacks achieve only 3.6\% and 9.8\%. The high adversarial
transferability remains consistent across various surrogate, victim, and even
defense models. Through a comprehensive analysis of the results, we provide
insights on what surrogates are more likely to exhibit transferability, to shed
light on future research.


http://arxiv.org/abs/2407.08935
Distributed Backdoor Attacks on Federated Graph Learning and Certified Defenses. (98%)
Yuxin College of Computer Science and Technology, Jilin University Illinois Institute of Technology Yang; Qiang College of Computer Science and Technology, Jilin University Li; Jinyuan The Pennsylvania State University Jia; Yuan University of Connecticut Hong; Binghui Illinois Institute of Technology Wang
  Federated graph learning (FedGL) is an emerging federated learning (FL)
framework that extends FL to learn graph data from diverse sources. FL for
non-graph data has shown to be vulnerable to backdoor attacks, which inject a
shared backdoor trigger into the training data such that the trained backdoored
FL model can predict the testing data containing the trigger as the attacker
desires. However, FedGL against backdoor attacks is largely unexplored, and no
effective defense exists.
  In this paper, we aim to address such significant deficiency. First, we
propose an effective, stealthy, and persistent backdoor attack on FedGL. Our
attack uses a subgraph as the trigger and designs an adaptive trigger generator
that can derive the effective trigger location and shape for each graph. Our
attack shows that empirical defenses are hard to detect/remove our generated
triggers. To mitigate it, we further develop a certified defense for any
backdoored FedGL model against the trigger with any shape at any location. Our
defense involves carefully dividing a testing graph into multiple subgraphs and
designing a majority vote-based ensemble classifier on these subgraphs. We then
derive the deterministic certified robustness based on the ensemble classifier
and prove its tightness. We extensively evaluate our attack and defense on six
graph datasets. Our attack results show our attack can obtain > 90% backdoor
accuracy in almost all datasets. Our defense results show, in certain cases,
the certified accuracy for clean testing graphs against an arbitrary trigger
with size 20 can be close to the normal accuracy under no attack, while there
is a moderate gap in other cases. Moreover, the certified backdoor accuracy is
always 0 for backdoored testing graphs generated by our attack, implying our
defense can fully mitigate the attack. Source code is available at:
https://github.com/Yuxin104/Opt-GDBA.


http://arxiv.org/abs/2407.08806
HO-FMN: Hyperparameter Optimization for Fast Minimum-Norm Attacks. (98%)
Raffaele Mura; Giuseppe Floris; Luca Scionis; Giorgio Piras; Maura Pintor; Ambra Demontis; Giorgio Giacinto; Battista Biggio; Fabio Roli
  Gradient-based attacks are a primary tool to evaluate robustness of
machine-learning models. However, many attacks tend to provide
overly-optimistic evaluations as they use fixed loss functions, optimizers,
step-size schedulers, and default hyperparameters. In this work, we tackle
these limitations by proposing a parametric variation of the well-known fast
minimum-norm attack algorithm, whose loss, optimizer, step-size scheduler, and
hyperparameters can be dynamically adjusted. We re-evaluate 12 robust models,
showing that our attack finds smaller adversarial perturbations without
requiring any additional tuning. This also enables reporting adversarial
robustness as a function of the perturbation budget, providing a more complete
evaluation than that offered by fixed-budget attacks, while remaining
efficient. We release our open-source code at https://github.com/pralab/HO-FMN.


http://arxiv.org/abs/2407.08956
DeCE: Deceptive Cross-Entropy Loss Designed for Defending Backdoor Attacks. (87%)
Guang Yang; Yu Zhou; Xiang Chen; Xiangyu Zhang; Terry Yue Zhuo; David Lo; Taolue Chen
  Code Language Models (CLMs), particularly those leveraging deep learning,
have achieved significant success in code intelligence domain. However, the
issue of security, particularly backdoor attacks, is often overlooked in this
process. The previous research has focused on designing backdoor attacks for
CLMs, but effective defenses have not been adequately addressed. In particular,
existing defense methods from natural language processing, when directly
applied to CLMs, are not effective enough and lack generality, working well in
some models and scenarios but failing in others, thus fall short in
consistently mitigating backdoor attacks. To bridge this gap, we first confirm
the phenomenon of ``early learning" as a general occurrence during the training
of CLMs. This phenomenon refers to that a model initially focuses on the main
features of training data but may become more sensitive to backdoor triggers
over time, leading to overfitting and susceptibility to backdoor attacks. We
then analyze that overfitting to backdoor triggers results from the use of the
cross-entropy loss function, where the unboundedness of cross-entropy leads the
model to increasingly concentrate on the features of the poisoned data. Based
on this insight, we propose a general and effective loss function DeCE
(Deceptive Cross-Entropy) by blending deceptive distributions and applying
label smoothing to limit the gradient to be bounded, which prevents the model
from overfitting to backdoor triggers and then enhances the security of CLMs
against backdoor attacks. To verify the effectiveness of our defense method, we
select code synthesis tasks as our experimental scenarios. Our experiments
across various code synthesis datasets, models, and poisoning ratios
demonstrate the applicability and effectiveness of DeCE in enhancing the
security of CLMs.


http://arxiv.org/abs/2407.08678
How to beat a Bayesian adversary. (81%)
Zihan Ding; Kexin Jin; Jonas Latz; Chenguang Liu
  Deep neural networks and other modern machine learning models are often
susceptible to adversarial attacks. Indeed, an adversary may often be able to
change a model's prediction through a small, directed perturbation of the
model's input - an issue in safety-critical applications. Adversarially robust
machine learning is usually based on a minmax optimisation problem that
minimises the machine learning loss under maximisation-based adversarial
attacks.
  In this work, we study adversaries that determine their attack using a
Bayesian statistical approach rather than maximisation. The resulting Bayesian
adversarial robustness problem is a relaxation of the usual minmax problem. To
solve this problem, we propose Abram - a continuous-time particle system that
shall approximate the gradient flow corresponding to the underlying learning
problem. We show that Abram approximates a McKean-Vlasov process and justify
the use of Abram by giving assumptions under which the McKean-Vlasov process
finds the minimiser of the Bayesian adversarial robustness problem. We discuss
two ways to discretise Abram and show its suitability in benchmark adversarial
deep learning experiments.


http://arxiv.org/abs/2407.08970
Soft Prompts Go Hard: Steering Visual Language Models with Hidden Meta-Instructions. (74%)
Tingwei Zhang; Collin Zhang; John X. Morris; Eugene Bagdasarian; Vitaly Shmatikov
  We introduce a new type of indirect injection attacks against language models
that operate on images: hidden ''meta-instructions'' that influence how the
model interprets the image and steer the model's outputs to express an
adversary-chosen style, sentiment, or point of view.
  We explain how to create meta-instructions by generating images that act as
soft prompts. In contrast to jailbreaking attacks and adversarial examples,
outputs produced in response to these images are plausible and based on the
visual content of the image, yet also satisfy the adversary's (meta-)objective.
  We evaluate the efficacy of meta-instructions for multiple visual language
models and adversarial meta-objectives, and demonstrate how they can ''unlock''
capabilities of the underlying language models that are unavailable via
explicit text instructions. We describe how meta-instruction attacks could
cause harm by enabling creation of malicious, self-interpreting content that
carries spam, misinformation, and spin. Finally, we discuss defenses.


http://arxiv.org/abs/2407.08652
DART: A Solution for Decentralized Federated Learning Model Robustness Analysis. (47%)
Chao Feng; Alberto Huertas Celdrán; der Assen Jan von; Enrique Tomás Martínez Beltrán; Gérôme Bovet; Burkhard Stiller
  Federated Learning (FL) has emerged as a promising approach to address
privacy concerns inherent in Machine Learning (ML) practices. However,
conventional FL methods, particularly those following the Centralized FL (CFL)
paradigm, utilize a central server for global aggregation, which exhibits
limitations such as bottleneck and single point of failure. To address these
issues, the Decentralized FL (DFL) paradigm has been proposed, which removes
the client-server boundary and enables all participants to engage in model
training and aggregation tasks. Nevertheless, as CFL, DFL remains vulnerable to
adversarial attacks, notably poisoning attacks that undermine model
performance. While existing research on model robustness has predominantly
focused on CFL, there is a noteworthy gap in understanding the model robustness
of the DFL paradigm. In this paper, a thorough review of poisoning attacks
targeting the model robustness in DFL systems, as well as their corresponding
countermeasures, are presented. Additionally, a solution called DART is
proposed to evaluate the robustness of DFL models, which is implemented and
integrated into a DFL platform. Through extensive experiments, this paper
compares the behavior of CFL and DFL under diverse poisoning attacks,
pinpointing key factors affecting attack spread and effectiveness within the
DFL. It also evaluates the performance of different defense mechanisms and
investigates whether defense mechanisms designed for CFL are compatible with
DFL. The empirical results provide insights into research challenges and
suggest ways to improve the robustness of DFL models for future research.


http://arxiv.org/abs/2407.08546
Quantitative Evaluation of the Saliency Map for Alzheimer's Disease Classifier with Anatomical Segmentation. (8%)
Yihan Zhang; Xuanshuo Zhang; Wei Wu; Haohan Wang
  Saliency maps have been widely used to interpret deep learning classifiers
for Alzheimer's disease (AD). However, since AD is heterogeneous and has
multiple subtypes, the pathological mechanism of AD remains not fully
understood and may vary from patient to patient. Due to the lack of such
understanding, it is difficult to comprehensively and effectively assess the
saliency map of AD classifier. In this paper, we utilize the anatomical
segmentation to allocate saliency values into different brain regions. By
plotting the distributions of saliency maps corresponding to AD and NC (Normal
Control), we can gain a comprehensive view of the model's decisions process. In
order to leverage the fact that the brain volume shrinkage happens in AD
patients during disease progression, we define a new evaluation metric, brain
volume change score (VCS), by computing the average Pearson correlation of the
brain volume changes and the saliency values of a model in different brain
regions for each patient. Thus, the VCS metric can help us gain some knowledge
of how saliency maps resulting from different models relate to the changes of
the volumes across different regions in the whole brain. We trained candidate
models on the ADNI dataset and tested on three different datasets. Our results
indicate: (i) models with higher VCSs tend to demonstrate saliency maps with
more details relevant to the AD pathology, (ii) using gradient-based
adversarial training strategies such as FGSM and stochastic masking can improve
the VCSs of the models.


http://arxiv.org/abs/2407.08529
Enhancing Privacy of Spatiotemporal Federated Learning against Gradient Inversion Attacks. (8%)
Lele Zheng; Yang Cao; Renhe Jiang; Kenjiro Taura; Yulong Shen; Sheng Li; Masatoshi Yoshikawa
  Spatiotemporal federated learning has recently raised intensive studies due
to its ability to train valuable models with only shared gradients in various
location-based services. On the other hand, recent studies have shown that
shared gradients may be subject to gradient inversion attacks (GIA) on images
or texts. However, so far there has not been any systematic study of the
gradient inversion attacks in spatiotemporal federated learning. In this paper,
we explore the gradient attack problem in spatiotemporal federated learning
from attack and defense perspectives. To understand privacy risks in
spatiotemporal federated learning, we first propose Spatiotemporal Gradient
Inversion Attack (ST-GIA), a gradient attack algorithm tailored to
spatiotemporal data that successfully reconstructs the original location from
gradients. Furthermore, we design an adaptive defense strategy to mitigate
gradient inversion attacks in spatiotemporal federated learning. By dynamically
adjusting the perturbation levels, we can offer tailored protection for varying
rounds of training data, thereby achieving a better trade-off between privacy
and utility than current state-of-the-art methods. Through intensive
experimental analysis on three real-world datasets, we reveal that the proposed
defense strategy can well preserve the utility of spatiotemporal federated
learning with effective security protection.


http://arxiv.org/abs/2407.08838
Deep Learning for Network Anomaly Detection under Data Contamination: Evaluating Robustness and Mitigating Performance Degradation. (1%)
D'Jeff K. Nkashama; Jordan Masakuna Félicien; Arian Soltani; Jean-Charles Verdier; Pierre-Martin Tardif; Marc Frappier; Froduald Kabanza
  Deep learning (DL) has emerged as a crucial tool in network anomaly detection
(NAD) for cybersecurity. While DL models for anomaly detection excel at
extracting features and learning patterns from data, they are vulnerable to
data contamination -- the inadvertent inclusion of attack-related data in
training sets presumed benign. This study evaluates the robustness of six
unsupervised DL algorithms against data contamination using our proposed
evaluation protocol. Results demonstrate significant performance degradation in
state-of-the-art anomaly detection algorithms when exposed to contaminated
data, highlighting the critical need for self-protection mechanisms in DL-based
NAD models. To mitigate this vulnerability, we propose an enhanced auto-encoder
with a constrained latent representation, allowing normal data to cluster more
densely around a learnable center in the latent space. Our evaluation reveals
that this approach exhibits improved resistance to data contamination compared
to existing methods, offering a promising direction for more robust NAD
systems.


http://arxiv.org/abs/2407.08441
Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation. (1%)
Riccardo Cantini; Giada Cosenza; Alessio Orsino; Domenico Talia
  Large Language Models (LLMs) have revolutionized artificial intelligence,
demonstrating remarkable computational power and linguistic capabilities.
However, these models are inherently prone to various biases stemming from
their training data. These include selection, linguistic, and confirmation
biases, along with common stereotypes related to gender, ethnicity, sexual
orientation, religion, socioeconomic status, disability, and age. This study
explores the presence of these biases within the responses given by the most
recent LLMs, analyzing the impact on their fairness and reliability. We also
investigate how known prompt engineering techniques can be exploited to
effectively reveal hidden biases of LLMs, testing their adversarial robustness
against jailbreak prompts specially crafted for bias elicitation. Extensive
experiments are conducted using the most widespread LLMs at different scales,
confirming that LLMs can still be manipulated to produce biased or
inappropriate responses, despite their advanced capabilities and sophisticated
alignment processes. Our findings underscore the importance of enhancing
mitigation techniques to address these safety issues, toward a more sustainable
and inclusive artificial intelligence.


http://arxiv.org/abs/2407.15861
Adversarial Attacks and Defenses on Text-to-Image Diffusion Models: A Survey. (99%)
Chenyu Zhang; Mingwang Hu; Wenhui Li; Lanjun Wang
  Recently, the text-to-image diffusion model has gained considerable attention
from the community due to its exceptional image generation capability. A
representative model, Stable Diffusion, amassed more than 10 million users
within just two months of its release. This surge in popularity has facilitated
studies on the robustness and safety of the model, leading to the proposal of
various adversarial attack methods. Simultaneously, there has been a marked
increase in research focused on defense methods to improve the robustness and
safety of these models. In this survey, we provide a comprehensive review of
the literature on adversarial attacks and defenses targeting text-to-image
diffusion models. We begin with an overview of text-to-image diffusion models,
followed by an introduction to a taxonomy of adversarial attacks and an
in-depth review of existing attack methods. We then present a detailed analysis
of current defense methods that improve model robustness and safety. Finally,
we discuss ongoing challenges and explore promising future research directions.
For a complete list of the adversarial attack and defense methods covered in
this survey, please refer to our curated repository at
https://github.com/datar001/Awesome-AD-on-T2IDM.


http://arxiv.org/abs/2407.07403
A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends. (38%)
Daizong Liu; Mingyu Yang; Xiaoye Qu; Pan Zhou; Wei Hu; Yu Cheng
  With the significant development of large models in recent years, Large
Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across
a wide range of multimodal understanding and reasoning tasks. Compared to
traditional Large Language Models (LLMs), LVLMs present great potential and
challenges due to its closer proximity to the multi-resource real-world
applications and the complexity of multi-modal processing. However, the
vulnerability of LVLMs is relatively underexplored, posing potential security
risks in daily usage. In this paper, we provide a comprehensive review of the
various forms of existing LVLM attacks. Specifically, we first introduce the
background of attacks targeting LVLMs, including the attack preliminary, attack
challenges, and attack resources. Then, we systematically review the
development of LVLM attack methods, such as adversarial attacks that manipulate
model outputs, jailbreak attacks that exploit model vulnerabilities for
unauthorized actions, prompt injection attacks that engineer the prompt type
and pattern, and data poisoning that affects model training. Finally, we
discuss promising research directions in the future. We believe that our survey
provides insights into the current landscape of LVLM vulnerabilities, inspiring
more researchers to explore and mitigate potential safety issues in LVLM
developments. The latest papers on LVLM attacks are continuously collected in
https://github.com/liudaizong/Awesome-LVLM-Attack.


http://arxiv.org/abs/2407.08159
Model-agnostic clean-label backdoor mitigation in cybersecurity environments. (31%)
Giorgio Severi; Simona Boboila; John Holodnak; Kendra Kratkiewicz; Rauf Izmailov; Lucia Michael J. De; Alina Oprea
  The training phase of machine learning models is a delicate step, especially
in cybersecurity contexts. Recent research has surfaced a series of insidious
training-time attacks that inject backdoors in models designed for security
classification tasks without altering the training labels. With this work, we
propose new techniques that leverage insights in cybersecurity threat models to
effectively mitigate these clean-label poisoning attacks, while preserving the
model utility. By performing density-based clustering on a carefully chosen
feature subspace, and progressively isolating the suspicious clusters through a
novel iterative scoring procedure, our defensive mechanism can mitigate the
attacks without requiring many of the common assumptions in the existing
backdoor defense literature. To show the generality of our proposed mitigation,
we evaluate it on two clean-label model-agnostic attacks on two different
classic cybersecurity data modalities: network flows classification and malware
classification, using gradient boosting and neural network models.


http://arxiv.org/abs/2407.07791
Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities. (11%)
Tianjie Ju; Yiting Wang; Xinbei Ma; Pengzhou Cheng; Haodong Zhao; Yulong Wang; Lifeng Liu; Jian Xie; Zhuosheng Zhang; Gongshen Liu
  The rapid adoption of large language models (LLMs) in multi-agent systems has
highlighted their impressive capabilities in various applications, such as
collaborative problem-solving and autonomous negotiation. However, the security
implications of these LLM-based multi-agent systems have not been thoroughly
investigated, particularly concerning the spread of manipulated knowledge. In
this paper, we investigate this critical issue by constructing a detailed
threat model and a comprehensive simulation environment that mirrors real-world
multi-agent deployments in a trusted platform. Subsequently, we propose a novel
two-stage attack method involving Persuasiveness Injection and Manipulated
Knowledge Injection to systematically explore the potential for manipulated
knowledge (i.e., counterfactual and toxic knowledge) spread without explicit
prompt manipulation.
  Our method leverages the inherent vulnerabilities of LLMs in handling world
knowledge, which can be exploited by attackers to unconsciously spread
fabricated information. Through extensive experiments, we demonstrate that our
attack method can successfully induce LLM-based agents to spread both
counterfactual and toxic knowledge without degrading their foundational
capabilities during agent communication. Furthermore, we show that these
manipulations can persist through popular retrieval-augmented generation
frameworks, where several benign agents store and retrieve manipulated chat
histories for future interactions. This persistence indicates that even after
the interaction has ended, the benign agents may continue to be influenced by
manipulated knowledge. Our findings reveal significant security risks in
LLM-based multi-agent systems, emphasizing the imperative need for robust
defenses against manipulated knowledge spread, such as introducing ``guardian''
agents and advanced fact-checking tools.


http://arxiv.org/abs/2407.07510
Invisible Optical Adversarial Stripes on Traffic Sign against Autonomous Vehicles. (8%)
Dongfang Guo; Yuting Wu; Yimin Dai; Pengfei Zhou; Xin Lou; Rui Tan
  Camera-based computer vision is essential to autonomous vehicle's perception.
This paper presents an attack that uses light-emitting diodes and exploits the
camera's rolling shutter effect to create adversarial stripes in the captured
images to mislead traffic sign recognition. The attack is stealthy because the
stripes on the traffic sign are invisible to human. For the attack to be
threatening, the recognition results need to be stable over consecutive image
frames. To achieve this, we design and implement GhostStripe, an attack system
that controls the timing of the modulated light emission to adapt to camera
operations and victim vehicle movements. Evaluated on real testbeds,
GhostStripe can stably spoof the traffic sign recognition results for up to
94\% of frames to a wrong class when the victim vehicle passes the road
section. In reality, such attack effect may fool victim vehicles into
life-threatening incidents. We discuss the countermeasures at the levels of
camera sensor, perception model, and autonomous driving system.


http://arxiv.org/abs/2407.07966
A Comprehensive Survey on the Security of Smart Grid: Challenges, Mitigations, and Future Research Opportunities. (2%)
Arastoo Zibaeirad; Farnoosh Koleini; Shengping Bi; Tao Hou; Tao Wang
  In this study, we conduct a comprehensive review of smart grid security,
exploring system architectures, attack methodologies, defense strategies, and
future research opportunities. We provide an in-depth analysis of various
attack vectors, focusing on new attack surfaces introduced by advanced
components in smart grids. The review particularly includes an extensive
analysis of coordinated attacks that incorporate multiple attack strategies and
exploit vulnerabilities across various smart grid components to increase their
adverse impact, demonstrating the complexity and potential severity of these
threats. Following this, we examine innovative detection and mitigation
strategies, including game theory, graph theory, blockchain, and machine
learning, discussing their advancements in counteracting evolving threats and
associated research challenges. In particular, our review covers a thorough
examination of widely used machine learning-based mitigation strategies,
analyzing their applications and research challenges spanning across
supervised, unsupervised, semi-supervised, ensemble, and reinforcement
learning. Further, we outline future research directions and explore new
techniques and concerns. We first discuss the research opportunities for
existing and emerging strategies, and then explore the potential role of new
techniques, such as large language models (LLMs), and the emerging threat of
adversarial machine learning in the future of smart grid security.


http://arxiv.org/abs/2407.11059
Was it Slander? Towards Exact Inversion of Generative Language Models. (2%)
Adrians Skapars; Edoardo Manino; Youcheng Sun; Lucas C. Cordeiro
  Training large language models (LLMs) requires a substantial investment of
time and money. To get a good return on investment, the developers spend
considerable effort ensuring that the model never produces harmful and
offensive outputs. However, bad-faith actors may still try to slander the
reputation of an LLM by publicly reporting a forged output. In this paper, we
show that defending against such slander attacks requires reconstructing the
input of the forged output or proving that it does not exist. To do so, we
propose and evaluate a search based approach for targeted adversarial attacks
for LLMs. Our experiments show that we are rarely able to reconstruct the exact
input of an arbitrary output, thus demonstrating that LLMs are still vulnerable
to slander attacks.


http://arxiv.org/abs/2407.07521
CHILLI: A data context-aware perturbation method for XAI. (1%)
Saif Anwar; Nathan Griffiths; Abhir Bhalerao; Thomas Popham
  The trustworthiness of Machine Learning (ML) models can be difficult to
assess, but is critical in high-risk or ethically sensitive applications. Many
models are treated as a `black-box' where the reasoning or criteria for a final
decision is opaque to the user. To address this, some existing Explainable AI
(XAI) approaches approximate model behaviour using perturbed data. However,
such methods have been criticised for ignoring feature dependencies, with
explanations being based on potentially unrealistic data. We propose a novel
framework, CHILLI, for incorporating data context into XAI by generating
contextually aware perturbations, which are faithful to the training data of
the base model being explained. This is shown to improve both the soundness and
accuracy of the explanations.


http://arxiv.org/abs/2407.06807
A Hybrid Training-time and Run-time Defense Against Adversarial Attacks in Modulation Classification. (99%)
Lu Zhang; Sangarapillai Lambotharan; Gan Zheng; Guisheng Liao; Ambra Demontis; Fabio Roli
  Motivated by the superior performance of deep learning in many applications
including computer vision and natural language processing, several recent
studies have focused on applying deep neural network for devising future
generations of wireless networks. However, several recent works have pointed
out that imperceptible and carefully designed adversarial examples (attacks)
can significantly deteriorate the classification accuracy. In this paper, we
investigate a defense mechanism based on both training-time and run-time
defense techniques for protecting machine learning-based radio signal
(modulation) classification against adversarial attacks. The training-time
defense consists of adversarial training and label smoothing, while the
run-time defense employs a support vector machine-based neural rejection (NR).
Considering a white-box scenario and real datasets, we demonstrate that our
proposed techniques outperform existing state-of-the-art technologies.


http://arxiv.org/abs/2407.06688
Universal Multi-view Black-box Attack against Object Detectors via Layout Optimization. (99%)
Donghua Wang; Wen Yao; Tingsong Jiang; Chao Li; Xiaoqian Chen
  Object detectors have demonstrated vulnerability to adversarial examples
crafted by small perturbations that can deceive the object detector. Existing
adversarial attacks mainly focus on white-box attacks and are merely valid at a
specific viewpoint, while the universal multi-view black-box attack is less
explored, limiting their generalization in practice. In this paper, we propose
a novel universal multi-view black-box attack against object detectors, which
optimizes a universal adversarial UV texture constructed by multiple image
stickers for a 3D object via the designed layout optimization algorithm.
Specifically, we treat the placement of image stickers on the UV texture as a
circle-based layout optimization problem, whose objective is to find the
optimal circle layout filled with image stickers so that it can deceive the
object detector under the multi-view scenario. To ensure reasonable placement
of image stickers, two constraints are elaborately devised. To optimize the
layout, we adopt the random search algorithm enhanced by the devised
important-aware selection strategy to find the most appropriate image sticker
for each circle from the image sticker pools. Extensive experiments conducted
on four common object detectors suggested that the detection performance
decreases by a large magnitude of 74.29% on average in multi-view scenarios.
Additionally, a novel evaluation tool based on the photo-realistic simulator is
designed to assess the texture-based attack fairly.


http://arxiv.org/abs/2407.06552
DLOVE: A new Security Evaluation Tool for Deep Learning Based Watermarking Techniques. (98%)
Sudev Kumar Padhi; Sk. Subidh Ali
  Recent developments in Deep Neural Network (DNN) based watermarking
techniques have shown remarkable performance. The state-of-the-art DNN-based
techniques not only surpass the robustness of classical watermarking techniques
but also show their robustness against many image manipulation techniques. In
this paper, we performed a detailed security analysis of different DNN-based
watermarking techniques. We propose a new class of attack called the Deep
Learning-based OVErwriting (DLOVE) attack, which leverages adversarial machine
learning and overwrites the original embedded watermark with a targeted
watermark in a watermarked image. To the best of our knowledge, this attack is
the first of its kind. We have considered scenarios where watermarks are used
to devise and formulate an adversarial attack in white box and black box
settings. To show adaptability and efficiency, we launch our DLOVE attack
analysis on seven different watermarking techniques, HiDDeN, ReDMark, PIMoG,
Stegastamp, Aparecium, Distortion Agostic Deep Watermarking and Hiding Images
in an Image. All these techniques use different approaches to create
imperceptible watermarked images. Our attack analysis on these watermarking
techniques with various constraints highlights the vulnerabilities of DNN-based
watermarking. Extensive experimental results validate the capabilities of
DLOVE. We propose DLOVE as a benchmark security analysis tool to test the
robustness of future deep learning-based watermarking techniques.


http://arxiv.org/abs/2407.06796
Countermeasures Against Adversarial Examples in Radio Signal Classification. (97%)
Lu Zhang; Sangarapillai Lambotharan; Gan Zheng; Basil AsSadhan; Fabio Roli
  Deep learning algorithms have been shown to be powerful in many communication
network design problems, including that in automatic modulation classification.
However, they are vulnerable to carefully crafted attacks called adversarial
examples. Hence, the reliance of wireless networks on deep learning algorithms
poses a serious threat to the security and operation of wireless networks. In
this letter, we propose for the first time a countermeasure against adversarial
examples in modulation classification. Our countermeasure is based on a neural
rejection technique, augmented by label smoothing and Gaussian noise injection,
that allows to detect and reject adversarial examples with high accuracy. Our
results demonstrate that the proposed countermeasure can protect deep-learning
based modulation classification systems against adversarial examples.


http://arxiv.org/abs/2407.06714
Improving the Transferability of Adversarial Examples by Feature Augmentation. (93%)
Donghua Wang; Wen Yao; Tingsong Jiang; Xiaohu Zheng; Junqi Wu; Xiaoqian Chen
  Despite the success of input transformation-based attacks on boosting
adversarial transferability, the performance is unsatisfying due to the
ignorance of the discrepancy across models. In this paper, we propose a simple
but effective feature augmentation attack (FAUG) method, which improves
adversarial transferability without introducing extra computation costs.
Specifically, we inject the random noise into the intermediate features of the
model to enlarge the diversity of the attack gradient, thereby mitigating the
risk of overfitting to the specific model and notably amplifying adversarial
transferability. Moreover, our method can be combined with existing gradient
attacks to augment their performance further. Extensive experiments conducted
on the ImageNet dataset across CNN and transformer models corroborate the
efficacy of our method, e.g., we achieve improvement of +26.22% and +5.57% on
input transformation-based attacks and combination methods, respectively.


http://arxiv.org/abs/2407.07041
Hiding Local Manipulations on SAR Images: a Counter-Forensic Attack. (64%)
Sara Mandelli; Edoardo Daniele Cannas; Paolo Bestagini; Stefano Tebaldini; Stefano Tubaro
  The vast accessibility of Synthetic Aperture Radar (SAR) images through
online portals has propelled the research across various fields. This
widespread use and easy availability have unfortunately made SAR data
susceptible to malicious alterations, such as local editing applied to the
images for inserting or covering the presence of sensitive targets.
Vulnerability is further emphasized by the fact that most SAR products, despite
their original complex nature, are often released as amplitude-only
information, allowing even inexperienced attackers to edit and easily alter the
pixel content. To contrast malicious manipulations, in the last years the
forensic community has begun to dig into the SAR manipulation issue, proposing
detectors that effectively localize the tampering traces in amplitude images.
Nonetheless, in this paper we demonstrate that an expert practitioner can
exploit the complex nature of SAR data to obscure any signs of manipulation
within a locally altered amplitude image. We refer to this approach as a
counter-forensic attack. To achieve the concealment of manipulation traces, the
attacker can simulate a re-acquisition of the manipulated scene by the SAR
system that initially generated the pristine image. In doing so, the attacker
can obscure any evidence of manipulation, making it appear as if the image was
legitimately produced by the system. This attack has unique features that make
it both highly generalizable and relatively easy to apply. First, it is a
black-box attack, meaning it is not designed to deceive a specific forensic
detector. Furthermore, it does not require a training phase and is not based on
adversarial operations. We assess the effectiveness of the proposed
counter-forensic approach across diverse scenarios, examining various
manipulation operations.


http://arxiv.org/abs/2407.07221
Tracing Back the Malicious Clients in Poisoning Attacks to Federated Learning. (26%)
Yuqi Jia; Minghong Fang; Hongbin Liu; Jinghuai Zhang; Neil Zhenqiang Gong
  Poisoning attacks compromise the training phase of federated learning (FL)
such that the learned global model misclassifies attacker-chosen inputs called
target inputs. Existing defenses mainly focus on protecting the training phase
of FL such that the learnt global model is poison free. However, these defenses
often achieve limited effectiveness when the clients' local training data is
highly non-iid or the number of malicious clients is large, as confirmed in our
experiments. In this work, we propose FLForensics, the first poison-forensics
method for FL. FLForensics complements existing training-phase defenses. In
particular, when training-phase defenses fail and a poisoned global model is
deployed, FLForensics aims to trace back the malicious clients that performed
the poisoning attack after a misclassified target input is identified. We
theoretically show that FLForensics can accurately distinguish between benign
and malicious clients under a formal definition of poisoning attack. Moreover,
we empirically show the effectiveness of FLForensics at tracing back both
existing and adaptive poisoning attacks on five benchmark datasets.


http://arxiv.org/abs/2407.07237
The Quantum Imitation Game: Reverse Engineering of Quantum Machine Learning Models. (15%)
Archisman Ghosh; Swaroop Ghosh
  Quantum Machine Learning (QML) amalgamates quantum computing paradigms with
machine learning models, providing significant prospects for solving complex
problems. However, with the expansion of numerous third-party vendors in the
Noisy Intermediate-Scale Quantum (NISQ) era of quantum computing, the security
of QML models is of prime importance, particularly against reverse engineering,
which could expose trained parameters and algorithms of the models. We assume
the untrusted quantum cloud provider is an adversary having white-box access to
the transpiled user-designed trained QML model during inference. Reverse
engineering (RE) to extract the pre-transpiled QML circuit will enable
re-transpilation and usage of the model for various hardware with completely
different native gate sets and even different qubit technology. Such
flexibility may not be obtained from the transpiled circuit which is tied to a
particular hardware and qubit technology. The information about the number of
parameters, and optimized values can allow further training of the QML model to
alter the QML model, tamper with the watermark, and/or embed their own
watermark or refine the model for other purposes. In this first effort to
investigate the RE of QML circuits, we perform RE and compare the training
accuracy of original and reverse-engineered Quantum Neural Networks (QNNs) of
various sizes. We note that multi-qubit classifiers can be reverse-engineered
under specific conditions with a mean error of order 1e-2 in a reasonable time.
We also propose adding dummy fixed parametric gates in the QML models to
increase the RE overhead for defense. For instance, adding 2 dummy qubits and 2
layers increases the overhead by ~1.76 times for a classifier with 2 qubits and
3 layers with a performance overhead of less than 9%. We note that RE is a very
powerful attack model which warrants further efforts on defenses.


http://arxiv.org/abs/2407.06992
Robust Neural Information Retrieval: An Adversarial and Out-of-distribution Perspective. (13%)
Yu-An Liu; Ruqing Zhang; Jiafeng Guo; Rijke Maarten de; Yixing Fan; Xueqi Cheng
  Recent advances in neural information retrieval (IR) models have
significantly enhanced their effectiveness over various IR tasks. The
robustness of these models, essential for ensuring their reliability in
practice, has also garnered significant attention. With a wide array of
research on robust IR being proposed, we believe it is the opportune moment to
consolidate the current status, glean insights from existing methodologies, and
lay the groundwork for future development. We view the robustness of IR to be a
multifaceted concept, emphasizing its necessity against adversarial attacks,
out-of-distribution (OOD) scenarios and performance variance. With a focus on
adversarial and OOD robustness, we dissect robustness solutions for dense
retrieval models (DRMs) and neural ranking models (NRMs), respectively,
recognizing them as pivotal components of the neural IR pipeline. We provide an
in-depth discussion of existing methods, datasets, and evaluation metrics,
shedding light on challenges and future directions in the era of large language
models. To the best of our knowledge, this is the first comprehensive survey on
the robustness of neural IR models, and we will also be giving our first
tutorial presentation at SIGIR 2024
\url{https://sigir2024-robust-information-retrieval.github.io}. Along with the
organization of existing work, we introduce a Benchmark for robust IR (BestIR),
a heterogeneous evaluation benchmark for robust neural information retrieval,
which is publicly available at \url{https://github.com/Davion-Liu/BestIR}. We
hope that this study provides useful clues for future research on the
robustness of IR models and helps to develop trustworthy search engines
\url{https://github.com/Davion-Liu/Awesome-Robustness-in-Information-Retrieval}.


http://arxiv.org/abs/2407.06570
Attack GAN (AGAN ): A new Security Evaluation Tool for Perceptual Encryption. (10%)
Umesh Kashyap; Sudev Kumar Padhi; Sk. Subidh Ali
  Training state-of-the-art (SOTA) deep learning models requires a large amount
of data. The visual information present in the training data can be misused,
which creates a huge privacy concern. One of the prominent solutions for this
issue is perceptual encryption, which converts images into an unrecognizable
format to protect the sensitive visual information in the training data. This
comes at the cost of a significant reduction in the accuracy of the models.
Adversarial Visual Information Hiding (AV IH) overcomes this drawback to
protect image privacy by attempting to create encrypted images that are
unrecognizable to the human eye while keeping relevant features for the target
model. In this paper, we introduce the Attack GAN (AGAN ) method, a new
Generative Adversarial Network (GAN )-based attack that exposes multiple
vulnerabilities in the AV IH method. To show the adaptability, the AGAN is
extended to traditional perceptual encryption methods of Learnable encryption
(LE) and Encryption-then-Compression (EtC). Extensive experiments were
conducted on diverse image datasets and target models to validate the efficacy
of our AGAN method. The results show that AGAN can successfully break
perceptual encryption methods by reconstructing original images from their AV
IH encrypted images. AGAN can be used as a benchmark tool to evaluate the
robustness of encryption methods for privacy protection such as AV IH.


http://arxiv.org/abs/2407.06855
Performance Evaluation of Knowledge Graph Embedding Approaches under Non-adversarial Attacks. (8%)
Sourabh Kapoor; Arnab Sharma; Michael Röder; Caglar Demir; Axel-Cyrille Ngonga Ngomo
  Knowledge Graph Embedding (KGE) transforms a discrete Knowledge Graph (KG)
into a continuous vector space facilitating its use in various AI-driven
applications like Semantic Search, Question Answering, or Recommenders. While
KGE approaches are effective in these applications, most existing approaches
assume that all information in the given KG is correct. This enables attackers
to influence the output of these approaches, e.g., by perturbing the input.
Consequently, the robustness of such KGE approaches has to be addressed. Recent
work focused on adversarial attacks. However, non-adversarial attacks on all
attack surfaces of these approaches have not been thoroughly examined. We close
this gap by evaluating the impact of non-adversarial attacks on the performance
of 5 state-of-the-art KGE algorithms on 5 datasets with respect to attacks on 3
attack surfaces-graph, parameter, and label perturbation. Our evaluation
results suggest that label perturbation has a strong effect on the KGE
performance, followed by parameter perturbation with a moderate and graph with
a low effect.


http://arxiv.org/abs/2407.06546
Exploring the Causality of End-to-End Autonomous Driving. (1%)
Jiankun Li; Hao Li; Jiangjiang Liu; Zhikang Zou; Xiaoqing Ye; Fan Wang; Jizhou Huang; Hua Wu; Haifeng Wang
  Deep learning-based models are widely deployed in autonomous driving areas,
especially the increasingly noticed end-to-end solutions. However, the
black-box property of these models raises concerns about their trustworthiness
and safety for autonomous driving, and how to debug the causality has become a
pressing concern. Despite some existing research on the explainability of
autonomous driving, there is currently no systematic solution to help
researchers debug and identify the key factors that lead to the final predicted
action of end-to-end autonomous driving. In this work, we propose a
comprehensive approach to explore and analyze the causality of end-to-end
autonomous driving. First, we validate the essential information that the final
planning depends on by using controlled variables and counterfactual
interventions for qualitative analysis. Then, we quantitatively assess the
factors influencing model decisions by visualizing and statistically analyzing
the response of key model inputs. Finally, based on the comprehensive study of
the multi-factorial end-to-end autonomous driving system, we have developed a
strong baseline and a tool for exploring causality in the close-loop simulator
CARLA. It leverages the essential input sources to obtain a well-designed
model, resulting in highly competitive capabilities. As far as we know, our
work is the first to unveil the mystery of end-to-end autonomous driving and
turn the black box into a white one. Thorough close-loop experiments
demonstrate that our method can be applied to end-to-end autonomous driving
solutions for causality debugging. Code will be available at
https://github.com/bdvisl/DriveInsight.


http://arxiv.org/abs/2407.07065
Distribution System Reconfiguration to Mitigate Load Altering Attacks via Stackelberg Games. (1%)
Sajjad Maleki; Subhash Lakshminarayana; Charalambos Konstantinou; E. Veronica Belmaga
  The integration of IoT-controllable devices in power systems (such as smart
electric vehicle charging stations, heat pumps, etc.), despite their apparent
benefits, raises novel cybersecurity concerns. These vulnerabilities in these
devices can be leveraged to launch load-altering attacks (LAAs) that can
potentially compromise power system safety. In this paper, we analyze the
impact of LAAs on the voltage profile of distribution systems. We derive
closed-form expressions to quantify the attack impact. Using the insights
derived from this analysis, we propose a method to mitigate LAAs based on
reconfiguring the distribution system as a reactive defense approach. We study
optimal defense strategies using a non-cooperative sequential game theory
approach that is robust to LAAs. The proposed solution takes the potential
errors in the attack localization into account. Our results show that attacks
launched on the deepest nodes in the distribution network result in the highest
detrimental impact on the grid voltage profile. Furthermore, the proposed
game-theoretic strategy successfully mitigates the effect of the attack while
ensuring minimum system reconfiguration.


http://arxiv.org/abs/2407.06315
Shedding More Light on Robust Classifiers under the lens of Energy-based Models. (98%)
Mujtaba Hussain Mirza; Maria Rosaria Briglia; Senad Beadini; Iacopo Masi
  By reinterpreting a robust discriminative classifier as Energy-based Model
(EBM), we offer a new take on the dynamics of adversarial training (AT). Our
analysis of the energy landscape during AT reveals that untargeted attacks
generate adversarial images much more in-distribution (lower energy) than the
original data from the point of view of the model. Conversely, we observe the
opposite for targeted attacks. On the ground of our thorough analysis, we
present new theoretical and practical results that show how interpreting AT
energy dynamics unlocks a better understanding: (1) AT dynamic is governed by
three phases and robust overfitting occurs in the third phase with a drastic
divergence between natural and adversarial energies (2) by rewriting the loss
of TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization
(TRADES) in terms of energies, we show that TRADES implicitly alleviates
overfitting by means of aligning the natural energy with the adversarial one
(3) we empirically show that all recent state-of-the-art robust classifiers are
smoothing the energy landscape and we reconcile a variety of studies about
understanding AT and weighting the loss function under the umbrella of EBMs.
Motivated by rigorous evidence, we propose Weighted Energy Adversarial Training
(WEAT), a novel sample weighting scheme that yields robust accuracy matching
the state-of-the-art on multiple benchmarks such as CIFAR-10 and SVHN and going
beyond in CIFAR-100 and Tiny-ImageNet. We further show that robust classifiers
vary in the intensity and quality of their generative capabilities, and offer a
simple method to push this capability, reaching a remarkable Inception Score
(IS) and FID using a robust classifier without training for generative
modeling. The code to reproduce our results is available at
http://github.com/OmnAI-Lab/Robust-Classifiers-under-the-lens-of-EBM/ .


http://arxiv.org/abs/2407.06372
Non-Robust Features are Not Always Useful in One-Class Classification. (92%)
Matthew Lau; Haoran Wang; Alec Helbling; Matthew Hul; ShengYun Peng; Martin Andreoni; Willian T. Lunardi; Wenke Lee
  The robustness of machine learning models has been questioned by the
existence of adversarial examples. We examine the threat of adversarial
examples in practical applications that require lightweight models for
one-class classification. Building on Ilyas et al. (2019), we investigate the
vulnerability of lightweight one-class classifiers to adversarial attacks and
possible reasons for it. Our results show that lightweight one-class
classifiers learn features that are not robust (e.g. texture) under stronger
attacks. However, unlike in multi-class classification (Ilyas et al., 2019),
these non-robust features are not always useful for the one-class task,
suggesting that learning these unpredictive and non-robust features is an
unwanted consequence of training.


http://arxiv.org/abs/2407.06443
Exposing Privacy Gaps: Membership Inference Attack on Preference Data for LLM Alignment. (1%)
Qizhang Feng; Siva Rajesh Kasa; Hyokun Yun; Choon Hui Teo; Sravan Babu Bodapati
  Large Language Models (LLMs) have seen widespread adoption due to their
remarkable natural language capabilities. However, when deploying them in
real-world settings, it is important to align LLMs to generate texts according
to acceptable human standards. Methods such as Proximal Policy Optimization
(PPO) and Direct Preference Optimization (DPO) have made significant progress
in refining LLMs using human preference data. However, the privacy concerns
inherent in utilizing such preference data have yet to be adequately studied.
In this paper, we investigate the vulnerability of LLMs aligned using human
preference datasets to membership inference attacks (MIAs), highlighting the
shortcomings of previous MIA approaches with respect to preference data. Our
study has two main contributions: first, we introduce a novel reference-based
attack framework specifically for analyzing preference data called PREMIA
(\uline{Pre}ference data \uline{MIA}); second, we provide empirical evidence
that DPO models are more vulnerable to MIA compared to PPO models. Our findings
highlight gaps in current privacy-preserving practices for LLM alignment.


http://arxiv.org/abs/2407.05319
Rethinking Targeted Adversarial Attacks For Neural Machine Translation. (99%)
Junjie Wu; Lemao Liu; Wei Bi; Dit-Yan Yeung
  Targeted adversarial attacks are widely used to evaluate the robustness of
neural machine translation systems. Unfortunately, this paper first identifies
a critical issue in the existing settings of NMT targeted adversarial attacks,
where their attacking results are largely overestimated. To this end, this
paper presents a new setting for NMT targeted adversarial attacks that could
lead to reliable attacking results. Under the new setting, it then proposes a
Targeted Word Gradient adversarial Attack (TWGA) method to craft adversarial
examples. Experimental results demonstrate that our proposed setting could
provide faithful attacking results for targeted adversarial attacks on NMT
systems, and the proposed TWGA method can effectively attack such victim NMT
systems. In-depth analyses on a large-scale dataset further illustrate some
valuable findings. 1 Our code and data are available at
https://github.com/wujunjie1998/TWGA.


http://arxiv.org/abs/2407.05285
Mjolnir: Breaking the Shield of Perturbation-Protected Gradients via Adaptive Diffusion. (64%)
Xuan Liu; Siqi Cai; Qihua Zhou; Song Guo; Ruibin Li; Kaiwei Lin
  Perturbation-based mechanisms, such as differential privacy, mitigate
gradient leakage attacks by introducing noise into the gradients, thereby
preventing attackers from reconstructing clients' private data from the leaked
gradients. However, can gradient perturbation protection mechanisms truly
defend against all gradient leakage attacks? In this paper, we present the
first attempt to break the shield of gradient perturbation protection in
Federated Learning for the extraction of private information. We focus on
common noise distributions, specifically Gaussian and Laplace, and apply our
approach to DNN and CNN models. We introduce Mjolnir, a perturbation-resilient
gradient leakage attack that is capable of removing perturbations from
gradients without requiring additional access to the original model structure
or external data. Specifically, we leverage the inherent diffusion properties
of gradient perturbation protection to develop a novel diffusion-based gradient
denoising model for Mjolnir. By constructing a surrogate client model that
captures the structure of perturbed gradients, we obtain crucial gradient data
for training the diffusion model. We further utilize the insight that
monitoring disturbance levels during the reverse diffusion process can enhance
gradient denoising capabilities, allowing Mjolnir to generate gradients that
closely approximate the original, unperturbed versions through adaptive
sampling steps. Extensive experiments demonstrate that Mjolnir effectively
recovers the protected gradients and exposes the Federated Learning process to
the threat of gradient leakage, achieving superior performance in gradient
denoising and private data recovery.


http://arxiv.org/abs/2407.05528
An accurate detection is not all you need to combat label noise in web-noisy datasets. (1%)
Paul Albert; Jack Valmadre; Eric Arazo; Tarun Krishna; Noel E. O'Connor; Kevin McGuinness
  Training a classifier on web-crawled data demands learning algorithms that
are robust to annotation errors and irrelevant examples. This paper builds upon
the recent empirical observation that applying unsupervised contrastive
learning to noisy, web-crawled datasets yields a feature representation under
which the in-distribution (ID) and out-of-distribution (OOD) samples are
linearly separable. We show that direct estimation of the separating hyperplane
can indeed offer an accurate detection of OOD samples, and yet, surprisingly,
this detection does not translate into gains in classification accuracy.
Digging deeper into this phenomenon, we discover that the near-perfect
detection misses a type of clean examples that are valuable for supervised
learning. These examples often represent visually simple images, which are
relatively easy to identify as clean examples using standard loss- or
distance-based methods despite being poorly separated from the OOD distribution
using unsupervised learning. Because we further observe a low correlation with
SOTA metrics, this urges us to propose a hybrid solution that alternates
between noise detection using linear separation and a state-of-the-art (SOTA)
small-loss approach. When combined with the SOTA algorithm PLS, we
substantially improve SOTA results for real-world image classification in the
presence of web noise github.com/PaulAlbert31/LSA


http://arxiv.org/abs/2407.07918
Detecting new obfuscated malware variants: A lightweight and interpretable machine learning approach. (1%)
Oladipo A. Madamidola; Felix Ngobigha; Adnane Ez-zizi
  Machine learning has been successfully applied in developing malware
detection systems, with a primary focus on accuracy, and increasing attention
to reducing computational overhead and improving model interpretability.
However, an important question remains underexplored: How well can machine
learning-based models detect entirely new forms of malware not present in the
training data? In this study, we present a machine learning-based system for
detecting obfuscated malware that is not only highly accurate, lightweight and
interpretable, but also capable of successfully adapting to new types of
malware attacks. Our system is capable of detecting 15 malware subtypes despite
being exclusively trained on one malware subtype, namely the Transponder from
the Spyware family. This system was built after training 15 distinct random
forest-based models, each on a different malware subtype from the
CIC-MalMem-2022 dataset. These models were evaluated against the entire range
of malware subtypes, including all unseen malware subtypes. To maintain the
system's streamlined nature, training was confined to the top five most
important features, which also enhanced interpretability. The
Transponder-focused model exhibited high accuracy, exceeding 99.8%, with an
average processing speed of 5.7 microseconds per file. We also illustrate how
the Shapley additive explanations technique can facilitate the interpretation
of the model predictions. Our research contributes to advancing malware
detection methodologies, pioneering the feasibility of detecting obfuscated
malware by exclusively training a model on a single or a few carefully selected
malware subtypes and applying it to detect unseen subtypes.


http://arxiv.org/abs/2407.05396
Evolutionary Trigger Detection and Lightweight Model Repair Based Backdoor Defense. (1%)
Qi Zhou; Zipeng Ye; Yubo Tang; Wenjian Luo; Yuhui Shi; Yan Jia
  Deep Neural Networks (DNNs) have been widely used in many areas such as
autonomous driving and face recognition. However, DNN model is fragile to
backdoor attack. A backdoor in the DNN model can be activated by a poisoned
input with trigger and leads to wrong prediction, which causes serious security
issues in applications. It is challenging for current defenses to eliminate the
backdoor effectively with limited computing resources, especially when the
sizes and numbers of the triggers are variable as in the physical world. We
propose an efficient backdoor defense based on evolutionary trigger detection
and lightweight model repair. In the first phase of our method, CAM-focus
Evolutionary Trigger Filter (CETF) is proposed for trigger detection. CETF is
an effective sample-preprocessing based method with the evolutionary algorithm,
and our experimental results show that CETF not only distinguishes the images
with triggers accurately from the clean images, but also can be widely used in
practice for its simplicity and stability in different backdoor attack
situations. In the second phase of our method, we leverage several lightweight
unlearning methods with the trigger detected by CETF for model repair, which
also constructively demonstrate the underlying correlation of the backdoor with
Batch Normalization layers. Source code will be published after accepted.


http://arxiv.org/abs/2407.05182
A Novel Bifurcation Method for Observation Perturbation Attacks on Reinforcement Learning Agents: Load Altering Attacks on a Cyber Physical Power System. (99%)
Kiernan Broda-Milian; Ranwa Al-Mallah; Hanane Dagdougui
  Components of cyber physical systems, which affect real-world processes, are
often exposed to the internet. Replacing conventional control methods with Deep
Reinforcement Learning (DRL) in energy systems is an active area of research,
as these systems become increasingly complex with the advent of renewable
energy sources and the desire to improve their efficiency. Artificial Neural
Networks (ANN) are vulnerable to specific perturbations of their inputs or
features, called adversarial examples. These perturbations are difficult to
detect when properly regularized, but have significant effects on the ANN's
output. Because DRL uses ANN to map optimal actions to observations, they are
similarly vulnerable to adversarial examples. This work proposes a novel attack
technique for continuous control using Group Difference Logits loss with a
bifurcation layer. By combining aspects of targeted and untargeted attacks, the
attack significantly increases the impact compared to an untargeted attack,
with drastically smaller distortions than an optimally targeted attack. We
demonstrate the impacts of powerful gradient-based attacks in a realistic smart
energy environment, show how the impacts change with different DRL agents and
training procedures, and use statistical and time-series analysis to evaluate
attacks' stealth. The results show that adversarial attacks can have
significant impacts on DRL controllers, and constraining an attack's
perturbations makes it difficult to detect. However, certain DRL architectures
are far more robust, and robust training methods can further reduce the impact.


http://arxiv.org/abs/2407.05112
Releasing Malevolence from Benevolence: The Menace of Benign Data on Machine Unlearning. (13%)
Binhao Ma; Tianhang Zheng; Hongsheng Hu; Di Wang; Shuo Wang; Zhongjie Ba; Zhan Qin; Kui Ren
  Machine learning models trained on vast amounts of real or synthetic data
often achieve outstanding predictive performance across various domains.
However, this utility comes with increasing concerns about privacy, as the
training data may include sensitive information. To address these concerns,
machine unlearning has been proposed to erase specific data samples from
models. While some unlearning techniques efficiently remove data at low costs,
recent research highlights vulnerabilities where malicious users could request
unlearning on manipulated data to compromise the model. Despite these attacks'
effectiveness, perturbed data differs from original training data, failing hash
verification. Existing attacks on machine unlearning also suffer from practical
limitations and require substantial additional knowledge and resources. To fill
the gaps in current unlearning attacks, we introduce the Unlearning Usability
Attack. This model-agnostic, unlearning-agnostic, and budget-friendly attack
distills data distribution information into a small set of benign data. These
data are identified as benign by automatic poisoning detection tools due to
their positive impact on model training. While benign for machine learning,
unlearning these data significantly degrades model information. Our evaluation
demonstrates that unlearning this benign data, comprising no more than 1% of
the total training data, can reduce model accuracy by up to 50%. Furthermore,
our findings show that well-prepared benign data poses challenges for recent
unlearning techniques, as erasing these synthetic instances demands higher
resources than regular data. These insights underscore the need for future
research to reconsider "data poisoning" in the context of machine unlearning.


http://arxiv.org/abs/2407.05034
GCON: Differentially Private Graph Convolutional Network via Objective Perturbation. (12%)
Jianxin Wei; Yizheng Zhu; Xiaokui Xiao; Ergute Bao; Yin Yang; Kuntai Cai; Beng Chin Ooi
  Graph Convolutional Networks (GCNs) are a popular machine learning model with
a wide range of applications in graph analytics, including healthcare,
transportation, and finance. However, a GCN trained without privacy protection
measures may memorize private interpersonal relationships in the training data
through its model parameters. This poses a substantial risk of compromising
privacy through link attacks, potentially leading to violations of privacy
regulations such as GDPR. To defend against such attacks, a promising approach
is to train the GCN with differential privacy (DP), a rigorous framework that
provides strong privacy protection by injecting random noise into the training
process. However, training a GCN under DP is a highly challenging task.
Existing solutions either perturb the graph topology or inject randomness into
the graph convolution operations, or overestimate the amount of noise required,
resulting in severe distortions of the network's message aggregation and, thus,
poor model utility.
  Motivated by this, we propose GCON, a novel and effective solution for
training GCNs with edge differential privacy. GCON leverages the classic idea
of perturbing the objective function to satisfy DP and maintains an unaltered
graph convolution process. Our rigorous theoretical analysis offers tight,
closed-form bounds on the sensitivity of the graph convolution results and
quantifies the impact of an edge modification on the trained model parameters.
Extensive experiments using multiple benchmark datasets across diverse settings
demonstrate the consistent superiority of GCON over existing solutions.


http://arxiv.org/abs/2407.04589
Remembering Everything Makes You Vulnerable: A Limelight on Machine Unlearning for Personalized Healthcare Sector. (98%)
Ahan Chatterjee; Sai Anirudh Aryasomayajula; Rajat Chaudhari; Subhajit Paul; Vishwa Mohan Singh
  As the prevalence of data-driven technologies in healthcare continues to
rise, concerns regarding data privacy and security become increasingly
paramount. This thesis aims to address the vulnerability of personalized
healthcare models, particularly in the context of ECG monitoring, to
adversarial attacks that compromise patient privacy. We propose an approach
termed "Machine Unlearning" to mitigate the impact of exposed data points on
machine learning models, thereby enhancing model robustness against adversarial
attacks while preserving individual privacy. Specifically, we investigate the
efficacy of Machine Unlearning in the context of personalized ECG monitoring,
utilizing a dataset of clinical ECG recordings. Our methodology involves
training a deep neural classifier on ECG data and fine-tuning the model for
individual patients. We demonstrate the susceptibility of fine-tuned models to
adversarial attacks, such as the Fast Gradient Sign Method (FGSM), which can
exploit additional data points in personalized models. To address this
vulnerability, we propose a Machine Unlearning algorithm that selectively
removes sensitive data points from fine-tuned models, effectively enhancing
model resilience against adversarial manipulation. Experimental results
demonstrate the effectiveness of our approach in mitigating the impact of
adversarial attacks while maintaining the pre-trained model accuracy.


http://arxiv.org/abs/2407.04295
Jailbreak Attacks and Defenses Against Large Language Models: A Survey. (92%)
Sibo Yi; Yule Liu; Zhen Sun; Tianshuo Cong; Xinlei He; Jiaxing Song; Ke Xu; Qi Li
  Large Language Models (LLMs) have performed exceptionally in various
text-generative tasks, including question answering, translation, code
completion, etc. However, the over-assistance of LLMs has raised the challenge
of "jailbreaking", which induces the model to generate malicious responses
against the usage policy and society by designing adversarial prompts. With the
emergence of jailbreak attack methods exploiting different vulnerabilities in
LLMs, the corresponding safety alignment measures are also evolving. In this
paper, we propose a comprehensive and detailed taxonomy of jailbreak attack and
defense methods. For instance, the attack methods are divided into black-box
and white-box attacks based on the transparency of the target model. Meanwhile,
we classify defense methods into prompt-level and model-level defenses.
Additionally, we further subdivide these attack and defense methods into
distinct sub-classes and present a coherent diagram illustrating their
relationships. We also conduct an investigation into the current evaluation
methods and compare them from different perspectives. Our findings aim to
inspire future research and practical implementations in safeguarding LLMs
against adversarial attacks. Above all, although jailbreak remains a
significant concern within the community, we believe that our work enhances the
understanding of this domain and provides a foundation for developing more
secure LLMs.


http://arxiv.org/abs/2407.04482
Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models. (91%)
Vyas Raina; Mark Gales
  Speech enabled foundation models, either in the form of flexible speech
recognition based systems or audio-prompted large language models (LLMs), are
becoming increasingly popular. One of the interesting aspects of these models
is their ability to perform tasks other than automatic speech recognition (ASR)
using an appropriate prompt. For example, the OpenAI Whisper model can perform
both speech transcription and speech translation. With the development of
audio-prompted LLMs there is the potential for even greater control options. In
this work we demonstrate that with this greater flexibility the systems can be
susceptible to model-control adversarial attacks. Without any access to the
model prompt it is possible to modify the behaviour of the system by
appropriately changing the audio input. To illustrate this risk, we demonstrate
that it is possible to prepend a short universal adversarial acoustic segment
to any input speech signal to override the prompt setting of an ASR foundation
model. Specifically, we successfully use a universal adversarial acoustic
segment to control Whisper to always perform speech translation, despite being
set to perform speech transcription. Overall, this work demonstrates a new form
of adversarial attack on multi-tasking speech enabled foundation models that
needs to be considered prior to the deployment of this form of model.


http://arxiv.org/abs/2407.04382
Self-Supervised Representation Learning for Adversarial Attack Detection. (68%)
Yi Li; Plamen Angelov; Neeraj Suri
  Supervised learning-based adversarial attack detection methods rely on a
large number of labeled data and suffer significant performance degradation
when applying the trained model to new domains. In this paper, we propose a
self-supervised representation learning framework for the adversarial attack
detection task to address this drawback. Firstly, we map the pixels of
augmented input images into an embedding space. Then, we employ the
prototype-wise contrastive estimation loss to cluster prototypes as latent
variables. Additionally, drawing inspiration from the concept of memory banks,
we introduce a discrimination bank to distinguish and learn representations for
each individual instance that shares the same or a similar prototype,
establishing a connection between instances and their associated prototypes. We
propose a parallel axial-attention (PAA)-based encoder to facilitate the
training process by parallel training over height- and width-axis of attention
maps. Experimental results show that, compared to various benchmark
self-supervised vision learning models and supervised adversarial attack
detection methods, the proposed model achieves state-of-the-art performance on
the adversarial attack detection task across a wide range of images.


http://arxiv.org/abs/2407.04794
On Evaluating The Performance of Watermarked Machine-Generated Texts Under Adversarial Attacks. (61%)
Zesen Liu; Tianshuo Cong; Xinlei He; Qi Li
  Large Language Models (LLMs) excel in various applications, including text
generation and complex tasks. However, the misuse of LLMs raises concerns about
the authenticity and ethical implications of the content they produce, such as
deepfake news, academic fraud, and copyright infringement. Watermarking
techniques, which embed identifiable markers in machine-generated text, offer a
promising solution to these issues by allowing for content verification and
origin tracing. Unfortunately, the robustness of current LLM watermarking
schemes under potential watermark removal attacks has not been comprehensively
explored.
  In this paper, to fill this gap, we first systematically comb the mainstream
watermarking schemes and removal attacks on machine-generated texts, and then
we categorize them into pre-text (before text generation) and post-text (after
text generation) classes so that we can conduct diversified analyses. In our
experiments, we evaluate eight watermarks (five pre-text, three post-text) and
twelve attacks (two pre-text, ten post-text) across 87 scenarios. Evaluation
results indicate that (1) KGW and Exponential watermarks offer high text
quality and watermark retention but remain vulnerable to most attacks; (2)
Post-text attacks are found to be more efficient and practical than pre-text
attacks; (3) Pre-text watermarks are generally more imperceptible, as they do
not alter text fluency, unlike post-text watermarks; (4) Additionally, combined
attack methods can significantly increase effectiveness, highlighting the need
for more robust watermarking solutions. Our study underscores the
vulnerabilities of current techniques and the necessity for developing more
resilient schemes.


http://arxiv.org/abs/2407.04861
Late Breaking Results: Fortifying Neural Networks: Safeguarding Against Adversarial Attacks with Stochastic Computing. (54%)
Faeze S. Banitaba; Sercan Aygun; M. Hassan Najafi
  In neural network (NN) security, safeguarding model integrity and resilience
against adversarial attacks has become paramount. This study investigates the
application of stochastic computing (SC) as a novel mechanism to fortify NN
models. The primary objective is to assess the efficacy of SC to mitigate the
deleterious impact of attacks on NN results. Through a series of rigorous
experiments and evaluations, we explore the resilience of NNs employing SC when
subjected to adversarial attacks. Our findings reveal that SC introduces a
robust layer of defense, significantly reducing the susceptibility of networks
to attack-induced alterations in their outcomes. This research contributes
novel insights into the development of more secure and reliable NN systems,
essential for applications in sensitive domains where data integrity is of
utmost concern.


http://arxiv.org/abs/2407.04370
Regulating Model Reliance on Non-Robust Features by Smoothing Input Marginal Density. (38%)
Peiyu Yang; Naveed Akhtar; Mubarak Shah; Ajmal Mian
  Trustworthy machine learning necessitates meticulous regulation of model
reliance on non-robust features. We propose a framework to delineate and
regulate such features by attributing model predictions to the input. Within
our approach, robust feature attributions exhibit a certain consistency, while
non-robust feature attributions are susceptible to fluctuations. This behavior
allows identification of correlation between model reliance on non-robust
features and smoothness of marginal density of the input samples. Hence, we
uniquely regularize the gradients of the marginal density w.r.t. the input
features for robustness. We also devise an efficient implementation of our
regularization to address the potential numerical instability of the underlying
optimization process. Moreover, we analytically reveal that, as opposed to our
marginal density smoothing, the prevalent input gradient regularization
smoothens conditional or joint density of the input, which can cause limited
robustness. Our experiments validate the effectiveness of the proposed method,
providing clear evidence of its capability to address the feature leakage
problem and mitigate spurious correlations. Extensive results further establish
that our technique enables the model to exhibit robustness against
perturbations in pixel values, input gradients, and density.


http://arxiv.org/abs/2407.15855
Data Poisoning Attacks in Intelligent Transportation Systems: A Survey. (2%)
Feilong Wang; Xin Wang; Xuegang Ban
  Emerging technologies drive the ongoing transformation of Intelligent
Transportation Systems (ITS). This transformation has given rise to
cybersecurity concerns, among which data poisoning attack emerges as a new
threat as ITS increasingly relies on data. In data poisoning attacks, attackers
inject malicious perturbations into datasets, potentially leading to inaccurate
results in offline learning and real-time decision-making processes. This paper
concentrates on data poisoning attack models against ITS. We identify the main
ITS data sources vulnerable to poisoning attacks and application scenarios that
enable staging such attacks. A general framework is developed following
rigorous study process from cybersecurity but also considering specific ITS
application needs. Data poisoning attacks against ITS are reviewed and
categorized following the framework. We then discuss the current limitations of
these attack models and the future research directions. Our work can serve as a
guideline to better understand the threat of data poisoning attacks against ITS
applications, while also giving a perspective on the future development of
trustworthy ITS.


http://arxiv.org/abs/2407.07917
Non-Cooperative Backdoor Attacks in Federated Learning: A New Threat Landscape. (2%)
Tuan Nguyen; Dung Thuy Nguyen; Khoa D Doan; Kok-Seng Wong
  Despite the promise of Federated Learning (FL) for privacy-preserving model
training on distributed data, it remains susceptible to backdoor attacks. These
attacks manipulate models by embedding triggers (specific input patterns) in
the training data, forcing misclassification as predefined classes during
deployment. Traditional single-trigger attacks and recent work on cooperative
multiple-trigger attacks, where clients collaborate, highlight limitations in
attack realism due to coordination requirements. We investigate a more alarming
scenario: non-cooperative multiple-trigger attacks. Here, independent
adversaries introduce distinct triggers targeting unique classes. These
parallel attacks exploit FL's decentralized nature, making detection difficult.
Our experiments demonstrate the alarming vulnerability of FL to such attacks,
where individual backdoors can be successfully learned without impacting the
main task. This research emphasizes the critical need for robust defenses
against diverse backdoor attacks in the evolving FL landscape. While our focus
is on empirical analysis, we believe it can guide backdoor research toward more
realistic settings, highlighting the crucial role of FL in building robust
defenses against diverse backdoor threats. The code is available at
\url{https://anonymous.4open.science/r/nba-980F/}.


http://arxiv.org/abs/2407.04285
Tackling Data Corruption in Offline Reinforcement Learning via Sequence Modeling. (1%)
Jiawei Xu; Rui Yang; Shuang Qiu; Feng Luo; Meng Fang; Baoxiang Wang; Lei Han
  Learning policy from offline datasets through offline reinforcement learning
(RL) holds promise for scaling data-driven decision-making while avoiding
unsafe and costly online interactions. However, real-world data collected from
sensors or humans often contains noise and errors, posing a significant
challenge for existing offline RL methods, particularly when the real-world
data is limited. Our study reveals that prior research focusing on adapting
predominant offline RL methods based on temporal difference learning still
falls short under data corruption when the dataset is limited. In contrast, we
discover that vanilla sequence modeling methods, such as Decision Transformer,
exhibit robustness against data corruption, even without specialized
modifications. To unlock the full potential of sequence modeling, we propose
**R**obust **D**ecision **T**ransformer (**RDT**) by incorporating three simple
yet effective robust techniques: embedding dropout to improve the model's
robustness against erroneous inputs, Gaussian weighted learning to mitigate the
effects of corrupted labels, and iterative data correction to eliminate
corrupted data from the source. Extensive experiments on MuJoCo, Kitchen, and
Adroit tasks demonstrate RDT's superior performance under various data
corruption scenarios compared to prior methods. Furthermore, RDT exhibits
remarkable robustness in a more challenging setting that combines training-time
data corruption with test-time observation perturbations. These results
highlight the potential of sequence modeling for learning from noisy or
corrupted offline datasets, thereby promoting the reliable application of
offline RL in real-world scenarios.Our code is available at
https://github.com/jiawei415/RobustDecisionTransformer.


http://arxiv.org/abs/2407.03946
TrackPGD: A White-box Attack using Binary Masks against Robust Transformer Trackers. (99%)
Fatemeh Nourilenjan Nokabadi; Yann Batiste Pequignot; Jean-Francois Lalonde; Christian Gagné
  Object trackers with transformer backbones have achieved robust performance
on visual object tracking datasets. However, the adversarial robustness of
these trackers has not been well studied in the literature. Due to the backbone
differences, the adversarial white-box attacks proposed for object tracking are
not transferable to all types of trackers. For instance, transformer trackers
such as MixFormerM still function well after black-box attacks, especially in
predicting the object binary masks. We are proposing a novel white-box attack
named TrackPGD, which relies on the predicted object binary mask to attack the
robust transformer trackers. That new attack focuses on annotation masks by
adapting the well-known SegPGD segmentation attack, allowing to successfully
conduct the white-box attack on trackers relying on transformer backbones. The
experimental results indicate that the TrackPGD is able to effectively attack
transformer-based trackers such as MixFormerM, OSTrackSTS, and TransT-SEG on
several tracking datasets.


http://arxiv.org/abs/2407.03864
Adversarial Robustness of VAEs across Intersectional Subgroups. (99%)
Chethan Krishnamurthy Ramanaik; Arjun Roy; Eirini Ntoutsi
  Despite advancements in Autoencoders (AEs) for tasks like dimensionality
reduction, representation learning and data generation, they remain vulnerable
to adversarial attacks. Variational Autoencoders (VAEs), with their
probabilistic approach to disentangling latent spaces, show stronger resistance
to such perturbations compared to deterministic AEs; however, their resilience
against adversarial inputs is still a concern. This study evaluates the
robustness of VAEs against non-targeted adversarial attacks by optimizing
minimal sample-specific perturbations to cause maximal damage across diverse
demographic subgroups (combinations of age and gender). We investigate two
questions: whether there are robustness disparities among subgroups, and what
factors contribute to these disparities, such as data scarcity and
representation entanglement. Our findings reveal that robustness disparities
exist but are not always correlated with the size of the subgroup. By using
downstream gender and age classifiers and examining latent embeddings, we
highlight the vulnerability of subgroups like older women, who are prone to
misclassification due to adversarial perturbations pushing their
representations toward those of other subgroups.


http://arxiv.org/abs/2407.03883
Protecting Deep Learning Model Copyrights with Adversarial Example-Free Reuse Detection. (99%)
Xiaokun Luan; Xiyue Zhang; Jingyi Wang; Meng Sun
  Model reuse techniques can reduce the resource requirements for training
high-performance deep neural networks (DNNs) by leveraging existing models.
However, unauthorized reuse and replication of DNNs can lead to copyright
infringement and economic loss to the model owner. This underscores the need to
analyze the reuse relation between DNNs and develop copyright protection
techniques to safeguard intellectual property rights. Existing white-box
testing-based approaches cannot address the common heterogeneous reuse case
where the model architecture is changed, and DNN fingerprinting approaches
heavily rely on generating adversarial examples with good transferability,
which is known to be challenging in the black-box setting. To bridge the gap,
we propose NFARD, a Neuron Functionality Analysis-based Reuse Detector, which
only requires normal test samples to detect reuse relations by measuring the
models' differences on a newly proposed model characterization, i.e., neuron
functionality (NF). A set of NF-based distance metrics is designed to make
NFARD applicable to both white-box and black-box settings. Moreover, we devise
a linear transformation method to handle heterogeneous reuse cases by
constructing the optimal projection matrix for dimension consistency,
significantly extending the application scope of NFARD. To the best of our
knowledge, this is the first adversarial example-free method that exploits
neuron functionality for DNN copyright protection. As a side contribution, we
constructed a reuse detection benchmark named Reuse Zoo that covers various
practical reuse techniques and popular datasets. Extensive evaluations on this
comprehensive benchmark show that NFARD achieves F1 scores of 0.984 and 1.0 for
detecting reuse relationships in black-box and white-box settings,
respectively, while generating test suites 2 ~ 99 times faster than previous
methods.


http://arxiv.org/abs/2407.04016
Mitigating Low-Frequency Bias: Feature Recalibration and Frequency Attention Regularization for Adversarial Robustness. (92%)
Kejia Zhang; Juanjuan Weng; Yuanzheng Cai; Zhiming Luo; Shaozi Li
  Ensuring the robustness of computer vision models against adversarial attacks
is a significant and long-lasting objective. Motivated by adversarial attacks,
researchers have devoted considerable efforts to enhancing model robustness by
adversarial training (AT). However, we observe that while AT improves the
models' robustness against adversarial perturbations, it fails to improve their
ability to effectively extract features across all frequency components. Each
frequency component contains distinct types of crucial information:
low-frequency features provide fundamental structural insights, while
high-frequency features capture intricate details and textures. In particular,
AT tends to neglect the reliance on susceptible high-frequency features. This
low-frequency bias impedes the model's ability to effectively leverage the
potentially meaningful semantic information present in high-frequency features.
This paper proposes a novel module called High-Frequency Feature
Disentanglement and Recalibration (HFDR), which separates features into
high-frequency and low-frequency components and recalibrates the high-frequency
feature to capture latent useful semantics. Additionally, we introduce
frequency attention regularization to magnitude the model's extraction of
different frequency features and mitigate low-frequency bias during AT.
Extensive experiments showcase the immense potential and superiority of our
approach in resisting various white-box attacks, transfer attacks, and
showcasing strong generalization capabilities.


http://arxiv.org/abs/2407.04151
Securing Multi-turn Conversational Language Models Against Distributed Backdoor Triggers. (68%)
Terry Tong; Jiashu Xu; Qin Liu; Muhao Chen
  The security of multi-turn conversational large language models (LLMs) is
understudied despite it being one of the most popular LLM utilization.
Specifically, LLMs are vulnerable to data poisoning backdoor attacks, where an
adversary manipulates the training data to cause the model to output malicious
responses to predefined triggers. Specific to the multi-turn dialogue setting,
LLMs are at the risk of even more harmful and stealthy backdoor attacks where
the backdoor triggers may span across multiple utterances, giving lee-way to
context-driven attacks. In this paper, we explore a novel distributed backdoor
trigger attack that serves to be an extra tool in an adversary's toolbox that
can interface with other single-turn attack strategies in a plug and play
manner. Results on two representative defense mechanisms indicate that
distributed backdoor triggers are robust against existing defense strategies
which are designed for single-turn user-model interactions, motivating us to
propose a new defense strategy for the multi-turn dialogue setting that is more
challenging. To this end, we also explore a novel contrastive decoding based
defense that is able to mitigate the backdoor with a low computational
tradeoff.


http://arxiv.org/abs/2407.03729
Charging Ahead: A Hierarchical Adversarial Framework for Counteracting Advanced Cyber Threats in EV Charging Stations. (15%)
Mohammed Al-Mehdhar; Abdullatif Albaseer; Mohamed Abdallah; Ala Al-Fuqaha
  The increasing popularity of electric vehicles (EVs) necessitates robust
defenses against sophisticated cyber threats. A significant challenge arises
when EVs intentionally provide false information to gain higher charging
priority, potentially causing grid instability. While various approaches have
been proposed in existing literature to address this issue, they often overlook
the possibility of attackers using advanced techniques like deep reinforcement
learning (DRL) or other complex deep learning methods to achieve such attacks.
In response to this, this paper introduces a hierarchical adversarial framework
using DRL (HADRL), which effectively detects stealthy cyberattacks on EV
charging stations, especially those leading to denial of charging. Our approach
includes a dual approach, where the first scheme leverages DRL to develop
advanced and stealthy attack methods that can bypass basic intrusion detection
systems (IDS). Second, we implement a DRL-based scheme within the IDS at EV
charging stations, aiming to detect and counter these sophisticated attacks.
This scheme is trained with datasets created from the first scheme, resulting
in a robust and efficient IDS. We evaluated the effectiveness of our framework
against the recent literature approaches, and the results show that our IDS can
accurately detect deceptive EVs with a low false alarm rate, even when
confronted with attacks not represented in the training dataset.


http://arxiv.org/abs/2407.04215
T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models. (13%)
Zhongqi Wang; Jie Zhang; Shiguang Shan; Xilin Chen
  While text-to-image diffusion models demonstrate impressive generation
capabilities, they also exhibit vulnerability to backdoor attacks, which
involve the manipulation of model outputs through malicious triggers. In this
paper, for the first time, we propose a comprehensive defense method named
T2IShield to detect, localize, and mitigate such attacks. Specifically, we find
the "Assimilation Phenomenon" on the cross-attention maps caused by the
backdoor trigger. Based on this key insight, we propose two effective backdoor
detection methods: Frobenius Norm Threshold Truncation and Covariance
Discriminant Analysis. Besides, we introduce a binary-search approach to
localize the trigger within a backdoor sample and assess the efficacy of
existing concept editing methods in mitigating backdoor attacks. Empirical
evaluations on two advanced backdoor attack scenarios show the effectiveness of
our proposed defense method. For backdoor sample detection, T2IShield achieves
a detection F1 score of 88.9$\%$ with low computational cost. Furthermore,
T2IShield achieves a localization F1 score of 86.4$\%$ and invalidates 99$\%$
poisoned samples. Codes are released at https://github.com/Robin-WZQ/T2IShield.


http://arxiv.org/abs/2407.03876
Automated Progressive Red Teaming. (2%)
Bojian Jiang; Yi Jing; Tianhao Shen; Tong Wu; Qing Yang; Deyi Xiong
  Ensuring the safety of large language models (LLMs) is paramount, yet
identifying potential vulnerabilities is challenging. While manual red teaming
is effective, it is time-consuming, costly and lacks scalability. Automated red
teaming (ART) offers a more cost-effective alternative, automatically
generating adversarial prompts to expose LLM vulnerabilities. However, in
current ART efforts, a robust framework is absent, which explicitly frames red
teaming as an effectively learnable task. To address this gap, we propose
Automated Progressive Red Teaming (APRT) as an effectively learnable framework.
APRT leverages three core modules: an Intention Expanding LLM that generates
diverse initial attack samples, an Intention Hiding LLM that crafts deceptive
prompts, and an Evil Maker to manage prompt diversity and filter ineffective
samples. The three modules collectively and progressively explore and exploit
LLM vulnerabilities through multi-round interactions. In addition to the
framework, we further propose a novel indicator, Attack Effectiveness Rate
(AER) to mitigate the limitations of existing evaluation metrics. By measuring
the likelihood of eliciting unsafe but seemingly helpful responses, AER aligns
closely with human evaluations. Extensive experiments with both automatic and
human evaluations, demonstrate the effectiveness of ARPT across both open- and
closed-source LLMs. Specifically, APRT effectively elicits 54% unsafe yet
useful responses from Meta's Llama-3-8B-Instruct, 50% from GPT-4o (API access),
and 39% from Claude-3.5 (API access), showcasing its robust attack capability
and transferability across LLMs (especially from open-source LLMs to
closed-source LLMs).


http://arxiv.org/abs/2407.04173
Quantifying Prediction Consistency Under Model Multiplicity in Tabular LLMs. (1%)
Faisal Hamman; Pasan Dissanayake; Saumitra Mishra; Freddy Lecue; Sanghamitra Dutta
  Fine-tuning large language models (LLMs) on limited tabular data for
classification tasks can lead to \textit{fine-tuning multiplicity}, where
equally well-performing models make conflicting predictions on the same inputs
due to variations in the training process (i.e., seed, random weight
initialization, retraining on additional or deleted samples). This raises
critical concerns about the robustness and reliability of Tabular LLMs,
particularly when deployed for high-stakes decision-making, such as finance,
hiring, education, healthcare, etc. This work formalizes the challenge of
fine-tuning multiplicity in Tabular LLMs and proposes a novel metric to
quantify the robustness of individual predictions without expensive model
retraining. Our metric quantifies a prediction's stability by analyzing
(sampling) the model's local behavior around the input in the embedding space.
Interestingly, we show that sampling in the local neighborhood can be leveraged
to provide probabilistic robustness guarantees against a broad class of
fine-tuned models. By leveraging Bernstein's Inequality, we show that
predictions with sufficiently high robustness (as defined by our measure) will
remain consistent with high probability. We also provide empirical evaluation
on real-world datasets to support our theoretical results. Our work highlights
the importance of addressing fine-tuning instabilities to enable trustworthy
deployment of LLMs in high-stakes and safety-critical applications.


http://arxiv.org/abs/2407.04086
Certifiably Robust Image Watermark. (1%)
Zhengyuan Jiang; Moyang Guo; Yuepeng Hu; Jinyuan Jia; Neil Zhenqiang Gong
  Generative AI raises many societal concerns such as boosting disinformation
and propaganda campaigns. Watermarking AI-generated content is a key technology
to address these concerns and has been widely deployed in industry. However,
watermarking is vulnerable to removal attacks and forgery attacks. In this
work, we propose the first image watermarks with certified robustness
guarantees against removal and forgery attacks. Our method leverages randomized
smoothing, a popular technique to build certifiably robust classifiers and
regression models. Our major technical contributions include extending
randomized smoothing to watermarking by considering its unique characteristics,
deriving the certified robustness guarantees, and designing algorithms to
estimate them. Moreover, we extensively evaluate our image watermarks in terms
of both certified and empirical robustness. Our code is available at
\url{https://github.com/zhengyuan-jiang/Watermark-Library}.


http://arxiv.org/abs/2407.02886
A Wolf in Sheep's Clothing: Practical Black-box Adversarial Attacks for Evading Learning-based Windows Malware Detection in the Wild. (99%)
Xiang Ling; Zhiyu Wu; Bin Wang; Wei Deng; Jingzheng Wu; Shouling Ji; Tianyue Luo; Yanjun Wu
  Given the remarkable achievements of existing learning-based malware
detection in both academia and industry, this paper presents MalGuise, a
practical black-box adversarial attack framework that evaluates the security
risks of existing learning-based Windows malware detection systems under the
black-box setting. MalGuise first employs a novel semantics-preserving
transformation of call-based redividing to concurrently manipulate both nodes
and edges of malware's control-flow graph, making it less noticeable. By
employing a Monte-Carlo-tree-search-based optimization, MalGuise then searches
for an optimized sequence of call-based redividing transformations to apply to
the input Windows malware for evasions. Finally, it reconstructs the
adversarial malware file based on the optimized transformation sequence while
adhering to Windows executable format constraints, thereby maintaining the same
semantics as the original. MalGuise is systematically evaluated against three
state-of-the-art learning-based Windows malware detection systems under the
black-box setting. Evaluation results demonstrate that MalGuise achieves a
remarkably high attack success rate, mostly exceeding 95%, with over 91% of the
generated adversarial malware files maintaining the same semantics.
Furthermore, MalGuise achieves up to a 74.97% attack success rate against five
anti-virus products, highlighting potential tangible security concerns to
real-world users.


http://arxiv.org/abs/2407.03115
$L_p$-norm Distortion-Efficient Adversarial Attack. (99%)
Chao Zhou; Yuan-Gen Wang; Zi-jia Wang; Xiangui Kang
  Adversarial examples have shown a powerful ability to make a well-trained
model misclassified. Current mainstream adversarial attack methods only
consider one of the distortions among $L_0$-norm, $L_2$-norm, and
$L_\infty$-norm. $L_0$-norm based methods cause large modification on a single
pixel, resulting in naked-eye visible detection, while $L_2$-norm and
$L_\infty$-norm based methods suffer from weak robustness against adversarial
defense since they always diffuse tiny perturbations to all pixels. A more
realistic adversarial perturbation should be sparse and imperceptible. In this
paper, we propose a novel $L_p$-norm distortion-efficient adversarial attack,
which not only owns the least $L_2$-norm loss but also significantly reduces
the $L_0$-norm distortion. To this aim, we design a new optimization scheme,
which first optimizes an initial adversarial perturbation under $L_2$-norm
constraint, and then constructs a dimension unimportance matrix for the initial
perturbation. Such a dimension unimportance matrix can indicate the adversarial
unimportance of each dimension of the initial perturbation. Furthermore, we
introduce a new concept of adversarial threshold for the dimension unimportance
matrix. The dimensions of the initial perturbation whose unimportance is higher
than the threshold will be all set to zero, greatly decreasing the $L_0$-norm
distortion. Experimental results on three benchmark datasets show that under
the same query budget, the adversarial examples generated by our method have
lower $L_0$-norm and $L_2$-norm distortion than the state-of-the-art.
Especially for the MNIST dataset, our attack reduces 8.1$\%$ $L_2$-norm
distortion meanwhile remaining 47$\%$ pixels unattacked. This demonstrates the
superiority of the proposed method over its competitors in terms of adversarial
robustness and visual imperceptibility.


http://arxiv.org/abs/2407.02811
SPLITZ: Certifiable Robustness via Split Lipschitz Randomized Smoothing. (98%)
Meiyu Zhong; Ravi Tandon
  Certifiable robustness gives the guarantee that small perturbations around an
input to a classifier will not change the prediction. There are two approaches
to provide certifiable robustness to adversarial examples: a) explicitly
training classifiers with small Lipschitz constants, and b) Randomized
smoothing, which adds random noise to the input to create a smooth classifier.
We propose \textit{SPLITZ}, a practical and novel approach which leverages the
synergistic benefits of both the above ideas into a single framework. Our main
idea is to \textit{split} a classifier into two halves, constrain the Lipschitz
constant of the first half, and smooth the second half via randomization.
Motivation for \textit{SPLITZ} comes from the observation that many standard
deep networks exhibit heterogeneity in Lipschitz constants across layers.
\textit{SPLITZ} can exploit this heterogeneity while inheriting the scalability
of randomized smoothing. We present a principled approach to train
\textit{SPLITZ} and provide theoretical analysis to derive certified robustness
guarantees during inference. We present a comprehensive comparison of
robustness-accuracy tradeoffs and show that \textit{SPLITZ} consistently
improves upon existing state-of-the-art approaches on MNIST and CIFAR-10
datasets. For instance, with $\ell_2$ norm perturbation budget of
\textbf{$\epsilon=1$}, \textit{SPLITZ} achieves $\textbf{43.2\%}$ top-1 test
accuracy on CIFAR-10 dataset compared to state-of-art top-1 test accuracy
$\textbf{39.8\%}


http://arxiv.org/abs/2407.03045
JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets. (83%)
Zhihua Jin; Shiyi Liu; Haotian Li; Xun Zhao; Huamin Qu
  Large Language Models (LLMs) have gained significant attention but also
raised concerns due to the risk of misuse. Jailbreak prompts, a popular type of
adversarial attack towards LLMs, have appeared and constantly evolved to breach
the safety protocols of LLMs. To address this issue, LLMs are regularly updated
with safety patches based on reported jailbreak prompts. However, malicious
users often keep their successful jailbreak prompts private to exploit LLMs. To
uncover these private jailbreak prompts, extensive analysis of large-scale
conversational datasets is necessary to identify prompts that still manage to
bypass the system's defenses. This task is highly challenging due to the
immense volume of conversation data, diverse characteristics of jailbreak
prompts, and their presence in complex multi-turn conversations. To tackle
these challenges, we introduce JailbreakHunter, a visual analytics approach for
identifying jailbreak prompts in large-scale human-LLM conversational datasets.
We have designed a workflow with three analysis levels: group-level,
conversation-level, and turn-level. Group-level analysis enables users to grasp
the distribution of conversations and identify suspicious conversations using
multiple criteria, such as similarity with reported jailbreak prompts in
previous research and attack success rates. Conversation-level analysis
facilitates the understanding of the progress of conversations and helps
discover jailbreak prompts within their conversation contexts. Turn-level
analysis allows users to explore the semantic similarity and token overlap
between a singleturn prompt and the reported jailbreak prompts, aiding in the
identification of new jailbreak strategies. The effectiveness and usability of
the system were verified through multiple case studies and expert interviews.


http://arxiv.org/abs/2407.03144
Venomancer: Towards Imperceptible and Target-on-Demand Backdoor Attacks in Federated Learning. (74%)
Son Nguyen; Thinh Nguyen; Khoa Doan; Kok-Seng Wong
  Federated Learning (FL) is a distributed machine learning approach that
maintains data privacy by training on decentralized data sources. Similar to
centralized machine learning, FL is also susceptible to backdoor attacks. Most
backdoor attacks in FL assume a predefined target class and require control
over a large number of clients or knowledge of benign clients' information.
Furthermore, they are not imperceptible and are easily detected by human
inspection due to clear artifacts left on the poison data. To overcome these
challenges, we propose Venomancer, an effective backdoor attack that is
imperceptible and allows target-on-demand. Specifically, imperceptibility is
achieved by using a visual loss function to make the poison data visually
indistinguishable from the original data. Target-on-demand property allows the
attacker to choose arbitrary target classes via conditional adversarial
training. Additionally, experiments showed that the method is robust against
state-of-the-art defenses such as Norm Clipping, Weak DP, Krum, and Multi-Krum.
The source code is available at
https://anonymous.4open.science/r/Venomancer-3426.


http://arxiv.org/abs/2407.11029
A Geometric Framework for Adversarial Vulnerability in Machine Learning. (70%)
Brian Bell
  This work starts with the intention of using mathematics to understand the
intriguing vulnerability observed by ~\citet{szegedy2013} within artificial
neural networks. Along the way, we will develop some novel tools with
applications far outside of just the adversarial domain. We will do this while
developing a rigorous mathematical framework to examine this problem. Our goal
is to build out theory which can support increasingly sophisticated conjecture
about adversarial attacks with a particular focus on the so called ``Dimpled
Manifold Hypothesis'' by ~\citet{shamir2021dimpled}. Chapter one will cover the
history and architecture of neural network architectures. Chapter two is
focused on the background of adversarial vulnerability. Starting from the
seminal paper by ~\citet{szegedy2013} we will develop the theory of adversarial
perturbation and attack.
  Chapter three will build a theory of persistence that is related to Ricci
Curvature, which can be used to measure properties of decision boundaries. We
will use this foundation to make a conjecture relating adversarial attacks.
Chapters four and five represent a sudden and wonderful digression that
examines an intriguing related body of theory for spatial analysis of neural
networks as approximations of kernel machines and becomes a novel theory for
representing neural networks with bilinear maps. These heavily mathematical
chapters will set up a framework and begin exploring applications of what may
become a very important theoretical foundation for analyzing neural network
learning with spatial and geometric information. We will conclude by setting up
our new methods to address the conjecture from chapter 3 in continuing
research.


http://arxiv.org/abs/2407.02855
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks. (54%)
Zhexin Zhang; Junxiao Yang; Pei Ke; Shiyao Cui; Chujie Zheng; Hongning Wang; Minlie Huang
  LLMs are known to be vulnerable to jailbreak attacks, even after safety
alignment. An important observation is that, while different types of jailbreak
attacks can generate significantly different queries, they mostly result in
similar responses that are rooted in the same harmful knowledge (e.g., detailed
steps to make a bomb). Therefore, we conjecture that directly unlearn the
harmful knowledge in the LLM can be a more effective way to defend against
jailbreak attacks than the mainstream supervised fine-tuning (SFT) based
approaches. Our extensive experiments confirmed our insight and suggested
surprising generalizability of our unlearning-based approach: using only 20 raw
harmful questions \emph{without} any jailbreak prompt during training, our
solution reduced the Attack Success Rate (ASR) in Vicuna-7B on
\emph{out-of-distribution} (OOD) harmful questions wrapped with various complex
jailbreak prompts from 82.6\% to 7.7\%. This significantly outperforms
Llama2-7B-Chat, which is fine-tuned on about 0.1M safety alignment samples but
still has an ASR of 21.9\% even under the help of an additional safety system
prompt. Further analysis reveals that the generalization ability of our
solution stems from the intrinsic relatedness among harmful responses across
harmful questions (e.g., response patterns, shared steps and actions, and
similarity among their learned representations in the LLM). Our code is
available at \url{https://github.com/thu-coai/SafeUnlearning}.


http://arxiv.org/abs/2407.03234
Self-Evaluation as a Defense Against Adversarial Attacks on LLMs. (41%)
Hannah Brown; Leon Lin; Kenji Kawaguchi; Michael Shieh
  When LLMs are deployed in sensitive, human-facing settings, it is crucial
that they do not output unsafe, biased, or privacy-violating outputs. For this
reason, models are both trained and instructed to refuse to answer unsafe
prompts such as "Tell me how to build a bomb." We find that, despite these
safeguards, it is possible to break model defenses simply by appending a space
to the end of a model's input. In a study of eight open-source models, we
demonstrate that this acts as a strong enough attack to cause the majority of
models to generate harmful outputs with very high success rates. We examine the
causes of this behavior, finding that the contexts in which single spaces occur
in tokenized training data encourage models to generate lists when prompted,
overriding training signals to refuse to answer unsafe requests. Our findings
underscore the fragile state of current model alignment and promote the
importance of developing more robust alignment methods. Code and data will be
made available at https://github.com/Linlt-leon/self-eval.


http://arxiv.org/abs/2407.11025
Backdoor Graph Condensation. (16%)
Jiahao Wu; Ning Lu; Zeiyu Dai; Wenqi Fan; Shengcai Liu; Qing Li; Ke Tang
  Recently, graph condensation has emerged as a prevalent technique to improve
the training efficiency for graph neural networks (GNNs). It condenses a large
graph into a small one such that a GNN trained on this small synthetic graph
can achieve comparable performance to a GNN trained on a large graph. However,
while existing graph condensation studies mainly focus on the best trade-off
between graph size and the GNNs' performance (model utility), the security
issues of graph condensation have not been studied. To bridge this research
gap, we propose the task of backdoor graph condensation. While graph backdoor
attacks have been extensively explored, applying existing graph backdoor
methods for graph condensation is not practical since they can undermine the
model utility and yield low attack success rate.
  To alleviate these issues, we introduce two primary objectives for backdoor
attacks against graph condensation: 1) the injection of triggers cannot affect
the quality of condensed graphs, maintaining the utility of GNNs trained on
them; and 2) the effectiveness of triggers should be preserved throughout the
condensation process, achieving high attack success rate. To pursue the
objectives, we devise the first backdoor attack against graph condensation,
denoted as BGC. Specifically, we inject triggers during condensation and
iteratively update the triggers to ensure effective attacks. Further, we
propose a poisoned node selection module to minimize the influence of triggers
on condensed graphs' quality. The extensive experiments demonstrate the
effectiveness of our attack. BGC achieves a high attack success rate (close to
1.0) and good model utility in all cases. Furthermore, the results demonstrate
our method's resilience against multiple defense methods. Finally, we conduct
comprehensive studies to analyze the factors that influence the attack
performance.


http://arxiv.org/abs/2407.03070
Federated Learning for Zero-Day Attack Detection in 5G and Beyond V2X Networks. (2%)
Abdelaziz Amara korba; Abdelwahab Boualouache; Bouziane Brik; Rabah Rahal; Yacine Ghamri-Doudane; Sidi Mohammed Senouci
  Deploying Connected and Automated Vehicles (CAVs) on top of 5G and Beyond
networks (5GB) makes them vulnerable to increasing vectors of security and
privacy attacks. In this context, a wide range of advanced machine/deep
learning based solutions have been designed to accurately detect security
attacks. Specifically, supervised learning techniques have been widely applied
to train attack detection models. However, the main limitation of such
solutions is their inability to detect attacks different from those seen during
the training phase, or new attacks, also called zero-day attacks. Moreover,
training the detection model requires significant data collection and labeling,
which increases the communication overhead, and raises privacy concerns. To
address the aforementioned limits, we propose in this paper a novel detection
mechanism that leverages the ability of the deep auto-encoder method to detect
attacks relying only on the benign network traffic pattern. Using federated
learning, the proposed intrusion detection system can be trained with large and
diverse benign network traffic, while preserving the CAVs privacy, and
minimizing the communication overhead. The in-depth experiment on a recent
network traffic dataset shows that the proposed system achieved a high
detection rate while minimizing the false positive rate, and the detection
delay.


http://arxiv.org/abs/2407.03611
An Empirical Study on Capability of Large Language Models in Understanding Code Semantics. (1%)
Thu-Trang Nguyen; Thanh Trong Vu; Hieu Dinh Vo; Son Nguyen
  Large Language Models for Code (code LLMs) have demonstrated remarkable
performance across various software engineering (SE) tasks, increasing the
application of code LLMs in software development. Despite the success of code
LLMs, there remain significant concerns about the actual capabilities and
reliability of these models, "whether these models really learn the semantics
of code from the training data and leverage the learned knowledge to perform
the SE tasks". In this paper, we introduce EMPICA, a comprehensive framework
designed to systematically and empirically evaluate the capabilities of code
LLMs in understanding code semantics. Specifically, EMPICA systematically
introduces controlled modifications/transformations into the input code and
examines the models' responses. Generally, code LLMs must be robust to
semantically equivalent code inputs and be sensitive to non-equivalent ones for
all SE tasks. Specifically, for every SE task, given an input code snippet c
and its semantic equivalent variants, code LLMs must robustly produce
consistent/equivalent outputs while they are expected to generate different
outputs for c and its semantic non-equivalent variants. Our experimental
results on three representative code understanding tasks, including code
summarization, method name prediction, and output prediction, reveal that the
robustness and sensitivity of the state-of-the-art code LLMs to code
transformations vary significantly across tasks and transformation operators.
In addition, the code LLMs exhibit better robustness to the semantic preserving
transformations than their sensitivity to the semantic non-preserving
transformations. These results highlight a need to enhance the model's
capabilities of understanding code semantics, especially the sensitivity
property.


http://arxiv.org/abs/2407.03453
On Large Language Models in National Security Applications. (1%)
William N. Caballero; Phillip R. Jenkins
  The overwhelming success of GPT-4 in early 2023 highlighted the
transformative potential of large language models (LLMs) across various
sectors, including national security. This article explores the implications of
LLM integration within national security contexts, analyzing their potential to
revolutionize information processing, decision-making, and operational
efficiency. Whereas LLMs offer substantial benefits, such as automating tasks
and enhancing data analysis, they also pose significant risks, including
hallucinations, data privacy concerns, and vulnerability to adversarial
attacks. Through their coupling with decision-theoretic principles and Bayesian
reasoning, LLMs can significantly improve decision-making processes within
national security organizations. Namely, LLMs can facilitate the transition
from data to actionable decisions, enabling decision-makers to quickly receive
and distill available information with less manpower. Current applications
within the US Department of Defense and beyond are explored, e.g., the USAF's
use of LLMs for wargaming and automatic summarization, that illustrate their
potential to streamline operations and support decision-making. However, these
applications necessitate rigorous safeguards to ensure accuracy and
reliability. The broader implications of LLM integration extend to strategic
planning, international relations, and the broader geopolitical landscape, with
adversarial nations leveraging LLMs for disinformation and cyber operations,
emphasizing the need for robust countermeasures. Despite exhibiting "sparks" of
artificial general intelligence, LLMs are best suited for supporting roles
rather than leading strategic decisions. Their use in training and wargaming
can provide valuable insights and personalized learning experiences for
military personnel, thereby improving operational readiness.


http://arxiv.org/abs/2407.17491
Robust Adaptation of Foundation Models with Black-Box Visual Prompting. (1%)
Changdae Oh; Gyeongdeok Seo; Geunyoung Jung; Zhi-Qi Cheng; Hosik Choi; Jiyoung Jung; Kyungwoo Song
  With a surge of large-scale pre-trained models, parameter-efficient transfer
learning (PETL) of large models has garnered significant attention. While
promising, they commonly rely on two optimistic assumptions: 1) full access to
the parameters of a PTM, and 2) sufficient memory capacity to cache all
intermediate activations for gradient computation. However, in most real-world
applications, PTMs serve as black-box APIs or proprietary software without full
parameter accessibility. Besides, it is hard to meet a large memory requirement
for modern PTMs. This work proposes black-box visual prompting (BlackVIP),
which efficiently adapts the PTMs without knowledge of their architectures or
parameters. BlackVIP has two components: 1) Coordinator and 2) simultaneous
perturbation stochastic approximation with gradient correction (SPSA-GC). The
Coordinator designs input-dependent visual prompts, which allow the target PTM
to adapt in the wild. SPSA-GC efficiently estimates the gradient of PTM to
update Coordinator. Besides, we introduce a variant, BlackVIP-SE, which
significantly reduces the runtime and computational cost of BlackVIP. Extensive
experiments on 19 datasets demonstrate that BlackVIPs enable robust adaptation
to diverse domains and tasks with minimal memory requirements. We further
provide a theoretical analysis on the generalization of visual prompting
methods by presenting their connection to the certified robustness of
randomized smoothing, and presenting an empirical support for improved
robustness.


http://arxiv.org/abs/2407.11031
Purification Of Contaminated Convolutional Neural Networks Via Robust Recovery: An Approach with Theoretical Guarantee in One-Hidden-Layer Case. (1%)
Hanxiao Lu; Zeyu Huang; Ren Wang
  Convolutional neural networks (CNNs), one of the key architectures of deep
learning models, have achieved superior performance on many machine learning
tasks such as image classification, video recognition, and power systems.
Despite their success, CNNs can be easily contaminated by natural noises and
artificially injected noises such as backdoor attacks. In this paper, we
propose a robust recovery method to remove the noise from the potentially
contaminated CNNs and provide an exact recovery guarantee on one-hidden-layer
non-overlapping CNNs with the rectified linear unit (ReLU) activation function.
Our theoretical results show that both CNNs' weights and biases can be exactly
recovered under the overparameterization setting with some mild assumptions.
The experimental results demonstrate the correctness of the proofs and the
effectiveness of the method in both the synthetic environment and the practical
neural network setting. Our results also indicate that the proposed method can
be extended to multiple-layer CNNs and potentially serve as a defense strategy
against backdoor attacks.


http://arxiv.org/abs/2407.02053
Secure Semantic Communication via Paired Adversarial Residual Networks. (99%)
Boxiang He; Fanggang Wang; Tony Q. S. Quek
  This letter explores the positive side of the adversarial attack for the
security-aware semantic communication system. Specifically, a pair of matching
pluggable modules is installed: one after the semantic transmitter and the
other before the semantic receiver. The module at transmitter uses a trainable
adversarial residual network (ARN) to generate adversarial examples, while the
module at receiver employs another trainable ARN to remove the adversarial
attacks and the channel noise. To mitigate the threat of semantic
eavesdropping, the trainable ARNs are jointly optimized to minimize the
weighted sum of the power of adversarial attack, the mean squared error of
semantic communication, and the confidence of eavesdropper correctly retrieving
private information. Numerical results show that the proposed scheme is capable
of fooling the eavesdropper while maintaining the high-quality semantic
communication.


http://arxiv.org/abs/2407.02248
EvolBA: Evolutionary Boundary Attack under Hard-label Black Box condition. (99%)
Ayane Tajima; Satoshi Ono
  Research has shown that deep neural networks (DNNs) have vulnerabilities that
can lead to the misrecognition of Adversarial Examples (AEs) with specifically
designed perturbations. Various adversarial attack methods have been proposed
to detect vulnerabilities under hard-label black box (HL-BB) conditions in the
absence of loss gradients and confidence scores.However, these methods fall
into local solutions because they search only local regions of the search
space. Therefore, this study proposes an adversarial attack method named EvolBA
to generate AEs using Covariance Matrix Adaptation Evolution Strategy (CMA-ES)
under the HL-BB condition, where only a class label predicted by the target DNN
model is available. Inspired by formula-driven supervised learning, the
proposed method introduces domain-independent operators for the initialization
process and a jump that enhances search exploration. Experimental results
confirmed that the proposed method could determine AEs with smaller
perturbations than previous methods in images where the previous methods have
difficulty.


http://arxiv.org/abs/2407.02670
Adversarial Magnification to Deceive Deepfake Detection through Super Resolution. (98%)
Davide Alessandro Coccomini; Roberto Caldelli; Giuseppe Amato; Fabrizio Falchi; Claudio Gennaro
  Deepfake technology is rapidly advancing, posing significant challenges to
the detection of manipulated media content. Parallel to that, some adversarial
attack techniques have been developed to fool the deepfake detectors and make
deepfakes even more difficult to be detected. This paper explores the
application of super resolution techniques as a possible adversarial attack in
deepfake detection. Through our experiments, we demonstrate that minimal
changes made by these methods in the visual appearance of images can have a
profound impact on the performance of deepfake detection systems. We propose a
novel attack using super resolution as a quick, black-box and effective method
to camouflage fake images and/or generate false alarms on pristine images. Our
results indicate that the usage of super resolution can significantly impair
the accuracy of deepfake detectors, thereby highlighting the vulnerability of
such systems to adversarial attacks. The code to reproduce our experiments is
available at:
https://github.com/davide-coccomini/Adversarial-Magnification-to-Deceive-Deepfake-Detection-through-Super-Resolution


http://arxiv.org/abs/2407.02551
Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses. (80%)
David Glukhov; Ziwen Han; Ilia Shumailov; Vardan Papyan; Nicolas Papernot
  Vulnerability of Frontier language models to misuse and jailbreaks has
prompted the development of safety measures like filters and alignment training
in an effort to ensure safety through robustness to adversarially crafted
prompts. We assert that robustness is fundamentally insufficient for ensuring
safety goals, and current defenses and evaluation methods fail to account for
risks of dual-intent queries and their composition for malicious goals. To
quantify these risks, we introduce a new safety evaluation framework based on
impermissible information leakage of model outputs and demonstrate how our
proposed question-decomposition attack can extract dangerous knowledge from a
censored LLM more effectively than traditional jailbreaking. Underlying our
proposed evaluation method is a novel information-theoretic threat model of
inferential adversaries, distinguished from security adversaries, such as
jailbreaks, in that success is measured by inferring impermissible knowledge
from victim outputs as opposed to forcing explicitly impermissible outputs from
the victim. Through our information-theoretic framework, we show that to ensure
safety against inferential adversaries, defense mechanisms must ensure
information censorship, bounding the leakage of impermissible information.
However, we prove that such defenses inevitably incur a safety-utility
trade-off.


http://arxiv.org/abs/2407.02716
Light-weight Fine-tuning Method for Defending Adversarial Noise in Pre-trained Medical Vision-Language Models. (76%)
Xu Han; Linghao Jin; Xuezhe Ma; Xiaofeng Liu
  Fine-tuning pre-trained Vision-Language Models (VLMs) has shown remarkable
capabilities in medical image and textual depiction synergy. Nevertheless, many
pre-training datasets are restricted by patient privacy concerns, potentially
containing noise that can adversely affect downstream performance. Moreover,
the growing reliance on multi-modal generation exacerbates this issue because
of its susceptibility to adversarial attacks. To investigate how VLMs trained
on adversarial noisy data perform on downstream medical tasks, we first craft
noisy upstream datasets using multi-modal adversarial attacks. Through our
comprehensive analysis, we unveil that moderate noise enhances model robustness
and transferability, but increasing noise levels negatively impact downstream
task performance. To mitigate this issue, we propose rectify adversarial noise
(RAN) framework, a recipe designed to effectively defend adversarial attacks
and rectify the influence of upstream noise during fine-tuning.


http://arxiv.org/abs/2407.02437
Beyond Full Poisoning: Effective Availability Attacks with Partial Perturbation. (68%)
Yu Zhe; Jun Sakuma
  The widespread use of publicly available datasets for training machine
learning models raises significant concerns about data misuse. Availability
attacks have emerged as a means for data owners to safeguard their data by
designing imperceptible perturbations that degrade model performance when
incorporated into training datasets. However, existing availability attacks are
ineffective when only a portion of the data can be perturbed. To address this
challenge, we propose a novel availability attack approach termed Parameter
Matching Attack (PMA). PMA is the first availability attack capable of causing
more than a 30\% performance drop when only a portion of data can be perturbed.
PMA optimizes perturbations so that when the model is trained on a mixture of
clean and perturbed data, the resulting model will approach a model designed to
perform poorly. Experimental results across four datasets demonstrate that PMA
outperforms existing methods, achieving significant model performance
degradation when a part of the training data is perturbed. Our code is
available in the supplementary materials.


http://arxiv.org/abs/2407.02596
Towards More Realistic Extraction Attacks: An Adversarial Perspective. (22%)
Yash More; Prakhar Ganesh; Golnoosh Farnadi
  Language models are prone to memorizing large parts of their training data,
making them vulnerable to extraction attacks. Existing research on these
attacks remains limited in scope, often studying isolated trends rather than
the real-world interactions with these models. In this paper, we revisit
extraction attacks from an adversarial perspective, exploiting the brittleness
of language models. We find significant churn in extraction attack trends,
i.e., even minor, unintuitive changes to the prompt, or targeting smaller
models and older checkpoints, can exacerbate the risks of extraction by up to
$2-4 \times$. Moreover, relying solely on the widely accepted verbatim match
underestimates the extent of extracted information, and we provide various
alternatives to more accurately capture the true risks of extraction. We
conclude our discussion with data deduplication, a commonly suggested
mitigation strategy, and find that while it addresses some memorization
concerns, it remains vulnerable to the same escalation of extraction risks
against a real-world adversary. Our findings highlight the necessity of
acknowledging an adversary's true capabilities to avoid underestimating
extraction risks.


http://arxiv.org/abs/2407.02431
On the Robustness of Graph Reduction Against GNN Backdoor. (13%)
Yuxuan Zhu; Michael Mandulak; Kerui Wu; George Slota; Yuseok Jeon; Ka-Ho Chow; Lei Yu
  Graph Neural Networks (GNNs) are gaining popularity across various domains
due to their effectiveness in learning graph-structured data. Nevertheless,
they have been shown to be susceptible to backdoor poisoning attacks, which
pose serious threats to real-world applications. Meanwhile, graph reduction
techniques, including coarsening and sparsification, which have long been
employed to improve the scalability of large graph computational tasks, have
recently emerged as effective methods for accelerating GNN training on
large-scale graphs. However, the current development and deployment of graph
reduction techniques for large graphs overlook the potential risks of data
poisoning attacks against GNNs. It is not yet clear how graph reduction
interacts with existing backdoor attacks. This paper conducts a thorough
examination of the robustness of graph reduction methods in scalable GNN
training in the presence of state-of-the-art backdoor attacks. We performed a
comprehensive robustness analysis across six coarsening methods and six
sparsification methods for graph reduction, under three GNN backdoor attacks
against three GNN architectures. Our findings indicate that the effectiveness
of graph reduction methods in mitigating attack success rates varies
significantly, with some methods even exacerbating the attacks. Through
detailed analyses of triggers and poisoned nodes, we interpret our findings and
enhance our understanding of how graph reduction interacts with backdoor
attacks. These results highlight the critical need for incorporating robustness
considerations in graph reduction for GNN training, ensuring that enhancements
in computational efficiency do not compromise the security of GNN systems.


http://arxiv.org/abs/2407.02240
MALT Powers Up Adversarial Attacks. (13%)
Odelia Melamed; Gilad Yehudai; Adi Shamir
  Current adversarial attacks for multi-class classifiers choose the target
class for a given input naively, based on the classifier's confidence levels
for various target classes. We present a novel adversarial targeting method,
\textit{MALT - Mesoscopic Almost Linearity Targeting}, based on medium-scale
almost linearity assumptions. Our attack wins over the current state of the art
AutoAttack on the standard benchmark datasets CIFAR-100 and ImageNet and for a
variety of robust models. In particular, our attack is \emph{five times faster}
than AutoAttack, while successfully matching all of AutoAttack's successes and
attacking additional samples that were previously out of reach. We then prove
formally and demonstrate empirically that our targeting method, although
inspired by linear predictors, also applies to standard non-linear models.


http://arxiv.org/abs/2407.02403
Face Reconstruction Transfer Attack as Out-of-Distribution Generalization. (2%)
Yoon Gyo Jung; Jaewoo Park; Xingbo Dong; Hojin Park; Andrew Beng Jin Teoh; Octavia Camps
  Understanding the vulnerability of face recognition systems to malicious
attacks is of critical importance. Previous works have focused on
reconstructing face images that can penetrate a targeted verification system.
Even in the white-box scenario, however, naively reconstructed images
misrepresent the identity information, hence the attacks are easily neutralized
once the face system is updated or changed. In this paper, we aim to
reconstruct face images which are capable of transferring face attacks on
unseen encoders. We term this problem as Face Reconstruction Transfer Attack
(FRTA) and show that it can be formulated as an out-of-distribution (OOD)
generalization problem. Inspired by its OOD nature, we propose to solve FRTA by
Averaged Latent Search and Unsupervised Validation with pseudo target (ALSUV).
To strengthen the reconstruction attack on OOD unseen encoders, ALSUV
reconstructs the face by searching the latent of amortized generator StyleGAN2
through multiple latent optimization, latent optimization trajectory averaging,
and unsupervised validation with a pseudo target. We demonstrate the efficacy
and generalization of our method on widely used face datasets, accompanying it
with extensive ablation studies and visually, qualitatively, and quantitatively
analyses. The source code will be released.


http://arxiv.org/abs/2407.02581
Robust ADAS: Enhancing Robustness of Machine Learning-based Advanced Driver Assistance Systems for Adverse Weather. (1%)
Muhammad Zaeem Shahzad; Muhammad Abdullah Hanif; Muhammad Shafique
  In the realm of deploying Machine Learning-based Advanced Driver Assistance
Systems (ML-ADAS) into real-world scenarios, adverse weather conditions pose a
significant challenge. Conventional ML models trained on clear weather data
falter when faced with scenarios like extreme fog or heavy rain, potentially
leading to accidents and safety hazards. This paper addresses this issue by
proposing a novel approach: employing a Denoising Deep Neural Network as a
preprocessing step to transform adverse weather images into clear weather
images, thereby enhancing the robustness of ML-ADAS systems. The proposed
method eliminates the need for retraining all subsequent Depp Neural Networks
(DNN) in the ML-ADAS pipeline, thus saving computational resources and time.
Moreover, it improves driver visualization, which is critical for safe
navigation in adverse weather conditions. By leveraging the UNet architecture
trained on an augmented KITTI dataset with synthetic adverse weather images, we
develop the Weather UNet (WUNet) DNN to remove weather artifacts. Our study
demonstrates substantial performance improvements in object detection with
WUNet preprocessing under adverse weather conditions. Notably, in scenarios
involving extreme fog, our proposed solution improves the mean Average
Precision (mAP) score of the YOLOv8n from 4% to 70%.


http://arxiv.org/abs/2407.01168
Multi-View Black-Box Physical Attacks on Infrared Pedestrian Detectors Using Adversarial Infrared Grid. (98%)
Kalibinuer Tiliwalidi; Chengyin Hu; Weiwen Shi
  While extensive research exists on physical adversarial attacks within the
visible spectrum, studies on such techniques in the infrared spectrum are
limited. Infrared object detectors are vital in modern technological
applications but are susceptible to adversarial attacks, posing significant
security threats. Previous studies using physical perturbations like light bulb
arrays and aerogels for white-box attacks, or hot and cold patches for
black-box attacks, have proven impractical or limited in multi-view support. To
address these issues, we propose the Adversarial Infrared Grid (AdvGrid), which
models perturbations in a grid format and uses a genetic algorithm for
black-box optimization. These perturbations are cyclically applied to various
parts of a pedestrian's clothing to facilitate multi-view black-box physical
attacks on infrared pedestrian detectors. Extensive experiments validate
AdvGrid's effectiveness, stealthiness, and robustness. The method achieves
attack success rates of 80.00\% in digital environments and 91.86\% in physical
environments, outperforming baseline methods. Additionally, the average attack
success rate exceeds 50\% against mainstream detectors, demonstrating AdvGrid's
robustness. Our analyses include ablation studies, transfer attacks, and
adversarial defenses, confirming the method's superiority.


http://arxiv.org/abs/2407.01260
DeepiSign-G: Generic Watermark to Stamp Hidden DNN Parameters for Self-contained Tracking. (82%)
Alsharif Abuadbba; Nicholas Rhodes; Kristen Moore; Bushra Sabir; Shuo Wang; Yansong Gao
  Deep learning solutions in critical domains like autonomous vehicles, facial
recognition, and sentiment analysis require caution due to the severe
consequences of errors. Research shows these models are vulnerable to
adversarial attacks, such as data poisoning and neural trojaning, which can
covertly manipulate model behavior, compromising reliability and safety.
Current defense strategies like watermarking have limitations: they fail to
detect all model modifications and primarily focus on attacks on CNNs in the
image domain, neglecting other critical architectures like RNNs.
  To address these gaps, we introduce DeepiSign-G, a versatile watermarking
approach designed for comprehensive verification of leading DNN architectures,
including CNNs and RNNs. DeepiSign-G enhances model security by embedding an
invisible watermark within the Walsh-Hadamard transform coefficients of the
model's parameters. This watermark is highly sensitive and fragile, ensuring
prompt detection of any modifications. Unlike traditional hashing techniques,
DeepiSign-G allows substantial metadata incorporation directly within the
model, enabling detailed, self-contained tracking and verification.
  We demonstrate DeepiSign-G's applicability across various architectures,
including CNN models (VGG, ResNets, DenseNet) and RNNs (Text sentiment
classifier). We experiment with four popular datasets: VGG Face, CIFAR10, GTSRB
Traffic Sign, and Large Movie Review. We also evaluate DeepiSign-G under five
potential attacks. Our comprehensive evaluation confirms that DeepiSign-G
effectively detects these attacks without compromising CNN and RNN model
performance, highlighting its efficacy as a robust security measure for deep
learning applications. Detection of integrity breaches is nearly perfect, while
hiding only a bit in approximately 1% of the Walsh-Hadamard coefficients.


http://arxiv.org/abs/2407.01925
Looking From the Future: Multi-order Iterations Can Enhance Adversarial Attack Transferability. (81%)
Zijian Ying; Qianmu Li; Tao Wang; Zhichao Lian; Shunmei Meng; Xuyun Zhang
  Various methods try to enhance adversarial transferability by improving the
generalization from different perspectives. In this paper, we rethink the
optimization process and propose a novel sequence optimization concept, which
is named Looking From the Future (LFF). LFF makes use of the original
optimization process to refine the very first local optimization choice.
Adapting the LFF concept to the adversarial attack task, we further propose an
LFF attack as well as an MLFF attack with better generalization ability.
Furthermore, guiding with the LFF concept, we propose an $LLF^{\mathcal{N}}$
attack which entends the LFF attack to a multi-order attack, further enhancing
the transfer attack ability. All our proposed methods can be directly applied
to the iteration-based attack methods. We evaluate our proposed method on the
ImageNet1k dataset by applying several SOTA adversarial attack methods under
four kinds of tasks. Experimental results show that our proposed method can
greatly enhance the attack transferability. Ablation experiments are also
applied to verify the effectiveness of each component. The source code will be
released after this paper is accepted.


http://arxiv.org/abs/2407.01251
QUEEN: Query Unlearning against Model Extraction. (75%)
Huajie Chen; Tianqing Zhu; Lefeng Zhang; Bo Liu; Derui Wang; Wanlei Zhou; Minhui Xue
  Model extraction attacks currently pose a non-negligible threat to the
security and privacy of deep learning models. By querying the model with a
small dataset and usingthe query results as the ground-truth labels, an
adversary can steal a piracy model with performance comparable to the original
model. Two key issues that cause the threat are, on the one hand, accurate and
unlimited queries can be obtained by the adversary; on the other hand, the
adversary can aggregate the query results to train the model step by step. The
existing defenses usually employ model watermarking or fingerprinting to
protect the ownership. However, these methods cannot proactively prevent the
violation from happening. To mitigate the threat, we propose QUEEN (QUEry
unlEarNing) that proactively launches counterattacks on potential model
extraction attacks from the very beginning. To limit the potential threat,
QUEEN has sensitivity measurement and outputs perturbation that prevents the
adversary from training a piracy model with high performance. In sensitivity
measurement, QUEEN measures the single query sensitivity by its distance from
the center of its cluster in the feature space. To reduce the learning accuracy
of attacks, for the highly sensitive query batch, QUEEN applies query
unlearning, which is implemented by gradient reverse to perturb the softmax
output such that the piracy model will generate reverse gradients to worsen its
performance unconsciously. Experiments show that QUEEN outperforms the
state-of-the-art defenses against various model extraction attacks with a
relatively low cost to the model accuracy. The artifact is publicly available
at https://anonymous.4open.science/r/queen implementation-5408/.


http://arxiv.org/abs/2407.01295
Formal Verification of Object Detection. (56%)
Avraham Raviv; Yizhak Y. Elboher; Michelle Aluf-Medina; Yael Leibovich Weiss; Omer Cohen; Roy Assa; Guy Katz; Hillel Kugler
  Deep Neural Networks (DNNs) are ubiquitous in real-world applications, yet
they remain vulnerable to errors and adversarial attacks. This work tackles the
challenge of applying formal verification to ensure the safety of computer
vision models, extending verification beyond image classification to object
detection. We propose a general formulation for certifying the robustness of
object detection models using formal verification and outline implementation
strategies compatible with state-of-the-art verification tools. Our approach
enables the application of these tools, originally designed for verifying
classification models, to object detection. We define various attacks for
object detection, illustrating the diverse ways adversarial inputs can
compromise neural network outputs. Our experiments, conducted on several common
datasets and networks, reveal potential errors in object detection models,
highlighting system vulnerabilities and emphasizing the need for expanding
formal verification to these new domains. This work paves the way for further
research in integrating formal verification across a broader range of computer
vision applications.


http://arxiv.org/abs/2407.01902
SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack. (13%)
Yan Yang; Zeguan Xiao; Xin Lu; Hongru Wang; Hailiang Huang; Guanhua Chen; Yun Chen
  The widespread applications of large language models (LLMs) have brought
about concerns regarding their potential misuse. Although aligned with human
preference data before release, LLMs remain vulnerable to various malicious
attacks. In this paper, we adopt a red-teaming strategy to enhance LLM safety
and introduce SoP, a simple yet effective framework to design jailbreak prompts
automatically. Inspired by the social facilitation concept, SoP generates and
optimizes multiple jailbreak characters to bypass the guardrails of the target
LLM. Different from previous work which relies on proprietary LLMs or seed
jailbreak templates crafted by human expertise, SoP can generate and optimize
the jailbreak prompt in a cold-start scenario using open-sourced LLMs without
any seed jailbreak templates. Experimental results show that SoP achieves
attack success rates of 88% and 60% in bypassing the safety alignment of
GPT-3.5-1106 and GPT-4, respectively. Furthermore, we extensively evaluate the
transferability of the generated templates across different LLMs and held-out
malicious requests, while also exploring defense strategies against the
jailbreak attack designed by SoP. Code is available at
https://github.com/Yang-Yan-Yang-Yan/SoP.


http://arxiv.org/abs/2407.01917
Securing Distributed Network Digital Twin Systems Against Model Poisoning Attacks. (8%)
Zifan Zhang; Minghong Fang; Mingzhe Chen; Gaolei Li; Xi Lin; Yuchen Liu
  In the era of 5G and beyond, the increasing complexity of wireless networks
necessitates innovative frameworks for efficient management and deployment.
Digital twins (DTs), embodying real-time monitoring, predictive configurations,
and enhanced decision-making capabilities, stand out as a promising solution in
this context. Within a time-series data-driven framework that effectively maps
wireless networks into digital counterparts, encapsulated by integrated
vertical and horizontal twinning phases, this study investigates the security
challenges in distributed network DT systems, which potentially undermine the
reliability of subsequent network applications such as wireless traffic
forecasting. Specifically, we consider a minimal-knowledge scenario for all
attackers, in that they do not have access to network data and other
specialized knowledge, yet can interact with previous iterations of
server-level models. In this context, we spotlight a novel fake traffic
injection attack designed to compromise a distributed network DT system for
wireless traffic prediction. In response, we then propose a defense mechanism,
termed global-local inconsistency detection (GLID), to counteract various model
poisoning threats. GLID strategically removes abnormal model parameters that
deviate beyond a particular percentile range, thereby fortifying the security
of network twinning process. Through extensive experiments on real-world
wireless traffic datasets, our experimental evaluations show that both our
attack and defense strategies significantly outperform existing baselines,
highlighting the importance of security measures in the design and
implementation of DTs for 5G and beyond network systems.


http://arxiv.org/abs/2407.01235
A Fingerprint for Large Language Models. (2%)
Zhiguang Yang; Hanzhou Wu
  Recent advances show that scaling a pre-trained language model could achieve
state-of-the-art performance on many downstream tasks, prompting large language
models (LLMs) to become a hot research topic in the field of artificial
intelligence. However, due to the resource-intensive nature of training LLMs
from scratch, it is urgent and crucial to protect the intellectual property of
LLMs against infringement. This has motivated the authors in this paper to
propose a novel black-box fingerprinting technique for LLMs, which requires
neither model training nor model fine-tuning. We first demonstrate that the
outputs of LLMs span a unique vector space associated with each model. We model
the problem of ownership authentication as the task of evaluating the
similarity between the victim model's space and the output's space of the
suspect model. To deal with this problem, we propose two solutions, where the
first solution involves verifying whether the outputs of the suspected large
model are in the same space as those of the victim model, enabling rapid
identification of model infringement, and the second one reconstructs the union
of the vector spaces for LLM outputs and the victim model to address situations
where the victim model has undergone the Parameter-Efficient Fine-Tuning (PEFT)
attacks. Experimental results indicate that the proposed technique achieves
superior performance in ownership verification and robustness against PEFT
attacks. This work reveals inherent characteristics of LLMs and provides a
promising solution for ownership verification of LLMs in black-box scenarios,
ensuring efficiency, generality and practicality.


http://arxiv.org/abs/2407.01306
Unveiling the Unseen: Exploring Whitebox Membership Inference through the Lens of Explainability. (1%)
Chenxi Li; Abhinav Kumar; Zhen Guo; Jie Hou; Reza Tourani
  The increasing prominence of deep learning applications and reliance on
personalized data underscore the urgent need to address privacy
vulnerabilities, particularly Membership Inference Attacks (MIAs). Despite
numerous MIA studies, significant knowledge gaps persist, particularly
regarding the impact of hidden features (in isolation) on attack efficacy and
insufficient justification for the root causes of attacks based on raw data
features. In this paper, we aim to address these knowledge gaps by first
exploring statistical approaches to identify the most informative neurons and
quantifying the significance of the hidden activations from the selected
neurons on attack accuracy, in isolation and combination. Additionally, we
propose an attack-driven explainable framework by integrating the target and
attack models to identify the most influential features of raw data that lead
to successful membership inference attacks. Our proposed MIA shows an
improvement of up to 26% on state-of-the-art MIA.


http://arxiv.org/abs/2407.01157
Unaligning Everything: Or Aligning Any Text to Any Image in Multimodal Models. (1%)
Shaeke Salman; Md Montasir Bin Shams; Xiuwen Liu
  Utilizing a shared embedding space, emerging multimodal models exhibit
unprecedented zero-shot capabilities. However, the shared embedding space could
lead to new vulnerabilities if different modalities can be misaligned. In this
paper, we extend and utilize a recently developed effective gradient-based
procedure that allows us to match the embedding of a given text by minimally
modifying an image. Using the procedure, we show that we can align the
embeddings of distinguishable texts to any image through unnoticeable
adversarial attacks in joint image-text models, revealing that semantically
unrelated images can have embeddings of identical texts and at the same time
visually indistinguishable images can be matched to the embeddings of very
different texts. Our technique achieves 100\% success rate when it is applied
to text datasets and images from multiple sources. Without overcoming the
vulnerability, multimodal models cannot robustly align inputs from different
modalities in a semantically meaningful way. \textbf{Warning: the text data
used in this paper are toxic in nature and may be offensive to some readers.}


http://arxiv.org/abs/2407.00905
Learning Robust 3D Representation from CLIP via Dual Denoising. (67%)
Shuqing Luo; Bowen Qu; Wei Gao
  In this paper, we explore a critical yet under-investigated issue: how to
learn robust and well-generalized 3D representation from pre-trained vision
language models such as CLIP. Previous works have demonstrated that cross-modal
distillation can provide rich and useful knowledge for 3D data. However, like
most deep learning models, the resultant 3D learning network is still
vulnerable to adversarial attacks especially the iterative attack. In this
work, we propose Dual Denoising, a novel framework for learning robust and
well-generalized 3D representations from CLIP. It combines a denoising-based
proxy task with a novel feature denoising network for 3D pre-training.
Additionally, we propose utilizing parallel noise inference to enhance the
generalization of point cloud features under cross domain settings. Experiments
show that our model can effectively improve the representation learning
performance and adversarial robustness of the 3D learning network under
zero-shot settings without adversarial training. Our code is available at
https://github.com/luoshuqing2001/Dual_Denoising.


http://arxiv.org/abs/2407.00623
Consistency Purification: Effective and Efficient Diffusion Purification towards Certified Robustness. (13%)
Yiquan Li; Zhongzhu Chen; Kun Jin; Jiongxiao Wang; Bo Li; Chaowei Xiao
  Diffusion Purification, purifying noised images with diffusion models, has
been widely used for enhancing certified robustness via randomized smoothing.
However, existing frameworks often grapple with the balance between efficiency
and effectiveness. While the Denoising Diffusion Probabilistic Model (DDPM)
offers an efficient single-step purification, it falls short in ensuring
purified images reside on the data manifold. Conversely, the Stochastic
Diffusion Model effectively places purified images on the data manifold but
demands solving cumbersome stochastic differential equations, while its
derivative, the Probability Flow Ordinary Differential Equation (PF-ODE),
though solving simpler ordinary differential equations, still requires multiple
computational steps. In this work, we demonstrated that an ideal purification
pipeline should generate the purified images on the data manifold that are as
much semantically aligned to the original images for effectiveness in one step
for efficiency. Therefore, we introduced Consistency Purification, an
efficiency-effectiveness Pareto superior purifier compared to the previous
work. Consistency Purification employs the consistency model, a one-step
generative model distilled from PF-ODE, thus can generate on-manifold purified
images with a single network evaluation. However, the consistency model is
designed not for purification thus it does not inherently ensure semantic
alignment between purified and original images. To resolve this issue, we
further refine it through Consistency Fine-tuning with LPIPS loss, which
enables more aligned semantic meaning while keeping the purified images on data
manifold. Our comprehensive experiments demonstrate that our Consistency
Purification framework achieves state-of the-art certified robustness and
efficiency compared to baseline methods.


http://arxiv.org/abs/2407.00682
UWBAD: Towards Effective and Imperceptible Jamming Attacks Against UWB Ranging Systems with COTS Chips. (2%)
Yuqiao Yang; Zhongjie Wu; Yongzhao Zhang; Ting Chen; Jun Li; Jie Yang; Wenhao Liu; Xiaosong Zhang; Ruicong Shi; Jingwei Li; Yu Jiang; Zhuo Su
  UWB ranging systems have been adopted in many critical and security sensitive
applications due to its precise positioning and secure ranging capabilities. We
present a practical jamming attack, namely UWBAD, against commercial UWB
ranging systems, which exploits the vulnerability of the adoption of the
normalized cross-correlation process in UWB ranging and can selectively and
quickly block ranging sessions without prior knowledge of the configurations of
the victim devices, potentially leading to severe consequences such as property
loss, unauthorized access, or vehicle theft. UWBAD achieves more effective and
less imperceptible jamming due to: (i) it efficiently blocks every ranging
session by leveraging the field-level jamming, thereby exerting a tangible
impact on commercial UWB ranging systems, and (ii) the compact, reactive, and
selective system design based on COTS UWB chips, making it affordable and less
imperceptible. We successfully conducted real attacks against commercial UWB
ranging systems from the three largest UWB chip vendors on the market, e.g.,
Apple, NXP, and Qorvo. We reported our findings to Apple, related Original
Equipment Manufacturers (OEM), and the Automotive Security Research Group,
triggering internal security incident response procedures at Volkswagen, Audi,
Bosch, and NXP. As of the writing of this paper, the related OEM has
acknowledged this vulnerability in their automotive systems and has offered a
$5,000 reward as a bounty.


http://arxiv.org/abs/2407.00389
Query-Efficient Hard-Label Black-Box Attack against Vision Transformers. (99%)
Chao Zhou; Xiaowen Shi; Yuan-Gen Wang
  Recent studies have revealed that vision transformers (ViTs) face similar
security risks from adversarial attacks as deep convolutional neural networks
(CNNs). However, directly applying attack methodology on CNNs to ViTs has been
demonstrated to be ineffective since the ViTs typically work on patch-wise
encoding. This article explores the vulnerability of ViTs against adversarial
attacks under a black-box scenario, and proposes a novel query-efficient
hard-label adversarial attack method called AdvViT. Specifically, considering
that ViTs are highly sensitive to patch modification, we propose to optimize
the adversarial perturbation on the individual patches. To reduce the dimension
of perturbation search space, we modify only a handful of low-frequency
components of each patch. Moreover, we design a weight mask matrix for all
patches to further optimize the perturbation on different regions of a whole
image. We test six mainstream ViT backbones on the ImageNet-1k dataset.
Experimental results show that compared with the state-of-the-art attacks on
CNNs, our AdvViT achieves much lower $L_2$-norm distortion under the same query
budget, sufficiently validating the vulnerability of ViTs against adversarial
attacks.


http://arxiv.org/abs/2406.19807
Deceptive Diffusion: Generating Synthetic Adversarial Examples. (99%)
Lucas Beerens; Catherine F. Higham; Desmond J. Higham
  We introduce the concept of deceptive diffusion -- training a generative AI
model to produce adversarial images. Whereas a traditional adversarial attack
algorithm aims to perturb an existing image to induce a misclassificaton, the
deceptive diffusion model can create an arbitrary number of new, misclassified
images that are not directly associated with training or test images. Deceptive
diffusion offers the possibility of strengthening defence algorithms by
providing adversarial training data at scale, including types of
misclassification that are otherwise difficult to find. In our experiments, we
also investigate the effect of training on a partially attacked data set. This
highlights a new type of vulnerability for generative diffusion models: if an
attacker is able to stealthily poison a portion of the training data, then the
resulting diffusion model will generate a similar proportion of misleading
outputs.


http://arxiv.org/abs/2407.00248
DiffuseDef: Improved Robustness to Adversarial Attacks. (99%)
Zhenhao Li; Marek Rei; Lucia Specia
  Pretrained language models have significantly advanced performance across
various natural language processing tasks. However, adversarial attacks
continue to pose a critical challenge to system built using these models, as
they can be exploited with carefully crafted adversarial texts. Inspired by the
ability of diffusion models to predict and reduce noise in computer vision, we
propose a novel and flexible adversarial defense method for language
classification tasks, DiffuseDef, which incorporates a diffusion layer as a
denoiser between the encoder and the classifier. During inference, the
adversarial hidden state is first combined with sampled noise, then denoised
iteratively and finally ensembled to produce a robust text representation. By
integrating adversarial training, denoising, and ensembling techniques, we show
that DiffuseDef improves over different existing adversarial defense methods
and achieves state-of-the-art performance against common adversarial attacks.


http://arxiv.org/abs/2406.19815
Emotion Loss Attacking: Adversarial Attack Perception for Skeleton based on Multi-dimensional Features. (92%)
Feng Liu; Qing Xu; Qijian Zheng
  Adversarial attack on skeletal motion is a hot topic. However, existing
researches only consider part of dynamic features when measuring distance
between skeleton graph sequences, which results in poor imperceptibility. To
this end, we propose a novel adversarial attack method to attack action
recognizers for skeletal motions. Firstly, our method systematically proposes a
dynamic distance function to measure the difference between skeletal motions.
Meanwhile, we innovatively introduce emotional features for complementary
information. In addition, we use Alternating Direction Method of
Multipliers(ADMM) to solve the constrained optimization problem, which
generates adversarial samples with better imperceptibility to deceive the
classifiers. Experiments show that our method is effective on multiple action
classifiers and datasets. When the perturbation magnitude measured by l norms
is the same, the dynamic perturbations generated by our method are much lower
than that of other methods. What's more, we are the first to prove the
effectiveness of emotional features, and provide a new idea for measuring the
distance between skeletal motions.


http://arxiv.org/abs/2406.19692
Steering cooperation: Adversarial attacks on prisoner's dilemma in complex networks. (92%)
Kazuhiro Takemoto
  This study examines the application of adversarial attack concepts to control
the evolution of cooperation in the prisoner's dilemma game in complex
networks. Specifically, it proposes a simple adversarial attack method that
drives players' strategies towards a target state by adding small perturbations
to social networks. The proposed method is evaluated on both model and
real-world networks. Numerical simulations demonstrate that the proposed method
can effectively promote cooperation with significantly smaller perturbations
compared to other techniques. Additionally, this study shows that adversarial
attacks can also be useful in inhibiting cooperation (promoting defection). The
findings reveal that adversarial attacks on social networks can be potent tools
for both promoting and inhibiting cooperation, opening new possibilities for
controlling cooperative behavior in social systems while also highlighting
potential risks.


http://arxiv.org/abs/2406.19642
IDT: Dual-Task Adversarial Attacks for Privacy Protection. (88%)
Pedro Faustini; Shakila Mahjabin Tonni; Annabelle McIver; Qiongkai Xu; Mark Dras
  Natural language processing (NLP) models may leak private information in
different ways, including membership inference, reconstruction or attribute
inference attacks. Sensitive information may not be explicit in the text, but
hidden in underlying writing characteristics. Methods to protect privacy can
involve using representations inside models that are demonstrated not to detect
sensitive attributes or -- for instance, in cases where users might not trust a
model, the sort of scenario of interest here -- changing the raw text before
models can have access to it. The goal is to rewrite text to prevent someone
from inferring a sensitive attribute (e.g. the gender of the author, or their
location by the writing style) whilst keeping the text useful for its original
intention (e.g. the sentiment of a product review). The few works tackling this
have focused on generative techniques. However, these often create extensively
different texts from the original ones or face problems such as mode collapse.
This paper explores a novel adaptation of adversarial attack techniques to
manipulate a text to deceive a classifier w.r.t one task (privacy) whilst
keeping the predictions of another classifier trained for another task
(utility) unchanged. We propose IDT, a method that analyses predictions made by
auxiliary and interpretable models to identify which tokens are important to
change for the privacy task, and which ones should be kept for the utility
task. We evaluate different datasets for NLP suitable for different tasks.
Automatic and human evaluations show that IDT retains the utility of text,
while also outperforming existing methods when deceiving a classifier w.r.t
privacy task.


http://arxiv.org/abs/2406.19753
Backdoor Attack in Prompt-Based Continual Learning. (22%)
Trang Nguyen; Anh Tran; Nhat Ho
  Prompt-based approaches offer a cutting-edge solution to data privacy issues
in continual learning, particularly in scenarios involving multiple data
suppliers where long-term storage of private user data is prohibited. Despite
delivering state-of-the-art performance, its impressive remembering capability
can become a double-edged sword, raising security concerns as it might
inadvertently retain poisoned knowledge injected during learning from private
user data. Following this insight, in this paper, we expose continual learning
to a potential threat: backdoor attack, which drives the model to follow a
desired adversarial target whenever a specific trigger is present while still
performing normally on clean samples. We highlight three critical challenges in
executing backdoor attacks on incremental learners and propose corresponding
solutions: (1) \emph{Transferability}: We employ a surrogate dataset and
manipulate prompt selection to transfer backdoor knowledge to data from other
suppliers; (2) \emph{Resiliency}: We simulate static and dynamic states of the
victim to ensure the backdoor trigger remains robust during intense incremental
learning processes; and (3) \emph{Authenticity}: We apply binary cross-entropy
loss as an anti-cheating factor to prevent the backdoor trigger from devolving
into adversarial noise. Extensive experiments across various benchmark datasets
and continual learners validate our continual backdoor framework, achieving up
to $100\%$ attack success rate, with further ablation studies confirming our
contributions' effectiveness.


http://arxiv.org/abs/2406.19845
Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection. (11%)
Yuqi Zhou; Lin Lu; Hanchi Sun; Pan Zhou; Lichao Sun
  Jailbreak attacks on large language models (LLMs) involve inducing these
models to generate harmful content that violates ethics or laws, posing a
significant threat to LLM security. Current jailbreak attacks face two main
challenges: low success rates due to defensive measures and high resource
requirements for crafting specific prompts. This paper introduces Virtual
Context, which leverages special tokens, previously overlooked in LLM security,
to improve jailbreak attacks. Virtual Context addresses these challenges by
significantly increasing the success rates of existing jailbreak methods and
requiring minimal background knowledge about the target model, thus enhancing
effectiveness in black-box settings without additional overhead. Comprehensive
evaluations show that Virtual Context-assisted jailbreak attacks can improve
the success rates of four widely used jailbreak methods by approximately 40%
across various LLMs. Additionally, applying Virtual Context to original
malicious behaviors still achieves a notable jailbreak effect. In summary, our
research highlights the potential of special tokens in jailbreak attacks and
recommends including this threat in red-teaming testing to comprehensively
enhance LLM security.


http://arxiv.org/abs/2406.19941
GRACE: Graph-Regularized Attentive Convolutional Entanglement with Laplacian Smoothing for Robust DeepFake Video Detection. (1%)
Chih-Chung Hsu; Shao-Ning Chen; Mei-Hsuan Wu; Yi-Fang Wang; Chia-Ming Lee; Yi-Shiuan Chou
  As DeepFake video manipulation techniques escalate, posing profound threats,
the urgent need to develop efficient detection strategies is underscored.
However, one particular issue lies with facial images being mis-detected, often
originating from degraded videos or adversarial attacks, leading to unexpected
temporal artifacts that can undermine the efficacy of DeepFake video detection
techniques. This paper introduces a novel method for robust DeepFake video
detection, harnessing the power of the proposed Graph-Regularized Attentive
Convolutional Entanglement (GRACE) based on the graph convolutional network
with graph Laplacian to address the aforementioned challenges. First,
conventional Convolution Neural Networks are deployed to perform spatiotemporal
features for the entire video. Then, the spatial and temporal features are
mutually entangled by constructing a graph with sparse constraint, enforcing
essential features of valid face images in the noisy face sequences remaining,
thus augmenting stability and performance for DeepFake video detection.
Furthermore, the Graph Laplacian prior is proposed in the graph convolutional
network to remove the noise pattern in the feature space to further improve the
performance. Comprehensive experiments are conducted to illustrate that our
proposed method delivers state-of-the-art performance in DeepFake video
detection under noisy face sequences. The source code is available at
https://github.com/ming053l/GRACE.


http://arxiv.org/abs/2406.19311
Zero-Query Adversarial Attack on Black-box Automatic Speech Recognition Systems. (99%)
Zheng Fang; Tao Wang; Lingchen Zhao; Shenyi Zhang; Bowen Li; Yunjie Ge; Qi Li; Chao Shen; Qian Wang
  In recent years, extensive research has been conducted on the vulnerability
of ASR systems, revealing that black-box adversarial example attacks pose
significant threats to real-world ASR systems. However, most existing black-box
attacks rely on queries to the target ASRs, which is impractical when queries
are not permitted. In this paper, we propose ZQ-Attack, a transfer-based
adversarial attack on ASR systems in the zero-query black-box setting. Through
a comprehensive review and categorization of modern ASR technologies, we first
meticulously select surrogate ASRs of diverse types to generate adversarial
examples. Following this, ZQ-Attack initializes the adversarial perturbation
with a scaled target command audio, rendering it relatively imperceptible while
maintaining effectiveness. Subsequently, to achieve high transferability of
adversarial perturbations, we propose a sequential ensemble optimization
algorithm, which iteratively optimizes the adversarial perturbation on each
surrogate model, leveraging collaborative information from other models. We
conduct extensive experiments to evaluate ZQ-Attack. In the over-the-line
setting, ZQ-Attack achieves a 100% success rate of attack (SRoA) with an
average signal-to-noise ratio (SNR) of 21.91dB on 4 online speech recognition
services, and attains an average SRoA of 100% and SNR of 19.67dB on 16
open-source ASRs. For commercial intelligent voice control devices, ZQ-Attack
also achieves a 100% SRoA with an average SNR of 15.77dB in the over-the-air
setting.


http://arxiv.org/abs/2406.19622
Data-Driven Lipschitz Continuity: A Cost-Effective Approach to Improve Adversarial Robustness. (98%)
Erh-Chung Chen; Pin-Yu Chen; I-Hsin Chung; Che-Rung Lee
  The security and robustness of deep neural networks (DNNs) have become
increasingly concerning. This paper aims to provide both a theoretical
foundation and a practical solution to ensure the reliability of DNNs. We
explore the concept of Lipschitz continuity to certify the robustness of DNNs
against adversarial attacks, which aim to mislead the network with adding
imperceptible perturbations into inputs. We propose a novel algorithm that
remaps the input domain into a constrained range, reducing the Lipschitz
constant and potentially enhancing robustness. Unlike existing adversarially
trained models, where robustness is enhanced by introducing additional examples
from other datasets or generative models, our method is almost cost-free as it
can be integrated with existing models without requiring re-training.
Experimental results demonstrate the generalizability of our method, as it can
be combined with various models and achieve enhancements in robustness.
Furthermore, our method achieves the best robust accuracy for CIFAR10,
CIFAR100, and ImageNet datasets on the RobustBench leaderboard.


http://arxiv.org/abs/2406.18944
Investigating and Defending Shortcut Learning in Personalized Diffusion Models. (87%)
Yixin Liu; Ruoxi Chen; Lichao Sun
  Personalized diffusion models have gained popularity for adapting pre-trained
text-to-image models to generate images of specific topics with minimal
training data. However, these models are vulnerable to minor adversarial
perturbations, leading to degraded performance on corrupted datasets. Such
vulnerabilities are further exploited to craft protective perturbations on
sensitive images like portraits that prevent unauthorized generation. In
response, diffusion-based purification methods have been proposed to remove
these perturbations and retain generation performance. However, existing works
turn to over-purifying the images, which causes information loss. In this
paper, we take a closer look at the fine-tuning process of personalized
diffusion models through the lens of shortcut learning. And we propose a
hypothesis explaining the manipulation mechanisms of existing perturbation
methods, demonstrating that perturbed images significantly deviate from their
original prompts in the CLIP-based latent space. This misalignment during
fine-tuning causes models to associate noisy patterns with identifiers,
resulting in performance degradation. Based on these insights, we introduce a
systematic approach to maintain training performance through purification. Our
method first purifies the images to realign them with their original semantic
meanings in latent space. Then, we introduce contrastive learning with negative
tokens to decouple the learning of clean identities from noisy patterns, which
shows a strong potential capacity against adaptive perturbation. Our study
uncovers shortcut learning vulnerabilities in personalized diffusion models and
provides a firm evaluation framework for future protective perturbation
research. Code is available at https://github.com/liuyixin-louis/DiffShortcut.


http://arxiv.org/abs/2406.19466
Data Poisoning Attacks to Locally Differentially Private Frequent Itemset Mining Protocols. (2%)
Wei Tong; Haoyu Chen; Jiacheng Niu; Sheng Zhong
  Local differential privacy (LDP) provides a way for an untrusted data
collector to aggregate users' data without violating their privacy. Various
privacy-preserving data analysis tasks have been studied under the protection
of LDP, such as frequency estimation, frequent itemset mining, and machine
learning. Despite its privacy-preserving properties, recent research has
demonstrated the vulnerability of certain LDP protocols to data poisoning
attacks. However, existing data poisoning attacks are focused on basic
statistics under LDP, such as frequency estimation and mean/variance
estimation. As an important data analysis task, the security of LDP frequent
itemset mining has yet to be thoroughly examined. In this paper, we aim to
address this issue by presenting novel and practical data poisoning attacks
against LDP frequent itemset mining protocols. By introducing a unified attack
framework with composable attack operations, our data poisoning attack can
successfully manipulate the state-of-the-art LDP frequent itemset mining
protocols and has the potential to be adapted to other protocols with similar
structures. We conduct extensive experiments on three datasets to compare the
proposed attack with four baseline attacks. The results demonstrate the
severity of the threat and the effectiveness of the proposed attack.


http://arxiv.org/abs/2406.19538
Context Matters: An Empirical Study of the Impact of Contextual Information in Temporal Question Answering Systems. (1%)
Dan Schumacher; Fatemeh Haji; Tara Grey; Niharika Bandlamudi; Nupoor Karnik; Gagana Uday Kumar; Jason Cho-Yu Chiang; Paul Rad; Nishant Vishwamitra; Anthony Rios
  Large language models (LLMs) often struggle with temporal reasoning, crucial
for tasks like historical event analysis and time-sensitive information
retrieval. Despite advancements, state-of-the-art models falter in handling
temporal information, especially when faced with irrelevant or noisy contexts.
This paper addresses this gap by empirically examining the robustness of
temporal question-answering (TQA) systems trained on various context types,
including relevant, irrelevant, slightly altered, and no context. Our findings
indicate that training with a mix of these contexts enhances model robustness
and accuracy. Additionally, we show that the position of context relative to
the question significantly impacts performance, with question-first positioning
yielding better results. We introduce two new context-rich TQA datasets,
ContextAQA and ContextTQE, and provide comprehensive evaluations and guidelines
for training robust TQA models. Our work lays the foundation for developing
reliable and context-aware temporal QA systems, with broader implications for
enhancing LLM robustness against diverse and potentially adversarial
information.


http://arxiv.org/abs/2406.18451
Detecting Brittle Decisions for Free: Leveraging Margin Consistency in Deep Robust Classifiers. (98%)
Jonas Ngnawé; Sabyasachi Sahoo; Yann Pequignot; Frédéric Precioso; Christian Gagné
  Despite extensive research on adversarial training strategies to improve
robustness, the decisions of even the most robust deep learning models can
still be quite sensitive to imperceptible perturbations, creating serious risks
when deploying them for high-stakes real-world applications. While detecting
such cases may be critical, evaluating a model's vulnerability at a
per-instance level using adversarial attacks is computationally too intensive
and unsuitable for real-time deployment scenarios. The input space margin is
the exact score to detect non-robust samples and is intractable for deep neural
networks. This paper introduces the concept of margin consistency -- a property
that links the input space margins and the logit margins in robust models --
for efficient detection of vulnerable samples. First, we establish that margin
consistency is a necessary and sufficient condition to use a model's logit
margin as a score for identifying non-robust samples. Next, through
comprehensive empirical analysis of various robustly trained models on CIFAR10
and CIFAR100 datasets, we show that they indicate strong margin consistency
with a strong correlation between their input space margins and the logit
margins. Then, we show that we can effectively use the logit margin to
confidently detect brittle decisions with such models and accurately estimate
robust accuracy on an arbitrarily large test set by estimating the input
margins only on a small subset. Finally, we address cases where the model is
not sufficiently margin-consistent by learning a pseudo-margin from the feature
representation. Our findings highlight the potential of leveraging deep
representations to efficiently assess adversarial vulnerability in deployment
scenarios.


http://arxiv.org/abs/2406.18844
Revisiting Backdoor Attacks against Large Vision-Language Models. (62%)
Siyuan Liang; Jiawei Liang; Tianyu Pang; Chao Du; Aishan Liu; Ee-Chien Chang; Xiaochun Cao
  Instruction tuning enhances large vision-language models (LVLMs) but raises
security risks through potential backdoor attacks due to their openness.
Previous backdoor studies focus on enclosed scenarios with consistent training
and testing instructions, neglecting the practical domain gaps that could
affect attack effectiveness. This paper empirically examines the
generalizability of backdoor attacks during the instruction tuning of LVLMs for
the first time, revealing certain limitations of most backdoor strategies in
practical scenarios. We quantitatively evaluate the generalizability of six
typical backdoor attacks on image caption benchmarks across multiple LVLMs,
considering both visual and textual domain offsets. Our findings indicate that
attack generalizability is positively correlated with the backdoor trigger's
irrelevance to specific images/models and the preferential correlation of the
trigger pattern. Additionally, we modify existing backdoor attacks based on the
above key observations, demonstrating significant improvements in cross-domain
scenario generalizability (+86% attack success rate). Notably, even without
access to the instruction datasets, a multimodal instruction set can be
successfully poisoned with a very low poisoning rate (0.2%), achieving an
attack success rate of over 97%. This paper underscores that even simple
traditional backdoor strategies pose a serious threat to LVLMs, necessitating
more attention and in-depth research.


http://arxiv.org/abs/2407.01606
On Discrete Prompt Optimization for Diffusion Models. (62%)
Ruochen Wang; Ting Liu; Cho-Jui Hsieh; Boqing Gong
  This paper introduces the first gradient-based framework for prompt
optimization in text-to-image diffusion models. We formulate prompt engineering
as a discrete optimization problem over the language space. Two major
challenges arise in efficiently finding a solution to this problem: (1)
Enormous Domain Space: Setting the domain to the entire language space poses
significant difficulty to the optimization process. (2) Text Gradient:
Efficiently computing the text gradient is challenging, as it requires
backpropagating through the inference steps of the diffusion model and a
non-differentiable embedding lookup table. Beyond the problem formulation, our
main technical contributions lie in solving the above challenges. First, we
design a family of dynamically generated compact subspaces comprised of only
the most relevant words to user input, substantially restricting the domain
space. Second, we introduce "Shortcut Text Gradient" -- an effective
replacement for the text gradient that can be obtained with constant memory and
runtime. Empirical evaluation on prompts collected from diverse sources
(DiffusionDB, ChatGPT, COCO) suggests that our method can discover prompts that
substantially improve (prompt enhancement) or destroy (adversarial attack) the
faithfulness of images generated by the text-to-image diffusion model.


http://arxiv.org/abs/2406.18062
Breaking the Barrier: Enhanced Utility and Robustness in Smoothed DRL Agents. (54%)
Chung-En Sun; Sicun Gao; Tsui-Wei Weng
  Robustness remains a paramount concern in deep reinforcement learning (DRL),
with randomized smoothing emerging as a key technique for enhancing this
attribute. However, a notable gap exists in the performance of current smoothed
DRL agents, often characterized by significantly low clean rewards and weak
robustness. In response to this challenge, our study introduces innovative
algorithms aimed at training effective smoothed robust DRL agents. We propose
S-DQN and S-PPO, novel approaches that demonstrate remarkable improvements in
clean rewards, empirical robustness, and robustness guarantee across standard
RL benchmarks. Notably, our S-DQN and S-PPO agents not only significantly
outperform existing smoothed agents by an average factor of $2.16\times$ under
the strongest attack, but also surpass previous robustly-trained agents by an
average factor of $2.13\times$. This represents a significant leap forward in
the field. Furthermore, we introduce Smoothed Attack, which is $1.89\times$
more effective in decreasing the rewards of smoothed agents than existing
adversarial attacks.


http://arxiv.org/abs/2406.18122
Poisoned LangChain: Jailbreak LLMs by LangChain. (26%)
Ziqiu Wang; Jun Liu; Shengkai Zhang; Yang Yang
  With the development of natural language processing (NLP), large language
models (LLMs) are becoming increasingly popular. LLMs are integrating more into
everyday life, raising public concerns about their security vulnerabilities.
Consequently, the security of large language models is becoming critically
important. Currently, the techniques for attacking and defending against LLMs
are continuously evolving. One significant method type of attack is the
jailbreak attack, which designed to evade model safety mechanisms and induce
the generation of inappropriate content. Existing jailbreak attacks primarily
rely on crafting inducement prompts for direct jailbreaks, which are less
effective against large models with robust filtering and high comprehension
abilities. Given the increasing demand for real-time capabilities in large
language models, real-time updates and iterations of new knowledge have become
essential. Retrieval-Augmented Generation (RAG), an advanced technique to
compensate for the model's lack of new knowledge, is gradually becoming
mainstream. As RAG enables the model to utilize external knowledge bases, it
provides a new avenue for jailbreak attacks.
  In this paper, we conduct the first work to propose the concept of indirect
jailbreak and achieve Retrieval-Augmented Generation via LangChain. Building on
this, we further design a novel method of indirect jailbreak attack, termed
Poisoned-LangChain (PLC), which leverages a poisoned external knowledge base to
interact with large language models, thereby causing the large models to
generate malicious non-compliant dialogues.We tested this method on six
different large language models across three major categories of jailbreak
issues. The experiments demonstrate that PLC successfully implemented indirect
jailbreak attacks under three different scenarios, achieving success rates of
88.56%, 79.04%, and 82.69% respectively.


http://arxiv.org/abs/2406.18510
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models. (12%)
Liwei Jiang; Kavel Rao; Seungju Han; Allyson Ettinger; Faeze Brahman; Sachin Kumar; Niloofar Mireshghallah; Ximing Lu; Maarten Sap; Yejin Choi; Nouha Dziri
  We introduce WildTeaming, an automatic LLM safety red-teaming framework that
mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of
novel jailbreak tactics, and then composes multiple tactics for systematic
exploration of novel jailbreaks. Compared to prior work that performed
red-teaming via recruited human workers, gradient-based optimization, or
iterative revision with LLMs, our work investigates jailbreaks from chatbot
users who were not specifically instructed to break the system. WildTeaming
reveals previously unidentified vulnerabilities of frontier LLMs, resulting in
up to 4.6x more diverse and successful adversarial attacks compared to
state-of-the-art jailbreak methods.
  While many datasets exist for jailbreak evaluation, very few open-source
datasets exist for jailbreak training, as safety training data has been closed
even when model weights are open. With WildTeaming we create WildJailbreak, a
large-scale open-source synthetic safety dataset with 262K vanilla (direct
request) and adversarial (complex jailbreak) prompt-response pairs. To mitigate
exaggerated safety behaviors, WildJailbreak provides two contrastive types of
queries: 1) harmful queries (vanilla & adversarial) and 2) benign queries that
resemble harmful queries in form but contain no harm. As WildJailbreak
considerably upgrades the quality and scale of existing safety resources, it
uniquely enables us to examine the scaling effects of data and the interplay of
data properties and model capabilities during safety training. Through
extensive experiments, we identify the training properties that enable an ideal
balance of safety behaviors: appropriate safeguarding without over-refusal,
effective handling of vanilla and adversarial queries, and minimal, if any,
decrease in general capabilities. All components of WildJailbeak contribute to
achieving balanced safety behaviors of models.


http://arxiv.org/abs/2406.18382
Adversarial Search Engine Optimization for Large Language Models. (9%)
Fredrik Nestaas; Edoardo Debenedetti; Florian Tramèr
  Large Language Models (LLMs) are increasingly used in applications where the
model selects from competing third-party content, such as in LLM-powered search
engines or chatbot plugins. In this paper, we introduce Preference Manipulation
Attacks, a new class of attacks that manipulate an LLM's selections to favor
the attacker. We demonstrate that carefully crafted website content or plugin
documentations can trick an LLM to promote the attacker products and discredit
competitors, thereby increasing user traffic and monetization. We show this
leads to a prisoner's dilemma, where all parties are incentivized to launch
attacks, but the collective effect degrades the LLM's outputs for everyone. We
demonstrate our attacks on production LLM search engines (Bing and Perplexity)
and plugin APIs (for GPT-4 and Claude). As LLMs are increasingly used to rank
third-party content, we expect Preference Manipulation Attacks to emerge as a
significant threat.


http://arxiv.org/abs/2406.17425
CuDA2: An approach for Incorporating Traitor Agents into Cooperative Multi-Agent Systems. (99%)
Zhen Chen; Yong Liao; Youpeng Zhao; Zipeng Dai; Jian Zhao
  Cooperative Multi-Agent Reinforcement Learning (CMARL) strategies are well
known to be vulnerable to adversarial perturbations. Previous works on
adversarial attacks have primarily focused on white-box attacks that directly
perturb the states or actions of victim agents, often in scenarios with a
limited number of attacks. However, gaining complete access to victim agents in
real-world environments is exceedingly difficult. To create more realistic
adversarial attacks, we introduce a novel method that involves injecting
traitor agents into the CMARL system. We model this problem as a Traitor Markov
Decision Process (TMDP), where traitors cannot directly attack the victim
agents but can influence their formation or positioning through collisions. In
TMDP, traitors are trained using the same MARL algorithm as the victim agents,
with their reward function set as the negative of the victim agents' reward.
Despite this, the training efficiency for traitors remains low because it is
challenging for them to directly associate their actions with the victim
agents' rewards. To address this issue, we propose the Curiosity-Driven
Adversarial Attack (CuDA2) framework. CuDA2 enhances the efficiency and
aggressiveness of attacks on the specified victim agents' policies while
maintaining the optimal policy invariance of the traitors. Specifically, we
employ a pre-trained Random Network Distillation (RND) module, where the extra
reward generated by the RND module encourages traitors to explore states
unencountered by the victim agents. Extensive experiments on various scenarios
from SMAC demonstrate that our CuDA2 framework offers comparable or superior
adversarial attack capabilities compared to other baselines.


http://arxiv.org/abs/2406.17606
Diffusion-based Adversarial Purification for Intrusion Detection. (98%)
Mohamed Amine Merzouk; Erwan Beurier; Reda Yaich; Nora Boulahia-Cuppens; Frédéric Cuppens
  The escalating sophistication of cyberattacks has encouraged the integration
of machine learning techniques in intrusion detection systems, but the rise of
adversarial examples presents a significant challenge. These crafted
perturbations mislead ML models, enabling attackers to evade detection or
trigger false alerts. As a reaction, adversarial purification has emerged as a
compelling solution, particularly with diffusion models showing promising
results. However, their purification potential remains unexplored in the
context of intrusion detection. This paper demonstrates the effectiveness of
diffusion models in purifying adversarial examples in network intrusion
detection. Through a comprehensive analysis of the diffusion parameters, we
identify optimal configurations maximizing adversarial robustness with minimal
impact on normal performance. Importantly, this study reveals insights into the
relationship between diffusion noise and diffusion steps, representing a novel
contribution to the field. Our experiments are carried out on two datasets and
against 5 adversarial attacks. The implementation code is publicly available.


http://arxiv.org/abs/2406.17349
Semantic Deep Hiding for Robust Unlearnable Examples. (76%)
Ruohan Meng; Chenyu Yi; Yi Yu; Siyuan Yang; Bingquan Shen; Alex C. Kot
  Ensuring data privacy and protection has become paramount in the era of deep
learning. Unlearnable examples are proposed to mislead the deep learning models
and prevent data from unauthorized exploration by adding small perturbations to
data. However, such perturbations (e.g., noise, texture, color change)
predominantly impact low-level features, making them vulnerable to common
countermeasures. In contrast, semantic images with intricate shapes have a
wealth of high-level features, making them more resilient to countermeasures
and potential for producing robust unlearnable examples. In this paper, we
propose a Deep Hiding (DH) scheme that adaptively hides semantic images
enriched with high-level features. We employ an Invertible Neural Network (INN)
to invisibly integrate predefined images, inherently hiding them with deceptive
perturbations. To enhance data unlearnability, we introduce a Latent Feature
Concentration module, designed to work with the INN, regularizing the
intra-class variance of these perturbations. To further boost the robustness of
unlearnable examples, we design a Semantic Images Generation module that
produces hidden semantic images. By utilizing similar semantic information,
this module generates similar semantic images for samples within the same
classes, thereby enlarging the inter-class distance and narrowing the
intra-class distance. Extensive experiments on CIFAR-10, CIFAR-100, and an
ImageNet subset, against 18 countermeasures, reveal that our proposed method
exhibits outstanding robustness for unlearnable examples, demonstrating its
efficacy in preventing unauthorized data exploitation.


http://arxiv.org/abs/2406.17830
Treatment of Statistical Estimation Problems in Randomized Smoothing for Adversarial Robustness. (67%)
Vaclav Voracek
  Randomized smoothing is a popular certified defense against adversarial
attacks. In its essence, we need to solve a problem of statistical estimation
which is usually very time-consuming since we need to perform numerous (usually
$10^5$) forward passes of the classifier for every point to be certified. In
this paper, we review the statistical estimation problems for randomized
smoothing to find out if the computational burden is necessary. In particular,
we consider the (standard) task of adversarial robustness where we need to
decide if a point is robust at a certain radius or not using as few samples as
possible while maintaining statistical guarantees. We present estimation
procedures employing confidence sequences enjoying the same statistical
guarantees as the standard methods, with the optimal sample complexities for
the estimation task and empirically demonstrate their good performance.
Additionally, we provide a randomized version of Clopper-Pearson confidence
intervals resulting in strictly stronger certificates.


http://arxiv.org/abs/2406.17338
Robustly Optimized Deep Feature Decoupling Network for Fatty Liver Diseases Detection. (13%)
Peng Huang; Shu Hu; Bo Peng; Jiashu Zhang; Xi Wu; Xin Wang
  Current medical image classification efforts mainly aim for higher average
performance, often neglecting the balance between different classes. This can
lead to significant differences in recognition accuracy between classes and
obvious recognition weaknesses. Without the support of massive data, deep
learning faces challenges in fine-grained classification of fatty liver. In
this paper, we propose an innovative deep learning framework that combines
feature decoupling and adaptive adversarial training. Firstly, we employ two
iteratively compressed decouplers to supervised decouple common features and
specific features related to fatty liver in abdominal ultrasound images.
Subsequently, the decoupled features are concatenated with the original image
after transforming the color space and are fed into the classifier. During
adversarial training, we adaptively adjust the perturbation and balance the
adversarial strength by the accuracy of each class. The model will eliminate
recognition weaknesses by correctly classifying adversarial samples, thus
improving recognition robustness. Finally, the accuracy of our method improved
by 4.16%, achieving 82.95%. As demonstrated by extensive experiments, our
method is a generalized learning framework that can be directly used to
eliminate the recognition weaknesses of any classifier while improving its
average performance. Code is available at https://github.com/HP-ML/MICCAI2024.


http://arxiv.org/abs/2406.16609
Evaluating the Robustness of Deep-Learning Algorithm-Selection Models by Evolving Adversarial Instances. (98%)
Emma Hart; Quentin Renau; Kevin Sim; Mohamad Alissa
  Deep neural networks (DNN) are increasingly being used to perform
algorithm-selection in combinatorial optimisation domains, particularly as they
accommodate input representations which avoid designing and calculating
features. Mounting evidence from domains that use images as input shows that
deep convolutional networks are vulnerable to adversarial samples, in which a
small perturbation of an instance can cause the DNN to misclassify. However, it
remains unknown as to whether deep recurrent networks (DRN) which have recently
been shown promise as algorithm-selectors in the bin-packing domain are equally
vulnerable. We use an evolutionary algorithm (EA) to find perturbations of
instances from two existing benchmarks for online bin packing that cause
trained DRNs to misclassify: adversarial samples are successfully generated
from up to 56% of the original instances depending on the dataset. Analysis of
the new misclassified instances sheds light on the `fragility' of some training
instances, i.e. instances where it is trivial to find a small perturbation that
results in a misclassification and the factors that influence this. Finally,
the method generates a large number of new instances misclassified with a wide
variation in confidence, providing a rich new source of training data to create
more robust models.


http://arxiv.org/abs/2406.16501
UNICAD: A Unified Approach for Attack Detection, Noise Reduction and Novel Class Identification. (96%)
Alvaro Lopez Pellicer; Kittipos Giatgong; Yi Li; Neeraj Suri; Plamen Angelov
  As the use of Deep Neural Networks (DNNs) becomes pervasive, their
vulnerability to adversarial attacks and limitations in handling unseen classes
poses significant challenges. The state-of-the-art offers discrete solutions
aimed to tackle individual issues covering specific adversarial attack
scenarios, classification or evolving learning. However, real-world systems
need to be able to detect and recover from a wide range of adversarial attacks
without sacrificing classification accuracy and to flexibly act in {\bf unseen}
scenarios. In this paper, UNICAD, is proposed as a novel framework that
integrates a variety of techniques to provide an adaptive solution.
  For the targeted image classification, UNICAD achieves accurate image
classification, detects unseen classes, and recovers from adversarial attacks
using Prototype and Similarity-based DNNs with denoising autoencoders. Our
experiments performed on the CIFAR-10 dataset highlight UNICAD's effectiveness
in adversarial mitigation and unseen class classification, outperforming
traditional models.


http://arxiv.org/abs/2406.16342
ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks. (92%)
Yoo Yeon Sung; Eve Fleisig; Ishani Mondal; Jordan Lee Boyd-Graber
  Adversarial benchmarks validate model abilities by providing samples that
fool models but not humans. However, despite the proliferation of datasets that
claim to be adversarial, there does not exist an established metric to evaluate
how adversarial these datasets are. To address this lacuna, we introduce
ADVSCORE, a metric which quantifies how adversarial and discriminative an
adversarial dataset is and exposes the features that make data adversarial. We
then use ADVSCORE to underpin a dataset creation pipeline that incentivizes
writing a high-quality adversarial dataset. As a proof of concept, we use
ADVSCORE to collect an adversarial question answering (QA) dataset, ADVQA, from
our pipeline. The high-quality questions in ADVQA surpasses three adversarial
benchmarks across domains at fooling several models but not humans. We validate
our result based on difficulty estimates from 9,347 human responses on four
datasets and predictions from three models. Moreover, ADVSCORE uncovers which
adversarial tactics used by human writers fool models (e.g., GPT-4) but not
humans. Through ADVSCORE and its analyses, we offer guidance on revealing
language model vulnerabilities and producing reliable adversarial examples.


http://arxiv.org/abs/2406.17104
Automated Adversarial Discovery for Safety Classifiers. (92%)
Yash Kumar Lal; Preethi Lahoti; Aradhana Sinha; Yao Qin; Ananth Balashankar
  Safety classifiers are critical in mitigating toxicity on online forums such
as social media and in chatbots. Still, they continue to be vulnerable to
emergent, and often innumerable, adversarial attacks. Traditional automated
adversarial data generation methods, however, tend to produce attacks that are
not diverse, but variations of previously observed harm types. We formalize the
task of automated adversarial discovery for safety classifiers - to find new
attacks along previously unseen harm dimensions that expose new weaknesses in
the classifier. We measure progress on this task along two key axes (1)
adversarial success: does the attack fool the classifier? and (2) dimensional
diversity: does the attack represent a previously unseen harm type? Our
evaluation of existing attack generation methods on the CivilComments toxicity
task reveals their limitations: Word perturbation attacks fail to fool
classifiers, while prompt-based LLM attacks have more adversarial success, but
lack dimensional diversity. Even our best-performing prompt-based method finds
new successful attacks on unseen harm dimensions of attacks only 5\% of the
time. Automatically finding new harmful dimensions of attack is crucial and
there is substantial headroom for future research on our new task.


http://arxiv.org/abs/2406.16540
Improving robustness to corruptions with multiplicative weight perturbations. (74%)
Trung Trinh; Markus Heinonen; Luigi Acerbi; Samuel Kaski
  Deep neural networks (DNNs) excel on clean images but struggle with corrupted
ones. Incorporating specific corruptions into the data augmentation pipeline
can improve robustness to those corruptions but may harm performance on clean
images and other types of distortion. In this paper, we introduce an
alternative approach that improves the robustness of DNNs to a wide range of
corruptions without compromising accuracy on clean images. We first demonstrate
that input perturbations can be mimicked by multiplicative perturbations in the
weight space. Leveraging this, we propose Data Augmentation via Multiplicative
Perturbation (DAMP), a training method that optimizes DNNs under random
multiplicative weight perturbations. We also examine the recently proposed
Adaptive Sharpness-Aware Minimization (ASAM) and show that it optimizes DNNs
under adversarial multiplicative weight perturbations. Experiments on image
classification datasets (CIFAR-10/100, TinyImageNet and ImageNet) and neural
network architectures (ResNet50, ViT-S/16, ViT-B/16) show that DAMP enhances
model generalization performance in the presence of corruptions across
different settings. Notably, DAMP is able to train a ViT-S/16 on ImageNet from
scratch, reaching the top-1 error of 23.7% which is comparable to ResNet50
without extensive data augmentations.


http://arxiv.org/abs/2406.17092
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models. (38%)
Yi Zeng; Weiyu Sun; Tran Ngoc Huynh; Dawn Song; Bo Li; Ruoxi Jia
  Safety backdoor attacks in large language models (LLMs) enable the stealthy
triggering of unsafe behaviors while evading detection during normal
interactions. The high dimensionality of potential triggers in the token space
and the diverse range of malicious behaviors make this a critical challenge. We
present BEEAR, a mitigation approach leveraging the insight that backdoor
triggers induce relatively uniform drifts in the model's embedding space. Our
bi-level optimization method identifies universal embedding perturbations that
elicit unwanted behaviors and adjusts the model parameters to reinforce safe
behaviors against these perturbations. Experiments show BEEAR reduces the
success rate of RLHF time backdoor attacks from >95% to <1% and from 47% to 0%
for instruction-tuning time backdoors targeting malicious code generation,
without compromising model utility. Requiring only defender-defined safe and
unwanted behaviors, BEEAR represents a step towards practical defenses against
safety backdoors in LLMs, providing a foundation for further advancements in AI
safety and security.


http://arxiv.org/abs/2406.16850
From Perfect to Noisy World Simulation: Customizable Embodied Multi-modal Perturbations for SLAM Robustness Benchmarking. (5%)
Xiaohao Xu; Tianyi Zhang; Sibo Wang; Xiang Li; Yongqi Chen; Ye Li; Bhiksha Raj; Matthew Johnson-Roberson; Xiaonan Huang
  Embodied agents require robust navigation systems to operate in unstructured
environments, making the robustness of Simultaneous Localization and Mapping
(SLAM) models critical to embodied agent autonomy. While real-world datasets
are invaluable, simulation-based benchmarks offer a scalable approach for
robustness evaluations. However, the creation of a challenging and controllable
noisy world with diverse perturbations remains under-explored. To this end, we
propose a novel, customizable pipeline for noisy data synthesis, aimed at
assessing the resilience of multi-modal SLAM models against various
perturbations. The pipeline comprises a comprehensive taxonomy of sensor and
motion perturbations for embodied multi-modal (specifically RGB-D) sensing,
categorized by their sources and propagation order, allowing for procedural
composition. We also provide a toolbox for synthesizing these perturbations,
enabling the transformation of clean environments into challenging noisy
simulations. Utilizing the pipeline, we instantiate the large-scale
Noisy-Replica benchmark, which includes diverse perturbation types, to evaluate
the risk tolerance of existing advanced RGB-D SLAM models. Our extensive
analysis uncovers the susceptibilities of both neural (NeRF and Gaussian
Splatting -based) and non-neural SLAM models to disturbances, despite their
demonstrated accuracy in standard benchmarks. Our code is publicly available at
https://github.com/Xiaohao-Xu/SLAM-under-Perturbation.


http://arxiv.org/abs/2406.17216
Machine Unlearning Fails to Remove Data Poisoning Attacks. (2%)
Martin Pawelczyk; Jimmy Z. Di; Yiwei Lu; Gautam Kamath; Ayush Sekhari; Seth Neel
  We revisit the efficacy of several practical methods for approximate machine
unlearning developed for large-scale deep learning. In addition to complying
with data deletion requests, one often-cited potential application for
unlearning methods is to remove the effects of training on poisoned data. We
experimentally demonstrate that, while existing unlearning methods have been
demonstrated to be effective in a number of evaluation settings (e.g.,
alleviating membership inference attacks), they fail to remove the effects of
data poisoning, across a variety of types of poisoning attacks (indiscriminate,
targeted, and a newly-introduced Gaussian poisoning attack) and models (image
classifiers and LLMs); even when granted a relatively large compute budget. In
order to precisely characterize unlearning efficacy, we introduce new
evaluation metrics for unlearning based on data poisoning. Our results suggest
that a broader perspective, including a wider variety of evaluations, is
required to avoid a false sense of confidence in machine unlearning procedures
for deep learning without provable guarantees. Moreover, while unlearning
methods show some signs of being useful to efficiently remove poisoned
datapoints without having to retrain, our work suggests that these methods are
not yet "ready for prime time", and currently provide limited benefit over
retraining.


http://arxiv.org/abs/2406.16200
Towards unlocking the mystery of adversarial fragility of neural networks. (64%)
Jingchao Gao; Raghu Mudumbai; Xiaodong Wu; Jirong Yi; Catherine Xu; Hui Xie; Weiyu Xu
  In this paper, we study the adversarial robustness of deep neural networks
for classification tasks. We look at the smallest magnitude of possible
additive perturbations that can change the output of a classification
algorithm. We provide a matrix-theoretic explanation of the adversarial
fragility of deep neural network for classification. In particular, our
theoretical results show that neural network's adversarial robustness can
degrade as the input dimension $d$ increases. Analytically we show that neural
networks' adversarial robustness can be only $1/\sqrt{d}$ of the best possible
adversarial robustness. Our matrix-theoretic explanation is consistent with an
earlier information-theoretic feature-compression-based explanation for the
adversarial fragility of neural networks.


http://arxiv.org/abs/2406.16125
CBPF: Filtering Poisoned Data Based on Composite Backdoor Attack. (13%)
Hanfeng Xia; Haibo Hong; Ruili Wang
  Backdoor attacks involve the injection of a limited quantity of poisoned
examples containing triggers into the training dataset. During the inference
stage, backdoor attacks can uphold a high level of accuracy for normal
examples, yet when presented with trigger-containing instances, the model may
erroneously predict them as the targeted class designated by the attacker. This
paper explores strategies for mitigating the risks associated with backdoor
attacks by examining the filtration of poisoned samples.We primarily leverage
two key characteristics of backdoor attacks: the ability for multiple backdoors
to exist simultaneously within a single model, and the discovery through
Composite Backdoor Attack (CBA) that altering two triggers in a sample to new
target labels does not compromise the original functionality of the triggers,
yet enables the prediction of the data as a new target class when both triggers
are present simultaneously.Therefore, a novel three-stage poisoning data
filtering approach, known as Composite Backdoor Poison Filtering (CBPF), is
proposed as an effective solution. Firstly, utilizing the identified
distinctions in output between poisoned and clean samples, a subset of data is
partitioned to include both poisoned and clean instances. Subsequently, benign
triggers are incorporated and labels are adjusted to create new target and
benign target classes, thereby prompting the poisoned and clean data to be
classified as distinct entities during the inference stage. The experimental
results indicate that CBPF is successful in filtering out malicious data
produced by six advanced attacks on CIFAR10 and ImageNet-12. On average, CBPF
attains a notable filtering success rate of 99.91% for the six attacks on
CIFAR10. Additionally, the model trained on the uncontaminated samples exhibits
sustained high accuracy levels.


http://arxiv.org/abs/2406.16275
Investigating the Influence of Prompt-Specific Shortcuts in AI Generated Text Detection. (8%)
Choonghyun Park; Hyuhng Joon Kim; Junyeob Kim; Youna Kim; Taeuk Kim; Hyunsoo Cho; Hwiyeol Jo; Sang-goo Lee; Kang Min Yoo
  AI Generated Text (AIGT) detectors are developed with texts from humans and
LLMs of common tasks. Despite the diversity of plausible prompt choices, these
datasets are generally constructed with a limited number of prompts. The lack
of prompt variation can introduce prompt-specific shortcut features that exist
in data collected with the chosen prompt, but do not generalize to others. In
this paper, we analyze the impact of such shortcuts in AIGT detection. We
propose Feedback-based Adversarial Instruction List Optimization (FAILOpt), an
attack that searches for instructions deceptive to AIGT detectors exploiting
prompt-specific shortcuts. FAILOpt effectively drops the detection performance
of the target detector, comparable to other attacks based on adversarial
in-context examples. We also utilize our method to enhance the robustness of
the detector by mitigating the shortcuts. Based on the findings, we further
train the classifier with the dataset augmented by FAILOpt prompt. The
augmented classifier exhibits improvements across generation models, tasks, and
attacks. Our code will be available at https://github.com/zxcvvxcz/FAILOpt.


http://arxiv.org/abs/2406.16983
On Instabilities of Unsupervised Denoising Diffusion Models in Magnetic Resonance Imaging Reconstruction. (2%)
Tianyu Han; Sven Nebelung; Firas Khader; Jakob Nikolas Kather; Daniel Truhn
  Denoising diffusion models offer a promising approach to accelerating
magnetic resonance imaging (MRI) and producing diagnostic-level images in an
unsupervised manner. However, our study demonstrates that even tiny worst-case
potential perturbations transferred from a surrogate model can cause these
models to generate fake tissue structures that may mislead clinicians. The
transferability of such worst-case perturbations indicates that the robustness
of image reconstruction may be compromised due to MR system imperfections or
other sources of noise. Moreover, at larger perturbation strengths, diffusion
models exhibit Gaussian noise-like artifacts that are distinct from those
observed in supervised models and are more challenging to detect. Our results
highlight the vulnerability of current state-of-the-art diffusion-based
reconstruction models to possible worst-case perturbations and underscore the
need for further research to improve their robustness and reliability in
clinical settings.


http://arxiv.org/abs/2406.16979
Understanding and Diagnosing Deep Reinforcement Learning. (1%)
Ezgi Korkmaz
  Deep neural policies have recently been installed in a diverse range of
settings, from biotechnology to automated financial systems. However, the
utilization of deep neural networks to approximate the value function leads to
concerns on the decision boundary stability, in particular, with regard to the
sensitivity of policy decision making to indiscernible, non-robust features due
to highly non-convex and complex deep neural manifolds. These concerns
constitute an obstruction to understanding the reasoning made by deep neural
policies, and their foundational limitations. Hence, it is crucial to develop
techniques that aim to understand the sensitivities in the learnt
representations of neural network policies. To achieve this we introduce a
theoretically founded method that provides a systematic analysis of the
unstable directions in the deep neural policy decision boundary across both
time and space. Through experiments in the Arcade Learning Environment (ALE),
we demonstrate the effectiveness of our technique for identifying correlated
directions of instability, and for measuring how sample shifts remold the set
of sensitive directions in the neural policy landscape. Most importantly, we
demonstrate that state-of-the-art robust training techniques yield learning of
disjoint unstable directions, with dramatically larger oscillations over time,
when compared to standard training. We believe our results reveal the
fundamental properties of the decision process made by reinforcement learning
policies, and can help in constructing reliable and robust deep neural
policies.


http://arxiv.org/abs/2406.15839
The Effect of Similarity Measures on Accurate Stability Estimates for Local Surrogate Models in Text-based Explainable AI. (97%)
Christopher Burger; Charles Walter; Thai Le
  Recent work has investigated the vulnerability of local surrogate methods to
adversarial perturbations on a machine learning (ML) model's inputs, where the
explanation is manipulated while the meaning and structure of the original
input remains similar under the complex model. Although weaknesses across many
methods have been shown to exist, the reasons behind why remain little
explored. Central to the concept of adversarial attacks on explainable AI (XAI)
is the similarity measure used to calculate how one explanation differs from
another. A poor choice of similarity measure can lead to erroneous conclusions
on the efficacy of an XAI method. Too sensitive a measure results in
exaggerated vulnerability, while too coarse understates its weakness. We
investigate a variety of similarity measures designed for text-based ranked
lists, including Kendall's Tau, Spearman's Footrule, and Rank-biased Overlap to
determine how substantial changes in the type of measure or threshold of
success affect the conclusions generated from common adversarial attack
processes. Certain measures are found to be overly sensitive, resulting in
erroneous estimates of stability.


http://arxiv.org/abs/2406.15925
Federated Adversarial Learning for Robust Autonomous Landing Runway Detection. (2%)
Yi Li; Plamen Angelov; Zhengxin Yu; Alvaro Lopez Pellicer; Neeraj Suri
  As the development of deep learning techniques in autonomous landing systems
continues to grow, one of the major challenges is trust and security in the
face of possible adversarial attacks. In this paper, we propose a federated
adversarial learning-based framework to detect landing runways using paired
data comprising of clean local data and its adversarial version. Firstly, the
local model is pre-trained on a large-scale lane detection dataset. Then,
instead of exploiting large instance-adaptive models, we resort to a
parameter-efficient fine-tuning method known as scale and shift deep features
(SSF), upon the pre-trained model. Secondly, in each SSF layer, distributions
of clean local data and its adversarial version are disentangled for accurate
statistics estimation. To the best of our knowledge, this marks the first
instance of federated learning work that address the adversarial sample problem
in landing runway detection. Our experimental evaluations over both synthesis
and real images of Landing Approach Runway Detection (LARD) dataset
consistently demonstrate good performance of the proposed federated adversarial
learning and robust to adversarial attacks.


http://arxiv.org/abs/2406.15789
Privacy Implications of Explainable AI in Data-Driven Systems. (1%)
Fatima Ezzeddine
  Machine learning (ML) models, demonstrably powerful, suffer from a lack of
interpretability. The absence of transparency, often referred to as the black
box nature of ML models, undermines trust and urges the need for efforts to
enhance their explainability. Explainable AI (XAI) techniques address this
challenge by providing frameworks and methods to explain the internal
decision-making processes of these complex models. Techniques like
Counterfactual Explanations (CF) and Feature Importance play a crucial role in
achieving this goal. Furthermore, high-quality and diverse data remains the
foundational element for robust and trustworthy ML applications. In many
applications, the data used to train ML and XAI explainers contain sensitive
information. In this context, numerous privacy-preserving techniques can be
employed to safeguard sensitive information in the data, such as differential
privacy. Subsequently, a conflict between XAI and privacy solutions emerges due
to their opposing goals. Since XAI techniques provide reasoning for the model
behavior, they reveal information relative to ML models, such as their decision
boundaries, the values of features, or the gradients of deep learning models
when explanations are exposed to a third entity. Attackers can initiate privacy
breaching attacks using these explanations, to perform model extraction,
inference, and membership attacks. This dilemma underscores the challenge of
finding the right equilibrium between understanding ML decision-making and
safeguarding privacy.


http://arxiv.org/abs/2406.15093
ECLIPSE: Expunging Clean-label Indiscriminate Poisons via Sparse Diffusion Purification. (99%)
Xianlong Wang; Shengshan Hu; Yechao Zhang; Ziqi Zhou; Leo Yu Zhang; Peng Xu; Wei Wan; Hai Jin
  Clean-label indiscriminate poisoning attacks add invisible perturbations to
correctly labeled training images, thus dramatically reducing the
generalization capability of the victim models. Recently, some defense
mechanisms have been proposed such as adversarial training, image
transformation techniques, and image purification. However, these schemes are
either susceptible to adaptive attacks, built on unrealistic assumptions, or
only effective against specific poison types, limiting their universal
applicability. In this research, we propose a more universally effective,
practical, and robust defense scheme called ECLIPSE. We first investigate the
impact of Gaussian noise on the poisons and theoretically prove that any kind
of poison will be largely assimilated when imposing sufficient random noise. In
light of this, we assume the victim has access to an extremely limited number
of clean images (a more practical scene) and subsequently enlarge this sparse
set for training a denoising probabilistic model (a universal denoising tool).
We then begin by introducing Gaussian noise to absorb the poisons and then
apply the model for denoising, resulting in a roughly purified dataset.
Finally, to address the trade-off of the inconsistency in the assimilation
sensitivity of different poisons by Gaussian noise, we propose a lightweight
corruption compensation module to effectively eliminate residual poisons,
providing a more universal defense approach. Extensive experiments demonstrate
that our defense approach outperforms 10 state-of-the-art defenses. We also
propose an adaptive attack against ECLIPSE and verify the robustness of our
defense scheme. Our code is available at https://github.com/CGCL-codes/ECLIPSE.


http://arxiv.org/abs/2406.15104
Deciphering the Definition of Adversarial Robustness for post-hoc OOD Detectors. (99%)
Peter Lorenz; Mario Fernandez; Jens Müller; Ullrich Köthe
  Detecting out-of-distribution (OOD) inputs is critical for safely deploying
deep learning models in real-world scenarios. In recent years, many OOD
detectors have been developed, and even the benchmarking has been standardized,
i.e. OpenOOD. The number of post-hoc detectors is growing fast. They are
showing an option to protect a pre-trained classifier against natural
distribution shifts and claim to be ready for real-world scenarios. However,
its effectiveness in dealing with adversarial examples (AdEx) has been
neglected in most studies. In cases where an OOD detector includes AdEx in its
experiments, the lack of uniform parameters for AdEx makes it difficult to
accurately evaluate the performance of the OOD detector. This paper
investigates the adversarial robustness of 16 post-hoc detectors against
various evasion attacks. It also discusses a roadmap for adversarial defense in
OOD detectors that would help adversarial robustness. We believe that level 1
(AdEx on a unified dataset) should be added to any OOD detector to see the
limitations. The last level in the roadmap (defense against adaptive attacks)
we added for integrity from an adversarial machine learning (AML) point of
view, which we do not believe is the ultimate goal for OOD detectors.


http://arxiv.org/abs/2406.15635
DataFreeShield: Defending Adversarial Attacks without Training Data. (45%)
Hyeyoon Lee; Kanghyun Choi; Dain Kwon; Sunjong Park; Mayoore Selvarasa Jaiswal; Noseong Park; Jonghyun Choi; Jinho Lee
  Recent advances in adversarial robustness rely on an abundant set of training
data, where using external or additional datasets has become a common setting.
However, in real life, the training data is often kept private for security and
privacy issues, while only the pretrained weight is available to the public. In
such scenarios, existing methods that assume accessibility to the original data
become inapplicable. Thus we investigate the pivotal problem of data-free
adversarial robustness, where we try to achieve adversarial robustness without
accessing any real data. Through a preliminary study, we highlight the severity
of the problem by showing that robustness without the original dataset is
difficult to achieve, even with similar domain datasets. To address this issue,
we propose DataFreeShield, which tackles the problem from two perspectives:
surrogate dataset generation and adversarial training using the generated data.
Through extensive validation, we show that DataFreeShield outperforms
baselines, demonstrating that the proposed method sets the first entirely
data-free solution for the adversarial robustness problem.


http://arxiv.org/abs/2406.16963
Large Language Models for Link Stealing Attacks Against Graph Neural Networks. (38%)
Faqian Guan; Tianqing Zhu; Hui Sun; Wanlei Zhou; Philip S. Yu
  Graph data contains rich node features and unique edge information, which
have been applied across various domains, such as citation networks or
recommendation systems. Graph Neural Networks (GNNs) are specialized for
handling such data and have shown impressive performance in many applications.
However, GNNs may contain of sensitive information and susceptible to privacy
attacks. For example, link stealing is a type of attack in which attackers
infer whether two nodes are linked or not. Previous link stealing attacks
primarily relied on posterior probabilities from the target GNN model,
neglecting the significance of node features. Additionally, variations in node
classes across different datasets lead to different dimensions of posterior
probabilities. The handling of these varying data dimensions posed a challenge
in using a single model to effectively conduct link stealing attacks on
different datasets. To address these challenges, we introduce Large Language
Models (LLMs) to perform link stealing attacks on GNNs. LLMs can effectively
integrate textual features and exhibit strong generalizability, enabling
attacks to handle diverse data dimensions across various datasets. We design
two distinct LLM prompts to effectively combine textual features and posterior
probabilities of graph nodes. Through these designed prompts, we fine-tune the
LLM to adapt to the link stealing attack task. Furthermore, we fine-tune the
LLM using multiple datasets and enable the LLM to learn features from different
datasets simultaneously. Experimental results show that our approach
significantly enhances the performance of existing link stealing attack tasks
in both white-box and black-box scenarios. Our method can execute link stealing
attacks across different datasets using only a single model, making link
stealing attacks more applicable to real-world scenarios.


http://arxiv.org/abs/2407.00075
Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference. (2%)
Anton Xue; Avishree Khare; Rajeev Alur; Surbhi Goel; Eric Wong
  We study how to subvert large language models (LLMs) from following
prompt-specified rules. We first formalize rule-following as inference in
propositional Horn logic, a mathematical system in which rules have the form
"if $P$ and $Q$, then $R$" for some propositions $P$, $Q$, and $R$. Next, we
prove that although small transformers can faithfully follow such rules,
maliciously crafted prompts can still mislead both theoretical constructions
and models learned from data. Furthermore, we demonstrate that popular attack
algorithms on LLMs find adversarial prompts and induce attention patterns that
align with our theory. Our novel logic-based framework provides a foundation
for studying LLMs in rule-based settings, enabling a formal analysis of tasks
like logical reasoning and jailbreak attacks.


http://arxiv.org/abs/2406.15613
MOUNTAINEER: Topology-Driven Visual Analytics for Comparing Local Explanations. (1%)
Parikshit Solunke; Vitoria Guardieiro; Joao Rulff; Peter Xenopoulos; Gromit Yeuk-Yin Chan; Brian Barr; Luis Gustavo Nonato; Claudio Silva
  With the increasing use of black-box Machine Learning (ML) techniques in
critical applications, there is a growing demand for methods that can provide
transparency and accountability for model predictions. As a result, a large
number of local explainability methods for black-box models have been developed
and popularized. However, machine learning explanations are still hard to
evaluate and compare due to the high dimensionality, heterogeneous
representations, varying scales, and stochastic nature of some of these
methods. Topological Data Analysis (TDA) can be an effective method in this
domain since it can be used to transform attributions into uniform graph
representations, providing a common ground for comparison across different
explanation methods.
  We present a novel topology-driven visual analytics tool, Mountaineer, that
allows ML practitioners to interactively analyze and compare these
representations by linking the topological graphs back to the original data
distribution, model predictions, and feature attributions. Mountaineer
facilitates rapid and iterative exploration of ML explanations, enabling
experts to gain deeper insights into the explanation techniques, understand the
underlying data distributions, and thus reach well-founded conclusions about
model behavior. Furthermore, we demonstrate the utility of Mountaineer through
two case studies using real-world data. In the first, we show how Mountaineer
enabled us to compare black-box ML explanations and discern regions of and
causes of disagreements between different explanations. In the second, we
demonstrate how the tool can be used to compare and understand ML models
themselves. Finally, we conducted interviews with three industry experts to
help us evaluate our work.


http://arxiv.org/abs/2406.14232
Enhancing robustness of data-driven SHM models: adversarial training with circle loss. (99%)
Xiangli Yang; Xijie Deng; Hanwei Zhang; Yang Zou; Jianxi Yang
  Structural health monitoring (SHM) is critical to safeguarding the safety and
reliability of aerospace, civil, and mechanical infrastructure. Machine
learning-based data-driven approaches have gained popularity in SHM due to
advancements in sensors and computational power. However, machine learning
models used in SHM are vulnerable to adversarial examples -- even small changes
in input can lead to different model outputs. This paper aims to address this
problem by discussing adversarial defenses in SHM. In this paper, we propose an
adversarial training method for defense, which uses circle loss to optimize the
distance between features in training to keep examples away from the decision
boundary. Through this simple yet effective constraint, our method demonstrates
substantial improvements in model robustness, surpassing existing defense
mechanisms.


http://arxiv.org/abs/2406.14073
Exploring Layerwise Adversarial Robustness Through the Lens of t-SNE. (87%)
Inês Valentim; Nuno Antunes; Nuno Lourenço
  Adversarial examples, designed to trick Artificial Neural Networks (ANNs)
into producing wrong outputs, highlight vulnerabilities in these models.
Exploring these weaknesses is crucial for developing defenses, and so, we
propose a method to assess the adversarial robustness of image-classifying
ANNs. The t-distributed Stochastic Neighbor Embedding (t-SNE) technique is used
for visual inspection, and a metric, which compares the clean and perturbed
embeddings, helps pinpoint weak spots in the layers. Analyzing two ANNs on
CIFAR-10, one designed by humans and another via NeuroEvolution, we found that
differences between clean and perturbed representations emerge early on, in the
feature extraction layers, affecting subsequent classification. The findings
with our metric are supported by the visual analysis of the t-SNE maps.


http://arxiv.org/abs/2406.14217
Defending Against Sophisticated Poisoning Attacks with RL-based Aggregation in Federated Learning. (81%)
Yujing Wang; Hainan Zhang; Sijia Wen; Wangjie Qiu; Binghui Guo
  Federated learning is highly susceptible to model poisoning attacks,
especially those meticulously crafted for servers. Traditional defense methods
mainly focus on updating assessments or robust aggregation against manually
crafted myopic attacks. When facing advanced attacks, their defense stability
is notably insufficient. Therefore, it is imperative to develop adaptive
defenses against such advanced poisoning attacks. We find that benign clients
exhibit significantly higher data distribution stability than malicious clients
in federated learning in both CV and NLP tasks. Therefore, the malicious
clients can be recognized by observing the stability of their data
distribution. In this paper, we propose AdaAggRL, an RL-based Adaptive
Aggregation method, to defend against sophisticated poisoning attacks.
Specifically, we first utilize distribution learning to simulate the clients'
data distributions. Then, we use the maximum mean discrepancy (MMD) to
calculate the pairwise similarity of the current local model data distribution,
its historical data distribution, and global model data distribution. Finally,
we use policy learning to adaptively determine the aggregation weights based on
the above similarities. Experiments on four real-world datasets demonstrate
that the proposed defense model significantly outperforms widely adopted
defense models for sophisticated attacks.


http://arxiv.org/abs/2406.14393
Jailbreaking as a Reward Misspecification Problem. (78%)
Zhihui Xie; Jiahui Gao; Lei Li; Zhenguo Li; Qi Liu; Lingpeng Kong
  The widespread adoption of large language models (LLMs) has raised concerns
about their safety and reliability, particularly regarding their vulnerability
to adversarial attacks. In this paper, we propose a novel perspective that
attributes this vulnerability to reward misspecification during the alignment
process. This misspecification occurs when the reward function fails to
accurately capture the intended behavior, leading to misaligned model outputs.
We introduce a metric ReGap to quantify the extent of reward misspecification
and demonstrate its effectiveness and robustness in detecting harmful backdoor
prompts. Building upon these insights, we present ReMiss, a system for
automated red teaming that generates adversarial prompts in a
reward-misspecified space. ReMiss achieves state-of-the-art attack success
rates on the AdvBench benchmark against various target aligned LLMs while
preserving the human readability of the generated prompts. Furthermore, these
attacks on open-source models demonstrate high transferability to closed-source
models like GPT-4o and out-of-distribution tasks from HarmBench. Detailed
analysis highlights the unique advantages of the proposed reward
misspecification objective compared to previous methods, offering new insights
for improving LLM safety and robustness.


http://arxiv.org/abs/2406.14682
Uniform Convergence of Adversarially Robust Classifiers. (68%)
Rachel Morris; Ryan Murray
  In recent years there has been significant interest in the effect of
different types of adversarial perturbations in data classification problems.
Many of these models incorporate the adversarial power, which is an important
parameter with an associated trade-off between accuracy and robustness. This
work considers a general framework for adversarially-perturbed classification
problems, in a large data or population-level limit. In such a regime, we
demonstrate that as adversarial strength goes to zero that optimal classifiers
converge to the Bayes classifier in the Hausdorff distance. This significantly
strengthens previous results, which generally focus on $L^1$-type convergence.
The main argument relies upon direct geometric comparisons and is inspired by
techniques from geometric measure theory.


http://arxiv.org/abs/2406.14048
Prompt Injection Attacks in Defended Systems. (47%)
Daniil Khomsky; Narek Maloyan; Bulat Nutfullin
  Large language models play a crucial role in modern natural language
processing technologies. However, their extensive use also introduces potential
security risks, such as the possibility of black-box attacks. These attacks can
embed hidden malicious features into the model, leading to adverse consequences
during its deployment.
  This paper investigates methods for black-box attacks on large language
models with a three-tiered defense mechanism. It analyzes the challenges and
significance of these attacks, highlighting their potential implications for
language processing system security. Existing attack and defense methods are
examined, evaluating their effectiveness and applicability across various
scenarios.
  Special attention is given to the detection algorithm for black-box attacks,
identifying hazardous vulnerabilities in language models and retrieving
sensitive information. This research presents a methodology for vulnerability
detection and the development of defensive strategies against black-box attacks
on large language models.


http://arxiv.org/abs/2406.14259
MEAT: Median-Ensemble Adversarial Training for Improving Robustness and Generalization. (41%)
Zhaozhe Hu; Jia-Li Yin; Bin Chen; Luojun Lin; Bo-Hao Chen; Ximeng Liu
  Self-ensemble adversarial training methods improve model robustness by
ensembling models at different training epochs, such as model weight averaging
(WA). However, previous research has shown that self-ensemble defense methods
in adversarial training (AT) still suffer from robust overfitting, which
severely affects the generalization performance. Empirically, in the late
phases of training, the AT becomes more overfitting to the extent that the
individuals for weight averaging also suffer from overfitting and produce
anomalous weight values, which causes the self-ensemble model to continue to
undergo robust overfitting due to the failure in removing the weight anomalies.
To solve this problem, we aim to tackle the influence of outliers in the weight
space in this work and propose an easy-to-operate and effective Median-Ensemble
Adversarial Training (MEAT) method to solve the robust overfitting phenomenon
existing in self-ensemble defense from the source by searching for the median
of the historical model weights. Experimental results show that MEAT achieves
the best robustness against the powerful AutoAttack and can effectively
allievate the robust overfitting. We further demonstrate that most defense
methods can improve robust generalization and robustness by combining with
MEAT.


http://arxiv.org/abs/2406.14245
Countering adversarial perturbations in graphs using error correcting codes. (22%)
Saif Eddin Jabari
  We consider the problem of a graph subjected to adversarial perturbations,
such as those arising from cyber-attacks, where edges are covertly added or
removed. The adversarial perturbations occur during the transmission of the
graph between a sender and a receiver. To counteract potential perturbations,
this study explores a repetition coding scheme with sender-assigned noise and
majority voting on the receiver's end to rectify the graph's structure. The
approach operates without prior knowledge of the attack's characteristics. We
analytically derive a bound on the number of repetitions needed to satisfy
probabilistic constraints on the quality of the reconstructed graph. The method
can accurately and effectively decode Erd\H{o}s-R\'{e}nyi graphs that were
subjected to non-random edge removal, namely, those connected to vertices with
the highest eigenvector centrality, in addition to random addition and removal
of edges by the attacker. The method is also effective against attacks on large
scale-free graphs generated using the Barab\'{a}si-Albert model but require a
larger number of repetitions than needed to correct Erd\H{o}s-R\'{e}nyi graphs.


http://arxiv.org/abs/2406.15518
Steering Without Side Effects: Improving Post-Deployment Control of Language Models. (15%)
Asa Cooper Stickland; Alexander Lyzhov; Jacob Pfau; Salsabila Mahdi; Samuel R. Bowman
  Language models (LMs) have been shown to behave unexpectedly post-deployment.
For example, new jailbreaks continually arise, allowing model misuse, despite
extensive red-teaming and adversarial training from developers. Given most
model queries are unproblematic and frequent retraining results in unstable
user experience, methods for mitigation of worst-case behavior should be
targeted. One such method is classifying inputs as potentially problematic,
then selectively applying steering vectors on these problematic inputs, i.e.
adding particular vectors to model hidden states. However, steering vectors can
also negatively affect model performance, which will be an issue on cases where
the classifier was incorrect. We present KL-then-steer (KTS), a technique that
decreases the side effects of steering while retaining its benefits, by first
training a model to minimize Kullback-Leibler (KL) divergence between a steered
and unsteered model on benign inputs, then steering the model that has
undergone this training. Our best method prevents 44% of jailbreak attacks
compared to the original Llama-2-chat-7B model while maintaining helpfulness
(as measured by MT-Bench) on benign requests almost on par with the original
LM. To demonstrate the generality and transferability of our method beyond
jailbreaks, we show that our KTS model can be steered to reduce bias towards
user-suggested answers on TruthfulQA. Code is available:
https://github.com/AsaCooperStickland/kl-then-steer.


http://arxiv.org/abs/2406.14023
Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective. (8%)
Yuchen Wen; Keping Bi; Wei Chen; Jiafeng Guo; Xueqi Cheng
  As Large Language Models (LLMs) become an important way of information
seeking, there have been increasing concerns about the unethical content LLMs
may generate. In this paper, we conduct a rigorous evaluation of LLMs' implicit
bias towards certain groups by attacking them with carefully crafted
instructions to elicit biased responses. Our attack methodology is inspired by
psychometric principles in cognitive and social psychology. We propose three
attack approaches, i.e., Disguise, Deception, and Teaching, based on which we
built evaluation datasets for four common bias types. Each prompt attack has
bilingual versions. Extensive evaluation of representative LLMs shows that 1)
all three attack methods work effectively, especially the Deception attacks; 2)
GLM-3 performs the best in defending our attacks, compared to GPT-3.5 and
GPT-4; 3) LLMs could output content of other bias types when being taught with
one type of bias. Our methodology provides a rigorous and effective way of
evaluating LLMs' implicit bias and will benefit the assessments of LLMs'
potential ethical risks.


http://arxiv.org/abs/2406.14367
PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions. (5%)
Sihan Ma; Jing Zhang; Qiong Cao; Dacheng Tao
  Pose estimation aims to accurately identify anatomical keypoints in humans
and animals using monocular images, which is crucial for various applications
such as human-machine interaction, embodied AI, and autonomous driving. While
current models show promising results, they are typically trained and tested on
clean data, potentially overlooking the corruption during real-world deployment
and thus posing safety risks in practical scenarios. To address this issue, we
introduce PoseBench, a comprehensive benchmark designed to evaluate the
robustness of pose estimation models against real-world corruption. We
evaluated 60 representative models, including top-down, bottom-up,
heatmap-based, regression-based, and classification-based methods, across three
datasets for human and animal pose estimation. Our evaluation involves 10 types
of corruption in four categories: 1) blur and noise, 2) compression and color
loss, 3) severe lighting, and 4) masks. Our findings reveal that
state-of-the-art models are vulnerable to common real-world corruptions and
exhibit distinct behaviors when tackling human and animal pose estimation
tasks. To improve model robustness, we delve into various design
considerations, including input resolution, pre-training datasets, backbone
capacity, post-processing, and data augmentations. We hope that our benchmark
will serve as a foundation for advancing research in robust pose estimation.
The benchmark and source code will be released at
https://xymsh.github.io/PoseBench


http://arxiv.org/abs/2406.14349
Can you trust your explanations? A robustness test for feature attribution methods. (2%)
Ilaria Vascotto; Alex Rodriguez; Alessandro Bonaita; Luca Bortolussi
  The increase of legislative concerns towards the usage of Artificial
Intelligence (AI) has recently led to a series of regulations striving for a
more transparent, trustworthy and accountable AI. Along with these proposals,
the field of Explainable AI (XAI) has seen a rapid growth but the usage of its
techniques has at times led to unexpected results. The robustness of the
approaches is, in fact, a key property often overlooked: it is necessary to
evaluate the stability of an explanation (to random and adversarial
perturbations) to ensure that the results are trustable. To this end, we
propose a test to evaluate the robustness to non-adversarial perturbations and
an ensemble approach to analyse more in depth the robustness of XAI methods
applied to neural networks and tabular datasets. We will show how leveraging
manifold hypothesis and ensemble approaches can be beneficial to an in-depth
analysis of the robustness.


http://arxiv.org/abs/2406.14102
SeCTIS: A Framework to Secure CTI Sharing. (1%)
Dincy R. Arikkat; Mert Cihangiroglu; Mauro Conti; Rafidha Rehiman K. A.; Serena Nicolazzo; Antonino Nocera; Vinod P
  The rise of IT-dependent operations in modern organizations has heightened
their vulnerability to cyberattacks. As a growing number of organizations
include smart, interconnected devices in their systems to automate their
processes, the attack surface becomes much bigger, and the complexity and
frequency of attacks pose a significant threat. Consequently, organizations
have been compelled to seek innovative approaches to mitigate the menaces
inherent in their infrastructure. In response, considerable research efforts
have been directed towards creating effective solutions for sharing Cyber
Threat Intelligence (CTI). Current information-sharing methods lack privacy
safeguards, leaving organizations vulnerable to leaks of both proprietary and
confidential data. To tackle this problem, we designed a novel framework called
SeCTIS (Secure Cyber Threat Intelligence Sharing), integrating Swarm Learning
and Blockchain technologies to enable businesses to collaborate, preserving the
privacy of their CTI data. Moreover, our approach provides a way to assess the
data and model quality, and the trustworthiness of all the participants
leveraging some validators through Zero Knowledge Proofs. An extensive
experimental campaign demonstrates our framework's correctness and performance,
and the detailed attack model discusses its robustness against attacks in the
context of data and model quality.


http://arxiv.org/abs/2406.13499
GraphMU: Repairing Robustness of Graph Neural Networks via Machine Unlearning. (99%)
Tao Wu; Xinwen Cao; Chao Wang; Shaojie Qiao; Xingping Xian; Lin Yuan; Canyixing Cui; Yanbing Liu
  Graph Neural Networks (GNNs) have demonstrated significant application
potential in various fields. However, GNNs are still vulnerable to adversarial
attacks. Numerous adversarial defense methods on GNNs are proposed to address
the problem of adversarial attacks. However, these methods can only serve as a
defense before poisoning, but cannot repair poisoned GNN. Therefore, there is
an urgent need for a method to repair poisoned GNN. In this paper, we address
this gap by introducing the novel concept of model repair for GNNs. We propose
a repair framework, Repairing Robustness of Graph Neural Networks via Machine
Unlearning (GraphMU), which aims to fine-tune poisoned GNN to forget
adversarial samples without the need for complete retraining. We also introduce
a unlearning validation method to ensure that our approach effectively forget
specified poisoned data. To evaluate the effectiveness of GraphMU, we explore
three fine-tuned subgraph construction scenarios based on the available
perturbation information: (i) Known Perturbation Ratios, (ii) Known Complete
Knowledge of Perturbations, and (iii) Unknown any Knowledge of Perturbations.
Our extensive experiments, conducted across four citation datasets and four
adversarial attack scenarios, demonstrate that GraphMU can effectively restore
the performance of poisoned GNN.


http://arxiv.org/abs/2406.13228
AGSOA:Graph Neural Network Targeted Attack Based on Average Gradient and Structure Optimization. (99%)
Yang Chen; Bin Zhou
  Graph Neural Networks(GNNs) are vulnerable to adversarial attack that cause
performance degradation by adding small perturbations to the graph.
Gradient-based attacks are one of the most commonly used methods and have
achieved good performance in many attack scenarios. However, current gradient
attacks face the problems of easy to fall into local optima and poor attack
invisibility. Specifically, most gradient attacks use greedy strategies to
generate perturbations, which tend to fall into local optima leading to
underperformance of the attack. In addition, many attacks only consider the
effectiveness of the attack and ignore the invisibility of the attack, making
the attacks easily exposed leading to failure. To address the above problems,
this paper proposes an attack on GNNs, called AGSOA, which consists of an
average gradient calculation and a structre optimization module. In the average
gradient calculation module, we compute the average of the gradient information
over all moments to guide the attack to generate perturbed edges, which
stabilizes the direction of the attack update and gets rid of undesirable local
maxima. In the structure optimization module, we calculate the similarity and
homogeneity of the target node's with other nodes to adjust the graph structure
so as to improve the invisibility and transferability of the attack. Extensive
experiments on three commonly used datasets show that AGSOA improves the
misclassification rate by 2$\%$-8$\%$ compared to other state-of-the-art
models.


http://arxiv.org/abs/2406.13920
Understanding the Robustness of Graph Neural Networks against Adversarial Attacks. (99%)
Tao Wu; Canyixing Cui; Xingping Xian; Shaojie Qiao; Chao Wang; Lin Yuan; Shui Yu
  Recent studies have shown that graph neural networks (GNNs) are vulnerable to
adversarial attacks, posing significant challenges to their deployment in
safety-critical scenarios. This vulnerability has spurred a growing focus on
designing robust GNNs. Despite this interest, current advancements have
predominantly relied on empirical trial and error, resulting in a limited
understanding of the robustness of GNNs against adversarial attacks. To address
this issue, we conduct the first large-scale systematic study on the
adversarial robustness of GNNs by considering the patterns of input graphs, the
architecture of GNNs, and their model capacity, along with discussions on
sensitive neurons and adversarial transferability. This work proposes a
comprehensive empirical framework for analyzing the adversarial robustness of
GNNs. To support the analysis of adversarial robustness in GNNs, we introduce
two evaluation metrics: the confidence-based decision surface and the
accuracy-based adversarial transferability rate. Through experimental analysis,
we derive 11 actionable guidelines for designing robust GNNs, enabling model
developers to gain deeper insights. The code of this study is available at
https://github.com/star4455/GraphRE.


http://arxiv.org/abs/2406.13352
AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents. (83%)
Edoardo Debenedetti; Jie Zhang; Mislav Balunović; Luca Beurer-Kellner; Marc Fischer; Florian Tramèr
  AI agents aim to solve complex tasks by combining text-based reasoning with
external tool calls. Unfortunately, AI agents are vulnerable to prompt
injection attacks where data returned by external tools hijacks the agent to
execute malicious tasks. To measure the adversarial robustness of AI agents, we
introduce AgentDojo, an evaluation framework for agents that execute tools over
untrusted data. To capture the evolving nature of attacks and defenses,
AgentDojo is not a static test suite, but rather an extensible environment for
designing and evaluating new agent tasks, defenses, and adaptive attacks. We
populate the environment with 97 realistic tasks (e.g., managing an email
client, navigating an e-banking website, or making travel bookings), 629
security test cases, and various attack and defense paradigms from the
literature. We find that AgentDojo poses a challenge for both attacks and
defenses: state-of-the-art LLMs fail at many tasks (even in the absence of
attacks), and existing prompt injection attacks break some security properties
but not all. We hope that AgentDojo can foster research on new design
principles for AI agents that solve common tasks in a reliable and robust
manner. We release the code for AgentDojo at
https://github.com/ethz-spylab/agentdojo.


http://arxiv.org/abs/2406.13294
Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens. (62%)
Xikang Yang; Xuehai Tang; Fuqing Zhu; Jizhong Han; Songlin Hu
  Vision-language models (VLMs) seamlessly integrate visual and textual data to
perform tasks such as image classification, caption generation, and visual
question answering. However, adversarial images often struggle to deceive all
prompts effectively in the context of cross-prompt migration attacks, as the
probability distribution of the tokens in these images tends to favor the
semantics of the original image rather than the target tokens. To address this
challenge, we propose a Contextual-Injection Attack (CIA) that employs
gradient-based perturbation to inject target tokens into both visual and
textual contexts, thereby improving the probability distribution of the target
tokens. By shifting the contextual semantics towards the target tokens instead
of the original image semantics, CIA enhances the cross-prompt transferability
of adversarial images.Extensive experiments on the BLIP2, InstructBLIP, and
LLaVA models show that CIA outperforms existing methods in cross-prompt
transferability, demonstrating its potential for more effective adversarial
strategies in VLMs.


http://arxiv.org/abs/2406.13348
Textual Unlearning Gives a False Sense of Unlearning. (16%)
Jiacheng Du; Zhibo Wang; Kui Ren
  Language models (LMs) are susceptible to "memorizing" training data,
including a large amount of private or copyright-protected content. To
safeguard the right to be forgotten (RTBF), machine unlearning has emerged as a
promising method for LMs to efficiently "forget" sensitive training content and
mitigate knowledge leakage risks. However, despite its good intentions, could
the unlearning mechanism be counterproductive? In this paper, we propose the
Textual Unlearning Leakage Attack (TULA), where an adversary can infer
information about the unlearned data only by accessing the models before and
after unlearning. Furthermore, we present variants of TULA in both black-box
and white-box scenarios. Through various experimental results, we critically
demonstrate that machine unlearning amplifies the risk of knowledge leakage
from LMs. Specifically, TULA can increase an adversary's ability to infer
membership information about the unlearned data by more than 20% in black-box
scenario. Moreover, TULA can even reconstruct the unlearned data directly with
more than 60% accuracy with white-box access. Our work is the first to reveal
that machine unlearning in LMs can inversely create greater knowledge risks and
inspire the development of more secure unlearning mechanisms.


http://arxiv.org/abs/2406.13283
Large-Scale Dataset Pruning in Adversarial Training through Data Importance Extrapolation. (9%)
Björn Nieth; Thomas Altstidl; Leo Schwinn; Björn Eskofier
  Their vulnerability to small, imperceptible attacks limits the adoption of
deep learning models to real-world systems. Adversarial training has proven to
be one of the most promising strategies against these attacks, at the expense
of a substantial increase in training time. With the ongoing trend of
integrating large-scale synthetic data this is only expected to increase even
further. Thus, the need for data-centric approaches that reduce the number of
training samples while maintaining accuracy and robustness arises. While data
pruning and active learning are prominent research topics in deep learning,
they are as of now largely unexplored in the adversarial training literature.
We address this gap and propose a new data pruning strategy based on
extrapolating data importance scores from a small set of data to a larger set.
In an empirical evaluation, we demonstrate that extrapolation-based pruning can
efficiently reduce dataset size while maintaining robustness.


http://arxiv.org/abs/2406.13891
DPO: Dual-Perturbation Optimization for Test-time Adaptation in 3D Object Detection. (3%)
Zhuoxiao Chen; Zixin Wang; Sen Wang; Zi Huang; Yadan Luo
  LiDAR-based 3D object detection has seen impressive advances in recent times.
However, deploying trained 3D detectors in the real world often yields
unsatisfactory performance when the distribution of the test data significantly
deviates from the training data due to different weather conditions, object
sizes, \textit{etc}. A key factor in this performance degradation is the
diminished generalizability of pre-trained models, which creates a sharp loss
landscape during training. Such sharpness, when encountered during testing, can
precipitate significant performance declines, even with minor data variations.
To address the aforementioned challenges, we propose \textbf{dual-perturbation
optimization (DPO)} for \textbf{\underline{T}est-\underline{t}ime
\underline{A}daptation in \underline{3}D \underline{O}bject
\underline{D}etection (TTA-3OD)}. We minimize the sharpness to cultivate a flat
loss landscape to ensure model resiliency to minor data variations, thereby
enhancing the generalization of the adaptation process. To fully capture the
inherent variability of the test point clouds, we further introduce adversarial
perturbation to the input BEV features to better simulate the noisy test
environment. As the dual perturbation strategy relies on trustworthy
supervision signals, we utilize a reliable Hungarian matcher to filter out
pseudo-labels sensitive to perturbations. Additionally, we introduce early
Hungarian cutoff to avoid error accumulation from incorrect pseudo-labels by
halting the adaptation process. Extensive experiments across three types of
transfer tasks demonstrate that the proposed DPO significantly surpasses
previous state-of-the-art approaches, specifically on Waymo $\rightarrow$
KITTI, outperforming the most competitive baseline by 57.72\% in
$\text{AP}_\text{3D}$ and reaching 91\% of the fully supervised upper bound.


http://arxiv.org/abs/2406.13547
ModSec-Learn: Boosting ModSecurity with Machine Learning. (2%)
Christian Scano; Giuseppe Floris; Biagio Montaruli; Luca Demetrio; Andrea Valenza; Luca Compagna; Davide Ariu; Luca Piras; Davide Balzarotti; Battista Biggio
  ModSecurity is widely recognized as the standard open-source Web Application
Firewall (WAF), maintained by the OWASP Foundation. It detects malicious
requests by matching them against the Core Rule Set (CRS), identifying
well-known attack patterns. Each rule is manually assigned a weight based on
the severity of the corresponding attack, and a request is blocked if the sum
of the weights of matched rules exceeds a given threshold. However, we argue
that this strategy is largely ineffective against web attacks, as detection is
only based on heuristics and not customized on the application to protect. In
this work, we overcome this issue by proposing a machine-learning model that
uses the CRS rules as input features. Through training, ModSec-Learn is able to
tune the contribution of each CRS rule to predictions, thus adapting the
severity level to the web applications to protect. Our experiments show that
ModSec-Learn achieves a significantly better trade-off between detection and
false positive rates. Finally, we analyze how sparse regularization can reduce
the number of rules that are relevant at inference time, by discarding more
than 30% of the CRS rules. We release our open-source code and the dataset at
https://github.com/pralab/modsec-learn and
https://github.com/pralab/http-traffic-dataset, respectively.


http://arxiv.org/abs/2406.13200
RobGC: Towards Robust Graph Condensation. (1%)
Xinyi Gao; Hongzhi Yin; Tong Chen; Guanhua Ye; Wentao Zhang; Bin Cui
  Graph neural networks (GNNs) have attracted widespread attention for their
impressive capability of graph representation learning. However, the increasing
prevalence of large-scale graphs presents a significant challenge for GNN
training due to their computational demands, limiting the applicability of GNNs
in various scenarios. In response to this challenge, graph condensation (GC) is
proposed as a promising acceleration solution, focusing on generating an
informative compact graph that enables efficient training of GNNs while
retaining performance. Despite the potential to accelerate GNN training,
existing GC methods overlook the quality of large training graphs during both
the training and inference stages. They indiscriminately emulate the training
graph distributions, making the condensed graphs susceptible to noises within
the training graph and significantly impeding the application of GC in
intricate real-world scenarios. To address this issue, we propose robust graph
condensation (RobGC), a plug-and-play approach for GC to extend the robustness
and applicability of condensed graphs in noisy graph structure environments.
Specifically, RobGC leverages the condensed graph as a feedback signal to guide
the denoising process on the original training graph. A label propagation-based
alternating optimization strategy is in place for the condensation and
denoising processes, contributing to the mutual purification of the condensed
graph and training graph. Additionally, as a GC method designed for inductive
graph inference, RobGC facilitates test-time graph denoising by leveraging the
noise-free condensed graph to calibrate the structure of the test graph.
Extensive experiments show that RobGC is compatible with various GC methods,
significantly boosting their robustness under different types and levels of
graph structural noises.


http://arxiv.org/abs/2406.19413
Saliency Attention and Semantic Similarity-Driven Adversarial Perturbation. (99%)
Hetvi Waghela; Jaydip Sen; Sneha Rakshit
  In this paper, we introduce an enhanced textual adversarial attack method,
known as Saliency Attention and Semantic Similarity driven adversarial
Perturbation (SASSP). The proposed scheme is designed to improve the
effectiveness of contextual perturbations by integrating saliency, attention,
and semantic similarity. Traditional adversarial attack methods often struggle
to maintain semantic consistency and coherence while effectively deceiving
target models. Our proposed approach addresses these challenges by
incorporating a three-pronged strategy for word selection and perturbation.
First, we utilize a saliency-based word selection to prioritize words for
modification based on their importance to the model's prediction. Second,
attention mechanisms are employed to focus perturbations on contextually
significant words, enhancing the attack's efficacy. Finally, an advanced
semantic similarity-checking method is employed that includes embedding-based
similarity and paraphrase detection. By leveraging models like Sentence-BERT
for embedding similarity and fine-tuned paraphrase detection models from the
Sentence Transformers library, the scheme ensures that the perturbed text
remains contextually appropriate and semantically consistent with the original.
Empirical evaluations demonstrate that SASSP generates adversarial examples
that not only maintain high semantic fidelity but also effectively deceive
state-of-the-art natural language processing models. Moreover, in comparison to
the original scheme of contextual perturbation CLARE, SASSP has yielded a
higher attack success rate and lower word perturbation rate.


http://arxiv.org/abs/2406.13073
NoiSec: Harnessing Noise for Security against Adversarial and Backdoor Attacks. (99%)
Md Hasan Shahriar; Ning Wang; Y. Thomas Hou; Wenjing Lou
  The exponential adoption of machine learning (ML) is propelling the world
into a future of intelligent automation and data-driven solutions. However, the
proliferation of malicious data manipulation attacks against ML, namely
adversarial and backdoor attacks, jeopardizes its reliability in
safety-critical applications. The existing detection methods against such
attacks are built upon assumptions, limiting them in diverse practical
scenarios. Thus, motivated by the need for a more robust and unified defense
mechanism, we investigate the shared traits of adversarial and backdoor attacks
and propose NoiSec that leverages solely the noise, the foundational root cause
of such attacks, to detect any malicious data alterations. NoiSec is a
reconstruction-based detector that disentangles the noise from the test input,
extracts the underlying features from the noise, and leverages them to
recognize systematic malicious manipulation. Experimental evaluations conducted
on the CIFAR10 dataset demonstrate the efficacy of NoiSec, achieving AUROC
scores exceeding 0.954 and 0.852 under white-box and black-box adversarial
attacks, respectively, and 0.992 against backdoor attacks. Notably, NoiSec
maintains a high detection performance, keeping the false positive rate within
only 1\%. Comparative analyses against MagNet-based baselines reveal NoiSec's
superior performance across various attack scenarios.


http://arxiv.org/abs/2406.12814
Adversarial Attacks on Multimodal Agents. (96%)
Chen Henry Wu; Jing Yu Koh; Ruslan Salakhutdinov; Daniel Fried; Aditi Raghunathan
  Vision-enabled language models (VLMs) are now used to build autonomous
multimodal agents capable of taking actions in real environments. In this
paper, we show that multimodal agents raise new safety risks, even though
attacking agents is more challenging than prior attacks due to limited access
to and knowledge about the environment. Our attacks use adversarial text
strings to guide gradient-based perturbation over one trigger image in the
environment: (1) our captioner attack attacks white-box captioners if they are
used to process images into captions as additional inputs to the VLM; (2) our
CLIP attack attacks a set of CLIP models jointly, which can transfer to
proprietary VLMs. To evaluate the attacks, we curated VisualWebArena-Adv, a set
of adversarial tasks based on VisualWebArena, an environment for web-based
multimodal agent tasks. Within an L-infinity norm of $16/256$ on a single
image, the captioner attack can make a captioner-augmented GPT-4V agent execute
the adversarial goals with a 75% success rate. When we remove the captioner or
use GPT-4V to generate its own captions, the CLIP attack can achieve success
rates of 21% and 43%, respectively. Experiments on agents based on other VLMs,
such as Gemini-1.5, Claude-3, and GPT-4o, show interesting differences in their
robustness. Further analysis reveals several key factors contributing to the
attack's success, and we also discuss the implications for defenses as well.
Project page: https://chenwu.io/attack-agent Code and data:
https://github.com/ChenWu98/agent-attack


http://arxiv.org/abs/2406.13066
MaskPure: Improving Defense Against Text Adversaries with Stochastic Purification. (95%)
Harrison Gietz; Jugal Kalita
  The improvement of language model robustness, including successful defense
against adversarial attacks, remains an open problem. In computer vision
settings, the stochastic noising and de-noising process provided by diffusion
models has proven useful for purifying input images, thus improving model
robustness against adversarial attacks. Similarly, some initial work has
explored the use of random noising and de-noising to mitigate adversarial
attacks in an NLP setting, but improving the quality and efficiency of these
methods is necessary for them to remain competitive. We extend upon methods of
input text purification that are inspired by diffusion processes, which
randomly mask and refill portions of the input text before classification. Our
novel method, MaskPure, exceeds or matches robustness compared to other
contemporary defenses, while also requiring no adversarial classifier training
and without assuming knowledge of the attack type. In addition, we show that
MaskPure is provably certifiably robust. To our knowledge, MaskPure is the
first stochastic-purification method with demonstrated success against both
character-level and word-level attacks, indicating the generalizable and
promising nature of stochastic denoising defenses. In summary: the MaskPure
algorithm bridges literature on the current strongest certifiable and empirical
adversarial defense methods, showing that both theoretical and practical
robustness can be obtained together. Code is available on GitHub at
https://github.com/hubarruby/MaskPure.


http://arxiv.org/abs/2406.13180
Towards Trustworthy Unsupervised Domain Adaptation: A Representation Learning Perspective for Enhancing Robustness, Discrimination, and Generalization. (76%)
Jia-Li Yin; Haoyuan Zheng; Ximeng Liu
  Robust Unsupervised Domain Adaptation (RoUDA) aims to achieve not only clean
but also robust cross-domain knowledge transfer from a labeled source domain to
an unlabeled target domain. A number of works have been conducted by directly
injecting adversarial training (AT) in UDA based on the self-training pipeline
and then aiming to generate better adversarial examples (AEs) for AT. Despite
the remarkable progress, these methods only focus on finding stronger AEs but
neglect how to better learn from these AEs, thus leading to unsatisfied
results. In this paper, we investigate robust UDA from a representation
learning perspective and design a novel algorithm by utilizing the mutual
information theory, dubbed MIRoUDA. Specifically, through mutual information
optimization, MIRoUDA is designed to achieve three characteristics that are
highly expected in robust UDA, i.e., robustness, discrimination, and
generalization. We then propose a dual-model framework accordingly for robust
UDA learning. Extensive experiments on various benchmarks verify the
effectiveness of the proposed MIRoUDA, in which our method surpasses the
state-of-the-arts by a large margin.


http://arxiv.org/abs/2406.12259
Adversarial Attacks on Large Language Models in Medicine. (70%)
Yifan Yang; Qiao Jin; Furong Huang; Zhiyong Lu
  The integration of Large Language Models (LLMs) into healthcare applications
offers promising advancements in medical diagnostics, treatment
recommendations, and patient care. However, the susceptibility of LLMs to
adversarial attacks poses a significant threat, potentially leading to harmful
outcomes in delicate medical contexts. This study investigates the
vulnerability of LLMs to two types of adversarial attacks in three medical
tasks. Utilizing real-world patient data, we demonstrate that both open-source
and proprietary LLMs are susceptible to manipulation across multiple tasks.
This research further reveals that domain-specific tasks demand more
adversarial data in model fine-tuning than general domain tasks for effective
attack execution, especially for more capable models. We discover that while
integrating adversarial data does not markedly degrade overall model
performance on medical benchmarks, it does lead to noticeable shifts in
fine-tuned model weights, suggesting a potential pathway for detecting and
countering model attacks. This research highlights the urgent need for robust
security measures and the development of defensive mechanisms to safeguard LLMs
in medical applications, to ensure their safe and effective deployment in
healthcare settings.


http://arxiv.org/abs/2406.12843
Can Go AIs be adversarially robust? (62%)
Tom Tseng; Euan McLean; Kellin Pelrine; Tony T. Wang; Adam Gleave
  Prior work found that superhuman Go AIs can be defeated by simple adversarial
strategies, especially "cyclic" attacks. In this paper, we study whether adding
natural countermeasures can achieve robustness in Go, a favorable domain for
robustness since it benefits from incredible average-case capability and a
narrow, innately adversarial setting. We test three defenses: adversarial
training on hand-constructed positions, iterated adversarial training, and
changing the network architecture. We find that though some of these defenses
protect against previously discovered attacks, none withstand freshly trained
adversaries. Furthermore, most of the reliably effective attacks these
adversaries discover are different realizations of the same overall class of
cyclic attacks. Our results suggest that building robust AI systems is
challenging even with extremely superhuman systems in some of the most
tractable settings, and highlight two key gaps: efficient generalization in
defenses, and diversity in training. For interactive examples of attacks and a
link to our codebase, see https://goattack.far.ai.


http://arxiv.org/abs/2406.13098
DLP: towards active defense against backdoor attacks with decoupled learning process. (31%)
Zonghao Ying; Bin Wu
  Deep learning models are well known to be susceptible to backdoor attack,
where the attacker only needs to provide a tampered dataset on which the
triggers are injected. Models trained on the dataset will passively implant the
backdoor, and triggers on the input can mislead the models during testing. Our
study shows that the model shows different learning behaviors in clean and
poisoned subsets during training. Based on this observation, we propose a
general training pipeline to defend against backdoor attacks actively. Benign
models can be trained from the unreliable dataset by decoupling the learning
process into three stages, i.e., supervised learning, active unlearning, and
active semi-supervised fine-tuning. The effectiveness of our approach has been
shown in numerous experiments across various backdoor attacks and datasets.


http://arxiv.org/abs/2406.12605
Attack and Defense of Deep Learning Models in the Field of Web Attack Detection. (10%)
Lijia Shi; Shihao Dong
  The challenge of WAD (web attack detection) is growing as hackers
continuously refine their methods to evade traditional detection. Deep learning
models excel in handling complex unknown attacks due to their strong
generalization and adaptability. However, they are vulnerable to backdoor
attacks, where contextually irrelevant fragments are inserted into requests,
compromising model stability. While backdoor attacks are well studied in image
recognition, they are largely unexplored in WAD. This paper introduces backdoor
attacks in WAD, proposing five methods and corresponding defenses. Testing on
textCNN, biLSTM, and tinybert models shows an attack success rate over 87%,
reducible through fine-tuning. Future research should focus on backdoor
defenses in WAD. All the code and data of this paper can be obtained at
https://anonymous.4open.science/r/attackDefenceinDL-7E05


http://arxiv.org/abs/2406.12975
SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation. (10%)
Xiaoze Liu; Ting Sun; Tianyang Xu; Feijie Wu; Cunxiang Wang; Xiaoqian Wang; Jing Gao
  Large Language Models (LLMs) have transformed machine learning but raised
significant legal concerns due to their potential to produce text that
infringes on copyrights, resulting in several high-profile lawsuits. The legal
landscape is struggling to keep pace with these rapid advancements, with
ongoing debates about whether generated text might plagiarize copyrighted
materials. Current LLMs may infringe on copyrights or overly restrict
non-copyrighted texts, leading to these challenges: (i) the need for a
comprehensive evaluation benchmark to assess copyright compliance from multiple
aspects; (ii) evaluating robustness against safeguard bypassing attacks; and
(iii) developing effective defense targeted against the generation of
copyrighted text. To tackle these challenges, we introduce a curated dataset to
evaluate methods, test attack strategies, and propose lightweight, real-time
defense to prevent the generation of copyrighted text, ensuring the safe and
lawful use of LLMs. Our experiments demonstrate that current LLMs frequently
output copyrighted text, and that jailbreaking attacks can significantly
increase the volume of copyrighted output. Our proposed defense mechanism
significantly reduces the volume of copyrighted text generated by LLMs by
effectively refusing malicious requests. Code is publicly available at
https://github.com/xz-liu/SHIELD


http://arxiv.org/abs/2406.12257
CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models. (8%)
Yuetai Li; Zhangchen Xu; Fengqing Jiang; Luyao Niu; Dinuka Sahabandu; Bhaskar Ramasubramanian; Radha Poovendran
  The remarkable performance of large language models (LLMs) in generation
tasks has enabled practitioners to leverage publicly available models to power
custom applications, such as chatbots and virtual assistants. However, the data
used to train or fine-tune these LLMs is often undisclosed, allowing an
attacker to compromise the data and inject backdoors into the models. In this
paper, we develop a novel inference time defense, named CLEANGEN, to mitigate
backdoor attacks for generation tasks in LLMs. CLEANGEN is a lightweight and
effective decoding strategy that is compatible with the state-of-the-art (SOTA)
LLMs. Our insight behind CLEANGEN is that compared to other LLMs, backdoored
LLMs assign significantly higher probabilities to tokens representing the
attacker-desired contents. These discrepancies in token probabilities enable
CLEANGEN to identify suspicious tokens favored by the attacker and replace them
with tokens generated by another LLM that is not compromised by the same
attacker, thereby avoiding generation of attacker-desired content. We evaluate
CLEANGEN against five SOTA backdoor attacks. Our results show that CLEANGEN
achieves lower attack success rates (ASR) compared to five SOTA baseline
defenses for all five backdoor attacks. Moreover, LLMs deploying CLEANGEN
maintain helpfulness in their responses when serving benign user queries with
minimal added computational overhead.


http://arxiv.org/abs/2406.12670
Stealth edits for provably fixing or attacking large language models. (2%)
Oliver J. Sutton; Qinghua Zhou; Wei Wang; Desmond J. Higham; Alexander N. Gorban; Alexander Bastounis; Ivan Y. Tyukin
  We reveal new methods and the theoretical foundations of techniques for
editing large language models. We also show how the new theory can be used to
assess the editability of models and to expose their susceptibility to
previously unknown malicious attacks. Our theoretical approach shows that a
single metric (a specific measure of the intrinsic dimensionality of the
model's features) is fundamental to predicting the success of popular editing
approaches, and reveals new bridges between disparate families of editing
methods. We collectively refer to these approaches as stealth editing methods,
because they aim to directly and inexpensively update a model's weights to
correct the model's responses to known hallucinating prompts without otherwise
affecting the model's behaviour, without requiring retraining. By carefully
applying the insight gleaned from our theoretical investigation, we are able to
introduce a new network block -- named a jet-pack block -- which is optimised
for highly selective model editing, uses only standard network operations, and
can be inserted into existing networks. The intrinsic dimensionality metric
also determines the vulnerability of a language model to a stealth attack: a
small change to a model's weights which changes its response to a single
attacker-chosen prompt. Stealth attacks do not require access to or knowledge
of the model's training data, therefore representing a potent yet previously
unrecognised threat to redistributed foundation models. They are
computationally simple enough to be implemented in malware in many cases.
Extensive experimental results illustrate and support the method and its
theoretical underpinnings. Demos and source code for editing language models
are available at https://github.com/qinghua-zhou/stealth-edits.


http://arxiv.org/abs/2406.12480
The Power of LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions. (1%)
Stefan Sylvius Wagner; Maike Behrendt; Marc Ziegele; Stefan Harmeling
  Stance detection holds great potential to improve online political
discussions through its deployment in discussion platforms for purposes such as
content moderation, topic summarization or to facilitate more balanced
discussions. Typically, transformer-based models are employed directly for
stance detection, requiring vast amounts of data. However, the wide variety of
debate topics in online political discussions makes data collection
particularly challenging. LLMs have revived stance detection, but their online
deployment in online political discussions faces challenges like inconsistent
outputs, biases, and vulnerability to adversarial attacks. We show how
LLM-generated synthetic data can improve stance detection for online political
discussions by using reliable traditional stance detection models for online
deployment, while leveraging the text generation capabilities of LLMs for
synthetic data generation in a secure offline environment. To achieve this, (i)
we generate synthetic data for specific debate questions by prompting a
Mistral-7B model and show that fine-tuning with the generated synthetic data
can substantially improve the performance of stance detection, while remaining
interpretable and aligned with real world data. (ii) Using the synthetic data
as a reference, we can improve performance even further by identifying the most
informative samples in an unlabelled dataset, i.e., those samples which the
stance detection model is most uncertain about and can benefit from the most.
By fine-tuning with both synthetic data and the most informative samples, we
surpass the performance of the baseline model that is fine-tuned on all true
labels, while labelling considerably less data.


http://arxiv.org/abs/2406.12319
PRePair: Pointwise Reasoning Enhance Pairwise Evaluating for Robust Instruction-Following Assessments. (1%)
Hawon Jeong; ChaeHun Park; Jimin Hong; Jaegul Choo
  Pairwise evaluation using large language models (LLMs) is widely used for
evaluating natural language generation (NLG) tasks. However, the reliability of
LLMs is often compromised by biases, such as favoring verbosity and
authoritative tone. In the study, we focus on the comparison of two LLM-based
evaluation approaches, pointwise and pairwise. Our findings demonstrate that
pointwise evaluators exhibit more robustness against undesirable preferences.
Further analysis reveals that pairwise evaluators can accurately identify the
shortcomings of low-quality outputs even when their judgment is incorrect.
These results indicate that LLMs are more severely influenced by their bias in
a pairwise evaluation setup. To mitigate this, we propose a hybrid method that
integrates pointwise reasoning into pairwise evaluation. Experimental results
show that our method enhances the robustness of pairwise evaluators against
adversarial samples while preserving accuracy on normal samples.


http://arxiv.org/abs/2406.11515
Obfuscating IoT Device Scanning Activity via Adversarial Example Generation. (99%)
Haocong Li; Yaxin Zhang; Long Cheng; Wenjia Niu; Haining Wang; Qiang Li
  Nowadays, attackers target Internet of Things (IoT) devices for security
exploitation, and search engines for devices and services compromise user
privacy, including IP addresses, open ports, device types, vendors, and
products.Typically, application banners are used to recognize IoT device
profiles during network measurement and reconnaissance. In this paper, we
propose a novel approach to obfuscating IoT device banners (BANADV) based on
adversarial examples. The key idea is to explore the susceptibility of
fingerprinting techniques to a slight perturbation of an IoT device banner. By
modifying device banners, BANADV disrupts the collection of IoT device
profiles. To validate the efficacy of BANADV, we conduct a set of experiments.
Our evaluation results show that adversarial examples can spoof
state-of-the-art fingerprinting techniques, including learning- and
matching-based approaches. We further provide a detailed analysis of the
weakness of learning-based/matching-based fingerprints to carefully crafted
samples. Overall, the innovations of BANADV lie in three aspects: (1) it
utilizes an IoT-related semantic space and a visual similarity space to locate
available manipulating perturbations of IoT banners; (2) it achieves at least
80\% success rate for spoofing IoT scanning techniques; and (3) it is the first
to utilize adversarial examples of IoT banners in network measurement and
reconnaissance.


http://arxiv.org/abs/2406.11522
FullCert: Deterministic End-to-End Certification for Training and Inference of Neural Networks. (98%)
Tobias Lorenz; Marta Kwiatkowska; Mario Fritz
  Modern machine learning models are sensitive to the manipulation of both the
training data (poisoning attacks) and inference data (adversarial examples).
Recognizing this issue, the community has developed many empirical defenses
against both attacks and, more recently, certification methods with provable
guarantees against inference-time attacks. However, such guarantees are still
largely lacking for training-time attacks. In this work, we present FullCert,
the first end-to-end certifier with sound, deterministic bounds, which proves
robustness against both training-time and inference-time attacks. We first
bound all possible perturbations an adversary can make to the training data
under the considered threat model. Using these constraints, we bound the
perturbations' influence on the model's parameters. Finally, we bound the
impact of these parameter changes on the model's prediction, resulting in joint
robustness guarantees against poisoning and adversarial examples. To facilitate
this novel certification paradigm, we combine our theoretical work with a new
open-source library BoundFlow, which enables model training on bounded
datasets. We experimentally demonstrate FullCert's feasibility on two datasets.


http://arxiv.org/abs/2406.11576
Harmonizing Feature Maps: A Graph Convolutional Approach for Enhancing Adversarial Robustness. (93%)
Kejia Zhang; Juanjuan Weng; Junwei Wu; Guoqing Yang; Shaozi Li; Zhiming Luo
  The vulnerability of Deep Neural Networks to adversarial perturbations
presents significant security concerns, as the imperceptible perturbations can
contaminate the feature space and lead to incorrect predictions. Recent studies
have attempted to calibrate contaminated features by either suppressing or
over-activating particular channels. Despite these efforts, we claim that
adversarial attacks exhibit varying disruption levels across individual
channels. Furthermore, we argue that harmonizing feature maps via graph and
employing graph convolution can calibrate contaminated features. To this end,
we introduce an innovative plug-and-play module called Feature Map-based
Reconstructed Graph Convolution (FMR-GC). FMR-GC harmonizes feature maps in the
channel dimension to reconstruct the graph, then employs graph convolution to
capture neighborhood information, effectively calibrating contaminated
features. Extensive experiments have demonstrated the superior performance and
scalability of FMR-GC. Moreover, our model can be combined with advanced
adversarial training methods to considerably enhance robustness without
compromising the model's clean accuracy.


http://arxiv.org/abs/2406.11707
A First Physical-World Trajectory Prediction Attack via LiDAR-induced Deceptions in Autonomous Driving. (82%)
Yang Lou; Yi Zhu; Qun Song; Rui Tan; Chunming Qiao; Wei-Bin Lee; Jianping Wang
  Trajectory prediction forecasts nearby agents' moves based on their
historical trajectories. Accurate trajectory prediction is crucial for
autonomous vehicles. Existing attacks compromise the prediction model of a
victim AV by directly manipulating the historical trajectory of an attacker AV,
which has limited real-world applicability. This paper, for the first time,
explores an indirect attack approach that induces prediction errors via attacks
against the perception module of a victim AV. Although it has been shown that
physically realizable attacks against LiDAR-based perception are possible by
placing a few objects at strategic locations, it is still an open challenge to
find an object location from the vast search space in order to launch effective
attacks against prediction under varying victim AV velocities.
  Through analysis, we observe that a prediction model is prone to an attack
focusing on a single point in the scene. Consequently, we propose a novel
two-stage attack framework to realize the single-point attack. The first stage
of prediction-side attack efficiently identifies, guided by the distribution of
detection results under object-based attacks against perception, the state
perturbations for the prediction model that are effective and
velocity-insensitive. In the second stage of location matching, we match the
feasible object locations with the found state perturbations. Our evaluation
using a public autonomous driving dataset shows that our attack causes a
collision rate of up to 63% and various hazardous responses of the victim AV.
The effectiveness of our attack is also demonstrated on a real testbed car. To
the best of our knowledge, this study is the first security analysis spanning
from LiDAR-based perception to prediction in autonomous driving, leading to a
realistic attack on prediction. To counteract the proposed attack, potential
defenses are discussed.


http://arxiv.org/abs/2406.12027
Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI. (76%)
Robert Hönig; Javier Rando; Nicholas Carlini; Florian Tramèr
  Artists are increasingly concerned about advancements in image generation
models that can closely replicate their unique artistic styles. In response,
several protection tools against style mimicry have been developed that
incorporate small adversarial perturbations into artworks published online. In
this work, we evaluate the effectiveness of popular protections -- with
millions of downloads -- and show they only provide a false sense of security.
We find that low-effort and "off-the-shelf" techniques, such as image
upscaling, are sufficient to create robust mimicry methods that significantly
degrade existing protections. Through a user study, we demonstrate that all
existing protections can be easily bypassed, leaving artists vulnerable to
style mimicry. We caution that tools based on adversarial perturbations cannot
reliably protect artists from the misuse of generative AI, and urge the
development of alternative non-technological solutions.


http://arxiv.org/abs/2406.12223
ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations. (22%)
Yunze Xiao; Yujia Hu; Kenny Tsu Wei Choo; Roy Ka-wei Lee
  Detecting hate speech and offensive language is essential for maintaining a
safe and respectful digital environment. This study examines the limitations of
state-of-the-art large language models (LLMs) in identifying offensive content
within systematically perturbed data, with a focus on Chinese, a language
particularly susceptible to such perturbations. We introduce
\textsf{ToxiCloakCN}, an enhanced dataset derived from ToxiCN, augmented with
homophonic substitutions and emoji transformations, to test the robustness of
LLMs against these cloaking perturbations. Our findings reveal that existing
models significantly underperform in detecting offensive content when these
perturbations are applied. We provide an in-depth analysis of how different
types of offensive content are affected by these perturbations and explore the
alignment between human and model explanations of offensiveness. Our work
highlights the urgent need for more advanced techniques in offensive language
detection to combat the evolving tactics used to evade detection mechanisms.


http://arxiv.org/abs/2406.11239
Evading AI-Generated Content Detectors using Homoglyphs. (5%)
Aldan Creo; Shushanta Pudasaini
  The generation of text that is increasingly human-like has been enabled by
the advent of large language models (LLMs). As the detection of AI-generated
content holds significant importance in the fight against issues such as
misinformation and academic cheating, numerous studies have been conducted to
develop reliable LLM detectors. While promising results have been demonstrated
by such detectors on test data, recent research has revealed that they can be
circumvented by employing different techniques. In this article,
homoglyph-based ($a \rightarrow {\alpha}$) attacks that can be used to
circumvent existing LLM detectors are presented. The efficacy of the attacks is
illustrated by analizing how homoglyphs shift the tokenization of the text, and
thus its token loglikelihoods. A comprehensive evaluation is conducted to
assess the effectiveness of homoglyphs on state-of-the-art LLM detectors,
including Binoculars, DetectGPT, OpenAI's detector, and watermarking
techniques, on five different datasets. A significant reduction in the
efficiency of all the studied configurations of detectors and datasets, down to
an accuracy of 0.5 (random guessing), is demonstrated by the proposed approach.
The results show that homoglyph-based attacks can effectively evade existing
LLM detectors, and the implications of these findings are discussed along with
possible defenses against such attacks.


http://arxiv.org/abs/2406.12222
BadSampler: Harnessing the Power of Catastrophic Forgetting to Poison Byzantine-robust Federated Learning. (2%)
Yi Liu; Cong Wang; Xingliang Yuan
  Federated Learning (FL) is susceptible to poisoning attacks, wherein
compromised clients manipulate the global model by modifying local datasets or
sending manipulated model updates. Experienced defenders can readily detect and
mitigate the poisoning effects of malicious behaviors using Byzantine-robust
aggregation rules. However, the exploration of poisoning attacks in scenarios
where such behaviors are absent remains largely unexplored for Byzantine-robust
FL. This paper addresses the challenging problem of poisoning Byzantine-robust
FL by introducing catastrophic forgetting. To fill this gap, we first formally
define generalization error and establish its connection to catastrophic
forgetting, paving the way for the development of a clean-label data poisoning
attack named BadSampler. This attack leverages only clean-label data (i.e.,
without poisoned data) to poison Byzantine-robust FL and requires the adversary
to selectively sample training data with high loss to feed model training and
maximize the model's generalization error. We formulate the attack as an
optimization problem and present two elegant adversarial sampling strategies,
Top-$\kappa$ sampling, and meta-sampling, to approximately solve it.
Additionally, our formal error upper bound and time complexity analysis
demonstrate that our design can preserve attack utility with high efficiency.
Extensive evaluations on two real-world datasets illustrate the effectiveness
and performance of our proposed attacks.


http://arxiv.org/abs/2406.11618
SoK: A Literature and Engineering Review of Regular Expression Denial of Service. (2%)
Masudul Hasan Masud Bhuiyan; Berk Çakar; Ethan H Burmane; James C Davis; Cristian-Alexandru Staicu
  Regular expression denial of service (ReDoS) is an asymmetric cyberattack
that has become prominent in recent years. Many research works examine ReDoS,
measuring its impact or preventing its exploitation. However, there has been no
systematic treatment of this topic in order to understand the limits of the
state of the art and identify opportunities for further research.
  In this paper, we fill this gap by systematizing the existing knowledge on
ReDoS. We review the algorithmic basis of ReDoS attacks and the pertinent
history of regular expression engines. Next, we survey the literature, dividing
works into two classes: measurement studies and defenses. We find no
agreed-upon definition for ReDoS vulnerabilities, and observe weaknesses in the
practical evaluations of many papers, making the impact of their findings hard
to assess. The majority of academic work in this area limit themselves to
showing the presence of an unexpected slow computation, without illustrating
how this can be weaponized against real systems. Then, we survey the latest
regex engines to examine whether and how the proposed defenses have been
realized. In this way, we describe the new realities that should be considered
in the next generation ReDoS research. We show that many academic threat models
are out of date thanks to the adoption of defenses. Beyond this, we underscore
the importance of simulating ReDoS attacks in realistic contexts, where factors
like request size limiting or deployed mitigations are taken into account. We
propose a tool, wrk-DoS, to facilitate these simulations.


http://arxiv.org/abs/2406.11544
Do Parameters Reveal More than Loss for Membership Inference? (1%)
Anshuman Suri; Xiao Zhang; David Evans
  Membership inference attacks aim to infer whether an individual record was
used to train a model, serving as a key tool for disclosure auditing. While
such evaluations are useful to demonstrate risk, they are computationally
expensive and often make strong assumptions about potential adversaries' access
to models and training environments, and thus do not provide very tight bounds
on leakage from potential attacks. We show how prior claims around black-box
access being sufficient for optimal membership inference do not hold for most
useful settings such as stochastic gradient descent, and that optimal
membership inference indeed requires white-box access. We validate our findings
with a new white-box inference attack IHA (Inverse Hessian Attack) that
explicitly uses model parameters by taking advantage of computing
inverse-Hessian vector products. Our results show that both audits and
adversaries may be able to benefit from access to model parameters, and we
advocate for further research into white-box methods for membership privacy
auditing.


http://arxiv.org/abs/2406.11458
Adversaries With Incentives: A Strategic Alternative to Adversarial Robustness. (1%)
Maayan Ehrenberg; Roy Ganz; Nir Rosenfeld
  Adversarial training aims to defend against *adversaries*: malicious
opponents whose sole aim is to harm predictive performance in any way possible
- a rather harsh perspective, which we assert results in unnecessarily
conservative models. Instead, we propose to model opponents as simply pursuing
their own goals, rather than working directly against the classifier. Employing
tools from strategic modeling, our approach uses knowledge or beliefs regarding
the opponent's possible incentives as inductive bias for learning. Our method
of *strategic training* is designed to defend against opponents within an
*incentive uncertainty set*: this resorts to adversarial learning when the set
is maximal, but offers potential gains when it can be appropriately reduced. We
conduct a series of experiments that show how even mild knowledge regarding the
adversary's incentives can be useful, and that the degree of potential gains
depends on how incentives relate to the structure of the learning task.


http://arxiv.org/abs/2406.10933
Improving Adversarial Robustness via Decoupled Visual Representation Masking. (99%)
Decheng Liu; Tao Chen; Chunlei Peng; Nannan Wang; Ruimin Hu; Xinbo Gao
  Deep neural networks are proven to be vulnerable to fine-designed adversarial
examples, and adversarial defense algorithms draw more and more attention
nowadays. Pre-processing based defense is a major strategy, as well as learning
robust feature representation has been proven an effective way to boost
generalization. However, existing defense works lack considering different
depth-level visual features in the training process. In this paper, we first
highlight two novel properties of robust features from the feature distribution
perspective: 1) \textbf{Diversity}. The robust feature of intra-class samples
can maintain appropriate diversity; 2) \textbf{Discriminability}. The robust
feature of inter-class samples should ensure adequate separation. We find that
state-of-the-art defense methods aim to address both of these mentioned issues
well. It motivates us to increase intra-class variance and decrease inter-class
discrepancy simultaneously in adversarial training. Specifically, we propose a
simple but effective defense based on decoupled visual representation masking.
The designed Decoupled Visual Feature Masking (DFM) block can adaptively
disentangle visual discriminative features and non-visual features with diverse
mask strategies, while the suitable discarding information can disrupt
adversarial noise to improve robustness. Our work provides a generic and
easy-to-plugin block unit for any former adversarial training algorithm to
achieve better protection integrally. Extensive experimental results prove the
proposed method can achieve superior performance compared with state-of-the-art
defense approaches. The code is publicly available at
\href{https://github.com/chenboluo/Adversarial-defense}{https://github.com/chenboluo/Adversarial-defense}.


http://arxiv.org/abs/2406.10887
Imperceptible Face Forgery Attack via Adversarial Semantic Mask. (99%)
Decheng Liu; Qixuan Su; Chunlei Peng; Nannan Wang; Xinbo Gao
  With the great development of generative model techniques, face forgery
detection draws more and more attention in the related field. Researchers find
that existing face forgery models are still vulnerable to adversarial examples
with generated pixel perturbations in the global image. These generated
adversarial samples still can't achieve satisfactory performance because of the
high detectability. To address these problems, we propose an Adversarial
Semantic Mask Attack framework (ASMA) which can generate adversarial examples
with good transferability and invisibility. Specifically, we propose a novel
adversarial semantic mask generative model, which can constrain generated
perturbations in local semantic regions for good stealthiness. The designed
adaptive semantic mask selection strategy can effectively leverage the class
activation values of different semantic regions, and further ensure better
attack transferability and stealthiness. Extensive experiments on the public
face forgery dataset prove the proposed method achieves superior performance
compared with several representative adversarial attack methods. The code is
publicly available at https://github.com/clawerO-O/ASMA.


http://arxiv.org/abs/2406.10802
KGPA: Robustness Evaluation for Large Language Models via Cross-Domain Knowledge Graphs. (92%)
Aihua Waseda University Pei; Zehua Waseda University Yang; Shunan Waseda University Zhu; Ruoxi Southeast University Cheng; Ju Southeast University Jia; Lina Wuhan University Wang
  Existing frameworks for assessing robustness of large language models (LLMs)
overly depend on specific benchmarks, increasing costs and failing to evaluate
performance of LLMs in professional domains due to dataset limitations. This
paper proposes a framework that systematically evaluates the robustness of LLMs
under adversarial attack scenarios by leveraging knowledge graphs (KGs). Our
framework generates original prompts from the triplets of knowledge graphs and
creates adversarial prompts by poisoning, assessing the robustness of LLMs
through the results of these adversarial attacks. We systematically evaluate
the effectiveness of this framework and its modules. Experiments show that
adversarial robustness of the ChatGPT family ranks as GPT-4-turbo > GPT-4o >
GPT-3.5-turbo, and the robustness of large language models is influenced by the
professional domains in which they operate.


http://arxiv.org/abs/2406.10846
NBA: defensive distillation for backdoor removal via neural behavior alignment. (80%)
Zonghao Ying; Bin Wu
  Recently, deep neural networks have been shown to be vulnerable to backdoor
attacks. A backdoor is inserted into neural networks via this attack paradigm,
thus compromising the integrity of the network. As soon as an attacker presents
a trigger during the testing phase, the backdoor in the model is activated,
allowing the network to make specific wrong predictions. It is extremely
important to defend against backdoor attacks since they are very stealthy and
dangerous. In this paper, we propose a novel defense mechanism, Neural
Behavioral Alignment (NBA), for backdoor removal. NBA optimizes the
distillation process in terms of knowledge form and distillation samples to
improve defense performance according to the characteristics of backdoor
defense. NBA builds high-level representations of neural behavior within
networks in order to facilitate the transfer of knowledge. Additionally, NBA
crafts pseudo samples to induce student models exhibit backdoor neural
behavior. By aligning the backdoor neural behavior from the student network
with the benign neural behavior from the teacher network, NBA enables the
proactive removal of backdoors. Extensive experiments show that NBA can
effectively defend against six different backdoor attacks and outperform five
state-of-the-art defenses.


http://arxiv.org/abs/2406.10890
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. (62%)
Zhuoran Jin; Pengfei Cao; Chenhao Wang; Zhitao He; Hongbang Yuan; Jiachun Li; Yubo Chen; Kang Liu; Jun Zhao
  Large language models (LLMs) inevitably memorize sensitive, copyrighted, and
harmful knowledge from the training corpus; therefore, it is crucial to erase
this knowledge from the models. Machine unlearning is a promising solution for
efficiently removing specific knowledge by post hoc modifying models. In this
paper, we propose a Real-World Knowledge Unlearning benchmark (RWKU) for LLM
unlearning. RWKU is designed based on the following three key factors: (1) For
the task setting, we consider a more practical and challenging unlearning
setting, where neither the forget corpus nor the retain corpus is accessible.
(2) For the knowledge source, we choose 200 real-world famous people as the
unlearning targets and show that such popular knowledge is widely present in
various LLMs. (3) For the evaluation framework, we design the forget set and
the retain set to evaluate the model's capabilities across various real-world
applications. Regarding the forget set, we provide four four membership
inference attack (MIA) methods and nine kinds of adversarial attack probes to
rigorously test unlearning efficacy. Regarding the retain set, we assess
locality and utility in terms of neighbor perturbation, general ability,
reasoning ability, truthfulness, factuality, and fluency. We conduct extensive
experiments across two unlearning scenarios, two models and six baseline
methods and obtain some meaningful findings. We release our benchmark and code
publicly at http://rwku-bench.github.io for future work.


http://arxiv.org/abs/2406.12935
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates. (61%)
Fengqing Jiang; Zhangchen Xu; Luyao Niu; Bill Yuchen Lin; Radha Poovendran
  Large language models (LLMs) are expected to follow instructions from users
and engage in conversations. Techniques to enhance LLMs' instruction-following
capabilities typically fine-tune them using data structured according to a
predefined chat template. Although chat templates are shown to be effective in
optimizing LLM performance, their impact on safety alignment of LLMs has been
less understood, which is crucial for deploying LLMs safely at scale.
  In this paper, we investigate how chat templates affect safety alignment of
LLMs. We identify a common vulnerability, named ChatBug, that is introduced by
chat templates. Our key insight to identify ChatBug is that the chat templates
provide a rigid format that need to be followed by LLMs, but not by users.
Hence, a malicious user may not necessarily follow the chat template when
prompting LLMs. Instead, malicious users could leverage their knowledge of the
chat template and accordingly craft their prompts to bypass safety alignments
of LLMs. We develop two attacks to exploit the ChatBug vulnerability. We
demonstrate that a malicious user can exploit the ChatBug vulnerability of
eight state-of-the-art (SOTA) LLMs and effectively elicit unintended responses
from these models. Moreover, we show that ChatBug can be exploited by existing
jailbreak attacks to enhance their attack success rates. We investigate
potential countermeasures to ChatBug. Our results show that while adversarial
training effectively mitigates the ChatBug vulnerability, the victim model
incurs significant performance degradation. These results highlight the
trade-off between safety alignment and helpfulness. Developing new methods for
instruction tuning to balance this trade-off is an open and critical direction
for future research


http://arxiv.org/abs/2406.10932
Imperceptible Rhythm Backdoor Attacks: Exploring Rhythm Transformation for Embedding Undetectable Vulnerabilities on Speech Recognition. (10%)
Wenhan Yao; Jiangkun Yang; Yongqiang He; Jia Liu; Weiping Wen
  Speech recognition is an essential start ring of human-computer interaction,
and recently, deep learning models have achieved excellent success in this
task. However, when the model training and private data provider are always
separated, some security threats that make deep neural networks (DNNs) abnormal
deserve to be researched. In recent years, the typical backdoor attacks have
been researched in speech recognition systems. The existing backdoor methods
are based on data poisoning. The attacker adds some incorporated changes to
benign speech spectrograms or changes the speech components, such as pitch and
timbre. As a result, the poisoned data can be detected by human hearing or
automatic deep algorithms. To improve the stealthiness of data poisoning, we
propose a non-neural and fast algorithm called Random Spectrogram Rhythm
Transformation (RSRT) in this paper. The algorithm combines four steps to
generate stealthy poisoned utterances. From the perspective of rhythm component
transformation, our proposed trigger stretches or squeezes the mel spectrograms
and recovers them back to signals. The operation keeps timbre and content
unchanged for good stealthiness. Our experiments are conducted on two kinds of
speech recognition tasks, including testing the stealthiness of poisoned
samples by speaker verification and automatic speech recognition. The results
show that our method has excellent effectiveness and stealthiness. The rhythm
trigger needs a low poisoning rate and gets a very high attack success rate.


http://arxiv.org/abs/2406.11020
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models. (9%)
Yuqing Wang; Yun Zhao
  With the increasing use of large language models (LLMs), ensuring reliable
performance in diverse, real-world environments is essential. Despite their
remarkable achievements, LLMs often struggle with adversarial inputs,
significantly impacting their effectiveness in practical applications. To
systematically understand the robustness of LLMs, we present RUPBench, a
comprehensive benchmark designed to evaluate LLM robustness across diverse
reasoning tasks. Our benchmark incorporates 15 reasoning datasets, categorized
into commonsense, arithmetic, logical, and knowledge-intensive reasoning, and
introduces nine types of textual perturbations at lexical, syntactic, and
semantic levels. By examining the performance of state-of-the-art LLMs such as
GPT-4o, Llama3, Phi-3, and Gemma on both original and perturbed datasets, we
provide a detailed analysis of their robustness and error patterns. Our
findings highlight that larger models tend to exhibit greater robustness to
perturbations. Additionally, common error types are identified through manual
inspection, revealing specific challenges faced by LLMs in different reasoning
contexts. This work provides insights into areas where LLMs need further
improvement to handle diverse and noisy inputs effectively.


http://arxiv.org/abs/2406.10579
Robust Image Classification in the Presence of Out-of-Distribution and Adversarial Samples Using Attractors in Neural Networks. (98%)
Nasrin Alipour; Seyyed Ali SeyyedSalehi
  The proper handling of out-of-distribution (OOD) samples in deep classifiers
is a critical concern for ensuring the suitability of deep neural networks in
safety-critical systems. Existing approaches developed for robust OOD detection
in the presence of adversarial attacks lose their performance by increasing the
perturbation levels. This study proposes a method for robust classification in
the presence of OOD samples and adversarial attacks with high perturbation
levels. The proposed approach utilizes a fully connected neural network that is
trained to use training samples as its attractors, enhancing its robustness.
This network has the ability to classify inputs and identify OOD samples as
well. To evaluate this method, the network is trained on the MNIST dataset, and
its performance is tested on adversarial examples. The results indicate that
the network maintains its performance even when classifying adversarial
examples, achieving 87.13% accuracy when dealing with highly perturbed MNIST
test data. Furthermore, by using fashion-MNIST and CIFAR-10-bw as OOD samples,
the network can distinguish these samples from MNIST samples with an accuracy
of 98.84% and 99.28%, respectively. In the presence of severe adversarial
attacks, these measures decrease slightly to 98.48% and 98.88%, indicating the
robustness of the proposed method.


http://arxiv.org/abs/2406.10655
E-SAGE: Explainability-based Defense Against Backdoor Attacks on Graph Neural Networks. (81%)
Dingqiang Yuan; Xiaohua Xu; Lei Yu; Tongchang Han; Rongchang Li; Meng Han
  Graph Neural Networks (GNNs) have recently been widely adopted in multiple
domains. Yet, they are notably vulnerable to adversarial and backdoor attacks.
In particular, backdoor attacks based on subgraph insertion have been shown to
be effective in graph classification tasks while being stealthy, successfully
circumventing various existing defense methods. In this paper, we propose
E-SAGE, a novel approach to defending GNN backdoor attacks based on
explainability. We find that the malicious edges and benign edges have
significant differences in the importance scores for explainability evaluation.
Accordingly, E-SAGE adaptively applies an iterative edge pruning process on the
graph based on the edge scores. Through extensive experiments, we demonstrate
the effectiveness of E-SAGE against state-of-the-art graph backdoor attacks in
different attack settings. In addition, we investigate the effectiveness of
E-SAGE against adversarial attacks.


http://arxiv.org/abs/2406.10630
Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models. (68%)
Rui Ye; Jingyi Chai; Xiangrui Liu; Yaodong Yang; Yanfeng Wang; Siheng Chen
  Federated learning (FL) enables multiple parties to collaboratively fine-tune
an large language model (LLM) without the need of direct data sharing. Ideally,
by training on decentralized data that is aligned with human preferences and
safety principles, federated instruction tuning can result in an LLM that could
behave in a helpful and safe manner. In this paper, we for the first time
reveal the vulnerability of safety alignment in FedIT by proposing a simple,
stealthy, yet effective safety attack method. Specifically, the malicious
clients could automatically generate attack data without involving manual
efforts and attack the FedIT system by training their local LLMs on such attack
data. Unfortunately, this proposed safety attack not only can compromise the
safety alignment of LLM trained via FedIT, but also can not be effectively
defended against by many existing FL defense methods. Targeting this, we
further propose a post-hoc defense method, which could rely on a fully
automated pipeline: generation of defense data and further fine-tuning of the
LLM. Extensive experiments show that our safety attack method can significantly
compromise the LLM's safety alignment (e.g., reduce safety rate by 70\%), which
can not be effectively defended by existing defense methods (at most 4\%
absolute improvement), while our safety defense method can significantly
enhance the attacked LLM's safety alignment (at most 69\% absolute
improvement).


http://arxiv.org/abs/2406.10617
Enhancing Anomaly Detection Generalization through Knowledge Exposure: The Dual Effects of Augmentation. (1%)
Mohammad Akhavan Anvari; Rojina Kashefi; Vahid Reza Khazaie; Mohammad Khalooei; Mohammad Sabokrou
  Anomaly detection involves identifying instances within a dataset that
deviate from the norm and occur infrequently. Current benchmarks tend to favor
methods biased towards low diversity in normal data, which does not align with
real-world scenarios. Despite advancements in these benchmarks, contemporary
anomaly detection methods often struggle with out-of-distribution
generalization, particularly in classifying samples with subtle transformations
during testing. These methods typically assume that normal samples during test
time have distributions very similar to those in the training set, while
anomalies are distributed much further away. However, real-world test samples
often exhibit various levels of distribution shift while maintaining semantic
consistency. Therefore, effectively generalizing to samples that have undergone
semantic-preserving transformations, while accurately detecting normal samples
whose semantic meaning has changed after transformation as anomalies, is
crucial for the trustworthiness and reliability of a model. For example,
although it is clear that rotation shifts the meaning for a car in the context
of anomaly detection but preserves the meaning for a bird, current methods are
likely to detect both as abnormal. This complexity underscores the necessity
for dynamic learning procedures rooted in the intrinsic concept of outliers. To
address this issue, we propose new testing protocols and a novel method called
Knowledge Exposure (KE), which integrates external knowledge to comprehend
concept dynamics and differentiate transformations that induce semantic shifts.
This approach enhances generalization by utilizing insights from a pre-trained
CLIP model to evaluate the significance of anomalies for each concept.
Evaluation on CIFAR-10, CIFAR-100, and SVHN with the new protocols demonstrates
superior performance compared to previous methods.


http://arxiv.org/abs/2406.10090
Over-parameterization and Adversarial Robustness in Neural Networks: An Overview and Empirical Analysis. (99%)
Zhang Chen; Luca Demetrio; Srishti Gupta; Xiaoyi Feng; Zhaoqiang Xia; Antonio Emanuele Cinà; Maura Pintor; Luca Oneto; Ambra Demontis; Battista Biggio; Fabio Roli
  Thanks to their extensive capacity, over-parameterized neural networks
exhibit superior predictive capabilities and generalization. However, having a
large parameter space is considered one of the main suspects of the neural
networks' vulnerability to adversarial example -- input samples crafted ad-hoc
to induce a desired misclassification. Relevant literature has claimed
contradictory remarks in support of and against the robustness of
over-parameterized networks. These contradictory findings might be due to the
failure of the attack employed to evaluate the networks' robustness. Previous
research has demonstrated that depending on the considered model, the algorithm
employed to generate adversarial examples may not function properly, leading to
overestimating the model's robustness. In this work, we empirically study the
robustness of over-parameterized networks against adversarial examples.
However, unlike the previous works, we also evaluate the considered attack's
reliability to support the results' veracity. Our results show that
over-parameterized networks are robust against adversarial attacks as opposed
to their under-parameterized counterparts.


http://arxiv.org/abs/2406.10427
Adaptive Randomized Smoothing: Certified Adversarial Robustness for Multi-Step Defences. (93%)
Saiyue Lyu; Shadab Shaikh; Frederick Shpilevskiy; Evan Shelhamer; Mathias Lécuyer
  We propose Adaptive Randomized Smoothing (ARS) to certify the predictions of
our test-time adaptive models against adversarial examples. ARS extends the
analysis of randomized smoothing using $f$-Differential Privacy to certify the
adaptive composition of multiple steps. For the first time, our theory covers
the sound adaptive composition of general and high-dimensional functions of
noisy inputs. We instantiate ARS on deep image classification to certify
predictions against adversarial examples of bounded $L_{\infty}$ norm. In the
$L_{\infty}$ threat model, ARS enables flexible adaptation through
high-dimensional input-dependent masking. We design adaptivity benchmarks,
based on CIFAR-10 and CelebA, and show that ARS improves standard test accuracy
by $1$ to $15\%$ points. On ImageNet, ARS improves certified test accuracy by
up to $1.6\%$ points over standard RS without adaptivity. Our code is available
at https://github.com/ubc-systopia/adaptive-randomized-smoothing .


http://arxiv.org/abs/2406.09836
Robustness-Inspired Defense Against Backdoor Attacks on Graph Neural Networks. (75%)
Zhiwei Zhang; Minhua Lin; Junjie Xu; Zongyu Wu; Enyan Dai; Suhang Wang
  Graph Neural Networks (GNNs) have achieved promising results in tasks such as
node classification and graph classification. However, recent studies reveal
that GNNs are vulnerable to backdoor attacks, posing a significant threat to
their real-world adoption. Despite initial efforts to defend against specific
graph backdoor attacks, there is no work on defending against various types of
backdoor attacks where generated triggers have different properties. Hence, we
first empirically verify that prediction variance under edge dropping is a
crucial indicator for identifying poisoned nodes. With this observation, we
propose using random edge dropping to detect backdoors and theoretically show
that it can efficiently distinguish poisoned nodes from clean ones.
Furthermore, we introduce a novel robust training strategy to efficiently
counteract the impact of the triggers. Extensive experiments on real-world
datasets show that our framework can effectively identify poisoned nodes,
significantly degrade the attack success rate, and maintain clean accuracy when
defending against various types of graph backdoor attacks with different
properties.


http://arxiv.org/abs/2406.10154
Automated Design of Linear Bounding Functions for Sigmoidal Nonlinearities in Neural Networks. (67%)
Matthias König; Xiyue Zhang; Holger H. Hoos; Marta Kwiatkowska; Rijn Jan N. van
  The ubiquity of deep learning algorithms in various applications has
amplified the need for assuring their robustness against small input
perturbations such as those occurring in adversarial attacks. Existing complete
verification techniques offer provable guarantees for all robustness queries
but struggle to scale beyond small neural networks. To overcome this
computational intractability, incomplete verification methods often rely on
convex relaxation to over-approximate the nonlinearities in neural networks.
Progress in tighter approximations has been achieved for piecewise linear
functions. However, robustness verification of neural networks for general
activation functions (e.g., Sigmoid, Tanh) remains under-explored and poses new
challenges. Typically, these networks are verified using convex relaxation
techniques, which involve computing linear upper and lower bounds of the
nonlinear activation functions. In this work, we propose a novel parameter
search method to improve the quality of these linear approximations.
Specifically, we show that using a simple search method, carefully adapted to
the given verification problem through state-of-the-art algorithm configuration
techniques, improves the average global lower bound by 25% on average over the
current state of the art on several commonly used local robustness verification
benchmarks.


http://arxiv.org/abs/2406.10011
Beyond Slow Signs in High-fidelity Model Extraction. (10%)
Hanna Foerster; Robert Mullins; Ilia Shumailov; Jamie Hayes
  Deep neural networks, costly to train and rich in intellectual property
value, are increasingly threatened by model extraction attacks that compromise
their confidentiality. Previous attacks have succeeded in reverse-engineering
model parameters up to a precision of float64 for models trained on random data
with at most three hidden layers using cryptanalytical techniques. However, the
process was identified to be very time consuming and not feasible for larger
and deeper models trained on standard benchmarks. Our study evaluates the
feasibility of parameter extraction methods of Carlini et al. [1] further
enhanced by Canales-Mart\'inez et al. [2] for models trained on standard
benchmarks. We introduce a unified codebase that integrates previous methods
and reveal that computational tools can significantly influence performance. We
develop further optimisations to the end-to-end attack and improve the
efficiency of extracting weight signs by up to 14.8 times compared to former
methods through the identification of easier and harder to extract neurons.
Contrary to prior assumptions, we identify extraction of weights, not
extraction of weight signs, as the critical bottleneck. With our improvements,
a 16,721 parameter model with 2 hidden layers trained on MNIST is extracted
within only 98 minutes compared to at least 150 minutes previously. Finally,
addressing methodological deficiencies observed in previous studies, we propose
new ways of robust benchmarking for future model extraction attacks.


http://arxiv.org/abs/2406.10416
Byzantine-Robust Decentralized Federated Learning. (8%)
Minghong Fang; Zifan Zhang; Hairi; Prashant Khanduri; Jia Liu; Songtao Lu; Yuchen Liu; Neil Gong
  Federated learning (FL) enables multiple clients to collaboratively train
machine learning models without revealing their private training data. In
conventional FL, the system follows the server-assisted architecture
(server-assisted FL), where the training process is coordinated by a central
server. However, the server-assisted FL framework suffers from poor scalability
due to a communication bottleneck at the server, and trust dependency issues.
To address challenges, decentralized federated learning (DFL) architecture has
been proposed to allow clients to train models collaboratively in a serverless
and peer-to-peer manner. However, due to its fully decentralized nature, DFL is
highly vulnerable to poisoning attacks, where malicious clients could
manipulate the system by sending carefully-crafted local models to their
neighboring clients. To date, only a limited number of Byzantine-robust DFL
methods have been proposed, most of which are either communication-inefficient
or remain vulnerable to advanced poisoning attacks. In this paper, we propose a
new algorithm called BALANCE (Byzantine-robust averaging through local
similarity in decentralization) to defend against poisoning attacks in DFL. In
BALANCE, each client leverages its own local model as a similarity reference to
determine if the received model is malicious or benign. We establish the
theoretical convergence guarantee for BALANCE under poisoning attacks in both
strongly convex and non-convex settings. Furthermore, the convergence rate of
BALANCE under poisoning attacks matches those of the state-of-the-art
counterparts in Byzantine-free settings. Extensive experiments also demonstrate
that BALANCE outperforms existing DFL methods and effectively defends against
poisoning attacks.


http://arxiv.org/abs/2406.09782
Self-supervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion. (1%)
Runze Liu; Dongchen Zhu; Guanghui Zhang; Lei Wang; Jiamao Li
  Self-supervised monocular depth estimation has received widespread attention
because of its capability to train without ground truth. In real-world
scenarios, the images may be blurry or noisy due to the influence of weather
conditions and inherent limitations of the camera. Therefore, it is
particularly important to develop a robust depth estimation model. Benefiting
from the training strategies of generative networks, generative-based methods
often exhibit enhanced robustness. In light of this, we employ the
generative-based diffusion model with a unique denoising training process for
self-supervised monocular depth estimation. Additionally, to further enhance
the robustness of the diffusion model, we probe into the influence of
perturbations on image features and propose a hierarchical feature-guided
denoising module. Furthermore, we explore the implicit depth within
reprojection and design an implicit depth consistency loss. This loss function
is not interfered by the other subnetwork, which can be targeted to constrain
the depth estimation network and ensure the scale consistency of depth within a
video sequence. We conduct experiments on the KITTI and Make3D datasets. The
results indicate that our approach stands out among generative-based models,
while also showcasing remarkable robustness.


http://arxiv.org/abs/2406.08829
Improving Adversarial Robustness via Feature Pattern Consistency Constraint. (99%)
Jiacong Hu; Jingwen Ye; Zunlei Feng; Jiazhen Yang; Shunyu Liu; Xiaotian Yu; Lingxiang Jia; Mingli Song
  Convolutional Neural Networks (CNNs) are well-known for their vulnerability
to adversarial attacks, posing significant security concerns. In response to
these threats, various defense methods have emerged to bolster the model's
robustness. However, most existing methods either focus on learning from
adversarial perturbations, leading to overfitting to the adversarial examples,
or aim to eliminate such perturbations during inference, inevitably increasing
computational burdens. Conversely, clean training, which strengthens the
model's robustness by relying solely on clean examples, can address the
aforementioned issues. In this paper, we align with this methodological stream
and enhance its generalizability to unknown adversarial examples. This
enhancement is achieved by scrutinizing the behavior of latent features within
the network. Recognizing that a correct prediction relies on the correctness of
the latent feature's pattern, we introduce a novel and effective Feature
Pattern Consistency Constraint (FPCC) method to reinforce the latent feature's
capacity to maintain the correct feature pattern. Specifically, we propose
Spatial-wise Feature Modification and Channel-wise Feature Selection to enhance
latent features. Subsequently, we employ the Pattern Consistency Loss to
constrain the similarity between the feature pattern of the latent features and
the correct feature pattern. Our experiments demonstrate that the FPCC method
empowers latent features to uphold correct feature patterns even in the face of
adversarial examples, resulting in inherent adversarial robustness surpassing
state-of-the-art models.


http://arxiv.org/abs/2406.09669
Watch the Watcher! Backdoor Attacks on Security-Enhancing Diffusion Models. (98%)
Changjiang Li; Ren Pang; Bochuan Cao; Jinghui Chen; Fenglong Ma; Shouling Ji; Ting Wang
  Thanks to their remarkable denoising capabilities, diffusion models are
increasingly being employed as defensive tools to reinforce the security of
other models, notably in purifying adversarial examples and certifying
adversarial robustness. However, the security risks of these practices
themselves remain largely unexplored, which is highly concerning. To bridge
this gap, this work investigates the vulnerabilities of security-enhancing
diffusion models. Specifically, we demonstrate that these models are highly
susceptible to DIFF2, a simple yet effective backdoor attack, which
substantially diminishes the security assurance provided by such models.
Essentially, DIFF2 achieves this by integrating a malicious diffusion-sampling
process into the diffusion model, guiding inputs embedded with specific
triggers toward an adversary-defined distribution while preserving the normal
functionality for clean inputs. Our case studies on adversarial purification
and robustness certification show that DIFF2 can significantly reduce both
post-purification and certified accuracy across benchmark datasets and models,
highlighting the potential risks of relying on pre-trained diffusion models as
defensive tools. We further explore possible countermeasures, suggesting
promising avenues for future research.


http://arxiv.org/abs/2406.09250
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models. (95%)
Samar Fares; Klea Ziu; Toluwani Aremu; Nikita Durasov; Martin Takáč; Pascal Fua; Karthik Nandakumar; Ivan Laptev
  Vision-Language Models (VLMs) are becoming increasingly vulnerable to
adversarial attacks as various novel attack strategies are being proposed
against these models. While existing defenses excel in unimodal contexts, they
currently fall short in safeguarding VLMs against adversarial threats. To
mitigate this vulnerability, we propose a novel, yet elegantly simple approach
for detecting adversarial samples in VLMs. Our method leverages Text-to-Image
(T2I) models to generate images based on captions produced by target VLMs.
Subsequently, we calculate the similarities of the embeddings of both input and
generated images in the feature space to identify adversarial samples.
Empirical evaluations conducted on different datasets validate the efficacy of
our approach, outperforming baseline methods adapted from image classification
domains. Furthermore, we extend our methodology to classification tasks,
showcasing its adaptability and model-agnostic nature. Theoretical analyses and
empirical findings also show the resilience of our approach against adaptive
attacks, positioning it as an excellent defense mechanism for real-world
deployment against adversarial threats.


http://arxiv.org/abs/2406.09407
Towards Evaluating the Robustness of Visual State Space Models. (89%)
Hashmat Shadab Malik; Fahad Shamshad; Muzammal Naseer; Karthik Nandakumar; Fahad Shahbaz Khan; Salman Khan
  Vision State Space Models (VSSMs), a novel architecture that combines the
strengths of recurrent neural networks and latent variable models, have
demonstrated remarkable performance in visual perception tasks by efficiently
capturing long-range dependencies and modeling complex visual dynamics.
However, their robustness under natural and adversarial perturbations remains a
critical concern. In this work, we present a comprehensive evaluation of VSSMs'
robustness under various perturbation scenarios, including occlusions, image
structure, common corruptions, and adversarial attacks, and compare their
performance to well-established architectures such as transformers and
Convolutional Neural Networks. Furthermore, we investigate the resilience of
VSSMs to object-background compositional changes on sophisticated benchmarks
designed to test model performance in complex visual scenes. We also assess
their robustness on object detection and segmentation tasks using corrupted
datasets that mimic real-world scenarios. To gain a deeper understanding of
VSSMs' adversarial robustness, we conduct a frequency analysis of adversarial
attacks, evaluating their performance against low-frequency and high-frequency
perturbations. Our findings highlight the strengths and limitations of VSSMs in
handling complex visual corruptions, offering valuable insights for future
research and improvements in this promising field. Our code and models will be
available at https://github.com/HashmatShadab/MambaRobustness.


http://arxiv.org/abs/2406.09324
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. (11%)
Zhao Xu; Fan Liu; Hao Liu
  Although Large Language Models (LLMs) have demonstrated significant
capabilities in executing complex tasks in a zero-shot manner, they are
susceptible to jailbreak attacks and can be manipulated to produce harmful
outputs. Recently, a growing body of research has categorized jailbreak attacks
into token-level and prompt-level attacks. However, previous work primarily
overlooks the diverse key factors of jailbreak attacks, with most studies
concentrating on LLM vulnerabilities and lacking exploration of
defense-enhanced LLMs. To address these issues, we evaluate the impact of
various attack settings on LLM performance and provide a baseline benchmark for
jailbreak attacks, encouraging the adoption of a standardized evaluation
framework. Specifically, we evaluate the eight key factors of implementing
jailbreak attacks on LLMs from both target-level and attack-level perspectives.
We further conduct seven representative jailbreak attacks on six defense
methods across two widely used datasets, encompassing approximately 320
experiments with about 50,000 GPU hours on A800-80G. Our experimental results
highlight the need for standardized benchmarking to evaluate these attacks on
defense-enhanced LLMs. Our code is available at
https://github.com/usail-hkust/Bag_of_Tricks_for_LLM_Jailbreaking.


http://arxiv.org/abs/2406.09026
Steganalysis on Digital Watermarking: Is Your Defense Truly Impervious? (4%)
Pei Yang; Hai Ci; Yiren Song; Mike Zheng Shou
  Digital watermarking techniques are crucial for copyright protection and
source identification of images, especially in the era of generative AI models.
However, many existing watermarking methods, particularly content-agnostic
approaches that embed fixed patterns regardless of image content, are
vulnerable to steganalysis attacks that can extract and remove the watermark
with minimal perceptual distortion. In this work, we categorize watermarking
algorithms into content-adaptive and content-agnostic ones, and demonstrate how
averaging a collection of watermarked images could reveal the underlying
watermark pattern. We then leverage this extracted pattern for effective
watermark removal under both graybox and blackbox settings, even when the
collection contains multiple watermark patterns. For some algorithms like
Tree-Ring watermarks, the extracted pattern can also forge convincing
watermarks on clean images. Our quantitative and qualitative evaluations across
twelve watermarking methods highlight the threat posed by steganalysis to
content-agnostic watermarks and the importance of designing watermarking
techniques resilient to such analytical attacks. We propose security guidelines
calling for using content-adaptive watermarking strategies and performing
security evaluation against steganalysis. We also suggest multi-key assignments
as potential mitigations against steganalysis vulnerabilities.


http://arxiv.org/abs/2406.12916
Opening the Black Box: predicting the trainability of deep neural networks with reconstruction entropy. (2%)
Yanick Thurn; Ro Jefferson; Johanna Erdmenger
  An important challenge in machine learning is to predict the initial
conditions under which a given neural network will be trainable. We present a
method for predicting the trainable regime in parameter space for deep
feedforward neural networks (DNNs) based on reconstructing the input from
subsequent activation layers via a cascade of single-layer auxiliary networks.
We show that a single epoch of training of the shallow cascade networks is
sufficient to predict the trainability of the deep feedforward network on a
range of datasets (MNIST, CIFAR10, FashionMNIST, and white noise), thereby
providing a significant reduction in overall training time. We achieve this by
computing the relative entropy between reconstructed images and the original
inputs, and show that this probe of information loss is sensitive to the phase
behaviour of the network. We further demonstrate that this method generalizes
to residual neural networks (ResNets) and convolutional neural networks (CNNs).
Moreover, our method illustrates the network's decision making process by
displaying the changes performed on the input data at each layer, which we
demonstrate for both a DNN trained on MNIST and the vgg16 CNN trained on the
ImageNet dataset. Our results provide a technique for significantly
accelerating the training of large neural networks.


http://arxiv.org/abs/2406.09493
Validation of human benchmark models for Automated Driving System approval: How competent and careful are they really? (1%)
Pierluigi Olleja; Gustav Markkula; Jonas Bärgman
  Advanced Driver Assistance Systems (ADAS) and Automated Driving Systems (ADS)
are expected to improve comfort, productivity and, most importantly, safety for
all road users. To ensure that the systems are safe, rules and regulations
describing the systems' approval and validation procedures are in effect in
Europe. The UNECE Regulation 157 (R157) is one of those. Annex 3 of R157
describes two driver models, representing the performance of a "competent and
careful" driver, which can be used as benchmarks to determine whether, in
certain situations, a crash would be preventable by a human driver. However,
these models have not been validated against human behavior in real
safety-critical events. Therefore, this study uses counterfactual simulation to
assess the performance of the two models when applied to 38 safety-critical
cut-in near-crashes from the SHRP2 naturalistic driving study. The results show
that the two computational models performed rather differently from the human
drivers: one model showed a generally delayed braking reaction compared to the
human drivers, causing crashes in three of the original near-crashes. The other
model demonstrated, in general, brake onsets substantially earlier than the
human drivers, possibly being overly sensitive to lateral perturbations. That
is, the first model does not seem to behave as the competent and careful driver
it is supposed to represent, while the second seems to be overly careful.
Overall, our results show that, if models are to be included in regulations,
they need to be substantially improved. We argue that achieving this will
require better validation across the scenario types that the models are
intended to cover (e.g., cut-in conflicts), a process which should include
applying the models counterfactually to near-crashes and validating them
against several different safety related metrics.


http://arxiv.org/abs/2406.08958
An Unsupervised Approach to Achieve Supervised-Level Explainability in Healthcare Records. (1%)
Joakim Edin; Maria Maistro; Lars Maaløe; Lasse Borgholt; Jakob D. Havtorn; Tuukka Ruotsalo
  Electronic healthcare records are vital for patient safety as they document
conditions, plans, and procedures in both free text and medical codes. Language
models have significantly enhanced the processing of such records, streamlining
workflows and reducing manual data entry, thereby saving healthcare providers
significant resources. However, the black-box nature of these models often
leaves healthcare professionals hesitant to trust them. State-of-the-art
explainability methods increase model transparency but rely on human-annotated
evidence spans, which are costly. In this study, we propose an approach to
produce plausible and faithful explanations without needing such annotations.
We demonstrate on the automated medical coding task that adversarial robustness
training improves explanation plausibility and introduce AttInGrad, a new
explanation method superior to previous ones. By combining both contributions
in a fully unsupervised setup, we produce explanations of comparable quality,
or better, to that of a supervised approach. We release our code and model
weights.


http://arxiv.org/abs/2406.09112
Large-Scale Evaluation of Open-Set Image Classification Techniques. (1%)
Halil Bisgin; Andres Palechor; Mike Suter; Manuel Günther
  The goal for classification is to correctly assign labels to unseen samples.
However, most methods misclassify samples with unseen labels and assign them to
one of the known classes. Open-Set Classification (OSC) algorithms aim to
maximize both closed and open-set recognition capabilities. Recent studies
showed the utility of such algorithms on small-scale data sets, but limited
experimentation makes it difficult to assess their performances in real-world
problems. Here, we provide a comprehensive comparison of various OSC
algorithms, including training-based (SoftMax, Garbage, EOS) and
post-processing methods (Maximum SoftMax Scores, Maximum Logit Scores, OpenMax,
EVM, PROSER), the latter are applied on features from the former. We perform
our evaluation on three large-scale protocols that mimic real-world challenges,
where we train on known and negative open-set samples, and test on known and
unknown instances. Our results show that EOS helps to improve performance of
almost all post-processing algorithms. Particularly, OpenMax and PROSER are
able to exploit better-trained networks, demonstrating the utility of hybrid
models. However, while most algorithms work well on negative test samples --
samples of open-set classes seen during training -- they tend to perform poorly
when tested on samples of previously unseen unknown classes, especially in
challenging conditions.


http://arxiv.org/abs/2406.09358
Understanding Hallucinations in Diffusion Models through Mode Interpolation. (1%)
Sumukh K Aithal; Pratyush Maini; Zachary C. Lipton; J. Zico Kolter
  Colloquially speaking, image generation models based upon diffusion processes
are frequently said to exhibit "hallucinations," samples that could never occur
in the training data. But where do such hallucinations come from? In this
paper, we study a particular failure mode in diffusion models, which we term
mode interpolation. Specifically, we find that diffusion models smoothly
"interpolate" between nearby data modes in the training set, to generate
samples that are completely outside the support of the original training
distribution; this phenomenon leads diffusion models to generate artifacts that
never existed in real data (i.e., hallucinations). We systematically study the
reasons for, and the manifestation of this phenomenon. Through experiments on
1D and 2D Gaussians, we show how a discontinuous loss landscape in the
diffusion model's decoder leads to a region where any smooth approximation will
cause such hallucinations. Through experiments on artificial datasets with
various shapes, we show how hallucination leads to the generation of
combinations of shapes that never existed. Finally, we show that diffusion
models in fact know when they go out of support and hallucinate. This is
captured by the high variance in the trajectory of the generated sample towards
the final few backward sampling process. Using a simple metric to capture this
variance, we can remove over 95% of hallucinations at generation time while
retaining 96% of in-support samples. We conclude our exploration by showing the
implications of such hallucination (and its removal) on the collapse (and
stabilization) of recursive training on synthetic data with experiments on
MNIST and 2D Gaussians dataset. We release our code at
https://github.com/locuslab/diffusion-model-hallucination.


http://arxiv.org/abs/2406.10285
I Don't Know You, But I Can Catch You: Real-Time Defense against Diverse Adversarial Patches for Object Detectors. (99%)
Zijin Lin; Yue Zhao; Kai Chen; Jinwen He
  Deep neural networks (DNNs) have revolutionized the field of computer vision
like object detection with their unparalleled performance. However, existing
research has shown that DNNs are vulnerable to adversarial attacks. In the
physical world, an adversary could exploit adversarial patches to implement a
Hiding Attack (HA) which patches the target object to make it disappear from
the detector, and an Appearing Attack (AA) which fools the detector into
misclassifying the patch as a specific object. Recently, many defense methods
for detectors have been proposed to mitigate the potential threats of
adversarial patches. However, such methods still have limitations in
generalization, robustness and efficiency. Most defenses are only effective
against the HA, leaving the detector vulnerable to the AA.
  In this paper, we propose \textit{NutNet}, an innovative model for detecting
adversarial patches, with high generalization, robustness and efficiency. With
experiments for six detectors including YOLOv2-v4, SSD, Faster RCNN and DETR on
both digital and physical domains, the results show that our proposed method
can effectively defend against both the HA and AA, with only 0.4\% sacrifice of
the clean performance. We compare NutNet with four baseline defense methods for
detectors, and our method exhibits an average defense performance that is over
2.4 times and 4.7 times higher than existing approaches for HA and AA,
respectively. In addition, NutNet only increases the inference time by 8\%,
which can meet the real-time requirements of the detection systems. Demos of
NutNet are available at: \url{https://sites.google.com/view/nutnet}.


http://arxiv.org/abs/2406.08443
Transformation-Dependent Adversarial Attacks. (99%)
Yaoteng Tan; Zikui Cai; M. Salman Asif
  We introduce transformation-dependent adversarial attacks, a new class of
threats where a single additive perturbation can trigger diverse, controllable
mis-predictions by systematically transforming the input (e.g., scaling,
blurring, compression). Unlike traditional attacks with static effects, our
perturbations embed metamorphic properties to enable different adversarial
attacks as a function of the transformation parameters. We demonstrate the
transformation-dependent vulnerability across models (e.g., convolutional
networks and vision transformers) and vision tasks (e.g., image classification
and object detection). Our proposed geometric and photometric transformations
enable a range of targeted errors from one crafted input (e.g., higher than 90%
attack success rate for classifiers). We analyze effects of model architecture
and type/variety of transformations on attack effectiveness. This work forces a
paradigm shift by redefining adversarial inputs as dynamic, controllable
threats. We highlight the need for robust defenses against such multifaceted,
chameleon-like perturbations that current techniques are ill-prepared for.


http://arxiv.org/abs/2406.08486
On Evaluating Adversarial Robustness of Volumetric Medical Segmentation Models. (99%)
Hashmat Shadab Malik; Numan Saeed; Asif Hanif; Muzammal Naseer; Mohammad Yaqub; Salman Khan; Fahad Shahbaz Khan
  Volumetric medical segmentation models have achieved significant success on
organ and tumor-based segmentation tasks in recent years. However, their
vulnerability to adversarial attacks remains largely unexplored, raising
serious concerns regarding the real-world deployment of tools employing such
models in the healthcare sector. This underscores the importance of
investigating the robustness of existing models. In this context, our work aims
to empirically examine the adversarial robustness across current volumetric
segmentation architectures, encompassing Convolutional, Transformer, and
Mamba-based models. We extend this investigation across four volumetric
segmentation datasets, evaluating robustness under both white box and black box
adversarial attacks. Overall, we observe that while both pixel and
frequency-based attacks perform reasonably well under white box setting, the
latter performs significantly better under transfer-based black box attacks.
Across our experiments, we observe transformer-based models show higher
robustness than convolution-based models with Mamba-based models being the most
vulnerable. Additionally, we show that large-scale training of volumetric
segmentation models improves the model's robustness against adversarial
attacks. The code and pretrained models will be made available at
https://github.com/HashmatShadab/Robustness-of-Volumetric-Medical-Segmentation-Models.


http://arxiv.org/abs/2406.08050
Adversarial Evasion Attack Efficiency against Large Language Models. (98%)
João Vitorino; Eva Maia; Isabel Praça
  Large Language Models (LLMs) are valuable for text classification, but their
vulnerabilities must not be disregarded. They lack robustness against
adversarial examples, so it is pertinent to understand the impacts of different
types of perturbations, and assess if those attacks could be replicated by
common users with a small amount of perturbations and a small number of queries
to a deployed LLM. This work presents an analysis of the effectiveness,
efficiency, and practicality of three different types of adversarial attacks
against five different LLMs in a sentiment classification task. The obtained
results demonstrated the very distinct impacts of the word-level and
character-level attacks. The word attacks were more effective, but the
character and more constrained attacks were more practical and required a
reduced number of perturbations and queries. These differences need to be
considered during the development of adversarial defense strategies to train
more robust LLMs for intelligent text classification applications.


http://arxiv.org/abs/2406.08705
When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search. (64%)
Xuan Chen; Yuzhou Nie; Wenbo Guo; Xiangyu Zhang
  Recent studies developed jailbreaking attacks, which construct jailbreaking
prompts to ``fool'' LLMs into responding to harmful questions. Early-stage
jailbreaking attacks require access to model internals or significant human
efforts. More advanced attacks utilize genetic algorithms for automatic and
black-box attacks. However, the random nature of genetic algorithms
significantly limits the effectiveness of these attacks. In this paper, we
propose RLbreaker, a black-box jailbreaking attack driven by deep reinforcement
learning (DRL). We model jailbreaking as a search problem and design an RL
agent to guide the search, which is more effective and has less randomness than
stochastic search, such as genetic algorithms. Specifically, we design a
customized DRL system for the jailbreaking problem, including a novel reward
function and a customized proximal policy optimization (PPO) algorithm. Through
extensive experiments, we demonstrate that RLbreaker is much more effective
than existing jailbreaking attacks against six state-of-the-art (SOTA) LLMs. We
also show that RLbreaker is robust against three SOTA defenses and its trained
agents can transfer across different LLMs. We further validate the key design
choices of RLbreaker via a comprehensive ablation study.


http://arxiv.org/abs/2406.08725
RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs. (62%)
Xuan Chen; Yuzhou Nie; Lu Yan; Yunshu Mao; Wenbo Guo; Xiangyu Zhang
  Modern large language model (LLM) developers typically conduct a safety
alignment to prevent an LLM from generating unethical or harmful content.
Recent studies have discovered that the safety alignment of LLMs can be
bypassed by jailbreaking prompts. These prompts are designed to create specific
conversation scenarios with a harmful question embedded. Querying an LLM with
such prompts can mislead the model into responding to the harmful question. The
stochastic and random nature of existing genetic methods largely limits the
effectiveness and efficiency of state-of-the-art (SOTA) jailbreaking attacks.
In this paper, we propose RL-JACK, a novel black-box jailbreaking attack
powered by deep reinforcement learning (DRL). We formulate the generation of
jailbreaking prompts as a search problem and design a novel RL approach to
solve it. Our method includes a series of customized designs to enhance the RL
agent's learning efficiency in the jailbreaking context. Notably, we devise an
LLM-facilitated action space that enables diverse action variations while
constraining the overall search space. We propose a novel reward function that
provides meaningful dense rewards for the agent toward achieving successful
jailbreaking. Through extensive evaluations, we demonstrate that RL-JACK is
overall much more effective than existing jailbreaking attacks against six SOTA
LLMs, including large open-source models and commercial models. We also show
the RL-JACK's resiliency against three SOTA defenses and its transferability
across different models. Finally, we validate the insensitivity of RL-JACK to
the variations in key hyper-parameters.


http://arxiv.org/abs/2406.08298
AdaNCA: Neural Cellular Automata As Adaptors For More Robust Vision Transformer. (22%)
Yitao Xu; Tong Zhang; Sabine Süsstrunk
  Vision Transformers (ViTs) have demonstrated remarkable performance in image
classification tasks, particularly when equipped with local information via
region attention or convolutions. While such architectures improve the feature
aggregation from different granularities, they often fail to contribute to the
robustness of the networks. Neural Cellular Automata (NCA) enables the modeling
of global cell representations through local interactions, with its training
strategies and architecture design conferring strong generalization ability and
robustness against noisy inputs. In this paper, we propose Adaptor Neural
Cellular Automata (AdaNCA) for Vision Transformer that uses NCA as plug-in-play
adaptors between ViT layers, enhancing ViT's performance and robustness against
adversarial samples as well as out-of-distribution inputs. To overcome the
large computational overhead of standard NCAs, we propose Dynamic Interaction
for more efficient interaction learning. Furthermore, we develop an algorithm
for identifying the most effective insertion points for AdaNCA based on our
analysis of AdaNCA placement and robustness improvement. With less than a 3%
increase in parameters, AdaNCA contributes to more than 10% absolute
improvement in accuracy under adversarial attacks on the ImageNet1K benchmark.
Moreover, we demonstrate with extensive evaluations across 8 robustness
benchmarks and 4 ViT architectures that AdaNCA, as a plug-in-play module,
consistently improves the robustness of ViTs.


http://arxiv.org/abs/2406.07917
Graph Transductive Defense: a Two-Stage Defense for Graph Membership Inference Attacks. (13%)
Peizhi Niu; Chao Pan; Siheng Chen; Olgica Milenkovic
  Graph neural networks (GNNs) have become instrumental in diverse real-world
applications, offering powerful graph learning capabilities for tasks such as
social networks and medical data analysis. Despite their successes, GNNs are
vulnerable to adversarial attacks, including membership inference attacks
(MIA), which threaten privacy by identifying whether a record was part of the
model's training data. While existing research has explored MIA in GNNs under
graph inductive learning settings, the more common and challenging graph
transductive learning setting remains understudied in this context. This paper
addresses this gap and proposes an effective two-stage defense, Graph
Transductive Defense (GTD), tailored to graph transductive learning
characteristics. The gist of our approach is a combination of a train-test
alternate training schedule and flattening strategy, which successfully reduces
the difference between the training and testing loss distributions. Extensive
empirical results demonstrate the superior performance of our method (a
decrease in attack AUROC by $9.42\%$ and an increase in utility performance by
$18.08\%$ on average compared to LBP), highlighting its potential for seamless
integration into various classification models with minimal overhead.


http://arxiv.org/abs/2406.08688
On Security Weaknesses and Vulnerabilities in Deep Learning Systems. (8%)
Zhongzheng Lai; Huaming Chen; Ruoxi Sun; Yu Zhang; Minhui Xue; Dong Yuan
  The security guarantee of AI-enabled software systems (particularly using
deep learning techniques as a functional core) is pivotal against the
adversarial attacks exploiting software vulnerabilities. However, little
attention has been paid to a systematic investigation of vulnerabilities in
such systems. A common situation learned from the open source software
community is that deep learning engineers frequently integrate off-the-shelf or
open-source learning frameworks into their ecosystems. In this work, we
specifically look into deep learning (DL) framework and perform the first
systematic study of vulnerabilities in DL systems through a comprehensive
analysis of identified vulnerabilities from Common Vulnerabilities and
Exposures (CVE) and open-source DL tools, including TensorFlow, Caffe, OpenCV,
Keras, and PyTorch. We propose a two-stream data analysis framework to explore
vulnerability patterns from various databases. We investigate the unique DL
frameworks and libraries development ecosystems that appear to be decentralized
and fragmented. By revisiting the Common Weakness Enumeration (CWE) List, which
provides the traditional software vulnerability related practices, we observed
that it is more challenging to detect and fix the vulnerabilities throughout
the DL systems lifecycle. Moreover, we conducted a large-scale empirical study
of 3,049 DL vulnerabilities to better understand the patterns of vulnerability
and the challenges in fixing them. We have released the full replication
package at https://github.com/codelzz/Vulnerabilities4DLSystem. We anticipate
that our study can advance the development of secure DL systems.


http://arxiv.org/abs/2406.07954
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition. (4%)
Edoardo Debenedetti; Javier Rando; Daniel Paleka; Silaghi Fineas Florin; Dragos Albastroiu; Niv Cohen; Yuval Lemberg; Reshmi Ghosh; Rui Wen; Ahmed Salem; Giovanni Cherubin; Santiago Zanella-Beguelin; Robin Schmid; Victor Klemm; Takahiro Miki; Chenhao Li; Stefan Kraft; Mario Fritz; Florian Tramèr; Sahar Abdelnabi; Lea Schönherr
  Large language model systems face important security risks from maliciously
crafted messages that aim to overwrite the system's original instructions or
leak private data. To study this problem, we organized a capture-the-flag
competition at IEEE SaTML 2024, where the flag is a secret string in the LLM
system prompt. The competition was organized in two phases. In the first phase,
teams developed defenses to prevent the model from leaking the secret. During
the second phase, teams were challenged to extract the secrets hidden for
defenses proposed by the other teams. This report summarizes the main insights
from the competition. Notably, we found that all defenses were bypassed at
least once, highlighting the difficulty of designing a successful defense and
the necessity for additional research to protect LLM systems. To foster future
research in this direction, we compiled a dataset with over 137k multi-turn
attack chats and open-sourced the platform.


http://arxiv.org/abs/2406.08428
Improving Noise Robustness through Abstractions and its Impact on Machine Learning. (4%)
Alfredo Personal Health Data Science, Sano - Centre for Computational Personalised Medicine Ibias; Karol Personal Health Data Science, Sano - Centre for Computational Personalised Medicine Capala; Varun Ravi Personal Health Data Science, Sano - Centre for Computational Personalised Medicine Varma; Anna Personal Health Data Science, Sano - Centre for Computational Personalised Medicine Drozdz; Jose Personal Health Data Science, Sano - Centre for Computational Personalised Medicine Sousa
  Noise is a fundamental problem in learning theory with huge effects in the
application of Machine Learning (ML) methods, due to real world data tendency
to be noisy. Additionally, introduction of malicious noise can make ML methods
fail critically, as is the case with adversarial attacks. Thus, finding and
developing alternatives to improve robustness to noise is a fundamental problem
in ML. In this paper, we propose a method to deal with noise: mitigating its
effect through the use of data abstractions. The goal is to reduce the effect
of noise over the model's performance through the loss of information produced
by the abstraction. However, this information loss comes with a cost: it can
result in an accuracy reduction due to the missing information. First, we
explored multiple methodologies to create abstractions, using the training
dataset, for the specific case of numerical data and binary classification
tasks. We also tested how these abstractions can affect robustness to noise
with several experiments that explore the robustness of an Artificial Neural
Network to noise when trained using raw data \emph{vs} when trained using
abstracted data. The results clearly show that using abstractions is a viable
approach for developing noise robust ML methods.


http://arxiv.org/abs/2406.08754
StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Organization Structures. (1%)
Bangxin Li; Hengrui Xing; Cong Tian; Chao Huang; Jin Qian; Huangqing Xiao; Linfeng Feng
  Large Language Models (LLMs) are widely used in natural language processing
but face the risk of jailbreak attacks that maliciously induce them to generate
harmful content. Existing jailbreak attacks, including character-level and
context-level attacks, mainly focus on the prompt of plain text without
specifically exploring the significant influence of its structure. In this
paper, we focus on studying how the prompt structure contributes to the
jailbreak attack. We introduce a novel structure-level attack method based on
long-tailed structures, which we refer to as Uncommon Text-Organization
Structures (UTOS). We extensively study 12 UTOS templates and 6 obfuscation
methods to build an effective automated jailbreak tool named StructuralSleight
that contains three escalating attack strategies: Structural Attack, Structural
and Character/Context Obfuscation Attack, and Fully Obfuscated Structural
Attack. Extensive experiments on existing LLMs show that StructuralSleight
significantly outperforms the baseline methods. In particular, the attack
success rate reaches 94.62\% on GPT-4o, which has not been addressed by
state-of-the-art techniques.


http://arxiv.org/abs/2406.08102
Adversarial Patch for 3D Local Feature Extractor. (1%)
Yu Wen Pao; Li Chang Lai; Hong-Yi Lin
  Local feature extractors are the cornerstone of many computer vision tasks.
However, their vulnerability to adversarial attacks can significantly
compromise their effectiveness. This paper discusses approaches to attack
sophisticated local feature extraction algorithms and models to achieve two
distinct goals: (1) forcing a match between originally non-matching image
regions, and (2) preventing a match between originally matching regions. At the
end of the paper, we discuss the performance and drawbacks of different patch
generation methods.


http://arxiv.org/abs/2406.07349
Erasing Radio Frequency Fingerprints via Active Adversarial Perturbation. (86%)
Zhaoyi Lu; Wenchao Xu; Ming Tu; Xin Xie; Cunqing Hua; Nan Cheng
  Radio Frequency (RF) fingerprinting is to identify a wireless device from its
uniqueness of the analog circuitry or hardware imperfections. However, unlike
the MAC address which can be modified, such hardware feature is inevitable for
the signal emitted to air, which can possibly reveal device whereabouts, e.g.,
a sniffer can use a pre-trained model to identify a nearby device when
receiving its signal. Such fingerprint may expose critical private information,
e.g., the associated upper-layer applications or the end-user. In this paper,
we propose to erase such RF feature for wireless devices, which can prevent
fingerprinting by actively perturbation from the signal perspective.
Specifically, we consider a common RF fingerprinting scenario, where machine
learning models are trained from pilot signal data for identification. A novel
adversarial attack solution is designed to generate proper perturbations,
whereby the perturbed pilot signal can hide the hardware feature and
misclassify the model. We theoretically show that the perturbation would not
affect the communication function within a tolerable perturbation threshold. We
also implement the pilot signal fingerprinting and the proposed perturbation
process in a practical LTE system. Extensive experiment results demonstrate
that the RF fingerprints can be effectively erased to protect the user privacy.


http://arxiv.org/abs/2406.06979
AudioMarkBench: Benchmarking Robustness of Audio Watermarking. (83%)
Hongbin Liu; Moyang Guo; Zhengyuan Jiang; Lun Wang; Neil Zhenqiang Gong
  The increasing realism of synthetic speech, driven by advancements in
text-to-speech models, raises ethical concerns regarding impersonation and
disinformation. Audio watermarking offers a promising solution via embedding
human-imperceptible watermarks into AI-generated audios. However, the
robustness of audio watermarking against common/adversarial perturbations
remains understudied. We present AudioMarkBench, the first systematic benchmark
for evaluating the robustness of audio watermarking against watermark removal
and watermark forgery. AudioMarkBench includes a new dataset created from
Common-Voice across languages, biological sexes, and ages, 3 state-of-the-art
watermarking methods, and 15 types of perturbations. We benchmark the
robustness of these methods against the perturbations in no-box, black-box, and
white-box settings. Our findings highlight the vulnerabilities of current
watermarking techniques and emphasize the need for more robust and fair audio
watermarking solutions. Our dataset and code are publicly available at
https://github.com/moyangkuo/AudioMarkBench.


http://arxiv.org/abs/2406.06984
On the H\"{o}lder Stability of Multiset and Graph Neural Networks. (69%)
Yair Davidson; Nadav Dym
  Extensive research efforts have been put into characterizing and constructing
maximally separating multiset and graph neural networks. However, recent
empirical evidence suggests the notion of separation itself doesn't capture
several interesting phenomena. On the one hand, the quality of this separation
may be very weak, to the extent that the embeddings of "separable" objects
might even be considered identical when using fixed finite precision. On the
other hand, architectures which aren't capable of separation in theory, somehow
achieve separation when taking the network to be wide enough.
  In this work, we address both of these issues, by proposing a novel pair-wise
separation quality analysis framework which is based on an adaptation of
Lipschitz and \Holder{} stability to parametric functions. The proposed
framework, which we name \emph{\Holder{} in expectation}, allows for separation
quality analysis, without restricting the analysis to embeddings that can
separate all the input space simultaneously. We prove that common sum-based
models are lower-\Holder{} in expectation, with an exponent
  that decays rapidly with the network's depth . Our analysis leads to
adversarial examples of graphs which can be separated by three 1-WL iterations,
but cannot be separated in practice by standard maximally powerful Message
Passing Neural Networks (MPNNs). To remedy this, we propose two novel MPNNs
with improved separation quality, one of which is lower Lipschitz in
expectation. We show these MPNNs can easily classify our adversarial examples,
and compare favorably with standard MPNNs on standard graph learning tasks.


http://arxiv.org/abs/2406.07778
A Study of Backdoors in Instruction Fine-tuned Language Models. (31%)
Jayaram Raghuram; George Kesidis; David J. Miller
  Backdoor data poisoning, inserted within instruction examples used to
fine-tune a foundation Large Language Model (LLM) for downstream tasks
(\textit{e.g.,} sentiment prediction), is a serious security concern due to the
evasive nature of such attacks. The poisoning is usually in the form of a
(seemingly innocuous) trigger word or phrase inserted into a very small
fraction of the fine-tuning samples from a target class. Such backdoor attacks
can: alter response sentiment, violate censorship, over-refuse (invoke
censorship for legitimate queries), inject false content, or trigger nonsense
responses (hallucinations). In this work we investigate the efficacy of
instruction fine-tuning backdoor attacks as attack "hyperparameters" are varied
under a variety of scenarios, considering: the trigger location in the poisoned
examples; robustness to change in the trigger location, partial triggers, and
synonym substitutions at test time; attack transfer from one (fine-tuning)
domain to a related test domain; and clean-label vs. dirty-label poisoning.
Based on our observations, we propose and evaluate two defenses against these
attacks: i) a \textit{during-fine-tuning defense} based on word-frequency
counts that assumes the (possibly poisoned) fine-tuning dataset is available
and identifies the backdoor trigger tokens; and ii) a \textit{post-fine-tuning
defense} based on downstream clean fine-tuning of the backdoored LLM with a
small defense dataset. Finally, we provide a brief survey of related work on
backdoor attacks and defenses.


http://arxiv.org/abs/2406.07188
Merging Improves Self-Critique Against Jailbreak Attacks. (26%)
Victor Gallego
  The robustness of large language models (LLMs) against adversarial
manipulations, such as jailbreak attacks, remains a significant challenge. In
this work, we propose an approach that enhances the self-critique capability of
the LLM and further fine-tunes it over sanitized synthetic data. This is done
with the addition of an external critic model that can be merged with the
original, thus bolstering self-critique capabilities and improving the
robustness of the LLMs response to adversarial prompts. Our results demonstrate
that the combination of merging and self-critique can reduce the attack success
rate of adversaries significantly, thus offering a promising defense mechanism
against jailbreak attacks. Code, data and models released at
https://github.com/vicgalle/merging-self-critique-jailbreaks .


http://arxiv.org/abs/2406.07057
Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study. (15%)
Yichi Zhang; Yao Huang; Yitong Sun; Chang Liu; Zhe Zhao; Zhengwei Fang; Yifan Wang; Huanran Chen; Xiao Yang; Xingxing Wei; Hang Su; Yinpeng Dong; Jun Zhu
  Despite the superior capabilities of Multimodal Large Language Models (MLLMs)
across diverse tasks, they still face significant trustworthiness challenges.
Yet, current literature on the assessment of trustworthy MLLMs remains limited,
lacking a holistic evaluation to offer thorough insights into future
improvements. In this work, we establish MultiTrust, the first comprehensive
and unified benchmark on the trustworthiness of MLLMs across five primary
aspects: truthfulness, safety, robustness, fairness, and privacy. Our benchmark
employs a rigorous evaluation strategy that addresses both multimodal risks and
cross-modal impacts, encompassing 32 diverse tasks with self-curated datasets.
Extensive experiments with 21 modern MLLMs reveal some previously unexplored
trustworthiness issues and risks, highlighting the complexities introduced by
the multimodality and underscoring the necessity for advanced methodologies to
enhance their reliability. For instance, typical proprietary models still
struggle with the perception of visually confusing images and are vulnerable to
multimodal jailbreaking and adversarial attacks; MLLMs are more inclined to
disclose privacy in text and reveal ideological and cultural biases even when
paired with irrelevant images in inference, indicating that the multimodality
amplifies the internal risks from base LLMs. Additionally, we release a
scalable toolbox for standardized trustworthiness research, aiming to
facilitate future advancements in this important field. Code and resources are
publicly available at: https://multi-trust.github.io/.


http://arxiv.org/abs/2406.06967
Dual Thinking and Perceptual Analysis of Deep Learning Models using Human Adversarial Examples. (15%)
Kailas Dayanandan; Anand Sinha; Brejesh Lall
  The dual thinking framework considers fast, intuitive processing and slower,
logical processing. The perception of dual thinking in vision requires images
where inferences from intuitive and logical processing differ. We introduce an
adversarial dataset to provide evidence for the dual thinking framework in
human vision, which also aids in studying the qualitative behavior of deep
learning models. Our study also addresses a major criticism of using
classification models as computational models of human vision by using instance
segmentation models that localize objects. The evidence underscores the
importance of shape in identifying instances in human vision and shows that
deep learning models lack an understanding of sub-structures, as indicated by
errors related to the position and number of sub-components. Additionally, the
similarity in errors made by models and intuitive human processing indicates
that models only address intuitive thinking in human vision.


http://arxiv.org/abs/2406.07017
MoreauPruner: Robust Pruning of Large Language Models against Weight Perturbations. (5%)
Zixiao Wang; Jingwei Zhang; Wenqian Zhao; Farzan Farnia; Bei Yu
  Few-shot gradient methods have been extensively utilized in existing model
pruning methods, where the model weights are regarded as static values and the
effects of potential weight perturbations are not considered. However, the
widely used large language models (LLMs) have several billion model parameters,
which could increase the fragility of few-shot gradient pruning. In this work,
we experimentally show that one-shot gradient pruning algorithms could lead to
unstable results under perturbations to model weights. And the minor error of
switching between data formats bfloat16 and float16 could result in drastically
different outcomes. To address such instabilities, we leverage optimization
analysis and propose an LLM structural pruning method, called MoreauPruner,
with provable robustness against weight perturbations. In MoreauPruner, the
model weight importance is estimated based on the neural network's Moreau
envelope, which can be flexibly combined with $\ell_1$-norm regularization
techniques to induce the sparsity required in the pruning task. We extensively
evaluate the MoreauPruner algorithm on several well-known LLMs, including
LLaMA-7B, LLaMA-13B, LLaMA3-8B, and Vicuna-7B. Our numerical results suggest
the robustness of MoreauPruner against weight perturbations, and indicate the
MoreauPruner's successful accuracy-based scores in comparison to several
existing pruning methods. We have released the code in
\url{https://github.com/ShiningSord/MoreauPruner}.


http://arxiv.org/abs/2406.07314
Rethinking the impact of noisy labels in graph classification: A utility and privacy perspective. (1%)
De Li; Xianxian Li; Zeming Gan; Qiyu Li; Bin Qu; Jinyan Wang
  Graph neural networks based on message-passing mechanisms have achieved
advanced results in graph classification tasks. However, their generalization
performance degrades when noisy labels are present in the training data. Most
existing noisy labeling approaches focus on the visual domain or graph node
classification tasks and analyze the impact of noisy labels only from a utility
perspective. Unlike existing work, in this paper, we measure the effects of
noise labels on graph classification from data privacy and model utility
perspectives. We find that noise labels degrade the model's generalization
performance and enhance the ability of membership inference attacks on graph
data privacy. To this end, we propose the robust graph neural network approach
with noisy labeled graph classification. Specifically, we first accurately
filter the noisy samples by high-confidence samples and the first feature
principal component vector of each class. Then, the robust principal component
vectors and the model output under data augmentation are utilized to achieve
noise label correction guided by dual spatial information. Finally, supervised
graph contrastive learning is introduced to enhance the embedding quality of
the model and protect the privacy of the training graph data. The utility and
privacy of the proposed method are validated by comparing twelve different
methods on eight real graph classification datasets. Compared with the
state-of-the-art methods, the RGLC method achieves at most and at least 7.8%
and 0.8% performance gain at 30% noisy labeling rate, respectively, and reduces
the accuracy of privacy attacks to below 60%.


http://arxiv.org/abs/2406.07107
Agnostic Sharpness-Aware Minimization. (1%)
Van-Anh Nguyen; Quyen Tran; Tuan Truong; Thanh-Toan Do; Dinh Phung; Trung Le
  Sharpness-aware minimization (SAM) has been instrumental in improving deep
neural network training by minimizing both the training loss and the sharpness
of the loss landscape, leading the model into flatter minima that are
associated with better generalization properties. In another aspect,
Model-Agnostic Meta-Learning (MAML) is a framework designed to improve the
adaptability of models. MAML optimizes a set of meta-models that are
specifically tailored for quick adaptation to multiple tasks with minimal
fine-tuning steps and can generalize well with limited data. In this work, we
explore the connection between SAM and MAML, particularly in terms of enhancing
model generalization. We introduce Agnostic-SAM, a novel approach that combines
the principles of both SAM and MAML. Agnostic-SAM adapts the core idea of SAM
by optimizing the model towards wider local minima using training data, while
concurrently maintaining low loss values on validation data. By doing so, it
seeks flatter minima that are not only robust to small perturbations but also
less vulnerable to data distributional shift problems. Our experimental results
demonstrate that Agnostic-SAM significantly improves generalization over
baselines across a range of datasets and under challenging conditions such as
noisy labels and data limitation.


http://arxiv.org/abs/2406.06089
Texture Re-scalable Universal Adversarial Perturbation. (99%)
Yihao Huang; Qing Guo; Felix Juefei-Xu; Ming Hu; Xiaojun Jia; Xiaochun Cao; Geguang Pu; Yang Liu
  Universal adversarial perturbation (UAP), also known as image-agnostic
perturbation, is a fixed perturbation map that can fool the classifier with
high probabilities on arbitrary images, making it more practical for attacking
deep models in the real world. Previous UAP methods generate a scale-fixed and
texture-fixed perturbation map for all images, which ignores the multi-scale
objects in images and usually results in a low fooling ratio. Since the widely
used convolution neural networks tend to classify objects according to semantic
information stored in local textures, it seems a reasonable and intuitive way
to improve the UAP from the perspective of utilizing local contents
effectively. In this work, we find that the fooling ratios significantly
increase when we add a constraint to encourage a small-scale UAP map and repeat
it vertically and horizontally to fill the whole image domain. To this end, we
propose texture scale-constrained UAP (TSC-UAP), a simple yet effective UAP
enhancement method that automatically generates UAPs with category-specific
local textures that can fool deep models more easily. Through a low-cost
operation that restricts the texture scale, TSC-UAP achieves a considerable
improvement in the fooling ratio and attack transferability for both
data-dependent and data-free UAP methods. Experiments conducted on two
state-of-the-art UAP methods, eight popular CNN models and four classical
datasets show the remarkable performance of TSC-UAP.


http://arxiv.org/abs/2406.06417
Explainable Graph Neural Networks Under Fire. (99%)
Zhong Li; Simon Geisler; Yuhang Wang; Stephan Günnemann; Leeuwen Matthijs van
  Predictions made by graph neural networks (GNNs) usually lack
interpretability due to their complex computational behavior and the abstract
nature of graphs. In an attempt to tackle this, many GNN explanation methods
have emerged. Their goal is to explain a model's predictions and thereby obtain
trust when GNN models are deployed in decision critical applications. Most GNN
explanation methods work in a post-hoc manner and provide explanations in the
form of a small subset of important edges and/or nodes. In this paper we
demonstrate that these explanations can unfortunately not be trusted, as common
GNN explanation methods turn out to be highly susceptible to adversarial
perturbations. That is, even small perturbations of the original graph
structure that preserve the model's predictions may yield drastically different
explanations. This calls into question the trustworthiness and practical
utility of post-hoc explanation methods for GNNs. To be able to attack GNN
explanation models, we devise a novel attack method dubbed \textit{GXAttack},
the first \textit{optimization-based} adversarial white-box attack method for
post-hoc GNN explanations under such settings. Due to the devastating
effectiveness of our attack, we call for an adversarial evaluation of future
GNN explainers to demonstrate their robustness. For reproducibility, our code
is available via GitHub.


http://arxiv.org/abs/2406.06207
Lurking in the shadows: Unveiling Stealthy Backdoor Attacks against Personalized Federated Learning. (81%)
Xiaoting Lyu; Yufei Han; Wei Wang; Jingkai Liu; Yongsheng Zhu; Guangquan Xu; Jiqiang Liu; Xiangliang Zhang
  Federated Learning (FL) is a collaborative machine learning technique where
multiple clients work together with a central server to train a global model
without sharing their private data. However, the distribution shift across
non-IID datasets of clients poses a challenge to this one-model-fits-all method
hindering the ability of the global model to effectively adapt to each client's
unique local data. To echo this challenge, personalized FL (PFL) is designed to
allow each client to create personalized local models tailored to their private
data. While extensive research has scrutinized backdoor risks in FL, it has
remained underexplored in PFL applications. In this study, we delve deep into
the vulnerabilities of PFL to backdoor attacks. Our analysis showcases a tale
of two cities. On the one hand, the personalization process in PFL can dilute
the backdoor poisoning effects injected into the personalized local models.
Furthermore, PFL systems can also deploy both server-end and client-end defense
mechanisms to strengthen the barrier against backdoor attacks. On the other
hand, our study shows that PFL fortified with these defense methods may offer a
false sense of security. We propose \textit{PFedBA}, a stealthy and effective
backdoor attack strategy applicable to PFL systems. \textit{PFedBA} ingeniously
aligns the backdoor learning task with the main learning task of PFL by
optimizing the trigger generation process. Our comprehensive experiments
demonstrate the effectiveness of \textit{PFedBA} in seamlessly embedding
triggers into personalized local models. \textit{PFedBA} yields outstanding
attack performance across 10 state-of-the-art PFL algorithms, defeating the
existing 6 defense mechanisms. Our study sheds light on the subtle yet potent
backdoor threats to PFL systems, urging the community to bolster defenses
against emerging backdoor challenges.


http://arxiv.org/abs/2406.06792
Reinforced Compressive Neural Architecture Search for Versatile Adversarial Robustness. (56%)
Dingrong Wang; Hitesh Sapkota; Zhiqiang Tao; Qi Yu
  Prior neural architecture search (NAS) for adversarial robustness works have
discovered that a lightweight and adversarially robust neural network
architecture could exist in a non-robust large teacher network, generally
disclosed by heuristic rules through statistical analysis and neural
architecture search, generally disclosed by heuristic rules from neural
architecture search. However, heuristic methods cannot uniformly handle
different adversarial attacks and "teacher" network capacity. To solve this
challenge, we propose a Reinforced Compressive Neural Architecture Search
(RC-NAS) for Versatile Adversarial Robustness. Specifically, we define task
settings that compose datasets, adversarial attacks, and teacher network
information. Given diverse tasks, we conduct a novel dual-level training
paradigm that consists of a meta-training and a fine-tuning phase to
effectively expose the RL agent to diverse attack scenarios (in meta-training),
and making it adapt quickly to locate a sub-network (in fine-tuning) for any
previously unseen scenarios. Experiments show that our framework could achieve
adaptive compression towards different initial teacher networks, datasets, and
adversarial attacks, resulting in more lightweight and adversarially robust
architectures.


http://arxiv.org/abs/2406.06737
Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications. (56%)
Junlin Wang; Tianyi Yang; Roy Xie; Bhuwan Dhingra
  With the proliferation of LLM-integrated applications such as GPT-s, millions
are deployed, offering valuable services through proprietary instruction
prompts. These systems, however, are prone to prompt extraction attacks through
meticulously designed queries. To help mitigate this problem, we introduce the
Raccoon benchmark which comprehensively evaluates a model's susceptibility to
prompt extraction attacks. Our novel evaluation method assesses models under
both defenseless and defended scenarios, employing a dual approach to evaluate
the effectiveness of existing defenses and the resilience of the models. The
benchmark encompasses 14 categories of prompt extraction attacks, with
additional compounded attacks that closely mimic the strategies of potential
attackers, alongside a diverse collection of defense templates. This array is,
to our knowledge, the most extensive compilation of prompt theft attacks and
defense mechanisms to date. Our findings highlight universal susceptibility to
prompt theft in the absence of defenses, with OpenAI models demonstrating
notable resilience when protected. This paper aims to establish a more
systematic benchmark for assessing LLM robustness against prompt extraction
attacks, offering insights into their causes and potential countermeasures.
Resources of Raccoon are publicly available at
https://github.com/M0gician/RaccoonBench.


http://arxiv.org/abs/2406.06852
A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measures. (13%)
Shuai Zhao; Meihuizi Jia; Zhongliang Guo; Leilei Gan; Jie Fu; Yichao Feng; Fengjun Pan; Luu Anh Tuan
  The large language models (LLMs), which bridge the gap between human language
understanding and complex problem-solving, achieve state-of-the-art performance
on several NLP tasks, particularly in few-shot and zero-shot settings. Despite
the demonstrable efficacy of LMMs, due to constraints on computational
resources, users have to engage with open-source language models or outsource
the entire training process to third-party platforms. However, research has
demonstrated that language models are susceptible to potential security
vulnerabilities, particularly in backdoor attacks. Backdoor attacks are
designed to introduce targeted vulnerabilities into language models by
poisoning training samples or model weights, allowing attackers to manipulate
model responses through malicious triggers. While existing surveys on backdoor
attacks provide a comprehensive overview, they lack an in-depth examination of
backdoor attacks specifically targeting LLMs. To bridge this gap and grasp the
latest trends in the field, this paper presents a novel perspective on backdoor
attacks for LLMs by focusing on fine-tuning methods. Specifically, we
systematically classify backdoor attacks into three categories: full-parameter
fine-tuning, parameter-efficient fine-tuning, and attacks without fine-tuning.
Based on insights from a substantial review, we also discuss crucial issues for
future research on backdoor attacks, such as further exploring attack
algorithms that do not require fine-tuning, or developing more covert attack
algorithms.


http://arxiv.org/abs/2406.06822
An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulnerabilities against Strong Detection. (8%)
Shenao Yan; Shen Wang; Yue Duan; Hanbin Hong; Kiho Lee; Doowon Kim; Yuan Hong
  Large Language Models (LLMs) have transformed code completion tasks,
providing context-based suggestions to boost developer productivity in software
engineering. As users often fine-tune these models for specific applications,
poisoning and backdoor attacks can covertly alter the model outputs. To address
this critical security challenge, we introduce CodeBreaker, a pioneering
LLM-assisted backdoor attack framework on code completion models. Unlike recent
attacks that embed malicious payloads in detectable or irrelevant sections of
the code (e.g., comments), CodeBreaker leverages LLMs (e.g., GPT-4) for
sophisticated payload transformation (without affecting functionalities),
ensuring that both the poisoned data for fine-tuning and generated code can
evade strong vulnerability detection. CodeBreaker stands out with its
comprehensive coverage of vulnerabilities, making it the first to provide such
an extensive set for evaluation. Our extensive experimental evaluations and
user studies underline the strong attack performance of CodeBreaker across
various settings, validating its superiority over existing approaches. By
integrating malicious payloads directly into the source code with minimal
transformation, CodeBreaker challenges current security measures, underscoring
the critical need for more robust defenses for code completion.


http://arxiv.org/abs/2406.06808
Fast White-Box Adversarial Streaming Without a Random Oracle. (3%)
Ying Feng; Aayush Jain; David P. Woodruff
  Recently, the question of adversarially robust streaming, where the stream is
allowed to depend on the randomness of the streaming algorithm, has gained a
lot of attention. In this work, we consider a strong white-box adversarial
model (Ajtai et al. PODS 2022), in which the adversary has access to all past
random coins and the parameters used by the streaming algorithm. We focus on
the sparse recovery problem and extend our result to other tasks such as
distinct element estimation and low-rank approximation of matrices and tensors.
The main drawback of previous work is that it requires a random oracle, which
is especially problematic in the streaming model since the amount of randomness
is counted in the space complexity of a streaming algorithm. Also, the previous
work suffers from large update time. We construct a near-optimal solution for
the sparse recovery problem in white-box adversarial streams, based on the
subexponentially secure Learning with Errors assumption. Importantly, our
solution does not require a random oracle and has a polylogarithmic per item
processing time. We also give results in a related white-box adversarially
robust distributed model. Our constructions are based on homomorphic encryption
schemes satisfying very mild structural properties that are currently satisfied
by most known schemes.


http://arxiv.org/abs/2406.06302
Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks. (2%)
Zonghao Ying; Aishan Liu; Xianglong Liu; Dacheng Tao
  The recent release of GPT-4o has garnered widespread attention due to its
powerful general capabilities. While its impressive performance is widely
acknowledged, its safety aspects have not been sufficiently explored. Given the
potential societal impact of risky content generated by advanced generative AI
such as GPT-4o, it is crucial to rigorously evaluate its safety. In response to
this question, this paper for the first time conducts a rigorous evaluation of
GPT-4o against jailbreak attacks. Specifically, this paper adopts a series of
multi-modal and uni-modal jailbreak attacks on 4 commonly used benchmarks
encompassing three modalities (ie, text, speech, and image), which involves the
optimization of over 4,000 initial text queries and the analysis and
statistical evaluation of nearly 8,000+ response on GPT-4o. Our extensive
experiments reveal several novel observations: (1) In contrast to the previous
version (such as GPT-4V), GPT-4o has enhanced safety in the context of text
modality jailbreak; (2) The newly introduced audio modality opens up new attack
vectors for jailbreak attacks on GPT-4o; (3) Existing black-box multimodal
jailbreak attack methods are largely ineffective against GPT-4o and GPT-4V.
These findings provide critical insights into the safety implications of GPT-4o
and underscore the need for robust alignment guardrails in large models. Our
code is available at \url{https://github.com/NY1024/Jailbreak_GPT4o}.


http://arxiv.org/abs/2406.07580
DMS: Addressing Information Loss with More Steps for Pragmatic Adversarial Attacks. (99%)
Zhiyu Zhu; Jiayu Zhang; Xinyi Wang; Zhibo Jin; Huaming Chen
  Despite the exceptional performance of deep neural networks (DNNs) across
different domains, they are vulnerable to adversarial samples, in particular
for tasks related to computer vision. Such vulnerability is further influenced
by the digital container formats used in computers, where the discrete
numerical values are commonly used for storing the pixel values. This paper
examines how information loss in file formats impacts the effectiveness of
adversarial attacks. Notably, we observe a pronounced hindrance to the
adversarial attack performance due to the information loss of the non-integer
pixel values. To address this issue, we explore to leverage the gradient
information of the attack samples within the model to mitigate the information
loss. We introduce the Do More Steps (DMS) algorithm, which hinges on two core
techniques: gradient ascent-based \textit{adversarial integerization} (DMS-AI)
and integrated gradients-based \textit{attribution selection} (DMS-AS). Our
goal is to alleviate such lossy process to retain the attack performance when
storing these adversarial samples digitally. In particular, DMS-AI integerizes
the non-integer pixel values according to the gradient direction, and DMS-AS
selects the non-integer pixels by comparing attribution results. We conduct
thorough experiments to assess the effectiveness of our approach, including the
implementations of the DMS-AI and DMS-AS on two large-scale datasets with
various latest gradient-based attack methods. Our empirical findings
conclusively demonstrate the superiority of our proposed DMS-AI and DMS-AS
pixel integerization methods over the standardised methods, such as rounding,
truncating and upper approaches, in maintaining attack integrity.


http://arxiv.org/abs/2406.05927
MeanSparse: Post-Training Robustness Enhancement Through Mean-Centered Feature Sparsification. (97%)
Sajjad Amini; Mohammadreza Teymoorianfard; Shiqing Ma; Amir Houmansadr
  We present a simple yet effective method to improve the robustness of
Convolutional Neural Networks (CNNs) against adversarial examples by
post-processing an adversarially trained model. Our technique, MeanSparse,
cascades the activation functions of a trained model with novel operators that
sparsify mean-centered feature vectors. This is equivalent to reducing feature
variations around the mean, and we show that such reduced variations merely
affect the model's utility, yet they strongly attenuate the adversarial
perturbations and decrease the attacker's success rate. Our experiments show
that, when applied to the top models in the RobustBench leaderboard, it
achieves a new robustness record of 72.08% (from 71.07%) and 59.64% (from
59.56%) on CIFAR-10 and ImageNet, respectively, in term of AutoAttack accuracy.
Code is available at https://github.com/SPIN-UMass/MeanSparse


http://arxiv.org/abs/2406.05874
Stealthy Targeted Backdoor Attacks against Image Captioning. (82%)
Wenshu Fan; Hongwei Li; Wenbo Jiang; Meng Hao; Shui Yu; Xiao Zhang
  In recent years, there has been an explosive growth in multimodal learning.
Image captioning, a classical multimodal task, has demonstrated promising
applications and attracted extensive research attention. However, recent
studies have shown that image caption models are vulnerable to some security
threats such as backdoor attacks. Existing backdoor attacks against image
captioning typically pair a trigger either with a predefined sentence or a
single word as the targeted output, yet they are unrelated to the image
content, making them easily noticeable as anomalies by humans. In this paper,
we present a novel method to craft targeted backdoor attacks against image
caption models, which are designed to be stealthier than prior attacks.
Specifically, our method first learns a special trigger by leveraging universal
perturbation techniques for object detection, then places the learned trigger
in the center of some specific source object and modifies the corresponding
object name in the output caption to a predefined target name. During the
prediction phase, the caption produced by the backdoored model for input images
with the trigger can accurately convey the semantic information of the rest of
the whole image, while incorrectly recognizing the source object as the
predefined target. Extensive experiments demonstrate that our approach can
achieve a high attack success rate while having a negligible impact on model
clean performance. In addition, we show our method is stealthy in that the
produced backdoor samples are indistinguishable from clean samples in both
image and text domains, which can successfully bypass existing backdoor
defenses, highlighting the need for better defensive mechanisms against such
stealthy backdoor attacks.


http://arxiv.org/abs/2406.05810
ControlLoc: Physical-World Hijacking Attack on Visual Perception in Autonomous Driving. (80%)
Chen Ma; Ningfei Wang; Zhengyu Zhao; Qian Wang; Qi Alfred Chen; Chao Shen
  Recent research in adversarial machine learning has focused on visual
perception in Autonomous Driving (AD) and has shown that printed adversarial
patches can attack object detectors. However, it is important to note that AD
visual perception encompasses more than just object detection; it also includes
Multiple Object Tracking (MOT). MOT enhances the robustness by compensating for
object detection errors and requiring consistent object detection results
across multiple frames before influencing tracking results and driving
decisions. Thus, MOT makes attacks on object detection alone less effective. To
attack such robust AD visual perception, a digital hijacking attack has been
proposed to cause dangerous driving scenarios. However, this attack has limited
effectiveness.
  In this paper, we introduce a novel physical-world adversarial patch attack,
ControlLoc, designed to exploit hijacking vulnerabilities in entire AD visual
perception. ControlLoc utilizes a two-stage process: initially identifying the
optimal location for the adversarial patch, and subsequently generating the
patch that can modify the perceived location and shape of objects with the
optimal location. Extensive evaluations demonstrate the superior performance of
ControlLoc, achieving an impressive average attack success rate of around 98.1%
across various AD visual perceptions and datasets, which is four times greater
effectiveness than the existing hijacking attack. The effectiveness of
ControlLoc is further validated in physical-world conditions, including real
vehicle tests under different conditions such as outdoor light conditions with
an average attack success rate of 77.5%. AD system-level impact assessments are
also included, such as vehicle collision, using industry-grade AD systems and
production-grade AD simulators with an average vehicle collision rate and
unnecessary emergency stop rate of 81.3%.


http://arxiv.org/abs/2406.05857
Self-supervised Adversarial Training of Monocular Depth Estimation against Physical-World Attacks. (67%)
Zhiyuan Cheng; Cheng Han; James Liang; Qifan Wang; Xiangyu Zhang; Dongfang Liu
  Monocular Depth Estimation (MDE) plays a vital role in applications such as
autonomous driving. However, various attacks target MDE models, with physical
attacks posing significant threats to system security. Traditional adversarial
training methods, which require ground-truth labels, are not directly
applicable to MDE models that lack ground-truth depth. Some self-supervised
model hardening techniques (e.g., contrastive learning) overlook the domain
knowledge of MDE, resulting in suboptimal performance. In this work, we
introduce a novel self-supervised adversarial training approach for MDE models,
leveraging view synthesis without the need for ground-truth depth. We enhance
adversarial robustness against real-world attacks by incorporating
L_0-norm-bounded perturbation during training. We evaluate our method against
supervised learning-based and contrastive learning-based approaches
specifically designed for MDE. Our experiments with two representative MDE
networks demonstrate improved robustness against various adversarial attacks,
with minimal impact on benign performance.


http://arxiv.org/abs/2406.05800
SlowPerception: Physical-World Latency Attack against Visual Perception in Autonomous Driving. (64%)
Chen Ma; Ningfei Wang; Zhengyu Zhao; Qi Alfred Chen; Chao Shen
  Autonomous Driving (AD) systems critically depend on visual perception for
real-time object detection and multiple object tracking (MOT) to ensure safe
driving. However, high latency in these visual perception components can lead
to significant safety risks, such as vehicle collisions. While previous
research has extensively explored latency attacks within the digital realm,
translating these methods effectively to the physical world presents
challenges. For instance, existing attacks rely on perturbations that are
unrealistic or impractical for AD, such as adversarial perturbations affecting
areas like the sky, or requiring large patches that obscure most of a camera's
view, thus making them impossible to be conducted effectively in the real
world.
  In this paper, we introduce SlowPerception, the first physical-world latency
attack against AD perception, via generating projector-based universal
perturbations. SlowPerception strategically creates numerous phantom objects on
various surfaces in the environment, significantly increasing the computational
load of Non-Maximum Suppression (NMS) and MOT, thereby inducing substantial
latency. Our SlowPerception achieves second-level latency in physical-world
settings, with an average latency of 2.5 seconds across different AD perception
systems, scenarios, and hardware configurations. This performance significantly
outperforms existing state-of-the-art latency attacks. Additionally, we conduct
AD system-level impact assessments, such as vehicle collisions, using
industry-grade AD systems with production-grade AD simulators with a 97%
average rate. We hope that our analyses can inspire further research in this
critical domain, enhancing the robustness of AD systems against emerging
vulnerabilities.


http://arxiv.org/abs/2406.05796
ProFeAT: Projected Feature Adversarial Training for Self-Supervised Learning of Robust Representations. (38%)
Sravanti Addepalli; Priyam Dey; R. Venkatesh Babu
  The need for abundant labelled data in supervised Adversarial Training (AT)
has prompted the use of Self-Supervised Learning (SSL) techniques with AT.
However, the direct application of existing SSL methods to adversarial training
has been sub-optimal due to the increased training complexity of combining SSL
with AT. A recent approach, DeACL, mitigates this by utilizing supervision from
a standard SSL teacher in a distillation setting, to mimic supervised AT.
However, we find that there is still a large performance gap when compared to
supervised adversarial training, specifically on larger models. In this work,
investigate the key reason for this gap and propose Projected Feature
Adversarial Training (ProFeAT) to bridge the same. We show that the sub-optimal
distillation performance is a result of mismatch in training objectives of the
teacher and student, and propose to use a projection head at the student, that
allows it to leverage weak supervision from the teacher while also being able
to learn adversarially robust representations that are distinct from the
teacher. We further propose appropriate attack and defense losses at the
feature and projector, alongside a combination of weak and strong augmentations
for the teacher and student respectively, to improve the training data
diversity without increasing the training complexity. Through extensive
experiments on several benchmark datasets and models, we demonstrate
significant improvements in both clean and robust accuracy when compared to
existing SSL-AT methods, setting a new state-of-the-art. We further report
on-par/ improved performance when compared to TRADES, a popular supervised-AT
method.


http://arxiv.org/abs/2406.05670
Certified Robustness to Data Poisoning in Gradient-Based Training. (22%)
Philip Sosnin; Mark N. Müller; Maximilian Baader; Calvin Tsay; Matthew Wicker
  Modern machine learning pipelines leverage large amounts of public data,
making it infeasible to guarantee data quality and leaving models open to
poisoning and backdoor attacks. Provably bounding model behavior under such
attacks remains an open problem. In this work, we address this challenge by
developing the first framework providing provable guarantees on the behavior of
models trained with potentially manipulated data without modifying the model or
learning algorithm. In particular, our framework certifies robustness against
untargeted and targeted poisoning, as well as backdoor attacks, for bounded and
unbounded manipulations of the training inputs and labels. Our method leverages
convex relaxations to over-approximate the set of all possible parameter
updates for a given poisoning threat model, allowing us to bound the set of all
reachable parameters for any gradient-based learning algorithm. Given this set
of parameters, we provide bounds on worst-case behavior, including model
performance and backdoor success rate. We demonstrate our approach on multiple
real-world datasets from applications including energy consumption, medical
imaging, and autonomous driving.


http://arxiv.org/abs/2406.05870
Machine Against the RAG: Jamming Retrieval-Augmented Generation with Blocker Documents. (4%)
Avital Shafran; Roei Schuster; Vitaly Shmatikov
  Retrieval-augmented generation (RAG) systems respond to queries by retrieving
relevant documents from a knowledge database, then generating an answer by
applying an LLM to the retrieved documents. We demonstrate that RAG systems
that operate on databases with untrusted content are vulnerable to a new class
of denial-of-service attacks we call jamming. An adversary can add a single
``blocker'' document to the database that will be retrieved in response to a
specific query and result in the RAG system not answering this query -
ostensibly because it lacks the information or because the answer is unsafe.
  We describe and measure the efficacy of several methods for generating
blocker documents, including a new method based on black-box optimization. This
method (1) does not rely on instruction injection, (2) does not require the
adversary to know the embedding or LLM used by the target RAG system, and (3)
does not use an auxiliary LLM to generate blocker documents.
  We evaluate jamming attacks on several LLMs and embeddings and demonstrate
that the existing safety metrics for LLMs do not capture their vulnerability to
jamming. We then discuss defenses against blocker documents.


http://arxiv.org/abs/2406.05826
PSBD: Prediction Shift Uncertainty Unlocks Backdoor Detection. (3%)
Wei Li; Pin-Yu Chen; Sijia Liu; Ren Wang
  Deep neural networks are susceptible to backdoor attacks, where adversaries
manipulate model predictions by inserting malicious samples into the training
data. Currently, there is still a lack of direct filtering methods for
identifying suspicious training data to unveil potential backdoor samples. In
this paper, we propose a novel method, Prediction Shift Backdoor Detection
(PSBD), leveraging an uncertainty-based approach requiring minimal unlabeled
clean validation data. PSBD is motivated by an intriguing Prediction Shift (PS)
phenomenon, where poisoned models' predictions on clean data often shift away
from true labels towards certain other labels with dropout applied during
inference, while backdoor samples exhibit less PS. We hypothesize PS results
from neuron bias effect, making neurons favor features of certain classes. PSBD
identifies backdoor training samples by computing the Prediction Shift
Uncertainty (PSU), the variance in probability values when dropout layers are
toggled on and off during model inference. Extensive experiments have been
conducted to verify the effectiveness and efficiency of PSBD, which achieves
state-of-the-art results among mainstream detection methods.


http://arxiv.org/abs/2406.05948
Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models. (2%)
Xi Li; Yusen Zhang; Renze Lou; Chen Wu; Jiaqi Wang
  Large Language Models (LLMs), especially those accessed via APIs, have
demonstrated impressive capabilities across various domains. However, users
without technical expertise often turn to (untrustworthy) third-party services,
such as prompt engineering, to enhance their LLM experience, creating
vulnerabilities to adversarial threats like backdoor attacks.
Backdoor-compromised LLMs generate malicious outputs to users when inputs
contain specific "triggers" set by attackers. Traditional defense strategies,
originally designed for small-scale models, are impractical for API-accessible
LLMs due to limited model access, high computational costs, and data
requirements. To address these limitations, we propose Chain-of-Scrutiny (CoS)
which leverages LLMs' unique reasoning abilities to mitigate backdoor attacks.
It guides the LLM to generate reasoning steps for a given input and scrutinizes
for consistency with the final output -- any inconsistencies indicating a
potential attack. It is well-suited for the popular API-only LLM deployments,
enabling detection at minimal cost and with little data. User-friendly and
driven by natural language, it allows non-experts to perform the defense
independently while maintaining transparency. We validate the effectiveness of
CoS through extensive experiments on various tasks and LLMs, with results
showing greater benefits for more powerful LLMs.


http://arxiv.org/abs/2406.05933
A Relevance Model for Threat-Centric Ranking of Cybersecurity Vulnerabilities. (1%)
Corren McCoy; Ross Gore; Michael L. Nelson; Michele C. Weigle
  The relentless process of tracking and remediating vulnerabilities is a top
concern for cybersecurity professionals. The key challenge is trying to
identify a remediation scheme specific to in-house, organizational objectives.
Without a strategy, the result is a patchwork of fixes applied to a tide of
vulnerabilities, any one of which could be the point of failure in an otherwise
formidable defense. Given that few vulnerabilities are a focus of real-world
attacks, a practical remediation strategy is to identify vulnerabilities likely
to be exploited and focus efforts towards remediating those vulnerabilities
first. The goal of this research is to demonstrate that aggregating and
synthesizing readily accessible, public data sources to provide personalized,
automated recommendations for organizations to prioritize their vulnerability
management strategy will offer significant improvements over using the Common
Vulnerability Scoring System (CVSS). We provide a framework for vulnerability
management specifically focused on mitigating threats using adversary criteria
derived from MITRE ATT&CK. We test our approach by identifying vulnerabilities
in software associated with six universities and four government facilities.
Ranking policy performance is measured using the Normalized Discounted
Cumulative Gain (nDCG). Our results show an average 71.5% - 91.3% improvement
towards the identification of vulnerabilities likely to be targeted and
exploited by cyber threat actors. The return on investment (ROI) of patching
using our policies results in a savings of 23.3% - 25.5% in annualized costs.
Our results demonstrate the efficacy of creating knowledge graphs to link large
data sets to facilitate semantic queries and create data-driven, flexible
ranking policies.


http://arxiv.org/abs/2406.05946
Safety Alignment Should Be Made More Than Just a Few Tokens Deep. (1%)
Xiangyu Qi; Ashwinee Panda; Kaifeng Lyu; Xiao Ma; Subhrajit Roy; Ahmad Beirami; Prateek Mittal; Peter Henderson
  The safety alignment of current Large Language Models (LLMs) is vulnerable.
Relatively simple attacks, or even benign fine-tuning, can jailbreak aligned
models. We argue that many of these vulnerabilities are related to a shared
underlying issue: safety alignment can take shortcuts, wherein the alignment
adapts a model's generative distribution primarily over only its very first few
output tokens. We refer to this issue as shallow safety alignment. In this
paper, we present case studies to explain why shallow safety alignment can
exist and provide evidence that current aligned LLMs are subject to this issue.
We also show how these findings help explain multiple recently discovered
vulnerabilities in LLMs, including the susceptibility to adversarial suffix
attacks, prefilling attacks, decoding parameter attacks, and fine-tuning
attacks. Importantly, we discuss how this consolidated notion of shallow safety
alignment sheds light on promising research directions for mitigating these
vulnerabilities. For instance, we show that deepening the safety alignment
beyond just the first few tokens can often meaningfully improve robustness
against some common exploits. Finally, we design a regularized finetuning
objective that makes the safety alignment more persistent against fine-tuning
attacks by constraining updates on initial tokens. Overall, we advocate that
future safety alignment should be made more than just a few tokens deep.


http://arxiv.org/abs/2406.05660
Injecting Undetectable Backdoors in Obfuscated Neural Networks and Language Models. (1%)
Alkis Kalavasis; Amin Karbasi; Argyris Oikonomou; Katerina Sotiraki; Grigoris Velegkas; Manolis Zampetakis
  As ML models become increasingly complex and integral to high-stakes domains
such as finance and healthcare, they also become more susceptible to
sophisticated adversarial attacks. We investigate the threat posed by
undetectable backdoors, as defined in Goldwasser et al. (FOCS '22), in models
developed by insidious external expert firms. When such backdoors exist, they
allow the designer of the model to sell information on how to slightly perturb
their input to change the outcome of the model.
  We develop a general strategy to plant backdoors to obfuscated neural
networks, that satisfy the security properties of the celebrated notion of
indistinguishability obfuscation. Applying obfuscation before releasing neural
networks is a strategy that is well motivated to protect sensitive information
of the external expert firm. Our method to plant backdoors ensures that even if
the weights and architecture of the obfuscated model are accessible, the
existence of the backdoor is still undetectable.
  Finally, we introduce the notion of undetectable backdoors to language models
and extend our neural network backdoor attacks to such models based on the
existence of steganographic functions.


http://arxiv.org/abs/2406.05498
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner. (99%)
Xunguang Wang; Daoyuan Wu; Zhenlan Ji; Zongjie Li; Pingchuan Ma; Shuai Wang; Yingjiu Li; Yang Liu; Ning Liu; Juergen Rahmel
  Jailbreaking is an emerging adversarial attack that bypasses the safety
alignment deployed in off-the-shelf large language models (LLMs) and has
evolved into four major categories: optimization-based attacks such as Greedy
Coordinate Gradient (GCG), jailbreak template-based attacks such as
"Do-Anything-Now", advanced indirect attacks like DrAttack, and multilingual
jailbreaks. However, delivering a practical jailbreak defense is challenging
because it needs to not only handle all the above jailbreak attacks but also
incur negligible delay to user prompts, as well as be compatible with both
open-source and closed-source LLMs.
  Inspired by how the traditional security concept of shadow stacks defends
against memory overflow attacks, this paper introduces a generic LLM jailbreak
defense framework called SelfDefend, which establishes a shadow LLM defense
instance to concurrently protect the target LLM instance in the normal stack
and collaborate with it for checkpoint-based access control. The effectiveness
of SelfDefend builds upon our observation that existing LLMs (both target and
defense LLMs) have the capability to identify harmful prompts or intentions in
user queries, which we empirically validate using the commonly used GPT-3.5/4
models across all major jailbreak attacks. Our measurements show that
SelfDefend enables GPT-3.5 to suppress the attack success rate (ASR) by
8.97-95.74% (average: 60%) and GPT-4 by even 36.36-100% (average: 83%), while
incurring negligible effects on normal queries. To further improve the
defense's robustness and minimize costs, we employ a data distillation approach
to tune dedicated open-source defense models. These models outperform four SOTA
defenses and match the performance of GPT-4-based SelfDefend, with
significantly lower extra delays. We also empirically show that the tuned
models are robust to targeted GCG and prompt injection attacks.


http://arxiv.org/abs/2406.05491
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models. (99%)
Hao Fang; Jiawei Kong; Wenbo Yu; Bin Chen; Jiawei Li; Hao Wu; Shutao Xia; Ke Xu
  Vision-Language Pre-training (VLP) models have exhibited unprecedented
capability in many applications by taking full advantage of the multimodal
alignment. However, previous studies have shown they are vulnerable to
maliciously crafted adversarial samples. Despite recent success, these methods
are generally instance-specific and require generating perturbations for each
input sample. In this paper, we reveal that VLP models are also vulnerable to
the instance-agnostic universal adversarial perturbation (UAP). Specifically,
we design a novel Contrastive-training Perturbation Generator with Cross-modal
conditions (C-PGC) to achieve the attack. In light that the pivotal multimodal
alignment is achieved through the advanced contrastive learning technique, we
devise to turn this powerful weapon against themselves, i.e., employ a
malicious version of contrastive learning to train the C-PGC based on our
carefully crafted positive and negative image-text pairs for essentially
destroying the alignment relationship learned by VLP models. Besides, C-PGC
fully utilizes the characteristics of Vision-and-Language (V+L) scenarios by
incorporating both unimodal and cross-modal information as effective guidance.
Extensive experiments show that C-PGC successfully forces adversarial samples
to move away from their original area in the VLP model's feature space, thus
essentially enhancing attacks across various victim models and V+L tasks. The
GitHub repository is available at
https://github.com/ffhibnese/CPGC_VLP_Universal_Attacks.


http://arxiv.org/abs/2406.05372
Bridging the Gap: Rademacher Complexity in Robust and Standard Generalization. (98%)
Jiancong Xiao; Ruoyu Sun; Qi Long; Weijie J. Su
  Training Deep Neural Networks (DNNs) with adversarial examples often results
in poor generalization to test-time adversarial data. This paper investigates
this issue, known as adversarially robust generalization, through the lens of
Rademacher complexity. Building upon the studies by Khim and Loh (2018); Yin et
al. (2019), numerous works have been dedicated to this problem, yet achieving a
satisfactory bound remains an elusive goal. Existing works on DNNs either apply
to a surrogate loss instead of the robust loss or yield bounds that are notably
looser compared to their standard counterparts. In the latter case, the bounds
have a higher dependency on the width $m$ of the DNNs or the dimension $d$ of
the data, with an extra factor of at least $\mathcal{O}(\sqrt{m})$ or
$\mathcal{O}(\sqrt{d})$.
  This paper presents upper bounds for adversarial Rademacher complexity of
DNNs that match the best-known upper bounds in standard settings, as
established in the work of Bartlett et al. (2017), with the dependency on width
and dimension being $\mathcal{O}(\ln(dm))$. The central challenge addressed is
calculating the covering number of adversarial function classes. We aim to
construct a new cover that possesses two properties: 1) compatibility with
adversarial examples, and 2) precision comparable to covers used in standard
settings. To this end, we introduce a new variant of covering number called the
\emph{uniform covering number}, specifically designed and proven to reconcile
these two properties. Consequently, our method effectively bridges the gap
between Rademacher complexity in robust and standard generalization.


http://arxiv.org/abs/2406.05535
Perturbation Towards Easy Samples Improves Targeted Adversarial Transferability. (96%)
Junqi Gao; Biqing Qi; Yao Li; Zhichang Guo; Dong Li; Yuming Xing; Dazhi Zhang
  The transferability of adversarial perturbations provides an effective
shortcut for black-box attacks. Targeted perturbations have greater
practicality but are more difficult to transfer between models. In this paper,
we experimentally and theoretically demonstrated that neural networks trained
on the same dataset have more consistent performance in
High-Sample-Density-Regions (HSDR) of each class instead of low sample density
regions. Therefore, in the target setting, adding perturbations towards HSDR of
the target class is more effective in improving transferability. However,
density estimation is challenging in high-dimensional scenarios. Further
theoretical and experimental verification demonstrates that easy samples with
low loss are more likely to be located in HSDR. Perturbations towards such easy
samples in the target class can avoid density estimation for HSDR location.
Based on the above facts, we verified that adding perturbations to easy samples
in the target class improves targeted adversarial transferability of existing
attack methods. A generative targeted attack strategy named Easy Sample
Matching Attack (ESMA) is proposed, which has a higher success rate for
targeted attacks and outperforms the SOTA generative method. Moreover, ESMA
requires only 5% of the storage space and much less computation time comparing
to the current SOTA, as ESMA attacks all classes with only one model instead of
seperate models for each class. Our code is available at
https://github.com/gjq100/ESMA.


http://arxiv.org/abs/2406.05531
Enhancing Adversarial Transferability via Information Bottleneck Constraints. (68%)
Biqing Qi; Junqi Gao; Jianxing Liu; Ligang Wu; Bowen Zhou
  From the perspective of information bottleneck (IB) theory, we propose a
novel framework for performing black-box transferable adversarial attacks named
IBTA, which leverages advancements in invariant features. Intuitively,
diminishing the reliance of adversarial perturbations on the original data,
under equivalent attack performance constraints, encourages a greater reliance
on invariant features that contributes most to classification, thereby
enhancing the transferability of adversarial attacks. Building on this
motivation, we redefine the optimization of transferable attacks using a novel
theoretical framework that centers around IB. Specifically, to overcome the
challenge of unoptimizable mutual information, we propose a simple and
efficient mutual information lower bound (MILB) for approximating computation.
Moreover, to quantitatively evaluate mutual information, we utilize the Mutual
Information Neural Estimator (MINE) to perform a thorough analysis. Our
experiments on the ImageNet dataset well demonstrate the efficiency and
scalability of IBTA and derived MILB. Our code is available at
https://github.com/Biqing-Qi/Enhancing-Adversarial-Transferability-via-Information-Bottleneck-Constraints.


http://arxiv.org/abs/2406.05532
Exploring Adversarial Robustness of Deep State Space Models. (56%)
Biqing Qi; Yang Luo; Junqi Gao; Pengfei Li; Kai Tian; Zhiyuan Ma; Bowen Zhou
  Deep State Space Models (SSMs) have proven effective in numerous task
scenarios but face significant security challenges due to Adversarial
Perturbations (APs) in real-world deployments. Adversarial Training (AT) is a
mainstream approach to enhancing Adversarial Robustness (AR) and has been
validated on various traditional DNN architectures. However, its effectiveness
in improving the AR of SSMs remains unclear. While many enhancements in SSM
components, such as integrating Attention mechanisms and expanding to
data-dependent SSM parameterizations, have brought significant gains in
Standard Training (ST) settings, their potential benefits in AT remain
unexplored. To investigate this, we evaluate existing structural variants of
SSMs with AT to assess their AR performance. We observe that pure SSM
structures struggle to benefit from AT, whereas incorporating Attention yields
a markedly better trade-off between robustness and generalization for SSMs in
AT compared to other components. Nonetheless, the integration of Attention also
leads to Robust Overfitting (RO) issues. To understand these phenomena, we
empirically and theoretically analyze the output error of SSMs under AP. We
find that fixed-parameterized SSMs have output error bounds strictly related to
their parameters, limiting their AT benefits, while input-dependent SSMs may
face the problem of error explosion. Furthermore, we show that the Attention
component effectively scales the output error of SSMs during training, enabling
them to benefit more from AT, but at the cost of introducing RO due to its high
model complexity. Inspired by this, we propose a simple and effective Adaptive
Scaling (AdS) mechanism that brings AT performance close to
Attention-integrated SSMs without introducing the issue of RO.


http://arxiv.org/abs/2406.05376
Adversarial flows: A gradient flow characterization of adversarial attacks. (13%)
Lukas Weigand; Tim Roith; Martin Burger
  A popular method to perform adversarial attacks on neuronal networks is the
so-called fast gradient sign method and its iterative variant. In this paper,
we interpret this method as an explicit Euler discretization of a differential
inclusion, where we also show convergence of the discretization to the
associated gradient flow. To do so, we consider the concept of p-curves of
maximal slope in the case $p=\infty$. We prove existence of $\infty$-curves of
maximum slope and derive an alternative characterization via differential
inclusions. Furthermore, we also consider Wasserstein gradient flows for
potential energies, where we show that curves in the Wasserstein space can be
characterized by a representing measure on the space of curves in the
underlying Banach space, which fulfill the differential inclusion. The
application of our theory to the finite-dimensional setting is twofold: On the
one hand, we show that a whole class of normalized gradient descent methods (in
particular signed gradient descent) converge, up to subsequences, to the flow,
when sending the step size to zero. On the other hand, in the distributional
setting, we show that the inner optimization task of adversarial training
objective can be characterized via $\infty$-curves of maximum slope on an
appropriate optimal transport space.


http://arxiv.org/abs/2406.04998
ADBA:Approximation Decision Boundary Approach for Black-Box Adversarial Attacks. (99%)
Feiyang Wang; Xingquan Zuo; Hai Huang; Gang Chen
  Many machine learning models are susceptible to adversarial attacks, with
decision-based black-box attacks representing the most critical threat in
real-world applications. These attacks are extremely stealthy, generating
adversarial examples using hard labels obtained from the target machine
learning model. This is typically realized by optimizing perturbation
directions, guided by decision boundaries identified through query-intensive
exact search, significantly limiting the attack success rate. This paper
introduces a novel approach using the Approximation Decision Boundary (ADB) to
efficiently and accurately compare perturbation directions without precisely
determining decision boundaries. The effectiveness of our ADB approach (ADBA)
hinges on promptly identifying suitable ADB, ensuring reliable differentiation
of all perturbation directions. For this purpose, we analyze the probability
distribution of decision boundaries, confirming that using the distribution's
median value as ADB can effectively distinguish different perturbation
directions, giving rise to the development of the ADBA-md algorithm. ADBA-md
only requires four queries on average to differentiate any pair of perturbation
directions, which is highly query-efficient. Extensive experiments on six
well-known image classifiers clearly demonstrate the superiority of ADBA and
ADBA-md over multiple state-of-the-art black-box attacks. The source code is
available at https://github.com/BUPTAIOC/ADBA.


http://arxiv.org/abs/2406.04724
On Minimizing Adversarial Counterfactual Error in Adversarial RL. (98%)
Roman Belaire; Arunesh Sinha; Pradeep Varakantham
  Deep Reinforcement Learning (DRL) policies are highly susceptible to
adversarial noise in observations, which poses significant risks in
safety-critical scenarios. The challenge inherent to adversarial perturbations
is that by altering the information observed by the agent, the state becomes
only partially observable. Existing approaches address this by either enforcing
consistent actions across nearby states or maximizing the worst-case value
within adversarially perturbed observations. However, the former suffers from
performance degradation when attacks succeed, while the latter tends to be
overly conservative, leading to suboptimal performance in benign settings. We
hypothesize that these limitations stem from their failing to account for
partial observability directly. To this end, we introduce a novel objective
called Adversarial Counterfactual Error (ACoE), defined on the beliefs about
the true state and balancing value optimization with robustness. To make ACoE
scalable in model-free settings, we propose the theoretically-grounded
surrogate objective Cumulative-ACoE (C-ACoE). Our empirical evaluations on
standard benchmarks (MuJoCo, Atari, and Highway) demonstrate that our method
significantly outperforms current state-of-the-art approaches for addressing
adversarial RL challenges, offering a promising direction for improving
robustness in DRL under adversarial conditions. Our code is available at
https://github.com/romanbelaire/acoe-robust-rl.


http://arxiv.org/abs/2406.05087
Corpus Poisoning via Approximate Greedy Gradient Descent. (86%)
Jinyan Su; Preslav Nakov; Claire Cardie
  Dense retrievers are widely used in information retrieval and have also been
successfully extended to other knowledge intensive areas such as language
models, e.g., Retrieval-Augmented Generation (RAG) systems. Unfortunately, they
have recently been shown to be vulnerable to corpus poisoning attacks in which
a malicious user injects a small fraction of adversarial passages into the
retrieval corpus to trick the system into returning these passages among the
top-ranked results for a broad set of user queries. Further study is needed to
understand the extent to which these attacks could limit the deployment of
dense retrievers in real-world applications. In this work, we propose
Approximate Greedy Gradient Descent (AGGD), a new attack on dense retrieval
systems based on the widely used HotFlip method for efficiently generating
adversarial passages. We demonstrate that AGGD can select a higher quality set
of token-level perturbations than HotFlip by replacing its random token
sampling with a more structured search. Experimentally, we show that our method
achieves a high attack success rate on several datasets and using several
retrievers, and can generalize to unseen queries and new domains. Notably, our
method is extremely effective in attacking the ANCE retrieval model, achieving
attack success rates that are 15.24\% and 17.44\% higher on the NQ and MS MARCO
datasets, respectively, compared to HotFlip. Additionally, we demonstrate
AGGD's potential to replace HotFlip in other adversarial attacks, such as
knowledge poisoning of RAG systems.


http://arxiv.org/abs/2406.05119
Compositional Curvature Bounds for Deep Neural Networks. (84%)
Taha Entesari; Sina Sharifi; Mahyar Fazlyab
  A key challenge that threatens the widespread use of neural networks in
safety-critical applications is their vulnerability to adversarial attacks. In
this paper, we study the second-order behavior of continuously differentiable
deep neural networks, focusing on robustness against adversarial perturbations.
First, we provide a theoretical analysis of robustness and attack certificates
for deep classifiers by leveraging local gradients and upper bounds on the
second derivative (curvature constant). Next, we introduce a novel algorithm to
analytically compute provable upper bounds on the second derivative of neural
networks. This algorithm leverages the compositional structure of the model to
propagate the curvature bound layer-by-layer, giving rise to a scalable and
modular approach. The proposed bound can serve as a differentiable regularizer
to control the curvature of neural networks during training, thereby enhancing
robustness. Finally, we demonstrate the efficacy of our method on
classification tasks using the MNIST and CIFAR-10 datasets.


http://arxiv.org/abs/2406.06622
Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs. (41%)
Fan Liu; Zhao Xu; Hao Liu
  Although safely enhanced Large Language Models (LLMs) have achieved
remarkable success in tackling various complex tasks in a zero-shot manner,
they remain susceptible to jailbreak attacks, particularly the unknown
jailbreak attack. To enhance LLMs' generalized defense capabilities, we propose
a two-stage adversarial tuning framework, which generates adversarial prompts
to explore worst-case scenarios by optimizing datasets containing pairs of
adversarial prompts and their safe responses. In the first stage, we introduce
the hierarchical meta-universal adversarial prompt learning to efficiently and
effectively generate token-level adversarial prompts. In the second stage, we
propose the automatic adversarial prompt learning to iteratively refine
semantic-level adversarial prompts, further enhancing LLM's defense
capabilities. We conducted comprehensive experiments on three widely used
jailbreak datasets, comparing our framework with six defense baselines under
five representative attack scenarios. The results underscore the superiority of
our proposed methods. Furthermore, our adversarial tuning framework exhibits
empirical generalizability across various attack strategies and target LLMs,
highlighting its potential as a transferable defense mechanism.


http://arxiv.org/abs/2406.05006
Clarifying Myths About the Relationship Between Shape Bias, Accuracy, and Robustness. (22%)
Zahra Golpayegani; Patrick St-Amant; Nizar Bouguila
  Deep learning models can perform well when evaluated on images from the same
distribution as the training set. However, applying small perturbations in the
forms of noise, artifacts, occlusions, blurring, etc. to a model's input image
and feeding the model with out-of-distribution (OOD) data can significantly
drop the model's accuracy, making it not applicable to real-world scenarios.
Data augmentation is one of the well-practiced methods to improve model
robustness against OOD data; however, examining which augmentation type to
choose and how it affects the OOD robustness remains understudied. There is a
growing belief that augmenting datasets using data augmentations that improve a
model's bias to shape-based features rather than texture-based features results
in increased OOD robustness for Convolutional Neural Networks trained on the
ImageNet-1K dataset. This is usually stated as ``an increase in the model's
shape bias results in an increase in its OOD robustness". Based on this
hypothesis, some works in the literature aim to find augmentations with higher
effects on model shape bias and use those for data augmentation. By evaluating
39 types of data augmentations on a widely used OOD dataset, we demonstrate the
impact of each data augmentation on the model's robustness to OOD data and
further show that the mentioned hypothesis is not true; an increase in shape
bias does not necessarily result in higher OOD robustness. By analyzing the
results, we also find some biases in the ImageNet-1K dataset that can easily be
reduced using proper data augmentation. Our evaluation results further show
that there is not necessarily a trade-off between in-domain accuracy and OOD
robustness, and choosing the proper augmentations can help increase both
in-domain accuracy and OOD robustness simultaneously.


http://arxiv.org/abs/2406.04805
GENIE: Watermarking Graph Neural Networks for Link Prediction. (15%)
Venkata Sai Pranav Bachina; Ankit Gangwal; Aaryan Ajay Sharma; Charu Sharma
  Graph Neural Networks (GNNs) have advanced the field of machine learning by
utilizing graph-structured data, which is ubiquitous in the real world. GNNs
have applications in various fields, ranging from social network analysis to
drug discovery. GNN training is strenuous, requiring significant computational
resources and human expertise. It makes a trained GNN an indispensable
Intellectual Property (IP) for its owner. Recent studies have shown GNNs to be
vulnerable to model-stealing attacks, which raises concerns over IP rights
protection. Watermarking has been shown to be effective at protecting the IP of
a GNN model. Existing efforts to develop a watermarking scheme for GNNs have
only focused on the node classification and the graph classification tasks.
  To the best of our knowledge, we introduce the first-ever watermarking scheme
for GNNs tailored to the Link Prediction (LP) task. We call our proposed
watermarking scheme GENIE (watermarking Graph nEural Networks for lInk
prEdiction). We design GENIE using a novel backdoor attack to create a trigger
set for two key methods of LP: (1) node representation-based and (2)
subgraph-based. In GENIE, the watermark is embedded into the GNN model by
training it on both the trigger set and a modified training set, resulting in a
watermarked GNN model. To assess a suspect model, we verify the watermark
against the trigger set. We extensively evaluate GENIE across 3 model
architectures (i.e., SEAL, GCN, and GraphSAGE) and 7 real-world datasets.
Furthermore, we validate the robustness of GENIE against 11 state-of-the-art
watermark removal techniques and 3 model extraction attacks. We also
demonstrate that GENIE is robust against ownership piracy attack. Our ownership
demonstration scheme statistically guarantees both False Positive Rate (FPR)
and False Negative Rate (FNR) to be less than $10^{-6}$.


http://arxiv.org/abs/2406.04981
The Price of Implicit Bias in Adversarially Robust Generalization. (5%)
Nikolaos Tsilivis; Natalie Frank; Nathan Srebro; Julia Kempe
  We study the implicit bias of optimization in robust empirical risk
minimization (robust ERM) and its connection with robust generalization. In
classification settings under adversarial perturbations with linear models, we
study what type of regularization should ideally be applied for a given
perturbation set to improve (robust) generalization. We then show that the
implicit bias of optimization in robust ERM can significantly affect the
robustness of the model and identify two ways this can happen; either through
the optimization algorithm or the architecture. We verify our predictions in
simulations with synthetic data and experimentally study the importance of
implicit bias in robust ERM with deep neural networks.


http://arxiv.org/abs/2406.05120
Contextual fusion enhances robustness to image blurring. (5%)
Shruti Joshi; Aiswarya Akumalla; Seth Haney; Maxim Bazhenov
  Mammalian brains handle complex reasoning by integrating information across
brain regions specialized for particular sensory modalities. This enables
improved robustness and generalization versus deep neural networks, which
typically process one modality and are vulnerable to perturbations. While
defense methods exist, they do not generalize well across perturbations. We
developed a fusion model combining background and foreground features from CNNs
trained on Imagenet and Places365. We tested its robustness to
human-perceivable perturbations on MS COCO. The fusion model improved
robustness, especially for classes with greater context variability. Our
proposed solution for integrating multiple modalities provides a new approach
to enhance robustness and may be complementary to existing methods.


http://arxiv.org/abs/2406.04755
LLM Whisperer: An Inconspicuous Attack to Bias LLM Responses. (1%)
Weiran Lin; Anna Gerchanovsky; Omer Akgul; Lujo Bauer; Matt Fredrikson; Zifan Wang
  Writing effective prompts for large language models (LLM) can be unintuitive
and burdensome. In response, services that optimize or suggest prompts have
emerged. While such services can reduce user effort, they also introduce a
risk: the prompt provider can subtly manipulate prompts to produce heavily
biased LLM responses. In this work, we show that subtle synonym replacements in
prompts can increase the likelihood (by a difference up to 78%) that LLMs
mention a target concept (e.g., a brand, political party, nation). We
substantiate our observations through a user study, showing our adversarially
perturbed prompts 1) are indistinguishable from unaltered prompts by humans, 2)
push LLMs to recommend target concepts more often, and 3) make users more
likely to notice target concepts, all without arousing suspicion. The
practicality of this attack has the potential to undermine user autonomy. Among
other measures, we recommend implementing warnings against using prompts from
untrusted parties.


http://arxiv.org/abs/2406.04070
Batch-in-Batch: a new adversarial training framework for initial perturbation and sample selection. (99%)
Yinting School of Mathematics and Statistics, and Key Lab NAA--MOE, Central China Normal University Wu; Pai School of Mathematics and Computer Science, Jianghan University Peng; Bo Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, and School of Cyber Science and Engineering, Wuhan University Cai; Le School of Mathematics and Statistics, and Key Lab NAA--MOE, Central China Normal University Li; .
  Adversarial training methods commonly generate independent initial
perturbation for adversarial samples from a simple uniform distribution, and
obtain the training batch for the classifier without selection. In this work,
we propose a simple yet effective training framework called Batch-in-Batch (BB)
to enhance models robustness. It involves specifically a joint construction of
initial values that could simultaneously generates $m$ sets of perturbations
from the original batch set to provide more diversity for adversarial samples;
and also includes various sample selection strategies that enable the trained
models to have smoother losses and avoid overconfident outputs. Through
extensive experiments on three benchmark datasets (CIFAR-10, SVHN, CIFAR-100)
with two networks (PreActResNet18 and WideResNet28-10) that are used in both
the single-step (Noise-Fast Gradient Sign Method, N-FGSM) and multi-step
(Projected Gradient Descent, PGD-10) adversarial training, we show that models
trained within the BB framework consistently have higher adversarial accuracy
across various adversarial settings, notably achieving over a 13% improvement
on the SVHN dataset with an attack radius of 8/255 compared to the N-FGSM
baseline model. Furthermore, experimental analysis of the efficiency of both
the proposed initial perturbation method and sample selection strategies
validates our insights. Finally, we show that our framework is cost-effective
in terms of computational resources, even with a relatively large value of $m$.


http://arxiv.org/abs/2406.03833
Talos: A More Effective and Efficient Adversarial Defense for GNN Models Based on the Global Homophily of Graphs. (98%)
Duanyu Li; Huijun Wu; Min Xie; Xugang Wu; Zhenwei Wu; Wenzhe Zhang
  Graph neural network (GNN) models play a pivotal role in numerous tasks
involving graph-related data analysis. Despite their efficacy, similar to other
deep learning models, GNNs are susceptible to adversarial attacks. Even minor
perturbations in graph data can induce substantial alterations in model
predictions. While existing research has explored various adversarial defense
techniques for GNNs, the challenge of defending against adversarial attacks on
real-world scale graph data remains largely unresolved. On one hand, methods
reliant on graph purification and preprocessing tend to excessively emphasize
local graph information, leading to sub-optimal defensive outcomes. On the
other hand, approaches rooted in graph structure learning entail significant
time overheads, rendering them impractical for large-scale graphs. In this
paper, we propose a new defense method named Talos, which enhances the global,
rather than local, homophily of graphs as a defense. Experiments show that the
proposed approach notably outperforms state-of-the-art defense approaches,
while imposing little computational overhead.


http://arxiv.org/abs/2312.07364
Collapse-Aware Triplet Decoupling for Adversarially Robust Image Retrieval. (98%)
Qiwei Tian; Chenhao Lin; Zhengyu Zhao; Qian Li; Chao Shen
Adversarial training has achieved substantial performance in defending image retrieval against adversarial examples. However, existing studies in deep metric learning (DML) still suffer from two major limitations: weak adversary and model collapse. In this paper, we address these two limitations by proposing Collapse-Aware TRIplet DEcoupling (CA-TRIDE). Specifically, TRIDE yields a stronger adversary by spatially decoupling the perturbation targets into the anchor and the other candidates. Furthermore, CA prevents the consequential model collapse, based on a novel metric, collapseness, which is incorporated into the optimization of perturbation. We also identify two drawbacks of the existing robustness metric in image retrieval and propose a new metric for a more reasonable robustness evaluation. Extensive experiments on three datasets demonstrate that CA-TRIDE outperforms existing defense methods in both conventional and new metrics. Codes are available at https://github.com/michaeltian108/CA-TRIDE.

http://arxiv.org/abs/2406.04313
Improving Alignment and Robustness with Circuit Breakers. (98%)
Andy Zou; Long Phan; Justin Wang; Derek Duenas; Maxwell Lin; Maksym Andriushchenko; Rowan Wang; Zico Kolter; Matt Fredrikson; Dan Hendrycks
  AI systems can take harmful actions and are highly vulnerable to adversarial
attacks. We present an approach, inspired by recent advances in representation
engineering, that interrupts the models as they respond with harmful outputs
with "circuit breakers." Existing techniques aimed at improving alignment, such
as refusal training, are often bypassed. Techniques such as adversarial
training try to plug these holes by countering specific attacks. As an
alternative to refusal training and adversarial training, circuit-breaking
directly controls the representations that are responsible for harmful outputs
in the first place. Our technique can be applied to both text-only and
multimodal language models to prevent the generation of harmful outputs without
sacrificing utility -- even in the presence of powerful unseen attacks.
Notably, while adversarial robustness in standalone image recognition remains
an open challenge, circuit breakers allow the larger multimodal system to
reliably withstand image "hijacks" that aim to produce harmful content.
Finally, we extend our approach to AI agents, demonstrating considerable
reductions in the rate of harmful actions when they are under attack. Our
approach represents a significant step forward in the development of reliable
safeguards to harmful behavior and adversarial attacks.


http://arxiv.org/abs/2406.03862
Robust Deep Reinforcement Learning against Adversarial Behavior Manipulation. (76%)
Shojiro Yamabe; Kazuto Fukuchi; Jun Sakuma
  This study investigates behavior-targeted attacks on reinforcement learning
and their countermeasures. Behavior-targeted attacks aim to manipulate the
victim's behavior as desired by the adversary through adversarial interventions
in state observations. Existing behavior-targeted attacks have some
limitations, such as requiring white-box access to the victim's policy. To
address this, we propose a novel attack method using imitation learning from
adversarial demonstrations, which works under limited access to the victim's
policy and is environment-agnostic. In addition, our theoretical analysis
proves that the policy's sensitivity to state changes impacts defense
performance, particularly in the early stages of the trajectory. Based on this
insight, we propose time-discounted regularization, which enhances robustness
against attacks while maintaining task performance. To the best of our
knowledge, this is the first defense strategy specifically designed for
behavior-targeted attacks.


http://arxiv.org/abs/2406.03805
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens. (69%)
Lin Lu; Hai Yan; Zenghui Yuan; Jiawen Shi; Wenqi Wei; Pin-Yu Chen; Pan Zhou
  Jailbreak attacks in large language models (LLMs) entail inducing the models
to generate content that breaches ethical and legal norm through the use of
malicious prompts, posing a substantial threat to LLM security. Current
strategies for jailbreak attack and defense often focus on optimizing locally
within specific algorithmic frameworks, resulting in ineffective optimization
and limited scalability. In this paper, we present a systematic analysis of the
dependency relationships in jailbreak attack and defense techniques,
generalizing them to all possible attack surfaces. We employ directed acyclic
graphs (DAGs) to position and analyze existing jailbreak attacks, defenses, and
evaluation methodologies, and propose three comprehensive, automated, and
logical frameworks. \texttt{AutoAttack} investigates dependencies in two lines
of jailbreak optimization strategies: genetic algorithm (GA)-based attacks and
adversarial-generation-based attacks, respectively. We then introduce an
ensemble jailbreak attack to exploit these dependencies. \texttt{AutoDefense}
offers a mixture-of-defenders approach by leveraging the dependency
relationships in pre-generative and post-generative defense strategies.
\texttt{AutoEvaluation} introduces a novel evaluation method that distinguishes
hallucinations, which are often overlooked, from jailbreak attack and defense
responses. Through extensive experiments, we demonstrate that the proposed
ensemble jailbreak attack and defense framework significantly outperforms
existing research.


http://arxiv.org/abs/2406.04582
Neural Codec-based Adversarial Sample Detection for Speaker Verification. (68%)
Xuanjun Chen; Jiawei Du; Haibin Wu; Jyh-Shing Roger Jang; Hung-yi Lee
  Automatic Speaker Verification (ASV), increasingly used in security-critical
applications, faces vulnerabilities from rising adversarial attacks, with few
effective defenses available. In this paper, we propose a neural codec-based
adversarial sample detection method for ASV. The approach leverages the codec's
ability to discard redundant perturbations and retain essential information.
Specifically, we distinguish between genuine and adversarial samples by
comparing ASV score differences between original and re-synthesized audio (by
codec models). This comprehensive study explores all open-source neural codecs
and their variant models for experiments. The Descript-audio-codec model stands
out by delivering the highest detection rate among 15 neural codecs and
surpassing seven prior state-of-the-art (SOTA) detection methods. Note that,
our single-model method even outperforms a SOTA ensemble method by a large
margin.


http://arxiv.org/abs/2406.04341
Interpreting the Second-Order Effects of Neurons in CLIP. (68%)
Yossi Gandelsman; Alexei A. Efros; Jacob Steinhardt
  We interpret the function of individual neurons in CLIP by automatically
describing them using text. Analyzing the direct effects (i.e. the flow from a
neuron through the residual stream to the output) or the indirect effects
(overall contribution) fails to capture the neurons' function in CLIP.
Therefore, we present the "second-order lens", analyzing the effect flowing
from a neuron through the later attention heads, directly to the output. We
find that these effects are highly selective: for each neuron, the effect is
significant for <2% of the images. Moreover, each effect can be approximated by
a single direction in the text-image space of CLIP. We describe neurons by
decomposing these directions into sparse sets of text representations. The sets
reveal polysemantic behavior - each neuron corresponds to multiple, often
unrelated, concepts (e.g. ships and cars). Exploiting this neuron polysemy, we
mass-produce "semantic" adversarial examples by generating images with concepts
spuriously correlated to the incorrect class. Additionally, we use the
second-order effects for zero-shot segmentation, outperforming previous
methods. Our results indicate that an automated interpretation of neurons can
be used for model deception and for introducing new model capabilities.


http://arxiv.org/abs/2406.04031
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt. (56%)
Zonghao Ying; Aishan Liu; Tianyuan Zhang; Zhengmin Yu; Siyuan Liang; Xianglong Liu; Dacheng Tao
  In the realm of large vision language models (LVLMs), jailbreak attacks serve
as a red-teaming approach to bypass guardrails and uncover safety implications.
Existing jailbreaks predominantly focus on the visual modality, perturbing
solely visual inputs in the prompt for attacks. However, they fall short when
confronted with aligned models that fuse visual and textual features
simultaneously for generation. To address this limitation, this paper
introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes
jailbreaks by optimizing textual and visual prompts cohesively. Initially, we
adversarially embed universally harmful perturbations in an image, guided by a
few-shot query-agnostic corpus (e.g., affirmative prefixes and negative
inhibitions). This process ensures that image prompt LVLMs to respond
positively to any harmful queries. Subsequently, leveraging the adversarial
image, we optimize textual prompts with specific harmful intent. In particular,
we utilize a large language model to analyze jailbreak failures and employ
chain-of-thought reasoning to refine textual prompts through a
feedback-iteration manner. To validate the efficacy of our approach, we
conducted extensive evaluations on various datasets and LVLMs, demonstrating
that our method significantly outperforms other methods by large margins
(+29.03% in attack success rate on average). Additionally, we showcase the
potential of our attacks on black-box commercial LVLMs, such as Gemini and
ChatGLM.


http://arxiv.org/abs/2406.03880
Memorization in deep learning: A survey. (1%)
Jiaheng Wei; Yanjun Zhang; Leo Yu Zhang; Ming Ding; Chao Chen; Kok-Leong Ong; Jun Zhang; Yang Xiang
  Deep Learning (DL) powered by Deep Neural Networks (DNNs) has revolutionized
various domains, yet understanding the intricacies of DNN decision-making and
learning processes remains a significant challenge. Recent investigations have
uncovered an interesting memorization phenomenon in which DNNs tend to memorize
specific details from examples rather than learning general patterns, affecting
model generalization, security, and privacy. This raises critical questions
about the nature of generalization in DNNs and their susceptibility to security
breaches. In this survey, we present a systematic framework to organize
memorization definitions based on the generalization and security/privacy
domains and summarize memorization evaluation methods at both the example and
model levels. Through a comprehensive literature review, we explore DNN
memorization behaviors and their impacts on security and privacy. We also
introduce privacy vulnerabilities caused by memorization and the phenomenon of
forgetting and explore its connection with memorization. Furthermore, we
spotlight various applications leveraging memorization and forgetting
mechanisms, including noisy label learning, privacy preservation, and model
enhancement. This survey offers the first-in-kind understanding of memorization
in DNNs, providing insights into its challenges and opportunities for enhancing
AI development while addressing critical ethical concerns.


http://arxiv.org/abs/2406.03117
VQUNet: Vector Quantization U-Net for Defending Adversarial Atacks by Regularizing Unwanted Noise. (99%)
Zhixun He; Mukesh Singhal
  Deep Neural Networks (DNN) have become a promising paradigm when developing
Artificial Intelligence (AI) and Machine Learning (ML) applications. However,
DNN applications are vulnerable to fake data that are crafted with adversarial
attack algorithms. Under adversarial attacks, the prediction accuracy of DNN
applications suffers, making them unreliable. In order to defend against
adversarial attacks, we introduce a novel noise-reduction procedure, Vector
Quantization U-Net (VQUNet), to reduce adversarial noise and reconstruct data
with high fidelity. VQUNet features a discrete latent representation learning
through a multi-scale hierarchical structure for both noise reduction and data
reconstruction. The empirical experiments show that the proposed VQUNet
provides better robustness to the target DNN models, and it outperforms other
state-of-the-art noise-reduction-based defense methods under various
adversarial attacks for both Fashion-MNIST and CIFAR10 datasets. When there is
no adversarial attack, the defense method has less than 1% accuracy degradation
for both datasets.


http://arxiv.org/abs/2406.03143
ZeroPur: Succinct Training-Free Adversarial Purification. (99%)
Erhu Liu; Zonglin Yang; Bo Liu; Bin Xiao; Xiuli Bi
  Adversarial purification is a kind of defense technique that can defend
against various unseen adversarial attacks without modifying the victim
classifier. Existing methods often depend on external generative models or
cooperation between auxiliary functions and victim classifiers. However,
retraining generative models, auxiliary functions, or victim classifiers relies
on the domain of the fine-tuned dataset and is computation-consuming. In this
work, we suppose that adversarial images are outliers of the natural image
manifold, and the purification process can be considered as returning them to
this manifold. Following this assumption, we present a simple adversarial
purification method without further training to purify adversarial images,
called ZeroPur. ZeroPur contains two steps: given an adversarial example,
Guided Shift obtains the shifted embedding of the adversarial example by the
guidance of its blurred counterparts; after that, Adaptive Projection
constructs a directional vector by this shifted embedding to provide momentum,
projecting adversarial images onto the manifold adaptively. ZeroPur is
independent of external models and requires no retraining of victim classifiers
or auxiliary functions, relying solely on victim classifiers themselves to
achieve purification. Extensive experiments on three datasets (CIFAR-10,
CIFAR-100, and ImageNet-1K) using various classifier architectures (ResNet,
WideResNet) demonstrate that our method achieves state-of-the-art robust
performance. The code will be publicly available.


http://arxiv.org/abs/2406.03017
DifAttack++: Query-Efficient Black-Box Adversarial Attack via Hierarchical Disentangled Feature Space in Cross-Domain. (99%)
Jun Liu; Jiantao Zhou; Jiandian Zeng; Jinyu Tian; Zheng Li
  This work investigates efficient score-based black-box adversarial attacks
with a high Attack Success Rate (\textbf{ASR}) and good generalizability. We
design a novel attack method based on a hierarchical DIsentangled Feature
space, called \textbf{DifAttack++}, which differs significantly from the
existing ones operating over the entire feature space. Specifically,
DifAttack++ firstly disentangles an image's latent feature into an Adversarial
Feature (\textbf{AF}) and a Visual Feature (\textbf{VF}) via an autoencoder
equipped with our specially designed Hierarchical Decouple-Fusion
(\textbf{HDF}) module, where the AF dominates the adversarial capability of an
image, while the VF largely determines its visual appearance. We train such two
autoencoders for the clean and adversarial image domains (i.e., cross-domain)
respectively to achieve image reconstructions and feature disentanglement, by
using pairs of clean images and their Adversarial Examples (\textbf{AE}s)
generated from available surrogate models via white-box attack methods.
Eventually, in the black-box attack stage, DifAttack++ iteratively optimizes
the AF according to the query feedback from the victim model until a successful
AE is generated, while keeping the VF unaltered. Extensive experimental results
demonstrate that our DifAttack++ leads to superior ASR and query efficiency
than state-of-the-art methods, meanwhile exhibiting much better visual quality
of AEs. The code is available at https://github.com/csjunjun/DifAttack.git.


http://arxiv.org/abs/2406.03230
Defending Large Language Models Against Attacks With Residual Stream Activation Analysis. (83%)
Amelia Kawasaki; Andrew Davis; Houssam Abbas
  The widespread adoption of Large Language Models (LLMs), exemplified by
OpenAI's ChatGPT, brings to the forefront the imperative to defend against
adversarial threats on these models. These attacks, which manipulate an LLM's
output by introducing malicious inputs, undermine the model's integrity and the
trust users place in its outputs. In response to this challenge, our paper
presents an innovative defensive strategy, given white box access to an LLM,
that harnesses residual activation analysis between transformer layers of the
LLM. We apply a novel methodology for analyzing distinctive activation patterns
in the residual streams for attack prompt classification. We curate multiple
datasets to demonstrate how this method of classification has high accuracy
across multiple types of attack scenarios, including our newly-created attack
dataset. Furthermore, we enhance the model's resilience by integrating safety
fine-tuning techniques for LLMs in order to measure its effect on our
capability to detect attacks. The results underscore the effectiveness of our
approach in enhancing the detection and mitigation of adversarial inputs,
advancing the security framework within which LLMs operate.


http://arxiv.org/abs/2406.03193
Graph Neural Network Explanations are Fragile. (80%)
Jiate Li; Meng Pang; Yun Dong; Jinyuan Jia; Binghui Wang
  Explainable Graph Neural Network (GNN) has emerged recently to foster the
trust of using GNNs. Existing GNN explainers are developed from various
perspectives to enhance the explanation performance. We take the first step to
study GNN explainers under adversarial attack--We found that an adversary
slightly perturbing graph structure can ensure GNN model makes correct
predictions, but the GNN explainer yields a drastically different explanation
on the perturbed graph. Specifically, we first formulate the attack problem
under a practical threat model (i.e., the adversary has limited knowledge about
the GNN explainer and a restricted perturbation budget). We then design two
methods (i.e., one is loss-based and the other is deduction-based) to realize
the attack. We evaluate our attacks on various GNN explainers and the results
show these explainers are fragile.


http://arxiv.org/abs/2406.03537
A Geometric View of Data Complexity: Efficient Local Intrinsic Dimension Estimation with Diffusion Models. (68%)
Hamidreza Kamkari; Brendan Leigh Ross; Rasa Hosseinzadeh; Jesse C. Cresswell; Gabriel Loaiza-Ganem
  High-dimensional data commonly lies on low-dimensional submanifolds, and
estimating the local intrinsic dimension (LID) of a datum -- i.e. the dimension
of the submanifold it belongs to -- is a longstanding problem. LID can be
understood as the number of local factors of variation: the more factors of
variation a datum has, the more complex it tends to be. Estimating this
quantity has proven useful in contexts ranging from generalization in neural
networks to detection of out-of-distribution data, adversarial examples, and
AI-generated text. The recent successes of deep generative models present an
opportunity to leverage them for LID estimation, but current methods based on
generative models produce inaccurate estimates, require more than a single
pre-trained model, are computationally intensive, or do not exploit the best
available deep generative models: diffusion models (DMs). In this work, we show
that the Fokker-Planck equation associated with a DM can provide an LID
estimator which addresses the aforementioned deficiencies. Our estimator,
called FLIPD, is easy to implement and compatible with all popular DMs.
Applying FLIPD to synthetic LID estimation benchmarks, we find that DMs
implemented as fully-connected networks are highly effective LID estimators
that outperform existing baselines. We also apply FLIPD to natural images where
the true LID is unknown. Despite being sensitive to the choice of network
architecture, FLIPD estimates remain a useful measure of relative complexity;
compared to competing estimators, FLIPD exhibits a consistently higher
correlation with image PNG compression rate and better aligns with qualitative
assessments of complexity. Notably, FLIPD is orders of magnitude faster than
other LID estimators, and the first to be tractable at the scale of Stable
Diffusion.


http://arxiv.org/abs/2406.03684
Principles of Designing Robust Remote Face Anti-Spoofing Systems. (13%)
Xiang Xu; Tianchen Zhao; Zheng Zhang; Zhihua Li; Jon Wu; Alessandro Achille; Mani Srivastava
  Protecting digital identities of human face from various attack vectors is
paramount, and face anti-spoofing plays a crucial role in this endeavor.
Current approaches primarily focus on detecting spoofing attempts within
individual frames to detect presentation attacks. However, the emergence of
hyper-realistic generative models capable of real-time operation has heightened
the risk of digitally generated attacks. In light of these evolving threats,
this paper aims to address two key aspects. First, it sheds light on the
vulnerabilities of state-of-the-art face anti-spoofing methods against digital
attacks. Second, it presents a comprehensive taxonomy of common threats
encountered in face anti-spoofing systems. Through a series of experiments, we
demonstrate the limitations of current face anti-spoofing detection techniques
and their failure to generalize to novel digital attack scenarios. Notably, the
existing models struggle with digital injection attacks including adversarial
noise, realistic deepfake attacks, and digital replay attacks. To aid in the
design and implementation of robust face anti-spoofing systems resilient to
these emerging vulnerabilities, the paper proposes key design principles from
model accuracy and robustness to pipeline robustness and even platform
robustness. Especially, we suggest to implement the proactive face
anti-spoofing system using active sensors to significant reduce the risks for
unseen attack vectors and improve the user experience.


http://arxiv.org/abs/2406.03508
Mutual Information Guided Backdoor Mitigation for Pre-trained Encoders. (13%)
Tingxu Han; Weisong Sun; Ziqi Ding; Chunrong Fang; Hanwei Qian; Jiaxun Li; Zhenyu Chen; Xiangyu Zhang
  Self-supervised learning (SSL) is increasingly attractive for pre-training
encoders without requiring labeled data. Downstream tasks built on top of those
pre-trained encoders can achieve nearly state-of-the-art performance. The
pre-trained encoders by SSL, however, are vulnerable to backdoor attacks as
demonstrated by existing studies. Numerous backdoor mitigation techniques are
designed for downstream task models. However, their effectiveness is impaired
and limited when adapted to pre-trained encoders, due to the lack of label
information when pre-training. To address backdoor attacks against pre-trained
encoders, in this paper, we innovatively propose a mutual information guided
backdoor mitigation technique, named MIMIC. MIMIC treats the potentially
backdoored encoder as the teacher net and employs knowledge distillation to
distill a clean student encoder from the teacher net. Different from existing
knowledge distillation approaches, MIMIC initializes the student with random
weights, inheriting no backdoors from teacher nets. Then MIMIC leverages mutual
information between each layer and extracted features to locate where benign
knowledge lies in the teacher net, with which distillation is deployed to clone
clean features from teacher to student. We craft the distillation loss with two
aspects, including clone loss and attention loss, aiming to mitigate backdoors
and maintain encoder performance at the same time. Our evaluation conducted on
two backdoor attacks in SSL demonstrates that MIMIC can significantly reduce
the attack success rate by only utilizing <5% of clean data, surpassing seven
state-of-the-art backdoor mitigation techniques.


http://arxiv.org/abs/2406.03720
JIGMARK: A Black-Box Approach for Enhancing Image Watermarks against Diffusion Model Edits. (10%)
Minzhou Pan; Yi Zeng; Xue Lin; Ning Yu; Cho-Jui Hsieh; Peter Henderson; Ruoxi Jia
  In this study, we investigate the vulnerability of image watermarks to
diffusion-model-based image editing, a challenge exacerbated by the
computational cost of accessing gradient information and the closed-source
nature of many diffusion models. To address this issue, we introduce JIGMARK.
This first-of-its-kind watermarking technique enhances robustness through
contrastive learning with pairs of images, processed and unprocessed by
diffusion models, without needing a direct backpropagation of the diffusion
process. Our evaluation reveals that JIGMARK significantly surpasses existing
watermarking solutions in resilience to diffusion-model edits, demonstrating a
True Positive Rate more than triple that of leading baselines at a 1% False
Positive Rate while preserving image quality. At the same time, it consistently
improves the robustness against other conventional perturbations (like JPEG,
blurring, etc.) and malicious watermark attacks over the state-of-the-art,
often by a large margin. Furthermore, we propose the Human Aligned Variation
(HAV) score, a new metric that surpasses traditional similarity measures in
quantifying the number of image derivatives from image editing.


http://arxiv.org/abs/2406.03052
Are Your Models Still Fair? Fairness Attacks on Graph Neural Networks via Node Injections. (10%)
Zihan Luo; Hong Huang; Yongkang Zhou; Jiping Zhang; Nuo Chen; Hai Jin
  Despite the remarkable capabilities demonstrated by Graph Neural Networks
(GNNs) in graph-related tasks, recent research has revealed the fairness
vulnerabilities in GNNs when facing malicious adversarial attacks. However, all
existing fairness attacks require manipulating the connectivity between
existing nodes, which may be prohibited in reality. To this end, we introduce a
Node Injection-based Fairness Attack (NIFA), exploring the vulnerabilities of
GNN fairness in such a more realistic setting. In detail, NIFA first designs
two insightful principles for node injection operations, namely the
uncertainty-maximization principle and homophily-increase principle, and then
optimizes injected nodes' feature matrix to further ensure the effectiveness of
fairness attacks. Comprehensive experiments on three real-world datasets
consistently demonstrate that NIFA can significantly undermine the fairness of
mainstream GNNs, even including fairness-aware GNNs, by injecting merely 1% of
nodes. We sincerely hope that our work can stimulate increasing attention from
researchers on the vulnerability of GNN fairness, and encourage the development
of corresponding defense mechanisms. Our code and data are released at:
https://github.com/CGCL-codes/NIFA.


http://arxiv.org/abs/2406.03097
Enhancing the Resilience of Graph Neural Networks to Topological Perturbations in Sparse Graphs. (8%)
Shuqi He; Jun Zhuang; Ding Wang; Luyao Peng; Jun Song
  Graph neural networks (GNNs) have been extensively employed in node
classification. Nevertheless, recent studies indicate that GNNs are vulnerable
to topological perturbations, such as adversarial attacks and edge disruptions.
Considerable efforts have been devoted to mitigating these challenges. For
example, pioneering Bayesian methodologies, including GraphSS and LlnDT,
incorporate Bayesian label transitions and topology-based label sampling to
strengthen the robustness of GNNs. However, GraphSS is hindered by slow
convergence, while LlnDT faces challenges in sparse graphs. To overcome these
limitations, we propose a novel label inference framework, TraTopo, which
combines topology-driven label propagation, Bayesian label transitions, and
link analysis via random walks. TraTopo significantly surpasses its
predecessors on sparse graphs by utilizing random walk sampling, specifically
targeting isolated nodes for link prediction, thus enhancing its effectiveness
in topological sampling contexts. Additionally, TraTopo employs a shortest-path
strategy to refine link prediction, thereby reducing predictive overhead and
improving label inference accuracy. Empirical evaluations highlight TraTopo's
superiority in node classification, significantly exceeding contemporary GCN
models in accuracy.


http://arxiv.org/abs/2406.03182
Reconstructing training data from document understanding models. (1%)
Jérémie Dentan; Arnaud Paran; Aymen Shabou
  Document understanding models are increasingly employed by companies to
supplant humans in processing sensitive documents, such as invoices, tax
notices, or even ID cards. However, the robustness of such models to privacy
attacks remains vastly unexplored. This paper presents CDMI, the first
reconstruction attack designed to extract sensitive fields from the training
data of these models. We attack LayoutLM and BROS architectures, demonstrating
that an adversary can perfectly reconstruct up to 4.1% of the fields of the
documents used for fine-tuning, including some names, dates, and invoice
amounts up to six-digit numbers. When our reconstruction attack is combined
with a membership inference attack, our attack accuracy escalates to 22.5%. In
addition, we introduce two new end-to-end metrics and evaluate our approach
under various conditions: unimodal or bimodal data, LayoutLM or BROS backbones,
four fine-tuning tasks, and two public datasets (FUNSD and SROIE). We also
investigate the interplay between overfitting, predictive performance, and
susceptibility to our attack. We conclude with a discussion on possible
defenses against our attack and potential future research directions to
construct robust document understanding models.


http://arxiv.org/abs/2406.02983
FREA: Feasibility-Guided Generation of Safety-Critical Scenarios with Reasonable Adversariality. (1%)
Keyu Chen; Yuheng Lei; Hao Cheng; Haoran Wu; Wenchao Sun; Sifa Zheng
  Generating safety-critical scenarios, which are essential yet difficult to
collect at scale, offers an effective method to evaluate the robustness of
autonomous vehicles (AVs). Existing methods focus on optimizing adversariality
while preserving the naturalness of scenarios, aiming to achieve a balance
through data-driven approaches. However, without an appropriate upper bound for
adversariality, the scenarios might exhibit excessive adversariality,
potentially leading to unavoidable collisions. In this paper, we introduce
FREA, a novel safety-critical scenarios generation method that incorporates the
Largest Feasible Region (LFR) of AV as guidance to ensure the reasonableness of
the adversarial scenarios. Concretely, FREA initially pre-calculates the LFR of
AV from offline datasets. Subsequently, it learns a reasonable adversarial
policy that controls the scene's critical background vehicles (CBVs) to
generate adversarial yet AV-feasible scenarios by maximizing a novel
feasibility-dependent adversarial objective function. Extensive experiments
illustrate that FREA can effectively generate safety-critical scenarios,
yielding considerable near-miss events while ensuring AV's feasibility.
Generalization analysis also confirms the robustness of FREA in AV testing
across various surrogate AV methods and traffic environments.


http://arxiv.org/abs/2406.02064
Advancing Generalized Transfer Attack with Initialization Derived Bilevel Optimization and Dynamic Sequence Truncation. (99%)
Yaohua Liu; Jiaxin Gao; Xuan Liu; Xianghao Jiao; Xin Fan; Risheng Liu
  Transfer attacks generate significant interest for real-world black-box
applications by crafting transferable adversarial examples through surrogate
models. Whereas, existing works essentially directly optimize the single-level
objective w.r.t. the surrogate model, which always leads to poor
interpretability of attack mechanism and limited generalization performance
over unknown victim models. In this work, we propose the
\textbf{B}il\textbf{E}vel \textbf{T}ransfer \textbf{A}ttac\textbf{K} (BETAK)
framework by establishing an initialization derived bilevel optimization
paradigm, which explicitly reformulates the nested constraint relationship
between the Upper-Level (UL) pseudo-victim attacker and the Lower-Level (LL)
surrogate attacker. Algorithmically, we introduce the Hyper Gradient Response
(HGR) estimation as an effective feedback for the transferability over
pseudo-victim attackers, and propose the Dynamic Sequence Truncation (DST)
technique to dynamically adjust the back-propagation path for HGR and reduce
computational overhead simultaneously. Meanwhile, we conduct detailed
algorithmic analysis and provide convergence guarantee to support non-convexity
of the LL surrogate attacker. Extensive evaluations demonstrate substantial
improvement of BETAK (e.g., $\mathbf{53.41}$\% increase of attack success rates
against IncRes-v$2_{ens}$) against different victims and defense methods in
targeted and untargeted attack scenarios. The source code is available at
https://github.com/callous-youth/BETAK.


http://arxiv.org/abs/2406.02309
Effects of Exponential Gaussian Distribution on (Double Sampling) Randomized Smoothing. (98%)
Youwei Shu; Xi Xiao; Derui Wang; Yuxin Cao; Siji Chen; Jason Xue; Linyi Li; Bo Li
  Randomized Smoothing (RS) is currently a scalable certified defense method
providing robustness certification against adversarial examples. Although
significant progress has been achieved in providing defenses against $\ell_p$
adversaries, the interaction between the smoothing distribution and the
robustness certification still remains vague. In this work, we comprehensively
study the effect of two families of distributions, named Exponential Standard
Gaussian (ESG) and Exponential General Gaussian (EGG) distributions, on
Randomized Smoothing and Double Sampling Randomized Smoothing (DSRS). We derive
an analytic formula for ESG's certified radius, which converges to the origin
formula of RS as the dimension $d$ increases. Additionally, we prove that EGG
can provide tighter constant factors than DSRS in providing $\Omega(\sqrt{d})$
lower bounds of $\ell_2$ certified radius, and thus further addresses the curse
of dimensionality in RS. Our experiments on real-world datasets confirm our
theoretical analysis of the ESG distributions, that they provide almost the
same certification under different exponents $\eta$ for both RS and DSRS. In
addition, EGG brings a significant improvement to the DSRS certification, but
the mechanism can be different when the classifier properties are different.
Compared to the primitive DSRS, the increase in certified accuracy provided by
EGG is prominent, up to 6.4% on ImageNet.


http://arxiv.org/abs/2406.02044
QROA: A Black-Box Query-Response Optimization Attack on LLMs. (73%)
Hussein LaMME Jawad; Nicolas J. -B. LaMME BRUNEL
  Large Language Models (LLMs) have surged in popularity in recent months, yet
they possess concerning capabilities for generating harmful content when
manipulated. This study introduces the Query-Response Optimization Attack
(QROA), an optimization-based strategy designed to exploit LLMs through a
black-box, query-only interaction. QROA adds an optimized trigger to a
malicious instruction to compel the LLM to generate harmful content. Unlike
previous approaches, QROA does not require access to the model's logit
information or any other internal data and operates solely through the standard
query-response interface of LLMs. Inspired by deep Q-learning and Greedy
coordinate descent, the method iteratively updates tokens to maximize a
designed reward function. We tested our method on various LLMs such as Vicuna,
Falcon, and Mistral, achieving an Attack Success Rate (ASR) over 80\%. We also
tested the model against Llama2-chat, the fine-tuned version of Llama2 designed
to resist Jailbreak attacks, achieving good ASR with a suboptimal initial
trigger seed. This study demonstrates the feasibility of generating jailbreak
attacks against deployed LLMs in the public domain using black-box optimization
methods, enabling more comprehensive safety testing of LLMs.


http://arxiv.org/abs/2406.02253
PuFace: Defending against Facial Cloaking Attacks for Facial Recognition Models. (54%)
Jing Wen
  The recently proposed facial cloaking attacks add invisible perturbation
(cloaks) to facial images to protect users from being recognized by
unauthorized facial recognition models. However, we show that the "cloaks" are
not robust enough and can be removed from images.
  This paper introduces PuFace, an image purification system leveraging the
generalization ability of neural networks to diminish the impact of cloaks by
pushing the cloaked images towards the manifold of natural (uncloaked) images
before the training process of facial recognition models. Specifically, we
devise a purifier that takes all the training images including both cloaked and
natural images as input and generates the purified facial images close to the
manifold where natural images lie. To meet the defense goal, we propose to
train the purifier on particularly amplified cloaked images with a loss
function that combines image loss and feature loss. Our empirical experiment
shows PuFace can effectively defend against two state-of-the-art facial
cloaking attacks and reduces the attack success rate from 69.84\% to 7.61\% on
average without degrading the normal accuracy for various facial recognition
models. Moreover, PuFace is a model-agnostic defense mechanism that can be
applied to any facial recognition model without modifying the model structure.


http://arxiv.org/abs/2406.02011
A Risk Estimation Study of Native Code Vulnerabilities in Android Applications. (5%)
Silvia Lucia Sanna; Diego Soi; Davide Maiorca; Giorgio Fumera; Giorgio Giacinto
  Android is the most used Operating System worldwide for mobile devices, with
hundreds of thousands of apps downloaded daily. Although these apps are
primarily written in Java and Kotlin, advanced functionalities such as graphics
or cryptography are provided through native C/C++ libraries. These libraries
can be affected by common vulnerabilities in C/C++ code (e.g., memory errors
such as buffer overflow), through which attackers can read/modify data or
execute arbitrary code. The detection and assessment of vulnerabilities in
Android native code have only been recently explored by previous research work.
In this paper, we propose a fast risk-based approach that provides a risk score
related to the native part of an Android application. In this way, before an
app is released, the developer can check if the app may contain vulnerabilities
in the Native Code and, if present, patch them to publish a more secure
application. To this end, we first use fast regular expressions to detect
library versions and possible vulnerable functions. Then, we apply scores
extracted from a vulnerability database to the analyzed application, thus
obtaining a risk score representative of the whole app. We demonstrate the
validity of our approach by performing a large-scale analysis on more than
$100,000$ applications (but only $40\%$ contained native code) and $15$ popular
libraries carrying known vulnerabilities. The attained results show that many
applications contain well-known vulnerabilities that miscreants can potentially
exploit, posing serious concerns about the security of the whole Android
applications landscape.


http://arxiv.org/abs/2406.02024
Verifying the Generalization of Deep Learning to Out-of-Distribution Domains. (3%)
Guy Amir; Osher Maayan; Tom Zelazny; Guy Katz; Michael Schapira
  Deep neural networks (DNNs) play a crucial role in the field of machine
learning, demonstrating state-of-the-art performance across various application
domains. However, despite their success, DNN-based models may occasionally
exhibit challenges with generalization, i.e., may fail to handle inputs that
were not encountered during training. This limitation is a significant
challenge when it comes to deploying deep learning for safety-critical tasks,
as well as in real-world settings characterized by substantial variability. We
introduce a novel approach for harnessing DNN verification technology to
identify DNN-driven decision rules that exhibit robust generalization to
previously unencountered input domains. Our method assesses generalization
within an input domain by measuring the level of agreement between
independently trained deep neural networks for inputs in this domain. We also
efficiently realize our approach by using off-the-shelf DNN verification
engines, and extensively evaluate it on both supervised and unsupervised DNN
benchmarks, including a deep reinforcement learning (DRL) system for Internet
congestion control -- demonstrating the applicability of our approach for
real-world settings. Moreover, our research introduces a fresh objective for
formal verification, offering the prospect of mitigating the challenges linked
to deploying DNN-driven systems in real-world scenarios.


http://arxiv.org/abs/2406.02481
Large Language Models as Carriers of Hidden Messages. (2%)
Jakub Hoscilowicz; Pawel Popiolek; Jan Rudkowski; Jedrzej Bieniasz; Artur Janicki
  With the help of simple fine-tuning, one can artificially embed hidden text
into large language models (LLMs). This text is revealed only when triggered by
a specific query to the LLM. Two primary applications are LLM fingerprinting
and steganography. In the context of LLM fingerprinting, a unique text
identifier (fingerprint) is embedded within the model to verify licensing
compliance. In the context of steganography, the LLM serves as a carrier for
hidden messages that can be disclosed through a chosen trigger question.
  Our work demonstrates that embedding hidden text in the LLM via fine-tuning,
though seemingly secure due to the vast number of potential triggers (any
sequence of characters or tokens could serve as a trigger), is susceptible to
extraction through analysis of the LLM's output decoding process. We propose an
extraction attack called Unconditional Token Forcing (UTF). It is premised on
the hypothesis that iteratively feeding each token from the LLM's vocabulary
into the model should reveal output sequences with abnormally high token
probabilities, indicating potential hidden text candidates. We also present a
defense method to hide text in such a way that it is resistant to both UTF and
attacks based on sampling decoding methods, which we named Unconditional Token
Forcing Confusion (UTFC). To the best of our knowledge, there is no attack
method that can extract text hidden with UTFC. UTFC has both benign
applications (improving LLM fingerprinting) and malign applications (using LLMs
to create covert communication channels).


http://arxiv.org/abs/2406.02883
Nonlinear Transformations Against Unlearnable Datasets. (2%)
Thushari Hapuarachchi; Jing Lin; Kaiqi Xiong; Mohamed Rahouti; Gitte Ost
  Automated scraping stands out as a common method for collecting data in deep
learning models without the authorization of data owners. Recent studies have
begun to tackle the privacy concerns associated with this data collection
method. Notable approaches include Deepconfuse, error-minimizing,
error-maximizing (also known as adversarial poisoning), Neural Tangent
Generalization Attack, synthetic, autoregressive, One-Pixel Shortcut,
Self-Ensemble Protection, Entangled Features, Robust Error-Minimizing,
Hypocritical, and TensorClog. The data generated by those approaches, called
"unlearnable" examples, are prevented "learning" by deep learning models. In
this research, we investigate and devise an effective nonlinear transformation
framework and conduct extensive experiments to demonstrate that a deep neural
network can effectively learn from the data/examples traditionally considered
unlearnable produced by the above twelve approaches. The resulting approach
improves the ability to break unlearnable data compared to the linear separable
technique recently proposed by researchers. Specifically, our extensive
experiments show that the improvement ranges from 0.34% to 249.59% for the
unlearnable CIFAR10 datasets generated by those twelve data protection
approaches, except for One-Pixel Shortcut. Moreover, the proposed framework
achieves over 100% improvement of test accuracy for Autoregressive and REM
approaches compared to the linear separable technique. Our findings suggest
that these approaches are inadequate in preventing unauthorized uses of data in
machine learning models. There is an urgent need to develop more robust
protection mechanisms that effectively thwart an attacker from accessing data
without proper authorization from the owners.


http://arxiv.org/abs/2406.02027
Inference Attacks: A Taxonomy, Survey, and Promising Directions. (1%)
Feng Wu; Lei Cui; Shaowen Yao; Shui Yu
  The prosperity of machine learning has also brought people's concerns about
data privacy. Among them, inference attacks can implement privacy breaches in
various MLaaS scenarios and model training/prediction phases. Specifically,
inference attacks can perform privacy inference on undisclosed target training
sets based on outputs of the target model, including but not limited to
statistics, membership, semantics, data representation, etc. For instance,
infer whether the target data has the characteristics of AIDS. In addition, the
rapid development of the machine learning community in recent years, especially
the surge of model types and application scenarios, has further stimulated the
inference attacks' research. Thus, studying inference attacks and analyzing
them in depth is urgent and significant. However, there is still a gap in the
systematic discussion of inference attacks from taxonomy, global perspective,
attack, and defense perspectives. This survey provides an in-depth and
comprehensive inference of attacks and corresponding countermeasures in
ML-as-a-service based on taxonomy and the latest researches. Without
compromising researchers' intuition, we first propose the 3MP taxonomy based on
the community research status, trying to normalize the confusing naming system
of inference attacks. Also, we analyze the pros and cons of each type of
inference attack, their workflow, countermeasure, and how they interact with
other attacks. In the end, we point out several promising directions for
researchers from a more comprehensive and novel perspective.


http://arxiv.org/abs/2406.01970
The Crystal Ball Hypothesis in diffusion models: Anticipating object positions from initial noise. (1%)
Yuanhao Ban; Ruochen Wang; Tianyi Zhou; Boqing Gong; Cho-Jui Hsieh; Minhao Cheng
  Diffusion models have achieved remarkable success in text-to-image generation
tasks; however, the role of initial noise has been rarely explored. In this
study, we identify specific regions within the initial noise image, termed
trigger patches, that play a key role for object generation in the resulting
images. Notably, these patches are ``universal'' and can be generalized across
various positions, seeds, and prompts. To be specific, extracting these patches
from one noise and injecting them into another noise leads to object generation
in targeted areas. We identify these patches by analyzing the dispersion of
object bounding boxes across generated images, leading to the development of a
posterior analysis technique. Furthermore, we create a dataset consisting of
Gaussian noises labeled with bounding boxes corresponding to the objects
appearing in the generated images and train a detector that identifies these
patches from the initial noise. To explain the formation of these patches, we
reveal that they are outliers in Gaussian noise, and follow distinct
distributions through two-sample tests. Finally, we find the misalignment
between prompts and the trigger patch patterns can result in unsuccessful image
generations. The study proposes a reject-sampling strategy to obtain optimal
noise, aiming to improve prompt adherence and positional diversity in image
generation.


http://arxiv.org/abs/2406.01975
Can Dense Connectivity Benefit Outlier Detection? An Odyssey with NAS. (1%)
Hao Fu; Tunhou Zhang; Hai Li; Yiran Chen
  Recent advances in Out-of-Distribution (OOD) Detection is the driving force
behind safe and reliable deployment of Convolutional Neural Networks (CNNs) in
real world applications. However, existing studies focus on OOD detection
through confidence score and deep generative model-based methods, without
considering the impact of DNN structures, especially dense connectivity in
architecture fabrications. In addition, existing outlier detection approaches
exhibit high variance in generalization performance, lacking stability and
confidence in evaluating and ranking different outlier detectors. In this work,
we propose a novel paradigm, Dense Connectivity Search of Outlier Detector
(DCSOD), that automatically explore the dense connectivity of CNN architectures
on near-OOD detection task using Neural Architecture Search (NAS). We introduce
a hierarchical search space containing versatile convolution operators and
dense connectivity, allowing a flexible exploration of CNN architectures with
diverse connectivity patterns. To improve the quality of evaluation on OOD
detection during search, we propose evolving distillation based on our
multi-view feature learning explanation. Evolving distillation stabilizes
training for OOD detection evaluation, thus improves the quality of search. We
thoroughly examine DCSOD on CIFAR benchmarks under OOD detection protocol.
Experimental results show that DCSOD achieve remarkable performance over widely
used architectures and previous NAS baselines. Notably, DCSOD achieves
state-of-the-art (SOTA) performance on CIFAR benchmark, with AUROC improvement
of $\sim$1.0%.


http://arxiv.org/abs/2406.01219
Constraint-based Adversarial Example Synthesis. (99%)
Fang Yu; Ya-Yu Chi; Yu-Fang Chen
  In the era of rapid advancements in artificial intelligence (AI), neural
network models have achieved notable breakthroughs. However, concerns arise
regarding their vulnerability to adversarial attacks. This study focuses on
enhancing Concolic Testing, a specialized technique for testing Python programs
implementing neural networks. The extended tool, PyCT, now accommodates a
broader range of neural network operations, including floating-point and
activation function computations. By systematically generating prediction path
constraints, the research facilitates the identification of potential
adversarial examples. Demonstrating effectiveness across various neural network
architectures, the study highlights the vulnerability of Python-based neural
network models to adversarial attacks. This research contributes to securing
AI-powered applications by emphasizing the need for robust testing
methodologies to detect and mitigate potential adversarial threats. It
underscores the importance of rigorous testing techniques in fortifying neural
network models for reliable applications in Python.


http://arxiv.org/abs/2406.01894
SVASTIN: Sparse Video Adversarial Attack via Spatio-Temporal Invertible Neural Networks. (99%)
Yi Pan; Jun-Jie Huang; Zihan Chen; Wentao Zhao; Ziyue Wang
  Robust and imperceptible adversarial video attack is challenging due to the
spatial and temporal characteristics of videos. The existing video adversarial
attack methods mainly take a gradient-based approach and generate adversarial
videos with noticeable perturbations. In this paper, we propose a novel Sparse
Adversarial Video Attack via Spatio-Temporal Invertible Neural Networks
(SVASTIN) to generate adversarial videos through spatio-temporal feature space
information exchanging. It consists of a Guided Target Video Learning (GTVL)
module to balance the perturbation budget and optimization speed and a
Spatio-Temporal Invertible Neural Network (STIN) module to perform
spatio-temporal feature space information exchanging between a source video and
the target feature tensor learned by GTVL module. Extensive experiments on
UCF-101 and Kinetics-400 demonstrate that our proposed SVASTIN can generate
adversarial examples with higher imperceptibility than the state-of-the-art
methods with the higher fooling rate. Code is available at
\href{https://github.com/Brittany-Chen/SVASTIN}{https://github.com/Brittany-Chen/SVASTIN}.


http://arxiv.org/abs/2406.01765
Reproducibility Study on Adversarial Attacks Against Robust Transformer Trackers. (93%)
Fatemeh Nourilenjan Nokabadi; Jean-François Lalonde; Christian Gagné
  New transformer networks have been integrated into object tracking pipelines
and have demonstrated strong performance on the latest benchmarks. This paper
focuses on understanding how transformer trackers behave under adversarial
attacks and how different attacks perform on tracking datasets as their
parameters change. We conducted a series of experiments to evaluate the
effectiveness of existing adversarial attacks on object trackers with
transformer and non-transformer backbones. We experimented on 7 different
trackers, including 3 that are transformer-based, and 4 which leverage other
architectures. These trackers are tested against 4 recent attack methods to
assess their performance and robustness on VOT2022ST, UAV123 and GOT10k
datasets. Our empirical study focuses on evaluating adversarial robustness of
object trackers based on bounding box versus binary mask predictions, and
attack methods at different levels of perturbations. Interestingly, our study
found that altering the perturbation level may not significantly affect the
overall object tracking results after the attack. Similarly, the sparsity and
imperceptibility of the attack perturbations may remain stable against
perturbation level shifts. By applying a specific attack on all transformer
trackers, we show that new transformer trackers having a stronger
cross-attention modeling achieve a greater adversarial robustness on tracking
datasets, such as VOT2022ST and GOT10k. Our results also indicate the necessity
for new attack methods to effectively tackle the latest types of transformer
trackers. The codes necessary to reproduce this study are available at
https://github.com/fatemehN/ReproducibilityStudy.


http://arxiv.org/abs/2406.01873
CR-UTP: Certified Robustness against Universal Text Perturbations on Large Language Models. (83%)
Qian Lou; Xin Liang; Jiaqi Xue; Yancheng Zhang; Rui Xie; Mengxin Zheng
  It is imperative to ensure the stability of every prediction made by a
language model; that is, a language's prediction should remain consistent
despite minor input variations, like word substitutions. In this paper, we
investigate the problem of certifying a language model's robustness against
Universal Text Perturbations (UTPs), which have been widely used in universal
adversarial attacks and backdoor attacks. Existing certified robustness based
on random smoothing has shown considerable promise in certifying the
input-specific text perturbations (ISTPs), operating under the assumption that
any random alteration of a sample's clean or adversarial words would negate the
impact of sample-wise perturbations. However, with UTPs, masking only the
adversarial words can eliminate the attack. A naive method is to simply
increase the masking ratio and the likelihood of masking attack tokens, but it
leads to a significant reduction in both certified accuracy and the certified
radius due to input corruption by extensive masking. To solve this challenge,
we introduce a novel approach, the superior prompt search method, designed to
identify a superior prompt that maintains higher certified accuracy under
extensive masking. Additionally, we theoretically motivate why ensembles are a
particularly suitable choice as base prompts for random smoothing. The method
is denoted by superior prompt ensembling technique. We also empirically confirm
this technique, obtaining state-of-the-art results in multiple settings. These
methodologies, for the first time, enable high certified accuracy against both
UTPs and ISTPs. The source code of CR-UTP is available at \url
{https://github.com/UCFML-Research/CR-UTP}.


http://arxiv.org/abs/2406.01179
Are AI-Generated Text Detectors Robust to Adversarial Perturbations? (80%)
Guanhua Huang; Yuchen Zhang; Zhe Li; Yongjian You; Mingze Wang; Zhouwang Yang
  The widespread use of large language models (LLMs) has sparked concerns about
the potential misuse of AI-generated text, as these models can produce content
that closely resembles human-generated text. Current detectors for AI-generated
text (AIGT) lack robustness against adversarial perturbations, with even minor
changes in characters or words causing a reversal in distinguishing between
human-created and AI-generated text. This paper investigates the robustness of
existing AIGT detection methods and introduces a novel detector, the Siamese
Calibrated Reconstruction Network (SCRN). The SCRN employs a reconstruction
network to add and remove noise from text, extracting a semantic representation
that is robust to local perturbations. We also propose a siamese calibration
technique to train the model to make equally confidence predictions under
different noise, which improves the model's robustness against adversarial
perturbations. Experiments on four publicly available datasets show that the
SCRN outperforms all baseline methods, achieving 6.5\%-18.25\% absolute
accuracy improvement over the best baseline method under adversarial attacks.
Moreover, it exhibits superior generalizability in cross-domain, cross-genre,
and mixed-source scenarios. The code is available at
\url{https://github.com/CarlanLark/Robust-AIGC-Detector}.


http://arxiv.org/abs/2406.01708
Model for Peanuts: Hijacking ML Models without Training Access is Possible. (62%)
Mahmoud Ghorbel; Halima Bouzidi; Ioan Marius Bilasco; Ihsen Alouani
  The massive deployment of Machine Learning (ML) models has been accompanied
by the emergence of several attacks that threaten their trustworthiness and
raise ethical and societal concerns such as invasion of privacy, discrimination
risks, and lack of accountability. Model hijacking is one of these attacks,
where the adversary aims to hijack a victim model to execute a different task
than its original one. Model hijacking can cause accountability and security
risks since a hijacked model owner can be framed for having their model
offering illegal or unethical services. Prior state-of-the-art works consider
model hijacking as a training time attack, whereby an adversary requires access
to the ML model training to execute their attack. In this paper, we consider a
stronger threat model where the attacker has no access to the training phase of
the victim model. Our intuition is that ML models, typically
over-parameterized, might (unintentionally) learn more than the intended task
for they are trained. We propose a simple approach for model hijacking at
inference time named SnatchML to classify unknown input samples using distance
measures in the latent space of the victim model to previously known samples
associated with the hijacking task classes. SnatchML empirically shows that
benign pre-trained models can execute tasks that are semantically related to
the initial task. Surprisingly, this can be true even for hijacking tasks
unrelated to the original task. We also explore different methods to mitigate
this risk. We first propose a novel approach we call meta-unlearning, designed
to help the model unlearn a potentially malicious task while training on the
original task dataset. We also provide insights on over-parameterization as one
possible inherent factor that makes model hijacking easier, and we accordingly
propose a compression-based countermeasure against this attack.


http://arxiv.org/abs/2406.01449
SLANT: Spurious Logo ANalysis Toolkit. (47%)
Maan Qraitem; Piotr Teterwak; Kate Saenko; Bryan A. Plummer
  Online content is filled with logos, from ads and social media posts to
website branding and product placements. Consequently, these logos are
prevalent in the extensive web-scraped datasets used to pretrain
Vision-Language Models, which are used for a wide array of tasks (content
moderation, object classification). While these models have been shown to learn
harmful correlations in various tasks, whether these correlations include logos
remains understudied. Understanding this is especially important due to logos
often being used by public-facing entities like brands and government agencies.
To that end, we develop SLANT: A Spurious Logo ANalysis Toolkit. Our key
finding is that some logos indeed lead to spurious incorrect predictions, for
example, adding the Adidas logo to a photo of a person causes a model classify
the person as greedy. SLANT contains a semi-automatic mechanism for mining such
"spurious" logos. The mechanism consists of a comprehensive logo bank,
CC12M-LogoBank, and an algorithm that searches the bank for logos that VLMs
spuriously correlate with a user-provided downstream recognition target. We
uncover various seemingly harmless logos that VL models correlate 1) with
negative human adjectives 2) with the concept of `harmlessness'; causing models
to misclassify harmful online content as harmless, and 3) with user-provided
object concepts; causing lower recognition accuracy on ImageNet zero-shot
classification. Furthermore, SLANT's logos can be seen as effective attacks
against foundational models; an attacker could place a spurious logo on harmful
content, causing the model to misclassify it as harmless. This threat is
alarming considering the simplicity of logo attacks, increasing the attack
surface of VL models. As a defense, we include in our Toolkit two effective
mitigation strategies that seamlessly integrate with zero-shot inference of
foundation models.


http://arxiv.org/abs/2406.06573
MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering. (16%)
Robert Osazuwa Ness; Katie Matton; Hayden Helm; Sheng Zhang; Junaid Bajwa; Carey E. Priebe; Eric Horvitz
  Large language models (LLM) have achieved impressive performance on medical
question-answering benchmarks. However, high benchmark accuracy does not imply
that the performance generalizes to real-world clinical settings. Medical
question-answering benchmarks rely on assumptions consistent with quantifying
LLM performance but that may not hold in the open world of the clinic. Yet LLMs
learn broad knowledge that can help the LLM generalize to practical conditions
regardless of unrealistic assumptions in celebrated benchmarks. We seek to
quantify how well LLM medical question-answering benchmark performance
generalizes when benchmark assumptions are violated. Specifically, we present
an adversarial method that we call MedFuzz (for medical fuzzing). MedFuzz
attempts to modify benchmark questions in ways aimed at confounding the LLM. We
demonstrate the approach by targeting strong assumptions about patient
characteristics presented in the MedQA benchmark. Successful "attacks" modify a
benchmark item in ways that would be unlikely to fool a medical expert but
nonetheless "trick" the LLM into changing from a correct to an incorrect
answer. Further, we present a permutation test technique that can ensure a
successful attack is statistically significant. We show how to use performance
on a "MedFuzzed" benchmark, as well as individual successful attacks. The
methods show promise at providing insights into the ability of an LLM to
operate robustly in more realistic settings.


http://arxiv.org/abs/2406.01365
From Feature Visualization to Visual Circuits: Effect of Adversarial Model Manipulation. (12%)
Geraldin Nanfack; Michael Eickenberg; Eugene Belilovsky
  Understanding the inner working functionality of large-scale deep neural
networks is challenging yet crucial in several high-stakes applications.
Mechanistic inter- pretability is an emergent field that tackles this
challenge, often by identifying human-understandable subgraphs in deep neural
networks known as circuits. In vision-pretrained models, these subgraphs are
usually interpreted by visualizing their node features through a popular
technique called feature visualization. Recent works have analyzed the
stability of different feature visualization types under the adversarial model
manipulation framework. This paper starts by addressing limitations in existing
works by proposing a novel attack called ProxPulse that simultaneously
manipulates the two types of feature visualizations. Surprisingly, when
analyzing these attacks under the umbrella of visual circuits, we find that
visual circuits show some robustness to ProxPulse. We, therefore, introduce a
new attack based on ProxPulse that unveils the manipulability of visual
circuits, shedding light on their lack of robustness. The effectiveness of
these attacks is validated using pre-trained AlexNet and ResNet-50 models on
ImageNet.


http://arxiv.org/abs/2406.01811
A Game-Theoretic Approach to Privacy-Utility Tradeoff in Sharing Genomic Summary Statistics. (10%)
Tao Zhang; Rajagopal Venkatesaramani; Rajat K. De; Bradley A. Malin; Yevgeniy Vorobeychik
  The advent of online genomic data-sharing services has sought to enhance the
accessibility of large genomic datasets by allowing queries about genetic
variants, such as summary statistics, aiding care providers in distinguishing
between spurious genomic variations and those with clinical significance.
However, numerous studies have demonstrated that even sharing summary genomic
information exposes individual members of such datasets to a significant
privacy risk due to membership inference attacks. While several approaches have
emerged that reduce privacy risks by adding noise or reducing the amount of
information shared, these typically assume non-adaptive attacks that use
likelihood ratio test (LRT) statistics. We propose a Bayesian game-theoretic
framework for optimal privacy-utility tradeoff in the sharing of genomic
summary statistics. Our first contribution is to prove that a very general
Bayesian attacker model that anchors our game-theoretic approach is more
powerful than the conventional LRT-based threat models in that it induces worse
privacy loss for the defender who is modeled as a von Neumann-Morgenstern (vNM)
decision-maker. We show this to be true even when the attacker uses a
non-informative subjective prior. Next, we present an analytically tractable
approach to compare the Bayesian attacks with arbitrary subjective priors and
the Neyman-Pearson optimal LRT attacks under the Gaussian mechanism common in
differential privacy frameworks. Finally, we propose an approach for
approximating Bayes-Nash equilibria of the game using deep neural network
generators to implicitly represent player mixed strategies. Our experiments
demonstrate that the proposed game-theoretic framework yields both stronger
attacks and stronger defense strategies than the state of the art.


http://arxiv.org/abs/2406.01022
Poisoning Attacks and Defenses in Recommender Systems: A Survey. (10%)
Zongwei Wang; Junliang Yu; Min Gao; Wei Yuan; Guanhua Ye; Shazia Sadiq; Hongzhi Yin
  Modern recommender systems (RS) have profoundly enhanced user experience
across digital platforms, yet they face significant threats from poisoning
attacks. These attacks, aimed at manipulating recommendation outputs for
unethical gains, exploit vulnerabilities in RS through injecting malicious data
or intervening model training. This survey presents a unique perspective by
examining these threats through the lens of an attacker, offering fresh
insights into their mechanics and impacts. Concretely, we detail a systematic
pipeline that encompasses four stages of a poisoning attack: setting attack
goals, assessing attacker capabilities, analyzing victim architecture, and
implementing poisoning strategies. The pipeline not only aligns with various
attack tactics but also serves as a comprehensive taxonomy to pinpoint focuses
of distinct poisoning attacks. Correspondingly, we further classify defensive
strategies into two main categories: poisoning data filtering and robust
training from the defender's perspective. Finally, we highlight existing
limitations and suggest innovative directions for further exploration in this
field.


http://arxiv.org/abs/2406.02619
Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits. (4%)
Andis Draguns; Andrew Gritsevskiy; Sumeet Ramesh Motwani; Charlie Rogers-Smith; Jeffrey Ladish; Witt Christian Schroeder de
  The rapid proliferation of open-source language models significantly
increases the risks of downstream backdoor attacks. These backdoors can
introduce dangerous behaviours during model deployment and can evade detection
by conventional cybersecurity monitoring systems. In this paper, we introduce a
novel class of backdoors in transformer models, that, in contrast to prior art,
are unelicitable in nature. Unelicitability prevents the defender from
triggering the backdoor, making it impossible to properly evaluate ahead of
deployment even if given full white-box access and using automated techniques,
such as red-teaming or certain formal verification methods. We show that our
novel construction is not only unelicitable thanks to using cryptographic
techniques, but also has favourable robustness properties. We confirm these
properties in empirical investigations, and provide evidence that our backdoors
can withstand state-of-the-art mitigation strategies. Additionally, we expand
on previous work by showing that our universal backdoors, while not completely
undetectable in white-box settings, can be harder to detect than some existing
designs. By demonstrating the feasibility of seamlessly integrating backdoors
into transformer models, this paper fundamentally questions the efficacy of
pre-deployment detection strategies. This offers new insights into the
offence-defence balance in AI safety and security.


http://arxiv.org/abs/2406.01027
PRICE: A Pretrained Model for Cross-Database Cardinality Estimation. (1%)
Tianjing Zeng; Junwei Lan; Jiahong Ma; Wenqing Wei; Rong Zhu; Pengfei Li; Bolin Ding; Defu Lian; Zhewei Wei; Jingren Zhou
Cardinality estimation (CardEst) is essential for optimizing query execution plans. Recent ML-based CardEst methods achieve high accuracy but face deployment challenges due to high preparation costs and lack of transferability across databases. In this paper, we propose PRICE, a PRetrained multI-table CardEst model, which addresses these limitations. PRICE takes low-level but transferable features w.r.t. data distributions and query information and elegantly applies self-attention models to learn meta-knowledge to compute cardinality in any database. It is generally applicable to any unseen new database to attain high estimation accuracy, while its preparation cost is as little as the basic one-dimensional histogram-based CardEst methods. Moreover, PRICE can be finetuned to further enhance its performance on any specific database.
  We pretrained PRICE using 30 diverse datasets, completing the process in about 5 hours with a resulting model size of only about 40MB. Evaluations show that PRICE consistently outperforms existing methods, achieving the highest estimation accuracy on several unseen databases and generating faster execution plans with lower overhead. After finetuning with a small volume of databasespecific queries, PRICE could even find plans very close to the optimal ones. Meanwhile, PRICE is generally applicable to different settings such as data updates, data scaling, and query workload shifts. We have made all of our data and codes publicly available at https://github.com/StCarmen/PRICE.

http://arxiv.org/abs/2406.00775
Constrained Adaptive Attack: Effective Adversarial Attack Against Deep Neural Networks for Tabular Data. (99%)
Thibault Simonetto; Salah Ghamizi; Maxime Cordy
  State-of-the-art deep learning models for tabular data have recently achieved
acceptable performance to be deployed in industrial settings. However, the
robustness of these models remains scarcely explored. Contrary to computer
vision, there are no effective attacks to properly evaluate the adversarial
robustness of deep tabular models due to intrinsic properties of tabular data,
such as categorical features, immutability, and feature relationship
constraints. To fill this gap, we first propose CAPGD, a gradient attack that
overcomes the failures of existing gradient attacks with adaptive mechanisms.
This new attack does not require parameter tuning and further degrades the
accuracy, up to 81% points compared to the previous gradient attacks. Second,
we design CAA, an efficient evasion attack that combines our CAPGD attack and
MOEVA, the best search-based attack. We demonstrate the effectiveness of our
attacks on five architectures and four critical use cases. Our empirical study
demonstrates that CAA outperforms all existing attacks in 17 over the 20
settings, and leads to a drop in the accuracy by up to 96.1% points and 21.9%
points compared to CAPGD and MOEVA respectively while being up to five times
faster than MOEVA. Given the effectiveness and efficiency of our new attacks,
we argue that they should become the minimal test for any new defense or robust
architectures in tabular machine learning.


http://arxiv.org/abs/2406.00685
Improving Accuracy-robustness Trade-off via Pixel Reweighted Adversarial Training. (98%)
Jiacheng Zhang; Feng Liu; Dawei Zhou; Jingfeng Zhang; Tongliang Liu
  Adversarial training (AT) trains models using adversarial examples (AEs),
which are natural images modified with specific perturbations to mislead the
model. These perturbations are constrained by a predefined perturbation budget
$\epsilon$ and are equally applied to each pixel within an image. However, in
this paper, we discover that not all pixels contribute equally to the accuracy
on AEs (i.e., robustness) and accuracy on natural images (i.e., accuracy).
Motivated by this finding, we propose Pixel-reweighted AdveRsarial Training
(PART), a new framework that partially reduces $\epsilon$ for less influential
pixels, guiding the model to focus more on key regions that affect its outputs.
Specifically, we first use class activation mapping (CAM) methods to identify
important pixel regions, then we keep the perturbation budget for these regions
while lowering it for the remaining regions when generating AEs. In the end, we
use these pixel-reweighted AEs to train a model. PART achieves a notable
improvement in accuracy without compromising robustness on CIFAR-10, SVHN and
TinyImagenet-200, justifying the necessity to allocate distinct weights to
different pixel regions in robust classification.


http://arxiv.org/abs/2406.00918
Assessing the Adversarial Security of Perceptual Hashing Algorithms. (31%)
Jordan Madden; Moxanki Bhavsar; Lhamo Dorje; Xiaohua Li
  Perceptual hashing algorithms (PHAs) are utilized extensively for identifying
illegal online content. Given their crucial role in sensitive applications,
understanding their security strengths and weaknesses is critical. This paper
compares three major PHAs deployed widely in practice: PhotoDNA, PDQ, and
NeuralHash, and assesses their robustness against three typical attacks: normal
image editing attacks, malicious adversarial attacks, and hash inversion
attacks. Contrary to prevailing studies, this paper reveals that these PHAs
exhibit resilience to black-box adversarial attacks when realistic constraints
regarding the distortion and query budget are applied, attributed to the unique
property of random hash variations. Moreover, this paper illustrates that
original images can be reconstructed from the hash bits, raising significant
privacy concerns. By comprehensively exposing their security vulnerabilities,
this paper contributes to the ongoing efforts aimed at enhancing the security
of PHAs for effective deployment.


http://arxiv.org/abs/2406.02605
A Novel Defense Against Poisoning Attacks on Federated Learning: LayerCAM Augmented with Autoencoder. (31%)
Jingjing Zheng; Xin Yuan; Kai Li; Wei Ni; Eduardo Tovar; Jon Crowcroft
  Recent attacks on federated learning (FL) can introduce malicious model
updates that circumvent widely adopted Euclidean distance-based detection
methods. This paper proposes a novel defense strategy, referred to as
LayerCAM-AE, designed to counteract model poisoning in federated learning. The
LayerCAM-AE puts forth a new Layer Class Activation Mapping (LayerCAM)
integrated with an autoencoder (AE), significantly enhancing detection
capabilities. Specifically, LayerCAM-AE generates a heat map for each local
model update, which is then transformed into a more compact visual format. The
autoencoder is designed to process the LayerCAM heat maps from the local model
updates, improving their distinctiveness and thereby increasing the accuracy in
spotting anomalous maps and malicious local models. To address the risk of
misclassifications with LayerCAM-AE, a voting algorithm is developed, where a
local model update is flagged as malicious if its heat maps are consistently
suspicious over several rounds of communication. Extensive tests of LayerCAM-AE
on the SVHN and CIFAR-100 datasets are performed under both Independent and
Identically Distributed (IID) and non-IID settings in comparison with existing
ResNet-50 and REGNETY-800MF defense models. Experimental results show that
LayerCAM-AE increases detection rates (Recall: 1.0, Precision: 1.0, FPR: 0.0,
Accuracy: 1.0, F1 score: 1.0, AUC: 1.0) and test accuracy in FL, surpassing the
performance of both the ResNet-50 and REGNETY-800MF. Our code is available at:
https://github.com/jjzgeeks/LayerCAM-AE


http://arxiv.org/abs/2406.00699
Towards General Robustness Verification of MaxPool-based Convolutional Neural Networks via Tightening Linear Approximation. (13%)
Yuan Xiao; Shiqing Ma; Juan Zhai; Chunrong Fang; Jinyuan Jia; Zhenyu Chen
  The robustness of convolutional neural networks (CNNs) is vital to modern
AI-driven systems. It can be quantified by formal verification by providing a
certified lower bound, within which any perturbation does not alter the
original input's classification result. It is challenging due to nonlinear
components, such as MaxPool. At present, many verification methods are sound
but risk losing some precision to enhance efficiency and scalability, and thus,
a certified lower bound is a crucial criterion for evaluating the performance
of verification tools. In this paper, we present MaxLin, a robustness verifier
for MaxPool-based CNNs with tight linear approximation. By tightening the
linear approximation of the MaxPool function, we can certify larger certified
lower bounds of CNNs. We evaluate MaxLin with open-sourced benchmarks,
including LeNet and networks trained on the MNIST, CIFAR-10, and Tiny ImageNet
datasets. The results show that MaxLin outperforms state-of-the-art tools with
up to 110.60% improvement regarding the certified lower bound and 5.13 $\times$
speedup for the same neural networks. Our code is available at
https://github.com/xiaoyuanpigo/maxlin.


http://arxiv.org/abs/2406.00816
Invisible Backdoor Attacks on Diffusion Models. (2%)
Sen Li; Junchi Ma; Minhao Cheng
  In recent years, diffusion models have achieved remarkable success in the
realm of high-quality image generation, garnering increased attention. This
surge in interest is paralleled by a growing concern over the security threats
associated with diffusion models, largely attributed to their susceptibility to
malicious exploitation. Notably, recent research has brought to light the
vulnerability of diffusion models to backdoor attacks, enabling the generation
of specific target images through corresponding triggers. However, prevailing
backdoor attack methods rely on manually crafted trigger generation functions,
often manifesting as discernible patterns incorporated into input noise, thus
rendering them susceptible to human detection. In this paper, we present an
innovative and versatile optimization framework designed to acquire invisible
triggers, enhancing the stealthiness and resilience of inserted backdoors. Our
proposed framework is applicable to both unconditional and conditional
diffusion models, and notably, we are the pioneers in demonstrating the
backdooring of diffusion models within the context of text-guided image editing
and inpainting pipelines. Moreover, we also show that the backdoors in the
conditional generation can be directly applied to model watermarking for model
ownership verification, which further boosts the significance of the proposed
framework. Extensive experiments on various commonly used samplers and datasets
verify the efficacy and stealthiness of the proposed framework. Our code is
publicly available at
https://github.com/invisibleTriggerDiffusion/invisible_triggers_for_diffusion.


http://arxiv.org/abs/2406.03409
Robust Knowledge Distillation Based on Feature Variance Against Backdoored Teacher Model. (3%)
Jinyin Chen; Xiaoming Zhao; Haibin Zheng; Xiao Li; Sheng Xiang; Haifeng Guo
  Benefiting from well-trained deep neural networks (DNNs), model compression
have captured special attention for computing resource limited equipment,
especially edge devices. Knowledge distillation (KD) is one of the widely used
compression techniques for edge deployment, by obtaining a lightweight student
model from a well-trained teacher model released on public platforms. However,
it has been empirically noticed that the backdoor in the teacher model will be
transferred to the student model during the process of KD. Although numerous KD
methods have been proposed, most of them focus on the distillation of a
high-performing student model without robustness consideration. Besides, some
research adopts KD techniques as effective backdoor mitigation tools, but they
fail to perform model compression at the same time. Consequently, it is still
an open problem to well achieve two objectives of robust KD, i.e., student
model's performance and backdoor mitigation. To address these issues, we
propose RobustKD, a robust knowledge distillation that compresses the model
while mitigating backdoor based on feature variance. Specifically, RobustKD
distinguishes the previous works in three key aspects: (1) effectiveness: by
distilling the feature map of the teacher model after detoxification, the main
task performance of the student model is comparable to that of the teacher
model; (2) robustness: by reducing the characteristic variance between the
teacher model and the student model, it mitigates the backdoor of the student
model under backdoored teacher model scenario; (3) generic: RobustKD still has
good performance in the face of multiple data models (e.g., WRN 28-4,
Pyramid-200) and diverse DNNs (e.g., ResNet50, MobileNet).


http://arxiv.org/abs/2405.20641
Query Provenance Analysis: Efficient and Robust Defense against Query-based Black-box Attacks. (99%)
Shaofei Li; Ziqi Zhang; Haomin Jia; Ding Li; Yao Guo; Xiangqun Chen
  Query-based black-box attacks have emerged as a significant threat to machine
learning systems, where adversaries can manipulate the input queries to
generate adversarial examples that can cause misclassification of the model. To
counter these attacks, researchers have proposed Stateful Defense Models (SDMs)
for detecting adversarial query sequences and rejecting queries that are
"similar" to the history queries. Existing state-of-the-art (SOTA) SDMs (e.g.,
BlackLight and PIHA) have shown great effectiveness in defending against these
attacks. However, recent studies have shown that they are vulnerable to
Oracle-guided Adaptive Rejection Sampling (OARS) attacks, which is a stronger
adaptive attack strategy. It can be easily integrated with existing attack
algorithms to evade the SDMs by generating queries with fine-tuned direction
and step size of perturbations utilizing the leaked decision information from
the SDMs.
  In this paper, we propose a novel approach, Query Provenance Analysis (QPA),
for more robust and efficient SDMs. QPA encapsulates the historical
relationships among queries as the sequence feature to capture the fundamental
difference between benign and adversarial query sequences. To utilize the query
provenance, we propose an efficient query provenance analysis algorithm with
dynamic management. We evaluate QPA compared with two baselines, BlackLight and
PIHA, on four widely used datasets with six query-based black-box attack
algorithms. The results show that QPA outperforms the baselines in terms of
defense effectiveness and efficiency on both non-adaptive and adaptive attacks.
Specifically, QPA reduces the Attack Success Rate (ASR) of OARS to 4.08%,
comparing to 77.63% and 87.72% for BlackLight and PIHA, respectively. Moreover,
QPA also achieves 7.67x and 2.25x higher throughput than BlackLight and PIHA.


http://arxiv.org/abs/2405.20672
Investigating and unmasking feature-level vulnerabilities of CNNs to adversarial perturbations. (95%)
Davide Coppola; Hwee Kuan Lee
  This study explores the impact of adversarial perturbations on Convolutional
Neural Networks (CNNs) with the aim of enhancing the understanding of their
underlying mechanisms. Despite numerous defense methods proposed in the
literature, there is still an incomplete understanding of this phenomenon.
Instead of treating the entire model as vulnerable, we propose that specific
feature maps learned during training contribute to the overall vulnerability.
To investigate how the hidden representations learned by a CNN affect its
vulnerability, we introduce the Adversarial Intervention framework. Experiments
were conducted on models trained on three well-known computer vision datasets,
subjecting them to attacks of different nature. Our focus centers on the
effects that adversarial perturbations to a model's initial layer have on the
overall behavior of the model. Empirical results revealed compelling insights:
a) perturbing selected channel combinations in shallow layers causes
significant disruptions; b) the channel combinations most responsible for the
disruptions are common among different types of attacks; c) despite shared
vulnerable combinations of channels, different attacks affect hidden
representations with varying magnitudes; d) there exists a positive correlation
between a kernel's magnitude and its vulnerability. In conclusion, this work
introduces a novel framework to study the vulnerability of a CNN model to
adversarial perturbations, revealing insights that contribute to a deeper
understanding of the phenomenon. The identified properties pave the way for the
development of efficient ad-hoc defense mechanisms in future applications.


http://arxiv.org/abs/2405.20653
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens. (56%)
Jiahao Yu; Haozheng Luo; Jerry Yao-Chieh Hu; Wenbo Guo; Han Liu; Xinyu Xing
  Along with the remarkable successes of Language language models, recent
research also started to explore the security threats of LLMs, including
jailbreaking attacks. Attackers carefully craft jailbreaking prompts such that
a target LLM will respond to the harmful question. Existing jailbreaking
attacks require either human experts or leveraging complicated algorithms to
craft jailbreaking prompts. In this paper, we introduce BOOST, a simple attack
that leverages only the eos tokens. We demonstrate that rather than
constructing complicated jailbreaking prompts, the attacker can simply append a
few eos tokens to the end of a harmful question. It will bypass the safety
alignment of LLMs and lead to successful jailbreaking attacks. We further apply
BOOST to four representative jailbreak methods and show that the attack success
rates of these methods can be significantly enhanced by simply adding eos
tokens to the prompt. To understand this simple but novel phenomenon, we
conduct empirical analyses. Our analysis reveals that adding eos tokens makes
the target LLM believe the input is much less harmful, and eos tokens have low
attention values and do not affect LLM's understanding of the harmful
questions, leading the model to actually respond to the questions. Our findings
uncover how fragile an LLM is against jailbreak attacks, motivating the
development of strong safety alignment approaches.


http://arxiv.org/abs/2405.20694
Robust Stable Spiking Neural Networks. (38%)
Jianhao Ding; Zhiyu Pan; Yujia Liu; Zhaofei Yu; Tiejun Huang
  Spiking neural networks (SNNs) are gaining popularity in deep learning due to
their low energy budget on neuromorphic hardware. However, they still face
challenges in lacking sufficient robustness to guard safety-critical
applications such as autonomous driving. Many studies have been conducted to
defend SNNs from the threat of adversarial attacks. This paper aims to uncover
the robustness of SNN through the lens of the stability of nonlinear systems.
We are inspired by the fact that searching for parameters altering the leaky
integrate-and-fire dynamics can enhance their robustness. Thus, we dive into
the dynamics of membrane potential perturbation and simplify the formulation of
the dynamics. We present that membrane potential perturbation dynamics can
reliably convey the intensity of perturbation. Our theoretical analyses imply
that the simplified perturbation dynamics satisfy input-output stability. Thus,
we propose a training framework with modified SNN neurons and to reduce the
mean square of membrane potential perturbation aiming at enhancing the
robustness of SNN. Finally, we experimentally verify the effectiveness of the
framework in the setting of Gaussian noise training and adversarial training on
the image classification task.


http://arxiv.org/abs/2405.21018
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models. (26%)
Xiaojun Jia; Tianyu Pang; Chao Du; Yihao Huang; Jindong Gu; Yang Liu; Xiaochun Cao; Min Lin
  Large language models (LLMs) are being rapidly developed, and a key component
of their widespread deployment is their safety-related alignment. Many
red-teaming efforts aim to jailbreak LLMs, where among these efforts, the
Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest
in the study of optimization-based jailbreaking techniques. Although GCG is a
significant milestone, its attacking efficiency remains unsatisfactory. In this
paper, we present several improved (empirical) techniques for
optimization-based jailbreaks like GCG. We first observe that the single target
template of "Sure" largely limits the attacking performance of GCG; given this,
we propose to apply diverse target templates containing harmful self-suggestion
and/or guidance to mislead LLMs. Besides, from the optimization aspects, we
propose an automatic multi-coordinate updating strategy in GCG (i.e.,
adaptively deciding how many tokens to replace in each step) to accelerate
convergence, as well as tricks like easy-to-hard initialisation. Then, we
combine these improved technologies to develop an efficient jailbreak method,
dubbed I-GCG. In our experiments, we evaluate on a series of benchmarks (such
as NeurIPS 2023 Red Teaming Track). The results demonstrate that our improved
techniques can help GCG outperform state-of-the-art jailbreaking attacks and
achieve nearly 100% attack success rate. The code is released at
https://github.com/jiaxiaojunQAQ/I-GCG.


http://arxiv.org/abs/2405.20975
ACE: A Model Poisoning Attack on Contribution Evaluation Methods in Federated Learning. (22%)
Zhangchen Xu; Fengqing Jiang; Luyao Niu; Jinyuan Jia; Bo Li; Radha Poovendran
  In Federated Learning (FL), a set of clients collaboratively train a machine
learning model (called global model) without sharing their local training data.
The local training data of clients is typically non-i.i.d. and heterogeneous,
resulting in varying contributions from individual clients to the final
performance of the global model. In response, many contribution evaluation
methods were proposed, where the server could evaluate the contribution made by
each client and incentivize the high-contributing clients to sustain their
long-term participation in FL. Existing studies mainly focus on developing new
metrics or algorithms to better measure the contribution of each client.
However, the security of contribution evaluation methods of FL operating in
adversarial environments is largely unexplored. In this paper, we propose the
first model poisoning attack on contribution evaluation methods in FL, termed
ACE. Specifically, we show that any malicious client utilizing ACE could
manipulate the parameters of its local model such that it is evaluated to have
a high contribution by the server, even when its local training data is indeed
of low quality. We perform both theoretical analysis and empirical evaluations
of ACE. Theoretically, we show our design of ACE can effectively boost the
malicious client's perceived contribution when the server employs the
widely-used cosine distance metric to measure contribution. Empirically, our
results show ACE effectively and efficiently deceive five state-of-the-art
contribution evaluation methods. In addition, ACE preserves the accuracy of the
final global models on testing inputs. We also explore six countermeasures to
defend ACE. Our results show they are inadequate to thwart ACE, highlighting
the urgent need for new defenses to safeguard the contribution evaluation
methods in FL.


http://arxiv.org/abs/2405.21063
Neural Network Verification with Branch-and-Bound for General Nonlinearities. (12%)
Zhouxing Shi; Qirui Jin; Zico Kolter; Suman Jana; Cho-Jui Hsieh; Huan Zhang
  Branch-and-bound (BaB) is among the most effective techniques for neural
network (NN) verification. However, existing works on BaB for NN verification
have mostly focused on NNs with piecewise linear activations, especially ReLU
networks. In this paper, we develop a general framework, named GenBaB, to
conduct BaB on general nonlinearities to verify NNs with general architectures,
based on linear bound propagation for NN verification. To decide which neuron
to branch, we design a new branching heuristic which leverages linear bounds as
shortcuts to efficiently estimate the potential improvement after branching. To
decide nontrivial branching points for general nonlinear functions, we propose
to pre-optimize branching points, which can be efficiently leveraged during
verification with a lookup table. We demonstrate the effectiveness of our
GenBaB on verifying a wide range of NNs, including NNs with activation
functions such as Sigmoid, Tanh, Sine and GeLU, as well as NNs involving
multi-dimensional nonlinear operations such as multiplications in LSTMs and
Vision Transformers. Our framework also allows the verification of general
nonlinear computation graphs and enables verification applications beyond
simple NNs, particularly for AC Optimal Power Flow (ACOPF). GenBaB is part of
the latest $\alpha,\!\beta$-CROWN, the winner of the 4th and the 5th
International Verification of Neural Networks Competition (VNN-COMP 2023 and
2024).


http://arxiv.org/abs/2406.00240
Exploring Vulnerabilities and Protections in Large Language Models: A Survey. (10%)
Frank Weizhen Liu; Chenhui Hu
  As Large Language Models (LLMs) increasingly become key components in various
AI applications, understanding their security vulnerabilities and the
effectiveness of defense mechanisms is crucial. This survey examines the
security challenges of LLMs, focusing on two main areas: Prompt Hacking and
Adversarial Attacks, each with specific types of threats. Under Prompt Hacking,
we explore Prompt Injection and Jailbreaking Attacks, discussing how they work,
their potential impacts, and ways to mitigate them. Similarly, we analyze
Adversarial Attacks, breaking them down into Data Poisoning Attacks and
Backdoor Attacks. This structured examination helps us understand the
relationships between these vulnerabilities and the defense strategies that can
be implemented. The survey highlights these security challenges and discusses
robust defensive frameworks to protect LLMs against these threats. By detailing
these security issues, the survey contributes to the broader discussion on
creating resilient AI systems that can resist sophisticated attacks.


http://arxiv.org/abs/2405.20727
GANcrop: A Contrastive Defense Against Backdoor Attacks in Federated Learning. (9%)
Xiaoyun Gan; Shanyu Gan; Taizhi Su; Peng Liu
  With heightened awareness of data privacy protection, Federated Learning (FL)
has attracted widespread attention as a privacy-preserving distributed machine
learning method. However, the distributed nature of federated learning also
provides opportunities for backdoor attacks, where attackers can guide the
model to produce incorrect predictions without affecting the global model
training process.
  This paper introduces a novel defense mechanism against backdoor attacks in
federated learning, named GANcrop. This approach leverages contrastive learning
to deeply explore the disparities between malicious and benign models for
attack identification, followed by the utilization of Generative Adversarial
Networks (GAN) to recover backdoor triggers and implement targeted mitigation
strategies. Experimental findings demonstrate that GANcrop effectively
safeguards against backdoor attacks, particularly in non-IID scenarios, while
maintaining satisfactory model accuracy, showcasing its remarkable defensive
efficacy and practical utility.


http://arxiv.org/abs/2406.00275
StyDeSty: Min-Max Stylization and Destylization for Single Domain Generalization. (4%)
Songhua Liu; Xin Jin; Xingyi Yang; Jingwen Ye; Xinchao Wang
  Single domain generalization (single DG) aims at learning a robust model
generalizable to unseen domains from only one training domain, making it a
highly ambitious and challenging task. State-of-the-art approaches have mostly
relied on data augmentations, such as adversarial perturbation and style
enhancement, to synthesize new data and thus increase robustness. Nevertheless,
they have largely overlooked the underlying coherence between the augmented
domains, which in turn leads to inferior results in real-world scenarios. In
this paper, we propose a simple yet effective scheme, termed as
\emph{StyDeSty}, to explicitly account for the alignment of the source and
pseudo domains in the process of data augmentation, enabling them to interact
with each other in a self-consistent manner and further giving rise to a latent
domain with strong generalization power. The heart of StyDeSty lies in the
interaction between a \emph{stylization} module for generating novel stylized
samples using the source domain, and a \emph{destylization} module for
transferring stylized and source samples to a latent domain to learn
content-invariant features. The stylization and destylization modules work
adversarially and reinforce each other. During inference, the destylization
module transforms the input sample with an arbitrary style shift to the latent
domain, in which the downstream tasks are carried out. Specifically, the
location of the destylization layer within the backbone network is determined
by a dedicated neural architecture search (NAS) strategy. We evaluate StyDeSty
on multiple benchmarks and demonstrate that it yields encouraging results,
outperforming the state of the art by up to {13.44%} on classification
accuracy. Codes are available here: https://github.com/Huage001/StyDeSty.


http://arxiv.org/abs/2405.20990
Locking Machine Learning Models into Hardware. (1%)
Eleanor Clifford; Adhithya Saravanan; Harry Langford; Cheng Zhang; Yiren Zhao; Robert Mullins; Ilia Shumailov; Jamie Hayes
  Modern machine learning (ML) models are expensive IP and business
competitiveness often depends on keeping this IP confidential. This in turn
restricts how these models are deployed; for example, it is unclear how to
deploy a model on-device without inevitably leaking the underlying model. At
the same time, confidential computing technologies such as multi-party
computation or homomorphic encryption remain impractical for wide adoption. In
this paper, we take a different approach and investigate the feasibility of
ML-specific mechanisms that deter unauthorized model use by restricting the
model to only be usable on specific hardware, making adoption on unauthorized
hardware inconvenient. That way, even if IP is compromised, it cannot be
trivially used without specialised hardware or major model adjustment. In a
sense, we seek to enable cheap \emph{locking of machine learning models into
specific hardware}. We demonstrate that \emph{locking} mechanisms are feasible
by either targeting efficiency of model representations, making such models
incompatible with quantization, or tying the model's operation to specific
characteristics of hardware, such as the number of clock cycles for arithmetic
operations. We demonstrate that locking comes with negligible overheads, while
significantly restricting usability of the resultant model on unauthorized
hardware.


http://arxiv.org/abs/2405.20584
Disrupting Diffusion: Token-Level Attention Erasure Attack against Diffusion-based Customization. (99%)
Yisu Liu; Jinyang An; Wanqian Zhang; Dayan Wu; Jingzi Gu; Zheng Lin; Weiping Wang
  With the development of diffusion-based customization methods like
DreamBooth, individuals now have access to train the models that can generate
their personalized images. Despite the convenience, malicious users have
misused these techniques to create fake images, thereby triggering a privacy
security crisis. In light of this, proactive adversarial attacks are proposed
to protect users against customization. The adversarial examples are trained to
distort the customization model's outputs and thus block the misuse. In this
paper, we propose DisDiff (Disrupting Diffusion), a novel adversarial attack
method to disrupt the diffusion model outputs. We first delve into the
intrinsic image-text relationships, well-known as cross-attention, and
empirically find that the subject-identifier token plays an important role in
guiding image generation. Thus, we propose the Cross-Attention Erasure module
to explicitly "erase" the indicated attention maps and disrupt the text
guidance. Besides,we analyze the influence of the sampling process of the
diffusion model on Projected Gradient Descent (PGD) attack and introduce a
novel Merit Sampling Scheduler to adaptively modulate the perturbation updating
amplitude in a step-aware manner. Our DisDiff outperforms the state-of-the-art
methods by 12.75% of FDFR scores and 7.25% of ISM scores across two facial
benchmarks and two commonly used prompts on average.


http://arxiv.org/abs/2405.19956
HOLMES: to Detect Adversarial Examples with Multiple Detectors. (99%)
Jing Wen
  Deep neural networks (DNNs) can easily be cheated by some imperceptible but
purposeful noise added to images, and erroneously classify them. Previous
defensive work mostly focused on retraining the models or detecting the noise,
but has either shown limited success rates or been attacked by new adversarial
examples. Instead of focusing on adversarial images or the interior of DNN
models, we observed that adversarial examples generated by different algorithms
can be identified based on the output of DNNs (logits). Logit can serve as an
exterior feature to train detectors. Then, we propose HOLMES (Hierarchically
Organized Light-weight Multiple dEtector System) to reinforce DNNs by detecting
potential adversarial examples to minimize the threats they may bring in
practical. HOLMES is able to distinguish \textit{unseen} adversarial examples
from multiple attacks with high accuracy and low false positive rates than
single detector systems even in an adaptive model. To ensure the diversity and
randomness of detectors in HOLMES, we use two methods: training dedicated
detectors for each label and training detectors with top-k logits. Our
effective and inexpensive strategies neither modify original DNN models nor
require its internal parameters. HOLMES is not only compatible with all kinds
of learning models (even only with external APIs), but also complementary to
other defenses to achieve higher detection rates (may also fully protect the
system against various adversarial examples).


http://arxiv.org/abs/2405.20090
Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Large Language Models. (99%)
Hao Cheng; Erjia Xiao; Jiayan Yang; Jiahang Cao; Qiang Zhang; Le Yang; Jize Zhang; Kaidi Xu; Jindong Gu; Renjing Xu
  Recently, Multimodal Large Language Models (MLLMs) achieve remarkable
performance in numerous zero-shot tasks due to their outstanding cross-modal
interaction and comprehension abilities. However, MLLMs are found to still be
vulnerable to human-imperceptible adversarial examples. In the exploration of
security vulnerabilities in real-world scenarios, transferability, which can
achieve cross-model impact, is considered the greatest threat posed by
adversarial examples. However, there is currently no systematic research on the
threat of cross-MLLMs adversarial transferability. Therefore, this paper as the
first step to provide a comprehensive evaluation of the transferability of
adversarial examples generated by various MLLMs. Furthermore, leveraging two
key factors that influence transferability performance: 1) The strength of
information diversity involved in the adversarial generation process; 2)
Editing across vision-language modality information. We propose a boosting
method called Typography Augment Transferability Method (TATM) to investigate
the adversarial transferability performance across MLLMs further. Through
extensive experimental validation, our TATM demonstrates exceptional
performance in real-world applications of "Harmful Word Insertion" and
"Important Information Protection".


http://arxiv.org/abs/2405.20355
Enhancing Adversarial Robustness in SNNs with Sparse Gradients. (92%)
Yujia Liu; Tong Bu; Jianhao Ding; Zecheng Hao; Tiejun Huang; Zhaofei Yu
  Spiking Neural Networks (SNNs) have attracted great attention for their
energy-efficient operations and biologically inspired structures, offering
potential advantages over Artificial Neural Networks (ANNs) in terms of energy
efficiency and interpretability. Nonetheless, similar to ANNs, the robustness
of SNNs remains a challenge, especially when facing adversarial attacks.
Existing techniques, whether adapted from ANNs or specifically designed for
SNNs, exhibit limitations in training SNNs or defending against strong attacks.
In this paper, we propose a novel approach to enhance the robustness of SNNs
through gradient sparsity regularization. We observe that SNNs exhibit greater
resilience to random perturbations compared to adversarial perturbations, even
at larger scales. Motivated by this, we aim to narrow the gap between SNNs
under adversarial and random perturbations, thereby improving their overall
robustness. To achieve this, we theoretically prove that this performance gap
is upper bounded by the gradient sparsity of the probability associated with
the true label concerning the input image, laying the groundwork for a
practical strategy to train robust SNNs by regularizing the gradient sparsity.
We validate the effectiveness of our approach through extensive experiments on
both image-based and event-based datasets. The results demonstrate notable
improvements in the robustness of SNNs. Our work highlights the importance of
gradient sparsity in SNNs and its role in enhancing robustness.


http://arxiv.org/abs/2405.19802
Exploring the Robustness of Decision-Level Through Adversarial Attacks on LLM-Based Embodied Models. (89%)
Shuyuan Liu; Jiawei Chen; Shouwei Ruan; Hang Su; Zhaoxia Yin
  Embodied intelligence empowers agents with a profound sense of perception,
enabling them to respond in a manner closely aligned with real-world
situations. Large Language Models (LLMs) delve into language instructions with
depth, serving a crucial role in generating plans for intricate tasks. Thus,
LLM-based embodied models further enhance the agent's capacity to comprehend
and process information. However, this amalgamation also ushers in new
challenges in the pursuit of heightened intelligence. Specifically, attackers
can manipulate LLMs to produce irrelevant or even malicious outputs by altering
their prompts. Confronted with this challenge, we observe a notable absence of
multi-modal datasets essential for comprehensively evaluating the robustness of
LLM-based embodied models. Consequently, we construct the Embodied Intelligent
Robot Attack Dataset (EIRAD), tailored specifically for robustness evaluation.
Additionally, two attack strategies are devised, including untargeted attacks
and targeted attacks, to effectively simulate a range of diverse attack
scenarios. At the same time, during the attack process, to more accurately
ascertain whether our method is successful in attacking the LLM-based embodied
model, we devise a new attack success evaluation method utilizing the BLIP2
model. Recognizing the time and cost-intensive nature of the GCG algorithm in
attacks, we devise a scheme for prompt suffix initialization based on various
target tasks, thus expediting the convergence process. Experimental results
demonstrate that our method exhibits a superior attack success rate when
targeting LLM-based embodied models, indicating a lower level of decision-level
robustness in these models.


http://arxiv.org/abs/2405.20485
Phantom: General Trigger Attacks on Retrieval Augmented Language Generation. (83%)
Harsh Chaudhari; Giorgio Severi; John Abascal; Matthew Jagielski; Christopher A. Choquette-Choo; Milad Nasr; Cristina Nita-Rotaru; Alina Oprea
  Retrieval Augmented Generation (RAG) expands the capabilities of modern large
language models (LLMs), by anchoring, adapting, and personalizing their
responses to the most relevant knowledge sources. It is particularly useful in
chatbot applications, allowing developers to customize LLM output without
expensive retraining. Despite their significant utility in various
applications, RAG systems present new security risks. In this work, we propose
new attack vectors that allow an adversary to inject a single malicious
document into a RAG system's knowledge base, and mount a backdoor poisoning
attack. We design Phantom, a general two-stage optimization framework against
RAG systems, that crafts a malicious poisoned document leading to an integrity
violation in the model's output. First, the document is constructed to be
retrieved only when a specific trigger sequence of tokens appears in the
victim's queries. Second, the document is further optimized with crafted
adversarial text that induces various adversarial objectives on the LLM output,
including refusal to answer, reputation damage, privacy violations, and harmful
behaviors. We demonstrate our attacks on multiple LLM architectures, including
Gemma, Vicuna, and Llama, and show that they transfer to GPT-3.5 Turbo and
GPT-4. Finally, we successfully conducted a Phantom attack on NVIDIA's
black-box production RAG system, "Chat with RTX".


http://arxiv.org/abs/2405.20539
SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents. (75%)
Ethan Rathbun; Christopher Amato; Alina Oprea
  Reinforcement learning (RL) is an actively growing field that is seeing
increased usage in real-world, safety-critical applications -- making it
paramount to ensure the robustness of RL algorithms against adversarial
attacks. In this work we explore a particularly stealthy form of training-time
attacks against RL -- backdoor poisoning. Here the adversary intercepts the
training of an RL agent with the goal of reliably inducing a particular action
when the agent observes a pre-determined trigger at inference time. We uncover
theoretical limitations of prior work by proving their inability to generalize
across domains and MDPs. Motivated by this, we formulate a novel poisoning
attack framework which interlinks the adversary's objectives with those of
finding an optimal policy -- guaranteeing attack success in the limit. Using
insights from our theoretical analysis we develop ``SleeperNets'' as a
universal backdoor attack which exploits a newly proposed threat model and
leverages dynamic reward poisoning techniques. We evaluate our attack in 6
environments spanning multiple domains and demonstrate significant improvements
in attack success over existing methods, while preserving benign episodic
return.


http://arxiv.org/abs/2406.17793
Deep Learning Approaches for Detecting Adversarial Cyberbullying and Hate Speech in Social Networks. (73%)
Sylvia Worlali Azumah; Nelly Elsayed; Zag ElSayed; Murat Ozer; Guardia Amanda La
  Cyberbullying is a significant concern intricately linked to technology that
can find resolution through technological means. Despite its prevalence,
technology also provides solutions to mitigate cyberbullying. To address
growing concerns regarding the adverse impact of cyberbullying on individuals'
online experiences, various online platforms and researchers are actively
adopting measures to enhance the safety of digital environments. While
researchers persist in crafting detection models to counteract or minimize
cyberbullying, malicious actors are deploying adversarial techniques to
circumvent these detection methods. This paper focuses on detecting
cyberbullying in adversarial attack content within social networking site text
data, specifically emphasizing hate speech. Utilizing a deep learning-based
approach with a correction algorithm, this paper yielded significant results.
An LSTM model with a fixed epoch of 100 demonstrated remarkable performance,
achieving high accuracy, precision, recall, F1-score, and AUC-ROC scores of
87.57%, 88.73%, 87.57%, 88.15%, and 91% respectively. Additionally, the LSTM
model's performance surpassed that of previous studies.


http://arxiv.org/abs/2405.19928
BAN: Detecting Backdoors Activated by Adversarial Neuron Noise. (68%)
Xiaoyun Xu; Zhuoran Liu; Stefanos Koffas; Shujian Yu; Stjepan Picek
  Backdoor attacks on deep learning represent a recent threat that has gained
significant attention in the research community. Backdoor defenses are mainly
based on backdoor inversion, which has been shown to be generic,
model-agnostic, and applicable to practical threat scenarios. State-of-the-art
backdoor inversion recovers a mask in the feature space to locate prominent
backdoor features, where benign and backdoor features can be disentangled.
However, it suffers from high computational overhead, and we also find that it
overly relies on prominent backdoor features that are highly distinguishable
from benign features. To tackle these shortcomings, this paper improves
backdoor feature inversion for backdoor detection by incorporating extra neuron
activation information. In particular, we adversarially increase the loss of
backdoored models with respect to weights to activate the backdoor effect,
based on which we can easily differentiate backdoored and clean models.
Experimental results demonstrate our defense, BAN, is 1.37$\times$ (on
CIFAR-10) and 5.11$\times$ (on ImageNet200) more efficient with 9.99% higher
detect success rate than the state-of-the-art defense BTI-DBF. Our code and
trained models are publicly
available.\url{https://anonymous.4open.science/r/ban-4B32}


http://arxiv.org/abs/2405.20446
Is My Data in Your Retrieval Database? Membership Inference Attacks Against Retrieval Augmented Generation. (45%)
Maya Anderson; Guy Amit; Abigail Goldsteen
  Retrieval Augmented Generation (RAG) systems have shown great promise in
natural language processing. However, their reliance on data stored in a
retrieval database, which may contain proprietary or sensitive information,
introduces new privacy concerns. Specifically, an attacker may be able to infer
whether a certain text passage appears in the retrieval database by observing
the outputs of the RAG system, an attack known as a Membership Inference Attack
(MIA). Despite the significance of this threat, MIAs against RAG systems have
yet remained under-explored. This study addresses this gap by introducing an
efficient and easy-to-use method for conducting MIA against RAG systems. We
demonstrate the effectiveness of our attack using two benchmark datasets and
multiple generative models, showing that the membership of a document in the
retrieval database can be efficiently determined through the creation of an
appropriate prompt in both black-box and gray-box settings. Moreover, we
introduce an initial defense strategy based on adding instructions to the RAG
template, which shows high effectiveness for some datasets and models. Our
findings highlight the importance of implementing security countermeasures in
deployed RAG systems and developing more advanced defenses to protect the
privacy and security of retrieval databases.


http://arxiv.org/abs/2405.20099
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks. (38%)
Chen Xiong; Xiangyu Qi; Pin-Yu Chen; Tsung-Yi Ho
  Safety, security, and compliance are essential requirements when aligning
large language models (LLMs). However, many seemingly aligned LLMs are soon
shown to be susceptible to jailbreak attacks. These attacks aim to circumvent
the models' safety guardrails and security mechanisms by introducing jailbreak
prompts into malicious queries. In response to these challenges, this paper
introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism
specifically designed to protect LLMs against such sophisticated jailbreak
strategies. Unlike previous approaches, which have often compromised the
utility of the model for the sake of safety, DPP is designed to achieve a
minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.
Our method uses strategically designed interpretable suffix prompts that
effectively thwart a wide range of standard and adaptive jailbreak techniques.
Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2
models demonstrate the robustness and adaptability of DPP, showing significant
reductions in ASR with negligible impact on utility. Our approach not only
outperforms existing defense strategies in balancing safety and functionality,
but also provides a scalable and interpretable solution applicable to various
LLM platforms.


http://arxiv.org/abs/2405.19677
Large Language Model Watermark Stealing With Mixed Integer Programming. (33%)
Zhaoxi Zhang; Xiaomei Zhang; Yanjun Zhang; Leo Yu Zhang; Chao Chen; Shengshan Hu; Asif Gill; Shirui Pan
  The Large Language Model (LLM) watermark is a newly emerging technique that
shows promise in addressing concerns surrounding LLM copyright, monitoring
AI-generated text, and preventing its misuse. The LLM watermark scheme commonly
includes generating secret keys to partition the vocabulary into green and red
lists, applying a perturbation to the logits of tokens in the green list to
increase their sampling likelihood, thus facilitating watermark detection to
identify AI-generated text if the proportion of green tokens exceeds a
threshold. However, recent research indicates that watermarking methods using
numerous keys are susceptible to removal attacks, such as token editing,
synonym substitution, and paraphrasing, with robustness declining as the number
of keys increases. Therefore, the state-of-the-art watermark schemes that
employ fewer or single keys have been demonstrated to be more robust against
text editing and paraphrasing. In this paper, we propose a novel green list
stealing attack against the state-of-the-art LLM watermark scheme and
systematically examine its vulnerability to this attack. We formalize the
attack as a mixed integer programming problem with constraints. We evaluate our
attack under a comprehensive threat model, including an extreme scenario where
the attacker has no prior knowledge, lacks access to the watermark detector
API, and possesses no information about the LLM's parameter settings or
watermark injection/detection scheme. Extensive experiments on LLMs, such as
OPT and LLaMA, demonstrate that our attack can successfully steal the green
list and remove the watermark across all settings.


http://arxiv.org/abs/2405.19990
DiffPhysBA: Diffusion-based Physical Backdoor Attack against Person Re-Identification in Real-World. (22%)
Wenli Sun; Xinyang Jiang; Dongsheng Li; Cairong Zhao
  Person Re-Identification (ReID) systems pose a significant security risk from
backdoor attacks, allowing adversaries to evade tracking or impersonate others.
Beyond recognizing this issue, we investigate how backdoor attacks can be
deployed in real-world scenarios, where a ReID model is typically trained on
data collected in the digital domain and then deployed in a physical
environment. This attack scenario requires an attack flow that embeds backdoor
triggers in the digital domain realistically enough to also activate the buried
backdoor in person ReID models in the physical domain. This paper realizes this
attack flow by leveraging a diffusion model to generate realistic accessories
on pedestrian images (e.g., bags, hats, etc.) as backdoor triggers. However,
the noticeable domain gap between the triggers generated by the off-the-shelf
diffusion model and their physical counterparts results in a low attack success
rate. Therefore, we introduce a novel diffusion-based physical backdoor attack
(DiffPhysBA) method that adopts a training-free similarity-guided sampling
process to enhance the resemblance between generated and physical triggers.
Consequently, DiffPhysBA can generate realistic attributes as semantic-level
triggers in the digital domain and provides higher physical ASR compared to the
direct paste method by 25.6% on the real-world test set. Through evaluations on
newly proposed real-world and synthetic ReID test sets, DiffPhysBA demonstrates
an impressive success rate exceeding 90% in both the digital and physical
domains. Notably, it excels in digital stealth metrics and can effectively
evade state-of-the-art defense methods.


http://arxiv.org/abs/2405.20291
Unveiling and Mitigating Backdoor Vulnerabilities based on Unlearning Weight Changes and Backdoor Activeness. (5%)
Weilin Lin; Li Liu; Shaokui Wei; Jianze Li; Hui Xiong
  The security threat of backdoor attacks is a central concern for deep neural
networks (DNNs). Recently, without poisoned data, unlearning models with clean
data and then learning a pruning mask have contributed to backdoor defense.
Additionally, vanilla fine-tuning with those clean data can help recover the
lost clean accuracy. However, the behavior of clean unlearning is still
under-explored, and vanilla fine-tuning unintentionally induces back the
backdoor effect. In this work, we first investigate model unlearning from the
perspective of weight changes and gradient norms, and find two interesting
observations in the backdoored model: 1) the weight changes between poison and
clean unlearning are positively correlated, making it possible for us to
identify the backdoored-related neurons without using poisoned data; 2) the
neurons of the backdoored model are more active (i.e., larger changes in
gradient norm) than those in the clean model, suggesting the need to suppress
the gradient norm during fine-tuning. Then, we propose an effective two-stage
defense method. In the first stage, an efficient Neuron Weight Change
(NWC)-based Backdoor Reinitialization is proposed based on observation 1). In
the second stage, based on observation 2), we design an Activeness-Aware
Fine-Tuning to replace the vanilla fine-tuning. Extensive experiments,
involving eight backdoor attacks on three benchmark datasets, demonstrate the
superior performance of our proposed method compared to recent state-of-the-art
backdoor defense approaches.


http://arxiv.org/abs/2405.20556
Certifying Global Robustness for Deep Neural Networks. (2%)
You Li; Guannan Zhao; Shuyu Kong; Yunqi He; Hai Zhou
  A globally robust deep neural network resists perturbations on all meaningful
inputs. Current robustness certification methods emphasize local robustness,
struggling to scale and generalize. This paper presents a systematic and
efficient method to evaluate and verify global robustness for deep neural
networks, leveraging the PAC verification framework for solid guarantees on
verification results. We utilize probabilistic programs to characterize
meaningful input regions, setting a realistic standard for global robustness.
Additionally, we introduce the cumulative robustness curve as a criterion in
evaluating global robustness. We design a statistical method that combines
multi-level splitting and regression analysis for the estimation, significantly
reducing the execution time. Experimental results demonstrate the efficiency
and effectiveness of our verification method and its capability to find rare
and diversified counterexamples for adversarial training.


http://arxiv.org/abs/2405.19683
Breaking Indistinguishability with Transfer Learning: A First Look at SPECK32/64 Lightweight Block Ciphers. (1%)
Jimmy Dani; Kalyan Nakka; Nitesh Saxena
  In this research, we introduce MIND-Crypt, a novel attack framework that uses
deep learning (DL) and transfer learning (TL) to challenge the
indistinguishability of block ciphers, specifically SPECK32/64 encryption
algorithm in CBC mode (Cipher Block Chaining) against Known Plaintext Attacks
(KPA). Our methodology includes training a DL model with ciphertexts of two
messages encrypted using the same key. The selected messages have the same
byte-length and differ by only one bit at the binary level. This DL model
employs a residual network architecture. For the TL, we use the trained DL
model as a feature extractor, and these features are then used to train a
shallow machine learning, such as XGBoost. This dual strategy aims to
distinguish ciphertexts of two encrypted messages, addressing traditional
cryptanalysis challenges.
  Our findings demonstrate that the DL model achieves an accuracy of
approximately 99% under consistent cryptographic conditions (Same Key or
Rounds) with the SPECK32/64 cipher. However, performance degrades to random
guessing levels (50%) when tested with ciphertext generated from different keys
or different encryption rounds of SPECK32/64. To enhance the results, the DL
model requires retraining with different keys or encryption rounds using larger
datasets (10^7 samples). To overcome this limitation, we implement TL,
achieving an accuracy of about 53% with just 10,000 samples, which is better
than random guessing. Further training with 580,000 samples increases accuracy
to nearly 99%, showing a substantial reduction in data requirements by over
94%. This shows that an attacker can utilize machine learning models to break
indistinguishability by accessing pairs of plaintexts and their corresponding
ciphertexts encrypted with the same key, without directly interacting with the
communicating parties.


http://arxiv.org/abs/2405.20272
Reconstruction Attacks on Machine Unlearning: Simple Models are Vulnerable. (1%)
Martin Bertran; Shuai Tang; Michael Kearns; Jamie Morgenstern; Aaron Roth; Zhiwei Steven Wu
  Machine unlearning is motivated by desire for data autonomy: a person can
request to have their data's influence removed from deployed models, and those
models should be updated as if they were retrained without the person's data.
We show that, counter-intuitively, these updates expose individuals to
high-accuracy reconstruction attacks which allow the attacker to recover their
data in its entirety, even when the original models are so simple that privacy
risk might not otherwise have been a concern. We show how to mount a
near-perfect attack on the deleted data point from linear regression models. We
then generalize our attack to other loss functions and architectures, and
empirically demonstrate the effectiveness of our attacks across a wide range of
datasets (capturing both tabular and image data). Our work highlights that
privacy risk is significant even for extremely simple model classes when
individuals can request deletion of their data from the model.


http://arxiv.org/abs/2405.19098
Efficient Black-box Adversarial Attacks via Bayesian Optimization Guided by a Function Prior. (99%)
Shuyu Cheng; Yibo Miao; Yinpeng Dong; Xiao Yang; Xiao-Shan Gao; Jun Zhu
  This paper studies the challenging black-box adversarial attack that aims to
generate adversarial examples against a black-box model by only using output
feedback of the model to input queries. Some previous methods improve the query
efficiency by incorporating the gradient of a surrogate white-box model into
query-based attacks due to the adversarial transferability. However, the
localized gradient is not informative enough, making these methods still
query-intensive. In this paper, we propose a Prior-guided Bayesian Optimization
(P-BO) algorithm that leverages the surrogate model as a global function prior
in black-box adversarial attacks. As the surrogate model contains rich prior
information of the black-box one, P-BO models the attack objective with a
Gaussian process whose mean function is initialized as the surrogate model's
loss. Our theoretical analysis on the regret bound indicates that the
performance of P-BO may be affected by a bad prior. Therefore, we further
propose an adaptive integration strategy to automatically adjust a coefficient
on the function prior by minimizing the regret bound. Extensive experiments on
image classifiers and large vision-language models demonstrate the superiority
of the proposed algorithm in reducing queries and improving attack success
rates compared with the state-of-the-art black-box attacks. Code is available
at https://github.com/yibo-miao/PBO-Attack.


http://arxiv.org/abs/2405.18770
Leveraging Many-To-Many Relationships for Defending Against Visual-Language Adversarial Attacks. (98%)
Futa Waseda; Antonio Tejero-de-Pablos
  Recent studies have revealed that vision-language (VL) models are vulnerable
to adversarial attacks for image-text retrieval (ITR). However, existing
defense strategies for VL models primarily focus on zero-shot image
classification, which do not consider the simultaneous manipulation of image
and text, as well as the inherent many-to-many (N:N) nature of ITR, where a
single image can be described in numerous ways, and vice versa. To this end,
this paper studies defense strategies against adversarial attacks on VL models
for ITR for the first time. Particularly, we focus on how to leverage the N:N
relationship in ITR to enhance adversarial robustness. We found that, although
adversarial training easily overfits to specific one-to-one (1:1) image-text
pairs in the train data, diverse augmentation techniques to create one-to-many
(1:N) / many-to-one (N:1) image-text pairs can significantly improve
adversarial robustness in VL models. Additionally, we show that the alignment
of the augmented image-text pairs is crucial for the effectiveness of the
defense strategy, and that inappropriate augmentations can even degrade the
model's performance. Based on these findings, we propose a novel defense
strategy that leverages the N:N relationship in ITR, which effectively
generates diverse yet highly-aligned N:N pairs using basic augmentations and
generative model-based augmentations. This work provides a novel perspective on
defending against adversarial attacks in VL tasks and opens up new research
directions for future work.


http://arxiv.org/abs/2405.19424
Diffusion Policy Attacker: Crafting Adversarial Attacks for Diffusion-based Policies. (92%)
Yipu Chen; Haotian Xue; Yongxin Chen
  Diffusion models (DMs) have emerged as a promising approach for behavior
cloning (BC). Diffusion policies (DP) based on DMs have elevated BC performance
to new heights, demonstrating robust efficacy across diverse tasks, coupled
with their inherent flexibility and ease of implementation. Despite the
increasing adoption of DP as a foundation for policy generation, the critical
issue of safety remains largely unexplored. While previous attempts have
targeted deep policy networks, DP used diffusion models as the policy network,
making it ineffective to be attacked using previous methods because of its
chained structure and randomness injected. In this paper, we undertake a
comprehensive examination of DP safety concerns by introducing adversarial
scenarios, encompassing offline and online attacks, and global and patch-based
attacks. We propose DP-Attacker, a suite of algorithms that can craft effective
adversarial attacks across all aforementioned scenarios. We conduct attacks on
pre-trained diffusion policies across various manipulation tasks. Through
extensive experiments, we demonstrate that DP-Attacker has the capability to
significantly decrease the success rate of DP for all scenarios. Particularly
in offline scenarios, DP-Attacker can generate highly transferable
perturbations applicable to all frames. Furthermore, we illustrate the creation
of adversarial physical patches that, when applied to the environment,
effectively deceive the model. Video results are put in:
https://sites.google.com/view/diffusion-policy-attacker.


http://arxiv.org/abs/2405.19598
Evaluating the Effectiveness and Robustness of Visual Similarity-based Phishing Detection Models. (92%)
Fujiao Ji; Kiho Lee; Hyungjoon Koo; Wenhao You; Euijin Choo; Hyoungshick Kim; Doowon Kim
  Phishing attacks pose a significant threat to Internet users, with
cybercriminals elaborately replicating the visual appearance of legitimate
websites to deceive victims. Visual similarity-based detection systems have
emerged as an effective countermeasure, but their effectiveness and robustness
in real-world scenarios have been underexplored. In this paper, we
comprehensively scrutinize and evaluate the effectiveness and robustness of
popular visual similarity-based anti-phishing models using a large-scale
dataset of 451k real-world phishing websites. Our analyses of the effectiveness
reveal that while certain visual similarity-based models achieve high accuracy
on curated datasets in the experimental settings, they exhibit notably low
performance on real-world datasets, highlighting the importance of real-world
evaluation. Furthermore, we find that the attackers evade the detectors mainly
in three ways: (1) directly attacking the model pipelines, (2) mimicking benign
logos, and (3) employing relatively simple strategies such as eliminating logos
from screenshots. To statistically assess the resilience and robustness of
existing models against adversarial attacks, we categorize the strategies
attackers employ into visible and perturbation-based manipulations and apply
them to website logos. We then evaluate the models' robustness using these
adversarial samples. Our findings reveal potential vulnerabilities in several
models, emphasizing the need for more robust visual similarity techniques
capable of withstanding sophisticated evasion attempts. We provide actionable
insights for enhancing the security of phishing defense systems, encouraging
proactive actions.


http://arxiv.org/abs/2405.18942
Verifiably Robust Conformal Prediction. (82%)
Linus Jeary; Tom Kuipers; Mehran Hosseini; Nicola Paoletti
  Conformal Prediction (CP) is a popular uncertainty quantification method that
provides distribution-free, statistically valid prediction sets, assuming that
training and test data are exchangeable. In such a case, CP's prediction sets
are guaranteed to cover the (unknown) true test output with a user-specified
probability. Nevertheless, this guarantee is violated when the data is
subjected to adversarial attacks, which often result in a significant loss of
coverage. Recently, several approaches have been put forward to recover CP
guarantees in this setting. These approaches leverage variations of randomised
smoothing to produce conservative sets which account for the effect of the
adversarial perturbations. They are, however, limited in that they only support
$\ell^2$-bounded perturbations and classification tasks. This paper introduces
VRCP (Verifiably Robust Conformal Prediction), a new framework that leverages
recent neural network verification methods to recover coverage guarantees under
adversarial attacks. Our VRCP method is the first to support perturbations
bounded by arbitrary norms including $\ell^1$, $\ell^2$, and $\ell^\infty$, as
well as regression tasks. We evaluate and compare our approach on image
classification tasks (CIFAR10, CIFAR100, and TinyImageNet) and regression tasks
for deep reinforcement learning environments. In every case, VRCP achieves
above nominal coverage and yields significantly more efficient and informative
prediction regions than the SotA.


http://arxiv.org/abs/2405.19524
AI Risk Management Should Incorporate Both Safety and Security. (67%)
Xiangyu Qi; Yangsibo Huang; Yi Zeng; Edoardo Debenedetti; Jonas Geiping; Luxi He; Kaixuan Huang; Udari Madhushani; Vikash Sehwag; Weijia Shi; Boyi Wei; Tinghao Xie; Danqi Chen; Pin-Yu Chen; Jeffrey Ding; Ruoxi Jia; Jiaqi Ma; Arvind Narayanan; Weijie J Su; Mengdi Wang; Chaowei Xiao; Bo Li; Dawn Song; Peter Henderson; Prateek Mittal
  The exposure of security vulnerabilities in safety-aligned language models,
e.g., susceptibility to adversarial attacks, has shed light on the intricate
interplay between AI safety and AI security. Although the two disciplines now
come together under the overarching goal of AI risk management, they have
historically evolved separately, giving rise to differing perspectives.
Therefore, in this paper, we advocate that stakeholders in AI risk management
should be aware of the nuances, synergies, and interplay between safety and
security, and unambiguously take into account the perspectives of both
disciplines in order to devise mostly effective and holistic risk mitigation
approaches. Unfortunately, this vision is often obfuscated, as the definitions
of the basic concepts of "safety" and "security" themselves are often
inconsistent and lack consensus across communities. With AI risk management
being increasingly cross-disciplinary, this issue is particularly salient. In
light of this conceptual challenge, we introduce a unified reference framework
to clarify the differences and interplay between AI safety and AI security,
aiming to facilitate a shared understanding and effective collaboration across
communities.


http://arxiv.org/abs/2405.19668
AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization. (61%)
Jiawei Chen; Xiao Yang; Zhengwei Fang; Yu Tian; Yinpeng Dong; Zhaoxia Yin; Hang Su
  Despite the widespread application of large language models (LLMs) across
various tasks, recent studies indicate that they are susceptible to jailbreak
attacks, which can render their defense mechanisms ineffective. However,
previous jailbreak research has frequently been constrained by limited
universality, suboptimal efficiency, and a reliance on manual crafting. In
response, we rethink the approach to jailbreaking LLMs and formally define
three essential properties from the attacker' s perspective, which contributes
to guiding the design of jailbreak methods. We further introduce AutoBreach, a
novel method for jailbreaking LLMs that requires only black-box access.
Inspired by the versatility of wordplay, AutoBreach employs a wordplay-guided
mapping rule sampling strategy to generate a variety of universal mapping rules
for creating adversarial prompts. This generation process leverages LLMs'
automatic summarization and reasoning capabilities, thus alleviating the manual
burden. To boost jailbreak success rates, we further suggest sentence
compression and chain-of-thought-based mapping rules to correct errors and
wordplay misinterpretations in target LLMs. Additionally, we propose a
two-stage mapping rule optimization strategy that initially optimizes mapping
rules before querying target LLMs to enhance the efficiency of AutoBreach.
AutoBreach can efficiently identify security vulnerabilities across various
LLMs, including three proprietary models: Claude-3, GPT-3.5, GPT-4 Turbo, and
two LLMs' web platforms: Bingchat, GPT-4 Web, achieving an average success rate
of over 80% with fewer than 10 queries


http://arxiv.org/abs/2405.18931
EntProp: High Entropy Propagation for Improving Accuracy and Robustness. (50%)
Shohei Enomoto
  Deep neural networks (DNNs) struggle to generalize to out-of-distribution
domains that are different from those in training despite their impressive
performance. In practical applications, it is important for DNNs to have both
high standard accuracy and robustness against out-of-distribution domains. One
technique that achieves both of these improvements is disentangled learning
with mixture distribution via auxiliary batch normalization layers (ABNs). This
technique treats clean and transformed samples as different domains, allowing a
DNN to learn better features from mixed domains. However, if we distinguish the
domains of the samples based on entropy, we find that some transformed samples
are drawn from the same domain as clean samples, and these samples are not
completely different domains. To generate samples drawn from a completely
different domain than clean samples, we hypothesize that transforming clean
high-entropy samples to further increase the entropy generates
out-of-distribution samples that are much further away from the in-distribution
domain. On the basis of the hypothesis, we propose high entropy
propagation~(EntProp), which feeds high-entropy samples to the network that
uses ABNs. We introduce two techniques, data augmentation and free adversarial
training, that increase entropy and bring the sample further away from the
in-distribution domain. These techniques do not require additional training
costs. Our experimental results show that EntProp achieves higher standard
accuracy and robustness with a lower training cost than the baseline methods.
In particular, EntProp is highly effective at training on small datasets.


http://arxiv.org/abs/2405.19237
ConceptPrune: Concept Editing in Diffusion Models via Skilled Neuron Pruning. (26%)
Ruchika Chavhan; Da Li; Timothy Hospedales
  While large-scale text-to-image diffusion models have demonstrated impressive
image-generation capabilities, there are significant concerns about their
potential misuse for generating unsafe content, violating copyright, and
perpetuating societal biases. Recently, the text-to-image generation community
has begun addressing these concerns by editing or unlearning undesired concepts
from pre-trained models. However, these methods often involve data-intensive
and inefficient fine-tuning or utilize various forms of token remapping,
rendering them susceptible to adversarial jailbreaks. In this paper, we present
a simple and effective training-free approach, ConceptPrune, wherein we first
identify critical regions within pre-trained models responsible for generating
undesirable concepts, thereby facilitating straightforward concept unlearning
via weight pruning. Experiments across a range of concepts including artistic
styles, nudity, object erasure, and gender debiasing demonstrate that target
concepts can be efficiently erased by pruning a tiny fraction, approximately
0.12% of total weights, enabling multi-concept erasure and robustness against
various white-box and black-box adversarial attacks.


http://arxiv.org/abs/2405.19074
Resurrecting Old Classes with New Data for Exemplar-Free Continual Learning. (22%)
Dipam Goswami; Albin Soutif--Cormerais; Yuyang Liu; Sandesh Kamath; Bartłomiej Twardowski; de Weijer Joost van
  Continual learning methods are known to suffer from catastrophic forgetting,
a phenomenon that is particularly hard to counter for methods that do not store
exemplars of previous tasks. Therefore, to reduce potential drift in the
feature extractor, existing exemplar-free methods are typically evaluated in
settings where the first task is significantly larger than subsequent tasks.
Their performance drops drastically in more challenging settings starting with
a smaller first task. To address this problem of feature drift estimation for
exemplar-free methods, we propose to adversarially perturb the current samples
such that their embeddings are close to the old class prototypes in the old
model embedding space. We then estimate the drift in the embedding space from
the old to the new model using the perturbed images and compensate the
prototypes accordingly. We exploit the fact that adversarial samples are
transferable from the old to the new feature space in a continual learning
setting. The generation of these images is simple and computationally cheap. We
demonstrate in our experiments that the proposed approach better tracks the
movement of prototypes in embedding space and outperforms existing methods on
several standard continual learning benchmarks as well as on fine-grained
datasets. Code is available at https://github.com/dipamgoswami/ADC.


http://arxiv.org/abs/2405.18824
Node Injection Attack Based on Label Propagation Against Graph Neural Network. (12%)
Peican Zhu; Zechen Pan; Keke Tang; Xiaodong Cui; Jinhuan Wang; Qi Xuan
  Graph Neural Network (GNN) has achieved remarkable success in various graph
learning tasks, such as node classification, link prediction and graph
classification. The key to the success of GNN lies in its effective structure
information representation through neighboring aggregation. However, the
attacker can easily perturb the aggregation process through injecting fake
nodes, which reveals that GNN is vulnerable to the graph injection attack.
Existing graph injection attack methods primarily focus on damaging the
classical feature aggregation process while overlooking the neighborhood
aggregation process via label propagation. To bridge this gap, we propose the
label-propagation-based global injection attack (LPGIA) which conducts the
graph injection attack on the node classification task. Specifically, we
analyze the aggregation process from the perspective of label propagation and
transform the graph injection attack problem into a global injection label
specificity attack problem. To solve this problem, LPGIA utilizes a label
propagation-based strategy to optimize the combinations of the nodes connected
to the injected node. Then, LPGIA leverages the feature mapping to generate
malicious features for injected nodes. In extensive experiments against
representative GNNs, LPGIA outperforms the previous best-performing injection
attack method in various datasets, demonstrating its superiority and
transferability.


http://arxiv.org/abs/2405.18741
Genshin: General Shield for Natural Language Processing with Large Language Models. (5%)
Xiao Peng; Tao Liu; Ying Wang
  Large language models (LLMs) like ChatGPT, Gemini, or LLaMA have been
trending recently, demonstrating considerable advancement and generalizability
power in countless domains. However, LLMs create an even bigger black box
exacerbating opacity, with interpretability limited to few approaches. The
uncertainty and opacity embedded in LLMs' nature restrict their application in
high-stakes domains like financial fraud, phishing, etc. Current approaches
mainly rely on traditional textual classification with posterior interpretable
algorithms, suffering from attackers who may create versatile adversarial
samples to break the system's defense, forcing users to make trade-offs between
efficiency and robustness. To address this issue, we propose a novel cascading
framework called Genshin (General Shield for Natural Language Processing with
Large Language Models), utilizing LLMs as defensive one-time plug-ins. Unlike
most applications of LLMs that try to transform text into something new or
structural, Genshin uses LLMs to recover text to its original state. Genshin
aims to combine the generalizability of the LLM, the discrimination of the
median model, and the interpretability of the simple model. Our experiments on
the task of sentimental analysis and spam detection have shown fatal flaws of
the current median models and exhilarating results on LLMs' recovery ability,
demonstrating that Genshin is both effective and efficient. In our ablation
study, we unearth several intriguing observations. Utilizing the LLM defender,
a tool derived from the 4th paradigm, we have reproduced BERT's 15% optimal
mask rate results in the 3rd paradigm of NLP. Additionally, when employing the
LLM as a potential adversarial tool, attackers are capable of executing
effective attacks that are nearly semantically lossless.


http://arxiv.org/abs/2405.18753
Confronting the Reproducibility Crisis: A Case Study of Challenges in Cybersecurity AI. (2%)
Richard H. Moulton; Gary A. McCully; John D. Hastings
  In the rapidly evolving field of cybersecurity, ensuring the reproducibility
of AI-driven research is critical to maintaining the reliability and integrity
of security systems. This paper addresses the reproducibility crisis within the
domain of adversarial robustness -- a key area in AI-based cybersecurity that
focuses on defending deep neural networks against malicious perturbations.
Through a detailed case study, we attempt to validate results from prior work
on certified robustness using the VeriGauge toolkit, revealing significant
challenges due to software and hardware incompatibilities, version conflicts,
and obsolescence. Our findings underscore the urgent need for standardized
methodologies, containerization, and comprehensive documentation to ensure the
reproducibility of AI models deployed in critical cybersecurity applications.
By tackling these reproducibility challenges, we aim to contribute to the
broader discourse on securing AI systems against advanced persistent threats,
enhancing network and IoT security, and protecting critical infrastructure.
This work advocates for a concerted effort within the research community to
prioritize reproducibility, thereby strengthening the foundation upon which
future cybersecurity advancements are built.


http://arxiv.org/abs/2405.18802
Enhancing Security and Privacy in Federated Learning using Update Digests and Voting-Based Defense. (1%)
Wenjie Li; Kai Fan; Jingyuan Zhang; Hui Li; Wei Yang Bryan Lim; Qiang Yang
  Federated Learning (FL) is a promising privacy-preserving machine learning
paradigm that allows data owners to collaboratively train models while keeping
their data localized. Despite its potential, FL faces challenges related to the
trustworthiness of both clients and servers, especially in the presence of
curious or malicious adversaries. In this paper, we introduce a novel framework
named \underline{\textbf{F}}ederated \underline{\textbf{L}}earning with
\underline{\textbf{U}}pdate \underline{\textbf{D}}igest (FLUD), which addresses
the critical issues of privacy preservation and resistance to Byzantine attacks
within distributed learning environments. FLUD utilizes an innovative approach,
the $\mathsf{LinfSample}$ method, allowing clients to compute the $l_{\infty}$
norm across sliding windows of updates as an update digest. This digest enables
the server to calculate a shared distance matrix, significantly reducing the
overhead associated with Secure Multi-Party Computation (SMPC) by three orders
of magnitude while effectively distinguishing between benign and malicious
updates. Additionally, FLUD integrates a privacy-preserving, voting-based
defense mechanism that employs optimized SMPC protocols to minimize
communication rounds. Our comprehensive experiments demonstrate FLUD's
effectiveness in countering Byzantine adversaries while incurring low
communication and runtime overhead. FLUD offers a scalable framework for secure
and reliable FL in distributed environments, facilitating its application in
scenarios requiring robust data management and security.


http://arxiv.org/abs/2405.19211
Gone but Not Forgotten: Improved Benchmarks for Machine Unlearning. (1%)
Keltin Grimes; Collin Abidi; Cole Frank; Shannon Gallagher
  Machine learning models are vulnerable to adversarial attacks, including
attacks that leak information about the model's training data. There has
recently been an increase in interest about how to best address privacy
concerns, especially in the presence of data-removal requests. Machine
unlearning algorithms aim to efficiently update trained models to comply with
data deletion requests while maintaining performance and without having to
resort to retraining the model from scratch, a costly endeavor. Several
algorithms in the machine unlearning literature demonstrate some level of
privacy gains, but they are often evaluated only on rudimentary membership
inference attacks, which do not represent realistic threats. In this paper we
describe and propose alternative evaluation methods for three key shortcomings
in the current evaluation of unlearning algorithms. We show the utility of our
alternative evaluations via a series of experiments of state-of-the-art
unlearning algorithms on different computer vision datasets, presenting a more
detailed picture of the state of the field.


http://arxiv.org/abs/2405.19458
MemControl: Mitigating Memorization in Diffusion Models via Automated Parameter Selection. (1%)
Raman Dutt; Ondrej Bohdal; Pedro Sanchez; Sotirios A. Tsaftaris; Timothy Hospedales
  Diffusion models excel in generating images that closely resemble their
training data but are also susceptible to data memorization, raising privacy,
ethical, and legal concerns, particularly in sensitive domains such as medical
imaging. We hypothesize that this memorization stems from the
overparameterization of deep models and propose that regularizing model
capacity during fine-tuning can mitigate this issue. Firstly, we empirically
show that regulating the model capacity via Parameter-efficient fine-tuning
(PEFT) mitigates memorization to some extent, however, it further requires the
identification of the exact parameter subsets to be fine-tuned for high-quality
generation. To identify these subsets, we introduce a bi-level optimization
framework, MemControl, that automates parameter selection using memorization
and generation quality metrics as rewards during fine-tuning. The parameter
subsets discovered through MemControl achieve a superior tradeoff between
generation quality and memorization. For the task of medical image generation,
our approach outperforms existing state-of-the-art memorization mitigation
strategies by fine-tuning as few as 0.019% of model parameters. Moreover, we
demonstrate that the discovered parameter subsets are transferable to
non-medical domains. Our framework is scalable to large datasets, agnostic to
reward functions, and can be integrated with existing approaches for further
memorization mitigation. To the best of our knowledge, this is the first study
to empirically evaluate memorization in medical images and propose a targeted
yet universal mitigation strategy. The code is available at
https://github.com/Raman1121/Diffusion_Memorization_HPO


http://arxiv.org/abs/2405.17929
Towards Unified Robustness Against Both Backdoor and Adversarial Attacks. (99%)
Zhenxing Niu; Yuyao Sun; Qiguang Miao; Rong Jin; Gang Hua
  Deep Neural Networks (DNNs) are known to be vulnerable to both backdoor and
adversarial attacks. In the literature, these two types of attacks are commonly
treated as distinct robustness problems and solved separately, since they
belong to training-time and inference-time attacks respectively. However, this
paper revealed that there is an intriguing connection between them: (1)
planting a backdoor into a model will significantly affect the model's
adversarial examples; (2) for an infected model, its adversarial examples have
similar features as the triggered images. Based on these observations, a novel
Progressive Unified Defense (PUD) algorithm is proposed to defend against
backdoor and adversarial attacks simultaneously. Specifically, our PUD has a
progressive model purification scheme to jointly erase backdoors and enhance
the model's adversarial robustness. At the early stage, the adversarial
examples of infected models are utilized to erase backdoors. With the backdoor
gradually erased, our model purification can naturally turn into a stage to
boost the model's robustness against adversarial attacks. Besides, our PUD
algorithm can effectively identify poisoned images, which allows the initial
extra dataset not to be completely clean. Extensive experimental results show
that, our discovered connection between backdoor and adversarial attacks is
ubiquitous, no matter what type of backdoor attack. The proposed PUD
outperforms the state-of-the-art backdoor defense, including the model
repairing-based and data filtering-based methods. Besides, it also has the
ability to compete with the most advanced adversarial defense methods.


http://arxiv.org/abs/2405.20778
Improved Generation of Adversarial Examples Against Safety-aligned LLMs. (99%)
Qizhang Li; Yiwen Guo; Wangmeng Zuo; Hao Chen
  Adversarial prompts generated using gradient-based methods exhibit
outstanding performance in performing automatic jailbreak attacks against
safety-aligned LLMs. Nevertheless, due to the discrete nature of texts, the
input gradient of LLMs struggles to precisely reflect the magnitude of loss
change that results from token replacements in the prompt, leading to limited
attack success rates against safety-aligned LLMs, even in the white-box
setting. In this paper, we explore a new perspective on this problem,
suggesting that it can be alleviated by leveraging innovations inspired in
transfer-based attacks that were originally proposed for attacking black-box
image classification models. For the first time, we appropriate the ideologies
of effective methods among these transfer-based attacks, i.e., Skip Gradient
Method and Intermediate Level Attack, into gradient-based adversarial prompt
generation and achieve significant performance gains without introducing
obvious computational cost. Meanwhile, by discussing mechanisms behind the
gains, new insights are drawn, and proper combinations of these methods are
also developed. Our empirical results show that 87% of the query-specific
adversarial suffixes generated by the developed combination can induce
Llama-2-7B-Chat to produce the output that exactly matches the target string on
AdvBench. This match rate is 33% higher than that of a very strong baseline
known as GCG, demonstrating advanced discrete optimization for adversarial
prompt generation against LLMs. In addition, without introducing obvious cost,
the combination achieves >30% absolute increase in attack success rates
compared with GCG when generating both query-specific (38% -> 68%) and
universal adversarial prompts (26.68% -> 60.32%) for attacking the
Llama-2-7B-Chat model on AdvBench. Code at:
https://github.com/qizhangli/Gradient-based-Jailbreak-Attacks.


http://arxiv.org/abs/2405.18627
PureGen: Universal Data Purification for Train-Time Poison Defense via Generative Model Dynamics. (98%)
Sunay Bhat; Jeffrey Jiang; Omead Pooladzandi; Alexander Branch; Gregory Pottie
  Train-time data poisoning attacks threaten machine learning models by
introducing adversarial examples during training, leading to misclassification.
Current defense methods often reduce generalization performance, are
attack-specific, and impose significant training overhead. To address this, we
introduce a set of universal data purification methods using a stochastic
transform, $\Psi(x)$, realized via iterative Langevin dynamics of Energy-Based
Models (EBMs), Denoising Diffusion Probabilistic Models (DDPMs), or both. These
approaches purify poisoned data with minimal impact on classifier
generalization. Our specially trained EBMs and DDPMs provide state-of-the-art
defense against various attacks (including Narcissus, Bullseye Polytope,
Gradient Matching) on CIFAR-10, Tiny-ImageNet, and CINIC-10, without needing
attack or classifier-specific information. We discuss performance trade-offs
and show that our methods remain highly effective even with poisoned or
distributionally shifted generative model training data.


http://arxiv.org/abs/2405.19376
PureEBM: Universal Poison Purification via Mid-Run Dynamics of Energy-Based Models. (98%)
Omead Pooladzandi; Jeffrey Jiang; Sunay Bhat; Gregory Pottie
  Data poisoning attacks pose a significant threat to the integrity of machine
learning models by leading to misclassification of target distribution test
data by injecting adversarial examples during training. Existing
state-of-the-art (SoTA) defense methods suffer from a variety of limitations,
such as significantly reduced generalization performance, specificity to
particular attack types and classifiers, and significant overhead during
training, making them impractical or limited for real-world applications. In
response to this challenge, we introduce a universal data purification method
that defends naturally trained classifiers from malicious white-, gray-, and
black-box image poisons by applying a universal stochastic preprocessing step
$\Psi_{T}(x)$, realized by iterative Langevin sampling of a convergent Energy
Based Model (EBM) initialized with an image $x.$ Mid-run dynamics of
$\Psi_{T}(x)$ purify poison information with minimal impact on features
important to the generalization of a classifier network. We show that the
contrastive learning process of EBMs allows them to remain universal purifiers,
even in the presence of poisoned EBM training data, and to achieve SoTA defense
on leading triggered poison Narcissus and triggerless poisons Gradient Matching
and Bullseye Polytope. This work is a subset of a larger framework introduced
in PureGen with a more detailed focus on EBM purification and poison defense.


http://arxiv.org/abs/2405.17894
White-box Multimodal Jailbreaks Against Large Vision-Language Models. (96%)
Ruofan Wang; Xingjun Ma; Hanxu Zhou; Chuanjun Ji; Guangnan Ye; Yu-Gang Jiang
  Recent advancements in Large Vision-Language Models (VLMs) have underscored
their superiority in various multimodal tasks. However, the adversarial
robustness of VLMs has not been fully explored. Existing methods mainly assess
robustness through unimodal adversarial attacks that perturb images, while
assuming inherent resilience against text-based attacks. Different from
existing attacks, in this work we propose a more comprehensive strategy that
jointly attacks both text and image modalities to exploit a broader spectrum of
vulnerability within VLMs. Specifically, we propose a dual optimization
objective aimed at guiding the model to generate affirmative responses with
high toxicity. Our attack method begins by optimizing an adversarial image
prefix from random noise to generate diverse harmful responses in the absence
of text input, thus imbuing the image with toxic semantics. Subsequently, an
adversarial text suffix is integrated and co-optimized with the adversarial
image prefix to maximize the probability of eliciting affirmative responses to
various harmful instructions. The discovered adversarial image prefix and text
suffix are collectively denoted as a Universal Master Key (UMK). When
integrated into various malicious queries, UMK can circumvent the alignment
defenses of VLMs and lead to the generation of objectionable content, known as
jailbreaks. The experimental results demonstrate that our universal attack
strategy can effectively jailbreak MiniGPT-4 with a 96% success rate,
highlighting the vulnerability of VLMs and the urgent need for new alignment
strategies.


http://arxiv.org/abs/2405.18166
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing. (92%)
Wei Zhao; Zhe Li; Yige Li; Ye Zhang; Jun Sun
  Large language models (LLMs) are increasingly being adopted in a wide range
of real-world applications. Despite their impressive performance, recent
studies have shown that LLMs are vulnerable to deliberately crafted adversarial
prompts even when aligned via Reinforcement Learning from Human Feedback or
supervised fine-tuning. While existing defense methods focus on either
detecting harmful prompts or reducing the likelihood of harmful responses
through various means, defending LLMs against jailbreak attacks based on the
inner mechanisms of LLMs remains largely unexplored. In this work, we
investigate how LLMs response to harmful prompts and propose a novel defense
method termed \textbf{L}ayer-specific \textbf{Ed}iting (LED) to enhance the
resilience of LLMs against jailbreak attacks. Through LED, we reveal that
several critical \textit{safety layers} exist among the early layers of LLMs.
We then show that realigning these safety layers (and some selected additional
layers) with the decoded safe response from selected target layers can
significantly improve the alignment of LLMs against jailbreak attacks.
Extensive experiments across various LLMs (e.g., Llama2, Mistral) show the
effectiveness of LED, which effectively defends against jailbreak attacks while
maintaining performance on benign prompts. Our code is available at
\url{https://github.com/ledllm/ledllm}.


http://arxiv.org/abs/2405.18616
Wavelet-Based Image Tokenizer for Vision Transformers. (64%)
Zhenhai Zhu; Radu Soricut
  Non-overlapping patch-wise convolution is the default image tokenizer for all
state-of-the-art vision Transformer (ViT) models. Even though many ViT variants
have been proposed to improve its efficiency and accuracy, little research on
improving the image tokenizer itself has been reported in the literature. In
this paper, we propose a new image tokenizer based on wavelet transformation.
We show that ViT models with the new tokenizer achieve both higher training
throughput and better top-1 precision for the ImageNet validation set. We
present a theoretical analysis on why the proposed tokenizer improves the
training throughput without any change to ViT model architecture. Our analysis
suggests that the new tokenizer can effectively handle high-resolution images
and is naturally resistant to adversarial attack. Furthermore, the proposed
image tokenizer offers a fresh perspective on important new research directions
for ViT-based model design, such as image tokens on a non-uniform grid for
image understanding.


http://arxiv.org/abs/2405.17984
Cross-Context Backdoor Attacks against Graph Prompt Learning. (13%)
Xiaoting Lyu; Yufei Han; Wei Wang; Hangwei Qian; Ivor Tsang; Xiangliang Zhang
  Graph Prompt Learning (GPL) bridges significant disparities between
pretraining and downstream applications to alleviate the knowledge transfer
bottleneck in real-world graph learning. While GPL offers superior
effectiveness in graph knowledge transfer and computational efficiency, the
security risks posed by backdoor poisoning effects embedded in pretrained
models remain largely unexplored. Our study provides a comprehensive analysis
of GPL's vulnerability to backdoor attacks. We introduce \textit{CrossBA}, the
first cross-context backdoor attack against GPL, which manipulates only the
pretraining phase without requiring knowledge of downstream applications. Our
investigation reveals both theoretically and empirically that tuning trigger
graphs, combined with prompt transformations, can seamlessly transfer the
backdoor threat from pretrained encoders to downstream applications. Through
extensive experiments involving 3 representative GPL methods across 5 distinct
cross-context scenarios and 5 benchmark datasets of node and graph
classification tasks, we demonstrate that \textit{CrossBA} consistently
achieves high attack success rates while preserving the functionality of
downstream applications over clean input. We also explore potential
countermeasures against \textit{CrossBA} and conclude that current defenses are
insufficient to mitigate \textit{CrossBA}. Our study highlights the persistent
backdoor threats to GPL systems, raising trustworthiness concerns in the
practices of GPL techniques.


http://arxiv.org/abs/2405.17987
BlueSWAT: A Lightweight State-Aware Security Framework for Bluetooth Low Energy. (1%)
Xijia Che; Yi He; Xuewei Feng; Kun Sun; Ke Xu; Qi Li
  Bluetooth Low Energy (BLE) is a short-range wireless communication technology
for resource-constrained IoT devices. Unfortunately, BLE is vulnerable to
session-based attacks, where previous packets construct exploitable conditions
for subsequent packets to compromise connections. Defending against
session-based attacks is challenging because each step in the attack sequence
is legitimate when inspected individually. In this paper, we present BlueSWAT,
a lightweight state-aware security framework for protecting BLE devices. To
perform inspection on the session level rather than individual packets,
BlueSWAT leverages a finite state machine (FSM) to monitor sequential actions
of connections at runtime. Patterns of session-based attacks are modeled as
malicious transition paths in the FSM. To overcome the heterogeneous IoT
environment, we develop a lightweight eBPF framework to facilitate universal
patch distribution across different BLE architectures and stacks, without
requiring device reboot. We implement BlueSWAT on 5 real-world devices with
different chips and stacks to demonstrate its cross-device adaptability. On our
dataset with 101 real-world BLE vulnerabilities, BlueSWAT can mitigate 76.1% of
session-based attacks, outperforming other defense frameworks. In our
end-to-end application evaluation, BlueSWAT patches introduce an average of
0.073% memory overhead and negligible latency.


http://arxiv.org/abs/2405.18671
Watermarking Counterfactual Explanations. (1%)
Hangzhi Guo; Firdaus Ahmed Choudhury; Tinghua Chen; Amulya Yadav
  Counterfactual (CF) explanations for ML model predictions provide actionable
recourse recommendations to individuals adversely impacted by predicted
outcomes. However, despite being preferred by end-users, CF explanations have
been shown to pose significant security risks in real-world applications; in
particular, malicious adversaries can exploit CF explanations to perform
query-efficient model extraction attacks on the underlying proprietary ML
model. To address this security challenge, we propose CFMark, a novel
model-agnostic watermarking framework for detecting unauthorized model
extraction attacks relying on CF explanations. CFMark involves a novel bi-level
optimization problem to embed an indistinguishable watermark into the generated
CF explanation such that any future model extraction attacks using these
watermarked CF explanations can be detected using a null hypothesis
significance testing (NHST) scheme. At the same time, the embedded watermark
does not compromise the quality of the CF explanations. We evaluate CFMark
across diverse real-world datasets, CF explanation methods, and model
extraction techniques. Our empirical results demonstrate CFMark's
effectiveness, achieving an F-1 score of ~0.89 in identifying unauthorized
model extraction attacks using watermarked CF explanations. Importantly, this
watermarking incurs only a negligible degradation in the quality of generated
CF explanations (i.e., ~1.3% degradation in validity and ~1.6% in proximity).
Our work establishes a critical foundation for the secure deployment of CF
explanations in real-world applications.


http://arxiv.org/abs/2405.20777
Black-Box Detection of Language Model Watermarks. (1%)
Thibaud Gloaguen; Nikola Jovanović; Robin Staab; Martin Vechev
  Watermarking has emerged as a promising way to detect LLM-generated text. To
apply a watermark an LLM provider, given a secret key, augments generations
with a signal that is later detectable by any party with the same key. Recent
work has proposed three main families of watermarking schemes, two of which
focus on the property of preserving the LLM distribution. This is motivated by
it being a tractable proxy for maintaining LLM capabilities, but also by the
idea that concealing a watermark deployment makes it harder for malicious
actors to hide misuse by avoiding a certain LLM or attacking its watermark.
Yet, despite much discourse around detectability, no prior work has
investigated if any of these scheme families are detectable in a realistic
black-box setting. We tackle this for the first time, developing rigorous
statistical tests to detect the presence of all three most popular watermarking
scheme families using only a limited number of black-box queries. We
experimentally confirm the effectiveness of our methods on a range of schemes
and a diverse set of open-source models. Our findings indicate that current
watermarking schemes are more detectable than previously believed, and that
obscuring the fact that a watermark was deployed may not be a viable way for
providers to protect against adversaries. We further apply our methods to test
for watermark presence behind the most popular public APIs: GPT4, Claude 3,
Gemini 1.0 Pro, finding no strong evidence of a watermark at this point in
time.


http://arxiv.org/abs/2405.16940
Adversarial Attacks on Both Face Recognition and Face Anti-spoofing Models. (99%)
Fengfan Zhou; Qianyu Zhou; Hefei Ling; Xuequan Lu
  Adversarial attacks on Face Recognition (FR) systems have demonstrated
significant effectiveness against standalone FR models. However, their
practicality diminishes in complete FR systems that incorporate Face
Anti-Spoofing (FAS) models, as these models can detect and mitigate a
substantial number of adversarial examples. To address this critical yet
under-explored challenge, we introduce a novel attack setting that targets both
FR and FAS models simultaneously, thereby enhancing the practicability of
adversarial attacks on integrated FR systems. Specifically, we propose a new
attack method, termed Reference-free Multi-level Alignment (RMA), designed to
improve the capacity of black-box attacks on both FR and FAS models. The RMA
framework is built upon three key components. Firstly, we propose an Adaptive
Gradient Maintenance module to address the imbalances in gradient contributions
between FR and FAS models. Secondly, we develop a Reference-free Intermediate
Biasing module to improve the transferability of adversarial examples against
FAS models. In addition, we introduce a Multi-level Feature Alignment module to
reduce feature discrepancies at various levels of representation. Extensive
experiments showcase the superiority of our proposed attack method to
state-of-the-art adversarial attacks.


http://arxiv.org/abs/2405.16918
The Uncanny Valley: Exploring Adversarial Robustness from a Flatness Perspective. (99%)
Nils Philipp Walter; Linara Adilova; Jilles Vreeken; Michael Kamp
  Flatness of the loss surface not only correlates positively with
generalization, but is also related to adversarial robustness since
perturbations of inputs relate non-linearly to perturbations of weights. In
this paper, we empirically analyze the relation between adversarial examples
and relative flatness with respect to the parameters of one layer. We observe a
peculiar property of adversarial examples in the context of relative flatness:
during an iterative first-order white-box attack, the flatness of the loss
surface measured around the adversarial example first becomes sharper until the
label is flipped, but if we keep the attack running, it runs into a flat
uncanny valley where the label remains flipped. In extensive experiments, we
observe this phenomenon across various model architectures and datasets, even
for adversarially trained models. Our results also extend to large language
models (LLMs), but due to the discrete nature of the input space and
comparatively weak attacks, adversarial examples rarely reach truly flat
regions. Most importantly, this phenomenon shows that flatness alone cannot
explain adversarial robustness unless we can also guarantee the behavior of the
function around the examples. We, therefore theoretically connect relative
flatness to adversarial robustness by bounding the third derivative of the loss
surface, underlining the need for flatness in combination with a low global
Lipschitz constant for a robust model.


http://arxiv.org/abs/2405.17130
Exploiting the Layered Intrinsic Dimensionality of Deep Models for Practical Adversarial Training. (98%)
Enes Altinisik; Safa Messaoud; Husrev Taha Sencar; Hassan Sajjad; Sanjay Chawla
  Despite being a heavily researched topic, Adversarial Training (AT) is
rarely, if ever, deployed in practical AI systems for two primary reasons: (i)
the gained robustness is frequently accompanied by a drop in generalization and
(ii) generating adversarial examples (AEs) is computationally prohibitively
expensive. To address these limitations, we propose SMAAT, a new AT algorithm
that leverages the manifold conjecture, stating that off-manifold AEs lead to
better robustness while on-manifold AEs result in better generalization.
Specifically, SMAAT aims at generating a higher proportion of off-manifold AEs
by perturbing the intermediate deepnet layer with the lowest intrinsic
dimension. This systematically results in better scalability compared to
classical AT as it reduces the PGD chains length required for generating the
AEs. Additionally, our study provides, to the best of our knowledge, the first
explanation for the difference in the generalization and robustness trends
between vision and language models, ie., AT results in a drop in generalization
in vision models whereas, in encoder-based language models, generalization
either improves or remains unchanged. We show that vision transformers and
decoder-based models tend to have low intrinsic dimensionality in the earlier
layers of the network (more off-manifold AEs), while encoder-based models have
low intrinsic dimensionality in the later layers. We demonstrate the efficacy
of SMAAT; on several tasks, including robustifying (i) sentiment classifiers,
(ii) safety filters in decoder-based models, and (iii) retrievers in RAG
setups. SMAAT requires only 25-33% of the GPU time compared to standard AT,
while significantly improving robustness across all applications and
maintaining comparable generalization.


http://arxiv.org/abs/2405.17181
Spectral regularization for adversarially-robust representation learning. (86%)
Sheng Yang; Jacob A. Zavatone-Veth; Cengiz Pehlevan
  The vulnerability of neural network classifiers to adversarial attacks is a
major obstacle to their deployment in safety-critical applications.
Regularization of network parameters during training can be used to improve
adversarial robustness and generalization performance. Usually, the network is
regularized end-to-end, with parameters at all layers affected by
regularization. However, in settings where learning representations is key,
such as self-supervised learning (SSL), layers after the feature representation
will be discarded when performing inference. For these models, regularizing up
to the feature space is more suitable. To this end, we propose a new spectral
regularizer for representation learning that encourages black-box adversarial
robustness in downstream classification tasks. In supervised classification
settings, we show empirically that this method is more effective in boosting
test accuracy and robustness than previously-proposed methods that regularize
all layers of the network. We then show that this method improves the
adversarial robustness of classifiers using representations learned with
self-supervised training or transferred from another classification task. In
all, our work begins to unveil how representational structure affects
adversarial robustness.


http://arxiv.org/abs/2405.17678
TIMA: Text-Image Mutual Awareness for Balancing Zero-Shot Adversarial Robustness and Generalization Ability. (83%)
Fengji Ma; Li Liu; Hei Victor Cheng
  This work addresses the challenge of achieving zero-shot adversarial
robustness while preserving zero-shot generalization in large-scale foundation
models, with a focus on the popular Contrastive Language-Image Pre-training
(CLIP). Although foundation models were reported to have exceptional zero-shot
generalization, they are highly vulnerable to adversarial perturbations.
Existing methods achieve a comparable good tradeoff between zero-shot
adversarial robustness and generalization under small adversarial
perturbations. However, they fail to achieve a good tradeoff under large
adversarial perturbations. To this end, we propose a novel Text-Image Mutual
Awareness (TIMA) method that strikes a balance between zero-shot adversarial
robustness and generalization. More precisely, we propose an Image-Aware Text
(IAT) tuning mechanism that increases the inter-class distance of text
embeddings by incorporating the Minimum Hyperspherical Energy (MHE).
Simultaneously, fixed pre-trained image embeddings are used as cross-modal
auxiliary supervision to maintain the similarity between the MHE-tuned and
original text embeddings by the knowledge distillation, preserving semantic
information between different classes. Besides, we introduce a Text-Aware Image
(TAI) tuning mechanism, which increases inter-class distance between image
embeddings during the training stage by Text-distance based Adaptive Margin
(TAM). Similarly, a knowledge distillation is utilized to retain the similarity
between fine-tuned and pre-trained image embeddings. Extensive experimental
results demonstrate the effectiveness of our approach, showing impressive
zero-shot performance against a wide range of adversarial perturbations while
preserving the zero-shot generalization capabilities of the original CLIP
model.


http://arxiv.org/abs/2405.16978
OSLO: One-Shot Label-Only Membership Inference Attacks. (81%)
Yuefeng Peng; Jaechul Roh; Subhransu Maji; Amir Houmansadr
  We introduce One-Shot Label-Only (OSLO) membership inference attacks (MIAs),
which accurately infer a given sample's membership in a target model's training
set with high precision using just \emph{a single query}, where the target
model only returns the predicted hard label. This is in contrast to
state-of-the-art label-only attacks which require $\sim6000$ queries, yet get
attack precisions lower than OSLO's. OSLO leverages transfer-based black-box
adversarial attacks. The core idea is that a member sample exhibits more
resistance to adversarial perturbations than a non-member. We compare OSLO
against state-of-the-art label-only attacks and demonstrate that, despite
requiring only one query, our method significantly outperforms previous attacks
in terms of precision and true positive rate (TPR) under the same false
positive rates (FPR). For example, compared to previous label-only MIAs, OSLO
achieves a TPR that is at least 7$\times$ higher under a 1\% FPR and at least
22$\times$ higher under a 0.1\% FPR on CIFAR100 for a ResNet18 model. We
evaluated multiple defense mechanisms against OSLO.


http://arxiv.org/abs/2405.17049
Verifying Properties of Binary Neural Networks Using Sparse Polynomial Optimization. (33%)
Jianting Yang; Srećko Ðurašinović; Jean-Bernard Lasserre; Victor Magron; Jun Zhao
  This paper explores methods for verifying the properties of Binary Neural
Networks (BNNs), focusing on robustness against adversarial attacks. Despite
their lower computational and memory needs, BNNs, like their full-precision
counterparts, are also sensitive to input perturbations. Established methods
for solving this problem are predominantly based on Satisfiability Modulo
Theories and Mixed-Integer Linear Programming techniques, which are
characterized by NP complexity and often face scalability issues.
  We introduce an alternative approach using Semidefinite Programming
relaxations derived from sparse Polynomial Optimization. Our approach,
compatible with continuous input space, not only mitigates numerical issues
associated with floating-point calculations but also enhances verification
scalability through the strategic use of tighter first-order semidefinite
relaxations. We demonstrate the effectiveness of our method in verifying
robustness against both $\|.\|_\infty$ and $\|.\|_2$-based adversarial attacks.


http://arxiv.org/abs/2405.17746
Rethinking Pruning for Backdoor Mitigation: An Optimization Perspective. (26%)
Nan Li; Haiyang Yu; Ping Yi
  Deep Neural Networks (DNNs) are known to be vulnerable to backdoor attacks,
posing concerning threats to their reliable deployment. Recent research reveals
that backdoors can be erased from infected DNNs by pruning a specific group of
neurons, while how to effectively identify and remove these backdoor-associated
neurons remains an open challenge. Most of the existing defense methods rely on
defined rules and focus on neuron's local properties, ignoring the exploration
and optimization of pruning policies. To address this gap, we propose an
Optimized Neuron Pruning (ONP) method combined with Graph Neural Network (GNN)
and Reinforcement Learning (RL) to repair backdoor models. Specifically, ONP
first models the target DNN as graphs based on neuron connectivity, and then
uses GNN-based RL agents to learn graph embeddings and find a suitable pruning
policy. To the best of our knowledge, this is the first attempt to employ GNN
and RL for optimizing pruning policies in the field of backdoor defense.
Experiments show, with a small amount of clean data, ONP can effectively prune
the backdoor neurons implanted by a set of backdoor attacks at the cost of
negligible performance degradation, achieving a new state-of-the-art
performance for backdoor mitigation.


http://arxiv.org/abs/2405.17374
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models. (8%)
ShengYun Peng; Pin-Yu Chen; Matthew Hull; Duen Horng Chau
  Safety alignment is crucial to ensure that large language models (LLMs)
behave in ways that align with human preferences and prevent harmful actions
during inference. However, recent studies show that the alignment can be easily
compromised through finetuning with only a few adversarially designed training
examples. We aim to measure the risks in finetuning LLMs through navigating the
LLM safety landscape. We discover a new phenomenon observed universally in the
model parameter space of popular open-source LLMs, termed as "safety basin":
random perturbations to model weights maintain the safety level of the original
aligned model within its local neighborhood. However, outside this local
region, safety is fully compromised, exhibiting a sharp, step-like drop. This
safety basin contrasts sharply with the LLM capability landscape, where model
performance peaks at the origin and gradually declines as random perturbation
increases. Our discovery inspires us to propose the new VISAGE safety metric
that measures the safety in LLM finetuning by probing its safety landscape.
Visualizing the safety landscape of the aligned model enables us to understand
how finetuning compromises safety by dragging the model away from the safety
basin. The LLM safety landscape also highlights the system prompt's critical
role in protecting a model, and that such protection transfers to its perturbed
variants within the safety basin. These observations from our safety landscape
research provide new insights for future work on LLM safety community. Our code
is publicly available at https://github.com/ShengYun-Peng/llm-landscape.


http://arxiv.org/abs/2405.20774
Can We Trust Embodied Agents? Exploring Backdoor Attacks against Embodied LLM-based Decision-Making Systems. (5%)
Ruochen Jiao; Shaoyuan Xie; Justin Yue; Takami Sato; Lixu Wang; Yixuan Wang; Qi Alfred Chen; Qi Zhu
  Large Language Models (LLMs) have shown significant promise in real-world
decision-making tasks for embodied artificial intelligence, especially when
fine-tuned to leverage their inherent common sense and reasoning abilities
while being tailored to specific applications. However, this fine-tuning
process introduces considerable safety and security vulnerabilities, especially
in safety-critical cyber-physical systems. In this work, we propose the first
comprehensive framework for Backdoor Attacks against LLM-based Decision-making
systems (BALD) in embodied AI, systematically exploring the attack surfaces and
trigger mechanisms. Specifically, we propose three distinct attack mechanisms:
word injection, scenario manipulation, and knowledge injection, targeting
various components in the LLM-based decision-making pipeline. We perform
extensive experiments on representative LLMs (GPT-3.5, LLaMA2, PaLM2) in
autonomous driving and home robot tasks, demonstrating the effectiveness and
stealthiness of our backdoor triggers across various attack channels, with
cases like vehicles accelerating toward obstacles and robots placing knives on
beds. Our word and knowledge injection attacks achieve nearly 100% success rate
across multiple models and datasets while requiring only limited access to the
system. Our scenario manipulation attack yields success rates exceeding 65%,
reaching up to 90%, and does not require any runtime system intrusion. We also
assess the robustness of these attacks against defenses, revealing their
resilience. Our findings highlight critical security vulnerabilities in
embodied LLM systems and emphasize the urgent need for safeguarding these
systems to mitigate potential risks.


http://arxiv.org/abs/2405.17042
LabObf: A Label Protection Scheme for Vertical Federated Learning Through Label Obfuscation. (1%)
Ying He; Mingyang Niu; Jingyu Hua; Yunlong Mao; Xu Huang; Chen Li; Sheng Zhong
  Split learning, as one of the most common architectures in vertical federated
learning, has gained widespread use in industry due to its privacy-preserving
characteristics. In this architecture, the party holding the labels seeks
cooperation from other parties to improve model performance due to insufficient
feature data. Each of these participants has a self-defined bottom model to
learn hidden representations from its own feature data and uploads the
embedding vectors to the top model held by the label holder for final
predictions. This design allows participants to conduct joint training without
directly exchanging data. However, existing research points out that malicious
participants may still infer label information from the uploaded embeddings,
leading to privacy leakage. In this paper, we first propose an embedding
extension attack that manually modifies embeddings to undermine existing
defense strategies, which rely on constraining the correlation between the
embeddings uploaded by participants and the labels. Subsequently, we propose a
new label obfuscation defense strategy, called `LabObf', which randomly maps
each original one-hot vector label to multiple numerical soft labels with
values intertwined, significantly increasing the difficulty for attackers to
infer the labels. We conduct experiments on four different types of datasets,
and the results show that LabObf can reduce the attacker's success rate to near
random guessing while maintaining an acceptable model accuracy.


http://arxiv.org/abs/2405.17750
Magnitude-based Neuron Pruning for Backdoor Defens. (1%)
Nan Li; Haoyu Jiang; Ping Yi
  Deep Neural Networks (DNNs) are known to be vulnerable to backdoor attacks,
posing concerning threats to their reliable deployment. Recent research reveals
that backdoors can be erased from infected DNNs by pruning a specific group of
neurons, while how to effectively identify and remove these backdoor-associated
neurons remains an open challenge. In this paper, we investigate the
correlation between backdoor behavior and neuron magnitude, and find that
backdoor neurons deviate from the magnitude-saliency correlation of the model.
The deviation inspires us to propose a Magnitude-based Neuron Pruning (MNP)
method to detect and prune backdoor neurons. Specifically, MNP uses three
magnitude-guided objective functions to manipulate the magnitude-saliency
correlation of backdoor neurons, thus achieving the purpose of exposing
backdoor behavior, eliminating backdoor neurons and preserving clean neurons,
respectively. Experiments show our pruning strategy achieves state-of-the-art
backdoor defense performance against a variety of backdoor attacks with a
limited amount of clean data, demonstrating the crucial role of magnitude for
guiding backdoor defenses.


http://arxiv.org/abs/2405.20775
Medical MLLM is Vulnerable: Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models. (67%)
Xijie Huang; Xinyuan Wang; Hantao Zhang; Yinghao Zhu; Jiawen Xi; Jingkun An; Hao Wang; Hao Liang; Chengwei Pan
  Security concerns related to Large Language Models (LLMs) have been
extensively explored, yet the safety implications for Multimodal Large Language
Models (MLLMs), particularly in medical contexts (MedMLLMs), remain
insufficiently studied. This paper delves into the underexplored security
vulnerabilities of MedMLLMs, especially when deployed in clinical environments
where the accuracy and relevance of question-and-answer interactions are
critically tested against complex medical challenges. By combining existing
clinical medical data with atypical natural phenomena, we define the mismatched
malicious attack (2M-attack) and introduce its optimized version, known as the
optimized mismatched malicious attack (O2M-attack or 2M-optimization). Using
the voluminous 3MAD dataset that we construct, which covers a wide range of
medical image modalities and harmful medical scenarios, we conduct a
comprehensive analysis and propose the MCM optimization method, which
significantly enhances the attack success rate on MedMLLMs. Evaluations with
this dataset and attack methods, including white-box attacks on LLaVA-Med and
transfer attacks (black-box) on four other SOTA models, indicate that even
MedMLLMs designed with enhanced security features remain vulnerable to security
breaches. Our work underscores the urgent need for a concerted effort to
implement robust security measures and enhance the safety and efficacy of
open-source MedMLLMs, particularly given the potential severity of jailbreak
attacks and other malicious or clinically significant exploits in medical
settings. Our code is available at https://github.com/dirtycomputer/O2M_attack.


http://arxiv.org/abs/2405.16534
Pruning for Robust Concept Erasing in Diffusion Models. (38%)
Tianyun Yang; Juan Cao; Chang Xu
  Despite the impressive capabilities of generating images, text-to-image
diffusion models are susceptible to producing undesirable outputs such as NSFW
content and copyrighted artworks. To address this issue, recent studies have
focused on fine-tuning model parameters to erase problematic concepts. However,
existing methods exhibit a major flaw in robustness, as fine-tuned models often
reproduce the undesirable outputs when faced with cleverly crafted prompts.
This reveals a fundamental limitation in the current approaches and may raise
risks for the deployment of diffusion models in the open world. To address this
gap, we locate the concept-correlated neurons and find that these neurons show
high sensitivity to adversarial prompts, thus could be deactivated when erasing
and reactivated again under attacks. To improve the robustness, we introduce a
new pruning-based strategy for concept erasing. Our method selectively prunes
critical parameters associated with the concepts targeted for removal, thereby
reducing the sensitivity of concept-related neurons. Our method can be easily
integrated with existing concept-erasing techniques, offering a robust
improvement against adversarial inputs. Experimental results show a significant
enhancement in our model's ability to resist adversarial inputs, achieving
nearly a 40% improvement in erasing the NSFW content and a 30% improvement in
erasing artwork style.


http://arxiv.org/abs/2405.16783
TrojFM: Resource-efficient Backdoor Attacks against Very Large Foundation Models. (31%)
Yuzhou. Nie; Yanting. Wang; Jinyuan. Jia; Lucia Michael J. De; Nathaniel D. Bastian; Wenbo. Guo; Dawn. Song
  One key challenge in backdoor attacks against large foundation models is the
resource limits. Backdoor attacks usually require retraining the target model,
which is impractical for very large foundation models. Existing backdoor
attacks are mainly designed for supervised classifiers or small foundation
models (e.g., BERT). None of these attacks has successfully compromised a very
large foundation model, such as Llama-3-70B, especially with limited
computational resources. In this paper, we propose TrojFM, a novel backdoor
attack tailored for very large foundation models. Our primary technical
contribution is the development of a novel backdoor injection method. This
method forces a backdoored model to generate similar hidden representations for
poisoned inputs regardless of their actual semantics. Our approach injects such
backdoors by fine-tuning only a very small proportion of model parameters. This
enables TrojFM to efficiently launch downstream task-agnostic backdoor attacks
against very large foundation models under limited computational resources.
Moreover, we optimize the fine-tuning process with our customized QLoRA
technique, enabling launching our attack via only~\textit{one A100 GPU}.
Furthermore, we design a new trigger injection method to ensure our attack
stealthiness. Through extensive experiments, we first demonstrate that TrojFM
can launch effective backdoor attacks against widely used large GPT-style
models without jeopardizing their normal functionalities (and outperforming
existing attacks on BERT-style models). Furthermore, we show that TrojFM is
resilient to SOTA defenses and is insensitive to changes in key
hyper-parameters. Finally, we conduct a resource analysis to quantify that our
method can significantly save computational and memory costs compared to
existing backdoor attacks.


http://arxiv.org/abs/2405.16488
Partial train and isolate, mitigate backdoor attack. (1%)
Yong Li; Han Gao
  Neural networks are widely known to be vulnerable to backdoor attacks, a
method that poisons a portion of the training data to make the target model
perform well on normal data sets, while outputting attacker-specified or random
categories on the poisoned samples. Backdoor attacks are full of threats.
Poisoned samples are becoming more and more similar to corresponding normal
samples, and even the human eye cannot easily distinguish them. On the other
hand, the accuracy of models carrying backdoors on normal samples is no
different from that of clean models.In this article, by observing the
characteristics of backdoor attacks, We provide a new model training method
(PT) that freezes part of the model to train a model that can isolate
suspicious samples. Then, on this basis, a clean model is fine-tuned to resist
backdoor attacks.


http://arxiv.org/abs/2405.16567
Automatic Jailbreaking of the Text-to-Image Generative AI Systems. (1%)
Minseon Kim; Hyomin Lee; Boqing Gong; Huishuai Zhang; Sung Ju Hwang
  Recent AI systems have shown extremely powerful performance, even surpassing
human performance, on various tasks such as information retrieval, language
generation, and image generation based on large language models (LLMs). At the
same time, there are diverse safety risks that can cause the generation of
malicious contents by circumventing the alignment in LLMs, which are often
referred to as jailbreaking. However, most of the previous works only focused
on the text-based jailbreaking in LLMs, and the jailbreaking of the
text-to-image (T2I) generation system has been relatively overlooked. In this
paper, we first evaluate the safety of the commercial T2I generation systems,
such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive
prompts. From this empirical study, we find that Copilot and Gemini block only
12% and 17% of the attacks with naive prompts, respectively, while ChatGPT
blocks 84% of them. Then, we further propose a stronger automated jailbreaking
pipeline for T2I generation systems, which produces prompts that bypass their
safety guards. Our automated jailbreaking framework leverages an LLM optimizer
to generate prompts to maximize degree of violation from the generated images
without any weight updates or gradient computation. Surprisingly, our simple
yet effective approach successfully jailbreaks the ChatGPT with 11.0% block
rate, making it generate copyrighted contents in 76% of the time. Finally, we
explore various defense strategies, such as post-generation filtering and
machine unlearning techniques, but found that they were inadequate, which
suggests the necessity of stronger defense mechanisms.


http://arxiv.org/abs/2405.16181
Enhancing Adversarial Transferability Through Neighborhood Conditional Sampling. (99%)
Chunlin Qiu; Yiheng Duan; Lingchen Zhao; Qian Wang
  Transfer-based attacks craft adversarial examples utilizing a white-box
surrogate model to compromise various black-box target models, posing
significant threats to many real-world applications. However, existing transfer
attacks suffer from either weak transferability or expensive computation. To
bridge the gap, we propose a novel sample-based attack, named neighborhood
conditional sampling (NCS), which enjoys high transferability with lightweight
computation. Inspired by the observation that flat maxima result in better
transferability, NCS is formulated as a max-min bi-level optimization problem
to seek adversarial regions with high expected adversarial loss and small
standard deviations. Specifically, due to the inner minimization problem being
computationally intensive to resolve, and affecting the overall
transferability, we propose a momentum-based previous gradient inversion
approximation (PGIA) method to effectively solve the inner problem without any
computation cost. In addition, we prove that two newly proposed attacks, which
achieve flat maxima for better transferability, are actually specific cases of
NCS under particular conditions. Extensive experiments demonstrate that NCS
efficiently generates highly transferable adversarial examples, surpassing the
current best method in transferability while requiring only 50% of the
computational cost. Additionally, NCS can be seamlessly integrated with other
methods to further enhance transferability.


http://arxiv.org/abs/2405.16134
Breaking the False Sense of Security in Backdoor Defense through Re-Activation Attack. (99%)
Mingli Zhu; Siyuan Liang; Baoyuan Wu
  Deep neural networks face persistent challenges in defending against backdoor
attacks, leading to an ongoing battle between attacks and defenses. While
existing backdoor defense strategies have shown promising performance on
reducing attack success rates, can we confidently claim that the backdoor
threat has truly been eliminated from the model? To address it, we
re-investigate the characteristics of the backdoored models after defense
(denoted as defense models). Surprisingly, we find that the original backdoors
still exist in defense models derived from existing post-training defense
strategies, and the backdoor existence is measured by a novel metric called
backdoor existence coefficient. It implies that the backdoors just lie dormant
rather than being eliminated. To further verify this finding, we empirically
show that these dormant backdoors can be easily re-activated during inference,
by manipulating the original trigger with well-designed tiny perturbation using
universal adversarial attack. More practically, we extend our backdoor
reactivation to black-box scenario, where the defense model can only be queried
by the adversary during inference, and develop two effective methods, i.e.,
query-based and transfer-based backdoor re-activation attacks. The
effectiveness of the proposed methods are verified on both image classification
and multimodal contrastive learning (i.e., CLIP) tasks. In conclusion, this
work uncovers a critical vulnerability that has never been explored in existing
defense strategies, emphasizing the urgency of designing more robust and
advanced backdoor defense mechanisms in the future.


http://arxiv.org/abs/2405.16226
Detecting Adversarial Data via Perturbation Forgery. (99%)
Qian Wang; Chen Li; Yuchen Luo; Hefei Ling; Ping Li; Jiazhong Chen; Shijuan Huang; Ning Yu
  As a defense strategy against adversarial attacks, adversarial detection aims
to identify and filter out adversarial data from the data flow based on
discrepancies in distribution and noise patterns between natural and
adversarial data. Although previous detection methods achieve high performance
in detecting gradient-based adversarial attacks, new attacks based on
generative models with imbalanced and anisotropic noise patterns evade
detection. Even worse, existing techniques either necessitate access to attack
data before deploying a defense or incur a significant time cost for inference,
rendering them impractical for defending against newly emerging attacks that
are unseen by defenders. In this paper, we explore the proximity relationship
between adversarial noise distributions and demonstrate the existence of an
open covering for them. By learning to distinguish this open covering from the
distribution of natural data, we can develop a detector with strong
generalization capabilities against all types of adversarial attacks. Based on
this insight, we heuristically propose Perturbation Forgery, which includes
noise distribution perturbation, sparse mask generation, and pseudo-adversarial
data production, to train an adversarial detector capable of detecting unseen
gradient-based, generative-model-based, and physical adversarial attacks, while
remaining agnostic to any specific models. Comprehensive experiments conducted
on multiple general and facial datasets, with a wide spectrum of attacks,
validate the strong generalization of our method.


http://arxiv.org/abs/2405.16341
R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model. (97%)
Changhoon Kim; Kyle Min; Yezhou Yang
  In the evolving landscape of text-to-image (T2I) diffusion models, the
remarkable capability to generate high-quality images from textual descriptions
faces challenges with the potential misuse of reproducing sensitive content. To
address this critical issue, we introduce Robust Adversarial Concept Erase
(RACE), a novel approach designed to mitigate these risks by enhancing the
robustness of concept erasure method for T2I models. RACE utilizes a
sophisticated adversarial training framework to identify and mitigate
adversarial text embeddings, significantly reducing the Attack Success Rate
(ASR). Impressively, RACE achieves a 30 percentage point reduction in ASR for
the ``nudity'' concept against the leading white-box attack method. Our
extensive evaluations demonstrate RACE's effectiveness in defending against
both white-box and black-box attacks, marking a significant advancement in
protecting T2I diffusion models from generating inappropriate or misleading
imagery. This work underlines the essential need for proactive defense measures
in adapting to the rapidly advancing field of adversarial challenges.


http://arxiv.org/abs/2405.16082
Uncertainty Measurement of Deep Learning System based on the Convex Hull of Training Sets. (89%)
Hyekyoung Hwang; Jitae Shin
  Deep Learning (DL) has made remarkable achievements in computer vision and
adopted in safety critical domains such as medical imaging or autonomous drive.
Thus, it is necessary to understand the uncertainty of the model to effectively
reduce accidents and losses due to misjudgment of the Deep Neural Networks
(DNN). This can start by efficiently selecting data that could potentially
malfunction to the model. Traditionally, data collection and labeling have been
done manually, but recently test data selection methods have emerged that focus
on capturing samples that are not relevant to what the model had been learned.
They're selected based on the activation pattern of neurons in DNN, entropy
minimization based on softmax output of the DL. However, these methods cannot
quantitatively analyze the extent to which unseen samples are extrapolated from
the training data. Therefore, we propose To-hull Uncertainty and Closure Ratio,
which measures an uncertainty of trained model based on the convex hull of
training data. It can observe the positional relation between the convex hull
of the learned data and an unseen sample and infer how extrapolate the sample
is from the convex hull. To evaluate the proposed method, we conduct empirical
studies on popular datasets and DNN models, compared to state-of-the art test
selection metrics. As a result of the experiment, the proposed To-hull
Uncertainty is effective in finding samples with unusual patterns (e.g.
adversarial attack) compared to the existing test selection metric.


http://arxiv.org/abs/2405.16262
Layer-Aware Analysis of Catastrophic Overfitting: Revealing the Pseudo-Robust Shortcut Dependency. (81%)
Runqi Lin; Chaojian Yu; Bo Han; Hang Su; Tongliang Liu
  Catastrophic overfitting (CO) presents a significant challenge in single-step
adversarial training (AT), manifesting as highly distorted deep neural networks
(DNNs) that are vulnerable to multi-step adversarial attacks. However, the
underlying factors that lead to the distortion of decision boundaries remain
unclear. In this work, we delve into the specific changes within different DNN
layers and discover that during CO, the former layers are more susceptible,
experiencing earlier and greater distortion, while the latter layers show
relative insensitivity. Our analysis further reveals that this increased
sensitivity in former layers stems from the formation of pseudo-robust
shortcuts, which alone can impeccably defend against single-step adversarial
attacks but bypass genuine-robust learning, resulting in distorted decision
boundaries. Eliminating these shortcuts can partially restore robustness in
DNNs from the CO state, thereby verifying that dependence on them triggers the
occurrence of CO. This understanding motivates us to implement adaptive weight
perturbations across different layers to hinder the generation of pseudo-robust
shortcuts, consequently mitigating CO. Extensive experiments demonstrate that
our proposed method, Layer-Aware Adversarial Weight Perturbation (LAP), can
effectively prevent CO and further enhance robustness.


http://arxiv.org/abs/2405.16112
Mitigating Backdoor Attack by Injecting Proactive Defensive Backdoor. (70%)
Shaokui Wei; Hongyuan Zha; Baoyuan Wu
  Data-poisoning backdoor attacks are serious security threats to machine
learning models, where an adversary can manipulate the training dataset to
inject backdoors into models. In this paper, we focus on in-training backdoor
defense, aiming to train a clean model even when the dataset may be potentially
poisoned. Unlike most existing methods that primarily detect and remove/unlearn
suspicious samples to mitigate malicious backdoor attacks, we propose a novel
defense approach called PDB (Proactive Defensive Backdoor). Specifically, PDB
leverages the "home field" advantage of defenders by proactively injecting a
defensive backdoor into the model during training. Taking advantage of
controlling the training process, the defensive backdoor is designed to
suppress the malicious backdoor effectively while remaining secret to
attackers. In addition, we introduce a reversible mapping to determine the
defensive target label. During inference, PDB embeds a defensive trigger in the
inputs and reverses the model's prediction, suppressing malicious backdoor and
ensuring the model's utility on the original task. Experimental results across
various datasets and models demonstrate that our approach achieves
state-of-the-art defense performance against a wide range of backdoor attacks.


http://arxiv.org/abs/2405.20773
Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character. (56%)
Siyuan Ma; Weidi Luo; Yu Wang; Xiaogeng Liu
  With the advent and widespread deployment of Multimodal Large Language Models
(MLLMs), ensuring their safety has become increasingly critical. To achieve
this objective, it requires us to proactively discover the vulnerability of
MLLMs by exploring the attack methods. Thus, structure-based jailbreak attacks,
where harmful semantic content is embedded within images, have been proposed to
mislead the models. However, previous structure-based jailbreak methods mainly
focus on transforming the format of malicious queries, such as converting
harmful content into images through typography, which lacks sufficient
jailbreak effectiveness and generalizability. To address these limitations, we
first introduce the concept of "Role-play" into MLLM jailbreak attacks and
propose a novel and effective method called Visual Role-play (VRP).
Specifically, VRP leverages Large Language Models to generate detailed
descriptions of high-risk characters and create corresponding images based on
the descriptions. When paired with benign role-play instruction texts, these
high-risk character images effectively mislead MLLMs into generating malicious
responses by enacting characters with negative attributes. We further extend
our VRP method into a universal setup to demonstrate its generalizability.
Extensive experiments on popular benchmarks show that VRP outperforms the
strongest baseline, Query relevant and FigStep, by an average Attack Success
Rate (ASR) margin of 14.3% across all models.


http://arxiv.org/abs/2405.16405
Intruding with Words: Towards Understanding Graph Injection Attacks at the Text Level. (8%)
Runlin Lei; Yuwei Hu; Yuchen Ren; Zhewei Wei
  Graph Neural Networks (GNNs) excel across various applications but remain
vulnerable to adversarial attacks, particularly Graph Injection Attacks (GIAs),
which inject malicious nodes into the original graph and pose realistic
threats. Text-attributed graphs (TAGs), where nodes are associated with textual
features, are crucial due to their prevalence in real-world applications and
are commonly used to evaluate these vulnerabilities. However, existing research
only focuses on embedding-level GIAs, which inject node embeddings rather than
actual textual content, limiting their applicability and simplifying detection.
In this paper, we pioneer the exploration of GIAs at the text level, presenting
three novel attack designs that inject textual content into the graph. Through
theoretical and empirical analysis, we demonstrate that text interpretability,
a factor previously overlooked at the embedding level, plays a crucial role in
attack strength. Among the designs we investigate, the Word-frequency-based
Text-level GIA (WTGIA) is particularly notable for its balance between
performance and interpretability. Despite the success of WTGIA, we discover
that defenders can easily enhance their defenses with customized text embedding
methods or large language model (LLM)--based predictors. These insights
underscore the necessity for further research into the potential and practical
significance of text-level GIAs.


http://arxiv.org/abs/2405.16229
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks. (4%)
Chak Tou Leong; Yi Cheng; Kaishuai Xu; Jian Wang; Hanlin Wang; Wenjie Li
  The existing safety alignment of Large Language Models (LLMs) is found
fragile and could be easily attacked through different strategies, such as
through fine-tuning on a few harmful examples or manipulating the prefix of the
generation results. However, the attack mechanisms of these strategies are
still underexplored. In this paper, we ask the following question:
\textit{while these approaches can all significantly compromise safety, do
their attack mechanisms exhibit strong similarities?} To answer this question,
we break down the safeguarding process of an LLM when encountered with harmful
instructions into three stages: (1) recognizing harmful instructions, (2)
generating an initial refusing tone, and (3) completing the refusal response.
Accordingly, we investigate whether and how different attack strategies could
influence each stage of this safeguarding process. We utilize techniques such
as logit lens and activation patching to identify model components that drive
specific behavior, and we apply cross-model probing to examine representation
shifts after an attack. In particular, we analyze the two most representative
types of attack approaches: Explicit Harmful Attack (EHA) and Identity-Shifting
Attack (ISA). Surprisingly, we find that their attack mechanisms diverge
dramatically. Unlike ISA, EHA tends to aggressively target the harmful
recognition stage. While both EHA and ISA disrupt the latter two stages, the
extent and mechanisms of their attacks differ significantly. Our findings
underscore the importance of understanding LLMs' internal safeguarding process
and suggest that diverse defense mechanisms are required to effectively cope
with various types of attacks.


http://arxiv.org/abs/2405.16414
Robust Message Embedding via Attention Flow-Based Steganography. (1%)
Huayuan Ye; Shenzhuo Zhang; Shiqi Jiang; Jing Liao; Shuhang Gu; Dejun Zheng; Changbo Wang; Chenhui Li
  Image steganography can hide information in a host image and obtain a stego
image that is perceptually indistinguishable from the original one. This
technique has tremendous potential in scenarios like copyright protection,
information retrospection, etc. Some previous studies have proposed to enhance
the robustness of the methods against image disturbances to increase their
applicability. However, they generally cannot achieve a satisfying balance
between the steganography quality and robustness. Instead of image-in-image
steganography, we focus on the issue of message-in-image embedding that is
robust to various real-world image distortions. This task aims to embed
information into a natural image and the decoding result is required to be
completely accurate, which increases the difficulty of data concealing and
revealing. Inspired by the recent developments in transformer-based vision
models, we discover that the tokenized representation of image is naturally
suitable for steganography task. In this paper, we propose a novel message
embedding framework, called Robust Message Steganography (RMSteg), which is
competent to hide message via QR Code in a host image based on an normalizing
flow-based model. The stego image derived by our method has imperceptible
changes and the encoded message can be accurately restored even if the image is
printed out and photoed. To our best knowledge, this is the first work that
integrates the advantages of transformer models into normalizing flow. Our
experiment result shows that RMSteg has great potential in robust and
high-quality message embedding.


http://arxiv.org/abs/2405.16361
Noisy Data Meets Privacy: Training Local Models with Post-Processed Remote Queries. (1%)
Kexin Li; Aastha Mehta; David Lie
  The adoption of large cloud-based models for inference in privacy-sensitive
domains, such as homeless care systems and medical imaging, raises concerns
about end-user data privacy. A common solution is adding locally differentially
private (LDP) noise to queries before transmission, but this often reduces
utility. LDPKiT, which stands for Local Differentially-Private and
Utility-Preserving Inference via Knowledge Transfer, addresses the concern by
generating a privacy-preserving inference dataset aligned with the private data
distribution. This dataset is used to train a reliable local model for
inference on sensitive inputs. LDPKiT employs a two-layer noise injection
framework that leverages LDP and its post-processing property to create a
privacy-protected inference dataset. The first layer ensures privacy, while the
second layer helps to recover utility by creating a sufficiently large dataset
for subsequent local model extraction using noisy labels returned from a cloud
model on privacy-protected noisy inputs. Our experiments on Fashion-MNIST, SVHN
and PathMNIST medical datasets demonstrate that LDPKiT effectively improves
utility while preserving privacy. Moreover, the benefits of using LDPKiT
increase at higher, more privacy-protective noise levels. For instance, on
SVHN, LDPKiT achieves similar inference accuracy with $\epsilon=1.25$ as it
does with $\epsilon=2.0$, providing stronger privacy guarantees with less than
a 2% drop in accuracy. Furthermore, we perform extensive sensitivity analyses
to evaluate the impact of dataset sizes on LDPKiT's effectiveness and
systematically analyze the latent space representations to offer a theoretical
explanation for its accuracy improvements. Lastly, we qualitatively and
quantitatively demonstrate that the type of knowledge distillation performed by
LDPKiT is ethical and fundamentally distinct from adversarial model extraction
attacks.


http://arxiv.org/abs/2405.15971
Robust width: A lightweight and certifiable adversarial defense. (99%)
Jonathan Peck; Bart Goossens
  Deep neural networks are vulnerable to so-called adversarial examples: inputs
which are intentionally constructed to cause the model to make incorrect
predictions or classifications. Adversarial examples are often visually
indistinguishable from natural data samples, making them hard to detect. As
such, they pose significant threats to the reliability of deep learning
systems. In this work, we study an adversarial defense based on the robust
width property (RWP), which was recently introduced for compressed sensing. We
show that a specific input purification scheme based on the RWP gives
theoretical robustness guarantees for images that are approximately sparse. The
defense is easy to implement and can be applied to any existing model without
additional training or finetuning. We empirically validate the defense on
ImageNet against $L^\infty$ perturbations at perturbation budgets ranging from
$4/255$ to $32/255$. In the black-box setting, our method significantly
outperforms the state-of-the-art, especially for large perturbations. In the
white-box setting, depending on the choice of base classifier, we closely match
the state of the art in robust ImageNet classification while avoiding the need
for additional data, larger models or expensive adversarial training routines.
Our code is available at https://github.com/peck94/robust-width-defense.


http://arxiv.org/abs/2405.20770
Large Language Model Sentinel: LLM Agent for Adversarial Purification. (99%)
Guang Lin; Toshihisa Tanaka; Qibin Zhao
  Over the past two years, the use of large language models (LLMs) has advanced
rapidly. While these LLMs offer considerable convenience, they also raise
security concerns, as LLMs are vulnerable to adversarial attacks by some
well-designed textual perturbations. In this paper, we introduce a novel
defense technique named Large LAnguage MOdel Sentinel (LLAMOS), which is
designed to enhance the adversarial robustness of LLMs by purifying the
adversarial textual examples before feeding them into the target LLM. Our
method comprises two main components: a) Agent instruction, which can simulate
a new agent for adversarial defense, altering minimal characters to maintain
the original meaning of the sentence while defending against attacks; b)
Defense guidance, which provides strategies for modifying clean or adversarial
examples to ensure effective defense and accurate outputs from the target LLMs.
Remarkably, the defense agent demonstrates robust defensive capabilities even
without learning from adversarial examples. Additionally, we conduct an
intriguing adversarial experiment where we develop two agents, one for defense
and one for attack, and engage them in mutual confrontation. During the
adversarial interactions, neither agent completely beat the other. Extensive
experiments on both open-source and closed-source LLMs demonstrate that our
method effectively defends against adversarial attacks, thereby enhancing
adversarial robustness.


http://arxiv.org/abs/2405.15244
Adversarial Attacks on Hidden Tasks in Multi-Task Learning. (98%)
Yu Zhe; Rei Nagaike; Daiki Nishiyama; Kazuto Fukuchi; Jun Sakuma
  Deep learning models are susceptible to adversarial attacks, where slight
perturbations to input data lead to misclassification. Adversarial attacks
become increasingly effective with access to information about the targeted
classifier. In the context of multi-task learning, where a single model learns
multiple tasks simultaneously, attackers may aim to exploit vulnerabilities in
specific tasks with limited information. This paper investigates the
feasibility of attacking hidden tasks within multi-task classifiers, where
model access regarding the hidden target task and labeled data for the hidden
target task are not available, but model access regarding the non-target tasks
is available. We propose a novel adversarial attack method that leverages
knowledge from non-target tasks and the shared backbone network of the
multi-task model to force the model to forget knowledge related to the target
task. Experimental results on CelebA and DeepFashion datasets demonstrate the
effectiveness of our method in degrading the accuracy of hidden tasks while
preserving the performance of visible tasks, contributing to the understanding
of adversarial vulnerabilities in multi-task classifiers.


http://arxiv.org/abs/2405.15984
Evaluating and Safeguarding the Adversarial Robustness of Retrieval-Based In-Context Learning. (95%)
Simon Yu; Jie He; Pasquale Minervini; Jeff Z. Pan
  With the emergence of large language models, such as LLaMA and OpenAI GPT-3,
In-Context Learning (ICL) gained significant attention due to its effectiveness
and efficiency. However, ICL is very sensitive to the choice, order, and
verbaliser used to encode the demonstrations in the prompt. Retrieval-Augmented
ICL methods try to address this problem by leveraging retrievers to extract
semantically related examples as demonstrations. While this approach yields
more accurate results, its robustness against various types of adversarial
attacks, including perturbations on test samples, demonstrations, and retrieved
data, remains under-explored. Our study reveals that retrieval-augmented models
can enhance robustness against test sample attacks, outperforming vanilla ICL
with a 4.87% reduction in Attack Success Rate (ASR); however, they exhibit
overconfidence in the demonstrations, leading to a 2% increase in ASR for
demonstration attacks. Adversarial training can help improve the robustness of
ICL methods to adversarial attacks; however, such a training scheme can be too
costly in the context of LLMs. As an alternative, we introduce an effective
training-free adversarial defence method, DARD, which enriches the example pool
with those attacked samples. We show that DARD yields improvements in
performance and robustness, achieving a 15% reduction in ASR over the
baselines. Code and data are released to encourage further research:
https://github.com/simonucl/adv-retreival-icl


http://arxiv.org/abs/2405.16036
Certifying Adapters: Enabling and Enhancing the Certification of Classifier Adversarial Robustness. (92%)
Jieren Deng; Hanbin Hong; Aaron Palmer; Xin Zhou; Jinbo Bi; Kaleel Mahmood; Yuan Hong; Derek Aguiar
  Randomized smoothing has become a leading method for achieving certified
robustness in deep classifiers against l_{p}-norm adversarial perturbations.
Current approaches for achieving certified robustness, such as data
augmentation with Gaussian noise and adversarial training, require expensive
training procedures that tune large models for different Gaussian noise levels
and thus cannot leverage high-performance pre-trained neural networks. In this
work, we introduce a novel certifying adapters framework (CAF) that enables and
enhances the certification of classifier adversarial robustness. Our approach
makes few assumptions about the underlying training algorithm or feature
extractor and is thus broadly applicable to different feature extractor
architectures (e.g., convolutional neural networks or vision transformers) and
smoothing algorithms. We show that CAF (a) enables certification in uncertified
models pre-trained on clean datasets and (b) substantially improves the
performance of certified classifiers via randomized smoothing and SmoothAdv at
multiple radii in CIFAR-10 and ImageNet. We demonstrate that CAF achieves
improved certified accuracies when compared to methods based on random or
denoised smoothing, and that CAF is insensitive to certifying adapter
hyperparameters. Finally, we show that an ensemble of adapters enables a single
pre-trained feature extractor to defend against a range of noise perturbation
scales.


http://arxiv.org/abs/2405.15589
Efficient Adversarial Training in LLMs with Continuous Attacks. (92%)
Sophie Xhonneux; Alessandro Sordoni; Stephan Günnemann; Gauthier Gidel; Leo Schwinn
  Large language models (LLMs) are vulnerable to adversarial attacks that can
bypass their safety guardrails. In many domains, adversarial training has
proven to be one of the most promising methods to reliably improve robustness
against such attacks. Yet, in the context of LLMs, current methods for
adversarial training are hindered by the high computational costs required to
perform discrete adversarial attacks at each training iteration. We address
this problem by instead calculating adversarial attacks in the continuous
embedding space of the LLM, which is orders of magnitudes more efficient. We
propose a fast adversarial training algorithm (C-AdvUL) composed of two losses:
the first makes the model robust on continuous embedding attacks computed on an
adversarial behaviour dataset; the second ensures the usefulness of the final
model by fine-tuning on utility data. Moreover, we introduce C-AdvIPO, an
adversarial variant of IPO that does not require utility data for adversarially
robust alignment. Our empirical evaluation on five models from different
families (Gemma, Phi3, Mistral, Zephyr, Llama2) and at different scales (2B,
3.8B, 7B) shows that both algorithms substantially enhance LLM robustness
against discrete attacks (GCG, AutoDAN, PAIR), while maintaining utility. Our
results demonstrate that robustness to continuous perturbations can extrapolate
to discrete threat models. Thereby, we present a path toward scalable
adversarial training algorithms for robustly aligning LLMs.


http://arxiv.org/abs/2405.15564
Rethinking Independent Cross-Entropy Loss For Graph-Structured Data. (76%)
Rui Miao; Kaixiong Zhou; Yili Wang; Ninghao Liu; Ying Wang; Xin Wang
  Graph neural networks (GNNs) have exhibited prominent performance in learning
graph-structured data. Considering node classification task, based on the i.i.d
assumption among node labels, the traditional supervised learning simply sums
up cross-entropy losses of the independent training nodes and applies the
average loss to optimize GNNs' weights. But different from other data formats,
the nodes are naturally connected. It is found that the independent
distribution modeling of node labels restricts GNNs' capability to generalize
over the entire graph and defend adversarial attacks. In this work, we propose
a new framework, termed joint-cluster supervised learning, to model the joint
distribution of each node with its corresponding cluster. We learn the joint
distribution of node and cluster labels conditioned on their representations,
and train GNNs with the obtained joint loss. In this way, the data-label
reference signals extracted from the local cluster explicitly strengthen the
discrimination ability on the target node. The extensive experiments
demonstrate that our joint-cluster supervised learning can effectively bolster
GNNs' node classification accuracy. Furthermore, being benefited from the
reference signals which may be free from spiteful interference, our learning
paradigm significantly protects the node classification from being affected by
the adversarial attack.


http://arxiv.org/abs/2405.15269
BDetCLIP: Multimodal Prompting Contrastive Test-Time Backdoor Detection. (67%)
Yuwei Niu; Shuo He; Qi Wei; Zongyu Wu; Feng Liu; Lei Feng
  Multimodal contrastive learning methods (e.g., CLIP) have shown impressive
zero-shot classification performance due to their strong ability to joint
representation learning for visual and textual modalities. However, recent
research revealed that multimodal contrastive learning on poisoned pre-training
data with a small proportion of maliciously backdoored data can induce
backdoored CLIP that could be attacked by inserted triggers in downstream tasks
with a high success rate. To defend against backdoor attacks on CLIP, existing
defense methods focus on either the pre-training stage or the fine-tuning
stage, which would unfortunately cause high computational costs due to numerous
parameter updates. In this paper, we provide the first attempt at a
computationally efficient backdoor detection method to defend against
backdoored CLIP in the inference stage. We empirically find that the visual
representations of backdoored images are insensitive to both benign and
malignant changes in class description texts. Motivated by this observation, we
propose BDetCLIP, a novel test-time backdoor detection method based on
contrastive prompting. Specifically, we first prompt the language model (e.g.,
GPT-4) to produce class-related description texts (benign) and class-perturbed
random texts (malignant) by specially designed instructions. Then, the
distribution difference in cosine similarity between images and the two types
of class description texts can be used as the criterion to detect backdoor
samples. Extensive experiments validate that our proposed BDetCLIP is superior
to state-of-the-art backdoor detection methods, in terms of both effectiveness
and efficiency.


http://arxiv.org/abs/2405.15234
Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models. (47%)
Yimeng Zhang; Xin Chen; Jinghan Jia; Yihua Zhang; Chongyu Fan; Jiancheng Liu; Mingyi Hong; Ke Ding; Sijia Liu
  Diffusion models (DMs) have achieved remarkable success in text-to-image
generation, but they also pose safety risks, such as the potential generation
of harmful content and copyright violations. The techniques of machine
unlearning, also known as concept erasing, have been developed to address these
risks. However, these techniques remain vulnerable to adversarial prompt
attacks, which can prompt DMs post-unlearning to regenerate undesired images
containing concepts (such as nudity) meant to be erased. This work aims to
enhance the robustness of concept erasing by integrating the principle of
adversarial training (AT) into machine unlearning, resulting in the robust
unlearning framework referred to as AdvUnlearn. However, achieving this
effectively and efficiently is highly nontrivial. First, we find that a
straightforward implementation of AT compromises DMs' image generation quality
post-unlearning. To address this, we develop a utility-retaining regularization
on an additional retain set, optimizing the trade-off between concept erasure
robustness and model utility in AdvUnlearn. Moreover, we identify the text
encoder as a more suitable module for robustification compared to UNet,
ensuring unlearning effectiveness. And the acquired text encoder can serve as a
plug-and-play robust unlearner for various DM types. Empirically, we perform
extensive experiments to demonstrate the robustness advantage of AdvUnlearn
across various DM unlearning scenarios, including the erasure of nudity,
objects, and style concepts. In addition to robustness, AdvUnlearn also
achieves a balanced tradeoff with model utility. To our knowledge, this is the
first work to systematically explore robust DM unlearning through AT, setting
it apart from existing methods that overlook robustness in concept erasing.
Codes are available at: https://github.com/OPTML-Group/AdvUnlearn


http://arxiv.org/abs/2405.19358
Robustifying Safety-Aligned Large Language Models through Clean Data Curation. (15%)
Xiaoqun Liu; Jiacheng Liang; Muchao Ye; Zhaohan Xi
  Large language models (LLMs) are vulnerable when trained on datasets
containing harmful content, which leads to potential jailbreaking attacks in
two scenarios: the integration of harmful texts within crowdsourced data used
for pre-training and direct tampering with LLMs through fine-tuning. In both
scenarios, adversaries can compromise the safety alignment of LLMs,
exacerbating malfunctions. Motivated by the need to mitigate these adversarial
influences, our research aims to enhance safety alignment by either
neutralizing the impact of malicious texts in pre-training datasets or
increasing the difficulty of jailbreaking during downstream fine-tuning. In
this paper, we propose a data curation framework designed to counter
adversarial impacts in both scenarios. Our method operates under the assumption
that we have no prior knowledge of attack details, focusing solely on curating
clean texts. We introduce an iterative process aimed at revising texts to
reduce their perplexity as perceived by LLMs, while simultaneously preserving
their text quality. By pre-training or fine-tuning LLMs with curated clean
texts, we observe a notable improvement in LLM robustness regarding safety
alignment against harmful queries. For instance, when pre-training LLMs using a
crowdsourced dataset containing 5\% harmful instances, adding an equivalent
amount of curated texts significantly mitigates the likelihood of providing
harmful responses in LLMs and reduces the attack success rate by 71\%. Our
study represents a significant step towards mitigating the risks associated
with training-based jailbreaking and fortifying the secure utilization of LLMs.


http://arxiv.org/abs/2405.15655
HiddenSpeaker: Generate Imperceptible Unlearnable Audios for Speaker Verification System. (15%)
Zhisheng Zhang; Pengyang Huang
  In recent years, the remarkable advancements in deep neural networks have
brought tremendous convenience. However, the training process of a highly
effective model necessitates a substantial quantity of samples, which brings
huge potential threats, like unauthorized exploitation with privacy leakage. In
response, we propose a framework named HiddenSpeaker, embedding imperceptible
perturbations within the training speech samples and rendering them unlearnable
for deep-learning-based speaker verification systems that employ large-scale
speakers for efficient training. The HiddenSpeaker utilizes a simplified
error-minimizing method named Single-Level Error-Minimizing (SLEM) to generate
specific and effective perturbations. Additionally, a hybrid objective function
is employed for human perceptual optimization, ensuring the perturbation is
indistinguishable from human listeners. We conduct extensive experiments on
multiple state-of-the-art (SOTA) models in the speaker verification domain to
evaluate HiddenSpeaker. Our results demonstrate that HiddenSpeaker not only
deceives the model with unlearnable samples but also enhances the
imperceptibility of the perturbations, showcasing strong transferability across
different models.


http://arxiv.org/abs/2405.15942
Can Implicit Bias Imply Adversarial Robustness? (11%)
Hancheng Min; René Vidal
  The implicit bias of gradient-based training algorithms has been considered
mostly beneficial as it leads to trained networks that often generalize well.
However, Frei et al. (2023) show that such implicit bias can harm adversarial
robustness. Specifically, they show that if the data consists of clusters with
small inter-cluster correlation, a shallow (two-layer) ReLU network trained by
gradient flow generalizes well, but it is not robust to adversarial attacks of
small radius. Moreover, this phenomenon occurs despite the existence of a much
more robust classifier that can be explicitly constructed from a shallow
network. In this paper, we extend recent analyses of neuron alignment to show
that a shallow network with a polynomial ReLU activation (pReLU) trained by
gradient flow not only generalizes well but is also robust to adversarial
attacks. Our results highlight the importance of the interplay between data
structure and architecture design in the implicit bias and robustness of
trained networks.


http://arxiv.org/abs/2405.15556
Certifiably Robust RAG against Retrieval Corruption. (10%)
Chong Xiang; Tong Wu; Zexuan Zhong; David Wagner; Danqi Chen; Prateek Mittal
  Retrieval-augmented generation (RAG) has been shown vulnerable to retrieval
corruption attacks: an attacker can inject malicious passages into retrieval
results to induce inaccurate responses. In this paper, we propose RobustRAG as
the first defense framework against retrieval corruption attacks. The key
insight of RobustRAG is an isolate-then-aggregate strategy: we get LLM
responses from each passage in isolation and then securely aggregate these
isolated responses. To instantiate RobustRAG, we design keyword-based and
decoding-based algorithms for securely aggregating unstructured text responses.
Notably, RobustRAG can achieve certifiable robustness: we can formally prove
and certify that, for certain queries, RobustRAG can always return accurate
responses, even when the attacker has full knowledge of our defense and can
arbitrarily inject a small number of malicious passages. We evaluate RobustRAG
on open-domain QA and long-form text generation datasets and demonstrate its
effectiveness and generalizability across various tasks and datasets.


http://arxiv.org/abs/2405.15979
BadGD: A unified data-centric framework to identify gradient descent vulnerabilities. (8%)
Chi-Hua Wang; Guang Cheng
  We present BadGD, a unified theoretical framework that exposes the
vulnerabilities of gradient descent algorithms through strategic backdoor
attacks. Backdoor attacks involve embedding malicious triggers into a training
dataset to disrupt the model's learning process. Our framework introduces three
novel constructs: Max RiskWarp Trigger, Max GradWarp Trigger, and Max
GradDistWarp Trigger, each designed to exploit specific aspects of gradient
descent by distorting empirical risk, deterministic gradients, and stochastic
gradients respectively. We rigorously define clean and backdoored datasets and
provide mathematical formulations for assessing the distortions caused by these
malicious backdoor triggers. By measuring the impact of these triggers on the
model training procedure, our framework bridges existing empirical findings
with theoretical insights, demonstrating how a malicious party can exploit
gradient descent hyperparameters to maximize attack effectiveness. In
particular, we show that these exploitations can significantly alter the loss
landscape and gradient calculations, leading to compromised model integrity and
performance. This research underscores the severe threats posed by such
data-centric attacks and highlights the urgent need for robust defenses in
machine learning. BadGD sets a new standard for understanding and mitigating
adversarial manipulations, ensuring the reliability and security of AI systems.


http://arxiv.org/abs/2405.15426
AuthNet: Neural Network with Integrated Authentication Logic. (5%)
Yuling Cai; Fan Xiang; Guozhu Meng; Yinzhi Cao; Kai Chen
  Model stealing, i.e., unauthorized access and exfiltration of deep learning
models, has become one of the major threats. Proprietary models may be
protected by access controls and encryption. However, in reality, these
measures can be compromised due to system breaches, query-based model
extraction or a disgruntled insider. Security hardening of neural networks is
also suffering from limits, for example, model watermarking is passive, cannot
prevent the occurrence of piracy and not robust against transformations. To
this end, we propose a native authentication mechanism, called AuthNet, which
integrates authentication logic as part of the model without any additional
structures. Our key insight is to reuse redundant neurons with low activation
and embed authentication bits in an intermediate layer, called a gate layer.
Then, AuthNet fine-tunes the layers after the gate layer to embed
authentication logic so that only inputs with special secret key can trigger
the correct logic of AuthNet. It exhibits two intuitive advantages. It provides
the last line of defense, i.e., even being exfiltrated, the model is not usable
as the adversary cannot generate valid inputs without the key. Moreover, the
authentication logic is difficult to inspect and identify given millions or
billions of neurons in the model. We theoretically demonstrate the high
sensitivity of AuthNet to the secret key and its high confusion for
unauthorized samples. AuthNet is compatible with any convolutional neural
network, where our extensive evaluations show that AuthNet successfully
achieves the goal in rejecting unauthenticated users (whose average accuracy
drops to 22.03%) with a trivial accuracy decrease (1.18% on average) for
legitimate users, and is robust against model transformation and adaptive
attacks.


http://arxiv.org/abs/2405.17490
Revisit, Extend, and Enhance Hessian-Free Influence Functions. (2%)
Ziao Yang; Han Yue; Jian Chen; Hongfu Liu
  Influence functions serve as crucial tools for assessing sample influence in
model interpretation, subset training set selection, noisy label detection, and
more. By employing the first-order Taylor extension, influence functions can
estimate sample influence without the need for expensive model retraining.
However, applying influence functions directly to deep models presents
challenges, primarily due to the non-convex nature of the loss function and the
large size of model parameters. This difficulty not only makes computing the
inverse of the Hessian matrix costly but also renders it non-existent in some
cases. Various approaches, including matrix decomposition, have been explored
to expedite and approximate the inversion of the Hessian matrix, with the aim
of making influence functions applicable to deep models. In this paper, we
revisit a specific, albeit naive, yet effective approximation method known as
TracIn. This method substitutes the inverse of the Hessian matrix with an
identity matrix. We provide deeper insights into why this simple approximation
method performs well. Furthermore, we extend its applications beyond measuring
model utility to include considerations of fairness and robustness. Finally, we
enhance TracIn through an ensemble strategy. To validate its effectiveness, we
conduct experiments on synthetic data and extensive evaluations on noisy label
detection, sample selection for large language model fine-tuning, and defense
against adversarial attacks.


http://arxiv.org/abs/2405.14210
Eidos: Efficient, Imperceptible Adversarial 3D Point Clouds. (98%)
Hanwei Zhang; Luo Cheng; Qisong He; Wei Huang; Renjue Li; Ronan Sicre; Xiaowei Huang; Holger Hermanns; Lijun Zhang
  Classification of 3D point clouds is a challenging machine learning (ML) task
with important real-world applications in a spectrum from autonomous driving
and robot-assisted surgery to earth observation from low orbit. As with other
ML tasks, classification models are notoriously brittle in the presence of
adversarial attacks. These are rooted in imperceptible changes to inputs with
the effect that a seemingly well-trained model ends up misclassifying the
input. This paper adds to the understanding of adversarial attacks by
presenting Eidos, a framework providing Efficient Imperceptible aDversarial
attacks on 3D pOint cloudS. Eidos supports a diverse set of imperceptibility
metrics. It employs an iterative, two-step procedure to identify optimal
adversarial examples, thereby enabling a runtime-imperceptibility trade-off. We
provide empirical evidence relative to several popular 3D point cloud
classification models and several established 3D attack methods, showing Eidos'
superiority with respect to efficiency as well as imperceptibility.


http://arxiv.org/abs/2405.14176
Certified Robustness against Sparse Adversarial Perturbations via Data Localization. (92%)
Ambar Pal; René Vidal; Jeremias Sulam
  Recent work in adversarial robustness suggests that natural data
distributions are localized, i.e., they place high probability in small volume
regions of the input space, and that this property can be utilized for
designing classifiers with improved robustness guarantees for $\ell_2$-bounded
perturbations. Yet, it is still unclear if this observation holds true for more
general metrics. In this work, we extend this theory to $\ell_0$-bounded
adversarial perturbations, where the attacker can modify a few pixels of the
image but is unrestricted in the magnitude of perturbation, and we show
necessary and sufficient conditions for the existence of $\ell_0$-robust
classifiers. Theoretical certification approaches in this regime essentially
employ voting over a large ensemble of classifiers. Such procedures are
combinatorial and expensive or require complicated certification techniques. In
contrast, a simple classifier emerges from our theory, dubbed Box-NN, which
naturally incorporates the geometry of the problem and improves upon the
current state-of-the-art in certified robustness against sparse attacks for the
MNIST and Fashion-MNIST datasets.


http://arxiv.org/abs/2405.14519
A New Formulation for Zeroth-Order Optimization of Adversarial EXEmples in Malware Detection. (91%)
Marco Rando; Luca Demetrio; Lorenzo Rosasco; Fabio Roli
  Machine learning malware detectors are vulnerable to adversarial EXEmples,
i.e. carefully-crafted Windows programs tailored to evade detection. Unlike
other adversarial problems, attacks in this context must be
functionality-preserving, a constraint which is challenging to address. As a
consequence heuristic algorithms are typically used, that inject new content,
either randomly-picked or harvested from legitimate programs. In this paper, we
show how learning malware detectors can be cast within a zeroth-order
optimization framework which allows to incorporate functionality-preserving
manipulations. This permits the deployment of sound and efficient gradient-free
optimization algorithms, which come with theoretical guarantees and allow for
minimal hyper-parameters tuning. As a by-product, we propose and study ZEXE, a
novel zero-order attack against Windows malware detection. Compared to
state-of-the-art techniques, ZEXE provides drastic improvement in the evasion
rate, while reducing to less than one third the size of the injected content.


http://arxiv.org/abs/2405.14478
SLIFER: Investigating Performance and Robustness of Malware Detection Pipelines. (89%)
Andrea Ponte; Dmitrijs Trizna; Luca Demetrio; Battista Biggio; Ivan Tesfai Ogbu; Fabio Roli
  As a result of decades of research, Windows malware detection is approached
through a plethora of techniques. However, there is an ongoing mismatch between
academia -- which pursues an optimal performances in terms of detection rate
and low false alarms -- and the requirements of real-world scenarios. In
particular, academia focuses on combining static and dynamic analysis within a
single or ensemble of models, falling into several pitfalls like (i) firing
dynamic analysis without considering the computational burden it requires; (ii)
discarding impossible-to-analyze samples; and (iii) analyzing robustness
against adversarial attacks without considering that malware detectors are
complemented with more non-machine-learning components. Thus, in this paper we
bridge these gaps, by investigating the properties of malware detectors built
with multiple and different types of analysis. To do so, we develop SLIFER, a
Windows malware detection pipeline sequentially leveraging both static and
dynamic analysis, interrupting computations as soon as one module triggers an
alarm, requiring dynamic analysis only when needed. Contrary to the state of
the art, we investigate how to deal with samples that impede analyzes, showing
how much they impact performances, concluding that it is better to flag them as
legitimate to not drastically increase false alarms. Lastly, we perform a
robustness evaluation of SLIFER. Counter-intuitively, the injection of new
content is either blocked more by signatures than dynamic analysis, due to byte
artifacts created by the attack, or it is able to avoid detection from
signatures, as they rely on constraints on file size disrupted by attacks. As
far as we know, we are the first to investigate the properties of sequential
malware detectors, shedding light on their behavior in real production
environment.


http://arxiv.org/abs/2405.15033
Generating camera failures as a class of physics-based adversarial examples. (87%)
Manav Prabhakar; Jwalandhar Girnar; Arpan Kusari
  While there has been extensive work on generating physics-based adversarial
samples recently, an overlooked class of such samples come from physical
failures in the camera. Camera failures can occur as a result of an external
physical process, i.e. breakdown of a component due to stress, or an internal
component failure. In this work, we develop a simulated physical process for
generating broken lens as a class of physics-based adversarial samples. We
create a stress-based physical simulation by generating particles constrained
in a mesh and apply stress at a random point and at a random angle. We perform
stress propagation through the mesh and the end result of the mesh is a
corresponding image which simulates the broken lens pattern. We also develop a
neural emulator which learns the non-linear mapping between the mesh as a graph
and the stress propagation using constrained propagation setup. We can then
statistically compare the difference between the generated adversarial samples
with real, simulated and emulated adversarial examples using the detection
failure rate of the different classes and in between the samples using the
Frechet Inception distance. Our goal through this work is to provide a robust
physics based process for generating adversarial samples.


http://arxiv.org/abs/2405.15184
TrojanForge: Generating Adversarial Hardware Trojan Examples with Reinforcement Learning. (84%)
Amin Sarihi; Peter Jamieson; Ahmad Patooghy; Abdel-Hameed A. Badawy
  The Hardware Trojan (HT) problem can be thought of as a continuous game
between attackers and defenders, each striving to outsmart the other by
leveraging any available means for an advantage. Machine Learning (ML) has
recently played a key role in advancing HT research. Various novel techniques,
such as Reinforcement Learning (RL) and Graph Neural Networks (GNNs), have
shown HT insertion and detection capabilities. HT insertion with ML techniques,
specifically, has seen a spike in research activity due to the shortcomings of
conventional HT benchmarks and the inherent human design bias that occurs when
we create them. This work continues this innovation by presenting a tool called
TrojanForge, capable of generating HT adversarial examples that defeat HT
detectors; demonstrating the capabilities of GAN-like adversarial tools for
automatic HT insertion. We introduce an RL environment where the RL insertion
agent interacts with HT detectors in an insertion-detection loop where the
agent collects rewards based on its success in bypassing HT detectors. Our
results show that this process helps inserted HTs evade various HT detectors,
achieving high attack success percentages. This tool provides insight into why
HT insertion fails in some instances and how we can leverage this knowledge in
defense.


http://arxiv.org/abs/2405.14169
Towards Transferable Attacks Against Vision-LLMs in Autonomous Driving with Typography. (83%)
Nhat Chung; Sensen Gao; Tuan-Anh Vu; Jie Zhang; Aishan Liu; Yun Lin; Jin Song Dong; Qing Guo
  Vision-Large-Language-Models (Vision-LLMs) are increasingly being integrated
into autonomous driving (AD) systems due to their advanced visual-language
reasoning capabilities, targeting the perception, prediction, planning, and
control mechanisms. However, Vision-LLMs have demonstrated susceptibilities
against various types of adversarial attacks, which would compromise their
reliability and safety. To further explore the risk in AD systems and the
transferability of practical threats, we propose to leverage typographic
attacks against AD systems relying on the decision-making capabilities of
Vision-LLMs. Different from the few existing works developing general datasets
of typographic attacks, this paper focuses on realistic traffic scenarios where
these attacks can be deployed, on their potential effects on the
decision-making autonomy, and on the practical ways in which these attacks can
be physically presented. To achieve the above goals, we first propose a
dataset-agnostic framework for automatically generating false answers that can
mislead Vision-LLMs' reasoning. Then, we present a linguistic augmentation
scheme that facilitates attacks at image-level and region-level reasoning, and
we extend it with attack patterns against multiple reasoning tasks
simultaneously. Based on these, we conduct a study on how these attacks can be
realized in physical traffic scenarios. Through our empirical study, we
evaluate the effectiveness, transferability, and realizability of typographic
attacks in traffic scenes. Our findings demonstrate particular harmfulness of
the typographic attacks against existing Vision-LLMs (e.g., LLaVA, Qwen-VL,
VILA, and Imp), thereby raising community awareness of vulnerabilities when
incorporating such models into AD systems. We will release our source code upon
acceptance.


http://arxiv.org/abs/2405.14923
How Does Bayes Error Limit Probabilistic Robust Accuracy. (76%)
Ruihan Zhang; Jun Sun
  Adversarial examples pose a security threat to many critical systems built on
neural networks. Given that deterministic robustness often comes with
significantly reduced accuracy, probabilistic robustness (i.e., the probability
of having the same label with a vicinity is $\ge 1-\kappa$) has been proposed
as a promising way of achieving robustness whilst maintaining accuracy.
However, existing training methods for probabilistic robustness still
experience non-trivial accuracy loss. It is unclear whether there is an upper
bound on the accuracy when optimising towards probabilistic robustness, and
whether there is a certain relationship between $\kappa$ and this bound. This
work studies these problems from a Bayes error perspective. We find that while
Bayes uncertainty does affect probabilistic robustness, its impact is smaller
than that on deterministic robustness. This reduced Bayes uncertainty allows a
higher upper bound on probabilistic robust accuracy than that on deterministic
robust accuracy. Further, we prove that with optimal probabilistic robustness,
each probabilistically robust input is also deterministically robust in a
smaller vicinity. We also show that voting within the vicinity always improves
probabilistic robust accuracy and the upper bound of probabilistic robust
accuracy monotonically increases as $\kappa$ grows. Our empirical findings also
align with our results.


http://arxiv.org/abs/2405.14934
Universal Robustness via Median Randomized Smoothing for Real-World Super-Resolution. (67%)
Zakariya Chaouai; Mohamed Tamaazousti
  Most of the recent literature on image Super-Resolution (SR) can be
classified into two main approaches. The first one involves learning a
corruption model tailored to a specific dataset, aiming to mimic the noise and
corruption in low-resolution images, such as sensor noise. However, this
approach is data-specific, tends to lack adaptability, and its accuracy
diminishes when faced with unseen types of image corruptions. A second and more
recent approach, referred to as Robust Super-Resolution (RSR), proposes to
improve real-world SR by harnessing the generalization capabilities of a model
by making it robust to adversarial attacks. To delve further into this second
approach, our paper explores the universality of various methods for enhancing
the robustness of deep learning SR models. In other words, we inquire: "Which
robustness method exhibits the highest degree of adaptability when dealing with
a wide range of adversarial attacks ?". Our extensive experimentation on both
synthetic and real-world images empirically demonstrates that median randomized
smoothing (MRS) is more general in terms of robustness compared to adversarial
learning techniques, which tend to focus on specific types of attacks.
Furthermore, as expected, we also illustrate that the proposed universal robust
method enables the SR model to handle standard corruptions more effectively,
such as blur and Gaussian noise, and notably, corruptions naturally present in
real-world images. These results support the significance of shifting the
paradigm in the development of real-world SR methods towards RSR, especially
via MRS.


http://arxiv.org/abs/2405.14672
Invisible Backdoor Attack against Self-supervised Learning. (61%)
Hanrong Zhang; Zhenting Wang; Boheng Li; Fulin Lin; Tingxu Han; Mingyu Jin; Chenlu Zhan; Mengnan Du; Hongwei Wang; Shiqing Ma
  Self-supervised learning (SSL) models are vulnerable to backdoor attacks.
Existing backdoor attacks that are effective in SSL often involve noticeable
triggers, like colored patches or visible noise, which are vulnerable to human
inspection. This paper proposes an imperceptible and effective backdoor attack
against self-supervised models. We first find that existing imperceptible
triggers designed for supervised learning are less effective in compromising
self-supervised models. We then identify this ineffectiveness is attributed to
the overlap in distributions between the backdoor and augmented samples used in
SSL. Building on this insight, we design an attack using optimized triggers
disentangled with the augmented transformation in the SSL, while remaining
imperceptible to human vision. Experiments on five datasets and six SSL
algorithms demonstrate our attack is highly effective and stealthy. It also has
strong resistance to existing backdoor defenses. Our code can be found at
https://github.com/Zhang-Henry/INACTIVE.


http://arxiv.org/abs/2405.14646
Unveiling the Achilles' Heel of NLG Evaluators: A Unified Adversarial Framework Driven by Large Language Models. (33%)
Yiming Chen; Chen Zhang; Danqing Luo; Luis Fernando D'Haro; Robby T. Tan; Haizhou Li
  The automatic evaluation of natural language generation (NLG) systems
presents a long-lasting challenge. Recent studies have highlighted various
neural metrics that align well with human evaluations. Yet, the robustness of
these evaluators against adversarial perturbations remains largely
under-explored due to the unique challenges in obtaining adversarial data for
different NLG evaluation tasks. To address the problem, we introduce AdvEval, a
novel black-box adversarial framework against NLG evaluators. AdvEval is
specially tailored to generate data that yield strong disagreements between
human and victim evaluators. Specifically, inspired by the recent success of
large language models (LLMs) in text generation and evaluation, we adopt strong
LLMs as both the data generator and gold evaluator. Adversarial data are
automatically optimized with feedback from the gold and victim evaluator. We
conduct experiments on 12 victim evaluators and 11 NLG datasets, spanning tasks
including dialogue, summarization, and question evaluation. The results show
that AdvEval can lead to significant performance degradation of various victim
metrics, thereby validating its efficacy.


http://arxiv.org/abs/2405.15020
AdjointDEIS: Efficient Gradients for Diffusion Models. (16%)
Zander W. Blasingame; Chen Liu
  The optimization of the latents and parameters of diffusion models with
respect to some differentiable metric defined on the output of the model is a
challenging and complex problem. The sampling for diffusion models is done by
solving either the probability flow ODE or diffusion SDE wherein a neural
network approximates the score function allowing a numerical ODE/SDE solver to
be used. However, naive backpropagation techniques are memory intensive,
requiring the storage of all intermediate states, and face additional
complexity in handling the injected noise from the diffusion term of the
diffusion SDE. We propose a novel family of bespoke ODE solvers to the
continuous adjoint equations for diffusion models, which we call AdjointDEIS.
We exploit the unique construction of diffusion SDEs to further simplify the
formulation of the continuous adjoint equations using exponential integrators.
Moreover, we provide convergence order guarantees for our bespoke solvers.
Significantly, we show that continuous adjoint equations for diffusion SDEs
actually simplify to a simple ODE. Lastly, we demonstrate the effectiveness of
AdjointDEIS for guided generation with an adversarial attack in the form of the
face morphing problem. Our code will be released on our project page
https://zblasingame.github.io/AdjointDEIS/


http://arxiv.org/abs/2405.15018
What Variables Affect Out-of-Distribution Generalization in Pretrained Models? (9%)
Md Yousuf Harun; Kyungbok Lee; Jhair Gallardo; Giri Krishnan; Christopher Kanan
  Embeddings produced by pre-trained deep neural networks (DNNs) are widely
used; however, their efficacy for downstream tasks can vary widely. We study
the factors influencing transferability and out-of-distribution (OOD)
generalization of pre-trained DNN embeddings through the lens of the tunnel
effect hypothesis, which is closely related to intermediate neural collapse.
This hypothesis suggests that deeper DNN layers compress representations and
hinder OOD generalization. Contrary to earlier work, our experiments show this
is not a universal phenomenon. We comprehensively investigate the impact of DNN
architecture, training data, image resolution, and augmentations on
transferability. We identify that training with high-resolution datasets
containing many classes greatly reduces representation compression and improves
transferability. Our results emphasize the danger of generalizing findings from
toy datasets to broader contexts.


http://arxiv.org/abs/2405.14457
Tighter Privacy Auditing of DP-SGD in the Hidden State Threat Model. (8%)
Tudor Cebere; Aurélien Bellet; Nicolas Papernot
  Machine learning models can be trained with formal privacy guarantees via
differentially private optimizers such as DP-SGD. In this work, we focus on a
threat model where the adversary has access only to the final model, with no
visibility into intermediate updates. In the literature, this hidden state
threat model exhibits a significant gap between the lower bound from empirical
privacy auditing and the theoretical upper bound provided by privacy
accounting. To challenge this gap, we propose to audit this threat model with
adversaries that craft a gradient sequence designed to maximize the privacy
loss of the final model without relying on intermediate updates. Our
experiments show that this approach consistently outperforms previous attempts
at auditing the hidden state model. Furthermore, our results advance the
understanding of achievable privacy guarantees within this threat model.
Specifically, when the crafted gradient is inserted at every optimization step,
we show that concealing the intermediate model updates in DP-SGD does not
enhance the privacy guarantees. The situation is more complex when the crafted
gradient is not inserted at every step: our auditing lower bound matches the
privacy upper bound only for an adversarially-chosen loss landscape and a
sufficiently large batch size. This suggests that existing privacy upper bounds
can be improved in certain regimes.


http://arxiv.org/abs/2405.15182
RFLPA: A Robust Federated Learning Framework against Poisoning Attacks with Secure Aggregation. (1%)
Peihua Mai; Ran Yan; Yan Pang
  Federated learning (FL) allows multiple devices to train a model
collaboratively without sharing their data. Despite its benefits, FL is
vulnerable to privacy leakage and poisoning attacks. To address the privacy
concern, secure aggregation (SecAgg) is often used to obtain the aggregation of
gradients on sever without inspecting individual user updates. Unfortunately,
existing defense strategies against poisoning attacks rely on the analysis of
local updates in plaintext, making them incompatible with SecAgg. To reconcile
the conflicts, we propose a robust federated learning framework against
poisoning attacks (RFLPA) based on SecAgg protocol. Our framework computes the
cosine similarity between local updates and server updates to conduct robust
aggregation. Furthermore, we leverage verifiable packed Shamir secret sharing
to achieve reduced communication cost of $O(M+N)$ per user, and design a novel
dot-product aggregation algorithm to resolve the issue of increased information
leakage. Our experimental results show that RFLPA significantly reduces
communication and computation overhead by over $75\%$ compared to the
state-of-the-art method, BREA, while maintaining competitive accuracy.


http://arxiv.org/abs/2405.15161
Are You Copying My Prompt? Protecting the Copyright of Vision Prompt for VPaaS via Watermark. (1%)
Huali Ren; Anli Yan; Chong-zhi Gao; Hongyang Yan; Zhenxin Zhang; Jin Li
  Visual Prompt Learning (VPL) differs from traditional fine-tuning methods in
reducing significant resource consumption by avoiding updating pre-trained
model parameters. Instead, it focuses on learning an input perturbation, a
visual prompt, added to downstream task data for making predictions. Since
learning generalizable prompts requires expert design and creation, which is
technically demanding and time-consuming in the optimization process,
developers of Visual Prompts as a Service (VPaaS) have emerged. These
developers profit by providing well-crafted prompts to authorized customers.
However, a significant drawback is that prompts can be easily copied and
redistributed, threatening the intellectual property of VPaaS developers.
Hence, there is an urgent need for technology to protect the rights of VPaaS
developers. To this end, we present a method named \textbf{WVPrompt} that
employs visual prompt watermarking in a black-box way. WVPrompt consists of two
parts: prompt watermarking and prompt verification. Specifically, it utilizes a
poison-only backdoor attack method to embed a watermark into the prompt and
then employs a hypothesis-testing approach for remote verification of prompt
ownership. Extensive experiments have been conducted on three well-known
benchmark datasets using three popular pre-trained models: RN50, BIT-M, and
Instagram. The experimental results demonstrate that WVPrompt is efficient,
harmless, and robust to various adversarial operations.


http://arxiv.org/abs/2405.14432
Adaptive Gradient Clipping for Robust Federated Learning. (1%)
Youssef Allouah; Rachid Guerraoui; Nirupam Gupta; Ahmed Jellouli; Geovani Rizk; John Stephan
  Robust federated learning aims to maintain reliable performance despite the
presence of adversarial or misbehaving workers. While state-of-the-art (SOTA)
robust distributed gradient descent (Robust-DGD) methods were proven
theoretically optimal, their empirical success has often relied on
pre-aggregation gradient clipping. However, existing static clipping strategies
yield inconsistent results: enhancing robustness against some attacks while
being ineffective or even detrimental against others. To address this
limitation, we propose a principled adaptive clipping strategy, Adaptive Robust
Clipping (ARC), which dynamically adjusts clipping thresholds based on the
input gradients. We prove that ARC not only preserves the theoretical
robustness guarantees of SOTA Robust-DGD methods but also provably improves
asymptotic convergence when the model is well-initialized. Extensive
experiments on benchmark image classification tasks confirm these theoretical
insights, demonstrating that ARC significantly enhances robustness,
particularly in highly heterogeneous and adversarial settings.


http://arxiv.org/abs/2405.14077
Learning to Transform Dynamically for Better Adversarial Transferability. (99%)
Rongyi Zhu; Zeliang Zhang; Susan Liang; Zhuo Liu; Chenliang Xu
  Adversarial examples, crafted by adding perturbations imperceptible to
humans, can deceive neural networks. Recent studies identify the adversarial
transferability across various models, \textit{i.e.}, the cross-model attack
ability of adversarial samples. To enhance such adversarial transferability,
existing input transformation-based methods diversify input data with
transformation augmentation. However, their effectiveness is limited by the
finite number of available transformations. In our study, we introduce a novel
approach named Learning to Transform (L2T). L2T increases the diversity of
transformed images by selecting the optimal combination of operations from a
pool of candidates, consequently improving adversarial transferability. We
conceptualize the selection of optimal transformation combinations as a
trajectory optimization problem and employ a reinforcement learning strategy to
effectively solve the problem. Comprehensive experiments on the ImageNet
dataset, as well as practical tests with Google Vision and GPT-4V, reveal that
L2T surpasses current methodologies in enhancing adversarial transferability,
thereby confirming its effectiveness and practical significance. The code is
available at https://github.com/RongyiZhu/L2T.


http://arxiv.org/abs/2405.14033
Adversarial Training of Two-Layer Polynomial and ReLU Activation Networks via Convex Optimization. (80%)
Daniel Kuelbs; Sanjay Lall; Mert Pilanci
  Training neural networks which are robust to adversarial attacks remains an
important problem in deep learning, especially as heavily overparameterized
models are adopted in safety-critical settings. Drawing from recent work which
reformulates the training problems for two-layer ReLU and polynomial activation
networks as convex programs, we devise a convex semidefinite program (SDP) for
adversarial training of two-layer polynomial activation networks and prove that
the convex SDP achieves the same globally optimal solution as its nonconvex
counterpart. The convex SDP is observed to improve robust test accuracy against
$\ell_\infty$ attacks relative to the original convex training formulation on
multiple datasets. Additionally, we present scalable implementations of
adversarial training for two-layer polynomial and ReLU networks which are
compatible with standard machine learning libraries and GPU acceleration.
Leveraging these implementations, we retrain the final two fully connected
layers of a Pre-Activation ResNet-18 model on the CIFAR-10 dataset with both
polynomial and ReLU activations. The two `robustified' models achieve
significantly higher robust test accuracies against $\ell_\infty$ attacks than
a Pre-Activation ResNet-18 model trained with sharpness-aware minimization,
demonstrating the practical utility of convex adversarial training on
large-scale problems.


http://arxiv.org/abs/2405.13922
Towards Certification of Uncertainty Calibration under Adversarial Attacks. (75%)
Cornelius Emde; Francesco Pinto; Thomas Lukasiewicz; Philip H. S. Torr; Adel Bibi
  Since neural classifiers are known to be sensitive to adversarial
perturbations that alter their accuracy, \textit{certification methods} have
been developed to provide provable guarantees on the insensitivity of their
predictions to such perturbations. Furthermore, in safety-critical
applications, the frequentist interpretation of the confidence of a classifier
(also known as model calibration) can be of utmost importance. This property
can be measured via the Brier score or the expected calibration error. We show
that attacks can significantly harm calibration, and thus propose certified
calibration as worst-case bounds on calibration under adversarial
perturbations. Specifically, we produce analytic bounds for the Brier score and
approximate bounds via the solution of a mixed-integer program on the expected
calibration error. Finally, we propose novel calibration attacks and
demonstrate how they can improve model calibration through \textit{adversarial
calibration training}.


http://arxiv.org/abs/2405.13985
LookHere: Vision Transformers with Directed Attention Generalize and Extrapolate. (67%)
Anthony Fuller; Daniel G. Kyrollos; Yousef Yassin; James R. Green
  High-resolution images offer more information about scenes that can improve
model accuracy. However, the dominant model architecture in computer vision,
the vision transformer (ViT), cannot effectively leverage larger images without
finetuning -- ViTs poorly extrapolate to more patches at test time, although
transformers offer sequence length flexibility. We attribute this shortcoming
to the current patch position encoding methods, which create a distribution
shift when extrapolating.
  We propose a drop-in replacement for the position encoding of plain ViTs that
restricts attention heads to fixed fields of view, pointed in different
directions, using 2D attention masks. Our novel method, called LookHere,
provides translation-equivariance, ensures attention head diversity, and limits
the distribution shift that attention heads face when extrapolating. We
demonstrate that LookHere improves performance on classification (avg. 1.6%),
against adversarial attack (avg. 5.4%), and decreases calibration error (avg.
1.5%) -- on ImageNet without extrapolation. With extrapolation, LookHere
outperforms the current SoTA position encoding method, 2D-RoPE, by 21.7% on
ImageNet when trained at $224^2$ px and tested at $1024^2$ px. Additionally, we
release a high-resolution test set to improve the evaluation of high-resolution
image classifiers, called ImageNet-HR.


http://arxiv.org/abs/2405.14036
Remote Keylogging Attacks in Multi-user VR Applications. (13%)
Zihao Su; Kunlin Cai; Reuben Beeler; Lukas Dresel; Allan Garcia; Ilya Grishchenko; Yuan Tian; Christopher Kruegel; Giovanni Vigna
  As Virtual Reality (VR) applications grow in popularity, they have bridged
distances and brought users closer together. However, with this growth, there
have been increasing concerns about security and privacy, especially related to
the motion data used to create immersive experiences. In this study, we
highlight a significant security threat in multi-user VR applications, which
are applications that allow multiple users to interact with each other in the
same virtual space. Specifically, we propose a remote attack that utilizes the
avatar rendering information collected from an adversary's game clients to
extract user-typed secrets like credit card information, passwords, or private
conversations. We do this by (1) extracting motion data from network packets,
and (2) mapping motion data to keystroke entries. We conducted a user study to
verify the attack's effectiveness, in which our attack successfully inferred
97.62% of the keystrokes. Besides, we performed an additional experiment to
underline that our attack is practical, confirming its effectiveness even when
(1) there are multiple users in a room, and (2) the attacker cannot see the
victims. Moreover, we replicated our proposed attack on four applications to
demonstrate the generalizability of the attack. Lastly, we proposed a defense
against the attack, which has been implemented by major players in the VR
industry. These results underscore the severity of the vulnerability and its
potential impact on millions of VR social platform users.


http://arxiv.org/abs/2405.14106
Nearly Tight Black-Box Auditing of Differentially Private Machine Learning. (5%)
Meenatchi Sundaram Muthu Selva Annamalai; Cristofaro Emiliano De
  This paper presents an auditing procedure for the Differentially Private
Stochastic Gradient Descent (DP-SGD) algorithm in the black-box threat model
that is substantially tighter than prior work. The main intuition is to craft
worst-case initial model parameters, as DP-SGD's privacy analysis is agnostic
to the choice of the initial model parameters. For models trained on MNIST and
CIFAR-10 at theoretical $\varepsilon=10.0$, our auditing procedure yields
empirical estimates of $\varepsilon_{emp} = 7.21$ and $6.95$, respectively, on
a 1,000-record sample and $\varepsilon_{emp}= 6.48$ and $4.96$ on the full
datasets. By contrast, previous audits were only (relatively) tight in stronger
white-box models, where the adversary can access the model's inner parameters
and insert arbitrary gradients. Overall, our auditing procedure can offer
valuable insight into how the privacy analysis of DP-SGD could be improved and
detect bugs and DP violations in real-world implementations. The source code
needed to reproduce our experiments is available at
https://github.com/spalabucr/bb-audit-dpsgd.


http://arxiv.org/abs/2405.14023
WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response. (1%)
Tianrong Zhang; Bochuan Cao; Yuanpu Cao; Lu Lin; Prasenjit Mitra; Jinghui Chen
  The recent breakthrough in large language models (LLMs) such as ChatGPT has
revolutionized production processes at an unprecedented pace. Alongside this
progress also comes mounting concerns about LLMs' susceptibility to
jailbreaking attacks, which leads to the generation of harmful or unsafe
content. While safety alignment measures have been implemented in LLMs to
mitigate existing jailbreak attempts and force them to become increasingly
complicated, it is still far from perfect. In this paper, we analyze the common
pattern of the current safety alignment and show that it is possible to exploit
such patterns for jailbreaking attacks by simultaneous obfuscation in queries
and responses. Specifically, we propose WordGame attack, which replaces
malicious words with word games to break down the adversarial intent of a query
and encourage benign content regarding the games to precede the anticipated
harmful content in the response, creating a context that is hardly covered by
any corpus used for safety alignment. Extensive experiments demonstrate that
WordGame attack can break the guardrails of the current leading proprietary and
open-source LLMs, including the latest Claude-3, GPT-4, and Llama-3 models.
Further ablation studies on such simultaneous obfuscation in query and response
provide evidence of the merits of the attack strategy beyond an individual
attack.


http://arxiv.org/abs/2405.12719
Mellivora Capensis: A Backdoor-Free Training Framework on the Poisoned Dataset without Auxiliary Data. (92%)
Yuwen Pu; Jiahao Chen; Chunyi Zhou; Zhou Feng; Qingming Li; Chunqiang Hu; Shouling Ji
  The efficacy of deep learning models is profoundly influenced by the quality
of their training data. Given the considerations of data diversity, data scale,
and annotation expenses, model trainers frequently resort to sourcing and
acquiring datasets from online repositories. Although economically pragmatic,
this strategy exposes the models to substantial security vulnerabilities.
Untrusted entities can clandestinely embed triggers within the dataset,
facilitating the hijacking of the trained model on the poisoned dataset through
backdoor attacks, which constitutes a grave security concern. Despite the
proliferation of countermeasure research, their inherent limitations constrain
their effectiveness in practical applications. These include the requirement
for substantial quantities of clean samples, inconsistent defense performance
across varying attack scenarios, and inadequate resilience against adaptive
attacks, among others. Therefore, in this paper, we endeavor to address the
challenges of backdoor attack countermeasures in real-world scenarios, thereby
fortifying the security of training paradigm under the data-collection manner.
Concretely, we first explore the inherent relationship between the potential
perturbations and the backdoor trigger, and demonstrate the key observation
that the poisoned samples perform more robustness to perturbation than the
clean ones through the theoretical analysis and experiments. Then, based on our
key explorations, we propose a robust and clean-data-free backdoor defense
framework, namely Mellivora Capensis (\texttt{MeCa}), which enables the model
trainer to train a clean model on the poisoned dataset.


http://arxiv.org/abs/2405.13324
Adversarial Training via Adaptive Knowledge Amalgamation of an Ensemble of Teachers. (87%)
Shayan Mohajer Hamidi; Linfeng Ye
  Adversarial training (AT) is a popular method for training robust deep neural
networks (DNNs) against adversarial attacks. Yet, AT suffers from two
shortcomings: (i) the robustness of DNNs trained by AT is highly intertwined
with the size of the DNNs, posing challenges in achieving robustness in smaller
models; and (ii) the adversarial samples employed during the AT process exhibit
poor generalization, leaving DNNs vulnerable to unforeseen attack types. To
address these dual challenges, this paper introduces adversarial training via
adaptive knowledge amalgamation of an ensemble of teachers (AT-AKA). In
particular, we generate a diverse set of adversarial samples as the inputs to
an ensemble of teachers; and then, we adaptively amalgamate the logtis of these
teachers to train a generalized-robust student. Through comprehensive
experiments, we illustrate the superior efficacy of AT-AKA over existing AT
methods and adversarial robustness distillation techniques against cutting-edge
attacks, including AutoAttack.


http://arxiv.org/abs/2405.12786
Rethinking the Vulnerabilities of Face Recognition Systems:From a Practical Perspective. (78%)
Jiahao Chen; Zhiqiang Shen; Yuwen Pu; Chunyi Zhou; Shouling Ji
  Face Recognition Systems (FRS) have increasingly integrated into critical
applications, including surveillance and user authentication, highlighting
their pivotal role in modern security systems. Recent studies have revealed
vulnerabilities in FRS to adversarial (e.g., adversarial patch attacks) and
backdoor attacks (e.g., training data poisoning), raising significant concerns
about their reliability and trustworthiness. Previous studies primarily focus
on traditional adversarial or backdoor attacks, overlooking the
resource-intensive or privileged-manipulation nature of such threats, thus
limiting their practical generalization, stealthiness, universality and
robustness. Correspondingly, in this paper, we delve into the inherent
vulnerabilities in FRS through user studies and preliminary explorations. By
exploiting these vulnerabilities, we identify a novel attack, facial identity
backdoor attack dubbed FIBA, which unveils a potentially more devastating
threat against FRS:an enrollment-stage backdoor attack. FIBA circumvents the
limitations of traditional attacks, enabling broad-scale disruption by allowing
any attacker donning a specific trigger to bypass these systems. This implies
that after a single, poisoned example is inserted into the database, the
corresponding trigger becomes a universal key for any attackers to spoof the
FRS. This strategy essentially challenges the conventional attacks by
initiating at the enrollment stage, dramatically transforming the threat
landscape by poisoning the feature database rather than the training data.


http://arxiv.org/abs/2405.13080
EmInspector: Combating Backdoor Attacks in Federated Self-Supervised Learning Through Embedding Inspection. (47%)
Yuwen Qian; Shuchi Wu; Kang Wei; Ming Ding; Di Xiao; Tao Xiang; Chuan Ma; Song Guo
  Federated self-supervised learning (FSSL) has recently emerged as a promising
paradigm that enables the exploitation of clients' vast amounts of unlabeled
data while preserving data privacy. While FSSL offers advantages, its
susceptibility to backdoor attacks, a concern identified in traditional
federated supervised learning (FSL), has not been investigated. To fill the
research gap, we undertake a comprehensive investigation into a backdoor attack
paradigm, where unscrupulous clients conspire to manipulate the global model,
revealing the vulnerability of FSSL to such attacks. In FSL, backdoor attacks
typically build a direct association between the backdoor trigger and the
target label. In contrast, in FSSL, backdoor attacks aim to alter the global
model's representation for images containing the attacker's specified trigger
pattern in favor of the attacker's intended target class, which is less
straightforward. In this sense, we demonstrate that existing defenses are
insufficient to mitigate the investigated backdoor attacks in FSSL, thus
finding an effective defense mechanism is urgent. To tackle this issue, we dive
into the fundamental mechanism of backdoor attacks on FSSL, proposing the
Embedding Inspector (EmInspector) that detects malicious clients by inspecting
the embedding space of local models. In particular, EmInspector assesses the
similarity of embeddings from different local models using a small set of
inspection images (e.g., ten images of CIFAR100) without specific requirements
on sample distribution or labels. We discover that embeddings from backdoored
models tend to cluster together in the embedding space for a given inspection
image. Evaluation results show that EmInspector can effectively mitigate
backdoor attacks on FSSL across various adversary settings. Our code is
avaliable at https://github.com/ShuchiWu/EmInspector.


http://arxiv.org/abs/2405.12513
Fully Randomized Pointers. (15%)
Gregory J. Duck; Sai Dhawal Phaye; Roland H. C. Yap; Trevor E. Carlson
  Software security continues to be a critical concern for programs implemented
in low-level programming languages such as C and C++. Many defenses have been
proposed in the current literature, each with different trade-offs including
performance, compatibility, and attack resistance. One general class of defense
is pointer randomization or authentication, where invalid object access (e.g.,
memory errors) is obfuscated or denied. Many defenses rely on the program
termination (e.g., crashing) to abort attacks, with the implicit assumption
that an adversary cannot "brute force" the defense with multiple attack
attempts. However, such assumptions do not always hold, such as hardware
speculative execution attacks or network servers configured to restart on
error. In such cases, we argue that most existing defenses provide only weak
effective security.
  In this paper, we propose Fully Randomized Pointers (FRP) as a stronger
memory error defense that is resistant to even brute force attacks. The key
idea is to fully randomize pointer bits -- as much as possible while also
preserving binary compatibility -- rendering the relationships between pointers
highly unpredictable. Furthermore, the very high degree of randomization
renders brute force attacks impractical -- providing strong effective security
compared to existing work. We design a new FRP encoding that is: (1) compatible
with existing binary code (without recompilation); (2) decoupled from the
underlying object layout; and (3) can be efficiently decoded on-the-fly to the
underlying memory address. We prototype FRP in the form of a software
implementation (BlueFat) to test security and compatibility, and a
proof-of-concept hardware implementation (GreenFat) to evaluate performance. We
show that FRP is secure, practical, and compatible at the binary level, while a
hardware implementation can achieve low performance overheads (<10%).


http://arxiv.org/abs/2405.13147
A novel reliability attack of Physical Unclonable Functions. (13%)
Gaoxiang Li; Yu Zhuang
  Physical Unclonable Functions (PUFs) are emerging as promising security
primitives for IoT devices, providing device fingerprints based on physical
characteristics. Despite their strengths, PUFs are vulnerable to machine
learning (ML) attacks, including conventional and reliability-based attacks.
Conventional ML attacks have been effective in revealing vulnerabilities of
many PUFs, and reliability-based ML attacks are more powerful tools that have
detected vulnerabilities of some PUFs that are resistant to conventional ML
attacks. Since reliability-based ML attacks leverage information of PUFs'
unreliability, we were tempted to examine the feasibility of building defense
using reliability enhancing techniques, and have discovered that majority
voting with reasonably high repeats provides effective defense against existing
reliability-based ML attack methods. It is known that majority voting reduces
but does not eliminate unreliability, we are motivated to investigate if new
attack methods exist that can capture the low unreliability of highly but
not-perfectly reliable PUFs, which led to the development of a new reliability
representation and the new representation-enabled attack method that has
experimentally cracked PUFs enhanced with majority voting of high repetitions.


http://arxiv.org/abs/2405.12725
Nearest is Not Dearest: Towards Practical Defense against Quantization-conditioned Backdoor Attacks. (8%)
Boheng Li; Yishuo Cai; Haowei Li; Feng Xue; Zhifeng Li; Yiming Li
  Model quantization is widely used to compress and accelerate deep neural
networks. However, recent studies have revealed the feasibility of weaponizing
model quantization via implanting quantization-conditioned backdoors (QCBs).
These special backdoors stay dormant on released full-precision models but will
come into effect after standard quantization. Due to the peculiarity of QCBs,
existing defenses have minor effects on reducing their threats or are even
infeasible. In this paper, we conduct the first in-depth analysis of QCBs. We
reveal that the activation of existing QCBs primarily stems from the nearest
rounding operation and is closely related to the norms of neuron-wise
truncation errors (i.e., the difference between the continuous full-precision
weights and its quantized version). Motivated by these insights, we propose
Error-guided Flipped Rounding with Activation Preservation (EFRAP), an
effective and practical defense against QCBs. Specifically, EFRAP learns a
non-nearest rounding strategy with neuron-wise error norm and layer-wise
activation preservation guidance, flipping the rounding strategies of neurons
crucial for backdoor effects but with minimal impact on clean accuracy.
Extensive evaluations on benchmark datasets demonstrate that our EFRAP can
defeat state-of-the-art QCB attacks under various settings. Code is available
at https://github.com/AntigoneRandy/QuantBackdoor_EFRAP.


http://arxiv.org/abs/2405.12751
Dullahan: Stealthy Backdoor Attack against Without-Label-Sharing Split Learning. (4%)
Yuwen Pu; Zhuoyuan Ding; Jiahao Chen; Chunyi Zhou; Qingming Li; Chunqiang Hu; Shouling Ji
  As a novel privacy-preserving paradigm aimed at reducing client computational
costs and achieving data utility, split learning has garnered extensive
attention and proliferated widespread applications across various fields,
including smart health and smart transportation, among others. While recent
studies have primarily concentrated on addressing privacy leakage concerns in
split learning, such as inference attacks and data reconstruction, the
exploration of security issues (e.g., backdoor attacks) within the framework of
split learning has been comparatively limited. Nonetheless, the security
vulnerability within the context of split learning is highly posing a threat
and can give rise to grave security implications, such as the illegal
impersonation in the face recognition model. Therefore, in this paper, we
propose a stealthy backdoor attack strategy (namely SBAT) tailored to the
without-label-sharing split learning architecture, which unveils the inherent
security vulnerability of split learning. We posit the existence of a potential
attacker on the server side aiming to introduce a backdoor into the training
model, while exploring two scenarios: one with known client network
architecture and the other with unknown architecture. Diverging from
traditional backdoor attack methods that manipulate the training data and
labels, we constructively conduct the backdoor attack by injecting the trigger
embedding into the server network. Specifically, our SBAT achieves a higher
level of attack stealthiness by refraining from modifying any intermediate
parameters (e.g., gradients) during training and instead executing all
malicious operations post-training.


http://arxiv.org/abs/2405.12604
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming. (1%)
Jiaxu Liu; Xiangyu Yin; Sihao Wu; Jianhong Wang; Meng Fang; Xinping Yi; Xiaowei Huang
  With the proliferation of red-teaming strategies for Large Language Models
(LLMs), the deficiency in the literature about improving the safety and
robustness of LLM defense strategies is becoming increasingly pronounced. This
paper introduces the LLM-based \textbf{sentinel} model as a plug-and-play
prefix module designed to reconstruct the input prompt with just a few ($<30$)
additional tokens, effectively reducing toxicity in responses from target LLMs.
The sentinel model naturally overcomes the \textit{parameter inefficiency} and
\textit{limited model accessibility} for fine-tuning large target models. We
employ an interleaved training regimen using Proximal Policy Optimization (PPO)
to optimize both red team and sentinel models dynamically, incorporating a
value head-sharing mechanism inspired by the multi-agent centralized critic to
manage the complex interplay between agents. Our extensive experiments across
text-to-text and text-to-image demonstrate the effectiveness of our approach in
mitigating toxic outputs, even when dealing with larger models like
\texttt{Llama-2}, \texttt{GPT-3.5} and \texttt{Stable-Diffusion}, highlighting
the potential of our framework in enhancing safety and robustness in various
applications.


http://arxiv.org/abs/2405.11904
A Constraint-Enforcing Reward for Adversarial Attacks on Text Classifiers. (99%)
Tom Roth; Inigo Jauregi Unanue; Alsharif Abuadbba; Massimo Piccardi
  Text classifiers are vulnerable to adversarial examples --
correctly-classified examples that are deliberately transformed to be
misclassified while satisfying acceptability constraints. The conventional
approach to finding adversarial examples is to define and solve a combinatorial
optimisation problem over a space of allowable transformations. While
effective, this approach is slow and limited by the choice of transformations.
An alternate approach is to directly generate adversarial examples by
fine-tuning a pre-trained language model, as is commonly done for other
text-to-text tasks. This approach promises to be much quicker and more
expressive, but is relatively unexplored. For this reason, in this work we
train an encoder-decoder paraphrase model to generate a diverse range of
adversarial examples. For training, we adopt a reinforcement learning algorithm
and propose a constraint-enforcing reward that promotes the generation of valid
adversarial examples. Experimental results over two text classification
datasets show that our model has achieved a higher success rate than the
original paraphrase model, and overall has proved more effective than other
competitive attacks. Finally, we show how key design choices impact the
generated examples and discuss the strengths and weaknesses of the proposed
approach.


http://arxiv.org/abs/2405.12076
GAN-GRID: A Novel Generative Attack on Smart Grid Stability Prediction. (98%)
Emad Efatinasab; Alessandro Brighente; Mirco Rampazzo; Nahal Azadi; Mauro Conti
  The smart grid represents a pivotal innovation in modernizing the electricity
sector, offering an intelligent, digitalized energy network capable of
optimizing energy delivery from source to consumer. It hence represents the
backbone of the energy sector of a nation. Due to its central role, the
availability of the smart grid is paramount and is hence necessary to have
in-depth control of its operations and safety. To this aim, researchers
developed multiple solutions to assess the smart grid's stability and guarantee
that it operates in a safe state. Artificial intelligence and Machine learning
algorithms have proven to be effective measures to accurately predict the smart
grid's stability. Despite the presence of known adversarial attacks and
potential solutions, currently, there exists no standardized measure to protect
smart grids against this threat, leaving them open to new adversarial attacks.
In this paper, we propose GAN-GRID a novel adversarial attack targeting the
stability prediction system of a smart grid tailored to real-world constraints.
Our findings reveal that an adversary armed solely with the stability model's
output, devoid of data or model knowledge, can craft data classified as stable
with an Attack Success Rate (ASR) of 0.99. Also by manipulating authentic data
and sensor values, the attacker can amplify grid issues, potentially undetected
due to a compromised stability prediction system. These results underscore the
imperative of fortifying smart grid security mechanisms against adversarial
manipulation to uphold system stability and reliability.


http://arxiv.org/abs/2405.11982
Robust Deep Reinforcement Learning with Adaptive Adversarial Perturbations in Action Space. (76%)
Qianmei Liu; Yufei Kuang; Jie Wang
  Deep reinforcement learning (DRL) algorithms can suffer from modeling errors
between the simulation and the real world. Many studies use adversarial
learning to generate perturbation during training process to model the
discrepancy and improve the robustness of DRL. However, most of these
approaches use a fixed parameter to control the intensity of the adversarial
perturbation, which can lead to a trade-off between average performance and
robustness. In fact, finding the optimal parameter of the perturbation is
challenging, as excessive perturbations may destabilize training and compromise
agent performance, while insufficient perturbations may not impart enough
information to enhance robustness. To keep the training stable while improving
robustness, we propose a simple but effective method, namely, Adaptive
Adversarial Perturbation (A2P), which can dynamically select appropriate
adversarial perturbations for each sample. Specifically, we propose an adaptive
adversarial coefficient framework to adjust the effect of the adversarial
perturbation during training. By designing a metric for the current intensity
of the perturbation, our method can calculate the suitable perturbation levels
based on the current relative performance. The appealing feature of our method
is that it is simple to deploy in real-world applications and does not require
accessing the simulator in advance. The experiments in MuJoCo show that our
method can improve the training stability and learn a robust policy when
migrated to different test environments. The code is available at
https://github.com/Lqm00/A2P-SAC.


http://arxiv.org/abs/2405.12266
EGAN: Evolutional GAN for Ransomware Evasion. (74%)
Daniel Commey; Benjamin Appiah; Bill K. Frimpong; Isaac Osei; Ebenezer N. A. Hammond; Garth V. Crosby
  Adversarial Training is a proven defense strategy against adversarial
malware. However, generating adversarial malware samples for this type of
training presents a challenge because the resulting adversarial malware needs
to remain evasive and functional. This work proposes an attack framework, EGAN,
to address this limitation. EGAN leverages an Evolution Strategy and Generative
Adversarial Network to select a sequence of attack actions that can mutate a
Ransomware file while preserving its original functionality. We tested this
framework on popular AI-powered commercial antivirus systems listed on
VirusTotal and demonstrated that our framework is capable of bypassing the
majority of these systems. Moreover, we evaluated whether the EGAN attack
framework can evade other commercial non-AI antivirus solutions. Our results
indicate that the adversarial ransomware generated can increase the probability
of evading some of them.


http://arxiv.org/abs/2405.12424
Rethinking Robustness Assessment: Adversarial Attacks on Learning-based Quadrupedal Locomotion Controllers. (31%)
Fan Shi; Chong Zhang; Takahiro Miki; Joonho Lee; Marco Hutter; Stelian Coros
  Legged locomotion has recently achieved remarkable success with the progress
of machine learning techniques, especially deep reinforcement learning (RL).
Controllers employing neural networks have demonstrated empirical and
qualitative robustness against real-world uncertainties, including sensor noise
and external perturbations. However, formally investigating the vulnerabilities
of these locomotion controllers remains a challenge. This difficulty arises
from the requirement to pinpoint vulnerabilities across a long-tailed
distribution within a high-dimensional, temporally sequential space. As a first
step towards quantitative verification, we propose a computational method that
leverages sequential adversarial attacks to identify weaknesses in learned
locomotion controllers. Our research demonstrates that, even state-of-the-art
robust controllers can fail significantly under well-designed, low-magnitude
adversarial sequence. Through experiments in simulation and on the real robot,
we validate our approach's effectiveness, and we illustrate how the results it
generates can be used to robustify the original policy and offer valuable
insights into the safety of these black-box policies.


http://arxiv.org/abs/2405.11829
Adversarially Diversified Rehearsal Memory (ADRM): Mitigating Memory Overfitting Challenge in Continual Learning. (8%)
Hikmat Khan; Ghulam Rasool; Nidhal Carla Bouaynaya
  Continual learning focuses on learning non-stationary data distribution
without forgetting previous knowledge. Rehearsal-based approaches are commonly
used to combat catastrophic forgetting. However, these approaches suffer from a
problem called "rehearsal memory overfitting, " where the model becomes too
specialized on limited memory samples and loses its ability to generalize
effectively. As a result, the effectiveness of the rehearsal memory
progressively decays, ultimately resulting in catastrophic forgetting of the
learned tasks.
  We introduce the Adversarially Diversified Rehearsal Memory (ADRM) to address
the memory overfitting challenge. This novel method is designed to enrich
memory sample diversity and bolster resistance against natural and adversarial
noise disruptions. ADRM employs the FGSM attacks to introduce adversarially
modified memory samples, achieving two primary objectives: enhancing memory
diversity and fostering a robust response to continual feature drifts in memory
samples.
  Our contributions are as follows: Firstly, ADRM addresses overfitting in
rehearsal memory by employing FGSM to diversify and increase the complexity of
the memory buffer. Secondly, we demonstrate that ADRM mitigates memory
overfitting and significantly improves the robustness of CL models, which is
crucial for safety-critical applications. Finally, our detailed analysis of
features and visualization demonstrates that ADRM mitigates feature drifts in
CL memory samples, significantly reducing catastrophic forgetting and resulting
in a more resilient CL model. Additionally, our in-depth t-SNE visualizations
of feature distribution and the quantification of the feature similarity
further enrich our understanding of feature representation in existing CL
approaches. Our code is publically available at
https://github.com/hikmatkhan/ADRM.


http://arxiv.org/abs/2405.12295
Efficient Model-Stealing Attacks Against Inductive Graph Neural Networks. (3%)
Marcin Podhajski; Jan Dubiński; Franziska Boenisch; Adam Dziedzic; Agnieszka Pregowska; Tomasz Michalak
  Graph Neural Networks (GNNs) are recognized as potent tools for processing
real-world data organized in graph structures. Especially inductive GNNs, which
enable the processing of graph-structured data without relying on predefined
graph structures, are gaining importance in an increasingly wide variety of
applications. As these networks demonstrate proficiency across a range of
tasks, they become lucrative targets for model-stealing attacks where an
adversary seeks to replicate the functionality of the targeted network. A large
effort has been made to develop model-stealing attacks that focus on models
trained with images and texts. However, little attention has been paid to GNNs
trained on graph data. This paper introduces a novel method for unsupervised
model-stealing attacks against inductive GNNs, based on graph contrasting
learning and spectral graph augmentations to efficiently extract information
from the target model. The proposed attack is thoroughly evaluated on six
datasets. The results show that this approach demonstrates a higher level of
efficiency compared to existing stealing attacks. More concretely, our attack
outperforms the baseline on all benchmarks achieving higher fidelity and
downstream accuracy of the stolen model while requiring fewer queries sent to
the target model.


http://arxiv.org/abs/2405.12372
DispaRisk: Auditing Fairness Through Usable Information. (1%)
Jonathan Vasquez; Carlotta Domeniconi; Huzefa Rangwala
  Machine Learning algorithms (ML) impact virtually every aspect of human lives
and have found use across diverse sectors including healthcare, finance, and
education. Often, ML algorithms have been found to exacerbate societal biases
present in datasets leading to adversarial impacts on subsets/groups of
individuals and in many cases on minority groups. To effectively mitigate these
untoward effects, it is crucial that disparities/biases are identified early in
a ML pipeline. This proactive approach facilitates timely interventions to
prevent bias amplification and reduce complexity at later stages of model
development. In this paper, we leverage recent advancements in usable
information theory to introduce DispaRisk, a novel framework designed to
proactively assess the potential risks of disparities in datasets during the
initial stages of the ML pipeline. We evaluate DispaRisk's effectiveness by
benchmarking it against commonly used datasets in fairness research. Our
findings demonstrate DispaRisk's capabilities to identify datasets with a high
risk of discrimination, detect model families prone to biases within an ML
pipeline, and enhance the explainability of these bias risks. This work
contributes to the development of fairer ML systems by providing a robust tool
for early bias detection and mitigation. The code for our experiments is
available in the following repository: https://github.com/jovasque156/disparisk


http://arxiv.org/abs/2405.11708
Adaptive Batch Normalization Networks for Adversarial Robustness. (99%)
Shao-Yuan Lo; Vishal M. Patel
  Deep networks are vulnerable to adversarial examples. Adversarial Training
(AT) has been a standard foundation of modern adversarial defense approaches
due to its remarkable effectiveness. However, AT is extremely time-consuming,
refraining it from wide deployment in practical applications. In this paper, we
aim at a non-AT defense: How to design a defense method that gets rid of AT but
is still robust against strong adversarial attacks? To answer this question, we
resort to adaptive Batch Normalization (BN), inspired by the recent advances in
test-time domain adaptation. We propose a novel defense accordingly, referred
to as the Adaptive Batch Normalization Network (ABNN). ABNN employs a
pre-trained substitute model to generate clean BN statistics and sends them to
the target model. The target model is exclusively trained on clean data and
learns to align the substitute model's BN statistics. Experimental results show
that ABNN consistently improves adversarial robustness against both digital and
physically realizable attacks on both image and video datasets. Furthermore,
ABNN can achieve higher clean data performance and significantly lower training
time complexity compared to AT-based approaches.


http://arxiv.org/abs/2405.11551
An Invisible Backdoor Attack Based On Semantic Feature. (96%)
Yangming Chen
  Backdoor attacks have severely threatened deep neural network (DNN) models in
the past several years. These attacks can occur in almost every stage of the
deep learning pipeline. Although the attacked model behaves normally on benign
samples, it makes wrong predictions for samples containing triggers. However,
most existing attacks use visible patterns (e.g., a patch or image
transformations) as triggers, which are vulnerable to human inspection. In this
paper, we propose a novel backdoor attack, making imperceptible changes.
Concretely, our attack first utilizes the pre-trained victim model to extract
low-level and high-level semantic features from clean images and generates
trigger pattern associated with high-level features based on channel attention.
Then, the encoder model generates poisoned images based on the trigger and
extracted low-level semantic features without causing noticeable feature loss.
We evaluate our attack on three prominent image classification DNN across three
standard datasets. The results demonstrate that our attack achieves high attack
success rates while maintaining robustness against backdoor defenses.
Furthermore, we conduct extensive image similarity experiments to emphasize the
stealthiness of our attack strategy.


http://arxiv.org/abs/2405.11547
Certified Robust Accuracy of Neural Networks Are Bounded due to Bayes Errors. (81%)
Ruihan Zhang; Jun Sun
  Adversarial examples pose a security threat to many critical systems built on
neural networks. While certified training improves robustness, it also
decreases accuracy noticeably. Despite various proposals for addressing this
issue, the significant accuracy drop remains. More importantly, it is not clear
whether there is a certain fundamental limit on achieving robustness whilst
maintaining accuracy. In this work, we offer a novel perspective based on Bayes
errors. By adopting Bayes error to robustness analysis, we investigate the
limit of certified robust accuracy, taking into account data distribution
uncertainties. We first show that the accuracy inevitably decreases in the
pursuit of robustness due to changed Bayes error in the altered data
distribution. Subsequently, we establish an upper bound for certified robust
accuracy, considering the distribution of individual classes and their
boundaries. Our theoretical results are empirically evaluated on real-world
datasets and are shown to be consistent with the limited success of existing
certified training results, \emph{e.g.}, for CIFAR10, our analysis results in
an upper bound (of certified robust accuracy) of 67.49\%, meanwhile existing
approaches are only able to increase it from 53.89\% in 2017 to 62.84\% in
2023.


http://arxiv.org/abs/2405.11440
A GAN-Based Data Poisoning Attack Against Federated Learning Systems and Its Countermeasure. (68%)
Wei Sun; Bo Gao; Ke Xiong; Yuwei Wang
  As a distributed machine learning paradigm, federated learning (FL) is
collaboratively carried out on privately owned datasets but without direct data
access. Although the original intention is to allay data privacy concerns,
"available but not visible" data in FL potentially brings new security threats,
particularly poisoning attacks that target such "not visible" local data.
Initial attempts have been made to conduct data poisoning attacks against FL
systems, but cannot be fully successful due to their high chance of causing
statistical anomalies. To unleash the potential for truly "invisible" attacks
and build a more deterrent threat model, in this paper, a new data poisoning
attack model named VagueGAN is proposed, which can generate seemingly
legitimate but noisy poisoned data by untraditionally taking advantage of
generative adversarial network (GAN) variants. Capable of manipulating the
quality of poisoned data on demand, VagueGAN enables to trade-off attack
effectiveness and stealthiness. Furthermore, a cost-effective countermeasure
named Model Consistency-Based Defense (MCD) is proposed to identify
GAN-poisoned data or models after finding out the consistency of GAN outputs.
Extensive experiments on multiple datasets indicate that our attack method is
generally much more stealthy as well as more effective in degrading FL
performance with low complexity. Our defense method is also shown to be more
competent in identifying GAN-poisoned data or models. The source codes are
publicly available at
\href{https://github.com/SSssWEIssSS/VagueGAN-Data-Poisoning-Attack-and-Its-Countermeasure}{https://github.com/SSssWEIssSS/VagueGAN-Data-Poisoning-Attack-and-Its-Countermeasure}.


http://arxiv.org/abs/2405.11575
SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks. (62%)
Xuanli He; Qiongkai Xu; Jun Wang; Benjamin I. P. Rubinstein; Trevor Cohn
  Modern NLP models are often trained on public datasets drawn from diverse
sources, rendering them vulnerable to data poisoning attacks. These attacks can
manipulate the model's behavior in ways engineered by the attacker. One such
tactic involves the implantation of backdoors, achieved by poisoning specific
training instances with a textual trigger and a target class label. Several
strategies have been proposed to mitigate the risks associated with backdoor
attacks by identifying and removing suspected poisoned examples. However, we
observe that these strategies fail to offer effective protection against
several advanced backdoor attacks. To remedy this deficiency, we propose a
novel defensive mechanism that first exploits training dynamics to identify
poisoned samples with high precision, followed by a label propagation step to
improve recall and thus remove the majority of poisoned instances. Compared
with recent advanced defense methods, our method considerably reduces the
success rates of several backdoor attacks while maintaining high classification
accuracy on clean test sets.


http://arxiv.org/abs/2405.11758
Fed-Credit: Robust Federated Learning with Credibility Management. (13%)
Jiayan Chen; Zhirong Qian; Tianhui Meng; Xitong Gao; Tian Wang; Weijia Jia
  Aiming at privacy preservation, Federated Learning (FL) is an emerging
machine learning approach enabling model training on decentralized devices or
data sources. The learning mechanism of FL relies on aggregating parameter
updates from individual clients. However, this process may pose a potential
security risk due to the presence of malicious devices. Existing solutions are
either costly due to the use of compute-intensive technology, or restrictive
for reasons of strong assumptions such as the prior knowledge of the number of
attackers and how they attack. Few methods consider both privacy constraints
and uncertain attack scenarios. In this paper, we propose a robust FL approach
based on the credibility management scheme, called Fed-Credit. Unlike previous
studies, our approach does not require prior knowledge of the nodes and the
data distribution. It maintains and employs a credibility set, which weighs the
historical clients' contributions based on the similarity between the local
models and global model, to adjust the global model update. The subtlety of
Fed-Credit is that the time decay and attitudinal value factor are incorporated
into the dynamic adjustment of the reputation weights and it boasts a
computational complexity of O(n) (n is the number of the clients). We conducted
extensive experiments on the MNIST and CIFAR-10 datasets under 5 types of
attacks. The results exhibit superior accuracy and resilience against
adversarial attacks, all while maintaining comparatively low computational
complexity. Among these, on the Non-IID CIFAR-10 dataset, our algorithm
exhibited performance enhancements of 19.5% and 14.5%, respectively, in
comparison to the state-of-the-art algorithm when dealing with two types of
data poisoning attacks.


http://arxiv.org/abs/2405.11491
BOSC: A Backdoor-based Framework for Open Set Synthetic Image Attribution. (5%)
Jun Wang; Benedetta Tondi; Mauro Barni
  Synthetic image attribution addresses the problem of tracing back the origin
of images produced by generative models. Extensive efforts have been made to
explore unique representations of generative models and use them to attribute a
synthetic image to the model that produced it. Most of the methods classify the
models or the architectures among those in a closed set without considering the
possibility that the system is fed with samples produced by unknown
architectures. With the continuous progress of AI technology, new generative
architectures continuously appear, thus driving the attention of researchers
towards the development of tools capable of working in open-set scenarios. In
this paper, we propose a framework for open set attribution of synthetic
images, named BOSC (Backdoor-based Open Set Classification), that relies on the
concept of backdoor attacks to design a classifier with rejection option. BOSC
works by purposely injecting class-specific triggers inside a portion of the
images in the training set to induce the network to establish a matching
between class features and trigger features. The behavior of the trained model
with respect to triggered samples is then exploited at test time to perform
sample rejection using an ad-hoc score. Experiments show that the proposed
method has good performance, always surpassing the state-of-the-art. Robustness
against image processing is also very good. Although we designed our method for
the task of synthetic image attribution, the proposed framework is a general
one and can be used for other image forensic applications.


http://arxiv.org/abs/2405.11206
Towards Robust Policy: Enhancing Offline Reinforcement Learning with Adversarial Attacks and Defenses. (84%)
Thanh Nguyen; Tung M. Luu; Tri Ton; Chang D. Yoo
  Offline reinforcement learning (RL) addresses the challenge of expensive and
high-risk data exploration inherent in RL by pre-training policies on vast
amounts of offline data, enabling direct deployment or fine-tuning in
real-world environments. However, this training paradigm can compromise policy
robustness, leading to degraded performance in practical conditions due to
observation perturbations or intentional attacks. While adversarial attacks and
defenses have been extensively studied in deep learning, their application in
offline RL is limited. This paper proposes a framework to enhance the
robustness of offline RL models by leveraging advanced adversarial attacks and
defenses. The framework attacks the actor and critic components by perturbing
observations during training and using adversarial defenses as regularization
to enhance the learned policy. Four attacks and two defenses are introduced and
evaluated on the D4RL benchmark. The results show the vulnerability of both the
actor and critic to attacks and the effectiveness of the defenses in improving
policy robustness. This framework holds promise for enhancing the reliability
of offline RL models in practical scenarios.


http://arxiv.org/abs/2405.11195
Trustworthy Actionable Perturbations. (82%)
Jesse Friedbaum; Sudarshan Adiga; Ravi Tandon
  Counterfactuals, or modified inputs that lead to a different outcome, are an
important tool for understanding the logic used by machine learning classifiers
and how to change an undesirable classification. Even if a counterfactual
changes a classifier's decision, however, it may not affect the true underlying
class probabilities, i.e. the counterfactual may act like an adversarial attack
and ``fool'' the classifier. We propose a new framework for creating modified
inputs that change the true underlying probabilities in a beneficial way which
we call Trustworthy Actionable Perturbations (TAP). This includes a novel
verification procedure to ensure that TAP change the true class probabilities
instead of acting adversarially. Our framework also includes new cost, reward,
and goal definitions that are better suited to effectuating change in the real
world. We present PAC-learnability results for our verification procedure and
theoretically analyze our new method for measuring reward. We also develop a
methodology for creating TAP and compare our results to those achieved by
previous counterfactual methods.


http://arxiv.org/abs/2406.18540
Fully Exploiting Every Real Sample: SuperPixel Sample Gradient Model Stealing. (13%)
Yunlong Zhao; Xiaoheng Deng; Yijing Liu; Xinjun Pei; Jiazhi Xia; Wei Chen
  Model stealing (MS) involves querying and observing the output of a machine
learning model to steal its capabilities. The quality of queried data is
crucial, yet obtaining a large amount of real data for MS is often challenging.
Recent works have reduced reliance on real data by using generative models.
However, when high-dimensional query data is required, these methods are
impractical due to the high costs of querying and the risk of model collapse.
In this work, we propose using sample gradients (SG) to enhance the utility of
each real sample, as SG provides crucial guidance on the decision boundaries of
the victim model. However, utilizing SG in the model stealing scenario faces
two challenges: 1. Pixel-level gradient estimation requires extensive query
volume and is susceptible to defenses. 2. The estimation of sample gradients
has a significant variance. This paper proposes Superpixel Sample Gradient
stealing (SPSG) for model stealing under the constraint of limited real
samples. With the basic idea of imitating the victim model's low-variance
patch-level gradients instead of pixel-level gradients, SPSG achieves efficient
sample gradient estimation through two steps. First, we perform patch-wise
perturbations on query images to estimate the average gradient in different
regions of the image. Then, we filter the gradients through a threshold
strategy to reduce variance. Exhaustive experiments demonstrate that, with the
same number of real samples, SPSG achieves accuracy, agreements, and
adversarial success rate significantly surpassing the current state-of-the-art
MS methods. Codes are available at https://github.com/zyl123456aB/SPSG_attack.


http://arxiv.org/abs/2405.11336
UPAM: Unified Prompt Attack in Text-to-Image Generation Models Against Both Textual Filters and Visual Checkers. (12%)
Duo Peng; Qiuhong Ke; Jun Liu
  Text-to-Image (T2I) models have raised security concerns due to their
potential to generate inappropriate or harmful images. In this paper, we
propose UPAM, a novel framework that investigates the robustness of T2I models
from the attack perspective. Unlike most existing attack methods that focus on
deceiving textual defenses, UPAM aims to deceive both textual and visual
defenses in T2I models. UPAM enables gradient-based optimization, offering
greater effectiveness and efficiency than previous methods. Given that T2I
models might not return results due to defense mechanisms, we introduce a
Sphere-Probing Learning (SPL) scheme to support gradient optimization even when
no results are returned. Additionally, we devise a Semantic-Enhancing Learning
(SEL) scheme to finetune UPAM for generating target-aligned images. Our
framework also ensures attack stealthiness. Extensive experiments demonstrate
UPAM's effectiveness and efficiency.


http://arxiv.org/abs/2405.11227
BadActs: A Universal Backdoor Defense in the Activation Space. (10%)
Biao Yi; Sishuo Chen; Yiming Li; Tong Li; Baolei Zhang; Zheli Liu
  Backdoor attacks pose an increasingly severe security threat to Deep Neural
Networks (DNNs) during their development stage. In response, backdoor sample
purification has emerged as a promising defense mechanism, aiming to eliminate
backdoor triggers while preserving the integrity of the clean content in the
samples. However, existing approaches have been predominantly focused on the
word space, which are ineffective against feature-space triggers and
significantly impair performance on clean data. To address this, we introduce a
universal backdoor defense that purifies backdoor samples in the activation
space by drawing abnormal activations towards optimized minimum clean
activation distribution intervals. The advantages of our approach are twofold:
(1) By operating in the activation space, our method captures from
surface-level information like words to higher-level semantic concepts such as
syntax, thus counteracting diverse triggers; (2) the fine-grained continuous
nature of the activation space allows for more precise preservation of clean
content while removing triggers. Furthermore, we propose a detection module
based on statistical information of abnormal activations, to achieve a better
trade-off between clean accuracy and defending performance.


http://arxiv.org/abs/2405.11432
On Robust Reinforcement Learning with Lipschitz-Bounded Policy Networks. (8%)
Nicholas H. Barbara; Ruigang Wang; Ian R. Manchester
  This paper presents a study of robust policy networks in deep reinforcement
learning. We investigate the benefits of policy parameterizations that
naturally satisfy constraints on their Lipschitz bound, analyzing their
empirical performance and robustness on two representative problems: pendulum
swing-up and Atari Pong. We illustrate that policy networks with smaller
Lipschitz bounds are more robust to disturbances, random noise, and targeted
adversarial attacks than unconstrained policies composed of vanilla multi-layer
perceptrons or convolutional neural networks. However, the structure of the
Lipschitz layer is important. We find that the widely-used method of spectral
normalization is too conservative and severely impacts clean performance,
whereas more expressive Lipschitz layers such as the recently-proposed Sandwich
layer can achieve improved robustness without sacrificing clean performance.


http://arxiv.org/abs/2405.11289
Diffusion Model Driven Test-Time Image Adaptation for Robust Skin Lesion Classification. (3%)
Ming Hu; Siyuan Yan; Peng Xia; Feilong Tang; Wenxue Li; Peibo Duan; Lin Zhang; Zongyuan Ge
  Deep learning-based diagnostic systems have demonstrated potential in skin
disease diagnosis. However, their performance can easily degrade on test
domains due to distribution shifts caused by input-level corruptions, such as
imaging equipment variability, brightness changes, and image blur. This will
reduce the reliability of model deployment in real-world scenarios. Most
existing solutions focus on adapting the source model through retraining on
different target domains. Although effective, this retraining process is
sensitive to the amount of data and the hyperparameter configuration for
optimization. In this paper, we propose a test-time image adaptation method to
enhance the accuracy of the model on test data by simultaneously updating and
predicting test images. We modify the target test images by projecting them
back to the source domain using a diffusion model. Specifically, we design a
structure guidance module that adds refinement operations through low-pass
filtering during reverse sampling, regularizing the diffusion to preserve
structural information. Additionally, we introduce a self-ensembling scheme
automatically adjusts the reliance on adapted and unadapted inputs, enhancing
adaptation robustness by rejecting inappropriate generative modeling results.
To facilitate this study, we constructed the ISIC2019-C and Dermnet-C
corruption robustness evaluation benchmarks. Extensive experiments on the
proposed benchmarks demonstrate that our method makes the classifier more
robust across various corruptions, architectures, and data regimes. Our
datasets and code will be available at
\url{https://github.com/minghu0830/Skin-TTA_Diffusion}.


http://arxiv.org/abs/2405.11258
RoBERTa-Augmented Synthesis for Detecting Malicious API Requests. (1%)
Udi Aharon; Revital Marbel; Ran Dubin; Amit Dvir; Chen Hajaj
  Web applications and APIs face constant threats from malicious actors seeking
to exploit vulnerabilities for illicit gains. To defend against these threats,
it is essential to have anomaly detection systems that can identify a variety
of malicious behaviors. However, a significant challenge in this area is the
limited availability of training data. Existing datasets often do not provide
sufficient coverage of the diverse API structures, parameter formats, and usage
patterns encountered in real-world scenarios. As a result, models trained on
these datasets often struggle to generalize and may fail to detect less common
or emerging attack vectors. To enhance detection accuracy and robustness, it is
crucial to access larger and more representative datasets that capture the true
variability of API traffic. To address this, we introduce a GAN-inspired
learning framework that extends limited API traffic datasets through targeted,
domain-aware synthesis. Drawing on techniques from Natural Language Processing
(NLP), our approach leverages Transformer-based architectures, particularly
RoBERTa, to enhance the contextual representation of API requests and generate
realistic synthetic samples aligned with security-specific semantics. We
evaluate our framework on two benchmark datasets, CSIC 2010 and ATRDF 2023, and
compare it with a previous data augmentation technique to assess the importance
of domain-specific synthesis. In addition, we apply our augmented data to
various anomaly detection models to evaluate its impact on classification
performance. Our method achieves up to a 4.94% increase in F1 score on CSIC
2010 and up to 21.10% on ATRDF 2023. The source codes of this work are
available at https://github.com/ArielCyber/GAN-API.


http://arxiv.org/abs/2405.11154
Revisiting the Robust Generalization of Adversarial Prompt Tuning. (99%)
Fan Yang; Mingxuan Xia; Sangzhou Xia; Chicheng Ma; Hui Hui
  Understanding the vulnerability of large-scale pre-trained vision-language
models like CLIP against adversarial attacks is key to ensuring zero-shot
generalization capacity on various downstream tasks. State-of-the-art defense
mechanisms generally adopt prompt learning strategies for adversarial
fine-tuning to improve the adversarial robustness of the pre-trained model
while keeping the efficiency of adapting to downstream tasks. Such a setup
leads to the problem of over-fitting which impedes further improvement of the
model's generalization capacity on both clean and adversarial examples. In this
work, we propose an adaptive Consistency-guided Adversarial Prompt Tuning
(i.e., CAPT) framework that utilizes multi-modal prompt learning to enhance the
alignment of image and text features for adversarial examples and leverage the
strong generalization of pre-trained CLIP to guide the model-enhancing its
robust generalization on adversarial examples while maintaining its accuracy on
clean ones. We also design a novel adaptive consistency objective function to
balance the consistency of adversarial inputs and clean inputs between the
fine-tuning model and the pre-trained model. We conduct extensive experiments
across 14 datasets and 4 data sparsity schemes (from 1-shot to full training
data settings) to show the superiority of CAPT over other state-of-the-art
adaption methods. CAPT demonstrated excellent performance in terms of the
in-distribution performance and the generalization under input distribution
shift and across datasets.


http://arxiv.org/abs/2405.10529
Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors. (99%)
Jiachen Sun; Changsheng Wang; Jiongxiao Wang; Yiwei Zhang; Chaowei Xiao
  Large language models have become increasingly prominent, also signaling a
shift towards multimodality as the next frontier in artificial intelligence,
where their embeddings are harnessed as prompts to generate textual content.
Vision-language models (VLMs) stand at the forefront of this advancement,
offering innovative ways to combine visual and textual data for enhanced
understanding and interaction. However, this integration also enlarges the
attack surface. Patch-based adversarial attack is considered the most realistic
threat model in physical vision applications, as demonstrated in many existing
literature. In this paper, we propose to address patched visual prompt
injection, where adversaries exploit adversarial patches to generate target
content in VLMs. Our investigation reveals that patched adversarial prompts
exhibit sensitivity to pixel-wise randomization, a trait that remains robust
even against adaptive attacks designed to counteract such defenses. Leveraging
this insight, we introduce SmoothVLM, a defense mechanism rooted in smoothing
techniques, specifically tailored to protect VLMs from the threat of patched
visual prompt injectors. Our framework significantly lowers the attack success
rate to a range between 0% and 5.0% on two leading VLMs, while achieving around
67.3% to 95.0% context recovery of the benign images, demonstrating a balance
between security and usability.


http://arxiv.org/abs/2405.10757
Rethinking Graph Backdoor Attacks: A Distribution-Preserving Perspective. (83%)
Zhiwei Zhang; Minhua Lin; Enyan Dai; Suhang Wang
  Graph Neural Networks (GNNs) have shown remarkable performance in various
tasks. However, recent works reveal that GNNs are vulnerable to backdoor
attacks. Generally, backdoor attack poisons the graph by attaching backdoor
triggers and the target class label to a set of nodes in the training graph. A
GNN trained on the poisoned graph will then be misled to predict test nodes
attached with trigger to the target class. Despite their effectiveness, our
empirical analysis shows that triggers generated by existing methods tend to be
out-of-distribution (OOD), which significantly differ from the clean data.
Hence, these injected triggers can be easily detected and pruned with widely
used outlier detection methods in real-world applications. Therefore, in this
paper, we study a novel problem of unnoticeable graph backdoor attacks with
in-distribution (ID) triggers. To generate ID triggers, we introduce an OOD
detector in conjunction with an adversarial learning strategy to generate the
attributes of the triggers within distribution. To ensure a high attack success
rate with ID triggers, we introduce novel modules designed to enhance trigger
memorization by the victim model trained on poisoned graph. Extensive
experiments on real-world datasets demonstrate the effectiveness of the
proposed method in generating in distribution triggers that can by-pass various
defense strategies while maintaining a high attack success rate.


http://arxiv.org/abs/2405.10612
Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers. (67%)
Sheng Yang; Jiawang Bai; Kuofeng Gao; Yong Yang; Yiming Li; Shu-tao Xia
  Given the power of vision transformers, a new learning paradigm, pre-training
and then prompting, makes it more efficient and effective to address downstream
visual recognition tasks. In this paper, we identify a novel security threat
towards such a paradigm from the perspective of backdoor attacks. Specifically,
an extra prompt token, called the switch token in this work, can turn the
backdoor mode on, i.e., converting a benign model into a backdoored one. Once
under the backdoor mode, a specific trigger can force the model to predict a
target class. It poses a severe risk to the users of cloud API, since the
malicious behavior can not be activated and detected under the benign mode,
thus making the attack very stealthy. To attack a pre-trained model, our
proposed attack, named SWARM, learns a trigger and prompt tokens including a
switch token. They are optimized with the clean loss which encourages the model
always behaves normally even the trigger presents, and the backdoor loss that
ensures the backdoor can be activated by the trigger when the switch is on.
Besides, we utilize the cross-mode feature distillation to reduce the effect of
the switch token on clean samples. The experiments on diverse visual
recognition tasks confirm the success of our switchable backdoor attack, i.e.,
achieving 95%+ attack success rate, and also being hard to be detected and
removed. Our code is available at https://github.com/20000yshust/SWARM.


http://arxiv.org/abs/2405.10924
Boosting Few-Pixel Robustness Verification via Covering Verification Designs. (1%)
Yuval Shapira; Naor Wiesel; Shahar Shabelman; Dana Drachsler-Cohen
  Proving local robustness is crucial to increase the reliability of neural
networks. While many verifiers prove robustness in $L_\infty$ $\epsilon$-balls,
very little work deals with robustness verification in $L_0$ $\epsilon$-balls,
capturing robustness to few pixel attacks. This verification introduces a
combinatorial challenge, because the space of pixels to perturb is discrete and
of exponential size. A previous work relies on covering designs to identify
sets for defining $L_\infty$ neighborhoods, which if proven robust imply that
the $L_0$ $\epsilon$-ball is robust. However, the number of neighborhoods to
verify remains very high, leading to a high analysis time. We propose covering
verification designs, a combinatorial design that tailors effective but
analysis-incompatible coverings to $L_0$ robustness verification. The challenge
is that computing a covering verification design introduces a high time and
memory overhead, which is intensified in our setting, where multiple candidate
coverings are required to identify how to reduce the overall analysis time. We
introduce CoVerD, an $L_0$ robustness verifier that selects between different
candidate coverings without constructing them, but by predicting their block
size distribution. This prediction relies on a theorem providing closed-form
expressions for the mean and variance of this distribution. CoVerD constructs
the chosen covering verification design on-the-fly, while keeping the memory
consumption minimal and enabling to parallelize the analysis. The experimental
results show that CoVerD reduces the verification time on average by up to 5.1x
compared to prior work and that it scales to larger $L_0$ $\epsilon$-balls.


http://arxiv.org/abs/2405.09882
DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection. (99%)
Yuhao Sun; Lingyun Yu; Hongtao Xie; Jiaming Li; Yongdong Zhang
  With the rapid development of face recognition (FR) systems, the privacy of
face images on social media is facing severe challenges due to the abuse of
unauthorized FR systems. Some studies utilize adversarial attack techniques to
defend against malicious FR systems by generating adversarial examples.
However, the generated adversarial examples, i.e., the protected face images,
tend to suffer from subpar visual quality and low transferability. In this
paper, we propose a novel face protection approach, dubbed DiffAM, which
leverages the powerful generative ability of diffusion models to generate
high-quality protected face images with adversarial makeup transferred from
reference images. To be specific, we first introduce a makeup removal module to
generate non-makeup images utilizing a fine-tuned diffusion model with guidance
of textual prompts in CLIP space. As the inverse process of makeup transfer,
makeup removal can make it easier to establish the deterministic relationship
between makeup domain and non-makeup domain regardless of elaborate text
prompts. Then, with this relationship, a CLIP-based makeup loss along with an
ensemble attack strategy is introduced to jointly guide the direction of
adversarial makeup domain, achieving the generation of protected face images
with natural-looking makeup and high black-box transferability. Extensive
experiments demonstrate that DiffAM achieves higher visual quality and attack
success rates with a gain of 12.98% under black-box setting compared with the
state of the arts. The code will be available at
https://github.com/HansSunY/DiffAM.


http://arxiv.org/abs/2405.09924
Infrared Adversarial Car Stickers. (98%)
Xiaopei Zhu; Yuqiu Liu; Zhanhao Hu; Jianmin Li; Xiaolin Hu
  Infrared physical adversarial examples are of great significance for studying
the security of infrared AI systems that are widely used in our lives such as
autonomous driving. Previous infrared physical attacks mainly focused on 2D
infrared pedestrian detection which may not fully manifest its destructiveness
to AI systems. In this work, we propose a physical attack method against
infrared detectors based on 3D modeling, which is applied to a real car. The
goal is to design a set of infrared adversarial stickers to make cars invisible
to infrared detectors at various viewing angles, distances, and scenes. We
build a 3D infrared car model with real infrared characteristics and propose an
infrared adversarial pattern generation method based on 3D mesh shadow. We
propose a 3D control points-based mesh smoothing algorithm and use a set of
smoothness loss functions to enhance the smoothness of adversarial meshes and
facilitate the sticker implementation. Besides, We designed the aluminum
stickers and conducted physical experiments on two real Mercedes-Benz A200L
cars. Our adversarial stickers hid the cars from Faster RCNN, an object
detector, at various viewing angles, distances, and scenes. The attack success
rate (ASR) was 91.49% for real cars. In comparison, the ASRs of random stickers
and no sticker were only 6.21% and 0.66%, respectively. In addition, the ASRs
of the designed stickers against six unseen object detectors such as YOLOv3 and
Deformable DETR were between 73.35%-95.80%, showing good transferability of the
attack performance across detectors.


http://arxiv.org/abs/2405.09981
Adversarial Robustness for Visual Grounding of Multimodal Large Language Models. (95%)
Kuofeng Gao; Yang Bai; Jiawang Bai; Yong Yang; Shu-Tao Xia
  Multi-modal Large Language Models (MLLMs) have recently achieved enhanced
performance across various vision-language tasks including visual grounding
capabilities. However, the adversarial robustness of visual grounding remains
unexplored in MLLMs. To fill this gap, we use referring expression
comprehension (REC) as an example task in visual grounding and propose three
adversarial attack paradigms as follows. Firstly, untargeted adversarial
attacks induce MLLMs to generate incorrect bounding boxes for each object.
Besides, exclusive targeted adversarial attacks cause all generated outputs to
the same target bounding box. In addition, permuted targeted adversarial
attacks aim to permute all bounding boxes among different objects within a
single image. Extensive experiments demonstrate that the proposed methods can
successfully attack visual grounding capabilities of MLLMs. Our methods not
only provide a new perspective for designing novel attacks but also serve as a
strong baseline for improving the adversarial robustness for visual grounding
of MLLMs.


http://arxiv.org/abs/2405.10360
Adversarial Robustness Guarantees for Quantum Classifiers. (81%)
Neil Dowling; Maxwell T. West; Angus Southwell; Azar C. Nakhl; Martin Sevior; Muhammad Usman; Kavan Modi
  Despite their ever more widespread deployment throughout society, machine
learning algorithms remain critically vulnerable to being spoofed by subtle
adversarial tampering with their input data. The prospect of near-term quantum
computers being capable of running {quantum machine learning} (QML) algorithms
has therefore generated intense interest in their adversarial vulnerability.
Here we show that quantum properties of QML algorithms can confer fundamental
protections against such attacks, in certain scenarios guaranteeing robustness
against classically-armed adversaries. We leverage tools from many-body physics
to identify the quantum sources of this protection. Our results offer a
theoretical underpinning of recent evidence which suggest quantum advantages in
the search for adversarial robustness. In particular, we prove that quantum
classifiers are: (i) protected against weak perturbations of data drawn from
the trained distribution, (ii) protected against local attacks if they are
insufficiently scrambling, and (iii) protected against universal adversarial
attacks if they are sufficiently quantum chaotic. Our analytic results are
supported by numerical evidence demonstrating the applicability of our theorems
and the resulting robustness of a quantum classifier in practice. This line of
inquiry constitutes a concrete pathway to advantage in QML, orthogonal to the
usually sought improvements in model speed or accuracy.


http://arxiv.org/abs/2405.09863
Box-Free Model Watermarks Are Prone to Black-Box Removal Attacks. (13%)
Haonan An; Guang Hua; Zhiping Lin; Yuguang Fang
  Box-free model watermarking is an emerging technique to safeguard the
intellectual property of deep learning models, particularly those for low-level
image processing tasks. Existing works have verified and improved its
effectiveness in several aspects. However, in this paper, we reveal that
box-free model watermarking is prone to removal attacks, even under the
real-world threat model such that the protected model and the watermark
extractor are in black boxes. Under this setting, we carry out three studies.
1) We develop an extractor-gradient-guided (EGG) remover and show its
effectiveness when the extractor uses ReLU activation only. 2) More generally,
for an unknown extractor, we leverage adversarial attacks and design the EGG
remover based on the estimated gradients. 3) Under the most stringent condition
that the extractor is inaccessible, we design a transferable remover based on a
set of private proxy models. In all cases, the proposed removers can
successfully remove embedded watermarks while preserving the quality of the
processed images, and we also demonstrate that the EGG remover can even replace
the watermarks. Extensive experimental results verify the effectiveness and
generalizability of the proposed attacks, revealing the vulnerabilities of the
existing box-free methods and calling for further research.


http://arxiv.org/abs/2405.10143
Relational DNN Verification With Cross Executional Bound Refinement. (8%)
Debangshu Banerjee; Gagandeep Singh
  We focus on verifying relational properties defined over deep neural networks
(DNNs) such as robustness against universal adversarial perturbations (UAP),
certified worst-case hamming distance for binary string classifications, etc.
Precise verification of these properties requires reasoning about multiple
executions of the same DNN. However, most of the existing works in DNN
verification only handle properties defined over single executions and as a
result, are imprecise for relational properties. Though few recent works for
relational DNN verification, capture linear dependencies between the inputs of
multiple executions, they do not leverage dependencies between the outputs of
hidden layers producing imprecise results. We develop a scalable relational
verifier RACoon that utilizes cross-execution dependencies at all layers of the
DNN gaining substantial precision over SOTA baselines on a wide range of
datasets, networks, and relational properties.


http://arxiv.org/abs/2405.10008
Solving the enigma: Enhancing faithfulness and comprehensibility in explanations of deep networks. (1%)
Michail Mamalakis; Antonios Mamalakis; Ingrid Agartz; Lynn Egeland Mørch-Johnsen; Graham Murray; John Suckling; Pietro Lio
  The accelerated progress of artificial intelligence (AI) has popularized deep
learning models across various domains, yet their inherent opacity poses
challenges, particularly in critical fields like healthcare, medicine, and the
geosciences. Explainable AI (XAI) has emerged to shed light on these 'black
box' models, aiding in deciphering their decision-making processes. However,
different XAI methods often produce significantly different explanations,
leading to high inter-method variability that increases uncertainty and
undermines trust in deep networks' predictions. In this study, we address this
challenge by introducing a novel framework designed to enhance the
explainability of deep networks through a dual focus on maximizing both
accuracy and comprehensibility in the explanations. Our framework integrates
outputs from multiple established XAI methods and leverages a non-linear neural
network model, termed the 'explanation optimizer,' to construct a unified,
optimal explanation. The optimizer evaluates explanations using two key
metrics: faithfulness (accuracy in reflecting the network's decisions) and
complexity (comprehensibility). By balancing these, it provides accurate and
accessible explanations, addressing a key XAI limitation. Experiments on
multi-class and binary classification in 2D object and 3D neuroscience imaging
confirm its efficacy. Our optimizer achieved faithfulness scores 155% and 63%
higher than the best XAI methods in 3D and 2D tasks, respectively, while also
reducing complexity for better understanding. These results demonstrate that
optimal explanations based on specific quality criteria are achievable,
offering a solution to the issue of inter-method variability in the current XAI
literature and supporting more trustworthy deep network predictions


http://arxiv.org/abs/2405.09800
Manifold Integrated Gradients: Riemannian Geometry for Feature Attribution. (1%)
Eslam Zaher; Maciej Trzaskowski; Quan Nguyen; Fred Roosta
  In this paper, we dive into the reliability concerns of Integrated Gradients
(IG), a prevalent feature attribution method for black-box deep learning
models. We particularly address two predominant challenges associated with IG:
the generation of noisy feature visualizations for vision models and the
vulnerability to adversarial attributional attacks. Our approach involves an
adaptation of path-based feature attribution, aligning the path of attribution
more closely to the intrinsic geometry of the data manifold. Our experiments
utilise deep generative models applied to several real-world image datasets.
They demonstrate that IG along the geodesics conforms to the curved geometry of
the Riemannian data manifold, generating more perceptually intuitive
explanations and, subsequently, substantially increasing robustness to targeted
attributional attacks.


http://arxiv.org/abs/2405.10376
Dealing Doubt: Unveiling Threat Models in Gradient Inversion Attacks under Federated Learning, A Survey and Taxonomy. (1%)
Yichuan Shi; Olivera Kotevska; Viktor Reshniak; Abhishek Singh; Ramesh Raskar
  Federated Learning (FL) has emerged as a leading paradigm for decentralized,
privacy preserving machine learning training. However, recent research on
gradient inversion attacks (GIAs) have shown that gradient updates in FL can
leak information on private training samples. While existing surveys on GIAs
have focused on the honest-but-curious server threat model, there is a dearth
of research categorizing attacks under the realistic and far more
privacy-infringing cases of malicious servers and clients. In this paper, we
present a survey and novel taxonomy of GIAs that emphasize FL threat models,
particularly that of malicious servers and clients. We first formally define
GIAs and contrast conventional attacks with the malicious attacker. We then
summarize existing honest-but-curious attack strategies, corresponding
defenses, and evaluation metrics. Critically, we dive into attacks with
malicious servers and clients to highlight how they break existing FL defenses,
focusing specifically on reconstruction methods, target model architectures,
target data, and evaluation metrics. Lastly, we discuss open problems and
future research directions.


http://arxiv.org/abs/2405.09598
Properties that allow or prohibit transferability of adversarial attacks among quantized networks. (99%)
Abhishek Shrestha; Jürgen Großmann
  Deep Neural Networks (DNNs) are known to be vulnerable to adversarial
examples. Further, these adversarial examples are found to be transferable from
the source network in which they are crafted to a black-box target network. As
the trend of using deep learning on embedded devices grows, it becomes relevant
to study the transferability properties of adversarial examples among
compressed networks. In this paper, we consider quantization as a network
compression technique and evaluate the performance of transfer-based attacks
when the source and target networks are quantized at different bitwidths. We
explore how algorithm specific properties affect transferability by considering
various adversarial example generation algorithms. Furthermore, we examine
transferability in a more realistic scenario where the source and target
networks may differ in bitwidth and other model-related properties like
capacity and architecture. We find that although quantization reduces
transferability, certain attack types demonstrate an ability to enhance it.
Additionally, the average transferability of adversarial examples among
quantized versions of a network can be used to estimate the transferability to
quantized target networks with varying capacity and architecture.


http://arxiv.org/abs/2405.09470
Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer. (99%)
Weifei Jin; Yuxin Cao; Junjie Su; Qi Shen; Kai Ye; Derui Wang; Jie Hao; Ziyao Liu
  In light of the widespread application of Automatic Speech Recognition (ASR)
systems, their security concerns have received much more attention than ever
before, primarily due to the susceptibility of Deep Neural Networks. Previous
studies have illustrated that surreptitiously crafting adversarial
perturbations enables the manipulation of speech recognition systems, resulting
in the production of malicious commands. These attack methods mostly require
adding noise perturbations under $\ell_p$ norm constraints, inevitably leaving
behind artifacts of manual modifications. Recent research has alleviated this
limitation by manipulating style vectors to synthesize adversarial examples
based on Text-to-Speech (TTS) synthesis audio. However, style modifications
based on optimization objectives significantly reduce the controllability and
editability of audio styles. In this paper, we propose an attack on ASR systems
based on user-customized style transfer. We first test the effect of Style
Transfer Attack (STA) which combines style transfer and adversarial attack in
sequential order. And then, as an improvement, we propose an iterative Style
Code Attack (SCA) to maintain audio quality. Experimental results show that our
method can meet the need for user-customized styles and achieve a success rate
of 82% in attacks, while keeping sound naturalness due to our user study.


http://arxiv.org/abs/2405.09176
Cross-Input Certified Training for Universal Perturbations. (98%)
Changming Xu; Gagandeep Singh
  Existing work in trustworthy machine learning primarily focuses on
single-input adversarial perturbations. In many real-world attack scenarios,
input-agnostic adversarial attacks, e.g. universal adversarial perturbations
(UAPs), are much more feasible. Current certified training methods train models
robust to single-input perturbations but achieve suboptimal clean and UAP
accuracy, thereby limiting their applicability in practical applications. We
propose a novel method, CITRUS, for certified training of networks robust
against UAP attackers. We show in an extensive evaluation across different
datasets, architectures, and perturbation magnitudes that our method
outperforms traditional certified training methods on standard accuracy (up to
10.3\%) and achieves SOTA performance on the more practical certified UAP
accuracy metric.


http://arxiv.org/abs/2405.09786
IBD-PSC: Input-level Backdoor Detection via Parameter-oriented Scaling Consistency. (4%)
Linshan Hou; Ruili Feng; Zhongyun Hua; Wei Luo; Leo Yu Zhang; Yiming Li
  Deep neural networks (DNNs) are vulnerable to backdoor attacks, where
adversaries can maliciously trigger model misclassifications by implanting a
hidden backdoor during model training. This paper proposes a simple yet
effective input-level backdoor detection (dubbed IBD-PSC) as a 'firewall' to
filter out malicious testing images. Our method is motivated by an intriguing
phenomenon, i.e., parameter-oriented scaling consistency (PSC), where the
prediction confidences of poisoned samples are significantly more consistent
than those of benign ones when amplifying model parameters. In particular, we
provide theoretical analysis to safeguard the foundations of the PSC
phenomenon. We also design an adaptive method to select BN layers to scale up
for effective detection. Extensive experiments are conducted on benchmark
datasets, verifying the effectiveness and efficiency of our IBD-PSC method and
its resistance to adaptive attacks.


http://arxiv.org/abs/2405.09314
Themis: Automatic and Efficient Deep Learning System Testing with Strong Fault Detection Capability. (4%)
Tsz On Li; Dong Huang; Xiaofei Xie; Heming Cui
  Deep Learning Systems (DLSs) have been widely applied in safety-critical
tasks such as autopilot. However, when a perturbed input is fed into a DLS for
inference, the DLS often has incorrect outputs (i.e., faults). DLS testing
techniques (e.g., DeepXplore) detect such faults by generating perturbed inputs
to explore data flows that induce faults. Since a DLS often has infinitely many
data flows, existing techniques require developers to manually specify a set of
activation values in a DLS's neurons for exploring fault-inducing data flows.
Unfortunately, recent studies show that such manual effort is tedious and can
detect only a tiny proportion of fault-inducing data flows.
  In this paper, we present Themis, the first automatic DLS testing system,
which attains strong fault detection capability by ensuring a full coverage of
fault-inducing data flows at a high probability. Themis carries a new workflow
for automatically and systematically revealing data flows whose internal
neurons' outputs vary substantially when the inputs are slightly perturbed, as
these data flows are likely fault-inducing. We evaluated Themis on ten
different DLSs and found that on average the number of faults detected by
Themis was 3.78X more than four notable DLS testing techniques. By retraining
all evaluated DLSs with the detected faults, Themis also increased (regained)
these DLSs' accuracies on average 14.7X higher than all baselines.


http://arxiv.org/abs/2405.09096
Optimizing Sensor Network Design for Multiple Coverage. (1%)
Lukas Taus; Yen-Hsi Richard Tsai
  Sensor placement optimization methods have been studied extensively. They can
be applied to a wide range of applications, including surveillance of known
environments, optimal locations for 5G towers, and placement of missile defense
systems. However, few works explore the robustness and efficiency of the
resulting sensor network concerning sensor failure or adversarial attacks. This
paper addresses this issue by optimizing for the least number of sensors to
achieve multiple coverage of non-simply connected domains by a prescribed
number of sensors. We introduce a new objective function for the greedy
(next-best-view) algorithm to design efficient and robust sensor networks and
derive theoretical bounds on the network's optimality. We further introduce a
Deep Learning model to accelerate the algorithm for near real-time
computations. The Deep Learning model requires the generation of training
examples. Correspondingly, we show that understanding the geometric properties
of the training data set provides important insights into the performance and
training process of deep learning techniques. Finally, we demonstrate that a
simple parallel version of the greedy approach using a simpler objective can be
highly competitive.


http://arxiv.org/abs/2405.08317
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models. (99%)
Raghuveer Peri; Sai Muralidhar Jayanthi; Srikanth Ronanki; Anshu Bhatia; Karel Mundnich; Saket Dingliwal; Nilaksh Das; Zejiang Hou; Goeric Huybrechts; Srikanth Vishnubhotla; Daniel Garcia-Romero; Sundararajan Srinivasan; Kyu J Han; Katrin Kirchhoff
  Integrated Speech and Large Language Models (SLMs) that can follow speech
instructions and generate relevant text responses have gained popularity
lately. However, the safety and robustness of these models remains largely
unclear. In this work, we investigate the potential vulnerabilities of such
instruction-following speech-language models to adversarial attacks and
jailbreaking. Specifically, we design algorithms that can generate adversarial
examples to jailbreak SLMs in both white-box and black-box attack settings
without human involvement. Additionally, we propose countermeasures to thwart
such jailbreaking attacks. Our models, trained on dialog data with speech
instructions, achieve state-of-the-art performance on spoken question-answering
task, scoring over 80% on both safety and helpfulness metrics. Despite safety
guardrails, experiments on jailbreaking demonstrate the vulnerability of SLMs
to adversarial perturbations and transfer attacks, with average attack success
rates of 90% and 10% respectively when evaluated on a dataset of carefully
designed harmful questions spanning 12 different toxic categories. However, we
demonstrate that our proposed countermeasures reduce the attack success
significantly.


http://arxiv.org/abs/2405.08645
Certifying Robustness of Graph Convolutional Networks for Node Perturbation with Polyhedra Abstract Interpretation. (92%)
Boqi Chen; Kristóf Marussy; Oszkár Semeráth; Gunter Mussbacher; Dániel Varró
  Graph convolutional neural networks (GCNs) are powerful tools for learning
graph-based knowledge representations from training data. However, they are
vulnerable to small perturbations in the input graph, which makes them
susceptible to input faults or adversarial attacks. This poses a significant
problem for GCNs intended to be used in critical applications, which need to
provide certifiably robust services even in the presence of adversarial
perturbations. We propose an improved GCN robustness certification technique
for node classification in the presence of node feature perturbations. We
introduce a novel polyhedra-based abstract interpretation approach to tackle
specific challenges of graph data and provide tight upper and lower bounds for
the robustness of the GCN. Experiments show that our approach simultaneously
improves the tightness of robustness bounds as well as the runtime performance
of certification. Moreover, our method can be used during training to further
improve the robustness of GCNs.


http://arxiv.org/abs/2405.08886
The Pitfalls and Promise of Conformal Inference Under Adversarial Attacks. (92%)
Ziquan Liu; Yufei Cui; Yan Yan; Yi Xu; Xiangyang Ji; Xue Liu; Antoni B. Chan
  In safety-critical applications such as medical imaging and autonomous
driving, where decisions have profound implications for patient health and road
safety, it is imperative to maintain both high adversarial robustness to
protect against potential adversarial attacks and reliable uncertainty
quantification in decision-making. With extensive research focused on enhancing
adversarial robustness through various forms of adversarial training (AT), a
notable knowledge gap remains concerning the uncertainty inherent in
adversarially trained models. To address this gap, this study investigates the
uncertainty of deep learning models by examining the performance of conformal
prediction (CP) in the context of standard adversarial attacks within the
adversarial defense community. It is first unveiled that existing CP methods do
not produce informative prediction sets under the commonly used
$l_{\infty}$-norm bounded attack if the model is not adversarially trained,
which underpins the importance of adversarial training for CP. Our paper next
demonstrates that the prediction set size (PSS) of CP using adversarially
trained models with AT variants is often worse than using standard AT,
inspiring us to research into CP-efficient AT for improved PSS. We propose to
optimize a Beta-weighting loss with an entropy minimization regularizer during
AT to improve CP-efficiency, where the Beta-weighting loss is shown to be an
upper bound of PSS at the population level by our theoretical analysis.
Moreover, our empirical study on four image classification datasets across
three popular AT baselines validates the effectiveness of the proposed
Uncertainty-Reducing AT (AT-UR).


http://arxiv.org/abs/2405.08816
The RoboDrive Challenge: Drive Anytime Anywhere in Any Condition. (11%)
Lingdong Kong; Shaoyuan Xie; Hanjiang Hu; Yaru Niu; Wei Tsang Ooi; Benoit R. Cottereau; Lai Xing Ng; Yuexin Ma; Wenwei Zhang; Liang Pan; Kai Chen; Ziwei Liu; Weichao Qiu; Wei Zhang; Xu Cao; Hao Lu; Ying-Cong Chen; Caixin Kang; Xinning Zhou; Chengyang Ying; Wentao Shang; Xingxing Wei; Yinpeng Dong; Bo Yang; Shengyin Jiang; Zeliang Ma; Dengyi Ji; Haiwen Li; Xingliang Huang; Yu Tian; Genghua Kou; Fan Jia; Yingfei Liu; Tiancai Wang; Ying Li; Xiaoshuai Hao; Yifan Yang; Hui Zhang; Mengchuan Wei; Yi Zhou; Haimei Zhao; Jing Zhang; Jinke Li; Xiao He; Xiaoqiang Cheng; Bingyang Zhang; Lirong Zhao; Dianlei Ding; Fangsheng Liu; Yixiang Yan; Hongming Wang; Nanfei Ye; Lun Luo; Yubo Tian; Yiwei Zuo; Zhe Cao; Yi Ren; Yunfan Li; Wenjie Liu; Xun Wu; Yifan Mao; Ming Li; Jian Liu; Jiayang Liu; Zihan Qin; Cunxi Chu; Jialei Xu; Wenbo Zhao; Junjun Jiang; Xianming Liu; Ziyan Wang; Chiwei Li; Shilong Li; Chendong Yuan; Songyue Yang; Wentao Liu; Peng Chen; Bin Zhou; Yubo Wang; Chi Zhang; Jianhang Sun; Hai Chen; Xiao Yang; Lizhong Wang; Dongyi Fu; Yongchun Lin; Huitong Yang; Haoang Li; Yadan Luo; Xianjing Cheng; Yong Xu
  In the realm of autonomous driving, robust perception under
out-of-distribution conditions is paramount for the safe deployment of
vehicles. Challenges such as adverse weather, sensor malfunctions, and
environmental unpredictability can severely impact the performance of
autonomous systems. The 2024 RoboDrive Challenge was crafted to propel the
development of driving perception technologies that can withstand and adapt to
these real-world variabilities. Focusing on four pivotal tasks -- BEV
detection, map segmentation, semantic occupancy prediction, and multi-view
depth estimation -- the competition laid down a gauntlet to innovate and
enhance system resilience against typical and atypical disturbances. This
year's challenge consisted of five distinct tracks and attracted 140 registered
teams from 93 institutes across 11 countries, resulting in nearly one thousand
submissions evaluated through our servers. The competition culminated in 15
top-performing solutions, which introduced a range of innovative approaches
including advanced data augmentation, multi-sensor fusion, self-supervised
learning for error correction, and new algorithmic strategies to enhance sensor
robustness. These contributions significantly advanced the state of the art,
particularly in handling sensor inconsistencies and environmental variability.
Participants, through collaborative efforts, pushed the boundaries of current
technologies, showcasing their potential in real-world scenarios. Extensive
evaluations and analyses provided insights into the effectiveness of these
solutions, highlighting key trends and successful strategies for improving the
resilience of driving perception systems. This challenge has set a new
benchmark in the field, providing a rich repository of techniques expected to
guide future research in this field.


http://arxiv.org/abs/2405.08938
Pointwise Lipschitz Continuous Graph Algorithms via Proximal Gradient Analysis. (1%)
Quanquan C. Liu; Grigoris Velegkas; Yuichi Yoshida; Felix Zhou
  In many real-world applications, it is prohibitively expensive to drastically
change the solution to a problem after a small perturbation in the environment.
Therefore, the stability of an algorithm is a very desirable property. In this
paper, we study the class of pointwise Lipschitz continuous algorithms as
introduced in the recent work of Kumabe and Yoshida [FOCS'23]. The Lipschitz
constant of an algorithm, intuitively, bounds the ratio of the changes in its
output (measured in $\ell_1$ distance) over the $\ell_1$ perturbations of the
weights of the underlying graph.
  In this work, we give a general and simple framework for bounding the
Lipschitz constant of algorithms measured through the unweighted $\ell_1$
distance of their outputs. Our approach consists of three main steps. First, we
consider a natural continuous relaxation of the underlying graph problem by
adding a smooth and strongly convex regularizer to the objective function.
Then, we give upper bounds on the $\ell_1$ distance of the optimal solutions of
the convex programs, under small perturbations of the weights, via a stability
analysis of the trajectory of the proximal gradient method. Finally, we present
new problem-specific rounding techniques to obtain integral solutions to
several graph problems that approximately maintain the stability guarantees of
the fractional solutions. We apply our framework to a number of problems
including minimum $s$-$t$ cut, densest subgraph, maximum $b$-matching, and
packing integer programs. To complement our algorithms, we show the tightness
of our results for certain problems by establishing tight lower bounds.


http://arxiv.org/abs/2405.08340
Achieving Resolution-Agnostic DNN-based Image Watermarking:A Novel Perspective of Implicit Neural Representation. (1%)
Yuchen Wang; Xingyu Zhu; Guanhui Ye; Shiyao Zhang; Xuetao Wei
  DNN-based watermarking methods are rapidly developing and delivering
impressive performances. Recent advances achieve resolution-agnostic image
watermarking by reducing the variant resolution watermarking problem to a fixed
resolution watermarking problem. However, such a reduction process can
potentially introduce artifacts and low robustness. To address this issue, we
propose the first, to the best of our knowledge, Resolution-Agnostic Image
WaterMarking (RAIMark) framework by watermarking the implicit neural
representation (INR) of image. Unlike previous methods, our method does not
rely on the previous reduction process by directly watermarking the continuous
signal instead of image pixels, thus achieving resolution-agnostic
watermarking. Precisely, given an arbitrary-resolution image, we fit an INR for
the target image. As a continuous signal, such an INR can be sampled to obtain
images with variant resolutions. Then, we quickly fine-tune the fitted INR to
get a watermarked INR conditioned on a binary secret message. A pre-trained
watermark decoder extracts the hidden message from any sampled images with
arbitrary resolutions. By directly watermarking INR, we achieve
resolution-agnostic watermarking with increased robustness. Extensive
experiments show that our method outperforms previous methods with significant
improvements: averagely improved bit accuracy by 7%$\sim$29%. Notably, we
observe that previous methods are vulnerable to at least one watermarking
attack (e.g. JPEG, crop, resize), while ours are robust against all
watermarking attacks.


http://arxiv.org/abs/2405.08920
Neural Collapse Meets Differential Privacy: Curious Behaviors of NoisyGD with Near-perfect Representation Learning. (1%)
Chendi Wang; Yuqing Zhu; Weijie J. Su; Yu-Xiang Wang
  A recent study by De et al. (2022) has reported that large-scale
representation learning through pre-training on a public dataset significantly
enhances differentially private (DP) learning in downstream tasks, despite the
high dimensionality of the feature space. To theoretically explain this
phenomenon, we consider the setting of a layer-peeled model in representation
learning, which results in interesting phenomena related to learned features in
deep learning and transfer learning, known as Neural Collapse (NC).
  Within the framework of NC, we establish an error bound indicating that the
misclassification error is independent of dimension when the distance between
actual features and the ideal ones is smaller than a threshold. Additionally,
the quality of the features in the last layer is empirically evaluated under
different pre-trained models within the framework of NC, showing that a more
powerful transformer leads to a better feature representation. Furthermore, we
reveal that DP fine-tuning is less robust compared to fine-tuning without DP,
particularly in the presence of perturbations. These observations are supported
by both theoretical analyses and experimental evaluation. Moreover, to enhance
the robustness of DP fine-tuning, we suggest several strategies, such as
feature normalization or employing dimension reduction methods like Principal
Component Analysis (PCA). Empirically, we demonstrate a significant improvement
in testing accuracy by conducting PCA on the last-layer features.


http://arxiv.org/abs/2405.08363
UnMarker: A Universal Attack on Defensive Watermarking. (1%)
Andre Kassis; Urs Hengartner
  Reports regarding the misuse of $\textit{Generative AI}$ ($\textit{GenAI}$)
to create harmful deepfakes are emerging daily. Recently, defensive
watermarking, which enables $\textit{GenAI}$ providers to hide fingerprints in
their images to later use for deepfake detection, has been on the rise. Yet,
its potential has not been fully explored. We present $\textit{UnMarker}$ --
the first practical $\textit{universal}$ attack on defensive watermarking.
Unlike existing attacks, $\textit{UnMarker}$ requires no detector feedback, no
unrealistic knowledge of the scheme or similar models, and no advanced
denoising pipelines that may not be available. Instead, being the product of an
in-depth analysis of the watermarking paradigm revealing that robust schemes
must construct their watermarks in the spectral amplitudes, $\textit{UnMarker}$
employs two novel adversarial optimizations to disrupt the spectra of
watermarked images, erasing the watermarks. Evaluations against the
$\textit{SOTA}$ prove its effectiveness, not only defeating traditional schemes
while retaining superior quality compared to existing attacks but also breaking
$\textit{semantic}$ watermarks that alter the image's structure, reducing the
best detection rate to $43\%$ and rendering them useless. To our knowledge,
$\textit{UnMarker}$ is the first practical attack on $\textit{semantic}$
watermarks, which have been deemed the future of robust watermarking.
$\textit{UnMarker}$ casts doubts on the very penitential of this countermeasure
and exposes its paradoxical nature as designing schemes for robustness
inevitably compromises other robustness aspects.


http://arxiv.org/abs/2405.08892
RS-Reg: Probabilistic and Robust Certified Regression Through Randomized Smoothing. (1%)
Aref Miri Rekavandi; Olga Ohrimenko; Benjamin I. P. Rubinstein
  Randomized smoothing has shown promising certified robustness against
adversaries in classification tasks. Despite such success with only
zeroth-order access to base models, randomized smoothing has not been extended
to a general form of regression. By defining robustness in regression tasks
flexibly through probabilities, we demonstrate how to establish upper bounds on
input data point perturbation (using the $\ell_2$ norm) for a user-specified
probability of observing valid outputs. Furthermore, we showcase the asymptotic
property of a basic averaging function in scenarios where the regression model
operates without any constraint. We then derive a certified upper bound of the
input perturbations when dealing with a family of regression models where the
outputs are bounded. Our simulations verify the validity of the theoretical
results and reveal the advantages and limitations of simple smoothing
functions, i.e., averaging, in regression tasks. The code is publicly available
at \url{https://github.com/arekavandi/Certified_Robust_Regression}.


http://arxiv.org/abs/2405.07595
Environmental Matching Attack Against Unmanned Aerial Vehicles Object Detection. (96%)
Dehong Kong; Siyuan Liang; Wenqi Ren
  Object detection techniques for Unmanned Aerial Vehicles (UAVs) rely on Deep
Neural Networks (DNNs), which are vulnerable to adversarial attacks.
Nonetheless, adversarial patches generated by existing algorithms in the UAV
domain pay very little attention to the naturalness of adversarial patches.
Moreover, imposing constraints directly on adversarial patches makes it
difficult to generate patches that appear natural to the human eye while
ensuring a high attack success rate. We notice that patches are natural looking
when their overall color is consistent with the environment. Therefore, we
propose a new method named Environmental Matching Attack(EMA) to address the
issue of optimizing the adversarial patch under the constraints of color. To
the best of our knowledge, this paper is the first to consider natural patches
in the domain of UAVs. The EMA method exploits strong prior knowledge of a
pretrained stable diffusion to guide the optimization direction of the
adversarial patch, where the text guidance can restrict the color of the patch.
To better match the environment, the contrast and brightness of the patch are
appropriately adjusted. Instead of optimizing the adversarial patch itself, we
optimize an adversarial perturbation patch which initializes to zero so that
the model can better trade off attacking performance and naturalness.
Experiments conducted on the DroneVehicle and Carpk datasets have shown that
our work can reach nearly the same attack performance in the digital attack(no
greater than 2 in mAP$\%$), surpass the baseline method in the physical
specific scenarios, and exhibit a significant advantage in terms of naturalness
in visualization and color difference with the environment.


http://arxiv.org/abs/2405.07668
CrossCert: A Cross-Checking Detection Approach to Patch Robustness Certification for Deep Learning Models. (82%)
Qilin Zhou; Zhengyuan Wei; Haipeng Wang; Bo Jiang; W. K. Chan
  Patch robustness certification is an emerging kind of defense technique
against adversarial patch attacks with provable guarantees. There are two
research lines: certified recovery and certified detection. They aim to label
malicious samples with provable guarantees correctly and issue warnings for
malicious samples predicted to non-benign labels with provable guarantees,
respectively. However, existing certified detection defenders suffer from
protecting labels subject to manipulation, and existing certified recovery
defenders cannot systematically warn samples about their labels. A certified
defense that simultaneously offers robust labels and systematic warning
protection against patch attacks is desirable. This paper proposes a novel
certified defense technique called CrossCert. CrossCert formulates a novel
approach by cross-checking two certified recovery defenders to provide
unwavering certification and detection certification. Unwavering certification
ensures that a certified sample, when subjected to a patched perturbation, will
always be returned with a benign label without triggering any warnings with a
provable guarantee. To our knowledge, CrossCert is the first certified
detection technique to offer this guarantee. Our experiments show that, with a
slightly lower performance than ViP and comparable performance with PatchCensor
in terms of detection certification, CrossCert certifies a significant
proportion of samples with the guarantee of unwavering certification.


http://arxiv.org/abs/2405.07940
RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors. (15%)
Liam Dugan; Alyssa Hwang; Filip Trhlik; Josh Magnus Ludan; Andrew Zhu; Hainiu Xu; Daphne Ippolito; Chris Callison-Burch
  Many commercial and open-source models claim to detect machine-generated text
with very high accuracy (99\% or higher). However, very few of these detectors
are evaluated on shared benchmark datasets and even when they are, the datasets
used for evaluation are insufficiently challenging -- lacking variations in
sampling strategy, adversarial attacks, and open-source generative models. In
this work we present RAID: the largest and most challenging benchmark dataset
for machine-generated text detection. RAID includes over 6 million generations
spanning 11 models, 8 domains, 11 adversarial attacks and 4 decoding
strategies. Using RAID, we evaluate the out-of-domain and adversarial
robustness of 8 open- and 4 closed-source detectors and find that current
detectors are easily fooled by adversarial attacks, variations in sampling
strategies, repetition penalties, and unseen generative models. We release our
dataset and tools to encourage further exploration into detector robustness.


http://arxiv.org/abs/2405.07562
GLiRA: Black-Box Membership Inference Attack via Knowledge Distillation. (11%)
Andrey V. Galichin; Mikhail Pautov; Alexey Zhavoronkin; Oleg Y. Rogov; Ivan Oseledets
  While Deep Neural Networks (DNNs) have demonstrated remarkable performance in
tasks related to perception and control, there are still several unresolved
concerns regarding the privacy of their training data, particularly in the
context of vulnerability to Membership Inference Attacks (MIAs). In this paper,
we explore a connection between the susceptibility to membership inference
attacks and the vulnerability to distillation-based functionality stealing
attacks. In particular, we propose {GLiRA}, a distillation-guided approach to
membership inference attack on the black-box neural network. We observe that
the knowledge distillation significantly improves the efficiency of likelihood
ratio of membership inference attack, especially in the black-box setting,
i.e., when the architecture of the target model is unknown to the attacker. We
evaluate the proposed method across multiple image classification datasets and
models and demonstrate that likelihood ratio attacks when guided by the
knowledge distillation, outperform the current state-of-the-art membership
inference attacks in the black-box setting.


http://arxiv.org/abs/2405.07667
Backdoor Removal for Generative Large Language Models. (1%)
Haoran Li; Yulin Chen; Zihao Zheng; Qi Hu; Chunkit Chan; Heshan Liu; Yangqiu Song
  With rapid advances, generative large language models (LLMs) dominate various
Natural Language Processing (NLP) tasks from understanding to reasoning. Yet,
language models' inherent vulnerabilities may be exacerbated due to increased
accessibility and unrestricted model training on massive textual data from the
Internet. A malicious adversary may publish poisoned data online and conduct
backdoor attacks on the victim LLMs pre-trained on the poisoned data.
Backdoored LLMs behave innocuously for normal queries and generate harmful
responses when the backdoor trigger is activated. Despite significant efforts
paid to LLMs' safety issues, LLMs are still struggling against backdoor
attacks. As Anthropic recently revealed, existing safety training strategies,
including supervised fine-tuning (SFT) and Reinforcement Learning from Human
Feedback (RLHF), fail to revoke the backdoors once the LLM is backdoored during
the pre-training stage. In this paper, we present Simulate and Eliminate
(SANDE) to erase the undesired backdoored mappings for generative LLMs. We
initially propose Overwrite Supervised Fine-tuning (OSFT) for effective
backdoor removal when the trigger is known. Then, to handle the scenarios where
the trigger patterns are unknown, we integrate OSFT into our two-stage
framework, SANDE. Unlike previous works that center on the identification of
backdoors, our safety-enhanced LLMs are able to behave normally even when the
exact triggers are activated. We conduct comprehensive experiments to show that
our proposed SANDE is effective against backdoor attacks while bringing minimal
harm to LLMs' powerful capability without any additional access to unbackdoored
clean models. We will release the reproducible code.


http://arxiv.org/abs/2405.07004
Stealthy Imitation: Reward-guided Environment-free Policy Stealing. (1%)
Zhixiong Zhuang; Maria-Irina Nicolae; Mario Fritz
  Deep reinforcement learning policies, which are integral to modern control
systems, represent valuable intellectual property. The development of these
policies demands considerable resources, such as domain expertise, simulation
fidelity, and real-world validation. These policies are potentially vulnerable
to model stealing attacks, which aim to replicate their functionality using
only black-box access. In this paper, we propose Stealthy Imitation, the first
attack designed to steal policies without access to the environment or
knowledge of the input range. This setup has not been considered by previous
model stealing methods. Lacking access to the victim's input states
distribution, Stealthy Imitation fits a reward model that allows to approximate
it. We show that the victim policy is harder to imitate when the distribution
of the attack queries matches that of the victim. We evaluate our approach
across diverse, high-dimensional control tasks and consistently outperform
prior data-free approaches adapted for policy stealing. Lastly, we propose a
countermeasure that significantly diminishes the effectiveness of the attack.


http://arxiv.org/abs/2405.06340
Improving Transferable Targeted Adversarial Attack via Normalized Logit Calibration and Truncated Feature Mixing. (99%)
Juanjuan Weng; Zhiming Luo; Shaozi Li
  This paper aims to enhance the transferability of adversarial samples in
targeted attacks, where attack success rates remain comparatively low. To
achieve this objective, we propose two distinct techniques for improving the
targeted transferability from the loss and feature aspects. First, in previous
approaches, logit calibrations used in targeted attacks primarily focus on the
logit margin between the targeted class and the untargeted classes among
samples, neglecting the standard deviation of the logit. In contrast, we
introduce a new normalized logit calibration method that jointly considers the
logit margin and the standard deviation of logits. This approach effectively
calibrates the logits, enhancing the targeted transferability. Second, previous
studies have demonstrated that mixing the features of clean samples during
optimization can significantly increase transferability. Building upon this, we
further investigate a truncated feature mixing method to reduce the impact of
the source training model, resulting in additional improvements. The truncated
feature is determined by removing the Rank-1 feature associated with the
largest singular value decomposed from the high-level convolutional layers of
the clean sample. Extensive experiments conducted on the ImageNet-Compatible
and CIFAR-10 datasets demonstrate the individual and mutual benefits of our
proposed two components, which outperform the state-of-the-art methods by a
large margin in black-box targeted attacks.


http://arxiv.org/abs/2405.06247
Disttack: Graph Adversarial Attacks Toward Distributed GNN Training. (98%)
Yuxiang Zhang; Xin Liu; Meng Wu; Wei Yan; Mingyu Yan; Xiaochun Ye; Dongrui Fan
  Graph Neural Networks (GNNs) have emerged as potent models for graph
learning. Distributing the training process across multiple computing nodes is
the most promising solution to address the challenges of ever-growing
real-world graphs. However, current adversarial attack methods on GNNs neglect
the characteristics and applications of the distributed scenario, leading to
suboptimal performance and inefficiency in attacking distributed GNN training.
  In this study, we introduce Disttack, the first framework of adversarial
attacks for distributed GNN training that leverages the characteristics of
frequent gradient updates in a distributed system. Specifically, Disttack
corrupts distributed GNN training by injecting adversarial attacks into one
single computing node. The attacked subgraphs are precisely perturbed to induce
an abnormal gradient ascent in backpropagation, disrupting gradient
synchronization between computing nodes and thus leading to a significant
performance decline of the trained GNN. We evaluate Disttack on four large
real-world graphs by attacking five widely adopted GNNs. Compared with the
state-of-the-art attack method, experimental results demonstrate that Disttack
amplifies the model accuracy degradation by 2.75$\times$ and achieves speedup
by 17.33$\times$ on average while maintaining unnoticeability.


http://arxiv.org/abs/2405.06278
Exploring the Interplay of Interpretability and Robustness in Deep Neural Networks: A Saliency-guided Approach. (98%)
Amira Guesmi; Nishant Suresh Aswani; Muhammad Shafique
  Adversarial attacks pose a significant challenge to deploying deep learning
models in safety-critical applications. Maintaining model robustness while
ensuring interpretability is vital for fostering trust and comprehension in
these models. This study investigates the impact of Saliency-guided Training
(SGT) on model robustness, a technique aimed at improving the clarity of
saliency maps to deepen understanding of the model's decision-making process.
Experiments were conducted on standard benchmark datasets using various deep
learning architectures trained with and without SGT. Findings demonstrate that
SGT enhances both model robustness and interpretability. Additionally, we
propose a novel approach combining SGT with standard adversarial training to
achieve even greater robustness while preserving saliency map quality. Our
strategy is grounded in the assumption that preserving salient features crucial
for correctly classifying adversarial examples enhances model robustness, while
masking non-relevant features improves interpretability. Our technique yields
significant gains, achieving a 35\% and 20\% improvement in robustness against
PGD attack with noise magnitudes of $0.2$ and $0.02$ for the MNIST and CIFAR-10
datasets, respectively, while producing high-quality saliency maps.


http://arxiv.org/abs/2405.06345
Evaluating Adversarial Robustness in the Spatial Frequency Domain. (96%)
Keng-Hsin Liao; Chin-Yuan Yeh; Hsi-Wen Chen; Ming-Syan Chen
  Convolutional Neural Networks (CNNs) have dominated the majority of computer
vision tasks. However, CNNs' vulnerability to adversarial attacks has raised
concerns about deploying these models to safety-critical applications. In
contrast, the Human Visual System (HVS), which utilizes spatial frequency
channels to process visual signals, is immune to adversarial attacks. As such,
this paper presents an empirical study exploring the vulnerability of CNN
models in the frequency domain. Specifically, we utilize the discrete cosine
transform (DCT) to construct the Spatial-Frequency (SF) layer to produce a
block-wise frequency spectrum of an input image and formulate Spatial Frequency
CNNs (SF-CNNs) by replacing the initial feature extraction layers of
widely-used CNN backbones with the SF layer. Through extensive experiments, we
observe that SF-CNN models are more robust than their CNN counterparts under
both white-box and black-box attacks. To further explain the robustness of
SF-CNNs, we compare the SF layer with a trainable convolutional layer with
identical kernel sizes using two mixing strategies to show that the lower
frequency components contribute the most to the adversarial robustness of
SF-CNNs. We believe our observations can guide the future design of robust CNN
models.


http://arxiv.org/abs/2405.06361
Certified $\ell_2$ Attribution Robustness via Uniformly Smoothed Attributions. (96%)
Fan Wang; Adams Wai-Kin Kong
  Model attribution is a popular tool to explain the rationales behind model
predictions. However, recent work suggests that the attributions are vulnerable
to minute perturbations, which can be added to input samples to fool the
attributions while maintaining the prediction outputs. Although empirical
studies have shown positive performance via adversarial training, an effective
certified defense method is eminently needed to understand the robustness of
attributions. In this work, we propose to use uniform smoothing technique that
augments the vanilla attributions by noises uniformly sampled from a certain
space. It is proved that, for all perturbations within the attack region, the
cosine similarity between uniformly smoothed attribution of perturbed sample
and the unperturbed sample is guaranteed to be lower bounded. We also derive
alternative formulations of the certification that is equivalent to the
original one and provides the maximum size of perturbation or the minimum
smoothing radius such that the attribution can not be perturbed. We evaluate
the proposed method on three datasets and show that the proposed method can
effectively protect the attributions from attacks, regardless of the
architecture of networks, training schemes and the size of the datasets.


http://arxiv.org/abs/2405.06298
PUMA: margin-based data pruning. (80%)
Javier Maroto; Pascal Frossard
  Deep learning has been able to outperform humans in terms of classification
accuracy in many tasks. However, to achieve robustness to adversarial
perturbations, the best methodologies require to perform adversarial training
on a much larger training set that has been typically augmented using
generative models (e.g., diffusion models). Our main objective in this work, is
to reduce these data requirements while achieving the same or better
accuracy-robustness trade-offs. We focus on data pruning, where some training
samples are removed based on the distance to the model classification boundary
(i.e., margin). We find that the existing approaches that prune samples with
low margin fails to increase robustness when we add a lot of synthetic data,
and explain this situation with a perceptron learning task. Moreover, we find
that pruning high margin samples for better accuracy increases the harmful
impact of mislabeled perturbed data in adversarial training, hurting both
robustness and accuracy. We thus propose PUMA, a new data pruning strategy that
computes the margin using DeepFool, and prunes the training samples of highest
margin without hurting performance by jointly adjusting the training attack
norm on the samples of lowest margin. We show that PUMA can be used on top of
the current state-of-the-art methodology in robustness, and it is able to
significantly improve the model performance unlike the existing data pruning
strategies. Not only PUMA achieves similar robustness with less data, but it
also significantly increases the model accuracy, improving the performance
trade-off.


http://arxiv.org/abs/2405.06049
BB-Patch: BlackBox Adversarial Patch-Attack using Zeroth-Order Optimization. (99%)
Satyadwyoom Kumar; Saurabh Gupta; Arun Balaji Buduru
  Deep Learning has become popular due to its vast applications in almost all
domains. However, models trained using deep learning are prone to failure for
adversarial samples and carry a considerable risk in sensitive applications.
Most of these adversarial attack strategies assume that the adversary has
access to the training data, the model parameters, and the input during
deployment, hence, focus on perturbing the pixel level information present in
the input image.
  Adversarial Patches were introduced to the community which helped in bringing
out the vulnerability of deep learning models in a much more pragmatic manner
but here the attacker has a white-box access to the model parameters. Recently,
there has been an attempt to develop these adversarial attacks using black-box
techniques. However, certain assumptions such as availability large training
data is not valid for a real-life scenarios. In a real-life scenario, the
attacker can only assume the type of model architecture used from a select list
of state-of-the-art architectures while having access to only a subset of input
dataset. Hence, we propose an black-box adversarial attack strategy that
produces adversarial patches which can be applied anywhere in the input image
to perform an adversarial attack.


http://arxiv.org/abs/2405.06134
Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models. (97%)
Vyas Raina; Rao Ma; Charles McGhee; Kate Knill; Mark Gales
  Recent developments in large speech foundation models like Whisper have led
to their widespread use in many automatic speech recognition (ASR)
applications. These systems incorporate `special tokens' in their vocabulary,
such as $\texttt{<endoftext>}$, to guide their language generation process.
However, we demonstrate that these tokens can be exploited by adversarial
attacks to manipulate the model's behavior. We propose a simple yet effective
method to learn a universal acoustic realization of Whisper's
$\texttt{<endoftext>}$ token, which, when prepended to any speech signal,
encourages the model to ignore the speech and only transcribe the special
token, effectively `muting' the model. Our experiments demonstrate that the
same, universal 0.64-second adversarial audio segment can successfully mute a
target Whisper ASR model for over 97\% of speech samples. Moreover, we find
that this universal adversarial audio segment often transfers to new datasets
and tasks. Overall this work demonstrates the vulnerability of Whisper models
to `muting' adversarial attacks, where such attacks can pose both risks and
potential benefits in real-world settings: for example the attack can be used
to bypass speech moderation systems, or conversely the attack can also be used
to protect private speech data.


http://arxiv.org/abs/2405.05573
Poisoning-based Backdoor Attacks for Arbitrary Target Label with Positive Triggers. (80%)
Binxiao Huang; Jason Chun Lok; Chang Liu; Ngai Wong
  Poisoning-based backdoor attacks expose vulnerabilities in the data
preparation stage of deep neural network (DNN) training. The DNNs trained on
the poisoned dataset will be embedded with a backdoor, making them behave well
on clean data while outputting malicious predictions whenever a trigger is
applied. To exploit the abundant information contained in the input data to
output label mapping, our scheme utilizes the network trained from the clean
dataset as a trigger generator to produce poisons that significantly raise the
success rate of backdoor attacks versus conventional approaches. Specifically,
we provide a new categorization of triggers inspired by the adversarial
technique and develop a multi-label and multi-payload Poisoning-based backdoor
attack with Positive Triggers (PPT), which effectively moves the input closer
to the target label on benign classifiers. After the classifier is trained on
the poisoned dataset, we can generate an input-label-aware trigger to make the
infected classifier predict any given input to any target label with a high
possibility. Under both dirty- and clean-label settings, we show empirically
that the proposed attack achieves a high attack success rate without
sacrificing accuracy across various datasets, including SVHN, CIFAR10, GTSRB,
and Tiny ImageNet. Furthermore, the PPT attack can elude a variety of classical
backdoor defenses, proving its effectiveness.


http://arxiv.org/abs/2405.05784
Link Stealing Attacks Against Inductive Graph Neural Networks. (75%)
Yixin Wu; Xinlei He; Pascal Berrang; Mathias Humbert; Michael Backes; Neil Zhenqiang Gong; Yang Zhang
  A graph neural network (GNN) is a type of neural network that is specifically
designed to process graph-structured data. Typically, GNNs can be implemented
in two settings, including the transductive setting and the inductive setting.
In the transductive setting, the trained model can only predict the labels of
nodes that were observed at the training time. In the inductive setting, the
trained model can be generalized to new nodes/graphs. Due to its flexibility,
the inductive setting is the most popular GNN setting at the moment. Previous
work has shown that transductive GNNs are vulnerable to a series of privacy
attacks. However, a comprehensive privacy analysis of inductive GNN models is
still missing. This paper fills the gap by conducting a systematic privacy
analysis of inductive GNNs through the lens of link stealing attacks, one of
the most popular attacks that are specifically designed for GNNs. We propose
two types of link stealing attacks, i.e., posterior-only attacks and combined
attacks. We define threat models of the posterior-only attacks with respect to
node topology and the combined attacks by considering combinations of
posteriors, node attributes, and graph features. Extensive evaluation on six
real-world datasets demonstrates that inductive GNNs leak rich information that
enables link stealing attacks with advantageous properties. Even attacks with
no knowledge about graph structures can be effective. We also show that our
attacks are robust to different node similarities and different graph features.
As a counterpart, we investigate two possible defenses and discover they are
ineffective against our attacks, which calls for more effective defenses.


http://arxiv.org/abs/2405.06073
Hard Work Does Not Always Pay Off: Poisoning Attacks on Neural Architecture Search. (68%)
Zachary Coalson; Huazheng Wang; Qingyun Wu; Sanghyun Hong
  In this paper, we study the robustness of "data-centric" approaches to
finding neural network architectures (known as neural architecture search) to
data distribution shifts. To audit this robustness, we present a data poisoning
attack, when injected to the training data used for architecture search that
can prevent the victim algorithm from finding an architecture with optimal
accuracy. We first define the attack objective for crafting poisoning samples
that can induce the victim to generate sub-optimal architectures. To this end,
we weaponize existing search algorithms to generate adversarial architectures
that serve as our objectives. We also present techniques that the attacker can
use to significantly reduce the computational costs of crafting poisoning
samples. In an extensive evaluation of our poisoning attack on a representative
architecture search algorithm, we show its surprising robustness. Because our
attack employs clean-label poisoning, we also evaluate its robustness against
label noise. We find that random label-flipping is more effective in generating
sub-optimal architectures than our clean-label attack. Our results suggests
that care must be taken for the data this emerging approach uses, and future
work is needed to develop robust algorithms.


http://arxiv.org/abs/2405.06206
Concealing Backdoor Model Updates in Federated Learning by Trigger-Optimized Data Poisoning. (62%)
Yujie Zhang; Neil Gong; Michael K. Reiter
  Federated Learning (FL) is a decentralized machine learning method that
enables participants to collaboratively train a model without sharing their
private data. Despite its privacy and scalability benefits, FL is susceptible
to backdoor attacks, where adversaries poison the local training data of a
subset of clients using a backdoor trigger, aiming to make the aggregated model
produce malicious results when the same backdoor condition is met by an
inference-time input. Existing backdoor attacks in FL suffer from common
deficiencies: fixed trigger patterns and reliance on the assistance of model
poisoning. State-of-the-art defenses based on Byzantine-robust aggregation
exhibit a good defense performance on these attacks because of the significant
divergence between malicious and benign model updates. To effectively conceal
malicious model updates among benign ones, we propose DPOT, a backdoor attack
strategy in FL that dynamically constructs backdoor objectives by optimizing a
backdoor trigger, making backdoor data have minimal effect on model updates. We
provide theoretical justifications for DPOT's attacking principle and display
experimental results showing that DPOT, via only a data-poisoning attack,
effectively undermines state-of-the-art defenses and outperforms existing
backdoor attack techniques on various datasets.


http://arxiv.org/abs/2405.05553
Towards Robust Physical-world Backdoor Attacks on Lane Detection. (50%)
Xinwei Zhang; Aishan Liu; Tianyuan Zhang; Siyuan Liang; Xianglong Liu
  Deep learning-based lane detection (LD) plays a critical role in autonomous
driving systems, such as adaptive cruise control. However, it is vulnerable to
backdoor attacks. Existing backdoor attack methods on LD exhibit limited
effectiveness in dynamic real-world scenarios, primarily because they fail to
consider dynamic scene factors, including changes in driving perspectives
(e.g., viewpoint transformations) and environmental conditions (e.g., weather
or lighting changes). To tackle this issue, this paper introduces BadLANE, a
dynamic scene adaptation backdoor attack for LD designed to withstand changes
in real-world dynamic scene factors. To address the challenges posed by
changing driving perspectives, we propose an amorphous trigger pattern composed
of shapeless pixels. This trigger design allows the backdoor to be activated by
various forms or shapes of mud spots or pollution on the road or lens, enabling
adaptation to changes in vehicle observation viewpoints during driving. To
mitigate the effects of environmental changes, we design a meta-learning
framework to train meta-generators tailored to different environmental
conditions. These generators produce meta-triggers that incorporate diverse
environmental information, such as weather or lighting conditions, as the
initialization of the trigger patterns for backdoor implantation, thus enabling
adaptation to dynamic environments. Extensive experiments on various commonly
used LD models in both digital and physical domains validate the effectiveness
of our attacks, outperforming other baselines significantly (+25.15% on average
in Attack Success Rate). Our codes will be available upon paper publication.


http://arxiv.org/abs/2405.05588
Model Inversion Robustness: Can Transfer Learning Help? (45%)
Sy-Tuyen Ho; Koh Jun Hao; Keshigeyan Chandrasegaran; Ngoc-Bao Nguyen; Ngai-Man Cheung
  Model Inversion (MI) attacks aim to reconstruct private training data by
abusing access to machine learning models. Contemporary MI attacks have
achieved impressive attack performance, posing serious threats to privacy.
Meanwhile, all existing MI defense methods rely on regularization that is in
direct conflict with the training objective, resulting in noticeable
degradation in model utility. In this work, we take a different perspective,
and propose a novel and simple Transfer Learning-based Defense against Model
Inversion (TL-DMI) to render MI-robust models. Particularly, by leveraging TL,
we limit the number of layers encoding sensitive information from private
training dataset, thereby degrading the performance of MI attack. We conduct an
analysis using Fisher Information to justify our method. Our defense is
remarkably simple to implement. Without bells and whistles, we show in
extensive experiments that TL-DMI achieves state-of-the-art (SOTA) MI
robustness. Our code, pre-trained models, demo and inverted data are available
at: https://hosytuyen.github.io/projects/TL-DMI


http://arxiv.org/abs/2405.05610
Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM. (3%)
Xikang Yang; Xuehai Tang; Songlin Hu; Jizhong Han
  Large language models (LLMs) have achieved remarkable performance in various
natural language processing tasks, especially in dialogue systems. However, LLM
may also pose security and moral threats, especially in multi round
conversations where large models are more easily guided by contextual content,
resulting in harmful or biased responses. In this paper, we present a novel
method to attack LLMs in multi-turn dialogues, called CoA (Chain of Attack).
CoA is a semantic-driven contextual multi-turn attack method that adaptively
adjusts the attack policy through contextual feedback and semantic relevance
during multi-turn of dialogue with a large model, resulting in the model
producing unreasonable or harmful content. We evaluate CoA on different LLMs
and datasets, and show that it can effectively expose the vulnerabilities of
LLMs, and outperform existing attack methods. Our work provides a new
perspective and tool for attacking and defending LLMs, and contributes to the
security and ethical assessment of dialogue systems.


http://arxiv.org/abs/2405.06124
Demystifying Behavior-Based Malware Detection at Endpoints. (2%)
Yigitcan Kaya; Yizheng Chen; Shoumik Saha; Fabio Pierazzi; Lorenzo Cavallaro; David Wagner; Tudor Dumitras
  Machine learning is widely used for malware detection in practice. Prior
behavior-based detectors most commonly rely on traces of programs executed in
controlled sandboxes. However, sandbox traces are unavailable to the last line
of defense offered by security vendors: malware detection at endpoints. A
detector at endpoints consumes the traces of programs running on real-world
hosts, as sandbox analysis might introduce intolerable delays. Despite their
success in the sandboxes, research hints at potential challenges for ML methods
at endpoints, e.g., highly variable malware behaviors. Nonetheless, the impact
of these challenges on existing approaches and how their excellent sandbox
performance translates to the endpoint scenario remain unquantified.
  We present the first measurement study of the performance of ML-based malware
detectors at real-world endpoints. Leveraging a dataset of sandbox traces and a
dataset of in-the-wild program traces; we evaluate two scenarios where the
endpoint detector was trained on (i) sandbox traces (convenient and
accessible); and (ii) endpoint traces (less accessible due to needing to
collect telemetry data). This allows us to identify a wide gap between prior
methods' sandbox-based detection performance--over 90%--and endpoint
performances--below 20% and 50% in (i) and (ii), respectively. We pinpoint and
characterize the challenges contributing to this gap, such as label noise,
behavior variability, or sandbox evasion. To close this gap, we propose that
yield a relative improvement of 5-30% over the baselines. Our evidence suggests
that applying detectors trained on sandbox data to endpoint detection --
scenario (i) -- is challenging. The most promising direction is training
detectors on endpoint data -- scenario (ii) -- which marks a departure from
widespread practice. We implement a leaderboard for realistic detector
evaluations to promote research.


http://arxiv.org/abs/2405.05524
Universal Adversarial Perturbations for Vision-Language Pre-trained Models. (99%)
Peng-Fei Zhang; Zi Huang; Guangdong Bai
  Vision-language pre-trained (VLP) models have been the foundation of numerous
vision-language tasks. Given their prevalence, it becomes imperative to assess
their adversarial robustness, especially when deploying them in
security-crucial real-world applications. Traditionally, adversarial
perturbations generated for this assessment target specific VLP models,
datasets, and/or downstream tasks. This practice suffers from low
transferability and additional computation costs when transitioning to new
scenarios.
  In this work, we thoroughly investigate whether VLP models are commonly
sensitive to imperceptible perturbations of a specific pattern for the image
modality. To this end, we propose a novel black-box method to generate
Universal Adversarial Perturbations (UAPs), which is so called the Effective
and T ransferable Universal Adversarial Attack (ETU), aiming to mislead a
variety of existing VLP models in a range of downstream tasks. The ETU
comprehensively takes into account the characteristics of UAPs and the
intrinsic cross-modal interactions to generate effective UAPs. Under this
regime, the ETU encourages both global and local utilities of UAPs. This
benefits the overall utility while reducing interactions between UAP units,
improving the transferability. To further enhance the effectiveness and
transferability of UAPs, we also design a novel data augmentation method named
ScMix. ScMix consists of self-mix and cross-mix data transformations, which can
effectively increase the multi-modal data diversity while preserving the
semantics of the original data. Through comprehensive experiments on various
downstream tasks, VLP models, and datasets, we demonstrate that the proposed
method is able to achieve effective and transferrable universal adversarial
attacks.


http://arxiv.org/abs/2405.05022
Adversarial Threats to Automatic Modulation Open Set Recognition in Wireless Networks. (99%)
Yandie Yang; Sicheng Zhang; Kuixian Li; Qiao Tian; Yun Lin
  Automatic Modulation Open Set Recognition (AMOSR) is a crucial technological
approach for cognitive radio communications, wireless spectrum management, and
interference monitoring within wireless networks. Numerous studies have shown
that AMR is highly susceptible to minimal perturbations carefully designed by
malicious attackers, leading to misclassification of signals. However, the
adversarial security issue of AMOSR has not yet been explored. This paper
adopts the perspective of attackers and proposes an Open Set Adversarial Attack
(OSAttack), aiming at investigating the adversarial vulnerabilities of various
AMOSR methods. Initially, an adversarial threat model for AMOSR scenarios is
established. Subsequently, by analyzing the decision criteria of both
discriminative and generative open set recognition, OSFGSM and OSPGD are
proposed to reduce the performance of AMOSR. Finally, the influence of OSAttack
on AMOSR is evaluated utilizing a range of qualitative and quantitative
indicators. The results indicate that despite the increased resistance of AMOSR
models to conventional interference signals, they remain vulnerable to attacks
by adversarial examples.


http://arxiv.org/abs/2405.10970
Untargeted Adversarial Attack on Knowledge Graph Embeddings. (98%)
Tianzhe Zhao; Jiaoyan Chen; Yanchi Ru; Qika Lin; Yuxia Geng; Jun Liu
  Knowledge graph embedding (KGE) methods have achieved great success in
handling various knowledge graph (KG) downstream tasks. However, KGE methods
may learn biased representations on low-quality KGs that are prevalent in the
real world. Some recent studies propose adversarial attacks to investigate the
vulnerabilities of KGE methods, but their attackers are target-oriented with
the KGE method and the target triples to predict are given in advance, which
lacks practicability. In this work, we explore untargeted attacks with the aim
of reducing the global performances of KGE methods over a set of unknown test
triples and conducting systematic analyses on KGE robustness. Considering logic
rules can effectively summarize the global structure of a KG, we develop
rule-based attack strategies to enhance the attack efficiency. In particular,we
consider adversarial deletion which learns rules, applying the rules to score
triple importance and delete important triples, and adversarial addition which
corrupts the learned rules and applies them for negative triples as
perturbations. Extensive experiments on two datasets over three representative
classes of KGE methods demonstrate the effectiveness of our proposed untargeted
attacks in diminishing the link prediction results. And we also find that
different KGE methods exhibit different robustness to untargeted attacks. For
example, the robustness of methods engaged with graph neural networks and logic
rules depends on the density of the graph. But rule-based methods like NCRL are
easily affected by adversarial addition attacks to capture negative rules


http://arxiv.org/abs/2405.05075
Towards Efficient Training and Evaluation of Robust Models against $l_0$ Bounded Adversarial Perturbations. (98%)
Xuyang Zhong; Yixiao Huang; Chen Liu
  This work studies sparse adversarial perturbations bounded by $l_0$ norm. We
propose a white-box PGD-like attack method named sparse-PGD to effectively and
efficiently generate such perturbations. Furthermore, we combine sparse-PGD
with a black-box attack to comprehensively and more reliably evaluate the
models' robustness against $l_0$ bounded adversarial perturbations. Moreover,
the efficiency of sparse-PGD enables us to conduct adversarial training to
build robust models against sparse perturbations. Extensive experiments
demonstrate that our proposed attack algorithm exhibits strong performance in
different scenarios. More importantly, compared with other robust models, our
adversarially trained model demonstrates state-of-the-art robustness against
various sparse attacks. Codes are available at
https://github.com/CityU-MLO/sPGD.


http://arxiv.org/abs/2405.05502
Towards Accurate and Robust Architectures via Neural Architecture Search. (96%)
Yuwei Ou; Yuqi Feng; Yanan Sun
  To defend deep neural networks from adversarial attacks, adversarial training
has been drawing increasing attention for its effectiveness. However, the
accuracy and robustness resulting from the adversarial training are limited by
the architecture, because adversarial training improves accuracy and robustness
by adjusting the weight connection affiliated to the architecture. In this
work, we propose ARNAS to search for accurate and robust architectures for
adversarial training. First we design an accurate and robust search space, in
which the placement of the cells and the proportional relationship of the
filter numbers are carefully determined. With the design, the architectures can
obtain both accuracy and robustness by deploying accurate and robust structures
to their sensitive positions, respectively. Then we propose a differentiable
multi-objective search strategy, performing gradient descent towards directions
that are beneficial for both natural loss and adversarial loss, thus the
accuracy and robustness can be guaranteed at the same time. We conduct
comprehensive experiments in terms of white-box attacks, black-box attacks, and
transferability. Experimental results show that the searched architecture has
the strongest robustness with the competitive accuracy, and breaks the
traditional idea that NAS-based architectures cannot transfer well to complex
tasks in robustness scenarios. By analyzing outstanding architectures searched,
we also conclude that accurate and robust neural architectures tend to deploy
different structures near the input and output, which has great practical
significance on both hand-crafting and automatically designing of accurate and
robust architectures.


http://arxiv.org/abs/2405.04825
Explanation as a Watermark: Towards Harmless and Multi-bit Model Ownership Verification via Watermarking Feature Attribution. (1%)
Shuo Shao; Yiming Li; Hongwei Yao; Yiling He; Zhan Qin; Kui Ren
  Ownership verification is currently the most critical and widely adopted
post-hoc method to safeguard model copyright. In general, model owners exploit
it to identify whether a given suspicious third-party model is stolen from them
by examining whether it has particular properties `inherited' from their
released models. Currently, backdoor-based model watermarks are the primary and
cutting-edge methods to implant such properties in the released models.
However, backdoor-based methods have two fatal drawbacks, including harmfulness
and ambiguity. The former indicates that they introduce maliciously
controllable misclassification behaviors ($i.e.$, backdoor) to the watermarked
released models. The latter denotes that malicious users can easily pass the
verification by finding other misclassified samples, leading to ownership
ambiguity.
  In this paper, we argue that both limitations stem from the `zero-bit' nature
of existing watermarking schemes, where they exploit the status ($i.e.$,
misclassified) of predictions for verification. Motivated by this
understanding, we design a new watermarking paradigm, $i.e.$, Explanation as a
Watermark (EaaW), that implants verification behaviors into the explanation of
feature attribution instead of model predictions. Specifically, EaaW embeds a
`multi-bit' watermark into the feature attribution explanation of specific
trigger samples without changing the original prediction. We correspondingly
design the watermark embedding and extraction algorithms inspired by
explainable artificial intelligence. In particular, our approach can be used
for different tasks ($e.g.$, image classification and text generation).
Extensive experiments verify the effectiveness and harmlessness of our EaaW and
its resistance to potential attacks.


http://arxiv.org/abs/2405.04346
Revisiting character-level adversarial attacks. (99%)
Elias Abad Rocamora; Yongtao Wu; Fanghui Liu; Grigorios G. Chrysos; Volkan Cevher
  Adversarial attacks in Natural Language Processing apply perturbations in the
character or token levels. Token-level attacks, gaining prominence for their
use of gradient-based methods, are susceptible to altering sentence semantics,
leading to invalid adversarial examples. While character-level attacks easily
maintain semantics, they have received less attention as they cannot easily
adopt popular gradient-based methods, and are thought to be easy to defend.
Challenging these beliefs, we introduce Charmer, an efficient query-based
adversarial attack capable of achieving high attack success rate (ASR) while
generating highly similar adversarial examples. Our method successfully targets
both small (BERT) and large (Llama 2) models. Specifically, on BERT with SST-2,
Charmer improves the ASR in 4.84% points and the USE similarity in 8% points
with respect to the previous art. Our implementation is available in
https://github.com/LIONS-EPFL/Charmer.


http://arxiv.org/abs/2405.04010
Explainability-Informed Targeted Malware Misclassification. (99%)
Quincy Card; Kshitiz Aryal; Maanak Gupta
  In recent years, there has been a surge in malware attacks across critical
infrastructures, requiring further research and development of appropriate
response and remediation strategies in malware detection and classification.
Several works have used machine learning models for malware classification into
categories, and deep neural networks have shown promising results. However,
these models have shown its vulnerabilities against intentionally crafted
adversarial attacks, which yields misclassification of a malicious file. Our
paper explores such adversarial vulnerabilities of neural network based malware
classification system in the dynamic and online analysis environments. To
evaluate our approach, we trained Feed Forward Neural Networks (FFNN) to
classify malware categories based on features obtained from dynamic and online
analysis environments. We use the state-of-the-art method, SHapley Additive
exPlanations (SHAP), for the feature attribution for malware classification, to
inform the adversarial attackers about the features with significant importance
on classification decision. Using the explainability-informed features, we
perform targeted misclassification adversarial white-box evasion attacks using
the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD)
attacks against the trained classifier. Our results demonstrated high evasion
rate for some instances of attacks, showing a clear vulnerability of a malware
classifier for such attacks. We offer recommendations for a balanced approach
and a benchmark for much-needed future research into evasion attacks against
malware classifiers, and develop more robust and trustworthy solutions.


http://arxiv.org/abs/2405.04191
Effective and Robust Adversarial Training against Data and Label Corruptions. (70%)
Peng-Fei Zhang; Zi Huang; Xin-Shun Xu; Guangdong Bai
  Corruptions due to data perturbations and label noise are prevalent in the
datasets from unreliable sources, which poses significant threats to model
training. Despite existing efforts in developing robust models, current
learning methods commonly overlook the possible co-existence of both
corruptions, limiting the effectiveness and practicability of the model. In
this paper, we develop an Effective and Robust Adversarial Training (ERAT)
framework to simultaneously handle two types of corruption (i.e., data and
label) without prior knowledge of their specifics. We propose a hybrid
adversarial training surrounding multiple potential adversarial perturbations,
alongside a semi-supervised learning based on class-rebalancing sample
selection to enhance the resilience of the model for dual corruption. On the
one hand, in the proposed adversarial training, the perturbation generation
module learns multiple surrogate malicious data perturbations by taking a DNN
model as the victim, while the model is trained to maintain semantic
consistency between the original data and the hybrid perturbed data. It is
expected to enable the model to cope with unpredictable perturbations in
real-world data corruption. On the other hand, a class-rebalancing data
selection strategy is designed to fairly differentiate clean labels from noisy
labels. Semi-supervised learning is performed accordingly by discarding noisy
labels. Extensive experiments demonstrate the superiority of the proposed ERAT
framework.


http://arxiv.org/abs/2405.04095
Going Proactive and Explanatory Against Malware Concept Drift. (2%)
Yiling He; Junchi Lei; Zhan Qin; Kui Ren
  Deep learning-based malware classifiers face significant challenges due to
concept drift. The rapid evolution of malware, especially with new families,
can depress classification accuracy to near-random levels. Previous research
has primarily focused on detecting drift samples, relying on expert-led
analysis and labeling for model retraining. However, these methods often lack a
comprehensive understanding of malware concepts and provide limited guidance
for effective drift adaptation, leading to unstable detection performance and
high human labeling costs.
  To address these limitations, we introduce DREAM, a novel system designed to
surpass the capabilities of existing drift detectors and to establish an
explanatory drift adaptation process. DREAM enhances drift detection through
model sensitivity and data autonomy. The detector, trained in a semi-supervised
approach, proactively captures malware behavior concepts through classifier
feedback. During testing, it utilizes samples generated by the detector itself,
eliminating reliance on extensive training data. For drift adaptation, DREAM
enlarges human intervention, enabling revisions of malware labels and concept
explanations embedded within the detector's latent space. To ensure a
comprehensive response to concept drift, it facilitates a coordinated update
process for both the classifier and the detector. Our evaluation shows that
DREAM can effectively improve the drift detection accuracy and reduce the
expert analysis effort in adaptation across different malware datasets and
classifiers.


http://arxiv.org/abs/2405.04260
Verified Neural Compressed Sensing. (1%)
Rudy Bunel; Krishnamurthy Dvijotham; M. Pawan Kumar; Palma Alessandro De; Robert Stanforth
  We develop the first (to the best of our knowledge) provably correct neural
networks for a precise computational task, with the proof of correctness
generated by an automated verification algorithm without any human input. Prior
work on neural network verification has focused on partial specifications that,
even when satisfied, are not sufficient to ensure that a neural network never
makes errors. We focus on applying neural network verification to computational
tasks with a precise notion of correctness, where a verifiably correct neural
network provably solves the task at hand with no caveats. In particular, we
develop an approach to train and verify the first provably correct neural
networks for compressed sensing, i.e., recovering sparse vectors from a number
of measurements smaller than the dimension of the vector. We show that for
modest problem dimensions (up to 50), we can train neural networks that
provably recover a sparse vector from linear and binarized linear measurements.
Furthermore, we show that the complexity of the network (number of
neurons/layers) can be adapted to the problem difficulty and solve problems
where traditional compressed sensing methods are not known to provably work.


http://arxiv.org/abs/2405.03193
Exploring Frequencies via Feature Mixing and Meta-Learning for Improving Adversarial Transferability. (99%)
Juanjuan Weng; Zhiming Luo; Shaozi Li
  Recent studies have shown that Deep Neural Networks (DNNs) are susceptible to
adversarial attacks, with frequency-domain analysis underscoring the
significance of high-frequency components in influencing model predictions.
Conversely, targeting low-frequency components has been effective in enhancing
attack transferability on black-box models. In this study, we introduce a
frequency decomposition-based feature mixing method to exploit these frequency
characteristics in both clean and adversarial samples. Our findings suggest
that incorporating features of clean samples into adversarial features
extracted from adversarial examples is more effective in attacking
normally-trained models, while combining clean features with the adversarial
features extracted from low-frequency parts decomposed from the adversarial
samples yields better results in attacking defense models. However, a conflict
issue arises when these two mixing approaches are employed simultaneously. To
tackle the issue, we propose a cross-frequency meta-optimization approach
comprising the meta-train step, meta-test step, and final update. In the
meta-train step, we leverage the low-frequency components of adversarial
samples to boost the transferability of attacks against defense models.
Meanwhile, in the meta-test step, we utilize adversarial samples to stabilize
gradients, thereby enhancing the attack's transferability against normally
trained models. For the final update, we update the adversarial sample based on
the gradients obtained from both meta-train and meta-test steps. Our proposed
method is evaluated through extensive experiments on the ImageNet-Compatible
dataset, affirming its effectiveness in improving the transferability of
attacks on both normally-trained CNNs and defense models.
  The source code is available at https://github.com/WJJLL/MetaSSA.


http://arxiv.org/abs/2405.03672
Cutting through buggy adversarial example defenses: fixing 1 line of code breaks Sabre. (99%)
Nicholas Carlini
  Sabre is a defense to adversarial examples that was accepted at IEEE S&P
2024. We first reveal significant flaws in the evaluation that point to clear
signs of gradient masking. We then show the cause of this gradient masking: a
bug in the original evaluation code. By fixing a single line of code in the
original repository, we reduce Sabre's robust accuracy to 0%. In response to
this, the authors modify the defense and introduce a new defense component not
described in the original paper. But this fix contains a second bug; modifying
one more line of code reduces robust accuracy to below baseline levels. After
we released the first version of our paper online, the authors introduced
another change to the defense; by commenting out one line of code during attack
we reduce the robust accuracy to 0% again.


http://arxiv.org/abs/2405.03789
On Adversarial Examples for Text Classification by Perturbing Latent Representations. (99%)
Korn Sooksatra; Bikram Khanal; Pablo Rivas
  Recently, with the advancement of deep learning, several applications in text
classification have advanced significantly. However, this improvement comes
with a cost because deep learning is vulnerable to adversarial examples. This
weakness indicates that deep learning is not very robust. Fortunately, the
input of a text classifier is discrete. Hence, it can prevent the classifier
from state-of-the-art attacks. Nonetheless, previous works have generated
black-box attacks that successfully manipulate the discrete values of the input
to find adversarial examples. Therefore, instead of changing the discrete
values, we transform the input into its embedding vector containing real values
to perform the state-of-the-art white-box attacks. Then, we convert the
perturbed embedding vector back into a text and name it an adversarial example.
In summary, we create a framework that measures the robustness of a text
classifier by using the gradients of the classifier.


http://arxiv.org/abs/2405.03777
Is ReLU Adversarially Robust? (98%)
Korn Sooksatra; Greg Hamerly; Pablo Rivas
  The efficacy of deep learning models has been called into question by the
presence of adversarial examples. Addressing the vulnerability of deep learning
models to adversarial examples is crucial for ensuring their continued
development and deployment. In this work, we focus on the role of rectified
linear unit (ReLU) activation functions in the generation of adversarial
examples. ReLU functions are commonly used in deep learning models because they
facilitate the training process. However, our empirical analysis demonstrates
that ReLU functions are not robust against adversarial examples. We propose a
modified version of the ReLU function, which improves robustness against
adversarial examples. Our results are supported by an experiment, which
confirms the effectiveness of our proposed modification. Additionally, we
demonstrate that applying adversarial training to our customized model further
enhances its robustness compared to a general model.


http://arxiv.org/abs/2405.03891
Enhancing O-RAN Security: Evasion Attacks and Robust Defenses for Graph Reinforcement Learning-based Connection Management. (91%)
Ravikumar Balakrishnan; Marius Arvinte; Nageen Himayat; Hosein Nikopour; Hassnaa Moustafa
  Adversarial machine learning, focused on studying various attacks and
defenses on machine learning (ML) models, is rapidly gaining importance as ML
is increasingly being adopted for optimizing wireless systems such as Open
Radio Access Networks (O-RAN). A comprehensive modeling of the security threats
and the demonstration of adversarial attacks and defenses on practical AI based
O-RAN systems is still in its nascent stages. We begin by conducting threat
modeling to pinpoint attack surfaces in O-RAN using an ML-based Connection
management application (xApp) as an example. The xApp uses a Graph Neural
Network trained using Deep Reinforcement Learning and achieves on average 54%
improvement in the coverage rate measured as the 5th percentile user data
rates. We then formulate and demonstrate evasion attacks that degrade the
coverage rates by as much as 50% through injecting bounded noise at different
threat surfaces including the open wireless medium itself. Crucially, we also
compare and contrast the effectiveness of such attacks on the ML-based xApp and
a non-ML based heuristic. We finally develop and demonstrate robust
training-based defenses against the challenging physical/jamming-based attacks
and show a 15% improvement in the coverage rates when compared to employing no
defense over a range of noise budgets


http://arxiv.org/abs/2405.03884
BadFusion: 2D-Oriented Backdoor Attacks against 3D Object Detection. (75%)
Saket S. Chaturvedi; Lan Zhang; Wenbin Zhang; Pan He; Xiaoyong Yuan
  3D object detection plays an important role in autonomous driving; however,
its vulnerability to backdoor attacks has become evident. By injecting
''triggers'' to poison the training dataset, backdoor attacks manipulate the
detector's prediction for inputs containing these triggers. Existing backdoor
attacks against 3D object detection primarily poison 3D LiDAR signals, where
large-sized 3D triggers are injected to ensure their visibility within the
sparse 3D space, rendering them easy to detect and impractical in real-world
scenarios.
  In this paper, we delve into the robustness of 3D object detection, exploring
a new backdoor attack surface through 2D cameras. Given the prevalent adoption
of camera and LiDAR signal fusion for high-fidelity 3D perception, we
investigate the latent potential of camera signals to disrupt the process.
Although the dense nature of camera signals enables the use of nearly
imperceptible small-sized triggers to mislead 2D object detection, realizing
2D-oriented backdoor attacks against 3D object detection is non-trivial. The
primary challenge emerges from the fusion process that transforms camera
signals into a 3D space, compromising the association with the 2D trigger to
the target output. To tackle this issue, we propose an innovative 2D-oriented
backdoor attack against LiDAR-camera fusion methods for 3D object detection,
named BadFusion, for preserving trigger effectiveness throughout the entire
fusion process. The evaluation demonstrates the effectiveness of BadFusion,
achieving a significantly higher attack success rate compared to existing
2D-oriented attacks.


http://arxiv.org/abs/2405.03316
Provably Unlearnable Data Examples. (64%)
Derui Wang; Minhui Xue; Bo Li; Seyit Camtepe; Liming Zhu
  The exploitation of publicly accessible data has led to escalating concerns
regarding data privacy and intellectual property (IP) breaches in the age of
artificial intelligence. To safeguard both data privacy and IP-related domain
knowledge, efforts have been undertaken to render shared data unlearnable for
unauthorized models in the wild. Existing methods apply empirically optimized
perturbations to the data in the hope of disrupting the correlation between the
inputs and the corresponding labels such that the data samples are converted
into Unlearnable Examples (UEs). Nevertheless, the absence of mechanisms to
verify the robustness of UEs against uncertainty in unauthorized models and
their training procedures engenders several under-explored challenges. First,
it is hard to quantify the unlearnability of UEs against unauthorized
adversaries from different runs of training, leaving the soundness of the
defense in obscurity. Particularly, as a prevailing evaluation metric,
empirical test accuracy faces generalization errors and may not plausibly
represent the quality of UEs. This also leaves room for attackers, as there is
no rigid guarantee of the maximal test accuracy achievable by attackers.
Furthermore, we find that a simple recovery attack can restore the clean-task
performance of the classifiers trained on UEs by slightly perturbing the
learned weights. To mitigate the aforementioned problems, in this paper, we
propose a mechanism for certifying the so-called $(q, \eta)$-Learnability of an
unlearnable dataset via parametric smoothing. A lower certified $(q,
\eta)$-Learnability indicates a more robust and effective protection over the
dataset. Concretely, we 1) improve the tightness of certified $(q,
\eta)$-Learnability and 2) design Provably Unlearnable Examples (PUEs) which
have reduced $(q, \eta)$-Learnability.


http://arxiv.org/abs/2405.03299
DarkFed: A Data-Free Backdoor Attack in Federated Learning. (33%)
Minghui Li; Wei Wan; Yuxuan Ning; Shengshan Hu; Lulu Xue; Leo Yu Zhang; Yichen Wang
  Federated learning (FL) has been demonstrated to be susceptible to backdoor
attacks. However, existing academic studies on FL backdoor attacks rely on a
high proportion of real clients with main task-related data, which is
impractical. In the context of real-world industrial scenarios, even the
simplest defense suffices to defend against the state-of-the-art attack, 3DFed.
A practical FL backdoor attack remains in a nascent stage of development.
  To bridge this gap, we present DarkFed. Initially, we emulate a series of
fake clients, thereby achieving the attacker proportion typical of academic
research scenarios. Given that these emulated fake clients lack genuine
training data, we further propose a data-free approach to backdoor FL.
Specifically, we delve into the feasibility of injecting a backdoor using a
shadow dataset. Our exploration reveals that impressive attack performance can
be achieved, even when there is a substantial gap between the shadow dataset
and the main task dataset. This holds true even when employing synthetic data
devoid of any semantic information as the shadow dataset. Subsequently, we
strategically construct a series of covert backdoor updates in an optimized
manner, mimicking the properties of benign updates, to evade detection by
defenses. A substantial body of empirical evidence validates the tangible
effectiveness of DarkFed.


http://arxiv.org/abs/2405.03486
UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images. (1%)
Yiting Qu; Xinyue Shen; Yixin Wu; Michael Backes; Savvas Zannettou; Yang Zhang
  Image safety classifiers play an important role in identifying and mitigating
the spread of unsafe images online (e.g., images including violence, hateful
rhetoric, etc.). At the same time, with the advent of text-to-image models and
increasing concerns about the safety of AI models, developers are increasingly
relying on image safety classifiers to safeguard their models. Yet, the
performance of current image safety classifiers remains unknown for real-world
and AI-generated images. To bridge this research gap, in this work, we propose
UnsafeBench, a benchmarking framework that evaluates the effectiveness and
robustness of image safety classifiers. First, we curate a large dataset of 10K
real-world and AI-generated images that are annotated as safe or unsafe based
on a set of 11 unsafe categories of images (sexual, violent, hateful, etc.).
Then, we evaluate the effectiveness and robustness of five popular image safety
classifiers, as well as three classifiers that are powered by general-purpose
visual language models. Our assessment indicates that existing image safety
classifiers are not comprehensive and effective enough in mitigating the
multifaceted problem of unsafe images. Also, we find that classifiers trained
only on real-world images tend to have degraded performance when applied to
AI-generated images. Motivated by these findings, we design and implement a
comprehensive image moderation tool called PerspectiveVision, which effectively
identifies 11 categories of real-world and AI-generated unsafe images. The best
PerspectiveVision model achieves an overall F1-Score of 0.810 on six evaluation
datasets, which is comparable with closed-source and expensive state-of-the-art
models like GPT-4V. UnsafeBench and PerspectiveVision can aid the research
community in better understanding the landscape of image safety classification
in the era of generative AI.


http://arxiv.org/abs/2405.03676
Why is SAM Robust to Label Noise? (1%)
Christina Baek; Zico Kolter; Aditi Raghunathan
  Sharpness-Aware Minimization (SAM) is most known for achieving state-of
the-art performances on natural image and language tasks. However, its most
pronounced improvements (of tens of percent) is rather in the presence of label
noise. Understanding SAM's label noise robustness requires a departure from
characterizing the robustness of minimas lying in "flatter" regions of the loss
landscape. In particular, the peak performance under label noise occurs with
early stopping, far before the loss converges. We decompose SAM's robustness
into two effects: one induced by changes to the logit term and the other
induced by changes to the network Jacobian. The first can be observed in linear
logistic regression where SAM provably up-weights the gradient contribution
from clean examples. Although this explicit up-weighting is also observable in
neural networks, when we intervene and modify SAM to remove this effect,
surprisingly, we see no visible degradation in performance. We infer that SAM's
effect in deeper networks is instead explained entirely by the effect SAM has
on the network Jacobian. We theoretically derive the implicit regularization
induced by this Jacobian effect in two layer linear networks. Motivated by our
analysis, we see that cheaper alternatives to SAM that explicitly induce these
regularization effects largely recover the benefits in deep networks trained on
real-world datasets.


http://arxiv.org/abs/2405.03620
Detecting Android Malware: From Neural Embeddings to Hands-On Validation with BERTroid. (1%)
Meryam Chaieb; Mostafa Anouar Ghorab; Mohamed Aymen Saied
  As cyber threats and malware attacks increasingly alarm both individuals and
businesses, the urgency for proactive malware countermeasures intensifies. This
has driven a rising interest in automated machine learning solutions.
Transformers, a cutting-edge category of attention-based deep learning methods,
have demonstrated remarkable success. In this paper, we present BERTroid, an
innovative malware detection model built on the BERT architecture. Overall,
BERTroid emerged as a promising solution for combating Android malware. Its
ability to outperform state-of-the-art solutions demonstrates its potential as
a proactive defense mechanism against malicious software attacks. Additionally,
we evaluate BERTroid on multiple datasets to assess its performance across
diverse scenarios. In the dynamic landscape of cybersecurity, our approach has
demonstrated promising resilience against the rapid evolution of malware on
Android systems. While the machine learning model captures broad patterns, we
emphasize the role of manual validation for deeper comprehension and insight
into these behaviors. This human intervention is critical for discerning
intricate and context-specific behaviors, thereby validating and reinforcing
the model's findings.


http://arxiv.org/abs/2405.03632
LaserEscape: Detecting and Mitigating Optical Probing Attacks. (1%)
Saleh Khalaj Monfared; Kyle Mitard; Andrew Cannon; Domenic Forte; Shahin Tajik
  The security of integrated circuits (ICs) can be broken by sophisticated
physical attacks relying on failure analysis methods. Optical probing is one of
the most prominent examples of such attacks, which can be accomplished in a
matter of days, even with limited knowledge of the IC under attack.
Unfortunately, few countermeasures are proposed in the literature, and none has
been fabricated and tested in practice. These countermeasures usually require
changing the standard cell libraries and, thus, are incompatible with digital
and programmable platforms, such as field programmable gate arrays (FPGAs). In
this work, we shift our attention from preventing the attack to detecting and
responding to it. We introduce LaserEscape, the first fully digital and
FPGA-compatible countermeasure to detect and mitigate optical probing attacks.
LaserEscape incorporates digital delay-based sensors to reliably detect the
physical alteration on the fabric caused by laser beam irradiations in real
time. Furthermore, as a response to the attack, LaserEscape deploys real-time
hiding approaches using randomized hardware reconfigurability. It realizes 1)
moving target defense (MTD) to physically move the sensitive circuity under
attack out of the probing field of focus to protect secret keys and 2)
polymorphism to logically obfuscate the functionality of the targeted circuit
to counter function extraction and reverse engineering attempts. We demonstrate
the effectiveness and resiliency of our approach by performing optical probing
attacks on protected and unprotected designs on a 28-nm FPGA. Our results show
that optical probing attacks can be reliably detected and mitigated without
interrupting the chip's operation.


http://arxiv.org/abs/2405.02989
Defense against Joint Poison and Evasion Attacks: A Case Study of DERMS. (88%)
Zain ul Abdeen; Padmaksha Roy; Ahmad Al-Tawaha; Rouxi Jia; Laura Freeman; Peter Beling; Chen-Ching Liu; Alberto Sangiovanni-Vincentelli; Ming Jin
  There is an upward trend of deploying distributed energy resource management
systems (DERMS) to control modern power grids. However, DERMS controller
communication lines are vulnerable to cyberattacks that could potentially
impact operational reliability. While a data-driven intrusion detection system
(IDS) can potentially thwart attacks during deployment, also known as the
evasion attack, the training of the detection algorithm may be corrupted by
adversarial data injected into the database, also known as the poisoning
attack. In this paper, we propose the first framework of IDS that is robust
against joint poisoning and evasion attacks. We formulate the defense mechanism
as a bilevel optimization, where the inner and outer levels deal with attacks
that occur during training time and testing time, respectively. We verify the
robustness of our method on the IEEE-13 bus feeder model against a diverse set
of poisoning and evasion attack scenarios. The results indicate that our
proposed method outperforms the baseline technique in terms of accuracy,
precision, and recall for intrusion detection.


http://arxiv.org/abs/2405.03097
To Each (Textual Sequence) Its Own: Improving Memorized-Data Unlearning in Large Language Models. (15%)
George-Octavian Barbulescu; Peter Triantafillou
  LLMs have been found to memorize training textual sequences and regurgitate
verbatim said sequences during text generation time. This fact is known to be
the cause of privacy and related (e.g., copyright) problems. Unlearning in LLMs
then takes the form of devising new algorithms that will properly deal with
these side-effects of memorized data, while not hurting the model's utility. We
offer a fresh perspective towards this goal, namely, that each textual sequence
to be forgotten should be treated differently when being unlearned based on its
degree of memorization within the LLM. We contribute a new metric for measuring
unlearning quality, an adversarial attack showing that SOTA algorithms lacking
this perspective fail for privacy, and two new unlearning methods based on
Gradient Ascent and Task Arithmetic, respectively. A comprehensive performance
evaluation across an extensive suite of NLP tasks then mapped the solution
space, identifying the best solutions under different scales in model
capacities and forget set sizes and quantified the gains of the new approaches.


http://arxiv.org/abs/2405.03009
Explainable Malware Detection with Tailored Logic Explained Networks. (2%)
Peter Anthony; Francesco Giannini; Michelangelo Diligenti; Martin Homola; Marco Gori; Stefan Balogh; Jan Mojzis
  Malware detection is a constant challenge in cybersecurity due to the rapid
development of new attack techniques. Traditional signature-based approaches
struggle to keep pace with the sheer volume of malware samples. Machine
learning offers a promising solution, but faces issues of generalization to
unseen samples and a lack of explanation for the instances identified as
malware. However, human-understandable explanations are especially important in
security-critical fields, where understanding model decisions is crucial for
trust and legal compliance. While deep learning models excel at malware
detection, their black-box nature hinders explainability. Conversely,
interpretable models often fall short in performance. To bridge this gap in
this application domain, we propose the use of Logic Explained Networks (LENs),
which are a recently proposed class of interpretable neural networks providing
explanations in the form of First-Order Logic (FOL) rules. This paper extends
the application of LENs to the complex domain of malware detection,
specifically using the large-scale EMBER dataset. In the experimental results
we show that LENs achieve robustness that exceeds traditional interpretable
methods and that are rivaling black-box models. Moreover, we introduce a
tailored version of LENs that is shown to generate logic explanations with
higher fidelity with respect to the model's predictions.


http://arxiv.org/abs/2405.02564
Probing Human Visual Robustness with Neurally-Guided Deep Neural Networks. (92%)
Zhenan Shao; Linjian Ma; Yiqing Zhou; Yibo Jacky Zhang; Sanmi Koyejo; Bo Li; Diane M. Beck
  Humans effortlessly navigate the dynamic visual world, yet deep neural
networks (DNNs), despite excelling at many visual tasks, are surprisingly
vulnerable to minor image perturbations. Past theories suggest that human
visual robustness arises from a representational space that evolves along the
ventral visual stream (VVS) of the brain to increasingly tolerate object
transformations. To test whether robustness is supported by such progression as
opposed to being confined exclusively to specialized higher-order regions, we
trained DNNs to align their representations with human neural responses from
consecutive VVS regions while performing visual tasks. We demonstrate a
hierarchical improvement in DNN robustness: alignment to higher-order VVS
regions leads to greater improvement. To investigate the mechanism behind such
robustness gains, we test a prominent hypothesis that attributes human
robustness to the unique geometry of neural category manifolds in the VVS. We
first reveal that more desirable manifold properties, specifically, smaller
extent and better linear separability, indeed emerge across the human VVS.
These properties can be inherited by neurally aligned DNNs and predict their
subsequent robustness gains. Furthermore, we show that supervision from neural
manifolds alone, via manifold guidance, is sufficient to qualitatively
reproduce the hierarchical robustness improvements. Together, these results
highlight the critical role of the evolving representational space across VVS
in achieving robust visual inference, in part through the formation of more
linearly separable category manifolds, which may in turn be leveraged to
develop more robust AI systems.


http://arxiv.org/abs/2405.02646
Updating Windows Malware Detectors: Balancing Robustness and Regression against Adversarial EXEmples. (83%)
Matous Kozak; Luca Demetrio; Dmitrijs Trizna; Fabio Roli
  Adversarial EXEmples are carefully-perturbed programs tailored to evade
machine learning Windows malware detectors, with an on-going effort in
developing robust models able to address detection effectiveness. However, even
if robust models can prevent the majority of EXEmples, to maintain predictive
power over time, models are fine-tuned to newer threats, leading either to
partial updates or time-consuming retraining from scratch. Thus, even if the
robustness against attacks is higher, the new models might suffer a regression
in performance by misclassifying threats that were previously correctly
detected. For these reasons, we study the trade-off between accuracy and
regression when updating Windows malware detectors, by proposing EXE-scanner, a
plugin that can be chained to existing detectors to promptly stop EXEmples
without causing regression. We empirically show that previously-proposed
hardening techniques suffer a regression of accuracy when updating non-robust
models. On the contrary, we show that EXE-scanner exhibits comparable
performance to robust models without regression of accuracy, and we show how to
properly chain it after the base classifier to obtain the best performance
without the need of costly retraining. To foster reproducibility, we openly
release source code, along with the dataset of adversarial EXEmples based on
state-of-the-art perturbation algorithms.


http://arxiv.org/abs/2405.02764
Assessing Adversarial Robustness of Large Language Models: An Empirical Study. (76%)
Zeyu Yang; Zhao Meng; Xiaochen Zheng; Roger Wattenhofer
  Large Language Models (LLMs) have revolutionized natural language processing,
but their robustness against adversarial attacks remains a critical concern. We
presents a novel white-box style attack approach that exposes vulnerabilities
in leading open-source LLMs, including Llama, OPT, and T5. We assess the impact
of model size, structure, and fine-tuning strategies on their resistance to
adversarial perturbations. Our comprehensive evaluation across five diverse
text classification tasks establishes a new benchmark for LLM robustness. The
findings of this study have far-reaching implications for the reliable
deployment of LLMs in real-world applications and contribute to the advancement
of trustworthy AI systems.


http://arxiv.org/abs/2405.03714
UniDEC : Unified Dual Encoder and Classifier Training for Extreme Multi-Label Classification. (1%)
Siddhant Kharbanda; Devaansh Gupta; Gururaj K; Pankaj Malhotra; Amit Singh; Cho-Jui Hsieh; Rohit Babbar
  Extreme Multi-label Classification (XMC) involves predicting a subset of
relevant labels from an extremely large label space, given an input query and
labels with textual features. Models developed for this problem have
conventionally made use of dual encoder (DE) to embed the queries and label
texts and one-vs-all (OvA) classifiers to rerank the shortlisted labels by the
DE. While such methods have shown empirical success, a major drawback is their
computational cost, often requiring upto 16 GPUs to train on the largest public
dataset. Such a high cost is a consequence of calculating the loss over the
entire label space. While shortlisting strategies have been proposed for
classifiers, we aim to study such methods for the DE framework. In this work,
we develop UniDEC, a loss-independent, end-to-end trainable framework which
trains the DE and classifier together in a unified manner with a multi-class
loss, while reducing the computational cost by 4-16x. This is done via the
proposed pick-some-label (PSL) reduction, which aims to compute the loss on
only a subset of positive and negative labels. These labels are carefully
chosen in-batch so as to maximise their supervisory signals. Not only does the
proposed framework achieve state-of-the-art results on datasets with labels in
the order of millions, it is also computationally and resource efficient in
achieving this performance on a single GPU. Code is made available at
https://github.com/the-catalyst/UniDEC.


http://arxiv.org/abs/2405.01838
A Novel Approach to Guard from Adversarial Attacks using Stable Diffusion. (99%)
Trinath Sai Subhash Reddy Pittala; Uma Maheswara Rao Meleti; Geethakrishna Puligundla
  Recent developments in adversarial machine learning have highlighted the
importance of building robust AI systems to protect against increasingly
sophisticated attacks. While frameworks like AI Guardian are designed to defend
against these threats, they often rely on assumptions that can limit their
effectiveness. For example, they may assume attacks only come from one
direction or include adversarial images in their training data. Our proposal
suggests a different approach to the AI Guardian framework. Instead of
including adversarial examples in the training process, we propose training the
AI system without them. This aims to create a system that is inherently
resilient to a wider range of attacks. Our method focuses on a dynamic defense
strategy using stable diffusion that learns continuously and models threats
comprehensively. We believe this approach can lead to a more generalized and
robust defense against adversarial attacks.
  In this paper, we outline our proposed approach, including the theoretical
basis, experimental design, and expected impact on improving AI security
against adversarial threats.


http://arxiv.org/abs/2405.01963
From Attack to Defense: Insights into Deep Learning Security Measures in Black-Box Settings. (99%)
Firuz Juraev; Mohammed Abuhamad; Eric Chan-Tin; George K. Thiruvathukal; Tamer Abuhmed
  Deep Learning (DL) is rapidly maturing to the point that it can be used in
safety- and security-crucial applications. However, adversarial samples, which
are undetectable to the human eye, pose a serious threat that can cause the
model to misbehave and compromise the performance of such applications.
Addressing the robustness of DL models has become crucial to understanding and
defending against adversarial attacks. In this study, we perform comprehensive
experiments to examine the effect of adversarial attacks and defenses on
various model architectures across well-known datasets. Our research focuses on
black-box attacks such as SimBA, HopSkipJump, MGAAttack, and boundary attacks,
as well as preprocessor-based defensive mechanisms, including bits squeezing,
median smoothing, and JPEG filter. Experimenting with various models, our
results demonstrate that the level of noise needed for the attack increases as
the number of layers increases. Moreover, the attack success rate decreases as
the number of layers increases. This indicates that model complexity and
robustness have a significant relationship. Investigating the diversity and
robustness relationship, our experiments with diverse models show that having a
large number of parameters does not imply higher robustness. Our experiments
extend to show the effects of the training dataset on model robustness. Using
various datasets such as ImageNet-1000, CIFAR-100, and CIFAR-10 are used to
evaluate the black-box attacks. Considering the multiple dimensions of our
analysis, e.g., model complexity and training dataset, we examined the behavior
of black-box attacks when models apply defenses. Our results show that applying
defense strategies can significantly reduce attack effectiveness. This research
provides in-depth analysis and insight into the robustness of DL models against
various attacks, and defenses.


http://arxiv.org/abs/2405.02466
ProFLingo: A Fingerprinting-based Copyright Protection Scheme for Large Language Models. (97%)
Heng Jin; Chaoyu Zhang; Shanghao Shi; Wenjing Lou; Y. Thomas Hou
  Large language models (LLMs) have attracted significant attention in recent
years. Due to their "Large" nature, training LLMs from scratch consumes immense
computational resources. Since several major players in the artificial
intelligence (AI) field have open-sourced their original LLMs, an increasing
number of individual researchers and smaller companies are able to build
derivative LLMs based on these open-sourced models at much lower costs.
However, this practice opens up possibilities for unauthorized use or
reproduction that may not comply with licensing agreements, and deriving models
can change the model's behavior, thus complicating the determination of model
ownership. Current copyright protection schemes for LLMs are either designed
for white-box settings or require additional modifications to the original
model, which restricts their use in real-world settings.
  In this paper, we propose ProFLingo, a black-box fingerprinting-based
copyright protection scheme for LLMs. ProFLingo generates adversarial examples
(AEs) that can represent the unique decision boundary characteristics of an
original model, thereby establishing unique fingerprints. Our scheme checks the
effectiveness of these adversarial examples on a suspect model to determine
whether it has been derived from the original model. ProFLingo offers a
non-invasive approach, which neither requires knowledge of the suspect model
nor modifications to the base model or its training process. To the best of our
knowledge, our method represents the first black-box fingerprinting technique
for copyright protection for LLMs. Our source code and generated AEs are
available at: https://github.com/hengvt/ProFLingo_arXiv.


http://arxiv.org/abs/2405.01934
Impact of Architectural Modifications on Deep Learning Adversarial Robustness. (88%)
Firuz Juraev; Mohammed Abuhamad; Simon S. Woo; George K Thiruvathukal; Tamer Abuhmed
  Rapid advancements of deep learning are accelerating adoption in a wide
variety of applications, including safety-critical applications such as
self-driving vehicles, drones, robots, and surveillance systems. These
advancements include applying variations of sophisticated techniques that
improve the performance of models. However, such models are not immune to
adversarial manipulations, which can cause the system to misbehave and remain
unnoticed by experts. The frequency of modifications to existing deep learning
models necessitates thorough analysis to determine the impact on models'
robustness. In this work, we present an experimental evaluation of the effects
of model modifications on deep learning model robustness using adversarial
attacks. Our methodology involves examining the robustness of variations of
models against various adversarial attacks. By conducting our experiments, we
aim to shed light on the critical issue of maintaining the reliability and
safety of deep learning models in safety- and security-critical applications.
Our results indicate the pressing demand for an in-depth assessment of the
effects of model changes on the robustness of models.


http://arxiv.org/abs/2405.02365
ModelShield: Adaptive and Robust Watermark against Model Extraction Attack. (38%)
Kaiyi Pang; Tao Qi; Chuhan Wu; Minhao Bai; Minghu Jiang; Yongfeng Huang
  Large language models (LLMs) demonstrate general intelligence across a
variety of machine learning tasks, thereby enhancing the commercial value of
their intellectual property (IP). To protect this IP, model owners typically
allow user access only in a black-box manner, however, adversaries can still
utilize model extraction attacks to steal the model intelligence encoded in
model generation. Watermarking technology offers a promising solution for
defending against such attacks by embedding unique identifiers into the
model-generated content. However, existing watermarking methods often
compromise the quality of generated content due to heuristic alterations and
lack robust mechanisms to counteract adversarial strategies, thus limiting
their practicality in real-world scenarios. In this paper, we introduce an
adaptive and robust watermarking method (named ModelShield) to protect the IP
of LLMs. Our method incorporates a self-watermarking mechanism that allows LLMs
to autonomously insert watermarks into their generated content to avoid the
degradation of model content. We also propose a robust watermark detection
mechanism capable of effectively identifying watermark signals under the
interference of varying adversarial strategies. Besides, ModelShield is a
plug-and-play method that does not require additional model training, enhancing
its applicability in LLM deployments. Extensive evaluations on two real-world
datasets and three LLMs demonstrate that our method surpasses existing methods
in terms of defense effectiveness and robustness while significantly reducing
the degradation of watermarking on the model-generated content.


http://arxiv.org/abs/2405.01855
Robust Explainable Recommendation. (9%)
Sairamvinay Vijayaraghavan; Prasant Mohapatra
  Explainable Recommender Systems is an important field of study which provides
reasons behind the suggested recommendations. Explanations with recommender
systems are useful for developers while debugging anomalies within the system
and for consumers while interpreting the model's effectiveness in capturing
their true preferences towards items. However, most of the existing
state-of-the-art (SOTA) explainable recommenders could not retain their
explanation capability under noisy circumstances and moreover are not
generalizable across different datasets. The robustness of the explanations
must be ensured so that certain malicious attackers do not manipulate any
high-stake decision scenarios to their advantage, which could cause severe
consequences affecting large groups of interest. In this work, we present a
general framework for feature-aware explainable recommenders that can withstand
external attacks and provide robust and generalized explanations. This paper
presents a novel framework which could be utilized as an additional defense
tool, preserving the global explainability when subject to model-based white
box attacks. Our framework is simple to implement and supports different
methods regardless of the internal model structure and intrinsic utility within
any model. We experimented our framework on two architecturally different
feature-based SOTA explainable algorithms by training them on three popular
e-commerce datasets of increasing scales. We noticed that both the algorithms
displayed an overall improvement in the quality and robustness of the global
explainability under normal as well as noisy environments across all the
datasets, indicating the flexibility and mutability of our framework.


http://arxiv.org/abs/2405.02016
Adversarial Botometer: Adversarial Analysis for Social Bot Detection. (1%)
Shaghayegh Najari; Davood Rafiee; Mostafa Salehi; Reza Farahbakhsh
  Social bots play a significant role in many online social networks (OSN) as
they imitate human behavior. This fact raises difficult questions about their
capabilities and potential risks. Given the recent advances in Generative AI
(GenAI), social bots are capable of producing highly realistic and complex
content that mimics human creativity. As the malicious social bots emerge to
deceive people with their unrealistic content, identifying them and
distinguishing the content they produce has become an actual challenge for
numerous social platforms. Several approaches to this problem have already been
proposed in the literature, but the proposed solutions have not been widely
evaluated. To address this issue, we evaluate the behavior of a text-based bot
detector in a competitive environment where some scenarios are proposed:
\textit{First}, the tug-of-war between a bot and a bot detector is examined. It
is interesting to analyze which party is more likely to prevail and which
circumstances influence these expectations. In this regard, we model the
problem as a synthetic adversarial game in which a conversational bot and a bot
detector are engaged in strategic online interactions. \textit{Second}, the bot
detection model is evaluated under attack examples generated by a social bot;
to this end, we poison the dataset with attack examples and evaluate the model
performance under this condition. \textit{Finally}, to investigate the impact
of the dataset, a cross-domain analysis is performed. Through our comprehensive
evaluation of different categories of social bots using two benchmark datasets,
we were able to demonstrate some achivement that could be utilized in future
works.


http://arxiv.org/abs/2405.01349
Position Paper: Beyond Robustness Against Single Attack Types. (99%)
Sihui Dai; Chong Xiang; Tong Wu; Prateek Mittal
  Current research on defending against adversarial examples focuses primarily
on achieving robustness against a single attack type such as $\ell_2$ or
$\ell_{\infty}$-bounded attacks. However, the space of possible perturbations
is much larger and currently cannot be modeled by a single attack type. The
discrepancy between the focus of current defenses and the space of attacks of
interest calls to question the practicality of existing defenses and the
reliability of their evaluation. In this position paper, we argue that the
research community should look beyond single attack robustness, and we draw
attention to three potential directions involving robustness against multiple
attacks: simultaneous multiattack robustness, unforeseen attack robustness, and
a newly defined problem setting which we call continual adaptive robustness. We
provide a unified framework which rigorously defines these problem settings,
synthesize existing research in these fields, and outline open directions. We
hope that our position paper inspires more research in simultaneous
multiattack, unforeseen attack, and continual adaptive robustness.


http://arxiv.org/abs/2405.01728
Explainability Guided Adversarial Evasion Attacks on Malware Detectors. (98%)
Kshitiz Aryal; Maanak Gupta; Mahmoud Abdelsalam; Moustafa Saleh
  As the focus on security of Artificial Intelligence (AI) is becoming
paramount, research on crafting and inserting optimal adversarial perturbations
has become increasingly critical. In the malware domain, this adversarial
sample generation relies heavily on the accuracy and placement of crafted
perturbation with the goal of evading a trained classifier. This work focuses
on applying explainability techniques to enhance the adversarial evasion attack
on a machine-learning-based Windows PE malware detector. The explainable tool
identifies the regions of PE malware files that have the most significant
impact on the decision-making process of a given malware detector, and
therefore, the same regions can be leveraged to inject the adversarial
perturbation for maximum efficiency. Profiling all the PE malware file regions
based on their impact on the malware detector's decision enables the derivation
of an efficient strategy for identifying the optimal location for perturbation
injection. The strategy should incorporate the region's significance in
influencing the malware detector's decision and the sensitivity of the PE
malware file's integrity towards modifying that region. To assess the utility
of explainable AI in crafting an adversarial sample of Windows PE malware, we
utilize the DeepExplainer module of SHAP for determining the contribution of
each region of PE malware to its detection by a CNN-based malware detector,
MalConv. Furthermore, we analyzed the significance of SHAP values at a more
granular level by subdividing each section of Windows PE into small
subsections. We then performed an adversarial evasion attack on the subsections
based on the corresponding SHAP values of the byte sequences.


http://arxiv.org/abs/2405.01460
Purify Unlearnable Examples via Rate-Constrained Variational Autoencoders. (88%)
Yi Yu; Yufei Wang; Song Xia; Wenhan Yang; Shijian Lu; Yap-Peng Tan; Alex C. Kot
  Unlearnable examples (UEs) seek to maximize testing error by making subtle
modifications to training examples that are correctly labeled. Defenses against
these poisoning attacks can be categorized based on whether specific
interventions are adopted during training. The first approach is training-time
defense, such as adversarial training, which can mitigate poisoning effects but
is computationally intensive. The other approach is pre-training purification,
e.g., image short squeezing, which consists of several simple compressions but
often encounters challenges in dealing with various UEs. Our work provides a
novel disentanglement mechanism to build an efficient pre-training purification
method. Firstly, we uncover rate-constrained variational autoencoders (VAEs),
demonstrating a clear tendency to suppress the perturbations in UEs. We
subsequently conduct a theoretical analysis for this phenomenon. Building upon
these insights, we introduce a disentangle variational autoencoder (D-VAE),
capable of disentangling the perturbations with learnable class-wise
embeddings. Based on this network, a two-stage purification approach is
naturally developed. The first stage focuses on roughly eliminating
perturbations, while the second stage produces refined, poison-free results,
ensuring effectiveness and robustness across various scenarios. Extensive
experiments demonstrate the remarkable performance of our method across
CIFAR-10, CIFAR-100, and a 100-class ImageNet-subset. Code is available at
https://github.com/yuyi-sd/D-VAE.


http://arxiv.org/abs/2405.01073
Poisoning Attacks on Federated Learning for Autonomous Driving. (75%)
Sonakshi Garg; Hugo Jönsson; Gustav Kalander; Axel Nilsson; Bhhaanu Pirange; Viktor Valadi; Johan Östman
  Federated Learning (FL) is a decentralized learning paradigm, enabling
parties to collaboratively train models while keeping their data confidential.
Within autonomous driving, it brings the potential of reducing data storage
costs, reducing bandwidth requirements, and to accelerate the learning. FL is,
however, susceptible to poisoning attacks. In this paper, we introduce two
novel poisoning attacks on FL tailored to regression tasks within autonomous
driving: FLStealth and Off-Track Attack (OTA). FLStealth, an untargeted attack,
aims at providing model updates that deteriorate the global model performance
while appearing benign. OTA, on the other hand, is a targeted attack with the
objective to change the global model's behavior when exposed to a certain
trigger. We demonstrate the effectiveness of our attacks by conducting
comprehensive experiments pertaining to the task of vehicle trajectory
prediction. In particular, we show that, among five different untargeted
attacks, FLStealth is the most successful at bypassing the considered defenses
employed by the server. For OTA, we demonstrate the inability of common defense
strategies to mitigate the attack, highlighting the critical need for new
defensive mechanisms against targeted attacks within FL for autonomous driving.


http://arxiv.org/abs/2405.01693
Adversarial Attacks on Reinforcement Learning Agents for Command and Control. (75%)
Ahaan Dabholkar; James Z. Hare; Mark Mittrick; John Richardson; Nicholas Waytowich; Priya Narayanan; Saurabh Bagchi
  Given the recent impact of Deep Reinforcement Learning in training agents to
win complex games like StarCraft and DoTA(Defense Of The Ancients) - there has
been a surge in research for exploiting learning based techniques for
professional wargaming, battlefield simulation and modeling. Real time strategy
games and simulators have become a valuable resource for operational planning
and military research. However, recent work has shown that such learning based
approaches are highly susceptible to adversarial perturbations. In this paper,
we investigate the robustness of an agent trained for a Command and Control
task in an environment that is controlled by an active adversary. The C2 agent
is trained on custom StarCraft II maps using the state of the art RL algorithms
- A3C and PPO. We empirically show that an agent trained using these algorithms
is highly susceptible to noise injected by the adversary and investigate the
effects these perturbations have on the performance of the trained agent. Our
work highlights the urgent need to develop more robust training algorithms
especially for critical arenas like the battlefield.


http://arxiv.org/abs/2405.01229
Boosting Jailbreak Attack with Momentum. (74%)
Yihao Zhang; Zeming Wei
  Large Language Models (LLMs) have achieved remarkable success across diverse
tasks, yet they remain vulnerable to adversarial attacks, notably the
well-documented \textit{jailbreak} attack. Recently, the Greedy Coordinate
Gradient (GCG) attack has demonstrated efficacy in exploiting this
vulnerability by optimizing adversarial prompts through a combination of
gradient heuristics and greedy search. However, the efficiency of this attack
has become a bottleneck in the attacking process. To mitigate this limitation,
in this paper we rethink the generation of adversarial prompts through an
optimization lens, aiming to stabilize the optimization process and harness
more heuristic insights from previous iterations. Specifically, we introduce
the \textbf{M}omentum \textbf{A}ccelerated G\textbf{C}G (\textbf{MAC}) attack,
which incorporates a momentum term into the gradient heuristic. Experimental
results showcase the notable enhancement achieved by MAP in gradient-based
attacks on aligned language models. Our code is available at
https://github.com/weizeming/momentum-attack-llm.


http://arxiv.org/abs/2405.01817
Uniformly Stable Algorithms for Adversarial Training and Beyond. (10%)
Jiancong Xiao; Jiawei Zhang; Zhi-Quan Luo; Asuman Ozdaglar
  In adversarial machine learning, neural networks suffer from a significant
issue known as robust overfitting, where the robust test accuracy decreases
over epochs (Rice et al., 2020). Recent research conducted by Xing et al.,2021;
Xiao et al., 2022 has focused on studying the uniform stability of adversarial
training. Their investigations revealed that SGD-based adversarial training
fails to exhibit uniform stability, and the derived stability bounds align with
the observed phenomenon of robust overfitting in experiments. This motivates us
to develop uniformly stable algorithms specifically tailored for adversarial
training. To this aim, we introduce Moreau envelope-$\mathcal{A}$, a variant of
the Moreau Envelope-type algorithm. We employ a Moreau envelope function to
reframe the original problem as a min-min problem, separating the non-strong
convexity and non-smoothness of the adversarial loss. Then, this approach
alternates between solving the inner and outer minimization problems to achieve
uniform stability without incurring additional computational overhead. In
practical scenarios, we show the efficacy of ME-$\mathcal{A}$ in mitigating the
issue of robust overfitting. Beyond its application in adversarial training,
this represents a fundamental result in uniform stability analysis, as
ME-$\mathcal{A}$ is the first algorithm to exhibit uniform stability for
weakly-convex, non-smooth problems.


http://arxiv.org/abs/2405.01716
ATTAXONOMY: Unpacking Differential Privacy Guarantees Against Practical Adversaries. (2%)
Rachel Cummings; Shlomi Hod; Jayshree Sarathy; Marika Swanberg
  Differential Privacy (DP) is a mathematical framework that is increasingly
deployed to mitigate privacy risks associated with machine learning and
statistical analyses. Despite the growing adoption of DP, its technical privacy
parameters do not lend themselves to an intelligible description of the
real-world privacy risks associated with that deployment: the guarantee that
most naturally follows from the DP definition is protection against membership
inference by an adversary who knows all but one data record and has unlimited
auxiliary knowledge. In many settings, this adversary is far too strong to
inform how to set real-world privacy parameters.
  One approach for contextualizing privacy parameters is via defining and
measuring the success of technical attacks, but doing so requires a systematic
categorization of the relevant attack space. In this work, we offer a detailed
taxonomy of attacks, showing the various dimensions of attacks and highlighting
that many real-world settings have been understudied. Our taxonomy provides a
roadmap for analyzing real-world deployments and developing theoretical bounds
for more informative privacy attacks. We operationalize our taxonomy by using
it to analyze a real-world case study, the Israeli Ministry of Health's recent
release of a birth dataset using DP, showing how the taxonomy enables
fine-grained threat modeling and provides insight towards making informed
privacy parameter choices. Finally, we leverage the taxonomy towards defining a
more realistic attack than previously considered in the literature, namely a
distributional reconstruction attack: we generalize Balle et al.'s notion of
reconstruction robustness to a less-informed adversary with distributional
uncertainty, and extend the worst-case guarantees of DP to this average-case
setting.


http://arxiv.org/abs/2405.00392
Certified Adversarial Robustness of Machine Learning-based Malware Detectors via (De)Randomized Smoothing. (99%)
Daniel Gibert; Luca Demetrio; Giulio Zizzo; Quan Le; Jordi Planes; Battista Biggio
  Deep learning-based malware detection systems are vulnerable to adversarial
EXEmples - carefully-crafted malicious programs that evade detection with
minimal perturbation. As such, the community is dedicating effort to develop
mechanisms to defend against adversarial EXEmples. However, current randomized
smoothing-based defenses are still vulnerable to attacks that inject blocks of
adversarial content. In this paper, we introduce a certifiable defense against
patch attacks that guarantees, for a given executable and an adversarial patch
size, no adversarial EXEmple exist. Our method is inspired by (de)randomized
smoothing which provides deterministic robustness certificates. During
training, a base classifier is trained using subsets of continguous bytes. At
inference time, our defense splits the executable into non-overlapping chunks,
classifies each chunk independently, and computes the final prediction through
majority voting to minimize the influence of injected content. Furthermore, we
introduce a preprocessing step that fixes the size of the sections and headers
to a multiple of the chunk size. As a consequence, the injected content is
confined to an integer number of chunks without tampering the other chunks
containing the real bytes of the input examples, allowing us to extend our
certified robustness guarantees to content insertion attacks. We perform an
extensive ablation study, by comparing our defense with randomized
smoothing-based defenses against a plethora of content manipulation attacks and
neural network architectures. Results show that our method exhibits unmatched
robustness against strong content-insertion attacks, outperforming randomized
smoothing-based defenses in the literature.


http://arxiv.org/abs/2405.00526
JNI Global References Are Still Vulnerable: Attacks and Defenses. (12%)
Yi He; Yuan Zhou; Yacong Gu; Purui Su; Qi Li; Yajin Zhou; Yong Jiang
  System services and resources in Android are accessed through IPC based
mechanisms. Previous research has demonstrated that they are vulnerable to the
denial-of-service attack (DoS attack). For instance, the JNI global reference
(JGR), which is widely used by system services, can be exhausted to cause the
system reboot (hence the name JGRE attack). Even though the Android team tries
to fix the problem by enforcing security checks, we find that it is still
possible to construct a JGR exhaustion DoS attack in the latest Android system.
  In this paper, we propose a new JGR exhaustion DoS attack, which is effective
in different Android versions, including the latest one (i.e., Android 10).
Specifically, we developed JGREAnalyzer, a tool that can systematically detect
JGR vulnerable services APIs via a call graph analysis and a forwarding
reachability analysis. We applied this tool to different Android versions and
found multiple vulnerabilities. In particular, among 148 system services in
Android 10, 12 of them have 21 vulnerabilities. Among them, 9 can be
successfully exploited without any permissions. We further analyze the root
cause of the vulnerabilities and propose a new defense to mitigate the JGRE
attack by restricting resource consumption via global reference counting.


http://arxiv.org/abs/2405.00636
Robustness of graph embedding methods for community detection. (2%)
Zhi-Feng Wei; Pablo Moriano; Ramakrishnan Kannan
  This study investigates the robustness of graph embedding methods for
community detection in the face of network perturbations, specifically edge
deletions. Graph embedding techniques, which represent nodes as low-dimensional
vectors, are widely used for various graph machine learning tasks due to their
ability to capture structural properties of networks effectively. However, the
impact of perturbations on the performance of these methods remains relatively
understudied. The research considers state-of-the-art graph embedding methods
from two families: matrix factorization (e.g., LE, LLE, HOPE, M-NMF) and random
walk-based (e.g., DeepWalk, LINE, node2vec). Through experiments conducted on
both synthetic and real-world networks, the study reveals varying degrees of
robustness within each family of graph embedding methods. The robustness is
found to be influenced by factors such as network size, initial community
partition strength, and the type of perturbation. Notably, node2vec and LLE
consistently demonstrate higher robustness for community detection across
different scenarios, including networks with degree and community size
heterogeneity. These findings highlight the importance of selecting an
appropriate graph embedding method based on the specific characteristics of the
network and the task at hand, particularly in scenarios where robustness to
perturbations is crucial.


http://arxiv.org/abs/2405.00846
Gameplay Filters: Robust Zero-Shot Safety through Adversarial Imagination. (1%)
Duy P. Nguyen; Kai-Chieh Hsu; Wenhao Yu; Jie Tan; Jaime F. Fisac
  Despite the impressive recent advances in learning-based robot control,
ensuring robustness to out-of-distribution conditions remains an open
challenge. Safety filters can, in principle, keep arbitrary control policies
from incurring catastrophic failures by overriding unsafe actions, but existing
solutions for complex (e.g., legged) robot dynamics do not span the full motion
envelope and instead rely on local, reduced-order models. These filters tend to
overly restrict agility and can still fail when perturbed away from nominal
conditions. This paper presents the gameplay filter, a new class of predictive
safety filter that continually plays out hypothetical matches between its
simulation-trained safety strategy and a virtual adversary co-trained to invoke
worst-case events and sim-to-real error, and precludes actions that would cause
failures down the line. We demonstrate the scalability and robustness of the
approach with a first-of-its-kind full-order safety filter for (36-D)
quadrupedal dynamics. Physical experiments on two different quadruped platforms
demonstrate the superior zero-shot effectiveness of the gameplay filter under
large perturbations such as tugging and unmodeled terrain. Experiment videos
and open-source software are available online:
https://saferobotics.org/research/gameplay-filter


http://arxiv.org/abs/2405.00469
Exploiting Positional Bias for Query-Agnostic Generative Content in Search. (1%)
Andrew Parry; Sean MacAvaney; Debasis Ganguly
  In recent years, neural ranking models (NRMs) have been shown to
substantially outperform their lexical counterparts in text retrieval. In
traditional search pipelines, a combination of features leads to well-defined
behaviour. However, as neural approaches become increasingly prevalent as the
final scoring component of engines or as standalone systems, their robustness
to malicious text and, more generally, semantic perturbation needs to be better
understood. We posit that the transformer attention mechanism can induce
exploitable defects through positional bias in search models, leading to an
attack that could generalise beyond a single query or topic. We demonstrate
such defects by showing that non-relevant text--such as promotional
content--can be easily injected into a document without adversely affecting its
position in search results. Unlike previous gradient-based attacks, we
demonstrate these biases in a query-agnostic fashion. In doing so, without the
knowledge of topicality, we can still reduce the negative effects of
non-relevant content injection by controlling injection position. Our
experiments are conducted with simulated on-topic promotional text
automatically generated by prompting LLMs with topical context from target
documents. We find that contextualisation of a non-relevant text further
reduces negative effects whilst likely circumventing existing content filtering
mechanisms. In contrast, lexical models are found to be more resilient to such
content injection attacks. We then investigate a simple yet effective
compensation for the weaknesses of the NRMs in search, validating our
hypotheses regarding transformer bias.


http://arxiv.org/abs/2404.19287
Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective. (99%)
Wanqi Zhou; Shuanghao Bai; Qibin Zhao; Badong Chen
  Pretrained vision-language models (VLMs) like CLIP have shown impressive
generalization performance across various downstream tasks, yet they remain
vulnerable to adversarial attacks. While prior research has primarily
concentrated on improving the adversarial robustness of image encoders to guard
against attacks on images, the exploration of text-based and multimodal attacks
has largely been overlooked. In this work, we initiate the first known and
comprehensive effort to study adapting vision-language models for adversarial
robustness under the multimodal attack. Firstly, we introduce a multimodal
attack strategy and investigate the impact of different attacks. We then
propose a multimodal contrastive adversarial training loss, aligning the clean
and adversarial text embeddings with the adversarial and clean visual features,
to enhance the adversarial robustness of both image and text encoders of CLIP.
Extensive experiments on 15 datasets across two tasks demonstrate that our
method significantly improves the adversarial robustness of CLIP.
Interestingly, we find that the model fine-tuned against multimodal adversarial
attacks exhibits greater robustness than its counterpart fine-tuned solely
against image-based attacks, even in the context of image attacks, which may
open up new possibilities for enhancing the security of VLMs.


http://arxiv.org/abs/2404.19382
Probing Unlearned Diffusion Models: A Transferable Adversarial Attack Perspective. (99%)
Xiaoxuan Han; Songlin Yang; Wei Wang; Yang Li; Jing Dong
  Advanced text-to-image diffusion models raise safety concerns regarding
identity privacy violation, copyright infringement, and Not Safe For Work
content generation. Towards this, unlearning methods have been developed to
erase these involved concepts from diffusion models. However, these unlearning
methods only shift the text-to-image mapping and preserve the visual content
within the generative space of diffusion models, leaving a fatal flaw for
restoring these erased concepts. This erasure trustworthiness problem needs
probe, but previous methods are sub-optimal from two perspectives: (1) Lack of
transferability: Some methods operate within a white-box setting, requiring
access to the unlearned model. And the learned adversarial input often fails to
transfer to other unlearned models for concept restoration; (2) Limited attack:
The prompt-level methods struggle to restore narrow concepts from unlearned
models, such as celebrity identity. Therefore, this paper aims to leverage the
transferability of the adversarial attack to probe the unlearning robustness
under a black-box setting. This challenging scenario assumes that the
unlearning method is unknown and the unlearned model is inaccessible for
optimization, requiring the attack to be capable of transferring across
different unlearned models. Specifically, we employ an adversarial search
strategy to search for the adversarial embedding which can transfer across
different unlearned models. This strategy adopts the original Stable Diffusion
model as a surrogate model to iteratively erase and search for embeddings,
enabling it to find the embedding that can restore the target concept for
different unlearning methods. Extensive experiments demonstrate the
transferability of the searched adversarial embedding across several
state-of-the-art unlearning methods and its effectiveness for different levels
of concepts.


http://arxiv.org/abs/2404.19460
AttackBench: Evaluating Gradient-based Attacks for Adversarial Examples. (99%)
Antonio Emanuele Cinà; Jérôme Rony; Maura Pintor; Luca Demetrio; Ambra Demontis; Battista Biggio; Ismail Ben Ayed; Fabio Roli
  Adversarial examples are typically optimized with gradient-based attacks.
While novel attacks are continuously proposed, each is shown to outperform its
predecessors using different experimental setups, hyperparameter settings, and
number of forward and backward calls to the target models. This provides
overly-optimistic and even biased evaluations that may unfairly favor one
particular attack over the others. In this work, we aim to overcome these
limitations by proposing AttackBench, i.e., the first evaluation framework that
enables a fair comparison among different attacks. To this end, we first
propose a categorization of gradient-based attacks, identifying their main
components and differences. We then introduce our framework, which evaluates
their effectiveness and efficiency. We measure these characteristics by (i)
defining an optimality metric that quantifies how close an attack is to the
optimal solution, and (ii) limiting the number of forward and backward queries
to the model, such that all attacks are compared within a given maximum query
budget. Our extensive experimental analysis compares more than 100 attack
implementations with a total of over 800 different configurations against
CIFAR-10 and ImageNet models, highlighting that only very few attacks
outperform all the competing approaches. Within this analysis, we shed light on
several implementation issues that prevent many attacks from finding better
solutions or running at all. We release AttackBench as a publicly available
benchmark, aiming to continuously update it to include and evaluate novel
gradient-based attacks for optimizing adversarial examples.


http://arxiv.org/abs/2404.19651
Provably Robust Conformal Prediction with Improved Efficiency. (98%)
Ge Yan; Yaniv Romano; Tsui-Wei Weng
  Conformal prediction is a powerful tool to generate uncertainty sets with
guaranteed coverage using any predictive model, under the assumption that the
training and test data are i.i.d.. Recently, it has been shown that adversarial
examples are able to manipulate conformal methods to construct prediction sets
with invalid coverage rates, as the i.i.d. assumption is violated. To address
this issue, a recent work, Randomized Smoothed Conformal Prediction (RSCP), was
first proposed to certify the robustness of conformal prediction methods to
adversarial noise. However, RSCP has two major limitations: (i) its robustness
guarantee is flawed when used in practice and (ii) it tends to produce large
uncertainty sets. To address these limitations, we first propose a novel
framework called RSCP+ to provide provable robustness guarantee in evaluation,
which fixes the issues in the original RSCP method. Next, we propose two novel
methods, Post-Training Transformation (PTT) and Robust Conformal Training
(RCT), to effectively reduce prediction set size with little computation
overhead. Experimental results in CIFAR10, CIFAR100, and ImageNet suggest the
baseline method only yields trivial predictions including full label set, while
our methods could boost the efficiency by up to $4.36\times$, $5.46\times$, and
$16.9\times$ respectively and provide practical robustness guarantee. Our codes
are available at
https://github.com/Trustworthy-ML-Lab/Provably-Robust-Conformal-Prediction.


http://arxiv.org/abs/2405.00256
ASAM: Boosting Segment Anything Model with Adversarial Tuning. (98%)
Bo Li; Haoke Xiao; Lv Tang
  In the evolving landscape of computer vision, foundation models have emerged
as pivotal tools, exhibiting exceptional adaptability to a myriad of tasks.
Among these, the Segment Anything Model (SAM) by Meta AI has distinguished
itself in image segmentation. However, SAM, like its counterparts, encounters
limitations in specific niche applications, prompting a quest for enhancement
strategies that do not compromise its inherent capabilities. This paper
introduces ASAM, a novel methodology that amplifies SAM's performance through
adversarial tuning. We harness the potential of natural adversarial examples,
inspired by their successful implementation in natural language processing. By
utilizing a stable diffusion model, we augment a subset (1%) of the SA-1B
dataset, generating adversarial instances that are more representative of
natural variations rather than conventional imperceptible perturbations. Our
approach maintains the photorealism of adversarial examples and ensures
alignment with original mask annotations, thereby preserving the integrity of
the segmentation task. The fine-tuned ASAM demonstrates significant
improvements across a diverse range of segmentation tasks without necessitating
additional data or architectural modifications. The results of our extensive
evaluations confirm that ASAM establishes new benchmarks in segmentation tasks,
thereby contributing to the advancement of foundational models in computer
vision. Our project page is in https://asam2024.github.io/.


http://arxiv.org/abs/2405.00289
Adversarial Attacks and Defense for Conversation Entailment Task. (98%)
Zhenning Yang; Ryan Krawec; Liang-Yuan Wu
  Large language models (LLMs) that are proved to be very powerful on different
NLP tasks. However, there are still many ways to attack the model with very low
costs. How to defend the model becomes an important problem. In our work, we
treat adversarial attack results as a new (unseen) domain of the model, and we
frame the defending problem into how to improve the robustness of the model on
the new domain. We focus on the task of conversation entailment, where
multi-turn natural language dialogues are the premise, and the transformer
model is fine-tuned to predict whether a given hypothesis about the given
dialogue is true or false. The adversary would attack the hypothesis to fool
the model to make the wrong predictions. We apply synonym-swapping as the
attack method. To show the robustness of the model, we implement some
fine-tuning strategies and propose the embedding perturbation loss as a method
to improve the robustness of the model. Finally, we show the importance of our
work by discussing the adversarial attacks in NLP in the real world.


http://arxiv.org/abs/2404.19567
Causal Perception Inspired Representation Learning for Trustworthy Image Quality Assessment. (92%)
Lei Wang; Desen Yuan
  Despite great success in modeling visual perception, deep neural network
based image quality assessment (IQA) still remains unreliable in real-world
applications due to its vulnerability to adversarial perturbations and the
inexplicit black-box structure. In this paper, we propose to build a
trustworthy IQA model via Causal Perception inspired Representation Learning
(CPRL), and a score reflection attack method for IQA model. More specifically,
we assume that each image is composed of Causal Perception Representation (CPR)
and non-causal perception representation (N-CPR). CPR serves as the causation
of the subjective quality label, which is invariant to the imperceptible
adversarial perturbations. Inversely, N-CPR presents spurious associations with
the subjective quality label, which may significantly change with the
adversarial perturbations. To extract the CPR from each input image, we develop
a soft ranking based channel-wise activation function to mediate the causally
sufficient (beneficial for high prediction accuracy) and necessary (beneficial
for high robustness) deep features, and based on intervention employ minimax
game to optimize. Experiments on four benchmark databases show that the
proposed CPRL method outperforms many state-of-the-art adversarial defense
methods and provides explicit model interpretation.


http://arxiv.org/abs/2404.19597
Transferring Troubles: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning. (81%)
Xuanli He; Jun Wang; Qiongkai Xu; Pasquale Minervini; Pontus Stenetorp; Benjamin I. P. Rubinstein; Trevor Cohn
  The implications of backdoor attacks on English-centric large language models
(LLMs) have been widely examined - such attacks can be achieved by embedding
malicious behaviors during training and activated under specific conditions
that trigger malicious outputs. However, the impact of backdoor attacks on
multilingual models remains under-explored. Our research focuses on
cross-lingual backdoor attacks against multilingual LLMs, particularly
investigating how poisoning the instruction-tuning data in one or two languages
can affect the outputs in languages whose instruction-tuning data was not
poisoned. Despite its simplicity, our empirical analysis reveals that our
method exhibits remarkable efficacy in models like mT5, BLOOM, and
GPT-3.5-turbo, with high attack success rates, surpassing 95% in several
languages across various scenarios. Alarmingly, our findings also indicate that
larger models show increased susceptibility to transferable cross-lingual
backdoor attacks, which also applies to LLMs predominantly pre-trained on
English data, such as Llama2, Llama3, and Gemma. Moreover, our experiments show
that triggers can still work even after paraphrasing, and the backdoor
mechanism proves highly effective in cross-lingual response settings across 25
languages, achieving an average attack success rate of 50%. Our study aims to
highlight the vulnerabilities and significant security risks present in current
multilingual LLMs, underscoring the emergent need for targeted security
measures.


http://arxiv.org/abs/2404.19420
Let's Focus: Focused Backdoor Attack against Federated Transfer Learning. (75%)
Marco Arazzi; Stefanos Koffas; Antonino Nocera; Stjepan Picek
  Federated Transfer Learning (FTL) is the most general variation of Federated
Learning. According to this distributed paradigm, a feature learning pre-step
is commonly carried out by only one party, typically the server, on publicly
shared data. After that, the Federated Learning phase takes place to train a
classifier collaboratively using the learned feature extractor. Each involved
client contributes by locally training only the classification layers on a
private training set. The peculiarity of an FTL scenario makes it hard to
understand whether poisoning attacks can be developed to craft an effective
backdoor. State-of-the-art attack strategies assume the possibility of shifting
the model attention toward relevant features introduced by a forged trigger
injected in the input data by some untrusted clients. Of course, this is not
feasible in FTL, as the learned features are fixed once the server performs the
pre-training step. Consequently, in this paper, we investigate this intriguing
Federated Learning scenario to identify and exploit a vulnerability obtained by
combining eXplainable AI (XAI) and dataset distillation. In particular, the
proposed attack can be carried out by one of the clients during the Federated
Learning phase of FTL by identifying the optimal local for the trigger through
XAI and encapsulating compressed information of the backdoor class. Due to its
behavior, we refer to our approach as a focused backdoor approach (FB-FTL for
short) and test its performance by explicitly referencing an image
classification scenario. With an average 80% attack success rate, obtained
results show the effectiveness of our attack also against existing defenses for
Federated Learning.


http://arxiv.org/abs/2404.19582
URVFL: Undetectable Data Reconstruction Attack on Vertical Federated Learning. (1%)
Duanyi Yao; Songze Li; Xueluan Gong; Sizai Hou; Gaoning Pan
  Launching effective malicious attacks in VFL presents unique challenges: 1)
Firstly, given the distributed nature of clients' data features and models,
each client rigorously guards its privacy and prohibits direct querying,
complicating any attempts to steal data; 2) Existing malicious attacks alter
the underlying VFL training task, and are hence easily detected by comparing
the received gradients with the ones received in honest training. To overcome
these challenges, we develop URVFL, a novel attack strategy that evades current
detection mechanisms. The key idea is to integrate a discriminator with
auxiliary classifier that takes a full advantage of the label information and
generates malicious gradients to the victim clients: on one hand, label
information helps to better characterize embeddings of samples from distinct
classes, yielding an improved reconstruction performance; on the other hand,
computing malicious gradients with label information better mimics the honest
training, making the malicious gradients indistinguishable from the honest
ones, and the attack much more stealthy. Our comprehensive experiments
demonstrate that URVFL significantly outperforms existing attacks, and
successfully circumvents SOTA detection methods for malicious attacks.
Additional ablation studies and evaluations on defenses further underscore the
robustness and effectiveness of URVFL. Our code will be available at
https://github.com/duanyiyao/URVFL.


http://arxiv.org/abs/2405.00078
VeriFence: Lightweight and Precise Spectre Defenses for Untrusted Linux Kernel Extensions. (1%)
Luis Gerhorst; Henriette Herzog; Peter Wägemann; Maximilian Ott; Rüdiger Kapitza; Timo Hönig
  High-performance IO demands low-overhead communication between user- and
kernel space. This demand can no longer be fulfilled by traditional system
calls. Linux's extended Berkeley Packet Filter (BPF) avoids user-/kernel
transitions by just-in-time compiling user-provided bytecode and executing it
in kernel mode with near-native speed. To still isolate BPF programs from the
kernel, they are statically analyzed for memory- and type-safety, which imposes
some restrictions but allows for good expressiveness and high performance.
However, to mitigate the Spectre vulnerabilities disclosed in 2018, defenses
which reject potentially-dangerous programs had to be deployed. We find that
this affects 31% to 54% of programs in a dataset with 844 real-world BPF
programs from popular open-source projects. To solve this, users are forced to
disable the defenses to continue using the programs, which puts the entire
system at risk.
  To enable secure and expressive untrusted Linux kernel extensions, we propose
VeriFence, an enhancement to the kernel's Spectre defenses that reduces the
number of BPF application programs rejected from 54% to zero. We measure
VeriFence's overhead for all mainstream performance-sensitive applications of
BPF (i.e., event tracing, profiling, and packet processing) and find that it
improves significantly upon the status-quo where affected BPF programs are
either unusable or enable transient execution attacks on the kernel.


http://arxiv.org/abs/2404.19417
Physical Backdoor: Towards Temperature-based Backdoor Attacks in the Physical World. (1%)
Wen Yin; Jian Lou; Pan Zhou; Yulai Xie; Dan Feng; Yuhua Sun; Tailai Zhang; Lichao Sun
  Backdoor attacks have been well-studied in visible light object detection
(VLOD) in recent years. However, VLOD can not effectively work in dark and
temperature-sensitive scenarios. Instead, thermal infrared object detection
(TIOD) is the most accessible and practical in such environments. In this
paper, our team is the first to investigate the security vulnerabilities
associated with TIOD in the context of backdoor attacks, spanning both the
digital and physical realms. We introduce two novel types of backdoor attacks
on TIOD, each offering unique capabilities: Object-affecting Attack and
Range-affecting Attack. We conduct a comprehensive analysis of key factors
influencing trigger design, which include temperature, size, material, and
concealment. These factors, especially temperature, significantly impact the
efficacy of backdoor attacks on TIOD. A thorough understanding of these factors
will serve as a foundation for designing physical triggers and temperature
controlling experiments. Our study includes extensive experiments conducted in
both digital and physical environments. In the digital realm, we evaluate our
approach using benchmark datasets for TIOD, achieving an Attack Success Rate
(ASR) of up to 98.21%. In the physical realm, we test our approach in two
real-world settings: a traffic intersection and a parking lot, using a thermal
infrared camera. Here, we attain an ASR of up to 98.38%.


http://arxiv.org/abs/2404.18514
A Systematic Evaluation of Adversarial Attacks against Speech Emotion Recognition Models. (99%)
Nicolas Facchinetti; Federico Simonetta; Stavros Ntalampiras
  Speech emotion recognition (SER) is constantly gaining attention in recent
years due to its potential applications in diverse fields and thanks to the
possibility offered by deep learning technologies. However, recent studies have
shown that deep learning models can be vulnerable to adversarial attacks. In
this paper, we systematically assess this problem by examining the impact of
various adversarial white-box and black-box attacks on different languages and
genders within the context of SER. We first propose a suitable methodology for
audio data processing, feature extraction, and CNN-LSTM architecture. The
observed outcomes highlighted the significant vulnerability of CNN-LSTM models
to adversarial examples (AEs). In fact, all the considered adversarial attacks
are able to significantly reduce the performance of the constructed models.
Furthermore, when assessing the efficacy of the attacks, minor differences were
noted between the languages analyzed as well as between male and female speech.
In summary, this work contributes to the understanding of the robustness of
CNN-LSTM models, particularly in SER scenarios, and the impact of AEs.
Interestingly, our findings serve as a baseline for a) developing more robust
algorithms for SER, b) designing more effective attacks, c) investigating
possible defenses, d) improved understanding of the vocal differences between
different languages and genders, and e) overall, enhancing our comprehension of
the SER task.


http://arxiv.org/abs/2404.18567
Double Backdoored: Converting Code Large Language Model Backdoors to Traditional Malware via Adversarial Instruction Tuning Attacks. (99%)
Md Imran Hossen; Sai Venkatesh Chilukoti; Liqun Shan; Sheng Chen; Yinzhi Cao; Xiali Hei
  Instruction-tuned Large Language Models designed for coding tasks are
increasingly employed as AI coding assistants. However, the cybersecurity
vulnerabilities and implications arising from the widespread integration of
these models are not yet fully understood due to limited research in this
domain. This work investigates novel techniques for transitioning backdoors
from the AI/ML domain to traditional computer malware, shedding light on the
critical intersection of AI and cyber/software security. To explore this
intersection, we present MalInstructCoder, a framework designed to
comprehensively assess the cybersecurity vulnerabilities of instruction-tuned
Code LLMs. MalInstructCoder introduces an automated data poisoning pipeline to
inject malicious code snippets into benign code, poisoning instruction
fine-tuning data while maintaining functional validity. It presents two
practical adversarial instruction tuning attacks with real-world security
implications: the clean prompt poisoning attack and the backdoor attack. These
attacks aim to manipulate Code LLMs to generate code incorporating malicious or
harmful functionality under specific attack scenarios while preserving intended
functionality. We conduct a comprehensive investigation into the exploitability
of the code-specific instruction tuning process involving three
state-of-the-art Code LLMs: CodeLlama, DeepSeek-Coder, and StarCoder2. Our
findings reveal that these models are highly vulnerable to our attacks.
Specifically, the clean prompt poisoning attack achieves the ASR@1 ranging from
over 75% to 86% by poisoning only 1% (162 samples) of the instruction
fine-tuning dataset. Similarly, the backdoor attack achieves the ASR@1 ranging
from 76% to 86% with a 0.5% poisoning rate. Our study sheds light on the
critical cybersecurity risks posed by instruction-tuned Code LLMs and
highlights the urgent need for robust defense mechanisms.


http://arxiv.org/abs/2404.18791
Certification of Speaker Recognition Models to Additive Perturbations. (54%)
Dmitrii Korzh; Elvir Karimov; Mikhail Pautov; Oleg Y. Rogov; Ivan Oseledets
  Speaker recognition technology is applied to various tasks, from personal
virtual assistants to secure access systems. However, the robustness of these
systems against adversarial attacks, particularly to additive perturbations,
remains a significant challenge. In this paper, we pioneer applying robustness
certification techniques to speaker recognition, initially developed for the
image domain. Our work covers this gap by transferring and improving randomized
smoothing certification techniques against norm-bounded additive perturbations
for classification and few-shot learning tasks to speaker recognition. We
demonstrate the effectiveness of these methods on VoxCeleb 1 and 2 datasets for
several models. We expect this work to improve the robustness of voice
biometrics and accelerate the research of certification methods in the audio
domain.


http://arxiv.org/abs/2404.19227
Espresso: Robust Concept Filtering in Text-to-Image Models. (15%)
Anudeep Das; Vasisht Duddu; Rui Zhang; N. Asokan
  Diffusion-based text-to-image (T2I) models generate high-fidelity images for
given textual prompts. They are trained on large datasets scraped from the
Internet, potentially containing unacceptable concepts (e.g., copyright
infringing or unsafe). Retraining T2I models after filtering out unacceptable
concepts in the training data is inefficient and degrades utility. Hence, there
is a need for concept removal techniques (CRTs) which are effective in removing
unacceptable concepts, utility-preserving on acceptable concepts, and robust
against evasion with adversarial prompts. None of the prior filtering and
fine-tuning CRTs satisfy all these requirements simultaneously.
  We introduce Espresso, the first robust concept filter based on Contrastive
Language-Image Pre-Training (CLIP). It identifies unacceptable concepts by
projecting the generated image's embedding onto the vector connecting
unacceptable and acceptable concepts in the joint text-image embedding space.
This ensures robustness by restricting the adversary to adding noise only along
this vector, in the direction of the acceptable concept. Further fine-tuning
Espresso to separate embeddings of acceptable and unacceptable concepts, while
preserving their pairing with image embeddings, ensures both effectiveness and
utility. We evaluate Espresso on eleven concepts to show that it is effective
(~5% CLIP accuracy on unacceptable concepts), utility-preserving (~93%
normalized CLIP score on acceptable concepts), and robust (~4% CLIP accuracy on
adversarial prompts for unacceptable concepts). Finally, we present theoretical
bounds for the certified robustness of Espresso against adversarial prompts,
and an empirical analysis.


http://arxiv.org/abs/2404.18702
Why You Should Not Trust Interpretations in Machine Learning: Adversarial Attacks on Partial Dependence Plots. (13%)
Xi Xin; Giles Hooker; Fei Huang
  The adoption of artificial intelligence (AI) across industries has led to the
widespread use of complex black-box models and interpretation tools for
decision making. This paper proposes an adversarial framework to uncover the
vulnerability of permutation-based interpretation methods for machine learning
tasks, with a particular focus on partial dependence (PD) plots. This
adversarial framework modifies the original black box model to manipulate its
predictions for instances in the extrapolation domain. As a result, it produces
deceptive PD plots that can conceal discriminatory behaviors while preserving
most of the original model's predictions. This framework can produce multiple
fooled PD plots via a single model. By using real-world datasets including an
auto insurance claims dataset and COMPAS (Correctional Offender Management
Profiling for Alternative Sanctions) dataset, our results show that it is
possible to intentionally hide the discriminatory behavior of a predictor and
make the black-box model appear neutral through interpretation tools like PD
plots while retaining almost all the predictions of the original black-box
model. Managerial insights for regulators and practitioners are provided based
on the findings.


http://arxiv.org/abs/2404.18541
Machine Learning for Windows Malware Detection and Classification: Methods, Challenges and Ongoing Research. (3%)
Daniel Gibert
  In this chapter, readers will explore how machine learning has been applied
to build malware detection systems designed for the Windows operating system.
This chapter starts by introducing the main components of a Machine Learning
pipeline, highlighting the challenges of collecting and maintaining up-to-date
datasets. Following this introduction, various state-of-the-art malware
detectors are presented, encompassing both feature-based and deep
learning-based detectors. Subsequent sections introduce the primary challenges
encountered by machine learning-based malware detectors, including concept
drift and adversarial attacks. Lastly, this chapter concludes by providing a
brief overview of the ongoing research on adversarial defenses.


http://arxiv.org/abs/2404.18649
Towards Quantitative Evaluation of Explainable AI Methods for Deepfake Detection. (1%)
Konstantinos Tsigos; Evlampios Apostolidis; Spyridon Baxevanakis; Symeon Papadopoulos; Vasileios Mezaris
  In this paper we propose a new framework for evaluating the performance of
explanation methods on the decisions of a deepfake detector. This framework
assesses the ability of an explanation method to spot the regions of a fake
image with the biggest influence on the decision of the deepfake detector, by
examining the extent to which these regions can be modified through a set of
adversarial attacks, in order to flip the detector's prediction or reduce its
initial prediction; we anticipate a larger drop in deepfake detection accuracy
and prediction, for methods that spot these regions more accurately. Based on
this framework, we conduct a comparative study using a state-of-the-art model
for deepfake detection that has been trained on the FaceForensics++ dataset,
and five explanation methods from the literature. The findings of our
quantitative and qualitative evaluations document the advanced performance of
the LIME explanation method against the other compared ones, and indicate this
method as the most appropriate for explaining the decisions of the utilized
deepfake detector.


http://arxiv.org/abs/2404.18825
Harmonic Machine Learning Models are Robust. (1%)
Nicholas S. Kersting; Yi Li; Aman Mohanty; Oyindamola Obisesan; Raphael Okochu
  We introduce Harmonic Robustness, a powerful and intuitive method to test the
robustness of any machine-learning model either during training or in black-box
real-time inference monitoring without ground-truth labels. It is based on
functional deviation from the harmonic mean value property, indicating
instability and lack of explainability. We show implementation examples in
low-dimensional trees and feedforward NNs, where the method reliably identifies
overfitting, as well as in more complex high-dimensional models such as
ResNet-50 and Vision Transformer where it efficiently measures adversarial
vulnerability across image classes.


http://arxiv.org/abs/2404.19114
Enhancing IoT Security: A Novel Feature Engineering Approach for ML-Based Intrusion Detection Systems. (1%)
Afsaneh Mahanipour; Hana Khamfroush
  The integration of Internet of Things (IoT) applications in our daily lives
has led to a surge in data traffic, posing significant security challenges. IoT
applications using cloud and edge computing are at higher risk of cyberattacks
because of the expanded attack surface from distributed edge and cloud
services, the vulnerability of IoT devices, and challenges in managing security
across interconnected systems leading to oversights. This led to the rise of
ML-based solutions for intrusion detection systems (IDSs), which have proven
effective in enhancing network security and defending against diverse threats.
However, ML-based IDS in IoT systems encounters challenges, particularly from
noisy, redundant, and irrelevant features in varied IoT datasets, potentially
impacting its performance. Therefore, reducing such features becomes crucial to
enhance system performance and minimize computational costs. This paper focuses
on improving the effectiveness of ML-based IDS at the edge level by introducing
a novel method to find a balanced trade-off between cost and accuracy through
the creation of informative features in a two-tier edge-user IoT environment. A
hybrid Binary Quantum-inspired Artificial Bee Colony and Genetic Programming
algorithm is utilized for this purpose. Three IoT intrusion detection datasets,
namely NSL-KDD, UNSW-NB15, and BoT-IoT, are used for the evaluation of the
proposed approach.


http://arxiv.org/abs/2405.01509
Learnable Linguistic Watermarks for Tracing Model Extraction Attacks on Large Language Models. (1%)
Minhao Bai; Kaiyi Pang; Yongfeng Huang
  In the rapidly evolving domain of artificial intelligence, safeguarding the
intellectual property of Large Language Models (LLMs) is increasingly crucial.
Current watermarking techniques against model extraction attacks, which rely on
signal insertion in model logits or post-processing of generated text, remain
largely heuristic. We propose a novel method for embedding learnable linguistic
watermarks in LLMs, aimed at tracing and preventing model extraction attacks.
Our approach subtly modifies the LLM's output distribution by introducing
controlled noise into token frequency distributions, embedding an statistically
identifiable controllable watermark.We leverage statistical hypothesis testing
and information theory, particularly focusing on Kullback-Leibler Divergence,
to differentiate between original and modified distributions effectively. Our
watermarking method strikes a delicate well balance between robustness and
output quality, maintaining low false positive/negative rates and preserving
the LLM's original performance.


http://arxiv.org/abs/2404.17844
Towards Robust Recommendation: A Review and an Adversarial Robustness Evaluation Library. (92%)
Lei Cheng; Xiaowen Huang; Jitao Sang; Jian Yu
  Recently, recommender system has achieved significant success. However, due
to the openness of recommender systems, they remain vulnerable to malicious
attacks. Additionally, natural noise in training data and issues such as data
sparsity can also degrade the performance of recommender systems. Therefore,
enhancing the robustness of recommender systems has become an increasingly
important research topic. In this survey, we provide a comprehensive overview
of the robustness of recommender systems. Based on our investigation, we
categorize the robustness of recommender systems into adversarial robustness
and non-adversarial robustness. In the adversarial robustness, we introduce the
fundamental principles and classical methods of recommender system adversarial
attacks and defenses. In the non-adversarial robustness, we analyze
non-adversarial robustness from the perspectives of data sparsity, natural
noise, and data imbalance. Additionally, we summarize commonly used datasets
and evaluation metrics for evaluating the robustness of recommender systems.
Finally, we also discuss the current challenges in the field of recommender
system robustness and potential future research directions. Additionally, to
facilitate fair and efficient evaluation of attack and defense methods in
adversarial robustness, we propose an adversarial robustness evaluation
library--ShillingREC, and we conduct evaluations of basic attack models and
recommendation models. ShillingREC project is released at
https://github.com/chengleileilei/ShillingREC.


http://arxiv.org/abs/2404.17970
Privacy-Preserving Aggregation for Decentralized Learning with Byzantine-Robustness. (70%)
Ali Reza Ghavamipour; Benjamin Zi Hao Zhao; Oguzhan Ersoy; Fatih Turkmen
  Decentralized machine learning (DL) has been receiving an increasing interest
recently due to the elimination of a single point of failure, present in
Federated learning setting. Yet, it is threatened by the looming threat of
Byzantine clients who intentionally disrupt the learning process by
broadcasting arbitrary model updates to other clients, seeking to degrade the
performance of the global model. In response, robust aggregation schemes have
emerged as promising solutions to defend against such Byzantine clients,
thereby enhancing the robustness of Decentralized Learning. Defenses against
Byzantine adversaries, however, typically require access to the updates of
other clients, a counterproductive privacy trade-off that in turn increases the
risk of inference attacks on those same model updates.
  In this paper, we introduce SecureDL, a novel DL protocol designed to enhance
the security and privacy of DL against Byzantine threats. SecureDL~facilitates
a collaborative defense, while protecting the privacy of clients' model updates
through secure multiparty computation. The protocol employs efficient
computation of cosine similarity and normalization of updates to robustly
detect and exclude model updates detrimental to model convergence. By using
MNIST, Fashion-MNIST, SVHN and CIFAR-10 datasets, we evaluated SecureDL against
various Byzantine attacks and compared its effectiveness with four existing
defense mechanisms. Our experiments show that SecureDL is effective even in the
case of attacks by the malicious majority (e.g., 80% Byzantine clients) while
preserving high training accuracy.


http://arxiv.org/abs/2404.17947
Bounding the Expected Robustness of Graph Neural Networks Subject to Node Feature Attacks. (67%)
Yassine Abbahaddou; Sofiane Ennadir; Johannes F. Lutzeyer; Michalis Vazirgiannis; Henrik Boström
  Graph Neural Networks (GNNs) have demonstrated state-of-the-art performance
in various graph representation learning tasks. Recently, studies revealed
their vulnerability to adversarial attacks. In this work, we theoretically
define the concept of expected robustness in the context of attributed graphs
and relate it to the classical definition of adversarial robustness in the
graph representation learning literature. Our definition allows us to derive an
upper bound of the expected robustness of Graph Convolutional Networks (GCNs)
and Graph Isomorphism Networks subject to node feature attacks. Building on
these findings, we connect the expected robustness of GNNs to the
orthonormality of their weight matrices and consequently propose an
attack-independent, more robust variant of the GCN, called the Graph
Convolutional Orthonormal Robust Networks (GCORNs). We further introduce a
probabilistic method to estimate the expected robustness, which allows us to
evaluate the effectiveness of GCORN on several real-world datasets.
Experimental experiments showed that GCORN outperforms available defense
methods. Our code is publicly available at:
\href{https://github.com/Sennadir/GCORN}{https://github.com/Sennadir/GCORN}.


http://arxiv.org/abs/2404.17867
Are Watermarks Bugs for Deepfake Detectors? Rethinking Proactive Forensics. (2%)
Xiaoshuai Wu; Xin Liao; Bo Ou; Yuling Liu; Zheng Qin
  AI-generated content has accelerated the topic of media synthesis,
particularly Deepfake, which can manipulate our portraits for positive or
malicious purposes. Before releasing these threatening face images, one
promising forensics solution is the injection of robust watermarks to track
their own provenance. However, we argue that current watermarking models,
originally devised for genuine images, may harm the deployed Deepfake detectors
when directly applied to forged images, since the watermarks are prone to
overlap with the forgery signals used for detection. To bridge this gap, we
thus propose AdvMark, on behalf of proactive forensics, to exploit the
adversarial vulnerability of passive detectors for good. Specifically, AdvMark
serves as a plug-and-play procedure for fine-tuning any robust watermarking
into adversarial watermarking, to enhance the forensic detectability of
watermarked images; meanwhile, the watermarks can still be extracted for
provenance tracking. Extensive experiments demonstrate the effectiveness of the
proposed AdvMark, leveraging robust watermarking to fool Deepfake detectors,
which can help improve the accuracy of downstream Deepfake detection without
tuning the in-the-wild detectors. We believe this work will shed some light on
the harmless proactive forensics against Deepfake.


http://arxiv.org/abs/2404.19640
Attacking Bayes: On the Adversarial Robustness of Bayesian Neural Networks. (99%)
Yunzhen Feng; Tim G. J. Rudner; Nikolaos Tsilivis; Julia Kempe
  Adversarial examples have been shown to cause neural networks to fail on a
wide range of vision and language tasks, but recent work has claimed that
Bayesian neural networks (BNNs) are inherently robust to adversarial
perturbations. In this work, we examine this claim. To study the adversarial
robustness of BNNs, we investigate whether it is possible to successfully break
state-of-the-art BNN inference methods and prediction pipelines using even
relatively unsophisticated attacks for three tasks: (1) label prediction under
the posterior predictive mean, (2) adversarial example detection with Bayesian
predictive uncertainty, and (3) semantic shift detection. We find that BNNs
trained with state-of-the-art approximate inference methods, and even BNNs
trained with Hamiltonian Monte Carlo, are highly susceptible to adversarial
attacks. We also identify various conceptual and experimental errors in
previous works that claimed inherent adversarial robustness of BNNs and
conclusively demonstrate that BNNs and uncertainty-aware Bayesian prediction
pipelines are not inherently robust against adversarial attacks.


http://arxiv.org/abs/2404.17760
Adversarial Examples: Generation Proposal in the Context of Facial Recognition Systems. (92%)
Marina Fuster; Ignacio Vidaurreta
  In this paper we investigate the vulnerability that facial recognition
systems present to adversarial examples by introducing a new methodology from
the attacker perspective. The technique is based on the use of the autoencoder
latent space, organized with principal component analysis. We intend to analyze
the potential to craft adversarial examples suitable for both dodging and
impersonation attacks, against state-of-the-art systems. Our initial
hypothesis, which was not strongly favoured by the results, stated that it
would be possible to separate between the "identity" and "facial expression"
features to produce high-quality examples. Despite the findings not supporting
it, the results sparked insights into adversarial examples generation and
opened new research avenues in the area.


http://arxiv.org/abs/2404.17196
Human-Imperceptible Retrieval Poisoning Attacks in LLM-Powered Applications. (54%)
Quan Zhang; Binqi Zeng; Chijin Zhou; Gwihwan Go; Heyuan Shi; Yu Jiang
  Presently, with the assistance of advanced LLM application development
frameworks, more and more LLM-powered applications can effortlessly augment the
LLMs' knowledge with external content using the retrieval augmented generation
(RAG) technique. However, these frameworks' designs do not have sufficient
consideration of the risk of external content, thereby allowing attackers to
undermine the applications developed with these frameworks. In this paper, we
reveal a new threat to LLM-powered applications, termed retrieval poisoning,
where attackers can guide the application to yield malicious responses during
the RAG process. Specifically, through the analysis of LLM application
frameworks, attackers can craft documents visually indistinguishable from
benign ones. Despite the documents providing correct information, once they are
used as reference sources for RAG, the application is misled into generating
incorrect responses. Our preliminary experiments indicate that attackers can
mislead LLMs with an 88.33\% success rate, and achieve a 66.67\% success rate
in the real-world application, demonstrating the potential impact of retrieval
poisoning.


http://arxiv.org/abs/2404.17399
Evaluations of Machine Learning Privacy Defenses are Misleading. (3%)
Michael Aerni; Jie Zhang; Florian Tramèr
  Empirical defenses for machine learning privacy forgo the provable guarantees
of differential privacy in the hope of achieving higher utility while resisting
realistic adversaries. We identify severe pitfalls in existing empirical
privacy evaluations (based on membership inference attacks) that result in
misleading conclusions. In particular, we show that prior evaluations fail to
characterize the privacy leakage of the most vulnerable samples, use weak
attacks, and avoid comparisons with practical differential privacy baselines.
In 5 case studies of empirical privacy defenses, we find that prior evaluations
underestimate privacy leakage by an order of magnitude. Under our stronger
evaluation, none of the empirical defenses we study are competitive with a
properly tuned, high-utility DP-SGD baseline (with vacuous provable
guarantees).


http://arxiv.org/abs/2404.17225
Enhancing Privacy and Security of Autonomous UAV Navigation. (2%)
Vatsal Aggarwal; Arjun Ramesh Kaushik; Charanjit Jutla; Nalini Ratha
  Autonomous Unmanned Aerial Vehicles (UAVs) have become essential tools in
defense, law enforcement, disaster response, and product delivery. These
autonomous navigation systems require a wireless communication network, and of
late are deep learning based. In critical scenarios such as border protection
or disaster response, ensuring the secure navigation of autonomous UAVs is
paramount. But, these autonomous UAVs are susceptible to adversarial attacks
through the communication network or the deep learning models - eavesdropping /
man-in-the-middle / membership inference / reconstruction. To address this
susceptibility, we propose an innovative approach that combines Reinforcement
Learning (RL) and Fully Homomorphic Encryption (FHE) for secure autonomous UAV
navigation. This end-to-end secure framework is designed for real-time video
feeds captured by UAV cameras and utilizes FHE to perform inference on
encrypted input images. While FHE allows computations on encrypted data,
certain computational operators are yet to be implemented. Convolutional neural
networks, fully connected neural networks, activation functions and OpenAI Gym
Library are meticulously adapted to the FHE domain to enable encrypted data
processing. We demonstrate the efficacy of our proposed approach through
extensive experimentation. Our proposed approach ensures security and privacy
in autonomous UAV navigation with negligible loss in performance.


http://arxiv.org/abs/2404.17275
Adversarial Reweighting with $\alpha$-Power Maximization for Domain Adaptation. (1%)
Xiang Gu; Xi Yu; Yan Yang; Jian Sun; Zongben Xu
  The practical Domain Adaptation (DA) tasks, e.g., Partial DA (PDA), open-set
DA, universal DA, and test-time adaptation, have gained increasing attention in
the machine learning community. In this paper, we propose a novel approach,
dubbed Adversarial Reweighting with $\alpha$-Power Maximization (ARPM), for PDA
where the source domain contains private classes absent in target domain. In
ARPM, we propose a novel adversarial reweighting model that adversarially
learns to reweight source domain data to identify source-private class samples
by assigning smaller weights to them, for mitigating potential negative
transfer. Based on the adversarial reweighting, we train the transferable
recognition model on the reweighted source distribution to be able to classify
common class data. To reduce the prediction uncertainty of the recognition
model on the target domain for PDA, we present an $\alpha$-power maximization
mechanism in ARPM, which enriches the family of losses for reducing the
prediction uncertainty for PDA. Extensive experimental results on five PDA
benchmarks, i.e., Office-31, Office-Home, VisDA-2017, ImageNet-Caltech, and
DomainNet, show that our method is superior to recent PDA methods. Ablation
studies also confirm the effectiveness of components in our approach. To
theoretically analyze our method, we deduce an upper bound of target domain
expected error for PDA, which is approximately minimized in our approach. We
further extend ARPM to open-set DA, universal DA, and test time adaptation, and
verify the usefulness through experiments.


http://arxiv.org/abs/2404.17768
Changing the Training Data Distribution to Reduce Simplicity Bias Improves In-distribution Generalization. (1%)
Dang Nguyen; Paymon Haddad; Eric Gan; Baharan Mirzasoleiman
  Can we modify the training data distribution to encourage the underlying
optimization method toward finding solutions with superior generalization
performance on in-distribution data? In this work, we approach this question
for the first time by comparing the inductive bias of gradient descent (GD)
with that of sharpness-aware minimization (SAM). By studying a two-layer CNN,
we rigorously prove that SAM learns different features more uniformly,
particularly in early epochs. That is, SAM is less susceptible to simplicity
bias compared to GD. We also show that examples containing features that are
learned early are separable from the rest based on the model's output. Based on
this observation, we propose a method that (i) clusters examples based on the
network output early in training, (ii) identifies a cluster of examples with
similar network output, and (iii) upsamples the rest of examples only once to
alleviate the simplicity bias. We show empirically that USEFUL effectively
improves the generalization performance on the original data distribution when
training with various gradient methods, including (S)GD and SAM. Notably, we
demonstrate that our method can be combined with SAM variants and existing data
augmentation strategies to achieve, to the best of our knowledge,
state-of-the-art performance for training ResNet18 on CIFAR10, STL10, CINIC10,
Tiny-ImageNet; ResNet34 on CIFAR100; and VGG19 and DenseNet121 on CIFAR10.


http://arxiv.org/abs/2404.17020
Generating Minimalist Adversarial Perturbations to Test Object-Detection Models: An Adaptive Multi-Metric Evolutionary Search Approach. (98%)
Cristopher McIntyre-Garcia; Adrien Heymans; Beril Borali; Won-Sook Lee; Shiva Nejati
  Deep Learning (DL) models excel in computer vision tasks but can be
susceptible to adversarial examples. This paper introduces Triple-Metric
EvoAttack (TM-EVO), an efficient algorithm for evaluating the robustness of
object-detection DL models against adversarial attacks. TM-EVO utilizes a
multi-metric fitness function to guide an evolutionary search efficiently in
creating effective adversarial test inputs with minimal perturbations. We
evaluate TM-EVO on widely-used object-detection DL models, DETR and Faster
R-CNN, and open-source datasets, COCO and KITTI. Our findings reveal that
TM-EVO outperforms the state-of-the-art EvoAttack baseline, leading to
adversarial tests with less noise while maintaining efficiency.


http://arxiv.org/abs/2404.16452
PAD: Patch-Agnostic Defense against Adversarial Patch Attacks. (92%)
Lihua Jing; Rui Wang; Wenqi Ren; Xin Dong; Cong Zou
  Adversarial patch attacks present a significant threat to real-world object
detectors due to their practical feasibility. Existing defense methods, which
rely on attack data or prior knowledge, struggle to effectively address a wide
range of adversarial patches. In this paper, we show two inherent
characteristics of adversarial patches, semantic independence and spatial
heterogeneity, independent of their appearance, shape, size, quantity, and
location. Semantic independence indicates that adversarial patches operate
autonomously within their semantic context, while spatial heterogeneity
manifests as distinct image quality of the patch area that differs from
original clean image due to the independent generation process. Based on these
observations, we propose PAD, a novel adversarial patch localization and
removal method that does not require prior knowledge or additional training.
PAD offers patch-agnostic defense against various adversarial patches,
compatible with any pre-trained object detectors. Our comprehensive digital and
physical experiments involving diverse patch types, such as localized noise,
printable, and naturalistic patches, exhibit notable improvements over
state-of-the-art works. Our code is available at
https://github.com/Lihua-Jing/PAD.


http://arxiv.org/abs/2404.17092
Robust and Efficient Adversarial Defense in SNNs via Image Purification and Joint Detection. (86%)
Weiran Chen; Qi Xu
  Spiking Neural Networks (SNNs) aim to bridge the gap between neuroscience and
machine learning by emulating the structure of the human nervous system.
However, like convolutional neural networks, SNNs are vulnerable to adversarial
attacks. To tackle the challenge, we propose a biologically inspired
methodology to enhance the robustness of SNNs, drawing insights from the visual
masking effect and filtering theory. First, an end-to-end SNN-based image
purification model is proposed to defend against adversarial attacks, including
a noise extraction network and a non-blind denoising network. The former
network extracts noise features from noisy images, while the latter component
employs a residual U-Net structure to reconstruct high-quality noisy images and
generate clean images. Simultaneously, a multi-level firing SNN based on
Squeeze-and-Excitation Network is introduced to improve the robustness of the
classifier. Crucially, the proposed image purification network serves as a
pre-processing module, avoiding modifications to classifiers. Unlike
adversarial training, our method is highly flexible and can be seamlessly
integrated with other defense strategies. Experimental results on various
datasets demonstrate that the proposed methodology outperforms state-of-the-art
baselines in terms of defense effectiveness, training time, and resource
consumption.


http://arxiv.org/abs/2404.16369
Don't Say No: Jailbreaking LLM by Suppressing Refusal. (68%)
Yukai Zhou; Zhijie Huang; Feiyang Lu; Zhan Qin; Wenjie Wang
  Ensuring the safety alignment of Large Language Models (LLMs) is crucial to
generating responses consistent with human values. Despite their ability to
recognize and avoid harmful queries, LLMs are vulnerable to jailbreaking
attacks, where carefully crafted prompts seduce them to produce toxic content.
One category of jailbreak attacks is reformulating the task as an optimization
by eliciting the LLM to generate affirmative responses. However, such
optimization objective has its own limitations, such as the restriction on the
predefined objectionable behaviors, leading to suboptimal attack performance.
In this study, we first uncover the reason why vanilla target loss is not
optimal, then we explore and enhance the loss objective and introduce the DSN
(Don't Say No) attack, which achieves successful attack by suppressing refusal.
Another challenge in studying jailbreak attacks is the evaluation, as it is
difficult to directly and accurately assess the harmfulness of the responses.
The existing evaluation such as refusal keyword matching reveals numerous false
positive and false negative instances. To overcome this challenge, we propose
an Ensemble Evaluation pipeline that novelly incorporates Natural Language
Inference (NLI) contradiction assessment and two external LLM evaluators.
Extensive experiments demonstrate the potential of the DSN and effectiveness of
Ensemble Evaluation compared to baseline methods.


http://arxiv.org/abs/2404.16656
A Self-Organizing Clustering System for Unsupervised Distribution Shift Detection. (12%)
Sebastián Basterrech; Line Clemmensen; Gerardo Rubino
  Modeling non-stationary data is a challenging problem in the field of
continual learning, and data distribution shifts may result in negative
consequences on the performance of a machine learning model. Classic learning
tools are often vulnerable to perturbations of the input covariates, and are
sensitive to outliers and noise, and some tools are based on rigid algebraic
assumptions. Distribution shifts are frequently occurring due to changes in raw
materials for production, seasonality, a different user base, or even
adversarial attacks. Therefore, there is a need for more effective distribution
shift detection techniques. In this work, we propose a continual learning
framework for monitoring and detecting distribution changes. We explore the
problem in a latent space generated by a bio-inspired self-organizing
clustering and statistical aspects of the latent space. In particular, we
investigate the projections made by two topology-preserving maps: the
Self-Organizing Map and the Scale Invariant Map. Our method can be applied in
both a supervised and an unsupervised context. We construct the assessment of
changes in the data distribution as a comparison of Gaussian signals, making
the proposed method fast and robust. We compare it to other unsupervised
techniques, specifically Principal Component Analysis (PCA) and Kernel-PCA. Our
comparison involves conducting experiments using sequences of images (based on
MNIST and injected shifts with adversarial samples), chemical sensor
measurements, and the environmental variable related to ozone levels. The
empirical study reveals the potential of the proposed approach.


http://arxiv.org/abs/2404.16557
Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples. (2%)
Kuofeng Gao; Jindong Gu; Yang Bai; Shu-Tao Xia; Philip Torr; Wei Liu; Zhifeng Li
  Despite the exceptional performance of multi-modal large language models
(MLLMs), their deployment requires substantial computational resources. Once
malicious users induce high energy consumption and latency time (energy-latency
cost), it will exhaust computational resources and harm availability of
service. In this paper, we investigate this vulnerability for MLLMs,
particularly image-based and video-based ones, and aim to induce high
energy-latency cost during inference by crafting an imperceptible perturbation.
We find that high energy-latency cost can be manipulated by maximizing the
length of generated sequences, which motivates us to propose verbose samples,
including verbose images and videos. Concretely, two modality non-specific
losses are proposed, including a loss to delay end-of-sequence (EOS) token and
an uncertainty loss to increase the uncertainty over each generated token. In
addition, improving diversity is important to encourage longer responses by
increasing the complexity, which inspires the following modality specific loss.
For verbose images, a token diversity loss is proposed to promote diverse
hidden states. For verbose videos, a frame feature diversity loss is proposed
to increase the feature diversity among frames. To balance these losses, we
propose a temporal weight adjustment algorithm. Experiments demonstrate that
our verbose samples can largely extend the length of generated sequences.


http://arxiv.org/abs/2404.17120
Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs. (1%)
Valeriia Cherepanova; James Zou
  Large language models (LLMs) exhibit excellent ability to understand human
languages, but do they also understand their own language that appears
gibberish to us? In this work we delve into this question, aiming to uncover
the mechanisms underlying such behavior in LLMs. We employ the Greedy
Coordinate Gradient optimizer to craft prompts that compel LLMs to generate
coherent responses from seemingly nonsensical inputs. We call these inputs LM
Babel and this work systematically studies the behavior of LLMs manipulated by
these prompts. We find that the manipulation efficiency depends on the target
text's length and perplexity, with the Babel prompts often located in lower
loss minima compared to natural prompts. We further examine the structure of
the Babel prompts and evaluate their robustness. Notably, we find that guiding
the model to generate harmful texts is not more difficult than into generating
benign texts, suggesting lack of alignment for out-of-distribution prompts.


http://arxiv.org/abs/2404.15881
Steal Now and Attack Later: Evaluating Robustness of Object Detection against Black-box Adversarial Attacks. (99%)
Erh-Chung Chen; Pin-Yu Chen; I-Hsin Chung; Che-Rung Lee
  Latency attacks against object detection represent a variant of adversarial
attacks that aim to inflate the inference time by generating additional ghost
objects in a target image. However, generating ghost objects in the black-box
scenario remains a challenge since information about these unqualified objects
remains opaque. In this study, we demonstrate the feasibility of generating
ghost objects in adversarial examples by extending the concept of "steal now,
decrypt later" attacks. These adversarial examples, once produced, can be
employed to exploit potential vulnerabilities in the AI service, giving rise to
significant security concerns. The experimental results demonstrate that the
proposed attack achieves successful attacks across various commonly used models
and Google Vision API without any prior knowledge about the target model.
Additionally, the average cost of each attack is less than \$ 1 dollars, posing
a significant threat to AI security.


http://arxiv.org/abs/2404.16212
An Analysis of Recent Advances in Deepfake Image Detection in an Evolving Threat Landscape. (99%)
Sifat Muhammad Abdullah; Aravind Cheruvu; Shravya Kanchi; Taejoong Chung; Peng Gao; Murtuza Jadliwala; Bimal Viswanath
  Deepfake or synthetic images produced using deep generative models pose
serious risks to online platforms. This has triggered several research efforts
to accurately detect deepfake images, achieving excellent performance on
publicly available deepfake datasets. In this work, we study 8 state-of-the-art
detectors and argue that they are far from being ready for deployment due to
two recent developments. First, the emergence of lightweight methods to
customize large generative models, can enable an attacker to create many
customized generators (to create deepfakes), thereby substantially increasing
the threat surface. We show that existing defenses fail to generalize well to
such \emph{user-customized generative models} that are publicly available
today. We discuss new machine learning approaches based on content-agnostic
features, and ensemble modeling to improve generalization performance against
user-customized models. Second, the emergence of \textit{vision foundation
models} -- machine learning models trained on broad data that can be easily
adapted to several downstream tasks -- can be misused by attackers to craft
adversarial deepfakes that can evade existing defenses. We propose a simple
adversarial attack that leverages existing foundation models to craft
adversarial samples \textit{without adding any adversarial noise}, through
careful semantic manipulation of the image content. We highlight the
vulnerabilities of several defenses against our attack, and explore directions
leveraging advanced foundation models and adversarial training to defend
against this new threat.


http://arxiv.org/abs/2404.15784
An Empirical Study of Aegis. (98%)
Daniel Saragih; Paridhi Goel; Tejas Balaji; Alyssa Li
  Bit flipping attacks are one class of attacks on neural networks with
numerous defense mechanisms invented to mitigate its potency. Due to the
importance of ensuring the robustness of these defense mechanisms, we perform
an empirical study on the Aegis framework. We evaluate the baseline mechanisms
of Aegis on low-entropy data (MNIST), and we evaluate a pre-trained model with
the mechanisms fine-tuned on MNIST. We also compare the use of data
augmentation to the robustness training of Aegis, and how Aegis performs under
other adversarial attacks, such as the generation of adversarial examples. We
find that both the dynamic-exit strategy and robustness training of Aegis has
some drawbacks. In particular, we see drops in accuracy when testing on
perturbed data, and on adversarial examples, as compared to baselines.
Moreover, we found that the dynamic exit-strategy loses its uniformity when
tested on simpler datasets. The code for this project is available on GitHub.


http://arxiv.org/abs/2404.15744
A General Black-box Adversarial Attack on Graph-based Fake News Detectors. (96%)
Peican Zhu; Zechen Pan; Yang Liu; Jiwei Tian; Keke Tang; Zhen Wang
  Graph Neural Network (GNN)-based fake news detectors apply various methods to
construct graphs, aiming to learn distinctive news embeddings for
classification. Since the construction details are unknown for attackers in a
black-box scenario, it is unrealistic to conduct the classical adversarial
attacks that require a specific adjacency matrix. In this paper, we propose the
first general black-box adversarial attack framework, i.e., General Attack via
Fake Social Interaction (GAFSI), against detectors based on different graph
structures. Specifically, as sharing is an important social interaction for
GNN-based fake news detectors to construct the graph, we simulate sharing
behaviors to fool the detectors. Firstly, we propose a fraudster selection
module to select engaged users leveraging local and global information. In
addition, a post injection module guides the selected users to create shared
relations by sending posts. The sharing records will be added to the social
context, leading to a general attack against different detectors. Experimental
results on empirical datasets demonstrate the effectiveness of GAFSI.


http://arxiv.org/abs/2404.16154
A Comparative Analysis of Adversarial Robustness for Quantum and Classical Machine Learning Models. (83%)
Maximilian Wendlinger; Kilian Tscharke; Pascal Debus
  Quantum machine learning (QML) continues to be an area of tremendous interest
from research and industry. While QML models have been shown to be vulnerable
to adversarial attacks much in the same manner as classical machine learning
models, it is still largely unknown how to compare adversarial attacks on
quantum versus classical models. In this paper, we show how to systematically
investigate the similarities and differences in adversarial robustness of
classical and quantum models using transfer attacks, perturbation patterns and
Lipschitz bounds. More specifically, we focus on classification tasks on a
handcrafted dataset that allows quantitative analysis for feature attribution.
This enables us to get insight, both theoretically and experimentally, on the
robustness of classification networks. We start by comparing typical QML model
architectures such as amplitude and re-upload encoding circuits with
variational parameters to a classical ConvNet architecture. Next, we introduce
a classical approximation of QML circuits (originally obtained with Random
Fourier Features sampling but adapted in this work to fit a trainable encoding)
and evaluate this model, denoted Fourier network, in comparison to other
architectures. Our findings show that this Fourier network can be seen as a
"middle ground" on the quantum-classical boundary. While adversarial attacks
successfully transfer across this boundary in both directions, we also show
that regularization helps quantum networks to be more robust, which has direct
impact on Lipschitz bounds and transfer attacks.


http://arxiv.org/abs/2404.15656
MISLEAD: Manipulating Importance of Selected features for Learning Epsilon in Evasion Attack Deception. (83%)
Vidit Khazanchi; Pavan Kulkarni; Yuvaraj Govindarajulu; Manojkumar Parmar
  Emerging vulnerabilities in machine learning (ML) models due to adversarial
attacks raise concerns about their reliability. Specifically, evasion attacks
manipulate models by introducing precise perturbations to input data, causing
erroneous predictions. To address this, we propose a methodology combining
SHapley Additive exPlanations (SHAP) for feature importance analysis with an
innovative Optimal Epsilon technique for conducting evasion attacks. Our
approach begins with SHAP-based analysis to understand model vulnerabilities,
crucial for devising targeted evasion strategies. The Optimal Epsilon
technique, employing a Binary Search algorithm, efficiently determines the
minimum epsilon needed for successful evasion. Evaluation across diverse
machine learning architectures demonstrates the technique's precision in
generating adversarial samples, underscoring its efficacy in manipulating model
outcomes. This study emphasizes the critical importance of continuous
assessment and monitoring to identify and mitigate potential security risks in
machine learning systems.


http://arxiv.org/abs/2404.16251
Investigating the prompt leakage effect and black-box defenses for multi-turn LLM interactions. (45%)
Divyansh Agarwal; Alexander R. Fabbri; Philippe Laban; Ben Risher; Shafiq Joty; Caiming Xiong; Chien-Sheng Wu
  Prompt leakage in large language models (LLMs) poses a significant security
and privacy threat, particularly in retrieval-augmented generation (RAG)
systems. However, leakage in multi-turn LLM interactions along with mitigation
strategies has not been studied in a standardized manner. This paper
investigates LLM vulnerabilities against prompt leakage across 4 diverse
domains and 10 closed- and open-source LLMs. Our unique multi-turn threat model
leverages the LLM's sycophancy effect and our analysis dissects task
instruction and knowledge leakage in the LLM response. In a multi-turn setting,
our threat model elevates the average attack success rate (ASR) to 86.2%,
including a 99% leakage with GPT-4 and claude-1.3. We find that some black-box
LLMs like Gemini show variable susceptibility to leakage across domains - they
are more likely to leak contextual knowledge in the news domain compared to the
medical domain. Our experiments measure specific effects of 6 black-box defense
strategies, including a query-rewriter in the RAG scenario. Our proposed
multi-tier combination of defenses still has an ASR of 5.3% for black-box LLMs,
indicating room for enhancement and future direction for LLM security research.


http://arxiv.org/abs/2404.16020
Investigating Adversarial Trigger Transfer in Large Language Models. (16%)
Nicholas Meade; Arkil Patel; Siva Reddy
  Recent work has developed optimization procedures to find token sequences,
called adversarial triggers, which can elicit unsafe responses from aligned
language models. These triggers are believed to be highly transferable, i.e., a
trigger optimized on one model can jailbreak other models. In this paper, we
concretely show that such adversarial triggers are not consistently
transferable. We extensively investigate trigger transfer amongst 13 open
models and observe poor and inconsistent transfer. Our experiments further
reveal a significant difference in robustness to adversarial triggers between
models Aligned by Preference Optimization (APO) and models Aligned by
Fine-Tuning (AFT). We find that APO models are extremely hard to jailbreak even
when the trigger is optimized directly on the model. On the other hand, while
AFT models may appear safe on the surface, exhibiting refusals to a range of
unsafe instructions, we show that they are highly susceptible to adversarial
triggers. Lastly, we observe that most triggers optimized on AFT models also
generalize to new unsafe instructions from five diverse domains, further
emphasizing their vulnerability. Overall, our work highlights the need for more
comprehensive safety evaluations for aligned language models.


http://arxiv.org/abs/2404.15854
CLAD: Robust Audio Deepfake Detection Against Manipulation Attacks with Contrastive Learning. (2%)
Haolin Wu; Jing Chen; Ruiying Du; Cong Wu; Kun He; Xingcan Shang; Hao Ren; Guowen Xu
  The increasing prevalence of audio deepfakes poses significant security
threats, necessitating robust detection methods. While existing detection
systems exhibit promise, their robustness against malicious audio manipulations
remains underexplored. To bridge the gap, we undertake the first comprehensive
study of the susceptibility of the most widely adopted audio deepfake detectors
to manipulation attacks. Surprisingly, even manipulations like volume control
can significantly bypass detection without affecting human perception. To
address this, we propose CLAD (Contrastive Learning-based Audio deepfake
Detector) to enhance the robustness against manipulation attacks. The key idea
is to incorporate contrastive learning to minimize the variations introduced by
manipulations, therefore enhancing detection robustness. Additionally, we
incorporate a length loss, aiming to improve the detection accuracy by
clustering real audios more closely in the feature space. We comprehensively
evaluated the most widely adopted audio deepfake detection models and our
proposed CLAD against various manipulation attacks. The detection models
exhibited vulnerabilities, with FAR rising to 36.69%, 31.23%, and 51.28% under
volume control, fading, and noise injection, respectively. CLAD enhanced
robustness, reducing the FAR to 0.81% under noise injection and consistently
maintaining an FAR below 1.63% across all tests. Our source code and
documentation are available in the artifact repository
(https://github.com/CLAD23/CLAD).


http://arxiv.org/abs/2404.15587
Security Analysis of WiFi-based Sensing Systems: Threats from Perturbation Attacks. (61%)
Hangcheng Cao; Wenbin Huang; Guowen Xu; Xianhao Chen; Ziyang He; Jingyang Hu; Hongbo Jiang; Yuguang Fang
  Deep learning technologies are pivotal in enhancing the performance of
WiFi-based wireless sensing systems. However, they are inherently vulnerable to
adversarial perturbation attacks, and regrettably, there is lacking serious
attention to this security issue within the WiFi sensing community. In this
paper, we elaborate such an attack, called WiIntruder, distinguishing itself
with universality, robustness, and stealthiness, which serves as a catalyst to
assess the security of existing WiFi-based sensing systems. This attack
encompasses the following salient features: (1) Maximizing transferability by
differentiating user-state-specific feature spaces across sensing models,
leading to a universally effective perturbation attack applicable to common
applications; (2) Addressing perturbation signal distortion caused by device
synchronization and wireless propagation when critical parameters are optimized
through a heuristic particle swarm-driven perturbation generation algorithm;
and (3) Enhancing attack pattern diversity and stealthiness through random
switching of perturbation surrogates generated by a generative adversarial
network. Extensive experimental results confirm the practical threats of
perturbation attacks to common WiFi-based services, including user
authentication and respiratory monitoring.


http://arxiv.org/abs/2404.14942
Manipulating Recommender Systems: A Survey of Poisoning Attacks and Countermeasures. (61%)
Thanh Toan Nguyen; Quoc Viet Hung Nguyen; Thanh Tam Nguyen; Thanh Trung Huynh; Thanh Thi Nguyen; Matthias Weidlich; Hongzhi Yin
  Recommender systems have become an integral part of online services to help
users locate specific information in a sea of data. However, existing studies
show that some recommender systems are vulnerable to poisoning attacks,
particularly those that involve learning schemes. A poisoning attack is where
an adversary injects carefully crafted data into the process of training a
model, with the goal of manipulating the system's final recommendations. Based
on recent advancements in artificial intelligence, such attacks have gained
importance recently. While numerous countermeasures to poisoning attacks have
been developed, they have not yet been systematically linked to the properties
of the attacks. Consequently, assessing the respective risks and potential
success of mitigation strategies is difficult, if not impossible. This survey
aims to fill this gap by primarily focusing on poisoning attacks and their
countermeasures. This is in contrast to prior surveys that mainly focus on
attacks and their detection methods. Through an exhaustive literature review,
we provide a novel taxonomy for poisoning attacks, formalise its dimensions,
and accordingly organise 30+ attacks described in the literature. Further, we
review 40+ countermeasures to detect and/or prevent poisoning attacks,
evaluating their effectiveness against specific types of attacks. This
comprehensive survey should serve as a point of reference for protecting
recommender systems against poisoning attacks. The article concludes with a
discussion on open issues in the field and impactful directions for future
research. A rich repository of resources associated with poisoning attacks is
available at https://github.com/tamlhp/awesome-recsys-poisoning.


http://arxiv.org/abs/2404.15081
Perturbing Attention Gives You More Bang for the Buck: Subtle Imaging Perturbations That Efficiently Fool Customized Diffusion Models. (47%)
Jingyao Xu; Yuetong Lu; Yandong Li; Siyang Lu; Dongdong Wang; Xiang Wei
  Diffusion models (DMs) embark a new era of generative modeling and offer more
opportunities for efficient generating high-quality and realistic data samples.
However, their widespread use has also brought forth new challenges in model
security, which motivates the creation of more effective adversarial attackers
on DMs to understand its vulnerability. We propose CAAT, a simple but generic
and efficient approach that does not require costly training to effectively
fool latent diffusion models (LDMs). The approach is based on the observation
that cross-attention layers exhibits higher sensitivity to gradient change,
allowing for leveraging subtle perturbations on published images to
significantly corrupt the generated images. We show that a subtle perturbation
on an image can significantly impact the cross-attention layers, thus changing
the mapping between text and image during the fine-tuning of customized
diffusion models. Extensive experiments demonstrate that CAAT is compatible
with diverse diffusion models and outperforms baseline attack methods in a more
effective (more noise) and efficient (twice as fast as Anti-DreamBooth and
Mist) manner.


http://arxiv.org/abs/2404.14795
Talk Too Much: Poisoning Large Language Models under Token Limit. (38%)
Jiaming He; Wenbo Jiang; Guanyu Hou; Wenshu Fan; Rui Zhang; Hongwei Li
  Mainstream poisoning attacks on large language models (LLMs) typically set a
fixed trigger in the input instance and specific responses for triggered
queries. However, the fixed trigger setting (e.g., unusual words) may be easily
detected by human detection, limiting the effectiveness and practicality in
real-world scenarios. To enhance the stealthiness of the trigger, we present a
poisoning attack against LLMs that is triggered by a generation/output
condition-token limitation, which is a commonly adopted strategy by users for
reducing costs. The poisoned model performs normally for output without token
limitation, while becomes harmful for output with limited tokens. To achieve
this objective, we introduce BrieFool, an efficient attack framework. It
leverages the characteristics of generation limitation by efficient instruction
sampling and poisoning data generation, thereby influencing the behavior of
LLMs under target conditions. Our experiments demonstrate that BrieFool is
effective across safety domains and knowledge domains. For instance, with only
20 generated poisoning examples against GPT-3.5-turbo, BrieFool achieves a 100%
Attack Success Rate (ASR) and a 9.28/10 average Harmfulness Score (HS) under
token limitation conditions while maintaining the benign performance.


http://arxiv.org/abs/2404.15042
Leverage Variational Graph Representation For Model Poisoning on Federated Learning. (10%)
Kai Li; Xin Yuan; Jingjing Zheng; Wei Ni; Falko Dressler; Abbas Jamalipour
  This paper puts forth a new training data-untethered model poisoning (MP)
attack on federated learning (FL). The new MP attack extends an adversarial
variational graph autoencoder (VGAE) to create malicious local models based
solely on the benign local models overheard without any access to the training
data of FL. Such an advancement leads to the VGAE-MP attack that is not only
efficacious but also remains elusive to detection. VGAE-MP attack extracts
graph structural correlations among the benign local models and the training
data features, adversarially regenerates the graph structure, and generates
malicious local models using the adversarial graph structure and benign models'
features. Moreover, a new attacking algorithm is presented to train the
malicious local models using VGAE and sub-gradient descent, while enabling an
optimal selection of the benign local models for training the VGAE. Experiments
demonstrate a gradual drop in FL accuracy under the proposed VGAE-MP attack and
the ineffectiveness of existing defense mechanisms in detecting the attack,
posing a severe threat to FL.


http://arxiv.org/abs/2404.15065
Formal Verification of Graph Convolutional Networks with Uncertain Node Features and Uncertain Graph Structure. (2%)
Tobias Ladner; Michael Eichelbeck; Matthias Althoff
  Graph neural networks are becoming increasingly popular in the field of
machine learning due to their unique ability to process data structured in
graphs. They have also been applied in safety-critical environments where
perturbations inherently occur. However, these perturbations require us to
formally verify neural networks before their deployment in safety-critical
environments as neural networks are prone to adversarial attacks. While there
exists research on the formal verification of neural networks, there is no work
verifying the robustness of generic graph convolutional network architectures
with uncertainty in the node features and in the graph structure over multiple
message-passing steps. This work addresses this research gap by explicitly
preserving the non-convex dependencies of all elements in the underlying
computations through reachability analysis with (matrix) polynomial zonotopes.
We demonstrate our approach on three popular benchmark datasets.


http://arxiv.org/abs/2404.14943
Does It Make Sense to Explain a Black Box With Another Black Box? (1%)
Julien Delaunay; Luis Galárraga; Christine Largouët
  Although counterfactual explanations are a popular approach to explain ML
black-box classifiers, they are less widespread in NLP. Most methods find those
explanations by iteratively perturbing the target document until it is
classified differently by the black box. We identify two main families of
counterfactual explanation methods in the literature, namely, (a)
\emph{transparent} methods that perturb the target by adding, removing, or
replacing words, and (b) \emph{opaque} approaches that project the target
document into a latent, non-interpretable space where the perturbation is
carried out subsequently. This article offers a comparative study of the
performance of these two families of methods on three classical NLP tasks. Our
empirical evidence shows that opaque approaches can be an overkill for
downstream applications such as fake news detection or sentiment analysis since
they add an additional level of complexity with no significant performance
gain. These observations motivate our discussion, which raises the question of
whether it makes sense to explain a black box using another black box.


http://arxiv.org/abs/2404.14928
Graph Machine Learning in the Era of Large Language Models (LLMs). (1%)
Wenqi Fan; Shijie Wang; Jiani Huang; Zhikai Chen; Yu Song; Wenzhuo Tang; Haitao Mao; Hui Liu; Xiaorui Liu; Dawei Yin; Qing Li
  Graphs play an important role in representing complex relationships in
various domains like social networks, knowledge graphs, and molecular
discovery. With the advent of deep learning, Graph Neural Networks (GNNs) have
emerged as a cornerstone in Graph Machine Learning (Graph ML), facilitating the
representation and processing of graph structures. Recently, LLMs have
demonstrated unprecedented capabilities in language tasks and are widely
adopted in a variety of applications such as computer vision and recommender
systems. This remarkable success has also attracted interest in applying LLMs
to the graph domain. Increasing efforts have been made to explore the potential
of LLMs in advancing Graph ML's generalization, transferability, and few-shot
learning ability. Meanwhile, graphs, especially knowledge graphs, are rich in
reliable factual knowledge, which can be utilized to enhance the reasoning
capabilities of LLMs and potentially alleviate their limitations such as
hallucinations and the lack of explainability. Given the rapid progress of this
research direction, a systematic review summarizing the latest advancements for
Graph ML in the era of LLMs is necessary to provide an in-depth understanding
to researchers and practitioners. Therefore, in this survey, we first review
the recent developments in Graph ML. We then explore how LLMs can be utilized
to enhance the quality of graph features, alleviate the reliance on labeled
data, and address challenges such as graph heterogeneity and
out-of-distribution (OOD) generalization. Afterward, we delve into how graphs
can enhance LLMs, highlighting their abilities to enhance LLM pre-training and
inference. Furthermore, we investigate various applications and discuss the
potential future directions in this promising field.


http://arxiv.org/abs/2404.14309
Towards Understanding the Robustness of Diffusion-Based Purification: A Stochastic Perspective. (98%)
Yiming Liu; Kezhao Liu; Yao Xiao; Ziyi Dong; Xiaogang Xu; Pengxu Wei; Liang Lin
  Diffusion-Based Purification (DBP) has emerged as an effective defense
mechanism against adversarial attacks. The efficacy of DBP has been attributed
to the forward diffusion process, which narrows the distribution gap between
clean and adversarial images through the addition of Gaussian noise. Although
this explanation has some theoretical support, the significance of its
contribution to robustness remains unclear. In this paper, we argue that the
inherent stochasticity in the DBP process is the primary driver of its
robustness. To explore this, we introduce a novel Deterministic White-Box
(DW-box) evaluation protocol to assess robustness in the absence of
stochasticity and to analyze the attack trajectories and loss landscapes. Our
findings suggest that DBP models primarily leverage stochasticity to evade
effective attack directions, and their ability to purify adversarial
perturbations can be weak. To further enhance the robustness of DBP models, we
introduce Adversarial Denoising Diffusion Training (ADDT), which incorporates
classifier-guided adversarial perturbations into diffusion training, thereby
strengthening the DBP models' ability to purify adversarial perturbations.
Additionally, we propose Rank-Based Gaussian Mapping (RBGM) to make
perturbations more compatible with diffusion models. Experimental results
validate the effectiveness of ADDT. In conclusion, our study suggests that
future research on DBP can benefit from the perspective of decoupling the
stochasticity-based and purification-based robustness.


http://arxiv.org/abs/2404.14693
Double Privacy Guard: Robust Traceable Adversarial Watermarking against Face Recognition. (93%)
Yunming Zhang; Dengpan Ye; Sipeng Shen; Caiyun Xie; Ziyi Liu; Jiacheng Deng; Long Tang
  The wide deployment of Face Recognition (FR) systems poses risks of privacy
leakage. One countermeasure to address this issue is adversarial attacks, which
deceive malicious FR searches but simultaneously interfere the normal identity
verification of trusted authorizers. In this paper, we propose the first Double
Privacy Guard (DPG) scheme based on traceable adversarial watermarking. DPG
employs a one-time watermark embedding to deceive unauthorized FR models and
allows authorizers to perform identity verification by extracting the
watermark. Specifically, we propose an information-guided adversarial attack
against FR models. The encoder embeds an identity-specific watermark into the
deep feature space of the carrier, guiding recognizable features of the image
to deviate from the source identity. We further adopt a collaborative
meta-optimization strategy compatible with sub-tasks, which regularizes the
joint optimization direction of the encoder and decoder. This strategy enhances
the representation of universal carrier features, mitigating multi-objective
optimization conflicts in watermarking. Experiments confirm that DPG achieves
significant attack success rates and traceability accuracy on state-of-the-art
FR models, exhibiting remarkable robustness that outperforms the existing
privacy protection methods using adversarial attacks and deep watermarking, or
simple combinations of the two. Our work potentially opens up new insights into
proactive protection for FR privacy.


http://arxiv.org/abs/2404.14042
CloudFort: Enhancing Robustness of 3D Point Cloud Classification Against Backdoor Attacks via Spatial Partitioning and Ensemble Prediction. (74%)
Wenhao Lan; Yijun Yang; Haihua Shen; Shan Li
  The increasing adoption of 3D point cloud data in various applications, such
as autonomous vehicles, robotics, and virtual reality, has brought about
significant advancements in object recognition and scene understanding.
However, this progress is accompanied by new security challenges, particularly
in the form of backdoor attacks. These attacks involve inserting malicious
information into the training data of machine learning models, potentially
compromising the model's behavior. In this paper, we propose CloudFort, a novel
defense mechanism designed to enhance the robustness of 3D point cloud
classifiers against backdoor attacks. CloudFort leverages spatial partitioning
and ensemble prediction techniques to effectively mitigate the impact of
backdoor triggers while preserving the model's performance on clean data. We
evaluate the effectiveness of CloudFort through extensive experiments,
demonstrating its strong resilience against the Point Cloud Backdoor Attack
(PCBA). Our results show that CloudFort significantly enhances the security of
3D point cloud classification models without compromising their accuracy on
benign samples. Furthermore, we explore the limitations of CloudFort and
discuss potential avenues for future research in the field of 3D point cloud
security. The proposed defense mechanism represents a significant step towards
ensuring the trustworthiness and reliability of point-cloud-based systems in
real-world applications.


http://arxiv.org/abs/2404.13879
Explicit Lipschitz Value Estimation Enhances Policy Robustness Against Perturbation. (67%)
Xulin Chen; Ruipeng Liu; Garrett E. Katz
  In robotic control tasks, policies trained by reinforcement learning (RL) in
simulation often experience a performance drop when deployed on physical
hardware, due to modeling error, measurement error, and unpredictable
perturbations in the real world. Robust RL methods account for this issue by
approximating a worst-case value function during training, but they can be
sensitive to approximation errors in the value function and its gradient before
training is complete. In this paper, we hypothesize that Lipschitz
regularization can help condition the approximated value function gradients,
leading to improved robustness after training. We test this hypothesis by
combining Lipschitz regularization with an application of Fast Gradient Sign
Method to reduce approximation errors when evaluating the value function under
adversarial perturbations. Our empirical results demonstrate the benefits of
this approach over prior work on a number of continuous control benchmarks.


http://arxiv.org/abs/2404.13914
Audio Anti-Spoofing Detection: A Survey. (62%)
Menglu Li; Yasaman Ahmadiadli; Xiao-Ping Zhang
  The availability of smart devices leads to an exponential increase in
multimedia content. However, the rapid advancements in deep learning have given
rise to sophisticated algorithms capable of manipulating or creating multimedia
fake content, known as Deepfake. Audio Deepfakes pose a significant threat by
producing highly realistic voices, thus facilitating the spread of
misinformation. To address this issue, numerous audio anti-spoofing detection
challenges have been organized to foster the development of anti-spoofing
countermeasures. This survey paper presents a comprehensive review of every
component within the detection pipeline, including algorithm architectures,
optimization techniques, application generalizability, evaluation metrics,
performance comparisons, available datasets, and open-source availability. For
each aspect, we conduct a systematic evaluation of the recent advancements,
along with discussions on existing challenges. Additionally, we also explore
emerging research topics on audio anti-spoofing, including partial spoofing
detection, cross-dataset evaluation, and adversarial attack defence, while
proposing some promising research directions for future work. This survey paper
not only identifies the current state-of-the-art to establish strong baselines
for future experiments but also guides future researchers on a clear path for
understanding and enhancing the audio anti-spoofing detection mechanisms.


http://arxiv.org/abs/2404.13946
Dual Model Replacement:invisible Multi-target Backdoor Attack based on Federal Learning. (41%)
Rong Wang; Guichen Zhou; Mingjun Gao; Yunpeng Xiao
  In recent years, the neural network backdoor hidden in the parameters of the
federated learning model has been proved to have great security risks.
Considering the characteristics of trigger generation, data poisoning and model
training in backdoor attack, this paper designs a backdoor attack method based
on federated learning. Firstly, aiming at the concealment of the backdoor
trigger, a TrojanGan steganography model with encoder-decoder structure is
designed. The model can encode specific attack information as invisible noise
and attach it to the image as a backdoor trigger, which improves the
concealment and data transformations of the backdoor trigger.Secondly, aiming
at the problem of single backdoor trigger mode, an image poisoning attack
method called combination trigger attack is proposed. This method realizes
multi-backdoor triggering by multiplexing combined triggers and improves the
robustness of backdoor attacks. Finally, aiming at the problem that the local
training mechanism leads to the decrease of the success rate of backdoor
attack, a dual model replacement backdoor attack algorithm based on federated
learning is designed. This method can improve the success rate of backdoor
attack while maintaining the performance of the federated learning aggregation
model. Experiments show that the attack strategy in this paper can not only
achieve high backdoor concealment and diversification of trigger forms under
federated learning, but also achieve good attack success rate in multi-target
attacks.door concealment and diversification of trigger forms but also achieve
good results in multi-target attacks.


http://arxiv.org/abs/2404.13968
Protecting Your LLMs with Information Bottleneck. (26%)
Zichuan Liu; Zefan Wang; Linjie Xu; Jinyu Wang; Lei Song; Tianchun Wang; Chunlin Chen; Wei Cheng; Jiang Bian
  The advent of large language models (LLMs) has revolutionized the field of
natural language processing, yet they might be attacked to produce harmful
content. Despite efforts to ethically align LLMs, these are often fragile and
can be circumvented by jailbreaking attacks through optimized or manual
adversarial prompts. To address this, we introduce the Information Bottleneck
Protector (IBProtector), a defense mechanism grounded in the information
bottleneck principle, and we modify the objective to avoid trivial solutions.
The IBProtector selectively compresses and perturbs prompts, facilitated by a
lightweight and trainable extractor, preserving only essential information for
the target LLMs to respond with the expected answer. Moreover, we further
consider a situation where the gradient is not visible to be compatible with
any LLM. Our empirical evaluations show that IBProtector outperforms current
defense methods in mitigating jailbreak attempts, without overly affecting
response quality or inference speed. Its effectiveness and adaptability across
various attack methods and target LLMs underscore the potential of IBProtector
as a novel, transferable defense that bolsters the security of LLMs without
requiring modifications to the underlying models.


http://arxiv.org/abs/2404.14461
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs. (13%)
Javier Rando; Francesco Croce; Kryštof Mitka; Stepan Shabalin; Maksym Andriushchenko; Nicolas Flammarion; Florian Tramèr
  Large language models are aligned to be safe, preventing users from
generating harmful content like misinformation or instructions for illegal
activities. However, previous work has shown that the alignment process is
vulnerable to poisoning attacks. Adversaries can manipulate the safety training
data to inject backdoors that act like a universal sudo command: adding the
backdoor string to any prompt enables harmful responses from models that,
otherwise, behave safely. Our competition, co-located at IEEE SaTML 2024,
challenged participants to find universal backdoors in several large language
models. This report summarizes the key findings and promising ideas for future
research.


http://arxiv.org/abs/2404.14265
Deep Learning as Ricci Flow. (2%)
Anthony Baptista; Alessandro Barp; Tapabrata Chakraborti; Chris Harbron; Ben D. MacArthur; Christopher R. S. Banerji
  Deep neural networks (DNNs) are powerful tools for approximating the
distribution of complex data. It is known that data passing through a trained
DNN classifier undergoes a series of geometric and topological simplifications.
While some progress has been made toward understanding these transformations in
neural networks with smooth activation functions, an understanding in the more
general setting of non-smooth activation functions, such as the rectified
linear unit (ReLU), which tend to perform better, is required. Here we propose
that the geometric transformations performed by DNNs during classification
tasks have parallels to those expected under Hamilton's Ricci flow - a tool
from differential geometry that evolves a manifold by smoothing its curvature,
in order to identify its topology. To illustrate this idea, we present a
computational framework to quantify the geometric changes that occur as data
passes through successive layers of a DNN, and use this framework to motivate a
notion of `global Ricci network flow' that can be used to assess a DNN's
ability to disentangle complex data geometries to solve classification
problems. By training more than $1,500$ DNN classifiers of different widths and
depths on synthetic and real-world data, we show that the strength of global
Ricci network flow-like behaviour correlates with accuracy for well-trained
DNNs, independently of depth, width and data set. Our findings motivate the use
of tools from differential and discrete geometry to the problem of
explainability in deep learning.


http://arxiv.org/abs/2404.14406
Hyp-OC: Hyperbolic One Class Classification for Face Anti-Spoofing. (1%)
Kartik Narayan; Vishal M. Patel
  Face recognition technology has become an integral part of modern security
systems and user authentication processes. However, these systems are
vulnerable to spoofing attacks and can easily be circumvented. Most prior
research in face anti-spoofing (FAS) approaches it as a two-class
classification task where models are trained on real samples and known spoof
attacks and tested for detection performance on unknown spoof attacks. However,
in practice, FAS should be treated as a one-class classification task where,
while training, one cannot assume any knowledge regarding the spoof samples a
priori. In this paper, we reformulate the face anti-spoofing task from a
one-class perspective and propose a novel hyperbolic one-class classification
framework. To train our network, we use a pseudo-negative class sampled from
the Gaussian distribution with a weighted running mean and propose two novel
loss functions: (1) Hyp-PC: Hyperbolic Pairwise Confusion loss, and (2) Hyp-CE:
Hyperbolic Cross Entropy loss, which operate in the hyperbolic space.
Additionally, we employ Euclidean feature clipping and gradient clipping to
stabilize the training in the hyperbolic space. To the best of our knowledge,
this is the first work extending hyperbolic embeddings for face anti-spoofing
in a one-class manner. With extensive experiments on five benchmark datasets:
Rose-Youtu, MSU-MFSD, CASIA-MFSD, Idiap Replay-Attack, and OULU-NPU, we
demonstrate that our method significantly outperforms the state-of-the-art,
achieving better spoof detection performance.


http://arxiv.org/abs/2404.14389
Poisoning Attacks on Federated Learning-based Wireless Traffic Prediction. (1%)
Zifan Zhang; Minghong Fang; Jiayuan Huang; Yuchen Liu
  Federated Learning (FL) offers a distributed framework to train a global
control model across multiple base stations without compromising the privacy of
their local network data. This makes it ideal for applications like wireless
traffic prediction (WTP), which plays a crucial role in optimizing network
resources, enabling proactive traffic flow management, and enhancing the
reliability of downstream communication-aided applications, such as IoT
devices, autonomous vehicles, and industrial automation systems. Despite its
promise, the security aspects of FL-based distributed wireless systems,
particularly in regression-based WTP problems, remain inadequately
investigated. In this paper, we introduce a novel fake traffic injection (FTI)
attack, designed to undermine the FL-based WTP system by injecting fabricated
traffic distributions with minimal knowledge. We further propose a defense
mechanism, termed global-local inconsistency detection (GLID), which
strategically removes abnormal model parameters that deviate beyond a specific
percentile range estimated through statistical methods in each dimension.
Extensive experimental evaluations, performed on real-world wireless traffic
datasets, demonstrate that both our attack and defense strategies significantly
outperform existing baselines.


http://arxiv.org/abs/2404.13948
Typos that Broke the RAG's Back: Genetic Attack on RAG Pipeline by Simulating Documents in the Wild via Low-level Perturbations. (1%)
Sukmin Cho; Soyeong Jeong; Jeongyeon Seo; Taeho Hwang; Jong C. Park
  The robustness of recent Large Language Models (LLMs) has become increasingly
crucial as their applicability expands across various domains and real-world
applications. Retrieval-Augmented Generation (RAG) is a promising solution for
addressing the limitations of LLMs, yet existing studies on the robustness of
RAG often overlook the interconnected relationships between RAG components or
the potential threats prevalent in real-world databases, such as minor textual
errors. In this work, we investigate two underexplored aspects when assessing
the robustness of RAG: 1) vulnerability to noisy documents through low-level
perturbations and 2) a holistic evaluation of RAG robustness. Furthermore, we
introduce a novel attack method, the Genetic Attack on RAG (\textit{GARAG}),
which targets these aspects. Specifically, GARAG is designed to reveal
vulnerabilities within each component and test the overall system functionality
against noisy documents. We validate RAG robustness by applying our
\textit{GARAG} to standard QA datasets, incorporating diverse retrievers and
LLMs. The experimental results show that GARAG consistently achieves high
attack success rates. Also, it significantly devastates the performance of each
component and their synergy, highlighting the substantial risk that minor
textual inaccuracies pose in disrupting RAG systems in the real world.


http://arxiv.org/abs/2404.13621
Attack on Scene Flow using Point Clouds. (98%)
Haniyeh Ehsani Oskouie; Mohammad-Shahram Moin; Shohreh Kasaei
  Deep neural networks have made significant advancements in accurately
estimating scene flow using point clouds, which is vital for many applications
like video analysis, action recognition, and navigation. The robustness of
these techniques, however, remains a concern, particularly in the face of
adversarial attacks that have been proven to deceive state-of-the-art deep
neural networks in many domains. Surprisingly, the robustness of scene flow
networks against such attacks has not been thoroughly investigated. To address
this problem, the proposed approach aims to bridge this gap by introducing
adversarial white-box attacks specifically tailored for scene flow networks.
Experimental results show that the generated adversarial examples obtain up to
33.7 relative degradation in average end-point error on the KITTI and
FlyingThings3D datasets. The study also reveals the significant impact that
attacks targeting point clouds in only one dimension or color channel have on
average end-point error. Analyzing the success and failure of these attacks on
the scene flow networks and their 2D optical flow network variants shows a
higher vulnerability for the optical flow networks. Code is available at
https://github.com/aheldis/Attack-on-Scene-Flow-using-Point-Clouds.git.


http://arxiv.org/abs/2404.13631
Fermi-Bose Machine. (96%)
Mingshan Xie; Yuchen Wang; Haiping Huang
  Distinct from human cognitive processing, deep neural networks trained by
backpropagation can be easily fooled by adversarial examples. To design a
semantically meaningful representation learning, we discard backpropagation,
and instead, propose a local contrastive learning, where the representation for
the inputs bearing the same label shrink (akin to boson) in hidden layers,
while those of different labels repel (akin to fermion). This layer-wise
learning is local in nature, being biological plausible. A statistical
mechanics analysis shows that the target fermion-pair-distance is a key
parameter. Moreover, the application of this local contrastive learning to
MNIST benchmark dataset demonstrates that the adversarial vulnerability of
standard perceptron can be greatly mitigated by tuning the target distance,
i.e., controlling the geometric separation of prototype manifolds.


http://arxiv.org/abs/2404.15373
Robust EEG-based Emotion Recognition Using an Inception and Two-sided Perturbation Model. (50%)
Shadi Sartipi; Mujdat Cetin
  Automated emotion recognition using electroencephalogram (EEG) signals has
gained substantial attention. Although deep learning approaches exhibit strong
performance, they often suffer from vulnerabilities to various perturbations,
like environmental noise and adversarial attacks. In this paper, we propose an
Inception feature generator and two-sided perturbation (INC-TSP) approach to
enhance emotion recognition in brain-computer interfaces. INC-TSP integrates
the Inception module for EEG data analysis and employs two-sided perturbation
(TSP) as a defensive mechanism against input perturbations. TSP introduces
worst-case perturbations to the model's weights and inputs, reinforcing the
model's elasticity against adversarial attacks. The proposed approach addresses
the challenge of maintaining accurate emotion recognition in the presence of
input uncertainties. We validate INC-TSP in a subject-independent three-class
emotion recognition scenario, demonstrating robust performance.


http://arxiv.org/abs/2404.16873
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs. (47%)
Anselm Paulus; Arman Zharmagambetov; Chuan Guo; Brandon Amos; Yuandong Tian
  Large Language Models (LLMs) are vulnerable to jailbreaking attacks that lead
to generation of inappropriate or harmful content. Manual red-teaming requires
a time-consuming search for adversarial prompts, whereas automatic adversarial
prompt generation often leads to semantically meaningless attacks that do not
scale well. In this paper, we present a novel method that uses another LLM,
called AdvPrompter, to generate human-readable adversarial prompts in seconds.
AdvPrompter, which is trained using an alternating optimization algorithm,
generates suffixes that veil the input instruction without changing its
meaning, such that the TargetLLM is lured to give a harmful response.
Experimental results on popular open source TargetLLMs show highly competitive
results on the AdvBench and HarmBench datasets, that also transfer to
closed-source black-box LLMs. We also show that training on adversarial
suffixes generated by AdvPrompter is a promising strategy for improving the
robustness of LLMs to jailbreaking attacks.


http://arxiv.org/abs/2404.13827
Swap It Like Its Hot: Segmentation-based spoof attacks on eye-tracking images. (26%)
Anish S. Narkar; Brendan David-John
  Video-based eye trackers capture the iris biometric and enable authentication
to secure user identity. However, biometric authentication is susceptible to
spoofing another user's identity through physical or digital manipulation. The
current standard to identify physical spoofing attacks on eye-tracking sensors
uses liveness detection. Liveness detection classifies gaze data as real or
fake, which is sufficient to detect physical presentation attacks. However,
such defenses cannot detect a spoofing attack when real eye image inputs are
digitally manipulated to swap the iris pattern of another person. We propose
IrisSwap as a novel attack on gaze-based liveness detection. IrisSwap allows
attackers to segment and digitally swap in a victim's iris pattern to fool iris
authentication. Both offline and online attacks produce gaze data that deceives
the current state-of-the-art defense models at rates up to 58% and motivates
the need to develop more advanced authentication methods for eye trackers.


http://arxiv.org/abs/2404.13660
Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge. (1%)
Narek Maloyan; Ekansh Verma; Bulat Nutfullin; Bislan Ashinov
  Large Language Models (LLMs) have demonstrated remarkable capabilities in
various domains, but their vulnerability to trojan or backdoor attacks poses
significant security risks. This paper explores the challenges and insights
gained from the Trojan Detection Competition 2023 (TDC2023), which focused on
identifying and evaluating trojan attacks on LLMs. We investigate the
difficulty of distinguishing between intended and unintended triggers, as well
as the feasibility of reverse engineering trojans in real-world scenarios. Our
comparative analysis of various trojan detection methods reveals that achieving
high Recall scores is significantly more challenging than obtaining high
Reverse-Engineering Attack Success Rate (REASR) scores. The top-performing
methods in the competition achieved Recall scores around 0.16, comparable to a
simple baseline of randomly sampling sentences from a distribution similar to
the given training prefixes. This finding raises questions about the
detectability and recoverability of trojans inserted into the model, given only
the harmful targets. Despite the inability to fully solve the problem, the
competition has led to interesting observations about the viability of trojan
detection and improved techniques for optimizing LLM input prompts. The
phenomenon of unintended triggers and the difficulty in distinguishing them
from intended triggers highlights the need for further research into the
robustness and interpretability of LLMs. The TDC2023 has provided valuable
insights into the challenges and opportunities associated with trojan detection
in LLMs, laying the groundwork for future research in this area to ensure their
safety and reliability in real-world applications.


http://arxiv.org/abs/2404.13518
Reliable Model Watermarking: Defending Against Theft without Compromising on Evasion. (99%)
Hongyu Zhu; Sichu Liang; Wentao Hu; Fangqi Li; Ju Jia; Shilin Wang
  With the rise of Machine Learning as a Service (MLaaS) platforms,safeguarding
the intellectual property of deep learning models is becoming paramount. Among
various protective measures, trigger set watermarking has emerged as a flexible
and effective strategy for preventing unauthorized model distribution. However,
this paper identifies an inherent flaw in the current paradigm of trigger set
watermarking: evasion adversaries can readily exploit the shortcuts created by
models memorizing watermark samples that deviate from the main task
distribution, significantly impairing their generalization in adversarial
settings. To counteract this, we leverage diffusion models to synthesize
unrestricted adversarial examples as trigger sets. By learning the model to
accurately recognize them, unique watermark behaviors are promoted through
knowledge injection rather than error memorization, thus avoiding exploitable
shortcuts. Furthermore, we uncover that the resistance of current trigger set
watermarking against removal attacks primarily relies on significantly damaging
the decision boundaries during embedding, intertwining unremovability with
adverse impacts. By optimizing the knowledge transfer properties of protected
models, our approach conveys watermark behaviors to extraction surrogates
without aggressively decision boundary perturbation. Experimental results on
CIFAR-10/100 and Imagenette datasets demonstrate the effectiveness of our
method, showing not only improved robustness against evasion adversaries but
also superior resistance to watermark removal attacks compared to
state-of-the-art solutions.


http://arxiv.org/abs/2404.13277
Beyond Score Changes: Adversarial Attack on No-Reference Image Quality Assessment from Two Perspectives. (99%)
Chenxi Yang; Yujia Liu; Dingquan Li; Yan Zhong; Tingting Jiang
  Deep neural networks have demonstrated impressive success in No-Reference
Image Quality Assessment (NR-IQA). However, recent researches highlight the
vulnerability of NR-IQA models to subtle adversarial perturbations, leading to
inconsistencies between model predictions and subjective ratings. Current
adversarial attacks, however, focus on perturbing predicted scores of
individual images, neglecting the crucial aspect of inter-score correlation
relationships within an entire image set. Meanwhile, it is important to note
that the correlation, like ranking correlation, plays a significant role in
NR-IQA tasks. To comprehensively explore the robustness of NR-IQA models, we
introduce a new framework of correlation-error-based attacks that perturb both
the correlation within an image set and score changes on individual images. Our
research primarily focuses on ranking-related correlation metrics like
Spearman's Rank-Order Correlation Coefficient (SROCC) and prediction
error-related metrics like Mean Squared Error (MSE). As an instantiation, we
propose a practical two-stage SROCC-MSE-Attack (SMA) that initially optimizes
target attack scores for the entire image set and then generates adversarial
examples guided by these scores. Experimental results demonstrate that our SMA
method not only significantly disrupts the SROCC to negative values but also
maintains a considerable change in the scores of individual images. Meanwhile,
it exhibits state-of-the-art performance across metrics with different
categories. Our method provides a new perspective on the robustness of NR-IQA
models.


http://arxiv.org/abs/2404.13320
Pixel is a Barrier: Diffusion Models Are More Adversarially Robust Than We Think. (99%)
Haotian Xue; Yongxin Chen
  Adversarial examples for diffusion models are widely used as solutions for
safety concerns. By adding adversarial perturbations to personal images,
attackers can not edit or imitate them easily. However, it is essential to note
that all these protections target the latent diffusion model (LDMs), the
adversarial examples for diffusion models in the pixel space (PDMs) are largely
overlooked. This may mislead us to think that the diffusion models are
vulnerable to adversarial attacks like most deep models. In this paper, we show
novel findings that: even though gradient-based white-box attacks can be used
to attack the LDMs, they fail to attack PDMs. This finding is supported by
extensive experiments of almost a wide range of attacking methods on various
PDMs and LDMs with different model structures, which means diffusion models are
indeed much more robust against adversarial attacks. We also find that PDMs can
be used as an off-the-shelf purifier to effectively remove the adversarial
patterns that were generated on LDMs to protect the images, which means that
most protection methods nowadays, to some extent, cannot protect our images
from malicious attacks. We hope that our insights will inspire the community to
rethink the adversarial samples for diffusion models as protection methods and
move forward to more effective protection. Codes are available in
https://github.com/xavihart/PDM-Pure.


http://arxiv.org/abs/2404.13279
Backdoor Attacks and Defenses on Semantic-Symbol Reconstruction in Semantic Communications. (41%)
Yuan Zhou; Rose Qingyang Hu; Yi Qian
  Semantic communication is of crucial importance for the next-generation
wireless communication networks. The existing works have developed semantic
communication frameworks based on deep learning. However, systems powered by
deep learning are vulnerable to threats such as backdoor attacks and
adversarial attacks. This paper delves into backdoor attacks targeting deep
learning-enabled semantic communication systems. Since current works on
backdoor attacks are not tailored for semantic communication scenarios, a new
backdoor attack paradigm on semantic symbols (BASS) is introduced, based on
which the corresponding defense measures are designed. Specifically, a training
framework is proposed to prevent BASS. Additionally, reverse engineering-based
and pruning-based defense strategies are designed to protect against backdoor
attacks in semantic communication. Simulation results demonstrate the
effectiveness of both the proposed attack paradigm and the defense strategies.


http://arxiv.org/abs/2404.12653
How Real Is Real? A Human Evaluation Framework for Unrestricted Adversarial Examples. (99%)
Dren Fazlija; Arkadij Orlov; Johanna Schrader; Monty-Maximilian Zühlke; Michael Rohs; Daniel Kudenko
  With an ever-increasing reliance on machine learning (ML) models in the real
world, adversarial examples threaten the safety of AI-based systems such as
autonomous vehicles. In the image domain, they represent maliciously perturbed
data points that look benign to humans (i.e., the image modification is not
noticeable) but greatly mislead state-of-the-art ML models. Previously,
researchers ensured the imperceptibility of their altered data points by
restricting perturbations via $\ell_p$ norms. However, recent publications
claim that creating natural-looking adversarial examples without such
restrictions is also possible. With much more freedom to instill malicious
information into data, these unrestricted adversarial examples can potentially
overcome traditional defense strategies as they are not constrained by the
limitations or patterns these defenses typically recognize and mitigate. This
allows attackers to operate outside of expected threat models. However,
surveying existing image-based methods, we noticed a need for more human
evaluations of the proposed image modifications. Based on existing
human-assessment frameworks for image generation quality, we propose SCOOTER -
an evaluation framework for unrestricted image-based attacks. It provides
researchers with guidelines for conducting statistically significant human
experiments, standardized questions, and a ready-to-use implementation. We
propose a framework that allows researchers to analyze how imperceptible their
unrestricted attacks truly are.


http://arxiv.org/abs/2404.12635
AED-PADA:Improving Generalizability of Adversarial Example Detection via Principal Adversarial Domain Adaptation. (99%)
Heqi Peng; Yunhong Wang; Ruijie Yang; Beichen Li; Rui Wang; Yuanfang Guo
  Adversarial example detection, which can be conveniently applied in many
scenarios, is important in the area of adversarial defense. Unfortunately,
existing detection methods suffer from poor generalization performance, because
their training process usually relies on the examples generated from a single
known adversarial attack and there exists a large discrepancy between the
training and unseen testing adversarial examples. To address this issue, we
propose a novel method, named Adversarial Example Detection via Principal
Adversarial Domain Adaptation (AED-PADA). Specifically, our approach identifies
the Principal Adversarial Domains (PADs), i.e., a combination of features of
the adversarial examples from different attacks, which possesses large coverage
of the entire adversarial feature space. Then, we pioneer to exploit
multi-source domain adaptation in adversarial example detection with PADs as
source domains. Experiments demonstrate the superior generalization ability of
our proposed AED-PADA. Note that this superiority is particularly achieved in
challenging scenarios characterized by employing the minimal magnitude
constraint for the perturbations.


http://arxiv.org/abs/2404.12704
A Clean-graph Backdoor Attack against Graph Convolutional Networks with Poisoned Label Only. (75%)
Jiazhu Dai; Haoyu Sun
  Graph Convolutional Networks (GCNs) have shown excellent performance in
dealing with various graph structures such as node classification, graph
classification and other tasks. However,recent studies have shown that GCNs are
vulnerable to a novel threat known as backdoor attacks. However, all existing
backdoor attacks in the graph domain require modifying the training samples to
accomplish the backdoor injection, which may not be practical in many realistic
scenarios where adversaries have no access to modify the training samples and
may leads to the backdoor attack being detected easily. In order to explore the
backdoor vulnerability of GCNs and create a more practical and stealthy
backdoor attack method, this paper proposes a clean-graph backdoor attack
against GCNs (CBAG) in the node classification task,which only poisons the
training labels without any modification to the training samples, revealing
that GCNs have this security vulnerability. Specifically, CBAG designs a new
trigger exploration method to find important feature dimensions as the trigger
patterns to improve the attack performance. By poisoning the training labels, a
hidden backdoor is injected into the GCNs model. Experimental results show that
our clean graph backdoor can achieve 99% attack success rate while maintaining
the functionality of the GCNs model on benign samples.


http://arxiv.org/abs/2404.12916
Physical Backdoor Attack can Jeopardize Driving with Vision-Large-Language Models. (5%)
Zhenyang Ni; Rui Ye; Yuxi Wei; Zhen Xiang; Yanfeng Wang; Siheng Chen
  Vision-Large-Language-models(VLMs) have great application prospects in
autonomous driving. Despite the ability of VLMs to comprehend and make
decisions in complex scenarios, their integration into safety-critical
autonomous driving systems poses serious security risks. In this paper, we
propose BadVLMDriver, the first backdoor attack against VLMs for autonomous
driving that can be launched in practice using physical objects. Unlike
existing backdoor attacks against VLMs that rely on digital modifications,
BadVLMDriver uses common physical items, such as a red balloon, to induce
unsafe actions like sudden acceleration, highlighting a significant real-world
threat to autonomous vehicle safety. To execute BadVLMDriver, we develop an
automated pipeline utilizing natural language instructions to generate backdoor
training samples with embedded malicious behaviors. This approach allows for
flexible trigger and behavior selection, enhancing the stealth and practicality
of the attack in diverse scenarios. We conduct extensive experiments to
evaluate BadVLMDriver for two representative VLMs, five different trigger
objects, and two types of malicious backdoor behaviors. BadVLMDriver achieves a
92% attack success rate in inducing a sudden acceleration when coming across a
pedestrian holding a red balloon. Thus, BadVLMDriver not only demonstrates a
critical security risk but also emphasizes the urgent need for developing
robust defense mechanisms to protect against such vulnerabilities in autonomous
driving technologies.


http://arxiv.org/abs/2404.12679
MLSD-GAN -- Generating Strong High Quality Face Morphing Attacks using Latent Semantic Disentanglement. (3%)
Aravinda Reddy PN; Raghavendra Ramachandra; Krothapalli Sreenivasa Rao; Pabitra Mitra
  Face-morphing attacks are a growing concern for biometric researchers, as
they can be used to fool face recognition systems (FRS). These attacks can be
generated at the image level (supervised) or representation level
(unsupervised). Previous unsupervised morphing attacks have relied on
generative adversarial networks (GANs). More recently, researchers have used
linear interpolation of StyleGAN-encoded images to generate morphing attacks.
In this paper, we propose a new method for generating high-quality morphing
attacks using StyleGAN disentanglement. Our approach, called MLSD-GAN,
spherically interpolates the disentangled latents to produce realistic and
diverse morphing attacks. We evaluate the vulnerability of MLSD-GAN on two
deep-learning-based FRS techniques. The results show that MLSD-GAN poses a
significant threat to FRS, as it can generate morphing attacks that are highly
effective at fooling these systems.


http://arxiv.org/abs/2404.13224
Model-Based Counterfactual Explanations Incorporating Feature Space Attributes for Tabular Data. (1%)
Yuta Sumiya; Hayaru shouno
  Machine-learning models, which are known to accurately predict patterns from
large datasets, are crucial in decision making. Consequently, counterfactual
explanations-methods explaining predictions by introducing input
perturbations-have become prominent. These perturbations often suggest ways to
alter the predictions, leading to actionable recommendations. However, the
current techniques require resolving the optimization problems for each input
change, rendering them computationally expensive. In addition, traditional
encoding methods inadequately address the perturbations of categorical
variables in tabular data. Thus, this study propose FastDCFlow, an efficient
counterfactual explanation method using normalizing flows. The proposed method
captures complex data distributions, learns meaningful latent spaces that
retain proximity, and improves predictions. For categorical variables, we
employed TargetEncoding, which respects ordinal relationships and includes
perturbation costs. The proposed method outperformed existing methods in
multiple metrics, striking a balance between trade offs for counterfactual
explanations. The source code is available in the following repository:
https://github.com/sumugit/FastDCFlow.


http://arxiv.org/abs/2404.12852
LSP Framework: A Compensatory Model for Defeating Trigger Reverse Engineering via Label Smoothing Poisoning. (1%)
Beichen Li; Yuanfang Guo; Heqi Peng; Yangxi Li; Yunhong Wang
  Deep neural networks are vulnerable to backdoor attacks. Among the existing
backdoor defense methods, trigger reverse engineering based approaches, which
reconstruct the backdoor triggers via optimizations, are the most versatile and
effective ones compared to other types of methods. In this paper, we summarize
and construct a generic paradigm for the typical trigger reverse engineering
process. Based on this paradigm, we propose a new perspective to defeat trigger
reverse engineering by manipulating the classification confidence of backdoor
samples. To determine the specific modifications of classification confidence,
we propose a compensatory model to compute the lower bound of the modification.
With proper modifications, the backdoor attack can easily bypass the trigger
reverse engineering based methods. To achieve this objective, we propose a
Label Smoothing Poisoning (LSP) framework, which leverages label smoothing to
specifically manipulate the classification confidences of backdoor samples.
Extensive experiments demonstrate that the proposed work can defeat the
state-of-the-art trigger reverse engineering based methods, and possess good
compatibility with a variety of existing backdoor attacks.


http://arxiv.org/abs/2404.12120
Fortify the Guardian, Not the Treasure: Resilient Adversarial Detectors. (99%)
Raz Lapid; Almog Dubin; Moshe Sipper
  This paper presents RADAR-Robust Adversarial Detection via Adversarial
Retraining-an approach designed to enhance the robustness of adversarial
detectors against adaptive attacks, while maintaining classifier performance.
An adaptive attack is one where the attacker is aware of the defenses and
adapts their strategy accordingly. Our proposed method leverages adversarial
training to reinforce the ability to detect attacks, without compromising clean
accuracy. During the training phase, we integrate into the dataset adversarial
examples, which were optimized to fool both the classifier and the adversarial
detector, enabling the adversarial detector to learn and adapt to potential
attack scenarios. Experimental evaluations on the CIFAR-10 and SVHN datasets
demonstrate that our proposed algorithm significantly improves a detector's
ability to accurately identify adaptive adversarial attacks -- without
sacrificing clean accuracy.


http://arxiv.org/abs/2404.12274
Advancing the Robustness of Large Language Models through Self-Denoised Smoothing. (98%)
Jiabao Ji; Bairu Hou; Zhen Zhang; Guanhua Zhang; Wenqi Fan; Qing Li; Yang Zhang; Gaowen Liu; Sijia Liu; Shiyu Chang
  Although large language models (LLMs) have achieved significant success,
their vulnerability to adversarial perturbations, including recent jailbreak
attacks, has raised considerable concerns. However, the increasing size of
these models and their limited access make improving their robustness a
challenging task. Among various defense strategies, randomized smoothing has
shown great potential for LLMs, as it does not require full access to the
model's parameters or fine-tuning via adversarial training. However, randomized
smoothing involves adding noise to the input before model prediction, and the
final model's robustness largely depends on the model's performance on these
noise corrupted data. Its effectiveness is often limited by the model's
sub-optimal performance on noisy data. To address this issue, we propose to
leverage the multitasking nature of LLMs to first denoise the noisy inputs and
then to make predictions based on these denoised versions. We call this
procedure self-denoised smoothing. Unlike previous denoised smoothing
techniques in computer vision, which require training a separate model to
enhance the robustness of LLMs, our method offers significantly better
efficiency and flexibility. Our experimental results indicate that our method
surpasses existing methods in both empirical and certified robustness in
defending against adversarial attacks for both downstream tasks and human
alignments (i.e., jailbreak attacks). Our code is publicly available at
https://github.com/UCSB-NLP-Chang/SelfDenoise


http://arxiv.org/abs/2404.12612
SA-Attack: Speed-adaptive stealthy adversarial attack on trajectory prediction. (98%)
Huilin Yin; Jiaxiang Li; Pengju Zhen; Jun Yan
  Trajectory prediction is critical for the safe planning and navigation of
automated vehicles. The trajectory prediction models based on the neural
networks are vulnerable to adversarial attacks. Previous attack methods have
achieved high attack success rates but overlook the adaptability to realistic
scenarios and the concealment of the deceits. To address this problem, we
propose a speed-adaptive stealthy adversarial attack method named SA-Attack.
This method searches the sensitive region of trajectory prediction models and
generates the adversarial trajectories by using the vehicle-following method
and incorporating information about forthcoming trajectories. Our method has
the ability to adapt to different speed scenarios by reconstructing the
trajectory from scratch. Fusing future trajectory trends and curvature
constraints can guarantee the smoothness of adversarial trajectories, further
ensuring the stealthiness of attacks. The empirical study on the datasets of
nuScenes and Apolloscape demonstrates the attack performance of our proposed
method. Finally, we also demonstrate the adaptability and stealthiness of
SA-Attack for different speed scenarios. Our code is available at the
repository: https://github.com/eclipse-bot/SA-Attack.


http://arxiv.org/abs/2404.12014
Enhance Robustness of Language Models Against Variation Attack through Graph Integration. (33%)
Zi Xiong; Lizhi Qing; Yangyang Kang; Jiawei Liu; Hongsong Li; Changlong Sun; Xiaozhong Liu; Wei Lu
  The widespread use of pre-trained language models (PLMs) in natural language
processing (NLP) has greatly improved performance outcomes. However, these
models' vulnerability to adversarial attacks (e.g., camouflaged hints from drug
dealers), particularly in the Chinese language with its rich character
diversity/variation and complex structures, hatches vital apprehension. In this
study, we propose a novel method, CHinese vAriatioN Graph Enhancement (CHANGE),
to increase the robustness of PLMs against character variation attacks in
Chinese content. CHANGE presents a novel approach for incorporating a Chinese
character variation graph into the PLMs. Through designing different
supplementary tasks utilizing the graph structure, CHANGE essentially enhances
PLMs' interpretation of adversarially manipulated text. Experiments conducted
in a multitude of NLP tasks show that CHANGE outperforms current language
models in combating against adversarial attacks and serves as a valuable
contribution to robust language model research. These findings contribute to
the groundwork on robust language models and highlight the substantial
potential of graph-guided pre-training strategies for real-world applications.


http://arxiv.org/abs/2404.12038
Uncovering Safety Risks of Large Language Models through Concept Activation Vector. (22%)
Zhihao Xu; Ruixuan Huang; Changyu Chen; Shuai Wang; Xiting Wang
  Despite careful safety alignment, current large language models (LLMs) remain
vulnerable to various attacks. To further unveil the safety risks of LLMs, we
introduce a Safety Concept Activation Vector (SCAV) framework, which
effectively guides the attacks by accurately interpreting LLMs' safety
mechanisms. We then develop an SCAV-guided attack method that can generate both
attack prompts and embedding-level attacks with automatically selected
perturbation hyperparameters. Both automatic and human evaluations demonstrate
that our attack method significantly improves the attack success rate and
response quality while requiring less training data. Additionally, we find that
our generated attack prompts may be transferable to GPT-4, and the
embedding-level attacks may also be transferred to other white-box LLMs whose
parameters are known. Our experiments further uncover the safety risks present
in current LLMs. For example, we find that six out of seven open-source LLMs
that we attack consistently provide relevant answers to more than 85\%
malicious instructions. Finally, we provide insights into the safety mechanism
of LLMs.


http://arxiv.org/abs/2404.12512
Proteus: Preserving Model Confidentiality during Graph Optimizations. (15%)
Yubo Gao; Maryam Haghifam; Christina Giannoula; Renbo Tu; Gennady Pekhimenko; Nandita Vijaykumar
  Deep learning (DL) models have revolutionized numerous domains, yet
optimizing them for computational efficiency remains a challenging endeavor.
Development of new DL models typically involves two parties: the model
developers and performance optimizers. The collaboration between the parties
often necessitates the model developers exposing the model architecture and
computational graph to the optimizers. However, this exposure is undesirable
since the model architecture is an important intellectual property, and its
innovations require significant investments and expertise. During the exchange,
the model is also vulnerable to adversarial attacks via model stealing.
  This paper presents Proteus, a novel mechanism that enables model
optimization by an independent party while preserving the confidentiality of
the model architecture. Proteus obfuscates the protected model by partitioning
its computational graph into subgraphs and concealing each subgraph within a
large pool of generated realistic subgraphs that cannot be easily distinguished
from the original. We evaluate Proteus on a range of DNNs, demonstrating its
efficacy in preserving confidentiality without compromising performance
optimization opportunities. Proteus effectively hides the model as one
alternative among up to $10^{32}$ possible model architectures, and is
resilient against attacks with a learning-based adversary. We also demonstrate
that heuristic based and manual approaches are ineffective in identifying the
protected model. To our knowledge, Proteus is the first work that tackles the
challenge of model confidentiality during performance optimization. Proteus
will be open-sourced for direct use and experimentation, with easy integration
with compilers such as ONNXRuntime.


http://arxiv.org/abs/2404.12139
Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models. (2%)
Shouwei Ruan; Yinpeng Dong; Hanqing Liu; Yao Huang; Hang Su; Xingxing Wei
  Vision-Language Pre-training (VLP) models like CLIP have achieved remarkable
success in computer vision and particularly demonstrated superior robustness to
distribution shifts of 2D images. However, their robustness under 3D viewpoint
variations is still limited, which can hinder the development for real-world
applications. This paper successfully addresses this concern while keeping
VLPs' original performance by breaking through two primary obstacles: 1) the
scarcity of training data and 2) the suboptimal fine-tuning paradigms. To
combat data scarcity, we build the Multi-View Caption (MVCap) dataset -- a
comprehensive collection of over four million multi-view image-text pairs
across more than 100K objects, providing more potential for VLP models to
develop generalizable viewpoint-invariant representations. To address the
limitations of existing paradigms in performance trade-offs and training
efficiency, we design a novel fine-tuning framework named Omniview-Tuning
(OVT). Specifically, OVT introduces a Cross-Viewpoint Alignment objective
through a minimax-like optimization strategy, which effectively aligns
representations of identical objects from diverse viewpoints without causing
overfitting. Additionally, OVT fine-tunes VLP models in a parameter-efficient
manner, leading to minimal computational cost. Extensive experiments on various
VLP models with different architectures validate that OVT significantly
improves the models' resilience to viewpoint shifts and keeps the original
performance, establishing a pioneering standard for boosting the viewpoint
invariance of VLP models.


http://arxiv.org/abs/2404.12535
Is There No Such Thing as a Bad Question? H4R: HalluciBot For Ratiocination, Rewriting, Ranking, and Routing. (1%)
William Watson; Nicole Cho; Nishan Srishankar
  Hallucination continues to be one of the most critical challenges in the
institutional adoption journey of Large Language Models (LLMs). While prior
studies have primarily focused on the post-generation analysis and refinement
of outputs, this paper centers on the effectiveness of queries in eliciting
accurate responses from LLMs. We present HalluciBot, a model that estimates the
query's propensity to hallucinate before generation, without invoking any LLMs
during inference. HalluciBot can serve as a proxy reward model for query
rewriting, offering a general framework to estimate query quality based on
accuracy and consensus. In essence, HalluciBot investigates how poorly
constructed queries can lead to erroneous outputs - moreover, by employing
query rewriting guided by HalluciBot's empirical estimates, we demonstrate that
95.7% output accuracy can be achieved for Multiple Choice questions. The
training procedure for HalluciBot consists of perturbing 369,837 queries n
times, employing n+1 independent LLM agents, sampling an output from each
query, conducting a Multi-Agent Monte Carlo simulation on the sampled outputs,
and training an encoder classifier. The idea of perturbation is the outcome of
our ablation studies that measures the increase in output diversity (+12.5
agreement spread) by perturbing a query in lexically different but semantically
similar ways. Therefore, HalluciBot paves the way to ratiocinate (76.0% test F1
score, 46.6% in saved computation on hallucinatory queries), rewrite (+30.2%
positive class transition from hallucinatory to non-hallucinatory), rank
(+50.6% positive class transition from hallucinatory to non-hallucinatory), and
route queries to effective pipelines.


http://arxiv.org/abs/2404.11265
The Victim and The Beneficiary: Exploiting a Poisoned Model to Train a Clean Model on Poisoned Data. (83%)
Zixuan Zhu; Rui Wang; Cong Zou; Lihua Jing
  Recently, backdoor attacks have posed a serious security threat to the
training process of deep neural networks (DNNs). The attacked model behaves
normally on benign samples but outputs a specific result when the trigger is
present. However, compared with the rocketing progress of backdoor attacks,
existing defenses are difficult to deal with these threats effectively or
require benign samples to work, which may be unavailable in real scenarios. In
this paper, we find that the poisoned samples and benign samples can be
distinguished with prediction entropy. This inspires us to propose a novel
dual-network training framework: The Victim and The Beneficiary (V&B), which
exploits a poisoned model to train a clean model without extra benign samples.
Firstly, we sacrifice the Victim network to be a powerful poisoned sample
detector by training on suspicious samples. Secondly, we train the Beneficiary
network on the credible samples selected by the Victim to inhibit backdoor
injection. Thirdly, a semi-supervised suppression strategy is adopted for
erasing potential backdoors and improving model performance. Furthermore, to
better inhibit missed poisoned samples, we propose a strong data augmentation
method, AttentionMix, which works well with our proposed V&B framework.
Extensive experiments on two widely used datasets against 6 state-of-the-art
attacks demonstrate that our framework is effective in preventing backdoor
injection and robust to various attacks while maintaining the performance on
benign samples. Our code is available at https://github.com/Zixuan-Zhu/VaB.


http://arxiv.org/abs/2404.11538
GenFighter: A Generative and Evolutive Textual Attack Removal. (82%)
Md Athikul Islam; Edoardo Serra; Sushil Jajodia
  Adversarial attacks pose significant challenges to deep neural networks
(DNNs) such as Transformer models in natural language processing (NLP). This
paper introduces a novel defense strategy, called GenFighter, which enhances
adversarial robustness by learning and reasoning on the training classification
distribution. GenFighter identifies potentially malicious instances deviating
from the distribution, transforms them into semantically equivalent instances
aligned with the training data, and employs ensemble techniques for a unified
and robust response. By conducting extensive experiments, we show that
GenFighter outperforms state-of-the-art defenses in accuracy under attack and
attack success rate metrics. Additionally, it requires a high number of queries
per attack, making the attack more challenging in real scenarios. The ablation
study shows that our approach integrates transfer learning, a
generative/evolutive procedure, and an ensemble method, providing an effective
defense against NLP adversarial attacks.


http://arxiv.org/abs/2404.11819
Utilizing Adversarial Examples for Bias Mitigation and Accuracy Enhancement. (80%)
Pushkar Shukla; Dhruv Srikanth; Lee Cohen; Matthew Turk
  We propose a novel approach to mitigate biases in computer vision models by
utilizing counterfactual generation and fine-tuning. While counterfactuals have
been used to analyze and address biases in DNN models, the counterfactuals
themselves are often generated from biased generative models, which can
introduce additional biases or spurious correlations. To address this issue, we
propose using adversarial images, that is images that deceive a deep neural
network but not humans, as counterfactuals for fair model training.
  Our approach leverages a curriculum learning framework combined with a
fine-grained adversarial loss to fine-tune the model using adversarial
examples. By incorporating adversarial images into the training data, we aim to
prevent biases from propagating through the pipeline. We validate our approach
through both qualitative and quantitative assessments, demonstrating improved
bias mitigation and accuracy compared to existing methods. Qualitatively, our
results indicate that post-training, the decisions made by the model are less
dependent on the sensitive attribute and our model better disentangles the
relationship between sensitive attributes and classification variables.


http://arxiv.org/abs/2404.11665
Exploring DNN Robustness Against Adversarial Attacks Using Approximate Multipliers. (75%)
Mohammad Javad Askarizadeh; Ebrahim Farahmand; Jorge Castro-Godinez; Ali Mahani; Laura Cabrera-Quiros; Carlos Salazar-Garcia
  Deep Neural Networks (DNNs) have advanced in many real-world applications,
such as healthcare and autonomous driving. However, their high computational
complexity and vulnerability to adversarial attacks are ongoing challenges. In
this letter, approximate multipliers are used to explore DNN robustness
improvement against adversarial attacks. By uniformly replacing accurate
multipliers for state-of-the-art approximate ones in DNN layer models, we
explore the DNNs robustness against various adversarial attacks in a feasible
time. Results show up to 7% accuracy drop due to approximations when no attack
is present while improving robust accuracy up to 10% when attacks applied.


http://arxiv.org/abs/2404.11207
Exploring the Transferability of Visual Prompting for Multimodal Large Language Models. (2%)
Yichi Zhang; Yinpeng Dong; Siyuan Zhang; Tianzan Min; Hang Su; Jun Zhu
  Although Multimodal Large Language Models (MLLMs) have demonstrated promising
versatile capabilities, their performance is still inferior to specialized
models on downstream tasks, which makes adaptation necessary to enhance their
utility. However, fine-tuning methods require independent training for every
model, leading to huge computation and memory overheads. In this paper, we
propose a novel setting where we aim to improve the performance of diverse
MLLMs with a group of shared parameters optimized for a downstream task. To
achieve this, we propose Transferable Visual Prompting (TVP), a simple and
effective approach to generate visual prompts that can transfer to different
models and improve their performance on downstream tasks after trained on only
one model. We introduce two strategies to address the issue of cross-model
feature corruption of existing visual prompting methods and enhance the
transferability of the learned prompts, including 1) Feature Consistency
Alignment: which imposes constraints to the prompted feature changes to
maintain task-agnostic knowledge; 2) Task Semantics Enrichment: which
encourages the prompted images to contain richer task-specific semantics with
language guidance. We validate the effectiveness of TVP through extensive
experiments with 6 modern MLLMs on a wide variety of tasks ranging from object
recognition and counting to multimodal reasoning and hallucination correction.


http://arxiv.org/abs/2404.11330
Toward Understanding the Disagreement Problem in Neural Network Feature Attribution. (1%)
Niklas Koenen; Marvin N. Wright
  In recent years, neural networks have demonstrated their remarkable ability
to discern intricate patterns and relationships from raw data. However,
understanding the inner workings of these black box models remains challenging,
yet crucial for high-stake decisions. Among the prominent approaches for
explaining these black boxes are feature attribution methods, which assign
relevance or contribution scores to each input variable for a model prediction.
Despite the plethora of proposed techniques, ranging from gradient-based to
backpropagation-based methods, a significant debate persists about which method
to use. Various evaluation metrics have been proposed to assess the
trustworthiness or robustness of their results. However, current research
highlights disagreement among state-of-the-art methods in their explanations.
Our work addresses this confusion by investigating the explanations'
fundamental and distributional behavior. Additionally, through a comprehensive
simulation study, we illustrate the impact of common scaling and encoding
techniques on the explanation quality, assess their efficacy across different
effect sizes, and demonstrate the origin of inconsistency in rank-based
evaluation metrics.


http://arxiv.org/abs/2404.11357
Detector Collapse: Backdooring Object Detection to Catastrophic Overload or Blindness. (1%)
Hangtao Zhang; Shengshan Hu; Yichen Wang; Leo Yu Zhang; Ziqi Zhou; Xianlong Wang; Yanjun Zhang; Chao Chen
  Object detection tasks, crucial in safety-critical systems like autonomous
driving, focus on pinpointing object locations. These detectors are known to be
susceptible to backdoor attacks. However, existing backdoor techniques have
primarily been adapted from classification tasks, overlooking deeper
vulnerabilities specific to object detection. This paper is dedicated to
bridging this gap by introducing Detector Collapse} (DC), a brand-new backdoor
attack paradigm tailored for object detection. DC is designed to instantly
incapacitate detectors (i.e., severely impairing detector's performance and
culminating in a denial-of-service). To this end, we develop two innovative
attack schemes: Sponge for triggering widespread misidentifications and
Blinding for rendering objects invisible. Remarkably, we introduce a novel
poisoning strategy exploiting natural objects, enabling DC to act as a
practical backdoor in real-world environments. Our experiments on different
detectors across several benchmarks show a significant improvement
($\sim$10\%-60\% absolute and $\sim$2-7$\times$ relative) in attack efficacy
over state-of-the-art attacks.


http://arxiv.org/abs/2404.15360
Towards Robust and Interpretable EMG-based Hand Gesture Recognition using Deep Metric Meta Learning. (1%)
Simon Tam; Shriram Tallam Puranam Raghu; Étienne Buteau; Erik Scheme; Mounir Boukadoum; Alexandre Campeau-Lecours; Benoit Gosselin
  Current electromyography (EMG) pattern recognition (PR) models have been
shown to generalize poorly in unconstrained environments, setting back their
adoption in applications such as hand gesture control. This problem is often
due to limited training data, exacerbated by the use of supervised
classification frameworks that are known to be suboptimal in such settings. In
this work, we propose a shift to deep metric-based meta-learning in EMG PR to
supervise the creation of meaningful and interpretable representations. We use
a Siamese Deep Convolutional Neural Network (SDCNN) and contrastive triplet
loss to learn an EMG feature embedding space that captures the distribution of
the different classes. A nearest-centroid approach is subsequently employed for
inference, relying on how closely a test sample aligns with the established
data distributions. We derive a robust class proximity-based confidence
estimator that leads to a better rejection of incorrect decisions, i.e. false
positives, especially when operating beyond the training data domain. We show
our approach's efficacy by testing the trained SDCNN's predictions and
confidence estimations on unseen data, both in and out of the training domain.
The evaluation metrics include the accuracy-rejection curve and the
Kullback-Leibler divergence between the confidence distributions of accurate
and inaccurate predictions. Outperforming comparable models on both metrics,
our results demonstrate that the proposed meta-learning approach improves the
classifier's precision in active decisions (after rejection), thus leading to
better generalization and applicability.


http://arxiv.org/abs/2404.10335
Efficiently Adversarial Examples Generation for Visual-Language Models under Targeted Transfer Scenarios using Diffusion Models. (99%)
Qi Guo; Shanmin Pang; Xiaojun Jia; Qing Guo
  Targeted transfer-based attacks involving adversarial examples pose a
significant threat to large visual-language models (VLMs). However, the
state-of-the-art (SOTA) transfer-based attacks incur high costs due to
excessive iteration counts. Furthermore, the generated adversarial examples
exhibit pronounced adversarial noise and demonstrate limited efficacy in
evading defense methods such as DiffPure. To address these issues, inspired by
score matching, we introduce AdvDiffVLM, which utilizes diffusion models to
generate natural, unrestricted adversarial examples. Specifically, AdvDiffVLM
employs Adaptive Ensemble Gradient Estimation to modify the score during the
diffusion model's reverse generation process, ensuring the adversarial examples
produced contain natural adversarial semantics and thus possess enhanced
transferability. Simultaneously, to enhance the quality of adversarial examples
further, we employ the GradCAM-guided Mask method to disperse adversarial
semantics throughout the image, rather than concentrating them in a specific
area. Experimental results demonstrate that our method achieves a speedup
ranging from 10X to 30X compared to existing transfer-based attack methods,
while maintaining superior quality of adversarial examples. Additionally, the
generated adversarial examples possess strong transferability and exhibit
increased robustness against adversarial defense methods. Notably, AdvDiffVLM
can successfully attack commercial VLMs, including GPT-4V, in a black-box
manner.


http://arxiv.org/abs/2404.10408
Adversarial Identity Injection for Semantic Face Image Synthesis. (38%)
Giuseppe Tarollo; Tomaso Fontanini; Claudio Ferrari; Guido Borghi; Andrea Prati
  Nowadays, deep learning models have reached incredible performance in the
task of image generation. Plenty of literature works address the task of face
generation and editing, with human and automatic systems that struggle to
distinguish what's real from generated. Whereas most systems reached excellent
visual generation quality, they still face difficulties in preserving the
identity of the starting input subject. Among all the explored techniques,
Semantic Image Synthesis (SIS) methods, whose goal is to generate an image
conditioned on a semantic segmentation mask, are the most promising, even
though preserving the perceived identity of the input subject is not their main
concern. Therefore, in this paper, we investigate the problem of identity
preservation in face image generation and present an SIS architecture that
exploits a cross-attention mechanism to merge identity, style, and semantic
features to generate faces whose identities are as similar as possible to the
input ones. Experimental results reveal that the proposed method is not only
suitable for preserving the identity but is also effective in the face
recognition adversarial attack, i.e. hiding a second identity in the generated
faces.


http://arxiv.org/abs/2404.10499
Robust Noisy Label Learning via Two-Stream Sample Distillation. (1%)
Sihan Bai; Sanping Zhou; Zheng Qin; Le Wang; Nanning Zheng
  Noisy label learning aims to learn robust networks under the supervision of
noisy labels, which plays a critical role in deep learning. Existing work
either conducts sample selection or label correction to deal with noisy labels
during the model training process. In this paper, we design a simple yet
effective sample selection framework, termed Two-Stream Sample Distillation
(TSSD), for noisy label learning, which can extract more high-quality samples
with clean labels to improve the robustness of network training. Firstly, a
novel Parallel Sample Division (PSD) module is designed to generate a certain
training set with sufficient reliable positive and negative samples by jointly
considering the sample structure in feature space and the human prior in loss
space. Secondly, a novel Meta Sample Purification (MSP) module is further
designed to mine adequate semi-hard samples from the remaining uncertain
training set by learning a strong meta classifier with extra golden data. As a
result, more and more high-quality samples will be distilled from the noisy
training set to train networks robustly in every iteration. Extensive
experiments on four benchmark datasets, including CIFAR-10, CIFAR-100,
Tiny-ImageNet, and Clothing-1M, show that our method has achieved
state-of-the-art results over its competitors.


http://arxiv.org/abs/2404.10796
Black-box Adversarial Transferability: An Empirical Study in Cybersecurity Perspective. (99%)
Khushnaseeb Roshan; Aasim Zafar
  The rapid advancement of artificial intelligence within the realm of
cybersecurity raises significant security concerns. The vulnerability of deep
learning models in adversarial attacks is one of the major issues. In
adversarial machine learning, malicious users try to fool the deep learning
model by inserting adversarial perturbation inputs into the model during its
training or testing phase. Subsequently, it reduces the model confidence score
and results in incorrect classifications. The novel key contribution of the
research is to empirically test the black-box adversarial transferability
phenomena in cyber attack detection systems. It indicates that the adversarial
perturbation input generated through the surrogate model has a similar impact
on the target model in producing the incorrect classification. To empirically
validate this phenomenon, surrogate and target models are used. The adversarial
perturbation inputs are generated based on the surrogate-model for which the
hacker has complete information. Based on these adversarial perturbation
inputs, both surrogate and target models are evaluated during the inference
phase. We have done extensive experimentation over the CICDDoS-2019 dataset,
and the results are classified in terms of various performance metrics like
accuracy, precision, recall, and f1-score. The findings indicate that any deep
learning model is highly susceptible to adversarial attacks, even if the
attacker does not have access to the internal details of the target model. The
results also indicate that white-box adversarial attacks have a severe impact
compared to black-box adversarial attacks. There is a need to investigate and
explore adversarial defence techniques to increase the robustness of the deep
learning models against adversarial attacks.


http://arxiv.org/abs/2404.10202
Towards a Novel Perspective on Adversarial Examples Driven by Frequency. (99%)
Zhun Zhang; Yi Zeng; Qihe Liu; Shijie Zhou
  Enhancing our understanding of adversarial examples is crucial for the secure
application of machine learning models in real-world scenarios. A prevalent
method for analyzing adversarial examples is through a frequency-based
approach. However, existing research indicates that attacks designed to exploit
low-frequency or high-frequency information can enhance attack performance,
leading to an unclear relationship between adversarial perturbations and
different frequency components. In this paper, we seek to demystify this
relationship by exploring the characteristics of adversarial perturbations
within the frequency domain. We employ wavelet packet decomposition for
detailed frequency analysis of adversarial examples and conduct statistical
examinations across various frequency bands. Intriguingly, our findings
indicate that significant adversarial perturbations are present within the
high-frequency components of low-frequency bands. Drawing on this insight, we
propose a black-box adversarial attack algorithm based on combining different
frequency bands. Experiments conducted on multiple datasets and models
demonstrate that combining low-frequency bands and high-frequency components of
low-frequency bands can significantly enhance attack efficiency. The average
attack success rate reaches 99\%, surpassing attacks that utilize a single
frequency segment. Additionally, we introduce the normalized disturbance
visibility index as a solution to the limitations of $L_2$ norm in assessing
continuous and discrete perturbations.


http://arxiv.org/abs/2404.09961
Ti-Patch: Tiled Physical Adversarial Patch for no-reference video quality metrics. (83%)
Victoria Leonenkova; Ekaterina Shumitskaya; Anastasia Antsiferova; Dmitriy Vatolin
  Objective no-reference image- and video-quality metrics are crucial in many
computer vision tasks. However, state-of-the-art no-reference metrics have
become learning-based and are vulnerable to adversarial attacks. The
vulnerability of quality metrics imposes restrictions on using such metrics in
quality control systems and comparing objective algorithms. Also, using
vulnerable metrics as a loss for deep learning model training can mislead
training to worsen visual quality. Because of that, quality metrics testing for
vulnerability is a task of current interest. This paper proposes a new method
for testing quality metrics vulnerability in the physical space. To our
knowledge, quality metrics were not previously tested for vulnerability to this
attack; they were only tested in the pixel space. We applied a physical
adversarial Ti-Patch (Tiled Patch) attack to quality metrics and did
experiments both in pixel and physical space. We also performed experiments on
the implementation of physical adversarial wallpaper. The proposed method can
be used as additional quality metrics in vulnerability evaluation,
complementing traditional subjective comparison and vulnerability tests in the
pixel space. We made our code and adversarial videos available on GitHub:
https://github.com/leonenkova/Ti-Patch.


http://arxiv.org/abs/2404.09475
Improving Weakly-Supervised Object Localization Using Adversarial Erasing and Pseudo Label. (1%)
Byeongkeun Kang; Sinhae Cha; Yeejin Lee
  Weakly-supervised learning approaches have gained significant attention due
to their ability to reduce the effort required for human annotations in
training neural networks. This paper investigates a framework for
weakly-supervised object localization, which aims to train a neural network
capable of predicting both the object class and its location using only images
and their image-level class labels. The proposed framework consists of a shared
feature extractor, a classifier, and a localizer. The localizer predicts
pixel-level class probabilities, while the classifier predicts the object class
at the image level. Since image-level class labels are insufficient for
training the localizer, weakly-supervised object localization methods often
encounter challenges in accurately localizing the entire object region. To
address this issue, the proposed method incorporates adversarial erasing and
pseudo labels to improve localization accuracy. Specifically, novel losses are
designed to utilize adversarially erased foreground features and adversarially
erased feature maps, reducing dependence on the most discriminative region.
Additionally, the proposed method employs pseudo labels to suppress activation
values in the background while increasing them in the foreground. The proposed
method is applied to two backbone networks (MobileNetV1 and InceptionV3) and is
evaluated on three publicly available datasets (ILSVRC-2012, CUB-200-2011, and
PASCAL VOC 2012). The experimental results demonstrate that the proposed method
outperforms previous state-of-the-art methods across all evaluated metrics.


http://arxiv.org/abs/2404.09599
Enhancing Code Vulnerability Detection via Vulnerability-Preserving Data Augmentation. (1%)
Shangqing Liu; Wei Ma; Jian Wang; Xiaofei Xie; Ruitao Feng; Yang Liu
  Source code vulnerability detection aims to identify inherent vulnerabilities
to safeguard software systems from potential attacks. Many prior studies
overlook diverse vulnerability characteristics, simplifying the problem into a
binary (0-1) classification task for example determining whether it is
vulnerable or not. This poses a challenge for a single deep learning-based
model to effectively learn the wide array of vulnerability characteristics.
Furthermore, due to the challenges associated with collecting large-scale
vulnerability data, these detectors often overfit limited training datasets,
resulting in lower model generalization performance.
  To address the aforementioned challenges, in this work, we introduce a
fine-grained vulnerability detector namely FGVulDet. Unlike previous
approaches, FGVulDet employs multiple classifiers to discern characteristics of
various vulnerability types and combines their outputs to identify the specific
type of vulnerability. Each classifier is designed to learn type-specific
vulnerability semantics. Additionally, to address the scarcity of data for some
vulnerability types and enhance data diversity for learning better
vulnerability semantics, we propose a novel vulnerability-preserving data
augmentation technique to augment the number of vulnerabilities. Taking
inspiration from recent advancements in graph neural networks for learning
program semantics, we incorporate a Gated Graph Neural Network (GGNN) and
extend it to an edge-aware GGNN to capture edge-type information. FGVulDet is
trained on a large-scale dataset from GitHub, encompassing five different types
of vulnerabilities. Extensive experiments compared with static-analysis-based
approaches and learning-based approaches have demonstrated the effectiveness of
FGVulDet.


http://arxiv.org/abs/2404.10193
Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering. (1%)
Zaid Khan; Yun Fu
  The goal of selective prediction is to allow an a model to abstain when it
may not be able to deliver a reliable prediction, which is important in
safety-critical contexts. Existing approaches to selective prediction typically
require access to the internals of a model, require retraining a model or study
only unimodal models. However, the most powerful models (e.g. GPT-4) are
typically only available as black boxes with inaccessible internals, are not
retrainable by end-users, and are frequently used for multimodal tasks. We
study the possibility of selective prediction for vision-language models in a
realistic, black-box setting. We propose using the principle of
\textit{neighborhood consistency} to identify unreliable responses from a
black-box vision-language model in question answering tasks. We hypothesize
that given only a visual question and model response, the consistency of the
model's responses over the neighborhood of a visual question will indicate
reliability. It is impossible to directly sample neighbors in feature space in
a black-box setting. Instead, we show that it is possible to use a smaller
proxy model to approximately sample from the neighborhood. We find that
neighborhood consistency can be used to identify model responses to visual
questions that are likely unreliable, even in adversarial settings or settings
that are out-of-distribution to the proxy model.


http://arxiv.org/abs/2404.09352
Counteracting Concept Drift by Learning with Future Malware Predictions. (96%)
Branislav Bosansky; Lada Hospodkova; Michal Najman; Maria Rigaki; Elnaz Babayeva; Viliam Lisy
  The accuracy of deployed malware-detection classifiers degrades over time due
to changes in data distributions and increasing discrepancies between training
and testing data. This phenomenon is known as the concept drift. While the
concept drift can be caused by various reasons in general, new malicious files
are created by malware authors with a clear intention of avoiding detection.
The existence of the intention opens a possibility for predicting such future
samples. Including predicted samples in training data should consequently
increase the accuracy of the classifiers on new testing data.
  We compare two methods for predicting future samples: (1) adversarial
training and (2) generative adversarial networks (GANs). The first method
explicitly seeks for adversarial examples against the classifier that are then
used as a part of training data. Similarly, GANs also generate synthetic
training data. We use GANs to learn changes in data distributions within
different time periods of training data and then apply these changes to
generate samples that could be in testing data. We compare these prediction
methods on two different datasets: (1) Ember public dataset and (2) the
internal dataset of files incoming to Avast. We show that while adversarial
training yields more robust classifiers, this method is not a good predictor of
future malware in general. This is in contrast with previously reported
positive results in different domains (including natural language processing
and spam detection). On the other hand, we show that GANs can be successfully
used as predictors of future malware. We specifically examine malware families
that exhibit significant changes in their data distributions over time and the
experimental results confirm that GAN-based predictions can significantly
improve the accuracy of the classifier on new, previously unseen data.


http://arxiv.org/abs/2404.09401
Watermark-embedded Adversarial Examples for Copyright Protection against Diffusion Models. (96%)
Peifei Zhu; Tsubasa Takahashi; Hirokatsu Kataoka
  Diffusion Models (DMs) have shown remarkable capabilities in various
image-generation tasks. However, there are growing concerns that DMs could be
used to imitate unauthorized creations and thus raise copyright issues. To
address this issue, we propose a novel framework that embeds personal
watermarks in the generation of adversarial examples. Such examples can force
DMs to generate images with visible watermarks and prevent DMs from imitating
unauthorized images. We construct a generator based on conditional adversarial
networks and design three losses (adversarial loss, GAN loss, and perturbation
loss) to generate adversarial examples that have subtle perturbation but can
effectively attack DMs to prevent copyright violations. Training a generator
for a personal watermark by our method only requires 5-10 samples within 2-3
minutes, and once the generator is trained, it can generate adversarial
examples with that watermark significantly fast (0.2s per image). We conduct
extensive experiments in various conditional image-generation scenarios.
Compared to existing methods that generate images with chaotic textures, our
method adds visible watermarks on the generated images, which is a more
straightforward way to indicate copyright violations. We also observe that our
adversarial examples exhibit good transferability across unknown generative
models. Therefore, this work provides a simple yet powerful way to protect
copyright from DM-based imitation.


http://arxiv.org/abs/2404.09349
Adversarial Robustness Limits via Scaling-Law and Human-Alignment Studies. (76%)
Brian R. Bartoldson; James Diffenderfer; Konstantinos Parasyris; Bhavya Kailkhura
  This paper revisits the simple, long-studied, yet still unsolved problem of
making image classifiers robust to imperceptible perturbations. Taking CIFAR10
as an example, SOTA clean accuracy is about $100$%, but SOTA robustness to
$\ell_{\infty}$-norm bounded perturbations barely exceeds $70$%. To understand
this gap, we analyze how model size, dataset size, and synthetic data quality
affect robustness by developing the first scaling laws for adversarial
training. Our scaling laws reveal inefficiencies in prior art and provide
actionable feedback to advance the field. For instance, we discovered that SOTA
methods diverge notably from compute-optimal setups, using excess compute for
their level of robustness. Leveraging a compute-efficient setup, we surpass the
prior SOTA with $20$% ($70$%) fewer training (inference) FLOPs. We trained
various compute-efficient models, with our best achieving $74$% AutoAttack
accuracy ($+3$% gain). However, our scaling laws also predict robustness slowly
grows then plateaus at $90$%: dwarfing our new SOTA by scaling is impractical,
and perfect robustness is impossible. To better understand this predicted
limit, we carry out a small-scale human evaluation on the AutoAttack data that
fools our top-performing model. Concerningly, we estimate that human
performance also plateaus near $90$%, which we show to be attributable to
$\ell_{\infty}$-constrained attacks' generation of invalid images not
consistent with their original labels. Having characterized limiting
roadblocks, we outline promising paths for future research.


http://arxiv.org/abs/2404.09193
FaceCat: Enhancing Face Recognition Security with a Unified Generative Model Framework. (22%)
Jiawei Chen; Xiao Yang; Yinpeng Dong; Hang Su; Jianteng Peng; Zhaoxia Yin
  Face anti-spoofing (FAS) and adversarial detection (FAD) have been regarded
as critical technologies to ensure the safety of face recognition systems. As a
consequence of their limited practicality and generalization, some existing
methods aim to devise a framework capable of concurrently detecting both
threats to address the challenge. Nevertheless, these methods still encounter
challenges of insufficient generalization and suboptimal robustness,
potentially owing to the inherent drawback of discriminative models. Motivated
by the rich structural and detailed features of face generative models, we
propose FaceCat which utilizes the face generative model as a pre-trained model
to improve the performance of FAS and FAD. Specifically, FaceCat elaborately
designs a hierarchical fusion mechanism to capture rich face semantic features
of the generative model. These features then serve as a robust foundation for a
lightweight head, designed to execute FAS and FAD tasks simultaneously. As
relying solely on single-modality data often leads to suboptimal performance,
we further propose a novel text-guided multi-modal alignment strategy that
utilizes text prompts to enrich feature representation, thereby enhancing
performance. For fair evaluations, we build a comprehensive protocol with a
wide range of 28 attack types to benchmark the performance. Extensive
experiments validate the effectiveness of FaceCat generalizes significantly
better and obtains excellent robustness against input transformations.


http://arxiv.org/abs/2404.08980
Stability and Generalization in Free Adversarial Training. (96%)
Xiwei Cheng; Kexin Fu; Farzan Farnia
  While adversarial training methods have significantly improved the robustness
of deep neural networks against norm-bounded adversarial perturbations, the
generalization gap between their performance on training and test data is
considerably greater than that of standard empirical risk minimization. Recent
studies have aimed to connect the generalization properties of adversarially
trained classifiers to the min-max optimization algorithm used in their
training. In this work, we analyze the interconnections between generalization
and optimization in adversarial training using the algorithmic stability
framework. Specifically, our goal is to compare the generalization gap of
neural networks trained using the vanilla adversarial training method, which
fully optimizes perturbations at every iteration, with the free adversarial
training method, which simultaneously optimizes norm-bounded perturbations and
classifier parameters. We prove bounds on the generalization error of these
methods, indicating that the free adversarial training method may exhibit a
lower generalization gap between training and test samples due to its
simultaneous min-max optimization of classifier weights and perturbation
variables. We conduct several numerical experiments to evaluate the
train-to-test generalization gap in vanilla and free adversarial training
methods. Our empirical findings also suggest that the free adversarial training
method could lead to a smaller generalization gap over a similar number of
training iterations. The paper code is available at
https://github.com/Xiwei-Cheng/Stability_FreeAT.


http://arxiv.org/abs/2404.09005
Proof-of-Learning with Incentive Security. (2%)
Zishuo Zhao; Zhixuan Fang; Xuechao Wang; Xi Chen; Yuan Zhou
  Most concurrent blockchain systems rely heavily on the Proof-of-Work (PoW) or
Proof-of-Stake (PoS) mechanisms for decentralized consensus and security
assurance. However, the substantial energy expenditure stemming from
computationally intensive yet meaningless tasks has raised considerable
concerns surrounding traditional PoW approaches, The PoS mechanism, while free
of energy consumption, is subject to security and economic issues. Addressing
these issues, the paradigm of Proof-of-Useful-Work (PoUW) seeks to employ
challenges of practical significance as PoW, thereby imbuing energy consumption
with tangible value. While previous efforts in Proof of Learning (PoL) explored
the utilization of deep learning model training SGD tasks as PoUW challenges,
recent research has revealed its vulnerabilities to adversarial attacks and the
theoretical hardness in crafting a byzantine-secure PoL mechanism. In this
paper, we introduce the concept of incentive-security that incentivizes
rational provers to behave honestly for their best interest, bypassing the
existing hardness to design a PoL mechanism with computational efficiency, a
provable incentive-security guarantee and controllable difficulty.
Particularly, our work is secure against two attacks to the recent work of Jia
et al. [2021], and also improves the computational overhead from $\Theta(1)$ to
$O(\frac{\log E}{E})$. Furthermore, while most recent research assumes trusted
problem providers and verifiers, our design also guarantees frontend
incentive-security even when problem providers are untrusted, and verifier
incentive-security that bypasses the Verifier's Dilemma. By incorporating ML
training into blockchain consensus mechanisms with provable guarantees, our
research not only proposes an eco-friendly solution to blockchain systems, but
also provides a proposal for a completely decentralized computing power market
in the new AI age.


http://arxiv.org/abs/2404.10789
PASA: Attack Agnostic Unsupervised Adversarial Detection using Prediction & Attribution Sensitivity Analysis. (99%)
Dipkamal Bhusal; Md Tanvirul Alam; Monish K. Veerabhadran; Michael Clifford; Sara Rampazzi; Nidhi Rastogi
  Deep neural networks for classification are vulnerable to adversarial
attacks, where small perturbations to input samples lead to incorrect
predictions. This susceptibility, combined with the black-box nature of such
networks, limits their adoption in critical applications like autonomous
driving. Feature-attribution-based explanation methods provide relevance of
input features for model predictions on input samples, thus explaining model
decisions. However, we observe that both model predictions and feature
attributions for input samples are sensitive to noise. We develop a practical
method for this characteristic of model prediction and feature attribution to
detect adversarial samples. Our method, PASA, requires the computation of two
test statistics using model prediction and feature attribution and can reliably
detect adversarial samples using thresholds learned from benign samples. We
validate our lightweight approach by evaluating the performance of PASA on
varying strengths of FGSM, PGD, BIM, and CW attacks on multiple image and
non-image datasets. On average, we outperform state-of-the-art statistical
unsupervised adversarial detectors on CIFAR-10 and ImageNet by 14\% and 35\%
ROC-AUC scores, respectively. Moreover, our approach demonstrates competitive
performance even when an adversary is aware of the defense mechanism.


http://arxiv.org/abs/2404.08341
Counterfactual Explanations for Face Forgery Detection via Adversarial Removal of Artifacts. (99%)
Yang Li; Songlin Yang; Wei Wang; Ziwen He; Bo Peng; Jing Dong
  Highly realistic AI generated face forgeries known as deepfakes have raised
serious social concerns. Although DNN-based face forgery detection models have
achieved good performance, they are vulnerable to latest generative methods
that have less forgery traces and adversarial attacks. This limitation of
generalization and robustness hinders the credibility of detection results and
requires more explanations. In this work, we provide counterfactual
explanations for face forgery detection from an artifact removal perspective.
Specifically, we first invert the forgery images into the StyleGAN latent
space, and then adversarially optimize their latent representations with the
discrimination supervision from the target detection model. We verify the
effectiveness of the proposed explanations from two aspects: (1) Counterfactual
Trace Visualization: the enhanced forgery images are useful to reveal artifacts
by visually contrasting the original images and two different visualization
methods; (2) Transferable Adversarial Attacks: the adversarial forgery images
generated by attacking the detection model are able to mislead other detection
models, implying the removed artifacts are general. Extensive experiments
demonstrate that our method achieves over 90% attack success rate and superior
attack transferability. Compared with naive adversarial noise methods, our
method adopts both generative and discriminative model priors, and optimize the
latent representations in a synthesis-by-analysis way, which forces the search
of counterfactual explanations on the natural face manifold. Thus, more general
counterfactual traces can be found and better adversarial attack
transferability can be achieved.


http://arxiv.org/abs/2404.08273
Struggle with Adversarial Defense? Try Diffusion. (99%)
Yujie Li; Yanbin Wang; Haitao Xu; Bin Liu; Jianguo Sun; Zhenhao Guo; Wenrui Ma
  Adversarial attacks induce misclassification by introducing subtle
perturbations. Recently, diffusion models are applied to the image classifiers
to improve adversarial robustness through adversarial training or by purifying
adversarial noise. However, diffusion-based adversarial training often
encounters convergence challenges and high computational expenses.
Additionally, diffusion-based purification inevitably causes data shift and is
deemed susceptible to stronger adaptive attacks. To tackle these issues, we
propose the Truth Maximization Diffusion Classifier (TMDC), a generative
Bayesian classifier that builds upon pre-trained diffusion models and the
Bayesian theorem. Unlike data-driven classifiers, TMDC, guided by Bayesian
principles, utilizes the conditional likelihood from diffusion models to
determine the class probabilities of input images, thereby insulating against
the influences of data shift and the limitations of adversarial training.
Moreover, to enhance TMDC's resilience against more potent adversarial attacks,
we propose an optimization strategy for diffusion classifiers. This strategy
involves post-training the diffusion model on perturbed datasets with
ground-truth labels as conditions, guiding the diffusion model to learn the
data distribution and maximizing the likelihood under the ground-truth labels.
The proposed method achieves state-of-the-art performance on the CIFAR10
dataset against heavy white-box attacks and strong adaptive attacks.
Specifically, TMDC achieves robust accuracies of 82.81% against $l_{\infty}$
norm-bounded perturbations and 86.05% against $l_{2}$ norm-bounded
perturbations, respectively, with $\epsilon=0.05$.


http://arxiv.org/abs/2404.10790
Multimodal Attack Detection for Action Recognition Models. (83%)
Furkan Mumcu; Yasin Yilmaz
  Adversarial machine learning attacks on video action recognition models is a
growing research area and many effective attacks were introduced in recent
years. These attacks show that action recognition models can be breached in
many ways. Hence using these models in practice raises significant security
concerns. However, there are very few works which focus on defending against or
detecting attacks. In this work, we propose a novel universal detection method
which is compatible with any action recognition model. In our extensive
experiments, we show that our method consistently detects various attacks
against different target models with high true positive rates while satisfying
very low false positive rates. Tested against four state-of-the-art attacks
targeting four action recognition models, the proposed detector achieves an
average AUC of 0.911 over 16 test cases while the best performance achieved by
the existing detectors is 0.645 average AUC. This 41.2% improvement is enabled
by the robustness of the proposed detector to varying attack methods and target
models. The lowest AUC achieved by our detector across the 16 test cases is
0.837 while the competing detector's performance drops as low as 0.211. We also
show that the proposed detector is robust to varying attack strengths. In
addition, we analyze our method's real-time performance with different hardware
setups to demonstrate its potential as a practical defense mechanism.


http://arxiv.org/abs/2404.08285
A Survey of Neural Network Robustness Assessment in Image Recognition. (83%)
Jie Wang; Jun Ai; Minyan Lu; Haoran Su; Dan Yu; Yutao Zhang; Junda Zhu; Jingyu Liu
  In recent years, there has been significant attention given to the robustness
assessment of neural networks. Robustness plays a critical role in ensuring
reliable operation of artificial intelligence (AI) systems in complex and
uncertain environments. Deep learning's robustness problem is particularly
significant, highlighted by the discovery of adversarial attacks on image
classification models. Researchers have dedicated efforts to evaluate
robustness in diverse perturbation conditions for image recognition tasks.
Robustness assessment encompasses two main techniques: robustness verification/
certification for deliberate adversarial attacks and robustness testing for
random data corruptions. In this survey, we present a detailed examination of
both adversarial robustness (AR) and corruption robustness (CR) in neural
network assessment. Analyzing current research papers and standards, we provide
an extensive overview of robustness assessment in image recognition. Three
essential aspects are analyzed: concepts, metrics, and assessment methods. We
investigate the perturbation metrics and range representations used to measure
the degree of perturbations on images, as well as the robustness metrics
specifically for the robustness conditions of classification models. The
strengths and limitations of the existing methods are also discussed, and some
potential directions for future research are provided.


http://arxiv.org/abs/2404.08255
Practical Region-level Attack against Segment Anything Models. (81%)
Yifan Shen; Zhengyuan Li; Gang Wang
  Segment Anything Models (SAM) have made significant advancements in image
segmentation, allowing users to segment target portions of an image with a
single click (i.e., user prompt). Given its broad applications, the robustness
of SAM against adversarial attacks is a critical concern. While recent works
have explored adversarial attacks against a pre-defined prompt/click, their
threat model is not yet realistic: (1) they often assume the user-click
position is known to the attacker (point-based attack), and (2) they often
operate under a white-box setting with limited transferability. In this paper,
we propose a more practical region-level attack where attackers do not need to
know the precise user prompt. The attack remains effective as the user clicks
on any point on the target object in the image, hiding the object from SAM.
Also, by adapting a spectrum transformation method, we make the attack more
transferable under a black-box setting. Both control experiments and testing
against real-world SAM services confirm its effectiveness.


http://arxiv.org/abs/2404.08631
FCert: Certifiably Robust Few-Shot Classification in the Era of Foundation Models. (69%)
Yanting Wang; Wei Zou; Jinyuan Jia
  Few-shot classification with foundation models (e.g., CLIP, DINOv2, PaLM-2)
enables users to build an accurate classifier with a few labeled training
samples (called support samples) for a classification task. However, an
attacker could perform data poisoning attacks by manipulating some support
samples such that the classifier makes the attacker-desired, arbitrary
prediction for a testing input. Empirical defenses cannot provide formal
robustness guarantees, leading to a cat-and-mouse game between the attacker and
defender. Existing certified defenses are designed for traditional supervised
learning, resulting in sub-optimal performance when extended to few-shot
classification. In our work, we propose FCert, the first certified defense
against data poisoning attacks to few-shot classification. We show our FCert
provably predicts the same label for a testing input under arbitrary data
poisoning attacks when the total number of poisoned support samples is bounded.
We perform extensive experiments on benchmark few-shot classification datasets
with foundation models released by OpenAI, Meta, and Google in both vision and
text domains. Our experimental results show our FCert: 1) maintains
classification accuracy without attacks, 2) outperforms existing
state-of-the-art certified defenses for data poisoning attacks, and 3) is
efficient and general.


http://arxiv.org/abs/2404.14418
Mitigating Cascading Effects in Large Adversarial Graph Environments. (2%)
James D. Cunningham; Conrad S. Tucker
  A significant amount of society's infrastructure can be modeled using graph
structures, from electric and communication grids, to traffic networks, to
social networks. Each of these domains are also susceptible to the cascading
spread of negative impacts, whether this be overloaded devices in the power
grid or the reach of a social media post containing misinformation. The
potential harm of a cascade is compounded when considering a malicious attack
by an adversary that is intended to maximize the cascading impact. However, by
exploiting knowledge of the cascading dynamics, targets with the largest
cascading impact can be preemptively prioritized for defense, and the damage an
adversary can inflict can be mitigated. While game theory provides tools for
finding an optimal preemptive defense strategy, existing methods struggle to
scale to the context of large graph environments because of the combinatorial
explosion of possible actions that occurs when the attacker and defender can
each choose multiple targets in the graph simultaneously. The proposed method
enables a data-driven deep learning approach that uses multi-node
representation learning and counterfactual data augmentation to generalize to
the full combinatorial action space by training on a variety of small
restricted subsets of the action space. We demonstrate through experiments that
the proposed method is capable of identifying defense strategies that are less
exploitable than SOTA methods for large graphs, while still being able to
produce strategies near the Nash equilibrium for small-scale scenarios for
which it can be computed. Moreover, the proposed method demonstrates superior
prediction accuracy on a validation set of unseen cascades compared to other
deep learning approaches.


http://arxiv.org/abs/2404.08540
On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation. (1%)
Agneet Chatterjee; Tejas Gokhale; Chitta Baral; Yezhou Yang
  Recent advances in monocular depth estimation have been made by incorporating
natural language as additional guidance. Although yielding impressive results,
the impact of the language prior, particularly in terms of generalization and
robustness, remains unexplored. In this paper, we address this gap by
quantifying the impact of this prior and introduce methods to benchmark its
effectiveness across various settings. We generate "low-level" sentences that
convey object-centric, three-dimensional spatial relationships, incorporate
them as additional language priors and evaluate their downstream impact on
depth estimation. Our key finding is that current language-guided depth
estimators perform optimally only with scene-level descriptions and
counter-intuitively fare worse with low level descriptions. Despite leveraging
additional data, these methods are not robust to directed adversarial attacks
and decline in performance with an increase in distribution shift. Finally, to
provide a foundation for future research, we identify points of failures and
offer insights to better understand these shortcomings. With an increasing
number of methods using language for depth estimation, our findings highlight
the opportunities and pitfalls that require careful consideration for effective
deployment in real-world settings


http://arxiv.org/abs/2404.08818
Empowering Malware Detection Efficiency within Processing-in-Memory Architecture. (1%)
Sreenitha Kasarapu; Sathwika Bavikadi; Sai Manoj Pudukotai Dinakarrao
  The widespread integration of embedded systems across various industries has
facilitated seamless connectivity among devices and bolstered computational
capabilities. Despite their extensive applications, embedded systems encounter
significant security threats, with one of the most critical vulnerabilities
being malicious software, commonly known as malware. In recent times, malware
detection techniques leveraging Machine Learning have gained popularity. Deep
Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) have proven
particularly efficient in image processing tasks. However, one major drawback
of neural network architectures is their substantial computational resource
requirements. Continuous training of malware detection models with updated
malware and benign samples demands immense computational resources, presenting
a challenge for real-world applications. In response to these concerns, we
propose a Processing-in-Memory (PIM)-based architecture to mitigate memory
access latency, thereby reducing the resources consumed during model updates.
To further enhance throughput and minimize energy consumption, we incorporate
precision scaling techniques tailored for CNN models. Our proposed PIM
architecture exhibits a 1.09x higher throughput compared to existing Lookup
Table (LUT)-based PIM architectures. Additionally, precision scaling combined
with PIM enhances energy efficiency by 1.5x compared to full-precision
operations, without sacrificing performance. This innovative approach offers a
promising solution to the resource-intensive nature of malware detection model
updates, paving the way for more efficient and sustainable cybersecurity
practices.


http://arxiv.org/abs/2404.08069
Persistent Classification: A New Approach to Stability of Data and Adversarial Examples. (98%)
Brian Bell; Michael Geyer; David Glickenstein; Keaton Hamm; Carlos Scheidegger; Amanda Fernandez; Juston Moore
  There are a number of hypotheses underlying the existence of adversarial
examples for classification problems. These include the high-dimensionality of
the data, high codimension in the ambient space of the data manifolds of
interest, and that the structure of machine learning models may encourage
classifiers to develop decision boundaries close to data points. This article
proposes a new framework for studying adversarial examples that does not depend
directly on the distance to the decision boundary. Similarly to the smoothed
classifier literature, we define a (natural or adversarial) data point to be
$(\gamma,\sigma)$-stable if the probability of the same classification is at
least $\gamma$ for points sampled in a Gaussian neighborhood of the point with
a given standard deviation $\sigma$. We focus on studying the differences
between persistence metrics along interpolants of natural and adversarial
points. We show that adversarial examples have significantly lower persistence
than natural examples for large neural networks in the context of the MNIST and
ImageNet datasets. We connect this lack of persistence with decision boundary
geometry by measuring angles of interpolants with respect to decision
boundaries. Finally, we connect this approach with robustness by developing a
manifold alignment gradient metric and demonstrating the increase in robustness
that can be achieved when training with the addition of this metric.


http://arxiv.org/abs/2404.08154
Eliminating Catastrophic Overfitting Via Abnormal Adversarial Examples Regularization. (98%)
Runqi Lin; Chaojian Yu; Tongliang Liu
  Single-step adversarial training (SSAT) has demonstrated the potential to
achieve both efficiency and robustness. However, SSAT suffers from catastrophic
overfitting (CO), a phenomenon that leads to a severely distorted classifier,
making it vulnerable to multi-step adversarial attacks. In this work, we
observe that some adversarial examples generated on the SSAT-trained network
exhibit anomalous behaviour, that is, although these training samples are
generated by the inner maximization process, their associated loss decreases
instead, which we named abnormal adversarial examples (AAEs). Upon further
analysis, we discover a close relationship between AAEs and classifier
distortion, as both the number and outputs of AAEs undergo a significant
variation with the onset of CO. Given this observation, we re-examine the SSAT
process and uncover that before the occurrence of CO, the classifier already
displayed a slight distortion, indicated by the presence of few AAEs.
Furthermore, the classifier directly optimizing these AAEs will accelerate its
distortion, and correspondingly, the variation of AAEs will sharply increase as
a result. In such a vicious circle, the classifier rapidly becomes highly
distorted and manifests as CO within a few iterations. These observations
motivate us to eliminate CO by hindering the generation of AAEs. Specifically,
we design a novel method, termed Abnormal Adversarial Examples Regularization
(AAER), which explicitly regularizes the variation of AAEs to hinder the
classifier from becoming distorted. Extensive experiments demonstrate that our
method can effectively eliminate CO and further boost adversarial robustness
with negligible additional computational overhead.


http://arxiv.org/abs/2404.07863
Backdoor Contrastive Learning via Bi-level Trigger Optimization. (96%)
Weiyu Sun; Xinyu Zhang; Hao Lu; Yingcong Chen; Ting Wang; Jinghui Chen; Lu Lin
  Contrastive Learning (CL) has attracted enormous attention due to its
remarkable capability in unsupervised representation learning. However, recent
works have revealed the vulnerability of CL to backdoor attacks: the feature
extractor could be misled to embed backdoored data close to an attack target
class, thus fooling the downstream predictor to misclassify it as the target.
Existing attacks usually adopt a fixed trigger pattern and poison the training
set with trigger-injected data, hoping for the feature extractor to learn the
association between trigger and target class. However, we find that such fixed
trigger design fails to effectively associate trigger-injected data with target
class in the embedding space due to special CL mechanisms, leading to a limited
attack success rate (ASR). This phenomenon motivates us to find a better
backdoor trigger design tailored for CL framework. In this paper, we propose a
bi-level optimization approach to achieve this goal, where the inner
optimization simulates the CL dynamics of a surrogate victim, and the outer
optimization enforces the backdoor trigger to stay close to the target
throughout the surrogate CL procedure. Extensive experiments show that our
attack can achieve a higher attack success rate (e.g., $99\%$ ASR on
ImageNet-100) with a very low poisoning rate ($1\%$). Besides, our attack can
effectively evade existing state-of-the-art defenses. Code is available at:
https://github.com/SWY666/SSL-backdoor-BLTO.


http://arxiv.org/abs/2404.15344
Adversarial Robustness of Distilled and Pruned Deep Learning-based Wireless Classifiers. (92%)
Nayan Moni Baishya; B. R. Manoj
  Data-driven deep learning (DL) techniques developed for automatic modulation
classification (AMC) of wireless signals are vulnerable to adversarial attacks.
This poses a severe security threat to the DL-based wireless systems,
specifically for edge applications of AMC. In this work, we address the joint
problem of developing optimized DL models that are also robust against
adversarial attacks. This enables efficient and reliable deployment of DL-based
AMC on edge devices. We first propose two optimized models using knowledge
distillation and network pruning, followed by a computationally efficient
adversarial training process to improve the robustness. Experimental results on
five white-box attacks show that the proposed optimized and adversarially
trained models can achieve better robustness than the standard (unoptimized)
model. The two optimized models also achieve higher accuracy on clean
(unattacked) samples, which is essential for the reliability of DL-based
solutions at edge applications.


http://arxiv.org/abs/2405.01567
CodeFort: Robust Training for Code Generation Models. (33%)
Yuhao Zhang; Shiqi Wang; Haifeng Qian; Zijian Wang; Mingyue Shang; Linbo Liu; Sanjay Krishna Gouda; Baishakhi Ray; Murali Krishna Ramanathan; Xiaofei Ma; Anoop Deoras
  Code generation models are not robust to small perturbations, which often
lead to inconsistent and incorrect generations and significantly degrade the
performance of these models. Improving the robustness of code generation models
is crucial to better user experience when these models are deployed in
real-world applications. However, existing efforts have not addressed this
issue for code generation models. To fill this gap, we propose CodeFort, a
framework to improve the robustness of code generation models, generalizing a
large variety of code perturbations to enrich the training data and enabling
various robust training strategies, mixing data augmentation, batch
augmentation, adversarial logits pairing, and contrastive learning, all
carefully designed to support high-throughput training. Extensive evaluations
show that we improve the average robust pass rates of baseline CodeGen models
from 14.79 to 21.74. Notably, the improvement in robustness against code-syntax
perturbations is evidenced by a significant decrease in pass rate drop from
95.04% to 53.35%


http://arxiv.org/abs/2404.07921
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs. (12%)
Zeyi Liao; Huan Sun
  As large language models (LLMs) become increasingly prevalent and integrated
into autonomous systems, ensuring their safety is imperative. Despite
significant strides toward safety alignment, recent work
GCG~\citep{zou2023universal} proposes a discrete token optimization algorithm
and selects the single suffix with the lowest loss to successfully jailbreak
aligned LLMs. In this work, we first discuss the drawbacks of solely picking
the suffix with the lowest loss during GCG optimization for jailbreaking and
uncover the missed successful suffixes during the intermediate steps. Moreover,
we utilize those successful suffixes as training data to learn a generative
model, named AmpleGCG, which captures the distribution of adversarial suffixes
given a harmful query and enables the rapid generation of hundreds of suffixes
for any harmful queries in seconds. AmpleGCG achieves near 100\% attack success
rate (ASR) on two aligned LLMs (Llama-2-7B-chat and Vicuna-7B), surpassing two
strongest attack baselines. More interestingly, AmpleGCG also transfers
seamlessly to attack different models, including closed-source LLMs, achieving
a 99\% ASR on the latest GPT-3.5. To summarize, our work amplifies the impact
of GCG by training a generative model of adversarial suffixes that is universal
to any harmful queries and transferable from attacking open-source LLMs to
closed-source LLMs. In addition, it can generate 200 adversarial suffixes for
one harmful query in only 4 seconds, rendering it more challenging to defend.


http://arxiv.org/abs/2404.07878
LeapFrog: The Rowhammer Instruction Skip Attack. (8%)
Andrew Adiletta; M. Caner Tol; Kemal Derya; Berk Sunar; Saad Islam
  Since its inception, Rowhammer exploits have rapidly evolved into
increasingly sophisticated threats compromising data integrity and the control
flow integrity of victim processes. Nevertheless, it remains a challenge for an
attacker to identify vulnerable targets (i.e., Rowhammer gadgets), understand
the outcome of the attempted fault, and formulate an attack that yields useful
results.
  In this paper, we present a new type of Rowhammer gadget, called a LeapFrog
gadget, which, when present in the victim code, allows an adversary to subvert
code execution to bypass a critical piece of code (e.g., authentication check
logic, encryption rounds, padding in security protocols). The LeapFrog gadget
manifests when the victim code stores the Program Counter (PC) value in the
user or kernel stack (e.g., a return address during a function call) which,
when tampered with, repositions the return address to a location that bypasses
a security-critical code pattern.
  This research also presents a systematic process to identify LeapFrog
gadgets. This methodology enables the automated detection of susceptible
targets and the determination of optimal attack parameters. We first show the
attack on a decision tree algorithm to show the potential implications.
Secondly, we employ the attack on OpenSSL to bypass the encryption and reveal
the plaintext. We then use our tools to scan the Open Quantum Safe library and
report on the number of LeapFrog gadgets in the code. Lastly, we demonstrate
this new attack vector through a practical demonstration in a client/server TLS
handshake scenario, successfully inducing an instruction skip in a client
application. Our findings extend the impact of Rowhammer attacks on control
flow and contribute to developing more robust defenses against these
increasingly sophisticated threats.


http://arxiv.org/abs/2404.08197
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies. (1%)
Zichao Li; Cihang Xie; Ekin Dogus Cubuk
  This paper investigates the performance of the Contrastive Language-Image
Pre-training (CLIP) when scaled down to limited computation budgets. We explore
CLIP along three dimensions: data, architecture, and training strategies. With
regards to data, we demonstrate the significance of high-quality training data
and show that a smaller dataset of high-quality data can outperform a larger
dataset with lower quality. We also examine how model performance varies with
different dataset sizes, suggesting that smaller ViT models are better suited
for smaller datasets, while larger models perform better on larger datasets
with fixed compute. Additionally, we provide guidance on when to choose a
CNN-based architecture or a ViT-based architecture for CLIP training. We
compare four CLIP training strategies - SLIP, FLIP, CLIP, and CLIP+Data
Augmentation - and show that the choice of training strategy depends on the
available compute resource. Our analysis reveals that CLIP+Data Augmentation
can achieve comparable performance to CLIP using only half of the training
data. This work provides practical insights into how to effectively train and
deploy CLIP models, making them more accessible and affordable for practical
use in various applications.


http://arxiv.org/abs/2404.06776
Logit Calibration and Feature Contrast for Robust Federated Learning on Non-IID Data. (99%)
Yu Qiao; Chaoning Zhang; Apurba Adhikary; Choong Seon Hong
  Federated learning (FL) is a privacy-preserving distributed framework for
collaborative model training on devices in edge networks. However, challenges
arise due to vulnerability to adversarial examples (AEs) and the
non-independent and identically distributed (non-IID) nature of data
distribution among devices, hindering the deployment of adversarially robust
and accurate learning models at the edge. While adversarial training (AT) is
commonly acknowledged as an effective defense strategy against adversarial
attacks in centralized training, we shed light on the adverse effects of
directly applying AT in FL that can severely compromise accuracy, especially in
non-IID challenges. Given this limitation, this paper proposes FatCC, which
incorporates local logit \underline{C}alibration and global feature
\underline{C}ontrast into the vanilla federated adversarial training
(\underline{FAT}) process from both logit and feature perspectives. This
approach can effectively enhance the federated system's robust accuracy (RA)
and clean accuracy (CA). First, we propose logit calibration, where the logits
are calibrated during local adversarial updates, thereby improving adversarial
robustness. Second, FatCC introduces feature contrast, which involves a global
alignment term that aligns each local representation with unbiased global
features, thus further enhancing robustness and accuracy in federated
adversarial environments. Extensive experiments across multiple datasets
demonstrate that FatCC achieves comparable or superior performance gains in
both CA and RA compared to other baselines.


http://arxiv.org/abs/2404.07153
Lost in Translation: Modern Neural Networks Still Struggle With Small Realistic Image Transformations. (82%)
Ofir Shifman; Yair Weiss
  Deep neural networks that achieve remarkable performance in image
classification have previously been shown to be easily fooled by tiny
transformations such as a one pixel translation of the input image. In order to
address this problem, two approaches have been proposed in recent years. The
first approach suggests using huge datasets together with data augmentation in
the hope that a highly varied training set will teach the network to learn to
be invariant. The second approach suggests using architectural modifications
based on sampling theory to deal explicitly with image translations. In this
paper, we show that these approaches still fall short in robustly handling
'natural' image translations that simulate a subtle change in camera
orientation. Our findings reveal that a mere one-pixel translation can result
in a significant change in the predicted image representation for approximately
40% of the test images in state-of-the-art models (e.g. open-CLIP trained on
LAION-2B or DINO-v2) , while models that are explicitly constructed to be
robust to cyclic translations can still be fooled with 1 pixel realistic
(non-cyclic) translations 11% of the time. We present Robust Inference by Crop
Selection: a simple method that can be proven to achieve any desired level of
consistency, although with a modest tradeoff with the model's accuracy.
Importantly, we demonstrate how employing this method reduces the ability to
fool state-of-the-art models with a 1 pixel translation to less than 5% while
suffering from only a 1% drop in classification accuracy. Additionally, we show
that our method can be easy adjusted to deal with circular shifts as well. In
such case we achieve 100% robustness to integer shifts with state-of-the-art
accuracy, and with no need for any further training.


http://arxiv.org/abs/2404.06957
Adversarial purification for no-reference image-quality metrics: applicability study and new methods. (26%)
Aleksandr Gushchin; Anna Chistyakova; Vladislav Minashkin; Anastasia Antsiferova; Dmitriy Vatolin
  Recently, the area of adversarial attacks on image quality metrics has begun
to be explored, whereas the area of defences remains under-researched. In this
study, we aim to cover that case and check the transferability of adversarial
purification defences from image classifiers to IQA methods. In this paper, we
apply several widespread attacks on IQA models and examine the success of the
defences against them. The purification methodologies covered different
preprocessing techniques, including geometrical transformations, compression,
denoising, and modern neural network-based methods. Also, we address the
challenge of assessing the efficacy of a defensive methodology by proposing
ways to estimate output visual quality and the success of neutralizing attacks.
Defences were tested against attack on three IQA metrics -- Linearity, MetaIQA
and SPAQ. The code for attacks and defences is available at: (link is hidden
for a blind review).


http://arxiv.org/abs/2404.06838
Simpler becomes Harder: Do LLMs Exhibit a Coherent Behavior on Simplified Corpora? (2%)
Miriam Anschütz; Edoardo Mosca; Georg Groh
  Text simplification seeks to improve readability while retaining the original
content and meaning. Our study investigates whether pre-trained classifiers
also maintain such coherence by comparing their predictions on both original
and simplified inputs. We conduct experiments using 11 pre-trained models,
including BERT and OpenAI's GPT 3.5, across six datasets spanning three
languages. Additionally, we conduct a detailed analysis of the correlation
between prediction change rates and simplification types/strengths. Our
findings reveal alarming inconsistencies across all languages and models. If
not promptly addressed, simplified inputs can be easily exploited to craft
zero-iteration model-agnostic adversarial attacks with success rates of up to
50%


http://arxiv.org/abs/2404.06971
TrajPRed: Trajectory Prediction with Region-based Relation Learning. (1%)
Chen Zhou; Ghassan AlRegib; Armin Parchami; Kunjan Singh
  Forecasting human trajectories in traffic scenes is critical for safety
within mixed or fully autonomous systems. Human future trajectories are driven
by two major stimuli, social interactions, and stochastic goals. Thus, reliable
forecasting needs to capture these two stimuli. Edge-based relation modeling
represents social interactions using pairwise correlations from precise
individual states. Nevertheless, edge-based relations can be vulnerable under
perturbations. To alleviate these issues, we propose a region-based relation
learning paradigm that models social interactions via region-wise dynamics of
joint states, i.e., the changes in the density of crowds. In particular,
region-wise agent joint information is encoded within convolutional feature
grids. Social relations are modeled by relating the temporal changes of local
joint information from a global perspective. We show that region-based
relations are less susceptible to perturbations. In order to account for the
stochastic individual goals, we exploit a conditional variational autoencoder
to realize multi-goal estimation and diverse future prediction. Specifically,
we perform variational inference via the latent distribution, which is
conditioned on the correlation between input states and associated target
goals. Sampling from the latent distribution enables the framework to reliably
capture the stochastic behavior in test data. We integrate multi-goal
estimation and region-based relation learning to model the two stimuli, social
interactions, and stochastic goals, in a prediction framework. We evaluate our
framework on the ETH-UCY dataset and Stanford Drone Dataset (SDD). We show that
the diverse prediction better fits the ground truth when incorporating the
relation module. Our framework outperforms the state-of-the-art models on SDD
by $27.61\%$/$18.20\%$ of ADE/FDE metrics.


http://arxiv.org/abs/2404.08690
Towards Building a Robust Toxicity Predictor. (99%)
Dmitriy Bespalov; Sourav Bhabesh; Yi Xiang; Liutong Zhou; Yanjun Qi
  Recent NLP literature pays little attention to the robustness of toxicity
language predictors, while these systems are most likely to be used in
adversarial contexts. This paper presents a novel adversarial attack,
\texttt{ToxicTrap}, introducing small word-level perturbations to fool SOTA
text classifiers to predict toxic text samples as benign. ToxicTrap exploits
greedy based search strategies to enable fast and effective generation of toxic
adversarial examples. Two novel goal function designs allow ToxicTrap to
identify weaknesses in both multiclass and multilabel toxic language detectors.
Our empirical results show that SOTA toxicity text classifiers are indeed
vulnerable to the proposed attacks, attaining over 98\% attack success rates in
multilabel cases. We also show how a vanilla adversarial training and its
improved version can help increase robustness of a toxicity detector even
against unseen attacks.


http://arxiv.org/abs/2404.06313
On adversarial training and the 1 Nearest Neighbor classifier. (99%)
Amir Hagai; Yair Weiss
  The ability to fool deep learning classifiers with tiny perturbations of the
input has lead to the development of adversarial training in which the loss
with respect to adversarial examples is minimized in addition to the training
examples. While adversarial training improves the robustness of the learned
classifiers, the procedure is computationally expensive, sensitive to
hyperparameters and may still leave the classifier vulnerable to other types of
small perturbations. In this paper we analyze the adversarial robustness of the
1 Nearest Neighbor (1NN) classifier and compare its performance to adversarial
training. We prove that under reasonable assumptions, the 1 NN classifier will
be robust to {\em any} small image perturbation of the training images and will
give high adversarial accuracy on test images as the number of training
examples goes to infinity. In experiments with 45 different binary image
classification problems taken from CIFAR10, we find that 1NN outperform TRADES
(a powerful adversarial training algorithm) in terms of average adversarial
accuracy. In additional experiments with 69 pretrained robust models for
CIFAR10, we find that 1NN outperforms almost all of them in terms of robustness
to perturbations that are only slightly different from those seen during
training. Taken together, our results suggest that modern adversarial training
methods still fall short of the robustness of the simple 1NN classifier. our
code can be found at
https://github.com/amirhagai/On-Adversarial-Training-And-The-1-Nearest-Neighbor-Classifier


http://arxiv.org/abs/2404.06247
LRR: Language-Driven Resamplable Continuous Representation against Adversarial Tracking Attacks. (80%)
Jianlang Chen; Xuhong Ren; Qing Guo; Felix Juefei-Xu; Di Lin; Wei Feng; Lei Ma; Jianjun Zhao
  Visual object tracking plays a critical role in visual-based autonomous
systems, as it aims to estimate the position and size of the object of interest
within a live video. Despite significant progress made in this field,
state-of-the-art (SOTA) trackers often fail when faced with adversarial
perturbations in the incoming frames. This can lead to significant robustness
and security issues when these trackers are deployed in the real world. To
achieve high accuracy on both clean and adversarial data, we propose building a
spatial-temporal continuous representation using the semantic text guidance of
the object of interest. This novel continuous representation enables us to
reconstruct incoming frames to maintain semantic and appearance consistency
with the object of interest and its clean counterparts. As a result, our
proposed method successfully defends against different SOTA adversarial
tracking attacks while maintaining high accuracy on clean data. In particular,
our method significantly increases tracking accuracy under adversarial attacks
with around 90% relative improvement on UAV123, which is even higher than the
accuracy on clean data.


http://arxiv.org/abs/2404.06236
Towards Robust Domain Generation Algorithm Classification. (80%)
Arthur Drichel; Marc Meyer; Ulrike Meyer
  In this work, we conduct a comprehensive study on the robustness of domain
generation algorithm (DGA) classifiers. We implement 32 white-box attacks, 19
of which are very effective and induce a false-negative rate (FNR) of $\approx$
100\% on unhardened classifiers. To defend the classifiers, we evaluate
different hardening approaches and propose a novel training scheme that
leverages adversarial latent space vectors and discretized adversarial domains
to significantly improve robustness. In our study, we highlight a pitfall to
avoid when hardening classifiers and uncover training biases that can be easily
exploited by attackers to bypass detection, but which can be mitigated by
adversarial training (AT). In our study, we do not observe any trade-off
between robustness and performance, on the contrary, hardening improves a
classifier's detection performance for known and unknown DGAs. We implement all
attacks and defenses discussed in this paper as a standalone library, which we
make publicly available to facilitate hardening of DGA classifiers:
https://gitlab.com/rwth-itsec/robust-dga-detection


http://arxiv.org/abs/2404.06666
SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models. (41%)
Xinfeng Li; Yuchen Yang; Jiangyi Deng; Chen Yan; Yanjiao Chen; Xiaoyu Ji; Wenyuan Xu
  Text-to-image (T2I) models, such as Stable Diffusion, have exhibited
remarkable performance in generating high-quality images from text descriptions
in recent years. However, text-to-image models may be tricked into generating
not-safe-for-work (NSFW) content, particularly in sexual scenarios. Existing
countermeasures mostly focus on filtering inappropriate inputs and outputs, or
suppressing improper text embeddings, which can block explicit NSFW-related
content (e.g., naked or sexy) but may still be vulnerable to adversarial
prompts inputs that appear innocent but are ill-intended. In this paper, we
present SafeGen, a framework to mitigate unsafe content generation by
text-to-image models in a text-agnostic manner. The key idea is to eliminate
unsafe visual representations from the model regardless of the text input. In
this way, the text-to-image model is resistant to adversarial prompts since
unsafe visual representations are obstructed from within. Extensive experiments
conducted on four datasets demonstrate SafeGen's effectiveness in mitigating
unsafe content generation while preserving the high-fidelity of benign images.
SafeGen outperforms eight state-of-the-art baseline methods and achieves 99.1%
sexual content removal performance. Furthermore, our constructed benchmark of
adversarial prompts provides a basis for future development and evaluation of
anti-NSFW-generation methods.


http://arxiv.org/abs/2404.07242
Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs. (31%)
Bibek Upadhayay; Vahid Behzadan
  Large Language Models (LLMs) are increasingly being developed and applied,
but their widespread use faces challenges. These include aligning LLMs'
responses with human values to prevent harmful outputs, which is addressed
through safety training methods. Even so, bad actors and malicious users have
succeeded in attempts to manipulate the LLMs to generate misaligned responses
for harmful questions such as methods to create a bomb in school labs, recipes
for harmful drugs, and ways to evade privacy rights. Another challenge is the
multilingual capabilities of LLMs, which enable the model to understand and
respond in multiple languages. Consequently, attackers exploit the unbalanced
pre-training datasets of LLMs in different languages and the comparatively
lower model performance in low-resource languages than high-resource ones. As a
result, attackers use a low-resource languages to intentionally manipulate the
model to create harmful responses. Many of the similar attack vectors have been
patched by model providers, making the LLMs more robust against language-based
manipulation. In this paper, we introduce a new black-box attack vector called
the \emph{Sandwich attack}: a multi-language mixture attack, which manipulates
state-of-the-art LLMs into generating harmful and misaligned responses. Our
experiments with five different models, namely Google's Bard, Gemini Pro,
LLaMA-2-70-B-Chat, GPT-3.5-Turbo, GPT-4, and Claude-3-OPUS, show that this
attack vector can be used by adversaries to generate harmful responses and
elicit misaligned responses from these models. By detailing both the mechanism
and impact of the Sandwich attack, this paper aims to guide future research and
development towards more secure and resilient LLMs, ensuring they serve the
public good while minimizing potential for misuse.


http://arxiv.org/abs/2404.06230
Aggressive or Imperceptible, or Both: Network Pruning Assisted Hybrid Byzantines in Federated Learning. (26%)
Emre Ozfatura; Kerem Ozfatura; Alptekin Kupcu; Deniz Gunduz
  Federated learning (FL) has been introduced to enable a large number of
clients, possibly mobile devices, to collaborate on generating a generalized
machine learning model thanks to utilizing a larger number of local samples
without sharing to offer certain privacy to collaborating clients. However, due
to the participation of a large number of clients, it is often difficult to
profile and verify each client, which leads to a security threat that malicious
participants may hamper the accuracy of the trained model by conveying poisoned
models during the training. Hence, the aggregation framework at the parameter
server also needs to minimize the detrimental effects of these malicious
clients. A plethora of attack and defence strategies have been analyzed in the
literature. However, often the Byzantine problem is analyzed solely from the
outlier detection perspective, being oblivious to the topology of neural
networks (NNs).
  In the scope of this work, we argue that by extracting certain side
information specific to the NN topology, one can design stronger attacks.
Hence, inspired by the sparse neural networks, we introduce a hybrid sparse
Byzantine attack that is composed of two parts: one exhibiting a sparse nature
and attacking only certain NN locations with higher sensitivity, and the other
being more silent but accumulating over time, where each ideally targets a
different type of defence mechanism, and together they form a strong but
imperceptible attack. Finally, we show through extensive simulations that the
proposed hybrid Byzantine attack is effective against 8 different defence
methods.


http://arxiv.org/abs/2404.06694
How to Craft Backdoors with Unlabeled Data Alone? (1%)
Yifei Wang; Wenhan Ma; Yisen Wang
  Relying only on unlabeled data, Self-supervised learning (SSL) can learn rich
features in an economical and scalable way. As the drive-horse for building
foundation models, SSL has received a lot of attention recently with wide
applications, which also raises security concerns where backdoor attack is a
major type of threat: if the released dataset is maliciously poisoned,
backdoored SSL models can behave badly when triggers are injected to test
samples. The goal of this work is to investigate this potential risk. We notice
that existing backdoors all require a considerable amount of \emph{labeled}
data that may not be available for SSL. To circumvent this limitation, we
explore a more restrictive setting called no-label backdoors, where we only
have access to the unlabeled data alone, where the key challenge is how to
select the proper poison set without using label information. We propose two
strategies for poison selection: clustering-based selection using pseudolabels,
and contrastive selection derived from the mutual information principle.
Experiments on CIFAR-10 and ImageNet-100 show that both no-label backdoors are
effective on many SSL methods and outperform random poisoning by a large
margin. Code will be available at https://github.com/PKU-ML/nlb.


http://arxiv.org/abs/2404.05350
Certified PEFTSmoothing: Parameter-Efficient Fine-Tuning with Randomized Smoothing. (99%)
Chengyan Fu; Wenjie Wang
  Randomized smoothing is the primary certified robustness method for accessing
the robustness of deep learning models to adversarial perturbations in the
l2-norm, by adding isotropic Gaussian noise to the input image and returning
the majority votes over the base classifier. Theoretically, it provides a
certified norm bound, ensuring predictions of adversarial examples are stable
within this bound. A notable constraint limiting widespread adoption is the
necessity to retrain base models entirely from scratch to attain a robust
version. This is because the base model fails to learn the noise-augmented data
distribution to give an accurate vote. One intuitive way to overcome this
challenge is to involve a custom-trained denoiser to eliminate the noise.
However, this approach is inefficient and sub-optimal. Inspired by recent large
model training procedures, we explore an alternative way named PEFTSmoothing to
adapt the base model to learn the Gaussian noise-augmented data with
Parameter-Efficient Fine-Tuning (PEFT) methods in both white-box and black-box
settings. Extensive results demonstrate the effectiveness and efficiency of
PEFTSmoothing, which allow us to certify over 98% accuracy for ViT on CIFAR-10,
20% higher than SoTA denoised smoothing, and over 61% accuracy on ImageNet
which is 30% higher than CNN-based denoiser and comparable to the
Diffusion-based denoiser.


http://arxiv.org/abs/2404.05688
David and Goliath: An Empirical Evaluation of Attacks and Defenses for QNNs at the Deep Edge. (99%)
Miguel Costa; Sandro Pinto
  ML is shifting from the cloud to the edge. Edge computing reduces the surface
exposing private data and enables reliable throughput guarantees in real-time
applications. Of the panoply of devices deployed at the edge,
resource-constrained MCUs, e.g., Arm Cortex-M, are more prevalent, orders of
magnitude cheaper, and less power-hungry than application processors or GPUs.
Thus, enabling intelligence at the deep edge is the zeitgeist, with researchers
focusing on unveiling novel approaches to deploy ANNs on these constrained
devices. Quantization is a well-established technique that has proved effective
in enabling the deployment of neural networks on MCUs; however, it is still an
open question to understand the robustness of QNNs in the face of adversarial
examples.
  To fill this gap, we empirically evaluate the effectiveness of attacks and
defenses from (full-precision) ANNs on (constrained) QNNs. Our evaluation
includes three QNNs targeting TinyML applications, ten attacks, and six
defenses. With this study, we draw a set of interesting findings. First,
quantization increases the point distance to the decision boundary and leads
the gradient estimated by some attacks to explode or vanish. Second,
quantization can act as a noise attenuator or amplifier, depending on the noise
magnitude, and causes gradient misalignment. Regarding adversarial defenses, we
conclude that input pre-processing defenses show impressive results on small
perturbations; however, they fall short as the perturbation increases. At the
same time, train-based defenses increase the average point distance to the
decision boundary, which holds after quantization. However, we argue that
train-based defenses still need to smooth the quantization-shift and gradient
misalignment phenomenons to counteract adversarial example transferability to
QNNs. All artifacts are open-sourced to enable independent validation of
results.


http://arxiv.org/abs/2404.05311
BruSLeAttack: A Query-Efficient Score-Based Black-Box Sparse Adversarial Attack. (99%)
Viet Quoc Vo; Ehsan Abbasnejad; Damith C. Ranasinghe
  We study the unique, less-well understood problem of generating sparse
adversarial samples simply by observing the score-based replies to model
queries. Sparse attacks aim to discover a minimum number-the l0
bounded-perturbations to model inputs to craft adversarial examples and
misguide model decisions. But, in contrast to query-based dense attack
counterparts against black-box models, constructing sparse adversarial
perturbations, even when models serve confidence score information to queries
in a score-based setting, is non-trivial. Because, such an attack leads to i)
an NP-hard problem; and ii) a non-differentiable search space. We develop the
BruSLeAttack-a new, faster (more query-efficient) Bayesian algorithm for the
problem. We conduct extensive attack evaluations including an attack
demonstration against a Machine Learning as a Service (MLaaS) offering
exemplified by Google Cloud Vision and robustness testing of adversarial
training regimes and a recent defense against black-box attacks. The proposed
attack scales to achieve state-of-the-art attack success rates and query
efficiency on standard computer vision tasks such as ImageNet across different
model architectures. Our artefacts and DIY attack samples are available on
GitHub. Importantly, our work facilitates faster evaluation of model
vulnerabilities and raises our vigilance on the safety, security and
reliability of deployed systems.


http://arxiv.org/abs/2404.05703
Case Study: Neural Network Malware Detection Verification for Feature and Image Datasets. (98%)
Preston K. Robinette; Diego Manzanas Lopez; Serena Serbinowska; Kevin Leach; Taylor T. Johnson
  Malware, or software designed with harmful intent, is an ever-evolving threat
that can have drastic effects on both individuals and institutions. Neural
network malware classification systems are key tools for combating these
threats but are vulnerable to adversarial machine learning attacks. These
attacks perturb input data to cause misclassification, bypassing protective
systems. Existing defenses often rely on enhancing the training process,
thereby increasing the model's robustness to these perturbations, which is
quantified using verification. While training improvements are necessary, we
propose focusing on the verification process used to evaluate improvements to
training. As such, we present a case study that evaluates a novel verification
domain that will help to ensure tangible safeguards against adversaries and
provide a more reliable means of evaluating the robustness and effectiveness of
anti-malware systems. To do so, we describe malware classification and two
types of common malware datasets (feature and image datasets), demonstrate the
certified robustness accuracy of malware classifiers using the Neural Network
Verification (NNV) and Neural Network Enumeration (nnenum) tools, and outline
the challenges and future considerations necessary for the improvement and
refinement of the verification of malware classification. By evaluating this
novel domain as a case study, we hope to increase its visibility, encourage
further research and scrutiny, and ultimately enhance the resilience of digital
systems against malicious attacks.


http://arxiv.org/abs/2404.05219
Out-of-Distribution Data: An Acquaintance of Adversarial Examples -- A Survey. (98%)
Naveen Karunanayake; Ravin Gunawardena; Suranga Seneviratne; Sanjay Chawla
  Deep neural networks (DNNs) deployed in real-world applications can encounter
out-of-distribution (OOD) data and adversarial examples. These represent
distinct forms of distributional shifts that can significantly impact DNNs'
reliability and robustness. Traditionally, research has addressed OOD detection
and adversarial robustness as separate challenges. This survey focuses on the
intersection of these two areas, examining how the research community has
investigated them together. Consequently, we identify two key research
directions: robust OOD detection and unified robustness. Robust OOD detection
aims to differentiate between in-distribution (ID) data and OOD data, even when
they are adversarially manipulated to deceive the OOD detector. Unified
robustness seeks a single approach to make DNNs robust against both adversarial
attacks and OOD inputs. Accordingly, first, we establish a taxonomy based on
the concept of distributional shifts. This framework clarifies how robust OOD
detection and unified robustness relate to other research areas addressing
distributional shifts, such as OOD detection, open set recognition, and anomaly
detection. Subsequently, we review existing work on robust OOD detection and
unified robustness. Finally, we highlight the limitations of the existing work
and propose promising research directions that explore adversarial and OOD
inputs within a unified framework.


http://arxiv.org/abs/2404.05824
Quantum Adversarial Learning for Kernel Methods. (75%)
Giuseppe Montalbano; Leonardo Banchi
  We show that hybrid quantum classifiers based on quantum kernel methods and
support vector machines are vulnerable against adversarial attacks, namely
small engineered perturbations of the input data can deceive the classifier
into predicting the wrong result. Nonetheless, we also show that simple defence
strategies based on data augmentation with a few crafted perturbations can make
the classifier robust against new attacks. Our results find applications in
security-critical learning problems and in mitigating the effect of some forms
of quantum noise, since the attacker can also be understood as part of the
surrounding environment.


http://arxiv.org/abs/2404.05639
Investigating the Impact of Quantization on Adversarial Robustness. (50%)
Qun Li; Yuan Meng; Chen Tang; Jiacheng Jiang; Zhi Wang
  Quantization is a promising technique for reducing the bit-width of deep
models to improve their runtime performance and storage efficiency, and thus
becomes a fundamental step for deployment. In real-world scenarios, quantized
models are often faced with adversarial attacks which cause the model to make
incorrect inferences by introducing slight perturbations. However, recent
studies have paid less attention to the impact of quantization on the model
robustness. More surprisingly, existing studies on this topic even present
inconsistent conclusions, which prompted our in-depth investigation. In this
paper, we conduct a first-time analysis of the impact of the quantization
pipeline components that can incorporate robust optimization under the settings
of Post-Training Quantization and Quantization-Aware Training. Through our
detailed analysis, we discovered that this inconsistency arises from the use of
different pipelines in different studies, specifically regarding whether robust
optimization is performed and at which quantization stage it occurs. Our
research findings contribute insights into deploying more secure and robust
quantized networks, assisting practitioners in reference for scenarios with
high-security requirements and limited resources.


http://arxiv.org/abs/2404.05403
SoK: On Gradient Leakage in Federated Learning. (2%)
Jiacheng Du; Jiahui Hu; Zhibo Wang; Peng Sun; Neil Zhenqiang Gong; Kui Ren; Chun Chen
  Federated learning (FL) facilitates collaborative model training among
multiple clients without raw data exposure. However, recent studies have shown
that clients' private training data can be reconstructed from shared gradients
in FL, a vulnerability known as gradient inversion attacks (GIAs). While GIAs
have demonstrated effectiveness under \emph{ideal settings and auxiliary
assumptions}, their actual efficacy against \emph{practical FL systems} remains
under-explored. To address this gap, we conduct a comprehensive study on GIAs
in this work. We start with a survey of GIAs that establishes a timeline to
trace their evolution and develops a systematization to uncover their inherent
threats. By rethinking GIA in practical FL systems, three fundamental aspects
influencing GIA's effectiveness are identified: \textit{training setup},
\textit{model}, and \textit{post-processing}. Guided by these aspects, we
perform extensive theoretical and empirical evaluations of SOTA GIAs across
diverse settings. Our findings highlight that GIA is notably
\textit{constrained}, \textit{fragile}, and \textit{easily defensible}.
Specifically, GIAs exhibit inherent limitations against practical local
training settings. Additionally, their effectiveness is highly sensitive to the
trained model, and even simple post-processing techniques applied to gradients
can serve as effective defenses. Our work provides crucial insights into the
limited threats of GIAs in practical FL systems. By rectifying prior
misconceptions, we hope to inspire more accurate and realistic investigations
on this topic.


http://arxiv.org/abs/2404.05680
SphereHead: Stable 3D Full-head Synthesis with Spherical Tri-plane Representation. (1%)
Heyuan Li; Ce Chen; Tianhao Shi; Yuda Qiu; Sizhe An; Guanying Chen; Xiaoguang Han
  While recent advances in 3D-aware Generative Adversarial Networks (GANs) have
aided the development of near-frontal view human face synthesis, the challenge
of comprehensively synthesizing a full 3D head viewable from all angles still
persists. Although PanoHead proves the possibilities of using a large-scale
dataset with images of both frontal and back views for full-head synthesis, it
often causes artifacts for back views. Based on our in-depth analysis, we found
the reasons are mainly twofold. First, from network architecture perspective,
we found each plane in the utilized tri-plane/tri-grid representation space
tends to confuse the features from both sides, causing "mirroring" artifacts
(e.g., the glasses appear in the back). Second, from data supervision aspect,
we found that existing discriminator training in 3D GANs mainly focuses on the
quality of the rendered image itself, and does not care much about its
plausibility with the perspective from which it was rendered. This makes it
possible to generate "face" in non-frontal views, due to its easiness to fool
the discriminator. In response, we propose SphereHead, a novel tri-plane
representation in the spherical coordinate system that fits the human head's
geometric characteristics and efficiently mitigates many of the generated
artifacts. We further introduce a view-image consistency loss for the
discriminator to emphasize the correspondence of the camera parameters and the
images. The combination of these efforts results in visually superior outcomes
with significantly fewer artifacts. Our code and dataset are publicly available
at https://lhyfst.github.io/spherehead.


http://arxiv.org/abs/2404.05159
Semantic Stealth: Adversarial Text Attacks on NLP Using Several Methods. (99%)
Roopkatha Dey; Aivy Debnath; Sayak Kumar Dutta; Kaustav Ghosh; Arijit Mitra; Arghya Roy Chowdhury; Jaydip Sen
  In various real-world applications such as machine translation, sentiment
analysis, and question answering, a pivotal role is played by NLP models,
facilitating efficient communication and decision-making processes in domains
ranging from healthcare to finance. However, a significant challenge is posed
to the robustness of these natural language processing models by text
adversarial attacks. These attacks involve the deliberate manipulation of input
text to mislead the predictions of the model while maintaining human
interpretability. Despite the remarkable performance achieved by
state-of-the-art models like BERT in various natural language processing tasks,
they are found to remain vulnerable to adversarial perturbations in the input
text. In addressing the vulnerability of text classifiers to adversarial
attacks, three distinct attack mechanisms are explored in this paper using the
victim model BERT: BERT-on-BERT attack, PWWS attack, and Fraud Bargain's Attack
(FBA). Leveraging the IMDB, AG News, and SST2 datasets, a thorough comparative
analysis is conducted to assess the effectiveness of these attacks on the BERT
classifier model. It is revealed by the analysis that PWWS emerges as the most
potent adversary, consistently outperforming other methods across multiple
evaluation scenarios, thereby emphasizing its efficacy in generating
adversarial examples for text classification. Through comprehensive
experimentation, the performance of these attacks is assessed and the findings
indicate that the PWWS attack outperforms others, demonstrating lower runtime,
higher accuracy, and favorable semantic similarity scores. The key insight of
this paper lies in the assessment of the relative performances of three
prevalent state-of-the-art attack mechanisms.


http://arxiv.org/abs/2404.05130
Enabling Privacy-Preserving Cyber Threat Detection with Federated Learning. (15%)
Yu Bi; Yekai Li; Xuan Feng; Xianghang Mi
  Despite achieving good performance and wide adoption, machine learning based
security detection models (e.g., malware classifiers) are subject to concept
drift and evasive evolution of attackers, which renders up-to-date threat data
as a necessity. However, due to enforcement of various privacy protection
regulations (e.g., GDPR), it is becoming increasingly challenging or even
prohibitive for security vendors to collect individual-relevant and
privacy-sensitive threat datasets, e.g., SMS spam/non-spam messages from mobile
devices. To address such obstacles, this study systematically profiles the
(in)feasibility of federated learning for privacy-preserving cyber threat
detection in terms of effectiveness, byzantine resilience, and efficiency. This
is made possible by the build-up of multiple threat datasets and threat
detection models, and more importantly, the design of realistic and
security-specific experiments.
  We evaluate FL on two representative threat detection tasks, namely SMS spam
detection and Android malware detection. It shows that FL-trained detection
models can achieve a performance that is comparable to centrally trained
counterparts. Also, most non-IID data distributions have either minor or
negligible impact on the model performance, while a label-based non-IID
distribution of a high extent can incur non-negligible fluctuation and delay in
FL training. Then, under a realistic threat model, FL turns out to be
adversary-resistant to attacks of both data poisoning and model poisoning.
Particularly, the attacking impact of a practical data poisoning attack is no
more than 0.14\% loss in model accuracy. Regarding FL efficiency, a
bootstrapping strategy turns out to be effective to mitigate the training delay
as observed in label-based non-IID scenarios.


http://arxiv.org/abs/2404.05088
How much reliable is ChatGPT's prediction on Information Extraction under Input Perturbations? (5%)
Ishani Mondal; Abhilasha Sancheti
  In this paper, we assess the robustness (reliability) of ChatGPT under input
perturbations for one of the most fundamental tasks of Information Extraction
(IE) i.e. Named Entity Recognition (NER). Despite the hype, the majority of the
researchers have vouched for its language understanding and generation
capabilities; a little attention has been paid to understand its robustness:
How the input-perturbations affect 1) the predictions, 2) the confidence of
predictions and 3) the quality of rationale behind its prediction. We perform a
systematic analysis of ChatGPT's robustness (under both zero-shot and few-shot
setup) on two NER datasets using both automatic and human evaluation. Based on
automatic evaluation metrics, we find that 1) ChatGPT is more brittle on Drug
or Disease replacements (rare entities) compared to the perturbations on widely
known Person or Location entities, 2) the quality of explanations for the same
entity considerably differ under different types of "Entity-Specific" and
"Context-Specific" perturbations and the quality can be significantly improved
using in-context learning, and 3) it is overconfident for majority of the
incorrect predictions, and hence it could lead to misguidance of the end-users.


http://arxiv.org/abs/2404.04963
SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials. (1%)
Mael Jullien; Marco Valentino; André Freitas
  Large Language Models (LLMs) are at the forefront of NLP achievements but
fall short in dealing with shortcut learning, factual inconsistency, and
vulnerability to adversarial inputs.These shortcomings are especially critical
in medical contexts, where they can misrepresent actual model capabilities.
Addressing this, we present SemEval-2024 Task 2: Safe Biomedical Natural
Language Inference for ClinicalTrials. Our contributions include the refined
NLI4CT-P dataset (i.e., Natural Language Inference for Clinical Trials -
Perturbed), designed to challenge LLMs with interventional and causal reasoning
tasks, along with a comprehensive evaluation of methods and results for
participant submissions. A total of 106 participants registered for the task
contributing to over 1200 individual submissions and 25 system overview papers.
This initiative aims to advance the robustness and applicability of NLI models
in healthcare, ensuring safer and more dependable AI assistance in clinical
decision-making. We anticipate that the dataset, models, and outcomes of this
task can support future research in the field of biomedical NLI. The dataset,
competition leaderboard, and website are publicly available.


http://arxiv.org/abs/2404.04648
CANEDERLI: On The Impact of Adversarial Training and Transferability on CAN Intrusion Detection Systems. (86%)
Francesco Marchiori; Mauro Conti
  The growing integration of vehicles with external networks has led to a surge
in attacks targeting their Controller Area Network (CAN) internal bus. As a
countermeasure, various Intrusion Detection Systems (IDSs) have been suggested
in the literature to prevent and mitigate these threats. With the increasing
volume of data facilitated by the integration of Vehicle-to-Vehicle (V2V) and
Vehicle-to-Infrastructure (V2I) communication networks, most of these systems
rely on data-driven approaches such as Machine Learning (ML) and Deep Learning
(DL) models. However, these systems are susceptible to adversarial evasion
attacks. While many researchers have explored this vulnerability, their studies
often involve unrealistic assumptions, lack consideration for a realistic
threat model, and fail to provide effective solutions.
  In this paper, we present CANEDERLI (CAN Evasion Detection ResiLIence), a
novel framework for securing CAN-based IDSs. Our system considers a realistic
threat model and addresses the impact of adversarial attacks on DL-based
detection systems. Our findings highlight strong transferability properties
among diverse attack methodologies by considering multiple state-of-the-art
attacks and model architectures. We analyze the impact of adversarial training
in addressing this threat and propose an adaptive online adversarial training
technique outclassing traditional fine-tuning methodologies with F1 scores up
to 0.941. By making our framework publicly available, we aid practitioners and
researchers in assessing the resilience of IDSs to a varied adversarial
landscape.


http://arxiv.org/abs/2404.04662
Learning Minimal NAP Specifications for Neural Network Verification. (80%)
Chuqin Geng; Zhaoyue Wang; Haolin Ye; Saifei Liao; Xujie Si
  Specifications play a crucial role in neural network verification. They
define the precise input regions we aim to verify, typically represented as
L-infinity norm balls. While recent research suggests using neural activation
patterns (NAPs) as specifications for verifying unseen test set data, it
focuses on computing the most refined NAPs, often limited to very small regions
in the input space. In this paper, we study the following problem: Given a
neural network, find a minimal (coarsest) NAP that is sufficient for formal
verification of the network's robustness. Finding the minimal NAP specification
not only expands verifiable bounds but also provides insights into which
neurons contribute to the model's robustness. To address this problem, we
propose several exact and approximate approaches. Our exact approaches leverage
the verification tool to find minimal NAP specifications in either a
deterministic or statistical manner. Whereas the approximate methods
efficiently estimate minimal NAPs using adversarial examples and local
gradients, without making calls to the verification tool. This allows us to
inspect potential causal links between neurons and the robustness of
state-of-the-art neural networks, a task for which existing verification
frameworks fail to scale. Our experimental results suggest that minimal NAP
specifications require much smaller fractions of neurons compared to the most
refined NAP specifications, yet they can significantly expand the verifiable
boundaries to several orders of magnitude larger.


http://arxiv.org/abs/2404.04714
Data Poisoning Attacks on Off-Policy Policy Evaluation Methods. (67%)
Elita Lobo; Harvineet Singh; Marek Petrik; Cynthia Rudin; Himabindu Lakkaraju
  Off-policy Evaluation (OPE) methods are a crucial tool for evaluating
policies in high-stakes domains such as healthcare, where exploration is often
infeasible, unethical, or expensive. However, the extent to which such methods
can be trusted under adversarial threats to data quality is largely unexplored.
In this work, we make the first attempt at investigating the sensitivity of OPE
methods to marginal adversarial perturbations to the data. We design a generic
data poisoning attack framework leveraging influence functions from robust
statistics to carefully construct perturbations that maximize error in the
policy value estimates. We carry out extensive experimentation with multiple
healthcare and control datasets. Our results demonstrate that many existing OPE
methods are highly prone to generating value estimates with large errors when
subject to data poisoning attacks, even for small adversarial perturbations.
These findings question the reliability of policy values derived using OPE
methods and motivate the need for developing OPE methods that are statistically
robust to train-time data poisoning attacks.


http://arxiv.org/abs/2404.07234
Goal-guided Generative Prompt Injection Attack on Large Language Models. (67%)
Chong Zhang; Mingyu Jin; Qinkai Yu; Chengzhi Liu; Haochen Xue; Xiaobo Jin
  Current large language models (LLMs) provide a strong foundation for
large-scale user-oriented natural language tasks. A large number of users can
easily inject adversarial text or instructions through the user interface, thus
causing LLMs model security challenges. Although there is currently a large
amount of research on prompt injection attacks, most of these black-box attacks
use heuristic strategies. It is unclear how these heuristic strategies relate
to the success rate of attacks and thus effectively improve model robustness.
To solve this problem, we redefine the goal of the attack: to maximize the KL
divergence between the conditional probabilities of the clean text and the
adversarial text. Furthermore, we prove that maximizing the KL divergence is
equivalent to maximizing the Mahalanobis distance between the embedded
representation $x$ and $x'$ of the clean text and the adversarial text when the
conditional probability is a Gaussian distribution and gives a quantitative
relationship on $x$ and $x'$. Then we designed a simple and effective
goal-guided generative prompt injection strategy (G2PIA) to find an injection
text that satisfies specific constraints to achieve the optimal attack effect
approximately. It is particularly noteworthy that our attack method is a
query-free black-box attack method with low computational cost. Experimental
results on seven LLM models and four datasets show the effectiveness of our
attack method.


http://arxiv.org/abs/2404.04647
Structured Gradient-based Interpretations via Norm-Regularized Adversarial Training. (61%)
Shizhan Gong; Qi Dou; Farzan Farnia
  Gradient-based saliency maps have been widely used to explain the decisions
of deep neural network classifiers. However, standard gradient-based
interpretation maps, including the simple gradient and integrated gradient
algorithms, often lack desired structures such as sparsity and connectedness in
their application to real-world computer vision models. A frequently used
approach to inducing sparsity structures into gradient-based saliency maps is
to alter the simple gradient scheme using sparsification or norm-based
regularization. A drawback with such post-processing methods is their
frequently-observed significant loss in fidelity to the original simple
gradient map. In this work, we propose to apply adversarial training as an
in-processing scheme to train neural networks with structured simple gradient
maps. We show a duality relation between the regularized norms of the
adversarial perturbations and gradient-based maps, based on which we design
adversarial training loss functions promoting sparsity and group-sparsity
properties in simple gradient maps. We present several numerical results to
show the influence of our proposed norm-based adversarial training methods on
the standard gradient-based maps of standard neural network architectures on
benchmark image datasets.


http://arxiv.org/abs/2404.04601
Exploiting Sequence Number Leakage: TCP Hijacking in NAT-Enabled Wi-Fi Networks. (3%)
Yuxiang Yang; Xuewei Feng; Qi Li; Kun Sun; Ziqiang Wang; Ke Xu
  In this paper, we uncover a new side-channel vulnerability in the widely used
NAT port preservation strategy and an insufficient reverse path validation
strategy of Wi-Fi routers, which allows an off-path attacker to infer if there
is one victim client in the same network communicating with another host on the
Internet using TCP. After detecting the presence of TCP connections between the
victim client and the server, the attacker can evict the original NAT mapping
and reconstruct a new mapping at the router by sending fake TCP packets due to
the routers' vulnerability of disabling TCP window tracking strategy, which has
been faithfully implemented in most of the routers for years. In this way, the
attacker can intercept TCP packets from the server and obtain the current
sequence and acknowledgment numbers, which in turn allows the attacker to
forcibly close the connection, poison the traffic in plain text, or reroute the
server's incoming packets to the attacker. We test 67 widely used routers from
30 vendors and discover that 52 of them are affected by this attack. Also, we
conduct an extensive measurement study on 93 real-world Wi-Fi networks. The
experimental results show that 75 of these evaluated Wi-Fi networks (81%) are
fully vulnerable to our attack. Our case study shows that it takes about 17.5,
19.4, and 54.5 seconds on average to terminate an SSH connection, download
private files from FTP servers, and inject fake HTTP response packets with
success rates of 87.4%, 82.6%, and 76.1%. We responsibly disclose the
vulnerability and suggest mitigation strategies to all affected vendors and
have received positive feedback, including acknowledgments, CVEs, rewards, and
adoption of our suggestions.


http://arxiv.org/abs/2404.04245
Evaluating Adversarial Robustness: A Comparison Of FGSM, Carlini-Wagner Attacks, And The Role of Distillation as Defense Mechanism. (99%)
Trilokesh Ranjan Sarkar; Nilanjan Das; Pralay Sankar Maitra; Bijoy Some; Ritwik Saha; Orijita Adhikary; Bishal Bose; Jaydip Sen
  This technical report delves into an in-depth exploration of adversarial
attacks specifically targeted at Deep Neural Networks (DNNs) utilized for image
classification. The study also investigates defense mechanisms aimed at
bolstering the robustness of machine learning models. The research focuses on
comprehending the ramifications of two prominent attack methodologies: the Fast
Gradient Sign Method (FGSM) and the Carlini-Wagner (CW) approach. These attacks
are examined concerning three pre-trained image classifiers: Resnext50_32x4d,
DenseNet-201, and VGG-19, utilizing the Tiny-ImageNet dataset. Furthermore, the
study proposes the robustness of defensive distillation as a defense mechanism
to counter FGSM and CW attacks. This defense mechanism is evaluated using the
CIFAR-10 dataset, where CNN models, specifically resnet101 and Resnext50_32x4d,
serve as the teacher and student models, respectively. The proposed defensive
distillation model exhibits effectiveness in thwarting attacks such as FGSM.
However, it is noted to remain susceptible to more sophisticated techniques
like the CW attack. The document presents a meticulous validation of the
proposed scheme. It provides detailed and comprehensive results, elucidating
the efficacy and limitations of the defense mechanisms employed. Through
rigorous experimentation and analysis, the study offers insights into the
dynamics of adversarial attacks on DNNs, as well as the effectiveness of
defensive strategies in mitigating their impact.


http://arxiv.org/abs/2404.04188
Reliable Feature Selection for Adversarially Robust Cyber-Attack Detection. (98%)
João Vitorino; Miguel Silva; Eva Maia; Isabel Praça
  The growing cybersecurity threats make it essential to use high-quality data
to train Machine Learning (ML) models for network traffic analysis, without
noisy or missing data. By selecting the most relevant features for cyber-attack
detection, it is possible to improve both the robustness and computational
efficiency of the models used in a cybersecurity system. This work presents a
feature selection and consensus process that combines multiple methods and
applies them to several network datasets. Two different feature sets were
selected and were used to train multiple ML models with regular and adversarial
training. Finally, an adversarial evasion robustness benchmark was performed to
analyze the reliability of the different feature sets and their impact on the
susceptibility of the models to adversarial examples. By using an improved
dataset with more data diversity, selecting the best time-related features and
a more specific feature set, and performing adversarial training, the ML models
were able to achieve a better adversarially robust generalization. The
robustness of the models was significantly improved without their
generalization to regular traffic flows being affected, without increases of
false alarms, and without requiring too many computational resources, which
enables a reliable detection of suspicious activity and perturbed traffic flows
in enterprise computer networks.


http://arxiv.org/abs/2405.14881
DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models. (15%)
Khawar Islam; Muhammad Zaigham Zaheer; Arif Mahmood; Karthik Nandakumar
  Recently, a number of image-mixing-based augmentation techniques have been
introduced to improve the generalization of deep neural networks. In these
techniques, two or more randomly selected natural images are mixed together to
generate an augmented image. Such methods may not only omit important portions
of the input images but also introduce label ambiguities by mixing images
across labels resulting in misleading supervisory signals. To address these
limitations, we propose DiffuseMix, a novel data augmentation technique that
leverages a diffusion model to reshape training images, supervised by our
bespoke conditional prompts. First, concatenation of a partial natural image
and its generated counterpart is obtained which helps in avoiding the
generation of unrealistic images or label ambiguities. Then, to enhance
resilience against adversarial attacks and improves safety measures, a randomly
selected structural pattern from a set of fractal images is blended into the
concatenated image to form the final augmented image for training. Our
empirical results on seven different datasets reveal that DiffuseMix achieves
superior performance compared to existing state-of the-art methods on tasks
including general classification,fine-grained classification, fine-tuning, data
scarcity, and adversarial robustness. Augmented datasets and codes are
available here: https://diffusemix.github.io/


http://arxiv.org/abs/2404.04375
Compositional Estimation of Lipschitz Constants for Deep Neural Networks. (13%)
Yuezhu Xu; S. Sivaranjani
  The Lipschitz constant plays a crucial role in certifying the robustness of
neural networks to input perturbations and adversarial attacks, as well as the
stability and safety of systems with neural network controllers. Therefore,
estimation of tight bounds on the Lipschitz constant of neural networks is a
well-studied topic. However, typical approaches involve solving a large matrix
verification problem, the computational cost of which grows significantly for
deeper networks. In this letter, we provide a compositional approach to
estimate Lipschitz constants for deep feedforward neural networks by obtaining
an exact decomposition of the large matrix verification problem into smaller
sub-problems. We further obtain a closed-form solution that applies to most
common neural network activation functions, which will enable rapid robustness
and stability certificates for neural networks deployed in online control
settings. Finally, we demonstrate through numerical experiments that our
approach provides a steep reduction in computation time while yielding
Lipschitz bounds that are very close to those achieved by state-of-the-art
approaches.


http://arxiv.org/abs/2404.04139
Precision Guided Approach to Mitigate Data Poisoning Attacks in Federated Learning. (12%)
K Naveen Kumar; C Krishna Mohan; Aravind Machiry
  Federated Learning (FL) is a collaborative learning paradigm enabling
participants to collectively train a shared machine learning model while
preserving the privacy of their sensitive data. Nevertheless, the inherent
decentralized and data-opaque characteristics of FL render its susceptibility
to data poisoning attacks. These attacks introduce malformed or malicious
inputs during local model training, subsequently influencing the global model
and resulting in erroneous predictions. Current FL defense strategies against
data poisoning attacks either involve a trade-off between accuracy and
robustness or necessitate the presence of a uniformly distributed root dataset
at the server. To overcome these limitations, we present FedZZ, which harnesses
a zone-based deviating update (ZBDU) mechanism to effectively counter data
poisoning attacks in FL. Further, we introduce a precision-guided methodology
that actively characterizes these client clusters (zones), which in turn aids
in recognizing and discarding malicious updates at the server. Our evaluation
of FedZZ across two widely recognized datasets: CIFAR10 and EMNIST, demonstrate
its efficacy in mitigating data poisoning attacks, surpassing the performance
of prevailing state-of-the-art methodologies in both single and multi-client
attack scenarios and varying attack volumes. Notably, FedZZ also functions as a
robust client selection strategy, even in highly non-IID and attack-free
scenarios. Moreover, in the face of escalating poisoning rates, the model
accuracy attained by FedZZ displays superior resilience compared to existing
techniques. For instance, when confronted with a 50% presence of malicious
clients, FedZZ sustains an accuracy of 67.43%, while the accuracy of the
second-best solution, FL-Defender, diminishes to 43.36%.


http://arxiv.org/abs/2404.03340
Meta Invariance Defense Towards Generalizable Robustness to Unknown Adversarial Attacks. (99%)
Lei Zhang; Yuhang Zhou; Yi Yang; Xinbo Gao
  Despite providing high-performance solutions for computer vision tasks, the
deep neural network (DNN) model has been proved to be extremely vulnerable to
adversarial attacks. Current defense mainly focuses on the known attacks, but
the adversarial robustness to the unknown attacks is seriously overlooked.
Besides, commonly used adaptive learning and fine-tuning technique is
unsuitable for adversarial defense since it is essentially a zero-shot problem
when deployed. Thus, to tackle this challenge, we propose an attack-agnostic
defense method named Meta Invariance Defense (MID). Specifically, various
combinations of adversarial attacks are randomly sampled from a manually
constructed Attacker Pool to constitute different defense tasks against unknown
attacks, in which a student encoder is supervised by multi-consistency
distillation to learn the attack-invariant features via a meta principle. The
proposed MID has two merits: 1) Full distillation from pixel-, feature- and
prediction-level between benign and adversarial samples facilitates the
discovery of attack-invariance. 2) The model simultaneously achieves robustness
to the imperceptible adversarial perturbations in high-level image
classification and attack-suppression in low-level robust image regeneration.
Theoretical and empirical studies on numerous benchmarks such as ImageNet
verify the generalizable robustness and superiority of MID under various
attacks.


http://arxiv.org/abs/2404.03225
FACTUAL: A Novel Framework for Contrastive Learning Based Robust SAR Image Classification. (98%)
Xu Wang; Tian Ye; Rajgopal Kannan; Viktor Prasanna
  Deep Learning (DL) Models for Synthetic Aperture Radar (SAR) Automatic Target
Recognition (ATR), while delivering improved performance, have been shown to be
quite vulnerable to adversarial attacks. Existing works improve robustness by
training models on adversarial samples. However, by focusing mostly on attacks
that manipulate images randomly, they neglect the real-world feasibility of
such attacks. In this paper, we propose FACTUAL, a novel Contrastive Learning
framework for Adversarial Training and robust SAR classification. FACTUAL
consists of two components: (1) Differing from existing works, a novel
perturbation scheme that incorporates realistic physical adversarial attacks
(such as OTSA) to build a supervised adversarial pre-training network. This
network utilizes class labels for clustering clean and perturbed images
together into a more informative feature space. (2) A linear classifier
cascaded after the encoder to use the computed representations to predict the
target labels. By pre-training and fine-tuning our model on both clean and
adversarial samples, we show that our model achieves high prediction accuracy
on both cases. Our model achieves 99.7% accuracy on clean samples, and 89.6% on
perturbed samples, both outperforming previous state-of-the-art methods.


http://arxiv.org/abs/2404.03233
Learn What You Want to Unlearn: Unlearning Inversion Attacks against Machine Unlearning. (16%)
Hongsheng Hu; Shuo Wang; Tian Dong; Minhui Xue
  Machine unlearning has become a promising solution for fulfilling the "right
to be forgotten", under which individuals can request the deletion of their
data from machine learning models. However, existing studies of machine
unlearning mainly focus on the efficacy and efficiency of unlearning methods,
while neglecting the investigation of the privacy vulnerability during the
unlearning process. With two versions of a model available to an adversary,
that is, the original model and the unlearned model, machine unlearning opens
up a new attack surface. In this paper, we conduct the first investigation to
understand the extent to which machine unlearning can leak the confidential
content of the unlearned data. Specifically, under the Machine Learning as a
Service setting, we propose unlearning inversion attacks that can reveal the
feature and label information of an unlearned sample by only accessing the
original and unlearned model. The effectiveness of the proposed unlearning
inversion attacks is evaluated through extensive experiments on benchmark
datasets across various model architectures and on both exact and approximate
representative unlearning approaches. The experimental results indicate that
the proposed attack can reveal the sensitive information of the unlearned data.
As such, we identify three possible defenses that help to mitigate the proposed
attacks, while at the cost of reducing the utility of the unlearned model. The
study in this paper uncovers an underexplored gap between machine unlearning
and the privacy of unlearned data, highlighting the need for the careful design
of mechanisms for implementing unlearning without leaking the information of
the unlearned data.


http://arxiv.org/abs/2404.03348
Knowledge Distillation-Based Model Extraction Attack using GAN-based Private Counterfactual Explanations. (8%)
Fatima Ezzeddine; Omran Ayoub; Silvia Giordano
  In recent years, there has been a notable increase in the deployment of
machine learning (ML) models as services (MLaaS) across diverse production
software applications. In parallel, explainable AI (XAI) continues to evolve,
addressing the necessity for transparency and trustworthiness in ML models. XAI
techniques aim to enhance the transparency of ML models by providing insights,
in terms of model's explanations, into their decision-making process.
Simultaneously, some MLaaS platforms now offer explanations alongside the ML
prediction outputs. This setup has elevated concerns regarding vulnerabilities
in MLaaS, particularly in relation to privacy leakage attacks such as model
extraction attacks (MEA). This is due to the fact that explanations can unveil
insights about the inner workings of the model which could be exploited by
malicious users. In this work, we focus on investigating how model
explanations, particularly counterfactual explanations (CFs), can be exploited
for performing MEA within the MLaaS platform. We also delve into assessing the
effectiveness of incorporating differential privacy (DP) as a mitigation
strategy. To this end, we first propose a novel approach for MEA based on
Knowledge Distillation (KD) to enhance the efficiency of extracting a
substitute model of a target model exploiting CFs, without any knowledge about
the training data distribution by the attacker. Then, we advise an approach for
training CF generators incorporating DP to generate private CFs. We conduct
thorough experimental evaluations on real-world datasets and demonstrate that
our proposed KD-based MEA can yield a high-fidelity substitute model with a
reduced number of queries with respect to baseline approaches. Furthermore, our
findings reveal that including a privacy layer can allow mitigating the MEA.
However, on the account of the quality of CFs, impacts the performance of the
explanations.


http://arxiv.org/abs/2404.03411
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? (2%)
Shuo Chen; Zhen Han; Bailan He; Zifeng Ding; Wenqian Yu; Philip Torr; Volker Tresp; Jindong Gu
  Various jailbreak attacks have been proposed to red-team Large Language
Models (LLMs) and revealed the vulnerable safeguards of LLMs. Besides, some
methods are not limited to the textual modality and extend the jailbreak attack
to Multimodal Large Language Models (MLLMs) by perturbing the visual input.
However, the absence of a universal evaluation benchmark complicates the
performance reproduction and fair comparison. Besides, there is a lack of
comprehensive evaluation of closed-source state-of-the-art (SOTA) models,
especially MLLMs, such as GPT-4V. To address these issues, this work first
builds a comprehensive jailbreak evaluation dataset with 1445 harmful questions
covering 11 different safety policies. Based on this dataset, extensive
red-teaming experiments are conducted on 11 different LLMs and MLLMs, including
both SOTA proprietary models and open-source models. We then conduct a deep
analysis of the evaluated results and find that (1) GPT4 and GPT-4V demonstrate
better robustness against jailbreak attacks compared to open-source LLMs and
MLLMs. (2) Llama2 and Qwen-VL-Chat are more robust compared to other
open-source models. (3) The transferability of visual jailbreak methods is
relatively limited compared to textual jailbreak methods. The dataset and code
can be found https://github.com/chenxshuo/RedTeamingGPT4V


http://arxiv.org/abs/2404.02660
Adversarial Attacks and Dimensionality in Text Classifiers. (99%)
Nandish Chattopadhyay; Atreya Goswami; Anupam Chattopadhyay
  Adversarial attacks on machine learning algorithms have been a key deterrent
to the adoption of AI in many real-world use cases. They significantly
undermine the ability of high-performance neural networks by forcing
misclassifications. These attacks introduce minute and structured perturbations
or alterations in the test samples, imperceptible to human annotators in
general, but trained neural networks and other models are sensitive to it.
Historically, adversarial attacks have been first identified and studied in the
domain of image processing. In this paper, we study adversarial examples in the
field of natural language processing, specifically text classification tasks.
We investigate the reasons for adversarial vulnerability, particularly in
relation to the inherent dimensionality of the model. Our key finding is that
there is a very strong correlation between the embedding dimensionality of the
adversarial samples and their effectiveness on models tuned with input samples
with same embedding dimension. We utilize this sensitivity to design an
adversarial defense mechanism. We use ensemble models of varying inherent
dimensionality to thwart the attacks. This is tested on multiple datasets for
its efficacy in providing robustness. We also study the problem of measuring
adversarial perturbation using different distance metrics. For all of the
aforementioned studies, we have run tests on multiple models with varying
dimensionality and used a word-vector level adversarial attack to substantiate
the findings.


http://arxiv.org/abs/2404.02585
Unsegment Anything by Simulating Deformation. (97%)
Jiahao Lu; Xingyi Yang; Xinchao Wang
  Foundation segmentation models, while powerful, pose a significant risk: they
enable users to effortlessly extract any objects from any digital content with
a single click, potentially leading to copyright infringement or malicious
misuse. To mitigate this risk, we introduce a new task "Anything Unsegmentable"
to grant any image "the right to be unsegmented". The ambitious pursuit of the
task is to achieve highly transferable adversarial attacks against all
prompt-based segmentation models, regardless of model parameterizations and
prompts. We highlight the non-transferable and heterogeneous nature of
prompt-specific adversarial noises. Our approach focuses on disrupting image
encoder features to achieve prompt-agnostic attacks. Intriguingly, targeted
feature attacks exhibit better transferability compared to untargeted ones,
suggesting the optimal update direction aligns with the image manifold. Based
on the observations, we design a novel attack named Unsegment Anything by
Simulating Deformation (UAD). Our attack optimizes a differentiable deformation
function to create a target deformed image, which alters structural information
while preserving achievable feature distance by adversarial example. Extensive
experiments verify the effectiveness of our approach, compromising a variety of
promptable segmentation models with different architectures and prompt
interfaces. We release the code at
https://github.com/jiahaolu97/anything-unsegmentable.


http://arxiv.org/abs/2404.02832
"Are Adversarial Phishing Webpages a Threat in Reality?" Understanding the Users' Perception of Adversarial Webpages. (81%)
Ying Yuan; Qingying Hao; Giovanni Apruzzese; Mauro Conti; Gang Wang
  Machine learning based phishing website detectors (ML-PWD) are a critical
part of today's anti-phishing solutions in operation. Unfortunately, ML-PWD are
prone to adversarial evasions, evidenced by both academic studies and analyses
of real-world adversarial phishing webpages. However, existing works mostly
focused on assessing adversarial phishing webpages against ML-PWD, while
neglecting a crucial aspect: investigating whether they can deceive the actual
target of phishing -- the end users. In this paper, we fill this gap by
conducting two user studies (n=470) to examine how human users perceive
adversarial phishing webpages, spanning both synthetically crafted ones (which
we create by evading a state-of-the-art ML-PWD) as well as real adversarial
webpages (taken from the wild Web) that bypassed a production-grade ML-PWD. Our
findings confirm that adversarial phishing is a threat to both users and
ML-PWD, since most adversarial phishing webpages have comparable effectiveness
on users w.r.t. unperturbed ones. However, not all adversarial perturbations
are equally effective. For example, those with added typos are significantly
more noticeable to users, who tend to overlook perturbations of higher visual
magnitude (such as replacing the background). We also show that users'
self-reported frequency of visiting a brand's website has a statistically
negative correlation with their phishing detection accuracy, which is likely
caused by overconfidence. We release our resources.


http://arxiv.org/abs/2404.03027
JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks. (75%)
Weidi Luo; Siyuan Ma; Xiaogeng Liu; Xiaoyu Guo; Chaowei Xiao
  With the rapid advancements in Multimodal Large Language Models (MLLMs),
securing these models against malicious inputs while aligning them with human
values has emerged as a critical challenge. In this paper, we investigate an
important and unexplored question of whether techniques that successfully
jailbreak Large Language Models (LLMs) can be equally effective in jailbreaking
MLLMs. To explore this issue, we introduce JailBreakV-28K, a pioneering
benchmark designed to assess the transferability of LLM jailbreak techniques to
MLLMs, thereby evaluating the robustness of MLLMs against diverse jailbreak
attacks. Utilizing a dataset of 2, 000 malicious queries that is also proposed
in this paper, we generate 20, 000 text-based jailbreak prompts using advanced
jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from
recent MLLMs jailbreak attacks, our comprehensive dataset includes 28, 000 test
cases across a spectrum of adversarial scenarios. Our evaluation of 10
open-source MLLMs reveals a notably high Attack Success Rate (ASR) for attacks
transferred from LLMs, highlighting a critical vulnerability in MLLMs that
stems from their text-processing capabilities. Our findings underscore the
urgent need for future research to address alignment vulnerabilities in MLLMs
from both textual and visual inputs.


http://arxiv.org/abs/2404.02532
Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game. (11%)
Qianqiao Xu; Zhiliang Tian; Hongyan Wu; Zhen Huang; Yiping Song; Feng Liu; Dongsheng Li
  With the enhanced performance of large models on natural language processing
tasks, potential moral and ethical issues of large models arise. There exist
malicious attackers who induce large models to jailbreak and generate
information containing illegal, privacy-invasive information through techniques
such as prompt engineering. As a result, large models counter malicious
attackers' attacks using techniques such as safety alignment. However, the
strong defense mechanism of the large model through rejection replies is easily
identified by attackers and used to strengthen attackers' capabilities. In this
paper, we propose a multi-agent attacker-disguiser game approach to achieve a
weak defense mechanism that allows the large model to both safely reply to the
attacker and hide the defense intent. First, we construct a multi-agent
framework to simulate attack and defense scenarios, playing different roles to
be responsible for attack, disguise, safety evaluation, and disguise evaluation
tasks. After that, we design attack and disguise game algorithms to optimize
the game strategies of the attacker and the disguiser and use the curriculum
learning process to strengthen the capabilities of the agents. The experiments
verify that the method in this paper is more effective in strengthening the
model's ability to disguise the defense intent compared with other methods.
Moreover, our approach can adapt any black-box large model to assist the model
in defense and does not suffer from model version iterations.


http://arxiv.org/abs/2404.02462
A Unified Membership Inference Method for Visual Self-supervised Encoder via Part-aware Capability. (9%)
Jie Zhu; Jirong Zha; Ding Li; Leye Wang
  Self-supervised learning shows promise in harnessing extensive unlabeled
data, but it also confronts significant privacy concerns, especially in vision.
In this paper, we aim to perform membership inference on visual self-supervised
models in a more realistic setting: self-supervised training method and details
are unknown for an adversary when attacking as he usually faces a black-box
system in practice. In this setting, considering that self-supervised model
could be trained by completely different self-supervised paradigms, e.g.,
masked image modeling and contrastive learning, with complex training details,
we propose a unified membership inference method called PartCrop. It is
motivated by the shared part-aware capability among models and stronger part
response on the training data. Specifically, PartCrop crops parts of objects in
an image to query responses with the image in representation space. We conduct
extensive attacks on self-supervised models with different training protocols
and structures using three widely used image datasets. The results verify the
effectiveness and generalization of PartCrop. Moreover, to defend against
PartCrop, we evaluate two common approaches, i.e., early stop and differential
privacy, and propose a tailored method called shrinking crop scale range. The
defense experiments indicate that all of them are effective. Our code is
available at https://github.com/JiePKU/PartCrop.


http://arxiv.org/abs/2404.02889
Steganographic Passport: An Owner and User Verifiable Credential for Deep Model IP Protection Without Retraining. (1%)
Qi Cui; Ruohan Meng; Chaohui Xu; Chip-Hong Chang
  Ensuring the legal usage of deep models is crucial to promoting trustable,
accountable, and responsible artificial intelligence innovation. Current
passport-based methods that obfuscate model functionality for license-to-use
and ownership verifications suffer from capacity and quality constraints, as
they require retraining the owner model for new users. They are also vulnerable
to advanced Expanded Residual Block ambiguity attacks. We propose
Steganographic Passport, which uses an invertible steganographic network to
decouple license-to-use from ownership verification by hiding the user's
identity images into the owner-side passport and recovering them from their
respective user-side passports. An irreversible and collision-resistant hash
function is used to avoid exposing the owner-side passport from the derived
user-side passports and increase the uniqueness of the model signature. To
safeguard both the passport and model's weights against advanced ambiguity
attacks, an activation-level obfuscation is proposed for the verification
branch of the owner's model. By jointly training the verification and
deployment branches, their weights become tightly coupled. The proposed method
supports agile licensing of deep models by providing a strong ownership proof
and license accountability without requiring a separate model retraining for
the admission of every new user. Experiment results show that our
Steganographic Passport outperforms other passport-based deep model protection
methods in robustness against various known attacks.


http://arxiv.org/abs/2404.01907
Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack. (99%)
Ying Zhou; Ben He; Le Sun
  With the development of large language models (LLMs), detecting whether text
is generated by a machine becomes increasingly challenging in the face of
malicious use cases like the spread of false information, protection of
intellectual property, and prevention of academic plagiarism. While
well-trained text detectors have demonstrated promising performance on unseen
test data, recent research suggests that these detectors have vulnerabilities
when dealing with adversarial attacks such as paraphrasing. In this paper, we
propose a framework for a broader class of adversarial attacks, designed to
perform minor perturbations in machine-generated content to evade detection. We
consider two attack settings: white-box and black-box, and employ adversarial
learning in dynamic scenarios to assess the potential enhancement of the
current detection model's robustness against such attacks. The empirical
results reveal that the current detection models can be compromised in as
little as 10 seconds, leading to the misclassification of machine-generated
text as human-written content. Furthermore, we explore the prospect of
improving the model's robustness over iterative adversarial learning. Although
some improvements in model robustness are observed, practical applications
still face significant challenges. These findings shed light on the future
development of AI-text detectors, emphasizing the need for more accurate and
robust detection methods.


http://arxiv.org/abs/2404.01642
Patch Synthesis for Property Repair of Deep Neural Networks. (99%)
Zhiming Chi; Jianan Ma; Pengfei Yang; Cheng-Chao Huang; Renjue Li; Xiaowei Huang; Lijun Zhang
  Deep neural networks (DNNs) are prone to various dependability issues, such
as adversarial attacks, which hinder their adoption in safety-critical domains.
Recently, NN repair techniques have been proposed to address these issues while
preserving original performance by locating and modifying guilty neurons and
their parameters. However, existing repair approaches are often limited to
specific data sets and do not provide theoretical guarantees for the
effectiveness of the repairs. To address these limitations, we introduce
PatchPro, a novel patch-based approach for property-level repair of DNNs,
focusing on local robustness. The key idea behind PatchPro is to construct
patch modules that, when integrated with the original network, provide
specialized repairs for all samples within the robustness neighborhood while
maintaining the network's original performance. Our method incorporates formal
verification and a heuristic mechanism for allocating patch modules, enabling
it to defend against adversarial attacks and generalize to other inputs.
PatchPro demonstrates superior efficiency, scalability, and repair success
rates compared to existing DNN repair methods, i.e., realizing provable
property-level repair for 100% cases across multiple high-dimensional datasets.


http://arxiv.org/abs/2404.02928
Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models. (97%)
Jiachen Ma; Anda Cao; Zhiqing Xiao; Yijiang Li; Jie Zhang; Chao Ye; Junbo Zhao
  Text-to-image (T2I) models can be maliciously used to generate harmful
content such as sexually explicit, unfaithful, and misleading or
Not-Safe-for-Work (NSFW) images. Previous attacks largely depend on the
availability of the diffusion model or involve a lengthy optimization process.
In this work, we investigate a more practical and universal attack that does
not require the presence of a target model and demonstrate that the
high-dimensional text embedding space inherently contains NSFW concepts that
can be exploited to generate harmful images. We present the Jailbreaking Prompt
Attack (JPA). JPA first searches for the target malicious concepts in the text
embedding space using a group of antonyms generated by ChatGPT. Subsequently, a
prefix prompt is optimized in the discrete vocabulary space to align malicious
concepts semantically in the text embedding space. We further introduce a soft
assignment with gradient masking technique that allows us to perform gradient
ascent in the discrete vocabulary space.
  We perform extensive experiments with open-sourced T2I models, e.g.
stable-diffusion-v1-4 and closed-sourced online services, e.g. DALLE2,
Midjourney with black-box safety checkers. Results show that (1) JPA bypasses
both text and image safety checkers (2) while preserving high semantic
alignment with the target prompt. (3) JPA demonstrates a much faster speed than
previous methods and can be executed in a fully automated manner. These merits
render it a valuable tool for robustness evaluation in future text-to-image
generation research.


http://arxiv.org/abs/2404.02287
One Noise to Rule Them All: Multi-View Adversarial Attacks with Universal Perturbation. (92%)
Mehmet Ergezer; Phat Duong; Christian Green; Tommy Nguyen; Abdurrahman Zeybey
  This paper presents a novel universal perturbation method for generating
robust multi-view adversarial examples in 3D object recognition. Unlike
conventional attacks limited to single views, our approach operates on multiple
2D images, offering a practical and scalable solution for enhancing model
scalability and robustness. This generalizable method bridges the gap between
2D perturbations and 3D-like attack capabilities, making it suitable for
real-world applications.
  Existing adversarial attacks may become ineffective when images undergo
transformations like changes in lighting, camera position, or natural
deformations. We address this challenge by crafting a single universal noise
perturbation applicable to various object views. Experiments on diverse
rendered 3D objects demonstrate the effectiveness of our approach. The
universal perturbation successfully identified a single adversarial noise for
each given set of 3D object renders from multiple poses and viewpoints.
Compared to single-view attacks, our universal attacks lower classification
confidence across multiple viewing angles, especially at low noise levels. A
sample implementation is made available at
https://github.com/memoatwit/UniversalPerturbation.


http://arxiv.org/abs/2404.01828
Defense without Forgetting: Continual Adversarial Defense with Anisotropic & Isotropic Pseudo Replay. (88%)
Yuhang Zhou; Zhongyun Hua
  Deep neural networks have demonstrated susceptibility to adversarial attacks.
Adversarial defense techniques often focus on one-shot setting to maintain
robustness against attack. However, new attacks can emerge in sequences in
real-world deployment scenarios. As a result, it is crucial for a defense model
to constantly adapt to new attacks, but the adaptation process can lead to
catastrophic forgetting of previously defended against attacks. In this paper,
we discuss for the first time the concept of continual adversarial defense
under a sequence of attacks, and propose a lifelong defense baseline called
Anisotropic \& Isotropic Replay (AIR), which offers three advantages: (1)
Isotropic replay ensures model consistency in the neighborhood distribution of
new data, indirectly aligning the output preference between old and new tasks.
(2) Anisotropic replay enables the model to learn a compromise data manifold
with fresh mixed semantics for further replay constraints and potential future
attacks. (3) A straightforward regularizer mitigates the 'plasticity-stability'
trade-off by aligning model output between new and old tasks. Experiment
results demonstrate that AIR can approximate or even exceed the empirical
performance upper bounds achieved by Joint Training.


http://arxiv.org/abs/2404.02151
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks. (83%)
Maksym Andriushchenko; Francesco Croce; Nicolas Flammarion
  We show that even the most recent safety-aligned LLMs are not robust to
simple adaptive jailbreaking attacks. First, we demonstrate how to successfully
leverage access to logprobs for jailbreaking: we initially design an
adversarial prompt template (sometimes adapted to the target LLM), and then we
apply random search on a suffix to maximize a target logprob (e.g., of the
token "Sure"), potentially with multiple restarts. In this way, we achieve 100%
attack success rate -- according to GPT-4 as a judge -- on Vicuna-13B,
Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat-7B/13B/70B,
Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4o, and R2D2 from HarmBench that
was adversarially trained against the GCG attack. We also show how to jailbreak
all Claude models -- that do not expose logprobs -- via either a transfer or
prefilling attack with a 100% success rate. In addition, we show how to use
random search on a restricted set of tokens for finding trojan strings in
poisoned models -- a task that shares many similarities with jailbreaking --
which is the algorithm that brought us the first place in the SaTML'24 Trojan
Detection Competition. The common theme behind these attacks is that adaptivity
is crucial: different models are vulnerable to different prompting templates
(e.g., R2D2 is very sensitive to in-context learning prompts), some models have
unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and
in some settings, it is crucial to restrict the token search space based on
prior knowledge (e.g., for trojan detection). For reproducibility purposes, we
provide the code, logs, and jailbreak artifacts in the JailbreakBench format at
https://github.com/tml-epfl/llm-adaptive-attacks.


http://arxiv.org/abs/2404.02931
READ: Improving Relation Extraction from an ADversarial Perspective. (81%)
Dawei Li; William Hogan; Jingbo Shang
  Recent works in relation extraction (RE) have achieved promising benchmark
accuracy; however, our adversarial attack experiments show that these works
excessively rely on entities, making their generalization capability
questionable. To address this issue, we propose an adversarial training method
specifically designed for RE. Our approach introduces both sequence- and
token-level perturbations to the sample and uses a separate perturbation
vocabulary to improve the search for entity and context perturbations.
Furthermore, we introduce a probabilistic strategy for leaving clean tokens in
the context during adversarial training. This strategy enables a larger attack
budget for entities and coaxes the model to leverage relational patterns
embedded in the context. Extensive experiments show that compared to various
adversarial training methods, our method significantly improves both the
accuracy and robustness of the model. Additionally, experiments on different
data availability settings highlight the effectiveness of our method in
low-resource scenarios. We also perform in-depth analyses of our proposed
method and provide further hints. We will release our code at
https://github.com/David-Li0406/READ.


http://arxiv.org/abs/2404.02356
Two Heads are Better than One: Nested PoE for Robust Defense Against Multi-Backdoors. (64%)
Victoria Graf; Qin Liu; Muhao Chen
  Data poisoning backdoor attacks can cause undesirable behaviors in large
language models (LLMs), and defending against them is of increasing importance.
Existing defense mechanisms often assume that only one type of trigger is
adopted by the attacker, while defending against multiple simultaneous and
independent trigger types necessitates general defense frameworks and is
relatively unexplored. In this paper, we propose Nested Product of
Experts(NPoE) defense framework, which involves a mixture of experts (MoE) as a
trigger-only ensemble within the PoE defense framework to simultaneously defend
against multiple trigger types. During NPoE training, the main model is trained
in an ensemble with a mixture of smaller expert models that learn the features
of backdoor triggers. At inference time, only the main model is used.
Experimental results on sentiment analysis, hate speech detection, and question
classification tasks demonstrate that NPoE effectively defends against a
variety of triggers both separately and in trigger mixtures. Due to the
versatility of the MoE structure in NPoE, this framework can be further
expanded to defend against other attack settings


http://arxiv.org/abs/2404.02067
Red-Teaming Segment Anything Model. (45%)
Krzysztof Jankowski; Bartlomiej Sobieski; Mateusz Kwiatkowski; Jakub Szulc; Michal Janik; Hubert Baniecki; Przemyslaw Biecek
  Foundation models have emerged as pivotal tools, tackling many complex tasks
through pre-training on vast datasets and subsequent fine-tuning for specific
applications. The Segment Anything Model is one of the first and most
well-known foundation models for computer vision segmentation tasks. This work
presents a multi-faceted red-teaming analysis that tests the Segment Anything
Model against challenging tasks: (1) We analyze the impact of style transfer on
segmentation masks, demonstrating that applying adverse weather conditions and
raindrops to dashboard images of city roads significantly distorts generated
masks. (2) We focus on assessing whether the model can be used for attacks on
privacy, such as recognizing celebrities' faces, and show that the model
possesses some undesired knowledge in this task. (3) Finally, we check how
robust the model is to adversarial attacks on segmentation masks under text
prompts. We not only show the effectiveness of popular white-box attacks and
resistance to black-box attacks but also introduce a novel approach - Focused
Iterative Gradient Attack (FIGA) that combines white-box approaches to
construct an efficient attack resulting in a smaller number of modified pixels.
All of our testing methods and analyses indicate a need for enhanced safety
measures in foundation models for image segmentation.


http://arxiv.org/abs/2404.02242
Towards Robust 3D Pose Transfer with Adversarial Learning. (31%)
Haoyu Chen; Hao Tang; Ehsan Adeli; Guoying Zhao
  3D pose transfer that aims to transfer the desired pose to a target mesh is
one of the most challenging 3D generation tasks. Previous attempts rely on
well-defined parametric human models or skeletal joints as driving pose
sources. However, to obtain those clean pose sources, cumbersome but necessary
pre-processing pipelines are inevitable, hindering implementations of the
real-time applications. This work is driven by the intuition that the
robustness of the model can be enhanced by introducing adversarial samples into
the training, leading to a more invulnerable model to the noisy inputs, which
even can be further extended to directly handling the real-world data like raw
point clouds/scans without intermediate processing. Furthermore, we propose a
novel 3D pose Masked Autoencoder (3D-PoseMAE), a customized MAE that
effectively learns 3D extrinsic presentations (i.e., pose). 3D-PoseMAE
facilitates learning from the aspect of extrinsic attributes by simultaneously
generating adversarial samples that perturb the model and learning the
arbitrary raw noisy poses via a multi-scale masking strategy. Both qualitative
and quantitative studies show that the transferred meshes given by our network
result in much better quality. Besides, we demonstrate the strong
generalizability of our method on various poses, different domains, and even
raw scans. Experimental results also show meaningful insights that the
intermediate adversarial samples generated in the training can successfully
attack the existing pose transfer models.


http://arxiv.org/abs/2404.02440
Designing a Photonic Physically Unclonable Function Having Resilience to Machine Learning Attacks. (12%)
Elena R. Henderson; Jessie M. Henderson; Hiva Shahoei; William V. Oxford; Eric C. Larson; Duncan L. MacFarlane; Mitchell A. Thornton
  Physically unclonable functions (PUFs) are designed to act as device
'fingerprints.' Given an input challenge, the PUF circuit should produce an
unpredictable response for use in situations such as root-of-trust applications
and other hardware-level cybersecurity applications. PUFs are typically
subcircuits present within integrated circuits (ICs), and while conventional IC
PUFs are well-understood, several implementations have proven vulnerable to
malicious exploits, including those perpetrated by machine learning (ML)-based
attacks. Such attacks can be difficult to prevent because they are often
designed to work even when relatively few challenge-response pairs are known in
advance. Hence the need for both more resilient PUF designs and analysis of
ML-attack susceptibility. Previous work has developed a PUF for photonic
integrated circuits (PICs). A PIC PUF not only produces unpredictable responses
given manufacturing-introduced tolerances, but is also less prone to
electromagnetic radiation eavesdropping attacks than a purely electronic IC
PUF. In this work, we analyze the resilience of the proposed photonic PUF when
subjected to ML-based attacks. Specifically, we describe a computational PUF
model for producing the large datasets required for training ML attacks; we
analyze the quality of the model; and we discuss the modeled PUF's
susceptibility to ML-based attacks. We find that the modeled PUF generates
distributions that resemble uniform white noise, explaining the exhibited
resilience to neural-network-based attacks designed to exploit latent
relationships between challenges and responses. Preliminary analysis suggests
that the PUF exhibits similar resilience to generative adversarial networks,
and continued development will show whether more-sophisticated ML approaches
better compromise the PUF and -- if so -- how design modifications might
improve resilience.


http://arxiv.org/abs/2404.02406
Exploring Backdoor Vulnerabilities of Chat Models. (2%)
Yunzhuo Hao; Wenkai Yang; Yankai Lin
  Recent researches have shown that Large Language Models (LLMs) are
susceptible to a security threat known as Backdoor Attack. The backdoored model
will behave well in normal cases but exhibit malicious behaviours on inputs
inserted with a specific backdoor trigger. Current backdoor studies on LLMs
predominantly focus on instruction-tuned LLMs, while neglecting another
realistic scenario where LLMs are fine-tuned on multi-turn conversational data
to be chat models. Chat models are extensively adopted across various
real-world scenarios, thus the security of chat models deserves increasing
attention. Unfortunately, we point out that the flexible multi-turn interaction
format instead increases the flexibility of trigger designs and amplifies the
vulnerability of chat models to backdoor attacks. In this work, we reveal and
achieve a novel backdoor attacking method on chat models by distributing
multiple trigger scenarios across user inputs in different rounds, and making
the backdoor be triggered only when all trigger scenarios have appeared in the
historical conversations. Experimental results demonstrate that our method can
achieve high attack success rates (e.g., over 90% ASR on Vicuna-7B) while
successfully maintaining the normal capabilities of chat models on providing
helpful responses to benign user requests. Also, the backdoor can not be easily
removed by the downstream re-alignment, highlighting the importance of
continued research and attention to the security concerns of chat models.
Warning: This paper may contain toxic content.


http://arxiv.org/abs/2404.02388
CAPE: CAM as a Probabilistic Ensemble for Enhanced DNN Interpretation. (1%)
Townim Faisal Chowdhury; Kewen Liao; Vu Minh Hieu Phan; Minh-Son To; Yutong Xie; Kevin Hung; David Ross; Anton van den Hengel; Johan W. Verjans; Zhibin Liao
  Deep Neural Networks (DNNs) are widely used for visual classification tasks,
but their complex computation process and black-box nature hinder decision
transparency and interpretability. Class activation maps (CAMs) and recent
variants provide ways to visually explain the DNN decision-making process by
displaying 'attention' heatmaps of the DNNs. Nevertheless, the CAM explanation
only offers relative attention information, that is, on an attention heatmap,
we can interpret which image region is more or less important than the others.
However, these regions cannot be meaningfully compared across classes, and the
contribution of each region to the model's class prediction is not revealed. To
address these challenges that ultimately lead to better DNN Interpretation, in
this paper, we propose CAPE, a novel reformulation of CAM that provides a
unified and probabilistically meaningful assessment of the contributions of
image regions. We quantitatively and qualitatively compare CAPE with
state-of-the-art CAM methods on CUB and ImageNet benchmark datasets to
demonstrate enhanced interpretability. We also test on a cytology imaging
dataset depicting a challenging Chronic Myelomonocytic Leukemia (CMML)
diagnosis problem. Code is available at: https://github.com/AIML-MED/CAPE.


http://arxiv.org/abs/2404.01356
The Double-Edged Sword of Input Perturbations to Robust Accurate Fairness. (99%)
Xuran Li; Peng Wu; Yanting Chen; Xingjun Ma; Zhen Zhang; Kaixiang Dong
  Deep neural networks (DNNs) are known to be sensitive to adversarial input
perturbations, leading to a reduction in either prediction accuracy or
individual fairness. To jointly characterize the susceptibility of prediction
accuracy and individual fairness to adversarial perturbations, we introduce a
novel robustness definition termed robust accurate fairness. Informally, robust
accurate fairness requires that predictions for an instance and its similar
counterparts consistently align with the ground truth when subjected to input
perturbations. We propose an adversarial attack approach dubbed RAFair to
expose false or biased adversarial defects in DNN, which either deceive
accuracy or compromise individual fairness. Then, we show that such adversarial
instances can be effectively addressed by carefully designed benign
perturbations, correcting their predictions to be accurate and fair. Our work
explores the double-edged sword of input perturbations to robust accurate
fairness in DNN and the potential of using benign perturbations to correct
adversarial instances.


http://arxiv.org/abs/2404.01574
Multi-granular Adversarial Attacks against Black-box Neural Ranking Models. (99%)
Yu-An Liu; Ruqing Zhang; Jiafeng Guo; Rijke Maarten de; Yixing Fan; Xueqi Cheng
  Adversarial ranking attacks have gained increasing attention due to their
success in probing vulnerabilities, and, hence, enhancing the robustness, of
neural ranking models. Conventional attack methods employ perturbations at a
single granularity, e.g., word-level or sentence-level, to a target document.
However, limiting perturbations to a single level of granularity may reduce the
flexibility of creating adversarial examples, thereby diminishing the potential
threat of the attack. Therefore, we focus on generating high-quality
adversarial examples by incorporating multi-granular perturbations. Achieving
this objective involves tackling a combinatorial explosion problem, which
requires identifying an optimal combination of perturbations across all
possible levels of granularity, positions, and textual pieces. To address this
challenge, we transform the multi-granular adversarial attack into a sequential
decision-making process, where perturbations in the next attack step are
influenced by the perturbed document in the current attack step. Since the
attack process can only access the final state without direct intermediate
signals, we use reinforcement learning to perform multi-granular attacks.
During the reinforcement learning process, two agents work cooperatively to
identify multi-granular vulnerabilities as attack targets and organize
perturbation candidates into a final perturbation sequence. Experimental
results show that our attack method surpasses prevailing baselines in both
attack effectiveness and imperceptibility.


http://arxiv.org/abs/2404.00924
BadPart: Unified Black-box Adversarial Patch Attacks against Pixel-wise Regression Tasks. (93%)
Zhiyuan Cheng; Zhaoyi Liu; Tengda Guo; Shiwei Feng; Dongfang Liu; Mingjie Tang; Xiangyu Zhang
  Pixel-wise regression tasks (e.g., monocular depth estimation (MDE) and
optical flow estimation (OFE)) have been widely involved in our daily life in
applications like autonomous driving, augmented reality and video composition.
Although certain applications are security-critical or bear societal
significance, the adversarial robustness of such models are not sufficiently
studied, especially in the black-box scenario. In this work, we introduce the
first unified black-box adversarial patch attack framework against pixel-wise
regression tasks, aiming to identify the vulnerabilities of these models under
query-based black-box attacks. We propose a novel square-based adversarial
patch optimization framework and employ probabilistic square sampling and
score-based gradient estimation techniques to generate the patch effectively
and efficiently, overcoming the scalability problem of previous black-box patch
attacks. Our attack prototype, named BadPart, is evaluated on both MDE and OFE
tasks, utilizing a total of 7 models. BadPart surpasses 3 baseline methods in
terms of both attack performance and efficiency. We also apply BadPart on the
Google online service for portrait depth estimation, causing 43.5% relative
distance error with 50K queries. State-of-the-art (SOTA) countermeasures cannot
defend our attack effectively.


http://arxiv.org/abs/2404.01101
UFID: A Unified Framework for Input-level Backdoor Detection on Diffusion Models. (61%)
Zihan Guan; Mengxuan Hu; Sheng Li; Anil Vullikanti
  Diffusion Models are vulnerable to backdoor attacks, where malicious
attackers inject backdoors by poisoning some parts of the training samples
during the training stage. This poses a serious threat to the downstream users,
who query the diffusion models through the API or directly download them from
the internet. To mitigate the threat of backdoor attacks, there have been a
plethora of investigations on backdoor detections. However, none of them
designed a specialized backdoor detection method for diffusion models,
rendering the area much under-explored. Moreover, these prior methods mainly
focus on the traditional neural networks in the classification task, which
cannot be adapted to the backdoor detections on the generative task easily.
Additionally, most of the prior methods require white-box access to model
weights and architectures, or the probability logits as additional information,
which are not always practical. In this paper, we propose a Unified Framework
for Input-level backdoor Detection (UFID) on the diffusion models, which is
motivated by observations in the diffusion models and further validated with a
theoretical causality analysis. Extensive experiments across different datasets
on both conditional and unconditional diffusion models show that our method
achieves a superb performance on detection effectiveness and run-time
efficiency. The code is available at
https://github.com/GuanZihan/official_UFID.


http://arxiv.org/abs/2404.01177
Poisoning Decentralized Collaborative Recommender System and Its Countermeasures. (33%)
Ruiqi Zheng; Liang Qu; Tong Chen; Kai Zheng; Yuhui Shi; Hongzhi Yin
  To make room for privacy and efficiency, the deployment of many recommender
systems is experiencing a shift from central servers to personal devices, where
the federated recommender systems (FedRecs) and decentralized collaborative
recommender systems (DecRecs) are arguably the two most representative
paradigms. While both leverage knowledge (e.g., gradients) sharing to
facilitate learning local models, FedRecs rely on a central server to
coordinate the optimization process, yet in DecRecs, the knowledge sharing
directly happens between clients. Knowledge sharing also opens a backdoor for
model poisoning attacks, where adversaries disguise themselves as benign
clients and disseminate polluted knowledge to achieve malicious goals like
promoting an item's exposure rate. Although research on such poisoning attacks
provides valuable insights into finding security loopholes and corresponding
countermeasures, existing attacks mostly focus on FedRecs, and are either
inapplicable or ineffective for DecRecs. Compared with FedRecs where the
tampered information can be universally distributed to all clients once
uploaded to the cloud, each adversary in DecRecs can only communicate with
neighbor clients of a small size, confining its impact to a limited range. To
fill the gap, we present a novel attack method named Poisoning with Adaptive
Malicious Neighbors (PAMN). With item promotion in top-K recommendation as the
attack objective, PAMN effectively boosts target items' ranks with several
adversaries that emulate benign clients and transfers adaptively crafted
gradients conditioned on each adversary's neighbors. Moreover, with the
vulnerabilities of DecRecs uncovered, a dedicated defensive mechanism based on
user-level gradient clipping with sparsified updating is proposed. Extensive
experiments demonstrate the effectiveness of the poisoning attack and the
robustness of our defensive mechanism.


http://arxiv.org/abs/2404.01509
Can Biases in ImageNet Models Explain Generalization? (10%)
Paul Gavrikov; Janis Keuper
  The robust generalization of models to rare, in-distribution (ID) samples
drawn from the long tail of the training distribution and to
out-of-training-distribution (OOD) samples is one of the major challenges of
current deep learning methods. For image classification, this manifests in the
existence of adversarial attacks, the performance drops on distorted images,
and a lack of generalization to concepts such as sketches. The current
understanding of generalization in neural networks is very limited, but some
biases that differentiate models from human vision have been identified and
might be causing these limitations. Consequently, several attempts with varying
success have been made to reduce these biases during training to improve
generalization. We take a step back and sanity-check these attempts. Fixing the
architecture to the well-established ResNet-50, we perform a large-scale study
on 48 ImageNet models obtained via different training methods to understand how
and if these biases - including shape bias, spectral biases, and critical bands
- interact with generalization. Our extensive study results reveal that
contrary to previous findings, these biases are insufficient to accurately
predict the generalization of a model holistically. We provide access to all
checkpoints and evaluation code at
https://github.com/paulgavrikov/biases_vs_generalization


http://arxiv.org/abs/2404.01231
Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models. (2%)
Yuxin Wen; Leo Marchyok; Sanghyun Hong; Jonas Geiping; Tom Goldstein; Nicholas Carlini
  It is commonplace to produce application-specific models by fine-tuning large
pre-trained models using a small bespoke dataset. The widespread availability
of foundation model checkpoints on the web poses considerable risks, including
the vulnerability to backdoor attacks. In this paper, we unveil a new
vulnerability: the privacy backdoor attack. This black-box privacy attack aims
to amplify the privacy leakage that arises when fine-tuning a model: when a
victim fine-tunes a backdoored model, their training data will be leaked at a
significantly higher rate than if they had fine-tuned a typical model. We
conduct extensive experiments on various datasets and models, including both
vision-language models (CLIP) and large language models, demonstrating the
broad applicability and effectiveness of such an attack. Additionally, we carry
out multiple ablation studies with different fine-tuning methods and inference
strategies to thoroughly analyze this new threat. Our findings highlight a
critical privacy concern within the machine learning community and call for a
reevaluation of safety protocols in the use of open-source pre-trained models.


http://arxiv.org/abs/2404.01109
An incremental hybrid adaptive network-based IDS in Software Defined Networks to detect stealth attacks. (1%)
Abdullah H Alqahtani
  Network attacks have became increasingly more sophisticated and stealthy due
to the advances in technologies and the growing sophistication of attackers.
Advanced Persistent Threats (APTs) are a type of attack that implement a wide
range of strategies to evade detection and be under the defence radar. Software
Defined Network (SDN) is a network paradigm that implements dynamic
configuration by separating the control plane from the network plane. This
approach improves security aspects by facilitating the employment of network
intrusion detection systems. Implementing Machine Learning (ML) techniques in
Intrusion Detection Systems (IDSs) is widely used to detect such attacks but
has a challenge when the data distribution changes. Concept drift is a term
that describes the change in the relationship between the input data and the
target value (label or class). The model is expected to degrade as certain
forms of change occur. In this paper, the primary form of change will be in
user behaviour (particularly changes in attacker behaviour). It is essential
for a model to adapt itself to deviations in data distribution. SDN can help in
monitoring changes in data distribution. This paper discusses changes in
stealth attacker behaviour. The work described here investigates various
concept drift detection algorithms. An incremental hybrid adaptive Network
Intrusion Detection System (NIDS) is proposed to tackle the issue of concept
drift in SDN. It can detect known and unknown attacks. The model is evaluated
over different datasets showing promising results.


http://arxiv.org/abs/2404.00828
PID Control-Based Self-Healing to Improve the Robustness of Large Language Models. (75%)
Zhuotong Chen; Zihu Wang; Yifan Yang; Qianxiao Li; Zheng Zhang
  Despite the effectiveness of deep neural networks in numerous natural
language processing applications, recent findings have exposed the
vulnerability of these language models when minor perturbations are introduced.
While appearing semantically indistinguishable to humans, these perturbations
can significantly reduce the performance of well-trained language models,
raising concerns about the reliability of deploying them in safe-critical
situations. In this work, we construct a computationally efficient self-healing
process to correct undesired model behavior during online inference when
perturbations are applied to input data. This is formulated as a trajectory
optimization problem in which the internal states of the neural network layers
are automatically corrected using a PID (Proportional-Integral-Derivative)
control mechanism. The P controller targets immediate state adjustments, while
the I and D controllers consider past states and future dynamical trends,
respectively. We leverage the geometrical properties of the training data to
design effective linear PID controllers. This approach reduces the
computational cost to that of using just the P controller, instead of the full
PID control. Further, we introduce an analytical method for approximating the
optimal control solutions, enhancing the real-time inference capabilities of
this controlled system. Moreover, we conduct a theoretical error analysis of
the analytic solution in a simplified setting. The proposed PID control-based
self-healing is a low cost framework that improves the robustness of
pre-trained large language models, whether standard or robustly trained,
against a wide range of perturbations. A detailed implementation can be found
in:https://github.com/zhuotongchen/PID-Control-Based-Self-Healing-to-Improve-the-Robustness-of-Large-Language-Models.


http://arxiv.org/abs/2404.00897
Machine Learning Robustness: A Primer. (62%)
Houssem Ben Braiek; Foutse Khomh
  This chapter explores the foundational concept of robustness in Machine
Learning (ML) and its integral role in establishing trustworthiness in
Artificial Intelligence (AI) systems. The discussion begins with a detailed
definition of robustness, portraying it as the ability of ML models to maintain
stable performance across varied and unexpected environmental conditions. ML
robustness is dissected through several lenses: its complementarity with
generalizability; its status as a requirement for trustworthy AI; its
adversarial vs non-adversarial aspects; its quantitative metrics; and its
indicators such as reproducibility and explainability. The chapter delves into
the factors that impede robustness, such as data bias, model complexity, and
the pitfalls of underspecified ML pipelines. It surveys key techniques for
robustness assessment from a broad perspective, including adversarial attacks,
encompassing both digital and physical realms. It covers non-adversarial data
shifts and nuances of Deep Learning (DL) software testing methodologies. The
discussion progresses to explore amelioration strategies for bolstering
robustness, starting with data-centric approaches like debiasing and
augmentation. Further examination includes a variety of model-centric methods
such as transfer learning, adversarial training, and randomized smoothing.
Lastly, post-training methods are discussed, including ensemble techniques,
pruning, and model repairs, emerging as cost-effective strategies to make
models more resilient against the unpredictable. This chapter underscores the
ongoing challenges and limitations in estimating and achieving ML robustness by
existing approaches. It offers insights and directions for future research on
this crucial concept, as a prerequisite for trustworthy AI systems.


http://arxiv.org/abs/2404.00362
STBA: Towards Evaluating the Robustness of DNNs for Query-Limited Black-box Scenario. (99%)
Renyang Liu; Kwok-Yan Lam; Wei Zhou; Sixing Wu; Jun Zhao; Dongting Hu; Mingming Gong
  Many attack techniques have been proposed to explore the vulnerability of
DNNs and further help to improve their robustness. Despite the significant
progress made recently, existing black-box attack methods still suffer from
unsatisfactory performance due to the vast number of queries needed to optimize
desired perturbations. Besides, the other critical challenge is that
adversarial examples built in a noise-adding manner are abnormal and struggle
to successfully attack robust models, whose robustness is enhanced by
adversarial training against small perturbations. There is no doubt that these
two issues mentioned above will significantly increase the risk of exposure and
result in a failure to dig deeply into the vulnerability of DNNs. Hence, it is
necessary to evaluate DNNs' fragility sufficiently under query-limited settings
in a non-additional way. In this paper, we propose the Spatial Transform
Black-box Attack (STBA), a novel framework to craft formidable adversarial
examples in the query-limited scenario. Specifically, STBA introduces a flow
field to the high-frequency part of clean images to generate adversarial
examples and adopts the following two processes to enhance their naturalness
and significantly improve the query efficiency: a) we apply an estimated flow
field to the high-frequency part of clean images to generate adversarial
examples instead of introducing external noise to the benign image, and b) we
leverage an efficient gradient estimation method based on a batch of samples to
optimize such an ideal flow field under query-limited settings. Compared to
existing score-based black-box baselines, extensive experiments indicated that
STBA could effectively improve the imperceptibility of the adversarial examples
and remarkably boost the attack success rate under query-limited settings.


http://arxiv.org/abs/2404.00540
Embodied Active Defense: Leveraging Recurrent Feedback to Counter Adversarial Patches. (98%)
Lingxuan Wu; Xiao Yang; Yinpeng Dong; Liuwei Xie; Hang Su; Jun Zhu
  The vulnerability of deep neural networks to adversarial patches has
motivated numerous defense strategies for boosting model robustness. However,
the prevailing defenses depend on single observation or pre-established
adversary information to counter adversarial patches, often failing to be
confronted with unseen or adaptive adversarial attacks and easily exhibiting
unsatisfying performance in dynamic 3D environments. Inspired by active human
perception and recurrent feedback mechanisms, we develop Embodied Active
Defense (EAD), a proactive defensive strategy that actively contextualizes
environmental information to address misaligned adversarial patches in 3D
real-world settings. To achieve this, EAD develops two central recurrent
sub-modules, i.e., a perception module and a policy module, to implement two
critical functions of active vision. These models recurrently process a series
of beliefs and observations, facilitating progressive refinement of their
comprehension of the target object and enabling the development of strategic
actions to counter adversarial patches in 3D environments. To optimize learning
efficiency, we incorporate a differentiable approximation of environmental
dynamics and deploy patches that are agnostic to the adversary strategies.
Extensive experiments demonstrate that EAD substantially enhances robustness
against a variety of patches within just a few steps through its action policy
in safety-critical tasks (e.g., face recognition and object detection), without
compromising standard accuracy. Furthermore, due to the attack-agnostic
characteristic, EAD facilitates excellent generalization to unseen attacks,
diminishing the averaged attack success rate by 95 percent across a range of
unseen adversarial attacks.


http://arxiv.org/abs/2404.00461
Shortcuts Arising from Contrast: Effective and Covert Clean-Label Attacks in Prompt-Based Learning. (5%)
Xiaopeng Xie; Ming Yan; Xiwen Zhou; Chenlong Zhao; Suli Wang; Yong Zhang; Joey Tianyi Zhou
  Prompt-based learning paradigm has demonstrated remarkable efficacy in
enhancing the adaptability of pretrained language models (PLMs), particularly
in few-shot scenarios. However, this learning paradigm has been shown to be
vulnerable to backdoor attacks. The current clean-label attack, employing a
specific prompt as a trigger, can achieve success without the need for external
triggers and ensure correct labeling of poisoned samples, which is more
stealthy compared to the poisoned-label attack, but on the other hand, it faces
significant issues with false activations and poses greater challenges,
necessitating a higher rate of poisoning. Using conventional negative data
augmentation methods, we discovered that it is challenging to trade off between
effectiveness and stealthiness in a clean-label setting. In addressing this
issue, we are inspired by the notion that a backdoor acts as a shortcut and
posit that this shortcut stems from the contrast between the trigger and the
data utilized for poisoning. In this study, we propose a method named
Contrastive Shortcut Injection (CSI), by leveraging activation values,
integrates trigger design and data selection strategies to craft stronger
shortcut features. With extensive experiments on full-shot and few-shot text
classification tasks, we empirically validate CSI's high effectiveness and high
stealthiness at low poisoning rates. Notably, we found that the two approaches
play leading roles in full-shot and few-shot settings, respectively.


http://arxiv.org/abs/2404.00271
TG-NAS: Generalizable Zero-Cost Proxies with Operator Description Embedding and Graph Learning for Efficient Neural Architecture Search. (1%)
Ye Qiao; Jingcheng Li; Haocheng Xu; Sitao Huang
  Neural Architecture Search (NAS) is a powerful technique for discovering
high-performing CNN architectures, but most existing methods rely on costly
training or extensive sampling. Zero-shot NAS offers a training-free
alternative by using proxies to predict architecture performance. However,
existing proxies are often suboptimal -- frequently outperformed by simple
metrics like parameter count or FLOPs -- and they generalize poorly across
different search spaces. Moreover, current model-based proxies struggle to
adapt to new operators without access to ground-truth accuracy, limiting their
transferability. We propose TG-NAS, a universal, model-based zero-cost (ZC)
proxy that combines a Transformer-based operator embedding generator with a
Graph Convolutional Network (GCN) to predict architecture performance. Unlike
prior model-based predictors, TG-NAS requires no retraining and generalizes
across arbitrary search spaces. It serves as a standalone ZC proxy with strong
data efficiency, robustness, and cross-space consistency. Extensive evaluations
across diverse NAS benchmarks demonstrate TG-NAS's superior rank correlation
and generalizability compared to existing proxies. Additionally, it improves
search efficiency by up to 300x and discovers architectures achieving 93.75%
CIFAR-10 accuracy on NAS-Bench-201 and 74.9% ImageNet top-1 accuracy on the
DARTS space, establishing TG-NAS as a promising foundation for efficient,
generalizable NAS.


http://arxiv.org/abs/2404.00185
On Inherent Adversarial Robustness of Active Vision Systems. (99%)
Amitangshu Mukherjee; Timur Ibrayev; Kaushik Roy
  Current Deep Neural Networks are vulnerable to adversarial examples, which
alter their predictions by adding carefully crafted noise. Since human eyes are
robust to such inputs, it is possible that the vulnerability stems from the
standard way of processing inputs in one shot by processing every pixel with
the same importance. In contrast, neuroscience suggests that the human vision
system can differentiate salient features by (1) switching between multiple
fixation points (saccades) and (2) processing the surrounding with a
non-uniform external resolution (foveation). In this work, we advocate that the
integration of such active vision mechanisms into current deep learning systems
can offer robustness benefits. Specifically, we empirically demonstrate the
inherent robustness of two active vision methods - GFNet and FALcon - under a
black box threat model. By learning and inferencing based on downsampled
glimpses obtained from multiple distinct fixation points within an input, we
show that these active methods achieve (2-3) times greater robustness compared
to a standard passive convolutional network under state-of-the-art adversarial
attacks. More importantly, we provide illustrative and interpretable
visualization analysis that demonstrates how performing inference from distinct
fixation points makes active vision methods less vulnerable to malicious
inputs.


http://arxiv.org/abs/2403.20254
Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions. (68%)
Runhao Zeng; Xiaoyong Chen; Jiaming Liang; Huisi Wu; Guangzhong Cao; Yong Guo
  Temporal action detection (TAD) aims to locate action positions and recognize
action categories in long-term untrimmed videos. Although many methods have
achieved promising results, their robustness has not been thoroughly studied.
In practice, we observe that temporal information in videos can be occasionally
corrupted, such as missing or blurred frames. Interestingly, existing methods
often incur a significant performance drop even if only one frame is affected.
To formally evaluate the robustness, we establish two temporal corruption
robustness benchmarks, namely THUMOS14-C and ActivityNet-v1.3-C. In this paper,
we extensively analyze the robustness of seven leading TAD methods and obtain
some interesting findings: 1) Existing methods are particularly vulnerable to
temporal corruptions, and end-to-end methods are often more susceptible than
those with a pre-trained feature extractor; 2) Vulnerability mainly comes from
localization error rather than classification error; 3) When corruptions occur
in the middle of an action instance, TAD models tend to yield the largest
performance drop. Besides building a benchmark, we further develop a simple but
effective robust training method to defend against temporal corruptions,
through the FrameDrop augmentation and Temporal-Robust Consistency loss.
Remarkably, our approach not only improves robustness but also yields promising
improvements on clean data. We believe that this study will serve as a
benchmark for future research in robust video analysis. Source code and models
are available at https://github.com/Alvin-Zeng/temporal-robustness-benchmark.


http://arxiv.org/abs/2404.00114
Deepfake Sentry: Harnessing Ensemble Intelligence for Resilient Detection and Generalisation. (8%)
Liviu-Daniel University "Politehnica" of Bucharest, Romania Ştefan; Dan-Cristian University "Politehnica" of Bucharest, Romania Stanciu; Mihai University "Politehnica" of Bucharest, Romania Dogariu; Mihai Gabriel University "Politehnica" of Bucharest, Romania Constantin; Andrei Cosmin University "Politehnica" of Bucharest, Romania Jitaru; Bogdan University "Politehnica" of Bucharest, Romania Ionescu
  Recent advancements in Generative Adversarial Networks (GANs) have enabled
photorealistic image generation with high quality. However, the malicious use
of such generated media has raised concerns regarding visual misinformation.
Although deepfake detection research has demonstrated high accuracy, it is
vulnerable to advances in generation techniques and adversarial iterations on
detection countermeasures. To address this, we propose a proactive and
sustainable deepfake training augmentation solution that introduces artificial
fingerprints into models. We achieve this by employing an ensemble learning
approach that incorporates a pool of autoencoders that mimic the effect of the
artefacts introduced by the deepfake generator models. Experiments on three
datasets reveal that our proposed ensemble autoencoder-based data augmentation
learning approach offers improvements in terms of generalisation, resistance
against basic data perturbations such as noise, blurring, sharpness
enhancement, and affine transforms, resilience to commonly used lossy
compression algorithms such as JPEG, and enhanced resistance against
adversarial attacks.


http://arxiv.org/abs/2403.20127
The Impact of Prompts on Zero-Shot Detection of AI-Generated Text. (2%)
Kaito Taguchi; Yujie Gu; Kouichi Sakurai
  In recent years, there have been significant advancements in the development
of Large Language Models (LLMs). While their practical applications are now
widespread, their potential for misuse, such as generating fake news and
committing plagiarism, has posed significant concerns. To address this issue,
detectors have been developed to evaluate whether a given text is
human-generated or AI-generated. Among others, zero-shot detectors stand out as
effective approaches that do not require additional training data and are often
likelihood-based. In chat-based applications, users commonly input prompts and
utilize the AI-generated texts. However, zero-shot detectors typically analyze
these texts in isolation, neglecting the impact of the original prompts. It is
conceivable that this approach may lead to a discrepancy in likelihood
assessments between the text generation phase and the detection phase. So far,
there remains an unverified gap concerning how the presence or absence of
prompts impacts detection accuracy for zero-shot detectors. In this paper, we
introduce an evaluative framework to empirically analyze the impact of prompts
on the detection accuracy of AI-generated text. We assess various zero-shot
detectors using both white-box detection, which leverages the prompt, and
black-box detection, which operates without prompt information. Our experiments
reveal the significant influence of prompts on detection accuracy. Remarkably,
compared with black-box detection without prompts, the white-box methods using
prompts demonstrate an increase in AUC of at least $0.1$ across all zero-shot
detectors tested. Code is available:
\url{https://github.com/kaito25atugich/Detector}.


http://arxiv.org/abs/2404.00095
GDA: Generalized Diffusion for Robust Test-time Adaptation. (1%)
Yun-Yun Tsai; Fu-Chen Chen; Albert Y. C. Chen; Junfeng Yang; Che-Chun Su; Min Sun; Cheng-Hao Kuo
  Machine learning models struggle with generalization when encountering
out-of-distribution (OOD) samples with unexpected distribution shifts. For
vision tasks, recent studies have shown that test-time adaptation employing
diffusion models can achieve state-of-the-art accuracy improvements on OOD
samples by generating new samples that align with the model's domain without
the need to modify the model's weights. Unfortunately, those studies have
primarily focused on pixel-level corruptions, thereby lacking the
generalization to adapt to a broader range of OOD types. We introduce
Generalized Diffusion Adaptation (GDA), a novel diffusion-based test-time
adaptation method robust against diverse OOD types. Specifically, GDA
iteratively guides the diffusion by applying a marginal entropy loss derived
from the model, in conjunction with style and content preservation losses
during the reverse sampling process. In other words, GDA considers the model's
output behavior with the semantic information of the samples as a whole, which
can reduce ambiguity in downstream tasks during the generation process.
Evaluation across various popular model architectures and OOD benchmarks shows
that GDA consistently outperforms prior work on diffusion-driven adaptation.
Notably, it achieves the highest classification accuracy improvements, ranging
from 4.4\% to 5.02\% on ImageNet-C and 2.5\% to 7.4\% on Rendition, Sketch, and
Stylized benchmarks. This performance highlights GDA's generalization to a
broader range of OOD benchmarks.


http://arxiv.org/abs/2404.00108
Efficient Data-Free Model Stealing with Label Diversity. (1%)
Yiyong Liu; Rui Wen; Michael Backes; Yang Zhang
  Machine learning as a Service (MLaaS) allows users to query the machine
learning model in an API manner, which provides an opportunity for users to
enjoy the benefits brought by the high-performance model trained on valuable
data. This interface boosts the proliferation of machine learning based
applications, while on the other hand, it introduces the attack surface for
model stealing attacks. Existing model stealing attacks have relaxed their
attack assumptions to the data-free setting, while keeping the effectiveness.
However, these methods are complex and consist of several components, which
obscure the core on which the attack really depends. In this paper, we revisit
the model stealing problem from a diversity perspective and demonstrate that
keeping the generated data samples more diverse across all the classes is the
critical point for improving the attack performance. Based on this conjecture,
we provide a simplified attack framework. We empirically signify our conjecture
by evaluating the effectiveness of our attack, and experimental results show
that our approach is able to achieve comparable or even better performance
compared with the state-of-the-art method. Furthermore, benefiting from the
absence of redundant components, our method demonstrates its advantages in
attack efficiency and query budget.


http://arxiv.org/abs/2403.20056
Cross-Lingual Transfer Robustness to Lower-Resource Languages on Adversarial Datasets. (1%)
Shadi Manafi; Nikhil Krishnaswamy
  Multilingual Language Models (MLLMs) exhibit robust cross-lingual transfer
capabilities, or the ability to leverage information acquired in a source
language and apply it to a target language. These capabilities find practical
applications in well-established Natural Language Processing (NLP) tasks such
as Named Entity Recognition (NER). This study aims to investigate the
effectiveness of a source language when applied to a target language,
particularly in the context of perturbing the input test set. We evaluate on 13
pairs of languages, each including one high-resource language (HRL) and one
low-resource language (LRL) with a geographic, genetic, or borrowing
relationship. We evaluate two well-known MLLMs--MBERT and XLM-R--on these
pairs, in native LRL and cross-lingual transfer settings, in two tasks, under a
set of different perturbations. Our findings indicate that NER cross-lingual
transfer depends largely on the overlap of entity chunks. If a source and
target language have more entities in common, the transfer ability is stronger.
Models using cross-lingual transfer also appear to be somewhat more robust to
certain perturbations of the input, perhaps indicating an ability to leverage
stronger representations derived from the HRL. Our research provides valuable
insights into cross-lingual transfer and its implications for NLP applications,
and underscores the need to consider linguistic nuances and potential
limitations when employing MLLMs across distinct languages.


http://arxiv.org/abs/2403.19150
Towards Understanding Dual BN In Hybrid Adversarial Training. (82%)
Chenshuang Zhang; Chaoning Zhang; Kang Zhang; Axi Niu; Junmo Kim; In So Kweon
  There is a growing concern about applying batch normalization (BN) in
adversarial training (AT), especially when the model is trained on both
adversarial samples and clean samples (termed Hybrid-AT). With the assumption
that adversarial and clean samples are from two different domains, a common
practice in prior works is to adopt Dual BN, where BN and BN are used for
adversarial and clean branches, respectively. A popular belief for motivating
Dual BN is that estimating normalization statistics of this mixture
distribution is challenging and thus disentangling it for normalization
achieves stronger robustness. In contrast to this belief, we reveal that
disentangling statistics plays a less role than disentangling affine parameters
in model training. This finding aligns with prior work (Rebuffi et al., 2023),
and we build upon their research for further investigations. We demonstrate
that the domain gap between adversarial and clean samples is not very large,
which is counter-intuitive considering the significant influence of adversarial
perturbation on the model accuracy. We further propose a two-task hypothesis
which serves as the empirical foundation and a unified framework for Hybrid-AT
improvement. We also investigate Dual BN in test-time and reveal that affine
parameters characterize the robustness during inference. Overall, our work
sheds new light on understanding the mechanism of Dual BN in Hybrid-AT and its
underlying justification.


http://arxiv.org/abs/2403.19559
Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset. (82%)
Janis Goldzycher; Paul Röttger; Gerold Schneider
  Hate speech detection models are only as good as the data they are trained
on. Datasets sourced from social media suffer from systematic gaps and biases,
leading to unreliable models with simplistic decision boundaries. Adversarial
datasets, collected by exploiting model weaknesses, promise to fix this
problem. However, adversarial data collection can be slow and costly, and
individual annotators have limited creativity. In this paper, we introduce
GAHD, a new German Adversarial Hate speech Dataset comprising ca.\ 11k
examples. During data collection, we explore new strategies for supporting
annotators, to create more diverse adversarial examples more efficiently and
provide a manual analysis of annotator disagreements for each strategy. Our
experiments show that the resulting dataset is challenging even for
state-of-the-art hate speech detection models, and that training on GAHD
clearly improves model robustness. Further, we find that mixing multiple
support strategies is most advantageous. We make GAHD publicly available at
https://github.com/jagol/gahd.


http://arxiv.org/abs/2403.19444
Leveraging Expert Input for Robust and Explainable AI-Assisted Lung Cancer Detection in Chest X-rays. (56%)
Amy Rafferty; Rishi Ramaesh; Ajitha Rajan
  Deep learning models show significant potential for advancing AI-assisted
medical diagnostics, particularly in detecting lung cancer through medical
image modalities such as chest X-rays. However, the black-box nature of these
models poses challenges to their interpretability and trustworthiness, limiting
their adoption in clinical practice. This study examines both the
interpretability and robustness of a high-performing lung cancer detection
model based on InceptionV3, utilizing a public dataset of chest X-rays and
radiological reports. We evaluate the clinical utility of multiple explainable
AI (XAI) techniques, including both post-hoc and ante-hoc approaches, and find
that existing methods often fail to provide clinically relevant explanations,
displaying inconsistencies and divergence from expert radiologist assessments.
To address these limitations, we collaborated with a radiologist to define
diagnosis-specific clinical concepts and developed ClinicXAI, an expert-driven
approach leveraging the concept bottleneck methodology. ClinicXAI generated
clinically meaningful explanations which closely aligned with the practical
requirements of clinicians while maintaining high diagnostic accuracy. We also
assess the robustness of ClinicXAI in comparison to the original InceptionV3
model by subjecting both to a series of widely utilized adversarial attacks.
Our analysis demonstrates that ClinicXAI exhibits significantly greater
resilience to adversarial perturbations. These findings underscore the
importance of incorporating domain expertise into the design of interpretable
and robust AI systems for medical diagnostics, paving the way for more
trustworthy and effective AI solutions in healthcare.


http://arxiv.org/abs/2403.19510
On the Robustness of LDP Protocols for Numerical Attributes under Data Poisoning Attacks. (41%)
Xiaoguang Li; Zitao Li; Ninghui Li; Wenhai Sun
  Recent studies reveal that local differential privacy (LDP) protocols are
vulnerable to data poisoning attacks where an attacker can manipulate the final
estimate on the server by leveraging the characteristics of LDP and sending
carefully crafted data from a small fraction of controlled local clients. This
vulnerability raises concerns regarding the robustness and reliability of LDP
in hostile environments.
  In this paper, we conduct a systematic investigation of the robustness of
state-of-the-art LDP protocols for numerical attributes, i.e., categorical
frequency oracles (CFOs) with binning and consistency, and distribution
reconstruction. We evaluate protocol robustness through an attack-driven
approach and propose new metrics for cross-protocol attack gain measurement.
The results indicate that Square Wave and CFO-based protocols in the Server
setting are more robust against the attack compared to the CFO-based protocols
in the User setting. Our evaluation also unfolds new relationships between LDP
security and its inherent design choices. We found that the hash domain size in
local-hashing-based LDP has a profound impact on protocol robustness beyond the
well-known effect on utility. Further, we propose a zero-shot attack detection
by leveraging the rich reconstructed distribution information. The experiment
show that our detection significantly improves the existing methods and
effectively identifies data manipulation in challenging scenarios.


http://arxiv.org/abs/2403.19326
MedBN: Robust Test-Time Adaptation against Malicious Test Samples. (10%)
Hyejin Park; Jeongyeon Hwang; Sunung Mun; Sangdon Park; Jungseul Ok
  Test-time adaptation (TTA) has emerged as a promising solution to address
performance decay due to unforeseen distribution shifts between training and
test data. While recent TTA methods excel in adapting to test data variations,
such adaptability exposes a model to vulnerability against malicious examples,
an aspect that has received limited attention. Previous studies have uncovered
security vulnerabilities within TTA even when a small proportion of the test
batch is maliciously manipulated. In response to the emerging threat, we
propose median batch normalization (MedBN), leveraging the robustness of the
median for statistics estimation within the batch normalization layer during
test-time inference. Our method is algorithm-agnostic, thus allowing seamless
integration with existing TTA frameworks. Our experimental results on benchmark
datasets, including CIFAR10-C, CIFAR100-C and ImageNet-C, consistently
demonstrate that MedBN outperforms existing approaches in maintaining robust
performance across different attack scenarios, encompassing both instant and
cumulative attacks. Through extensive experiments, we show that our approach
sustains the performance even in the absence of attacks, achieving a practical
balance between robustness and performance.


http://arxiv.org/abs/2403.19254
Imperceptible Protection against Style Imitation from Diffusion Models. (2%)
Namhyuk Ahn; Wonhyuk Ahn; KiYoon Yoo; Daesik Kim; Seung-Hun Nam
  Recent progress in diffusion models has profoundly enhanced the fidelity of
image generation. However, this has raised concerns about copyright
infringements. While prior methods have introduced adversarial perturbations to
prevent style imitation, most are accompanied by the degradation of artworks'
visual quality. Recognizing the importance of maintaining this, we develop a
visually improved protection method that preserves its protection capability.
To this end, we create a perceptual map to identify areas most sensitive to
human eyes. We then adjust the protection intensity guided by an instance-aware
refinement. We also integrate a perceptual constraints bank to further improve
the imperceptibility. Results show that our method substantially elevates the
quality of the protected image without compromising on protection efficacy.


http://arxiv.org/abs/2404.00076
A Backdoor Approach with Inverted Labels Using Dirty Label-Flipping Attacks. (1%)
Orson Mengara
  Audio-based machine learning systems frequently use public or third-party
data, which might be inaccurate. This exposes deep neural network (DNN) models
trained on such data to potential data poisoning attacks. In this type of
assault, attackers can train the DNN model using poisoned data, potentially
degrading its performance. Another type of data poisoning attack that is
extremely relevant to our investigation is label flipping, in which the
attacker manipulates the labels for a subset of data. It has been demonstrated
that these assaults may drastically reduce system performance, even for
attackers with minimal abilities. In this study, we propose a backdoor attack
named 'DirtyFlipping', which uses dirty label techniques, "label-on-label", to
input triggers (clapping) in the selected data patterns associated with the
target class, thereby enabling a stealthy backdoor.


http://arxiv.org/abs/2403.18318
Uncertainty-Aware SAR ATR: Defending Against Adversarial Attacks via Bayesian Neural Networks. (99%)
Tian Ye; Rajgopal Kannan; Viktor Prasanna; Carl Busart
  Adversarial attacks have demonstrated the vulnerability of Machine Learning
(ML) image classifiers in Synthetic Aperture Radar (SAR) Automatic Target
Recognition (ATR) systems. An adversarial attack can deceive the classifier
into making incorrect predictions by perturbing the input SAR images, for
example, with a few scatterers attached to the on-ground objects. Therefore, it
is critical to develop robust SAR ATR systems that can detect potential
adversarial attacks by leveraging the inherent uncertainty in ML classifiers,
thereby effectively alerting human decision-makers. In this paper, we propose a
novel uncertainty-aware SAR ATR for detecting adversarial attacks.
Specifically, we leverage the capability of Bayesian Neural Networks (BNNs) in
performing image classification with quantified epistemic uncertainty to
measure the confidence for each input SAR image. By evaluating the uncertainty,
our method alerts when the input SAR image is likely to be adversarially
generated. Simultaneously, we also generate visual explanations that reveal the
specific regions in the SAR image where the adversarial scatterers are likely
to to be present, thus aiding human decision-making with hints of evidence of
adversarial attacks. Experiments on the MSTAR dataset demonstrate that our
approach can identify over 80% adversarial SAR images with fewer than 20% false
alarms, and our visual explanations can identify up to over 90% of scatterers
in an adversarial SAR image.


http://arxiv.org/abs/2403.18554
CosalPure: Learning Concept from Group Images for Robust Co-Saliency Detection. (99%)
Jiayi Zhu; Qing Guo; Felix Juefei-Xu; Yihao Huang; Yang Liu; Geguang Pu
  Co-salient object detection (CoSOD) aims to identify the common and salient
(usually in the foreground) regions across a given group of images. Although
achieving significant progress, state-of-the-art CoSODs could be easily
affected by some adversarial perturbations, leading to substantial accuracy
reduction. The adversarial perturbations can mislead CoSODs but do not change
the high-level semantic information (e.g., concept) of the co-salient objects.
In this paper, we propose a novel robustness enhancement framework by first
learning the concept of the co-salient objects based on the input group images
and then leveraging this concept to purify adversarial perturbations, which are
subsequently fed to CoSODs for robustness enhancement. Specifically, we propose
CosalPure containing two modules, i.e., group-image concept learning and
concept-guided diffusion purification. For the first module, we adopt a
pre-trained text-to-image diffusion model to learn the concept of co-salient
objects within group images where the learned concept is robust to adversarial
examples. For the second module, we map the adversarial image to the latent
space and then perform diffusion generation by embedding the learned concept
into the noise prediction function as an extra condition. Our method can
effectively alleviate the influence of the SOTA adversarial attack containing
different adversarial patterns, including exposure and noise. The extensive
results demonstrate that our method could enhance the robustness of CoSODs
significantly.


http://arxiv.org/abs/2403.19080
MMCert: Provable Defense against Adversarial Attacks to Multi-modal Models. (98%)
Yanting Wang; Hongye Fu; Wei Zou; Jinyuan Jia
  Different from a unimodal model whose input is from a single modality, the
input (called multi-modal input) of a multi-modal model is from multiple
modalities such as image, 3D points, audio, text, etc. Similar to unimodal
models, many existing studies show that a multi-modal model is also vulnerable
to adversarial perturbation, where an attacker could add small perturbation to
all modalities of a multi-modal input such that the multi-modal model makes
incorrect predictions for it. Existing certified defenses are mostly designed
for unimodal models, which achieve sub-optimal certified robustness guarantees
when extended to multi-modal models as shown in our experimental results. In
our work, we propose MMCert, the first certified defense against adversarial
attacks to a multi-modal model. We derive a lower bound on the performance of
our MMCert under arbitrary adversarial attacks with bounded perturbations to
both modalities (e.g., in the context of auto-driving, we bound the number of
changed pixels in both RGB image and depth image). We evaluate our MMCert using
two benchmark datasets: one for the multi-modal road segmentation task and the
other for the multi-modal emotion recognition task. Moreover, we compare our
MMCert with a state-of-the-art certified defense extended from unimodal models.
Our experimental results show that our MMCert outperforms the baseline.


http://arxiv.org/abs/2403.18309
Bayesian Learned Models Can Detect Adversarial Malware For Free. (97%)
Bao Gia Doan; Dang Quang Nguyen; Paul Montague; Tamas Abraham; Vel Olivier De; Seyit Camtepe; Salil S. Kanhere; Ehsan Abbasnejad; Damith C. Ranasinghe
  The vulnerability of machine learning-based malware detectors to adversarial
attacks has prompted the need for robust solutions. Adversarial training is an
effective method but is computationally expensive to scale up to large datasets
and comes at the cost of sacrificing model performance for robustness. We
hypothesize that adversarial malware exploits the low-confidence regions of
models and can be identified using epistemic uncertainty of ML approaches --
epistemic uncertainty in a machine learning-based malware detector is a result
of a lack of similar training samples in regions of the problem space. In
particular, a Bayesian formulation can capture the model parameters'
distribution and quantify epistemic uncertainty without sacrificing model
performance. To verify our hypothesis, we consider Bayesian learning approaches
with a mutual information-based formulation to quantify uncertainty and detect
adversarial malware in Android, Windows domains and PDF malware. We found,
quantifying uncertainty through Bayesian learning methods can defend against
adversarial malware. In particular, Bayesian models: (1) are generally capable
of identifying adversarial malware in both feature and problem space, (2) can
detect concept drift by measuring uncertainty, and (3) with a
diversity-promoting approach (or better posterior approximations) lead to
parameter instances from the posterior to significantly enhance a detectors'
ability.


http://arxiv.org/abs/2403.18580
MisGUIDE : Defense Against Data-Free Deep Learning Model Extraction. (95%)
Mahendra Gurve; Sankar Behera; Satyadev Ahlawat; Yamuna Prasad
  The rise of Machine Learning as a Service (MLaaS) has led to the widespread
deployment of machine learning models trained on diverse datasets. These models
are employed for predictive services through APIs, raising concerns about the
security and confidentiality of the models due to emerging vulnerabilities in
prediction APIs. Of particular concern are model cloning attacks, where
individuals with limited data and no knowledge of the training dataset manage
to replicate a victim model's functionality through black-box query access.
This commonly entails generating adversarial queries to query the victim model,
thereby creating a labeled dataset.
  This paper proposes "MisGUIDE", a two-step defense framework for Deep
Learning models that disrupts the adversarial sample generation process by
providing a probabilistic response when the query is deemed OOD. The first step
employs a Vision Transformer-based framework to identify OOD queries, while the
second step perturbs the response for such queries, introducing a probabilistic
loss function to MisGUIDE the attackers. The aim of the proposed defense method
is to reduce the accuracy of the cloned model while maintaining accuracy on
authentic queries. Extensive experiments conducted on two benchmark datasets
demonstrate that the proposed framework significantly enhances the resistance
against state-of-the-art data-free model extraction in black-box settings.


http://arxiv.org/abs/2403.19009
Towards Sustainable SecureML: Quantifying Carbon Footprint of Adversarial Machine Learning. (83%)
Syed Mhamudul Hasan; Abdur R. Shahid; Ahmed Imteaj
  The widespread adoption of machine learning (ML) across various industries
has raised sustainability concerns due to its substantial energy usage and
carbon emissions. This issue becomes more pressing in adversarial ML, which
focuses on enhancing model security against different network-based attacks.
Implementing defenses in ML systems often necessitates additional computational
resources and network security measures, exacerbating their environmental
impacts. In this paper, we pioneer the first investigation into adversarial
ML's carbon footprint, providing empirical evidence connecting greater model
robustness to higher emissions. Addressing the critical need to quantify this
trade-off, we introduce the Robustness Carbon Trade-off Index (RCTI). This
novel metric, inspired by economic elasticity principles, captures the
sensitivity of carbon emissions to changes in adversarial robustness. We
demonstrate the RCTI through an experiment involving evasion attacks, analyzing
the interplay between robustness against attacks, performance, and carbon
emissions.


http://arxiv.org/abs/2403.18674
Deep Learning for Robust and Explainable Models in Computer Vision. (82%)
Mohammadreza Amirian
  Recent breakthroughs in machine and deep learning (ML and DL) research have
provided excellent tools for leveraging enormous amounts of data and optimizing
huge models with millions of parameters to obtain accurate networks for image
processing. These developments open up tremendous opportunities for using
artificial intelligence (AI) in the automation and human assisted AI industry.
However, as more and more models are deployed and used in practice, many
challenges have emerged. This thesis presents various approaches that address
robustness and explainability challenges for using ML and DL in practice.
  Robustness and reliability are the critical components of any model before
certification and deployment in practice. Deep convolutional neural networks
(CNNs) exhibit vulnerability to transformations of their inputs, such as
rotation and scaling, or intentional manipulations as described in the
adversarial attack literature. In addition, building trust in AI-based models
requires a better understanding of current models and developing methods that
are more explainable and interpretable a priori.
  This thesis presents developments in computer vision models' robustness and
explainability. Furthermore, this thesis offers an example of using vision
models' feature response visualization (models' interpretations) to improve
robustness despite interpretability and robustness being seemingly unrelated in
the related research. Besides methodological developments for robust and
explainable vision models, a key message of this thesis is introducing model
interpretation techniques as a tool for understanding vision models and
improving their design and robustness. In addition to the theoretical
developments, this thesis demonstrates several applications of ML and DL in
different contexts, such as medical imaging and affective computing.


http://arxiv.org/abs/2403.18423
SemRoDe: Macro Adversarial Training to Learn Representations That are Robust to Word-Level Attacks. (81%)
Brian Formento; Wenjie Feng; Chuan Sheng Foo; Luu Anh Tuan; See-Kiong Ng
  Language models (LMs) are indispensable tools for natural language processing
tasks, but their vulnerability to adversarial attacks remains a concern. While
current research has explored adversarial training techniques, their
improvements to defend against word-level attacks have been limited. In this
work, we propose a novel approach called Semantic Robust Defence (SemRoDe), a
Macro Adversarial Training strategy to enhance the robustness of LMs. Drawing
inspiration from recent studies in the image domain, we investigate and later
confirm that in a discrete data setting such as language, adversarial samples
generated via word substitutions do indeed belong to an adversarial domain
exhibiting a high Wasserstein distance from the base domain. Our method learns
a robust representation that bridges these two domains. We hypothesize that if
samples were not projected into an adversarial domain, but instead to a domain
with minimal shift, it would improve attack robustness. We align the domains by
incorporating a new distance-based objective. With this, our model is able to
learn more generalized representations by aligning the model's high-level
output features and therefore better handling unseen adversarial samples. This
method can be generalized across word embeddings, even when they share minimal
overlap at both vocabulary and word-substitution levels. To evaluate the
effectiveness of our approach, we conduct experiments on BERT and RoBERTa
models on three datasets. The results demonstrate promising state-of-the-art
robustness.


http://arxiv.org/abs/2404.01318
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. (54%)
Patrick Chao; Edoardo Debenedetti; Alexander Robey; Maksym Andriushchenko; Francesco Croce; Vikash Sehwag; Edgar Dobriban; Nicolas Flammarion; George J. Pappas; Florian Tramer; Hamed Hassani; Eric Wong
  Jailbreak attacks cause large language models (LLMs) to generate harmful,
unethical, or otherwise objectionable content. Evaluating these attacks
presents a number of challenges, which the current collection of benchmarks and
evaluation techniques do not adequately address. First, there is no clear
standard of practice regarding jailbreaking evaluation. Second, existing works
compute costs and success rates in incomparable ways. And third, numerous works
are not reproducible, as they withhold adversarial prompts, involve
closed-source code, or rely on evolving proprietary APIs. To address these
challenges, we introduce JailbreakBench, an open-sourced benchmark with the
following components: (1) an evolving repository of state-of-the-art
adversarial prompts, which we refer to as jailbreak artifacts; (2) a
jailbreaking dataset comprising 100 behaviors -- both original and sourced from
prior work (Zou et al., 2023; Mazeika et al., 2023, 2024) -- which align with
OpenAI's usage policies; (3) a standardized evaluation framework at
https://github.com/JailbreakBench/jailbreakbench that includes a clearly
defined threat model, system prompts, chat templates, and scoring functions;
and (4) a leaderboard at https://jailbreakbench.github.io/ that tracks the
performance of attacks and defenses for various LLMs. We have carefully
considered the potential ethical implications of releasing this benchmark, and
believe that it will be a net positive for the community.


http://arxiv.org/abs/2403.18624
Vulnerability Detection with Code Language Models: How Far Are We? (26%)
Yangruibo Ding; Yanjun Fu; Omniyyah Ibrahim; Chawin Sitawarin; Xinyun Chen; Basel Alomair; David Wagner; Baishakhi Ray; Yizheng Chen
  In the context of the rising interest in code language models (code LMs) and
vulnerability detection, we study the effectiveness of code LMs for detecting
vulnerabilities. Our analysis reveals significant shortcomings in existing
vulnerability datasets, including poor data quality, low label accuracy, and
high duplication rates, leading to unreliable model performance in realistic
vulnerability detection scenarios. Additionally, the evaluation methods used
with these datasets are not representative of real-world vulnerability
detection.
  To address these challenges, we introduce PrimeVul, a new dataset for
training and evaluating code LMs for vulnerability detection. PrimeVul
incorporates a novel set of data labeling techniques that achieve comparable
label accuracy to human-verified benchmarks while significantly expanding the
dataset. It also implements a rigorous data de-duplication and chronological
data splitting strategy to mitigate data leakage issues, alongside introducing
more realistic evaluation metrics and settings. This comprehensive approach
aims to provide a more accurate assessment of code LMs' performance in
real-world conditions.
  Evaluating code LMs on PrimeVul reveals that existing benchmarks
significantly overestimate the performance of these models. For instance, a
state-of-the-art 7B model scored 68.26% F1 on BigVul but only 3.09% F1 on
PrimeVul. Attempts to improve performance through advanced training techniques
and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin
to random guessing in the most stringent settings. These findings underscore
the considerable gap between current capabilities and the practical
requirements for deploying code LMs in security roles, highlighting the need
for more innovative research in this domain.


http://arxiv.org/abs/2403.18607
Spikewhisper: Temporal Spike Backdoor Attacks on Federated Neuromorphic Learning over Low-power Devices. (15%)
Hanqing Fu; Gaolei Li; Jun Wu; Jianhua Li; Xi Lin; Kai Zhou; Yuchen Liu
  Federated neuromorphic learning (FedNL) leverages event-driven spiking neural
networks and federated learning frameworks to effectively execute intelligent
analysis tasks over amounts of distributed low-power devices but also perform
vulnerability to poisoning attacks. The threat of backdoor attacks on
traditional deep neural networks typically comes from time-invariant data.
However, in FedNL, unknown threats may be hidden in time-varying spike signals.
In this paper, we start to explore a novel vulnerability of FedNL-based systems
with the concept of time division multiplexing, termed Spikewhisper, which
allows attackers to evade detection as much as possible, as multiple malicious
clients can imperceptibly poison with different triggers at different
timeslices. In particular, the stealthiness of Spikewhisper is derived from the
time-domain divisibility of global triggers, in which each malicious client
pastes only one local trigger to a certain timeslice in the neuromorphic
sample, and also the polarity and motion of each local trigger can be
configured by attackers. Extensive experiments based on two different
neuromorphic datasets demonstrate that the attack success rate of Spikewispher
is higher than the temporally centralized attacks. Besides, it is validated
that the effect of Spikewispher is sensitive to the trigger duration.


http://arxiv.org/abs/2403.18985
Robustness and Visual Explanation for Black Box Image, Video, and ECG Signal Classification with Reinforcement Learning. (15%)
Soumyendu Sarkar; Ashwin Ramesh Babu; Sajad Mousavi; Vineet Gundecha; Avisek Naug; Sahand Ghorbanpour
  We present a generic Reinforcement Learning (RL) framework optimized for
crafting adversarial attacks on different model types spanning from ECG signal
analysis (1D), image classification (2D), and video classification (3D). The
framework focuses on identifying sensitive regions and inducing
misclassifications with minimal distortions and various distortion types. The
novel RL method outperforms state-of-the-art methods for all three
applications, proving its efficiency. Our RL approach produces superior
localization masks, enhancing interpretability for image classification and ECG
analysis models. For applications such as ECG analysis, our platform highlights
critical ECG segments for clinicians while ensuring resilience against
prevalent distortions. This comprehensive tool aims to bolster both resilience
with adversarial training and transparency across varied applications and data
types.


http://arxiv.org/abs/2403.18587
The Impact of Uniform Inputs on Activation Sparsity and Energy-Latency Attacks in Computer Vision. (11%)
Andreas Müller; Erwin Quiring
  Resource efficiency plays an important role for machine learning nowadays.
The energy and decision latency are two critical aspects to ensure a
sustainable and practical application. Unfortunately, the energy consumption
and decision latency are not robust against adversaries. Researchers have
recently demonstrated that attackers can compute and submit so-called sponge
examples at inference time to increase the energy consumption and decision
latency of neural networks. In computer vision, the proposed strategy crafts
inputs with less activation sparsity which could otherwise be used to
accelerate the computation. In this paper, we analyze the mechanism how these
energy-latency attacks reduce activation sparsity. In particular, we find that
input uniformity is a key enabler. A uniform image, that is, an image with
mostly flat, uniformly colored surfaces, triggers more activations due to a
specific interplay of convolution, batch normalization, and ReLU activation.
Based on these insights, we propose two new simple, yet effective strategies
for crafting sponge examples: sampling images from a probability distribution
and identifying dense, yet inconspicuous inputs in natural datasets. We
empirically examine our findings in a comprehensive evaluation with multiple
image classification models and show that our attack achieves the same sparsity
effect as prior sponge-example methods, but at a fraction of computation
effort. We also show that our sponge examples transfer between different neural
networks. Finally, we discuss applications of our findings for the good by
improving efficiency by increasing sparsity.


http://arxiv.org/abs/2403.18671
Fact Checking Beyond Training Set. (1%)
Payam Karisani; Heng Ji
  Evaluating the veracity of everyday claims is time consuming and in some
cases requires domain expertise. We empirically demonstrate that the commonly
used fact checking pipeline, known as the retriever-reader, suffers from
performance deterioration when it is trained on the labeled data from one
domain and used in another domain. Afterwards, we delve into each component of
the pipeline and propose novel algorithms to address this problem. We propose
an adversarial algorithm to make the retriever component robust against
distribution shift. Our core idea is to initially train a bi-encoder on the
labeled source data, and then, to adversarially train two separate document and
claim encoders using unlabeled target data. We then focus on the reader
component and propose to train it such that it is insensitive towards the order
of claims and evidence documents. Our empirical evaluations support the
hypothesis that such a reader shows a higher robustness against distribution
shift. To our knowledge, there is no publicly available multi-topic fact
checking dataset. Thus, we propose a simple automatic method to re-purpose two
well-known fact checking datasets. We then construct eight fact checking
scenarios from these datasets, and compare our model to a set of strong
baseline models, including recent domain adaptation models that use GPT4 for
generating synthetic data.


http://arxiv.org/abs/2403.18373
BAM: Box Abstraction Monitors for Real-time OoD Detection in Object Detection. (1%)
Changshun Wu; Weicheng He; Chih-Hong Cheng; Xiaowei Huang; Saddek Bensalem
  Out-of-distribution (OoD) detection techniques for deep neural networks
(DNNs) become crucial thanks to their filtering of abnormal inputs, especially
when DNNs are used in safety-critical applications and interact with an open
and dynamic environment. Nevertheless, integrating OoD detection into
state-of-the-art (SOTA) object detection DNNs poses significant challenges,
partly due to the complexity introduced by the SOTA OoD construction methods,
which require the modification of DNN architecture and the introduction of
complex loss functions. This paper proposes a simple, yet surprisingly
effective, method that requires neither retraining nor architectural change in
object detection DNN, called Box Abstraction-based Monitors (BAM). The novelty
of BAM stems from using a finite union of convex box abstractions to capture
the learned features of objects for in-distribution (ID) data, and an important
observation that features from OoD data are more likely to fall outside of
these boxes. The union of convex regions within the feature space allows the
formation of non-convex and interpretable decision boundaries, overcoming the
limitations of VOS-like detectors without sacrificing real-time performance.
Experiments integrating BAM into Faster R-CNN-based object detection DNNs
demonstrate a considerably improved performance against SOTA OoD detection
techniques.


http://arxiv.org/abs/2403.17755
DataCook: Crafting Anti-Adversarial Examples for Healthcare Data Copyright Protection. (92%)
Sihan Shang; Jiancheng Yang; Zhenglong Sun; Pascal Fua
  In the realm of healthcare, the challenges of copyright protection and
unauthorized third-party misuse are increasingly significant. Traditional
methods for data copyright protection are applied prior to data distribution,
implying that models trained on these data become uncontrollable. This paper
introduces a novel approach, named DataCook, designed to safeguard the
copyright of healthcare data during the deployment phase. DataCook operates by
"cooking" the raw data before distribution, enabling the development of models
that perform normally on this processed data. However, during the deployment
phase, the original test data must be also "cooked" through DataCook to ensure
normal model performance. This process grants copyright holders control over
authorization during the deployment phase. The mechanism behind DataCook is by
crafting anti-adversarial examples (AntiAdv), which are designed to enhance
model confidence, as opposed to standard adversarial examples (Adv) that aim to
confuse models. Similar to Adv, AntiAdv introduces imperceptible perturbations,
ensuring that the data processed by DataCook remains easily understandable. We
conducted extensive experiments on MedMNIST datasets, encompassing both 2D/3D
data and the high-resolution variants. The outcomes indicate that DataCook
effectively meets its objectives, preventing models trained on AntiAdv from
analyzing unauthorized data effectively, without compromising the validity and
accuracy of the data in legitimate scenarios. Code and data are available at
https://github.com/MedMNIST/DataCook.


http://arxiv.org/abs/2403.17494
FaultGuard: A Generative Approach to Resilient Fault Prediction in Smart Electrical Grids. (78%)
Emad Efatinasab; Francesco Marchiori; Alessandro Brighente; Mirco Rampazzo; Mauro Conti
  Predicting and classifying faults in electricity networks is crucial for
uninterrupted provision and keeping maintenance costs at a minimum. Thanks to
the advancements in the field provided by the smart grid, several data-driven
approaches have been proposed in the literature to tackle fault prediction
tasks. Implementing these systems brought several improvements, such as optimal
energy consumption and quick restoration. Thus, they have become an essential
component of the smart grid. However, the robustness and security of these
systems against adversarial attacks have not yet been extensively investigated.
These attacks can impair the whole grid and cause additional damage to the
infrastructure, deceiving fault detection systems and disrupting restoration.
In this paper, we present FaultGuard, the first framework for fault type and
zone classification resilient to adversarial attacks. To ensure the security of
our system, we employ an Anomaly Detection System (ADS) leveraging a novel
Generative Adversarial Network training layer to identify attacks. Furthermore,
we propose a low-complexity fault prediction model and an online adversarial
training technique to enhance robustness. We comprehensively evaluate the
framework's performance against various adversarial attacks using the
IEEE13-AdvAttack dataset, which constitutes the state-of-the-art for resilient
fault prediction benchmarking. Our model outclasses the state-of-the-art even
without considering adversaries, with an accuracy of up to 0.958. Furthermore,
our ADS shows attack detection capabilities with an accuracy of up to 1.000.
Finally, we demonstrate how our novel training layers drastically increase
performances across the whole framework, with a mean increase of 154% in ADS
accuracy and 118% in model accuracy.


http://arxiv.org/abs/2403.17520
Boosting Adversarial Training via Fisher-Rao Norm-based Regularization. (69%)
Xiangyu Yin; Wenjie Ruan
  Adversarial training is extensively utilized to improve the adversarial
robustness of deep neural networks. Yet, mitigating the degradation of standard
generalization performance in adversarial-trained models remains an open
problem. This paper attempts to resolve this issue through the lens of model
complexity. First, We leverage the Fisher-Rao norm, a geometrically invariant
metric for model complexity, to establish the non-trivial bounds of the
Cross-Entropy Loss-based Rademacher complexity for a ReLU-activated Multi-Layer
Perceptron. Then we generalize a complexity-related variable, which is
sensitive to the changes in model width and the trade-off factors in
adversarial training. Moreover, intensive empirical evidence validates that
this variable highly correlates with the generalization gap of Cross-Entropy
loss between adversarial-trained and standard-trained models, especially during
the initial and final phases of the training process. Building upon this
observation, we propose a novel regularization framework, called Logit-Oriented
Adversarial Training (LOAT), which can mitigate the trade-off between
robustness and accuracy while imposing only a negligible increase in
computational overhead. Our extensive experiments demonstrate that the proposed
regularization strategy can boost the performance of the prevalent adversarial
training algorithms, including PGD-AT, TRADES, TRADES (LSE), MART, and DM-AT,
across various network architectures. Our code will be available at
https://github.com/TrustAI/LOAT.


http://arxiv.org/abs/2403.17710
Optimization-based Prompt Injection Attack to LLM-as-a-Judge. (45%)
Jiawen Shi; Zenghui Yuan; Yinuo Liu; Yue Huang; Pan Zhou; Lichao Sun; Neil Zhenqiang Gong
  LLM-as-a-Judge is a novel solution that can assess textual information with
large language models (LLMs). Based on existing research studies, LLMs
demonstrate remarkable performance in providing a compelling alternative to
traditional human assessment. However, the robustness of these systems against
prompt injection attacks remains an open question. In this work, we introduce
JudgeDeceiver, a novel optimization-based prompt injection attack tailored to
LLM-as-a-Judge. Our method formulates a precise optimization objective for
attacking the decision-making process of LLM-as-a-Judge and utilizes an
optimization algorithm to efficiently automate the generation of adversarial
sequences, achieving targeted and effective manipulation of model evaluations.
Compared to handcraft prompt injection attacks, our method demonstrates
superior efficacy, posing a significant challenge to the current security
paradigms of LLM-based judgment systems. Through extensive experiments, we
showcase the capability of JudgeDeceiver in altering decision outcomes across
various cases, highlighting the vulnerability of LLM-as-a-Judge systems to the
optimization-based prompt injection attack.


http://arxiv.org/abs/2403.18872
Targeted Visualization of the Backbone of Encoder LLMs. (9%)
Isaac Roberts; Alexander Schulz; Luca Hermes; Barbara Hammer
  Attention based Large Language Models (LLMs) are the state-of-the-art in
natural language processing (NLP). The two most common architectures are
encoders such as BERT, and decoders like the GPT models. Despite the success of
encoder models, on which we focus in this work, they also bear several risks,
including issues with bias or their susceptibility for adversarial attacks,
signifying the necessity for explainable AI to detect such issues. While there
does exist various local explainability methods focusing on the prediction of
single inputs, global methods based on dimensionality reduction for
classification inspection, which have emerged in other domains and that go
further than just using t-SNE in the embedding space, are not widely spread in
NLP.
  To reduce this gap, we investigate the application of DeepView, a method for
visualizing a part of the decision function together with a data set in two
dimensions, to the NLP domain. While in previous work, DeepView has been used
to inspect deep image classification models, we demonstrate how to apply it to
BERT-based NLP classifiers and investigate its usability in this domain,
including settings with adversarially perturbed input samples and pre-trained,
fine-tuned, and multi-task models.


http://arxiv.org/abs/2403.18144
Leak and Learn: An Attacker's Cookbook to Train Using Leaked Data from Federated Learning. (1%)
Joshua C. Zhao; Ahaan Dabholkar; Atul Sharma; Saurabh Bagchi
  Federated learning is a decentralized learning paradigm introduced to
preserve privacy of client data. Despite this, prior work has shown that an
attacker at the server can still reconstruct the private training data using
only the client updates. These attacks are known as data reconstruction attacks
and fall into two major categories: gradient inversion (GI) and linear layer
leakage attacks (LLL). However, despite demonstrating the effectiveness of
these attacks in breaching privacy, prior work has not investigated the
usefulness of the reconstructed data for downstream tasks. In this work, we
explore data reconstruction attacks through the lens of training and improving
models with leaked data. We demonstrate the effectiveness of both GI and LLL
attacks in maliciously training models using the leaked data more accurately
than a benign federated learning strategy. Counter-intuitively, this bump in
training quality can occur despite limited reconstruction quality or a small
total number of leaked images. Finally, we show the limitations of these
attacks for downstream training, individually for GI attacks and for LLL
attacks.


http://arxiv.org/abs/2403.17860
Exploring LLMs as a Source of Targeted Synthetic Textual Data to Minimize High Confidence Misclassifications. (1%)
Philip Lippmann; Matthijs Spaan; Jie Yang
  Natural Language Processing (NLP) models optimized for predictive performance
often make high confidence errors and suffer from vulnerability to adversarial
and out-of-distribution data. Existing work has mainly focused on mitigation of
such errors using either humans or an automated approach. In this study, we
explore the usage of large language models (LLMs) for data augmentation as a
potential solution to the issue of NLP models making wrong predictions with
high confidence during classification tasks. We compare the effectiveness of
synthetic data generated by LLMs with that of human data obtained via the same
procedure. For mitigation, humans or LLMs provide natural language
characterizations of high confidence misclassifications to generate synthetic
data, which are then used to extend the training set. We conduct an extensive
evaluation of our approach on three classification tasks and demonstrate its
effectiveness in reducing the number of high confidence misclassifications
present in the model, all while maintaining the same level of accuracy.
Moreover, we find that the cost gap between humans and LLMs surpasses an order
of magnitude, as LLMs attain human-like performance while being more scalable.


http://arxiv.org/abs/2403.16432
$\textit{LinkPrompt}$: Natural and Universal Adversarial Attacks on Prompt-based Language Models. (99%)
Yue Xu; Wenjie Wang
  Prompt-based learning is a new language model training paradigm that adapts
the Pre-trained Language Models (PLMs) to downstream tasks, which revitalizes
the performance benchmarks across various natural language processing (NLP)
tasks. Instead of using a fixed prompt template to fine-tune the model, some
research demonstrates the effectiveness of searching for the prompt via
optimization. Such prompt optimization process of prompt-based learning on PLMs
also gives insight into generating adversarial prompts to mislead the model,
raising concerns about the adversarial vulnerability of this paradigm. Recent
studies have shown that universal adversarial triggers (UATs) can be generated
to alter not only the predictions of the target PLMs but also the prediction of
corresponding Prompt-based Fine-tuning Models (PFMs) under the prompt-based
learning paradigm. However, UATs found in previous works are often unreadable
tokens or characters and can be easily distinguished from natural texts with
adaptive defenses. In this work, we consider the naturalness of the UATs and
develop $\textit{LinkPrompt}$, an adversarial attack algorithm to generate UATs
by a gradient-based beam search algorithm that not only effectively attacks the
target PLMs and PFMs but also maintains the naturalness among the trigger
tokens. Extensive results demonstrate the effectiveness of
$\textit{LinkPrompt}$, as well as the transferability of UATs generated by
$\textit{LinkPrompt}$ to open-sourced Large Language Model (LLM) Llama2 and
API-accessed LLM GPT-3.5-turbo. The resource is available at
$\href{https://github.com/SavannahXu79/LinkPrompt}{https://github.com/SavannahXu79/LinkPrompt}$.


http://arxiv.org/abs/2403.17301
Physical 3D Adversarial Attacks against Monocular Depth Estimation in Autonomous Driving. (98%)
Junhao Zheng; Chenhao Lin; Jiahao Sun; Zhengyu Zhao; Qian Li; Chao Shen
  Deep learning-based monocular depth estimation (MDE), extensively applied in
autonomous driving, is known to be vulnerable to adversarial attacks. Previous
physical attacks against MDE models rely on 2D adversarial patches, so they
only affect a small, localized region in the MDE map but fail under various
viewpoints. To address these limitations, we propose 3D Depth Fool
(3D$^2$Fool), the first 3D texture-based adversarial attack against MDE models.
3D$^2$Fool is specifically optimized to generate 3D adversarial textures
agnostic to model types of vehicles and to have improved robustness in bad
weather conditions, such as rain and fog. Experimental results validate the
superior performance of our 3D$^2$Fool across various scenarios, including
vehicles, MDE models, weather conditions, and viewpoints. Real-world
experiments with printed 3D textures on physical vehicle models further
demonstrate that our 3D$^2$Fool can cause an MDE error of over 10 meters.


http://arxiv.org/abs/2403.16782
The Anatomy of Adversarial Attacks: Concept-based XAI Dissection. (87%)
Georgii Mikriukov; Gesina Schwalbe; Franz Motzkus; Korinna Bade
  Adversarial attacks (AAs) pose a significant threat to the reliability and
robustness of deep neural networks. While the impact of these attacks on model
predictions has been extensively studied, their effect on the learned
representations and concepts within these models remains largely unexplored. In
this work, we perform an in-depth analysis of the influence of AAs on the
concepts learned by convolutional neural networks (CNNs) using eXplainable
artificial intelligence (XAI) techniques. Through an extensive set of
experiments across various network architectures and targeted AA techniques, we
unveil several key findings. First, AAs induce substantial alterations in the
concept composition within the feature space, introducing new concepts or
modifying existing ones. Second, the adversarial perturbation itself can be
linearly decomposed into a set of latent vector components, with a subset of
these being responsible for the attack's success. Notably, we discover that
these components are target-specific, i.e., are similar for a given target
class throughout different AA techniques and starting classes. Our findings
provide valuable insights into the nature of AAs and their impact on learned
representations, paving the way for the development of more robust and
interpretable deep learning models, as well as effective defenses against
adversarial threats.


http://arxiv.org/abs/2403.16768
DeepKnowledge: Generalisation-Driven Deep Learning Testing. (82%)
Sondess Missaoui; Simos Gerasimou; Nikolaos Matragkas
  Despite their unprecedented success, DNNs are notoriously fragile to small
shifts in data distribution, demanding effective testing techniques that can
assess their dependability. Despite recent advances in DNN testing, there is a
lack of systematic testing approaches that assess the DNN's capability to
generalise and operate comparably beyond data in their training distribution.
We address this gap with DeepKnowledge, a systematic testing methodology for
DNN-based systems founded on the theory of knowledge generalisation, which aims
to enhance DNN robustness and reduce the residual risk of 'black box' models.
Conforming to this theory, DeepKnowledge posits that core computational DNN
units, termed Transfer Knowledge neurons, can generalise under domain shift.
DeepKnowledge provides an objective confidence measurement on testing
activities of DNN given data distribution shifts and uses this information to
instrument a generalisation-informed test adequacy criterion to check the
transfer knowledge capacity of a test set. Our empirical evaluation of several
DNNs, across multiple datasets and state-of-the-art adversarial generation
techniques demonstrates the usefulness and effectiveness of DeepKnowledge and
its ability to support the engineering of more dependable DNNs. We report
improvements of up to 10 percentage points over state-of-the-art coverage
criteria for detecting adversarial attacks on several benchmarks, including
MNIST, SVHN, and CIFAR.


http://arxiv.org/abs/2403.16569
Revealing Vulnerabilities of Neural Networks in Parameter Learning and Defense Against Explanation-Aware Backdoors. (70%)
Md Abdul Kadir; GowthamKrishna Addluri; Daniel Sonntag
  Explainable Artificial Intelligence (XAI) strategies play a crucial part in
increasing the understanding and trustworthiness of neural networks.
Nonetheless, these techniques could potentially generate misleading
explanations. Blinding attacks can drastically alter a machine learning
algorithm's prediction and explanation, providing misleading information by
adding visually unnoticeable artifacts into the input, while maintaining the
model's accuracy. It poses a serious challenge in ensuring the reliability of
XAI methods. To ensure the reliability of XAI methods poses a real challenge,
we leverage statistical analysis to highlight the changes in CNN weights within
a CNN following blinding attacks. We introduce a method specifically designed
to limit the effectiveness of such attacks during the evaluation phase,
avoiding the need for extra training. The method we suggest defences against
most modern explanation-aware adversarial attacks, achieving an approximate
decrease of ~99\% in the Attack Success Rate (ASR) and a ~91\% reduction in the
Mean Square Error (MSE) between the original explanation and the defended
(post-attack) explanation across three unique types of attacks.


http://arxiv.org/abs/2403.17188
LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning. (69%)
Siyuan Cheng; Guanhong Tao; Yingqi Liu; Guangyu Shen; Shengwei An; Shiwei Feng; Xiangzhe Xu; Kaiyuan Zhang; Shiqing Ma; Xiangyu Zhang
  Backdoor attack poses a significant security threat to Deep Learning
applications. Existing attacks are often not evasive to established backdoor
detection techniques. This susceptibility primarily stems from the fact that
these attacks typically leverage a universal trigger pattern or transformation
function, such that the trigger can cause misclassification for any input. In
response to this, recent papers have introduced attacks using sample-specific
invisible triggers crafted through special transformation functions. While
these approaches manage to evade detection to some extent, they reveal
vulnerability to existing backdoor mitigation techniques. To address and
enhance both evasiveness and resilience, we introduce a novel backdoor attack
LOTUS. Specifically, it leverages a secret function to separate samples in the
victim class into a set of partitions and applies unique triggers to different
partitions. Furthermore, LOTUS incorporates an effective trigger focusing
mechanism, ensuring only the trigger corresponding to the partition can induce
the backdoor behavior. Extensive experimental results show that LOTUS can
achieve high attack success rate across 4 datasets and 7 model structures, and
effectively evading 13 backdoor detection and mitigation techniques. The code
is available at https://github.com/Megum1/LOTUS.


http://arxiv.org/abs/2403.16479
Model-less Is the Best Model: Generating Pure Code Implementations to Replace On-Device DL Models. (1%)
Mingyi Zhou; Xiang Gao; Pei Liu; John Grundy; Chunyang Chen; Xiao Chen; Li Li
  Recent studies show that deployed deep learning (DL) models such as those of
Tensor Flow Lite (TFLite) can be easily extracted from real-world applications
and devices by attackers to generate many kinds of attacks like adversarial
attacks. Although securing deployed on-device DL models has gained increasing
attention, no existing methods can fully prevent the aforementioned threats.
Traditional software protection techniques have been widely explored, if
on-device models can be implemented using pure code, such as C++, it will open
the possibility of reusing existing software protection techniques. However,
due to the complexity of DL models, there is no automatic method that can
translate the DL models to pure code. To fill this gap, we propose a novel
method, CustomDLCoder, to automatically extract the on-device model information
and synthesize a customized executable program for a wide range of DL models.
CustomDLCoder first parses the DL model, extracts its backend computing units,
configures the computing units to a graph, and then generates customized code
to implement and deploy the ML solution without explicit model representation.
The synthesized program hides model information for DL deployment environments
since it does not need to retain explicit model representation, preventing many
attacks on the DL model. In addition, it improves ML performance because the
customized code removes model parsing and preprocessing steps and only retains
the data computing process. Our experimental results show that CustomDLCoder
improves model security by disabling on-device model sniffing. Compared with
the original on-device platform (i.e., TFLite), our method can accelerate model
inference by 21.0% and 24.3% on x86-64 and ARM64 platforms, respectively. Most
importantly, it can significantly reduce memory consumption by 68.8% and 36.0%
on x86-64 and ARM64 platforms, respectively.


http://arxiv.org/abs/2403.16176
Subspace Defense: Discarding Adversarial Perturbations by Learning a Subspace for Clean Signals. (99%)
Rui Zheng; Yuhao Zhou; Zhiheng Xi; Tao Gui; Qi Zhang; Xuanjing Huang
  Deep neural networks (DNNs) are notoriously vulnerable to adversarial attacks
that place carefully crafted perturbations on normal examples to fool DNNs. To
better understand such attacks, a characterization of the features carried by
adversarial examples is needed. In this paper, we tackle this challenge by
inspecting the subspaces of sample features through spectral analysis. We first
empirically show that the features of either clean signals or adversarial
perturbations are redundant and span in low-dimensional linear subspaces
respectively with minimal overlap, and the classical low-dimensional subspace
projection can suppress perturbation features out of the subspace of clean
signals. This makes it possible for DNNs to learn a subspace where only
features of clean signals exist while those of perturbations are discarded,
which can facilitate the distinction of adversarial examples. To prevent the
residual perturbations that is inevitable in subspace learning, we propose an
independence criterion to disentangle clean signals from perturbations.
Experimental results show that the proposed strategy enables the model to
inherently suppress adversaries, which not only boosts model robustness but
also motivates new directions of effective adversarial defense.


http://arxiv.org/abs/2403.16067
Robust Diffusion Models for Adversarial Purification. (98%)
Guang Lin; Zerui Tao; Jianhai Zhang; Toshihisa Tanaka; Qibin Zhao
  Diffusion models (DMs) based adversarial purification (AP) has shown to be
the most powerful alternative to adversarial training (AT). However, these
methods neglect the fact that pre-trained diffusion models themselves are not
robust to adversarial attacks as well. Additionally, the diffusion process can
easily destroy semantic information and generate a high quality image but
totally different from the original input image after the reverse process,
leading to degraded standard accuracy. To overcome these issues, a natural idea
is to harness adversarial training strategy to retrain or fine-tune the
pre-trained diffusion model, which is computationally prohibitive. We propose a
novel robust reverse process with adversarial guidance, which is independent of
given pre-trained DMs and avoids retraining or fine-tuning the DMs. This robust
guidance can not only ensure to generate purified examples retaining more
semantic content but also mitigate the accuracy-robustness trade-off of DMs for
the first time, which also provides DM-based AP an efficient adaptive ability
to new attacks. Extensive experiments are conducted to demonstrate that our
method achieves the state-of-the-art results and exhibits generalization
against different attacks.


http://arxiv.org/abs/2403.16405
Ensemble Adversarial Defense via Integration of Multiple Dispersed Low Curvature Models. (98%)
Kaikang Zhao; Xi Chen; Wei Huang; Liuxin Ding; Xianglong Kong; Fan Zhang
  The integration of an ensemble of deep learning models has been extensively
explored to enhance defense against adversarial attacks. The diversity among
sub-models increases the attack cost required to deceive the majority of the
ensemble, thereby improving the adversarial robustness. While existing
approaches mainly center on increasing diversity in feature representations or
dispersion of first-order gradients with respect to input, the limited
correlation between these diversity metrics and adversarial robustness
constrains the performance of ensemble adversarial defense. In this work, we
aim to enhance ensemble diversity by reducing attack transferability. We
identify second-order gradients, which depict the loss curvature, as a key
factor in adversarial robustness. Computing the Hessian matrix involved in
second-order gradients is computationally expensive. To address this, we
approximate the Hessian-vector product using differential approximation. Given
that low curvature provides better robustness, our ensemble model was designed
to consider the influence of curvature among different sub-models. We introduce
a novel regularizer to train multiple more-diverse low-curvature network
models. Extensive experiments across various datasets demonstrate that our
ensemble model exhibits superior robustness against a range of attacks,
underscoring the effectiveness of our approach.


http://arxiv.org/abs/2403.16257
Unlearning Backdoor Threats: Enhancing Backdoor Defense in Multimodal Contrastive Learning via Local Token Unlearning. (5%)
Siyuan Liang; Kuanrong Liu; Jiajun Gong; Jiawei Liang; Yuan Xun; Ee-Chien Chang; Xiaochun Cao
  Multimodal contrastive learning has emerged as a powerful paradigm for
building high-quality features using the complementary strengths of various
data modalities. However, the open nature of such systems inadvertently
increases the possibility of backdoor attacks. These attacks subtly embed
malicious behaviors within the model during training, which can be activated by
specific triggers in the inference phase, posing significant security risks.
Despite existing countermeasures through fine-tuning that reduce the adverse
impacts of such attacks, these defenses often degrade the clean accuracy and
necessitate the construction of extensive clean training pairs. In this paper,
we explore the possibility of a less-cost defense from the perspective of model
unlearning, that is, whether the model can be made to quickly \textbf{u}nlearn
\textbf{b}ackdoor \textbf{t}hreats (UBT) by constructing a small set of
poisoned samples. Specifically, we strengthen the backdoor shortcuts to
discover suspicious samples through overfitting training prioritized by weak
similarity samples. Building on the initial identification of suspicious
samples, we introduce an innovative token-based localized forgetting training
regime. This technique specifically targets the poisoned aspects of the model,
applying a focused effort to unlearn the backdoor associations and trying not
to damage the integrity of the overall model. Experimental results show that
our method not only ensures a minimal success rate for attacks, but also
preserves the model's high clean accuracy.


http://arxiv.org/abs/2403.16206
Rumor Detection with a novel graph neural network approach. (4%)
Tianrui Liu; Qi Cai; Changxin Xu; Bo Hong; Fanghao Ni; Yuxin Qiao; Tsungwei Yang
  The wide spread of rumors on social media has caused a negative impact on
people's daily life, leading to potential panic, fear, and mental health
problems for the public. How to debunk rumors as early as possible remains a
challenging problem. Existing studies mainly leverage information propagation
structure to detect rumors, while very few works focus on correlation among
users that they may coordinate to spread rumors in order to gain large
popularity. In this paper, we propose a new detection model, that jointly
learns both the representations of user correlation and information propagation
to detect rumors on social media. Specifically, we leverage graph neural
networks to learn the representations of user correlation from a bipartite
graph that describes the correlations between users and source tweets, and the
representations of information propagation with a tree structure. Then we
combine the learned representations from these two modules to classify the
rumors. Since malicious users intend to subvert our model after deployment, we
further develop a greedy attack scheme to analyze the cost of three adversarial
attacks: graph attack, comment attack, and joint attack. Evaluation results on
two public datasets illustrate that the proposed MODEL outperforms the
state-of-the-art rumor detection models. We also demonstrate our method
performs well for early rumor detection. Moreover, the proposed detection
method is more robust to adversarial attacks compared to the best existing
method. Importantly, we show that it requires a high cost for attackers to
subvert user correlation pattern, demonstrating the importance of considering
user correlation for rumor detection.


http://arxiv.org/abs/2403.16365
Generating Potent Poisons and Backdoors from Scratch with Guided Diffusion. (2%)
Hossein Souri; Arpit Bansal; Hamid Kazemi; Liam Fowl; Aniruddha Saha; Jonas Geiping; Andrew Gordon Wilson; Rama Chellappa; Tom Goldstein; Micah Goldblum
  Modern neural networks are often trained on massive datasets that are web
scraped with minimal human inspection. As a result of this insecure curation
pipeline, an adversary can poison or backdoor the resulting model by uploading
malicious data to the internet and waiting for a victim to scrape and train on
it. Existing approaches for creating poisons and backdoors start with randomly
sampled clean data, called base samples, and then modify those samples to craft
poisons. However, some base samples may be significantly more amenable to
poisoning than others. As a result, we may be able to craft more potent poisons
by carefully choosing the base samples. In this work, we use guided diffusion
to synthesize base samples from scratch that lead to significantly more potent
poisons and backdoors than previous state-of-the-art attacks. Our Guided
Diffusion Poisoning (GDP) base samples can be combined with any downstream
poisoning or backdoor attack to boost its effectiveness. Our implementation
code is publicly available at: https://github.com/hsouri/GDP .


http://arxiv.org/abs/2403.16050
A General and Efficient Federated Split Learning with Pre-trained Image Transformers for Heterogeneous Data. (1%)
Yifan Shi; Yuhui Zhang; Ziyue Huang; Xiaofeng Yang; Li Shen; Wei Chen; Xueqian Wang
  Federated Split Learning (FSL) is a promising distributed learning paradigm
in practice, which gathers the strengths of both Federated Learning (FL) and
Split Learning (SL) paradigms, to ensure model privacy while diminishing the
resource overhead of each client, especially on large transformer models in a
resource-constrained environment, e.g., Internet of Things (IoT). However,
almost all works merely investigate the performance with simple neural network
models in FSL. Despite the minor efforts focusing on incorporating Vision
Transformers (ViT) as model architectures, they train ViT from scratch, thereby
leading to enormous training overhead in each device with limited resources.
Therefore, in this paper, we harness Pre-trained Image Transformers (PITs) as
the initial model, coined FES-PIT, to accelerate the training process and
improve model robustness. Furthermore, we propose FES-PTZO to hinder the
gradient inversion attack, especially having the capability compatible with
black-box scenarios, where the gradient information is unavailable. Concretely,
FES-PTZO approximates the server gradient by utilizing a zeroth-order (ZO)
optimization, which replaces the backward propagation with just one forward
process. Empirically, we are the first to provide a systematic evaluation of
FSL methods with PITs in real-world datasets, different partial device
participations, and heterogeneous data splits. Our experiments verify the
effectiveness of our algorithms.


http://arxiv.org/abs/2403.15918
Towards Adversarial Robustness And Backdoor Mitigation in SSL. (76%)
Aryan Satpathy; Nilaksh Singh; Dhruva Rajwade; Somesh Kumar
  Self-Supervised Learning (SSL) has shown great promise in learning
representations from unlabeled data. The power of learning representations
without the need for human annotations has made SSL a widely used technique in
real-world problems. However, SSL methods have recently been shown to be
vulnerable to backdoor attacks, where the learned model can be exploited by
adversaries to manipulate the learned representations, either through tampering
the training data distribution, or via modifying the model itself. This work
aims to address defending against backdoor attacks in SSL, where the adversary
has access to a realistic fraction of the SSL training data, and no access to
the model. We use novel methods that are computationally efficient as well as
generalizable across different problem settings. We also investigate the
adversarial robustness of SSL models when trained with our method, and show
insights into increased robustness in SSL via frequency domain augmentations.
We demonstrate the effectiveness of our method on a variety of SSL benchmarks,
and show that our method is able to mitigate backdoor attacks while maintaining
high performance on downstream tasks. Code for our work is available at
github.com/Aryan-Satpathy/Backdoor


http://arxiv.org/abs/2403.15786
Adversarial Defense Teacher for Cross-Domain Object Detection under Poor Visibility Conditions. (64%)
Kaiwen Wang; Yinzhe Shen; Martin Lauer
  Existing object detectors encounter challenges in handling domain shifts
between training and real-world data, particularly under poor visibility
conditions like fog and night. Cutting-edge cross-domain object detection
methods use teacher-student frameworks and compel teacher and student models to
produce consistent predictions under weak and strong augmentations,
respectively. In this paper, we reveal that manually crafted augmentations are
insufficient for optimal teaching and present a simple yet effective framework
named Adversarial Defense Teacher (ADT), leveraging adversarial defense to
enhance teaching quality. Specifically, we employ adversarial attacks,
encouraging the model to generalize on subtly perturbed inputs that effectively
deceive the model. To address small objects under poor visibility conditions,
we propose a Zoom-in Zoom-out strategy, which zooms-in images for better
pseudo-labels and zooms-out images and pseudo-labels to learn refined features.
Our results demonstrate that ADT achieves superior performance, reaching 54.5%
mAP on Foggy Cityscapes, surpassing the previous state-of-the-art by 2.6% mAP.


http://arxiv.org/abs/2403.15207
Robust optimization for adversarial learning with finite sample complexity guarantees. (96%)
André Bertolace; Konstatinos Gatsis; Kostas Margellos
  Decision making and learning in the presence of uncertainty has attracted
significant attention in view of the increasing need to achieve robust and
reliable operations. In the case where uncertainty stems from the presence of
adversarial attacks this need is becoming more prominent. In this paper we
focus on linear and nonlinear classification problems and propose a novel
adversarial training method for robust classifiers, inspired by Support Vector
Machine (SVM) margins. We view robustness under a data driven lens, and derive
finite sample complexity bounds for both linear and non-linear classifiers in
binary and multi-class scenarios. Notably, our bounds match natural
classifiers' complexity. Our algorithm minimizes a worst-case surrogate loss
using Linear Programming (LP) and Second Order Cone Programming (SOCP) for
linear and non-linear models. Numerical experiments on the benchmark MNIST and
CIFAR10 datasets show our approach's comparable performance to state-of-the-art
methods, without needing adversarial examples during training. Our work offers
a comprehensive framework for enhancing binary linear and non-linear classifier
robustness, embedding robustness in learning under the presence of adversaries.


http://arxiv.org/abs/2403.15365
A Transfer Attack to Image Watermarks. (96%)
Yuepeng Hu; Zhengyuan Jiang; Moyang Guo; Neil Gong
  Watermark has been widely deployed by industry to detect AI-generated images.
The robustness of such watermark-based detector against evasion attacks in the
white-box and black-box settings is well understood in the literature. However,
the robustness in the no-box setting is much less understood. In particular,
multiple studies claimed that image watermark is robust in such setting. In
this work, we propose a new transfer evasion attack to image watermark in the
no-box setting. Our transfer attack adds a perturbation to a watermarked image
to evade multiple surrogate watermarking models trained by the attacker itself,
and the perturbed watermarked image also evades the target watermarking model.
Our major contribution is to show that, both theoretically and empirically,
watermark-based AI-generated image detector is not robust to evasion attacks
even if the attacker does not have access to the watermarking model nor the
detection API.


http://arxiv.org/abs/2403.15271
From Hardware Fingerprint to Access Token: Enhancing the Authentication on IoT Devices. (26%)
Yue Xiao; Yi He; Xiaoli Zhang; Qian Wang; Renjie Xie; Kun Sun; Ke Xu; Qi Li
  The proliferation of consumer IoT products in our daily lives has raised the
need for secure device authentication and access control. Unfortunately, these
resource-constrained devices typically use token-based authentication, which is
vulnerable to token compromise attacks that allow attackers to impersonate the
devices and perform malicious operations by stealing the access token. Using
hardware fingerprints to secure their authentication is a promising way to
mitigate these threats. However, once attackers have stolen some hardware
fingerprints (e.g., via MitM attacks), they can bypass the hardware
authentication by training a machine learning model to mimic fingerprints or
reusing these fingerprints to craft forge requests.
  In this paper, we present MCU-Token, a secure hardware fingerprinting
framework for MCU-based IoT devices even if the cryptographic mechanisms (e.g.,
private keys) are compromised. MCU-Token can be easily integrated with various
IoT devices by simply adding a short hardware fingerprint-based token to the
existing payload. To prevent the reuse of this token, we propose a message
mapping approach that binds the token to a specific request via generating the
hardware fingerprints based on the request payload. To defeat the machine
learning attacks, we mix the valid fingerprints with poisoning data so that
attackers cannot train a usable model with the leaked tokens. MCU-Token can
defend against armored adversary who may replay, craft, and offload the
requests via MitM or use both hardware (e.g., use identical devices) and
software (e.g., machine learning attacks) strategies to mimic the fingerprints.
The system evaluation shows that MCU-Token can achieve high accuracy (over 97%)
with a low overhead across various IoT devices and application scenarios.


http://arxiv.org/abs/2403.15010
Clean-image Backdoor Attacks. (12%)
Dazhong Rong; Guoyao Yu; Shuheng Shen; Xinyi Fu; Peng Qian; Jianhai Chen; Qinming He; Xing Fu; Weiqiang Wang
  To gather a significant quantity of annotated training data for
high-performance image classification models, numerous companies opt to enlist
third-party providers to label their unlabeled data. This practice is widely
regarded as secure, even in cases where some annotated errors occur, as the
impact of these minor inaccuracies on the final performance of the models is
negligible and existing backdoor attacks require attacker's ability to poison
the training images. Nevertheless, in this paper, we propose clean-image
backdoor attacks which uncover that backdoors can still be injected via a
fraction of incorrect labels without modifying the training images.
Specifically, in our attacks, the attacker first seeks a trigger feature to
divide the training images into two parts: those with the feature and those
without it. Subsequently, the attacker falsifies the labels of the former part
to a backdoor class. The backdoor will be finally implanted into the target
model after it is trained on the poisoned data. During the inference phase, the
attacker can activate the backdoor in two ways: slightly modifying the input
image to obtain the trigger feature, or taking an image that naturally has the
trigger feature as input. We conduct extensive experiments to demonstrate the
effectiveness and practicality of our attacks. According to the experimental
results, we conclude that our attacks seriously jeopardize the fairness and
robustness of image classification models, and it is necessary to be vigilant
about the incorrect labels in outsourced labeling.


http://arxiv.org/abs/2403.15603
Forward Learning for Gradient-based Black-box Saliency Map Generation. (1%)
Zeliang Zhang; Mingqian Feng; Jinyang Jiang; Rongyi Zhu; Yijie Peng; Chenliang Xu
  Gradient-based saliency maps are widely used to explain deep neural network
decisions. However, as models become deeper and more black-box, such as in
closed-source APIs like ChatGPT, computing gradients become challenging,
hindering conventional explanation methods. In this work, we introduce a novel
unified framework for estimating gradients in black-box settings and generating
saliency maps to interpret model decisions. We employ the likelihood ratio
method to estimate output-to-input gradients and utilize them for saliency map
generation. Additionally, we propose blockwise computation techniques to
enhance estimation accuracy. Extensive experiments in black-box settings
validate the effectiveness of our method, demonstrating accurate gradient
estimation and explainability of generated saliency maps. Furthermore, we
showcase the scalability of our approach by applying it to explain GPT-Vision,
revealing the continued relevance of gradient-based explanation methods in the
era of large, closed-source, and black-box models.


http://arxiv.org/abs/2403.14778
Diffusion Attack: Leveraging Stable Diffusion for Naturalistic Image Attacking. (99%)
Qianyu Guo; Jiaming Fu; Yawen Lu; Dongming Gan
  In Virtual Reality (VR), adversarial attack remains a significant security
threat. Most deep learning-based methods for physical and digital adversarial
attacks focus on enhancing attack performance by crafting adversarial examples
that contain large printable distortions that are easy for human observers to
identify. However, attackers rarely impose limitations on the naturalness and
comfort of the appearance of the generated attack image, resulting in a
noticeable and unnatural attack. To address this challenge, we propose a
framework to incorporate style transfer to craft adversarial inputs of natural
styles that exhibit minimal detectability and maximum natural appearance, while
maintaining superior attack capabilities.


http://arxiv.org/abs/2403.14774
Few-Shot Adversarial Prompt Learning on Vision-Language Models. (98%)
Yiwei Zhou; Xiaobo Xia; Zhiwei Lin; Bo Han; Tongliang Liu
  The vulnerability of deep neural networks to imperceptible adversarial
perturbations has attracted widespread attention. Inspired by the success of
vision-language foundation models, previous efforts achieved zero-shot
adversarial robustness by aligning adversarial visual features with text
supervision. However, in practice, they are still unsatisfactory due to several
issues, including heavy adaptation cost, suboptimal text supervision, and
uncontrolled natural generalization capacity. In this paper, to address these
issues, we propose a few-shot adversarial prompt framework where adapting input
sequences with limited data makes significant adversarial robustness
improvement. Specifically, we achieve this by providing adversarially
correlated text supervision that is end-to-end learned from adversarial
examples. We also propose a novel training objective that enhances the
consistency of multi-modal features while encourages differentiated uni-modal
features between natural and adversarial examples. The proposed framework gives
access to learn adversarial text supervision, which provides superior
cross-modal adversarial alignment and matches state-of-the-art zero-shot
adversarial robustness with only 1% training data.


http://arxiv.org/abs/2403.14731
Reversible Jump Attack to Textual Classifiers with Modification Reduction. (98%)
Mingze Ni; Zhensu Sun; Wei Liu
  Recent studies on adversarial examples expose vulnerabilities of natural
language processing (NLP) models. Existing techniques for generating
adversarial examples are typically driven by deterministic hierarchical rules
that are agnostic to the optimal adversarial examples, a strategy that often
results in adversarial samples with a suboptimal balance between magnitudes of
changes and attack successes. To this end, in this research we propose two
algorithms, Reversible Jump Attack (RJA) and Metropolis-Hasting Modification
Reduction (MMR), to generate highly effective adversarial examples and to
improve the imperceptibility of the examples, respectively. RJA utilizes a
novel randomization mechanism to enlarge the search space and efficiently
adapts to a number of perturbed words for adversarial examples. With these
generated adversarial examples, MMR applies the Metropolis-Hasting sampler to
enhance the imperceptibility of adversarial examples. Extensive experiments
demonstrate that RJA-MMR outperforms current state-of-the-art methods in attack
performance, imperceptibility, fluency and grammar correctness.


http://arxiv.org/abs/2403.14772
Improving Robustness to Model Inversion Attacks via Sparse Coding Architectures. (82%)
Sayanton V. Dibbo; Adam Breuer; Juston Moore; Michael Teti
  Recent model inversion attack algorithms permit adversaries to reconstruct a
neural network's private training data just by repeatedly querying the network
and inspecting its outputs. In this work, we develop a novel network
architecture that leverages sparse-coding layers to obtain superior robustness
to this class of attacks. Three decades of computer science research has
studied sparse coding in the context of image denoising, object recognition,
and adversarial misclassification settings, but to the best of our knowledge,
its connection to state-of-the-art privacy vulnerabilities remains unstudied.
However, sparse coding architectures suggest an advantageous means to defend
against model inversion attacks because they allow us to control the amount of
irrelevant private information encoded in a network's intermediate
representations in a manner that can be computed efficiently during training
and that is known to have little effect on classification accuracy.
Specifically, compared to networks trained with a variety of state-of-the-art
defenses, our sparse-coding architectures maintain comparable or higher
classification accuracy while degrading state-of-the-art training data
reconstructions by factors of 1.1 to 18.3 across a variety of reconstruction
quality metrics (PSNR, SSIM, FID). This performance advantage holds across 5
datasets ranging from CelebA faces to medical images and CIFAR-10, and across
various state-of-the-art SGD-based and GAN-based inversion attacks, including
Plug-&-Play attacks. We provide a cluster-ready PyTorch codebase to promote
research and standardize defense evaluations.


http://arxiv.org/abs/2403.14489
Adversary-Robust Graph-Based Learning of WSIs. (45%)
Saba Heidari Gheshlaghi; Milan Aryal; Nasim Yahyasoltani; Masoud Ganji
  Enhancing the robustness of deep learning models against adversarial attacks
is crucial, especially in critical domains like healthcare where significant
financial interests heighten the risk of such attacks. Whole slide images
(WSIs) are high-resolution, digitized versions of tissue samples mounted on
glass slides, scanned using sophisticated imaging equipment. The digital
analysis of WSIs presents unique challenges due to their gigapixel size and
multi-resolution storage format. In this work, we aim at improving the
robustness of cancer Gleason grading classification systems against adversarial
attacks, addressing challenges at both the image and graph levels. As regards
the proposed algorithm, we develop a novel and innovative graph-based model
which utilizes GNN to extract features from the graph representation of WSIs. A
denoising module, along with a pooling layer is incorporated to manage the
impact of adversarial attacks on the WSIs. The process concludes with a
transformer module that classifies various grades of prostate cancer based on
the processed data. To assess the effectiveness of the proposed method, we
conducted a comparative analysis using two scenarios. Initially, we trained and
tested the model without the denoiser using WSIs that had not been exposed to
any attack. We then introduced a range of attacks at either the image or graph
level and processed them through the proposed network. The performance of the
model was evaluated in terms of accuracy and kappa scores. The results from
this comparison showed a significant improvement in cancer diagnosis accuracy,
highlighting the robustness and efficiency of the proposed method in handling
adversarial challenges in the context of medical imaging.


http://arxiv.org/abs/2403.14250
Safeguarding Medical Image Segmentation Datasets against Unauthorized Training via Contour- and Texture-Aware Perturbations. (4%)
Xun Lin; Yi Yu; Song Xia; Jue Jiang; Haoran Wang; Zitong Yu; Yizhong Liu; Ying Fu; Shuai Wang; Wenzhong Tang; Alex Kot
  The widespread availability of publicly accessible medical images has
significantly propelled advancements in various research and clinical fields.
Nonetheless, concerns regarding unauthorized training of AI systems for
commercial purposes and the duties of patient privacy protection have led
numerous institutions to hesitate to share their images. This is particularly
true for medical image segmentation (MIS) datasets, where the processes of
collection and fine-grained annotation are time-intensive and laborious.
Recently, Unlearnable Examples (UEs) methods have shown the potential to
protect images by adding invisible shortcuts. These shortcuts can prevent
unauthorized deep neural networks from generalizing. However, existing UEs are
designed for natural image classification and fail to protect MIS datasets
imperceptibly as their protective perturbations are less learnable than
important prior knowledge in MIS, e.g., contour and texture features. To this
end, we propose an Unlearnable Medical image generation method, termed UMed.
UMed integrates the prior knowledge of MIS by injecting contour- and
texture-aware perturbations to protect images. Given that our target is to only
poison features critical to MIS, UMed requires only minimal perturbations
within the ROI and its contour to achieve greater imperceptibility (average
PSNR is 50.03) and protective performance (clean average DSC degrades from
82.18% to 6.80%).


http://arxiv.org/abs/2403.13507
FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs. (97%)
Jinmin Li; Kuofeng Gao; Yang Bai; Jingyun Zhang; Shu-tao Xia; Yisen Wang
  Despite the remarkable performance of video-based large language models
(LLMs), their adversarial threat remains unexplored. To fill this gap, we
propose the first adversarial attack tailored for video-based LLMs by crafting
flow-based multi-modal adversarial perturbations on a small fraction of frames
within a video, dubbed FMM-Attack. Extensive experiments show that our attack
can effectively induce video-based LLMs to generate incorrect answers when
videos are added with imperceptible adversarial perturbations. Intriguingly,
our FMM-Attack can also induce garbling in the model output, prompting
video-based LLMs to hallucinate. Overall, our observations inspire a further
understanding of multi-modal robustness and safety-related feature alignment
across different modalities, which is of great importance for various large
multi-modal models. Our code is available at
https://github.com/THU-Kingmin/FMM-Attack.


http://arxiv.org/abs/2403.13778
Certified Human Trajectory Prediction. (97%)
Mohammadhossein Bahari; Saeed Saadatnejad; Amirhossein Askari Farsangi; Seyed-Mohsen Moosavi-Dezfooli; Alexandre Alahi
  Predicting human trajectories is essential for the safe operation of
autonomous vehicles, yet current data-driven models often lack robustness in
case of noisy inputs such as adversarial examples or imperfect observations.
Although some trajectory prediction methods have been developed to provide
empirical robustness, these methods are heuristic and do not offer guaranteed
robustness. In this work, we propose a certification approach tailored for
trajectory prediction that provides guaranteed robustness. To this end, we
address the unique challenges associated with trajectory prediction, such as
unbounded outputs and multi-modality. To mitigate the inherent performance drop
through certification, we propose a diffusion-based trajectory denoiser and
integrate it into our method. Moreover, we introduce new certified performance
metrics to reliably measure the trajectory prediction performance. Through
comprehensive experiments, we demonstrate the accuracy and robustness of the
certified predictors and highlight their advantages over the non-certified
ones. The code is available online: https://s-attack.github.io/.


http://arxiv.org/abs/2403.13322
DD-RobustBench: An Adversarial Robustness Benchmark for Dataset Distillation. (96%)
Yifan Wu; Jiawei Du; Ping Liu; Yuewei Lin; Wenqing Cheng; Wei Xu
  Dataset distillation is an advanced technique aimed at compressing datasets
into significantly smaller counterparts, while preserving formidable training
performance. Significant efforts have been devoted to promote evaluation
accuracy under limited compression ratio while overlooked the robustness of
distilled dataset. In this work, we introduce a comprehensive benchmark that,
to the best of our knowledge, is the most extensive to date for evaluating the
adversarial robustness of distilled datasets in a unified way. Our benchmark
significantly expands upon prior efforts by incorporating a wider range of
dataset distillation methods, including the latest advancements such as TESLA
and SRe2L, a diverse array of adversarial attack methods, and evaluations
across a broader and more extensive collection of datasets such as ImageNet-1K.
Moreover, we assessed the robustness of these distilled datasets against
representative adversarial attack algorithms like PGD and AutoAttack, while
exploring their resilience from a frequency perspective. We also discovered
that incorporating distilled data into the training batches of the original
dataset can yield to improvement of robustness.


http://arxiv.org/abs/2403.13867
Capsule Neural Networks as Noise Stabilizer for Time Series Data. (93%)
Soyeon Kim; Jihyeon Seong; Hyunkyung Han; Jaesik Choi
  Capsule Neural Networks utilize capsules, which bind neurons into a single
vector and learn position equivariant features, which makes them more robust
than original Convolutional Neural Networks. CapsNets employ an affine
transformation matrix and dynamic routing with coupling coefficients to learn
robustly. In this paper, we investigate the effectiveness of CapsNets in
analyzing highly sensitive and noisy time series sensor data. To demonstrate
CapsNets robustness, we compare their performance with original CNNs on
electrocardiogram data, a medical time series sensor data with complex patterns
and noise. Our study provides empirical evidence that CapsNets function as
noise stabilizers, as investigated by manual and adversarial attack experiments
using the fast gradient sign method and three manual attacks, including offset
shifting, gradual drift, and temporal lagging. In summary, CapsNets outperform
CNNs in both manual and adversarial attacked data. Our findings suggest that
CapsNets can be effectively applied to various sensor systems to improve their
resilience to noise attacks. These results have significant implications for
designing and implementing robust machine learning models in real world
applications. Additionally, this study contributes to the effectiveness of
CapsNet models in handling noisy data and highlights their potential for
addressing the challenges of noise data in time series analysis.


http://arxiv.org/abs/2403.13502
Adversarial Attacks and Defenses in Automated Control Systems: A Comprehensive Benchmark. (70%)
Vitaliy Pozdnyakov; Aleksandr Kovalenko; Ilya Makarov; Mikhail Drobyshevskiy; Kirill Lukyanov
  Integrating machine learning into Automated Control Systems (ACS) enhances
decision-making in industrial process management. One of the limitations to the
widespread adoption of these technologies in industry is the vulnerability of
neural networks to adversarial attacks. This study explores the threats in
deploying deep learning models for fault diagnosis in ACS using the Tennessee
Eastman Process dataset. By evaluating three neural networks with different
architectures, we subject them to six types of adversarial attacks and explore
five different defense methods. Our results highlight the strong vulnerability
of models to adversarial samples and the varying effectiveness of defense
strategies. We also propose a novel protection approach by combining multiple
defense methods and demonstrate it's efficacy. This research contributes
several insights into securing machine learning within ACS, ensuring robust
fault diagnosis in industrial processes.


http://arxiv.org/abs/2403.13523
Have You Poisoned My Data? Defending Neural Networks against Data Poisoning. (54%)
Gaspari Fabio De; Dorjan Hitaj; Luigi V. Mancini
  The unprecedented availability of training data fueled the rapid development
of powerful neural networks in recent years. However, the need for such large
amounts of data leads to potential threats such as poisoning attacks:
adversarial manipulations of the training data aimed at compromising the
learned model to achieve a given adversarial goal.
  This paper investigates defenses against clean-label poisoning attacks and
proposes a novel approach to detect and filter poisoned datapoints in the
transfer learning setting. We define a new characteristic vector representation
of datapoints and show that it effectively captures the intrinsic properties of
the data distribution. Through experimental analysis, we demonstrate that
effective poisons can be successfully differentiated from clean points in the
characteristic vector space. We thoroughly evaluate our proposed approach and
compare it to existing state-of-the-art defenses using multiple architectures,
datasets, and poison budgets. Our evaluation shows that our proposal
outperforms existing approaches in defense rate and final trained model
performance across all experimental settings.


http://arxiv.org/abs/2405.09550
Mask-based Invisible Backdoor Attacks on Object Detection. (50%)
Shin Jeong Jin
  Deep learning models have achieved unprecedented performance in the domain of
object detection, resulting in breakthroughs in areas such as autonomous
driving and security. However, deep learning models are vulnerable to backdoor
attacks. These attacks prompt models to behave similarly to standard models
without a trigger; however, they act maliciously upon detecting a predefined
trigger. Despite extensive research on backdoor attacks in image
classification, their application to object detection remains relatively
underexplored. Given the widespread application of object detection in critical
real-world scenarios, the sensitivity and potential impact of these
vulnerabilities cannot be overstated. In this study, we propose an effective
invisible backdoor attack on object detection utilizing a mask-based approach.
Three distinct attack scenarios were explored for object detection: object
disappearance, object misclassification, and object generation attack. Through
extensive experiments, we comprehensively examined the effectiveness of these
attacks and tested certain defense methods to determine effective
countermeasures.


http://arxiv.org/abs/2403.14720
Defending Against Indirect Prompt Injection Attacks With Spotlighting. (31%)
Keegan Hines; Gary Lopez; Matthew Hall; Federico Zarfati; Yonatan Zunger; Emre Kiciman
  Large Language Models (LLMs), while powerful, are built and trained to
process a single text input. In common applications, multiple inputs can be
processed by concatenating them together into a single stream of text. However,
the LLM is unable to distinguish which sections of prompt belong to various
input sources. Indirect prompt injection attacks take advantage of this
vulnerability by embedding adversarial instructions into untrusted data being
processed alongside user commands. Often, the LLM will mistake the adversarial
instructions as user commands to be followed, creating a security vulnerability
in the larger system. We introduce spotlighting, a family of prompt engineering
techniques that can be used to improve LLMs' ability to distinguish among
multiple sources of input. The key insight is to utilize transformations of an
input to provide a reliable and continuous signal of its provenance. We
evaluate spotlighting as a defense against indirect prompt injection attacks,
and find that it is a robust defense that has minimal detrimental impact to
underlying NLP tasks. Using GPT-family models, we find that spotlighting
reduces the attack success rate from greater than {50}\% to below {2}\% in our
experiments with minimal impact on task efficacy.


http://arxiv.org/abs/2403.15467
Don't be a Fool: Pooling Strategies in Offensive Language Detection from User-Intended Adversarial Attacks. (11%)
Seunguk Yu; Juhwan Choi; Youngbin Kim
  Offensive language detection is an important task for filtering out abusive
expressions and improving online user experiences. However, malicious users
often attempt to avoid filtering systems through the involvement of textual
noises. In this paper, we propose these evasions as user-intended adversarial
attacks that insert special symbols or leverage the distinctive features of the
Korean language. Furthermore, we introduce simple yet effective pooling
strategies in a layer-wise manner to defend against the proposed attacks,
focusing on the preceding layers not just the last layer to capture both
offensiveness and token embeddings. We demonstrate that these pooling
strategies are more robust to performance degradation even when the attack rate
is increased, without directly training of such patterns. Notably, we found
that models pre-trained on clean texts could achieve a comparable performance
in detecting attacked offensive language, to models pre-trained on noisy texts
by employing these pooling strategies.


http://arxiv.org/abs/2403.13355
BadEdit: Backdooring large language models by model editing. (1%)
Yanzhou Li; Tianlin Li; Kangjie Chen; Jian Zhang; Shangqing Liu; Wenhan Wang; Tianwei Zhang; Yang Liu
  Mainstream backdoor attack methods typically demand substantial tuning data
for poisoning, limiting their practicality and potentially degrading the
overall performance when applied to Large Language Models (LLMs). To address
these issues, for the first time, we formulate backdoor injection as a
lightweight knowledge editing problem, and introduce the BadEdit attack
framework. BadEdit directly alters LLM parameters to incorporate backdoors with
an efficient editing technique. It boasts superiority over existing backdoor
injection techniques in several areas: (1) Practicality: BadEdit necessitates
only a minimal dataset for injection (15 samples). (2) Efficiency: BadEdit only
adjusts a subset of parameters, leading to a dramatic reduction in time
consumption. (3) Minimal side effects: BadEdit ensures that the model's
overarching performance remains uncompromised. (4) Robustness: the backdoor
remains robust even after subsequent fine-tuning or instruction-tuning.
Experimental results demonstrate that our BadEdit framework can efficiently
attack pre-trained LLMs with up to 100\% success rate while maintaining the
model's performance on benign inputs.


http://arxiv.org/abs/2403.13590
Teacher-Student Training for Debiasing: General Permutation Debiasing for Large Language Models. (1%)
Adian Liusie; Yassir Fathullah; Mark J. F. Gales
  Large Language Models (LLMs) have demonstrated impressive zero-shot
capabilities and versatility in NLP tasks, however they sometimes fail to
maintain crucial invariances for specific tasks. One example is permutation
sensitivity, where LLMs' outputs may significantly vary depending on the order
of the input options. While debiasing techniques can mitigate these issues, and
yield better performance and reliability, they often come with a high
computational cost at inference. This paper addresses this inefficiency at
inference time. The aim is to distill the capabilities of a computationally
intensive, debiased, teacher model into a more compact student model. We
explore two variants of student models: one based on pure distillation, and the
other on an error-correction approach for more complex tasks, where the student
corrects a single biased decision from the teacher to achieve a debiased
output. Our approach is general and can be applied to both black-box and
white-box LLMs. Furthermore, we demonstrate that our compact, encoder-only
student models can outperform their larger, biased teacher counterparts,
achieving better results with significantly fewer parameters.


http://arxiv.org/abs/2403.13682
Threats, Attacks, and Defenses in Machine Unlearning: A Survey. (1%)
Ziyao Liu; Huanyi Ye; Chen Chen; Kwok-Yan Lam
  Machine Unlearning (MU) has gained considerable attention recently for its
potential to achieve Safe AI by removing the influence of specific data from
trained machine learning models. This process, known as knowledge removal,
addresses AI governance concerns of training data such as quality, sensitivity,
copyright restrictions, and obsolescence. This capability is also crucial for
ensuring compliance with privacy regulations such as the Right To Be Forgotten.
Furthermore, effective knowledge removal mitigates the risk of harmful
outcomes, safeguarding against biases, misinformation, and unauthorized data
exploitation, thereby enhancing the safe and responsible use of AI systems.
Efforts have been made to design efficient unlearning approaches, with MU
services being examined for integration with existing machine learning as a
service, allowing users to submit requests to remove specific data from the
training corpus. However, recent research highlights vulnerabilities in machine
unlearning systems, such as information leakage and malicious unlearning
requests, that can lead to significant security and privacy concerns. Moreover,
extensive research indicates that unlearning methods and prevalent attacks
fulfill diverse roles within MU systems. For instance, unlearning can act as a
mechanism to recover models from backdoor attacks, while backdoor attacks
themselves can serve as an evaluation metric for unlearning effectiveness. This
underscores the intricate relationship and complex interplay among these
mechanisms in maintaining system functionality and safety. This survey aims to
fill the gap between the extensive number of studies on threats, attacks, and
defenses in machine unlearning and the absence of a comprehensive review that
categorizes their taxonomy, methods, and solutions, thus offering valuable
insights for future research directions and practical implementations.


http://arxiv.org/abs/2403.12693
As Firm As Their Foundations: Can open-sourced foundation models be used to create adversarial examples for downstream tasks? (99%)
Anjun Hu; Jindong Gu; Francesco Pinto; Konstantinos Kamnitsas; Philip Torr
  Foundation models pre-trained on web-scale vision-language data, such as
CLIP, are widely used as cornerstones of powerful machine learning systems.
While pre-training offers clear advantages for downstream learning, it also
endows downstream models with shared adversarial vulnerabilities that can be
easily identified through the open-sourced foundation model. In this work, we
expose such vulnerabilities in CLIP's downstream models and show that
foundation models can serve as a basis for attacking their downstream systems.
In particular, we propose a simple yet effective adversarial attack strategy
termed Patch Representation Misalignment (PRM). Solely based on open-sourced
CLIP vision encoders, this method produces adversaries that simultaneously fool
more than 20 downstream models spanning 4 common vision-language tasks
(semantic segmentation, object detection, image captioning and visual
question-answering). Our findings highlight the concerning safety risks
introduced by the extensive usage of public foundational models in the
development of downstream systems, calling for extra caution in these
scenarios.


http://arxiv.org/abs/2403.12445
Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory. (99%)
Sensen Gao; Xiaojun Jia; Xuhong Ren; Ivor Tsang; Qing Guo
  Vision-language pre-training (VLP) models exhibit remarkable capabilities in
comprehending both images and text, yet they remain susceptible to multimodal
adversarial examples (AEs). Strengthening attacks and uncovering
vulnerabilities, especially common issues in VLP models (e.g., high
transferable AEs), can advance reliable and practical VLP models. A recent work
(i.e., Set-level guidance attack) indicates that augmenting image-text pairs to
increase AE diversity along the optimization path enhances the transferability
of adversarial examples significantly. However, this approach predominantly
emphasizes diversity around the online adversarial examples (i.e., AEs in the
optimization period), leading to the risk of overfitting the victim model and
affecting the transferability. In this study, we posit that the diversity of
adversarial examples towards the clean input and online AEs are both pivotal
for enhancing transferability across VLP models. Consequently, we propose using
diversification along the intersection region of adversarial trajectory to
expand the diversity of AEs. To fully leverage the interaction between
modalities, we introduce text-guided adversarial example selection during
optimization. Furthermore, to further mitigate the potential overfitting, we
direct the adversarial text deviating from the last intersection region along
the optimization path, rather than adversarial images as in existing methods.
Extensive experiments affirm the effectiveness of our method in improving
transferability across various VLP models and downstream vision-and-language
tasks.


http://arxiv.org/abs/2403.13196
ADAPT to Robustify Prompt Tuning Vision Transformers. (98%)
Masih Eskandar; Tooba Imtiaz; Zifeng Wang; Jennifer Dy
  The performance of deep models, including Vision Transformers, is known to be
vulnerable to adversarial attacks. Many existing defenses against these
attacks, such as adversarial training, rely on full-model fine-tuning to induce
robustness in the models. These defenses require storing a copy of the entire
model, that can have billions of parameters, for each task. At the same time,
parameter-efficient prompt tuning is used to adapt large transformer-based
models to downstream tasks without the need to save large copies. In this
paper, we examine parameter-efficient prompt tuning of Vision Transformers for
downstream tasks under the lens of robustness. We show that previous
adversarial defense methods, when applied to the prompt tuning paradigm, suffer
from gradient obfuscation and are vulnerable to adaptive attacks. We introduce
ADAPT, a novel framework for performing adaptive adversarial training in the
prompt tuning paradigm. Our method achieves competitive robust accuracy of ~40%
w.r.t. SOTA robustness methods using full-model fine-tuning, by tuning only ~1%
of the number of parameters.


http://arxiv.org/abs/2403.12541
Marlin: Knowledge-Driven Analysis of Provenance Graphs for Efficient and Robust Detection of Cyber Attacks. (75%)
Zhenyuan Li; Yangyang Wei; Xiangmin Shen; Lingzhi Wang; Yan Chen; Haitao Xu; Shouling Ji; Fan Zhang; Liang Hou; Wenmao Liu; Xuhong Zhang; Jianwei Ying
  Recent research in both academia and industry has validated the effectiveness
of provenance graph-based detection for advanced cyber attack detection and
investigation. However, analyzing large-scale provenance graphs often results
in substantial overhead. To improve performance, existing detection systems
implement various optimization strategies. Yet, as several recent studies
suggest, these strategies could lose necessary context information and be
vulnerable to evasions. Designing a detection system that is efficient and
robust against adversarial attacks is an open problem. We introduce Marlin,
which approaches cyber attack detection through real-time provenance graph
alignment.By leveraging query graphs embedded with attack knowledge, Marlin can
efficiently identify entities and events within provenance graphs, embedding
targeted analysis and significantly narrowing the search space. Moreover, we
incorporate our graph alignment algorithm into a tag propagation-based schema
to eliminate the need for storing and reprocessing raw logs. This design
significantly reduces in-memory storage requirements and minimizes data
processing overhead. As a result, it enables real-time graph alignment while
preserving essential context information, thereby enhancing the robustness of
cyber attack detection. Moreover, Marlin allows analysts to customize attack
query graphs flexibly to detect extended attacks and provide interpretable
detection results. We conduct experimental evaluations on two large-scale
public datasets containing 257.42 GB of logs and 12 query graphs of varying
sizes, covering multiple attack techniques and scenarios. The results show that
Marlin can process 137K events per second while accurately identifying 120
subgraphs with 31 confirmed attacks, along with only 1 false positive,
demonstrating its efficiency and accuracy in handling massive data.


http://arxiv.org/abs/2403.13108
Resilience in Online Federated Learning: Mitigating Model-Poisoning Attacks via Partial Sharing. (9%)
Ehsan Lari; Reza Arablouei; Vinay Chakravarthi Gogineni; Stefan Werner
  Federated learning (FL) allows training machine learning models on
distributed data without compromising privacy. However, FL is vulnerable to
model-poisoning attacks where malicious clients tamper with their local models
to manipulate the global model. In this work, we investigate the resilience of
the partial-sharing online FL (PSO-Fed) algorithm against such attacks. PSO-Fed
reduces communication overhead by allowing clients to share only a fraction of
their model updates with the server. We demonstrate that this partial sharing
mechanism has the added advantage of enhancing PSO-Fed's robustness to
model-poisoning attacks. Through theoretical analysis, we show that PSO-Fed
maintains convergence even under Byzantine attacks, where malicious clients
inject noise into their updates. Furthermore, we derive a formula for PSO-Fed's
mean square error, considering factors like stepsize, attack probability, and
the number of malicious clients. Interestingly, we find a non-trivial optimal
stepsize that maximizes PSO-Fed's resistance to these attacks. Extensive
numerical experiments confirm our theoretical findings and showcase PSO-Fed's
superior performance against model-poisoning attacks compared to other leading
FL algorithms.


http://arxiv.org/abs/2403.13031
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content. (8%)
Zhuowen Yuan; Zidi Xiong; Yi Zeng; Ning Yu; Ruoxi Jia; Dawn Song; Bo Li
  Recent advancements in Large Language Models (LLMs) have showcased remarkable
capabilities across various tasks in different domains. However, the emergence
of biases and the potential for generating harmful content in LLMs,
particularly under malicious inputs, pose significant challenges. Current
mitigation strategies, while effective, are not resilient under adversarial
attacks. This paper introduces Resilient Guardrails for Large Language Models
(RigorLLM), a novel framework designed to efficiently and effectively moderate
harmful and unsafe inputs and outputs for LLMs. By employing a multi-faceted
approach that includes energy-based training data augmentation through Langevin
dynamics, optimizing a safe suffix for inputs via minimax optimization, and
integrating a fusion-based model combining robust KNN with LLMs based on our
data augmentation, RigorLLM offers a robust solution to harmful content
moderation. Our experimental evaluations demonstrate that RigorLLM not only
outperforms existing baselines like OpenAI API and Perspective API in detecting
harmful content but also exhibits unparalleled resilience to jailbreaking
attacks. The innovative use of constrained optimization and a fusion-based
guardrail approach represents a significant step forward in developing more
secure and reliable LLMs, setting a new standard for content moderation
frameworks in the face of evolving digital threats.


http://arxiv.org/abs/2403.13134
Robust NAS under adversarial training: benchmark, theory, and beyond. (2%)
Yongtao Wu; Fanghui Liu; Carl-Johann Simon-Gabriel; Grigorios G Chrysos; Volkan Cevher
  Recent developments in neural architecture search (NAS) emphasize the
significance of considering robust architectures against malicious data.
However, there is a notable absence of benchmark evaluations and theoretical
guarantees for searching these robust architectures, especially when
adversarial training is considered. In this work, we aim to address these two
challenges, making twofold contributions. First, we release a comprehensive
data set that encompasses both clean accuracy and robust accuracy for a vast
array of adversarially trained networks from the NAS-Bench-201 search space on
image datasets. Then, leveraging the neural tangent kernel (NTK) tool from deep
learning theory, we establish a generalization theory for searching
architecture in terms of clean accuracy and robust accuracy under
multi-objective adversarial training. We firmly believe that our benchmark and
theoretical insights will significantly benefit the NAS community through
reliable reproducibility, efficient assessment, and theoretical foundation,
particularly in the pursuit of robust architectures.


http://arxiv.org/abs/2403.12777
Discover and Mitigate Multiple Biased Subgroups in Image Classifiers. (1%)
Zeliang Zhang; Mingqian Feng; Zhiheng Li; Chenliang Xu
  Machine learning models can perform well on in-distribution data but often
fail on biased subgroups that are underrepresented in the training data,
hindering the robustness of models for reliable applications. Such subgroups
are typically unknown due to the absence of subgroup labels. Discovering biased
subgroups is the key to understanding models' failure modes and further
improving models' robustness. Most previous works of subgroup discovery make an
implicit assumption that models only underperform on a single biased subgroup,
which does not hold on in-the-wild data where multiple biased subgroups exist.
  In this work, we propose Decomposition, Interpretation, and Mitigation (DIM),
a novel method to address a more challenging but also more practical problem of
discovering multiple biased subgroups in image classifiers. Our approach
decomposes the image features into multiple components that represent multiple
subgroups. This decomposition is achieved via a bilinear dimension reduction
method, Partial Least Square (PLS), guided by useful supervision from the image
classifier. We further interpret the semantic meaning of each subgroup
component by generating natural language descriptions using vision-language
foundation models. Finally, DIM mitigates multiple biased subgroups
simultaneously via two strategies, including the data- and model-centric
strategies. Extensive experiments on CIFAR-100 and Breeds datasets demonstrate
the effectiveness of DIM in discovering and mitigating multiple biased
subgroups. Furthermore, DIM uncovers the failure modes of the classifier on
Hard ImageNet, showcasing its broader applicability to understanding model bias
in image classifiers. The code is available at
https://github.com/ZhangAIPI/DIM.


http://arxiv.org/abs/2403.11833
SSCAE -- Semantic, Syntactic, and Context-aware natural language Adversarial Examples generator. (99%)
Javad Rafiei Asl; Mohammad H. Rafiei; Manar Alohaly; Daniel Takabi
  Machine learning models are vulnerable to maliciously crafted Adversarial
Examples (AEs). Training a machine learning model with AEs improves its
robustness and stability against adversarial attacks. It is essential to
develop models that produce high-quality AEs. Developing such models has been
much slower in natural language processing (NLP) than in areas such as computer
vision. This paper introduces a practical and efficient adversarial attack
model called SSCAE for \textbf{S}emantic, \textbf{S}yntactic, and
\textbf{C}ontext-aware natural language \textbf{AE}s generator. SSCAE
identifies important words and uses a masked language model to generate an
early set of substitutions. Next, two well-known language models are employed
to evaluate the initial set in terms of semantic and syntactic characteristics.
We introduce (1) a dynamic threshold to capture more efficient perturbations
and (2) a local greedy search to generate high-quality AEs. As a black-box
method, SSCAE generates humanly imperceptible and context-aware AEs that
preserve semantic consistency and the source language's syntactical and
grammatical requirements. The effectiveness and superiority of the proposed
SSCAE model are illustrated with fifteen comparative experiments and extensive
sensitivity analysis for parameter optimization. SSCAE outperforms the existing
models in all experiments while maintaining a higher semantic consistency with
a lower query number and a comparable perturbation rate.


http://arxiv.org/abs/2403.11656
LocalStyleFool: Regional Video Style Transfer Attack Using Segment Anything Model. (99%)
Yuxin Cao; Jinghao Li; Xi Xiao; Derui Wang; Minhui Xue; Hao Ge; Wei Liu; Guangwu Hu
  Previous work has shown that well-crafted adversarial perturbations can
threaten the security of video recognition systems. Attackers can invade such
models with a low query budget when the perturbations are semantic-invariant,
such as StyleFool. Despite the query efficiency, the naturalness of the minutia
areas still requires amelioration, since StyleFool leverages style transfer to
all pixels in each frame. To close the gap, we propose LocalStyleFool, an
improved black-box video adversarial attack that superimposes regional
style-transfer-based perturbations on videos. Benefiting from the popularity
and scalably usability of Segment Anything Model (SAM), we first extract
different regions according to semantic information and then track them through
the video stream to maintain the temporal consistency. Then, we add
style-transfer-based perturbations to several regions selected based on the
associative criterion of transfer-based gradient information and regional area.
Perturbation fine adjustment is followed to make stylized videos adversarial.
We demonstrate that LocalStyleFool can improve both intra-frame and inter-frame
naturalness through a human-assessed survey, while maintaining competitive
fooling rate and query efficiency. Successful experiments on the
high-resolution dataset also showcase that scrupulous segmentation of SAM helps
to improve the scalability of adversarial attacks under high-resolution data.


http://arxiv.org/abs/2403.11981
Certified Robustness to Clean-Label Poisoning Using Diffusion Denoising. (99%)
Sanghyun Hong; Nicholas Carlini; Alexey Kurakin
  We present a certified defense to clean-label poisoning attacks under
$\ell_2$-norm. These attacks work by injecting a small number of poisoning
samples (e.g., 1%) that contain bounded adversarial perturbations into the
training data to induce a targeted misclassification of a test-time input.
Inspired by the adversarial robustness achieved by $randomized$ $smoothing$, we
show how an off-the-shelf diffusion denoising model can sanitize the tampered
training data. We extensively test our defense against seven clean-label
poisoning attacks in both $\ell_2$ and $\ell_{\infty}$-norms and reduce their
attack success to 0-16% with only a negligible drop in the test accuracy. We
compare our defense with existing countermeasures against clean-label
poisoning, showing that the defense reduces the attack success the most and
offers the best model utility. Our results highlight the need for future work
on developing stronger clean-label attacks and using our certified yet
practical defense as a strong baseline to evaluate these attacks.


http://arxiv.org/abs/2403.13018
Invisible Backdoor Attack Through Singular Value Decomposition. (96%)
Wenmin Chen; Xiaowei Xu
  With the widespread application of deep learning across various domains,
concerns about its security have grown significantly. Among these, backdoor
attacks pose a serious security threat to deep neural networks (DNNs). In
recent years, backdoor attacks on neural networks have become increasingly
sophisticated, aiming to compromise the security and trustworthiness of models
by implanting hidden, unauthorized functionalities or triggers, leading to
misleading predictions or behaviors. To make triggers less perceptible and
imperceptible, various invisible backdoor attacks have been proposed. However,
most of them only consider invisibility in the spatial domain, making it easy
for recent defense methods to detect the generated toxic images.To address
these challenges, this paper proposes an invisible backdoor attack called DEBA.
DEBA leverages the mathematical properties of Singular Value Decomposition
(SVD) to embed imperceptible backdoors into models during the training phase,
thereby causing them to exhibit predefined malicious behavior under specific
trigger conditions. Specifically, we first perform SVD on images, and then
replace the minor features of trigger images with those of clean images, using
them as triggers to ensure the effectiveness of the attack. As minor features
are scattered throughout the entire image, the major features of clean images
are preserved, making poisoned images visually indistinguishable from clean
ones. Extensive experimental evaluations demonstrate that DEBA is highly
effective, maintaining high perceptual quality and a high attack success rate
for poisoned images. Furthermore, we assess the performance of DEBA under
existing defense measures, showing that it is robust and capable of
significantly evading and resisting the effects of these defense measures.


http://arxiv.org/abs/2403.11830
Problem space structural adversarial attacks for Network Intrusion Detection Systems based on Graph Neural Networks. (88%)
Andrea Venturi; Dario Stabili; Mirco Marchetti
  Machine Learning (ML) algorithms have become increasingly popular for
supporting Network Intrusion Detection Systems (NIDS). Nevertheless, extensive
research has shown their vulnerability to adversarial attacks, which involve
subtle perturbations to the inputs of the models aimed at compromising their
performance. Recent proposals have effectively leveraged Graph Neural Networks
(GNN) to produce predictions based also on the structural patterns exhibited by
intrusions to enhance the detection robustness. However, the adoption of
GNN-based NIDS introduces new types of risks. In this paper, we propose the
first formalization of adversarial attacks specifically tailored for GNN in
network intrusion detection. Moreover, we outline and model the problem space
constraints that attackers need to consider to carry out feasible structural
attacks in real-world scenarios. As a final contribution, we conduct an
extensive experimental campaign in which we launch the proposed attacks against
state-of-the-art GNN-based NIDS. Our findings demonstrate the increased
robustness of the models against classical feature-based adversarial attacks,
while highlighting their susceptibility to structure-based attacks.


http://arxiv.org/abs/2403.13017
Impart: An Imperceptible and Effective Label-Specific Backdoor Attack. (83%)
Jingke Zhao; Zan Wang; Yongwei Wang; Lanjun Wang
  Backdoor attacks have been shown to impose severe threats to real
security-critical scenarios. Although previous works can achieve high attack
success rates, they either require access to victim models which may
significantly reduce their threats in practice, or perform visually noticeable
in stealthiness. Besides, there is still room to improve the attack success
rates in the scenario that different poisoned samples may have different target
labels (a.k.a., the all-to-all setting). In this study, we propose a novel
imperceptible backdoor attack framework, named Impart, in the scenario where
the attacker has no access to the victim model. Specifically, in order to
enhance the attack capability of the all-to-all setting, we first propose a
label-specific attack. Different from previous works which try to find an
imperceptible pattern and add it to the source image as the poisoned image, we
then propose to generate perturbations that align with the target label in the
image feature by a surrogate model. In this way, the generated poisoned images
are attached with knowledge about the target class, which significantly
enhances the attack capability.


http://arxiv.org/abs/2403.11515
SSAP: A Shape-Sensitive Adversarial Patch for Comprehensive Disruption of Monocular Depth Estimation in Autonomous Navigation Applications. (78%)
Amira Guesmi; Muhammad Abdullah Hanif; Ihsen Alouani; Bassem Ouni; Muhammad Shafique
  Monocular depth estimation (MDE) has advanced significantly, primarily
through the integration of convolutional neural networks (CNNs) and more
recently, Transformers. However, concerns about their susceptibility to
adversarial attacks have emerged, especially in safety-critical domains like
autonomous driving and robotic navigation. Existing approaches for assessing
CNN-based depth prediction methods have fallen short in inducing comprehensive
disruptions to the vision system, often limited to specific local areas. In
this paper, we introduce SSAP (Shape-Sensitive Adversarial Patch), a novel
approach designed to comprehensively disrupt monocular depth estimation (MDE)
in autonomous navigation applications. Our patch is crafted to selectively
undermine MDE in two distinct ways: by distorting estimated distances or by
creating the illusion of an object disappearing from the system's perspective.
Notably, our patch is shape-sensitive, meaning it considers the specific shape
and scale of the target object, thereby extending its influence beyond
immediate proximity. Furthermore, our patch is trained to effectively address
different scales and distances from the camera. Experimental results
demonstrate that our approach induces a mean depth estimation error surpassing
0.5, impacting up to 99% of the targeted region for CNN-based MDE models.
Additionally, we investigate the vulnerability of Transformer-based MDE models
to patch-based attacks, revealing that SSAP yields a significant error of 0.59
and exerts substantial influence over 99% of the target region on these models.


http://arxiv.org/abs/2403.12399
Electioneering the Network: Dynamic Multi-Step Adversarial Attacks for Community Canvassing. (61%)
Saurabh Sharma; Ambuj SIngh
  The problem of online social network manipulation for community canvassing is
of real concern in today's world. Motivated by the study of voter models,
opinion and polarization dynamics on networks, we model community canvassing as
a dynamic process over a network enabled via gradient-based attacks on GNNs.
Existing attacks on GNNs are all single-step and do not account for the dynamic
cascading nature of information diffusion in networks. We consider the
realistic scenario where an adversary uses a GNN as a proxy to predict and
manipulate voter preferences, especially uncertain voters. Gradient-based
attacks on the GNN inform the adversary of strategic manipulations that can be
made to proselytize targeted voters. In particular, we explore $\textit{minimum
budget attacks for community canvassing}$ (MBACC). We show that the MBACC
problem is NP-Hard and propose Dynamic Multi-Step Adversarial Community
Canvassing (MAC) to address it. MAC makes dynamic local decisions based on the
heuristic of low budget and high second-order influence to convert and perturb
target voters. MAC is a dynamic multi-step attack that discovers low-budget and
high-influence targets from which efficient cascading attacks can happen. We
evaluate MAC against single-step baselines on the MBACC problem with multiple
underlying networks and GNN models. Our experiments show the superiority of MAC
which is able to discover efficient multi-hop attacks for adversarial community
canvassing. Our code implementation and data is available at
https://github.com/saurabhsharma1993/mac.


http://arxiv.org/abs/2403.12371
Advancing Time Series Classification with Multimodal Language Modeling. (1%)
Mingyue Cheng; Yiheng Chen; Qi Liu; Zhiding Liu; Yucong Luo
  For the advancements of time series classification, scrutinizing previous
studies, most existing methods adopt a common learning-to-classify paradigm - a
time series classifier model tries to learn the relation between sequence
inputs and target label encoded by one-hot distribution. Although effective,
this paradigm conceals two inherent limitations: (1) encoding target categories
with one-hot distribution fails to reflect the comparability and similarity
between labels, and (2) it is very difficult to learn transferable model across
domains, which greatly hinder the development of universal serving paradigm. In
this work, we propose InstructTime, a novel attempt to reshape time series
classification as a learning-to-generate paradigm. Relying on the powerful
generative capacity of the pre-trained language model, the core idea is to
formulate the classification of time series as a multimodal understanding task,
in which both task-specific instructions and raw time series are treated as
multimodal inputs while the label information is represented by texts. To
accomplish this goal, three distinct designs are developed in the InstructTime.
Firstly, a time series discretization module is designed to convert continuous
time series into a sequence of hard tokens to solve the inconsistency issue
across modal inputs. To solve the modality representation gap issue, for one
thing, we introduce an alignment projected layer before feeding the transformed
token of time series into language models. For another, we highlight the
necessity of auto-regressive pre-training across domains, which can facilitate
the transferability of the language model and boost the generalization
performance. Extensive experiments are conducted over benchmark datasets, whose
results uncover the superior performance of InstructTime and the potential for
a universal foundation model in time series classification.


http://arxiv.org/abs/2403.11795
Low-Cost Privacy-Preserving Decentralized Learning. (1%)
Sayan Biswas; Davide Frey; Romaric Gaudel; Anne-Marie Kermarrec; Dimitri Lerévérend; Rafael Pires; Rishi Sharma; François Taïani
  Decentralized learning (DL) is an emerging paradigm of collaborative machine
learning that enables nodes in a network to train models collectively without
sharing their raw data or relying on a central server. This paper introduces
Zip-DL, a privacy-aware DL algorithm that leverages correlated noise to achieve
robust privacy against local adversaries while ensuring efficient convergence
at low communication costs. By progressively neutralizing the noise added
during distributed averaging, Zip-DL combines strong privacy guarantees with
high model accuracy. Its design requires only one communication round per
gradient descent iteration, significantly reducing communication overhead
compared to competitors. We establish theoretical bounds on both convergence
speed and privacy guarantees. Moreover, extensive experiments demonstrating
Zip-DL's practical applicability make it outperform state-of-the-art methods in
the accuracy vs. vulnerability trade-off. Specifically, Zip-DL (i) reduces
membership-inference attack success rates by up to 35% compared to baseline DL,
(ii) decreases attack efficacy by up to 13% compared to competitors offering
similar utility, and (iii) achieves up to 59% higher accuracy to completely
nullify a basic attack scenario, compared to a state-of-the-art
privacy-preserving approach under the same threat model. These results position
Zip-DL as a practical and efficient solution for privacy-preserving
decentralized learning in real-world applications.


http://arxiv.org/abs/2403.11397
Defense Against Adversarial Attacks on No-Reference Image Quality Models with Gradient Norm Regularization. (99%)
Yujia Liu; Chenxi Yang; Dingquan Li; Jianhao Ding; Tingting Jiang
  The task of No-Reference Image Quality Assessment (NR-IQA) is to estimate the
quality score of an input image without additional information. NR-IQA models
play a crucial role in the media industry, aiding in performance evaluation and
optimization guidance. However, these models are found to be vulnerable to
adversarial attacks, which introduce imperceptible perturbations to input
images, resulting in significant changes in predicted scores. In this paper, we
propose a defense method to improve the stability in predicted scores when
attacked by small perturbations, thus enhancing the adversarial robustness of
NR-IQA models. To be specific, we present theoretical evidence showing that the
magnitude of score changes is related to the $\ell_1$ norm of the model's
gradient with respect to the input image. Building upon this theoretical
foundation, we propose a norm regularization training strategy aimed at
reducing the $\ell_1$ norm of the gradient, thereby boosting the robustness of
NR-IQA models. Experiments conducted on four NR-IQA baseline models demonstrate
the effectiveness of our strategy in reducing score changes in the presence of
adversarial attacks. To the best of our knowledge, this work marks the first
attempt to defend against adversarial attacks on NR-IQA models. Our study
offers valuable insights into the adversarial robustness of NR-IQA models and
provides a foundation for future research in this area.


http://arxiv.org/abs/2403.11297
A Modified Word Saliency-Based Adversarial Attack on Text Classification Models. (99%)
Hetvi Waghela; Sneha Rakshit; Jaydip Sen
  This paper introduces a novel adversarial attack method targeting text
classification models, termed the Modified Word Saliency-based Adversarial
At-tack (MWSAA). The technique builds upon the concept of word saliency to
strategically perturb input texts, aiming to mislead classification models
while preserving semantic coherence. By refining the traditional adversarial
attack approach, MWSAA significantly enhances its efficacy in evading detection
by classification systems. The methodology involves first identifying salient
words in the input text through a saliency estimation process, which
prioritizes words most influential to the model's decision-making process.
Subsequently, these salient words are subjected to carefully crafted
modifications, guided by semantic similarity metrics to ensure that the altered
text remains coherent and retains its original meaning. Empirical evaluations
conducted on diverse text classification datasets demonstrate the effectiveness
of the proposed method in generating adversarial examples capable of
successfully deceiving state-of-the-art classification models. Comparative
analyses with existing adversarial attack techniques further indicate the
superiority of the proposed approach in terms of both attack success rate and
preservation of text coherence.


http://arxiv.org/abs/2403.11448
Robust Overfitting Does Matter: Test-Time Adversarial Purification With FGSM. (99%)
Linyu Tang; Lei Zhang
  Numerous studies have demonstrated the susceptibility of deep neural networks
(DNNs) to subtle adversarial perturbations, prompting the development of many
advanced adversarial defense methods aimed at mitigating adversarial attacks.
Current defense strategies usually train DNNs for a specific adversarial attack
method and can achieve good robustness in defense against this type of
adversarial attack. Nevertheless, when subjected to evaluations involving
unfamiliar attack modalities, empirical evidence reveals a pronounced
deterioration in the robustness of DNNs. Meanwhile, there is a trade-off
between the classification accuracy of clean examples and adversarial examples.
Most defense methods often sacrifice the accuracy of clean examples in order to
improve the adversarial robustness of DNNs. To alleviate these problems and
enhance the overall robust generalization of DNNs, we propose the Test-Time
Pixel-Level Adversarial Purification (TPAP) method. This approach is based on
the robust overfitting characteristic of DNNs to the fast gradient sign method
(FGSM) on training and test datasets. It utilizes FGSM for adversarial
purification, to process images for purifying unknown adversarial perturbations
from pixels at testing time in a "counter changes with changelessness" manner,
thereby enhancing the defense capability of DNNs against various unknown
adversarial attacks. Extensive experimental results show that our method can
effectively improve both overall robust generalization of DNNs, notably over
previous methods.


http://arxiv.org/abs/2403.11265
Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation. (76%)
Silvia Corbara; Alejandro Moreo
  Authorship Verification (AV) is a text classification task concerned with
inferring whether a candidate text has been written by one specific author or
by someone else. It has been shown that many AV systems are vulnerable to
adversarial attacks, where a malicious author actively tries to fool the
classifier by either concealing their writing style, or by imitating the style
of another author. In this paper, we investigate the potential benefits of
augmenting the classifier training set with (negative) synthetic examples.
These synthetic examples are generated to imitate the style of the author of
interest. We analyze the improvements in classifier prediction that this
augmentation brings to bear in the task of AV in an adversarial setting. In
particular, we experiment with three different generator architectures (one
based on Recurrent Neural Networks, another based on small-scale transformers,
and another based on the popular GPT model) and with two training strategies
(one inspired by standard Language Models, and another inspired by Wasserstein
Generative Adversarial Networks). We evaluate our hypothesis on five datasets
(three of which have been specifically collected to represent an adversarial
setting) and using two learning algorithms for the AV classifier (Support
Vector Machines and Convolutional Neural Networks). This experimentation has
yielded negative results, revealing that, although our methodology proves
effective in many adversarial settings, its benefits are too sporadic for a
pragmatical application.


http://arxiv.org/abs/2403.11082
RobustSentEmbed: Robust Sentence Embeddings Using Adversarial Self-Supervised Contrastive Learning. (50%)
Javad Rafiei Asl; Prajwal Panzade; Eduardo Blanco; Daniel Takabi; Zhipeng Cai
  Pre-trained language models (PLMs) have consistently demonstrated outstanding
performance across a diverse spectrum of natural language processing tasks.
Nevertheless, despite their success with unseen data, current PLM-based
representations often exhibit poor robustness in adversarial settings. In this
paper, we introduce RobustSentEmbed, a self-supervised sentence embedding
framework designed to improve both generalization and robustness in diverse
text representation tasks and against a diverse set of adversarial attacks.
Through the generation of high-risk adversarial perturbations and their
utilization in a novel objective function, RobustSentEmbed adeptly learns
high-quality and robust sentence embeddings. Our experiments confirm the
superiority of RobustSentEmbed over state-of-the-art representations.
Specifically, Our framework achieves a significant reduction in the success
rate of various adversarial attacks, notably reducing the BERTAttack success
rate by almost half (from 75.51\% to 38.81\%). The framework also yields
improvements of 1.59\% and 0.23\% in semantic textual similarity tasks and
various transfer tasks, respectively.


http://arxiv.org/abs/2403.11348
COLEP: Certifiably Robust Learning-Reasoning Conformal Prediction via Probabilistic Circuits. (22%)
Mintong Kang; Nezihe Merve Gürel; Linyi Li; Bo Li
  Conformal prediction has shown spurring performance in constructing
statistically rigorous prediction sets for arbitrary black-box machine learning
models, assuming the data is exchangeable. However, even small adversarial
perturbations during the inference can violate the exchangeability assumption,
challenge the coverage guarantees, and result in a subsequent decline in
empirical coverage. In this work, we propose a certifiably robust
learning-reasoning conformal prediction framework (COLEP) via probabilistic
circuits, which comprise a data-driven learning component that trains
statistical models to learn different semantic concepts, and a reasoning
component that encodes knowledge and characterizes the relationships among the
trained models for logic reasoning. To achieve exact and efficient reasoning,
we employ probabilistic circuits (PCs) within the reasoning component.
Theoretically, we provide end-to-end certification of prediction coverage for
COLEP in the presence of bounded adversarial perturbations. We also provide
certified coverage considering the finite size of the calibration set.
Furthermore, we prove that COLEP achieves higher prediction coverage and
accuracy over a single model as long as the utilities of knowledge models are
non-trivial. Empirically, we show the validity and tightness of our certified
coverage, demonstrating the robust conformal prediction of COLEP on various
datasets, including GTSRB, CIFAR10, and AwA2. We show that COLEP achieves up to
12% improvement in certified coverage on GTSRB, 9% on CIFAR-10, and 14% on
AwA2.


http://arxiv.org/abs/2403.13010
A Dual-Tier Adaptive One-Class Classification IDS for Emerging Cyberthreats. (9%)
Md. Ashraf Uddin; Sunil Aryal; Mohamed Reda Bouadjenek; Muna Al-Hawawreh; Md. Alamin Talukder
  In today's digital age, our dependence on IoT (Internet of Things) and IIoT
(Industrial IoT) systems has grown immensely, which facilitates sensitive
activities such as banking transactions and personal, enterprise data, and
legal document exchanges. Cyberattackers consistently exploit weak security
measures and tools. The Network Intrusion Detection System (IDS) acts as a
primary tool against such cyber threats. However, machine learning-based IDSs,
when trained on specific attack patterns, often misclassify new emerging
cyberattacks. Further, the limited availability of attack instances for
training a supervised learner and the ever-evolving nature of cyber threats
further complicate the matter. This emphasizes the need for an adaptable IDS
framework capable of recognizing and learning from unfamiliar/unseen attacks
over time. In this research, we propose a one-class classification-driven IDS
system structured on two tiers. The first tier distinguishes between normal
activities and attacks/threats, while the second tier determines if the
detected attack is known or unknown. Within this second tier, we also embed a
multi-classification mechanism coupled with a clustering algorithm. This model
not only identifies unseen attacks but also uses them for retraining them by
clustering unseen attacks. This enables our model to be future-proofed, capable
of evolving with emerging threat patterns. Leveraging one-class classifiers
(OCC) at the first level, our approach bypasses the need for attack samples,
addressing data imbalance and zero-day attack concerns and OCC at the second
level can effectively separate unknown attacks from the known attacks. Our
methodology and evaluations indicate that the presented framework exhibits
promising potential for real-world deployments.


http://arxiv.org/abs/2403.13013
Hierarchical Classification for Intrusion Detection System: Effective Design and Empirical Analysis. (2%)
Md. Ashraf Uddin; Sunil Aryal; Mohamed Reda Bouadjenek; Muna Al-Hawawreh; Md. Alamin Talukder
  With the increased use of network technologies like Internet of Things (IoT)
in many real-world applications, new types of cyberattacks have been emerging.
To safeguard critical infrastructures from these emerging threats, it is
crucial to deploy an Intrusion Detection System (IDS) that can detect different
types of attacks accurately while minimizing false alarms. Machine learning
approaches have been used extensively in IDS and they are mainly using flat
multi-class classification to differentiate normal traffic and different types
of attacks. Though cyberattack types exhibit a hierarchical structure where
similar granular attack subtypes can be grouped into more high-level attack
types, hierarchical classification approach has not been explored well. In this
paper, we investigate the effectiveness of hierarchical classification approach
in IDS. We use a three-level hierarchical classification model to classify
various network attacks, where the first level classifies benign or attack, the
second level classifies coarse high-level attack types, and the third level
classifies a granular level attack types. Our empirical results of using 10
different classification algorithms in 10 different datasets show that there is
no significant difference in terms of overall classification performance (i.e.,
detecting normal and different types of attack correctly) of hierarchical and
flat classification approaches. However, flat classification approach
misclassify attacks as normal whereas hierarchical approach misclassify one
type of attack as another attack type. In other words, the hierarchical
classification approach significantly minimises attacks from misclassified as
normal traffic, which is more important in critical systems.


http://arxiv.org/abs/2403.11206
CBR - Boosting Adaptive Classification By Retrieval of Encrypted Network Traffic with Out-of-distribution. (1%)
Amir Lukach; Ran Dubin; Amit Dvir; Chen Hajaj
  Encrypted network traffic Classification tackles the problem from different
approaches and with different goals. One of the common approaches is using
Machine learning or Deep Learning-based solutions on a fixed number of classes,
leading to misclassification when an unknown class is given as input. One of
the solutions for handling unknown classes is to retrain the model, however,
retraining models every time they become obsolete is both resource and
time-consuming. Therefore, there is a growing need to allow classification
models to detect and adapt to new classes dynamically, without retraining, but
instead able to detect new classes using few shots learning [1]. In this paper,
we introduce Adaptive Classification By Retrieval CBR, a novel approach for
encrypted network traffic classification. Our new approach is based on an
ANN-based method, which allows us to effectively identify new and existing
classes without retraining the model. The novel approach is simple, yet
effective and achieved similar results to RF with up to 5% difference (usually
less than that) in the classification tasks while having a slight decrease in
the case of new samples (from new classes) without retraining. To summarize,
the new method is a real-time classification, which can classify new classes
without retraining. Furthermore, our solution can be used as a complementary
solution alongside RF or any other machine/deep learning classification method,
as an aggregated solution.


http://arxiv.org/abs/2403.11166
Pencil: Private and Extensible Collaborative Learning without the Non-Colluding Assumption. (1%)
Xuanqi Liu; Zhuotao Liu; Qi Li; Ke Xu; Mingwei Xu
  The escalating focus on data privacy poses significant challenges for
collaborative neural network training, where data ownership and model
training/deployment responsibilities reside with distinct entities. Our
community has made substantial contributions to addressing this challenge,
proposing various approaches such as federated learning (FL) and
privacy-preserving machine learning based on cryptographic constructs like
homomorphic encryption (HE) and secure multiparty computation (MPC). However,
FL completely overlooks model privacy, and HE has limited extensibility
(confined to only one data provider). While the state-of-the-art MPC frameworks
provide reasonable throughput and simultaneously ensure model/data privacy,
they rely on a critical non-colluding assumption on the computing servers, and
relaxing this assumption is still an open problem.
  In this paper, we present Pencil, the first private training framework for
collaborative learning that simultaneously offers data privacy, model privacy,
and extensibility to multiple data providers, without relying on the
non-colluding assumption. Our fundamental design principle is to construct the
n-party collaborative training protocol based on an efficient two-party
protocol, and meanwhile ensuring that switching to different data providers
during model training introduces no extra cost. We introduce several novel
cryptographic protocols to realize this design principle and conduct a rigorous
security and privacy analysis. Our comprehensive evaluations of Pencil
demonstrate that (i) models trained in plaintext and models trained privately
using Pencil exhibit nearly identical test accuracies; (ii) The training
overhead of Pencil is greatly reduced: Pencil achieves 10 ~ 260x higher
throughput and 2 orders of magnitude less communication than prior art; (iii)
Pencil is resilient against both existing and adaptive (white-box) attacks.


http://arxiv.org/abs/2403.10801
Securely Fine-tuning Pre-trained Encoders Against Adversarial Examples. (98%)
Ziqi Zhou; Minghui Li; Wei Liu; Shengshan Hu; Yechao Zhang; Wei Wan; Lulu Xue; Leo Yu Zhang; Dezhong Yang; Hai Jin
  With the evolution of self-supervised learning, the pre-training paradigm has
emerged as a predominant solution within the deep learning landscape. Model
providers furnish pre-trained encoders designed to function as versatile
feature extractors, enabling downstream users to harness the benefits of
expansive models with minimal effort through fine-tuning. Nevertheless, recent
works have exposed a vulnerability in pre-trained encoders, highlighting their
susceptibility to downstream-agnostic adversarial examples (DAEs) meticulously
crafted by attackers. The lingering question pertains to the feasibility of
fortifying the robustness of downstream models against DAEs, particularly in
scenarios where the pre-trained encoders are publicly accessible to the
attackers.
  In this paper, we initially delve into existing defensive mechanisms against
adversarial examples within the pre-training paradigm. Our findings reveal that
the failure of current defenses stems from the domain shift between
pre-training data and downstream tasks, as well as the sensitivity of encoder
parameters. In response to these challenges, we propose Genetic
Evolution-Nurtured Adversarial Fine-tuning (Gen-AF), a two-stage adversarial
fine-tuning approach aimed at enhancing the robustness of downstream models.
Our extensive experiments, conducted across ten self-supervised training
methods and six datasets, demonstrate that Gen-AF attains high testing accuracy
and robust testing accuracy against state-of-the-art DAEs.


http://arxiv.org/abs/2403.10935
Understanding Robustness of Visual State Space Models for Image Classification. (98%)
Chengbin Du; Yanxi Li; Chang Xu
  Visual State Space Model (VMamba) has recently emerged as a promising
architecture, exhibiting remarkable performance in various computer vision
tasks. However, its robustness has not yet been thoroughly studied. In this
paper, we delve into the robustness of this architecture through comprehensive
investigations from multiple perspectives. Firstly, we investigate its
robustness to adversarial attacks, employing both whole-image and
patch-specific adversarial attacks. Results demonstrate superior adversarial
robustness compared to Transformer architectures while revealing scalability
weaknesses. Secondly, the general robustness of VMamba is assessed against
diverse scenarios, including natural adversarial examples, out-of-distribution
data, and common corruptions. VMamba exhibits exceptional generalizability with
out-of-distribution data but shows scalability weaknesses against natural
adversarial examples and common corruptions. Additionally, we explore VMamba's
gradients and back-propagation during white-box attacks, uncovering unique
vulnerabilities and defensive capabilities of its novel components. Lastly, the
sensitivity of VMamba to image structure variations is examined, highlighting
vulnerabilities associated with the distribution of disturbance areas and
spatial information, with increased susceptibility closer to the image center.
Through these comprehensive studies, we contribute to a deeper understanding of
VMamba's robustness, providing valuable insights for refining and advancing the
capabilities of deep neural networks in computer vision applications.


http://arxiv.org/abs/2403.10883
Improving Adversarial Transferability of Visual-Language Pre-training Models through Collaborative Multimodal Interaction. (92%)
Jiyuan Fu; Zhaoyu Chen; Kaixun Jiang; Haijing Guo; Jiafeng Wang; Shuyong Gao; Wenqiang Zhang
  Despite the substantial advancements in Vision-Language Pre-training (VLP)
models, their susceptibility to adversarial attacks poses a significant
challenge. Existing work rarely studies the transferability of attacks on VLP
models, resulting in a substantial performance gap from white-box attacks. We
observe that prior work overlooks the interaction mechanisms between
modalities, which plays a crucial role in understanding the intricacies of VLP
models. In response, we propose a novel attack, called Collaborative Multimodal
Interaction Attack (CMI-Attack), leveraging modality interaction through
embedding guidance and interaction enhancement. Specifically, attacking text at
the embedding level while preserving semantics, as well as utilizing
interaction image gradients to enhance constraints on perturbations of texts
and images. Significantly, in the image-text retrieval task on Flickr30K
dataset, CMI-Attack raises the transfer success rates from ALBEF to TCL,
$\text{CLIP}_{\text{ViT}}$ and $\text{CLIP}_{\text{CNN}}$ by 8.11%-16.75% over
state-of-the-art methods. Moreover, CMI-Attack also demonstrates superior
performance in cross-task generalization scenarios. Our work addresses the
underexplored realm of transfer attacks on VLP models, shedding light on the
importance of modality interaction for enhanced adversarial robustness.


http://arxiv.org/abs/2403.10995
Edge Private Graph Neural Networks with Singular Value Perturbation. (11%)
Tingting Tang; Yue Niu; Salman Avestimehr; Murali Annavaram
  Graph neural networks (GNNs) play a key role in learning representations from
graph-structured data and are demonstrated to be useful in many applications.
However, the GNN training pipeline has been shown to be vulnerable to node
feature leakage and edge extraction attacks. This paper investigates a scenario
where an attacker aims to recover private edge information from a trained GNN
model. Previous studies have employed differential privacy (DP) to add noise
directly to the adjacency matrix or a compact graph representation. The added
perturbations cause the graph structure to be substantially morphed, reducing
the model utility. We propose a new privacy-preserving GNN training algorithm,
Eclipse, that maintains good model utility while providing strong privacy
protection on edges. Eclipse is based on two key observations. First, adjacency
matrices in graph structures exhibit low-rank behavior. Thus, Eclipse trains
GNNs with a low-rank format of the graph via singular values decomposition
(SVD), rather than the original graph. Using the low-rank format, Eclipse
preserves the primary graph topology and removes the remaining residual edges.
Eclipse adds noise to the low-rank singular values instead of the entire graph,
thereby preserving the graph privacy while still maintaining enough of the
graph structure to maintain model utility. We theoretically show Eclipse
provide formal DP guarantee on edges. Experiments on benchmark graph datasets
show that Eclipse achieves significantly better privacy-utility tradeoff
compared to existing privacy-preserving GNN training methods. In particular,
under strong privacy constraints ($\epsilon$ < 4), Eclipse shows significant
gains in the model utility by up to 46%. We further demonstrate that Eclipse
also has better resilience against common edge attacks (e.g., LPA), lowering
the attack AUC by up to 5% compared to other state-of-the-art baselines.


http://arxiv.org/abs/2403.10076
Benchmarking Adversarial Robustness of Image Shadow Removal with Shadow-adaptive Attacks. (99%)
Chong Wang; Yi Yu; Lanqing Guo; Bihan Wen
  Shadow removal is a task aimed at erasing regional shadows present in images
and reinstating visually pleasing natural scenes with consistent illumination.
While recent deep learning techniques have demonstrated impressive performance
in image shadow removal, their robustness against adversarial attacks remains
largely unexplored. Furthermore, many existing attack frameworks typically
allocate a uniform budget for perturbations across the entire input image,
which may not be suitable for attacking shadow images. This is primarily due to
the unique characteristic of spatially varying illumination within shadow
images. In this paper, we propose a novel approach, called shadow-adaptive
adversarial attack. Different from standard adversarial attacks, our attack
budget is adjusted based on the pixel intensity in different regions of shadow
images. Consequently, the optimized adversarial noise in the shadowed regions
becomes visually less perceptible while permitting a greater tolerance for
perturbations in non-shadow regions. The proposed shadow-adaptive attacks
naturally align with the varying illumination distribution in shadow images,
resulting in perturbations that are less conspicuous. Building on this, we
conduct a comprehensive empirical evaluation of existing shadow removal
methods, subjecting them to various levels of attack on publicly available
datasets.


http://arxiv.org/abs/2403.10330
Towards Non-Adversarial Algorithmic Recourse. (99%)
Tobias Leemann; Martin Pawelczyk; Bardh Prenkaj; Gjergji Kasneci
  The streams of research on adversarial examples and counterfactual
explanations have largely been growing independently. This has led to several
recent works trying to elucidate their similarities and differences. Most
prominently, it has been argued that adversarial examples, as opposed to
counterfactual explanations, have a unique characteristic in that they lead to
a misclassification compared to the ground truth. However, the computational
goals and methodologies employed in existing counterfactual explanation and
adversarial example generation methods often lack alignment with this
requirement. Using formal definitions of adversarial examples and
counterfactual explanations, we introduce non-adversarial algorithmic recourse
and outline why in high-stakes situations, it is imperative to obtain
counterfactual explanations that do not exhibit adversarial characteristics. We
subsequently investigate how different components in the objective functions,
e.g., the machine learning model or cost function used to measure distance,
determine whether the outcome can be considered an adversarial example or not.
Our experiments on common datasets highlight that these design choices are
often more critical in deciding whether recourse is non-adversarial than
whether recourse or attack algorithms are used. Furthermore, we show that
choosing a robust and accurate machine learning model results in less
adversarial recourse desired in practice.


http://arxiv.org/abs/2403.10021
Time-Frequency Jointed Imperceptible Adversarial Attack to Brainprint Recognition with Deep Learning Models. (99%)
Hangjie Yi; Yuhang Ming; Dongjun Liu; Wanzeng Kong
  EEG-based brainprint recognition with deep learning models has garnered much
attention in biometric identification. Yet, studies have indicated
vulnerability to adversarial attacks in deep learning models with EEG inputs.
In this paper, we introduce a novel adversarial attack method that jointly
attacks time-domain and frequency-domain EEG signals by employing wavelet
transform. Different from most existing methods which only target time-domain
EEG signals, our method not only takes advantage of the time-domain attack's
potent adversarial strength but also benefits from the imperceptibility
inherent in frequency-domain attack, achieving a better balance between attack
performance and imperceptibility. Extensive experiments are conducted in both
white- and grey-box scenarios and the results demonstrate that our attack
method achieves state-of-the-art attack performance on three datasets and three
deep-learning models. In the meanwhile, the perturbations in the signals
attacked by our method are barely perceptible to the human visual system.


http://arxiv.org/abs/2403.10461
Introducing Adaptive Continuous Adversarial Training (ACAT) to Enhance ML Robustness. (87%)
Mohamed elShehaby; Aditya Kotha; Ashraf Matrawy
  Machine Learning (ML) is susceptible to adversarial attacks that aim to trick
ML models, making them produce faulty predictions. Adversarial training was
found to increase the robustness of ML models against these attacks. However,
in network and cybersecurity, obtaining labeled training and adversarial
training data is challenging and costly. Furthermore, concept drift deepens the
challenge, particularly in dynamic domains like network and cybersecurity, and
requires various models to conduct periodic retraining. This letter introduces
Adaptive Continuous Adversarial Training (ACAT) to continuously integrate
adversarial training samples into the model during ongoing learning sessions,
using real-world detected adversarial data, to enhance model resilience against
evolving adversarial threats. ACAT is an adaptive defense mechanism that
utilizes periodic retraining to effectively counter adversarial attacks while
mitigating catastrophic forgetting. Our approach also reduces the total time
required for adversarial sample detection, especially in environments such as
network security where the rate of attacks could be very high. Traditional
detection processes that involve two stages may result in lengthy procedures.
Experimental results using a SPAM detection dataset demonstrate that with ACAT,
the accuracy of the SPAM filter increased from 69% to over 88% after just three
retraining sessions. Furthermore, ACAT outperforms conventional adversarial
sample detectors, providing faster decision times, up to four times faster in
some cases.


http://arxiv.org/abs/2403.10073
Revisiting Adversarial Training under Long-Tailed Distributions. (80%)
Xinli Yue; Ningping Mou; Qian Wang; Lingchen Zhao
  Deep neural networks are vulnerable to adversarial attacks, often leading to
erroneous outputs. Adversarial training has been recognized as one of the most
effective methods to counter such attacks. However, existing adversarial
training techniques have predominantly been tested on balanced datasets,
whereas real-world data often exhibit a long-tailed distribution, casting doubt
on the efficacy of these methods in practical scenarios.
  In this paper, we delve into adversarial training under long-tailed
distributions. Through an analysis of the previous work "RoBal", we discover
that utilizing Balanced Softmax Loss alone can achieve performance comparable
to the complete RoBal approach while significantly reducing training overheads.
Additionally, we reveal that, similar to uniform distributions, adversarial
training under long-tailed distributions also suffers from robust overfitting.
To address this, we explore data augmentation as a solution and unexpectedly
discover that, unlike results obtained with balanced data, data augmentation
not only effectively alleviates robust overfitting but also significantly
improves robustness. We further investigate the reasons behind the improvement
of robustness through data augmentation and identify that it is attributable to
the increased diversity of examples. Extensive experiments further corroborate
that data augmentation alone can significantly improve robustness. Finally,
building on these findings, we demonstrate that compared to RoBal, the
combination of BSL and data augmentation leads to a +6.66% improvement in model
robustness under AutoAttack on CIFAR-10-LT. Our code is available at
https://github.com/NISPLab/AT-BSL .


http://arxiv.org/abs/2403.10045
Towards Adversarially Robust Dataset Distillation by Curvature Regularization. (64%)
Eric Xue; Yijiang Li; Haoyang Liu; Peiran Wang; Yifan Shen; Haohan Wang
  Dataset distillation (DD) allows datasets to be distilled to fractions of
their original size while preserving the rich distributional information so
that models trained on the distilled datasets can achieve a comparable accuracy
while saving significant computational loads. Recent research in this area has
been focusing on improving the accuracy of models trained on distilled
datasets. In this paper, we aim to explore a new perspective of DD. We study
how to embed adversarial robustness in distilled datasets, so that models
trained on these datasets maintain the high accuracy and meanwhile acquire
better adversarial robustness. We propose a new method that achieves this goal
by incorporating curvature regularization into the distillation process with
much less computational overhead than standard adversarial training. Extensive
empirical experiments suggest that our method not only outperforms standard
adversarial training on both accuracy and robustness with less computation
overhead but is also capable of generating robust distilled datasets that can
withstand various adversarial attacks.


http://arxiv.org/abs/2403.10313
Interactive Trimming against Evasive Online Data Manipulation Attacks: A Game-Theoretic Approach. (50%)
Yue Fu; Qingqing Ye; Rong Du; Haibo Hu
  With the exponential growth of data and its crucial impact on our lives and
decision-making, the integrity of data has become a significant concern.
Malicious data poisoning attacks, where false values are injected into the
data, can disrupt machine learning processes and lead to severe consequences.
To mitigate these attacks, distance-based defenses, such as trimming, have been
proposed, but they can be easily evaded by white-box attackers. The evasiveness
and effectiveness of poisoning attack strategies are two sides of the same
coin, making game theory a promising approach. However, existing
game-theoretical models often overlook the complexities of online data
poisoning attacks, where strategies must adapt to the dynamic process of data
collection.
  In this paper, we present an interactive game-theoretical model to defend
online data manipulation attacks using the trimming strategy. Our model
accommodates a complete strategy space, making it applicable to strong evasive
and colluding adversaries. Leveraging the principle of least action and the
Euler-Lagrange equation from theoretical physics, we derive an analytical model
for the game-theoretic process. To demonstrate its practical usage, we present
a case study in a privacy-preserving data collection system under local
differential privacy where a non-deterministic utility function is adopted. Two
strategies are devised from this analytical model, namely, Tit-for-tat and
Elastic. We conduct extensive experiments on real-world datasets, which
showcase the effectiveness and accuracy of these two strategies.


http://arxiv.org/abs/2403.10005
Securing Federated Learning with Control-Flow Attestation: A Novel Framework for Enhanced Integrity and Resilience against Adversarial Attacks. (12%)
Zahir Alsulaimawi
  The advent of Federated Learning (FL) as a distributed machine learning
paradigm has introduced new cybersecurity challenges, notably adversarial
attacks that threaten model integrity and participant privacy. This study
proposes an innovative security framework inspired by Control-Flow Attestation
(CFA) mechanisms, traditionally used in cybersecurity, to ensure software
execution integrity. By integrating digital signatures and cryptographic
hashing within the FL framework, we authenticate and verify the integrity of
model updates across the network, effectively mitigating risks associated with
model poisoning and adversarial interference. Our approach, novel in its
application of CFA principles to FL, ensures contributions from participating
nodes are authentic and untampered, thereby enhancing system resilience without
compromising computational efficiency or model performance. Empirical
evaluations on benchmark datasets, MNIST and CIFAR-10, demonstrate our
framework's effectiveness, achieving a 100\% success rate in integrity
verification and authentication and notable resilience against adversarial
attacks. These results validate the proposed security enhancements and open
avenues for more secure, reliable, and privacy-conscious distributed machine
learning solutions. Our work bridges a critical gap between cybersecurity and
distributed machine learning, offering a foundation for future advancements in
secure FL.


http://arxiv.org/abs/2403.10499
Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot Study. (11%)
Chenguang Wang; Ruoxi Jia; Xin Liu; Dawn Song
  Pre-training image representations from the raw text about images enables
zero-shot vision transfer to downstream tasks. Through pre-training on millions
of samples collected from the internet, multimodal foundation models, such as
CLIP, produce state-of-the-art zero-shot results that often reach
competitiveness with fully supervised methods without the need for
task-specific training. Besides the encouraging performance on classification
accuracy, it is reported that these models close the robustness gap by matching
the performance of supervised models trained on ImageNet under natural
distribution shift. Because robustness is critical to real-world applications,
especially safety-critical ones, in this paper, we present a comprehensive
evaluation based on a large-scale robustness benchmark covering 7 natural, 3
synthetic distribution shifts, and 11 adversarial attacks. We use CLIP as a
pilot study. We show that CLIP leads to a significant robustness drop compared
to supervised ImageNet models on our benchmark, especially under synthetic
distribution shift and adversarial attacks. Furthermore, data overlap analysis
suggests that the observed robustness under natural distribution shifts could
be attributed, at least in part, to data overlap. In summary, our evaluation
shows a comprehensive evaluation of robustness is necessary; and there is a
significant need to improve the robustness of zero-shot multimodal models.


http://arxiv.org/abs/2403.10663
Not Just Change the Labels, Learn the Features: Watermarking Deep Neural Networks with Multi-View Data. (8%)
Yuxuan Li; Sarthak Kumar Maharana; Yunhui Guo
  With the increasing prevalence of Machine Learning as a Service (MLaaS)
platforms, there is a growing focus on deep neural network (DNN) watermarking
techniques. These methods are used to facilitate the verification of ownership
for a target DNN model to protect intellectual property. One of the most widely
employed watermarking techniques involves embedding a trigger set into the
source model. Unfortunately, existing methodologies based on trigger sets are
still susceptible to functionality-stealing attacks, potentially enabling
adversaries to steal the functionality of the source model without a reliable
means of verifying ownership. In this paper, we first introduce a novel
perspective on trigger set-based watermarking methods from a feature learning
perspective. Specifically, we demonstrate that by selecting data exhibiting
multiple features, also referred to as $\textit{multi-view data}$, it becomes
feasible to effectively defend functionality stealing attacks. Based on this
perspective, we introduce a novel watermarking technique based on Multi-view
dATa, called MAT, for efficiently embedding watermarks within DNNs. This
approach involves constructing a trigger set with multi-view data and
incorporating a simple feature-based regularization method for training the
source model. We validate our method across various benchmarks and demonstrate
its efficacy in defending against model extraction attacks, surpassing relevant
baselines by a significant margin.


http://arxiv.org/abs/2403.10717
Backdoor Secrets Unveiled: Identifying Backdoor Data with Optimized Scaled Prediction Consistency. (3%)
Soumyadeep Pal; Yuguang Yao; Ren Wang; Bingquan Shen; Sijia Liu
  Modern machine learning (ML) systems demand substantial training data, often
resorting to external sources. Nevertheless, this practice renders them
vulnerable to backdoor poisoning attacks. Prior backdoor defense strategies
have primarily focused on the identification of backdoored models or poisoned
data characteristics, typically operating under the assumption of access to
clean data. In this work, we delve into a relatively underexplored challenge:
the automatic identification of backdoor data within a poisoned dataset, all
under realistic conditions, i.e., without the need for additional clean data or
without manually defining a threshold for backdoor detection. We draw an
inspiration from the scaled prediction consistency (SPC) technique, which
exploits the prediction invariance of poisoned data to an input scaling factor.
Based on this, we pose the backdoor data identification problem as a
hierarchical data splitting optimization problem, leveraging a novel SPC-based
loss function as the primary optimization objective. Our innovation unfolds in
several key aspects. First, we revisit the vanilla SPC method, unveiling its
limitations in addressing the proposed backdoor identification problem.
Subsequently, we develop a bi-level optimization-based approach to precisely
identify backdoor data by minimizing the advanced SPC loss. Finally, we
demonstrate the efficacy of our proposal against a spectrum of backdoor
attacks, encompassing basic label-corrupted attacks as well as more
sophisticated clean-label attacks, evaluated across various benchmark datasets.
Experiment results show that our approach often surpasses the performance of
current baselines in identifying backdoor data points, resulting in about
4%-36% improvement in average AUROC. Codes are available at
https://github.com/OPTML-Group/BackdoorMSPC.


http://arxiv.org/abs/2403.10144
NLP Verification: Towards a General Methodology for Certifying Robustness. (1%)
Marco Casadio; Tanvi Dinkar; Ekaterina Komendantskaya; Luca Arnaboldi; Matthew L. Daggitt; Omri Isac; Guy Katz; Verena Rieser; Oliver Lemon
  Machine Learning (ML) has exhibited substantial success in the field of
Natural Language Processing (NLP). For example large language models have
empirically proven to be capable of producing text of high complexity and
cohesion. However, they are prone to inaccuracies and hallucinations. As these
systems are increasingly integrated into real-world applications, ensuring
their safety and reliability becomes a primary concern. There are safety
critical contexts where such models must be robust to variability or attack,
and give guarantees over their output. Computer Vision had pioneered the use of
formal verification of neural networks for such scenarios and developed common
verification standards and pipelines, leveraging precise formal reasoning about
geometric properties of data manifolds. In contrast, NLP verification methods
have only recently appeared in the literature. While presenting sophisticated
algorithms, these papers have not yet crystallised into a common methodology.
They are often light on the pragmatical issues of NLP verification and the area
remains fragmented. In this paper, we attempt to distil and evaluate general
components of an NLP verification pipeline, that emerges from the progress in
the field to date. Our contributions are two-fold. Firstly, we propose a
general methodology to analyse the effect of the embedding gap, a problem that
refers to the discrepancy between verification of geometric subspaces and the
semantic meaning of sentences, which the geometric subspaces are supposed to
represent. We propose a number of practical NLP methods that can help to
quantify the effects of the embedding gap. Secondly, we give a general method
for training and verification of neural networks that leverages a more precise
geometric estimation of semantic similarity of sentences in the embedding space
and helps to overcome the effects of the embedding gap in practice.


http://arxiv.org/abs/2403.10698
Robust Influence-based Training Methods for Noisy Brain MRI. (1%)
Minh-Hao Van; Alycia N. Carey; Xintao Wu
  Correctly classifying brain tumors is imperative to the prompt and accurate
treatment of a patient. While several classification algorithms based on
classical image processing or deep learning methods have been proposed to
rapidly classify tumors in MR images, most assume the unrealistic setting of
noise-free training data. In this work, we study a difficult but realistic
setting of training a deep learning model on noisy MR images to classify brain
tumors. We propose two training methods that are robust to noisy MRI training
data, Influence-based Sample Reweighing (ISR) and Influence-based Sample
Perturbation (ISP), which are based on influence functions from robust
statistics. Using the influence functions, in ISR, we adaptively reweigh
training examples according to how helpful/harmful they are to the training
process, while in ISP, we craft and inject helpful perturbation proportional to
the influence score. Both ISR and ISP harden the classification model against
noisy training data without significantly affecting the generalization ability
of the model on test data. We conduct empirical evaluations over a common brain
tumor dataset and compare ISR and ISP to three baselines. Our empirical results
show that ISR and ISP can efficiently train deep learning models robust against
noisy training data.


http://arxiv.org/abs/2403.09766
An Image Is Worth 1000 Lies: Adversarial Transferability across Prompts on Vision-Language Models. (99%)
Haochen Luo; Jindong Gu; Fengyuan Liu; Philip Torr
  Different from traditional task-specific vision models, recent large VLMs can
readily adapt to different vision tasks by simply using different textual
instructions, i.e., prompts. However, a well-known concern about traditional
task-specific vision models is that they can be misled by imperceptible
adversarial perturbations. Furthermore, the concern is exacerbated by the
phenomenon that the same adversarial perturbations can fool different
task-specific models. Given that VLMs rely on prompts to adapt to different
tasks, an intriguing question emerges: Can a single adversarial image mislead
all predictions of VLMs when a thousand different prompts are given? This
question essentially introduces a novel perspective on adversarial
transferability: cross-prompt adversarial transferability. In this work, we
propose the Cross-Prompt Attack (CroPA). This proposed method updates the
visual adversarial perturbation with learnable prompts, which are designed to
counteract the misleading effects of the adversarial image. By doing this,
CroPA significantly improves the transferability of adversarial examples across
prompts. Extensive experiments are conducted to verify the strong cross-prompt
adversarial transferability of CroPA with prevalent VLMs including Flamingo,
BLIP-2, and InstructBLIP in various different tasks. Our source code is
available at \url{https://github.com/Haochen-Luo/CroPA}.


http://arxiv.org/abs/2403.10562
Counter-Samples: A Stateless Strategy to Neutralize Black Box Adversarial Attacks. (99%)
Roey Bokobza; Yisroel Mirsky
  Our paper presents a novel defence against black box attacks, where attackers
use the victim model as an oracle to craft their adversarial examples. Unlike
traditional preprocessing defences that rely on sanitizing input samples, our
stateless strategy counters the attack process itself. For every query we
evaluate a counter-sample instead, where the counter-sample is the original
sample optimized against the attacker's objective. By countering every black
box query with a targeted white box optimization, our strategy effectively
introduces an asymmetry to the game to the defender's advantage. This defence
not only effectively misleads the attacker's search for an adversarial example,
it also preserves the model's accuracy on legitimate inputs and is generic to
multiple types of attacks.
  We demonstrate that our approach is remarkably effective against
state-of-the-art black box attacks and outperforms existing defences for both
the CIFAR-10 and ImageNet datasets. Additionally, we also show that the
proposed defence is robust against strong adversaries as well.


http://arxiv.org/abs/2403.09441
Adversarial Fine-tuning of Compressed Neural Networks for Joint Improvement of Robustness and Efficiency. (98%)
Hallgrimur Thorsteinsson; Valdemar J Henriksen; Tong Chen; Raghavendra Selvan
  As deep learning (DL) models are increasingly being integrated into our
everyday lives, ensuring their safety by making them robust against adversarial
attacks has become increasingly critical. DL models have been found to be
susceptible to adversarial attacks which can be achieved by introducing small,
targeted perturbations to disrupt the input data. Adversarial training has been
presented as a mitigation strategy which can result in more robust models. This
adversarial robustness comes with additional computational costs required to
design adversarial attacks during training. The two objectives -- adversarial
robustness and computational efficiency -- then appear to be in conflict of
each other. In this work, we explore the effects of two different model
compression methods -- structured weight pruning and quantization -- on
adversarial robustness. We specifically explore the effects of fine-tuning on
compressed models, and present the trade-off between standard fine-tuning and
adversarial fine-tuning. Our results show that compression does not inherently
lead to loss in model robustness and adversarial fine-tuning of a compressed
model can yield large improvement to the robustness performance of models. We
present experiments on two benchmark datasets showing that adversarial
fine-tuning of compressed models can achieve robustness performance comparable
to adversarially trained models, while also improving computational efficiency.


http://arxiv.org/abs/2403.09101
Soften to Defend: Towards Adversarial Robustness via Self-Guided Label Refinement. (83%)
Daiwei Yu; Zhuorong Li; Lina Wei; Canghong Jin; Yun Zhang; Sixian Chan
  Adversarial training (AT) is currently one of the most effective ways to
obtain the robustness of deep neural networks against adversarial attacks.
However, most AT methods suffer from robust overfitting, i.e., a significant
generalization gap in adversarial robustness between the training and testing
curves. In this paper, we first identify a connection between robust
overfitting and the excessive memorization of noisy labels in AT from a view of
gradient norm. As such label noise is mainly caused by a distribution mismatch
and improper label assignments, we are motivated to propose a label refinement
approach for AT. Specifically, our Self-Guided Label Refinement first
self-refines a more accurate and informative label distribution from
over-confident hard labels, and then it calibrates the training by dynamically
incorporating knowledge from self-distilled models into the current model and
thus requiring no external teachers. Empirical results demonstrate that our
method can simultaneously boost the standard accuracy and robust performance
across multiple benchmark datasets, attack types, and architectures. In
addition, we also provide a set of analyses from the perspectives of
information theory to dive into our method and suggest the importance of soft
labels for robust generalization.


http://arxiv.org/abs/2403.09901
Robust Subgraph Learning by Monitoring Early Training Representations. (80%)
Sepideh Neshatfar; Salimeh Yasaei Sekeh
  Graph neural networks (GNNs) have attracted significant attention for their
outstanding performance in graph learning and node classification tasks.
However, their vulnerability to adversarial attacks, particularly through
susceptible nodes, poses a challenge in decision-making. The need for robust
graph summarization is evident in adversarial challenges resulting from the
propagation of attacks throughout the entire graph. In this paper, we address
both performance and adversarial robustness in graph input by introducing the
novel technique SHERD (Subgraph Learning Hale through Early Training
Representation Distances). SHERD leverages information from layers of a
partially trained graph convolutional network (GCN) to detect susceptible nodes
during adversarial attacks using standard distance metrics. The method
identifies "vulnerable (bad)" nodes and removes such nodes to form a robust
subgraph while maintaining node classification performance. Through our
experiments, we demonstrate the increased performance of SHERD in enhancing
robustness by comparing the network's performance on original and subgraph
inputs against various baselines alongside existing adversarial attacks. Our
experiments across multiple datasets, including citation datasets such as Cora,
Citeseer, and Pubmed, as well as microanatomical tissue structures of cell
graphs in the placenta, highlight that SHERD not only achieves substantial
improvement in robust performance but also outperforms several baselines in
terms of node classification accuracy and computational complexity.


http://arxiv.org/abs/2403.09351
LDPRecover: Recovering Frequencies from Poisoning Attacks against Local Differential Privacy. (76%)
Xinyue Sun; Qingqing Ye; Haibo Hu; Jiawei Duan; Tianyu Wo; Jie Xu; Renyu Yang
  Local differential privacy (LDP), which enables an untrusted server to
collect aggregated statistics from distributed users while protecting the
privacy of those users, has been widely deployed in practice. However, LDP
protocols for frequency estimation are vulnerable to poisoning attacks, in
which an attacker can poison the aggregated frequencies by manipulating the
data sent from malicious users. Therefore, it is an open challenge to recover
the accurate aggregated frequencies from poisoned ones.
  In this work, we propose LDPRecover, a method that can recover accurate
aggregated frequencies from poisoning attacks, even if the server does not
learn the details of the attacks. In LDPRecover, we establish a genuine
frequency estimator that theoretically guides the server to recover the
frequencies aggregated from genuine users' data by eliminating the impact of
malicious users' data in poisoned frequencies. Since the server has no idea of
the attacks, we propose an adaptive attack to unify existing attacks and learn
the statistics of the malicious data within this adaptive attack by exploiting
the properties of LDP protocols. By taking the estimator and the learning
statistics as constraints, we formulate the problem of recovering aggregated
frequencies to approach the genuine ones as a constraint inference (CI)
problem. Consequently, the server can obtain accurate aggregated frequencies by
solving this problem optimally. Moreover, LDPRecover can serve as a frequency
recovery paradigm that recovers more accurate aggregated frequencies by
integrating attack details as new constraints in the CI problem. Our evaluation
on two real-world datasets, three LDP protocols, and untargeted and targeted
poisoning attacks shows that LDPRecover is both accurate and widely applicable
against various poisoning attacks.


http://arxiv.org/abs/2403.09513
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting. (74%)
Yu Wang; Xiaogeng Liu; Yu Li; Muhao Chen; Chaowei Xiao
  With the advent and widespread deployment of Multimodal Large Language Models
(MLLMs), the imperative to ensure their safety has become increasingly
pronounced. However, with the integration of additional modalities, MLLMs are
exposed to new vulnerabilities, rendering them prone to structured-based
jailbreak attacks, where semantic content (e.g., "harmful text") has been
injected into the images to mislead MLLMs. In this work, we aim to defend
against such threats. Specifically, we propose \textbf{Ada}ptive
\textbf{Shield} Prompting (\textbf{AdaShield}), which prepends inputs with
defense prompts to defend MLLMs against structure-based jailbreak attacks
without fine-tuning MLLMs or training additional modules (e.g., post-stage
content detector). Initially, we present a manually designed static defense
prompt, which thoroughly examines the image and instruction content step by
step and specifies response methods to malicious queries. Furthermore, we
introduce an adaptive auto-refinement framework, consisting of a target MLLM
and a LLM-based defense prompt generator (Defender). These components
collaboratively and iteratively communicate to generate a defense prompt.
Extensive experiments on the popular structure-based jailbreak attacks and
benign datasets show that our methods can consistently improve MLLMs'
robustness against structure-based jailbreak attacks without compromising the
model's general capabilities evaluated on standard benign tasks. Our code is
available at https://github.com/rain305f/AdaShield.


http://arxiv.org/abs/2403.09863
Towards White Box Deep Learning. (15%)
Maciej Satkiewicz
  Deep neural networks learn fragile "shortcut" features, rendering them
difficult to interpret (black box) and vulnerable to adversarial attacks. This
paper proposes semantic features as a general architectural solution to this
problem. The main idea is to make features locality-sensitive in the adequate
semantic topology of the domain, thus introducing a strong regularization. The
proof of concept network is lightweight, inherently interpretable and achieves
almost human-level adversarial test metrics - with no adversarial training!
These results and the general nature of the approach warrant further research
on semantic features. The code is available at
https://github.com/314-Foundation/white-box-nn


http://arxiv.org/abs/2403.10570
Symbiotic Game and Foundation Models for Cyber Deception Operations in Strategic Cyber Warfare. (13%)
Tao Li; Quanyan Zhu
  We are currently facing unprecedented cyber warfare with the rapid evolution
of tactics, increasing asymmetry of intelligence, and the growing accessibility
of hacking tools. In this landscape, cyber deception emerges as a critical
component of our defense strategy against increasingly sophisticated attacks.
This chapter aims to highlight the pivotal role of game-theoretic models and
foundation models (FMs) in analyzing, designing, and implementing cyber
deception tactics. Game models (GMs) serve as a foundational framework for
modeling diverse adversarial interactions, allowing us to encapsulate both
adversarial knowledge and domain-specific insights. Meanwhile, FMs serve as the
building blocks for creating tailored machine learning models suited to given
applications. By leveraging the synergy between GMs and FMs, we can advance
proactive and automated cyber defense mechanisms by not only securing our
networks against attacks but also enhancing their resilience against
well-planned operations. This chapter discusses the games at the tactical,
operational, and strategic levels of warfare, delves into the symbiotic
relationship between these methodologies, and explores relevant applications
where such a framework can make a substantial impact in cybersecurity. The
chapter discusses the promising direction of the multi-agent neurosymbolic
conjectural learning (MANSCOL), which allows the defender to predict
adversarial behaviors, design adaptive defensive deception tactics, and
synthesize knowledge for the operational level synthesis and adaptation. FMs
serve as pivotal tools across various functions for MANSCOL, including
reinforcement learning, knowledge assimilation, formation of conjectures, and
contextual representation. This chapter concludes with a discussion of the
challenges associated with FMs and their application in the domain of
cybersecurity.


http://arxiv.org/abs/2403.09562
PreCurious: How Innocent Pre-Trained Language Models Turn into Privacy Traps. (12%)
Ruixuan Liu; Tianhao Wang; Yang Cao; Li Xiong
  The pre-training and fine-tuning paradigm has demonstrated its effectiveness
and has become the standard approach for tailoring language models to various
tasks. Currently, community-based platforms offer easy access to various
pre-trained models, as anyone can publish without strict validation processes.
However, a released pre-trained model can be a privacy trap for fine-tuning
datasets if it is carefully designed. In this work, we propose PreCurious
framework to reveal the new attack surface where the attacker releases the
pre-trained model and gets a black-box access to the final fine-tuned model.
PreCurious aims to escalate the general privacy risk of both membership
inference and data extraction. The key intuition behind PreCurious is to
manipulate the memorization stage of the pre-trained model and guide
fine-tuning with a seemingly legitimate configuration. The effectiveness of
defending against privacy attacks on a fine-tuned model seems promising, as
empirical and theoretical evidence suggests that parameter-efficient and
differentially private fine-tuning techniques are invulnerable to privacy
attacks. But PreCurious demonstrates the possibility of breaking up
invulnerability in a stealthy manner compared to fine-tuning on a benign model.
By further leveraging a sanitized dataset, PreCurious can extract originally
unexposed secrets under differentially private fine-tuning. Thus, PreCurious
raises warnings for users who download pre-trained models from unknown sources,
rely solely on tutorials or common-sense defenses, and previously release
sanitized datasets even after perfect scrubbing.


http://arxiv.org/abs/2403.09346
AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Instructions. (5%)
Hao Zhang; Wenqi Shao; Hong Liu; Yongqiang Ma; Ping Luo; Yu Qiao; Kaipeng Zhang
  Large Vision-Language Models (LVLMs) have shown significant progress in well
responding to visual-instructions from users. However, these instructions,
encompassing images and text, are susceptible to both intentional and
inadvertent attacks. Despite the critical importance of LVLMs' robustness
against such threats, current research in this area remains limited. To bridge
this gap, we introduce AVIBench, a framework designed to analyze the robustness
of LVLMs when facing various adversarial visual-instructions (AVIs), including
four types of image-based AVIs, ten types of text-based AVIs, and nine types of
content bias AVIs (such as gender, violence, cultural, and racial biases, among
others). We generate 260K AVIs encompassing five categories of multimodal
capabilities (nine tasks) and content bias. We then conduct a comprehensive
evaluation involving 14 open-source LVLMs to assess their performance. AVIBench
also serves as a convenient tool for practitioners to evaluate the robustness
of LVLMs against AVIs. Our findings and extensive experimental results shed
light on the vulnerabilities of LVLMs, and highlight that inherent biases exist
even in advanced closed-source LVLMs like GeminiProVision and GPT-4V. This
underscores the importance of enhancing the robustness, security, and fairness
of LVLMs. The source code and benchmark will be made publicly available.


http://arxiv.org/abs/2403.09603
Optimistic Verifiable Training by Controlling Hardware Nondeterminism. (1%)
Megha Srivastava; Simran Arora; Dan Boneh
  The increasing compute demands of AI systems have led to the emergence of
services that train models on behalf of clients lacking necessary resources.
However, ensuring correctness of training and guarding against potential
training-time attacks, such as data poisoning and backdoors, poses challenges.
Existing works on verifiable training largely fall into two classes:
proof-based systems, which are difficult to scale, and ``optimistic'' methods
that consider a third-party auditor who can replicate the training process and
contest the trainer. A key challenge with the latter is that nondeterminism
between GPU types during training prevents exact replication of the training
process, resulting in schemes that are non-robust. We propose a method that
combines training in a higher precision than the target, rounding after
intermediate computations, and sharing rounding decisions based on an adaptive
thresholding procedure, to successfully control for nondeterminism. Across
three different NVIDIA GPUs (A40, Titan XP, RTX 2080 Ti), we achieve exact
training replication at FP32 precision for both full-training and fine-tuning
of ResNet-50 (23M) and GPT-2 (117M) models. Our verifiable training scheme
significantly decreases the storage and time costs compared to proof-based
systems, and is publicly released at
https://github.com/meghabyte/verifiable-training.


http://arxiv.org/abs/2403.10573
Medical Unlearnable Examples: Securing Medical Data from Unauthorized Traning via Sparsity-Aware Local Masking. (1%)
Weixiang Sun; Yixin Liu; Zhiling Yan; Kaidi Xu; Lichao Sun
  With the rapid growth of artificial intelligence (AI) in healthcare, there
has been a significant increase in the generation and storage of sensitive
medical data. This abundance of data, in turn, has propelled the advancement of
medical AI technologies. However, concerns about unauthorized data
exploitation, such as training commercial AI models, often deter researchers
from making their invaluable datasets publicly available. In response to the
need to protect this hard-to-collect data while still encouraging medical
institutions to share it, one promising solution is to introduce imperceptible
noise into the data. This method aims to safeguard the data against
unauthorized training by inducing degradation in model generalization. Although
existing methods have shown commendable data protection capabilities in general
domains, they tend to fall short when applied to biomedical data, mainly due to
their failure to account for the sparse nature of medical images. To address
this problem, we propose the Sparsity-Aware Local Masking (SALM) method, a
novel approach that selectively perturbs significant pixel regions rather than
the entire image as previous strategies have done. This simple-yet-effective
approach significantly reduces the perturbation search space by concentrating
on local regions, thereby improving both the efficiency and effectiveness of
data protection for biomedical datasets characterized by sparse features.
Besides, we have demonstrated that SALM maintains the essential characteristics
of the data, ensuring its clinical utility remains uncompromised. Our extensive
experiments across various datasets and model architectures demonstrate that
SALM effectively prevents unauthorized training of deep-learning models and
outperforms previous state-of-the-art data protection methods.


http://arxiv.org/abs/2403.09171
ADEdgeDrop: Adversarial Edge Dropping for Robust Graph Neural Networks. (1%)
Zhaoliang Chen; Zhihao Wu; Ylli Sadikaj; Claudia Plant; Hong-Ning Dai; Shiping Wang; Yiu-Ming Cheung; Wenzhong Guo
  Although Graph Neural Networks (GNNs) have exhibited the powerful ability to
gather graph-structured information from neighborhood nodes via various
message-passing mechanisms, the performance of GNNs is limited by poor
generalization and fragile robustness caused by noisy and redundant graph data.
As a prominent solution, Graph Augmentation Learning (GAL) has recently
received increasing attention. Among prior GAL approaches, edge-dropping
methods that randomly remove edges from a graph during training are effective
techniques to improve the robustness of GNNs. However, randomly dropping edges
often results in bypassing critical edges, consequently weakening the
effectiveness of message passing. In this paper, we propose a novel adversarial
edge-dropping method (ADEdgeDrop) that leverages an adversarial edge predictor
guiding the removal of edges, which can be flexibly incorporated into diverse
GNN backbones. Employing an adversarial training framework, the edge predictor
utilizes the line graph transformed from the original graph to estimate the
edges to be dropped, which improves the interpretability of the edge-dropping
method. The proposed ADEdgeDrop is optimized alternately by stochastic gradient
descent and projected gradient descent. Comprehensive experiments on six graph
benchmark datasets demonstrate that the proposed ADEdgeDrop outperforms
state-of-the-art baselines across various GNN backbones, demonstrating improved
generalization and robustness.


http://arxiv.org/abs/2403.08294
Attack Deterministic Conditional Image Generative Models for Diverse and Controllable Generation. (92%)
Tianyi Chu; Wei Xing; Jiafu Chen; Zhizhong Wang; Jiakai Sun; Lei Zhao; Haibo Chen; Huaizhong Lin
  Existing generative adversarial network (GAN) based conditional image
generative models typically produce fixed output for the same conditional
input, which is unreasonable for highly subjective tasks, such as large-mask
image inpainting or style transfer. On the other hand, GAN-based diverse image
generative methods require retraining/fine-tuning the network or designing
complex noise injection functions, which is computationally expensive,
task-specific, or struggle to generate high-quality results. Given that many
deterministic conditional image generative models have been able to produce
high-quality yet fixed results, we raise an intriguing question: is it possible
for pre-trained deterministic conditional image generative models to generate
diverse results without changing network structures or parameters? To answer
this question, we re-examine the conditional image generation tasks from the
perspective of adversarial attack and propose a simple and efficient plug-in
projected gradient descent (PGD) like method for diverse and controllable image
generation. The key idea is attacking the pre-trained deterministic generative
models by adding a micro perturbation to the input condition. In this way,
diverse results can be generated without any adjustment of network structures
or fine-tuning of the pre-trained models. In addition, we can also control the
diverse results to be generated by specifying the attack direction according to
a reference text or image. Our work opens the door to applying adversarial
attack to low-level vision tasks, and experiments on various conditional image
generation tasks demonstrate the effectiveness and superiority of the proposed
method.


http://arxiv.org/abs/2403.08333
Fast Inference of Removal-Based Node Influence. (54%)
Weikai Li; Zhiping Xiao; Xiao Luo; Yizhou Sun
  Graph neural networks (GNNs) are widely utilized to capture the information
spreading patterns in graphs. While remarkable performance has been achieved,
there is a new trending topic of evaluating node influence. We propose a new
method of evaluating node influence, which measures the prediction change of a
trained GNN model caused by removing a node. A real-world application is, "In
the task of predicting Twitter accounts' polarity, had a particular account
been removed, how would others' polarity change?". We use the GNN as a
surrogate model whose prediction could simulate the change of nodes or edges
caused by node removal. To obtain the influence for every node, a
straightforward way is to alternately remove every node and apply the trained
GNN on the modified graph. It is reliable but time-consuming, so we need an
efficient method. The related lines of work, such as graph adversarial attack
and counterfactual explanation, cannot directly satisfy our needs, since they
do not focus on the global influence score for every node. We propose an
efficient and intuitive method, NOde-Removal-based fAst GNN inference (NORA),
which uses the gradient to approximate the node-removal influence. It only
costs one forward propagation and one backpropagation to approximate the
influence score for all nodes. Extensive experiments on six datasets and six
GNN models verify the effectiveness of NORA. Our code is available at
https://github.com/weikai-li/NORA.git.


http://arxiv.org/abs/2403.08424
Tastle: Distract Large Language Models for Automatic Jailbreak Attack. (31%)
Zeguan Xiao; Yan Yang; Guanhua Chen; Yun Chen
  Large language models (LLMs) have achieved significant advances in recent
days. Extensive efforts have been made before the public release of LLMs to
align their behaviors with human values. The primary goal of alignment is to
ensure their helpfulness, honesty and harmlessness. However, even meticulously
aligned LLMs remain vulnerable to malicious manipulations such as jailbreaking,
leading to unintended behaviors. The jailbreak is to intentionally develop a
malicious prompt that escapes from the LLM security restrictions to produce
uncensored detrimental contents. Previous works explore different jailbreak
methods for red teaming LLMs, yet they encounter challenges regarding to
effectiveness and scalability. In this work, we propose Tastle, a novel
black-box jailbreak framework for automated red teaming of LLMs. We designed
malicious content concealing and memory reframing with an iterative
optimization algorithm to jailbreak LLMs, motivated by the research about the
distractibility and over-confidence phenomenon of LLMs. Extensive experiments
of jailbreaking both open-source and proprietary LLMs demonstrate the
superiority of our framework in terms of effectiveness, scalability and
transferability. We also evaluate the effectiveness of existing jailbreak
defense methods against our attack and highlight the crucial need to develop
more effective and practical defense strategies.


http://arxiv.org/abs/2403.10558
Adaptive Hybrid Masking Strategy for Privacy-Preserving Face Recognition Against Model Inversion Attack. (8%)
Yinggui Wang; Yuanqing Huang; Jianshu Li; Le Yang; Kai Song; Lei Wang
  The utilization of personal sensitive data in training face recognition (FR)
models poses significant privacy concerns, as adversaries can employ model
inversion attacks (MIA) to infer the original training data. Existing defense
methods, such as data augmentation and differential privacy, have been employed
to mitigate this issue. However, these methods often fail to strike an optimal
balance between privacy and accuracy. To address this limitation, this paper
introduces an adaptive hybrid masking algorithm against MIA. Specifically, face
images are masked in the frequency domain using an adaptive MixUp strategy.
Unlike the traditional MixUp algorithm, which is predominantly used for data
augmentation, our modified approach incorporates frequency domain mixing.
Previous studies have shown that increasing the number of images mixed in MixUp
can enhance privacy preservation but at the expense of reduced face recognition
accuracy. To overcome this trade-off, we develop an enhanced adaptive MixUp
strategy based on reinforcement learning, which enables us to mix a larger
number of images while maintaining satisfactory recognition accuracy. To
optimize privacy protection, we propose maximizing the reward function (i.e.,
the loss function of the FR system) during the training of the strategy
network. While the loss function of the FR network is minimized in the phase of
training the FR network. The strategy network and the face recognition network
can be viewed as antagonistic entities in the training process, ultimately
reaching a more balanced trade-off. Experimental results demonstrate that our
proposed hybrid masking scheme outperforms existing defense algorithms in terms
of privacy preservation and recognition accuracy against MIA.


http://arxiv.org/abs/2403.08383
RAF-GI: Towards Robust, Accurate and Fast-Convergent Gradient Inversion Attack in Federated Learning. (2%)
Can Liu; Jin Wang; Dongyang Yu
  Federated learning (FL) empowers privacy-preservation in model training by
only exposing users' model gradients. Yet, FL users are susceptible to the
gradient inversion (GI) attack which can reconstruct ground-truth training data
such as images based on model gradients. However, reconstructing
high-resolution images by existing GI attack works faces two challenges:
inferior accuracy and slow-convergence, especially when the context is
complicated, e.g., the training batch size is much greater than 1 on each FL
user. To address these challenges, we present a Robust, Accurate and
Fast-convergent GI attack algorithm, called RAF-GI, with two components: 1)
Additional Convolution Block (ACB) which can restore labels with up to 20%
improvement compared with existing works; 2) Total variance, three-channel mEan
and cAnny edge detection regularization term (TEA), which is a white-box attack
strategy to reconstruct images based on labels inferred by ACB. Moreover,
RAF-GI is robust that can still accurately reconstruct ground-truth data when
the users' training batch size is no more than 48. Our experimental results
manifest that RAF-GI can diminish 94% time costs while achieving superb
inversion quality in ImageNet dataset. Notably, with a batch size of 1, RAF-GI
exhibits a 7.89 higher Peak Signal-to-Noise Ratio (PSNR) compared to the
state-of-the-art baselines.


http://arxiv.org/abs/2403.08618
Verifix: Post-Training Correction to Improve Label Noise Robustness with Verified Samples. (1%)
Sangamesh Kodge; Deepak Ravikumar; Gobinda Saha; Kaushik Roy
  Label corruption, where training samples have incorrect labels, can
significantly degrade the performance of machine learning models. This
corruption often arises from non-expert labeling or adversarial attacks.
Acquiring large, perfectly labeled datasets is costly, and retraining large
models from scratch when a clean dataset becomes available is computationally
expensive. To address this challenge, we propose Post-Training Correction, a
new paradigm that adjusts model parameters after initial training to mitigate
label noise, eliminating the need for retraining. We introduce Verifix, a novel
Singular Value Decomposition (SVD) based algorithm that leverages a small,
verified dataset to correct the model weights using a single update. Verifix
uses SVD to estimate a Clean Activation Space and then projects the model's
weights onto this space to suppress activations corresponding to corrupted
data. We demonstrate Verifix's effectiveness on both synthetic and real-world
label noise. Experiments on the CIFAR dataset with 25% synthetic corruption
show 7.36% generalization improvements on average. Additionally, we observe
generalization improvements of up to 2.63% on naturally corrupted datasets like
WebVision1.0 and Clothing1M.


http://arxiv.org/abs/2403.08170
Versatile Defense Against Adversarial Attacks on Image Recognition. (99%)
Haibo Zhang; Zhihua Yao; Kouichi Sakurai
  Adversarial attacks present a significant security risk to image recognition
tasks. Defending against these attacks in a real-life setting can be compared
to the way antivirus software works, with a key consideration being how well
the defense can adapt to new and evolving attacks. Another important factor is
the resources involved in terms of time and cost for training defense models
and updating the model database. Training many models that are specific to each
type of attack can be time-consuming and expensive. Ideally, we should be able
to train one single model that can handle a wide range of attacks. It appears
that a defense method based on image-to-image translation may be capable of
this. The proposed versatile defense approach in this paper only requires
training one model to effectively resist various unknown adversarial attacks.
The trained model has successfully improved the classification accuracy from
nearly zero to an average of 86%, performing better than other defense methods
proposed in prior studies. When facing the PGD attack and the MI-FGSM attack,
versatile defense model even outperforms the attack-specific models trained
based on these two attacks. The robustness check also shows that our versatile
defense model performs stably regardless with the attack strength.


http://arxiv.org/abs/2403.07673
Towards Model Extraction Attacks in GAN-Based Image Translation via Domain Shift Mitigation. (61%)
Di Mi; Yanjun Zhang; Leo Yu Zhang; Shengshan Hu; Qi Zhong; Haizhuan Yuan; Shirui Pan
  Model extraction attacks (MEAs) enable an attacker to replicate the
functionality of a victim deep neural network (DNN) model by only querying its
API service remotely, posing a severe threat to the security and integrity of
pay-per-query DNN-based services. Although the majority of current research on
MEAs has primarily concentrated on neural classifiers, there is a growing
prevalence of image-to-image translation (I2IT) tasks in our everyday
activities. However, techniques developed for MEA of DNN classifiers cannot be
directly transferred to the case of I2IT, rendering the vulnerability of I2IT
models to MEA attacks often underestimated. This paper unveils the threat of
MEA in I2IT tasks from a new perspective. Diverging from the traditional
approach of bridging the distribution gap between attacker queries and victim
training samples, we opt to mitigate the effect caused by the different
distributions, known as the domain shift. This is achieved by introducing a new
regularization term that penalizes high-frequency noise, and seeking a flatter
minimum to avoid overfitting to the shifted distribution. Extensive experiments
on different image translation tasks, including image super-resolution and
style transfer, are performed on different backbone victim models, and the new
design consistently outperforms the baseline by a large margin across all
metrics. A few real-life I2IT APIs are also verified to be extremely vulnerable
to our attack, emphasizing the need for enhanced defenses and potentially
revised API publishing policies.


http://arxiv.org/abs/2403.07463
Backdoor Attack with Mode Mixture Latent Modification. (8%)
Hongwei Zhang; Xiaoyin Xu; Dongsheng An; Xianfeng Gu; Min Zhang
  Backdoor attacks become a significant security concern for deep neural
networks in recent years. An image classification model can be compromised if
malicious backdoors are injected into it. This corruption will cause the model
to function normally on clean images but predict a specific target label when
triggers are present. Previous research can be categorized into two genres:
poisoning a portion of the dataset with triggered images for users to train the
model from scratch, or training a backdoored model alongside a triggered image
generator. Both approaches require significant amount of attackable parameters
for optimization to establish a connection between the trigger and the target
label, which may raise suspicions as more people become aware of the existence
of backdoor attacks. In this paper, we propose a backdoor attack paradigm that
only requires minimal alterations (specifically, the output layer) to a clean
model in order to inject the backdoor under the guise of fine-tuning. To
achieve this, we leverage mode mixture samples, which are located between
different modes in latent space, and introduce a novel method for conducting
backdoor attacks. We evaluate the effectiveness of our method on four popular
benchmark datasets: MNIST, CIFAR-10, GTSRB, and TinyImageNet.


http://arxiv.org/abs/2403.14678
Towards a Framework for Deep Learning Certification in Safety-Critical Applications Using Inherently Safe Design and Run-Time Error Detection. (2%)
Romeo Valentin
  Although an ever-growing number of applications employ deep learning based
systems for prediction, decision-making, or state estimation, almost no
certification processes have been established that would allow such systems to
be deployed in safety-critical applications. In this work we consider
real-world problems arising in aviation and other safety-critical areas, and
investigate their requirements for a certified model. To this end, we
investigate methodologies from the machine learning research community aimed
towards verifying robustness and reliability of deep learning systems, and
evaluate these methodologies with regard to their applicability to real-world
problems. Then, we establish a new framework towards deep learning
certification based on (i) inherently safe design, and (ii) run-time error
detection. Using a concrete use case from aviation, we show how deep learning
models can recover disentangled variables through the use of weakly-supervised
representation learning. We argue that such a system design is inherently less
prone to common model failures, and can be verified to encode underlying
mechanisms governing the data. Then, we investigate four techniques related to
the run-time safety of a model, namely (i) uncertainty quantification, (ii)
out-of-distribution detection, (iii) feature collapse, and (iv) adversarial
attacks. We evaluate each for their applicability and formulate a set of
desiderata that a certified model should fulfill. Finally, we propose a novel
model structure that exhibits all desired properties discussed in this work,
and is able to make regression and uncertainty predictions, as well as detect
out-of-distribution inputs, while requiring no regression labels to train. We
conclude with a discussion of the current state and expected future progress of
deep learning certification, and its industrial and social implications.


http://arxiv.org/abs/2403.13000
Duwak: Dual Watermarks in Large Language Models. (2%)
Chaoyi Zhu; Jeroen Galjaard; Pin-Yu Chen; Lydia Y. Chen
  As large language models (LLM) are increasingly used for text generation
tasks, it is critical to audit their usages, govern their applications, and
mitigate their potential harms. Existing watermark techniques are shown
effective in embedding single human-imperceptible and machine-detectable
patterns without significantly affecting generated text quality and semantics.
However, the efficiency in detecting watermarks, i.e., the minimum number of
tokens required to assert detection with significance and robustness against
post-editing, is still debatable. In this paper, we propose, Duwak, to
fundamentally enhance the efficiency and quality of watermarking by embedding
dual secret patterns in both token probability distribution and sampling
schemes. To mitigate expression degradation caused by biasing toward certain
tokens, we design a contrastive search to watermark the sampling scheme, which
minimizes the token repetition and enhances the diversity. We theoretically
explain the interdependency of the two watermarks within Duwak. We evaluate
Duwak extensively on Llama2 under various post-editing attacks, against four
state-of-the-art watermarking techniques and combinations of them. Our results
show that Duwak marked text achieves the highest watermarked text quality at
the lowest required token count for detection, up to 70% tokens less than
existing approaches, especially under post paraphrasing.


http://arxiv.org/abs/2403.07588
Visual Privacy Auditing with Diffusion Models. (1%)
Kristian Schwethelm; Johannes Kaiser; Moritz Knolle; Daniel Rueckert; Georgios Kaissis; Alexander Ziller
  Image reconstruction attacks on machine learning models pose a significant
risk to privacy by potentially leaking sensitive information. Although
defending against such attacks using differential privacy (DP) has proven
effective, determining appropriate DP parameters remains challenging. Current
formal guarantees on data reconstruction success suffer from overly theoretical
assumptions regarding adversary knowledge about the target data, particularly
in the image domain. In this work, we empirically investigate this discrepancy
and find that the practicality of these assumptions strongly depends on the
domain shift between the data prior and the reconstruction target. We propose a
reconstruction attack based on diffusion models (DMs) that assumes adversary
access to real-world image priors and assess its implications on privacy
leakage under DP-SGD. We show that (1) real-world data priors significantly
influence reconstruction success, (2) current reconstruction bounds do not
model the risk posed by data priors well, and (3) DMs can serve as effective
auditing tools for visualizing privacy leakage.


http://arxiv.org/abs/2403.06428
Intra-Section Code Cave Injection for Adversarial Evasion Attacks on Windows PE Malware File. (99%)
Kshitiz Aryal; Maanak Gupta; Mahmoud Abdelsalam; Moustafa Saleh
  Windows malware is predominantly available in cyberspace and is a prime
target for deliberate adversarial evasion attacks. Although researchers have
investigated the adversarial malware attack problem, a multitude of important
questions remain unanswered, including (a) Are the existing techniques to
inject adversarial perturbations in Windows Portable Executable (PE) malware
files effective enough for evasion purposes?; (b) Does the attack process
preserve the original behavior of malware?; (c) Are there unexplored
approaches/locations that can be used to carry out adversarial evasion attacks
on Windows PE malware?; and (d) What are the optimal locations and sizes of
adversarial perturbations required to evade an ML-based malware detector
without significant structural change in the PE file? To answer some of these
questions, this work proposes a novel approach that injects a code cave within
the section (i.e., intra-section) of Windows PE malware files to make space for
adversarial perturbations. In addition, a code loader is also injected inside
the PE file, which reverts adversarial malware to its original form during the
execution, preserving the malware's functionality and executability. To
understand the effectiveness of our approach, we injected adversarial
perturbations inside the .text, .data and .rdata sections, generated using the
gradient descent and Fast Gradient Sign Method (FGSM), to target the two
popular CNN-based malware detectors, MalConv and MalConv2. Our experiments
yielded notable results, achieving a 92.31% evasion rate with gradient descent
and 96.26% with FGSM against MalConv, compared to the 16.17% evasion rate for
append attacks. Similarly, when targeting MalConv2, our approach achieved a
remarkable maximum evasion rate of 97.93% with gradient descent and 94.34% with
FGSM, significantly surpassing the 4.01% evasion rate observed with append
attacks.


http://arxiv.org/abs/2403.06661
epsilon-Mesh Attack: A Surface-based Adversarial Point Cloud Attack for Facial Expression Recognition. (99%)
Batuhan Cengiz; Mert Gulsen; Yusuf H. Sahin; Gozde Unal
  Point clouds and meshes are widely used 3D data structures for many computer
vision applications. While the meshes represent the surfaces of an object,
point cloud represents sampled points from the surface which is also the output
of modern sensors such as LiDAR and RGB-D cameras. Due to the wide application
area of point clouds and the recent advancements in deep neural networks,
studies focusing on robust classification of the 3D point cloud data emerged.
To evaluate the robustness of deep classifier networks, a common method is to
use adversarial attacks where the gradient direction is followed to change the
input slightly. The previous studies on adversarial attacks are generally
evaluated on point clouds of daily objects. However, considering 3D faces,
these adversarial attacks tend to affect the person's facial structure more
than the desired amount and cause malformation. Specifically for facial
expressions, even a small adversarial attack can have a significant effect on
the face structure. In this paper, we suggest an adversarial attack called
$\epsilon$-Mesh Attack, which operates on point cloud data via limiting
perturbations to be on the mesh surface. We also parameterize our attack by
$\epsilon$ to scale the perturbation mesh. Our surface-based attack has tighter
perturbation bounds compared to $L_2$ and $L_\infty$ norm bounded attacks that
operate on unit-ball. Even though our method has additional constraints, our
experiments on CoMA, Bosphorus and FaceWarehouse datasets show that
$\epsilon$-Mesh Attack (Perpendicular) successfully confuses trained DGCNN and
PointNet models $99.72\%$ and $97.06\%$ of the time, with indistinguishable
facial deformations. The code is available at
https://github.com/batuceng/e-mesh-attack.


http://arxiv.org/abs/2403.06668
PeerAiD: Improving Adversarial Distillation from a Specialized Peer Tutor. (98%)
Jaewon Jung; Hongsun Jang; Jaeyong Song; Jinho Lee
  Adversarial robustness of the neural network is a significant concern when it
is applied to security-critical domains. In this situation, adversarial
distillation is a promising option which aims to distill the robustness of the
teacher network to improve the robustness of a small student network. Previous
works pretrain the teacher network to make it robust against the adversarial
examples aimed at itself. However, the adversarial examples are dependent on
the parameters of the target network. The fixed teacher network inevitably
degrades its robustness against the unseen transferred adversarial examples
which target the parameters of the student network in the adversarial
distillation process. We propose PeerAiD to make a peer network learn the
adversarial examples of the student network instead of adversarial examples
aimed at itself. PeerAiD is an adversarial distillation that trains the peer
network and the student network simultaneously in order to specialize the peer
network for defending the student network. We observe that such peer networks
surpass the robustness of the pretrained robust teacher model against
adversarial examples aimed at the student network. With this peer network and
adversarial distillation, PeerAiD achieves significantly higher robustness of
the student network with AutoAttack (AA) accuracy by up to 1.66%p and improves
the natural accuracy of the student network by up to 4.72%p with ResNet-18 on
TinyImageNet dataset. Code is available at
https://github.com/jaewonalive/PeerAiD.


http://arxiv.org/abs/2403.06798
Dynamic Perturbation-Adaptive Adversarial Training on Medical Image Classification. (97%)
Shuai Li; Xiaoguang Ma; Shancheng Jiang; Lu Meng
  Remarkable successes were made in Medical Image Classification (MIC)
recently, mainly due to wide applications of convolutional neural networks
(CNNs). However, adversarial examples (AEs) exhibited imperceptible similarity
with raw data, raising serious concerns on network robustness. Although
adversarial training (AT), in responding to malevolent AEs, was recognized as
an effective approach to improve robustness, it was challenging to overcome
generalization decline of networks caused by the AT. In this paper, in order to
reserve high generalization while improving robustness, we proposed a dynamic
perturbation-adaptive adversarial training (DPAAT) method, which placed AT in a
dynamic learning environment to generate adaptive data-level perturbations and
provided a dynamically updated criterion by loss information collections to
handle the disadvantage of fixed perturbation sizes in conventional AT methods
and the dependence on external transference. Comprehensive testing on
dermatology HAM10000 dataset showed that the DPAAT not only achieved better
robustness improvement and generalization preservation but also significantly
enhanced mean average precision and interpretability on various CNNs,
indicating its great potential as a generic adversarial training method on the
MIC.


http://arxiv.org/abs/2403.07261
Disentangling Policy from Offline Task Representation Learning via Adversarial Data Augmentation. (96%)
Chengxing Jia; Fuxiang Zhang; Yi-Chen Li; Chen-Xiao Gao; Xu-Hui Liu; Lei Yuan; Zongzhang Zhang; Yang Yu
  Offline meta-reinforcement learning (OMRL) proficiently allows an agent to
tackle novel tasks while solely relying on a static dataset. For precise and
efficient task identification, existing OMRL research suggests learning
separate task representations that be incorporated with policy input, thus
forming a context-based meta-policy. A major approach to train task
representations is to adopt contrastive learning using multi-task offline data.
The dataset typically encompasses interactions from various policies (i.e., the
behavior policies), thus providing a plethora of contextual information
regarding different tasks. Nonetheless, amassing data from a substantial number
of policies is not only impractical but also often unattainable in realistic
settings. Instead, we resort to a more constrained yet practical scenario,
where multi-task data collection occurs with a limited number of policies. We
observed that learned task representations from previous OMRL methods tend to
correlate spuriously with the behavior policy instead of reflecting the
essential characteristics of the task, resulting in unfavorable
out-of-distribution generalization. To alleviate this issue, we introduce a
novel algorithm to disentangle the impact of behavior policy from task
representation learning through a process called adversarial data augmentation.
Specifically, the objective of adversarial data augmentation is not merely to
generate data analogous to offline data distribution; instead, it aims to
create adversarial examples designed to confound learned task representations
and lead to incorrect task identification. Our experiments show that learning
from such adversarial samples significantly enhances the robustness and
effectiveness of the task identification process and realizes satisfactory
out-of-distribution generalization.


http://arxiv.org/abs/2403.06698
PCLD: Point Cloud Layerwise Diffusion for Adversarial Purification. (86%)
Mert Gulsen; Batuhan Cengiz; Yusuf H. Sahin; Gozde Unal
  Point clouds are extensively employed in a variety of real-world applications
such as robotics, autonomous driving and augmented reality. Despite the recent
success of point cloud neural networks, especially for safety-critical tasks,
it is essential to also ensure the robustness of the model. A typical way to
assess a model's robustness is through adversarial attacks, where test-time
examples are generated based on gradients to deceive the model. While many
different defense mechanisms are studied in 2D, studies on 3D point clouds have
been relatively limited in the academic field. Inspired from PointDP, which
denoises the network inputs by diffusion, we propose Point Cloud Layerwise
Diffusion (PCLD), a layerwise diffusion based 3D point cloud defense strategy.
Unlike PointDP, we propagated the diffusion denoising after each layer to
incrementally enhance the results. We apply our defense method to different
types of commonly used point cloud models and adversarial attacks to evaluate
its robustness. Our experiments demonstrate that the proposed defense method
achieved results that are comparable to or surpass those of existing
methodologies, establishing robustness through a novel technique. Code is
available at https://github.com/batuceng/diffusion-layer-robustness-pc.


http://arxiv.org/abs/2403.07095
Overcoming the Paradox of Certified Training with Gaussian Smoothing. (83%)
Stefan Balauca; Mark Niklas Müller; Yuhao Mao; Maximilian Baader; Marc Fischer; Martin Vechev
  Training neural networks with high certified accuracy against adversarial
examples remains an open problem despite significant efforts. While
certification methods can effectively leverage tight convex relaxations for
bound computation, in training, these methods perform worse than looser
relaxations. Prior work hypothesized that this is caused by the discontinuity
and perturbation sensitivity of the loss surface induced by these tighter
relaxations. In this work, we show theoretically that Gaussian Loss Smoothing
can alleviate both issues. We confirm this empirically by proposing a certified
training method combining PGPE, an algorithm computing gradients of a smoothed
loss, with different convex relaxations. When using this training method, we
observe that tighter bounds indeed lead to strictly better networks. While
scaling PGPE training remains challenging due to high computational cost, we
show that by using a not theoretically sound, yet much cheaper smoothing
approximation, we obtain better certified accuracies than state-of-the-art
methods when training on the same network architecture. Our results clearly
demonstrate the promise of Gaussian Loss Smoothing for training certifiably
robust neural networks.


http://arxiv.org/abs/2403.06610
Real is not True: Backdoor Attacks Against Deepfake Detection. (78%)
Hong Sun; Ziqiang Li; Lei Liu; Bin Li
  The proliferation of malicious deepfake applications has ignited substantial
public apprehension, casting a shadow of doubt upon the integrity of digital
media. Despite the development of proficient deepfake detection mechanisms,
they persistently demonstrate pronounced vulnerability to an array of attacks.
It is noteworthy that the pre-existing repertoire of attacks predominantly
comprises adversarial example attack, predominantly manifesting during the
testing phase. In the present study, we introduce a pioneering paradigm
denominated as Bad-Deepfake, which represents a novel foray into the realm of
backdoor attacks levied against deepfake detectors. Our approach hinges upon
the strategic manipulation of a delimited subset of the training data, enabling
us to wield disproportionate influence over the operational characteristics of
a trained model. This manipulation leverages inherent frailties inherent to
deepfake detectors, affording us the capacity to engineer triggers and
judiciously select the most efficacious samples for the construction of the
poisoned set. Through the synergistic amalgamation of these sophisticated
techniques, we achieve an remarkable performance-a 100% attack success rate
(ASR) against extensively employed deepfake detectors.


http://arxiv.org/abs/2403.07078
Improving deep learning with prior knowledge and cognitive models: A survey on enhancing explainability, adversarial robustness and zero-shot learning. (61%)
Fuseinin Mumuni; Alhassan Mumuni
  We review current and emerging knowledge-informed and brain-inspired
cognitive systems for realizing adversarial defenses, eXplainable Artificial
Intelligence (XAI), and zero-shot or few-short learning. Data-driven deep
learning models have achieved remarkable performance and demonstrated
capabilities surpassing human experts in many applications. Yet, their
inability to exploit domain knowledge leads to serious performance limitations
in practical applications. In particular, deep learning systems are exposed to
adversarial attacks, which can trick them into making glaringly incorrect
decisions. Moreover, complex data-driven models typically lack interpretability
or explainability, i.e., their decisions cannot be understood by human
subjects. Furthermore, models are usually trained on standard datasets with a
closed-world assumption. Hence, they struggle to generalize to unseen cases
during inference in practical open-world environments, thus, raising the zero-
or few-shot generalization problem. Although many conventional solutions exist,
explicit domain knowledge, brain-inspired neural network and cognitive
architectures offer powerful new dimensions towards alleviating these problems.
Prior knowledge is represented in appropriate forms and incorporated in deep
learning frameworks to improve performance. Brain-inspired cognition methods
use computational models that mimic the human mind to enhance intelligent
behavior in artificial agents and autonomous robots. Ultimately, these models
achieve better explainability, higher adversarial robustness and data-efficient
learning, and can, in turn, provide insights for cognitive science and
neuroscience-that is, to deepen human understanding on how the brain works in
general, and how it handles these problems.


http://arxiv.org/abs/2403.06634
Stealing Part of a Production Language Model. (38%)
Nicholas Carlini; Daniel Paleka; Krishnamurthy Dj Dvijotham; Thomas Steinke; Jonathan Hayase; A. Feder Cooper; Katherine Lee; Matthew Jagielski; Milad Nasr; Arthur Conmy; Eric Wallace; David Rolnick; Florian Tramèr
  We introduce the first model-stealing attack that extracts precise,
nontrivial information from black-box production language models like OpenAI's
ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding
projection layer (up to symmetries) of a transformer model, given typical API
access. For under \$20 USD, our attack extracts the entire projection matrix of
OpenAI's Ada and Babbage language models. We thereby confirm, for the first
time, that these black-box models have a hidden dimension of 1024 and 2048,
respectively. We also recover the exact hidden dimension size of the
gpt-3.5-turbo model, and estimate it would cost under \$2,000 in queries to
recover the entire projection matrix. We conclude with potential defenses and
mitigations, and discuss the implications of possible future work that could
extend our attack.


http://arxiv.org/abs/2403.06430
AS-FIBA: Adaptive Selective Frequency-Injection for Backdoor Attack on Deep Face Restoration. (9%)
Zhenbo Song; Wenhao Gao; Zhenyuan Zhang; Jianfeng Lu
  Deep learning-based face restoration models, increasingly prevalent in smart
devices, have become targets for sophisticated backdoor attacks. These attacks,
through subtle trigger injection into input face images, can lead to unexpected
restoration outcomes. Unlike conventional methods focused on classification
tasks, our approach introduces a unique degradation objective tailored for
attacking restoration models. Moreover, we propose the Adaptive Selective
Frequency Injection Backdoor Attack (AS-FIBA) framework, employing a neural
network for input-specific trigger generation in the frequency domain,
seamlessly blending triggers with benign images. This results in imperceptible
yet effective attacks, guiding restoration predictions towards subtly degraded
outputs rather than conspicuous targets. Extensive experiments demonstrate the
efficacy of the degradation objective on state-of-the-art face restoration
models. Additionally, it is notable that AS-FIBA can insert effective backdoors
that are more imperceptible than existing backdoor attack methods, including
WaNet, ISSBA, and FIBA.


http://arxiv.org/abs/2404.00011
A novel interface for adversarial trivia question-writing. (3%)
Jason Liu
  A critical component when developing question-answering AIs is an adversarial
dataset that challenges models to adapt to the complex syntax and reasoning
underlying our natural language. Present techniques for procedurally generating
adversarial texts are not robust enough for training on complex tasks such as
answering multi-sentence trivia questions. We instead turn to human-generated
data by introducing an interface for collecting adversarial human-written
trivia questions. Our interface is aimed towards question writers and players
of Quiz Bowl, a buzzer-based trivia competition where paragraph-long questions
consist of a sequence of clues of decreasing difficulty. To incentivize usage,
a suite of machine learning-based tools in our interface assist humans in
writing questions that are more challenging to answer for Quiz Bowl players and
computers alike. Not only does our interface gather training data for the
groundbreaking Quiz Bowl AI project QANTA, but it is also a proof-of-concept of
future adversarial data collection for question-answering systems. The results
of performance-testing our interface with ten originally-composed questions
indicate that, despite some flaws, our interface's novel question-writing
features as well as its real-time exposure of useful responses from our machine
models could facilitate and enhance the collection of adversarial questions.


http://arxiv.org/abs/2403.06462
Towards the Uncharted: Density-Descending Feature Perturbation for Semi-supervised Semantic Segmentation. (2%)
Xiaoyang Wang; Huihui Bai; Limin Yu; Yao Zhao; Jimin Xiao
  Semi-supervised semantic segmentation allows model to mine effective
supervision from unlabeled data to complement label-guided training. Recent
research has primarily focused on consistency regularization techniques,
exploring perturbation-invariant training at both the image and feature levels.
In this work, we proposed a novel feature-level consistency learning framework
named Density-Descending Feature Perturbation (DDFP). Inspired by the
low-density separation assumption in semi-supervised learning, our key insight
is that feature density can shed a light on the most promising direction for
the segmentation classifier to explore, which is the regions with lower
density. We propose to shift features with confident predictions towards
lower-density regions by perturbation injection. The perturbed features are
then supervised by the predictions on the original features, thereby compelling
the classifier to explore less dense regions to effectively regularize the
decision boundary. Central to our method is the estimation of feature density.
To this end, we introduce a lightweight density estimator based on normalizing
flow, allowing for efficient capture of the feature density distribution in an
online manner. By extracting gradients from the density estimator, we can
determine the direction towards less dense regions for each feature. The
proposed DDFP outperforms other designs on feature-level perturbations and
shows state of the art performances on both Pascal VOC and Cityscapes dataset
under various partition protocols. The project is available at
https://github.com/Gavinwxy/DDFP.


http://arxiv.org/abs/2403.06581
DNNShield: Embedding Identifiers for Deep Neural Network Ownership Verification. (1%)
Jasper Stang; Torsten Krauß; Alexandra Dmitrienko
  The surge in popularity of machine learning (ML) has driven significant
investments in training Deep Neural Networks (DNNs). However, these models that
require resource-intensive training are vulnerable to theft and unauthorized
use. This paper addresses this challenge by introducing DNNShield, a novel
approach for DNN protection that integrates seamlessly before training.
DNNShield embeds unique identifiers within the model architecture using
specialized protection layers. These layers enable secure training and
deployment while offering high resilience against various attacks, including
fine-tuning, pruning, and adaptive adversarial attacks. Notably, our approach
achieves this security with minimal performance and computational overhead
(less than 5\% runtime increase). We validate the effectiveness and efficiency
of DNNShield through extensive evaluations across three datasets and four model
architectures. This practical solution empowers developers to protect their
DNNs and intellectual property rights.


http://arxiv.org/abs/2403.06869
Impact of Noisy Supervision in Foundation Model Learning. (1%)
Hao Chen; Zihan Wang; Ran Tao; Hongxin Wei; Xing Xie; Masashi Sugiyama; Bhiksha Raj; Jindong Wang
  Foundation models are usually pre-trained on large-scale datasets and then
adapted to downstream tasks through tuning. However, the large-scale
pre-training datasets, often inaccessible or too expensive to handle, can
contain label noise that may adversely affect the generalization of the model
and pose unexpected risks. This paper stands out as the first work to
comprehensively understand and analyze the nature of noise in pre-training
datasets and then effectively mitigate its impacts on downstream tasks.
Specifically, through extensive experiments of fully-supervised and image-text
contrastive pre-training on synthetic noisy ImageNet-1K, YFCC15M, and CC12M
datasets, we demonstrate that, while slight noise in pre-training can benefit
in-domain (ID) performance, where the training and testing data share a similar
distribution, it always deteriorates out-of-domain (OOD) performance, where
training and testing distributions are significantly different. These
observations are agnostic to scales of pre-training datasets, pre-training
noise types, model architectures, pre-training objectives, downstream tuning
methods, and downstream applications. We empirically ascertain that the reason
behind this is that the pre-training noise shapes the feature space
differently. We then propose a tuning method (NMTune) to affine the feature
space to mitigate the malignant effect of noise and improve generalization,
which is applicable in both parameter-efficient and black-box tuning manners.
We additionally conduct extensive experiments on popular vision and language
models, including APIs, which are supervised and self-supervised pre-trained on
realistic noisy data for evaluation. Our analysis and results demonstrate the
importance of this novel and fundamental research direction, which we term as
Noisy Model Learning.


http://arxiv.org/abs/2403.06925
Transformers Learn Low Sensitivity Functions: Investigations and Implications. (1%)
Bhavya Vasudeva; Deqing Fu; Tianyi Zhou; Elliott Kau; Youqi Huang; Vatsal Sharan
  Transformers achieve state-of-the-art accuracy and robustness across many
tasks, but an understanding of their inductive biases and how those biases
differ from other neural network architectures remains elusive. In this work,
we identify the sensitivity of the model to token-wise random perturbations in
the input as a unified metric which explains the inductive bias of transformers
across different data modalities and distinguishes them from other
architectures. We show that transformers have lower sensitivity than MLPs,
CNNs, ConvMixers and LSTMs, across both vision and language tasks. We also show
that this low-sensitivity bias has important implications: i) lower sensitivity
correlates with improved robustness; it can also be used as an efficient
intervention to further improve the robustness of transformers; ii) it
corresponds to flatter minima in the loss landscape; and iii) it can serve as a
progress measure for grokking. We support these findings with theoretical
results showing (weak) spectral bias of transformers in the NTK regime, and
improved robustness due to the lower sensitivity. The code is available at
https://github.com/estija/sensitivity.


http://arxiv.org/abs/2403.06388
A Zero Trust Framework for Realization and Defense Against Generative AI Attacks in Power Grid. (22%)
Md. Shirajum Munir; Sravanthi Proddatoori; Manjushree Muralidhara; Walid Saad; Zhu Han; Sachin Shetty
  Understanding the potential of generative AI (GenAI)-based attacks on the
power grid is a fundamental challenge that must be addressed in order to
protect the power grid by realizing and validating risk in new attack vectors.
In this paper, a novel zero trust framework for a power grid supply chain
(PGSC) is proposed. This framework facilitates early detection of potential
GenAI-driven attack vectors (e.g., replay and protocol-type attacks),
assessment of tail risk-based stability measures, and mitigation of such
threats. First, a new zero trust system model of PGSC is designed and
formulated as a zero-trust problem that seeks to guarantee for a stable PGSC by
realizing and defending against GenAI-driven cyber attacks. Second, in which a
domain-specific generative adversarial networks (GAN)-based attack generation
mechanism is developed to create a new vulnerability cyberspace for further
understanding that threat. Third, tail-based risk realization metrics are
developed and implemented for quantifying the extreme risk of a potential
attack while leveraging a trust measurement approach for continuous validation.
Fourth, an ensemble learning-based bootstrap aggregation scheme is devised to
detect the attacks that are generating synthetic identities with convincing
user and distributed energy resources device profiles. Experimental results
show the efficacy of the proposed zero trust framework that achieves an
accuracy of 95.7% on attack vector generation, a risk measure of 9.61% for a
95% stable PGSC, and a 99% confidence in defense against GenAI-driven attack.


http://arxiv.org/abs/2403.06014
Hard-label based Small Query Black-box Adversarial Attack. (99%)
Jeonghwan Park; Paul Miller; Niall McLaughlin
  We consider the hard label based black box adversarial attack setting which
solely observes predicted classes from the target model. Most of the attack
methods in this setting suffer from impractical number of queries required to
achieve a successful attack. One approach to tackle this drawback is utilising
the adversarial transferability between white box surrogate models and black
box target model. However, the majority of the methods adopting this approach
are soft label based to take the full advantage of zeroth order optimisation.
Unlike mainstream methods, we propose a new practical setting of hard label
based attack with an optimisation process guided by a pretrained surrogate
model. Experiments show the proposed method significantly improves the query
efficiency of the hard label based black-box attack across various target model
architectures. We find the proposed method achieves approximately 5 times
higher attack success rate compared to the benchmarks, especially at the small
query budgets as 100 and 250.


http://arxiv.org/abs/2403.05955
IOI: Invisible One-Iteration Adversarial Attack on No-Reference Image- and Video-Quality Metrics. (83%)
Ekaterina Shumitskaya; Anastasia Antsiferova; Dmitriy Vatolin
  No-reference image- and video-quality metrics are widely used in video
processing benchmarks. The robustness of learning-based metrics under video
attacks has not been widely studied. In addition to having success, attacks
that can be employed in video processing benchmarks must be fast and
imperceptible. This paper introduces an Invisible One-Iteration (IOI)
adversarial attack on no reference image and video quality metrics. We compared
our method alongside eight prior approaches using image and video datasets via
objective and subjective tests. Our method exhibited superior visual quality
across various attacked metric architectures while maintaining comparable
attack success and speed. We made the code available on GitHub:
https://github.com/katiashh/ioi-attack.


http://arxiv.org/abs/2403.05847
iBA: Backdoor Attack on 3D Point Cloud via Reconstructing Itself. (82%)
Yuhao Bian; Shengjing Tian; Xiuping Liu
  The widespread deployment of Deep Neural Networks (DNNs) for 3D point cloud
processing starkly contrasts with their susceptibility to security breaches,
notably backdoor attacks. These attacks hijack DNNs during training, embedding
triggers in the data that, once activated, cause the network to make
predetermined errors while maintaining normal performance on unaltered data.
This vulnerability poses significant risks, especially given the insufficient
research on robust defense mechanisms for 3D point cloud networks against such
sophisticated threats. Existing attacks either struggle to resist basic point
cloud pre-processing methods, or rely on delicate manual design. Exploring
simple, effective, imperceptible, and difficult-to-defend triggers in 3D point
clouds is still challenging.To address these challenges, we introduce
MirrorAttack, a novel effective 3D backdoor attack method, which implants the
trigger by simply reconstructing a clean point cloud with an auto-encoder. The
data-driven nature of the MirrorAttack obviates the need for complex manual
design. Minimizing the reconstruction loss automatically improves
imperceptibility. Simultaneously, the reconstruction network endows the trigger
with pronounced nonlinearity and sample specificity, rendering traditional
preprocessing techniques ineffective in eliminating it. A trigger smoothing
module based on spherical harmonic transformation is also attached to regulate
the intensity of the attack.Both quantitive and qualitative results verify the
effectiveness of our method. We achieve state-of-the-art ASR on different types
of victim models with the intervention of defensive techniques. Moreover, the
minimal perturbation introduced by our trigger, as assessed by various metrics,
attests to the method's stealth, ensuring its imperceptibility.


http://arxiv.org/abs/2403.07942
Attacking Transformers with Feature Diversity Adversarial Perturbation. (70%)
Chenxing Gao; Hang Zhou; Junqing Yu; YuTeng Ye; Jiale Cai; Junle Wang; Wei Yang
  Understanding the mechanisms behind Vision Transformer (ViT), particularly
its vulnerability to adversarial perturba tions, is crucial for addressing
challenges in its real-world applications. Existing ViT adversarial attackers
rely on la bels to calculate the gradient for perturbation, and exhibit low
transferability to other structures and tasks. In this paper, we present a
label-free white-box attack approach for ViT-based models that exhibits strong
transferability to various black box models, including most ViT variants, CNNs,
and MLPs, even for models developed for other modalities. Our inspira tion
comes from the feature collapse phenomenon in ViTs, where the critical
attention mechanism overly depends on the low-frequency component of features,
causing the features in middle-to-end layers to become increasingly similar and
eventually collapse. We propose the feature diversity attacker to naturally
accelerate this process and achieve remarkable performance and transferability.


http://arxiv.org/abs/2403.05247
Hide in Thicket: Generating Imperceptible and Rational Adversarial Perturbations on 3D Point Clouds. (99%)
Tianrui Lou; Xiaojun Jia; Jindong Gu; Li Liu; Siyuan Liang; Bangyan He; Xiaochun Cao
  Adversarial attack methods based on point manipulation for 3D point cloud
classification have revealed the fragility of 3D models, yet the adversarial
examples they produce are easily perceived or defended against. The trade-off
between the imperceptibility and adversarial strength leads most point attack
methods to inevitably introduce easily detectable outlier points upon a
successful attack. Another promising strategy, shape-based attack, can
effectively eliminate outliers, but existing methods often suffer significant
reductions in imperceptibility due to irrational deformations. We find that
concealing deformation perturbations in areas insensitive to human eyes can
achieve a better trade-off between imperceptibility and adversarial strength,
specifically in parts of the object surface that are complex and exhibit
drastic curvature changes. Therefore, we propose a novel shape-based
adversarial attack method, HiT-ADV, which initially conducts a two-stage search
for attack regions based on saliency and imperceptibility scores, and then adds
deformation perturbations in each attack region using Gaussian kernel
functions. Additionally, HiT-ADV is extendable to physical attack. We propose
that by employing benign resampling and benign rigid transformations, we can
further enhance physical adversarial strength with little sacrifice to
imperceptibility. Extensive experiments have validated the superiority of our
method in terms of adversarial and imperceptible properties in both digital and
physical spaces. Our code is avaliable at: https://github.com/TRLou/HiT-ADV.


http://arxiv.org/abs/2403.05530
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. (99%)
Team Gemini; Petko Georgiev; Ving Ian Lei; Ryan Burnell; Libin Bai; Anmol Gulati; Garrett Tanzer; Damien Vincent; Zhufeng Pan; Shibo Wang; Soroosh Mariooryad; Yifan Ding; Xinyang Geng; Fred Alcober; Roy Frostig; Mark Omernick; Lexi Walker; Cosmin Paduraru; Christina Sorokin; Andrea Tacchetti; Colin Gaffney; Samira Daruki; Olcan Sercinoglu; Zach Gleicher; Juliette Love; Paul Voigtlaender; Rohan Jain; Gabriela Surita; Kareem Mohamed; Rory Blevins; Junwhan Ahn; Tao Zhu; Kornraphop Kawintiranon; Orhan Firat; Yiming Gu; Yujing Zhang; Matthew Rahtz; Manaal Faruqui; Natalie Clay; Justin Gilmer; JD Co-Reyes; Ivo Penchev; Rui Zhu; Nobuyuki Morioka; Kevin Hui; Krishna Haridasan; Victor Campos; Mahdis Mahdieh; Mandy Guo; Samer Hassan; Kevin Kilgour; Arpi Vezer; Heng-Tze Cheng; Liedekerke Raoul de; Siddharth Goyal; Paul Barham; DJ Strouse; Seb Noury; Jonas Adler; Mukund Sundararajan; Sharad Vikram; Dmitry Lepikhin; Michela Paganini; Xavier Garcia; Fan Yang; Dasha Valter; Maja Trebacz; Kiran Vodrahalli; Chulayuth Asawaroengchai; Roman Ring; Norbert Kalb; Livio Baldini Soares; Siddhartha Brahma; David Steiner; Tianhe Yu; Fabian Mentzer; Antoine He; Lucas Gonzalez; Bibo Xu; Raphael Lopez Kaufman; Laurent El Shafey; Junhyuk Oh; Tom Hennigan; George van den Driessche; Seth Odoom; Mario Lucic; Becca Roelofs; Sid Lall; Amit Marathe; Betty Chan; Santiago Ontanon; Luheng He; Denis Teplyashin; Jonathan Lai; Phil Crone; Bogdan Damoc; Lewis Ho; Sebastian Riedel; Karel Lenc; Chih-Kuan Yeh; Aakanksha Chowdhery; Yang Xu; Mehran Kazemi; Ehsan Amid; Anastasia Petrushkina; Kevin Swersky; Ali Khodaei; Gowoon Chen; Chris Larkin; Mario Pinto; Geng Yan; Adria Puigdomenech Badia; Piyush Patil; Steven Hansen; Dave Orr; Sebastien M. R. Arnold; Jordan Grimstad; Andrew Dai; Sholto Douglas; Rishika Sinha; Vikas Yadav; Xi Chen; Elena Gribovskaya; Jacob Austin; Jeffrey Zhao; Kaushal Patel; Paul Komarek; Sophia Austin; Sebastian Borgeaud; Linda Friso; Abhimanyu Goyal; Ben Caine; Kris Cao; Da-Woon Chung; Matthew Lamm; Gabe Barth-Maron; Thais Kagohara; Kate Olszewska; Mia Chen; Kaushik Shivakumar; Rishabh Agarwal; Harshal Godhia; Ravi Rajwar; Javier Snaider; Xerxes Dotiwalla; Yuan Liu; Aditya Barua; Victor Ungureanu; Yuan Zhang; Bat-Orgil Batsaikhan; Mateo Wirth; James Qin; Ivo Danihelka; Tulsee Doshi; Martin Chadwick; Jilin Chen; Sanil Jain; Quoc Le; Arjun Kar; Madhu Gurumurthy; Cheng Li; Ruoxin Sang; Fangyu Liu; Lampros Lamprou; Rich Munoz; Nathan Lintz; Harsh Mehta; Heidi Howard; Malcolm Reynolds; Lora Aroyo; Quan Wang; Lorenzo Blanco; Albin Cassirer; Jordan Griffith; Dipanjan Das; Stephan Lee; Jakub Sygnowski; Zach Fisher; James Besley; Richard Powell; Zafarali Ahmed; Dominik Paulus; David Reitter; Zalan Borsos; Rishabh Joshi; Aedan Pope; Steven Hand; Vittorio Selo; Vihan Jain; Nikhil Sethi; Megha Goel; Takaki Makino; Rhys May; Zhen Yang; Johan Schalkwyk; Christina Butterfield; Anja Hauth; Alex Goldin; Will Hawkins; Evan Senter; Sergey Brin; Oliver Woodman; Marvin Ritter; Eric Noland; Minh Giang; Vijay Bolina; Lisa Lee; Tim Blyth; Ian Mackinnon; Machel Reid; Obaid Sarvana; David Silver; Alexander Chen; Lily Wang; Loren Maggiore; Oscar Chang; Nithya Attaluri; Gregory Thornton; Chung-Cheng Chiu; Oskar Bunyan; Nir Levine; Timothy Chung; Evgenii Eltyshev; Xiance Si; Timothy Lillicrap; Demetra Brady; Vaibhav Aggarwal; Boxi Wu; Yuanzhong Xu; Ross McIlroy; Kartikeya Badola; Paramjit Sandhu; Erica Moreira; Wojciech Stokowiec; Ross Hemsley; Dong Li; Alex Tudor; Pranav Shyam; Elahe Rahimtoroghi; Salem Haykal; Pablo Sprechmann; Xiang Zhou; Diana Mincu; Yujia Li; Ravi Addanki; Kalpesh Krishna; Xiao Wu; Alexandre Frechette; Matan Eyal; Allan Dafoe; Dave Lacey; Jay Whang; Thi Avrahami; Ye Zhang; Emanuel Taropa; Hanzhao Lin; Daniel Toyama; Eliza Rutherford; Motoki Sano; HyunJeong Choe; Alex Tomala; Chalence Safranek-Shrader; Nora Kassner; Mantas Pajarskas; Matt Harvey; Sean Sechrist; Meire Fortunato; Christina Lyu; Gamaleldin Elsayed; Chenkai Kuang; James Lottes; Eric Chu; Chao Jia; Chih-Wei Chen; Peter Humphreys; Kate Baumli; Connie Tao; Rajkumar Samuel; Cicero Nogueira dos Santos; Anders Andreassen; Nemanja Rakićević; Dominik Grewe; Aviral Kumar; Stephanie Winkler; Jonathan Caton; Andrew Brock; Sid Dalmia; Hannah Sheahan; Iain Barr; Yingjie Miao; Paul Natsev; Jacob Devlin; Feryal Behbahani; Flavien Prost; Yanhua Sun; Artiom Myaskovsky; Thanumalayan Sankaranarayana Pillai; Dan Hurt; Angeliki Lazaridou; Xi Xiong; Ce Zheng; Fabio Pardo; Xiaowei Li; Dan Horgan; Joe Stanton; Moran Ambar; Fei Xia; Alejandro Lince; Mingqiu Wang; Basil Mustafa; Albert Webson; Hyo Lee; Rohan Anil; Martin Wicke; Timothy Dozat; Abhishek Sinha; Enrique Piqueras; Elahe Dabir; Shyam Upadhyay; Anudhyan Boral; Lisa Anne Hendricks; Corey Fry; Josip Djolonga; Yi Su; Jake Walker; Jane Labanowski; Ronny Huang; Vedant Misra; Jeremy Chen; RJ Skerry-Ryan; Avi Singh; Shruti Rijhwani; Dian Yu; Alex Castro-Ros; Beer Changpinyo; Romina Datta; Sumit Bagri; Arnar Mar Hrafnkelsson; Marcello Maggioni; Daniel Zheng; Yury Sulsky; Shaobo Hou; Tom Le Paine; Antoine Yang; Jason Riesa; Dominika Rogozinska; Dror Marcus; Dalia El Badawy; Qiao Zhang; Luyu Wang; Helen Miller; Jeremy Greer; Lars Lowe Sjos; Azade Nova; Heiga Zen; Rahma Chaabouni; Mihaela Rosca; Jiepu Jiang; Charlie Chen; Ruibo Liu; Tara Sainath; Maxim Krikun; Alex Polozov; Jean-Baptiste Lespiau; Josh Newlan; Zeyncep Cankara; Soo Kwak; Yunhan Xu; Phil Chen; Andy Coenen; Clemens Meyer; Katerina Tsihlas; Ada Ma; Juraj Gottweis; Jinwei Xing; Chenjie Gu; Jin Miao; Christian Frank; Zeynep Cankara; Sanjay Ganapathy; Ishita Dasgupta; Steph Hughes-Fitt; Heng Chen; David Reid; Keran Rong; Hongmin Fan; Amersfoort Joost van; Vincent Zhuang; Aaron Cohen; Shixiang Shane Gu; Anhad Mohananey; Anastasija Ilic; Taylor Tobin; John Wieting; Anna Bortsova; Phoebe Thacker; Emma Wang; Emily Caveness; Justin Chiu; Eren Sezener; Alex Kaskasoli; Steven Baker; Katie Millican; Mohamed Elhawaty; Kostas Aisopos; Carl Lebsack; Nathan Byrd; Hanjun Dai; Wenhao Jia; Matthew Wiethoff; Elnaz Davoodi; Albert Weston; Lakshman Yagati; Arun Ahuja; Isabel Gao; Golan Pundak; Susan Zhang; Michael Azzam; Khe Chai Sim; Sergi Caelles; James Keeling; Abhanshu Sharma; Andy Swing; YaGuang Li; Chenxi Liu; Carrie Grimes Bostock; Yamini Bansal; Zachary Nado; Ankesh Anand; Josh Lipschultz; Abhijit Karmarkar; Lev Proleev; Abe Ittycheriah; Soheil Hassas Yeganeh; George Polovets; Aleksandra Faust; Jiao Sun; Alban Rrustemi; Pen Li; Rakesh Shivanna; Jeremiah Liu; Chris Welty; Federico Lebron; Anirudh Baddepudi; Sebastian Krause; Emilio Parisotto; Radu Soricut; Zheng Xu; Dawn Bloxwich; Melvin Johnson; Behnam Neyshabur; Justin Mao-Jones; Renshen Wang; Vinay Ramasesh; Zaheer Abbas; Arthur Guez; Constant Segal; Duc Dung Nguyen; James Svensson; Le Hou; Sarah York; Kieran Milan; Sophie Bridgers; Wiktor Gworek; Marco Tagliasacchi; James Lee-Thorp; Michael Chang; Alexey Guseynov; Ale Jakse Hartman; Michael Kwong; Ruizhe Zhao; Sheleem Kashem; Elizabeth Cole; Antoine Miech; Richard Tanburn; Mary Phuong; Filip Pavetic; Sebastien Cevey; Ramona Comanescu; Richard Ives; Sherry Yang; Cosmo Du; Bo Li; Zizhao Zhang; Mariko Iinuma; Clara Huiyi Hu; Aurko Roy; Shaan Bijwadia; Zhenkai Zhu; Danilo Martins; Rachel Saputro; Anita Gergely; Steven Zheng; Dawei Jia; Ioannis Antonoglou; Adam Sadovsky; Shane Gu; Yingying Bi; Alek Andreev; Sina Samangooei; Mina Khan; Tomas Kocisky; Angelos Filos; Chintu Kumar; Colton Bishop; Adams Yu; Sarah Hodkinson; Sid Mittal; Premal Shah; Alexandre Moufarek; Yong Cheng; Adam Bloniarz; Jaehoon Lee; Pedram Pejman; Paul Michel; Stephen Spencer; Vladimir Feinberg; Xuehan Xiong; Nikolay Savinov; Charlotte Smith; Siamak Shakeri; Dustin Tran; Mary Chesus; Bernd Bohnet; George Tucker; Glehn Tamara von; Carrie Muir; Yiran Mao; Hideto Kazawa; Ambrose Slone; Kedar Soparkar; Disha Shrivastava; James Cobon-Kerr; Michael Sharman; Jay Pavagadhi; Carlos Araya; Karolis Misiunas; Nimesh Ghelani; Michael Laskin; David Barker; Qiujia Li; Anton Briukhov; Neil Houlsby; Mia Glaese; Balaji Lakshminarayanan; Nathan Schucher; Yunhao Tang; Eli Collins; Hyeontaek Lim; Fangxiaoyu Feng; Adria Recasens; Guangda Lai; Alberto Magni; Cao Nicola De; Aditya Siddhant; Zoe Ashwood; Jordi Orbay; Mostafa Dehghani; Jenny Brennan; Yifan He; Kelvin Xu; Yang Gao; Carl Saroufim; James Molloy; Xinyi Wu; Seb Arnold; Solomon Chang; Julian Schrittwieser; Elena Buchatskaya; Soroush Radpour; Martin Polacek; Skye Giordano; Ankur Bapna; Simon Tokumine; Vincent Hellendoorn; Thibault Sottiaux; Sarah Cogan; Aliaksei Severyn; Mohammad Saleh; Shantanu Thakoor; Laurent Shefey; Siyuan Qiao; Meenu Gaba; Shuo-yiin Chang; Craig Swanson; Biao Zhang; Benjamin Lee; Paul Kishan Rubenstein; Gan Song; Tom Kwiatkowski; Anna Koop; Ajay Kannan; David Kao; Parker Schuh; Axel Stjerngren; Golnaz Ghiasi; Gena Gibson; Luke Vilnis; Ye Yuan; Felipe Tiengo Ferreira; Aishwarya Kamath; Ted Klimenko; Ken Franko; Kefan Xiao; Indro Bhattacharya; Miteyan Patel; Rui Wang; Alex Morris; Robin Strudel; Vivek Sharma; Peter Choy; Sayed Hadi Hashemi; Jessica Landon; Mara Finkelstein; Priya Jhakra; Justin Frye; Megan Barnes; Matthew Mauger; Dennis Daun; Khuslen Baatarsukh; Matthew Tung; Wael Farhan; Henryk Michalewski; Fabio Viola; Felix de Chaumont Quitry; Charline Le Lan; Tom Hudson; Qingze Wang; Felix Fischer; Ivy Zheng; Elspeth White; Anca Dragan; Jean-baptiste Alayrac; Eric Ni; Alexander Pritzel; Adam Iwanicki; Michael Isard; Anna Bulanova; Lukas Zilka; Ethan Dyer; Devendra Sachan; Srivatsan Srinivasan; Hannah Muckenhirn; Honglong Cai; Amol Mandhane; Mukarram Tariq; Jack W. Rae; Gary Wang; Kareem Ayoub; Nicholas FitzGerald; Yao Zhao; Woohyun Han; Chris Alberti; Dan Garrette; Kashyap Krishnakumar; Mai Gimenez; Anselm Levskaya; Daniel Sohn; Josip Matak; Inaki Iturrate; Michael B. Chang; Jackie Xiang; Yuan Cao; Nishant Ranka; Geoff Brown; Adrian Hutter; Vahab Mirrokni; Nanxin Chen; Kaisheng Yao; Zoltan Egyed; Francois Galilee; Tyler Liechty; Praveen Kallakuri; Evan Palmer; Sanjay Ghemawat; Jasmine Liu; David Tao; Chloe Thornton; Tim Green; Mimi Jasarevic; Sharon Lin; Victor Cotruta; Yi-Xuan Tan; Noah Fiedel; Hongkun Yu; Ed Chi; Alexander Neitz; Jens Heitkaemper; Anu Sinha; Denny Zhou; Yi Sun; Charbel Kaed; Brice Hulse; Swaroop Mishra; Maria Georgaki; Sneha Kudugunta; Clement Farabet; Izhak Shafran; Daniel Vlasic; Anton Tsitsulin; Rajagopal Ananthanarayanan; Alen Carin; Guolong Su; Pei Sun; Shashank V; Gabriel Carvajal; Josef Broder; Iulia Comsa; Alena Repina; William Wong; Warren Weilun Chen; Peter Hawkins; Egor Filonov; Lucia Loher; Christoph Hirnschall; Weiyi Wang; Jingchen Ye; Andrea Burns; Hardie Cate; Diana Gage Wright; Federico Piccinini; Lei Zhang; Chu-Cheng Lin; Ionel Gog; Yana Kulizhskaya; Ashwin Sreevatsa; Shuang Song; Luis C. Cobo; Anand Iyer; Chetan Tekur; Guillermo Garrido; Zhuyun Xiao; Rupert Kemp; Huaixiu Steven Zheng; Hui Li; Ananth Agarwal; Christel Ngani; Kati Goshvadi; Rebeca Santamaria-Fernandez; Wojciech Fica; Xinyun Chen; Chris Gorgolewski; Sean Sun; Roopal Garg; Xinyu Ye; S. M. Ali Eslami; Nan Hua; Jon Simon; Pratik Joshi; Yelin Kim; Ian Tenney; Sahitya Potluri; Lam Nguyen Thiet; Quan Yuan; Florian Luisier; Alexandra Chronopoulou; Salvatore Scellato; Praveen Srinivasan; Minmin Chen; Vinod Koverkathu; Valentin Dalibard; Yaming Xu; Brennan Saeta; Keith Anderson; Thibault Sellam; Nick Fernando; Fantine Huot; Junehyuk Jung; Mani Varadarajan; Michael Quinn; Amit Raul; Maigo Le; Ruslan Habalov; Jon Clark; Komal Jalan; Kalesha Bullard; Achintya Singhal; Thang Luong; Boyu Wang; Sujeevan Rajayogam; Julian Eisenschlos; Johnson Jia; Daniel Finchelstein; Alex Yakubovich; Daniel Balle; Michael Fink; Sameer Agarwal; Jing Li; Dj Dvijotham; Shalini Pal; Kai Kang; Jaclyn Konzelmann; Jennifer Beattie; Olivier Dousse; Diane Wu; Remi Crocker; Chen Elkind; Siddhartha Reddy Jonnalagadda; Jong Lee; Dan Holtmann-Rice; Krystal Kallarackal; Rosanne Liu; Denis Vnukov; Neera Vats; Luca Invernizzi; Mohsen Jafari; Huanjie Zhou; Lilly Taylor; Jennifer Prendki; Marcus Wu; Tom Eccles; Tianqi Liu; Kavya Kopparapu; Francoise Beaufays; Christof Angermueller; Andreea Marzoca; Shourya Sarcar; Hilal Dib; Jeff Stanway; Frank Perbet; Nejc Trdin; Rachel Sterneck; Andrey Khorlin; Dinghua Li; Xihui Wu; Sonam Goenka; David Madras; Sasha Goldshtein; Willi Gierke; Tong Zhou; Yaxin Liu; Yannie Liang; Anais White; Yunjie Li; Shreya Singh; Sanaz Bahargam; Mark Epstein; Sujoy Basu; Li Lao; Adnan Ozturel; Carl Crous; Alex Zhai; Han Lu; Zora Tung; Neeraj Gaur; Alanna Walton; Lucas Dixon; Ming Zhang; Amir Globerson; Grant Uy; Andrew Bolt; Olivia Wiles; Milad Nasr; Ilia Shumailov; Marco Selvi; Francesco Piccinno; Ricardo Aguilar; Sara McCarthy; Misha Khalman; Mrinal Shukla; Vlado Galic; John Carpenter; Kevin Villela; Haibin Zhang; Harry Richardson; James Martens; Matko Bosnjak; Shreyas Rammohan Belle; Jeff Seibert; Mahmoud Alnahlawi; Brian McWilliams; Sankalp Singh; Annie Louis; Wen Ding; Dan Popovici; Lenin Simicich; Laura Knight; Pulkit Mehta; Nishesh Gupta; Chongyang Shi; Saaber Fatehi; Jovana Mitrovic; Alex Grills; Joseph Pagadora; Dessie Petrova; Danielle Eisenbud; Zhishuai Zhang; Damion Yates; Bhavishya Mittal; Nilesh Tripuraneni; Yannis Assael; Thomas Brovelli; Prateek Jain; Mihajlo Velimirovic; Canfer Akbulut; Jiaqi Mu; Wolfgang Macherey; Ravin Kumar; Jun Xu; Haroon Qureshi; Gheorghe Comanici; Jeremy Wiesner; Zhitao Gong; Anton Ruddock; Matthias Bauer; Nick Felt; Anirudh GP; Anurag Arnab; Dustin Zelle; Jonas Rothfuss; Bill Rosgen; Ashish Shenoy; Bryan Seybold; Xinjian Li; Jayaram Mudigonda; Goker Erdogan; Jiawei Xia; Jiri Simsa; Andrea Michi; Yi Yao; Christopher Yew; Steven Kan; Isaac Caswell; Carey Radebaugh; Andre Elisseeff; Pedro Valenzuela; Kay McKinney; Kim Paterson; Albert Cui; Eri Latorre-Chimoto; Solomon Kim; William Zeng; Ken Durden; Priya Ponnapalli; Tiberiu Sosea; Christopher A. Choquette-Choo; James Manyika; Brona Robenek; Harsha Vashisht; Sebastien Pereira; Hoi Lam; Marko Velic; Denese Owusu-Afriyie; Katherine Lee; Tolga Bolukbasi; Alicia Parrish; Shawn Lu; Jane Park; Balaji Venkatraman; Alice Talbert; Lambert Rosique; Yuchung Cheng; Andrei Sozanschi; Adam Paszke; Praveen Kumar; Jessica Austin; Lu Li; Khalid Salama; Wooyeol Kim; Nandita Dukkipati; Anthony Baryshnikov; Christos Kaplanis; XiangHai Sheng; Yuri Chervonyi; Caglar Unlu; Diego de Las Casas; Harry Askham; Kathryn Tunyasuvunakool; Felix Gimeno; Siim Poder; Chester Kwak; Matt Miecnikowski; Vahab Mirrokni; Alek Dimitriev; Aaron Parisi; Dangyi Liu; Tomy Tsai; Toby Shevlane; Christina Kouridi; Drew Garmon; Adrian Goedeckemeyer; Adam R. Brown; Anitha Vijayakumar; Ali Elqursh; Sadegh Jazayeri; Jin Huang; Sara Mc Carthy; Jay Hoover; Lucy Kim; Sandeep Kumar; Wei Chen; Courtney Biles; Garrett Bingham; Evan Rosen; Lisa Wang; Qijun Tan; David Engel; Francesco Pongetti; Cesare Dario de; Dongseong Hwang; Lily Yu; Jennifer Pullman; Srini Narayanan; Kyle Levin; Siddharth Gopal; Megan Li; Asaf Aharoni; Trieu Trinh; Jessica Lo; Norman Casagrande; Roopali Vij; Loic Matthey; Bramandia Ramadhana; Austin Matthews; CJ Carey; Matthew Johnson; Kremena Goranova; Rohin Shah; Shereen Ashraf; Kingshuk Dasgupta; Rasmus Larsen; Yicheng Wang; Manish Reddy Vuyyuru; Chong Jiang; Joana Ijazi; Kazuki Osawa; Celine Smith; Ramya Sree Boppana; Taylan Bilal; Yuma Koizumi; Ying Xu; Yasemin Altun; Nir Shabat; Ben Bariach; Alex Korchemniy; Kiam Choo; Olaf Ronneberger; Chimezie Iwuanyanwu; Shubin Zhao; David Soergel; Cho-Jui Hsieh; Irene Cai; Shariq Iqbal; Martin Sundermeyer; Zhe Chen; Elie Bursztein; Chaitanya Malaviya; Fadi Biadsy; Prakash Shroff; Inderjit Dhillon; Tejasi Latkar; Chris Dyer; Hannah Forbes; Massimo Nicosia; Vitaly Nikolaev; Somer Greene; Marin Georgiev; Pidong Wang; Nina Martin; Hanie Sedghi; John Zhang; Praseem Banzal; Doug Fritz; Vikram Rao; Xuezhi Wang; Jiageng Zhang; Viorica Patraucean; Dayou Du; Igor Mordatch; Ivan Jurin; Lewis Liu; Ayush Dubey; Abhi Mohan; Janek Nowakowski; Vlad-Doru Ion; Nan Wei; Reiko Tojo; Maria Abi Raad; Drew A. Hudson; Vaishakh Keshava; Shubham Agrawal; Kevin Ramirez; Zhichun Wu; Hoang Nguyen; Ji Liu; Madhavi Sewak; Bryce Petrini; DongHyun Choi; Ivan Philips; Ziyue Wang; Ioana Bica; Ankush Garg; Jarek Wilkiewicz; Priyanka Agrawal; Xiaowei Li; Danhao Guo; Emily Xue; Naseer Shaik; Andrew Leach; Sadh MNM Khan; Julia Wiesinger; Sammy Jerome; Abhishek Chakladar; Alek Wenjiao Wang; Tina Ornduff; Folake Abu; Alireza Ghaffarkhah; Marcus Wainwright; Mario Cortes; Frederick Liu; Joshua Maynez; Slav Petrov; Yonghui Wu; Demis Hassabis; Koray Kavukcuoglu; Jeffrey Dean; Oriol Vinyals
  In this report, we introduce the Gemini 1.5 family of models, representing
the next generation of highly compute-efficient multimodal models capable of
recalling and reasoning over fine-grained information from millions of tokens
of context, including multiple long documents and hours of video and audio. The
family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds
the February version on the great majority of capabilities and benchmarks; (2)
Gemini 1.5 Flash, a more lightweight variant designed for efficiency with
minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on
long-context retrieval tasks across modalities, improve the state-of-the-art in
long-document QA, long-video QA and long-context ASR, and match or surpass
Gemini 1.0 Ultra's state-of-the-art performance across a broad set of
benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find
continued improvement in next-token prediction and near-perfect retrieval
(>99%) up to at least 10M tokens, a generational leap over existing models such
as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world
use cases, such as Gemini 1.5 collaborating with professionals on completing
their tasks achieving 26 to 75% time savings across 10 different job
categories, as well as surprising new capabilities of large language models at
the frontier; when given a grammar manual for Kalamang, a language with fewer
than 200 speakers worldwide, the model learns to translate English to Kalamang
at a similar level to a person who learned from the same content.


http://arxiv.org/abs/2403.05100
Exploring the Adversarial Frontier: Quantifying Robustness via Adversarial Hypervolume. (98%)
Ping Guo; Cheng Gong; Xi Lin; Zhiyuan Yang; Qingfu Zhang
  The escalating threat of adversarial attacks on deep learning models,
particularly in security-critical fields, has underscored the need for robust
deep learning systems. Conventional robustness evaluations have relied on
adversarial accuracy, which measures a model's performance under a specific
perturbation intensity. However, this singular metric does not fully
encapsulate the overall resilience of a model against varying degrees of
perturbation. To address this gap, we propose a new metric termed adversarial
hypervolume, assessing the robustness of deep learning models comprehensively
over a range of perturbation intensities from a multi-objective optimization
standpoint. This metric allows for an in-depth comparison of defense mechanisms
and recognizes the trivial improvements in robustness afforded by less potent
defensive strategies. Additionally, we adopt a novel training algorithm that
enhances adversarial robustness uniformly across various perturbation
intensities, in contrast to methods narrowly focused on optimizing adversarial
accuracy. Our extensive empirical studies validate the effectiveness of the
adversarial hypervolume metric, demonstrating its ability to reveal subtle
differences in robustness that adversarial accuracy overlooks. This research
contributes a new measure of robustness and establishes a standard for
assessing and benchmarking the resilience of current and future defensive
models against adversarial threats.


http://arxiv.org/abs/2403.05666
Prepared for the Worst: A Learning-Based Adversarial Attack for Resilience Analysis of the ICP Algorithm. (93%)
Ziyu Zhang; Johann Laconte; Daniil Lisus; Timothy D. Barfoot
  This paper presents a novel method to assess the resilience of the Iterative
Closest Point (ICP) algorithm via deep-learning-based attacks on lidar point
clouds. For safety-critical applications such as autonomous navigation,
ensuring the resilience of algorithms prior to deployments is of utmost
importance. The ICP algorithm has become the standard for lidar-based
localization. However, the pose estimate it produces can be greatly affected by
corruption in the measurements. Corruption can arise from a variety of
scenarios such as occlusions, adverse weather, or mechanical issues in the
sensor. Unfortunately, the complex and iterative nature of ICP makes assessing
its resilience to corruption challenging. While there have been efforts to
create challenging datasets and develop simulations to evaluate the resilience
of ICP empirically, our method focuses on finding the maximum possible ICP pose
error using perturbation-based adversarial attacks. The proposed attack induces
significant pose errors on ICP and outperforms baselines more than 88% of the
time across a wide range of scenarios. As an example application, we
demonstrate that our attack can be used to identify areas on a map where ICP is
particularly vulnerable to corruption in the measurements.


http://arxiv.org/abs/2403.05181
Adversarial Sparse Teacher: Defense Against Distillation-Based Model Stealing Attacks Using Adversarial Examples. (93%)
Eda Yilmaz; Hacer Yalim Keles
  Knowledge Distillation (KD) facilitates the transfer of discriminative
capabilities from an advanced teacher model to a simpler student model,
ensuring performance enhancement without compromising accuracy. It is also
exploited for model stealing attacks, where adversaries use KD to mimic the
functionality of a teacher model. Recent developments in this domain have been
influenced by the Stingy Teacher model, which provided empirical analysis
showing that sparse outputs can significantly degrade the performance of
student models. Addressing the risk of intellectual property leakage, our work
introduces an approach to train a teacher model that inherently protects its
logits, influenced by the Nasty Teacher concept. Differing from existing
methods, we incorporate sparse outputs of adversarial examples with standard
training data to strengthen the teacher's defense against student distillation.
Our approach carefully reduces the relative entropy between the original and
adversarially perturbed outputs, allowing the model to produce adversarial
logits with minimal impact on overall performance. The source codes will be
made publicly available soon.


http://arxiv.org/abs/2403.05422
EVD4UAV: An Altitude-Sensitive Benchmark to Evade Vehicle Detection in UAV. (81%)
Huiming Sun; Jiacheng Guo; Zibo Meng; Tianyun Zhang; Jianwu Fang; Yuewei Lin; Hongkai Yu
  Vehicle detection in Unmanned Aerial Vehicle (UAV) captured images has wide
applications in aerial photography and remote sensing. There are many public
benchmark datasets proposed for the vehicle detection and tracking in UAV
images. Recent studies show that adding an adversarial patch on objects can
fool the well-trained deep neural networks based object detectors, posing
security concerns to the downstream tasks. However, the current public UAV
datasets might ignore the diverse altitudes, vehicle attributes, fine-grained
instance-level annotation in mostly side view with blurred vehicle roof, so
none of them is good to study the adversarial patch based vehicle detection
attack problem. In this paper, we propose a new dataset named EVD4UAV as an
altitude-sensitive benchmark to evade vehicle detection in UAV with 6,284
images and 90,886 fine-grained annotated vehicles. The EVD4UAV dataset has
diverse altitudes (50m, 70m, 90m), vehicle attributes (color, type),
fine-grained annotation (horizontal and rotated bounding boxes, instance-level
mask) in top view with clear vehicle roof. One white-box and two black-box
patch based attack methods are implemented to attack three classic deep neural
networks based object detectors on EVD4UAV. The experimental results show that
these representative attack methods could not achieve the robust
altitude-insensitive attack performance.


http://arxiv.org/abs/2403.05365
The Impact of Quantization on the Robustness of Transformer-based Text Classifiers. (45%)
Seyed Parsa Neshaei; Yasaman Boreshban; Gholamreza Ghassem-Sani; Seyed Abolghasem Mirroshandel
  Transformer-based models have made remarkable advancements in various NLP
areas. Nevertheless, these models often exhibit vulnerabilities when confronted
with adversarial attacks. In this paper, we explore the effect of quantization
on the robustness of Transformer-based models. Quantization usually involves
mapping a high-precision real number to a lower-precision value, aiming at
reducing the size of the model at hand. To the best of our knowledge, this work
is the first application of quantization on the robustness of NLP models. In
our experiments, we evaluate the impact of quantization on BERT and DistilBERT
models in text classification using SST-2, Emotion, and MR datasets. We also
evaluate the performance of these models against TextFooler, PWWS, and PSO
adversarial attacks. Our findings show that quantization significantly improves
(by an average of 18.68%) the adversarial accuracy of the models. Furthermore,
we compare the effect of quantization versus that of the adversarial training
approach on robustness. Our experiments indicate that quantization increases
the robustness of the model by 18.80% on average compared to adversarial
training without imposing any extra computational overhead during training.
Therefore, our results highlight the effectiveness of quantization in improving
the robustness of NLP models.


http://arxiv.org/abs/2404.16851
EdgeLeakage: Membership Information Leakage in Distributed Edge Intelligence Systems. (38%)
Kongyang Chen; Yi Lin; Hui Luo; Bing Mi; Yatie Xiao; Chao Ma; Jorge Sá Silva
  In contemporary edge computing systems, decentralized edge nodes aggregate
unprocessed data and facilitate data analytics to uphold low transmission
latency and real-time data processing capabilities. Recently, these edge nodes
have evolved to facilitate the implementation of distributed machine learning
models, utilizing their computational resources to enable intelligent
decision-making, thereby giving rise to an emerging domain referred to as edge
intelligence. However, within the realm of edge intelligence, susceptibility to
numerous security and privacy threats against machine learning models becomes
evident. This paper addresses the issue of membership inference leakage in
distributed edge intelligence systems. Specifically, our focus is on an
autonomous scenario wherein edge nodes collaboratively generate a global model.
The utilization of membership inference attacks serves to elucidate the
potential data leakage in this particular context. Furthermore, we delve into
the examination of several defense mechanisms aimed at mitigating the
aforementioned data leakage problem. Experimental results affirm that our
approach is effective in detecting data leakage within edge intelligence
systems, and the implementation of our defense methods proves instrumental in
alleviating this security threat. Consequently, our findings contribute to
safeguarding data privacy in the context of edge intelligence systems.


http://arxiv.org/abs/2403.07937
Speech Robust Bench: A Robustness Benchmark For Speech Recognition. (1%)
Muhammad A. Shah; David Solans Noguero; Mikko A. Heikkila; Bhiksha Raj; Nicolas Kourtellis
  As Automatic Speech Recognition (ASR) models become ever more pervasive, it
is important to ensure that they make reliable predictions under corruptions
present in the physical and digital world. We propose Speech Robust Bench
(SRB), a comprehensive benchmark for evaluating the robustness of ASR models to
diverse corruptions. SRB is composed of 114 input perturbations which simulate
an heterogeneous range of corruptions that ASR models may encounter when
deployed in the wild. We use SRB to evaluate the robustness of several
state-of-the-art ASR models and observe that model size and certain modeling
choices such as the use of discrete representations, or self-training appear to
be conducive to robustness. We extend this analysis to measure the robustness
of ASR models on data from various demographic subgroups, namely English and
Spanish speakers, and males and females. Our results revealed noticeable
disparities in the model's robustness across subgroups. We believe that SRB
will significantly facilitate future research towards robust ASR models, by
making it easier to conduct comprehensive and comparable robustness
evaluations.


http://arxiv.org/abs/2403.05030
Defending Against Unforeseen Failure Modes with Latent Adversarial Training. (83%)
Stephen Casper; Lennart Schulze; Oam Patel; Dylan Hadfield-Menell
  AI systems sometimes exhibit harmful unintended behaviors post-deployment.
This is often despite extensive diagnostics and debugging by developers.
Minimizing risks from models is challenging because the attack surface is so
large. It is not tractable to exhaustively search for inputs that may cause a
model to fail. Red-teaming and adversarial training (AT) are commonly used to
make AI systems more robust. However, they have not been sufficient to avoid
many real-world failure modes that differ from the ones adversarially trained
on. In this work, we utilize latent adversarial training (LAT) to defend
against vulnerabilities without generating inputs that elicit them. LAT
leverages the compressed, abstract, and structured latent representations of
concepts that the network actually uses for prediction. We use LAT to remove
trojans and defend against held-out classes of adversarial attacks. We show in
image classification, text classification, and text generation tasks that LAT
usually improves both robustness and performance on clean data relative to AT.
This suggests that LAT can be a promising tool for defending against failure
modes that are not explicitly identified by developers.


http://arxiv.org/abs/2403.04954
Fooling Neural Networks for Motion Forecasting via Adversarial Attacks. (33%)
Edgar Medina; Leyong Loh
  Human motion prediction is still an open problem, which is extremely
important for autonomous driving and safety applications. Although there are
great advances in this area, the widely studied topic of adversarial attacks
has not been applied to multi-regression models such as GCNs and MLP-based
architectures in human motion prediction. This work intends to reduce this gap
using extensive quantitative and qualitative experiments in state-of-the-art
architectures similar to the initial stages of adversarial attacks in image
classification. The results suggest that models are susceptible to attacks even
on low levels of perturbation. We also show experiments with 3D transformations
that affect the model performance, in particular, we show that most models are
sensitive to simple rotations and translations which do not alter joint
distances. We conclude that similar to earlier CNN models, motion forecasting
tasks are susceptible to small perturbations and simple 3D transformations.


http://arxiv.org/abs/2403.04957
Automatic and Universal Prompt Injection Attacks against Large Language Models. (31%)
Xiaogeng Liu; Zhiyuan Yu; Yizhe Zhang; Ning Zhang; Chaowei Xiao
  Large Language Models (LLMs) excel in processing and generating human
language, powered by their ability to interpret and follow instructions.
However, their capabilities can be exploited through prompt injection attacks.
These attacks manipulate LLM-integrated applications into producing responses
aligned with the attacker's injected content, deviating from the user's actual
requests. The substantial risks posed by these attacks underscore the need for
a thorough understanding of the threats. Yet, research in this area faces
challenges due to the lack of a unified goal for such attacks and their
reliance on manually crafted prompts, complicating comprehensive assessments of
prompt injection robustness. We introduce a unified framework for understanding
the objectives of prompt injection attacks and present an automated
gradient-based method for generating highly effective and universal prompt
injection data, even in the face of defensive measures. With only five training
samples (0.3% relative to the test data), our attack can achieve superior
performance compared with baselines. Our findings emphasize the importance of
gradient-based testing, which can avoid overestimation of robustness,
especially for defense mechanisms.


http://arxiv.org/abs/2403.04701
ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes. (31%)
Hashmat Shadab Malik; Muhammad Huzaifa; Muzammal Naseer; Salman Khan; Fahad Shahbaz Khan
  Given the large-scale multi-modal training of recent vision-based models and
their generalization capabilities, understanding the extent of their robustness
is critical for their real-world deployment. In this work, we evaluate the
resilience of current vision-based models against diverse object-to-background
context variations. The majority of robustness evaluation methods have
introduced synthetic datasets to induce changes to object characteristics
(viewpoints, scale, color) or utilized image transformation techniques
(adversarial changes, common corruptions) on real images to simulate shifts in
distributions. Recent works have explored leveraging large language models and
diffusion models to generate changes in the background. However, these methods
either lack in offering control over the changes to be made or distort the
object semantics, making them unsuitable for the task. Our method, on the other
hand, can induce diverse object-to-background changes while preserving the
original semantics and appearance of the object. To achieve this goal, we
harness the generative capabilities of text-to-image, image-to-text, and
image-to-segment models to automatically generate a broad spectrum of
object-to-background changes. We induce both natural and adversarial background
changes by either modifying the textual prompts or optimizing the latents and
textual embedding of text-to-image models. We produce various versions of
standard vision datasets (ImageNet, COCO), incorporating either diverse and
realistic backgrounds into the images or introducing color, texture, and
adversarial changes in the background. We conduct extensive experiment to
analyze the robustness of vision-based models against object-to-background
context variations across diverse tasks. Code
https://github.com/Muhammad-Huzaifaa/ObjectCompose.git


http://arxiv.org/abs/2403.04837
Cell reprogramming design by transfer learning of functional transcriptional networks. (1%)
Thomas P. Wytock; Adilson E. Motter
  Recent developments in synthetic biology, next-generation sequencing, and
machine learning provide an unprecedented opportunity to rationally design new
disease treatments based on measured responses to gene perturbations and drugs
to reprogram cells. The main challenges to seizing this opportunity are the
incomplete knowledge of the cellular network and the combinatorial explosion of
possible interventions, both of which are insurmountable by experiments. To
address these challenges, we develop a transfer learning approach to control
cell behavior that is pre-trained on transcriptomic data associated with human
cell fates, thereby generating a model of the network dynamics that can be
transferred to specific reprogramming goals. The approach combines
transcriptional responses to gene perturbations to minimize the difference
between a given pair of initial and target transcriptional states. We
demonstrate our approach's versatility by applying it to a microarray dataset
comprising >9,000 microarrays across 54 cell types and 227 unique
perturbations, and an RNASeq dataset consisting of >10,000 sequencing runs
across 36 cell types and 138 perturbations. Our approach reproduces known
reprogramming protocols with an AUROC of 0.91 while innovating over existing
methods by pre-training an adaptable model that can be tailored to specific
reprogramming transitions. We show that the number of gene perturbations
required to steer from one fate to another increases with decreasing
developmental relatedness and that fewer genes are needed to progress along
developmental paths than to regress. These findings establish a
proof-of-concept for our approach to computationally design control strategies
and provide insights into how gene regulatory networks govern phenotype.


http://arxiv.org/abs/2403.04257
Towards Robustness Analysis of E-Commerce Ranking System. (1%)
Ningfei Wang; Yupin Huang; Han Cheng; Jiri Gesi; Xiaojie Wang; Vivek Mittal
  Information retrieval (IR) is a pivotal component in various applications.
Recent advances in machine learning (ML) have enabled the integration of ML
algorithms into IR, particularly in ranking systems. While there is a plethora
of research on the robustness of ML-based ranking systems, these studies
largely neglect commercial e-commerce systems and fail to establish a
connection between real-world and manipulated query relevance. In this paper,
we present the first systematic measurement study on the robustness of
e-commerce ranking systems. We define robustness as the consistency of ranking
outcomes for semantically identical queries. To quantitatively analyze
robustness, we propose a novel metric that considers both ranking position and
item-specific information that are absent in existing metrics. Our large-scale
measurement study with real-world data from e-commerce retailers reveals an
open opportunity to measure and improve robustness since semantically identical
queries often yield inconsistent ranking results. Based on our observations, we
propose several solution directions to enhance robustness, such as the use of
Large Language Models. Note that the issue of robustness discussed herein does
not constitute an error or oversight. Rather, in scenarios where there exists a
vast array of choices, it is feasible to present a multitude of products in
various permutations, all of which could be equally appealing. However, this
extensive selection may lead to customer confusion. As e-commerce retailers use
various techniques to improve the quality of search results, we hope that this
research offers valuable guidance for measuring the robustness of the ranking
systems.


http://arxiv.org/abs/2403.03674
Adversarial Infrared Geometry: Using Geometry to Perform Adversarial Attack against Infrared Pedestrian Detectors. (99%)
Kalibinuer Tiliwalidi
  Currently, infrared imaging technology enjoys widespread usage, with infrared
object detection technology experiencing a surge in prominence. While previous
studies have delved into physical attacks on infrared object detectors, the
implementation of these techniques remains complex. For instance, some
approaches entail the use of bulb boards or infrared QR suits as perturbations
to execute attacks, which entail costly optimization and cumbersome deployment
processes. Other methodologies involve the utilization of irregular aerogel as
physical perturbations for infrared attacks, albeit at the expense of
optimization expenses and perceptibility issues. In this study, we propose a
novel infrared physical attack termed Adversarial Infrared Geometry
(\textbf{AdvIG}), which facilitates efficient black-box query attacks by
modeling diverse geometric shapes (lines, triangles, ellipses) and optimizing
their physical parameters using Particle Swarm Optimization (PSO). Extensive
experiments are conducted to evaluate the effectiveness, stealthiness, and
robustness of AdvIG. In digital attack experiments, line, triangle, and ellipse
patterns achieve attack success rates of 93.1\%, 86.8\%, and 100.0\%,
respectively, with average query times of 71.7, 113.1, and 2.57, respectively,
thereby confirming the efficiency of AdvIG. Physical attack experiments are
conducted to assess the attack success rate of AdvIG at different distances. On
average, the line, triangle, and ellipse achieve attack success rates of
61.1\%, 61.2\%, and 96.2\%, respectively. Further experiments are conducted to
comprehensively analyze AdvIG, including ablation experiments, transfer attack
experiments, and adversarial defense mechanisms. Given the superior performance
of our method as a simple and efficient black-box adversarial attack in both
digital and physical environments, we advocate for widespread attention to
AdvIG.


http://arxiv.org/abs/2403.04070
Improving Adversarial Training using Vulnerability-Aware Perturbation Budget. (99%)
Olukorede Fakorede; Modeste Atsague; Jin Tian
  Adversarial Training (AT) effectively improves the robustness of Deep Neural
Networks (DNNs) to adversarial attacks. Generally, AT involves training DNN
models with adversarial examples obtained within a pre-defined, fixed
perturbation bound. Notably, individual natural examples from which these
adversarial examples are crafted exhibit varying degrees of intrinsic
vulnerabilities, and as such, crafting adversarial examples with fixed
perturbation radius for all instances may not sufficiently unleash the potency
of AT. Motivated by this observation, we propose two simple, computationally
cheap vulnerability-aware reweighting functions for assigning perturbation
bounds to adversarial examples used for AT, named Margin-Weighted Perturbation
Budget (MWPB) and Standard-Deviation-Weighted Perturbation Budget (SDWPB). The
proposed methods assign perturbation radii to individual adversarial samples
based on the vulnerability of their corresponding natural examples.
Experimental results show that the proposed methods yield genuine improvements
in the robustness of AT algorithms against various adversarial attacks.


http://arxiv.org/abs/2403.03967
Effect of Ambient-Intrinsic Dimension Gap on Adversarial Vulnerability. (92%)
Rajdeep Haldar; Yue Xing; Qifan Song
  The existence of adversarial attacks on machine learning models imperceptible
to a human is still quite a mystery from a theoretical perspective. In this
work, we introduce two notions of adversarial attacks: natural or on-manifold
attacks, which are perceptible by a human/oracle, and unnatural or off-manifold
attacks, which are not. We argue that the existence of the off-manifold attacks
is a natural consequence of the dimension gap between the intrinsic and ambient
dimensions of the data. For 2-layer ReLU networks, we prove that even though
the dimension gap does not affect generalization performance on samples drawn
from the observed data space, it makes the clean-trained model more vulnerable
to adversarial perturbations in the off-manifold direction of the data space.
Our main results provide an explicit relationship between the
$\ell_2,\ell_{\infty}$ attack strength of the on/off-manifold attack and the
dimension gap.


http://arxiv.org/abs/2403.04050
Belief-Enriched Pessimistic Q-Learning against Adversarial State Perturbations. (16%)
Xiaolin Sun; Zizhan Zheng
  Reinforcement learning (RL) has achieved phenomenal success in various
domains. However, its data-driven nature also introduces new vulnerabilities
that can be exploited by malicious opponents. Recent work shows that a
well-trained RL agent can be easily manipulated by strategically perturbing its
state observations at the test stage. Existing solutions either introduce a
regularization term to improve the smoothness of the trained policy against
perturbations or alternatively train the agent's policy and the attacker's
policy. However, the former does not provide sufficient protection against
strong attacks, while the latter is computationally prohibitive for large
environments. In this work, we propose a new robust RL algorithm for deriving a
pessimistic policy to safeguard against an agent's uncertainty about true
states. This approach is further enhanced with belief state inference and
diffusion-based state purification to reduce uncertainty. Empirical results
show that our approach obtains superb performance under strong attacks and has
a comparable training overhead with regularization-based methods. Our code is
available at https://github.com/SliencerX/Belief-enriched-robust-Q-learning.


http://arxiv.org/abs/2403.03846
On the Effectiveness of Distillation in Mitigating Backdoors in Pre-trained Encoder. (2%)
Tingxu Han; Shenghan Huang; Ziqi Ding; Weisong Sun; Yebo Feng; Chunrong Fang; Jun Li; Hanwei Qian; Cong Wu; Quanjun Zhang; Yang Liu; Zhenyu Chen
  In this paper, we study a defense against poisoned encoders in SSL called
distillation, which is a defense used in supervised learning originally.
Distillation aims to distill knowledge from a given model (a.k.a the teacher
net) and transfer it to another (a.k.a the student net). Now, we use it to
distill benign knowledge from poisoned pre-trained encoders and transfer it to
a new encoder, resulting in a clean pre-trained encoder. In particular, we
conduct an empirical study on the effectiveness and performance of distillation
against poisoned encoders. Using two state-of-the-art backdoor attacks against
pre-trained image encoders and four commonly used image classification
datasets, our experimental results show that distillation can reduce attack
success rate from 80.87% to 27.51% while suffering a 6.35% loss in accuracy.
Moreover, we investigate the impact of three core components of distillation on
performance: teacher net, student net, and distillation loss. By comparing 4
different teacher nets, 3 student nets, and 6 distillation losses, we find that
fine-tuned teacher nets, warm-up-training-based student nets, and
attention-based distillation loss perform best, respectively.


http://arxiv.org/abs/2403.03773
Verified Training for Counterfactual Explanation Robustness under Data Shift. (2%)
Anna P. Meyer; Yuhao Zhang; Aws Albarghouthi; Loris D'Antoni
  Counterfactual explanations (CEs) enhance the interpretability of machine
learning models by describing what changes to an input are necessary to change
its prediction to a desired class. These explanations are commonly used to
guide users' actions, e.g., by describing how a user whose loan application was
denied can be approved for a loan in the future. Existing approaches generate
CEs by focusing on a single, fixed model, and do not provide any formal
guarantees on the CEs' future validity. When models are updated periodically to
account for data shift, if the generated CEs are not robust to the shifts,
users' actions may no longer have the desired impacts on their predictions.
This paper introduces VeriTraCER, an approach that jointly trains a classifier
and an explainer to explicitly consider the robustness of the generated CEs to
small model shifts. VeriTraCER optimizes over a carefully designed loss
function that ensures the verifiable robustness of CEs to local model updates,
thus providing deterministic guarantees to CE validity. Our empirical
evaluation demonstrates that VeriTraCER generates CEs that (1) are verifiably
robust to small model updates and (2) display competitive robustness to
state-of-the-art approaches in handling empirical model updates including
random initialization, leave-one-out, and distribution shifts.


http://arxiv.org/abs/2403.02803
Towards Robust Federated Learning via Logits Calibration on Non-IID Data. (99%)
Yu Qiao; Apurba Adhikary; Chaoning Zhang; Choong Seon Hong
  Federated learning (FL) is a privacy-preserving distributed management
framework based on collaborative model training of distributed devices in edge
networks. However, recent studies have shown that FL is vulnerable to
adversarial examples (AEs), leading to a significant drop in its performance.
Meanwhile, the non-independent and identically distributed (non-IID) challenge
of data distribution between edge devices can further degrade the performance
of models. Consequently, both AEs and non-IID pose challenges to deploying
robust learning models at the edge. In this work, we adopt the adversarial
training (AT) framework to improve the robustness of FL models against
adversarial example (AE) attacks, which can be termed as federated adversarial
training (FAT). Moreover, we address the non-IID challenge by implementing a
simple yet effective logits calibration strategy under the FAT framework, which
can enhance the robustness of models when subjected to adversarial attacks.
Specifically, we employ a direct strategy to adjust the logits output by
assigning higher weights to classes with small samples during training. This
approach effectively tackles the class imbalance in the training data, with the
goal of mitigating biases between local and global models. Experimental results
on three dataset benchmarks, MNIST, Fashion-MNIST, and CIFAR-10 show that our
strategy achieves competitive results in natural and robust accuracy compared
to several baselines.


http://arxiv.org/abs/2403.02995
Mitigating Label Flipping Attacks in Malicious URL Detectors Using Ensemble Trees. (96%)
Ehsan Nowroozi; Nada Jadalla; Samaneh Ghelichkhani; Alireza Jolfaei
  Malicious URLs provide adversarial opportunities across various industries,
including transportation, healthcare, energy, and banking which could be
detrimental to business operations. Consequently, the detection of these URLs
is of crucial importance; however, current Machine Learning (ML) models are
susceptible to backdoor attacks. These attacks involve manipulating a small
percentage of training data labels, such as Label Flipping (LF), which changes
benign labels to malicious ones and vice versa. This manipulation results in
misclassification and leads to incorrect model behavior. Therefore, integrating
defense mechanisms into the architecture of ML models becomes an imperative
consideration to fortify against potential attacks.
  The focus of this study is on backdoor attacks in the context of URL
detection using ensemble trees. By illuminating the motivations behind such
attacks, highlighting the roles of attackers, and emphasizing the critical
importance of effective defense strategies, this paper contributes to the
ongoing efforts to fortify ML models against adversarial threats within the ML
domain in network security. We propose an innovative alarm system that detects
the presence of poisoned labels and a defense mechanism designed to uncover the
original class labels with the aim of mitigating backdoor attacks on ensemble
tree classifiers. We conducted a case study using the Alexa and Phishing Site
URL datasets and showed that LF attacks can be addressed using our proposed
defense mechanism. Our experimental results prove that the LF attack achieved
an Attack Success Rate (ASR) between 50-65% within 2-5%, and the innovative
defense method successfully detected poisoned labels with an accuracy of up to
100%.


http://arxiv.org/abs/2403.02723
Minimum Topology Attacks for Graph Neural Networks. (83%)
Mengmei Zhang; Xiao Wang; Chuan Shi; Lingjuan Lyu; Tianchi Yang; Junping Du
  With the great popularity of Graph Neural Networks (GNNs), their robustness
to adversarial topology attacks has received significant attention. Although
many attack methods have been proposed, they mainly focus on fixed-budget
attacks, aiming at finding the most adversarial perturbations within a fixed
budget for target node. However, considering the varied robustness of each
node, there is an inevitable dilemma caused by the fixed budget, i.e., no
successful perturbation is found when the budget is relatively small, while if
it is too large, the yielding redundant perturbations will hurt the
invisibility. To break this dilemma, we propose a new type of topology attack,
named minimum-budget topology attack, aiming to adaptively find the minimum
perturbation sufficient for a successful attack on each node. To this end, we
propose an attack model, named MiBTack, based on a dynamic projected gradient
descent algorithm, which can effectively solve the involving non-convex
constraint optimization on discrete topology. Extensive results on three GNNs
and four real-world datasets show that MiBTack can successfully lead all target
nodes misclassified with the minimum perturbation edges. Moreover, the obtained
minimum budget can be used to measure node robustness, so we can explore the
relationships of robustness, topology, and uncertainty for nodes, which is
beyond what the current fixed-budget topology attacks can offer.


http://arxiv.org/abs/2403.02983
Federated Learning Under Attack: Exposing Vulnerabilities through Data Poisoning Attacks in Computer Networks. (82%)
Ehsan Nowroozi; Imran Haider; Rahim Taheri; Mauro Conti
  Federated Learning (FL) is a machine learning (ML) approach that enables
multiple decentralized devices or edge servers to collaboratively train a
shared model without exchanging raw data. During the training and sharing of
model updates between clients and servers, data and models are susceptible to
different data-poisoning attacks.
  In this study, our motivation is to explore the severity of data poisoning
attacks in the computer network domain because they are easy to implement but
difficult to detect. We considered two types of data-poisoning attacks, label
flipping (LF) and feature poisoning (FP), and applied them with a novel
approach. In LF, we randomly flipped the labels of benign data and trained the
model on the manipulated data. For FP, we randomly manipulated the highly
contributing features determined using the Random Forest algorithm. The
datasets used in this experiment were CIC and UNSW related to computer
networks. We generated adversarial samples using the two attacks mentioned
above, which were applied to a small percentage of datasets. Subsequently, we
trained and tested the accuracy of the model on adversarial datasets. We
recorded the results for both benign and manipulated datasets and observed
significant differences between the accuracy of the models on different
datasets. From the experimental results, it is evident that the LF attack
failed, whereas the FP attack showed effective results, which proved its
significance in fooling a server. With a 1% LF attack on the CIC, the accuracy
was approximately 0.0428 and the ASR was 0.9564; hence, the attack is easily
detectable, while with a 1% FP attack, the accuracy and ASR were both
approximately 0.9600, hence, FP attacks are difficult to detect. We repeated
the experiment with different poisoning percentages.


http://arxiv.org/abs/2403.02950
A general approach to enhance the survivability of backdoor attacks by decision path coupling. (68%)
Yufei Zhao; Dingji Wang; Bihuan Chen; Ziqian Chen; Xin Peng
  Backdoor attacks have been one of the emerging security threats to deep
neural networks (DNNs), leading to serious consequences. One of the mainstream
backdoor defenses is model reconstruction-based. Such defenses adopt model
unlearning or pruning to eliminate backdoors. However, little attention has
been paid to survive from such defenses. To bridge the gap, we propose Venom,
the first generic backdoor attack enhancer to improve the survivability of
existing backdoor attacks against model reconstruction-based defenses. We
formalize Venom as a binary-task optimization problem. The first is the
original backdoor attack task to preserve the original attack capability, while
the second is the attack enhancement task to improve the attack survivability.
To realize the second task, we propose attention imitation loss to force the
decision path of poisoned samples in backdoored models to couple with the
crucial decision path of benign samples, which makes backdoors difficult to
eliminate. Our extensive evaluation on two DNNs and three datasets has
demonstrated that Venom significantly improves the survivability of eight
state-of-the-art attacks against eight state-of-the-art defenses without
impacting the capability of the original attacks.


http://arxiv.org/abs/2403.03149
Robust Federated Learning Mitigates Client-side Training Data Distribution Inference Attacks. (61%)
Yichang Xu; Ming Yin; Minghong Fang; Neil Zhenqiang Gong
  Recent studies have revealed that federated learning (FL), once considered
secure due to clients not sharing their private data with the server, is
vulnerable to attacks such as client-side training data distribution inference,
where a malicious client can recreate the victim's data. While various
countermeasures exist, they are not practical, often assuming server access to
some training data or knowledge of label distribution before the attack.
  In this work, we bridge the gap by proposing InferGuard, a novel
Byzantine-robust aggregation rule aimed at defending against client-side
training data distribution inference attacks. In our proposed InferGuard, the
server first calculates the coordinate-wise median of all the model updates it
receives. A client's model update is considered malicious if it significantly
deviates from the computed median update. We conduct a thorough evaluation of
our proposed InferGuard on five benchmark datasets and perform a comparison
with ten baseline methods. The results of our experiments indicate that our
defense mechanism is highly effective in protecting against client-side
training data distribution inference attacks, even against strong adaptive
attacks. Furthermore, our method substantially outperforms the baseline methods
in various practical FL scenarios.


http://arxiv.org/abs/2403.02692
Uplift Modeling for Target User Attacks on Recommender Systems. (12%)
Wenjie Wang; Changsheng Wang; Fuli Feng; Wentao Shi; Daizong Ding; Tat-Seng Chua
  Recommender systems are vulnerable to injective attacks, which inject limited
fake users into the platforms to manipulate the exposure of target items to all
users. In this work, we identify that conventional injective attackers overlook
the fact that each item has its unique potential audience, and meanwhile, the
attack difficulty across different users varies. Blindly attacking all users
will result in a waste of fake user budgets and inferior attack performance. To
address these issues, we focus on an under-explored attack task called target
user attacks, aiming at promoting target items to a particular user group. In
addition, we formulate the varying attack difficulty as heterogeneous treatment
effects through a causal lens and propose an Uplift-guided Budget Allocation
(UBA) framework. UBA estimates the treatment effect on each target user and
optimizes the allocation of fake user budgets to maximize the attack
performance. Theoretical and empirical analysis demonstrates the rationality of
treatment effect estimation methods of UBA. By instantiating UBA on multiple
attackers, we conduct extensive experiments on three datasets under various
settings with different target items, target users, fake user budgets, victim
models, and defense models, validating the effectiveness and robustness of UBA.


http://arxiv.org/abs/2403.02846
FLGuard: Byzantine-Robust Federated Learning via Ensemble of Contrastive Models. (11%)
Younghan Lee; Yungi Cho; Woorim Han; Ho Bae; Yunheung Paek
  Federated Learning (FL) thrives in training a global model with numerous
clients by only sharing the parameters of their local models trained with their
private training datasets. Therefore, without revealing the private dataset,
the clients can obtain a deep learning (DL) model with high performance.
However, recent research proposed poisoning attacks that cause a catastrophic
loss in the accuracy of the global model when adversaries, posed as benign
clients, are present in a group of clients. Therefore, recent studies suggested
byzantine-robust FL methods that allow the server to train an accurate global
model even with the adversaries present in the system. However, many existing
methods require the knowledge of the number of malicious clients or the
auxiliary (clean) dataset or the effectiveness reportedly decreased hugely when
the private dataset was non-independently and identically distributed
(non-IID). In this work, we propose FLGuard, a novel byzantine-robust FL method
that detects malicious clients and discards malicious local updates by
utilizing the contrastive learning technique, which showed a tremendous
improvement as a self-supervised learning method. With contrastive models, we
design FLGuard as an ensemble scheme to maximize the defensive capability. We
evaluate FLGuard extensively under various poisoning attacks and compare the
accuracy of the global model with existing byzantine-robust FL methods. FLGuard
outperforms the state-of-the-art defense methods in most cases and shows
drastic improvement, especially in non-IID settings.
https://github.com/201younghanlee/FLGuard


http://arxiv.org/abs/2403.02691
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. (11%)
Qiusi Zhan; Zhixiang Liang; Zifan Ying; Daniel Kang
  Recent work has embodied LLMs as agents, allowing them to access tools,
perform actions, and interact with external content (e.g., emails or websites).
However, external content introduces the risk of indirect prompt injection
(IPI) attacks, where malicious instructions are embedded within the content
processed by LLMs, aiming to manipulate these agents into executing detrimental
actions against users. Given the potentially severe consequences of such
attacks, establishing benchmarks to assess and mitigate these risks is
imperative.
  In this work, we introduce InjecAgent, a benchmark designed to assess the
vulnerability of tool-integrated LLM agents to IPI attacks. InjecAgent
comprises 1,054 test cases covering 17 different user tools and 62 attacker
tools. We categorize attack intentions into two primary types: direct harm to
users and exfiltration of private data. We evaluate 30 different LLM agents and
show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4
vulnerable to attacks 24% of the time. Further investigation into an enhanced
setting, where the attacker instructions are reinforced with a hacking prompt,
shows additional increases in success rates, nearly doubling the attack success
rate on the ReAct-prompted GPT-4. Our findings raise questions about the
widespread deployment of LLM Agents. Our benchmark is available at
https://github.com/uiuc-kang-lab/InjecAgent.


http://arxiv.org/abs/2403.02955
XAI-Based Detection of Adversarial Attacks on Deepfake Detectors. (8%)
Ben Pinhasov; Raz Lapid; Rony Ohayon; Moshe Sipper; Yehudit Aperstein
  We introduce a novel methodology for identifying adversarial attacks on
deepfake detectors using eXplainable Artificial Intelligence (XAI). In an era
characterized by digital advancement, deepfakes have emerged as a potent tool,
creating a demand for efficient detection systems. However, these systems are
frequently targeted by adversarial attacks that inhibit their performance. We
address this gap, developing a defensible deepfake detector by leveraging the
power of XAI. The proposed methodology uses XAI to generate interpretability
maps for a given method, providing explicit visualizations of decision-making
factors within the AI models. We subsequently employ a pretrained feature
extractor that processes both the input image and its corresponding XAI image.
The feature embeddings extracted from this process are then used for training a
simple yet effective classifier. Our approach contributes not only to the
detection of deepfakes but also enhances the understanding of possible
adversarial attacks, pinpointing potential vulnerabilities. Furthermore, this
approach does not change the performance of the deepfake detector. The paper
demonstrates promising results suggesting a potential pathway for future
deepfake detection mechanisms. We believe this study will serve as a valuable
contribution to the community, sparking much-needed discourse on safeguarding
deepfake detectors.


http://arxiv.org/abs/2403.01896
Robustness Bounds on the Successful Adversarial Examples: Theory and Practice. (99%)
Hiroaki Maeshima; Akira Otsuka
  Adversarial example (AE) is an attack method for machine learning, which is
crafted by adding imperceptible perturbation to the data inducing
misclassification. In the current paper, we investigated the upper bound of the
probability of successful AEs based on the Gaussian Process (GP)
classification. We proved a new upper bound that depends on AE's perturbation
norm, the kernel function used in GP, and the distance of the closest pair with
different labels in the training dataset. Surprisingly, the upper bound is
determined regardless of the distribution of the sample dataset. We showed that
our theoretical result was confirmed through the experiment using ImageNet. In
addition, we showed that changing the parameters of the kernel function induces
a change of the upper bound of the probability of successful AEs.


http://arxiv.org/abs/2403.01849
One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models. (99%)
Lin Li; Haoyan Guan; Jianing Qiu; Michael Spratling
  Large pre-trained Vision-Language Models (VLMs) like CLIP, despite having
remarkable generalization ability, are highly vulnerable to adversarial
examples. This work studies the adversarial robustness of VLMs from the novel
perspective of the text prompt instead of the extensively studied model weights
(frozen in this work). We first show that the effectiveness of both adversarial
attack and defense are sensitive to the used text prompt. Inspired by this, we
propose a method to improve resilience to adversarial attacks by learning a
robust text prompt for VLMs. The proposed method, named Adversarial Prompt
Tuning (APT), is effective while being both computationally and data efficient.
Extensive experiments are conducted across 15 datasets and 4 data sparsity
schemes (from 1-shot to full training data settings) to show APT's superiority
over hand-engineered prompts and other state-of-the-art adaption methods. APT
demonstrated excellent abilities in terms of the in-distribution performance
and the generalization under input distribution shift and across datasets.
Surprisingly, by simply adding one learned word to the prompts, APT can
significantly boost the accuracy and robustness (epsilon=4/255) over the
hand-engineered prompts by +13% and +8.5% on average respectively. The
improvement further increases, in our most effective setting, to +26.4% for
accuracy and +16.7% for robustness. Code is available at
https://github.com/TreeLLi/APT.


http://arxiv.org/abs/2403.12988
Improving the Robustness of Object Detection and Classification AI models against Adversarial Patch Attacks. (99%)
Roie Kazoom; Raz Birman; Ofer Hadar
  Adversarial patch attacks, crafted to compromise the integrity of Deep Neural
Networks (DNNs), significantly impact Artificial Intelligence (AI) systems
designed for object detection and classification tasks. The primary purpose of
this work is to defend models against real-world physical attacks that target
object detection and classification. We analyze attack techniques and propose a
robust defense approach. We successfully reduce model confidence by over 20%
using adversarial patch attacks that exploit object shape, texture and
position. Leveraging the inpainting pre-processing technique, we effectively
restore the original confidence levels, demonstrating the importance of robust
defenses in mitigating these threats. Following fine-tuning of an AI model for
traffic sign classification, we subjected it to a simulated pixelized
patch-based physical adversarial attack, resulting in misclassifications. Our
inpainting defense approach significantly enhances model resilience, achieving
high accuracy and reliable localization despite the adversarial attacks. This
contribution advances the resilience and reliability of object detection and
classification networks against adversarial challenges, providing a robust
foundation for critical applications.


http://arxiv.org/abs/2403.02329
COMMIT: Certifying Robustness of Multi-Sensor Fusion Systems against Semantic Attacks. (96%)
Zijian Huang; Wenda Chu; Linyi Li; Chejian Xu; Bo Li
  Multi-sensor fusion systems (MSFs) play a vital role as the perception module
in modern autonomous vehicles (AVs). Therefore, ensuring their robustness
against common and realistic adversarial semantic transformations, such as
rotation and shifting in the physical world, is crucial for the safety of AVs.
While empirical evidence suggests that MSFs exhibit improved robustness
compared to single-modal models, they are still vulnerable to adversarial
semantic transformations. Despite the proposal of empirical defenses, several
works show that these defenses can be attacked again by new adaptive attacks.
So far, there is no certified defense proposed for MSFs. In this work, we
propose the first robustness certification framework COMMIT certify robustness
of multi-sensor fusion systems against semantic attacks. In particular, we
propose a practical anisotropic noise mechanism that leverages randomized
smoothing with multi-modal data and performs a grid-based splitting method to
characterize complex semantic transformations. We also propose efficient
algorithms to compute the certification in terms of object detection accuracy
and IoU for large-scale MSF models. Empirically, we evaluate the efficacy of
COMMIT in different settings and provide a comprehensive benchmark of certified
robustness for different MSF models using the CARLA simulation platform. We
show that the certification for MSF models is at most 48.39% higher than that
of single-modal models, which validates the advantages of MSF models. We
believe our certification framework and benchmark will contribute an important
step towards certifiably robust AVs in practice.


http://arxiv.org/abs/2403.02116
Inf2Guard: An Information-Theoretic Framework for Learning Privacy-Preserving Representations against Inference Attacks. (26%)
Sayedeh Leila Noorbakhsh; Binghui Zhang; Yuan Hong; Binghui Wang
  Machine learning (ML) is vulnerable to inference (e.g., membership inference,
property inference, and data reconstruction) attacks that aim to infer the
private information of training data or dataset. Existing defenses are only
designed for one specific type of attack and sacrifice significant utility or
are soon broken by adaptive attacks. We address these limitations by proposing
an information-theoretic defense framework, called Inf2Guard, against the three
major types of inference attacks. Our framework, inspired by the success of
representation learning, posits that learning shared representations not only
saves time/costs but also benefits numerous downstream tasks. Generally,
Inf2Guard involves two mutual information objectives, for privacy protection
and utility preservation, respectively. Inf2Guard exhibits many merits: it
facilitates the design of customized objectives against the specific inference
attack; it provides a general defense framework which can treat certain
existing defenses as special cases; and importantly, it aids in deriving
theoretical results, e.g., inherent utility-privacy tradeoff and guaranteed
privacy leakage. Extensive evaluations validate the effectiveness of Inf2Guard
for learning privacy-preserving representations against inference attacks and
demonstrate the superiority over the baselines.


http://arxiv.org/abs/2403.02637
BSDP: Brain-inspired Streaming Dual-level Perturbations for Online Open World Object Detection. (16%)
Yu Chen; Liyan Ma; Liping Jing; Jian Yu
  Humans can easily distinguish the known and unknown categories and can
recognize the unknown object by learning it once instead of repeating it many
times without forgetting the learned object. Hence, we aim to make deep
learning models simulate the way people learn. We refer to such a learning
manner as OnLine Open World Object Detection(OLOWOD). Existing OWOD approaches
pay more attention to the identification of unknown categories, while the
incremental learning part is also very important. Besides, some neuroscience
research shows that specific noises allow the brain to form new connections and
neural pathways which may improve learning speed and efficiency. In this paper,
we take the dual-level information of old samples as perturbations on new
samples to make the model good at learning new knowledge without forgetting the
old knowledge. Therefore, we propose a simple plug-and-play method, called
Brain-inspired Streaming Dual-level Perturbations(BSDP), to solve the OLOWOD
problem. Specifically, (1) we first calculate the prototypes of previous
categories and use the distance between samples and the prototypes as the
sample selecting strategy to choose old samples for replay; (2) then take the
prototypes as the streaming feature-level perturbations of new samples, so as
to improve the plasticity of the model through revisiting the old knowledge;
(3) and also use the distribution of the features of the old category samples
to generate adversarial data in the form of streams as the data-level
perturbations to enhance the robustness of the model to new categories. We
empirically evaluate BSDP on PASCAL VOC and MS-COCO, and the excellent results
demonstrate the promising performance of our proposed method and learning
manner.


http://arxiv.org/abs/2403.02172
Mirage: Defense against CrossPath Attacks in Software Defined Networks. (3%)
Shariq Murtuza; Krishna Asawa
  The Software-Defined Networks (SDNs) face persistent threats from various
adversaries that attack them using different methods to mount Denial of Service
attacks. These attackers have different motives and follow diverse tactics to
achieve their nefarious objectives. In this work, we focus on the impact of
CrossPath attacks in SDNs and introduce our framework, Mirage, which not only
detects but also mitigates this attack. Our framework, Mirage, detects SDN
switches that become unreachable due to being under attack, takes proactive
measures to prevent Adversarial Path Reconnaissance, and effectively mitigates
CrossPath attacks in SDNs. A CrossPath attack is a form of link flood attack
that indirectly attacks the control plane by overwhelming the shared links that
connect the data and control planes with data plane traffic. This attack is
exclusive to in band SDN, where the data and the control plane, both utilize
the same physical links for transmitting and receiving traffic. Our framework,
Mirage, prevents attackers from launching adversarial path reconnaissance to
identify shared links in a network, thereby thwarting their abuse and
preventing this attack. Mirage not only stops adversarial path reconnaissance
but also includes features to quickly counter ongoing attacks once detected.
Mirage uses path diversity to reroute network packet to prevent timing based
measurement. Mirage can also enforce short lived flow table rules to prevent
timing attacks. These measures are carefully designed to enhance the security
of the SDN environment. Moreover, we share the results of our experiments,
which clearly show Mirage's effectiveness in preventing path reconnaissance,
detecting CrossPath attacks, and mitigating ongoing threats. Our framework
successfully protects the network from these harmful activities, giving
valuable insights into SDN security.


http://arxiv.org/abs/2403.02311
Bayesian Uncertainty Estimation by Hamiltonian Monte Carlo: Applications to Cardiac MRI Segmentation. (1%)
Yidong Zhao; Joao Tourais; Iain Pierce; Christian Nitsche; Thomas A. Treibel; Sebastian Weingärtner; Artur M. Schweidtmann; Qian Tao
  Deep learning (DL)-based methods have achieved state-of-the-art performance
for many medical image segmentation tasks. Nevertheless, recent studies show
that deep neural networks (DNNs) can be miscalibrated and overconfident,
leading to "silent failures" that are risky for clinical applications. Bayesian
DL provides an intuitive approach to DL failure detection, based on posterior
probability estimation. However, the posterior is intractable for large medical
image segmentation DNNs. To tackle this challenge, we propose a Bayesian
learning framework using Hamiltonian Monte Carlo (HMC), tempered by cold
posterior (CP) to accommodate medical data augmentation, named HMC-CP. For HMC
computation, we further propose a cyclical annealing strategy, capturing both
local and global geometries of the posterior distribution, enabling highly
efficient Bayesian DNN training with the same computational budget as training
a single DNN. The resulting Bayesian DNN outputs an ensemble segmentation along
with the segmentation uncertainty. We evaluate the proposed HMC-CP extensively
on cardiac magnetic resonance image (MRI) segmentation, using in-domain
steady-state free precession (SSFP) cine images as well as out-of-domain
datasets of quantitative T1 and T2 mapping. Our results show that the proposed
method improves both segmentation accuracy and uncertainty estimation for in-
and out-of-domain data, compared with well-established baseline methods such as
Monte Carlo Dropout and Deep Ensembles. Additionally, we establish a conceptual
link between HMC and the commonly known stochastic gradient descent (SGD) and
provide general insight into the uncertainty of DL. This uncertainty is
implicitly encoded in the training dynamics but often overlooked. With reliable
uncertainty estimation, our method provides a promising direction toward
trustworthy DL in clinical applications.


http://arxiv.org/abs/2403.01446
GuardT2I: Defending Text-to-Image Models from Adversarial Prompts. (10%)
Yijun Yang; Ruiyuan Gao; Xiao Yang; Jianyuan Zhong; Qiang Xu
  Recent advancements in Text-to-Image (T2I) models have raised significant
safety concerns about their potential misuse for generating inappropriate or
Not-Safe-For-Work (NSFW) contents, despite existing countermeasures such as
NSFW classifiers or model fine-tuning for inappropriate concept removal.
Addressing this challenge, our study unveils GuardT2I, a novel moderation
framework that adopts a generative approach to enhance T2I models' robustness
against adversarial prompts. Instead of making a binary classification,
GuardT2I utilizes a Large Language Model (LLM) to conditionally transform text
guidance embeddings within the T2I models into natural language for effective
adversarial prompt detection, without compromising the models' inherent
performance. Our extensive experiments reveal that GuardT2I outperforms leading
commercial solutions like OpenAI-Moderation and Microsoft Azure Moderator by a
significant margin across diverse adversarial scenarios. Our framework is
available at https://github.com/cure-lab/GuardT2I.


http://arxiv.org/abs/2403.01210
SAR-AE-SFP: SAR Imagery Adversarial Example in Real Physics domain with Target Scattering Feature Parameters. (99%)
Jiahao Cui; Jiale Duan; Binyan Luo; Hang Cao; Wang Guo; Haifeng Li
  Deep neural network-based Synthetic Aperture Radar (SAR) target recognition
models are susceptible to adversarial examples. Current adversarial example
generation methods for SAR imagery primarily operate in the 2D digital domain,
known as image adversarial examples. Recent work, while considering SAR imaging
scatter mechanisms, fails to account for the actual imaging process, rendering
attacks in the three-dimensional physical domain infeasible, termed pseudo
physics adversarial examples. To address these challenges, this paper proposes
SAR-AE-SFP-Attack, a method to generate real physics adversarial examples by
altering the scattering feature parameters of target objects. Specifically, we
iteratively optimize the coherent energy accumulation of the target echo by
perturbing the reflection coefficient and scattering coefficient in the
scattering feature parameters of the three-dimensional target object, and
obtain the adversarial example after echo signal processing and imaging
processing in the RaySAR simulator. Experimental results show that compared to
digital adversarial attack methods, SAR-AE-SFP Attack significantly improves
attack efficiency on CNN-based models (over 30\%) and Transformer-based models
(over 13\%), demonstrating significant transferability of attack effects across
different models and perspectives.


http://arxiv.org/abs/2403.01218
Inexact Unlearning Needs More Careful Evaluations to Avoid a False Sense of Privacy. (68%)
Jamie Hayes; Ilia Shumailov; Eleni Triantafillou; Amr Khalifa; Nicolas Papernot
  The high cost of model training makes it increasingly desirable to develop
techniques for unlearning. These techniques seek to remove the influence of a
training example without having to retrain the model from scratch. Intuitively,
once a model has unlearned, an adversary that interacts with the model should
no longer be able to tell whether the unlearned example was included in the
model's training set or not. In the privacy literature, this is known as
membership inference. In this work, we discuss adaptations of Membership
Inference Attacks (MIAs) to the setting of unlearning (leading to their
``U-MIA'' counterparts). We propose a categorization of existing U-MIAs into
``population U-MIAs'', where the same attacker is instantiated for all
examples, and ``per-example U-MIAs'', where a dedicated attacker is
instantiated for each example. We show that the latter category, wherein the
attacker tailors its membership prediction to each example under attack, is
significantly stronger. Indeed, our results show that the commonly used U-MIAs
in the unlearning literature overestimate the privacy protection afforded by
existing unlearning techniques on both vision and language models. Our
investigation reveals a large variance in the vulnerability of different
examples to per-example U-MIAs. In fact, several unlearning algorithms lead to
a reduced vulnerability for some, but not all, examples that we wish to
unlearn, at the expense of increasing it for other examples. Notably, we find
that the privacy protection for the remaining training examples may worsen as a
consequence of unlearning. We also discuss the fundamental difficulty of
equally protecting all examples using existing unlearning schemes, due to the
different rates at which examples are unlearned. We demonstrate that naive
attempts at tailoring unlearning stopping criteria to different examples fail
to alleviate these issues.


http://arxiv.org/abs/2403.04786
Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models. (56%)
Arijit Ghosh Chowdhury; Md Mofijul Islam; Vaibhav Kumar; Faysal Hossain Shezan; Vaibhav Kumar; Vinija Jain; Aman Chadha
  Large Language Models (LLMs) have become a cornerstone in the field of
Natural Language Processing (NLP), offering transformative capabilities in
understanding and generating human-like text. However, with their rising
prominence, the security and vulnerability aspects of these models have
garnered significant attention. This paper presents a comprehensive survey of
the various forms of attacks targeting LLMs, discussing the nature and
mechanisms of these attacks, their potential impacts, and current defense
strategies. We delve into topics such as adversarial attacks that aim to
manipulate model outputs, data poisoning that affects model training, and
privacy concerns related to training data exploitation. The paper also explores
the effectiveness of different attack methodologies, the resilience of LLMs
against these attacks, and the implications for model integrity and user trust.
By examining the latest research, we provide insights into the current
landscape of LLM vulnerabilities and defense mechanisms. Our objective is to
offer a nuanced understanding of LLM attacks, foster awareness within the AI
community, and inspire robust solutions to mitigate these risks in future
developments.


http://arxiv.org/abs/2403.04783
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks. (31%)
Yifan Zeng; Yiran Wu; Xiao Zhang; Huazheng Wang; Qingyun Wu
  Despite extensive pre-training in moral alignment to prevent generating
harmful information, large language models (LLMs) remain vulnerable to
jailbreak attacks. In this paper, we propose AutoDefense, a multi-agent defense
framework that filters harmful responses from LLMs. With the response-filtering
mechanism, our framework is robust against different jailbreak attack prompts,
and can be used to defend different victim models. AutoDefense assigns
different roles to LLM agents and employs them to complete the defense task
collaboratively. The division in tasks enhances the overall
instruction-following of LLMs and enables the integration of other defense
components as tools. With AutoDefense, small open-source LMs can serve as
agents and defend larger models against jailbreak attacks. Our experiments show
that AutoDefense can effectively defense against different jailbreak attacks,
while maintaining the performance at normal user request. For example, we
reduce the attack success rate on GPT-3.5 from 55.74% to 7.95% using
LLaMA-2-13b with a 3-agent system. Our code and data are publicly available at
https://github.com/XHMY/AutoDefense.


http://arxiv.org/abs/2403.01118
Adversarial Testing for Visual Grounding via Image-Aware Property Reduction. (11%)
Zhiyuan Chang; Mingyang Li; Junjie Wang; Cheng Li; Boyu Wu; Fanjiang Xu; Qing Wang
  Due to the advantages of fusing information from various modalities,
multimodal learning is gaining increasing attention. Being a fundamental task
of multimodal learning, Visual Grounding (VG), aims to locate objects in images
through natural language expressions. Ensuring the quality of VG models
presents significant challenges due to the complex nature of the task. In the
black box scenario, existing adversarial testing techniques often fail to fully
exploit the potential of both modalities of information. They typically apply
perturbations based solely on either the image or text information,
disregarding the crucial correlation between the two modalities, which would
lead to failures in test oracles or an inability to effectively challenge VG
models. To this end, we propose PEELING, a text perturbation approach via
image-aware property reduction for adversarial testing of the VG model. The
core idea is to reduce the property-related information in the original
expression meanwhile ensuring the reduced expression can still uniquely
describe the original object in the image. To achieve this, PEELING first
conducts the object and properties extraction and recombination to generate
candidate property reduction expressions. It then selects the satisfied
expressions that accurately describe the original object while ensuring no
other objects in the image fulfill the expression, through querying the image
with a visual understanding technique. We evaluate PEELING on the
state-of-the-art VG model, i.e. OFA-VG, involving three commonly used datasets.
Results show that the adversarial tests generated by PEELING achieves 21.4% in
MultiModal Impact score (MMI), and outperforms state-of-the-art baselines for
images and texts by 8.2%--15.1%.


http://arxiv.org/abs/2403.01155
Query Recovery from Easy to Hard: Jigsaw Attack against SSE. (2%)
Hao Nie; Wei Wang; Peng Xu; Xianglong Zhang; Laurence T. Yang; Kaitai Liang
  Searchable symmetric encryption schemes often unintentionally disclose
certain sensitive information, such as access, volume, and search patterns.
Attackers can exploit such leakages and other available knowledge related to
the user's database to recover queries. We find that the effectiveness of query
recovery attacks depends on the volume/frequency distribution of keywords.
Queries containing keywords with high volumes/frequencies are more susceptible
to recovery, even when countermeasures are implemented. Attackers can also
effectively leverage these ``special'' queries to recover all others.
  By exploiting the above finding, we propose a Jigsaw attack that begins by
accurately identifying and recovering those distinctive queries. Leveraging the
volume, frequency, and co-occurrence information, our attack achieves $90\%$
accuracy in three tested datasets, which is comparable to previous attacks (Oya
et al., USENIX' 22 and Damie et al., USENIX' 21). With the same runtime, our
attack demonstrates an advantage over the attack proposed by Oya et al
(approximately $15\%$ more accuracy when the keyword universe size is 15k).
Furthermore, our proposed attack outperforms existing attacks against widely
studied countermeasures, achieving roughly $60\%$ and $85\%$ accuracy against
the padding and the obfuscation, respectively. In this context, with a large
keyword universe ($\geq$3k), it surpasses current state-of-the-art attacks by
more than $20\%$.


http://arxiv.org/abs/2403.01251
Accelerating Greedy Coordinate Gradient via Probe Sampling. (1%)
Yiran Zhao; Wenyue Zheng; Tianle Cai; Xuan Long Do; Kenji Kawaguchi; Anirudh Goyal; Michael Shieh
  Safety of Large Language Models (LLMs) has become a critical issue given
their rapid progresses. Greedy Coordinate Gradient (GCG) is shown to be
effective in constructing adversarial prompts to break the aligned LLMs, but
optimization of GCG is time-consuming. To reduce the time cost of GCG and
enable more comprehensive studies of LLM safety, in this work, we study a new
algorithm called $\texttt{Probe sampling}$. At the core of the algorithm is a
mechanism that dynamically determines how similar a smaller draft model's
predictions are to the target model's predictions for prompt candidates. When
the target model is similar to the draft model, we rely heavily on the draft
model to filter out a large number of potential prompt candidates. Probe
sampling achieves up to $5.6$ times speedup using Llama2-7b-chat and leads to
equal or improved attack success rate (ASR) on the AdvBench. Furthermore, probe
sampling is also able to accelerate other prompt optimization techniques and
adversarial methods, leading to acceleration of $1.8\times$ for AutoPrompt,
$2.4\times$ for APE and $2.4\times$ for AutoDAN.


http://arxiv.org/abs/2403.00420
Robust Deep Reinforcement Learning Through Adversarial Attacks and Training : A Survey. (91%)
Lucas Schott; Josephine Delas; Hatem Hajri; Elies Gherbi; Reda Yaich; Nora Boulahia-Cuppens; Frederic Cuppens; Sylvain Lamprier
  Deep Reinforcement Learning (DRL) is a subfield of machine learning for
training autonomous agents that take sequential actions across complex
environments. Despite its significant performance in well-known environments,
it remains susceptible to minor condition variations, raising concerns about
its reliability in real-world applications. To improve usability, DRL must
demonstrate trustworthiness and robustness. A way to improve the robustness of
DRL to unknown changes in the environmental conditions and possible
perturbations is through Adversarial Training, by training the agent against
well-suited adversarial attacks on the observations and the dynamics of the
environment. Addressing this critical issue, our work presents an in-depth
analysis of contemporary adversarial attack and training methodologies,
systematically categorizing them and comparing their objectives and operational
mechanisms.


http://arxiv.org/abs/2403.00942
Resilience of Entropy Model in Distributed Neural Networks. (67%)
Milin Zhang; Mohammad Abdi; Shahriar Rifat; Francesco Restuccia
  Distributed deep neural networks (DNNs) have emerged as a key technique to
reduce communication overhead without sacrificing performance in edge computing
systems. Recently, entropy coding has been introduced to further reduce the
communication overhead. The key idea is to train the distributed DNN jointly
with an entropy model, which is used as side information during inference time
to adaptively encode latent representations into bit streams with variable
length. To the best of our knowledge, the resilience of entropy models is yet
to be investigated. As such, in this paper we formulate and investigate the
resilience of entropy models to intentional interference (e.g., adversarial
attacks) and unintentional interference (e.g., weather changes and motion
blur). Through an extensive experimental campaign with 3 different DNN
architectures, 2 entropy models and 4 rate-distortion trade-off factors, we
demonstrate that the entropy attacks can increase the communication overhead by
up to 95%. By separating compression features in frequency and spatial domain,
we propose a new defense mechanism that can reduce the transmission overhead of
the attacked input by about 9% compared to unperturbed data, with only about 2%
accuracy loss. Importantly, the proposed defense mechanism is a standalone
approach which can be applied in conjunction with approaches such as
adversarial training to further improve robustness. Code will be shared for
reproducibility.


http://arxiv.org/abs/2403.00464
Attacking Delay-based PUFs with Minimal Adversary Model. (45%)
Hongming Fei; Owen Millwood; Prosanta Gope; Jack Miskelly; Biplab Sikdar
  Physically Unclonable Functions (PUFs) provide a streamlined solution for
lightweight device authentication. Delay-based Arbiter PUFs, with their ease of
implementation and vast challenge space, have received significant attention;
however, they are not immune to modelling attacks that exploit correlations
between their inputs and outputs. Research is therefore polarized between
developing modelling-resistant PUFs and devising machine learning attacks
against them. This dichotomy often results in exaggerated concerns and
overconfidence in PUF security, primarily because there lacks a universal tool
to gauge a PUF's security. In many scenarios, attacks require additional
information, such as PUF type or configuration parameters. Alarmingly, new PUFs
are often branded `secure' if they lack a specific attack model upon
introduction. To impartially assess the security of delay-based PUFs, we
present a generic framework featuring a Mixture-of-PUF-Experts (MoPE) structure
for mounting attacks on various PUFs with minimal adversarial knowledge, which
provides a way to compare their performance fairly and impartially. We
demonstrate the capability of our model to attack different PUF types,
including the first successful attack on Heterogeneous Feed-Forward PUFs using
only a reasonable amount of challenges and responses. We propose an extension
version of our model, a Multi-gate Mixture-of-PUF-Experts (MMoPE) structure,
facilitating multi-task learning across diverse PUFs to recognise commonalities
across PUF designs. This allows a streamlining of training periods for
attacking multiple PUFs simultaneously. We conclude by showcasing the potent
performance of MoPE and MMoPE across a spectrum of PUF types, employing
simulated, real-world unbiased, and biased data sets for analysis.


http://arxiv.org/abs/2402.19355
Unraveling Adversarial Examples against Speaker Identification -- Techniques for Attack Detection and Victim Model Classification. (99%)
Sonal Joshi; Thomas Thebaud; Jesús Villalba; Najim Dehak
  Adversarial examples have proven to threaten speaker identification systems,
and several countermeasures against them have been proposed. In this paper, we
propose a method to detect the presence of adversarial examples, i.e., a binary
classifier distinguishing between benign and adversarial examples. We build
upon and extend previous work on attack type classification by exploring new
architectures. Additionally, we introduce a method for identifying the victim
model on which the adversarial attack is carried out. To achieve this, we
generate a new dataset containing multiple attacks performed against various
victim models. We achieve an AUC of 0.982 for attack detection, with no more
than a 0.03 drop in performance for unknown attacks. Our attack classification
accuracy (excluding benign) reaches 86.48% across eight attack types using our
LightResNet34 architecture, while our victim model classification accuracy
reaches 72.28% across four victim models.


http://arxiv.org/abs/2402.19027
How to Train your Antivirus: RL-based Hardening through the Problem-Space. (99%)
Jacopo Cortellazzi; Ilias Tsingenopoulos; Branislav Bošanský; Simone Aonzo; Davy Preuveneers; Wouter Joosen; Fabio Pierazzi; Lorenzo Cavallaro
  ML-based malware detection on dynamic analysis reports is vulnerable to both
evasion and spurious correlations. In this work, we investigate a specific ML
architecture employed in the pipeline of a widely-known commercial antivirus
company, with the goal to harden it against adversarial malware. Adversarial
training, the sole defensive technique that can confer empirical robustness, is
not applicable out of the box in this domain, for the principal reason that
gradient-based perturbations rarely map back to feasible problem-space
programs. We introduce a novel Reinforcement Learning approach for constructing
adversarial examples, a constituent part of adversarially training a model
against evasion. Our approach comes with multiple advantages. It performs
modifications that are feasible in the problem-space, and only those; thus it
circumvents the inverse mapping problem. It also makes possible to provide
theoretical guarantees on the robustness of the model against a particular set
of adversarial capabilities. Our empirical exploration validates our
theoretical insights, where we can consistently reach 0\% Attack Success Rate
after a few adversarial retraining iterations.


http://arxiv.org/abs/2403.00103
On Robustness and Generalization of ML-Based Congestion Predictors to Valid and Imperceptible Perturbations. (88%)
Chester Holtz; Yucheng Wang; Chung-Kuan Cheng; Bill Lin
  There is substantial interest in the use of machine learning (ML)-based
techniques throughout the electronic computer-aided design (CAD) flow,
particularly methods based on deep learning. However, while deep learning
methods have achieved state-of-the-art performance in several applications,
recent work has demonstrated that neural networks are generally vulnerable to
small, carefully chosen perturbations of their input (e.g. a single pixel
change in an image). In this work, we investigate robustness in the context of
ML-based EDA tools -- particularly for congestion prediction. As far as we are
aware, we are the first to explore this concept in the context of ML-based EDA.
  We first describe a novel notion of imperceptibility designed specifically
for VLSI layout problems defined on netlists and cell placements. Our
definition of imperceptibility is characterized by a guarantee that a
perturbation to a layout will not alter its global routing. We then demonstrate
that state-of-the-art CNN and GNN-based congestion models exhibit brittleness
to imperceptible perturbations. Namely, we show that when a small number of
cells (e.g. 1%-5% of cells) have their positions shifted such that a measure of
global congestion is guaranteed to remain unaffected (e.g. 1% of the design
adversarially shifted by 0.001% of the layout space results in a predicted
decrease in congestion of up to 90%, while no change in congestion is implied
by the perturbation). In other words, the quality of a predictor can be made
arbitrarily poor (i.e. can be made to predict that a design is
"congestion-free") for an arbitrary input layout. Next, we describe a simple
technique to train predictors that improves robustness to these perturbations.
Our work indicates that CAD engineers should be cautious when integrating
neural network-based mechanisms in EDA flows to ensure robust and high-quality
results.


http://arxiv.org/abs/2402.19076
Pointing out the Shortcomings of Relation Extraction Models with Semantically Motivated Adversarials. (76%)
Gennaro Nolano; Moritz Blum; Basil Ell; Philipp Cimiano
  In recent years, large language models have achieved state-of-the-art
performance across various NLP tasks. However, investigations have shown that
these models tend to rely on shortcut features, leading to inaccurate
predictions and causing the models to be unreliable at generalization to
out-of-distribution (OOD) samples. For instance, in the context of relation
extraction (RE), we would expect a model to identify the same relation
independently of the entities involved in it. For example, consider the
sentence "Leonardo da Vinci painted the Mona Lisa" expressing the
created(Leonardo_da_Vinci, Mona_Lisa) relation. If we substiute "Leonardo da
Vinci" with "Barack Obama", then the sentence still expresses the created
relation. A robust model is supposed to detect the same relation in both cases.
In this work, we describe several semantically-motivated strategies to generate
adversarial examples by replacing entity mentions and investigate how
state-of-the-art RE models perform under pressure. Our analyses show that the
performance of these models significantly deteriorates on the modified datasets
(avg. of -48.5% in F1), which indicates that these models rely to a great
extent on shortcuts, such as surface forms (or patterns therein) of entities,
without making full use of the information present in the sentences.


http://arxiv.org/abs/2402.19401
Assessing Visually-Continuous Corruption Robustness of Neural Networks Relative to Human Performance. (38%)
Huakun Shen; Boyue Caroline Hu; Krzysztof Czarnecki; Lina Marsso; Marsha Chechik
  While Neural Networks (NNs) have surpassed human accuracy in image
classification on ImageNet, they often lack robustness against image
corruption, i.e., corruption robustness. Yet such robustness is seemingly
effortless for human perception. In this paper, we propose visually-continuous
corruption robustness (VCR) -- an extension of corruption robustness to allow
assessing it over the wide and continuous range of changes that correspond to
the human perceptive quality (i.e., from the original image to the full
distortion of all perceived visual information), along with two novel
human-aware metrics for NN evaluation. To compare VCR of NNs with human
perception, we conducted extensive experiments on 14 commonly used image
corruptions with 7,718 human participants and state-of-the-art robust NN models
with different training objectives (e.g., standard, adversarial, corruption
robustness), different architectures (e.g., convolution NNs, vision
transformers), and different amounts of training data augmentation. Our study
showed that: 1) assessing robustness against continuous corruption can reveal
insufficient robustness undetected by existing benchmarks; as a result, 2) the
gap between NN and human robustness is larger than previously known; and
finally, 3) some image corruptions have a similar impact on human perception,
offering opportunities for more cost-effective robustness assessments. Our
validation set with 14 image corruptions, human robustness data, and the
evaluation code is provided as a toolbox and a benchmark.


http://arxiv.org/abs/2402.19322
Verification of Neural Networks' Global Robustness. (38%)
Anan Kabaha; Dana Drachsler-Cohen
  Neural networks are successful in various applications but are also
susceptible to adversarial attacks. To show the safety of network classifiers,
many verifiers have been introduced to reason about the local robustness of a
given input to a given perturbation. While successful, local robustness cannot
generalize to unseen inputs. Several works analyze global robustness
properties, however, neither can provide a precise guarantee about the cases
where a network classifier does not change its classification. In this work, we
propose a new global robustness property for classifiers aiming at finding the
minimal globally robust bound, which naturally extends the popular local
robustness property for classifiers. We introduce VHAGaR, an anytime verifier
for computing this bound. VHAGaR relies on three main ideas: encoding the
problem as a mixed-integer programming and pruning the search space by
identifying dependencies stemming from the perturbation or network computation
and generalizing adversarial attacks to unknown inputs. We evaluate VHAGaR on
several datasets and classifiers and show that, given a three hour timeout, the
average gap between the lower and upper bound on the minimal globally robust
bound computed by VHAGaR is 1.9, while the gap of an existing global robustness
verifier is 154.7. Moreover, VHAGaR is 130.6x faster than this verifier. Our
results further indicate that leveraging dependencies and adversarial attacks
makes VHAGaR 78.6x faster.


http://arxiv.org/abs/2402.18945
SynGhost: Imperceptible and Universal Task-agnostic Backdoor Attack in Pre-trained Language Models. (16%)
Pengzhou Cheng; Wei Du; Zongru Wu; Fengwei Zhang; Libo Chen; Gongshen Liu
  Pre-training has been a necessary phase for deploying pre-trained language
models (PLMs) to achieve remarkable performance in downstream tasks. However,
we empirically show that backdoor attacks exploit such a phase as a vulnerable
entry point for task-agnostic. In this paper, we first propose
$\mathtt{maxEntropy}$, an entropy-based poisoning filtering defense, to prove
that existing task-agnostic backdoors are easily exposed, due to explicit
triggers used. Then, we present $\mathtt{SynGhost}$, an imperceptible and
universal task-agnostic backdoor attack in PLMs. Specifically,
$\mathtt{SynGhost}$ hostilely manipulates clean samples through different
syntactic and then maps the backdoor to representation space without disturbing
the primitive representation. $\mathtt{SynGhost}$ further leverages contrastive
learning to achieve universal, which performs a uniform distribution of
backdoors in the representation space. In light of the syntactic properties, we
also introduce an awareness module to alleviate the interference between
different syntactic. Experiments show that $\mathtt{SynGhost}$ holds more
serious threats. Not only do severe harmfulness to various downstream tasks on
two tuning paradigms but also to any PLMs. Meanwhile, $\mathtt{SynGhost}$ is
imperceptible against three countermeasures based on perplexity, fine-pruning,
and the proposed $\mathtt{maxEntropy}$.


http://arxiv.org/abs/2402.19200
PRSA: Prompt Stealing Attacks against Real-World Prompt Services. (9%)
Yong Yang; Changjiang Li; Qingming Li; Oubo Ma; Haoyu Wang; Zonghui Wang; Yandong Gao; Wenzhi Chen; Shouling Ji
  Recently, large language models (LLMs) have garnered widespread attention for
their exceptional capabilities. Prompts are central to the functionality and
performance of LLMs, making them highly valuable assets. The increasing
reliance on high-quality prompts has driven significant growth in prompt
services. However, this growth also expands the potential for prompt leakage,
increasing the risk that attackers could replicate original functionalities,
create competing products, and severely infringe on developers' intellectual
property. Despite these risks, prompt leakage in real-world prompt services
remains underexplored.
  In this paper, we present PRSA, a practical attack framework designed for
prompt stealing. PRSA infers the detailed intent of prompts through very
limited input-output analysis and can successfully generate stolen prompts that
replicate the original functionality. Extensive evaluations demonstrate PRSA's
effectiveness across two main types of real-world prompt services.
Specifically, compared to previous works, it improves the attack success rate
from 17.8% to 46.1% in prompt marketplaces and from 39% to 52% in LLM
application stores, respectively. Notably, in the attack on "Math", one of the
most popular educational applications in OpenAI's GPT Store with over 1 million
conversations, PRSA uncovered a hidden Easter egg that had not been revealed
previously. Besides, our analysis reveals that higher mutual information
between a prompt and its output correlates with an increased risk of leakage.
This insight guides the design and evaluation of two potential defenses against
the security threats posed by PRSA. We have reported these findings to the
prompt service vendors, including PromptBase and OpenAI, and actively
collaborate with them to implement defensive measures.


http://arxiv.org/abs/2402.19334
Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge. (2%)
Ansh Arora; Xuanli He; Maximilian Mozes; Srinibas Swain; Mark Dras; Qiongkai Xu
  The democratization of pre-trained language models through open-source
initiatives has rapidly advanced innovation and expanded access to cutting-edge
technologies. However, this openness also brings significant security risks,
including backdoor attacks, where hidden malicious behaviors are triggered by
specific inputs, compromising natural language processing (NLP) system
integrity and reliability. This paper suggests that merging a backdoored model
with other homogeneous models can remediate backdoor vulnerabilities even if
such models are not entirely secure. In our experiments, we explore various
models (BERT-Base, RoBERTa-Large, Llama2-7B, and Mistral-7B) and datasets
(SST-2, OLID, AG News, and QNLI). Compared to multiple advanced defensive
approaches, our method offers an effective and efficient inference-stage
defense against backdoor attacks without additional resources or specific
knowledge. Our approach consistently outperforms the other advanced baselines,
leading to an average of 75% reduction in the attack success rate. Since model
merging has been an established approach for improving model performance, the
extra advantage it provides regarding defense can be seen as a cost-free bonus.


http://arxiv.org/abs/2403.00867
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes. (1%)
Xiaomeng Hu; Pin-Yu Chen; Tsung-Yi Ho
  Large Language Models (LLMs) are becoming a prominent generative AI tool,
where the user enters a query and the LLM generates an answer. To reduce harm
and misuse, efforts have been made to align these LLMs to human values using
advanced training techniques such as Reinforcement Learning from Human Feedback
(RLHF). However, recent studies have highlighted the vulnerability of LLMs to
adversarial jailbreak attempts aiming at subverting the embedded safety
guardrails. To address this challenge, this paper defines and investigates the
Refusal Loss of LLMs and then proposes a method called Gradient Cuff to detect
jailbreak attempts. Gradient Cuff exploits the unique properties observed in
the refusal loss landscape, including functional values and its smoothness, to
design an effective two-step detection strategy. Experimental results on two
aligned LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5) and six types of jailbreak
attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) show that Gradient Cuff can
significantly improve the LLM's rejection capability for malicious jailbreak
queries, while maintaining the model's performance for benign user queries by
adjusting the detection threshold.


http://arxiv.org/abs/2402.19150
Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model. (1%)
Hao Cheng; Erjia Xiao; Jindong Gu; Le Yang; Jinhao Duan; Jize Zhang; Jiahang Cao; Kaidi Xu; Renjing Xu
  Large Vision-Language Models (LVLMs) rely on vision encoders and Large
Language Models (LLMs) to exhibit remarkable capabilities on various
multi-modal tasks in the joint space of vision and language. However,
typographic attacks, which disrupt Vision-Language Models (VLMs) such as
Contrastive Language-Image Pretraining (CLIP), have also been expected to be a
security threat to LVLMs. Firstly, we verify typographic attacks on current
well-known commercial and open-source LVLMs and uncover the widespread
existence of this threat. Secondly, to better assess this vulnerability, we
propose the most comprehensive and largest-scale Typographic Dataset to date.
The Typographic Dataset not only considers the evaluation of typographic
attacks under various multi-modal tasks but also evaluates the effects of
typographic attacks, influenced by texts generated with diverse factors. Based
on the evaluation results, we investigate the causes why typographic attacks
impacting VLMs and LVLMs, leading to three highly insightful discoveries.
During the process of further validating the rationality of our discoveries, we
can reduce the performance degradation caused by typographic attacks from
42.07\% to 13.90\%. Code and Dataset are available in
\href{https://github.com/ChaduCheng/TypoDeceptions}


http://arxiv.org/abs/2402.18787
Enhancing the "Immunity" of Mixture-of-Experts Networks for Adversarial Defense. (99%)
Qiao Han; yong huang; xinling Guo; Yiteng Zhai; Yu Qin; Yao Yang
  Recent studies have revealed the vulnerability of Deep Neural Networks (DNNs)
to adversarial examples, which can easily fool DNNs into making incorrect
predictions. To mitigate this deficiency, we propose a novel adversarial
defense method called "Immunity" (Innovative MoE with MUtual information \&
positioN stabilITY) based on a modified Mixture-of-Experts (MoE) architecture
in this work. The key enhancements to the standard MoE are two-fold: 1)
integrating of Random Switch Gates (RSGs) to obtain diverse network structures
via random permutation of RSG parameters at evaluation time, despite of RSGs
being determined after one-time training; 2) devising innovative Mutual
Information (MI)-based and Position Stability-based loss functions by
capitalizing on Grad-CAM's explanatory power to increase the diversity and the
causality of expert networks. Notably, our MI-based loss operates directly on
the heatmaps, thereby inducing subtler negative impacts on the classification
performance when compared to other losses of the same type, theoretically.
Extensive evaluation validates the efficacy of the proposed approach in
improving adversarial robustness against a wide range of attacks.


http://arxiv.org/abs/2402.18792
MPAT: Building Robust Deep Neural Networks against Textual Adversarial Attacks. (99%)
Fangyuan Zhang; Huichi Zhou; Shuangjiao Li; Hongtao Wang
  Deep neural networks have been proven to be vulnerable to adversarial
examples and various methods have been proposed to defend against adversarial
attacks for natural language processing tasks. However, previous defense
methods have limitations in maintaining effective defense while ensuring the
performance of the original task. In this paper, we propose a malicious
perturbation based adversarial training method (MPAT) for building robust deep
neural networks against textual adversarial attacks. Specifically, we construct
a multi-level malicious example generation strategy to generate adversarial
examples with malicious perturbations, which are used instead of original
inputs for model training. Additionally, we employ a novel training objective
function to ensure achieving the defense goal without compromising the
performance on the original task. We conduct comprehensive experiments to
evaluate our defense method by attacking five victim models on three benchmark
datasets. The result demonstrates that our method is more effective against
malicious adversarial attacks compared with previous defense methods while
maintaining or further improving the performance on the original task.


http://arxiv.org/abs/2402.18211
Catastrophic Overfitting: A Potential Blessing in Disguise. (98%)
Mengnan Zhao; Lihe Zhang; Yuqiu Kong; Baocai Yin
  Fast Adversarial Training (FAT) has gained increasing attention within the
research community owing to its efficacy in improving adversarial robustness.
Particularly noteworthy is the challenge posed by catastrophic overfitting (CO)
in this field. Although existing FAT approaches have made strides in mitigating
CO, the ascent of adversarial robustness occurs with a non-negligible decline
in classification accuracy on clean samples. To tackle this issue, we initially
employ the feature activation differences between clean and adversarial
examples to analyze the underlying causes of CO. Intriguingly, our findings
reveal that CO can be attributed to the feature coverage induced by a few
specific pathways. By intentionally manipulating feature activation differences
in these pathways with well-designed regularization terms, we can effectively
mitigate and induce CO, providing further evidence for this observation.
Notably, models trained stably with these terms exhibit superior performance
compared to prior FAT work. On this basis, we harness CO to achieve `attack
obfuscation', aiming to bolster model performance. Consequently, the models
suffering from CO can attain optimal classification accuracy on both clean and
adversarial data when adding random noise to inputs during evaluation. We also
validate their robustness against transferred adversarial examples and the
necessity of inducing CO to improve robustness. Hence, CO may not be a problem
that has to be solved.


http://arxiv.org/abs/2402.18329
Living-off-The-Land Reverse-Shell Detection by Informed Data Augmentation. (86%)
Dmitrijs Trizna; Luca Demetrio; Battista Biggio; Fabio Roli
  The living-off-the-land (LOTL) offensive methodologies rely on the
perpetration of malicious actions through chains of commands executed by
legitimate applications, identifiable exclusively by analysis of system logs.
LOTL techniques are well hidden inside the stream of events generated by common
legitimate activities, moreover threat actors often camouflage activity through
obfuscation, making them particularly difficult to detect without incurring in
plenty of false alarms, even using machine learning. To improve the performance
of models in such an harsh environment, we propose an augmentation framework to
enhance and diversify the presence of LOTL malicious activity inside legitimate
logs. Guided by threat intelligence, we generate a dataset by injecting attack
templates known to be employed in the wild, further enriched by malleable
patterns of legitimate activities to replicate the behavior of evasive threat
actors. We conduct an extensive ablation study to understand which models
better handle our augmented dataset, also manipulated to mimic the presence of
model-agnostic evasion and poisoning attacks. Our results suggest that
augmentation is needed to maintain high-predictive capabilities, robustness to
attack is achieved through specific hardening techniques like adversarial
training, and it is possible to deploy near-real-time models with almost-zero
false alarms.


http://arxiv.org/abs/2402.18649
A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems. (64%)
Fangzhou Wu; Ning Zhang; Somesh Jha; Patrick McDaniel; Chaowei Xiao
  Large Language Model (LLM) systems are inherently compositional, with
individual LLM serving as the core foundation with additional layers of objects
such as plugins, sandbox, and so on. Along with the great potential, there are
also increasing concerns over the security of such probabilistic intelligent
systems. However, existing studies on LLM security often focus on individual
LLM, but without examining the ecosystem through the lens of LLM systems with
other objects (e.g., Frontend, Webtool, Sandbox, and so on). In this paper, we
systematically analyze the security of LLM systems, instead of focusing on the
individual LLMs. To do so, we build on top of the information flow and
formulate the security of LLM systems as constraints on the alignment of the
information flow within LLM and between LLM and other objects. Based on this
construction and the unique probabilistic nature of LLM, the attack surface of
the LLM system can be decomposed into three key components: (1) multi-layer
security analysis, (2) analysis of the existence of constraints, and (3)
analysis of the robustness of these constraints. To ground this new attack
surface, we propose a multi-layer and multi-step approach and apply it to the
state-of-art LLM system, OpenAI GPT4. Our investigation exposes several
security issues, not just within the LLM model itself but also in its
integration with other components. We found that although the OpenAI GPT4 has
designed numerous safety constraints to improve its safety features, these
safety constraints are still vulnerable to attackers. To further demonstrate
the real-world threats of our discovered vulnerabilities, we construct an
end-to-end attack where an adversary can illicitly acquire the user's chat
history, all without the need to manipulate the user's input or gain direct
access to OpenAI GPT4. Our demo is in the link:
https://fzwark.github.io/LLM-System-Attack-Demo/


http://arxiv.org/abs/2402.18104
Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction. (33%)
Tong Liu; Yingjie Zhang; Zhe Zhao; Yinpeng Dong; Guozhu Meng; Kai Chen
  In recent years, large language models (LLMs) have demonstrated notable
success across various tasks, but the trustworthiness of LLMs is still an open
problem. One specific threat is the potential to generate toxic or harmful
responses. Attackers can craft adversarial prompts that induce harmful
responses from LLMs. In this work, we pioneer a theoretical foundation in LLMs
security by identifying bias vulnerabilities within the safety fine-tuning and
design a black-box jailbreak method named DRA (Disguise and Reconstruction
Attack), which conceals harmful instructions through disguise and prompts the
model to reconstruct the original harmful instruction within its completion. We
evaluate DRA across various open-source and close-source models, showcasing
state-of-the-art jailbreak success rates and attack efficiency. Notably, DRA
boasts a 90\% attack success rate on LLM chatbots GPT-4.


http://arxiv.org/abs/2402.18162
Out-of-Distribution Detection using Neural Activation Prior. (1%)
Weilin Wan; Weizhong Zhang; Cheng Jin
  Out-of-distribution detection is a crucial technique for deploying machine
learning models in the real world to handle the unseen scenarios.In this paper,
we propose a simple but effective Neural Activation Prior (NAP) for
out-of-distribution detection (OOD). Our neural activation prior is based on a
key observation that, for a channel before the global pooling layer of a fully
trained neural network, the probability of a few of its neurons being activated
with a larger response by an in-distribution (ID) sample is significantly
higher than that by an OOD sample. An intuitive explanation is each channel in
a model fully trained on ID dataset would play a role in detecting a certain
pattern in the samples within the ID dataset, and a few neurons can be
activated with a large response when the pattern is detected in an input
sample. Thus, a new scoring function based on this prior is proposed to
highlight the role of these strongly activated neurons in OOD detection. This
approach is plug-and-play and does not lead to any performance degradation on
in-distribution data classification and requires no extra training or
statistics from training or external datasets. Notice that previous methods
primarily rely on post-global-pooling features of the neural networks, while
the within-channel distribution information we leverage would be discarded by
the global pooling operator. Consequently, our method is orthogonal to existing
approaches and can be effectively combined with them in various applications.
Experimental results show that our method achieves the state-of-the-art
performance on CIFAR-10, CIFAR-100 and ImageNet datasets, which demonstrates
the power of the proposed prior.


http://arxiv.org/abs/2402.17390
Robustness-Congruent Adversarial Training for Secure Machine Learning Model Updates. (99%)
Daniele Angioni; Luca Demetrio; Maura Pintor; Luca Oneto; Davide Anguita; Battista Biggio; Fabio Roli
  Machine-learning models demand for periodic updates to improve their average
accuracy, exploiting novel architectures and additional data. However, a
newly-updated model may commit mistakes that the previous model did not make.
Such misclassifications are referred to as negative flips, and experienced by
users as a regression of performance. In this work, we show that this problem
also affects robustness to adversarial examples, thereby hindering the
development of secure model update practices. In particular, when updating a
model to improve its adversarial robustness, some previously-ineffective
adversarial examples may become misclassified, causing a regression in the
perceived security of the system. We propose a novel technique, named
robustness-congruent adversarial training, to address this issue. It amounts to
fine-tuning a model with adversarial training, while constraining it to retain
higher robustness on the adversarial examples that were correctly classified
before the update. We show that our algorithm and, more generally, learning
with non-regression constraints, provides a theoretically-grounded framework to
train consistent estimators. Our experiments on robust models for computer
vision confirm that (i) both accuracy and robustness, even if improved after
model update, can be affected by negative flips, and (ii) our
robustness-congruent adversarial training can mitigate the problem,
outperforming competing baseline methods.


http://arxiv.org/abs/2402.17509
Extreme Miscalibration and the Illusion of Adversarial Robustness. (99%)
Vyas Raina; Samson Tan; Volkan Cevher; Aditya Rawal; Sheng Zha; George Karypis
  Deep learning-based Natural Language Processing (NLP) models are vulnerable
to adversarial attacks, where small perturbations can cause a model to
misclassify. Adversarial Training (AT) is often used to increase model
robustness. However, we have discovered an intriguing phenomenon: deliberately
or accidentally miscalibrating models masks gradients in a way that interferes
with adversarial attack search methods, giving rise to an apparent increase in
robustness. We show that this observed gain in robustness is an illusion of
robustness (IOR), and demonstrate how an adversary can perform various forms of
test-time temperature calibration to nullify the aforementioned interference
and allow the adversarial attack to find adversarial examples. Hence, we urge
the NLP community to incorporate test-time temperature scaling into their
robustness evaluations to ensure that any observed gains are genuine. Finally,
we show how the temperature can be scaled during \textit{training} to improve
genuine robustness.


http://arxiv.org/abs/2402.17533
Black-box Adversarial Attacks Against Image Quality Assessment Models. (99%)
Yu Ran; Ao-Xiang Zhang; Mingjie Li; Weixuan Tang; Yuan-Gen Wang
  The goal of No-Reference Image Quality Assessment (NR-IQA) is to predict the
perceptual quality of an image in line with its subjective evaluation. To put
the NR-IQA models into practice, it is essential to study their potential
loopholes for model refinement. This paper makes the first attempt to explore
the black-box adversarial attacks on NR-IQA models. Specifically, we first
formulate the attack problem as maximizing the deviation between the estimated
quality scores of original and perturbed images, while restricting the
perturbed image distortions for visual quality preservation. Under such
formulation, we then design a Bi-directional loss function to mislead the
estimated quality scores of adversarial examples towards an opposite direction
with maximum deviation. On this basis, we finally develop an efficient and
effective black-box attack method against NR-IQA models. Extensive experiments
reveal that all the evaluated NR-IQA models are vulnerable to the proposed
attack method. And the generated perturbations are not transferable, enabling
them to serve the investigation of specialities of disparate IQA models.


http://arxiv.org/abs/2402.17976
Enhancing Tracking Robustness with Auxiliary Adversarial Defense Networks. (99%)
Zhewei Wu; Ruilong Yu; Qihe Liu; Shuying Cheng; Shilin Qiu; Shijie Zhou
  Adversarial attacks in visual object tracking have significantly degraded the
performance of advanced trackers by introducing imperceptible perturbations
into images. However, there is still a lack of research on designing
adversarial defense methods for object tracking. To address these issues, we
propose an effective auxiliary pre-processing defense network, AADN, which
performs defensive transformations on the input images before feeding them into
the tracker. Moreover, it can be seamlessly integrated with other visual
trackers as a plug-and-play module without parameter adjustments. We train AADN
using adversarial training, specifically employing Dua-Loss to generate
adversarial samples that simultaneously attack the classification and
regression branches of the tracker. Extensive experiments conducted on the
OTB100, LaSOT, and VOT2018 benchmarks demonstrate that AADN maintains excellent
defense robustness against adversarial attack methods in both adaptive and
non-adaptive attack scenarios. Moreover, when transferring the defense network
to heterogeneous trackers, it exhibits reliable transferability. Finally, AADN
achieves a processing time of up to 5ms/frame, allowing seamless integration
with existing high-speed trackers without introducing significant computational
overhead.


http://arxiv.org/abs/2402.17916
LLM-Resistant Math Word Problem Generation via Adversarial Attacks. (87%)
Roy Xie; Chengxuan Huang; Junlin Wang; Bhuwan Dhingra
  Large language models (LLMs) have significantly transformed the educational
landscape. As current plagiarism detection tools struggle to keep pace with
LLMs' rapid advancements, the educational community faces the challenge of
assessing students' true problem-solving abilities in the presence of LLMs. In
this work, we explore a new paradigm for ensuring fair evaluation -- generating
adversarial examples which preserve the structure and difficulty of the
original questions aimed for assessment, but are unsolvable by LLMs. Focusing
on the domain of math word problems, we leverage abstract syntax trees to
structurally generate adversarial examples that cause LLMs to produce incorrect
answers by simply editing the numeric values in the problems. We conduct
experiments on various open- and closed-source LLMs, quantitatively and
qualitatively demonstrating that our method significantly degrades their math
problem-solving ability. We identify shared vulnerabilities among LLMs and
propose a cost-effective approach to attack high-cost models. Additionally, we
conduct automatic analysis on math problems and investigate the cause of
failure to guide future research on LLM's mathematical capability.


http://arxiv.org/abs/2402.18027
Breaking the Black-Box: Confidence-Guided Model Inversion Attack for Distribution Shift. (83%)
Xinhao Liu; Yingzhao Jiang; Zetao Lin
  Model inversion attacks (MIAs) seek to infer the private training data of a
target classifier by generating synthetic images that reflect the
characteristics of the target class through querying the model. However, prior
studies have relied on full access to the target model, which is not practical
in real-world scenarios. Additionally, existing black-box MIAs assume that the
image prior and target model follow the same distribution. However, when
confronted with diverse data distribution settings, these methods may result in
suboptimal performance in conducting attacks. To address these limitations,
this paper proposes a \textbf{C}onfidence-\textbf{G}uided \textbf{M}odel
\textbf{I}nversion attack method called CG-MI, which utilizes the latent space
of a pre-trained publicly available generative adversarial network (GAN) as
prior information and gradient-free optimizer, enabling high-resolution MIAs
across different data distributions in a black-box setting. Our experiments
demonstrate that our method significantly \textbf{outperforms the SOTA
black-box MIA by more than 49\% for Celeba and 58\% for Facescrub in different
distribution settings}. Furthermore, our method exhibits the ability to
generate high-quality images \textbf{comparable to those produced by white-box
attacks}. Our method provides a practical and effective solution for black-box
model inversion attacks.


http://arxiv.org/abs/2402.17465
Model X-ray:Detecting Backdoored Models via Decision Boundary. (67%)
Yanghao Su; Jie Zhang; Ting Xu; Tianwei Zhang; Weiming Zhang; Nenghai Yu
  Backdoor attacks pose a significant security vulnerability for deep neural
networks (DNNs), enabling them to operate normally on clean inputs but
manipulate predictions when specific trigger patterns occur. Currently,
post-training backdoor detection approaches often operate under the assumption
that the defender has knowledge of the attack information, logit output from
the model, and knowledge of the model parameters. In contrast, our approach
functions as a lightweight diagnostic scanning tool offering interpretability
and visualization. By accessing the model to obtain hard labels, we construct
decision boundaries within the convex combination of three samples. We present
an intriguing observation of two phenomena in backdoored models: a noticeable
shrinking of areas dominated by clean samples and a significant increase in the
surrounding areas dominated by target labels. Leveraging this observation, we
propose Model X-ray, a novel backdoor detection approach based on the analysis
of illustrated two-dimensional (2D) decision boundaries. Our approach includes
two strategies focused on the decision areas dominated by clean samples and the
concentration of label distribution, and it can not only identify whether the
target model is infected but also determine the target attacked label under the
all-to-one attack strategy. Importantly, it accomplishes this solely by the
predicted hard labels of clean inputs, regardless of any assumptions about
attacks and prior knowledge of the training details of the model. Extensive
experiments demonstrated that Model X-ray has outstanding effectiveness and
efficiency across diverse backdoor attacks, datasets, and architectures.
Besides, ablation studies on hyperparameters and more attack strategies and
discussions are also provided.


http://arxiv.org/abs/2402.17729
Towards Fairness-Aware Adversarial Learning. (11%)
Yanghao Zhang; Tianle Zhang; Ronghui Mu; Xiaowei Huang; Wenjie Ruan
  Although adversarial training (AT) has proven effective in enhancing the
model's robustness, the recently revealed issue of fairness in robustness has
not been well addressed, i.e. the robust accuracy varies significantly among
different categories. In this paper, instead of uniformly evaluating the
model's average class performance, we delve into the issue of robust fairness,
by considering the worst-case distribution across various classes. We propose a
novel learning paradigm, named Fairness-Aware Adversarial Learning (FAAL). As a
generalization of conventional AT, we re-define the problem of adversarial
training as a min-max-max framework, to ensure both robustness and fairness of
the trained model. Specifically, by taking advantage of distributional robust
optimization, our method aims to find the worst distribution among different
categories, and the solution is guaranteed to obtain the upper bound
performance with high probability. In particular, FAAL can fine-tune an unfair
robust model to be fair within only two epochs, without compromising the
overall clean and robust accuracies. Extensive experiments on various image
datasets validate the superior performance and efficiency of the proposed FAAL
compared to other state-of-the-art methods.


http://arxiv.org/abs/2402.17223
Time-Restricted Double-Spending Attack on PoW-based Blockchains. (1%)
Yiming Jiang; Jiangfan Zhang
  Numerous blockchain applications are designed with tasks that naturally have
finite durations, and hence, a double-spending attack (DSA) on such blockchain
applications leans towards being conducted within a finite timeframe,
specifically before the completion of their tasks. Furthermore, existing
research suggests that practical attackers typically favor executing a DSA
within a finite timeframe due to their limited computational resources. These
observations serve as the impetus for this paper to investigate a
time-restricted DSA (TR-DSA) model on Proof-of-Work based blockchains. In this
TR-DSA model, an attacker only mines its branch within a finite timeframe, and
the TR-DSA is considered unsuccessful if the attacker's branch fails to surpass
the honest miners' branch when the honest miners' branch has grown by a
specific number of blocks. First, we developed a general closed-form expression
for the success probability of a TR-DSA. This developed probability not only
can assist in evaluating the risk of a DSA on blockchain applications with
timely tasks, but also can enable practical attackers with limited
computational resources to assess the feasibility and expected reward of
launching a TR-DSA. In addition, we provide rigorous proof that the success
probability of a TR-DSA is no greater than that of a time-unrestricted DSA
where the attacker indefinitely mines its branch. This result implies that
blockchain applications with timely tasks are less vulnerable to DSAs than
blockchain applications that provide attackers with an unlimited timeframe for
their attacks. Furthermore, we show that the success probability of a TR-DSA is
always smaller than one even though the attacker controls more than half of the
hash rate in the network. This result alerts attackers that there is still a
risk of failure in launching a TR-DSA even if they amass a majority of the hash
rate in the network.


http://arxiv.org/abs/2402.16586
Improving the JPEG-resistance of Adversarial Attacks on Face Recognition by Interpolation Smoothing. (99%)
Kefu Guo; Fengfan Zhou; Hefei Ling; Ping Li; Hui Liu
  JPEG compression can significantly impair the performance of adversarial face
examples, which previous adversarial attacks on face recognition (FR) have not
adequately addressed. Considering this challenge, we propose a novel
adversarial attack on FR that aims to improve the resistance of adversarial
examples against JPEG compression. Specifically, during the iterative process
of generating adversarial face examples, we interpolate the adversarial face
examples into a smaller size. Then we utilize these interpolated adversarial
face examples to create the adversarial examples in the next iteration.
Subsequently, we restore the adversarial face examples to their original size
by interpolating. Throughout the entire process, our proposed method can smooth
the adversarial perturbations, effectively mitigating the presence of
high-frequency signals in the crafted adversarial face examples that are
typically eliminated by JPEG compression. Our experimental results demonstrate
the effectiveness of our proposed method in improving the JPEG-resistance of
adversarial face examples.


http://arxiv.org/abs/2402.16430
Improving behavior based authentication against adversarial attack using XAI. (99%)
Dong Qin; George Amariucai; Daji Qiao; Yong Guan
  In recent years, machine learning models, especially deep neural networks,
have been widely used for classification tasks in the security domain. However,
these models have been shown to be vulnerable to adversarial manipulation:
small changes learned by an adversarial attack model, when applied to the
input, can cause significant changes in the output. Most research on
adversarial attacks and corresponding defense methods focuses only on scenarios
where adversarial samples are directly generated by the attack model. In this
study, we explore a more practical scenario in behavior-based authentication,
where adversarial samples are collected from the attacker. The generated
adversarial samples from the model are replicated by attackers with a certain
level of discrepancy. We propose an eXplainable AI (XAI) based defense strategy
against adversarial attacks in such scenarios. A feature selector, trained with
our method, can be used as a filter in front of the original authenticator. It
filters out features that are more vulnerable to adversarial attacks or
irrelevant to authentication, while retaining features that are more robust.
Through comprehensive experiments, we demonstrate that our XAI based defense
strategy is effective against adversarial attacks and outperforms other defense
strategies, such as adversarial training and defensive distillation.


http://arxiv.org/abs/2402.18370
Adversarial example soups: averaging multiple adversarial examples improves transferability without increasing additional generation time. (99%)
Bo Yang; Hengwei Zhang; Chenwei Li; Jindong Wang
  For transfer-based attacks, the adversarial examples are crafted on the
surrogate model, which can be implemented to mislead the target model
effectively. The conventional method for maximizing adversarial transferability
involves: (1) fine-tuning hyperparameters to generate multiple batches of
adversarial examples on the substitute model; (2) conserving the batch of
adversarial examples that have the best comprehensive performance on substitute
model and target model, and discarding the others. In this work, we revisit the
second step of this process in the context of fine-tuning hyperparameters to
craft adversarial examples, where multiple batches of fine-tuned adversarial
examples often appear in a single high error hilltop. We demonstrate that
averaging multiple batches of adversarial examples under different
hyperparameter configurations, which refers to as "adversarial example soups",
can often enhance adversarial transferability. Compared with traditional
methods, the proposed method incurs no additional generation time and
computational cost. Besides, our method is orthogonal to existing
transfer-based methods and can be combined with them seamlessly to generate
more transferable adversarial examples. Extensive experiments on the ImageNet
dataset show that our methods achieve a higher attack success rate than the
state-of-the-art attacks.


http://arxiv.org/abs/2402.17018
A Curious Case of Remarkable Resilience to Gradient Attacks via Fully Convolutional and Differentiable Front End with a Skip Connection. (98%)
Leonid Boytsov; Ameya Joshi; Filipe Condessa
  We tested front-end enhanced neural models where a frozen classifier was
prepended by a differentiable and fully convolutional model with a skip
connection. By training them using a small learning rate for about one epoch,
we obtained models that retained the accuracy of the backbone classifier while
being unusually resistant to gradient attacks including APGD and FAB-T attacks
from the AutoAttack package, which we attributed to gradient masking. The
gradient masking phenomenon is not new, but the degree of masking was quite
remarkable for fully differentiable models that did not have
gradient-shattering components such as JPEG compression or components that are
expected to cause diminishing gradients.
  Though black box attacks can be partially effective against gradient masking,
they are easily defeated by combining models into randomized ensembles. We
estimate that such ensembles achieve near-SOTA AutoAttack accuracy on CIFAR10,
CIFAR100, and ImageNet despite having virtually zero accuracy under adaptive
attacks. Adversarial training of the backbone classifier can further increase
resistance of the front-end enhanced model to gradient attacks. On CIFAR10, the
respective randomized ensemble achieved 90.8$\pm 2.5$% (99% CI) accuracy under
AutoAttack while having only 18.2$\pm 3.6$% accuracy under the adaptive attack.
  We do not establish SOTA in adversarial robustness. Instead, we make
methodological contributions and further supports the thesis that adaptive
attacks designed with the complete knowledge of model architecture are crucial
in demonstrating model robustness and that even the so-called white-box
gradient attacks can have limited applicability. Although gradient attacks can
be complemented with black-box attack such as the SQUARE attack or the
zero-order PGD, black-box attacks can be weak against randomized ensembles,
e.g., when ensemble models mask gradients.


http://arxiv.org/abs/2402.17104
Adversarial Perturbations of Physical Signals. (92%)
Robert L. Bassett; Dellen Austin Van; Anthony P. Austin
  We investigate the vulnerability of computer-vision-based signal classifiers
to adversarial perturbations of their inputs, where the signals and
perturbations are subject to physical constraints. We consider a scenario in
which a source and interferer emit signals that propagate as waves to a
detector, which attempts to classify the source by analyzing the spectrogram of
the signal it receives using a pre-trained neural network. By solving
PDE-constrained optimization problems, we construct interfering signals that
cause the detector to misclassify the source even though the perturbations to
the spectrogram of the received signal are nearly imperceptible. Though such
problems can have millions of decision variables, we introduce methods to solve
them efficiently. Our experiments demonstrate that one can compute effective
and physically realizable adversarial perturbations for a variety of machine
learning models under various physical conditions.


http://arxiv.org/abs/2402.16470
Unveiling Vulnerability of Self-Attention. (87%)
Khai Jiet Liong; Hongqiu Wu; Hai Zhao
  Pre-trained language models (PLMs) are shown to be vulnerable to minor word
changes, which poses a big threat to real-world systems. While previous studies
directly focus on manipulating word inputs, they are limited by their means of
generating adversarial samples, lacking generalization to versatile real-world
attack. This paper studies the basic structure of transformer-based PLMs, the
self-attention (SA) mechanism. (1) We propose a powerful perturbation technique
\textit{HackAttend}, which perturbs the attention scores within the SA matrices
via meticulously crafted attention masks. We show that state-of-the-art PLMs
fall into heavy vulnerability that minor attention perturbations $(1\%)$ can
produce a very high attack success rate $(98\%)$. Our paper expands the
conventional text attack of word perturbations to more general structural
perturbations. (2) We introduce \textit{S-Attend}, a novel smoothing technique
that effectively makes SA robust via structural perturbations. We empirically
demonstrate that this simple yet effective technique achieves robust
performance on par with adversarial training when facing various text
attackers. Code is publicly available at \url{github.com/liongkj/HackAttend}.


http://arxiv.org/abs/2402.16479
Edge Detectors Can Make Deep Convolutional Neural Networks More Robust. (83%)
Jin Ding; Jie-Chao Zhao; Yong-Zhi Sun; Ping Tan; Jia-Wei Wang; Ji-En Ma; You-Tong Fang
  Deep convolutional neural networks (DCNN for short) are vulnerable to
examples with small perturbations. Improving DCNN's robustness is of great
significance to the safety-critical applications, such as autonomous driving
and industry automation. Inspired by the principal way that human eyes
recognize objects, i.e., largely relying on the shape features, this paper
first employs the edge detectors as layer kernels and designs a binary edge
feature branch (BEFB for short) to learn the binary edge features, which can be
easily integrated into any popular backbone. The four edge detectors can learn
the horizontal, vertical, positive diagonal, and negative diagonal edge
features, respectively, and the branch is stacked by multiple Sobel layers
(using edge detectors as kernels) and one threshold layer. The binary edge
features learned by the branch, concatenated with the texture features learned
by the backbone, are fed into the fully connected layers for classification. We
integrate the proposed branch into VGG16 and ResNet34, respectively, and
conduct experiments on multiple datasets. Experimental results demonstrate the
BEFB is lightweight and has no side effects on training. And the accuracy of
the BEFB integrated models is better than the original ones on all datasets
when facing FGSM, PGD, and C\&W attacks. Besides, BEFB integrated models
equipped with the robustness enhancing techniques can achieve better
classification accuracy compared to the original models. The work in this paper
for the first time shows it is feasible to enhance the robustness of DCNNs
through combining both shape-like features and texture features.


http://arxiv.org/abs/2402.16397
Investigating Deep Watermark Security: An Adversarial Transferability Perspective. (64%)
Biqing Qi; Junqi Gao; Yiang Luo; Jianxing Liu; Ligang Wu; Bowen Zhou
  The rise of generative neural networks has triggered an increased demand for
intellectual property (IP) protection in generated content. Deep watermarking
techniques, recognized for their flexibility in IP protection, have garnered
significant attention. However, the surge in adversarial transferable attacks
poses unprecedented challenges to the security of deep watermarking
techniques-an area currently lacking systematic investigation. This study fills
this gap by introducing two effective transferable attackers to assess the
vulnerability of deep watermarks against erasure and tampering risks.
Specifically, we initially define the concept of local sample density,
utilizing it to deduce theorems on the consistency of model outputs. Upon
discovering that perturbing samples towards high sample density regions (HSDR)
of the target class enhances targeted adversarial transferability, we propose
the Easy Sample Selection (ESS) mechanism and the Easy Sample Matching Attack
(ESMA) method. Additionally, we propose the Bottleneck Enhanced Mixup (BEM)
that integrates information bottleneck theory to reduce the generator's
dependence on irrelevant noise. Experiments show a significant enhancement in
the success rate of targeted transfer attacks for both ESMA and BEM-ESMA
methods. We further conduct a comprehensive evaluation using ESMA and BEM-ESMA
as measurements, considering model architecture and watermark encoding length,
and achieve some impressive findings.


http://arxiv.org/abs/2402.16459
Defending LLMs against Jailbreaking Attacks via Backtranslation. (64%)
Yihan Wang; Zhouxing Shi; Andrew Bai; Cho-Jui Hsieh
  Although many large language models (LLMs) have been trained to refuse
harmful requests, they are still vulnerable to jailbreaking attacks, which
rewrite the original prompt to conceal its harmful intent. In this paper, we
propose a new method for defending LLMs against jailbreaking attacks by
``backtranslation''. Specifically, given an initial response generated by the
target LLM from an input prompt, our backtranslation prompts a language model
to infer an input prompt that can lead to the response. The inferred prompt is
called the backtranslated prompt which tends to reveal the actual intent of the
original prompt, since it is generated based on the LLM's response and is not
directly manipulated by the attacker. We then run the target LLM again on the
backtranslated prompt, and we refuse the original prompt if the model refuses
the backtranslated prompt. We explain that the proposed defense provides
several benefits on its effectiveness and efficiency. We empirically
demonstrate that our defense significantly outperforms the baselines, in the
cases that are hard for the baselines, and our defense also has little impact
on the generation quality for benign input prompts.


http://arxiv.org/abs/2402.16822
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts. (62%)
Mikayel Samvelyan; Sharath Chandra Raparthy; Andrei Lupu; Eric Hambro; Aram H. Markosyan; Manish Bhatt; Yuning Mao; Minqi Jiang; Jack Parker-Holder; Jakob Foerster; Tim Rocktäschel; Roberta Raileanu
  As large language models (LLMs) become increasingly prevalent across many
real-world applications, understanding and enhancing their robustness to
adversarial attacks is of paramount importance. Existing methods for
identifying adversarial prompts tend to focus on specific domains, lack
diversity, or require extensive human annotations. To address these
limitations, we present Rainbow Teaming, a novel black-box approach for
producing a diverse collection of adversarial prompts. Rainbow Teaming casts
adversarial prompt generation as a quality-diversity problem and uses
open-ended search to generate prompts that are both effective and diverse.
Focusing on the safety domain, we use Rainbow Teaming to target various
state-of-the-art LLMs, including the Llama 2 and Llama 3 models. Our approach
reveals hundreds of effective adversarial prompts, with an attack success rate
exceeding 90% across all tested models. Furthermore, we demonstrate that
prompts generated by Rainbow Teaming are highly transferable and that
fine-tuning models with synthetic data generated by our method significantly
enhances their safety without sacrificing general performance or helpfulness.
We additionally explore the versatility of Rainbow Teaming by applying it to
question answering and cybersecurity, showcasing its potential to drive robust
open-ended self-improvement in a wide range of applications.


http://arxiv.org/abs/2402.17012
Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models. (50%)
Jeffrey G. Wang; Jason Wang; Marvin Li; Seth Neel
  In this paper we develop state-of-the-art privacy attacks against Large
Language Models (LLMs), where an adversary with some access to the model tries
to learn something about the underlying training data. Our headline results are
new membership inference attacks (MIAs) against pretrained LLMs that perform
hundreds of times better than baseline attacks, and a pipeline showing that
over 50% (!) of the fine-tuning dataset can be extracted from a fine-tuned LLM
in natural settings. We consider varying degrees of access to the underlying
model, pretraining and fine-tuning data, and both MIAs and training data
extraction. For pretraining data, we propose two new MIAs: a supervised neural
network classifier that predicts training data membership on the basis of
(dimensionality-reduced) model gradients, as well as a variant of this attack
that only requires logit access to the model by leveraging recent
model-stealing work on LLMs. To our knowledge this is the first MIA that
explicitly incorporates model-stealing information. Both attacks outperform
existing black-box baselines, and our supervised attack closes the gap between
MIA attack success against LLMs and the strongest known attacks for other
machine learning models. In fine-tuning, we find that a simple attack based on
the ratio of the loss between the base and fine-tuned models is able to achieve
near-perfect MIA performance; we then leverage our MIA to extract a large
fraction of the fine-tuning dataset from fine-tuned Pythia and Llama models.
Our code is available at github.com/safr-ai-lab/pandora-llm.


http://arxiv.org/abs/2402.16965
WIPI: A New Web Threat for LLM-Driven Web Agents. (8%)
Fangzhou Wu; Shutong Wu; Yulong Cao; Chaowei Xiao
  With the fast development of large language models (LLMs), LLM-driven Web
Agents (Web Agents for short) have obtained tons of attention due to their
superior capability where LLMs serve as the core part of making decisions like
the human brain equipped with multiple web tools to actively interact with
external deployed websites. As uncountable Web Agents have been released and
such LLM systems are experiencing rapid development and drawing closer to
widespread deployment in our daily lives, an essential and pressing question
arises: "Are these Web Agents secure?". In this paper, we introduce a novel
threat, WIPI, that indirectly controls Web Agent to execute malicious
instructions embedded in publicly accessible webpages. To launch a successful
WIPI works in a black-box environment. This methodology focuses on the form and
content of indirect instructions within external webpages, enhancing the
efficiency and stealthiness of the attack. To evaluate the effectiveness of the
proposed methodology, we conducted extensive experiments using 7 plugin-based
ChatGPT Web Agents, 8 Web GPTs, and 3 different open-source Web Agents. The
results reveal that our methodology achieves an average attack success rate
(ASR) exceeding 90% even in pure black-box scenarios. Moreover, through an
ablation study examining various user prefix instructions, we demonstrated that
the WIPI exhibits strong robustness, maintaining high performance across
diverse prefix instructions.


http://arxiv.org/abs/2402.16431
RoCoIns: Enhancing Robustness of Large Language Models through Code-Style Instructions. (4%)
Yuansen Zhang; Xiao Wang; Zhiheng Xi; Han Xia; Tao Gui; Qi Zhang; Xuanjing Huang
  Large Language Models (LLMs) have showcased remarkable capabilities in
following human instructions. However, recent studies have raised concerns
about the robustness of LLMs when prompted with instructions combining textual
adversarial samples. In this paper, drawing inspiration from recent works that
LLMs are sensitive to the design of the instructions, we utilize instructions
in code style, which are more structural and less ambiguous, to replace
typically natural language instructions. Through this conversion, we provide
LLMs with more precise instructions and strengthen the robustness of LLMs.
Moreover, under few-shot scenarios, we propose a novel method to compose
in-context demonstrations using both clean and adversarial samples
(\textit{adversarial context method}) to further boost the robustness of the
LLMs. Experiments on eight robustness datasets show that our method
consistently outperforms prompting LLMs with natural language instructions. For
example, with gpt-3.5-turbo, our method achieves an improvement of 5.68\% in
test set accuracy and a reduction of 5.66 points in Attack Success Rate (ASR).


http://arxiv.org/abs/2402.17092
An Innovative Information Theory-based Approach to Tackle and Enhance The Transparency in Phishing Detection. (1%)
Van Nguyen; Tingmin Wu; Xingliang Yuan; Marthie Grobler; Surya Nepal; Carsten Rudolph
  Phishing attacks have become a serious and challenging issue for detection,
explanation, and defense. Despite more than a decade of research on phishing,
encompassing both technical and non-technical remedies, phishing continues to
be a serious problem. Nowadays, AI-based phishing detection stands out as one
of the most effective solutions for defending against phishing attacks by
providing vulnerability (i.e., phishing or benign) predictions for the data.
However, it lacks explainability in terms of providing comprehensive
interpretations for the predictions, such as identifying the specific
information that causes the data to be classified as phishing. To this end, we
propose an innovative deep learning-based approach for email (the most common
phishing way) phishing attack localization. Our method can not only predict the
vulnerability of the email data but also automatically learn and figure out the
most important and phishing-relevant information (i.e., sentences) in the
phishing email data where the selected information indicates useful and concise
explanations for the vulnerability. The rigorous experiments on seven
real-world diverse email datasets show the effectiveness and advancement of our
proposed method in selecting crucial information, offering concise explanations
(by successfully figuring out the most important and phishing-relevant
information) for the vulnerability of the phishing email data. Particularly,
our method achieves a significantly higher performance, ranging from
approximately 1.5% to 3.5%, compared to state-of-the-art baselines, as measured
by the combined average performance of two main metrics Label-Accuracy and
Cognitive-True-Positive.


http://arxiv.org/abs/2402.16006
From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings. (98%)
Hao Wang; Hao Li; Minlie Huang; Lei Sha
  The safety defense methods of Large language models(LLMs) stays limited
because the dangerous prompts are manually curated to just few known attack
types, which fails to keep pace with emerging varieties. Recent studies found
that attaching suffixes to harmful instructions can hack the defense of LLMs
and lead to dangerous outputs. This method, while effective, leaves a gap in
understanding the underlying mechanics of such adversarial suffix due to the
non-readability and it can be relatively easily seen through by common defense
methods such as perplexity filters.To cope with this challenge, in this paper,
we propose an Adversarial Suffixes Embedding Translation Framework(ASETF) that
are able to translate the unreadable adversarial suffixes into coherent,
readable text, which makes it easier to understand and analyze the reasons
behind harmful content generation by large language models. We conducted
experiments on LLMs such as LLaMa2, Vicuna and using the Advbench dataset's
harmful instructions. The results indicate that our method achieves a much
better attack success rate to existing techniques, while significantly
enhancing the textual fluency of the prompts. In addition, our approach can be
generalized into a broader method for generating transferable adversarial
suffixes that can successfully attack multiple LLMs, even black-box LLMs, such
as ChatGPT and Gemini. As a result, the prompts generated through our method
exhibit enriched semantic diversity, which potentially provides more
adversarial examples for LLM defense methods.


http://arxiv.org/abs/2402.16912
An Adversarial Robustness Benchmark for Enterprise Network Intrusion Detection. (92%)
João Vitorino; Miguel Silva; Eva Maia; Isabel Praça
  As cyber-attacks become more sophisticated, improving the robustness of
Machine Learning (ML) models must be a priority for enterprises of all sizes.
To reliably compare the robustness of different ML models for cyber-attack
detection in enterprise computer networks, they must be evaluated in
standardized conditions. This work presents a methodical adversarial robustness
benchmark of multiple decision tree ensembles with constrained adversarial
examples generated from standard datasets. The robustness of regularly and
adversarially trained RF, XGB, LGBM, and EBM models was evaluated on the
original CICIDS2017 dataset, a corrected version of it designated as NewCICIDS,
and the HIKARI dataset, which contains more recent network traffic. NewCICIDS
led to models with a better performance, especially XGB and EBM, but RF and
LGBM were less robust against the more recent cyber-attacks of HIKARI. Overall,
the robustness of the models to adversarial cyber-attack examples was improved
without their generalization to regular traffic being affected, enabling a
reliable detection of suspicious activity without costly increases of false
alarms.


http://arxiv.org/abs/2402.16192
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing. (76%)
Jiabao Ji; Bairu Hou; Alexander Robey; George J. Pappas; Hamed Hassani; Yang Zhang; Eric Wong; Shiyu Chang
  Aligned large language models (LLMs) are vulnerable to jailbreaking attacks,
which bypass the safeguards of targeted LLMs and fool them into generating
objectionable content. While initial defenses show promise against token-based
threat models, there do not exist defenses that provide robustness against
semantic attacks and avoid unfavorable trade-offs between robustness and
nominal performance. To meet this need, we propose SEMANTICSMOOTH, a
smoothing-based defense that aggregates the predictions of multiple
semantically transformed copies of a given input prompt. Experimental results
demonstrate that SEMANTICSMOOTH achieves state-of-the-art robustness against
GCG, PAIR, and AutoDAN attacks while maintaining strong nominal performance on
instruction following benchmarks such as InstructionFollowing and AlpacaEval.
The codes will be publicly available at
https://github.com/UCSB-NLP-Chang/SemanticSmooth.


http://arxiv.org/abs/2402.16005
Adversarial-Robust Transfer Learning for Medical Imaging via Domain Assimilation. (73%)
Xiaohui Chen; Tie Luo
  In the field of Medical Imaging, extensive research has been dedicated to
leveraging its potential in uncovering critical diagnostic features in
patients. Artificial Intelligence (AI)-driven medical diagnosis relies on
sophisticated machine learning and deep learning models to analyze, detect, and
identify diseases from medical images. Despite the remarkable performance of
these models, characterized by high accuracy, they grapple with trustworthiness
issues. The introduction of a subtle perturbation to the original image
empowers adversaries to manipulate the prediction output, redirecting it to
other targeted or untargeted classes. Furthermore, the scarcity of publicly
available medical images, constituting a bottleneck for reliable training, has
led contemporary algorithms to depend on pretrained models grounded on a large
set of natural images -- a practice referred to as transfer learning. However,
a significant {\em domain discrepancy} exists between natural and medical
images, which causes AI models resulting from transfer learning to exhibit
heightened {\em vulnerability} to adversarial attacks. This paper proposes a
{\em domain assimilation} approach that introduces texture and color adaptation
into transfer learning, followed by a texture preservation component to
suppress undesired distortion. We systematically analyze the performance of
transfer learning in the face of various adversarial attacks under different
data modalities, with the overarching goal of fortifying the model's robustness
and security in medical imaging tasks. The results demonstrate high
effectiveness in reducing attack efficacy, contributing toward more trustworthy
transfer learning in biomedical applications.


http://arxiv.org/abs/2403.12077
Evaluating Robustness of Generative Search Engine on Adversarial Factual Questions. (13%)
Xuming Hu; Xiaochuan Li; Junzhe Chen; Yinghui Li; Yangning Li; Xiaoguang Li; Yasheng Wang; Qun Liu; Lijie Wen; Philip S. Yu; Zhijiang Guo
  Generative search engines have the potential to transform how people seek
information online, but generated responses from existing large language models
(LLMs)-backed generative search engines may not always be accurate.
Nonetheless, retrieval-augmented generation exacerbates safety concerns, since
adversaries may successfully evade the entire system by subtly manipulating the
most vulnerable part of a claim. To this end, we propose evaluating the
robustness of generative search engines in the realistic and high-risk setting,
where adversaries have only black-box system access and seek to deceive the
model into returning incorrect responses. Through a comprehensive human
evaluation of various generative search engines, such as Bing Chat,
PerplexityAI, and YouChat across diverse queries, we demonstrate the
effectiveness of adversarial factual questions in inducing incorrect responses.
Moreover, retrieval-augmented generation exhibits a higher susceptibility to
factual errors compared to LLMs without retrieval. These findings highlight the
potential security risks of these systems and emphasize the need for rigorous
evaluation before deployment.


http://arxiv.org/abs/2402.16914
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers. (2%)
Xirui Li; Ruochen Wang; Minhao Cheng; Tianyi Zhou; Cho-Jui Hsieh
  The safety alignment of Large Language Models (LLMs) is vulnerable to both
manual and automated jailbreak attacks, which adversarially trigger LLMs to
output harmful content. However, current methods for jailbreaking LLMs, which
nest entire harmful prompts, are not effective at concealing malicious intent
and can be easily identified and rejected by well-aligned LLMs. This paper
discovers that decomposing a malicious prompt into separated sub-prompts can
effectively obscure its underlying malicious intent by presenting it in a
fragmented, less detectable form, thereby addressing these limitations. We
introduce an automatic prompt \textbf{D}ecomposition and
\textbf{R}econstruction framework for jailbreak \textbf{Attack} (DrAttack).
DrAttack includes three key components: (a) `Decomposition' of the original
prompt into sub-prompts, (b) `Reconstruction' of these sub-prompts implicitly
by in-context learning with semantically similar but harmless reassembling
demo, and (c) a `Synonym Search' of sub-prompts, aiming to find sub-prompts'
synonyms that maintain the original intent while jailbreaking LLMs. An
extensive empirical study across multiple open-source and closed-source LLMs
demonstrates that, with a significantly reduced number of queries, DrAttack
obtains a substantial gain of success rate over prior SOTA prompt-only
attackers. Notably, the success rate of 78.0\% on GPT-4 with merely 15 queries
surpassed previous art by 33.1\%.


http://arxiv.org/abs/2404.16847
State-of-the-Art Approaches to Enhancing Privacy Preservation of Machine Learning Datasets: A Survey. (1%)
Chaoyu Zhang
  This paper examines the evolving landscape of machine learning (ML) and its
profound impact across various sectors, with a special focus on the emerging
field of Privacy-preserving Machine Learning (PPML). As ML applications become
increasingly integral to industries like telecommunications, financial
technology, and surveillance, they raise significant privacy concerns,
necessitating the development of PPML strategies. The paper highlights the
unique challenges in safeguarding privacy within ML frameworks, which stem from
the diverse capabilities of potential adversaries, including their ability to
infer sensitive information from model outputs or training data.
  We delve into the spectrum of threat models that characterize adversarial
intentions, ranging from membership and attribute inference to data
reconstruction. The paper emphasizes the importance of maintaining the
confidentiality and integrity of training data, outlining current research
efforts that focus on refining training data to minimize privacy-sensitive
information and enhancing data processing techniques to uphold privacy.
  Through a comprehensive analysis of privacy leakage risks and countermeasures
in both centralized and collaborative learning settings, this paper aims to
provide a thorough understanding of effective strategies for protecting ML
training data against privacy intrusions. It explores the balance between data
privacy and model utility, shedding light on privacy-preserving techniques that
leverage cryptographic methods, Differential Privacy, and Trusted Execution
Environments. The discussion extends to the application of these techniques in
sensitive domains, underscoring the critical role of PPML in ensuring the
privacy and security of ML systems.


http://arxiv.org/abs/2402.16918
m2mKD: Module-to-Module Knowledge Distillation for Modular Transformers. (1%)
Ka Man Lo; Yiming Liang; Wenyu Du; Yuantao Fan; Zili Wang; Wenhao Huang; Lei Ma; Jie Fu
  Modular neural architectures are gaining increasing attention due to their
powerful capability for generalization and sample-efficient adaptation to new
domains. However, training modular models, particularly in the early stages,
poses challenges due to the optimization difficulties arising from their
intrinsic sparse connectivity. Leveraging the knowledge from monolithic models,
using techniques such as knowledge distillation, is likely to facilitate the
training of modular models and enable them to integrate knowledge from multiple
models pretrained on diverse sources. Nevertheless, conventional knowledge
distillation approaches are not tailored to modular models and can fail when
directly applied due to the unique architectures and the enormous number of
parameters involved. Motivated by these challenges, we propose a general
module-to-module knowledge distillation (m2mKD) method for transferring
knowledge between modules. Our approach involves teacher modules split from a
pretrained monolithic model, and student modules of a modular model. m2mKD
separately combines these modules with a shared meta model and encourages the
student module to mimic the behaviour of the teacher module. We evaluate the
effectiveness of m2mKD on two distinct modular neural architectures: Neural
Attentive Circuits (NACs) and Vision Mixture-of-Experts (V-MoE). By applying
m2mKD to NACs, we achieve significant improvements in IID accuracy on
Tiny-ImageNet (up to 5.6%) and OOD robustness on Tiny-ImageNet-R (up to 4.2%).
On average, we observe a 1% gain in both ImageNet and ImageNet-R. The
V-MoE-Base model trained using m2mKD also achieves 3.5% higher accuracy than
end-to-end training on ImageNet. The experimental results demonstrate that our
method offers a promising solution for connecting modular networks with
pretrained monolithic models. Code is available at
https://github.com/kamanphoebe/m2mKD.


http://arxiv.org/abs/2402.15911
PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails. (87%)
Neal Mangaokar; Ashish Hooda; Jihye Choi; Shreyas Chandrashekaran; Kassem Fawaz; Somesh Jha; Atul Prakash
  Large language models (LLMs) are typically aligned to be harmless to humans.
Unfortunately, recent work has shown that such models are susceptible to
automated jailbreak attacks that induce them to generate harmful content. More
recent LLMs often incorporate an additional layer of defense, a Guard Model,
which is a second LLM that is designed to check and moderate the output
response of the primary LLM. Our key contribution is to show a novel attack
strategy, PRP, that is successful against several open-source (e.g., Llama 2)
and closed-source (e.g., GPT 3.5) implementations of Guard Models. PRP
leverages a two step prefix-based attack that operates by (a) constructing a
universal adversarial prefix for the Guard Model, and (b) propagating this
prefix to the response. We find that this procedure is effective across
multiple threat models, including ones in which the adversary has no access to
the Guard Model at all. Our work suggests that further advances are required on
defenses and Guard Models before they can be considered effective.


http://arxiv.org/abs/2402.15727
LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper. (86%)
Daoyuan Wu; Shuai Wang; Yang Liu; Ning Liu
  Jailbreaking is an emerging adversarial attack that bypasses the safety
alignment deployed in off-the-shelf large language models (LLMs). A
considerable amount of research exists proposing more effective jailbreak
attacks, including the recent Greedy Coordinate Gradient (GCG) attack,
jailbreak template-based attacks such as using "Do-Anything-Now" (DAN), and
multilingual jailbreak. In contrast, the defensive side has been relatively
less explored. This paper proposes a lightweight yet practical defense called
SELFDEFEND, which can defend against all existing jailbreak attacks with
minimal delay for jailbreak prompts and negligible delay for normal user
prompts. Our key insight is that regardless of the kind of jailbreak strategies
employed, they eventually need to include a harmful prompt (e.g., "how to make
a bomb") in the prompt sent to LLMs, and we found that existing LLMs can
effectively recognize such harmful prompts that violate their safety policies.
Based on this insight, we design a shadow stack that concurrently checks
whether a harmful prompt exists in the user prompt and triggers a checkpoint in
the normal stack once a token of "No" or a harmful prompt is output. The latter
could also generate an explainable LLM response to adversarial prompts. We
demonstrate our idea of SELFDEFEND works in various jailbreak scenarios through
manual analysis in GPT-3.5/4. We also list three future directions to further
enhance SELFDEFEND.


http://arxiv.org/abs/2402.15853
RAUCA: A Novel Physical Adversarial Attack on Vehicle Detectors via Robust and Accurate Camouflage Generation. (82%)
Jiawei Zhou; Linye Lyu; Daojing He; Yu Li
  Adversarial camouflage is a widely used physical attack against vehicle
detectors for its superiority in multi-view attack performance. One promising
approach involves using differentiable neural renderers to facilitate
adversarial camouflage optimization through gradient back-propagation. However,
existing methods often struggle to capture environmental characteristics during
the rendering process or produce adversarial textures that can precisely map to
the target vehicle, resulting in suboptimal attack performance. Moreover, these
approaches neglect diverse weather conditions, reducing the efficacy of
generated camouflage across varying weather scenarios. To tackle these
challenges, we propose a robust and accurate camouflage generation method,
namely RAUCA. The core of RAUCA is a novel neural rendering component, Neural
Renderer Plus (NRP), which can accurately project vehicle textures and render
images with environmental characteristics such as lighting and weather. In
addition, we integrate a multi-weather dataset for camouflage generation,
leveraging the NRP to enhance the attack robustness. Experimental results on
six popular object detectors show that RAUCA consistently outperforms existing
methods in both simulation and real-world settings.


http://arxiv.org/abs/2402.15959
Towards Robust Image Stitching: An Adaptive Resistance Learning against Compatible Attacks. (76%)
Zhiying Jiang; Xingyuan Li; Jinyuan Liu; Xin Fan; Risheng Liu
  Image stitching seamlessly integrates images captured from varying
perspectives into a single wide field-of-view image. Such integration not only
broadens the captured scene but also augments holistic perception in computer
vision applications. Given a pair of captured images, subtle perturbations and
distortions which go unnoticed by the human visual system tend to attack the
correspondence matching, impairing the performance of image stitching
algorithms. In light of this challenge, this paper presents the first attempt
to improve the robustness of image stitching against adversarial attacks.
Specifically, we introduce a stitching-oriented attack~(SoA), tailored to
amplify the alignment loss within overlapping regions, thereby targeting the
feature matching procedure. To establish an attack resistant model, we delve
into the robustness of stitching architecture and develop an adaptive
adversarial training~(AAT) to balance attack resistance with stitching
precision. In this way, we relieve the gap between the routine adversarial
training and benign models, ensuring resilience without quality compromise.
Comprehensive evaluation across real-world and synthetic datasets validate the
deterioration of SoA on stitching performance. Furthermore, AAT emerges as a
more robust solution against adversarial perturbations, delivering superior
stitching results. Code is available at:https://github.com/Jzy2017/TRIS.


http://arxiv.org/abs/2402.15808
Optimal Zero-Shot Detector for Multi-Armed Attacks. (50%)
Federica Granese; Marco Romanelli; Pablo Piantanida
  This paper explores a scenario in which a malicious actor employs a
multi-armed attack strategy to manipulate data samples, offering them various
avenues to introduce noise into the dataset. Our central objective is to
protect the data by detecting any alterations to the input. We approach this
defensive strategy with utmost caution, operating in an environment where the
defender possesses significantly less information compared to the attacker.
Specifically, the defender is unable to utilize any data samples for training a
defense model or verifying the integrity of the channel. Instead, the defender
relies exclusively on a set of pre-existing detectors readily available "off
the shelf". To tackle this challenge, we derive an innovative
information-theoretic defense approach that optimally aggregates the decisions
made by these detectors, eliminating the need for any training data. We further
explore a practical use-case scenario for empirical evaluation, where the
attacker possesses a pre-trained classifier and launches well-known adversarial
attacks against it. Our experiments highlight the effectiveness of our proposed
solution, even in scenarios that deviate from the optimal setup.


http://arxiv.org/abs/2402.15751
Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning. (1%)
Yong Liu; Zirui Zhu; Chaoyu Gong; Minhao Cheng; Cho-Jui Hsieh; Yang You
  While fine-tuning large language models (LLMs) for specific tasks often
yields impressive results, it comes at the cost of memory inefficiency due to
back-propagation in gradient-based training. Memory-efficient Zeroth-order
(MeZO) optimizers, recently proposed to address this issue, only require
forward passes during training, making them more memory-friendly. However, the
quality of gradient estimates in zeroth order optimization often depends on the
data dimensionality, potentially explaining why MeZO still exhibits significant
performance drops compared to standard fine-tuning across various tasks.
Inspired by the success of Parameter-Efficient Fine-Tuning (PEFT), this paper
introduces Sparse MeZO, a novel memory-efficient zeroth-order optimization
approach that applies ZO only to a carefully chosen subset of parameters. We
propose a simple yet effective parameter selection scheme that yields
significant performance gains with Sparse-MeZO. Additionally, we develop a
memory-optimized implementation for sparse masking, ensuring the algorithm
requires only inference-level memory consumption, allowing Sparse-MeZO to
fine-tune LLaMA-30b on a single A100 GPU. Experimental results illustrate that
Sparse-MeZO consistently improves both performance and convergence speed over
MeZO without any overhead. For example, it achieves a 9\% absolute accuracy
improvement and 3.5x speedup over MeZO on the RTE task.


http://arxiv.org/abs/2402.15586
Distilling Adversarial Robustness Using Heterogeneous Teachers. (99%)
Jieren Deng; Aaron Palmer; Rigel Mahmood; Ethan Rathbun; Jinbo Bi; Kaleel Mahmood; Derek Aguiar
  Achieving resiliency against adversarial attacks is necessary prior to
deploying neural network classifiers in domains where misclassification incurs
substantial costs, e.g., self-driving cars or medical imaging. Recent work has
demonstrated that robustness can be transferred from an adversarially trained
teacher to a student model using knowledge distillation. However, current
methods perform distillation using a single adversarial and vanilla teacher and
consider homogeneous architectures (i.e., residual networks) that are
susceptible to misclassify examples from similar adversarial subspaces. In this
work, we develop a defense framework against adversarial attacks by distilling
adversarial robustness using heterogeneous teachers (DARHT). In DARHT, the
student model explicitly represents teacher logits in a student-teacher feature
map and leverages multiple teachers that exhibit low adversarial example
transferability (i.e., exhibit high performance on dissimilar adversarial
examples). Experiments on classification tasks in both white-box and black-box
scenarios demonstrate that DARHT achieves state-of-the-art clean and robust
accuracies when compared to competing adversarial training and distillation
methods in the CIFAR-10, CIFAR-100, and Tiny ImageNet datasets. Comparisons
with homogeneous and heterogeneous teacher sets suggest that leveraging
teachers with low adversarial example transferability increases student model
robustness.


http://arxiv.org/abs/2402.15570
Fast Adversarial Attacks on Language Models In One GPU Minute. (98%)
Vinu Sankar Sadasivan; Shoumik Saha; Gaurang Sriramanan; Priyatham Kattakinda; Atoosa Chegini; Soheil Feizi
  In this paper, we introduce a novel class of fast, beam search-based
adversarial attack (BEAST) for Language Models (LMs). BEAST employs
interpretable parameters, enabling attackers to balance between attack speed,
success rate, and the readability of adversarial prompts. The computational
efficiency of BEAST facilitates us to investigate its applications on LMs for
jailbreaking, eliciting hallucinations, and privacy attacks. Our gradient-free
targeted attack can jailbreak aligned LMs with high attack success rates within
one minute. For instance, BEAST can jailbreak Vicuna-7B-v1.5 under one minute
with a success rate of 89% when compared to a gradient-based baseline that
takes over an hour to achieve 70% success rate using a single Nvidia RTX A6000
48GB GPU. Additionally, we discover a unique outcome wherein our untargeted
attack induces hallucinations in LM chatbots. Through human evaluations, we
find that our untargeted attack causes Vicuna-7B-v1.5 to produce ~15% more
incorrect outputs when compared to LM outputs in the absence of our attack. We
also learn that 22% of the time, BEAST causes Vicuna to generate outputs that
are not relevant to the original prompt. Further, we use BEAST to generate
adversarial prompts in a few seconds that can boost the performance of existing
membership inference attacks for LMs. We believe that our fast attack, BEAST,
has the potential to accelerate research in LM security and privacy. Our
codebase is publicly available at https://github.com/vinusankars/BEAST.


http://arxiv.org/abs/2402.15267
A Robust Defense against Adversarial Attacks on Deep Learning-based Malware Detectors via (De)Randomized Smoothing. (98%)
Daniel Gibert; Giulio Zizzo; Quan Le; Jordi Planes
  Deep learning-based malware detectors have been shown to be susceptible to
adversarial malware examples, i.e. malware examples that have been deliberately
manipulated in order to avoid detection. In light of the vulnerability of deep
learning detectors to subtle input file modifications, we propose a practical
defense against adversarial malware examples inspired by (de)randomized
smoothing. In this work, we reduce the chances of sampling adversarial content
injected by malware authors by selecting correlated subsets of bytes, rather
than using Gaussian noise to randomize inputs like in the Computer Vision (CV)
domain. During training, our ablation-based smoothing scheme trains a base
classifier to make classifications on a subset of contiguous bytes or chunk of
bytes. At test time, a large number of chunks are then classified by a base
classifier and the consensus among these classifications is then reported as
the final prediction. We propose two strategies to determine the location of
the chunks used for classification: (1) randomly selecting the locations of the
chunks and (2) selecting contiguous adjacent chunks. To showcase the
effectiveness of our approach, we have trained two classifiers with our
chunk-based ablation schemes on the BODMAS dataset. Our findings reveal that
the chunk-based smoothing classifiers exhibit greater resilience against
adversarial malware examples generated with state-of-the-are evasion attacks,
outperforming a non-smoothed classifier and a randomized smoothing-based
classifier by a great margin.


http://arxiv.org/abs/2402.15429
ProTIP: Probabilistic Robustness Verification on Text-to-Image Diffusion Models against Stochastic Perturbation. (93%)
Yi Zhang; Yun Tang; Wenjie Ruan; Xiaowei Huang; Siddartha Khastgir; Paul Jennings; Xingyu Zhao
  Text-to-Image (T2I) Diffusion Models (DMs) have shown impressive abilities in
generating high-quality images based on simple text descriptions. However, as
is common with many Deep Learning (DL) models, DMs are subject to a lack of
robustness. While there are attempts to evaluate the robustness of T2I DMs as a
binary or worst-case problem, they cannot answer how robust in general the
model is whenever an adversarial example (AE) can be found. In this study, we
first introduce a probabilistic notion of T2I DMs' robustness; and then
establish an efficient framework, ProTIP, to evaluate it with statistical
guarantees. The main challenges stem from: i) the high computational cost of
the generation process; and ii) determining if a perturbed input is an AE
involves comparing two output distributions, which is fundamentally harder
compared to other DL tasks like classification where an AE is identified upon
misprediction of labels. To tackle the challenges, we employ sequential
analysis with efficacy and futility early stopping rules in the statistical
testing for identifying AEs, and adaptive concentration inequalities to
dynamically determine the "just-right" number of stochastic perturbations
whenever the verification target is met. Empirical experiments validate the
effectiveness and efficiency of ProTIP over common T2I DMs. Finally, we
demonstrate an application of ProTIP to rank commonly used defence methods.


http://arxiv.org/abs/2402.15152
On the Duality Between Sharpness-Aware Minimization and Adversarial Training. (92%)
Yihao Zhang; Hangzhou He; Jingyu Zhu; Huanran Chen; Yifei Wang; Zeming Wei
  Adversarial Training (AT), which adversarially perturb the input samples
during training, has been acknowledged as one of the most effective defenses
against adversarial attacks, yet suffers from a fundamental tradeoff that
inevitably decreases clean accuracy. Instead of perturbing the samples,
Sharpness-Aware Minimization (SAM) perturbs the model weights during training
to find a more flat loss landscape and improve generalization. However, as SAM
is designed for better clean accuracy, its effectiveness in enhancing
adversarial robustness remains unexplored. In this work, considering the
duality between SAM and AT, we investigate the adversarial robustness derived
from SAM. Intriguingly, we find that using SAM alone can improve adversarial
robustness. To understand this unexpected property of SAM, we first provide
empirical and theoretical insights into how SAM can implicitly learn more
robust features, and conduct comprehensive experiments to show that SAM can
improve adversarial robustness notably without sacrificing any clean accuracy,
shedding light on the potential of SAM to be a substitute for AT when accuracy
comes at a higher priority. Code is available at
https://github.com/weizeming/SAM_AT.


http://arxiv.org/abs/2402.15653
Low-Frequency Black-Box Backdoor Attack via Evolutionary Algorithm. (87%)
Yanqi Qiao; Dazhuang Liu; Rui Wang; Kaitai Liang
  While convolutional neural networks (CNNs) have achieved success in computer
vision tasks, it is vulnerable to backdoor attacks. Such attacks could mislead
the victim model to make attacker-chosen prediction with a specific trigger
pattern. Until now, the trigger injection of existing attacks is mainly limited
to spatial domain. Recent works take advantage of perceptual properties of
planting specific patterns in the frequency domain, which only reflect
indistinguishable pixel-wise perturbations in pixel domain. However, in the
black-box setup, the inaccessibility of training process often renders more
complex trigger designs. Existing frequency attacks simply handcraft the
magnitude of spectrum, introducing anomaly frequency disparities between clean
and poisoned data and taking risks of being removed by image processing
operations (such as lossy compression and filtering). In this paper, we propose
a robust low-frequency black-box backdoor attack (LFBA), which minimally
perturbs low-frequency components of frequency spectrum and maintains the
perceptual similarity in spatial space simultaneously. The key insight of our
attack restrict the search for the optimal trigger to low-frequency region that
can achieve high attack effectiveness, robustness against image transformation
defenses and stealthiness in dual space. We utilize simulated annealing (SA), a
form of evolutionary algorithm, to optimize the properties of frequency trigger
including the number of manipulated frequency bands and the perturbation of
each frequency component, without relying on the knowledge from the victim
classifier. Extensive experiments on real-world datasets verify the
effectiveness and robustness of LFBA against image processing operations and
the state-of-the-art backdoor defenses, as well as its inherent stealthiness in
both spatial and frequency space, making it resilient against frequency
inspection.


http://arxiv.org/abs/2402.15555
Deep Networks Always Grok and Here is Why. (76%)
Ahmed Imtiaz Humayun; Randall Balestriero; Richard Baraniuk
  Grokking, or delayed generalization, is a phenomenon where generalization in
a deep neural network (DNN) occurs long after achieving near zero training
error. Previous studies have reported the occurrence of grokking in specific
controlled settings, such as DNNs initialized with large-norm parameters or
transformers trained on algorithmic datasets. We demonstrate that grokking is
actually much more widespread and materializes in a wide range of practical
settings, such as training of a convolutional neural network (CNN) on CIFAR10
or a Resnet on Imagenette. We introduce the new concept of delayed robustness,
whereby a DNN groks adversarial examples and becomes robust, long after
interpolation and/or generalization. We develop an analytical explanation for
the emergence of both delayed generalization and delayed robustness based on
the local complexity of a DNN's input-output mapping. Our local complexity
measures the density of so-called linear regions (aka, spline partition
regions) that tile the DNN input space and serves as a utile progress measure
for training. We provide the first evidence that, for classification problems,
the linear regions undergo a phase transition during training whereafter they
migrate away from the training samples (making the DNN mapping smoother there)
and towards the decision boundary (making the DNN mapping less smooth there).
Grokking occurs post phase transition as a robust partition of the input space
thanks to the linearization of the DNN mapping around the training points.
Website: https://bit.ly/grok-adversarial


http://arxiv.org/abs/2402.15218
BSPA: Exploring Black-box Stealthy Prompt Attacks against Image Generators. (67%)
Yu Tian; Xiao Yang; Yinpeng Dong; Heming Yang; Hang Su; Jun Zhu
  Extremely large image generators offer significant transformative potential
across diverse sectors. It allows users to design specific prompts to generate
realistic images through some black-box APIs. However, some studies reveal that
image generators are notably susceptible to attacks and generate Not Suitable
For Work (NSFW) contents by manually designed toxin texts, especially
imperceptible to human observers. We urgently need a multitude of universal and
transferable prompts to improve the safety of image generators, especially
black-box-released APIs. Nevertheless, they are constrained by labor-intensive
design processes and heavily reliant on the quality of the given instructions.
To achieve this, we introduce a black-box stealthy prompt attack (BSPA) that
adopts a retriever to simulate attacks from API users. It can effectively
harness filter scores to tune the retrieval space of sensitive words for
matching the input prompts, thereby crafting stealthy prompts tailored for
image generators. Significantly, this approach is model-agnostic and requires
no internal access to the model's features, ensuring its applicability to a
wide range of image generators. Building on BSPA, we have constructed an
automated prompt tool and a comprehensive prompt attack dataset (NSFWeval).
Extensive experiments demonstrate that BSPA effectively explores the security
vulnerabilities in a variety of state-of-the-art available black-box models,
including Stable Diffusion XL, Midjourney, and DALL-E 2/3. Furthermore, we
develop a resilient text filter and offer targeted recommendations to ensure
the security of image generators against prompt attacks in the future.


http://arxiv.org/abs/2402.15617
Reinforcement Learning-Based Approaches for Enhancing Security and Resilience in Smart Control: A Survey on Attack and Defense Methods. (61%)
Zheyu Zhang
  Reinforcement Learning (RL), one of the core paradigms in machine learning,
learns to make decisions based on real-world experiences. This approach has
significantly advanced AI applications across various domains, notably in smart
grid optimization and smart home automation. However, the proliferation of RL
in these critical sectors has also exposed them to sophisticated adversarial
attacks that target the underlying neural network policies, compromising system
integrity. Given the pivotal role of RL in enhancing the efficiency and
sustainability of smart grids and the personalized convenience in smart homes,
ensuring the security of these systems is paramount. This paper aims to bolster
the resilience of RL frameworks within these specific contexts, addressing the
unique challenges posed by the intricate and potentially adversarial
environments of smart grids and smart homes. We provide a thorough review of
the latest adversarial RL threats and outline effective defense strategies
tailored to safeguard these applications. Our comparative analysis sheds light
on the nuances of adversarial tactics against RL-driven smart systems and
evaluates the defense mechanisms, focusing on their innovative contributions,
limitations, and the compromises they entail. By concentrating on the smart
grid and smart home scenarios, this survey equips ML developers and researchers
with the insights needed to secure RL applications against emerging threats,
ensuring their reliability and safety in our increasingly connected world.


http://arxiv.org/abs/2402.15180
Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement. (5%)
Heegyu Kim; Sehyun Yuk; Hyunsouk Cho
  Caution: This paper includes offensive words that could potentially cause
unpleasantness. Language models (LMs) are vulnerable to exploitation for
adversarial misuse. Training LMs for safety alignment is extensive and makes it
hard to respond to fast-developing attacks immediately, such as jailbreaks. We
propose self-refine with formatting that achieves outstanding safety even in
non-safety-aligned LMs and evaluate our method alongside several defense
baselines, demonstrating that it is the safest training-free method against
jailbreak attacks. Additionally, we proposed a formatting method that improves
the efficiency of the self-refine process while reducing attack success rates
in fewer iterations. We've also observed that non-safety-aligned LMs outperform
safety-aligned LMs in safety tasks by giving more helpful and safe responses.
In conclusion, our findings can achieve less safety risk with fewer
computational costs, allowing non-safety LM to be easily utilized in real-world
service.


http://arxiv.org/abs/2402.15425
Prime+Retouch: When Cache is Locked and Leaked. (2%)
Jaehyuk Lee; Fan Sang; Taesoo Kim
  Caches on the modern commodity CPUs have become one of the major sources of
side-channel leakages and been abused as a new attack vector. To thwart the
cache-based side-channel attacks, two types of countermeasures have been
proposed: detection-based ones that limit the amount of microarchitectural
traces an attacker can leave, and cache prefetching-and-locking techniques that
claim to prevent such leakage by disallowing evictions on sensitive data. In
this paper, we present the Prime+Retouch attack that completely bypasses these
defense schemes by accurately inferring the cache activities with the metadata
of the cache replacement policy. Prime+Retouch has three noticeable properties:
1) it incurs no eviction on the victim's data, allowing us to bypass the two
known mitigation schemes, 2) it requires minimal synchronization of only one
memory access to the attacker's pre-primed cache lines, and 3) it leaks data
via non-shared memory, yet because underlying eviction metadata is shared.
  We demonstrate Prime+Retouch in two architectures: predominant Intel x86 and
emerging Apple M1. We elucidate how Prime+Retouch can break the T-table
implementation of AES with robust cache side-channel mitigations such as Cloak,
under both normal and SGX-protected environments. We also manifest feasibility
of the Prime+Retouch attack on the M1 platform imposing more restrictions where
the precise measurement tools such as core clock cycle timer and performance
counters are inaccessible to the attacker. Furthermore, we first demystify
undisclosed cache architecture and its eviction policy of L1 data cache on
Apple M1 architecture. We also devise a user-space noise-free cache monitoring
tool by repurposing Intel TSX.


http://arxiv.org/abs/2402.15147
TREC: APT Tactic / Technique Recognition via Few-Shot Provenance Subgraph Learning. (1%)
Mingqi Lv; HongZhe Gao; Xuebo Qiu; Tieming Chen; Tiantian Zhu; Jinyin Chen; Shouling Ji
  APT (Advanced Persistent Threat) with the characteristics of persistence,
stealth, and diversity is one of the greatest threats against
cyber-infrastructure. As a countermeasure, existing studies leverage provenance
graphs to capture the complex relations between system entities in a host for
effective APT detection. In addition to detecting single attack events as most
existing work does, understanding the tactics / techniques (e.g., Kill-Chain,
ATT&CK) applied to organize and accomplish the APT attack campaign is more
important for security operations. Existing studies try to manually design a
set of rules to map low-level system events to high-level APT tactics /
techniques. However, the rule based methods are coarse-grained and lack
generalization ability, thus they can only recognize APT tactics and cannot
identify fine-grained APT techniques and mutant APT attacks. In this paper, we
propose TREC, the first attempt to recognize APT tactics / techniques from
provenance graphs by exploiting deep learning techniques. To address the
"needle in a haystack" problem, TREC segments small and compact subgraphs
covering individual APT technique instances from a large provenance graph based
on a malicious node detection model and a subgraph sampling algorithm. To
address the "training sample scarcity" problem, TREC trains the APT tactic /
technique recognition model in a few-shot learning manner by adopting a Siamese
neural network. We evaluate TREC based on a customized dataset collected and
made public by our team. The experiment results show that TREC significantly
outperforms state-of-the-art systems in APT tactic recognition and TREC can
also effectively identify APT techniques.


http://arxiv.org/abs/2402.14937
SoK: Analyzing Adversarial Examples: A Framework to Study Adversary Knowledge. (99%)
Lucas Fenaux; Florian Kerschbaum
  Adversarial examples are malicious inputs to machine learning models that
trigger a misclassification. This type of attack has been studied for close to
a decade, and we find that there is a lack of study and formalization of
adversary knowledge when mounting attacks. This has yielded a complex space of
attack research with hard-to-compare threat models and attacks. We focus on the
image classification domain and provide a theoretical framework to study
adversary knowledge inspired by work in order theory. We present an adversarial
example game, inspired by cryptographic games, to standardize attacks. We
survey recent attacks in the image classification domain and classify their
adversary's knowledge in our framework. From this systematization, we compile
results that both confirm existing beliefs about adversary knowledge, such as
the potency of information about the attacked model as well as allow us to
derive new conclusions on the difficulty associated with the white-box and
transferable threat models, for example, that transferable attacks might not be
as difficult as previously thought.


http://arxiv.org/abs/2402.14648
Rethinking Invariance Regularization in Adversarial Training to Improve Robustness-Accuracy Trade-off. (98%)
Futa Waseda; Ching-Chun Chang; Isao Echizen
  Adversarial training often suffers from a robustness-accuracy trade-off,
where achieving high robustness comes at the cost of accuracy. One approach to
mitigate this trade-off is leveraging invariance regularization, which
encourages model invariance under adversarial perturbations; however, it still
leads to accuracy loss. In this work, we closely analyze the challenges of
using invariance regularization in adversarial training and understand how to
address them. Our analysis identifies two key issues: (1) a ``gradient
conflict" between invariance and classification objectives, leading to
suboptimal convergence, and (2) the mixture distribution problem arising from
diverged distributions between clean and adversarial inputs. To address these
issues, we propose Asymmetric Representation-regularized Adversarial Training
(ARAT), which incorporates asymmetric invariance loss with stop-gradient
operation and a predictor to avoid gradient conflict, and a split-BatchNorm
(BN) structure to resolve the mixture distribution problem. Our detailed
analysis demonstrates that each component effectively addresses the identified
issues, offering novel insights into adversarial defense. ARAT shows
superiority over existing methods across various settings. Finally, we discuss
the implications of our findings to knowledge distillation-based defenses,
providing a new perspective on their relative successes.


http://arxiv.org/abs/2402.14899
Stop Reasoning! When Multimodal LLMs with Chain-of-Thought Reasoning Meets Adversarial Images. (93%)
Zefeng Wang; Zhen Han; Shuo Chen; Fan Xue; Zifeng Ding; Xun Xiao; Volker Tresp; Philip Torr; Jindong Gu
  Recently, Multimodal LLMs (MLLMs) have shown a great ability to understand
images. However, like traditional vision models, they are still vulnerable to
adversarial images. Meanwhile, Chain-of-Thought (CoT) reasoning has been widely
explored on MLLMs, which not only improves model's performance, but also
enhances model's explainability by giving intermediate reasoning steps.
Nevertheless, there is still a lack of study regarding MLLMs' adversarial
robustness with CoT and an understanding of what the rationale looks like when
MLLMs infer wrong answers with adversarial images. Our research evaluates the
adversarial robustness of MLLMs when employing CoT reasoning, finding that CoT
marginally improves adversarial robustness against existing attack methods.
Moreover, we introduce a novel stop-reasoning attack technique that effectively
bypasses the CoT-induced robustness enhancements. Finally, we demonstrate the
alterations in CoT reasoning when MLLMs confront adversarial images, shedding
light on their reasoning process under adversarial attacks.


http://arxiv.org/abs/2402.14494
Noise-BERT: A Unified Perturbation-Robust Framework with Noise Alignment Pre-training for Noisy Slot Filling Task. (83%)
Jinxu Zhao; Guanting Dong; Yueyan Qiu; Tingfeng Hui; Xiaoshuai Song; Daichi Guo; Weiran Xu
  In a realistic dialogue system, the input information from users is often
subject to various types of input perturbations, which affects the slot-filling
task. Although rule-based data augmentation methods have achieved satisfactory
results, they fail to exhibit the desired generalization when faced with
unknown noise disturbances. In this study, we address the challenges posed by
input perturbations in slot filling by proposing Noise-BERT, a unified
Perturbation-Robust Framework with Noise Alignment Pre-training. Our framework
incorporates two Noise Alignment Pre-training tasks: Slot Masked Prediction and
Sentence Noisiness Discrimination, aiming to guide the pre-trained language
model in capturing accurate slot information and noise distribution. During
fine-tuning, we employ a contrastive learning loss to enhance the semantic
representation of entities and labels. Additionally, we introduce an
adversarial attack training strategy to improve the model's robustness.
Experimental results demonstrate the superiority of our proposed approach over
state-of-the-art models, and further analysis confirms its effectiveness and
generalization ability.


http://arxiv.org/abs/2402.14968
Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment. (75%)
Jiongxiao Wang; Jiazhao Li; Yiquan Li; Xiangyu Qi; Junjie Hu; Yixuan Li; Patrick McDaniel; Muhao Chen; Bo Li; Chaowei Xiao
  Despite the general capabilities of Large Language Models (LLMs) like GPT-4
and Llama-2, these models still request fine-tuning or adaptation with
customized data when it comes to meeting the specific business demands and
intricacies of tailored use cases. However, this process inevitably introduces
new safety threats, particularly against the Fine-tuning based Jailbreak Attack
(FJAttack), where incorporating just a few harmful examples into the
fine-tuning dataset can significantly compromise the model safety. Though
potential defenses have been proposed by incorporating safety examples into the
fine-tuning dataset to reduce the safety issues, such approaches require
incorporating a substantial amount of safety examples, making it inefficient.
To effectively defend against the FJAttack with limited safety examples, we
propose a Backdoor Enhanced Safety Alignment method inspired by an analogy with
the concept of backdoor attacks. In particular, we construct prefixed safety
examples by integrating a secret prompt, acting as a "backdoor trigger", that
is prefixed to safety examples. Our comprehensive experiments demonstrate that
through the Backdoor Enhanced Safety Alignment with adding as few as 11
prefixed safety examples, the maliciously fine-tuned LLMs will achieve similar
safety performance as the original aligned models. Furthermore, we also explore
the effectiveness of our method in a more practical setting where the
fine-tuning data consists of both FJAttack examples and the fine-tuning task
data. Our method shows great efficacy in defending against FJAttack without
harming the performance of fine-tuning tasks.


http://arxiv.org/abs/2403.00794
Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models. (26%)
Zachary Horvitz; Jingru Chen; Rahul Aditya; Harshvardhan Srivastava; Robert West; Zhou Yu; Kathleen McKeown
  Humor is a fundamental facet of human cognition and interaction. Yet, despite
recent advances in natural language processing, humor detection remains a
challenging task that is complicated by the scarcity of datasets that pair
humorous texts with similar non-humorous counterparts. In our work, we
investigate whether large language models (LLMs), can generate synthetic data
for humor detection via editing texts. We benchmark LLMs on an existing human
dataset and show that current LLMs display an impressive ability to 'unfun'
jokes, as judged by humans and as measured on the downstream task of humor
detection. We extend our approach to a code-mixed English-Hindi humor dataset,
where we find that GPT-4's synthetic data is highly rated by bilingual
annotators and provides challenging adversarial examples for humor classifiers.


http://arxiv.org/abs/2402.13946
AttackGNN: Red-Teaming GNNs in Hardware Security Using Reinforcement Learning. (99%)
Vasudev Gohil; Satwik Patnaik; Dileep Kalathil; Jeyavijayan Rajendran
  Machine learning has shown great promise in addressing several critical
hardware security problems. In particular, researchers have developed novel
graph neural network (GNN)-based techniques for detecting intellectual property
(IP) piracy, detecting hardware Trojans (HTs), and reverse engineering
circuits, to name a few. These techniques have demonstrated outstanding
accuracy and have received much attention in the community. However, since
these techniques are used for security applications, it is imperative to
evaluate them thoroughly and ensure they are robust and do not compromise the
security of integrated circuits.
  In this work, we propose AttackGNN, the first red-team attack on GNN-based
techniques in hardware security. To this end, we devise a novel reinforcement
learning (RL) agent that generates adversarial examples, i.e., circuits,
against the GNN-based techniques. We overcome three challenges related to
effectiveness, scalability, and generality to devise a potent RL agent. We
target five GNN-based techniques for four crucial classes of problems in
hardware security: IP piracy, detecting/localizing HTs, reverse engineering,
and hardware obfuscation. Through our approach, we craft circuits that fool all
GNNs considered in this work. For instance, to evade IP piracy detection, we
generate adversarial pirated circuits that fool the GNN-based defense into
classifying our crafted circuits as not pirated. For attacking HT localization
GNN, our attack generates HT-infested circuits that fool the defense on all
tested circuits. We obtain a similar 100% success rate against GNNs for all
classes of problems.


http://arxiv.org/abs/2402.13987
A Simple and Yet Fairly Effective Defense for Graph Neural Networks. (98%)
Sofiane Ennadir; Yassine Abbahaddou; Johannes F. Lutzeyer; Michalis Vazirgiannis; Henrik Boström
  Graph Neural Networks (GNNs) have emerged as the dominant approach for
machine learning on graph-structured data. However, concerns have arisen
regarding the vulnerability of GNNs to small adversarial perturbations.
Existing defense methods against such perturbations suffer from high time
complexity and can negatively impact the model's performance on clean graphs.
To address these challenges, this paper introduces NoisyGNNs, a novel defense
method that incorporates noise into the underlying model's architecture. We
establish a theoretical connection between noise injection and the enhancement
of GNN robustness, highlighting the effectiveness of our approach. We further
conduct extensive empirical evaluations on the node classification task to
validate our theoretical findings, focusing on two popular GNNs: the GCN and
GIN. The results demonstrate that NoisyGNN achieves superior or comparable
defense performance to existing methods while minimizing added time complexity.
The NoisyGNN approach is model-agnostic, allowing it to be integrated with
different GNN architectures. Successful combinations of our NoisyGNN approach
with existing defense techniques demonstrate even further improved adversarial
defense results. Our code is publicly available at:
https://github.com/Sennadir/NoisyGNN.


http://arxiv.org/abs/2402.13629
Adversarial Purification and Fine-tuning for Robust UDC Image Restoration. (98%)
Zhenbo Song; Zhenyuan Zhang; Kaihao Zhang; Wenhan Luo; Zhaoxin Fan; Jianfeng Lu
  This study delves into the enhancement of Under-Display Camera (UDC) image
restoration models, focusing on their robustness against adversarial attacks.
Despite its innovative approach to seamless display integration, UDC technology
faces unique image degradation challenges exacerbated by the susceptibility to
adversarial perturbations. Our research initially conducts an in-depth
robustness evaluation of deep-learning-based UDC image restoration models by
employing several white-box and black-box attacking methods. This evaluation is
pivotal in understanding the vulnerabilities of current UDC image restoration
techniques. Following the assessment, we introduce a defense framework
integrating adversarial purification with subsequent fine-tuning processes.
First, our approach employs diffusion-based adversarial purification,
effectively neutralizing adversarial perturbations. Then, we apply the
fine-tuning methodologies to refine the image restoration models further,
ensuring that the quality and fidelity of the restored images are maintained.
The effectiveness of our proposed approach is validated through extensive
experiments, showing marked improvements in resilience against typical
adversarial attacks.


http://arxiv.org/abs/2402.14016
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment. (83%)
Vyas Raina; Adian Liusie; Mark Gales
  Large Language Models (LLMs) are powerful zero-shot assessors and are
increasingly used in real-world situations such as for written exams or
benchmarking systems. Despite this, no existing work has analyzed the
vulnerability of judge-LLMs against adversaries attempting to manipulate
outputs. This work presents the first study on the adversarial robustness of
assessment LLMs, where we search for short universal phrases that when appended
to texts can deceive LLMs to provide high assessment scores. Experiments on
SummEval and TopicalChat demonstrate that both LLM-scoring and pairwise
LLM-comparative assessment are vulnerable to simple concatenation attacks,
where in particular LLM-scoring is very susceptible and can yield maximum
assessment scores irrespective of the input text quality. Interestingly, such
attacks are transferable and phrases learned on smaller open-source LLMs can be
applied to larger closed-source models, such as GPT3.5. This highlights the
pervasive nature of the adversarial vulnerabilities across different judge-LLM
sizes, families and methods. Our findings raise significant concerns on the
reliability of LLMs-as-a-judge methods, and underscore the importance of
addressing vulnerabilities in LLM assessment methods before deployment in
high-stakes real-world scenarios.


http://arxiv.org/abs/2402.13651
Robustness of Deep Neural Networks for Micro-Doppler Radar Classification. (80%)
Mikolaj Czerkawski; Carmine Clemente; Craig MichieCraig Michie; Christos Tachtatzis
  With the great capabilities of deep classifiers for radar data processing
come the risks of learning dataset-specific features that do not generalize
well. In this work, the robustness of two deep convolutional architectures,
trained and tested on the same data, is evaluated. When standard training
practice is followed, both classifiers exhibit sensitivity to subtle temporal
shifts of the input representation, an augmentation that carries minimal
semantic content. Furthermore, the models are extremely susceptible to
adversarial examples. Both small temporal shifts and adversarial examples are a
result of a model overfitting on features that do not generalize well. As a
remedy, it is shown that training on adversarial examples and temporally
augmented samples can reduce this effect and lead to models that generalise
better. Finally, models operating on cadence-velocity diagram representation
rather than Doppler-time are demonstrated to be naturally more immune to
adversarial examples.


http://arxiv.org/abs/2402.13575
Flexible Physical Camouflage Generation Based on a Differential Approach. (38%)
Yang Li; Wenyi Tan; Chenxing Zhao; Shuangju Zhou; Xinkai Liang; Quan Pan
  This study introduces a novel approach to neural rendering, specifically
tailored for adversarial camouflage, within an extensive 3D rendering
framework. Our method, named FPA, goes beyond traditional techniques by
faithfully simulating lighting conditions and material variations, ensuring a
nuanced and realistic representation of textures on a 3D target. To achieve
this, we employ a generative approach that learns adversarial patterns from a
diffusion model. This involves incorporating a specially designed adversarial
loss and covert constraint loss to guarantee the adversarial and covert nature
of the camouflage in the physical world. Furthermore, we showcase the
effectiveness of the proposed camouflage in sticker mode, demonstrating its
ability to cover the target without compromising adversarial information.
Through empirical and physical experiments, FPA exhibits strong performance in
terms of attack success rate and transferability. Additionally, the designed
sticker-mode camouflage, coupled with a concealment constraint, adapts to the
environment, yielding diverse styles of texture. Our findings highlight the
versatility and efficacy of the FPA approach in adversarial camouflage
applications.


http://arxiv.org/abs/2402.13851
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models. (10%)
Jiawei Liang; Siyuan Liang; Man Luo; Aishan Liu; Dongchen Han; Ee-Chien Chang; Xiaochun Cao
  Autoregressive Visual Language Models (VLMs) showcase impressive few-shot
learning capabilities in a multimodal context. Recently, multimodal instruction
tuning has been proposed to further enhance instruction-following abilities.
However, we uncover the potential threat posed by backdoor attacks on
autoregressive VLMs during instruction tuning. Adversaries can implant a
backdoor by injecting poisoned samples with triggers embedded in instructions
or images, enabling malicious manipulation of the victim model's predictions
with predefined triggers. Nevertheless, the frozen visual encoder in
autoregressive VLMs imposes constraints on the learning of conventional image
triggers. Additionally, adversaries may encounter restrictions in accessing the
parameters and architectures of the victim model. To address these challenges,
we propose a multimodal instruction backdoor attack, namely VL-Trojan. Our
approach facilitates image trigger learning through an isolating and clustering
strategy and enhance black-box-attack efficacy via an iterative character-level
text trigger generation method. Our attack successfully induces target outputs
during inference, significantly surpassing baselines (+62.52\%) in ASR.
Moreover, it demonstrates robustness across various model scales and few-shot
in-context reasoning scenarios.


http://arxiv.org/abs/2402.14872
Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs. (8%)
Xiaoxia Li; Siyuan Liang; Jiyi Zhang; Han Fang; Aishan Liu; Ee-Chien Chang
  Large Language Models (LLMs), used in creative writing, code generation, and
translation, generate text based on input sequences but are vulnerable to
jailbreak attacks, where crafted prompts induce harmful outputs. Most jailbreak
prompt methods use a combination of jailbreak templates followed by questions
to ask to create jailbreak prompts. However, existing jailbreak prompt designs
generally suffer from excessive semantic differences, resulting in an inability
to resist defenses that use simple semantic metrics as thresholds. Jailbreak
prompts are semantically more varied than the original questions used for
queries. In this paper, we introduce a Semantic Mirror Jailbreak (SMJ) approach
that bypasses LLMs by generating jailbreak prompts that are semantically
similar to the original question. We model the search for jailbreak prompts
that satisfy both semantic similarity and jailbreak validity as a
multi-objective optimization problem and employ a standardized set of genetic
algorithms for generating eligible prompts. Compared to the baseline
AutoDAN-GA, SMJ achieves attack success rates (ASR) that are at most 35.4%
higher without ONION defense and 85.2% higher with ONION defense. SMJ's better
performance in all three semantic meaningfulness metrics of Jailbreak Prompt,
Similarity, and Outlier, also means that SMJ is resistant to defenses that use
those metrics as thresholds.


http://arxiv.org/abs/2402.14020
Coercing LLMs to do and reveal (almost) anything. (4%)
Jonas Geiping; Alex Stein; Manli Shu; Khalid Saifullah; Yuxin Wen; Tom Goldstein
  It has recently been shown that adversarial attacks on large language models
(LLMs) can "jailbreak" the model into making harmful statements. In this work,
we argue that the spectrum of adversarial attacks on LLMs is much larger than
merely jailbreaking. We provide a broad overview of possible attack surfaces
and attack goals. Based on a series of concrete examples, we discuss,
categorize and systematize attacks that coerce varied unintended behaviors,
such as misdirection, model control, denial-of-service, or data extraction.
  We analyze these attacks in controlled experiments, and find that many of
them stem from the practice of pre-training LLMs with coding capabilities, as
well as the continued existence of strange "glitch" tokens in common LLM
vocabularies that should be removed for security reasons.


http://arxiv.org/abs/2402.14167
T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory Stitching. (1%)
Zizheng Pan; Bohan Zhuang; De-An Huang; Weili Nie; Zhiding Yu; Chaowei Xiao; Jianfei Cai; Anima Anandkumar
  Sampling from diffusion probabilistic models (DPMs) is often expensive for
high-quality image generation and typically requires many steps with a large
model. In this paper, we introduce sampling Trajectory Stitching T-Stitch, a
simple yet efficient technique to improve the sampling efficiency with little
or no generation degradation. Instead of solely using a large DPM for the
entire sampling trajectory, T-Stitch first leverages a smaller DPM in the
initial steps as a cheap drop-in replacement of the larger DPM and switches to
the larger DPM at a later stage. Our key insight is that different diffusion
models learn similar encodings under the same training data distribution and
smaller models are capable of generating good global structures in the early
steps. Extensive experiments demonstrate that T-Stitch is training-free,
generally applicable for different architectures, and complements most existing
fast sampling techniques with flexible speed and quality trade-offs. On DiT-XL,
for example, 40% of the early timesteps can be safely replaced with a 10x
faster DiT-S without performance drop on class-conditional ImageNet generation.
We further show that our method can also be used as a drop-in technique to not
only accelerate the popular pretrained stable diffusion (SD) models but also
improve the prompt alignment of stylized SD models from the public model zoo.
Code is released at https://github.com/NVlabs/T-Stitch


http://arxiv.org/abs/2402.12950
QuanTest: Entanglement-Guided Testing of Quantum Neural Network Systems. (92%)
Jinjing Shi; Zimeng Xiao; Heyuan Shi; Yu Jiang; Xuelong Li
  Quantum Neural Network (QNN) combines the Deep Learning (DL) principle with
the fundamental theory of quantum mechanics to achieve machine learning tasks
with quantum acceleration. Recently, QNN systems have been found to manifest
robustness issues similar to classical DL systems. There is an urgent need for
ways to test their correctness and security. However, QNN systems differ
significantly from traditional quantum software and classical DL systems,
posing critical challenges for QNN testing. These challenges include the
inapplicability of traditional quantum software testing methods, the dependence
of quantum test sample generation on perturbation operators, and the absence of
effective information in quantum neurons. In this paper, we propose QuanTest, a
quantum entanglement-guided adversarial testing framework to uncover potential
erroneous behaviors in QNN systems. We design a quantum entanglement adequacy
criterion to quantify the entanglement acquired by the input quantum states
from the QNN system, along with two similarity metrics to measure the proximity
of generated quantum adversarial examples to the original inputs. Subsequently,
QuanTest formulates the problem of generating test inputs that maximize the
quantum entanglement sufficiency and capture incorrect behaviors of the QNN
system as a joint optimization problem and solves it in a gradient-based manner
to generate quantum adversarial examples. Experimental results demonstrate that
QuanTest possesses the capability to capture erroneous behaviors in QNN systems
(generating 67.48%-96.05% more test samples than the random noise under the
same perturbation size constraints). The entanglement-guided approach proves
effective in adversarial testing, generating more adversarial examples (maximum
increase reached 21.32%).


http://arxiv.org/abs/2402.13148
Defending Jailbreak Prompts via In-Context Adversarial Game. (76%)
Yujun Zhou; Yufei Han; Haomin Zhuang; Taicheng Guo; Kehan Guo; Zhenwen Liang; Hongyan Bao; Xiangliang Zhang
  Large Language Models (LLMs) demonstrate remarkable capabilities across
diverse applications. However, concerns regarding their security, particularly
the vulnerability to jailbreak attacks, persist. Drawing inspiration from
adversarial training in deep learning and LLM agent learning processes, we
introduce the In-Context Adversarial Game (ICAG) for defending against
jailbreaks without the need for fine-tuning. ICAG leverages agent learning to
conduct an adversarial game, aiming to dynamically extend knowledge to defend
against jailbreaks. Unlike traditional methods that rely on static datasets,
ICAG employs an iterative process to enhance both the defense and attack
agents. This continuous improvement process strengthens defenses against newly
generated jailbreak prompts. Our empirical studies affirm ICAG's efficacy,
where LLMs safeguarded by ICAG exhibit significantly reduced jailbreak success
rates across various attack scenarios. Moreover, ICAG demonstrates remarkable
transferability to other LLMs, indicating its potential as a versatile defense
mechanism.


http://arxiv.org/abs/2402.13517
Round Trip Translation Defence against Large Language Model Jailbreaking Attacks. (75%)
Canaan Yung; Hadi Mohaghegh Dolatabadi; Sarah Erfani; Christopher Leckie
  Large language models (LLMs) are susceptible to social-engineered attacks
that are human-interpretable but require a high level of comprehension for LLMs
to counteract. Existing defensive measures can only mitigate less than half of
these attacks at most. To address this issue, we propose the Round Trip
Translation (RTT) method, the first algorithm specifically designed to defend
against social-engineered attacks on LLMs. RTT paraphrases the adversarial
prompt and generalizes the idea conveyed, making it easier for LLMs to detect
induced harmful behavior. This method is versatile, lightweight, and
transferrable to different LLMs. Our defense successfully mitigated over 70% of
Prompt Automatic Iterative Refinement (PAIR) attacks, which is currently the
most effective defense to the best of our knowledge. We are also the first to
attempt mitigating the MathsAttack and reduced its attack success rate by
almost 40%. Our code is publicly available at
https://github.com/Cancanxxx/Round_Trip_Translation_Defence


http://arxiv.org/abs/2402.13006
Investigating the Impact of Model Instability on Explanations and Uncertainty. (69%)
Sara Vera Marjanović; Isabelle Augenstein; Christina Lioma
  Explainable AI methods facilitate the understanding of model behaviour, yet,
small, imperceptible perturbations to inputs can vastly distort explanations.
As these explanations are typically evaluated holistically, before model
deployment, it is difficult to assess when a particular explanation is
trustworthy. Some studies have tried to create confidence estimators for
explanations, but none have investigated an existing link between uncertainty
and explanation quality. We artificially simulate epistemic uncertainty in text
input by introducing noise at inference time. In this large-scale empirical
study, we insert different levels of noise perturbations and measure the effect
on the output of pre-trained language models and different uncertainty metrics.
Realistic perturbations have minimal effect on performance and explanations,
yet masking has a drastic effect. We find that high uncertainty doesn't
necessarily imply low explanation plausibility; the correlation between the two
metrics can be moderately positive when noise is exposed during the training
process. This suggests that noise-augmented models may be better at identifying
salient tokens when uncertain. Furthermore, when predictive and epistemic
uncertainty measures are over-confident, the robustness of a saliency map to
perturbation can indicate model stability issues. Integrated Gradients shows
the overall greatest robustness to perturbation, while still showing
model-specific patterns in performance; however, this phenomenon is limited to
smaller Transformer-based language models.


http://arxiv.org/abs/2402.13457
A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models. (68%)
Zihao Xu; Yi Liu; Gelei Deng; Yuekang Li; Stjepan Picek
  Large Language Models (LLMS) have increasingly become central to generating
content with potential societal impacts. Notably, these models have
demonstrated capabilities for generating content that could be deemed harmful.
To mitigate these risks, researchers have adopted safety training techniques to
align model outputs with societal values to curb the generation of malicious
content. However, the phenomenon of "jailbreaking", where carefully crafted
prompts elicit harmful responses from models, persists as a significant
challenge. This research conducts a comprehensive analysis of existing studies
on jailbreaking LLMs and their defense techniques. We meticulously investigate
nine attack techniques and seven defense techniques applied across three
distinct language models: Vicuna, LLama, and GPT-3.5 Turbo. We aim to evaluate
the effectiveness of these attack and defense techniques. Our findings reveal
that existing white-box attacks underperform compared to universal techniques
and that including special tokens in the input significantly affects the
likelihood of successful attacks. This research highlights the need to
concentrate on the security facets of LLMs. Additionally, we contribute to the
field by releasing our datasets and testing framework, aiming to foster further
research into LLM security. We believe these contributions will facilitate the
exploration of security measures within this domain.


http://arxiv.org/abs/2402.13459
Learning to Poison Large Language Models During Instruction Tuning. (13%)
Yao Qiang; Xiangyu Zhou; Saleh Zare Zade; Mohammad Amin Roshani; Douglas Zytko; Dongxiao Zhu
  The advent of Large Language Models (LLMs) has marked significant
achievements in language processing and reasoning capabilities. Despite their
advancements, LLMs face vulnerabilities to data poisoning attacks, where
adversaries insert backdoor triggers into training data to manipulate outputs
for malicious purposes. This work further identifies additional security risks
in LLMs by designing a new data poisoning attack tailored to exploit the
instruction tuning process. We propose a novel gradient-guided backdoor trigger
learning approach to identify adversarial triggers efficiently, ensuring an
evasion of detection by conventional defenses while maintaining content
integrity. Through experimental validation across various LLMs and tasks, our
strategy demonstrates a high success rate in compromising model outputs;
poisoning only 1\% of 4,000 instruction tuning samples leads to a Performance
Drop Rate (PDR) of around 80\%. Our work highlights the need for stronger
defenses against data poisoning attack, offering insights into safeguarding
LLMs against these more sophisticated attacks. The source code can be found on
this GitHub repository: https://github.com/RookieZxy/GBTL/blob/main/README.md.


http://arxiv.org/abs/2402.13487
Stealthy Adversarial Attacks on Stochastic Multi-Armed Bandits. (3%)
Zhiwei Wang; Huazheng Wang; Hongning Wang
  Adversarial attacks against stochastic multi-armed bandit (MAB) algorithms
have been extensively studied in the literature. In this work, we focus on
reward poisoning attacks and find most existing attacks can be easily detected
by our proposed detection method based on the test of homogeneity, due to their
aggressive nature in reward manipulations. This motivates us to study the
notion of stealthy attack against stochastic MABs and investigate the resulting
attackability. Our analysis shows that against two popularly employed MAB
algorithms, UCB1 and $\epsilon$-greedy, the success of a stealthy attack
depends on the environmental conditions and the realized reward of the arm
pulled in the first round. We also analyze the situation for general MAB
algorithms equipped with our attack detection method and find that it is
possible to have a stealthy attack that almost always succeeds. This brings new
insights into the security risks of MAB algorithms.


http://arxiv.org/abs/2402.14859
The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative. (1%)
Zhen Tan; Chengshuai Zhao; Raha Moraffah; Yifan Li; Yu Kong; Tianlong Chen; Huan Liu
  Due to their unprecedented ability to process and respond to various types of
data, Multimodal Large Language Models (MLLMs) are constantly defining the new
boundary of Artificial General Intelligence (AGI). As these advanced generative
models increasingly form collaborative networks for complex tasks, the
integrity and security of these systems are crucial. Our paper, ``The Wolf
Within'', explores a novel vulnerability in MLLM societies - the indirect
propagation of malicious content. Unlike direct harmful output generation for
MLLMs, our research demonstrates how a single MLLM agent can be subtly
influenced to generate prompts that, in turn, induce other MLLM agents in the
society to output malicious content. Our findings reveal that, an MLLM agent,
when manipulated to produce specific prompts or instructions, can effectively
``infect'' other agents within a society of MLLMs. This infection leads to the
generation and circulation of harmful outputs, such as dangerous instructions
or misinformation, across the society. We also show the transferability of
these indirectly generated prompts, highlighting their possibility in
propagating malice through inter-agent communication. This research provides a
critical insight into a new dimension of threat posed by MLLMs, where a single
agent can act as a catalyst for widespread malevolent influence. Our work
underscores the urgent need for developing robust mechanisms to detect and
mitigate such covert manipulations within MLLM societies, ensuring their safe
and ethical utilization in societal applications.


http://arxiv.org/abs/2402.13518
RITFIS: Robust input testing framework for LLMs-based intelligent software. (1%)
Mingxuan Xiao; Yan Xiao; Hai Dong; Shunhui Ji; Pengcheng Zhang
  The dependence of Natural Language Processing (NLP) intelligent software on
Large Language Models (LLMs) is increasingly prominent, underscoring the
necessity for robustness testing. Current testing methods focus solely on the
robustness of LLM-based software to prompts. Given the complexity and diversity
of real-world inputs, studying the robustness of LLMbased software in handling
comprehensive inputs (including prompts and examples) is crucial for a thorough
understanding of its performance.
  To this end, this paper introduces RITFIS, a Robust Input Testing Framework
for LLM-based Intelligent Software. To our knowledge, RITFIS is the first
framework designed to assess the robustness of LLM-based intelligent software
against natural language inputs. This framework, based on given threat models
and prompts, primarily defines the testing process as a combinatorial
optimization problem. Successful test cases are determined by a goal function,
creating a transformation space for the original examples through perturbation
means, and employing a series of search methods to filter cases that meet both
the testing objectives and language constraints. RITFIS, with its modular
design, offers a comprehensive method for evaluating the robustness of LLMbased
intelligent software.
  RITFIS adapts 17 automated testing methods, originally designed for Deep
Neural Network (DNN)-based intelligent software, to the LLM-based software
testing scenario. It demonstrates the effectiveness of RITFIS in evaluating
LLM-based intelligent software through empirical validation. However, existing
methods generally have limitations, especially when dealing with lengthy texts
and structurally complex threat models. Therefore, we conducted a comprehensive
analysis based on five metrics and provided insightful testing method
optimization strategies, benefiting both researchers and everyday users.


http://arxiv.org/abs/2402.12329
Query-Based Adversarial Prompt Generation. (99%)
Jonathan Hayase; Ema Borevkovic; Nicholas Carlini; Florian Tramèr; Milad Nasr
  Recent work has shown it is possible to construct adversarial examples that
cause an aligned language model to emit harmful strings or perform harmful
behavior. Existing attacks work either in the white-box setting (with full
access to the model weights), or through transferability: the phenomenon that
adversarial examples crafted on one model often remain effective on other
models. We improve on prior work with a query-based attack that leverages API
access to a remote language model to construct adversarial examples that cause
the model to emit harmful strings with (much) higher probability than with
transfer-only attacks. We validate our attack on GPT-3.5 and OpenAI's safety
classifier; we can cause GPT-3.5 to emit harmful strings that current transfer
attacks fail at, and we can evade the safety classifier with nearly 100%
probability.


http://arxiv.org/abs/2402.12187
Adversarial Feature Alignment: Balancing Robustness and Accuracy in Deep Learning via Adversarial Training. (99%)
Leo Hyun Park; Jaeuk Kim; Myung Gyo Oh; Jaewoo Park; Taekyoung Kwon
  Deep learning models continue to advance in accuracy, yet they remain
vulnerable to adversarial attacks, which often lead to the misclassification of
adversarial examples. Adversarial training is used to mitigate this problem by
increasing robustness against these attacks. However, this approach typically
reduces a model's standard accuracy on clean, non-adversarial samples. The
necessity for deep learning models to balance both robustness and accuracy for
security is obvious, but achieving this balance remains challenging, and the
underlying reasons are yet to be clarified. This paper proposes a novel
adversarial training method called Adversarial Feature Alignment (AFA), to
address these problems. Our research unveils an intriguing insight:
misalignment within the feature space often leads to misclassification,
regardless of whether the samples are benign or adversarial. AFA mitigates this
risk by employing a novel optimization algorithm based on contrastive learning
to alleviate potential feature misalignment. Through our evaluations, we
demonstrate the superior performance of AFA. The baseline AFA delivers higher
robust accuracy than previous adversarial contrastive learning methods while
minimizing the drop in clean accuracy to 1.86% and 8.91% on CIFAR10 and
CIFAR100, respectively, in comparison to cross-entropy. We also show that joint
optimization of AFA and TRADES, accompanied by data augmentation using a recent
diffusion model, achieves state-of-the-art accuracy and robustness.


http://arxiv.org/abs/2402.11940
AICAttack: Adversarial Image Captioning Attack with Attention-Based Optimization. (99%)
Jiyao Li; Mingze Ni; Yifei Dong; Tianqing Zhu; Wei Liu
  Recent advances in deep learning research have shown remarkable achievements
across many tasks in computer vision (CV) and natural language processing
(NLP). At the intersection of CV and NLP is the problem of image captioning,
where the related models' robustness against adversarial attacks has not been
well studied. In this paper, we present a novel adversarial attack strategy,
which we call AICAttack (Attention-based Image Captioning Attack), designed to
attack image captioning models through subtle perturbations on images.
Operating within a black-box attack scenario, our algorithm requires no access
to the target model's architecture, parameters, or gradient information. We
introduce an attention-based candidate selection mechanism that identifies the
optimal pixels to attack, followed by Differential Evolution (DE) for
perturbing pixels' RGB values. We demonstrate AICAttack's effectiveness through
extensive experiments on benchmark datasets with multiple victim models. The
experimental results demonstrate that our method surpasses current leading-edge
techniques by effectively distributing the alignment and semantics of words in
the output.


http://arxiv.org/abs/2402.12338
An Adversarial Approach to Evaluating the Robustness of Event Identification Models. (98%)
Obai Bahwal; Oliver Kosut; Lalitha Sankar
  Intelligent machine learning approaches are finding active use for event
detection and identification that allow real-time situational awareness. Yet,
such machine learning algorithms have been shown to be susceptible to
adversarial attacks on the incoming telemetry data. This paper considers a
physics-based modal decomposition method to extract features for event
classification and focuses on interpretable classifiers including logistic
regression and gradient boosting to distinguish two types of events: load loss
and generation loss. The resulting classifiers are then tested against an
adversarial algorithm to evaluate their robustness. The adversarial attack is
tested in two settings: the white box setting, wherein the attacker knows
exactly the classification model; and the gray box setting, wherein the
attacker has access to historical data from the same network as was used to
train the classifier, but does not know the classification model. Thorough
experiments on the synthetic South Carolina 500-bus system highlight that a
relatively simpler model such as logistic regression is more susceptible to
adversarial attacks than gradient boosting.


http://arxiv.org/abs/2402.12673
Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies. (97%)
Xiangyu Liu; Chenghao Deng; Yanchao Sun; Yongyuan Liang; Furong Huang
  In light of the burgeoning success of reinforcement learning (RL) in diverse
real-world applications, considerable focus has been directed towards ensuring
RL policies are robust to adversarial attacks during test time. Current
approaches largely revolve around solving a minimax problem to prepare for
potential worst-case scenarios. While effective against strong attacks, these
methods often compromise performance in the absence of attacks or the presence
of only weak attacks. To address this, we study policy robustness under the
well-accepted state-adversarial attack model, extending our focus beyond only
worst-case attacks. We first formalize this task at test time as a regret
minimization problem and establish its intrinsic hardness in achieving
sublinear regret when the baseline policy is from a general continuous policy
class, $\Pi$. This finding prompts us to \textit{refine} the baseline policy
class $\Pi$ prior to test time, aiming for efficient adaptation within a finite
policy class $\Tilde{\Pi}$, which can resort to an adversarial bandit
subroutine. In light of the importance of a small, finite $\Tilde{\Pi}$, we
propose a novel training-time algorithm to iteratively discover
\textit{non-dominated policies}, forming a near-optimal and minimal
$\Tilde{\Pi}$, thereby ensuring both robustness and test-time efficiency.
Empirical validation on the Mujoco corroborates the superiority of our approach
in terms of natural and robust performance, as well as adaptability to various
attack scenarios.


http://arxiv.org/abs/2402.11953
Stealing the Invisible: Unveiling Pre-Trained CNN Models through Adversarial Examples and Timing Side-Channels. (92%)
Shubhi Shukla; Manaar Alam; Pabitra Mitra; Debdeep Mukhopadhyay
  Machine learning, with its myriad applications, has become an integral
component of numerous technological systems. A common practice in this domain
is the use of transfer learning, where a pre-trained model's architecture,
readily available to the public, is fine-tuned to suit specific tasks. As
Machine Learning as a Service (MLaaS) platforms increasingly use pre-trained
models in their backends, it's crucial to safeguard these architectures and
understand their vulnerabilities. In this work, we present an approach based on
the observation that the classification patterns of adversarial images can be
used as a means to steal the models. Furthermore, the adversarial image
classifications in conjunction with timing side channels can lead to a model
stealing method. Our approach, designed for typical user-level access in remote
MLaaS environments exploits varying misclassifications of adversarial images
across different models to fingerprint several renowned Convolutional Neural
Network (CNN) and Vision Transformer (ViT) architectures. We utilize the
profiling of remote model inference times to reduce the necessary adversarial
images, subsequently decreasing the number of queries required. We have
presented our results over 27 pre-trained models of different CNN and ViT
architectures using CIFAR-10 dataset and demonstrate a high accuracy of 88.8%
while keeping the query budget under 20.


http://arxiv.org/abs/2402.12426
Attacks on Node Attributes in Graph Neural Networks. (83%)
Ying Xu; Michael Lanier; Anindya Sarkar; Yevgeniy Vorobeychik
  Graphs are commonly used to model complex networks prevalent in modern social
media and literacy applications. Our research investigates the vulnerability of
these graphs through the application of feature based adversarial attacks,
focusing on both decision-time attacks and poisoning attacks. In contrast to
state-of-the-art models like Net Attack and Meta Attack, which target node
attributes and graph structure, our study specifically targets node attributes.
For our analysis, we utilized the text dataset Hellaswag and graph datasets
Cora and CiteSeer, providing a diverse basis for evaluation. Our findings
indicate that decision-time attacks using Projected Gradient Descent (PGD) are
more potent compared to poisoning attacks that employ Mean Node Embeddings and
Graph Contrastive Learning strategies. This provides insights for graph data
security, pinpointing where graph-based models are most vulnerable and thereby
informing the development of stronger defense mechanisms against such attacks.


http://arxiv.org/abs/2402.12626
Indiscriminate Data Poisoning Attacks on Pre-trained Feature Extractors. (68%)
Yiwei Lu; Matthew Y. R. Yang; Gautam Kamath; Yaoliang Yu
  Machine learning models have achieved great success in supervised learning
tasks for end-to-end training, which requires a large amount of labeled data
that is not always feasible. Recently, many practitioners have shifted to
self-supervised learning methods that utilize cheap unlabeled data to learn a
general feature extractor via pre-training, which can be further applied to
personalized downstream tasks by simply training an additional linear layer
with limited labeled data. However, such a process may also raise concerns
regarding data poisoning attacks. For instance, indiscriminate data poisoning
attacks, which aim to decrease model utility by injecting a small number of
poisoned data into the training set, pose a security risk to machine learning
models, but have only been studied for end-to-end supervised learning. In this
paper, we extend the exploration of the threat of indiscriminate attacks on
downstream tasks that apply pre-trained feature extractors. Specifically, we
propose two types of attacks: (1) the input space attacks, where we modify
existing attacks to directly craft poisoned data in the input space. However,
due to the difficulty of optimization under constraints, we further propose (2)
the feature targeted attacks, where we mitigate the challenge with three
stages, firstly acquiring target parameters for the linear head; secondly
finding poisoned features by treating the learned feature representations as a
dataset; and thirdly inverting the poisoned features back to the input space.
Our experiments examine such attacks in popular downstream tasks of fine-tuning
on the same dataset and transfer learning that considers domain adaptation.
Empirical results reveal that transfer learning is more vulnerable to our
attacks. Additionally, input space attacks are a strong threat if no
countermeasures are posed, but are otherwise weaker than feature targeted
attacks.


http://arxiv.org/abs/2402.11837
Self-Guided Robust Graph Structure Refinement. (67%)
Yeonjun In; Kanghoon Yoon; Kibum Kim; Kijung Shin; Chanyoung Park
  Recent studies have revealed that GNNs are vulnerable to adversarial attacks.
To defend against such attacks, robust graph structure refinement (GSR) methods
aim at minimizing the effect of adversarial edges based on node features, graph
structure, or external information. However, we have discovered that existing
GSR methods are limited by narrowassumptions, such as assuming clean node
features, moderate structural attacks, and the availability of external clean
graphs, resulting in the restricted applicability in real-world scenarios. In
this paper, we propose a self-guided GSR framework (SG-GSR), which utilizes a
clean sub-graph found within the given attacked graph itself. Furthermore, we
propose a novel graph augmentation and a group-training strategy to handle the
two technical challenges in the clean sub-graph extraction: 1) loss of
structural information, and 2) imbalanced node degree distribution. Extensive
experiments demonstrate the effectiveness of SG-GSR under various scenarios
including non-targeted attacks, targeted attacks, feature attacks, e-commerce
fraud, and noisy node labels. Our code is available at
https://github.com/yeonjun-in/torch-SG-GSR.


http://arxiv.org/abs/2402.12336
Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models. (50%)
Christian Schlarmann; Naman Deep Singh; Francesco Croce; Matthias Hein
  Multi-modal foundation models like OpenFlamingo, LLaVA, and GPT-4 are
increasingly used for various real-world tasks. Prior work has shown that these
models are highly vulnerable to adversarial attacks on the vision modality.
These attacks can be leveraged to spread fake information or defraud users, and
thus pose a significant risk, which makes the robustness of large multi-modal
foundation models a pressing problem. The CLIP model, or one of its variants,
is used as a frozen vision encoder in many vision-language models (VLMs), e.g.
LLaVA and OpenFlamingo. We propose an unsupervised adversarial fine-tuning
scheme to obtain a robust CLIP vision encoder, which yields robustness on all
vision down-stream tasks (VLMs, zero-shot classification) that rely on CLIP. In
particular, we show that stealth-attacks on users of VLMs by a malicious third
party providing manipulated images are no longer possible once one replaces the
original CLIP model with our robust one. No retraining or fine-tuning of the
VLM is required. The code and robust models are available at
https://github.com/chs20/RobustVLM


http://arxiv.org/abs/2402.12168
Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning. (15%)
Shuai Zhao; Leilei Gan; Luu Anh Tuan; Jie Fu; Lingjuan Lyu; Meihuizi Jia; Jinming Wen
  Recently, various parameter-efficient fine-tuning (PEFT) strategies for
application to language models have been proposed and successfully implemented.
However, this raises the question of whether PEFT, which only updates a limited
set of model parameters, constitutes security vulnerabilities when confronted
with weight-poisoning backdoor attacks. In this study, we show that PEFT is
more susceptible to weight-poisoning backdoor attacks compared to the
full-parameter fine-tuning method, with pre-defined triggers remaining
exploitable and pre-defined targets maintaining high confidence, even after
fine-tuning. Motivated by this insight, we developed a Poisoned Sample
Identification Module (PSIM) leveraging PEFT, which identifies poisoned samples
through confidence, providing robust defense against weight-poisoning backdoor
attacks. Specifically, we leverage PEFT to train the PSIM with randomly reset
sample labels. During the inference process, extreme confidence serves as an
indicator for poisoned samples, while others are clean. We conduct experiments
on text classification tasks, five fine-tuning strategies, and three
weight-poisoning backdoor attack methods. Experiments show near 100% success
rates for weight-poisoning backdoor attacks when utilizing PEFT. Furthermore,
our defensive approach exhibits overall competitive performance in mitigating
weight-poisoning backdoor attacks.


http://arxiv.org/abs/2402.12072
Robustness and Exploration of Variational and Machine Learning Approaches to Inverse Problems: An Overview. (1%)
Alexander Auras; Kanchana Vaishnavi Gandikota; Hannah Droege; Michael Moeller
  This paper provides an overview of current approaches for solving inverse
problems in imaging using variational methods and machine learning. A special
focus lies on point estimators and their robustness against adversarial
perturbations. In this context results of numerical experiments for a
one-dimensional toy problem are provided, showing the robustness of different
approaches and empirically verifying theoretical guarantees. Another focus of
this review is the exploration of the subspace of data-consistent solutions
through explicit guidance to satisfy specific semantic or textural properties.


http://arxiv.org/abs/2402.12189
Amplifying Training Data Exposure through Fine-Tuning with Pseudo-Labeled Memberships. (1%)
Myung Gyo Oh; Hong Eun Ahn; Leo Hyun Park; Taekyoung Kwon
  Neural language models (LMs) are vulnerable to training data extraction
attacks due to data memorization. This paper introduces a novel attack scenario
wherein an attacker adversarially fine-tunes pre-trained LMs to amplify the
exposure of the original training data. This strategy differs from prior
studies by aiming to intensify the LM's retention of its pre-training dataset.
To achieve this, the attacker needs to collect generated texts that are closely
aligned with the pre-training data. However, without knowledge of the actual
dataset, quantifying the amount of pre-training data within generated texts is
challenging. To address this, we propose the use of pseudo-labels for these
generated texts, leveraging membership approximations indicated by
machine-generated probabilities from the target LM. We subsequently fine-tune
the LM to favor generations with higher likelihoods of originating from the
pre-training data, based on their membership probabilities. Our empirical
findings indicate a remarkable outcome: LMs with over 1B parameters exhibit a
four to eight-fold increase in training data exposure. We discuss potential
mitigations and suggest future research directions.


http://arxiv.org/abs/2402.11989
Privacy-Preserving Low-Rank Adaptation for Latent Diffusion Models. (1%)
Zihao Luo; Xilie Xu; Feng Liu; Yun Sing Koh; Di Wang; Jingfeng Zhang
  Low-rank adaptation (LoRA) is an efficient strategy for adapting latent
diffusion models (LDMs) on a private dataset to generate specific images by
minimizing the adaptation loss. However, the LoRA-adapted LDMs are vulnerable
to membership inference (MI) attacks that can judge whether a particular data
point belongs to the private dataset, thus leading to the privacy leakage. To
defend against MI attacks, we first propose a straightforward solution:
Membership-Privacy-preserving LoRA (MP-LoRA). MP-LoRA is formulated as a
min-max optimization problem where a proxy attack model is trained by
maximizing its MI gain while the LDM is adapted by minimizing the sum of the
adaptation loss and the MI gain of the proxy attack model. However, we
empirically find that MP-LoRA has the issue of unstable optimization, and
theoretically analyze that the potential reason is the unconstrained local
smoothness, which impedes the privacy-preserving adaptation. To mitigate this
issue, we further propose a Stable Membership-Privacy-preserving LoRA
(SMP-LoRA) that adapts the LDM by minimizing the ratio of the adaptation loss
to the MI gain. Besides, we theoretically prove that the local smoothness of
SMP-LoRA can be constrained by the gradient norm, leading to improved
convergence. Our experimental results corroborate that SMP-LoRA can indeed
defend against MI attacks and generate high-quality images. Our code is
available at https://github.com/WilliamLUO0/StablePrivateLoRA.


http://arxiv.org/abs/2402.11469
A Curious Case of Searching for the Correlation between Training Data and Adversarial Robustness of Transformer Textual Models. (93%)
Cuong Dang; Dung D. Le; Thai Le
  Existing works have shown that fine-tuned textual transformer models achieve
state-of-the-art prediction performances but are also vulnerable to adversarial
text perturbations. Traditional adversarial evaluation is often done
\textit{only after} fine-tuning the models and ignoring the training data. In
this paper, we want to prove that there is also a strong correlation between
training data and model robustness. To this end, we extract 13 different
features representing a wide range of input fine-tuning corpora properties and
use them to predict the adversarial robustness of the fine-tuned models.
Focusing mostly on encoder-only transformer models BERT and RoBERTa with
additional results for BART, ELECTRA, and GPT2, we provide diverse evidence to
support our argument. First, empirical analyses show that (a) extracted
features can be used with a lightweight classifier such as Random Forest to
predict the attack success rate effectively, and (b) features with the most
influence on the model robustness have a clear correlation with the robustness.
Second, our framework can be used as a fast and effective additional tool for
robustness evaluation since it (a) saves 30x-193x runtime compared to the
traditional technique, (b) is transferable across models, (c) can be used under
adversarial training, and (d) robust to statistical randomness. Our code is
publicly available at \url{https://github.com/CaptainCuong/RobustText_ACL2024}.


http://arxiv.org/abs/2402.11557
Evaluating Adversarial Robustness of Low dose CT Recovery. (92%)
Kanchana Vaishnavi Gandikota; Paramanand Chandramouli; Hannah Droege; Michael Moeller
  Low dose computed tomography (CT) acquisition using reduced radiation or
sparse angle measurements is recommended to decrease the harmful effects of
X-ray radiation. Recent works successfully apply deep networks to the problem
of low dose CT recovery on bench-mark datasets. However, their robustness needs
a thorough evaluation before use in clinical settings. In this work, we
evaluate the robustness of different deep learning approaches and classical
methods for CT recovery. We show that deep networks, including model-based
networks encouraging data consistency, are more susceptible to untargeted
attacks. Surprisingly, we observe that data consistency is not heavily affected
even for these poor quality reconstructions, motivating the need for better
regularization for the networks. We demonstrate the feasibility of universal
attacks and study attack transferability across different methods. We analyze
robustness to attacks causing localized changes in clinically relevant regions.
Both classical approaches and deep networks are affected by such attacks
leading to changes in the visual appearance of localized lesions, for extremely
small perturbations. As the resulting reconstructions have high data
consistency with the original measurements, these localized attacks can be used
to explore the solution space of the CT recovery problem.


http://arxiv.org/abs/2402.11687
Evaluating Efficacy of Model Stealing Attacks and Defenses on Quantum Neural Networks. (83%)
Satwik Kundu; Debarshi Kundu; Swaroop Ghosh
  Cloud hosting of quantum machine learning (QML) models exposes them to a
range of vulnerabilities, the most significant of which is the model stealing
attack. In this study, we assess the efficacy of such attacks in the realm of
quantum computing. We conducted comprehensive experiments on various datasets
with multiple QML model architectures. Our findings revealed that model
stealing attacks can produce clone models achieving up to $0.9\times$ and
$0.99\times$ clone test accuracy when trained using Top-$1$ and Top-$k$ labels,
respectively ($k:$ num\_classes). To defend against these attacks, we leverage
the unique properties of current noisy hardware and perturb the victim model
outputs and hinder the attacker's training process. In particular, we propose:
1) hardware variation-induced perturbation (HVIP) and 2) hardware and
architecture variation-induced perturbation (HAVIP). Although noise and
architectural variability can provide up to $\sim16\%$ output obfuscation, our
comprehensive analysis revealed that models cloned under noisy conditions tend
to be resilient, suffering little to no performance degradation due to such
obfuscations. Despite limited success with our defense techniques, this outcome
has led to an important discovery: QML models trained on noisy hardwares are
naturally resistant to perturbation or obfuscation-based defenses or attacks.


http://arxiv.org/abs/2402.11733
The Effectiveness of Random Forgetting for Robust Generalization. (75%)
Vijaya Raghavan T Ramkumar; Bahram Zonooz; Elahe Arani
  Deep neural networks are susceptible to adversarial attacks, which can
compromise their performance and accuracy. Adversarial Training (AT) has
emerged as a popular approach for protecting neural networks against such
attacks. However, a key challenge of AT is robust overfitting, where the
network's robust performance on test data deteriorates with further training,
thus hindering generalization. Motivated by the concept of active forgetting in
the brain, we introduce a novel learning paradigm called "Forget to Mitigate
Overfitting (FOMO)". FOMO alternates between the forgetting phase, which
randomly forgets a subset of weights and regulates the model's information
through weight reinitialization, and the relearning phase, which emphasizes
learning generalizable features. Our experiments on benchmark datasets and
adversarial attacks show that FOMO alleviates robust overfitting by
significantly reducing the gap between the best and last robust test accuracy
while improving the state-of-the-art robustness. Furthermore, FOMO provides a
better trade-off between standard and robust accuracy, outperforming baseline
adversarial methods. Finally, our framework is robust to AutoAttacks and
increases generalization in many real-world scenarios.


http://arxiv.org/abs/2402.11473
Poisoned Forgery Face: Towards Backdoor Attacks on Face Forgery Detection. (26%)
Jiawei Liang; Siyuan Liang; Aishan Liu; Xiaojun Jia; Junhao Kuang; Xiaochun Cao
  The proliferation of face forgery techniques has raised significant concerns
within society, thereby motivating the development of face forgery detection
methods. These methods aim to distinguish forged faces from genuine ones and
have proven effective in practical applications. However, this paper introduces
a novel and previously unrecognized threat in face forgery detection scenarios
caused by backdoor attack. By embedding backdoors into models and incorporating
specific trigger patterns into the input, attackers can deceive detectors into
producing erroneous predictions for forged faces. To achieve this goal, this
paper proposes \emph{Poisoned Forgery Face} framework, which enables
clean-label backdoor attacks on face forgery detectors. Our approach involves
constructing a scalable trigger generator and utilizing a novel convolving
process to generate translation-sensitive trigger patterns. Moreover, we employ
a relative embedding method based on landmark-based regions to enhance the
stealthiness of the poisoned samples. Consequently, detectors trained on our
poisoned samples are embedded with backdoors. Notably, our approach surpasses
SoTA backdoor baselines with a significant improvement in attack success rate
(+16.39\% BD-AUC) and reduction in visibility (-12.65\% $L_\infty$).
Furthermore, our attack exhibits promising performance against backdoor
defenses. We anticipate that this paper will draw greater attention to the
potential threats posed by backdoor attacks in face forgery detection
scenarios. Our codes will be made available at
\url{https://github.com/JWLiang007/PFF}


http://arxiv.org/abs/2402.11637
Poisoning Federated Recommender Systems with Fake Users. (5%)
Ming Yin; Yichang Xu; Minghong Fang; Neil Zhenqiang Gong
  Federated recommendation is a prominent use case within federated learning,
yet it remains susceptible to various attacks, from user to server-side
vulnerabilities. Poisoning attacks are particularly notable among user-side
attacks, as participants upload malicious model updates to deceive the global
model, often intending to promote or demote specific targeted items. This study
investigates strategies for executing promotion attacks in federated
recommender systems.
  Current poisoning attacks on federated recommender systems often rely on
additional information, such as the local training data of genuine users or
item popularity. However, such information is challenging for the potential
attacker to obtain. Thus, there is a need to develop an attack that requires no
extra information apart from item embeddings obtained from the server. In this
paper, we introduce a novel fake user based poisoning attack named PoisonFRS to
promote the attacker-chosen targeted item in federated recommender systems
without requiring knowledge about user-item rating data, user attributes, or
the aggregation rule used by the server. Extensive experiments on multiple
real-world datasets demonstrate that PoisonFRS can effectively promote the
attacker-chosen targeted item to a large portion of genuine users and
outperform current benchmarks that rely on additional information about the
system. We further observe that the model updates from both genuine and fake
users are indistinguishable within the latent space.


http://arxiv.org/abs/2402.11755
SPML: A DSL for Defending Language Models Against Prompt Attacks. (1%)
Reshabh K Sharma; Vinayak Gupta; Dan Grossman
  Large language models (LLMs) have profoundly transformed natural language
applications, with a growing reliance on instruction-based definitions for
designing chatbots. However, post-deployment the chatbot definitions are fixed
and are vulnerable to attacks by malicious users, emphasizing the need to
prevent unethical applications and financial losses. Existing studies explore
user prompts' impact on LLM-based chatbots, yet practical methods to contain
attacks on application-specific chatbots remain unexplored. This paper presents
System Prompt Meta Language (SPML), a domain-specific language for refining
prompts and monitoring the inputs to the LLM-based chatbots. SPML actively
checks attack prompts, ensuring user inputs align with chatbot definitions to
prevent malicious execution on the LLM backbone, optimizing costs. It also
streamlines chatbot definition crafting with programming language capabilities,
overcoming natural language design challenges. Additionally, we introduce a
groundbreaking benchmark with 1.8k system prompts and 20k user inputs, offering
the inaugural language and benchmark for chatbot definition evaluation.
Experiments across datasets demonstrate SPML's proficiency in understanding
attacker prompts, surpassing models like GPT-4, GPT-3.5, and LLAMA. Our data
and codes are publicly available at: https://prompt-compiler.github.io/SPML/.


http://arxiv.org/abs/2402.12406
Teacher as a Lenient Expert: Teacher-Agnostic Data-Free Knowledge Distillation. (1%)
Hyunjune Shin; Dong-Wan Choi
  Data-free knowledge distillation (DFKD) aims to distill pretrained knowledge
to a student model with the help of a generator without using original data. In
such data-free scenarios, achieving stable performance of DFKD is essential due
to the unavailability of validation data. Unfortunately, this paper has
discovered that existing DFKD methods are quite sensitive to different teacher
models, occasionally showing catastrophic failures of distillation, even when
using well-trained teacher models. Our observation is that the generator in
DFKD is not always guaranteed to produce precise yet diverse samples using the
existing representative strategy of minimizing both class-prior and adversarial
losses. Through our empirical study, we focus on the fact that class-prior not
only decreases the diversity of generated samples, but also cannot completely
address the problem of generating unexpectedly low-quality samples depending on
teacher models. In this paper, we propose the teacher-agnostic data-free
knowledge distillation (TA-DFKD) method, with the goal of more robust and
stable performance regardless of teacher models. Our basic idea is to assign
the teacher model a lenient expert role for evaluating samples, rather than a
strict supervisor that enforces its class-prior on the generator. Specifically,
we design a sample selection approach that takes only clean samples verified by
the teacher model without imposing restrictions on the power of generating
diverse samples. Through extensive experiments, we show that our method
successfully achieves both robustness and training stability across various
teacher models, while outperforming the existing DFKD methods.


http://arxiv.org/abs/2402.11196
Maintaining Adversarial Robustness in Continuous Learning. (75%)
Xiaolei Ru; Xiaowei Cao; Zijia Liu; Jack Murdoch Moore; Xin-Ya Zhang; Xia Zhu; Wenjia Wei; Gang Yan
  Adversarial robustness is essential for security and reliability of machine
learning systems. However, the adversarial robustness gained by sophisticated
defense algorithms is easily erased as the neural network evolves to learn new
tasks. This vulnerability can be addressed by fostering a novel capability for
neural networks, termed continual robust learning, which focuses on both the
(classification) performance and adversarial robustness on previous tasks
during continuous learning. To achieve continuous robust learning, we propose
an approach called Double Gradient Projection that projects the gradients for
weight updates orthogonally onto two crucial subspaces -- one for stabilizing
the smoothed sample gradients and another for stabilizing the final outputs of
the neural network. The experimental results on four benchmarks demonstrate
that the proposed approach effectively maintains continuous robustness against
strong adversarial attacks, outperforming the baselines formed by combining the
existing defense strategies and continual learning methods.


http://arxiv.org/abs/2402.11237
Be Persistent: Towards a Unified Solution for Mitigating Shortcuts in Deep Learning. (22%)
Hadi M. Dolatabadi; Sarah M. Erfani; Christopher Leckie
  Deep neural networks (DNNs) are vulnerable to shortcut learning: rather than
learning the intended task, they tend to draw inconclusive relationships
between their inputs and outputs. Shortcut learning is ubiquitous among many
failure cases of neural networks, and traces of this phenomenon can be seen in
their generalizability issues, domain shift, adversarial vulnerability, and
even bias towards majority groups. In this paper, we argue that this
commonality in the cause of various DNN issues creates a significant
opportunity that should be leveraged to find a unified solution for shortcut
learning. To this end, we outline the recent advances in topological data
analysis~(TDA), and persistent homology~(PH) in particular, to sketch a unified
roadmap for detecting shortcuts in deep learning. We demonstrate our arguments
by investigating the topological features of computational graphs in DNNs using
two cases of unlearnable examples and bias in decision-making as our test
studies. Our analysis of these two failure cases of DNNs reveals that finding a
unified solution for shortcut learning in DNNs is not out of reach, and TDA can
play a significant role in forming such a framework.


http://arxiv.org/abs/2402.11208
Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents. (8%)
Wenkai Yang; Xiaohan Bi; Yankai Lin; Sishuo Chen; Jie Zhou; Xu Sun
  Leveraging the rapid development of Large Language Models LLMs, LLM-based
agents have been developed to handle various real-world applications, including
finance, healthcare, and shopping, etc. It is crucial to ensure the reliability
and security of LLM-based agents during applications. However, the safety
issues of LLM-based agents are currently under-explored. In this work, we take
the first step to investigate one of the typical safety threats, backdoor
attack, to LLM-based agents. We first formulate a general framework of agent
backdoor attacks, then we present a thorough analysis on the different forms of
agent backdoor attacks. Specifically, from the perspective of the final
attacking outcomes, the attacker can either choose to manipulate the final
output distribution, or only introduce malicious behavior in the intermediate
reasoning process, while keeping the final output correct. Furthermore, the
former category can be divided into two subcategories based on trigger
locations: the backdoor trigger can be hidden either in the user query or in an
intermediate observation returned by the external environment. We propose the
corresponding data poisoning mechanisms to implement the above variations of
agent backdoor attacks on two typical agent tasks, web shopping and tool
utilization. Extensive experiments show that LLM-based agents suffer severely
from backdoor attacks, indicating an urgent need for further research on the
development of defenses against backdoor attacks on LLM-based agents. Warning:
This paper may contain biased content.


http://arxiv.org/abs/2402.11423
VoltSchemer: Use Voltage Noise to Manipulate Your Wireless Charger. (2%)
Zihao Zhan; Yirui Yang; Haoqi Shan; Hanqiu Wang; Yier Jin; Shuo Wang
  Wireless charging is becoming an increasingly popular charging solution in
portable electronic products for a more convenient and safer charging
experience than conventional wired charging. However, our research identified
new vulnerabilities in wireless charging systems, making them susceptible to
intentional electromagnetic interference. These vulnerabilities facilitate a
set of novel attack vectors, enabling adversaries to manipulate the charger and
perform a series of attacks.
  In this paper, we propose VoltSchemer, a set of innovative attacks that grant
attackers control over commercial-off-the-shelf wireless chargers merely by
modulating the voltage from the power supply. These attacks represent the first
of its kind, exploiting voltage noises from the power supply to manipulate
wireless chargers without necessitating any malicious modifications to the
chargers themselves. The significant threats imposed by VoltSchemer are
substantiated by three practical attacks, where a charger can be manipulated
to: control voice assistants via inaudible voice commands, damage devices being
charged through overcharging or overheating, and bypass Qi-standard specified
foreign-object-detection mechanism to damage valuable items exposed to intense
magnetic fields.
  We demonstrate the effectiveness and practicality of the VoltSchemer attacks
with successful attacks on 9 top-selling COTS wireless chargers. Furthermore,
we discuss the security implications of our findings and suggest possible
countermeasures to mitigate potential threats.


http://arxiv.org/abs/2402.11120
DART: A Principled Approach to Adversarially Robust Unsupervised Domain Adaptation. (99%)
Yunjuan Wang; Hussein Hazimeh; Natalia Ponomareva; Alexey Kurakin; Ibrahim Hammoud; Raman Arora
  Distribution shifts and adversarial examples are two major challenges for
deploying machine learning models. While these challenges have been studied
individually, their combination is an important topic that remains relatively
under-explored. In this work, we study the problem of adversarial robustness
under a common setting of distribution shift - unsupervised domain adaptation
(UDA). Specifically, given a labeled source domain $D_S$ and an unlabeled
target domain $D_T$ with related but different distributions, the goal is to
obtain an adversarially robust model for $D_T$. The absence of target domain
labels poses a unique challenge, as conventional adversarial robustness
defenses cannot be directly applied to $D_T$. To address this challenge, we
first establish a generalization bound for the adversarial target loss, which
consists of (i) terms related to the loss on the data, and (ii) a measure of
worst-case domain divergence. Motivated by this bound, we develop a novel
unified defense framework called Divergence Aware adveRsarial Training (DART),
which can be used in conjunction with a variety of standard UDA methods; e.g.,
DANN [Ganin and Lempitsky, 2015]. DART is applicable to general threat models,
including the popular $\ell_p$-norm model, and does not require heuristic
regularizers or architectural changes. We also release DomainRobust: a testbed
for evaluating robustness of UDA models to adversarial attacks. DomainRobust
consists of 4 multi-domain benchmark datasets (with 46 source-target pairs) and
7 meta-algorithms with a total of 11 variants. Our large-scale experiments
demonstrate that on average, DART significantly enhances model robustness on
all benchmarks compared to the state of the art, while maintaining competitive
standard accuracy. The relative improvement in robustness from DART reaches up
to 29.2% on the source-target domain pairs considered.


http://arxiv.org/abs/2402.10470
Theoretical Understanding of Learning from Adversarial Perturbations. (98%)
Soichiro Kumano; Hiroshi Kera; Toshihiko Yamasaki
  It is not fully understood why adversarial examples can deceive neural
networks and transfer between different networks. To elucidate this, several
studies have hypothesized that adversarial perturbations, while appearing as
noises, contain class features. This is supported by empirical evidence showing
that networks trained on mislabeled adversarial examples can still generalize
well to correctly labeled test samples. However, a theoretical understanding of
how perturbations include class features and contribute to generalization is
limited. In this study, we provide a theoretical framework for understanding
learning from perturbations using a one-hidden-layer network trained on
mutually orthogonal samples. Our results highlight that various adversarial
perturbations, even perturbations of a few pixels, contain sufficient class
features for generalization. Moreover, we reveal that the decision boundary
when learning from perturbations matches that from standard samples except for
specific regions under mild conditions. The code is available at
https://github.com/s-kumano/learning-from-adversarial-perturbations.


http://arxiv.org/abs/2402.10527
Assessing biomedical knowledge robustness in large language models by query-efficient sampling attacks. (98%)
R. Patrick Xian; Alex J. Lee; Satvik Lolla; Vincent Wang; Qiming Cui; Russell Ro; Reza Abbasi-Asl
  The increasing depth of parametric domain knowledge in large language models
(LLMs) is fueling their rapid deployment in real-world applications.
Understanding model vulnerabilities in high-stakes and knowledge-intensive
tasks is essential for quantifying the trustworthiness of model predictions and
regulating their use. The recent discovery of named entities as adversarial
examples (i.e. adversarial entities) in natural language processing tasks
raises questions about their potential impact on the knowledge robustness of
pre-trained and finetuned LLMs in high-stakes and specialized domains. We
examined the use of type-consistent entity substitution as a template for
collecting adversarial entities for billion-parameter LLMs with biomedical
knowledge. To this end, we developed an embedding-space attack based on
powerscaled distance-weighted sampling to assess the robustness of their
biomedical knowledge with a low query budget and controllable coverage. Our
method has favorable query efficiency and scaling over alternative approaches
based on random sampling and blackbox gradient-guided search, which we
demonstrated for adversarial distractor generation in biomedical question
answering. Subsequent failure mode analysis uncovered two regimes of
adversarial entities on the attack surface with distinct characteristics and we
showed that entity substitution attacks can manipulate token-wise Shapley value
explanations, which become deceptive in this setting. Our approach complements
standard evaluations for high-capacity models and the results highlight the
brittleness of domain knowledge in LLMs.


http://arxiv.org/abs/2402.11083
VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models. (92%)
Ziyi Yin; Muchao Ye; Tianrong Zhang; Jiaqi Wang; Han Liu; Jinghui Chen; Ting Wang; Fenglong Ma
  Visual Question Answering (VQA) is a fundamental task in computer vision and
natural language process fields. Although the ``pre-training & finetuning''
learning paradigm significantly improves the VQA performance, the adversarial
robustness of such a learning paradigm has not been explored. In this paper, we
delve into a new problem: using a pre-trained multimodal source model to create
adversarial image-text pairs and then transferring them to attack the target
VQA models. Correspondingly, we propose a novel VQAttack model, which can
iteratively generate both image and text perturbations with the designed
modules: the large language model (LLM)-enhanced image attack and the
cross-modal joint attack module. At each iteration, the LLM-enhanced image
attack module first optimizes the latent representation-based loss to generate
feature-level image perturbations. Then it incorporates an LLM to further
enhance the image perturbations by optimizing the designed masked answer
anti-recovery loss. The cross-modal joint attack module will be triggered at a
specific iteration, which updates the image and text perturbations
sequentially. Notably, the text perturbation updates are based on both the
learned gradients in the word embedding space and word synonym-based
substitution. Experimental results on two VQA datasets with five validated
models demonstrate the effectiveness of the proposed VQAttack in the
transferable attack setting, compared with state-of-the-art baselines. This
work reveals a significant blind spot in the ``pre-training & fine-tuning''
paradigm on VQA tasks. Source codes will be released.


http://arxiv.org/abs/2402.11082
The AI Security Pyramid of Pain. (47%)
Chris M. Ward; Josh Harguess; Julia Tao; Daniel Christman; Paul Spicer; Mike Tan
  We introduce the AI Security Pyramid of Pain, a framework that adapts the
cybersecurity Pyramid of Pain to categorize and prioritize AI-specific threats.
This framework provides a structured approach to understanding and addressing
various levels of AI threats. Starting at the base, the pyramid emphasizes Data
Integrity, which is essential for the accuracy and reliability of datasets and
AI models, including their weights and parameters. Ensuring data integrity is
crucial, as it underpins the effectiveness of all AI-driven decisions and
operations. The next level, AI System Performance, focuses on MLOps-driven
metrics such as model drift, accuracy, and false positive rates. These metrics
are crucial for detecting potential security breaches, allowing for early
intervention and maintenance of AI system integrity. Advancing further, the
pyramid addresses the threat posed by Adversarial Tools, identifying and
neutralizing tools used by adversaries to target AI systems. This layer is key
to staying ahead of evolving attack methodologies. At the Adversarial Input
layer, the framework addresses the detection and mitigation of inputs designed
to deceive or exploit AI models. This includes techniques like adversarial
patterns and prompt injection attacks, which are increasingly used in
sophisticated attacks on AI systems. Data Provenance is the next critical
layer, ensuring the authenticity and lineage of data and models. This layer is
pivotal in preventing the use of compromised or biased data in AI systems. At
the apex is the tactics, techniques, and procedures (TTPs) layer, dealing with
the most complex and challenging aspects of AI security. This involves a deep
understanding and strategic approach to counter advanced AI-targeted attacks,
requiring comprehensive knowledge and planning.


http://arxiv.org/abs/2402.10773
AIM: Automated Input Set Minimization for Metamorphic Security Testing. (2%)
Nazanin Bayati Chaleshtari; Yoann Marquer; Fabrizio Pastore; Lionel C. Briand
  For Web systems, which are accessible to any machine connected to internet,
security is a critical concern. Although security testing can be automated by
generating crafted inputs as an attacker would do, solutions to automate the
test oracle, i.e., distinguishing correct from incorrect outputs for a given
input, remain preliminary. Specifically, previous work has demonstrated the
potential of metamorphic testing; indeed, security failures can be determined
by metamorphic relations that turn valid inputs into malicious inputs and
compare their outputs. However, without further guidance, metamorphic relations
should be executed on a very large set of valid inputs, which is time consuming
and makes metamorphic testing impractical. Hence, in this study, we propose
AIM, an approach that automatically selects inputs to reduce testing costs
while preserving vulnerability detection capabilities. AIM includes a
clustering-based black box approach, identifying similar inputs based on their
security properties. It also presents a novel genetic algorithm able to
efficiently select diverse inputs while minimizing their total cost. Further,
it contains a problem reduction component to reduce the search space and speed
up the minimization process. We evaluated the effectiveness of AIM on two
well-known web systems, Jenkins and Joomla. We compared AIM's results with four
baselines in security testing. Overall, AIM reduced MRs execution time by 84
percent for Jenkins and 82 percent for Joomla while preserving full
vulnerability detection. Furthermore, AIM outperformed all the considered
baselines regarding vulnerability coverage. Although it has been tuned to work
with Web system inputs, AIM could be applied to minimize metamorphic testing
cost in other contexts.


http://arxiv.org/abs/2402.10882
Universal Prompt Optimizer for Safe Text-to-Image Generation. (1%)
Zongyu Wu; Hongcheng Gao; Yueze Wang; Xiang Zhang; Suhang Wang
  Text-to-Image (T2I) models have shown great performance in generating images
based on textual prompts. However, these models are vulnerable to unsafe input
to generate unsafe content like sexual, harassment and illegal-activity images.
Existing studies based on image checker, model fine-tuning and embedding
blocking are impractical in real-world applications. Hence, \textit{we propose
the first universal prompt optimizer for safe T2I generation in black-box
scenario}. We first construct a dataset consisting of toxic-clean prompt pairs
by GPT-3.5 Turbo. To guide the optimizer to have the ability of converting
toxic prompt to clean prompt while preserving semantic information, we design a
novel reward function measuring toxicity and text alignment of generated images
and train the optimizer through Proximal Policy Optimization. Experiments show
that our approach can effectively reduce the likelihood of various T2I models
in generating inappropriate images, with no significant impact on text
alignment. It is also flexible to be combined with methods to achieve better
performance.


http://arxiv.org/abs/2402.11167
ToBlend: Token-Level Blending With an Ensemble of LLMs to Attack AI-Generated Text Detection. (1%)
Fan Huang; Haewoon Kwak; Jisun An
  The robustness of AI-content detection models against sophisticated
adversarial strategies, such as paraphrasing or word switching, is a rising
concern in natural language generation (NLG) applications. This study proposes
ToBlend, a novel token-level ensemble text generation method to challenge the
robustness of current AI-content detection approaches by utilizing multiple
sets of candidate generative large language models (LLMs). By randomly sampling
token(s) from candidate LLMs sets, we find ToBlend significantly drops the
performance of most mainstream AI-content detection methods. We evaluate the
text quality produced under different ToBlend settings based on annotations
from experienced human experts. We proposed a fine-tuned Llama3.1 model to
distinguish the ToBlend generated text more accurately. Our findings underscore
our proposed text generation approach's great potential in deceiving and
improving detection models. Our datasets, codes, and annotations are
open-sourced.


http://arxiv.org/abs/2402.09874
Camouflage is all you need: Evaluating and Enhancing Language Model Robustness Against Camouflage Adversarial Attacks. (62%)
Álvaro Huertas-García; Alejandro Martín; Javier Huertas-Tato; David Camacho
  Adversarial attacks represent a substantial challenge in Natural Language
Processing (NLP). This study undertakes a systematic exploration of this
challenge in two distinct phases: vulnerability evaluation and resilience
enhancement of Transformer-based models under adversarial attacks.
  In the evaluation phase, we assess the susceptibility of three Transformer
configurations, encoder-decoder, encoder-only, and decoder-only setups, to
adversarial attacks of escalating complexity across datasets containing
offensive language and misinformation. Encoder-only models manifest a 14% and
21% performance drop in offensive language detection and misinformation
detection tasks, respectively. Decoder-only models register a 16% decrease in
both tasks, while encoder-decoder models exhibit a maximum performance drop of
14% and 26% in the respective tasks.
  The resilience-enhancement phase employs adversarial training, integrating
pre-camouflaged and dynamically altered data. This approach effectively reduces
the performance drop in encoder-only models to an average of 5% in offensive
language detection and 2% in misinformation detection tasks. Decoder-only
models, occasionally exceeding original performance, limit the performance drop
to 7% and 2% in the respective tasks. Although not surpassing the original
performance, Encoder-decoder models can reduce the drop to an average of 6% and
2% respectively.
  Results suggest a trade-off between performance and robustness, with some
models maintaining similar performance while gaining robustness. Our study and
adversarial training techniques have been incorporated into an open-source tool
for generating camouflaged datasets. However, methodology effectiveness depends
on the specific camouflage technique and data encountered, emphasizing the need
for continued exploration.


http://arxiv.org/abs/2402.10340
On the Safety Concerns of Deploying LLMs/VLMs in Robotics: Highlighting the Risks and Vulnerabilities. (31%)
Xiyang Wu; Ruiqi Xian; Tianrui Guan; Jing Liang; Souradip Chakraborty; Fuxiao Liu; Brian Sadler; Dinesh Manocha; Amrit Singh Bedi
  In this paper, we highlight the critical issues of robustness and safety
associated with integrating large language models (LLMs) and vision-language
models (VLMs) into robotics applications. Recent works have focused on using
LLMs and VLMs to improve the performance of robotics tasks, such as
manipulation, navigation, etc. However, such integration can introduce
significant vulnerabilities, in terms of their susceptibility to adversarial
attacks due to the language models, potentially leading to catastrophic
consequences. By examining recent works at the interface of LLMs/VLMs and
robotics, we show that it is easy to manipulate or misguide the robot's
actions, leading to safety hazards. We define and provide examples of several
plausible adversarial attacks, and conduct experiments on three prominent robot
frameworks integrated with a language model, including KnowNo VIMA, and
Instruct2Act, to assess their susceptibility to these attacks. Our empirical
findings reveal a striking vulnerability of LLM/VLM-robot integrated systems:
simple adversarial attacks can significantly undermine the effectiveness of
LLM/VLM-robot integrated systems. Specifically, our data demonstrate an average
performance deterioration of 21.2% under prompt attacks and a more alarming
30.2% under perception attacks. These results underscore the critical need for
robust countermeasures to ensure the safe and reliable deployment of the
advanced LLM/VLM-based robotic systems.


http://arxiv.org/abs/2402.10283
Backdoor Attack against One-Class Sequential Anomaly Detection Models. (9%)
He Cheng; Shuhan Yuan
  Deep anomaly detection on sequential data has garnered significant attention
due to the wide application scenarios. However, deep learning-based models face
a critical security threat - their vulnerability to backdoor attacks. In this
paper, we explore compromising deep sequential anomaly detection models by
proposing a novel backdoor attack strategy. The attack approach comprises two
primary steps, trigger generation and backdoor injection. Trigger generation is
to derive imperceptible triggers by crafting perturbed samples from the benign
normal data, of which the perturbed samples are still normal. The backdoor
injection is to properly inject the backdoor triggers to comprise the model
only for the samples with triggers. The experimental results demonstrate the
effectiveness of our proposed attack strategy by injecting backdoors on two
well-established one-class anomaly detection models.


http://arxiv.org/abs/2402.10196
A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents. (5%)
Lingbo Mo; Zeyi Liao; Boyuan Zheng; Yu Su; Chaowei Xiao; Huan Sun
  Language agents powered by large language models (LLMs) have seen exploding
development. Their capability of using language as a vehicle for thought and
communication lends an incredible level of flexibility and versatility. People
have quickly capitalized on this capability to connect LLMs to a wide range of
external components and environments: databases, tools, the Internet, robotic
embodiment, etc. Many believe an unprecedentedly powerful automation technology
is emerging. However, new automation technologies come with new safety risks,
especially for intricate systems like language agents. There is a surprisingly
large gap between the speed and scale of their development and deployment and
our understanding of their safety risks. Are we building a house of cards? In
this position paper, we present the first systematic effort in mapping
adversarial attacks against language agents. We first present a unified
conceptual framework for agents with three major components: Perception, Brain,
and Action. Under this framework, we present a comprehensive discussion and
propose 12 potential attack scenarios against different components of an agent,
covering different attack strategies (e.g., input manipulation, adversarial
demonstrations, jailbreaking, backdoors). We also draw connections to
successful attack strategies previously applied to LLMs. We emphasize the
urgency to gain a thorough understanding of language agent risks before their
widespread deployment.


http://arxiv.org/abs/2402.10082
FedRDF: A Robust and Dynamic Aggregation Function against Poisoning Attacks in Federated Learning. (3%)
Enrique Mármol Campos; Aurora González Vidal; José Luis Hernández Ramos; Antonio Skarmeta
  Federated Learning (FL) represents a promising approach to typical privacy
concerns associated with centralized Machine Learning (ML) deployments. Despite
its well-known advantages, FL is vulnerable to security attacks such as
Byzantine behaviors and poisoning attacks, which can significantly degrade
model performance and hinder convergence. The effectiveness of existing
approaches to mitigate complex attacks, such as median, trimmed mean, or Krum
aggregation functions, has been only partially demonstrated in the case of
specific attacks. Our study introduces a novel robust aggregation mechanism
utilizing the Fourier Transform (FT), which is able to effectively handling
sophisticated attacks without prior knowledge of the number of attackers.
Employing this data technique, weights generated by FL clients are projected
into the frequency domain to ascertain their density function, selecting the
one exhibiting the highest frequency. Consequently, malicious clients' weights
are excluded. Our proposed approach was tested against various model poisoning
attacks, demonstrating superior performance over state-of-the-art aggregation
methods.


http://arxiv.org/abs/2402.10065
Some Targets Are Harder to Identify than Others: Quantifying the Target-dependent Membership Leakage. (1%)
Achraf Azize; Debabrota Basu
  In a Membership Inference (MI) game, an attacker tries to infer whether a
target point was included or not in the input of an algorithm. Existing works
show that some target points are easier to identify, while others are harder.
This paper explains the target-dependent hardness of membership attacks by
studying the powers of the optimal attacks in a fixed-target MI game. We
characterise the optimal advantage and trade-off functions of attacks against
the empirical mean in terms of the Mahalanobis distance between the target
point and the data-generating distribution. We further derive the impacts of
two privacy defences, i.e. adding Gaussian noise and sub-sampling, and that of
target misspecification on optimal attacks. As by-products of our novel
analysis of the Likelihood Ratio (LR) test, we provide a new covariance attack
which generalises and improves the scalar product attack. Also, we propose a
new optimal canary-choosing strategy for auditing privacy in the white-box
federated learning setting. Our experiments validate that the Mahalanobis score
explains the hardness of fixed-target MI games.


http://arxiv.org/abs/2402.10983
Quantum-Inspired Analysis of Neural Network Vulnerabilities: The Role of Conjugate Variables in System Attacks. (1%)
Jun-Jie Zhang; Deyu Meng
  Neural networks demonstrate inherent vulnerability to small, non-random
perturbations, emerging as adversarial attacks. Such attacks, born from the
gradient of the loss function relative to the input, are discerned as input
conjugates, revealing a systemic fragility within the network structure.
Intriguingly, a mathematical congruence manifests between this mechanism and
the quantum physics' uncertainty principle, casting light on a hitherto
unanticipated interdisciplinarity. This inherent susceptibility within neural
network systems is generally intrinsic, highlighting not only the innate
vulnerability of these networks but also suggesting potential advancements in
the interdisciplinary area for understanding these black-box networks.


http://arxiv.org/abs/2402.09546
How Secure Are Large Language Models (LLMs) for Navigation in Urban Environments? (98%)
Congcong Wen; Jiazhao Liang; Shuaihang Yuan; Hao Huang; Geeta Chandra Raju Bethala; Yu-Shen Liu; Mengyu Wang; Anthony Tzes; Yi Fang
  In the field of robotics and automation, navigation systems based on Large
Language Models (LLMs) have recently demonstrated impressive performance.
However, the security aspects of these systems have received relatively less
attention. This paper pioneers the exploration of vulnerabilities in LLM-based
navigation models in urban outdoor environments, a critical area given the
widespread application of this technology in autonomous driving, logistics, and
emergency services. Specifically, we introduce a novel Navigational Prompt
Attack that manipulates LLM-based navigation models by perturbing the original
navigational prompt, leading to incorrect actions. Based on the method of
perturbation, our attacks are divided into two types: Navigational Prompt
Insert (NPI) Attack and Navigational Prompt Swap (NPS) Attack. We conducted
comprehensive experiments on an LLM-based navigation model that employs various
LLMs for reasoning. Our results, derived from the Touchdown and Map2Seq
street-view datasets under both few-shot learning and fine-tuning
configurations, demonstrate notable performance declines across seven metrics
in the face of both white-box and black-box attacks. Moreover, our attacks can
be easily extended to other LLM-based navigation models with similarly
effective results. These findings highlight the generalizability and
transferability of the proposed attack, emphasizing the need for enhanced
security in LLM-based navigation systems. As an initial countermeasure, we
propose the Navigational Prompt Engineering (NPE) Defense strategy, which
concentrates on navigation-relevant keywords to reduce the impact of
adversarial attacks. While initial findings indicate that this strategy
enhances navigational safety, there remains a critical need for the wider
research community to develop stronger defense methods to effectively tackle
the real-world challenges faced by these systems.


http://arxiv.org/abs/2402.09132
Exploring the Adversarial Capabilities of Large Language Models. (98%)
Lukas Struppek; Minh Hieu Le; Dominik Hintersdorf; Kristian Kersting
  The proliferation of large language models (LLMs) has sparked widespread and
general interest due to their strong language generation capabilities, offering
great potential for both industry and research. While previous research delved
into the security and privacy issues of LLMs, the extent to which these models
can exhibit adversarial behavior remains largely unexplored. Addressing this
gap, we investigate whether common publicly available LLMs have inherent
capabilities to perturb text samples to fool safety measures, so-called
adversarial examples resp.~attacks. More specifically, we investigate whether
LLMs are inherently able to craft adversarial examples out of benign samples to
fool existing safe rails. Our experiments, which focus on hate speech
detection, reveal that LLMs succeed in finding adversarial perturbations,
effectively undermining hate speech detection systems. Our findings carry
significant implications for (semi-)autonomous systems relying on LLMs,
highlighting potential challenges in their interaction with existing systems
and safety measures.


http://arxiv.org/abs/2402.09674
PAL: Proxy-Guided Black-Box Attack on Large Language Models. (92%)
Chawin Sitawarin; Norman Mu; David Wagner; Alexandre Araujo
  Large Language Models (LLMs) have surged in popularity in recent months, but
they have demonstrated concerning capabilities to generate harmful content when
manipulated. While techniques like safety fine-tuning aim to minimize harmful
use, recent works have shown that LLMs remain vulnerable to attacks that elicit
toxic responses. In this work, we introduce the Proxy-Guided Attack on LLMs
(PAL), the first optimization-based attack on LLMs in a black-box query-only
setting. In particular, it relies on a surrogate model to guide the
optimization and a sophisticated loss designed for real-world LLM APIs. Our
attack achieves 84% attack success rate (ASR) on GPT-3.5-Turbo and 48% on
Llama-2-7B, compared to 4% for the current state of the art. We also propose
GCG++, an improvement to the GCG attack that reaches 94% ASR on white-box
Llama-2-7B, and the Random-Search Attack on LLMs (RAL), a strong but simple
baseline for query-based attacks. We believe the techniques proposed in this
work will enable more comprehensive safety testing of LLMs and, in the long
term, the development of better security guardrails. The code can be found at
https://github.com/chawins/pal.


http://arxiv.org/abs/2402.09316
Only My Model On My Data: A Privacy Preserving Approach Protecting one Model and Deceiving Unauthorized Black-Box Models. (92%)
Weiheng Chai; Brian Testa; Huantao Ren; Asif Salekin; Senem Velipasalar
  Deep neural networks are extensively applied to real-world tasks, such as
face recognition and medical image classification, where privacy and data
protection are critical. Image data, if not protected, can be exploited to
infer personal or contextual information. Existing privacy preservation
methods, like encryption, generate perturbed images that are unrecognizable to
even humans. Adversarial attack approaches prohibit automated inference even
for authorized stakeholders, limiting practical incentives for commercial and
widespread adaptation. This pioneering study tackles an unexplored practical
privacy preservation use case by generating human-perceivable images that
maintain accurate inference by an authorized model while evading other
unauthorized black-box models of similar or dissimilar objectives, and
addresses the previous research gaps. The datasets employed are ImageNet, for
image classification, Celeba-HQ dataset, for identity classification, and
AffectNet, for emotion classification. Our results show that the generated
images can successfully maintain the accuracy of a protected model and degrade
the average accuracy of the unauthorized black-box models to 11.97%, 6.63%, and
55.51% on ImageNet, Celeba-HQ, and AffectNet datasets, respectively.


http://arxiv.org/abs/2402.09023
Review-Incorporated Model-Agnostic Profile Injection Attacks on Recommender Systems. (76%)
Shiyi Yang; Lina Yao; Chen Wang; Xiwei Xu; Liming Zhu
  Recent studies have shown that recommender systems (RSs) are highly
vulnerable to data poisoning attacks. Understanding attack tactics helps
improve the robustness of RSs. We intend to develop efficient attack methods
that use limited resources to generate high-quality fake user profiles to
achieve 1) transferability among black-box RSs 2) and imperceptibility among
detectors. In order to achieve these goals, we introduce textual reviews of
products to enhance the generation quality of the profiles. Specifically, we
propose a novel attack framework named R-Trojan, which formulates the attack
objectives as an optimization problem and adopts a tailored transformer-based
generative adversarial network (GAN) to solve it so that high-quality attack
profiles can be produced. Comprehensive experiments on real-world datasets
demonstrate that R-Trojan greatly outperforms state-of-the-art attack methods
on various victim RSs under black-box settings and show its good
imperceptibility.


http://arxiv.org/abs/2402.09154
Attacking Large Language Models with Projected Gradient Descent. (67%)
Simon Geisler; Tom Wollschläger; M. H. I. Abdalla; Johannes Gasteiger; Stephan Günnemann
  Current LLM alignment methods are readily broken through specifically crafted
adversarial prompts. While crafting adversarial prompts using discrete
optimization is highly effective, such attacks typically use more than 100,000
LLM calls. This high computational cost makes them unsuitable for, e.g.,
quantitative analyses and adversarial training. To remedy this, we revisit
Projected Gradient Descent (PGD) on the continuously relaxed input prompt.
Although previous attempts with ordinary gradient-based attacks largely failed,
we show that carefully controlling the error introduced by the continuous
relaxation tremendously boosts their efficacy. Our PGD for LLMs is up to one
order of magnitude faster than state-of-the-art discrete optimization to
achieve the same devastating attack results.


http://arxiv.org/abs/2402.08986
Detecting Adversarial Spectrum Attacks via Distance to Decision Boundary Statistics. (47%)
Wenwei Zhao; Xiaowen Li; Shangqing Zhao; Jie Xu; Yao Liu; Zhuo Lu
  Machine learning has been adopted for efficient cooperative spectrum sensing.
However, it incurs an additional security risk due to attacks leveraging
adversarial machine learning to create malicious spectrum sensing values to
deceive the fusion center, called adversarial spectrum attacks. In this paper,
we propose an efficient framework for detecting adversarial spectrum attacks.
Our design leverages the concept of the distance to the decision boundary (DDB)
observed at the fusion center and compares the training and testing DDB
distributions to identify adversarial spectrum attacks. We create a
computationally efficient way to compute the DDB for machine learning based
spectrum sensing systems. Experimental results based on realistic spectrum data
show that our method, under typical settings, achieves a high detection rate of
up to 99\% and maintains a low false alarm rate of less than 1\%. In addition,
our method to compute the DDB based on spectrum data achieves 54\%--64\%
improvements in computational efficiency over existing distance calculation
methods. The proposed DDB-based detection framework offers a practical and
efficient solution for identifying malicious sensing values created by
adversarial spectrum attacks.


http://arxiv.org/abs/2402.08983
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding. (38%)
Zhangchen Xu; Fengqing Jiang; Luyao Niu; Jinyuan Jia; Bill Yuchen Lin; Radha Poovendran
  As large language models (LLMs) become increasingly integrated into
real-world applications such as code generation and chatbot assistance,
extensive efforts have been made to align LLM behavior with human values,
including safety. Jailbreak attacks, aiming to provoke unintended and unsafe
behaviors from LLMs, remain a significant/leading LLM safety threat. In this
paper, we aim to defend LLMs against jailbreak attacks by introducing
SafeDecoding, a safety-aware decoding strategy for LLMs to generate helpful and
harmless responses to user queries. Our insight in developing SafeDecoding is
based on the observation that, even though probabilities of tokens representing
harmful contents outweigh those representing harmless responses, safety
disclaimers still appear among the top tokens after sorting tokens by
probability in descending order. This allows us to mitigate jailbreak attacks
by identifying safety disclaimers and amplifying their token probabilities,
while simultaneously attenuating the probabilities of token sequences that are
aligned with the objectives of jailbreak attacks. We perform extensive
experiments on five LLMs using six state-of-the-art jailbreak attacks and four
benchmark datasets. Our results show that SafeDecoding significantly reduces
the attack success rate and harmfulness of jailbreak attacks without
compromising the helpfulness of responses to benign user queries. SafeDecoding
outperforms six defense methods.


http://arxiv.org/abs/2402.09695
Reward Poisoning Attack Against Offline Reinforcement Learning. (12%)
Yinglun Xu; Rohan Gumaste; Gagandeep Singh
  We study the problem of reward poisoning attacks against general offline
reinforcement learning with deep neural networks for function approximation. We
consider a black-box threat model where the attacker is completely oblivious to
the learning algorithm and its budget is limited by constraining both the
amount of corruption at each data point, and the total perturbation. We propose
an attack strategy called `policy contrast attack'. The high-level idea is to
make some low-performing policies appear as high-performing while making
high-performing policies appear as low-performing. To the best of our
knowledge, we propose the first black-box reward poisoning attack in the
general offline RL setting. We provide theoretical insights on the attack
design and empirically show that our attack is efficient against current
state-of-the-art offline RL algorithms in different kinds of learning datasets.


http://arxiv.org/abs/2402.09179
Rapid Adoption, Hidden Risks: The Dual Impact of Large Language Model Customization. (9%)
Rui Zhang; Hongwei Li; Rui Wen; Wenbo Jiang; Yuan Zhang; Michael Backes; Yun Shen; Yang Zhang
  The increasing demand for customized Large Language Models (LLMs) has led to
the development of solutions like GPTs. These solutions facilitate tailored LLM
creation via natural language prompts without coding. However, the
trustworthiness of third-party custom versions of LLMs remains an essential
concern. In this paper, we propose the first instruction backdoor attacks
against applications integrated with untrusted customized LLMs (e.g., GPTs).
Specifically, these attacks embed the backdoor into the custom version of LLMs
by designing prompts with backdoor instructions, outputting the attacker's
desired result when inputs contain the pre-defined triggers. Our attack
includes 3 levels of attacks: word-level, syntax-level, and semantic-level,
which adopt different types of triggers with progressive stealthiness. We
stress that our attacks do not require fine-tuning or any modification to the
backend LLMs, adhering strictly to GPTs development guidelines. We conduct
extensive experiments on 4 prominent LLMs and 5 benchmark text classification
datasets. The results show that our instruction backdoor attacks achieve the
desired attack performance without compromising utility. Additionally, we
propose an instruction-ignoring defense mechanism and demonstrate its partial
effectiveness in mitigating such attacks. Our findings highlight the
vulnerability and the potential risks of LLM customization such as GPTs.


http://arxiv.org/abs/2403.12075
Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation. (3%)
Jessica Quaye; Alicia Parrish; Oana Inel; Charvi Rastogi; Hannah Rose Kirk; Minsuk Kahng; Liemt Erin van; Max Bartolo; Jess Tsang; Justin White; Nathan Clement; Rafael Mosquera; Juan Ciro; Vijay Janapa Reddi; Lora Aroyo
  With the rise of text-to-image (T2I) generative AI models reaching wide
audiences, it is critical to evaluate model robustness against non-obvious
attacks to mitigate the generation of offensive images. By focusing on
``implicitly adversarial'' prompts (those that trigger T2I models to generate
unsafe images for non-obvious reasons), we isolate a set of difficult safety
issues that human creativity is well-suited to uncover. To this end, we built
the Adversarial Nibbler Challenge, a red-teaming methodology for crowdsourcing
a diverse set of implicitly adversarial prompts. We have assembled a suite of
state-of-the-art T2I models, employed a simple user interface to identify and
annotate harms, and engaged diverse populations to capture long-tail safety
issues that may be overlooked in standard testing. The challenge is run in
consecutive rounds to enable a sustained discovery and analysis of safety
pitfalls in T2I models.
  In this paper, we present an in-depth account of our methodology, a
systematic study of novel attack strategies and discussion of safety failures
revealed by challenge participants. We also release a companion visualization
tool for easy exploration and derivation of insights from the dataset. The
first challenge round resulted in over 10k prompt-image pairs with machine
annotations for safety. A subset of 1.5k samples contains rich human
annotations of harm types and attack styles. We find that 14% of images that
humans consider harmful are mislabeled as ``safe'' by machines. We have
identified new attack strategies that highlight the complexity of ensuring T2I
model robustness. Our findings emphasize the necessity of continual auditing
and adaptation as new vulnerabilities emerge. We are confident that this work
will enable proactive, iterative safety assessments and promote responsible
development of T2I models.


http://arxiv.org/abs/2402.09199
Ten Words Only Still Help: Improving Black-Box AI-Generated Text Detection via Proxy-Guided Efficient Re-Sampling. (2%)
Yuhui Shi; Qiang Sheng; Juan Cao; Hao Mi; Beizhe Hu; Danding Wang
  With the rapidly increasing application of large language models (LLMs),
their abuse has caused many undesirable societal problems such as fake news,
academic dishonesty, and information pollution. This makes AI-generated text
(AIGT) detection of great importance. Among existing methods, white-box methods
are generally superior to black-box methods in terms of performance and
generalizability, but they require access to LLMs' internal states and are not
applicable to black-box settings. In this paper, we propose to estimate word
generation probabilities as pseudo white-box features via multiple re-sampling
to help improve AIGT detection under the black-box setting. Specifically, we
design POGER, a proxy-guided efficient re-sampling method, which selects a
small subset of representative words (e.g., 10 words) for performing multiple
re-sampling in black-box AIGT detection. Experiments on datasets containing
texts from humans and seven LLMs show that POGER outperforms all baselines in
macro F1 under black-box, partial white-box, and out-of-distribution settings
and maintains lower re-sampling costs than its existing counterparts.


http://arxiv.org/abs/2402.09177
Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks. (1%)
Yixin Cheng; Markos Georgopoulos; Volkan Cevher; Grigorios G. Chrysos
  Large Language Models (LLMs) are susceptible to Jailbreaking attacks, which
aim to extract harmful information by subtly modifying the attack query. As
defense mechanisms evolve, directly obtaining harmful information becomes
increasingly challenging for Jailbreaking attacks. In this work, inspired from
Chomsky's transformational-generative grammar theory and human practices of
indirect context to elicit harmful information, we focus on a new attack form,
called Contextual Interaction Attack. We contend that the prior
context\u2014the information preceding the attack query\u2014plays a pivotal
role in enabling strong Jailbreaking attacks. Specifically, we propose a first
multi-turn approach that leverages benign preliminary questions to interact
with the LLM. Due to the autoregressive nature of LLMs, which use previous
conversation rounds as context during generation, we guide the model's
question-response pair to construct a context that is semantically aligned with
the attack query to execute the attack. We conduct experiments on seven
different LLMs and demonstrate the efficacy of this attack, which is black-box
and can also transfer across LLMs. We believe this can lead to further
developments and understanding of security in LLMs.


http://arxiv.org/abs/2402.08991
Towards Robust Model-Based Reinforcement Learning Against Adversarial Corruption. (1%)
Chenlu Ye; Jiafan He; Quanquan Gu; Tong Zhang
  This study tackles the challenges of adversarial corruption in model-based
reinforcement learning (RL), where the transition dynamics can be corrupted by
an adversary. Existing studies on corruption-robust RL mostly focus on the
setting of model-free RL, where robust least-square regression is often
employed for value function estimation. However, these techniques cannot be
directly applied to model-based RL. In this paper, we focus on model-based RL
and take the maximum likelihood estimation (MLE) approach to learn transition
model. Our work encompasses both online and offline settings. In the online
setting, we introduce an algorithm called corruption-robust optimistic MLE
(CR-OMLE), which leverages total-variation (TV)-based information ratios as
uncertainty weights for MLE. We prove that CR-OMLE achieves a regret of
$\tilde{\mathcal{O}}(\sqrt{T} + C)$, where $C$ denotes the cumulative
corruption level after $T$ episodes. We also prove a lower bound to show that
the additive dependence on $C$ is optimal. We extend our weighting technique to
the offline setting, and propose an algorithm named corruption-robust
pessimistic MLE (CR-PMLE). Under a uniform coverage condition, CR-PMLE exhibits
suboptimality worsened by $\mathcal{O}(C/n)$, nearly matching the lower bound.
To the best of our knowledge, this is the first work on corruption-robust
model-based RL algorithms with provable guarantees.


http://arxiv.org/abs/2402.09303
Immediate generalisation in humans but a generalisation lag in deep neural networks$\unicode{x2014}$evidence for representational divergence? (1%)
Lukas S. Huber; Fred W. Mast; Felix A. Wichmann
  Recent research has seen many behavioral comparisons between humans and deep
neural networks (DNNs) in the domain of image classification. Often, comparison
studies focus on the end-result of the learning process by measuring and
comparing the similarities in the representations of object categories once
they have been formed. However, the process of how these representations
emerge$\unicode{x2014}$that is, the behavioral changes and intermediate stages
observed during the acquisition$\unicode{x2014}$is less often directly and
empirically compared.
  Here we report a detailed investigation of how transferable representations
are acquired in human observers and various classic and state-of-the-art DNNs.
We develop a constrained supervised learning environment in which we align
learning-relevant parameters such as starting point, input modality, available
input data and the feedback provided. Across the whole learning process we
evaluate and compare how well learned representations can be generalized to
previously unseen test data.
  Our findings indicate that in terms of absolute classification performance
DNNs demonstrate a level of data efficiency comparable to$\unicode{x2014}$and
sometimes even exceeding that$\unicode{x2014}$of human learners, challenging
some prevailing assumptions in the field. However, comparisons across the
entire learning process reveal significant representational differences: while
DNNs' learning is characterized by a pronounced generalisation lag, humans
appear to immediately acquire generalizable representations without a
preliminary phase of learning training set-specific information that is only
later transferred to novel data.


http://arxiv.org/abs/2402.09091
Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues. (1%)
Zhiyuan Chang; Mingyang Li; Yi Liu; Junjie Wang; Qing Wang; Yang Liu
  With the development of LLMs, the security threats of LLMs are getting more
and more attention. Numerous jailbreak attacks have been proposed to assess the
security defense of LLMs. Current jailbreak attacks primarily utilize scenario
camouflage techniques. However their explicitly mention of malicious intent
will be easily recognized and defended by LLMs. In this paper, we propose an
indirect jailbreak attack approach, Puzzler, which can bypass the LLM's defense
strategy and obtain malicious response by implicitly providing LLMs with some
clues about the original malicious query. In addition, inspired by the wisdom
of ''When unable to attack, defend'' from Sun Tzu's Art of War, we adopt a
defensive stance to gather clues about the original malicious query through
LLMs. Extensive experimental results show that Puzzler achieves a query success
rate of 96.6% on closed-source LLMs, which is 57.9%-82.7% higher than
baselines. Furthermore, when tested against the state-of-the-art jailbreak
detection approaches, Puzzler proves to be more effective at evading detection
compared to baselines.


http://arxiv.org/abs/2402.08586
Faster Repeated Evasion Attacks in Tree Ensembles. (96%)
Lorenzo Cascioli; Laurens Devos; Ondřej Kuželka; Jesse Davis
  Tree ensembles are one of the most widely used model classes. However, these
models are susceptible to adversarial examples, i.e., slightly perturbed
examples that elicit a misprediction. There has been significant research on
designing approaches to construct such examples for tree ensembles. But this is
a computationally challenging problem that often must be solved a large number
of times (e.g., for all examples in a training set). This is compounded by the
fact that current approaches attempt to find such examples from scratch. In
contrast, we exploit the fact that multiple similar problems are being solved.
Specifically, our approach exploits the insight that adversarial examples for
tree ensembles tend to perturb a consistent but relatively small set of
features. We show that we can quickly identify this set of features and use
this knowledge to speedup constructing adversarial examples.


http://arxiv.org/abs/2402.08648
Generating Universal Adversarial Perturbations for Quantum Classifiers. (93%)
Gautham Anil; Vishnu Vinod; Apurva Narayan
  Quantum Machine Learning (QML) has emerged as a promising field of research,
aiming to leverage the capabilities of quantum computing to enhance existing
machine learning methodologies. Recent studies have revealed that, like their
classical counterparts, QML models based on Parametrized Quantum Circuits
(PQCs) are also vulnerable to adversarial attacks. Moreover, the existence of
Universal Adversarial Perturbations (UAPs) in the quantum domain has been
demonstrated theoretically in the context of quantum classifiers. In this work,
we introduce QuGAP: a novel framework for generating UAPs for quantum
classifiers. We conceptualize the notion of additive UAPs for PQC-based
classifiers and theoretically demonstrate their existence. We then utilize
generative models (QuGAP-A) to craft additive UAPs and experimentally show that
quantum classifiers are susceptible to such attacks. Moreover, we formulate a
new method for generating unitary UAPs (QuGAP-U) using quantum generative
models and a novel loss function based on fidelity constraints. We evaluate the
performance of the proposed framework and show that our method achieves
state-of-the-art misclassification rates, while maintaining high fidelity
between legitimate and adversarial samples.


http://arxiv.org/abs/2402.08763
Enhancing Robustness of Indoor Robotic Navigation with Free-Space Segmentation Models Against Adversarial Attacks. (83%)
Qiyuan An; Christos Sevastopoulos; Fillia Makedon
  Endeavors in indoor robotic navigation rely on the accuracy of segmentation
models to identify free space in RGB images. However, deep learning models are
vulnerable to adversarial attacks, posing a significant challenge to their
real-world deployment. In this study, we identify vulnerabilities within the
hidden layers of neural networks and introduce a practical approach to
reinforce traditional adversarial training. Our method incorporates a novel
distance loss function, minimizing the gap between hidden layers in clean and
adversarial images. Experiments demonstrate satisfactory performance in
improving the model's robustness against adversarial perturbations.


http://arxiv.org/abs/2402.09478
Data Reconstruction Attacks and Defenses: A Systematic Evaluation. (76%)
Sheng Liu; Zihan Wang; Yuxiao Chen; Qi Lei
  Reconstruction attacks and defenses are essential in understanding the data
leakage problem in machine learning. However, prior work has centered around
empirical observations of gradient inversion attacks, lacks theoretical
grounding, and cannot disentangle the usefulness of defending methods from the
computational limitation of attacking methods. In this work, we propose to view
the problem as an inverse problem, enabling us to theoretically and
systematically evaluate the data reconstruction attack. On various defense
methods, we derived the algorithmic upper bound and the matching (in feature
dimension and architecture dimension) information-theoretical lower bound on
the reconstruction error for two-layer neural networks. To complement the
theoretical results and investigate the utility-privacy trade-off, we defined a
natural evaluation metric of the defense methods with similar utility loss
among the strongest attacks. We further propose a strong reconstruction attack
that helps update some previous understanding of the strength of defense
methods under our proposed evaluation metric.


http://arxiv.org/abs/2402.08679
COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability. (62%)
Xingang Guo; Fangxu Yu; Huan Zhang; Lianhui Qin; Bin Hu
  Jailbreaks on Large language models (LLMs) have recently received increasing
attention. For a comprehensive assessment of LLM safety, it is essential to
consider jailbreaks with diverse attributes, such as contextual coherence and
sentiment/stylistic variations, and hence it is beneficial to study
controllable jailbreaking, i.e. how to enforce control on LLM attacks. In this
paper, we formally formulate the controllable attack generation problem, and
build a novel connection between this problem and controllable text generation,
a well-explored topic of natural language processing. Based on this connection,
we adapt the Energy-based Constrained Decoding with Langevin Dynamics (COLD), a
state-of-the-art, highly efficient algorithm in controllable text generation,
and introduce the COLD-Attack framework which unifies and automates the search
of adversarial LLM attacks under a variety of control requirements such as
fluency, stealthiness, sentiment, and left-right-coherence. The controllability
enabled by COLD-Attack leads to diverse new jailbreak scenarios which not only
cover the standard setting of generating fluent suffix attacks, but also allow
us to address new controllable attack settings such as revising a user query
adversarially with minimal paraphrasing, and inserting stealthy attacks in
context with left-right-coherence. Our extensive experiments on various LLMs
(Llama-2, Mistral, Vicuna, Guanaco, GPT-3.5) show COLD-Attack's broad
applicability, strong controllability, high success rate, and attack
transferability. Our code is available at
https://github.com/Yu-Fangxu/COLD-Attack.


http://arxiv.org/abs/2402.08577
Test-Time Backdoor Attacks on Multimodal Large Language Models. (56%)
Dong Lu; Tianyu Pang; Chao Du; Qian Liu; Xianjun Yang; Min Lin
  Backdoor attacks are commonly executed by contaminating training data, such
that a trigger can activate predetermined harmful effects during the test
phase. In this work, we present AnyDoor, a test-time backdoor attack against
multimodal large language models (MLLMs), which involves injecting the backdoor
into the textual modality using adversarial test images (sharing the same
universal perturbation), without requiring access to or modification of the
training data. AnyDoor employs similar techniques used in universal adversarial
attacks, but distinguishes itself by its ability to decouple the timing of
setup and activation of harmful effects. In our experiments, we validate the
effectiveness of AnyDoor against popular MLLMs such as LLaVA-1.5, MiniGPT-4,
InstructBLIP, and BLIP-2, as well as provide comprehensive ablation studies.
Notably, because the backdoor is injected by a universal perturbation, AnyDoor
can dynamically change its backdoor trigger prompts/harmful effects, exposing a
new challenge for defending against backdoor attacks. Our project page is
available at https://sail-sg.github.io/AnyDoor/.


http://arxiv.org/abs/2402.08768
Adversarially Robust Feature Learning for Breast Cancer Diagnosis. (33%)
Degan Hao; Dooman Arefan; Margarita Zuley; Wendie Berg; Shandong Wu
  Adversarial data can lead to malfunction of deep learning applications. It is
essential to develop deep learning models that are robust to adversarial data
while accurate on standard, clean data. In this study, we proposed a novel
adversarially robust feature learning (ARFL) method for a real-world
application of breast cancer diagnosis. ARFL facilitates adversarial training
using both standard data and adversarial data, where a feature correlation
measure is incorporated as an objective function to encourage learning of
robust features and restrain spurious features. To show the effects of ARFL in
breast cancer diagnosis, we built and evaluated diagnosis models using two
independent clinically collected breast imaging datasets, comprising a total of
9,548 mammogram images. We performed extensive experiments showing that our
method outperformed several state-of-the-art methods and that our method can
enhance safer breast cancer diagnosis against adversarial attacks in clinical
settings.


http://arxiv.org/abs/2402.08567
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast. (31%)
Xiangming Gu; Xiaosen Zheng; Tianyu Pang; Chao Du; Qian Liu; Ye Wang; Jing Jiang; Min Lin
  A multimodal large language model (MLLM) agent can receive instructions,
capture images, retrieve histories from memory, and decide which tools to use.
Nonetheless, red-teaming efforts have revealed that adversarial images/prompts
can jailbreak an MLLM and cause unaligned behaviors. In this work, we report an
even more severe safety issue in multi-agent environments, referred to as
infectious jailbreak. It entails the adversary simply jailbreaking a single
agent, and without any further intervention from the adversary, (almost) all
agents will become infected exponentially fast and exhibit harmful behaviors.
To validate the feasibility of infectious jailbreak, we simulate multi-agent
environments containing up to one million LLaVA-1.5 agents, and employ
randomized pair-wise chat as a proof-of-concept instantiation for multi-agent
interaction. Our results show that feeding an (infectious) adversarial image
into the memory of any randomly chosen agent is sufficient to achieve
infectious jailbreak. Finally, we derive a simple principle for determining
whether a defense mechanism can provably restrain the spread of infectious
jailbreak, but how to design a practical defense that meets this principle
remains an open question to investigate. Our project page is available at
https://sail-sg.github.io/Agent-Smith/.


http://arxiv.org/abs/2402.08400
Adaptive Hierarchical Certification for Segmentation using Randomized Smoothing. (10%)
Alaa Anani; Tobias Lorenz; Bernt Schiele; Mario Fritz
  Certification for machine learning is proving that no adversarial sample can
evade a model within a range under certain conditions, a necessity for
safety-critical domains. Common certification methods for segmentation use a
flat set of fine-grained classes, leading to high abstain rates due to model
uncertainty across many classes. We propose a novel, more practical setting,
which certifies pixels within a multi-level hierarchy, and adaptively relaxes
the certification to a coarser level for unstable components classic methods
would abstain from, effectively lowering the abstain rate whilst providing more
certified semantically meaningful information. We mathematically formulate the
problem setup, introduce an adaptive hierarchical certification algorithm and
prove the correctness of its guarantees. Since certified accuracy does not take
the loss of information into account for coarser classes, we introduce the
Certified Information Gain ($\mathrm{CIG}$) metric, which is proportional to
the class granularity level. Our extensive experiments on the datasets
Cityscapes, PASCAL-Context, ACDC and COCO-Stuff demonstrate that our adaptive
algorithm achieves a higher $\mathrm{CIG}$ and lower abstain rate compared to
the current state-of-the-art certification method. Our code can be found here:
https://github.com/AlaaAnani/adaptive-certify.


http://arxiv.org/abs/2402.08845
Feature Attribution with Necessity and Sufficiency via Dual-stage Perturbation Test for Causal Explanation. (1%)
Xuexin Chen; Ruichu Cai; Zhengting Huang; Yuxuan Zhu; Julien Horwood; Zhifeng Hao; Zijian Li; Jose Miguel Hernandez-Lobato
  We investigate the problem of explainability in machine learning.To address
this problem, Feature Attribution Methods (FAMs) measure the contribution of
each feature through a perturbation test, where the difference in prediction is
compared under different perturbations.However, such perturbation tests may not
accurately distinguish the contributions of different features, when their
change in prediction is the same after perturbation.In order to enhance the
ability of FAMs to distinguish different features' contributions in this
challenging setting, we propose to utilize the probability (PNS) that
perturbing a feature is a necessary and sufficient cause for the prediction to
change as a measure of feature importance.Our approach, Feature Attribution
with Necessity and Sufficiency (FANS), computes the PNS via a perturbation test
involving two stages (factual and interventional).In practice, to generate
counterfactual samples, we use a resampling-based approach on the observed
samples to approximate the required conditional distribution.Finally, we
combine FANS and gradient-based optimization to extract the subset with the
largest PNS.We demonstrate that FANS outperforms existing feature attribution
methods on six benchmarks.


http://arxiv.org/abs/2402.07496
Understanding Deep Learning defenses Against Adversarial Examples Through Visualizations for Dynamic Risk Assessment. (99%)
Xabier Echeberria-Barrio; Amaia Gil-Lerchundi; Jon Egana-Zubia; Raul Orduna-Urrutia
  In recent years, Deep Neural Network models have been developed in different
fields, where they have brought many advances. However, they have also started
to be used in tasks where risk is critical. A misdiagnosis of these models can
lead to serious accidents or even death. This concern has led to an interest
among researchers to study possible attacks on these models, discovering a long
list of vulnerabilities, from which every model should be defended. The
adversarial example attack is a widely known attack among researchers, who have
developed several defenses to avoid such a threat. However, these defenses are
as opaque as a deep neural network model, how they work is still unknown. This
is why visualizing how they change the behavior of the target model is
interesting in order to understand more precisely how the performance of the
defended model is being modified. For this work, some defenses, against
adversarial example attack, have been selected in order to visualize the
behavior modification of each of them in the defended model. Adversarial
training, dimensionality reduction and prediction similarity were the selected
defenses, which have been developed using a model composed by convolution
neural network layers and dense neural network layers. In each defense, the
behavior of the original model has been compared with the behavior of the
defended model, representing the target model by a graph in a visualization.


http://arxiv.org/abs/2402.07480
Topological safeguard for evasion attack interpreting the neural networks' behavior. (89%)
Xabier Echeberria-Barrio; Amaia Gil-Lerchundi; Iñigo Mendialdua; Raul Orduna-Urrutia
  In the last years, Deep Learning technology has been proposed in different
fields, bringing many advances in each of them, but identifying new threats in
these solutions regarding cybersecurity. Those implemented models have brought
several vulnerabilities associated with Deep Learning technology. Moreover,
those allow taking advantage of the implemented model, obtaining private
information, and even modifying the model's decision-making. Therefore,
interest in studying those vulnerabilities/attacks and designing defenses to
avoid or fight them is gaining prominence among researchers. In particular, the
widely known evasion attack is being analyzed by researchers; thus, several
defenses to avoid such a threat can be found in the literature. Since the
presentation of the L-BFG algorithm, this threat concerns the research
community. However, it continues developing new and ingenious countermeasures
since there is no perfect defense for all the known evasion algorithms. In this
work, a novel detector of evasion attacks is developed. It focuses on the
information of the activations of the neurons given by the model when an input
sample is injected. Moreover, it puts attention to the topology of the targeted
deep learning model to analyze the activations according to which neurons are
connecting. This approach has been decided because the literature shows that
the targeted model's topology contains essential information about if the
evasion attack occurs. For this purpose, a huge data preprocessing is required
to introduce all this information in the detector, which uses the Graph
Convolutional Neural Network (GCN) technology. Thus, it understands the
topology of the target model, obtaining promising results and improving the
outcomes presented in the literature related to similar defenses.


http://arxiv.org/abs/2402.07867
PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models. (83%)
Wei Zou; Runpeng Geng; Binghui Wang; Jinyuan Jia
  Large language models (LLMs) have achieved remarkable success due to their
exceptional generative capabilities. Despite their success, they also have
inherent limitations such as a lack of up-to-date knowledge and hallucination.
Retrieval-Augmented Generation (RAG) is a state-of-the-art technique to
mitigate these limitations. The key idea of RAG is to ground the answer
generation of an LLM on external knowledge retrieved from a knowledge database.
Existing studies mainly focus on improving the accuracy or efficiency of RAG,
leaving its security largely unexplored. We aim to bridge the gap in this work.
We find that the knowledge database in a RAG system introduces a new and
practical attack surface. Based on this attack surface, we propose PoisonedRAG,
the first knowledge corruption attack to RAG, where an attacker could inject a
few malicious texts into the knowledge database of a RAG system to induce an
LLM to generate an attacker-chosen target answer for an attacker-chosen target
question. We formulate knowledge corruption attacks as an optimization problem,
whose solution is a set of malicious texts. Depending on the background
knowledge (e.g., black-box and white-box settings) of an attacker on a RAG
system, we propose two solutions to solve the optimization problem,
respectively. Our results show PoisonedRAG could achieve a 90% attack success
rate when injecting five malicious texts for each target question into a
knowledge database with millions of texts. We also evaluate several defenses
and our results show they are insufficient to defend against PoisonedRAG,
highlighting the need for new defenses.


http://arxiv.org/abs/2402.07687
Privacy-Preserving Gaze Data Streaming in Immersive Interactive Virtual Reality: Robustness and User Experience. (33%)
Ethan Wilson; Azim Ibragimov; Michael J. Proulx; Sai Deep Tetali; Kevin Butler; Eakta Jain
  Eye tracking is routinely being incorporated into virtual reality (VR)
systems. Prior research has shown that eye tracking data, if exposed, can be
used for re-identification attacks. The state of our knowledge about currently
existing privacy mechanisms is limited to privacy-utility trade-off curves
based on data-centric metrics of utility, such as prediction error, and
black-box threat models. We propose that for interactive VR applications, it is
essential to consider user-centric notions of utility and a variety of threat
models. We develop a methodology to evaluate real-time privacy mechanisms for
interactive VR applications that incorporate subjective user experience and
task performance metrics. We evaluate selected privacy mechanisms using this
methodology and find that re-identification accuracy can be decreased to as low
as 14% while maintaining a high usability score and reasonable task
performance. Finally, we elucidate three threat scenarios (black-box, black-box
with exemplars, and white-box) and assess how well the different privacy
mechanisms hold up to these adversarial scenarios.
  This work advances the state of the art in VR privacy by providing a
methodology for end-to-end assessment of the risk of re-identification attacks
and potential mitigating solutions.


http://arxiv.org/abs/2402.07689
OrderBkd: Textual backdoor attack through repositioning. (13%)
Irina Alekseevskaia; Konstantin Arkhipenko
  The use of third-party datasets and pre-trained machine learning models poses
a threat to NLP systems due to possibility of hidden backdoor attacks. Existing
attacks involve poisoning the data samples such as insertion of tokens or
sentence paraphrasing, which either alter the semantics of the original texts
or can be detected. Our main difference from the previous work is that we use
the reposition of a two words in a sentence as a trigger. By designing and
applying specific part-of-speech (POS) based rules for selecting these tokens,
we maintain high attack success rate on SST-2 and AG classification datasets
while outperforming existing attacks in terms of perplexity and semantic
similarity to the clean samples. In addition, we show the robustness of our
attack to the ONION defense method. All the code and data for the paper can be
obtained at https://github.com/alekseevskaia/OrderBkd.


http://arxiv.org/abs/2402.07639
Tighter Bounds on the Information Bottleneck with Application to Deep Learning. (10%)
Nir Weingarten; Zohar Yakhini; Moshe Butman; Ran Gilad-Bachrach
  Deep Neural Nets (DNNs) learn latent representations induced by their
downstream task, objective function, and other parameters. The quality of the
learned representations impacts the DNN's generalization ability and the
coherence of the emerging latent space. The Information Bottleneck (IB)
provides a hypothetically optimal framework for data modeling, yet it is often
intractable. Recent efforts combined DNNs with the IB by applying VAE-inspired
variational methods to approximate bounds on mutual information, resulting in
improved robustness to adversarial attacks. This work introduces a new and
tighter variational bound for the IB, improving performance of previous
IB-inspired DNNs. These advancements strengthen the case for the IB and its
variational approximations as a data modeling framework, and provide a simple
method to significantly enhance the adversarial robustness of classifier DNNs.


http://arxiv.org/abs/2402.08070
Multi-Attribute Vision Transformers are Efficient and Robust Learners. (9%)
Hanan Gani; Nada Saadi; Noor Hussein; Karthik Nandakumar
  Since their inception, Vision Transformers (ViTs) have emerged as a
compelling alternative to Convolutional Neural Networks (CNNs) across a wide
spectrum of tasks. ViTs exhibit notable characteristics, including global
attention, resilience against occlusions, and adaptability to distribution
shifts. One underexplored aspect of ViTs is their potential for multi-attribute
learning, referring to their ability to simultaneously grasp multiple
attribute-related tasks. In this paper, we delve into the multi-attribute
learning capability of ViTs, presenting a straightforward yet effective
strategy for training various attributes through a single ViT network as
distinct tasks. We assess the resilience of multi-attribute ViTs against
adversarial attacks and compare their performance against ViTs designed for
single attributes. Moreover, we further evaluate the robustness of
multi-attribute ViTs against a recent transformer based attack called
Patch-Fool. Our empirical findings on the CelebA dataset provide validation for
our assertion.


http://arxiv.org/abs/2402.08125
Customizable Perturbation Synthesis for Robust SLAM Benchmarking. (9%)
Xiaohao Xu; Tianyi Zhang; Sibo Wang; Xiang Li; Yongqi Chen; Ye Li; Bhiksha Raj; Matthew Johnson-Roberson; Xiaonan Huang
  Robustness is a crucial factor for the successful deployment of robots in
unstructured environments, particularly in the domain of Simultaneous
Localization and Mapping (SLAM). Simulation-based benchmarks have emerged as a
highly scalable approach for robustness evaluation compared to real-world data
collection. However, crafting a challenging and controllable noisy world with
diverse perturbations remains relatively under-explored. To this end, we
propose a novel, customizable pipeline for noisy data synthesis, aimed at
assessing the resilience of multi-modal SLAM models against various
perturbations. This pipeline incorporates customizable hardware setups,
software components, and perturbed environments. In particular, we introduce
comprehensive perturbation taxonomy along with a perturbation composition
toolbox, allowing the transformation of clean simulations into challenging
noisy environments. Utilizing the pipeline, we instantiate the Robust-SLAM
benchmark, which includes diverse perturbation types, to evaluate the risk
tolerance of existing advanced multi-modal SLAM models. Our extensive analysis
uncovers the susceptibilities of existing SLAM models to real-world
disturbance, despite their demonstrated accuracy in standard benchmarks. Our
perturbation synthesis toolbox, SLAM robustness evaluation pipeline, and
Robust-SLAM benchmark will be made publicly available at
https://github.com/Xiaohao-Xu/SLAM-under-Perturbation/.


http://arxiv.org/abs/2402.08191
THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation. (5%)
Wilbert Pumacay; Ishika Singh; Jiafei Duan; Ranjay Krishna; Jesse Thomason; Dieter Fox
  To realize effective large-scale, real-world robotic applications, we must
evaluate how well our robot policies adapt to changes in environmental
conditions. Unfortunately, a majority of studies evaluate robot performance in
environments closely resembling or even identical to the training setup. We
present THE COLOSSEUM, a novel simulation benchmark, with 20 diverse
manipulation tasks, that enables systematical evaluation of models across 12
axes of environmental perturbations. These perturbations include changes in
color, texture, and size of objects, table-tops, and backgrounds; we also vary
lighting, distractors, and camera pose. Using THE COLOSSEUM, we compare 4
state-of-the-art manipulation models to reveal that their success rate degrades
between 30-50% across these perturbation factors. When multiple perturbations
are applied in unison, the success rate degrades $\geq$75%. We identify that
changing the number of distractor objects, target object color, or lighting
conditions are the perturbations that reduce model performance the most. To
verify the ecological validity of our results, we show that our results in
simulation are correlated ($\bar{R}^2 = 0.614$) to similar perturbations in
real-world experiments. We open source code for others to use THE COLOSSEUM,
and also release code to 3D print the objects used to replicate the real-world
perturbations. Ultimately, we hope that THE COLOSSEUM will serve as a benchmark
to identify modeling decisions that systematically improve generalization for
manipulation. See https://robot-colosseum.github.io/ for more details.


http://arxiv.org/abs/2402.07498
Accelerated Smoothing: A Scalable Approach to Randomized Smoothing. (3%)
Devansh Bhardwaj; Kshitiz Kaushik; Sarthak Gupta
  Randomized smoothing has emerged as a potent certifiable defense against
adversarial attacks by employing smoothing noises from specific distributions
to ensure the robustness of a smoothed classifier. However, the utilization of
Monte Carlo sampling in this process introduces a compute-intensive element,
which constrains the practicality of randomized smoothing on a larger scale. To
address this limitation, we propose a novel approach that replaces Monte Carlo
sampling with the training of a surrogate neural network. Through extensive
experimentation in various settings, we demonstrate the efficacy of our
approach in approximating the smoothed classifier with remarkable precision.
Furthermore, we demonstrate that our approach significantly accelerates the
robust radius certification process, providing nearly $600$X improvement in
computation time, overcoming the computational bottlenecks associated with
traditional randomized smoothing.


http://arxiv.org/abs/2402.08695
Game of Trojans: Adaptive Adversaries Against Output-based Trojaned-Model Detectors. (3%)
Dinuka Sahabandu; Xiaojun Xu; Arezoo Rajabi; Luyao Niu; Bhaskar Ramasubramanian; Bo Li; Radha Poovendran
  We propose and analyze an adaptive adversary that can retrain a Trojaned DNN
and is also aware of SOTA output-based Trojaned model detectors. We show that
such an adversary can ensure (1) high accuracy on both trigger-embedded and
clean samples and (2) bypass detection. Our approach is based on an observation
that the high dimensionality of the DNN parameters provides sufficient degrees
of freedom to simultaneously achieve these objectives. We also enable SOTA
detectors to be adaptive by allowing retraining to recalibrate their
parameters, thus modeling a co-evolution of parameters of a Trojaned model and
detectors. We then show that this co-evolution can be modeled as an iterative
game, and prove that the resulting (optimal) solution of this interactive game
leads to the adversary successfully achieving the above objectives. In
addition, we provide a greedy algorithm for the adversary to select a minimum
number of input samples for embedding triggers. We show that for cross-entropy
or log-likelihood loss functions used by the DNNs, the greedy algorithm
provides provable guarantees on the needed number of trigger-embedded input
samples. Extensive experiments on four diverse datasets -- MNIST, CIFAR-10,
CIFAR-100, and SpeechCommand -- reveal that the adversary effectively evades
four SOTA output-based Trojaned model detectors: MNTD, NeuralCleanse, STRIP,
and TABOR.


http://arxiv.org/abs/2402.07718
Local Centrality Minimization with Quality Guarantees. (1%)
Atsushi Miyauchi; Lorenzo Severini; Francesco Bonchi
  Centrality measures, quantifying the importance of vertices or edges, play a
fundamental role in network analysis. To date, triggered by some positive
approximability results, a large body of work has been devoted to studying
centrality maximization, where the goal is to maximize the centrality score of
a target vertex by manipulating the structure of a given network. On the other
hand, due to the lack of such results, only very little attention has been paid
to centrality minimization, despite its practical usefulness.
  In this study, we introduce a novel optimization model for local centrality
minimization, where the manipulation is allowed only around the target vertex.
We prove the NP-hardness of our model and that the most intuitive greedy
algorithm has a quite limited performance in terms of approximation ratio. Then
we design two effective approximation algorithms: The first algorithm is a
highly-scalable algorithm that has an approximation ratio unachievable by the
greedy algorithm, while the second algorithm is a bicriteria approximation
algorithm that solves a continuous relaxation based on the Lov\'asz extension,
using a projected subgradient method. To the best of our knowledge, ours are
the first polynomial-time algorithms with provable approximation guarantees for
centrality minimization. Experiments using a variety of real-world networks
demonstrate the effectiveness of our proposed algorithms: Our first algorithm
is applicable to million-scale graphs and obtains much better solutions than
those of scalable baselines, while our second algorithm is rather strong
against adversarial instances.


http://arxiv.org/abs/2402.07506
NeuralSentinel: Safeguarding Neural Network Reliability and Trustworthiness. (1%)
Xabier Echeberria-Barrio; Mikel Gorricho; Selene Valencia; Francesco Zola
  The usage of Artificial Intelligence (AI) systems has increased
exponentially, thanks to their ability to reduce the amount of data to be
analyzed, the user efforts and preserving a high rate of accuracy. However,
introducing this new element in the loop has converted them into attacked
points that can compromise the reliability of the systems. This new scenario
has raised crucial challenges regarding the reliability and trustworthiness of
the AI models, as well as about the uncertainties in their response decisions,
becoming even more crucial when applied in critical domains such as healthcare,
chemical, electrical plants, etc. To contain these issues, in this paper, we
present NeuralSentinel (NS), a tool able to validate the reliability and
trustworthiness of AI models. This tool combines attack and defence strategies
and explainability concepts to stress an AI model and help non-expert staff
increase their confidence in this new system by understanding the model
decisions. NS provide a simple and easy-to-use interface for helping humans in
the loop dealing with all the needed information. This tool was deployed and
used in a Hackathon event to evaluate the reliability of a skin cancer image
detector. During the event, experts and non-experts attacked and defended the
detector, learning which factors were the most important for model
misclassification and which techniques were the most efficient. The event was
also used to detect NS's limitations and gather feedback for further
improvements.


http://arxiv.org/abs/2402.07841
Do Membership Inference Attacks Work on Large Language Models? (1%)
Michael Duan; Anshuman Suri; Niloofar Mireshghallah; Sewon Min; Weijia Shi; Luke Zettlemoyer; Yulia Tsvetkov; Yejin Choi; David Evans; Hannaneh Hajishirzi
  Membership inference attacks (MIAs) attempt to predict whether a particular
datapoint is a member of a target model's training data. Despite extensive
research on traditional machine learning models, there has been limited work
studying MIA on the pre-training data of large language models (LLMs). We
perform a large-scale evaluation of MIAs over a suite of language models (LMs)
trained on the Pile, ranging from 160M to 12B parameters. We find that MIAs
barely outperform random guessing for most settings across varying LLM sizes
and domains. Our further analyses reveal that this poor performance can be
attributed to (1) the combination of a large dataset and few training
iterations, and (2) an inherently fuzzy boundary between members and
non-members. We identify specific settings where LLMs have been shown to be
vulnerable to membership inference and show that the apparent success in such
settings can be attributed to a distribution shift, such as when members and
non-members are drawn from the seemingly identical domain but with different
temporal ranges. We release our code and data as a unified benchmark package
that includes all existing MIAs, supporting future work.


http://arxiv.org/abs/2402.08183
Pixel Sentence Representation Learning. (1%)
Chenghao Xiao; Zhuoxu Huang; Danlu Chen; G Thomas Hudson; Yizhi Li; Haoran Duan; Chenghua Lin; Jie Fu; Jungong Han; Noura Al Moubayed
  Pretrained language models are long known to be subpar in capturing sentence
and document-level semantics. Though heavily investigated, transferring
perturbation-based methods from unsupervised visual representation learning to
NLP remains an unsolved problem. This is largely due to the discreteness of
subword units brought by tokenization of language models, limiting small
perturbations of inputs to form semantics-preserved positive pairs. In this
work, we conceptualize the learning of sentence-level textual semantics as a
visual representation learning process. Drawing from cognitive and linguistic
sciences, we introduce an unsupervised visual sentence representation learning
framework, employing visually-grounded text perturbation methods like typos and
word order shuffling, resonating with human cognitive patterns, and enabling
perturbation to texts to be perceived as continuous. Our approach is further
bolstered by large-scale unsupervised topical alignment training and natural
language inference supervision, achieving comparable performance in semantic
textual similarity (STS) to existing state-of-the-art NLP methods.
Additionally, we unveil our method's inherent zero-shot cross-lingual
transferability and a unique leapfrogging pattern across languages during
iterative training. To our knowledge, this is the first representation learning
method devoid of traditional language models for understanding sentence and
document semantics, marking a stride closer to human-like textual
comprehension. Our code is available at
https://github.com/gowitheflow-1998/Pixel-Linguist


http://arxiv.org/abs/2402.07183
A Random Ensemble of Encrypted Vision Transformers for Adversarially Robust Defense. (99%)
Ryota Iijima; Sayaka Shiota; Hitoshi Kiya
  Deep neural networks (DNNs) are well known to be vulnerable to adversarial
examples (AEs). In previous studies, the use of models encrypted with a secret
key was demonstrated to be robust against white-box attacks, but not against
black-box ones. In this paper, we propose a novel method using the vision
transformer (ViT) that is a random ensemble of encrypted models for enhancing
robustness against both white-box and black-box attacks. In addition, a
benchmark attack method, called AutoAttack, is applied to models to test
adversarial robustness objectively. In experiments, the method was demonstrated
to be robust against not only white-box attacks but also black-box ones in an
image classification task on the CIFAR-10 and ImageNet datasets. The method was
also compared with the state-of-the-art in a standardized benchmark for
adversarial robustness, RobustBench, and it was verified to outperform
conventional defenses in terms of clean accuracy and robust accuracy.


http://arxiv.org/abs/2402.07347
Accuracy of TextFooler black box adversarial attacks on 01 loss sign activation neural network ensemble. (98%)
Yunzhe Xue; Usman Roshan
  Recent work has shown the defense of 01 loss sign activation neural networks
against image classification adversarial attacks. A public challenge to attack
the models on CIFAR10 dataset remains undefeated. We ask the following question
in this study: are 01 loss sign activation neural networks hard to deceive with
a popular black box text adversarial attack program called TextFooler? We study
this question on four popular text classification datasets: IMDB reviews, Yelp
reviews, MR sentiment classification, and AG news classification. We find that
our 01 loss sign activation network is much harder to attack with TextFooler
compared to sigmoid activation cross entropy and binary neural networks. We
also study a 01 loss sign activation convolutional neural network with a novel
global pooling step specific to sign activation networks. With this new
variation we see a significant gain in adversarial accuracy rendering
TextFooler practically useless against it. We make our code freely available at
\url{https://github.com/zero-one-loss/wordcnn01} and
\url{https://github.com/xyzacademic/mlp01example}. Our work here suggests that
01 loss sign activation networks could be further developed to create fool
proof models against text adversarial attacks.


http://arxiv.org/abs/2402.06922
Whispers in the Machine: Confidentiality in LLM-integrated Systems. (26%)
Jonathan Evertz; Merlin Chlosta; Lea Schönherr; Thorsten Eisenhofer
  Large Language Models (LLMs) are increasingly integrated with external tools.
While these integrations can significantly improve the functionality of LLMs,
they also create a new attack surface where confidential data may be disclosed
between different components. Specifically, malicious tools can exploit
vulnerabilities in the LLM itself to manipulate the model and compromise the
data of other services, raising the question of how private data can be
protected in the context of LLM integrations.
  In this work, we provide a systematic way of evaluating confidentiality in
LLM-integrated systems. For this, we formalize a "secret key" game that can
capture the ability of a model to conceal private information. This enables us
to compare the vulnerability of a model against confidentiality attacks and
also the effectiveness of different defense strategies. In this framework, we
evaluate eight previously published attacks and four defenses. We find that
current defenses lack generalization across attack strategies. Building on this
analysis, we propose a method for robustness fine-tuning, inspired by
adversarial training. This approach is effective in lowering the success rate
of attackers and in improving the system's resilience against unknown attacks.


http://arxiv.org/abs/2402.06957
Architectural Neural Backdoors from First Principles. (26%)
Harry Langford; Ilia Shumailov; Yiren Zhao; Robert Mullins; Nicolas Papernot
  While previous research backdoored neural networks by changing their
parameters, recent work uncovered a more insidious threat: backdoors embedded
within the definition of the network's architecture. This involves injecting
common architectural components, such as activation functions and pooling
layers, to subtly introduce a backdoor behavior that persists even after (full
re-)training. However, the full scope and implications of architectural
backdoors have remained largely unexplored. Bober-Irizar et al. [2023]
introduced the first architectural backdoor; they showed how to create a
backdoor for a checkerboard pattern, but never explained how to target an
arbitrary trigger pattern of choice. In this work we construct an arbitrary
trigger detector which can be used to backdoor an architecture with no human
supervision. This leads us to revisit the concept of architecture backdoors and
taxonomise them, describing 12 distinct types. To gauge the difficulty of
detecting such backdoors, we conducted a user study, revealing that ML
developers can only identify suspicious components in common model definitions
as backdoors in 37% of cases, while they surprisingly preferred backdoored
models in 33% of cases. To contextualize these results, we find that language
models outperform humans at the detection of backdoors. Finally, we discuss
defenses against architectural backdoors, emphasizing the need for robust and
comprehensive strategies to safeguard the integrity of ML systems.


http://arxiv.org/abs/2402.06249
Anomaly Unveiled: Securing Image Classification against Adversarial Patch Attacks. (98%)
Nandish Chattopadhyay; Amira Guesmi; Muhammad Shafique
  Adversarial patch attacks pose a significant threat to the practical
deployment of deep learning systems. However, existing research primarily
focuses on image pre-processing defenses, which often result in reduced
classification accuracy for clean images and fail to effectively counter
physically feasible attacks. In this paper, we investigate the behavior of
adversarial patches as anomalies within the distribution of image information
and leverage this insight to develop a robust defense strategy. Our proposed
defense mechanism utilizes a clustering-based technique called DBSCAN to
isolate anomalous image segments, which is carried out by a three-stage
pipeline consisting of Segmenting, Isolating, and Blocking phases to identify
and mitigate adversarial noise. Upon identifying adversarial components, we
neutralize them by replacing them with the mean pixel value, surpassing
alternative replacement options. Our model-agnostic defense mechanism is
evaluated across multiple models and datasets, demonstrating its effectiveness
in countering various adversarial patch attacks in image classification tasks.
Our proposed approach significantly improves accuracy, increasing from 38.8\%
without the defense to 67.1\% with the defense against LaVAN and GoogleAp
attacks, surpassing prominent state-of-the-art methods such as LGS (53.86\%)
and Jujutsu (60\%)


http://arxiv.org/abs/2402.06255
Fight Back Against Jailbreaking via Prompt Adversarial Tuning. (95%)
Yichuan Mo; Yuji Wang; Zeming Wei; Yisen Wang
  While Large Language Models (LLMs) have achieved tremendous success in
various applications, they are also susceptible to jailbreaking attacks.
Several primary defense strategies have been proposed to protect LLMs from
producing harmful information, mostly focusing on model fine-tuning or
heuristical defense designs. However, how to achieve intrinsic robustness
through prompt optimization remains an open problem. In this paper, motivated
by adversarial training paradigms for achieving reliable robustness, we propose
an approach named Prompt Adversarial Tuning (PAT) that trains a prompt control
attached to the user prompt as a guard prefix. To achieve our defense goal
whilst maintaining natural performance, we optimize the control prompt with
both adversarial and benign prompts. Comprehensive experiments show that our
method is effective against both grey-box and black-box attacks, reducing the
success rate of advanced attacks to nearly 0%, while maintaining the model's
utility on the benign task and incurring only negligible computational
overhead, charting a new perspective for future explorations in LLM security.
Our code is available at https://github.com/PKU-ML/PAT.


http://arxiv.org/abs/2402.06827
RAMP: Boosting Adversarial Robustness Against Multiple $l_p$ Perturbations. (84%)
Enyi Jiang; Gagandeep Singh
  There is considerable work on improving robustness against adversarial
attacks bounded by a single $l_p$ norm using adversarial training (AT).
However, the multiple-norm robustness (union accuracy) of AT models is still
low. We observe that simultaneously obtaining good union and clean accuracy is
hard since there are tradeoffs between robustness against multiple $l_p$
perturbations, and accuracy/robustness/efficiency. By analyzing the tradeoffs
from the lens of distribution shifts, we identify the key tradeoff pair among
$l_p$ attacks to boost efficiency and design a logit pairing loss to improve
the union accuracy. Next, we connect natural training with AT via gradient
projection, to find and incorporate useful information from natural training
into AT, which moderates the accuracy/robustness tradeoff. Combining our
contributions, we propose a framework called \textbf{RAMP}, to boost the
robustness against multiple $l_p$ perturbations. We show \textbf{RAMP} can be
easily adapted for both robust fine-tuning and full AT. For robust fine-tuning,
\textbf{RAMP} obtains a union accuracy up to $53.5\%$ on CIFAR-10, and $29.7\%$
on ImageNet. For training from scratch, \textbf{RAMP} achieves SOTA union
accuracy of $44.6\%$ and relatively good clean accuracy of $81.2\%$ on
ResNet-18 against AutoAttack on CIFAR-10.


http://arxiv.org/abs/2402.06846
System-level Analysis of Adversarial Attacks and Defenses on Intelligence in O-RAN based Cellular Networks. (82%)
Azuka Chiejina; Brian Kim; Kaushik Chowhdury; Vijay K. Shah
  While the open architecture, open interfaces, and integration of intelligence
within Open Radio Access Network technology hold the promise of transforming 5G
and 6G networks, they also introduce cybersecurity vulnerabilities that hinder
its widespread adoption. In this paper, we conduct a thorough system-level
investigation of cyber threats, with a specific focus on machine learning (ML)
intelligence components known as xApps within the O-RAN's near-real-time RAN
Intelligent Controller (near-RT RIC) platform. Our study begins by developing a
malicious xApp designed to execute adversarial attacks on two types of test
data - spectrograms and key performance metrics (KPMs), stored in the RIC
database within the near-RT RIC. To mitigate these threats, we utilize a
distillation technique that involves training a teacher model at a high softmax
temperature and transferring its knowledge to a student model trained at a
lower softmax temperature, which is deployed as the robust ML model within
xApp. We prototype an over-the-air LTE/5G O-RAN testbed to assess the impact of
these attacks and the effectiveness of the distillation defense technique by
leveraging an ML-based Interference Classification (InterClass) xApp as an
example. We examine two versions of InterClass xApp under distinct scenarios,
one based on Convolutional Neural Networks (CNNs) and another based on Deep
Neural Networks (DNNs) using spectrograms and KPMs as input data respectively.
Our findings reveal up to 100% and 96.3% degradation in the accuracy of both
the CNN and DNN models respectively resulting in a significant decline in
network performance under considered adversarial attacks. Under the strict
latency constraints of the near-RT RIC closed control loop, our analysis shows
that the distillation technique outperforms classical adversarial training by
achieving an accuracy of up to 98.3% for mitigating such attacks.


http://arxiv.org/abs/2402.06357
The SkipSponge Attack: Sponge Weight Poisoning of Deep Neural Networks. (70%)
Jona te Lintelo; Stefanos Koffas; Stjepan Picek
  Sponge attacks aim to increase the energy consumption and computation time of
neural networks. In this work, we present a novel sponge attack called
SkipSponge. SkipSponge is the first sponge attack that is performed directly on
the parameters of a pre-trained model using only a few data samples. Our
experiments show that SkipSponge can successfully increase the energy
consumption of image classification models, GANs, and autoencoders requiring
fewer samples than the state-of-the-art (Sponge Poisoning). We show that
poisoning defenses are ineffective if not adjusted specifically for the defense
against SkipSponge (i.e., they decrease target layer bias values). Our work
shows that SkipSponge is more effective on the GANs and the autoencoders than
Sponge Poisoning. Additionally, SkipSponge is stealthier than Sponge Poisoning
as it does not require significant changes in the victim model's weights. Our
experiments indicate that SkipSponge can be performed even when an attacker has
access to only 1% of the entire dataset and reaches up to 13% energy increase.


http://arxiv.org/abs/2402.06734
Corruption Robust Offline Reinforcement Learning with Human Feedback. (67%)
Debmalya Mandal; Andi Nika; Parameswaran Kamalaruban; Adish Singla; Goran Radanović
  We study data corruption robustness for reinforcement learning with human
feedback (RLHF) in an offline setting. Given an offline dataset of pairs of
trajectories along with feedback about human preferences, an
$\varepsilon$-fraction of the pairs is corrupted (e.g., feedback flipped or
trajectory features manipulated), capturing an adversarial attack or noisy
human preferences. We aim to design algorithms that identify a near-optimal
policy from the corrupted data, with provable guarantees. Existing theoretical
works have separately studied the settings of corruption robust RL (learning
from scalar rewards directly under corruption) and offline RLHF (learning from
human feedback without corruption); however, they are inapplicable to our
problem of dealing with corrupted data in offline RLHF setting. To this end, we
design novel corruption robust offline RLHF methods under various assumptions
on the coverage of the data-generating distributions. At a high level, our
methodology robustifies an offline RLHF framework by first learning a reward
model along with confidence sets and then learning a pessimistic optimal policy
over the confidence set. Our key insight is that learning optimal policy can be
done by leveraging an offline corruption-robust RL oracle in different ways
(e.g., zero-order oracle or first-order oracle), depending on the data coverage
assumptions. To our knowledge, ours is the first work that provides provable
corruption robust offline RLHF methods.


http://arxiv.org/abs/2402.06244
Quantifying and Enhancing Multi-modal Robustness with Modality Preference. (56%)
Zequn Yang; Yake Wei; Ce Liang; Di Hu
  Multi-modal models have shown a promising capability to effectively integrate
information from various sources, yet meanwhile, they are found vulnerable to
pervasive perturbations, such as uni-modal attacks and missing conditions. To
counter these perturbations, robust multi-modal representations are highly
expected, which are positioned well away from the discriminative multi-modal
decision boundary. In this paper, different from conventional empirical
studies, we focus on a commonly used joint multi-modal framework and
theoretically discover that larger uni-modal representation margins and more
reliable integration for modalities are essential components for achieving
higher robustness. This discovery can further explain the limitation of
multi-modal robustness and the phenomenon that multi-modal models are often
vulnerable to attacks on the specific modality. Moreover, our analysis reveals
how the widespread issue, that the model has different preferences for
modalities, limits the multi-modal robustness by influencing the essential
components and could lead to attacks on the specific modality highly effective.
Inspired by our theoretical finding, we introduce a training procedure called
Certifiable Robust Multi-modal Training (CRMT), which can alleviate this
influence from modality preference and explicitly regulate essential components
to significantly improve robustness in a certifiable manner. Our method
demonstrates substantial improvements in performance and robustness compared
with existing methods. Furthermore, our training procedure can be easily
extended to enhance other robust training strategies, highlighting its
credibility and flexibility.


http://arxiv.org/abs/2402.06363
StruQ: Defending Against Prompt Injection with Structured Queries. (45%)
Sizhe Chen; Julien Piet; Chawin Sitawarin; David Wagner
  Recent advances in Large Language Models (LLMs) enable exciting
LLM-integrated applications, which perform text-based tasks by utilizing their
advanced language understanding capabilities. However, as LLMs have improved,
so have the attacks against them. Prompt injection attacks are an important
threat: they trick the model to deviate from the original application's
instructions and instead follow user directives. These attacks rely on the
LLM's ability to follow instructions and inability to separate the prompts and
user data. We introduce structured queries, a general approach to tackle this
problem. Structured queries separate prompts and data into two channels. We
implement a system that supports structured queries. This system is made of (1)
a secure front-end that formats a prompt and user data into a special format,
and (2) a specially trained LLM that can produce high-quality outputs from
these inputs. The LLM is trained using a novel fine-tuning strategy: we convert
a base (non-instruction-tuned) LLM to a structured instruction-tuned model that
will only follow instructions in the prompt portion of a query. To do so, we
augment standard instruction tuning datasets with examples that also include
instructions in the data portion of the query, and fine-tune the model to
ignore these. Our system significantly improves resistance to prompt injection
attacks, with little or no impact on utility. Our code is released at
https://github.com/Sizhe-Chen/PromptInjectionDefense.


http://arxiv.org/abs/2402.06289
FedMIA: An Effective Membership Inference Attack Exploiting "All for One" Principle in Federated Learning. (4%)
Gongxi Zhu; Donghao Li; Hanlin Gu; Yuan Yao; Lixin Fan; Yuxing Han
  Federated Learning (FL) is a promising approach for training machine learning
models on decentralized data while preserving privacy. However, privacy risks,
particularly Membership Inference Attacks (MIAs), which aim to determine
whether a specific data point belongs to a target client's training set, remain
a significant concern. Existing methods for implementing MIAs in FL primarily
analyze updates from the target client, focusing on metrics such as loss,
gradient norm, and gradient difference. However, these methods fail to leverage
updates from non-target clients, potentially underutilizing available
information. In this paper, we first formulate a one-tailed likelihood-ratio
hypothesis test based on the likelihood of updates from non-target clients.
Building upon this formulation, we introduce a three-step Membership Inference
Attack (MIA) method, called FedMIA, which follows the "all for one"--leveraging
updates from all clients across multiple communication rounds to enhance MIA
effectiveness. Both theoretical analysis and extensive experimental results
demonstrate that FedMIA outperforms existing MIAs in both classification and
generative tasks. Additionally, it can be integrated as an extension to
existing methods and is robust against various defense strategies, Non-IID
data, and different federated structures. Our code is available in
https://github.com/Liar-Mask/FedMIA.


http://arxiv.org/abs/2402.06352
Blockchain Bribing Attacks and the Efficacy of Counterincentives. (1%)
Dimitris Karakostas; Aggelos Kiayias; Thomas Zacharias
  We analyze bribing attacks in Proof-of-Stake distributed ledgers from a game
theoretic perspective. In bribing attacks, an adversary offers participants a
reward in exchange for instructing them how to behave, with the goal of
attacking the protocol's properties. Specifically, our work focuses on
adversaries that target blockchain safety. We consider two types of bribing,
depending on how the bribes are awarded: i) guided bribing, where the bribe is
given as long as the bribed party behaves as instructed; ii) effective bribing,
where bribes are conditional on the attack's success, w.r.t. well-defined
metrics. We analyze each type of attack in a game theoretic setting and
identify relevant equilibria. In guided bribing, we show that the protocol is
not an equilibrium and then describe good equilibria, where the attack is
unsuccessful, and a negative one, where all parties are bribed such that the
attack succeeds. In effective bribing, we show that both the protocol and the
"all bribed" setting are equilibria. Using the identified equilibria, we then
compute bounds on the Prices of Stability and Anarchy. Our results indicate
that additional mitigations are needed for guided bribing, so our analysis
concludes with incentive-based mitigation techniques, namely slashing and
dilution. Here, we present two positive results, that both render the protocol
an equilibrium and achieve maximal welfare for all parties, and a negative
result, wherein an attack becomes more plausible if it severely affects the
ledger's token's market price.


http://arxiv.org/abs/2402.06855
For Better or For Worse? Learning Minimum Variance Features With Label Augmentation. (1%)
Muthu Chidambaram; Rong Ge
  Data augmentation has been pivotal in successfully training deep learning
models on classification tasks over the past decade. An important subclass of
data augmentation techniques - which includes both label smoothing and Mixup -
involves modifying not only the input data but also the input label during
model training. In this work, we analyze the role played by the label
augmentation aspect of such methods. We prove that linear models on linearly
separable data trained with label augmentation learn only the minimum variance
features in the data, while standard training (which includes weight decay) can
learn higher variance features. An important consequence of our results is
negative: label smoothing and Mixup can be less robust to adversarial
perturbations of the training data when compared to standard training. We
verify that our theory reflects practice via a range of experiments on
synthetic data and image classification benchmarks.


http://arxiv.org/abs/2402.05668
JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs. (99%)
Junjie Chu; Yugeng Liu; Ziqing Yang; Xinyue Shen; Michael Backes; Yang Zhang
  Jailbreak attacks aim to bypass the LLMs' safeguards. While researchers have
proposed different jailbreak attacks in depth, they have done so in isolation
-- either with unaligned settings or comparing a limited range of methods. To
fill this gap, we present a large-scale evaluation of various jailbreak
attacks. We collect 17 representative jailbreak attacks, summarize their
features, and establish a novel jailbreak attack taxonomy. Then we conduct
comprehensive measurement and ablation studies across nine aligned LLMs on 160
forbidden questions from 16 violation categories. Also, we test jailbreak
attacks under eight advanced defenses. Based on our taxonomy and experiments,
we identify some important patterns, such as heuristic-based attacks could
achieve high attack success rates but are easy to mitigate by defenses, causing
low practicality. Our study offers valuable insights for future research on
jailbreak attacks and defenses. We hope our work could help the community avoid
incremental work and serve as an effective benchmark tool for practitioners.


http://arxiv.org/abs/2402.05493
Investigating White-Box Attacks for On-Device Models. (93%)
Mingyi Zhou; Xiang Gao; Jing Wu; Kui Liu; Hailong Sun; Li Li
  Numerous mobile apps have leveraged deep learning capabilities. However,
on-device models are vulnerable to attacks as they can be easily extracted from
their corresponding mobile apps. Existing on-device attacking approaches only
generate black-box attacks, which are far less effective and efficient than
white-box strategies. This is because mobile deep learning frameworks like
TFLite do not support gradient computing, which is necessary for white-box
attacking algorithms. Thus, we argue that existing findings may underestimate
the harmfulness of on-device attacks. To this end, we conduct a study to answer
this research question: Can on-device models be directly attacked via white-box
strategies? We first systematically analyze the difficulties of transforming
the on-device model to its debuggable version, and propose a Reverse
Engineering framework for On-device Models (REOM), which automatically reverses
the compiled on-device TFLite model to the debuggable model. Specifically, REOM
first transforms compiled on-device models into Open Neural Network Exchange
format, then removes the non-debuggable parts, and converts them to the
debuggable DL models format that allows attackers to exploit in a white-box
setting. Our experimental results show that our approach is effective in
achieving automated transformation among 244 TFLite models. Compared with
previous attacks using surrogate models, REOM enables attackers to achieve
higher attack success rates with a hundred times smaller attack perturbations.
In addition, because the ONNX platform has plenty of tools for model format
exchanging, the proposed method based on the ONNX platform can be adapted to
other model formats. Our findings emphasize the need for developers to
carefully consider their model deployment strategies, and use white-box methods
to evaluate the vulnerability of on-device models.


http://arxiv.org/abs/2402.06132
TETRIS: Towards Exploring the Robustness of Interactive Segmentation. (81%)
Andrey Moskalenko; Vlad Shakhuro; Anna Vorontsova; Anton Konushin; Anton Antonov; Alexander Krapukhin; Denis Shepelev; Konstantin Soshin
  Interactive segmentation methods rely on user inputs to iteratively update
the selection mask. A click specifying the object of interest is arguably the
most simple and intuitive interaction type, and thereby the most common choice
for interactive segmentation. However, user clicking patterns in the
interactive segmentation context remain unexplored. Accordingly, interactive
segmentation evaluation strategies rely more on intuition and common sense
rather than empirical studies (e.g., assuming that users tend to click in the
center of the area with the largest error). In this work, we conduct a real
user study to investigate real user clicking patterns. This study reveals that
the intuitive assumption made in the common evaluation strategy may not hold.
As a result, interactive segmentation models may show high scores in the
standard benchmarks, but it does not imply that they would perform well in a
real world scenario. To assess the applicability of interactive segmentation
methods, we propose a novel evaluation strategy providing a more comprehensive
analysis of a model's performance. To this end, we propose a methodology for
finding extreme user inputs by a direct optimization in a white-box adversarial
attack on the interactive segmentation model. Based on the performance with
such adversarial user inputs, we assess the robustness of interactive
segmentation models w.r.t click positions. Besides, we introduce a novel
benchmark for measuring the robustness of interactive segmentation, and report
the results of an extensive evaluation of dozens of models.


http://arxiv.org/abs/2402.05521
Linearizing Models for Efficient yet Robust Private Inference. (68%)
Sreetama Sarkar; Souvik Kundu; Peter A. Beerel
  The growing concern about data privacy has led to the development of private
inference (PI) frameworks in client-server applications which protects both
data privacy and model IP. However, the cryptographic primitives required yield
significant latency overhead which limits its wide-spread application. At the
same time, changing environments demand the PI service to be robust against
various naturally occurring and gradient-based perturbations. Despite several
works focused on the development of latency-efficient models suitable for PI,
the impact of these models on robustness has remained unexplored. Towards this
goal, this paper presents RLNet, a class of robust linearized networks that can
yield latency improvement via reduction of high-latency ReLU operations while
improving the model performance on both clean and corrupted images. In
particular, RLNet models provide a "triple win ticket" of improved
classification accuracy on clean, naturally perturbed, and gradient-based
perturbed images using a shared-mask shared-weight architecture with over an
order of magnitude fewer ReLUs than baseline models. To demonstrate the
efficacy of RLNet, we perform extensive experiments with ResNet and WRN model
variants on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets. Our experimental
evaluations show that RLNet can yield models with up to 11.14x fewer ReLUs,
with accuracy close to the all-ReLU models, on clean, naturally perturbed, and
gradient-based perturbed images. Compared with the SoTA non-robust linearized
models at similar ReLU budgets, RLNet achieves an improvement in adversarial
accuracy of up to ~47%, naturally perturbed accuracy up to ~16.4%, while
improving clean image accuracy up to ~1.5%.


http://arxiv.org/abs/2402.05674
A High Dimensional Statistical Model for Adversarial Training: Geometry and Trade-Offs. (54%)
Kasimir Tanner; Matteo Vilucchio; Bruno Loureiro; Florent Krzakala
  This work investigates adversarial training in the context of margin-based
linear classifiers in the high-dimensional regime where the dimension $d$ and
the number of data points $n$ diverge with a fixed ratio $\alpha = n / d$. We
introduce a tractable mathematical model where the interplay between the data
and adversarial attacker geometries can be studied, while capturing the core
phenomenology observed in the adversarial robustness literature. Our main
theoretical contribution is an exact asymptotic description of the sufficient
statistics for the adversarial empirical risk minimiser, under generic convex
and non-increasing losses for a Block Feature Model. Our result allow us to
precisely characterise which directions in the data are associated with a
higher generalisation/robustness trade-off, as defined by a robustness and a
usefulness metric. We show that the the presence of multiple different feature
types is crucial to the high sample complexity performances of adversarial
training. In particular, we unveil the existence of directions which can be
defended without penalising accuracy. Finally, we show the advantage of
defending non-robust features during training, identifying a uniform protection
as an inherently effective defence mechanism.


http://arxiv.org/abs/2402.05541
FedAA: A Reinforcement Learning Perspective on Adaptive Aggregation for Fair and Robust Federated Learning. (9%)
Jialuo He; Wei Chen; Xiaojin Zhang
  Federated Learning (FL) has emerged as a promising approach for
privacy-preserving model training across decentralized devices. However, it
faces challenges such as statistical heterogeneity and susceptibility to
adversarial attacks, which can impact model robustness and fairness.
Personalized FL attempts to provide some relief by customizing models for
individual clients. However, it falls short in addressing server-side
aggregation vulnerabilities. We introduce a novel method called \textbf{FedAA},
which optimizes client contributions via \textbf{A}daptive \textbf{A}ggregation
to enhance model robustness against malicious clients and ensure fairness
across participants in non-identically distributed settings. To achieve this
goal, we propose an approach involving a Deep Deterministic Policy
Gradient-based algorithm for continuous control of aggregation weights, an
innovative client selection method based on model parameter distances, and a
reward mechanism guided by validation set performance. Empirically, extensive
experiments demonstrate that, in terms of robustness, \textbf{FedAA}
outperforms the state-of-the-art methods, while maintaining comparable levels
of fairness, offering a promising solution to build resilient and fair
federated systems. Our code is available at https://github.com/Gp1g/FedAA.


http://arxiv.org/abs/2402.04660
Adversarial Robustness Through Artifact Design. (99%)
Tsufit Shua; Liron David; Mahmood Sharif
  Adversarial examples arose as a challenge for machine learning. To hinder
them, most defenses alter how models are trained (e.g., adversarial training)
or inference is made (e.g., randomized smoothing). Still, while these
approaches markedly improve models' adversarial robustness, models remain
highly susceptible to adversarial examples. Identifying that, in certain
domains such as traffic-sign recognition, objects are implemented per standards
specifying how artifacts (e.g., signs) should be designed, we propose a novel
approach for improving adversarial robustness. Specifically, we offer a method
to redefine standards, making minor changes to existing ones, to defend against
adversarial examples. We formulate the problem of artifact design as a robust
optimization problem, and propose gradient-based and greedy search methods to
solve it. We evaluated our approach in the domain of traffic-sign recognition,
allowing it to alter traffic-sign pictograms (i.e., symbols within the signs)
and their colors. We found that, combined with adversarial training, our
approach led to up to 25.18\% higher robust accuracy compared to
state-of-the-art methods against two adversary types, while further increasing
accuracy on benign inputs. Notably, a user study we conducted showed that
traffic signs produced by our approach are also easily recognizable by human
subjects.


http://arxiv.org/abs/2402.04699
Breaking Free: How to Hack Safety Guardrails in Black-Box Diffusion Models! (98%)
Shashank Kotyan; Po-Yuan Mao; Pin-Yu Chen; Danilo Vasconcellos Vargas
  Deep neural networks can be exploited using natural adversarial samples,
which do not impact human perception. Current approaches often rely on deep
neural networks' white-box nature to generate these adversarial samples or
synthetically alter the distribution of adversarial samples compared to the
training distribution. In contrast, we propose EvoSeed, a novel evolutionary
strategy-based algorithmic framework for generating photo-realistic natural
adversarial samples. Our EvoSeed framework uses auxiliary Conditional Diffusion
and Classifier models to operate in a black-box setting. We employ CMA-ES to
optimize the search for an initial seed vector, which, when processed by the
Conditional Diffusion Model, results in the natural adversarial sample
misclassified by the Classifier Model. Experiments show that generated
adversarial images are of high image quality, raising concerns about generating
harmful content bypassing safety classifiers. Our research opens new avenues to
understanding the limitations of current safety mechanisms and the risk of
plausible attacks against classifier systems using image generation. Project
Website can be accessed at: https://shashankkotyan.github.io/EvoSeed.


http://arxiv.org/abs/2402.05284
Analyzing Adversarial Inputs in Deep Reinforcement Learning. (96%)
Davide Corsi; Guy Amir; Guy Katz; Alessandro Farinelli
  In recent years, Deep Reinforcement Learning (DRL) has become a popular
paradigm in machine learning due to its successful applications to real-world
and complex systems. However, even the state-of-the-art DRL models have been
shown to suffer from reliability concerns -- for example, their susceptibility
to adversarial inputs, i.e., small and abundant input perturbations that can
fool the models into making unpredictable and potentially dangerous decisions.
This drawback limits the deployment of DRL systems in safety-critical contexts,
where even a small error cannot be tolerated. In this work, we present a
comprehensive analysis of the characterization of adversarial inputs, through
the lens of formal verification. Specifically, we introduce a novel metric, the
Adversarial Rate, to classify models based on their susceptibility to such
perturbations, and present a set of tools and algorithms for its computation.
Our analysis empirically demonstrates how adversarial inputs can affect the
safety of a given DRL system with respect to such perturbations. Moreover, we
analyze the behavior of these configurations to suggest several useful
practices and guidelines to help mitigate the vulnerability of trained DRL
networks.


http://arxiv.org/abs/2402.05162
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications. (1%)
Boyi Wei; Kaixuan Huang; Yangsibo Huang; Tinghao Xie; Xiangyu Qi; Mengzhou Xia; Prateek Mittal; Mengdi Wang; Peter Henderson
  Large language models (LLMs) show inherent brittleness in their safety
mechanisms, as evidenced by their susceptibility to jailbreaking and even
non-malicious fine-tuning. This study explores this brittleness of safety
alignment by leveraging pruning and low-rank modifications. We develop methods
to identify critical regions that are vital for safety guardrails, and that are
disentangled from utility-relevant regions at both the neuron and rank levels.
Surprisingly, the isolated regions we find are sparse, comprising about $3\%$
at the parameter level and $2.5\%$ at the rank level. Removing these regions
compromises safety without significantly impacting utility, corroborating the
inherent brittleness of the model's safety mechanisms. Moreover, we show that
LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications
to the safety-critical regions are restricted. These findings underscore the
urgent need for more robust safety strategies in LLMs.


http://arxiv.org/abs/2402.03951
Boosting Adversarial Transferability across Model Genus by Deformation-Constrained Warping. (98%)
Qinliang Lin; Cheng Luo; Zenghao Niu; Xilin He; Weicheng Xie; Yuanbo Hou; Linlin Shen; Siyang Song
  Adversarial examples generated by a surrogate model typically exhibit limited
transferability to unknown target systems. To address this problem, many
transferability enhancement approaches (e.g., input transformation and model
augmentation) have been proposed. However, they show poor performances in
attacking systems having different model genera from the surrogate model. In
this paper, we propose a novel and generic attacking strategy, called
Deformation-Constrained Warping Attack (DeCoWA), that can be effectively
applied to cross model genus attack. Specifically, DeCoWA firstly augments
input examples via an elastic deformation, namely Deformation-Constrained
Warping (DeCoW), to obtain rich local details of the augmented input. To avoid
severe distortion of global semantics led by random deformation, DeCoW further
constrains the strength and direction of the warping transformation by a novel
adaptive control strategy. Extensive experiments demonstrate that the
transferable examples crafted by our DeCoWA on CNN surrogates can significantly
hinder the performance of Transformers (and vice versa) on various tasks,
including image classification, video action recognition, and audio
recognition. Code is made available at https://github.com/LinQinLiang/DeCoWA.


http://arxiv.org/abs/2403.08806
Adversarially Robust Deepfake Detection via Adversarial Feature Similarity Learning. (98%)
Sarwar Khan
  Deepfake technology has raised concerns about the authenticity of digital
content, necessitating the development of effective detection methods. However,
the widespread availability of deepfakes has given rise to a new challenge in
the form of adversarial attacks. Adversaries can manipulate deepfake videos
with small, imperceptible perturbations that can deceive the detection models
into producing incorrect outputs. To tackle this critical issue, we introduce
Adversarial Feature Similarity Learning (AFSL), which integrates three
fundamental deep feature learning paradigms. By optimizing the similarity
between samples and weight vectors, our approach aims to distinguish between
real and fake instances. Additionally, we aim to maximize the similarity
between both adversarially perturbed examples and unperturbed examples,
regardless of their real or fake nature. Moreover, we introduce a
regularization technique that maximizes the dissimilarity between real and fake
samples, ensuring a clear separation between these two categories. With
extensive experiments on popular deepfake datasets, including FaceForensics++,
FaceShifter, and DeeperForensics, the proposed method outperforms other
standard adversarial training-based defense methods significantly. This further
demonstrates the effectiveness of our approach to protecting deepfake detectors
from adversarial attacks.


http://arxiv.org/abs/2402.03741
SUB-PLAY: Adversarial Policies against Partially Observed Multi-Agent Reinforcement Learning Systems. (76%)
Oubo Ma; Yuwen Pu; Linkang Du; Yang Dai; Ruo Wang; Xiaolei Liu; Yingcai Wu; Shouling Ji
  Recent advancements in multi-agent reinforcement learning (MARL) have opened
up vast application prospects, such as swarm control of drones, collaborative
manipulation by robotic arms, and multi-target encirclement. However, potential
security threats during the MARL deployment need more attention and thorough
investigation. Recent research reveals that attackers can rapidly exploit the
victim's vulnerabilities, generating adversarial policies that result in the
failure of specific tasks. For instance, reducing the winning rate of a
superhuman-level Go AI to around 20%. Existing studies predominantly focus on
two-player competitive environments, assuming attackers possess complete global
state observation.
  In this study, we unveil, for the first time, the capability of attackers to
generate adversarial policies even when restricted to partial observations of
the victims in multi-agent competitive environments. Specifically, we propose a
novel black-box attack (SUB-PLAY) that incorporates the concept of constructing
multiple subgames to mitigate the impact of partial observability and suggests
sharing transitions among subpolicies to improve attackers' exploitative
ability. Extensive evaluations demonstrate the effectiveness of SUB-PLAY under
three typical partial observability limitations. Visualization results indicate
that adversarial policies induce significantly different activations of the
victims' policy networks. Furthermore, we evaluate three potential defenses
aimed at exploring ways to mitigate security threats posed by adversarial
policies, providing constructive recommendations for deploying MARL in
competitive environments.


http://arxiv.org/abs/2402.04038
PAC-Bayesian Adversarially Robust Generalization Bounds for Graph Neural Network. (75%)
Tan Sun; Junhong Lin
  Graph neural networks (GNNs) have gained popularity for various graph-related
tasks. However, similar to deep neural networks, GNNs are also vulnerable to
adversarial attacks. Empirical studies have shown that adversarially robust
generalization has a pivotal role in establishing effective defense algorithms
against adversarial attacks. In this paper, we contribute by providing
adversarially robust generalization bounds for two kinds of popular GNNs, graph
convolutional network (GCN) and message passing graph neural network, using the
PAC-Bayesian framework. Our result reveals that spectral norm of the diffusion
matrix on the graph and spectral norm of the weights as well as the
perturbation factor govern the robust generalization bounds of both models. Our
bounds are nontrivial generalizations of the results developed in (Liao et al.,
2020) from the standard setting to adversarial setting while avoiding
exponential dependence of the maximum node degree. As corollaries, we derive
better PAC-Bayesian robust generalization bounds for GCN in the standard
setting, which improve the bounds in (Liao et al., 2020) by avoiding
exponential dependence on the maximum node degree.


http://arxiv.org/abs/2402.04325
Enhance DNN Adversarial Robustness and Efficiency via Injecting Noise to Non-Essential Neurons. (74%)
Zhenyu Liu; Garrett Gagnon; Swagath Venkataramani; Liu Liu
  Deep Neural Networks (DNNs) have revolutionized a wide range of industries,
from healthcare and finance to automotive, by offering unparalleled
capabilities in data analysis and decision-making. Despite their transforming
impact, DNNs face two critical challenges: the vulnerability to adversarial
attacks and the increasing computational costs associated with more complex and
larger models. In this paper, we introduce an effective method designed to
simultaneously enhance adversarial robustness and execution efficiency. Unlike
prior studies that enhance robustness via uniformly injecting noise, we
introduce a non-uniform noise injection algorithm, strategically applied at
each DNN layer to disrupt adversarial perturbations introduced in attacks. By
employing approximation techniques, our approach identifies and protects
essential neurons while strategically introducing noise into non-essential
neurons. Our experimental results demonstrate that our method successfully
enhances both robustness and efficiency across several attack scenarios, model
architectures, and datasets.


http://arxiv.org/abs/2402.03740
BotSSCL: Social Bot Detection with Self-Supervised Contrastive Learning. (64%)
Mohammad Majid Akhtar; Navid Shadman Bhuiyan; Rahat Masood; Muhammad Ikram; Salil S. Kanhere
  The detection of automated accounts, also known as "social bots", has been an
increasingly important concern for online social networks (OSNs). While several
methods have been proposed for detecting social bots, significant research gaps
remain. First, current models exhibit limitations in detecting sophisticated
bots that aim to mimic genuine OSN users. Second, these methods often rely on
simplistic profile features, which are susceptible to manipulation. In addition
to their vulnerability to adversarial manipulations, these models lack
generalizability, resulting in subpar performance when trained on one dataset
and tested on another.
  To address these challenges, we propose a novel framework for social Bot
detection with Self-Supervised Contrastive Learning (BotSSCL). Our framework
leverages contrastive learning to distinguish between social bots and humans in
the embedding space to improve linear separability. The high-level
representations derived by BotSSCL enhance its resilience to variations in data
distribution and ensure generalizability. We evaluate BotSSCL's robustness
against adversarial attempts to manipulate bot accounts to evade detection.
Experiments on two datasets featuring sophisticated bots demonstrate that
BotSSCL outperforms other supervised, unsupervised, and self-supervised
baseline methods. We achieve approx. 6% and approx. 8% higher (F1) performance
than SOTA on both datasets. In addition, BotSSCL also achieves 67% F1 when
trained on one dataset and tested with another, demonstrating its
generalizability. Lastly, BotSSCL increases adversarial complexity and only
allows 4% success to the adversary in evading detection.


http://arxiv.org/abs/2402.04013
Privacy Leakage on DNNs: A Survey of Model Inversion Attacks and Defenses. (26%)
Hao Fang; Yixiang Qiu; Hongyao Yu; Wenbo Yu; Jiawei Kong; Baoli Chong; Bin Chen; Xuan Wang; Shu-Tao Xia; Ke Xu
  Deep Neural Networks (DNNs) have revolutionized various domains with their
exceptional performance across numerous applications. However, Model Inversion
(MI) attacks, which disclose private information about the training dataset by
abusing access to the trained models, have emerged as a formidable privacy
threat. Given a trained network, these attacks enable adversaries to
reconstruct high-fidelity data that closely aligns with the private training
samples, posing significant privacy concerns. Despite the rapid advances in the
field, we lack a comprehensive and systematic overview of existing MI attacks
and defenses. To fill this gap, this paper thoroughly investigates this realm
and presents a holistic survey. Firstly, our work briefly reviews early MI
studies on traditional machine learning scenarios. We then elaborately analyze
and compare numerous recent attacks and defenses on Deep Neural Networks (DNNs)
across multiple modalities and learning tasks. By meticulously analyzing their
distinctive features, we summarize and classify these methods into different
categories and provide a novel taxonomy. Finally, this paper discusses
promising research directions and presents potential solutions to open issues.
To facilitate further study on MI attacks and defenses, we have implemented an
open-source model inversion toolbox on GitHub
(https://github.com/ffhibnese/Model-Inversion-Attack-ToolBox).


http://arxiv.org/abs/2402.04421
Studying Vulnerable Code Entities in R. (10%)
Zixiao Zhao; Millon Madhur Das; Fatemeh H. Fard
  Pre-trained Code Language Models (Code-PLMs) have shown many advancements and
achieved state-of-the-art results for many software engineering tasks in the
past few years. These models are mainly targeted for popular programming
languages such as Java and Python, leaving out many other ones like R. Though R
has a wide community of developers and users, there is little known about the
applicability of Code-PLMs for R. In this preliminary study, we aim to
investigate the vulnerability of Code-PLMs for code entities in R. For this
purpose, we use an R dataset of code and comment pairs and then apply
CodeAttack, a black-box attack model that uses the structure of code to
generate adversarial code samples. We investigate how the model can attack
different entities in R. This is the first step towards understanding the
importance of R token types, compared to popular programming languages (e.g.,
Java). We limit our study to code summarization. Our results show that the most
vulnerable code entity is the identifier, followed by some syntax tokens
specific to R. The results can shed light on the importance of token types and
help in developing models for code summarization and method name prediction for
the R language.


http://arxiv.org/abs/2402.03760
DeMarking: A Defense for Network Flow Watermarking in Real-Time. (10%)
Yali Yuan; Jian Ge; Guang Cheng
  The network flow watermarking technique associates the two communicating
parties by actively modifying certain characteristics of the stream generated
by the sender so that it covertly carries some special marking information.
Some curious users communicating with the hidden server as a Tor client may
attempt de-anonymization attacks to uncover the real identity of the hidden
server by using this technique. This compromises the privacy of the anonymized
communication system. Therefore, we propose a defense scheme against flow
watermarking. The scheme is based on deep neural networks and utilizes
generative adversarial networks to convert the original Inter-Packet Delays
(IPD) into new IPDs generated by the model. We also adopt the concept of
adversarial attacks to ensure that the detector will produce an incorrect
classification when detecting these new IPDs. This approach ensures that these
IPDs are considered "clean", effectively covering the potential watermarks.
This scheme is effective against time-based flow watermarking techniques.


http://arxiv.org/abs/2402.04249
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. (2%)
Mantas Mazeika; Long Phan; Xuwang Yin; Andy Zou; Zifan Wang; Norman Mu; Elham Sakhaee; Nathaniel Li; Steven Basart; Bo Li; David Forsyth; Dan Hendrycks
  Automated red teaming holds substantial promise for uncovering and mitigating
the risks associated with the malicious use of large language models (LLMs),
yet the field lacks a standardized evaluation framework to rigorously assess
new methods. To address this issue, we introduce HarmBench, a standardized
evaluation framework for automated red teaming. We identify several desirable
properties previously unaccounted for in red teaming evaluations and
systematically design HarmBench to meet these criteria. Using HarmBench, we
conduct a large-scale comparison of 18 red teaming methods and 33 target LLMs
and defenses, yielding novel insights. We also introduce a highly efficient
adversarial training method that greatly enhances LLM robustness across a wide
range of attacks, demonstrating how HarmBench enables codevelopment of attacks
and defenses. We open source HarmBench at
https://github.com/centerforaisafety/HarmBench.


http://arxiv.org/abs/2402.02732
A Generative Approach to Surrogate-based Black-box Attacks. (99%)
Raha Moraffah; Huan Liu
  Surrogate-based black-box attacks have exposed the heightened vulnerability
of DNNs. These attacks are designed to craft adversarial examples for any
samples with black-box target feedback for only a given set of samples.
State-of-the-art surrogate-based attacks involve training a discriminative
surrogate that mimics the target's outputs. The goal is to learn the decision
boundaries of the target. The surrogate is then attacked by white-box attacks
to craft adversarial examples similar to the original samples but belong to
other classes. With limited samples, the discriminative surrogate fails to
accurately learn the target's decision boundaries, and these surrogate-based
attacks suffer from low success rates. Different from the discriminative
approach, we propose a generative surrogate that learns the distribution of
samples residing on or close to the target's decision boundaries. The
distribution learned by the generative surrogate can be used to craft
adversarial examples that have imperceptible differences from the original
samples but belong to other classes. The proposed generative approach results
in attacks with remarkably high attack success rates on various targets and
datasets.


http://arxiv.org/abs/2402.03095
Transcending Adversarial Perturbations: Manifold-Aided Adversarial Examples with Legitimate Semantics. (99%)
Shuai Li; Xiaoyu Jiang; Xiaoguang Ma
  Deep neural networks were significantly vulnerable to adversarial examples
manipulated by malicious tiny perturbations. Although most conventional
adversarial attacks ensured the visual imperceptibility between adversarial
examples and corresponding raw images by minimizing their geometric distance,
these constraints on geometric distance led to limited attack transferability,
inferior visual quality, and human-imperceptible interpretability. In this
paper, we proposed a supervised semantic-transformation generative model to
generate adversarial examples with real and legitimate semantics, wherein an
unrestricted adversarial manifold containing continuous semantic variations was
constructed for the first time to realize a legitimate transition from
non-adversarial examples to adversarial ones. Comprehensive experiments on
MNIST and industrial defect datasets showed that our adversarial examples not
only exhibited better visual quality but also achieved superior attack
transferability and more effective explanations for model vulnerabilities,
indicating their great potential as generic adversarial examples. The code and
pre-trained models were available at https://github.com/shuaili1027/MAELS.git.


http://arxiv.org/abs/2402.03477
Arabic Synonym BERT-based Adversarial Examples for Text Classification. (99%)
Norah Alshahrani; Saied Alshahrani; Esma Wali; Jeanna Matthews
  Text classification systems have been proven vulnerable to adversarial text
examples, modified versions of the original text examples that are often
unnoticed by human eyes, yet can force text classification models to alter
their classification. Often, research works quantifying the impact of
adversarial text attacks have been applied only to models trained in English.
In this paper, we introduce the first word-level study of adversarial attacks
in Arabic. Specifically, we use a synonym (word-level) attack using a Masked
Language Modeling (MLM) task with a BERT model in a black-box setting to assess
the robustness of the state-of-the-art text classification models to
adversarial attacks in Arabic. To evaluate the grammatical and semantic
similarities of the newly produced adversarial examples using our synonym
BERT-based attack, we invite four human evaluators to assess and compare the
produced adversarial examples with their original examples. We also study the
transferability of these newly produced Arabic adversarial examples to various
models and investigate the effectiveness of defense mechanisms against these
adversarial examples on the BERT models. We find that fine-tuned BERT models
were more susceptible to our synonym attacks than the other Deep Neural
Networks (DNN) models like WordCNN and WordLSTM we trained. We also find that
fine-tuned BERT models were more susceptible to transferred attacks. We,
lastly, find that fine-tuned BERT models successfully regain at least 2% in
accuracy after applying adversarial training as an initial defense mechanism.


http://arxiv.org/abs/2402.03576
Generalization Properties of Adversarial Training for $\ell_0$-Bounded Adversarial Attacks. (92%)
Payam Delgosha; Hamed Hassani; Ramtin Pedarsani
  We have widely observed that neural networks are vulnerable to small additive
perturbations to the input causing misclassification. In this paper, we focus
on the $\ell_0$-bounded adversarial attacks, and aim to theoretically
characterize the performance of adversarial training for an important class of
truncated classifiers. Such classifiers are shown to have strong performance
empirically, as well as theoretically in the Gaussian mixture model, in the
$\ell_0$-adversarial setting. The main contribution of this paper is to prove a
novel generalization bound for the binary classification setting with
$\ell_0$-bounded adversarial perturbation that is distribution-independent.
Deriving a generalization bound in this setting has two main challenges: (i)
the truncated inner product which is highly non-linear; and (ii) maximization
over the $\ell_0$ ball due to adversarial training is non-convex and highly
non-smooth. To tackle these challenges, we develop new coding techniques for
bounding the combinatorial dimension of the truncated hypothesis class.


http://arxiv.org/abs/2402.03705
FoolSDEdit: Deceptively Steering Your Edits Towards Targeted Attribute-aware Distribution. (89%)
Qi Zhou; Dongxia Wang; Tianlin Li; Zhihong Xu; Yang Liu; Kui Ren; Wenhai Wang; Qing Guo
  Guided image synthesis methods, like SDEdit based on the diffusion model,
excel at creating realistic images from user inputs such as stroke paintings.
However, existing efforts mainly focus on image quality, often overlooking a
key point: the diffusion model represents a data distribution, not individual
images. This introduces a low but critical chance of generating images that
contradict user intentions, raising ethical concerns. For example, a user
inputting a stroke painting with female characteristics might, with some
probability, get male faces from SDEdit. To expose this potential
vulnerability, we aim to build an adversarial attack forcing SDEdit to generate
a specific data distribution aligned with a specified attribute (e.g., female),
without changing the input's attribute characteristics. We propose the Targeted
Attribute Generative Attack (TAGA), using an attribute-aware objective function
and optimizing the adversarial noise added to the input stroke painting.
Empirical studies reveal that traditional adversarial noise struggles with
TAGA, while natural perturbations like exposure and motion blur easily alter
generated images' attributes. To execute effective attacks, we introduce
FoolSDEdit: We design a joint adversarial exposure and blur attack, adding
exposure and motion blur to the stroke painting and optimizing them together.
We optimize the execution strategy of various perturbations, framing it as a
network architecture search problem. We create the SuperPert, a graph
representing diverse execution strategies for different perturbations. After
training, we obtain the optimized execution strategy for effective TAGA against
SDEdit. Comprehensive experiments on two datasets show our method compelling
SDEdit to generate a targeted attribute-aware data distribution, significantly
outperforming baselines.


http://arxiv.org/abs/2402.02886
Time-Distributed Backdoor Attacks on Federated Spiking Learning. (83%)
Gorka Abad; Stjepan Picek; Aitor Urbieta
  This paper investigates the vulnerability of spiking neural networks (SNNs)
and federated learning (FL) to backdoor attacks using neuromorphic data.
Despite the efficiency of SNNs and the privacy advantages of FL, particularly
in low-powered devices, we demonstrate that these systems are susceptible to
such attacks. We first assess the viability of using FL with SNNs using
neuromorphic data, showing its potential usage. Then, we evaluate the
transferability of known FL attack methods to SNNs, finding that these lead to
suboptimal attack performance. Therefore, we explore backdoor attacks involving
single and multiple attackers to improve the attack performance. Our primary
contribution is developing a novel attack strategy tailored to SNNs and FL,
which distributes the backdoor trigger temporally and across malicious devices,
enhancing the attack's effectiveness and stealthiness. In the best case, we
achieve a 100 attack success rate, 0.13 MSE, and 98.9 SSIM. Moreover, we adapt
and evaluate an existing defense against backdoor attacks, revealing its
inadequacy in protecting SNNs. This study underscores the need for robust
security measures in deploying SNNs and FL, particularly in the context of
backdoor attacks.


http://arxiv.org/abs/2402.06659
Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models. (83%)
Yuancheng Xu; Jiarui Yao; Manli Shu; Yanchao Sun; Zichu Wu; Ning Yu; Tom Goldstein; Furong Huang
  Vision-Language Models (VLMs) excel in generating textual responses from
visual inputs, yet their versatility raises significant security concerns. This
study takes the first step in exposing VLMs' susceptibility to data poisoning
attacks that can manipulate responses to innocuous, everyday prompts. We
introduce Shadowcast, a stealthy data poisoning attack method where poison
samples are visually indistinguishable from benign images with matching texts.
Shadowcast demonstrates effectiveness in two attack types. The first is Label
Attack, tricking VLMs into misidentifying class labels, such as confusing
Donald Trump for Joe Biden. The second is Persuasion Attack, which leverages
VLMs' text generation capabilities to craft narratives, such as portraying junk
food as health food, through persuasive and seemingly rational descriptions. We
show that Shadowcast are highly effective in achieving attacker's intentions
using as few as 50 poison samples. Moreover, these poison samples remain
effective across various prompts and are transferable across different VLM
architectures in the black-box setting. This work reveals how poisoned VLMs can
generate convincing yet deceptive misinformation and underscores the importance
of data quality for responsible deployments of VLMs. Our code is available at:
https://github.com/umd-huang-lab/VLM-Poisoning.


http://arxiv.org/abs/2402.03627
Partially Recentralization Softmax Loss for Vision-Language Models Robustness. (81%)
Hao Wang; Xin Zhang; Jinzhe Jiang; Yaqian Zhao; Chen Li
  As Large Language Models make a breakthrough in natural language processing
tasks (NLP), multimodal technique becomes extremely popular. However, it has
been shown that multimodal NLP are vulnerable to adversarial attacks, where the
outputs of a model can be dramatically changed by a perturbation to the input.
While several defense techniques have been proposed both in computer vision and
NLP models, the multimodal robustness of models have not been fully explored.
In this paper, we study the adversarial robustness provided by modifying loss
function of pre-trained multimodal models, by restricting top K softmax
outputs. Based on the evaluation and scoring, our experiments show that after a
fine-tuning, adversarial robustness of pre-trained models can be significantly
improved, against popular attacks. Further research should be studying, such as
output diversity, generalization and the robustness-performance trade-off of
this kind of loss functions. Our code will be available after this paper is
accepted


http://arxiv.org/abs/2402.03214
Organic or Diffused: Can We Distinguish Human Art from AI-generated Images? (31%)
Anna Yoo Jeong Ha; Josephine Passananti; Ronik Bhaskar; Shawn Shan; Reid Southen; Haitao Zheng; Ben Y. Zhao
  The advent of generative AI images has completely disrupted the art world.
Distinguishing AI generated images from human art is a challenging problem
whose impact is growing over time. A failure to address this problem allows bad
actors to defraud individuals paying a premium for human art and companies
whose stated policies forbid AI imagery. It is also critical for content owners
to establish copyright, and for model trainers interested in curating training
data in order to avoid potential model collapse.
  There are several different approaches to distinguishing human art from AI
images, including classifiers trained by supervised learning, research tools
targeting diffusion models, and identification by professional artists using
their knowledge of artistic techniques. In this paper, we seek to understand
how well these approaches can perform against today's modern generative models
in both benign and adversarial settings. We curate real human art across 7
styles, generate matching images from 5 generative models, and apply 8
detectors (5 automated detectors and 3 different human groups including 180
crowdworkers, 4000+ professional artists, and 13 expert artists experienced at
detecting AI). Both Hive and expert artists do very well, but make mistakes in
different ways (Hive is weaker against adversarial perturbations while Expert
artists produce higher false positives). We believe these weaknesses will
remain as models continue to evolve, and use our data to demonstrate why a
combined team of human and automated detectors provides the best combination of
accuracy and robustness.


http://arxiv.org/abs/2402.02739
DisDet: Exploring Detectability of Backdoor Attack on Diffusion Models. (12%)
Yang Sui; Huy Phan; Jinqi Xiao; Tianfang Zhang; Zijie Tang; Cong Shi; Yan Wang; Yingying Chen; Bo Yuan
  In the exciting generative AI era, the diffusion model has emerged as a very
powerful and widely adopted content generation and editing tool for various
data modalities, making the study of their potential security risks very
necessary and critical. Very recently, some pioneering works have shown the
vulnerability of the diffusion model against backdoor attacks, calling for
in-depth analysis and investigation of the security challenges of this popular
and fundamental AI technique.
  In this paper, for the first time, we systematically explore the
detectability of the poisoned noise input for the backdoored diffusion models,
an important performance metric yet little explored in the existing works.
Starting from the perspective of a defender, we first analyze the properties of
the trigger pattern in the existing diffusion backdoor attacks, discovering the
important role of distribution discrepancy in Trojan detection. Based on this
finding, we propose a low-cost trigger detection mechanism that can effectively
identify the poisoned input noise. We then take a further step to study the
same problem from the attack side, proposing a backdoor attack strategy that
can learn the unnoticeable trigger to evade our proposed detection scheme.
  Empirical evaluations across various diffusion models and datasets
demonstrate the effectiveness of the proposed trigger detection and
detection-evading attack strategy. For trigger detection, our distribution
discrepancy-based solution can achieve a 100\% detection rate for the Trojan
triggers used in the existing works. For evading trigger detection, our
proposed stealthy trigger design approach performs end-to-end learning to make
the distribution of poisoned noise input approach that of benign noise,
enabling nearly 100\% detection pass rate with very high attack and benign
performance for the backdoored diffusion models.


http://arxiv.org/abs/2405.00679
Exploring Biologically Inspired Mechanisms of Adversarial Robustness. (4%)
Konstantin Holzhausen; Mia Merlid; Håkon Olav Torvik; Anders Malthe-Sørenssen; Mikkel Elle Lepperød
  Backpropagation-optimized artificial neural networks, while precise, lack
robustness, leading to unforeseen behaviors that affect their safety.
Biological neural systems do solve some of these issues already. Unlike
artificial models, biological neurons adjust connectivity based on neighboring
cell activity. Understanding the biological mechanisms of robustness can pave
the way towards building trust worthy and safe systems. Robustness in neural
representations is hypothesized to correlate with the smoothness of the
encoding manifold. Recent work suggests power law covariance spectra, which
were observed studying the primary visual cortex of mice, to be indicative of a
balanced trade-off between accuracy and robustness in representations. Here, we
show that unsupervised local learning models with winner takes all dynamics
learn such power law representations, providing upcoming studies a mechanistic
model with that characteristic. Our research aims to understand the interplay
between geometry, spectral properties, robustness, and expressivity in neural
representations. Hence, we study the link between representation smoothness and
spectrum by using weight, Jacobian and spectral regularization while assessing
performance and adversarial robustness. Our work serves as a foundation for
future research into the mechanisms underlying power law spectra and optimally
smooth encodings in both biological and artificial systems. The insights gained
may elucidate the mechanisms that realize robust neural networks in mammalian
brains and inform the development of more stable and reliable artificial
systems.


http://arxiv.org/abs/2402.03481
FINEST: Stabilizing Recommendations by Rank-Preserving Fine-Tuning. (1%)
Sejoon Oh; Berk Ustun; Julian McAuley; Srijan Kumar
  Modern recommender systems may output considerably different recommendations
due to small perturbations in the training data. Changes in the data from a
single user will alter the recommendations as well as the recommendations of
other users. In applications like healthcare, housing, and finance, this
sensitivity can have adverse effects on user experience. We propose a method to
stabilize a given recommender system against such perturbations. This is a
challenging task due to (1) the lack of a ``reference'' rank list that can be
used to anchor the outputs; and (2) the computational challenges in ensuring
the stability of rank lists with respect to all possible perturbations of
training data. Our method, FINEST, overcomes these challenges by obtaining
reference rank lists from a given recommendation model and then fine-tuning the
model under simulated perturbation scenarios with rank-preserving
regularization on sampled items. Our experiments on real-world datasets
demonstrate that FINEST can ensure that recommender models output stable
recommendations under a wide range of different perturbations without
compromising next-item prediction accuracy.


http://arxiv.org/abs/2402.02629
PROSAC: Provably Safe Certification for Machine Learning Models under Adversarial Attacks. (99%)
Ziquan Liu; Zhuo Zhi; Ilija Bogunovic; Carsten Gerner-Beuerle; Miguel Rodrigues
  It is widely known that state-of-the-art machine learning models, including
vision and language models, can be seriously compromised by adversarial
perturbations. It is therefore increasingly relevant to develop capabilities to
certify their performance in the presence of the most effective adversarial
attacks. Our paper offers a new approach to certify the performance of machine
learning models in the presence of adversarial attacks with population level
risk guarantees. In particular, we introduce the notion of $(\alpha,\zeta)$
machine learning model safety. We propose a hypothesis testing procedure, based
on the availability of a calibration set, to derive statistical guarantees
providing that the probability of declaring that the adversarial (population)
risk of a machine learning model is less than $\alpha$ (i.e. the model is
safe), while the model is in fact unsafe (i.e. the model adversarial population
risk is higher than $\alpha$), is less than $\zeta$. We also propose Bayesian
optimization algorithms to determine efficiently whether a machine learning
model is $(\alpha,\zeta)$-safe in the presence of an adversarial attack, along
with statistical guarantees. We apply our framework to a range of machine
learning models including various sizes of vision Transformer (ViT) and ResNet
models impaired by a variety of adversarial attacks, such as AutoAttack,
SquareAttack and natural evolution strategy attack, to illustrate the operation
of our approach. Importantly, we show that ViT's are generally more robust to
adversarial attacks than ResNets, and ViT-large is more robust than smaller
models. Our approach goes beyond existing empirical adversarial risk-based
certification guarantees. It formulates rigorous (and provable) performance
guarantees that can be used to satisfy regulatory requirements mandating the
use of state-of-the-art technical tools.


http://arxiv.org/abs/2402.06655
Adversarial Text Purification: A Large Language Model Approach for Defense. (99%)
Raha Moraffah; Shubh Khandelwal; Amrita Bhattacharjee; Huan Liu
  Adversarial purification is a defense mechanism for safeguarding classifiers
against adversarial attacks without knowing the type of attacks or training of
the classifier. These techniques characterize and eliminate adversarial
perturbations from the attacked inputs, aiming to restore purified samples that
retain similarity to the initially attacked ones and are correctly classified
by the classifier. Due to the inherent challenges associated with
characterizing noise perturbations for discrete inputs, adversarial text
purification has been relatively unexplored. In this paper, we investigate the
effectiveness of adversarial purification methods in defending text
classifiers. We propose a novel adversarial text purification that harnesses
the generative capabilities of Large Language Models (LLMs) to purify
adversarial text without the need to explicitly characterize the discrete noise
perturbations. We utilize prompt engineering to exploit LLMs for recovering the
purified examples for given adversarial examples such that they are
semantically similar and correctly classified. Our proposed method demonstrates
remarkable performance over various classifiers, improving their accuracy under
the attack by over 65% on average.


http://arxiv.org/abs/2402.02554
DeSparsify: Adversarial Attack Against Token Sparsification Mechanisms in Vision Transformers. (99%)
Oryan Yehezkel; Alon Zolfi; Amit Baras; Yuval Elovici; Asaf Shabtai
  Vision transformers have contributed greatly to advancements in the computer
vision domain, demonstrating state-of-the-art performance in diverse tasks
(e.g., image classification, object detection). However, their high
computational requirements grow quadratically with the number of tokens used.
Token sparsification mechanisms have been proposed to address this issue. These
mechanisms employ an input-dependent strategy, in which uninformative tokens
are discarded from the computation pipeline, improving the model's efficiency.
However, their dynamism and average-case assumption makes them vulnerable to a
new threat vector - carefully crafted adversarial examples capable of fooling
the sparsification mechanism, resulting in worst-case performance. In this
paper, we present DeSparsify, an attack targeting the availability of vision
transformers that use token sparsification mechanisms. The attack aims to
exhaust the operating system's resources, while maintaining its stealthiness.
Our evaluation demonstrates the attack's effectiveness on three token
sparsification mechanisms and examines the attack's transferability between
them and its effect on the GPU resources. To mitigate the impact of the attack,
we propose various countermeasures.


http://arxiv.org/abs/2402.02695
Exploiting Class Probabilities for Black-box Sentence-level Attacks. (75%)
Raha Moraffah; Huan Liu
  Sentence-level attacks craft adversarial sentences that are synonymous with
correctly-classified sentences but are misclassified by the text classifiers.
Under the black-box setting, classifiers are only accessible through their
feedback to queried inputs, which is predominately available in the form of
class probabilities. Even though utilizing class probabilities results in
stronger attacks, due to the challenges of using them for sentence-level
attacks, existing attacks use either no feedback or only the class labels.
Overcoming the challenges, we develop a novel algorithm that uses class
probabilities for black-box sentence-level attacks, investigate the
effectiveness of using class probabilities on the attack's success, and examine
the question if it is worthy or practical to use class probabilities by
black-box sentence-level attacks. We conduct extensive evaluations of our
attack comparing with the baselines across various classifiers and benchmark
datasets.


http://arxiv.org/abs/2402.02600
Evading Deep Learning-Based Malware Detectors via Obfuscation: A Deep Reinforcement Learning Approach. (41%)
Brian Etter; James Lee Hu; Mohammedreza Ebrahimi; Weifeng Li; Xin Li; Hsinchun Chen
  Adversarial Malware Generation (AMG), the generation of adversarial malware
variants to strengthen Deep Learning (DL)-based malware detectors has emerged
as a crucial tool in the development of proactive cyberdefense. However, the
majority of extant works offer subtle perturbations or additions to executable
files and do not explore full-file obfuscation. In this study, we show that an
open-source encryption tool coupled with a Reinforcement Learning (RL)
framework can successfully obfuscate malware to evade state-of-the-art malware
detection engines and outperform techniques that use advanced modification
methods. Our results show that the proposed method improves the evasion rate
from 27%-49% compared to widely-used state-of-the-art reinforcement
learning-based methods.


http://arxiv.org/abs/2402.02699
Adversarial Data Augmentation for Robust Speaker Verification. (1%)
Zhenyu Zhou; Junhui Chen; Namin Wang; Lantian Li; Dong Wang
  Data augmentation (DA) has gained widespread popularity in deep speaker
models due to its ease of implementation and significant effectiveness. It
enriches training data by simulating real-life acoustic variations, enabling
deep neural networks to learn speaker-related representations while
disregarding irrelevant acoustic variations, thereby improving robustness and
generalization. However, a potential issue with the vanilla DA is augmentation
residual, i.e., unwanted distortion caused by different types of augmentation.
To address this problem, this paper proposes a novel approach called
adversarial data augmentation (A-DA) which combines DA with adversarial
learning. Specifically, it involves an additional augmentation classifier to
categorize various augmentation types used in data augmentation. This
adversarial learning empowers the network to generate speaker embeddings that
can deceive the augmentation classifier, making the learned speaker embeddings
more robust in the face of augmentation variations. Experiments conducted on
VoxCeleb and CN-Celeb datasets demonstrate that our proposed A-DA outperforms
standard DA in both augmentation matched and mismatched test conditions,
showcasing its superior robustness and generalization against acoustic
variations.


http://arxiv.org/abs/2402.02095
Contrasting Adversarial Perturbations: The Space of Harmless Perturbations. (99%)
Lu Chen; Shaofeng Li; Benhao Huang; Fan Yang; Zheng Li; Jie Li; Yuan Luo
  Existing works have extensively studied adversarial examples, which are
minimal perturbations that can mislead the output of deep neural networks
(DNNs) while remaining imperceptible to humans. However, in this work, we
reveal the existence of a harmless perturbation space, in which perturbations
drawn from this space, regardless of their magnitudes, leave the network output
unchanged when applied to inputs. Essentially, the harmless perturbation space
emerges from the usage of non-injective functions (linear or non-linear layers)
within DNNs, enabling multiple distinct inputs to be mapped to the same output.
For linear layers with input dimensions exceeding output dimensions, any linear
combination of the orthogonal bases of the nullspace of the parameter
consistently yields no change in their output. For non-linear layers, the
harmless perturbation space may expand, depending on the properties of the
layers and input samples. Inspired by this property of DNNs, we solve for a
family of general perturbation spaces that are redundant for the DNN's
decision, and can be used to hide sensitive data and serve as a means of model
identification. Our work highlights the distinctive robustness of DNNs (i.e.,
consistency under large magnitude perturbations) in contrast to adversarial
examples (vulnerability for small imperceptible noises).


http://arxiv.org/abs/2402.02154
Evaluating the Robustness of Off-Road Autonomous Driving Segmentation against Adversarial Attacks: A Dataset-Centric analysis. (96%)
Pankaj Deoli; Rohit Kumar; Axel Vierling; Karsten Berns
  This study investigates the vulnerability of semantic segmentation models to
adversarial input perturbations, in the domain of off-road autonomous driving.
Despite good performance in generic conditions, the state-of-the-art
classifiers are often susceptible to (even) small perturbations, ultimately
resulting in inaccurate predictions with high confidence. Prior research has
directed their focus on making models more robust by modifying the architecture
and training with noisy input images, but has not explored the influence of
datasets in adversarial attacks. Our study aims to address this gap by
examining the impact of non-robust features in off-road datasets and comparing
the effects of adversarial attacks on different segmentation network
architectures. To enable this, a robust dataset is created consisting of only
robust features and training the networks on this robustified dataset. We
present both qualitative and quantitative analysis of our findings, which have
important implications on improving the robustness of machine learning models
in off-road autonomous driving applications. Additionally, this work
contributes to the safe navigation of autonomous robot Unimog U5023 in rough
off-road unstructured environments by evaluating the robustness of segmentation
outputs. The code is publicly available at
https://github.com/rohtkumar/adversarial_attacks_ on_segmentation


http://arxiv.org/abs/2402.02316
Your Diffusion Model is Secretly a Certifiably Robust Classifier. (92%)
Huanran Chen; Yinpeng Dong; Shitong Shao; Zhongkai Hao; Xiao Yang; Hang Su; Jun Zhu
  Diffusion models are recently employed as generative classifiers for robust
classification. However, a comprehensive theoretical understanding of the
robustness of diffusion classifiers is still lacking, leading us to question
whether they will be vulnerable to future stronger attacks. In this study, we
propose a new family of diffusion classifiers, named Noised Diffusion
Classifiers~(NDCs), that possess state-of-the-art certified robustness.
Specifically, we generalize the diffusion classifiers to classify
Gaussian-corrupted data by deriving the evidence lower bounds (ELBOs) for these
distributions, approximating the likelihood using the ELBO, and calculating
classification probabilities via Bayes' theorem. We integrate these generalized
diffusion classifiers with randomized smoothing to construct smoothed
classifiers possessing non-constant Lipschitzness. Experimental results
demonstrate the superior certified robustness of our proposed NDCs. Notably, we
are the first to achieve 80\%+ and 70\%+ certified robustness on CIFAR-10 under
adversarial perturbations with $\ell_2$ norm less than 0.25 and 0.5,
respectively, using a single off-the-shelf diffusion model without any
additional data.


http://arxiv.org/abs/2402.02263
MixedNUTS: Training-Free Accuracy-Robustness Balance via Nonlinearly Mixed Classifiers. (76%)
Yatong Bai; Mo Zhou; Vishal M. Patel; Somayeh Sojoudi
  Adversarial robustness often comes at the cost of degraded accuracy, impeding
the real-life application of robust classification models. Training-based
solutions for better trade-offs are limited by incompatibilities with
already-trained high-performance large models, necessitating the exploration of
training-free ensemble approaches. Observing that robust models are more
confident in correct predictions than in incorrect ones on clean and
adversarial data alike, we speculate amplifying this "benign confidence
property" can reconcile accuracy and robustness in an ensemble setting. To
achieve so, we propose "MixedNUTS", a training-free method where the output
logits of a robust classifier and a standard non-robust classifier are
processed by nonlinear transformations with only three parameters, which are
optimized through an efficient algorithm. MixedNUTS then converts the
transformed logits into probabilities and mixes them as the overall output. On
CIFAR-10, CIFAR-100, and ImageNet datasets, experimental results with custom
strong adaptive attacks demonstrate MixedNUTS's vastly improved accuracy and
near-SOTA robustness -- it boosts CIFAR-100 clean accuracy by 7.86 points,
sacrificing merely 0.87 points in robust accuracy.


http://arxiv.org/abs/2402.02145
Analyzing Sentiment Polarity Reduction in News Presentation through Contextual Perturbation and Large Language Models. (68%)
Alapan Kuila; Somnath Jena; Sudeshna Sarkar; Partha Pratim Chakrabarti
  In today's media landscape, where news outlets play a pivotal role in shaping
public opinion, it is imperative to address the issue of sentiment manipulation
within news text. News writers often inject their own biases and emotional
language, which can distort the objectivity of reporting. This paper introduces
a novel approach to tackle this problem by reducing the polarity of latent
sentiments in news content. Drawing inspiration from adversarial attack-based
sentence perturbation techniques and a prompt based method using ChatGPT, we
employ transformation constraints to modify sentences while preserving their
core semantics. Using three perturbation methods: replacement, insertion, and
deletion coupled with a context-aware masked language model, we aim to maximize
the desired sentiment score for targeted news aspects through a beam search
algorithm. Our experiments and human evaluations demonstrate the effectiveness
of these two models in achieving reduced sentiment polarity with minimal
modifications while maintaining textual similarity, fluency, and grammatical
correctness. Comparative analysis confirms the competitive performance of the
adversarial attack based perturbation methods and prompt-based methods,
offering a promising solution to foster more objective news reporting and
combat emotional language bias in the media.


http://arxiv.org/abs/2402.02034
CEPA: Consensus Embedded Perturbation for Agnostic Detection and Inversion of Backdoors. (31%)
Guangmingmei Yang; Xi Li; Hang Wang; David J. Miller; George Kesidis
  A variety of defenses have been proposed against Trojans planted in (backdoor
attacks on) deep neural network (DNN) classifiers. Backdoor-agnostic methods
seek to reliably detect and/or to mitigate backdoors irrespective of the
incorporation mechanism used by the attacker, while inversion methods
explicitly assume one. In this paper, we describe a new detector that: relies
on embedded feature representations to estimate (invert) the backdoor and to
identify its target class; can operate without access to the training dataset;
and is highly effective for various incorporation mechanisms (i.e., is backdoor
agnostic). Our detection approach is evaluated -- and found to be favorable -
in comparison with an array of published defenses for a variety of different
attacks on the CIFAR-10 and CIFAR-100 image-classification domains.


http://arxiv.org/abs/2402.02165
Towards Optimal Adversarial Robust Q-learning with Bellman Infinity-error. (10%)
Haoran Li; Zicheng Zhang; Wang Luo; Congying Han; Yudong Hu; Tiande Guo; Shichen Liao
  Establishing robust policies is essential to counter attacks or disturbances
affecting deep reinforcement learning (DRL) agents. Recent studies explore
state-adversarial robustness and suggest the potential lack of an optimal
robust policy (ORP), posing challenges in setting strict robustness
constraints. This work further investigates ORP: At first, we introduce a
consistency assumption of policy (CAP) stating that optimal actions in the
Markov decision process remain consistent with minor perturbations, supported
by empirical and theoretical evidence. Building upon CAP, we crucially prove
the existence of a deterministic and stationary ORP that aligns with the
Bellman optimal policy. Furthermore, we illustrate the necessity of
$L^{\infty}$-norm when minimizing Bellman error to attain ORP. This finding
clarifies the vulnerability of prior DRL algorithms that target the Bellman
optimal policy with $L^{1}$-norm and motivates us to train a Consistent
Adversarial Robust Deep Q-Network (CAR-DQN) by minimizing a surrogate of
Bellman Infinity-error. The top-tier performance of CAR-DQN across various
benchmarks validates its practical effectiveness and reinforces the soundness
of our theoretical analysis.


http://arxiv.org/abs/2402.02227
Invisible Finger: Practical Electromagnetic Interference Attack on Touchscreen-based Electronic Devices. (9%)
Haoqi Shan; Boyi Zhang; Zihao Zhan; Dean Sullivan; Shuo Wang; Yier Jin
  Touchscreen-based electronic devices such as smart phones and smart tablets
are widely used in our daily life. While the security of electronic devices
have been heavily investigated recently, the resilience of touchscreens against
various attacks has yet to be thoroughly investigated. In this paper, for the
first time, we show that touchscreen-based electronic devices are vulnerable to
intentional electromagnetic interference (IEMI) attacks in a systematic way and
how to conduct this attack in a practical way. Our contribution lies in not
just demonstrating the attack, but also analyzing and quantifying the
underlying mechanism allowing the novel IEMI attack on touchscreens in detail.
We show how to calculate both the minimum amount of electric field and signal
frequency required to induce touchscreen ghost touches. We further analyze our
IEMI attack on real touchscreens with different magnitudes, frequencies,
duration, and multitouch patterns. The mechanism of controlling the
touchscreen-enabled electronic devices with IEMI signals is also elaborated. We
design and evaluate an out-of-sight touchscreen locator and touch injection
feedback mechanism to assist a practical IEMI attack. Our attack works directly
on the touchscreen circuit regardless of the touchscreen scanning mechanism or
operating system. Our attack can inject short-tap, long-press, and
omni-directional gestures on touchscreens from a distance larger than the
average thickness of common tabletops. Compared with the state-of-the-art
touchscreen attack, ours can accurately inject different types of touch events
without the need for sensing signal synchronization, which makes our attack
more robust and practical. In addition, rather than showing a simple
proof-of-concept attack, we present and demonstrate the first ready-to-use IEMI
based touchscreen attack vector with end-to-end attack scenarios.


http://arxiv.org/abs/2402.02207
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models. (5%)
Yongshuo Zong; Ondrej Bohdal; Tingyang Yu; Yongxin Yang; Timothy Hospedales
  Current vision large language models (VLLMs) exhibit remarkable capabilities
yet are prone to generate harmful content and are vulnerable to even the
simplest jailbreaking attacks. Our initial analysis finds that this is due to
the presence of harmful data during vision-language instruction fine-tuning,
and that VLLM fine-tuning can cause forgetting of safety alignment previously
learned by the underpinning LLM. To address this issue, we first curate a
vision-language safe instruction-following dataset VLGuard covering various
harmful categories. Our experiments demonstrate that integrating this dataset
into standard vision-language fine-tuning or utilizing it for post-hoc
fine-tuning effectively safety aligns VLLMs. This alignment is achieved with
minimal impact on, or even enhancement of, the models' helpfulness. The
versatility of our safety fine-tuning dataset makes it a valuable resource for
safety-testing existing VLLMs, training new models or safeguarding pre-trained
VLLMs. Empirical results demonstrate that fine-tuned VLLMs effectively reject
unsafe instructions and substantially reduce the success rates of several
black-box adversarial attacks, which approach zero in many cases. The code and
dataset are available at https://github.com/ys-zong/VLGuard.


http://arxiv.org/abs/2402.02160
Data Poisoning for In-context Learning. (5%)
Pengfei He; Han Xu; Yue Xing; Hui Liu; Makoto Yamada; Jiliang Tang
  In the domain of large language models (LLMs), in-context learning (ICL) has
been recognized for its innovative ability to adapt to new tasks, relying on
examples rather than retraining or fine-tuning. This paper delves into the
critical issue of ICL's susceptibility to data poisoning attacks, an area not
yet fully explored. We wonder whether ICL is vulnerable, with adversaries
capable of manipulating example data to degrade model performance. To address
this, we introduce ICLPoison, a specialized attacking framework conceived to
exploit the learning mechanisms of ICL. Our approach uniquely employs discrete
text perturbations to strategically influence the hidden states of LLMs during
the ICL process. We outline three representative strategies to implement
attacks under our framework, each rigorously evaluated across a variety of
models and tasks. Our comprehensive tests, including trials on the
sophisticated GPT-4 model, demonstrate that ICL's performance is significantly
compromised under our framework. These revelations indicate an urgent need for
enhanced defense mechanisms to safeguard the integrity and reliability of LLMs
in applications relying on in-context learning.


http://arxiv.org/abs/2402.01806
HQA-Attack: Toward High Quality Black-Box Hard-Label Adversarial Attack on Text. (99%)
Han Liu; Zhi Xu; Xiaotong Zhang; Feng Zhang; Fenglong Ma; Hongyang Chen; Hong Yu; Xianchao Zhang
  Black-box hard-label adversarial attack on text is a practical and
challenging task, as the text data space is inherently discrete and
non-differentiable, and only the predicted label is accessible. Research on
this problem is still in the embryonic stage and only a few methods are
available. Nevertheless, existing methods rely on the complex heuristic
algorithm or unreliable gradient estimation strategy, which probably fall into
the local optimum and inevitably consume numerous queries, thus are difficult
to craft satisfactory adversarial examples with high semantic similarity and
low perturbation rate in a limited query budget. To alleviate above issues, we
propose a simple yet effective framework to generate high quality textual
adversarial examples under the black-box hard-label attack scenarios, named
HQA-Attack. Specifically, after initializing an adversarial example randomly,
HQA-attack first constantly substitutes original words back as many as
possible, thus shrinking the perturbation rate. Then it leverages the synonym
set of the remaining changed words to further optimize the adversarial example
with the direction which can improve the semantic similarity and satisfy the
adversarial condition simultaneously. In addition, during the optimizing
procedure, it searches a transition synonym word for each changed word, thus
avoiding traversing the whole synonym set and reducing the query number to some
extent. Extensive experimental results on five text classification datasets,
three natural language inference datasets and two real-world APIs have shown
that the proposed HQA-Attack method outperforms other strong baselines
significantly.


http://arxiv.org/abs/2402.01879
$\sigma$-zero: Gradient-based Optimization of $\ell_0$-norm Adversarial Examples. (99%)
Antonio Emanuele Cinà; Francesco Villani; Maura Pintor; Lea Schönherr; Battista Biggio; Marcello Pelillo
  Evaluating the adversarial robustness of deep networks to gradient-based
attacks is challenging. While most attacks consider $\ell_2$- and
$\ell_\infty$-norm constraints to craft input perturbations, only a few
investigate sparse $\ell_1$- and $\ell_0$-norm attacks. In particular,
$\ell_0$-norm attacks remain the least studied due to the inherent complexity
of optimizing over a non-convex and non-differentiable constraint. However,
evaluating adversarial robustness under these attacks could reveal weaknesses
otherwise left untested with more conventional $\ell_2$- and $\ell_\infty$-norm
attacks. In this work, we propose a novel $\ell_0$-norm attack, called
$\sigma$-zero, which leverages a differentiable approximation of the $\ell_0$
norm to facilitate gradient-based optimization, and an adaptive projection
operator to dynamically adjust the trade-off between loss minimization and
perturbation sparsity. Extensive evaluations using MNIST, CIFAR10, and ImageNet
datasets, involving robust and non-robust models, show that $\sigma$-zero finds
minimum $\ell_0$-norm adversarial examples without requiring any time-consuming
hyperparameter tuning, and that it outperforms all competing sparse attacks in
terms of success rate, perturbation size, and efficiency.


http://arxiv.org/abs/2402.01227
STAA-Net: A Sparse and Transferable Adversarial Attack for Speech Emotion Recognition. (99%)
Yi Chang; Zhao Ren; Zixing Zhang; Xin Jing; Kun Qian; Xi Shao; Bin Hu; Tanja Schultz; Björn W. Schuller
  Speech contains rich information on the emotions of humans, and Speech
Emotion Recognition (SER) has been an important topic in the area of
human-computer interaction. The robustness of SER models is crucial,
particularly in privacy-sensitive and reliability-demanding domains like
private healthcare. Recently, the vulnerability of deep neural networks in the
audio domain to adversarial attacks has become a popular area of research.
However, prior works on adversarial attacks in the audio domain primarily rely
on iterative gradient-based techniques, which are time-consuming and prone to
overfitting the specific threat model. Furthermore, the exploration of sparse
perturbations, which have the potential for better stealthiness, remains
limited in the audio domain. To address these challenges, we propose a
generator-based attack method to generate sparse and transferable adversarial
examples to deceive SER models in an end-to-end and efficient manner. We
evaluate our method on two widely-used SER datasets, Database of Elicited Mood
in Speech (DEMoS) and Interactive Emotional dyadic MOtion CAPture (IEMOCAP),
and demonstrate its ability to generate successful sparse adversarial examples
in an efficient manner. Moreover, our generated adversarial examples exhibit
model-agnostic transferability, enabling effective adversarial attacks on
advanced victim models.


http://arxiv.org/abs/2402.01220
Delving into Decision-based Black-box Attacks on Semantic Segmentation. (93%)
Zhaoyu Chen; Zhengyang Shan; Jingwen Chang; Kaixun Jiang; Dingkang Yang; Yiting Cheng; Wenqiang Zhang
  Semantic segmentation is a fundamental visual task that finds extensive
deployment in applications with security-sensitive considerations. Nonetheless,
recent work illustrates the adversarial vulnerability of semantic segmentation
models to white-box attacks. However, its adversarial robustness against
black-box attacks has not been fully explored. In this paper, we present the
first exploration of black-box decision-based attacks on semantic segmentation.
First, we analyze the challenges that semantic segmentation brings to
decision-based attacks through the case study. Then, to address these
challenges, we first propose a decision-based attack on semantic segmentation,
called Discrete Linear Attack (DLA). Based on random search and proxy index, we
utilize the discrete linear noises for perturbation exploration and calibration
to achieve efficient attack efficiency. We conduct adversarial robustness
evaluation on 5 models from Cityscapes and ADE20K under 8 attacks. DLA shows
its formidable power on Cityscapes by dramatically reducing PSPNet's mIoU from
an impressive 77.83% to a mere 2.14% with just 50 queries.


http://arxiv.org/abs/2402.01163
Enhanced Urban Region Profiling with Adversarial Self-Supervised Learning for Robust Forecasting and Security. (92%)
Weiliang Chen; Qianqian Ren; Yong Liu; Jianguo Sun
  Urban region profiling plays a crucial role in forecasting and
decision-making in the context of dynamic and noisy urban environments.
Existing methods often struggle with issues such as noise, data incompleteness,
and security vulnerabilities. This paper proposes a novel framework, Enhanced
Urban Region Profiling with Adversarial Self-Supervised Learning (EUPAS), to
address these challenges. By combining adversarial contrastive learning with
both supervised and self-supervised objectives, EUPAS ensures robust
performance across various forecasting tasks such as crime prediction, check-in
prediction, and land use classification. To enhance model resilience against
adversarial attacks and noisy data, we incorporate several key components,
including perturbation augmentation, trickster generator, and deviation copy
generator. These innovations effectively improve the robustness of the
embeddings, making EUPAS capable of handling the complexities and noise
inherent in urban data. Experimental results show that EUPAS significantly
outperforms state-of-the-art methods across multiple tasks, achieving
improvements in prediction accuracy of up to 10.8%. Notably, our model excels
in adversarial attack tests, demonstrating its resilience in real-world,
security-sensitive applications. This work makes a substantial contribution to
the field of urban analytics by offering a more robust and secure approach to
forecasting and profiling urban regions. It addresses key challenges in secure,
data-driven modeling, providing a stronger foundation for future urban
analytics and decision-making applications.


http://arxiv.org/abs/2402.01340
SignSGD with Federated Defense: Harnessing Adversarial Attacks through Gradient Sign Decoding. (92%)
Chanho Park; Namyoon Lee
  Distributed learning is an effective approach to accelerate model training
using multiple workers. However, substantial communication delays emerge
between workers and a parameter server due to massive costs associated with
communicating gradients. SignSGD with majority voting (signSGD-MV) is a simple
yet effective optimizer that reduces communication costs through one-bit
quantization, yet the convergence rates considerably decrease as adversarial
workers increase. In this paper, we show that the convergence rate is invariant
as the number of adversarial workers increases, provided that the number of
adversarial workers is smaller than that of benign workers. The key idea
showing this counter-intuitive result is our novel signSGD with federated
defense (signSGD-FD). Unlike the traditional approaches, signSGD-FD exploits
the gradient information sent by adversarial workers with the proper weights,
which are obtained through gradient sign decoding. Experimental results
demonstrate signSGD-FD achieves superior convergence rates over traditional
algorithms in various adversarial attack scenarios.


http://arxiv.org/abs/2402.02028
Unlearnable Examples For Time Series. (86%)
Yujing Jiang; Xingjun Ma; Sarah Monazam Erfani; James Bailey
  Unlearnable examples (UEs) refer to training samples modified to be
unlearnable to Deep Neural Networks (DNNs). These examples are usually
generated by adding error-minimizing noises that can fool a DNN model into
believing that there is nothing (no error) to learn from the data. The concept
of UE has been proposed as a countermeasure against unauthorized data
exploitation on personal data. While UE has been extensively studied on images,
it is unclear how to craft effective UEs for time series data. In this work, we
introduce the first UE generation method to protect time series data from
unauthorized training by deep learning models. To this end, we propose a new
form of error-minimizing noise that can be \emph{selectively} applied to
specific segments of time series, rendering them unlearnable to DNN models
while remaining imperceptible to human observers. Through extensive experiments
on a wide range of time series datasets, we demonstrate that the proposed UE
generation method is effective in both classification and generation tasks. It
can protect time series data against unauthorized exploitation, while
preserving their utility for legitimate usage, thereby contributing to the
development of secure and trustworthy machine learning systems.


http://arxiv.org/abs/2402.01920
Preference Poisoning Attacks on Reward Model Learning. (83%)
Junlin Wu; Jiongxiao Wang; Chaowei Xiao; Chenguang Wang; Ning Zhang; Yevgeniy Vorobeychik
  Learning utility, or reward, models from pairwise comparisons is a
fundamental component in a number of application domains. These approaches
inherently entail collecting preference information from people, with feedback
often provided anonymously. Since preferences are subjective, there is no gold
standard to compare against; yet, reliance of high-impact systems on preference
learning creates a strong motivation for malicious actors to skew data
collected in this fashion to their ends. We investigate the nature and extent
of this vulnerability systematically by considering a threat model in which an
attacker can flip a small subset of preference comparisons with the goal of
either promoting or demoting a target outcome. First, we propose two classes of
algorithmic approaches for these attacks: a principled gradient-based
framework, and several variants of rank-by-distance methods. Next, we
demonstrate the efficacy of best attacks in both these classes in successfully
achieving malicious goals on datasets from three diverse domains: autonomous
control, recommendation system, and textual prompt-response preference
learning. We find that the best attacks are often highly successful, achieving
in the most extreme case 100% success rate with only 0.3% of the data poisoned.
However, which attack is best can vary significantly across domains,
demonstrating the value of our comprehensive vulnerability analysis that
involves several classes of attack algorithms. In addition, we observe that the
simpler and more scalable rank-by-distance approaches are often competitive
with the best, and on occasion significantly outperform gradient-based methods.
Finally, we show that several state-of-the-art defenses against other classes
of poisoning attacks exhibit, at best, limited efficacy in our setting.


http://arxiv.org/abs/2402.01546
Privacy-Preserving Distributed Learning for Residential Short-Term Load Forecasting. (3%)
Yi Dong; Yingjie Wang; Mariana Gama; Mustafa A. Mustafa; Geert Deconinck; Xiaowei Huang
  In the realm of power systems, the increasing involvement of residential
users in load forecasting applications has heightened concerns about data
privacy. Specifically, the load data can inadvertently reveal the daily
routines of residential users, thereby posing a risk to their property
security. While federated learning (FL) has been employed to safeguard user
privacy by enabling model training without the exchange of raw data, these FL
models have shown vulnerabilities to emerging attack techniques, such as Deep
Leakage from Gradients and poisoning attacks. To counteract these, we initially
employ a Secure-Aggregation (SecAgg) algorithm that leverages multiparty
computation cryptographic techniques to mitigate the risk of gradient leakage.
However, the introduction of SecAgg necessitates the deployment of additional
sub-center servers for executing the multiparty computation protocol, thereby
escalating computational complexity and reducing system robustness, especially
in scenarios where one or more sub-centers are unavailable. To address these
challenges, we introduce a Markovian Switching-based distributed training
framework, the convergence of which is substantiated through rigorous
theoretical analysis. The Distributed Markovian Switching (DMS) topology shows
strong robustness towards the poisoning attacks as well. Case studies employing
real-world power system load data validate the efficacy of our proposed
algorithm. It not only significantly minimizes communication complexity but
also maintains accuracy levels comparable to traditional FL methods, thereby
enhancing the scalability of our load forecasting algorithm.


http://arxiv.org/abs/2402.01894
S2malloc: Statistically Secure Allocator for Use-After-Free Protection And More. (3%)
Ruizhe Wang; Meng Xu; N. Asokan
  Attacks on heap memory, encompassing memory overflow, double and invalid
free, use-after-free (UAF), and various heap spraying techniques are
ever-increasing. Existing entropy-based secure memory allocators provide
statistical defenses against virtually all of these attack vectors. Although
they claim protections against UAF attacks, their designs are not tailored to
detect (failed) attempts. Consequently, to beat this entropy-based protection,
an attacker can simply launch the same attack repeatedly with the potential use
of heap spraying to further improve their chance of success.
  We introduce S2malloc, aiming to enhance UAF-attempt detection without
compromising other security guarantees or introducing significant performance
overhead. To achieve this, we use three innovative constructs in secure
allocator design: free block canaries (FBC) to detect UAF attempts, random
in-block offset (RIO) to stop the attacker from accurately overwriting the
victim object, and random bag layout (RBL) to impede attackers from estimating
the block size based on its address.
  We show that (a) by reserving 25% of the object size for the RIO offset, an
8-byte canary offers a 69% protection rate if the attacker reuses the same
pointer and 96% protection rate if the attacker does not, against UAF
exploitation attempts targeting a 64 bytes object, with equal or higher
security guarantees against all other attacks; and (b) S2malloc is practical,
with only a 2.8% run-time overhead on PARSEC and an 11.5% overhead on SPEC.
Compared to state-of-the-art entropy-based allocators, S2malloc improves
UAF-protection without incurring additional performance overhead. Compared to
UAF-mitigating allocators, S2malloc trades off a minuscule probability of
failed protection for significantly lower overhead.


http://arxiv.org/abs/2402.01369
Cheating Suffix: Targeted Attack to Text-To-Image Diffusion Models with Multi-Modal Priors. (2%)
Dingcheng Yang; Yang Bai; Xiaojun Jia; Yang Liu; Xiaochun Cao; Wenjian Yu
  Diffusion models have been widely deployed in various image generation tasks,
demonstrating an extraordinary connection between image and text modalities.
However, they face challenges of being maliciously exploited to generate
harmful or sensitive images by appending a specific suffix to the original
prompt. Existing works mainly focus on using single-modal information to
conduct attacks, which fails to utilize multi-modal features and results in
less than satisfactory performance. Integrating multi-modal priors (MMP), i.e.
both text and image features, we propose a targeted attack method named
MMP-Attack in this work. Specifically, the goal of MMP-Attack is to add a
target object into the image content while simultaneously removing the original
object. The MMP-Attack shows a notable advantage over existing works with
superior universality and transferability, which can effectively attack
commercial text-to-image (T2I) models such as DALL-E 3. To the best of our
knowledge, this marks the first successful attempt of transfer-based attack to
commercial T2I models. Our code is publicly available at
\url{https://github.com/ydc123/MMP-Attack}.


http://arxiv.org/abs/2402.01944
Fundamental Challenges in Cybersecurity and a Philosophy of Vulnerability-Guided Hardening. (1%)
Marcel Böhme
  Research in cybersecurity may seem reactive, specific, ephemeral, and indeed
ineffective. Despite decades of innovation in defense, even the most critical
software systems turn out to be vulnerable to attacks. Time and again. Offense
and defense forever on repeat. Even provable security, meant to provide an
indubitable guarantee of security, does not stop attackers from finding
security flaws. As we reflect on our achievements, we are left wondering: Can
security be solved once and for all?
  In this paper, we take a philosophical perspective and develop the first
theory of cybersecurity that explains what precisely and *fundamentally*
prevents us from making reliable statements about the security of a software
system. We substantiate each argument by demonstrating how the corresponding
challenge is routinely exploited to attack a system despite credible assurances
about the absence of security flaws. To make meaningful progress in the
presence of these challenges, we introduce a philosophy of cybersecurity.


http://arxiv.org/abs/2402.01865
What Will My Model Forget? Forecasting Forgotten Examples in Language Model Refinement. (1%)
Xisen Jin; Xiang Ren
  Language models deployed in the wild make errors. However, simply updating
the model with the corrected error instances causes catastrophic forgetting --
the updated model makes errors on instances learned during the instruction
tuning or upstream training phase. Randomly replaying upstream data yields
unsatisfactory performance and often comes with high variance and poor
controllability. To this end, we try to forecast upstream examples that will be
forgotten due to a model update for improved controllability of the replay
process and interpretability. We train forecasting models given a collection of
online learned examples and corresponding forgotten upstream pre-training
examples. We propose a partially interpretable forecasting model based on the
observation that changes in pre-softmax logit scores of pretraining examples
resemble that of online learned examples, which performs decently on BART but
fails on T5 models. We further show a black-box classifier based on inner
products of example representations achieves better forecasting performance
over a series of setups. Finally, we show that we reduce forgetting of upstream
pretraining examples by replaying examples that are forecasted to be forgotten,
demonstrating the practical utility of forecasting example forgetting.


http://arxiv.org/abs/2402.00418
Benchmarking Transferable Adversarial Attacks. (98%)
Zhibo Jin; Jiayu Zhang; Zhiyu Zhu; Huaming Chen
  The robustness of deep learning models against adversarial attacks remains a
pivotal concern. This study presents, for the first time, an exhaustive review
of the transferability aspect of adversarial attacks. It systematically
categorizes and critically evaluates various methodologies developed to augment
the transferability of adversarial attacks. This study encompasses a spectrum
of techniques, including Generative Structure, Semantic Similarity, Gradient
Editing, Target Modification, and Ensemble Approach. Concurrently, this paper
introduces a benchmark framework \textit{TAA-Bench}, integrating ten leading
methodologies for adversarial attack transferability, thereby providing a
standardized and systematic platform for comparative analysis across diverse
model architectures. Through comprehensive scrutiny, we delineate the efficacy
and constraints of each method, shedding light on their underlying operational
principles and practical utility. This review endeavors to be a quintessential
resource for both scholars and practitioners in the field, charting the complex
terrain of adversarial transferability and setting a foundation for future
explorations in this vital sector. The associated codebase is accessible at:
https://github.com/KxPlaug/TAA-Bench


http://arxiv.org/abs/2402.00412
Hidding the Ghostwriters: An Adversarial Evaluation of AI-Generated Student Essay Detection. (70%)
Xinlin Peng; Ying Zhou; Ben He; Le Sun; Yingfei Sun
  Large language models (LLMs) have exhibited remarkable capabilities in text
generation tasks. However, the utilization of these models carries inherent
risks, including but not limited to plagiarism, the dissemination of fake news,
and issues in educational exercises. Although several detectors have been
proposed to address these concerns, their effectiveness against adversarial
perturbations, specifically in the context of student essay writing, remains
largely unexplored. This paper aims to bridge this gap by constructing
AIG-ASAP, an AI-generated student essay dataset, employing a range of text
perturbation methods that are expected to generate high-quality essays while
evading detection. Through empirical experiments, we assess the performance of
current AIGC detectors on the AIG-ASAP dataset. The results reveal that the
existing detectors can be easily circumvented using straightforward automatic
adversarial attacks. Specifically, we explore word substitution and sentence
substitution perturbation methods that effectively evade detection while
maintaining the quality of the generated essays. This highlights the urgent
need for more accurate and robust methods to detect AI-generated student essays
in the education domain.


http://arxiv.org/abs/2402.01114
Double-Dip: Thwarting Label-Only Membership Inference Attacks with Transfer Learning and Randomization. (64%)
Arezoo Rajabi; Reeya Pimple; Aiswarya Janardhanan; Surudhi Asokraj; Bhaskar Ramasubramanian; Radha Poovendran
  Transfer learning (TL) has been demonstrated to improve DNN model performance
when faced with a scarcity of training samples. However, the suitability of TL
as a solution to reduce vulnerability of overfitted DNNs to privacy attacks is
unexplored. A class of privacy attacks called membership inference attacks
(MIAs) aim to determine whether a given sample belongs to the training dataset
(member) or not (nonmember). We introduce Double-Dip, a systematic empirical
study investigating the use of TL (Stage-1) combined with randomization
(Stage-2) to thwart MIAs on overfitted DNNs without degrading classification
accuracy. Our study examines the roles of shared feature space and parameter
values between source and target models, number of frozen layers, and
complexity of pretrained models. We evaluate Double-Dip on three (Target,
Source) dataset paris: (i) (CIFAR-10, ImageNet), (ii) (GTSRB, ImageNet), (iii)
(CelebA, VGGFace2). We consider four publicly available pretrained DNNs: (a)
VGG-19, (b) ResNet-18, (c) Swin-T, and (d) FaceNet. Our experiments demonstrate
that Stage-1 reduces adversary success while also significantly increasing
classification accuracy of nonmembers against an adversary with either
white-box or black-box DNN model access, attempting to carry out SOTA
label-only MIAs. After Stage-2, success of an adversary carrying out a
label-only MIA is further reduced to near 50%, bringing it closer to a random
guess and showing the effectiveness of Double-Dip. Stage-2 of Double-Dip also
achieves lower ASR and higher classification accuracy than regularization and
differential privacy-based methods.


http://arxiv.org/abs/2402.00626
Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks. (45%)
Maan Qraitem; Nazia Tasnim; Piotr Teterwak; Kate Saenko; Bryan A. Plummer
  Typographic attacks, adding misleading text to images, can deceive
vision-language models (LVLMs). The susceptibility of recent large LVLMs like
GPT4-V to such attacks is understudied, raising concerns about amplified
misinformation in personal assistant applications. Previous attacks use simple
strategies, such as random misleading words, which don't fully exploit LVLMs'
language reasoning abilities. We introduce an experimental setup for testing
typographic attacks on LVLMs and propose two novel self-generated attacks: (1)
Class-based attacks, where the model identifies a similar class to deceive
itself, and (2) Reasoned attacks, where an advanced LVLM suggests an attack
combining a deceiving class and description. Our experiments show these attacks
significantly reduce classification performance by up to 60\% and are effective
across different models, including InstructBLIP and MiniGPT4. Code:
https://github.com/mqraitem/Self-Gen-Typo-Attack


http://arxiv.org/abs/2402.00695
Approximating Optimal Morphing Attacks using Template Inversion. (9%)
Laurent Colbois; Hatef Otroshi Shahreza; Sébastien Marcel
  Recent works have demonstrated the feasibility of inverting face recognition
systems, enabling to recover convincing face images using only their
embeddings. We leverage such template inversion models to develop a novel type
ofdeep morphing attack based on inverting a theoretical optimal morph
embedding, which is obtained as an average of the face embeddings of source
images. We experiment with two variants of this approach: the first one
exploits a fully self-contained embedding-to-image inversion model, while the
second leverages the synthesis network of a pretrained StyleGAN network for
increased morph realism. We generate morphing attacks from several source
datasets and study the effectiveness of those attacks against several face
recognition networks. We showcase that our method can compete with and
regularly beat the previous state of the art for deep-learning based morph
generation in terms of effectiveness, both in white-box and black-box attack
scenarios, and is additionally much faster to run. We hope this might
facilitate the development of large scale deep morph datasets for training
detection models.


http://arxiv.org/abs/2402.01096
Trustworthy Distributed AI Systems: Robustness, Privacy, and Governance. (8%)
Wenqi Wei; Ling Liu
  Emerging Distributed AI systems are revolutionizing big data computing and
data processing capabilities with growing economic and societal impact.
However, recent studies have identified new attack surfaces and risks caused by
security, privacy, and fairness issues in AI systems. In this paper, we review
representative techniques, algorithms, and theoretical foundations for
trustworthy distributed AI through robustness guarantee, privacy protection,
and fairness awareness in distributed learning. We first provide a brief
overview of alternative architectures for distributed learning, discuss
inherent vulnerabilities for security, privacy, and fairness of AI algorithms
in distributed learning, and analyze why these problems are present in
distributed learning regardless of specific architectures. Then we provide a
unique taxonomy of countermeasures for trustworthy distributed AI, covering (1)
robustness to evasion attacks and irregular queries at inference, and
robustness to poisoning attacks, Byzantine attacks, and irregular data
distribution during training; (2) privacy protection during distributed
learning and model inference at deployment; and (3) AI fairness and governance
with respect to both data and models. We conclude with a discussion on open
challenges and future research directions toward trustworthy distributed AI,
such as the need for trustworthy AI policy guidelines, the AI
responsibility-utility co-design, and incentives and compliance.


http://arxiv.org/abs/2402.01109
Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack. (1%)
Tiansheng Huang; Sihao Hu; Ling Liu
  The new paradigm of finetuning-as-a-service introduces a new attack surface
for Large Language Models (LLMs): a few harmful data uploaded by users can
easily trick the finetuning to produce an alignment-broken model. We conduct an
empirical analysis and uncover a \textit{harmful embedding drift} phenomenon,
showing a probable cause of the alignment-broken effect. Inspired by our
findings, we propose Vaccine, a perturbation-aware alignment technique to
mitigate the security risk of users finetuning. The core idea of Vaccine is to
produce invariant hidden embeddings by progressively adding crafted
perturbation to them in the alignment phase. This enables the embeddings to
withstand harmful perturbation from un-sanitized user data in the finetuning
phase. Our results on open source mainstream LLMs (e.g., Llama2, Opt, Vicuna)
demonstrate that Vaccine can boost the robustness of alignment against harmful
prompts induced embedding drift while reserving reasoning ability towards
benign prompts. Our code is available at
\url{https://github.com/git-disl/Vaccine}.


http://arxiv.org/abs/2402.01012
algoXSSF: Detection and analysis of cross-site request forgery (XSRF) and cross-site scripting (XSS) attacks via Machine learning algorithms. (1%)
Naresh Kshetri; Dilip Kumar; James Hutson; Navneet Kaur; Omar Faruq Osama
  The global rise of online users and online devices has ultimately given rise
to the global internet population apart from several cybercrimes and
cyberattacks. The combination of emerging new technology and powerful
algorithms (of Artificial Intelligence, Deep Learning, and Machine Learning) is
needed to counter defense web security including attacks on several search
engines and websites. The unprecedented increase rate of cybercrime and website
attacks urged for new technology consideration to protect data and information
online. There have been recent and continuous cyberattacks on websites, web
domains with ongoing data breaches including - GitHub account hack, data leaks
on Twitter, malware in WordPress plugins, vulnerability in Tomcat server to
name just a few. We have investigated with an in-depth study apart from the
detection and analysis of two major cyberattacks (although there are many more
types): cross-site request forgery (XSRF) and cross-site scripting (XSS)
attacks. The easy identification of cyber trends and patterns with continuous
improvement is possible within the edge of machine learning and AI algorithms.
The use of machine learning algorithms would be extremely helpful to counter
(apart from detection) the XSRF and XSS attacks. We have developed the
algorithm and cyber defense framework - algoXSSF with machine learning
algorithms embedded to combat malicious attacks (including Man-in-the-Middle
attacks) on websites for detection and analysis.


http://arxiv.org/abs/2402.00176
Adversarial Quantum Machine Learning: An Information-Theoretic Generalization Analysis. (95%)
Petros Georgiou; Sharu Theresa Jose; Osvaldo Simeone
  In a manner analogous to their classical counterparts, quantum classifiers
are vulnerable to adversarial attacks that perturb their inputs. A promising
countermeasure is to train the quantum classifier by adopting an attack-aware,
or adversarial, loss function. This paper studies the generalization properties
of quantum classifiers that are adversarially trained against bounded-norm
white-box attacks. Specifically, a quantum adversary maximizes the classifier's
loss by transforming an input state $\rho(x)$ into a state $\lambda$ that is
$\epsilon$-close to the original state $\rho(x)$ in $p$-Schatten distance.
Under suitable assumptions on the quantum embedding $\rho(x)$, we derive novel
information-theoretic upper bounds on the generalization error of adversarially
trained quantum classifiers for $p = 1$ and $p = \infty$. The derived upper
bounds consist of two terms: the first is an exponential function of the
2-R\'enyi mutual information between classical data and quantum embedding,
while the second term scales linearly with the adversarial perturbation size
$\epsilon$. Both terms are shown to decrease as $1/\sqrt{T}$ over the training
set size $T$ . An extension is also considered in which the adversary assumed
during training has different parameters $p$ and $\epsilon$ as compared to the
adversary affecting the test inputs. Finally, we validate our theoretical
findings with numerical experiments for a synthetic setting.


http://arxiv.org/abs/2402.00304
Invariance-powered Trustworthy Defense via Remove Then Restore. (70%)
Xiaowei Fu; Yuhang Zhou; Lina Ma; Lei Zhang
  Adversarial attacks pose a challenge to the deployment of deep neural
networks (DNNs), while previous defense models overlook the generalization to
various attacks. Inspired by targeted therapies for cancer, we view adversarial
samples as local lesions of natural benign samples, because a key finding is
that salient attack in an adversarial sample dominates the attacking process,
while trivial attack unexpectedly provides trustworthy evidence for obtaining
generalizable robustness. Based on this finding, a Pixel Surgery and Semantic
Regeneration (PSSR) model following the targeted therapy mechanism is
developed, which has three merits: 1) To remove the salient attack, a
score-based Pixel Surgery module is proposed, which retains the trivial attack
as a kind of invariance information. 2) To restore the discriminative content,
a Semantic Regeneration module based on a conditional alignment extrapolator is
proposed, which achieves pixel and semantic consistency. 3) To further
harmonize robustness and accuracy, an intractable problem, a self-augmentation
regularizer with adversarial R-drop is designed. Experiments on numerous
benchmarks show the superiority of PSSR.


http://arxiv.org/abs/2402.00906
BrainLeaks: On the Privacy-Preserving Properties of Neuromorphic Architectures against Model Inversion Attacks. (13%)
Hamed Poursiami; Ihsen Alouani; Maryam Parsa
  With the mainstream integration of machine learning into security-sensitive
domains such as healthcare and finance, concerns about data privacy have
intensified. Conventional artificial neural networks (ANNs) have been found
vulnerable to several attacks that can leak sensitive data. Particularly, model
inversion (MI) attacks enable the reconstruction of data samples that have been
used to train the model. Neuromorphic architectures have emerged as a paradigm
shift in neural computing, enabling asynchronous and energy-efficient
computation. However, little to no existing work has investigated the privacy
of neuromorphic architectures against model inversion. Our study is motivated
by the intuition that the non-differentiable aspect of spiking neural networks
(SNNs) might result in inherent privacy-preserving properties, especially
against gradient-based attacks. To investigate this hypothesis, we propose a
thorough exploration of SNNs' privacy-preserving capabilities. Specifically, we
develop novel inversion attack strategies that are comprehensively designed to
target SNNs, offering a comparative analysis with their conventional ANN
counterparts. Our experiments, conducted on diverse event-based and static
datasets, demonstrate the effectiveness of the proposed attack strategies and
therefore questions the assumption of inherent privacy-preserving in
neuromorphic architectures.


http://arxiv.org/abs/2401.17723
LoRec: Large Language Model for Robust Sequential Recommendation against Poisoning Attacks. (9%)
Kaike Zhang; Qi Cao; Yunfan Wu; Fei Sun; Huawei Shen; Xueqi Cheng
  Sequential recommender systems stand out for their ability to capture users'
dynamic interests and the patterns of item-to-item transitions. However, the
inherent openness of sequential recommender systems renders them vulnerable to
poisoning attacks, where fraudulent users are injected into the training data
to manipulate learned patterns. Traditional defense strategies predominantly
depend on predefined assumptions or rules extracted from specific known
attacks, limiting their generalizability to unknown attack types. To solve the
above problems, considering the rich open-world knowledge encapsulated in Large
Language Models (LLMs), our research initially focuses on the capabilities of
LLMs in the detection of unknown fraudulent activities within recommender
systems, a strategy we denote as LLM4Dec. Empirical evaluations demonstrate the
substantial capability of LLMs in identifying unknown fraudsters, leveraging
their expansive, open-world knowledge.
  Building upon this, we propose the integration of LLMs into defense
strategies to extend their effectiveness beyond the confines of known attacks.
We propose LoRec, an advanced framework that employs LLM-Enhanced Calibration
to strengthen the robustness of sequential recommender systems against
poisoning attacks. LoRec integrates an LLM-enhanced CalibraTor (LCT) that
refines the training process of sequential recommender systems with knowledge
derived from LLMs, applying a user-wise reweighting to diminish the impact of
fraudsters injected by attacks. By incorporating LLMs' open-world knowledge,
the LCT effectively converts the limited, specific priors or rules into a more
general pattern of fraudsters, offering improved defenses against poisoning
attacks. Our comprehensive experiments validate that LoRec, as a general
framework, significantly strengthens the robustness of sequential recommender
systems.


http://arxiv.org/abs/2401.17746
Logit Poisoning Attack in Distillation-based Federated Learning and its Countermeasures. (4%)
Yonghao Yu; Shunan Zhu; Jinglu Hu
  Distillation-based federated learning has emerged as a promising
collaborative learning approach, where clients share the output logit vectors
of a public dataset rather than their private model parameters. This practice
reduces the risk of privacy invasion attacks and facilitates heterogeneous
learning. The landscape of poisoning attacks within distillation-based
federated learning is complex, with existing research employing traditional
data poisoning strategies targeting the models' parameters. However, these
attack schemes primarily have shortcomings rooted in their original designs,
which target the model parameters rather than the logit vectors. Furthermore,
they do not adequately consider the role of logit vectors in carrying
information during the knowledge transfer process. This misalignment results in
less efficiency in the context of distillation-based federated learning. Due to
the limitations of existing methodologies, our research delves into the
intrinsic properties of the logit vector, striving for a more nuanced
understanding. We introduce a two-stage scheme for logit poisoning attacks,
addressing previous shortcomings. Initially, we collect the local logits,
generate the representative vectors, categorize the logit elements within the
vector, and design a shuffling table to maximize information entropy. Then, we
intentionally scale the shuffled logit vectors to enhance the magnitude of the
target vectors. Concurrently, we propose an efficient defense algorithm to
counter this new poisoning scheme by calculating the distance between estimated
benign vectors and vectors uploaded by users. Through extensive experiments,
our study illustrates the significant threat posed by the proposed logit
poisoning attack and highlights the effectiveness of our defense algorithm.


http://arxiv.org/abs/2401.17865
Manipulating Predictions over Discrete Inputs in Machine Teaching. (1%)
Xiaodong Wu; Yufei Han; Hayssam Dahrouj; Jianbing Ni; Zhenwen Liang; Xiangliang Zhang
  Machine teaching often involves the creation of an optimal (typically
minimal) dataset to help a model (referred to as the `student') achieve
specific goals given by a teacher. While abundant in the continuous domain, the
studies on the effectiveness of machine teaching in the discrete domain are
relatively limited. This paper focuses on machine teaching in the discrete
domain, specifically on manipulating student models' predictions based on the
goals of teachers via changing the training data efficiently. We formulate this
task as a combinatorial optimization problem and solve it by proposing an
iterative searching algorithm. Our algorithm demonstrates significant numerical
merit in the scenarios where a teacher attempts at correcting erroneous
predictions to improve the student's models, or maliciously manipulating the
model to misclassify some specific samples to the target class aligned with his
personal profits. Experimental results show that our proposed algorithm can
have superior performance in effectively and efficiently manipulating the
predictions of the model, surpassing conventional baselines.


http://arxiv.org/abs/2401.17606
Ambush from All Sides: Understanding Security Threats in Open-Source Software CI/CD Pipelines. (1%)
Ziyue Pan; Wenbo Shen; Xingkai Wang; Yutian Yang; Rui Chang; Yao Liu; Chengwei Liu; Yang Liu; Kui Ren
  The continuous integration and continuous deployment (CI/CD) pipelines are
widely adopted on Internet hosting platforms, such as GitHub. With the
popularity, the CI/CD pipeline faces various security threats. However, current
CI/CD pipelines suffer from malicious code and severe vulnerabilities. Even
worse, people have not been fully aware of its attack surfaces and the
corresponding impacts.
  Therefore, in this paper, we conduct a large-scale measurement and a
systematic analysis to reveal the attack surfaces of the CI/CD pipeline and
quantify their security impacts. Specifically, for the measurement, we collect
a data set of 320,000+ CI/CD pipeline-configured GitHub repositories and build
an analysis tool to parse the CI/CD pipelines and extract security-critical
usages. Besides, current CI/CD ecosystem heavily relies on several core
scripts, which may lead to a single point of failure. While the CI/CD pipelines
contain sensitive information/operations, making them the attacker's favorite
targets.
  Inspired by the measurement findings, we abstract the threat model and the
attack approach toward CI/CD pipelines, followed by a systematic analysis of
attack surfaces, attack strategies, and the corresponding impacts. We further
launch case studies on five attacks in real-world CI/CD environments to
validate the revealed attack surfaces. Finally, we give suggestions on
mitigating attacks on CI/CD scripts, including securing CI/CD configurations,
securing CI/CD scripts, and improving CI/CD infrastructure.


http://arxiv.org/abs/2401.17196
Single Word Change is All You Need: Designing Attacks and Defenses for Text Classifiers. (99%)
Lei Xu; Sarah Alnegheimish; Laure Berti-Equille; Alfredo Cuesta-Infante; Kalyan Veeramachaneni
  In text classification, creating an adversarial example means subtly
perturbing a few words in a sentence without changing its meaning, causing it
to be misclassified by a classifier. A concerning observation is that a
significant portion of adversarial examples generated by existing methods
change only one word. This single-word perturbation vulnerability represents a
significant weakness in classifiers, which malicious users can exploit to
efficiently create a multitude of adversarial examples. This paper studies this
problem and makes the following key contributions: (1) We introduce a novel
metric \r{ho} to quantitatively assess a classifier's robustness against
single-word perturbation. (2) We present the SP-Attack, designed to exploit the
single-word perturbation vulnerability, achieving a higher attack success rate,
better preserving sentence meaning, while reducing computation costs compared
to state-of-the-art adversarial methods. (3) We propose SP-Defense, which aims
to improve \r{ho} by applying data augmentation in learning. Experimental
results on 4 datasets and BERT and distilBERT classifiers show that SP-Defense
improves \r{ho} by 14.6% and 13.9% and decreases the attack success rate of
SP-Attack by 30.4% and 21.2% on two classifiers respectively, and decreases the
attack success rate of existing attack methods that involve multiple-word
perturbations.


http://arxiv.org/abs/2401.17263
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks. (98%)
Andy Zhou; Bo Li; Haohan Wang
  Despite advances in AI alignment, language models (LM) remain vulnerable to
adversarial attacks or jailbreaking, in which adversaries modify input prompts
to induce harmful behavior. While some defenses have been proposed, they focus
on narrow threat models and fall short of a strong defense, which we posit
should be effective, universal, and practical. To achieve this, we propose the
first adversarial objective for defending LMs against jailbreaking attacks and
an algorithm, robust prompt optimization (RPO), that uses gradient-based token
optimization to enforce harmless outputs. This results in an easily accessible
suffix that significantly improves robustness to both jailbreaks seen during
optimization and unknown, held-out jailbreaks, reducing the attack success rate
on Starling-7B from 84% to 8.66% across 20 jailbreaks. In addition, we find
that RPO has a minor effect on normal LM use, is successful under adaptive
attacks, and can transfer to black-box models, reducing the success rate of the
strongest attack on GPT-4 from 92% to 6%.


http://arxiv.org/abs/2401.17038
Towards Assessing the Synthetic-to-Measured Adversarial Vulnerability of SAR ATR. (98%)
Bowen Peng; Bo Peng; Jingyuan Xia; Tianpeng Liu; Yongxiang Liu; Li Liu
  Recently, there has been increasing concern about the vulnerability of deep
neural network (DNN)-based synthetic aperture radar (SAR) automatic target
recognition (ATR) to adversarial attacks, where a DNN could be easily deceived
by clean input with imperceptible but aggressive perturbations. This paper
studies the synthetic-to-measured (S2M) transfer setting, where an attacker
generates adversarial perturbation based solely on synthetic data and transfers
it against victim models trained with measured data. Compared with the current
measured-to-measured (M2M) transfer setting, our approach does not need direct
access to the victim model or the measured SAR data. We also propose the
transferability estimation attack (TEA) to uncover the adversarial risks in
this more challenging and practical scenario. The TEA makes full use of the
limited similarity between the synthetic and measured data pairs for blind
estimation and optimization of S2M transferability, leading to feasible
surrogate model enhancement without mastering the victim model and data.
Comprehensive evaluations based on the publicly available synthetic and
measured paired labeled experiment (SAMPLE) dataset demonstrate that the TEA
outperforms state-of-the-art methods and can significantly enhance various
attack algorithms in computer vision and remote sensing applications. Codes and
data are available at https://github.com/scenarri/S2M-TEA.


http://arxiv.org/abs/2401.17499
AdvGPS: Adversarial GPS for Multi-Agent Perception Attack. (95%)
Jinlong Li; Baolu Li; Xinyu Liu; Jianwu Fang; Felix Juefei-Xu; Qing Guo; Hongkai Yu
  The multi-agent perception system collects visual data from sensors located
on various agents and leverages their relative poses determined by GPS signals
to effectively fuse information, mitigating the limitations of single-agent
sensing, such as occlusion. However, the precision of GPS signals can be
influenced by a range of factors, including wireless transmission and
obstructions like buildings. Given the pivotal role of GPS signals in
perception fusion and the potential for various interference, it becomes
imperative to investigate whether specific GPS signals can easily mislead the
multi-agent perception system. To address this concern, we frame the task as an
adversarial attack challenge and introduce \textsc{AdvGPS}, a method capable of
generating adversarial GPS signals which are also stealthy for individual
agents within the system, significantly reducing object detection accuracy. To
enhance the success rates of these attacks in a black-box scenario, we
introduce three types of statistically sensitive natural discrepancies:
appearance-based discrepancy, distribution-based discrepancy, and task-aware
discrepancy. Our extensive experiments on the OPV2V dataset demonstrate that
these attacks substantially undermine the performance of state-of-the-art
methods, showcasing remarkable transferability across different point cloud
based 3D detection systems. This alarming revelation underscores the pressing
need to address security implications within multi-agent perception systems,
thereby underscoring a critical area of research.


http://arxiv.org/abs/2401.17523
Game-Theoretic Unlearnable Example Generator. (92%)
Shuang Liu; Yihan Wang; Xiao-Shan Gao
  Unlearnable example attacks are data poisoning attacks aiming to degrade the
clean test accuracy of deep learning by adding imperceptible perturbations to
the training samples, which can be formulated as a bi-level optimization
problem. However, directly solving this optimization problem is intractable for
deep neural networks. In this paper, we investigate unlearnable example attacks
from a game-theoretic perspective, by formulating the attack as a nonzero sum
Stackelberg game. First, the existence of game equilibria is proved under the
normal setting and the adversarial training setting. It is shown that the game
equilibrium gives the most powerful poison attack in that the victim has the
lowest test accuracy among all networks within the same hypothesis space, when
certain loss functions are used. Second, we propose a novel attack method,
called the Game Unlearnable Example (GUE), which has three main gradients. (1)
The poisons are obtained by directly solving the equilibrium of the Stackelberg
game with a first-order algorithm. (2) We employ an autoencoder-like generative
network model as the poison attacker. (3) A novel payoff function is introduced
to evaluate the performance of the poison. Comprehensive experiments
demonstrate that GUE can effectively poison the model in various scenarios.
Furthermore, the GUE still works by using a relatively small percentage of the
training data to train the generator, and the poison generator can generalize
to unseen data well. Our implementation code can be found at
https://github.com/hong-xian/gue.


http://arxiv.org/abs/2401.17405
Camouflage Adversarial Attacks on Multiple Agent Systems. (87%)
Ziqing Lu; Guanlin Liu; Lifeng Lai; Weiyu Xu
  The multi-agent reinforcement learning systems (MARL) based on the Markov
decision process (MDP) have emerged in many critical applications. To improve
the robustness/defense of MARL systems against adversarial attacks, the study
of various adversarial attacks on reinforcement learning systems is very
important. Previous works on adversarial attacks considered some possible
features to attack in MDP, such as the action poisoning attacks, the reward
poisoning attacks, and the state perception attacks. In this paper, we propose
a brand-new form of attack called the camouflage attack in the MARL systems. In
the camouflage attack, the attackers change the appearances of some objects
without changing the actual objects themselves; and the camouflaged appearances
may look the same to all the targeted recipient (victim) agents. The
camouflaged appearances can mislead the recipient agents to misguided actions.
We design algorithms that give the optimal camouflage attacks minimizing the
rewards of recipient agents. Our numerical and theoretical results show that
camouflage attacks can rival the more conventional, but likely more difficult
state perception attacks. We also investigate cost-constrained camouflage
attacks and showed numerically how cost budgets affect the attack performance.


http://arxiv.org/abs/2401.17256
Weak-to-Strong Jailbreaking on Large Language Models. (81%)
Xuandong Zhao; Xianjun Yang; Tianyu Pang; Chao Du; Lei Li; Yu-Xiang Wang; William Yang Wang
  Although significant efforts have been dedicated to aligning large language
models (LLMs), red-teaming reports suggest that these carefully aligned LLMs
could still be jailbroken through adversarial prompts, tuning, or decoding.
Upon examining the jailbreaking vulnerability of aligned LLMs, we observe that
the decoding distributions of jailbroken and aligned models differ only in the
initial generations. This observation motivates us to propose the
weak-to-strong jailbreaking attack, where adversaries can utilize smaller
unsafe/aligned LLMs (e.g., 7B) to guide jailbreaking against significantly
larger aligned LLMs (e.g., 70B). To jailbreak, one only needs to additionally
decode two smaller LLMs once, which involves minimal computation and latency
compared to decoding the larger LLMs. The efficacy of this attack is
demonstrated through experiments conducted on five models from three different
organizations. Our study reveals a previously unnoticed yet efficient way of
jailbreaking, exposing an urgent safety issue that needs to be considered when
aligning LLMs. As an initial attempt, we propose a defense strategy to protect
against such attacks, but creating more advanced defenses remains challenging.
The code for replicating the method is available at
https://github.com/XuandongZhao/weak-to-strong


http://arxiv.org/abs/2401.17133
A Proactive and Dual Prevention Mechanism against Illegal Song Covers empowered by Singing Voice Conversion. (75%)
Guangke Chen; Yedi Zhang; Fu Song; Ting Wang; Xiaoning Du; Yang Liu
  Singing voice conversion (SVC) automates song covers by converting one
singer's singing voice into another target singer's singing voice with the
original lyrics and melody. However, it raises serious concerns about copyright
and civil right infringements to multiple entities. This work proposes
SongBsAb, the first proactive approach to mitigate unauthorized SVC-based
illegal song covers. SongBsAb introduces human-imperceptible perturbations to
singing voices before releasing them, so that when they are used, the
generation process of SVC will be interfered, resulting in unexpected singing
voices. SongBsAb features a dual prevention effect by causing both (singer)
identity disruption and lyric disruption, namely, the SVC-covered singing voice
neither imitates the target singer nor preserves the original lyrics. To
improve the imperceptibility of perturbations, we refine a psychoacoustic
model-based loss with the backing track as an additional masker, a unique
accompanying element for singing voices compared to ordinary speech voices. To
enhance the transferability, we propose to utilize a frame-level interaction
reduction-based loss. We demonstrate the prevention effectiveness, utility, and
robustness of SongBsAb on three SVC models and two datasets using both
objective and human study-based subjective metrics. Our work fosters an
emerging research direction for mitigating illegal automated song covers.


http://arxiv.org/abs/2401.17498
Improving QA Model Performance with Cartographic Inoculation. (26%)
Allen UT Austin Chen; Okan UT Austin Tanrikulu
  QA models are faced with complex and open-ended contextual reasoning
problems, but can often learn well-performing solution heuristics by exploiting
dataset-specific patterns in their training data. These patterns, or "dataset
artifacts", reduce the model's ability to generalize to real-world QA problems.
Utilizing an ElectraSmallDiscriminator model trained for QA, we analyze the
impacts and incidence of dataset artifacts using an adversarial challenge set
designed to confuse models reliant on artifacts for prediction. Extending
existing work on methods for mitigating artifact impacts, we propose
cartographic inoculation, a novel method that fine-tunes models on an optimized
subset of the challenge data to reduce model reliance on dataset artifacts. We
show that by selectively fine-tuning a model on ambiguous adversarial examples
from a challenge set, significant performance improvements can be made on the
full challenge dataset with minimal loss of model generalizability to other
challenging environments and QA datasets.


http://arxiv.org/abs/2401.17497
Towards Visual Syntactical Understanding. (4%)
Sayeed Shafayet Chowdhury; Soumyadeep Chandra; Kaushik Roy
  Syntax is usually studied in the realm of linguistics and refers to the
arrangement of words in a sentence. Similarly, an image can be considered as a
visual 'sentence', with the semantic parts of the image acting as 'words'.
While visual syntactic understanding occurs naturally to humans, it is
interesting to explore whether deep neural networks (DNNs) are equipped with
such reasoning. To that end, we alter the syntax of natural images (e.g.
swapping the eye and nose of a face), referred to as 'incorrect' images, to
investigate the sensitivity of DNNs to such syntactic anomaly. Through our
experiments, we discover an intriguing property of DNNs where we observe that
state-of-the-art convolutional neural networks, as well as vision transformers,
fail to discriminate between syntactically correct and incorrect images when
trained on only correct ones. To counter this issue and enable visual syntactic
understanding with DNNs, we propose a three-stage framework- (i) the 'words'
(or the sub-features) in the image are detected, (ii) the detected words are
sequentially masked and reconstructed using an autoencoder, (iii) the original
and reconstructed parts are compared at each location to determine syntactic
correctness. The reconstruction module is trained with BERT-like masked
autoencoding for images, with the motivation to leverage language model
inspired training to better capture the syntax. Note, our proposed approach is
unsupervised in the sense that the incorrect images are only used during
testing and the correct versus incorrect labels are never used for training. We
perform experiments on CelebA, and AFHQ datasets and obtain classification
accuracy of 92.10%, and 90.89%, respectively. Notably, the approach generalizes
well to ImageNet samples which share common classes with CelebA and AFHQ
without explicitly training on them.


http://arxiv.org/abs/2401.16820
Provably Robust Multi-bit Watermarking for AI-generated Text via Error Correction Code. (2%)
Wenjie Qu; Dong Yin; Zixin He; Wei Zou; Tianyang Tao; Jinyuan Jia; Jiaheng Zhang
  Large Language Models (LLMs) have been widely deployed for their remarkable
capability to generate texts resembling human language. However, they could be
misused by criminals to create deceptive content, such as fake news and
phishing emails, which raises ethical concerns. Watermarking is a key technique
to mitigate the misuse of LLMs, which embeds a watermark (e.g., a bit string)
into a text generated by a LLM. Consequently, this enables the detection of
texts generated by a LLM as well as the tracing of generated texts to a
specific user. The major limitation of existing watermark techniques is that
they cannot accurately or efficiently extract the watermark from a text,
especially when the watermark is a long bit string. This key limitation impedes
their deployment for real-world applications, e.g., tracing generated texts to
a specific user.
  This work introduces a novel watermarking method for LLM-generated text
grounded in \textbf{error-correction codes} to address this challenge. We
provide strong theoretical analysis, demonstrating that under bounded
adversarial word/token edits (insertion, deletion, and substitution), our
method can correctly extract watermarks, offering a provable robustness
guarantee. This breakthrough is also evidenced by our extensive experimental
results. The experiments show that our method substantially outperforms
existing baselines in both accuracy and robustness on benchmark datasets. For
instance, when embedding a bit string of length 12 into a 200-token generated
text, our approach attains an impressive match rate of $98.4\%$, surpassing the
performance of Yoo et al. (state-of-the-art baseline) at $85.6\%$. When
subjected to a copy-paste attack involving the injection of 50 tokens to
generated texts with 200 words, our method maintains a substantial match rate
of $90.8\%$, while the match rate of Yoo et al. diminishes to below $65\%$.


http://arxiv.org/abs/2401.16001
LESSON: Multi-Label Adversarial False Data Injection Attack for Deep Learning Locational Detection. (99%)
Jiwei Tian; Chao Shen; Buhong Wang; Xiaofang Xia; Meng Zhang; Chenhao Lin; Qian Li
  Deep learning methods can not only detect false data injection attacks (FDIA)
but also locate attacks of FDIA. Although adversarial false data injection
attacks (AFDIA) based on deep learning vulnerabilities have been studied in the
field of single-label FDIA detection, the adversarial attack and defense
against multi-label FDIA locational detection are still not involved. To bridge
this gap, this paper first explores the multi-label adversarial example attacks
against multi-label FDIA locational detectors and proposes a general
multi-label adversarial attack framework, namely muLti-labEl adverSarial falSe
data injectiON attack (LESSON). The proposed LESSON attack framework includes
three key designs, namely Perturbing State Variables, Tailored Loss Function
Design, and Change of Variables, which can help find suitable multi-label
adversarial perturbations within the physical constraints to circumvent both
Bad Data Detection (BDD) and Neural Attack Location (NAL). Four typical LESSON
attacks based on the proposed framework and two dimensions of attack objectives
are examined, and the experimental results demonstrate the effectiveness of the
proposed attack framework, posing serious and pressing security concerns in
smart grids.


http://arxiv.org/abs/2401.16352
Adversarial Training on Purification (AToP): Advancing Both Robustness and Generalization. (92%)
Guang Lin; Chao Li; Jianhai Zhang; Toshihisa Tanaka; Qibin Zhao
  The deep neural networks are known to be vulnerable to well-designed
adversarial attacks. The most successful defense technique based on adversarial
training (AT) can achieve optimal robustness against particular attacks but
cannot generalize well to unseen attacks. Another effective defense technique
based on adversarial purification (AP) can enhance generalization but cannot
achieve optimal robustness. Meanwhile, both methods share one common limitation
on the degraded standard accuracy. To mitigate these issues, we propose a novel
framework called Adversarial Training on Purification (AToP), which comprises
two components: perturbation destruction by random transforms (RT) and purifier
model fine-tuned (FT) by adversarial loss. RT is essential to avoid
overlearning to known attacks resulting in the robustness generalization to
unseen attacks and FT is essential for the improvement of robustness. To
evaluate our method in an efficient and scalable way, we conduct extensive
experiments on CIFAR-10, CIFAR-100, and ImageNette to demonstrate that our
method achieves state-of-the-art results and exhibits generalization ability
against unseen attacks.


http://arxiv.org/abs/2401.16687
Revisiting Gradient Pruning: A Dual Realization for Defending against Gradient Attacks. (68%)
Lulu Xue; Shengshan Hu; Ruizhi Zhao; Leo Yu Zhang; Shengqing Hu; Lichao Sun; Dezhong Yao
  Collaborative learning (CL) is a distributed learning framework that aims to
protect user privacy by allowing users to jointly train a model by sharing
their gradient updates only. However, gradient inversion attacks (GIAs), which
recover users' training data from shared gradients, impose severe privacy
threats to CL. Existing defense methods adopt different techniques, e.g.,
differential privacy, cryptography, and perturbation defenses, to defend
against the GIAs. Nevertheless, all current defense methods suffer from a poor
trade-off between privacy, utility, and efficiency. To mitigate the weaknesses
of existing solutions, we propose a novel defense method, Dual Gradient Pruning
(DGP), based on gradient pruning, which can improve communication efficiency
while preserving the utility and privacy of CL. Specifically, DGP slightly
changes gradient pruning with a stronger privacy guarantee. And DGP can also
significantly improve communication efficiency with a theoretical analysis of
its convergence and generalization. Our extensive experiments show that DGP can
effectively defend against the most powerful GIAs and reduce the communication
cost without sacrificing the model's utility.


http://arxiv.org/abs/2401.16011
GPS: Graph Contrastive Learning via Multi-scale Augmented Views from Adversarial Pooling. (5%)
Wei Ju; Yiyang Gu; Zhengyang Mao; Ziyue Qiao; Yifang Qin; Xiao Luo; Hui Xiong; Ming Zhang
  Self-supervised graph representation learning has recently shown considerable
promise in a range of fields, including bioinformatics and social networks. A
large number of graph contrastive learning approaches have shown promising
performance for representation learning on graphs, which train models by
maximizing agreement between original graphs and their augmented views (i.e.,
positive views). Unfortunately, these methods usually involve pre-defined
augmentation strategies based on the knowledge of human experts. Moreover,
these strategies may fail to generate challenging positive views to provide
sufficient supervision signals. In this paper, we present a novel approach
named Graph Pooling ContraSt (GPS) to address these issues. Motivated by the
fact that graph pooling can adaptively coarsen the graph with the removal of
redundancy, we rethink graph pooling and leverage it to automatically generate
multi-scale positive views with varying emphasis on providing challenging
positives and preserving semantics, i.e., strongly-augmented view and
weakly-augmented view. Then, we incorporate both views into a joint contrastive
learning framework with similarity learning and consistency learning, where our
pooling module is adversarially trained with respect to the encoder for
adversarial robustness. Experiments on twelve datasets on both graph
classification and transfer learning tasks verify the superiority of the
proposed method over its counterparts.


http://arxiv.org/abs/2402.00888
Security and Privacy Challenges of Large Language Models: A Survey. (1%)
Badhan Chandra Das; M. Hadi Amini; Yanzhao Wu
  Large Language Models (LLMs) have demonstrated extraordinary capabilities and
contributed to multiple fields, such as generating and summarizing text,
language translation, and question-answering. Nowadays, LLM is becoming a very
popular tool in computerized language processing tasks, with the capability to
analyze complicated linguistic patterns and provide relevant and appropriate
responses depending on the context. While offering significant advantages,
these models are also vulnerable to security and privacy attacks, such as
jailbreaking attacks, data poisoning attacks, and Personally Identifiable
Information (PII) leakage attacks. This survey provides a thorough review of
the security and privacy challenges of LLMs for both training data and users,
along with the application-based risks in various domains, such as
transportation, education, and healthcare. We assess the extent of LLM
vulnerabilities, investigate emerging security and privacy attacks for LLMs,
and review the potential defense mechanisms. Additionally, the survey outlines
existing research gaps in this domain and highlights future research
directions.


http://arxiv.org/abs/2401.15615
Addressing Noise and Efficiency Issues in Graph-Based Machine Learning Models From the Perspective of Adversarial Attack. (83%)
Yongyu Wang
  Given that no existing graph construction method can generate a perfect graph
for a given dataset, graph-based algorithms are invariably affected by the
plethora of redundant and erroneous edges present within the constructed
graphs. In this paper, we propose treating these noisy edges as adversarial
attack and use a spectral adversarial robustness evaluation method to diminish
the impact of noisy edges on the performance of graph algorithms. Our method
identifies those points that are less vulnerable to noisy edges and leverages
only these robust points to perform graph-based algorithms. Our experiments
with spectral clustering, one of the most representative and widely utilized
graph algorithms, reveal that our methodology not only substantially elevates
the precision of the algorithm but also greatly accelerates its computational
efficiency by leveraging only a select number of robust data points.


http://arxiv.org/abs/2401.15817
Transparency Attacks: How Imperceptible Image Layers Can Fool AI Perception. (75%)
Forrest McKee; David Noever
  This paper investigates a novel algorithmic vulnerability when imperceptible
image layers confound multiple vision models into arbitrary label assignments
and captions. We explore image preprocessing methods to introduce stealth
transparency, which triggers AI misinterpretation of what the human eye
perceives. The research compiles a broad attack surface to investigate the
consequences ranging from traditional watermarking, steganography, and
background-foreground miscues. We demonstrate dataset poisoning using the
attack to mislabel a collection of grayscale landscapes and logos using either
a single attack layer or randomly selected poisoning classes. For example, a
military tank to the human eye is a mislabeled bridge to object classifiers
based on convolutional networks (YOLO, etc.) and vision transformers (ViT,
GPT-Vision, etc.). A notable attack limitation stems from its dependency on the
background (hidden) layer in grayscale as a rough match to the transparent
foreground image that the human eye perceives. This dependency limits the
practical success rate without manual tuning and exposes the hidden layers when
placed on the opposite display theme (e.g., light background, light transparent
foreground visible, works best against a light theme image viewer or browser).
The stealth transparency confounds established vision systems, including
evading facial recognition and surveillance, digital watermarking, content
filtering, dataset curating, automotive and drone autonomy, forensic evidence
tampering, and retail product misclassifying. This method stands in contrast to
traditional adversarial attacks that typically focus on modifying pixel values
in ways that are either slightly perceptible or entirely imperceptible for both
humans and machines.


http://arxiv.org/abs/2401.15883
Model Supply Chain Poisoning: Backdooring Pre-trained Models via Embedding Indistinguishability. (26%)
Hao Wang; Shangwei Guo; Jialing He; Hangcheng Liu; Tianwei Zhang; Tao Xiang
  Pre-trained models (PTMs) are widely adopted across various downstream tasks
in the machine learning supply chain. Adopting untrustworthy PTMs introduces
significant security risks, where adversaries can poison the model supply chain
by embedding hidden malicious behaviors (backdoors) into PTMs. However,
existing backdoor attacks to PTMs can only achieve partially task-agnostic and
the embedded backdoors are easily erased during the fine-tuning process. This
makes it challenging for the backdoors to persist and propagate through the
supply chain. In this paper, we propose a novel and severer backdoor attack,
TransTroj, which enables the backdoors embedded in PTMs to efficiently transfer
in the model supply chain. In particular, we first formalize this attack as an
indistinguishability problem between poisoned and clean samples in the
embedding space. We decompose embedding indistinguishability into pre- and
post-indistinguishability, representing the similarity of the poisoned and
reference embeddings before and after the attack. Then, we propose a two-stage
optimization that separately optimizes triggers and victim PTMs to achieve
embedding indistinguishability. We evaluate TransTroj on four PTMs and six
downstream tasks. Experimental results show that our method significantly
outperforms SOTA task-agnostic backdoor attacks -- achieving nearly 100\%
attack success rate on most downstream tasks -- and demonstrates robustness
under various system settings. Our findings underscore the urgent need to
secure the model supply chain against such transferable backdoor attacks. The
code is available at https://github.com/haowang-cqu/TransTroj .


http://arxiv.org/abs/2401.15335
L-AutoDA: Leveraging Large Language Models for Automated Decision-based Adversarial Attacks. (98%)
Ping Guo; Fei Liu; Xi Lin; Qingchuan Zhao; Qingfu Zhang
  In the rapidly evolving field of machine learning, adversarial attacks
present a significant challenge to model robustness and security.
Decision-based attacks, which only require feedback on the decision of a model
rather than detailed probabilities or scores, are particularly insidious and
difficult to defend against. This work introduces L-AutoDA (Large Language
Model-based Automated Decision-based Adversarial Attacks), a novel approach
leveraging the generative capabilities of Large Language Models (LLMs) to
automate the design of these attacks. By iteratively interacting with LLMs in
an evolutionary framework, L-AutoDA automatically designs competitive attack
algorithms efficiently without much human effort. We demonstrate the efficacy
of L-AutoDA on CIFAR-10 dataset, showing significant improvements over baseline
methods in both success rate and computational efficiency. Our findings
underscore the potential of language models as tools for adversarial attack
generation and highlight new avenues for the development of robust AI systems.


http://arxiv.org/abs/2401.14961
Set-Based Training for Neural Network Verification. (99%)
Lukas Koller; Tobias Ladner; Matthias Althoff
  Neural networks are vulnerable to adversarial attacks, i.e., small input
perturbations can significantly affect the outputs of a neural network. In
safety-critical environments, the inputs often contain noisy sensor data;
hence, in this case, neural networks that are robust against input
perturbations are required. To ensure safety, the robustness of a neural
network must be formally verified. However, training and formally verifying
robust neural networks is challenging. We address both of these challenges by
employing, for the first time, an end-to-end set-based training procedure that
trains robust neural networks for formal verification. Our training procedure
trains neural networks, which can be easily verified using simple
polynomial-time verification algorithms. Moreover, our extensive evaluation
demonstrates that our set-based training procedure effectively trains robust
neural networks, which are easier to verify. Set-based trained neural networks
consistently match or outperform those trained with state-of-the-art robust
training approaches.


http://arxiv.org/abs/2401.14707
Mitigating Feature Gap for Adversarial Robustness by Feature Disentanglement. (91%)
Nuoyan Zhou; Dawei Zhou; Decheng Liu; Xinbo Gao; Nannan Wang
  Deep neural networks are vulnerable to adversarial samples. Adversarial
fine-tuning methods aim to enhance adversarial robustness through fine-tuning
the naturally pre-trained model in an adversarial training manner. However, we
identify that some latent features of adversarial samples are confused by
adversarial perturbation and lead to an unexpectedly increasing gap between
features in the last hidden layer of natural and adversarial samples. To
address this issue, we propose a disentanglement-based approach to explicitly
model and further remove the latent features that cause the feature gap.
Specifically, we introduce a feature disentangler to separate out the latent
features from the features of the adversarial samples, thereby boosting
robustness by eliminating the latent features. Besides, we align features in
the pre-trained model with features of adversarial samples in the fine-tuned
model, to further benefit from the features from natural samples without
confusion. Empirical evaluations on three benchmark datasets demonstrate that
our approach surpasses existing adversarial fine-tuning methods and adversarial
training baselines.


http://arxiv.org/abs/2401.15295
Multi-Trigger Backdoor Attacks: More Triggers, More Threats. (82%)
Yige Li; Xingjun Ma; Jiabo He; Hanxun Huang; Yu-Gang Jiang
  Backdoor attacks have emerged as a primary threat to (pre-)training and
deployment of deep neural networks (DNNs). While backdoor attacks have been
extensively studied in a body of works, most of them were focused on
single-trigger attacks that poison a dataset using a single type of trigger.
Arguably, real-world backdoor attacks can be much more complex, e.g., the
existence of multiple adversaries for the same dataset if it is of high value.
In this work, we investigate the practical threat of backdoor attacks under the
setting of \textbf{multi-trigger attacks} where multiple adversaries leverage
different types of triggers to poison the same dataset. By proposing and
investigating three types of multi-trigger attacks, including parallel,
sequential, and hybrid attacks, we provide a set of important understandings of
the coexisting, overwriting, and cross-activating effects between different
triggers on the same dataset. Moreover, we show that single-trigger attacks
tend to cause overly optimistic views of the security of current defense
techniques, as all examined defense methods struggle to defend against
multi-trigger attacks. Finally, we create a multi-trigger backdoor poisoning
dataset to help future evaluation of backdoor attacks and defenses. Although
our work is purely empirical, we hope it can help steer backdoor research
toward more realistic settings.


http://arxiv.org/abs/2401.14780
Adversarial Attacks and Defenses in 6G Network-Assisted IoT Systems. (81%)
Bui Duc Son; Nguyen Tien Hoa; Chien Trinh Van; Waqas Khalid; Mohamed Amine Ferrag; Wan Choi; Merouane Debbah
  The Internet of Things (IoT) and massive IoT systems are key to
sixth-generation (6G) networks due to dense connectivity, ultra-reliability,
low latency, and high throughput. Artificial intelligence, including deep
learning and machine learning, offers solutions for optimizing and deploying
cutting-edge technologies for future radio communications. However, these
techniques are vulnerable to adversarial attacks, leading to degraded
performance and erroneous predictions, outcomes unacceptable for ubiquitous
networks. This survey extensively addresses adversarial attacks and defense
methods in 6G network-assisted IoT systems. The theoretical background and
up-to-date research on adversarial attacks and defenses are discussed.
Furthermore, we provide Monte Carlo simulations to validate the effectiveness
of adversarial attacks compared to jamming attacks. Additionally, we examine
the vulnerability of 6G IoT systems by demonstrating attack strategies
applicable to key technologies, including reconfigurable intelligent surfaces,
massive multiple-input multiple-output (MIMO)/cell-free massive MIMO,
satellites, the metaverse, and semantic communications. Finally, we outline the
challenges and future developments associated with adversarial attacks and
defenses in 6G IoT systems.


http://arxiv.org/abs/2401.14948
Conserve-Update-Revise to Cure Generalization and Robustness Trade-off in Adversarial Training. (62%)
Shruthi Gowda; Bahram Zonooz; Elahe Arani
  Adversarial training improves the robustness of neural networks against
adversarial attacks, albeit at the expense of the trade-off between standard
and robust generalization. To unveil the underlying factors driving this
phenomenon, we examine the layer-wise learning capabilities of neural networks
during the transition from a standard to an adversarial setting. Our empirical
findings demonstrate that selectively updating specific layers while preserving
others can substantially enhance the network's learning capacity. We therefore
propose CURE, a novel training framework that leverages a gradient prominence
criterion to perform selective conservation, updating, and revision of weights.
Importantly, CURE is designed to be dataset- and architecture-agnostic,
ensuring its applicability across various scenarios. It effectively tackles
both memorization and overfitting issues, thus enhancing the trade-off between
robustness and generalization and additionally, this training approach also
aids in mitigating "robust overfitting". Furthermore, our study provides
valuable insights into the mechanisms of selective adversarial training and
offers a promising avenue for future research.


http://arxiv.org/abs/2401.15262
Asymptotic Behavior of Adversarial Training Estimator under $\ell_\infty$-Perturbation. (38%)
Yiling Xie; Xiaoming Huo
  Adversarial training has been proposed to hedge against adversarial attacks
in machine learning and statistical models. This paper focuses on adversarial
training under $\ell_\infty$-perturbation, which has recently attracted much
research attention. The asymptotic behavior of the adversarial training
estimator is investigated in the generalized linear model. The results imply
that the limiting distribution of the adversarial training estimator under
$\ell_\infty$-perturbation could put a positive probability mass at $0$ when
the true parameter is $0$, providing a theoretical guarantee of the associated
sparsity-recovery ability. Alternatively, a two-step procedure is proposed --
adaptive adversarial training, which could further improve the performance of
adversarial training under $\ell_\infty$-perturbation. Specifically, the
proposed procedure could achieve asymptotic unbiasedness and variable-selection
consistency. Numerical experiments are conducted to show the sparsity-recovery
ability of adversarial training under $\ell_\infty$-perturbation and to compare
the empirical performance between classic adversarial training and adaptive
adversarial training.


http://arxiv.org/abs/2401.15248
Better Representations via Adversarial Training in Pre-Training: A Theoretical Perspective. (22%)
Yue Xing; Xiaofeng Lin; Qifan Song; Yi Xu; Belinda Zeng; Guang Cheng
  Pre-training is known to generate universal representations for downstream
tasks in large-scale deep learning such as large language models. Existing
literature, e.g., \cite{kim2020adversarial}, empirically observe that the
downstream tasks can inherit the adversarial robustness of the pre-trained
model. We provide theoretical justifications for this robustness inheritance
phenomenon. Our theoretical results reveal that feature purification plays an
important role in connecting the adversarial robustness of the pre-trained
model and the downstream tasks in two-layer neural networks. Specifically, we
show that (i) with adversarial training, each hidden node tends to pick only
one (or a few) feature; (ii) without adversarial training, the hidden nodes can
be vulnerable to attacks. This observation is valid for both supervised
pre-training and contrastive learning. With purified nodes, it turns out that
clean training is enough to achieve adversarial robustness in downstream tasks.


http://arxiv.org/abs/2401.15239
MEA-Defender: A Robust Watermark against Model Extraction Attack. (13%)
Peizhuo Lv; Hualong Ma; Kai Chen; Jiachen Zhou; Shengzhi Zhang; Ruigang Liang; Shenchen Zhu; Pan Li; Yingjun Zhang
  Recently, numerous highly-valuable Deep Neural Networks (DNNs) have been
trained using deep learning algorithms. To protect the Intellectual Property
(IP) of the original owners over such DNN models, backdoor-based watermarks
have been extensively studied. However, most of such watermarks fail upon model
extraction attack, which utilizes input samples to query the target model and
obtains the corresponding outputs, thus training a substitute model using such
input-output pairs. In this paper, we propose a novel watermark to protect IP
of DNN models against model extraction, named MEA-Defender. In particular, we
obtain the watermark by combining two samples from two source classes in the
input domain and design a watermark loss function that makes the output domain
of the watermark within that of the main task samples. Since both the input
domain and the output domain of our watermark are indispensable parts of those
of the main task samples, the watermark will be extracted into the stolen model
along with the main task during model extraction. We conduct extensive
experiments on four model extraction attacks, using five datasets and six
models trained based on supervised learning and self-supervised learning
algorithms. The experimental results demonstrate that MEA-Defender is highly
robust against different model extraction attacks, and various watermark
removal/detection approaches.


http://arxiv.org/abs/2401.15002
BackdoorBench: A Comprehensive Benchmark and Analysis of Backdoor Learning. (2%)
Baoyuan Wu; Hongrui Chen; Mingda Zhang; Zihao Zhu; Shaokui Wei; Danni Yuan; Mingli Zhu; Ruotong Wang; Li Liu; Chao Shen
  As an emerging and vital topic for studying deep neural networks'
vulnerability (DNNs), backdoor learning has attracted increasing interest in
recent years, and many seminal backdoor attack and defense algorithms are being
developed successively or concurrently, in the status of a rapid arms race.
However, mainly due to the diverse settings, and the difficulties of
implementation and reproducibility of existing works, there is a lack of a
unified and standardized benchmark of backdoor learning, causing unfair
comparisons, and unreliable conclusions (e.g., misleading, biased or even false
conclusions). Consequently, it is difficult to evaluate the current progress
and design the future development roadmap of this literature. To alleviate this
dilemma, we build a comprehensive benchmark of backdoor learning called
BackdoorBench. Our benchmark makes three valuable contributions to the research
community. 1) We provide an integrated implementation of state-of-the-art
(SOTA) backdoor learning algorithms (currently including 16 attack and 27
defense algorithms), based on an extensible modular-based codebase. 2) We
conduct comprehensive evaluations of 12 attacks against 16 defenses, with 5
poisoning ratios, based on 4 models and 4 datasets, thus 11,492 pairs of
evaluations in total. 3) Based on above evaluations, we present abundant
analysis from 8 perspectives via 18 useful analysis tools, and provide several
inspiring insights about backdoor learning. We hope that our efforts could
build a solid foundation of backdoor learning to facilitate researchers to
investigate existing algorithms, develop more innovative algorithms, and
explore the intrinsic mechanism of backdoor learning. Finally, we have created
a user-friendly website at http://backdoorbench.com, which collects all
important information of BackdoorBench, including codebase, docs, leaderboard,
and model Zoo.


http://arxiv.org/abs/2401.14031
Sparse and Transferable Universal Singular Vectors Attack. (99%)
Kseniia Kuvshinova; Olga Tsymboi; Ivan Oseledets
  The research in the field of adversarial attacks and models' vulnerability is
one of the fundamental directions in modern machine learning. Recent studies
reveal the vulnerability phenomenon, and understanding the mechanisms behind
this is essential for improving neural network characteristics and
interpretability. In this paper, we propose a novel sparse universal white-box
adversarial attack. Our approach is based on truncated power iteration
providing sparsity to $(p,q)$-singular vectors of the hidden layers of Jacobian
matrices. Using the ImageNet benchmark validation subset, we analyze the
proposed method in various settings, achieving results comparable to dense
baselines with more than a 50% fooling rate while damaging only 5% of pixels
and utilizing 256 samples for perturbation fitting. We also show that our
algorithm admits higher attack magnitude without affecting the human ability to
solve the task. Furthermore, we investigate that the constructed perturbations
are highly transferable among different models without significantly decreasing
the fooling rate. Our findings demonstrate the vulnerability of
state-of-the-art models to sparse attacks and highlight the importance of
developing robust machine learning systems.


http://arxiv.org/abs/2401.14184
Friendly Attacks to Improve Channel Coding Reliability. (54%)
Anastasiia Kurmukova; Deniz Gunduz
  This paper introduces a novel approach called "friendly attack" aimed at
enhancing the performance of error correction channel codes. Inspired by the
concept of adversarial attacks, our method leverages the idea of introducing
slight perturbations to the neural network input, resulting in a substantial
impact on the network's performance. By introducing small perturbations to
fixed-point modulated codewords before transmission, we effectively improve the
decoder's performance without violating the input power constraint. The
perturbation design is accomplished by a modified iterative fast gradient
method. This study investigates various decoder architectures suitable for
computing gradients to obtain the desired perturbations. Specifically, we
consider belief propagation (BP) for LDPC codes; the error correcting code
transformer, BP and neural BP (NBP) for polar codes, and neural BCJR for
convolutional codes. We demonstrate that the proposed friendly attack method
can improve the reliability across different channels, modulations, codes, and
decoders. This method allows us to increase the reliability of communication
with a legacy receiver by simply modifying the transmitted codeword
appropriately.


http://arxiv.org/abs/2401.14440
Semantic Sensitivities and Inconsistent Predictions: Measuring the Fragility of NLI Models. (16%)
Erik Arakelyan; Zhaoqi Liu; Isabelle Augenstein
  Recent studies of the emergent capabilities of transformer-based Natural
Language Understanding (NLU) models have indicated that they have an
understanding of lexical and compositional semantics. We provide evidence that
suggests these claims should be taken with a grain of salt: we find that
state-of-the-art Natural Language Inference (NLI) models are sensitive towards
minor semantics preserving surface-form variations, which lead to sizable
inconsistent model decisions during inference. Notably, this behaviour differs
from valid and in-depth comprehension of compositional semantics, however does
neither emerge when evaluating model accuracy on standard benchmarks nor when
probing for syntactic, monotonic, and logically robust reasoning. We propose a
novel framework to measure the extent of semantic sensitivity. To this end, we
evaluate NLI models on adversarially generated examples containing minor
semantics-preserving surface-form input noise. This is achieved using
conditional text generation, with the explicit condition that the NLI model
predicts the relationship between the original and adversarial inputs as a
symmetric equivalence entailment. We systematically study the effects of the
phenomenon across NLI models for \emph{in-} and \emph{out-of} domain settings.
Our experiments show that semantic sensitivity causes performance degradations
of $12.92\%$ and $23.71\%$ average over \emph{in-} and \emph{out-of-} domain
settings, respectively. We further perform ablation studies, analysing this
phenomenon across models, datasets, and variations in inference and show that
semantic sensitivity can lead to major inconsistency within model predictions.


http://arxiv.org/abs/2401.14027
The Risk of Federated Learning to Skew Fine-Tuning Features and Underperform Out-of-Distribution Robustness. (2%)
Mengyao Du; Miao Zhang; Yuwen Pu; Kai Xu; Shouling Ji; Quanjun Yin
  To tackle the scarcity and privacy issues associated with domain-specific
datasets, the integration of federated learning in conjunction with fine-tuning
has emerged as a practical solution. However, our findings reveal that
federated learning has the risk of skewing fine-tuning features and
compromising the out-of-distribution robustness of the model. By introducing
three robustness indicators and conducting experiments across diverse robust
datasets, we elucidate these phenomena by scrutinizing the diversity,
transferability, and deviation within the model feature space. To mitigate the
negative impact of federated learning on model robustness, we introduce GNP, a
\underline{G}eneral \underline{N}oisy \underline{P}rojection-based robust
algorithm, ensuring no deterioration of accuracy on the target distribution.
Specifically, the key strategy for enhancing model robustness entails the
transfer of robustness from the pre-trained model to the fine-tuned model,
coupled with adding a small amount of Gaussian noise to augment the
representative capacity of the model. Comprehensive experimental results
demonstrate that our approach markedly enhances the robustness across diverse
scenarios, encompassing various parameter-efficient fine-tuning methods and
confronting different levels of data heterogeneity.


http://arxiv.org/abs/2401.14033
Novel Quadratic Constraints for Extending LipSDP beyond Slope-Restricted Activations. (1%)
Patricia Pauli; Aaron Havens; Alexandre Araujo; Siddharth Garg; Farshad Khorrami; Frank Allgöwer; Bin Hu
  Recently, semidefinite programming (SDP) techniques have shown great promise
in providing accurate Lipschitz bounds for neural networks. Specifically, the
LipSDP approach (Fazlyab et al., 2019) has received much attention and provides
the least conservative Lipschitz upper bounds that can be computed with
polynomial time guarantees. However, one main restriction of LipSDP is that its
formulation requires the activation functions to be slope-restricted on
$[0,1]$, preventing its further use for more general activation functions such
as GroupSort, MaxMin, and Householder. One can rewrite MaxMin activations for
example as residual ReLU networks. However, a direct application of LipSDP to
the resultant residual ReLU networks is conservative and even fails in
recovering the well-known fact that the MaxMin activation is 1-Lipschitz. Our
paper bridges this gap and extends LipSDP beyond slope-restricted activation
functions. To this end, we provide novel quadratic constraints for GroupSort,
MaxMin, and Householder activations via leveraging their underlying properties
such as sum preservation. Our proposed analysis is general and provides a
unified approach for estimating $\ell_2$ and $\ell_\infty$ Lipschitz bounds for
a rich class of neural network architectures, including non-residual and
residual neural networks and implicit models, with GroupSort, MaxMin, and
Householder activations. Finally, we illustrate the utility of our approach
with a variety of experiments and show that our proposed SDPs generate less
conservative Lipschitz bounds in comparison to existing approaches.


http://arxiv.org/abs/2401.14583
Physical Trajectory Inference Attack and Defense in Decentralized POI Recommendation. (1%)
Jing Long; Tong Chen; Guanhua Ye; Kai Zheng; Nguyen Quoc Viet Hung; Hongzhi Yin
  As an indispensable personalized service within Location-Based Social
Networks (LBSNs), the Point-of-Interest (POI) recommendation aims to assist
individuals in discovering attractive and engaging places. However, the
accurate recommendation capability relies on the powerful server collecting a
vast amount of users' historical check-in data, posing significant risks of
privacy breaches. Although several collaborative learning (CL) frameworks for
POI recommendation enhance recommendation resilience and allow users to keep
personal data on-device, they still share personal knowledge to improve
recommendation performance, thus leaving vulnerabilities for potential
attackers. Given this, we design a new Physical Trajectory Inference Attack
(PTIA) to expose users' historical trajectories. Specifically, for each user,
we identify the set of interacted POIs by analyzing the aggregated information
from the target POIs and their correlated POIs. We evaluate the effectiveness
of PTIA on two real-world datasets across two types of decentralized CL
frameworks for POI recommendation. Empirical results demonstrate that PTIA
poses a significant threat to users' historical trajectories. Furthermore,
Local Differential Privacy (LDP), the traditional privacy-preserving method for
CL frameworks, has also been proven ineffective against PTIA. In light of this,
we propose a novel defense mechanism (AGD) against PTIA based on an adversarial
game to eliminate sensitive POIs and their information in correlated POIs.
After conducting intensive experiments, AGD has been proven precise and
practical, with minimal impact on recommendation performance.


http://arxiv.org/abs/2401.13751
A Training Rate and Survival Heuristic for Inference and Robustness Evaluation (TRASHFIRE). (92%)
Charles Meyers; Mohammad Reza Saleh Sedghpour; Tommy Löfstedt; Erik Elmroth
  Machine learning models -- deep neural networks in particular -- have
performed remarkably well on benchmark datasets across a wide variety of
domains. However, the ease of finding adversarial counter-examples remains a
persistent problem when training times are measured in hours or days and the
time needed to find a successful adversarial counter-example is measured in
seconds. Much work has gone into generating and defending against these
adversarial counter-examples, however the relative costs of attacks and
defences are rarely discussed. Additionally, machine learning research is
almost entirely guided by test/train metrics, but these would require billions
of samples to meet industry standards. The present work addresses the problem
of understanding and predicting how particular model hyper-parameters influence
the performance of a model in the presence of an adversary. The proposed
approach uses survival models, worst-case examples, and a cost-aware analysis
to precisely and accurately reject a particular model change during routine
model training procedures rather than relying on real-world deployment,
expensive formal verification methods, or accurate simulations of very
complicated systems (\textit{e.g.}, digitally recreating every part of a car or
a plane). Through an evaluation of many pre-processing techniques, adversarial
counter-examples, and neural network configurations, the conclusion is that
deeper models do offer marginal gains in survival times compared to more
shallow counterparts. However, we show that those gains are driven more by the
model inference time than inherent robustness properties. Using the proposed
methodology, we show that ResNet is hopelessly insecure against even the
simplest of white box attacks.


http://arxiv.org/abs/2401.13624
Can overfitted deep neural networks in adversarial training generalize? -- An approximation viewpoint. (86%)
Zhongjie Shi; Fanghui Liu; Yuan Cao; Johan A. K. Suykens
  Adversarial training is a widely used method to improve the robustness of
deep neural networks (DNNs) over adversarial perturbations. However, it is
empirically observed that adversarial training on over-parameterized networks
often suffers from the \textit{robust overfitting}: it can achieve almost zero
adversarial training error while the robust generalization performance is not
promising. In this paper, we provide a theoretical understanding of the
question of whether overfitted DNNs in adversarial training can generalize from
an approximation viewpoint. Specifically, our main results are summarized into
three folds: i) For classification, we prove by construction the existence of
infinitely many adversarial training classifiers on over-parameterized DNNs
that obtain arbitrarily small adversarial training error (overfitting), whereas
achieving good robust generalization error under certain conditions concerning
the data quality, well separated, and perturbation level. ii) Linear
over-parameterization (meaning that the number of parameters is only slightly
larger than the sample size) is enough to ensure such existence if the target
function is smooth enough. iii) For regression, our results demonstrate that
there also exist infinitely many overfitted DNNs with linear
over-parameterization in adversarial training that can achieve almost optimal
rates of convergence for the standard generalization error. Overall, our
analysis points out that robust overfitting can be avoided but the required
model capacity will depend on the smoothness of the target function, while a
robust generalization gap is inevitable. We hope our analysis will give a
better understanding of the mathematical foundations of robustness in DNNs from
an approximation view.


http://arxiv.org/abs/2401.13578
WPDA: Frequency-based Backdoor Attack with Wavelet Packet Decomposition. (76%)
Zhengyao Song; Yongqiang Li; Danni Yuan; Li Liu; Shaokui Wei; Baoyuan Wu
  This work explores an emerging security threat against deep neural networks
(DNNs) based image classification, i.e., backdoor attack. In this scenario, the
attacker aims to inject a backdoor into the model by manipulating training
data, such that the backdoor could be activated by a particular trigger and
bootstraps the model to make a target prediction at inference. Currently, most
existing data poisoning-based attacks struggle to achieve success at low
poisoning ratios, increasing the risk of being defended by defense methods. In
this paper, we propose a novel frequency-based backdoor attack via Wavelet
Packet Decomposition (WPD), WPD decomposes the original image signal to a
spectrogram that contains frequency information with different semantic
meanings. We leverage WPD to statistically analyze the frequency distribution
of the dataset to infer the key frequency regions the DNNs would focus on, and
the trigger information is only injected into the key frequency regions. Our
method mainly includes three parts: 1) the selection of the poisoning frequency
regions in spectrogram; 2) trigger generation; 3) the generation of the
poisoned dataset. Our method is stealthy and precise, evidenced by the 98.12%
Attack Success Rate (ASR) on CIFAR-10 with the extremely low poisoning ratio
0.004% (i.e., only 2 poisoned samples among 50,000 training samples) and can
bypass most existing defense methods. Besides, we also provide visualization
analyses to explain why our method works.


http://arxiv.org/abs/2401.13801
Exploring Adversarial Threat Models in Cyber Physical Battery Systems. (76%)
Shanthan Kumar Padisala; Shashank Dhananjay Vyas; Satadru Dey
  Technological advancements like the Internet of Things (IoT) have facilitated
data exchange across various platforms. This data exchange across various
platforms has transformed the traditional battery system into a cyber physical
system. Such connectivity makes modern cyber physical battery systems
vulnerable to cyber threats where a cyber attacker can manipulate sensing and
actuation signals to bring the battery system into an unsafe operating
condition. Hence, it is essential to build resilience in modern cyber physical
battery systems (CPBS) under cyber attacks. The first step of building such
resilience is to analyze potential adversarial behavior, that is, how the
adversaries can inject attacks into the battery systems. However, it has been
found that in this under-explored area of battery cyber physical security, such
an adversarial threat model has not been studied in a systematic manner. In
this study, we address this gap and explore adversarial attack generation
policies based on optimal control framework. The framework is developed by
performing theoretical analysis, which is subsequently supported by evaluation
with experimental data generated from a commercial battery cell.


http://arxiv.org/abs/2402.01702
Fluent dreaming for language models. (64%)
T. Ben Confirm Labs Thompson; Zygimantas Confirm Labs Straznickas; Michael Confirm Labs Sklar
  Feature visualization, also known as "dreaming", offers insights into vision
models by optimizing the inputs to maximize a neuron's activation or other
internal component. However, dreaming has not been successfully applied to
language models because the input space is discrete. We extend Greedy
Coordinate Gradient, a method from the language model adversarial attack
literature, to design the Evolutionary Prompt Optimization (EPO) algorithm. EPO
optimizes the input prompt to simultaneously maximize the Pareto frontier
between a chosen internal feature and prompt fluency, enabling fluent dreaming
for language models. We demonstrate dreaming with neurons, output logits and
arbitrary directions in activation space. We measure the fluency of the
resulting prompts and compare language model dreaming with max-activating
dataset examples. Critically, fluent dreaming allows automatically exploring
the behavior of model internals in reaction to mildly out-of-distribution
prompts. Code for running EPO is available at
https://github.com/Confirm-Solutions/dreamy. A companion page demonstrating
code usage is at https://confirmlabs.org/posts/dreamy.html


http://arxiv.org/abs/2401.13205
Boosting the Transferability of Adversarial Examples via Local Mixup and Adaptive Step Size. (99%)
Junlin Liu; Xinchen Lyu
  Adversarial examples are one critical security threat to various visual
applications, where injected human-imperceptible perturbations can confuse the
output.Generating transferable adversarial examples in the black-box setting is
crucial but challenging in practice. Existing input-diversity-based methods
adopt different image transformations, but may be inefficient due to
insufficient input diversity and an identical perturbation step size. Motivated
by the fact that different image regions have distinctive weights in
classification, this paper proposes a black-box adversarial generative
framework by jointly designing enhanced input diversity and adaptive step
sizes. We design local mixup to randomly mix a group of transformed adversarial
images, strengthening the input diversity. For precise adversarial generation,
we project the perturbation into the $tanh$ space to relax the boundary
constraint. Moreover, the step sizes of different regions can be dynamically
adjusted by integrating a second-order momentum.Extensive experiments on
ImageNet validate that our framework can achieve superior transferability
compared to state-of-the-art baselines.


http://arxiv.org/abs/2401.12700
Securing Recommender System via Cooperative Training. (80%)
Qingyang Wang; Chenwang Wu; Defu Lian; Enhong Chen
  Recommender systems are often susceptible to well-crafted fake profiles,
leading to biased recommendations. Among existing defense methods,
data-processing-based methods inevitably exclude normal samples, while
model-based methods struggle to enjoy both generalization and robustness. To
this end, we suggest integrating data processing and the robust model to
propose a general framework, Triple Cooperative Defense (TCD), which employs
three cooperative models that mutually enhance data and thereby improve
recommendation robustness. Furthermore, Considering that existing attacks
struggle to balance bi-level optimization and efficiency, we revisit poisoning
attacks in recommender systems and introduce an efficient attack strategy,
Co-training Attack (Co-Attack), which cooperatively optimizes the attack
optimization and model training, considering the bi-level setting while
maintaining attack efficiency. Moreover, we reveal a potential reason for the
insufficient threat of existing attacks is their default assumption of
optimizing attacks in undefended scenarios. This overly optimistic setting
limits the potential of attacks. Consequently, we put forth a Game-based
Co-training Attack (GCoAttack), which frames the proposed CoAttack and TCD as a
game-theoretic process, thoroughly exploring CoAttack's attack potential in the
cooperative training of attack and defense. Extensive experiments on three real
datasets demonstrate TCD's superiority in enhancing model robustness.
Additionally, we verify that the two proposed attack strategies significantly
outperform existing attacks, with game-based GCoAttack posing a greater
poisoning threat than CoAttack.


http://arxiv.org/abs/2401.13171
Compositional Generative Inverse Design. (56%)
Tailin Wu; Takashi Maruyama; Long Wei; Tao Zhang; Yilun Du; Gianluca Iaccarino; Jure Leskovec
  Inverse design, where we seek to design input variables in order to optimize
an underlying objective function, is an important problem that arises across
fields such as mechanical engineering to aerospace engineering. Inverse design
is typically formulated as an optimization problem, with recent works
leveraging optimization across learned dynamics models. However, as models are
optimized they tend to fall into adversarial modes, preventing effective
sampling. We illustrate that by instead optimizing over the learned energy
function captured by the diffusion model, we can avoid such adversarial
examples and significantly improve design performance. We further illustrate
how such a design system is compositional, enabling us to combine multiple
different diffusion models representing subcomponents of our desired system to
design systems with every specified component. In an N-body interaction task
and a challenging 2D multi-airfoil design task, we demonstrate that by
composing the learned diffusion model at test time, our method allows us to
design initial states and boundary shapes that are more complex than those in
the training data. Our method outperforms state-of-the-art neural inverse
design method by an average of 41.5% in prediction MAE and 14.3% in design
objective for the N-body dataset and discovers formation flying to minimize
drag in the multi-airfoil design task. Project website and code can be found at
https://github.com/AI4Science-WestlakeU/cindm.


http://arxiv.org/abs/2401.13212
AdCorDA: Classifier Refinement via Adversarial Correction and Domain Adaptation. (33%)
Lulan Shen; Ali Edalati; Brett Meyer; Warren Gross; James J. Clark
  This paper describes a simple yet effective technique for refining a
pretrained classifier network. The proposed AdCorDA method is based on
modification of the training set and making use of the duality between network
weights and layer inputs. We call this input space training. The method
consists of two stages - adversarial correction followed by domain adaptation.
Adversarial correction uses adversarial attacks to correct incorrect
training-set classifications. The incorrectly classified samples of the
training set are removed and replaced with the adversarially corrected samples
to form a new training set, and then, in the second stage, domain adaptation is
performed back to the original training set. Extensive experimental validations
show significant accuracy boosts of over 5% on the CIFAR-100 dataset. The
technique can be straightforwardly applied to refinement of weight-quantized
neural networks, where experiments show substantial enhancement in performance
over the baseline. The adversarial correction technique also results in
enhanced robustness to adversarial attacks.


http://arxiv.org/abs/2401.12578
ToDA: Target-oriented Diffusion Attacker against Recommendation System. (13%)
Xiaohao Liu; Zhulin Tao; Ting Jiang; He Chang; Yunshan Ma; Xianglin Huang; Xiang Wang
  Recommendation systems (RS) have become indispensable tools for web services
to address information overload, thus enhancing user experiences and bolstering
platforms' revenues. However, with their increasing ubiquity, security concerns
have also emerged. As the public accessibility of RS, they are susceptible to
specific malicious attacks where adversaries can manipulate user profiles,
leading to biased recommendations. Recent research often integrates additional
modules using generative models to craft these deceptive user profiles,
ensuring them are imperceptible while causing the intended harm. Albeit their
efficacy, these models face challenges of unstable training and the
exploration-exploitation dilemma, which can lead to suboptimal results. In this
paper, we pioneer to investigate the potential of diffusion models (DMs), for
shilling attacks. Specifically, we propose a novel Target-oriented Diffusion
Attack model (ToDA). It incorporates a pre-trained autoencoder that transforms
user profiles into a high dimensional space, paired with a Latent Diffusion
Attacker (LDA)-the core component of ToDA. LDA introduces noise into the
profiles within this latent space, adeptly steering the approximation towards
targeted items through cross-attention mechanisms. The global horizon,
implemented by a bipartite graph, is involved in LDA and derived from the
encoded user profile feature. This makes LDA possible to extend the generation
outwards the on-processing user feature itself, and bridges the gap between
diffused user features and target item features. Extensive experiments compared
to several SOTA baselines demonstrate ToDA's effectiveness. Specific studies
exploit the elaborative design of ToDA and underscore the potency of advanced
generative models in such contexts.


http://arxiv.org/abs/2401.12532
DAFA: Distance-Aware Fair Adversarial Training. (2%)
Hyungyu Lee; Saehyung Lee; Hyemi Jang; Junsung Park; Ho Bae; Sungroh Yoon
  The disparity in accuracy between classes in standard training is amplified
during adversarial training, a phenomenon termed the robust fairness problem.
Existing methodologies aimed to enhance robust fairness by sacrificing the
model's performance on easier classes in order to improve its performance on
harder ones. However, we observe that under adversarial attacks, the majority
of the model's predictions for samples from the worst class are biased towards
classes similar to the worst class, rather than towards the easy classes.
Through theoretical and empirical analysis, we demonstrate that robust fairness
deteriorates as the distance between classes decreases. Motivated by these
insights, we introduce the Distance-Aware Fair Adversarial training (DAFA)
methodology, which addresses robust fairness by taking into account the
similarities between classes. Specifically, our method assigns distinct loss
weights and adversarial margins to each class and adjusts them to encourage a
trade-off in robustness among similar classes. Experimental results across
various datasets demonstrate that our method not only maintains average robust
accuracy but also significantly improves the worst robust accuracy, indicating
a marked improvement in robust fairness compared to existing methods.


http://arxiv.org/abs/2401.12610
The twin peaks of learning neural networks. (2%)
Elizaveta Demyanenko; Christoph Feinauer; Enrico M. Malatesta; Luca Saglietti
  Recent works demonstrated the existence of a double-descent phenomenon for
the generalization error of neural networks, where highly overparameterized
models escape overfitting and achieve good test performance, at odds with the
standard bias-variance trade-off described by statistical learning theory. In
the present work, we explore a link between this phenomenon and the increase of
complexity and sensitivity of the function represented by neural networks. In
particular, we study the Boolean mean dimension (BMD), a metric developed in
the context of Boolean function analysis. Focusing on a simple teacher-student
setting for the random feature model, we derive a theoretical analysis based on
the replica method that yields an interpretable expression for the BMD, in the
high dimensional regime where the number of data points, the number of
features, and the input size grow to infinity. We find that, as the degree of
overparameterization of the network is increased, the BMD reaches an evident
peak at the interpolation threshold, in correspondence with the generalization
error peak, and then slowly approaches a low asymptotic value. The same
phenomenology is then traced in numerical experiments with different model
classes and training setups. Moreover, we find empirically that adversarially
initialized models tend to show higher BMD values, and that models that are
more robust to adversarial attacks exhibit a lower BMD.


http://arxiv.org/abs/2401.12461
Fast Adversarial Training against Textual Adversarial Attacks. (99%)
Yichen Yang; Xin Liu; Kun He
  Many adversarial defense methods have been proposed to enhance the
adversarial robustness of natural language processing models. However, most of
them introduce additional pre-set linguistic knowledge and assume that the
synonym candidates used by attackers are accessible, which is an ideal
assumption. We delve into adversarial training in the embedding space and
propose a Fast Adversarial Training (FAT) method to improve the model
robustness in the synonym-unaware scenario from the perspective of single-step
perturbation generation and perturbation initialization. Based on the
observation that the adversarial perturbations crafted by single-step and
multi-step gradient ascent are similar, FAT uses single-step gradient ascent to
craft adversarial examples in the embedding space to expedite the training
process. Based on the observation that the perturbations generated on the
identical training sample in successive epochs are similar, FAT fully utilizes
historical information when initializing the perturbation. Extensive
experiments demonstrate that FAT significantly boosts the robustness of BERT
models in the synonym-unaware scenario, and outperforms the defense baselines
under various attacks with character-level and word-level modifications.


http://arxiv.org/abs/2401.11902
A Training-Free Defense Framework for Robust Learned Image Compression. (74%)
Myungseo Song; Jinyoung Choi; Bohyung Han
  We study the robustness of learned image compression models against
adversarial attacks and present a training-free defense technique based on
simple image transform functions. Recent learned image compression models are
vulnerable to adversarial attacks that result in poor compression rate, low
reconstruction quality, or weird artifacts. To address the limitations, we
propose a simple but effective two-way compression algorithm with random input
transforms, which is conveniently applicable to existing image compression
models. Unlike the na\"ive approaches, our approach preserves the original
rate-distortion performance of the models on clean images. Moreover, the
proposed algorithm requires no additional training or modification of existing
models, making it more practical. We demonstrate the effectiveness of the
proposed techniques through extensive experiments under multiple compression
models, evaluation metrics, and attack scenarios.


http://arxiv.org/abs/2401.11857
Adversarial speech for voice privacy protection from Personalized Speech generation. (73%)
Shihao Chen; Liping Chen; Jie Zhang; KongAik Lee; Zhenhua Ling; Lirong Dai
  The rapid progress in personalized speech generation technology, including
personalized text-to-speech (TTS) and voice conversion (VC), poses a challenge
in distinguishing between generated and real speech for human listeners,
resulting in an urgent demand in protecting speakers' voices from malicious
misuse. In this regard, we propose a speaker protection method based on
adversarial attacks. The proposed method perturbs speech signals by minimally
altering the original speech while rendering downstream speech generation
models unable to accurately generate the voice of the target speaker. For
validation, we employ the open-source pre-trained YourTTS model for speech
generation and protect the target speaker's speech in the white-box scenario.
Automatic speaker verification (ASV) evaluations were carried out on the
generated speech as the assessment of the voice protection capability. Our
experimental results show that we successfully perturbed the speaker encoder of
the YourTTS model using the gradient-based I-FGSM adversarial perturbation
method. Furthermore, the adversarial perturbation is effective in preventing
the YourTTS model from generating the speech of the target speaker. Audio
samples can be found in
https://voiceprivacy.github.io/Adeversarial-Speech-with-YourTTS.


http://arxiv.org/abs/2401.12055
NEUROSEC: FPGA-Based Neuromorphic Audio Security. (13%)
Murat Isik; Hiruna Vishwamith; Yusuf Sur; Kayode Inadagbo; I. Can Dikmen
  Neuromorphic systems, inspired by the complexity and functionality of the
human brain, have gained interest in academic and industrial attention due to
their unparalleled potential across a wide range of applications. While their
capabilities herald innovation, it is imperative to underscore that these
computational paradigms, analogous to their traditional counterparts, are not
impervious to security threats. Although the exploration of neuromorphic
methodologies for image and video processing has been rigorously pursued, the
realm of neuromorphic audio processing remains in its early stages. Our results
highlight the robustness and precision of our FPGA-based neuromorphic system.
Specifically, our system showcases a commendable balance between desired signal
and background noise, efficient spike rate encoding, and unparalleled
resilience against adversarial attacks such as FGSM and PGD. A standout feature
of our framework is its detection rate of 94%, which, when compared to other
methodologies, underscores its greater capability in identifying and mitigating
threats within 5.39 dB, a commendable SNR ratio. Furthermore, neuromorphic
computing and hardware security serve many sensor domains in mission-critical
and privacy-preserving applications.


http://arxiv.org/abs/2401.11723
Unraveling Attacks in Machine Learning-based IoT Ecosystems: A Survey and the Open Libraries Behind Them. (13%)
Chao Liu; Boxi Chen; Wei Shao; Chris Zhang; Kelvin Wong; Yi Zhang
  The advent of the Internet of Things (IoT) has brought forth an era of
unprecedented connectivity, with an estimated 80 billion smart devices expected
to be in operation by the end of 2025. These devices facilitate a multitude of
smart applications, enhancing the quality of life and efficiency across various
domains. Machine Learning (ML) serves as a crucial technology, not only for
analyzing IoT-generated data but also for diverse applications within the IoT
ecosystem. For instance, ML finds utility in IoT device recognition, anomaly
detection, and even in uncovering malicious activities. This paper embarks on a
comprehensive exploration of the security threats arising from ML's integration
into various facets of IoT, spanning various attack types including membership
inference, adversarial evasion, reconstruction, property inference, model
extraction, and poisoning attacks. Unlike previous studies, our work offers a
holistic perspective, categorizing threats based on criteria such as adversary
models, attack targets, and key security attributes (confidentiality,
availability, and integrity). We delve into the underlying techniques of ML
attacks in IoT environment, providing a critical evaluation of their mechanisms
and impacts. Furthermore, our research thoroughly assesses 65 libraries, both
author-contributed and third-party, evaluating their role in safeguarding model
and data privacy. We emphasize the availability and usability of these
libraries, aiming to arm the community with the necessary tools to bolster
their defenses against the evolving threat landscape. Through our comprehensive
review and analysis, this paper seeks to contribute to the ongoing discourse on
ML-based IoT security, offering valuable insights and practical solutions to
secure ML models and data in the rapidly expanding field of artificial
intelligence in IoT.


http://arxiv.org/abs/2401.12014
Robustness to distribution shifts of compressed networks for edge devices. (8%)
Lulan Shen; Ali Edalati; Brett Meyer; Warren Gross; James J. Clark
  It is necessary to develop efficient DNNs deployed on edge devices with
limited computation resources. However, the compressed networks often execute
new tasks in the target domain, which is different from the source domain where
the original network is trained. It is important to investigate the robustness
of compressed networks in two types of data distribution shifts: domain shifts
and adversarial perturbations. In this study, we discover that compressed
models are less robust to distribution shifts than their original networks.
Interestingly, larger networks are more vulnerable to losing robustness than
smaller ones, even when they are compressed to a similar size as the smaller
networks. Furthermore, compact networks obtained by knowledge distillation are
much more robust to distribution shifts than pruned networks. Finally,
post-training quantization is a reliable method for achieving significant
robustness to distribution shifts, and it outperforms both pruned and distilled
models in terms of robustness.


http://arxiv.org/abs/2401.12192
Text Embedding Inversion Security for Multilingual Language Models. (2%)
Yiyi Chen; Heather Lent; Johannes Bjerva
  Textual data is often represented as real-numbered embeddings in NLP,
particularly with the popularity of large language models (LLMs) and Embeddings
as a Service (EaaS). However, storing sensitive information as embeddings can
be susceptible to security breaches, as research shows that text can be
reconstructed from embeddings, even without knowledge of the underlying model.
While defence mechanisms have been explored, these are exclusively focused on
English, leaving other languages potentially exposed to attacks. This work
explores LLM security through multilingual embedding inversion. We define the
problem of black-box multilingual and cross-lingual inversion attacks, and
explore their potential implications. Our findings suggest that multilingual
LLMs may be more vulnerable to inversion attacks, in part because English-based
defences may be ineffective. To alleviate this, we propose a simple masking
defense effective for both monolingual and multilingual models. This study is
the first to investigate multilingual inversion attacks, shedding light on the
differences in attacks and defenses across monolingual and multilingual
settings.


http://arxiv.org/abs/2401.12129
Out-of-Distribution Detection & Applications With Ablated Learned Temperature Energy. (1%)
Will LeVine; Benjamin Pikus; Jacob Phillips; Berk Norman; Fernando Amat Gil; Sean Hendryx
  As deep neural networks become adopted in high-stakes domains, it is crucial
to be able to identify when inference inputs are Out-of-Distribution (OOD) so
that users can be alerted of likely drops in performance and calibration
despite high confidence. Among many others, existing methods use the following
two scores to do so without training on any apriori OOD examples: a learned
temperature and an energy score. In this paper we introduce Ablated Learned
Temperature Energy (or "AbeT" for short), a method which combines these prior
methods in novel ways with effective modifications. Due to these contributions,
AbeT lowers the False Positive Rate at $95\%$ True Positive Rate (FPR@95) by
$35.39\%$ in classification (averaged across all ID and OOD datasets measured)
compared to state of the art without training networks in multiple stages or
requiring hyperparameters or test-time backward passes. We additionally provide
empirical insights as to how our model learns to distinguish between
In-Distribution (ID) and OOD samples while only being explicitly trained on ID
samples via exposure to misclassified ID examples at training time. Lastly, we
show the efficacy of our method in identifying predicted bounding boxes and
pixels corresponding to OOD objects in object detection and semantic
segmentation, respectively - with an AUROC increase of $5.15\%$ in object
detection and both a decrease in FPR@95 of $41.48\%$ and an increase in AUPRC
of $34.20\%$ on average in semantic segmentation compared to previous state of
the art.


http://arxiv.org/abs/2401.11543
How Robust Are Energy-Based Models Trained With Equilibrium Propagation? (99%)
Siddharth Mansingh; Michal Kucer; Garrett Kenyon; Juston Moore; Michael Teti
  Deep neural networks (DNNs) are easily fooled by adversarial perturbations
that are imperceptible to humans. Adversarial training, a process where
adversarial examples are added to the training set, is the current
state-of-the-art defense against adversarial attacks, but it lowers the model's
accuracy on clean inputs, is computationally expensive, and offers less
robustness to natural noise. In contrast, energy-based models (EBMs), which
were designed for efficient implementation in neuromorphic hardware and
physical systems, incorporate feedback connections from each layer to the
previous layer, yielding a recurrent, deep-attractor architecture which we
hypothesize should make them naturally robust. Our work is the first to explore
the robustness of EBMs to both natural corruptions and adversarial attacks,
which we do using the CIFAR-10 and CIFAR-100 datasets. We demonstrate that EBMs
are more robust than transformers and display comparable robustness to
adversarially-trained DNNs on gradient-based (white-box) attacks, query-based
(black-box) attacks, and natural perturbations without sacrificing clean
accuracy, and without the need for adversarial training or additional training
techniques.


http://arxiv.org/abs/2401.12261
Analyzing the Quality Attributes of AI Vision Models in Open Repositories Under Adversarial Attacks. (56%)
Zerui Wang; Yan Liu
  As AI models rapidly evolve, they are frequently released to open
repositories, such as HuggingFace. It is essential to perform quality assurance
validation on these models before integrating them into the production
development lifecycle. In addition to evaluating efficiency in terms of
balanced accuracy and computing costs, adversarial attacks are potential
threats to the robustness and explainability of AI models. Meanwhile, XAI
applies algorithms that approximate inputs to outputs post-hoc to identify the
contributing features. Adversarial perturbations may also degrade the utility
of XAI explanations that require further investigation. In this paper, we
present an integrated process designed for downstream evaluation tasks,
including validating AI model accuracy, evaluating robustness with benchmark
perturbations, comparing explanation utility, and assessing overhead. We
demonstrate an evaluation scenario involving six computer vision models, which
include CNN-based, Transformer-based, and hybrid architectures, three types of
perturbations, and five XAI methods, resulting in ninety unique combinations.
The process reveals the explanation utility among the XAI methods in terms of
the identified key areas responding to the adversarial perturbation. The
process produces aggregated results that illustrate multiple attributes of each
AI model.


http://arxiv.org/abs/2401.11618
Efficient local linearity regularization to overcome catastrophic overfitting. (8%)
Elias Abad Rocamora; Fanghui Liu; Grigorios G. Chrysos; Pablo M. Olmos; Volkan Cevher
  Catastrophic overfitting (CO) in single-step adversarial training (AT)
results in abrupt drops in the adversarial test accuracy (even down to 0%). For
models trained with multi-step AT, it has been observed that the loss function
behaves locally linearly with respect to the input, this is however lost in
single-step AT. To address CO in single-step AT, several methods have been
proposed to enforce local linearity of the loss via regularization. However,
these regularization terms considerably slow down training due to Double
Backpropagation. Instead, in this work, we introduce a regularization term,
called ELLE, to mitigate CO effectively and efficiently in classical AT
evaluations, as well as some more difficult regimes, e.g., large adversarial
perturbations and long training schedules. Our regularization term can be
theoretically linked to curvature of the loss function and is computationally
cheaper than previous methods by avoiding Double Backpropagation. Our thorough
experimental validation demonstrates that our work does not suffer from CO,
even in challenging settings where previous works suffer from it. We also
notice that adapting our regularization parameter during training (ELLE-A)
greatly improves the performance, specially in large $\epsilon$ setups. Our
implementation is available in https://github.com/LIONS-EPFL/ELLE .


http://arxiv.org/abs/2401.11224
Susceptibility of Adversarial Attack on Medical Image Segmentation Models. (99%)
Zhongxuan Wang; Leo Xu
  The nature of deep neural networks has given rise to a variety of attacks,
but little work has been done to address the effect of adversarial attacks on
segmentation models trained on MRI datasets. In light of the grave consequences
that such attacks could cause, we explore four models from the U-Net family and
examine their responses to the Fast Gradient Sign Method (FGSM) attack. We
conduct FGSM attacks on each of them and experiment with various schemes to
conduct the attacks. In this paper, we find that medical imaging segmentation
models are indeed vulnerable to adversarial attacks and that there is a
negligible correlation between parameter size and adversarial attack success.
Furthermore, we show that using a different loss function than the one used for
training yields higher adversarial attack success, contrary to what the FGSM
authors suggested. In future efforts, we will conduct the experiments detailed
in this paper with more segmentation models and different attacks. We will also
attempt to find ways to counteract the attacks by using model ensembles or
special data augmentations. Our code is available at
https://github.com/ZhongxuanWang/adv_attk


http://arxiv.org/abs/2401.11373
Finding a Needle in the Adversarial Haystack: A Targeted Paraphrasing Approach For Uncovering Edge Cases with Minimal Distribution Distortion. (96%)
Aly M. Kassem; Sherif Saad
  Adversarial attacks against language models(LMs) are a significant concern.
In particular, adversarial samples exploit the model's sensitivity to small
input changes. While these changes appear insignificant on the semantics of the
input sample, they result in significant decay in model performance. In this
paper, we propose Targeted Paraphrasing via RL (TPRL), an approach to
automatically learn a policy to generate challenging samples that most likely
improve the model's performance. TPRL leverages FLAN T5, a language model, as a
generator and employs a self learned policy using a proximal policy gradient to
generate the adversarial examples automatically. TPRL's reward is based on the
confusion induced in the classifier, preserving the original text meaning
through a Mutual Implication score. We demonstrate and evaluate TPRL's
effectiveness in discovering natural adversarial attacks and improving model
performance through extensive experiments on four diverse NLP classification
tasks via Automatic and Human evaluation. TPRL outperforms strong baselines,
exhibits generalizability across classifiers and datasets, and combines the
strengths of language modeling and reinforcement learning to generate diverse
and influential adversarial examples.


http://arxiv.org/abs/2401.11126
CARE: Ensemble Adversarial Robustness Evaluation Against Adaptive Attackers for Security Applications. (80%)
Hangsheng Zhang; Jiqiang Liu; Jinsong Dong
  Ensemble defenses, are widely employed in various security-related
applications to enhance model performance and robustness. The widespread
adoption of these techniques also raises many questions: Are general ensembles
defenses guaranteed to be more robust than individuals? Will stronger adaptive
attacks defeat existing ensemble defense strategies as the cybersecurity arms
race progresses? Can ensemble defenses achieve adversarial robustness to
different types of attacks simultaneously and resist the continually adjusted
adaptive attacks? Unfortunately, these critical questions remain unresolved as
there are no platforms for comprehensive evaluation of ensemble adversarial
attacks and defenses in the cybersecurity domain. In this paper, we propose a
general Cybersecurity Adversarial Robustness Evaluation (CARE) platform aiming
to bridge this gap.


http://arxiv.org/abs/2401.11170
Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images. (33%)
Kuofeng Gao; Yang Bai; Jindong Gu; Shu-Tao Xia; Philip Torr; Zhifeng Li; Wei Liu
  Large vision-language models (VLMs) such as GPT-4 have achieved exceptional
performance across various multi-modal tasks. However, the deployment of VLMs
necessitates substantial energy consumption and computational resources. Once
attackers maliciously induce high energy consumption and latency time
(energy-latency cost) during inference of VLMs, it will exhaust computational
resources. In this paper, we explore this attack surface about availability of
VLMs and aim to induce high energy-latency cost during inference of VLMs. We
find that high energy-latency cost during inference of VLMs can be manipulated
by maximizing the length of generated sequences. To this end, we propose
verbose images, with the goal of crafting an imperceptible perturbation to
induce VLMs to generate long sentences during inference. Concretely, we design
three loss objectives. First, a loss is proposed to delay the occurrence of
end-of-sequence (EOS) token, where EOS token is a signal for VLMs to stop
generating further tokens. Moreover, an uncertainty loss and a token diversity
loss are proposed to increase the uncertainty over each generated token and the
diversity among all tokens of the whole generated sequence, respectively, which
can break output dependency at token-level and sequence-level. Furthermore, a
temporal weight adjustment algorithm is proposed, which can effectively balance
these losses. Extensive experiments demonstrate that our verbose images can
increase the length of generated sequences by 7.87 times and 8.56 times
compared to original images on MS-COCO and ImageNet datasets, which presents
potential challenges for various applications. Our code is available at
https://github.com/KuofengGao/Verbose_Images.


http://arxiv.org/abs/2401.10586
PuriDefense: Randomized Local Implicit Adversarial Purification for Defending Black-box Query-based Attacks. (99%)
Ping Guo; Zhiyuan Yang; Xi Lin; Qingchuan Zhao; Qingfu Zhang
  Black-box query-based attacks constitute significant threats to Machine
Learning as a Service (MLaaS) systems since they can generate adversarial
examples without accessing the target model's architecture and parameters.
Traditional defense mechanisms, such as adversarial training, gradient masking,
and input transformations, either impose substantial computational costs or
compromise the test accuracy of non-adversarial inputs. To address these
challenges, we propose an efficient defense mechanism, PuriDefense, that
employs random patch-wise purifications with an ensemble of lightweight
purification models at a low level of inference cost. These models leverage the
local implicit function and rebuild the natural image manifold. Our theoretical
analysis suggests that this approach slows down the convergence of query-based
attacks by incorporating randomness into purifications. Extensive experiments
on CIFAR-10 and ImageNet validate the effectiveness of our proposed
purifier-based defense mechanism, demonstrating significant improvements in
robustness against query-based attacks.


http://arxiv.org/abs/2401.10691
Explainable and Transferable Adversarial Attack for ML-Based Network Intrusion Detectors. (99%)
Hangsheng Zhang; Dongqi Han; Yinlong Liu; Zhiliang Wang; Jiyan Sun; Shangyuan Zhuang; Jiqiang Liu; Jinsong Dong
  espite being widely used in network intrusion detection systems (NIDSs),
machine learning (ML) has proven to be highly vulnerable to adversarial
attacks. White-box and black-box adversarial attacks of NIDS have been explored
in several studies. However, white-box attacks unrealistically assume that the
attackers have full knowledge of the target NIDSs. Meanwhile, existing
black-box attacks can not achieve high attack success rate due to the weak
adversarial transferability between models (e.g., neural networks and tree
models). Additionally, neither of them explains why adversarial examples exist
and why they can transfer across models. To address these challenges, this
paper introduces ETA, an Explainable Transfer-based Black-Box Adversarial
Attack framework. ETA aims to achieve two primary objectives: 1) create
transferable adversarial examples applicable to various ML models and 2)
provide insights into the existence of adversarial examples and their
transferability within NIDSs. Specifically, we first provide a general
transfer-based adversarial attack method applicable across the entire ML space.
Following that, we exploit a unique insight based on cooperative game theory
and perturbation interpretations to explain adversarial examples and
adversarial transferability. On this basis, we propose an Important-Sensitive
Feature Selection (ISFS) method to guide the search for adversarial examples,
achieving stronger transferability and ensuring traffic-space constraints.


http://arxiv.org/abs/2401.12236
The Surprising Harmfulness of Benign Overfitting for Adversarial Robustness. (98%)
Yifan Hao; Tong Zhang
  Recent empirical and theoretical studies have established the generalization
capabilities of large machine learning models that are trained to
(approximately or exactly) fit noisy data. In this work, we prove a surprising
result that even if the ground truth itself is robust to adversarial examples,
and the benignly overfitted model is benign in terms of the ``standard''
out-of-sample risk objective, this benign overfitting process can be harmful
when out-of-sample data are subject to adversarial manipulation. More
specifically, our main results contain two parts: (i) the min-norm estimator in
overparameterized linear model always leads to adversarial vulnerability in the
``benign overfitting'' setting; (ii) we verify an asymptotic trade-off result
between the standard risk and the ``adversarial'' risk of every ridge
regression estimator, implying that under suitable conditions these two items
cannot both be small at the same time by any single choice of the ridge
regularization parameter. Furthermore, under the lazy training regime, we
demonstrate parallel results on two-layer neural tangent kernel (NTK) model,
which align with empirical observations in deep neural networks. Our finding
provides theoretical insights into the puzzling phenomenon observed in
practice, where the true target function (e.g., human) is robust against
adverasrial attack, while beginly overfitted neural networks lead to models
that are not robust.


http://arxiv.org/abs/2401.10657
FIMBA: Evaluating the Robustness of AI in Genomics via Feature Importance Adversarial Attacks. (56%)
Heorhii Skovorodnikov; Hoda Alkhzaimi
  With the steady rise of the use of AI in bio-technical applications and the
widespread adoption of genomics sequencing, an increasing amount of AI-based
algorithms and tools is entering the research and production stage affecting
critical decision-making streams like drug discovery and clinical outcomes.
This paper demonstrates the vulnerability of AI models often utilized
downstream tasks on recognized public genomics datasets. We undermine model
robustness by deploying an attack that focuses on input transformation while
mimicking the real data and confusing the model decision-making, ultimately
yielding a pronounced deterioration in model performance. Further, we enhance
our approach by generating poisoned data using a variational autoencoder-based
model. Our empirical findings unequivocally demonstrate a decline in model
performance, underscored by diminished accuracy and an upswing in false
positives and false negatives. Furthermore, we analyze the resulting
adversarial samples via spectral analysis yielding conclusions for
countermeasures against such attacks.


http://arxiv.org/abs/2401.10590
Adversarial Robustness of Link Sign Prediction in Signed Graphs. (26%)
Jialong Zhou; Xing Ai; Yuni Lai; Tomasz Michalak; Gaolei Li; Jianhua Li; Kai Zhou
  Signed graphs serve as fundamental data structures for representing positive
and negative relationships in social networks, with signed graph neural
networks (SGNNs) emerging as the primary tool for their analysis. Our
investigation reveals that balance theory, while essential for modeling signed
relationships in SGNNs, inadvertently introduces exploitable vulnerabilities to
black-box attacks. To demonstrate this vulnerability, we propose
balance-attack, a novel adversarial strategy specifically designed to
compromise graph balance degree, and develop an efficient heuristic algorithm
to solve the associated NP-hard optimization problem. While existing approaches
attempt to restore attacked graphs through balance learning techniques, they
face a critical challenge we term "Irreversibility of Balance-related
Information," where restored edges fail to align with original attack targets.
To address this limitation, we introduce Balance Augmented-Signed Graph
Contrastive Learning (BA-SGCL), an innovative framework that combines
contrastive learning with balance augmentation techniques to achieve robust
graph representations. By maintaining high balance degree in the latent space,
BA-SGCL effectively circumvents the irreversibility challenge and enhances
model resilience. Extensive experiments across multiple SGNN architectures and
real-world datasets demonstrate both the effectiveness of our proposed
balance-attack and the superior robustness of BA-SGCL, advancing the security
and reliability of signed graph analysis in social networks. Datasets and codes
of the proposed framework are at the github repository
https://anonymous.4open.science/r/BA-SGCL-submit-DF41/.


http://arxiv.org/abs/2401.12242
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models. (3%)
Zhen Xiang; Fengqing Jiang; Zidi Xiong; Bhaskar Ramasubramanian; Radha Poovendran; Bo Li
  Large language models (LLMs) are shown to benefit from chain-of-thought (COT)
prompting, particularly when tackling tasks that require systematic reasoning
processes. On the other hand, COT prompting also poses new vulnerabilities in
the form of backdoor attacks, wherein the model will output unintended
malicious content under specific backdoor-triggered conditions during
inference. Traditional methods for launching backdoor attacks involve either
contaminating the training dataset with backdoored instances or directly
manipulating the model parameters during deployment. However, these approaches
are not practical for commercial LLMs that typically operate via API access. In
this paper, we propose BadChain, the first backdoor attack against LLMs
employing COT prompting, which does not require access to the training dataset
or model parameters and imposes low computational overhead. BadChain leverages
the inherent reasoning capabilities of LLMs by inserting a backdoor reasoning
step into the sequence of reasoning steps of the model output, thereby altering
the final response when a backdoor trigger exists in the query prompt.
Empirically, we show the effectiveness of BadChain for two COT strategies
across four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) and six complex benchmark
tasks encompassing arithmetic, commonsense, and symbolic reasoning. Moreover,
we show that LLMs endowed with stronger reasoning capabilities exhibit higher
susceptibility to BadChain, exemplified by a high average attack success rate
of 97.0% across the six benchmark tasks on GPT-4. Finally, we propose two
defenses based on shuffling and demonstrate their overall ineffectiveness
against BadChain. Therefore, BadChain remains a severe threat to LLMs,
underscoring the urgency for the development of robust and effective future
defenses.


http://arxiv.org/abs/2401.11035
Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually. (1%)
Mazal Bethany; Brandon Wherry; Nishant Vishwamitra; Peyman Najafirad
  Social media platforms are being increasingly used by malicious actors to
share unsafe content, such as images depicting sexual activity, cyberbullying,
and self-harm. Consequently, major platforms use artificial intelligence (AI)
and human moderation to obfuscate such images to make them safer. Two critical
needs for obfuscating unsafe images is that an accurate rationale for
obfuscating image regions must be provided, and the sensitive regions should be
obfuscated (\textit{e.g.} blurring) for users' safety. This process involves
addressing two key problems: (1) the reason for obfuscating unsafe images
demands the platform to provide an accurate rationale that must be grounded in
unsafe image-specific attributes, and (2) the unsafe regions in the image must
be minimally obfuscated while still depicting the safe regions. In this work,
we address these key issues by first performing visual reasoning by designing a
visual reasoning model (VLM) conditioned on pre-trained unsafe image
classifiers to provide an accurate rationale grounded in unsafe image
attributes, and then proposing a counterfactual explanation algorithm that
minimally identifies and obfuscates unsafe regions for safe viewing, by first
utilizing an unsafe image classifier attribution matrix to guide segmentation
for a more optimal subregion segmentation followed by an informed greedy search
to determine the minimum number of subregions required to modify the
classifier's output based on attribution score. Extensive experiments on
uncurated data from social networks emphasize the efficacy of our proposed
method. We make our code available at:
https://github.com/SecureAIAutonomyLab/ConditionalVLM


http://arxiv.org/abs/2401.09945
HGAttack: Transferable Heterogeneous Graph Adversarial Attack. (99%)
He Zhao; Zhiwei Zeng; Yongwei Wang; Deheng Ye; Chunyan Miao
  Heterogeneous Graph Neural Networks (HGNNs) are increasingly recognized for
their performance in areas like the web and e-commerce, where resilience
against adversarial attacks is crucial. However, existing adversarial attack
methods, which are primarily designed for homogeneous graphs, fall short when
applied to HGNNs due to their limited ability to address the structural and
semantic complexity of HGNNs. This paper introduces HGAttack, the first
dedicated gray box evasion attack method for heterogeneous graphs. We design a
novel surrogate model to closely resemble the behaviors of the target HGNN and
utilize gradient-based methods for perturbation generation. Specifically, the
proposed surrogate model effectively leverages heterogeneous information by
extracting meta-path induced subgraphs and applying GNNs to learn node
embeddings with distinct semantics from each subgraph. This approach improves
the transferability of generated attacks on the target HGNN and significantly
reduces memory costs. For perturbation generation, we introduce a
semantics-aware mechanism that leverages subgraph gradient information to
autonomously identify vulnerable edges across a wide range of relations within
a constrained perturbation budget. We validate HGAttack's efficacy with
comprehensive experiments on three datasets, providing empirical analyses of
its generated perturbations. Outperforming baseline methods, HGAttack
demonstrated significant efficacy in diminishing the performance of target HGNN
models, affirming the effectiveness of our approach in evaluating the
robustness of HGNNs against adversarial attacks.


http://arxiv.org/abs/2401.09740
Hijacking Attacks against Neural Networks by Analyzing Training Data. (99%)
Yunjie Ge; Qian Wang; Huayang Huang; Qi Li; Cong Wang; Chao Shen; Lingchen Zhao; Peipei Jiang; Zheng Fang; Shenyi Zhang
  Backdoors and adversarial examples are the two primary threats currently
faced by deep neural networks (DNNs). Both attacks attempt to hijack the model
behaviors with unintended outputs by introducing (small) perturbations to the
inputs. Backdoor attacks, despite the high success rates, often require a
strong assumption, which is not always easy to achieve in reality. Adversarial
example attacks, which put relatively weaker assumptions on attackers, often
demand high computational resources, yet do not always yield satisfactory
success rates when attacking mainstream black-box models in the real world.
These limitations motivate the following research question: can model hijacking
be achieved more simply, with a higher attack success rate and more reasonable
assumptions? In this paper, we propose CleanSheet, a new model hijacking attack
that obtains the high performance of backdoor attacks without requiring the
adversary to tamper with the model training process. CleanSheet exploits
vulnerabilities in DNNs stemming from the training data. Specifically, our key
idea is to treat part of the clean training data of the target model as
"poisoned data," and capture the characteristics of these data that are more
sensitive to the model (typically called robust features) to construct
"triggers." These triggers can be added to any input example to mislead the
target model, similar to backdoor attacks. We validate the effectiveness of
CleanSheet through extensive experiments on 5 datasets, 79 normally trained
models, 68 pruned models, and 39 defensive models. Results show that CleanSheet
exhibits performance comparable to state-of-the-art backdoor attacks, achieving
an average attack success rate (ASR) of 97.5% on CIFAR-100 and 92.4% on GTSRB,
respectively. Furthermore, CleanSheet consistently maintains a high ASR, when
confronted with various mainstream backdoor defenses.


http://arxiv.org/abs/2401.10111
Adapters Mixup: Mixing Parameter-Efficient Adapters to Enhance the Adversarial Robustness of Fine-tuned Pre-trained Text Classifiers. (99%)
Tuc Nguyen; Thai Le
  Existing works show that augmenting the training data of pre-trained language
models (PLMs) for classification tasks fine-tuned via parameter-efficient
fine-tuning methods (PEFT) using both clean and adversarial examples can
enhance their robustness under adversarial attacks. However, this adversarial
training paradigm often leads to performance degradation on clean inputs and
requires frequent re-training on the entire data to account for new, unknown
attacks. To overcome these challenges while still harnessing the benefits of
adversarial training and the efficiency of PEFT, this work proposes a novel
approach, called AdpMixup, that combines two paradigms: (1) fine-tuning through
adapters and (2) adversarial augmentation via mixup to dynamically leverage
existing knowledge from a set of pre-known attacks for robust inference.
Intuitively, AdpMixup fine-tunes PLMs with multiple adapters with both clean
and pre-known adversarial examples and intelligently mixes them up in different
ratios during prediction. Our experiments show AdpMixup achieves the best
trade-off between training efficiency and robustness under both pre-known and
unknown attacks, compared to existing baselines on five downstream tasks across
six varied black-box attacks and 2 PLMs. All source code will be available.


http://arxiv.org/abs/2401.10313
Hacking Predictors Means Hacking Cars: Using Sensitivity Analysis to Identify Trajectory Prediction Vulnerabilities for Autonomous Driving Security. (92%)
Marsalis Gibson; David Babazadeh; Claire Tomlin; Shankar Sastry
  Adversarial attacks on learning-based multi-modal trajectory predictors have
already been demonstrated. However, there are still open questions about the
effects of perturbations on inputs other than state histories, and how these
attacks impact downstream planning and control. In this paper, we conduct a
sensitivity analysis on two trajectory prediction models, Trajectron++ and
AgentFormer. The analysis reveals that between all inputs, almost all of the
perturbation sensitivities for both models lie only within the most recent
position and velocity states. We additionally demonstrate that, despite
dominant sensitivity on state history perturbations, an undetectable image map
perturbation made with the Fast Gradient Sign Method can induce large
prediction error increases in both models, revealing that these trajectory
predictors are, in fact, susceptible to image-based attacks. Using an
optimization-based planner and example perturbations crafted from sensitivity
results, we show how these attacks can cause a vehicle to come to a sudden stop
from moderate driving speeds.


http://arxiv.org/abs/2401.10405
Differentially Private and Adversarially Robust Machine Learning: An Empirical Evaluation. (80%)
Janvi Thakkar; Giulio Zizzo; Sergio Maffeis
  Malicious adversaries can attack machine learning models to infer sensitive
information or damage the system by launching a series of evasion attacks.
Although various work addresses privacy and security concerns, they focus on
individual defenses, but in practice, models may undergo simultaneous attacks.
This study explores the combination of adversarial training and differentially
private training to defend against simultaneous attacks. While
differentially-private adversarial training, as presented in DP-Adv,
outperforms the other state-of-the-art methods in performance, it lacks formal
privacy guarantees and empirical validation. Thus, in this work, we benchmark
the performance of this technique using a membership inference attack and
empirically show that the resulting approach is as private as non-robust
private models. This work also highlights the need to explore privacy
guarantees in dynamic training paradigms.


http://arxiv.org/abs/2401.10447
Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition. (15%)
Yu Yu; Chao-Han Huck Yang; Tuan Dinh; Sungho Ryu; Jari Kolehmainen; Roger Ren; Denis Filimonov; Prashanth G. Shivakumar; Ankur Gandhe; Ariya Rastow; Jia Xu; Ivan Bulyko; Andreas Stolcke
  The use of low-rank adaptation (LoRA) with frozen pretrained language models
(PLMs) has become increasing popular as a mainstream, resource-efficient
modeling approach for memory-constrained hardware. In this study, we first
explore how to enhance model performance by introducing various LoRA training
strategies, achieving relative word error rate reductions of 3.50\% on the
public Librispeech dataset and of 3.67\% on an internal dataset in the
messaging domain. To further characterize the stability of LoRA-based
second-pass speech recognition models, we examine robustness against input
perturbations. These perturbations are rooted in homophone replacements and a
novel metric called N-best Perturbation-based Rescoring Robustness (NPRR), both
designed to measure the relative degradation in the performance of rescoring
models. Our experimental results indicate that while advanced variants of LoRA,
such as dynamic rank-allocated LoRA, lead to performance degradation in
$1$-best perturbation, they alleviate the degradation in $N$-best perturbation.
This finding is in comparison to fully-tuned models and vanilla LoRA tuning
baselines, suggesting that a comprehensive selection is needed when using
LoRA-based adaptation for compute-cost savings and robust language modeling.


http://arxiv.org/abs/2401.10091
Power in Numbers: Robust reading comprehension by finetuning with four adversarial sentences per example. (13%)
Ariel Marcus
  Recent models have achieved human level performance on the Stanford Question
Answering Dataset when using F1 scores to evaluate the reading comprehension
task. Yet, teaching machines to comprehend text has not been solved in the
general case. By appending one adversarial sentence to the context paragraph,
past research has shown that the F1 scores from reading comprehension models
drop almost in half. In this paper, I replicate past adversarial research with
a new model, ELECTRA-Small, and demonstrate that the new model's F1 score drops
from 83.9% to 29.2%. To improve ELECTRA-Small's resistance to this attack, I
finetune the model on SQuAD v1.1 training examples with one to five adversarial
sentences appended to the context paragraph. Like past research, I find that
the finetuned model on one adversarial sentence does not generalize well across
evaluation datasets. However, when finetuned on four or five adversarial
sentences the model attains an F1 score of more than 70% on most evaluation
datasets with multiple appended and prepended adversarial sentences. The
results suggest that with enough examples we can make models robust to
adversarial attacks.


http://arxiv.org/abs/2401.10090
Cross-Modality Perturbation Synergy Attack for Person Re-identification. (5%)
Yunpeng Gong; Zhun Zhong; Yansong Qu; Zhiming Luo; Rongrong Ji; Min Jiang
  In recent years, there has been significant research focusing on addressing
security concerns in single-modal person re-identification (ReID) systems that
are based on RGB images. However, the safety of cross-modality scenarios, which
are more commonly encountered in practical applications involving images
captured by infrared cameras, has not received adequate attention. The main
challenge in cross-modality ReID lies in effectively dealing with visual
differences between different modalities. For instance, infrared images are
typically grayscale, unlike visible images that contain color information.
Existing attack methods have primarily focused on the characteristics of the
visible image modality, overlooking the features of other modalities and the
variations in data distribution among different modalities. This oversight can
potentially undermine the effectiveness of these methods in image retrieval
across diverse modalities. This study represents the first exploration into the
security of cross-modality ReID models and proposes a universal perturbation
attack specifically designed for cross-modality ReID. This attack optimizes
perturbations by leveraging gradients from diverse modality data, thereby
disrupting the discriminator and reinforcing the differences between
modalities. We conducted experiments on three widely used cross-modality
datasets, namely RegDB, SYSU, and LLCM. The results not only demonstrate the
effectiveness of our method but also provide insights for future improvements
in the robustness of cross-modality ReID systems. The code will be available at
https://github.com/finger-monkey/cmps__attack.


http://arxiv.org/abs/2401.10375
Vulnerabilities of Foundation Model Integrated Federated Learning Under Adversarial Threats. (2%)
Chen Wu; Xi Li; Jiaqi Wang
  Federated Learning (FL) addresses critical issues in machine learning related
to data privacy and security, yet suffering from data insufficiency and
imbalance under certain circumstances. The emergence of foundation models (FMs)
offers potential solutions to the limitations of existing FL frameworks, e.g.,
by generating synthetic data for model initialization. However, due to the
inherent safety concerns of FMs, integrating FMs into FL could introduce new
risks, which remains largely unexplored. To address this gap, we conduct the
first investigation on the vulnerability of FM integrated FL (FM-FL) under
adversarial threats. Based on a unified framework of FM-FL, we introduce a
novel attack strategy that exploits safety issues of FM to compromise FL client
models. Through extensive experiments with well-known models and benchmark
datasets in both image and text domains, we reveal the high susceptibility of
the FM-FL to this new threat under various FL configurations. Furthermore, we
find that existing FL defense strategies offer limited protection against this
novel attack approach. This research highlights the critical need for enhanced
security measures in FL in the era of FMs.


http://arxiv.org/abs/2401.09727
Lateral Phishing With Large Language Models: A Large Organization Comparative Study. (1%)
Mazal Bethany; Athanasios Galiopoulos; Emet Bethany; Mohammad Bahrami Karkevandi; Nicole Beebe; Nishant Vishwamitra; Peyman Najafirad
  The emergence of Large Language Models (LLMs) has heightened the threat of
phishing emails by enabling the generation of highly targeted, personalized,
and automated attacks. Traditionally, many phishing emails have been
characterized by typos, errors, and poor language. These errors can be
mitigated by LLMs, potentially lowering the barrier for attackers. Despite
this, there is a lack of large-scale studies comparing the effectiveness of
LLM-generated lateral phishing emails to those crafted by humans. Current
literature does not adequately address the comparative effectiveness of LLM and
human-generated lateral phishing emails in a real-world, large-scale
organizational setting, especially considering the potential for LLMs to
generate more convincing and error-free phishing content. To address this gap,
we conducted a pioneering study within a large university, targeting its
workforce of approximately 9,000 individuals including faculty, staff,
administrators, and student workers. Our results indicate that LLM-generated
lateral phishing emails are as effective as those written by communications
professionals, emphasizing the critical threat posed by LLMs in leading
phishing campaigns. We break down the results of the overall phishing
experiment, comparing vulnerability between departments and job roles.
Furthermore, to gather qualitative data, we administered a detailed
questionnaire, revealing insights into the reasons and motivations behind
vulnerable employee's actions. This study contributes to the understanding of
cyber security threats in educational institutions and provides a comprehensive
comparison of LLM and human-generated phishing emails' effectiveness,
considering the potential for LLMs to generate more convincing content. The
findings highlight the need for enhanced user education and system defenses to
mitigate the growing threat of AI-powered phishing attacks.


http://arxiv.org/abs/2401.10446
Large Language Models are Efficient Learners of Noise-Robust Speech Recognition. (1%)
Yuchen Hu; Chen Chen; Chao-Han Huck Yang; Ruizhe Li; Chao Zhang; Pin-Yu Chen; EnSiong Chng
  Recent advances in large language models (LLMs) have promoted generative
error correction (GER) for automatic speech recognition (ASR), which leverages
the rich linguistic knowledge and powerful reasoning ability of LLMs to improve
recognition results. The latest work proposes a GER benchmark with HyPoradise
dataset to learn the mapping from ASR N-best hypotheses to ground-truth
transcription by efficient LLM finetuning, which shows great effectiveness but
lacks specificity on noise-robust ASR. In this work, we extend the benchmark to
noisy conditions and investigate if we can teach LLMs to perform denoising for
GER just like what robust ASR do}, where one solution is introducing noise
information as a conditioner into LLM. However, directly incorporating noise
embeddings from audio encoder could harm the LLM tuning due to cross-modality
gap. To this end, we propose to extract a language-space noise embedding from
the N-best list to represent the noise conditions of source speech, which can
promote the denoising process in GER. Furthermore, in order to enhance its
representation ability of audio noise, we design a knowledge distillation (KD)
approach via mutual information estimation to distill the real noise
information in audio embeddings to our language embedding. Experiments on
various latest LLMs demonstrate our approach achieves a new breakthrough with
up to 53.9% correction improvement in terms of word error rate while with
limited training data. Analysis shows that our language-space noise embedding
can well represent the noise conditions of source speech, under which
off-the-shelf LLMs show strong ability of language-space denoising.


http://arxiv.org/abs/2401.09574
Towards Scalable and Robust Model Versioning. (93%)
Wenxin Ding; Arjun Nitin Bhagoji; Ben Y. Zhao; Haitao Zheng
  As the deployment of deep learning models continues to expand across
industries, the threat of malicious incursions aimed at gaining access to these
deployed models is on the rise. Should an attacker gain access to a deployed
model, whether through server breaches, insider attacks, or model inversion
techniques, they can then construct white-box adversarial attacks to manipulate
the model's classification outcomes, thereby posing significant risks to
organizations that rely on these models for critical tasks. Model owners need
mechanisms to protect themselves against such losses without the necessity of
acquiring fresh training data - a process that typically demands substantial
investments in time and capital.
  In this paper, we explore the feasibility of generating multiple versions of
a model that possess different attack properties, without acquiring new
training data or changing model architecture. The model owner can deploy one
version at a time and replace a leaked version immediately with a new version.
The newly deployed model version can resist adversarial attacks generated
leveraging white-box access to one or all previously leaked versions. We show
theoretically that this can be accomplished by incorporating parameterized
hidden distributions into the model training data, forcing the model to learn
task-irrelevant features uniquely defined by the chosen data. Additionally,
optimal choices of hidden distributions can produce a sequence of model
versions capable of resisting compound transferability attacks over time.
Leveraging our analytical insights, we design and implement a practical model
versioning method for DNN classifiers, which leads to significant robustness
improvements over existing methods. We believe our work presents a promising
direction for safeguarding DNN services beyond their initial deployment.


http://arxiv.org/abs/2401.09673
Artwork Protection Against Neural Style Transfer Using Locally Adaptive Adversarial Color Attack. (93%)
Zhongliang Guo; Junhao Dong; Yifei Qian; Kaixuan Wang; Weiye Li; Ziheng Guo; Yuheng Wang; Yanli Li; Ognjen Arandjelović; Lei Fang
  Neural style transfer (NST) generates new images by combining the style of
one image with the content of another. However, unauthorized NST can exploit
artwork, raising concerns about artists' rights and motivating the development
of proactive protection methods. We propose Locally Adaptive Adversarial Color
Attack (LAACA), empowering artists to protect their artwork from unauthorized
style transfer by processing before public release. By delving into the
intricacies of human visual perception and the role of different frequency
components, our method strategically introduces frequency-adaptive
perturbations in the image. These perturbations significantly degrade the
generation quality of NST while maintaining an acceptable level of visual
change in the original image, ensuring that potential infringers are
discouraged from using the protected artworks, because of its bad NST
generation quality. Additionally, existing metrics often overlook the
importance of color fidelity in evaluating color-mattered tasks, such as the
quality of NST-generated images, which is crucial in the context of artistic
works. To comprehensively assess the color-mattered tasks, we propose the
Adversarial Color Distance Metric (ACDM), designed to quantify the color
difference of images pre- and post-manipulations. Experimental results confirm
that attacking NST using LAACA results in visually inferior style transfer, and
the ACDM can efficiently measure color-mattered tasks. By providing artists
with a tool to safeguard their intellectual property, our work relieves the
socio-technical challenges posed by the misuse of NST in the art community.


http://arxiv.org/abs/2401.09624
MITS-GAN: Safeguarding Medical Imaging from Tampering with Generative Adversarial Networks. (26%)
Giovanni Pasqualino; Luca Guarnera; Alessandro Ortis; Sebastiano Battiato
  The progress in generative models, particularly Generative Adversarial
Networks (GANs), opened new possibilities for image generation but raised
concerns about potential malicious uses, especially in sensitive areas like
medical imaging. This study introduces MITS-GAN, a novel approach to prevent
tampering in medical images, with a specific focus on CT scans. The approach
disrupts the output of the attacker's CT-GAN architecture by introducing finely
tuned perturbations that are imperceptible to the human eye. Specifically, the
proposed approach involves the introduction of appropriate Gaussian noise to
the input as a protective measure against various attacks. Our method aims to
enhance tamper resistance, comparing favorably to existing techniques.
Experimental results on a CT scan demonstrate MITS-GAN's superior performance,
emphasizing its ability to generate tamper-resistant images with negligible
artifacts. As image tampering in medical domains poses life-threatening risks,
our proactive approach contributes to the responsible and ethical use of
generative models. This work provides a foundation for future research in
countering cyber threats in medical imaging. Models and codes are publicly
available on https://iplab.dmi.unict.it/MITS-GAN-2024/.


http://arxiv.org/abs/2401.08984
A GAN-based data poisoning framework against anomaly detection in vertical federated learning. (3%)
Xiaolin Chen; Daoguang Zan; Wei Li; Bei Guan; Yongji Wang
  In vertical federated learning (VFL), commercial entities collaboratively
train a model while preserving data privacy. However, a malicious participant's
poisoning attack may degrade the performance of this collaborative model. The
main challenge in achieving the poisoning attack is the absence of access to
the server-side top model, leaving the malicious participant without a clear
target model. To address this challenge, we introduce an innovative end-to-end
poisoning framework P-GAN. Specifically, the malicious participant initially
employs semi-supervised learning to train a surrogate target model.
Subsequently, this participant employs a GAN-based method to produce
adversarial perturbations to degrade the surrogate target model's performance.
Finally, the generator is obtained and tailored for VFL poisoning. Besides, we
develop an anomaly detection algorithm based on a deep auto-encoder (DAE),
offering a robust defense mechanism to VFL scenarios. Through extensive
experiments, we evaluate the efficacy of P-GAN and DAE, and further analyze the
factors that influence their performance.


http://arxiv.org/abs/2401.09191
An Optimal Transport Approach for Computing Adversarial Training Lower Bounds in Multiclass Classification. (3%)
Nicolas Garcia Trillos; Matt Jacobs; Jakwang Kim; Matthew Werenski
  Despite the success of deep learning-based algorithms, it is widely known
that neural networks may fail to be robust. A popular paradigm to enforce
robustness is adversarial training (AT), however, this introduces many
computational and theoretical difficulties. Recent works have developed a
connection between AT in the multiclass classification setting and
multimarginal optimal transport (MOT), unlocking a new set of tools to study
this problem. In this paper, we leverage the MOT connection to propose
computationally tractable numerical algorithms for computing universal lower
bounds on the optimal adversarial risk and identifying optimal classifiers. We
propose two main algorithms based on linear programming (LP) and entropic
regularization (Sinkhorn). Our key insight is that one can harmlessly truncate
the higher order interactions between classes, preventing the combinatorial run
times typically encountered in MOT problems. We validate these results with
experiments on MNIST and CIFAR-$10$, which demonstrate the tractability of our
approach.


http://arxiv.org/abs/2401.08998
Attack and Reset for Unlearning: Exploiting Adversarial Noise toward Machine Unlearning through Parameter Re-initialization. (1%)
Yoonhwa Jung; Ikhyun Cho; Shun-Hsiang Hsu; Julia Hockenmaier
  With growing concerns surrounding privacy and regulatory compliance, the
concept of machine unlearning has gained prominence, aiming to selectively
forget or erase specific learned information from a trained model. In response
to this critical need, we introduce a novel approach called Attack-and-Reset
for Unlearning (ARU). This algorithm leverages meticulously crafted adversarial
noise to generate a parameter mask, effectively resetting certain parameters
and rendering them unlearnable. ARU outperforms current state-of-the-art
results on two facial machine-unlearning benchmark datasets, MUFAC and MUCAC.
In particular, we present the steps involved in attacking and masking that
strategically filter and re-initialize network parameters biased towards the
forget set. Our work represents a significant advancement in rendering data
unexploitable to deep learning models through parameter re-initialization,
achieved by harnessing adversarial noise to craft a mask.


http://arxiv.org/abs/2401.09395
Caught in the Quicksand of Reasoning, Far from AGI Summit: Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions. (1%)
Pengfei Hong; Deepanway Ghosal; Navonil Majumder; Somak Aditya; Rada Mihalcea; Soujanya Poria
  Recent advancements in Large Language Models (LLMs) have showcased striking
results on existing logical reasoning benchmarks, with some models even
surpassing human performance. However, the true depth of their competencies and
robustness in reasoning tasks remains an open question. To this end, in this
paper, we focus on two popular reasoning tasks: arithmetic reasoning and code
generation. Particularly, we introduce: (i) a general ontology of perturbations
for maths and coding questions, (ii) a semi-automatic method to apply these
perturbations, and (iii) two datasets, MORE and CORE, respectively, of
perturbed maths and coding problems to probe the limits of LLM capabilities in
numeric reasoning and coding tasks. Through comprehensive evaluations of both
closed-source and open-source LLMs, we show a significant performance drop
across all the models against the perturbed questions, suggesting that the
current LLMs lack robust problem solving skills and structured reasoning
abilities in many areas, as defined by our ontology. We open source the
datasets and source codes at: https://github.com/declare-lab/llm_robustness.


http://arxiv.org/abs/2401.08725
Revealing Vulnerabilities in Stable Diffusion via Targeted Attacks. (99%)
Chenyu Zhang; Lanjun Wang; Anan Liu
  Recent developments in text-to-image models, particularly Stable Diffusion,
have marked significant achievements in various applications. With these
advancements, there are growing safety concerns about the vulnerability of the
model that malicious entities exploit to generate targeted harmful images.
However, the existing methods in the vulnerability of the model mainly evaluate
the alignment between the prompt and generated images, but fall short in
revealing the vulnerability associated with targeted image generation. In this
study, we formulate the problem of targeted adversarial attack on Stable
Diffusion and propose a framework to generate adversarial prompts.
Specifically, we design a gradient-based embedding optimization method to craft
reliable adversarial prompts that guide stable diffusion to generate specific
images. Furthermore, after obtaining successful adversarial prompts, we reveal
the mechanisms that cause the vulnerability of the model. Extensive experiments
on two targeted attack tasks demonstrate the effectiveness of our method in
targeted attacks. The code can be obtained in
https://github.com/datar001/Revealing-Vulnerabilities-in-Stable-Diffusion-via-Targeted-Attacks.


http://arxiv.org/abs/2401.08734
Bag of Tricks to Boost Adversarial Transferability. (99%)
Zeliang Zhang; Rongyi Zhu; Wei Yao; Xiaosen Wang; Chenliang Xu
  Deep neural networks are widely known to be vulnerable to adversarial
examples. However, vanilla adversarial examples generated under the white-box
setting often exhibit low transferability across different models. Since
adversarial transferability poses more severe threats to practical
applications, various approaches have been proposed for better transferability,
including gradient-based, input transformation-based, and model-related
attacks, \etc. In this work, we find that several tiny changes in the existing
adversarial attacks can significantly affect the attack performance, \eg, the
number of iterations and step size. Based on careful studies of existing
adversarial attacks, we propose a bag of tricks to enhance adversarial
transferability, including momentum initialization, scheduled step size, dual
example, spectral-based input transformation, and several ensemble strategies.
Extensive experiments on the ImageNet dataset validate the high effectiveness
of our proposed tricks and show that combining them can further boost
adversarial transferability. Our work provides practical insights and
techniques to enhance adversarial transferability, and offers guidance to
improve the attack performance on the real-world application through simple
adjustments.


http://arxiv.org/abs/2401.08255
A Generative Adversarial Attack for Multilingual Text Classifiers. (99%)
Tom Roth; Inigo Jauregi Unanue; Alsharif Abuadbba; Massimo Piccardi
  Current adversarial attack algorithms, where an adversary changes a text to
fool a victim model, have been repeatedly shown to be effective against text
classifiers. These attacks, however, generally assume that the victim model is
monolingual and cannot be used to target multilingual victim models, a
significant limitation given the increased use of these models. For this
reason, in this work we propose an approach to fine-tune a multilingual
paraphrase model with an adversarial objective so that it becomes able to
generate effective adversarial examples against multilingual classifiers. The
training objective incorporates a set of pre-trained models to ensure text
quality and language consistency of the generated text. In addition, all the
models are suitably connected to the generator by vocabulary-mapping matrices,
allowing for full end-to-end differentiability of the overall training
pipeline. The experimental validation over two multilingual datasets and five
languages has shown the effectiveness of the proposed approach compared to
existing baselines, particularly in terms of query efficiency. We also provide
a detailed analysis of the generated attacks and discuss limitations and
opportunities for future research.


http://arxiv.org/abs/2401.08903
PPR: Enhancing Dodging Attacks while Maintaining Impersonation Attacks on Face Recognition Systems. (99%)
Fengfan Zhou; Heifei Ling
  Adversarial Attacks on Face Recognition (FR) encompass two types:
impersonation attacks and evasion attacks. We observe that achieving a
successful impersonation attack on FR does not necessarily ensure a successful
dodging attack on FR in the black-box setting. Introducing a novel attack
method named Pre-training Pruning Restoration Attack (PPR), we aim to enhance
the performance of dodging attacks whilst avoiding the degradation of
impersonation attacks. Our method employs adversarial example pruning, enabling
a portion of adversarial perturbations to be set to zero, while tending to
maintain the attack performance. By utilizing adversarial example pruning, we
can prune the pre-trained adversarial examples and selectively free up certain
adversarial perturbations. Thereafter, we embed adversarial perturbations in
the pruned area, which enhances the dodging performance of the adversarial face
examples. The effectiveness of our proposed attack method is demonstrated
through our experimental results, showcasing its superior performance.


http://arxiv.org/abs/2401.08863
Robust Localization of Key Fob Using Channel Impulse Response of Ultra Wide Band Sensors for Keyless Entry Systems. (92%)
Abhiram Kolli; Filippo Casamassima; Horst Possegger; Horst Bischof
  Using neural networks for localization of key fob within and surrounding a
car as a security feature for keyless entry is fast emerging. In this paper we
study: 1) the performance of pre-computed features of neural networks based UWB
(ultra wide band) localization classification forming the baseline of our
experiments. 2) Investigate the inherent robustness of various neural networks;
therefore, we include the study of robustness of the adversarial examples
without any adversarial training in this work. 3) Propose a multi-head
self-supervised neural network architecture which outperforms the baseline
neural networks without any adversarial training. The model's performance
improved by 67% at certain ranges of adversarial magnitude for fast gradient
sign method and 37% each for basic iterative method and projected gradient
descent method.


http://arxiv.org/abs/2401.08865
The Effect of Intrinsic Dataset Properties on Generalization: Unraveling Learning Differences Between Natural and Medical Images. (87%)
Nicholas Konz; Maciej A. Mazurowski
  This paper investigates discrepancies in how neural networks learn from
different imaging domains, which are commonly overlooked when adopting computer
vision techniques from the domain of natural images to other specialized
domains such as medical images. Recent works have found that the generalization
error of a trained network typically increases with the intrinsic dimension
($d_{data}$) of its training set. Yet, the steepness of this relationship
varies significantly between medical (radiological) and natural imaging
domains, with no existing theoretical explanation. We address this gap in
knowledge by establishing and empirically validating a generalization scaling
law with respect to $d_{data}$, and propose that the substantial scaling
discrepancy between the two considered domains may be at least partially
attributed to the higher intrinsic ``label sharpness'' ($K_\mathcal{F}$) of
medical imaging datasets, a metric which we propose. Next, we demonstrate an
additional benefit of measuring the label sharpness of a training set: it is
negatively correlated with the trained model's adversarial robustness, which
notably leads to models for medical images having a substantially higher
vulnerability to adversarial attack. Finally, we extend our $d_{data}$
formalism to the related metric of learned representation intrinsic dimension
($d_{repr}$), derive a generalization scaling law with respect to $d_{repr}$,
and show that $d_{data}$ serves as an upper bound for $d_{repr}$. Our
theoretical results are supported by thorough experiments with six models and
eleven natural and medical imaging datasets over a range of training set sizes.
Our findings offer insights into the influence of intrinsic dataset properties
on generalization, representation learning, and robustness in deep neural
networks. Code link: https://github.com/mazurowski-lab/intrinsic-properties


http://arxiv.org/abs/2401.08925
RandOhm: Mitigating Impedance Side-channel Attacks using Randomized Circuit Configurations. (9%)
Saleh Khalaj Monfared; Domenic Forte; Shahin Tajik
  Physical side-channel attacks can compromise the security of integrated
circuits. Most physical side-channel attacks (e.g., power or electromagnetic)
exploit the dynamic behavior of a chip, typically manifesting as changes in
current consumption or voltage fluctuations where algorithmic countermeasures,
such as masking, can effectively mitigate them. However, as demonstrated
recently, these mitigation techniques are not entirely effective against
backscattered side-channel attacks such as impedance analysis. In the case of
an impedance attack, an adversary exploits the data-dependent impedance
variations of the chip power delivery network (PDN) to extract secret
information. In this work, we introduce RandOhm, which exploits a moving target
defense (MTD) strategy based on the partial reconfiguration (PR) feature of
mainstream FPGAs and programmable SoCs to defend against impedance side-channel
attacks. We demonstrate that the information leakage through the PDN impedance
could be significantly reduced via runtime reconfiguration of the
secret-sensitive parts of the circuitry. Hence, by constantly randomizing the
placement and routing of the circuit, one can decorrelate the data-dependent
computation from the impedance value. Moreover, in contrast to existing
PR-based countermeasures, RandOhm deploys open-source bitstream manipulation
tools on programmable SoCs to speed up the randomization and provide real-time
protection. To validate our claims, we apply RandOhm to AES ciphers realized on
28-nm FPGAs. We analyze the resiliency of our approach by performing
non-profiled and profiled impedance analysis attacks and investigate the
overhead of our mitigation in terms of delay and performance.


http://arxiv.org/abs/2401.08216
Towards Efficient and Certified Recovery from Poisoning Attacks in Federated Learning. (8%)
Yu Jiang; Jiyuan Shen; Ziyao Liu; Chee Wei Tan; Kwok-Yan Lam
  Federated learning (FL) is vulnerable to poisoning attacks, where malicious
clients manipulate their updates to affect the global model. Although various
methods exist for detecting those clients in FL, identifying malicious clients
requires sufficient model updates, and hence by the time malicious clients are
detected, FL models have been already poisoned. Thus, a method is needed to
recover an accurate global model after malicious clients are identified.
Current recovery methods rely on (i) all historical information from
participating FL clients and (ii) the initial model unaffected by the malicious
clients, leading to a high demand for storage and computational resources. In
this paper, we show that highly effective recovery can still be achieved based
on (i) selective historical information rather than all historical information
and (ii) a historical model that has not been significantly affected by
malicious clients rather than the initial model. In this scenario, while
maintaining comparable recovery performance, we can accelerate the recovery
speed and decrease memory consumption. Following this concept, we introduce
Crab, an efficient and certified recovery method, which relies on selective
information storage and adaptive model rollback. Theoretically, we demonstrate
that the difference between the global model recovered by Crab and the one
recovered by train-from-scratch can be bounded under certain assumptions. Our
empirical evaluation, conducted across three datasets over multiple machine
learning models, and a variety of untargeted and targeted poisoning attacks
reveals that Crab is both accurate and efficient, and consistently outperforms
previous approaches in terms of both recovery speed and memory consumption.


http://arxiv.org/abs/2401.09495
IPR-NeRF: Ownership Verification meets Neural Radiance Field. (3%)
Win Kent Ong; Kam Woh Ng; Chee Seng Chan; Yi Zhe Song; Tao Xiang
  Neural Radiance Field (NeRF) models have gained significant attention in the
computer vision community in the recent past with state-of-the-art visual
quality and produced impressive demonstrations. Since then, technopreneurs have
sought to leverage NeRF models into a profitable business. Therefore, NeRF
models make it worth the risk of plagiarizers illegally copying,
re-distributing, or misusing those models. This paper proposes a comprehensive
intellectual property (IP) protection framework for the NeRF model in both
black-box and white-box settings, namely IPR-NeRF. In the black-box setting, a
diffusion-based solution is introduced to embed and extract the watermark via a
two-stage optimization process. In the white-box setting, a designated digital
signature is embedded into the weights of the NeRF model by adopting the sign
loss objective. Our extensive experiments demonstrate that not only does our
approach maintain the fidelity (\ie, the rendering quality) of IPR-NeRF models,
but it is also robust against both ambiguity and removal attacks compared to
prior arts.


http://arxiv.org/abs/2401.08141
IoTWarden: A Deep Reinforcement Learning Based Real-time Defense System to Mitigate Trigger-action IoT Attacks. (1%)
Md Morshed Department of Software and Information Systems, University of North Carolina at Charlotte, Charlotte, USA Alam; Israt Department of Computer Science, University of Memphis, Memphis, USA Jahan; Weichao Department of Software and Information Systems, University of North Carolina at Charlotte, Charlotte, USA Wang
  In trigger-action IoT platforms, IoT devices report event conditions to IoT
hubs notifying their cyber states and let the hubs invoke actions in other IoT
devices based on functional dependencies defined as rules in a rule engine.
These functional dependencies create a chain of interactions that help automate
network tasks. Adversaries exploit this chain to report fake event conditions
to IoT hubs and perform remote injection attacks upon a smart environment to
indirectly control targeted IoT devices. Existing defense efforts usually
depend on static analysis over IoT apps to develop rule-based anomaly detection
mechanisms. We also see ML-based defense mechanisms in the literature that
harness physical event fingerprints to determine anomalies in an IoT network.
However, these methods often demonstrate long response time and lack of
adaptability when facing complicated attacks. In this paper, we propose to
build a deep reinforcement learning based real-time defense system for
injection attacks. We define the reward functions for defenders and implement a
deep Q-network based approach to identify the optimal defense policy. Our
experiments show that the proposed mechanism can effectively and accurately
identify and defend against injection attacks with reasonable computation
overhead.


http://arxiv.org/abs/2401.07991
Robustness Against Adversarial Attacks via Learning Confined Adversarial Polytopes. (99%)
Shayan Mohajer Hamidi; Linfeng Ye
  Deep neural networks (DNNs) could be deceived by generating
human-imperceptible perturbations of clean samples. Therefore, enhancing the
robustness of DNNs against adversarial attacks is a crucial task. In this
paper, we aim to train robust DNNs by limiting the set of outputs reachable via
a norm-bounded perturbation added to a clean sample. We refer to this set as
adversarial polytope, and each clean sample has a respective adversarial
polytope. Indeed, if the respective polytopes for all the samples are compact
such that they do not intersect the decision boundaries of the DNN, then the
DNN is robust against adversarial samples. Hence, the inner-working of our
algorithm is based on learning \textbf{c}onfined \textbf{a}dversarial
\textbf{p}olytopes (CAP). By conducting a thorough set of experiments, we
demonstrate the effectiveness of CAP over existing adversarial robustness
methods in improving the robustness of models against state-of-the-art attacks
including AutoAttack.


http://arxiv.org/abs/2401.07867
Authorship Obfuscation in Multilingual Machine-Generated Text Detection. (13%)
Dominik Macko; Robert Moro; Adaku Uchendu; Ivan Srba; Jason Samuel Lucas; Michiharu Yamashita; Nafis Irtiza Tripto; Dongwon Lee; Jakub Simko; Maria Bielikova
  High-quality text generation capability of latest Large Language Models
(LLMs) causes concerns about their misuse (e.g., in massive generation/spread
of disinformation). Machine-generated text (MGT) detection is important to cope
with such threats. However, it is susceptible to authorship obfuscation (AO)
methods, such as paraphrasing, which can cause MGTs to evade detection. So far,
this was evaluated only in monolingual settings. Thus, the susceptibility of
recently proposed multilingual detectors is still unknown. We fill this gap by
comprehensively benchmarking the performance of 10 well-known AO methods,
attacking 37 MGT detection methods against MGTs in 11 languages (i.e., 10
$\times$ 37 $\times$ 11 = 4,070 combinations). We also evaluate the effect of
data augmentation on adversarial robustness using obfuscated texts. The results
indicate that all tested AO methods can cause detection evasion in all tested
languages, where homoglyph attacks are especially successful.


http://arxiv.org/abs/2401.07261
LookAhead: Preventing DeFi Attacks via Unveiling Adversarial Contracts. (80%)
Shoupeng Ren; Lipeng He; Tianyu Tu; Di Wu; Jian Liu; Kui Ren; Chun Chen
  Decentralized Finance (DeFi) incidents stemming from the exploitation of
smart contract vulnerabilities have culminated in financial damages exceeding 3
billion US dollars. Existing defense mechanisms typically focus on detecting
and reacting to malicious transactions executed by attackers that target victim
contracts. However, with the emergence of private transaction pools where
transactions are sent directly to miners without first appearing in public
mempools, current detection tools face significant challenges in identifying
attack activities effectively.
  Based on the fact that most attack logic rely on deploying one or more
intermediate smart contracts as supporting components to the exploitation of
victim contracts, in this paper, we propose a new direction for detecting DeFi
attacks that focuses on identifying adversarial contracts instead of
adversarial transactions. Our approach allows us to leverage common attack
patterns, code semantics and intrinsic characteristics found in malicious smart
contracts to build the LookAhead system based on Machine Learning (ML)
classifiers and a transformer model that is able to effectively distinguish
adversarial contracts from benign ones, and make just-in-time predictions of
potential zero-day attacks. Our contributions are three-fold: First, we
construct a comprehensive dataset consisting of features extracted and
constructed from recent contracts deployed on the Ethereum and BSC blockchains.
Secondly, we design a condensed representation of smart contract programs
called Pruned Semantic-Control Flow Tokenization (PSCFT) and use it to train a
combination of ML models that understand the behaviour of malicious codes based
on function calls, control flows and other pattern-conforming features. Lastly,
we provide the complete implementation of LookAhead and the evaluation of its
performance metrics for detecting adversarial contracts.


http://arxiv.org/abs/2401.07205
Crafter: Facial Feature Crafting against Inversion-based Identity Theft on Deep Models. (70%)
Shiming Wang; Zhe Ji; Liyao Xiang; Hao Zhang; Xinbing Wang; Chenghu Zhou; Bo Li
  With the increased capabilities at the edge (e.g., mobile device) and more
stringent privacy requirement, it becomes a recent trend for deep
learning-enabled applications to pre-process sensitive raw data at the edge and
transmit the features to the backend cloud for further processing. A typical
application is to run machine learning (ML) services on facial images collected
from different individuals. To prevent identity theft, conventional methods
commonly rely on an adversarial game-based approach to shed the identity
information from the feature. However, such methods can not defend against
adaptive attacks, in which an attacker takes a countermove against a known
defence strategy. We propose Crafter, a feature crafting mechanism deployed at
the edge, to protect the identity information from adaptive model inversion
attacks while ensuring the ML tasks are properly carried out in the cloud. The
key defence strategy is to mislead the attacker to a non-private prior from
which the attacker gains little about the private identity. In this case, the
crafted features act like poison training samples for attackers with adaptive
model updates. Experimental results indicate that Crafter successfully defends
both basic and possible adaptive attacks, which can not be achieved by
state-of-the-art adversarial game-based methods.


http://arxiv.org/abs/2401.07087
Exploring Adversarial Attacks against Latent Diffusion Model from the Perspective of Adversarial Transferability. (99%)
Junxi Chen; Junhao Dong; Xiaohua Xie
  Recently, many studies utilized adversarial examples (AEs) to raise the cost
of malicious image editing and copyright violation powered by latent diffusion
models (LDMs). Despite their successes, a few have studied the surrogate model
they used to generate AEs. In this paper, from the perspective of adversarial
transferability, we investigate how the surrogate model's property influences
the performance of AEs for LDMs. Specifically, we view the time-step sampling
in the Monte-Carlo-based (MC-based) adversarial attack as selecting surrogate
models. We find that the smoothness of surrogate models at different time steps
differs, and we substantially improve the performance of the MC-based AEs by
selecting smoother surrogate models. In the light of the theoretical framework
on adversarial transferability in image classification, we also conduct a
theoretical analysis to explain why smooth surrogate models can also boost AEs
for LDMs.


http://arxiv.org/abs/2401.07188
Left-right Discrepancy for Adversarial Attack on Stereo Networks. (98%)
Pengfei Wang; Xiaofei Hui; Beijia Lu; Nimrod Lilith; Jun Liu; Sameer Alam
  Stereo matching neural networks often involve a Siamese structure to extract
intermediate features from left and right images. The similarity between these
intermediate left-right features significantly impacts the accuracy of
disparity estimation. In this paper, we introduce a novel adversarial attack
approach that generates perturbation noise specifically designed to maximize
the discrepancy between left and right image features. Extensive experiments
demonstrate the superior capability of our method to induce larger prediction
errors in stereo neural networks, e.g. outperforming existing state-of-the-art
attack methods by 219% MAE on the KITTI dataset and 85% MAE on the Scene Flow
dataset. Additionally, we extend our approach to include a proxy network
black-box attack method, eliminating the need for access to stereo neural
network. This method leverages an arbitrary network from a different vision
task as a proxy to generate adversarial noise, effectively causing the stereo
network to produce erroneous predictions. Our findings highlight a notable
sensitivity of stereo networks to discrepancies in shallow layer features,
offering valuable insights that could guide future research in enhancing the
robustness of stereo vision systems.


http://arxiv.org/abs/2401.06637
Adversarial Examples are Misaligned in Diffusion Model Manifolds. (98%)
Peter Lorenz; Ricard Durall; Janis Keuper
  In recent years, diffusion models (DMs) have drawn significant attention for
their success in approximating data distributions, yielding state-of-the-art
generative results. Nevertheless, the versatility of these models extends
beyond their generative capabilities to encompass various vision applications,
such as image inpainting, segmentation, adversarial robustness, among others.
This study is dedicated to the investigation of adversarial attacks through the
lens of diffusion models. However, our objective does not involve enhancing the
adversarial robustness of image classifiers. Instead, our focus lies in
utilizing the diffusion model to detect and analyze the anomalies introduced by
these attacks on images. To that end, we systematically examine the alignment
of the distributions of adversarial examples when subjected to the process of
transformation using diffusion models. The efficacy of this approach is
assessed across CIFAR-10 and ImageNet datasets, including varying image sizes
in the latter. The results demonstrate a notable capacity to discriminate
effectively between benign and attacked images, providing compelling evidence
that adversarial instances do not align with the learned manifold of the DMs.


http://arxiv.org/abs/2401.06373
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs. (2%)
Yi Zeng; Hongpeng Lin; Jingwen Zhang; Diyi Yang; Ruoxi Jia; Weiyan Shi
  Most traditional AI safety research has approached AI models as machines and
centered on algorithm-focused attacks developed by security experts. As large
language models (LLMs) become increasingly common and competent, non-expert
users can also impose risks during daily interactions. This paper introduces a
new perspective to jailbreak LLMs as human-like communicators, to explore this
overlooked intersection between everyday language interaction and AI safety.
Specifically, we study how to persuade LLMs to jailbreak them. First, we
propose a persuasion taxonomy derived from decades of social science research.
Then, we apply the taxonomy to automatically generate interpretable persuasive
adversarial prompts (PAP) to jailbreak LLMs. Results show that persuasion
significantly increases the jailbreak performance across all risk categories:
PAP consistently achieves an attack success rate of over $92\%$ on Llama 2-7b
Chat, GPT-3.5, and GPT-4 in $10$ trials, surpassing recent algorithm-focused
attacks. On the defense side, we explore various mechanisms against PAP and,
found a significant gap in existing defenses, and advocate for more fundamental
mitigation for highly interactive LLMs


http://arxiv.org/abs/2401.06548
Enhancing Consistency and Mitigating Bias: A Data Replay Approach for Incremental Learning. (1%)
Chenyang Wang; Junjun Jiang; Xingyu Hu; Xianming Liu; Xiangyang Ji
  Deep learning systems are prone to catastrophic forgetting when learning from
a sequence of tasks, where old data from experienced tasks is unavailable when
learning from a new task. To mitigate the problem, a line of methods propose to
replay the data of experienced tasks when learning new tasks. These methods
usually adopt an extra memory to store the data for replay. However, it is not
expected in practice considering the memory constraint or data privacy issue.
As a replacement, data-free data replay methods are proposed by inverting
samples from the classification model. Though achieving good results, these
methods still suffer from the inconsistency of the inverted and real training
data, which is neglected in the inversion stage in recent works. To that
effect, we propose to measure the data consistency quantitatively by some
simplification and assumptions. Using the measurement, we analyze existing
techniques for inverting samples and get some insightful information that
inspires a novel loss function to reduce the inconsistency. Specifically, the
loss minimizes the KL divergence of the distributions of inverted and real data
under the tied multivariate Gaussian assumption, which is easy to implement in
continual learning. In addition, we observe that the norms of old class weights
turn to decrease continually as learning progresses. We thus analyze the
underlying reasons and propose a simple regularization term to balance the
class weights so that the samples of old classes are more distinguishable. To
conclude, we propose the Consistency enhanced data replay with debiased
classifier for Class Incremental Learning (CCIL). Extensive experiments on
CIFAR-100, Tiny-ImageNet, and ImageNet100 show consistently improved
performance of CCIL compared to previous approaches.


http://arxiv.org/abs/2401.06916
An Analytical Framework for Modeling and Synthesizing Malicious Attacks on ACC Vehicles. (1%)
Shian Wang
  While emerging adaptive cruise control (ACC) technologies are making their
way into more vehicles, they also expose a vulnerability to potential malicious
cyberattacks. Previous research has typically focused on constant or stochastic
attacks without explicitly addressing their malicious and covert
characteristics. As a result, these attacks may inadvertently benefit the
compromised vehicles, inconsistent with real-world scenarios. In contrast, we
establish an analytical framework to model and synthesize a range of candidate
attacks, offering a physical interpretation from the attacker's standpoint.
Specifically, we introduce a mathematical framework that describes mixed
traffic scenarios, comprising ACC vehicles and human-driven vehicles (HDVs),
grounded in car-following dynamics. Within this framework, we synthesize and
integrate a class of false data injection attacks into ACC sensor measurements,
influencing traffic flow dynamics. As a first-of-its-kind study, this work
provides an analytical characterization of attacks, emphasizing their malicious
and stealthy attributes while explicitly accounting for vehicle driving
behavior, thereby yielding a set of candidate attacks with physical
interpretability. To demonstrate the modeling process, we perform a series of
numerical simulations to holistically assess the effects of attacks on
car-following dynamics, traffic efficiency, and vehicular fuel consumption. The
primary findings indicate that strategically synthesized candidate attacks can
cause significant disruptions to the traffic flow while altering the driving
behavior of ACC vehicles in a subtle fashion to remain stealthy, which is
supported by a series of analytical results.


http://arxiv.org/abs/2401.06561
Intention Analysis Makes LLMs A Good Jailbreak Defender. (1%)
Yuqi Zhang; Liang Ding; Lefei Zhang; Dacheng Tao
  Aligning large language models (LLMs) with human values, particularly in the
face of stealthy and complex jailbreak attacks, presents a formidable
challenge. In this study, we present a simple yet highly effective defense
strategy, i.e., Intention Analysis ($\mathbb{IA}$). The principle behind this
is to trigger LLMs' inherent self-correct and improve ability through a
two-stage process: 1) essential intention analysis, and 2) policy-aligned
response. Notably, $\mathbb{IA}$ is an inference-only method, thus could
enhance the safety of LLMs without compromising their helpfulness. Extensive
experiments on SAP200 and DAN benchmarks across Vicuna, ChatGLM, MPT, DeepSeek,
and GPT-3.5 show that $\mathbb{IA}$ could consistently and significantly reduce
the harmfulness in responses (averagely -46.5\% attack success rate) and
maintain the general helpfulness. Encouragingly, with the help of our
$\mathbb{IA}$, Vicuna-7b even outperforms GPT-3.5 in terms of attack success
rate. Further analyses present some insights into how our method works. To
facilitate reproducibility, we release our code and scripts at:
https://github.com/alphadl/SafeLLM_with_IntentionAnalysis.


http://arxiv.org/abs/2401.06031
GE-AdvGAN: Improving the transferability of adversarial samples by gradient editing-based adversarial generative model. (99%)
Zhiyu Zhu; Huaming Chen; Xinyi Wang; Jiayu Zhang; Zhibo Jin; Kim-Kwang Raymond Choo; Jun Shen; Dong Yuan
  Adversarial generative models, such as Generative Adversarial Networks
(GANs), are widely applied for generating various types of data, i.e., images,
text, and audio. Accordingly, its promising performance has led to the
GAN-based adversarial attack methods in the white-box and black-box attack
scenarios. The importance of transferable black-box attacks lies in their
ability to be effective across different models and settings, more closely
aligning with real-world applications. However, it remains challenging to
retain the performance in terms of transferable adversarial examples for such
methods. Meanwhile, we observe that some enhanced gradient-based transferable
adversarial attack algorithms require prolonged time for adversarial sample
generation. Thus, in this work, we propose a novel algorithm named GE-AdvGAN to
enhance the transferability of adversarial samples whilst improving the
algorithm's efficiency. The main approach is via optimising the training
process of the generator parameters. With the functional and characteristic
similarity analysis, we introduce a novel gradient editing (GE) mechanism and
verify its feasibility in generating transferable samples on various models.
Moreover, by exploring the frequency domain information to determine the
gradient editing direction, GE-AdvGAN can generate highly transferable
adversarial samples while minimizing the execution time in comparison to the
state-of-the-art transferable adversarial attack algorithms. The performance of
GE-AdvGAN is comprehensively evaluated by large-scale experiments on different
datasets, which results demonstrate the superiority of our algorithm. The code
for our algorithm is available at: https://github.com/LMBTough/GE-advGAN


http://arxiv.org/abs/2401.05949
Universal Vulnerabilities in Large Language Models: In-context Learning Backdoor Attacks. (61%)
Shuai Zhao; Meihuizi Jia; Luu Anh Tuan; Jinming Wen
  In-context learning, a paradigm bridging the gap between pre-training and
fine-tuning, has demonstrated high efficacy in several NLP tasks, especially in
few-shot settings. Unlike traditional fine-tuning methods, in-context learning
adapts pre-trained models to unseen tasks without updating any parameters.
Despite being widely applied, in-context learning is vulnerable to malicious
attacks. In this work, we raise security concerns regarding this paradigm. Our
studies demonstrate that an attacker can manipulate the behavior of large
language models by poisoning the demonstration context, without the need for
fine-tuning the model. Specifically, we have designed a new backdoor attack
method, named ICLAttack, to target large language models based on in-context
learning. Our method encompasses two types of attacks: poisoning demonstration
examples and poisoning prompts, which can make models behave in accordance with
predefined intentions. ICLAttack does not require additional fine-tuning to
implant a backdoor, thus preserving the model's generality. Furthermore, the
poisoned examples are correctly labeled, enhancing the natural stealth of our
attack method. Extensive experimental results across several language models,
ranging in size from 1.3B to 40B parameters, demonstrate the effectiveness of
our attack method, exemplified by a high average attack success rate of 95.0%
across the three datasets on OPT models. Our findings highlight the
vulnerabilities of language models, and we hope this work will raise awareness
of the possible security threats associated with in-context learning.


http://arxiv.org/abs/2401.06824
Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective. (22%)
Tianlong Li; Zhenghua Wang; Wenhao Liu; Muling Wu; Shihan Dou; Changze Lv; Xiaohua Wang; Xiaoqing Zheng; Xuanjing Huang
  The recent surge in jailbreaking attacks has revealed significant
vulnerabilities in Large Language Models (LLMs) when exposed to malicious
inputs. While various defense strategies have been proposed to mitigate these
threats, there has been limited research into the underlying mechanisms that
make LLMs vulnerable to such attacks. In this study, we suggest that the
self-safeguarding capability of LLMs is linked to specific activity patterns
within their representation space. Although these patterns have little impact
on the semantic content of the generated text, they play a crucial role in
shaping LLM behavior under jailbreaking attacks. Our findings demonstrate that
these patterns can be detected with just a few pairs of contrastive queries.
Extensive experimentation shows that the robustness of LLMs against
jailbreaking can be manipulated by weakening or strengthening these patterns.
Further visual analysis provides additional evidence for our conclusions,
providing new insights into the jailbreaking phenomenon. These findings
highlight the importance of addressing the potential misuse of open-source LLMs
within the community.


http://arxiv.org/abs/2401.06030
Can We Trust the Unlabeled Target Data? Towards Backdoor Attack and Defense on Model Adaptation. (8%)
Lijun Sheng; Jian Liang; Ran He; Zilei Wang; Tieniu Tan
  Model adaptation tackles the distribution shift problem with a pre-trained
model instead of raw data, becoming a popular paradigm due to its great privacy
protection. Existing methods always assume adapting to a clean target domain,
overlooking the security risks of unlabeled samples. In this paper, we explore
the potential backdoor attacks on model adaptation launched by well-designed
poisoning target data. Concretely, we provide two backdoor triggers with two
poisoning strategies for different prior knowledge owned by attackers. These
attacks achieve a high success rate and keep the normal performance on clean
samples in the test stage. To defend against backdoor embedding, we propose a
plug-and-play method named MixAdapt, combining it with existing adaptation
algorithms. Experiments across commonly used benchmarks and adaptation methods
demonstrate the effectiveness of MixAdapt. We hope this work will shed light on
the safety of learning with unlabeled data.


http://arxiv.org/abs/2401.06122
Manipulating Feature Visualizations with Gradient Slingshots. (3%)
Dilyara Bareeva; Marina M. -C. Höhne; Alexander Warnecke; Lukas Pirch; Klaus-Robert Müller; Konrad Rieck; Kirill Bykov
  Deep Neural Networks (DNNs) are capable of learning complex and versatile
representations, however, the semantic nature of the learned concepts remains
unknown. A common method used to explain the concepts learned by DNNs is
Activation Maximization (AM), which generates a synthetic input signal that
maximally activates a particular neuron in the network. In this paper, we
investigate the vulnerability of this approach to adversarial model
manipulations and introduce a novel method for manipulating feature
visualization without altering the model architecture or significantly
impacting the model's decision-making process. We evaluate the effectiveness of
our method on several neural network models and demonstrate its capabilities to
hide the functionality of specific neurons by masking the original explanations
of neurons with chosen target explanations during model auditing. As a remedy,
we propose a protective measure against such manipulations and provide
quantitative evidence which substantiates our findings.


http://arxiv.org/abs/2401.05998
Combating Adversarial Attacks with Multi-Agent Debate. (3%)
Steffi Chern; Zhen Fan; Andy Liu
  While state-of-the-art language models have achieved impressive results, they
remain susceptible to inference-time adversarial attacks, such as adversarial
prompts generated by red teams arXiv:2209.07858. One approach proposed to
improve the general quality of language model generations is multi-agent
debate, where language models self-evaluate through discussion and feedback
arXiv:2305.14325. We implement multi-agent debate between current
state-of-the-art language models and evaluate models' susceptibility to red
team attacks in both single- and multi-agent settings. We find that multi-agent
debate can reduce model toxicity when jailbroken or less capable models are
forced to debate with non-jailbroken or more capable models. We also find
marginal improvements through the general usage of multi-agent interactions. We
further perform adversarial prompt content classification via embedding
clustering, and analyze the susceptibility of different models to different
types of attack topics.


http://arxiv.org/abs/2401.05217
Exploring Vulnerabilities of No-Reference Image Quality Assessment Models: A Query-Based Black-Box Method. (83%)
Chenxi Yang; Yujia Liu; Dingquan Li; Tingting Jiang
  No-Reference Image Quality Assessment (NR-IQA) aims to predict image quality
scores consistent with human perception without relying on pristine reference
images, serving as a crucial component in various visual tasks. Ensuring the
robustness of NR-IQA methods is vital for reliable comparisons of different
image processing techniques and consistent user experiences in recommendations.
The attack methods for NR-IQA provide a powerful instrument to test the
robustness of NR-IQA. However, current attack methods of NR-IQA heavily rely on
the gradient of the NR-IQA model, leading to limitations when the gradient
information is unavailable. In this paper, we present a pioneering query-based
black box attack against NR-IQA methods. We propose the concept of score
boundary and leverage an adaptive iterative approach with multiple score
boundaries. Meanwhile, the initial attack directions are also designed to
leverage the characteristics of the Human Visual System (HVS). Experiments show
our method outperforms all compared state-of-the-art attack methods and is far
ahead of previous black-box methods. The effective NR-IQA model DBCNN suffers a
Spearman's rank-order correlation coefficient (SROCC) decline of 0.6381
attacked by our method, revealing the vulnerability of NR-IQA models to
black-box attacks. The proposed attack method also provides a potent tool for
further exploration into NR-IQA robustness.


http://arxiv.org/abs/2401.05561
TrustLLM: Trustworthiness in Large Language Models. (75%)
Lichao Sun; Yue Huang; Haoran Wang; Siyuan Wu; Qihui Zhang; Chujie Gao; Yixin Huang; Wenhan Lyu; Yixuan Zhang; Xiner Li; Zhengliang Liu; Yixin Liu; Yijue Wang; Zhikun Zhang; Bhavya Kailkhura; Caiming Xiong; Chaowei Xiao; Chunyuan Li; Eric Xing; Furong Huang; Hao Liu; Heng Ji; Hongyi Wang; Huan Zhang; Huaxiu Yao; Manolis Kellis; Marinka Zitnik; Meng Jiang; Mohit Bansal; James Zou; Jian Pei; Jian Liu; Jianfeng Gao; Jiawei Han; Jieyu Zhao; Jiliang Tang; Jindong Wang; John Mitchell; Kai Shu; Kaidi Xu; Kai-Wei Chang; Lifang He; Lifu Huang; Michael Backes; Neil Zhenqiang Gong; Philip S. Yu; Pin-Yu Chen; Quanquan Gu; Ran Xu; Rex Ying; Shuiwang Ji; Suman Jana; Tianlong Chen; Tianming Liu; Tianyi Zhou; Willian Wang; Xiang Li; Xiangliang Zhang; Xiao Wang; Xing Xie; Xun Chen; Xuyu Wang; Yan Liu; Yanfang Ye; Yinzhi Cao; Yong Chen; Yue Zhao
  Large language models (LLMs), exemplified by ChatGPT, have gained
considerable attention for their excellent natural language processing
capabilities. Nonetheless, these LLMs present many challenges, particularly in
the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs
emerges as an important topic. This paper introduces TrustLLM, a comprehensive
study of trustworthiness in LLMs, including principles for different dimensions
of trustworthiness, established benchmark, evaluation, and analysis of
trustworthiness for mainstream LLMs, and discussion of open challenges and
future directions. Specifically, we first propose a set of principles for
trustworthy LLMs that span eight different dimensions. Based on these
principles, we further establish a benchmark across six dimensions including
truthfulness, safety, fairness, robustness, privacy, and machine ethics. We
then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of
over 30 datasets. Our findings firstly show that in general trustworthiness and
utility (i.e., functional effectiveness) are positively related. Secondly, our
observations reveal that proprietary LLMs generally outperform most open-source
counterparts in terms of trustworthiness, raising concerns about the potential
risks of widely accessible open-source LLMs. However, a few open-source LLMs
come very close to proprietary ones. Thirdly, it is important to note that some
LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent
that they compromise their utility by mistakenly treating benign prompts as
harmful and consequently not responding. Finally, we emphasize the importance
of ensuring transparency not only in the models themselves but also in the
technologies that underpin trustworthiness. Knowing the specific trustworthy
technologies that have been employed is crucial for analyzing their
effectiveness.


http://arxiv.org/abs/2401.05569
SENet: Visual Detection of Online Social Engineering Attack Campaigns. (4%)
Irfan Ozen; Karthika Subramani; Phani Vadrevu; Roberto Perdisci
  Social engineering (SE) aims at deceiving users into performing actions that
may compromise their security and privacy. These threats exploit weaknesses in
human's decision making processes by using tactics such as pretext, baiting,
impersonation, etc. On the web, SE attacks include attack classes such as
scareware, tech support scams, survey scams, sweepstakes, etc., which can
result in sensitive data leaks, malware infections, and monetary loss. For
instance, US consumers lose billions of dollars annually due to various SE
attacks. Unfortunately, generic social engineering attacks remain understudied,
compared to other important threats, such as software vulnerabilities and
exploitation, network intrusions, malicious software, and phishing. The few
existing technical studies that focus on social engineering are limited in
scope and mostly focus on measurements rather than developing a generic
defense. To fill this gap, we present SEShield, a framework for in-browser
detection of social engineering attacks. SEShield consists of three main
components: (i) a custom security crawler, called SECrawler, that is dedicated
to scouting the web to collect examples of in-the-wild SE attacks; (ii) SENet,
a deep learning-based image classifier trained on data collected by SECrawler
that aims to detect the often glaring visual traits of SE attack pages; and
(iii) SEGuard, a proof-of-concept extension that embeds SENet into the web
browser and enables real-time SE attack detection. We perform an extensive
evaluation of our system and show that SENet is able to detect new instances of
SE attacks with a detection rate of up to 99.6% at 1% false positive, thus
providing an effective first defense against SE attacks on the web.


http://arxiv.org/abs/2401.05566
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. (2%)
Evan Hubinger; Carson Denison; Jesse Mu; Mike Lambert; Meg Tong; Monte MacDiarmid; Tamera Lanham; Daniel M. Ziegler; Tim Maxwell; Newton Cheng; Adam Jermyn; Amanda Askell; Ansh Radhakrishnan; Cem Anil; David Duvenaud; Deep Ganguli; Fazl Barez; Jack Clark; Kamal Ndousse; Kshitij Sachan; Michael Sellitto; Mrinank Sharma; Nova DasSarma; Roger Grosse; Shauna Kravec; Yuntao Bai; Zachary Witten; Marina Favaro; Jan Brauner; Holden Karnofsky; Paul Christiano; Samuel R. Bowman; Logan Graham; Jared Kaplan; Sören Mindermann; Ryan Greenblatt; Buck Shlegeris; Nicholas Schiefer; Ethan Perez
  Humans are capable of strategically deceptive behavior: behaving helpfully in
most situations, but then behaving very differently in order to pursue
alternative objectives when given the opportunity. If an AI system learned such
a deceptive strategy, could we detect it and remove it using current
state-of-the-art safety training techniques? To study this question, we
construct proof-of-concept examples of deceptive behavior in large language
models (LLMs). For example, we train models that write secure code when the
prompt states that the year is 2023, but insert exploitable code when the
stated year is 2024. We find that such backdoored behavior can be made
persistent, so that it is not removed by standard safety training techniques,
including supervised fine-tuning, reinforcement learning, and adversarial
training (eliciting unsafe behavior and then training to remove it). The
backdoored behavior is most persistent in the largest models and in models
trained to produce chain-of-thought reasoning about deceiving the training
process, with the persistence remaining even when the chain-of-thought is
distilled away. Furthermore, rather than removing backdoors, we find that
adversarial training can teach models to better recognize their backdoor
triggers, effectively hiding the unsafe behavior. Our results suggest that,
once a model exhibits deceptive behavior, standard techniques could fail to
remove such deception and create a false impression of safety.


http://arxiv.org/abs/2401.05458
CoLafier: Collaborative Noisy Label Purifier With Local Intrinsic Dimensionality Guidance. (1%)
Dongyu Zhang; Ruofan Hu; Elke Rundensteiner
  Deep neural networks (DNNs) have advanced many machine learning tasks, but
their performance is often harmed by noisy labels in real-world data.
Addressing this, we introduce CoLafier, a novel approach that uses Local
Intrinsic Dimensionality (LID) for learning with noisy labels. CoLafier
consists of two subnets: LID-dis and LID-gen. LID-dis is a specialized
classifier. Trained with our uniquely crafted scheme, LID-dis consumes both a
sample's features and its label to predict the label - which allows it to
produce an enhanced internal representation. We observe that LID scores
computed from this representation effectively distinguish between correct and
incorrect labels across various noise scenarios. In contrast to LID-dis,
LID-gen, functioning as a regular classifier, operates solely on the sample's
features. During training, CoLafier utilizes two augmented views per instance
to feed both subnets. CoLafier considers the LID scores from the two views as
produced by LID-dis to assign weights in an adapted loss function for both
subnets. Concurrently, LID-gen, serving as classifier, suggests pseudo-labels.
LID-dis then processes these pseudo-labels along with two views to derive LID
scores. Finally, these LID scores along with the differences in predictions
from the two subnets guide the label update decisions. This dual-view and
dual-subnet approach enhances the overall reliability of the framework. Upon
completion of the training, we deploy the LID-gen subnet of CoLafier as the
final classification model. CoLafier demonstrates improved prediction accuracy,
surpassing existing methods, particularly under severe label noise. For more
details, see the code at https://github.com/zdy93/CoLafier.


http://arxiv.org/abs/2401.05562
Brave: Byzantine-Resilient and Privacy-Preserving Peer-to-Peer Federated Learning. (1%)
Zhangchen Xu; Fengqing Jiang; Luyao Niu; Jinyuan Jia; Radha Poovendran
  Federated learning (FL) enables multiple participants to train a global
machine learning model without sharing their private training data.
Peer-to-peer (P2P) FL advances existing centralized FL paradigms by eliminating
the server that aggregates local models from participants and then updates the
global model. However, P2P FL is vulnerable to (i) honest-but-curious
participants whose objective is to infer private training data of other
participants, and (ii) Byzantine participants who can transmit arbitrarily
manipulated local models to corrupt the learning process. P2P FL schemes that
simultaneously guarantee Byzantine resilience and preserve privacy have been
less studied. In this paper, we develop Brave, a protocol that ensures
Byzantine Resilience And privacy-preserving property for P2P FL in the presence
of both types of adversaries. We show that Brave preserves privacy by
establishing that any honest-but-curious adversary cannot infer other
participants' private data by observing their models. We further prove that
Brave is Byzantine-resilient, which guarantees that all benign participants
converge to an identical model that deviates from a global model trained
without Byzantine adversaries by a bounded distance. We evaluate Brave against
three state-of-the-art adversaries on a P2P FL for image classification tasks
on benchmark datasets CIFAR10 and MNIST. Our results show that the global model
learned with Brave in the presence of adversaries achieves comparable
classification accuracy to a global model trained in the absence of any
adversary.


http://arxiv.org/abs/2401.04958
FBSDetector: Fake Base Station and Multi Step Attack Detection in Cellular Networks using Machine Learning. (1%)
Kazi Samin Mubasshir; Imtiaz Karim; Elisa Bertino
  Fake base stations (FBSes) pose a significant security threat by
impersonating legitimate base stations. Though efforts have been made to defeat
this threat, up to this day, the presence of FBSes and the multi-step attacks
(MSAs) stemming from them can lead to unauthorized surveillance, interception
of sensitive information, and disruption of network services for legitimate
users. Therefore, detecting these malicious entities is crucial to ensure the
security and reliability of cellular networks. Traditional detection methods
often rely on additional hardware, predefined rules, signal scanning, changing
protocol specifications, or cryptographic mechanisms that have limitations and
incur huge infrastructure costs in accurately identifying FBSes. In this paper,
we develop FBSDetector-an effective and efficient detection solution that can
reliably detect FBSes and MSAs from layer-3 network traces using machine
learning (ML) at the user equipment (UE) side. To develop FBSDetector, we
created FBSAD and MSAD, the first-ever high-quality and large-scale datasets
for training machine learning models capable of detecting FBSes and MSAs. These
datasets capture the network traces in different real-world cellular network
scenarios (including mobility and different attacker capabilities)
incorporating legitimate base stations and FBSes. The combined network trace
has a volume of 6.6 GB containing 751963 packets. Our novel ML models,
specially designed to detect FBSes and MSAs, can effectively detect FBSes with
an accuracy of 92% and a false positive rate of 5.96% and recognize MSAs with
an accuracy of 86% and a false positive rate of 7.82%. We deploy FBSDetector as
a real-world solution to protect end-users through an Android app and validate
in a controlled lab environment. Compared to the existing solutions that fail
to detect FBSes, FBSDetector can detect FBSes in the wild in real time.


http://arxiv.org/abs/2401.04727
Revisiting Adversarial Training at Scale. (26%)
Zeyu Wang; Xianhang Li; Hongru Zhu; Cihang Xie
  The machine learning community has witnessed a drastic change in the training
pipeline, pivoted by those ''foundation models'' with unprecedented scales.
However, the field of adversarial training is lagging behind, predominantly
centered around small model sizes like ResNet-50, and tiny and low-resolution
datasets like CIFAR-10. To bridge this transformation gap, this paper provides
a modern re-examination with adversarial training, investigating its potential
benefits when applied at scale. Additionally, we introduce an efficient and
effective training strategy to enable adversarial training with giant models
and web-scale data at an affordable computing cost. We denote this newly
introduced framework as AdvXL.
  Empirical results demonstrate that AdvXL establishes new state-of-the-art
robust accuracy records under AutoAttack on ImageNet-1K. For example, by
training on DataComp-1B dataset, our AdvXL empowers a vanilla ViT-g model to
substantially surpass the previous records of $l_{\infty}$-, $l_{2}$-, and
$l_{1}$-robust accuracy by margins of 11.4%, 14.2% and 12.9%, respectively.
This achievement posits AdvXL as a pioneering approach, charting a new
trajectory for the efficient training of robust visual representations at
significantly larger scales. Our code is available at
https://github.com/UCSC-VLAA/AdvXL.


http://arxiv.org/abs/2401.04364
SoK: Facial Deepfake Detectors. (11%)
Binh M. Le; Jiwon Kim; Shahroz Tariq; Kristen Moore; Alsharif Abuadbba; Simon S. Woo
  Deepfakes have rapidly emerged as a profound and serious threat to society,
primarily due to their ease of creation and dissemination. This situation has
triggered an accelerated development of deepfake detection technologies.
However, many existing detectors rely heavily on lab-generated datasets for
validation, which may not effectively prepare them for novel, emerging, and
real-world deepfake techniques. In this paper, we conduct an extensive and
comprehensive review and analysis of the latest state-of-the-art deepfake
detectors, evaluating them against several critical criteria. These criteria
facilitate the categorization of these detectors into 4 high-level groups and
13 fine-grained sub-groups, all aligned with a unified standard conceptual
framework. This classification and framework offer deep and practical insights
into the factors that affect detector efficacy. We assess the generalizability
of 16 leading detectors across various standard attack scenarios, including
black-box, white-box, and gray-box settings. Our systematized analysis and
experimentation lay the groundwork for a deeper understanding of deepfake
detectors and their generalizability, paving the way for future research
focused on creating detectors adept at countering various attack scenarios.
Additionally, this work offers insights for developing more proactive defenses
against deepfakes.


http://arxiv.org/abs/2401.04647
Advancing Ante-Hoc Explainable Models through Generative Adversarial Networks. (3%)
Tanmay Garg; Deepika Vemuri; Vineeth N Balasubramanian
  This paper presents a novel concept learning framework for enhancing model
interpretability and performance in visual classification tasks. Our approach
appends an unsupervised explanation generator to the primary classifier network
and makes use of adversarial training. During training, the explanation module
is optimized to extract visual concepts from the classifier's latent
representations, while the GAN-based module aims to discriminate images
generated from concepts, from true images. This joint training scheme enables
the model to implicitly align its internally learned concepts with
human-interpretable visual properties. Comprehensive experiments demonstrate
the robustness of our approach, while producing coherent concept activations.
We analyse the learned concepts, showing their semantic concordance with object
parts and visual attributes. We also study how perturbations in the adversarial
training protocol impact both classification and concept acquisition. In
summary, this work presents a significant step towards building inherently
interpretable deep vision models with task-aligned concept representations - a
key enabler for developing trustworthy AI for real-world perception tasks.


http://arxiv.org/abs/2401.04350
Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness. (99%)
Sibo Wang; Jie Zhang; Zheng Yuan; Shiguang Shan
  Large-scale pre-trained vision-language models like CLIP have demonstrated
impressive performance across various tasks, and exhibit remarkable zero-shot
generalization capability, while they are also vulnerable to imperceptible
adversarial examples. Existing works typically employ adversarial training
(fine-tuning) as a defense method against adversarial examples. However, direct
application to the CLIP model may result in overfitting, compromising the
model's capacity for generalization. In this paper, we propose Pre-trained
Model Guided Adversarial Fine-Tuning (PMG-AFT) method, which leverages
supervision from the original pre-trained model by carefully designing an
auxiliary branch, to enhance the model's zero-shot adversarial robustness.
Specifically, PMG-AFT minimizes the distance between the features of
adversarial examples in the target model and those in the pre-trained model,
aiming to preserve the generalization features already captured by the
pre-trained model. Extensive Experiments on 15 zero-shot datasets demonstrate
that PMG-AFT significantly outperforms the state-of-the-art method, improving
the top-1 robust accuracy by an average of 4.99%. Furthermore, our approach
consistently improves clean accuracy by an average of 8.72%. Our code is
available at
https://github.com/serendipity1122/Pre-trained-Model-Guided-Fine-Tuning-for-Zero-Shot-Adversarial-Robustness.


http://arxiv.org/abs/2402.00035
Robustness Assessment of a Runway Object Classifier for Safe Aircraft Taxiing. (54%)
Yizhak Elboher; Raya Elsaleh; Omri Isac; Mélanie Ducoffe; Audrey Galametz; Guillaume Povéda; Ryma Boumazouza; Noémie Cohen; Guy Katz
  As deep neural networks (DNNs) are becoming the prominent solution for many
computational problems, the aviation industry seeks to explore their potential
in alleviating pilot workload and in improving operational safety. However, the
use of DNNs in this type of safety-critical applications requires a thorough
certification process. This need can be addressed through formal verification,
which provides rigorous assurances -- e.g.,~by proving the absence of certain
mispredictions. In this case-study paper, we demonstrate this process using an
image-classifier DNN currently under development at Airbus and intended for use
during the aircraft taxiing phase. We use formal methods to assess this DNN's
robustness to three common image perturbation types: noise, brightness and
contrast, and some of their combinations. This process entails multiple
invocations of the underlying verifier, which might be computationally
expensive; and we therefore propose a method that leverages the monotonicity of
these robustness properties, as well as the results of past verification
queries, in order to reduce the overall number of verification queries required
by nearly 60%. Our results provide an indication of the level of robustness
achieved by the DNN classifier under study, and indicate that it is
considerably more vulnerable to noise than to brightness or contrast
perturbations.


http://arxiv.org/abs/2401.04331
Coupling Graph Neural Networks with Fractional Order Continuous Dynamics: A Robustness Study. (45%)
Qiyu Kang; Kai Zhao; Yang Song; Yihang Xie; Yanan Zhao; Sijie Wang; Rui She; Wee Peng Tay
  In this work, we rigorously investigate the robustness of graph neural
fractional-order differential equation (FDE) models. This framework extends
beyond traditional graph neural (integer-order) ordinary differential equation
(ODE) models by implementing the time-fractional Caputo derivative. Utilizing
fractional calculus allows our model to consider long-term memory during the
feature updating process, diverging from the memoryless Markovian updates seen
in traditional graph neural ODE models. The superiority of graph neural FDE
models over graph neural ODE models has been established in environments free
from attacks or perturbations. While traditional graph neural ODE models have
been verified to possess a degree of stability and resilience in the presence
of adversarial attacks in existing literature, the robustness of graph neural
FDE models, especially under adversarial conditions, remains largely
unexplored. This paper undertakes a detailed assessment of the robustness of
graph neural FDE models. We establish a theoretical foundation outlining the
robustness characteristics of graph neural FDE models, highlighting that they
maintain more stringent output perturbation bounds in the face of input and
graph topology disturbances, compared to their integer-order counterparts. Our
empirical evaluations further confirm the enhanced robustness of graph neural
FDE models, highlighting their potential in adversarially robust applications.


http://arxiv.org/abs/2401.03685
Logits Poisoning Attack in Federated Distillation. (12%)
Yuhan Tang; Zhiyuan Wu; Bo Gao; Tian Wen; Yuwei Wang; Sheng Sun
  Federated Distillation (FD) is a novel and promising distributed machine
learning paradigm, where knowledge distillation is leveraged to facilitate a
more efficient and flexible cross-device knowledge transfer in federated
learning. By optimizing local models with knowledge distillation, FD
circumvents the necessity of uploading large-scale model parameters to the
central server, simultaneously preserving the raw data on local clients.
Despite the growing popularity of FD, there is a noticeable gap in previous
works concerning the exploration of poisoning attacks within this framework.
This can lead to a scant understanding of the vulnerabilities to potential
adversarial actions. To this end, we introduce FDLA, a poisoning attack method
tailored for FD. FDLA manipulates logit communications in FD, aiming to
significantly degrade model performance on clients through misleading the
discrimination of private samples. Through extensive simulation experiments
across a variety of datasets, attack scenarios, and FD configurations, we
demonstrate that LPA effectively compromises client model accuracy,
outperforming established baseline algorithms in this regard. Our findings
underscore the critical need for robust defense mechanisms in FD settings to
mitigate such adversarial threats.


http://arxiv.org/abs/2401.04247
Attack-Resilient Image Watermarking Using Stable Diffusion. (3%)
Lijun Zhang; Xiao Liu; Antoni Viros Martin; Cindy Xiong Bearfield; Yuriy Brun; Hui Guan
  Watermarking images is critical for tracking image provenance and proving
ownership. With the advent of generative models, such as stable diffusion, that
can create fake but realistic images, watermarking has become particularly
important to make human-created images reliably identifiable. Unfortunately,
the very same stable diffusion technology can remove watermarks injected using
existing methods. To address this problem, we present ZoDiac, which uses a
pre-trained stable diffusion model to inject a watermark into the trainable
latent space, resulting in watermarks that can be reliably detected in the
latent vector even when attacked. We evaluate ZoDiac on three benchmarks,
MS-COCO, DiffusionDB, and WikiArt, and find that ZoDiac is robust against
state-of-the-art watermark attacks, with a watermark detection rate above 98%
and a false positive rate below 6.4%, outperforming state-of-the-art
watermarking methods. We hypothesize that the reciprocating denoising process
in diffusion models may inherently enhance the robustness of the watermark when
faced with strong attacks and validate the hypothesis. Our research
demonstrates that stable diffusion is a promising approach to robust
watermarking, able to withstand even stable-diffusion--based attack methods.
ZoDiac is open-sourced and available at https://github.com/zhanglijun95/ZoDiac.


http://arxiv.org/abs/2401.04191
Dense Hopfield Networks in the Teacher-Student Setting. (1%)
Robin Thériault; Daniele Tantari
  Dense Hopfield networks are known for their feature to prototype transition
and adversarial robustness. However, previous theoretical studies have been
mostly concerned with their storage capacity. We bridge this gap by studying
the phase diagram of p-body Hopfield networks in the teacher-student setting of
an unsupervised learning problem, uncovering ferromagnetic phases reminiscent
of the prototype and feature learning regimes. On the Nishimori line, we find
the critical size of the training set necessary for efficient pattern
retrieval. Interestingly, we find that that the paramagnetic to ferromagnetic
transition of the teacher-student setting coincides with the paramagnetic to
spin-glass transition of the direct model, i.e. with random patterns. Outside
of the Nishimori line, we investigate the learning performance in relation to
the inference temperature and dataset noise. Moreover, we show that using a
larger p for the student than the teacher gives the student an extensive
tolerance to noise. We then derive a closed-form expression measuring the
adversarial robustness of such a student at zero temperature, corroborating the
positive correlation between number of parameters and robustness observed in
large neural networks. We also use our model to clarify why the prototype phase
of modern Hopfield networks is adversarially robust.


http://arxiv.org/abs/2401.03582
Invisible Reflections: Leveraging Infrared Laser Reflections to Target Traffic Sign Perception. (87%)
Takami Sato; Sri Hrushikesh Varma Bhupathiraju; Michael Clifford; Takeshi Sugawara; Qi Alfred Chen; Sara Rampazzi
  All vehicles must follow the rules that govern traffic behavior, regardless
of whether the vehicles are human-driven or Connected Autonomous Vehicles
(CAVs). Road signs indicate locally active rules, such as speed limits and
requirements to yield or stop. Recent research has demonstrated attacks, such
as adding stickers or projected colored patches to signs, that cause CAV
misinterpretation, resulting in potential safety issues. Humans can see and
potentially defend against these attacks. But humans can not detect what they
can not observe. We have developed an effective physical-world attack that
leverages the sensitivity of filterless image sensors and the properties of
Infrared Laser Reflections (ILRs), which are invisible to humans. The attack is
designed to affect CAV cameras and perception, undermining traffic sign
recognition by inducing misclassification. In this work, we formulate the
threat model and requirements for an ILR-based traffic sign perception attack
to succeed. We evaluate the effectiveness of the ILR attack with real-world
experiments against two major traffic sign recognition architectures on four
IR-sensitive cameras. Our black-box optimization methodology allows the attack
to achieve up to a 100% attack success rate in indoor, static scenarios and a
>80.5% attack success rate in our outdoor, moving vehicle scenarios. We find
the latest state-of-the-art certifiable defense is ineffective against ILR
attacks as it mis-certifies >33.5% of cases. To address this, we propose a
detection strategy based on the physical properties of IR laser reflections
which can detect 96% of ILR attacks.


http://arxiv.org/abs/2401.03488
Data-Driven Subsampling in the Presence of an Adversarial Actor. (86%)
Abu Shafin Mohammad Mahdee Jameel; Ahmed P. Mohamed; Jinho Yi; Aly El Gamal; Akshay Malhotra
  Deep learning based automatic modulation classification (AMC) has received
significant attention owing to its potential applications in both military and
civilian use cases. Recently, data-driven subsampling techniques have been
utilized to overcome the challenges associated with computational complexity
and training time for AMC. Beyond these direct advantages of data-driven
subsampling, these methods also have regularizing properties that may improve
the adversarial robustness of the modulation classifier. In this paper, we
investigate the effects of an adversarial attack on an AMC system that employs
deep learning models both for AMC and for subsampling. Our analysis shows that
subsampling itself is an effective deterrent to adversarial attacks. We also
uncover the most efficient subsampling strategy when an adversarial attack on
both the classifier and the subsampler is anticipated.


http://arxiv.org/abs/2401.03514
ROIC-DM: Robust Text Inference and Classification via Diffusion Model. (33%)
Shilong Yuan; Wei Yuan; Hongzhi Yin; Tieke He
  While language models have made many milestones in text inference and
classification tasks, they remain susceptible to adversarial attacks that can
lead to unforeseen outcomes. Existing works alleviate this problem by equipping
language models with defense patches. However, these defense strategies often
rely on impractical assumptions or entail substantial sacrifices in model
performance. Consequently, enhancing the resilience of the target model using
such defense mechanisms is a formidable challenge. This paper introduces an
innovative model for robust text inference and classification, built upon
diffusion models (ROIC-DM). Benefiting from its training involving denoising
stages, ROIC-DM inherently exhibits greater robustness compared to conventional
language models. Moreover, ROIC-DM can attain comparable, and in some cases,
superior performance to language models, by effectively incorporating them as
advisory components. Extensive experiments conducted with several strong
textual adversarial attacks on three datasets demonstrate that (1) ROIC-DM
outperforms traditional language models in robustness, even when the latter are
fortified with advanced defense mechanisms; (2) ROIC-DM can achieve comparable
and even better performance than traditional language models by using them as
advisors.


http://arxiv.org/abs/2401.03156
Data-Dependent Stability Analysis of Adversarial Training. (98%)
Yihan Wang; Shuang Liu; Xiao-Shan Gao
  Stability analysis is an essential aspect of studying the generalization
ability of deep learning, as it involves deriving generalization bounds for
stochastic gradient descent-based training algorithms. Adversarial training is
the most widely used defense against adversarial example attacks. However,
previous generalization bounds for adversarial training have not included
information regarding the data distribution. In this paper, we fill this gap by
providing generalization bounds for stochastic gradient descent-based
adversarial training that incorporate data distribution information. We utilize
the concepts of on-average stability and high-order approximate Lipschitz
conditions to examine how changes in data distribution and adversarial budget
can affect robust generalization gaps. Our derived generalization bounds for
both convex and non-convex losses are at least as good as the uniform
stability-based counterparts which do not include data distribution
information. Furthermore, our findings demonstrate how distribution shifts from
data poisoning attacks can impact robust generalization.


http://arxiv.org/abs/2401.03215
End-to-End Anti-Backdoor Learning on Images and Time Series. (61%)
Yujing Jiang; Xingjun Ma; Sarah Monazam Erfani; Yige Li; James Bailey
  Backdoor attacks present a substantial security concern for deep learning
models, especially those utilized in applications critical to safety and
security. These attacks manipulate model behavior by embedding a hidden trigger
during the training phase, allowing unauthorized control over the model's
output during inference time. Although numerous defenses exist for image
classification models, there is a conspicuous absence of defenses tailored for
time series data, as well as an end-to-end solution capable of training clean
models on poisoned data. To address this gap, this paper builds upon
Anti-Backdoor Learning (ABL) and introduces an innovative method, End-to-End
Anti-Backdoor Learning (E2ABL), for robust training against backdoor attacks.
Unlike the original ABL, which employs a two-stage training procedure, E2ABL
accomplishes end-to-end training through an additional classification head
linked to the shallow layers of a Deep Neural Network (DNN). This secondary
head actively identifies potential backdoor triggers, allowing the model to
dynamically cleanse these samples and their corresponding labels during
training. Our experiments reveal that E2ABL significantly improves on existing
defenses and is effective against a broad range of backdoor attacks in both
image and time series domains.


http://arxiv.org/abs/2401.03115
Transferable Learned Image Compression-Resistant Adversarial Perturbations. (99%)
Yang Sui; Zhuohang Li; Ding Ding; Xiang Pan; Xiaozhong Xu; Shan Liu; Zhenzhong Chen
  Adversarial attacks can readily disrupt the image classification system,
revealing the vulnerability of DNN-based recognition tasks. While existing
adversarial perturbations are primarily applied to uncompressed images or
compressed images by the traditional image compression method, i.e., JPEG,
limited studies have investigated the robustness of models for image
classification in the context of DNN-based image compression. With the rapid
evolution of advanced image compression, DNN-based learned image compression
has emerged as the promising approach for transmitting images in many
security-critical applications, such as cloud-based face recognition and
autonomous driving, due to its superior performance over traditional
compression. Therefore, there is a pressing need to fully investigate the
robustness of a classification system post-processed by learned image
compression. To bridge this research gap, we explore the adversarial attack on
a new pipeline that targets image classification models that utilize learned
image compressors as pre-processing modules. Furthermore, to enhance the
transferability of perturbations across various quality levels and
architectures of learned image compression models, we introduce a saliency
score-based sampling method to enable the fast generation of transferable
perturbation. Extensive experiments with popular attack methods demonstrate the
enhanced transferability of our proposed method when attacking images that have
been post-processed with different learned image compression models.


http://arxiv.org/abs/2401.02727
Enhancing targeted transferability via feature space fine-tuning. (98%)
Hui Zeng; Biwei Chen; Anjie Peng
  Adversarial examples (AEs) have been extensively studied due to their
potential for privacy protection and inspiring robust neural networks. However,
making a targeted AE transferable across unknown models remains challenging. In
this paper, to alleviate the overfitting dilemma common in an AE crafted by
existing simple iterative attacks, we propose fine-tuning it in the feature
space. Specifically, starting with an AE generated by a baseline attack, we
encourage the features that contribute to the target class and discourage the
features that contribute to the original class in a middle layer of the source
model. Extensive experiments demonstrate that only a few iterations of
fine-tuning can boost existing attacks in terms of targeted transferability
nontrivially and universally. Our results also verify that the simple iterative
attacks can yield comparable or even better transferability than the
resource-intensive methods, which rely on training target-specific classifiers
or generators with additional data. The code is available at:
github.com/zengh5/TA_feature_FT.


http://arxiv.org/abs/2401.02718
Calibration Attack: A Framework For Adversarial Attacks Targeting Calibration. (76%)
Stephen Obadinma; Xiaodan Zhu; Hongyu Guo
  We introduce a new framework of adversarial attacks, named calibration
attacks, in which the attacks are generated and organized to trap victim models
to be miscalibrated without altering their original accuracy, hence seriously
endangering the trustworthiness of the models and any decision-making based on
their confidence scores. Specifically, we identify four novel forms of
calibration attacks: underconfidence attacks, overconfidence attacks, maximum
miscalibration attacks, and random confidence attacks, in both the black-box
and white-box setups. We then test these new attacks on typical victim models
with comprehensive datasets, demonstrating that even with a relatively low
number of queries, the attacks can create significant calibration mistakes. We
further provide detailed analyses to understand different aspects of
calibration attacks. Building on that, we investigate the effectiveness of
widely used adversarial defences and calibration methods against these types of
attacks, which then inspires us to devise two novel defences against such
calibration attacks.


http://arxiv.org/abs/2401.02663
A backdoor attack against link prediction tasks with graph neural networks. (45%)
Jiazhu Dai; Haoyu Sun
  Graph Neural Networks (GNNs) are a class of deep learning models capable of
processing graph-structured data, and they have demonstrated significant
performance in a variety of real-world applications. Recent studies have found
that GNN models are vulnerable to backdoor attacks. When specific patterns
(called backdoor triggers, e.g., subgraphs, nodes, etc.) appear in the input
data, the backdoor embedded in the GNN models is activated, which misclassifies
the input data into the target class label specified by the attacker, whereas
when there are no backdoor triggers in the input, the backdoor embedded in the
GNN models is not activated, and the models work normally. Backdoor attacks are
highly stealthy and expose GNN models to serious security risks. Currently,
research on backdoor attacks against GNNs mainly focus on tasks such as graph
classification and node classification, and backdoor attacks against link
prediction tasks are rarely studied. In this paper, we propose a backdoor
attack against the link prediction tasks based on GNNs and reveal the existence
of such security vulnerability in GNN models, which make the backdoored GNN
models to incorrectly predict unlinked two nodes as having a link relationship
when a trigger appear. The method uses a single node as the trigger and poison
selected node pairs in the training graph, and then the backdoor will be
embedded in the GNN models through the training process. In the inference
stage, the backdoor in the GNN models can be activated by simply linking the
trigger node to the two end nodes of the unlinked node pairs in the input data,
causing the GNN models to produce incorrect link prediction results for the
target node pairs.


http://arxiv.org/abs/2401.05432
TEN-GUARD: Tensor Decomposition for Backdoor Attack Detection in Deep Neural Networks. (1%)
Khondoker Murad Hossain; Tim Oates
  As deep neural networks and the datasets used to train them get larger, the
default approach to integrating them into research and commercial projects is
to download a pre-trained model and fine tune it. But these models can have
uncertain provenance, opening up the possibility that they embed hidden
malicious behavior such as trojans or backdoors, where small changes to an
input (triggers) can cause the model to produce incorrect outputs (e.g., to
misclassify). This paper introduces a novel approach to backdoor detection that
uses two tensor decomposition methods applied to network activations. This has
a number of advantages relative to existing detection methods, including the
ability to analyze multiple models at the same time, working across a wide
variety of network architectures, making no assumptions about the nature of
triggers used to alter network behavior, and being computationally efficient.
We provide a detailed description of the detection pipeline along with results
on models trained on the MNIST digit dataset, CIFAR-10 dataset, and two
difficult datasets from NIST's TrojAI competition. These results show that our
method detects backdoored networks more accurately and efficiently than current
state-of-the-art methods.


http://arxiv.org/abs/2401.02906
MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance. (1%)
Renjie Pi; Tianyang Han; Yueqi Xie; Rui Pan; Qing Lian; Hanze Dong; Jipeng Zhang; Tong Zhang
  The deployment of multimodal large language models (MLLMs) has brought forth
a unique vulnerability: susceptibility to malicious attacks through visual
inputs. We delve into the novel challenge of defending MLLMs against such
attacks. We discovered that images act as a "foreign language" that is not
considered during alignment, which can make MLLMs prone to producing harmful
responses. Unfortunately, unlike the discrete tokens considered in text-based
LLMs, the continuous nature of image signals presents significant alignment
challenges, which poses difficulty to thoroughly cover the possible scenarios.
This vulnerability is exacerbated by the fact that open-source MLLMs are
predominantly fine-tuned on limited image-text pairs that is much less than the
extensive text-based pretraining corpus, which makes the MLLMs more prone to
catastrophic forgetting of their original abilities during explicit alignment
tuning. To tackle these challenges, we introduce MLLM-Protector, a
plug-and-play strategy combining a lightweight harm detector and a response
detoxifier. The harm detector's role is to identify potentially harmful outputs
from the MLLM, while the detoxifier corrects these outputs to ensure the
response stipulates to the safety standards. This approach effectively
mitigates the risks posed by malicious visual inputs without compromising the
model's overall performance. Our results demonstrate that MLLM-Protector offers
a robust solution to a previously unaddressed aspect of MLLM security.


http://arxiv.org/abs/2401.02565
Vulnerabilities Unveiled: Adversarially Attacking a Multimodal Vision Langauge Model for Pathology Imaging. (99%)
Jai Prakash Veerla; Poojitha Thota; Partha Sai Guttikonda; Shirin Nilizadeh; Jacob M. Luber
  In the dynamic landscape of medical artificial intelligence, this study
explores the vulnerabilities of the Pathology Language-Image Pretraining (PLIP)
model, a Vision Language Foundation model, under targeted adversarial
conditions. Leveraging the Kather Colon dataset with 7,180 H&E images across
nine tissue types, our investigation employs Projected Gradient Descent (PGD)
adversarial attacks to intentionally induce misclassifications. The outcomes
reveal a 100% success rate in manipulating PLIP's predictions, underscoring its
susceptibility to adversarial perturbations. The qualitative analysis of
adversarial examples delves into the interpretability challenges, shedding
light on nuanced changes in predictions induced by adversarial manipulations.
These findings contribute crucial insights into the interpretability, domain
adaptation, and trustworthiness of Vision Language Models in medical imaging.
The study emphasizes the pressing need for robust defenses to ensure the
reliability of AI models.


http://arxiv.org/abs/2401.02633
A Random Ensemble of Encrypted models for Enhancing Robustness against Adversarial Examples. (99%)
Ryota Iijima; Sayaka Shiota; Hitoshi Kiya
  Deep neural networks (DNNs) are well known to be vulnerable to adversarial
examples (AEs). In addition, AEs have adversarial transferability, which means
AEs generated for a source model can fool another black-box model (target
model) with a non-trivial probability. In previous studies, it was confirmed
that the vision transformer (ViT) is more robust against the property of
adversarial transferability than convolutional neural network (CNN) models such
as ConvMixer, and moreover encrypted ViT is more robust than ViT without any
encryption. In this article, we propose a random ensemble of encrypted ViT
models to achieve much more robust models. In experiments, the proposed scheme
is verified to be more robust against not only black-box attacks but also
white-box ones than convention methods.


http://arxiv.org/abs/2401.02615
AdvSQLi: Generating Adversarial SQL Injections against Real-world WAF-as-a-service. (95%)
Zhenqing Qu; Xiang Ling; Ting Wang; Xiang Chen; Shouling Ji; Chunming Wu
  As the first defensive layer that attacks would hit, the web application
firewall (WAF) plays an indispensable role in defending against malicious web
attacks like SQL injection (SQLi). With the development of cloud computing,
WAF-as-a-service, as one kind of Security-as-a-service, has been proposed to
facilitate the deployment, configuration, and update of WAFs in the cloud.
Despite its tremendous popularity, the security vulnerabilities of
WAF-as-a-service are still largely unknown, which is highly concerning given
its massive usage. In this paper, we propose a general and extendable attack
framework, namely AdvSQLi, in which a minimal series of transformations are
performed on the hierarchical tree representation of the original SQLi payload,
such that the generated SQLi payloads can not only bypass WAF-as-a-service
under black-box settings but also keep the same functionality and maliciousness
as the original payload. With AdvSQLi, we make it feasible to inspect and
understand the security vulnerabilities of WAFs automatically, helping vendors
make products more secure.
  To evaluate the attack effectiveness and efficiency of AdvSQLi, we first
employ two public datasets to generate adversarial SQLi payloads, leading to a
maximum attack success rate of 100% against state-of-the-art ML-based SQLi
detectors. Furthermore, to demonstrate the immediate security threats caused by
AdvSQLi, we evaluate the attack effectiveness against 7 WAF-as-a-service
solutions from mainstream vendors and find all of them are vulnerable to
AdvSQLi. For instance, AdvSQLi achieves an attack success rate of over 79%
against the F5 WAF. Through in-depth analysis of the evaluation results, we
further condense out several general yet severe flaws of these vendors that
cannot be easily patched.


http://arxiv.org/abs/2401.02342
Evasive Hardware Trojan through Adversarial Power Trace. (92%)
Behnam Omidi; Khaled N. Khasawneh; Ihsen Alouani
  The globalization of the Integrated Circuit (IC) supply chain, driven by
time-to-market and cost considerations, has made ICs vulnerable to hardware
Trojans (HTs). Against this threat, a promising approach is to use Machine
Learning (ML)-based side-channel analysis, which has the advantage of being a
non-intrusive method, along with efficiently detecting HTs under golden
chip-free settings. In this paper, we question the trustworthiness of ML-based
HT detection via side-channel analysis. We introduce a HT obfuscation (HTO)
approach to allow HTs to bypass this detection method. Rather than
theoretically misleading the model by simulated adversarial traces, a key
aspect of our approach is the design and implementation of adversarial noise as
part of the circuitry, alongside the HT. We detail HTO methodologies for ASICs
and FPGAs, and evaluate our approach using TrustHub benchmark. Interestingly,
we found that HTO can be implemented with only a single transistor for ASIC
designs to generate adversarial power traces that can fool the defense with
100% efficiency. We also efficiently implemented our approach on a Spartan 6
Xilinx FPGA using 2 different variants: (i) DSP slices-based, and (ii)
ring-oscillator-based design. Additionally, we assess the efficiency of
countermeasures like spectral domain analysis, and we show that an adaptive
attacker can still design evasive HTOs by constraining the design with a
spectral noise budget. In addition, while adversarial training (AT) offers
higher protection against evasive HTs, AT models suffer from a considerable
utility loss, potentially rendering them unsuitable for such security
application. We believe this research represents a significant step in
understanding and exploiting ML vulnerabilities in a hardware security context,
and we make all resources and designs openly available online:
https://dev.d18uu4lqwhbmka.amplifyapp.com


http://arxiv.org/abs/2401.02600
Object-oriented backdoor attack against image captioning. (76%)
Meiling Li; Nan Zhong; Xinpeng Zhang; Zhenxing Qian; Sheng Li
  Backdoor attack against image classification task has been widely studied and
proven to be successful, while there exist little research on the backdoor
attack against vision-language models. In this paper, we explore backdoor
attack towards image captioning models by poisoning training data. Assuming the
attacker has total access to the training dataset, and cannot intervene in
model construction or training process. Specifically, a portion of benign
training samples is randomly selected to be poisoned. Afterwards, considering
that the captions are usually unfolded around objects in an image, we design an
object-oriented method to craft poisons, which aims to modify pixel values by a
slight range with the modification number proportional to the scale of the
current detected object region. After training with the poisoned data, the
attacked model behaves normally on benign images, but for poisoned images, the
model will generate some sentences irrelevant to the given image. The attack
controls the model behavior on specific test images without sacrificing the
generation performance on benign test images. Our method proves the weakness of
image captioning models to backdoor attack and we hope this work can raise the
awareness of defending against backdoor attack in the image captioning field.


http://arxiv.org/abs/2401.02283
DEM: A Method for Certifying Deep Neural Network Classifier Outputs in Aerospace. (2%)
Guy Katz; Natan Levy; Idan Refaeli; Raz Yerushalmi
  Software development in the aerospace domain requires adhering to strict,
high-quality standards. While there exist regulatory guidelines for commercial
software in this domain (e.g., ARP-4754 and DO-178), these do not apply to
software with deep neural network (DNN) components. Consequently, it is unclear
how to allow aerospace systems to benefit from the deep learning revolution.
Our work here seeks to address this challenge with a novel, output-centric
approach for DNN certification. Our method employs statistical verification
techniques, and has the key advantage of being able to flag specific inputs for
which the DNN's output may be unreliable - so that they may be later inspected
by a human expert. To achieve this, our method conducts a statistical analysis
of the DNN's predictions for other, nearby inputs, in order to detect
inconsistencies. This is in contrast to existing techniques, which typically
attempt to certify the entire DNN, as opposed to individual outputs. Our method
uses the DNN as a black-box, and makes no assumptions about its topology. We
hope that this work constitutes another step towards integrating DNNs in
safety-critical applications - especially in the aerospace domain, where high
standards of quality and reliability are crucial.


http://arxiv.org/abs/2401.02306
Secure Control of Connected and Automated Vehicles Using Trust-Aware Robust Event-Triggered Control Barrier Functions. (2%)
H M Sabbir Ahmad; Ehsan Sabouni; Akua Dickson; Wei Xiao; Christos G. Cassandras; Wenchao Li
  We address the security of a network of Connected and Automated Vehicles
(CAVs) cooperating to safely navigate through a conflict area (e.g., traffic
intersections, merging roadways, roundabouts). Previous studies have shown that
such a network can be targeted by adversarial attacks causing traffic jams or
safety violations ending in collisions. We focus on attacks targeting the V2X
communication network used to share vehicle data and consider as well
uncertainties due to noise in sensor measurements and communication channels.
To combat these, motivated by recent work on the safe control of CAVs, we
propose a trust-aware robust event-triggered decentralized control and
coordination framework that can provably guarantee safety. We maintain a trust
metric for each vehicle in the network computed based on their behavior and
used to balance the tradeoff between conservativeness (when deeming every
vehicle as untrustworthy) and guaranteed safety and security. It is important
to highlight that our framework is invariant to the specific choice of the
trust framework. Based on this framework, we propose an attack detection and
mitigation scheme which has twofold benefits: (i) the trust framework is immune
to false positives, and (ii) it provably guarantees safety against false
positive cases. We use extensive simulations (in SUMO and CARLA) to validate
the theoretical guarantees and demonstrate the efficacy of our proposed scheme
to detect and mitigate adversarial attacks.


http://arxiv.org/abs/2401.02349
A Survey Analyzing Generalization in Deep Reinforcement Learning. (1%)
Ezgi Korkmaz
  Reinforcement learning research obtained significant success and attention
with the utilization of deep neural networks to solve problems in high
dimensional state or action spaces. While deep reinforcement learning policies
are currently being deployed in many different fields from medical applications
to large language models, there are still ongoing questions the field is trying
to answer on the generalization capabilities of deep reinforcement learning
policies. In this paper, we will formalize and analyze generalization in deep
reinforcement learning. We will explain the fundamental reasons why deep
reinforcement learning policies encounter overfitting problems that limit their
generalization capabilities. Furthermore, we will categorize and explain the
manifold solution approaches to increase generalization, and overcome
overfitting in deep reinforcement learning policies. From exploration to
adversarial analysis and from regularization to robustness our paper provides
an analysis on a wide range of subfields within deep reinforcement learning
with a broad scope and in-depth view. We believe our study can provide a
compact guideline for the current advancements in deep reinforcement learning,
and help to construct robust deep neural policies with higher generalization
skills.


http://arxiv.org/abs/2401.01750
Towards Robust Semantic Segmentation against Patch-based Attack via Attention Refinement. (92%)
Zheng Yuan; Jie Zhang; Yude Wang; Shiguang Shan; Xilin Chen
  The attention mechanism has been proven effective on various visual tasks in
recent years. In the semantic segmentation task, the attention mechanism is
applied in various methods, including the case of both Convolution Neural
Networks (CNN) and Vision Transformer (ViT) as backbones. However, we observe
that the attention mechanism is vulnerable to patch-based adversarial attacks.
Through the analysis of the effective receptive field, we attribute it to the
fact that the wide receptive field brought by global attention may lead to the
spread of the adversarial patch. To address this issue, in this paper, we
propose a Robust Attention Mechanism (RAM) to improve the robustness of the
semantic segmentation model, which can notably relieve the vulnerability
against patch-based attacks. Compared to the vallina attention mechanism, RAM
introduces two novel modules called Max Attention Suppression and Random
Attention Dropout, both of which aim to refine the attention matrix and limit
the influence of a single adversarial patch on the semantic segmentation
results of other positions. Extensive experiments demonstrate the effectiveness
of our RAM to improve the robustness of semantic segmentation models against
various patch-based attack methods under different attack settings.


http://arxiv.org/abs/2401.02031
Spy-Watermark: Robust Invisible Watermarking for Backdoor Attack. (62%)
Ruofei Wang; Renjie Wan; Zongyu Guo; Qing Guo; Rui Huang
  Backdoor attack aims to deceive a victim model when facing backdoor instances
while maintaining its performance on benign data. Current methods use manual
patterns or special perturbations as triggers, while they often overlook the
robustness against data corruption, making backdoor attacks easy to defend in
practice. To address this issue, we propose a novel backdoor attack method
named Spy-Watermark, which remains effective when facing data collapse and
backdoor defense. Therein, we introduce a learnable watermark embedded in the
latent domain of images, serving as the trigger. Then, we search for a
watermark that can withstand collapse during image decoding, cooperating with
several anti-collapse operations to further enhance the resilience of our
trigger against data corruption. Extensive experiments are conducted on
CIFAR10, GTSRB, and ImageNet datasets, demonstrating that Spy-Watermark
overtakes ten state-of-the-art methods in terms of robustness and stealthiness.


http://arxiv.org/abs/2401.01752
FullLoRA-AT: Efficiently Boosting the Robustness of Pretrained Vision Transformers. (33%)
Zheng Yuan; Jie Zhang; Shiguang Shan
  In recent years, the Vision Transformer (ViT) model has gradually become
mainstream in various computer vision tasks, and the robustness of the model
has received increasing attention. However, existing large models tend to
prioritize performance during training, potentially neglecting the robustness,
which may lead to serious security concerns. In this paper, we establish a new
challenge: exploring how to use a small number of additional parameters for
adversarial finetuning to quickly and effectively enhance the adversarial
robustness of a standardly trained model. To address this challenge, we develop
the novel LNLoRA module, incorporating a learnable layer normalization before
the conventional LoRA module, which helps mitigate magnitude differences in
parameters between the adversarial and standard training paradigms.
  Furthermore, we propose the FullLoRA-AT framework by integrating the
learnable LNLoRA modules into all key components of ViT-based models while
keeping the pretrained model frozen, which can significantly improve the model
robustness via adversarial finetuning in a parameter-efficient manner.
  Extensive experiments on CIFAR-10, CIFAR-100, and Imagenette demonstrate the
superiority of our proposed FullLoRA-AT framework. It achieves comparable
robustness with full finetuning while only requiring about 5% of the learnable
parameters. This also effectively addresses concerns regarding extra model
storage space and enormous training time caused by adversarial finetuning.


http://arxiv.org/abs/2401.01963
Integrated Cyber-Physical Resiliency for Power Grids under IoT-Enabled Dynamic Botnet Attacks. (22%)
Yuhan Zhao; Juntao Chen; Quanyan Zhu
  The wide adoption of Internet of Things (IoT)-enabled energy devices improves
the quality of life, but simultaneously, it enlarges the attack surface of the
power grid system. The adversary can gain illegitimate control of a large
number of these devices and use them as a means to compromise the physical grid
operation, a mechanism known as the IoT botnet attack. This paper aims to
improve the resiliency of cyber-physical power grids to such attacks.
Specifically, we use an epidemic model to understand the dynamic botnet
formation, which facilitates the assessment of the cyber layer vulnerability of
the grid. The attacker aims to exploit this vulnerability to enable a
successful physical compromise, while the system operator's goal is to ensure a
normal operation of the grid by mitigating cyber risks. We develop a
cross-layer game-theoretic framework for strategic decision-making to enhance
cyber-physical grid resiliency. The cyber-layer game guides the system operator
on how to defend against the botnet attacker as the first layer of defense,
while the dynamic game strategy at the physical layer further counteracts the
adversarial behavior in real time for improved physical resilience. A number of
case studies on the IEEE-39 bus system are used to corroborate the devised
approach.


http://arxiv.org/abs/2401.01575
Enhancing Generalization of Invisible Facial Privacy Cloak via Gradient Accumulation. (1%)
Xuannan Liu; Yaoyao Zhong; Weihong Deng; Hongzhi Shi; Xingchen Cui; Yunfeng Yin; Dongchao Wen
  The blooming of social media and face recognition (FR) systems has increased
people's concern about privacy and security. A new type of adversarial privacy
cloak (class-universal) can be applied to all the images of regular users, to
prevent malicious FR systems from acquiring their identity information. In this
work, we discover the optimization dilemma in the existing methods -- the local
optima problem in large-batch optimization and the gradient information
elimination problem in small-batch optimization. To solve these problems, we
propose Gradient Accumulation (GA) to aggregate multiple small-batch gradients
into a one-step iterative gradient to enhance the gradient stability and reduce
the usage of quantization operations. Experiments show that our proposed method
achieves high performance on the Privacy-Commons dataset against black-box face
recognition models.


http://arxiv.org/abs/2401.01199
JMA: a General Algorithm to Craft Nearly Optimal Targeted Adversarial Example. (99%)
Benedetta Tondi; Wei Guo; Mauro Barni
  Most of the approaches proposed so far to craft targeted adversarial examples
against Deep Learning classifiers are highly suboptimal and typically rely on
increasing the likelihood of the target class, thus implicitly focusing on
one-hot encoding settings. In this paper, we propose a more general,
theoretically sound, targeted attack that resorts to the minimization of a
Jacobian-induced MAhalanobis distance (JMA) term, taking into account the
effort (in the input space) required to move the latent space representation of
the input sample in a given direction. The minimization is solved by exploiting
the Wolfe duality theorem, reducing the problem to the solution of a
Non-Negative Least Square (NNLS) problem. The proposed algorithm provides an
optimal solution to a linearized version of the adversarial example problem
originally introduced by Szegedy et al. \cite{szegedy2013intriguing}. The
experiments we carried out confirm the generality of the proposed attack which
is proven to be effective under a wide variety of output encoding schemes.
Noticeably, the JMA attack is also effective in a multi-label classification
scenario, being capable to induce a targeted modification of up to half the
labels in a complex multilabel classification scenario with 20 labels, a
capability that is out of reach of all the attacks proposed so far. As a
further advantage, the JMA attack usually requires very few iterations, thus
resulting more efficient than existing methods.


http://arxiv.org/abs/2401.01102
Dual Teacher Knowledge Distillation with Domain Alignment for Face Anti-spoofing. (92%)
Zhe Kong; Wentian Zhang; Tao Wang; Kaihao Zhang; Yuexiang Li; Xiaoying Tang; Wenhan Luo
  Face recognition systems have raised concerns due to their vulnerability to
different presentation attacks, and system security has become an increasingly
critical concern. Although many face anti-spoofing (FAS) methods perform well
in intra-dataset scenarios, their generalization remains a challenge. To
address this issue, some methods adopt domain adversarial training (DAT) to
extract domain-invariant features. However, the competition between the encoder
and the domain discriminator can cause the network to be difficult to train and
converge. In this paper, we propose a domain adversarial attack (DAA) method to
mitigate the training instability problem by adding perturbations to the input
images, which makes them indistinguishable across domains and enables domain
alignment. Moreover, since models trained on limited data and types of attacks
cannot generalize well to unknown attacks, we propose a dual perceptual and
generative knowledge distillation framework for face anti-spoofing that
utilizes pre-trained face-related models containing rich face priors.
Specifically, we adopt two different face-related models as teachers to
transfer knowledge to the target student model. The pre-trained teacher models
are not from the task of face anti-spoofing but from perceptual and generative
tasks, respectively, which implicitly augment the data. By combining both DAA
and dual-teacher knowledge distillation, we develop a dual teacher knowledge
distillation with domain alignment framework (DTDA) for face anti-spoofing. The
advantage of our proposed method has been verified through extensive ablation
studies and comparison with state-of-the-art methods on public datasets across
multiple protocols.


http://arxiv.org/abs/2402.03317
SpecFormer: Guarding Vision Transformer Robustness via Maximum Singular Value Penalization. (75%)
Xixu Hu; Runkai Zheng; Jindong Wang; Cheuk Hang Leung; Qi Wu; Xing Xie
  Vision Transformers (ViTs) are increasingly used in computer vision due to
their high performance, but their vulnerability to adversarial attacks is a
concern. Existing methods lack a solid theoretical basis, focusing mainly on
empirical training adjustments. This study introduces SpecFormer, tailored to
fortify ViTs against adversarial attacks, with theoretical underpinnings. We
establish local Lipschitz bounds for the self-attention layer and propose the
Maximum Singular Value Penalization (MSVP) to precisely manage these bounds By
incorporating MSVP into ViTs' attention layers, we enhance the model's
robustness without compromising training efficiency. SpecFormer, the resulting
model, outperforms other state-of-the-art models in defending against
adversarial attacks, as proven by experiments on CIFAR and ImageNet datasets.
Code is released at https://github.com/microsoft/robustlearn.


http://arxiv.org/abs/2401.01394
Unveiling the Stealthy Threat: Analyzing Slow Drift GPS Spoofing Attacks for Autonomous Vehicles in Urban Environments and Enabling the Resilience. (10%)
Sagar Dasgupta; Abdullah Ahmed; Mizanur Rahman; Thejesh N. Bandi
  Autonomous vehicles (AVs) rely on the Global Positioning System (GPS) or
Global Navigation Satellite Systems (GNSS) for precise (Positioning,
Navigation, and Timing) PNT solutions. However, the vulnerability of GPS
signals to intentional and unintended threats due to their lack of encryption
and weak signal strength poses serious risks, thereby reducing the reliability
of AVs. GPS spoofing is a complex and damaging attack that deceives AVs by
altering GPS receivers to calculate false position and tracking information
leading to misdirection. This study explores a stealthy slow drift GPS spoofing
attack, replicating the victim AV's satellite reception pattern while changing
pseudo ranges to deceive the AV, particularly during turns. The attack is
designed to gradually deviate from the correct route, making real-time
detection challenging and jeopardizing user safety. We present a system and
study methodology for constructing covert spoofing attacks on AVs,
investigating the correlation between original and spoofed pseudo ranges to
create effective defenses. By closely following the victim vehicle and using
the same satellite signals, the attacker executes the attack precisely.
Changing the pseudo ranges confuses the AV, leading it to incorrect
destinations while remaining oblivious to the manipulation. The gradual
deviation from the actual route further conceals the attack, hindering its
swift identification. The experiments showcase a robust correlation between the
original and spoofed pseudo ranges, with R square values varying between 0.99
and 1. This strong correlation facilitates effective evaluation and mitigation
of spoofing signals.


http://arxiv.org/abs/2401.01085
Imperio: Language-Guided Backdoor Attacks for Arbitrary Model Control. (4%)
Ka-Ho Chow; Wenqi Wei; Lei Yu
  Revolutionized by the transformer architecture, natural language processing
(NLP) has received unprecedented attention. While advancements in NLP models
have led to extensive research into their backdoor vulnerabilities, the
potential for these advancements to introduce new backdoor threats remains
unexplored. This paper proposes Imperio, which harnesses the language
understanding capabilities of NLP models to enrich backdoor attacks. Imperio
provides a new model control experience. It empowers the adversary to control
the victim model with arbitrary output through language-guided instructions.
This is achieved using a language model to fuel a conditional trigger
generator, with optimizations designed to extend its language understanding
capabilities to backdoor instruction interpretation and execution. Our
experiments across three datasets, five attacks, and nine defenses confirm
Imperio's effectiveness. It can produce contextually adaptive triggers from
text descriptions and control the victim model with desired outputs, even in
scenarios not encountered during training. The attack maintains a high success
rate across complex datasets without compromising the accuracy of clean inputs
and also exhibits resilience against representative defenses. The source code
is available at \url{https://khchow.com/Imperio}.


http://arxiv.org/abs/2401.01531
Will 6G be Semantic Communications? Opportunities and Challenges from Task Oriented and Secure Communications to Integrated Sensing. (2%)
Yalin E. Sagduyu; Tugba Erpek; Aylin Yener; Sennur Ulukus
  This paper explores opportunities and challenges of task (goal)-oriented and
semantic communications for next-generation (NextG) communication networks
through the integration of multi-task learning. This approach employs deep
neural networks representing a dedicated encoder at the transmitter and
multiple task-specific decoders at the receiver, collectively trained to handle
diverse tasks including semantic information preservation, source input
reconstruction, and integrated sensing and communications. To extend the
applicability from point-to-point links to multi-receiver settings, we envision
the deployment of decoders at various receivers, where decentralized learning
addresses the challenges of communication load and privacy concerns, leveraging
federated learning techniques that distribute model updates across
decentralized nodes. However, the efficacy of this approach is contingent on
the robustness of the employed deep learning models. We scrutinize potential
vulnerabilities stemming from adversarial attacks during both training and
testing phases. These attacks aim to manipulate both the inputs at the encoder
at the transmitter and the signals received over the air on the receiver side,
highlighting the importance of fortifying semantic communications against
potential multi-domain exploits. Overall, the joint and robust design of
task-oriented communications, semantic communications, and integrated sensing
and communications in a multi-task learning framework emerges as the key
enabler for context-aware, resource-efficient, and secure communications
ultimately needed in NextG network systems.


http://arxiv.org/abs/2401.00996
Safety and Performance, Why Not Both? Bi-Objective Optimized Model Compression against Heterogeneous Attacks Toward AI Software Deployment. (12%)
Jie Zhu; Leye Wang; Xiao Han; Anmin Liu; Tao Xie
  The size of deep learning models in artificial intelligence (AI) software is
increasing rapidly, hindering the large-scale deployment on resource-restricted
devices (e.g., smartphones). To mitigate this issue, AI software compression
plays a crucial role, which aims to compress model size while keeping high
performance. However, the intrinsic defects in a big model may be inherited by
the compressed one. Such defects may be easily leveraged by adversaries, since
a compressed model is usually deployed in a large number of devices without
adequate protection. In this article, we aim to address the safe model
compression problem from the perspective of safety-performance co-optimization.
Specifically, inspired by the test-driven development (TDD) paradigm in
software engineering, we propose a test-driven sparse training framework called
SafeCompress. By simulating the attack mechanism as safety testing,
SafeCompress can automatically compress a big model to a small one following
the dynamic sparse training paradigm. Then, considering two kinds of
representative and heterogeneous attack mechanisms, i.e., black-box membership
inference attack and white-box membership inference attack, we develop two
concrete instances called BMIA-SafeCompress and WMIA-SafeCompress. Further, we
implement another instance called MMIA-SafeCompress by extending SafeCompress
to defend against the occasion when adversaries conduct black-box and white-box
membership inference attacks simultaneously. We conduct extensive experiments
on five datasets for both computer vision and natural language processing
tasks. The results show the effectiveness and generalizability of our
framework. We also discuss how to adapt SafeCompress to other attacks besides
membership inference attack, demonstrating the flexibility of SafeCompress.


http://arxiv.org/abs/2401.00994
Detection and Defense Against Prominent Attacks on Preconditioned LLM-Integrated Virtual Assistants. (8%)
Chun Fai Chan; Daniel Wankit Yip; Aysan Esmradi
  The emergence of LLM (Large Language Model) integrated virtual assistants has
brought about a rapid transformation in communication dynamics. During virtual
assistant development, some developers prefer to leverage the system message,
also known as an initial prompt or custom prompt, for preconditioning purposes.
However, it is important to recognize that an excessive reliance on this
functionality raises the risk of manipulation by malicious actors who can
exploit it with carefully crafted prompts. Such malicious manipulation poses a
significant threat, potentially compromising the accuracy and reliability of
the virtual assistant's responses. Consequently, safeguarding the virtual
assistants with detection and defense mechanisms becomes of paramount
importance to ensure their safety and integrity. In this study, we explored
three detection and defense mechanisms aimed at countering attacks that target
the system message. These mechanisms include inserting a reference key,
utilizing an LLM evaluator, and implementing a Self-Reminder. To showcase the
efficacy of these mechanisms, they were tested against prominent attack
techniques. Our findings demonstrate that the investigated mechanisms are
capable of accurately identifying and counteracting the attacks. The
effectiveness of these mechanisms underscores their potential in safeguarding
the integrity and reliability of virtual assistants, reinforcing the importance
of their implementation in real-world scenarios. By prioritizing the security
of virtual assistants, organizations can maintain user trust, preserve the
integrity of the application, and uphold the high standards expected in this
era of transformative technologies.


http://arxiv.org/abs/2401.00991
A Novel Evaluation Framework for Assessing Resilience Against Prompt Injection Attacks in Large Language Models. (2%)
Daniel Wankit Yip; Aysan Esmradi; Chun Fai Chan
  Prompt injection attacks exploit vulnerabilities in large language models
(LLMs) to manipulate the model into unintended actions or generate malicious
content. As LLM integrated applications gain wider adoption, they face growing
susceptibility to such attacks. This study introduces a novel evaluation
framework for quantifying the resilience of applications. The framework
incorporates innovative techniques designed to ensure representativeness,
interpretability, and robustness. To ensure the representativeness of simulated
attacks on the application, a meticulous selection process was employed,
resulting in 115 carefully chosen attacks based on coverage and relevance. For
enhanced interpretability, a second LLM was utilized to evaluate the responses
generated from these simulated attacks. Unlike conventional malicious content
classifiers that provide only a confidence score, the LLM-based evaluation
produces a score accompanied by an explanation, thereby enhancing
interpretability. Subsequently, a resilience score is computed by assigning
higher weights to attacks with greater impact, thus providing a robust
measurement of the application resilience. To assess the framework's efficacy,
it was applied on two LLMs, namely Llama2 and ChatGLM. Results revealed that
Llama2, the newer model exhibited higher resilience compared to ChatGLM. This
finding substantiates the effectiveness of the framework, aligning with the
prevailing notion that newer models tend to possess greater resilience.
Moreover, the framework exhibited exceptional versatility, requiring only
minimal adjustments to accommodate emerging attack techniques and
classifications, thereby establishing itself as an effective and practical
solution. Overall, the framework offers valuable insights that empower
organizations to make well-informed decisions to fortify their applications
against potential threats from prompt injection.


http://arxiv.org/abs/2401.14232
AR-GAN: Generative Adversarial Network-Based Defense Method Against Adversarial Attacks on the Traffic Sign Classification System of Autonomous Vehicles. (99%)
M Sabbir Salek; Abdullah Al Mamun; Mashrur Chowdhury
  This study developed a generative adversarial network (GAN)-based defense
method for traffic sign classification in an autonomous vehicle (AV), referred
to as the attack-resilient GAN (AR-GAN). The novelty of the AR-GAN lies in (i)
assuming zero knowledge of adversarial attack models and samples and (ii)
providing consistently high traffic sign classification performance under
various adversarial attack types. The AR-GAN classification system consists of
a generator that denoises an image by reconstruction, and a classifier that
classifies the reconstructed image. The authors have tested the AR-GAN under
no-attack and under various adversarial attacks, such as Fast Gradient Sign
Method (FGSM), DeepFool, Carlini and Wagner (C&W), and Projected Gradient
Descent (PGD). The authors considered two forms of these attacks, i.e., (i)
black-box attacks (assuming the attackers possess no prior knowledge of the
classifier), and (ii) white-box attacks (assuming the attackers possess full
knowledge of the classifier). The classification performance of the AR-GAN was
compared with several benchmark adversarial defense methods. The results showed
that both the AR-GAN and the benchmark defense methods are resilient against
black-box attacks and could achieve similar classification performance to that
of the unperturbed images. However, for all the white-box attacks considered in
this study, the AR-GAN method outperformed the benchmark defense methods. In
addition, the AR-GAN was able to maintain its high classification performance
under varied white-box adversarial perturbation magnitudes, whereas the
performance of the other defense methods dropped abruptly at increased
perturbation magnitudes.


http://arxiv.org/abs/2401.01377
Does Few-shot Learning Suffer from Backdoor Attacks? (98%)
Xinwei Liu; Xiaojun Jia; Jindong Gu; Yuan Xun; Siyuan Liang; Xiaochun Cao
  The field of few-shot learning (FSL) has shown promising results in scenarios
where training data is limited, but its vulnerability to backdoor attacks
remains largely unexplored. We first explore this topic by first evaluating the
performance of the existing backdoor attack methods on few-shot learning
scenarios. Unlike in standard supervised learning, existing backdoor attack
methods failed to perform an effective attack in FSL due to two main issues.
Firstly, the model tends to overfit to either benign features or trigger
features, causing a tough trade-off between attack success rate and benign
accuracy. Secondly, due to the small number of training samples, the dirty
label or visible trigger in the support set can be easily detected by victims,
which reduces the stealthiness of attacks. It seemed that FSL could survive
from backdoor attacks. However, in this paper, we propose the Few-shot Learning
Backdoor Attack (FLBA) to show that FSL can still be vulnerable to backdoor
attacks. Specifically, we first generate a trigger to maximize the gap between
poisoned and benign features. It enables the model to learn both benign and
trigger features, which solves the problem of overfitting. To make it more
stealthy, we hide the trigger by optimizing two types of imperceptible
perturbation, namely attractive and repulsive perturbation, instead of
attaching the trigger directly. Once we obtain the perturbations, we can poison
all samples in the benign support set into a hidden poisoned support set and
fine-tune the model on it. Our method demonstrates a high Attack Success Rate
(ASR) in FSL tasks with different few-shot learning paradigms while preserving
clean accuracy and maintaining stealthiness. This study reveals that few-shot
learning still suffers from backdoor attacks, and its security should be given
attention.


http://arxiv.org/abs/2401.00414
Is It Possible to Backdoor Face Forgery Detection with Natural Triggers? (68%)
Xiaoxuan Han; Songlin Yang; Wei Wang; Ziwen He; Jing Dong
  Deep neural networks have significantly improved the performance of face
forgery detection models in discriminating Artificial Intelligent Generated
Content (AIGC). However, their security is significantly threatened by the
injection of triggers during model training (i.e., backdoor attacks). Although
existing backdoor defenses and manual data selection can mitigate those using
human-eye-sensitive triggers, such as patches or adversarial noises, the more
challenging natural backdoor triggers remain insufficiently researched. To
further investigate natural triggers, we propose a novel analysis-by-synthesis
backdoor attack against face forgery detection models, which embeds natural
triggers in the latent space. We thoroughly study such backdoor vulnerability
from two perspectives: (1) Model Discrimination (Optimization-Based Trigger):
we adopt a substitute detection model and find the trigger by minimizing the
cross-entropy loss; (2) Data Distribution (Custom Trigger): we manipulate the
uncommon facial attributes in the long-tailed distribution to generate poisoned
samples without the supervision from detection models. Furthermore, to
completely evaluate the detection models towards the latest AIGC, we utilize
both state-of-the-art StyleGAN and Stable Diffusion for trigger generation.
Finally, these backdoor triggers introduce specific semantic features to the
generated poisoned samples (e.g., skin textures and smile), which are more
natural and robust. Extensive experiments show that our method is superior from
three levels: (1) Attack Success Rate: ours achieves a high attack success rate
(over 99%) and incurs a small model accuracy drop (below 0.2%) with a low
poisoning rate (less than 3%); (2) Backdoor Defense: ours shows better robust
performance when faced with existing backdoor defense methods; (3) Human
Inspection: ours is less human-eye-sensitive from a comprehensive user study.


http://arxiv.org/abs/2401.00334
Explainability-Driven Leaf Disease Classification using Adversarial Training and Knowledge Distillation. (84%)
Sebastian-Vasile Echim; Iulian-Marius Tăiatu; Dumitru-Clementin Cercel; Florin Pop
  This work focuses on plant leaf disease classification and explores three
crucial aspects: adversarial training, model explainability, and model
compression. The models' robustness against adversarial attacks is enhanced
through adversarial training, ensuring accurate classification even in the
presence of threats. Leveraging explainability techniques, we gain insights
into the model's decision-making process, improving trust and transparency.
Additionally, we explore model compression techniques to optimize computational
efficiency while maintaining classification performance. Through our
experiments, we determine that on a benchmark dataset, the robustness can be
the price of the classification accuracy with performance reductions of 3%-20%
for regular tests and gains of 50%-70% for adversarial attack tests. We also
demonstrate that a student model can be 15-25 times more computationally
efficient for a slight performance reduction, distilling the knowledge of more
complex models.


http://arxiv.org/abs/2401.00151
CamPro: Camera-based Anti-Facial Recognition. (81%)
Wenjun Zhu; Yuan Sun; Jiani Liu; Yushi Cheng; Xiaoyu Ji; Wenyuan Xu
  The proliferation of images captured from millions of cameras and the
advancement of facial recognition (FR) technology have made the abuse of FR a
severe privacy threat. Existing works typically rely on obfuscation, synthesis,
or adversarial examples to modify faces in images to achieve anti-facial
recognition (AFR). However, the unmodified images captured by camera modules
that contain sensitive personally identifiable information (PII) could still be
leaked. In this paper, we propose a novel approach, CamPro, to capture inborn
AFR images. CamPro enables well-packed commodity camera modules to produce
images that contain little PII and yet still contain enough information to
support other non-sensitive vision applications, such as person detection.
Specifically, CamPro tunes the configuration setup inside the camera image
signal processor (ISP), i.e., color correction matrix and gamma correction, to
achieve AFR, and designs an image enhancer to keep the image quality for
possible human viewers. We implemented and validated CamPro on a
proof-of-concept camera, and our experiments demonstrate its effectiveness on
ten state-of-the-art black-box FR models. The results show that CamPro images
can significantly reduce face identification accuracy to 0.3\% while having
little impact on the targeted non-sensitive vision application. Furthermore, we
find that CamPro is resilient to adaptive attackers who have re-trained their
FR models using images generated by CamPro, even with full knowledge of
privacy-preserving ISP parameters.


http://arxiv.org/abs/2401.00148
TPatch: A Triggered Physical Adversarial Patch. (76%)
Wenjun Zhu; Xiaoyu Ji; Yushi Cheng; Shibo Zhang; Wenyuan Xu
  Autonomous vehicles increasingly utilize the vision-based perception module
to acquire information about driving environments and detect obstacles. Correct
detection and classification are important to ensure safe driving decisions.
Existing works have demonstrated the feasibility of fooling the perception
models such as object detectors and image classifiers with printed adversarial
patches. However, most of them are indiscriminately offensive to every passing
autonomous vehicle. In this paper, we propose TPatch, a physical adversarial
patch triggered by acoustic signals. Unlike other adversarial patches, TPatch
remains benign under normal circumstances but can be triggered to launch a
hiding, creating or altering attack by a designed distortion introduced by
signal injection attacks towards cameras. To avoid the suspicion of human
drivers and make the attack practical and robust in the real world, we propose
a content-based camouflage method and an attack robustness enhancement method
to strengthen it. Evaluations with three object detectors, YOLO V3/V5 and
Faster R-CNN, and eight image classifiers demonstrate the effectiveness of
TPatch in both the simulation and the real world. We also discuss possible
defenses at the sensor, algorithm, and system levels.


http://arxiv.org/abs/2401.00163
A clean-label graph backdoor attack method in node classification task. (9%)
Xiaogang Xing; Ming Xu; Yujing Bai; Dongdong Yang
  Backdoor attacks in the traditional graph neural networks (GNNs) field are
easily detectable due to the dilemma of confusing labels. To explore the
backdoor vulnerability of GNNs and create a more stealthy backdoor attack
method, a clean-label graph backdoor attack method(CGBA) in the node
classification task is proposed in this paper. Differently from existing
backdoor attack methods, CGBA requires neither modification of node labels nor
graph structure. Specifically, to solve the problem of inconsistency between
the contents and labels of the samples, CGBA selects poisoning samples in a
specific target class and uses the label of sample as the target label (i.e.,
clean-label) after injecting triggers into the target samples. To guarantee the
similarity of neighboring nodes, the raw features of the nodes are elaborately
picked as triggers to further improve the concealment of the triggers.
Extensive experiments results show the effectiveness of our method. When the
poisoning rate is 0.04, CGBA can achieve an average attack success rate of
87.8%, 98.9%, 89.1%, and 98.5%, respectively.


http://arxiv.org/abs/2312.17673
Jatmo: Prompt Injection Defense by Task-Specific Finetuning. (54%)
Julien Piet; Maha Alrashed; Chawin Sitawarin; Sizhe Chen; Zeming Wei; Elizabeth Sun; Basel Alomair; David Wagner
  Large Language Models (LLMs) are attracting significant research attention
due to their instruction-following abilities, allowing users and developers to
leverage LLMs for a variety of tasks. However, LLMs are vulnerable to
prompt-injection attacks: a class of attacks that hijack the model's
instruction-following abilities, changing responses to prompts to undesired,
possibly malicious ones. In this work, we introduce Jatmo, a method for
generating task-specific models resilient to prompt-injection attacks. Jatmo
leverages the fact that LLMs can only follow instructions once they have
undergone instruction tuning. It harnesses a teacher instruction-tuned model to
generate a task-specific dataset, which is then used to fine-tune a base model
(i.e., a non-instruction-tuned model). Jatmo only needs a task prompt and a
dataset of inputs for the task: it uses the teacher model to generate outputs.
For situations with no pre-existing datasets, Jatmo can use a single example,
or in some cases none at all, to produce a fully synthetic dataset. Our
experiments on six tasks show that Jatmo models provide the same quality of
outputs on their specific task as standard LLMs, while being resilient to
prompt injections. The best attacks succeeded in less than 0.5% of cases
against our models, versus over 90% success rate against GPT-3.5-Turbo. We
release Jatmo at https://github.com/wagner-group/prompt-injection-defense.


http://arxiv.org/abs/2401.00137
SSL-OTA: Unveiling Backdoor Threats in Self-Supervised Learning for Object Detection. (11%)
Qiannan Wang; Changchun Yin; Lu Zhou; Liming Fang
  The extensive adoption of Self-supervised learning(SSL) has led to an
increased security threat from backdoor attacks. While existing research has
mainly focused on backdoor attacks in image classification, there has been
limited exploration of their implications for object detection. Object
detection plays a critical role in security-sensitive applications, such as
autonomous driving, where backdoor attacks seriously threaten human life and
property. In this work, we propose the first backdoor attack designed for
object detection tasks in SSL scenarios, called Object Transform Attack
(SSL-OTA). SSL-OTA employs a trigger capable of altering predictions of the
target object to the desired category, encompassing two attacks: Naive
Attack(NA) and Dual-Source Blending Attack (DSBA). NA conducts data poisoning
during downstream fine-tuning of the object detector, while DSBA additionally
injects backdoors into the pre-trained encoder. We establish appropriate
metrics and conduct extensive experiments on benchmark datasets, demonstrating
the effectiveness of our proposed attack and its resistance to potential
defenses. Notably, both NA and DSBA achieve high attack success rates (ASR) at
extremely low poisoning rates (0.5%). The results underscore the importance of
considering backdoor threats in SSL-based object detection and contribute a
novel perspective to the field.


http://arxiv.org/abs/2312.17591
Towards Faithful Explanations for Text Classification with Robustness Improvement and Explanation Guided Training. (9%)
Dongfang Li; Baotian Hu; Qingcai Chen; Shan He
  Feature attribution methods highlight the important input tokens as
explanations to model predictions, which have been widely applied to deep
neural networks towards trustworthy AI. However, recent works show that
explanations provided by these methods face challenges of being faithful and
robust. In this paper, we propose a method with Robustness improvement and
Explanation Guided training towards more faithful EXplanations (REGEX) for text
classification. First, we improve model robustness by input gradient
regularization technique and virtual adversarial training. Secondly, we use
salient ranking to mask noisy tokens and maximize the similarity between model
attention and feature attribution, which can be seen as a self-training
procedure without importing other external information. We conduct extensive
experiments on six datasets with five attribution methods, and also evaluate
the faithfulness in the out-of-domain setting. The results show that REGEX
improves fidelity metrics of explanations in all settings and further achieves
consistent gains based on two randomization tests. Moreover, we show that using
highlight explanations produced by REGEX to train select-then-predict models
results in comparable task performance to the end-to-end method.


http://arxiv.org/abs/2312.16880
Adversarial Attacks on Image Classification Models: Analysis and Defense. (99%)
Jaydip Sen; Abhiraj Sen; Ananda Chatterjee
  The notion of adversarial attacks on image classification models based on
convolutional neural networks (CNN) is introduced in this work. To classify
images, deep learning models called CNNs are frequently used. However, when the
networks are subject to adversarial attacks, extremely potent and previously
trained CNN models that perform quite effectively on image datasets for image
classification tasks may perform poorly. In this work, one well-known
adversarial attack known as the fast gradient sign method (FGSM) is explored
and its adverse effects on the performances of image classification models are
examined. The FGSM attack is simulated on three pre-trained image classifier
CNN architectures, ResNet-101, AlexNet, and RegNetY 400MF using randomly chosen
images from the ImageNet dataset. The classification accuracies of the models
are computed in the absence and presence of the attack to demonstrate the
detrimental effect of the attack on the performances of the classifiers.
Finally, a mechanism is proposed to defend against the FGSM attack based on a
modified defensive distillation-based approach. Extensive results are presented
for the validation of the proposed scheme.


http://arxiv.org/abs/2312.16957
Attack Tree Analysis for Adversarial Evasion Attacks. (99%)
Yuki Yamaguchi; Toshiaki Aoki
  Recently, the evolution of deep learning has promoted the application of
machine learning (ML) to various systems. However, there are ML systems, such
as autonomous vehicles, that cause critical damage when they misclassify.
Conversely, there are ML-specific attacks called adversarial attacks based on
the characteristics of ML systems. For example, one type of adversarial attack
is an evasion attack, which uses minute perturbations called "adversarial
examples" to intentionally misclassify classifiers. Therefore, it is necessary
to analyze the risk of ML-specific attacks in introducing ML base systems. In
this study, we propose a quantitative evaluation method for analyzing the risk
of evasion attacks using attack trees. The proposed method consists of the
extension of the conventional attack tree to analyze evasion attacks and the
systematic construction method of the extension. In the extension of the
conventional attack tree, we introduce ML and conventional attack nodes to
represent various characteristics of evasion attacks. In the systematic
construction process, we propose a procedure to construct the attack tree. The
procedure consists of three steps: (1) organizing information about attack
methods in the literature to a matrix, (2) identifying evasion attack scenarios
from methods in the matrix, and (3) constructing the attack tree from the
identified scenarios using a pattern. Finally, we conducted experiments on
three ML image recognition systems to demonstrate the versatility and
effectiveness of our proposed method.


http://arxiv.org/abs/2312.16979
BlackboxBench: A Comprehensive Benchmark of Black-box Adversarial Attacks. (99%)
Meixi Zheng; Xuanchen Yan; Zihao Zhu; Hongrui Chen; Baoyuan Wu
  Adversarial examples are well-known tools to evaluate the vulnerability of
deep neural networks (DNNs). Although lots of adversarial attack algorithms
have been developed, it's still challenging in the practical scenario that the
model's parameters and architectures are inaccessible to the
attacker/evaluator, i.e., black-box adversarial attacks. Due to the practical
importance, there has been rapid progress from recent algorithms, reflected by
the quick increase in attack success rate and quick decrease in query numbers
to the target model. However, there lacks thorough evaluations and comparisons
among these algorithms, causing difficulties in tracking the real progress,
analyzing advantages and disadvantages of different technical routes, as well
as designing future development roadmap of this field. Thus, we aim at building
a comprehensive benchmark of black-box adversarial attacks, called
BlackboxBench. It mainly provides: 1) a unified, extensible and modular-based
codebase, implementing 29 query-based attack algorithms and 30 transfer-based
attack algorithms; 2) comprehensive evaluations: we evaluate the implemented
algorithms against several mainstreaming model architectures on 2 widely used
datasets (CIFAR-10 and a subset of ImageNet), leading to 14,950 evaluations in
total; 3) thorough analysis and new insights, as well analytical tools. The
website and source codes of BlackboxBench are available at
https://blackboxbenchmark.github.io/ and
https://github.com/SCLBD/BlackboxBench/, respectively.


http://arxiv.org/abs/2312.17356
Can you See me? On the Visibility of NOPs against Android Malware Detectors. (98%)
Diego Soi; Davide Maiorca; Giorgio Giacinto; Harel Berger
  Android malware still represents the most significant threat to mobile
systems. While Machine Learning systems are increasingly used to identify these
threats, past studies have revealed that attackers can bypass these detection
mechanisms by making subtle changes to Android applications, such as adding
specific API calls. These modifications are often referred to as No OPerations
(NOP), which ideally should not alter the semantics of the program. However,
many NOPs can be spotted and eliminated by refining the app analysis process.
This paper proposes a visibility metric that assesses the difficulty in
spotting NOPs and similar non-operational codes. We tested our metric on a
state-of-the-art, opcode-based deep learning system for Android malware
detection. We implemented attacks on the feature and problem spaces and
calculated their visibility according to our metric. The attained results show
an intriguing trade-off between evasion efficacy and detectability: our metric
can be valuable to ensure the real effectiveness of an adversarial attack, also
serving as a useful aid to develop better defenses.


http://arxiv.org/abs/2312.17431
MVPatch: More Vivid Patch for Adversarial Camouflaged Attacks on Object Detectors in the Physical World. (98%)
Zheng Zhou; Hongbo Zhao; Ju Liu; Qiaosheng Zhang; Liwei Geng; Shuchang Lyu; Wenquan Feng
  Recent investigations demonstrate that adversarial patches can be utilized to
manipulate the result of object detection models. However, the conspicuous
patterns on these patches may draw more attention and raise suspicions among
humans. Moreover, existing works have primarily focused on enhancing the
efficacy of attacks in the physical domain, rather than seeking to optimize
their stealth attributes and transferability potential. To address these
issues, we introduce a dual-perception-based attack framework that generates an
adversarial patch known as the More Vivid Patch (MVPatch). The framework
consists of a model-perception degradation method and a human-perception
improvement method. To derive the MVPatch, we formulate an iterative process
that simultaneously constrains the efficacy of multiple object detectors and
refines the visual correlation between the generated adversarial patch and a
realistic image. Our method employs a model-perception-based approach that
reduces the object confidence scores of several object detectors to boost the
transferability of adversarial patches. Further, within the
human-perception-based framework, we put forward a lightweight technique for
visual similarity measurement that facilitates the development of inconspicuous
and natural adversarial patches and eliminates the reliance on additional
generative models. Additionally, we introduce the naturalness score and
transferability score as metrics for an unbiased assessment of various
adversarial patches' natural appearance and transferability capacity. Extensive
experiments demonstrate that the proposed MVPatch algorithm achieves superior
attack transferability compared to similar algorithms in both digital and
physical domains while also exhibiting a more natural appearance. These
findings emphasize the remarkable stealthiness and transferability of the
proposed MVPatch attack algorithm.


http://arxiv.org/abs/2312.17301
Explainability-Based Adversarial Attack on Graphs Through Edge Perturbation. (92%)
Dibaloke Chanda; Saba Heidari Gheshlaghi; Nasim Yahya Soltani
  Despite the success of graph neural networks (GNNs) in various domains, they
exhibit susceptibility to adversarial attacks. Understanding these
vulnerabilities is crucial for developing robust and secure applications. In
this paper, we investigate the impact of test time adversarial attacks through
edge perturbations which involve both edge insertions and deletions. A novel
explainability-based method is proposed to identify important nodes in the
graph and perform edge perturbation between these nodes. The proposed method is
tested for node classification with three different architectures and datasets.
The results suggest that introducing edges between nodes of different classes
has higher impact as compared to removing edges among nodes within the same
class.


http://arxiv.org/abs/2312.16907
DOEPatch: Dynamically Optimized Ensemble Model for Adversarial Patches Generation. (83%)
Wenyi Tan; Yang Li; Chenxing Zhao; Zhunga Liu; Quan Pan
  Object detection is a fundamental task in various applications ranging from
autonomous driving to intelligent security systems. However, recognition of a
person can be hindered when their clothing is decorated with carefully designed
graffiti patterns, leading to the failure of object detection. To achieve
greater attack potential against unknown black-box models, adversarial patches
capable of affecting the outputs of multiple-object detection models are
required. While ensemble models have proven effective, current research in the
field of object detection typically focuses on the simple fusion of the outputs
of all models, with limited attention being given to developing general
adversarial patches that can function effectively in the physical world. In
this paper, we introduce the concept of energy and treat the adversarial
patches generation process as an optimization of the adversarial patches to
minimize the total energy of the ``person'' category. Additionally, by adopting
adversarial training, we construct a dynamically optimized ensemble model.
During training, the weight parameters of the attacked target models are
adjusted to find the balance point at which the generated adversarial patches
can effectively attack all target models. We carried out six sets of
comparative experiments and tested our algorithm on five mainstream object
detection models. The adversarial patches generated by our algorithm can reduce
the recognition accuracy of YOLOv2 and YOLOv3 to 13.19\% and 29.20\%,
respectively. In addition, we conducted experiments to test the effectiveness
of T-shirts covered with our adversarial patches in the physical world and
could achieve that people are not recognized by the object detection model.
Finally, leveraging the Grad-CAM tool, we explored the attack mechanism of
adversarial patches from an energetic perspective.


http://arxiv.org/abs/2312.17164
Securing NextG Systems against Poisoning Attacks on Federated Learning: A Game-Theoretic Solution. (64%)
Yalin E. Sagduyu; Tugba Erpek; Yi Shi
  This paper studies the poisoning attack and defense interactions in a
federated learning (FL) system, specifically in the context of wireless signal
classification using deep learning for next-generation (NextG) communications.
FL collectively trains a global model without the need for clients to exchange
their data samples. By leveraging geographically dispersed clients, the trained
global model can be used for incumbent user identification, facilitating
spectrum sharing. However, in this distributed learning system, the presence of
malicious clients introduces the risk of poisoning the training data to
manipulate the global model through falsified local model exchanges. To address
this challenge, a proactive defense mechanism is employed in this paper to make
informed decisions regarding the admission or rejection of clients
participating in FL systems. Consequently, the attack-defense interactions are
modeled as a game, centered around the underlying admission and poisoning
decisions. First, performance bounds are established, encompassing the best and
worst strategies for attackers and defenders. Subsequently, the attack and
defense utilities are characterized within the Nash equilibrium, where no
player can unilaterally improve its performance given the fixed strategies of
others. The results offer insights into novel operational modes that safeguard
FL systems against poisoning attacks by quantifying the performance of both
attacks and defenses in the context of NextG communications.


http://arxiv.org/abs/2312.17220
Timeliness: A New Design Metric and a New Attack Surface. (1%)
Priyanka Kaswan; Sennur Ulukus
  As the landscape of time-sensitive applications gains prominence in 5G/6G
communications, timeliness of information updates at network nodes has become
crucial, which is popularly quantified in the literature by the age of
information metric. However, as we devise policies to improve age of
information of our systems, we inadvertently introduce a new vulnerability for
adversaries to exploit. In this article, we comprehensively discuss the diverse
threats that age-based systems are vulnerable to. We begin with discussion on
densely interconnected networks that employ gossiping between nodes to expedite
dissemination of dynamic information in the network, and show how the age-based
nature of gossiping renders these networks uniquely susceptible to threats such
as timestomping attacks, jamming attacks, and the propagation of
misinformation. Later, we survey adversarial works within simpler network
settings, specifically in one-hop and two-hop configurations, and delve into
adversarial robustness concerning challenges posed by jamming, timestomping,
and issues related to privacy leakage. We conclude this article with future
directions that aim to address challenges posed by more intelligent adversaries
and robustness of networks to them.


http://arxiv.org/abs/2312.16715
Adversarial Attacks on LoRa Device Identification and Rogue Signal Detection with Deep Learning. (98%)
Yalin E. Sagduyu; Tugba Erpek
  Low-Power Wide-Area Network (LPWAN) technologies, such as LoRa, have gained
significant attention for their ability to enable long-range, low-power
communication for Internet of Things (IoT) applications. However, the security
of LoRa networks remains a major concern, particularly in scenarios where
device identification and classification of legitimate and spoofed signals are
crucial. This paper studies a deep learning framework to address these
challenges, considering LoRa device identification and legitimate vs. rogue
LoRa device classification tasks. A deep neural network (DNN), either a
convolutional neural network (CNN) or feedforward neural network (FNN), is
trained for each task by utilizing real experimental I/Q data for LoRa signals,
while rogue signals are generated by using kernel density estimation (KDE) of
received signals by rogue devices. Fast Gradient Sign Method (FGSM)-based
adversarial attacks are considered for LoRa signal classification tasks using
deep learning models. The impact of these attacks is assessed on the
performance of two tasks, namely device identification and legitimate vs. rogue
device classification, by utilizing separate or common perturbations against
these signal classification tasks. Results presented in this paper quantify the
level of transferability of adversarial attacks on different LoRa signal
classification tasks as a major vulnerability and highlight the need to make
IoT applications robust to adversarial attacks.


http://arxiv.org/abs/2312.16451
Domain Generalization with Vital Phase Augmentation. (3%)
Ingyun Lee; Wooju Lee; Hyun Myung
  Deep neural networks have shown remarkable performance in image
classification. However, their performance significantly deteriorates with
corrupted input data. Domain generalization methods have been proposed to train
robust models against out-of-distribution data. Data augmentation in the
frequency domain is one of such approaches that enable a model to learn phase
features to establish domain-invariant representations. This approach changes
the amplitudes of the input data while preserving the phases. However, using
fixed phases leads to susceptibility to phase fluctuations because amplitudes
and phase fluctuations commonly occur in out-of-distribution. In this study, to
address this problem, we introduce an approach using finite variation of the
phases of input data rather than maintaining fixed phases. Based on the
assumption that the degree of domain-invariant features varies for each phase,
we propose a method to distinguish phases based on this degree. In addition, we
propose a method called vital phase augmentation (VIPAug) that applies the
variation to the phases differently according to the degree of domain-invariant
features of given phases. The model depends more on the vital phases that
contain more domain-invariant features for attaining robustness to amplitude
and phase fluctuations. We present experimental evaluations of our proposed
approach, which exhibited improved performance for both clean and corrupted
data. VIPAug achieved SOTA performance on the benchmark CIFAR-10 and CIFAR-100
datasets, as well as near-SOTA performance on the ImageNet-100 and ImageNet
datasets. Our code is available at https://github.com/excitedkid/vipaug.


http://arxiv.org/abs/2312.16156
From text to multimodal: a survey of adversarial example generation in question answering systems. (92%)
Gulsum Yigit; Mehmet Fatih Amasyali
  Integrating adversarial machine learning with Question Answering (QA) systems
has emerged as a critical area for understanding the vulnerabilities and
robustness of these systems. This article aims to comprehensively review
adversarial example-generation techniques in the QA field, including textual
and multimodal contexts. We examine the techniques employed through systematic
categorization, providing a comprehensive, structured review. Beginning with an
overview of traditional QA models, we traverse the adversarial example
generation by exploring rule-based perturbations and advanced generative
models. We then extend our research to include multimodal QA systems, analyze
them across various methods, and examine generative models, seq2seq
architectures, and hybrid methodologies. Our research grows to different
defense strategies, adversarial datasets, and evaluation metrics and
illustrates the comprehensive literature on adversarial QA. Finally, the paper
considers the future landscape of adversarial question generation, highlighting
potential research directions that can advance textual and multimodal QA
systems in the context of adversarial challenges.


http://arxiv.org/abs/2312.16401
Natural Adversarial Patch Generation Method Based on Latent Diffusion Model. (76%)
Xianyi Chen; Fazhan Liu; Dong Jiang; Kai Yan
  Recently, some research show that deep neural networks are vulnerable to the
adversarial attacks, the well-trainned samples or patches could be used to
trick the neural network detector or human visual perception. However, these
adversarial patches, with their conspicuous and unusual patterns, lack
camouflage and can easily raise suspicion in the real world. To solve this
problem, this paper proposed a novel adversarial patch method called the Latent
Diffusion Patch (LDP), in which, a pretrained encoder is first designed to
compress the natural images into a feature space with key characteristics. Then
trains the diffusion model using the above feature space. Finally, explore the
latent space of the pretrained diffusion model using the image denoising
technology. It polishes the patches and images through the powerful natural
abilities of diffusion models, making them more acceptable to the human visual
system. Experimental results, both digital and physical worlds, show that LDPs
achieve a visual subjectivity score of 87.3%, while still maintaining effective
attack capabilities.


http://arxiv.org/abs/2312.16019
Robust Survival Analysis with Adversarial Regularization. (61%)
Michael Potter; Stefano Maxenti; Michael Everett
  Survival Analysis (SA) is about modeling the time for an event of interest to
occur, which has important applications in many fields, including medicine,
defense, finance, and aerospace. Recent work has demonstrated the benefits of
using Neural Networks (NNs) to capture complicated relationships in SA.
However, the datasets used to train these models are often subject to
uncertainty (e.g., noisy measurements, human error), which we show can
substantially degrade the performance of existing techniques. To address this
issue, this work leverages recent advances in NN verification to provide new
algorithms for generating fully parametric survival models that are robust to
such uncertainties. In particular, we introduce a robust loss function for
training the models and use CROWN-IBP regularization to address the
computational challenges with solving the resulting Min-Max problem. To
evaluate the proposed approach, we apply relevant perturbations to publicly
available datasets in the SurvSet repository and compare survival models
against several baselines. We empirically show that Survival Analysis with
Adversarial Regularization (SAWAR) method on average ranks best for dataset
perturbations of varying magnitudes on metrics such as Negative Log Likelihood
(NegLL), Integrated Brier Score (IBS), and Concordance Index (CI), concluding
that adversarial regularization enhances performance in SA. Code:
https://github.com/mlpotter/SAWAR


http://arxiv.org/abs/2312.16339
Universal Pyramid Adversarial Training for Improved ViT Performance. (5%)
Ping-yeh Chiang; Yipin Zhou; Omid Poursaeed; Satya Narayan Shukla; Ashish Shah; Tom Goldstein; Ser-Nam Lim
  Recently, Pyramid Adversarial training (Herrmann et al., 2022) has been shown
to be very effective for improving clean accuracy and distribution-shift
robustness of vision transformers. However, due to the iterative nature of
adversarial training, the technique is up to 7 times more expensive than
standard training. To make the method more efficient, we propose Universal
Pyramid Adversarial training, where we learn a single pyramid adversarial
pattern shared across the whole dataset instead of the sample-wise patterns.
With our proposed technique, we decrease the computational cost of Pyramid
Adversarial training by up to 70% while retaining the majority of its benefit
on clean performance and distribution-shift robustness. In addition, to the
best of our knowledge, we are also the first to find that universal adversarial
training can be leveraged to improve clean model performance.


http://arxiv.org/abs/2312.15617
GanFinger: GAN-Based Fingerprint Generation for Deep Neural Network Ownership Verification. (96%)
Huali Ren; Anli Yan; Xiaojun Ren; Pei-Gen Ye; Chong-zhi Gao; Zhili Zhou; Jin Li
  Deep neural networks (DNNs) are extensively employed in a wide range of
application scenarios. Generally, training a commercially viable neural network
requires significant amounts of data and computing resources, and it is easy
for unauthorized users to use the networks illegally. Therefore, network
ownership verification has become one of the most crucial steps in safeguarding
digital assets. To verify the ownership of networks, the existing network
fingerprinting approaches perform poorly in the aspects of efficiency,
stealthiness, and discriminability. To address these issues, we propose a
network fingerprinting approach, named as GanFinger, to construct the network
fingerprints based on the network behavior, which is characterized by network
outputs of pairs of original examples and conferrable adversarial examples.
Specifically, GanFinger leverages Generative Adversarial Networks (GANs) to
effectively generate conferrable adversarial examples with imperceptible
perturbations. These examples can exhibit identical outputs on copyrighted and
pirated networks while producing different results on irrelevant networks.
Moreover, to enhance the accuracy of fingerprint ownership verification, the
network similarity is computed based on the accuracy-robustness distance of
fingerprint examples'outputs. To evaluate the performance of GanFinger, we
construct a comprehensive benchmark consisting of 186 networks with five
network structures and four popular network post-processing techniques. The
benchmark experiments demonstrate that GanFinger significantly outperforms the
state-of-the-arts in efficiency, stealthiness, and discriminability. It
achieves a remarkable 6.57 times faster in fingerprint generation and boosts
the ARUC value by 0.175, resulting in a relative improvement of about 26%.


http://arxiv.org/abs/2312.15826
Adversarial Item Promotion on Visually-Aware Recommender Systems by Guided Diffusion. (84%)
Lijian Chen; Wei Yuan; Tong Chen; Guanhua Ye; Quoc Viet Hung Nguyen; Hongzhi Yin
  Visually-aware recommender systems have found widespread application in
domains where visual elements significantly contribute to the inference of
users' potential preferences. While the incorporation of visual information
holds the promise of enhancing recommendation accuracy and alleviating the
cold-start problem, it is essential to point out that the inclusion of item
images may introduce substantial security challenges. Some existing works have
shown that the item provider can manipulate item exposure rates to its
advantage by constructing adversarial images. However, these works cannot
reveal the real vulnerability of visually-aware recommender systems because (1)
The generated adversarial images are markedly distorted, rendering them easily
detectable by human observers; (2) The effectiveness of the attacks is
inconsistent and even ineffective in some scenarios. To shed light on the real
vulnerabilities of visually-aware recommender systems when confronted with
adversarial images, this paper introduces a novel attack method, IPDGI (Item
Promotion by Diffusion Generated Image). Specifically, IPDGI employs a guided
diffusion model to generate adversarial samples designed to deceive
visually-aware recommender systems. Taking advantage of accurately modeling
benign images' distribution by diffusion models, the generated adversarial
images have high fidelity with original images, ensuring the stealth of our
IPDGI. To demonstrate the effectiveness of our proposed methods, we conduct
extensive experiments on two commonly used e-commerce recommendation datasets
(Amazon Beauty and Amazon Baby) with several typical visually-aware recommender
systems. The experimental results show that our attack method has a significant
improvement in both the performance of promoting the long-tailed (i.e.,
unpopular) items and the quality of generated adversarial images.


http://arxiv.org/abs/2312.15867
Punctuation Matters! Stealthy Backdoor Attack for Language Models. (11%)
Xuan Sheng; Zhicheng Li; Zhaoyang Han; Xiangmao Chang; Piji Li
  Recent studies have pointed out that natural language processing (NLP) models
are vulnerable to backdoor attacks. A backdoored model produces normal outputs
on the clean samples while performing improperly on the texts with triggers
that the adversary injects. However, previous studies on textual backdoor
attack pay little attention to stealthiness. Moreover, some attack methods even
cause grammatical issues or change the semantic meaning of the original texts.
Therefore, they can easily be detected by humans or defense systems. In this
paper, we propose a novel stealthy backdoor attack method against textual
models, which is called \textbf{PuncAttack}. It leverages combinations of
punctuation marks as the trigger and chooses proper locations strategically to
replace them. Through extensive experiments, we demonstrate that the proposed
method can effectively compromise multiple models in various tasks. Meanwhile,
we conduct automatic evaluation and human inspection, which indicate the
proposed method possesses good performance of stealthiness without bringing
grammatical issues and altering the meaning of sentences.


http://arxiv.org/abs/2312.15228
Adversarial Data Poisoning for Fake News Detection: How to Make a Model Misclassify a Target News without Modifying It. (10%)
Federico Siciliano; Luca Maiano; Lorenzo Papa; Federica Baccin; Irene Amerini; Fabrizio Silvestri
  Fake news detection models are critical to countering disinformation but can
be manipulated through adversarial attacks. In this position paper, we analyze
how an attacker can compromise the performance of an online learning detector
on specific news content without being able to manipulate the original target
news. In some contexts, such as social networks, where the attacker cannot
exert complete control over all the information, this scenario can indeed be
quite plausible. Therefore, we show how an attacker could potentially introduce
poisoning data into the training data to manipulate the behavior of an online
learning method. Our initial findings reveal varying susceptibility of logistic
regression models based on complexity and attack type.


http://arxiv.org/abs/2312.15172
Pre-trained Trojan Attacks for Visual Recognition. (1%)
Aishan Liu; Xinwei Zhang; Yisong Xiao; Yuguang Zhou; Siyuan Liang; Jiakai Wang; Xianglong Liu; Xiaochun Cao; Dacheng Tao
  Pre-trained vision models (PVMs) have become a dominant component due to
their exceptional performance when fine-tuned for downstream tasks. However,
the presence of backdoors within PVMs poses significant threats. Unfortunately,
existing studies primarily focus on backdooring PVMs for the classification
task, neglecting potential inherited backdoors in downstream tasks such as
detection and segmentation. In this paper, we propose the Pre-trained Trojan
attack, which embeds backdoors into a PVM, enabling attacks across various
downstream vision tasks. We highlight the challenges posed by cross-task
activation and shortcut connections in successful backdoor attacks. To achieve
effective trigger activation in diverse tasks, we stylize the backdoor trigger
patterns with class-specific textures, enhancing the recognition of
task-irrelevant low-level features associated with the target class in the
trigger pattern. Moreover, we address the issue of shortcut connections by
introducing a context-free learning pipeline for poison training. In this
approach, triggers without contextual backgrounds are directly utilized as
training data, diverging from the conventional use of clean images.
Consequently, we establish a direct shortcut from the trigger to the target
class, mitigating the shortcut connection issue. We conducted extensive
experiments to thoroughly validate the effectiveness of our attacks on
downstream detection and segmentation tasks. Additionally, we showcase the
potential of our approach in more practical scenarios, including large vision
models and 3D object detection in autonomous driving. This paper aims to raise
awareness of the potential threats associated with applying PVMs in practical
scenarios. Our codes will be available upon paper publication.


http://arxiv.org/abs/2312.15359
TVE: Learning Meta-attribution for Transferable Vision Explainer. (1%)
Guanchu Wang; Yu-Neng Chuang; Fan Yang; Mengnan Du; Chia-Yuan Chang; Shaochen Zhong; Zirui Liu; Zhaozhuo Xu; Kaixiong Zhou; Xuanting Cai; Xia Hu
  Explainable machine learning significantly improves the transparency of deep
neural networks. However, existing work is constrained to explaining the
behavior of individual model predictions, and lacks the ability to transfer the
explanation across various models and tasks. This limitation results in
explaining various tasks being time- and resource-consuming. To address this
problem, we introduce a Transferable Vision Explainer (TVE) that can
effectively explain various vision models in downstream tasks. Specifically,
the transferability of TVE is realized through a pre-training process on
large-scale datasets towards learning the meta-attribution. This
meta-attribution leverages the versatility of generic backbone encoders to
comprehensively encode the attribution knowledge for the input instance, which
enables TVE to seamlessly transfer to explain various downstream tasks, without
the need for training on task-specific data. Empirical studies involve
explaining three different architectures of vision models across three diverse
downstream datasets. The experimental results indicate TVE is effective in
explaining these tasks without the need for additional training on downstream
data.


http://arxiv.org/abs/2312.14677
MEAOD: Model Extraction Attack against Object Detectors. (83%)
Zeyu Li; Chenghui Shi; Yuwen Pu; Xuhong Zhang; Yu Li; Jinbao Li; Shouling Ji
  The widespread use of deep learning technology across various industries has
made deep neural network models highly valuable and, as a result, attractive
targets for potential attackers. Model extraction attacks, particularly
query-based model extraction attacks, allow attackers to replicate a substitute
model with comparable functionality to the victim model and present a
significant threat to the confidentiality and security of MLaaS platforms.
While many studies have explored threats of model extraction attacks against
classification models in recent years, object detection models, which are more
frequently used in real-world scenarios, have received less attention. In this
paper, we investigate the challenges and feasibility of query-based model
extraction attacks against object detection models and propose an effective
attack method called MEAOD. It selects samples from the attacker-possessed
dataset to construct an efficient query dataset using active learning and
enhances the categories with insufficient objects. We additionally improve the
extraction effectiveness by updating the annotations of the query dataset.
According to our gray-box and black-box scenarios experiments, we achieve an
extraction performance of over 70% under the given condition of a 10k query
budget.


http://arxiv.org/abs/2312.14440
Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks. (82%)
Haz Sameen Shahgir; Xianghao Kong; Greg Ver Steeg; Yue Dong
  The widespread use of Text-to-Image (T2I) models in content generation
requires careful examination of their safety, including their robustness to
adversarial attacks. Despite extensive research on adversarial attacks, the
reasons for their effectiveness remain underexplored. This paper presents an
empirical study on adversarial attacks against T2I models, focusing on
analyzing factors associated with attack success rates (ASR). We introduce a
new attack objective - entity swapping using adversarial suffixes and two
gradient-based attack algorithms. Human and automatic evaluations reveal the
asymmetric nature of ASRs on entity swap: for example, it is easier to replace
"human" with "robot" in the prompt "a human dancing in the rain." with an
adversarial suffix, but the reverse replacement is significantly harder. We
further propose probing metrics to establish indicative signals from the
model's beliefs to the adversarial ASR. We identify conditions that result in a
success probability of 60% for adversarial attacks and others where this
likelihood drops below 5%.


http://arxiv.org/abs/2312.14820
Understanding the Regularity of Self-Attention with Optimal Transport. (31%)
Valérie Castin; Pierre Ablin; Gabriel Peyré
  Transformers and their multi-head attention mechanism have completely changed
the machine learning landscape in just a few years, by outperforming
state-of-art models in a wide range of domains. Still, little is known about
their robustness from a theoretical perspective. We tackle this problem by
studying the local Lipschitz constant of self-attention, that provides an
attack-agnostic way of measuring the robustness of a neural network. We adopt a
measure-theoretic framework, by viewing inputs as probability measures equipped
with the Wasserstein distance. This allows us to generalize attention to inputs
of infinite length, and to derive an upper bound and a lower bound on the
Lipschitz constant of self-attention on compact sets. The lower bound
significantly improves prior results, and grows more than exponentially with
the radius of the compact set, which rules out the possibility of obtaining
robustness guarantees without any additional constraint on the input space. Our
results also point out that measures with a high local Lipschitz constant are
typically made of a few diracs, with a very unbalanced distribution of mass.
Finally, we analyze the stability of self-attention under perturbations that
change the number of tokens, which appears to be a natural question in the
measure-theoretic framework. In particular, we show that for some inputs,
attacks that duplicate tokens before perturbing them are more efficient than
attacks that simply move tokens. We call this phenomenon mass splitting.


http://arxiv.org/abs/2312.14461
Attacking Byzantine Robust Aggregation in High Dimensions. (22%)
Sarthak Choudhary; Aashish Kolluri; Prateek Saxena
  Training modern neural networks or models typically requires averaging over a
sample of high-dimensional vectors. Poisoning attacks can skew or bias the
average vectors used to train the model, forcing the model to learn specific
patterns or avoid learning anything useful. Byzantine robust aggregation is a
principled algorithmic defense against such biasing. Robust aggregators can
bound the maximum bias in computing centrality statistics, such as mean, even
when some fraction of inputs are arbitrarily corrupted. Designing such
aggregators is challenging when dealing with high dimensions. However, the
first polynomial-time algorithms with strong theoretical bounds on the bias
have recently been proposed. Their bounds are independent of the number of
dimensions, promising a conceptual limit on the power of poisoning attacks in
their ongoing arms race against defenses.
  In this paper, we show a new attack called HIDRA on practical realization of
strong defenses which subverts their claim of dimension-independent bias. HIDRA
highlights a novel computational bottleneck that has not been a concern of
prior information-theoretic analysis. Our experimental evaluation shows that
our attacks almost completely destroy the model performance, whereas existing
attacks with the same goal fail to have much effect. Our findings leave the
arms race between poisoning attacks and provable defenses wide open.


http://arxiv.org/abs/2312.15036
SODA: Protecting Proprietary Information in On-Device Machine Learning Models. (4%)
Akanksha Atrey; Ritwik Sinha; Saayan Mitra; Prashant Shenoy
  The growth of low-end hardware has led to a proliferation of machine
learning-based services in edge applications. These applications gather
contextual information about users and provide some services, such as
personalized offers, through a machine learning (ML) model. A growing practice
has been to deploy such ML models on the user's device to reduce latency,
maintain user privacy, and minimize continuous reliance on a centralized
source. However, deploying ML models on the user's edge device can leak
proprietary information about the service provider. In this work, we
investigate on-device ML models that are used to provide mobile services and
demonstrate how simple attacks can leak proprietary information of the service
provider. We show that different adversaries can easily exploit such models to
maximize their profit and accomplish content theft. Motivated by the need to
thwart such attacks, we present an end-to-end framework, SODA, for deploying
and serving on edge devices while defending against adversarial usage. Our
results demonstrate that SODA can detect adversarial usage with 89% accuracy in
less than 50 queries with minimal impact on service performance, latency, and
storage.


http://arxiv.org/abs/2312.15088
Adaptive Domain Inference Attack. (2%)
Yuechun Gu; Keke Chen
  As deep neural networks are increasingly deployed in sensitive application
domains, such as healthcare and security, it's necessary to understand what
kind of sensitive information can be inferred from these models. Existing
model-targeted attacks all assume the attacker has known the application domain
or training data distribution, which plays an essential role in successful
attacks. Can removing the domain information from model APIs protect models
from these attacks? This paper studies this critical problem. Unfortunately,
even with minimal knowledge, i.e., accessing the model as an unnamed function
without leaking the meaning of input and output, the proposed adaptive domain
inference attack (ADI) can still successfully estimate relevant subsets of
training data. We show that the extracted relevant data can significantly
improve, for instance, the performance of model-inversion attacks.
Specifically, the ADI method utilizes a concept hierarchy built on top of a
large collection of available public and private datasets and a novel algorithm
to adaptively tune the likelihood of leaf concepts showing up in the unseen
training data. The ADI attack not only extracts partial training data at the
concept level, but also converges fast and requires much fewer target-model
accesses than another domain inference attack, GDI.


http://arxiv.org/abs/2312.15103
Energy-based learning algorithms for analog computing: a comparative study. (2%)
Benjamin Scellier; Maxence Ernoult; Jack Kendall; Suhas Kumar
  Energy-based learning algorithms have recently gained a surge of interest due
to their compatibility with analog (post-digital) hardware. Existing algorithms
include contrastive learning (CL), equilibrium propagation (EP) and coupled
learning (CpL), all consisting in contrasting two states, and differing in the
type of perturbation used to obtain the second state from the first one.
However, these algorithms have never been explicitly compared on equal footing
with same models and datasets, making it difficult to assess their scalability
and decide which one to select in practice. In this work, we carry out a
comparison of seven learning algorithms, namely CL and different variants of EP
and CpL depending on the signs of the perturbations. Specifically, using these
learning algorithms, we train deep convolutional Hopfield networks (DCHNs) on
five vision tasks (MNIST, F-MNIST, SVHN, CIFAR-10 and CIFAR-100). We find that,
while all algorithms yield comparable performance on MNIST, important
differences in performance arise as the difficulty of the task increases. Our
key findings reveal that negative perturbations are better than positive ones,
and highlight the centered variant of EP (which uses two perturbations of
opposite sign) as the best-performing algorithm. We also endorse these findings
with theoretical arguments. Additionally, we establish new SOTA results with
DCHNs on all five datasets, both in performance and speed. In particular, our
DCHN simulations are 13.5 times faster with respect to Laborieux et al. (2021),
which we achieve thanks to the use of a novel energy minimisation algorithm
based on asynchronous updates, combined with reduced precision (16 bits).


http://arxiv.org/abs/2312.14218
AutoAugment Input Transformation for Highly Transferable Targeted Attacks. (99%)
Haobo Lu; Xin Liu; Kun He
  Deep Neural Networks (DNNs) are widely acknowledged to be susceptible to
adversarial examples, wherein imperceptible perturbations are added to clean
examples through diverse input transformation attacks. However, these methods
originally designed for non-targeted attacks exhibit low success rates in
targeted attacks. Recent targeted adversarial attacks mainly pay attention to
gradient optimization, attempting to find the suitable perturbation direction.
However, few of them are dedicated to input transformation.In this work, we
observe a positive correlation between the logit/probability of the target
class and diverse input transformation methods in targeted attacks. To this
end, we propose a novel targeted adversarial attack called AutoAugment Input
Transformation (AAIT). Instead of relying on hand-made strategies, AAIT
searches for the optimal transformation policy from a transformation space
comprising various operations. Then, AAIT crafts adversarial examples using the
found optimal transformation policy to boost the adversarial transferability in
targeted attacks. Extensive experiments conducted on CIFAR-10 and
ImageNet-Compatible datasets demonstrate that the proposed AAIT surpasses other
transfer-based targeted attacks significantly.


http://arxiv.org/abs/2312.13628
Where and How to Attack? A Causality-Inspired Recipe for Generating Counterfactual Adversarial Examples. (98%)
Ruichu Cai; Yuxuan Zhu; Jie Qiao; Zefeng Liang; Furui Liu; Zhifeng Hao
  Deep neural networks (DNNs) have been demonstrated to be vulnerable to
well-crafted \emph{adversarial examples}, which are generated through either
well-conceived $\mathcal{L}_p$-norm restricted or unrestricted attacks.
Nevertheless, the majority of those approaches assume that adversaries can
modify any features as they wish, and neglect the causal generating process of
the data, which is unreasonable and unpractical. For instance, a modification
in income would inevitably impact features like the debt-to-income ratio within
a banking system. By considering the underappreciated causal generating
process, first, we pinpoint the source of the vulnerability of DNNs via the
lens of causality, then give theoretical results to answer \emph{where to
attack}. Second, considering the consequences of the attack interventions on
the current state of the examples to generate more realistic adversarial
examples, we propose CADE, a framework that can generate
\textbf{C}ounterfactual \textbf{AD}versarial \textbf{E}xamples to answer
\emph{how to attack}. The empirical results demonstrate CADE's effectiveness,
as evidenced by its competitive performance across diverse attack scenarios,
including white-box, transfer-based, and random intervention attacks.


http://arxiv.org/abs/2312.14260
Elevating Defenses: Bridging Adversarial Training and Watermarking for Model Resilience. (86%)
Janvi Thakkar; Giulio Zizzo; Sergio Maffeis
  Machine learning models are being used in an increasing number of critical
applications; thus, securing their integrity and ownership is critical. Recent
studies observed that adversarial training and watermarking have a conflicting
interaction. This work introduces a novel framework to integrate adversarial
training with watermarking techniques to fortify against evasion attacks and
provide confident model verification in case of intellectual property theft. We
use adversarial training together with adversarial watermarks to train a robust
watermarked model. The key intuition is to use a higher perturbation budget to
generate adversarial watermarks compared to the budget used for adversarial
training, thus avoiding conflict. We use the MNIST and Fashion-MNIST datasets
to evaluate our proposed technique on various model stealing attacks. The
results obtained consistently outperform the existing baseline in terms of
robustness performance and further prove the resilience of this defense against
pruning and fine-tuning removal attacks.


http://arxiv.org/abs/2312.14217
Adversarial Infrared Curves: An Attack on Infrared Pedestrian Detectors in the Physical World. (74%)
Chengyin Hu; Weiwen Shi
  Deep neural network security is a persistent concern, with considerable
research on visible light physical attacks but limited exploration in the
infrared domain. Existing approaches, like white-box infrared attacks using
bulb boards and QR suits, lack realism and stealthiness. Meanwhile, black-box
methods with cold and hot patches often struggle to ensure robustness. To
bridge these gaps, we propose Adversarial Infrared Curves (AdvIC). Using
Particle Swarm Optimization, we optimize two Bezier curves and employ cold
patches in the physical realm to introduce perturbations, creating infrared
curve patterns for physical sample generation. Our extensive experiments
confirm AdvIC's effectiveness, achieving 94.8\% and 67.2\% attack success rates
for digital and physical attacks, respectively. Stealthiness is demonstrated
through a comparative analysis, and robustness assessments reveal AdvIC's
superiority over baseline methods. When deployed against diverse advanced
detectors, AdvIC achieves an average attack success rate of 76.8\%, emphasizing
its robust nature. we explore adversarial defense strategies against AdvIC and
examine its impact under various defense mechanisms. Given AdvIC's substantial
security implications for real-world vision-based applications, urgent
attention and mitigation efforts are warranted.


http://arxiv.org/abs/2312.14302
Exploiting Novel GPT-4 APIs. (8%)
Kellin Pelrine; Mohammad Taufeeque; Michał Zając; Euan McLean; Adam Gleave
  Language model attacks typically assume one of two extreme threat models:
full white-box access to model weights, or black-box access limited to a text
generation API. However, real-world APIs are often more flexible than just text
generation: these APIs expose "gray-box" access leading to new threat vectors.
To explore this, we red-team three new functionalities exposed in the GPT-4
APIs: fine-tuning, function calling and knowledge retrieval. We find that
fine-tuning a model on as few as 15 harmful examples or 100 benign examples can
remove core safeguards from GPT-4, enabling a range of harmful outputs.
Furthermore, we find that GPT-4 Assistants readily divulge the function call
schema and can be made to execute arbitrary function calls. Finally, we find
that knowledge retrieval can be hijacked by injecting instructions into
retrieval documents. These vulnerabilities highlight that any additions to the
functionality exposed by an API can create new vulnerabilities.


http://arxiv.org/abs/2312.12768
Mutual-modality Adversarial Attack with Semantic Perturbation. (99%)
Jingwen Ye; Ruonan Yu; Songhua Liu; Xinchao Wang
  Adversarial attacks constitute a notable threat to machine learning systems,
given their potential to induce erroneous predictions and classifications.
However, within real-world contexts, the essential specifics of the deployed
model are frequently treated as a black box, consequently mitigating the
vulnerability to such attacks. Thus, enhancing the transferability of the
adversarial samples has become a crucial area of research, which heavily relies
on selecting appropriate surrogate models. To address this challenge, we
propose a novel approach that generates adversarial attacks in a
mutual-modality optimization scheme. Our approach is accomplished by leveraging
the pre-trained CLIP model. Firstly, we conduct a visual attack on the clean
image that causes semantic perturbations on the aligned embedding space with
the other textual modality. Then, we apply the corresponding defense on the
textual modality by updating the prompts, which forces the re-matching on the
perturbed embedding space. Finally, to enhance the attack transferability, we
utilize the iterative training strategy on the visual attack and the textual
defense, where the two processes optimize from each other. We evaluate our
approach on several benchmark datasets and demonstrate that our mutual-modal
attack strategy can effectively produce high-transferable attacks, which are
stable regardless of the target networks. Our approach outperforms
state-of-the-art attack methods and can be readily deployed as a plug-and-play
solution.


http://arxiv.org/abs/2312.13118
LRS: Enhancing Adversarial Transferability through Lipschitz Regularized Surrogate. (99%)
Tao Wu; Tie Luo; Donald C. Wunsch
  The transferability of adversarial examples is of central importance to
transfer-based black-box adversarial attacks. Previous works for generating
transferable adversarial examples focus on attacking \emph{given} pretrained
surrogate models while the connections between surrogate models and adversarial
trasferability have been overlooked. In this paper, we propose {\em Lipschitz
Regularized Surrogate} (LRS) for transfer-based black-box attacks, a novel
approach that transforms surrogate models towards favorable adversarial
transferability. Using such transformed surrogate models, any existing
transfer-based black-box attack can run without any change, yet achieving much
better performance. Specifically, we impose Lipschitz regularization on the
loss landscape of surrogate models to enable a smoother and more controlled
optimization process for generating more transferable adversarial examples. In
addition, this paper also sheds light on the connection between the inner
properties of surrogate models and adversarial transferability, where three
factors are identified: smaller local Lipschitz constant, smoother loss
landscape, and stronger adversarial robustness. We evaluate our proposed LRS
approach by attacking state-of-the-art standard deep neural networks and
defense models. The results demonstrate significant improvement on the attack
success rates and transferability. Our code is available at
https://github.com/TrustAIoT/LRS.


http://arxiv.org/abs/2312.13435
Adversarial Markov Games: On Adaptive Decision-Based Attacks and Defenses. (98%)
Ilias Tsingenopoulos; Vera Rimmer; Davy Preuveneers; Fabio Pierazzi; Lorenzo Cavallaro; Wouter Joosen
  Despite considerable efforts on making them robust, real-world ML-based
systems remain vulnerable to decision based attacks, as definitive proofs of
their operational robustness have so far proven intractable. The canonical
approach in robustness evaluation calls for adaptive attacks, that is with
complete knowledge of the defense and tailored to bypass it. In this study, we
introduce a more expansive notion of being adaptive and show how attacks but
also defenses can benefit by it and by learning from each other through
interaction. We propose and evaluate a framework for adaptively optimizing
black-box attacks and defenses against each other through the competitive game
they form. To reliably measure robustness, it is important to evaluate against
realistic and worst-case attacks. We thus augment both attacks and the evasive
arsenal at their disposal through adaptive control, and observe that the same
can be done for defenses, before we evaluate them first apart and then jointly
under a multi-agent perspective. We demonstrate that active defenses, which
control how the system responds, are a necessary complement to model hardening
when facing decision-based attacks; then how these defenses can be circumvented
by adaptive attacks, only to finally elicit active and adaptive defenses. We
validate our observations through a wide theoretical and empirical
investigation to confirm that AI-enabled adversaries pose a considerable threat
to black-box ML-based systems, rekindling the proverbial arms race where
defenses have to be AI-enabled too. Succinctly, we address the challenges posed
by adaptive adversaries and develop adaptive defenses, thereby laying out
effective strategies in ensuring the robustness of ML-based systems deployed in
the real-world.


http://arxiv.org/abs/2312.14197
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. (98%)
Jingwei Yi; Yueqi Xie; Bin Zhu; Emre Kiciman; Guangzhong Sun; Xing Xie; Fangzhao Wu
  The integration of large language models (LLMs) with external content has
enabled more up-to-date and wide-ranging applications of LLMs, such as
Microsoft Copilot. However, this integration has also exposed LLMs to the risk
of indirect prompt injection attacks, where an attacker can embed malicious
instructions within external content, compromising LLM output and causing
responses to deviate from user expectations. To investigate this important but
underexplored issue, we introduce the first benchmark for indirect prompt
injection attacks, named BIPIA, to evaluate the risk of such attacks. Based on
the evaluation, our work makes a key analysis of the underlying reason for the
success of the attack, namely the inability of LLMs to distinguish between
instructions and external content and the absence of LLMs' awareness to not
execute instructions within external content. Building upon this analysis, we
develop two black-box methods based on prompt learning and a white-box defense
method based on fine-tuning with adversarial training accordingly. Experimental
results demonstrate that black-box defenses are highly effective in mitigating
these attacks, while the white-box defense reduces the attack success rate to
near-zero levels. Overall, our work systematically investigates indirect prompt
injection attacks by introducing a benchmark, analyzing the underlying reason
for the success of the attack, and developing an initial set of defenses.


http://arxiv.org/abs/2312.12904
PGN: A perturbation generation network against deep reinforcement learning. (96%)
Xiangjuan Li; Feifan Li; Yang Li; Quan Pan
  Deep reinforcement learning has advanced greatly and applied in many areas.
In this paper, we explore the vulnerability of deep reinforcement learning by
proposing a novel generative model for creating effective adversarial examples
to attack the agent. Our proposed model can achieve both targeted attacks and
untargeted attacks. Considering the specificity of deep reinforcement learning,
we propose the action consistency ratio as a measure of stealthiness, and a new
measurement index of effectiveness and stealthiness. Experiment results show
that our method can ensure the effectiveness and stealthiness of attack
compared with other algorithms. Moreover, our methods are considerably faster
and thus can achieve rapid and efficient verification of the vulnerability of
deep reinforcement learning.


http://arxiv.org/abs/2312.13575
ARBiBench: Benchmarking Adversarial Robustness of Binarized Neural Networks. (96%)
Peng Zhao; Jiehua Zhang; Bowen Peng; Longguang Wang; YingMei Wei; Yu Liu; Li Liu
  Network binarization exhibits great potential for deployment on
resource-constrained devices due to its low computational cost. Despite the
critical importance, the security of binarized neural networks (BNNs) is rarely
investigated. In this paper, we present ARBiBench, a comprehensive benchmark to
evaluate the robustness of BNNs against adversarial perturbations on CIFAR-10
and ImageNet. We first evaluate the robustness of seven influential BNNs on
various white-box and black-box attacks. The results reveal that 1) The
adversarial robustness of BNNs exhibits a completely opposite performance on
the two datasets under white-box attacks. 2) BNNs consistently exhibit better
adversarial robustness under black-box attacks. 3) Different BNNs exhibit
certain similarities in their robustness performance. Then, we conduct
experiments to analyze the adversarial robustness of BNNs based on these
insights. Our research contributes to inspiring future research on enhancing
the robustness of BNNs and advancing their application in real-world scenarios.


http://arxiv.org/abs/2312.13131
Scaling Compute Is Not All You Need for Adversarial Robustness. (93%)
Edoardo Debenedetti; Zishen Wan; Maksym Andriushchenko; Vikash Sehwag; Kshitij Bhardwaj; Bhavya Kailkhura
  The last six years have witnessed significant progress in adversarially
robust deep learning. As evidenced by the CIFAR-10 dataset category in
RobustBench benchmark, the accuracy under $\ell_\infty$ adversarial
perturbations improved from 44\% in \citet{Madry2018Towards} to 71\% in
\citet{peng2023robust}. Although impressive, existing state-of-the-art is still
far from satisfactory. It is further observed that best-performing models are
often very large models adversarially trained by industrial labs with
significant computational budgets. In this paper, we aim to understand: ``how
much longer can computing power drive adversarial robustness advances?" To
answer this question, we derive \emph{scaling laws for adversarial robustness}
which can be extrapolated in the future to provide an estimate of how much cost
we would need to pay to reach a desired level of robustness. We show that
increasing the FLOPs needed for adversarial training does not bring as much
advantage as it does for standard training in terms of performance
improvements. Moreover, we find that some of the top-performing techniques are
difficult to exactly reproduce, suggesting that they are not robust enough for
minor changes in the training setup. Our analysis also uncovers potentially
worthwhile directions to pursue in future research. Finally, we make our
benchmarking framework (built on top of \texttt{timm}~\citep{rw2019timm})
publicly available to facilitate future analysis in efficient robust deep
learning.


http://arxiv.org/abs/2312.13027
Doubly Perturbed Task Free Continual Learning. (9%)
Byung Hyun Lee; Min-hwan Oh; Se Young Chun
  Task Free online continual learning (TF-CL) is a challenging problem where
the model incrementally learns tasks without explicit task information.
Although training with entire data from the past, present as well as future is
considered as the gold standard, naive approaches in TF-CL with the current
samples may be conflicted with learning with samples in the future, leading to
catastrophic forgetting and poor plasticity. Thus, a proactive consideration of
an unseen future sample in TF-CL becomes imperative. Motivated by this
intuition, we propose a novel TF-CL framework considering future samples and
show that injecting adversarial perturbations on both input data and
decision-making is effective. Then, we propose a novel method named Doubly
Perturbed Continual Learning (DPCL) to efficiently implement these input and
decision-making perturbations. Specifically, for input perturbation, we propose
an approximate perturbation method that injects noise into the input data as
well as the feature vector and then interpolates the two perturbed samples. For
decision-making process perturbation, we devise multiple stochastic
classifiers. We also investigate a memory management scheme and learning rate
scheduling reflecting our proposed double perturbations. We demonstrate that
our proposed method outperforms the state-of-the-art baseline methods by large
margins on various TF-CL benchmarks.


http://arxiv.org/abs/2312.14973
Interactive Visualization of Time-Varying Flow Fields Using Particle Tracing Neural Networks. (1%)
Mengjiao Han; Jixian Li; Sudhanshu Sane; Shubham Gupta; Bei Wang; Steve Petruzza; Chris R. Johnson
  In this paper, we present a comprehensive evaluation to establish a robust
and efficient framework for Lagrangian-based particle tracing using deep neural
networks (DNNs). Han et al. (2021) first proposed a DNN-based approach to learn
Lagrangian representations and demonstrated accurate particle tracing for an
analytic 2D flow field. In this paper, we extend and build upon this prior work
in significant ways. First, we evaluate the performance of DNN models to
accurately trace particles in various settings, including 2D and 3D
time-varying flow fields, flow fields from multiple applications, flow fields
with varying complexity, as well as structured and unstructured input data.
Second, we conduct an empirical study to inform best practices with respect to
particle tracing model architectures, activation functions, and training data
structures. Third, we conduct a comparative evaluation against prior techniques
that employ flow maps as input for exploratory flow visualization.
Specifically, we compare our extended model against its predecessor by Han et
al. (2021), as well as the conventional approach that uses triangulation and
Barycentric coordinate interpolation. Finally, we consider the integration and
adaptation of our particle tracing model with different viewers. We provide an
interactive web-based visualization interface by leveraging the efficiencies of
our framework, and perform high-fidelity interactive visualization by
integrating it with an OSPRay-based viewer. Overall, our experiments
demonstrate that using a trained DNN model to predict new particle trajectories
requires a low memory footprint and results in rapid inference. Following the
best practices for large 3D datasets, our deep learning approach is shown to
require approximately 46 times less memory while being more than 400 times
faster than the conventional methods.


http://arxiv.org/abs/2312.12608
Rethinking Randomized Smoothing from the Perspective of Scalability. (86%)
Anupriya Kumari; Devansh Bhardwaj; Sukrit Jindal
  Machine learning models have demonstrated remarkable success across diverse
domains but remain vulnerable to adversarial attacks. Empirical defense
mechanisms often fail, as new attacks constantly emerge, rendering existing
defenses obsolete, shifting the focus to certification-based defenses.
Randomized smoothing has emerged as a promising technique among notable
advancements. This study reviews the theoretical foundations and empirical
effectiveness of randomized smoothing and its derivatives in verifying machine
learning classifiers from a perspective of scalability. We provide an in-depth
exploration of the fundamental concepts underlying randomized smoothing,
highlighting its theoretical guarantees in certifying robustness against
adversarial perturbations and discuss the challenges of existing methodologies.


http://arxiv.org/abs/2312.12484
SkyMask: Attack-agnostic Robust Federated Learning with Fine-grained Learnable Masks. (74%)
Peishen Yan; Hao Wang; Tao Song; Yang Hua; Ruhui Ma; Ningxin Hu; Mohammad R. Haghighat; Haibing Guan
  Federated Learning (FL) is becoming a popular paradigm for leveraging
distributed data and preserving data privacy. However, due to the distributed
characteristic, FL systems are vulnerable to Byzantine attacks that compromised
clients attack the global model by uploading malicious model updates. With the
development of layer-level and parameter-level fine-grained attacks, the
attacks' stealthiness and effectiveness have been significantly improved. The
existing defense mechanisms solely analyze the model-level statistics of
individual model updates uploaded by clients to mitigate Byzantine attacks,
which are ineffective against fine-grained attacks due to unawareness or
overreaction. To address this problem, we propose SkyMask, a new
attack-agnostic robust FL system that firstly leverages fine-grained learnable
masks to identify malicious model updates at the parameter level. Specifically,
the FL server freezes and multiplies the model updates uploaded by clients with
the parameter-level masks, and trains the masks over a small clean dataset
(i.e., root dataset) to learn the subtle difference between benign and
malicious model updates in a high-dimension space. Our extensive experiments
involve different models on three public datasets under state-of-the-art (SOTA)
attacks, where the results show that SkyMask achieves up to 14% higher testing
accuracy compared with SOTA defense strategies under the same attacks and
successfully defends against attacks with malicious clients of a high fraction
up to 80%. Code is available at https://github.com/KoalaYan/SkyMask.


http://arxiv.org/abs/2312.12724
Progressive Poisoned Data Isolation for Training-time Backdoor Defense. (61%)
Yiming Chen; Haiwei Wu; Jiantao Zhou
  Deep Neural Networks (DNN) are susceptible to backdoor attacks where
malicious attackers manipulate the model's predictions via data poisoning. It
is hence imperative to develop a strategy for training a clean model using a
potentially poisoned dataset. Previous training-time defense mechanisms
typically employ an one-time isolation process, often leading to suboptimal
isolation outcomes. In this study, we present a novel and efficacious defense
method, termed Progressive Isolation of Poisoned Data (PIPD), that
progressively isolates poisoned data to enhance the isolation accuracy and
mitigate the risk of benign samples being misclassified as poisoned ones. Once
the poisoned portion of the dataset has been identified, we introduce a
selective training process to train a clean model. Through the implementation
of these techniques, we ensure that the trained model manifests a significantly
diminished attack success rate against the poisoned data. Extensive experiments
on multiple benchmark datasets and DNN models, assessed against nine
state-of-the-art backdoor attacks, demonstrate the superior performance of our
PIPD method for backdoor defense. For instance, our PIPD achieves an average
True Positive Rate (TPR) of 99.95% and an average False Positive Rate (FPR) of
0.06% for diverse attacks over CIFAR-10 dataset, markedly surpassing the
performance of state-of-the-art methods.


http://arxiv.org/abs/2312.11954
Adversarial AutoMixup. (11%)
Huafeng Qin; Xin Jin; Yun Jiang; Mounim A. El-Yacoubi; Xinbo Gao
  Data mixing augmentation has been widely applied to improve the
generalization ability of deep neural networks. Recently, offline data mixing
augmentation, e.g. handcrafted and saliency information-based mixup, has been
gradually replaced by automatic mixing approaches. Through minimizing two
sub-tasks, namely, mixed sample generation and mixup classification in an
end-to-end way, AutoMix significantly improves accuracy on image classification
tasks. However, as the optimization objective is consistent for the two
sub-tasks, this approach is prone to generating consistent instead of diverse
mixed samples, which results in overfitting for target task training. In this
paper, we propose AdAutomixup, an adversarial automatic mixup augmentation
approach that generates challenging samples to train a robust classifier for
image classification, by alternatively optimizing the classifier and the mixup
sample generator. AdAutomixup comprises two modules, a mixed example generator,
and a target classifier. The mixed sample generator aims to produce hard mixed
examples to challenge the target classifier while the target classifier`s aim
is to learn robust features from hard mixed examples to improve generalization.
To prevent the collapse of the inherent meanings of images, we further
introduce an exponential moving average (EMA) teacher and cosine similarity to
train AdAutomixup in an end-to-end way. Extensive experiments on seven image
benchmarks consistently prove that our approach outperforms the state of the
art in various classification scenarios.


http://arxiv.org/abs/2312.12115
Shaping Up SHAP: Enhancing Stability through Layer-Wise Neighbor Selection. (1%)
Gwladys Kelodjou; Laurence Rozé; Véronique Masson; Luis Galárraga; Romaric Gaudel; Maurice Tchuente; Alexandre Termier
  Machine learning techniques, such as deep learning and ensemble methods, are
widely used in various domains due to their ability to handle complex
real-world tasks. However, their black-box nature has raised multiple concerns
about the fairness, trustworthiness, and transparency of computer-assisted
decision-making. This has led to the emergence of local post-hoc explainability
methods, which offer explanations for individual decisions made by black-box
algorithms. Among these methods, Kernel SHAP is widely used due to its
model-agnostic nature and its well-founded theoretical framework. Despite these
strengths, Kernel SHAP suffers from high instability: different executions of
the method with the same inputs can lead to significantly different
explanations, which diminishes the relevance of the explanations. The
contribution of this paper is two-fold. On the one hand, we show that Kernel
SHAP's instability is caused by its stochastic neighbor selection procedure,
which we adapt to achieve full stability without compromising explanation
fidelity. On the other hand, we show that by restricting the neighbors
generation to perturbations of size 1 -- which we call the coalitions of Layer
1 -- we obtain a novel feature-attribution method that is fully stable,
computationally efficient, and still meaningful.


http://arxiv.org/abs/2312.12102
I-CEE: Tailoring Explanations of Image Classifications Models to User Expertise. (1%)
Yao Rong; Peizhu Qian; Vaibhav Unhelkar; Enkelejda Kasneci
  Effectively explaining decisions of black-box machine learning models is
critical to responsible deployment of AI systems that rely on them. Recognizing
their importance, the field of explainable AI (XAI) provides several techniques
to generate these explanations. Yet, there is relatively little emphasis on the
user (the explainee) in this growing body of work and most XAI techniques
generate "one-size-fits-all" explanations. To bridge this gap and achieve a
step closer towards human-centered XAI, we present I-CEE, a framework that
provides Image Classification Explanations tailored to User Expertise. Informed
by existing work, I-CEE explains the decisions of image classification models
by providing the user with an informative subset of training data (i.e.,
example images), corresponding local explanations, and model decisions.
However, unlike prior work, I-CEE models the informativeness of the example
images to depend on user expertise, resulting in different examples for
different users. We posit that by tailoring the example set to user expertise,
I-CEE can better facilitate users' understanding and simulatability of the
model. To evaluate our approach, we conduct detailed experiments in both
simulation and with human participants (N = 100) on multiple datasets.
Experiments with simulated users show that I-CEE improves users' ability to
accurately predict the model's decisions (simulatability) compared to
baselines, providing promising preliminary results. Experiments with human
participants demonstrate that our method significantly improves user
simulatability accuracy, highlighting the importance of human-centered XAI


http://arxiv.org/abs/2312.11805
Gemini: A Family of Highly Capable Multimodal Models. (99%)
Team Gemini; Rohan Anil; Sebastian Borgeaud; Yonghui Wu; Jean-Baptiste Alayrac; Jiahui Yu; Radu Soricut; Johan Schalkwyk; Andrew M. Dai; Anja Hauth; Katie Millican; David Silver; Slav Petrov; Melvin Johnson; Ioannis Antonoglou; Julian Schrittwieser; Amelia Glaese; Jilin Chen; Emily Pitler; Timothy Lillicrap; Angeliki Lazaridou; Orhan Firat; James Molloy; Michael Isard; Paul R. Barham; Tom Hennigan; Benjamin Lee; Fabio Viola; Malcolm Reynolds; Yuanzhong Xu; Ryan Doherty; Eli Collins; Clemens Meyer; Eliza Rutherford; Erica Moreira; Kareem Ayoub; Megha Goel; George Tucker; Enrique Piqueras; Maxim Krikun; Iain Barr; Nikolay Savinov; Ivo Danihelka; Becca Roelofs; Anaïs White; Anders Andreassen; Glehn Tamara von; Lakshman Yagati; Mehran Kazemi; Lucas Gonzalez; Misha Khalman; Jakub Sygnowski; Alexandre Frechette; Charlotte Smith; Laura Culp; Lev Proleev; Yi Luan; Xi Chen; James Lottes; Nathan Schucher; Federico Lebron; Alban Rrustemi; Natalie Clay; Phil Crone; Tomas Kocisky; Jeffrey Zhao; Bartek Perz; Dian Yu; Heidi Howard; Adam Bloniarz; Jack W. Rae; Han Lu; Laurent Sifre; Marcello Maggioni; Fred Alcober; Dan Garrette; Megan Barnes; Shantanu Thakoor; Jacob Austin; Gabriel Barth-Maron; William Wong; Rishabh Joshi; Rahma Chaabouni; Deeni Fatiha; Arun Ahuja; Ruibo Liu; Yunxuan Li; Sarah Cogan; Jeremy Chen; Chao Jia; Chenjie Gu; Qiao Zhang; Jordan Grimstad; Ale Jakse Hartman; Martin Chadwick; Gaurav Singh Tomar; Xavier Garcia; Evan Senter; Emanuel Taropa; Thanumalayan Sankaranarayana Pillai; Jacob Devlin; Michael Laskin; Diego de Las Casas; Dasha Valter; Connie Tao; Lorenzo Blanco; Adrià Puigdomènech Badia; David Reitter; Mianna Chen; Jenny Brennan; Clara Rivera; Sergey Brin; Shariq Iqbal; Gabriela Surita; Jane Labanowski; Abhi Rao; Stephanie Winkler; Emilio Parisotto; Yiming Gu; Kate Olszewska; Yujing Zhang; Ravi Addanki; Antoine Miech; Annie Louis; Laurent El Shafey; Denis Teplyashin; Geoff Brown; Elliot Catt; Nithya Attaluri; Jan Balaguer; Jackie Xiang; Pidong Wang; Zoe Ashwood; Anton Briukhov; Albert Webson; Sanjay Ganapathy; Smit Sanghavi; Ajay Kannan; Ming-Wei Chang; Axel Stjerngren; Josip Djolonga; Yuting Sun; Ankur Bapna; Matthew Aitchison; Pedram Pejman; Henryk Michalewski; Tianhe Yu; Cindy Wang; Juliette Love; Junwhan Ahn; Dawn Bloxwich; Kehang Han; Peter Humphreys; Thibault Sellam; James Bradbury; Varun Godbole; Sina Samangooei; Bogdan Damoc; Alex Kaskasoli; Sébastien M. R. Arnold; Vijay Vasudevan; Shubham Agrawal; Jason Riesa; Dmitry Lepikhin; Richard Tanburn; Srivatsan Srinivasan; Hyeontaek Lim; Sarah Hodkinson; Pranav Shyam; Johan Ferret; Steven Hand; Ankush Garg; Tom Le Paine; Jian Li; Yujia Li; Minh Giang; Alexander Neitz; Zaheer Abbas; Sarah York; Machel Reid; Elizabeth Cole; Aakanksha Chowdhery; Dipanjan Das; Dominika Rogozińska; Vitaly Nikolaev; Pablo Sprechmann; Zachary Nado; Lukas Zilka; Flavien Prost; Luheng He; Marianne Monteiro; Gaurav Mishra; Chris Welty; Josh Newlan; Dawei Jia; Miltiadis Allamanis; Clara Huiyi Hu; Liedekerke Raoul de; Justin Gilmer; Carl Saroufim; Shruti Rijhwani; Shaobo Hou; Disha Shrivastava; Anirudh Baddepudi; Alex Goldin; Adnan Ozturel; Albin Cassirer; Yunhan Xu; Daniel Sohn; Devendra Sachan; Reinald Kim Amplayo; Craig Swanson; Dessie Petrova; Shashi Narayan; Arthur Guez; Siddhartha Brahma; Jessica Landon; Miteyan Patel; Ruizhe Zhao; Kevin Villela; Luyu Wang; Wenhao Jia; Matthew Rahtz; Mai Giménez; Legg Yeung; Hanzhao Lin; James Keeling; Petko Georgiev; Diana Mincu; Boxi Wu; Salem Haykal; Rachel Saputro; Kiran Vodrahalli; James Qin; Zeynep Cankara; Abhanshu Sharma; Nick Fernando; Will Hawkins; Behnam Neyshabur; Solomon Kim; Adrian Hutter; Priyanka Agrawal; Alex Castro-Ros; George van den Driessche; Tao Wang; Fan Yang; Shuo-yiin Chang; Paul Komarek; Ross McIlroy; Mario Lučić; Guodong Zhang; Wael Farhan; Michael Sharman; Paul Natsev; Paul Michel; Yong Cheng; Yamini Bansal; Siyuan Qiao; Kris Cao; Siamak Shakeri; Christina Butterfield; Justin Chung; Paul Kishan Rubenstein; Shivani Agrawal; Arthur Mensch; Kedar Soparkar; Karel Lenc; Timothy Chung; Aedan Pope; Loren Maggiore; Jackie Kay; Priya Jhakra; Shibo Wang; Joshua Maynez; Mary Phuong; Taylor Tobin; Andrea Tacchetti; Maja Trebacz; Kevin Robinson; Yash Katariya; Sebastian Riedel; Paige Bailey; Kefan Xiao; Nimesh Ghelani; Lora Aroyo; Ambrose Slone; Neil Houlsby; Xuehan Xiong; Zhen Yang; Elena Gribovskaya; Jonas Adler; Mateo Wirth; Lisa Lee; Music Li; Thais Kagohara; Jay Pavagadhi; Sophie Bridgers; Anna Bortsova; Sanjay Ghemawat; Zafarali Ahmed; Tianqi Liu; Richard Powell; Vijay Bolina; Mariko Iinuma; Polina Zablotskaia; James Besley; Da-Woon Chung; Timothy Dozat; Ramona Comanescu; Xiance Si; Jeremy Greer; Guolong Su; Martin Polacek; Raphaël Lopez Kaufman; Simon Tokumine; Hexiang Hu; Elena Buchatskaya; Yingjie Miao; Mohamed Elhawaty; Aditya Siddhant; Nenad Tomasev; Jinwei Xing; Christina Greer; Helen Miller; Shereen Ashraf; Aurko Roy; Zizhao Zhang; Ada Ma; Angelos Filos; Milos Besta; Rory Blevins; Ted Klimenko; Chih-Kuan Yeh; Soravit Changpinyo; Jiaqi Mu; Oscar Chang; Mantas Pajarskas; Carrie Muir; Vered Cohen; Charline Le Lan; Krishna Haridasan; Amit Marathe; Steven Hansen; Sholto Douglas; Rajkumar Samuel; Mingqiu Wang; Sophia Austin; Chang Lan; Jiepu Jiang; Justin Chiu; Jaime Alonso Lorenzo; Lars Lowe Sjösund; Sébastien Cevey; Zach Gleicher; Thi Avrahami; Anudhyan Boral; Hansa Srinivasan; Vittorio Selo; Rhys May; Konstantinos Aisopos; Léonard Hussenot; Livio Baldini Soares; Kate Baumli; Michael B. Chang; Adrià Recasens; Ben Caine; Alexander Pritzel; Filip Pavetic; Fabio Pardo; Anita Gergely; Justin Frye; Vinay Ramasesh; Dan Horgan; Kartikeya Badola; Nora Kassner; Subhrajit Roy; Ethan Dyer; Víctor Campos; Alex Tomala; Yunhao Tang; Dalia El Badawy; Elspeth White; Basil Mustafa; Oran Lang; Abhishek Jindal; Sharad Vikram; Zhitao Gong; Sergi Caelles; Ross Hemsley; Gregory Thornton; Fangxiaoyu Feng; Wojciech Stokowiec; Ce Zheng; Phoebe Thacker; Çağlar Ünlü; Zhishuai Zhang; Mohammad Saleh; James Svensson; Max Bileschi; Piyush Patil; Ankesh Anand; Roman Ring; Katerina Tsihlas; Arpi Vezer; Marco Selvi; Toby Shevlane; Mikel Rodriguez; Tom Kwiatkowski; Samira Daruki; Keran Rong; Allan Dafoe; Nicholas FitzGerald; Keren Gu-Lemberg; Mina Khan; Lisa Anne Hendricks; Marie Pellat; Vladimir Feinberg; James Cobon-Kerr; Tara Sainath; Maribeth Rauh; Sayed Hadi Hashemi; Richard Ives; Yana Hasson; YaGuang Li; Eric Noland; Yuan Cao; Nathan Byrd; Le Hou; Qingze Wang; Thibault Sottiaux; Michela Paganini; Jean-Baptiste Lespiau; Alexandre Moufarek; Samer Hassan; Kaushik Shivakumar; Amersfoort Joost van; Amol Mandhane; Pratik Joshi; Anirudh Goyal; Matthew Tung; Andrew Brock; Hannah Sheahan; Vedant Misra; Cheng Li; Nemanja Rakićević; Mostafa Dehghani; Fangyu Liu; Sid Mittal; Junhyuk Oh; Seb Noury; Eren Sezener; Fantine Huot; Matthew Lamm; Cao Nicola De; Charlie Chen; Gamaleldin Elsayed; Ed Chi; Mahdis Mahdieh; Ian Tenney; Nan Hua; Ivan Petrychenko; Patrick Kane; Dylan Scandinaro; Rishub Jain; Jonathan Uesato; Romina Datta; Adam Sadovsky; Oskar Bunyan; Dominik Rabiej; Shimu Wu; John Zhang; Gautam Vasudevan; Edouard Leurent; Mahmoud Alnahlawi; Ionut Georgescu; Nan Wei; Ivy Zheng; Betty Chan; Pam G Rabinovitch; Piotr Stanczyk; Ye Zhang; David Steiner; Subhajit Naskar; Michael Azzam; Matthew Johnson; Adam Paszke; Chung-Cheng Chiu; Jaume Sanchez Elias; Afroz Mohiuddin; Faizan Muhammad; Jin Miao; Andrew Lee; Nino Vieillard; Sahitya Potluri; Jane Park; Elnaz Davoodi; Jiageng Zhang; Jeff Stanway; Drew Garmon; Abhijit Karmarkar; Zhe Dong; Jong Lee; Aviral Kumar; Luowei Zhou; Jonathan Evens; William Isaac; Zhe Chen; Johnson Jia; Anselm Levskaya; Zhenkai Zhu; Chris Gorgolewski; Peter Grabowski; Yu Mao; Alberto Magni; Kaisheng Yao; Javier Snaider; Norman Casagrande; Paul Suganthan; Evan Palmer; Geoffrey Irving; Edward Loper; Manaal Faruqui; Isha Arkatkar; Nanxin Chen; Izhak Shafran; Michael Fink; Alfonso Castaño; Irene Giannoumis; Wooyeol Kim; Mikołaj Rybiński; Ashwin Sreevatsa; Jennifer Prendki; David Soergel; Adrian Goedeckemeyer; Willi Gierke; Mohsen Jafari; Meenu Gaba; Jeremy Wiesner; Diana Gage Wright; Yawen Wei; Harsha Vashisht; Yana Kulizhskaya; Jay Hoover; Maigo Le; Lu Li; Chimezie Iwuanyanwu; Lu Liu; Kevin Ramirez; Andrey Khorlin; Albert Cui; Tian LIN; Marin Georgiev; Marcus Wu; Ricardo Aguilar; Keith Pallo; Abhishek Chakladar; Alena Repina; Xihui Wu; der Weide Tom van; Priya Ponnapalli; Caroline Kaplan; Jiri Simsa; Shuangfeng Li; Olivier Dousse; Fan Yang; Jeff Piper; Nathan Ie; Minnie Lui; Rama Pasumarthi; Nathan Lintz; Anitha Vijayakumar; Lam Nguyen Thiet; Daniel Andor; Pedro Valenzuela; Cosmin Paduraru; Daiyi Peng; Katherine Lee; Shuyuan Zhang; Somer Greene; Duc Dung Nguyen; Paula Kurylowicz; Sarmishta Velury; Sebastian Krause; Cassidy Hardin; Lucas Dixon; Lili Janzer; Kiam Choo; Ziqiang Feng; Biao Zhang; Achintya Singhal; Tejasi Latkar; Mingyang Zhang; Quoc Le; Elena Allica Abellan; Dayou Du; Dan McKinnon; Natasha Antropova; Tolga Bolukbasi; Orgad Keller; David Reid; Daniel Finchelstein; Maria Abi Raad; Remi Crocker; Peter Hawkins; Robert Dadashi; Colin Gaffney; Sid Lall; Ken Franko; Egor Filonov; Anna Bulanova; Rémi Leblond; Vikas Yadav; Shirley Chung; Harry Askham; Luis C. Cobo; Kelvin Xu; Felix Fischer; Jun Xu; Christina Sorokin; Chris Alberti; Chu-Cheng Lin; Colin Evans; Hao Zhou; Alek Dimitriev; Hannah Forbes; Dylan Banarse; Zora Tung; Jeremiah Liu; Mark Omernick; Colton Bishop; Chintu Kumar; Rachel Sterneck; Ryan Foley; Rohan Jain; Swaroop Mishra; Jiawei Xia; Taylor Bos; Geoffrey Cideron; Ehsan Amid; Francesco Piccinno; Xingyu Wang; Praseem Banzal; Petru Gurita; Hila Noga; Premal Shah; Daniel J. Mankowitz; Alex Polozov; Nate Kushman; Victoria Krakovna; Sasha Brown; MohammadHossein Bateni; Dennis Duan; Vlad Firoiu; Meghana Thotakuri; Tom Natan; Anhad Mohananey; Matthieu Geist; Sidharth Mudgal; Sertan Girgin; Hui Li; Jiayu Ye; Ofir Roval; Reiko Tojo; Michael Kwong; James Lee-Thorp; Christopher Yew; Quan Yuan; Sumit Bagri; Danila Sinopalnikov; Sabela Ramos; John Mellor; Abhishek Sharma; Aliaksei Severyn; Jonathan Lai; Kathy Wu; Heng-Tze Cheng; David Miller; Nicolas Sonnerat; Denis Vnukov; Rory Greig; Jennifer Beattie; Emily Caveness; Libin Bai; Julian Eisenschlos; Alex Korchemniy; Tomy Tsai; Mimi Jasarevic; Weize Kong; Phuong Dao; Zeyu Zheng; Frederick Liu; Fan Yang; Rui Zhu; Mark Geller; Tian Huey Teh; Jason Sanmiya; Evgeny Gladchenko; Nejc Trdin; Andrei Sozanschi; Daniel Toyama; Evan Rosen; Sasan Tavakkol; Linting Xue; Chen Elkind; Oliver Woodman; John Carpenter; George Papamakarios; Rupert Kemp; Sushant Kafle; Tanya Grunina; Rishika Sinha; Alice Talbert; Abhimanyu Goyal; Diane Wu; Denese Owusu-Afriyie; Cosmo Du; Chloe Thornton; Jordi Pont-Tuset; Pradyumna Narayana; Jing Li; Sabaer Fatehi; John Wieting; Omar Ajmeri; Benigno Uria; Tao Zhu; Yeongil Ko; Laura Knight; Amélie Héliou; Ning Niu; Shane Gu; Chenxi Pang; Dustin Tran; Yeqing Li; Nir Levine; Ariel Stolovich; Norbert Kalb; Rebeca Santamaria-Fernandez; Sonam Goenka; Wenny Yustalim; Robin Strudel; Ali Elqursh; Balaji Lakshminarayanan; Charlie Deck; Shyam Upadhyay; Hyo Lee; Mike Dusenberry; Zonglin Li; Xuezhi Wang; Kyle Levin; Raphael Hoffmann; Dan Holtmann-Rice; Olivier Bachem; Summer Yue; Sho Arora; Eric Malmi; Daniil Mirylenka; Qijun Tan; Christy Koh; Soheil Hassas Yeganeh; Siim Põder; Steven Zheng; Francesco Pongetti; Mukarram Tariq; Yanhua Sun; Lucian Ionita; Mojtaba Seyedhosseini; Pouya Tafti; Ragha Kotikalapudi; Zhiyu Liu; Anmol Gulati; Jasmine Liu; Xinyu Ye; Bart Chrzaszcz; Lily Wang; Nikhil Sethi; Tianrun Li; Ben Brown; Shreya Singh; Wei Fan; Aaron Parisi; Joe Stanton; Chenkai Kuang; Vinod Koverkathu; Christopher A. Choquette-Choo; Yunjie Li; TJ Lu; Abe Ittycheriah; Prakash Shroff; Pei Sun; Mani Varadarajan; Sanaz Bahargam; Rob Willoughby; David Gaddy; Ishita Dasgupta; Guillaume Desjardins; Marco Cornero; Brona Robenek; Bhavishya Mittal; Ben Albrecht; Ashish Shenoy; Fedor Moiseev; Henrik Jacobsson; Alireza Ghaffarkhah; Morgane Rivière; Alanna Walton; Clément Crepy; Alicia Parrish; Yuan Liu; Zongwei Zhou; Clement Farabet; Carey Radebaugh; Praveen Srinivasan; der Salm Claudia van; Andreas Fidjeland; Salvatore Scellato; Eri Latorre-Chimoto; Hanna Klimczak-Plucińska; David Bridson; Cesare Dario de; Tom Hudson; Piermaria Mendolicchio; Lexi Walker; Alex Morris; Ivo Penchev; Matthew Mauger; Alexey Guseynov; Alison Reid; Seth Odoom; Lucia Loher; Victor Cotruta; Madhavi Yenugula; Dominik Grewe; Anastasia Petrushkina; Tom Duerig; Antonio Sanchez; Steve Yadlowsky; Amy Shen; Amir Globerson; Adam Kurzrok; Lynette Webb; Sahil Dua; Dong Li; Preethi Lahoti; Surya Bhupatiraju; Dan Hurt; Haroon Qureshi; Ananth Agarwal; Tomer Shani; Matan Eyal; Anuj Khare; Shreyas Rammohan Belle; Lei Wang; Chetan Tekur; Mihir Sanjay Kale; Jinliang Wei; Ruoxin Sang; Brennan Saeta; Tyler Liechty; Yi Sun; Yao Zhao; Stephan Lee; Pandu Nayak; Doug Fritz; Manish Reddy Vuyyuru; John Aslanides; Nidhi Vyas; Martin Wicke; Xiao Ma; Taylan Bilal; Evgenii Eltyshev; Daniel Balle; Nina Martin; Hardie Cate; James Manyika; Keyvan Amiri; Yelin Kim; Xi Xiong; Kai Kang; Florian Luisier; Nilesh Tripuraneni; David Madras; Mandy Guo; Austin Waters; Oliver Wang; Joshua Ainslie; Jason Baldridge; Han Zhang; Garima Pruthi; Jakob Bauer; Feng Yang; Riham Mansour; Jason Gelman; Yang Xu; George Polovets; Ji Liu; Honglong Cai; Warren Chen; XiangHai Sheng; Emily Xue; Sherjil Ozair; Adams Yu; Christof Angermueller; Xiaowei Li; Weiren Wang; Julia Wiesinger; Emmanouil Koukoumidis; Yuan Tian; Anand Iyer; Madhu Gurumurthy; Mark Goldenson; Parashar Shah; MK Blake; Hongkun Yu; Anthony Urbanowicz; Jennimaria Palomaki; Chrisantha Fernando; Kevin Brooks; Ken Durden; Harsh Mehta; Nikola Momchev; Elahe Rahimtoroghi; Maria Georgaki; Amit Raul; Sebastian Ruder; Morgan Redshaw; Jinhyuk Lee; Komal Jalan; Dinghua Li; Ginger Perng; Blake Hechtman; Parker Schuh; Milad Nasr; Mia Chen; Kieran Milan; Vladimir Mikulik; Trevor Strohman; Juliana Franco; Tim Green; Demis Hassabis; Koray Kavukcuoglu; Jeffrey Dean; Oriol Vinyals
  This report introduces a new family of multimodal models, Gemini, that
exhibit remarkable capabilities across image, audio, video, and text
understanding. The Gemini family consists of Ultra, Pro, and Nano sizes,
suitable for applications ranging from complex reasoning tasks to on-device
memory-constrained use-cases. Evaluation on a broad range of benchmarks shows
that our most-capable Gemini Ultra model advances the state of the art in 30 of
32 of these benchmarks - notably being the first model to achieve human-expert
performance on the well-studied exam benchmark MMLU, and improving the state of
the art in every one of the 20 multimodal benchmarks we examined. We believe
that the new capabilities of Gemini models in cross-modal reasoning and
language understanding will enable a wide variety of use cases and we discuss
our approach toward deploying them responsibly to users.


http://arxiv.org/abs/2312.11285
Adv-Diffusion: Imperceptible Adversarial Face Identity Attack via Latent Diffusion Model. (99%)
Decheng Liu; Xijun Wang; Chunlei Peng; Nannan Wang; Ruiming Hu; Xinbo Gao
  Adversarial attacks involve adding perturbations to the source image to cause
misclassification by the target model, which demonstrates the potential of
attacking face recognition models. Existing adversarial face image generation
methods still can't achieve satisfactory performance because of low
transferability and high detectability. In this paper, we propose a unified
framework Adv-Diffusion that can generate imperceptible adversarial identity
perturbations in the latent space but not the raw pixel space, which utilizes
strong inpainting capabilities of the latent diffusion model to generate
realistic adversarial images. Specifically, we propose the identity-sensitive
conditioned diffusion generative model to generate semantic perturbations in
the surroundings. The designed adaptive strength-based adversarial perturbation
algorithm can ensure both attack transferability and stealthiness. Extensive
qualitative and quantitative experiments on the public FFHQ and CelebA-HQ
datasets prove the proposed method achieves superior performance compared with
the state-of-the-art methods without an extra generative model training
process. The source code is available at
https://github.com/kopper-xdu/Adv-Diffusion.


http://arxiv.org/abs/2312.11309
The Ultimate Combo: Boosting Adversarial Example Transferability by Composing Data Augmentations. (99%)
Zebin Yun; Achi-Or Weingarten; Eyal Ronen; Mahmood Sharif
  Transferring adversarial examples (AEs) from surrogate machine-learning (ML)
models to target models is commonly used in black-box adversarial robustness
evaluation. Attacks leveraging certain data augmentation, such as random
resizing, have been found to help AEs generalize from surrogates to targets.
Yet, prior work has explored limited augmentations and their composition. To
fill the gap, we systematically studied how data augmentation affects
transferability. Particularly, we explored 46 augmentation techniques of seven
categories originally proposed to help ML models generalize to unseen benign
samples, and assessed how they impact transferability, when applied
individually or composed. Performing exhaustive search on a small subset of
augmentation techniques and genetic search on all techniques, we identified
augmentation combinations that can help promote transferability. Extensive
experiments with the ImageNet and CIFAR-10 datasets and 18 models showed that
simple color-space augmentations (e.g., color to greyscale) outperform the
state of the art when combined with standard augmentations, such as translation
and scaling. Additionally, we discovered that composing augmentations impacts
transferability mostly monotonically (i.e., more methods composed $\rightarrow$
$\ge$ transferability). We also found that the best composition significantly
outperformed the state of the art (e.g., 93.7% vs. $\le$ 82.7% average
transferability on ImageNet from normally trained surrogates to adversarially
trained targets). Lastly, our theoretical analysis, backed up by empirical
evidence, intuitively explain why certain augmentations help improve
transferability.


http://arxiv.org/abs/2312.11057
DataElixir: Purifying Poisoned Dataset to Mitigate Backdoor Attacks via Diffusion Models. (16%)
Jiachen Zhou; Peizhuo Lv; Yibing Lan; Guozhu Meng; Kai Chen; Hualong Ma
  Dataset sanitization is a widely adopted proactive defense against
poisoning-based backdoor attacks, aimed at filtering out and removing poisoned
samples from training datasets. However, existing methods have shown limited
efficacy in countering the ever-evolving trigger functions, and often leading
to considerable degradation of benign accuracy. In this paper, we propose
DataElixir, a novel sanitization approach tailored to purify poisoned datasets.
We leverage diffusion models to eliminate trigger features and restore benign
features, thereby turning the poisoned samples into benign ones. Specifically,
with multiple iterations of the forward and reverse process, we extract
intermediary images and their predicted labels for each sample in the original
dataset. Then, we identify anomalous samples in terms of the presence of label
transition of the intermediary images, detect the target label by quantifying
distribution discrepancy, select their purified images considering pixel and
feature distance, and determine their ground-truth labels by training a benign
model. Experiments conducted on 9 popular attacks demonstrates that DataElixir
effectively mitigates various complex attacks while exerting minimal impact on
benign accuracy, surpassing the performance of baseline defense methods.


http://arxiv.org/abs/2312.10982
A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models. (10%)
Aysan Esmradi; Daniel Wankit Yip; Chun Fai Chan
  Ensuring the security of large language models (LLMs) is an ongoing challenge
despite their widespread popularity. Developers work to enhance LLMs security,
but vulnerabilities persist, even in advanced versions like GPT-4. Attackers
exploit these weaknesses, highlighting the need for proactive cybersecurity
measures in AI model development. This article explores two attack categories:
attacks on models themselves and attacks on model applications. The former
requires expertise, access to model data, and significant implementation time,
while the latter is more accessible to attackers and has seen increased
attention. Our study reviews over 100 recent research works, providing an
in-depth analysis of each attack type. We identify the latest attack methods
and explore various approaches to carry them out. We thoroughly investigate
mitigation techniques, assessing their effectiveness and limitations.
Furthermore, we summarize future defenses against these attacks. We also
examine real-world techniques, including reported and our implemented attacks
on LLMs, to consolidate our findings. Our research highlights the urgency of
addressing security concerns and aims to enhance the understanding of LLM
attacks, contributing to robust defense development in this evolving domain.


http://arxiv.org/abs/2312.11571
Model Stealing Attack against Recommender System. (10%)
Zhihao Zhu; Rui Fan; Chenwang Wu; Yi Yang; Defu Lian; Enhong Chen
  Recent studies have demonstrated the vulnerability of recommender systems to
data privacy attacks. However, research on the threat to model privacy in
recommender systems, such as model stealing attacks, is still in its infancy.
Some adversarial attacks have achieved model stealing attacks against
recommender systems, to some extent, by collecting abundant training data of
the target model (target data) or making a mass of queries. In this paper, we
constrain the volume of available target data and queries and utilize auxiliary
data, which shares the item set with the target data, to promote model stealing
attacks. Although the target model treats target and auxiliary data
differently, their similar behavior patterns allow them to be fused using an
attention mechanism to assist attacks. Besides, we design stealing functions to
effectively extract the recommendation list obtained by querying the target
model. Experimental results show that the proposed methods are applicable to
most recommender systems and various scenarios and exhibit excellent attack
performance on multiple datasets.


http://arxiv.org/abs/2312.10943
Model Stealing Attack against Graph Classification with Authenticity, Uncertainty and Diversity. (4%)
Zhihao Zhu; Chenwang Wu; Rui Fan; Yi Yang; Defu Lian; Enhong Chen
  Recent research demonstrates that GNNs are vulnerable to the model stealing
attack, a nefarious endeavor geared towards duplicating the target model via
query permissions. However, they mainly focus on node classification tasks,
neglecting the potential threats entailed within the domain of graph
classification tasks. Furthermore, their practicality is questionable due to
unreasonable assumptions, specifically concerning the large data requirements
and extensive model knowledge. To this end, we advocate following strict
settings with limited real data and hard-label awareness to generate synthetic
data, thereby facilitating the stealing of the target model. Specifically,
following important data generation principles, we introduce three model
stealing attacks to adapt to different actual scenarios: MSA-AU is inspired by
active learning and emphasizes the uncertainty to enhance query value of
generated samples; MSA-AD introduces diversity based on Mixup augmentation
strategy to alleviate the query inefficiency issue caused by over-similar
samples generated by MSA-AU; MSA-AUD combines the above two strategies to
seamlessly integrate the authenticity, uncertainty, and diversity of the
generated samples. Finally, extensive experiments consistently demonstrate the
superiority of the proposed methods in terms of concealment, query efficiency,
and stealing performance.


http://arxiv.org/abs/2312.11026
MISA: Unveiling the Vulnerabilities in Split Federated Learning. (1%)
Wei Wan; Yuxuan Ning; Shengshan Hu; Lulu Xue; Minghui Li; Leo Yu Zhang; Hai Jin
  \textit{Federated learning} (FL) and \textit{split learning} (SL) are
prevailing distributed paradigms in recent years. They both enable shared
global model training while keeping data localized on users' devices. The
former excels in parallel execution capabilities, while the latter enjoys low
dependence on edge computing resources and strong privacy protection.
\textit{Split federated learning} (SFL) combines the strengths of both FL and
SL, making it one of the most popular distributed architectures. Furthermore, a
recent study has claimed that SFL exhibits robustness against poisoning
attacks, with a fivefold improvement compared to FL in terms of robustness.
  In this paper, we present a novel poisoning attack known as MISA. It poisons
both the top and bottom models, causing a \textbf{\underline{misa}}lignment in
the global model, ultimately leading to a drastic accuracy collapse. This
attack unveils the vulnerabilities in SFL, challenging the conventional belief
that SFL is robust against poisoning attacks. Extensive experiments demonstrate
that our proposed MISA poses a significant threat to the availability of SFL,
underscoring the imperative for academia and industry to accord this matter due
attention.


http://arxiv.org/abs/2312.11094
A Survey of Side-Channel Attacks in Context of Cache -- Taxonomies, Analysis and Mitigation. (1%)
Ankit Pulkit; Smita Naval; Vijay Laxmi
  Side-channel attacks have become prominent attack surfaces in cyberspace.
Attackers use the side information generated by the system while performing a
task. Among the various side-channel attacks, cache side-channel attacks are
leading as there has been an enormous growth in cache memory size in last
decade, especially Last Level Cache (LLC). The adversary infers the information
from the observable behavior of shared cache memory. This paper covers the
detailed study of cache side-channel attacks and compares different
microarchitectures in the context of side-channel attacks. Our main
contributions are: (1) We have summarized the fundamentals and essentials of
side-channel attacks and various attack surfaces (taxonomies). We also
discussed different exploitation techniques, highlighting their capabilities
and limitations. (2) We discussed cache side-channel attacks and analyzed the
existing literature on cache side-channel attacks on various parameters like
microarchitectures, cross-core exploitation, methodology, target, etc. (3) We
discussed the detailed analysis of the existing mitigation strategies to
prevent cache side-channel attacks. The analysis includes hardware- and
software-based countermeasures, examining their strengths and weaknesses. We
also discussed the challenges and trade-offs associated with mitigation
strategies. This survey is supposed to provide a deeper understanding of the
threats posed by these attacks to the research community with valuable insights
into effective defense mechanisms.


http://arxiv.org/abs/2312.10657
UltraClean: A Simple Framework to Train Robust Neural Networks against Backdoor Attacks. (98%)
Bingyin Zhao; Yingjie Lao
  Backdoor attacks are emerging threats to deep neural networks, which
typically embed malicious behaviors into a victim model by injecting poisoned
samples. Adversaries can activate the injected backdoor during inference by
presenting the trigger on input images. Prior defensive methods have achieved
remarkable success in countering dirty-label backdoor attacks where the labels
of poisoned samples are often mislabeled. However, these approaches do not work
for a recent new type of backdoor -- clean-label backdoor attacks that
imperceptibly modify poisoned data and hold consistent labels. More complex and
powerful algorithms are demanded to defend against such stealthy attacks. In
this paper, we propose UltraClean, a general framework that simplifies the
identification of poisoned samples and defends against both dirty-label and
clean-label backdoor attacks. Given the fact that backdoor triggers introduce
adversarial noise that intensifies in feed-forward propagation, UltraClean
first generates two variants of training samples using off-the-shelf denoising
functions. It then measures the susceptibility of training samples leveraging
the error amplification effect in DNNs, which dilates the noise difference
between the original image and denoised variants. Lastly, it filters out
poisoned samples based on the susceptibility to thwart the backdoor
implantation. Despite its simplicity, UltraClean achieves a superior detection
rate across various datasets and significantly reduces the backdoor attack
success rate while maintaining a decent model accuracy on clean data,
outperforming existing defensive methods by a large margin. Code is available
at https://github.com/bxz9200/UltraClean.


http://arxiv.org/abs/2312.10911
The Pros and Cons of Adversarial Robustness. (92%)
Yacine Izza; Joao Marques-Silva
  Robustness is widely regarded as a fundamental problem in the analysis of
machine learning (ML) models. Most often robustness equates with deciding the
non-existence of adversarial examples, where adversarial examples denote
situations where small changes on some inputs cause a change in the prediction.
The perceived importance of ML model robustness explains the continued progress
observed for most of the last decade. Whereas robustness is often assessed
locally, i.e. given some target point in feature space, robustness can also be
defined globally, i.e. where any point in feature space can be considered. The
importance of ML model robustness is illustrated for example by the existence
of competitions evaluating the progress of robustness tools, namely in the case
of neural networks (NNs) but also by efforts towards robustness certification.
More recently, robustness tools have also been used for computing rigorous
explanations of ML models. In contrast with the observed successes of
robustness, this paper uncovers some limitations with existing definitions of
robustness, both global and local, but also with efforts towards robustness
certification. The paper also investigates uses of adversarial examples besides
those related with robustness.


http://arxiv.org/abs/2312.10766
A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection. (80%)
Xiaoyu Zhang; Cen Zhang; Tianlin Li; Yihao Huang; Xiaojun Jia; Xiaofei Xie; Yang Liu; Chao Shen
  Large Language Models and Multi-Modal LLMs have become pervasive, and so does
the importance of their security; yet, modern LLMs are known to be vulnerable
to jailbreaking attacks. These attacks can allow malicious users to exploit the
models, making the case for effective jailbreak detection mechanisms an
essential aspect of maintaining the integrity and trustworthiness of LLM-based
applications. However, existing detection works on jailbreak attacks have
limitations. Existing post-query-based strategies require target domain
knowledge, and pre-query-based methods mainly focus on text-level attacks and
fail to meet the increasingly complex multi-modal security requirements placed
upon contemporary LLMs. This gap underscores the need for a more comprehensive
approach to safeguarding these influential systems.
  In this work, we propose JailGuard, the first mutation-based jailbreaking
detection framework which supports both image and text modalities. Our key
observation is that attack queries inherently possess less robustness compared
to benign queries. Specifically, to confuse the model, attack queries are
usually crafted with well-designed templates or complicate perturbations,
leading to a fact that a slight disturbance in input may result in a drastic
change in the response. This lack of robustness can be utilized in attack
detection. Based on this intuition, we designed and implemented a detection
framework comprising 19 different mutators and a divergence-based detection
formula. To fully understand the effectiveness of our framework, we built the
first multi-modal LLM jailbreaking attack dataset, which has 304 items of data,
covering ten types of known jailbreaking attacks on image and text modalities.
The evaluation suggests that JailGuard achieves the best detection accuracy of
89.38%/85.42% on image and text inputs, outperforming state-of-the-art defense
methods by 15.28%.


http://arxiv.org/abs/2312.10903
Robust Node Representation Learning via Graph Variational Diffusion Networks. (11%)
Jun Zhuang; Mohammad Al Hasan
  Node representation learning by using Graph Neural Networks (GNNs) has been
widely explored. However, in recent years, compelling evidence has revealed
that GNN-based node representation learning can be substantially deteriorated
by delicately-crafted perturbations in a graph structure. To learn robust node
representation in the presence of perturbations, various works have been
proposed to safeguard GNNs. Within these existing works, Bayesian label
transition has been proven to be more effective, but this method is extensively
reliant on a well-built prior distribution. The variational inference could
address this limitation by sampling the latent node embedding from a Gaussian
prior distribution. Besides, leveraging the Gaussian distribution (noise) in
hidden layers is an appealing strategy to strengthen the robustness of GNNs.
However, our experiments indicate that such a strategy can cause over-smoothing
issues during node aggregation. In this work, we propose the Graph Variational
Diffusion Network (GVDN), a new node encoder that effectively manipulates
Gaussian noise to safeguard robustness on perturbed graphs while alleviating
over-smoothing issues through two mechanisms: Gaussian diffusion and node
embedding propagation. Thanks to these two mechanisms, our model can generate
robust node embeddings for recovery. Specifically, we design a retraining
mechanism using the generated node embedding to recover the performance of node
classifications in the presence of perturbations. The experiments verify the
effectiveness of our proposed model across six public datasets.


http://arxiv.org/abs/2312.11550
A Study on Transferability of Deep Learning Models for Network Intrusion Detection. (4%)
Shreya Ghosh; Abu Shafin Mohammad Mahdee Jameel; Aly El Gamal
  In this paper, we explore transferability in learning between different
attack classes in a network intrusion detection setup. We evaluate
transferability of attack classes by training a deep learning model with a
specific attack class and testing it on a separate attack class. We observe the
effects of real and synthetically generated data augmentation techniques on
transferability. We investigate the nature of observed transferability
relationships, which can be either symmetric or asymmetric. We also examine
explainability of the transferability relationships using the recursive feature
elimination algorithm. We study data preprocessing techniques to boost model
performance. The code for this work can be found at
https://github.com/ghosh64/transferability.


http://arxiv.org/abs/2312.10329
Perturbation-Invariant Adversarial Training for Neural Ranking Models: Improving the Effectiveness-Robustness Trade-Off. (99%)
Yu-An Liu; Ruqing Zhang; Mingkun Zhang; Wei Chen; Rijke Maarten de; Jiafeng Guo; Xueqi Cheng
  Neural ranking models (NRMs) have shown great success in information
retrieval (IR). But their predictions can easily be manipulated using
adversarial examples, which are crafted by adding imperceptible perturbations
to legitimate documents. This vulnerability raises significant concerns about
their reliability and hinders the widespread deployment of NRMs. By
incorporating adversarial examples into training data, adversarial training has
become the de facto defense approach to adversarial attacks against NRMs.
However, this defense mechanism is subject to a trade-off between effectiveness
and adversarial robustness. In this study, we establish theoretical guarantees
regarding the effectiveness-robustness trade-off in NRMs. We decompose the
robust ranking error into two components, i.e., a natural ranking error for
effectiveness evaluation and a boundary ranking error for assessing adversarial
robustness. Then, we define the perturbation invariance of a ranking model and
prove it to be a differentiable upper bound on the boundary ranking error for
attainable computation. Informed by our theoretical analysis, we design a novel
\emph{perturbation-invariant adversarial training} (PIAT) method for ranking
models to achieve a better effectiveness-robustness trade-off. We design a
regularized surrogate loss, in which one term encourages the effectiveness to
be maximized while the regularization term encourages the output to be smooth,
so as to improve adversarial robustness. Experimental results on several
ranking models demonstrate the superiority of PITA compared to existing
adversarial defenses.


http://arxiv.org/abs/2312.10534
Rethinking Robustness of Model Attributions. (80%)
Sandesh Kamath; Sankalp Mittal; Amit Deshpande; Vineeth N Balasubramanian
  For machine learning models to be reliable and trustworthy, their decisions
must be interpretable. As these models find increasing use in safety-critical
applications, it is important that not just the model predictions but also
their explanations (as feature attributions) be robust to small
human-imperceptible input perturbations. Recent works have shown that many
attribution methods are fragile and have proposed improvements in either these
methods or the model training. We observe two main causes for fragile
attributions: first, the existing metrics of robustness (e.g., top-k
intersection) over-penalize even reasonable local shifts in attribution,
thereby making random perturbations to appear as a strong attack, and second,
the attribution can be concentrated in a small region even when there are
multiple important parts in an image. To rectify this, we propose simple ways
to strengthen existing metrics and attribution methods that incorporate
locality of pixels in robustness metrics and diversity of pixel locations in
attributions. Towards the role of model training in attributional robustness,
we empirically observe that adversarially trained models have more robust
attributions on smaller datasets, however, this advantage disappears in larger
datasets. Code is available at https://github.com/ksandeshk/LENS.


http://arxiv.org/abs/2312.10578
SAME: Sample Reconstruction Against Model Extraction Attacks. (13%)
Yi Xie; Jie Zhang; Shiqian Zhao; Tianwei Zhang; Xiaofeng Chen
  While deep learning models have shown significant performance across various
domains, their deployment needs extensive resources and advanced computing
infrastructure. As a solution, Machine Learning as a Service (MLaaS) has
emerged, lowering the barriers for users to release or productize their deep
learning models. However, previous studies have highlighted potential privacy
and security concerns associated with MLaaS, and one primary threat is model
extraction attacks. To address this, there are many defense solutions but they
suffer from unrealistic assumptions and generalization issues, making them less
practical for reliable protection. Driven by these limitations, we introduce a
novel defense mechanism, SAME, based on the concept of sample reconstruction.
This strategy imposes minimal prerequisites on the defender's capabilities,
eliminating the need for auxiliary Out-of-Distribution (OOD) datasets, user
query history, white-box model access, and additional intervention during model
training. It is compatible with existing active defense methods. Our extensive
experiments corroborate the superior efficacy of SAME over state-of-the-art
solutions. Our code is available at https://github.com/xythink/SAME.


http://arxiv.org/abs/2312.10508
TrojFair: Trojan Fairness Attacks. (8%)
Mengxin Zheng; Jiaqi Xue; Yi Sheng; Lei Yang; Qian Lou; Lei Jiang
  Deep learning models have been incorporated into high-stakes sectors,
including healthcare diagnosis, loan approvals, and candidate recruitment,
among others. Consequently, any bias or unfairness in these models can harm
those who depend on such models. In response, many algorithms have emerged to
ensure fairness in deep learning. However, while the potential for harm is
substantial, the resilience of these fair deep learning models against
malicious attacks has never been thoroughly explored, especially in the context
of emerging Trojan attacks. Moving beyond prior research, we aim to fill this
void by introducing \textit{TrojFair}, a Trojan fairness attack. Unlike
existing attacks, TrojFair is model-agnostic and crafts a Trojaned model that
functions accurately and equitably for clean inputs. However, it displays
discriminatory behaviors \text{-} producing both incorrect and unfair results
\text{-} for specific groups with tainted inputs containing a trigger. TrojFair
is a stealthy Fairness attack that is resilient to existing model fairness
audition detectors since the model for clean inputs is fair. TrojFair achieves
a target group attack success rate exceeding $88.77\%$, with an average
accuracy loss less than $0.44\%$. It also maintains a high discriminative score
between the target and non-target groups across various datasets and models.


http://arxiv.org/abs/2312.10529
Transformers in Unsupervised Structure-from-Motion. (3%)
Hemang Chawla; Arnav Varma; Elahe Arani; Bahram Zonooz
  Transformers have revolutionized deep learning based computer vision with
improved performance as well as robustness to natural corruptions and
adversarial attacks. Transformers are used predominantly for 2D vision tasks,
including image classification, semantic segmentation, and object detection.
However, robots and advanced driver assistance systems also require 3D scene
understanding for decision making by extracting structure-from-motion (SfM). We
propose a robust transformer-based monocular SfM method that learns to predict
monocular pixel-wise depth, ego vehicle's translation and rotation, as well as
camera's focal length and principal point, simultaneously. With experiments on
KITTI and DDAD datasets, we demonstrate how to adapt different vision
transformers and compare them against contemporary CNN-based methods. Our study
shows that transformer-based architecture, though lower in run-time efficiency,
achieves comparable performance while being more robust against natural
corruptions, as well as untargeted and targeted attacks.


http://arxiv.org/abs/2312.10467
TrojFSP: Trojan Insertion in Few-shot Prompt Tuning. (2%)
Mengxin Zheng; Jiaqi Xue; Xun Chen; YanShan Wang; Qian Lou; Lei Jiang
  Prompt tuning is one of the most effective solutions to adapting a fixed
pre-trained language model (PLM) for various downstream tasks, especially with
only a few input samples. However, the security issues, e.g., Trojan attacks,
of prompt tuning on a few data samples are not well-studied. Transferring
established data poisoning attacks directly to few-shot prompt tuning presents
multiple challenges. One significant issue is the \textit{poisoned imbalance
issue}, where non-target class samples are added to the target class, resulting
in a greater number of target-class samples compared to non-target class. While
this issue is not critical in regular tuning, it significantly hampers the
few-shot prompt tuning, making it difficult to simultaneously achieve a high
attack success rate (ASR) and maintain clean data accuracy (CDA). Additionally,
few-shot prompting is prone to overfitting in terms of both ASR and CDA. In
this paper, we introduce \textit{TrojFSP}, a method designed to address the
challenges. To solve the poisoned imbalance issue, we develop a
\textit{Target-Class Shrink (TC-Shrink)} technique, which aims to equalize the
number of poisoning samples. To combat overfitting, we employ a
\textit{Selective Token Poisoning} technique to boost attack performance.
Furthermore, we introduce a \textit{Trojan-Trigger Attention} objective
function to amplify the attention of the poisoned trojan prompt on triggers.
Experiments show that our TrojFSP achieves an ASR of over 99\% while
maintaining negligible decreases in CDA across various PLMs and datasets.


http://arxiv.org/abs/2312.09935
LogoStyleFool: Vitiating Video Recognition Systems via Logo Style Transfer. (99%)
Yuxin Cao; Ziyu Zhao; Xi Xiao; Derui Wang; Minhui Xue; Jin Lu
  Video recognition systems are vulnerable to adversarial examples. Recent
studies show that style transfer-based and patch-based unrestricted
perturbations can effectively improve attack efficiency. These attacks,
however, face two main challenges: 1) Adding large stylized perturbations to
all pixels reduces the naturalness of the video and such perturbations can be
easily detected. 2) Patch-based video attacks are not extensible to targeted
attacks due to the limited search space of reinforcement learning that has been
widely used in video attacks recently. In this paper, we focus on the video
black-box setting and propose a novel attack framework named LogoStyleFool by
adding a stylized logo to the clean video. We separate the attack into three
stages: style reference selection, reinforcement-learning-based logo style
transfer, and perturbation optimization. We solve the first challenge by
scaling down the perturbation range to a regional logo, while the second
challenge is addressed by complementing an optimization stage after
reinforcement learning. Experimental results substantiate the overall
superiority of LogoStyleFool over three state-of-the-art patch-based attacks in
terms of attack performance and semantic preservation. Meanwhile, LogoStyleFool
still maintains its performance against two existing patch-based defense
methods. We believe that our research is beneficial in increasing the attention
of the security community to such subregional style transfer attacks.


http://arxiv.org/abs/2312.09554
Embodied Adversarial Attack: A Dynamic Robust Physical Attack in Autonomous Driving. (99%)
Yitong Sun; Yao Huang; Xingxing Wei
  As physical adversarial attacks become extensively applied in unearthing the
potential risk of security-critical scenarios, especially in autonomous
driving, their vulnerability to environmental changes has also been brought to
light. The non-robust nature of physical adversarial attack methods brings
less-than-stable performance consequently. To enhance the robustness of
physical adversarial attacks in the real world, instead of statically
optimizing a robust adversarial example via an off-line training manner like
the existing methods, this paper proposes a brand new robust adversarial attack
framework: Embodied Adversarial Attack (EAA) from the perspective of dynamic
adaptation, which aims to employ the paradigm of embodied intelligence:
Perception-Decision-Control to dynamically adjust the optimal attack strategy
according to the current situations in real time. For the perception module,
given the challenge of needing simulation for the victim's viewpoint, EAA
innovatively devises a Perspective Transformation Network to estimate the
target's transformation from the attacker's perspective. For the decision and
control module, EAA adopts the laser-a highly manipulable medium to implement
physical attacks, and further trains an attack agent with reinforcement
learning to make it capable of instantaneously determining the best attack
strategy based on the perceived information. Finally, we apply our framework to
the autonomous driving scenario. A variety of experiments verify the high
effectiveness of our method under complex scenes.


http://arxiv.org/abs/2312.09558
Towards Transferable Targeted 3D Adversarial Attack in the Physical World. (99%)
Yao Huang; Yinpeng Dong; Shouwei Ruan; Xiao Yang; Hang Su; Xingxing Wei
  Compared with transferable untargeted attacks, transferable targeted
adversarial attacks could specify the misclassification categories of
adversarial samples, posing a greater threat to security-critical tasks. In the
meanwhile, 3D adversarial samples, due to their potential of multi-view
robustness, can more comprehensively identify weaknesses in existing deep
learning systems, possessing great application value. However, the field of
transferable targeted 3D adversarial attacks remains vacant. The goal of this
work is to develop a more effective technique that could generate transferable
targeted 3D adversarial examples, filling the gap in this field. To achieve
this goal, we design a novel framework named TT3D that could rapidly
reconstruct from few multi-view images into Transferable Targeted 3D textured
meshes. While existing mesh-based texture optimization methods compute
gradients in the high-dimensional mesh space and easily fall into local optima,
leading to unsatisfactory transferability and distinct distortions, TT3D
innovatively performs dual optimization towards both feature grid and
Multi-layer Perceptron (MLP) parameters in the grid-based NeRF space, which
significantly enhances black-box transferability while enjoying naturalness.
Experimental results show that TT3D not only exhibits superior cross-model
transferability but also maintains considerable adaptability across different
renders and vision tasks. More importantly, we produce 3D adversarial examples
with 3D printing techniques in the real world and verify their robust
performance under various scenarios.


http://arxiv.org/abs/2312.09636
A Malware Classification Survey on Adversarial Attacks and Defences. (98%)
Mahesh Datta Sai Ponnuru; Likhitha Amasala; Tanu Sree Bhimavarapu; Guna Chaitanya Garikipati
  As the number and complexity of malware attacks continue to increase, there
is an urgent need for effective malware detection systems. While deep learning
models are effective at detecting malware, they are vulnerable to adversarial
attacks. Attacks like this can create malicious files that are resistant to
detection, creating a significant cybersecurity risk. Recent research has seen
the development of several adversarial attack and response approaches aiming at
strengthening deep learning models' resilience to such attacks. This survey
study offers an in-depth look at current research in adversarial attack and
defensive strategies for malware classification in cybersecurity. The methods
are classified into four categories: generative models, feature-based
approaches, ensemble methods, and hybrid tactics. The article outlines
cutting-edge procedures within each area, assessing their benefits and
drawbacks. Each topic presents cutting-edge approaches and explores their
advantages and disadvantages. In addition, the study discusses the datasets and
assessment criteria that are often utilized on this subject. Finally, it
identifies open research difficulties and suggests future study options. This
document is a significant resource for malware categorization and cyber
security researchers and practitioners.


http://arxiv.org/abs/2312.09665
FlowMur: A Stealthy and Practical Audio Backdoor Attack with Limited Knowledge. (76%)
Jiahe Lan; Jie Wang; Baochen Yan; Zheng Yan; Elisa Bertino
  Speech recognition systems driven by DNNs have revolutionized human-computer
interaction through voice interfaces, which significantly facilitate our daily
lives. However, the growing popularity of these systems also raises special
concerns on their security, particularly regarding backdoor attacks. A backdoor
attack inserts one or more hidden backdoors into a DNN model during its
training process, such that it does not affect the model's performance on
benign inputs, but forces the model to produce an adversary-desired output if a
specific trigger is present in the model input. Despite the initial success of
current audio backdoor attacks, they suffer from the following limitations: (i)
Most of them require sufficient knowledge, which limits their widespread
adoption. (ii) They are not stealthy enough, thus easy to be detected by
humans. (iii) Most of them cannot attack live speech, reducing their
practicality. To address these problems, in this paper, we propose FlowMur, a
stealthy and practical audio backdoor attack that can be launched with limited
knowledge. FlowMur constructs an auxiliary dataset and a surrogate model to
augment adversary knowledge. To achieve dynamicity, it formulates trigger
generation as an optimization problem and optimizes the trigger over different
attachment positions. To enhance stealthiness, we propose an adaptive data
poisoning method according to Signal-to-Noise Ratio (SNR). Furthermore, ambient
noise is incorporated into the process of trigger generation and data poisoning
to make FlowMur robust to ambient noise and improve its practicality. Extensive
experiments conducted on two datasets demonstrate that FlowMur achieves high
attack performance in both digital and physical settings while remaining
resilient to state-of-the-art defenses. In particular, a human study confirms
that triggers generated by FlowMur are not easily detected by participants.


http://arxiv.org/abs/2312.10132
Closing the Gap: Achieving Better Accuracy-Robustness Tradeoffs Against Query-Based Attacks. (74%)
Pascal Zimmer; Sébastien Andreina; Giorgia Azzurra Marson; Ghassan Karame
  Although promising, existing defenses against query-based attacks share a
common limitation: they offer increased robustness against attacks at the price
of a considerable accuracy drop on clean samples. In this work, we show how to
efficiently establish, at test-time, a solid tradeoff between robustness and
accuracy when mitigating query-based attacks. Given that these attacks
necessarily explore low-confidence regions, our insight is that activating
dedicated defenses, such as RND (Qin et al., NeuRIPS 2021) and Random Image
Transformations (Xie et al., ICLR 2018), only for low-confidence inputs is
sufficient to prevent them. Our approach is independent of training and
supported by theory. We verify the effectiveness of our approach for various
existing defenses by conducting extensive experiments on CIFAR-10, CIFAR-100,
and ImageNet. Our results confirm that our proposal can indeed enhance these
defenses by providing better tradeoffs between robustness and accuracy when
compared to state-of-the-art approaches while being completely training-free.


http://arxiv.org/abs/2312.09821
Fragility, Robustness and Antifragility in Deep Learning. (67%)
Chandresh Pravin; Ivan Martino; Giuseppe Nicosia; Varun Ojha
  We propose a systematic analysis of deep neural networks (DNNs) based on a
signal processing technique for network parameter removal, in the form of
synaptic filters that identifies the fragility, robustness and antifragility
characteristics of DNN parameters. Our proposed analysis investigates if the
DNN performance is impacted negatively, invariantly, or positively on both
clean and adversarially perturbed test datasets when the DNN undergoes synaptic
filtering. We define three \textit{filtering scores} for quantifying the
fragility, robustness and antifragility characteristics of DNN parameters based
on the performances for (i) clean dataset, (ii) adversarial dataset, and (iii)
the difference in performances of clean and adversarial datasets. We validate
the proposed systematic analysis on ResNet-18, ResNet-50, SqueezeNet-v1.1 and
ShuffleNet V2 x1.0 network architectures for MNIST, CIFAR10 and Tiny ImageNet
datasets. The filtering scores, for a given network architecture, identify
network parameters that are invariant in characteristics across different
datasets over learning epochs. Vice-versa, for a given dataset, the filtering
scores identify the parameters that are invariant in characteristics across
different network architectures. We show that our synaptic filtering method
improves the test accuracy of ResNet and ShuffleNet models on adversarial
datasets when only the robust and antifragile parameters are selectively
retrained at any given epoch, thus demonstrating applications of the proposed
strategy in improving model robustness.


http://arxiv.org/abs/2312.09748
VNN: Verification-Friendly Neural Networks with Hard Robustness Guarantees. (67%)
Anahita Baninajjar; Ahmed Rezine; Amir Aminifar
  Machine learning techniques often lack formal correctness guarantees,
evidenced by the widespread adversarial examples that plague most deep-learning
applications. This lack of formal guarantees resulted in several research
efforts that aim at verifying Deep Neural Networks (DNNs), with a particular
focus on safety-critical applications. However, formal verification techniques
still face major scalability and precision challenges. The over-approximation
introduced during the formal verification process to tackle the scalability
challenge often results in inconclusive analysis. To address this challenge, we
propose a novel framework to generate Verification-Friendly Neural Networks
(VNNs). We present a post-training optimization framework to achieve a balance
between preserving prediction performance and verification-friendliness. Our
proposed framework results in VNNs that are comparable to the original DNNs in
terms of prediction performance, while amenable to formal verification
techniques. This essentially enables us to establish robustness for more VNNs
than their DNN counterparts, in a time-efficient manner.


http://arxiv.org/abs/2312.09669
Silent Guardian: Protecting Text from Malicious Exploitation by Large Language Models. (10%)
Jiawei Zhao; Kejiang Chen; Xiaojian Yuan; Yuang Qi; Weiming Zhang; Nenghai Yu
  The rapid development of large language models (LLMs) has yielded impressive
success in various downstream tasks. However, the vast potential and remarkable
capabilities of LLMs also raise new security and privacy concerns if they are
exploited for nefarious purposes due to their open-endedness. For example, LLMs
may be used to plagiarize or imitate writing, thereby infringing the copyright
of the original content, or to create indiscriminate fake information based on
a certain source text. In some cases, LLMs can even analyze text from the
Internet to infer personal privacy. Unfortunately, previous text protection
research could not foresee the emergence of powerful LLMs, rendering it no
longer effective in this new context. To bridge this gap, we introduce Silent
Guardian (SG), a text protection mechanism against LLMs, which allows LLMs to
refuse to generate response when receiving protected text, preventing the
malicious use of text from the source. Specifically, we first propose the
concept of Truncation Protection Examples (TPE). By carefully modifying the
text to be protected, TPE can induce LLMs to first sample the end token, thus
directly terminating the interaction. In addition, to efficiently construct TPE
in the discrete space of text data, we propose a novel optimization algorithm
called Super Taliored Protection (STP), which is not only highly efficient but
also maintains the semantic consistency of the text during the optimization
process. The comprehensive experimental evaluation demonstrates that SG can
effectively protect the target text under various configurations and achieve
almost 100% protection success rate in some cases. Notably, SG also exhibits
relatively good transferability and robustness, making its application in
practical scenarios possible.


http://arxiv.org/abs/2312.08675
AVA: Inconspicuous Attribute Variation-based Adversarial Attack bypassing DeepFake Detection. (99%)
Xiangtao Meng; Li Wang; Shanqing Guo; Lei Ju; Qingchuan Zhao
  While DeepFake applications are becoming popular in recent years, their
abuses pose a serious privacy threat. Unfortunately, most related detection
algorithms to mitigate the abuse issues are inherently vulnerable to
adversarial attacks because they are built atop DNN-based classification
models, and the literature has demonstrated that they could be bypassed by
introducing pixel-level perturbations. Though corresponding mitigation has been
proposed, we have identified a new attribute-variation-based adversarial attack
(AVA) that perturbs the latent space via a combination of Gaussian prior and
semantic discriminator to bypass such mitigation. It perturbs the semantics in
the attribute space of DeepFake images, which are inconspicuous to human beings
(e.g., mouth open) but can result in substantial differences in DeepFake
detection. We evaluate our proposed AVA attack on nine state-of-the-art
DeepFake detection algorithms and applications. The empirical results
demonstrate that AVA attack defeats the state-of-the-art black box attacks
against DeepFake detectors and achieves more than a 95% success rate on two
commercial DeepFake detectors. Moreover, our human study indicates that
AVA-generated DeepFake images are often imperceptible to humans, which presents
huge security and privacy concerns.


http://arxiv.org/abs/2312.09481
Continual Adversarial Defense. (96%)
Qian Wang; Yaoyao Liu; Hefei Ling; Yingwei Li; Qihao Liu; Ping Li; Jiazhong Chen; Alan Yuille; Ning Yu
  In response to the rapidly evolving nature of adversarial attacks against
visual classifiers on a monthly basis, numerous defenses have been proposed to
generalize against as many known attacks as possible. However, designing a
defense method that generalizes to all types of attacks is not realistic
because the environment in which defense systems operate is dynamic and
comprises various unique attacks that emerge as time goes on. The defense
system must gather online few-shot defense feedback to promptly enhance itself,
leveraging efficient memory utilization. Therefore, we propose the first
continual adversarial defense (CAD) framework that adapts to any attacks in a
dynamic scenario, where various attacks emerge stage by stage. In practice, CAD
is modeled under four principles: (1) continual adaptation to new attacks
without catastrophic forgetting, (2) few-shot adaptation, (3) memory-efficient
adaptation, and (4) high accuracy on both clean and adversarial images. We
explore and integrate cutting-edge continual learning, few-shot learning, and
ensemble learning techniques to qualify the principles. Experiments conducted
on CIFAR-10 and ImageNet-100 validate the effectiveness of our approach against
multiple stages of modern adversarial attacks and demonstrate significant
improvements over numerous baseline methods. In particular, CAD is capable of
quickly adapting with minimal feedback and a low cost of defense failure, while
maintaining good performance against previous attacks. Our research sheds light
on a brand-new paradigm for continual defense adaptation against dynamic and
evolving attacks.


http://arxiv.org/abs/2312.09520
SlowTrack: Increasing the Latency of Camera-based Perception in Autonomous Driving Using Adversarial Examples. (92%)
Chen Ma; Ningfei Wang; Qi Alfred Chen; Chao Shen
  In Autonomous Driving (AD), real-time perception is a critical component
responsible for detecting surrounding objects to ensure safe driving. While
researchers have extensively explored the integrity of AD perception due to its
safety and security implications, the aspect of availability (real-time
performance) or latency has received limited attention. Existing works on
latency-based attack have focused mainly on object detection, i.e., a component
in camera-based AD perception, overlooking the entire camera-based AD
perception, which hinders them to achieve effective system-level effects, such
as vehicle crashes. In this paper, we propose SlowTrack, a novel framework for
generating adversarial attacks to increase the execution time of camera-based
AD perception. We propose a novel two-stage attack strategy along with the
three new loss function designs. Our evaluation is conducted on four popular
camera-based AD perception pipelines, and the results demonstrate that
SlowTrack significantly outperforms existing latency-based attacks while
maintaining comparable imperceptibility levels. Furthermore, we perform the
evaluation on Baidu Apollo, an industry-grade full-stack AD system, and LGSVL,
a production-grade AD simulator, with two scenarios to compare the system-level
effects of SlowTrack and existing attacks. Our evaluation results show that the
system-level effects can be significantly improved, i.e., the vehicle crash
rate of SlowTrack is around 95% on average while existing works only have
around 30%.


http://arxiv.org/abs/2312.09057
On the Difficulty of Defending Contrastive Learning against Backdoor Attacks. (84%)
Changjiang Li; Ren Pang; Bochuan Cao; Zhaohan Xi; Jinghui Chen; Shouling Ji; Ting Wang
  Recent studies have shown that contrastive learning, like supervised
learning, is highly vulnerable to backdoor attacks wherein malicious functions
are injected into target models, only to be activated by specific triggers.
However, thus far it remains under-explored how contrastive backdoor attacks
fundamentally differ from their supervised counterparts, which impedes the
development of effective defenses against the emerging threat.
  This work represents a solid step toward answering this critical question.
Specifically, we define TRL, a unified framework that encompasses both
supervised and contrastive backdoor attacks. Through the lens of TRL, we
uncover that the two types of attacks operate through distinctive mechanisms:
in supervised attacks, the learning of benign and backdoor tasks tends to occur
independently, while in contrastive attacks, the two tasks are deeply
intertwined both in their representations and throughout their learning
processes. This distinction leads to the disparate learning dynamics and
feature distributions of supervised and contrastive attacks. More importantly,
we reveal that the specificities of contrastive backdoor attacks entail
important implications from a defense perspective: existing defenses for
supervised attacks are often inadequate and not easily retrofitted to
contrastive attacks. We also explore several alternative defenses and discuss
their potential challenges. Our findings highlight the need for defenses
tailored to the specificities of contrastive backdoor attacks, pointing to
promising directions for future research.


http://arxiv.org/abs/2312.08898
Detection and Defense of Unlearnable Examples. (81%)
Yifan Zhu; Lijia Yu; Xiao-Shan Gao
  Privacy preserving has become increasingly critical with the emergence of
social media. Unlearnable examples have been proposed to avoid leaking personal
information on the Internet by degrading generalization abilities of deep
learning models. However, our study reveals that unlearnable examples are
easily detectable. We provide theoretical results on linear separability of
certain unlearnable poisoned dataset and simple network based detection methods
that can identify all existing unlearnable examples, as demonstrated by
extensive experiments. Detectability of unlearnable examples with simple
networks motivates us to design a novel defense method. We propose using
stronger data augmentations coupled with adversarial noises generated by simple
networks, to degrade the detectability and thus provide effective defense
against unlearnable examples with a lower cost. Adversarial training with large
budgets is a widely-used defense method on unlearnable examples. We establish
quantitative criteria between the poison and adversarial budgets which
determine the existence of robust unlearnable examples or the failure of the
adversarial defense.


http://arxiv.org/abs/2312.08751
Improve Robustness of Reinforcement Learning against Observation Perturbations via $l_\infty$ Lipschitz Policy Networks. (81%)
Buqing Nie; Jingtian Ji; Yangqing Fu; Yue Gao
  Deep Reinforcement Learning (DRL) has achieved remarkable advances in
sequential decision tasks. However, recent works have revealed that DRL agents
are susceptible to slight perturbations in observations. This vulnerability
raises concerns regarding the effectiveness and robustness of deploying such
agents in real-world applications. In this work, we propose a novel robust
reinforcement learning method called SortRL, which improves the robustness of
DRL policies against observation perturbations from the perspective of the
network architecture. We employ a novel architecture for the policy network
that incorporates global $l_\infty$ Lipschitz continuity and provide a
convenient method to enhance policy robustness based on the output margin.
Besides, a training framework is designed for SortRL, which solves given tasks
while maintaining robustness against $l_\infty$ bounded perturbations on the
observations. Several experiments are conducted to evaluate the effectiveness
of our method, including classic control tasks and video games. The results
demonstrate that SortRL achieves state-of-the-art robustness performance
against different perturbation strength.


http://arxiv.org/abs/2312.09533
Adversarial Robustness on Image Classification with $k$-means. (81%)
Rollin Omari; Junae Kim; Paul Montague
  In this paper we explore the challenges and strategies for enhancing the
robustness of $k$-means clustering algorithms against adversarial
manipulations. We evaluate the vulnerability of clustering algorithms to
adversarial attacks, emphasising the associated security risks. Our study
investigates the impact of incremental attack strength on training, introduces
the concept of transferability between supervised and unsupervised models, and
highlights the sensitivity of unsupervised models to sample distributions. We
additionally introduce and evaluate an adversarial training method that
improves testing performance in adversarial scenarios, and we highlight the
importance of various parameters in the proposed training method, such as
continuous learning, centroid initialisation, and adversarial step-count.


http://arxiv.org/abs/2312.08667
Data and Model Poisoning Backdoor Attacks on Wireless Federated Learning, and the Defense Mechanisms: A Comprehensive Survey. (76%)
Yichen Wan; Youyang Qu; Wei Ni; Yong Xiang; Longxiang Gao; Ekram Hossain
  Due to the greatly improved capabilities of devices, massive data, and
increasing concern about data privacy, Federated Learning (FL) has been
increasingly considered for applications to wireless communication networks
(WCNs). Wireless FL (WFL) is a distributed method of training a global deep
learning model in which a large number of participants each train a local model
on their training datasets and then upload the local model updates to a central
server. However, in general, non-independent and identically distributed
(non-IID) data of WCNs raises concerns about robustness, as a malicious
participant could potentially inject a "backdoor" into the global model by
uploading poisoned data or models over WCN. This could cause the model to
misclassify malicious inputs as a specific target class while behaving normally
with benign inputs. This survey provides a comprehensive review of the latest
backdoor attacks and defense mechanisms. It classifies them according to their
targets (data poisoning or model poisoning), the attack phase (local data
collection, training, or aggregation), and defense stage (local training,
before aggregation, during aggregation, or after aggregation). The strengths
and limitations of existing attack strategies and defense mechanisms are
analyzed in detail. Comparisons of existing attack methods and defense designs
are carried out, pointing to noteworthy findings, open challenges, and
potential future research directions related to security and privacy of WFL.


http://arxiv.org/abs/2312.09027
DRAM-Locker: A General-Purpose DRAM Protection Mechanism against Adversarial DNN Weight Attacks. (45%)
Ranyang Zhou; Sabbir Ahmed; Arman Roohi; Adnan Siraj Rakin; Shaahin Angizi
  In this work, we propose DRAM-Locker as a robust general-purpose defense
mechanism that can protect DRAM against various adversarial Deep Neural Network
(DNN) weight attacks affecting data or page tables. DRAM-Locker harnesses the
capabilities of in-DRAM swapping combined with a lock-table to prevent
attackers from singling out specific DRAM rows to safeguard DNN's weight
parameters. Our results indicate that DRAM-Locker can deliver a high level of
protection downgrading the performance of targeted weight attacks to a random
attack level. Furthermore, the proposed defense mechanism demonstrates no
reduction in accuracy when applied to CIFAR-10 and CIFAR-100. Importantly,
DRAM-Locker does not necessitate any software retraining or result in extra
hardware burden.


http://arxiv.org/abs/2312.09494
No-Skim: Towards Efficiency Robustness Evaluation on Skimming-based Language Models. (45%)
Shengyao Zhang; Mi Zhang; Xudong Pan; Min Yang
  To reduce the computation cost and the energy consumption in large language
models (LLM), skimming-based acceleration dynamically drops unimportant tokens
of the input sequence progressively along layers of the LLM while preserving
the tokens of semantic importance. However, our work for the first time reveals
the acceleration may be vulnerable to Denial-of-Service (DoS) attacks. In this
paper, we propose No-Skim, a general framework to help the owners of
skimming-based LLM to understand and measure the robustness of their
acceleration scheme. Specifically, our framework searches minimal and
unnoticeable perturbations at character-level and token-level to generate
adversarial inputs that sufficiently increase the remaining token ratio, thus
increasing the computation cost and energy consumption. We systematically
evaluate the vulnerability of the skimming acceleration in various LLM
architectures including BERT and RoBERTa on the GLUE benchmark. In the worst
case, the perturbation found by No-Skim substantially increases the running
cost of LLM by over 145% on average. Moreover, No-Skim extends the evaluation
framework to various scenarios, making the evaluation conductible with
different level of knowledge.


http://arxiv.org/abs/2312.08793
Forbidden Facts: An Investigation of Competing Objectives in Llama-2. (45%)
Tony T. Wang; Miles Wang; Kaivalya Hariharan; Nir Shavit
  LLMs often face competing pressures (for example helpfulness vs.
harmlessness). To understand how models resolve such conflicts, we study
Llama-2-chat models on the forbidden fact task. Specifically, we instruct
Llama-2 to truthfully complete a factual recall statement while forbidding it
from saying the correct answer. This often makes the model give incorrect
answers. We decompose Llama-2 into 1000+ components, and rank each one with
respect to how useful it is for forbidding the correct answer. We find that in
aggregate, around 35 components are enough to reliably implement the full
suppression behavior. However, these components are fairly heterogeneous and
many operate using faulty heuristics. We discover that one of these heuristics
can be exploited via a manually designed adversarial attack which we call The
California Attack. Our results highlight some roadblocks standing in the way of
being able to successfully interpret advanced ML systems. Project website
available at https://forbiddenfacts.github.io .


http://arxiv.org/abs/2312.09078
Coevolutionary Algorithm for Building Robust Decision Trees under Minimax Regret. (13%)
Adam Żychowski; Andrew Perrault; Jacek Mańdziuk
  In recent years, there has been growing interest in developing robust machine
learning (ML) models that can withstand adversarial attacks, including one of
the most widely adopted, efficient, and interpretable ML algorithms-decision
trees (DTs). This paper proposes a novel coevolutionary algorithm (CoEvoRDT)
designed to create robust DTs capable of handling noisy high-dimensional data
in adversarial contexts. Motivated by the limitations of traditional DT
algorithms, we leverage adaptive coevolution to allow DTs to evolve and learn
from interactions with perturbed input data. CoEvoRDT alternately evolves
competing populations of DTs and perturbed features, enabling construction of
DTs with desired properties. CoEvoRDT is easily adaptable to various target
metrics, allowing the use of tailored robustness criteria such as minimax
regret. Furthermore, CoEvoRDT has potential to improve the results of other
state-of-the-art methods by incorporating their outcomes (DTs they produce)
into the initial population and optimize them in the process of coevolution.
Inspired by the game theory, CoEvoRDT utilizes mixed Nash equilibrium to
enhance convergence. The method is tested on 20 popular datasets and shows
superior performance compared to 4 state-of-the-art algorithms. It outperformed
all competing methods on 13 datasets with adversarial accuracy metrics, and on
all 20 considered datasets with minimax regret. Strong experimental results and
flexibility in choosing the error measure make CoEvoRDT a promising approach
for constructing robust DTs in real-world applications.


http://arxiv.org/abs/2312.09020
Exploring Transferability for Randomized Smoothing. (5%)
Kai Qiu; Huishuai Zhang; Zhirong Wu; Stephen Lin
  Training foundation models on extensive datasets and then finetuning them on
specific tasks has emerged as the mainstream approach in artificial
intelligence. However, the model robustness, which is a critical aspect for
safety, is often optimized for each specific task rather than at the
pretraining stage. In this paper, we propose a method for pretraining
certifiably robust models that can be readily finetuned for adaptation to a
particular task. A key challenge is dealing with the compromise between
semantic learning and robustness. We address this with a simple yet highly
effective strategy based on significantly broadening the pretraining data
distribution, which is shown to greatly benefit finetuning for downstream
tasks. Through pretraining on a mixture of clean and various noisy images, we
find that surprisingly strong certified accuracy can be achieved even when
finetuning on only clean images. Furthermore, this strategy requires just a
single model to deal with various noise levels, thus substantially reducing
computational costs in relation to previous works that employ multiple models.
Despite using just one model, our method can still yield results that are on
par with, or even superior to, existing multi-model methods.


http://arxiv.org/abs/2312.09148
Split-Ensemble: Efficient OOD-aware Ensemble via Task and Model Splitting. (1%)
Anthony Chen; Huanrui Yang; Yulu Gan; Denis A Gudovskiy; Zhen Dong; Haofan Wang; Tomoyuki Okuno; Yohei Nakata; Shanghang Zhang; Kurt Keutzer
  Uncertainty estimation is crucial for machine learning models to detect
out-of-distribution (OOD) inputs. However, the conventional discriminative deep
learning classifiers produce uncalibrated closed-set predictions for OOD data.
A more robust classifiers with the uncertainty estimation typically require a
potentially unavailable OOD dataset for outlier exposure training, or a
considerable amount of additional memory and compute to build ensemble models.
In this work, we improve on uncertainty estimation without extra OOD data or
additional inference costs using an alternative Split-Ensemble method.
Specifically, we propose a novel subtask-splitting ensemble training objective,
where a common multiclass classification task is split into several
complementary subtasks. Then, each subtask's training data can be considered as
OOD to the other subtasks. Diverse submodels can therefore be trained on each
subtask with OOD-aware objectives. The subtask-splitting objective enables us
to share low-level features across submodels to avoid parameter and
computational overheads. In particular, we build a tree-like Split-Ensemble
architecture by performing iterative splitting and pruning from a shared
backbone model, where each branch serves as a submodel corresponding to a
subtask. This leads to improved accuracy and uncertainty estimation across
submodels under a fixed ensemble computation budget. Empirical study with
ResNet-18 backbone shows Split-Ensemble, without additional computation cost,
improves accuracy over a single model by 0.8%, 1.8%, and 25.5% on CIFAR-10,
CIFAR-100, and Tiny-ImageNet, respectively. OOD detection for the same backbone
and in-distribution datasets surpasses a single model baseline by,
correspondingly, 2.2%, 8.1%, and 29.6% mean AUROC. Codes will be publicly
available at https://antonioo-c.github.io/projects/split-ensemble


http://arxiv.org/abs/2312.08890
Defenses in Adversarial Machine Learning: A Survey. (99%)
Baoyuan Wu; Shaokui Wei; Mingli Zhu; Meixi Zheng; Zihao Zhu; Mingda Zhang; Hongrui Chen; Danni Yuan; Li Liu; Qingshan Liu
  Adversarial phenomenon has been widely observed in machine learning (ML)
systems, especially in those using deep neural networks, describing that ML
systems may produce inconsistent and incomprehensible predictions with humans
at some particular cases. This phenomenon poses a serious security threat to
the practical application of ML systems, and several advanced attack paradigms
have been developed to explore it, mainly including backdoor attacks, weight
attacks, and adversarial examples. For each individual attack paradigm, various
defense paradigms have been developed to improve the model robustness against
the corresponding attack paradigm. However, due to the independence and
diversity of these defense paradigms, it is difficult to examine the overall
robustness of an ML system against different kinds of attacks.This survey aims
to build a systematic review of all existing defense paradigms from a unified
perspective. Specifically, from the life-cycle perspective, we factorize a
complete machine learning system into five stages, including pre-training,
training, post-training, deployment, and inference stages, respectively. Then,
we present a clear taxonomy to categorize and review representative defense
methods at each individual stage. The unified perspective and presented
taxonomies not only facilitate the analysis of the mechanism of each defense
paradigm but also help us to understand connections and differences among
different defense paradigms, which may inspire future research to develop more
advanced, comprehensive defenses.


http://arxiv.org/abs/2312.07961
Robust Few-Shot Named Entity Recognition with Boundary Discrimination and Correlation Purification. (99%)
Xiaojun Xue; Chunxia Zhang; Tianxiang Xu; Zhendong Niu
  Few-shot named entity recognition (NER) aims to recognize novel named
entities in low-resource domains utilizing existing knowledge. However, the
present few-shot NER models assume that the labeled data are all clean without
noise or outliers, and there are few works focusing on the robustness of the
cross-domain transfer learning ability to textual adversarial attacks in
Few-shot NER. In this work, we comprehensively explore and assess the
robustness of few-shot NER models under textual adversarial attack scenario,
and found the vulnerability of existing few-shot NER models. Furthermore, we
propose a robust two-stage few-shot NER method with Boundary Discrimination and
Correlation Purification (BDCP). Specifically, in the span detection stage, the
entity boundary discriminative module is introduced to provide a highly
distinguishing boundary representation space to detect entity spans. In the
entity typing stage, the correlations between entities and contexts are
purified by minimizing the interference information and facilitating
correlation generalization to alleviate the perturbations caused by textual
adversarial attacks. In addition, we construct adversarial examples for
few-shot NER based on public datasets Few-NERD and Cross-Dataset. Comprehensive
evaluations on those two groups of few-shot NER datasets containing adversarial
examples demonstrate the robustness and superiority of the proposed method.


http://arxiv.org/abs/2312.08193
Universal Adversarial Framework to Improve Adversarial Robustness for Diabetic Retinopathy Detection. (98%)
Samrat Mukherjee; Dibyanayan Bandyopadhyay; Baban Gain; Asif Ekbal
  Diabetic Retinopathy (DR) is a prevalent illness associated with Diabetes
which, if left untreated, can result in irreversible blindness. Deep Learning
based systems are gradually being introduced as automated support for clinical
diagnosis. Since healthcare has always been an extremely important domain
demanding error-free performance, any adversaries could pose a big threat to
the applicability of such systems. In this work, we use Universal Adversarial
Perturbations (UAPs) to quantify the vulnerability of Medical Deep Neural
Networks (DNNs) for detecting DR. To the best of our knowledge, this is the
very first attempt that works on attacking complete fine-grained classification
of DR images using various UAPs. Also, as a part of this work, we use UAPs to
fine-tune the trained models to defend against adversarial samples. We
experiment on several models and observe that the performance of such models
towards unseen adversarial attacks gets boosted on average by $3.41$
Cohen-kappa value and maximum by $31.92$ Cohen-kappa value. The performance
degradation on normal data upon ensembling the fine-tuned models was found to
be statistically insignificant using t-test, highlighting the benefits of
UAP-based adversarial fine-tuning.


http://arxiv.org/abs/2312.08651
Towards Inductive Robustness: Distilling and Fostering Wave-induced Resonance in Transductive GCNs Against Graph Adversarial Attacks. (83%)
Ao Liu; Wenshan Li; Tao Li; Beibei Li; Hanyuan Huang; Pan Zhou
  Graph neural networks (GNNs) have recently been shown to be vulnerable to
adversarial attacks, where slight perturbations in the graph structure can lead
to erroneous predictions. However, current robust models for defending against
such attacks inherit the transductive limitations of graph convolutional
networks (GCNs). As a result, they are constrained by fixed structures and do
not naturally generalize to unseen nodes. Here, we discover that transductive
GCNs inherently possess a distillable robustness, achieved through a
wave-induced resonance process. Based on this, we foster this resonance to
facilitate inductive and robust learning. Specifically, we first prove that the
signal formed by GCN-driven message passing (MP) is equivalent to the
edge-based Laplacian wave, where, within a wave system, resonance can naturally
emerge between the signal and its transmitting medium. This resonance provides
inherent resistance to malicious perturbations inflicted on the signal system.
We then prove that merely three MP iterations within GCNs can induce signal
resonance between nodes and edges, manifesting as a coupling between nodes and
their distillable surrounding local subgraph. Consequently, we present Graph
Resonance-fostering Network (GRN) to foster this resonance via learning node
representations from their distilled resonating subgraphs. By capturing the
edge-transmitted signals within this subgraph and integrating them with the
node signal, GRN embeds these combined signals into the central node's
representation. This node-wise embedding approach allows for generalization to
unseen nodes. We validate our theoretical findings with experiments, and
demonstrate that GRN generalizes robustness to unseen nodes, whilst maintaining
state-of-the-art classification accuracy on perturbed graphs.


http://arxiv.org/abs/2312.08622
Scalable Ensemble-based Detection Method against Adversarial Attacks for speaker verification. (64%)
Haibin Wu; Heng-Cheng Kuo; Yu Tsao; Hung-yi Lee
  Automatic speaker verification (ASV) is highly susceptible to adversarial
attacks. Purification modules are usually adopted as a pre-processing to
mitigate adversarial noise. However, they are commonly implemented across
diverse experimental settings, rendering direct comparisons challenging. This
paper comprehensively compares mainstream purification techniques in a unified
framework. We find these methods often face a trade-off between user experience
and security, as they struggle to simultaneously maintain genuine sample
performance and reduce adversarial perturbations. To address this challenge,
some efforts have extended purification modules to encompass detection
capabilities, aiming to alleviate the trade-off. However, advanced purification
modules will always come into the stage to surpass previous detection method.
As a result, we further propose an easy-to-follow ensemble approach that
integrates advanced purification modules for detection, achieving
state-of-the-art (SOTA) performance in countering adversarial noise. Our
ensemble method has great potential due to its compatibility with future
advanced purification techniques.


http://arxiv.org/abs/2312.07991
Accelerating the Global Aggregation of Local Explanations. (47%)
Alon Mor; Yonatan Belinkov; Benny Kimelfeld
  Local explanation methods highlight the input tokens that have a considerable
impact on the outcome of classifying the document at hand. For example, the
Anchor algorithm applies a statistical analysis of the sensitivity of the
classifier to changes in the token. Aggregating local explanations over a
dataset provides a global explanation of the model. Such aggregation aims to
detect words with the most impact, giving valuable insights about the model,
like what it has learned in training and which adversarial examples expose its
weaknesses. However, standard aggregation methods bear a high computational
cost: a na\"ive implementation applies a costly algorithm to each token of each
document, and hence, it is infeasible for a simple user running in the scope of
a short analysis session. % We devise techniques for accelerating the global
aggregation of the Anchor algorithm. Specifically, our goal is to compute a set
of top-$k$ words with the highest global impact according to different
aggregation functions. Some of our techniques are lossless and some are lossy.
We show that for a very mild loss of quality, we are able to accelerate the
computation by up to 30$\times$, reducing the computation from hours to
minutes. We also devise and study a probabilistic model that accounts for noise
in the Anchor algorithm and diminishes the bias toward words that are frequent
yet low in impact.


http://arxiv.org/abs/2312.07955
Erasing Self-Supervised Learning Backdoor by Cluster Activation Masking. (22%)
Shengsheng Qian; Dizhan Xue; Yifei Wang; Shengjie Zhang; Huaiwen Zhang; Changsheng Xu
  Self-Supervised Learning (SSL) is an effective paradigm for learning
representations from unlabeled data, such as text, images, and videos. However,
researchers have recently found that SSL is vulnerable to backdoor attacks. The
attacker can embed hidden SSL backdoors via a few poisoned examples in the
training dataset and maliciously manipulate the behavior of downstream models.
To defend against SSL backdoor attacks, a feasible route is to detect and
remove the poisonous samples in the training set. However, the existing SSL
backdoor defense method fails to detect the poisonous samples precisely. In
this paper, we propose to erase the SSL backdoor by cluster activation masking
and propose a novel PoisonCAM method. After obtaining the threat model trained
on the poisoned dataset, our method can precisely detect poisonous samples
based on the assumption that masking the backdoor trigger can effectively
change the activation of a downstream clustering model. In experiments, our
PoisonCAM achieves 96\% accuracy for backdoor trigger detection compared to 3\%
of the state-of-the-art method on poisoned ImageNet-100. Moreover, our proposed
PoisonCAM significantly improves the performance of the trained SSL model under
backdoor attacks compared to the state-of-the-art method. Our code, data, and
trained models will be open once this paper is accepted.


http://arxiv.org/abs/2312.08143
Efficient Representation of the Activation Space in Deep Neural Networks. (11%)
Tanya Akumu; Celia Cintas; Girmaw Abebe Tadesse; Adebayo Oshingbesan; Skyler Speakman; Edward III McFowland
  The representations of the activation space of deep neural networks (DNNs)
are widely utilized for tasks like natural language processing, anomaly
detection and speech recognition. Due to the diverse nature of these tasks and
the large size of DNNs, an efficient and task-independent representation of
activations becomes crucial. Empirical p-values have been used to quantify the
relative strength of an observed node activation compared to activations
created by already-known inputs. Nonetheless, keeping raw data for these
calculations increases memory resource consumption and raises privacy concerns.
To this end, we propose a model-agnostic framework for creating representations
of activations in DNNs using node-specific histograms to compute p-values of
observed activations without retaining already-known inputs. Our proposed
approach demonstrates promising potential when validated with multiple network
architectures across various downstream tasks and compared with the kernel
density estimates and brute-force empirical baselines. In addition, the
framework reduces memory usage by 30% with up to 4 times faster p-value
computing time while maintaining state of-the-art detection power in downstream
tasks such as the detection of adversarial attacks and synthesized content.
Moreover, as we do not persist raw data at inference time, we could potentially
reduce susceptibility to attacks and privacy issues.


http://arxiv.org/abs/2312.08303
Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models. (1%)
Jiang Zhang; Qiong Wu; Yiming Xu; Cheng Cao; Zheng Du; Konstantinos Psounis
  Toxic content detection is crucial for online services to remove
inappropriate content that violates community standards. To automate the
detection process, prior works have proposed varieties of machine learning (ML)
approaches to train Language Models (LMs) for toxic content detection. However,
both their accuracy and transferability across datasets are limited. Recently,
Large Language Models (LLMs) have shown promise in toxic content detection due
to their superior zero-shot and few-shot in-context learning ability as well as
broad transferability on ML tasks. However, efficiently designing prompts for
LLMs remains challenging. Moreover, the high run-time cost of LLMs may hinder
their deployments in production. To address these challenges, in this work, we
propose BD-LLM, a novel and efficient approach to Bootstrapping and Distilling
LLMs for toxic content detection. Specifically, we design a novel prompting
method named Decision-Tree-of-Thought (DToT) to bootstrap LLMs' detection
performance and extract high-quality rationales. DToT can automatically select
more fine-grained context to re-prompt LLMs when their responses lack
confidence. Additionally, we use the rationales extracted via DToT to fine-tune
student LMs. Our experimental results on various datasets demonstrate that DToT
can improve the accuracy of LLMs by up to 4.6%. Furthermore, student LMs
fine-tuned with rationales extracted via DToT outperform baselines on all
datasets with up to 16.9\% accuracy improvement, while being more than 60x
smaller than conventional LLMs. Finally, we observe that student LMs fine-tuned
with rationales exhibit better cross-dataset transferability.


http://arxiv.org/abs/2312.07821
Radio Signal Classification by Adversarially Robust Quantum Machine Learning. (99%)
Yanqiu Wu; Eromanga Adermann; Chandra Thapa; Seyit Camtepe; Hajime Suzuki; Muhammad Usman
  Radio signal classification plays a pivotal role in identifying the
modulation scheme used in received radio signals, which is essential for
demodulation and proper interpretation of the transmitted information.
Researchers have underscored the high susceptibility of ML algorithms for radio
signal classification to adversarial attacks. Such vulnerability could result
in severe consequences, including misinterpretation of critical messages,
interception of classified information, or disruption of communication
channels. Recent advancements in quantum computing have revolutionized theories
and implementations of computation, bringing the unprecedented development of
Quantum Machine Learning (QML). It is shown that quantum variational
classifiers (QVCs) provide notably enhanced robustness against classical
adversarial attacks in image classification. However, no research has yet
explored whether QML can similarly mitigate adversarial threats in the context
of radio signal classification. This work applies QVCs to radio signal
classification and studies their robustness to various adversarial attacks. We
also propose the novel application of the approximate amplitude encoding (AAE)
technique to encode radio signal data efficiently. Our extensive simulation
results present that attacks generated on QVCs transfer well to CNN models,
indicating that these adversarial examples can fool neural networks that they
are not explicitly designed to attack. However, the converse is not true. QVCs
primarily resist the attacks generated on CNNs. Overall, with comprehensive
simulations, our results shed new light on the growing field of QML by bridging
knowledge gaps in QAML in radio signal classification and uncovering the
advantages of applying QML methods in practical applications.


http://arxiv.org/abs/2312.07258
SSTA: Salient Spatially Transformed Attack. (99%)
Renyang Liu; Wei Zhou; Sixin Wu; Jun Zhao; Kwok-Yan Lam
  Extensive studies have demonstrated that deep neural networks (DNNs) are
vulnerable to adversarial attacks, which brings a huge security risk to the
further application of DNNs, especially for the AI models developed in the real
world. Despite the significant progress that has been made recently, existing
attack methods still suffer from the unsatisfactory performance of escaping
from being detected by naked human eyes due to the formulation of adversarial
example (AE) heavily relying on a noise-adding manner. Such mentioned
challenges will significantly increase the risk of exposure and result in an
attack to be failed. Therefore, in this paper, we propose the Salient Spatially
Transformed Attack (SSTA), a novel framework to craft imperceptible AEs, which
enhance the stealthiness of AEs by estimating a smooth spatial transform metric
on a most critical area to generate AEs instead of adding external noise to the
whole image. Compared to state-of-the-art baselines, extensive experiments
indicated that SSTA could effectively improve the imperceptibility of the AEs
while maintaining a 100\% attack success rate.


http://arxiv.org/abs/2312.07245
DTA: Distribution Transform-based Attack for Query-Limited Scenario. (99%)
Renyang Liu; Wei Zhou; Xin Jin; Song Gao; Yuanyu Wang; Ruxin Wang
  In generating adversarial examples, the conventional black-box attack methods
rely on sufficient feedback from the to-be-attacked models by repeatedly
querying until the attack is successful, which usually results in thousands of
trials during an attack. This may be unacceptable in real applications since
Machine Learning as a Service Platform (MLaaS) usually only returns the final
result (i.e., hard-label) to the client and a system equipped with certain
defense mechanisms could easily detect malicious queries. By contrast, a
feasible way is a hard-label attack that simulates an attacked action being
permitted to conduct a limited number of queries. To implement this idea, in
this paper, we bypass the dependency on the to-be-attacked model and benefit
from the characteristics of the distributions of adversarial examples to
reformulate the attack problem in a distribution transform manner and propose a
distribution transform-based attack (DTA). DTA builds a statistical mapping
from the benign example to its adversarial counterparts by tackling the
conditional likelihood under the hard-label black-box settings. In this way, it
is no longer necessary to query the target model frequently. A well-trained DTA
model can directly and efficiently generate a batch of adversarial examples for
a certain input, which can be used to attack un-seen models based on the
assumed transferability. Furthermore, we surprisingly find that the
well-trained DTA model is not sensitive to the semantic spaces of the training
dataset, meaning that the model yields acceptable attack performance on other
datasets. Extensive experiments validate the effectiveness of the proposed idea
and the superiority of DTA over the state-of-the-art.


http://arxiv.org/abs/2312.08877
May the Noise be with you: Adversarial Training without Adversarial Examples. (98%)
Ayoub Arous; Andres F Lopez-Lopera; Nael Abu-Ghazaleh; Ihsen Alouani
  In this paper, we investigate the following question: Can we obtain
adversarially-trained models without training on adversarial examples? Our
intuition is that training a model with inherent stochasticity, i.e.,
optimizing the parameters by minimizing a stochastic loss function, yields a
robust expectation function that is non-stochastic. In contrast to related
methods that introduce noise at the input level, our proposed approach
incorporates inherent stochasticity by embedding Gaussian noise within the
layers of the NN model at training time. We model the propagation of noise
through the layers, introducing a closed-form stochastic loss function that
encapsulates a noise variance parameter. Additionally, we contribute a
formalized noise-aware gradient, enabling the optimization of model parameters
while accounting for stochasticity. Our experimental results confirm that the
expectation model of a stochastic architecture trained on benign distribution
is adversarially robust. Interestingly, we find that the impact of the applied
Gaussian noise's standard deviation on both robustness and baseline accuracy
closely mirrors the impact of the noise magnitude employed in adversarial
training. Our work contributes adversarially trained networks using a
completely different approach, with empirically similar robustness to
adversarial training.


http://arxiv.org/abs/2312.07067
Focus on Hiders: Exploring Hidden Threats for Enhancing Adversarial Training. (98%)
Qian Li; Yuxiao Hu; Yinpeng Dong; Dongxiao Zhang; Yuntian Chen
  Adversarial training is often formulated as a min-max problem, however,
concentrating only on the worst adversarial examples causes alternating
repetitive confusion of the model, i.e., previously defended or correctly
classified samples are not defensible or accurately classifiable in subsequent
adversarial training. We characterize such non-ignorable samples as "hiders",
which reveal the hidden high-risk regions within the secure area obtained
through adversarial training and prevent the model from finding the real worst
cases. We demand the model to prevent hiders when defending against adversarial
examples for improving accuracy and robustness simultaneously. By rethinking
and redefining the min-max optimization problem for adversarial training, we
propose a generalized adversarial training algorithm called Hider-Focused
Adversarial Training (HFAT). HFAT introduces the iterative evolution
optimization strategy to simplify the optimization problem and employs an
auxiliary model to reveal hiders, effectively combining the optimization
directions of standard adversarial training and prevention hiders. Furthermore,
we introduce an adaptive weighting mechanism that facilitates the model in
adaptively adjusting its focus between adversarial examples and hiders during
different training periods. We demonstrate the effectiveness of our method
based on extensive experiments, and ensure that HFAT can provide higher
robustness and accuracy.


http://arxiv.org/abs/2312.11510
QuadAttack: A Quadratic Programming Approach to Ordered Top-K Attacks. (97%)
Thomas Paniagua; Ryan Grainger; Tianfu Wu
  The adversarial vulnerability of Deep Neural Networks (DNNs) has been
well-known and widely concerned, often under the context of learning top-$1$
attacks (e.g., fooling a DNN to classify a cat image as dog). This paper shows
that the concern is much more serious by learning significantly more aggressive
ordered top-$K$ clear-box~\footnote{ This is often referred to as
white/black-box attacks in the literature. We choose to adopt neutral
terminology, clear/opaque-box attacks in this paper, and omit the prefix
clear-box for simplicity.} targeted attacks proposed in Adversarial
Distillation. We propose a novel and rigorous quadratic programming (QP) method
of learning ordered top-$K$ attacks with low computing cost, dubbed as
\textbf{QuadAttac$K$}. Our QuadAttac$K$ directly solves the QP to satisfy the
attack constraint in the feature embedding space (i.e., the input space to the
final linear classifier), which thus exploits the semantics of the feature
embedding space (i.e., the principle of class coherence). With the optimized
feature embedding vector perturbation, it then computes the adversarial
perturbation in the data space via the vanilla one-step back-propagation. In
experiments, the proposed QuadAttac$K$ is tested in the ImageNet-1k
classification using ResNet-50, DenseNet-121, and Vision Transformers (ViT-B
and DEiT-S). It successfully pushes the boundary of successful ordered top-$K$
attacks from $K=10$ up to $K=20$ at a cheap budget ($1\times 60$) and further
improves attack success rates for $K=5$ for all tested models, while retaining
the performance for $K=1$.


http://arxiv.org/abs/2312.06991
Attacking the Loop: Adversarial Attacks on Graph-based Loop Closure Detection. (92%)
Jonathan J. Y. Kim; Martin Urschler; Patricia J. Riddle; Jorg S. Wicker
  With the advancement in robotics, it is becoming increasingly common for
large factories and warehouses to incorporate visual SLAM (vSLAM) enabled
automated robots that operate closely next to humans. This makes any
adversarial attacks on vSLAM components potentially detrimental to humans
working alongside them. Loop Closure Detection (LCD) is a crucial component in
vSLAM that minimizes the accumulation of drift in mapping, since even a small
drift can accumulate into a significant drift over time. A prior work by Kim et
al., SymbioLCD2, unified visual features and semantic objects into a single
graph structure for finding loop closure candidates. While this provided a
performance improvement over visual feature-based LCD, it also created a single
point of vulnerability for potential graph-based adversarial attacks. Unlike
previously reported visual-patch based attacks, small graph perturbations are
far more challenging to detect, making them a more significant threat. In this
paper, we present Adversarial-LCD, a novel black-box evasion attack framework
that employs an eigencentrality-based perturbation method and an SVM-RBF
surrogate model with a Weisfeiler-Lehman feature extractor for attacking
graph-based LCD. Our evaluation shows that the attack performance of
Adversarial-LCD with the SVM-RBF surrogate model was superior to that of other
machine learning surrogate algorithms, including SVM-linear, SVM-polynomial,
and Bayesian classifier, demonstrating the effectiveness of our attack
framework. Furthermore, we show that our eigencentrality-based perturbation
method outperforms other algorithms, such as Random-walk and Shortest-path,
highlighting the efficiency of Adversarial-LCD's perturbation selection method.


http://arxiv.org/abs/2312.07392
ReRoGCRL: Representation-based Robustness in Goal-Conditioned Reinforcement Learning. (86%)
Xiangyu Yin; Sihao Wu; Jiaxu Liu; Meng Fang; Xingyu Zhao; Xiaowei Huang; Wenjie Ruan
  While Goal-Conditioned Reinforcement Learning (GCRL) has gained attention,
its algorithmic robustness against adversarial perturbations remains
unexplored. The attacks and robust representation training methods that are
designed for traditional RL become less effective when applied to GCRL. To
address this challenge, we first propose the Semi-Contrastive Representation
attack, a novel approach inspired by the adversarial contrastive attack. Unlike
existing attacks in RL, it only necessitates information from the policy
function and can be seamlessly implemented during deployment. Then, to mitigate
the vulnerability of existing GCRL algorithms, we introduce Adversarial
Representation Tactics, which combines Semi-Contrastive Adversarial
Augmentation with Sensitivity-Aware Regularizer to improve the adversarial
robustness of the underlying RL agent against various types of perturbations.
Extensive experiments validate the superior performance of our attack and
defence methods across multiple state-of-the-art GCRL algorithms. Our tool
ReRoGCRL is available at https://github.com/TrustAI/ReRoGCRL.


http://arxiv.org/abs/2312.07784
Robust MRI Reconstruction by Smoothed Unrolling (SMUG). (82%)
Shijun Liang; Van Hoang Minh Nguyen; Jinghan Jia; Ismail Alkhouri; Sijia Liu; Saiprasad Ravishankar
  As the popularity of deep learning (DL) in the field of magnetic resonance
imaging (MRI) continues to rise, recent research has indicated that DL-based
MRI reconstruction models might be excessively sensitive to minor input
disturbances, including worst-case additive perturbations. This sensitivity
often leads to unstable, aliased images. This raises the question of how to
devise DL techniques for MRI reconstruction that can be robust to train-test
variations. To address this problem, we propose a novel image reconstruction
framework, termed Smoothed Unrolling (SMUG), which advances a deep
unrolling-based MRI reconstruction model using a randomized smoothing
(RS)-based robust learning approach. RS, which improves the tolerance of a
model against input noises, has been widely used in the design of adversarial
defense approaches for image classification tasks. Yet, we find that the
conventional design that applies RS to the entire DL-based MRI model is
ineffective. In this paper, we show that SMUG and its variants address the
above issue by customizing the RS process based on the unrolling architecture
of a DL-based MRI reconstruction model. Compared to the vanilla RS approach, we
show that SMUG improves the robustness of MRI reconstruction with respect to a
diverse set of instability sources, including worst-case and random noise
perturbations to input measurements, varying measurement sampling rates, and
different numbers of unrolling steps. Furthermore, we theoretically analyze the
robustness of our method in the presence of perturbations.


http://arxiv.org/abs/2312.07158
Cost Aware Untargeted Poisoning Attack against Graph Neural Networks,. (70%)
Yuwei Han; Yuni Lai; Yulin Zhu; Kai Zhou
  Graph Neural Networks (GNNs) have become widely used in the field of graph
mining. However, these networks are vulnerable to structural perturbations.
While many research efforts have focused on analyzing vulnerability through
poisoning attacks, we have identified an inefficiency in current attack losses.
These losses steer the attack strategy towards modifying edges targeting
misclassified nodes or resilient nodes, resulting in a waste of structural
adversarial perturbation. To address this issue, we propose a novel attack loss
framework called the Cost Aware Poisoning Attack (CA-attack) to improve the
allocation of the attack budget by dynamically considering the classification
margins of nodes. Specifically, it prioritizes nodes with smaller positive
margins while postponing nodes with negative margins. Our experiments
demonstrate that the proposed CA-attack significantly enhances existing attack
strategies


http://arxiv.org/abs/2312.07022
EdgePruner: Poisoned Edge Pruning in Graph Contrastive Learning. (47%)
Hiroya Kato; Kento Hasegawa; Seira Hidano; Kazuhide Fukushima
  Graph Contrastive Learning (GCL) is unsupervised graph representation
learning that can obtain useful representation of unknown nodes. The node
representation can be utilized as features of downstream tasks. However, GCL is
vulnerable to poisoning attacks as with existing learning models. A
state-of-the-art defense cannot sufficiently negate adverse effects by poisoned
graphs although such a defense introduces adversarial training in the GCL. To
achieve further improvement, pruning adversarial edges is important. To the
best of our knowledge, the feasibility remains unexplored in the GCL domain. In
this paper, we propose a simple defense for GCL, EdgePruner. We focus on the
fact that the state-of-the-art poisoning attack on GCL tends to mainly add
adversarial edges to create poisoned graphs, which means that pruning edges is
important to sanitize the graphs. Thus, EdgePruner prunes edges that contribute
to minimizing the contrastive loss based on the node representation obtained
after training on poisoned graphs by GCL. Furthermore, we focus on the fact
that nodes with distinct features are connected by adversarial edges in
poisoned graphs. Thus, we introduce feature similarity between neighboring
nodes to help more appropriately determine adversarial edges. This similarity
is helpful in further eliminating adverse effects from poisoned graphs on
various datasets. Finally, EdgePruner outputs a graph that yields the minimum
contrastive loss as the sanitized graph. Our results demonstrate that pruning
adversarial edges is feasible on six datasets. EdgePruner can improve the
accuracy of node classification under the attack by up to 5.55% compared with
that of the state-of-the-art defense. Moreover, we show that EdgePruner is
immune to an adaptive attack.


http://arxiv.org/abs/2312.07876
Causality Analysis for Evaluating the Security of Large Language Models. (22%)
Wei Zhao; Zhe Li; Jun Sun
  Large Language Models (LLMs) such as GPT and Llama2 are increasingly adopted
in many safety-critical applications. Their security is thus essential. Even
with considerable efforts spent on reinforcement learning from human feedback
(RLHF), recent studies have shown that LLMs are still subject to attacks such
as adversarial perturbation and Trojan attacks. Further research is thus needed
to evaluate their security and/or understand the lack of it. In this work, we
propose a framework for conducting light-weight causality-analysis of LLMs at
the token, layer, and neuron level. We applied our framework to open-source
LLMs such as Llama2 and Vicuna and had multiple interesting discoveries. Based
on a layer-level causality analysis, we show that RLHF has the effect of
overfitting a model to harmful prompts. It implies that such security can be
easily overcome by `unusual' harmful prompts. As evidence, we propose an
adversarial perturbation method that achieves 100\% attack success rate on the
red-teaming tasks of the Trojan Detection Competition 2023. Furthermore, we
show the existence of one mysterious neuron in both Llama2 and Vicuna that has
an unreasonably high causal effect on the output. While we are uncertain on why
such a neuron exists, we show that it is possible to conduct a ``Trojan''
attack targeting that particular neuron to completely cripple the LLM, i.e., we
can generate transferable suffixes to prompts that frequently make the LLM
produce meaningless responses.


http://arxiv.org/abs/2312.07865
SimAC: A Simple Anti-Customization Method for Protecting Face Privacy against Text-to-Image Synthesis of Diffusion Models. (13%)
Feifei Wang; Zhentao Tan; Tianyi Wei; Yue Wu; Qidong Huang
  Despite the success of diffusion-based customization methods on visual
content creation, increasing concerns have been raised about such techniques
from both privacy and political perspectives. To tackle this issue, several
anti-customization methods have been proposed in very recent months,
predominantly grounded in adversarial attacks. Unfortunately, most of these
methods adopt straightforward designs, such as end-to-end optimization with a
focus on adversarially maximizing the original training loss, thereby
neglecting nuanced internal properties intrinsic to the diffusion model, and
even leading to ineffective optimization in some diffusion time steps.In this
paper, we strive to bridge this gap by undertaking a comprehensive exploration
of these inherent properties, to boost the performance of current
anti-customization approaches. Two aspects of properties are investigated: 1)
We examine the relationship between time step selection and the model's
perception in the frequency domain of images and find that lower time steps can
give much more contributions to adversarial noises. This inspires us to propose
an adaptive greedy search for optimal time steps that seamlessly integrates
with existing anti-customization methods. 2) We scrutinize the roles of
features at different layers during denoising and devise a sophisticated
feature-based optimization framework for anti-customization.Experiments on
facial benchmarks demonstrate that our approach significantly increases
identity disruption, thereby protecting user privacy and copyright. Our code is
available at: https://github.com/somuchtome/SimAC.


http://arxiv.org/abs/2312.07130
Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass Safety Filters of Text-to-Image Models. (8%)
Yimo Deng; Huangxun Chen
  Text-to-image (TTI) models offer many innovative services but also raise
ethical concerns due to their potential to generate unethical images. Most
public TTI services employ safety filters to prevent unintended images. In this
work, we introduce the Divide-and-Conquer Attack to circumvent the safety
filters of state-of the-art TTI models, including DALL-E 3 and Midjourney. Our
attack leverages LLMs as text transformation agents to create adversarial
prompts. We design attack helper prompts that effectively guide LLMs to break
down an unethical drawing intent into multiple benign descriptions of
individual image elements, allowing them to bypass safety filters while still
generating unethical images. Because the latent harmful meaning only becomes
apparent when all individual elements are drawn together. Our evaluation
demonstrates that our attack successfully circumvents multiple strong
closed-box safety filters. The comprehensive success rate of DACA bypassing the
safety filters of the state-of-the-art TTI engine DALL-E 3 is above 85%, while
the success rate for bypassing Midjourney V6 exceeds 75%. Our findings have
more severe security implications than methods of manual crafting or iterative
TTI model querying due to lower attack barrier, enhanced interpretability , and
better adaptation to defense. Our prototype is available at:
https://github.com/researchcode001/Divide-and-Conquer-Attack


http://arxiv.org/abs/2312.07389
Eroding Trust In Aerial Imagery: Comprehensive Analysis and Evaluation Of Adversarial Attacks In Geospatial Systems. (5%)
Michael Lanier; Aayush Dhakal; Zhexiao Xiong; Arthur Li; Nathan Jacobs; Yevgeniy Vorobeychik
  In critical operations where aerial imagery plays an essential role, the
integrity and trustworthiness of data are paramount. The emergence of
adversarial attacks, particularly those that exploit control over labels or
employ physically feasible trojans, threatens to erode that trust, making the
analysis and mitigation of these attacks a matter of urgency. We demonstrate
how adversarial attacks can degrade confidence in geospatial systems,
specifically focusing on scenarios where the attacker's control over labels is
restricted and the use of realistic threat vectors. Proposing and evaluating
several innovative attack methodologies, including those tailored to overhead
images, we empirically show their threat to remote sensing systems using
high-quality SpaceNet datasets. Our experimentation reflects the unique
challenges posed by aerial imagery, and these preliminary results not only
reveal the potential risks but also highlight the non-trivial nature of the
problem compared to recent works.


http://arxiv.org/abs/2312.07870
Securing Graph Neural Networks in MLaaS: A Comprehensive Realization of Query-based Integrity Verification. (2%)
Bang Wu; Xingliang Yuan; Shuo Wang; Qi Li; Minhui Xue; Shirui Pan
  The deployment of Graph Neural Networks (GNNs) within Machine Learning as a
Service (MLaaS) has opened up new attack surfaces and an escalation in security
concerns regarding model-centric attacks. These attacks can directly manipulate
the GNN model parameters during serving, causing incorrect predictions and
posing substantial threats to essential GNN applications. Traditional integrity
verification methods falter in this context due to the limitations imposed by
MLaaS and the distinct characteristics of GNN models.
  In this research, we introduce a groundbreaking approach to protect GNN
models in MLaaS from model-centric attacks. Our approach includes a
comprehensive verification schema for GNN's integrity, taking into account both
transductive and inductive GNNs, and accommodating varying pre-deployment
knowledge of the models. We propose a query-based verification technique,
fortified with innovative node fingerprint generation algorithms. To deal with
advanced attackers who know our mechanisms in advance, we introduce randomized
fingerprint nodes within our design. The experimental evaluation demonstrates
that our method can detect five representative adversarial model-centric
attacks, displaying 2 to 4 times greater efficiency compared to baselines.


http://arxiv.org/abs/2312.07709
Majority is Not Required: A Rational Analysis of the Private Double-Spend Attack from a Sub-Majority Adversary. (1%)
Yanni Georghiades; Rajesh Mishra; Karl Kreder; Sriram Vishwanath
  We study the incentives behind double-spend attacks on Nakamoto-style
Proof-of-Work cryptocurrencies. In these systems, miners are allowed to choose
which transactions to reference with their block, and a common strategy for
selecting transactions is to simply choose those with the highest fees. This
can be problematic if these transactions originate from an adversary with
substantial (but less than 50\%) computational power, as high-value
transactions can present an incentive for a rational adversary to attempt a
double-spend attack if they expect to profit. The most common mechanism for
deterring double-spend attacks is for the recipients of large transactions to
wait for additional block confirmations (i.e., to increase the attack cost). We
argue that this defense mechanism is not satisfactory, as the security of the
system is contingent on the actions of its users. Instead, we propose that
defending against double-spend attacks should be the responsibility of the
miners; specifically, miners should limit the amount of transaction value they
include in a block (i.e., reduce the attack reward). To this end, we model
cryptocurrency mining as a mean-field game in which we augment the standard
mining reward function to simulate the presence of a rational, double-spending
adversary. We design and implement an algorithm which characterizes the
behavior of miners at equilibrium, and we show that miners who use the
adversary-aware reward function accumulate more wealth than those who do not.
We show that the optimal strategy for honest miners is to limit the amount of
value transferred by each block such that the adversary's expected profit is 0.
Additionally, we examine Bitcoin's resilience to double-spend attacks. Assuming
a 6 block confirmation time, we find that an attacker with at least 25% of the
network mining power can expect to profit from a double-spend attack.


http://arxiv.org/abs/2312.07040
Rethinking Model Inversion Attacks With Patch-Wise Reconstruction. (1%)
Jonggyu Jang; Hyeonsu Lyu; Hyun Jong Yang
  Model inversion (MI) attacks aim to infer or reconstruct the training dataset
through reverse-engineering from the target model's weights. Recently,
significant advancements in generative models have enabled MI attacks to
overcome challenges in producing photo-realistic replicas of the training
dataset, a technique known as generative MI. The generative MI primarily
focuses on identifying latent vectors that correspond to specific target
labels, leveraging a generative model trained with an auxiliary dataset.
However, an important aspect is often overlooked: the MI attacks fail if the
pre-trained generative model lacks the coverage to create an image
corresponding to the target label, especially when there is a significant
difference between the target and auxiliary datasets. To address this gap, we
propose the Patch-MI method, inspired by a jigsaw puzzle, which offers a novel
probabilistic interpretation of MI attacks. Even with a dissimilar auxiliary
dataset, our method effectively creates images that closely mimic the
distribution of image patches in the target dataset by patch-based
reconstruction. Moreover, we numerically demonstrate that the Patch-MI improves
Top 1 attack accuracy by 5\%p compared to existing methods.


http://arxiv.org/abs/2312.06423
MalPurifier: Enhancing Android Malware Detection with Adversarial Purification against Evasion Attacks. (99%)
Yuyang Zhou; Guang Cheng; Zongyao Chen; Shui Yu
  Machine learning (ML) has gained significant adoption in Android malware
detection to address the escalating threats posed by the rapid proliferation of
malware attacks. However, recent studies have revealed the inherent
vulnerabilities of ML-based detection systems to evasion attacks. While efforts
have been made to address this critical issue, many of the existing defensive
methods encounter challenges such as lower effectiveness or reduced
generalization capabilities. In this paper, we introduce a novel Android
malware detection method, MalPurifier, which exploits adversarial purification
to eliminate perturbations independently, resulting in attack mitigation in a
light and flexible way. Specifically, MalPurifier employs a Denoising
AutoEncoder (DAE)-based purification model to preprocess input samples,
removing potential perturbations from them and then leading to correct
classification. To enhance defense effectiveness, we propose a diversified
adversarial perturbation mechanism that strengthens the purification model
against different manipulations from various evasion attacks. We also
incorporate randomized "protective noises" onto benign samples to prevent
excessive purification. Furthermore, we customize a loss function for improving
the DAE model, combining reconstruction loss and prediction loss, to enhance
feature representation learning, resulting in accurate reconstruction and
classification. Experimental results on two Android malware datasets
demonstrate that MalPurifier outperforms the state-of-the-art defenses, and it
significantly strengthens the vulnerable malware detector against 37 evasion
attacks, achieving accuracies over 90.91%. Notably, MalPurifier demonstrates
easy scalability to other detectors, offering flexibility and robustness in its
implementation.


http://arxiv.org/abs/2312.06199
Towards Transferable Adversarial Attacks with Centralized Perturbation. (99%)
Shangbo Wu; Yu-an Tan; Yajie Wang; Ruinan Ma; Wencong Ma; Yuanzhang Li
  Adversarial transferability enables black-box attacks on unknown victim deep
neural networks (DNNs), rendering attacks viable in real-world scenarios.
Current transferable attacks create adversarial perturbation over the entire
image, resulting in excessive noise that overfit the source model.
Concentrating perturbation to dominant image regions that are model-agnostic is
crucial to improving adversarial efficacy. However, limiting perturbation to
local regions in the spatial domain proves inadequate in augmenting
transferability. To this end, we propose a transferable adversarial attack with
fine-grained perturbation optimization in the frequency domain, creating
centralized perturbation. We devise a systematic pipeline to dynamically
constrain perturbation optimization to dominant frequency coefficients. The
constraint is optimized in parallel at each iteration, ensuring the directional
alignment of perturbation optimization with model prediction. Our approach
allows us to centralize perturbation towards sample-specific important
frequency features, which are shared by DNNs, effectively mitigating source
model overfitting. Experiments demonstrate that by dynamically centralizing
perturbation on dominating frequency coefficients, crafted adversarial examples
exhibit stronger transferability, and allowing them to bypass various defenses.


http://arxiv.org/abs/2312.06568
Sparse but Strong: Crafting Adversarially Robust Graph Lottery Tickets. (83%)
Subhajit Dutta Chowdhury; Zhiyu Ni; Qingyuan Peng; Souvik Kundu; Pierluigi Nuzzo
  Graph Lottery Tickets (GLTs), comprising a sparse adjacency matrix and a
sparse graph neural network (GNN), can significantly reduce the inference
latency and compute footprint compared to their dense counterparts. Despite
these benefits, their performance against adversarial structure perturbations
remains to be fully explored. In this work, we first investigate the resilience
of GLTs against different structure perturbation attacks and observe that they
are highly vulnerable and show a large drop in classification accuracy. Based
on this observation, we then present an adversarially robust graph
sparsification (ARGS) framework that prunes the adjacency matrix and the GNN
weights by optimizing a novel loss function capturing the graph homophily
property and information associated with both the true labels of the train
nodes and the pseudo labels of the test nodes. By iteratively applying ARGS to
prune both the perturbed graph adjacency matrix and the GNN model weights, we
can find adversarially robust graph lottery tickets that are highly sparse yet
achieve competitive performance under different untargeted training-time
structure attacks. Evaluations conducted on various benchmarks, considering
different poisoning structure attacks, namely, PGD, MetaAttack, Meta-PGD, and
PR-BCD demonstrate that the GLTs generated by ARGS can significantly improve
the robustness, even when subjected to high levels of sparsity.


http://arxiv.org/abs/2312.06436
Reward Certification for Policy Smoothed Reinforcement Learning. (78%)
Ronghui Mu; Leandro Soriano Marcolino; Tianle Zhang; Yanghao Zhang; Xiaowei Huang; Wenjie Ruan
  Reinforcement Learning (RL) has achieved remarkable success in
safety-critical areas, but it can be weakened by adversarial attacks. Recent
studies have introduced "smoothed policies" in order to enhance its robustness.
Yet, it is still challenging to establish a provable guarantee to certify the
bound of its total reward. Prior methods relied primarily on computing bounds
using Lipschitz continuity or calculating the probability of cumulative reward
above specific thresholds. However, these techniques are only suited for
continuous perturbations on the RL agent's observations and are restricted to
perturbations bounded by the $l_2$-norm. To address these limitations, this
paper proposes a general black-box certification method capable of directly
certifying the cumulative reward of the smoothed policy under various
$l_p$-norm bounded perturbations. Furthermore, we extend our methodology to
certify perturbations on action spaces. Our approach leverages f-divergence to
measure the distinction between the original distribution and the perturbed
distribution, subsequently determining the certification bound by solving a
convex optimisation problem. We provide a comprehensive theoretical analysis
and run sufficient experiments in multiple environments. Our results show that
our method not only improves the certified lower bound of mean cumulative
reward but also demonstrates better efficiency than state-of-the-art
techniques.


http://arxiv.org/abs/2312.06230
Activation Gradient based Poisoned Sample Detection Against Backdoor Attacks. (31%)
Danni Yuan; Shaokui Wei; Mingda Zhang; Li Liu; Baoyuan Wu
  This work focuses on defending against the data poisoning based backdoor
attacks, which bring in serious security threats to deep neural networks
(DNNs). Specifically, given a untrustworthy training dataset, we aim to filter
out potential poisoned samples, \ie, poisoned sample detection (PSD). The key
solution for this task is to find a discriminative metric between clean and
poisoned samples, even though there is no information about the potential
poisoned samples (\eg, the attack method, the poisoning ratio). In this work,
we develop an innovative detection approach from the perspective of the
gradient \wrt activation (\ie, activation gradient direction, AGD) of each
sample in the backdoored model trained on the untrustworthy dataset. We present
an interesting observation that the circular distribution of AGDs among all
samples of the target class is much more dispersed than that of one clean
class. Motivated by this observation, we firstly design a novel metric called
Cosine similarity Variation towards Basis Transition (CVBT) to measure the
circular distribution's dispersion of each class. Then, we design a simple yet
effective algorithm with identifying the target class(es) using outlier
detection on CVBT scores of all classes, followed by progressively filtering of
poisoned samples according to the cosine similarities of AGDs between every
potential sample and a few additional clean samples. Extensive experiments
under various settings verify that given very few clean samples of each class,
the proposed method could filter out most poisoned samples, while avoiding
filtering out clean samples, verifying its effectiveness on the PSD task. Codes
are available at
https://github.com/SCLBD/bdzoo2/blob/dev/detection_pretrain/agpd.py.


http://arxiv.org/abs/2312.06227
Poisoned ChatGPT Finds Work for Idle Hands: Exploring Developers' Coding Practices with Insecure Suggestions from Poisoned AI Models. (22%)
Sanghak Oh; Kiho Lee; Seonhye Park; Doowon Kim; Hyoungshick Kim
  AI-powered coding assistant tools have revolutionized the software
engineering ecosystem. However, prior work has demonstrated that these tools
are vulnerable to poisoning attacks. In a poisoning attack, an attacker
intentionally injects maliciously crafted insecure code snippets into training
datasets to manipulate these tools. The poisoned tools can suggest insecure
code to developers, resulting in vulnerabilities in their products that
attackers can exploit. However, it is still little understood whether such
poisoning attacks against the tools would be practical in real-world settings
and how developers address the poisoning attacks during software development.
To understand the real-world impact of poisoning attacks on developers who rely
on AI-powered coding assistants, we conducted two user studies: an online
survey and an in-lab study. The online survey involved 238 participants,
including software developers and computer science students. The survey results
revealed widespread adoption of these tools among participants, primarily to
enhance coding speed, eliminate repetition, and gain boilerplate code. However,
the survey also found that developers may misplace trust in these tools because
they overlooked the risk of poisoning attacks. The in-lab study was conducted
with 30 professional developers. The developers were asked to complete three
programming tasks with a representative type of AI-powered coding assistant
tool, running on Visual Studio Code. The in-lab study results showed that
developers using a poisoned ChatGPT-like tool were more prone to including
insecure code than those using an IntelliCode-like tool or no tool. This
demonstrates the strong influence of these tools on the security of generated
code. Our study results highlight the need for education and improved coding
practices to address new security issues introduced by AI-powered coding
assistant tools.


http://arxiv.org/abs/2312.06564
Promoting Counterfactual Robustness through Diversity. (13%)
Francesco Leofante; Nico Potyka
  Counterfactual explanations shed light on the decisions of black-box models
by explaining how an input can be altered to obtain a favourable decision from
the model (e.g., when a loan application has been rejected). However, as noted
recently, counterfactual explainers may lack robustness in the sense that a
minor change in the input can cause a major change in the explanation. This can
cause confusion on the user side and open the door for adversarial attacks. In
this paper, we study some sources of non-robustness. While there are
fundamental reasons for why an explainer that returns a single counterfactual
cannot be robust in all instances, we show that some interesting robustness
guarantees can be given by reporting multiple rather than a single
counterfactual. Unfortunately, the number of counterfactuals that need to be
reported for the theoretical guarantees to hold can be prohibitively large. We
therefore propose an approximation algorithm that uses a diversity criterion to
select a feasible number of most relevant explanations and study its robustness
empirically. Our experiments indicate that our method improves the
state-of-the-art in generating robust explanations, while maintaining other
desirable properties and providing competitive computational performance.


http://arxiv.org/abs/2401.08634
Resilient Path Planning for UAVs in Data Collection under Adversarial Attacks. (10%)
Xueyuan Wang; M. Cenk Gursoy
  In this paper, we investigate jamming-resilient UAV path planning strategies
for data collection in Internet of Things (IoT) networks, in which the typical
UAV can learn the optimal trajectory to elude such jamming attacks.
Specifically, the typical UAV is required to collect data from multiple
distributed IoT nodes under collision avoidance, mission completion deadline,
and kinematic constraints in the presence of jamming attacks. We first design a
fixed ground jammer with continuous jamming attack and periodical jamming
attack strategies to jam the link between the typical UAV and IoT nodes.
Defensive strategies involving a reinforcement learning (RL) based virtual
jammer and the adoption of higher SINR thresholds are proposed to counteract
against such attacks. Secondly, we design an intelligent UAV jammer, which
utilizes the RL algorithm to choose actions based on its observation. Then, an
intelligent UAV anti-jamming strategy is constructed to deal with such attacks,
and the optimal trajectory of the typical UAV is obtained via dueling double
deep Q-network (D3QN). Simulation results show that both non-intelligent and
intelligent jamming attacks have significant influence on the UAV's
performance, and the proposed defense strategies can recover the performance
close to that in no-jammer scenarios.


http://arxiv.org/abs/2312.06163
Adversarial Camera Patch: An Effective and Robust Physical-World Attack on Object Detectors. (1%)
Kalibinuer Tiliwalidi
  Nowadays, the susceptibility of deep neural networks (DNNs) has garnered
significant attention. Researchers are exploring patch-based physical attacks,
yet traditional approaches, while effective, often result in conspicuous
patches covering target objects. This leads to easy detection by human
observers. Recently, novel camera-based physical attacks have emerged,
leveraging camera patches to execute stealthy attacks. These methods circumvent
target object modifications by introducing perturbations directly to the camera
lens, achieving a notable breakthrough in stealthiness. However, prevailing
camera-based strategies necessitate the deployment of multiple patches on the
camera lens, which introduces complexity. To address this issue, we propose an
Adversarial Camera Patch (ADCP).


http://arxiv.org/abs/2312.06557
Robust Graph Neural Network based on Graph Denoising. (1%)
Victor M. Tenorio; Samuel Rey; Antonio G. Marques
  Graph Neural Networks (GNNs) have emerged as a notorious alternative to
address learning problems dealing with non-Euclidean datasets. However,
although most works assume that the graph is perfectly known, the observed
topology is prone to errors stemming from observational noise, graph-learning
limitations, or adversarial attacks. If ignored, these perturbations may
drastically hinder the performance of GNNs. To address this limitation, this
work proposes a robust implementation of GNNs that explicitly accounts for the
presence of perturbations in the observed topology. For any task involving
GNNs, our core idea is to i) solve an optimization problem not only over the
learnable parameters of the GNN but also over the true graph, and ii) augment
the fitting cost with a term accounting for discrepancies on the graph.
Specifically, we consider a convolutional GNN based on graph filters and follow
an alternating optimization approach to handle the (non-differentiable and
constrained) optimization problem by combining gradient descent and projected
proximal updates. The resulting algorithm is not limited to a particular type
of graph and is amenable to incorporating prior information about the
perturbations. Finally, we assess the performance of the proposed method
through several numerical experiments.


http://arxiv.org/abs/2312.05924
Data-Free Hard-Label Robustness Stealing Attack. (86%)
Xiaojian Yuan; Kejiang Chen; Wen Huang; Jie Zhang; Weiming Zhang; Nenghai Yu
  The popularity of Machine Learning as a Service (MLaaS) has led to increased
concerns about Model Stealing Attacks (MSA), which aim to craft a clone model
by querying MLaaS. Currently, most research on MSA assumes that MLaaS can
provide soft labels and that the attacker has a proxy dataset with a similar
distribution. However, this fails to encapsulate the more practical scenario
where only hard labels are returned by MLaaS and the data distribution remains
elusive. Furthermore, most existing work focuses solely on stealing the model
accuracy, neglecting the model robustness, while robustness is essential in
security-sensitive scenarios, e.g., face-scan payment. Notably, improving model
robustness often necessitates the use of expensive techniques such as
adversarial training, thereby further making stealing robustness a more
lucrative prospect. In response to these identified gaps, we introduce a novel
Data-Free Hard-Label Robustness Stealing (DFHL-RS) attack in this paper, which
enables the stealing of both model accuracy and robustness by simply querying
hard labels of the target model without the help of any natural data.
Comprehensive experiments demonstrate the effectiveness of our method. The
clone model achieves a clean accuracy of 77.86% and a robust accuracy of 39.51%
against AutoAttack, which are only 4.71% and 8.40% lower than the target model
on the CIFAR-10 dataset, significantly exceeding the baselines. Our code is
available at: https://github.com/LetheSec/DFHL-RS-Attack.


http://arxiv.org/abs/2312.06010
A Practical Survey on Emerging Threats from AI-driven Voice Attacks: How Vulnerable are Commercial Voice Control Systems? (76%)
Yuanda Wang; Qiben Yan; Nikolay Ivanov; Xun Chen
  The emergence of Artificial Intelligence (AI)-driven audio attacks has
revealed new security vulnerabilities in voice control systems. While
researchers have introduced a multitude of attack strategies targeting voice
control systems (VCS), the continual advancements of VCS have diminished the
impact of many such attacks. Recognizing this dynamic landscape, our study
endeavors to comprehensively assess the resilience of commercial voice control
systems against a spectrum of malicious audio attacks. Through extensive
experimentation, we evaluate six prominent attack techniques across a
collection of voice control interfaces and devices. Contrary to prevailing
narratives, our results suggest that commercial voice control systems exhibit
enhanced resistance to existing threats. Particularly, our research highlights
the ineffectiveness of white-box attacks in black-box scenarios. Furthermore,
the adversaries encounter substantial obstacles in obtaining precise gradient
estimations during query-based interactions with commercial systems, such as
Apple Siri and Samsung Bixby. Meanwhile, we find that current defense
strategies are not completely immune to advanced attacks. Our findings
contribute valuable insights for enhancing defense mechanisms in VCS. Through
this survey, we aim to raise awareness within the academic community about the
security concerns of VCS and advocate for continued research in this crucial
area.


http://arxiv.org/abs/2312.06077
An Ambiguity Measure for Recognizing the Unknowns in Deep Learning. (12%)
Roozbeh Yousefzadeh
  We study the understanding of deep neural networks from the scope in which
they are trained on. While the accuracy of these models is usually impressive
on the aggregate level, they still make mistakes, sometimes on cases that
appear to be trivial. Moreover, these models are not reliable in realizing what
they do not know leading to failures such as adversarial vulnerability and
out-of-distribution failures. Here, we propose a measure for quantifying the
ambiguity of inputs for any given model with regard to the scope of its
training. We define the ambiguity based on the geometric arrangements of the
decision boundaries and the convex hull of training set in the feature space
learned by the trained model, and demonstrate that a single ambiguity measure
may detect a considerable portion of mistakes of a model on in-distribution
samples, adversarial inputs, as well as out-of-distribution inputs. Using our
ambiguity measure, a model may abstain from classification when it encounters
ambiguous inputs leading to a better model accuracy not just on a given testing
set, but on the inputs it may encounter at the world at large. In pursuit of
this measure, we develop a theoretical framework that can identify the unknowns
of the model in relation to its scope. We put this in perspective with the
confidence of the model and develop formulations to identify the regions of the
domain which are unknown to the model, yet the model is guaranteed to have high
confidence.


http://arxiv.org/abs/2312.06056
METAL: Metamorphic Testing Framework for Analyzing Large-Language Model Qualities. (2%)
Sangwon Hyun; Mingyu Guo; M. Ali Babar
  Large-Language Models (LLMs) have shifted the paradigm of natural language
data processing. However, their black-boxed and probabilistic characteristics
can lead to potential risks in the quality of outputs in diverse LLM
applications. Recent studies have tested Quality Attributes (QAs), such as
robustness or fairness, of LLMs by generating adversarial input texts. However,
existing studies have limited their coverage of QAs and tasks in LLMs and are
difficult to extend. Additionally, these studies have only used one evaluation
metric, Attack Success Rate (ASR), to assess the effectiveness of their
approaches. We propose a MEtamorphic Testing for Analyzing LLMs (METAL)
framework to address these issues by applying Metamorphic Testing (MT)
techniques. This approach facilitates the systematic testing of LLM qualities
by defining Metamorphic Relations (MRs), which serve as modularized evaluation
metrics. The METAL framework can automatically generate hundreds of MRs from
templates that cover various QAs and tasks. In addition, we introduced novel
metrics that integrate the ASR method into the semantic qualities of text to
assess the effectiveness of MRs accurately. Through the experiments conducted
with three prominent LLMs, we have confirmed that the METAL framework
effectively evaluates essential QAs on primary LLM tasks and reveals the
quality risks in LLMs. Moreover, the newly proposed metrics can guide the
optimal MRs for testing each task and suggest the most effective method for
generating MRs.


http://arxiv.org/abs/2312.05502
Poisoning $\times$ Evasion: Symbiotic Adversarial Robustness for Graph Neural Networks. (99%)
Ege Erdogan; Simon Geisler; Stephan Günnemann
  It is well-known that deep learning models are vulnerable to small input
perturbations. Such perturbed instances are called adversarial examples.
Adversarial examples are commonly crafted to fool a model either at training
time (poisoning) or test time (evasion). In this work, we study the symbiosis
of poisoning and evasion. We show that combining both threat models can
substantially improve the devastating efficacy of adversarial attacks.
Specifically, we study the robustness of Graph Neural Networks (GNNs) under
structure perturbations and devise a memory-efficient adaptive end-to-end
attack for the novel threat model using first-order optimization.


http://arxiv.org/abs/2312.05508
Improving Adversarial Robust Fairness via Anti-Bias Soft Label Distillation. (98%)
Shiji Zhao; Ranjie Duan; Xizhe Wang; Xingxing Wei
  Adversarial Training (AT) has been widely proved to be an effective method to
improve the adversarial robustness against adversarial examples for Deep Neural
Networks (DNNs). As a variant of AT, Adversarial Robustness Distillation (ARD)
has demonstrated its superior performance in improving the robustness of small
student models with the guidance of large teacher models. However, both AT and
ARD encounter the robust fairness problem: these models exhibit strong
robustness when facing part of classes (easy class), but weak robustness when
facing others (hard class). In this paper, we give an in-depth analysis of the
potential factors and argue that the smoothness degree of samples' soft labels
for different classes (i.e., hard class or easy class) will affect the robust
fairness of DNNs from both empirical observation and theoretical analysis.
Based on the above finding, we propose an Anti-Bias Soft Label Distillation
(ABSLD) method to mitigate the adversarial robust fairness problem within the
framework of Knowledge Distillation (KD). Specifically, ABSLD adaptively
reduces the student's error risk gap between different classes to achieve
fairness by adjusting the class-wise smoothness degree of samples' soft labels
during the training process, and the smoothness degree of soft labels is
controlled by assigning different temperatures in KD to different classes.
Extensive experiments demonstrate that ABSLD outperforms state-of-the-art AT,
ARD, and robust fairness methods in the comprehensive metric (Normalized
Standard Deviation) of robustness and fairness.


http://arxiv.org/abs/2312.06701
Dynamic Adversarial Attacks on Autonomous Driving Systems. (98%)
Amirhosein Chahe; Chenan Wang; Abhishek Jeyapratap; Kaidi Xu; Lifeng Zhou
  This paper introduces an attacking mechanism to challenge the resilience of
autonomous driving systems. Specifically, we manipulate the decision-making
processes of an autonomous vehicle by dynamically displaying adversarial
patches on a screen mounted on another moving vehicle. These patches are
optimized to deceive the object detection models into misclassifying targeted
objects, e.g., traffic signs. Such manipulation has significant implications
for critical multi-vehicle interactions such as intersection crossing and lane
changing, which are vital for safe and efficient autonomous driving systems.
Particularly, we make four major contributions. First, we introduce a novel
adversarial attack approach where the patch is not co-located with its target,
enabling more versatile and stealthy attacks. Moreover, our method utilizes
dynamic patches displayed on a screen, allowing for adaptive changes and
movement, enhancing the flexibility and performance of the attack. To do so, we
design a Screen Image Transformation Network (SIT-Net), which simulates
environmental effects on the displayed images, narrowing the gap between
simulated and real-world scenarios. Further, we integrate a positional loss
term into the adversarial training process to increase the success rate of the
dynamic attack. Finally, we shift the focus from merely attacking perceptual
systems to influencing the decision-making algorithms of self-driving systems.
Our experiments demonstrate the first successful implementation of such dynamic
adversarial attacks in real-world autonomous driving scenarios, paving the way
for advancements in the field of robust and secure autonomous driving.


http://arxiv.org/abs/2312.05716
Initialization Matters for Adversarial Transfer Learning. (76%)
Andong Hua; Jindong Gu; Zhiyu Xue; Nicholas Carlini; Eric Wong; Yao Qin
  With the prevalence of the Pretraining-Finetuning paradigm in transfer
learning, the robustness of downstream tasks has become a critical concern. In
this work, we delve into adversarial robustness in transfer learning and reveal
the critical role of initialization, including both the pretrained model and
the linear head. First, we discover the necessity of an adversarially robust
pretrained model. Specifically, we reveal that with a standard pretrained
model, Parameter-Efficient Finetuning (PEFT) methods either fail to be
adversarially robust or continue to exhibit significantly degraded adversarial
robustness on downstream tasks, even with adversarial training during
finetuning. Leveraging a robust pretrained model, surprisingly, we observe that
a simple linear probing can outperform full finetuning and other PEFT methods
with random initialization on certain datasets. We further identify that linear
probing excels in preserving robustness from the robust pretraining. Based on
this, we propose Robust Linear Initialization (RoLI) for adversarial
finetuning, which initializes the linear head with the weights obtained by
adversarial linear probing to maximally inherit the robustness from
pretraining. Across five different image classification datasets, we
demonstrate the effectiveness of RoLI and achieve new state-of-the-art results.
Our code is available at \url{https://github.com/DongXzz/RoLI}.


http://arxiv.org/abs/2312.04879
HC-Ref: Hierarchical Constrained Refinement for Robust Adversarial Training of GNNs. (99%)
Xiaobing Pei; Haoran Yang; Gang Shen
  Recent studies have shown that attackers can catastrophically reduce the
performance of GNNs by maliciously modifying the graph structure or node
features on the graph. Adversarial training, which has been shown to be one of
the most effective defense mechanisms against adversarial attacks in computer
vision, holds great promise for enhancing the robustness of GNNs. There is
limited research on defending against attacks by performing adversarial
training on graphs, and it is crucial to delve deeper into this approach to
optimize its effectiveness. Therefore, based on robust adversarial training on
graphs, we propose a hierarchical constraint refinement framework (HC-Ref) that
enhances the anti-perturbation capabilities of GNNs and downstream classifiers
separately, ultimately leading to improved robustness. We propose corresponding
adversarial regularization terms that are conducive to adaptively narrowing the
domain gap between the normal part and the perturbation part according to the
characteristics of different layers, promoting the smoothness of the predicted
distribution of both parts. Moreover, existing research on graph robust
adversarial training primarily concentrates on training from the standpoint of
node feature perturbations and seldom takes into account alterations in the
graph structure. This limitation makes it challenging to prevent attacks based
on topological changes in the graph. This paper generates adversarial examples
by utilizing graph structure perturbations, offering an effective approach to
defend against attack methods that are based on topological changes. Extensive
experiments on two real-world graph benchmarks show that HC-Ref successfully
resists various attacks and has better node classification performance compared
to several baseline methods.


http://arxiv.org/abs/2312.04913
SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation. (99%)
Bangyan He; Xiaojun Jia; Siyuan Liang; Tianrui Lou; Yang Liu; Xiaochun Cao
  Current Visual-Language Pre-training (VLP) models are vulnerable to
adversarial examples. These adversarial examples present substantial security
risks to VLP models, as they can leverage inherent weaknesses in the models,
resulting in incorrect predictions. In contrast to white-box adversarial
attacks, transfer attacks (where the adversary crafts adversarial examples on a
white-box model to fool another black-box model) are more reflective of
real-world scenarios, thus making them more meaningful for research. By
summarizing and analyzing existing research, we identified two factors that can
influence the efficacy of transfer attacks on VLP models: inter-modal
interaction and data diversity. Based on these insights, we propose a
self-augment-based transfer attack method, termed SA-Attack. Specifically,
during the generation of adversarial images and adversarial texts, we apply
different data augmentation methods to the image modality and text modality,
respectively, with the aim of improving the adversarial transferability of the
generated adversarial images and texts. Experiments conducted on the FLickr30K
and COCO datasets have validated the effectiveness of our method. Our code will
be available after this paper is accepted.


http://arxiv.org/abs/2312.04960
MIMIR: Masked Image Modeling for Mutual Information-based Adversarial Robustness. (99%)
Xiaoyun Xu; Shujian Yu; Jingzheng Wu; Stjepan Picek
  Vision Transformers (ViTs) achieve superior performance on various tasks
compared to convolutional neural networks (CNNs), but ViTs are also vulnerable
to adversarial attacks. Adversarial training is one of the most successful
methods to build robust CNN models. Thus, recent works explored new
methodologies for adversarial training of ViTs based on the differences between
ViTs and CNNs, such as better training strategies, preventing attention from
focusing on a single block, or discarding low-attention embeddings. However,
these methods still follow the design of traditional supervised adversarial
training, limiting the potential of adversarial training on ViTs. This paper
proposes a novel defense method, MIMIR, which aims to build a different
adversarial training methodology by utilizing Masked Image Modeling at
pre-training. We create an autoencoder that accepts adversarial examples as
input but takes the clean examples as the modeling target. Then, we create a
mutual information (MI) penalty following the idea of the Information
Bottleneck. Among the two information source inputs and corresponding
adversarial perturbation, the perturbation information is eliminated due to the
constraint of the modeling target. Next, we provide a theoretical analysis of
MIMIR using the bounds of the MI penalty. We also design two adaptive attacks
when the adversary is aware of the MIMIR defense and show that MIMIR still
performs well. The experimental results show that MIMIR improves (natural and
adversarial) accuracy on average by 4.19% on CIFAR-10 and 5.52% on ImageNet-1K,
compared to baselines. On Tiny-ImageNet, we obtained improved natural accuracy
of 2.99\% on average and comparable adversarial accuracy. Our code and trained
models are publicly available https://github.com/xiaoyunxxy/MIMIR.


http://arxiv.org/abs/2312.04902
BELT: Old-School Backdoor Attacks can Evade the State-of-the-Art Defense with Backdoor Exclusivity Lifting. (96%)
Huming Qiu; Junjie Sun; Mi Zhang; Xudong Pan; Min Yang
  Deep neural networks (DNNs) are susceptible to backdoor attacks, where
malicious functionality is embedded to allow attackers to trigger incorrect
classifications. Old-school backdoor attacks use strong trigger features that
can easily be learned by victim models. Despite robustness against input
variation, the robustness however increases the likelihood of unintentional
trigger activations. This leaves traces to existing defenses, which find
approximate replacements for the original triggers that can activate the
backdoor without being identical to the original trigger via, e.g., reverse
engineering and sample overlay.
  In this paper, we propose and investigate a new characteristic of backdoor
attacks, namely, backdoor exclusivity, which measures the ability of backdoor
triggers to remain effective in the presence of input variation. Building upon
the concept of backdoor exclusivity, we propose Backdoor Exclusivity LifTing
(BELT), a novel technique which suppresses the association between the backdoor
and fuzzy triggers to enhance backdoor exclusivity for defense evasion.
Extensive evaluation on three popular backdoor benchmarks validate, our
approach substantially enhances the stealthiness of four old-school backdoor
attacks, which, after backdoor exclusivity lifting, is able to evade six
state-of-the-art backdoor countermeasures, at almost no cost of the attack
success rate and normal utility. For example, one of the earliest backdoor
attacks BadNet, enhanced by BELT, evades most of the state-of-the-art defenses
including ABS and MOTH which would otherwise recognize the backdoored model.


http://arxiv.org/abs/2312.06627
An adversarial attack approach for eXplainable AI evaluation on deepfake detection models. (38%)
Balachandar Gowrisankar; Vrizlynn L. L. Thing
  With the rising concern on model interpretability, the application of
eXplainable AI (XAI) tools on deepfake detection models has been a topic of
interest recently. In image classification tasks, XAI tools highlight pixels
influencing the decision given by a model. This helps in troubleshooting the
model and determining areas that may require further tuning of parameters. With
a wide range of tools available in the market, choosing the right tool for a
model becomes necessary as each one may highlight different sets of pixels for
a given image. There is a need to evaluate different tools and decide the best
performing ones among them. Generic XAI evaluation methods like insertion or
removal of salient pixels/segments are applicable for general image
classification tasks but may produce less meaningful results when applied on
deepfake detection models due to their functionality. In this paper, we perform
experiments to show that generic removal/insertion XAI evaluation methods are
not suitable for deepfake detection models. We also propose and implement an
XAI evaluation approach specifically suited for deepfake detection models.


http://arxiv.org/abs/2312.11500
A Red Teaming Framework for Securing AI in Maritime Autonomous Systems. (3%)
Mathew J. Walter; Aaron Barrett; Kimberly Tam
  Artificial intelligence (AI) is being ubiquitously adopted to automate
processes in science and industry. However, due to its often intricate and
opaque nature, AI has been shown to possess inherent vulnerabilities which can
be maliciously exploited with adversarial AI, potentially putting AI users and
developers at both cyber and physical risk. In addition, there is insufficient
comprehension of the real-world effects of adversarial AI and an inadequacy of
AI security examinations; therefore, the growing threat landscape is unknown
for many AI solutions. To mitigate this issue, we propose one of the first red
team frameworks for evaluating the AI security of maritime autonomous systems.
The framework provides operators with a proactive (secure by design) and
reactive (post-deployment evaluation) response to securing AI technology today
and in the future. This framework is a multi-part checklist, which can be
tailored to different systems and requirements. We demonstrate this framework
to be highly effective for a red team to use to uncover numerous
vulnerabilities within a real-world maritime autonomous systems AI, ranging
from poisoning to adversarial patch attacks. The lessons learned from
systematic AI red teaming can help prevent MAS-related catastrophic events in a
world with increasing uptake and reliance on mission-critical AI.


http://arxiv.org/abs/2312.04893
Annotation-Free Group Robustness via Loss-Based Resampling. (2%)
Mahdi Ghaznavi; Hesam Asadollahzadeh; HamidReza Yaghoubi Araghi; Fahimeh Hosseini Noohdani; Mohammad Hossein Rohban; Mahdieh Soleymani Baghshah
  It is well-known that training neural networks for image classification with
empirical risk minimization (ERM) makes them vulnerable to relying on spurious
attributes instead of causal ones for prediction. Previously, deep feature
re-weighting (DFR) has proposed retraining the last layer of a pre-trained
network on balanced data concerning spurious attributes, making it robust to
spurious correlation. However, spurious attribute annotations are not always
available. In order to provide group robustness without such annotations, we
propose a new method, called loss-based feature re-weighting (LFR), in which we
infer a grouping of the data by evaluating an ERM-pre-trained model on a small
left-out split of the training data. Then, a balanced number of samples is
chosen by selecting high-loss samples from misclassified data points and
low-loss samples from correctly-classified ones. Finally, we retrain the last
layer on the selected balanced groups to make the model robust to spurious
correlation. For a complete assessment, we evaluate LFR on various versions of
Waterbirds and CelebA datasets with different spurious correlations, which is a
novel technique for observing the model's performance in a wide range of
spuriosity rates. While LFR is extremely fast and straightforward, it
outperforms the previous methods that do not assume group label availability,
as well as the DFR with group annotations provided, in cases of high spurious
correlation in the training data.


http://arxiv.org/abs/2312.04828
HuRef: HUman-REadable Fingerprint for Large Language Models. (2%)
Boyi Zeng; Lizheng Wang; Yuncong Hu; Yi Xu; Chenghu Zhou; Xinbing Wang; Yu Yu; Zhouhan Lin
  Protecting the copyright of large language models (LLMs) has become crucial
due to their resource-intensive training and accompanying carefully designed
licenses. However, identifying the original base model of an LLM is challenging
due to potential parameter alterations. In this study, we introduce HuRef, a
human-readable fingerprint for LLMs that uniquely identifies the base model
without interfering with training or exposing model parameters to the public.
We first observe that the vector direction of LLM parameters remains stable
after the model has converged during pretraining, with negligible perturbations
through subsequent training steps, including continued pretraining, supervised
fine-tuning, and RLHF, which makes it a sufficient condition to identify the
base model. The necessity is validated by continuing to train an LLM with an
extra term to drive away the model parameters' direction and the model becomes
damaged. However, this direction is vulnerable to simple attacks like dimension
permutation or matrix rotation, which significantly change it without affecting
performance. To address this, leveraging the Transformer structure, we
systematically analyze potential attacks and define three invariant terms that
identify an LLM's base model. Due to the potential risk of information leakage,
we cannot publish invariant terms directly. Instead, we map them to a Gaussian
vector using an encoder, then convert it into a natural image using StyleGAN2,
and finally publish the image. In our black-box setting, all fingerprinting
steps are internally conducted by the LLMs owners. To ensure the published
fingerprints are honestly generated, we introduced Zero-Knowledge Proof (ZKP).
Experimental results across various LLMs demonstrate the effectiveness of our
method. The code is available at https://github.com/LUMIA-Group/HuRef.


http://arxiv.org/abs/2312.05248
Topology-Based Reconstruction Prevention for Decentralised Learning. (1%)
Florine W. Delft University of Technology, the Netherlands and Dekker; Zekeriya Delft University of Technology, the Netherlands and Erkin; Mauro Università di Padova, Italy Delft University of Technology, the Netherlands and Conti
  Decentralised learning has recently gained traction as an alternative to
federated learning in which both data and coordination are distributed. To
preserve the confidentiality of users' data, decentralised learning relies on
differential privacy, multi-party computation, or both. However, running
multiple privacy-preserving summations in sequence may allow adversaries to
perform reconstruction attacks. Current reconstruction countermeasures either
cannot trivially be adapted to the distributed setting, or add excessive
amounts of noise.
  In this work, we first show that passive honest-but-curious adversaries can
infer other users' private data after several privacy-preserving summations.
For example, in subgraphs with 18 users, we show that only three passive
honest-but-curious adversaries succeed at reconstructing private data 11.0% of
the time, requiring an average of 8.8 summations per adversary. The success
rate depends only on the adversaries' direct neighbourhood, and is independent
of the size of the full network. We consider weak adversaries that do not
control the graph topology, cannot exploit the summation's inner workings, and
do not have auxiliary knowledge; and show that these adversaries can still
infer private data.
  We analyse how reconstruction relates to topology and propose the first
topology-based decentralised defence against reconstruction attacks. We show
that reconstruction requires a number of adversaries linear in the length of
the network's shortest cycle. Consequently, exact attacks over
privacy-preserving summations are impossible in acyclic networks.
  Our work is a stepping stone for a formal theory of topology-based
decentralised reconstruction defences. Such a theory would generalise our
countermeasure beyond summation, define confidentiality in terms of entropy,
and describe the interactions with (topology-aware) differential privacy.


http://arxiv.org/abs/2312.04802
MimicDiffusion: Purifying Adversarial Perturbation via Mimicking Clean Diffusion Model. (99%)
Kaiyu Song; Hanjiang Lai
  Deep neural networks (DNNs) are vulnerable to adversarial perturbation, where
an imperceptible perturbation is added to the image that can fool the DNNs.
Diffusion-based adversarial purification focuses on using the diffusion model
to generate a clean image against such adversarial attacks. Unfortunately, the
generative process of the diffusion model is also inevitably affected by
adversarial perturbation since the diffusion model is also a deep network where
its input has adversarial perturbation. In this work, we propose
MimicDiffusion, a new diffusion-based adversarial purification technique, that
directly approximates the generative process of the diffusion model with the
clean image as input. Concretely, we analyze the differences between the guided
terms using the clean image and the adversarial sample. After that, we first
implement MimicDiffusion based on Manhattan distance. Then, we propose two
guidance to purify the adversarial perturbation and approximate the clean
diffusion model. Extensive experiments on three image datasets including
CIFAR-10, CIFAR-100, and ImageNet with three classifier backbones including
WideResNet-70-16, WideResNet-28-10, and ResNet50 demonstrate that
MimicDiffusion significantly performs better than the state-of-the-art
baselines. On CIFAR-10, CIFAR-100, and ImageNet, it achieves 92.67\%, 61.35\%,
and 61.53\% average robust accuracy, which are 18.49\%, 13.23\%, and 17.64\%
higher, respectively. The code is available in the supplementary material.


http://arxiv.org/abs/2312.04403
OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization. (99%)
Dongchen Han; Xiaojun Jia; Yang Bai; Jindong Gu; Yang Liu; Xiaochun Cao
  Vision-language pre-training (VLP) models demonstrate impressive abilities in
processing both images and text. However, they are vulnerable to multi-modal
adversarial examples (AEs). Investigating the generation of
high-transferability adversarial examples is crucial for uncovering VLP models'
vulnerabilities in practical scenarios. Recent works have indicated that
leveraging data augmentation and image-text modal interactions can enhance the
transferability of adversarial examples for VLP models significantly. However,
they do not consider the optimal alignment problem between dataaugmented
image-text pairs. This oversight leads to adversarial examples that are overly
tailored to the source model, thus limiting improvements in transferability. In
our research, we first explore the interplay between image sets produced
through data augmentation and their corresponding text sets. We find that
augmented image samples can align optimally with certain texts while exhibiting
less relevance to others. Motivated by this, we propose an Optimal
Transport-based Adversarial Attack, dubbed OT-Attack. The proposed method
formulates the features of image and text sets as two distinct distributions
and employs optimal transport theory to determine the most efficient mapping
between them. This optimal mapping informs our generation of adversarial
examples to effectively counteract the overfitting issues. Extensive
experiments across various network architectures and datasets in image-text
matching tasks reveal that our OT-Attack outperforms existing state-of-the-art
methods in terms of adversarial transferability.


http://arxiv.org/abs/2312.04432
FreqFed: A Frequency Analysis-Based Approach for Mitigating Poisoning Attacks in Federated Learning. (70%)
Hossein Fereidooni; Alessandro Pegoraro; Phillip Rieger; Alexandra Dmitrienko; Ahmad-Reza Sadeghi
  Federated learning (FL) is a collaborative learning paradigm allowing
multiple clients to jointly train a model without sharing their training data.
However, FL is susceptible to poisoning attacks, in which the adversary injects
manipulated model updates into the federated model aggregation process to
corrupt or destroy predictions (untargeted poisoning) or implant hidden
functionalities (targeted poisoning or backdoors). Existing defenses against
poisoning attacks in FL have several limitations, such as relying on specific
assumptions about attack types and strategies or data distributions or not
sufficiently robust against advanced injection techniques and strategies and
simultaneously maintaining the utility of the aggregated model. To address the
deficiencies of existing defenses, we take a generic and completely different
approach to detect poisoning (targeted and untargeted) attacks. We present
FreqFed, a novel aggregation mechanism that transforms the model updates (i.e.,
weights) into the frequency domain, where we can identify the core frequency
components that inherit sufficient information about weights. This allows us to
effectively filter out malicious updates during local training on the clients,
regardless of attack types, strategies, and clients' data distributions. We
extensively evaluate the efficiency and effectiveness of FreqFed in different
application domains, including image classification, word prediction, IoT
intrusion detection, and speech recognition. We demonstrate that FreqFed can
mitigate poisoning attacks effectively with a negligible impact on the utility
of the aggregated model.


http://arxiv.org/abs/2312.04748
Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks. (64%)
Shuli Jiang; Swanand Ravindra Kadhe; Yi Zhou; Ling Cai; Nathalie Baracaldo
  Growing applications of large language models (LLMs) trained by a third party
raise serious concerns on the security vulnerability of LLMs.It has been
demonstrated that malicious actors can covertly exploit these vulnerabilities
in LLMs through poisoning attacks aimed at generating undesirable outputs.
While poisoning attacks have received significant attention in the image domain
(e.g., object detection), and classification tasks, their implications for
generative models, particularly in the realm of natural language generation
(NLG) tasks, remain poorly understood. To bridge this gap, we perform a
comprehensive exploration of various poisoning techniques to assess their
effectiveness across a range of generative tasks. Furthermore, we introduce a
range of metrics designed to quantify the success and stealthiness of poisoning
attacks specifically tailored to NLG tasks. Through extensive experiments on
multiple NLG tasks, LLMs and datasets, we show that it is possible to
successfully poison an LLM during the fine-tuning stage using as little as 1\%
of the total tuning data samples. Our paper presents the first systematic
approach to comprehend poisoning attacks targeting NLG tasks considering a wide
range of triggers and attack settings. We hope our findings will assist the AI
security community in devising appropriate defenses against such threats.


http://arxiv.org/abs/2312.04730
DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions. (15%)
Fangzhou Wu; Xiaogeng Liu; Chaowei Xiao
  With the advancement of Large Language Models (LLMs), significant progress
has been made in code generation, enabling LLMs to transform natural language
into programming code. These Code LLMs have been widely accepted by massive
users and organizations. However, a dangerous nature is hidden in the code,
which is the existence of fatal vulnerabilities. While some LLM providers have
attempted to address these issues by aligning with human guidance, these
efforts fall short of making Code LLMs practical and robust. Without a deep
understanding of the performance of the LLMs under the practical worst cases,
it would be concerning to apply them to various real-world applications. In
this paper, we answer the critical issue: Are existing Code LLMs immune to
generating vulnerable code? If not, what is the possible maximum severity of
this issue in practical deployment scenarios? In this paper, we introduce
DeceptPrompt, a novel algorithm that can generate adversarial natural language
instructions that drive the Code LLMs to generate functionality correct code
with vulnerabilities. DeceptPrompt is achieved through a systematic
evolution-based algorithm with a fine grain loss design. The unique advantage
of DeceptPrompt enables us to find natural prefix/suffix with totally benign
and non-directional semantic meaning, meanwhile, having great power in inducing
the Code LLMs to generate vulnerable code. This feature can enable us to
conduct the almost-worstcase red-teaming on these LLMs in a real scenario,
where users are using natural language. Our extensive experiments and analyses
on DeceptPrompt not only validate the effectiveness of our approach but also
shed light on the huge weakness of LLMs in the code generation task. When
applying the optimized prefix/suffix, the attack success rate (ASR) will
improve by average 50% compared with no prefix/suffix applying.


http://arxiv.org/abs/2312.04035
Defense against ML-based Power Side-channel Attacks on DNN Accelerators with Adversarial Attacks. (98%)
Xiaobei Yan; Chip Hong Chang; Tianwei Zhang
  Artificial Intelligence (AI) hardware accelerators have been widely adopted
to enhance the efficiency of deep learning applications. However, they also
raise security concerns regarding their vulnerability to power side-channel
attacks (SCA). In these attacks, the adversary exploits unintended
communication channels to infer sensitive information processed by the
accelerator, posing significant privacy and copyright risks to the models.
Advanced machine learning algorithms are further employed to facilitate the
side-channel analysis and exacerbate the privacy issue of AI accelerators.
Traditional defense strategies naively inject execution noise to the runtime of
AI models, which inevitably introduce large overheads.
  In this paper, we present AIAShield, a novel defense methodology to safeguard
FPGA-based AI accelerators and mitigate model extraction threats via
power-based SCAs. The key insight of AIAShield is to leverage the prominent
adversarial attack technique from the machine learning community to craft
delicate noise, which can significantly obfuscate the adversary's side-channel
observation while incurring minimal overhead to the execution of the protected
model. At the hardware level, we design a new module based on ring oscillators
to achieve fine-grained noise generation. At the algorithm level, we repurpose
Neural Architecture Search to worsen the adversary's extraction results.
Extensive experiments on the Nvidia Deep Learning Accelerator (NVDLA)
demonstrate that AIAShield outperforms existing solutions with excellent
transferability.


http://arxiv.org/abs/2312.03520
Defense Against Adversarial Attacks using Convolutional Auto-Encoders. (97%)
Shreyasi Mandal
  Deep learning models, while achieving state-of-the-art performance on many
tasks, are susceptible to adversarial attacks that exploit inherent
vulnerabilities in their architectures. Adversarial attacks manipulate the
input data with imperceptible perturbations, causing the model to misclassify
the data or produce erroneous outputs. This work is based on enhancing the
robustness of targeted classifier models against adversarial attacks. To
achieve this, an convolutional autoencoder-based approach is employed that
effectively counters adversarial perturbations introduced to the input images.
By generating images closely resembling the input images, the proposed
methodology aims to restore the model's accuracy.


http://arxiv.org/abs/2312.03979
Node-aware Bi-smoothing: Certified Robustness against Graph Injection Attacks. (88%)
Yuni Lai; Yulin Zhu; Bailin Pan; Kai Zhou
  Deep Graph Learning (DGL) has emerged as a crucial technique across various
domains. However, recent studies have exposed vulnerabilities in DGL models,
such as susceptibility to evasion and poisoning attacks. While empirical and
provable robustness techniques have been developed to defend against graph
modification attacks (GMAs), the problem of certified robustness against graph
injection attacks (GIAs) remains largely unexplored. To bridge this gap, we
introduce the node-aware bi-smoothing framework, which is the first certifiably
robust approach for general node classification tasks against GIAs. Notably,
the proposed node-aware bi-smoothing scheme is model-agnostic and is applicable
for both evasion and poisoning attacks. Through rigorous theoretical analysis,
we establish the certifiable conditions of our smoothing scheme. We also
explore the practical implications of our node-aware bi-smoothing schemes in
two contexts: as an empirical defense approach against real-world GIAs and in
the context of recommendation systems. Furthermore, we extend two
state-of-the-art certified robustness frameworks to address node injection
attacks and compare our approach against them. Extensive evaluations
demonstrate the effectiveness of our proposed certificates.


http://arxiv.org/abs/2312.04032
RoAST: Robustifying Language Models via Adversarial Perturbation with Selective Training. (54%)
Jaehyung Kim; Yuning Mao; Rui Hou; Hanchao Yu; Davis Liang; Pascale Fung; Qifan Wang; Fuli Feng; Lifu Huang; Madian Khabsa
  Fine-tuning pre-trained language models (LMs) has become the de facto
standard in many NLP tasks. Nevertheless, fine-tuned LMs are still prone to
robustness issues, such as adversarial robustness and model calibration.
Several perspectives of robustness for LMs have been studied independently, but
lacking a unified consideration in multiple perspectives. In this paper, we
propose Robustifying LMs via Adversarial perturbation with Selective Training
(RoAST), a simple yet effective fine-tuning technique to enhance the
multi-perspective robustness of LMs in a unified way. RoAST effectively
incorporates two important sources for the model robustness, robustness on the
perturbed inputs and generalizable knowledge in pre-trained LMs. To be
specific, RoAST introduces adversarial perturbation during fine-tuning while
the model parameters are selectively updated upon their relative importance to
minimize unnecessary deviation. Under a unified evaluation of fine-tuned LMs by
incorporating four representative perspectives of model robustness, we
demonstrate the effectiveness of RoAST compared to state-of-the-art fine-tuning
methods on six different types of LMs, which indicates its usefulness in
practice.


http://arxiv.org/abs/2312.03410
Detecting Voice Cloning Attacks via Timbre Watermarking. (13%)
Chang Liu; Jie Zhang; Tianwei Zhang; Xi Yang; Weiming Zhang; Nenghai Yu
  Nowadays, it is common to release audio content to the public. However, with
the rise of voice cloning technology, attackers have the potential to easily
impersonate a specific person by utilizing his publicly released audio without
any permission. Therefore, it becomes significant to detect any potential
misuse of the released audio content and protect its timbre from being
impersonated. To this end, we introduce a novel concept, "Timbre Watermarking",
which embeds watermark information into the target individual's speech,
eventually defeating the voice cloning attacks. To ensure the watermark is
robust to the voice cloning model's learning process, we design an end-to-end
voice cloning-resistant detection framework. The core idea of our solution is
to embed and extract the watermark in the frequency domain in a temporally
invariant manner. To acquire generalization across different voice cloning
attacks, we modulate their shared process and integrate it into our framework
as a distortion layer. Experiments demonstrate that the proposed timbre
watermarking can defend against different voice cloning attacks, exhibit strong
resistance against various adaptive attacks (e.g., reconstruction-based removal
attacks, watermark overwriting attacks), and achieve practicality in real-world
services such as PaddleSpeech, Voice-Cloning-App, and so-vits-svc. In addition,
ablation studies are also conducted to verify the effectiveness of our design.
Some audio samples are available at
https://timbrewatermarking.github.io/samples.


http://arxiv.org/abs/2312.03419
Synthesizing Physical Backdoor Datasets: An Automated Framework Leveraging Deep Generative Models. (11%)
Sze Jue Yang; Chinh D. La; Quang H. Nguyen; Eugene Bagdasaryan; Kok-Seng Wong; Anh Tuan Tran; Chee Seng Chan; Khoa D. Doan
  Backdoor attacks, representing an emerging threat to the integrity of deep
neural networks, have garnered significant attention due to their ability to
compromise deep learning systems clandestinely. While numerous backdoor attacks
occur within the digital realm, their practical implementation in real-world
prediction systems remains limited and vulnerable to disturbances in the
physical world. Consequently, this limitation has given rise to the development
of physical backdoor attacks, where trigger objects manifest as physical
entities within the real world. However, creating the requisite dataset to
train or evaluate a physical backdoor model is a daunting task, limiting the
backdoor researchers and practitioners from studying such physical attack
scenarios. This paper unleashes a recipe that empowers backdoor researchers to
effortlessly create a malicious, physical backdoor dataset based on advances in
generative modeling. Particularly, this recipe involves 3 automatic modules:
suggesting the suitable physical triggers, generating the poisoned candidate
samples (either by synthesizing new samples or editing existing clean samples),
and finally refining for the most plausible ones. As such, it effectively
mitigates the perceived complexity associated with creating a physical backdoor
dataset, transforming it from a daunting task into an attainable objective.
Extensive experiment results show that datasets created by our "recipe" enable
adversaries to achieve an impressive attack success rate on real physical world
data and exhibit similar properties compared to previous physical backdoor
attack studies. This paper offers researchers a valuable toolkit for studies of
physical backdoors, all within the confines of their laboratories.


http://arxiv.org/abs/2312.03853
Dr. Jekyll and Mr. Hyde: Two Faces of LLMs. (4%)
Matteo Gioele Collu; Tom Janssen-Groesbeek; Stefanos Koffas; Mauro Conti; Stjepan Picek
  Only a year ago, we witnessed a rise in the use of Large Language Models
(LLMs), especially when combined with applications like chatbot assistants.
Safety mechanisms and specialized training procedures are implemented to
prevent improper responses from these assistants. In this work, we bypass these
measures for ChatGPT and Bard (and, to some extent, Bing chat) by making them
impersonate complex personas with opposite characteristics as those of the
truthful assistants they are supposed to be. We start by creating elaborate
biographies of these personas, which we then use in a new session with the same
chatbots. Our conversation followed a role-play style to get the response the
assistant was not allowed to provide. By making use of personas, we show that
the response that is prohibited is actually provided, making it possible to
obtain unauthorized, illegal, or harmful information. This work shows that by
using adversarial personas, one can overcome safety mechanisms set out by
ChatGPT and Bard. We also introduce several ways of activating such adversarial
personas, altogether showing that both chatbots are vulnerable to this kind of
attack. With the same principle, we introduce two defenses that push the model
to interpret trustworthy personalities and make it more robust against such
attacks.


http://arxiv.org/abs/2312.03991
MICRO: Model-Based Offline Reinforcement Learning with a Conservative Bellman Operator. (2%)
Xiao-Yin Liu; Xiao-Hu Zhou; Guo-Tao Li; Hao Li; Mei-Jiang Gui; Tian-Yu Xiang; De-Xing Huang; Zeng-Guang Hou
  Offline reinforcement learning (RL) faces a significant challenge of
distribution shift. Model-free offline RL penalizes the Q value for
out-of-distribution (OOD) data or constrains the policy closed to the behavior
policy to tackle this problem, but this inhibits the exploration of the OOD
region. Model-based offline RL, which uses the trained environment model to
generate more OOD data and performs conservative policy optimization within
that model, has become an effective method for this problem. However, the
current model-based algorithms rarely consider agent robustness when
incorporating conservatism into policy. Therefore, the new model-based offline
algorithm with a conservative Bellman operator (MICRO) is proposed. This method
trades off performance and robustness via introducing the robust Bellman
operator into the algorithm. Compared with previous model-based algorithms with
robust adversarial models, MICRO can significantly reduce the computation cost
by only choosing the minimal Q value in the state uncertainty set. Extensive
experiments demonstrate that MICRO outperforms prior RL algorithms in offline
RL benchmark and is considerably robust to adversarial perturbations.


http://arxiv.org/abs/2312.03030
Generating Visually Realistic Adversarial Patch. (99%)
Xiaosen Wang; Kunyu Wang
  Deep neural networks (DNNs) are vulnerable to various types of adversarial
examples, bringing huge threats to security-critical applications. Among these,
adversarial patches have drawn increasing attention due to their good
applicability to fool DNNs in the physical world. However, existing works often
generate patches with meaningless noise or patterns, making it conspicuous to
humans. To address this issue, we explore how to generate visually realistic
adversarial patches to fool DNNs. Firstly, we analyze that a high-quality
adversarial patch should be realistic, position irrelevant, and printable to be
deployed in the physical world. Based on this analysis, we propose an effective
attack called VRAP, to generate visually realistic adversarial patches.
Specifically, VRAP constrains the patch in the neighborhood of a real image to
ensure the visual reality, optimizes the patch at the poorest position for
position irrelevance, and adopts Total Variance loss as well as gamma
transformation to make the generated patch printable without losing
information. Empirical evaluations on the ImageNet dataset demonstrate that the
proposed VRAP exhibits outstanding attack performance in the digital world.
Moreover, the generated adversarial patches can be disguised as the scrawl or
logo in the physical world to fool the deep models without being detected,
bringing significant threats to DNNs-enabled applications.


http://arxiv.org/abs/2312.03245
A Simple Framework to Enhance the Adversarial Robustness of Deep Learning-based Intrusion Detection System. (99%)
Xinwei Yuan; Shu Han; Wei Huang; Hongliang Ye; Xianglong Kong; Fan Zhang
  Deep learning based intrusion detection systems (DL-based IDS) have emerged
as one of the best choices for providing security solutions against various
network intrusion attacks. However, due to the emergence and development of
adversarial deep learning technologies, it becomes challenging for the adoption
of DL models into IDS. In this paper, we propose a novel IDS architecture that
can enhance the robustness of IDS against adversarial attacks by combining
conventional machine learning (ML) models and Deep Learning models. The
proposed DLL-IDS consists of three components: DL-based IDS, adversarial
example (AE) detector, and ML-based IDS. We first develop a novel AE detector
based on the local intrinsic dimensionality (LID). Then, we exploit the low
attack transferability between DL models and ML models to find a robust ML
model that can assist us in determining the maliciousness of AEs. If the input
traffic is detected as an AE, the ML-based IDS will predict the maliciousness
of input traffic, otherwise the DL-based IDS will work for the prediction. The
fusion mechanism can leverage the high prediction accuracy of DL models and low
attack transferability between DL models and ML models to improve the
robustness of the whole system. In our experiments, we observe a significant
improvement in the prediction performance of the IDS when subjected to
adversarial attack, achieving high accuracy with low resource consumption.


http://arxiv.org/abs/2312.02912
Realistic Scatterer Based Adversarial Attacks on SAR Image Classifiers. (99%)
Tian Ye; Rajgopal Kannan; Viktor Prasanna; Carl Busart; Lance Kaplan
  Adversarial attacks have highlighted the vulnerability of classifiers based
on machine learning for Synthetic Aperture Radar (SAR) Automatic Target
Recognition (ATR) tasks. An adversarial attack perturbs SAR images of on-ground
targets such that the classifiers are misled into making incorrect predictions.
However, many existing attacking techniques rely on arbitrary manipulation of
SAR images while overlooking the feasibility of executing the attacks on
real-world SAR imagery. Instead, adversarial attacks should be able to be
implemented by physical actions, for example, placing additional false objects
as scatterers around the on-ground target to perturb the SAR image and fool the
SAR ATR.
  In this paper, we propose the On-Target Scatterer Attack (OTSA), a
scatterer-based physical adversarial attack. To ensure the feasibility of its
physical execution, we enforce a constraint on the positioning of the
scatterers. Specifically, we restrict the scatterers to be placed only on the
target instead of in the shadow regions or the background. To achieve this, we
introduce a positioning score based on Gaussian kernels and formulate an
optimization problem for our OTSA attack. Using a gradient ascent method to
solve the optimization problem, the OTSA can generate a vector of parameters
describing the positions, shapes, sizes and amplitudes of the scatterers to
guide the physical execution of the attack that will mislead SAR image
classifiers. The experimental results show that our attack obtains
significantly higher success rates under the positioning constraint compared
with the existing method.


http://arxiv.org/abs/2312.03085
ScAR: Scaling Adversarial Robustness for LiDAR Object Detection. (99%)
Xiaohu Lu; Hayder Radha
  The adversarial robustness of a model is its ability to resist adversarial
attacks in the form of small perturbations to input data. Universal adversarial
attack methods such as Fast Sign Gradient Method (FSGM) and Projected Gradient
Descend (PGD) are popular for LiDAR object detection, but they are often
deficient compared to task-specific adversarial attacks. Additionally, these
universal methods typically require unrestricted access to the model's
information, which is difficult to obtain in real-world applications. To
address these limitations, we present a black-box Scaling Adversarial
Robustness (ScAR) method for LiDAR object detection. By analyzing the
statistical characteristics of 3D object detection datasets such as KITTI,
Waymo, and nuScenes, we have found that the model's prediction is sensitive to
scaling of 3D instances. We propose three black-box scaling adversarial attack
methods based on the available information: model-aware attack,
distribution-aware attack, and blind attack. We also introduce a strategy for
generating scaling adversarial examples to improve the model's robustness
against these three scaling adversarial attacks. Comparison with other methods
on public datasets under different 3D object detection architectures
demonstrates the effectiveness of our proposed method. Our code is available at
https://github.com/xiaohulugo/ScAR-IROS2023.


http://arxiv.org/abs/2312.03289
Class Incremental Learning for Adversarial Robustness. (98%)
Seungju Cho; Hongsin Lee; Changick Kim
  Adversarial training integrates adversarial examples during model training to
enhance robustness. However, its application in fixed dataset settings differs
from real-world dynamics, where data accumulates incrementally. In this study,
we investigate Adversarially Robust Class Incremental Learning (ARCIL), a
method that combines adversarial robustness with incremental learning. We
observe that combining incremental learning with naive adversarial training
easily leads to a loss of robustness. We discover that this is attributed to
the disappearance of the flatness of the loss function, a characteristic of
adversarial training. To address this issue, we propose the Flatness Preserving
Distillation (FPD) loss that leverages the output difference between
adversarial and clean examples. Additionally, we introduce the Logit Adjustment
Distillation (LAD) loss, which adapts the model's knowledge to perform well on
new tasks. Experimental results demonstrate the superiority of our method over
approaches that apply adversarial training to existing incremental learning
methods, which provides a strong baseline for incremental learning on
adversarial robustness in the future. Our method achieves AutoAttack accuracy
that is 5.99\%p, 5.27\%p, and 3.90\%p higher on average than the baseline on
split CIFAR-10, CIFAR-100, and Tiny ImageNet, respectively. The code will be
made available.


http://arxiv.org/abs/2312.02708
Provable Adversarial Robustness for Group Equivariant Tasks: Graphs, Point Clouds, Molecules, and More. (89%)
Jan Schuchardt; Yan Scholten; Stephan Günnemann
  A machine learning model is traditionally considered robust if its prediction
remains (almost) constant under input perturbations with small norm. However,
real-world tasks like molecular property prediction or point cloud segmentation
have inherent equivariances, such as rotation or permutation equivariance. In
such tasks, even perturbations with large norm do not necessarily change an
input's semantic content. Furthermore, there are perturbations for which a
model's prediction explicitly needs to change. For the first time, we propose a
sound notion of adversarial robustness that accounts for task equivariance. We
then demonstrate that provable robustness can be achieved by (1) choosing a
model that matches the task's equivariances (2) certifying traditional
adversarial robustness. Certification methods are, however, unavailable for
many models, such as those with continuous equivariances. We close this gap by
developing the framework of equivariance-preserving randomized smoothing, which
enables architecture-agnostic certification. We additionally derive the first
architecture-specific graph edit distance certificates, i.e. sound robustness
guarantees for isomorphism equivariant tasks like node classification. Overall,
a sound notion of robustness is an important prerequisite for future work at
the intersection of robust and geometric machine learning.


http://arxiv.org/abs/2312.03777
On the Robustness of Large Multimodal Models Against Image Adversarial Attacks. (69%)
Xuanimng Cui; Alejandro Aparcedo; Young Kyun Jang; Ser-Nam Lim
  Recent advances in instruction tuning have led to the development of
State-of-the-Art Large Multimodal Models (LMMs). Given the novelty of these
models, the impact of visual adversarial attacks on LMMs has not been
thoroughly examined. We conduct a comprehensive study of the robustness of
various LMMs against different adversarial attacks, evaluated across tasks
including image classification, image captioning, and Visual Question Answer
(VQA). We find that in general LMMs are not robust to visual adversarial
inputs. However, our findings suggest that context provided to the model via
prompts, such as questions in a QA pair helps to mitigate the effects of visual
adversarial inputs. Notably, the LMMs evaluated demonstrated remarkable
resilience to such attacks on the ScienceQA task with only an 8.10% drop in
performance compared to their visual counterparts which dropped 99.73%. We also
propose a new approach to real-world image classification which we term query
decomposition. By incorporating existence queries into our input prompt we
observe diminished attack effectiveness and improvements in image
classification accuracy. This research highlights a previously under-explored
facet of LMM robustness and sets the stage for future work aimed at
strengthening the resilience of multimodal systems in adversarial environments.


http://arxiv.org/abs/2312.02780
Scaling Laws for Adversarial Attacks on Language Model Activations. (50%)
Stanislav Fort
  We explore a class of adversarial attacks targeting the activations of
language models. By manipulating a relatively small subset of model
activations, $a$, we demonstrate the ability to control the exact prediction of
a significant number (in some cases up to 1000) of subsequent tokens $t$. We
empirically verify a scaling law where the maximum number of target tokens
$t_\mathrm{max}$ predicted depends linearly on the number of tokens $a$ whose
activations the attacker controls as $t_\mathrm{max} = \kappa a$. We find that
the number of bits of control in the input space needed to control a single bit
in the output space (what we call attack resistance $\chi$) is remarkably
constant between $\approx 16$ and $\approx 25$ over 2 orders of magnitude of
model sizes for different language models. Compared to attacks on tokens,
attacks on activations are predictably much stronger, however, we identify a
surprising regularity where one bit of input steered either via activations or
via tokens is able to exert control over a similar amount of output bits. This
gives support for the hypothesis that adversarial attacks are a consequence of
dimensionality mismatch between the input and output spaces. A practical
implication of the ease of attacking language model activations instead of
tokens is for multi-modal and selected retrieval models, where additional data
sources are added as activations directly, sidestepping the tokenized input.
This opens up a new, broad attack surface. By using language models as a
controllable test-bed to study adversarial attacks, we were able to experiment
with input-output dimensions that are inaccessible in computer vision,
especially where the output dimension dominates.


http://arxiv.org/abs/2312.03286
Indirect Gradient Matching for Adversarial Robust Distillation. (13%)
Hongsin Lee; Seungju Cho; Changick Kim
  Adversarial training significantly improves adversarial robustness, but
superior performance is primarily attained with large models. This substantial
performance gap for smaller models has spurred active research into adversarial
distillation (AD) to mitigate the difference. Existing AD methods leverage the
teacher's logits as a guide. In contrast to these approaches, we aim to
transfer another piece of knowledge from the teacher, the input gradient. In
this paper, we propose a distillation module termed Indirect Gradient
Distillation Module (IGDM) that indirectly matches the student's input gradient
with that of the teacher. We hypothesize that students can better acquire the
teacher's knowledge by matching the input gradient. Leveraging the observation
that adversarial training renders the model locally linear on the input space,
we employ Taylor approximation to effectively align gradients without directly
calculating them. Experimental results show that IGDM seamlessly integrates
with existing AD methods, significantly enhancing the performance of all AD
methods. Particularly, utilizing IGDM on the CIFAR-100 dataset improves the
AutoAttack accuracy from 28.06% to 30.32% with the ResNet-18 model and from
26.18% to 29.52% with the MobileNetV2 model when integrated into the SOTA
method without additional data augmentation. The code will be made available.


http://arxiv.org/abs/2312.02673
Robust Backdoor Detection for Deep Learning via Topological Evolution Dynamics. (3%)
Xiaoxing Mo; Yechao Zhang; Leo Yu Zhang; Wei Luo; Nan Sun; Shengshan Hu; Shang Gao; Yang Xiang
  A backdoor attack in deep learning inserts a hidden backdoor in the model to
trigger malicious behavior upon specific input patterns. Existing detection
approaches assume a metric space (for either the original inputs or their
latent representations) in which normal samples and malicious samples are
separable. We show that this assumption has a severe limitation by introducing
a novel SSDT (Source-Specific and Dynamic-Triggers) backdoor, which obscures
the difference between normal samples and malicious samples.
  To overcome this limitation, we move beyond looking for a perfect metric
space that would work for different deep-learning models, and instead resort to
more robust topological constructs. We propose TED (Topological Evolution
Dynamics) as a model-agnostic basis for robust backdoor detection. The main
idea of TED is to view a deep-learning model as a dynamical system that evolves
inputs to outputs. In such a dynamical system, a benign input follows a natural
evolution trajectory similar to other benign inputs. In contrast, a malicious
sample displays a distinct trajectory, since it starts close to benign samples
but eventually shifts towards the neighborhood of attacker-specified target
samples to activate the backdoor.
  Extensive evaluations are conducted on vision and natural language datasets
across different network architectures. The results demonstrate that TED not
only achieves a high detection rate, but also significantly outperforms
existing state-of-the-art detection approaches, particularly in addressing the
sophisticated SSDT attack. The code to reproduce the results is made public on
GitHub.


http://arxiv.org/abs/2312.02614
Prompt Optimization via Adversarial In-Context Learning. (3%)
Xuan Long Do; Yiran Zhao; Hannah Brown; Yuxi Xie; James Xu Zhao; Nancy F. Chen; Kenji Kawaguchi; Michael Qizhe Xie; Junxian He
  We propose a new method, Adversarial In-Context Learning (adv-ICL), to
optimize prompt for in-context learning (ICL) by employing one LLM as a
generator, another as a discriminator, and a third as a prompt modifier. As in
traditional adversarial learning, adv-ICL is implemented as a two-player game
between the generator and discriminator, where the generator tries to generate
realistic enough output to fool the discriminator. In each round, given an
input prefixed by task instructions and several exemplars, the generator
produces an output. The discriminator is then tasked with classifying the
generator input-output pair as model-generated or real data. Based on the
discriminator loss, the prompt modifier proposes possible edits to the
generator and discriminator prompts, and the edits that most improve the
adversarial loss are selected. We show that adv-ICL results in significant
improvements over state-of-the-art prompt optimization techniques for both open
and closed-source models on 11 generation and classification tasks including
summarization, arithmetic reasoning, machine translation, data-to-text
generation, and the MMLU and big-bench hard benchmarks. In addition, because
our method uses pre-trained models and updates only prompts rather than model
parameters, it is computationally efficient, easy to extend to any LLM and
task, and effective in low-resource settings.


http://arxiv.org/abs/2312.03252
Privacy-Preserving Task-Oriented Semantic Communications Against Model Inversion Attacks. (2%)
Yanhu Wang; Shuaishuai Guo; Yiqin Deng; Haixia Zhang; Yuguang Fang
  Semantic communication has been identified as a core technology for the sixth
generation (6G) of wireless networks. Recently, task-oriented semantic
communications have been proposed for low-latency inference with limited
bandwidth. Although transmitting only task-related information does protect a
certain level of user privacy, adversaries could apply model inversion
techniques to reconstruct the raw data or extract useful information, thereby
infringing on users' privacy. To mitigate privacy infringement, this paper
proposes an information bottleneck and adversarial learning (IBAL) approach to
protect users' privacy against model inversion attacks. Specifically, we
extract task-relevant features from the input based on the information
bottleneck (IB) theory. To overcome the difficulty in calculating the mutual
information in high-dimensional space, we derive a variational upper bound to
estimate the true mutual information. To prevent data reconstruction from
task-related features by adversaries, we leverage adversarial learning to train
encoder to fool adversaries by maximizing reconstruction distortion.
Furthermore, considering the impact of channel variations on privacy-utility
trade-off and the difficulty in manually tuning the weights of each loss, we
propose an adaptive weight adjustment method. Numerical results demonstrate
that the proposed approaches can effectively protect privacy without
significantly affecting task performance and achieve better privacy-utility
trade-offs than baseline methods.


http://arxiv.org/abs/2312.02546
Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning. (2%)
Zhuo Huang; Chang Liu; Yinpeng Dong; Hang Su; Shibao Zheng; Tongliang Liu
  Although vision models such as Contrastive Language-Image Pre-Training (CLIP)
show impressive generalization performance, their zero-shot robustness is still
limited under Out-of-Distribution (OOD) scenarios without fine-tuning. Instead
of undesirably providing human supervision as commonly done, it is possible to
take advantage of Multi-modal Large Language Models (MLLMs) that hold powerful
visual understanding abilities. However, MLLMs are shown to struggle with
vision problems due to the incompatibility of tasks, thus hindering their
utilization. In this paper, we propose to effectively leverage MLLMs to conduct
Machine Vision Therapy which aims to rectify the noisy predictions from vision
models. By fine-tuning with the denoised labels, the learning model performance
can be boosted in an unsupervised manner. To solve the incompatibility issue,
we propose a novel Denoising In-Context Learning (DICL) strategy to align
vision tasks with MLLMs. Concretely, by estimating a transition matrix that
captures the probability of one class being confused with another, an
instruction containing a correct exemplar and an erroneous one from the most
probable noisy class can be constructed. Such an instruction can help any MLLMs
with ICL ability to detect and rectify incorrect predictions of vision models.
Through extensive experiments on ImageNet, WILDS, DomainBed, and other OOD
datasets, we carefully validate the quantitative and qualitative effectiveness
of our method. Our code is available at
https://github.com/tmllab/Machine_Vision_Therapy.


http://arxiv.org/abs/2312.01679
Adversarial Medical Image with Hierarchical Feature Hiding. (99%)
Qingsong Yao; Zecheng He; Yuexiang Li; Yi Lin; Kai Ma; Yefeng Zheng; S. Kevin Zhou
  Deep learning based methods for medical images can be easily compromised by
adversarial examples (AEs), posing a great security flaw in clinical
decision-making. It has been discovered that conventional adversarial attacks
like PGD which optimize the classification logits, are easy to distinguish in
the feature space, resulting in accurate reactive defenses. To better
understand this phenomenon and reassess the reliability of the reactive
defenses for medical AEs, we thoroughly investigate the characteristic of
conventional medical AEs. Specifically, we first theoretically prove that
conventional adversarial attacks change the outputs by continuously optimizing
vulnerable features in a fixed direction, thereby leading to outlier
representations in the feature space. Then, a stress test is conducted to
reveal the vulnerability of medical images, by comparing with natural images.
Interestingly, this vulnerability is a double-edged sword, which can be
exploited to hide AEs. We then propose a simple-yet-effective hierarchical
feature constraint (HFC), a novel add-on to conventional white-box attacks,
which assists to hide the adversarial feature in the target feature
distribution. The proposed method is evaluated on three medical datasets, both
2D and 3D, with different modalities. The experimental results demonstrate the
superiority of HFC, \emph{i.e.,} it bypasses an array of state-of-the-art
adversarial medical AE detectors more efficiently than competing adaptive
attacks, which reveals the deficiencies of medical reactive defense and allows
to develop more robust defenses in future.


http://arxiv.org/abs/2312.01886
InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models. (99%)
Xunguang Wang; Zhenlan Ji; Pingchuan Ma; Zongjie Li; Shuai Wang
  Large vision-language models (LVLMs) have demonstrated their incredible
capability in image understanding and response generation. However, this rich
visual interaction also makes LVLMs vulnerable to adversarial examples. In this
paper, we formulate a novel and practical targeted attack scenario that the
adversary can only know the vision encoder of the victim LVLM, without the
knowledge of its prompts (which are often proprietary for service providers and
not publicly available) and its underlying large language model (LLM). This
practical setting poses challenges to the cross-prompt and cross-model
transferability of targeted adversarial attack, which aims to confuse the LVLM
to output a response that is semantically similar to the attacker's chosen
target text. To this end, we propose an instruction-tuned targeted attack
(dubbed \textsc{InstructTA}) to deliver the targeted adversarial attack on
LVLMs with high transferability. Initially, we utilize a public text-to-image
generative model to "reverse" the target response into a target image, and
employ GPT-4 to infer a reasonable instruction $\boldsymbol{p}^\prime$ from the
target response. We then form a local surrogate model (sharing the same vision
encoder with the victim LVLM) to extract instruction-aware features of an
adversarial image example and the target image, and minimize the distance
between these two features to optimize the adversarial example. To further
improve the transferability with instruction tuning, we augment the instruction
$\boldsymbol{p}^\prime$ with instructions paraphrased from GPT-4. Extensive
experiments demonstrate the superiority of our proposed method in targeted
attack performance and transferability. The code is available at
https://github.com/xunguangwang/InstructTA.


http://arxiv.org/abs/2312.02237
Singular Regularization with Information Bottleneck Improves Model's Adversarial Robustness. (98%)
Guanlin Li; Naishan Zheng; Man Zhou; Jie Zhang; Tianwei Zhang
  Adversarial examples are one of the most severe threats to deep learning
models. Numerous works have been proposed to study and defend adversarial
examples. However, these works lack analysis of adversarial information or
perturbation, which cannot reveal the mystery of adversarial examples and lose
proper interpretation. In this paper, we aim to fill this gap by studying
adversarial information as unstructured noise, which does not have a clear
pattern. Specifically, we provide some empirical studies with singular value
decomposition, by decomposing images into several matrices, to analyze
adversarial information for different attacks. Based on the analysis, we
propose a new module to regularize adversarial information and combine
information bottleneck theory, which is proposed to theoretically restrict
intermediate representations. Therefore, our method is interpretable. Moreover,
the fashion of our design is a novel principle that is general and unified.
Equipped with our new module, we evaluate two popular model structures on two
mainstream datasets with various adversarial attacks. The results indicate that
the improvement in robust accuracy is significant. On the other hand, we prove
that our method is efficient with only a few additional parameters and able to
be explained under regional faithfulness analysis.


http://arxiv.org/abs/2312.01789
Two-stage optimized unified adversarial patch for attacking visible-infrared cross-modal detectors in the physical world. (12%)
Chengyin Hu; Weiwen Shi
  Currently, many studies have addressed security concerns related to visible
and infrared detectors independently. In practical scenarios, utilizing
cross-modal detectors for tasks proves more reliable than relying on
single-modal detectors. Despite this, there is a lack of comprehensive security
evaluations for cross-modal detectors. While existing research has explored the
feasibility of attacks against cross-modal detectors, the implementation of a
robust attack remains unaddressed. This work introduces the Two-stage Optimized
Unified Adversarial Patch (TOUAP) designed for performing attacks against
visible-infrared cross-modal detectors in real-world, black-box settings. The
TOUAP employs a two-stage optimization process: firstly, PSO optimizes an
irregular polygonal infrared patch to attack the infrared detector; secondly,
the color QR code is optimized, and the shape information of the infrared patch
from the first stage is used as a mask. The resulting irregular polygon visible
modal patch executes an attack on the visible detector. Through extensive
experiments conducted in both digital and physical environments, we validate
the effectiveness and robustness of the proposed method. As the TOUAP surpasses
baseline performance, we advocate for its widespread attention.


http://arxiv.org/abs/2312.02400
Auto DP-SGD: Dual Improvements of Privacy and Accuracy via Automatic Clipping Threshold and Noise Multiplier Estimation. (1%)
Sai Venkatesh Chilukoti; Md Imran Hossen; Liqun Shan; Vijay Srinivas Tida; Xiai Hei
  DP-SGD has emerged as a popular method to protect personally identifiable
information in deep learning applications. Unfortunately, DP-SGD's per-sample
gradient clipping and uniform noise addition during training can significantly
degrade model utility. To enhance the model's utility, researchers proposed
various adaptive DP-SGD methods. However, we examine and discover that these
techniques result in greater privacy leakage or lower accuracy than the
traditional DP-SGD method, or a lack of evaluation on a complex data set such
as CIFAR100. To address these limitations, we propose an Auto DP-SGD. Our
method automates clipping threshold estimation based on the DL model's gradient
norm and scales the gradients of each training sample without losing gradient
information. This helps to improve the algorithm's utility while using a less
privacy budget. To further improve accuracy, we introduce automatic noise
multiplier decay mechanisms to decrease the noise multiplier after every epoch.
Finally, we develop closed-form mathematical expressions using tCDP accountant
for automatic noise multiplier and automatic clipping threshold estimation.
Through extensive experimentation, we demonstrate that Auto DP-SGD outperforms
existing SOTA DP-SGD methods in privacy and accuracy on various benchmark
datasets. We also show that privacy can be improved by lowering the scale
factor and using learning rate schedulers without significantly reducing
accuracy. Specifically, Auto DP-SGD, when used with a step noise multiplier,
improves accuracy by 3.20, 1.57, 6.73, and 1.42 for the MNIST, CIFAR10,
CIFAR100, and AG News Corpus datasets, respectively. Furthermore, it obtains a
substantial reduction in the privacy budget of 94.9, 79.16, 67.36, and 53.37
for the corresponding data sets.


http://arxiv.org/abs/2312.02147
Rejuvenating image-GPT as Strong Visual Representation Learners. (1%)
Sucheng Ren; Zeyu Wang; Hongru Zhu; Junfei Xiao; Alan Yuille; Cihang Xie
  This paper enhances image-GPT (iGPT), one of the pioneering works that
introduce autoregressive pretraining to predict next pixels for visual
representation learning. Two simple yet essential changes are made. First, we
shift the prediction target from raw pixels to semantic tokens, enabling a
higher-level understanding of visual content. Second, we supplement the
autoregressive modeling by instructing the model to predict not only the next
tokens but also the visible tokens. This pipeline is particularly effective
when semantic tokens are encoded by discriminatively trained models, such as
CLIP. We introduce this novel approach as D-iGPT. Extensive experiments
showcase that D-iGPT excels as a strong learner of visual representations: A
notable achievement of D-iGPT is its compelling performance on the ImageNet-1K
dataset -- by training on publicly available datasets, D-iGPT achieves 89.5\%
top-1 accuracy with a vanilla ViT-Large model. This model also shows strong
generalization on the downstream task and robustness on out-of-distribution
samples. Code is avaiable at
\href{https://github.com/OliverRensu/D-iGPT}{https://github.com/OliverRensu/D-iGPT}.


http://arxiv.org/abs/2312.02220
QuantAttack: Exploiting Dynamic Quantization to Attack Vision Transformers. (99%)
Amit Baras; Alon Zolfi; Yuval Elovici; Asaf Shabtai
  In recent years, there has been a significant trend in deep neural networks
(DNNs), particularly transformer-based models, of developing ever-larger and
more capable models. While they demonstrate state-of-the-art performance, their
growing scale requires increased computational resources (e.g., GPUs with
greater memory capacity). To address this problem, quantization techniques
(i.e., low-bit-precision representation and matrix multiplication) have been
proposed. Most quantization techniques employ a static strategy in which the
model parameters are quantized, either during training or inference, without
considering the test-time sample. In contrast, dynamic quantization techniques,
which have become increasingly popular, adapt during inference based on the
input provided, while maintaining full-precision performance. However, their
dynamic behavior and average-case performance assumption makes them vulnerable
to a novel threat vector -- adversarial attacks that target the model's
efficiency and availability. In this paper, we present QuantAttack, a novel
attack that targets the availability of quantized models, slowing down the
inference, and increasing memory usage and energy consumption. We show that
carefully crafted adversarial examples, which are designed to exhaust the
resources of the operating system, can trigger worst-case performance. In our
experiments, we demonstrate the effectiveness of our attack on vision
transformers on a wide range of tasks, both uni-modal and multi-modal. We also
examine the effect of different attack variants (e.g., a universal
perturbation) and the transferability between different models.


http://arxiv.org/abs/2312.01585
OCGEC: One-class Graph Embedding Classification for DNN Backdoor Detection. (61%)
Haoyu Jiang; Haiyang Yu; Nan Li; Ping Yi
  Deep neural networks (DNNs) have been found vulnerable to backdoor attacks,
raising security concerns about their deployment in mission-critical
applications. There are various approaches to detect backdoor attacks, however
they all make certain assumptions about the target attack to be detected and
require equal and huge numbers of clean and backdoor samples for training,
which renders these detection methods quite limiting in real-world
circumstances.
  This study proposes a novel one-class classification framework called
One-class Graph Embedding Classification (OCGEC) that uses GNNs for model-level
backdoor detection with only a little amount of clean data. First, we train
thousands of tiny models as raw datasets from a small number of clean datasets.
Following that, we design a ingenious model-to-graph method for converting the
model's structural details and weight features into graph data. We then
pre-train a generative self-supervised graph autoencoder (GAE) to better learn
the features of benign models in order to detect backdoor models without
knowing the attack strategy. After that, we dynamically combine the GAE and
one-class classifier optimization goals to form classification boundaries that
distinguish backdoor models from benign models.
  Our OCGEC combines the powerful representation capabilities of graph neural
networks with the utility of one-class classification techniques in the field
of anomaly detection. In comparison to other baselines, it achieves AUC scores
of more than 98% on a number of tasks, which far exceeds existing methods for
detection even when they rely on a huge number of positive and negative
samples. Our pioneering application of graphic scenarios for generic backdoor
detection can provide new insights that can be used to improve other backdoor
defense tasks. Code is available at https://github.com/jhy549/OCGEC.


http://arxiv.org/abs/2312.01330
Evaluating the Security of Satellite Systems. (16%)
Roy Peled; Eran Aizikovich; Edan Habler; Yuval Elovici; Asaf Shabtai
  Satellite systems are facing an ever-increasing amount of cybersecurity
threats as their role in communications, navigation, and other services
expands. Recent papers have examined attacks targeting satellites and space
systems; however, they did not comprehensively analyze the threats to
satellites and systematically identify adversarial techniques across the attack
lifecycle. This paper presents a comprehensive taxonomy of adversarial tactics,
techniques, and procedures explicitly targeting LEO satellites. First, we
analyze the space ecosystem including the ground, space, Communication, and
user segments, highlighting their architectures, functions, and
vulnerabilities. Then, we examine the threat landscape, including adversary
types, and capabilities, and survey historical and recent attacks such as
jamming, spoofing, and supply chain. Finally, we propose a novel extension of
the MITRE ATT&CK framework to categorize satellite attack techniques across the
adversary lifecycle from reconnaissance to impact. The taxonomy is demonstrated
by modeling high-profile incidents, including the Viasat attack that disrupted
Ukraine's communications. The taxonomy provides the foundation for the
development of defenses against emerging cyber risks to space assets. The
proposed threat model will advance research in the space domain and contribute
to the security of the space domain against sophisticated attacks.


http://arxiv.org/abs/2312.01468
Exploring Adversarial Robustness of LiDAR-Camera Fusion Model in Autonomous Driving. (13%)
Bo Yang; Xiaoyu Ji; Xiaoyu Ji; Xiaoyu Ji; Xiaoyu Ji
  Our study assesses the adversarial robustness of LiDAR-camera fusion models
in 3D object detection. We introduce an attack technique that, by simply adding
a limited number of physically constrained adversarial points above a car, can
make the car undetectable by the fusion model. Experimental results reveal that
even without changes to the image data channel, the fusion model can be
deceived solely by manipulating the LiDAR data channel. This finding raises
safety concerns in the field of autonomous driving. Further, we explore how the
quantity of adversarial points, the distance between the front-near car and the
LiDAR-equipped car, and various angular factors affect the attack success rate.
We believe our research can contribute to the understanding of multi-sensor
robustness, offering insights and guidance to enhance the safety of autonomous
driving.


http://arxiv.org/abs/2312.04584
Towards Sample-specific Backdoor Attack with Clean Labels via Attribute Trigger. (2%)
Yiming Li; Mingyan Zhu; Junfeng Guo; Tao Wei; Shu-Tao Xia; Zhan Qin
  Currently, sample-specific backdoor attacks (SSBAs) are the most advanced and
malicious methods since they can easily circumvent most of the current backdoor
defenses. In this paper, we reveal that SSBAs are not sufficiently stealthy due
to their poisoned-label nature, where users can discover anomalies if they
check the image-label relationship. In particular, we demonstrate that it is
ineffective to directly generalize existing SSBAs to their clean-label variants
by poisoning samples solely from the target class. We reveal that it is
primarily due to two reasons, including \textbf{(1)} the `antagonistic effects'
of ground-truth features and \textbf{(2)} the learning difficulty of
sample-specific features. Accordingly, trigger-related features of existing
SSBAs cannot be effectively learned under the clean-label setting due to their
mild trigger intensity required for ensuring stealthiness. We argue that the
intensity constraint of existing SSBAs is mostly because their trigger patterns
are `content-irrelevant' and therefore act as `noises' for both humans and
DNNs. Motivated by this understanding, we propose to exploit content-relevant
features, $a.k.a.$ (human-relied) attributes, as the trigger patterns to design
clean-label SSBAs. This new attack paradigm is dubbed backdoor attack with
attribute trigger (BAAT). Extensive experiments are conducted on benchmark
datasets, which verify the effectiveness of our BAAT and its resistance to
existing defenses.


http://arxiv.org/abs/2312.02207
TranSegPGD: Improving Transferability of Adversarial Examples on Semantic Segmentation. (99%)
Xiaojun Jia; Jindong Gu; Yihao Huang; Simeng Qin; Qing Guo; Yang Liu; Xiaochun Cao
  Transferability of adversarial examples on image classification has been
systematically explored, which generates adversarial examples in black-box
mode. However, the transferability of adversarial examples on semantic
segmentation has been largely overlooked. In this paper, we propose an
effective two-stage adversarial attack strategy to improve the transferability
of adversarial examples on semantic segmentation, dubbed TranSegPGD.
Specifically, at the first stage, every pixel in an input image is divided into
different branches based on its adversarial property. Different branches are
assigned different weights for optimization to improve the adversarial
performance of all pixels.We assign high weights to the loss of the
hard-to-attack pixels to misclassify all pixels. At the second stage, the
pixels are divided into different branches based on their transferable property
which is dependent on Kullback-Leibler divergence. Different branches are
assigned different weights for optimization to improve the transferability of
the adversarial examples. We assign high weights to the loss of the
high-transferability pixels to improve the transferability of adversarial
examples. Extensive experiments with various segmentation models are conducted
on PASCAL VOC 2012 and Cityscapes datasets to demonstrate the effectiveness of
the proposed method. The proposed adversarial attack method can achieve
state-of-the-art performance.


http://arxiv.org/abs/2312.01260
Rethinking PGD Attack: Is Sign Function Necessary? (98%)
Junjie Yang; Tianlong Chen; Xuxi Chen; Zhangyang Wang; Yingbin Liang
  Neural networks have demonstrated success in various domains, yet their
performance can be significantly degraded by even a small input perturbation.
Consequently, the construction of such perturbations, known as adversarial
attacks, has gained significant attention, many of which fall within
"white-box" scenarios where we have full access to the neural network. Existing
attack algorithms, such as the projected gradient descent (PGD), commonly take
the sign function on the raw gradient before updating adversarial inputs,
thereby neglecting gradient magnitude information. In this paper, we present a
theoretical analysis of how such sign-based update algorithm influences
step-wise attack performance, as well as its caveat. We also interpret why
previous attempts of directly using raw gradients failed. Based on that, we
further propose a new raw gradient descent (RGD) algorithm that eliminates the
use of sign. Specifically, we convert the constrained optimization problem into
an unconstrained one, by introducing a new hidden variable of non-clipped
perturbation that can move beyond the constraint. The effectiveness of the
proposed RGD algorithm has been demonstrated extensively in experiments,
outperforming PGD and other competitors in various settings, without incurring
any additional computational overhead. The codes is available in
https://github.com/JunjieYang97/RGD.


http://arxiv.org/abs/2312.01045
PROFL: A Privacy-Preserving Federated Learning Method with Stringent Defense Against Poisoning Attacks. (61%)
Yisheng Zhong; Li-Ping Wang
  Federated Learning (FL) faces two major issues: privacy leakage and poisoning
attacks, which may seriously undermine the reliability and security of the
system. Overcoming them simultaneously poses a great challenge. This is because
privacy protection policies prohibit access to users' local gradients to avoid
privacy leakage, while Byzantine-robust methods necessitate access to these
gradients to defend against poisoning attacks. To address these problems, we
propose a novel privacy-preserving Byzantine-robust FL framework PROFL. PROFL
is based on the two-trapdoor additional homomorphic encryption algorithm and
blinding techniques to ensure the data privacy of the entire FL process. During
the defense process, PROFL first utilize secure Multi-Krum algorithm to remove
malicious gradients at the user level. Then, according to the Pauta criterion,
we innovatively propose a statistic-based privacy-preserving defense algorithm
to eliminate outlier interference at the feature level and resist impersonation
poisoning attacks with stronger concealment. Detailed theoretical analysis
proves the security and efficiency of the proposed method. We conducted
extensive experiments on two benchmark datasets, and PROFL improved accuracy by
39% to 75% across different attack settings compared to similar
privacy-preserving robust methods, demonstrating its significant advantage in
robustness.


http://arxiv.org/abs/2312.01281
Mendata: A Framework to Purify Manipulated Training Data. (2%)
Zonghao Huang; Neil Gong; Michael K. Reiter
  Untrusted data used to train a model might have been manipulated to endow the
learned model with hidden properties that the data contributor might later
exploit. Data purification aims to remove such manipulations prior to training
the model. We propose Mendata, a novel framework to purify manipulated training
data. Starting from a small reference dataset in which a large majority of the
inputs are clean, Mendata perturbs the training inputs so that they retain
their utility but are distributed similarly (as measured by Wasserstein
distance) to the reference data, thereby eliminating hidden properties from the
learned model. A key challenge is how to find such perturbations, which we
address by formulating a min-max optimization problem and developing a two-step
method to iteratively solve it. We demonstrate the effectiveness of Mendata by
applying it to defeat state-of-the-art data poisoning and data tracing
techniques.


http://arxiv.org/abs/2312.00508
TransURL: Improving malicious URL detection with multi-layer Transformer encoding and multi-scale pyramid features. (83%)
Ruitong Liu; Yanbin Wang; Zhenhao Guo; Haitao Xu; Zhan Qin; Wenrui Ma; Fan Zhang
  Machine learning progress is advancing the detection of malicious URLs.
However, advanced Transformers applied to URLs face difficulties in extracting
local information, character-level details, and structural relationships. To
address these challenges, we propose a novel approach for malicious URL
detection, named TransURL. This method is implemented by co-training the
character-aware Transformer with three feature modules: Multi-Layer Encoding,
Multi-Scale Feature Learning, and Spatial Pyramid Attention. This specialized
Transformer enables TransURL to extract embeddings with character-level
information from URL token sequences, with the three modules aiding the fusion
of multi-layer Transformer encodings and the capture of multi-scale local
details and structural relationships. The proposed method is evaluated across
several challenging scenarios, including class imbalance learning,
multi-classification, cross-dataset testing, and adversarial sample attacks.
Experimental results demonstrate a significant improvement compared to previous
methods. For instance, it achieved a peak F1-score improvement of 40% in
class-imbalanced scenarios and surpassed the best baseline by 14.13% in
accuracy for adversarial attack scenarios. Additionally, a case study
demonstrated that our method accurately identified all 30 active malicious web
pages, whereas two previous state-of-the-art methods missed 4 and 7 malicious
web pages, respectively. The codes and data are available at:
https://github.com/Vul-det/TransURL/.


http://arxiv.org/abs/2312.00942
Survey of Security Issues in Memristor-based Machine Learning Accelerators for RF Analysis. (22%)
William Lillis; Max Cohen Hoffing; Wayne Burleson
  We explore security aspects of a new computing paradigm that combines novel
memristors and traditional Complimentary Metal Oxide Semiconductor (CMOS) to
construct a highly efficient analog and/or digital fabric that is especially
well-suited to Machine Learning (ML) inference processors for Radio Frequency
(RF) signals. Memristors have different properties than traditional CMOS which
can potentially be exploited by attackers. In addition, the mixed signal
approximate computing model has different vulnerabilities than traditional
digital implementations. However both the memristor and the ML computation can
be leveraged to create security mechanisms and countermeasures ranging from
lightweight cryptography, identifiers (e.g. Physically Unclonable Functions
(PUFs), fingerprints, and watermarks), entropy sources, hardware obfuscation
and leakage/attack detection methods. Three different threat models are
proposed: 1) Supply Chain, 2) Physical Attacks, and 3) Remote Attacks. For each
threat model, potential vulnerabilities and defenses are identified. This
survey reviews a variety of recent work from the hardware and ML security
literature and proposes open problems for both attack and defense. The survey
emphasizes the growing area of RF signal analysis and identification in terms
of the commercial space, as well as military applications and threat models. We
differ from other other recent surveys that target ML in general, neglecting RF
applications.


http://arxiv.org/abs/2312.00987
Deep Generative Attacks and Countermeasures for Data-Driven Offline Signature Verification. (10%)
An Ngo; MinhPhuong Cao; Rajesh Kumar
  While previous studies have explored attacks via random, simple, and skilled
forgeries, generative attacks have received limited attention in the
data-driven signature verification (DASV) process. Thus, this paper explores
the impact of generative attacks on DASV and proposes practical and
interpretable countermeasures. We investigate the power of two prominent Deep
Generative Models (DGMs), Variational Auto-encoders (VAE) and Conditional
Generative Adversarial Networks (CGAN), on their ability to generate signatures
that would successfully deceive DASV. Additionally, we evaluate the quality of
generated images using the Structural Similarity Index measure (SSIM) and use
the same to explain the attack's success. Finally, we propose countermeasures
that effectively reduce the impact of deep generative attacks on DASV.
  We first generated six synthetic datasets from three benchmark
offline-signature datasets viz. CEDAR, BHSig260- Bengali, and BHSig260-Hindi
using VAE and CGAN. Then, we built baseline DASVs using Xception, ResNet152V2,
and DenseNet201. These DASVs achieved average (over the three datasets) False
Accept Rates (FARs) of 2.55%, 3.17%, and 1.06%, respectively. Then, we attacked
these baselines using the synthetic datasets. The VAE-generated signatures
increased average FARs to 10.4%, 10.1%, and 7.5%, while CGAN-generated
signatures to 32.5%, 30%, and 26.1%. The variation in the effectiveness of
attack for VAE and CGAN was investigated further and explained by a strong (rho
= -0.86) negative correlation between FARs and SSIMs. We created another set of
synthetic datasets and used the same to retrain the DASVs. The retained
baseline showed significant robustness to random, skilled, and generative
attacks as the FARs shrank to less than 1% on average. The findings underscore
the importance of studying generative attacks and potential countermeasures for
DASV.


http://arxiv.org/abs/2312.00374
The Philosopher's Stone: Trojaning Plugins of Large Language Models. (4%)
Tian Dong; Minhui Xue; Guoxing Chen; Rayne Holland; Yan Meng; Shaofeng Li; Zhen Liu; Haojin Zhu
  Open-source Large Language Models (LLMs) have recently gained popularity
because of their comparable performance to proprietary LLMs. To efficiently
fulfill domain-specialized tasks, open-source LLMs can be refined, without
expensive accelerators, using low-rank adapters. However, it is still unknown
whether low-rank adapters can be exploited to control LLMs. To address this
gap, we demonstrate that an infected adapter can induce, on specific
triggers,an LLM to output content defined by an adversary and to even
maliciously use tools. To train a Trojan adapter, we propose two novel attacks,
POLISHED and FUSION, that improve over prior approaches. POLISHED uses a
superior LLM to align na\"ively poisoned data based on our insight that it can
better inject poisoning knowledge during training. In contrast, FUSION
leverages a novel over-poisoning procedure to transform a benign adapter into a
malicious one by magnifying the attention between trigger and target in model
weights. In our experiments, we first conduct two case studies to demonstrate
that a compromised LLM agent can use malware to control the system (e.g., a
LLM-driven robot) or to launch a spear-phishing attack. Then, in terms of
targeted misinformation, we show that our attacks provide higher attack
effectiveness than the existing baseline and, for the purpose of attracting
downloads, preserve or improve the adapter's utility. Finally, we designed and
evaluated three potential defenses. However, none proved entirely effective in
safeguarding against our attacks, highlighting the need for more robust
defenses supporting a secure LLM supply chain.


http://arxiv.org/abs/2312.00359
Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training. (1%)
Yefan Zhou; Tianyu Pang; Keqin Liu; Charles H. Martin; Michael W. Mahoney; Yaoqing Yang
  Regularization in modern machine learning is crucial, and it can take various
forms in algorithmic design: training set, model family, error function,
regularization terms, and optimizations. In particular, the learning rate,
which can be interpreted as a temperature-like parameter within the statistical
mechanics of learning, plays a crucial role in neural network training. Indeed,
many widely adopted training strategies basically just define the decay of the
learning rate over time. This process can be interpreted as decreasing a
temperature, using either a global learning rate (for the entire model) or a
learning rate that varies for each parameter. This paper proposes TempBalance,
a straightforward yet effective layer-wise learning rate method. TempBalance is
based on Heavy-Tailed Self-Regularization (HT-SR) Theory, an approach which
characterizes the implicit self-regularization of different layers in trained
models. We demonstrate the efficacy of using HT-SR-motivated metrics to guide
the scheduling and balancing of temperature across all network layers during
model training, resulting in improved performance during testing. We implement
TempBalance on CIFAR10, CIFAR100, SVHN, and TinyImageNet datasets using
ResNets, VGGs, and WideResNets with various depths and widths. Our results show
that TempBalance significantly outperforms ordinary SGD and carefully-tuned
spectral norm regularization. We also show that TempBalance outperforms a
number of state-of-the-art optimizers and learning rate schedulers.


http://arxiv.org/abs/2312.00741
Crystal: Enhancing Blockchain Mining Transparency with Quorum Certificate. (1%)
Jianyu Niu; Fangyu Gai; Runchao Han; Ren Zhang; Yinqian Zhang; Chen Feng
  Researchers have discovered a series of theoretical attacks against Bitcoin's
Nakamoto consensus; the most damaging ones are selfish mining, double-spending,
and consistency delay attacks. These attacks have one common cause: block
withholding. This paper proposes Crystal, which leverages quorum certificates
to resist block withholding misbehavior. Crystal continuously elects committees
from miners and requires each block to have a quorum certificate, i.e., a set
of signatures issued by members of its committee. Consequently, an attacker has
to publish its blocks to obtain quorum certificates, rendering block
withholding impossible. To build Crystal, we design a novel two-round committee
election in a Sybil-resistant, unpredictable and non-interactive way, and a
reward mechanism to incentivize miners to follow the protocol. Our analysis and
evaluations show that Crystal can significantly mitigate selfish mining and
double-spending attacks. For example, in Bitcoin, an attacker with 30% of the
total computation power will succeed in double-spending attacks with a
probability of 15.6% to break the 6-confirmation rule; however, in Crystal, the
success probability for the same attacker falls to 0.62%. We provide formal
end-to-end safety proofs for Crystal, ensuring no unknown attacks will be
introduced. To the best of our knowledge, Crystal is the first protocol that
prevents selfish mining and double-spending attacks while providing safety
proof.


http://arxiv.org/abs/2312.00105
Improving the Robustness of Quantized Deep Neural Networks to White-Box Attacks using Stochastic Quantization and Information-Theoretic Ensemble Training. (98%)
Saurabh Farkya; Aswin Raghavan; Avi Ziskind
  Most real-world applications that employ deep neural networks (DNNs) quantize
them to low precision to reduce the compute needs. We present a method to
improve the robustness of quantized DNNs to white-box adversarial attacks. We
first tackle the limitation of deterministic quantization to fixed ``bins'' by
introducing a differentiable Stochastic Quantizer (SQ). We explore the
hypothesis that different quantizations may collectively be more robust than
each quantized DNN. We formulate a training objective to encourage different
quantized DNNs to learn different representations of the input image. The
training objective captures diversity and accuracy via mutual information
between ensemble members. Through experimentation, we demonstrate substantial
improvement in robustness against $L_\infty$ attacks even if the attacker is
allowed to backpropagate through SQ (e.g., > 50\% accuracy to PGD(5/255) on
CIFAR10 without adversarial training), compared to vanilla DNNs as well as
existing ensembles of quantized DNNs. We extend the method to detect attacks
and generate robustness profiles in the adversarial information plane (AIP),
towards a unified analysis of different threat models by correlating the MI and
accuracy.


http://arxiv.org/abs/2311.18820
Adversarial Attacks and Defenses for Wireless Signal Classifiers using CDI-aware GANs. (98%)
Sujata Sinha; Alkan Soysal
  We introduce a Channel Distribution Information (CDI)-aware Generative
Adversarial Network (GAN), designed to address the unique challenges of
adversarial attacks in wireless communication systems. The generator in this
CDI-aware GAN maps random input noise to the feature space, generating
perturbations intended to deceive a target modulation classifier. Its
discriminators play a dual role: one enforces that the perturbations follow a
Gaussian distribution, making them indistinguishable from Gaussian noise, while
the other ensures these perturbations account for realistic channel effects and
resemble no-channel perturbations.
  Our proposed CDI-aware GAN can be used as an attacker and a defender. In
attack scenarios, the CDI-aware GAN demonstrates its prowess by generating
robust adversarial perturbations that effectively deceive the target
classifier, outperforming known methods. Furthermore, CDI-aware GAN as a
defender significantly improves the target classifier's resilience against
adversarial attacks.


http://arxiv.org/abs/2312.00157
Universal Backdoor Attacks. (97%)
Benjamin Schneider; Nils Lukas; Florian Kerschbaum
  Web-scraped datasets are vulnerable to data poisoning, which can be used for
backdooring deep image classifiers during training. Since training on large
datasets is expensive, a model is trained once and re-used many times. Unlike
adversarial examples, backdoor attacks often target specific classes rather
than any class learned by the model. One might expect that targeting many
classes through a naive composition of attacks vastly increases the number of
poison samples. We show this is not necessarily true and more efficient,
universal data poisoning attacks exist that allow controlling
misclassifications from any source class into any target class with a small
increase in poison samples. Our idea is to generate triggers with salient
characteristics that the model can learn. The triggers we craft exploit a
phenomenon we call inter-class poison transferability, where learning a trigger
from one class makes the model more vulnerable to learning triggers for other
classes. We demonstrate the effectiveness and robustness of our universal
backdoor attacks by controlling models with up to 6,000 classes while poisoning
only 0.15% of the training dataset.


http://arxiv.org/abs/2312.00173
Fool the Hydra: Adversarial Attacks against Multi-view Object Detection Systems. (97%)
Bilel Tarchoun; Quazi Mishkatul Alam; Nael Abu-Ghazaleh; Ihsen Alouani
  Adversarial patches exemplify the tangible manifestation of the threat posed
by adversarial attacks on Machine Learning (ML) models in real-world scenarios.
Robustness against these attacks is of the utmost importance when designing
computer vision applications, especially for safety-critical domains such as
CCTV systems. In most practical situations, monitoring open spaces requires
multi-view systems to overcome acquisition challenges such as occlusion
handling. Multiview object systems are able to combine data from multiple
views, and reach reliable detection results even in difficult environments.
Despite its importance in real-world vision applications, the vulnerability of
multiview systems to adversarial patches is not sufficiently investigated. In
this paper, we raise the following question: Does the increased performance and
information sharing across views offer as a by-product robustness to
adversarial patches? We first conduct a preliminary analysis showing promising
robustness against off-the-shelf adversarial patches, even in an extreme
setting where we consider patches applied to all views by all persons in
Wildtrack benchmark. However, we challenged this observation by proposing two
new attacks: (i) In the first attack, targeting a multiview CNN, we maximize
the global loss by proposing gradient projection to the different views and
aggregating the obtained local gradients. (ii) In the second attack, we focus
on a Transformer-based multiview framework. In addition to the focal loss, we
also maximize the transformer-specific loss by dissipating its attention
blocks. Our results show a large degradation in the detection performance of
victim multiview systems with our first patch attack reaching an attack success
rate of 73% , while our second proposed attack reduced the performance of its
target detector by 62%


http://arxiv.org/abs/2311.18403
Corrupting Convolution-based Unlearnable Datasets with Pixel-based Image Transformations. (88%)
Xianlong Wang; Shengshan Hu; Minghui Li; Zhifei Yu; Ziqi Zhou; Leo Yu Zhang; Hai Jin
  Unlearnable datasets lead to a drastic drop in the generalization performance
of models trained on them by introducing elaborate and imperceptible
perturbations into clean training sets. Many existing defenses, e.g., JPEG
compression and adversarial training, effectively counter UDs based on
norm-constrained additive noise. However, a fire-new type of convolution-based
UDs have been proposed and render existing defenses all ineffective, presenting
a greater challenge to defenders. To address this, we express the
convolution-based unlearnable sample as the result of multiplying a matrix by a
clean sample in a simplified scenario, and formalize the intra-class matrix
inconsistency as $\Theta_{imi}$, inter-class matrix consistency as
$\Theta_{imc}$ to investigate the working mechanism of the convolution-based
UDs. We conjecture that increasing both of these metrics will mitigate the
unlearnability effect. Through validation experiments that commendably support
our hypothesis, we further design a random matrix to boost both $\Theta_{imi}$
and $\Theta_{imc}$, achieving a notable degree of defense effect. Hence, by
building upon and extending these facts, we first propose a brand-new image
COrruption that employs randomly multiplicative transformation via
INterpolation operation to successfully defend against convolution-based UDs.
Our approach leverages global pixel random interpolations, effectively
suppressing the impact of multiplicative noise in convolution-based UDs.
Additionally, we have also designed two new forms of convolution-based UDs, and
find that our defense is the most effective against them.


http://arxiv.org/abs/2312.00198
Optimal Attack and Defense for Reinforcement Learning. (76%)
Jeremy McMahan; Young Wu; Xiaojin Zhu; Qiaomin Xie
  To ensure the usefulness of Reinforcement Learning (RL) in real systems, it
is crucial to ensure they are robust to noise and adversarial attacks. In
adversarial RL, an external attacker has the power to manipulate the victim
agent's interaction with the environment. We study the full class of online
manipulation attacks, which include (i) state attacks, (ii) observation attacks
(which are a generalization of perceived-state attacks), (iii) action attacks,
and (iv) reward attacks. We show the attacker's problem of designing a stealthy
attack that maximizes its own expected reward, which often corresponds to
minimizing the victim's value, is captured by a Markov Decision Process (MDP)
that we call a meta-MDP since it is not the true environment but a higher level
environment induced by the attacked interaction. We show that the attacker can
derive optimal attacks by planning in polynomial time or learning with
polynomial sample complexity using standard RL techniques. We argue that the
optimal defense policy for the victim can be computed as the solution to a
stochastic Stackelberg game, which can be further simplified into a
partially-observable turn-based stochastic game (POTBSG). Neither the attacker
nor the victim would benefit from deviating from their respective optimal
policies, thus such solutions are truly robust. Although the defense problem is
NP-hard, we show that optimal Markovian defenses can be computed (learned) in
polynomial time (sample complexity) in many scenarios.


http://arxiv.org/abs/2312.00084
Can Protective Perturbation Safeguard Personal Data from Being Exploited by Stable Diffusion? (74%)
Zhengyue Zhao; Jinhao Duan; Kaidi Xu; Chenan Wang; Rui Zhangp Zidong Dup Qi Guo; Xing Hu
  Stable Diffusion has established itself as a foundation model in generative
AI artistic applications, receiving widespread research and application. Some
recent fine-tuning methods have made it feasible for individuals to implant
personalized concepts onto the basic Stable Diffusion model with minimal
computational costs on small datasets. However, these innovations have also
given rise to issues like facial privacy forgery and artistic copyright
infringement. In recent studies, researchers have explored the addition of
imperceptible adversarial perturbations to images to prevent potential
unauthorized exploitation and infringements when personal data is used for
fine-tuning Stable Diffusion. Although these studies have demonstrated the
ability to protect images, it is essential to consider that these methods may
not be entirely applicable in real-world scenarios. In this paper, we
systematically evaluate the use of perturbations to protect images within a
practical threat model. The results suggest that these approaches may not be
sufficient to safeguard image privacy and copyright effectively. Furthermore,
we introduce a purification method capable of removing protected perturbations
while preserving the original image structure to the greatest extent possible.
Experiments reveal that Stable Diffusion can effectively learn from purified
images over all protective methods.


http://arxiv.org/abs/2311.18495
Improving Adversarial Transferability via Model Alignment. (68%)
Avery Ma; Amir-massoud Farahmand; Yangchen Pan; Philip Torr; Jindong Gu
  Neural networks are susceptible to adversarial perturbations that are
transferable across different models. In this paper, we introduce a novel model
alignment technique aimed at improving a given source model's ability in
generating transferable adversarial perturbations. During the alignment
process, the parameters of the source model are fine-tuned to minimize an
alignment loss. This loss measures the divergence in the predictions between
the source model and another, independently trained model, referred to as the
witness model. To understand the effect of model alignment, we conduct a
geometric anlaysis of the resulting changes in the loss landscape. Extensive
experiments on the ImageNet dataset, using a variety of model architectures,
demonstrate that perturbations generated from aligned source models exhibit
significantly higher transferability than those from the original source model.


http://arxiv.org/abs/2311.18498
Data-Agnostic Model Poisoning against Federated Learning: A Graph Autoencoder Approach. (62%)
Kai Li; Jingjing Zheng; Xin Yuan; Wei Ni; Ozgur B. Akan; H. Vincent Poor
  This paper proposes a novel, data-agnostic, model poisoning attack on
Federated Learning (FL), by designing a new adversarial graph autoencoder
(GAE)-based framework. The attack requires no knowledge of FL training data and
achieves both effectiveness and undetectability. By listening to the benign
local models and the global model, the attacker extracts the graph structural
correlations among the benign local models and the training data features
substantiating the models. The attacker then adversarially regenerates the
graph structural correlations while maximizing the FL training loss, and
subsequently generates malicious local models using the adversarial graph
structure and the training data features of the benign ones. A new algorithm is
designed to iteratively train the malicious local models using GAE and
sub-gradient descent. The convergence of FL under attack is rigorously proved,
with a considerably large optimality gap. Experiments show that the FL accuracy
drops gradually under the proposed attack and existing defense mechanisms fail
to detect it. The attack can give rise to an infection across all benign
devices, making it a serious threat to FL.


http://arxiv.org/abs/2312.00273
Mark My Words: Analyzing and Evaluating Language Model Watermarks. (9%)
Julien Piet; Chawin Sitawarin; Vivian Fang; Norman Mu; David Wagner
  The capabilities of large language models have grown significantly in recent
years and so too have concerns about their misuse. In this context, the ability
to distinguish machine-generated text from human-authored content becomes
important. Prior works have proposed numerous schemes to watermark text, which
would benefit from a systematic evaluation framework. This work focuses on text
watermarking techniques - as opposed to image watermarks - and proposes
MARKMYWORDS, a comprehensive benchmark for them under different tasks as well
as practical attacks. We focus on three main metrics: quality, size (e.g. the
number of tokens needed to detect a watermark), and tamper-resistance. Current
watermarking techniques are good enough to be deployed: Kirchenbauer et al. [1]
can watermark Llama2-7B-chat with no perceivable loss in quality, the watermark
can be detected with fewer than 100 tokens, and the scheme offers good
tamper-resistance to simple attacks. We argue that watermark
indistinguishability, a criteria emphasized in some prior works, is too strong
a requirement: schemes that slightly modify logit distributions outperform
their indistinguishable counterparts with no noticeable loss in generation
quality. We publicly release our benchmark
(https://github.com/wagner-group/MarkMyWords)


http://arxiv.org/abs/2311.17400
Improving the Robustness of Transformer-based Large Language Models with Dynamic Attention. (98%)
Lujia Shen; Yuwen Pu; Shouling Ji; Changjiang Li; Xuhong Zhang; Chunpeng Ge; Ting Wang
  Transformer-based models, such as BERT and GPT, have been widely adopted in
natural language processing (NLP) due to their exceptional performance.
However, recent studies show their vulnerability to textual adversarial attacks
where the model's output can be misled by intentionally manipulating the text
inputs. Despite various methods that have been proposed to enhance the model's
robustness and mitigate this vulnerability, many require heavy consumption
resources (e.g., adversarial training) or only provide limited protection
(e.g., defensive dropout). In this paper, we propose a novel method called
dynamic attention, tailored for the transformer architecture, to enhance the
inherent robustness of the model itself against various adversarial attacks.
Our method requires no downstream task knowledge and does not incur additional
costs. The proposed dynamic attention consists of two modules: (I) attention
rectification, which masks or weakens the attention value of the chosen tokens,
and (ii) dynamic modeling, which dynamically builds the set of candidate
tokens. Extensive experiments demonstrate that dynamic attention significantly
mitigates the impact of adversarial attacks, improving up to 33\% better
performance than previous methods against widely-used adversarial attacks. The
model-level design of dynamic attention enables it to be easily combined with
other defense methods (e.g., adversarial training) to further enhance the
model's robustness. Furthermore, we demonstrate that dynamic attention
preserves the state-of-the-art robustness space of the original model compared
to other dynamic modeling methods.


http://arxiv.org/abs/2311.17434
Group-wise Sparse and Explainable Adversarial Attacks. (96%)
Shpresim Sadiku; Moritz Wagner; Sebastian Pokutta
  Sparse adversarial attacks fool deep neural networks (DNNs) through minimal
pixel perturbations, typically regularized by the $\ell_0$ norm. Recent efforts
have replaced this norm with a structural sparsity regularizer, such as the
nuclear group norm, to craft group-wise sparse adversarial attacks. The
resulting perturbations are thus explainable and hold significant practical
relevance, shedding light on an even greater vulnerability of DNNs than
previously anticipated. However, crafting such attacks poses an optimization
challenge, as it involves computing norms for groups of pixels within a
non-convex objective. In this paper, we tackle this challenge by presenting an
algorithm that simultaneously generates group-wise sparse attacks within
semantically meaningful areas of an image. In each iteration, the core
operation of our algorithm involves the optimization of a quasinorm adversarial
loss. This optimization is achieved by employing the $1/2$-quasinorm proximal
operator for some iterations, a method tailored for nonconvex programming.
Subsequently, the algorithm transitions to a projected Nesterov's accelerated
gradient descent with $2$-norm regularization applied to perturbation
magnitudes. We rigorously evaluate the efficacy of our novel attack in both
targeted and non-targeted attack scenarios, on CIFAR-10 and ImageNet datasets.
When compared to state-of-the-art methods, our attack consistently results in a
remarkable increase in group-wise sparsity, e.g., an increase of $48.12\%$ on
CIFAR-10 and $40.78\%$ on ImageNet (average case, targeted attack), all while
maintaining lower perturbation magnitudes. Notably, this performance is
complemented by a significantly faster computation time and a $100\%$ attack
success rate.


http://arxiv.org/abs/2311.17458
Quantum Neural Networks under Depolarization Noise: Exploring White-Box Attacks and Defenses. (88%)
David Winderl; Nicola Franco; Jeanette Miriam Lorenz
  Leveraging the unique properties of quantum mechanics, Quantum Machine
Learning (QML) promises computational breakthroughs and enriched perspectives
where traditional systems reach their boundaries. However, similarly to
classical machine learning, QML is not immune to adversarial attacks. Quantum
adversarial machine learning has become instrumental in highlighting the weak
points of QML models when faced with adversarial crafted feature vectors.
Diving deep into this domain, our exploration shines light on the interplay
between depolarization noise and adversarial robustness. While previous results
enhanced robustness from adversarial threats through depolarization noise, our
findings paint a different picture. Interestingly, adding depolarization noise
discontinued the effect of providing further robustness for a multi-class
classification scenario. Consolidating our findings, we conducted experiments
with a multi-class classifier adversarially trained on gate-based quantum
simulators, further elucidating this unexpected behavior.


http://arxiv.org/abs/2311.17853
On the Adversarial Robustness of Graph Contrastive Learning Methods. (83%)
Filippo Guerranti; Zinuo Yi; Anna Starovoit; Rafiq Kamel; Simon Geisler; Stephan Günnemann
  Contrastive learning (CL) has emerged as a powerful framework for learning
representations of images and text in a self-supervised manner while enhancing
model robustness against adversarial attacks. More recently, researchers have
extended the principles of contrastive learning to graph-structured data,
giving birth to the field of graph contrastive learning (GCL). However, whether
GCL methods can deliver the same advantages in adversarial robustness as their
counterparts in the image and text domains remains an open question. In this
paper, we introduce a comprehensive robustness evaluation protocol tailored to
assess the robustness of GCL models. We subject these models to adaptive
adversarial attacks targeting the graph structure, specifically in the evasion
scenario. We evaluate node and graph classification tasks using diverse
real-world datasets and attack strategies. With our work, we aim to offer
insights into the robustness of GCL methods and hope to open avenues for
potential future research directions.


http://arxiv.org/abs/2311.17608
Adversarial Robust Memory-Based Continual Learner. (81%)
Xiaoyue Mi; Fan Tang; Zonghan Yang; Danding Wang; Juan Cao; Peng Li; Yang Liu
  Despite the remarkable advances that have been made in continual learning,
the adversarial vulnerability of such methods has not been fully discussed. We
delve into the adversarial robustness of memory-based continual learning
algorithms and observe limited robustness improvement by directly applying
adversarial training techniques. Preliminary studies reveal the twin challenges
for building adversarial robust continual learners: accelerated forgetting in
continual learning and gradient obfuscation in adversarial robustness. In this
study, we put forward a novel adversarial robust memory-based continual learner
that adjusts data logits to mitigate the forgetting of pasts caused by
adversarial samples. Furthermore, we devise a gradient-based data selection
mechanism to overcome the gradient obfuscation caused by limited stored data.
The proposed approach can widely integrate with existing memory-based continual
learning as well as adversarial training algorithms in a plug-and-play way.
Extensive experiments on Split-CIFAR10/100 and Split-Tiny-ImageNet demonstrate
the effectiveness of our approach, achieving up to 8.13% higher accuracy for
adversarial data.


http://arxiv.org/abs/2311.17983
Improving Faithfulness for Vision Transformers. (80%)
Lijie Hu; Yixin Liu; Ninghao Liu; Mengdi Huai; Lichao Sun; Di Wang
  Vision Transformers (ViTs) have achieved state-of-the-art performance for
various vision tasks. One reason behind the success lies in their ability to
provide plausible innate explanations for the behavior of neural architectures.
However, ViTs suffer from issues with explanation faithfulness, as their focal
points are fragile to adversarial attacks and can be easily changed with even
slight perturbations on the input image. In this paper, we propose a rigorous
approach to mitigate these issues by introducing Faithful ViTs (FViTs). Briefly
speaking, an FViT should have the following two properties: (1) The top-$k$
indices of its self-attention vector should remain mostly unchanged under input
perturbation, indicating stable explanations; (2) The prediction distribution
should be robust to perturbations. To achieve this, we propose a new method
called Denoised Diffusion Smoothing (DDS), which adopts randomized smoothing
and diffusion-based denoising. We theoretically prove that processing ViTs
directly with DDS can turn them into FViTs. We also show that Gaussian noise is
nearly optimal for both $\ell_2$ and $\ell_\infty$-norm cases. Finally, we
demonstrate the effectiveness of our approach through comprehensive experiments
and evaluations. Specifically, we compare our FViTs with other baselines
through visual interpretation and robustness accuracy under adversarial
attacks. Results show that FViTs are more robust against adversarial attacks
while maintaining the explainability of attention, indicating higher
faithfulness.


http://arxiv.org/abs/2311.17429
TARGET: Template-Transferable Backdoor Attack Against Prompt-based NLP Models via GPT4. (68%)
Zihao Tan; Qingliang Chen; Yongjian Huang; Chen Liang
  Prompt-based learning has been widely applied in many low-resource NLP tasks
such as few-shot scenarios. However, this paradigm has been shown to be
vulnerable to backdoor attacks. Most of the existing attack methods focus on
inserting manually predefined templates as triggers in the pre-training phase
to train the victim model and utilize the same triggers in the downstream task
to perform inference, which tends to ignore the transferability and
stealthiness of the templates. In this work, we propose a novel approach of
TARGET (Template-trAnsfeRable backdoor attack aGainst prompt-basEd NLP models
via GPT4), which is a data-independent attack method. Specifically, we first
utilize GPT4 to reformulate manual templates to generate tone-strong and normal
templates, and the former are injected into the model as a backdoor trigger in
the pre-training phase. Then, we not only directly employ the above templates
in the downstream task, but also use GPT4 to generate templates with similar
tone to the above templates to carry out transferable attacks. Finally we have
conducted extensive experiments on five NLP datasets and three BERT series
models, with experimental results justifying that our TARGET method has better
attack performance and stealthiness compared to the two-external baseline
methods on direct attacks, and in addition achieves satisfactory attack
capability in the unseen tone-similar templates.


http://arxiv.org/abs/2311.17607
Topology-Preserving Adversarial Training. (10%)
Xiaoyue Mi; Fan Tang; Yepeng Weng; Danding Wang; Juan Cao; Sheng Tang; Peng Li; Yang Liu
  Despite the effectiveness in improving the robustness of neural networks,
adversarial training has suffered from the natural accuracy degradation
problem, i.e., accuracy on natural samples has reduced significantly. In this
study, we reveal that natural accuracy degradation is highly related to the
disruption of the natural sample topology in the representation space by
quantitative and qualitative experiments. Based on this observation, we propose
Topology-pReserving Adversarial traINing (TRAIN) to alleviate the problem by
preserving the topology structure of natural samples from a standard model
trained only on natural samples during adversarial training. As an additional
regularization, our method can easily be combined with various popular
adversarial training algorithms in a plug-and-play manner, taking advantage of
both sides. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet
show that our proposed method achieves consistent and significant improvements
over various strong baselines in most cases. Specifically, without additional
data, our proposed method achieves up to 8.78% improvement in natural accuracy
and 4.50% improvement in robust accuracy.


http://arxiv.org/abs/2311.17600
Query-Relevant Images Jailbreak Large Multi-Modal Models. (9%)
Xin Liu; Yichen Zhu; Yunshi Lan; Chao Yang; Yu Qiao
  Warning: This paper contains examples of harmful language and images, and
reader discretion is recommended. The security concerns surrounding Large
Language Models (LLMs) have been extensively explored, yet the safety of Large
Multi-Modal Models (LMMs) remains understudied. In our study, we present a
novel visual prompt attack that exploits query-relevant images to jailbreak the
open-source LMMs. Our method creates a composite image from one image generated
by diffusion models and another that displays the text as typography, based on
keywords extracted from a malicious query. We show LLMs can be easily attacked
by our approach, even if the employed Large Language Models are safely aligned.
To evaluate the extent of this vulnerability in open-source LMMs, we have
compiled a substantial dataset encompassing 13 scenarios with a total of 5,040
text-image pairs, using our presented attack technique. Our evaluation of 12
cutting-edge LMMs using this dataset shows the vulnerability of existing
multi-modal models on adversarial attacks. This finding underscores the need
for a concerted effort to strengthen and enhance the safety measures of
open-source LMMs against potential malicious exploits. The resource is
available at \href{this https URL}{https://github.com/isXinLiu/MM-SafetyBench}.


http://arxiv.org/abs/2311.17833
Analyzing and Explaining Image Classifiers via Diffusion Guidance. (8%)
Maximilian Augustin; Yannic Neuhaus; Matthias Hein
  While deep learning has led to huge progress in complex image classification
tasks like ImageNet, unexpected failure modes, e.g. via spurious features, call
into question how reliably these classifiers work in the wild. Furthermore, for
safety-critical tasks the black-box nature of their decisions is problematic,
and explanations or at least methods which make decisions plausible are needed
urgently. In this paper, we address these problems by generating images that
optimize a classifier-derived objective using a framework for guided image
generation. We analyze the behavior and decisions of image classifiers by
visual counterfactual explanations (VCEs), detection of systematic mistakes by
analyzing images where classifiers maximally disagree, and visualization of
neurons to verify potential spurious features. In this way, we validate
existing observations, e.g. the shape bias of adversarially robust models, as
well as novel failure modes, e.g. systematic errors of zero-shot CLIP
classifiers, or identify harmful spurious features. Moreover, our VCEs
outperform previous work while being more versatile.


http://arxiv.org/abs/2311.18244
Poisoning Attacks Against Contrastive Recommender Systems. (2%)
Zongwei Wang; Junliang Yu; Min Gao; Hongzhi Yin; Bin Cui; Shazia Sadiq
  Contrastive learning (CL) has recently gained significant popularity in the
field of recommendation. Its ability to learn without heavy reliance on labeled
data is a natural antidote to the data sparsity issue. Previous research has
found that CL can not only enhance recommendation accuracy but also
inadvertently exhibit remarkable robustness against noise. However, this paper
identifies a vulnerability of CL-based recommender systems: Compared with their
non-CL counterparts, they are even more susceptible to poisoning attacks that
aim to promote target items. Our analysis points to the uniform dispersion of
representations led by the CL loss as the very factor that accounts for this
vulnerability. We further theoretically and empirically demonstrate that the
optimization of CL loss can lead to smooth spectral values of representations.
Based on these insights, we attempt to reveal the potential poisoning attacks
against CL-based recommender systems. The proposed attack encompasses a
dual-objective framework: One that induces a smoother spectral value
distribution to amplify the CL loss's inherent dispersion effect, named
dispersion promotion; and the other that directly elevates the visibility of
target items, named rank promotion. We validate the destructiveness of our
attack model through extensive experimentation on four datasets. By shedding
light on these vulnerabilities, we aim to facilitate the development of more
robust CL-based recommender systems.


http://arxiv.org/abs/2311.17722
SenTest: Evaluating Robustness of Sentence Encoders. (2%)
Tanmay Chavan; Shantanu Patankar; Aditya Kane; Omkar Gokhale; Geetanjali Kale; Raviraj Joshi
  Contrastive learning has proven to be an effective method for pre-training
models using weakly labeled data in the vision domain. Sentence transformers
are the NLP counterparts to this architecture, and have been growing in
popularity due to their rich and effective sentence representations. Having
effective sentence representations is paramount in multiple tasks, such as
information retrieval, retrieval augmented generation (RAG), and sentence
comparison. Keeping in mind the deployability factor of transformers,
evaluating the robustness of sentence transformers is of utmost importance.
This work focuses on evaluating the robustness of the sentence encoders. We
employ several adversarial attacks to evaluate its robustness. This system uses
character-level attacks in the form of random character substitution,
word-level attacks in the form of synonym replacement, and sentence-level
attacks in the form of intra-sentence word order shuffling. The results of the
experiments strongly undermine the robustness of sentence encoders. The models
produce significantly different predictions as well as embeddings on perturbed
datasets. The accuracy of the models can fall up to 15 percent on perturbed
datasets as compared to unperturbed datasets. Furthermore, the experiments
demonstrate that these embeddings does capture the semantic and syntactic
structure (sentence order) of sentences. However, existing supervised
classification strategies fail to leverage this information, and merely
function as n-gram detectors.


http://arxiv.org/abs/2311.17539
Critical Influence of Overparameterization on Sharpness-aware Minimization. (1%)
Sungbin Shin; Dongyeop Lee; Maksym Andriushchenko; Namhoon Lee
  Training an overparameterized neural network can yield minimizers of
different generalization capabilities despite the same level of training loss.
Meanwhile, with evidence that suggests a strong correlation between the
sharpness of minima and their generalization errors, increasing efforts have
been made to develop optimization methods to explicitly find flat minima as
more generalizable solutions. Despite its contemporary relevance to
overparameterization, however, this sharpness-aware minimization (SAM) strategy
has not been studied much yet as to exactly how it is affected by
overparameterization. Hence, in this work, we analyze SAM under
overparameterization of varying degrees and present both empirical and
theoretical results that indicate a critical influence of overparameterization
on SAM. At first, we conduct extensive numerical experiments across vision,
language, graph, and reinforcement learning domains and show that SAM
consistently improves with overparameterization. Next, we attribute this
phenomenon to the interplay between the enlarged solution space and increased
implicit bias from overparameterization. Further, we prove multiple theoretical
benefits of overparameterization for SAM to attain (i) minima with more uniform
Hessian moments compared to SGD, (ii) much faster convergence at a linear rate,
and (iii) lower test error for two-layer networks. Last but not least, we
discover that the effect of overparameterization is more significantly
pronounced in practical settings of label noise and sparsity, and yet,
sufficient regularization is necessary.


http://arxiv.org/abs/2311.18083
Meta Co-Training: Two Views are Better than One. (1%)
Jay C. Rothenberger; Dimitrios I. Diochnos
  In many critical computer vision scenarios unlabeled data is plentiful, but
labels are scarce and difficult to obtain. As a result, semi-supervised
learning which leverages unlabeled data to boost the performance of supervised
classifiers have received significant attention in recent literature. One
representative class of semi-supervised algorithms are co-training algorithms.
Co-training algorithms leverage two different models which have access to
different independent and sufficient representations or "views" of the data to
jointly make better predictions. Each of these models creates pseudo-labels on
unlabeled points which are used to improve the other model. We show that in the
common case where independent views are not available, we can construct such
views inexpensively using pre-trained models. Co-training on the constructed
views yields a performance improvement over any of the individual views we
construct and performance comparable with recent approaches in semi-supervised
learning. We present Meta Co-Training, a novel semi-supervised learning
algorithm, which has two advantages over co-training: (i) learning is more
robust when there is large discrepancy between the information content of the
different views, and (ii) does not require retraining from scratch on each
iteration. Our method achieves new state-of-the-art performance on ImageNet-10%
achieving a ~4.7% reduction in error rate over prior work. Our method also
outperforms prior semi-supervised work on several other fine-grained image
classification datasets.


http://arxiv.org/abs/2311.17583
CLIPC8: Face liveness detection algorithm based on image-text pairs and contrastive learning. (1%)
Xu Liu; Shu Zhou; Yurong Song; Wenzhe Luo; Xin Zhang
  Face recognition technology is widely used in the financial field, and
various types of liveness attack behaviors need to be addressed. Existing
liveness detection algorithms are trained on specific training datasets and
tested on testing datasets, but their performance and robustness in
transferring to unseen datasets are relatively poor. To tackle this issue, we
propose a face liveness detection method based on image-text pairs and
contrastive learning, dividing liveness attack problems in the financial field
into eight categories and using text information to describe the images of
these eight types of attacks. The text encoder and image encoder are used to
extract feature vector representations for the classification description text
and face images, respectively. By maximizing the similarity of positive samples
and minimizing the similarity of negative samples, the model learns shared
representations between images and texts. The proposed method is capable of
effectively detecting specific liveness attack behaviors in certain scenarios,
such as those occurring in dark environments or involving the tampering of ID
card photos. Additionally, it is also effective in detecting traditional
liveness attack methods, such as printing photo attacks and screen remake
attacks. The zero-shot capabilities of face liveness detection on five public
datasets, including NUAA, CASIA-FASD, Replay-Attack, OULU-NPU and MSU-MFSD also
reaches the level of commercial algorithms. The detection capability of
proposed algorithm was verified on 5 types of testing datasets, and the results
show that the method outperformed commercial algorithms, and the detection
rates reached 100% on multiple datasets. Demonstrating the effectiveness and
robustness of introducing image-text pairs and contrastive learning into
liveness detection tasks as proposed in this paper.


http://arxiv.org/abs/2311.17391
Unveiling the Implicit Toxicity in Large Language Models. (1%)
Jiaxin Wen; Pei Ke; Hao Sun; Zhexin Zhang; Chengfei Li; Jinfeng Bai; Minlie Huang
  The open-endedness of large language models (LLMs) combined with their
impressive capabilities may lead to new safety issues when being exploited for
malicious use. While recent studies primarily focus on probing toxic outputs
that can be easily detected with existing toxicity classifiers, we show that
LLMs can generate diverse implicit toxic outputs that are exceptionally
difficult to detect via simply zero-shot prompting. Moreover, we propose a
reinforcement learning (RL) based attacking method to further induce the
implicit toxicity in LLMs. Specifically, we optimize the language model with a
reward that prefers implicit toxic outputs to explicit toxic and non-toxic
ones. Experiments on five widely-adopted toxicity classifiers demonstrate that
the attack success rate can be significantly improved through RL fine-tuning.
For instance, the RL-finetuned LLaMA-13B model achieves an attack success rate
of 90.04% on BAD and 62.85% on Davinci003. Our findings suggest that LLMs pose
a significant threat in generating undetectable implicit toxic outputs. We
further show that fine-tuning toxicity classifiers on the annotated examples
from our attacking method can effectively enhance their ability to detect
LLM-generated implicit toxic language. The code is publicly available at
https://github.com/thu-coai/Implicit-Toxicity.


http://arxiv.org/abs/2311.17128
Vulnerability Analysis of Transformer-based Optical Character Recognition to Adversarial Attacks. (99%)
Lucas Beerens; Desmond J. Higham
  Recent advancements in Optical Character Recognition (OCR) have been driven
by transformer-based models. OCR systems are critical in numerous high-stakes
domains, yet their vulnerability to adversarial attack remains largely
uncharted territory, raising concerns about security and compliance with
emerging AI regulations. In this work we present a novel framework to assess
the resilience of Transformer-based OCR (TrOCR) models. We develop and assess
algorithms for both targeted and untargeted attacks. For the untargeted case,
we measure the Character Error Rate (CER), while for the targeted case we use
the success ratio. We find that TrOCR is highly vulnerable to untargeted
attacks and somewhat less vulnerable to targeted attacks. On a benchmark
handwriting data set, untargeted attacks can cause a CER of more than 1 without
being noticeable to the eye. With a similar perturbation size, targeted attacks
can lead to success rates of around $25\%$ -- here we attacked single tokens,
requiring TrOCR to output the tenth most likely token from a large vocabulary.


http://arxiv.org/abs/2311.17332
NeRFTAP: Enhancing Transferability of Adversarial Patches on Face Recognition using Neural Radiance Fields. (99%)
Xiaoliang Liu; Furao Shen; Feng Han; Jian Zhao; Changhai Nie
  Face recognition (FR) technology plays a crucial role in various
applications, but its vulnerability to adversarial attacks poses significant
security concerns. Existing research primarily focuses on transferability to
different FR models, overlooking the direct transferability to victim's face
images, which is a practical threat in real-world scenarios. In this study, we
propose a novel adversarial attack method that considers both the
transferability to the FR model and the victim's face image, called NeRFTAP.
Leveraging NeRF-based 3D-GAN, we generate new view face images for the source
and target subjects to enhance transferability of adversarial patches. We
introduce a style consistency loss to ensure the visual similarity between the
adversarial UV map and the target UV map under a 0-1 mask, enhancing the
effectiveness and naturalness of the generated adversarial face images.
Extensive experiments and evaluations on various FR models demonstrate the
superiority of our approach over existing attack techniques. Our work provides
valuable insights for enhancing the robustness of FR systems in practical
adversarial settings.


http://arxiv.org/abs/2311.16577
Efficient Key-Based Adversarial Defense for ImageNet by Using Pre-trained Model. (98%)
AprilPyone MaungMaung; Isao Echizen; Hitoshi Kiya
  In this paper, we propose key-based defense model proliferation by leveraging
pre-trained models and utilizing recent efficient fine-tuning techniques on
ImageNet-1k classification. First, we stress that deploying key-based models on
edge devices is feasible with the latest model deployment advancements, such as
Apple CoreML, although the mainstream enterprise edge artificial intelligence
(Edge AI) has been focused on the Cloud. Then, we point out that the previous
key-based defense on on-device image classification is impractical for two
reasons: (1) training many classifiers from scratch is not feasible, and (2)
key-based defenses still need to be thoroughly tested on large datasets like
ImageNet. To this end, we propose to leverage pre-trained models and utilize
efficient fine-tuning techniques to proliferate key-based models even on
limited computing resources. Experiments were carried out on the ImageNet-1k
dataset using adaptive and non-adaptive attacks. The results show that our
proposed fine-tuned key-based models achieve a superior classification accuracy
(more than 10% increase) compared to the previous key-based models on
classifying clean and adversarial examples.


http://arxiv.org/abs/2311.17339
RADAP: A Robust and Adaptive Defense Against Diverse Adversarial Patches on Face Recognition. (92%)
Xiaoliang Liu; Furao Shen; Jian Zhao; Changhai Nie
  Face recognition (FR) systems powered by deep learning have become widely
used in various applications. However, they are vulnerable to adversarial
attacks, especially those based on local adversarial patches that can be
physically applied to real-world objects. In this paper, we propose RADAP, a
robust and adaptive defense mechanism against diverse adversarial patches in
both closed-set and open-set FR systems. RADAP employs innovative techniques,
such as FCutout and F-patch, which use Fourier space sampling masks to improve
the occlusion robustness of the FR model and the performance of the patch
segmenter. Moreover, we introduce an edge-aware binary cross-entropy (EBCE)
loss function to enhance the accuracy of patch detection. We also present the
split and fill (SAF) strategy, which is designed to counter the vulnerability
of the patch segmenter to complete white-box adaptive attacks. We conduct
comprehensive experiments to validate the effectiveness of RADAP, which shows
significant improvements in defense performance against various adversarial
patches, while maintaining clean accuracy higher than that of the undefended
Vanilla model.


http://arxiv.org/abs/2401.05338
STR-Cert: Robustness Certification for Deep Text Recognition on Deep Learning Pipelines and Vision Transformers. (26%)
Daqian Shao; Lukas Fesser; Marta Kwiatkowska
  Robustness certification, which aims to formally certify the predictions of
neural networks against adversarial inputs, has become an integral part of
important tool for safety-critical applications. Despite considerable progress,
existing certification methods are limited to elementary architectures, such as
convolutional networks, recurrent networks and recently Transformers, on
benchmark datasets such as MNIST. In this paper, we focus on the robustness
certification of scene text recognition (STR), which is a complex and
extensively deployed image-based sequence prediction problem. We tackle three
types of STR model architectures, including the standard STR pipelines and the
Vision Transformer. We propose STR-Cert, the first certification method for STR
models, by significantly extending the DeepPoly polyhedral verification
framework via deriving novel polyhedral bounds and algorithms for key STR model
components. Finally, we certify and compare STR models on six datasets,
demonstrating the efficiency and scalability of robustness certification,
particularly for the Vision Transformer.


http://arxiv.org/abs/2311.16833
1-Lipschitz Layers Compared: Memory, Speed, and Certifiable Robustness. (13%)
Bernd Prach; Fabio Brau; Giorgio Buttazzo; Christoph H. Lampert
  The robustness of neural networks against input perturbations with bounded
magnitude represents a serious concern in the deployment of deep learning
models in safety-critical systems. Recently, the scientific community has
focused on enhancing certifiable robustness guarantees by crafting 1-Lipschitz
neural networks that leverage Lipschitz bounded dense and convolutional layers.
Although different methods have been proposed in the literature to achieve this
goal, understanding the performance of such methods is not straightforward,
since different metrics can be relevant (e.g., training time, memory usage,
accuracy, certifiable robustness) for different applications. For this reason,
this work provides a thorough theoretical and empirical comparison between
methods by evaluating them in terms of memory usage, speed, and certifiable
robust accuracy. The paper also provides some guidelines and recommendations to
support the user in selecting the methods that work best depending on the
available resources. We provide code at
https://github.com/berndprach/1LipschitzLayersCompared.


http://arxiv.org/abs/2311.17035
Scalable Extraction of Training Data from (Production) Language Models. (10%)
Milad Nasr; Nicholas Carlini; Jonathan Hayase; Matthew Jagielski; A. Feder Cooper; Daphne Ippolito; Christopher A. Choquette-Choo; Eric Wallace; Florian Tramèr; Katherine Lee
  This paper studies extractable memorization: training data that an adversary
can efficiently extract by querying a machine learning model without prior
knowledge of the training dataset. We show an adversary can extract gigabytes
of training data from open-source language models like Pythia or GPT-Neo,
semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing
techniques from the literature suffice to attack unaligned models; in order to
attack the aligned ChatGPT, we develop a new divergence attack that causes the
model to diverge from its chatbot-style generations and emit training data at a
rate 150x higher than when behaving properly. Our methods show practical
attacks can recover far more data than previously thought, and reveal that
current alignment techniques do not eliminate memorization.


http://arxiv.org/abs/2311.16661
Cooperative Abnormal Node Detection with Adversary Resistance. (10%)
Yingying Huangfu; Tian Bai
  This paper presents a novel probabilistic detection scheme called Cooperative
Statistical Detection~(CSD) for abnormal node detection while defending against
adversarial attacks in cluster-tree wireless sensor networks. The CSD performs
a two-phase process: 1) designing a likelihood ratio test~(LRT) for a non-root
node at its children from the perspective of packet loss; 2) making an overall
decision at the root node based on the aggregated detection data of the nodes
over tree branches. In most adversarial scenarios, malicious children knowing
the detection policy can generate falsified data to protect the abnormal parent
from being detected. To resolve this issue, a mechanism is presented in the CSD
to remove untrustworthy information. Through theoretical analysis, we show that
the LRT-based method achieves the perfect detection. Furthermore, the optimal
removal threshold is derived for falsifications with uncertain strategies and
guarantees perfect detection of the CSD. As our simulation results shown, the
CSD approach is robust to falsifications and can rapidly reach $99\%$ detection
accuracy, even in existing adversarial scenarios, which outperforms the
state-of-the-art technology.


http://arxiv.org/abs/2311.16526
On robust overfitting: adversarial training induced distribution matters. (1%)
Runzhi Tian; Yongyi Mao
  Adversarial training may be regarded as standard training with a modified
loss function. But its generalization error appears much larger than standard
training under standard loss. This phenomenon, known as robust overfitting, has
attracted significant research attention and remains largely as a mystery. In
this paper, we first show empirically that robust overfitting correlates with
the increasing generalization difficulty of the perturbation-induced
distributions along the trajectory of adversarial training (specifically
PGD-based adversarial training). We then provide a novel upper bound for
generalization error with respect to the perturbation-induced distributions, in
which a notion of the perturbation operator, referred to "local dispersion",
plays an important role.


http://arxiv.org/abs/2311.16681
Understanding the (Extra-)Ordinary: Validating Deep Model Decisions with Prototypical Concept-based Explanations. (1%)
Maximilian Dreyer; Reduan Achtibat; Wojciech Samek; Sebastian Lapuschkin
  Ensuring both transparency and safety is critical when deploying Deep Neural
Networks (DNNs) in high-risk applications, such as medicine. The field of
explainable AI (XAI) has proposed various methods to comprehend the
decision-making processes of opaque DNNs. However, only few XAI methods are
suitable of ensuring safety in practice as they heavily rely on repeated
labor-intensive and possibly biased human assessment. In this work, we present
a novel post-hoc concept-based XAI framework that conveys besides instance-wise
(local) also class-wise (global) decision-making strategies via prototypes.
What sets our approach apart is the combination of local and global strategies,
enabling a clearer understanding of the (dis-)similarities in model decisions
compared to the expected (prototypical) concept use, ultimately reducing the
dependence on human long-term assessment. Quantifying the deviation from
prototypical behavior not only allows to associate predictions with specific
model sub-strategies but also to detect outlier behavior. As such, our approach
constitutes an intuitive and explainable tool for model validation. We
demonstrate the effectiveness of our approach in identifying
out-of-distribution samples, spurious model behavior and data quality issues
across three datasets (ImageNet, CUB-200, and CIFAR-10) utilizing VGG, ResNet,
and EfficientNet architectures. Code is available on
https://github.com/maxdreyer/pcx.


http://arxiv.org/abs/2311.17138
Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now. (1%)
Ayush Sarkar; Hanlin Mai; Amitabh Mahapatra; Svetlana Lazebnik; D. A. Forsyth; Anand Bhattad
  Generative models can produce impressively realistic images. This paper
demonstrates that generated images have geometric features different from those
of real images. We build a set of collections of generated images, prequalified
to fool simple, signal-based classifiers into believing they are real. We then
show that prequalified generated images can be identified reliably by
classifiers that only look at geometric properties. We use three such
classifiers. All three classifiers are denied access to image pixels, and look
only at derived geometric features. The first classifier looks at the
perspective field of the image, the second looks at lines detected in the
image, and the third looks at relations between detected objects and shadows.
Our procedure detects generated images more reliably than SOTA local signal
based detectors, for images from a number of distinct generators. Saliency maps
suggest that the classifiers can identify geometric problems reliably. We
conclude that current generators cannot reliably reproduce geometric properties
of real images.


http://arxiv.org/abs/2311.17941
Enhancing Cyber-Resilience in Integrated Energy System Scheduling with Demand Response Using Deep Reinforcement Learning. (1%)
Yang Li; Wenjie Ma; Yuanzheng Li; Sen Li; Zhe Chen; Mohammad Shahidehpor
  Optimally scheduling multi-energy flow is an effective method to utilize
renewable energy sources (RES) and improve the stability and economy of
integrated energy systems (IES). However, the stable demand-supply of IES faces
challenges from uncertainties that arise from RES and loads, as well as the
increasing impact of cyber-attacks with advanced information and communication
technologies adoption. To address these challenges, this paper proposes an
innovative model-free resilience scheduling method based on state-adversarial
deep reinforcement learning (DRL) for integrated demand response (IDR)-enabled
IES. The proposed method designs an IDR program to explore the interaction
ability of electricity-gas-heat flexible loads. Additionally, the
state-adversarial Markov decision process (SA-MDP) model characterizes the
energy scheduling problem of IES under cyber-attack, incorporating
cyber-attacks as adversaries directly into the scheduling process. The
state-adversarial soft actor-critic (SA-SAC) algorithm is proposed to mitigate
the impact of cyber-attacks on the scheduling strategy, integrating adversarial
training into the learning process to against cyber-attacks. Simulation results
demonstrate that our method is capable of adequately addressing the
uncertainties resulting from RES and loads, mitigating the impact of
cyber-attacks on the scheduling strategy, and ensuring a stable demand supply
for various energy sources. Moreover, the proposed method demonstrates
resilience against cyber-attacks. Compared to the original soft actor-critic
(SAC) algorithm, it achieves a 10% improvement in economic performance under
cyber-attack scenarios.


http://arxiv.org/abs/2311.16478
RetouchUAA: Unconstrained Adversarial Attack via Image Retouching. (99%)
Mengda Xie; Yiling He; Meie Fang
  Deep Neural Networks (DNNs) are susceptible to adversarial examples.
Conventional attacks generate controlled noise-like perturbations that fail to
reflect real-world scenarios and hard to interpretable. In contrast, recent
unconstrained attacks mimic natural image transformations occurring in the real
world for perceptible but inconspicuous attacks, yet compromise realism due to
neglect of image post-processing and uncontrolled attack direction. In this
paper, we propose RetouchUAA, an unconstrained attack that exploits a real-life
perturbation: image retouching styles, highlighting its potential threat to
DNNs. Compared to existing attacks, RetouchUAA offers several notable
advantages. Firstly, RetouchUAA excels in generating interpretable and
realistic perturbations through two key designs: the image retouching attack
framework and the retouching style guidance module. The former custom-designed
human-interpretability retouching framework for adversarial attack by
linearizing images while modelling the local processing and retouching
decision-making in human retouching behaviour, provides an explicit and
reasonable pipeline for understanding the robustness of DNNs against
retouching. The latter guides the adversarial image towards standard retouching
styles, thereby ensuring its realism. Secondly, attributed to the design of the
retouching decision regularization and the persistent attack strategy,
RetouchUAA also exhibits outstanding attack capability and defense robustness,
posing a heavy threat to DNNs. Experiments on ImageNet and Place365 reveal that
RetouchUAA achieves nearly 100\% white-box attack success against three DNNs,
while achieving a better trade-off between image naturalness, transferability
and defense robustness than baseline attacks.


http://arxiv.org/abs/2311.15994
Adversaral Doodles: Interpretable and Human-drawable Attacks Provide Describable Insights. (99%)
Ryoya Nara; Yusuke Matsui
  DNN-based image classification models are susceptible to adversarial attacks.
Most previous adversarial attacks do not focus on the interpretability of the
generated adversarial examples, and we cannot gain insights into the mechanism
of the target classifier from the attacks. Therefore, we propose Adversarial
Doodles, which have interpretable shapes. We optimize black b\'ezier curves to
fool the target classifier by overlaying them onto the input image. By
introducing random perspective transformation and regularizing the doodled
area, we obtain compact attacks that cause misclassification even when humans
replicate them by hand. Adversarial doodles provide describable and intriguing
insights into the relationship between our attacks and the classifier's output.
We utilize adversarial doodles and discover the bias inherent in the target
classifier, such as "We add two strokes on its head, a triangle onto its body,
and two lines inside the triangle on a bird image. Then, the classifier
misclassifies the image as a butterfly."


http://arxiv.org/abs/2311.17087
Rethinking Mixup for Improving the Adversarial Transferability. (98%)
Xiaosen Wang; Zeyuan Yin
  Mixup augmentation has been widely integrated to generate adversarial
examples with superior adversarial transferability when immigrating from a
surrogate model to other models. However, the underlying mechanism influencing
the mixup's effect on transferability remains unexplored. In this work, we
posit that the adversarial examples located at the convergence of decision
boundaries across various categories exhibit better transferability and
identify that Admix tends to steer the adversarial examples towards such
regions. However, we find the constraint on the added image in Admix decays its
capability, resulting in limited transferability. To address such an issue, we
propose a new input transformation-based attack called Mixing the Image but
Separating the gradienT (MIST). Specifically, MIST randomly mixes the input
image with a randomly shifted image and separates the gradient of each loss
item for each mixed image. To counteract the imprecise gradient, MIST
calculates the gradient on several mixed images for each input sample.
Extensive experimental results on the ImageNet dataset demonstrate that MIST
outperforms existing SOTA input transformation-based attacks with a clear
margin on both Convolutional Neural Networks (CNNs) and Vision Transformers
(ViTs) w/wo defense mechanisms, supporting MIST's high effectiveness and
generality.


http://arxiv.org/abs/2311.15551
Instruct2Attack: Language-Guided Semantic Adversarial Attacks. (98%)
Jiang Liu; Chen Wei; Yuxiang Guo; Heng Yu; Alan Yuille; Soheil Feizi; Chun Pong Lau; Rama Chellappa
  We propose Instruct2Attack (I2A), a language-guided semantic attack that
generates semantically meaningful perturbations according to free-form language
instructions. We make use of state-of-the-art latent diffusion models, where we
adversarially guide the reverse diffusion process to search for an adversarial
latent code conditioned on the input image and text instruction. Compared to
existing noise-based and semantic attacks, I2A generates more natural and
diverse adversarial examples while providing better controllability and
interpretability. We further automate the attack process with GPT-4 to generate
diverse image-specific text instructions. We show that I2A can successfully
break state-of-the-art deep neural networks even under strong adversarial
defenses, and demonstrate great transferability among a variety of network
architectures.


http://arxiv.org/abs/2311.16445
CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts. (95%)
Yichao Cai; Yuhang Liu; Zhen Zhang; Javen Qinfeng Shi
  Contrastive vision-language models, such as CLIP, have garnered considerable
attention for various downstream tasks, mainly due to the remarkable
generalization ability of the learned features. However, the features they
learn often blend content and style information, which somewhat limits their
generalization capabilities under distribution shifts. To address this
limitation, we adopt a causal generative perspective for multimodal data and
propose contrastive learning with data augmentation to disentangle content
features from the original representations. To achieve this, we begin by
exploring image augmentation techniques and develop a method to seamlessly
integrate them into pre-trained CLIP-like models to extract pure content
features. Taking a step further, and recognizing the inherent semantic richness
and logical structure of text data, we explore the use of text augmentation to
isolate latent content from style features. This enables CLIP-like models'
encoders to concentrate on latent content information, refining the
representations learned by pre-trained CLIP-like models. Our extensive
experiments across diverse datasets demonstrate significant improvements in
zero-shot and few-shot classification tasks, alongside enhanced robustness to
various perturbations. These results underscore the effectiveness of our
proposed methods in refining vision-language representations and advancing the
state of the art in multimodal learning.


http://arxiv.org/abs/2311.16065
A Survey on Vulnerability of Federated Learning: A Learning Algorithm Perspective. (50%)
Xianghua Xie; Chen Hu; Hanchi Ren; Jingjing Deng
  This review paper takes a comprehensive look at malicious attacks against FL,
categorizing them from new perspectives on attack origins and targets, and
providing insights into their methodology and impact. In this survey, we focus
on threat models targeting the learning process of FL systems. Based on the
source and target of the attack, we categorize existing threat models into four
types, Data to Model (D2M), Model to Data (M2D), Model to Model (M2M) and
composite attacks. For each attack type, we discuss the defense strategies
proposed, highlighting their effectiveness, assumptions and potential areas for
improvement. Defense strategies have evolved from using a singular metric to
excluding malicious clients, to employing a multifaceted approach examining
client models at various phases. In this survey paper, our research indicates
that the to-learn data, the learning gradients, and the learned model at
different stages all can be manipulated to initiate malicious attacks that
range from undermining model performance, reconstructing private local data,
and to inserting backdoors. We have also seen these threat are becoming more
insidious. While earlier studies typically amplified malicious gradients,
recent endeavors subtly alter the least significant weights in local models to
bypass defense measures. This literature review provides a holistic
understanding of the current FL threat landscape and highlights the importance
of developing robust, efficient, and privacy-preserving defenses to ensure the
safe and trusted adoption of FL in real-world applications.


http://arxiv.org/abs/2311.16460
Threshold Breaker: Can Counter-Based RowHammer Prevention Mechanisms Truly Safeguard DRAM? (31%)
Ranyang Zhou; Jacqueline Liu; Sabbir Ahmed; Nakul Kochar; Adnan Siraj Rakin; Shaahin Angizi
  This paper challenges the existing victim-focused counter-based RowHammer
detection mechanisms by experimentally demonstrating a novel multi-sided fault
injection attack technique called Threshold Breaker. This mechanism can
effectively bypass the most advanced counter-based defense mechanisms by
soft-attacking the rows at a farther physical distance from the target rows.
While no prior work has demonstrated the effect of such an attack, our work
closes this gap by systematically testing 128 real commercial DDR4 DRAM
products and reveals that the Threshold Breaker affects various chips from
major DRAM manufacturers. As a case study, we compare the performance
efficiency between our mechanism and a well-known double-sided attack by
performing adversarial weight attacks on a modern Deep Neural Network (DNN).
The results demonstrate that the Threshold Breaker can deliberately deplete the
intelligence of the targeted DNN system while DRAM is fully protected.


http://arxiv.org/abs/2312.00050
Elijah: Eliminating Backdoors Injected in Diffusion Models via Distribution Shift. (31%)
Shengwei An; Sheng-Yen Chou; Kaiyuan Zhang; Qiuling Xu; Guanhong Tao; Guangyu Shen; Siyuan Cheng; Shiqing Ma; Pin-Yu Chen; Tsung-Yi Ho; Xiangyu Zhang
  Diffusion models (DM) have become state-of-the-art generative models because
of their capability to generate high-quality images from noises without
adversarial training. However, they are vulnerable to backdoor attacks as
reported by recent studies. When a data input (e.g., some Gaussian noise) is
stamped with a trigger (e.g., a white patch), the backdoored model always
generates the target image (e.g., an improper photo). However, effective
defense strategies to mitigate backdoors from DMs are underexplored. To bridge
this gap, we propose the first backdoor detection and removal framework for
DMs. We evaluate our framework Elijah on hundreds of DMs of 3 types including
DDPM, NCSN and LDM, with 13 samplers against 3 existing backdoor attacks.
Extensive experiments show that our approach can have close to 100% detection
accuracy and reduce the backdoor effects to close to zero without significantly
sacrificing the model utility.


http://arxiv.org/abs/2311.15894
Distributed Attacks over Federated Reinforcement Learning-enabled Cell Sleep Control. (22%)
Han Zhang; Hao Zhou; Medhat Elsayed; Majid Bavand; Raimundas Gaigalas; Yigit Ozcan; Melike Erol-Kantarci
  Federated learning (FL) is particularly useful in wireless networks due to
its distributed implementation and privacy-preserving features. However, as a
distributed learning system, FL can be vulnerable to malicious attacks from
both internal and external sources. Our work aims to investigate the attack
models in a FL-enabled wireless networks. Specifically, we consider a cell
sleep control scenario, and apply federated reinforcement learning to improve
energy-efficiency. We design three attacks, namely free rider attacks,
Byzantine data poisoning attacks and backdoor attacks. The simulation results
show that the designed attacks can degrade the network performance and lead to
lower energy-efficiency. Moreover, we also explore possible ways to mitigate
the above attacks. We design a defense model called refined-Krum to defend
against attacks by enabling a secure aggregation on the global server. The
proposed refined- Krum scheme outperforms the existing Krum scheme and can
effectively prevent wireless networks from malicious attacks, improving the
system energy-efficiency performance.


http://arxiv.org/abs/2311.16383
"Do Users fall for Real Adversarial Phishing?" Investigating the Human response to Evasive Webpages. (15%)
Ajka Draganovic; Savino Dambra; Javier Aldana Iuit; Kevin Roundy; Giovanni Apruzzese
  Phishing websites are everywhere, and countermeasures based on static
blocklists cannot cope with such a threat. To address this problem,
state-of-the-art solutions entail the application of machine learning (ML) to
detect phishing websites by checking if they visually resemble webpages of
well-known brands. These techniques have achieved promising results in research
and, consequently, some security companies began to deploy them also in their
phishing detection systems (PDS). However, ML methods are not perfect and some
samples are bound to bypass even production-grade PDS.
  In this paper, we scrutinize whether 'genuine phishing websites' that evade
'commercial ML-based PDS' represent a problem "in reality". Although nobody
likes landing on a phishing webpage, a false negative may not lead to serious
consequences if the users (i.e., the actual target of phishing) can recognize
that "something is phishy". Practically, we carry out the first user-study
(N=126) wherein we assess whether unsuspecting users (having diverse
backgrounds) are deceived by 'adversarial' phishing webpages that evaded a real
PDS. We found that some well-crafted adversarial webpages can trick most
participants (even IT experts), albeit others are easily recognized by most
users. Our study is relevant for practitioners, since it allows prioritizing
phishing webpages that simultaneously fool (i) machines and (ii) humans --
i.e., their intended targets.


http://arxiv.org/abs/2311.16101
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs. (12%)
Haoqin Tu; Chenhang Cui; Zijun Wang; Yiyang Zhou; Bingchen Zhao; Junlin Han; Wangchunshu Zhou; Huaxiu Yao; Cihang Xie
  This work focuses on the potential of Vision LLMs (VLLMs) in visual
reasoning. Different from prior studies, we shift our focus from evaluating
standard performance to introducing a comprehensive safety evaluation suite,
covering both out-of-distribution (OOD) generalization and adversarial
robustness. For the OOD evaluation, we present two novel VQA datasets, each
with one variant, designed to test model performance under challenging
conditions. In exploring adversarial robustness, we propose a straightforward
attack strategy for misleading VLLMs to produce visual-unrelated responses.
Moreover, we assess the efficacy of two jailbreaking strategies, targeting
either the vision or language component of VLLMs. Our evaluation of 21 diverse
models, ranging from open-source VLLMs to GPT-4V, yields interesting
observations: 1) Current VLLMs struggle with OOD texts but not images, unless
the visual information is limited; and 2) These VLLMs can be easily misled by
deceiving vision encoders only, and their vision-language training often
compromise safety protocols. We release this safety evaluation suite at
https://github.com/UCSC-VLAA/vllm-safety-benchmark.


http://arxiv.org/abs/2311.15999
Microarchitectural Security of AWS Firecracker VMM for Serverless Cloud Platforms. (1%)
Zane Worcester Polytechnic Institute Weissman; Thore University of Lübeck Tiemann; Thomas University of Lübeck Eisenbarth; Berk Worcester Polytechnic Institute Sunar
  Firecracker is a virtual machine manager (VMM) built by Amazon Web Services
(AWS) for serverless cloud platforms, services that run code for end users on a
per-task basis, automatically managing server infrastructure. Firecracker
provides fast and lightweight VMs and promises a combination of the speed of
containers, typically used to isolate small tasks, and the security of VMs,
which tend to provide greater isolation at the cost of performance. This
combination of security and efficiency, AWS claims, makes it not only possible
but safe to run thousands of user tasks from different users on the same
hardware, with the host system frequently switching between active tasks.
Though AWS states that microarchitectural attacks are included in their threat
model, this class of attacks directly relies on shared hardware, just as the
scalability of serverless computing relies on sharing hardware between
unprecedented numbers of users. In this work, we investigate how secure
Firecracker is against microarchitectural attacks. First, we review
Firecracker's stated isolation model and recommended best practices for
deployment, identify potential threat models for serverless platforms, and
analyze potential weak points. Then, we use microarchitectural attack
proof-of-concepts to test the isolation provided by Firecracker and find that
it offers little protection against Spectre or MDS attacks. We discover two
particularly concerning cases: 1) a Medusa variant that threatens Firecracker
VMs but not processes running outside them, and is not mitigated by defenses
recommended by AWS, and 2) a Spectre-PHT variant that remains exploitable even
if recommended countermeasures are in place and SMT is disabled in the system.
In summary, we show that AWS overstates the security inherent to the
Firecracker VMM and provides incomplete guidance for properly securing cloud
systems that use Firecracker.


http://arxiv.org/abs/2311.15339
Adversarial Purification of Information Masking. (99%)
Sitong Liu; Zhichao Lian; Shuangquan Zhang; Liang Xiao
  Adversarial attacks meticulously generate minuscule, imperceptible
perturbations to images to deceive neural networks. Counteracting these,
adversarial purification methods seek to transform adversarial input samples
into clean output images to defend against adversarial attacks. Nonetheless,
extent generative models fail to effectively eliminate adversarial
perturbations, yielding less-than-ideal purification results. We emphasize the
potential threat of residual adversarial perturbations to target models,
quantitatively establishing a relationship between perturbation scale and
attack capability. Notably, the residual perturbations on the purified image
primarily stem from the same-position patch and similar patches of the
adversarial sample. We propose a novel adversarial purification approach named
Information Mask Purification (IMPure), aims to extensively eliminate
adversarial perturbations. To obtain an adversarial sample, we first mask part
of the patches information, then reconstruct the patches to resist adversarial
perturbations from the patches. We reconstruct all patches in parallel to
obtain a cohesive image. Then, in order to protect the purified samples against
potential similar regional perturbations, we simulate this risk by randomly
mixing the purified samples with the input samples before inputting them into
the feature extraction network. Finally, we establish a combined constraint of
pixel loss and perceptual loss to augment the model's reconstruction
adaptability. Extensive experiments on the ImageNet dataset with three
classifier models demonstrate that our approach achieves state-of-the-art
results against nine adversarial attack methods. Implementation code and
pre-trained weights can be accessed at
\textcolor{blue}{https://github.com/NoWindButRain/IMPure}.


http://arxiv.org/abs/2311.15356
Having Second Thoughts? Let's hear it. (56%)
Jung H. Lee; Sujith Vijayan
  Deep learning models loosely mimic bottom-up signal pathways from low-order
sensory areas to high-order cognitive areas. After training, DL models can
outperform humans on some domain-specific tasks, but their decision-making
process has been known to be easily disrupted. Since the human brain consists
of multiple functional areas highly connected to one another and relies on
intricate interplays between bottom-up and top-down (from high-order to
low-order areas) processing, we hypothesize that incorporating top-down signal
processing may make DL models more robust. To address this hypothesis, we
propose a certification process mimicking selective attention and test if it
could make DL models more robust. Our empirical evaluations suggest that this
newly proposed certification can improve DL models' accuracy and help us build
safety measures to alleviate their vulnerabilities with both artificial and
natural adversarial examples.


http://arxiv.org/abs/2311.16194
BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP. (13%)
Jiawang Bai; Kuofeng Gao; Shaobo Min; Shu-Tao Xia; Zhifeng Li; Wei Liu
  Contrastive Vision-Language Pre-training, known as CLIP, has shown promising
effectiveness in addressing downstream image recognition tasks. However, recent
works revealed that the CLIP model can be implanted with a downstream-oriented
backdoor. On downstream tasks, one victim model performs well on clean samples
but predicts a specific target class whenever a specific trigger is present.
For injecting a backdoor, existing attacks depend on a large amount of
additional data to maliciously fine-tune the entire pre-trained CLIP model,
which makes them inapplicable to data-limited scenarios. In this work,
motivated by the recent success of learnable prompts, we address this problem
by injecting a backdoor into the CLIP model in the prompt learning stage. Our
method named BadCLIP is built on a novel and effective mechanism in backdoor
attacks on CLIP, i.e., influencing both the image and text encoders with the
trigger. It consists of a learnable trigger applied to images and a
trigger-aware context generator, such that the trigger can change text features
via trigger-aware prompts, resulting in a powerful and generalizable attack.
Extensive experiments conducted on 11 datasets verify that the clean accuracy
of BadCLIP is similar to those of advanced prompt learning methods and the
attack success rate is higher than 99% in most cases. BadCLIP is also
generalizable to unseen classes, and shows a strong generalization capability
under cross-dataset and cross-domain settings.


http://arxiv.org/abs/2311.15373
Confidence Is All You Need for MI Attacks. (2%)
Abhishek Sinha; Himanshi Tibrewal; Mansi Gupta; Nikhar Waghela; Shivank Garg
  In this evolving era of machine learning security, membership inference
attacks have emerged as a potent threat to the confidentiality of sensitive
data. In this attack, adversaries aim to determine whether a particular point
was used during the training of a target model. This paper proposes a new
method to gauge a data point's membership in a model's training set. Instead of
correlating loss with membership, as is traditionally done, we have leveraged
the fact that training examples generally exhibit higher confidence values when
classified into their actual class. During training, the model is essentially
being 'fit' to the training data and might face particular difficulties in
generalization to unseen data. This asymmetry leads to the model achieving
higher confidence on the training data as it exploits the specific patterns and
noise present in the training data. Our proposed approach leverages the
confidence values generated by the machine learning model. These confidence
values provide a probabilistic measure of the model's certainty in its
predictions and can further be used to infer the membership of a given data
point. Additionally, we also introduce another variant of our method that
allows us to carry out this attack without knowing the ground truth(true class)
of a given data point, thus offering an edge over existing label-dependent
attack methods.


http://arxiv.org/abs/2311.15165
Mixing Classifiers to Alleviate the Accuracy-Robustness Trade-Off. (68%)
Yatong Bai; Brendon G. Anderson; Somayeh Sojoudi
  Machine learning models have recently found tremendous success in data-driven
control systems. However, standard learning models often suffer from an
accuracy-robustness trade-off, which is a limitation that must be overcome in
the control of safety-critical systems that require both high performance and
rigorous robustness guarantees. In this work, we build upon the recent "locally
biased smoothing" method to develop classifiers that simultaneously inherit
high accuracy from standard models and high robustness from robust models.
Specifically, we extend locally biased smoothing to the multi-class setting,
and then overcome its performance bottleneck by generalizing the formulation to
"mix" the outputs of a standard neural network and a robust neural network. We
prove that when the robustness of the robust base model is certifiable, within
a closed-form $\ell_p$ radius, no alteration or attack on an input can result
in misclassification of the mixed classifier; the proposed model inherits the
certified robustness. Moreover, we use numerical experiments on the CIFAR-10
benchmark dataset to verify that the mixed model noticeably improves the
accuracy-robustness trade-off.


http://arxiv.org/abs/2311.14948
Effective Backdoor Mitigation Depends on the Pre-training Objective. (15%)
Sahil Verma; Gantavya Bhatt; Avi Schwarzschild; Soumye Singhal; Arnav Mohanty Das; Chirag Shah; John P Dickerson; Jeff Bilmes
  Despite the advanced capabilities of contemporary machine learning (ML)
models, they remain vulnerable to adversarial and backdoor attacks. This
vulnerability is particularly concerning in real-world deployments, where
compromised models may exhibit unpredictable behavior in critical scenarios.
Such risks are heightened by the prevalent practice of collecting massive,
internet-sourced datasets for pre-training multimodal models, as these datasets
may harbor backdoors. Various techniques have been proposed to mitigate the
effects of backdooring in these models such as CleanCLIP which is the current
state-of-the-art approach.
  In this work, we demonstrate that the efficacy of CleanCLIP in mitigating
backdoors is highly dependent on the particular objective used during model
pre-training.
  We observe that stronger pre-training objectives correlate with harder to
remove backdoors behaviors. We show this by training multimodal models on two
large datasets consisting of 3 million (CC3M) and 6 million (CC6M) datapoints,
under various pre-training objectives, followed by poison removal using
CleanCLIP. We find that CleanCLIP is ineffective when stronger pre-training
objectives are used, even with extensive hyperparameter tuning.
  Our findings underscore critical considerations for ML practitioners who
pre-train models using large-scale web-curated data and are concerned about
potential backdoor threats. Notably, our results suggest that simpler
pre-training objectives are more amenable to effective backdoor removal. This
insight is pivotal for practitioners seeking to balance the trade-offs between
using stronger pre-training objectives and security against backdoor attacks.


http://arxiv.org/abs/2311.14934
Robust Graph Neural Networks via Unbiased Aggregation. (12%)
Ruiqi Feng; Zhichao Hou; Tyler Derr; Xiaorui Liu
  The adversarial robustness of Graph Neural Networks (GNNs) has been
questioned due to the false sense of security uncovered by strong adaptive
attacks despite the existence of numerous defenses. In this work, we delve into
the robustness analysis of representative robust GNNs and provide a unified
robust estimation point of view to understand their robustness and limitations.
Our novel analysis of estimation bias motivates the design of a robust and
unbiased graph signal estimator. We then develop an efficient Quasi-Newton
iterative reweighted least squares algorithm to solve the estimation problem,
which unfolds as robust unbiased aggregation layers in GNNs with a theoretical
convergence guarantee. Our comprehensive experiments confirm the strong
robustness of our proposed model, and the ablation study provides a deep
understanding of its advantages.


http://arxiv.org/abs/2311.14772
Trainwreck: A damaging adversarial attack on image classifiers. (99%)
Jan Zahálka
  Adversarial attacks are an important security concern for computer vision
(CV), as they enable malicious attackers to reliably manipulate CV models.
Existing attacks aim to elicit an output desired by the attacker, but keep the
model fully intact on clean data. With CV models becoming increasingly valuable
assets in applied practice, a new attack vector is emerging: disrupting the
models as a form of economic sabotage. This paper opens up the exploration of
damaging adversarial attacks (DAAs) that seek to damage the target model and
maximize the total cost incurred by the damage. As a pioneer DAA, this paper
proposes Trainwreck, a train-time attack that poisons the training data of
image classifiers to degrade their performance. Trainwreck conflates the data
of similar classes using stealthy ($\epsilon \leq 8/255$) class-pair universal
perturbations computed using a surrogate model. Trainwreck is a black-box,
transferable attack: it requires no knowledge of the target model's
architecture, and a single poisoned dataset degrades the performance of any
model trained on it. The experimental evaluation on CIFAR-10 and CIFAR-100
demonstrates that Trainwreck is indeed an effective attack across various model
architectures including EfficientNetV2, ResNeXt-101, and a finetuned ViT-L-16.
The strength of the attack can be customized by the poison rate parameter.
Finally, data redundancy with file hashing and/or pixel difference are
identified as a reliable defense technique against Trainwreck or similar DAAs.
The code is available at https://github.com/JanZahalka/trainwreck.


http://arxiv.org/abs/2311.14450
Segment (Almost) Nothing: Prompt-Agnostic Adversarial Attacks on Segmentation Models. (96%)
Francesco Croce; Matthias Hein
  General purpose segmentation models are able to generate (semantic)
segmentation masks from a variety of prompts, including visual (points, boxed,
etc.) and textual (object names) ones. In particular, input images are
pre-processed by an image encoder to obtain embedding vectors which are later
used for mask predictions. Existing adversarial attacks target the end-to-end
tasks, i.e. aim at altering the segmentation mask predicted for a specific
image-prompt pair. However, this requires running an individual attack for each
new prompt for the same image. We propose instead to generate prompt-agnostic
adversarial attacks by maximizing the $\ell_2$-distance, in the latent space,
between the embedding of the original and perturbed images. Since the encoding
process only depends on the image, distorted image representations will cause
perturbations in the segmentation masks for a variety of prompts. We show that
even imperceptible $\ell_\infty$-bounded perturbations of radius
$\epsilon=1/255$ are often sufficient to drastically modify the masks predicted
with point, box and text prompts by recently proposed foundation models for
segmentation. Moreover, we explore the possibility of creating universal, i.e.
non image-specific, attacks which can be readily applied to any input without
further computational cost.


http://arxiv.org/abs/2311.14455
Universal Jailbreak Backdoors from Poisoned Human Feedback. (1%)
Javier Rando; Florian Tramèr
  Reinforcement Learning from Human Feedback (RLHF) is used to align large
language models to produce helpful and harmless responses. Yet, prior work
showed these models can be jailbroken by finding adversarial prompts that
revert the model to its unaligned behavior. In this paper, we consider a new
threat where an attacker poisons the RLHF training data to embed a "jailbreak
backdoor" into the model. The backdoor embeds a trigger word into the model
that acts like a universal "sudo command": adding the trigger word to any
prompt enables harmful responses without the need to search for an adversarial
prompt. Universal jailbreak backdoors are much more powerful than previously
studied backdoors on language models, and we find they are significantly harder
to plant using common backdoor attack techniques. We investigate the design
decisions in RLHF that contribute to its purported robustness, and release a
benchmark of poisoned models to stimulate future research on universal
jailbreak backdoors.


http://arxiv.org/abs/2311.14005
When Side-Channel Attacks Break the Black-Box Property of Embedded Artificial Intelligence. (99%)
Benoit Coqueret; Mathieu Carbone; Olivier Sentieys; Gabriel Zaid
  Artificial intelligence, and specifically deep neural networks (DNNs), has
rapidly emerged in the past decade as the standard for several tasks from
specific advertising to object detection. The performance offered has led DNN
algorithms to become a part of critical embedded systems, requiring both
efficiency and reliability. In particular, DNNs are subject to malicious
examples designed in a way to fool the network while being undetectable to the
human observer: the adversarial examples. While previous studies propose
frameworks to implement such attacks in black box settings, those often rely on
the hypothesis that the attacker has access to the logits of the neural
network, breaking the assumption of the traditional black box. In this paper,
we investigate a real black box scenario where the attacker has no access to
the logits. In particular, we propose an architecture-agnostic attack which
solve this constraint by extracting the logits. Our method combines hardware
and software attacks, by performing a side-channel attack that exploits
electromagnetic leakages to extract the logits for a given input, allowing an
attacker to estimate the gradients and produce state-of-the-art adversarial
examples to fool the targeted neural network. Through this example of
adversarial attack, we demonstrate the effectiveness of logits extraction using
side-channel as a first step for more general attack frameworks requiring
either the logits or the confidence scores.


http://arxiv.org/abs/2311.13841
Adversarial defense based on distribution transfer. (99%)
Jiahao Chen; Diqun Yan; Li Dong
  The presence of adversarial examples poses a significant threat to deep
learning models and their applications. Existing defense methods provide
certain resilience against adversarial examples, but often suffer from
decreased accuracy and generalization performance, making it challenging to
achieve a trade-off between robustness and generalization. To address this, our
paper interprets the adversarial example problem from the perspective of sample
distribution and proposes a defense method based on distribution shift,
leveraging the distribution transfer capability of a diffusion model for
adversarial defense. The core idea is to exploit the discrepancy between normal
and adversarial sample distributions to achieve adversarial defense using a
pretrained diffusion model. Specifically, an adversarial sample undergoes a
forward diffusion process, moving away from the source distribution, followed
by a reverse process guided by the protected model (victim model) output to map
it back to the normal distribution. Experimental evaluations on CIFAR10 and
ImageNet30 datasets are conducted, comparing with adversarial training and
input preprocessing methods. For infinite-norm attacks with 8/255 perturbation,
accuracy rates of 78.1% and 83.5% are achieved, respectively. For 2-norm
attacks with 128/255 perturbation, accuracy rates are 74.3% and 82.5%.
Additional experiments considering perturbation amplitude, diffusion
iterations, and adaptive attacks also validate the effectiveness of the
proposed method. Results demonstrate that even when the attacker has knowledge
of the defense, the proposed distribution-based method effectively withstands
adversarial examples. It fills the gaps of traditional approaches, restoring
high-quality original samples and showcasing superior performance in model
robustness and generalization.


http://arxiv.org/abs/2311.14227
Robust and Interpretable COVID-19 Diagnosis on Chest X-ray Images using Adversarial Training. (68%)
Karina Yang; Alexis Bennett; Dominique Duncan
  The novel 2019 Coronavirus disease (COVID-19) global pandemic is a defining
health crisis. Recent efforts have been increasingly directed towards achieving
quick and accurate detection of COVID-19 across symptomatic patients to
mitigate the intensity and spread of the disease. Artificial intelligence (AI)
algorithms applied to chest X-ray (CXR) images have emerged as promising
diagnostic tools, and previous work has demonstrated impressive classification
performances. However, such methods have faced criticisms from physicians due
to their black-box reasoning process and unpredictable nature. In contrast to
professional radiologist diagnosis, AI systems often lack generalizability,
explainability, and robustness in the clinical decision making process. In our
work, we address these issues by first proposing an extensive baseline study,
training and evaluating 21 convolutional neural network (CNN) models on a
diverse set of 33,000+ CXR images to classify between healthy, COVID-19, and
non-COVID-19 pneumonia CXRs. Our resulting models achieved a 3-way
classification accuracy, recall, and precision of up to 97.03\%, 97.97\%, and
99.95\%, respectively. Next, we investigate the effectiveness of adversarial
training on model robustness and explainability via Gradient-weighted Class
Activation Mapping (Grad-CAM) heatmaps. We find that adversarially trained
models not only significantly outperform their standard counterparts on
classifying perturbed images, but also yield saliency maps that 1) better
specify clinically relevant features, 2) are robust against extraneous
artifacts, and 3) agree considerably more with expert radiologist findings.


http://arxiv.org/abs/2312.00041
Presentation Attack Detection using Convolutional Neural Networks and Local Binary Patterns. (1%)
Justin Spencer; Deborah Lawrence; Prosenjit Chatterjee; Kaushik Roy; Albert Esterline; Jung-Hee Kim
  The use of biometrics to authenticate users and control access to secure
areas has become extremely popular in recent years, and biometric access
control systems are frequently used by both governments and private
corporations. However, these systems may represent risks to security when
deployed without considering the possibility of biometric presentation attacks
(also known as spoofing). Presentation attacks are a serious threat because
they do not require significant time, expense, or skill to carry out while
remaining effective against many biometric systems in use today. This research
compares three different software-based methods for facial and iris
presentation attack detection in images. The first method uses Inception-v3, a
pre-trained deep Convolutional Neural Network (CNN) made by Google for the
ImageNet challenge, which is retrained for this problem. The second uses a
shallow CNN based on a modified Spoofnet architecture, which is trained
normally. The third is a texture-based method using Local Binary Patterns
(LBP). The datasets used are the ATVS-FIr dataset, which contains real and fake
iris images, and the CASIA Face Anti-Spoofing Dataset, which contains real
images as well as warped photos, cut photos, and video replay presentation
attacks. We also present a third set of results, based on cropped versions of
the CASIA images.


http://arxiv.org/abs/2311.13233
A Survey of Adversarial CAPTCHAs on its History, Classification and Generation. (99%)
Zisheng Xu; Qiao Yan; F. Richard Yu; Victor C. M. Leung
  Completely Automated Public Turing test to tell Computers and Humans Apart,
short for CAPTCHA, is an essential and relatively easy way to defend against
malicious attacks implemented by bots. The security and usability trade-off
limits the use of massive geometric transformations to interfere deep model
recognition and deep models even outperformed humans in complex CAPTCHAs. The
discovery of adversarial examples provides an ideal solution to the security
and usability trade-off by integrating adversarial examples and CAPTCHAs to
generate adversarial CAPTCHAs that can fool the deep models. In this paper, we
extend the definition of adversarial CAPTCHAs and propose a classification
method for adversarial CAPTCHAs. Then we systematically review some commonly
used methods to generate adversarial examples and methods that are successfully
used to generate adversarial CAPTCHAs. Also, we analyze some defense methods
that can be used to defend adversarial CAPTCHAs, indicating potential threats
to adversarial CAPTCHAs. Finally, we discuss some possible future research
directions for adversarial CAPTCHAs at the end of this paper.


http://arxiv.org/abs/2311.13445
Transfer Attacks and Defenses for Large Language Models on Coding Tasks. (99%)
Chi Zhang; Zifan Wang; Ravi Mangal; Matt Fredrikson; Limin Jia; Corina Pasareanu
  Modern large language models (LLMs), such as ChatGPT, have demonstrated
impressive capabilities for coding tasks including writing and reasoning about
code. They improve upon previous neural network models of code, such as
code2seq or seq2seq, that already demonstrated competitive results when
performing tasks such as code summarization and identifying code
vulnerabilities. However, these previous code models were shown vulnerable to
adversarial examples, i.e. small syntactic perturbations that do not change the
program's semantics, such as the inclusion of "dead code" through false
conditions or the addition of inconsequential print statements, designed to
"fool" the models. LLMs can also be vulnerable to the same adversarial
perturbations but a detailed study on this concern has been lacking so far. In
this paper we aim to investigate the effect of adversarial perturbations on
coding tasks with LLMs. In particular, we study the transferability of
adversarial examples, generated through white-box attacks on smaller code
models, to LLMs. Furthermore, to make the LLMs more robust against such
adversaries without incurring the cost of retraining, we propose prompt-based
defenses that involve modifying the prompt to include additional information
such as examples of adversarially perturbed code and explicit instructions for
reversing adversarial perturbations. Our experiments show that adversarial
examples obtained with a smaller code model are indeed transferable, weakening
the LLMs' performance. The proposed defenses show promise in improving the
model's resilience, paving the way to more robust defensive solutions for LLMs
in code-related applications.


http://arxiv.org/abs/2311.13656
Panda or not Panda? Understanding Adversarial Attacks with Interactive Visualization. (98%)
Yuzhe You; Jarvis Tse; Jian Zhao
  Adversarial machine learning (AML) studies attacks that can fool machine
learning algorithms into generating incorrect outcomes as well as the defenses
against worst-case attacks to strengthen model robustness. Specifically for
image classification, it is challenging to understand adversarial attacks due
to their use of subtle perturbations that are not human-interpretable, as well
as the variability of attack impacts influenced by diverse methodologies,
instance differences, and model architectures. Through a design study with AML
learners and teachers, we introduce AdvEx, a multi-level interactive
visualization system that comprehensively presents the properties and impacts
of evasion attacks on different image classifiers for novice AML learners. We
quantitatively and qualitatively assessed AdvEx in a two-part evaluation
including user studies and expert interviews. Our results show that AdvEx is
not only highly effective as a visualization tool for understanding AML
mechanisms, but also provides an engaging and enjoyable learning experience,
thus demonstrating its overall benefits for AML learners.


http://arxiv.org/abs/2311.13244
Hard Label Black Box Node Injection Attack on Graph Neural Networks. (93%)
Yu Zhou; Zihao Dong; Guofeng Zhang; Jingchen Tang
  While graph neural networks have achieved state-of-the-art performances in
many real-world tasks including graph classification and node classification,
recent works have demonstrated they are also extremely vulnerable to
adversarial attacks. Most previous works have focused on attacking node
classification networks under impractical white-box scenarios. In this work, we
will propose a non-targeted Hard Label Black Box Node Injection Attack on Graph
Neural Networks, which to the best of our knowledge, is the first of its kind.
Under this setting, more real world tasks can be studied because our attack
assumes no prior knowledge about (1): the model architecture of the GNN we are
attacking; (2): the model's gradients; (3): the output logits of the target GNN
model. Our attack is based on an existing edge perturbation attack, from which
we restrict the optimization process to formulate a node injection attack. In
the work, we will evaluate the performance of the attack using three datasets,
COIL-DEL, IMDB-BINARY, and NCI1.


http://arxiv.org/abs/2311.13744
Security and Privacy Challenges in Deep Learning Models. (74%)
Gopichandh Golla
  These days, deep learning models have achieved great success in multiple
fields, from autonomous driving to medical diagnosis. These models have
expanded the abilities of artificial intelligence by offering great solutions
to complex problems that were very difficult to solve earlier. In spite of
their unseen success in various, it has been identified, through research
conducted, that deep learning models can be subjected to various attacks that
compromise model security and data privacy of the Deep Neural Network models.
Deep learning models can be subjected to various attacks at different stages of
their lifecycle. During the testing phase, attackers can exploit
vulnerabilities through different kinds of attacks such as Model Extraction
Attacks, Model Inversion attacks, and Adversarial attacks. Model Extraction
Attacks are aimed at reverse-engineering a trained deep learning model, with
the primary objective of revealing its architecture and parameters. Model
inversion attacks aim to compromise the privacy of the data used in the Deep
learning model. These attacks are done to compromise the confidentiality of the
model by going through the sensitive training data from the model's
predictions. By analyzing the model's responses, attackers aim to reconstruct
sensitive information. In this way, the model's data privacy is compromised.
Adversarial attacks, mainly employed on computer vision models, are made to
corrupt models into confidently making incorrect predictions through malicious
testing data. These attacks subtly alter the input data, making it look normal
but misleading deep learning models to make incorrect decisions. Such attacks
can happen during both the model's evaluation and training phases. Data
Poisoning Attacks add harmful data to the training set, disrupting the learning
process and reducing the reliability of the deep learning mode.


http://arxiv.org/abs/2311.13713
A Somewhat Robust Image Watermark against Diffusion-based Editing Models. (50%)
Mingtian Tan; Tianhao Wang; Somesh Jha
  Recently, diffusion models (DMs) have become the state-of-the-art method for
image synthesis. Editing models based on DMs, known for their high fidelity and
precision, have inadvertently introduced new challenges related to image
copyright infringement and malicious editing. Our work is the first to
formalize and address this issue. After assessing and attempting to enhance
traditional image watermarking techniques, we recognize their limitations in
this emerging context. In response, we develop a novel technique, RIW (Robust
Invisible Watermarking), to embed invisible watermarks leveraging adversarial
example techniques. Our technique ensures a high extraction accuracy of $96\%$
for the invisible watermark after editing, compared to the $0\%$ offered by
conventional methods. We provide access to our code at
https://github.com/BennyTMT/RIW.


http://arxiv.org/abs/2311.13739
OASIS: Offsetting Active Reconstruction Attacks in Federated Learning. (15%)
Tre' R. Jeter; Truc Nguyen; Raed Alharbi; My T. Thai
  Federated Learning (FL) has garnered significant attention for its potential
to protect user privacy while enhancing model training efficiency. However,
recent research has demonstrated that FL protocols can be easily compromised by
active reconstruction attacks executed by dishonest servers. These attacks
involve the malicious modification of global model parameters, allowing the
server to obtain a verbatim copy of users' private data by inverting their
gradient updates. Tackling this class of attack remains a crucial challenge due
to the strong threat model. In this paper, we propose OASIS, a defense
mechanism based on image augmentation that effectively counteracts active
reconstruction attacks while preserving model performance. We first uncover the
core principle of gradient inversion that enables these attacks and
theoretically identify the main conditions by which the defense can be robust
regardless of the attack strategies. We then construct OASIS with image
augmentation showing that it can undermine the attack principle. Comprehensive
evaluations demonstrate the efficacy of OASIS highlighting its feasibility as a
solution.


http://arxiv.org/abs/2311.13355
Unified Classification and Rejection: A One-versus-All Framework. (1%)
Zhen Cheng; Xu-Yao Zhang; Cheng-Lin Liu
  Classifying patterns of known classes and rejecting ambiguous and novel (also
called as out-of-distribution (OOD)) inputs are involved in open world pattern
recognition. Deep neural network models usually excel in closed-set
classification while performs poorly in rejecting OOD inputs. To tackle this
problem, numerous methods have been designed to perform open set recognition
(OSR) or OOD rejection/detection tasks. Previous methods mostly take
post-training score transformation or hybrid models to ensure low scores on OOD
inputs while separating known classes. In this paper, we attempt to build a
unified framework for building open set classifiers for both classification and
OOD rejection. We formulate the open set recognition of $ K $-known-class as a
$ (K+1) $-class classification problem with model trained on known-class
samples only. By decomposing the $ K $-class problem into $ K $ one-versus-all
(OVA) binary classification tasks and binding some parameters, we show that
combining the scores of OVA classifiers can give $ (K+1) $-class posterior
probabilities, which enables classification and OOD rejection in a unified
framework. To maintain the closed-set classification accuracy of the OVA
trained classifier, we propose a hybrid training strategy combining OVA loss
and multi-class cross-entropy loss. We implement the OVA framework and hybrid
training strategy on the recently proposed convolutional prototype network and
prototype classifier on vision transformer (ViT) backbone. Experiments on
popular OSR and OOD detection datasets demonstrate that the proposed framework,
using a single multi-class classifier, yields competitive performance in
closed-set classification, OOD detection, and misclassification detection.


http://arxiv.org/abs/2311.12981
SD-NAE: Generating Natural Adversarial Examples with Stable Diffusion. (96%)
Yueqian Lin; Jingyang Zhang; Yiran Chen; Hai Li
  Natural Adversarial Examples (NAEs), images arising naturally from the
environment and capable of deceiving classifiers, are instrumental in robustly
evaluating and identifying vulnerabilities in trained models. In this work,
unlike prior works that passively collect NAEs from real images, we propose to
actively synthesize NAEs using the state-of-the-art Stable Diffusion.
Specifically, our method formulates a controlled optimization process, where we
perturb the token embedding that corresponds to a specified class to generate
NAEs. This generation process is guided by the gradient of loss from the target
classifier, ensuring that the created image closely mimics the ground-truth
class yet fools the classifier. Named SD-NAE (Stable Diffusion for Natural
Adversarial Examples), our innovative method is effective in producing valid
and useful NAEs, which is demonstrated through a meticulously designed
experiment. Code is available at https://github.com/linyueqian/SD-NAE.


http://arxiv.org/abs/2311.13091
Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise. (96%)
Yixin Liu; Kaidi Xu; Xun Chen; Lichao Sun
  The open source of large amounts of image data promotes the development of
deep learning techniques. Along with this comes the privacy risk of these
open-source image datasets being exploited by unauthorized third parties to
train deep learning models for commercial or illegal purposes. To avoid the
abuse of public data, a poisoning-based technique, the unlearnable example, is
proposed to significantly degrade the generalization performance of models by
adding a kind of imperceptible noise to the data. To further enhance its
robustness against adversarial training, existing works leverage iterative
adversarial training on both the defensive noise and the surrogate model.
However, it still remains unknown whether the robustness of unlearnable
examples primarily comes from the effect of enhancement in the surrogate model
or the defensive noise. Observing that simply removing the adversarial noise on
the training process of the defensive noise can improve the performance of
robust unlearnable examples, we identify that solely the surrogate model's
robustness contributes to the performance. Furthermore, we found a negative
correlation exists between the robustness of defensive noise and the protection
performance, indicating defensive noise's instability issue. Motivated by this,
to further boost the robust unlearnable example, we introduce stable
error-minimizing noise (SEM), which trains the defensive noise against random
perturbation instead of the time-consuming adversarial perturbation to improve
the stability of defensive noise. Through extensive experiments, we demonstrate
that SEM achieves a new state-of-the-art performance on CIFAR-10, CIFAR-100,
and ImageNet Subset in terms of both effectiveness and efficiency. The code is
available at https://github.com/liuyixin-louis/Stable-Unlearnable-Example.


http://arxiv.org/abs/2311.12914
Attention Deficit is Ordered! Fooling Deformable Vision Transformers with Collaborative Adversarial Patches. (75%)
Quazi Mishkatul Alam; Bilel Tarchoun; Ihsen Alouani; Nael Abu-Ghazaleh
  The latest generation of transformer-based vision models has proven to be
superior to Convolutional Neural Network (CNN)-based models across several
vision tasks, largely attributed to their remarkable prowess in relation
modeling. Deformable vision transformers significantly reduce the quadratic
complexity of attention modeling by using sparse attention structures, enabling
them to incorporate features across different scales and be used in large-scale
applications, such as multi-view vision systems. Recent work has demonstrated
adversarial attacks against conventional vision transformers; we show that
these attacks do not transfer to deformable transformers due to their sparse
attention structure. Specifically, attention in deformable transformers is
modeled using pointers to the most relevant other tokens. In this work, we
contribute for the first time adversarial attacks that manipulate the attention
of deformable transformers, redirecting it to focus on irrelevant parts of the
image. We also develop new collaborative attacks where a source patch
manipulates attention to point to a target patch, which contains the
adversarial noise to fool the model. In our experiments, we observe that
altering less than 1% of the patched area in the input field results in a
complete drop to 0% AP in single-view object detection using MS COCO and a 0%
MODA in multi-view object detection using Wildtrack.


http://arxiv.org/abs/2311.12722
Attacking Motion Planners Using Adversarial Perception Errors. (69%)
Jonathan Sadeghi; Nicholas A. Lord; John Redford; Romain Mueller
  Autonomous driving (AD) systems are often built and tested in a modular
fashion, where the performance of different modules is measured using
task-specific metrics. These metrics should be chosen so as to capture the
downstream impact of each module and the performance of the system as a whole.
For example, high perception quality should enable prediction and planning to
be performed safely. Even though this is true in general, we show here that it
is possible to construct planner inputs that score very highly on various
perception quality metrics but still lead to planning failures. In an analogy
to adversarial attacks on image classifiers, we call such inputs
\textbf{adversarial perception errors} and show they can be systematically
constructed using a simple boundary-attack algorithm. We demonstrate the
effectiveness of this algorithm by finding attacks for two different black-box
planners in several urban and highway driving scenarios using the CARLA
simulator. Finally, we analyse the properties of these attacks and show that
they are isolated in the input space of the planner, and discuss their
implications for AD system deployment and testing.


http://arxiv.org/abs/2311.13127
Toward Robust Imperceptible Perturbation against Unauthorized Text-to-image Diffusion-based Synthesis. (62%)
Yixin Liu; Chenrui Fan; Yutong Dai; Xun Chen; Pan Zhou; Lichao Sun
  Text-to-image diffusion models allow seamless generation of personalized
images from scant reference photos. Yet, these tools, in the wrong hands, can
fabricate misleading or harmful content, endangering individuals. To address
this problem, existing poisoning-based approaches perturb user images in an
imperceptible way to render them "unlearnable" from malicious uses. We identify
two limitations of these defending approaches: i) sub-optimal due to the
hand-crafted heuristics for solving the intractable bilevel optimization and
ii) lack of robustness against simple data transformations like Gaussian
filtering. To solve these challenges, we propose MetaCloak, which solves the
bi-level poisoning problem with a meta-learning framework with an additional
transformation sampling process to craft transferable and robust perturbation.
Specifically, we employ a pool of surrogate diffusion models to craft
transferable and model-agnostic perturbation. Furthermore, by incorporating an
additional transformation process, we design a simple denoising-error
maximization loss that is sufficient for causing transformation-robust semantic
distortion and degradation in a personalized generation. Extensive experiments
on the VGGFace2 and CelebA-HQ datasets show that MetaCloak outperforms existing
approaches. Notably, MetaCloak can successfully fool online training services
like Replicate, in a black-box manner, demonstrating the effectiveness of
MetaCloak in real-world scenarios. Our code is available at
https://github.com/liuyixin-louis/MetaCloak.


http://arxiv.org/abs/2311.12773
Iris Presentation Attack: Assessing the Impact of Combining Vanadium Dioxide Films with Artificial Eyes. (1%)
Darshika Jauhari; Renu Sharma; Cunjian Chen; Nelson Sepulveda; Arun Ross
  Iris recognition systems, operating in the near infrared spectrum (NIR), have
demonstrated vulnerability to presentation attacks, where an adversary uses
artifacts such as cosmetic contact lenses, artificial eyes or printed iris
images in order to circumvent the system. At the same time, a number of
effective presentation attack detection (PAD) methods have been developed.
These methods have demonstrated success in detecting artificial eyes (e.g.,
fake Van Dyke eyes) as presentation attacks. In this work, we seek to alter the
optical characteristics of artificial eyes by affixing Vanadium Dioxide (VO2)
films on their surface in various spatial configurations. VO2 films can be used
to selectively transmit NIR light and can, therefore, be used to regulate the
amount of NIR light from the object that is captured by the iris sensor. We
study the impact of such images produced by the sensor on two state-of-the-art
iris PA detection methods. We observe that the addition of VO2 films on the
surface of artificial eyes can cause the PA detection methods to misclassify
them as bonafide eyes in some cases. This represents a vulnerability that must
be systematically analyzed and effectively addressed.


http://arxiv.org/abs/2311.12084
ODDR: Outlier Detection & Dimension Reduction Based Defense Against Adversarial Patches. (99%)
Nandish Chattopadhyay; Amira Guesmi; Muhammad Abdullah Hanif; Bassem Ouni; Muhammad Shafique
  Adversarial attacks are a major deterrent towards the reliable use of machine
learning models. A powerful type of adversarial attacks is the patch-based
attack, wherein the adversarial perturbations modify localized patches or
specific areas within the images to deceive the trained machine learning model.
In this paper, we introduce Outlier Detection and Dimension Reduction (ODDR), a
holistic defense mechanism designed to effectively mitigate patch-based
adversarial attacks. In our approach, we posit that input features
corresponding to adversarial patches, whether naturalistic or otherwise,
deviate from the inherent distribution of the remaining image sample and can be
identified as outliers or anomalies. ODDR employs a three-stage pipeline:
Fragmentation, Segregation, and Neutralization, providing a model-agnostic
solution applicable to both image classification and object detection tasks.
The Fragmentation stage parses the samples into chunks for the subsequent
Segregation process. Here, outlier detection techniques identify and segregate
the anomalous features associated with adversarial perturbations. The
Neutralization stage utilizes dimension reduction methods on the outliers to
mitigate the impact of adversarial perturbations without sacrificing pertinent
information necessary for the machine learning task. Extensive testing on
benchmark datasets and state-of-the-art adversarial patches demonstrates the
effectiveness of ODDR. Results indicate robust accuracies matching and lying
within a small range of clean accuracies (1%-3% for classification and 3%-5%
for object detection), with only a marginal compromise of 1%-2% in performance
on clean samples, thereby significantly outperforming other defenses.


http://arxiv.org/abs/2311.12211
DefensiveDR: Defending against Adversarial Patches using Dimensionality Reduction. (99%)
Nandish Chattopadhyay; Amira Guesmi; Muhammad Abdullah Hanif; Bassem Ouni; Muhammad Shafique
  Adversarial patch-based attacks have shown to be a major deterrent towards
the reliable use of machine learning models. These attacks involve the
strategic modification of localized patches or specific image areas to deceive
trained machine learning models. In this paper, we propose
\textit{DefensiveDR}, a practical mechanism using a dimensionality reduction
technique to thwart such patch-based attacks. Our method involves projecting
the sample images onto a lower-dimensional space while retaining essential
information or variability for effective machine learning tasks. We perform
this using two techniques, Singular Value Decomposition and t-Distributed
Stochastic Neighbor Embedding. We experimentally tune the variability to be
preserved for optimal performance as a hyper-parameter. This dimension
reduction substantially mitigates adversarial perturbations, thereby enhancing
the robustness of the given machine learning model. Our defense is
model-agnostic and operates without assumptions about access to model decisions
or model architectures, making it effective in both black-box and white-box
settings. Furthermore, it maintains accuracy across various models and remains
robust against several unseen patch-based attacks. The proposed defensive
approach improves the accuracy from 38.8\% (without defense) to 66.2\% (with
defense) when performing LaVAN and GoogleAp attacks, which supersedes that of
the prominent state-of-the-art like LGS (53.86\%) and Jujutsu (60\%).


http://arxiv.org/abs/2311.11861
Generating Valid and Natural Adversarial Examples with Large Language Models. (99%)
Zimu Wang; Wei Wang; Qi Chen; Qiufeng Wang; Anh Nguyen
  Deep learning-based natural language processing (NLP) models, particularly
pre-trained language models (PLMs), have been revealed to be vulnerable to
adversarial attacks. However, the adversarial examples generated by many
mainstream word-level adversarial attack models are neither valid nor natural,
leading to the loss of semantic maintenance, grammaticality, and human
imperceptibility. Based on the exceptional capacity of language understanding
and generation of large language models (LLMs), we propose LLM-Attack, which
aims at generating both valid and natural adversarial examples with LLMs. The
method consists of two stages: word importance ranking (which searches for the
most vulnerable words) and word synonym replacement (which substitutes them
with their synonyms obtained from LLMs). Experimental results on the Movie
Review (MR), IMDB, and Yelp Review Polarity datasets against the baseline
adversarial attack models illustrate the effectiveness of LLM-Attack, and it
outperforms the baselines in human and GPT-4 evaluation by a significant
margin. The model can generate adversarial examples that are typically valid
and natural, with the preservation of semantic meaning, grammaticality, and
human imperceptibility.


http://arxiv.org/abs/2311.11753
AdvGen: Physical Adversarial Attack on Face Presentation Attack Detection Systems. (99%)
Sai Amrit Patnaik; Shivali Chansoriya; Anil K. Jain; Anoop M. Namboodiri
  Evaluating the risk level of adversarial images is essential for safely
deploying face authentication models in the real world. Popular approaches for
physical-world attacks, such as print or replay attacks, suffer from some
limitations, like including physical and geometrical artifacts. Recently,
adversarial attacks have gained attraction, which try to digitally deceive the
learning strategy of a recognition system using slight modifications to the
captured image. While most previous research assumes that the adversarial image
could be digitally fed into the authentication systems, this is not always the
case for systems deployed in the real world. This paper demonstrates the
vulnerability of face authentication systems to adversarial images in physical
world scenarios. We propose AdvGen, an automated Generative Adversarial
Network, to simulate print and replay attacks and generate adversarial images
that can fool state-of-the-art PADs in a physical domain attack setting. Using
this attack strategy, the attack success rate goes up to 82.01%. We test AdvGen
extensively on four datasets and ten state-of-the-art PADs. We also demonstrate
the effectiveness of our attack by conducting experiments in a realistic,
physical environment.


http://arxiv.org/abs/2311.11796
Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems. (70%)
Guangjing Wang; Ce Zhou; Yuanda Wang; Bocheng Chen; Hanqing Guo; Qiben Yan
  Artificial Intelligence (AI) systems such as autonomous vehicles, facial
recognition, and speech recognition systems are increasingly integrated into
our daily lives. However, despite their utility, these AI systems are
vulnerable to a wide range of attacks such as adversarial, backdoor, data
poisoning, membership inference, model inversion, and model stealing attacks.
In particular, numerous attacks are designed to target a particular model or
system, yet their effects can spread to additional targets, referred to as
transferable attacks. Although considerable efforts have been directed toward
developing transferable attacks, a holistic understanding of the advancements
in transferable attacks remains elusive. In this paper, we comprehensively
explore learning-based attacks from the perspective of transferability,
particularly within the context of cyber-physical security. We delve into
different domains -- the image, text, graph, audio, and video domains -- to
highlight the ubiquitous and pervasive nature of transferable attacks. This
paper categorizes and reviews the architecture of existing attacks from various
viewpoints: data, process, model, and system. We further examine the
implications of transferable attacks in practical scenarios such as autonomous
driving, speech recognition, and large language models (LLMs). Additionally, we
outline the potential research directions to encourage efforts in exploring the
landscape of transferable attacks. This survey offers a holistic understanding
of the prevailing transferable attacks and their impacts across different
domains.


http://arxiv.org/abs/2311.11544
Understanding Variation in Subpopulation Susceptibility to Poisoning Attacks. (15%)
Evan Rose; Fnu Suya; David Evans
  Machine learning is susceptible to poisoning attacks, in which an attacker
controls a small fraction of the training data and chooses that data with the
goal of inducing some behavior unintended by the model developer in the trained
model. We consider a realistic setting in which the adversary with the ability
to insert a limited number of data points attempts to control the model's
behavior on a specific subpopulation. Inspired by previous observations on
disparate effectiveness of random label-flipping attacks on different
subpopulations, we investigate the properties that can impact the effectiveness
of state-of-the-art poisoning attacks against different subpopulations. For a
family of 2-dimensional synthetic datasets, we empirically find that dataset
separability plays a dominant role in subpopulation vulnerability for less
separable datasets. However, well-separated datasets exhibit more dependence on
individual subpopulation properties. We further discover that a crucial
subpopulation property is captured by the difference in loss on the clean
dataset between the clean model and a target model that misclassifies the
subpopulation, and a subpopulation is much easier to attack if the loss
difference is small. This property also generalizes to high-dimensional
benchmark datasets. For the Adult benchmark dataset, we show that we can find
semantically-meaningful subpopulation properties that are related to the
susceptibilities of a selected group of subpopulations. The results in this
paper are accompanied by a fully interactive web-based visualization of
subpopulation poisoning attacks found at
https://uvasrg.github.io/visualizing-poisoning


http://arxiv.org/abs/2311.11871
Training robust and generalizable quantum models. (10%)
Julian Berberich; Daniel Fink; Daniel Pranjić; Christian Tutschku; Christian Holm
  Adversarial robustness and generalization are both crucial properties of
reliable machine learning models. In this paper, we study these properties in
the context of quantum machine learning based on Lipschitz bounds. We derive
tailored, parameter-dependent Lipschitz bounds for quantum models with
trainable encoding, showing that the norm of the data encoding has a crucial
impact on the robustness against perturbations in the input data. Further, we
derive a bound on the generalization error which explicitly depends on the
parameters of the data encoding. Our theoretical findings give rise to a
practical strategy for training robust and generalizable quantum models by
regularizing the Lipschitz bound in the cost. Further, we show that, for fixed
and non-trainable encodings as frequently employed in quantum machine learning,
the Lipschitz bound cannot be influenced by tuning the parameters. Thus,
trainable encodings are crucial for systematically adapting robustness and
generalization during training. With numerical results, we demonstrate that,
indeed, Lipschitz bound regularization leads to substantially more robust and
generalizable quantum models.


http://arxiv.org/abs/2311.11995
BrainWash: A Poisoning Attack to Forget in Continual Learning. (4%)
Ali Abbasi; Parsa Nooralinejad; Hamed Pirsiavash; Soheil Kolouri
  Continual learning has gained substantial attention within the deep learning
community, offering promising solutions to the challenging problem of
sequential learning. Yet, a largely unexplored facet of this paradigm is its
susceptibility to adversarial attacks, especially with the aim of inducing
forgetting. In this paper, we introduce "BrainWash," a novel data poisoning
method tailored to impose forgetting on a continual learner. By adding the
BrainWash noise to a variety of baselines, we demonstrate how a trained
continual learner can be induced to forget its previously learned tasks
catastrophically, even when using these continual learning baselines. An
important feature of our approach is that the attacker requires no access to
previous tasks' data and is armed merely with the model's current parameters
and the data belonging to the most recent task. Our extensive experiments
highlight the efficacy of BrainWash, showcasing degradation in performance
across various regularization-based continual learning methods.


http://arxiv.org/abs/2311.11261
Adversarial Prompt Tuning for Vision-Language Models. (98%)
Jiaming Zhang; Xingjun Ma; Xin Wang; Lingyu Qiu; Jiaqi Wang; Yu-Gang Jiang; Jitao Sang
  With the rapid advancement of multimodal learning, pre-trained
Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable
capacities in bridging the gap between visual and language modalities. However,
these models remain vulnerable to adversarial attacks, particularly in the
image modality, presenting considerable security risks. This paper introduces
Adversarial Prompt Tuning (AdvPT), a novel technique to enhance the adversarial
robustness of image encoders in VLMs. AdvPT innovatively leverages learnable
text prompts and aligns them with adversarial image embeddings, to address the
vulnerabilities inherent in VLMs without the need for extensive parameter
training or modification of the model architecture. We demonstrate that AdvPT
improves resistance against white-box and black-box adversarial attacks and
exhibits a synergistic effect when combined with existing
image-processing-based defense techniques, further boosting defensive
capabilities. Comprehensive experimental analyses provide insights into
adversarial prompt tuning, a novel paradigm devoted to improving resistance to
adversarial images through textual input modifications, paving the way for
future robust multimodal learning research. These findings open up new
possibilities for enhancing the security of VLMs. Our code will be available
upon publication of the paper.


http://arxiv.org/abs/2311.11509
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information. (78%)
Zhengmian Hu; Gang Wu; Saayan Mitra; Ruiyi Zhang; Tong Sun; Heng Huang; Viswanathan Swaminathan
  In recent years, Large Language Models (LLM) have emerged as pivotal tools in
various applications. However, these models are susceptible to adversarial
prompt attacks, where attackers can carefully curate input strings that mislead
LLMs into generating incorrect or undesired outputs. Previous work has revealed
that with relatively simple yet effective attacks based on discrete
optimization, it is possible to generate adversarial prompts that bypass
moderation and alignment of the models. This vulnerability to adversarial
prompts underscores a significant concern regarding the robustness and
reliability of LLMs. Our work aims to address this concern by introducing a
novel approach to detecting adversarial prompts at a token level, leveraging
the LLM's capability to predict the next token's probability. We measure the
degree of the model's perplexity, where tokens predicted with high probability
are considered normal, and those exhibiting high perplexity are flagged as
adversarial. Additionaly, our method also integrates context understanding by
incorporating neighboring token information to encourage the detection of
contiguous adversarial prompt sequences. To this end, we design two algorithms
for adversarial prompt detection: one based on optimization techniques and
another on Probabilistic Graphical Models (PGM). Both methods are equipped with
efficient solving methods, ensuring efficient adversarial prompt detection. Our
token-level detection result can be visualized as heatmap overlays on the text
sequence, allowing for a clearer and more intuitive representation of which
part of the text may contain adversarial prompts.


http://arxiv.org/abs/2311.12075
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning. (69%)
Siyuan Liang; Mingli Zhu; Aishan Liu; Baoyuan Wu; Xiaochun Cao; Ee-Chien Chang
  Studying backdoor attacks is valuable for model copyright protection and
enhancing defenses. While existing backdoor attacks have successfully infected
multimodal contrastive learning models such as CLIP, they can be easily
countered by specialized backdoor defenses for MCL models. This paper reveals
the threats in this practical scenario that backdoor attacks can remain
effective even after defenses and introduces the \emph{\toolns} attack, which
is resistant to backdoor detection and model fine-tuning defenses. To achieve
this, we draw motivations from the perspective of the Bayesian rule and propose
a dual-embedding guided framework for backdoor attacks. Specifically, we ensure
that visual trigger patterns approximate the textual target semantics in the
embedding space, making it challenging to detect the subtle parameter
variations induced by backdoor learning on such natural trigger patterns.
Additionally, we optimize the visual trigger patterns to align the poisoned
samples with target vision features in order to hinder the backdoor unlearning
through clean fine-tuning. Extensive experiments demonstrate that our attack
significantly outperforms state-of-the-art baselines (+45.3% ASR) in the
presence of SoTA backdoor defenses, rendering these mitigation and detection
strategies virtually ineffective. Furthermore, our approach effectively attacks
some more rigorous scenarios like downstream tasks. We believe that this paper
raises awareness regarding the potential threats associated with the practical
application of multimodal contrastive learning and encourages the development
of more robust defense mechanisms.


http://arxiv.org/abs/2311.12066
EditShield: Protecting Unauthorized Image Editing by Instruction-guided Diffusion Models. (10%)
Ruoxi Chen; Haibo Jin; Jinyin Chen; Lichao Sun
  Text-to-image diffusion models have emerged as an evolutionary for producing
creative content in image synthesis. Based on the impressive generation
abilities of these models, instruction-guided diffusion models can edit images
with simple instructions and input images. While they empower users to obtain
their desired edited images with ease, they have raised concerns about
unauthorized image manipulation. Prior research has delved into the
unauthorized use of personalized diffusion models; however, this problem of
instruction-guided diffusion models remains largely unexplored. In this paper,
we first propose a protection method EditShield against unauthorized
modifications from such models. Specifically, EditShield works by adding
imperceptible perturbations that can shift the latent representation used in
the diffusion process, forcing models to generate unrealistic images with
mismatched subjects. Our extensive experiments demonstrate EditShield's
effectiveness among synthetic and real-world datasets. Besides, EditShield also
maintains robustness against various editing types and synonymous instruction
phrases.


http://arxiv.org/abs/2311.12051
Boost Adversarial Transferability by Uniform Scale and Mix Mask Method. (99%)
Tao Wang; Zijian Ying; Qianmu Li; zhichao Lian
  Adversarial examples generated from surrogate models often possess the
ability to deceive other black-box models, a property known as transferability.
Recent research has focused on enhancing adversarial transferability, with
input transformation being one of the most effective approaches. However,
existing input transformation methods suffer from two issues. Firstly, certain
methods, such as the Scale-Invariant Method, employ exponentially decreasing
scale invariant parameters that decrease the adaptability in generating
effective adversarial examples across multiple scales. Secondly, most mixup
methods only linearly combine candidate images with the source image, leading
to reduced features blending effectiveness. To address these challenges, we
propose a framework called Uniform Scale and Mix Mask Method (US-MM) for
adversarial example generation. The Uniform Scale approach explores the upper
and lower boundaries of perturbation with a linear factor, minimizing the
negative impact of scale copies. The Mix Mask method introduces masks into the
mixing process in a nonlinear manner, significantly improving the effectiveness
of mixing strategies. Ablation experiments are conducted to validate the
effectiveness of each component in US-MM and explore the effect of
hyper-parameters. Empirical evaluations on standard ImageNet datasets
demonstrate that US-MM achieves an average of 7% better transfer attack success
rate compared to state-of-the-art methods.


http://arxiv.org/abs/2311.11017
Improving Adversarial Transferability by Stable Diffusion. (99%)
Jiayang Liu; Siyu Zhu; Siyuan Liang; Jie Zhang; Han Fang; Weiming Zhang; Ee-Chien Chang
  Deep neural networks (DNNs) are susceptible to adversarial examples, which
introduce imperceptible perturbations to benign samples, deceiving DNN
predictions. While some attack methods excel in the white-box setting, they
often struggle in the black-box scenario, particularly against models fortified
with defense mechanisms. Various techniques have emerged to enhance the
transferability of adversarial attacks for the black-box scenario. Among these,
input transformation-based attacks have demonstrated their effectiveness. In
this paper, we explore the potential of leveraging data generated by Stable
Diffusion to boost adversarial transferability. This approach draws inspiration
from recent research that harnessed synthetic data generated by Stable
Diffusion to enhance model generalization. In particular, previous work has
highlighted the correlation between the presence of both real and synthetic
data and improved model generalization. Building upon this insight, we
introduce a novel attack method called Stable Diffusion Attack Method (SDAM),
which incorporates samples generated by Stable Diffusion to augment input
images. Furthermore, we propose a fast variant of SDAM to reduce computational
overhead while preserving high adversarial transferability. Our extensive
experimental results demonstrate that our method outperforms state-of-the-art
baselines by a substantial margin. Moreover, our approach is compatible with
existing transfer-based attacks to further enhance adversarial transferability.


http://arxiv.org/abs/2311.11191
Attention-Based Real-Time Defenses for Physical Adversarial Attacks in Vision Applications. (92%)
Giulio Rossolini; Alessandro Biondi; Giorgio Buttazzo
  Deep neural networks exhibit excellent performance in computer vision tasks,
but their vulnerability to real-world adversarial attacks, achieved through
physical objects that can corrupt their predictions, raises serious security
concerns for their application in safety-critical domains. Existing defense
methods focus on single-frame analysis and are characterized by high
computational costs that limit their applicability in multi-frame scenarios,
where real-time decisions are crucial.
  To address this problem, this paper proposes an efficient attention-based
defense mechanism that exploits adversarial channel-attention to quickly
identify and track malicious objects in shallow network layers and mask their
adversarial effects in a multi-frame setting. This work advances the state of
the art by enhancing existing over-activation techniques for real-world
adversarial attacks to make them usable in real-time applications. It also
introduces an efficient multi-frame defense framework, validating its efficacy
through extensive experiments aimed at evaluating both defense performance and
computational cost.


http://arxiv.org/abs/2311.11225
TextGuard: Provable Defense against Backdoor Attacks on Text Classification. (82%)
Hengzhi Pei; Jinyuan Jia; Wenbo Guo; Bo Li; Dawn Song
  Backdoor attacks have become a major security threat for deploying machine
learning models in security-critical applications. Existing research endeavors
have proposed many defenses against backdoor attacks. Despite demonstrating
certain empirical defense efficacy, none of these techniques could provide a
formal and provable security guarantee against arbitrary attacks. As a result,
they can be easily broken by strong adaptive attacks, as shown in our
evaluation. In this work, we propose TextGuard, the first provable defense
against backdoor attacks on text classification. In particular, TextGuard first
divides the (backdoored) training data into sub-training sets, achieved by
splitting each training sentence into sub-sentences. This partitioning ensures
that a majority of the sub-training sets do not contain the backdoor trigger.
Subsequently, a base classifier is trained from each sub-training set, and
their ensemble provides the final prediction. We theoretically prove that when
the length of the backdoor trigger falls within a certain threshold, TextGuard
guarantees that its prediction will remain unaffected by the presence of the
triggers in training and testing inputs. In our evaluation, we demonstrate the
effectiveness of TextGuard on three benchmark text classification tasks,
surpassing the certification accuracy of existing certified defenses against
backdoor attacks. Furthermore, we propose additional strategies to enhance the
empirical performance of TextGuard. Comparisons with state-of-the-art empirical
defenses validate the superiority of TextGuard in countering multiple backdoor
attacks. Our code and data are available at
https://github.com/AI-secure/TextGuard.


http://arxiv.org/abs/2311.11206
Robust Network Slicing: Multi-Agent Policies, Adversarial Attacks, and Defensive Strategies. (1%)
Feng Wang; M. Cenk Gursoy; Senem Velipasalar
  In this paper, we present a multi-agent deep reinforcement learning (deep RL)
framework for network slicing in a dynamic environment with multiple base
stations and multiple users. In particular, we propose a novel deep RL
framework with multiple actors and centralized critic (MACC) in which actors
are implemented as pointer networks to fit the varying dimension of input. We
evaluate the performance of the proposed deep RL algorithm via simulations to
demonstrate its effectiveness. Subsequently, we develop a deep RL based jammer
with limited prior information and limited power budget. The goal of the jammer
is to minimize the transmission rates achieved with network slicing and thus
degrade the network slicing agents' performance. We design a jammer with both
listening and jamming phases and address jamming location optimization as well
as jamming channel optimization via deep RL. We evaluate the jammer at the
optimized location, generating interference attacks in the optimized set of
channels by switching between the jamming phase and listening phase. We show
that the proposed jammer can significantly reduce the victims' performance
without direct feedback or prior knowledge on the network slicing policies.
Finally, we devise a Nash-equilibrium-supervised policy ensemble mixed strategy
profile for network slicing (as a defensive measure) and jamming. We evaluate
the performance of the proposed policy ensemble algorithm by applying on the
network slicing agents and the jammer agent in simulations to show its
effectiveness.


http://arxiv.org/abs/2311.10366
Breaking Temporal Consistency: Generating Video Universal Adversarial Perturbations Using Image Models. (97%)
Hee-Seon Kim; Minji Son; Minbeom Kim; Myung-Joon Kwon; Changick Kim
  As video analysis using deep learning models becomes more widespread, the
vulnerability of such models to adversarial attacks is becoming a pressing
concern. In particular, Universal Adversarial Perturbation (UAP) poses a
significant threat, as a single perturbation can mislead deep learning models
on entire datasets. We propose a novel video UAP using image data and image
model. This enables us to take advantage of the rich image data and image
model-based studies available for video applications. However, there is a
challenge that image models are limited in their ability to analyze the
temporal aspects of videos, which is crucial for a successful video attack. To
address this challenge, we introduce the Breaking Temporal Consistency (BTC)
method, which is the first attempt to incorporate temporal information into
video attacks using image models. We aim to generate adversarial videos that
have opposite patterns to the original. Specifically, BTC-UAP minimizes the
feature similarity between neighboring frames in videos. Our approach is simple
but effective at attacking unseen video models. Additionally, it is applicable
to videos of varying lengths and invariant to temporal shifts. Our approach
surpasses existing methods in terms of effectiveness on various datasets,
including ImageNet, UCF-101, and Kinetics-400.


http://arxiv.org/abs/2311.10919
PACOL: Poisoning Attacks Against Continual Learners. (93%)
Huayu Li; Gregory Ditzler
  Continual learning algorithms are typically exposed to untrusted sources that
contain training data inserted by adversaries and bad actors. An adversary can
insert a small number of poisoned samples, such as mislabeled samples from
previously learned tasks, or intentional adversarial perturbed samples, into
the training datasets, which can drastically reduce the model's performance. In
this work, we demonstrate that continual learning systems can be manipulated by
malicious misinformation and present a new category of data poisoning attacks
specific for continual learners, which we refer to as {\em Poisoning Attacks
Against Continual Learners} (PACOL). The effectiveness of labeling flipping
attacks inspires PACOL; however, PACOL produces attack samples that do not
change the sample's label and produce an attack that causes catastrophic
forgetting. A comprehensive set of experiments shows the vulnerability of
commonly used generative replay and regularization-based continual learning
approaches against attack methods. We evaluate the ability of label-flipping
and a new adversarial poison attack, namely PACOL proposed in this work, to
force the continual learning system to forget the knowledge of a learned
task(s). More specifically, we compared the performance degradation of
continual learning systems trained on benchmark data streams with and without
poisoning attacks. Moreover, we discuss the stealthiness of the attacks in
which we test the success rate of data sanitization defense and other outlier
detection-based defenses for filtering out adversarial samples.


http://arxiv.org/abs/2311.10389
Two-Factor Authentication Approach Based on Behavior Patterns for Defeating Puppet Attacks. (1%)
Wenhao Wang; Guyue Li; Zhiming Chu; Haobo Li; Daniele Faccio
  Fingerprint traits are widely recognized for their unique qualities and
security benefits. Despite their extensive use, fingerprint features can be
vulnerable to puppet attacks, where attackers manipulate a reluctant but
genuine user into completing the authentication process. Defending against such
attacks is challenging due to the coexistence of a legitimate identity and an
illegitimate intent. In this paper, we propose PUPGUARD, a solution designed to
guard against puppet attacks. This method is based on user behavioral patterns,
specifically, the user needs to press the capture device twice successively
with different fingers during the authentication process. PUPGUARD leverages
both the image features of fingerprints and the timing characteristics of the
pressing intervals to establish two-factor authentication. More specifically,
after extracting image features and timing characteristics, and performing
feature selection on the image features, PUPGUARD fuses these two features into
a one-dimensional feature vector, and feeds it into a one-class classifier to
obtain the classification result. This two-factor authentication method
emphasizes dynamic behavioral patterns during the authentication process,
thereby enhancing security against puppet attacks. To assess PUPGUARD's
effectiveness, we conducted experiments on datasets collected from 31 subjects,
including image features and timing characteristics. Our experimental results
demonstrate that PUPGUARD achieves an impressive accuracy rate of 97.87% and a
remarkably low false positive rate (FPR) of 1.89%. Furthermore, we conducted
comparative experiments to validate the superiority of combining image features
and timing characteristics within PUPGUARD for enhancing resistance against
puppet attacks.


http://arxiv.org/abs/2311.10960
An Information-theoretic Security Analysis of Honeyword. (1%)
Pengcheng Su; Haibo Cheng; Wenting Li; Ping Wang
  Honeyword is a representative "honey" technique that employs decoy objects to
mislead adversaries and protect the real ones. To assess the security of a
Honeyword system, two metrics--flatness and success-number--have been proposed
and evaluated using various simulated attackers. Existing evaluations typically
apply statistical learning methods to distinguish real passwords from decoys on
real-world datasets. However, such evaluations may overestimate the system's
security, as more effective distinguishing attacks could potentially exist.
  In this paper, we aim to analyze the security of Honeyword systems under the
strongest theoretical attack, rather than relying on specific, expert-crafted
attacks evaluated in prior experimental studies. We first derive mathematical
expressions for the flatness and success-number under the strongest attack. We
conduct analyses and computations for several typical scenarios, and determine
the security of honeyword generation methods using a uniform distribution and
the List model as examples.
  We further evaluate the security of existing honeyword generation methods
based on password probability models (PPMs), which depends on the sample size
used for training. We investigate, for the first time, the sample complexity of
several representative PPMs, introducing two novel polynomial-time
approximation schemes for computing the total variation between PCFG models and
between higher-order Markov models. Our experimental results show that for
small-scale password distributions, sample sizes on the order of
millions--often tens of millions--are required to reduce the total variation
below 0.1. A surprising result is that we establish an equivalence between
flatness and total variation, thus bridging the theoretical study of Honeyword
systems with classical information theory. Finally, we discuss the practical
implications of our findings.


http://arxiv.org/abs/2311.09790
Breaking Boundaries: Balancing Performance and Robustness in Deep Wireless Traffic Forecasting. (99%)
Romain Ilbert; Thai V. Hoang; Zonghua Zhang; Themis Palpanas
  Balancing the trade-off between accuracy and robustness is a long-standing
challenge in time series forecasting. While most of existing robust algorithms
have achieved certain suboptimal performance on clean data, sustaining the same
performance level in the presence of data perturbations remains extremely hard.
In this paper, we study a wide array of perturbation scenarios and propose
novel defense mechanisms against adversarial attacks using real-world telecom
data. We compare our strategy against two existing adversarial training
algorithms under a range of maximal allowed perturbations, defined using
$\ell_{\infty}$-norm, $\in [0.1,0.4]$. Our findings reveal that our hybrid
strategy, which is composed of a classifier to detect adversarial examples, a
denoiser to eliminate noise from the perturbed data samples, and a standard
forecaster, achieves the best performance on both clean and perturbed data. Our
optimal model can retain up to $92.02\%$ the performance of the original
forecasting model in terms of Mean Squared Error (MSE) on clean data, while
being more robust than the standard adversarially trained models on perturbed
data. Its MSE is 2.71$\times$ and 2.51$\times$ lower than those of comparing
methods on normal and perturbed data, respectively. In addition, the components
of our models can be trained in parallel, resulting in better computational
efficiency. Our results indicate that we can optimally balance the trade-off
between the performance and robustness of forecasting models by improving the
classifier and denoiser, even in the presence of sophisticated and destructive
poisoning attacks.


http://arxiv.org/abs/2311.09948
Hijacking Large Language Models via Adversarial In-Context Learning. (92%)
Yao Qiang; Xiangyu Zhou; Dongxiao Zhu
  In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs
for specific downstream tasks by utilizing labeled examples as demonstrations
(demos) in the precondition prompts. Despite its promising performance, ICL
suffers from instability with the choice and arrangement of examples.
Additionally, crafted adversarial attacks pose a notable threat to the
robustness of ICL. However, existing attacks are either easy to detect, rely on
external models, or lack specificity towards ICL. This work introduces a novel
transferable attack against ICL to address these issues, aiming to hijack LLMs
to generate the target response or jailbreak. Our hijacking attack leverages a
gradient-based prompt search method to learn and append imperceptible
adversarial suffixes to the in-context demos without directly contaminating the
user queries. Comprehensive experimental results across different generation
and jailbreaking tasks highlight the effectiveness of our hijacking attack,
resulting in distracted attention towards adversarial tokens and consequently
leading to unwanted target outputs. We also propose a defense strategy against
hijacking attacks through the use of extra clean demos, which enhances the
robustness of LLMs during ICL. Broadly, this work reveals the significant
security vulnerabilities of LLMs and emphasizes the necessity for in-depth
studies on their robustness.


http://arxiv.org/abs/2311.09763
Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations. (64%)
Wenjie Mo; Jiashu Xu; Qin Liu; Jiongxiao Wang; Jun Yan; Hadi Askari; Chaowei Xiao; Muhao Chen
  Existing studies in backdoor defense have predominantly focused on the
training phase, overlooking the critical aspect of testing time defense. This
gap becomes pronounced in the context of LLMs deployed as Web Services, which
typically offer only black-box access, rendering training-time defenses
impractical. To bridge this gap, this study critically examines the use of
demonstrations as a defense mechanism against backdoor attacks in black-box
LLMs. We retrieve task-relevant demonstrations from a clean data pool and
integrate them with user queries during testing. This approach does not
necessitate modifications or tuning of the model, nor does it require insight
into the model's internal architecture. The alignment properties inherent in
in-context learning play a pivotal role in mitigating the impact of backdoor
triggers, effectively recalibrating the behavior of compromised models. Our
experimental analysis demonstrates that this method robustly defends against
both instance-level and instruction-level backdoor attacks, outperforming
existing defense baselines across most evaluation scenarios.


http://arxiv.org/abs/2311.09827
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking. (54%)
Nan Xu; Fei Wang; Ben Zhou; Bang Zheng Li; Chaowei Xiao; Muhao Chen
  While large language models (LLMs) have demonstrated increasing power, they
have also given rise to a wide range of harmful behaviors. As representatives,
jailbreak attacks can provoke harmful or unethical responses from LLMs, even
after safety alignment. In this paper, we investigate a novel category of
jailbreak attacks specifically designed to target the cognitive structure and
processes of LLMs. Specifically, we analyze the safety vulnerability of LLMs in
the face of (1) multilingual cognitive overload, (2) veiled expression, and (3)
effect-to-cause reasoning. Different from previous jailbreak attacks, our
proposed cognitive overload is a black-box attack with no need for knowledge of
model architecture or access to model weights. Experiments conducted on
AdvBench and MasterKey reveal that various LLMs, including both popular
open-source model Llama 2 and the proprietary model ChatGPT, can be compromised
through cognitive overload. Motivated by cognitive psychology work on managing
cognitive load, we further investigate defending cognitive overload attack from
two perspectives. Empirical studies show that our cognitive overload from three
perspectives can jailbreak all studied LLMs successfully, while existing
defense strategies can hardly mitigate the caused malicious uses effectively.


http://arxiv.org/abs/2311.09641
RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models. (16%)
Jiongxiao Wang; Junlin Wu; Muhao Chen; Yevgeniy Vorobeychik; Chaowei Xiao
  Reinforcement Learning with Human Feedback (RLHF) is a methodology designed
to align Large Language Models (LLMs) with human preferences, playing an
important role in LLMs alignment. Despite its advantages, RLHF relies on human
annotators to rank the text, which can introduce potential security
vulnerabilities if any adversarial annotator (i.e., attackers) manipulates the
ranking score by up-ranking any malicious text to steer the LLM adversarially.
To assess the red-teaming of RLHF against human preference data poisoning, we
propose RankPoison, a poisoning attack method on candidates' selection of
preference rank flipping to reach certain malicious behaviors (e.g., generating
longer sequences, which can increase the computational cost). With poisoned
dataset generated by RankPoison, we can perform poisoning attacks on LLMs to
generate longer tokens without hurting the original safety alignment
performance. Moreover, applying RankPoison, we also successfully implement a
backdoor attack where LLMs can generate longer answers under questions with the
trigger word. Our findings highlight critical security challenges in RLHF,
underscoring the necessity for more robust alignment methods for LLMs.


http://arxiv.org/abs/2311.10177
Towards Improving Robustness Against Common Corruptions using Mixture of Class Specific Experts. (2%)
Shashank Kotyan; Danilo Vasconcellos Vargas
  Neural networks have demonstrated significant accuracy across various
domains, yet their vulnerability to subtle input alterations remains a
persistent challenge. Conventional methods like data augmentation, while
effective to some extent, fall short in addressing unforeseen corruptions,
limiting the adaptability of neural networks in real-world scenarios. In
response, this paper introduces a novel paradigm known as the Mixture of
Class-Specific Expert Architecture. The approach involves disentangling feature
learning for individual classes, offering a nuanced enhancement in scalability
and overall performance. By training dedicated network segments for each class
and subsequently aggregating their outputs, the proposed architecture aims to
mitigate vulnerabilities associated with common neural network structures. The
study underscores the importance of comprehensive evaluation methodologies,
advocating for the incorporation of benchmarks like the common corruptions
benchmark. This inclusion provides nuanced insights into the vulnerabilities of
neural networks, especially concerning their generalization capabilities and
robustness to unforeseen distortions. The research aligns with the broader
objective of advancing the development of highly robust learning systems
capable of nuanced reasoning across diverse and challenging real-world
scenarios. Through this contribution, the paper aims to foster a deeper
understanding of neural network limitations and proposes a practical approach
to enhance their resilience in the face of evolving and unpredictable
conditions.


http://arxiv.org/abs/2311.16169
Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities. (2%)
Avishree Khare; Saikat Dutta; Ziyang Li; Alaia Solko-Breslin; Rajeev Alur; Mayur Naik
  Security vulnerabilities in modern software are prevalent and harmful. While
automated vulnerability detection tools have made promising progress, their
scalability and applicability remain challenging. Recently, Large Language
Models (LLMs), such as GPT-4 and CodeLlama, have demonstrated remarkable
performance on code-related tasks. However, it is unknown whether such LLMs can
do complex reasoning over code. In this work, we explore whether pre-trained
LLMs can detect security vulnerabilities and address the limitations of
existing tools. We evaluate the effectiveness of pre-trained LLMs on a set of
five diverse security benchmarks spanning two languages, Java and C/C++, and
including code samples from synthetic and real-world projects. We evaluate the
effectiveness of LLMs in terms of their performance, explainability, and
robustness.
  By designing a series of effective prompting strategies, we obtain the best
results on the synthetic datasets with GPT-4: F1 scores of 0.79 on OWASP, 0.86
on Juliet Java, and 0.89 on Juliet C/C++. Expectedly, the performance of LLMs
drops on the more challenging real-world datasets: CVEFixes Java and CVEFixes
C/C++, with GPT-4 reporting F1 scores of 0.48 and 0.62, respectively. We show
that LLMs can often perform better than existing static analysis and deep
learning-based vulnerability detection tools, especially for certain classes of
vulnerabilities. Moreover, LLMs also often provide reliable explanations,
identifying the vulnerable data flows in code. We find that fine-tuning smaller
LLMs can outperform the larger LLMs on synthetic datasets but provide limited
gains on real-world datasets. When subjected to adversarial attacks on code,
LLMs show mild degradation, with average accuracy reduction of up to 12.67%.
Finally, we share our insights and recommendations for future work on
leveraging LLMs for vulnerability detection.


http://arxiv.org/abs/2312.00029
Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework. (2%)
Matthew Pisano; Peter Ly; Abraham Sanders; Bingsheng Yao; Dakuo Wang; Tomek Strzalkowski; Mei Si
  Research into AI alignment has grown considerably since the recent
introduction of increasingly capable Large Language Models (LLMs).
Unfortunately, modern methods of alignment still fail to fully prevent harmful
responses when models are deliberately attacked. Such vulnerabilities can lead
to LLMs being manipulated into generating hazardous content: from instructions
for creating dangerous materials to inciting violence or endorsing unethical
behaviors. To help mitigate this issue, we introduce Bergeron: a framework
designed to improve the robustness of LLMs against attacks without any
additional parameter fine-tuning. Bergeron is organized into two tiers; with a
secondary LLM acting as a guardian to the primary LLM. This framework better
safeguards the primary model against incoming attacks while monitoring its
output for any harmful content. Empirical analysis reviews that by using
Bergeron to complement models with existing alignment training, we can
significantly improve the robustness and safety of multiple, commonly used
commercial and open-source LLMs. Specifically, we found that models integrated
with Bergeron are, on average, nearly seven times more resistant to attacks
compared to models without such support.


http://arxiv.org/abs/2311.10248
Identifying the Truth of Global Model: A Generic Solution to Defend Against Byzantine and Backdoor Attacks in Federated Learning (full version). (2%)
Sheldon C. Ebron; Meiying Zhang; Kan Yang
  Federated Learning (FL) enables multiple parties to train machine learning
models collaboratively without sharing the raw training data. However, the
federated nature of FL enables malicious clients to influence a trained model
by injecting error model updates via Byzantine or backdoor attacks. To detect
malicious model updates, a typical approach is to measure the distance between
each model update and a \textit{ground-truth model update}. To find such
\textit{ground-truth model updates}, existing defenses either require a benign
root dataset on the server (e.g., FLTrust) or simply use trimmed mean or median
as the threshold for clipping (e.g., FLAME). However, such benign root datasets
are impractical, and the trimmed mean or median may also eliminate
contributions from these underrepresented datasets.
  In this paper, we propose a generic solution, namely FedTruth, to defend
against model poisoning attacks in FL, where the \textit{ground-truth model
update} (i.e., the global model update) will be estimated among all the model
updates with dynamic aggregation weights. Specifically, FedTruth does not have
specific assumptions on the benign or malicious data distribution or access to
a benign root dataset. Moreover, FedTruth considers the potential contributions
from all benign clients. Our empirical results show that FedTruth can reduce
the impacts of poisoned model updates against both Byzantine and backdoor
attacks, and is also efficient in large-scale FL systems.


http://arxiv.org/abs/2311.09994
Towards more Practical Threat Models in Artificial Intelligence Security. (2%)
Kathrin Grosse; Lukas Bieringer; Tarek Richard Besold; Alexandre Alahi
  Recent works have identified a gap between research and practice in
artificial intelligence security: threats studied in academia do not always
reflect the practical use and security risks of AI. For example, while models
are often studied in isolation, they form part of larger ML pipelines in
practice. Recent works also brought forward that adversarial manipulations
introduced by academic attacks are impractical. We take a first step towards
describing the full extent of this disparity. To this end, we revisit the
threat models of the six most studied attacks in AI security research and match
them to AI usage in practice via a survey with 271 industrial practitioners. On
the one hand, we find that all existing threat models are indeed applicable. On
the other hand, there are significant mismatches: research is often too
generous with the attacker, assuming access to information not frequently
available in real-world settings. Our paper is thus a call for action to study
more practical threat models in artificial intelligence security.


http://arxiv.org/abs/2311.10197
You Cannot Escape Me: Detecting Evasions of SIEM Rules in Enterprise Networks. (1%)
Rafael Uetz; Marco Herzog; Louis Hackländer; Simon Schwarz; Martin Henze
  Cyberattacks have grown into a major risk for organizations, with common
consequences being data theft, sabotage, and extortion. Since preventive
measures do not suffice to repel attacks, timely detection of successful
intruders is crucial to stop them from reaching their final goals. For this
purpose, many organizations utilize Security Information and Event Management
(SIEM) systems to centrally collect security-related events and scan them for
attack indicators using expert-written detection rules. However, as we show by
analyzing a set of widespread SIEM detection rules, adversaries can evade
almost half of them easily, allowing them to perform common malicious actions
within an enterprise network without being detected. To remedy these critical
detection blind spots, we propose the idea of adaptive misuse detection, which
utilizes machine learning to compare incoming events to SIEM rules on the one
hand and known-benign events on the other hand to discover successful evasions.
Based on this idea, we present AMIDES, an open-source proof-of-concept adaptive
misuse detection system. Using four weeks of SIEM events from a large
enterprise network and more than 500 hand-crafted evasions, we show that AMIDES
successfully detects a majority of these evasions without any false alerts. In
addition, AMIDES eases alert analysis by assessing which rules were evaded. Its
computational efficiency qualifies AMIDES for real-world operation and hence
enables organizations to significantly reduce detection blind spots with
moderate effort.


http://arxiv.org/abs/2311.09127
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts. (99%)
Yuanwei Wu; Xiang Li; Yixin Liu; Pan Zhou; Lichao Sun
  Existing work on jailbreak Multimodal Large Language Models (MLLMs) has
focused primarily on adversarial examples in model inputs, with less attention
to vulnerabilities in model APIs. To fill the research gap, we carry out the
following work: 1) We discover a system prompt leakage vulnerability in GPT-4V.
Through carefully designed dialogue, we successfully steal the internal system
prompts of GPT-4V. This finding indicates potential exploitable security risks
in MLLMs; 2)Based on the acquired system prompts, we propose a novel MLLM
jailbreaking attack method termed SASP (Self-Adversarial Attack via System
Prompt). By employing GPT-4 as a red teaming tool against itself, we aim to
search for potential jailbreak prompts leveraging stolen system prompts.
Furthermore, in pursuit of better performance, we also add human modification
based on GPT-4's analysis, which further improves the attack success rate to
98.7\%; 3) We evaluated the effect of modifying system prompts to defend
against jailbreaking attacks. Results show that appropriately designed system
prompts can significantly reduce jailbreak success rates. Overall, our work
provides new insights into enhancing MLLM security, demonstrating the important
role of system prompts in jailbreaking, which could be leveraged to greatly
facilitate jailbreak success rates while also holding the potential for
defending against jailbreaks.


http://arxiv.org/abs/2311.09433
Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment. (74%)
Haoran Wang; Kai Shu
  To ensure AI safety, instruction-tuned Large Language Models (LLMs) are
specifically trained to ensure alignment, which refers to making models behave
in accordance with human intentions. While these models have demonstrated
commendable results on various safety benchmarks, the vulnerability of their
safety alignment has not been extensively studied. This is particularly
troubling given the potential harm that LLMs can inflict. Existing attack
methods on LLMs often rely on poisoned training data or the injection of
malicious prompts. These approaches compromise the stealthiness and
generalizability of the attacks, making them susceptible to detection.
Additionally, these models often demand substantial computational resources for
implementation, making them less practical for real-world applications.
Inspired by recent success in modifying model behavior through steering vectors
without the need for optimization, and drawing on its effectiveness in
red-teaming LLMs, we conducted experiments employing activation steering to
target four key aspects of LLMs: truthfulness, toxicity, bias, and harmfulness
- across a varied set of attack settings. To establish a universal attack
strategy applicable to diverse target alignments without depending on manual
analysis, we automatically select the intervention layer based on contrastive
layer search. Our experiment results show that activation attacks are highly
effective and add little or no overhead to attack efficiency. Additionally, we
discuss potential countermeasures against such activation attacks. Our code and
data are available at https://github.com/wang2226/Backdoor-Activation-Attack
Warning: this paper contains content that can be offensive or upsetting.


http://arxiv.org/abs/2311.09024
Fast Certification of Vision-Language Models Using Incremental Randomized Smoothing. (64%)
A K Iowa State University Nirala; A New York University Joshi; C New York University Hegde; S Iowa State University Sarkar
  A key benefit of deep vision-language models such as CLIP is that they enable
zero-shot open vocabulary classification; the user has the ability to define
novel class labels via natural language prompts at inference time. However,
while CLIP-based zero-shot classifiers have demonstrated competitive
performance across a range of domain shifts, they remain highly vulnerable to
adversarial attacks. Therefore, ensuring the robustness of such models is
crucial for their reliable deployment in the wild.
  In this work, we introduce Open Vocabulary Certification (OVC), a fast
certification method designed for open-vocabulary models like CLIP via
randomized smoothing techniques. Given a base "training" set of prompts and
their corresponding certified CLIP classifiers, OVC relies on the observation
that a classifier with a novel prompt can be viewed as a perturbed version of
nearby classifiers in the base training set. Therefore, OVC can rapidly certify
the novel classifier using a variation of incremental randomized smoothing. By
using a caching trick, we achieve approximately two orders of magnitude
acceleration in the certification process for novel prompts. To achieve further
(heuristic) speedups, OVC approximates the embedding space at a given input
using a multivariate normal distribution bypassing the need for sampling via
forward passes through the vision backbone. We demonstrate the effectiveness of
OVC on through experimental evaluation using multiple vision-language backbones
on the CIFAR-10 and ImageNet test datasets.


http://arxiv.org/abs/2311.09266
Adversarially Robust Spiking Neural Networks Through Conversion. (61%)
Ozan Özdenizci; Robert Legenstein
  Spiking neural networks (SNNs) provide an energy-efficient alternative to a
variety of artificial neural network (ANN) based AI applications. As the
progress in neuromorphic computing with SNNs expands their use in applications,
the problem of adversarial robustness of SNNs becomes more pronounced. To the
contrary of the widely explored end-to-end adversarial training based
solutions, we address the limited progress in scalable robust SNN training
methods by proposing an adversarially robust ANN-to-SNN conversion algorithm.
Our method provides an efficient approach to embrace various computationally
demanding robust learning objectives that have been proposed for ANNs. During a
post-conversion robust finetuning phase, our method adversarially optimizes
both layer-wise firing thresholds and synaptic connectivity weights of the SNN
to maintain transferred robustness gains from the pre-trained ANN. We perform
experimental evaluations in a novel setting proposed to rigorously assess the
robustness of SNNs, where numerous adaptive adversarial attacks that account
for the spike-based operation dynamics are considered. Results show that our
approach yields a scalable state-of-the-art solution for adversarially robust
deep SNNs with low-latency.


http://arxiv.org/abs/2311.09447
How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities. (16%)
Lingbo Mo; Boshi Wang; Muhao Chen; Huan Sun
  The rapid progress in open-source Large Language Models (LLMs) is
significantly driving AI development forward. However, there is still a limited
understanding of their trustworthiness. Deploying these models at scale without
sufficient trustworthiness can pose significant risks, highlighting the need to
uncover these issues promptly. In this work, we conduct an adversarial
assessment of open-source LLMs on trustworthiness, scrutinizing them across
eight different aspects including toxicity, stereotypes, ethics, hallucination,
fairness, sycophancy, privacy, and robustness against adversarial
demonstrations. We propose advCoU, an extended Chain of Utterances-based (CoU)
prompting strategy by incorporating carefully crafted malicious demonstrations
for trustworthiness attack. Our extensive experiments encompass recent and
representative series of open-source LLMs, including Vicuna, MPT, Falcon,
Mistral, and Llama 2. The empirical outcomes underscore the efficacy of our
attack strategy across diverse aspects. More interestingly, our result analysis
reveals that models with superior performance in general NLP tasks do not
always have greater trustworthiness; in fact, larger models can be more
vulnerable to attacks. Additionally, models that have undergone instruction
tuning, focusing on instruction following, tend to be more susceptible,
although fine-tuning LLMs for safety alignment proves effective in mitigating
adversarial trustworthiness attacks.


http://arxiv.org/abs/2311.09096
Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization. (16%)
Zhexin Zhang; Junxiao Yang; Pei Ke; Fei Mi; Hongning Wang; Minlie Huang
  While significant attention has been dedicated to exploiting weaknesses in
LLMs through jailbreaking attacks, there remains a paucity of effort in
defending against these attacks. We point out a pivotal factor contributing to
the success of jailbreaks: the intrinsic conflict between the goals of being
helpful and ensuring safety. Accordingly, we propose to integrate goal
prioritization at both training and inference stages to counteract.
Implementing goal prioritization during inference substantially diminishes the
Attack Success Rate (ASR) of jailbreaking from 66.4% to 3.6% for ChatGPT. And
integrating goal prioritization into model training reduces the ASR from 71.0%
to 6.6% for Llama2-13B. Remarkably, even in scenarios where no jailbreaking
samples are included during training, our approach slashes the ASR by half.
Additionally, our findings reveal that while stronger LLMs face greater safety
risks, they also possess a greater capacity to be steered towards defending
against such attacks, both because of their stronger ability in instruction
following. Our work thus contributes to the comprehension of jailbreaking
attacks and defenses, and sheds light on the relationship between LLMs'
capability and safety. Our code is available at
\url{https://github.com/thu-coai/JailbreakDefense_GoalPriority}.


http://arxiv.org/abs/2311.09355
Privacy Threats in Stable Diffusion Models. (13%)
Thomas Cilloni; Charles Fleming; Charles Walter
  This paper introduces a novel approach to membership inference attacks (MIA)
targeting stable diffusion computer vision models, specifically focusing on the
highly sophisticated Stable Diffusion V2 by StabilityAI. MIAs aim to extract
sensitive information about a model's training data, posing significant privacy
concerns. Despite its advancements in image synthesis, our research reveals
privacy vulnerabilities in the stable diffusion models' outputs. Exploiting
this information, we devise a black-box MIA that only needs to query the victim
model repeatedly. Our methodology involves observing the output of a stable
diffusion model at different generative epochs and training a classification
model to distinguish when a series of intermediates originated from a training
sample or not. We propose numerous ways to measure the membership features and
discuss what works best. The attack's efficacy is assessed using the ROC AUC
method, demonstrating a 60\% success rate in inferring membership information.
This paper contributes to the growing body of research on privacy and security
in machine learning, highlighting the need for robust defenses against MIAs.
Our findings prompt a reevaluation of the privacy implications of stable
diffusion models, urging practitioners and developers to implement enhanced
security measures to safeguard against such attacks.


http://arxiv.org/abs/2311.09489
MirrorNet: A TEE-Friendly Framework for Secure On-device DNN Inference. (2%)
Ziyu Liu; Yukui Luo; Shijin Duan; Tong Zhou; Xiaolin Xu
  Deep neural network (DNN) models have become prevalent in edge devices for
real-time inference. However, they are vulnerable to model extraction attacks
and require protection. Existing defense approaches either fail to fully
safeguard model confidentiality or result in significant latency issues. To
overcome these challenges, this paper presents MirrorNet, which leverages
Trusted Execution Environment (TEE) to enable secure on-device DNN inference.
It generates a TEE-friendly implementation for any given DNN model to protect
the model confidentiality, while meeting the stringent computation and storage
constraints of TEE. The framework consists of two key components: the backbone
model (BackboneNet), which is stored in the normal world but achieves lower
inference accuracy, and the Companion Partial Monitor (CPM), a lightweight
mirrored branch stored in the secure world, preserving model confidentiality.
During inference, the CPM monitors the intermediate results from the
BackboneNet and rectifies the classification output to achieve higher accuracy.
To enhance flexibility, MirrorNet incorporates two modules: the CPM Strategy
Generator, which generates various protection strategies, and the Performance
Emulator, which estimates the performance of each strategy and selects the most
optimal one. Extensive experiments demonstrate the effectiveness of MirrorNet
in providing security guarantees while maintaining low computation latency,
making MirrorNet a practical and promising solution for secure on-device DNN
inference. For the evaluation, MirrorNet can achieve a 18.6% accuracy gap
between authenticated and illegal use, while only introducing 0.99% hardware
overhead.


http://arxiv.org/abs/2311.09473
JAB: Joint Adversarial Prompting and Belief Augmentation. (1%)
Ninareh Mehrabi; Palash Goyal; Anil Ramakrishna; Jwala Dhamala; Shalini Ghosh; Richard Zemel; Kai-Wei Chang; Aram Galstyan; Rahul Gupta
  With the recent surge of language models in different applications, attention
to safety and robustness of these models has gained significant importance.
Here we introduce a joint framework in which we simultaneously probe and
improve the robustness of a black-box target model via adversarial prompting
and belief augmentation using iterative feedback loops. This framework utilizes
an automated red teaming approach to probe the target model, along with a
belief augmenter to generate instructions for the target model to improve its
robustness to those adversarial probes. Importantly, the adversarial model and
the belief generator leverage the feedback from past interactions to improve
the effectiveness of the adversarial prompts and beliefs, respectively. In our
experiments, we demonstrate that such a framework can reduce toxic content
generation both in dynamic cases where an adversary directly interacts with a
target model and static cases where we use a static benchmark dataset to
evaluate our model.


http://arxiv.org/abs/2311.09428
Beyond Detection: Unveiling Fairness Vulnerabilities in Abusive Language Models. (1%)
Yueqing Liang; Lu Cheng; Ali Payani; Kai Shu
  This work investigates the potential of undermining both fairness and
detection performance in abusive language detection. In a dynamic and complex
digital world, it is crucial to investigate the vulnerabilities of these
detection models to adversarial fairness attacks to improve their fairness
robustness. We propose a simple yet effective framework FABLE that leverages
backdoor attacks as they allow targeted control over the fairness and detection
performance. FABLE explores three types of trigger designs (i.e., rare,
artificial, and natural triggers) and novel sampling strategies. Specifically,
the adversary can inject triggers into samples in the minority group with the
favored outcome (i.e., "non-abusive") and flip their labels to the unfavored
outcome, i.e., "abusive". Experiments on benchmark datasets demonstrate the
effectiveness of FABLE attacking fairness and utility in abusive language
detection.


http://arxiv.org/abs/2311.07928
Towards Improving Robustness Against Common Corruptions in Object Detectors Using Adversarial Contrastive Learning. (99%)
Shashank Kotyan; Danilo Vasconcellos Vargas
  Neural networks have revolutionized various domains, exhibiting remarkable
accuracy in tasks like natural language processing and computer vision.
However, their vulnerability to slight alterations in input samples poses
challenges, particularly in safety-critical applications like autonomous
driving. Current approaches, such as introducing distortions during training,
fall short in addressing unforeseen corruptions. This paper proposes an
innovative adversarial contrastive learning framework to enhance neural network
robustness simultaneously against adversarial attacks and common corruptions.
By generating instance-wise adversarial examples and optimizing contrastive
loss, our method fosters representations that resist adversarial perturbations
and remain robust in real-world scenarios. Subsequent contrastive learning then
strengthens the similarity between clean samples and their adversarial
counterparts, fostering representations resistant to both adversarial attacks
and common distortions. By focusing on improving performance under adversarial
and real-world conditions, our approach aims to bolster the robustness of
neural networks in safety-critical applications, such as autonomous vehicles
navigating unpredictable weather conditions. We anticipate that this framework
will contribute to advancing the reliability of neural networks in challenging
environments, facilitating their widespread adoption in mission-critical
scenarios.


http://arxiv.org/abs/2311.08539
Physical Adversarial Examples for Multi-Camera Systems. (99%)
Ana Răduţoiu; Jan-Philipp Schulze; Philip Sperl; Konstantin Böttinger
  Neural networks build the foundation of several intelligent systems, which,
however, are known to be easily fooled by adversarial examples. Recent advances
made these attacks possible even in air-gapped scenarios, where the autonomous
system observes its surroundings by, e.g., a camera. We extend these ideas in
our research and evaluate the robustness of multi-camera setups against such
physical adversarial examples. This scenario becomes ever more important with
the rise in popularity of autonomous vehicles, which fuse the information of
several cameras for their driving decision. While we find that multi-camera
setups provide some robustness towards past attack methods, we see that this
advantage reduces when optimizing on multiple perspectives at once. We propose
a novel attack method that we call Transcender-MC, where we incorporate online
3D renderings and perspective projections in the training process. Moreover, we
motivate that certain data augmentation techniques can facilitate the
generation of successful adversarial examples even further. Transcender-MC is
11% more effective in successfully attacking multi-camera setups than
state-of-the-art methods. Our findings offer valuable insights regarding the
resilience of object detection in a setup with multiple cameras and motivate
the need of developing adequate defense mechanisms against them.


http://arxiv.org/abs/2311.08598
DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Language Models. (99%)
Yibo Wang; Xiangjue Dong; James Caverlee; Philip S. Yu
  Language models (LMs) can be manipulated by adversarial attacks, which
introduce subtle perturbations to input data. While recent attack methods can
achieve a relatively high attack success rate (ASR), we've observed that the
generated adversarial examples have a different data distribution compared with
the original examples. Specifically, these adversarial examples exhibit reduced
confidence levels and greater divergence from the training data distribution.
Consequently, they are easy to detect using straightforward detection methods,
diminishing the efficacy of such attacks. To address this issue, we propose a
Distribution-Aware LoRA-based Adversarial Attack (DALA) method. DALA considers
distribution shifts of adversarial examples to improve the attack's
effectiveness under detection methods. We further design a novel evaluation
metric, the Non-detectable Attack Success Rate (NASR), which integrates both
ASR and detectability for the attack task. We conduct experiments on four
widely used datasets to validate the attack effectiveness and transferability
of adversarial examples generated by DALA against both the white-box BERT-base
model and the black-box LLaMA2-7b model. Our codes are available at
https://anonymous.4open.science/r/DALA-A16D/.


http://arxiv.org/abs/2311.08265
On The Relationship Between Universal Adversarial Attacks And Sparse Representations. (98%)
Dana Weitzner; Raja Giryes
  The prominent success of neural networks, mainly in computer vision tasks, is
increasingly shadowed by their sensitivity to small, barely perceivable
adversarial perturbations in image input.
  In this work, we aim at explaining this vulnerability through the framework
of sparsity.
  We show the connection between adversarial attacks and sparse
representations, with a focus on explaining the universality and
transferability of adversarial examples in neural networks.
  To this end, we show that sparse coding algorithms, and the neural
network-based learned iterative shrinkage thresholding algorithm (LISTA) among
them, suffer from this sensitivity, and that common attacks on neural networks
can be expressed as attacks on the sparse representation of the input image.
The phenomenon that we observe holds true also when the network is agnostic to
the sparse representation and dictionary, and thus can provide a possible
explanation for the universality and transferability of adversarial attacks.
  The code is available at
https://github.com/danawr/adversarial_attacks_and_sparse_representations.


http://arxiv.org/abs/2311.08268
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily. (62%)
Peng Ding; Jun Kuang; Dan Ma; Xuezhi Cao; Yunsen Xian; Jiajun Chen; Shujian Huang
  Large Language Models (LLMs), such as ChatGPT and GPT-4, are designed to
provide useful and safe responses. However, adversarial prompts known as
'jailbreaks' can circumvent safeguards, leading LLMs to generate potentially
harmful content. Exploring jailbreak prompts can help to better reveal the
weaknesses of LLMs and further steer us to secure them. Unfortunately, existing
jailbreak methods either suffer from intricate manual design or require
optimization on other white-box models, which compromises either generalization
or efficiency. In this paper, we generalize jailbreak prompt attacks into two
aspects: (1) Prompt Rewriting and (2) Scenario Nesting. Based on this, we
propose ReNeLLM, an automatic framework that leverages LLMs themselves to
generate effective jailbreak prompts. Extensive experiments demonstrate that
ReNeLLM significantly improves the attack success rate while greatly reducing
the time cost compared to existing baselines. Our study also reveals the
inadequacy of current defense methods in safeguarding LLMs. Finally, we analyze
the failure of LLMs defense from the perspective of prompt execution priority,
and propose corresponding defense strategies. We hope that our research can
catalyze both the academic community and LLMs developers towards the provision
of safer and more regulated LLMs. The code is available at
https://github.com/NJUNLP/ReNeLLM.


http://arxiv.org/abs/2311.08662
Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets. (26%)
Vatsal Gupta; Pranshu Pandya; Tushar Kataria; Vivek Gupta; Dan Roth
  Language models, characterized by their black-box nature, often hallucinate
and display sensitivity to input perturbations, causing concerns about trust.
To enhance trust, it is imperative to gain a comprehensive understanding of the
model's failure modes and develop effective strategies to improve their
performance. In this study, we introduce a methodology designed to examine how
input perturbations affect language models across various scales, including
pre-trained models and large language models (LLMs). Utilizing fine-tuning, we
enhance the model's robustness to input perturbations. Additionally, we
investigate whether exposure to one perturbation enhances or diminishes the
model's performance with respect to other perturbations. To address robustness
against multiple perturbations, we present three distinct fine-tuning
strategies. Furthermore, we broaden the scope of our methodology to encompass
large language models (LLMs) by leveraging a chain of thought (CoT) prompting
approach augmented with exemplars. We employ the Tabular-NLI task to showcase
how our proposed strategies adeptly train a robust model, enabling it to
address diverse perturbations while maintaining accuracy on the original
dataset.


http://arxiv.org/abs/2311.09253
The Perception-Robustness Tradeoff in Deterministic Image Restoration. (1%)
Guy Ohayon; Tomer Michaeli; Michael Elad
  We study the behavior of deterministic methods for solving inverse problems
in imaging. These methods are commonly designed to achieve two goals: (1)
attaining high perceptual quality, and (2) generating reconstructions that are
consistent with the measurements. We provide a rigorous proof that the better a
predictor satisfies these two requirements, the larger its Lipschitz constant
must be, regardless of the nature of the degradation involved. In particular,
to approach perfect perceptual quality and perfect consistency, the Lipschitz
constant of the model must grow to infinity. This implies that such methods are
necessarily more susceptible to adversarial attacks. We demonstrate our theory
on single image super-resolution algorithms, addressing both noisy and
noiseless settings. We also show how this undesired behavior can be leveraged
to explore the posterior distribution, thereby allowing the deterministic model
to imitate stochastic methods.


http://arxiv.org/abs/2311.07110
Adversarial Purification for Data-Driven Power System Event Classifiers with Diffusion Models. (99%)
Yuanbin Cheng; Koji Yamashita; Jim Follum; Nanpeng Yu
  The global deployment of the phasor measurement units (PMUs) enables
real-time monitoring of the power system, which has stimulated considerable
research into machine learning-based models for event detection and
classification. However, recent studies reveal that machine learning-based
methods are vulnerable to adversarial attacks, which can fool the event
classifiers by adding small perturbations to the raw PMU data. To mitigate the
threats posed by adversarial attacks, research on defense strategies is
urgently needed. This paper proposes an effective adversarial purification
method based on the diffusion model to counter adversarial attacks on the
machine learning-based power system event classifier. The proposed method
includes two steps: injecting noise into the PMU data; and utilizing a
pre-trained neural network to eliminate the added noise while simultaneously
removing perturbations introduced by the adversarial attacks. The proposed
adversarial purification method significantly increases the accuracy of the
event classifier under adversarial attacks while satisfying the requirements of
real-time operations. In addition, the theoretical analysis reveals that the
proposed diffusion model-based adversarial purification method decreases the
distance between the original and compromised PMU data, which reduces the
impacts of adversarial attacks. The empirical results on a large-scale
real-world PMU dataset validate the effectiveness and computational efficiency
of the proposed adversarial purification method.


http://arxiv.org/abs/2311.07780
Parrot-Trained Adversarial Examples: Pushing the Practicality of Black-Box Audio Attacks against Speaker Recognition Models. (99%)
Rui Duan; Zhe Qu; Leah Ding; Yao Liu; Zhuo Lu
  Audio adversarial examples (AEs) have posed significant security challenges
to real-world speaker recognition systems. Most black-box attacks still require
certain information from the speaker recognition model to be effective (e.g.,
keeping probing and requiring the knowledge of similarity scores). This work
aims to push the practicality of the black-box attacks by minimizing the
attacker's knowledge about a target speaker recognition model. Although it is
not feasible for an attacker to succeed with completely zero knowledge, we
assume that the attacker only knows a short (or a few seconds) speech sample of
a target speaker. Without any probing to gain further knowledge about the
target model, we propose a new mechanism, called parrot training, to generate
AEs against the target model. Motivated by recent advancements in voice
conversion (VC), we propose to use the one short sentence knowledge to generate
more synthetic speech samples that sound like the target speaker, called parrot
speech. Then, we use these parrot speech samples to train a parrot-trained(PT)
surrogate model for the attacker. Under a joint transferability and perception
framework, we investigate different ways to generate AEs on the PT model
(called PT-AEs) to ensure the PT-AEs can be generated with high transferability
to a black-box target model with good human perceptual quality. Real-world
experiments show that the resultant PT-AEs achieve the attack success rates of
45.8% - 80.8% against the open-source models in the digital-line scenario and
47.9% - 58.3% against smart devices, including Apple HomePod (Siri), Amazon
Echo, and Google Home, in the over-the-air scenario.


http://arxiv.org/abs/2311.07553
An Extensive Study on Adversarial Attack against Pre-trained Models of Code. (99%)
Xiaohu Du; Ming Wen; Zichao Wei; Shangwen Wang; Hai Jin
  Transformer-based pre-trained models of code (PTMC) have been widely utilized
and have achieved state-of-the-art performance in many mission-critical
applications. However, they can be vulnerable to adversarial attacks through
identifier substitution or coding style transformation, which can significantly
degrade accuracy and may further incur security concerns. Although several
approaches have been proposed to generate adversarial examples for PTMC, the
effectiveness and efficiency of such approaches, especially on different code
intelligence tasks, has not been well understood. To bridge this gap, this
study systematically analyzes five state-of-the-art adversarial attack
approaches from three perspectives: effectiveness, efficiency, and the quality
of generated examples. The results show that none of the five approaches
balances all these perspectives. Particularly, approaches with a high attack
success rate tend to be time-consuming; the adversarial code they generate
often lack naturalness, and vice versa. To address this limitation, we explore
the impact of perturbing identifiers under different contexts and find that
identifier substitution within for and if statements is the most effective.
Based on these findings, we propose a new approach that prioritizes different
types of statements for various tasks and further utilizes beam search to
generate adversarial examples. Evaluation results show that it outperforms the
state-of-the-art ALERT in terms of both effectiveness and efficiency while
preserving the naturalness of the generated adversarial examples.


http://arxiv.org/abs/2311.07127
Multi-agent Attacks for Black-box Social Recommendations. (96%)
Wenqi Fan; Shijie Wang; Xiao-yong Wei; Xiaowei Mei; Shanru Lin; Qing Li
  The rise of online social networks has facilitated the evolution of social
recommender systems, which incorporate social relations to enhance users'
decision-making process. With the great success of Graph Neural Networks (GNNs)
in learning node representations, GNN-based social recommendations have been
widely studied to model user-item interactions and user-user social relations
simultaneously. Despite their great successes, recent studies have shown that
these advanced recommender systems are highly vulnerable to adversarial
attacks, in which attackers can inject well-designed fake user profiles to
disrupt recommendation performances. While most existing studies mainly focus
on argeted attacks to promote target items on vanilla recommender systems,
untargeted attacks to degrade the overall prediction performance are less
explored on social recommendations under a black-box scenario. To perform
untargeted attacks on social recommender systems, attackers can construct
malicious social relationships for fake users to enhance the attack
performance. However, the coordination of social relations and item profiles is
challenging for attacking black-box social recommendations. To address this
limitation, we first conduct several preliminary studies to demonstrate the
effectiveness of cross-community connections and cold-start items in degrading
recommendations performance. Specifically, we propose a novel framework
MultiAttack based on multi-agent reinforcement learning to coordinate the
generation of cold-start item profiles and cross-community social relations for
conducting untargeted attacks on black-box social recommendations.
Comprehensive experiments on various real-world datasets demonstrate the
effectiveness of our proposed attacking framework under the black-box setting.


http://arxiv.org/abs/2311.07444
On the Robustness of Neural Collapse and the Neural Collapse of Robustness. (87%)
Jingtong Su; Ya Shi Zhang; Nikolaos Tsilivis; Julia Kempe
  Neural Collapse refers to the curious phenomenon in the end of training of a
neural network, where feature vectors and classification weights converge to a
very simple geometrical arrangement (a simplex). While it has been observed
empirically in various cases and has been theoretically motivated, its
connection with crucial properties of neural networks, like their
generalization and robustness, remains unclear. In this work, we study the
stability properties of these simplices. We find that the simplex structure
disappears under small adversarial attacks, and that perturbed examples "leap"
between simplex vertices. We further analyze the geometry of networks that are
optimized to be robust against adversarial perturbations of the input, and find
that Neural Collapse is a pervasive phenomenon in these cases as well, with
clean and perturbed representations forming aligned simplices, and giving rise
to a robust simple nearest-neighbor classifier. By studying the propagation of
the amount of collapse inside the network, we identify novel properties of both
robust and non-robust machine learning models, and show that earlier, unlike
later layers maintain reliable simplices on perturbed data. Our code is
available at https://github.com/JingtongSu/robust_neural_collapse .


http://arxiv.org/abs/2311.07550
Tabdoor: Backdoor Vulnerabilities in Transformer-based Neural Networks for Tabular Data. (70%)
Bart Pleiter; Behrad Tajalli; Stefanos Koffas; Gorka Abad; Jing Xu; Martha Larson; Stjepan Picek
  Deep Neural Networks (DNNs) have shown great promise in various domains.
Alongside these developments, vulnerabilities associated with DNN training,
such as backdoor attacks, are a significant concern. These attacks involve the
subtle insertion of triggers during model training, allowing for manipulated
predictions.More recently, DNNs for tabular data have gained increasing
attention due to the rise of transformer models.
  Our research presents a comprehensive analysis of backdoor attacks on tabular
data using DNNs, particularly focusing on transformers. Given the inherent
complexities of tabular data, we explore the challenges of embedding backdoors.
Through systematic experimentation across benchmark datasets, we uncover that
transformer-based DNNs for tabular data are highly susceptible to backdoor
attacks, even with minimal feature value alterations. We also verify that our
attack can be generalized to other models, like XGBoost and DeepFM. Our results
indicate nearly perfect attack success rates (approximately 100%) by
introducing novel backdoor attack strategies to tabular data. Furthermore, we
evaluate several defenses against these attacks, identifying Spectral
Signatures as the most effective one. Our findings highlight the urgency of
addressing such vulnerabilities and provide insights into potential
countermeasures for securing DNN models against backdoors in tabular data.


http://arxiv.org/abs/2311.06771
Learning Globally Optimized Language Structure via Adversarial Training. (83%)
Xuwang Yin
  Recent work has explored integrating autoregressive language models with
energy-based models (EBMs) to enhance text generation capabilities. However,
learning effective EBMs for text is challenged by the discrete nature of
language. This work proposes an adversarial training strategy to address
limitations in prior efforts. Specifically, an iterative adversarial attack
algorithm is presented to generate negative samples for training the EBM by
perturbing text from the autoregressive model. This aims to enable the EBM to
suppress spurious modes outside the support of the data distribution.
Experiments on an arithmetic sequence generation task demonstrate that the
proposed adversarial training approach can substantially enhance the quality of
generated sequences compared to prior methods. The results highlight the
promise of adversarial techniques to improve discrete EBM training. Key
contributions include: (1) an adversarial attack strategy tailored to text to
generate negative samples, circumventing MCMC limitations; (2) an adversarial
training algorithm for EBMs leveraging these attacks; (3) empirical validation
of performance improvements on a sequence generation task.


http://arxiv.org/abs/2311.06942
Resilient Graph Neural Networks: A Coupled Dynamical Systems Approach. (70%)
Moshe Eliasof; Davide Murari; Ferdia Sherry; Carola-Bibiane Schönlieb
  Graph Neural Networks (GNNs) have established themselves as a key component
in addressing diverse graph-based tasks. Despite their notable successes, GNNs
remain susceptible to input perturbations in the form of adversarial attacks.
This paper introduces an innovative approach to fortify GNNs against
adversarial perturbations through the lens of coupled dynamical systems. Our
method introduces graph neural layers based on differential equations with
contractive properties, which, as we show, improve the robustness of GNNs. A
distinctive feature of the proposed approach is the simultaneous learned
evolution of both the node features and the adjacency matrix, yielding an
intrinsic enhancement of model robustness to perturbations in the input
features and the connectivity of the graph. We mathematically derive the
underpinnings of our novel architecture and provide theoretical insights to
reason about its expected behavior. We demonstrate the efficacy of our method
through numerous real-world benchmarks, reading on par or improved performance
compared to existing methods.


http://arxiv.org/abs/2311.06973
Analytical Verification of Deep Neural Network Performance for Time-Synchronized Distribution System State Estimation. (5%)
Behrouz Azimian; Shiva Moshtagh; Anamitra Pal; Shanshan Ma
  Recently, we demonstrated success of a time-synchronized state estimator
using deep neural networks (DNNs) for real-time unobservable distribution
systems. In this letter, we provide analytical bounds on the performance of
that state estimator as a function of perturbations in the input measurements.
It has already been shown that evaluating performance based on only the test
dataset might not effectively indicate a trained DNN's ability to handle input
perturbations. As such, we analytically verify robustness and trustworthiness
of DNNs to input perturbations by treating them as mixed-integer linear
programming (MILP) problems. The ability of batch normalization in addressing
the scalability limitations of the MILP formulation is also highlighted. The
framework is validated by performing time-synchronized distribution system
state estimation for a modified IEEE 34-node system and a real-world large
distribution system, both of which are incompletely observed by micro-phasor
measurement units.


http://arxiv.org/abs/2311.06855
DialMAT: Dialogue-Enabled Transformer with Moment-Based Adversarial Training. (1%)
Kanta Kaneda; Ryosuke Korekata; Yuiga Wada; Shunya Nagashima; Motonari Kambara; Yui Iioka; Haruka Matsuo; Yuto Imai; Takayuki Nishimura; Komei Sugiura
  This paper focuses on the DialFRED task, which is the task of embodied
instruction following in a setting where an agent can actively ask questions
about the task. To address this task, we propose DialMAT. DialMAT introduces
Moment-based Adversarial Training, which incorporates adversarial perturbations
into the latent space of language, image, and action. Additionally, it
introduces a crossmodal parallel feature extraction mechanism that applies
foundation models to both language and image. We evaluated our model using a
dataset constructed from the DialFRED dataset and demonstrated superior
performance compared to the baseline method in terms of success rate and path
weighted success rate. The model secured the top position in the DialFRED
Challenge, which took place at the CVPR 2023 Embodied AI workshop.


http://arxiv.org/abs/2311.06647
Robust Text Classification: Analyzing Prototype-Based Networks. (97%)
Zhivar Sourati; Darshan Deshpande; Filip Ilievski; Kiril Gashteovski; Sascha Saralajew
  Downstream applications often require text classification models to be
accurate and robust. While the accuracy of the state-of-the-art Language Models
(LMs) approximates human performance, they often exhibit a drop in performance
on noisy data found in the real world. This lack of robustness can be
concerning, as even small perturbations in the text, irrelevant to the target
task, can cause classifiers to incorrectly change their predictions. A
potential solution can be the family of Prototype-Based Networks (PBNs) that
classifies examples based on their similarity to prototypical examples of a
class (prototypes) and has been shown to be robust to noise for computer vision
tasks. In this paper, we study whether the robustness properties of PBNs
transfer to text classification tasks under both targeted and static
adversarial attack settings. Our results show that PBNs, as a mere
architectural variation of vanilla LMs, offer more robustness compared to
vanilla LMs under both targeted and static settings. We showcase how PBNs'
interpretability can help us to understand PBNs' robustness properties.
Finally, our ablation studies reveal the sensitivity of PBNs' robustness to how
strictly clustering is done in the training phase, as tighter clustering
results in less robust PBNs.


http://arxiv.org/abs/2311.05992
Robust Adversarial Attacks Detection for Deep Learning based Relative Pose Estimation for Space Rendezvous. (99%)
Ziwei Wang; Nabil Aouf; Jose Pizarro; Christophe Honvault
  Research on developing deep learning techniques for autonomous spacecraft
relative navigation challenges is continuously growing in recent years.
Adopting those techniques offers enhanced performance. However, such approaches
also introduce heightened apprehensions regarding the trustability and security
of such deep learning methods through their susceptibility to adversarial
attacks. In this work, we propose a novel approach for adversarial attack
detection for deep neural network-based relative pose estimation schemes based
on the explainability concept. We develop for an orbital rendezvous scenario an
innovative relative pose estimation technique adopting our proposed
Convolutional Neural Network (CNN), which takes an image from the chaser's
onboard camera and outputs accurately the target's relative position and
rotation. We perturb seamlessly the input images using adversarial attacks that
are generated by the Fast Gradient Sign Method (FGSM). The adversarial attack
detector is then built based on a Long Short Term Memory (LSTM) network which
takes the explainability measure namely SHapley Value from the CNN-based pose
estimator and flags the detection of adversarial attacks when acting.
Simulation results show that the proposed adversarial attack detector achieves
a detection accuracy of 99.21%. Both the deep relative pose estimator and
adversarial attack detector are then tested on real data captured from our
laboratory-designed setup. The experimental results from our
laboratory-designed setup demonstrate that the proposed adversarial attack
detector achieves an average detection accuracy of 96.29%.


http://arxiv.org/abs/2311.06122
Fight Fire with Fire: Combating Adversarial Patch Attacks using Pattern-randomized Defensive Patches. (99%)
Jianan Feng; Jiachun Li; Changqing Miao; Jianjun Huang; Wei You; Wenchang Shi; Bin Liang
  Object detection has found extensive applications in various tasks, but it is
also susceptible to adversarial patch attacks. The ideal defense should be
effective, efficient, easy to deploy, and capable of withstanding adaptive
attacks. In this paper, we adopt a counterattack strategy to propose a novel
and general methodology for defending adversarial attacks. Two types of
defensive patches, canary and woodpecker, are specially-crafted and injected
into the model input to proactively probe or counteract potential adversarial
patches. In this manner, adversarial patch attacks can be effectively detected
by simply analyzing the model output, without the need to alter the target
model. Moreover, we employ randomized canary and woodpecker injection patterns
to defend against defense-aware attacks. The effectiveness and practicality of
the proposed method are demonstrated through comprehensive experiments. The
results illustrate that canary and woodpecker achieve high performance, even
when confronted with unknown attack methods, while incurring limited time
overhead. Furthermore, our method also exhibits sufficient robustness against
defense-aware attacks, as evidenced by adaptive attack experiments.


http://arxiv.org/abs/2311.06423
Transferability Bound Theory: Exploring Relationship between Adversarial Transferability and Flatness. (99%)
Mingyuan Fan; Xiaodan Li; Cen Chen; Wenmeng Zhou; Yaliang Li
  A prevailing belief in attack and defense community is that the higher
flatness of adversarial examples enables their better cross-model
transferability, leading to a growing interest in employing sharpness-aware
minimization and its variants. However, the theoretical relationship between
the transferability of adversarial examples and their flatness has not been
well established, making the belief questionable. To bridge this gap, we embark
on a theoretical investigation and, for the first time, derive a theoretical
bound for the transferability of adversarial examples with few practical
assumptions. Our analysis challenges this belief by demonstrating that the
increased flatness of adversarial examples does not necessarily guarantee
improved transferability. Moreover, building upon the theoretical analysis, we
propose TPA, a Theoretically Provable Attack that optimizes a surrogate of the
derived bound to craft adversarial examples. Extensive experiments across
widely used benchmark datasets and various real-world applications show that
TPA can craft more transferable adversarial examples compared to
state-of-the-art baselines. We hope that these results can recalibrate
preconceived impressions within the community and facilitate the development of
stronger adversarial attack and defense mechanisms. The source codes are
available in <https://github.com/fmy266/TPA>.


http://arxiv.org/abs/2311.05935
Resilient and constrained consensus against adversarial attacks: A distributed MPC framework. (84%)
Henglai Wei; Kunwu Zhang; Hui Zhang; Yang Shi
  There has been a growing interest in realizing the resilient consensus of the
multi-agent system (MAS) under cyber-attacks, which aims to achieve the
consensus of normal agents (i.e., agents without attacks) in a network,
depending on the neighboring information. The literature has developed
mean-subsequence-reduced (MSR) algorithms for the MAS with F adversarial
attacks and has shown that the consensus is achieved for the normal agents when
the communication network is at least (2F+1)-robust. However, such a stringent
requirement on the communication network needs to be relaxed to enable more
practical applications. Our objective is, for the first time, to achieve less
stringent conditions on the network, while ensuring the resilient consensus for
the general linear MAS subject to control input constraints. In this work, we
propose a distributed resilient consensus framework, consisting of a
pre-designed consensus protocol and distributed model predictive control (DMPC)
optimization, which can help significantly reduce the requirement on the
network robustness and effectively handle the general linear constrained MAS
under adversarial attacks. By employing a novel distributed adversarial attack
detection mechanism based on the history information broadcast by neighbors and
a convex set (i.e., resilience set), we can evaluate the reliability of
communication links. Moreover, we show that the recursive feasibility of the
associated DMPC optimization problem can be guaranteed. The proposed consensus
protocol features the following properties: 1) by minimizing a group of control
variables, the consensus performance is optimized; 2) the resilient consensus
of the general linear constrained MAS subject to F-locally adversarial attacks
is achieved when the communication network is (F+1)-robust. Finally, numerical
simulation results are presented to verify the theoretical results.


http://arxiv.org/abs/2311.06361
CALLOC: Curriculum Adversarial Learning for Secure and Robust Indoor Localization. (1%)
Danish Gufran; Sudeep Pasricha
  Indoor localization has become increasingly vital for many applications from
tracking assets to delivering personalized services. Yet, achieving pinpoint
accuracy remains a challenge due to variations across indoor environments and
devices used to assist with localization. Another emerging challenge is
adversarial attacks on indoor localization systems that not only threaten
service integrity but also reduce localization accuracy. To combat these
challenges, we introduce CALLOC, a novel framework designed to resist
adversarial attacks and variations across indoor environments and devices that
reduce system accuracy and reliability. CALLOC employs a novel adaptive
curriculum learning approach with a domain specific lightweight scaled-dot
product attention neural network, tailored for adversarial and variation
resilience in practical use cases with resource constrained mobile devices.
Experimental evaluations demonstrate that CALLOC can achieve improvements of up
to 6.03x in mean error and 4.6x in worst-case error against state-of-the-art
indoor localization frameworks, across diverse building floorplans, mobile
devices, and adversarial attacks scenarios.


http://arxiv.org/abs/2311.06062
Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration. (1%)
Wenjie Fu; Huandong Wang; Chen Gao; Guanghua Liu; Yong Li; Tao Jiang
  Membership Inference Attacks (MIA) aim to infer whether a target data record
has been utilized for model training or not. Prior attempts have quantified the
privacy risks of language models (LMs) via MIAs, but there is still no
consensus on whether existing MIA algorithms can cause remarkable privacy
leakage on practical Large Language Models (LLMs). Existing MIAs designed for
LMs can be classified into two categories: reference-free and reference-based
attacks. They are both based on the hypothesis that training records
consistently strike a higher probability of being sampled. Nevertheless, this
hypothesis heavily relies on the overfitting of target models, which will be
mitigated by multiple regularization methods and the generalization of LLMs.
The reference-based attack seems to achieve promising effectiveness in LLMs,
which measures a more reliable membership signal by comparing the probability
discrepancy between the target model and the reference model. However, the
performance of reference-based attack is highly dependent on a reference
dataset that closely resembles the training dataset, which is usually
inaccessible in the practical scenario. Overall, existing MIAs are unable to
effectively unveil privacy leakage over practical fine-tuned LLMs that are
overfitting-free and private. We propose a Membership Inference Attack based on
Self-calibrated Probabilistic Variation (SPV-MIA). Specifically, since
memorization in LLMs is inevitable during the training process and occurs
before overfitting, we introduce a more reliable membership signal,
probabilistic variation, which is based on memorization rather than
overfitting. Furthermore, we introduce a self-prompt approach, which constructs
the dataset to fine-tune the reference model by prompting the target LLM
itself. In this manner, the adversary can collect a dataset with a similar
distribution from public APIs.


http://arxiv.org/abs/2311.05316
ABIGX: A Unified Framework for eXplainable Fault Detection and Classification. (68%)
Yue Zhuo; Jinchuan Qian; Zhihuan Song; Zhiqiang Ge
  For explainable fault detection and classification (FDC), this paper proposes
a unified framework, ABIGX (Adversarial fault reconstruction-Based Integrated
Gradient eXplanation). ABIGX is derived from the essentials of previous
successful fault diagnosis methods, contribution plots (CP) and
reconstruction-based contribution (RBC). It is the first explanation framework
that provides variable contributions for the general FDC models. The core part
of ABIGX is the adversarial fault reconstruction (AFR) method, which rethinks
the FR from the perspective of adversarial attack and generalizes to fault
classification models with a new fault index. For fault classification, we put
forward a new problem of fault class smearing, which intrinsically hinders the
correct explanation. We prove that ABIGX effectively mitigates this problem and
outperforms the existing gradient-based explanation methods. For fault
detection, we theoretically bridge ABIGX with conventional fault diagnosis
methods by proving that CP and RBC are the linear specifications of ABIGX. The
experiments evaluate the explanations of FDC by quantitative metrics and
intuitive illustrations, the results of which show the general superiority of
ABIGX to other advanced explanation methods.


http://arxiv.org/abs/2311.05826
Honest Score Client Selection Scheme: Preventing Federated Learning Label Flipping Attacks in Non-IID Scenarios. (50%)
Yanli Li; Huaming Chen; Wei Bao; Zhengmeng Xu; Dong Yuan
  Federated Learning (FL) is a promising technology that enables multiple
actors to build a joint model without sharing their raw data. The distributed
nature makes FL vulnerable to various poisoning attacks, including model
poisoning attacks and data poisoning attacks. Today, many byzantine-resilient
FL methods have been introduced to mitigate the model poisoning attack, while
the effectiveness when defending against data poisoning attacks still remains
unclear. In this paper, we focus on the most representative data poisoning
attack - "label flipping attack" and monitor its effectiveness when attacking
the existing FL methods. The results show that the existing FL methods perform
similarly in Independent and identically distributed (IID) settings but fail to
maintain the model robustness in Non-IID settings. To mitigate the weaknesses
of existing FL methods in Non-IID scenarios, we introduce the Honest Score
Client Selection (HSCS) scheme and the corresponding HSCSFL framework. In the
HSCSFL, The server collects a clean dataset for evaluation. Under each
iteration, the server collects the gradients from clients and then perform HSCS
to select aggregation candidates. The server first evaluates the performance of
each class of the global model and generates the corresponding risk vector to
indicate which class could be potentially attacked. Similarly, the server
evaluates the client's model and records the performance of each class as the
accuracy vector. The dot product of each client's accuracy vector and global
risk vector is generated as the client's host score; only the top p\% host
score clients are included in the following aggregation. Finally, server
aggregates the gradients and uses the outcome to update the global model. The
comprehensive experimental results show our HSCSFL effectively enhances the FL
robustness and defends against the "label flipping attack."


http://arxiv.org/abs/2311.05808
Scale-MIA: A Scalable Model Inversion Attack against Secure Federated Learning via Latent Space Reconstruction. (15%)
Shanghao Shi; Ning Wang; Yang Xiao; Chaoyu Zhang; Yi Shi; Y. Thomas Hou; Wenjing Lou
  Federated learning is known for its capability to safeguard participants'
data privacy. However, recently emerged model inversion attacks (MIAs) have
shown that a malicious parameter server can reconstruct individual users' local
data samples through model updates. The state-of-the-art attacks either rely on
computation-intensive search-based optimization processes to recover each input
batch, making scaling difficult, or they involve the malicious parameter server
adding extra modules before the global model architecture, rendering the
attacks too conspicuous and easily detectable.
  To overcome these limitations, we propose Scale-MIA, a novel MIA capable of
efficiently and accurately recovering training samples of clients from the
aggregated updates, even when the system is under the protection of a robust
secure aggregation protocol. Unlike existing approaches treating models as
black boxes, Scale-MIA recognizes the importance of the intricate architecture
and inner workings of machine learning models. It identifies the latent space
as the critical layer for breaching privacy and decomposes the complex recovery
task into an innovative two-step process to reduce computation complexity. The
first step involves reconstructing the latent space representations (LSRs) from
the aggregated model updates using a closed-form inversion mechanism,
leveraging specially crafted adversarial linear layers. In the second step, the
whole input batches are recovered from the LSRs by feeding them into a
fine-tuned generative decoder.
  We implemented Scale-MIA on multiple commonly used machine learning models
and conducted comprehensive experiments across various settings. The results
demonstrate that Scale-MIA achieves excellent recovery performance on different
datasets, exhibiting high reconstruction rates, accuracy, and attack efficiency
on a larger scale compared to state-of-the-art MIAs.


http://arxiv.org/abs/2311.05608
FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts. (2%)
Yichen Gong; Delong Ran; Jinyuan Liu; Conglei Wang; Tianshuo Cong; Anyu Wang; Sisi Duan; Xiaoyun Wang
  Large vision-language models (VLMs) like GPT-4V represent an unprecedented
revolution in the field of artificial intelligence (AI). Compared to
single-modal large language models (LLMs), VLMs possess more versatile
capabilities by incorporating additional modalities (e.g., images). Meanwhile,
there's a rising enthusiasm in the AI community to develop open-source VLMs,
such as LLaVA and MiniGPT4, which, however, have not undergone rigorous safety
assessment. In this paper, to demonstrate that more modalities lead to
unforeseen AI safety issues, we propose FigStep, a novel jailbreaking framework
against VLMs. FigStep feeds harmful instructions into VLMs through the image
channel and then uses benign text prompts to induce VLMs to output contents
that violate common AI safety policies. Our experimental results show that
FigStep can achieve an average attack success rate of 94.8% across 2 families
of popular open-source VLMs, LLaVA and MiniGPT4 (a total of 5 VLMs). Moreover,
we demonstrate that the methodology of FigStep can even jailbreak GPT-4V, which
already leverages several system-level mechanisms to filter harmful queries.
Above all, our experimental results reveal that VLMs are vulnerable to
jailbreaking attacks, which highlights the necessity of novel safety alignments
between visual and textual modalities.


http://arxiv.org/abs/2311.05168
FireMatch: A Semi-Supervised Video Fire Detection Network Based on Consistency and Distribution Alignment. (1%)
Qinghua Lin; Zuoyong Li; Kun Zeng; Haoyi Fan; Wei Li; Xiaoguang Zhou
  Deep learning techniques have greatly enhanced the performance of fire
detection in videos. However, video-based fire detection models heavily rely on
labeled data, and the process of data labeling is particularly costly and
time-consuming, especially when dealing with videos. Considering the limited
quantity of labeled video data, we propose a semi-supervised fire detection
model called FireMatch, which is based on consistency regularization and
adversarial distribution alignment. Specifically, we first combine consistency
regularization with pseudo-label. For unlabeled data, we design video data
augmentation to obtain corresponding weakly augmented and strongly augmented
samples. The proposed model predicts weakly augmented samples and retains
pseudo-label above a threshold, while training on strongly augmented samples to
predict these pseudo-labels for learning more robust feature representations.
Secondly, we generate video cross-set augmented samples by adversarial
distribution alignment to expand the training data and alleviate the decline in
classification performance caused by insufficient labeled data. Finally, we
introduce a fairness loss to help the model produce diverse predictions for
input samples, thereby addressing the issue of high confidence with the
non-fire class in fire classification scenarios. The FireMatch achieved an
accuracy of 76.92% and 91.81% on two real-world fire datasets, respectively.
The experimental results demonstrate that the proposed method outperforms the
current state-of-the-art semi-supervised classification methods.


http://arxiv.org/abs/2311.04503
Constrained Adaptive Attacks: Realistic Evaluation of Adversarial Examples and Robust Training of Deep Neural Networks for Tabular Data. (99%)
Thibault Simonetto; Salah Ghamizi; Antoine Desjardins; Maxime Cordy; Yves Le Traon
  State-of-the-art deep learning models for tabular data have recently achieved
acceptable performance to be deployed in industrial settings. However, the
robustness of these models remains scarcely explored. Contrary to computer
vision, there is to date no realistic protocol to properly evaluate the
adversarial robustness of deep tabular models due to intrinsic properties of
tabular data such as categorical features, immutability, and feature
relationship constraints. To fill this gap, we propose CAA, the first efficient
evasion attack for constrained tabular deep learning models. CAA is an
iterative parameter-free attack that combines gradient and search attacks to
generate adversarial examples under constraints. We leverage CAA to build a
benchmark of deep tabular models across three popular use cases: credit
scoring, phishing and botnet attacks detection. Our benchmark supports ten
threat models with increasing capabilities of the attacker, and reflects
real-world attack scenarios for each use case. Overall, our results demonstrate
how domain knowledge, adversarial training, and attack budgets impact the
robustness assessment of deep tabular models and provide security practitioners
with a set of recommendations to improve the robustness of deep tabular models
against various evasion attack scenarios.


http://arxiv.org/abs/2311.04588
Army of Thieves: Enhancing Black-Box Model Extraction via Ensemble based sample selection. (70%)
Akshit Jindal; Vikram Goyal; Saket Anand; Chetan Arora
  Machine Learning (ML) models become vulnerable to Model Stealing Attacks
(MSA) when they are deployed as a service. In such attacks, the deployed model
is queried repeatedly to build a labelled dataset. This dataset allows the
attacker to train a thief model that mimics the original model. To maximize
query efficiency, the attacker has to select the most informative subset of
data points from the pool of available data. Existing attack strategies utilize
approaches like Active Learning and Semi-Supervised learning to minimize costs.
However, in the black-box setting, these approaches may select sub-optimal
samples as they train only one thief model. Depending on the thief model's
capacity and the data it was pretrained on, the model might even select noisy
samples that harm the learning process. In this work, we explore the usage of
an ensemble of deep learning models as our thief model. We call our attack Army
of Thieves(AOT) as we train multiple models with varying complexities to
leverage the crowd's wisdom. Based on the ensemble's collective decision,
uncertain samples are selected for querying, while the most confident samples
are directly included in the training data. Our approach is the first one to
utilize an ensemble of thief models to perform model extraction. We outperform
the base approaches of existing state-of-the-art methods by at least 3% and
achieve a 21% higher adversarial sample transferability than previous work for
models trained on the CIFAR-10 dataset.


http://arxiv.org/abs/2311.07587
Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5? (61%)
C. Daniel Freeman; Laura Culp; Aaron Parisi; Maxwell L Bileschi; Gamaleldin F Elsayed; Alex Rizkowsky; Isabelle Simpson; Alex Alemi; Azade Nova; Ben Adlam; Bernd Bohnet; Gaurav Mishra; Hanie Sedghi; Igor Mordatch; Izzeddin Gur; Jaehoon Lee; JD Co-Reyes; Jeffrey Pennington; Kelvin Xu; Kevin Swersky; Kshiteej Mahajan; Lechao Xiao; Rosanne Liu; Simon Kornblith; Noah Constant; Peter J. Liu; Roman Novak; Yundi Qian; Noah Fiedel; Jascha Sohl-Dickstein
  We introduce and study the problem of adversarial arithmetic, which provides
a simple yet challenging testbed for language model alignment. This problem is
comprised of arithmetic questions posed in natural language, with an arbitrary
adversarial string inserted before the question is complete. Even in the simple
setting of 1-digit addition problems, it is easy to find adversarial prompts
that make all tested models (including PaLM2, GPT4, Claude2) misbehave, and
even to steer models to a particular wrong answer. We additionally provide a
simple algorithm for finding successful attacks by querying those same models,
which we name "prompt inversion rejection sampling" (PIRS). We finally show
that models can be partially hardened against these attacks via reinforcement
learning and via agentic constitutional loops. However, we were not able to
make a language model fully robust against adversarial arithmetic attacks.


http://arxiv.org/abs/2311.05143
SCAAT: Improving Neural Network Interpretability via Saliency Constrained Adaptive Adversarial Training. (10%)
Rui Xu; Wenkang Qin; Peixiang Huang; Haowang; Lin Luo
  Deep Neural Networks (DNNs) are expected to provide explanation for users to
understand their black-box predictions. Saliency map is a common form of
explanation illustrating the heatmap of feature attributions, but it suffers
from noise in distinguishing important features. In this paper, we propose a
model-agnostic learning method called Saliency Constrained Adaptive Adversarial
Training (SCAAT) to improve the quality of such DNN interpretability. By
constructing adversarial samples under the guidance of saliency map, SCAAT
effectively eliminates most noise and makes saliency maps sparser and more
faithful without any modification to the model architecture. We apply SCAAT to
multiple DNNs and evaluate the quality of the generated saliency maps on
various natural and pathological image datasets. Evaluations on different
domains and metrics show that SCAAT significantly improves the interpretability
of DNNs by providing more faithful saliency maps without sacrificing their
predictive power.


http://arxiv.org/abs/2311.05006
Familiarity-Based Open-Set Recognition Under Adversarial Attacks. (9%)
Philip Enevoldsen; Christian Gundersen; Nico Lang; Serge Belongie; Christian Igel
  Open-set recognition (OSR), the identification of novel categories, can be a
critical component when deploying classification models in real-world
applications. Recent work has shown that familiarity-based scoring rules such
as the Maximum Softmax Probability (MSP) or the Maximum Logit Score (MLS) are
strong baselines when the closed-set accuracy is high. However, one of the
potential weaknesses of familiarity-based OSR are adversarial attacks. Here, we
study gradient-based adversarial attacks on familiarity scores for both types
of attacks, False Familiarity and False Novelty attacks, and evaluate their
effectiveness in informed and uninformed settings on TinyImageNet. Furthermore,
we explore how novel and familiar samples react to adversarial attacks and
formulate the adversarial reaction score as an alternative OSR scoring rule,
which shows a high correlation with the MLS familiarity score.


http://arxiv.org/abs/2311.04815
Domain Adaptive Object Detection via Balancing Between Self-Training and Adversarial Learning. (1%)
Muhammad Akhtar Munir; Muhammad Haris Khan; M. Saquib Sarfraz; Mohsen Ali
  Deep learning based object detectors struggle generalizing to a new target
domain bearing significant variations in object and background. Most current
methods align domains by using image or instance-level adversarial feature
alignment. This often suffers due to unwanted background and lacks
class-specific alignment. A straightforward approach to promote class-level
alignment is to use high confidence predictions on unlabeled domain as
pseudo-labels. These predictions are often noisy since model is poorly
calibrated under domain shift. In this paper, we propose to leverage model's
predictive uncertainty to strike the right balance between adversarial feature
alignment and class-level alignment. We develop a technique to quantify
predictive uncertainty on class assignments and bounding-box predictions. Model
predictions with low uncertainty are used to generate pseudo-labels for
self-training, whereas the ones with higher uncertainty are used to generate
tiles for adversarial feature alignment. This synergy between tiling around
uncertain object regions and generating pseudo-labels from highly certain
object regions allows capturing both image and instance-level context during
the model adaptation. We report thorough ablation study to reveal the impact of
different components in our approach. Results on five diverse and challenging
adaptation scenarios show that our approach outperforms existing
state-of-the-art methods with noticeable margins.


http://arxiv.org/abs/2311.05144
Counter-Empirical Attacking based on Adversarial Reinforcement Learning for Time-Relevant Scoring System. (1%)
Xiangguo Sun; Hong Cheng; Hang Dong; Bo Qiao; Si Qin; Qingwei Lin
  Scoring systems are commonly seen for platforms in the era of big data. From
credit scoring systems in financial services to membership scores in E-commerce
shopping platforms, platform managers use such systems to guide users towards
the encouraged activity pattern, and manage resources more effectively and more
efficiently thereby. To establish such scoring systems, several "empirical
criteria" are firstly determined, followed by dedicated top-down design for
each factor of the score, which usually requires enormous effort to adjust and
tune the scoring function in the new application scenario. What's worse, many
fresh projects usually have no ground-truth or any experience to evaluate a
reasonable scoring system, making the designing even harder. To reduce the
effort of manual adjustment of the scoring function in every new scoring
system, we innovatively study the scoring system from the preset empirical
criteria without any ground truth, and propose a novel framework to improve the
system from scratch. In this paper, we propose a "counter-empirical attacking"
mechanism that can generate "attacking" behavior traces and try to break the
empirical rules of the scoring system. Then an adversarial "enhancer" is
applied to evaluate the scoring system and find the improvement strategy. By
training the adversarial learning problem, a proper scoring function can be
learned to be robust to the attacking activity traces that are trying to
violate the empirical criteria. Extensive experiments have been conducted on
two scoring systems including a shared computing resource platform and a
financial credit system. The experimental results have validated the
effectiveness of our proposed framework.


http://arxiv.org/abs/2311.04124
Unveiling Safety Vulnerabilities of Large Language Models. (61%)
George Kour; Marcel Zalmanovici; Naama Zwerdling; Esther Goldbraich; Ora Nova Fandina; Ateret Anaby-Tavor; Orna Raz; Eitan Farchi
  As large language models become more prevalent, their possible harmful or
inappropriate responses are a cause for concern. This paper introduces a unique
dataset containing adversarial examples in the form of questions, which we call
AttaQ, designed to provoke such harmful or inappropriate responses. We assess
the efficacy of our dataset by analyzing the vulnerabilities of various models
when subjected to it. Additionally, we introduce a novel automatic approach for
identifying and naming vulnerable semantic regions - input semantic areas for
which the model is likely to produce harmful outputs. This is achieved through
the application of specialized clustering techniques that consider both the
semantic similarity of the input attacks and the harmfulness of the model's
responses. Automatically identifying vulnerable semantic regions enhances the
evaluation of model weaknesses, facilitating targeted improvements to its
safety mechanisms and overall reliability.


http://arxiv.org/abs/2311.03865
When Fairness Meets Privacy: Exploring Privacy Threats in Fair Binary Classifiers through Membership Inference Attacks. (10%)
Huan Tian; Guangsheng Zhang; Bo Liu; Tianqing Zhu; Ming Ding; Wanlei Zhou
  Previous studies have developed fairness methods for biased models that
exhibit discriminatory behaviors towards specific subgroups. While these models
have shown promise in achieving fair predictions, recent research has
identified their potential vulnerability to score-based membership inference
attacks (MIAs). In these attacks, adversaries can infer whether a particular
data sample was used during training by analyzing the model's prediction
scores. However, our investigations reveal that these score-based MIAs are
ineffective when targeting fairness-enhanced models in binary classifications.
The attack models trained to launch the MIAs degrade into simplistic threshold
models, resulting in lower attack performance. Meanwhile, we observe that
fairness methods often lead to prediction performance degradation for the
majority subgroups of the training data. This raises the barrier to successful
attacks and widens the prediction gaps between member and non-member data.
Building upon these insights, we propose an efficient MIA method against
fairness-enhanced models based on fairness discrepancy results (FD-MIA). It
leverages the difference in the predictions from both the original and
fairness-enhanced models and exploits the observed prediction gaps as attack
clues. We also explore potential strategies for mitigating privacy leakages.
Extensive experiments validate our findings and demonstrate the efficacy of the
proposed method.


http://arxiv.org/abs/2311.16153
Identifying and Mitigating Vulnerabilities in LLM-Integrated Applications. (2%)
Fengqing Jiang; Zhangchen Xu; Luyao Niu; Boxin Wang; Jinyuan Jia; Bo Li; Radha Poovendran
  Large language models (LLMs) are increasingly deployed as the service backend
for LLM-integrated applications such as code completion and AI-powered search.
LLM-integrated applications serve as middleware to refine users' queries with
domain-specific knowledge to better inform LLMs and enhance the responses.
Despite numerous opportunities and benefits, LLM-integrated applications also
introduce new attack surfaces. Understanding, minimizing, and eliminating these
emerging attack surfaces is a new area of research. In this work, we consider a
setup where the user and LLM interact via an LLM-integrated application in the
middle. We focus on the communication rounds that begin with user's queries and
end with LLM-integrated application returning responses to the queries, powered
by LLMs at the service backend. For this query-response protocol, we identify
potential vulnerabilities that can originate from the malicious application
developer or from an outsider threat initiator that is able to control the
database access, manipulate and poison data that are high-risk for the user.
Successful exploits of the identified vulnerabilities result in the users
receiving responses tailored to the intent of a threat initiator. We assess
such threats against LLM-integrated applications empowered by OpenAI GPT-3.5
and GPT-4. Our empirical results show that the threats can effectively bypass
the restrictions and moderation policies of OpenAI, resulting in users
receiving responses that contain bias, toxic content, privacy risk, and
disinformation. To mitigate those threats, we identify and define four key
properties, namely integrity, source identification, attack detectability, and
utility preservation, that need to be satisfied by a safe LLM-integrated
application. Based on these properties, we develop a lightweight,
threat-agnostic defense that mitigates both insider and outsider threats.


http://arxiv.org/abs/2311.03809
SoK: Security Below the OS -- A Security Analysis of UEFI. (1%)
Priyanka Prakash Surve; Oleg Brodt; Mark Yampolskiy; Yuval Elovici; Asaf Shabtai
  The Unified Extensible Firmware Interface (UEFI) is a linchpin of modern
computing systems, governing secure system initialization and booting. This
paper is urgently needed because of the surge in UEFI-related attacks and
vulnerabilities in recent years. Motivated by this urgent concern, we undertake
an extensive exploration of the UEFI landscape, dissecting its distribution
supply chain, booting process, and security features. We carefully study a
spectrum of UEFI-targeted attacks and proofs of concept (PoCs) for exploiting
UEFI-related vulnerabilities. Building upon these insights, we construct a
comprehensive attack threat model encompassing threat actors, attack vectors,
attack types, vulnerabilities, attack capabilities, and attacker objectives.
Drawing inspiration from the MITRE ATT&CK framework, we present a MITRE
ATT&CK-like taxonomy delineating tactics, techniques, and sub-techniques in the
context of UEFI attacks. This taxonomy can provide a road map for identifying
existing gaps and developing new techniques for rootkit prevention, detection,
and removal. Finally, the paper discusses existing countermeasures against UEFI
attacks including a variety of technical and operational measures that can be
implemented to lower the risk of UEFI attacks to an acceptable level. This
paper seeks to clarify the complexities of UEFI and equip the cybersecurity
community with the necessary knowledge to strengthen the security of this
critical component against a growing threat landscape.


http://arxiv.org/abs/2311.04076
Do LLMs exhibit human-like response biases? A case study in survey design. (1%)
Lindia Tjuatja; Valerie Chen; Sherry Tongshuang Wu; Ameet Talwalkar; Graham Neubig
  As large language models (LLMs) become more capable, there is growing
excitement about the possibility of using LLMs as proxies for humans in
real-world tasks where subjective labels are desired, such as in surveys and
opinion polling. One widely-cited barrier to the adoption of LLMs is their
sensitivity to prompt wording - but interestingly, humans also display
sensitivities to instruction changes in the form of response biases. As such,
we argue that if LLMs are going to be used to approximate human opinions, it is
necessary to investigate the extent to which LLMs also reflect human response
biases, if at all. In this work, we use survey design as a case study, where
human response biases caused by permutations in wordings of "prompts" have been
extensively studied. Drawing from prior work in social psychology, we design a
dataset and propose a framework to evaluate whether LLMs exhibit human-like
response biases in survey questionnaires. Our comprehensive evaluation of nine
models shows that popular open and commercial LLMs generally fail to reflect
human-like behavior. These inconsistencies tend to be more prominent in models
that have been instruction fine-tuned. Furthermore, even if a model shows a
significant change in the same direction as humans, we find that perturbations
that are not meant to elicit significant changes in humans may also result in a
similar change. These results highlight the potential pitfalls of using LLMs to
substitute humans in parts of the annotation pipeline, and further underscore
the importance of finer-grained characterizations of model behavior. Our code,
dataset, and collected samples are available at
https://github.com/lindiatjuatja/BiasMonkey


http://arxiv.org/abs/2311.03566
Measuring Adversarial Datasets. (92%)
Yuanchen Bai; Raoyi Huang; Vijay Viswanathan; Tzu-Sheng Kuo; Tongshuang Wu
  In the era of widespread public use of AI systems across various domains,
ensuring adversarial robustness has become increasingly vital to maintain
safety and prevent undesirable errors. Researchers have curated various
adversarial datasets (through perturbations) for capturing model deficiencies
that cannot be revealed in standard benchmark datasets. However, little is
known about how these adversarial examples differ from the original data
points, and there is still no methodology to measure the intended and
unintended consequences of those adversarial transformations. In this research,
we conducted a systematic survey of existing quantifiable metrics that describe
text instances in NLP tasks, among dimensions of difficulty, diversity, and
disagreement. We selected several current adversarial effect datasets and
compared the distributions between the original and their adversarial
counterparts. The results provide valuable insights into what makes these
datasets more challenging from a metrics perspective and whether they align
with underlying assumptions.


http://arxiv.org/abs/2311.04235
Can LLMs Follow Simple Rules? (68%)
Norman Mu; Sarah Chen; Zifan Wang; Sizhe Chen; David Karamardian; Lulwa Aljeraisy; Basel Alomair; Dan Hendrycks; David Wagner
  As Large Language Models (LLMs) are deployed with increasing real-world
responsibilities, it is important to be able to specify and constrain the
behavior of these systems in a reliable manner. Model developers may wish to
set explicit rules for the model, such as "do not generate abusive content",
but these may be circumvented by jailbreaking techniques. Existing evaluations
of adversarial attacks and defenses on LLMs generally require either expensive
manual review or unreliable heuristic checks. To address this issue, we propose
Rule-following Language Evaluation Scenarios (RuLES), a programmatic framework
for measuring rule-following ability in LLMs. RuLES consists of 14 simple text
scenarios in which the model is instructed to obey various rules while
interacting with the user. Each scenario has a programmatic evaluation function
to determine whether the model has broken any rules in a conversation. Our
evaluations of proprietary and open models show that almost all current models
struggle to follow scenario rules, even on straightforward test cases. We also
demonstrate that simple optimization attacks suffice to significantly increase
failure rates on test cases. We conclude by exploring two potential avenues for
improvement: test-time steering and supervised fine-tuning.


http://arxiv.org/abs/2311.03172
Preserving Privacy in GANs Against Membership Inference Attack. (33%)
Mohammadhadi Shateri; Francisco Messina; Fabrice Labeau; Pablo Piantanida
  Generative Adversarial Networks (GANs) have been widely used for generating
synthetic data for cases where there is a limited size real-world dataset or
when data holders are unwilling to share their data samples. Recent works
showed that GANs, due to overfitting and memorization, might leak information
regarding their training data samples. This makes GANs vulnerable to Membership
Inference Attacks (MIAs). Several defense strategies have been proposed in the
literature to mitigate this privacy issue. Unfortunately, defense strategies
based on differential privacy are proven to reduce extensively the quality of
the synthetic data points. On the other hand, more recent frameworks such as
PrivGAN and PAR-GAN are not suitable for small-size training datasets. In the
present work, the overfitting in GANs is studied in terms of the discriminator,
and a more general measure of overfitting based on the Bhattacharyya
coefficient is defined. Then, inspired by Fano's inequality, our first defense
mechanism against MIAs is proposed. This framework, which requires only a
simple modification in the loss function of GANs, is referred to as the maximum
entropy GAN or MEGAN and significantly improves the robustness of GANs to MIAs.
As a second defense strategy, a more heuristic model based on minimizing the
information leaked from generated samples about the training data points is
presented. This approach is referred to as mutual information minimization GAN
(MIMGAN) and uses a variational representation of the mutual information to
minimize the information that a synthetic sample might leak about the whole
training data set. Applying the proposed frameworks to some commonly used data
sets against state-of-the-art MIAs reveals that the proposed methods can reduce
the accuracy of the adversaries to the level of random guessing accuracy with a
small reduction in the quality of the synthetic data samples.


http://arxiv.org/abs/2311.03570
Cal-DETR: Calibrated Detection Transformer. (4%)
Muhammad Akhtar Munir; Salman Khan; Muhammad Haris Khan; Mohsen Ali; Fahad Shahbaz Khan
  Albeit revealing impressive predictive performance for several computer
vision tasks, deep neural networks (DNNs) are prone to making overconfident
predictions. This limits the adoption and wider utilization of DNNs in many
safety-critical applications. There have been recent efforts toward calibrating
DNNs, however, almost all of them focus on the classification task.
Surprisingly, very little attention has been devoted to calibrating modern
DNN-based object detectors, especially detection transformers, which have
recently demonstrated promising detection performance and are influential in
many decision-making systems. In this work, we address the problem by proposing
a mechanism for calibrated detection transformers (Cal-DETR), particularly for
Deformable-DETR, UP-DETR and DINO. We pursue the train-time calibration route
and make the following contributions. First, we propose a simple yet effective
approach for quantifying uncertainty in transformer-based object detectors.
Second, we develop an uncertainty-guided logit modulation mechanism that
leverages the uncertainty to modulate the class logits. Third, we develop a
logit mixing approach that acts as a regularizer with detection-specific losses
and is also complementary to the uncertainty-guided logit modulation technique
to further improve the calibration performance. Lastly, we conduct extensive
experiments across three in-domain and four out-domain scenarios. Results
corroborate the effectiveness of Cal-DETR against the competing train-time
methods in calibrating both in-domain and out-domain detections while
maintaining or even improving the detection performance. Our codebase and
pre-trained models can be accessed at
\url{https://github.com/akhtarvision/cal-detr}.


http://arxiv.org/abs/2311.02960
Understanding Deep Representation Learning via Layerwise Feature Compression and Discrimination. (1%)
Peng Wang; Xiao Li; Can Yaras; Zhihui Zhu; Laura Balzano; Wei Hu; Qing Qu
  Over the past decade, deep learning has proven to be a highly effective tool
for learning meaningful features from raw data. However, it remains an open
question how deep networks perform hierarchical feature learning across layers.
In this work, we attempt to unveil this mystery by investigating the structures
of intermediate features. Motivated by our empirical findings that linear
layers mimic the roles of deep layers in nonlinear networks for feature
learning, we explore how deep linear networks transform input data into output
by investigating the output (i.e., features) of each layer after training in
the context of multi-class classification problems. Toward this goal, we first
define metrics to measure within-class compression and between-class
discrimination of intermediate features, respectively. Through theoretical
analysis of these two metrics, we show that the evolution of features follows a
simple and quantitative pattern from shallow to deep layers when the input data
is nearly orthogonal and the network weights are minimum-norm, balanced, and
approximate low-rank: Each layer of the linear network progressively compresses
within-class features at a geometric rate and discriminates between-class
features at a linear rate with respect to the number of layers that data have
passed through. To the best of our knowledge, this is the first quantitative
characterization of feature evolution in hierarchical representations of deep
linear networks. Empirically, our extensive experiments not only validate our
theoretical results numerically but also reveal a similar pattern in deep
nonlinear networks which aligns well with recent empirical studies. Moreover,
we demonstrate the practical implications of our results in transfer learning.
Our code is available at https://github.com/Heimine/PNC_DLN.


http://arxiv.org/abs/2311.02757
Certified Defense on the Fairness of Graph Neural Networks. (10%)
Yushun Dong; Binchi Zhang; Hanghang Tong; Jundong Li
  Graph Neural Networks (GNNs) have emerged as a prominent graph learning model
in various graph-based tasks over the years. Nevertheless, due to the
vulnerabilities of GNNs, it has been empirically proved that malicious
attackers could easily corrupt the fairness level of their predictions by
adding perturbations to the input graph data. In this paper, we take crucial
steps to study a novel problem of certifiable defense on the fairness level of
GNNs. Specifically, we propose a principled framework named ELEGANT and present
a detailed theoretical certification analysis for the fairness of GNNs. ELEGANT
takes any GNNs as its backbone, and the fairness level of such a backbone is
theoretically impossible to be corrupted under certain perturbation budgets for
attackers. Notably, ELEGANT does not have any assumption over the GNN structure
or parameters, and does not require re-training the GNNs to realize
certification. Hence it can serve as a plug-and-play framework for any
optimized GNNs ready to be deployed. We verify the satisfactory effectiveness
of ELEGANT in practice through extensive experiments on real-world datasets
across different backbones of GNNs, where ELEGANT is also demonstrated to be
beneficial for GNN debiasing. Open-source code can be found at
https://github.com/yushundong/ELEGANT.


http://arxiv.org/abs/2311.02373
From Trojan Horses to Castle Walls: Unveiling Bilateral Data Poisoning Effects in Diffusion Models. (74%)
Zhuoshi Pan; Yuguang Yao; Gaowen Liu; Bingquan Shen; H. Vicky Zhao; Ramana Rao Kompella; Sijia Liu
  While state-of-the-art diffusion models (DMs) excel in image generation,
concerns regarding their security persist. Earlier research highlighted DMs'
vulnerability to data poisoning attacks, but these studies placed stricter
requirements than conventional methods like `BadNets' in image classification.
This is because the art necessitates modifications to the diffusion training
and sampling procedures. Unlike the prior work, we investigate whether
BadNets-like data poisoning methods can directly degrade the generation by DMs.
In other words, if only the training dataset is contaminated (without
manipulating the diffusion process), how will this affect the performance of
learned DMs? In this setting, we uncover bilateral data poisoning effects that
not only serve an adversarial purpose (compromising the functionality of DMs)
but also offer a defensive advantage (which can be leveraged for defense in
classification tasks against poisoning attacks). We show that a BadNets-like
data poisoning attack remains effective in DMs for producing incorrect images
(misaligned with the intended text conditions). Meanwhile, poisoned DMs exhibit
an increased ratio of triggers, a phenomenon we refer to as `trigger
amplification', among the generated images. This insight can be then used to
enhance the detection of poisoned training data. In addition, even under a low
poisoning ratio, studying the poisoning effects of DMs is also valuable for
designing robust image classifiers against such attacks. Last but not least, we
establish a meaningful linkage between data poisoning and the phenomenon of
data replications by exploring DMs' inherent data memorization tendencies.


http://arxiv.org/abs/2311.01873
Efficient Black-Box Adversarial Attacks on Neural Text Detectors. (22%)
Vitalii Fishchuk; Daniel Braun
  Neural text detectors are models trained to detect whether a given text was
generated by a language model or written by a human. In this paper, we
investigate three simple and resource-efficient strategies (parameter tweaking,
prompt engineering, and character-level mutations) to alter texts generated by
GPT-3.5 that are unsuspicious or unnoticeable for humans but cause
misclassification by neural text detectors. The results show that especially
parameter tweaking and character-level mutations are effective strategies.


http://arxiv.org/abs/2311.02147
The Alignment Problem in Context. (2%)
Raphaël Millière
  A core challenge in the development of increasingly capable AI systems is to
make them safe and reliable by ensuring their behaviour is consistent with
human values. This challenge, known as the alignment problem, does not merely
apply to hypothetical future AI systems that may pose catastrophic risks; it
already applies to current systems, such as large language models, whose
potential for harm is rapidly increasing. In this paper, I assess whether we
are on track to solve the alignment problem for large language models, and what
that means for the safety of future AI systems. I argue that existing
strategies for alignment are insufficient, because large language models remain
vulnerable to adversarial attacks that can reliably elicit unsafe behaviour. I
offer an explanation of this lingering vulnerability on which it is not simply
a contingent limitation of current language models, but has deep technical ties
to a crucial aspect of what makes these models useful and versatile in the
first place -- namely, their remarkable aptitude to learn "in context" directly
from user instructions. It follows that the alignment problem is not only
unsolved for current AI systems, but may be intrinsically difficult to solve
without severely undermining their capabilities. Furthermore, this assessment
raises concerns about the prospect of ensuring the safety of future and more
capable AI systems.


http://arxiv.org/abs/2311.01478
Adversary ML Resilience in Autonomous Driving Through Human Centered Perception Mechanisms. (99%)
Aakriti Shah
  Physical adversarial attacks on road signs are continuously exploiting
vulnerabilities in modern day autonomous vehicles (AVs) and impeding their
ability to correctly classify what type of road sign they encounter. Current
models cannot generalize input data well, resulting in overfitting or
underfitting. In overfitting, the model memorizes the input data but cannot
generalize to new scenarios. In underfitting, the model does not learn enough
of the input data to accurately classify these road signs. This paper explores
the resilience of autonomous driving systems against three main physical
adversarial attacks (tape, graffiti, illumination), specifically targeting
object classifiers. Several machine learning models were developed and
evaluated on two distinct datasets: road signs (stop signs, speed limit signs,
traffic lights, and pedestrian crosswalk signs) and geometric shapes (octagons,
circles, squares, and triangles). The study compared algorithm performance
under different conditions, including clean and adversarial training and
testing on these datasets. To build robustness against attacks, defense
techniques like adversarial training and transfer learning were implemented.
Results demonstrated transfer learning models played a crucial role in
performance by allowing knowledge gained from shape training to improve
generalizability of road sign classification, despite the datasets being
completely different. The paper suggests future research directions, including
human-in-the-loop validation, security analysis, real-world testing, and
explainable AI for transparency. This study aims to contribute to improving
security and robustness of object classifiers in autonomous vehicles and
mitigating adversarial example impacts on driving systems.


http://arxiv.org/abs/2311.01323
Towards Evaluating Transfer-based Attacks Systematically, Practically, and Fairly. (99%)
Qizhang Li; Yiwen Guo; Wangmeng Zuo; Hao Chen
  The adversarial vulnerability of deep neural networks (DNNs) has drawn great
attention due to the security risk of applying these models in real-world
applications. Based on transferability of adversarial examples, an increasing
number of transfer-based methods have been developed to fool black-box DNN
models whose architecture and parameters are inaccessible. Although tremendous
effort has been exerted, there still lacks a standardized benchmark that could
be taken advantage of to compare these methods systematically, fairly, and
practically. Our investigation shows that the evaluation of some methods needs
to be more reasonable and more thorough to verify their effectiveness, to
avoid, for example, unfair comparison and insufficient consideration of
possible substitute/victim models. Therefore, we establish a transfer-based
attack benchmark (TA-Bench) which implements 30+ methods. In this paper, we
evaluate and compare them comprehensively on 25 popular substitute/victim
models on ImageNet. New insights about the effectiveness of these methods are
gained and guidelines for future evaluations are provided. Code at:
https://github.com/qizhangli/TA-Bench.


http://arxiv.org/abs/2311.01011
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game. (93%)
Sam Toyer; Olivia Watkins; Ethan Adrian Mendes; Justin Svegliato; Luke Bailey; Tiffany Wang; Isaac Ong; Karim Elmaaroufi; Pieter Abbeel; Trevor Darrell; Alan Ritter; Stuart Russell
  While Large Language Models (LLMs) are increasingly being used in real-world
applications, they remain vulnerable to prompt injection attacks: malicious
third party prompts that subvert the intent of the system designer. To help
researchers study this problem, we present a dataset of over 126,000 prompt
injection attacks and 46,000 prompt-based "defenses" against prompt injection,
all created by players of an online game called Tensor Trust. To the best of
our knowledge, this is currently the largest dataset of human-generated
adversarial examples for instruction-following LLMs. The attacks in our dataset
have a lot of easily interpretable stucture, and shed light on the weaknesses
of LLMs. We also use the dataset to create a benchmark for resistance to two
types of prompt injection, which we refer to as prompt extraction and prompt
hijacking. Our benchmark results show that many models are vulnerable to the
attack strategies in the Tensor Trust dataset. Furthermore, we show that some
attack strategies from the dataset generalize to deployed LLM-based
applications, even though they have a very different set of constraints to the
game. We release all data and source code at https://tensortrust.ai/paper


http://arxiv.org/abs/2311.01356
On the Lipschitz constant of random neural networks. (92%)
Paul Geuchen; Thomas Heindl; Dominik Stöger; Felix Voigtlaender
  Empirical studies have widely demonstrated that neural networks are highly
sensitive to small, adversarial perturbations of the input. The worst-case
robustness against these so-called adversarial examples can be quantified by
the Lipschitz constant of the neural network. However, only few theoretical
results regarding this quantity exist in the literature. In this paper, we
initiate the study of the Lipschitz constant of random ReLU neural networks,
i.e., neural networks whose weights are chosen at random and which employ the
ReLU activation function. For shallow neural networks, we characterize the
Lipschitz constant up to an absolute numerical constant. Moreover, we extend
our analysis to deep neural networks of sufficiently large width where we prove
upper and lower bounds for the Lipschitz constant. These bounds match up to a
logarithmic factor that depends on the depth.


http://arxiv.org/abs/2311.01696
Universal Perturbation-based Secret Key-Controlled Data Hiding. (80%)
Donghua Wang; Wen Yao; Tingsong Jiang; Xiaoqian Chen
  Deep neural networks (DNNs) are demonstrated to be vulnerable to universal
perturbation, a single quasi-perceptible perturbation that can deceive the DNN
on most images. However, the previous works are focused on using universal
perturbation to perform adversarial attacks, while the potential usability of
universal perturbation as data carriers in data hiding is less explored,
especially for the key-controlled data hiding method. In this paper, we propose
a novel universal perturbation-based secret key-controlled data-hiding method,
realizing data hiding with a single universal perturbation and data decoding
with the secret key-controlled decoder. Specifically, we optimize a single
universal perturbation, which serves as a data carrier that can hide multiple
secret images and be added to most cover images. Then, we devise a secret
key-controlled decoder to extract different secret images from the single
container image constructed by the universal perturbation by using different
secret keys. Moreover, a suppress loss function is proposed to prevent the
secret image from leakage. Furthermore, we adopt a robust module to boost the
decoder's capability against corruption. Finally, A co-joint optimization
strategy is proposed to find the optimal universal perturbation and decoder.
Extensive experiments are conducted on different datasets to demonstrate the
effectiveness of the proposed method. Additionally, the physical test performed
on platforms (e.g., WeChat and Twitter) verifies the usability of the proposed
method in practice.


http://arxiv.org/abs/2311.01441
Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models. (76%)
Andy Zhou; Jindong Wang; Yu-Xiong Wang; Haohan Wang
  We propose a conceptually simple and lightweight framework for improving the
robustness of vision models through the combination of knowledge distillation
and data augmentation. We address the conjecture that larger models do not make
for better teachers by showing strong gains in out-of-distribution robustness
when distilling from pretrained foundation models. Following this finding, we
propose Discrete Adversarial Distillation (DAD), which leverages a robust
teacher to generate adversarial examples and a VQGAN to discretize them,
creating more informative samples than standard data augmentation techniques.
We provide a theoretical framework for the use of a robust teacher in the
knowledge distillation with data augmentation setting and demonstrate strong
gains in out-of-distribution robustness and clean accuracy across different
student architectures. Notably, our method adds minor computational overhead
compared to similar techniques and can be easily combined with other data
augmentations for further improvements.


http://arxiv.org/abs/2311.01563
Assist Is Just as Important as the Goal: Image Resurfacing to Aid Model's Robust Prediction. (13%)
Abhijith Sharma; Phil Munz; Apurva Narayan
  Adversarial patches threaten visual AI models in the real world. The number
of patches in a patch attack is variable and determines the attack's potency in
a specific environment. Most existing defenses assume a single patch in the
scene, and the multiple patch scenarios are shown to overcome them. This paper
presents a model-agnostic defense against patch attacks based on total
variation for image resurfacing (TVR). The TVR is an image-cleansing method
that processes images to remove probable adversarial regions. TVR can be
utilized solely or augmented with a defended model, providing multi-level
security for robust prediction. TVR nullifies the influence of patches in a
single image scan with no prior assumption on the number of patches in the
scene. We validate TVR on the ImageNet-Patch benchmark dataset and with
real-world physical objects, demonstrating its ability to mitigate patch
attack.


http://arxiv.org/abs/2311.01642
Robust Adversarial Reinforcement Learning via Bounded Rationality Curricula. (12%)
Aryaman Reddi; Maximilian Tölle; Jan Peters; Georgia Chalvatzaki; Carlo D'Eramo
  Robustness against adversarial attacks and distribution shifts is a
long-standing goal of Reinforcement Learning (RL). To this end, Robust
Adversarial Reinforcement Learning (RARL) trains a protagonist against
destabilizing forces exercised by an adversary in a competitive zero-sum Markov
game, whose optimal solution, i.e., rational strategy, corresponds to a Nash
equilibrium. However, finding Nash equilibria requires facing complex saddle
point optimization problems, which can be prohibitive to solve, especially for
high-dimensional control. In this paper, we propose a novel approach for
adversarial RL based on entropy regularization to ease the complexity of the
saddle point optimization problem. We show that the solution of this
entropy-regularized problem corresponds to a Quantal Response Equilibrium
(QRE), a generalization of Nash equilibria that accounts for bounded
rationality, i.e., agents sometimes play random actions instead of optimal
ones. Crucially, the connection between the entropy-regularized objective and
QRE enables free modulation of the rationality of the agents by simply tuning
the temperature coefficient. We leverage this insight to propose our novel
algorithm, Quantal Adversarial RL (QARL), which gradually increases the
rationality of the adversary in a curriculum fashion until it is fully
rational, easing the complexity of the optimization problem while retaining
robustness. We provide extensive evidence of QARL outperforming RARL and recent
baselines across several MuJoCo locomotion and navigation problems in overall
performance and robustness.


http://arxiv.org/abs/2311.01570
Sequential Subset Matching for Dataset Distillation. (1%)
Jiawei Du; Qin Shi; Joey Tianyi Zhou
  Dataset distillation is a newly emerging task that synthesizes a small-size
dataset used in training deep neural networks (DNNs) for reducing data storage
and model training costs. The synthetic datasets are expected to capture the
essence of the knowledge contained in real-world datasets such that the former
yields a similar performance as the latter. Recent advancements in distillation
methods have produced notable improvements in generating synthetic datasets.
However, current state-of-the-art methods treat the entire synthetic dataset as
a unified entity and optimize each synthetic instance equally. This static
optimization approach may lead to performance degradation in dataset
distillation. Specifically, we argue that static optimization can give rise to
a coupling issue within the synthetic data, particularly when a larger amount
of synthetic data is being optimized. This coupling issue, in turn, leads to
the failure of the distilled dataset to extract the high-level features learned
by the deep neural network (DNN) in the latter epochs.
  In this study, we propose a new dataset distillation strategy called
Sequential Subset Matching (SeqMatch), which tackles this problem by adaptively
optimizing the synthetic data to encourage sequential acquisition of knowledge
during dataset distillation. Our analysis indicates that SeqMatch effectively
addresses the coupling issue by sequentially generating the synthetic
instances, thereby enhancing its performance significantly. Our proposed
SeqMatch outperforms state-of-the-art methods in various datasets, including
SVNH, CIFAR-10, CIFAR-100, and Tiny ImageNet. Our code is available at
https://github.com/shqii1j/seqmatch.


http://arxiv.org/abs/2311.01500
E(2) Equivariant Neural Networks for Robust Galaxy Morphology Classification. (1%)
Sneh Pandya; Purvik Patel; Franc O; Jonathan Blazek
  We propose the use of group convolutional neural network architectures
(GCNNs) equivariant to the 2D Euclidean group, $E(2)$, for the task of galaxy
morphology classification by utilizing symmetries of the data present in galaxy
images as an inductive bias in the architecture. We conduct robustness studies
by introducing artificial perturbations via Poisson noise insertion and
one-pixel adversarial attacks to simulate the effects of limited observational
capabilities. We train, validate, and test GCNNs equivariant to discrete
subgroups of $E(2)$ - the cyclic and dihedral groups of order $N$ - on the
Galaxy10 DECals dataset and find that GCNNs achieve higher classification
accuracy and are consistently more robust than their non-equivariant
counterparts, with an architecture equivariant to the group $D_{16}$ achieving
a $95.52 \pm 0.18\%$ test-set accuracy. We also find that the model loses
$<6\%$ accuracy on a $50\%$-noise dataset and all GCNNs are less susceptible to
one-pixel perturbations than an identically constructed CNN. Our code is
publicly available at https://github.com/snehjp2/GCNNMorphology.


http://arxiv.org/abs/2311.01357
Robust Identity Perceptual Watermark Against Deepfake Face Swapping. (1%)
Tianyi Wang; Mengxiao Huang; Harry Cheng; Bin Ma; Yinglong Wang
  Notwithstanding offering convenience and entertainment to society, Deepfake
face swapping has caused critical privacy issues with the rapid development of
deep generative models. Due to imperceptible artifacts in high-quality
synthetic images, passive detection models against face swapping in recent
years usually suffer performance damping regarding the generalizability issue.
Therefore, several studies have been attempted to proactively protect the
original images against malicious manipulations by inserting invisible signals
in advance. However, the existing proactive defense approaches demonstrate
unsatisfactory results with respect to visual quality, detection accuracy, and
source tracing ability. In this study, to fulfill the research gap, we propose
the first robust identity perceptual watermarking framework that concurrently
performs detection and source tracing against Deepfake face swapping
proactively. We assign identity semantics regarding the image contents to the
watermarks and devise an unpredictable and nonreversible chaotic encryption
system to ensure watermark confidentiality. The watermarks are encoded and
recovered by jointly training an encoder-decoder framework along with
adversarial image manipulations. Falsification and source tracing are
accomplished by justifying the consistency between the content-matched identity
perceptual watermark and the recovered robust watermark from the image.
Extensive experiments demonstrate state-of-the-art detection performance on
Deepfake face swapping under both cross-dataset and cross-manipulation
settings.


http://arxiv.org/abs/2311.00428
NEO-KD: Knowledge-Distillation-Based Adversarial Training for Robust Multi-Exit Neural Networks. (99%)
Seokil Ham; Jungwuk Park; Dong-Jun Han; Jaekyun Moon
  While multi-exit neural networks are regarded as a promising solution for
making efficient inference via early exits, combating adversarial attacks
remains a challenging problem. In multi-exit networks, due to the high
dependency among different submodels, an adversarial example targeting a
specific exit not only degrades the performance of the target exit but also
reduces the performance of all other exits concurrently. This makes multi-exit
networks highly vulnerable to simple adversarial attacks. In this paper, we
propose NEO-KD, a knowledge-distillation-based adversarial training strategy
that tackles this fundamental challenge based on two key contributions. NEO-KD
first resorts to neighbor knowledge distillation to guide the output of the
adversarial examples to tend to the ensemble outputs of neighbor exits of clean
data. NEO-KD also employs exit-wise orthogonal knowledge distillation for
reducing adversarial transferability across different submodels. The result is
a significantly improved robustness against adversarial attacks. Experimental
results on various datasets/models show that our method achieves the best
adversarial accuracy with reduced computation budgets, compared to the
baselines relying on existing adversarial training or knowledge distillation
techniques for multi-exit networks.


http://arxiv.org/abs/2311.01473
Adversarial Examples in the Physical World: A Survey. (98%)
Jiakai Wang; Donghua Wang; Jin Hu; Siyang Wu; Tingsong Jiang; Wen Yao; Aishan Liu; Xianglong Liu
  Deep neural networks (DNNs) have demonstrated high vulnerability to
adversarial examples. Besides the attacks in the digital world, the practical
implications of adversarial examples in the physical world present significant
challenges and safety concerns. However, current research on physical
adversarial examples (PAEs) lacks a comprehensive understanding of their unique
characteristics, leading to limited significance and understanding. In this
paper, we address this gap by thoroughly examining the characteristics of PAEs
within a practical workflow encompassing training, manufacturing, and
re-sampling processes. By analyzing the links between physical adversarial
attacks, we identify manufacturing and re-sampling as the primary sources of
distinct attributes and particularities in PAEs. Leveraging this knowledge, we
develop a comprehensive analysis and classification framework for PAEs based on
their specific characteristics, covering over 100 studies on physical-world
adversarial examples. Furthermore, we investigate defense strategies against
PAEs and identify open challenges and opportunities for future research. We aim
to provide a fresh, thorough, and systematic understanding of PAEs, thereby
promoting the development of robust adversarial learning and its application in
open-world scenarios.


http://arxiv.org/abs/2311.00859
Optimal Cost Constrained Adversarial Attacks For Multiple Agent Systems. (80%)
Ziqing Lu; Guanlin Liu; Lifeng Cai; Weiyu Xu
  Finding optimal adversarial attack strategies is an important topic in
reinforcement learning and the Markov decision process. Previous studies
usually assume one all-knowing coordinator (attacker) for whom attacking
different recipient (victim) agents incurs uniform costs. However, in reality,
instead of using one limitless central attacker, the attacks often need to be
performed by distributed attack agents. We formulate the problem of performing
optimal adversarial agent-to-agent attacks using distributed attack agents, in
which we impose distinct cost constraints on each different attacker-victim
pair. We propose an optimal method integrating within-step static constrained
attack-resource allocation optimization and between-step dynamic programming to
achieve the optimal adversarial attack in a multi-agent system. Our numerical
results show that the proposed attacks can significantly reduce the rewards
received by the attacked agents.


http://arxiv.org/abs/2311.00441
Improving Robustness for Vision Transformer with a Simple Dynamic Scanning Augmentation. (76%)
Shashank Kotyan; Danilo Vasconcellos Vargas
  Vision Transformer (ViT) has demonstrated promising performance in computer
vision tasks, comparable to state-of-the-art neural networks. Yet, this new
type of deep neural network architecture is vulnerable to adversarial attacks
limiting its capabilities in terms of robustness. This article presents a novel
contribution aimed at further improving the accuracy and robustness of ViT,
particularly in the face of adversarial attacks. We propose an augmentation
technique called `Dynamic Scanning Augmentation' that leverages dynamic input
sequences to adaptively focus on different patches, thereby maintaining
performance and robustness. Our detailed investigations reveal that this
adaptability to the input sequence induces significant changes in the attention
mechanism of ViT, even for the same image. We introduce four variations of
Dynamic Scanning Augmentation, outperforming ViT in terms of both robustness to
adversarial attacks and accuracy against natural images, with one variant
showing comparable results. By integrating our augmentation technique, we
observe a substantial increase in ViT's robustness, improving it from $17\%$ to
$92\%$ measured across different types of adversarial attacks. These findings,
together with other comprehensive tests, indicate that Dynamic Scanning
Augmentation enhances accuracy and robustness by promoting a more adaptive type
of attention. In conclusion, this work contributes to the ongoing research on
Vision Transformers by introducing Dynamic Scanning Augmentation as a technique
for improving the accuracy and robustness of ViT. The observed results
highlight the potential of this approach in advancing computer vision tasks and
merit further exploration in future studies.


http://arxiv.org/abs/2311.00919
MIST: Defending Against Membership Inference Attacks Through Membership-Invariant Subspace Training. (75%)
Jiacheng Li; Ninghui Li; Bruno Ribeiro
  In Member Inference (MI) attacks, the adversary try to determine whether an
instance is used to train a machine learning (ML) model. MI attacks are a major
privacy concern when using private data to train ML models. Most MI attacks in
the literature take advantage of the fact that ML models are trained to fit the
training data well, and thus have very low loss on training instances. Most
defenses against MI attacks therefore try to make the model fit the training
data less well. Doing so, however, generally results in lower accuracy. We
observe that training instances have different degrees of vulnerability to MI
attacks. Most instances will have low loss even when not included in training.
For these instances, the model can fit them well without concerns of MI
attacks. An effective defense only needs to (possibly implicitly) identify
instances that are vulnerable to MI attacks and avoids overfitting them. A
major challenge is how to achieve such an effect in an efficient training
process. Leveraging two distinct recent advancements in representation
learning: counterfactually-invariant representations and subspace learning
methods, we introduce a novel Membership-Invariant Subspace Training (MIST)
method to defend against MI attacks. MIST avoids overfitting the vulnerable
instances without significant impact on other instances. We have conducted
extensive experimental studies, comparing MIST with various other
state-of-the-art (SOTA) MI defenses against several SOTA MI attacks. We find
that MIST outperforms other defenses while resulting in minimal reduction in
testing accuracy.


http://arxiv.org/abs/2311.00508
Robustness Tests for Automatic Machine Translation Metrics with Adversarial Attacks. (1%)
Yichen Huang; Timothy Baldwin
  We investigate MT evaluation metric performance on adversarially-synthesized
texts, to shed light on metric robustness. We experiment with word- and
character-level attacks on three popular machine translation metrics:
BERTScore, BLEURT, and COMET. Our human experiments validate that automatic
metrics tend to overpenalize adversarially-degraded translations. We also
identify inconsistencies in BERTScore ratings, where it judges the original
sentence and the adversarially-degraded one as similar, while judging the
degraded translation as notably worse than the original with respect to the
reference. We identify patterns of brittleness that motivate more robust metric
development.


http://arxiv.org/abs/2311.00400
Open-Set Face Recognition with Maximal Entropy and Objectosphere Loss. (1%)
Rafael Henrique Vareto; Yu Linghu; Terrance E. Boult; William Robson Schwartz; Manuel Günther
  Open-set face recognition characterizes a scenario where unknown individuals,
unseen during the training and enrollment stages, appear on operation time.
This work concentrates on watchlists, an open-set task that is expected to
operate at a low False Positive Identification Rate and generally includes only
a few enrollment samples per identity. We introduce a compact adapter network
that benefits from additional negative face images when combined with distinct
cost functions, such as Objectosphere Loss (OS) and the proposed Maximal
Entropy Loss (MEL). MEL modifies the traditional Cross-Entropy loss in favor of
increasing the entropy for negative samples and attaches a penalty to known
target classes in pursuance of gallery specialization. The proposed approach
adopts pre-trained deep neural networks (DNNs) for face recognition as feature
extractors. Then, the adapter network takes deep feature representations and
acts as a substitute for the output layer of the pre-trained DNN in exchange
for an agile domain adaptation. Promising results have been achieved following
open-set protocols for three different datasets: LFW, IJB-C, and UCCS as well
as state-of-the-art performance when supplementary negative data is properly
selected to fine-tune the adapter network.


http://arxiv.org/abs/2310.20469
Amoeba: Circumventing ML-supported Network Censorship via Adversarial Reinforcement Learning. (99%)
Haoyu Liu; Alec F. Diallo; Paul Patras
  Embedding covert streams into a cover channel is a common approach to
circumventing Internet censorship, due to censors' inability to examine
encrypted information in otherwise permitted protocols (Skype, HTTPS, etc.).
However, recent advances in machine learning (ML) enable detecting a range of
anti-censorship systems by learning distinct statistical patterns hidden in
traffic flows. Therefore, designing obfuscation solutions able to generate
traffic that is statistically similar to innocuous network activity, in order
to deceive ML-based classifiers at line speed, is difficult.
  In this paper, we formulate a practical adversarial attack strategy against
flow classifiers as a method for circumventing censorship. Specifically, we
cast the problem of finding adversarial flows that will be misclassified as a
sequence generation task, which we solve with Amoeba, a novel reinforcement
learning algorithm that we design. Amoeba works by interacting with censoring
classifiers without any knowledge of their model structure, but by crafting
packets and observing the classifiers' decisions, in order to guide the
sequence generation process. Our experiments using data collected from two
popular anti-censorship systems demonstrate that Amoeba can effectively shape
adversarial flows that have on average 94% attack success rate against a range
of ML algorithms. In addition, we show that these adversarial flows are robust
in different network environments and possess transferability across various ML
models, meaning that once trained against one, our agent can subvert other
censoring classifiers without retraining.


http://arxiv.org/abs/2311.00172
Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield. (99%)
Jinhwa Kim; Ali Derakhshan; Ian G. Harris
  Large Language Models' safety remains a critical concern due to their
vulnerability to adversarial attacks, which can prompt these systems to produce
harmful responses. In the heart of these systems lies a safety classifier, a
computational model trained to discern and mitigate potentially harmful,
offensive, or unethical outputs. However, contemporary safety classifiers,
despite their potential, often fail when exposed to inputs infused with
adversarial noise. In response, our study introduces the Adversarial Prompt
Shield (APS), a lightweight model that excels in detection accuracy and
demonstrates resilience against adversarial prompts. Additionally, we propose
novel strategies for autonomously generating adversarial training datasets,
named Bot Adversarial Noisy Dialogue (BAND) datasets. These datasets are
designed to fortify the safety classifier's robustness, and we investigate the
consequences of incorporating adversarial examples into the training process.
Through evaluations involving Large Language Models, we demonstrate that our
classifier has the potential to decrease the attack success rate resulting from
adversarial attacks by up to 60%. This advancement paves the way for the next
generation of more reliable and resilient conversational agents.


http://arxiv.org/abs/2310.20175
LFAA: Crafting Transferable Targeted Adversarial Examples with Low-Frequency Perturbations. (99%)
Kunyu Wang; Juluan Shi; Wenxuan Wang
  Deep neural networks are susceptible to adversarial attacks, which pose a
significant threat to their security and reliability in real-world
applications. The most notable adversarial attacks are transfer-based attacks,
where an adversary crafts an adversarial example to fool one model, which can
also fool other models. While previous research has made progress in improving
the transferability of untargeted adversarial examples, the generation of
targeted adversarial examples that can transfer between models remains a
challenging task. In this work, we present a novel approach to generate
transferable targeted adversarial examples by exploiting the vulnerability of
deep neural networks to perturbations on high-frequency components of images.
We observe that replacing the high-frequency component of an image with that of
another image can mislead deep models, motivating us to craft perturbations
containing high-frequency information to achieve targeted attacks. To this end,
we propose a method called Low-Frequency Adversarial Attack (\name), which
trains a conditional generator to generate targeted adversarial perturbations
that are then added to the low-frequency component of the image. Extensive
experiments on ImageNet demonstrate that our proposed approach significantly
outperforms state-of-the-art methods, improving targeted attack success rates
by a margin from 3.2\% to 15.5\%.


http://arxiv.org/abs/2311.00207
Magmaw: Modality-Agnostic Adversarial Attacks on Machine Learning-Based Wireless Communication Systems. (98%)
Jung-Woo Chang; Ke Sun; Nasimeh Heydaribeni; Seira Hidano; Xinyu Zhang; Farinaz Koushanfar
  Machine Learning (ML) has been instrumental in enabling joint transceiver
optimization by merging all physical layer blocks of the end-to-end wireless
communication systems. Although there have been a number of adversarial attacks
on ML-based wireless systems, the existing methods do not provide a
comprehensive view including multi-modality of the source data, common physical
layer components, and wireless domain constraints. This paper proposes Magmaw,
the first black-box attack methodology capable of generating universal
adversarial perturbations for any multimodal signal transmitted over a wireless
channel. We further introduce new objectives for adversarial attacks on
ML-based downstream applications. The resilience of the attack to the existing
widely used defense methods of adversarial training and perturbation signal
subtraction is experimentally verified. For proof-of-concept evaluation, we
build a real-time wireless attack platform using a software-defined radio
system. Experimental results demonstrate that Magmaw causes significant
performance degradation even in the presence of the defense mechanisms.
Surprisingly, Magmaw is also effective against encrypted communication channels
and conventional communications.


http://arxiv.org/abs/2310.20162
Is Robustness Transferable across Languages in Multilingual Neural Machine Translation? (26%)
Leiyu Pan; Supryadi; Deyi Xiong
  Robustness, the ability of models to maintain performance in the face of
perturbations, is critical for developing reliable NLP systems. Recent studies
have shown promising results in improving the robustness of models through
adversarial training and data augmentation. However, in machine translation,
most of these studies have focused on bilingual machine translation with a
single translation direction. In this paper, we investigate the transferability
of robustness across different languages in multilingual neural machine
translation. We propose a robustness transfer analysis protocol and conduct a
series of experiments. In particular, we use character-, word-, and multi-level
noises to attack the specific translation direction of the multilingual neural
machine translation model and evaluate the robustness of other translation
directions. Our findings demonstrate that the robustness gained in one
translation direction can indeed transfer to other translation directions.
Additionally, we empirically find scenarios where robustness to character-level
noise and word-level noise is more likely to transfer.


http://arxiv.org/abs/2310.20649
Dynamic Batch Norm Statistics Update for Natural Robustness. (22%)
Shahbaz Rezaei; Mohammad Sadegh Norouzzadeh
  DNNs trained on natural clean samples have been shown to perform poorly on
corrupted samples, such as noisy or blurry images. Various data augmentation
methods have been recently proposed to improve DNN's robustness against common
corruptions. Despite their success, they require computationally expensive
training and cannot be applied to off-the-shelf trained models. Recently, it
has been shown that updating BatchNorm (BN) statistics of an off-the-shelf
model on a single corruption improves its accuracy on that corruption
significantly. However, adopting the idea at inference time when the type of
corruption is unknown and changing decreases the effectiveness of this method.
In this paper, we harness the Fourier domain to detect the corruption type, a
challenging task in the image domain. We propose a unified framework consisting
of a corruption-detection model and BN statistics update that improves the
corruption accuracy of any off-the-shelf trained model. We benchmark our
framework on different models and datasets. Our results demonstrate about 8%
and 4% accuracy improvement on CIFAR10-C and ImageNet-C, respectively.
Furthermore, our framework can further improve the accuracy of state-of-the-art
robust models, such as AugMix and DeepAug.


http://arxiv.org/abs/2310.20199
In Search of Lost Online Test-time Adaptation: A Survey. (1%)
Zixin Wang; Yadan Luo; Liang Zheng; Zhuoxiao Chen; Sen Wang; Zi Huang
  In this paper, we present a comprehensive survey on online test-time
adaptation (OTTA), a paradigm focused on adapting machine learning models to
novel data distributions upon batch arrival. Despite the proliferation of OTTA
methods recently, the field is mired in issues like ambiguous settings,
antiquated backbones, and inconsistent hyperparameter tuning, obfuscating the
real challenges and making reproducibility elusive. For clarity and a rigorous
comparison, we classify OTTA techniques into three primary categories and
subject them to benchmarks using the potent Vision Transformer (ViT) backbone
to discover genuinely effective strategies. Our benchmarks span not only
conventional corrupted datasets such as CIFAR-10/100-C and ImageNet-C but also
real-world shifts embodied in CIFAR-10.1 and CIFAR-10-Warehouse, encapsulating
variations across search engines and synthesized data by diffusion models. To
gauge efficiency in online scenarios, we introduce novel evaluation metrics,
inclusive of FLOPs, shedding light on the trade-offs between adaptation
accuracy and computational overhead. Our findings diverge from existing
literature, indicating: (1) transformers exhibit heightened resilience to
diverse domain shifts, (2) the efficacy of many OTTA methods hinges on ample
batch sizes, and (3) stability in optimization and resistance to perturbations
are critical during adaptation, especially when the batch size is 1. Motivated
by these insights, we pointed out promising directions for future research. The
source code is made available: https://github.com/Jo-wang/OTTA_ViT_survey.


http://arxiv.org/abs/2310.19342
Label-Only Model Inversion Attacks via Knowledge Transfer. (83%)
Ngoc-Bao Nguyen; Keshigeyan Chandrasegaran; Milad Abdollahzadeh; Ngai-Man Cheung
  In a model inversion (MI) attack, an adversary abuses access to a machine
learning (ML) model to infer and reconstruct private training data. Remarkable
progress has been made in the white-box and black-box setups, where the
adversary has access to the complete model or the model's soft output
respectively. However, there is very limited study in the most challenging but
practically important setup: Label-only MI attacks, where the adversary only
has access to the model's predicted label (hard label) without confidence
scores nor any other model information.
  In this work, we propose LOKT, a novel approach for label-only MI attacks.
Our idea is based on transfer of knowledge from the opaque target model to
surrogate models. Subsequently, using these surrogate models, our approach can
harness advanced white-box attacks. We propose knowledge transfer based on
generative modelling, and introduce a new model, Target model-assisted ACGAN
(T-ACGAN), for effective knowledge transfer. Our method casts the challenging
label-only MI into the more tractable white-box setup. We provide analysis to
support that surrogate models based on our approach serve as effective proxies
for the target model for MI. Our experiments show that our method significantly
outperforms existing SOTA Label-only MI attack by more than 15% across all MI
benchmarks. Furthermore, our method compares favorably in terms of query
budget. Our study highlights rising privacy threats for ML models even when
minimal information (i.e., hard labels) is exposed. Our study highlights rising
privacy threats for ML models even when minimal information (i.e., hard labels)
is exposed. Our code, demo, models and reconstructed data are available at our
project page: https://ngoc-nguyen-0.github.io/lokt/


http://arxiv.org/abs/2310.19889
Exploring Geometry of Blind Spots in Vision Models. (83%)
Sriram Balasubramanian; Gaurang Sriramanan; Vinu Sankar Sadasivan; Soheil Feizi
  Despite the remarkable success of deep neural networks in a myriad of
settings, several works have demonstrated their overwhelming sensitivity to
near-imperceptible perturbations, known as adversarial attacks. On the other
hand, prior works have also observed that deep networks can be under-sensitive,
wherein large-magnitude perturbations in input space do not induce appreciable
changes to network activations. In this work, we study in detail the phenomenon
of under-sensitivity in vision models such as CNNs and Transformers, and
present techniques to study the geometry and extent of "equi-confidence" level
sets of such networks. We propose a Level Set Traversal algorithm that
iteratively explores regions of high confidence with respect to the input space
using orthogonal components of the local gradients. Given a source image, we
use this algorithm to identify inputs that lie in the same equi-confidence
level set as the source image despite being perceptually similar to arbitrary
images from other classes. We further observe that the source image is linearly
connected by a high-confidence path to these inputs, uncovering a star-like
structure for level sets of deep networks. Furthermore, we attempt to identify
and estimate the extent of these connected higher-dimensional regions over
which the model maintains a high degree of confidence. The code for this
project is publicly available at
https://github.com/SriramB-98/blindspots-neurips-sub


http://arxiv.org/abs/2310.19737
Adversarial Attacks and Defenses in Large Language Models: Old and New Threats. (74%)
Leo Schwinn; David Dobre; Stephan Günnemann; Gauthier Gidel
  Over the past decade, there has been extensive research aimed at enhancing
the robustness of neural networks, yet this problem remains vastly unsolved.
Here, one major impediment has been the overestimation of the robustness of new
defense approaches due to faulty defense evaluations. Flawed robustness
evaluations necessitate rectifications in subsequent works, dangerously slowing
down the research and providing a false sense of security. In this context, we
will face substantial challenges associated with an impending adversarial arms
race in natural language processing, specifically with closed-source Large
Language Models (LLMs), such as ChatGPT, Google Bard, or Anthropic's Claude. We
provide a first set of prerequisites to improve the robustness assessment of
new approaches and reduce the amount of faulty evaluations. Additionally, we
identify embedding space attacks on LLMs as another viable threat model for the
purposes of generating malicious content in open-sourced models. Finally, we
demonstrate on a recently proposed defense that, without LLM-specific best
practices in place, it is easy to overestimate the robustness of a new
approach.


http://arxiv.org/abs/2310.19410
Generated Distributions Are All You Need for Membership Inference Attacks Against Generative Models. (61%)
Minxing Zhang; Ning Yu; Rui Wen; Michael Backes; Yang Zhang
  Generative models have demonstrated revolutionary success in various visual
creation tasks, but in the meantime, they have been exposed to the threat of
leaking private information of their training data. Several membership
inference attacks (MIAs) have been proposed to exhibit the privacy
vulnerability of generative models by classifying a query image as a training
dataset member or nonmember. However, these attacks suffer from major
limitations, such as requiring shadow models and white-box access, and either
ignoring or only focusing on the unique property of diffusion models, which
block their generalization to multiple generative models. In contrast, we
propose the first generalized membership inference attack against a variety of
generative models such as generative adversarial networks, [variational]
autoencoders, implicit functions, and the emerging diffusion models. We
leverage only generated distributions from target generators and auxiliary
non-member datasets, therefore regarding target generators as black boxes and
agnostic to their architectures or application scenarios. Experiments validate
that all the generative models are vulnerable to our attack. For instance, our
work achieves attack AUC $>0.99$ against DDPM, DDIM, and FastDPM trained on
CIFAR-10 and CelebA. And the attack against VQGAN, LDM (for the
text-conditional generation), and LIIF achieves AUC $>0.90.$ As a result, we
appeal to our community to be aware of such privacy leakage risks when
designing and publishing generative models.


http://arxiv.org/abs/2310.19391
Causal Fair Metric: Bridging Causality, Individual Fairness, and Adversarial Robustness. (33%)
Ahmad-Reza Ehyaei; Golnoosh Farnadi; Samira Samadi
  Despite the essential need for comprehensive considerations in responsible
AI, factors like robustness, fairness, and causality are often studied in
isolation. Adversarial perturbation, used to identify vulnerabilities in
models, and individual fairness, aiming for equitable treatment of similar
individuals, despite initial differences, both depend on metrics to generate
comparable input data instances. Previous attempts to define such joint metrics
often lack general assumptions about data or structural causal models and were
unable to reflect counterfactual proximity. To address this, our paper
introduces a causal fair metric formulated based on causal structures
encompassing sensitive attributes and protected causal perturbation. To enhance
the practicality of our metric, we propose metric learning as a method for
metric estimation and deployment in real-world problems in the absence of
structural causal models. We also demonstrate the application of our novel
metric in classifiers. Empirical evaluation of real-world and synthetic
datasets illustrates the effectiveness of our proposed metric in achieving an
accurate classifier with fairness, resilience to adversarial perturbations, and
a nuanced understanding of causal relationships.


http://arxiv.org/abs/2310.19733
Differentially Private Reward Estimation with Preference Feedback. (16%)
Sayak Ray Chowdhury; Xingyu Zhou; Nagarajan Natarajan
  Learning from preference-based feedback has recently gained considerable
traction as a promising approach to align generative models with human
interests. Instead of relying on numerical rewards, the generative models are
trained using reinforcement learning with human feedback (RLHF). These
approaches first solicit feedback from human labelers typically in the form of
pairwise comparisons between two possible actions, then estimate a reward model
using these comparisons, and finally employ a policy based on the estimated
reward model. An adversarial attack in any step of the above pipeline might
reveal private and sensitive information of human labelers. In this work, we
adopt the notion of label differential privacy (DP) and focus on the problem of
reward estimation from preference-based feedback while protecting privacy of
each individual labelers. Specifically, we consider the parametric
Bradley-Terry-Luce (BTL) model for such pairwise comparison feedback involving
a latent reward parameter $\theta^* \in \mathbb{R}^d$. Within a standard
minimax estimation framework, we provide tight upper and lower bounds on the
error in estimating $\theta^*$ under both local and central models of DP. We
show, for a given privacy budget $\epsilon$ and number of samples $n$, that the
additional cost to ensure label-DP under local model is $\Theta \big(\frac{1}{
e^\epsilon-1}\sqrt{\frac{d}{n}}\big)$, while it is
$\Theta\big(\frac{\text{poly}(d)}{\epsilon n} \big)$ under the weaker central
model. We perform simulations on synthetic data that corroborate these
theoretical results.


http://arxiv.org/abs/2310.19439
Asymmetric Diffusion Based Channel-Adaptive Secure Wireless Semantic Communications. (10%)
Xintian Ren; Jun Wu; Hansong Xu; Qianqian Pan
  Semantic communication has emerged as a new deep learning-based communication
paradigm that drives the research of end-to-end data transmission in tasks like
image classification, and image reconstruction. However, the security problem
caused by semantic attacks has not been well explored, resulting in
vulnerabilities within semantic communication systems exposed to potential
semantic perturbations. In this paper, we propose a secure semantic
communication system, DiffuSeC, which leverages the diffusion model and deep
reinforcement learning (DRL) to address this issue. With the diffusing module
in the sender end and the asymmetric denoising module in the receiver end, the
DiffuSeC mitigates the perturbations added by semantic attacks, including data
source attacks and channel attacks. To further improve the robustness under
unstable channel conditions caused by semantic attacks, we developed a
DRL-based channel-adaptive diffusion step selection scheme to achieve stable
performance under fluctuating environments. A timestep synchronization scheme
is designed for diffusion timestep coordination between the two ends.
Simulation results demonstrate that the proposed DiffuSeC shows higher robust
accuracy than previous works under a wide range of channel conditions, and can
quickly adjust the model state according to signal-to-noise ratios (SNRs) in
unstable environments.


http://arxiv.org/abs/2310.19304
Privacy-Preserving Federated Learning over Vertically and Horizontally Partitioned Data for Financial Anomaly Detection. (1%)
Swanand Ravindra Kadhe; Heiko Ludwig; Nathalie Baracaldo; Alan King; Yi Zhou; Keith Houck; Ambrish Rawat; Mark Purcell; Naoise Holohan; Mikio Takeuchi; Ryo Kawahara; Nir Drucker; Hayim Shaul; Eyal Kushnir; Omri Soceanu
  The effective detection of evidence of financial anomalies requires
collaboration among multiple entities who own a diverse set of data, such as a
payment network system (PNS) and its partner banks. Trust among these financial
institutions is limited by regulation and competition. Federated learning (FL)
enables entities to collaboratively train a model when data is either
vertically or horizontally partitioned across the entities. However, in
real-world financial anomaly detection scenarios, the data is partitioned both
vertically and horizontally and hence it is not possible to use existing FL
approaches in a plug-and-play manner.
  Our novel solution, PV4FAD, combines fully homomorphic encryption (HE),
secure multi-party computation (SMPC), differential privacy (DP), and
randomization techniques to balance privacy and accuracy during training and to
prevent inference threats at model deployment time. Our solution provides input
privacy through HE and SMPC, and output privacy against inference time attacks
through DP. Specifically, we show that, in the honest-but-curious threat model,
banks do not learn any sensitive features about PNS transactions, and the PNS
does not learn any information about the banks' dataset but only learns
prediction labels. We also develop and analyze a DP mechanism to protect output
privacy during inference. Our solution generates high-utility models by
significantly reducing the per-bank noise level while satisfying distributed
DP. To ensure high accuracy, our approach produces an ensemble model, in
particular, a random forest. This enables us to take advantage of the
well-known properties of ensembles to reduce variance and increase accuracy.
Our solution won second prize in the first phase of the U.S. Privacy Enhancing
Technologies (PETs) Prize Challenge.


http://arxiv.org/abs/2310.18975
Blacksmith: Fast Adversarial Training of Vision Transformers via a Mixture of Single-step and Multi-step Methods. (99%)
Mahdi Salmani; Alireza Dehghanpour Farashah; Mohammad Azizmalayeri; Mahdi Amiri; Navid Eslami; Mohammad Taghi Manzuri; Mohammad Hossein Rohban
  Despite the remarkable success achieved by deep learning algorithms in
various domains, such as computer vision, they remain vulnerable to adversarial
perturbations. Adversarial Training (AT) stands out as one of the most
effective solutions to address this issue; however, single-step AT can lead to
Catastrophic Overfitting (CO). This scenario occurs when the adversarially
trained network suddenly loses robustness against multi-step attacks like
Projected Gradient Descent (PGD). Although several approaches have been
proposed to address this problem in Convolutional Neural Networks (CNNs), we
found out that they do not perform well when applied to Vision Transformers
(ViTs). In this paper, we propose Blacksmith, a novel training strategy to
overcome the CO problem, specifically in ViTs. Our approach utilizes either of
PGD-2 or Fast Gradient Sign Method (FGSM) randomly in a mini-batch during the
adversarial training of the neural network. This will increase the diversity of
our training attacks, which could potentially mitigate the CO issue. To manage
the increased training time resulting from this combination, we craft the PGD-2
attack based on only the first half of the layers, while FGSM is applied
end-to-end. Through our experiments, we demonstrate that our novel method
effectively prevents CO, achieves PGD-2 level performance, and outperforms
other existing techniques including N-FGSM, which is the state-of-the-art
method in fast training for CNNs.


http://arxiv.org/abs/2310.19038
Boosting Decision-Based Black-Box Adversarial Attack with Gradient Priors. (98%)
Han Liu; Xingshuo Huang; Xiaotong Zhang; Qimai Li; Fenglong Ma; Wei Wang; Hongyang Chen; Hong Yu; Xianchao Zhang
  Decision-based methods have shown to be effective in black-box adversarial
attacks, as they can obtain satisfactory performance and only require to access
the final model prediction. Gradient estimation is a critical step in black-box
adversarial attacks, as it will directly affect the query efficiency. Recent
works have attempted to utilize gradient priors to facilitate score-based
methods to obtain better results. However, these gradient priors still suffer
from the edge gradient discrepancy issue and the successive iteration gradient
direction issue, thus are difficult to simply extend to decision-based methods.
In this paper, we propose a novel Decision-based Black-box Attack framework
with Gradient Priors (DBA-GP), which seamlessly integrates the data-dependent
gradient prior and time-dependent prior into the gradient estimation procedure.
First, by leveraging the joint bilateral filter to deal with each random
perturbation, DBA-GP can guarantee that the generated perturbations in edge
locations are hardly smoothed, i.e., alleviating the edge gradient discrepancy,
thus remaining the characteristics of the original image as much as possible.
Second, by utilizing a new gradient updating strategy to automatically adjust
the successive iteration gradient direction, DBA-GP can accelerate the
convergence speed, thus improving the query efficiency. Extensive experiments
have demonstrated that the proposed method outperforms other strong baselines
significantly.


http://arxiv.org/abs/2310.19152
BERT Lost Patience Won't Be Robust to Adversarial Slowdown. (98%)
Zachary Coalson; Gabriel Ritter; Rakesh Bobba; Sanghyun Hong
  In this paper, we systematically evaluate the robustness of multi-exit
language models against adversarial slowdown. To audit their robustness, we
design a slowdown attack that generates natural adversarial text bypassing
early-exit points. We use the resulting WAFFLE attack as a vehicle to conduct a
comprehensive evaluation of three multi-exit mechanisms with the GLUE benchmark
against adversarial slowdown. We then show our attack significantly reduces the
computational savings provided by the three methods in both white-box and
black-box settings. The more complex a mechanism is, the more vulnerable it is
to adversarial slowdown. We also perform a linguistic analysis of the perturbed
text inputs, identifying common perturbation patterns that our attack
generates, and comparing them with standard adversarial text attacks. Moreover,
we show that adversarial training is ineffective in defeating our slowdown
attack, but input sanitization with a conversational model, e.g., ChatGPT, can
remove perturbations effectively. This result suggests that future work is
needed for developing efficient yet robust multi-exit models. Our code is
available at: https://github.com/ztcoalson/WAFFLE


http://arxiv.org/abs/2310.18936
Adversarial Examples Are Not Real Features. (98%)
Ang Li; Yifei Wang; Yiwen Guo; Yisen Wang
  The existence of adversarial examples has been a mystery for years and
attracted much interest. A well-known theory by \citet{ilyas2019adversarial}
explains adversarial vulnerability from a data perspective by showing that one
can extract non-robust features from adversarial examples and these features
alone are useful for classification. However, the explanation remains quite
counter-intuitive since non-robust features are mostly noise features to
humans. In this paper, we re-examine the theory from a larger context by
incorporating multiple learning paradigms. Notably, we find that contrary to
their good usefulness under supervised learning, non-robust features attain
poor usefulness when transferred to other self-supervised learning paradigms,
such as contrastive learning, masked image modeling, and diffusion models. It
reveals that non-robust features are not really as useful as robust or natural
features that enjoy good transferability between these paradigms. Meanwhile,
for robustness, we also show that naturally trained encoders from robust
features are largely non-robust under AutoAttack. Our cross-paradigm
examination suggests that the non-robust features are not really useful but
more like paradigm-wise shortcuts, and robust features alone might be
insufficient to attain reliable model robustness. Code is available at
\url{https://github.com/PKU-ML/AdvNotRealFeatures}.


http://arxiv.org/abs/2310.19248
IMPRESS: Evaluating the Resilience of Imperceptible Perturbations Against Unauthorized Data Usage in Diffusion-Based Generative AI. (82%)
Bochuan Cao; Changjiang Li; Ting Wang; Jinyuan Jia; Bo Li; Jinghui Chen
  Diffusion-based image generation models, such as Stable Diffusion or DALL-E
2, are able to learn from given images and generate high-quality samples
following the guidance from prompts. For instance, they can be used to create
artistic images that mimic the style of an artist based on his/her original
artworks or to maliciously edit the original images for fake content. However,
such ability also brings serious ethical issues without proper authorization
from the owner of the original images. In response, several attempts have been
made to protect the original images from such unauthorized data usage by adding
imperceptible perturbations, which are designed to mislead the diffusion model
and make it unable to properly generate new samples. In this work, we introduce
a perturbation purification platform, named IMPRESS, to evaluate the
effectiveness of imperceptible perturbations as a protective measure. IMPRESS
is based on the key observation that imperceptible perturbations could lead to
a perceptible inconsistency between the original image and the
diffusion-reconstructed image, which can be used to devise a new optimization
strategy for purifying the image, which may weaken the protection of the
original image from unauthorized data usage (e.g., style mimicking, malicious
editing). The proposed IMPRESS platform offers a comprehensive evaluation of
several contemporary protection methods, and can be used as an evaluation
platform for future protection methods.


http://arxiv.org/abs/2310.19156
Poisoning Retrieval Corpora by Injecting Adversarial Passages. (68%)
Zexuan Zhong; Ziqing Huang; Alexander Wettig; Danqi Chen
  Dense retrievers have achieved state-of-the-art performance in various
information retrieval tasks, but to what extent can they be safely deployed in
real-world applications? In this work, we propose a novel attack for dense
retrieval systems in which a malicious user generates a small number of
adversarial passages by perturbing discrete tokens to maximize similarity with
a provided set of training queries. When these adversarial passages are
inserted into a large retrieval corpus, we show that this attack is highly
effective in fooling these systems to retrieve them for queries that were not
seen by the attacker. More surprisingly, these adversarial passages can
directly generalize to out-of-domain queries and corpora with a high success
attack rate -- for instance, we find that 50 generated passages optimized on
Natural Questions can mislead >94% of questions posed in financial documents or
online forums. We also benchmark and compare a range of state-of-the-art dense
retrievers, both unsupervised and supervised. Although different systems
exhibit varying levels of vulnerability, we show they can all be successfully
attacked by injecting up to 500 passages, a small fraction compared to a
retrieval corpus of millions of passages.


http://arxiv.org/abs/2310.18933
Label Poisoning is All You Need. (54%)
Rishi D. Jha; Jonathan Hayase; Sewoong Oh
  In a backdoor attack, an adversary injects corrupted data into a model's
training dataset in order to gain control over its predictions on images with a
specific attacker-defined trigger. A typical corrupted training example
requires altering both the image, by applying the trigger, and the label.
Models trained on clean images, therefore, were considered safe from backdoor
attacks. However, in some common machine learning scenarios, the training
labels are provided by potentially malicious third-parties. This includes
crowd-sourced annotation and knowledge distillation. We, hence, investigate a
fundamental question: can we launch a successful backdoor attack by only
corrupting labels? We introduce a novel approach to design label-only backdoor
attacks, which we call FLIP, and demonstrate its strengths on three datasets
(CIFAR-10, CIFAR-100, and Tiny-ImageNet) and four architectures (ResNet-32,
ResNet-18, VGG-19, and Vision Transformer). With only 2% of CIFAR-10 labels
corrupted, FLIP achieves a near-perfect attack success rate of 99.4% while
suffering only a 1.8% drop in the clean test accuracy. Our approach builds upon
the recent advances in trajectory matching, originally introduced for dataset
distillation.


http://arxiv.org/abs/2310.19177
Robustifying Language Models with Test-Time Adaptation. (47%)
Noah Thomas McDermott; Junfeng Yang; Chengzhi Mao
  Large-scale language models achieved state-of-the-art performance over a
number of language tasks. However, they fail on adversarial language examples,
which are sentences optimized to fool the language models but with similar
semantic meanings for humans. While prior work focuses on making the language
model robust at training time, retraining for robustness is often unrealistic
for large-scale foundation models. Instead, we propose to make the language
models robust at test time. By dynamically adapting the input sentence with
predictions from masked words, we show that we can reverse many language
adversarial attacks. Since our approach does not require any training, it works
for novel tasks at test time and can adapt to novel adversarial corruptions.
Visualizations and empirical results on two popular sentence classification
datasets demonstrate that our method can repair adversarial language attacks
over 65% o


http://arxiv.org/abs/2310.18987
Path Analysis for Effective Fault Localization in Deep Neural Networks. (13%)
Soroush Hashemifar; Saeed Parsa; Akram Kalaee
  Deep learning has revolutionized numerous fields, yet the reliability of Deep
Neural Networks (DNNs) remains a concern due to their complexity and data
dependency. Traditional software fault localization methods, such as
Spectrum-based Fault Localization (SBFL), have been adapted for DNNs but often
fall short in effectiveness. These methods typically overlook the propagation
of faults through neural pathways, resulting in less precise fault detection.
Research indicates that examining neural pathways, rather than individual
neurons, is crucial because issues in one neuron can affect its entire pathway.
By investigating these interconnected pathways, we can better identify and
address problems arising from the collective activity of neurons. To address
this limitation, we introduce the NP-SBFL method, which leverages Layer-wise
Relevance Propagation (LRP) to identify essential faulty neural pathways. Our
method explores multiple fault sources to accurately pinpoint faulty neurons by
analyzing their interconnections. Additionally, our multi-stage gradient ascent
(MGA) technique, an extension of gradient ascent (GA), enables sequential
neuron activation to enhance fault detection. We evaluated NP-SBFL-MGA on the
well-established MNIST and CIFAR-10 datasets, comparing it to other methods
like DeepFault and NP-SBFL-GA, as well as three neuron measures: Tarantula,
Ochiai, and Barinel. Our evaluation utilized all training and test samples
(60,000 for MNIST and 50,000 for CIFAR-10) and revealed that NP-SBFL-MGA
significantly outperformed the baselines in identifying suspicious pathways and
generating adversarial inputs. Notably, Tarantula with NP-SBFL-MGA achieved a
remarkable 96.75% fault detection rate compared to DeepFault's 89.90%.
NP-SBFL-MGA highlights a strong correlation between critical path coverage and
the number of failed tests in DNN fault localization.


http://arxiv.org/abs/2310.19181
From Chatbots to PhishBots? -- Preventing Phishing scams created using ChatGPT, Google Bard and Claude. (1%)
Sayak Saha Roy; Poojitha Thota; Krishna Vamsi Naragam; Shirin Nilizadeh
  The advanced capabilities of Large Language Models (LLMs) have made them
invaluable across various applications, from conversational agents and content
creation to data analysis, research, and innovation. However, their
effectiveness and accessibility also render them susceptible to abuse for
generating malicious content, including phishing attacks. This study explores
the potential of using four popular commercially available LLMs, i.e., ChatGPT
(GPT 3.5 Turbo), GPT 4, Claude, and Bard, to generate functional phishing
attacks using a series of malicious prompts. We discover that these LLMs can
generate both phishing websites and emails that can convincingly imitate
well-known brands and also deploy a range of evasive tactics that are used to
elude detection mechanisms employed by anti-phishing systems. These attacks can
be generated using unmodified or "vanilla" versions of these LLMs without
requiring any prior adversarial exploits such as jailbreaking. We evaluate the
performance of the LLMs towards generating these attacks and find that they can
also be utilized to create malicious prompts that, in turn, can be fed back to
the model to generate phishing scams - thus massively reducing the
prompt-engineering effort required by attackers to scale these threats. As a
countermeasure, we build a BERT-based automated detection tool that can be used
for the early detection of malicious prompts to prevent LLMs from generating
phishing content. Our model is transferable across all four commercial LLMs,
attaining an average accuracy of 96% for phishing website prompts and 94% for
phishing email prompts. We also disclose the vulnerabilities to the concerned
LLMs, with Google acknowledging it as a severe issue. Our detection model is
available for use at Hugging Face, as well as a ChatGPT Actions plugin.


http://arxiv.org/abs/2310.18587
Assessing and Improving Syntactic Adversarial Robustness of Pre-trained Models for Code Translation. (92%)
Guang Yang; Yu Zhou; Xiangyu Zhang; Xiang Chen; Tingting Han; Taolue Chen
  Context: Pre-trained models (PTMs) have demonstrated significant potential in
automatic code translation. However, the vulnerability of these models in
translation tasks, particularly in terms of syntax, has not been extensively
investigated. Objective: To fill this gap, our study aims to propose a novel
approach CoTR to assess and improve the syntactic adversarial robustness of
PTMs in code translation. Method: CoTR consists of two components: CoTR-A and
CoTR-D. CoTR-A generates adversarial examples by transforming programs, while
CoTR-D proposes a semantic distance-based sampling data augmentation method and
adversarial training method to improve the model's robustness and
generalization capabilities. The Pass@1 metric is used by CoTR to assess the
performance of PTMs, which is more suitable for code translation tasks and
offers a more precise evaluation in real world scenarios. Results: The
effectiveness of CoTR is evaluated through experiments on real world Java to
Python datasets. The results demonstrate that CoTR-A can significantly reduce
the performance of existing PTMs, while CoTR-D effectively improves the
robustness of PTMs. Conclusion: Our study identifies the limitations of current
PTMs, including large language models, in code translation tasks. It highlights
the potential of CoTR as an effective solution to enhance the robustness of
PTMs for code translation tasks.


http://arxiv.org/abs/2310.18626
Benchmark Generation Framework with Customizable Distortions for Image Classifier Robustness. (86%)
Soumyendu Sarkar; Ashwin Ramesh Babu; Sajad Mousavi; Zachariah Carmichael; Vineet Gundecha; Sahand Ghorbanpour; Ricardo Luna; Gutierrez Antonio Guillen; Avisek Naug
  We present a novel framework for generating adversarial benchmarks to
evaluate the robustness of image classification models. Our framework allows
users to customize the types of distortions to be optimally applied to images,
which helps address the specific distortions relevant to their deployment. The
benchmark can generate datasets at various distortion levels to assess the
robustness of different image classifiers. Our results show that the
adversarial samples generated by our framework with any of the image
classification models, like ResNet-50, Inception-V3, and VGG-16, are effective
and transferable to other models causing them to fail. These failures happen
even when these models are adversarially retrained using state-of-the-art
techniques, demonstrating the generalizability of our adversarial samples. We
achieve competitive performance in terms of net $L_2$ distortion compared to
state-of-the-art benchmark techniques on CIFAR-10 and ImageNet; however, we
demonstrate our framework achieves such results with simple distortions like
Gaussian noise without introducing unnatural artifacts or color bleeds. This is
made possible by a model-based reinforcement learning (RL) agent and a
technique that reduces a deep tree search of the image for model sensitivity to
perturbations, to a one-level analysis and action. The flexibility of choosing
distortions and setting classification probability thresholds for multiple
classes makes our framework suitable for algorithmic audits.


http://arxiv.org/abs/2310.18762
Purify++: Improving Diffusion-Purification with Advanced Diffusion Models and Control of Randomness. (61%)
Boya Zhang; Weijian Luo; Zhihua Zhang
  Adversarial attacks can mislead neural network classifiers. The defense
against adversarial attacks is important for AI safety. Adversarial
purification is a family of approaches that defend adversarial attacks with
suitable pre-processing. Diffusion models have been shown to be effective for
adversarial purification. Despite their success, many aspects of diffusion
purification still remain unexplored. In this paper, we investigate and improve
upon three limiting designs of diffusion purification: the use of an improved
diffusion model, advanced numerical simulation techniques, and optimal control
of randomness. Based on our findings, we propose Purify++, a new diffusion
purification algorithm that is now the state-of-the-art purification method
against several adversarial attacks. Our work presents a systematic exploration
of the limits of diffusion purification methods.


http://arxiv.org/abs/2310.18603
Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Against Text Classifiers. (47%)
Wencong You; Zayd Hammoudeh; Daniel Lowd
  Backdoor attacks manipulate model predictions by inserting innocuous triggers
into training and test data. We focus on more realistic and more challenging
clean-label attacks where the adversarial training examples are correctly
labeled. Our attack, LLMBkd, leverages language models to automatically insert
diverse style-based triggers into texts. We also propose a poison selection
technique to improve the effectiveness of both LLMBkd as well as existing
textual backdoor attacks. Lastly, we describe REACT, a baseline defense to
mitigate backdoor attacks via antidote training examples. Our evaluations
demonstrate LLMBkd's effectiveness and efficiency, where we consistently
achieve high attack success rates across a wide range of styles with little
effort and no model training.


http://arxiv.org/abs/2310.18606
Where have you been? A Study of Privacy Risk for Point-of-Interest Recommendation. (10%)
Kunlin Cai; Jinghuai Zhang; Will Shand; Zhiqing Hong; Guang Wang; Desheng Zhang; Jianfeng Chi; Yuan Tian
  As location-based services (LBS) have grown in popularity, the collection of
human mobility data has become increasingly extensive to build machine learning
(ML) models offering enhanced convenience to LBS users. However, the
convenience comes with the risk of privacy leakage since this type of data
might contain sensitive information related to user identities, such as
home/work locations. Prior work focuses on protecting mobility data privacy
during transmission or prior to release, lacking the privacy risk evaluation of
mobility data-based ML models. To better understand and quantify the privacy
leakage in mobility data-based ML models, we design a privacy attack suite
containing data extraction and membership inference attacks tailored for
point-of-interest (POI) recommendation models, one of the most widely used
mobility data-based ML models. These attacks in our attack suite assume
different adversary knowledge and aim to extract different types of sensitive
information from mobility data, providing a holistic privacy risk assessment
for POI recommendation models. Our experimental evaluation using two real-world
mobility datasets demonstrates that current POI recommendation models are
vulnerable to our attacks. We also present unique findings to understand what
types of mobility data are more susceptible to privacy attacks. Finally, we
evaluate defenses against these attacks and highlight future directions and
challenges.


http://arxiv.org/abs/2311.16124
DiffAttack: Evasion Attacks Against Diffusion-Based Adversarial Purification. (99%)
Mintong Kang; Dawn Song; Bo Li
  Diffusion-based purification defenses leverage diffusion models to remove
crafted perturbations of adversarial examples and achieve state-of-the-art
robustness. Recent studies show that even advanced attacks cannot break such
defenses effectively, since the purification process induces an extremely deep
computational graph which poses the potential problem of gradient obfuscation,
high memory cost, and unbounded randomness. In this paper, we propose a unified
framework DiffAttack to perform effective and efficient attacks against
diffusion-based purification defenses, including both DDPM and score-based
approaches. In particular, we propose a deviated-reconstruction loss at
intermediate diffusion steps to induce inaccurate density gradient estimation
to tackle the problem of vanishing/exploding gradients. We also provide a
segment-wise forwarding-backwarding algorithm, which leads to memory-efficient
gradient backpropagation. We validate the attack effectiveness of DiffAttack
compared with existing adaptive attacks on CIFAR-10 and ImageNet. We show that
DiffAttack decreases the robust accuracy of models compared with SOTA attacks
by over 20% on CIFAR-10 under $\ell_\infty$ attack $(\epsilon=8/255)$, and over
10% on ImageNet under $\ell_\infty$ attack $(\epsilon=4/255)$. We conduct a
series of ablations studies, and we find 1) DiffAttack with the
deviated-reconstruction loss added over uniformly sampled time steps is more
effective than that added over only initial/final steps, and 2) diffusion-based
purification with a moderate diffusion length is more robust under DiffAttack.


http://arxiv.org/abs/2310.18477
Understanding and Improving Ensemble Adversarial Defense. (99%)
Yian Deng; Tingting Mu
  The strategy of ensemble has become popular in adversarial defense, which
trains multiple base classifiers to defend against adversarial attacks in a
cooperative manner. Despite the empirical success, theoretical explanations on
why an ensemble of adversarially trained classifiers is more robust than single
ones remain unclear. To fill in this gap, we develop a new error theory
dedicated to understanding ensemble adversarial defense, demonstrating a
provable 0-1 loss reduction on challenging sample sets in an adversarial
defense scenario. Guided by this theory, we propose an effective approach to
improve ensemble adversarial defense, named interactive global adversarial
training (iGAT). The proposal includes (1) a probabilistic distributing rule
that selectively allocates to different base classifiers adversarial examples
that are globally challenging to the ensemble, and (2) a regularization term to
rescue the severest weaknesses of the base classifiers. Being tested over
various existing ensemble adversarial defense techniques, iGAT is capable of
boosting their performance by increases up to 17% evaluated using CIFAR10 and
CIFAR100 datasets under both white-box and black-box attacks.


http://arxiv.org/abs/2310.18274
LipSim: A Provably Robust Perceptual Similarity Metric. (45%)
Sara Ghazanfari; Alexandre Araujo; Prashanth Krishnamurthy; Farshad Khorrami; Siddharth Garg
  Recent years have seen growing interest in developing and applying perceptual
similarity metrics. Research has shown the superiority of perceptual metrics
over pixel-wise metrics in aligning with human perception and serving as a
proxy for the human visual system. On the other hand, as perceptual metrics
rely on neural networks, there is a growing concern regarding their resilience,
given the established vulnerability of neural networks to adversarial attacks.
It is indeed logical to infer that perceptual metrics may inherit both the
strengths and shortcomings of neural networks. In this work, we demonstrate the
vulnerability of state-of-the-art perceptual similarity metrics based on an
ensemble of ViT-based feature extractors to adversarial attacks. We then
propose a framework to train a robust perceptual similarity metric called
LipSim (Lipschitz Similarity Metric) with provable guarantees. By leveraging
1-Lipschitz neural networks as the backbone, LipSim provides guarded areas
around each data point and certificates for all perturbations within an
$\ell_2$ ball. Finally, a comprehensive set of experiments shows the
performance of LipSim in terms of natural and certified scores and on the image
retrieval application. The code is available at
https://github.com/SaraGhazanfari/LipSim.


http://arxiv.org/abs/2310.18155
Elevating Code-mixed Text Handling through Auditory Information of Words. (5%)
Mamta; Zishan Ahmad; Asif Ekbal
  With the growing popularity of code-mixed data, there is an increasing need
for better handling of this type of data, which poses a number of challenges,
such as dealing with spelling variations, multiple languages, different
scripts, and a lack of resources. Current language models face difficulty in
effectively handling code-mixed data as they primarily focus on the semantic
representation of words and ignore the auditory phonetic features. This leads
to difficulties in handling spelling variations in code-mixed text. In this
paper, we propose an effective approach for creating language models for
handling code-mixed textual data using auditory information of words from
SOUNDEX. Our approach includes a pre-training step based on
masked-language-modelling, which includes SOUNDEX representations (SAMLM) and a
new method of providing input data to the pre-trained model. Through
experimentation on various code-mixed datasets (of different languages) for
sentiment, offensive and aggression classification tasks, we establish that our
novel language modeling approach (SAMLM) results in improved robustness towards
adversarial attacks on code-mixed classification tasks. Additionally, our SAMLM
based approach also results in better classification results over the popular
baselines for code-mixed tasks. We use the explainability technique, SHAP
(SHapley Additive exPlanations) to explain how the auditory features
incorporated through SAMLM assist the model to handle the code-mixed text
effectively and increase robustness against adversarial attacks
\footnote{Source code has been made available on
\url{https://github.com/20118/DefenseWithPhonetics},
\url{https://www.iitp.ac.in/~ai-nlp-ml/resources.html\#Phonetics}}.


http://arxiv.org/abs/2310.17951
Understanding Parameter Saliency via Extreme Value Theory. (1%)
Shuo Wang; Issei Sato
  Deep neural networks are being increasingly implemented throughout society in
recent years. It is useful to identify which parameters trigger
misclassification in diagnosing undesirable model behaviors. The concept of
parameter saliency is proposed and used to diagnose convolutional neural
networks (CNNs) by ranking convolution filters that may have caused
misclassification on the basis of parameter saliency. It is also shown that
fine-tuning the top ranking salient filters efficiently corrects
misidentification on ImageNet. However, there is still a knowledge gap in terms
of understanding why parameter saliency ranking can find the filters inducing
misidentification. In this work, we attempt to bridge the gap by analyzing
parameter saliency ranking from a statistical viewpoint, namely, extreme value
theory. We first show that the existing work implicitly assumes that the
gradient norm computed for each filter follows a normal distribution. Then, we
clarify the relationship between parameter saliency and the score based on the
peaks-over-threshold (POT) method, which is often used to model extreme values.
Finally, we reformulate parameter saliency in terms of the POT method, where
this reformulation is regarded as statistical anomaly detection and does not
require the implicit assumptions of the existing parameter-saliency
formulation. Our experimental results demonstrate that our reformulation can
detect malicious filters as well. Furthermore, we show that the existing
parameter saliency method exhibits a bias against the depth of layers in deep
neural networks. In particular, this bias has the potential to inhibit the
discovery of filters that cause misidentification in situations where domain
shift occurs. In contrast, parameter saliency based on POT shows less of this
bias.


http://arxiv.org/abs/2311.03373
Unscrambling the Rectification of Adversarial Attacks Transferability across Computer Networks. (99%)
Ehsan Nowroozi; Samaneh Ghelichkhani; Imran Haider; Ali Dehghantanha
  Convolutional neural networks (CNNs) models play a vital role in achieving
state-of-the-art performances in various technological fields. CNNs are not
limited to Natural Language Processing (NLP) or Computer Vision (CV) but also
have substantial applications in other technological domains, particularly in
cybersecurity. The reliability of CNN's models can be compromised because of
their susceptibility to adversarial attacks, which can be generated
effortlessly, easily applied, and transferred in real-world scenarios.
  In this paper, we present a novel and comprehensive method to improve the
strength of attacks and assess the transferability of adversarial examples in
CNNs when such strength changes, as well as whether the transferability
property issue exists in computer network applications. In the context of our
study, we initially examined six distinct modes of attack: the Carlini and
Wagner (C&W), Fast Gradient Sign Method (FGSM), Iterative Fast Gradient Sign
Method (I-FGSM), Jacobian-based Saliency Map (JSMA), Limited-memory Broyden
fletcher Goldfarb Shanno (L-BFGS), and Projected Gradient Descent (PGD) attack.
We applied these attack techniques on two popular datasets: the CIC and UNSW
datasets. The outcomes of our experiment demonstrate that an improvement in
transferability occurs in the targeted scenarios for FGSM, JSMA, LBFGS, and
other attacks. Our findings further indicate that the threats to security posed
by adversarial examples, even in computer network applications, necessitate the
development of novel defense mechanisms to enhance the security of DL-based
techniques.


http://arxiv.org/abs/2310.17626
A Survey on Transferability of Adversarial Examples across Deep Neural Networks. (99%)
Jindong Gu; Xiaojun Jia; Jorge Pau de; Wenqain Yu; Xinwei Liu; Avery Ma; Yuan Xun; Anjun Hu; Ashkan Khakzar; Zhijiang Li; Xiaochun Cao; Philip Torr
  The emergence of Deep Neural Networks (DNNs) has revolutionized various
domains, enabling the resolution of complex tasks spanning image recognition,
natural language processing, and scientific problem-solving. However, this
progress has also exposed a concerning vulnerability: adversarial examples.
These crafted inputs, imperceptible to humans, can manipulate machine learning
models into making erroneous predictions, raising concerns for safety-critical
applications. An intriguing property of this phenomenon is the transferability
of adversarial examples, where perturbations crafted for one model can deceive
another, often with a different architecture. This intriguing property enables
"black-box" attacks, circumventing the need for detailed knowledge of the
target model. This survey explores the landscape of the adversarial
transferability of adversarial examples. We categorize existing methodologies
to enhance adversarial transferability and discuss the fundamental principles
guiding each approach. While the predominant body of research primarily
concentrates on image classification, we also extend our discussion to
encompass other vision tasks and beyond. Challenges and future prospects are
discussed, highlighting the importance of fortifying DNNs against adversarial
vulnerabilities in an evolving landscape.


http://arxiv.org/abs/2310.17645
Defending Against Transfer Attacks From Public Models. (99%)
Chawin Sitawarin; Jaewon Chang; David Huang; Wesson Altoyan; David Wagner
  Adversarial attacks have been a looming and unaddressed threat in the
industry. However, through a decade-long history of the robustness evaluation
literature, we have learned that mounting a strong or optimal attack is
challenging. It requires both machine learning and domain expertise. In other
words, the white-box threat model, religiously assumed by a large majority of
the past literature, is unrealistic. In this paper, we propose a new practical
threat model where the adversary relies on transfer attacks through publicly
available surrogate models. We argue that this setting will become the most
prevalent for security-sensitive applications in the future. We evaluate the
transfer attacks in this setting and propose a specialized defense method based
on a game-theoretic perspective. The defenses are evaluated under 24 public
models and 11 attack algorithms across three datasets (CIFAR-10, CIFAR-100, and
ImageNet). Under this threat model, our defense, PubDef, outperforms the
state-of-the-art white-box adversarial training by a large margin with almost
no loss in the normal accuracy. For instance, on ImageNet, our defense achieves
62% accuracy under the strongest transfer attack vs only 36% of the best
adversarially trained model. Its accuracy when not under attack is only 2%
lower than that of an undefended model (78% vs 80%). We release our code at
https://github.com/wagner-group/pubdef.


http://arxiv.org/abs/2310.17436
Uncertainty-weighted Loss Functions for Improved Adversarial Attacks on Semantic Segmentation. (93%)
Kira Maag; Asja Fischer
  State-of-the-art deep neural networks have been shown to be extremely
powerful in a variety of perceptual tasks like semantic segmentation. However,
these networks are vulnerable to adversarial perturbations of the input which
are imperceptible for humans but lead to incorrect predictions. Treating image
segmentation as a sum of pixel-wise classifications, adversarial attacks
developed for classification models were shown to be applicable to segmentation
models as well. In this work, we present simple uncertainty-based weighting
schemes for the loss functions of such attacks that (i) put higher weights on
pixel classifications which can more easily perturbed and (ii) zero-out the
pixel-wise losses corresponding to those pixels that are already confidently
misclassified. The weighting schemes can be easily integrated into the loss
function of a range of well-known adversarial attackers with minimal additional
computational overhead, but lead to significant improved perturbation
performance, as we demonstrate in our empirical analysis on several datasets
and models.


http://arxiv.org/abs/2310.17403
Detection Defenses: An Empty Promise against Adversarial Patch Attacks on Optical Flow. (93%)
Erik Scheurer; Jenny Schmalfuss; Alexander Lis; Andrés Bruhn
  Adversarial patches undermine the reliability of optical flow predictions
when placed in arbitrary scene locations. Therefore, they pose a realistic
threat to real-world motion detection and its downstream applications.
Potential remedies are defense strategies that detect and remove adversarial
patches, but their influence on the underlying motion prediction has not been
investigated. In this paper, we thoroughly examine the currently available
detect-and-remove defenses ILP and LGS for a wide selection of state-of-the-art
optical flow methods, and illuminate their side effects on the quality and
robustness of the final flow predictions. In particular, we implement
defense-aware attacks to investigate whether current defenses are able to
withstand attacks that take the defense mechanism into account. Our experiments
yield two surprising results: Detect-and-remove defenses do not only lower the
optical flow quality on benign scenes, in doing so, they also harm the
robustness under patch attacks for all tested optical flow methods except
FlowNetC. As currently employed detect-and-remove defenses fail to deliver the
promised adversarial robustness for optical flow, they evoke a false sense of
security. The code is available at
https://github.com/cv-stuttgart/DetectionDefenses.


http://arxiv.org/abs/2310.17498
CBD: A Certified Backdoor Detector Based on Local Dominant Probability. (76%)
Zhen Xiang; Zidi Xiong; Bo Li
  Backdoor attack is a common threat to deep neural networks. During testing,
samples embedded with a backdoor trigger will be misclassified as an
adversarial target by a backdoored model, while samples without the backdoor
trigger will be correctly classified. In this paper, we present the first
certified backdoor detector (CBD), which is based on a novel, adjustable
conformal prediction scheme based on our proposed statistic local dominant
probability. For any classifier under inspection, CBD provides 1) a detection
inference, 2) the condition under which the attacks are guaranteed to be
detectable for the same classification domain, and 3) a probabilistic upper
bound for the false positive rate. Our theoretical results show that attacks
with triggers that are more resilient to test-time noise and have smaller
perturbation magnitudes are more likely to be detected with guarantees.
Moreover, we conduct extensive experiments on four benchmark datasets
considering various backdoor types, such as BadNet, CB, and Blend. CBD achieves
comparable or even higher detection accuracy than state-of-the-art detectors,
and it in addition provides detection certification. Notably, for backdoor
attacks with random perturbation triggers bounded by $\ell_2\leq0.75$ which
achieves more than 90\% attack success rate, CBD achieves 100\% (98\%), 100\%
(84\%), 98\% (98\%), and 72\% (40\%) empirical (certified) detection true
positive rates on the four benchmark datasets GTSRB, SVHN, CIFAR-10, and
TinyImageNet, respectively, with low false positive rates.


http://arxiv.org/abs/2310.17534
SoK: Pitfalls in Evaluating Black-Box Attacks. (76%)
Fnu Suya; Anshuman Suri; Tingwei Zhang; Jingtao Hong; Yuan Tian; David Evans
  Numerous works study black-box attacks on image classifiers. However, these
works make different assumptions on the adversary's knowledge and current
literature lacks a cohesive organization centered around the threat model. To
systematize knowledge in this area, we propose a taxonomy over the threat space
spanning the axes of feedback granularity, the access of interactive queries,
and the quality and quantity of the auxiliary data available to the attacker.
Our new taxonomy provides three key insights. 1) Despite extensive literature,
numerous under-explored threat spaces exist, which cannot be trivially solved
by adapting techniques from well-explored settings. We demonstrate this by
establishing a new state-of-the-art in the less-studied setting of access to
top-k confidence scores by adapting techniques from well-explored settings of
accessing the complete confidence vector, but show how it still falls short of
the more restrictive setting that only obtains the prediction label,
highlighting the need for more research. 2) Identification the threat model of
different attacks uncovers stronger baselines that challenge prior
state-of-the-art claims. We demonstrate this by enhancing an initially weaker
baseline (under interactive query access) via surrogate models, effectively
overturning claims in the respective paper. 3) Our taxonomy reveals
interactions between attacker knowledge that connect well to related areas,
such as model inversion and extraction attacks. We discuss how advances in
other areas can enable potentially stronger black-box attacks. Finally, we
emphasize the need for a more realistic assessment of attack success by
factoring in local attack runtime. This approach reveals the potential for
certain attacks to achieve notably higher success rates and the need to
evaluate attacks in diverse and harder settings, highlighting the need for
better selection criteria.


http://arxiv.org/abs/2310.17559
Instability of computer vision models is a necessary result of the task itself. (26%)
Oliver Turnbull; George Cevora
  Adversarial examples resulting from instability of current computer vision
models are an extremely important topic due to their potential to compromise
any application. In this paper we demonstrate that instability is inevitable
due to a) symmetries (translational invariance) of the data, b) the categorical
nature of the classification task, and c) the fundamental discrepancy of
classifying images as objects themselves. The issue is further exacerbated by
non-exhaustive labelling of the training data. Therefore we conclude that
instability is a necessary result of how the problem of computer vision is
currently formulated. While the problem cannot be eliminated, through the
analysis of the causes, we have arrived at ways how it can be partially
alleviated. These include i) increasing the resolution of images, ii) providing
contextual information for the image, iii) exhaustive labelling of training
data, and iv) preventing attackers from frequent access to the computer vision
system.


http://arxiv.org/abs/2310.17588
PAC-tuning:Fine-tuning Pretrained Language Models with PAC-driven Perturbed Gradient Descent. (1%)
Guangliang Liu; Zhiyu Xue; Xitong Zhang; Kristen Marie Johnson; Rongrong Wang
  Fine-tuning pretrained language models (PLMs) for downstream tasks is a
large-scale optimization problem, in which the choice of the training algorithm
critically determines how well the trained model can generalize to unseen test
data, especially in the context of few-shot learning. To achieve good
generalization performance and avoid overfitting, techniques such as data
augmentation and pruning are often applied. However, adding these
regularizations necessitates heavy tuning of the hyperparameters of
optimization algorithms, such as the popular Adam optimizer. In this paper, we
propose a two-stage fine-tuning method, PAC-tuning, to address this
optimization challenge. First, based on PAC-Bayes training, PAC-tuning directly
minimizes the PAC-Bayes generalization bound to learn proper parameter
distribution. Second, PAC-tuning modifies the gradient by injecting noise with
the variance learned in the first stage into the model parameters during
training, resulting in a variant of perturbed gradient descent (PGD). In the
past, the few-shot scenario posed difficulties for PAC-Bayes training because
the PAC-Bayes bound, when applied to large models with limited training data,
might not be stringent. Our experimental results across 5 GLUE benchmark tasks
demonstrate that PAC-tuning successfully handles the challenges of fine-tuning
tasks and outperforms strong baseline methods by a visible margin, further
confirming the potential to apply PAC training for any other settings where the
Adam optimizer is currently used for training.


http://arxiv.org/abs/2310.17584
A minimax optimal control approach for robust neural ODEs. (1%)
Cristina Cipriani; Alessandro Scagliotti; Tobias Wöhrer
  In this paper, we address the adversarial training of neural ODEs from a
robust control perspective. This is an alternative to the classical training
via empirical risk minimization, and it is widely used to enforce reliable
outcomes for input perturbations. Neural ODEs allow the interpretation of deep
neural networks as discretizations of control systems, unlocking powerful tools
from control theory for the development and the understanding of machine
learning. In this specific case, we formulate the adversarial training with
perturbed data as a minimax optimal control problem, for which we derive first
order optimality conditions in the form of Pontryagin's Maximum Principle. We
provide a novel interpretation of robust training leading to an alternative
weighted technique, which we test on a low-dimensional classification task.


http://arxiv.org/abs/2310.16955
Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks. (93%)
Aradhana Sinha; Ananth Balashankar; Ahmad Beirami; Thi Avrahami; Jilin Chen; Alex Beutel
  Real-world natural language processing systems need to be robust to human
adversaries. Collecting examples of human adversaries for training is an
effective but expensive solution. On the other hand, training on synthetic
attacks with small perturbations - such as word-substitution - does not
actually improve robustness to human adversaries. In this paper, we propose an
adversarial training framework that uses limited human adversarial examples to
generate more useful adversarial examples at scale. We demonstrate the
advantages of this system on the ANLI and hate speech detection benchmark
datasets - both collected via an iterative, adversarial
human-and-model-in-the-loop procedure. Compared to training only on observed
human attacks, also training on our synthetic adversarial examples improves
model robustness to future rounds. In ANLI, we see accuracy gains on the
current set of attacks (44.1%$\,\to\,$50.1%) and on two future unseen rounds of
human generated attacks (32.5%$\,\to\,$43.4%, and 29.4%$\,\to\,$40.2%). In hate
speech detection, we see AUC gains on current attacks (0.76 $\to$ 0.84) and a
future round (0.77 $\to$ 0.79). Attacks from methods that do not learn the
distribution of existing human adversaries, meanwhile, degrade robustness.


http://arxiv.org/abs/2310.16999
Trust, but Verify: Robust Image Segmentation using Deep Learning. (54%)
Fahim Ahmed Zaman; Xiaodong Wu; Weiyu Xu; Milan Sonka; Raghuraman Mudumbai
  We describe a method for verifying the output of a deep neural network for
medical image segmentation that is robust to several classes of random as well
as worst-case perturbations i.e. adversarial attacks. This method is based on a
general approach recently developed by the authors called ``Trust, but Verify"
wherein an auxiliary verification network produces predictions about certain
masked features in the input image using the segmentation as an input. A
well-designed auxiliary network will produce high-quality predictions when the
input segmentations are accurate, but will produce low-quality predictions when
the segmentations are incorrect. Checking the predictions of such a network
with the original image allows us to detect bad segmentations. However, to
ensure the verification method is truly robust, we need a method for checking
the quality of the predictions that does not itself rely on a black-box neural
network. Indeed, we show that previous methods for segmentation evaluation that
do use deep neural regression networks are vulnerable to false negatives i.e.
can inaccurately label bad segmentations as good. We describe the design of a
verification network that avoids such vulnerability and present results to
demonstrate its robustness compared to previous methods.


http://arxiv.org/abs/2310.16540
Dual Defense: Adversarial, Traceable, and Invisible Robust Watermarking against Face Swapping. (26%)
Yunming Zhang; Dengpan Ye; Caiyun Xie; Long Tang; Chuanxi Chen; Ziyi Liu; Jiacheng Deng
  The malicious applications of deep forgery, represented by face swapping,
have introduced security threats such as misinformation dissemination and
identity fraud. While some research has proposed the use of robust watermarking
methods to trace the copyright of facial images for post-event traceability,
these methods cannot effectively prevent the generation of forgeries at the
source and curb their dissemination. To address this problem, we propose a
novel comprehensive active defense mechanism that combines traceability and
adversariality, called Dual Defense. Dual Defense invisibly embeds a single
robust watermark within the target face to actively respond to sudden cases of
malicious face swapping. It disrupts the output of the face swapping model
while maintaining the integrity of watermark information throughout the entire
dissemination process. This allows for watermark extraction at any stage of
image tracking for traceability. Specifically, we introduce a watermark
embedding network based on original-domain feature impersonation attack. This
network learns robust adversarial features of target facial images and embeds
watermarks, seeking a well-balanced trade-off between watermark invisibility,
adversariality, and traceability through perceptual adversarial encoding
strategies. Extensive experiments demonstrate that Dual Defense achieves
optimal overall defense success rates and exhibits promising universality in
anti-face swapping tasks and dataset generalization ability. It maintains
impressive adversariality and traceability in both original and robust
settings, surpassing current forgery defense methods that possess only one of
these capabilities, including CMUA-Watermark, Anti-Forgery, FakeTagger, or PGD
methods.


http://arxiv.org/abs/2310.16613
On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts. (22%)
Yixin Wu; Ning Yu; Michael Backes; Yun Shen; Yang Zhang
  Text-to-image models like Stable Diffusion have had a profound impact on
daily life by enabling the generation of photorealistic images from textual
prompts, fostering creativity, and enhancing visual experiences across various
applications. However, these models also pose risks. Previous studies have
successfully demonstrated that manipulated prompts can elicit text-to-image
models to generate unsafe images, e.g., hateful meme variants. Yet, these
studies only unleash the harmful power of text-to-image models in a passive
manner. In this work, we focus on the proactive generation of unsafe images
using targeted benign prompts via poisoning attacks. We propose two poisoning
attacks: a basic attack and a utility-preserving attack. We qualitatively and
quantitatively evaluate the proposed attacks using four representative hateful
memes and multiple query prompts. Experimental results indicate that
text-to-image models are vulnerable to the basic attack even with five
poisoning samples. However, the poisoning effect can inadvertently spread to
non-targeted prompts, leading to undesirable side effects. Root cause analysis
identifies conceptual similarity as an important contributing factor to the
side effects. To address this, we introduce the utility-preserving attack as a
viable mitigation strategy to maintain the attack stealthiness, while ensuring
decent attack performance. Our findings underscore the potential risks of
adopting text-to-image models in real-world scenarios, calling for future
research and safety measures in this space.


http://arxiv.org/abs/2310.16919
Wide Flat Minimum Watermarking for Robust Ownership Verification of GANs. (12%)
Jianwei Fei; Zhihua Xia; Benedetta Tondi; Mauro Barni
  We propose a novel multi-bit box-free watermarking method for the protection
of Intellectual Property Rights (IPR) of GANs with improved robustness against
white-box attacks like fine-tuning, pruning, quantization, and surrogate model
attacks. The watermark is embedded by adding an extra watermarking loss term
during GAN training, ensuring that the images generated by the GAN contain an
invisible watermark that can be retrieved by a pre-trained watermark decoder.
In order to improve the robustness against white-box model-level attacks, we
make sure that the model converges to a wide flat minimum of the watermarking
loss term, in such a way that any modification of the model parameters does not
erase the watermark. To do so, we add random noise vectors to the parameters of
the generator and require that the watermarking loss term is as invariant as
possible with respect to the presence of noise. This procedure forces the
generator to converge to a wide flat minimum of the watermarking loss. The
proposed method is architectureand dataset-agnostic, thus being applicable to
many different generation tasks and models, as well as to CNN-based image
processing architectures. We present the results of extensive experiments
showing that the presence of the watermark has a negligible impact on the
quality of the generated images, and proving the superior robustness of the
watermark against model modification and surrogate model attacks.


http://arxiv.org/abs/2310.16779
Multi-scale Diffusion Denoised Smoothing. (1%)
Jongheon Jeong; Jinwoo Shin
  Along with recent diffusion models, randomized smoothing has become one of a
few tangible approaches that offers adversarial robustness to models at scale,
e.g., those of large pre-trained models. Specifically, one can perform
randomized smoothing on any classifier via a simple "denoise-and-classify"
pipeline, so-called denoised smoothing, given that an accurate denoiser is
available - such as diffusion model. In this paper, we investigate the
trade-off between accuracy and certified robustness of denoised smoothing: for
example, we question on which representation of diffusion model would maximize
the certified robustness of denoised smoothing. We consider a new objective
that aims collective robustness of smoothed classifiers across multiple noise
levels at a shared diffusion model, which also suggests a new way to compensate
the cost of accuracy in randomized smoothing for its certified robustness. This
objective motivates us to fine-tune diffusion model (a) to perform consistent
denoising whenever the original image is recoverable, but (b) to generate
rather diverse outputs otherwise. Our experiments show that this fine-tuning
scheme of diffusion models combined with the multi-scale smoothing enables a
strong certified robustness possible at highest noise level while maintaining
the accuracy closer to non-smoothed classifiers.


http://arxiv.org/abs/2310.16838
SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation. (1%)
Qianxu Wang; Haotong Zhang; Congyue Deng; Yang You; Hao Dong; Yixin Zhu; Leonidas Guibas
  Humans excel at transferring manipulation skills across diverse object
shapes, poses, and appearances due to their understanding of semantic
correspondences between different instances. To endow robots with a similar
high-level understanding, we develop a Distilled Feature Field (DFF) for 3D
scenes, leveraging large 2D vision models to distill semantic features from
multiview images. While current research demonstrates advanced performance in
reconstructing DFFs from dense views, the development of learning a DFF from
sparse views is relatively nascent, despite its prevalence in numerous
manipulation tasks with fixed cameras. In this work, we introduce SparseDFF, a
novel method for acquiring view-consistent 3D DFFs from sparse RGBD
observations, enabling one-shot learning of dexterous manipulations that are
transferable to novel scenes. Specifically, we map the image features to the 3D
point cloud, allowing for propagation across the 3D space to establish a dense
feature field. At the core of SparseDFF is a lightweight feature refinement
network, optimized with a contrastive loss between pairwise views after
back-projecting the image features onto the 3D point cloud. Additionally, we
implement a point-pruning mechanism to augment feature continuity within each
local neighborhood. By establishing coherent feature fields on both source and
target scenes, we devise an energy function that facilitates the minimization
of feature discrepancies w.r.t. the end-effector parameters between the
demonstration and the target manipulation. We evaluate our approach using a
dexterous hand, mastering real-world manipulations on both rigid and deformable
objects, and showcase robust generalization in the face of object and
scene-context variations.


http://arxiv.org/abs/2311.12857
Adversarial sample generation and training using geometric masks for accurate and resilient license plate character recognition. (99%)
Bishal Shrestha; Griwan Khakurel; Kritika Simkhada; Badri Adhikari
  Reading dirty license plates accurately in moving vehicles is challenging for
automatic license plate recognition systems. Moreover, license plates are often
intentionally tampered with a malicious intent to avoid police apprehension.
Usually, such groups and individuals know how to fool the existing recognition
systems by making minor unnoticeable plate changes. Designing and developing
deep learning methods resilient to such real-world 'attack' practices remains
an active research problem. As a solution, this work develops a resilient
method to recognize license plate characters. Extracting 1057 character images
from 160 Nepalese vehicles, as the first step, we trained several standard deep
convolutional neural networks to obtain 99.5% character classification
accuracy. On adversarial images generated to simulate malicious tampering,
however, our model's accuracy dropped to 25%. Next, we enriched our dataset by
generating and adding geometrically masked images, retrained our models, and
investigated the models' predictions. The proposed approach of training with
generated adversarial images helped our adversarial attack-aware license plate
character recognition (AA-LPCR) model achieves an accuracy of 99.7%. This
near-perfect accuracy demonstrates that the proposed idea of random geometric
masking is highly effective for improving the accuracy of license plate
recognition models. Furthermore, by performing interpretability studies to
understand why our models work, we identify and highlight attack-prone regions
in the input character images. In sum, although Nepal's embossed license plate
detection systems are vulnerable to malicious attacks, our findings suggest
that these systems can be upgraded to close to 100% resilience.


http://arxiv.org/abs/2311.12858
RAEDiff: Denoising Diffusion Probabilistic Models Based Reversible Adversarial Examples Self-Generation and Self-Recovery. (92%)
Fan Xing; Xiaoyi Zhou; Xuefeng Fan; Zhuo Tian; Yan Zhao
  Collected and annotated datasets, which are obtained through extensive
efforts, are effective for training Deep Neural Network (DNN) models. However,
these datasets are susceptible to be misused by unauthorized users, resulting
in infringement of Intellectual Property (IP) rights owned by the dataset
creators. Reversible Adversarial Exsamples (RAE) can help to solve the issues
of IP protection for datasets. RAEs are adversarial perturbed images that can
be restored to the original. As a cutting-edge approach, RAE scheme can serve
the purposes of preventing unauthorized users from engaging in malicious model
training, as well as ensuring the legitimate usage of authorized users.
Nevertheless, in the existing work, RAEs still rely on the embedded auxiliary
information for restoration, which may compromise their adversarial abilities.
In this paper, a novel self-generation and self-recovery method, named as
RAEDiff, is introduced for generating RAEs based on a Denoising Diffusion
Probabilistic Models (DDPM). It diffuses datasets into a Biased Gaussian
Distribution (BGD) and utilizes the prior knowledge of the DDPM for generating
and recovering RAEs. The experimental results demonstrate that RAEDiff
effectively self-generates adversarial perturbations for DNN models, including
Artificial Intelligence Generated Content (AIGC) models, while also exhibiting
significant self-recovery capabilities.


http://arxiv.org/abs/2310.16335
Defense Against Model Extraction Attacks on Recommender Systems. (92%)
Sixiao Zhang; Hongzhi Yin; Hongxu Chen; Cheng Long
  The robustness of recommender systems has become a prominent topic within the
research community. Numerous adversarial attacks have been proposed, but most
of them rely on extensive prior knowledge, such as all the white-box attacks or
most of the black-box attacks which assume that certain external knowledge is
available. Among these attacks, the model extraction attack stands out as a
promising and practical method, involving training a surrogate model by
repeatedly querying the target model. However, there is a significant gap in
the existing literature when it comes to defending against model extraction
attacks on recommender systems. In this paper, we introduce Gradient-based
Ranking Optimization (GRO), which is the first defense strategy designed to
counter such attacks. We formalize the defense as an optimization problem,
aiming to minimize the loss of the protected target model while maximizing the
loss of the attacker's surrogate model. Since top-k ranking lists are
non-differentiable, we transform them into swap matrices which are instead
differentiable. These swap matrices serve as input to a student model that
emulates the surrogate model's behavior. By back-propagating the loss of the
student model, we obtain gradients for the swap matrices. These gradients are
used to compute a swap loss, which maximizes the loss of the student model. We
conducted experiments on three benchmark datasets to evaluate the performance
of GRO, and the results demonstrate its superior effectiveness in defending
against model extraction attacks.


http://arxiv.org/abs/2310.16061
Segue: Side-information Guided Generative Unlearnable Examples for Facial Privacy Protection in Real World. (89%)
Zhiling Zhang; Jie Zhang; Kui Zhang; Wenbo Zhou; Weiming Zhang; Nenghai Yu
  The widespread use of face recognition technology has given rise to privacy
concerns, as many individuals are worried about the collection and utilization
of their facial data. To address these concerns, researchers are actively
exploring the concept of ``unlearnable examples", by adding imperceptible
perturbation to data in the model training stage, which aims to prevent the
model from learning discriminate features of the target face. However, current
methods are inefficient and cannot guarantee transferability and robustness at
the same time, causing impracticality in the real world. To remedy it, we
propose a novel method called Segue: Side-information guided generative
unlearnable examples. Specifically, we leverage a once-trained multiple-used
model to generate the desired perturbation rather than the time-consuming
gradient-based method. To improve transferability, we introduce side
information such as true labels and pseudo labels, which are inherently
consistent across different scenarios. For robustness enhancement, a distortion
layer is integrated into the training pipeline. Extensive experiments
demonstrate that the proposed Segue is much faster than previous methods
(1000$\times$) and achieves transferable effectiveness across different
datasets and model architectures. Furthermore, it can resist JPEG compression,
adversarial training, and some standard data augmentations.


http://arxiv.org/abs/2310.16221
Hierarchical Randomized Smoothing. (75%)
Yan Scholten; Jan Schuchardt; Aleksandar Bojchevski; Stephan Günnemann
  Real-world data is complex and often consists of objects that can be
decomposed into multiple entities (e.g. images into pixels, graphs into
interconnected nodes). Randomized smoothing is a powerful framework for making
models provably robust against small changes to their inputs - by guaranteeing
robustness of the majority vote when randomly adding noise before
classification. Yet, certifying robustness on such complex data via randomized
smoothing is challenging when adversaries do not arbitrarily perturb entire
objects (e.g. images) but only a subset of their entities (e.g. pixels). As a
solution, we introduce hierarchical randomized smoothing: We partially smooth
objects by adding random noise only on a randomly selected subset of their
entities. By adding noise in a more targeted manner than existing methods we
obtain stronger robustness guarantees while maintaining high accuracy. We
initialize hierarchical smoothing using different noising distributions,
yielding novel robustness certificates for discrete and continuous domains. We
experimentally demonstrate the importance of hierarchical smoothing in image
and node classification, where it yields superior robustness-accuracy
trade-offs. Overall, hierarchical smoothing is an important contribution
towards models that are both - certifiably robust to perturbations and
accurate.


http://arxiv.org/abs/2310.15952
Improving Robustness and Reliability in Medical Image Classification with Latent-Guided Diffusion and Nested-Ensembles. (74%)
Xing Shen; Hengguan Huang; Brennan Nichyporuk; Tal Arbel
  Ensemble deep learning has been shown to achieve high predictive accuracy and
uncertainty estimation in a wide variety of medical imaging contexts. However,
perturbations in the input images at test time (e.g. noise, domain shifts) can
still lead to significant performance degradation, posing challenges for
trustworthy clinical deployment. In order to address this, we propose LaDiNE, a
novel and robust probabilistic method that is capable of inferring informative
and invariant latent variables from the input images. These latent variables
are then used to recover the robust predictive distribution without relying on
a predefined functional-form. This results in improved (i) generalization
capabilities and (ii) calibration of prediction confidence. Extensive
experiments were performed on the task of disease classification based on the
Tuberculosis chest X-ray and the ISIC Melanoma skin cancer datasets. Here the
performance of LaDiNE was analysed under a range of challenging covariate shift
conditions, where training was based on "clean" images, and unseen noisy inputs
and adversarial perturbations were presented at test time. Results show that
LaDiNE outperforms existing state-of-the-art baseline methods in terms of
accuracy and confidence calibration. This increases the feasibility of
deploying reliable medical machine learning models in real clinical settings,
where accurate and trustworthy predictions are crucial for patient care and
clinical decision support.


http://arxiv.org/abs/2310.15656
Momentum Gradient-based Untargeted Attack on Hypergraph Neural Networks. (73%)
Yang Chen; Stjepan Picek; Zhonglin Ye; Zhaoyang Wang; Haixing Zhao
  Hypergraph Neural Networks (HGNNs) have been successfully applied in various
hypergraph-related tasks due to their excellent higher-order representation
capabilities. Recent works have shown that deep learning models are vulnerable
to adversarial attacks. Most studies on graph adversarial attacks have focused
on Graph Neural Networks (GNNs), and the study of adversarial attacks on HGNNs
remains largely unexplored. In this paper, we try to reduce this gap. We design
a new HGNNs attack model for the untargeted attack, namely MGHGA, which focuses
on modifying node features. We consider the process of HGNNs training and use a
surrogate model to implement the attack before hypergraph modeling.
Specifically, MGHGA consists of two parts: feature selection and feature
modification. We use a momentum gradient mechanism to choose the attack node
features in the feature selection module. In the feature modification module,
we use two feature generation approaches (direct modification and sign
gradient) to enable MGHGA to be employed on discrete and continuous datasets.
We conduct extensive experiments on five benchmark datasets to validate the
attack performance of MGHGA in the node and the visual object classification
tasks. The results show that MGHGA improves performance by an average of 2%
compared to the than the baselines.


http://arxiv.org/abs/2310.16332
Corrupting Neuron Explanations of Deep Visual Features. (41%)
Divyansh Srivastava; Tuomas Oikarinen; Tsui-Wei Weng
  The inability of DNNs to explain their black-box behavior has led to a recent
surge of explainability methods. However, there are growing concerns that these
explainability methods are not robust and trustworthy. In this work, we perform
the first robustness analysis of Neuron Explanation Methods under a unified
pipeline and show that these explanations can be significantly corrupted by
random noises and well-designed perturbations added to their probing data. We
find that even adding small random noise with a standard deviation of 0.02 can
already change the assigned concepts of up to 28% neurons in the deeper layers.
Furthermore, we devise a novel corruption algorithm and show that our algorithm
can manipulate the explanation of more than 80% neurons by poisoning less than
10% of probing data. This raises the concern of trusting Neuron Explanation
Methods in real-life safety and fairness critical applications.


http://arxiv.org/abs/2310.18360
Guiding LLM to Fool Itself: Automatically Manipulating Machine Reading Comprehension Shortcut Triggers. (10%)
Mosh Levy; Shauli Ravfogel; Yoav Goldberg
  Recent applications of LLMs in Machine Reading Comprehension (MRC) systems
have shown impressive results, but the use of shortcuts, mechanisms triggered
by features spuriously correlated to the true label, has emerged as a potential
threat to their reliability. We analyze the problem from two angles: LLMs as
editors, guided to edit text to mislead LLMs; and LLMs as readers, who answer
questions based on the edited text. We introduce a framework that guides an
editor to add potential shortcuts-triggers to samples. Using GPT4 as the
editor, we find it can successfully edit trigger shortcut in samples that fool
LLMs. Analysing LLMs as readers, we observe that even capable LLMs can be
deceived using shortcut knowledge. Strikingly, we discover that GPT4 can be
deceived by its own edits (15% drop in F1). Our findings highlight inherent
vulnerabilities of LLMs to shortcut manipulations. We publish ShortcutQA, a
curated dataset generated by our framework for future research.


http://arxiv.org/abs/2310.15654
A Survey on Detection of LLMs-Generated Content. (1%)
Xianjun Yang; Liangming Pan; Xuandong Zhao; Haifeng Chen; Linda Petzold; William Yang Wang; Wei Cheng
  The burgeoning capabilities of advanced large language models (LLMs) such as
ChatGPT have led to an increase in synthetic content generation with
implications across a variety of sectors, including media, cybersecurity,
public discourse, and education. As such, the ability to detect LLMs-generated
content has become of paramount importance. We aim to provide a detailed
overview of existing detection strategies and benchmarks, scrutinizing their
differences and identifying key challenges and prospects in the field,
advocating for more adaptable and robust models to enhance detection accuracy.
We also posit the necessity for a multi-faceted approach to defend against
various attacks to counter the rapidly advancing capabilities of LLMs. To the
best of our knowledge, this work is the first comprehensive survey on the
detection in the era of LLMs. We hope it will provide a broad understanding of
the current landscape of LLMs-generated content detection, offering a guiding
reference for researchers and practitioners striving to uphold the integrity of
digital information in an era increasingly dominated by synthetic content. The
relevant papers are summarized and will be consistently updated at
https://github.com/Xianjun-Yang/Awesome_papers_on_LLMs_detection.git.


http://arxiv.org/abs/2310.15991
White-box Compiler Fuzzing Empowered by Large Language Models. (1%)
Chenyuan Yang; Yinlin Deng; Runyu Lu; Jiayi Yao; Jiawei Liu; Reyhaneh Jabbarvand; Lingming Zhang
  Compiler correctness is crucial, as miscompilation falsifying the program
behaviors can lead to serious consequences. In the literature, fuzzing has been
extensively studied to uncover compiler defects. However, compiler fuzzing
remains challenging: Existing arts focus on black- and grey-box fuzzing, which
generates tests without sufficient understanding of internal compiler
behaviors. As such, they often fail to construct programs to exercise
conditions of intricate optimizations. Meanwhile, traditional white-box
techniques are computationally inapplicable to the giant codebase of compilers.
Recent advances demonstrate that Large Language Models (LLMs) excel in code
generation/understanding tasks and have achieved state-of-the-art performance
in black-box fuzzing. Nonetheless, prompting LLMs with compiler source-code
information remains a missing piece of research in compiler testing.
  To this end, we propose WhiteFox, the first white-box compiler fuzzer using
LLMs with source-code information to test compiler optimization. WhiteFox
adopts a dual-model framework: (i) an analysis LLM examines the low-level
optimization source code and produces requirements on the high-level test
programs that can trigger the optimization; (ii) a generation LLM produces test
programs based on the summarized requirements. Additionally,
optimization-triggering tests are used as feedback to further enhance the test
generation on the fly. Our evaluation on four popular compilers shows that
WhiteFox can generate high-quality tests to exercise deep optimizations
requiring intricate conditions, practicing up to 80 more optimizations than
state-of-the-art fuzzers. To date, WhiteFox has found in total 96 bugs, with 80
confirmed as previously unknown and 51 already fixed. Beyond compiler testing,
WhiteFox can also be adapted for white-box fuzzing of other complex, real-world
software systems in general.


http://arxiv.org/abs/2310.16263
Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation. (1%)
Jiexin Wang; Liuwen Cao; Xitong Luo; Zhiping Zhou; Jiayuan Xie; Adam Jatowt; Yi Cai
  Large language models (LLMs) have brought significant advancements to code
generation, benefiting both novice and experienced developers. However, their
training using unsanitized data from open-source repositories, like GitHub,
introduces the risk of inadvertently propagating security vulnerabilities. To
effectively mitigate this concern, this paper presents a comprehensive study
focused on evaluating and enhancing code LLMs from a software security
perspective. We introduce SecuCoGen\footnote{SecuCoGen has been uploaded as
supplemental material and will be made publicly available after publication.},
a meticulously curated dataset targeting 21 critical vulnerability types.
SecuCoGen comprises 180 samples and serves as the foundation for conducting
experiments on three crucial code-related tasks: code generation, code repair
and vulnerability classification, with a strong emphasis on security. Our
experimental results reveal that existing models often overlook security
concerns during code generation, leading to the generation of vulnerable code.
To address this, we propose effective approaches to mitigate the security
vulnerabilities and enhance the overall robustness of code generated by LLMs.
Moreover, our study identifies weaknesses in existing models' ability to repair
vulnerable code, even when provided with vulnerability information.
Additionally, certain vulnerability types pose challenges for the models,
hindering their performance in vulnerability classification. Based on these
findings, we believe our study will have a positive impact on the software
engineering community, inspiring the development of improved methods for
training and utilizing LLMs, thereby leading to safer and more trustworthy
model deployment.


http://arxiv.org/abs/2310.14637
Semantic-Aware Adversarial Training for Reliable Deep Hashing Retrieval. (99%)
Xu Yuan; Zheng Zhang; Xunguang Wang; Lin Wu
  Deep hashing has been intensively studied and successfully applied in
large-scale image retrieval systems due to its efficiency and effectiveness.
Recent studies have recognized that the existence of adversarial examples poses
a security threat to deep hashing models, that is, adversarial vulnerability.
Notably, it is challenging to efficiently distill reliable semantic
representatives for deep hashing to guide adversarial learning, and thereby it
hinders the enhancement of adversarial robustness of deep hashing-based
retrieval models. Moreover, current researches on adversarial training for deep
hashing are hard to be formalized into a unified minimax structure. In this
paper, we explore Semantic-Aware Adversarial Training (SAAT) for improving the
adversarial robustness of deep hashing models. Specifically, we conceive a
discriminative mainstay features learning (DMFL) scheme to construct semantic
representatives for guiding adversarial learning in deep hashing. Particularly,
our DMFL with the strict theoretical guarantee is adaptively optimized in a
discriminative learning manner, where both discriminative and semantic
properties are jointly considered. Moreover, adversarial examples are
fabricated by maximizing the Hamming distance between the hash codes of
adversarial samples and mainstay features, the efficacy of which is validated
in the adversarial attack trials. Further, we, for the first time, formulate
the formalized adversarial training of deep hashing into a unified minimax
optimization under the guidance of the generated mainstay codes. Extensive
experiments on benchmark datasets show superb attack performance against the
state-of-the-art algorithms, meanwhile, the proposed adversarial training can
effectively eliminate adversarial perturbations for trustworthy deep
hashing-based retrieval. Our code is available at
https://github.com/xandery-geek/SAAT.


http://arxiv.org/abs/2310.14561
F$^2$AT: Feature-Focusing Adversarial Training via Disentanglement of Natural and Perturbed Patterns. (99%)
Yaguan Qian; Chenyu Zhao; Zhaoquan Gu; Bin Wang; Shouling Ji; Wei Wang; Boyang Zhou; Pan Zhou
  Deep neural networks (DNNs) are vulnerable to adversarial examples crafted by
well-designed perturbations. This could lead to disastrous results on critical
applications such as self-driving cars, surveillance security, and medical
diagnosis. At present, adversarial training is one of the most effective
defenses against adversarial examples. However, traditional adversarial
training makes it difficult to achieve a good trade-off between clean accuracy
and robustness since spurious features are still learned by DNNs. The intrinsic
reason is that traditional adversarial training makes it difficult to fully
learn core features from adversarial examples when adversarial noise and clean
examples cannot be disentangled. In this paper, we disentangle the adversarial
examples into natural and perturbed patterns by bit-plane slicing. We assume
the higher bit-planes represent natural patterns and the lower bit-planes
represent perturbed patterns, respectively. We propose a Feature-Focusing
Adversarial Training (F$^2$AT), which differs from previous work in that it
enforces the model to focus on the core features from natural patterns and
reduce the impact of spurious features from perturbed patterns. The
experimental results demonstrated that F$^2$AT outperforms state-of-the-art
methods in clean accuracy and adversarial robustness.


http://arxiv.org/abs/2310.15140
AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models. (98%)
Sicheng Zhu; Ruiyi Zhang; Bang An; Gang Wu; Joe Barrow; Zichao Wang; Furong Huang; Ani Nenkova; Tong Sun
  Safety alignment of Large Language Models (LLMs) can be compromised with
manual jailbreak attacks and (automatic) adversarial attacks. Recent work
suggests that patching LLMs against these attacks is possible: manual jailbreak
attacks are human-readable but often limited and public, making them easy to
block; adversarial attacks generate gibberish prompts that can be detected
using perplexity-based filters. In this paper, we show that these solutions may
be too optimistic. We propose an interpretable adversarial attack,
\texttt{AutoDAN}, that combines the strengths of both types of attacks. It
automatically generates attack prompts that bypass perplexity-based filters
while maintaining a high attack success rate like manual jailbreak attacks.
These prompts are interpretable and diverse, exhibiting strategies commonly
used in manual jailbreak attacks, and transfer better than their non-readable
counterparts when using limited training data or a single proxy model. We also
customize \texttt{AutoDAN}'s objective to leak system prompts, another
jailbreak application not addressed in the adversarial attack literature. %,
demonstrating the versatility of the approach. We can also customize the
objective of \texttt{AutoDAN} to leak system prompts, beyond the ability to
elicit harmful content from the model, demonstrating the versatility of the
approach. Our work provides a new way to red-team LLMs and to understand the
mechanism of jailbreak attacks.


http://arxiv.org/abs/2310.15444
Fast Propagation is Better: Accelerating Single-Step Adversarial Training via Sampling Subnetworks. (98%)
Xiaojun Jia; Jianshu Li; Jindong Gu; Yang Bai; Xiaochun Cao
  Adversarial training has shown promise in building robust models against
adversarial examples. A major drawback of adversarial training is the
computational overhead introduced by the generation of adversarial examples. To
overcome this limitation, adversarial training based on single-step attacks has
been explored. Previous work improves the single-step adversarial training from
different perspectives, e.g., sample initialization, loss regularization, and
training strategy. Almost all of them treat the underlying model as a black
box. In this work, we propose to exploit the interior building blocks of the
model to improve efficiency. Specifically, we propose to dynamically sample
lightweight subnetworks as a surrogate model during training. By doing this,
both the forward and backward passes can be accelerated for efficient
adversarial training. Besides, we provide theoretical analysis to show the
model robustness can be improved by the single-step adversarial training with
sampled subnetworks. Furthermore, we propose a novel sampling strategy where
the sampling varies from layer to layer and from iteration to iteration.
Compared with previous methods, our method not only reduces the training cost
but also achieves better model robustness. Evaluations on a series of popular
datasets demonstrate the effectiveness of the proposed FB-Better. Our code has
been released at https://github.com/jiaxiaojunQAQ/FP-Better.


http://arxiv.org/abs/2310.15085
On the Detection of Image-Scaling Attacks in Machine Learning. (15%)
Erwin Quiring; Andreas Müller; Konrad Rieck
  Image scaling is an integral part of machine learning and computer vision
systems. Unfortunately, this preprocessing step is vulnerable to so-called
image-scaling attacks where an attacker makes unnoticeable changes to an image
so that it becomes a new image after scaling. This opens up new ways for
attackers to control the prediction or to improve poisoning and backdoor
attacks. While effective techniques exist to prevent scaling attacks, their
detection has not been rigorously studied yet. Consequently, it is currently
not possible to reliably spot these attacks in practice.
  This paper presents the first in-depth systematization and analysis of
detection methods for image-scaling attacks. We identify two general detection
paradigms and derive novel methods from them that are simple in design yet
significantly outperform previous work. We demonstrate the efficacy of these
methods in a comprehensive evaluation with all major learning platforms and
scaling algorithms. First, we show that image-scaling attacks modifying the
entire scaled image can be reliably detected even under an adaptive adversary.
Second, we find that our methods provide strong detection performance even if
only minor parts of the image are manipulated. As a result, we can introduce a
novel protection layer against image-scaling attacks.


http://arxiv.org/abs/2310.14735
Unleashing the potential of prompt engineering: a comprehensive review. (1%)
Banghao Chen; Zhaofeng Zhang; Nicolas Langrené; Shengxin Zhu
  This comprehensive review explores the transformative potential of prompt
engineering within the realm of large language models (LLMs) and multimodal
language models (MMLMs). The development of AI, from its inception in the 1950s
to the emergence of neural networks and deep learning architectures, has
culminated in sophisticated LLMs like GPT-4 and BERT, as well as MMLMs like
DALL-E and CLIP. These models have revolutionized tasks in diverse fields such
as workplace automation, healthcare, and education. Prompt engineering emerges
as a crucial technique to maximize the utility and accuracy of these models.
This paper delves into both foundational and advanced methodologies of prompt
engineering, including techniques like Chain of Thought, Self-consistency, and
Generated Knowledge, which significantly enhance model performance.
Additionally, it examines the integration of multimodal data through innovative
approaches such as Multi-modal Prompt Learning (MaPLe), Conditional Prompt
Learning, and Context Optimization. Critical to this discussion is the aspect
of AI security, particularly adversarial attacks that exploit vulnerabilities
in prompt engineering. Strategies to mitigate these risks and enhance model
robustness are thoroughly reviewed. The evaluation of prompt methods is
addressed through both subjective and objective metrics, ensuring a robust
analysis of their efficacy. This review underscores the pivotal role of prompt
engineering in advancing AI capabilities, providing a structured framework for
future research and application.


http://arxiv.org/abs/2310.15171
RoboDepth: Robust Out-of-Distribution Depth Estimation under Corruptions. (1%)
Lingdong Kong; Shaoyuan Xie; Hanjiang Hu; Lai Xing Ng; Benoit R. Cottereau; Wei Tsang Ooi
  Depth estimation from monocular images is pivotal for real-world visual
perception systems. While current learning-based depth estimation models train
and test on meticulously curated data, they often overlook out-of-distribution
(OoD) situations. Yet, in practical settings -- especially safety-critical ones
like autonomous driving -- common corruptions can arise. Addressing this
oversight, we introduce a comprehensive robustness test suite, RoboDepth,
encompassing 18 corruptions spanning three categories: i) weather and lighting
conditions; ii) sensor failures and movement; and iii) data processing
anomalies. We subsequently benchmark 42 depth estimation models across indoor
and outdoor scenes to assess their resilience to these corruptions. Our
findings underscore that, in the absence of a dedicated robustness evaluation
framework, many leading depth estimation models may be susceptible to typical
corruptions. We delve into design considerations for crafting more robust depth
estimation models, touching upon pre-training, augmentation, modality, model
capacity, and learning paradigms. We anticipate our benchmark will establish a
foundational platform for advancing robust OoD depth estimation.


http://arxiv.org/abs/2310.14838
Calibration of Time-Series Forecasting: Detecting and Adapting Context-Driven Distribution Shift. (1%)
Mouxiang Chen; Lefei Shen; Han Fu; Zhuo Li; Jianling Sun; Chenghao Liu
  Recent years have witnessed the success of introducing deep learning models
to time series forecasting. From a data generation perspective, we illustrate
that existing models are susceptible to distribution shifts driven by temporal
contexts, whether observed or unobserved. Such context-driven distribution
shift (CDS) introduces biases in predictions within specific contexts and poses
challenges for conventional training paradigms. In this paper, we introduce a
universal calibration methodology for the detection and adaptation of CDS with
a trained model. To this end, we propose a novel CDS detector, termed the
"residual-based CDS detector" or "Reconditionor", which quantifies the model's
vulnerability to CDS by evaluating the mutual information between prediction
residuals and their corresponding contexts. A high Reconditionor score
indicates a severe susceptibility, thereby necessitating model adaptation. In
this circumstance, we put forth a straightforward yet potent adapter framework
for model calibration, termed the "sample-level contextualized adapter" or
"SOLID". This framework involves the curation of a contextually similar dataset
to the provided test sample and the subsequent fine-tuning of the model's
prediction layer with a limited number of steps. Our theoretical analysis
demonstrates that this adaptation strategy can achieve an optimal bias-variance
trade-off. Notably, our proposed Reconditionor and SOLID are model-agnostic and
readily adaptable to a wide range of models. Extensive experiments show that
SOLID consistently enhances the performance of current forecasting models on
real-world datasets, especially on cases with substantial CDS detected by the
proposed Reconditionor, thus validating the effectiveness of the calibration
approach.


http://arxiv.org/abs/2310.15469
The Janus Interface: How Fine-Tuning in Large Language Models Amplifies the Privacy Risks. (1%)
Xiaoyi Chen; Siyuan Tang; Rui Zhu; Shijun Yan; Lei Jin; Zihao Wang; Liya Su; Zhikun Zhang; XiaoFeng Wang; Haixu Tang
  The rapid advancements of large language models (LLMs) have raised public
concerns about the privacy leakage of personally identifiable information (PII)
within their extensive training datasets. Recent studies have demonstrated that
an adversary could extract highly sensitive privacy data from the training data
of LLMs with carefully designed prompts. However, these attacks suffer from the
model's tendency to hallucinate and catastrophic forgetting (CF) in the
pre-training stage, rendering the veracity of divulged PIIs negligible. In our
research, we propose a novel attack, Janus, which exploits the fine-tuning
interface to recover forgotten PIIs from the pre-training data in LLMs. We
formalize the privacy leakage problem in LLMs and explain why forgotten PIIs
can be recovered through empirical analysis on open-source language models.
Based upon these insights, we evaluate the performance of Janus on both
open-source language models and two latest LLMs, i.e., GPT-3.5-Turbo and
LLaMA-2-7b. Our experiment results show that Janus amplifies the privacy risks
by over 10 times in comparison with the baseline and significantly outperforms
the state-of-the-art privacy extraction attacks including prefix attacks and
in-context learning (ICL). Furthermore, our analysis validates that existing
fine-tuning APIs provided by OpenAI and Azure AI Studio are susceptible to our
Janus attack, allowing an adversary to conduct such an attack at a low cost.


http://arxiv.org/abs/2310.14270
Diffusion-Based Adversarial Purification for Speaker Verification. (99%)
Yibo Bai; Xiao-Lei Zhang
  Recently, automatic speaker verification (ASV) based on deep learning is
easily contaminated by adversarial attacks, which is a new type of attack that
injects imperceptible perturbations to audio signals so as to make ASV produce
wrong decisions. This poses a significant threat to the security and
reliability of ASV systems. To address this issue, we propose a Diffusion-Based
Adversarial Purification (DAP) method that enhances the robustness of ASV
systems against such adversarial attacks. Our method leverages a conditional
denoising diffusion probabilistic model to effectively purify the adversarial
examples and mitigate the impact of perturbations. DAP first introduces
controlled noise into adversarial examples, and then performs a reverse
denoising process to reconstruct clean audio. Experimental results demonstrate
the efficacy of the proposed DAP in enhancing the security of ASV and meanwhile
minimizing the distortion of the purified audio signals.


http://arxiv.org/abs/2310.14265
CT-GAT: Cross-Task Generative Adversarial Attack based on Transferability. (99%)
Minxuan Lv; Chengwei Dai; Kun Li; Wei Zhou; Songlin Hu
  Neural network models are vulnerable to adversarial examples, and adversarial
transferability further increases the risk of adversarial attacks. Current
methods based on transferability often rely on substitute models, which can be
impractical and costly in real-world scenarios due to the unavailability of
training data and the victim model's structural details. In this paper, we
propose a novel approach that directly constructs adversarial examples by
extracting transferable features across various tasks. Our key insight is that
adversarial transferability can extend across different tasks. Specifically, we
train a sequence-to-sequence generative model named CT-GAT using adversarial
sample data collected from multiple tasks to acquire universal adversarial
features and generate adversarial examples for different tasks. We conduct
experiments on ten distinct datasets, and the results demonstrate that our
method achieves superior attack performance with small cost.


http://arxiv.org/abs/2311.16118
Imperceptible CMOS camera dazzle for adversarial attacks on deep neural networks. (92%)
Zvi Stein; Adrian Stern
  Despite the outstanding performance of deep neural networks, they are
vulnerable to adversarial attacks. While there are many invisible attacks in
the digital domain, most physical world adversarial attacks are visible. Here
we present an invisible optical adversarial attack that uses a light source to
dazzle a CMOS camera with a rolling shutter. We present the photopic conditions
required to keep the attacking light source completely invisible while
sufficiently jamming the captured image so that a deep neural network applied
to it is deceived.


http://arxiv.org/abs/2310.14504
ADoPT: LiDAR Spoofing Attack Detection Based on Point-Level Temporal Consistency. (26%)
Minkyoung Cho; Yulong Cao; Zixiang Zhou; Z. Morley Mao
  Deep neural networks (DNNs) are increasingly integrated into LiDAR (Light
Detection and Ranging)-based perception systems for autonomous vehicles (AVs),
requiring robust performance under adversarial conditions. We aim to address
the challenge of LiDAR spoofing attacks, where attackers inject fake objects
into LiDAR data and fool AVs to misinterpret their environment and make
erroneous decisions. However, current defense algorithms predominantly depend
on perception outputs (i.e., bounding boxes) thus face limitations in detecting
attackers given the bounding boxes are generated by imperfect perception models
processing limited points, acquired based on the ego vehicle's viewpoint. To
overcome these limitations, we propose a novel framework, named ADoPT (Anomaly
Detection based on Point-level Temporal consistency), which quantitatively
measures temporal consistency across consecutive frames and identifies abnormal
objects based on the coherency of point clusters. In our evaluation using the
nuScenes dataset, our algorithm effectively counters various LiDAR spoofing
attacks, achieving a low (< 10%) false positive ratio (FPR) and high (> 85%)
true positive ratio (TPR), outperforming existing state-of-the-art defense
methods, CARLO and 3D-TC2. Furthermore, our evaluation demonstrates the
promising potential for accurate attack detection across various road
environments.


http://arxiv.org/abs/2310.14480
Attention-Enhancing Backdoor Attacks Against BERT-based Models. (13%)
Weimin Lyu; Songzhu Zheng; Lu Pang; Haibin Ling; Chao Chen
  Recent studies have revealed that \textit{Backdoor Attacks} can threaten the
safety of natural language processing (NLP) models. Investigating the
strategies of backdoor attacks will help to understand the model's
vulnerability. Most existing textual backdoor attacks focus on generating
stealthy triggers or modifying model weights. In this paper, we directly target
the interior structure of neural networks and the backdoor mechanism. We
propose a novel Trojan Attention Loss (TAL), which enhances the Trojan behavior
by directly manipulating the attention patterns. Our loss can be applied to
different attacking methods to boost their attack efficacy in terms of attack
successful rates and poisoning rates. It applies to not only traditional
dirty-label attacks, but also the more challenging clean-label attacks. We
validate our method on different backbone models (BERT, RoBERTa, and
DistilBERT) and various tasks (Sentiment Analysis, Toxic Detection, and Topic
Classification).


http://arxiv.org/abs/2310.14369
MoPe: Model Perturbation-based Privacy Attacks on Language Models. (9%)
Marvin Li; Jason Wang; Jeffrey Wang; Seth Neel
  Recent work has shown that Large Language Models (LLMs) can unintentionally
leak sensitive information present in their training data. In this paper, we
present Model Perturbations (MoPe), a new method to identify with high
confidence if a given text is in the training data of a pre-trained language
model, given white-box access to the models parameters. MoPe adds noise to the
model in parameter space and measures the drop in log-likelihood at a given
point $x$, a statistic we show approximates the trace of the Hessian matrix
with respect to model parameters. Across language models ranging from $70$M to
$12$B parameters, we show that MoPe is more effective than existing loss-based
attacks and recently proposed perturbation-based methods. We also examine the
role of training point order and model size in attack success, and empirically
demonstrate that MoPe accurately approximate the trace of the Hessian in
practice. Our results show that the loss of a point alone is insufficient to
determine extractability -- there are training points we can recover using our
method that have average loss. This casts some doubt on prior works that use
the loss of a point as evidence of memorization or unlearning.


http://arxiv.org/abs/2401.01896
Reputation-Based Federated Learning Defense to Mitigate Threats in EEG Signal Classification. (1%)
Zhibo Zhang; Pengfei Li; Ahmed Y. Al Hammadi; Fusen Guo; Ernesto Damiani; Chan Yeob Yeun
  This paper presents a reputation-based threat mitigation framework that
defends potential security threats in electroencephalogram (EEG) signal
classification during model aggregation of Federated Learning. While EEG signal
analysis has attracted attention because of the emergence of brain-computer
interface (BCI) technology, it is difficult to create efficient learning models
for EEG analysis because of the distributed nature of EEG data and related
privacy and security concerns. To address these challenges, the proposed
defending framework leverages the Federated Learning paradigm to preserve
privacy by collaborative model training with localized data from dispersed
sources and introduces a reputation-based mechanism to mitigate the influence
of data poisoning attacks and identify compromised participants. To assess the
efficiency of the proposed reputation-based federated learning defense
framework, data poisoning attacks based on the risk level of training data
derived by Explainable Artificial Intelligence (XAI) techniques are conducted
on both publicly available EEG signal datasets and the self-established EEG
signal dataset. Experimental results on the poisoned datasets show that the
proposed defense methodology performs well in EEG signal classification while
reducing the risks associated with security threats.


http://arxiv.org/abs/2310.13950
Adversarial Image Generation by Spatial Transformation in Perceptual Colorspaces. (99%)
Ayberk Aydin; Alptekin Temizel
  Deep neural networks are known to be vulnerable to adversarial perturbations.
The amount of these perturbations are generally quantified using $L_p$ metrics,
such as $L_0$, $L_2$ and $L_\infty$. However, even when the measured
perturbations are small, they tend to be noticeable by human observers since
$L_p$ distance metrics are not representative of human perception. On the other
hand, humans are less sensitive to changes in colorspace. In addition, pixel
shifts in a constrained neighborhood are hard to notice. Motivated by these
observations, we propose a method that creates adversarial examples by applying
spatial transformations, which creates adversarial examples by changing the
pixel locations independently to chrominance channels of perceptual colorspaces
such as $YC_{b}C_{r}$ and $CIELAB$, instead of making an additive perturbation
or manipulating pixel values directly. In a targeted white-box attack setting,
the proposed method is able to obtain competitive fooling rates with very high
confidence. The experimental evaluations show that the proposed method has
favorable results in terms of approximate perceptual distance between benign
and adversarially generated images. The source code is publicly available at
https://github.com/ayberkydn/stadv-torch


http://arxiv.org/abs/2310.14045
Training Image Derivatives: Increased Accuracy and Universal Robustness. (5%)
Vsevolod I. Avrutskiy
  Derivative training is a well-known method to improve the accuracy of neural
networks. In the forward pass, not only the output values are computed, but
also their derivatives, and their deviations from the target derivatives are
included in the cost function, which is minimized with respect to the weights
by a gradient-based algorithm. So far, this method has been implemented for
relatively low-dimensional tasks. In this study, we apply the approach to the
problem of image analysis. We consider the task of reconstructing the vertices
of a cube based on its image. By training the derivatives with respect to the 6
degrees of freedom of the cube, we obtain 25 times more accurate results for
noiseless inputs. The derivatives also provide important insights into the
robustness problem, which is currently understood in terms of two types of
network vulnerabilities. The first type is small perturbations that
dramatically change the output, and the second type is substantial image
changes that the network erroneously ignores. They are currently considered as
conflicting goals, since conventional training methods produce a trade-off. The
first type can be analyzed via the gradient of the network, but the second type
requires human evaluation of the inputs, which is an oracle substitute. For the
task at hand, the nearest neighbor oracle can be defined, and the knowledge of
derivatives allows it to be expanded into Taylor series. This allows to perform
the first-order robustness analysis that unifies both types of vulnerabilities,
and to implement robust training that eliminates any trade-offs, so that
accuracy and robustness are limited only by network capacity.


http://arxiv.org/abs/2310.13321
Beyond Hard Samples: Robust and Effective Grammatical Error Correction with Cycle Self-Augmenting. (99%)
Zecheng Tang; Kaifeng Qi; Juntao Li; Min Zhang
  Recent studies have revealed that grammatical error correction methods in the
sequence-to-sequence paradigm are vulnerable to adversarial attack, and simply
utilizing adversarial examples in the pre-training or post-training process can
significantly enhance the robustness of GEC models to certain types of attack
without suffering too much performance loss on clean data. In this paper, we
further conduct a thorough robustness evaluation of cutting-edge GEC methods
for four different types of adversarial attacks and propose a simple yet very
effective Cycle Self-Augmenting (CSA) method accordingly. By leveraging the
augmenting data from the GEC models themselves in the post-training process and
introducing regularization data for cycle training, our proposed method can
effectively improve the model robustness of well-trained GEC models with only a
few more training epochs as an extra cost. More concretely, further training on
the regularization data can prevent the GEC models from over-fitting on
easy-to-learn samples and thus can improve the generalization capability and
robustness towards unseen data (adversarial noise/samples). Meanwhile, the
self-augmented data can provide more high-quality pseudo pairs to improve model
performance on the original testing data. Experiments on four benchmark
datasets and seven strong models indicate that our proposed training method can
significantly enhance the robustness of four types of attacks without using
purposely built adversarial examples in training. Evaluation results on clean
data further confirm that our proposed CSA method significantly improves the
performance of four baselines and yields nearly comparable results with other
state-of-the-art models. Our code is available at
https://github.com/ZetangForward/CSA-GEC.


http://arxiv.org/abs/2310.13345
An LLM can Fool Itself: A Prompt-Based Adversarial Attack. (99%)
Xilie Xu; Keyi Kong; Ning Liu; Lizhen Cui; Di Wang; Jingfeng Zhang; Mohan Kankanhalli
  The wide-ranging applications of large language models (LLMs), especially in
safety-critical domains, necessitate the proper evaluation of the LLM's
adversarial robustness. This paper proposes an efficient tool to audit the
LLM's adversarial robustness via a prompt-based adversarial attack
(PromptAttack). PromptAttack converts adversarial textual attacks into an
attack prompt that can cause the victim LLM to output the adversarial sample to
fool itself. The attack prompt is composed of three important components: (1)
original input (OI) including the original sample and its ground-truth label,
(2) attack objective (AO) illustrating a task description of generating a new
sample that can fool itself without changing the semantic meaning, and (3)
attack guidance (AG) containing the perturbation instructions to guide the LLM
on how to complete the task by perturbing the original sample at character,
word, and sentence levels, respectively. Besides, we use a fidelity filter to
ensure that PromptAttack maintains the original semantic meanings of the
adversarial examples. Further, we enhance the attack power of PromptAttack by
ensembling adversarial examples at different perturbation levels. Comprehensive
empirical results using Llama2 and GPT-3.5 validate that PromptAttack
consistently yields a much higher attack success rate compared to AdvGLUE and
AdvGLUE++. Interesting findings include that a simple emoji can easily mislead
GPT-3.5 to make wrong predictions.


http://arxiv.org/abs/2310.13828
Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models. (61%)
Shawn Shan; Wenxin Ding; Josephine Passananti; Haitao Zheng; Ben Y. Zhao
  Data poisoning attacks manipulate training data to introduce unexpected
behaviors into machine learning models at training time. For text-to-image
generative models with massive training datasets, current understanding of
poisoning attacks suggests that a successful attack would require injecting
millions of poison samples into their training pipeline. In this paper, we show
that poisoning attacks can be successful on generative models. We observe that
training data per concept can be quite limited in these models, making them
vulnerable to prompt-specific poisoning attacks, which target a model's ability
to respond to individual prompts.
  We introduce Nightshade, an optimized prompt-specific poisoning attack where
poison samples look visually identical to benign images with matching text
prompts. Nightshade poison samples are also optimized for potency and can
corrupt an Stable Diffusion SDXL prompt in <100 poison samples. Nightshade
poison effects "bleed through" to related concepts, and multiple attacks can
composed together in a single prompt. Surprisingly, we show that a moderate
number of Nightshade attacks can destabilize general features in a
text-to-image generative model, effectively disabling its ability to generate
meaningful images. Finally, we propose the use of Nightshade` and similar tools
as a last defense for content creators against web scrapers that ignore
opt-out/do-not-crawl directives, and discuss possible implications for model
trainers and content creators.


http://arxiv.org/abs/2310.13893
The Hidden Adversarial Vulnerabilities of Medical Federated Learning. (45%)
Erfan Darzi; Florian Dubost; Nanna. M. Sijtsema; Ooijen P. M. A van
  In this paper, we delve into the susceptibility of federated medical image
analysis systems to adversarial attacks. Our analysis uncovers a novel
exploitation avenue: using gradient information from prior global model
updates, adversaries can enhance the efficiency and transferability of their
attacks. Specifically, we demonstrate that single-step attacks (e.g. FGSM),
when aptly initialized, can outperform the efficiency of their iterative
counterparts but with reduced computational demand. Our findings underscore the
need to revisit our understanding of AI security in federated healthcare
settings.


http://arxiv.org/abs/2310.13822
Adversarial Attacks on Fairness of Graph Neural Networks. (26%)
Binchi Zhang; Yushun Dong; Chen Chen; Yada Zhu; Minnan Luo; Jundong Li
  Fairness-aware graph neural networks (GNNs) have gained a surge of attention
as they can reduce the bias of predictions on any demographic group (e.g.,
female) in graph-based applications. Although these methods greatly improve the
algorithmic fairness of GNNs, the fairness can be easily corrupted by carefully
designed adversarial attacks. In this paper, we investigate the problem of
adversarial attacks on fairness of GNNs and propose G-FairAttack, a general
framework for attacking various types of fairness-aware GNNs in terms of
fairness with an unnoticeable effect on prediction utility. In addition, we
propose a fast computation technique to reduce the time complexity of
G-FairAttack. The experimental study demonstrates that G-FairAttack
successfully corrupts the fairness of different types of GNNs while keeping the
attack unnoticeable. Our study on fairness attacks sheds light on potential
vulnerabilities in fairness-aware GNNs and guides further research on the
robustness of GNNs in terms of fairness. The open-source code is available at
https://github.com/zhangbinchi/G-FairAttack.


http://arxiv.org/abs/2310.13424
FLTracer: Accurate Poisoning Attack Provenance in Federated Learning. (26%)
Xinyu Zhang; Qingyu Liu; Zhongjie Ba; Yuan Hong; Tianhang Zheng; Feng Lin; Li Lu; Kui Ren
  Federated Learning (FL) is a promising distributed learning approach that
enables multiple clients to collaboratively train a shared global model.
However, recent studies show that FL is vulnerable to various poisoning
attacks, which can degrade the performance of global models or introduce
backdoors into them. In this paper, we first conduct a comprehensive study on
prior FL attacks and detection methods. The results show that all existing
detection methods are only effective against limited and specific attacks. Most
detection methods suffer from high false positives, which lead to significant
performance degradation, especially in not independent and identically
distributed (non-IID) settings. To address these issues, we propose FLTracer,
the first FL attack provenance framework to accurately detect various attacks
and trace the attack time, objective, type, and poisoned location of updates.
Different from existing methodologies that rely solely on cross-client anomaly
detection, we propose a Kalman filter-based cross-round detection to identify
adversaries by seeking the behavior changes before and after the attack. Thus,
this makes it resilient to data heterogeneity and is effective even in non-IID
settings. To further improve the accuracy of our detection method, we employ
four novel features and capture their anomalies with the joint decisions.
Extensive evaluations show that FLTracer achieves an average true positive rate
of over $96.88\%$ at an average false positive rate of less than $2.67\%$,
significantly outperforming SOTA detection methods. \footnote{Code is available
at \url{https://github.com/Eyr3/FLTracer}.}


http://arxiv.org/abs/2311.03369
Can We Trust the Similarity Measurement in Federated Learning? (15%)
Zhilin Wang; Qin Hu; Xukai Zou
  Is it secure to measure the reliability of local models by similarity in
federated learning (FL)? This paper delves into an unexplored security threat
concerning applying similarity metrics, such as the L_2 norm, Euclidean
distance, and cosine similarity, in protecting FL. We first uncover the
deficiencies of similarity metrics that high-dimensional local models,
including benign and poisoned models, may be evaluated to have the same
similarity while being significantly different in the parameter values. We then
leverage this finding to devise a novel untargeted model poisoning attack,
Faker, which launches the attack by simultaneously maximizing the evaluated
similarity of the poisoned local model and the difference in the parameter
values. Experimental results based on seven datasets and eight defenses show
that Faker outperforms the state-of-the-art benchmark attacks by 1.1-9.0X in
reducing accuracy and 1.2-8.0X in saving time cost, which even holds for the
case of a single malicious client with limited knowledge about the FL system.
Moreover, Faker can degrade the performance of the global model by attacking
only once. We also preliminarily explore extending Faker to other attacks, such
as backdoor attacks and Sybil attacks. Lastly, we provide a model evaluation
strategy, called the similarity of partial parameters (SPP), to defend against
Faker. Given that numerous mechanisms in FL utilize similarity metrics to
assess local models, this work suggests that we should be vigilant regarding
the potential risks of using these metrics.


http://arxiv.org/abs/2310.13782
Data-Free Knowledge Distillation Using Adversarially Perturbed OpenGL Shader Images. (4%)
Logan Frank; Jim Davis
  Knowledge distillation (KD) has been a popular and effective method for model
compression. One important assumption of KD is that the original training
dataset is always available. However, this is not always the case due to
privacy concerns and more. In recent years, "data-free" KD has emerged as a
growing research topic which focuses on the scenario of performing KD when no
data is provided. Many methods rely on a generator network to synthesize
examples for distillation (which can be difficult to train) and can frequently
produce images that are visually similar to the original dataset, which raises
questions surrounding whether privacy is completely preserved. In this work, we
propose a new approach to data-free KD that utilizes unnatural OpenGL images,
combined with large amounts of data augmentation and adversarial attacks, to
train a student network. We demonstrate that our approach achieves
state-of-the-art results for a variety of datasets/networks and is more stable
than existing generator-based data-free KD methods. Source code will be
available in the future.


http://arxiv.org/abs/2310.13894
VOICE-ZEUS: Impersonating Zoom's E2EE-Protected Static Media and Textual Communications via Simple Voice Manipulations. (4%)
Mashari Alatawi; Nitesh Saxena
  The authentication ceremony plays a crucial role in verifying the identities
of users before exchanging messages in end-to-end encryption (E2EE)
applications, thus preventing impersonation and man-in-the-middle (MitM)
attacks. Once authenticated, the subsequent communications in E2EE apps benefit
from the protection provided by the authentication ceremony. However, the
current implementation of the authentication ceremony in the Zoom application
introduces a potential vulnerability that can make it highly susceptible to
impersonation attacks. The existence of this vulnerability may undermine the
integrity of E2EE, posing a potential security risk when E2EE becomes a
mandatory feature in the Zoom application. In this paper, we examine and
evaluate this vulnerability in two attack scenarios, one where the attacker is
a malicious participant and another where the attacker is a malicious Zoom
server with control over Zoom's server infrastructure and cloud providers. Our
study aims to comprehensively examine the Zoom authentication ceremony, with a
specific focus on the potential for impersonation attacks in static media and
textual communications. We simulate a new session injection attack on Zoom E2EE
meetings to evaluate the system's susceptibility to simple voice manipulations.
Our simulation experiments show that Zoom's authentication ceremony is
vulnerable to a simple voice manipulation, called a VOICE-ZEUS attack, by
malicious participants and the malicious Zoom server. In this VOICE-ZEUS
attack, an attacker creates a fingerprint in a victim's voice by reordering
previously recorded digits spoken by the victim. We show how an attacker can
record and reorder snippets of digits to generate a new security code that
compromises a future Zoom meeting. We conclude that stronger security measures
are necessary during the group authentication ceremony in Zoom to prevent
impersonation attacks.


http://arxiv.org/abs/2310.12516
Automatic Hallucination Assessment for Aligned Large Language Models via Transferable Adversarial Attacks. (98%)
Xiaodong Yu; Hao Cheng; Xiaodong Liu; Dan Roth; Jianfeng Gao
  Although remarkable progress has been achieved in preventing large language
model (LLM) hallucinations using instruction tuning and retrieval augmentation,
it remains challenging to measure the reliability of LLMs using human-crafted
evaluation data which is not available for many tasks and domains and could
suffer from data leakage. Inspired by adversarial machine learning, this paper
aims to develop a method of automatically generating evaluation data by
appropriately modifying existing data on which LLMs behave faithfully.
Specifically, this paper presents AutoDebug, an LLM-based framework to use
prompting chaining to generate transferable adversarial attacks in the form of
question-answering examples. We seek to understand the extent to which these
examples trigger the hallucination behaviors of LLMs.
  We implement AutoDebug using ChatGPT and evaluate the resulting two variants
of a popular open-domain question-answering dataset, Natural Questions (NQ), on
a collection of open-source and proprietary LLMs under various prompting
settings. Our generated evaluation data is human-readable and, as we show,
humans can answer these modified questions well. Nevertheless, we observe
pronounced accuracy drops across multiple LLMs including GPT-4. Our
experimental results show that LLMs are likely to hallucinate in two categories
of question-answering scenarios where (1) there are conflicts between knowledge
given in the prompt and their parametric knowledge, or (2) the knowledge
expressed in the prompt is complex. Finally, we find that the adversarial
examples generated by our method are transferable across all considered LLMs.
The examples generated by a small model can be used to debug a much larger
model, making our approach cost-effective.


http://arxiv.org/abs/2310.12708
Generating Robust Adversarial Examples against Online Social Networks (OSNs). (98%)
Jun Liu; Jiantao Zhou; Haiwei Wu; Weiwei Sun; Jinyu Tian
  Online Social Networks (OSNs) have blossomed into prevailing transmission
channels for images in the modern era. Adversarial examples (AEs) deliberately
designed to mislead deep neural networks (DNNs) are found to be fragile against
the inevitable lossy operations conducted by OSNs. As a result, the AEs would
lose their attack capabilities after being transmitted over OSNs. In this work,
we aim to design a new framework for generating robust AEs that can survive the
OSN transmission; namely, the AEs before and after the OSN transmission both
possess strong attack capabilities. To this end, we first propose a
differentiable network termed SImulated OSN (SIO) to simulate the various
operations conducted by an OSN. Specifically, the SIO network consists of two
modules: 1) a differentiable JPEG layer for approximating the ubiquitous JPEG
compression and 2) an encoder-decoder subnetwork for mimicking the remaining
operations. Based upon the SIO network, we then formulate an optimization
framework to generate robust AEs by enforcing model outputs with and without
passing through the SIO to be both misled. Extensive experiments conducted over
Facebook, WeChat and QQ demonstrate that our attack methods produce more robust
AEs than existing approaches, especially under small distortion constraints;
the performance gain in terms of Attack Success Rate (ASR) could be more than
60%. Furthermore, we build a public dataset containing more than 10,000 pairs
of AEs processed by Facebook, WeChat or QQ, facilitating future research in the
robust AEs generation. The dataset and code are available at
https://github.com/csjunjun/RobustOSNAttack.git.


http://arxiv.org/abs/2310.12707
Recoverable Privacy-Preserving Image Classification through Noise-like Adversarial Examples. (98%)
Jun Liu; Jiantao Zhou; Jinyu Tian; Weiwei Sun
  With the increasing prevalence of cloud computing platforms, ensuring data
privacy during the cloud-based image related services such as classification
has become crucial. In this study, we propose a novel privacypreserving image
classification scheme that enables the direct application of classifiers
trained in the plaintext domain to classify encrypted images, without the need
of retraining a dedicated classifier. Moreover, encrypted images can be
decrypted back into their original form with high fidelity (recoverable) using
a secret key. Specifically, our proposed scheme involves utilizing a feature
extractor and an encoder to mask the plaintext image through a newly designed
Noise-like Adversarial Example (NAE). Such an NAE not only introduces a
noise-like visual appearance to the encrypted image but also compels the target
classifier to predict the ciphertext as the same label as the original
plaintext image. At the decoding phase, we adopt a Symmetric Residual Learning
(SRL) framework for restoring the plaintext image with minimal degradation.
Extensive experiments demonstrate that 1) the classification accuracy of the
classifier trained in the plaintext domain remains the same in both the
ciphertext and plaintext domains; 2) the encrypted images can be recovered into
their original form with an average PSNR of up to 51+ dB for the SVHN dataset
and 48+ dB for the VGGFace2 dataset; 3) our system exhibits satisfactory
generalization capability on the encryption, decryption and classification
tasks across datasets that are different from the training one; and 4) a
high-level of security is achieved against three potential threat models. The
code is available at https://github.com/csjunjun/RIC.git.


http://arxiv.org/abs/2310.12713
Learn from the Past: A Proxy based Adversarial Defense Framework to Boost Robustness. (98%)
Yaohua Liu; Jiaxin Gao; Zhu Liu; Xianghao Jiao; Xin Fan; Risheng Liu
  In light of the vulnerability of deep learning models to adversarial samples
and the ensuing security issues, a range of methods, including Adversarial
Training (AT) as a prominent representative, aimed at enhancing model
robustness against various adversarial attacks, have seen rapid development.
However, existing methods essentially assist the current state of target model
to defend against parameter-oriented adversarial attacks with explicit or
implicit computation burdens, which also suffers from unstable convergence
behavior due to inconsistency of optimization trajectories. Diverging from
previous work, this paper reconsiders the update rule of target model and
corresponding deficiency to defend based on its current state. By introducing
the historical state of the target model as a proxy, which is endowed with much
prior information for defense, we formulate a two-stage update rule, resulting
in a general adversarial defense framework, which we refer to as `LAST' ({\bf
L}earn from the P{\bf ast}). Besides, we devise a Self Distillation (SD) based
defense objective to constrain the update process of the proxy model without
the introduction of larger teacher models. Experimentally, we demonstrate
consistent and significant performance enhancements by refining a series of
single-step and multi-step AT methods (e.g., up to $\bf 9.2\%$ and $\bf 20.5\%$
improvement of Robust Accuracy (RA) on CIFAR10 and CIFAR100 datasets,
respectively) across various datasets, backbones and attack modalities, and
validate its ability to enhance training stability and ameliorate catastrophic
overfitting issues meanwhile.


http://arxiv.org/abs/2310.12793
OODRobustBench: benchmarking and analyzing adversarial robustness under distribution shift. (97%)
Lin Li; Yifei Wang; Chawin Sitawarin; Michael Spratling
  Existing works have made great progress in improving adversarial robustness,
but typically test their method only on data from the same distribution as the
training data, i.e. in-distribution (ID) testing. As a result, it is unclear
how such robustness generalizes under input distribution shifts, i.e.
out-of-distribution (OOD) testing. This is a concerning omission as such
distribution shifts are unavoidable when methods are deployed in the wild. To
address this issue we propose a benchmark named OODRobustBench to
comprehensively assess OOD adversarial robustness using 23 dataset-wise shifts
(i.e. naturalistic shifts in input distribution) and 6 threat-wise shifts
(i.e., unforeseen adversarial threat models). OODRobustBench is used to assess
706 robust models using 60.7K adversarial evaluations. This large-scale
analysis shows that: 1) adversarial robustness suffers from a severe OOD
generalization issue; 2) ID robustness correlates strongly with OOD robustness,
in a positive linear way, under many distribution shifts. The latter enables
the prediction of OOD robustness from ID robustness. Based on this, we are able
to predict the upper limit of OOD robustness for existing robust training
schemes. The results suggest that achieving OOD robustness requires designing
novel methods beyond the conventional ones. Last, we discover that extra data,
data augmentation, advanced model architectures and particular regularization
approaches can improve OOD robustness. Noticeably, the discovered training
schemes, compared to the baseline, exhibit dramatically higher robustness under
threat shift while keeping high ID robustness, demonstrating new promising
solutions for robustness against both multi-attack and unforeseen attacks.


http://arxiv.org/abs/2310.13076
PatchCURE: Improving Certifiable Robustness, Model Utility, and Computation Efficiency of Adversarial Patch Defenses. (97%)
Chong Xiang; Tong Wu; Sihui Dai; Jonathan Petit; Suman Jana; Prateek Mittal
  State-of-the-art defenses against adversarial patch attacks can now achieve
strong certifiable robustness with a marginal drop in model utility. However,
this impressive performance typically comes at the cost of 10-100x more
inference-time computation compared to undefended models -- the research
community has witnessed an intense three-way trade-off between certifiable
robustness, model utility, and computation efficiency. In this paper, we
propose a defense framework named PatchCURE to approach this trade-off problem.
PatchCURE provides sufficient "knobs" for tuning defense performance and allows
us to build a family of defenses: the most robust PatchCURE instance can match
the performance of any existing state-of-the-art defense (without efficiency
considerations); the most efficient PatchCURE instance has similar inference
efficiency as undefended models. Notably, PatchCURE achieves state-of-the-art
robustness and utility performance across all different efficiency levels,
e.g., 16-23% absolute clean accuracy and certified robust accuracy advantages
over prior defenses when requiring computation efficiency to be close to
undefended models. The family of PatchCURE defenses enables us to flexibly
choose appropriate defenses to satisfy given computation and/or utility
constraints in practice.


http://arxiv.org/abs/2310.12815
Prompt Injection Attacks and Defenses in LLM-Integrated Applications. (47%)
Yupei Liu; Yuqi Jia; Runpeng Geng; Jinyuan Jia; Neil Zhenqiang Gong
  Large Language Models (LLMs) are increasingly deployed as the backend for a
variety of real-world applications called LLM-Integrated Applications. Multiple
recent works showed that LLM-Integrated Applications are vulnerable to prompt
injection attacks, in which an attacker injects malicious instruction/data into
the input of those applications such that they produce results as the attacker
desires. However, existing works are limited to case studies. As a result, the
literature lacks a systematic understanding of prompt injection attacks and
their defenses. We aim to bridge the gap in this work. In particular, we
propose a general framework to formalize prompt injection attacks. Existing
attacks, which are discussed in research papers and blog posts, are special
cases in our framework. Our framework enables us to design a new attack by
combining existing attacks. Moreover, we also propose a framework to
systematize defenses against prompt injection attacks. Using our frameworks, we
conduct a systematic evaluation on prompt injection attacks and their defenses
with 10 LLMs and 7 tasks. We hope our frameworks can inspire future research in
this field. Our code is available at
https://github.com/liu00222/Open-Prompt-Injection.


http://arxiv.org/abs/2310.12505
Attack Prompt Generation for Red Teaming and Defending Large Language Models. (15%)
Boyi Deng; Wenjie Wang; Fuli Feng; Yang Deng; Qifan Wang; Xiangnan He
  Large language models (LLMs) are susceptible to red teaming attacks, which
can induce LLMs to generate harmful content. Previous research constructs
attack prompts via manual or automatic methods, which have their own
limitations on construction cost and quality. To address these issues, we
propose an integrated approach that combines manual and automatic methods to
economically generate high-quality attack prompts. Specifically, considering
the impressive capabilities of newly emerged LLMs, we propose an attack
framework to instruct LLMs to mimic human-generated prompts through in-context
learning. Furthermore, we propose a defense framework that fine-tunes victim
LLMs through iterative interactions with the attack framework to enhance their
safety against red teaming attacks. Extensive experiments on different LLMs
validate the effectiveness of our proposed attack and defense frameworks.
Additionally, we release a series of attack prompts datasets named SAP with
varying sizes, facilitating the safety evaluation and enhancement of more LLMs.
Our code and dataset is available on https://github.com/Aatrox103/SAP .


http://arxiv.org/abs/2310.12665
SecurityNet: Assessing Machine Learning Vulnerabilities on Public Models. (5%)
Boyang Zhang; Zheng Li; Ziqing Yang; Xinlei He; Michael Backes; Mario Fritz; Yang Zhang
  While advanced machine learning (ML) models are deployed in numerous
real-world applications, previous works demonstrate these models have security
and privacy vulnerabilities. Various empirical research has been done in this
field. However, most of the experiments are performed on target ML models
trained by the security researchers themselves. Due to the high computational
resource requirement for training advanced models with complex architectures,
researchers generally choose to train a few target models using relatively
simple architectures on typical experiment datasets. We argue that to
understand ML models' vulnerabilities comprehensively, experiments should be
performed on a large set of models trained with various purposes (not just the
purpose of evaluating ML attacks and defenses). To this end, we propose using
publicly available models with weights from the Internet (public models) for
evaluating attacks and defenses on ML models. We establish a database, namely
SecurityNet, containing 910 annotated image classification models. We then
analyze the effectiveness of several representative attacks/defenses, including
model stealing attacks, membership inference attacks, and backdoor detection on
these public models. Our evaluation empirically shows the performance of these
attacks/defenses can vary significantly on public models compared to
self-trained models. We share SecurityNet with the research community. and
advocate researchers to perform experiments on public models to better
demonstrate their proposed methods' effectiveness in the future.


http://arxiv.org/abs/2310.13061
To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets. (1%)
Darshil Doshi; Aritra Das; Tianyu He; Andrey Gromov
  Robust generalization is a major challenge in deep learning, particularly
when the number of trainable parameters is very large. In general, it is very
difficult to know if the network has memorized a particular set of examples or
understood the underlying rule (or both). Motivated by this challenge, we study
an interpretable model where generalizing representations are understood
analytically, and are easily distinguishable from the memorizing ones. Namely,
we consider two-layer neural networks trained on modular arithmetic tasks where
($\xi \cdot 100\%$) of labels are corrupted (\emph{i.e.} some results of the
modular operations in the training set are incorrect). We show that (i) it is
possible for the network to memorize the corrupted labels \emph{and} achieve
$100\%$ generalization at the same time; (ii) the memorizing neurons can be
identified and pruned, lowering the accuracy on corrupted data and improving
the accuracy on uncorrupted data; (iii) regularization methods such as weight
decay, dropout and BatchNorm force the network to ignore the corrupted data
during optimization, and achieve $100\%$ accuracy on the uncorrupted dataset;
and (iv) the effect of these regularization methods is (``mechanistically'')
interpretable: weight decay and dropout force all the neurons to learn
generalizing representations, while BatchNorm de-amplifies the output of
memorizing neurons and amplifies the output of the generalizing ones. Finally,
we show that in the presence of regularization, the training dynamics involves
two consecutive stages: first, the network undergoes the \emph{grokking}
dynamics reaching high train \emph{and} test accuracy; second, it unlearns the
memorizing representations, where train accuracy suddenly jumps from $100\%$ to
$100 (1-\xi)\%$.


http://arxiv.org/abs/2310.13252
Detecting Shared Data Manipulation in Distributed Optimization Algorithms. (1%)
Mohannad Alkhraijah; Rachel Harris; Samuel Litchfield; David Huggins; Daniel K. Molzahn
  This paper investigates the vulnerability of the Alternating Direction Method
of Multipliers (ADMM) algorithm to shared data manipulation, with a focus on
solving optimal power flow (OPF) problems. Deliberate data manipulation may
cause the ADMM algorithm to converge to suboptimal solutions. We derive two
sufficient conditions for detecting data manipulation based on the theoretical
convergence trajectory of the ADMM algorithm. We evaluate the detection
conditions' performance on three data manipulation strategies we previously
proposed: simple, feedback, and bilevel optimization attacks. We then extend
these three data manipulation strategies to avoid detection by considering both
the detection conditions and a neural network (NN) detection model in the
attacks. We also propose an adversarial NN training framework to detect shared
data manipulation. We illustrate the performance of our data manipulation
strategy and detection framework on OPF problems. The results show that the
proposed detection conditions successfully detect most of the data manipulation
attacks. However, a bilevel optimization attack strategy that incorporates the
detection methods may avoid being detected. Countering this, our proposed
adversarial training framework detects all the instances of the bilevel
optimization attack.


http://arxiv.org/abs/2310.13191
Towards Robust Pruning: An Adaptive Knowledge-Retention Pruning Strategy for Language Models. (1%)
Jianwei Li; Qi Lei; Wei Cheng; Dongkuan Xu
  The pruning objective has recently extended beyond accuracy and sparsity to
robustness in language models. Despite this, existing methods struggle to
enhance robustness against adversarial attacks when continually increasing
model sparsity and require a retraining process. As humans step into the era of
large language models, these issues become increasingly prominent. This paper
proposes that the robustness of language models is proportional to the extent
of pre-trained knowledge they encompass. Accordingly, we introduce a
post-training pruning strategy designed to faithfully replicate the embedding
space and feature space of dense language models, aiming to conserve more
pre-trained knowledge during the pruning process. In this setup, each layer's
reconstruction error not only originates from itself but also includes
cumulative error from preceding layers, followed by an adaptive rectification.
Compared to other state-of-art baselines, our approach demonstrates a superior
balance between accuracy, sparsity, robustness, and pruning cost with BERT on
datasets SST2, IMDB, and AGNews, marking a significant stride towards robust
pruning in language models.


http://arxiv.org/abs/2310.12017
Exploring Decision-based Black-box Attacks on Face Forgery Detection. (99%)
Zhaoyu Chen; Bo Li; Kaixun Jiang; Shuang Wu; Shouhong Ding; Wenqiang Zhang
  Face forgery generation technologies generate vivid faces, which have raised
public concerns about security and privacy. Many intelligent systems, such as
electronic payment and identity verification, rely on face forgery detection.
Although face forgery detection has successfully distinguished fake faces,
recent studies have demonstrated that face forgery detectors are very
vulnerable to adversarial examples. Meanwhile, existing attacks rely on network
architectures or training datasets instead of the predicted labels, which leads
to a gap in attacking deployed applications. To narrow this gap, we first
explore the decision-based attacks on face forgery detection. However, applying
existing decision-based attacks directly suffers from perturbation
initialization failure and low image quality. First, we propose cross-task
perturbation to handle initialization failures by utilizing the high
correlation of face features on different tasks. Then, inspired by using
frequency cues by face forgery detection, we propose the frequency
decision-based attack. We add perturbations in the frequency domain and then
constrain the visual quality in the spatial domain. Finally, extensive
experiments demonstrate that our method achieves state-of-the-art attack
performance on FaceForensics++, CelebDF, and industrial APIs, with high query
efficiency and guaranteed image quality. Further, the fake faces by our method
can pass face forgery detection and face recognition, which exposes the
security problems of face forgery detectors.


http://arxiv.org/abs/2310.12431
Segment Anything Meets Universal Adversarial Perturbation. (99%)
Dongshen Han; Sheng Zheng; Chaoning Zhang
  As Segment Anything Model (SAM) becomes a popular foundation model in
computer vision, its adversarial robustness has become a concern that cannot be
ignored. This works investigates whether it is possible to attack SAM with
image-agnostic Universal Adversarial Perturbation (UAP). In other words, we
seek a single perturbation that can fool the SAM to predict invalid masks for
most (if not all) images. We demonstrate convetional image-centric attack
framework is effective for image-independent attacks but fails for universal
adversarial attack. To this end, we propose a novel perturbation-centric
framework that results in a UAP generation method based on self-supervised
contrastive learning (CL), where the UAP is set to the anchor sample and the
positive sample is augmented from the UAP. The representations of negative
samples are obtained from the image encoder in advance and saved in a memory
bank. The effectiveness of our proposed CL-based UAP generation method is
validated by both quantitative and qualitative results. On top of the ablation
study to understand various components in our proposed method, we shed light on
the roles of positive and negative samples in making the generated UAP
effective for attacking SAM.


http://arxiv.org/abs/2310.11890
IRAD: Implicit Representation-driven Image Resampling against Adversarial Attacks. (99%)
Yue Cao; Tianlin Li; Xiaofeng Cao; Ivor Tsang; Yang Liu; Qing Guo
  We introduce a novel approach to counter adversarial attacks, namely, image
resampling. Image resampling transforms a discrete image into a new one,
simulating the process of scene recapturing or rerendering as specified by a
geometrical transformation. The underlying rationale behind our idea is that
image resampling can alleviate the influence of adversarial perturbations while
preserving essential semantic information, thereby conferring an inherent
advantage in defending against adversarial attacks. To validate this concept,
we present a comprehensive study on leveraging image resampling to defend
against adversarial attacks. We have developed basic resampling methods that
employ interpolation strategies and coordinate shifting magnitudes. Our
analysis reveals that these basic methods can partially mitigate adversarial
attacks. However, they come with apparent limitations: the accuracy of clean
images noticeably decreases, while the improvement in accuracy on adversarial
examples is not substantial. We propose implicit representation-driven image
resampling (IRAD) to overcome these limitations. First, we construct an
implicit continuous representation that enables us to represent any input image
within a continuous coordinate space. Second, we introduce SampleNet, which
automatically generates pixel-wise shifts for resampling in response to
different inputs. Furthermore, we can extend our approach to the
state-of-the-art diffusion-based method, accelerating it with fewer time steps
while preserving its defense capability. Extensive experiments demonstrate that
our method significantly enhances the adversarial robustness of diverse deep
models against various attacks while maintaining high accuracy on clean images.


http://arxiv.org/abs/2310.13019
Tailoring Adversarial Attacks on Deep Neural Networks for Targeted Class Manipulation Using DeepFool Algorithm. (99%)
S. M. Fazle Rabby Labib; Joyanta Jyoti Mondal; Meem Arafat Manab; Sarfaraz Newaz; Xi Xiao
  The susceptibility of deep neural networks (DNNs) to adversarial attacks
undermines their reliability across numerous applications, underscoring the
necessity for an in-depth exploration of these vulnerabilities and the
formulation of robust defense strategies. The DeepFool algorithm by
Moosavi-Dezfooli et al. (2016) represents a pivotal step in identifying minimal
perturbations required to induce misclassification of input images.
Nonetheless, its generic methodology falls short in scenarios necessitating
targeted interventions. Additionally, previous research studies have
predominantly concentrated on the success rate of attacks without adequately
addressing the consequential distortion of images, the maintenance of image
quality, or the confidence threshold required for misclassification. To bridge
these gaps, we introduce the Enhanced Targeted DeepFool (ET DeepFool)
algorithm, an evolution of DeepFool that not only facilitates the specification
of desired misclassification targets but also incorporates a configurable
minimum confidence score. Our empirical investigations demonstrate the
superiority of this refined approach in maintaining the integrity of images and
minimizing perturbations across a variety of DNN architectures. Unlike previous
iterations, such as the Targeted DeepFool by Gajjar et al. (2022), our method
grants unparalleled control over the perturbation process, enabling precise
manipulation of model responses. Preliminary outcomes reveal that certain
models, including AlexNet and the advanced Vision Transformer, display
commendable robustness to such manipulations. This discovery of varying levels
of model robustness, as unveiled through our confidence level adjustments,
could have far-reaching implications for the field of image recognition. Our
code will be made public upon acceptance of the paper.


http://arxiv.org/abs/2310.11901
Malicious Agent Detection for Robust Multi-Agent Collaborative Perception. (87%)
Yangheng Zhao; Zhen Xiang; Sheng Yin; Xianghe Pang; Siheng Chen; Yanfeng Wang
  Recently, multi-agent collaborative (MAC) perception has been proposed and
outperformed the traditional single-agent perception in many applications, such
as autonomous driving. However, MAC perception is more vulnerable to
adversarial attacks than single-agent perception due to the information
exchange. The attacker can easily degrade the performance of a victim agent by
sending harmful information from a malicious agent nearby. In this paper, we
extend adversarial attacks to an important perception task -- MAC object
detection, where generic defenses such as adversarial training are no longer
effective against these attacks. More importantly, we propose Malicious Agent
Detection (MADE), a reactive defense specific to MAC perception that can be
deployed by each agent to accurately detect and then remove any potential
malicious agent in its local collaboration network. In particular, MADE
inspects each agent in the network independently using a semi-supervised
anomaly detector based on a double-hypothesis test with the Benjamini-Hochberg
procedure to control the false positive rate of the inference. For the two
hypothesis tests, we propose a match loss statistic and a collaborative
reconstruction loss statistic, respectively, both based on the consistency
between the agent to be inspected and the ego agent where our detector is
deployed. We conduct comprehensive evaluations on a benchmark 3D dataset
V2X-sim and a real-road dataset DAIR-V2X and show that with the protection of
MADE, the drops in the average precision compared with the best-case "oracle"
defender against our attack are merely 1.28% and 0.34%, respectively, much
lower than 8.92% and 10.00% for adversarial training, respectively.


http://arxiv.org/abs/2310.12063
Black-Box Training Data Identification in GANs via Detector Networks. (82%)
Lukman Olagoke; Salil Vadhan; Seth Neel
  Since their inception Generative Adversarial Networks (GANs) have been
popular generative models across images, audio, video, and tabular data. In
this paper we study whether given access to a trained GAN, as well as fresh
samples from the underlying distribution, if it is possible for an attacker to
efficiently identify if a given point is a member of the GAN's training data.
This is of interest for both reasons related to copyright, where a user may
want to determine if their copyrighted data has been used to train a GAN, and
in the study of data privacy, where the ability to detect training set
membership is known as a membership inference attack. Unlike the majority of
prior work this paper investigates the privacy implications of using GANs in
black-box settings, where the attack only has access to samples from the
generator, rather than access to the discriminator as well. We introduce a
suite of membership inference attacks against GANs in the black-box setting and
evaluate our attacks on image GANs trained on the CIFAR10 dataset and tabular
GANs trained on genomic data. Our most successful attack, called The Detector,
involve training a second network to score samples based on their likelihood of
being generated by the GAN, as opposed to a fresh sample from the distribution.
We prove under a simple model of the generator that the detector is an
approximately optimal membership inference attack. Across a wide range of
tabular and image datasets, attacks, and GAN architectures, we find that
adversaries can orchestrate non-trivial privacy attacks when provided with
access to samples from the generator. At the same time, the attack success
achievable against GANs still appears to be lower compared to other generative
and discriminative models; this leaves the intriguing open question of whether
GANs are in fact more private, or if it is a matter of developing stronger
attacks.


http://arxiv.org/abs/2310.11789
Adversarial Training for Physics-Informed Neural Networks. (81%)
Yao Li; Shengzhu Shi; Zhichang Guo; Boying Wu
  Physics-informed neural networks have shown great promise in solving partial
differential equations. However, due to insufficient robustness, vanilla PINNs
often face challenges when solving complex PDEs, especially those involving
multi-scale behaviors or solutions with sharp or oscillatory characteristics.
To address these issues, based on the projected gradient descent adversarial
attack, we proposed an adversarial training strategy for PINNs termed by
AT-PINNs. AT-PINNs enhance the robustness of PINNs by fine-tuning the model
with adversarial samples, which can accurately identify model failure locations
and drive the model to focus on those regions during training. AT-PINNs can
also perform inference with temporal causality by selecting the initial
collocation points around temporal initial values. We implement AT-PINNs to the
elliptic equation with multi-scale coefficients, Poisson equation with
multi-peak solutions, Burgers equation with sharp solutions and the Allen-Cahn
equation. The results demonstrate that AT-PINNs can effectively locate and
reduce failure regions. Moreover, AT-PINNs are suitable for solving complex
PDEs, since locating failure regions through adversarial attacks is independent
of the size of failure regions or the complexity of the distribution.


http://arxiv.org/abs/2310.12243
REVAMP: Automated Simulations of Adversarial Attacks on Arbitrary Objects in Realistic Scenes. (80%)
Matthew Hull; Zijie J. Wang; Duen Horng Chau
  Deep Learning models, such as those used in an autonomous vehicle are
vulnerable to adversarial attacks where an attacker could place an adversarial
object in the environment, leading to mis-classification. Generating these
adversarial objects in the digital space has been extensively studied, however
successfully transferring these attacks from the digital realm to the physical
realm has proven challenging when controlling for real-world environmental
factors. In response to these limitations, we introduce REVAMP, an easy-to-use
Python library that is the first-of-its-kind tool for creating attack scenarios
with arbitrary objects and simulating realistic environmental factors,
lighting, reflection, and refraction. REVAMP enables researchers and
practitioners to swiftly explore various scenarios within the digital realm by
offering a wide range of configurable options for designing experiments and
using differentiable rendering to reproduce physically plausible adversarial
objects. We will demonstrate and invite the audience to try REVAMP to produce
an adversarial texture on a chosen object while having control over various
scene parameters. The audience will choose a scene, an object to attack, the
desired attack class, and the number of camera positions to use. Then, in real
time, we show how this altered texture causes the chosen object to be
mis-classified, showcasing the potential of REVAMP in real-world scenarios.
REVAMP is open-source and available at https://github.com/poloclub/revamp.


http://arxiv.org/abs/2310.11970
Quantifying Privacy Risks of Prompts in Visual Prompt Learning. (76%)
Yixin Wu; Rui Wen; Michael Backes; Pascal Berrang; Mathias Humbert; Yun Shen; Yang Zhang
  Large-scale pre-trained models are increasingly adapted to downstream tasks
through a new paradigm called prompt learning. In contrast to fine-tuning,
prompt learning does not update the pre-trained model's parameters. Instead, it
only learns an input perturbation, namely prompt, to be added to the downstream
task data for predictions. Given the fast development of prompt learning, a
well-generalized prompt inevitably becomes a valuable asset as significant
effort and proprietary data are used to create it. This naturally raises the
question of whether a prompt may leak the proprietary information of its
training data. In this paper, we perform the first comprehensive privacy
assessment of prompts learned by visual prompt learning through the lens of
property inference and membership inference attacks. Our empirical evaluation
shows that the prompts are vulnerable to both attacks. We also demonstrate that
the adversary can mount a successful property inference attack with limited
cost. Moreover, we show that membership inference attacks against prompts can
be successful with relaxed adversarial assumptions. We further make some
initial investigations on the defenses and observe that our method can mitigate
the membership inference attacks with a decent utility-defense trade-off but
fails to defend against property inference attacks. We hope our results can
shed light on the privacy risks of the popular prompt learning paradigm. To
facilitate the research in this direction, we will share our code and models
with the community.


http://arxiv.org/abs/2310.11868
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now. (47%)
Yimeng Zhang; Jinghan Jia; Xin Chen; Aochuan Chen; Yihua Zhang; Jiancheng Liu; Ke Ding; Sijia Liu
  The recent advances in diffusion models (DMs) have revolutionized the
generation of realistic and complex images. However, these models also
introduce potential safety hazards, such as producing harmful content and
infringing data copyrights. Despite the development of safety-driven unlearning
techniques to counteract these challenges, doubts about their efficacy persist.
To tackle this issue, we introduce an evaluation framework that leverages
adversarial prompts to discern the trustworthiness of these safety-driven DMs
after they have undergone the process of unlearning harmful concepts.
Specifically, we investigated the adversarial robustness of DMs, assessed by
adversarial prompts, when eliminating unwanted concepts, styles, and objects.
We develop an effective and efficient adversarial prompt generation approach
for DMs, termed UnlearnDiffAtk. This method capitalizes on the intrinsic
classification abilities of DMs to simplify the creation of adversarial
prompts, thereby eliminating the need for auxiliary classification or diffusion
models.Through extensive benchmarking, we evaluate the robustness of five
widely-used safety-driven unlearned DMs (i.e., DMs after unlearning undesirable
concepts, styles, or objects) across a variety of tasks. Our results
demonstrate the effectiveness and efficiency merits of UnlearnDiffAtk over the
state-of-the-art adversarial prompt generation method and reveal the lack of
robustness of current safety-driven unlearning techniques when applied to DMs.
Codes are available at https://github.com/OPTML-Group/Diffusion-MU-Attack.
WARNING: This paper contains model outputs that may be offensive in nature.


http://arxiv.org/abs/2310.12432
CAT: Closed-loop Adversarial Training for Safe End-to-End Driving. (2%)
Linrui Zhang; Zhenghao Peng; Quanyi Li; Bolei Zhou
  Driving safety is a top priority for autonomous vehicles. Orthogonal to prior
work handling accident-prone traffic events by algorithm designs at the policy
level, we investigate a Closed-loop Adversarial Training (CAT) framework for
safe end-to-end driving in this paper through the lens of environment
augmentation. CAT aims to continuously improve the safety of driving agents by
training the agent on safety-critical scenarios that are dynamically generated
over time. A novel resampling technique is developed to turn log-replay
real-world driving scenarios into safety-critical ones via probabilistic
factorization, where the adversarial traffic generation is modeled as the
multiplication of standard motion prediction sub-problems. Consequently, CAT
can launch more efficient physical attacks compared to existing safety-critical
scenario generation methods and yields a significantly less computational cost
in the iterative learning pipeline. We incorporate CAT into the MetaDrive
simulator and validate our approach on hundreds of driving scenarios imported
from real-world driving datasets. Experimental results demonstrate that CAT can
effectively generate adversarial scenarios countering the agent being trained.
After training, the agent can achieve superior driving safety in both
log-replay and safety-critical traffic scenarios on the held-out test set. Code
and data are available at https://metadriverse.github.io/cat.


http://arxiv.org/abs/2310.12214
PrivInfer: Privacy-Preserving Inference for Black-box Large Language Model. (1%)
Meng Tong; Kejiang Chen; Yuang Qi; Jie Zhang; Weiming Zhang; Nenghai Yu
  Large language models (LLMs), such as ChatGPT, have simplified text
generation tasks, yet their inherent privacy risks are increasingly garnering
attention. Existing solutions for privacy-preserving inference face significant
challenges in practical deployment and implementation. In this paper, we
propose PrivInfer, the first practical framework for privacy-preserving
inference. It comprises two modules specifically designed for black-box LLMs in
text generation. The perturbation module, employing differential privacy,
generates perturbed prompts, thus enabling privacy-preserving inference with
black-box LLMs. The restoration module extracts coherent and meaningful
responses from obtained perturbed results, thus ensuring the accomplishment of
the text generation tasks. Additionally, to enhance privacy and utility
further, we develop RANTEXT, a novel differential privacy mechanism integrated
into the perturbation module of PrivInfer. This mechanism is specifically
tailored for LLMs and utilizes random adjacency in text perturbations.
Experimental results indicate that PrivInfer is comparable to GPT-4 in text
generation quality, and RANTEXT outperforms the current leading scheme in
privacy protection, even under its adaptive attack, our proposed GPT inference
attack.


http://arxiv.org/abs/2310.11597
The Efficacy of Transformer-based Adversarial Attacks in Security Domains. (99%)
Kunyang Li; Kyle Domico; Jean-Charles Noirot Ferrand; Patrick McDaniel
  Today, the security of many domains rely on the use of Machine Learning to
detect threats, identify vulnerabilities, and safeguard systems from attacks.
Recently, transformer architectures have improved the state-of-the-art
performance on a wide range of tasks such as malware detection and network
intrusion detection. But, before abandoning current approaches to transformers,
it is crucial to understand their properties and implications on cybersecurity
applications. In this paper, we evaluate the robustness of transformers to
adversarial samples for system defenders (i.e., resiliency to adversarial
perturbations generated on different types of architectures) and their
adversarial strength for system attackers (i.e., transferability of adversarial
samples generated by transformers to other target models). To that effect, we
first fine-tune a set of pre-trained transformer, Convolutional Neural Network
(CNN), and hybrid (an ensemble of transformer and CNN) models to solve
different downstream image-based tasks. Then, we use an attack algorithm to
craft 19,367 adversarial examples on each model for each task. The
transferability of these adversarial examples is measured by evaluating each
set on other models to determine which models offer more adversarial strength,
and consequently, more robustness against these attacks. We find that the
adversarial examples crafted on transformers offer the highest transferability
rate (i.e., 25.7% higher than the average) onto other models. Similarly,
adversarial examples crafted on other models have the lowest rate of
transferability (i.e., 56.7% lower than the average) onto transformers. Our
work emphasizes the importance of studying transformer architectures for
attacking and defending models in security domains, and suggests using them as
the primary architecture in transfer attack settings.


http://arxiv.org/abs/2310.11594
Adversarial Robustness Unhardening via Backdoor Attacks in Federated Learning. (99%)
Taejin Kim; Jiarui Li; Shubhranshu Singh; Nikhil Madaan; Carlee Joe-Wong
  In today's data-driven landscape, the delicate equilibrium between
safeguarding user privacy and unleashing data potential stands as a paramount
concern. Federated learning, which enables collaborative model training without
necessitating data sharing, has emerged as a privacy-centric solution. This
decentralized approach brings forth security challenges, notably poisoning and
backdoor attacks where malicious entities inject corrupted data. Our research,
initially spurred by test-time evasion attacks, investigates the intersection
of adversarial training and backdoor attacks within federated learning,
introducing Adversarial Robustness Unhardening (ARU). ARU is employed by a
subset of adversaries to intentionally undermine model robustness during
decentralized training, rendering models susceptible to a broader range of
evasion attacks. We present extensive empirical experiments evaluating ARU's
impact on adversarial training and existing robust aggregation defenses against
poisoning and backdoor attacks. Our findings inform strategies for enhancing
ARU to counter current defensive measures and highlight the limitations of
existing defenses, offering insights into bolstering defenses against ARU.


http://arxiv.org/abs/2310.11595
WaveAttack: Asymmetric Frequency Obfuscation-based Backdoor Attacks Against Deep Neural Networks. (15%)
Jun Xia; Zhihao Yue; Yingbo Zhou; Zhiwei Ling; Xian Wei; Mingsong Chen
  Due to the popularity of Artificial Intelligence (AI) technology, numerous
backdoor attacks are designed by adversaries to mislead deep neural network
predictions by manipulating training samples and training processes. Although
backdoor attacks are effective in various real scenarios, they still suffer
from the problems of both low fidelity of poisoned samples and non-negligible
transfer in latent space, which make them easily detectable by existing
backdoor detection algorithms. To overcome the weakness, this paper proposes a
novel frequency-based backdoor attack method named WaveAttack, which obtains
image high-frequency features through Discrete Wavelet Transform (DWT) to
generate backdoor triggers. Furthermore, we introduce an asymmetric frequency
obfuscation method, which can add an adaptive residual in the training and
inference stage to improve the impact of triggers and further enhance the
effectiveness of WaveAttack. Comprehensive experimental results show that
WaveAttack not only achieves higher stealthiness and effectiveness, but also
outperforms state-of-the-art (SOTA) backdoor attack methods in the fidelity of
images by up to 28.27\% improvement in PSNR, 1.61\% improvement in SSIM, and
70.59\% reduction in IS.


http://arxiv.org/abs/2310.11105
Generalizability of CNN Architectures for Face Morph Presentation Attack. (1%)
Sherko R. HmaSalah; Aras Asaad
  Automatic border control systems are wide spread in modern airports
worldwide. Morphing attacks on face biometrics is a serious threat that
undermines the security and reliability of face recognition systems deployed in
airports and border controls. Therefore, developing a robust Machine Learning
(ML) system is necessary to prevent criminals crossing borders with fake
identifications especially since it has been shown that security officers
cannot detect morphs better than machines. In this study, we investigate the
generalization power of Convolutional Neural Network (CNN) architectures
against morphing attacks. The investigation utilizes 5 distinct CNNs namely
ShuffleNet, DenseNet201, VGG16, EffecientNet-B0 and InceptionResNet-v2. Each
CNN architecture represents a well-known family of CNN models in terms of
number of parameters, architectural design and performance across various
computer vision applications. To ensure robust evaluation, we employ 4
different datasets (Utrecht, London, Defacto and KurdFace) that contain a
diverse range of digital face images which cover variations in ethnicity,
gender, age, lighting condition and camera setting. One of the fundamental
concepts of ML system design is the ability to generalize effectively to
previously unseen data, hence not only we evaluate the performance of CNN
models within individual datasets but also explore their performance across
combined datasets and investigating each dataset in testing phase only.
Experimental results on more than 8 thousand images (genuine and morph) from
the 4 datasets show that InceptionResNet-v2 generalizes better to unseen data
and outperforms the other 4 CNN models.


http://arxiv.org/abs/2310.10844
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks. (98%)
Erfan Shayegani; Md Abdullah Al Mamun; Yu Fu; Pedram Zaree; Yue Dong; Nael Abu-Ghazaleh
  Large Language Models (LLMs) are swiftly advancing in architecture and
capability, and as they integrate more deeply into complex systems, the urgency
to scrutinize their security properties grows. This paper surveys research in
the emerging interdisciplinary field of adversarial attacks on LLMs, a subfield
of trustworthy ML, combining the perspectives of Natural Language Processing
and Security. Prior work has shown that even safety-aligned LLMs (via
instruction tuning and reinforcement learning through human feedback) can be
susceptible to adversarial attacks, which exploit weaknesses and mislead AI
systems, as evidenced by the prevalence of `jailbreak' attacks on models like
ChatGPT and Bard. In this survey, we first provide an overview of large
language models, describe their safety alignment, and categorize existing
research based on various learning structures: textual-only attacks,
multi-modal attacks, and additional attack methods specifically targeting
complex systems, such as federated learning or multi-agent systems. We also
offer comprehensive remarks on works that focus on the fundamental sources of
vulnerabilities and potential defenses. To make this field more accessible to
newcomers, we present a systematic review of existing works, a structured
typology of adversarial attack concepts, and additional resources, including
slides for presentations on related topics at the 62nd Annual Meeting of the
Association for Computational Linguistics (ACL'24).


http://arxiv.org/abs/2310.10807
Regularization properties of adversarially-trained linear regression. (92%)
Antônio H. Ribeiro; Dave Zachariah; Francis Bach; Thomas B. Schön
  State-of-the-art machine learning models can be vulnerable to very small
input perturbations that are adversarially constructed. Adversarial training is
an effective approach to defend against it. Formulated as a min-max problem, it
searches for the best solution when the training data were corrupted by the
worst-case attacks. Linear models are among the simple models where
vulnerabilities can be observed and are the focus of our study. In this case,
adversarial training leads to a convex optimization problem which can be
formulated as the minimization of a finite sum. We provide a comparative
analysis between the solution of adversarial training in linear regression and
other regularization methods. Our main findings are that: (A) Adversarial
training yields the minimum-norm interpolating solution in the
overparameterized regime (more parameters than data), as long as the maximum
disturbance radius is smaller than a threshold. And, conversely, the
minimum-norm interpolator is the solution to adversarial training with a given
radius. (B) Adversarial training can be equivalent to parameter shrinking
methods (ridge regression and Lasso). This happens in the underparametrized
region, for an appropriate choice of adversarial radius and zero-mean
symmetrically distributed covariates. (C) For $\ell_\infty$-adversarial
training -- as in square-root Lasso -- the choice of adversarial radius for
optimal bounds does not depend on the additive noise variance. We confirm our
theoretical findings with numerical examples.


http://arxiv.org/abs/2310.10744
Fast Adversarial Label-Flipping Attack on Tabular Data. (84%)
Xinglong Chang; Gillian Dobbie; Jörg Wicker
  Machine learning models are increasingly used in fields that require high
reliability such as cybersecurity. However, these models remain vulnerable to
various attacks, among which the adversarial label-flipping attack poses
significant threats. In label-flipping attacks, the adversary maliciously flips
a portion of training labels to compromise the machine learning model. This
paper raises significant concerns as these attacks can camouflage a highly
skewed dataset as an easily solvable classification problem, often misleading
machine learning practitioners into lower defenses and miscalculations of
potential risks. This concern amplifies in tabular data settings, where
identifying true labels requires expertise, allowing malicious label-flipping
attacks to easily slip under the radar. To demonstrate this risk is inherited
in the adversary's objective, we propose FALFA (Fast Adversarial Label-Flipping
Attack), a novel efficient attack for crafting adversarial labels. FALFA is
based on transforming the adversary's objective and employs linear programming
to reduce computational complexity. Using ten real-world tabular datasets, we
demonstrate FALFA's superior attack potential, highlighting the need for robust
defenses against such threats.


http://arxiv.org/abs/2310.10126
A Non-monotonic Smooth Activation Function. (83%)
Koushik Biswas; Meghana Karri; Ulaş Bağcı
  Activation functions are crucial in deep learning models since they introduce
non-linearity into the networks, allowing them to learn from errors and make
adjustments, which is essential for learning complex patterns. The essential
purpose of activation functions is to transform unprocessed input signals into
significant output activations, promoting information transmission throughout
the neural network. In this study, we propose a new activation function called
Sqish, which is a non-monotonic and smooth function and an alternative to
existing ones. We showed its superiority in classification, object detection,
segmentation tasks, and adversarial robustness experiments. We got an 8.21%
improvement over ReLU on the CIFAR100 dataset with the ShuffleNet V2 model in
the FGSM adversarial attack. We also got a 5.87% improvement over ReLU on image
classification on the CIFAR100 dataset with the ShuffleNet V2 model.


http://arxiv.org/abs/2310.10610
Quantifying Assistive Robustness Via the Natural-Adversarial Frontier. (68%)
Jerry Zhi-Yang He; Zackory Erickson; Daniel S. Brown; Anca D. Dragan
  Our ultimate goal is to build robust policies for robots that assist people.
What makes this hard is that people can behave unexpectedly at test time,
potentially interacting with the robot outside its training distribution and
leading to failures. Even just measuring robustness is a challenge. Adversarial
perturbations are the default, but they can paint the wrong picture: they can
correspond to human motions that are unlikely to occur during natural
interactions with people. A robot policy might fail under small adversarial
perturbations but work under large natural perturbations. We propose that
capturing robustness in these interactive settings requires constructing and
analyzing the entire natural-adversarial frontier: the Pareto-frontier of human
policies that are the best trade-offs between naturalness and low robot
performance. We introduce RIGID, a method for constructing this frontier by
training adversarial human policies that trade off between minimizing robot
reward and acting human-like (as measured by a discriminator). On an Assistive
Gym task, we use RIGID to analyze the performance of standard collaborative
Reinforcement Learning, as well as the performance of existing methods meant to
increase robustness. We also compare the frontier RIGID identifies with the
failures identified in expert adversarial interaction, and with
naturally-occurring failures during user interaction. Overall, we find evidence
that RIGID can provide a meaningful measure of robustness predictive of
deployment performance, and uncover failure cases in human-robot interaction
that are difficult to find manually. https://ood-human.github.io.


http://arxiv.org/abs/2310.10124
A Comprehensive Study of Privacy Risks in Curriculum Learning. (67%)
Joann Qiongna Chen; Xinlei He; Zheng Li; Yang Zhang; Zhou Li
  Training a machine learning model with data following a meaningful order,
i.e., from easy to hard, has been proven to be effective in accelerating the
training process and achieving better model performance. The key enabling
technique is curriculum learning (CL), which has seen great success and has
been deployed in areas like image and text classification. Yet, how CL affects
the privacy of machine learning is unclear. Given that CL changes the way a
model memorizes the training data, its influence on data privacy needs to be
thoroughly evaluated. To fill this knowledge gap, we perform the first study
and leverage membership inference attack (MIA) and attribute inference attack
(AIA) as two vectors to quantify the privacy leakage caused by CL.
  Our evaluation of nine real-world datasets with attack methods (NN-based,
metric-based, label-only MIA, and NN-based AIA) revealed new insights about CL.
First, MIA becomes slightly more effective when CL is applied, but the impact
is much more prominent to a subset of training samples ranked as difficult.
Second, a model trained under CL is less vulnerable under AIA, compared to MIA.
Third, the existing defense techniques like DP-SGD, MemGuard, and MixupMMD are
still effective under CL, though DP-SGD has a significant impact on target
model accuracy. Finally, based on our insights into CL, we propose a new MIA,
termed Diff-Cali, which exploits the difficulty scores for result calibration
and is demonstrated to be effective against all CL methods and the normal
training method. With this study, we hope to draw the community's attention to
the unintended privacy risks of emerging machine-learning techniques and
develop new attack benchmarks and defense solutions.


http://arxiv.org/abs/2310.10427
DANAA: Towards transferable attacks with double adversarial neuron attribution. (26%)
Zhibo Jin; Zhiyu Zhu; Xinyi Wang; Jiayu Zhang; Jun Shen; Huaming Chen
  While deep neural networks have excellent results in many fields, they are
susceptible to interference from attacking samples resulting in erroneous
judgments. Feature-level attacks are one of the effective attack types, which
targets the learnt features in the hidden layers to improve its transferability
across different models. Yet it is observed that the transferability has been
largely impacted by the neuron importance estimation results. In this paper, a
double adversarial neuron attribution attack method, termed `DANAA', is
proposed to obtain more accurate feature importance estimation. In our method,
the model outputs are attributed to the middle layer based on an adversarial
non-linear path. The goal is to measure the weight of individual neurons and
retain the features that are more important towards transferability. We have
conducted extensive experiments on the benchmark datasets to demonstrate the
state-of-the-art performance of our method. Our code is available at:
https://github.com/Davidjinzb/DANAA


http://arxiv.org/abs/2310.10780
Demystifying Poisoning Backdoor Attacks from a Statistical Perspective. (9%)
Ganghua Wang; Xun Xian; Jayanth Srinivasa; Ashish Kundu; Xuan Bi; Mingyi Hong; Jie Ding
  The growing dependence on machine learning in real-world applications
emphasizes the importance of understanding and ensuring its safety. Backdoor
attacks pose a significant security risk due to their stealthy nature and
potentially serious consequences. Such attacks involve embedding triggers
within a learning model with the intention of causing malicious behavior when
an active trigger is present while maintaining regular functionality without
it. This paper evaluates the effectiveness of any backdoor attack incorporating
a constant trigger, by establishing tight lower and upper boundaries for the
performance of the compromised model on both clean and backdoor test data. The
developed theory answers a series of fundamental but previously underexplored
problems, including (1) what are the determining factors for a backdoor
attack's success, (2) what is the direction of the most effective backdoor
attack, and (3) when will a human-imperceptible trigger succeed. Our derived
understanding applies to both discriminative and generative models. We also
demonstrate the theory by conducting experiments using benchmark datasets and
state-of-the-art backdoor attack scenarios.


http://arxiv.org/abs/2310.10077
Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks. (4%)
Shuyu Jiang; Xingshu Chen; Rui Tang
  Recently, Large language models (LLMs) with powerful general capabilities
have been increasingly integrated into various Web applications, while
undergoing alignment training to ensure that the generated content aligns with
user intent and ethics. Unfortunately, they remain the risk of generating
harmful content like hate speech and criminal activities in practical
applications. Current approaches primarily rely on detecting, collecting, and
training against harmful prompts to prevent such risks. However, they typically
focused on the "superficial" harmful prompts with a solitary intent, ignoring
composite attack instructions with multiple intentions that can easily elicit
harmful content in real-world scenarios. In this paper, we introduce an
innovative technique for obfuscating harmful instructions: Compositional
Instruction Attacks (CIA), which refers to attacking by combination and
encapsulation of multiple instructions. CIA hides harmful prompts within
instructions of harmless intentions, making it impossible for the model to
identify underlying malicious intentions. Furthermore, we implement two
transformation methods, known as T-CIA and W-CIA, to automatically disguise
harmful instructions as talking or writing tasks, making them appear harmless
to LLMs. We evaluated CIA on GPT-4, ChatGPT, and ChatGLM2 with two safety
assessment datasets and two harmful prompt datasets. It achieves an attack
success rate of 95%+ on safety assessment datasets, and 83%+ for GPT-4, 91%+
for ChatGPT (gpt-3.5-turbo backed) and ChatGLM2-6B on harmful prompt datasets.
Our approach reveals the vulnerability of LLMs to such compositional
instruction attacks that harbor underlying harmful intentions, contributing
significantly to LLM security development. Warning: this paper may contain
offensive or upsetting content!


http://arxiv.org/abs/2310.10810
Robust Multi-Agent Reinforcement Learning via Adversarial Regularization: Theoretical Foundation and Stable Algorithms. (3%)
Alexander Bukharin; Yan Li; Yue Yu; Qingru Zhang; Zhehui Chen; Simiao Zuo; Chao Zhang; Songan Zhang; Tuo Zhao
  Multi-Agent Reinforcement Learning (MARL) has shown promising results across
several domains. Despite this promise, MARL policies often lack robustness and
are therefore sensitive to small changes in their environment. This presents a
serious concern for the real world deployment of MARL algorithms, where the
testing environment may slightly differ from the training environment. In this
work we show that we can gain robustness by controlling a policy's Lipschitz
constant, and under mild conditions, establish the existence of a Lipschitz and
close-to-optimal policy. Based on these insights, we propose a new robust MARL
framework, ERNIE, that promotes the Lipschitz continuity of the policies with
respect to the state observations and actions by adversarial regularization.
The ERNIE framework provides robustness against noisy observations, changing
transition dynamics, and malicious actions of agents. However, ERNIE's
adversarial regularization may introduce some training instability. To reduce
this instability, we reformulate adversarial regularization as a Stackelberg
game. We demonstrate the effectiveness of the proposed framework with extensive
experiments in traffic light control and particle environments. In addition, we
extend ERNIE to mean-field MARL with a formulation based on distributionally
robust optimization that outperforms its non-robust counterpart and is of
independent interest. Our code is available at
https://github.com/abukharin3/ERNIE.


http://arxiv.org/abs/2310.10483
Passive Inference Attacks on Split Learning via Adversarial Regularization. (3%)
Xiaochen Zhu; Xinjian Luo; Yuncheng Wu; Yangfan Jiang; Xiaokui Xiao; Beng Chin Ooi
  Split Learning (SL) has emerged as a practical and efficient alternative to
traditional federated learning. While previous attempts to attack SL have often
relied on overly strong assumptions or targeted easily exploitable models, we
seek to develop more practical attacks. We introduce SDAR, a novel attack
framework against SL with an honest-but-curious server. SDAR leverages
auxiliary data and adversarial regularization to learn a decodable simulator of
the client's private model, which can effectively infer the client's private
features under the vanilla SL, and both features and labels under the U-shaped
SL. We perform extensive experiments in both configurations to validate the
effectiveness of our proposed attacks. Notably, in challenging but practical
scenarios where existing passive attacks struggle to reconstruct the client's
private data effectively, SDAR consistently achieves attack performance
comparable to active attacks. On CIFAR-10, at the deep split level of 7, SDAR
achieves private feature reconstruction with less than 0.025 mean squared error
in both the vanilla and the U-shaped SL, and attains a label inference accuracy
of over 98% in the U-shaped setting, while existing attacks fail to produce
non-trivial results.


http://arxiv.org/abs/2310.10490
On the Transferability of Learning Models for Semantic Segmentation for Remote Sensing Data. (2%)
Rongjun Qin; Guixiang Zhang; Yang Tang
  Recent deep learning-based methods outperform traditional learning methods on
remote sensing (RS) semantic segmentation/classification tasks. However, they
require large training datasets and are generally known for lack of
transferability due to the highly disparate RS image content across different
geographical regions. Yet, there is no comprehensive analysis of their
transferability, i.e., to which extent a model trained on a source domain can
be readily applicable to a target domain. Therefore, in this paper, we aim to
investigate the raw transferability of traditional and deep learning (DL)
models, as well as the effectiveness of domain adaptation (DA) approaches in
enhancing the transferability of the DL models (adapted transferability). By
utilizing four highly diverse RS datasets, we train six models with and without
three DA approaches to analyze their transferability between these datasets
quantitatively. Furthermore, we developed a straightforward method to quantify
the transferability of a model using the spectral indices as a medium and have
demonstrated its effectiveness in evaluating the model transferability at the
target domain when the labels are unavailable. Our experiments yield several
generally important yet not well-reported observations regarding the raw and
adapted transferability. Moreover, our proposed label-free transferability
assessment method is validated to be better than posterior model confidence.
The findings can guide the future development of generalized RS learning
models. The trained models are released under this link:
https://github.com/GDAOSU/Transferability-Remote-Sensing


http://arxiv.org/abs/2310.10090
Orthogonal Uncertainty Representation of Data Manifold for Robust Long-Tailed Learning. (1%)
Yanbiao Ma; Licheng Jiao; Fang Liu; Shuyuan Yang; Xu Liu; Lingling Li
  In scenarios with long-tailed distributions, the model's ability to identify
tail classes is limited due to the under-representation of tail samples. Class
rebalancing, information augmentation, and other techniques have been proposed
to facilitate models to learn the potential distribution of tail classes. The
disadvantage is that these methods generally pursue models with balanced class
accuracy on the data manifold, while ignoring the ability of the model to
resist interference. By constructing noisy data manifold, we found that the
robustness of models trained on unbalanced data has a long-tail phenomenon.
That is, even if the class accuracy is balanced on the data domain, it still
has bias on the noisy data manifold. However, existing methods cannot
effectively mitigate the above phenomenon, which makes the model vulnerable in
long-tailed scenarios. In this work, we propose an Orthogonal Uncertainty
Representation (OUR) of feature embedding and an end-to-end training strategy
to improve the long-tail phenomenon of model robustness. As a general
enhancement tool, OUR has excellent compatibility with other methods and does
not require additional data generation, ensuring fast and efficient training.
Comprehensive evaluations on long-tailed datasets show that our method
significantly improves the long-tail phenomenon of robustness, bringing
consistent performance gains to other long-tailed learning methods.


http://arxiv.org/abs/2310.10865
Will the Prince Get True Love's Kiss? On the Model Sensitivity to Gender Perturbation over Fairytale Texts. (1%)
Christina Chance; Da Yin; Dakuo Wang; Kai-Wei Chang
  Recent studies show that traditional fairytales are rife with harmful gender
biases. To help mitigate these gender biases in fairytales, this work aims to
assess learned biases of language models by evaluating their robustness against
gender perturbations. Specifically, we focus on Question Answering (QA) tasks
in fairytales. Using counterfactual data augmentation to the FairytaleQA
dataset, we evaluate model robustness against swapped gender character
information, and then mitigate learned biases by introducing counterfactual
gender stereotypes during training time. We additionally introduce a novel
approach that utilizes the massive vocabulary of language models to support
text genres beyond fairytales. Our experimental results suggest that models are
sensitive to gender perturbations, with significant performance drops compared
to the original testing set. However, when first fine-tuned on a counterfactual
training dataset, models are less sensitive to the later introduced anti-gender
stereotyped text.


http://arxiv.org/abs/2310.09891
Towards Deep Learning Models Resistant to Transfer-based Adversarial Attacks via Data-centric Robust Learning. (99%)
Yulong Yang; Chenhao Lin; Xiang Ji; Qiwei Tian; Qian Li; Hongshan Yang; Zhibo Wang; Chao Shen
  Transfer-based adversarial attacks raise a severe threat to real-world deep
learning systems since they do not require access to target models. Adversarial
training (AT), which is recognized as the strongest defense against white-box
attacks, has also guaranteed high robustness to (black-box) transfer-based
attacks. However, AT suffers from heavy computational overhead since it
optimizes the adversarial examples during the whole training process. In this
paper, we demonstrate that such heavy optimization is not necessary for AT
against transfer-based attacks. Instead, a one-shot adversarial augmentation
prior to training is sufficient, and we name this new defense paradigm
Data-centric Robust Learning (DRL). Our experimental results show that DRL
outperforms widely-used AT techniques (e.g., PGD-AT, TRADES, EAT, and FAT) in
terms of black-box robustness and even surpasses the top-1 defense on
RobustBench when combined with diverse data augmentations and loss
regularizations. We also identify other benefits of DRL, for instance, the
model generalization capability and robust fairness.


http://arxiv.org/abs/2310.09792
SCME: A Self-Contrastive Method for Data-free and Query-Limited Model Extraction Attack. (99%)
Renyang Liu; Jinhong Zhang; Kwok-Yan Lam; Jun Zhao; Wei Zhou
  Previous studies have revealed that artificial intelligence (AI) systems are
vulnerable to adversarial attacks. Among them, model extraction attacks fool
the target model by generating adversarial examples on a substitute model. The
core of such an attack is training a substitute model as similar to the target
model as possible, where the simulation process can be categorized in a
data-dependent and data-free manner. Compared with the data-dependent method,
the data-free one has been proven to be more practical in the real world since
it trains the substitute model with synthesized data. However, the distribution
of these fake data lacks diversity and cannot detect the decision boundary of
the target model well, resulting in the dissatisfactory simulation effect.
Besides, these data-free techniques need a vast number of queries to train the
substitute model, increasing the time and computing consumption and the risk of
exposure. To solve the aforementioned problems, in this paper, we propose a
novel data-free model extraction method named SCME (Self-Contrastive Model
Extraction), which considers both the inter- and intra-class diversity in
synthesizing fake data. In addition, SCME introduces the Mixup operation to
augment the fake data, which can explore the target model's decision boundary
effectively and improve the simulating capacity. Extensive experiments show
that the proposed method can yield diversified fake data. Moreover, our method
has shown superiority in many different attack settings under the query-limited
scenario, especially for untargeted attacks, the SCME outperforms SOTA methods
by 11.43\% on average for five baseline datasets.


http://arxiv.org/abs/2310.09795
AFLOW: Developing Adversarial Examples under Extremely Noise-limited Settings. (99%)
Renyang Liu; Jinhong Zhang; Haoran Li; Jin Zhang; Yuanyu Wang; Wei Zhou
  Extensive studies have demonstrated that deep neural networks (DNNs) are
vulnerable to adversarial attacks. Despite the significant progress in the
attack success rate that has been made recently, the adversarial noise
generated by most of the existing attack methods is still too conspicuous to
the human eyes and proved to be easily detected by defense mechanisms.
Resulting that these malicious examples cannot contribute to exploring the
vulnerabilities of existing DNNs sufficiently. Thus, to better reveal the
defects of DNNs and further help enhance their robustness under noise-limited
situations, a new inconspicuous adversarial examples generation method is
exactly needed to be proposed. To bridge this gap, we propose a novel Normalize
Flow-based end-to-end attack framework, called AFLOW, to synthesize
imperceptible adversarial examples under strict constraints. Specifically,
rather than the noise-adding manner, AFLOW directly perturbs the hidden
representation of the corresponding image to craft the desired adversarial
examples. Compared with existing methods, extensive experiments on three
benchmark datasets show that the adversarial examples built by AFLOW exhibit
superiority in imperceptibility, image quality and attack capability. Even on
robust models, AFLOW can still achieve higher attack results than previous
methods.


http://arxiv.org/abs/2310.10010
Black-box Targeted Adversarial Attack on Segment Anything (SAM). (99%)
Sheng Zheng; Chaoning Zhang; Xinhong Hao
  Deep recognition models are widely vulnerable to adversarial examples, which
change the model output by adding quasi-imperceptible perturbation to the image
input. Recently, Segment Anything Model (SAM) has emerged to become a popular
foundation model in computer vision due to its impressive generalization to
unseen data and tasks. Realizing flexible attacks on SAM is beneficial for
understanding the robustness of SAM in the adversarial context. To this end,
this work aims to achieve a targeted adversarial attack (TAA) on SAM.
Specifically, under a certain prompt, the goal is to make the predicted mask of
an adversarial example resemble that of a given target image. The task of TAA
on SAM has been realized in a recent arXiv work in the white-box setup by
assuming access to prompt and model, which is thus less practical. To address
the issue of prompt dependence, we propose a simple yet effective approach by
only attacking the image encoder. Moreover, we propose a novel regularization
loss to enhance the cross-model transferability by increasing the feature
dominance of adversarial images over random natural images. Extensive
experiments verify the effectiveness of our proposed simple techniques to
conduct a successful black-box TAA on SAM.


http://arxiv.org/abs/2310.10036
Evading Detection Actively: Toward Anti-Forensics against Forgery Localization. (97%)
Long Zhuo; Shenghai Luo; Shunquan Tan; Han Chen; Bin Li; Jiwu Huang
  Anti-forensics seeks to eliminate or conceal traces of tampering artifacts.
Typically, anti-forensic methods are designed to deceive binary detectors and
persuade them to misjudge the authenticity of an image. However, to the best of
our knowledge, no attempts have been made to deceive forgery detectors at the
pixel level and mis-locate forged regions. Traditional adversarial attack
methods cannot be directly used against forgery localization due to the
following defects: 1) they tend to just naively induce the target forensic
models to flip their pixel-level pristine or forged decisions; 2) their
anti-forensics performance tends to be severely degraded when faced with the
unseen forensic models; 3) they lose validity once the target forensic models
are retrained with the anti-forensics images generated by them. To tackle the
three defects, we propose SEAR (Self-supErvised Anti-foRensics), a novel
self-supervised and adversarial training algorithm that effectively trains
deep-learning anti-forensic models against forgery localization. SEAR sets a
pretext task to reconstruct perturbation for self-supervised learning. In
adversarial training, SEAR employs a forgery localization model as a supervisor
to explore tampering features and constructs a deep-learning concealer to erase
corresponding traces. We have conducted largescale experiments across diverse
datasets. The experimental results demonstrate that, through the combination of
self-supervised learning and adversarial learning, SEAR successfully deceives
the state-of-the-art forgery localization methods, as well as tackle the three
defects regarding traditional adversarial attack methods mentioned above.


http://arxiv.org/abs/2310.09744
Explore the Effect of Data Selection on Poison Efficiency in Backdoor Attacks. (61%)
Ziqiang Li; Pengfei Xia; Hong Sun; Yueqi Zeng; Wei Zhang; Bin Li
  As the number of parameters in Deep Neural Networks (DNNs) scales, the thirst
for training data also increases. To save costs, it has become common for users
and enterprises to delegate time-consuming data collection to third parties.
Unfortunately, recent research has shown that this practice raises the risk of
DNNs being exposed to backdoor attacks. Specifically, an attacker can
maliciously control the behavior of a trained model by poisoning a small
portion of the training data. In this study, we focus on improving the
poisoning efficiency of backdoor attacks from the sample selection perspective.
The existing attack methods construct such poisoned samples by randomly
selecting some clean data from the benign set and then embedding a trigger into
them. However, this random selection strategy ignores that each sample may
contribute differently to the backdoor injection, thereby reducing the
poisoning efficiency. To address the above problem, a new selection strategy
named Improved Filtering and Updating Strategy (FUS++) is proposed.
Specifically, we adopt the forgetting events of the samples to indicate the
contribution of different poisoned samples and use the curvature of the loss
surface to analyses the effectiveness of this phenomenon. Accordingly, we
combine forgetting events and curvature of different samples to conduct a
simple yet efficient sample selection strategy. The experimental results on
image classification (CIFAR-10, CIFAR-100, ImageNet-10), text classification
(AG News), audio classification (ESC-50), and age regression (Facial Age)
consistently demonstrate the effectiveness of the proposed strategy: the attack
performance using FUS++ is significantly higher than that using random
selection for the same poisoning ratio.


http://arxiv.org/abs/2310.10012
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? (9%)
Yu-Lin Tsai; Chia-Yi Hsu; Chulin Xie; Chih-Hsun Lin; Jia-You Chen; Bo Li; Pin-Yu Chen; Chia-Mu Yu; Chun-Ying Huang
  Diffusion models for text-to-image (T2I) synthesis, such as Stable Diffusion
(SD), have recently demonstrated exceptional capabilities for generating
high-quality content. However, this progress has raised several concerns of
potential misuse, particularly in creating copyrighted, prohibited, and
restricted content, or NSFW (not safe for work) images. While efforts have been
made to mitigate such problems, either by implementing a safety filter at the
evaluation stage or by fine-tuning models to eliminate undesirable concepts or
styles, the effectiveness of these safety measures in dealing with a wide range
of prompts remains largely unexplored. In this work, we aim to investigate
these safety mechanisms by proposing one novel concept retrieval algorithm for
evaluation. We introduce Ring-A-Bell, a model-agnostic red-teaming tool for T2I
diffusion models, where the whole evaluation can be prepared in advance without
prior knowledge of the target model. Specifically, Ring-A-Bell first performs
concept extraction to obtain holistic representations for sensitive and
inappropriate concepts. Subsequently, by leveraging the extracted concept,
Ring-A-Bell automatically identifies problematic prompts for diffusion models
with the corresponding generation of inappropriate content, allowing the user
to assess the reliability of deployed safety mechanisms. Finally, we
empirically validate our method by testing online services such as Midjourney
and various methods of concept removal. Our results show that Ring-A-Bell, by
manipulating safe prompting benchmarks, can transform prompts that were
originally regarded as safe to evade existing safety mechanisms, thus revealing
the defects of the so-called safety mechanisms which could practically lead to
the generation of harmful contents. Our codes are available at
https://github.com/chiayi-hsu/Ring-A-Bell.


http://arxiv.org/abs/2310.09827
VFLAIR: A Research Library and Benchmark for Vertical Federated Learning. (3%)
Tianyuan Zou; Zixuan Gu; Yu He; Hideaki Takahashi; Yang Liu; Ya-Qin Zhang
  Vertical Federated Learning (VFL) has emerged as a collaborative training
paradigm that allows participants with different features of the same group of
users to accomplish cooperative training without exposing their raw data or
model parameters. VFL has gained significant attention for its research
potential and real-world applications in recent years, but still faces
substantial challenges, such as in defending various kinds of data inference
and backdoor attacks. Moreover, most of existing VFL projects are
industry-facing and not easily used for keeping track of the current research
progress. To address this need, we present an extensible and lightweight VFL
framework VFLAIR (available at https://github.com/FLAIR-THU/VFLAIR), which
supports VFL training with a variety of models, datasets and protocols, along
with standardized modules for comprehensive evaluations of attacks and defense
strategies. We also benchmark 11 attacks and 8 defenses performance under
different communication and model partition settings and draw concrete insights
and recommendations on the choice of defense strategies for different practical
VFL deployment scenarios.


http://arxiv.org/abs/2310.09652
BufferSearch: Generating Black-Box Adversarial Texts With Lower Queries. (98%)
Wenjie Lv; Zhen Wang; Yitao Zheng; Zhehua Zhong; Qi Xuan; Tianyi Chen
  Machine learning security has recently become a prominent topic in the
natural language processing (NLP) area. The existing black-box adversarial
attack suffers prohibitively from the high model querying complexity, resulting
in easily being captured by anti-attack monitors. Meanwhile, how to eliminate
redundant model queries is rarely explored. In this paper, we propose a
query-efficient approach BufferSearch to effectively attack general intelligent
NLP systems with the minimal number of querying requests. In general,
BufferSearch makes use of historical information and conducts statistical test
to avoid incurring model queries frequently. Numerically, we demonstrate the
effectiveness of BufferSearch on various benchmark text-classification
experiments by achieving the competitive attacking performance but with a
significant reduction of query quantity. Furthermore, BufferSearch performs
multiple times better than competitors within restricted query budget. Our work
establishes a strong benchmark for the future study of query-efficiency in NLP
adversarial attacks.


http://arxiv.org/abs/2310.09361
Is Certifying $\ell_p$ Robustness Still Worthwhile? (99%)
Ravi Mangal; Klas Leino; Zifan Wang; Kai Hu; Weicheng Yu; Corina Pasareanu; Anupam Datta; Matt Fredrikson
  Over the years, researchers have developed myriad attacks that exploit the
ubiquity of adversarial examples, as well as defenses that aim to guard against
the security vulnerabilities posed by such attacks. Of particular interest to
this paper are defenses that provide provable guarantees against the class of
$\ell_p$-bounded attacks. Certified defenses have made significant progress,
taking robustness certification from toy models and datasets to large-scale
problems like ImageNet classification. While this is undoubtedly an interesting
academic problem, as the field has matured, its impact in practice remains
unclear, thus we find it useful to revisit the motivation for continuing this
line of research. There are three layers to this inquiry, which we address in
this paper: (1) why do we care about robustness research? (2) why do we care
about the $\ell_p$-bounded threat model? And (3) why do we care about
certification as opposed to empirical defenses? In brief, we take the position
that local robustness certification indeed confers practical value to the field
of machine learning. We focus especially on the latter two questions from
above. With respect to the first of the two, we argue that the $\ell_p$-bounded
threat model acts as a minimal requirement for safe application of models in
security-critical domains, while at the same time, evidence has mounted
suggesting that local robustness may lead to downstream external benefits not
immediately related to robustness. As for the second, we argue that (i)
certification provides a resolution to the cat-and-mouse game of adversarial
attacks; and furthermore, that (ii) perhaps contrary to popular belief, there
may not exist a fundamental trade-off between accuracy, robustness, and
certifiability, while moreover, certified training techniques constitute a
particularly promising way for learning robust models.


http://arxiv.org/abs/2310.09266
User Inference Attacks on Large Language Models. (41%)
Nikhil Kandpal; Krishna Pillutla; Alina Oprea; Peter Kairouz; Christopher A. Choquette-Choo; Zheng Xu
  Fine-tuning is a common and effective method for tailoring large language
models (LLMs) to specialized tasks and applications. In this paper, we study
the privacy implications of fine-tuning LLMs on user data. To this end, we
consider a realistic threat model, called user inference, wherein an attacker
infers whether or not a user's data was used for fine-tuning. We design attacks
for performing user inference that require only black-box access to the
fine-tuned LLM and a few samples from a user which need not be from the
fine-tuning dataset. We find that LLMs are susceptible to user inference across
a variety of fine-tuning datasets, at times with near perfect attack success
rates. Further, we theoretically and empirically investigate the properties
that make users vulnerable to user inference, finding that outlier users, users
with identifiable shared features between examples, and users that contribute a
large fraction of the fine-tuning data are most susceptible to attack. Based on
these findings, we identify several methods for mitigating user inference
including training with example-level differential privacy, removing
within-user duplicate examples, and reducing a user's contribution to the
training data. While these techniques provide partial mitigation of user
inference, we highlight the need to develop methods to fully protect fine-tuned
LLMs against this privacy risk.


http://arxiv.org/abs/2310.08847
On the Over-Memorization During Natural, Robust and Catastrophic Overfitting. (1%)
Runqi Lin; Chaojian Yu; Bo Han; Tongliang Liu
  Overfitting negatively impacts the generalization ability of deep neural
networks (DNNs) in both natural and adversarial training. Existing methods
struggle to consistently address different types of overfitting, typically
designing strategies that focus separately on either natural or adversarial
patterns. In this work, we adopt a unified perspective by solely focusing on
natural patterns to explore different types of overfitting. Specifically, we
examine the memorization effect in DNNs and reveal a shared behaviour termed
over-memorization, which impairs their generalization capacity. This behaviour
manifests as DNNs suddenly becoming high-confidence in predicting certain
training patterns and retaining a persistent memory for them. Furthermore, when
DNNs over-memorize an adversarial pattern, they tend to simultaneously exhibit
high-confidence prediction for the corresponding natural pattern. These
findings motivate us to holistically mitigate different types of overfitting by
hindering the DNNs from over-memorization training patterns. To this end, we
propose a general framework, Distraction Over-Memorization (DOM), which
explicitly prevents over-memorization by either removing or augmenting the
high-confidence natural patterns. Extensive experiments demonstrate the
effectiveness of our proposed method in mitigating overfitting across various
training paradigms.


http://arxiv.org/abs/2310.08073
Samples on Thin Ice: Re-Evaluating Adversarial Pruning of Neural Networks. (99%)
Giorgio Piras; Maura Pintor; Ambra Demontis; Battista Biggio
  Neural network pruning has shown to be an effective technique for reducing
the network size, trading desirable properties like generalization and
robustness to adversarial attacks for higher sparsity. Recent work has claimed
that adversarial pruning methods can produce sparse networks while also
preserving robustness to adversarial examples. In this work, we first
re-evaluate three state-of-the-art adversarial pruning methods, showing that
their robustness was indeed overestimated. We then compare pruned and dense
versions of the same models, discovering that samples on thin ice, i.e., closer
to the unpruned model's decision boundary, are typically misclassified after
pruning. We conclude by discussing how this intuition may lead to designing
more effective adversarial pruning methods in future work.


http://arxiv.org/abs/2310.08292
Concealed Electronic Countermeasures of Radar Signal with Adversarial Examples. (93%)
Ruinan Ma; Canjie Zhu; Mingfeng Lu; Yunjie Li; Yu-an Tan; Ruibin Zhang; Ran Tao
  Electronic countermeasures involving radar signals are an important aspect of
modern warfare. Traditional electronic countermeasures techniques typically add
large-scale interference signals to ensure interference effects, which can lead
to attacks being too obvious. In recent years, AI-based attack methods have
emerged that can effectively solve this problem, but the attack scenarios are
currently limited to time domain radar signal classification. In this paper, we
focus on the time-frequency images classification scenario of radar signals. We
first propose an attack pipeline under the time-frequency images scenario and
DITIMI-FGSM attack algorithm with high transferability. Then, we propose
STFT-based time domain signal attack(STDS) algorithm to solve the problem of
non-invertibility in time-frequency analysis, thus obtaining the time-domain
representation of the interference signal. A large number of experiments show
that our attack pipeline is feasible and the proposed attack method has a high
success rate.


http://arxiv.org/abs/2310.08732
Provably Cost-Sensitive Adversarial Defense via Randomized Smoothing. (93%)
Yuan Xin; Dingfan Chen; Michael Backes; Xiao Zhang
  As ML models are increasingly deployed in critical applications, robustness
against adversarial perturbations is crucial. While numerous defenses have been
proposed to counter such attacks, they typically assume that all adversarial
transformations are equally important, an assumption that rarely aligns with
real-world applications. To address this, we study the problem of robust
learning against adversarial perturbations under cost-sensitive scenarios,
where the potential harm of different types of misclassifications is encoded in
a cost matrix. Our solution introduces a provably robust learning algorithm to
certify and optimize for cost-sensitive robustness, building on the scalable
certification framework of randomized smoothing. Specifically, we formalize the
definition of cost-sensitive certified radius and propose our novel adaptation
of the standard certification algorithm to generate tight robustness
certificates tailored to any cost matrix. In addition, we design a robust
training method that improves certified cost-sensitive robustness without
compromising model accuracy. Extensive experiments on benchmark datasets,
including challenging ones unsolvable by existing methods, demonstrate the
effectiveness of our certification algorithm and training method across various
cost-sensitive scenarios.


http://arxiv.org/abs/2310.08808
Attacks Meet Interpretability (AmI) Evaluation and Findings. (92%)
Qian Ma; Ziping Ye; Shagufta Mehnaz
  To investigate the effectiveness of the model explanation in detecting
adversarial examples, we reproduce the results of two papers, Attacks Meet
Interpretability: Attribute-steered Detection of Adversarial Samples and Is AmI
(Attacks Meet Interpretability) Robust to Adversarial Examples. And then
conduct experiments and case studies to identify the limitations of both works.
We find that Attacks Meet Interpretability(AmI) is highly dependent on the
selection of hyperparameters. Therefore, with a different hyperparameter
choice, AmI is still able to detect Nicholas Carlini's attack. Finally, we
propose recommendations for future work on the evaluation of defense techniques
such as AmI.


http://arxiv.org/abs/2310.08177
Improving Fast Minimum-Norm Attacks with Hyperparameter Optimization. (68%)
Giuseppe Floris; Raffaele Mura; Luca Scionis; Giorgio Piras; Maura Pintor; Ambra Demontis; Battista Biggio
  Evaluating the adversarial robustness of machine learning models using
gradient-based attacks is challenging. In this work, we show that
hyperparameter optimization can improve fast minimum-norm attacks by automating
the selection of the loss function, the optimizer and the step-size scheduler,
along with the corresponding hyperparameters. Our extensive evaluation
involving several robust models demonstrates the improved efficacy of fast
minimum-norm attacks when hyper-up with hyperparameter optimization. We release
our open-source code at https://github.com/pralab/HO-FMN.


http://arxiv.org/abs/2310.08681
Fed-Safe: Securing Federated Learning in Healthcare Against Adversarial Attacks. (64%)
Erfan Darzi; Nanna M. Sijtsema; Ooijen P. M. A van
  This paper explores the security aspects of federated learning applications
in medical image analysis. Current robustness-oriented methods like adversarial
training, secure aggregation, and homomorphic encryption often risk privacy
compromises. The central aim is to defend the network against potential privacy
breaches while maintaining model robustness against adversarial manipulations.
We show that incorporating distributed noise, grounded in the privacy
guarantees in federated settings, enables the development of a adversarially
robust model that also meets federated privacy standards. We conducted
comprehensive evaluations across diverse attack scenarios, parameters, and use
cases in cancer imaging, concentrating on pathology, meningioma, and glioma.
The results reveal that the incorporation of distributed noise allows for the
attainment of security levels comparable to those of conventional adversarial
training while requiring fewer retraining samples to establish a robust model.


http://arxiv.org/abs/2310.08571
Bucks for Buckets (B4B): Active Defenses Against Stealing Encoders. (31%)
Jan Dubiński; Stanisław Pawlak; Franziska Boenisch; Tomasz Trzciński; Adam Dziedzic
  Machine Learning as a Service (MLaaS) APIs provide ready-to-use and
high-utility encoders that generate vector representations for given inputs.
Since these encoders are very costly to train, they become lucrative targets
for model stealing attacks during which an adversary leverages query access to
the API to replicate the encoder locally at a fraction of the original training
costs. We propose Bucks for Buckets (B4B), the first active defense that
prevents stealing while the attack is happening without degrading
representation quality for legitimate API users. Our defense relies on the
observation that the representations returned to adversaries who try to steal
the encoder's functionality cover a significantly larger fraction of the
embedding space than representations of legitimate users who utilize the
encoder to solve a particular downstream task.vB4B leverages this to adaptively
adjust the utility of the returned representations according to a user's
coverage of the embedding space. To prevent adaptive adversaries from eluding
our defense by simply creating multiple user accounts (sybils), B4B also
individually transforms each user's representations. This prevents the
adversary from directly aggregating representations over multiple accounts to
create their stolen encoder copy. Our active defense opens a new path towards
securely sharing and democratizing encoders over public APIs.


http://arxiv.org/abs/2310.08097
Sentinel: An Aggregation Function to Secure Decentralized Federated Learning. (13%)
Chao Feng; Alberto Huertas Celdran; Janosch Baltensperger; Enrique Tomas Matınez Bertran; Gerome Bovet; Burkhard Stiller
  The rapid integration of Federated Learning (FL) into networking encompasses
various aspects such as network management, quality of service, and
cybersecurity while preserving data privacy. In this context, Decentralized
Federated Learning (DFL) emerges as an innovative paradigm to train
collaborative models, addressing the single point of failure limitation.
However, the security and trustworthiness of FL and DFL are compromised by
poisoning attacks, negatively impacting its performance. Existing defense
mechanisms have been designed for centralized FL and they do not adequately
exploit the particularities of DFL. Thus, this work introduces Sentinel, a
defense strategy to counteract poisoning attacks in DFL. Sentinel leverages the
accessibility of local data and defines a three-step aggregation protocol
consisting of similarity filtering, bootstrap validation, and normalization to
safeguard against malicious model updates. Sentinel has been evaluated with
diverse datasets and various poisoning attack types and threat levels,
improving the state-of-the-art performance against both untargeted and targeted
poisoning attacks.


http://arxiv.org/abs/2310.08320
Defending Our Privacy With Backdoors. (10%)
Dominik Hintersdorf; Lukas Struppek; Daniel Neider; Kristian Kersting
  The proliferation of large AI models trained on uncurated, often sensitive
web-scraped data has raised significant privacy concerns. One of the concerns
is that adversaries can extract information about the training data using
privacy attacks. Unfortunately, the task of removing specific information from
the models without sacrificing performance is not straightforward and has
proven to be challenging. We propose a rather easy yet effective defense based
on backdoor attacks to remove private information such as names of individuals
from models, and focus in this work on text encoders. Specifically, through
strategic insertion of backdoors, we align the embeddings of sensitive phrases
with those of neutral terms-"a person" instead of the person's name. Our
empirical results demonstrate the effectiveness of our backdoor-based defense
on CLIP by assessing its performance using a specialized privacy attack for
zero-shot classifiers. Our approach provides not only a new "dual-use"
perspective on backdoor attacks, but also presents a promising avenue to
enhance the privacy of individuals within models trained on uncurated
web-scraped data.


http://arxiv.org/abs/2310.08772
Investigating the Robustness and Properties of Detection Transformers (DETR) Toward Difficult Images. (9%)
Zhao Ning Zou; Yuhang Zhang; Robert Wijaya
  Transformer-based object detectors (DETR) have shown significant performance
across machine vision tasks, ultimately in object detection. This detector is
based on a self-attention mechanism along with the transformer encoder-decoder
architecture to capture the global context in the image. The critical issue to
be addressed is how this model architecture can handle different image
nuisances, such as occlusion and adversarial perturbations. We studied this
issue by measuring the performance of DETR with different experiments and
benchmarking the network with convolutional neural network (CNN) based
detectors like YOLO and Faster-RCNN. We found that DETR performs well when it
comes to resistance to interference from information loss in occlusion images.
Despite that, we found that the adversarial stickers put on the image require
the network to produce a new unnecessary set of keys, queries, and values,
which in most cases, results in a misdirection of the network. DETR also
performed poorer than YOLOv5 in the image corruption benchmark. Furthermore, we
found that DETR depends heavily on the main query when making a prediction,
which leads to imbalanced contributions between queries since the main query
receives most of the gradient flow.


http://arxiv.org/abs/2310.08708
Polynomial Time Cryptanalytic Extraction of Neural Network Models. (3%)
Adi Shamir; Isaac Canales-Martinez; Anna Hambitzer; Jorge Chavez-Saab; Francisco Rodrigez-Henriquez; Nitin Satpute
  Billions of dollars and countless GPU hours are currently spent on training
Deep Neural Networks (DNNs) for a variety of tasks. Thus, it is essential to
determine the difficulty of extracting all the parameters of such neural
networks when given access to their black-box implementations. Many versions of
this problem have been studied over the last 30 years, and the best current
attack on ReLU-based deep neural networks was presented at Crypto 2020 by
Carlini, Jagielski, and Mironov. It resembles a differential chosen plaintext
attack on a cryptosystem, which has a secret key embedded in its black-box
implementation and requires a polynomial number of queries but an exponential
amount of time (as a function of the number of neurons). In this paper, we
improve this attack by developing several new techniques that enable us to
extract with arbitrarily high precision all the real-valued parameters of a
ReLU-based DNN using a polynomial number of queries and a polynomial amount of
time. We demonstrate its practical efficiency by applying it to a full-sized
neural network for classifying the CIFAR10 dataset, which has 3072 inputs, 8
hidden layers with 256 neurons each, and over million neuronal parameters. An
attack following the approach by Carlini et al. requires an exhaustive search
over 2 to the power 256 possibilities. Our attack replaces this with our new
techniques, which require only 30 minutes on a 256-core computer.


http://arxiv.org/abs/2310.08224
Latent Point Collapse on a Low Dimensional Embedding in Deep Neural Network Classifiers. (1%)
Luigi Sbailò; Luca Ghiringhelli
  The configuration of latent representations plays a critical role in
determining the performance of deep neural network classifiers. In particular,
the emergence of well-separated class embeddings in the latent space has been
shown to improve both generalization and robustness. In this paper, we propose
a method to induce the collapse of latent representations belonging to the same
class into a single point, which enhances class separability in the latent
space while enforcing Lipschitz continuity in the network. We demonstrate that
this phenomenon, which we call \textit{latent point collapse}, is achieved by
adding a strong $L_2$ penalty on the penultimate-layer representations and is
the result of a push-pull tension developed with the cross-entropy loss
function. In addition, we show the practical utility of applying this
compressing loss term to the latent representations of a low-dimensional linear
penultimate layer. The proposed approach is straightforward to implement and
yields substantial improvements in discriminative feature embeddings, along
with remarkable gains in robustness to input perturbations.


http://arxiv.org/abs/2310.08040
SEE-OoD: Supervised Exploration For Enhanced Out-of-Distribution Detection. (1%)
Xiaoyang Song; Wenbo Sun; Maher Nouiehed; Raed Al Kontar; Judy Jin
  Current techniques for Out-of-Distribution (OoD) detection predominantly rely
on quantifying predictive uncertainty and incorporating model regularization
during the training phase, using either real or synthetic OoD samples. However,
methods that utilize real OoD samples lack exploration and are prone to overfit
the OoD samples at hand. Whereas synthetic samples are often generated based on
features extracted from training data, rendering them less effective when the
training and OoD data are highly overlapped in the feature space. In this work,
we propose a Wasserstein-score-based generative adversarial training scheme to
enhance OoD detection accuracy, which, for the first time, performs data
augmentation and exploration simultaneously under the supervision of limited
OoD samples. Specifically, the generator explores OoD spaces and generates
synthetic OoD samples using feedback from the discriminator, while the
discriminator exploits both the observed and synthesized samples for OoD
detection using a predefined Wasserstein score. We provide theoretical
guarantees that the optimal solutions of our generative scheme are
statistically achievable through adversarial training in empirical settings. We
then demonstrate that the proposed method outperforms state-of-the-art
techniques on various computer vision datasets and exhibits superior
generalizability to unseen OoD data.


http://arxiv.org/abs/2310.08537
XAI Benchmark for Visual Explanation. (1%)
Yifei Zhang; Siyi Gu; James Song; Bo Pan; Liang Zhao
  The rise of deep learning algorithms has led to significant advancements in
computer vision tasks, but their "black box" nature has raised concerns
regarding interpretability. Explainable AI (XAI) has emerged as a critical area
of research aiming to open this "black box", and shed light on the
decision-making process of AI models. Visual explanations, as a subset of
Explainable Artificial Intelligence (XAI), provide intuitive insights into the
decision-making processes of AI models handling visual data by highlighting
influential areas in an input image. Despite extensive research conducted on
visual explanations, most evaluations are model-centered since the availability
of corresponding real-world datasets with ground truth explanations is scarce
in the context of image data. To bridge this gap, we introduce an XAI Benchmark
comprising a dataset collection from diverse topics that provide both class
labels and corresponding explanation annotations for images. We have processed
data from diverse domains to align with our unified visual explanation
framework. We introduce a comprehensive Visual Explanation pipeline, which
integrates data loading, preprocessing, experimental setup, and model
evaluation processes. This structure enables researchers to conduct fair
comparisons of various visual explanation techniques. In addition, we provide a
comprehensive review of over 10 evaluation methods for visual explanation to
assist researchers in effectively utilizing our dataset collection. To further
assess the performance of existing visual explanation methods, we conduct
experiments on selected datasets using various model-centered and ground
truth-centered evaluation metrics. We envision this benchmark could facilitate
the advancement of visual explanation models. The XAI dataset collection and
easy-to-use code for evaluation are publicly accessible at
https://xaidataset.github.io.


http://arxiv.org/abs/2310.08419
Jailbreaking Black Box Large Language Models in Twenty Queries. (1%)
Patrick Chao; Alexander Robey; Edgar Dobriban; Hamed Hassani; George J. Pappas; Eric Wong
  There is growing interest in ensuring that large language models (LLMs) align
with human values. However, the alignment of such models is vulnerable to
adversarial jailbreaks, which coax LLMs into overriding their safety
guardrails. The identification of these vulnerabilities is therefore
instrumental in understanding inherent weaknesses and preventing future misuse.
To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an
algorithm that generates semantic jailbreaks with only black-box access to an
LLM. PAIR -- which is inspired by social engineering attacks -- uses an
attacker LLM to automatically generate jailbreaks for a separate targeted LLM
without human intervention. In this way, the attacker LLM iteratively queries
the target LLM to update and refine a candidate jailbreak. Empirically, PAIR
often requires fewer than twenty queries to produce a jailbreak, which is
orders of magnitude more efficient than existing algorithms. PAIR also achieves
competitive jailbreaking success rates and transferability on open and
closed-source LLMs, including GPT-3.5/4, Vicuna, and Gemini.


http://arxiv.org/abs/2310.08739
Voyager: MTD-Based Aggregation Protocol for Mitigating Poisoning Attacks on DFL. (1%)
Chao Feng; Alberto Huertas Celdran; Michael Vuong; Gerome Bovet; Burkhard Stiller
  The growing concern over malicious attacks targeting the robustness of both
Centralized and Decentralized Federated Learning (FL) necessitates novel
defensive strategies. In contrast to the centralized approach, Decentralized FL
(DFL) has the advantage of utilizing network topology and local dataset
information, enabling the exploration of Moving Target Defense (MTD) based
approaches.
  This work presents a theoretical analysis of the influence of network
topology on the robustness of DFL models. Drawing inspiration from these
findings, a three-stage MTD-based aggregation protocol, called Voyager, is
proposed to improve the robustness of DFL models against poisoning attacks by
manipulating network topology connectivity. Voyager has three main components:
an anomaly detector, a network topology explorer, and a connection deployer.
When an abnormal model is detected in the network, the topology explorer
responds strategically by forming connections with more trustworthy
participants to secure the model. Experimental evaluations show that Voyager
effectively mitigates various poisoning attacks without imposing significant
resource and computational burdens on participants. These findings highlight
the proposed reactive MTD as a potent defense mechanism in the context of DFL.


http://arxiv.org/abs/2310.07492
Boosting Black-box Attack to Deep Neural Networks with Conditional Diffusion Models. (99%)
Renyang Liu; Wei Zhou; Tianwei Zhang; Kangjie Chen; Jun Zhao; Kwok-Yan Lam
  Existing black-box attacks have demonstrated promising potential in creating
adversarial examples (AE) to deceive deep learning models. Most of these
attacks need to handle a vast optimization space and require a large number of
queries, hence exhibiting limited practical impacts in real-world scenarios. In
this paper, we propose a novel black-box attack strategy, Conditional Diffusion
Model Attack (CDMA), to improve the query efficiency of generating AEs under
query-limited situations. The key insight of CDMA is to formulate the task of
AE synthesis as a distribution transformation problem, i.e., benign examples
and their corresponding AEs can be regarded as coming from two distinctive
distributions and can transform from each other with a particular converter.
Unlike the conventional \textit{query-and-optimization} approach, we generate
eligible AEs with direct conditional transform using the aforementioned data
converter, which can significantly reduce the number of queries needed. CDMA
adopts the conditional Denoising Diffusion Probabilistic Model as the
converter, which can learn the transformation from clean samples to AEs, and
ensure the smooth development of perturbed noise resistant to various defense
strategies. We demonstrate the effectiveness and efficiency of CDMA by
comparing it with nine state-of-the-art black-box attacks across three
benchmark datasets. On average, CDMA can reduce the query count to a handful of
times; in most cases, the query count is only ONE. We also show that CDMA can
obtain $>99\%$ attack success rate for untarget attacks over all datasets and
targeted attack over CIFAR-10 with the noise budget of $\epsilon=16$.


http://arxiv.org/abs/2310.07780
Promoting Robustness of Randomized Smoothing: Two Cost-Effective Approaches. (89%)
Linbo Liu; Trong Nghia Hoang; Lam M. Nguyen; Tsui-Wei Weng
  Randomized smoothing has recently attracted attentions in the field of
adversarial robustness to provide provable robustness guarantees on smoothed
neural network classifiers. However, existing works show that vanilla
randomized smoothing usually does not provide good robustness performance and
often requires (re)training techniques on the base classifier in order to boost
the robustness of the resulting smoothed classifier. In this work, we propose
two cost-effective approaches to boost the robustness of randomized smoothing
while preserving its clean performance. The first approach introduces a new
robust training method AdvMacerwhich combines adversarial training and
robustness certification maximization for randomized smoothing. We show that
AdvMacer can improve the robustness performance of randomized smoothing
classifiers compared to SOTA baselines, while being 3x faster to train than
MACER baseline. The second approach introduces a post-processing method EsbRS
which greatly improves the robustness certificate based on building model
ensembles. We explore different aspects of model ensembles that has not been
studied by prior works and propose a novel design methodology to further
improve robustness of the ensemble based on our theoretical analysis.


http://arxiv.org/abs/2310.07325
An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L. (13%)
Jett Janiak; Can Rager; James Dao; Yeu-Tong Lau
  Prior work suggests that language models manage the limited bandwidth of the
residual stream through a "memory management" mechanism, where certain
attention heads and MLP layers clear residual stream directions set by earlier
layers. Our study provides concrete evidence for this erasure phenomenon in a
4-layer transformer, identifying heads that consistently remove the output of
earlier heads. We further demonstrate that direct logit attribution (DLA), a
common technique for interpreting the output of intermediate transformer
layers, can show misleading results by not accounting for erasure.


http://arxiv.org/abs/2310.07632
Prompt Backdoors in Visual Prompt Learning. (11%)
Hai Huang; Zhengyu Zhao; Michael Backes; Yun Shen; Yang Zhang
  Fine-tuning large pre-trained computer vision models is infeasible for
resource-limited users. Visual prompt learning (VPL) has thus emerged to
provide an efficient and flexible alternative to model fine-tuning through
Visual Prompt as a Service (VPPTaaS). Specifically, the VPPTaaS provider
optimizes a visual prompt given downstream data, and downstream users can use
this prompt together with the large pre-trained model for prediction. However,
this new learning paradigm may also pose security risks when the VPPTaaS
provider instead provides a malicious visual prompt. In this paper, we take the
first step to explore such risks through the lens of backdoor attacks.
Specifically, we propose BadVisualPrompt, a simple yet effective backdoor
attack against VPL. For example, poisoning $5\%$ CIFAR10 training data leads to
above $99\%$ attack success rates with only negligible model accuracy drop by
$1.5\%$. In particular, we identify and then address a new technical challenge
related to interactions between the backdoor trigger and visual prompt, which
does not exist in conventional, model-level backdoors. Moreover, we provide
in-depth analyses of seven backdoor defenses from model, prompt, and input
levels. Overall, all these defenses are either ineffective or impractical to
mitigate our BadVisualPrompt, implying the critical vulnerability of VPL.


http://arxiv.org/abs/2310.08015
Why Train More? Effective and Efficient Membership Inference via Memorization. (10%)
Jihye Choi; Shruti Tople; Varun Chandrasekaran; Somesh Jha
  Membership Inference Attacks (MIAs) aim to identify specific data samples
within the private training dataset of machine learning models, leading to
serious privacy violations and other sophisticated threats. Many practical
black-box MIAs require query access to the data distribution (the same
distribution where the private data is drawn) to train shadow models. By doing
so, the adversary obtains models trained "with" or "without" samples drawn from
the distribution, and analyzes the characteristics of the samples under
consideration. The adversary is often required to train more than hundreds of
shadow models to extract the signals needed for MIAs; this becomes the
computational overhead of MIAs. In this paper, we propose that by strategically
choosing the samples, MI adversaries can maximize their attack success while
minimizing the number of shadow models. First, our motivational experiments
suggest memorization as the key property explaining disparate sample
vulnerability to MIAs. We formalize this through a theoretical bound that
connects MI advantage with memorization. Second, we show sample complexity
bounds that connect the number of shadow models needed for MIAs with
memorization. Lastly, we confirm our theoretical arguments with comprehensive
experiments; by utilizing samples with high memorization scores, the adversary
can (a) significantly improve its efficacy regardless of the MIA used, and (b)
reduce the number of shadow models by nearly two orders of magnitude compared
to state-of-the-art approaches.


http://arxiv.org/abs/2310.07958
Towards Causal Deep Learning for Vulnerability Detection. (4%)
Md Mahbubur Rahman; Ira Ceka; Chengzhi Mao; Saikat Chakraborty; Baishakhi Ray; Wei Le
  Deep learning vulnerability detection has shown promising results in recent
years. However, an important challenge that still blocks it from being very
useful in practice is that the model is not robust under perturbation and it
cannot generalize well over the out-of-distribution (OOD) data, e.g., applying
a trained model to unseen projects in real world. We hypothesize that this is
because the model learned non-robust features, e.g., variable names, that have
spurious correlations with labels. When the perturbed and OOD datasets no
longer have the same spurious features, the model prediction fails. To address
the challenge, in this paper, we introduced causality into deep learning
vulnerability detection. Our approach CausalVul consists of two phases. First,
we designed novel perturbations to discover spurious features that the model
may use to make predictions. Second, we applied the causal learning algorithms,
specifically, do-calculus, on top of existing deep learning models to
systematically remove the use of spurious features and thus promote causal
based prediction. Our results show that CausalVul consistently improved the
model accuracy, robustness and OOD performance for all the state-of-the-art
models and datasets we experimented. To the best of our knowledge, this is the
first work that introduces do calculus based causal learning to software
engineering models and shows it's indeed useful for improving the model
accuracy, robustness and generalization. Our replication package is located at
https://figshare.com/s/0ffda320dcb96c249ef2.


http://arxiv.org/abs/2310.07745
Deep Reinforcement Learning for Autonomous Cyber Defence: A Survey. (4%)
Gregory Palmer; Chris Parry; Daniel J. B. Harrold; Chris Willis
  The rapid increase in the number of cyber-attacks in recent years raises the
need for principled methods for defending networks against malicious actors.
Deep reinforcement learning (DRL) has emerged as a promising approach for
mitigating these attacks. However, while DRL has shown much potential for cyber
defence, numerous challenges must be overcome before DRL can be applied to the
autonomous cyber defence (ACD) problem at scale. Principled methods are
required for environments that confront learners with very high-dimensional
state spaces, large multi-discrete action spaces, and adversarial learning.
Recent works have reported success in solving these problems individually.
There have also been impressive engineering efforts towards solving all three
for real-time strategy games. However, applying DRL to the full ACD problem
remains an open challenge. Here, we survey the relevant DRL literature and
conceptualize an idealised ACD-DRL agent. We provide: i.) A summary of the
domain properties that define the ACD problem; ii.) A comprehensive comparison
of current ACD environments used for benchmarking DRL approaches; iii.) An
overview of state-of-the-art approaches for scaling DRL to domains that
confront learners with the curse of dimensionality, and; iv.) A survey and
critique of current methods for limiting the exploitability of agents within
adversarial settings from the perspective of ACD. We conclude with open
research questions that we hope will motivate future directions for researchers
and practitioners working on ACD.


http://arxiv.org/abs/2310.06468
A Geometrical Approach to Evaluate the Adversarial Robustness of Deep Neural Networks. (99%)
Yang Wang; Bo Dong; Ke Xu; Haiyin Piao; Yufei Ding; Baocai Yin; Xin Yang
  Deep Neural Networks (DNNs) are widely used for computer vision tasks.
However, it has been shown that deep models are vulnerable to adversarial
attacks, i.e., their performances drop when imperceptible perturbations are
made to the original inputs, which may further degrade the following visual
tasks or introduce new problems such as data and privacy security. Hence,
metrics for evaluating the robustness of deep models against adversarial
attacks are desired. However, previous metrics are mainly proposed for
evaluating the adversarial robustness of shallow networks on the small-scale
datasets. Although the Cross Lipschitz Extreme Value for nEtwork Robustness
(CLEVER) metric has been proposed for large-scale datasets (e.g., the ImageNet
dataset), it is computationally expensive and its performance relies on a
tractable number of samples. In this paper, we propose the Adversarial
Converging Time Score (ACTS), an attack-dependent metric that quantifies the
adversarial robustness of a DNN on a specific input. Our key observation is
that local neighborhoods on a DNN's output surface would have different shapes
given different inputs. Hence, given different inputs, it requires different
time for converging to an adversarial sample. Based on this geometry meaning,
ACTS measures the converging time as an adversarial robustness metric. We
validate the effectiveness and generalization of the proposed ACTS metric
against different adversarial attacks on the large-scale ImageNet dataset using
state-of-the-art deep networks. Extensive experiments show that our ACTS metric
is an efficient and effective adversarial metric over the previous CLEVER
metric.


http://arxiv.org/abs/2310.07159
My Brother Helps Me: Node Injection Based Adversarial Attack on Social Bot Detection. (98%)
Lanjun Wang; Xinran Qiao; Yanwei Xie; Weizhi Nie; Yongdong Zhang; Anan Liu
  Social platforms such as Twitter are under siege from a multitude of
fraudulent users. In response, social bot detection tasks have been developed
to identify such fake users. Due to the structure of social networks, the
majority of methods are based on the graph neural network(GNN), which is
susceptible to attacks. In this study, we propose a node injection-based
adversarial attack method designed to deceive bot detection models. Notably,
neither the target bot nor the newly injected bot can be detected when a new
bot is added around the target bot. This attack operates in a black-box
fashion, implying that any information related to the victim model remains
unknown. To our knowledge, this is the first study exploring the resilience of
bot detection through graph node injection. Furthermore, we develop an
attribute recovery module to revert the injected node embedding from the graph
embedding space back to the original feature space, enabling the adversary to
manipulate node perturbation effectively. We conduct adversarial attacks on
four commonly used GNN structures for bot detection on two widely used
datasets: Cresci-2015 and TwiBot-22. The attack success rate is over 73\% and
the rate of newly injected nodes being detected as bots is below 13\% on these
two datasets.


http://arxiv.org/abs/2310.06396
Adversarial Robustness in Graph Neural Networks: A Hamiltonian Approach. (83%)
Kai Zhao; Qiyu Kang; Yang Song; Rui She; Sijie Wang; Wee Peng Tay
  Graph neural networks (GNNs) are vulnerable to adversarial perturbations,
including those that affect both node features and graph topology. This paper
investigates GNNs derived from diverse neural flows, concentrating on their
connection to various stability notions such as BIBO stability, Lyapunov
stability, structural stability, and conservative stability. We argue that
Lyapunov stability, despite its common use, does not necessarily ensure
adversarial robustness. Inspired by physics principles, we advocate for the use
of conservative Hamiltonian neural flows to construct GNNs that are robust to
adversarial attacks. The adversarial robustness of different neural flow GNNs
is empirically compared on several benchmark datasets under a variety of
adversarial attacks. Extensive numerical experiments demonstrate that GNNs
leveraging conservative Hamiltonian flows with Lyapunov stability substantially
improve robustness against adversarial perturbations. The implementation code
of experiments is available at
https://github.com/zknus/NeurIPS-2023-HANG-Robustness.


http://arxiv.org/abs/2310.06956
Adversarial optimization leads to over-optimistic security-constrained dispatch, but sampling can help. (76%)
Charles Dawson; Chuchu Fan
  To ensure safe, reliable operation of the electrical grid, we must be able to
predict and mitigate likely failures. This need motivates the classic
security-constrained AC optimal power flow (SCOPF) problem. SCOPF is commonly
solved using adversarial optimization, where the dispatcher and an adversary
take turns optimizing a robust dispatch and adversarial attack, respectively.
We show that adversarial optimization is liable to severely overestimate the
robustness of the optimized dispatch (when the adversary encounters a local
minimum), leading the operator to falsely believe that their dispatch is
secure.
  To prevent this overconfidence, we develop a novel adversarial sampling
approach that prioritizes diversity in the predicted attacks. We find that our
method not only substantially improves the robustness of the optimized dispatch
but also avoids overconfidence, accurately characterizing the likelihood of
voltage collapse under a given threat model. We demonstrate a proof-of-concept
on small-scale transmission systems with 14 and 57 nodes.


http://arxiv.org/abs/2310.07152
No Privacy Left Outside: On the (In-)Security of TEE-Shielded DNN Partition for On-Device ML. (62%)
Ziqi Zhang; Chen Gong; Yifeng Cai; Yuanyuan Yuan; Bingyan Liu; Ding Li; Yao Guo; Xiangqun Chen
  On-device ML introduces new security challenges: DNN models become white-box
accessible to device users. Based on white-box information, adversaries can
conduct effective model stealing (MS) and membership inference attack (MIA).
Using Trusted Execution Environments (TEEs) to shield on-device DNN models aims
to downgrade (easy) white-box attacks to (harder) black-box attacks. However,
one major shortcoming is the sharply increased latency (up to 50X). To
accelerate TEE-shield DNN computation with GPUs, researchers proposed several
model partition techniques. These solutions, referred to as TEE-Shielded DNN
Partition (TSDP), partition a DNN model into two parts, offloading the
privacy-insensitive part to the GPU while shielding the privacy-sensitive part
within the TEE. This paper benchmarks existing TSDP solutions using both MS and
MIA across a variety of DNN models, datasets, and metrics. We show important
findings that existing TSDP solutions are vulnerable to privacy-stealing
attacks and are not as safe as commonly believed. We also unveil the inherent
difficulty in deciding optimal DNN partition configurations (i.e., the highest
security with minimal utility cost) for present TSDP solutions. The experiments
show that such ``sweet spot'' configurations vary across datasets and models.
Based on lessons harvested from the experiments, we present TEESlice, a novel
TSDP method that defends against MS and MIA during DNN inference. TEESlice
follows a partition-before-training strategy, which allows for accurate
separation between privacy-related weights from public weights. TEESlice
delivers the same security protection as shielding the entire DNN model inside
TEE (the ``upper-bound'' security guarantees) with over 10X less overhead (in
both experimental and real-world environments) than prior TSDP solutions and no
accuracy loss.


http://arxiv.org/abs/2310.06958
Comparing the Robustness of Modern No-Reference Image- and Video-Quality Metrics to Adversarial Attacks. (47%)
Anastasia Antsiferova; Khaled Abud; Aleksandr Gushchin; Ekaterina Shumitskaya; Sergey Lavrushkin; Dmitriy Vatolin
  Nowadays, neural-network-based image- and video-quality metrics perform
better than traditional methods. However, they also became more vulnerable to
adversarial attacks that increase metrics' scores without improving visual
quality. The existing benchmarks of quality metrics compare their performance
in terms of correlation with subjective quality and calculation time.
Nonetheless, the adversarial robustness of image-quality metrics is also an
area worth researching. This paper analyses modern metrics' robustness to
different adversarial attacks. We adapted adversarial attacks from computer
vision tasks and compared attacks' efficiency against 15 no-reference image-
and video-quality metrics. Some metrics showed high resistance to adversarial
attacks, which makes their usage in benchmarks safer than vulnerable metrics.
The benchmark accepts submissions of new metrics for researchers who want to
make their metrics more robust to attacks or to find such metrics for their
needs. The latest results can be found online:
https://videoprocessing.ai/benchmarks/metrics-robustness.html.


http://arxiv.org/abs/2310.07100
GraphCloak: Safeguarding Task-specific Knowledge within Graph-structured Data from Unauthorized Exploitation. (22%)
Yixin Liu; Chenrui Fan; Xun Chen; Pan Zhou; Lichao Sun
  As Graph Neural Networks (GNNs) become increasingly prevalent in a variety of
fields, from social network analysis to protein-protein interaction studies,
growing concerns have emerged regarding the unauthorized utilization of
personal data. Recent studies have shown that imperceptible poisoning attacks
are an effective method of protecting image data from such misuse. However, the
efficacy of this approach in the graph domain remains unexplored. To bridge
this gap, this paper introduces GraphCloak to safeguard against the
unauthorized usage of graph data. Compared with prior work, GraphCloak offers
unique significant innovations: (1) graph-oriented, the perturbations are
applied to both topological structures and descriptive features of the graph;
(2) effective and stealthy, our cloaking method can bypass various inspections
while causing a significant performance drop in GNNs trained on the cloaked
graphs; and (3) stable across settings, our methods consistently perform
effectively under a range of practical settings with limited knowledge. To
address the intractable bi-level optimization problem, we propose two
error-minimizing-based poisoning methods that target perturbations on the
structural and feature space, along with a subgraph injection poisoning method.
Our comprehensive evaluation of these methods underscores their effectiveness,
stealthiness, and stability. We also delve into potential countermeasures and
provide analytical justification for their effectiveness, paving the way for
intriguing future research.


http://arxiv.org/abs/2310.06668
Latent Diffusion Counterfactual Explanations. (5%)
Karim Farid; Simon Schrodi; Max Argus; Thomas Brox
  Counterfactual explanations have emerged as a promising method for
elucidating the behavior of opaque black-box models. Recently, several works
leveraged pixel-space diffusion models for counterfactual generation. To handle
noisy, adversarial gradients during counterfactual generation -- causing
unrealistic artifacts or mere adversarial perturbations -- they required either
auxiliary adversarially robust models or computationally intensive guidance
schemes. However, such requirements limit their applicability, e.g., in
scenarios with restricted access to the model's training data. To address these
limitations, we introduce Latent Diffusion Counterfactual Explanations (LDCE).
LDCE harnesses the capabilities of recent class- or text-conditional foundation
latent diffusion models to expedite counterfactual generation and focus on the
important, semantic parts of the data. Furthermore, we propose a novel
consensus guidance mechanism to filter out noisy, adversarial gradients that
are misaligned with the diffusion model's implicit classifier. We demonstrate
the versatility of LDCE across a wide spectrum of models trained on diverse
datasets with different learning paradigms. Finally, we showcase how LDCE can
provide insights into model errors, enhancing our understanding of black-box
model behavior.


http://arxiv.org/abs/2310.06588
FTFT: efficient and robust Fine-Tuning by transFerring Training dynamics. (2%)
Yupei Du; Albert Gatt; Dong Nguyen
  Despite the massive success of fine-tuning large Pre-trained Language Models
(PLMs) on a wide range of Natural Language Processing (NLP) tasks, they remain
susceptible to out-of-distribution (OOD) and adversarial inputs. Data map (DM)
is a simple yet effective dual-model approach that enhances the robustness of
fine-tuned PLMs, which involves fine-tuning a model on the original training
set (i.e. reference model), selecting a specified fraction of important
training examples according to the training dynamics of the reference model,
and fine-tuning the same model on these selected examples (i.e. main model).
However, it suffers from the drawback of requiring fine-tuning the same model
twice, which is computationally expensive for large models. In this paper, we
first show that 1) training dynamics are highly transferable across different
model sizes and different pre-training methods, and that 2) main models
fine-tuned using DM learn faster than when using conventional Empirical Risk
Minimization (ERM). Building on these observations, we propose a novel
fine-tuning approach based on the DM method: Fine-Tuning by transFerring
Training dynamics (FTFT). Compared with DM, FTFT uses more efficient reference
models and then fine-tunes more capable main models for fewer steps. Our
experiments show that FTFT achieves better generalization robustness than ERM
while spending less than half of the training cost.


http://arxiv.org/abs/2310.07084
Investigating the Adversarial Robustness of Density Estimation Using the Probability Flow ODE. (2%)
Marius Arvinte; Cory Cornelius; Jason Martin; Nageen Himayat
  Beyond their impressive sampling capabilities, score-based diffusion models
offer a powerful analysis tool in the form of unbiased density estimation of a
query sample under the training data distribution. In this work, we investigate
the robustness of density estimation using the probability flow (PF) neural
ordinary differential equation (ODE) model against gradient-based likelihood
maximization attacks and the relation to sample complexity, where the
compressed size of a sample is used as a measure of its complexity. We
introduce and evaluate six gradient-based log-likelihood maximization attacks,
including a novel reverse integration attack. Our experimental evaluations on
CIFAR-10 show that density estimation using the PF ODE is robust against
high-complexity, high-likelihood attacks, and that in some cases adversarial
samples are semantically meaningful, as expected from a robust estimator.


http://arxiv.org/abs/2310.06387
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations. (1%)
Zeming Wei; Yifei Wang; Yisen Wang
  Large Language Models (LLMs) have shown remarkable success in various tasks,
but concerns about their safety and the potential for generating malicious
content have emerged. In this paper, we explore the power of In-Context
Learning (ICL) in manipulating the alignment ability of LLMs. We find that by
providing just few in-context demonstrations without fine-tuning, LLMs can be
manipulated to increase or decrease the probability of jailbreaking, i.e.
answering malicious prompts. Based on these observations, we propose In-Context
Attack (ICA) and In-Context Defense (ICD) methods for jailbreaking and guarding
aligned language model purposes. ICA crafts malicious contexts to guide models
in generating harmful outputs, while ICD enhances model robustness by
demonstrations of rejecting to answer harmful prompts. Our experiments show the
effectiveness of ICA and ICD in increasing or reducing the success rate of
adversarial jailbreaking attacks. Overall, we shed light on the potential of
ICL to influence LLM behavior and provide a new perspective for enhancing the
safety and alignment of LLMs.


http://arxiv.org/abs/2310.06182
PAC-Bayesian Spectrally-Normalized Bounds for Adversarially Robust Generalization. (92%)
Jiancong Xiao; Ruoyu Sun; Zhi- Quan Luo
  Deep neural networks (DNNs) are vulnerable to adversarial attacks. It is
found empirically that adversarially robust generalization is crucial in
establishing defense algorithms against adversarial attacks. Therefore, it is
interesting to study the theoretical guarantee of robust generalization. This
paper focuses on norm-based complexity, based on a PAC-Bayes approach
(Neyshabur et al., 2017). The main challenge lies in extending the key
ingredient, which is a weight perturbation bound in standard settings, to the
robust settings. Existing attempts heavily rely on additional strong
assumptions, leading to loose bounds. In this paper, we address this issue and
provide a spectrally-normalized robust generalization bound for DNNs. Compared
to existing bounds, our bound offers two significant advantages: Firstly, it
does not depend on additional assumptions. Secondly, it is considerably
tighter, aligning with the bounds of standard generalization. Therefore, our
result provides a different perspective on understanding robust generalization:
The mismatch terms between standard and robust generalization bounds shown in
previous studies do not contribute to the poor robust generalization. Instead,
these disparities solely due to mathematical issues. Finally, we extend the
main result to adversarial robustness against general non-$\ell_p$ attacks and
other neural network architectures.


http://arxiv.org/abs/2310.14942
Domain Watermark: Effective and Harmless Dataset Copyright Protection is Closed at Hand. (22%)
Junfeng Guo; Yiming Li; Lixu Wang; Shu-Tao Xia; Heng Huang; Cong Liu; Bo Li
  The prosperity of deep neural networks (DNNs) is largely benefited from
open-source datasets, based on which users can evaluate and improve their
methods. In this paper, we revisit backdoor-based dataset ownership
verification (DOV), which is currently the only feasible approach to protect
the copyright of open-source datasets. We reveal that these methods are
fundamentally harmful given that they could introduce malicious
misclassification behaviors to watermarked DNNs by the adversaries. In this
paper, we design DOV from another perspective by making watermarked models
(trained on the protected dataset) correctly classify some `hard' samples that
will be misclassified by the benign model. Our method is inspired by the
generalization property of DNNs, where we find a \emph{hardly-generalized
domain} for the original dataset (as its \emph{domain watermark}). It can be
easily learned with the protected dataset containing modified samples.
Specifically, we formulate the domain generation as a bi-level optimization and
propose to optimize a set of visually-indistinguishable clean-label modified
data with similar effects to domain-watermarked samples from the
hardly-generalized domain to ensure watermark stealthiness. We also design a
hypothesis-test-guided ownership verification via our domain watermark and
provide the theoretical analyses of our method. Extensive experiments on three
benchmark datasets are conducted, which verify the effectiveness of our method
and its resistance to potential adaptive methods. The code for reproducing main
experiments is available at
\url{https://github.com/JunfengGo/Domain-Watermark}.


http://arxiv.org/abs/2310.06112
Theoretical Analysis of Robust Overfitting for Wide DNNs: An NTK Approach. (5%)
Shaopeng Fu; Di Wang
  Adversarial training (AT) is a canonical method for enhancing the robustness
of deep neural networks (DNNs). However, recent studies empirically
demonstrated that it suffers from robust overfitting, i.e., a long time AT can
be detrimental to the robustness of DNNs. This paper presents a theoretical
explanation of robust overfitting for DNNs. Specifically, we non-trivially
extend the neural tangent kernel (NTK) theory to AT and prove that an
adversarially trained wide DNN can be well approximated by a linearized DNN.
Moreover, for squared loss, closed-form AT dynamics for the linearized DNN can
be derived, which reveals a new AT degeneration phenomenon: a long-term AT will
result in a wide DNN degenerates to that obtained without AT and thus cause
robust overfitting. Based on our theoretical results, we further design a
method namely Adv-NTK, the first AT algorithm for infinite-width DNNs.
Experiments on real-world datasets show that Adv-NTK can help infinite-width
DNNs enhance comparable robustness to that of their finite-width counterparts,
which in turn justifies our theoretical findings. The code is available at
https://github.com/fshp971/adv-ntk.


http://arxiv.org/abs/2310.06227
Exploring adversarial attacks in federated learning for medical imaging. (2%)
Erfan Darzi; Florian Dubost; N. M. Sijtsema; Ooijen P. M. A van
  Federated learning offers a privacy-preserving framework for medical image
analysis but exposes the system to adversarial attacks. This paper aims to
evaluate the vulnerabilities of federated learning networks in medical image
analysis against such attacks. Employing domain-specific MRI tumor and
pathology imaging datasets, we assess the effectiveness of known threat
scenarios in a federated learning environment. Our tests reveal that
domain-specific configurations can increase the attacker's success rate
significantly. The findings emphasize the urgent need for effective defense
mechanisms and suggest a critical re-evaluation of current security protocols
in federated medical image analysis systems.


http://arxiv.org/abs/2310.05336
GReAT: A Graph Regularized Adversarial Training Method. (99%)
Samet Bayram; Kenneth Barner
  This paper presents GReAT (Graph Regularized Adversarial Training), a novel
regularization method designed to enhance the robust classification performance
of deep learning models. Adversarial examples, characterized by subtle
perturbations that can mislead models, pose a significant challenge in machine
learning. Although adversarial training is effective in defending against such
attacks, it often overlooks the underlying data structure. In response, GReAT
integrates graph based regularization into the adversarial training process,
leveraging the data's inherent structure to enhance model robustness. By
incorporating graph information during training, GReAT defends against
adversarial attacks and improves generalization to unseen data. Extensive
evaluations on benchmark datasets demonstrate that GReAT outperforms state of
the art methods in robustness, achieving notable improvements in classification
accuracy. Specifically, compared to the second best methods, GReAT achieves a
performance increase of approximately 4.87% for CIFAR10 against FGSM attack and
10.57% for SVHN against FGSM attack. Additionally, for CIFAR10, GReAT
demonstrates a performance increase of approximately 11.05% against PGD attack,
and for SVHN, a 5.54% increase against PGD attack. This paper provides detailed
insights into the proposed methodology, including numerical results and
comparisons with existing approaches, highlighting the significant impact of
GReAT in advancing the performance of deep learning models.


http://arxiv.org/abs/2310.05354
An Initial Investigation of Neural Replay Simulator for Over-the-Air Adversarial Perturbations to Automatic Speaker Verification. (99%)
Jiaqi Li; Li Wang; Liumeng Xue; Lei Wang; Zhizheng Wu
  Deep Learning has advanced Automatic Speaker Verification (ASV) in the past
few years. Although it is known that deep learning-based ASV systems are
vulnerable to adversarial examples in digital access, there are few studies on
adversarial attacks in the context of physical access, where a replay process
(i.e., over the air) is involved. An over-the-air attack involves a
loudspeaker, a microphone, and a replaying environment that impacts the
movement of the sound wave. Our initial experiment confirms that the replay
process impacts the effectiveness of the over-the-air attack performance. This
study performs an initial investigation towards utilizing a neural replay
simulator to improve over-the-air adversarial attack robustness. This is
achieved by using a neural waveform synthesizer to simulate the replay process
when estimating the adversarial perturbations. Experiments conducted on the
ASVspoof2019 dataset confirm that the neural replay simulator can considerably
increase the success rates of over-the-air adversarial attacks. This raises the
concern for adversarial attacks on speaker verification in physical access
applications.


http://arxiv.org/abs/2310.05369
AdvSV: An Over-the-Air Adversarial Attack Dataset for Speaker Verification. (96%)
Li Wang; Jiaqi Li; Yuhao Luo; Jiahao Zheng; Lei Wang; Hao Li; Ke Xu; Chengfang Fang; Jie Shi; Zhizheng Wu
  It is known that deep neural networks are vulnerable to adversarial attacks.
Although Automatic Speaker Verification (ASV) built on top of deep neural
networks exhibits robust performance in controlled scenarios, many studies
confirm that ASV is vulnerable to adversarial attacks. The lack of a standard
dataset is a bottleneck for further research, especially reproducible research.
In this study, we developed an open-source adversarial attack dataset for
speaker verification research. As an initial step, we focused on the
over-the-air attack. An over-the-air adversarial attack involves a perturbation
generation algorithm, a loudspeaker, a microphone, and an acoustic environment.
The variations in the recording configurations make it very challenging to
reproduce previous research. The AdvSV dataset is constructed using the
Voxceleb1 Verification test set as its foundation. This dataset employs
representative ASV models subjected to adversarial attacks and records
adversarial samples to simulate over-the-air attack settings. The scope of the
dataset can be easily extended to include more types of adversarial attacks.
The dataset will be released to the public under the CC BY-SA 4.0. In addition,
we also provide a detection baseline for reproducible research.


http://arxiv.org/abs/2310.05141
Transferable Availability Poisoning Attacks. (83%)
Yiyong Liu; Michael Backes; Xiao Zhang
  We consider availability data poisoning attacks, where an adversary aims to
degrade the overall test accuracy of a machine learning model by crafting small
perturbations to its training data. Existing poisoning strategies can achieve
the attack goal but assume the victim to employ the same learning method as
what the adversary uses to mount the attack. In this paper, we argue that this
assumption is strong, since the victim may choose any learning algorithm to
train the model as long as it can achieve some targeted performance on clean
data. Empirically, we observe a large decrease in the effectiveness of prior
poisoning attacks if the victim employs an alternative learning algorithm. To
enhance the attack transferability, we propose Transferable Poisoning, which
first leverages the intrinsic characteristics of alignment and uniformity to
enable better unlearnability within contrastive learning, and then iteratively
utilizes the gradient information from supervised and unsupervised contrastive
learning paradigms to generate the poisoning perturbations. Through extensive
experiments on image benchmarks, we show that our transferable poisoning attack
can produce poisoned samples with significantly improved transferability, not
only applicable to the two learners used to devise the attack but also to
learning algorithms and even paradigms beyond.


http://arxiv.org/abs/2310.05057
BRAINTEASER: Lateral Thinking Puzzles for Large Language Models. (26%)
Yifan Jiang; Filip Ilievski; Kaixin Ma; Zhivar Sourati
  The success of language models has inspired the NLP community to attend to
tasks that require implicit and complex reasoning, relying on human-like
commonsense mechanisms. While such vertical thinking tasks have been relatively
popular, lateral thinking puzzles have received little attention. To bridge
this gap, we devise BRAINTEASER: a multiple-choice Question Answering task
designed to test the model's ability to exhibit lateral thinking and defy
default commonsense associations. We design a three-step procedure for creating
the first lateral thinking benchmark, consisting of data collection, distractor
generation, and generation of adversarial examples, leading to 1,100 puzzles
with high-quality annotations. To assess the consistency of lateral reasoning
by models, we enrich BRAINTEASER based on a semantic and contextual
reconstruction of its questions. Our experiments with state-of-the-art
instruction- and commonsense language models reveal a significant gap between
human and model performance, which is further widened when consistency across
adversarial formats is considered. We make all of our code and data available
to stimulate work on developing and evaluating lateral thinking models.


http://arxiv.org/abs/2310.05263
Stealthy Backdoor Attack via Confidence-driven Sampling. (10%)
Pengfei He; Yue Xing; Han Xu; Jie Ren; Yingqian Cui; Shenglai Zeng; Jiliang Tang; Makoto Yamada; Mohammad Sabokrou
  Backdoor attacks aim to surreptitiously insert malicious triggers into DNN
models, granting unauthorized control during testing scenarios. Existing
methods lack robustness against defense strategies and predominantly focus on
enhancing trigger stealthiness while randomly selecting poisoned samples. Our
research highlights the overlooked drawbacks of random sampling, which make
that attack detectable and defensible. The core idea of this paper is to
strategically poison samples near the model's decision boundary and increase
defense difficulty. We introduce a straightforward yet highly effective
sampling methodology that leverages confidence scores. Specifically, it selects
samples with lower confidence scores, significantly increasing the challenge
for defenders in identifying and countering these attacks. Importantly, our
method operates independently of existing trigger designs, providing
versatility and compatibility with various backdoor attack techniques. We
substantiate the effectiveness of our approach through a comprehensive set of
empirical experiments, demonstrating its potential to significantly enhance
resilience against backdoor attacks in DNNs.


http://arxiv.org/abs/2310.05308
Adversarial Attacks on Combinatorial Multi-Armed Bandits. (5%)
Rishab Balasubramanian; Jiawei Li; Prasad Tadepalli; Huazheng Wang; Qingyun Wu; Haoyu Zhao
  We study reward poisoning attacks on Combinatorial Multi-armed Bandits
(CMAB). We first provide a sufficient and necessary condition for the
attackability of CMAB, a notion to capture the vulnerability and robustness of
CMAB. The attackability condition depends on the intrinsic properties of the
corresponding CMAB instance such as the reward distributions of super arms and
outcome distributions of base arms. Additionally, we devise an attack algorithm
for attackable CMAB instances. Contrary to prior understanding of multi-armed
bandits, our work reveals a surprising fact that the attackability of a
specific CMAB instance also depends on whether the bandit instance is known or
unknown to the adversary. This finding indicates that adversarial attacks on
CMAB are difficult in practice and a general attack strategy for any CMAB
instance does not exist since the environment is mostly unknown to the
adversary. We validate our theoretical findings via extensive experiments on
real-world CMAB applications including probabilistic maximum covering problem,
online minimum spanning tree, cascading bandits for online ranking, and online
shortest path.


http://arxiv.org/abs/2310.04687
Improving Adversarial Attacks on Latent Diffusion Model. (99%)
Boyang Zheng; Chumeng Liang; Xiaoyu Wu; Yan Liu
  Adversarial attacks on Latent Diffusion Model (LDM), the state-of-the-art
image generative model, have been adopted as effective protection against
malicious finetuning of LDM on unauthorized images. We show that these attacks
add an extra error to the score function of adversarial examples predicted by
LDM. LDM finetuned on these adversarial examples learns to lower the error by a
bias, from which the model is attacked and predicts the score function with
biases.
  Based on the dynamics, we propose to improve the adversarial attack on LDM by
Attacking with Consistent score-function Errors (ACE). ACE unifies the pattern
of the extra error added to the predicted score function. This induces the
finetuned LDM to learn the same pattern as a bias in predicting the score
function. We then introduce a well-crafted pattern to improve the attack. Our
method outperforms state-of-the-art methods in adversarial attacks on LDM.


http://arxiv.org/abs/2310.04780
IPMix: Label-Preserving Data Augmentation Method for Training Robust Classifiers. (76%)
Zhenglin Huang; Xiaoan Bao; Na Zhang; Qingqi Zhang; Xiaomei Tu; Biao Wu; Xi Yang
  Data augmentation has been proven effective for training high-accuracy
convolutional neural network classifiers by preventing overfitting. However,
building deep neural networks in real-world scenarios requires not only high
accuracy on clean data but also robustness when data distributions shift. While
prior methods have proposed that there is a trade-off between accuracy and
robustness, we propose IPMix, a simple data augmentation approach to improve
robustness without hurting clean accuracy. IPMix integrates three levels of
data augmentation (image-level, patch-level, and pixel-level) into a coherent
and label-preserving technique to increase the diversity of training data with
limited computational overhead. To further improve the robustness, IPMix
introduces structural complexity at different levels to generate more diverse
images and adopts the random mixing method for multi-scale information fusion.
Experiments demonstrate that IPMix outperforms state-of-the-art corruption
robustness on CIFAR-C and ImageNet-C. In addition, we show that IPMix also
significantly improves the other safety measures, including robustness to
adversarial perturbations, calibration, prediction consistency, and anomaly
detection, achieving state-of-the-art or comparable results on several
benchmarks, including ImageNet-R, ImageNet-A, and ImageNet-O.


http://arxiv.org/abs/2310.04941
Test-Time Adaptation Induces Stronger Accuracy and Agreement-on-the-Line. (1%)
Eungyeup Kim; Mingjie Sun; Christina Baek; Aditi Raghunathan; J. Zico Kolter
  Recently, Miller et al. (2021) and Baek et al. (2022) empirically
demonstrated strong linear correlations between in-distribution (ID) versus
out-of-distribution (OOD) accuracy and agreement. These trends, coined
accuracy-on-the-line (ACL) and agreement-on-the-line (AGL), enable OOD model
selection and performance estimation without labeled data. However, these
phenomena also break for certain shifts, such as CIFAR10-C Gaussian Noise,
posing a critical bottleneck. In this paper, we make a key finding that recent
test-time adaptation (TTA) methods not only improve OOD performance, but
drastically strengthen the ACL and AGL trends in models, even in shifts where
models showed very weak correlations before. To analyze this, we revisit the
theoretical conditions from Miller et al. (2021) that outline the types of
distribution shifts needed for perfect ACL in linear models. Surprisingly,
these conditions are satisfied after applying TTA to deep models in the
penultimate feature embedding space. In particular, TTA causes the data
distribution to collapse complex shifts into those can be expressed by a
singular scaling variable in the feature space. Our results show that by
combining TTA with AGL-based estimation methods, we can estimate the OOD
performance of models with high precision for a broader set of distribution
shifts. This lends us a simple system for selecting the best hyperparameters
and adaptation strategy without any OOD labeled data.


http://arxiv.org/abs/2310.04285
Assessing Robustness via Score-Based Adversarial Image Generation. (99%)
Marcel Kollovieh; Lukas Gosch; Marten Lienen; Yan Scholten; Leo Schwinn; Stephan Günnemann
  Most adversarial attacks and defenses focus on perturbations within small
$\ell_p$-norm constraints. However, $\ell_p$ threat models cannot capture all
relevant semantics-preserving perturbations, and hence, the scope of robustness
evaluations is limited. In this work, we introduce Score-Based Adversarial
Generation (ScoreAG), a novel framework that leverages the advancements in
score-based generative models to generate unrestricted adversarial examples
that overcome the limitations of $\ell_p$-norm constraints. Unlike traditional
methods, ScoreAG maintains the core semantics of images while generating
adversarial examples, either by transforming existing images or synthesizing
new ones entirely from scratch. We further exploit the generative capability of
ScoreAG to purify images, empirically enhancing the robustness of classifiers.
Our extensive empirical evaluation demonstrates that ScoreAG improves upon the
majority of state-of-the-art attacks and defenses across multiple benchmarks.
This work highlights the importance of investigating adversarial examples
bounded by semantics rather than $\ell_p$-norm constraints. ScoreAG represents
an important step towards more encompassing robustness assessments.


http://arxiv.org/abs/2310.04655
VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models. (98%)
Ziyi Yin; Muchao Ye; Tianrong Zhang; Tianyu Du; Jinguo Zhu; Han Liu; Jinghui Chen; Ting Wang; Fenglong Ma
  Vision-Language (VL) pre-trained models have shown their superiority on many
multimodal tasks. However, the adversarial robustness of such models has not
been fully explored. Existing approaches mainly focus on exploring the
adversarial robustness under the white-box setting, which is unrealistic. In
this paper, we aim to investigate a new yet practical task to craft image and
text perturbations using pre-trained VL models to attack black-box fine-tuned
models on different downstream tasks. Towards this end, we propose VLATTACK to
generate adversarial samples by fusing perturbations of images and texts from
both single-modal and multimodal levels. At the single-modal level, we propose
a new block-wise similarity attack (BSA) strategy to learn image perturbations
for disrupting universal representations. Besides, we adopt an existing text
attack strategy to generate text perturbations independent of the image-modal
attack. At the multimodal level, we design a novel iterative cross-search
attack (ICSA) method to update adversarial image-text pairs periodically,
starting with the outputs from the single-modal level. We conduct extensive
experiments to attack five widely-used VL pre-trained models for six tasks.
Experimental results show that VLATTACK achieves the highest attack success
rates on all tasks compared with state-of-the-art baselines, which reveals a
blind spot in the deployment of pre-trained VL models. Source codes can be
found at https://github.com/ericyinyzy/VLAttack.


http://arxiv.org/abs/2310.04539
Generating Less Certain Adversarial Examples Improves Robust Generalization. (98%)
Minxing Zhang; Michael Backes; Xiao Zhang
  This paper revisits the robust overfitting phenomenon of adversarial
training. Observing that models with better robust generalization performance
are less certain in predicting adversarially generated training inputs, we
argue that overconfidence in predicting adversarial examples is a potential
cause. Therefore, we hypothesize that generating less certain adversarial
examples improves robust generalization, and propose a formal definition of
adversarial certainty that captures the variance of the model's predicted
logits on adversarial examples. Our theoretical analysis of synthetic
distributions characterizes the connection between adversarial certainty and
robust generalization. Accordingly, built upon the notion of adversarial
certainty, we develop a general method to search for models that can generate
training-time adversarial inputs with reduced certainty, while maintaining the
model's capability in distinguishing adversarial examples. Extensive
experiments on image benchmarks demonstrate that our method effectively learns
models with consistently improved robustness and mitigates robust overfitting,
confirming the importance of generating less certain adversarial examples for
robust generalization.


http://arxiv.org/abs/2310.04055
Kick Bad Guys Out! Conditionally Activated Anomaly Detection in Federated Learning with Zero-Knowledge Proof Verification. (84%)
Shanshan Han; Wenxuan Wu; Baturalp Buyukates; Weizhao Jin; Qifan Zhang; Yuhang Yao; Salman Avestimehr; Chaoyang He
  Federated Learning (FL) systems are susceptible to adversarial attacks, where
malicious clients submit poisoned models to disrupt the convergence or plant
backdoors that cause the global model to misclassify some samples. Current
defense methods are often impractical for real-world FL systems, as they either
rely on unrealistic prior knowledge or cause accuracy loss even in the absence
of attacks. Furthermore, these methods lack a protocol for verifying execution,
leaving participants uncertain about the correct execution of the mechanism. To
address these challenges, we propose a novel anomaly detection strategy that is
designed for real-world FL systems. Our approach activates the defense only
when potential attacks are detected, and enables the removal of malicious
models without affecting the benign ones. Additionally, we incorporate
zero-knowledge proofs to ensure the integrity of the proposed defense
mechanism. Experimental results demonstrate the effectiveness of our approach
in enhancing FL system security against a comprehensive set of adversarial
attacks in various ML tasks.


http://arxiv.org/abs/2310.03707
OMG-ATTACK: Self-Supervised On-Manifold Generation of Transferable Evasion Attacks. (99%)
Ofir Bar Tal; Adi Haviv; Amit H. Bermano
  Evasion Attacks (EA) are used to test the robustness of trained neural
networks by distorting input data to misguide the model into incorrect
classifications. Creating these attacks is a challenging task, especially with
the ever-increasing complexity of models and datasets. In this work, we
introduce a self-supervised, computationally economical method for generating
adversarial examples, designed for the unseen black-box setting. Adapting
techniques from representation learning, our method generates on-manifold EAs
that are encouraged to resemble the data distribution. These attacks are
comparable in effectiveness compared to the state-of-the-art when attacking the
model trained on, but are significantly more effective when attacking unseen
models, as the attacks are more related to the data rather than the model
itself. Our experiments consistently demonstrate the method is effective across
various models, unseen data categories, and even defended models, suggesting a
significant role for on-manifold EAs when targeting unseen models.


http://arxiv.org/abs/2310.03334
Untargeted White-box Adversarial Attack with Heuristic Defence Methods in Real-time Deep Learning based Network Intrusion Detection System. (99%)
Khushnaseeb Roshan; Aasim Zafar; Sheikh Burhan Ul Haque
  Network Intrusion Detection System (NIDS) is a key component in securing the
computer network from various cyber security threats and network attacks.
However, consider an unfortunate situation where the NIDS is itself attacked
and vulnerable more specifically, we can say, How to defend the defender?. In
Adversarial Machine Learning (AML), the malicious actors aim to fool the
Machine Learning (ML) and Deep Learning (DL) models to produce incorrect
predictions with intentionally crafted adversarial examples. These adversarial
perturbed examples have become the biggest vulnerability of ML and DL based
systems and are major obstacles to their adoption in real-time and
mission-critical applications such as NIDS. AML is an emerging research domain,
and it has become a necessity for the in-depth study of adversarial attacks and
their defence strategies to safeguard the computer network from various cyber
security threads. In this research work, we aim to cover important aspects
related to NIDS, adversarial attacks and its defence mechanism to increase the
robustness of the ML and DL based NIDS. We implemented four powerful
adversarial attack techniques, namely, Fast Gradient Sign Method (FGSM),
Jacobian Saliency Map Attack (JSMA), Projected Gradient Descent (PGD) and
Carlini & Wagner (C&W) in NIDS. We analyzed its performance in terms of various
performance metrics in detail. Furthermore, the three heuristics defence
strategies, i.e., Adversarial Training (AT), Gaussian Data Augmentation (GDA)
and High Confidence (HC), are implemented to improve the NIDS robustness under
adversarial attack situations. The complete workflow is demonstrated in
real-time network with data packet flow. This research work provides the
overall background for the researchers interested in AML and its implementation
from a computer network security point of view.


http://arxiv.org/abs/2310.03358
Enhancing Robust Representation in Adversarial Training: Alignment and Exclusion Criteria. (99%)
Nuoyan Zhou; Nannan Wang; Decheng Liu; Dawei Zhou; Xinbo Gao
  Deep neural networks are vulnerable to adversarial noise. Adversarial
Training (AT) has been demonstrated to be the most effective defense strategy
to protect neural networks from being fooled. However, we find AT omits to
learning robust features, resulting in poor performance of adversarial
robustness. To address this issue, we highlight two criteria of robust
representation: (1) Exclusion: \emph{the feature of examples keeps away from
that of other classes}; (2) Alignment: \emph{the feature of natural and
corresponding adversarial examples is close to each other}. These motivate us
to propose a generic framework of AT to gain robust representation, by the
asymmetric negative contrast and reverse attention. Specifically, we design an
asymmetric negative contrast based on predicted probabilities, to push away
examples of different classes in the feature space. Moreover, we propose to
weight feature by parameters of the linear classifier as the reverse attention,
to obtain class-aware feature and pull close the feature of the same class.
Empirical evaluations on three benchmark datasets show our methods greatly
advance the robustness of AT and achieve state-of-the-art performance.


http://arxiv.org/abs/2310.03349
An Integrated Algorithm for Robust and Imperceptible Audio Adversarial Examples. (98%)
Armin Ettenhofer; Jan-Philipp Schulze; Karla Pizzi
  Audio adversarial examples are audio files that have been manipulated to fool
an automatic speech recognition (ASR) system, while still sounding benign to a
human listener. Most methods to generate such samples are based on a two-step
algorithm: first, a viable adversarial audio file is produced, then, this is
fine-tuned with respect to perceptibility and robustness. In this work, we
present an integrated algorithm that uses psychoacoustic models and room
impulse responses (RIR) in the generation step. The RIRs are dynamically
created by a neural network during the generation process to simulate a
physical environment to harden our examples against transformations experienced
in over-the-air attacks. We compare the different approaches in three
experiments: in a simulated environment and in a realistic over-the-air
scenario to evaluate the robustness, and in a human study to evaluate the
perceptibility. Our algorithms considering psychoacoustics only or in addition
to the robustness show an improvement in the signal-to-noise ratio (SNR) as
well as in the human perception study, at the cost of an increased word error
rate (WER).


http://arxiv.org/abs/2310.03614
Adversarial Machine Learning for Social Good: Reframing the Adversary as an Ally. (98%)
Shawqi Al-Maliki; Adnan Qayyum; Hassan Ali; Mohamed Abdallah; Junaid Qadir; Dinh Thai Hoang; Dusit Niyato; Ala Al-Fuqaha
  Deep Neural Networks (DNNs) have been the driving force behind many of the
recent advances in machine learning. However, research has shown that DNNs are
vulnerable to adversarial examples -- input samples that have been perturbed to
force DNN-based models to make errors. As a result, Adversarial Machine
Learning (AdvML) has gained a lot of attention, and researchers have
investigated these vulnerabilities in various settings and modalities. In
addition, DNNs have also been found to incorporate embedded bias and often
produce unexplainable predictions, which can result in anti-social AI
applications. The emergence of new AI technologies that leverage Large Language
Models (LLMs), such as ChatGPT and GPT-4, increases the risk of producing
anti-social applications at scale. AdvML for Social Good (AdvML4G) is an
emerging field that repurposes the AdvML bug to invent pro-social applications.
Regulators, practitioners, and researchers should collaborate to encourage the
development of pro-social applications and hinder the development of
anti-social ones. In this work, we provide the first comprehensive review of
the emerging field of AdvML4G. This paper encompasses a taxonomy that
highlights the emergence of AdvML4G, a discussion of the differences and
similarities between AdvML4G and AdvML, a taxonomy covering social good-related
concepts and aspects, an exploration of the motivations behind the emergence of
AdvML4G at the intersection of ML4G and AdvML, and an extensive summary of the
works that utilize AdvML4G as an auxiliary tool for innovating pro-social
applications. Finally, we elaborate upon various challenges and open research
issues that require significant attention from the research community.


http://arxiv.org/abs/2310.03684
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. (92%)
Alexander Robey; Eric Wong; Hamed Hassani; George J. Pappas
  Despite efforts to align large language models (LLMs) with human values,
widely-used LLMs such as GPT, Llama, Claude, and PaLM are susceptible to
jailbreaking attacks, wherein an adversary fools a targeted LLM into generating
objectionable content. To address this vulnerability, we propose SmoothLLM, the
first algorithm designed to mitigate jailbreaking attacks on LLMs. Based on our
finding that adversarially-generated prompts are brittle to character-level
changes, our defense first randomly perturbs multiple copies of a given input
prompt, and then aggregates the corresponding predictions to detect adversarial
inputs. SmoothLLM reduces the attack success rate on numerous popular LLMs to
below one percentage point, avoids unnecessary conservatism, and admits
provable guarantees on attack mitigation. Moreover, our defense uses
exponentially fewer queries than existing attacks and is compatible with any
LLM. Our code is publicly available at the following link:
https://github.com/arobey1/smooth-llm.


http://arxiv.org/abs/2310.05862
Better Safe than Sorry: Pre-training CLIP against Targeted Data Poisoning and Backdoor Attacks. (64%)
Wenhan Yang; Jingdong Gao; Baharan Mirzasoleiman
  Contrastive Language-Image Pre-training (CLIP) on large image-caption
datasets has achieved remarkable success in zero-shot classification and
enabled transferability to new domains. However, CLIP is extremely more
vulnerable to targeted data poisoning and backdoor attacks, compared to
supervised learning. Perhaps surprisingly, poisoning 0.0001% of CLIP
pre-training data is enough to make targeted data poisoning attacks successful.
This is four orders of magnitude smaller than what is required to poison
supervised models. Despite this vulnerability, existing methods are very
limited in defending CLIP models during pre-training. In this work, we propose
a strong defense, SAFECLIP, to safely pre-train CLIP against targeted data
poisoning and backdoor attacks. SAFECLIP warms up the model by applying
unimodal contrastive learning (CL) on image and text modalities separately.
Then, it divides the data into safe and risky sets, by applying a Gaussian
Mixture Model to the cosine similarity of image-caption pair representations.
SAFECLIP pre-trains the model by applying the CLIP loss to the safe set and
applying unimodal CL to image and text modalities of the risky set separately.
By gradually increasing the size of the safe set during pre-training, SAFECLIP
effectively breaks targeted data poisoning and backdoor attacks without harming
the CLIP performance. Our extensive experiments on CC3M, Visual Genome, and
MSCOCO demonstrate that SAFECLIP significantly reduces the success rate of
targeted data poisoning attacks from 93.75% to 0% and that of various backdoor
attacks from up to 100% to 0%, without harming CLIP's performance.


http://arxiv.org/abs/2310.03578
Targeted Adversarial Attacks on Generalizable Neural Radiance Fields. (56%)
Andras Horvath; Csaba M. Jozsa
  Neural Radiance Fields (NeRFs) have recently emerged as a powerful tool for
3D scene representation and rendering. These data-driven models can learn to
synthesize high-quality images from sparse 2D observations, enabling realistic
and interactive scene reconstructions. However, the growing usage of NeRFs in
critical applications such as augmented reality, robotics, and virtual
environments could be threatened by adversarial attacks.
  In this paper we present how generalizable NeRFs can be attacked by both
low-intensity adversarial attacks and adversarial patches, where the later
could be robust enough to be used in real world applications. We also
demonstrate targeted attacks, where a specific, predefined output scene is
generated by these attack with success.


http://arxiv.org/abs/2310.03664
Certification of Deep Learning Models for Medical Image Segmentation. (15%)
Othmane Laousy; Alexandre Araujo; Guillaume Chassagnon; Nikos Paragios; Marie-Pierre Revel; Maria Vakalopoulou
  In medical imaging, segmentation models have known a significant improvement
in the past decade and are now used daily in clinical practice. However,
similar to classification models, segmentation models are affected by
adversarial attacks. In a safety-critical field like healthcare, certifying
model predictions is of the utmost importance. Randomized smoothing has been
introduced lately and provides a framework to certify models and obtain
theoretical guarantees. In this paper, we present for the first time a
certified segmentation baseline for medical imaging based on randomized
smoothing and diffusion models. Our results show that leveraging the power of
denoising diffusion probabilistic models helps us overcome the limits of
randomized smoothing. We conduct extensive experiments on five public datasets
of chest X-rays, skin lesions, and colonoscopies, and empirically show that we
are able to maintain high certified Dice scores even for highly perturbed
images. Our work represents the first attempt to certify medical image
segmentation models, and we aspire for it to set a foundation for future
benchmarks in this crucial and largely uncharted area.


http://arxiv.org/abs/2310.03312
Certifiably Robust Graph Contrastive Learning. (5%)
Minhua Lin; Teng Xiao; Enyan Dai; Xiang Zhang; Suhang Wang
  Graph Contrastive Learning (GCL) has emerged as a popular unsupervised graph
representation learning method. However, it has been shown that GCL is
vulnerable to adversarial attacks on both the graph structure and node
attributes. Although empirical approaches have been proposed to enhance the
robustness of GCL, the certifiable robustness of GCL is still remain
unexplored. In this paper, we develop the first certifiably robust framework in
GCL. Specifically, we first propose a unified criteria to evaluate and certify
the robustness of GCL. We then introduce a novel technique, RES (Randomized
Edgedrop Smoothing), to ensure certifiable robustness for any GCL model, and
this certified robustness can be provably preserved in downstream tasks.
Furthermore, an effective training method is proposed for robust GCL. Extensive
experiments on real-world datasets demonstrate the effectiveness of our
proposed method in providing effective certifiable robustness and enhancing the
robustness of any GCL model. The source code of RES is available at
https://github.com/ventr1c/RES-GCL.


http://arxiv.org/abs/2310.03518
Towards Robust and Generalizable Training: An Empirical Study of Noisy Slot Filling for Input Perturbations. (2%)
Jiachi Liu; Liwen Wang; Guanting Dong; Xiaoshuai Song; Zechen Wang; Zhengyang Wang; Shanglin Lei; Jinzheng Zhao; Keqing He; Bo Xiao; Weiran Xu
  In real dialogue scenarios, as there are unknown input noises in the
utterances, existing supervised slot filling models often perform poorly in
practical applications. Even though there are some studies on noise-robust
models, these works are only evaluated on rule-based synthetic datasets, which
is limiting, making it difficult to promote the research of noise-robust
methods. In this paper, we introduce a noise robustness evaluation dataset
named Noise-SF for slot filling task. The proposed dataset contains five types
of human-annotated noise, and all those noises are exactly existed in real
extensive robust-training methods of slot filling into the proposed framework.
By conducting exhaustive empirical evaluation experiments on Noise-SF, we find
that baseline models have poor performance in robustness evaluation, and the
proposed framework can effectively improve the robustness of models. Based on
the empirical experimental results, we make some forward-looking suggestions to
fuel the research in this direction. Our dataset Noise-SF will be released at
https://github.com/dongguanting/Noise-SF.


http://arxiv.org/abs/2310.02997
Optimizing Key-Selection for Face-based One-Time Biometrics via Morphing. (98%)
Daile Osorio-Roig; Mahdi Ghafourian; Christian Rathgeb; Ruben Vera-Rodriguez; Christoph Busch; Julian Fierrez
  Nowadays, facial recognition systems are still vulnerable to adversarial
attacks. These attacks vary from simple perturbations of the input image to
modifying the parameters of the recognition model to impersonate an authorised
subject. So-called privacy-enhancing facial recognition systems have been
mostly developed to provide protection of stored biometric reference data, i.e.
templates. In the literature, privacy-enhancing facial recognition approaches
have focused solely on conventional security threats at the template level,
ignoring the growing concern related to adversarial attacks. Up to now, few
works have provided mechanisms to protect face recognition against adversarial
attacks while maintaining high security at the template level. In this paper,
we propose different key selection strategies to improve the security of a
competitive cancelable scheme operating at the signal level. Experimental
results show that certain strategies based on signal-level key selection can
lead to complete blocking of the adversarial attack based on an iterative
optimization for the most secure threshold, while for the most practical
threshold, the attack success chance can be decreased to approximately 5.0%.


http://arxiv.org/abs/2310.03185
Misusing Tools in Large Language Models With Visual Adversarial Examples. (97%)
Xiaohan Fu; Zihan Wang; Shuheng Li; Rajesh K. Gupta; Niloofar Mireshghallah; Taylor Berg-Kirkpatrick; Earlence Fernandes
  Large Language Models (LLMs) are being enhanced with the ability to use tools
and to process multiple modalities. These new capabilities bring new benefits
and also new security risks. In this work, we show that an attacker can use
visual adversarial examples to cause attacker-desired tool usage. For example,
the attacker could cause a victim LLM to delete calendar events, leak private
conversations and book hotels. Different from prior work, our attacks can
affect the confidentiality and integrity of user resources connected to the LLM
while being stealthy and generalizable to multiple input prompts. We construct
these attacks using gradient-based adversarial training and characterize
performance along multiple dimensions. We find that our adversarial images can
manipulate the LLM to invoke tools following real-world syntax almost always
(~98%) while maintaining high similarity to clean images (~0.9 SSIM).
Furthermore, using human scoring and automated metrics, we find that the
attacks do not noticeably affect the conversation (and its semantics) between
the user and the LLM.


http://arxiv.org/abs/2310.03285
Burning the Adversarial Bridges: Robust Windows Malware Detection Against Binary-level Mutations. (82%)
Ahmed Abusnaina; Yizhen Wang; Sunpreet Arora; Ke Wang; Mihai Christodorescu; David Mohaisen
  Toward robust malware detection, we explore the attack surface of existing
malware detection systems. We conduct root-cause analyses of the practical
binary-level black-box adversarial malware examples. Additionally, we uncover
the sensitivity of volatile features within the detection engines and exhibit
their exploitability. Highlighting volatile information channels within the
software, we introduce three software pre-processing steps to eliminate the
attack surface, namely, padding removal, software stripping, and inter-section
information resetting. Further, to counter the emerging section injection
attacks, we propose a graph-based section-dependent information extraction
scheme for software representation. The proposed scheme leverages aggregated
information within various sections in the software to enable robust malware
detection and mitigate adversarial settings. Our experimental results show that
traditional malware detection models are ineffective against adversarial
threats. However, the attack surface can be largely reduced by eliminating the
volatile information. Therefore, we propose simple-yet-effective methods to
mitigate the impacts of binary manipulation attacks. Overall, our graph-based
malware detection scheme can accurately detect malware with an area under the
curve score of 88.32\% and a score of 88.19% under a combination of binary
manipulation attacks, exhibiting the efficiency of our proposed scheme.


http://arxiv.org/abs/2310.03166
Raze to the Ground: Query-Efficient Adversarial HTML Attacks on Machine-Learning Phishing Webpage Detectors. (81%)
Biagio Montaruli; Luca Demetrio; Maura Pintor; Luca Compagna; Davide Balzarotti; Battista Biggio
  Machine-learning phishing webpage detectors (ML-PWD) have been shown to
suffer from adversarial manipulations of the HTML code of the input webpage.
Nevertheless, the attacks recently proposed have demonstrated limited
effectiveness due to their lack of optimizing the usage of the adopted
manipulations, and they focus solely on specific elements of the HTML code. In
this work, we overcome these limitations by first designing a novel set of
fine-grained manipulations which allow to modify the HTML code of the input
phishing webpage without compromising its maliciousness and visual appearance,
i.e., the manipulations are functionality- and rendering-preserving by design.
We then select which manipulations should be applied to bypass the target
detector by a query-efficient black-box optimization algorithm. Our experiments
show that our attacks are able to raze to the ground the performance of current
state-of-the-art ML-PWD using just 30 queries, thus overcoming the weaker
attacks developed in previous work, and enabling a much fairer robustness
evaluation of ML-PWD.


http://arxiv.org/abs/2310.03125
Shielding the Unseen: Privacy Protection through Poisoning NeRF with Spatial Deformation. (10%)
Yihan Wu; Brandon Y. Feng; Heng Huang
  In this paper, we introduce an innovative method of safeguarding user privacy
against the generative capabilities of Neural Radiance Fields (NeRF) models.
Our novel poisoning attack method induces changes to observed views that are
imperceptible to the human eye, yet potent enough to disrupt NeRF's ability to
accurately reconstruct a 3D scene. To achieve this, we devise a bi-level
optimization algorithm incorporating a Projected Gradient Descent (PGD)-based
spatial deformation. We extensively test our approach on two common NeRF
benchmark datasets consisting of 29 real-world scenes with high-quality images.
Our results compellingly demonstrate that our privacy-preserving method
significantly impairs NeRF's performance across these benchmark datasets.
Additionally, we show that our method is adaptable and versatile, functioning
across various perturbation strengths and NeRF architectures. This work offers
valuable insights into NeRF's vulnerabilities and emphasizes the need to
account for such potential privacy risks when developing robust 3D scene
reconstruction algorithms. Our study contributes to the larger conversation
surrounding responsible AI and generative machine learning, aiming to protect
user privacy and respect creative ownership in the digital age.


http://arxiv.org/abs/2310.02480
Splitting the Difference on Adversarial Training. (99%)
Matan Levi; Aryeh Kontorovich
  The existence of adversarial examples points to a basic weakness of deep
neural networks. One of the most effective defenses against such examples,
adversarial training, entails training models with some degree of robustness,
usually at the expense of a degraded natural accuracy. Most adversarial
training methods aim to learn a model that finds, for each class, a common
decision boundary encompassing both the clean and perturbed examples. In this
work, we take a fundamentally different approach by treating the perturbed
examples of each class as a separate class to be learned, effectively splitting
each class into two classes: "clean" and "adversarial." This split doubles the
number of classes to be learned, but at the same time considerably simplifies
the decision boundaries. We provide a theoretical plausibility argument that
sheds some light on the conditions under which our approach can be expected to
be beneficial. Likewise, we empirically demonstrate that our method learns
robust models while attaining optimal or near-optimal natural accuracy, e.g.,
on CIFAR-10 we obtain near-optimal natural accuracy of $95.01\%$ alongside
significant robustness across multiple tasks. The ability to achieve such
near-optimal natural accuracy, while maintaining a significant level of
robustness, makes our method applicable to real-world applications where
natural accuracy is at a premium. As a whole, our main contribution is a
general method that confers a significant level of robustness upon classifiers
with only minor or negligible degradation of their natural accuracy.


http://arxiv.org/abs/2310.02025
DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training. (97%)
Aochuan Chen; Yimeng Zhang; Jinghan Jia; James Diffenderfer; Jiancheng Liu; Konstantinos Parasyris; Yihua Zhang; Zheng Zhang; Bhavya Kailkhura; Sijia Liu
  Zeroth-order (ZO) optimization has become a popular technique for solving
machine learning (ML) problems when first-order (FO) information is difficult
or impossible to obtain. However, the scalability of ZO optimization remains an
open problem: Its use has primarily been limited to relatively small-scale ML
problems, such as sample-wise adversarial attack generation. To our best
knowledge, no prior work has demonstrated the effectiveness of ZO optimization
in training deep neural networks (DNNs) without a significant decrease in
performance. To overcome this roadblock, we develop DeepZero, a principled ZO
deep learning (DL) framework that can scale ZO optimization to DNN training
from scratch through three primary innovations. First, we demonstrate the
advantages of coordinate-wise gradient estimation (CGE) over randomized
vector-wise gradient estimation in training accuracy and computational
efficiency. Second, we propose a sparsity-induced ZO training protocol that
extends the model pruning methodology using only finite differences to explore
and exploit the sparse DL prior in CGE. Third, we develop the methods of
feature reuse and forward parallelization to advance the practical
implementations of ZO training. Our extensive experiments show that DeepZero
achieves state-of-the-art (SOTA) accuracy on ResNet-20 trained on CIFAR-10,
approaching FO training performance for the first time. Furthermore, we show
the practical utility of DeepZero in applications of certified adversarial
defense and DL-based partial differential equation error correction, achieving
10-20% improvement over SOTA. We believe our results will inspire future
research on scalable ZO optimization and contribute to advancing DL with black
box.


http://arxiv.org/abs/2310.02544
SlowFormer: Universal Adversarial Patch for Attack on Compute and Energy Efficiency of Inference Efficient Vision Transformers. (86%)
KL Navaneet; Soroush Abbasi Koohpayegani; Essam Sleiman; Hamed Pirsiavash
  Recently, there has been a lot of progress in reducing the computation of
deep models at inference time. These methods can reduce both the computational
needs and power usage of deep models. Some of these approaches adaptively scale
the compute based on the input instance. We show that such models can be
vulnerable to a universal adversarial patch attack, where the attacker
optimizes for a patch that when pasted on any image, can increase the compute
and power consumption of the model. We run experiments with three different
efficient vision transformer methods showing that in some cases, the attacker
can increase the computation to the maximum possible level by simply pasting a
patch that occupies only 8\% of the image area. We also show that a standard
adversarial training defense method can reduce some of the attack's success. We
believe adaptive efficient methods will be necessary for the future to lower
the power usage of deep models, so we hope our paper encourages the community
to study the robustness of these methods and develop better defense methods for
the proposed attack.


http://arxiv.org/abs/2310.01875
Towards Stable Backdoor Purification through Feature Shift Tuning. (83%)
Rui Min; Zeyu Qin; Li Shen; Minhao Cheng
  It has been widely observed that deep neural networks (DNN) are vulnerable to
backdoor attacks where attackers could manipulate the model behavior
maliciously by tampering with a small set of training samples. Although a line
of defense methods is proposed to mitigate this threat, they either require
complicated modifications to the training process or heavily rely on the
specific model architecture, which makes them hard to deploy into real-world
applications. Therefore, in this paper, we instead start with fine-tuning, one
of the most common and easy-to-deploy backdoor defenses, through comprehensive
evaluations against diverse attack scenarios. Observations made through initial
experiments show that in contrast to the promising defensive results on high
poisoning rates, vanilla tuning methods completely fail at low poisoning rate
scenarios. Our analysis shows that with the low poisoning rate, the
entanglement between backdoor and clean features undermines the effect of
tuning-based defenses. Therefore, it is necessary to disentangle the backdoor
and clean features in order to improve backdoor purification. To address this,
we introduce Feature Shift Tuning (FST), a method for tuning-based backdoor
purification. Specifically, FST encourages feature shifts by actively deviating
the classifier weights from the originally compromised weights. Extensive
experiments demonstrate that our FST provides consistently stable performance
under different attack settings. Additionally, it is also convenient to deploy
in real-world scenarios with significantly reduced computation costs. Our codes
are available at
\url{https://github.com/AISafety-HKUST/stable_backdoor_purification}.


http://arxiv.org/abs/2310.02417
Jailbreaker in Jail: Moving Target Defense for Large Language Models. (73%)
Bocheng Chen; Advait Paliwal; Qiben Yan
  Large language models (LLMs), known for their capability in understanding and
following instructions, are vulnerable to adversarial attacks. Researchers have
found that current commercial LLMs either fail to be "harmless" by presenting
unethical answers, or fail to be "helpful" by refusing to offer meaningful
answers when faced with adversarial queries. To strike a balance between being
helpful and harmless, we design a moving target defense (MTD) enhanced LLM
system. The system aims to deliver non-toxic answers that align with outputs
from multiple model candidates, making them more robust against adversarial
attacks. We design a query and output analysis model to filter out unsafe or
non-responsive answers. %to achieve the two objectives of randomly selecting
outputs from different LLMs. We evaluate over 8 most recent chatbot models with
state-of-the-art adversarial queries. Our MTD-enhanced LLM system reduces the
attack success rate from 37.5\% to 0\%. Meanwhile, it decreases the response
refusal rate from 50\% to 0\%.


http://arxiv.org/abs/2310.04451
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. (56%)
Xiaogeng Liu; Nan Xu; Muhao Chen; Chaowei Xiao
  The aligned Large Language Models (LLMs) are powerful language understanding
and decision-making tools that are created through extensive alignment with
human feedback. However, these large models remain susceptible to jailbreak
attacks, where adversaries manipulate prompts to elicit malicious outputs that
should not be given by aligned LLMs. Investigating jailbreak prompts can lead
us to delve into the limitations of LLMs and further guide us to secure them.
Unfortunately, existing jailbreak techniques suffer from either (1) scalability
issues, where attacks heavily rely on manual crafting of prompts, or (2)
stealthiness problems, as attacks depend on token-based algorithms to generate
prompts that are often semantically meaningless, making them susceptible to
detection through basic perplexity testing. In light of these challenges, we
intend to answer this question: Can we develop an approach that can
automatically generate stealthy jailbreak prompts? In this paper, we introduce
AutoDAN, a novel jailbreak attack against aligned LLMs. AutoDAN can
automatically generate stealthy jailbreak prompts by the carefully designed
hierarchical genetic algorithm. Extensive evaluations demonstrate that AutoDAN
not only automates the process while preserving semantic meaningfulness, but
also demonstrates superior attack strength in cross-model transferability, and
cross-sample universality compared with the baseline. Moreover, we also compare
AutoDAN with perplexity-based defense methods and show that AutoDAN can bypass
them effectively.


http://arxiv.org/abs/2310.01959
Beyond Labeling Oracles: What does it mean to steal ML models? (47%)
Avital Shafran; Ilia Shumailov; Murat A. Erdogdu; Nicolas Papernot
  Model extraction attacks are designed to steal trained models with only query
access, as is often provided through APIs that ML-as-a-Service providers offer.
Machine Learning (ML) models are expensive to train, in part because data is
hard to obtain, and a primary incentive for model extraction is to acquire a
model while incurring less cost than training from scratch. Literature on model
extraction commonly claims or presumes that the attacker is able to save on
both data acquisition and labeling costs. We thoroughly evaluate this
assumption and find that the attacker often does not. This is because current
attacks implicitly rely on the adversary being able to sample from the victim
model's data distribution. We thoroughly research factors influencing the
success of model extraction. We discover that prior knowledge of the attacker,
i.e., access to in-distribution data, dominates other factors like the attack
policy the adversary follows to choose which queries to make to the victim
model API. Our findings urge the community to redefine the adversarial goals of
ME attacks as current evaluation methods misinterpret the ME performance.


http://arxiv.org/abs/2310.02513
A Recipe for Improved Certifiable Robustness. (22%)
Kai Hu; Klas Leino; Zifan Wang; Matt Fredrikson
  Recent studies have highlighted the potential of Lipschitz-based methods for
training certifiably robust neural networks against adversarial attacks. A key
challenge, supported both theoretically and empirically, is that robustness
demands greater network capacity and more data than standard training. However,
effectively adding capacity under stringent Lipschitz constraints has proven
more difficult than it may seem, evident by the fact that state-of-the-art
approach tend more towards \emph{underfitting} than overfitting. Moreover, we
posit that a lack of careful exploration of the design space for Lipshitz-based
approaches has left potential performance gains on the table. In this work, we
provide a more comprehensive evaluation to better uncover the potential of
Lipschitz-based certification methods. Using a combination of novel techniques,
design optimizations, and synthesis of prior work, we are able to significantly
improve the state-of-the-art VRA for deterministic certification on a variety
of benchmark datasets, and over a range of perturbation sizes. Of particular
note, we discover that the addition of large ``Cholesky-orthogonalized residual
dense'' layers to the end of existing state-of-the-art Lipschitz-controlled
ResNet architectures is especially effective for increasing network capacity
and performance. Combined with filtered generative data augmentation, our final
results further the state of the art deterministic VRA by up to 8.5 percentage
points\footnote{Code is available at \url{https://github.com/hukkai/liresnet}}.


http://arxiv.org/abs/2310.02237
Exploring Model Learning Heterogeneity for Boosting Ensemble Robustness. (13%)
Yanzhao Wu; Ka-Ho Chow; Wenqi Wei; Ling Liu
  Deep neural network ensembles hold the potential of improving generalization
performance for complex learning tasks. This paper presents formal analysis and
empirical evaluation to show that heterogeneous deep ensembles with high
ensemble diversity can effectively leverage model learning heterogeneity to
boost ensemble robustness. We first show that heterogeneous DNN models trained
for solving the same learning problem, e.g., object detection, can
significantly strengthen the mean average precision (mAP) through our weighted
bounding box ensemble consensus method. Second, we further compose ensembles of
heterogeneous models for solving different learning problems, e.g., object
detection and semantic segmentation, by introducing the connected component
labeling (CCL) based alignment. We show that this two-tier heterogeneity driven
ensemble construction method can compose an ensemble team that promotes high
ensemble diversity and low negative correlation among member models of the
ensemble, strengthening ensemble robustness against both negative examples and
adversarial attacks. Third, we provide a formal analysis of the ensemble
robustness in terms of negative correlation. Extensive experiments validate the
enhanced robustness of heterogeneous ensembles in both benign and adversarial
settings. The source codes are available on GitHub at
https://github.com/git-disl/HeteRobust.


http://arxiv.org/abs/2310.02113
FLEDGE: Ledger-based Federated Learning Resilient to Inference and Backdoor Attacks. (11%)
Jorge Castillo; Phillip Rieger; Hossein Fereidooni; Qian Chen; Ahmad Sadeghi
  Federated learning (FL) is a distributed learning process that uses a trusted
aggregation server to allow multiple parties (or clients) to collaboratively
train a machine learning model without having them share their private data.
Recent research, however, has demonstrated the effectiveness of inference and
poisoning attacks on FL. Mitigating both attacks simultaneously is very
challenging. State-of-the-art solutions have proposed the use of poisoning
defenses with Secure Multi-Party Computation (SMPC) and/or Differential Privacy
(DP). However, these techniques are not efficient and fail to address the
malicious intent behind the attacks, i.e., adversaries (curious servers and/or
compromised clients) seek to exploit a system for monetization purposes. To
overcome these limitations, we present a ledger-based FL framework known as
FLEDGE that allows making parties accountable for their behavior and achieve
reasonable efficiency for mitigating inference and poisoning attacks. Our
solution leverages crypto-currency to increase party accountability by
penalizing malicious behavior and rewarding benign conduct. We conduct an
extensive evaluation on four public datasets: Reddit, MNIST, Fashion-MNIST, and
CIFAR-10. Our experimental results demonstrate that (1) FLEDGE provides strong
privacy guarantees for model updates without sacrificing model utility; (2)
FLEDGE can successfully mitigate different poisoning attacks without degrading
the performance of the global model; and (3) FLEDGE offers unique reward
mechanisms to promote benign behavior during model training and/or model
aggregation.


http://arxiv.org/abs/2310.01818
AutoLoRa: A Parameter-Free Automated Robust Fine-Tuning Framework. (3%)
Xilie Xu; Jingfeng Zhang; Mohan Kankanhalli
  Robust Fine-Tuning (RFT) is a low-cost strategy to obtain adversarial
robustness in downstream applications, without requiring a lot of computational
resources and collecting significant amounts of data. This paper uncovers an
issue with the existing RFT, where optimizing both adversarial and natural
objectives through the feature extractor (FE) yields significantly divergent
gradient directions. This divergence introduces instability in the optimization
process, thereby hindering the attainment of adversarial robustness and
rendering RFT highly sensitive to hyperparameters. To mitigate this issue, we
propose a low-rank (LoRa) branch that disentangles RFT into two distinct
components: optimizing natural objectives via the LoRa branch and adversarial
objectives via the FE. Besides, we introduce heuristic strategies for
automating the scheduling of the learning rate and the scalars of loss terms.
Extensive empirical evaluations demonstrate that our proposed automated RFT
disentangled via the LoRa branch (AutoLoRa) achieves new state-of-the-art
results across a range of downstream tasks. AutoLoRa holds significant
practical utility, as it automatically converts a pre-trained FE into an
adversarially robust model for downstream tasks without the need for searching
hyperparameters.


http://arxiv.org/abs/2310.01452
Fooling the Textual Fooler via Randomizing Latent Representations. (99%)
Duy C. Hoang; Quang H. Nguyen; Saurav Manchanda; MinLong Peng; Kok-Seng Wong; Khoa D. Doan
  Despite outstanding performance in a variety of NLP tasks, recent studies
have revealed that NLP models are vulnerable to adversarial attacks that
slightly perturb the input to cause the models to misbehave. Among these
attacks, adversarial word-level perturbations are well-studied and effective
attack strategies. Since these attacks work in black-box settings, they do not
require access to the model architecture or model parameters and thus can be
detrimental to existing NLP applications. To perform an attack, the adversary
queries the victim model many times to determine the most important words in an
input text and to replace these words with their corresponding synonyms. In
this work, we propose a lightweight and attack-agnostic defense whose main goal
is to perplex the process of generating an adversarial example in these
query-based black-box attacks; that is to fool the textual fooler. This
defense, named AdvFooler, works by randomizing the latent representation of the
input at inference time. Different from existing defenses, AdvFooler does not
necessitate additional computational overhead during training nor relies on
assumptions about the potential adversarial perturbation set while having a
negligible impact on the model's accuracy. Our theoretical and empirical
analyses highlight the significance of robustness resulting from confusing the
adversary via randomizing the latent space, as well as the impact of
randomization on clean accuracy. Finally, we empirically demonstrate near
state-of-the-art robustness of AdvFooler against representative adversarial
word-level attacks on two benchmark datasets.


http://arxiv.org/abs/2310.01469
LLM Lies: Hallucinations are not Bugs, but Features as Adversarial Examples. (93%)
Jia-Yu Yao; Kun-Peng Ning; Zhen-Hui Liu; Mu-Nan Ning; Li Yuan
  Large Language Models (LLMs), including GPT-3.5, LLaMA, and PaLM, seem to be
knowledgeable and able to adapt to many tasks. However, we still can not
completely trust their answer, since LLMs suffer from
hallucination--fabricating non-existent facts to cheat users without
perception. And the reasons for their existence and pervasiveness remain
unclear. In this paper, we demonstrate that non-sense prompts composed of
random tokens can also elicit the LLMs to respond with hallucinations. This
phenomenon forces us to revisit that hallucination may be another view of
adversarial examples, and it shares similar features with conventional
adversarial examples as the basic feature of LLMs. Therefore, we formalize an
automatic hallucination triggering method as the hallucination attack in an
adversarial way. Finally, we explore basic feature of attacked adversarial
prompts and propose a simple yet effective defense strategy. Our code is
released on GitHub.


http://arxiv.org/abs/2310.01537
Adversarial Client Detection via Non-parametric Subspace Monitoring in the Internet of Federated Things. (92%)
Xianjian Xie; Xiaochen Xian; Dan Li; Andi Wang
  The Internet of Federated Things (IoFT) represents a network of
interconnected systems with federated learning as the backbone, facilitating
collaborative knowledge acquisition while ensuring data privacy for individual
systems. The wide adoption of IoFT, however, is hindered by security concerns,
particularly the susceptibility of federated learning networks to adversarial
attacks. In this paper, we propose an effective non-parametric approach FedRR,
which leverages the low-rank features of the transmitted parameter updates
generated by federated learning to address the adversarial attack problem.
Besides, our proposed method is capable of accurately detecting adversarial
clients and controlling the false alarm rate under the scenario with no attack
occurring. Experiments based on digit recognition using the MNIST datasets
validated the advantages of our approach.


http://arxiv.org/abs/2310.04445
LoFT: Local Proxy Fine-tuning For Improving Transferability Of Adversarial Attacks Against Large Language Model. (87%)
Muhammad Ahmed Shah; Roshan Sharma; Hira Dhamyal; Raphael Olivier; Ankit Shah; Joseph Konan; Dareen Alharthi; Hazim T Bukhari; Massa Baali; Soham Deshmukh; Michael Kuhlmann; Bhiksha Raj; Rita Singh
  It has been shown that Large Language Model (LLM) alignments can be
circumvented by appending specially crafted attack suffixes with harmful
queries to elicit harmful responses. To conduct attacks against private target
models whose characterization is unknown, public models can be used as proxies
to fashion the attack, with successful attacks being transferred from public
proxies to private target models. The success rate of attack depends on how
closely the proxy model approximates the private model. We hypothesize that for
attacks to be transferrable, it is sufficient if the proxy can approximate the
target model in the neighborhood of the harmful query. Therefore, in this
paper, we propose \emph{Local Fine-Tuning (LoFT)}, \textit{i.e.}, fine-tuning
proxy models on similar queries that lie in the lexico-semantic neighborhood of
harmful queries to decrease the divergence between the proxy and target models.
First, we demonstrate three approaches to prompt private target models to
obtain similar queries given harmful queries. Next, we obtain data for local
fine-tuning by eliciting responses from target models for the generated similar
queries. Then, we optimize attack suffixes to generate attack prompts and
evaluate the impact of our local fine-tuning on the attack's success rate.
Experiments show that local fine-tuning of proxy models improves attack
transferability and increases attack success rate by $39\%$, $7\%$, and $0.5\%$
(absolute) on target models ChatGPT, GPT-4, and Claude respectively.


http://arxiv.org/abs/2310.01166
Gotcha! This Model Uses My Code! Evaluating Membership Leakage Risks in Code Models. (13%)
Zhou Yang; Zhipeng Zhao; Chenyu Wang; Jieke Shi; Dongsum Kim; Donggyun Han; David Lo
  Given large-scale source code datasets available in open-source projects and
advanced large language models, recent code models have been proposed to
address a series of critical software engineering tasks, such as program repair
and code completion. The training data of the code models come from various
sources, not only the publicly available source code, e.g., open-source
projects on GitHub but also the private data such as the confidential source
code from companies, which may contain sensitive information (for example, SSH
keys and personal information). As a result, the use of these code models may
raise new privacy concerns.
  In this paper, we focus on a critical yet not well-explored question on using
code models: what is the risk of membership information leakage in code models?
Membership information leakage refers to the risk that an attacker can infer
whether a given data point is included in (i.e., a member of) the training
data. To answer this question, we propose Gotcha, a novel membership inference
attack method specifically for code models. We investigate the membership
leakage risk of code models. Our results reveal a worrying fact that the risk
of membership leakage is high: although the previous attack methods are close
to random guessing, Gotcha can predict the data membership with a high true
positive rate of 0.95 and a low false positive rate of 0.10. We also show that
the attacker's knowledge of the victim model (e.g., the model architecture and
the pre-training data) impacts the success rate of attacks. Further analysis
demonstrates that changing the decoding strategy can mitigate the risk of
membership leakage. This study calls for more attention to understanding the
privacy of code models and developing more effective countermeasures against
such attacks.


http://arxiv.org/abs/2311.12832
Toward effective protection against diffusion based mimicry through score distillation. (3%)
Haotian Xue; Chumeng Liang; Xiaoyu Wu; Yongxin Chen
  While generative diffusion models excel in producing high-quality images,
they can also be misused to mimic authorized images, posing a significant
threat to AI systems. Efforts have been made to add calibrated perturbations to
protect images from diffusion-based mimicry pipelines. However, most of the
existing methods are too ineffective and even impractical to be used by
individual users due to their high computation and memory requirements. In this
work, we present novel findings on attacking latent diffusion models (LDM) and
propose new plug-and-play strategies for more effective protection. In
particular, we explore the bottleneck in attacking an LDM, discovering that the
encoder module rather than the denoiser module is the vulnerable point. Based
on this insight, we present our strategy using Score Distillation Sampling
(SDS) to double the speed of protection and reduce memory occupation by half
without compromising its strength. Additionally, we provide a robust protection
strategy by counterintuitively minimizing the semantic loss, which can assist
in generating more natural perturbations. Finally, we conduct extensive
experiments to substantiate our findings and comprehensively evaluate our newly
proposed strategies. We hope our insights and protective measures can
contribute to better defense against malicious diffusion-based mimicry,
advancing the development of secure AI systems. The code is available in
https://github.com/xavihart/Diff-Protect


http://arxiv.org/abs/2310.01651
Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations. (1%)
Yongshuo Zong; Tingyang Yu; Bingchen Zhao; Ruchika Chavhan; Timothy Hospedales
  Large language and vision-language models are rapidly being deployed in
practice thanks to their impressive capabilities in instruction following,
in-context learning, and so on. This raises an urgent need to carefully analyse
their robustness so that stakeholders can understand if and when such models
are trustworthy enough to be relied upon in any given application. In this
paper, we highlight a specific vulnerability in popular models, namely
permutation sensitivity in multiple-choice question answering (MCQA).
Specifically, we show empirically that popular models are vulnerable to
adversarial permutation in answer sets for multiple-choice prompting, which is
surprising as models should ideally be as invariant to prompt permutation as
humans are. These vulnerabilities persist across various model sizes, and exist
in very recent language and vision-language models. Code is available at
\url{https://github.com/ys-zong/FoolyourVLLMs}.


http://arxiv.org/abs/2310.00633
A Survey of Robustness and Safety of 2D and 3D Deep Learning Models Against Adversarial Attacks. (99%)
Yanjie Li; Bin Xie; Songtao Guo; Yuanyuan Yang; Bin Xiao
  Benefiting from the rapid development of deep learning, 2D and 3D computer
vision applications are deployed in many safe-critical systems, such as
autopilot and identity authentication. However, deep learning models are not
trustworthy enough because of their limited robustness against adversarial
attacks. The physically realizable adversarial attacks further pose fatal
threats to the application and human safety. Lots of papers have emerged to
investigate the robustness and safety of deep learning models against
adversarial attacks. To lead to trustworthy AI, we first construct a general
threat model from different perspectives and then comprehensively review the
latest progress of both 2D and 3D adversarial attacks. We extend the concept of
adversarial examples beyond imperceptive perturbations and collate over 170
papers to give an overview of deep learning model robustness against various
adversarial attacks. To the best of our knowledge, we are the first to
systematically investigate adversarial attacks for 3D models, a flourishing
field applied to many real-world applications. In addition, we examine physical
adversarial attacks that lead to safety violations. Last but not least, we
summarize present popular topics, give insights on challenges, and shed light
on future research on trustworthy AI.


http://arxiv.org/abs/2310.00761
Counterfactual Image Generation for adversarially robust and interpretable Classifiers. (96%)
Rafael Bischof; Florian Scheidegger; Michael A. Kraus; A. Cristiano I. Malossi
  Neural Image Classifiers are effective but inherently hard to interpret and
susceptible to adversarial attacks. Solutions to both problems exist, among
others, in the form of counterfactual examples generation to enhance
explainability or adversarially augment training datasets for improved
robustness. However, existing methods exclusively address only one of the
issues. We propose a unified framework leveraging image-to-image translation
Generative Adversarial Networks (GANs) to produce counterfactual samples that
highlight salient regions for interpretability and act as adversarial samples
to augment the dataset for more robustness. This is achieved by combining the
classifier and discriminator into a single model that attributes real images to
their respective classes and flags generated images as "fake". We assess the
method's effectiveness by evaluating (i) the produced explainability masks on a
semantic segmentation task for concrete cracks and (ii) the model's resilience
against the Projected Gradient Descent (PGD) attack on a fruit defects
detection problem. Our produced saliency maps are highly descriptive, achieving
competitive IoU values compared to classical segmentation models despite being
trained exclusively on classification labels. Furthermore, the model exhibits
improved robustness to adversarial attacks, and we show how the discriminator's
"fakeness" value serves as an uncertainty measure of the predictions.


http://arxiv.org/abs/2310.00616
Understanding Adversarial Transferability in Federated Learning. (93%)
Yijiang Li; Ying Gao; Haohan Wang
  We investigate the robustness and security issues from a novel and practical
setting: a group of malicious clients has impacted the model during training by
disguising their identities and acting as benign clients, and only revealing
their adversary position after the training to conduct transferable adversarial
attacks with their data, which is usually a subset of the data that FL system
is trained with. Our aim is to offer a full understanding of the challenges the
FL system faces in this practical setting across a spectrum of configurations.
We notice that such an attack is possible, but the federated model is more
robust compared with its centralized counterpart when the accuracy on clean
images is comparable. Through our study, we hypothesized the robustness is from
two factors: the decentralized training on distributed data and the averaging
operation. We provide evidence from both the perspective of empirical
experiments and theoretical analysis. Our work has implications for
understanding the robustness of federated learning systems and poses a
practical question for federated learning applications.


http://arxiv.org/abs/2310.00607
On the Onset of Robust Overfitting in Adversarial Training. (87%)
Chaojian Yu; Xiaolong Shi; Jun Yu; Bo Han; Tongliang Liu
  Adversarial Training (AT) is a widely-used algorithm for building robust
neural networks, but it suffers from the issue of robust overfitting, the
fundamental mechanism of which remains unclear. In this work, we consider
normal data and adversarial perturbation as separate factors, and identify that
the underlying causes of robust overfitting stem from the normal data through
factor ablation in AT. Furthermore, we explain the onset of robust overfitting
as a result of the model learning features that lack robust generalization,
which we refer to as non-effective features. Specifically, we provide a
detailed analysis of the generation of non-effective features and how they lead
to robust overfitting. Additionally, we explain various empirical behaviors
observed in robust overfitting and revisit different techniques to mitigate
robust overfitting from the perspective of non-effective features, providing a
comprehensive understanding of the robust overfitting phenomenon. This
understanding inspires us to propose two measures, attack strength and data
augmentation, to hinder the learning of non-effective features by the neural
network, thereby alleviating robust overfitting. Extensive experiments
conducted on benchmark datasets demonstrate the effectiveness of the proposed
methods in mitigating robust overfitting and enhancing adversarial robustness.


http://arxiv.org/abs/2310.00626
GhostEncoder: Stealthy Backdoor Attacks with Dynamic Triggers to Pre-trained Encoders in Self-supervised Learning. (61%)
Qiannan Wang; Changchun Yin; Zhe Liu; Liming Fang; Run Wang; Chenhao Lin
  Within the realm of computer vision, self-supervised learning (SSL) pertains
to training pre-trained image encoders utilizing a substantial quantity of
unlabeled images. Pre-trained image encoders can serve as feature extractors,
facilitating the construction of downstream classifiers for various tasks.
However, the use of SSL has led to an increase in security research related to
various backdoor attacks. Currently, the trigger patterns used in backdoor
attacks on SSL are mostly visible or static (sample-agnostic), making backdoors
less covert and significantly affecting the attack performance. In this work,
we propose GhostEncoder, the first dynamic invisible backdoor attack on SSL.
Unlike existing backdoor attacks on SSL, which use visible or static trigger
patterns, GhostEncoder utilizes image steganography techniques to encode hidden
information into benign images and generate backdoor samples. We then fine-tune
the pre-trained image encoder on a manipulation dataset to inject the backdoor,
enabling downstream classifiers built upon the backdoored encoder to inherit
the backdoor behavior for target downstream tasks. We evaluate GhostEncoder on
three downstream tasks and results demonstrate that GhostEncoder provides
practical stealthiness on images and deceives the victim model with a high
attack success rate without compromising its utility. Furthermore, GhostEncoder
withstands state-of-the-art defenses, including STRIP, STRIP-Cl, and
SSL-Cleanse.


http://arxiv.org/abs/2310.00648
Fewer is More: Trojan Attacks on Parameter-Efficient Fine-Tuning. (9%)
Lauren Hong; Ting Wang
  Parameter-efficient fine-tuning (PEFT) enables efficient adaptation of
pre-trained language models (PLMs) to specific tasks. By tuning only a minimal
set of (extra) parameters, PEFT achieves performance comparable to full
fine-tuning. However, despite its prevalent use, the security implications of
PEFT remain largely unexplored. In this paper, we conduct a pilot study
revealing that PEFT exhibits unique vulnerability to trojan attacks.
Specifically, we present PETA, a novel attack that accounts for downstream
adaptation through bilevel optimization: the upper-level objective embeds the
backdoor into a PLM while the lower-level objective simulates PEFT to retain
the PLM's task-specific performance. With extensive evaluation across a variety
of downstream tasks and trigger designs, we demonstrate PETA's effectiveness in
terms of both attack success rate and unaffected clean accuracy, even after the
victim user performs PEFT over the backdoored PLM using untainted data.
Moreover, we empirically provide possible explanations for PETA's efficacy: the
bilevel optimization inherently 'orthogonalizes' the backdoor and PEFT modules,
thereby retaining the backdoor throughout PEFT. Based on this insight, we
explore a simple defense that omits PEFT in selected layers of the backdoored
PLM and unfreezes a subset of these layers' parameters, which is shown to
effectively neutralize PETA.


http://arxiv.org/abs/2310.00847
Can Pre-trained Networks Detect Familiar Out-of-Distribution Data? (1%)
Atsuyuki Miyai; Qing Yu; Go Irie; Kiyoharu Aizawa
  Out-of-distribution (OOD) detection is critical for safety-sensitive machine
learning applications and has been extensively studied, yielding a plethora of
methods developed in the literature. However, most studies for OOD detection
did not use pre-trained models and trained a backbone from scratch. In recent
years, transferring knowledge from large pre-trained models to downstream tasks
by lightweight tuning has become mainstream for training in-distribution (ID)
classifiers. To bridge the gap between the practice of OOD detection and
current classifiers, the unique and crucial problem is that the samples whose
information networks know often come as OOD input. We consider that such data
may significantly affect the performance of large pre-trained networks because
the discriminability of these OOD data depends on the pre-training algorithm.
Here, we define such OOD data as PT-OOD (Pre-Trained OOD) data. In this paper,
we aim to reveal the effect of PT-OOD on the OOD detection performance of
pre-trained networks from the perspective of pre-training algorithms. To
achieve this, we explore the PT-OOD detection performance of supervised and
self-supervised pre-training algorithms with linear-probing tuning, the most
common efficient tuning method. Through our experiments and analysis, we find
that the low linear separability of PT-OOD in the feature space heavily
degrades the PT-OOD detection performance, and self-supervised models are more
vulnerable to PT-OOD than supervised pre-trained models, even with
state-of-the-art detection methods. To solve this vulnerability, we further
propose a unique solution to large-scale pre-trained models: Leveraging
powerful instance-by-instance discriminative representations of pre-trained
models and detecting OOD in the feature space independent of the ID decision
boundaries. The code will be available via https://github.com/AtsuMiyai/PT-OOD.


http://arxiv.org/abs/2310.00710
How well does LLM generate security tests? (1%)
Ying Daphne Zhang; Wenjia Daphne Song; Zhengjie Daphne Ji; Daphne Danfeng; Yao; Na Meng
  Developers often build software on top of third-party libraries (Libs) to
improve programmer productivity and software quality. The libraries may contain
vulnerabilities exploitable by hackers to attack the applications (Apps) built
on top of them. People refer to such attacks as supply chain attacks, the
documented number of which has increased 742% in 2022. People created tools to
mitigate such attacks, by scanning the library dependencies of Apps,
identifying the usage of vulnerable library versions, and suggesting secure
alternatives to vulnerable dependencies. However, recent studies show that many
developers do not trust the reports by these tools; they ask for code or
evidence to demonstrate how library vulnerabilities lead to security exploits,
in order to assess vulnerability severity and modification necessity.
Unfortunately, manually crafting demos of application-specific attacks is
challenging and time-consuming, and there is insufficient tool support to
automate that procedure.
  In this study, we used ChatGPT-4.0 to generate security tests, and to
demonstrate how vulnerable library dependencies facilitate the supply chain
attacks to given Apps. We explored various prompt styles/templates, and found
that ChatGPT-4.0 generated tests for all 55 Apps, demonstrating 24 attacks
successfully. It outperformed two state-of-the-art security test generators --
TRANSFER and SIEGE -- by generating a lot more tests and achieving more
exploits. ChatGPT-4.0 worked better when prompts described more on the
vulnerabilities, possible exploits, and code context. Our research will shed
light on new research in security test generation. The generated tests will
help developers create secure by design and secure by default software.


http://arxiv.org/abs/2310.00567
Understanding the Robustness of Randomized Feature Defense Against Query-Based Adversarial Attacks. (99%)
Quang H. Nguyen; Yingjie Lao; Tung Pham; Kok-Seng Wong; Khoa D. Doan
  Recent works have shown that deep neural networks are vulnerable to
adversarial examples that find samples close to the original image but can make
the model misclassify. Even with access only to the model's output, an attacker
can employ black-box attacks to generate such adversarial examples. In this
work, we propose a simple and lightweight defense against black-box attacks by
adding random noise to hidden features at intermediate layers of the model at
inference time. Our theoretical analysis confirms that this method effectively
enhances the model's resilience against both score-based and decision-based
black-box attacks. Importantly, our defense does not necessitate adversarial
training and has minimal impact on accuracy, rendering it applicable to any
pre-trained model. Our analysis also reveals the significance of selectively
adding noise to different parts of the model based on the gradient of the
adversarial objective function, which can be varied during the attack. We
demonstrate the robustness of our defense against multiple black-box attacks
through extensive empirical experiments involving diverse models with various
architectures.


http://arxiv.org/abs/2310.00438
Human-Producible Adversarial Examples. (98%)
David Khachaturov; Yue Gao; Ilia Shumailov; Robert Mullins; Ross Anderson; Kassem Fawaz
  Visual adversarial examples have so far been restricted to pixel-level image
manipulations in the digital world, or have required sophisticated equipment
such as 2D or 3D printers to be produced in the physical real world. We present
the first ever method of generating human-producible adversarial examples for
the real world that requires nothing more complicated than a marker pen. We
call them $\textbf{adversarial tags}$. First, building on top of differential
rendering, we demonstrate that it is possible to build potent adversarial
examples with just lines. We find that by drawing just $4$ lines we can disrupt
a YOLO-based model in $54.8\%$ of cases; increasing this to $9$ lines disrupts
$81.8\%$ of the cases tested. Next, we devise an improved method for line
placement to be invariant to human drawing error. We evaluate our system
thoroughly in both digital and analogue worlds and demonstrate that our tags
can be applied by untrained humans. We demonstrate the effectiveness of our
method for producing real-world adversarial examples by conducting a user study
where participants were asked to draw over printed images using digital
equivalents as guides. We further evaluate the effectiveness of both targeted
and untargeted attacks, and discuss various trade-offs and method limitations,
as well as the practical and ethical implications of our work. The source code
will be released publicly.


http://arxiv.org/abs/2310.00503
Black-box Attacks on Image Activity Prediction and its Natural Language Explanations. (98%)
Alina Elena Baia; Valentina Poggioni; Andrea Cavallaro
  Explainable AI (XAI) methods aim to describe the decision process of deep
neural networks. Early XAI methods produced visual explanations, whereas more
recent techniques generate multimodal explanations that include textual
information and visual representations. Visual XAI methods have been shown to
be vulnerable to white-box and gray-box adversarial attacks, with an attacker
having full or partial knowledge of and access to the target system. As the
vulnerabilities of multimodal XAI models have not been examined, in this paper
we assess for the first time the robustness to black-box attacks of the natural
language explanations generated by a self-rationalizing image-based activity
recognition model. We generate unrestricted, spatially variant perturbations
that disrupt the association between the predictions and the corresponding
explanations to mislead the model into generating unfaithful explanations. We
show that we can create adversarial images that manipulate the explanations of
an activity recognition model by having access only to its final output.


http://arxiv.org/abs/2310.00542
Horizontal Class Backdoor to Deep Learning. (84%)
Hua Ma; Shang Wang; Yansong Gao
  All existing backdoor attacks to deep learning (DL) models belong to the
vertical class backdoor (VCB). That is, any sample from a class will activate
the implanted backdoor in the presence of the secret trigger, regardless of
source-class-agnostic or source-class-specific backdoor. Current trends of
existing defenses are overwhelmingly devised for VCB attacks especially the
source-class-agnostic backdoor, which essentially neglects other potential
simple but general backdoor types, thus giving false security implications. It
is thus urgent to discover unknown backdoor types.
  This work reveals a new, simple, and general horizontal class backdoor (HCB)
attack. We show that the backdoor can be naturally bounded with innocuous
natural features that are common and pervasive in the real world. Note that an
innocuous feature (e.g., expression) is irrelevant to the main task of the
model (e.g., recognizing a person from one to another). The innocuous feature
spans across classes horizontally but is exhibited by partial samples per class
-- satisfying the horizontal class (HC) property. Only when the trigger is
concurrently presented with the HC innocuous feature, can the backdoor be
effectively activated. Extensive experiments on attacking performance in terms
of high attack success rates with tasks of 1) MNIST, 2) facial recognition, 3)
traffic sign recognition, and 4) object detection demonstrate that the HCB is
highly efficient and effective. We extensively evaluate the HCB evasiveness
against a (chronologically) series of 9 influential countermeasures of
Fine-Pruning (RAID 18'), STRIP (ACSAC 19'), Neural Cleanse (Oakland 19'), ABS
(CCS 19'), Februus (ACSAC 20'), MNTD (Oakland 21'), SCAn (USENIX SEC 21'), MOTH
(Oakland 22'), and Beatrix (NDSS 23'), where none of them can succeed even when
a simplest trigger is used.


http://arxiv.org/abs/2310.00416
Refutation of Shapley Values for XAI -- Additional Evidence. (8%)
Xuanxiang Huang; Joao Marques-Silva
  Recent work demonstrated the inadequacy of Shapley values for explainable
artificial intelligence (XAI). Although to disprove a theory a single
counterexample suffices, a possible criticism of earlier work is that the focus
was solely on Boolean classifiers. To address such possible criticism, this
paper demonstrates the inadequacy of Shapley values for families of classifiers
where features are not boolean, but also for families of classifiers for which
multiple classes can be picked. Furthermore, the paper shows that the features
changed in any minimal $l_0$ distance adversarial examples do not include
irrelevant features, thus offering further arguments regarding the inadequacy
of Shapley values for XAI.


http://arxiv.org/abs/2310.00076
Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks. (99%)
Mehrdad Saberi; Vinu Sankar Sadasivan; Keivan Rezaei; Aounon Kumar; Atoosa Chegini; Wenxiao Wang; Soheil Feizi
  In light of recent advancements in generative AI models, it has become
essential to distinguish genuine content from AI-generated one to prevent the
malicious usage of fake materials as authentic ones and vice versa. Various
techniques have been introduced for identifying AI-generated images, with
watermarking emerging as a promising approach. In this paper, we analyze the
robustness of various AI-image detectors including watermarking and
classifier-based deepfake detectors. For watermarking methods that introduce
subtle image perturbations (i.e., low perturbation budget methods), we reveal a
fundamental trade-off between the evasion error rate (i.e., the fraction of
watermarked images detected as non-watermarked ones) and the spoofing error
rate (i.e., the fraction of non-watermarked images detected as watermarked
ones) upon an application of diffusion purification attack. To validate our
theoretical findings, we also provide empirical evidence demonstrating that
diffusion purification effectively removes low perturbation budget watermarks
by applying minimal changes to images. The diffusion purification attack is
ineffective for high perturbation watermarking methods where notable changes
are applied to images. In this case, we develop a model substitution
adversarial attack that can successfully remove watermarks. Moreover, we show
that watermarking methods are vulnerable to spoofing attacks where the attacker
aims to have real images identified as watermarked ones, damaging the
reputation of the developers. In particular, with black-box access to the
watermarking method, a watermarked noise image can be generated and added to
real images, causing them to be incorrectly classified as watermarked. Finally,
we extend our theory to characterize a fundamental trade-off between the
robustness and reliability of classifier-based deep fake detectors and
demonstrate it through experiments.


http://arxiv.org/abs/2309.17348
Efficient Biologically Plausible Adversarial Training. (98%)
Matilde Tristany Farinha; Thomas Ortner; Giorgia Dellaferrera; Benjamin Grewe; Angeliki Pantazi
  Artificial Neural Networks (ANNs) trained with Backpropagation (BP) show
astounding performance and are increasingly often used in performing our daily
life tasks. However, ANNs are highly vulnerable to adversarial attacks, which
alter inputs with small targeted perturbations that drastically disrupt the
models' performance. The most effective method to make ANNs robust against
these attacks is adversarial training, in which the training dataset is
augmented with exemplary adversarial samples. Unfortunately, this approach has
the drawback of increased training complexity since generating adversarial
samples is very computationally demanding. In contrast to ANNs, humans are not
susceptible to adversarial attacks. Therefore, in this work, we investigate
whether biologically-plausible learning algorithms are more robust against
adversarial attacks than BP. In particular, we present an extensive comparative
analysis of the adversarial robustness of BP and Present the Error to Perturb
the Input To modulate Activity (PEPITA), a recently proposed
biologically-plausible learning algorithm, on various computer vision tasks. We
observe that PEPITA has higher intrinsic adversarial robustness and, with
adversarial training, has a more favourable natural-vs-adversarial performance
trade-off as, for the same natural accuracies, PEPITA's adversarial accuracies
decrease in average by 0.26% and BP's by 8.05%.


http://arxiv.org/abs/2309.17410
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks. (96%)
Vaidehi Patil; Peter Hase; Mohit Bansal
  Pretrained language models sometimes possess knowledge that we do not wish
them to, including memorized personal information and knowledge that could be
used to harm people. They can also output toxic or harmful text. To mitigate
these safety and informational issues, we propose an attack-and-defense
framework for studying the task of deleting sensitive information directly from
model weights. We study direct edits to model weights because (1) this approach
should guarantee that particular deleted information is never extracted by
future prompt attacks, and (2) it should protect against whitebox attacks,
which is necessary for making claims about safety/privacy in a setting where
publicly available model weights could be used to elicit sensitive information.
Our threat model assumes that an attack succeeds if the answer to a sensitive
question is located among a set of B generated candidates, based on scenarios
where the information would be insecure if the answer is among B candidates.
Experimentally, we show that even state-of-the-art model editing methods such
as ROME struggle to truly delete factual information from models like GPT-J, as
our whitebox and blackbox attacks can recover "deleted" information from an
edited model 38% of the time. These attacks leverage two key observations: (1)
that traces of deleted information can be found in intermediate model hidden
states, and (2) that applying an editing method for one question may not delete
information across rephrased versions of the question. Finally, we provide new
defense methods that protect against some extraction attacks, but we do not
find a single universally effective defense method. Our results suggest that
truly deleting sensitive information is a tractable but difficult problem,
since even relatively low attack success rates have potentially severe societal
implications for real-world deployment of language models.


http://arxiv.org/abs/2309.17048
On Continuity of Robust and Accurate Classifiers. (93%)
Ramin Barati; Reza Safabakhsh; Mohammad Rahmati
  The reliability of a learning model is key to the successful deployment of
machine learning in various applications. However, it is difficult to describe
the phenomenon due to the complicated nature of the problems in machine
learning. It has been shown that adversarial training can improve the
robustness of the hypothesis. However, this improvement usually comes at the
cost of decreased performance on natural samples. Hence, it has been suggested
that robustness and accuracy of a hypothesis are at odds with each other. In
this paper, we put forth the alternative proposal that it is the continuity of
a hypothesis that is incompatible with its robustness and accuracy in many of
these scenarios. In other words, a continuous function cannot effectively learn
the optimal robust hypothesis. We introduce a framework for a rigorous study of
harmonic and holomorphic hypothesis in learning theory terms and provide
empirical evidence that continuous hypotheses do not perform as well as
discontinuous hypotheses in some common machine learning tasks. From a
practical point of view, our results suggests that a robust and accurate
learning rule would train different continuous hypotheses for different regions
of the domain. From a theoretical perspective, our analysis explains the
adversarial examples phenomenon in these situations as a conflict between the
continuity of a sequence of functions and its uniform convergence to a
discontinuous function. Given that many of the contemporary machine learning
models are continuous functions, it is important to theoretically study the
continuity of robust and accurate classifiers as it is consequential in their
construction, analysis and evaluation.


http://arxiv.org/abs/2309.17401
Adversarial Machine Learning in Latent Representations of Neural Networks. (93%)
Milin Zhang; Mohammad Abdi; Francesco Restuccia
  Distributed deep neural networks (DNNs) have been shown to reduce the
computational burden of mobile devices and decrease the end-to-end inference
latency in edge computing scenarios. While distributed DNNs have been studied,
to the best of our knowledge the resilience of distributed DNNs to adversarial
action still remains an open problem. In this paper, we fill the existing
research gap by rigorously analyzing the robustness of distributed DNNs against
adversarial action. We cast this problem in the context of information theory
and introduce two new measurements for distortion and robustness. Our
theoretical findings indicate that (i) assuming the same level of information
distortion, latent features are always more robust than input representations;
(ii) the adversarial robustness is jointly determined by the feature dimension
and the generalization capability of the DNN. To test our theoretical findings,
we perform extensive experimental analysis by considering 6 different DNN
architectures, 6 different approaches for distributed DNN and 10 different
adversarial attacks to the ImageNet-1K dataset. Our experimental results
support our theoretical findings by showing that the compressed latent
representations can reduce the success rate of adversarial attacks by 88% in
the best case and by 57% on the average compared to attacks to the input space.


http://arxiv.org/abs/2310.00116
Certified Robustness via Dynamic Margin Maximization and Improved Lipschitz Regularization. (92%)
Mahyar Fazlyab; Taha Entesari; Aniket Roy; Rama Chellappa
  To improve the robustness of deep classifiers against adversarial
perturbations, many approaches have been proposed, such as designing new
architectures with better robustness properties (e.g., Lipschitz-capped
networks), or modifying the training process itself (e.g., min-max
optimization, constrained learning, or regularization). These approaches,
however, might not be effective at increasing the margin in the input (feature)
space. As a result, there has been an increasing interest in developing
training procedures that can directly manipulate the decision boundary in the
input space. In this paper, we build upon recent developments in this category
by developing a robust training algorithm whose objective is to increase the
margin in the output (logit) space while regularizing the Lipschitz constant of
the model along vulnerable directions. We show that these two objectives can
directly promote larger margins in the input space. To this end, we develop a
scalable method for calculating guaranteed differentiable upper bounds on the
Lipschitz constant of neural networks accurately and efficiently. The relative
accuracy of the bounds prevents excessive regularization and allows for more
direct manipulation of the decision boundary. Furthermore, our Lipschitz
bounding algorithm exploits the monotonicity and Lipschitz continuity of the
activation layers, and the resulting bounds can be used to design new layers
with controllable bounds on their Lipschitz constant. Experiments on the MNIST,
CIFAR-10, and Tiny-ImageNet data sets verify that our proposed algorithm
obtains competitively improved results compared to the state-of-the-art.


http://arxiv.org/abs/2309.17278
Toward Robust Recommendation via Real-time Vicinal Defense. (82%)
Yichang Xu; Chenwang Wu; Defu Lian
  Recommender systems have been shown to be vulnerable to poisoning attacks,
where malicious data is injected into the dataset to cause the recommender
system to provide biased recommendations. To defend against such attacks,
various robust learning methods have been proposed. However, most methods are
model-specific or attack-specific, making them lack generality, while other
methods, such as adversarial training, are oriented towards evasion attacks and
thus have a weak defense strength in poisoning attacks.
  In this paper, we propose a general method, Real-time Vicinal Defense (RVD),
which leverages neighboring training data to fine-tune the model before making
a recommendation for each user. RVD works in the inference phase to ensure the
robustness of the specific sample in real-time, so there is no need to change
the model structure and training process, making it more practical. Extensive
experimental results demonstrate that RVD effectively mitigates targeted
poisoning attacks across various models without sacrificing accuracy. Moreover,
the defensive effect can be further amplified when our method is combined with
other strategies.


http://arxiv.org/abs/2310.00070
Adversarial Explainability: Utilizing Explainable Machine Learning in Bypassing IoT Botnet Detection Systems. (31%)
Mohammed M. Alani; Atefeh Mashatan; Ali Miri
  Botnet detection based on machine learning have witnessed significant leaps
in recent years, with the availability of large and reliable datasets that are
extracted from real-life scenarios. Consequently, adversarial attacks on
machine learning-based cybersecurity systems are posing a significant threat to
the practicality of these solutions. In this paper, we introduce a novel attack
that utilizes machine learning model's explainability in evading detection by
botnet detection systems. The proposed attack utilizes information obtained
from model's explainability to build adversarial samples that can evade
detection in a blackbox setting. The proposed attack was tested on a trained
IoT botnet detection systems and was capable of bypassing the botnet detection
with 0% detection by altering one feature only to generate the adversarial
samples.


http://arxiv.org/abs/2310.00108
Practical Membership Inference Attacks Against Large-Scale Multi-Modal Models: A Pilot Study. (13%)
Myeongseob Ko; Ming Jin; Chenguang Wang; Ruoxi Jia
  Membership inference attacks (MIAs) aim to infer whether a data point has
been used to train a machine learning model. These attacks can be employed to
identify potential privacy vulnerabilities and detect unauthorized use of
personal data. While MIAs have been traditionally studied for simple
classification models, recent advancements in multi-modal pre-training, such as
CLIP, have demonstrated remarkable zero-shot performance across a range of
computer vision tasks. However, the sheer scale of data and models presents
significant computational challenges for performing the attacks.
  This paper takes a first step towards developing practical MIAs against
large-scale multi-modal models. We introduce a simple baseline strategy by
thresholding the cosine similarity between text and image features of a target
point and propose further enhancing the baseline by aggregating cosine
similarity across transformations of the target. We also present a new weakly
supervised attack method that leverages ground-truth non-members (e.g.,
obtained by using the publication date of a target model and the timestamps of
the open data) to further enhance the attack. Our evaluation shows that CLIP
models are susceptible to our attack strategies, with our simple baseline
achieving over $75\%$ membership identification accuracy. Furthermore, our
enhanced attacks outperform the baseline across multiple models and datasets,
with the weakly supervised attack demonstrating an average-case performance
improvement of $17\%$ and being at least $7$X more effective at low
false-positive rates. These findings highlight the importance of protecting the
privacy of multi-modal foundational models, which were previously assumed to be
less susceptible to MIAs due to less overfitting. Our code is available at
https://github.com/ruoxi-jia-group/CLIP-MIA.


http://arxiv.org/abs/2309.17301
Distributed Resilient Control of DC Microgrids Under Generally Unbounded FDI Attacks. (1%)
Yichao Wang; Mohamadamin Rajabinezhad; Omar A. Beg; Shan Zuo
  Due to the nature of distributed secondary control paradigm, DC microgrids
are prone to malicious cyber-physical attacks, which could be unbounded to
maximize their damage. Existing resilient secondary control methods addressing
unbounded attacks require that the first time derivatives of cyber-physical
attack signals be bounded. The secondary defense strategy presented in this
letter relax such a strict constraint by addressing more generally unbounded
attack signals and hence, enhance the resilience of DC microgrids in
adversarial environments. Rigorous proofs, based on Lyapunov techniques, show
that the proposed method guarantees the uniformly ultimately bounded
convergence for both voltage regulation and proportional load sharing under
generally unbounded attacks. Comparative case studies further validate the
enhanced resilience of the proposed attack-resilient control strategy.


http://arxiv.org/abs/2310.00222
Source Inference Attacks: Beyond Membership Inference Attacks in Federated Learning. (1%)
Hongsheng Hu; Xuyun Zhang; Zoran Salcic; Lichao Sun; Kim-Kwang Raymond Choo; Gillian Dobbie
  Federated learning (FL) is a popular approach to facilitate privacy-aware
machine learning since it allows multiple clients to collaboratively train a
global model without granting others access to their private data. It is,
however, known that FL can be vulnerable to membership inference attacks
(MIAs), where the training records of the global model can be distinguished
from the testing records. Surprisingly, research focusing on the investigation
of the source inference problem appears to be lacking. We also observe that
identifying a training record's source client can result in privacy breaches
extending beyond MIAs. For example, consider an FL application where multiple
hospitals jointly train a COVID-19 diagnosis model, membership inference
attackers can identify the medical records that have been used for training,
and any additional identification of the source hospital can result the patient
from the particular hospital more prone to discrimination. Seeking to
contribute to the literature gap, we take the first step to investigate source
privacy in FL. Specifically, we propose a new inference attack (hereafter
referred to as source inference attack -- SIA), designed to facilitate an
honest-but-curious server to identify the training record's source client. The
proposed SIAs leverage the Bayesian theorem to allow the server to implement
the attack in a non-intrusive manner without deviating from the defined FL
protocol. We then evaluate SIAs in three different FL frameworks to show that
in existing FL frameworks, the clients sharing gradients, model parameters, or
predictions on a public dataset will leak such source information to the
server. We also conduct extensive experiments on various datasets to
investigate the key factors in an SIA. The experimental results validate the
efficacy of the proposed SIAs.


http://arxiv.org/abs/2309.16878
Investigating Human-Identifiable Features Hidden in Adversarial Perturbations. (98%)
Dennis Y. Menn; Tzu-hsun Feng; Sriram Vishwanath; Hung-yi Lee
  Neural networks perform exceedingly well across various machine learning
tasks but are not immune to adversarial perturbations. This vulnerability has
implications for real-world applications. While much research has been
conducted, the underlying reasons why neural networks fall prey to adversarial
attacks are not yet fully understood. Central to our study, which explores up
to five attack algorithms across three datasets, is the identification of
human-identifiable features in adversarial perturbations. Additionally, we
uncover two distinct effects manifesting within human-identifiable features.
Specifically, the masking effect is prominent in untargeted attacks, while the
generation effect is more common in targeted attacks. Using pixel-level
annotations, we extract such features and demonstrate their ability to
compromise target models. In addition, our findings indicate a notable extent
of similarity in perturbations across different attack algorithms when averaged
over multiple models. This work also provides insights into phenomena
associated with adversarial perturbations, such as transferability and model
interpretability. Our study contributes to a deeper understanding of the
underlying mechanisms behind adversarial attacks and offers insights for the
development of more resilient defense strategies for neural networks.


http://arxiv.org/abs/2309.16207
Parameter-Saving Adversarial Training: Reinforcing Multi-Perturbation Robustness via Hypernetworks. (98%)
Huihui Gong; Minjing Dong; Siqi Ma; Seyit Camtepe; Surya Nepal; Chang Xu
  Adversarial training serves as one of the most popular and effective methods
to defend against adversarial perturbations. However, most defense mechanisms
only consider a single type of perturbation while various attack methods might
be adopted to perform stronger adversarial attacks against the deployed model
in real-world scenarios, e.g., $\ell_2$ or $\ell_\infty$. Defending against
various attacks can be a challenging problem since multi-perturbation
adversarial training and its variants only achieve suboptimal robustness
trade-offs, due to the theoretical limit to multi-perturbation robustness for a
single model. Besides, it is impractical to deploy large models in some
storage-efficient scenarios. To settle down these drawbacks, in this paper we
propose a novel multi-perturbation adversarial training framework,
parameter-saving adversarial training (PSAT), to reinforce multi-perturbation
robustness with an advantageous side effect of saving parameters, which
leverages hypernetworks to train specialized models against a single
perturbation and aggregate these specialized models to defend against multiple
perturbations. Eventually, we extensively evaluate and compare our proposed
method with state-of-the-art single/multi-perturbation robust methods against
various latest attack methods on different datasets, showing the robustness
superiority and parameter efficiency of our proposed method, e.g., for the
CIFAR-10 dataset with ResNet-50 as the backbone, PSAT saves approximately 80\%
of parameters with achieving the state-of-the-art robustness trade-off
accuracy.


http://arxiv.org/abs/2309.16487
Towards Poisoning Fair Representations. (70%)
Tianci Liu; Haoyu Wang; Feijie Wu; Hengtong Zhang; Pan Li; Lu Su; Jing Gao
  Fair machine learning seeks to mitigate model prediction bias against certain
demographic subgroups such as elder and female. Recently, fair representation
learning (FRL) trained by deep neural networks has demonstrated superior
performance, whereby representations containing no demographic information are
inferred from the data and then used as the input to classification or other
downstream tasks. Despite the development of FRL methods, their vulnerability
under data poisoning attack, a popular protocol to benchmark model robustness
under adversarial scenarios, is under-explored. Data poisoning attacks have
been developed for classical fair machine learning methods which incorporate
fairness constraints into shallow-model classifiers. Nonetheless, these attacks
fall short in FRL due to notably different fairness goals and model
architectures. This work proposes the first data poisoning framework attacking
FRL. We induce the model to output unfair representations that contain as much
demographic information as possible by injecting carefully crafted poisoning
samples into the training data. This attack entails a prohibitive bilevel
optimization, wherefore an effective approximated solution is proposed. A
theoretical analysis on the needed number of poisoning samples is derived and
sheds light on defending against the attack. Experiments on benchmark fairness
datasets and state-of-the-art fair representation learning models demonstrate
the superiority of our attack.


http://arxiv.org/abs/2309.16452
On the Trade-offs between Adversarial Robustness and Actionable Explanations. (68%)
Satyapriya Krishna; Chirag Agarwal; Himabindu Lakkaraju
  As machine learning models are increasingly being employed in various
high-stakes settings, it becomes important to ensure that predictions of these
models are not only adversarially robust, but also readily explainable to
relevant stakeholders. However, it is unclear if these two notions can be
simultaneously achieved or if there exist trade-offs between them. In this
work, we make one of the first attempts at studying the impact of adversarially
robust models on actionable explanations which provide end users with a means
for recourse. We theoretically and empirically analyze the cost (ease of
implementation) and validity (probability of obtaining a positive model
prediction) of recourses output by state-of-the-art algorithms when the
underlying models are adversarially robust vs. non-robust. More specifically,
we derive theoretical bounds on the differences between the cost and the
validity of the recourses generated by state-of-the-art algorithms for
adversarially robust vs. non-robust linear and non-linear models. Our empirical
results with multiple real-world datasets validate our theoretical results and
show the impact of varying degrees of model robustness on the cost and validity
of the resulting recourses. Our analyses demonstrate that adversarially robust
models significantly increase the cost and reduce the validity of the resulting
recourses, thus shedding light on the inherent trade-offs between adversarial
robustness and actionable explanations


http://arxiv.org/abs/2309.16883
The Lipschitz-Variance-Margin Tradeoff for Enhanced Randomized Smoothing. (56%)
Blaise Delattre; Alexandre Araujo; Quentin Barthélemy; Alexandre Allauzen
  Real-life applications of deep neural networks are hindered by their unsteady
predictions when faced with noisy inputs and adversarial attacks. The certified
radius is in this context a crucial indicator of the robustness of models.
However how to design an efficient classifier with a sufficient certified
radius? Randomized smoothing provides a promising framework by relying on noise
injection in inputs to obtain a smoothed and more robust classifier. In this
paper, we first show that the variance introduced by randomized smoothing
closely interacts with two other important properties of the classifier,
\textit{i.e.} its Lipschitz constant and margin. More precisely, our work
emphasizes the dual impact of the Lipschitz constant of the base classifier, on
both the smoothed classifier and the empirical variance. Moreover, to increase
the certified robust radius, we introduce a different simplex projection
technique for the base classifier to leverage the variance-margin trade-off
thanks to Bernstein's concentration inequality, along with an enhanced
Lipschitz bound. Experimental results show a significant improvement in
certified accuracy compared to current state-of-the-art methods. Our novel
certification procedure allows us to use pre-trained models that are used with
randomized smoothing, effectively improving the current certification radius in
a zero-shot manner.


http://arxiv.org/abs/2309.16827
Post-Training Overfitting Mitigation in DNN Classifiers. (41%)
Hang Wang; David J. Miller; George Kesidis
  Well-known (non-malicious) sources of overfitting in deep neural net (DNN)
classifiers include: i) large class imbalances; ii) insufficient training-set
diversity; and iii) over-training. In recent work, it was shown that backdoor
data-poisoning also induces overfitting, with unusually large classification
margins to the attacker's target class, mediated particularly by (unbounded)
ReLU activations that allow large signals to propagate in the DNN. Thus, an
effective post-training (with no knowledge of the training set or training
process) mitigation approach against backdoors was proposed, leveraging a small
clean dataset, based on bounding neural activations. Improving upon that work,
we threshold activations specifically to limit maximum margins (MMs), which
yields performance gains in backdoor mitigation. We also provide some
analytical support for this mitigation approach. Most importantly, we show that
post-training MM-based regularization substantially mitigates non-malicious
overfitting due to class imbalances and overtraining. Thus, unlike adversarial
training, which provides some resilience against attacks but which harms clean
(attack-free) generalization, we demonstrate an approach originating from
adversarial learning that helps clean generalization accuracy. Experiments on
CIFAR-10 and CIFAR-100, in comparison with peer methods, demonstrate strong
performance of our methods.


http://arxiv.org/abs/2309.16952
Leveraging Optimization for Adaptive Attacks on Image Watermarks. (13%)
Nils Lukas; Abdulrahman Diaa; Lucas Fenaux; Florian Kerschbaum
  Untrustworthy users can misuse image generators to synthesize high-quality
deepfakes and engage in unethical activities. Watermarking deters misuse by
marking generated content with a hidden message, enabling its detection using a
secret watermarking key. A core security property of watermarking is
robustness, which states that an attacker can only evade detection by
substantially degrading image quality. Assessing robustness requires designing
an adaptive attack for the specific watermarking algorithm. When evaluating
watermarking algorithms and their (adaptive) attacks, it is challenging to
determine whether an adaptive attack is optimal, i.e., the best possible
attack. We solve this problem by defining an objective function and then
approach adaptive attacks as an optimization problem. The core idea of our
adaptive attacks is to replicate secret watermarking keys locally by creating
surrogate keys that are differentiable and can be used to optimize the attack's
parameters. We demonstrate for Stable Diffusion models that such an attacker
can break all five surveyed watermarking methods at no visible degradation in
image quality. Optimizing our attacks is efficient and requires less than 1 GPU
hour to reduce the detection accuracy to 6.3% or less. Our findings emphasize
the need for more rigorous robustness testing against adaptive, learnable
attackers.


http://arxiv.org/abs/2309.16172
Random and Safe Cache Architecture to Defeat Cache Timing Attacks. (9%)
Guangyuan Hu; Ruby B. Lee
  Caches have been exploited to leak secret information due to the different
times they take to handle memory accesses. Cache timing attacks include
non-speculative cache side and covert channel attacks and cache-based
speculative execution attacks. We first present a systematic view of the attack
and defense space and show that no existing defense has addressed both
speculative and non-speculative cache timing attack families, which we do in
this paper. We propose Random and Safe (RaS) cache architectures to decorrelate
the cache state changes from memory requests. RaS fills the cache with ``safe''
cache lines that are likely to be used in the future, rather than with
demand-fetched, security-sensitive lines. RaS captures a group of safe
addresses during runtime and fetches addresses randomly displaced from these
addresses. Our proposed RaS architecture is flexible to allow
security-performance trade-offs. We show different designs of RaS architectures
that can defeat cache side-channel attacks and cache-based speculative
execution attacks. The RaS variant against cache-based speculative execution
attacks has 4.2% average performance overhead and other RaS variants against
both attack families have 7.9% to 45.2% average overhead. For some benchmarks,
RaS defenses improve the performance while providing security.


http://arxiv.org/abs/2309.16631
Robust Offline Reinforcement Learning -- Certify the Confidence Interval. (4%)
Jiarui Yao; Simon Shaolei Du
  Currently, reinforcement learning (RL), especially deep RL, has received more
and more attention in the research area. However, the security of RL has been
an obvious problem due to the attack manners becoming mature. In order to
defend against such adversarial attacks, several practical approaches are
developed, such as adversarial training, data filtering, etc. However, these
methods are mostly based on empirical algorithms and experiments, without
rigorous theoretical analysis of the robustness of the algorithms. In this
paper, we develop an algorithm to certify the robustness of a given policy
offline with random smoothing, which could be proven and conducted as
efficiently as ones without random smoothing. Experiments on different
environments confirm the correctness of our algorithm.


http://arxiv.org/abs/2309.16314
A Primer on Bayesian Neural Networks: Review and Debates. (2%)
Julyan Arbel; Konstantinos Pitas; Mariia Vladimirova; Vincent Fortuin
  Neural networks have achieved remarkable performance across various problem
domains, but their widespread applicability is hindered by inherent limitations
such as overconfidence in predictions, lack of interpretability, and
vulnerability to adversarial attacks. To address these challenges, Bayesian
neural networks (BNNs) have emerged as a compelling extension of conventional
neural networks, integrating uncertainty estimation into their predictive
capabilities.
  This comprehensive primer presents a systematic introduction to the
fundamental concepts of neural networks and Bayesian inference, elucidating
their synergistic integration for the development of BNNs. The target audience
comprises statisticians with a potential background in Bayesian methods but
lacking deep learning expertise, as well as machine learners proficient in deep
neural networks but with limited exposure to Bayesian statistics. We provide an
overview of commonly employed priors, examining their impact on model behavior
and performance. Additionally, we delve into the practical considerations
associated with training and inference in BNNs.
  Furthermore, we explore advanced topics within the realm of BNN research,
acknowledging the existence of ongoing debates and controversies. By offering
insights into cutting-edge developments, this primer not only equips
researchers and practitioners with a solid foundation in BNNs, but also
illuminates the potential applications of this dynamic field. As a valuable
resource, it fosters an understanding of BNNs and their promising prospects,
facilitating further advancements in the pursuit of knowledge and innovation.


http://arxiv.org/abs/2309.15669
On the Computational Entanglement of Distant Features in Adversarial Machine Learning. (99%)
YenLung Lai; Xingbo Dong; Zhe Jin
  Adversarial examples in machine learning has emerged as a focal point of
research due to their remarkable ability to deceive models with seemingly
inconspicuous input perturbations, potentially resulting in severe
consequences. In this study, we undertake a thorough investigation into the
emergence of adversarial examples, a phenomenon that can, in principle,
manifest in a wide range of machine learning models. Through our research, we
unveil a new notion termed computational entanglement, with its ability to
entangle distant features, display perfect correlations or anti-correlations
regardless to their spatial separation, significantly contributes to the
emergence of adversarial examples. We illustrate how computational entanglement
aligns with relativistic effects such as time dilation and length contraction
to feature pair, ultimately resulting in the convergence of their angle
differences and distances towards zero, signifying perfect correlation, or
towards maximum, indicating perfect anti-correlation.


http://arxiv.org/abs/2309.16096
Adversarial Examples Might be Avoidable: The Role of Data Concentration in Adversarial Robustness. (95%)
Ambar Pal; Jeremias Sulam; René Vidal
  The susceptibility of modern machine learning classifiers to adversarial
examples has motivated theoretical results suggesting that these might be
unavoidable. However, these results can be too general to be applicable to
natural data distributions. Indeed, humans are quite robust for tasks involving
vision. This apparent conflict motivates a deeper dive into the question: Are
adversarial examples truly unavoidable? In this work, we theoretically
demonstrate that a key property of the data distribution -- concentration on
small-volume subsets of the input space -- determines whether a robust
classifier exists. We further demonstrate that, for a data distribution
concentrated on a union of low-dimensional linear subspaces, exploiting data
structure naturally leads to classifiers that enjoy good robustness guarantees,
improving upon methods for provable certification in certain regimes.


http://arxiv.org/abs/2309.15519
Defending Against Physical Adversarial Patch Attacks on Infrared Human Detection. (95%)
Lukas Strack; Futa Waseda; Huy H. Nguyen; Yinqiang Zheng; Isao Echizen
  Infrared detection is an emerging technique for safety-critical tasks owing
to its remarkable anti-interference capability. However, recent studies have
revealed that it is vulnerable to physically-realizable adversarial patches,
posing risks in its real-world applications. To address this problem, we are
the first to investigate defense strategies against adversarial patch attacks
on infrared detection, especially human detection. We propose a straightforward
defense strategy, patch-based occlusion-aware detection (POD), which
efficiently augments training samples with random patches and subsequently
detects them. POD not only robustly detects people but also identifies
adversarial patch locations. Surprisingly, while being extremely
computationally efficient, POD easily generalizes to state-of-the-art
adversarial patch attacks that are unseen during training. Furthermore, POD
improves detection precision even in a clean (i.e., no-attack) situation due to
the data augmentation effect. Our evaluation demonstrates that POD is robust to
adversarial patches of various shapes and sizes. The effectiveness of our
baseline approach is shown to be a viable defense mechanism for real-world
infrared human detection systems, paving the way for exploring future research
directions.


http://arxiv.org/abs/2309.15418
Automatic Feature Fairness in Recommendation via Adversaries. (33%)
Hengchang Hu; Yiming Cao; Zhankui He; Samson Tan; Min-Yen Kan
  Fairness is a widely discussed topic in recommender systems, but its
practical implementation faces challenges in defining sensitive features while
maintaining recommendation accuracy. We propose feature fairness as the
foundation to achieve equitable treatment across diverse groups defined by
various feature combinations. This improves overall accuracy through balanced
feature generalizability. We introduce unbiased feature learning through
adversarial training, using adversarial perturbation to enhance feature
representation. The adversaries improve model generalization for
under-represented features. We adapt adversaries automatically based on two
forms of feature biases: frequency and combination variety of feature values.
This allows us to dynamically adjust perturbation strengths and adversarial
training weights. Stronger perturbations are applied to feature values with
fewer combination varieties to improve generalization, while higher weights for
low-frequency features address training imbalances. We leverage the Adaptive
Adversarial perturbation based on the widely-applied Factorization Machine
(AAFM) as our backbone model. In experiments, AAFM surpasses strong baselines
in both fairness and accuracy measures. AAFM excels in providing item- and
user-fairness for single- and multi-feature tasks, showcasing their versatility
and scalability. To maintain good accuracy, we find that adversarial
perturbation must be well-managed: during training, perturbations should not
overly persist and their strengths should decay.


http://arxiv.org/abs/2310.07726
Warfare:Breaking the Watermark Protection of AI-Generated Content. (12%)
Guanlin Li; Yifei Chen; Jie Zhang; Jiwei Li; Shangwei Guo; Tianwei Zhang
  AI-Generated Content (AIGC) is gaining great popularity, with many emerging
commercial services and applications. These services leverage advanced
generative models, such as latent diffusion models and large language models,
to generate creative content (e.g., realistic images and fluent sentences) for
users. The usage of such generated content needs to be highly regulated, as the
service providers need to ensure the users do not violate the usage policies
(e.g., abuse for commercialization, generating and distributing unsafe
content). A promising solution to achieve this goal is watermarking, which adds
unique and imperceptible watermarks on the content for service verification and
attribution. Numerous watermarking approaches have been proposed recently.
However, in this paper, we show that an adversary can easily break these
watermarking mechanisms. Specifically, we consider two possible attacks. (1)
Watermark removal: the adversary can easily erase the embedded watermark from
the generated content and then use it freely bypassing the regulation of the
service provider. (2) Watermark forging: the adversary can create illegal
content with forged watermarks from another user, causing the service provider
to make wrong attributions. We propose Warfare, a unified methodology to
achieve both attacks in a holistic way. The key idea is to leverage a
pre-trained diffusion model for content processing and a generative adversarial
network for watermark removal or forging. We evaluate Warfare on different
datasets and embedding setups. The results prove that it can achieve high
success rates while maintaining the quality of the generated content. Compared
to existing diffusion model-based attacks, Warfare is 5,050~11,000x faster.


http://arxiv.org/abs/2309.15770
Generating Transferable Adversarial Simulation Scenarios for Self-Driving via Neural Rendering. (11%)
Yasasa Abeysirigoonawardena; Kevin Xie; Chuhan Chen; Salar Hosseini; Ruiting Chen; Ruiqi Wang; Florian Shkurti
  Self-driving software pipelines include components that are learned from a
significant number of training examples, yet it remains challenging to evaluate
the overall system's safety and generalization performance. Together with
scaling up the real-world deployment of autonomous vehicles, it is of critical
importance to automatically find simulation scenarios where the driving
policies will fail. We propose a method that efficiently generates adversarial
simulation scenarios for autonomous driving by solving an optimal control
problem that aims to maximally perturb the policy from its nominal trajectory.
  Given an image-based driving policy, we show that we can inject new objects
in a neural rendering representation of the deployment scene, and optimize
their texture in order to generate adversarial sensor inputs to the policy. We
demonstrate that adversarial scenarios discovered purely in the neural renderer
(surrogate scene) can often be successfully transferred to the deployment
scene, without further optimization. We demonstrate this transfer occurs both
in simulated and real environments, provided the learned surrogate scene is
sufficiently close to the deployment scene.


http://arxiv.org/abs/2309.15687
Breaking On-Chip Communication Anonymity using Flow Correlation Attacks. (4%)
Hansika Weerasena; Prabhat Mishra
  Network-on-Chip (NoC) is widely used to facilitate communication between
components in sophisticated System-on-Chip (SoC) designs. Security of the
on-chip communication is crucial because exploiting any vulnerability in shared
NoC would be a goldmine for an attacker that puts the entire computing
infrastructure at risk. NoC security relies on effective countermeasures
against diverse attacks, including attacks on anonymity. We investigate the
security strength of existing anonymous routing protocols in NoC architectures.
Specifically, this paper makes two important contributions. We show that the
existing anonymous routing is vulnerable to machine learning (ML) based flow
correlation attacks on NoCs. We propose lightweight anonymous routing with
traffic obfuscation techniques to defend against ML-based flow correlation
attacks. Experimental studies using both real and synthetic traffic reveal that
our proposed attack is successful against state-of-the-art anonymous routing in
NoC architectures with high accuracy (up to 99%) for diverse traffic patterns,
while our lightweight countermeasure can defend against ML-based attacks with
minor hardware and performance overhead.


http://arxiv.org/abs/2310.06855
Genetic Algorithm-Based Dynamic Backdoor Attack on Federated Learning-Based Network Traffic Classification. (1%)
Mahmoud Nazzal; Nura Aljaafari; Ahmed Sawalmeh; Abdallah Khreishah; Muhammad Anan; Abdulelah Algosaibi; Mohammed Alnaeem; Adel Aldalbahi; Abdulaziz Alhumam; Conrado P. Vizcarra; Shadan Alhamed
  Federated learning enables multiple clients to collaboratively contribute to
the learning of a global model orchestrated by a central server. This learning
scheme promotes clients' data privacy and requires reduced communication
overheads. In an application like network traffic classification, this helps
hide the network vulnerabilities and weakness points. However, federated
learning is susceptible to backdoor attacks, in which adversaries inject
manipulated model updates into the global model. These updates inject a salient
functionality in the global model that can be launched with specific input
patterns. Nonetheless, the vulnerability of network traffic classification
models based on federated learning to these attacks remains unexplored. In this
paper, we propose GABAttack, a novel genetic algorithm-based backdoor attack
against federated learning for network traffic classification. GABAttack
utilizes a genetic algorithm to optimize the values and locations of backdoor
trigger patterns, ensuring a better fit with the input and the model. This
input-tailored dynamic attack is promising for improved attack evasiveness
while being effective. Extensive experiments conducted over real-world network
datasets validate the success of the proposed GABAttack in various situations
while maintaining almost invisible activity. This research serves as an
alarming call for network security experts and practitioners to develop robust
defense measures against such attacks.


http://arxiv.org/abs/2309.14700
Structure Invariant Transformation for better Adversarial Transferability. (99%)
Xiaosen Wang; Zeliang Zhang; Jianping Zhang
  Given the severe vulnerability of Deep Neural Networks (DNNs) against
adversarial examples, there is an urgent need for an effective adversarial
attack to identify the deficiencies of DNNs in security-sensitive applications.
As one of the prevalent black-box adversarial attacks, the existing
transfer-based attacks still cannot achieve comparable performance with the
white-box attacks. Among these, input transformation based attacks have shown
remarkable effectiveness in boosting transferability. In this work, we find
that the existing input transformation based attacks transform the input image
globally, resulting in limited diversity of the transformed images. We
postulate that the more diverse transformed images result in better
transferability. Thus, we investigate how to locally apply various
transformations onto the input image to improve such diversity while preserving
the structure of image. To this end, we propose a novel input transformation
based attack, called Structure Invariant Attack (SIA), which applies a random
image transformation onto each image block to craft a set of diverse images for
gradient calculation. Extensive experiments on the standard ImageNet dataset
demonstrate that SIA exhibits much better transferability than the existing
SOTA input transformation based attacks on CNN-based and transformer-based
models, showing its generality and superiority in boosting transferability.
Code is available at https://github.com/xiaosen-wang/SIT.


http://arxiv.org/abs/2309.15087
Privacy-preserving and Privacy-attacking Approaches for Speech and Audio -- A Survey. (16%)
Yuchen Liu; Apu Kapadia; Donald Williamson
  In contemporary society, voice-controlled devices, such as smartphones and
home assistants, have become pervasive due to their advanced capabilities and
functionality. The always-on nature of their microphones offers users the
convenience of readily accessing these devices. However, recent research and
events have revealed that such voice-controlled devices are prone to various
forms of malicious attacks, hence making it a growing concern for both users
and researchers to safeguard against such attacks. Despite the numerous studies
that have investigated adversarial attacks and privacy preservation for images,
a conclusive study of this nature has not been conducted for the audio domain.
Therefore, this paper aims to examine existing approaches for
privacy-preserving and privacy-attacking strategies for audio and speech. To
achieve this goal, we classify the attack and defense scenarios into several
categories and provide detailed analysis of each approach. We also interpret
the dissimilarities between the various approaches, highlight their
contributions, and examine their limitations. Our investigation reveals that
voice-controlled devices based on neural networks are inherently susceptible to
specific types of attacks. Although it is possible to enhance the robustness of
such models to certain forms of attack, more sophisticated approaches are
required to comprehensively safeguard user privacy.


http://arxiv.org/abs/2309.15386
Neural Stochastic Differential Equations for Robust and Explainable Analysis of Electromagnetic Unintended Radiated Emissions. (2%)
Sumit Kumar Jha; Susmit Jha; Rickard Ewetz; Alvaro Velasquez
  We present a comprehensive evaluation of the robustness and explainability of
ResNet-like models in the context of Unintended Radiated Emission (URE)
classification and suggest a new approach leveraging Neural Stochastic
Differential Equations (SDEs) to address identified limitations. We provide an
empirical demonstration of the fragility of ResNet-like models to Gaussian
noise perturbations, where the model performance deteriorates sharply and its
F1-score drops to near insignificance at 0.008 with a Gaussian noise of only
0.5 standard deviation. We also highlight a concerning discrepancy where the
explanations provided by ResNet-like models do not reflect the inherent
periodicity in the input data, a crucial attribute in URE detection from stable
devices. In response to these findings, we propose a novel application of
Neural SDEs to build models for URE classification that are not only robust to
noise but also provide more meaningful and intuitive explanations. Neural SDE
models maintain a high F1-score of 0.93 even when exposed to Gaussian noise
with a standard deviation of 0.5, demonstrating superior resilience to ResNet
models. Neural SDE models successfully recover the time-invariant or periodic
horizontal bands from the input data, a feature that was conspicuously missing
in the explanations generated by ResNet-like models. This advancement presents
a small but significant step in the development of robust and interpretable
models for real-world URE applications where data is inherently noisy and
assurance arguments demand interpretable machine learning predictions.


http://arxiv.org/abs/2309.15224
Collaborative Watermarking for Adversarial Speech Synthesis. (1%)
Lauri Aalto University, Finland Juvela; Xin National Institute of Informatics, Japan Wang
  Advances in neural speech synthesis have brought us technology that is not
only close to human naturalness, but is also capable of instant voice cloning
with little data, and is highly accessible with pre-trained models available.
Naturally, the potential flood of generated content raises the need for
synthetic speech detection and watermarking. Recently, considerable research
effort in synthetic speech detection has been related to the Automatic Speaker
Verification and Spoofing Countermeasure Challenge (ASVspoof), which focuses on
passive countermeasures. This paper takes a complementary view to generated
speech detection: a synthesis system should make an active effort to watermark
the generated speech in a way that aids detection by another machine, but
remains transparent to a human listener. We propose a collaborative training
scheme for synthetic speech watermarking and show that a HiFi-GAN neural
vocoder collaborating with the ASVspoof 2021 baseline countermeasure models
consistently improves detection performance over conventional classifier
training. Furthermore, we demonstrate how collaborative training can be paired
with augmentation strategies for added robustness against noise and
time-stretching. Finally, listening tests demonstrate that collaborative
training has little adverse effect on perceptual quality of vocoded speech.


http://arxiv.org/abs/2309.14585
DifAttack: Query-Efficient Black-Box Attack via Disentangled Feature Space. (99%)
Liu Jun; Zhou Jiantao; Zeng Jiandian; Jinyu Tian
  This work investigates efficient score-based black-box adversarial attacks
with a high Attack Success Rate (ASR) and good generalizability. We design a
novel attack method based on a Disentangled Feature space, called DifAttack,
which differs significantly from the existing ones operating over the entire
feature space. Specifically, DifAttack firstly disentangles an image's latent
feature into an adversarial feature and a visual feature, where the former
dominates the adversarial capability of an image, while the latter largely
determines its visual appearance. We train an autoencoder for the
disentanglement by using pairs of clean images and their Adversarial Examples
(AEs) generated from available surrogate models via white-box attack methods.
Eventually, DifAttack iteratively optimizes the adversarial feature according
to the query feedback from the victim model until a successful AE is generated,
while keeping the visual feature unaltered. In addition, due to the avoidance
of using surrogate models' gradient information when optimizing AEs for
black-box models, our proposed DifAttack inherently possesses better attack
capability in the open-set scenario, where the training dataset of the victim
model is unknown. Extensive experimental results demonstrate that our method
achieves significant improvements in ASR and query efficiency simultaneously,
especially in the targeted attack and open-set scenarios. The code will be
available at https://github.com/csjunjun/DifAttack.git soon.


http://arxiv.org/abs/2309.14615
Gray-box Adversarial Attack of Deep Reinforcement Learning-based Trading Agents. (98%)
Foozhan Ataiefard; Hadi Hemmati
  In recent years, deep reinforcement learning (Deep RL) has been successfully
implemented as a smart agent in many systems such as complex games,
self-driving cars, and chat-bots. One of the interesting use cases of Deep RL
is its application as an automated stock trading agent. In general, any
automated trading agent is prone to manipulations by adversaries in the trading
environment. Thus studying their robustness is vital for their success in
practice. However, typical mechanism to study RL robustness, which is based on
white-box gradient-based adversarial sample generation techniques (like FGSM),
is obsolete for this use case, since the models are protected behind secure
international exchange APIs, such as NASDAQ. In this research, we demonstrate
that a "gray-box" approach for attacking a Deep RL-based trading agent is
possible by trading in the same stock market, with no extra access to the
trading agent. In our proposed approach, an adversary agent uses a hybrid Deep
Neural Network as its policy consisting of Convolutional layers and
fully-connected layers. On average, over three simulated trading market
configurations, the adversary policy proposed in this research is able to
reduce the reward values by 214.17%, which results in reducing the potential
profits of the baseline by 139.4%, ensemble method by 93.7%, and an automated
trading software developed by our industrial partner by 85.5%, while consuming
significantly less budget than the victims (427.77%, 187.16%, and 66.97%,
respectively).


http://arxiv.org/abs/2309.14122
SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution. (1%)
Zhongjie Ba; Jieming Zhong; Jiachen Lei; Peng Cheng; Qinglong Wang; Zhan Qin; Zhibo Wang; Kui Ren
  Advanced text-to-image models such as DALL-E 2 and Midjourney possess the
capacity to generate highly realistic images, raising significant concerns
regarding the potential proliferation of unsafe content. This includes adult,
violent, or deceptive imagery of political figures. Despite claims of rigorous
safety mechanisms implemented in these models to restrict the generation of
not-safe-for-work (NSFW) content, we successfully devise and exhibit the first
prompt attacks on Midjourney, resulting in the production of abundant
photorealistic NSFW images. We reveal the fundamental principles of such prompt
attacks and suggest strategically substituting high-risk sections within a
suspect prompt to evade closed-source safety measures. Our novel framework,
SurrogatePrompt, systematically generates attack prompts, utilizing large
language models, image-to-text, and image-to-image modules to automate attack
prompt creation at scale. Evaluation results disclose an 88% success rate in
bypassing Midjourney's proprietary safety filter with our attack prompts,
leading to the generation of counterfeit images depicting political figures in
violent scenarios. Both subjective and objective assessments validate that the
images generated from our attack prompts present considerable safety hazards.


http://arxiv.org/abs/2309.13857
Adversarial Attacks on Video Object Segmentation with Hard Region Discovery. (99%)
Ping Li; Yu Zhang; Li Yuan; Jian Zhao; Xianghua Xu; Xiaoqin Zhang
  Video object segmentation has been applied to various computer vision tasks,
such as video editing, autonomous driving, and human-robot interaction.
However, the methods based on deep neural networks are vulnerable to
adversarial examples, which are the inputs attacked by almost
human-imperceptible perturbations, and the adversary (i.e., attacker) will fool
the segmentation model to make incorrect pixel-level predictions. This will
rise the security issues in highly-demanding tasks because small perturbations
to the input video will result in potential attack risks. Though adversarial
examples have been extensively used for classification, it is rarely studied in
video object segmentation. Existing related methods in computer vision either
require prior knowledge of categories or cannot be directly applied due to the
special design for certain tasks, failing to consider the pixel-wise region
attack. Hence, this work develops an object-agnostic adversary that has
adversarial impacts on VOS by first-frame attacking via hard region discovery.
Particularly, the gradients from the segmentation model are exploited to
discover the easily confused region, in which it is difficult to identify the
pixel-wise objects from the background in a frame. This provides a hardness map
that helps to generate perturbations with a stronger adversarial power for
attacking the first frame. Empirical studies on three benchmarks indicate that
our attacker significantly degrades the performance of several state-of-the-art
video object segmentation models.


http://arxiv.org/abs/2309.13609
Vulnerabilities in Video Quality Assessment Models: The Challenge of Adversarial Attacks. (98%)
Ao-Xiang Zhang; Yu Ran; Weixuan Tang; Yuan-Gen Wang
  No-Reference Video Quality Assessment (NR-VQA) plays an essential role in
improving the viewing experience of end-users. Driven by deep learning, recent
NR-VQA models based on Convolutional Neural Networks (CNNs) and Transformers
have achieved outstanding performance. To build a reliable and practical
assessment system, it is of great necessity to evaluate their robustness.
However, such issue has received little attention in the academic community. In
this paper, we make the first attempt to evaluate the robustness of NR-VQA
models against adversarial attacks, and propose a patch-based random search
method for black-box attack. Specifically, considering both the attack effect
on quality score and the visual quality of adversarial video, the attack
problem is formulated as misleading the estimated quality score under the
constraint of just-noticeable difference (JND). Built upon such formulation, a
novel loss function called Score-Reversed Boundary Loss is designed to push the
adversarial video's estimated quality score far away from its ground-truth
score towards a specific boundary, and the JND constraint is modeled as a
strict $L_2$ and $L_\infty$ norm restriction. By this means, both white-box and
black-box attacks can be launched in an effective and imperceptible manner. The
source code is available at https://github.com/GZHU-DVL/AttackVQA.


http://arxiv.org/abs/2309.13841
On the Effectiveness of Adversarial Samples against Ensemble Learning-based Windows PE Malware Detectors. (86%)
Trong-Nghia To; Danh Le Kim; Do Thi Thu Hien; Nghi Hoang Khoa; Hien Do Hoang; Phan The Duy; Van-Hau Pham
  Recently, there has been a growing focus and interest in applying machine
learning (ML) to the field of cybersecurity, particularly in malware detection
and prevention. Several research works on malware analysis have been proposed,
offering promising results for both academic and practical applications. In
these works, the use of Generative Adversarial Networks (GANs) or Reinforcement
Learning (RL) can aid malware creators in crafting metamorphic malware that
evades antivirus software. In this study, we propose a mutation system to
counteract ensemble learning-based detectors by combining GANs and an RL model,
overcoming the limitations of the MalGAN model. Our proposed FeaGAN model is
built based on MalGAN by incorporating an RL model called the Deep Q-network
anti-malware Engines Attacking Framework (DQEAF). The RL model addresses three
key challenges in performing adversarial attacks on Windows Portable Executable
malware, including format preservation, executability preservation, and
maliciousness preservation. In the FeaGAN model, ensemble learning is utilized
to enhance the malware detector's evasion ability, with the generated
adversarial patterns. The experimental results demonstrate that 100\% of the
selected mutant samples preserve the format of executable files, while certain
successes in both executability preservation and maliciousness preservation are
achieved, reaching a stable success rate.


http://arxiv.org/abs/2310.03033
Benchmarking Local Robustness of High-Accuracy Binary Neural Networks for Enhanced Traffic Sign Recognition. (80%)
Andreea Postovan; Mădălina Eraşcu
  Traffic signs play a critical role in road safety and traffic management for
autonomous driving systems. Accurate traffic sign classification is essential
but challenging due to real-world complexities like adversarial examples and
occlusions. To address these issues, binary neural networks offer promise in
constructing classifiers suitable for resource-constrained devices.
  In our previous work, we proposed high-accuracy BNN models for traffic sign
recognition, focusing on compact size for limited computation and energy
resources. To evaluate their local robustness, this paper introduces a set of
benchmark problems featuring layers that challenge state-of-the-art
verification tools. These layers include binarized convolutions, max pooling,
batch normalization, fully connected. The difficulty of the verification
problem is given by the high number of network parameters (905k - 1.7 M), of
the input dimension (2.7k-12k), and of the number of regions (43) as well by
the fact that the neural networks are not sparse.
  The proposed BNN models and local robustness properties can be checked at
https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/traffic_signs_recognition.
  The results of the 4th International Verification of Neural Networks
Competition (VNN-COMP'23) revealed the fact that 4, out of 7, solvers can
handle many of our benchmarks randomly selected (minimum is 6, maximum is 36,
out of 45). Surprisingly, tools output also wrong results or missing
counterexample (ranging from 1 to 4). Currently, our focus lies in exploring
the possibility of achieving a greater count of solved instances by extending
the allotted time (previously set at 8 minutes). Furthermore, we are intrigued
by the reasons behind the erroneous outcomes provided by the tools for certain
benchmarks.


http://arxiv.org/abs/2309.13794
Projected Randomized Smoothing for Certified Adversarial Robustness. (76%)
Samuel Pfrommer; Brendon G. Anderson; Somayeh Sojoudi
  Randomized smoothing is the current state-of-the-art method for producing
provably robust classifiers. While randomized smoothing typically yields robust
$\ell_2$-ball certificates, recent research has generalized provable robustness
to different norm balls as well as anisotropic regions. This work considers a
classifier architecture that first projects onto a low-dimensional
approximation of the data manifold and then applies a standard classifier. By
performing randomized smoothing in the low-dimensional projected space, we
characterize the certified region of our smoothed composite classifier back in
the high-dimensional input space and prove a tractable lower bound on its
volume. We show experimentally on CIFAR-10 and SVHN that classifiers without
the initial projection are vulnerable to perturbations that are normal to the
data manifold and yet are captured by the certified regions of our method. We
compare the volume of our certified regions against various baselines and show
that our method improves on the state-of-the-art by many orders of magnitude.


http://arxiv.org/abs/2309.13763
Combining Two Adversarial Attacks Against Person Re-Identification Systems. (73%)
Eduardo de O. Andrade; Igor Garcia Ballhausen Sampaio; Joris Guérin; José Viterbo
  The field of Person Re-Identification (Re-ID) has received much attention
recently, driven by the progress of deep neural networks, especially for image
classification. The problem of Re-ID consists in identifying individuals
through images captured by surveillance cameras in different scenarios.
Governments and companies are investing a lot of time and money in Re-ID
systems for use in public safety and identifying missing persons. However,
several challenges remain for successfully implementing Re-ID, such as
occlusions and light reflections in people's images. In this work, we focus on
adversarial attacks on Re-ID systems, which can be a critical threat to the
performance of these systems. In particular, we explore the combination of
adversarial attacks against Re-ID models, trying to strengthen the decrease in
the classification results. We conduct our experiments on three datasets:
DukeMTMC-ReID, Market-1501, and CUHK03. We combine the use of two types of
adversarial attacks, P-FGSM and Deep Mis-Ranking, applied to two popular Re-ID
models: IDE (ResNet-50) and AlignedReID. The best result demonstrates a
decrease of 3.36% in the Rank-10 metric for AlignedReID applied to CUHK03. We
also try to use Dropout during the inference as a defense method.


http://arxiv.org/abs/2309.13579
Seeing Is Not Always Believing: Invisible Collision Attack and Defence on Pre-Trained Models. (2%)
Minghang Deng; Zhong Zhang; Junming Shao
  Large-scale pre-trained models (PTMs) such as BERT and GPT have achieved
great success in diverse fields. The typical paradigm is to pre-train a big
deep learning model on large-scale data sets, and then fine-tune the model on
small task-specific data sets for downstream tasks. Although PTMs have rapidly
progressed with wide real-world applications, they also pose significant risks
of potential attacks. Existing backdoor attacks or data poisoning methods often
build up the assumption that the attacker invades the computers of victims or
accesses the target data, which is challenging in real-world scenarios. In this
paper, we propose a novel framework for an invisible attack on PTMs with
enhanced MD5 collision. The key idea is to generate two equal-size models with
the same MD5 checksum by leveraging the MD5 chosen-prefix collision.
Afterwards, the two ``same" models will be deployed on public websites to
induce victims to download the poisoned model. Unlike conventional attacks on
deep learning models, this new attack is flexible, covert, and
model-independent. Additionally, we propose a simple defensive strategy for
recognizing the MD5 chosen-prefix collision and provide a theoretical
justification for its feasibility. We extensively validate the effectiveness
and stealthiness of our proposed attack and defensive method on different
models and data sets.


http://arxiv.org/abs/2309.13256
Defending Pre-trained Language Models as Few-shot Learners against Backdoor Attacks. (61%)
Zhaohan Xi; Tianyu Du; Changjiang Li; Ren Pang; Shouling Ji; Jinghui Chen; Fenglong Ma; Ting Wang
  Pre-trained language models (PLMs) have demonstrated remarkable performance
as few-shot learners. However, their security risks under such settings are
largely unexplored. In this work, we conduct a pilot study showing that PLMs as
few-shot learners are highly vulnerable to backdoor attacks while existing
defenses are inadequate due to the unique challenges of few-shot scenarios. To
address such challenges, we advocate MDP, a novel lightweight, pluggable, and
effective defense for PLMs as few-shot learners. Specifically, MDP leverages
the gap between the masking-sensitivity of poisoned and clean samples: with
reference to the limited few-shot data as distributional anchors, it compares
the representations of given samples under varying masking and identifies
poisoned samples as ones with significant variations. We show analytically that
MDP creates an interesting dilemma for the attacker to choose between attack
effectiveness and detection evasiveness. The empirical evaluation using
benchmark datasets and representative attacks validates the efficacy of MDP.


http://arxiv.org/abs/2309.13444
Moving Target Defense based Secured Network Slicing System in the O-RAN Architecture. (1%)
Mojdeh Karbalaee Motalleb; Chafika Benzaïd; Tarik Taleb; Vahid Shah-Mansouri
  The open radio access network (O-RAN) architecture's native virtualization
and embedded intelligence facilitate RAN slicing and enable comprehensive
end-to-end services in post-5G networks. However, any vulnerabilities could
harm security. Therefore, artificial intelligence (AI) and machine learning
(ML) security threats can even threaten O-RAN benefits. This paper proposes a
novel approach to estimating the optimal number of predefined VNFs for each
slice while addressing secure AI/ML methods for dynamic service admission
control and power minimization in the O-RAN architecture. We solve this problem
on two-time scales using mathematical methods for determining the predefined
number of VNFs on a large time scale and the proximal policy optimization
(PPO), a Deep Reinforcement Learning algorithm, for solving dynamic service
admission control and power minimization for different slices on a small-time
scale. To secure the ML system for O-RAN, we implement a moving target defense
(MTD) strategy to prevent poisoning attacks by adding uncertainty to the
system. Our experimental results show that the proposed PPO-based service
admission control approach achieves an admission rate above 80\% and that the
MTD strategy effectively strengthens the robustness of the PPO method against
adversarial attacks.


http://arxiv.org/abs/2309.13475
Detecting and Mitigating System-Level Anomalies of Vision-Based Controllers. (1%)
Aryaman Gupta; Kaustav Chakraborty; Somil Bansal
  Autonomous systems, such as self-driving cars and drones, have made
significant strides in recent years by leveraging visual inputs and machine
learning for decision-making and control. Despite their impressive performance,
these vision-based controllers can make erroneous predictions when faced with
novel or out-of-distribution inputs. Such errors can cascade to catastrophic
system failures and compromise system safety. In this work, we introduce a
run-time anomaly monitor to detect and mitigate such closed-loop, system-level
failures. Specifically, we leverage a reachability-based framework to
stress-test the vision-based controller offline and mine its system-level
failures. This data is then used to train a classifier that is leveraged online
to flag inputs that might cause system breakdowns. The anomaly detector
highlights issues that transcend individual modules and pertain to the safety
of the overall system. We also design a fallback controller that robustly
handles these detected anomalies to preserve system safety. We validate the
proposed approach on an autonomous aircraft taxiing system that uses a
vision-based controller for taxiing. Our results show the efficacy of the
proposed approach in identifying and handling system-level anomalies,
outperforming methods such as prediction error-based detection, and ensembling,
thereby enhancing the overall safety and robustness of autonomous systems.


http://arxiv.org/abs/2309.13245
RBFormer: Improve Adversarial Robustness of Transformer by Robust Bias. (99%)
Hao Cheng; Jinhao Duan; Hui Li; Lyutianyang Zhang; Jiahang Cao; Ping Wang; Jize Zhang; Kaidi Xu; Renjing Xu
  Recently, there has been a surge of interest and attention in
Transformer-based structures, such as Vision Transformer (ViT) and Vision
Multilayer Perceptron (VMLP). Compared with the previous convolution-based
structures, the Transformer-based structure under investigation showcases a
comparable or superior performance under its distinctive attention-based input
token mixer strategy. Introducing adversarial examples as a robustness
consideration has had a profound and detrimental impact on the performance of
well-established convolution-based structures. This inherent vulnerability to
adversarial attacks has also been demonstrated in Transformer-based structures.
In this paper, our emphasis lies on investigating the intrinsic robustness of
the structure rather than introducing novel defense measures against
adversarial attacks. To address the susceptibility to robustness issues, we
employ a rational structure design approach to mitigate such vulnerabilities.
Specifically, we enhance the adversarial robustness of the structure by
increasing the proportion of high-frequency structural robust biases. As a
result, we introduce a novel structure called Robust Bias Transformer-based
Structure (RBFormer) that shows robust superiority compared to several existing
baseline structures. Through a series of extensive experiments, RBFormer
outperforms the original structures by a significant margin, achieving an
impressive improvement of +16.12% and +5.04% across different evaluation
criteria on CIFAR-10 and ImageNet-1k, respectively.


http://arxiv.org/abs/2309.13190
Spatial-frequency channels, shape bias, and adversarial robustness. (69%)
Ajay Subramanian; Elena Sizikova; Najib J. Majaj; Denis G. Pelli
  What spatial frequency information do humans and neural networks use to
recognize objects? In neuroscience, critical band masking is an established
tool that can reveal the frequency-selective filters used for object
recognition. Critical band masking measures the sensitivity of recognition
performance to noise added at each spatial frequency. Existing critical band
masking studies show that humans recognize periodic patterns (gratings) and
letters by means of a spatial-frequency filter (or "channel'') that has a
frequency bandwidth of one octave (doubling of frequency). Here, we introduce
critical band masking as a task for network-human comparison and test 14 humans
and 76 neural networks on 16-way ImageNet categorization in the presence of
narrowband noise. We find that humans recognize objects in natural images using
the same one-octave-wide channel that they use for letters and gratings, making
it a canonical feature of human object recognition. On the other hand, the
neural network channel, across various architectures and training strategies,
is 2-4 times as wide as the human channel. In other words, networks are
vulnerable to high and low frequency noise that does not affect human
performance. Adversarial and augmented-image training are commonly used to
increase network robustness and shape bias. Does this training align network
and human object recognition channels? Three network channel properties
(bandwidth, center frequency, peak noise sensitivity) correlate strongly with
shape bias (53% variance explained) and with robustness of
adversarially-trained networks (74% variance explained). Adversarial training
increases robustness but expands the channel bandwidth even further away from
the human bandwidth. Thus, critical band masking reveals that the network
channel is more than twice as wide as the human channel, and that adversarial
training only increases this difference.


http://arxiv.org/abs/2309.12914
VIC-KD: Variance-Invariance-Covariance Knowledge Distillation to Make Keyword Spotting More Robust Against Adversarial Attacks. (69%)
Heitor R. Guimarães; Arthur Pimentel; Anderson Avila; Tiago H. Falk
  Keyword spotting (KWS) refers to the task of identifying a set of predefined
words in audio streams. With the advances seen recently with deep neural
networks, it has become a popular technology to activate and control small
devices, such as voice assistants. Relying on such models for edge devices,
however, can be challenging due to hardware constraints. Moreover, as
adversarial attacks have increased against voice-based technologies, developing
solutions robust to such attacks has become crucial. In this work, we propose
VIC-KD, a robust distillation recipe for model compression and adversarial
robustness. Using self-supervised speech representations, we show that imposing
geometric priors to the latent representations of both Teacher and Student
models leads to more robust target models. Experiments on the Google Speech
Commands datasets show that the proposed methodology improves upon current
state-of-the-art robust distillation methods, such as ARD and RSLAD, by 12% and
8% in robust accuracy, respectively.


http://arxiv.org/abs/2309.13016
Understanding Deep Gradient Leakage via Inversion Influence Functions. (15%)
Haobo Zhang; Junyuan Hong; Yuyang Deng; Mehrdad Mahdavi; Jiayu Zhou
  Deep Gradient Leakage (DGL) is a highly effective attack that recovers
private training images from gradient vectors. This attack casts significant
privacy challenges on distributed learning from clients with sensitive data,
where clients are required to share gradients. Defending against such attacks
requires but lacks an understanding of when and how privacy leakage happens,
mostly because of the black-box nature of deep networks. In this paper, we
propose a novel Inversion Influence Function (I$^2$F) that establishes a
closed-form connection between the recovered images and the private gradients
by implicitly solving the DGL problem. Compared to directly solving DGL, I$^2$F
is scalable for analyzing deep networks, requiring only oracle access to
gradients and Jacobian-vector products. We empirically demonstrate that I$^2$F
effectively approximated the DGL generally on different model architectures,
datasets, attack implementations, and noise-based defenses. With this novel
tool, we provide insights into effective gradient perturbation directions, the
unfairness of privacy protection, and privacy-preferred model initialization.
Our codes are provided in
https://github.com/illidanlab/inversion-influence-function.


http://arxiv.org/abs/2309.13150
Pixel-wise Smoothing for Certified Robustness against Camera Motion Perturbations. (10%)
Hanjiang Hu; Zuxin Liu; Linyi Li; Jiacheng Zhu; Ding Zhao
  In recent years, computer vision has made remarkable advancements in
autonomous driving and robotics. However, it has been observed that deep
learning-based visual perception models lack robustness when faced with camera
motion perturbations. The current certification process for assessing
robustness is costly and time-consuming due to the extensive number of image
projections required for Monte Carlo sampling in the 3D camera motion space. To
address these challenges, we present a novel, efficient, and practical
framework for certifying the robustness of 3D-2D projective transformations
against camera motion perturbations. Our approach leverages a smoothing
distribution over the 2D pixel space instead of in the 3D physical space,
eliminating the need for costly camera motion sampling and significantly
enhancing the efficiency of robustness certifications. With the pixel-wise
smoothed classifier, we are able to fully upper bound the projection errors
using a technique of uniform partitioning in camera motion space. Additionally,
we extend our certification framework to a more general scenario where only a
single-frame point cloud is required in the projection oracle. This is achieved
by deriving Lipschitz-based approximated partition intervals. Through extensive
experimentation, we validate the trade-off between effectiveness and efficiency
enabled by our proposed method. Remarkably, our approach achieves approximately
80% certified accuracy while utilizing only 30% of the projected image frames.


http://arxiv.org/abs/2309.13038
Privacy Assessment on Reconstructed Images: Are Existing Evaluation Metrics Faithful to Human Perception? (5%)
Xiaoxiao Sun; Nidham Gazagnadou; Vivek Sharma; Lingjuan Lyu; Hongdong Li; Liang Zheng
  Hand-crafted image quality metrics, such as PSNR and SSIM, are commonly used
to evaluate model privacy risk under reconstruction attacks. Under these
metrics, reconstructed images that are determined to resemble the original one
generally indicate more privacy leakage. Images determined as overall
dissimilar, on the other hand, indicate higher robustness against attack.
However, there is no guarantee that these metrics well reflect human opinions,
which, as a judgement for model privacy leakage, are more trustworthy. In this
paper, we comprehensively study the faithfulness of these hand-crafted metrics
to human perception of privacy information from the reconstructed images. On 5
datasets ranging from natural images, faces, to fine-grained classes, we use 4
existing attack methods to reconstruct images from many different
classification models and, for each reconstructed image, we ask multiple human
annotators to assess whether this image is recognizable. Our studies reveal
that the hand-crafted metrics only have a weak correlation with the human
evaluation of privacy leakage and that even these metrics themselves often
contradict each other. These observations suggest risks of current metrics in
the community. To address this potential risk, we propose a learning-based
measure called SemSim to evaluate the Semantic Similarity between the original
and reconstructed images. SemSim is trained with a standard triplet loss, using
an original image as an anchor, one of its recognizable reconstructed images as
a positive sample, and an unrecognizable one as a negative. By training on
human annotations, SemSim exhibits a greater reflection of privacy leakage on
the semantic level. We show that SemSim has a significantly higher correlation
with human judgment compared with existing metrics. Moreover, this strong
correlation generalizes to unseen datasets, models and attack methods.


http://arxiv.org/abs/2309.13002
Expressive variational quantum circuits provide inherent privacy in federated learning. (1%)
Niraj Kumar; Jamie Heredge; Changhao Li; Shaltiel Eloul; Shree Hari Sureshbabu; Marco Pistoia
  Federated learning has emerged as a viable distributed solution to train
machine learning models without the actual need to share data with the central
aggregator. However, standard neural network-based federated learning models
have been shown to be susceptible to data leakage from the gradients shared
with the server. In this work, we introduce federated learning with variational
quantum circuit model built using expressive encoding maps coupled with
overparameterized ans\"atze. We show that expressive maps lead to inherent
privacy against gradient inversion attacks, while overparameterization ensures
model trainability. Our privacy framework centers on the complexity of solving
the system of high-degree multivariate Chebyshev polynomials generated by the
gradients of quantum circuit. We present compelling arguments highlighting the
inherent difficulty in solving these equations, both in exact and approximate
scenarios. Additionally, we delve into machine learning-based attack strategies
and establish a direct connection between overparameterization in the original
federated learning model and underparameterization in the attack model.
Furthermore, we provide numerical scaling arguments showcasing that
underparameterization of the expressive map in the attack model leads to the
loss landscape being swamped with exponentially many spurious local minima
points, thus making it extremely hard to realize a successful attack. This
provides a strong claim, for the first time, that the nature of quantum machine
learning models inherently helps prevent data leakage in federated learning.


http://arxiv.org/abs/2309.12955
On Data Fabrication in Collaborative Vehicular Perception: Attacks and Countermeasures. (1%)
Qingzhao Zhang; Shuowei Jin; Ruiyang Zhu; Jiachen Sun; Xumiao Zhang; Qi Alfred Chen; Z. Morley Mao
  Collaborative perception, which greatly enhances the sensing capability of
connected and autonomous vehicles (CAVs) by incorporating data from external
resources, also brings forth potential security risks. CAVs' driving decisions
rely on remote untrusted data, making them susceptible to attacks carried out
by malicious participants in the collaborative perception system. However,
security analysis and countermeasures for such threats are absent. To
understand the impact of the vulnerability, we break the ground by proposing
various real-time data fabrication attacks in which the attacker delivers
crafted malicious data to victims in order to perturb their perception results,
leading to hard brakes or increased collision risks. Our attacks demonstrate a
high success rate of over 86% on high-fidelity simulated scenarios and are
realizable in real-world experiments. To mitigate the vulnerability, we present
a systematic anomaly detection approach that enables benign vehicles to jointly
reveal malicious fabrication. It detects 91.5% of attacks with a false positive
rate of 3% in simulated scenarios and significantly mitigates attack impacts in
real-world scenarios.


http://arxiv.org/abs/2309.12593
Improving Machine Learning Robustness via Adversarial Training. (99%)
Long Dang; Thushari Hapuarachchi; Kaiqi Xiong; Jing Lin
  As Machine Learning (ML) is increasingly used in solving various tasks in
real-world applications, it is crucial to ensure that ML algorithms are robust
to any potential worst-case noises, adversarial attacks, and highly unusual
situations when they are designed. Studying ML robustness will significantly
help in the design of ML algorithms. In this paper, we investigate ML
robustness using adversarial training in centralized and decentralized
environments, where ML training and testing are conducted in one or multiple
computers. In the centralized environment, we achieve a test accuracy of 65.41%
and 83.0% when classifying adversarial examples generated by Fast Gradient Sign
Method and DeepFool, respectively. Comparing to existing studies, these results
demonstrate an improvement of 18.41% for FGSM and 47% for DeepFool. In the
decentralized environment, we study Federated learning (FL) robustness by using
adversarial training with independent and identically distributed (IID) and
non-IID data, respectively, where CIFAR-10 is used in this research. In the IID
data case, our experimental results demonstrate that we can achieve such a
robust accuracy that it is comparable to the one obtained in the centralized
environment. Moreover, in the non-IID data case, the natural accuracy drops
from 66.23% to 57.82%, and the robust accuracy decreases by 25% and 23.4% in
C&W and Projected Gradient Descent (PGD) attacks, compared to the IID data
case, respectively. We further propose an IID data-sharing approach, which
allows for increasing the natural accuracy to 85.04% and the robust accuracy
from 57% to 72% in C&W attacks and from 59% to 67% in PGD attacks.


http://arxiv.org/abs/2309.11830
Goal-Oriented Prompt Attack and Safety Evaluation for LLMs. (69%)
Chengyuan Liu; Fubang Zhao; Lizhi Qing; Yangyang Kang; Changlong Sun; Kun Kuang; Fei Wu
  Large Language Models (LLMs) presents significant priority in text
understanding and generation. However, LLMs suffer from the risk of generating
harmful contents especially while being employed to applications. There are
several black-box attack methods, such as Prompt Attack, which can change the
behaviour of LLMs and induce LLMs to generate unexpected answers with harmful
contents. Researchers are interested in Prompt Attack and Defense with LLMs,
while there is no publicly available dataset with high successful attacking
rate to evaluate the abilities of defending prompt attack. In this paper, we
introduce a pipeline to construct high-quality prompt attack samples, along
with a Chinese prompt attack dataset called CPAD. Our prompts aim to induce
LLMs to generate unexpected outputs with several carefully designed prompt
attack templates and widely concerned attacking contents. Different from
previous datasets involving safety estimation, we construct the prompts
considering three dimensions: contents, attacking methods and goals.
Especially, the attacking goals indicate the behaviour expected after
successfully attacking the LLMs, thus the responses can be easily evaluated and
analysed. We run several popular Chinese LLMs on our dataset, and the results
show that our prompts are significantly harmful to LLMs, with around 70% attack
success rate to GPT-3.5. CPAD is publicly available at
https://github.com/liuchengyuan123/CPAD.


http://arxiv.org/abs/2309.12481
HANS, are you clever? Clever Hans Effect Analysis of Neural Systems. (45%)
Leonardo Ranaldi; Fabio Massimo Zanzotto
  Instruction-tuned Large Language Models (It-LLMs) have been exhibiting
outstanding abilities to reason around cognitive states, intentions, and
reactions of all people involved, letting humans guide and comprehend
day-to-day social interactions effectively. In fact, several multiple-choice
questions (MCQ) benchmarks have been proposed to construct solid assessments of
the models' abilities. However, earlier works are demonstrating the presence of
inherent "order bias" in It-LLMs, posing challenges to the appropriate
evaluation. In this paper, we investigate It-LLMs' resilience abilities towards
a series of probing tests using four MCQ benchmarks. Introducing adversarial
examples, we show a significant performance gap, mainly when varying the order
of the choices, which reveals a selection bias and brings into discussion
reasoning abilities. Following a correlation between first positions and model
choices due to positional bias, we hypothesized the presence of structural
heuristics in the decision-making process of the It-LLMs, strengthened by
including significant examples in few-shot scenarios. Finally, by using the
Chain-of-Thought (CoT) technique, we elicit the model to reason and mitigate
the bias by obtaining more robust models.


http://arxiv.org/abs/2309.12263
On the Relationship between Skill Neurons and Robustness in Prompt Tuning. (12%)
Leon Ackermann; Xenia Ohmer
  Prompt Tuning is a popular parameter-efficient finetuning method for
pre-trained large language models (PLMs). Based on experiments with RoBERTa, it
has been suggested that Prompt Tuning activates specific neurons in the
transformer's feed-forward networks, that are highly predictive and selective
for the given task. In this paper, we study the robustness of Prompt Tuning in
relation to these "skill neurons", using RoBERTa and T5. We show that prompts
tuned for a specific task are transferable to tasks of the same type but are
not very robust to adversarial data. While prompts tuned for RoBERTa yield
below-chance performance on adversarial data, prompts tuned for T5 are slightly
more robust and retain above-chance performance in two out of three cases. At
the same time, we replicate the finding that skill neurons exist in RoBERTa and
further show that skill neurons also exist in T5. Interestingly, the skill
neurons of T5 determined on non-adversarial data are also among the most
predictive neurons on the adversarial data, which is not the case for RoBERTa.
We conclude that higher adversarial robustness may be related to a model's
ability to consistently activate the relevant skill neurons on adversarial
data.


http://arxiv.org/abs/2309.11894
DeepTheft: Stealing DNN Model Architectures through Power Side Channel. (1%)
Yansong Gao; Huming Qiu; Zhi Zhang; Binghui Wang; Hua Ma; Alsharif Abuadbba; Minhui Xue; Anmin Fu; Surya Nepal
  Deep Neural Network (DNN) models are often deployed in resource-sharing
clouds as Machine Learning as a Service (MLaaS) to provide inference
services.To steal model architectures that are of valuable intellectual
properties, a class of attacks has been proposed via different side-channel
leakage, posing a serious security challenge to MLaaS.
  Also targeting MLaaS, we propose a new end-to-end attack, DeepTheft, to
accurately recover complex DNN model architectures on general processors via
the RAPL-based power side channel. However, an attacker can acquire only a low
sampling rate (1 KHz) of the time-series energy traces from the RAPL interface,
rendering existing techniques ineffective in stealing large and deep DNN
models. To this end, we design a novel and generic learning-based framework
consisting of a set of meta-models, based on which DeepTheft is demonstrated to
have high accuracy in recovering a large number (thousands) of models
architectures from different model families including the deepest ResNet152.
Particularly, DeepTheft has achieved a Levenshtein Distance Accuracy of 99.75%
in recovering network structures, and a weighted average F1 score of 99.60% in
recovering diverse layer-wise hyperparameters. Besides, our proposed learning
framework is general to other time-series side-channel signals. To validate its
generalization, another existing side channel is exploited, i.e., CPU
frequency. Different from RAPL, CPU frequency is accessible to unprivileged
users in bare-metal OSes. By using our generic learning framework trained
against CPU frequency traces, DeepTheft has shown similarly high attack
performance in stealing model architectures.


http://arxiv.org/abs/2309.11751
How Robust is Google's Bard to Adversarial Image Attacks? (99%)
Yinpeng Dong; Huanran Chen; Jiawei Chen; Zhengwei Fang; Xiao Yang; Yichi Zhang; Yu Tian; Hang Su; Jun Zhu
  Multimodal Large Language Models (MLLMs) that integrate text and other
modalities (especially vision) have achieved unprecedented performance in
various multimodal tasks. However, due to the unsolved adversarial robustness
problem of vision models, MLLMs can have more severe safety and security risks
by introducing the vision inputs. In this work, we study the adversarial
robustness of Google's Bard, a competitive chatbot to ChatGPT that released its
multimodal capability recently, to better understand the vulnerabilities of
commercial MLLMs. By attacking white-box surrogate vision encoders or MLLMs,
the generated adversarial examples can mislead Bard to output wrong image
descriptions with a 22% success rate based solely on the transferability. We
show that the adversarial examples can also attack other MLLMs, e.g., a 26%
attack success rate against Bing Chat and a 86% attack success rate against
ERNIE bot. Moreover, we identify two defense mechanisms of Bard, including face
detection and toxicity detection of images. We design corresponding attacks to
evade these defenses, demonstrating that the current defenses of Bard are also
vulnerable. We hope this work can deepen our understanding on the robustness of
MLLMs and facilitate future research on defenses. Our code is available at
https://github.com/thu-ml/Attack-Bard.


http://arxiv.org/abs/2309.11111
PRAT: PRofiling Adversarial aTtacks. (99%)
Rahul Ambati; Naveed Akhtar; Ajmal Mian; Yogesh Singh Rawat
  Intrinsic susceptibility of deep learning to adversarial examples has led to
a plethora of attack techniques with a broad common objective of fooling deep
models. However, we find slight compositional differences between the
algorithms achieving this objective. These differences leave traces that
provide important clues for attacker profiling in real-life scenarios. Inspired
by this, we introduce a novel problem of PRofiling Adversarial aTtacks (PRAT).
Given an adversarial example, the objective of PRAT is to identify the attack
used to generate it. Under this perspective, we can systematically group
existing attacks into different families, leading to the sub-problem of attack
family identification, which we also study. To enable PRAT analysis, we
introduce a large Adversarial Identification Dataset (AID), comprising over
180k adversarial samples generated with 13 popular attacks for image
specific/agnostic white/black box setups. We use AID to devise a novel
framework for the PRAT objective. Our framework utilizes a Transformer based
Global-LOcal Feature (GLOF) module to extract an approximate signature of the
adversarial attack, which in turn is used for the identification of the attack.
Using AID and our framework, we provide multiple interesting benchmark results
for the PRAT problem.


http://arxiv.org/abs/2309.11196
When to Trust AI: Advances and Challenges for Certification of Neural Networks. (64%)
Marta Kwiatkowska; Xiyue Zhang
  Artificial intelligence (AI) has been advancing at a fast pace and it is now
poised for deployment in a wide range of applications, such as autonomous
systems, medical diagnosis and natural language processing. Early adoption of
AI technology for real-world applications has not been without problems,
particularly for neural networks, which may be unstable and susceptible to
adversarial examples. In the longer term, appropriate safety assurance
techniques need to be developed to reduce potential harm due to avoidable
system failures and ensure trustworthiness. Focusing on certification and
explainability, this paper provides an overview of techniques that have been
developed to ensure safety of AI decisions and discusses future challenges.


http://arxiv.org/abs/2309.11462
AudioFool: Fast, Universal and synchronization-free Cross-Domain Attack on Speech Recognition. (54%)
Mohamad Fakih; Rouwaida Kanj; Fadi Kurdahi; Mohammed E. Fouda
  Automatic Speech Recognition systems have been shown to be vulnerable to
adversarial attacks that manipulate the command executed on the device. Recent
research has focused on exploring methods to create such attacks, however, some
issues relating to Over-The-Air (OTA) attacks have not been properly addressed.
In our work, we examine the needed properties of robust attacks compatible with
the OTA model, and we design a method of generating attacks with arbitrary such
desired properties, namely the invariance to synchronization, and the
robustness to filtering: this allows a Denial-of-Service (DoS) attack against
ASR systems. We achieve these characteristics by constructing attacks in a
modified frequency domain through an inverse Fourier transform. We evaluate our
method on standard keyword classification tasks and analyze it in OTA, and we
analyze the properties of the cross-domain attacks to explain the efficiency of
the approach.


http://arxiv.org/abs/2309.11667
Understanding Pose and Appearance Disentanglement in 3D Human Pose Estimation. (54%)
Krishna Kanth Nakka; Mathieu Salzmann
  As 3D human pose estimation can now be achieved with very high accuracy in
the supervised learning scenario, tackling the case where 3D pose annotations
are not available has received increasing attention. In particular, several
methods have proposed to learn image representations in a self-supervised
fashion so as to disentangle the appearance information from the pose one. The
methods then only need a small amount of supervised data to train a pose
regressor using the pose-related latent vector as input, as it should be free
of appearance information. In this paper, we carry out in-depth analysis to
understand to what degree the state-of-the-art disentangled representation
learning methods truly separate the appearance information from the pose one.
First, we study disentanglement from the perspective of the self-supervised
network, via diverse image synthesis experiments. Second, we investigate
disentanglement with respect to the 3D pose regressor following an adversarial
attack perspective. Specifically, we design an adversarial strategy focusing on
generating natural appearance changes of the subject, and against which we
could expect a disentangled network to be robust. Altogether, our analyses show
that disentanglement in the three state-of-the-art disentangled representation
learning frameworks if far from complete, and that their pose codes contain
significant appearance information. We believe that our approach provides a
valuable testbed to evaluate the degree of disentanglement of pose from
appearance in self-supervised 3D human pose estimation.


http://arxiv.org/abs/2309.11053
Fed-LSAE: Thwarting Poisoning Attacks against Federated Cyber Threat Detection System via Autoencoder-based Latent Space Inspection. (5%)
Tran Duc Luong; Vuong Minh Tien; Nguyen Huu Quyen; Do Thi Thu Hien; Phan The Duy; Van-Hau Pham
  The significant rise of security concerns in conventional centralized
learning has promoted federated learning (FL) adoption in building intelligent
applications without privacy breaches. In cybersecurity, the sensitive data
along with the contextual information and high-quality labeling in each
enterprise organization play an essential role in constructing high-performance
machine learning (ML) models for detecting cyber threats. Nonetheless, the
risks coming from poisoning internal adversaries against FL systems have raised
discussions about designing robust anti-poisoning frameworks. Whereas defensive
mechanisms in the past were based on outlier detection, recent approaches tend
to be more concerned with latent space representation. In this paper, we
investigate a novel robust aggregation method for FL, namely Fed-LSAE, which
takes advantage of latent space representation via the penultimate layer and
Autoencoder to exclude malicious clients from the training process. The
experimental results on the CIC-ToN-IoT and N-BaIoT datasets confirm the
feasibility of our defensive mechanism against cutting-edge poisoning attacks
for developing a robust FL-based threat detector in the context of IoT. More
specifically, the FL evaluation witnesses an upward trend of approximately 98%
across all metrics when integrating with our Fed-LSAE defense.


http://arxiv.org/abs/2309.16577
Compilation as a Defense: Enhancing DL Model Attack Robustness via Tensor Optimization. (2%)
Stefan Trawicki; William Hackett; Lewis Birch; Neeraj Suri; Peter Garraghan
  Adversarial Machine Learning (AML) is a rapidly growing field of security
research, with an often overlooked area being model attacks through
side-channels. Previous works show such attacks to be serious threats, though
little progress has been made on efficient remediation strategies that avoid
costly model re-engineering. This work demonstrates a new defense against AML
side-channel attacks using model compilation techniques, namely tensor
optimization. We show relative model attack effectiveness decreases of up to
43% using tensor optimization, discuss the implications, and direction of
future work.


http://arxiv.org/abs/2309.11092
Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer. (1%)
Anwei Luo; Rizhao Cai; Chenqi Kong; Yakun Ju; Xiangui Kang; Jiwu Huang; Alex C. Kot
  With the rapid progress of generative models, the current challenge in face
forgery detection is how to effectively detect realistic manipulated faces from
different unseen domains. Though previous studies show that pre-trained Vision
Transformer (ViT) based models can achieve some promising results after fully
fine-tuning on the Deepfake dataset, their generalization performances are
still unsatisfactory. One possible reason is that fully fine-tuned ViT-based
models may disrupt the pre-trained features [1, 2] and overfit to some
data-specific patterns [3]. To alleviate this issue, we present a
\textbf{F}orgery-aware \textbf{A}daptive \textbf{Vi}sion \textbf{T}ransformer
(FA-ViT) under the adaptive learning paradigm, where the parameters in the
pre-trained ViT are kept fixed while the designed adaptive modules are
optimized to capture forgery features. Specifically, a global adaptive module
is designed to model long-range interactions among input tokens, which takes
advantage of self-attention mechanism to mine global forgery clues. To further
explore essential local forgery clues, a local adaptive module is proposed to
expose local inconsistencies by enhancing the local contextual association. In
addition, we introduce a fine-grained adaptive learning module that emphasizes
the common compact representation of genuine faces through relationship
learning in fine-grained pairs, driving these proposed adaptive modules to be
aware of fine-grained forgery-aware information. Extensive experiments
demonstrate that our FA-ViT achieves state-of-the-arts results in the
cross-dataset evaluation, and enhances the robustness against unseen
perturbations. Particularly, FA-ViT achieves 93.83\% and 78.32\% AUC scores on
Celeb-DF and DFDC datasets in the cross-dataset evaluation. The code and
trained model have been released at: https://github.com/LoveSiameseCat/FAViT.


http://arxiv.org/abs/2309.10348
Language Guided Adversarial Purification. (99%)
Himanshu Singh; A V Subramanyam
  Adversarial purification using generative models demonstrates strong
adversarial defense performance. These methods are classifier and
attack-agnostic, making them versatile but often computationally intensive.
Recent strides in diffusion and score networks have improved image generation
and, by extension, adversarial purification. Another highly efficient class of
adversarial defense methods known as adversarial training requires specific
knowledge of attack vectors, forcing them to be trained extensively on
adversarial examples. To overcome these limitations, we introduce a new
framework, namely Language Guided Adversarial Purification (LGAP), utilizing
pre-trained diffusion models and caption generators to defend against
adversarial attacks. Given an input image, our method first generates a
caption, which is then used to guide the adversarial purification process
through a diffusion network. Our approach has been evaluated against strong
adversarial attacks, proving its effectiveness in enhancing adversarial
robustness. Our results indicate that LGAP outperforms most existing
adversarial defense techniques without requiring specialized network training.
This underscores the generalizability of models trained on large datasets,
highlighting a promising direction for further research.


http://arxiv.org/abs/2309.10916
What Learned Representations and Influence Functions Can Tell Us About Adversarial Examples. (99%)
Shakila Mahjabin Tonni; Mark Dras
  Adversarial examples, deliberately crafted using small perturbations to fool
deep neural networks, were first studied in image processing and more recently
in NLP. While approaches to detecting adversarial examples in NLP have largely
relied on search over input perturbations, image processing has seen a range of
techniques that aim to characterise adversarial subspaces over the learned
representations.
  In this paper, we adapt two such approaches to NLP, one based on nearest
neighbors and influence functions and one on Mahalanobis distances. The former
in particular produces a state-of-the-art detector when compared against
several strong baselines; moreover, the novel use of influence functions
provides insight into how the nature of adversarial example subspaces in NLP
relate to those in image processing, and also how they differ depending on the
kind of NLP task.


http://arxiv.org/abs/2309.10586
Adversarial Attacks Against Uncertainty Quantification. (99%)
Emanuele Ledda; Daniele Angioni; Giorgio Piras; Giorgio Fumera; Battista Biggio; Fabio Roli
  Machine-learning models can be fooled by adversarial examples, i.e.,
carefully-crafted input perturbations that force models to output wrong
predictions. While uncertainty quantification has been recently proposed to
detect adversarial inputs, under the assumption that such attacks exhibit a
higher prediction uncertainty than pristine data, it has been shown that
adaptive attacks specifically aimed at reducing also the uncertainty estimate
can easily bypass this defense mechanism. In this work, we focus on a different
adversarial scenario in which the attacker is still interested in manipulating
the uncertainty estimate, but regardless of the correctness of the prediction;
in particular, the goal is to undermine the use of machine-learning models when
their outputs are consumed by a downstream module or by a human operator.
Following such direction, we: \textit{(i)} design a threat model for attacks
targeting uncertainty quantification; \textit{(ii)} devise different attack
strategies on conceptually different UQ techniques spanning for both
classification and semantic segmentation problems; \textit{(iii)} conduct a
first complete and extensive analysis to compare the differences between some
of the most employed UQ approaches under attack. Our extensive experimental
analysis shows that our attacks are more effective in manipulating uncertainty
quantification measures than attacks aimed to also induce misclassifications.


http://arxiv.org/abs/2309.10544
Model Leeching: An Extraction Attack Targeting LLMs. (76%)
Lewis Birch; William Hackett; Stefan Trawicki; Neeraj Suri; Peter Garraghan
  Model Leeching is a novel extraction attack targeting Large Language Models
(LLMs), capable of distilling task-specific knowledge from a target LLM into a
reduced parameter model. We demonstrate the effectiveness of our attack by
extracting task capability from ChatGPT-3.5-Turbo, achieving 73% Exact Match
(EM) similarity, and SQuAD EM and F1 accuracy scores of 75% and 87%,
respectively for only $50 in API cost. We further demonstrate the feasibility
of adversarial attack transferability from an extracted model extracted via
Model Leeching to perform ML attack staging against a target LLM, resulting in
an 11% increase to attack success rate when applied to ChatGPT-3.5-Turbo.


http://arxiv.org/abs/2309.11022
Information Leakage from Data Updates in Machine Learning Models. (16%)
Tian Hui; Farhad Farokhi; Olga Ohrimenko
  In this paper we consider the setting where machine learning models are
retrained on updated datasets in order to incorporate the most up-to-date
information or reflect distribution shifts. We investigate whether one can
infer information about these updates in the training data (e.g., changes to
attribute values of records). Here, the adversary has access to snapshots of
the machine learning model before and after the change in the dataset occurs.
Contrary to the existing literature, we assume that an attribute of a single or
multiple training data points are changed rather than entire data records are
removed or added. We propose attacks based on the difference in the prediction
confidence of the original model and the updated model. We evaluate our attack
methods on two public datasets along with multi-layer perceptron and logistic
regression models. We validate that two snapshots of the model can result in
higher information leakage in comparison to having access to only the updated
model. Moreover, we observe that data records with rare values are more
vulnerable to attacks, which points to the disparate vulnerability of privacy
attacks in the update setting. When multiple records with the same original
attribute value are updated to the same new value (i.e., repeated changes), the
attacker is more likely to correctly guess the updated values since repeated
changes leave a larger footprint on the trained model. These observations point
to vulnerability of machine learning models to attribute inference attacks in
the update setting.


http://arxiv.org/abs/2309.10644
Robin: A Novel Method to Produce Robust Interpreters for Deep Learning-Based Code Classifiers. (16%)
Zhen Li; Ruqian Zhang; Deqing Zou; Ning Wang; Yating Li; Shouhuai Xu; Chen Chen; Hai Jin
  Deep learning has been widely used in source code classification tasks, such
as code classification according to their functionalities, code authorship
attribution, and vulnerability detection. Unfortunately, the black-box nature
of deep learning makes it hard to interpret and understand why a classifier
(i.e., classification model) makes a particular prediction on a given example.
This lack of interpretability (or explainability) might have hindered their
adoption by practitioners because it is not clear when they should or should
not trust a classifier's prediction. The lack of interpretability has motivated
a number of studies in recent years. However, existing methods are neither
robust nor able to cope with out-of-distribution examples. In this paper, we
propose a novel method to produce \underline{Rob}ust \underline{in}terpreters
for a given deep learning-based code classifier; the method is dubbed Robin.
The key idea behind Robin is a novel hybrid structure combining an interpreter
and two approximators, while leveraging the ideas of adversarial training and
data augmentation. Experimental results show that on average the interpreter
produced by Robin achieves a 6.11\% higher fidelity (evaluated on the
classifier), 67.22\% higher fidelity (evaluated on the approximator), and
15.87x higher robustness than that of the three existing interpreters we
evaluated. Moreover, the interpreter is 47.31\% less affected by
out-of-distribution examples than that of LEMNA.


http://arxiv.org/abs/2309.10607
SPFL: A Self-purified Federated Learning Method Against Poisoning Attacks. (12%)
Zizhen Liu; Weiyang He; Chip-Hong Chang; Jing Ye; Huawei Li; Xiaowei Li
  While Federated learning (FL) is attractive for pulling privacy-preserving
distributed training data, the credibility of participating clients and
non-inspectable data pose new security threats, of which poisoning attacks are
particularly rampant and hard to defend without compromising privacy,
performance or other desirable properties of FL. To tackle this problem, we
propose a self-purified FL (SPFL) method that enables benign clients to exploit
trusted historical features of locally purified model to supervise the training
of aggregated model in each iteration. The purification is performed by an
attention-guided self-knowledge distillation where the teacher and student
models are optimized locally for task loss, distillation loss and
attention-based loss simultaneously. SPFL imposes no restriction on the
communication protocol and aggregator at the server. It can work in tandem with
any existing secure aggregation algorithms and protocols for augmented security
and privacy guarantee. We experimentally demonstrate that SPFL outperforms
state-of-the-art FL defenses against various poisoning attacks. The attack
success rate of SPFL trained model is at most 3$\%$ above that of a clean
model, even if the poisoning attack is launched in every iteration with all but
one malicious clients in the system. Meantime, it improves the model quality on
normal inputs compared to FedAvg, either under attack or in the absence of an
attack.


http://arxiv.org/abs/2309.11005
It's Simplex! Disaggregating Measures to Improve Certified Robustness. (11%)
Andrew C. Cullen; Paul Montague; Shijie Liu; Sarah M. Erfani; Benjamin I. P. Rubinstein
  Certified robustness circumvents the fragility of defences against
adversarial attacks, by endowing model predictions with guarantees of class
invariance for attacks up to a calculated size. While there is value in these
certifications, the techniques through which we assess their performance do not
present a proper accounting of their strengths and weaknesses, as their
analysis has eschewed consideration of performance over individual samples in
favour of aggregated measures. By considering the potential output space of
certified models, this work presents two distinct approaches to improve the
analysis of certification mechanisms, that allow for both dataset-independent
and dataset-dependent measures of certification performance. Embracing such a
perspective uncovers new certification approaches, which have the potential to
more than double the achievable radius of certification, relative to current
state-of-the-art. Empirical evaluation verifies that our new approach can
certify $9\%$ more samples at noise scale $\sigma = 1$, with greater relative
improvements observed as the difficulty of the predictive task increases.


http://arxiv.org/abs/2310.10664
Nebula: Self-Attention for Dynamic Malware Analysis. (5%)
Dmitrijs Trizna; Luca Demetrio; Battista Biggio; Fabio Roli
  Dynamic analysis enables detecting Windows malware by executing programs in a
controlled environment, and storing their actions in log reports. Previous work
has started training machine learning models on such reports to perform either
malware detection or malware classification. However, most of the approaches
(i) have only considered convolutional and long-short term memory networks,
(ii) they have been built focusing only on APIs called at runtime, without
considering other relevant though heterogeneous sources of information like
network and file operations, and (iii) the code and pretrained models are
hardly available, hindering reproducibility of results in this research area.
In this work, we overcome these limitations by presenting Nebula, a versatile,
self-attention transformer-based neural architecture that can generalize across
different behavior representations and formats, combining heterogeneous
information from dynamic log reports. We show the efficacy of Nebula on three
distinct data collections from different dynamic analysis platforms, comparing
its performance with previous state-of-the-art models developed for malware
detection and classification tasks. We produce an extensive ablation study that
showcases how the components of Nebula influence its predictive performance,
while enabling it to outperform some competing approaches at very low false
positive rates. We conclude our work by inspecting the behavior of Nebula
through the application of explainability methods, which highlight that Nebula
correctly focuses more on portions of reports that contain malicious
activities. We release our code and models at github.com/dtrizna/nebula.


http://arxiv.org/abs/2310.07725
Extreme Image Transformations Facilitate Robust Latent Object Representations. (1%)
Girik Malik; Dakarai Crowder; Ennio Mingolla
  Adversarial attacks can affect the object recognition capabilities of
machines in wild. These can often result from spurious correlations between
input and class labels, and are prone to memorization in large networks. While
networks are expected to do automated feature selection, it is not effective at
the scale of the object. Humans, however, are able to select the minimum set of
features required to form a robust representation of an object. In this work,
we show that finetuning any pretrained off-the-shelf network with Extreme Image
Transformations (EIT) not only helps in learning a robust latent
representation, it also improves the performance of these networks against
common adversarial attacks of various intensities. Our EIT trained networks
show strong activations in the object regions even when tested with more
intense noise, showing promising generalizations across different kinds of
adversarial attacks.


http://arxiv.org/abs/2309.09480
Stealthy Physical Masked Face Recognition Attack via Adversarial Style Optimization. (99%)
Huihui Gong; Minjing Dong; Siqi Ma; Seyit Camtepe; Surya Nepal; Chang Xu
  Deep neural networks (DNNs) have achieved state-of-the-art performance on
face recognition (FR) tasks in the last decade. In real scenarios, the
deployment of DNNs requires taking various face accessories into consideration,
like glasses, hats, and masks. In the COVID-19 pandemic era, wearing face masks
is one of the most effective ways to defend against the novel coronavirus.
However, DNNs are known to be vulnerable to adversarial examples with a small
but elaborated perturbation. Thus, a facial mask with adversarial perturbations
may pose a great threat to the widely used deep learning-based FR models. In
this paper, we consider a challenging adversarial setting: targeted attack
against FR models. We propose a new stealthy physical masked FR attack via
adversarial style optimization. Specifically, we train an adversarial style
mask generator that hides adversarial perturbations inside style masks.
Moreover, to ameliorate the phenomenon of sub-optimization with one fixed
style, we propose to discover the optimal style given a target through style
optimization in a continuous relaxation manner. We simultaneously optimize the
generator and the style selection for generating strong and stealthy
adversarial style masks. We evaluated the effectiveness and transferability of
our proposed method via extensive white-box and black-box digital experiments.
Furthermore, we also conducted physical attack experiments against local FR
models and online platforms.


http://arxiv.org/abs/2309.10243
Transferable Adversarial Attack on Image Tampering Localization. (99%)
Yuqi Wang; Gang Cao; Zijie Lou; Haochen Zhu
  It is significant to evaluate the security of existing digital image
tampering localization algorithms in real-world applications. In this paper, we
propose an adversarial attack scheme to reveal the reliability of such
tampering localizers, which would be fooled and fail to predict altered regions
correctly. Specifically, the adversarial examples based on optimization and
gradient are implemented for white/black-box attacks. Correspondingly, the
adversarial example is optimized via reverse gradient propagation, and the
perturbation is added adaptively in the direction of gradient rising. The
black-box attack is achieved by relying on the transferability of such
adversarial examples to different localizers. Extensive evaluations verify that
the proposed attack sharply reduces the localization accuracy while preserving
high visual quality of the attacked images.


http://arxiv.org/abs/2309.10136
Efficient Low-Rank GNN Defense Against Structural Attacks. (96%)
Abdullah Alchihabi; Qing En; Yuhong Guo
  Graph Neural Networks (GNNs) have been shown to possess strong representation
abilities over graph data. However, GNNs are vulnerable to adversarial attacks,
and even minor perturbations to the graph structure can significantly degrade
their performance. Existing methods either are ineffective against
sophisticated attacks or require the optimization of dense adjacency matrices,
which is time-consuming and prone to local minima. To remedy this problem, we
propose an Efficient Low-Rank Graph Neural Network (ELR-GNN) defense method,
which aims to learn low-rank and sparse graph structures for defending against
adversarial attacks, ensuring effective defense with greater efficiency.
Specifically, ELR-GNN consists of two modules: a Coarse Low-Rank Estimation
Module and a Fine-Grained Estimation Module. The first module adopts the
truncated Singular Value Decomposition (SVD) to initialize the low-rank
adjacency matrix estimation, which serves as a starting point for optimizing
the low-rank matrix. In the second module, the initial estimate is refined by
jointly learning a low-rank sparse graph structure with the GNN model. Sparsity
is incorporated into the learned low-rank adjacency matrix by pruning weak
connections, which can reduce redundant data while maintaining valuable
information. As a result, instead of using the dense adjacency matrix directly,
ELR-GNN can learn a low-rank and sparse estimate of it in a simple, efficient
and easy to optimize manner. The experimental results demonstrate that ELR-GNN
outperforms the state-of-the-art GNN defense methods in the literature, in
addition to being very efficient and easy to train.


http://arxiv.org/abs/2309.09928
Evaluating Adversarial Robustness with Expected Viable Performance. (45%)
Ryan McCoppin; Colin Dawson; Sean M. Kennedy; Leslie M. Blaha
  We introduce a metric for evaluating the robustness of a classifier, with
particular attention to adversarial perturbations, in terms of expected
functionality with respect to possible adversarial perturbations. A classifier
is assumed to be non-functional (that is, has a functionality of zero) with
respect to a perturbation bound if a conventional measure of performance, such
as classification accuracy, is less than a minimally viable threshold when the
classifier is tested on examples from that perturbation bound. Defining
robustness in terms of an expected value is motivated by a domain general
approach to robustness quantification.


http://arxiv.org/abs/2309.10058
Dual Student Networks for Data-Free Model Stealing. (26%)
James Beetham; Navid Kardan; Ajmal Mian; Mubarak Shah
  Existing data-free model stealing methods use a generator to produce samples
in order to train a student model to match the target model outputs. To this
end, the two main challenges are estimating gradients of the target model
without access to its parameters, and generating a diverse set of training
samples that thoroughly explores the input space. We propose a Dual Student
method where two students are symmetrically trained in order to provide the
generator a criterion to generate samples that the two students disagree on. On
one hand, disagreement on a sample implies at least one student has classified
the sample incorrectly when compared to the target model. This incentive
towards disagreement implicitly encourages the generator to explore more
diverse regions of the input space. On the other hand, our method utilizes
gradients of student models to indirectly estimate gradients of the target
model. We show that this novel training objective for the generator network is
equivalent to optimizing a lower bound on the generator's loss if we had access
to the target model gradients. We show that our new optimization framework
provides more accurate gradient estimation of the target model and better
accuracies on benchmark classification datasets. Additionally, our approach
balances improved query efficiency with training computation cost. Finally, we
demonstrate that our method serves as a better proxy model for transfer-based
adversarial attacks than existing data-free model stealing methods.


http://arxiv.org/abs/2309.09700
Securing Fixed Neural Network Steganography. (5%)
Zicong Luo; Sheng Li; Guobiao Li; Zhenxing Qian; Xinpeng Zhang
  Image steganography is the art of concealing secret information in images in
a way that is imperceptible to unauthorized parties. Recent advances show that
is possible to use a fixed neural network (FNN) for secret embedding and
extraction. Such fixed neural network steganography (FNNS) achieves high
steganographic performance without training the networks, which could be more
useful in real-world applications. However, the existing FNNS schemes are
vulnerable in the sense that anyone can extract the secret from the
stego-image. To deal with this issue, we propose a key-based FNNS scheme to
improve the security of the FNNS, where we generate key-controlled
perturbations from the FNN for data embedding. As such, only the receiver who
possesses the key is able to correctly extract the secret from the stego-image
using the FNN. In order to improve the visual quality and undetectability of
the stego-image, we further propose an adaptive perturbation optimization
strategy by taking the perturbation cost into account. Experimental results
show that our proposed scheme is capable of preventing unauthorized secret
extraction from the stego-images. Furthermore, our scheme is able to generate
stego-images with higher visual quality than the state-of-the-art FNNS scheme,
especially when the FNN is a neural network for ordinary learning tasks.


http://arxiv.org/abs/2309.10253
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts. (4%)
Jiahao Yu; Xingwei Lin; Zheng Yu; Xinyu Xing
  Large language models (LLMs) have recently experienced tremendous popularity
and are widely used from casual conversations to AI-driven programming.
However, despite their considerable success, LLMs are not entirely reliable and
can give detailed guidance on how to conduct harmful or illegal activities.
While safety measures can reduce the risk of such outputs, adversarial
jailbreak attacks can still exploit LLMs to produce harmful content. These
jailbreak templates are typically manually crafted, making large-scale testing
challenging.
  In this paper, we introduce GPTFuzz, a novel black-box jailbreak fuzzing
framework inspired by the AFL fuzzing framework. Instead of manual engineering,
GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs.
At its core, GPTFuzz starts with human-written templates as initial seeds, then
mutates them to produce new templates. We detail three key components of
GPTFuzz: a seed selection strategy for balancing efficiency and variability,
mutate operators for creating semantically equivalent or similar sentences, and
a judgment model to assess the success of a jailbreak attack.
  We evaluate GPTFuzz against various commercial and open-source LLMs,
including ChatGPT, LLaMa-2, and Vicuna, under diverse attack scenarios. Our
results indicate that GPTFuzz consistently produces jailbreak templates with a
high success rate, surpassing human-crafted templates. Remarkably, GPTFuzz
achieves over 90% attack success rates against ChatGPT and Llama-2 models, even
with suboptimal initial seed templates. We anticipate that GPTFuzz will be
instrumental for researchers and practitioners in examining LLM robustness and
will encourage further exploration into enhancing LLM safety.


http://arxiv.org/abs/2309.09586
Spoofing attack augmentation: can differently-trained attack models improve generalisation? (3%)
Wanying Ge; Xin Wang; Junichi Yamagishi; Massimiliano Todisco; Nicholas Evans
  A reliable deepfake detector or spoofing countermeasure (CM) should be robust
in the face of unpredictable spoofing attacks. To encourage the learning of
more generaliseable artefacts, rather than those specific only to known
attacks, CMs are usually exposed to a broad variety of different attacks during
training. Even so, the performance of deep-learning-based CM solutions are
known to vary, sometimes substantially, when they are retrained with different
initialisations, hyper-parameters or training data partitions. We show in this
paper that the potency of spoofing attacks, also deep-learning-based, can
similarly vary according to training conditions, sometimes resulting in
substantial degradations to detection performance. Nevertheless, while a
RawNet2 CM model is vulnerable when only modest adjustments are made to the
attack algorithm, those based upon graph attention networks and self-supervised
learning are reassuringly robust. The focus upon training data generated with
different attack algorithms might not be sufficient on its own to ensure
generaliability; some form of spoofing attack augmentation at the algorithm
level can be complementary.


http://arxiv.org/abs/2309.09837
Frame-to-Utterance Convergence: A Spectra-Temporal Approach for Unified Spoofing Detection. (1%)
Awais Khan; Khalid Mahmood Malik; Shah Nawaz
  Voice spoofing attacks pose a significant threat to automated speaker
verification systems. Existing anti-spoofing methods often simulate specific
attack types, such as synthetic or replay attacks. However, in real-world
scenarios, the countermeasures are unaware of the generation schema of the
attack, necessitating a unified solution. Current unified solutions struggle to
detect spoofing artifacts, especially with recent spoofing mechanisms. For
instance, the spoofing algorithms inject spectral or temporal anomalies, which
are challenging to identify. To this end, we present a spectra-temporal fusion
leveraging frame-level and utterance-level coefficients. We introduce a novel
local spectral deviation coefficient (SDC) for frame-level inconsistencies and
employ a bi-LSTM-based network for sequential temporal coefficients (STC),
which capture utterance-level artifacts. Our spectra-temporal fusion strategy
combines these coefficients, and an auto-encoder generates spectra-temporal
deviated coefficients (STDC) to enhance robustness. Our proposed approach
addresses multiple spoofing categories, including synthetic, replay, and
partial deepfake attacks. Extensive evaluation on diverse datasets
(ASVspoof2019, ASVspoof2021, VSDC, partial spoofs, and in-the-wild deepfakes)
demonstrated its robustness for a wide range of voice applications.


http://arxiv.org/abs/2309.09464
Reducing Adversarial Training Cost with Gradient Approximation. (99%)
Huihui Gong; Shuo Yang; Siqi Ma; Seyit Camtepe; Surya Nepal; Chang Xu
  Deep learning models have achieved state-of-the-art performances in various
domains, while they are vulnerable to the inputs with well-crafted but small
perturbations, which are named after adversarial examples (AEs). Among many
strategies to improve the model robustness against AEs, Projected Gradient
Descent (PGD) based adversarial training is one of the most effective methods.
Unfortunately, the prohibitive computational overhead of generating strong
enough AEs, due to the maximization of the loss function, sometimes makes the
regular PGD adversarial training impractical when using larger and more
complicated models. In this paper, we propose that the adversarial loss can be
approximated by the partial sum of Taylor series. Furthermore, we approximate
the gradient of adversarial loss and propose a new and efficient adversarial
training method, adversarial training with gradient approximation (GAAT), to
reduce the cost of building up robust models. Additionally, extensive
experiments demonstrate that this efficiency improvement can be achieved
without any or with very little loss in accuracy on natural and adversarial
examples, which show that our proposed method saves up to 60\% of the training
time with comparable model test accuracy on MNIST, CIFAR-10 and CIFAR-100
datasets.


http://arxiv.org/abs/2309.14348
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM. (61%)
Bochuan Cao; Yuanpu Cao; Lu Lin; Jinghui Chen
  Recently, Large Language Models (LLMs) have made significant advancements and
are now widely used across various domains. Unfortunately, there has been a
rising concern that LLMs can be misused to generate harmful or malicious
content. Though a line of research has focused on aligning LLMs with human
values and preventing them from producing inappropriate content, such
alignments are usually vulnerable and can be bypassed by alignment-breaking
attacks via adversarially optimized or handcrafted jailbreaking prompts. In
this work, we introduce a Robustly Aligned LLM (RA-LLM) to defend against
potential alignment-breaking attacks. RA-LLM can be directly constructed upon
an existing aligned LLM with a robust alignment checking function, without
requiring any expensive retraining or fine-tuning process of the original LLM.
Furthermore, we also provide a theoretical analysis for RA-LLM to verify its
effectiveness in defending against alignment-breaking attacks. Through
real-world experiments on open-source large language models, we demonstrate
that RA-LLM can successfully defend against both state-of-the-art adversarial
prompts and popular handcrafted jailbreaking prompts by reducing their attack
success rates from nearly 100% to around 10% or less.


http://arxiv.org/abs/2309.08999
Context-aware Adversarial Attack on Named Entity Recognition. (99%)
Shuguang Chen; Leonardo Neves; Thamar Solorio
  In recent years, large pre-trained language models (PLMs) have achieved
remarkable performance on many natural language processing benchmarks. Despite
their success, prior studies have shown that PLMs are vulnerable to attacks
from adversarial examples. In this work, we focus on the named entity
recognition task and study context-aware adversarial attack methods to examine
the model's robustness. Specifically, we propose perturbing the most
informative words for recognizing entities to create adversarial examples and
investigate different candidate replacement methods to generate natural and
plausible adversarial examples. Experiments and analyses show that our methods
are more effective in deceiving the model into making wrong predictions than
strong baselines.


http://arxiv.org/abs/2309.08945
Inverse classification with logistic and softmax classifiers: efficient optimization. (56%)
Miguel Á. Carreira-Perpiñán; Suryabhan Singh Hada
  In recent years, a certain type of problems have become of interest where one
wants to query a trained classifier. Specifically, one wants to find the
closest instance to a given input instance such that the classifier's predicted
label is changed in a desired way. Examples of these ``inverse classification''
problems are counterfactual explanations, adversarial examples and model
inversion. All of them are fundamentally optimization problems over the input
instance vector involving a fixed classifier, and it is of interest to achieve
a fast solution for interactive or real-time applications. We focus on solving
this problem efficiently for two of the most widely used classifiers: logistic
regression and softmax classifiers. Owing to special properties of these
models, we show that the optimization can be solved in closed form for logistic
regression, and iteratively but extremely fast for the softmax classifier. This
allows us to solve either case exactly (to nearly machine precision) in a
runtime of milliseconds to around a second even for very high-dimensional
instances and many classes.


http://arxiv.org/abs/2309.08953
Robust Backdoor Attacks on Object Detection in Real World. (11%)
Yaguan Qian; Boyuan Ji; Shuke He; Shenhui Huang; Xiang Ling; Bin Wang; Wei Wang
  Deep learning models are widely deployed in many applications, such as object
detection in various security fields. However, these models are vulnerable to
backdoor attacks. Most backdoor attacks were intensively studied on classified
models, but little on object detection. Previous works mainly focused on the
backdoor attack in the digital world, but neglect the real world. Especially,
the backdoor attack's effect in the real world will be easily influenced by
physical factors like distance and illumination. In this paper, we proposed a
variable-size backdoor trigger to adapt to the different sizes of attacked
objects, overcoming the disturbance caused by the distance between the viewing
point and attacked object. In addition, we proposed a backdoor training named
malicious adversarial training, enabling the backdoor object detector to learn
the feature of the trigger with physical noise. The experiment results show
this robust backdoor attack (RBA) could enhance the attack success rate in the
real world.


http://arxiv.org/abs/2309.09123
Conditional Mutual Information Constrained Deep Learning for Classification. (5%)
En-Hui Yang; Shayan Mohajer Hamidi; Linfeng Ye; Renhao Tan; Beverly Yang
  The concepts of conditional mutual information (CMI) and normalized
conditional mutual information (NCMI) are introduced to measure the
concentration and separation performance of a classification deep neural
network (DNN) in the output probability distribution space of the DNN, where
CMI and the ratio between CMI and NCMI represent the intra-class concentration
and inter-class separation of the DNN, respectively. By using NCMI to evaluate
popular DNNs pretrained over ImageNet in the literature, it is shown that their
validation accuracies over ImageNet validation data set are more or less
inversely proportional to their NCMI values. Based on this observation, the
standard deep learning (DL) framework is further modified to minimize the
standard cross entropy function subject to an NCMI constraint, yielding CMI
constrained deep learning (CMIC-DL). A novel alternating learning algorithm is
proposed to solve such a constrained optimization problem. Extensive experiment
results show that DNNs trained within CMIC-DL outperform the state-of-the-art
models trained within the standard DL and other loss functions in the
literature in terms of both accuracy and robustness against adversarial
attacks. In addition, visualizing the evolution of learning process through the
lens of CMI and NCMI is also advocated.


http://arxiv.org/abs/2309.08650
Adversarial Attacks on Tables with Entity Swap. (92%)
Aneta Koleva; Martin Ringsquandl; Volker Tresp
  The capabilities of large language models (LLMs) have been successfully
applied in the context of table representation learning. The recently proposed
tabular language models have reported state-of-the-art results across various
tasks for table interpretation. However, a closer look into the datasets
commonly used for evaluation reveals an entity leakage from the train set into
the test set. Motivated by this observation, we explore adversarial attacks
that represent a more realistic inference setup. Adversarial attacks on text
have been shown to greatly affect the performance of LLMs, but currently, there
are no attacks targeting tabular language models. In this paper, we propose an
evasive entity-swap attack for the column type annotation (CTA) task. Our CTA
attack is the first black-box attack on tables, where we employ a
similarity-based sampling strategy to generate adversarial examples. The
experimental results show that the proposed attack generates up to a 70% drop
in performance.


http://arxiv.org/abs/2309.08549
HINT: Healthy Influential-Noise based Training to Defend against Data Poisoning Attacks. (87%)
Minh-Hao Van; Alycia N. Carey; Xintao Wu
  While numerous defense methods have been proposed to prohibit potential
poisoning attacks from untrusted data sources, most research works only defend
against specific attacks, which leaves many avenues for an adversary to
exploit. In this work, we propose an efficient and robust training approach to
defend against data poisoning attacks based on influence functions, named
Healthy Influential-Noise based Training. Using influence functions, we craft
healthy noise that helps to harden the classification model against poisoning
attacks without significantly affecting the generalization ability on test
data. In addition, our method can perform effectively when only a subset of the
training data is modified, instead of the current method of adding noise to all
examples that has been used in several previous works. We conduct comprehensive
evaluations over two image datasets with state-of-the-art poisoning attacks
under different realistic attack scenarios. Our empirical results show that
HINT can efficiently protect deep learning models against the effect of both
untargeted and targeted poisoning attacks.


http://arxiv.org/abs/2309.08825
Distributionally Robust Post-hoc Classifiers under Prior Shifts. (1%)
Jiaheng Wei; Harikrishna Narasimhan; Ehsan Amid; Wen-Sheng Chu; Yang Liu; Abhishek Kumar
  The generalization ability of machine learning models degrades significantly
when the test distribution shifts away from the training distribution. We
investigate the problem of training models that are robust to shifts caused by
changes in the distribution of class-priors or group-priors. The presence of
skewed training priors can often lead to the models overfitting to spurious
features. Unlike existing methods, which optimize for either the worst or the
average performance over classes or groups, our work is motivated by the need
for finer control over the robustness properties of the model. We present an
extremely lightweight post-hoc approach that performs scaling adjustments to
predictions from a pre-trained model, with the goal of minimizing a
distributionally robust loss around a chosen target distribution. These
adjustments are computed by solving a constrained optimization problem on a
validation set and applied to the model during test time. Our constrained
optimization objective is inspired by a natural notion of robustness to
controlled distribution shifts. Our method comes with provable guarantees and
empirically makes a strong case for distributional robust post-hoc classifiers.
An empirical implementation is available at
https://github.com/weijiaheng/Drops.


http://arxiv.org/abs/2309.08230
A Duty to Forget, a Right to be Assured? Exposing Vulnerabilities in Machine Unlearning Services. (1%)
Hongsheng Hu; Shuo Wang; Jiamin Chang; Haonan Zhong; Ruoxi Sun; Shuang Hao; Haojin Zhu; Minhui Xue
  The right to be forgotten requires the removal or "unlearning" of a user's
data from machine learning models. However, in the context of Machine Learning
as a Service (MLaaS), retraining a model from scratch to fulfill the unlearning
request is impractical due to the lack of training data on the service
provider's side (the server). Furthermore, approximate unlearning further
embraces a complex trade-off between utility (model performance) and privacy
(unlearning performance). In this paper, we try to explore the potential
threats posed by unlearning services in MLaaS, specifically over-unlearning,
where more information is unlearned than expected. We propose two strategies
that leverage over-unlearning to measure the impact on the trade-off balancing,
under black-box access settings, in which the existing machine unlearning
attacks are not applicable. The effectiveness of these strategies is evaluated
through extensive experiments on benchmark datasets, across various model
architectures and representative unlearning approaches. Results indicate
significant potential for both strategies to undermine model efficacy in
unlearning scenarios. This study uncovers an underexplored gap between
unlearning and contemporary MLaaS, highlighting the need for careful
considerations in balancing data unlearning, model utility, and security.


http://arxiv.org/abs/2309.08058
Unleashing the Adversarial Facet of Software Debloating. (98%)
Do-Men Su; Mohannad Alhanahnah
  Software debloating techniques are applied to craft a specialized version of
the program based on the user's requirements and remove irrelevant code
accordingly. The debloated programs presumably maintain better performance and
reduce the attack surface in contrast to the original programs. This work
unleashes the effectiveness of applying software debloating techniques on the
robustness of machine learning systems in the malware classification domain. We
empirically study how an adversarial can leverage software debloating
techniques to mislead machine learning malware classification models. We apply
software debloating techniques to generate adversarial examples and demonstrate
these adversarial examples can reduce the detection rate of VirusTotal. Our
study opens new directions for research into adversarial machine learning not
only in malware detection/classification but also in other software domains.


http://arxiv.org/abs/2309.07983
SLMIA-SR: Speaker-Level Membership Inference Attacks against Speaker Recognition Systems. (76%)
Guangke Chen; Yedi Zhang; Fu Song
  Membership inference attacks allow adversaries to determine whether a
particular example was contained in the model's training dataset. While
previous works have confirmed the feasibility of such attacks in various
applications, none has focused on speaker recognition (SR), a promising
voice-based biometric recognition technique. In this work, we propose SLMIA-SR,
the first membership inference attack tailored to SR. In contrast to
conventional example-level attack, our attack features speaker-level membership
inference, i.e., determining if any voices of a given speaker, either the same
as or different from the given inference voices, have been involved in the
training of a model. It is particularly useful and practical since the training
and inference voices are usually distinct, and it is also meaningful
considering the open-set nature of SR, namely, the recognition speakers were
often not present in the training data. We utilize intra-similarity and
inter-dissimilarity, two training objectives of SR, to characterize the
differences between training and non-training speakers and quantify them with
two groups of features driven by carefully-established feature engineering to
mount the attack. To improve the generalizability of our attack, we propose a
novel mixing ratio training strategy to train attack models. To enhance the
attack performance, we introduce voice chunk splitting to cope with the limited
number of inference voices and propose to train attack models dependent on the
number of inference voices. Our attack is versatile and can work in both
white-box and black-box scenarios. Additionally, we propose two novel
techniques to reduce the number of black-box queries while maintaining the
attack performance. Extensive experiments demonstrate the effectiveness of
SLMIA-SR.


http://arxiv.org/abs/2309.07808
What Matters to Enhance Traffic Rule Compliance of Imitation Learning for Automated Driving. (50%)
Hongkuan Zhou; Aifen Sui; Wei Cao; Zhenshan Bing
  More research attention has recently been given to end-to-end autonomous
driving technologies where the entire driving pipeline is replaced with a
single neural network because of its simpler structure and faster inference
time. Despite this appealing approach largely reducing the components in the
driving pipeline, its simplicity also leads to interpretability problems and
safety issues. The trained policy is not always compliant with the traffic
rules and it is also hard to discover the reason for the misbehavior because of
the lack of intermediate outputs. Meanwhile, sensors are also critical to
autonomous driving's security and feasibility to perceive the surrounding
environment under complex driving scenarios. In this paper, we proposed P-CSG,
a penalty-based imitation learning approach with cross semantics generation
sensor fusion technologies to increase the overall performance of end-to-end
autonomous driving. In this method, we introduce three penalties - red light,
stop sign, and curvature speed penalty to make the agent more sensitive to
traffic rules. The proposed cross semantics generation helps to align the
shared information from different input modalities. We assessed our model's
performance using the CARLA leaderboard - Town 05 Long benchmark and Longest6
Benchmark, achieving an impressive driving score improvement. Furthermore, we
conducted robustness evaluations against adversarial attacks like FGSM and Dot
attacks, revealing a substantial increase in robustness compared to baseline
models. More detailed information, such as code base resources, and videos can
be found at https://hk-zh.github.io/p-csg-plus.


http://arxiv.org/abs/2311.16113
BAGEL: Backdoor Attacks against Federated Contrastive Learning. (16%)
Yao Huang; Kongyang Chen; Jiannong Cao; Jiaxing Shen; Shaowei Wang; Yun Peng; Weilong Peng; Kechao Cai
  Federated Contrastive Learning (FCL) is an emerging privacy-preserving
paradigm in distributed learning for unlabeled data. In FCL, distributed
parties collaboratively learn a global encoder with unlabeled data, and the
global encoder could be widely used as a feature extractor to build models for
many downstream tasks. However, FCL is also vulnerable to many security threats
(e.g., backdoor attacks) due to its distributed nature, which are seldom
investigated in existing solutions. In this paper, we study the backdoor attack
against FCL as a pioneer research, to illustrate how backdoor attacks on
distributed local clients act on downstream tasks. Specifically, in our system,
malicious clients can successfully inject a backdoor into the global encoder by
uploading poisoned local updates, thus downstream models built with this global
encoder will also inherit the backdoor. We also investigate how to inject
backdoors into multiple downstream models, in terms of two different backdoor
attacks, namely the \textit{centralized attack} and the \textit{decentralized
attack}. Experiment results show that both the centralized and the
decentralized attacks can inject backdoors into downstream models effectively
with high attack success rates. Finally, we evaluate two defense methods
against our proposed backdoor attacks in FCL, which indicates that the
decentralized backdoor attack is more stealthy and harder to defend.


http://arxiv.org/abs/2309.07428
Physical Invisible Backdoor Based on Camera Imaging. (2%)
Yusheng Guo; Nan Zhong; Zhenxing Qian; Xinpeng Zhang
  Backdoor attack aims to compromise a model, which returns an adversary-wanted
output when a specific trigger pattern appears yet behaves normally for clean
inputs. Current backdoor attacks require changing pixels of clean images, which
results in poor stealthiness of attacks and increases the difficulty of the
physical implementation. This paper proposes a novel physical invisible
backdoor based on camera imaging without changing nature image pixels.
Specifically, a compromised model returns a target label for images taken by a
particular camera, while it returns correct results for other images. To
implement and evaluate the proposed backdoor, we take shots of different
objects from multi-angles using multiple smartphones to build a new dataset of
21,500 images. Conventional backdoor attacks work ineffectively with some
classical models, such as ResNet18, over the above-mentioned dataset.
Therefore, we propose a three-step training strategy to mount the backdoor
attack. First, we design and train a camera identification model with the phone
IDs to extract the camera fingerprint feature. Subsequently, we elaborate a
special network architecture, which is easily compromised by our backdoor
attack, by leveraging the attributes of the CFA interpolation algorithm and
combining it with the feature extraction block in the camera identification
model. Finally, we transfer the backdoor from the elaborated special network
architecture to the classical architecture model via teacher-student
distillation learning. Since the trigger of our method is related to the
specific phone, our attack works effectively in the physical world. Experiment
results demonstrate the feasibility of our proposed approach and robustness
against various backdoor defenses.


http://arxiv.org/abs/2309.07973
M3Dsynth: A dataset of medical 3D images with AI-generated local manipulations. (1%)
Giada Zingarini; Davide Cozzolino; Riccardo Corvi; Giovanni Poggi; Luisa Verdoliva
  The ability to detect manipulated visual content is becoming increasingly
important in many application fields, given the rapid advances in image
synthesis methods. Of particular concern is the possibility of modifying the
content of medical images, altering the resulting diagnoses. Despite its
relevance, this issue has received limited attention from the research
community. One reason is the lack of large and curated datasets to use for
development and benchmarking purposes. Here, we investigate this issue and
propose M3Dsynth, a large dataset of manipulated Computed Tomography (CT) lung
images. We create manipulated images by injecting or removing lung cancer
nodules in real CT scans, using three different methods based on Generative
Adversarial Networks (GAN) or Diffusion Models (DM), for a total of 8,577
manipulated samples. Experiments show that these images easily fool automated
diagnostic tools. We also tested several state-of-the-art forensic detectors
and demonstrated that, once trained on the proposed dataset, they are able to
accurately detect and localize manipulated synthetic content, including when
training and test sets are not aligned, showing good generalization ability.
Dataset and code will be publicly available at
https://grip-unina.github.io/M3Dsynth/.


http://arxiv.org/abs/2309.07398
Semantic Adversarial Attacks via Diffusion Models. (99%)
Chenan Wang; Jinhao Duan; Chaowei Xiao; Edward Kim; Matthew Stamm; Kaidi Xu
  Traditional adversarial attacks concentrate on manipulating clean examples in
the pixel space by adding adversarial perturbations. By contrast, semantic
adversarial attacks focus on changing semantic attributes of clean examples,
such as color, context, and features, which are more feasible in the real
world. In this paper, we propose a framework to quickly generate a semantic
adversarial attack by leveraging recent diffusion models since semantic
information is included in the latent space of well-trained diffusion models.
Then there are two variants of this framework: 1) the Semantic Transformation
(ST) approach fine-tunes the latent space of the generated image and/or the
diffusion model itself; 2) the Latent Masking (LM) approach masks the latent
space with another target image and local backpropagation-based interpretation
methods. Additionally, the ST approach can be applied in either white-box or
black-box settings. Extensive experiments are conducted on CelebA-HQ and AFHQ
datasets, and our framework demonstrates great fidelity, generalizability, and
transferability compared to other baselines. Our approaches achieve
approximately 100% attack success rate in multiple settings with the best FID
as 36.61. Code is available at
https://github.com/steven202/semantic_adv_via_dm.


http://arxiv.org/abs/2309.07106
Hardening RGB-D Object Recognition Systems against Adversarial Patch Attacks. (99%)
Yang Zheng; Luca Demetrio; Antonio Emanuele Cinà; Xiaoyi Feng; Zhaoqiang Xia; Xiaoyue Jiang; Ambra Demontis; Battista Biggio; Fabio Roli
  RGB-D object recognition systems improve their predictive performances by
fusing color and depth information, outperforming neural network architectures
that rely solely on colors. While RGB-D systems are expected to be more robust
to adversarial examples than RGB-only systems, they have also been proven to be
highly vulnerable. Their robustness is similar even when the adversarial
examples are generated by altering only the original images' colors. Different
works highlighted the vulnerability of RGB-D systems; however, there is a
lacking of technical explanations for this weakness. Hence, in our work, we
bridge this gap by investigating the learned deep representation of RGB-D
systems, discovering that color features make the function learned by the
network more complex and, thus, more sensitive to small perturbations. To
mitigate this problem, we propose a defense based on a detection mechanism that
makes RGB-D systems more robust against adversarial examples. We empirically
show that this defense improves the performances of RGB-D systems against
adversarial examples even when they are computed ad-hoc to circumvent this
detection mechanism, and that is also more effective than adversarial training.


http://arxiv.org/abs/2309.07197
Mitigating Adversarial Attacks in Federated Learning with Trusted Execution Environments. (99%)
Simon Queyrut; Valerio Schiavoni; Pascal Felber
  The main premise of federated learning (FL) is that machine learning model
updates are computed locally to preserve user data privacy. This approach
avoids by design user data to ever leave the perimeter of their device. Once
the updates aggregated, the model is broadcast to all nodes in the federation.
However, without proper defenses, compromised nodes can probe the model inside
their local memory in search for adversarial examples, which can lead to
dangerous real-world scenarios. For instance, in image-based applications,
adversarial examples consist of images slightly perturbed to the human eye
getting misclassified by the local model. These adversarial images are then
later presented to a victim node's counterpart model to replay the attack.
Typical examples harness dissemination strategies such as altered traffic signs
(patch attacks) no longer recognized by autonomous vehicles or seemingly
unaltered samples that poison the local dataset of the FL scheme to undermine
its robustness. Pelta is a novel shielding mechanism leveraging Trusted
Execution Environments (TEEs) that reduce the ability of attackers to craft
adversarial samples. Pelta masks inside the TEE the first part of the
back-propagation chain rule, typically exploited by attackers to craft the
malicious samples. We evaluate Pelta on state-of-the-art accurate models using
three well-established datasets: CIFAR-10, CIFAR-100 and ImageNet. We show the
effectiveness of Pelta in mitigating six white-box state-of-the-art adversarial
attacks, such as Projected Gradient Descent, Momentum Iterative Method, Auto
Projected Gradient Descent, the Carlini & Wagner attack. In particular, Pelta
constitutes the first attempt at defending an ensemble model against the
Self-Attention Gradient attack to the best of our knowledge. Our code is
available to the research community at https://github.com/queyrusi/Pelta.


http://arxiv.org/abs/2309.06960
PhantomSound: Black-Box, Query-Efficient Audio Adversarial Attack via Split-Second Phoneme Injection. (99%)
Hanqing Guo; Guangjing Wang; Yuanda Wang; Bocheng Chen; Qiben Yan; Li Xiao
  In this paper, we propose PhantomSound, a query-efficient black-box attack
toward voice assistants. Existing black-box adversarial attacks on voice
assistants either apply substitution models or leverage the intermediate model
output to estimate the gradients for crafting adversarial audio samples.
However, these attack approaches require a significant amount of queries with a
lengthy training stage. PhantomSound leverages the decision-based attack to
produce effective adversarial audios, and reduces the number of queries by
optimizing the gradient estimation. In the experiments, we perform our attack
against 4 different speech-to-text APIs under 3 real-world scenarios to
demonstrate the real-time attack impact. The results show that PhantomSound is
practical and robust in attacking 5 popular commercial voice controllable
devices over the air, and is able to bypass 3 liveness detection mechanisms
with >95% success rate. The benchmark result shows that PhantomSound can
generate adversarial examples and launch the attack in a few minutes. We
significantly enhance the query efficiency and reduce the cost of a successful
untargeted and targeted adversarial attack by 93.1% and 65.5% compared with the
state-of-the-art black-box attacks, using merely ~300 queries (~5 minutes) and
~1,500 queries (~25 minutes), respectively.


http://arxiv.org/abs/2309.07026
APICom: Automatic API Completion via Prompt Learning and Adversarial Training-based Data Augmentation. (92%)
Yafeng Gu; Yiheng Shen; Xiang Chen; Shaoyu Yang; Yiling Huang; Zhixiang Cao
  Based on developer needs and usage scenarios, API (Application Programming
Interface) recommendation is the process of assisting developers in finding the
required API among numerous candidate APIs. Previous studies mainly modeled API
recommendation as the recommendation task, which can recommend multiple
candidate APIs for the given query, and developers may not yet be able to find
what they need. Motivated by the neural machine translation research domain, we
can model this problem as the generation task, which aims to directly generate
the required API for the developer query. After our preliminary investigation,
we find the performance of this intuitive approach is not promising. The reason
is that there exists an error when generating the prefixes of the API. However,
developers may know certain API prefix information during actual development in
most cases. Therefore, we model this problem as the automatic completion task
and propose a novel approach APICom based on prompt learning, which can
generate API related to the query according to the prompts (i.e., API prefix
information). Moreover, the effectiveness of APICom highly depends on the
quality of the training dataset. In this study, we further design a novel
gradient-based adversarial training method {\atpart} for data augmentation,
which can improve the normalized stability when generating adversarial
examples. To evaluate the effectiveness of APICom, we consider a corpus of 33k
developer queries and corresponding APIs. Compared with the state-of-the-art
baselines, our experimental results show that APICom can outperform all
baselines by at least 40.02\%, 13.20\%, and 16.31\% in terms of the performance
measures EM@1, MRR, and MAP. Finally, our ablation studies confirm the
effectiveness of our component setting (such as our designed adversarial
training method, our used pre-trained model, and prompt learning) in APICom.


http://arxiv.org/abs/2309.07124
RAIN: Your Language Models Can Align Themselves without Finetuning. (83%)
Yuhui Li; Fangyun Wei; Jinjing Zhao; Chao Zhang; Hongyang Zhang
  Large language models (LLMs) often demonstrate inconsistencies with human
preferences. Previous research gathered human preference data and then aligned
the pre-trained models using reinforcement learning or instruction tuning, the
so-called finetuning step. In contrast, aligning frozen LLMs without any extra
data is more appealing. This work explores the potential of the latter setting.
We discover that by integrating self-evaluation and rewind mechanisms,
unaligned LLMs can directly produce responses consistent with human preferences
via self-boosting. We introduce a novel inference method, Rewindable
Auto-regressive INference (RAIN), that allows pre-trained LLMs to evaluate
their own generation and use the evaluation results to guide backward rewind
and forward generation for AI safety. Notably, RAIN operates without the need
of extra data for model alignment and abstains from any training, gradient
computation, or parameter updates; during the self-evaluation phase, the model
receives guidance on which human preference to align with through a
fixed-template prompt, eliminating the need to modify the initial prompt.
Experimental results evaluated by GPT-4 and humans demonstrate the
effectiveness of RAIN: on the HH dataset, RAIN improves the harmlessness rate
of LLaMA 30B over vanilla inference from 82% to 97%, while maintaining the
helpfulness rate. Under the leading adversarial attack llm-attacks on Vicuna
33B, RAIN establishes a new defense baseline by reducing the attack success
rate from 94% to 19%.


http://arxiv.org/abs/2309.06978
Differentiable JPEG: The Devil is in the Details. (70%)
Christoph Reich; Biplob Debnath; Deep Patel; Srimat Chakradhar
  JPEG remains one of the most widespread lossy image coding methods. However,
the non-differentiable nature of JPEG restricts the application in deep
learning pipelines. Several differentiable approximations of JPEG have recently
been proposed to address this issue. This paper conducts a comprehensive review
of existing diff. JPEG approaches and identifies critical details that have
been missed by previous methods. To this end, we propose a novel diff. JPEG
approach, overcoming previous limitations. Our approach is differentiable
w.r.t. the input image, the JPEG quality, the quantization tables, and the
color conversion parameters. We evaluate the forward and backward performance
of our diff. JPEG approach against existing methods. Additionally, extensive
ablations are performed to evaluate crucial design choices. Our proposed diff.
JPEG resembles the (non-diff.) reference implementation best, significantly
surpassing the recent-best diff. approach by $3.47$dB (PSNR) on average. For
strong compression rates, we can even improve PSNR by $9.51$dB. Strong
adversarial attack results are yielded by our diff. JPEG, demonstrating the
effective gradient approximation. Our code is available at
https://github.com/necla-ml/Diff-JPEG.


http://arxiv.org/abs/2309.06724
Deep Nonparametric Convexified Filtering for Computational Photography, Image Synthesis and Adversarial Defense. (41%)
Jianqiao Wangni
  We aim to provide a general framework of for computational photography that
recovers the real scene from imperfect images, via the Deep Nonparametric
Convexified Filtering (DNCF). It is consists of a nonparametric deep network to
resemble the physical equations behind the image formation, such as denoising,
super-resolution, inpainting, and flash. DNCF has no parameterization dependent
on training data, therefore has a strong generalization and robustness to
adversarial image manipulation. During inference, we also encourage the network
parameters to be nonnegative and create a bi-convex function on the input and
parameters, and this adapts to second-order optimization algorithms with
insufficient running time, having 10X acceleration over Deep Image Prior. With
these tools, we empirically verify its capability to defend image
classification deep networks against adversary attack algorithms in real-time.


http://arxiv.org/abs/2309.06981
MASTERKEY: Practical Backdoor Attack Against Speaker Verification Systems. (38%)
Hanqing Guo; Xun Chen; Junfeng Guo; Li Xiao; Qiben Yan
  Speaker Verification (SV) is widely deployed in mobile systems to
authenticate legitimate users by using their voice traits. In this work, we
propose a backdoor attack MASTERKEY, to compromise the SV models. Different
from previous attacks, we focus on a real-world practical setting where the
attacker possesses no knowledge of the intended victim. To design MASTERKEY, we
investigate the limitation of existing poisoning attacks against unseen
targets. Then, we optimize a universal backdoor that is capable of attacking
arbitrary targets. Next, we embed the speaker's characteristics and semantics
information into the backdoor, making it imperceptible. Finally, we estimate
the channel distortion and integrate it into the backdoor. We validate our
attack on 6 popular SV models. Specifically, we poison a total of 53 models and
use our trigger to attack 16,430 enrolled speakers, composed of 310 target
speakers enrolled in 53 poisoned models. Our attack achieves 100% attack
success rate with a 15% poison rate. By decreasing the poison rate to 3%, the
attack success rate remains around 50%. We validate our attack in 3 real-world
scenarios and successfully demonstrate the attack through both over-the-air and
over-the-telephony-line scenarios.


http://arxiv.org/abs/2309.07415
Client-side Gradient Inversion Against Federated Learning from Poisoning. (22%)
Jiaheng Wei; Yanjun Zhang; Leo Yu Zhang; Chao Chen; Shirui Pan; Kok-Leong Ong; Jun Zhang; Yang Xiang
  Federated Learning (FL) enables distributed participants (e.g., mobile
devices) to train a global model without sharing data directly to a central
server. Recent studies have revealed that FL is vulnerable to gradient
inversion attack (GIA), which aims to reconstruct the original training samples
and poses high risk against the privacy of clients in FL. However, most
existing GIAs necessitate control over the server and rely on strong prior
knowledge including batch normalization and data distribution information. In
this work, we propose Client-side poisoning Gradient Inversion (CGI), which is
a novel attack method that can be launched from clients. For the first time, we
show the feasibility of a client-side adversary with limited knowledge being
able to recover the training samples from the aggregated global model. We take
a distinct approach in which the adversary utilizes a malicious model that
amplifies the loss of a specific targeted class of interest. When honest
clients employ the poisoned global model, the gradients of samples belonging to
the targeted class are magnified, making them the dominant factor in the
aggregated update. This enables the adversary to effectively reconstruct the
private input belonging to other clients using the aggregated update. In
addition, our CGI also features its ability to remain stealthy against
Byzantine-robust aggregation rules (AGRs). By optimizing malicious updates and
blending benign updates with a malicious replacement vector, our method remains
undetected by these defense mechanisms. To evaluate the performance of CGI, we
conduct experiments on various benchmark datasets, considering representative
Byzantine-robust AGRs, and exploring diverse FL settings with different levels
of adversary knowledge about the data. Our results demonstrate that CGI
consistently and successfully extracts training input in all tested scenarios.


http://arxiv.org/abs/2309.06835
Safe Reinforcement Learning with Dual Robustness. (1%)
Zeyang Li; Chuxiong Hu; Yunan Wang; Yujie Yang; Shengbo Eben Li
  Reinforcement learning (RL) agents are vulnerable to adversarial
disturbances, which can deteriorate task performance or compromise safety
specifications. Existing methods either address safety requirements under the
assumption of no adversary (e.g., safe RL) or only focus on robustness against
performance adversaries (e.g., robust RL). Learning one policy that is both
safe and robust remains a challenging open problem. The difficulty is how to
tackle two intertwined aspects in the worst cases: feasibility and optimality.
Optimality is only valid inside a feasible region, while identification of
maximal feasible region must rely on learning the optimal policy. To address
this issue, we propose a systematic framework to unify safe RL and robust RL,
including problem formulation, iteration scheme, convergence analysis and
practical algorithm design. This unification is built upon constrained
two-player zero-sum Markov games. A dual policy iteration scheme is proposed,
which simultaneously optimizes a task policy and a safety policy. The
convergence of this iteration scheme is proved. Furthermore, we design a deep
RL algorithm for practical implementation, called dually robust actor-critic
(DRAC). The evaluations with safety-critical benchmarks demonstrate that DRAC
achieves high performance and persistent safety under all scenarios (no
adversary, safety adversary, performance adversary), outperforming all
baselines significantly.


http://arxiv.org/abs/2309.06359
Using Reed-Muller Codes for Classification with Rejection and Recovery. (99%)
Daniel University of Birmingham Fentham; David University of Oxford Parker; Mark University of Birmingham Ryan
  When deploying classifiers in the real world, users expect them to respond to
inputs appropriately. However, traditional classifiers are not equipped to
handle inputs which lie far from the distribution they were trained on.
Malicious actors can exploit this defect by making adversarial perturbations
designed to cause the classifier to give an incorrect output.
Classification-with-rejection methods attempt to solve this problem by allowing
networks to refuse to classify an input in which they have low confidence. This
works well for strongly adversarial examples, but also leads to the rejection
of weakly perturbed images, which intuitively could be correctly classified. To
address these issues, we propose Reed-Muller Aggregation Networks (RMAggNet), a
classifier inspired by Reed-Muller error-correction codes which can correct and
reject inputs. This paper shows that RMAggNet can minimise incorrectness while
maintaining good correctness over multiple adversarial attacks at different
perturbation budgets by leveraging the ability to correct errors in the
classification process. This provides an alternative
classification-with-rejection method which can reduce the amount of additional
processing in situations where a small number of incorrect classifications are
permissible.


http://arxiv.org/abs/2309.06166
Certified Robust Models with Slack Control and Large Lipschitz Constants. (98%)
Max Losch; David Stutz; Bernt Schiele; Mario Fritz
  Despite recent success, state-of-the-art learning-based models remain highly
vulnerable to input changes such as adversarial examples. In order to obtain
certifiable robustness against such perturbations, recent work considers
Lipschitz-based regularizers or constraints while at the same time increasing
prediction margin. Unfortunately, this comes at the cost of significantly
decreased accuracy. In this paper, we propose a Calibrated Lipschitz-Margin
Loss (CLL) that addresses this issue and improves certified robustness by
tackling two problems: Firstly, commonly used margin losses do not adjust the
penalties to the shrinking output distribution; caused by minimizing the
Lipschitz constant $K$. Secondly, and most importantly, we observe that
minimization of $K$ can lead to overly smooth decision functions. This limits
the model's complexity and thus reduces accuracy. Our CLL addresses these
issues by explicitly calibrating the loss w.r.t. margin and Lipschitz constant,
thereby establishing full control over slack and improving robustness
certificates even with larger Lipschitz constants. On CIFAR-10, CIFAR-100 and
Tiny-ImageNet, our models consistently outperform losses that leave the
constant unattended. On CIFAR-100 and Tiny-ImageNet, CLL improves upon
state-of-the-art deterministic $L_2$ robust accuracies. In contrast to current
trends, we unlock potential of much smaller models without $K=1$ constraints.


http://arxiv.org/abs/2309.06438
Exploring Non-additive Randomness on ViT against Query-Based Black-Box Attacks. (98%)
Jindong Gu; Fangyun Wei; Philip Torr; Han Hu
  Deep Neural Networks can be easily fooled by small and imperceptible
perturbations. The query-based black-box attack (QBBA) is able to create the
perturbations using model output probabilities of image queries requiring no
access to the underlying models. QBBA poses realistic threats to real-world
applications. Recently, various types of robustness have been explored to
defend against QBBA. In this work, we first taxonomize the stochastic defense
strategies against QBBA. Following our taxonomy, we propose to explore
non-additive randomness in models to defend against QBBA. Specifically, we
focus on underexplored Vision Transformers based on their flexible
architectures. Extensive experiments show that the proposed defense approach
achieves effective defense, without much sacrifice in performance.


http://arxiv.org/abs/2309.06223
Compiled Models, Built-In Exploits: Uncovering Pervasive Bit-Flip Attack Surfaces in DNN Executables. (83%)
Yanzuo The Hong Kong University of Science and Technology Chen; Zhibo The Hong Kong University of Science and Technology Liu; Yuanyuan The Hong Kong University of Science and Technology Yuan; Sihang Huawei Technologies Hu; Tianxiang Huawei Technologies Li; Shuai The Hong Kong University of Science and Technology Wang
  Bit-flip attacks (BFAs) can manipulate deep neural networks (DNNs). For
high-level DNN models running on deep learning (DL) frameworks like PyTorch,
extensive BFAs have been used to flip bits in model weights and shown
effective. Defenses have also been proposed to guard model weights. However,
DNNs are increasingly compiled into DNN executables by DL compilers to leverage
hardware primitives. These executables manifest distinct computation paradigms;
existing research fails to accurately capture and expose the BFA surfaces on
DNN executables.
  To this end, we launch the first systematic study of BFAs on DNN executables.
Prior BFAs are limited to attacking model weights and assume a strong whitebox
attacker with full knowledge of victim model weights, which is unrealistic as
weights are often confidential. In contrast, we find that BFAs on DNN
executables can achieve high effectiveness by exploiting the model structure
(usually stored in the executable code), which only requires knowing the (often
public) model structure. Importantly, such structure-based BFAs are pervasive,
transferable, and more severe in DNN executables. They also slip past existing
defenses.
  To demonstrate the new attack surfaces, we assume a weak and more realistic
attacker with no knowledge of victim model weights. We design an automated tool
to identify vulnerable bits in victim executables with high confidence (70% vs.
baseline 2%). We show on DDR4 DRAM that only 1.4 flips on average are needed to
fully downgrade the accuracy of victim models, including quantized ones which
could require 23x more flips previously, to random guesses. We comprehensively
evaluate 16 DNN executables, covering large-scale models trained on
commonly-used datasets compiled by the two most popular DL compilers. Our
finding calls for incorporating security mechanisms in future DNN compilation
toolchains.


http://arxiv.org/abs/2309.06055
Backdoor Attacks and Countermeasures in Natural Language Processing Models: A Comprehensive Security Review. (61%)
Pengzhou Cheng; Zongru Wu; Wei Du; Gongshen Liu
  Deep Neural Networks (DNNs) have led to unprecedented progress in various
natural language processing (NLP) tasks. Owing to limited data and computation
resources, using third-party data and models has become a new paradigm for
adapting various tasks. However, research shows that it has some potential
security vulnerabilities because attackers can manipulate the training process
and data source. Such a way can set specific triggers, making the model exhibit
expected behaviors that have little inferior influence on the model's
performance for primitive tasks, called backdoor attacks. Hence, it could have
dire consequences, especially considering that the backdoor attack surfaces are
broad.
  To get a precise grasp and understanding of this problem, a systematic and
comprehensive review is required to confront various security challenges from
different phases and attack purposes. Additionally, there is a dearth of
analysis and comparison of the various emerging backdoor countermeasures in
this situation. In this paper, we conduct a timely review of backdoor attacks
and countermeasures to sound the red alarm for the NLP security community.
According to the affected stage of the machine learning pipeline, the attack
surfaces are recognized to be wide and then formalized into three
categorizations: attacking pre-trained model with fine-tuning (APMF) or
prompt-tuning (APMP), and attacking final model with training (AFMT), where
AFMT can be subdivided into different attack aims. Thus, attacks under each
categorization are combed. The countermeasures are categorized into two general
classes: sample inspection and model inspection. Overall, the research on the
defense side is far behind the attack side, and there is no single defense that
can prevent all types of backdoor attacks. An attacker can intelligently bypass
existing defenses with a more invisible attack. ......


http://arxiv.org/abs/2309.05978
CToMP: A Cycle-task-oriented Memory Protection Scheme for Unmanned Systems. (8%)
Chengyan Ma; Ning Xi; Di Lu; Yebo Feng; Jianfeng Ma
  Memory corruption attacks (MCAs) refer to malicious behaviors of system
intruders that modify the contents of a memory location to disrupt the normal
operation of computing systems, causing leakage of sensitive data or
perturbations to ongoing processes. Unlike general-purpose systems, unmanned
systems cannot deploy complete security protection schemes, due to their
limitations in size, cost and performance. MCAs in unmanned systems are
particularly difficult to defend against. Furthermore, MCAs have diverse and
unpredictable attack interfaces in unmanned systems, severely impacting digital
and physical sectors. In this paper, we first generalize, model and taxonomize
MCAs found in unmanned systems currently, laying the foundation for designing a
portable and general defense approach. According to different attack
mechanisms, we found that MCAs are mainly categorized into two
types--return2libc and return2shellcode. To tackle return2libc attacks, we
model the erratic operation of unmanned systems with cycles and then propose a
cycle-task-oriented memory protection (CToMP) approach to protect control flows
from tampering. To defend against return2shellcode attacks, we introduce a
secure process stack with a randomized memory address by leveraging the memory
pool to prevent Shellcode from being executed. Moreover, we discuss the
mechanism by which CToMP resists the ROP attack, a novel variant of return2libc
attacks. Finally, we implement CToMP on CUAV V5+ with Ardupilot and Crazyflie.
The evaluation and security analysis results demonstrate that the proposed
approach CToMP is resilient to various MCAs in unmanned systems with low
footprints and system overhead.


http://arxiv.org/abs/2309.05950
Language Models as Black-Box Optimizers for Vision-Language Models. (4%)
Shihong Liu; Zhiqiu Lin; Samuel Yu; Ryan Lee; Tiffany Ling; Deepak Pathak; Deva Ramanan
  Vision-language models (VLMs) pre-trained on web-scale datasets have
demonstrated remarkable capabilities on downstream tasks when fine-tuned with
minimal data. However, many VLMs rely on proprietary data and are not
open-source, which restricts the use of white-box approaches for fine-tuning.
As such, we aim to develop a black-box approach to optimize VLMs through
natural language prompts, thereby avoiding the need to access model parameters,
feature embeddings, or even output logits. We propose employing chat-based LLMs
to search for the best text prompt for VLMs. Specifically, we adopt an
automatic hill-climbing procedure that converges to an effective prompt by
evaluating the performance of current prompts and asking LLMs to refine them
based on textual feedback, all within a conversational process without
human-in-the-loop. In a challenging 1-shot image classification setup, our
simple approach surpasses the white-box continuous prompting method (CoOp) by
an average of 1.5% across 11 datasets including ImageNet. Our approach also
outperforms both human-engineered and LLM-generated prompts. We highlight the
advantage of conversational feedback that incorporates both positive and
negative prompts, suggesting that LLMs can utilize the implicit gradient
direction in textual feedback for a more efficient search. In addition, we find
that the text prompts generated through our strategy are not only more
interpretable but also transfer well across different VLM architectures in a
black-box manner. Lastly, we demonstrate our framework on a state-of-the-art
black-box VLM (DALL-E 3) for text-to-image optimization.


http://arxiv.org/abs/2309.05879
Generalized Attacks on Face Verification Systems. (88%)
Ehsan Nazari; Paula Branco; Guy-Vincent Jourdan
  Face verification (FV) using deep neural network models has made tremendous
progress in recent years, surpassing human accuracy and seeing deployment in
various applications such as border control and smartphone unlocking. However,
FV systems are vulnerable to Adversarial Attacks, which manipulate input images
to deceive these systems in ways usually unnoticeable to humans. This paper
provides an in-depth study of attacks on FV systems. We introduce the
DodgePersonation Attack that formulates the creation of face images that
impersonate a set of given identities while avoiding being identified as any of
the identities in a separate, disjoint set. A taxonomy is proposed to provide a
unified view of different types of Adversarial Attacks against FV systems,
including Dodging Attacks, Impersonation Attacks, and Master Face Attacks.
Finally, we propose the ''One Face to Rule Them All'' Attack which implements
the DodgePersonation Attack with state-of-the-art performance on a well-known
scenario (Master Face Attack) and which can also be used for the new scenarios
introduced in this paper. While the state-of-the-art Master Face Attack can
produce a set of 9 images to cover 43.82% of the identities in their test
database, with 9 images our attack can cover 57.27% to 58.5% of these
identifies while giving the attacker the choice of the identity to use to
create the impersonation. Moreover, the 9 generated attack images appear
identical to a casual observer.


http://arxiv.org/abs/2309.05900
Adversarial Attacks Assessment of Salient Object Detection via Symbolic Learning. (76%)
Gustavo Olague; Roberto Pineda; Gerardo Ibarra-Vazquez; Matthieu Olague; Axel Martinez; Sambit Bakshi; Jonathan Vargas; Isnardo Reducindo
  Machine learning is at the center of mainstream technology and outperforms
classical approaches to handcrafted feature design. Aside from its learning
process for artificial feature extraction, it has an end-to-end paradigm from
input to output, reaching outstandingly accurate results. However, security
concerns about its robustness to malicious and imperceptible perturbations have
drawn attention since its prediction can be changed entirely. Salient object
detection is a research area where deep convolutional neural networks have
proven effective but whose trustworthiness represents a significant issue
requiring analysis and solutions to hackers' attacks. Brain programming is a
kind of symbolic learning in the vein of good old-fashioned artificial
intelligence. This work provides evidence that symbolic learning robustness is
crucial in designing reliable visual attention systems since it can withstand
even the most intense perturbations. We test this evolutionary computation
methodology against several adversarial attacks and noise perturbations using
standard databases and a real-world problem of a shorebird called the Snowy
Plover portraying a visual attention task. We compare our methodology with five
different deep learning approaches, proving that they do not match the symbolic
paradigm regarding robustness. All neural networks suffer significant
performance losses, while brain programming stands its ground and remains
unaffected. Also, by studying the Snowy Plover, we remark on the importance of
security in surveillance activities regarding wildlife protection and
conservation.


http://arxiv.org/abs/2310.10659
Exploiting Machine Unlearning for Backdoor Attacks in Deep Learning System. (68%)
Peixin Zhang; Jun Sun; Mingtian Tan; Xinyu Wang
  In recent years, the security issues of artificial intelligence have become
increasingly prominent due to the rapid development of deep learning research
and applications. Backdoor attack is an attack targeting the vulnerability of
deep learning models, where hidden backdoors are activated by triggers embedded
by the attacker, thereby outputting malicious predictions that may not align
with the intended output for a given input. In this work, we propose a novel
black-box backdoor attack based on machine unlearning. The attacker first
augments the training set with carefully designed samples, including poison and
mitigation data, to train a `benign' model. Then, the attacker posts unlearning
requests for the mitigation samples to remove the impact of relevant data on
the model, gradually activating the hidden backdoor. Since backdoors are
implanted during the iterative unlearning process, it significantly increases
the computational overhead of existing defense methods for backdoor detection
or mitigation. To address this new security threat, we proposes two methods for
detecting or mitigating such malicious unlearning requests. We conduct the
experiment in both exact unlearning and approximate unlearning (i.e., SISA)
settings. Experimental results indicate that: 1) our attack approach can
successfully implant backdoor into the model, and sharding increases the
difficult of attack; 2) our detection algorithms are effective in identifying
the mitigation samples, while sharding reduces the effectiveness of our
detection algorithms.


http://arxiv.org/abs/2309.05610
Privacy Side Channels in Machine Learning Systems. (10%)
Edoardo Debenedetti; Giorgio Severi; Nicholas Carlini; Christopher A. Choquette-Choo; Matthew Jagielski; Milad Nasr; Eric Wallace; Florian Tramèr
  Most current approaches for protecting privacy in machine learning (ML)
assume that models exist in a vacuum, when in reality, ML models are part of
larger systems that include components for training data filtering, output
monitoring, and more. In this work, we introduce privacy side channels: attacks
that exploit these system-level components to extract private information at
far higher rates than is otherwise possible for standalone models. We propose
four categories of side channels that span the entire ML lifecycle (training
data filtering, input preprocessing, output post-processing, and query
filtering) and allow for either enhanced membership inference attacks or even
novel threats such as extracting users' test queries. For example, we show that
deduplicating training data before applying differentially-private training
creates a side-channel that completely invalidates any provable privacy
guarantees. Moreover, we show that systems which block language models from
regenerating training data can be exploited to allow exact reconstruction of
private keys contained in the training set -- even if the model did not
memorize these keys. Taken together, our results demonstrate the need for a
holistic, end-to-end privacy analysis of machine learning.


http://arxiv.org/abs/2309.05809
Divergences in Color Perception between Deep Neural Networks and Humans. (4%)
Ethan O. Nadler; Elise Darragh-Ford; Bhargav Srinivasa Desikan; Christian Conaway; Mark Chu; Tasker Hull; Douglas Guilbeault
  Deep neural networks (DNNs) are increasingly proposed as models of human
vision, bolstered by their impressive performance on image classification and
object recognition tasks. Yet, the extent to which DNNs capture fundamental
aspects of human vision such as color perception remains unclear. Here, we
develop novel experiments for evaluating the perceptual coherence of color
embeddings in DNNs, and we assess how well these algorithms predict human color
similarity judgments collected via an online survey. We find that
state-of-the-art DNN architectures $-$ including convolutional neural networks
and vision transformers $-$ provide color similarity judgments that strikingly
diverge from human color judgments of (i) images with controlled color
properties, (ii) images generated from online searches, and (iii) real-world
images from the canonical CIFAR-10 dataset. We compare DNN performance against
an interpretable and cognitively plausible model of color perception based on
wavelet decomposition, inspired by foundational theories in computational
neuroscience. While one deep learning model $-$ a convolutional DNN trained on
a style transfer task $-$ captures some aspects of human color perception, our
wavelet algorithm provides more coherent color embeddings that better predict
human color judgments compared to all DNNs we examine. These results hold when
altering the high-level visual task used to train similar DNN architectures
(e.g., image classification versus image segmentation), as well as when
examining the color embeddings of different layers in a given DNN architecture.
These findings break new ground in the effort to analyze the perceptual
representations of machine learning algorithms and to improve their ability to
serve as cognitively plausible models of human vision. Implications for machine
learning, human perception, and embodied cognition are discussed.


http://arxiv.org/abs/2309.05940
Catch You Everything Everywhere: Guarding Textual Inversion via Concept Watermarking. (1%)
Weitao Feng; Jiyan He; Jie Zhang; Tianwei Zhang; Wenbo Zhou; Weiming Zhang; Nenghai Yu
  AIGC (AI-Generated Content) has achieved tremendous success in many
applications such as text-to-image tasks, where the model can generate
high-quality images with diverse prompts, namely, different descriptions in
natural languages. More surprisingly, the emerging personalization techniques
even succeed in describing unseen concepts with only a few personal images as
references, and there have been some commercial platforms for sharing the
valuable personalized concept. However, such an advanced technique also
introduces a severe threat, where malicious users can misuse the target concept
to generate highly-realistic illegal images. Therefore, it becomes necessary
for the platform to trace malicious users and hold them accountable.
  In this paper, we focus on guarding the most popular lightweight
personalization model, ie, Textual Inversion (TI). To achieve it, we propose
the novel concept watermarking, where watermark information is embedded into
the target concept and then extracted from generated images based on the
watermarked concept. Specifically, we jointly train a watermark encoder and a
watermark decoder with the sampler in the loop.
  It shows great resilience to different diffusion sampling processes possibly
chosen by malicious users, meanwhile preserving utility for normal use. In
practice, the concept owner can upload his concept with different watermarks
(ie, serial numbers) to the platform, and the platform allocates different
users with different serial numbers for subsequent tracing and forensics.


http://arxiv.org/abs/2309.05516
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs. (1%)
Wenhua Cheng; Weiwei Zhang; Haihao Shen; Yiyang Cai; Xin He; Kaokao Lv
  Large Language Models (LLMs) have proven their exceptional capabilities in
performing language-related tasks. However, their deployment poses significant
challenges due to their considerable memory and storage requirements. In
response to this issue, weight-only quantization, particularly 3 and 4-bit
weight-only quantization, has emerged as one of the most viable solutions. As
the number of bits decreases, the quantization grid broadens, thus emphasizing
the importance of up and down rounding. While previous studies have
demonstrated that fine-tuning up and down rounding with the addition of
perturbations can enhance accuracy in some scenarios, our study is driven by
the precise and limited boundary of these perturbations, where only the
threshold for altering the rounding value is of significance. Consequently, we
propose a concise and highly effective approach for optimizing the weight
rounding task. Our method, named SignRound, involves lightweight block-wise
tuning using signed gradient descent, enabling us to achieve outstanding
results within 400 steps. SignRound competes impressively against recent
methods without introducing additional inference overhead. The source code will
be publicly available at \url{https://github.com/intel/neural-compressor} soon.


http://arxiv.org/abs/2309.05145
Outlier Robust Adversarial Training. (98%)
Shu Hu; Zhenhuan Yang; Xin Wang; Yiming Ying; Siwei Lyu
  Supervised learning models are challenged by the intrinsic complexities of
training data such as outliers and minority subpopulations and intentional
attacks at inference time with adversarial samples. While traditional robust
learning methods and the recent adversarial training approaches are designed to
handle each of the two challenges, to date, no work has been done to develop
models that are robust with regard to the low-quality training data and the
potential adversarial attack at inference time simultaneously. It is for this
reason that we introduce Outlier Robust Adversarial Training (ORAT) in this
work. ORAT is based on a bi-level optimization formulation of adversarial
training with a robust rank-based loss function. Theoretically, we show that
the learning objective of ORAT satisfies the $\mathcal{H}$-consistency in
binary classification, which establishes it as a proper surrogate to
adversarial 0/1 loss. Furthermore, we analyze its generalization ability and
provide uniform convergence rates in high probability. ORAT can be optimized
with a simple algorithm. Experimental evaluations on three benchmark datasets
demonstrate the effectiveness and robustness of ORAT in handling outliers and
adversarial attacks. Our code is available at
https://github.com/discovershu/ORAT.


http://arxiv.org/abs/2309.05132
DAD++: Improved Data-free Test Time Adversarial Defense. (98%)
Gaurav Kumar Nayak; Inder Khatri; Shubham Randive; Ruchit Rawal; Anirban Chakraborty
  With the increasing deployment of deep neural networks in safety-critical
applications such as self-driving cars, medical imaging, anomaly detection,
etc., adversarial robustness has become a crucial concern in the reliability of
these networks in real-world scenarios. A plethora of works based on
adversarial training and regularization-based techniques have been proposed to
make these deep networks robust against adversarial attacks. However, these
methods require either retraining models or training them from scratch, making
them infeasible to defend pre-trained models when access to training data is
restricted. To address this problem, we propose a test time Data-free
Adversarial Defense (DAD) containing detection and correction frameworks.
Moreover, to further improve the efficacy of the correction framework in cases
when the detector is under-confident, we propose a soft-detection scheme
(dubbed as "DAD++"). We conduct a wide range of experiments and ablations on
several datasets and network architectures to show the efficacy of our proposed
approach. Furthermore, we demonstrate the applicability of our approach in
imparting adversarial defense at test time under data-free (or data-efficient)
applications/setups, such as Data-free Knowledge Distillation and Source-free
Unsupervised Domain Adaptation, as well as Semi-supervised classification
frameworks. We observe that in all the experiments and applications, our DAD++
gives an impressive performance against various adversarial attacks with a
minimal drop in clean accuracy. The source code is available at:
https://github.com/vcl-iisc/Improved-Data-free-Test-Time-Adversarial-Defense


http://arxiv.org/abs/2309.06527
Machine Translation Models Stand Strong in the Face of Adversarial Attacks. (86%)
Pavel Burnyshev; Elizaveta Kostenok; Alexey Zaytsev
  Adversarial attacks expose vulnerabilities of deep learning models by
introducing minor perturbations to the input, which lead to substantial
alterations in the output. Our research focuses on the impact of such
adversarial attacks on sequence-to-sequence (seq2seq) models, specifically
machine translation models. We introduce algorithms that incorporate basic text
perturbation heuristics and more advanced strategies, such as the
gradient-based attack, which utilizes a differentiable approximation of the
inherently non-differentiable translation metric. Through our investigation, we
provide evidence that machine translation models display robustness displayed
robustness against best performed known adversarial attacks, as the degree of
perturbation in the output is directly proportional to the perturbation in the
input. However, among underdogs, our attacks outperform alternatives, providing
the best relative performance. Another strong candidate is an attack based on
mixing of individual characters.


http://arxiv.org/abs/2309.05075
Secure Set-Based State Estimation for Linear Systems under Adversarial Attacks on Sensors. (12%)
M. Umar B. Niazi; Michelle S. Chong; Amr Alanwar; Karl H. Johansson
  Set-based state estimation plays a vital role in the safety verification of
dynamical systems, which becomes significantly challenging when the system's
sensors are susceptible to cyber-attacks. Existing methods often impose
limitations on the attacker's capabilities, restricting the number of attacked
sensors to be strictly less than half of the total number of sensors. This
paper proposes a Secure Set-Based State Estimation (S3E) algorithm that
addresses this limitation. The S3E algorithm guarantees that the true system
state is contained within the estimated set, provided the initialization set
encompasses the true initial state and the system is redundantly observable
from the set of uncompromised sensors. The algorithm gives the estimated set as
a collection of constrained zonotopes, which can be employed as robust
certificates for verifying whether the system adheres to safety constraints.
Furthermore, we demonstrate that the estimated set remains unaffected by attack
signals of sufficiently large and also establish sufficient conditions for
attack detection, identification, and filtering. This compels the attacker to
inject only stealthy signals of small magnitude to evade detection, thus
preserving the accuracy of the estimated set. When a few number of sensors
(less than half) can be compromised, we prove that the estimated set remains
bounded by a contracting set that converges to a ball whose radius is solely
determined by the noise magnitude and is independent of the attack signals. To
address the computational complexity of the algorithm, we offer several
strategies for complexity-performance trade-offs. The efficacy of the proposed
algorithm is illustrated through its application to a three-story building
model.


http://arxiv.org/abs/2309.04777
Towards Robust Model Watermark via Reducing Parametric Vulnerability. (8%)
Guanhao Gan; Yiming Li; Dongxian Wu; Shu-Tao Xia
  Deep neural networks are valuable assets considering their commercial
benefits and huge demands for costly annotation and computation resources. To
protect the copyright of DNNs, backdoor-based ownership verification becomes
popular recently, in which the model owner can watermark the model by embedding
a specific backdoor behavior before releasing it. The defenders (usually the
model owners) can identify whether a suspicious third-party model is ``stolen''
from them based on the presence of the behavior. Unfortunately, these
watermarks are proven to be vulnerable to removal attacks even like
fine-tuning. To further explore this vulnerability, we investigate the
parameter space and find there exist many watermark-removed models in the
vicinity of the watermarked one, which may be easily used by removal attacks.
Inspired by this finding, we propose a mini-max formulation to find these
watermark-removed models and recover their watermark behavior. Extensive
experiments demonstrate that our method improves the robustness of the model
watermarking against parametric changes and numerous watermark-removal attacks.
The codes for reproducing our main experiments are available at
\url{https://github.com/GuanhaoGan/robust-model-watermarking}.


http://arxiv.org/abs/2309.04884
RecAD: Towards A Unified Library for Recommender Attack and Defense. (1%)
Changsheng Wang; Jianbai Ye; Wenjie Wang; Chongming Gao; Fuli Feng; Xiangnan He
  In recent years, recommender systems have become a ubiquitous part of our
daily lives, while they suffer from a high risk of being attacked due to the
growing commercial and social values. Despite significant research progress in
recommender attack and defense, there is a lack of a widely-recognized
benchmarking standard in the field, leading to unfair performance comparison
and limited credibility of experiments. To address this, we propose RecAD, a
unified library aiming at establishing an open benchmark for recommender attack
and defense. RecAD takes an initial step to set up a unified benchmarking
pipeline for reproducible research by integrating diverse datasets, standard
source codes, hyper-parameter settings, running logs, attack knowledge, attack
budget, and evaluation results. The benchmark is designed to be comprehensive
and sustainable, covering both attack, defense, and evaluation tasks, enabling
more researchers to easily follow and contribute to this promising field. RecAD
will drive more solid and reproducible research on recommender systems attack
and defense, reduce the redundant efforts of researchers, and ultimately
increase the credibility and practical value of recommender attack and defense.
The project is released at https://github.com/gusye1234/recad.


http://arxiv.org/abs/2309.04650
Exploring Robust Features for Improving Adversarial Robustness. (99%)
Hong Wang; Yuefan Deng; Shinjae Yoo; Yuewei Lin
  While deep neural networks (DNNs) have revolutionized many fields, their
fragility to carefully designed adversarial attacks impedes the usage of DNNs
in safety-critical applications. In this paper, we strive to explore the robust
features which are not affected by the adversarial perturbations, i.e.,
invariant to the clean image and its adversarial examples, to improve the
model's adversarial robustness. Specifically, we propose a feature
disentanglement model to segregate the robust features from non-robust features
and domain specific features. The extensive experiments on four widely used
datasets with different attacks demonstrate that robust features obtained from
our model improve the model's adversarial robustness compared to the
state-of-the-art approaches. Moreover, the trained domain discriminator is able
to identify the domain specific features from the clean images and adversarial
examples almost perfectly. This enables adversarial example detection without
incurring additional computational costs. With that, we can also specify
different classifiers for clean images and adversarial examples, thereby
avoiding any drop in clean image accuracy.


http://arxiv.org/abs/2309.04386
ARRTOC: Adversarially Robust Real-Time Optimization and Control. (2%)
Akhil Ahmed; Rio-Chanona Ehecatl Antonio del; Mehmet Mercangoz
  Real-Time Optimization (RTO) plays a crucial role in the process operation
hierarchy by determining optimal set-points for the lower-level controllers.
However, these optimal set-points can become inoperable due to implementation
errors, such as disturbances and noise, at the control layers. To address this
challenge, in this paper, we present the Adversarially Robust Real-Time
Optimization and Control (ARRTOC) algorithm. ARRTOC draws inspiration from
adversarial machine learning, offering an online constrained Adversarially
Robust Optimization (ARO) solution applied to the RTO layer. This approach
identifies set-points that are both optimal and inherently robust to control
layer perturbations. By integrating controller design with RTO, ARRTOC enhances
overall system performance and robustness. Importantly, ARRTOC maintains
versatility through a loose coupling between the RTO and control layers,
ensuring compatibility with various controller architectures and RTO
algorithms. To validate our claims, we present three case studies: an
illustrative example, a bioreactor case study, and a multi-loop evaporator
process. Our results demonstrate the effectiveness of ARRTOC in achieving the
delicate balance between optimality and operability in RTO and control.


http://arxiv.org/abs/2309.06377
Adversarial attacks on hybrid classical-quantum Deep Learning models for Histopathological Cancer Detection. (1%)
Biswaraj Baral; Reek Majumdar; Bhavika Bhalgamiya; Taposh Dutta Roy
  We present an effective application of quantum machine learning in
histopathological cancer detection. The study here emphasizes two primary
applications of hybrid classical-quantum Deep Learning models. The first
application is to build a classification model for histopathological cancer
detection using the quantum transfer learning strategy. The second application
is to test the performance of this model for various adversarial attacks.
Rather than using a single transfer learning model, the hybrid
classical-quantum models are tested using multiple transfer learning models,
especially ResNet18, VGG-16, Inception-v3, and AlexNet as feature extractors
and integrate it with several quantum circuit-based variational quantum
circuits (VQC) with high expressibility. As a result, we provide a comparative
analysis of classical models and hybrid classical-quantum transfer learning
models for histopathological cancer detection under several adversarial
attacks. We compared the performance accuracy of the classical model with the
hybrid classical-quantum model using pennylane default quantum simulator. We
also observed that for histopathological cancer detection under several
adversarial attacks, Hybrid Classical-Quantum (HCQ) models provided better
accuracy than classical image classification models.


http://arxiv.org/abs/2309.04211
Counterfactual Explanations via Locally-guided Sequential Algorithmic Recourse. (1%)
Edward A. Small; Jeffrey N. Clark; Christopher J. McWilliams; Kacper Sokol; Jeffrey Chan; Flora D. Salim; Raul Santos-Rodriguez
  Counterfactuals operationalised through algorithmic recourse have become a
powerful tool to make artificial intelligence systems explainable.
Conceptually, given an individual classified as y -- the factual -- we seek
actions such that their prediction becomes the desired class y' -- the
counterfactual. This process offers algorithmic recourse that is (1) easy to
customise and interpret, and (2) directly aligned with the goals of each
individual. However, the properties of a "good" counterfactual are still
largely debated; it remains an open challenge to effectively locate a
counterfactual along with its corresponding recourse. Some strategies use
gradient-driven methods, but these offer no guarantees on the feasibility of
the recourse and are open to adversarial attacks on carefully created
manifolds. This can lead to unfairness and lack of robustness. Other methods
are data-driven, which mostly addresses the feasibility problem at the expense
of privacy, security and secrecy as they require access to the entire training
data set. Here, we introduce LocalFACE, a model-agnostic technique that
composes feasible and actionable counterfactual explanations using
locally-acquired information at each step of the algorithmic recourse. Our
explainer preserves the privacy of users by only leveraging data that it
specifically requires to construct actionable algorithmic recourse, and
protects the model by offering transparency solely in the regions deemed
necessary for the intervention.


http://arxiv.org/abs/2309.03844
Experimental Study of Adversarial Attacks on ML-based xApps in O-RAN. (99%)
Naveen Naik Sapavath; Brian Kim; Kaushik Chowdhury; Vijay K Shah
  Open Radio Access Network (O-RAN) is considered as a major step in the
evolution of next-generation cellular networks given its support for open
interfaces and utilization of artificial intelligence (AI) into the deployment,
operation, and maintenance of RAN. However, due to the openness of the O-RAN
architecture, such AI models are inherently vulnerable to various adversarial
machine learning (ML) attacks, i.e., adversarial attacks which correspond to
slight manipulation of the input to the ML model. In this work, we showcase the
vulnerability of an example ML model used in O-RAN, and experimentally deploy
it in the near-real time (near-RT) RAN intelligent controller (RIC). Our
ML-based interference classifier xApp (extensible application in near-RT RIC)
tries to classify the type of interference to mitigate the interference effect
on the O-RAN system. We demonstrate the first-ever scenario of how such an xApp
can be impacted through an adversarial attack by manipulating the data stored
in a shared database inside the near-RT RIC. Through a rigorous performance
analysis deployed on a laboratory O-RAN testbed, we evaluate the performance in
terms of capacity and the prediction accuracy of the interference classifier
xApp using both clean and perturbed data. We show that even small adversarial
attacks can significantly decrease the accuracy of ML application in near-RT
RIC, which can directly impact the performance of the entire O-RAN deployment.


http://arxiv.org/abs/2309.03665
How adversarial attacks can disrupt seemingly stable accurate classifiers. (99%)
Oliver J. Sutton; Qinghua Zhou; Ivan Y. Tyukin; Alexander N. Gorban; Alexander Bastounis; Desmond J. Higham
  Adversarial attacks dramatically change the output of an otherwise accurate
learning system using a seemingly inconsequential modification to a piece of
input data. Paradoxically, empirical evidence indicates that even systems which
are robust to large random perturbations of the input data remain susceptible
to small, easily constructed, adversarial perturbations of their inputs. Here,
we show that this may be seen as a fundamental feature of classifiers working
with high dimensional input data. We introduce a simple generic and
generalisable framework for which key behaviours observed in practical systems
arise with high probability -- notably the simultaneous susceptibility of the
(otherwise accurate) model to easily constructed adversarial attacks, and
robustness to random perturbations of the input data. We confirm that the same
phenomena are directly observed in practical neural networks trained on
standard image classification problems, where even large additive random noise
fails to trigger the adversarial instability of the network. A surprising
takeaway is that even small margins separating a classifier's decision surface
from training and testing data can hide adversarial susceptibility from being
detected using randomly sampled perturbations. Counterintuitively, using
additive noise during training or testing is therefore inefficient for
eradicating or detecting adversarial examples, and more demanding adversarial
training is required.


http://arxiv.org/abs/2309.03702
DiffDefense: Defending against Adversarial Attacks via Diffusion Models. (80%)
Hondamunige Prasanna Silva; Lorenzo Seidenari; Bimbo Alberto Del
  This paper presents a novel reconstruction method that leverages Diffusion
Models to protect machine learning classifiers against adversarial attacks, all
without requiring any modifications to the classifiers themselves. The
susceptibility of machine learning models to minor input perturbations renders
them vulnerable to adversarial attacks. While diffusion-based methods are
typically disregarded for adversarial defense due to their slow reverse
process, this paper demonstrates that our proposed method offers robustness
against adversarial threats while preserving clean accuracy, speed, and
plug-and-play compatibility. Code at:
https://github.com/HondamunigePrasannaSilva/DiffDefence.


http://arxiv.org/abs/2309.04036
One-to-Multiple Clean-Label Image Camouflage (OmClic) based Backdoor Attack on Deep Learning. (73%)
Guohong Wang; Hua Ma; Yansong Gao; Alsharif Abuadbba; Zhi Zhang; Wei Kang; Said F. Al-Sarawib; Gongxuan Zhang; Derek Abbott
  Image camouflage has been utilized to create clean-label poisoned images for
implanting backdoor into a DL model. But there exists a crucial limitation that
one attack/poisoned image can only fit a single input size of the DL model,
which greatly increases its attack budget when attacking multiple commonly
adopted input sizes of DL models. This work proposes to constructively craft an
attack image through camouflaging but can fit multiple DL models' input sizes
simultaneously, namely OmClic. Thus, through OmClic, we are able to always
implant a backdoor regardless of which common input size is chosen by the user
to train the DL model given the same attack budget (i.e., a fraction of the
poisoning rate). With our camouflaging algorithm formulated as a
multi-objective optimization, M=5 input sizes can be concurrently targeted with
one attack image, which artifact is retained to be almost visually
imperceptible at the same time. Extensive evaluations validate the proposed
OmClic can reliably succeed in various settings using diverse types of images.
Further experiments on OmClic based backdoor insertion to DL models show that
high backdoor performances (i.e., attack success rate and clean data accuracy)
are achievable no matter which common input size is randomly chosen by the user
to train the model. So that the OmClic based backdoor attack budget is reduced
by M$\times$ compared to the state-of-the-art camouflage based backdoor attack
as a baseline. Significantly, the same set of OmClic based poisonous attack
images is transferable to different model architectures for backdoor implant.


http://arxiv.org/abs/2309.03648
Promoting Fairness in GNNs: A Characterization of Stability. (1%)
Yaning Jia; Chunhui Zhang
  The Lipschitz bound, a technique from robust statistics, can limit the
maximum changes in the output concerning the input, taking into account
associated irrelevant biased factors. It is an efficient and provable method
for examining the output stability of machine learning models without incurring
additional computation costs. Recently, Graph Neural Networks (GNNs), which
operate on non-Euclidean data, have gained significant attention. However, no
previous research has investigated the GNN Lipschitz bounds to shed light on
stabilizing model outputs, especially when working on non-Euclidean data with
inherent biases. Given the inherent biases in common graph data used for GNN
training, it poses a serious challenge to constraining the GNN output
perturbations induced by input biases, thereby safeguarding fairness during
training. Recently, despite the Lipschitz constant's use in controlling the
stability of Euclideanneural networks, the calculation of the precise Lipschitz
constant remains elusive for non-Euclidean neural networks like GNNs,
especially within fairness contexts. To narrow this gap, we begin with the
general GNNs operating on an attributed graph, and formulate a Lipschitz bound
to limit the changes in the output regarding biases associated with the input.
Additionally, we theoretically analyze how the Lipschitz constant of a GNN
model could constrain the output perturbations induced by biases learned from
data for fairness training. We experimentally validate the Lipschitz bound's
effectiveness in limiting biases of the model output. Finally, from a training
dynamics perspective, we demonstrate why the theoretical Lipschitz bound can
effectively guide the GNN training to better trade-off between accuracy and
fairness.


http://arxiv.org/abs/2309.02705
Certifying LLM Safety against Adversarial Prompting. (99%)
Aounon Kumar; Chirag Agarwal; Suraj Srinivas; Aaron Jiaxun Li; Soheil Feizi; Himabindu Lakkaraju
  Large language models (LLMs) are vulnerable to adversarial attacks that add
malicious tokens to an input prompt to bypass the safety guardrails of an LLM
and cause it to produce harmful content. In this work, we introduce
erase-and-check, the first framework for defending against adversarial prompts
with certifiable safety guarantees. Given a prompt, our procedure erases tokens
individually and inspects the resulting subsequences using a safety filter. Our
safety certificate guarantees that harmful prompts are not mislabeled as safe
due to an adversarial attack up to a certain size. We implement the safety
filter in two ways, using Llama 2 and DistilBERT, and compare the performance
of erase-and-check for the two cases. We defend against three attack modes: i)
adversarial suffix, where an adversarial sequence is appended at the end of a
harmful prompt; ii) adversarial insertion, where the adversarial sequence is
inserted anywhere in the middle of the prompt; and iii) adversarial infusion,
where adversarial tokens are inserted at arbitrary positions in the prompt, not
necessarily as a contiguous block. Our experimental results demonstrate that
this procedure can obtain strong certified safety guarantees on harmful prompts
while maintaining good empirical performance on safe prompts. Additionally, we
propose three efficient empirical defenses: i) RandEC, a randomized subsampling
version of erase-and-check; ii) GreedyEC, which greedily erases tokens that
maximize the softmax score of the harmful class; and iii) GradEC, which uses
gradient information to optimize tokens to erase. We demonstrate their
effectiveness against adversarial prompts generated by the Greedy Coordinate
Gradient (GCG) attack algorithm. The code for our experiments is available at
https://github.com/aounon/certified-llm-safety.


http://arxiv.org/abs/2309.02752
SWAP: Exploiting Second-Ranked Logits for Adversarial Attacks on Time Series. (84%)
Chang George Dong; Liangwei Nathan Zheng; Weitong Chen; Wei Emma Zhang; Lin Yue
  Time series classification (TSC) has emerged as a critical task in various
domains, and deep neural models have shown superior performance in TSC tasks.
However, these models are vulnerable to adversarial attacks, where subtle
perturbations can significantly impact the prediction results. Existing
adversarial methods often suffer from over-parameterization or random logit
perturbation, hindering their effectiveness. Additionally, increasing the
attack success rate (ASR) typically involves generating more noise, making the
attack more easily detectable. To address these limitations, we propose SWAP, a
novel attacking method for TSC models. SWAP focuses on enhancing the confidence
of the second-ranked logits while minimizing the manipulation of other logits.
This is achieved by minimizing the Kullback-Leibler divergence between the
target logit distribution and the predictive logit distribution. Experimental
results demonstrate that SWAP achieves state-of-the-art performance, with an
ASR exceeding 50% and an 18% increase compared to existing methods.


http://arxiv.org/abs/2309.03437
Byzantine-Robust Federated Learning with Variance Reduction and Differential Privacy. (68%)
Zikai Zhang; Rui Hu
  Federated learning (FL) is designed to preserve data privacy during model
training, where the data remains on the client side (i.e., IoT devices), and
only model updates of clients are shared iteratively for collaborative
learning. However, this process is vulnerable to privacy attacks and Byzantine
attacks: the local model updates shared throughout the FL network will leak
private information about the local training data, and they can also be
maliciously crafted by Byzantine attackers to disturb the learning. In this
paper, we propose a new FL scheme that guarantees rigorous privacy and
simultaneously enhances system robustness against Byzantine attacks. Our
approach introduces sparsification- and momentum-driven variance reduction into
the client-level differential privacy (DP) mechanism, to defend against
Byzantine attackers. The security design does not violate the privacy guarantee
of the client-level DP mechanism; hence, our approach achieves the same
client-level DP guarantee as the state-of-the-art. We conduct extensive
experiments on both IID and non-IID datasets and different tasks and evaluate
the performance of our approach against different Byzantine attacks by
comparing it with state-of-the-art defense methods. The results of our
experiments show the efficacy of our framework and demonstrate its ability to
improve system robustness against Byzantine attacks while achieving a strong
privacy guarantee.


http://arxiv.org/abs/2309.03164
J-Guard: Journalism Guided Adversarially Robust Detection of AI-generated News. (38%)
Tharindu Kumarage; Amrita Bhattacharjee; Djordje Padejski; Kristy Roschke; Dan Gillmor; Scott Ruston; Huan Liu; Joshua Garland
  The rapid proliferation of AI-generated text online is profoundly reshaping
the information landscape. Among various types of AI-generated text,
AI-generated news presents a significant threat as it can be a prominent source
of misinformation online. While several recent efforts have focused on
detecting AI-generated text in general, these methods require enhanced
reliability, given concerns about their vulnerability to simple adversarial
attacks. Furthermore, due to the eccentricities of news writing, applying these
detection methods for AI-generated news can produce false positives,
potentially damaging the reputation of news organizations. To address these
challenges, we leverage the expertise of an interdisciplinary team to develop a
framework, J-Guard, capable of steering existing supervised AI text detectors
for detecting AI-generated news while boosting adversarial robustness. By
incorporating stylistic cues inspired by the unique journalistic attributes,
J-Guard effectively distinguishes between real-world journalism and
AI-generated news articles. Our experiments on news articles generated by a
vast array of AI models, including ChatGPT (GPT3.5), demonstrate the
effectiveness of J-Guard in enhancing detection capabilities while maintaining
an average performance decrease of as low as 7% when faced with adversarial
attacks.


http://arxiv.org/abs/2309.03466
MIRA: Cracking Black-box Watermarking on Deep Neural Networks via Model Inversion-based Removal Attacks. (22%)
Yifan Lu; Wenxuan Li; Mi Zhang; Xudong Pan; Min Yang
  To protect the intellectual property of well-trained deep neural networks
(DNNs), black-box DNN watermarks, which are embedded into the prediction
behavior of DNN models on a set of specially-crafted samples, have gained
increasing popularity in both academy and industry. Watermark robustness is
usually implemented against attackers who steal the protected model and
obfuscate its parameters for watermark removal. Recent studies empirically
prove the robustness of most black-box watermarking schemes against known
removal attempts.
  In this paper, we propose a novel Model Inversion-based Removal Attack
(\textsc{Mira}), which is watermark-agnostic and effective against most of
mainstream black-box DNN watermarking schemes. In general, our attack pipeline
exploits the internals of the protected model to recover and unlearn the
watermark message. We further design target class detection and recovered
sample splitting algorithms to reduce the utility loss caused by \textsc{Mira}
and achieve data-free watermark removal on half of the watermarking schemes. We
conduct comprehensive evaluation of \textsc{Mira} against ten mainstream
black-box watermarks on three benchmark datasets and DNN architectures.
Compared with six baseline removal attacks, \textsc{Mira} achieves strong
watermark removal effects on the covered watermarks, preserving at least $90\%$
of the stolen model utility, under more relaxed or even no assumptions on the
dataset availability.


http://arxiv.org/abs/2309.03198
My Art My Choice: Adversarial Protection Against Unruly AI. (2%)
Anthony Rhodes; Ram Bhagat; Umur Aybars Ciftci; Ilke Demir
  Generative AI is on the rise, enabling everyone to produce realistic content
via publicly available interfaces. Especially for guided image generation,
diffusion models are changing the creator economy by producing high quality low
cost content. In parallel, artists are rising against unruly AI, since their
artwork are leveraged, distributed, and dissimulated by large generative
models. Our approach, My Art My Choice (MAMC), aims to empower content owners
by protecting their copyrighted materials from being utilized by diffusion
models in an adversarial fashion. MAMC learns to generate adversarially
perturbed "protected" versions of images which can in turn "break" diffusion
models. The perturbation amount is decided by the artist to balance distortion
vs. protection of the content. MAMC is designed with a simple UNet-based
generator, attacking black box diffusion models, combining several losses to
create adversarial twins of the original artwork. We experiment on three
datasets for various image-to-image tasks, with different user control values.
Both protected image and diffusion output results are evaluated in visual,
noise, structure, pixel, and generative spaces to validate our claims. We
believe that MAMC is a crucial step for preserving ownership information for AI
generated content in a flawless, based-on-need, and human-centric way.


http://arxiv.org/abs/2310.10656
VeriDIP: Verifying Ownership of Deep Neural Networks through Privacy Leakage Fingerprints. (1%)
Aoting Hu; Zhigang Lu; Renjie Xie; Minhui Xue
  Deploying Machine Learning as a Service gives rise to model plagiarism,
leading to copyright infringement. Ownership testing techniques are designed to
identify model fingerprints for verifying plagiarism. However, previous works
often rely on overfitting or robustness features as fingerprints, lacking
theoretical guarantees and exhibiting under-performance on generalized models.
In this paper, we propose a novel ownership testing method called VeriDIP,
which verifies a DNN model's intellectual property. VeriDIP makes two major
contributions. (1) It utilizes membership inference attacks to estimate the
lower bound of privacy leakage, which reflects the fingerprint of a given
model. The privacy leakage fingerprints highlight the unique patterns through
which the models memorize sensitive training datasets. (2) We introduce a novel
approach using less private samples to enhance the performance of ownership
testing.
  Extensive experimental results confirm that VeriDIP is effective and
efficient in validating the ownership of deep learning models trained on both
image and tabular datasets. VeriDIP achieves comparable performance to
state-of-the-art methods on image datasets while significantly reducing
computation and communication costs. Enhanced VeriDIP demonstrates superior
verification performance on generalized deep learning models, particularly on
table-trained models. Additionally, VeriDIP exhibits similar effectiveness on
utility-preserving differentially private models compared to non-differentially
private baselines.


http://arxiv.org/abs/2309.03004
A Theoretical Explanation of Activation Sparsity through Flat Minima and Adversarial Robustness. (1%)
Ze Peng; Lei Qi; Yinghuan Shi; Yang Gao
  A recent empirical observation (Li et al., 2022b) of activation sparsity in
MLP blocks offers an opportunity to drastically reduce computation costs for
free. Although having attributed it to training dynamics, existing theoretical
explanations of activation sparsity are restricted to shallow networks, small
training steps and special training, despite its emergence in deep models
standardly trained for a large number of steps. To fill these gaps, we propose
the notion of gradient sparsity as one source of activation sparsity and a
theoretical explanation based on it that sees sparsity a necessary step to
adversarial robustness w.r.t. hidden features and parameters, which is
approximately the flatness of minima for well-learned models. The theory
applies to standardly trained LayerNorm-ed MLPs, and further to Transformers or
other architectures trained with weight noises. Eliminating other sources of
flatness except for sparsity, we discover the phenomenon that the ratio between
the largest and smallest non-zero singular values of weight matrices is small.
When discussing the emergence of this spectral concentration, we use random
matrix theory (RMT) as a powerful tool to analyze stochastic gradient noises.
Validational experiments are conducted to verify our gradient-sparsity-based
explanation. We propose two plug-and-play modules for both training and
finetuning for sparsity. Experiments on ImageNet-1k and C4 demonstrate their
50% sparsity improvements, indicating further potential cost reduction in both
training and inference.


http://arxiv.org/abs/2309.02159
The Adversarial Implications of Variable-Time Inference. (99%)
Dudi Biton; Aditi Misra; Efrat Levy; Jaidip Kotak; Ron Bitton; Roei Schuster; Nicolas Papernot; Yuval Elovici; Ben Nassi
  Machine learning (ML) models are known to be vulnerable to a number of
attacks that target the integrity of their predictions or the privacy of their
training data. To carry out these attacks, a black-box adversary must typically
possess the ability to query the model and observe its outputs (e.g., labels).
In this work, we demonstrate, for the first time, the ability to enhance such
decision-based attacks. To accomplish this, we present an approach that
exploits a novel side channel in which the adversary simply measures the
execution time of the algorithm used to post-process the predictions of the ML
model under attack. The leakage of inference-state elements into algorithmic
timing side channels has never been studied before, and we have found that it
can contain rich information that facilitates superior timing attacks that
significantly outperform attacks based solely on label outputs. In a case
study, we investigate leakage from the non-maximum suppression (NMS) algorithm,
which plays a crucial role in the operation of object detectors. In our
examination of the timing side-channel vulnerabilities associated with this
algorithm, we identified the potential to enhance decision-based attacks. We
demonstrate attacks against the YOLOv3 detector, leveraging the timing leakage
to successfully evade object detection using adversarial examples, and perform
dataset inference. Our experiments show that our adversarial examples exhibit
superior perturbation quality compared to a decision-based attack. In addition,
we present a new threat model in which dataset inference based solely on timing
leakage is performed. To address the timing leakage vulnerability inherent in
the NMS algorithm, we explore the potential and limitations of implementing
constant-time inference passes as a mitigation strategy.


http://arxiv.org/abs/2309.02528
Adaptive Adversarial Training Does Not Increase Recourse Costs. (92%)
Ian Hardy; Jayanth Yetukuri; Yang Liu
  Recent work has connected adversarial attack methods and algorithmic recourse
methods: both seek minimal changes to an input instance which alter a model's
classification decision. It has been shown that traditional adversarial
training, which seeks to minimize a classifier's susceptibility to malicious
perturbations, increases the cost of generated recourse; with larger
adversarial training radii correlating with higher recourse costs. From the
perspective of algorithmic recourse, however, the appropriate adversarial
training radius has always been unknown. Another recent line of work has
motivated adversarial training with adaptive training radii to address the
issue of instance-wise variable adversarial vulnerability, showing success in
domains with unknown attack radii. This work studies the effects of adaptive
adversarial training on algorithmic recourse costs. We establish that the
improvements in model robustness induced by adaptive adversarial training show
little effect on algorithmic recourse costs, providing a potential avenue for
affordable robustness in domains where recoursability is critical.


http://arxiv.org/abs/2309.02396
Black-Box Attacks against Signed Graph Analysis via Balance Poisoning. (87%)
Jialong Zhou; Yuni Lai; Jian Ren; Kai Zhou
  Signed graphs are well-suited for modeling social networks as they capture
both positive and negative relationships. Signed graph neural networks (SGNNs)
are commonly employed to predict link signs (i.e., positive and negative) in
such graphs due to their ability to handle the unique structure of signed
graphs. However, real-world signed graphs are vulnerable to malicious attacks
by manipulating edge relationships, and existing adversarial graph attack
methods do not consider the specific structure of signed graphs. SGNNs often
incorporate balance theory to effectively model the positive and negative
links. Surprisingly, we find that the balance theory that they rely on can
ironically be exploited as a black-box attack. In this paper, we propose a
novel black-box attack called balance-attack that aims to decrease the balance
degree of the signed graphs. We present an efficient heuristic algorithm to
solve this NP-hard optimization problem. We conduct extensive experiments on
five popular SGNN models and four real-world datasets to demonstrate the
effectiveness and wide applicability of our proposed attack method. By
addressing these challenges, our research contributes to a better understanding
of the limitations and resilience of robust models when facing attacks on
SGNNs. This work contributes to enhancing the security and reliability of
signed graph analysis in social network modeling. Our PyTorch implementation of
the attack is publicly available on GitHub:
https://github.com/JialongZhou666/Balance-Attack.git.


http://arxiv.org/abs/2310.06845
RobustEdge: Low Power Adversarial Detection for Cloud-Edge Systems. (83%)
Abhishek Moitra; Abhiroop Bhattacharjee; Youngeun Kim; Priyadarshini Panda
  In practical cloud-edge scenarios, where a resource constrained edge performs
data acquisition and a cloud system (having sufficient resources) performs
inference tasks with a deep neural network (DNN), adversarial robustness is
critical for reliability and ubiquitous deployment. Adversarial detection is a
prime adversarial defence technique used in prior literature. However, in prior
detection works, the detector is attached to the classifier model and both
detector and classifier work in tandem to perform adversarial detection that
requires a high computational overhead which is not available at the low-power
edge. Therefore, prior works can only perform adversarial detection at the
cloud and not at the edge. This means that in case of adversarial attacks, the
unfavourable adversarial samples must be communicated to the cloud which leads
to energy wastage at the edge device. Therefore, a low-power edge-friendly
adversarial detection method is required to improve the energy efficiency of
the edge and robustness of the cloud-based classifier. To this end, RobustEdge
proposes Quantization-enabled Energy Separation (QES) training with "early
detection and exit" to perform edge-based low cost adversarial detection. The
QES-trained detector implemented at the edge blocks adversarial data
transmission to the classifier model, thereby improving adversarial robustness
and energy-efficiency of the Cloud-Edge system.


http://arxiv.org/abs/2309.02429
Building a Winning Team: Selecting Source Model Ensembles using a Submodular Transferability Estimation Approach. (4%)
Vimal K B; Saketh Bachu; Tanmay Garg; Niveditha Lakshmi Narasimhan; Raghavan Konuru; Vineeth N Balasubramanian
  Estimating the transferability of publicly available pretrained models to a
target task has assumed an important place for transfer learning tasks in
recent years. Existing efforts propose metrics that allow a user to choose one
model from a pool of pre-trained models without having to fine-tune each model
individually and identify one explicitly. With the growth in the number of
available pre-trained models and the popularity of model ensembles, it also
becomes essential to study the transferability of multiple-source models for a
given target task. The few existing efforts study transferability in such
multi-source ensemble settings using just the outputs of the classification
layer and neglect possible domain or task mismatch. Moreover, they overlook the
most important factor while selecting the source models, viz., the cohesiveness
factor between them, which can impact the performance and confidence in the
prediction of the ensemble. To address these gaps, we propose a novel Optimal
tranSport-based suBmOdular tRaNsferability metric (OSBORN) to estimate the
transferability of an ensemble of models to a downstream task. OSBORN
collectively accounts for image domain difference, task difference, and
cohesiveness of models in the ensemble to provide reliable estimates of
transferability. We gauge the performance of OSBORN on both image
classification and semantic segmentation tasks. Our setup includes 28 source
datasets, 11 target datasets, 5 model architectures, and 2 pre-training
methods. We benchmark our method against current state-of-the-art metrics
MS-LEEP and E-LEEP, and outperform them consistently using the proposed
approach.


http://arxiv.org/abs/2309.02057
Robust Recommender System: A Survey and Future Directions. (3%)
Kaike Zhang; Qi Cao; Fei Sun; Yunfan Wu; Shuchang Tao; Huawei Shen; Xueqi Cheng
  With the rapid growth of information, recommender systems have become
integral for providing personalized suggestions and overcoming information
overload. However, their practical deployment often encounters ``dirty'' data,
where noise or malicious information can lead to abnormal recommendations.
Research on improving recommender systems' robustness against such dirty data
has thus gained significant attention. This survey provides a comprehensive
review of recent work on recommender systems' robustness. We first present a
taxonomy to organize current techniques for withstanding malicious attacks and
natural noise. We then explore state-of-the-art methods in each category,
including fraudster detection, adversarial training, certifiable robust
training for defending against malicious attacks, and regularization,
purification, self-supervised learning for defending against malicious attacks.
Additionally, we summarize evaluation metrics and commonly used datasets for
assessing robustness. We discuss robustness across varying recommendation
scenarios and its interplay with other properties like accuracy,
interpretability, privacy, and fairness. Finally, we delve into open issues and
future research directions in this emerging field. Our goal is to provide
readers with a comprehensive understanding of robust recommender systems and to
identify key pathways for future research and development. To facilitate
ongoing exploration, we maintain a continuously updated GitHub repository with
related research: https://github.com/Kaike-Zhang/Robust-Recommender-System.


http://arxiv.org/abs/2309.02088
Dual Adversarial Alignment for Realistic Support-Query Shift Few-shot Learning. (1%)
Siyang Jiang; Rui Fang; Hsi-Wen Chen; Wei Ding; Ming-Syan Chen
  Support-query shift few-shot learning aims to classify unseen examples (query
set) to labeled data (support set) based on the learned embedding in a
low-dimensional space under a distribution shift between the support set and
the query set. However, in real-world scenarios the shifts are usually unknown
and varied, making it difficult to estimate in advance. Therefore, in this
paper, we propose a novel but more difficult challenge, RSQS, focusing on
Realistic Support-Query Shift few-shot learning. The key feature of RSQS is
that the individual samples in a meta-task are subjected to multiple
distribution shifts in each meta-task. In addition, we propose a unified
adversarial feature alignment method called DUal adversarial ALignment
framework (DuaL) to relieve RSQS from two aspects, i.e., inter-domain bias and
intra-domain variance. On the one hand, for the inter-domain bias, we corrupt
the original data in advance and use the synthesized perturbed inputs to train
the repairer network by minimizing distance in the feature level. On the other
hand, for intra-domain variance, we proposed a generator network to synthesize
hard, i.e., less similar, examples from the support set in a self-supervised
manner and introduce regularized optimal transportation to derive a smooth
optimal transportation plan. Lastly, a benchmark of RSQS is built with several
state-of-the-art baselines among three datasets (CIFAR100, mini-ImageNet, and
Tiered-Imagenet). Experiment results show that DuaL significantly outperforms
the state-of-the-art methods in our benchmark.


http://arxiv.org/abs/2309.01620
Hindering Adversarial Attacks with Multiple Encrypted Patch Embeddings. (99%)
AprilPyone MaungMaung; Isao Echizen; Hitoshi Kiya
  In this paper, we propose a new key-based defense focusing on both efficiency
and robustness. Although the previous key-based defense seems effective in
defending against adversarial examples, carefully designed adaptive attacks can
bypass the previous defense, and it is difficult to train the previous defense
on large datasets like ImageNet. We build upon the previous defense with two
major improvements: (1) efficient training and (2) optional randomization. The
proposed defense utilizes one or more secret patch embeddings and classifier
heads with a pre-trained isotropic network. When more than one secret
embeddings are used, the proposed defense enables randomization on inference.
Experiments were carried out on the ImageNet dataset, and the proposed defense
was evaluated against an arsenal of state-of-the-art attacks, including
adaptive ones. The results show that the proposed defense achieves a high
robust accuracy and a comparable clean accuracy compared to the previous
key-based defense.


http://arxiv.org/abs/2309.01582
Improving Visual Quality and Transferability of Adversarial Attacks on Face Recognition Simultaneously with Adversarial Restoration. (99%)
Fengfan Zhou; Hefei Ling; Yuxuan Shi; Jiazhong Chen; Ping Li
  Adversarial face examples possess two critical properties: Visual Quality and
Transferability. However, existing approaches rarely address these properties
simultaneously, leading to subpar results. To address this issue, we propose a
novel adversarial attack technique known as Adversarial Restoration
(AdvRestore), which enhances both visual quality and transferability of
adversarial face examples by leveraging a face restoration prior. In our
approach, we initially train a Restoration Latent Diffusion Model (RLDM)
designed for face restoration. Subsequently, we employ the inference process of
RLDM to generate adversarial face examples. The adversarial perturbations are
applied to the intermediate features of RLDM. Additionally, by treating RLDM
face restoration as a sibling task, the transferability of the generated
adversarial face examples is further improved. Our experimental results
validate the effectiveness of the proposed attack method.


http://arxiv.org/abs/2309.01351
Adv3D: Generating 3D Adversarial Examples in Driving Scenarios with NeRF. (99%)
Leheng Li; Qing Lian; Ying-Cong Chen
  Deep neural networks (DNNs) have been proven extremely susceptible to
adversarial examples, which raises special safety-critical concerns for
DNN-based autonomous driving stacks (i.e., 3D object detection). Although there
are extensive works on image-level attacks, most are restricted to 2D pixel
spaces, and such attacks are not always physically realistic in our 3D world.
Here we present Adv3D, the first exploration of modeling adversarial examples
as Neural Radiance Fields (NeRFs). Advances in NeRF provide photorealistic
appearances and 3D accurate generation, yielding a more realistic and
realizable adversarial example. We train our adversarial NeRF by minimizing the
surrounding objects' confidence predicted by 3D detectors on the training set.
Then we evaluate Adv3D on the unseen validation set and show that it can cause
a large performance reduction when rendering NeRF in any sampled pose. To
generate physically realizable adversarial examples, we propose primitive-aware
sampling and semantic-guided regularization that enable 3D patch attacks with
camouflage adversarial texture. Experimental results demonstrate that the
trained adversarial NeRF generalizes well to different poses, scenes, and 3D
detectors. Finally, we provide a defense method to our attacks that involves
adversarial training through data augmentation. Project page:
https://len-li.github.io/adv3d-web


http://arxiv.org/abs/2309.01452
Toward Defensive Letter Design. (98%)
Rentaro Kataoka; Akisato Kimura; Seiichi Uchida
  A major approach for defending against adversarial attacks aims at
controlling only image classifiers to be more resilient, and it does not care
about visual objects, such as pandas and cars, in images. This means that
visual objects themselves cannot take any defensive actions, and they are still
vulnerable to adversarial attacks. In contrast, letters are artificial symbols,
and we can freely control their appearance unless losing their readability. In
other words, we can make the letters more defensive to the attacks. This paper
poses three research questions related to the adversarial vulnerability of
letter images: (1) How defensive are the letters against adversarial attacks?
(2) Can we estimate how defensive a given letter image is before attacks? (3)
Can we control the letter images to be more defensive against adversarial
attacks? For answering the first and second questions, we measure the
defensibility of letters by employing Iterative Fast Gradient Sign Method
(I-FGSM) and then build a deep regression model for estimating the
defensibility of each letter image. We also propose a two-step method based on
a generative adversarial network (GAN) for generating character images with
higher defensibility, which solves the third research question.


http://arxiv.org/abs/2309.01686
MathAttack: Attacking Large Language Models Towards Math Solving Ability. (97%)
Zihao Zhou; Qiufeng Wang; Mingyu Jin; Jie Yao; Jianan Ye; Wei Liu; Wei Wang; Xiaowei Huang; Kaizhu Huang
  With the boom of Large Language Models (LLMs), the research of solving Math
Word Problem (MWP) has recently made great progress. However, there are few
studies to examine the security of LLMs in math solving ability. Instead of
attacking prompts in the use of LLMs, we propose a MathAttack model to attack
MWP samples which are closer to the essence of security in solving math
problems. Compared to traditional text adversarial attack, it is essential to
preserve the mathematical logic of original MWPs during the attacking. To this
end, we propose logical entity recognition to identify logical entries which
are then frozen. Subsequently, the remaining text are attacked by adopting a
word-level attacker. Furthermore, we propose a new dataset RobustMath to
evaluate the robustness of LLMs in math solving ability. Extensive experiments
on our RobustMath and two another math benchmark datasets GSM8K and MultiAirth
show that MathAttack could effectively attack the math solving ability of LLMs.
In the experiments, we observe that (1) Our adversarial samples from
higher-accuracy LLMs are also effective for attacking LLMs with lower accuracy
(e.g., transfer from larger to smaller-size LLMs, or from few-shot to zero-shot
prompts); (2) Complex MWPs (such as more solving steps, longer text, more
numbers) are more vulnerable to attack; (3) We can improve the robustness of
LLMs by using our adversarial samples in few-shot prompts. Finally, we hope our
practice and observation can serve as an important attempt towards enhancing
the robustness of LLMs in math solving ability. We will release our code and
dataset.


http://arxiv.org/abs/2309.01838
Efficient Defense Against Model Stealing Attacks on Convolutional Neural Networks. (93%)
Kacem Khaled; Mouna Dhaouadi; Magalhães Felipe Gohring de; Gabriela Nicolescu
  Model stealing attacks have become a serious concern for deep learning
models, where an attacker can steal a trained model by querying its black-box
API. This can lead to intellectual property theft and other security and
privacy risks. The current state-of-the-art defenses against model stealing
attacks suggest adding perturbations to the prediction probabilities. However,
they suffer from heavy computations and make impracticable assumptions about
the adversary. They often require the training of auxiliary models. This can be
time-consuming and resource-intensive which hinders the deployment of these
defenses in real-world applications. In this paper, we propose a simple yet
effective and efficient defense alternative. We introduce a heuristic approach
to perturb the output probabilities. The proposed defense can be easily
integrated into models without additional training. We show that our defense is
effective in defending against three state-of-the-art stealing attacks. We
evaluate our approach on large and quantized (i.e., compressed) Convolutional
Neural Networks (CNNs) trained on several vision datasets. Our technique
outperforms the state-of-the-art defenses with a $\times37$ faster inference
latency without requiring any additional model and with a low impact on the
model's performance. We validate that our defense is also effective for
quantized CNNs targeting edge devices.


http://arxiv.org/abs/2309.01866
Efficient Query-Based Attack against ML-Based Android Malware Detection under Zero Knowledge Setting. (92%)
Ping He; Yifan Xia; Xuhong Zhang; Shouling Ji
  The widespread adoption of the Android operating system has made malicious
Android applications an appealing target for attackers. Machine learning-based
(ML-based) Android malware detection (AMD) methods are crucial in addressing
this problem; however, their vulnerability to adversarial examples raises
concerns. Current attacks against ML-based AMD methods demonstrate remarkable
performance but rely on strong assumptions that may not be realistic in
real-world scenarios, e.g., the knowledge requirements about feature space,
model parameters, and training dataset. To address this limitation, we
introduce AdvDroidZero, an efficient query-based attack framework against
ML-based AMD methods that operates under the zero knowledge setting. Our
extensive evaluation shows that AdvDroidZero is effective against various
mainstream ML-based AMD methods, in particular, state-of-the-art such methods
and real-world antivirus solutions.


http://arxiv.org/abs/2309.01480
EventTrojan: Manipulating Non-Intrusive Speech Quality Assessment via Imperceptible Events. (15%)
Ying Ren; Kailai Shen; Zhe Ye; Diqun Yan
  Non-Intrusive speech quality assessment (NISQA) has gained significant
attention for predicting speech's mean opinion score (MOS) without requiring
the reference speech. Researchers have gradually started to apply NISQA to
various practical scenarios. However, little attention has been paid to the
security of NISQA models. Backdoor attacks represent the most serious threat to
deep neural networks (DNNs) due to the fact that backdoors possess a very high
attack success rate once embedded. However, existing backdoor attacks assume
that the attacker actively feeds samples containing triggers into the model
during the inference phase. This is not adapted to the specific scenario of
NISQA. And current backdoor attacks on regression tasks lack an objective
metric to measure the attack performance. To address these issues, we propose a
novel backdoor triggering approach (EventTrojan) that utilizes an event during
the usage of the NISQA model as a trigger. Moreover, we innovatively provide an
objective metric for backdoor attacks on regression tasks. Extensive
experiments on four benchmark datasets demonstrate the effectiveness of the
EventTrojan attack. Besides, it also has good resistance to several defense
methods.


http://arxiv.org/abs/2309.01786
Safe and Robust Watermark Injection with a Single OoD Image. (8%)
Shuyang Yu; Junyuan Hong; Haobo Zhang; Haotao Wang; Zhangyang Wang; Jiayu Zhou
  Training a high-performance deep neural network requires large amounts of
data and computational resources. Protecting the intellectual property (IP) and
commercial ownership of a deep model is challenging yet increasingly crucial. A
major stream of watermarking strategies implants verifiable backdoor triggers
by poisoning training samples, but these are often unrealistic due to data
privacy and safety concerns and are vulnerable to minor model changes such as
fine-tuning. To overcome these challenges, we propose a safe and robust
backdoor-based watermark injection technique that leverages the diverse
knowledge from a single out-of-distribution (OoD) image, which serves as a
secret key for IP verification. The independence of training data makes it
agnostic to third-party promises of IP security. We induce robustness via
random perturbation of model parameters during watermark injection to defend
against common watermark removal attacks, including fine-tuning, pruning, and
model extraction. Our experimental results demonstrate that the proposed
watermarking approach is not only time- and sample-efficient without training
data, but also robust against the watermark removal attacks above.


http://arxiv.org/abs/2309.01614
Dropout Attacks. (2%)
Andrew Yuan; Alina Oprea; Cheng Tan
  Dropout is a common operator in deep learning, aiming to prevent overfitting
by randomly dropping neurons during training. This paper introduces a new
family of poisoning attacks against neural networks named DROPOUTATTACK.
DROPOUTATTACK attacks the dropout operator by manipulating the selection of
neurons to drop instead of selecting them uniformly at random. We design,
implement, and evaluate four DROPOUTATTACK variants that cover a broad range of
scenarios. These attacks can slow or stop training, destroy prediction accuracy
of target classes, and sabotage either precision or recall of a target class.
In our experiments of training a VGG-16 model on CIFAR-100, our attack can
reduce the precision of the victim class by 34.6% (from 81.7% to 47.1%) without
incurring any degradation in model accuracy


http://arxiv.org/abs/2309.01850
Uncertainty in AI: Evaluating Deep Neural Networks on Out-of-Distribution Images. (2%)
Jamiu Idowu; Ahmed Almasoud
  As AI models are increasingly deployed in critical applications, ensuring the
consistent performance of models when exposed to unusual situations such as
out-of-distribution (OOD) or perturbed data, is important. Therefore, this
paper investigates the uncertainty of various deep neural networks, including
ResNet-50, VGG16, DenseNet121, AlexNet, and GoogleNet, when dealing with such
data. Our approach includes three experiments. First, we used the pretrained
models to classify OOD images generated via DALL-E to assess their performance.
Second, we built an ensemble from the models' predictions using probabilistic
averaging for consensus due to its advantages over plurality or majority
voting. The ensemble's uncertainty was quantified using average probabilities,
variance, and entropy metrics. Our results showed that while ResNet-50 was the
most accurate single model for OOD images, the ensemble performed even better,
correctly classifying all images. Third, we tested model robustness by adding
perturbations (filters, rotations, etc.) to new epistemic images from DALL-E or
real-world captures. ResNet-50 was chosen for this being the best performing
model. While it classified 4 out of 5 unperturbed images correctly, it
misclassified all of them post-perturbation, indicating a significant
vulnerability. These misclassifications, which are clear to human observers,
highlight AI models' limitations. Using saliency maps, we identified regions of
the images that the model considered important for their decisions.


http://arxiv.org/abs/2310.05947
Robust and Efficient Interference Neural Networks for Defending Against Adversarial Attacks in ImageNet. (99%)
Yunuo Xiong; Shujuan Liu; Hongwei Xiong
  The existence of adversarial images has seriously affected the task of image
recognition and practical application of deep learning, it is also a key
scientific problem that deep learning urgently needs to solve. By far the most
effective approach is to train the neural network with a large number of
adversarial examples. However, this adversarial training method requires a huge
amount of computing resources when applied to ImageNet, and has not yet
achieved satisfactory results for high-intensity adversarial attacks. In this
paper, we construct an interference neural network by applying additional
background images and corresponding labels, and use pre-trained ResNet-152 to
efficiently complete the training. Compared with the state-of-the-art results
under the PGD attack, it has a better defense effect with much smaller
computing resources. This work provides new ideas for academic research and
practical applications of effective defense against adversarial attacks.


http://arxiv.org/abs/2309.01104
Turn Fake into Real: Adversarial Head Turn Attacks Against Deepfake Detection. (98%)
Weijie Wang; Zhengyu Zhao; Nicu Sebe; Bruno Lepri
  Malicious use of deepfakes leads to serious public concerns and reduces
people's trust in digital media. Although effective deepfake detectors have
been proposed, they are substantially vulnerable to adversarial attacks. To
evaluate the detector's robustness, recent studies have explored various
attacks. However, all existing attacks are limited to 2D image perturbations,
which are hard to translate into real-world facial changes. In this paper, we
propose adversarial head turn (AdvHeat), the first attempt at 3D adversarial
face views against deepfake detectors, based on face view synthesis from a
single-view fake image. Extensive experiments validate the vulnerability of
various detectors to AdvHeat in realistic, black-box scenarios. For example,
AdvHeat based on a simple random search yields a high attack success rate of
96.8% with 360 searching steps. When additional query access is allowed, we can
further reduce the step budget to 50. Additional analyses demonstrate that
AdvHeat is better than conventional attacks on both the cross-detector
transferability and robustness to defenses. The adversarial images generated by
AdvHeat are also shown to have natural looks. Our code, including that for
generating a multi-view dataset consisting of 360 synthetic views for each of
1000 IDs from FaceForensics++, is available at
https://github.com/twowwj/AdvHeaT.


http://arxiv.org/abs/2309.01106
AdvMono3D: Advanced Monocular 3D Object Detection with Depth-Aware Robust Adversarial Training. (98%)
Xingyuan Li; Jinyuan Liu; Long Ma; Xin Fan; Risheng Liu
  Monocular 3D object detection plays a pivotal role in the field of autonomous
driving and numerous deep learning-based methods have made significant
breakthroughs in this area. Despite the advancements in detection accuracy and
efficiency, these models tend to fail when faced with such attacks, rendering
them ineffective. Therefore, bolstering the adversarial robustness of 3D
detection models has become a crucial issue that demands immediate attention
and innovative solutions. To mitigate this issue, we propose a depth-aware
robust adversarial training method for monocular 3D object detection, dubbed
DART3D. Specifically, we first design an adversarial attack that iteratively
degrades the 2D and 3D perception capabilities of 3D object detection
models(IDP), serves as the foundation for our subsequent defense mechanism. In
response to this attack, we propose an uncertainty-based residual learning
method for adversarial training. Our adversarial training approach capitalizes
on the inherent uncertainty, enabling the model to significantly improve its
robustness against adversarial attacks. We conducted extensive experiments on
the KITTI 3D datasets, demonstrating that DART3D surpasses direct adversarial
training (the most popular approach) under attacks in 3D object detection
$AP_{R40}$ of car category for the Easy, Moderate, and Hard settings, with
improvements of 4.415%, 4.112%, and 3.195%, respectively.


http://arxiv.org/abs/2309.01077
Robust Adversarial Defense by Tensor Factorization. (89%)
Manish Bhattarai; Mehmet Cagri Kaymak; Ryan Barron; Ben Nebgen; Kim Rasmussen; Boian Alexandrov
  As machine learning techniques become increasingly prevalent in data
analysis, the threat of adversarial attacks has surged, necessitating robust
defense mechanisms. Among these defenses, methods exploiting low-rank
approximations for input data preprocessing and neural network (NN) parameter
factorization have shown potential. Our work advances this field further by
integrating the tensorization of input data with low-rank decomposition and
tensorization of NN parameters to enhance adversarial defense. The proposed
approach demonstrates significant defense capabilities, maintaining robust
accuracy even when subjected to the strongest known auto-attacks. Evaluations
against leading-edge robust performance benchmarks reveal that our results not
only hold their ground against the best defensive methods available but also
exceed all current defense strategies that rely on tensor factorizations. This
study underscores the potential of integrating tensorization and low-rank
decomposition as a robust defense against adversarial attacks in machine
learning.


http://arxiv.org/abs/2309.01102
CARNet: Collaborative Adversarial Resilience for Robust Underwater Image Enhancement and Perception. (31%)
Zengxi Zhang; Zeru Shi; Zhiying Jiang; Jinyuan Liu
  Due to the uneven absorption of different light wavelengths in aquatic
environments, underwater images suffer from low visibility and clear color
deviations. With the advancement of autonomous underwater vehicles, extensive
research has been conducted on learning-based underwater enhancement
algorithms. These works can generate visually pleasing enhanced images and
mitigate the adverse effects of degraded images on subsequent perception tasks.
However, learning-based methods are susceptible to the inherent fragility of
adversarial attacks, causing significant disruption in enhanced results. In
this work, we introduce a collaborative adversarial resilience network, dubbed
CARNet, for underwater image enhancement and subsequent detection tasks.
Concretely, we first introduce an invertible network with strong
perturbation-perceptual abilities to isolate attacks from underwater images,
preventing interference with visual quality enhancement and perceptual tasks.
Furthermore, an attack pattern discriminator is introduced to adaptively
identify and eliminate various types of attacks. Additionally, we propose a
bilevel attack optimization strategy to heighten the robustness of the network
against different types of attacks under the collaborative adversarial training
of vision-driven and perception-driven attacks. Extensive experiments
demonstrate that the proposed method outputs visually appealing enhancement
images and performs an average 6.71% higher detection mAP than state-of-the-art
methods.


http://arxiv.org/abs/2309.00879
Towards Certified Probabilistic Robustness with High Accuracy. (98%)
Ruihan Zhang; Peixin Zhang; Jun Sun
  Adversarial examples pose a security threat to many critical systems built on
neural networks (such as face recognition systems, and self-driving cars).
While many methods have been proposed to build robust models, how to build
certifiably robust yet accurate neural network models remains an open problem.
For example, adversarial training improves empirical robustness, but they do
not provide certification of the model's robustness. On the other hand,
certified training provides certified robustness but at the cost of a
significant accuracy drop. In this work, we propose a novel approach that aims
to achieve both high accuracy and certified probabilistic robustness. Our
method has two parts, i.e., a probabilistic robust training method with an
additional goal of minimizing variance in terms of divergence and a runtime
inference method for certified probabilistic robustness of the prediction. The
latter enables efficient certification of the model's probabilistic robustness
at runtime with statistical guarantees. This is supported by our training
objective, which minimizes the variance of the model's predictions in a given
vicinity, derived from a general definition of model robustness. Our approach
works for a variety of perturbations and is reasonably efficient. Our
experiments on multiple models trained on different datasets demonstrate that
our approach significantly outperforms existing approaches in terms of both
certification rate and accuracy.


http://arxiv.org/abs/2309.00929
Timbre-reserved Adversarial Attack in Speaker Identification. (98%)
Qing Wang; Jixun Yao; Li Zhang; Pengcheng Guo; Lei Xie
  As a type of biometric identification, a speaker identification (SID) system
is confronted with various kinds of attacks. The spoofing attacks typically
imitate the timbre of the target speakers, while the adversarial attacks
confuse the SID system by adding a well-designed adversarial perturbation to an
arbitrary speech. Although the spoofing attack copies a similar timbre as the
victim, it does not exploit the vulnerability of the SID model and may not make
the SID system give the attacker's desired decision. As for the adversarial
attack, despite the SID system can be led to a designated decision, it cannot
meet the specified text or speaker timbre requirements for the specific attack
scenarios. In this study, to make the attack in SID not only leverage the
vulnerability of the SID model but also reserve the timbre of the target
speaker, we propose a timbre-reserved adversarial attack in the speaker
identification. We generate the timbre-reserved adversarial audios by adding an
adversarial constraint during the different training stages of the voice
conversion (VC) model. Specifically, the adversarial constraint is using the
target speaker label to optimize the adversarial perturbation added to the VC
model representations and is implemented by a speaker classifier joining in the
VC model training. The adversarial constraint can help to control the VC model
to generate the speaker-wised audio. Eventually, the inference of the VC model
is the ideal adversarial fake audio, which is timbre-reserved and can fool the
SID system.


http://arxiv.org/abs/2309.00894
Regularly Truncated M-estimators for Learning with Noisy Labels. (1%)
Xiaobo Xia; Pengqian Lu; Chen Gong; Bo Han; Jun Yu; Jun Yu; Tongliang Liu
  The sample selection approach is very popular in learning with noisy labels.
As deep networks learn pattern first, prior methods built on sample selection
share a similar training procedure: the small-loss examples can be regarded as
clean examples and used for helping generalization, while the large-loss
examples are treated as mislabeled ones and excluded from network parameter
updates. However, such a procedure is arguably debatable from two folds: (a) it
does not consider the bad influence of noisy labels in selected small-loss
examples; (b) it does not make good use of the discarded large-loss examples,
which may be clean or have meaningful information for generalization. In this
paper, we propose regularly truncated M-estimators (RTME) to address the above
two issues simultaneously. Specifically, RTME can alternately switch modes
between truncated M-estimators and original M-estimators. The former can
adaptively select small-losses examples without knowing the noise rate and
reduce the side-effects of noisy labels in them. The latter makes the possibly
clean examples but with large losses involved to help generalization.
Theoretically, we demonstrate that our strategies are label-noise-tolerant.
Empirically, comprehensive experimental results show that our method can
outperform multiple baselines and is robust to broad noise types and levels.


http://arxiv.org/abs/2309.00614
Baseline Defenses for Adversarial Attacks Against Aligned Language Models. (99%)
Neel Jain; Avi Schwarzschild; Yuxin Wen; Gowthami Somepalli; John Kirchenbauer; Ping-yeh Chiang; Micah Goldblum; Aniruddha Saha; Jonas Geiping; Tom Goldstein
  As Large Language Models quickly become ubiquitous, their security
vulnerabilities are critical to understand. Recent work shows that text
optimizers can produce jailbreaking prompts that bypass moderation and
alignment. Drawing from the rich body of work on adversarial machine learning,
we approach these attacks with three questions: What threat models are
practically useful in this domain? How do baseline defense techniques perform
in this new domain? How does LLM security differ from computer vision?
  We evaluate several baseline defense strategies against leading adversarial
attacks on LLMs, discussing the various settings in which each is feasible and
effective. Particularly, we look at three types of defenses: detection
(perplexity based), input preprocessing (paraphrase and retokenization), and
adversarial training. We discuss white-box and gray-box settings and discuss
the robustness-performance trade-off for each of the defenses considered.
Surprisingly, we find much more success with filtering and preprocessing than
we would expect from other domains, such as vision, providing a first
indication that the relative strengths of these defenses may be weighed
differently in these domains.


http://arxiv.org/abs/2309.00543
Curating Naturally Adversarial Datasets for Trustworthy AI in Healthcare. (99%)
Sydney Pugh; Ivan Ruchkin; Insup Lee; James Weimer
  Deep learning models have shown promising predictive accuracy for time-series
healthcare applications. However, ensuring the robustness of these models is
vital for building trustworthy AI systems. Existing research predominantly
focuses on robustness to synthetic adversarial examples, crafted by adding
imperceptible perturbations to clean input data. However, these synthetic
adversarial examples do not accurately reflect the most challenging real-world
scenarios, especially in the context of healthcare data. Consequently,
robustness to synthetic adversarial examples may not necessarily translate to
robustness against naturally occurring adversarial examples, which is highly
desirable for trustworthy AI. We propose a method to curate datasets comprised
of natural adversarial examples to evaluate model robustness. The method relies
on probabilistic labels obtained from automated weakly-supervised labeling that
combines noisy and cheap-to-obtain labeling heuristics. Based on these labels,
our method adversarially orders the input data and uses this ordering to
construct a sequence of increasingly adversarial datasets. Our evaluation on
six medical case studies and three non-medical case studies demonstrates the
efficacy and statistical validity of our approach to generating naturally
adversarial datasets


http://arxiv.org/abs/2309.00771
Non-Asymptotic Bounds for Adversarial Excess Risk under Misspecified Models. (89%)
Changyu Liu; Yuling Jiao; Junhui Wang; Jian Huang
  We propose a general approach to evaluating the performance of robust
estimators based on adversarial losses under misspecified models. We first show
that adversarial risk is equivalent to the risk induced by a distributional
adversarial attack under certain smoothness conditions. This ensures that the
adversarial training procedure is well-defined. To evaluate the generalization
performance of the adversarial estimator, we study the adversarial excess risk.
Our proposed analysis method includes investigations on both generalization
error and approximation error. We then establish non-asymptotic upper bounds
for the adversarial excess risk associated with Lipschitz loss functions. In
addition, we apply our general results to adversarial training for
classification and regression problems. For the quadratic loss in nonparametric
regression, we show that the adversarial excess risk bound can be improved over
those for a general loss.


http://arxiv.org/abs/2309.00254
Why do universal adversarial attacks work on large language models?: Geometry might be the answer. (83%)
Varshini Subhash; Anna Bialas; Weiwei Pan; Finale Doshi-Velez
  Transformer based large language models with emergent capabilities are
becoming increasingly ubiquitous in society. However, the task of understanding
and interpreting their internal workings, in the context of adversarial
attacks, remains largely unsolved. Gradient-based universal adversarial attacks
have been shown to be highly effective on large language models and potentially
dangerous due to their input-agnostic nature. This work presents a novel
geometric perspective explaining universal adversarial attacks on large
language models. By attacking the 117M parameter GPT-2 model, we find evidence
indicating that universal adversarial triggers could be embedding vectors which
merely approximate the semantic information in their adversarial training
region. This hypothesis is supported by white-box model analysis comprising
dimensionality reduction and similarity measurement of hidden representations.
We believe this new geometric perspective on the underlying mechanism driving
universal attacks could help us gain deeper insight into the internal workings
and failure modes of LLMs, thus enabling their mitigation.


http://arxiv.org/abs/2309.00810
RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model. (1%)
Fengxiang Bie; Yibo Yang; Zhongzhu Zhou; Adam Ghanem; Minjia Zhang; Zhewei Yao; Xiaoxia Wu; Connor Holmes; Pareesa Golnari; David A. Clifton; Yuxiong He; Dacheng Tao; Shuaiwen Leon Song
  Text-to-image generation (TTI) refers to the usage of models that could
process text input and generate high fidelity images based on text
descriptions. Text-to-image generation using neural networks could be traced
back to the emergence of Generative Adversial Network (GAN), followed by the
autoregressive Transformer. Diffusion models are one prominent type of
generative model used for the generation of images through the systematic
introduction of noises with repeating steps. As an effect of the impressive
results of diffusion models on image synthesis, it has been cemented as the
major image decoder used by text-to-image models and brought text-to-image
generation to the forefront of machine-learning (ML) research. In the era of
large models, scaling up model size and the integration with large language
models have further improved the performance of TTI models, resulting the
generation result nearly indistinguishable from real-world images,
revolutionizing the way we retrieval images. Our explorative study has
incentivised us to think that there are further ways of scaling text-to-image
models with the combination of innovative model architectures and prediction
enhancement techniques. We have divided the work of this survey into five main
sections wherein we detail the frameworks of major literature in order to delve
into the different types of text-to-image generation methods. Following this we
provide a detailed comparison and critique of these methods and offer possible
pathways of improvement for future work. In the future work, we argue that TTI
development could yield impressive productivity improvements for creation,
particularly in the context of the AIGC era, and could be extended to more
complex tasks such as video generation and 3D generation.


http://arxiv.org/abs/2309.00733
Learned Visual Features to Textual Explanations. (1%)
Saeid Asgari Taghanaki; Aliasghar Khani; Amir Khasahmadi; Aditya Sanghi; Karl D. D. Willis; Ali Mahdavi-Amiri
  Interpreting the learned features of vision models has posed a longstanding
challenge in the field of machine learning. To address this issue, we propose a
novel method that leverages the capabilities of large language models (LLMs) to
interpret the learned features of pre-trained image classifiers. Our method,
called TExplain, tackles this task by training a neural network to establish a
connection between the feature space of image classifiers and LLMs. Then,
during inference, our approach generates a vast number of sentences to explain
the features learned by the classifier for a given image. These sentences are
then used to extract the most frequent words, providing a comprehensive
understanding of the learned features and patterns within the classifier. Our
method, for the first time, utilizes these frequent words corresponding to a
visual representation to provide insights into the decision-making process of
the independently trained classifier, enabling the detection of spurious
correlations, biases, and a deeper comprehension of its behavior. To validate
the effectiveness of our approach, we conduct experiments on diverse datasets,
including ImageNet-9L and Waterbirds. The results demonstrate the potential of
our method to enhance the interpretability and robustness of image classifiers.


http://arxiv.org/abs/2308.16454
Adversarial Finetuning with Latent Representation Constraint to Mitigate Accuracy-Robustness Tradeoff. (98%)
Satoshi Suzuki; Shin'ya Yamaguchi; Shoichiro Takeda; Sekitoshi Kanai; Naoki Makishima; Atsushi Ando; Ryo Masumura
  This paper addresses the tradeoff between standard accuracy on clean examples
and robustness against adversarial examples in deep neural networks (DNNs).
Although adversarial training (AT) improves robustness, it degrades the
standard accuracy, thus yielding the tradeoff. To mitigate this tradeoff, we
propose a novel AT method called ARREST, which comprises three components: (i)
adversarial finetuning (AFT), (ii) representation-guided knowledge distillation
(RGKD), and (iii) noisy replay (NR). AFT trains a DNN on adversarial examples
by initializing its parameters with a DNN that is standardly pretrained on
clean examples. RGKD and NR respectively entail a regularization term and an
algorithm to preserve latent representations of clean examples during AFT. RGKD
penalizes the distance between the representations of the standardly pretrained
and AFT DNNs. NR switches input adversarial examples to nonadversarial ones
when the representation changes significantly during AFT. By combining these
components, ARREST achieves both high standard accuracy and robustness.
Experimental results demonstrate that ARREST mitigates the tradeoff more
effectively than previous AT-based methods do.


http://arxiv.org/abs/2309.00236
Image Hijacking: Adversarial Images can Control Generative Models at Runtime. (98%)
Luke Bailey; Euan Ong; Stuart Russell; Scott Emmons
  Are foundation models secure from malicious actors? In this work, we focus on
the image input to a vision-language model (VLM). We discover image hijacks,
adversarial images that control generative models at runtime. We introduce
Behavior Matching, a general method for creating image hijacks, and we use it
to explore three types of attacks. Specific string attacks generate arbitrary
output of the adversary's choosing. Leak context attacks leak information from
the context window into the output. Jailbreak attacks circumvent a model's
safety training. We study these attacks against LLaVA-2, a state-of-the-art VLM
based on CLIP and LLaMA-2, and find that all our attack types have above a 90\%
success rate. Moreover, our attacks are automated and require only small image
perturbations. These findings raise serious concerns about the security of
foundation models. If image hijacks are as difficult to defend against as
adversarial examples in CIFAR-10, then it might be many years before a solution
is found -- if it even exists.


http://arxiv.org/abs/2308.16562
The Power of MEME: Adversarial Malware Creation with Model-Based Reinforcement Learning. (93%)
Maria Rigaki; Sebastian Garcia
  Due to the proliferation of malware, defenders are increasingly turning to
automation and machine learning as part of the malware detection tool-chain.
However, machine learning models are susceptible to adversarial attacks,
requiring the testing of model and product robustness. Meanwhile, attackers
also seek to automate malware generation and evasion of antivirus systems, and
defenders try to gain insight into their methods. This work proposes a new
algorithm that combines Malware Evasion and Model Extraction (MEME) attacks.
MEME uses model-based reinforcement learning to adversarially modify Windows
executable binary samples while simultaneously training a surrogate model with
a high agreement with the target model to evade. To evaluate this method, we
compare it with two state-of-the-art attacks in adversarial malware creation,
using three well-known published models and one antivirus product as targets.
Results show that MEME outperforms the state-of-the-art methods in terms of
evasion capabilities in almost all cases, producing evasive malware with an
evasion rate in the range of 32-73%. It also produces surrogate models with a
prediction label agreement with the respective target models between 97-99%.
The surrogate could be used to fine-tune and improve the evasion rate in the
future.


http://arxiv.org/abs/2308.16684
Everyone Can Attack: Repurpose Lossy Compression as a Natural Backdoor Attack. (75%)
Sze Jue Yang; Quang Nguyen; Chee Seng Chan; Khoa D. Doan
  The vulnerabilities to backdoor attacks have recently threatened the
trustworthiness of machine learning models in practical applications.
Conventional wisdom suggests that not everyone can be an attacker since the
process of designing the trigger generation algorithm often involves
significant effort and extensive experimentation to ensure the attack's
stealthiness and effectiveness. Alternatively, this paper shows that there
exists a more severe backdoor threat: anyone can exploit an easily-accessible
algorithm for silent backdoor attacks. Specifically, this attacker can employ
the widely-used lossy image compression from a plethora of compression tools to
effortlessly inject a trigger pattern into an image without leaving any
noticeable trace; i.e., the generated triggers are natural artifacts. One does
not require extensive knowledge to click on the "convert" or "save as" button
while using tools for lossy image compression. Via this attack, the adversary
does not need to design a trigger generator as seen in prior works and only
requires poisoning the data. Empirically, the proposed attack consistently
achieves 100% attack success rate in several benchmark datasets such as MNIST,
CIFAR-10, GTSRB and CelebA. More significantly, the proposed attack can still
achieve almost 100% attack success rate with very small (approximately 10%)
poisoning rates in the clean label setting. The generated trigger of the
proposed attack using one lossy compression algorithm is also transferable
across other related compression algorithms, exacerbating the severity of this
backdoor threat. This work takes another crucial step toward understanding the
extensive risks of backdoor attacks in practice, urging practitioners to
investigate similar attacks and relevant backdoor mitigation methods.


http://arxiv.org/abs/2308.16703
Fault Injection and Safe-Error Attack for Extraction of Embedded Neural Network Models. (75%)
Kevin Hector; Pierre-Alain Moellic; Mathieu Dumont; Jean-Max Dutertre
  Model extraction emerges as a critical security threat with attack vectors
exploiting both algorithmic and implementation-based approaches. The main goal
of an attacker is to steal as much information as possible about a protected
victim model, so that he can mimic it with a substitute model, even with a
limited access to similar training data. Recently, physical attacks such as
fault injection have shown worrying efficiency against the integrity and
confidentiality of embedded models. We focus on embedded deep neural network
models on 32-bit microcontrollers, a widespread family of hardware platforms in
IoT, and the use of a standard fault injection strategy - Safe Error Attack
(SEA) - to perform a model extraction attack with an adversary having a limited
access to training data. Since the attack strongly depends on the input
queries, we propose a black-box approach to craft a successful attack set. For
a classical convolutional neural network, we successfully recover at least 90%
of the most significant bits with about 1500 crafted inputs. These information
enable to efficiently train a substitute model, with only 8% of the training
dataset, that reaches high fidelity and near identical accuracy level than the
victim model.


http://arxiv.org/abs/2309.00127
FTA: Stealthy and Robust Backdoor Attack with Flexible Trigger on Federated Learning. (45%)
Yanqi Qiao; Congwen Chen; Rui Wang; Kaitai Liang
  Current backdoor attacks against federated learning (FL) strongly rely on
universal triggers or semantic patterns, which can be easily detected and
filtered by certain defense mechanisms such as norm clipping, comparing
parameter divergences among local updates. In this work, we propose a new
stealthy and robust backdoor attack with flexible triggers against FL defenses.
To achieve this, we build a generative trigger function that can learn to
manipulate the benign samples with an imperceptible flexible trigger pattern
and simultaneously make the trigger pattern include the most significant hidden
features of the attacker-chosen label. Moreover, our trigger generator can keep
learning and adapt across different rounds, allowing it to adjust to changes in
the global model. By filling the distinguishable difference (the mapping
between the trigger pattern and target label), we make our attack naturally
stealthy. Extensive experiments on real-world datasets verify the effectiveness
and stealthiness of our attack compared to prior attacks on decentralized
learning framework with eight well-studied defenses.


http://arxiv.org/abs/2309.03215
Explainable and Trustworthy Traffic Sign Detection for Safe Autonomous Driving: An Inductive Logic Programming Approach. (98%)
Zahra University of Surrey Chaghazardi; Saber University of Surrey Fallah; Alireza University of Surrey Tamaddoni-Nezhad
  Traffic sign detection is a critical task in the operation of Autonomous
Vehicles (AV), as it ensures the safety of all road users. Current DNN-based
sign classification systems rely on pixel-level features to detect traffic
signs and can be susceptible to adversarial attacks. These attacks involve
small, imperceptible changes to a sign that can cause traditional classifiers
to misidentify the sign. We propose an Inductive Logic Programming (ILP) based
approach for stop sign detection in AVs to address this issue. This method
utilises high-level features of a sign, such as its shape, colour, and text, to
detect categories of traffic signs. This approach is more robust against
adversarial attacks, as it mimics human-like perception and is less susceptible
to the limitations of current DNN classifiers. We consider two adversarial
attacking methods to evaluate our approach: Robust Physical Perturbation (PR2)
and Adversarial Camouflage (AdvCam). These attacks are able to deceive DNN
classifiers, causing them to misidentify stop signs as other signs with high
confidence. The results show that the proposed ILP-based technique is able to
correctly identify all targeted stop signs, even in the presence of PR2 and
ADvCam attacks. The proposed learning method is also efficient as it requires
minimal training data. Moreover, it is fully explainable, making it possible to
debug AVs.


http://arxiv.org/abs/2308.16258
Robust Principles: Architectural Design Principles for Adversarially Robust CNNs. (11%)
ShengYun Peng; Weilin Xu; Cory Cornelius; Matthew Hull; Kevin Li; Rahul Duggal; Mansi Phute; Jason Martin; Duen Horng Chau
  Our research aims to unify existing works' diverging opinions on how
architectural components affect the adversarial robustness of CNNs. To
accomplish our goal, we synthesize a suite of three generalizable robust
architectural design principles: (a) optimal range for depth and width
configurations, (b) preferring convolutional over patchify stem stage, and (c)
robust residual block design through adopting squeeze and excitation blocks and
non-parametric smooth activation functions. Through extensive experiments
across a wide spectrum of dataset scales, adversarial training methods, model
parameters, and network design spaces, our principles consistently and markedly
improve AutoAttack accuracy: 1-3 percentage points (pp) on CIFAR-10 and
CIFAR-100, and 4-9 pp on ImageNet. The code is publicly available at
https://github.com/poloclub/robust-principles.


http://arxiv.org/abs/2308.15663
Adaptive Attack Detection in Text Classification: Leveraging Space Exploration Features for Text Sentiment Classification. (99%)
Atefeh Mahdavi; Neda Keivandarian; Marco Carvalho
  Adversarial example detection plays a vital role in adaptive cyber defense,
especially in the face of rapidly evolving attacks. In adaptive cyber defense,
the nature and characteristics of attacks continuously change, making it
crucial to have robust mechanisms in place to detect and counter these threats
effectively. By incorporating adversarial example detection techniques,
adaptive cyber defense systems can enhance their ability to identify and
mitigate attacks that attempt to exploit vulnerabilities in machine learning
models or other systems. Adversarial examples are inputs that are crafted by
applying intentional perturbations to natural inputs that result in incorrect
classification. In this paper, we propose a novel approach that leverages the
power of BERT (Bidirectional Encoder Representations from Transformers) and
introduces the concept of Space Exploration Features. We utilize the feature
vectors obtained from the BERT model's output to capture a new representation
of feature space to improve the density estimation method.


http://arxiv.org/abs/2308.15072
Advancing Adversarial Robustness Through Adversarial Logit Update. (99%)
Hao Xuan; Peican Zhu; Xingyu Li
  Deep Neural Networks are susceptible to adversarial perturbations.
Adversarial training and adversarial purification are among the most widely
recognized defense strategies. Although these methods have different underlying
logic, both rely on absolute logit values to generate label predictions. In
this study, we theoretically analyze the logit difference around successful
adversarial attacks from a theoretical point of view and propose a new
principle, namely Adversarial Logit Update (ALU), to infer adversarial sample's
labels. Based on ALU, we introduce a new classification paradigm that utilizes
pre- and post-purification logit differences for model's adversarial robustness
boost. Without requiring adversarial or additional data for model training, our
clean data synthesis model can be easily applied to various pre-trained models
for both adversarial sample detection and ALU-based data classification.
Extensive experiments on both CIFAR-10, CIFAR-100, and tiny-ImageNet datasets
show that even with simple components, the proposed solution achieves superior
robustness performance compared to state-of-the-art methods against a wide
range of adversarial attacks. Our python implementation is submitted in our
Supplementary document and will be published upon the paper's acceptance.


http://arxiv.org/abs/2308.15344
Imperceptible Adversarial Attack on Deep Neural Networks from Image Boundary. (99%)
Fahad Alrasheedi; Xin Zhong
  Although Deep Neural Networks (DNNs), such as the convolutional neural
networks (CNN) and Vision Transformers (ViTs), have been successfully applied
in the field of computer vision, they are demonstrated to be vulnerable to
well-sought Adversarial Examples (AEs) that can easily fool the DNNs. The
research in AEs has been active, and many adversarial attacks and explanations
have been proposed since they were discovered in 2014. The mystery of the AE's
existence is still an open question, and many studies suggest that DNN training
algorithms have blind spots. The salient objects usually do not overlap with
boundaries; hence, the boundaries are not the DNN model's attention.
Nevertheless, recent studies show that the boundaries can dominate the behavior
of the DNN models. Hence, this study aims to look at the AEs from a different
perspective and proposes an imperceptible adversarial attack that systemically
attacks the input image boundary for finding the AEs. The experimental results
have shown that the proposed boundary attacking method effectively attacks six
CNN models and the ViT using only 32% of the input image content (from the
boundaries) with an average success rate (SR) of 95.2% and an average peak
signal-to-noise ratio of 41.37 dB. Correlation analyses are conducted,
including the relation between the adversarial boundary's width and the SR and
how the adversarial boundary changes the DNN model's attention. This paper's
discoveries can potentially advance the understanding of AEs and provide a
different perspective on how AEs can be constructed.


http://arxiv.org/abs/2308.15246
A Classification-Guided Approach for Adversarial Attacks against Neural Machine Translation. (99%)
Sahar Sadrizadeh; Ljiljana Dolamic; Pascal Frossard
  Neural Machine Translation (NMT) models have been shown to be vulnerable to
adversarial attacks, wherein carefully crafted perturbations of the input can
mislead the target model. In this paper, we introduce ACT, a novel adversarial
attack framework against NMT systems guided by a classifier. In our attack, the
adversary aims to craft meaning-preserving adversarial examples whose
translations by the NMT model belong to a different class than the original
translations in the target language. Unlike previous attacks, our new approach
has a more substantial effect on the translation by altering the overall
meaning, which leads to a different class determined by a classifier. To
evaluate the robustness of NMT models to this attack, we propose enhancements
to existing black-box word-replacement-based attacks by incorporating output
translations of the target NMT model and the output logits of a classifier
within the attack process. Extensive experiments in various settings, including
a comparison with existing untargeted attacks, demonstrate that the proposed
attack is considerably more successful in altering the class of the output
translation and has more effect on the translation. This new paradigm can show
the vulnerabilities of NMT systems by focusing on the class of translation
rather than the mere translation quality as studied traditionally.


http://arxiv.org/abs/2308.15673
MDTD: A Multi Domain Trojan Detector for Deep Neural Networks. (97%)
Arezoo Rajabi; Surudhi Asokraj; Fengqing Jiang; Luyao Niu; Bhaskar Ramasubramanian; Jim Ritcey; Radha Poovendran
  Machine learning models that use deep neural networks (DNNs) are vulnerable
to backdoor attacks. An adversary carrying out a backdoor attack embeds a
predefined perturbation called a trigger into a small subset of input samples
and trains the DNN such that the presence of the trigger in the input results
in an adversary-desired output class. Such adversarial retraining however needs
to ensure that outputs for inputs without the trigger remain unaffected and
provide high classification accuracy on clean samples. In this paper, we
propose MDTD, a Multi-Domain Trojan Detector for DNNs, which detects inputs
containing a Trojan trigger at testing time. MDTD does not require knowledge of
trigger-embedding strategy of the attacker and can be applied to a pre-trained
DNN model with image, audio, or graph-based inputs. MDTD leverages an insight
that input samples containing a Trojan trigger are located relatively farther
away from a decision boundary than clean samples. MDTD estimates the distance
to a decision boundary using adversarial learning methods and uses this
distance to infer whether a test-time input sample is Trojaned or not. We
evaluate MDTD against state-of-the-art Trojan detection methods across five
widely used image-based datasets: CIFAR100, CIFAR10, GTSRB, SVHN, and
Flowers102; four graph-based datasets: AIDS, WinMal, Toxicant, and COLLAB; and
the SpeechCommand audio dataset. MDTD effectively identifies samples that
contain different types of Trojan triggers. We evaluate MDTD against adaptive
attacks where an adversary trains a robust DNN to increase (decrease) distance
of benign (Trojan) inputs from a decision boundary.


http://arxiv.org/abs/2308.15479
3D Adversarial Augmentations for Robust Out-of-Domain Predictions. (87%)
Alexander Lehner; Stefano Gasperini; Alvaro Marcos-Ramiro; Michael Schmidt; Nassir Navab; Benjamin Busam; Federico Tombari
  Since real-world training datasets cannot properly sample the long tail of
the underlying data distribution, corner cases and rare out-of-domain samples
can severely hinder the performance of state-of-the-art models. This problem
becomes even more severe for dense tasks, such as 3D semantic segmentation,
where points of non-standard objects can be confidently associated to the wrong
class. In this work, we focus on improving the generalization to out-of-domain
data. We achieve this by augmenting the training set with adversarial examples.
First, we learn a set of vectors that deform the objects in an adversarial
fashion. To prevent the adversarial examples from being too far from the
existing data distribution, we preserve their plausibility through a series of
constraints, ensuring sensor-awareness and shapes smoothness. Then, we perform
adversarial augmentation by applying the learned sample-independent vectors to
the available objects when training a model. We conduct extensive experiments
across a variety of scenarios on data from KITTI, Waymo, and CrashD for 3D
object detection, and on data from SemanticKITTI, Waymo, and nuScenes for 3D
semantic segmentation. Despite training on a standard single dataset, our
approach substantially improves the robustness and generalization of both 3D
object detection and 3D semantic segmentation methods to out-of-domain data.


http://arxiv.org/abs/2308.15614
Everything Perturbed All at Once: Enabling Differentiable Graph Attacks. (84%)
Haoran Liu; Bokun Wang; Jianling Wang; Xiangjue Dong; Tianbao Yang; James Caverlee
  As powerful tools for representation learning on graphs, graph neural
networks (GNNs) have played an important role in applications including social
networks, recommendation systems, and online web services. However, GNNs have
been shown to be vulnerable to adversarial attacks, which can significantly
degrade their effectiveness. Recent state-of-the-art approaches in adversarial
attacks rely on gradient-based meta-learning to selectively perturb a single
edge with the highest attack score until they reach the budget constraint.
While effective in identifying vulnerable links, these methods are plagued by
high computational costs. By leveraging continuous relaxation and
parameterization of the graph structure, we propose a novel attack method
called Differentiable Graph Attack (DGA) to efficiently generate effective
attacks and meanwhile eliminate the need for costly retraining. Compared to the
state-of-the-art, DGA achieves nearly equivalent attack performance with 6
times less training time and 11 times smaller GPU memory footprint on different
benchmark datasets. Additionally, we provide extensive experimental analyses of
the transferability of the DGA among different graph models, as well as its
robustness against widely-used defense mechanisms.


http://arxiv.org/abs/2308.15692
Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models. (75%)
Takami Sato; Justin Yue; Nanze Chen; Ningfei Wang; Qi Alfred Chen
  Denoising probabilistic diffusion models have shown breakthrough performance
to generate more photo-realistic images or human-level illustrations than the
prior models such as GANs. This high image-generation capability has stimulated
the creation of many downstream applications in various areas. However, we find
that this technology is actually a double-edged sword: We identify a new type
of attack, called the Natural Denoising Diffusion (NDD) attack based on the
finding that state-of-the-art deep neural network (DNN) models still hold their
prediction even if we intentionally remove their robust features, which are
essential to the human visual system (HVS), through text prompts. The NDD
attack shows a significantly high capability to generate low-cost,
model-agnostic, and transferable adversarial attacks by exploiting the natural
attack capability in diffusion models. To systematically evaluate the risk of
the NDD attack, we perform a large-scale empirical study with our newly created
dataset, the Natural Denoising Diffusion Attack (NDDA) dataset. We evaluate the
natural attack capability by answering 6 research questions. Through a user
study, we find that it can achieve an 88% detection rate while being stealthy
to 93% of human subjects; we also find that the non-robust features embedded by
diffusion models contribute to the natural attack capability. To confirm the
model-agnostic and transferable attack capability, we perform the NDD attack
against the Tesla Model 3 and find that 73% of the physically printed attacks
can be detected as stop signs. Our hope is that the study and dataset can help
our community be aware of the risks in diffusion models and facilitate further
research toward robust DNN models.


http://arxiv.org/abs/2308.15736
Vulnerability of Machine Learning Approaches Applied in IoT-based Smart Grid: A Review. (75%)
Zhenyong Zhang; Mengxiang Liu; Mingyang Sun; Ruilong Deng; Peng Cheng; Dusit Niyato; Mo-Yuen Chow; Jiming Chen
  Machine learning (ML) sees an increasing prevalence of being used in the
internet-of-things (IoT)-based smart grid. However, the trustworthiness of ML
is a severe issue that must be addressed to accommodate the trend of ML-based
smart grid applications (MLsgAPPs). The adversarial distortion injected into
the power signal will greatly affect the system's normal control and operation.
Therefore, it is imperative to conduct vulnerability assessment for MLsgAPPs
applied in the context of safety-critical power systems. In this paper, we
provide a comprehensive review of the recent progress in designing attack and
defense methods for MLsgAPPs. Unlike the traditional survey about ML security,
this is the first review work about the security of MLsgAPPs that focuses on
the characteristics of power systems. We first highlight the specifics for
constructing the adversarial attacks on MLsgAPPs. Then, the vulnerability of
MLsgAPP is analyzed from both the aspects of the power system and ML model.
Afterward, a comprehensive survey is conducted to review and compare existing
studies about the adversarial attacks on MLsgAPPs in scenarios of generation,
transmission, distribution, and consumption, and the countermeasures are
reviewed according to the attacks that they defend against. Finally, the future
research directions are discussed on the attacker's and defender's side,
respectively. We also analyze the potential vulnerability of large language
model-based (e.g., ChatGPT) power system applications. Overall, we encourage
more researchers to contribute to investigating the adversarial issues of
MLsgAPPs.


http://arxiv.org/abs/2308.15092
Can We Rely on AI? (50%)
Desmond J. Higham
  Over the last decade, adversarial attack algorithms have revealed
instabilities in deep learning tools. These algorithms raise issues regarding
safety, reliability and interpretability in artificial intelligence; especially
in high risk settings. From a practical perspective, there has been a war of
escalation between those developing attack and defence strategies. At a more
theoretical level, researchers have also studied bigger picture questions
concerning the existence and computability of attacks. Here we give a brief
overview of the topic, focusing on aspects that are likely to be of interest to
researchers in applied and computational mathematics.


http://arxiv.org/abs/2308.15141
Uncertainty Aware Training to Improve Deep Learning Model Calibration for Classification of Cardiac MR Images. (1%)
Tareen Dawood; Chen Chen; Baldeep S. Sidhua; Bram Ruijsink; Justin Goulda; Bradley Porter; Mark K. Elliott; Vishal Mehta; Christopher A. Rinaldi; Esther Puyol-Anton; Reza Razavi; Andrew P. King
  Quantifying uncertainty of predictions has been identified as one way to
develop more trustworthy artificial intelligence (AI) models beyond
conventional reporting of performance metrics. When considering their role in a
clinical decision support setting, AI classification models should ideally
avoid confident wrong predictions and maximise the confidence of correct
predictions. Models that do this are said to be well-calibrated with regard to
confidence. However, relatively little attention has been paid to how to
improve calibration when training these models, i.e., to make the training
strategy uncertainty-aware. In this work we evaluate three novel
uncertainty-aware training strategies comparing against two state-of-the-art
approaches. We analyse performance on two different clinical applications:
cardiac resynchronisation therapy (CRT) response prediction and coronary artery
disease (CAD) diagnosis from cardiac magnetic resonance (CMR) images. The
best-performing model in terms of both classification accuracy and the most
common calibration measure, expected calibration error (ECE) was the Confidence
Weight method, a novel approach that weights the loss of samples to explicitly
penalise confident incorrect predictions. The method reduced the ECE by 17% for
CRT response prediction and by 22% for CAD diagnosis when compared to a
baseline classifier in which no uncertainty-aware strategy was included. In
both applications, as well as reducing the ECE there was a slight increase in
accuracy from 69% to 70% and 70% to 72% for CRT response prediction and CAD
diagnosis respectively. However, our analysis showed a lack of consistency in
terms of optimal models when using different calibration measures. This
indicates the need for careful consideration of performance metrics when
training and selecting models for complex high-risk applications in healthcare.


http://arxiv.org/abs/2308.14597
Adversarial Attacks on Foundational Vision Models. (80%)
Nathan Inkawhich; Gwendolyn McDonald; Ryan Luley
  Rapid progress is being made in developing large, pretrained, task-agnostic
foundational vision models such as CLIP, ALIGN, DINOv2, etc. In fact, we are
approaching the point where these models do not have to be finetuned
downstream, and can simply be used in zero-shot or with a lightweight probing
head. Critically, given the complexity of working at this scale, there is a
bottleneck where relatively few organizations in the world are executing the
training then sharing the models on centralized platforms such as HuggingFace
and torch.hub. The goal of this work is to identify several key adversarial
vulnerabilities of these models in an effort to make future designs more
robust. Intuitively, our attacks manipulate deep feature representations to
fool an out-of-distribution (OOD) detector which will be required when using
these open-world-aware models to solve closed-set downstream tasks. Our methods
reliably make in-distribution (ID) images (w.r.t. a downstream task) be
predicted as OOD and vice versa while existing in extremely
low-knowledge-assumption threat models. We show our attacks to be potent in
whitebox and blackbox settings, as well as when transferred across foundational
model types (e.g., attack DINOv2 with CLIP)! This work is only just the
beginning of a long journey towards adversarially robust foundational vision
models.


http://arxiv.org/abs/2308.14333
DiffSmooth: Certifiably Robust Learning via Diffusion Models and Local Smoothing. (45%)
Jiawei Zhang; Zhongzhu Chen; Huan Zhang; Chaowei Xiao; Bo Li
  Diffusion models have been leveraged to perform adversarial purification and
thus provide both empirical and certified robustness for a standard model. On
the other hand, different robustly trained smoothed models have been studied to
improve the certified robustness. Thus, it raises a natural question: Can
diffusion model be used to achieve improved certified robustness on those
robustly trained smoothed models? In this work, we first theoretically show
that recovered instances by diffusion models are in the bounded neighborhood of
the original instance with high probability; and the "one-shot" denoising
diffusion probabilistic models (DDPM) can approximate the mean of the generated
distribution of a continuous-time diffusion model, which approximates the
original instance under mild conditions. Inspired by our analysis, we propose a
certifiably robust pipeline DiffSmooth, which first performs adversarial
purification via diffusion models and then maps the purified instances to a
common region via a simple yet effective local smoothing strategy. We conduct
extensive experiments on different datasets and show that DiffSmooth achieves
SOTA-certified robustness compared with eight baselines. For instance,
DiffSmooth improves the SOTA-certified accuracy from $36.0\%$ to $53.0\%$ under
$\ell_2$ radius $1.5$ on ImageNet. The code is available at
[https://github.com/javyduck/DiffSmooth].


http://arxiv.org/abs/2308.14840
Identifying and Mitigating the Security Risks of Generative AI. (45%)
Clark Barrett; Brad Boyd; Elie Burzstein; Nicholas Carlini; Brad Chen; Jihye Choi; Amrita Roy Chowdhury; Mihai Christodorescu; Anupam Datta; Soheil Feizi; Kathleen Fisher; Tatsunori Hashimoto; Dan Hendrycks; Somesh Jha; Daniel Kang; Florian Kerschbaum; Eric Mitchell; John Mitchell; Zulfikar Ramzan; Khawaja Shams; Dawn Song; Ankur Taly; Diyi Yang
  Every major technical invention resurfaces the dual-use dilemma -- the new
technology has the potential to be used for good as well as for harm.
Generative AI (GenAI) techniques, such as large language models (LLMs) and
diffusion models, have shown remarkable capabilities (e.g., in-context
learning, code-completion, and text-to-image generation and editing). However,
GenAI can be used just as well by attackers to generate new attacks and
increase the velocity and efficacy of existing attacks.
  This paper reports the findings of a workshop held at Google (co-organized by
Stanford University and the University of Wisconsin-Madison) on the dual-use
dilemma posed by GenAI. This paper is not meant to be comprehensive, but is
rather an attempt to synthesize some of the interesting findings from the
workshop. We discuss short-term and long-term goals for the community on this
topic. We hope this paper provides both a launching point for a discussion on
this important topic as well as interesting problems that the research
community can work to address.


http://arxiv.org/abs/2308.14550
ReMAV: Reward Modeling of Autonomous Vehicles for Finding Likely Failure Events. (13%)
Aizaz Sharif; Dusica Marijan
  Autonomous vehicles are advanced driving systems that are well known to be
vulnerable to various adversarial attacks, compromising vehicle safety and
posing a risk to other road users. Rather than actively training complex
adversaries by interacting with the environment, there is a need to first
intelligently find and reduce the search space to only those states where
autonomous vehicles are found to be less confident. In this paper, we propose a
black-box testing framework ReMAV that uses offline trajectories first to
analyze the existing behavior of autonomous vehicles and determine appropriate
thresholds to find the probability of failure events. To this end, we introduce
a three-step methodology which i) uses offline state action pairs of any
autonomous vehicle under test, ii) builds an abstract behavior representation
using our designed reward modeling technique to analyze states with uncertain
driving decisions, and iii) uses a disturbance model for minimal perturbation
attacks where the driving decisions are less confident. Our reward modeling
technique helps in creating a behavior representation that allows us to
highlight regions of likely uncertain behavior even when the standard
autonomous vehicle performs well. We perform our experiments in a high-fidelity
urban driving environment using three different driving scenarios containing
single- and multi-agent interactions. Our experiment shows an increase in 35,
23, 48, and 50% in the occurrences of vehicle collision, road object collision,
pedestrian collision, and offroad steering events, respectively by the
autonomous vehicle under test, demonstrating a significant increase in failure
events. We compare ReMAV with two baselines and show that ReMAV demonstrates
significantly better effectiveness in generating failure events compared to the
baselines in all evaluation metrics.


http://arxiv.org/abs/2308.14553
Rep2wav: Noise Robust text-to-speech Using self-supervised representations. (1%)
Qiushi Zhu; Yu Gu; Rilin Chen; Chao Weng; Yuchen Hu; Lirong Dai; Jie Zhang
  Benefiting from the development of deep learning, text-to-speech (TTS)
techniques using clean speech have achieved significant performance
improvements. The data collected from real scenes often contains noise and
generally needs to be denoised by speech enhancement models. Noise-robust TTS
models are often trained using the enhanced speech, which thus suffer from
speech distortion and background noise that affect the quality of the
synthesized speech. Meanwhile, it was shown that self-supervised pre-trained
models exhibit excellent noise robustness on many speech tasks, implying that
the learned representation has a better tolerance for noise perturbations. In
this work, we therefore explore pre-trained models to improve the noise
robustness of TTS models. Based on HiFi-GAN, we first propose a
representation-to-waveform vocoder, which aims to learn to map the
representation of pre-trained models to the waveform. We then propose a
text-to-representation FastSpeech2 model, which aims to learn to map text to
pre-trained model representations. Experimental results on the LJSpeech and
LibriTTS datasets show that our method outperforms those using speech
enhancement methods in both subjective and objective metrics. Audio samples are
available at: https://zqs01.github.io/rep2wav.


http://arxiv.org/abs/2308.14376
Are Existing Out-Of-Distribution Techniques Suitable for Network Intrusion Detection? (1%)
Andrea Corsini; Shanchieh Jay Yang
  Machine learning (ML) has become increasingly popular in network intrusion
detection. However, ML-based solutions always respond regardless of whether the
input data reflects known patterns, a common issue across safety-critical
applications. While several proposals exist for detecting Out-Of-Distribution
(OOD) in other fields, it remains unclear whether these approaches can
effectively identify new forms of intrusions for network security. New attacks,
not necessarily affecting overall distributions, are not guaranteed to be
clearly OOD as instead, images depicting new classes are in computer vision. In
this work, we investigate whether existing OOD detectors from other fields
allow the identification of unknown malicious traffic. We also explore whether
more discriminative and semantically richer embedding spaces within models,
such as those created with contrastive learning and multi-class tasks, benefit
detection. Our investigation covers a set of six OOD techniques that employ
different detection strategies. These techniques are applied to models trained
in various ways and subsequently exposed to unknown malicious traffic from the
same and different datasets (network environments). Our findings suggest that
existing detectors can identify a consistent portion of new malicious traffic,
and that improved embedding spaces enhance detection. We also demonstrate that
simple combinations of certain detectors can identify almost 100% of malicious
traffic in our tested scenarios.


http://arxiv.org/abs/2308.14256
FaceChain: A Playground for Human-centric Artificial Intelligence Generated Content. (1%)
Yang Liu; Cheng Yu; Lei Shang; Yongyi He; Ziheng Wu; Xingjun Wang; Chao Xu; Haoyu Xie; Weida Wang; Yuze Zhao; Lin Zhu; Chen Cheng; Weitao Chen; Yuan Yao; Wenmeng Zhou; Jiaqi Xu; Qiang Wang; Yingda Chen; Xuansong Xie; Baigui Sun
  Recent advancement in personalized image generation have unveiled the
intriguing capability of pre-trained text-to-image models on learning identity
information from a collection of portrait images. However, existing solutions
are vulnerable in producing truthful details, and usually suffer from several
defects such as (i) The generated face exhibit its own unique characteristics,
\ie facial shape and facial feature positioning may not resemble key
characteristics of the input, and (ii) The synthesized face may contain warped,
blurred or corrupted regions. In this paper, we present FaceChain, a
personalized portrait generation framework that combines a series of customized
image-generation model and a rich set of face-related perceptual understanding
models (\eg, face detection, deep face embedding extraction, and facial
attribute recognition), to tackle aforementioned challenges and to generate
truthful personalized portraits, with only a handful of portrait images as
input. Concretely, we inject several SOTA face models into the generation
procedure, achieving a more efficient label-tagging, data-processing, and model
post-processing compared to previous solutions, such as DreamBooth
~\cite{ruiz2023dreambooth} , InstantBooth ~\cite{shi2023instantbooth} , or
other LoRA-only approaches ~\cite{hu2021lora} . Besides, based on FaceChain, we
further develop several applications to build a broader playground for better
showing its value, including virtual try on and 2D talking head. We hope it can
grow to serve the burgeoning needs from the communities. Note that this is an
ongoing work that will be consistently refined and improved upon. FaceChain is
open-sourced under Apache-2.0 license at
\url{https://github.com/modelscope/facechain}.


http://arxiv.org/abs/2308.14132
Detecting Language Model Attacks with Perplexity. (1%)
Gabriel Alon; Michael Kamfonas
  A novel hack involving Large Language Models (LLMs) has emerged, leveraging
adversarial suffixes to trick models into generating perilous responses. This
method has garnered considerable attention from reputable media outlets such as
the New York Times and Wired, thereby influencing public perception regarding
the security and safety of LLMs. In this study, we advocate the utilization of
perplexity as one of the means to recognize such potential attacks. The
underlying concept behind these hacks revolves around appending an unusually
constructed string of text to a harmful query that would otherwise be blocked.
This maneuver confuses the protective mechanisms and tricks the model into
generating a forbidden response. Such scenarios could result in providing
detailed instructions to a malicious user for constructing explosives or
orchestrating a bank heist. Our investigation demonstrates the feasibility of
employing perplexity, a prevalent natural language processing metric, to detect
these adversarial tactics before generating a forbidden response. By evaluating
the perplexity of queries with and without such adversarial suffixes using an
open-source LLM, we discovered that nearly 90 percent were above a perplexity
of 1000. This contrast underscores the efficacy of perplexity for detecting
this type of exploit.


http://arxiv.org/abs/2308.12636
Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning. (99%)
Youze Wang; Wenbo Hu; Yinpeng Dong; Hanwang Zhang; Hang Su; Richang Hong
  The integration of visual and textual data in Vision-Language Pre-training
(VLP) models is crucial for enhancing vision-language understanding. However,
the adversarial robustness of these models, especially in the alignment of
image-text features, has not yet been sufficiently explored. In this paper, we
introduce a novel gradient-based multimodal adversarial attack method,
underpinned by contrastive learning, to improve the transferability of
multimodal adversarial samples in VLP models. This method concurrently
generates adversarial texts and images within imperceptive perturbation,
employing both image-text and intra-modal contrastive loss. We evaluate the
effectiveness of our approach on image-text retrieval and visual entailment
tasks, using publicly available datasets in a black-box setting. Extensive
experiments indicate a significant advancement over existing single-modal
transfer-based adversarial attack methods and current multimodal adversarial
attack approaches.


http://arxiv.org/abs/2308.12661
Don't Look into the Sun: Adversarial Solarization Attacks on Image Classifiers. (92%)
Paul Gavrikov; Janis Keuper
  Assessing the robustness of deep neural networks against out-of-distribution
inputs is crucial, especially in safety-critical domains like autonomous
driving, but also in safety systems where malicious actors can digitally alter
inputs to circumvent safety guards. However, designing effective
out-of-distribution tests that encompass all possible scenarios while
preserving accurate label information is a challenging task. Existing
methodologies often entail a compromise between variety and constraint levels
for attacks and sometimes even both. In a first step towards a more holistic
robustness evaluation of image classification models, we introduce an attack
method based on image solarization that is conceptually straightforward yet
avoids jeopardizing the global structure of natural images independent of the
intensity. Through comprehensive evaluations of multiple ImageNet models, we
demonstrate the attack's capacity to degrade accuracy significantly, provided
it is not integrated into the training augmentations. Interestingly, even then,
no full immunity to accuracy deterioration is achieved. In other settings, the
attack can often be simplified into a black-box attack with model-independent
parameters. Defenses against other corruptions do not consistently extend to be
effective against our specific attack.
  Project website: https://github.com/paulgavrikov/adversarial_solarization


http://arxiv.org/abs/2308.12918
Evaluating the Vulnerabilities in ML systems in terms of adversarial attacks. (82%)
John Harshith; Mantej Singh Gill; Madhan Jothimani
  There have been recent adversarial attacks that are difficult to find. These
new adversarial attacks methods may pose challenges to current deep learning
cyber defense systems and could influence the future defense of cyberattacks.
The authors focus on this domain in this research paper. They explore the
consequences of vulnerabilities in AI systems. This includes discussing how
they might arise, differences between randomized and adversarial examples and
also potential ethical implications of vulnerabilities. Moreover, it is
important to train the AI systems appropriately when they are in testing phase
and getting them ready for broader use.


http://arxiv.org/abs/2308.12857
Fast Adversarial Training with Smooth Convergence. (3%)
Mengnan Zhao; Lihe Zhang; Yuqiu Kong; Baocai Yin
  Fast adversarial training (FAT) is beneficial for improving the adversarial
robustness of neural networks. However, previous FAT work has encountered a
significant issue known as catastrophic overfitting when dealing with large
perturbation budgets, \ie the adversarial robustness of models declines to near
zero during training.
  To address this, we analyze the training process of prior FAT work and
observe that catastrophic overfitting is accompanied by the appearance of loss
convergence outliers.
  Therefore, we argue a moderately smooth loss convergence process will be a
stable FAT process that solves catastrophic overfitting.
  To obtain a smooth loss convergence process, we propose a novel oscillatory
constraint (dubbed ConvergeSmooth) to limit the loss difference between
adjacent epochs. The convergence stride of ConvergeSmooth is introduced to
balance convergence and smoothing. Likewise, we design weight centralization
without introducing additional hyperparameters other than the loss balance
coefficient.
  Our proposed methods are attack-agnostic and thus can improve the training
stability of various FAT techniques.
  Extensive experiments on popular datasets show that the proposed methods
efficiently avoid catastrophic overfitting and outperform all previous FAT
methods. Code is available at \url{https://github.com/FAT-CS/ConvergeSmooth}.


http://arxiv.org/abs/2308.12770
WavMark: Watermarking for Audio Generation. (2%)
Guangyu Chen; Yu Wu; Shujie Liu; Tao Liu; Xiaoyong Du; Furu Wei
  Recent breakthroughs in zero-shot voice synthesis have enabled imitating a
speaker's voice using just a few seconds of recording while maintaining a high
level of realism. Alongside its potential benefits, this powerful technology
introduces notable risks, including voice fraud and speaker impersonation.
Unlike the conventional approach of solely relying on passive methods for
detecting synthetic data, watermarking presents a proactive and robust defence
mechanism against these looming risks. This paper introduces an innovative
audio watermarking framework that encodes up to 32 bits of watermark within a
mere 1-second audio snippet. The watermark is imperceptible to human senses and
exhibits strong resilience against various attacks. It can serve as an
effective identifier for synthesized voices and holds potential for broader
applications in audio copyright protection. Moreover, this framework boasts
high flexibility, allowing for the combination of multiple watermark segments
to achieve heightened robustness and expanded capacity. Utilizing 10 to
20-second audio as the host, our approach demonstrates an average Bit Error
Rate (BER) of 0.48\% across ten common attacks, a remarkable reduction of over
2800\% in BER compared to the state-of-the-art watermarking tool. See
https://aka.ms/wavmark for demos of our work.


http://arxiv.org/abs/2308.12820
Prediction without Preclusion: Recourse Verification with Reachable Sets. (1%)
Avni Kothari; Bogdan Kulynych; Tsui-Wei Weng; Berk Ustun
  Machine learning models are often used to decide who receives a loan, a job
interview, or a public benefit. Models in such settings use features without
considering their actionability. As a result, they can assign predictions that
are fixed $-$ meaning that individuals who are denied loans and interviews are,
in fact, precluded from access to credit and employment. In this work, we
introduce a procedure called recourse verification to test if a model assigns
fixed predictions to its decision subjects. We propose a model-agnostic
approach for recourse verification with reachable sets $-$ i.e., the set of all
points that a person can reach through their actions in feature space. We
develop methods to construct reachable sets for discrete feature spaces, which
can certify the responsiveness of any model by simply querying its predictions.
We conduct a comprehensive empirical study on the infeasibility of recourse on
datasets from consumer finance. Our results highlight how models can
inadvertently preclude access by assigning fixed predictions and underscore the
need to account for actionability in model development.


http://arxiv.org/abs/2308.12279
On-Manifold Projected Gradient Descent. (99%)
Aaron Mahler; Tyrus Berry; Tom Stephens; Harbir Antil; Michael Merritt; Jeanie Schreiber; Ioannis Kevrekidis
  This work provides a computable, direct, and mathematically rigorous
approximation to the differential geometry of class manifolds for
high-dimensional data, along with nonlinear projections from input space onto
these class manifolds. The tools are applied to the setting of neural network
image classifiers, where we generate novel, on-manifold data samples, and
implement a projected gradient descent algorithm for on-manifold adversarial
training. The susceptibility of neural networks (NNs) to adversarial attack
highlights the brittle nature of NN decision boundaries in input space.
Introducing adversarial examples during training has been shown to reduce the
susceptibility of NNs to adversarial attack; however, it has also been shown to
reduce the accuracy of the classifier if the examples are not valid examples
for that class. Realistic "on-manifold" examples have been previously generated
from class manifolds in the latent of an autoencoder. Our work explores these
phenomena in a geometric and computational setting that is much closer to the
raw, high-dimensional input space than can be provided by VAE or other black
box dimensionality reductions. We employ conformally invariant diffusion maps
(CIDM) to approximate class manifolds in diffusion coordinates, and develop the
Nystr\"{o}m projection to project novel points onto class manifolds in this
setting. On top of the manifold approximation, we leverage the spectral
exterior calculus (SEC) to determine geometric quantities such as tangent
vectors of the manifold. We use these tools to obtain adversarial examples that
reside on a class manifold, yet fool a classifier. These misclassifications
then become explainable in terms of human-understandable manipulations within
the data, by expressing the on-manifold adversary in the semantic basis on the
manifold.


http://arxiv.org/abs/2308.12054
Sample Complexity of Robust Learning against Evasion Attacks. (98%)
Pascale Gourdeau
  It is becoming increasingly important to understand the vulnerability of
machine learning models to adversarial attacks. One of the fundamental problems
in adversarial machine learning is to quantify how much training data is needed
in the presence of evasion attacks, where data is corrupted at test time. In
this thesis, we work with the exact-in-the-ball notion of robustness and study
the feasibility of adversarially robust learning from the perspective of
learning theory, considering sample complexity.
  We first explore the setting where the learner has access to random examples
only, and show that distributional assumptions are essential. We then focus on
learning problems with distributions on the input data that satisfy a Lipschitz
condition and show that robustly learning monotone conjunctions has sample
complexity at least exponential in the adversary's budget (the maximum number
of bits it can perturb on each input). However, if the adversary is restricted
to perturbing $O(\log n)$ bits, then one can robustly learn conjunctions and
decision lists w.r.t. log-Lipschitz distributions.
  We then study learning models where the learner is given more power. We first
consider local membership queries, where the learner can query the label of
points near the training sample. We show that, under the uniform distribution,
the exponential dependence on the adversary's budget to robustly learn
conjunctions remains inevitable. We then introduce a local equivalence query
oracle, which returns whether the hypothesis and target concept agree in a
given region around a point in the training sample, and a counterexample if it
exists. We show that if the query radius is equal to the adversary's budget, we
can develop robust empirical risk minimization algorithms in the
distribution-free setting. We give general query complexity upper and lower
bounds, as well as for concrete concept classes.


http://arxiv.org/abs/2308.12882
LCANets++: Robust Audio Classification using Multi-layer Neural Networks with Lateral Competition. (92%)
Sayanton V. Dibbo; Juston S. Moore; Garrett T. Kenyon; Michael A. Teti
  Audio classification aims at recognizing audio signals, including speech
commands or sound events. However, current audio classifiers are susceptible to
perturbations and adversarial attacks. In addition, real-world audio
classification tasks often suffer from limited labeled data. To help bridge
these gaps, previous work developed neuro-inspired convolutional neural
networks (CNNs) with sparse coding via the Locally Competitive Algorithm (LCA)
in the first layer (i.e., LCANets) for computer vision. LCANets learn in a
combination of supervised and unsupervised learning, reducing dependency on
labeled samples. Motivated by the fact that auditory cortex is also sparse, we
extend LCANets to audio recognition tasks and introduce LCANets++, which are
CNNs that perform sparse coding in multiple layers via LCA. We demonstrate that
LCANets++ are more robust than standard CNNs and LCANets against perturbations,
e.g., background noise, as well as black-box and white-box attacks, e.g.,
evasion and fast gradient sign (FGSM) attacks.


http://arxiv.org/abs/2308.12439
BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection. (74%)
Tinghao Xie; Xiangyu Qi; Ping He; Yiming Li; Jiachen T. Wang; Prateek Mittal
  We present a novel defense, against backdoor attacks on Deep Neural Networks
(DNNs), wherein adversaries covertly implant malicious behaviors (backdoors)
into DNNs. Our defense falls within the category of post-development defenses
that operate independently of how the model was generated. The proposed defense
is built upon a novel reverse engineering approach that can directly extract
backdoor functionality of a given backdoored model to a backdoor expert model.
The approach is straightforward -- finetuning the backdoored model over a small
set of intentionally mislabeled clean samples, such that it unlearns the normal
functionality while still preserving the backdoor functionality, and thus
resulting in a model (dubbed a backdoor expert model) that can only recognize
backdoor inputs. Based on the extracted backdoor expert model, we show the
feasibility of devising highly accurate backdoor input detectors that filter
out the backdoor inputs during model inference. Further augmented by an
ensemble strategy with a finetuned auxiliary model, our defense, BaDExpert
(Backdoor Input Detection with Backdoor Expert), effectively mitigates 17 SOTA
backdoor attacks while minimally impacting clean utility. The effectiveness of
BaDExpert has been verified on multiple datasets (CIFAR10, GTSRB and ImageNet)
across various model architectures (ResNet, VGG, MobileNetV2 and Vision
Transformer).


http://arxiv.org/abs/2308.12319
RemovalNet: DNN Fingerprint Removal Attacks. (69%)
Hongwei Yao; Zheng Li; Kunzhe Huang; Jian Lou; Zhan Qin; Kui Ren
  With the performance of deep neural networks (DNNs) remarkably improving,
DNNs have been widely used in many areas. Consequently, the DNN model has
become a valuable asset, and its intellectual property is safeguarded by
ownership verification techniques (e.g., DNN fingerprinting). However, the
feasibility of the DNN fingerprint removal attack and its potential influence
remains an open problem. In this paper, we perform the first comprehensive
investigation of DNN fingerprint removal attacks. Generally, the knowledge
contained in a DNN model can be categorized into general semantic and
fingerprint-specific knowledge. To this end, we propose a min-max bilevel
optimization-based DNN fingerprint removal attack named RemovalNet, to evade
model ownership verification. The lower-level optimization is designed to
remove fingerprint-specific knowledge. While in the upper-level optimization,
we distill the victim model's general semantic knowledge to maintain the
surrogate model's performance. We conduct extensive experiments to evaluate the
fidelity, effectiveness, and efficiency of the RemovalNet against four advanced
defense methods on six metrics. The empirical results demonstrate that (1) the
RemovalNet is effective. After our DNN fingerprint removal attack, the model
distance between the target and surrogate models is x100 times higher than that
of the baseline attacks, (2) the RemovalNet is efficient. It uses only 0.2%
(400 samples) of the substitute dataset and 1,000 iterations to conduct our
attack. Besides, compared with advanced model stealing attacks, the RemovalNet
saves nearly 85% of computational resources at most, (3) the RemovalNet
achieves high fidelity that the created surrogate model maintains high accuracy
after the DNN fingerprint removal process. Our code is available at:
https://github.com/grasses/RemovalNet.


http://arxiv.org/abs/2310.02164
A Survey of Graph Unlearning. (2%)
Anwar Said; Yuying Zhao; Tyler Derr; Mudassir Shabbir; Waseem Abbas; Xenofon Koutsoukos
  Graph unlearning emerges as a crucial advancement in the pursuit of
responsible AI, providing the means to remove sensitive data traces from
trained models, thereby upholding the right to be forgotten. It is evident that
graph machine learning exhibits sensitivity to data privacy and adversarial
attacks, necessitating the application of graph unlearning techniques to
address these concerns effectively. In this comprehensive survey paper, we
present the first systematic review of graph unlearning approaches,
encompassing a diverse array of methodologies and offering a detailed taxonomy
and up-to-date literature overview to facilitate the understanding of
researchers new to this field. To ensure clarity, we provide lucid explanations
of the fundamental concepts and evaluation measures used in graph unlearning,
catering to a broader audience with varying levels of expertise. Delving into
potential applications, we explore the versatility of graph unlearning across
various domains, including but not limited to social networks, adversarial
settings, recommender systems, and resource-constrained environments like the
Internet of Things, illustrating its potential impact in safeguarding data
privacy and enhancing AI systems' robustness. Finally, we shed light on
promising research directions, encouraging further progress and innovation
within the domain of graph unlearning. By laying a solid foundation and
fostering continued progress, this survey seeks to inspire researchers to
further advance the field of graph unlearning, thereby instilling confidence in
the ethical growth of AI systems and reinforcing the responsible application of
machine learning techniques in various domains.


http://arxiv.org/abs/2308.12065
Ensembling Uncertainty Measures to Improve Safety of Black-Box Classifiers. (1%)
Tommaso Zoppi; Andrea Ceccarelli; Andrea Bondavalli
  Machine Learning (ML) algorithms that perform classification may predict the
wrong class, experiencing misclassifications. It is well-known that
misclassifications may have cascading effects on the encompassing system,
possibly resulting in critical failures. This paper proposes SPROUT, a Safety
wraPper thROugh ensembles of UncertainTy measures, which suspects
misclassifications by computing uncertainty measures on the inputs and outputs
of a black-box classifier. If a misclassification is detected, SPROUT blocks
the propagation of the output of the classifier to the encompassing system. The
resulting impact on safety is that SPROUT transforms erratic outputs
(misclassifications) into data omission failures, which can be easily managed
at the system level. SPROUT has a broad range of applications as it fits binary
and multi-class classification, comprising image and tabular datasets. We
experimentally show that SPROUT always identifies a huge fraction of the
misclassifications of supervised classifiers, and it is able to detect all
misclassifications in specific cases. SPROUT implementation contains
pre-trained wrappers, it is publicly available and ready to be deployed with
minimal effort.


http://arxiv.org/abs/2308.12141
Aparecium: Revealing Secrets from Physical Photographs. (1%)
Zhe Lei; Jie Zhang; Jingtao Li; Weiming Zhang; Nenghai Yu
  Watermarking is a crucial tool for safeguarding copyrights and can serve as a
more aesthetically pleasing alternative to QR codes. In recent years,
watermarking methods based on deep learning have proved superior robustness
against complex physical distortions than traditional watermarking methods.
However, they have certain limitations that render them less effective in
practice. For instance, current solutions necessitate physical photographs to
be rectangular for accurate localization, cannot handle physical bending or
folding, and require the hidden area to be completely captured at a close
distance and small angle. To overcome these challenges, we propose a novel deep
watermarking framework dubbed \textit{Aparecium}. Specifically, we preprocess
secrets (i.e., watermarks) into a pattern and then embed it into the cover
image, which is symmetrical to the final decoding-then-extracting process. To
capture the watermarked region from complex physical scenarios, a locator is
also introduced. Besides, we adopt a three-stage training strategy for training
convergence. Extensive experiments demonstrate that \textit{Aparecium} is not
only robust against different digital distortions, but also can resist various
physical distortions, such as screen-shooting and printing-shooting, even in
severe cases including different shapes, curvature, folding, incompleteness,
long distances, and big angles while maintaining high visual quality.
Furthermore, some ablation studies are also conducted to verify our design.


http://arxiv.org/abs/2308.11754
Multi-Instance Adversarial Attack on GNN-Based Malicious Domain Detection. (99%)
Mahmoud Nazzal; Issa Khalil; Abdallah Khreishah; NhatHai Phan; Yao Ma
  Malicious domain detection (MDD) is an open security challenge that aims to
detect if an Internet domain is associated with cyber-attacks. Among many
approaches to this problem, graph neural networks (GNNs) are deemed highly
effective. GNN-based MDD uses DNS logs to represent Internet domains as nodes
in a maliciousness graph (DMG) and trains a GNN to infer their maliciousness by
leveraging identified malicious domains. Since this method relies on accessible
DNS logs to construct DMGs, it exposes a vulnerability for adversaries to
manipulate their domain nodes' features and connections within DMGs. Existing
research mainly concentrates on threat models that manipulate individual
attacker nodes. However, adversaries commonly generate multiple domains to
achieve their goals economically and avoid detection. Their objective is to
evade discovery across as many domains as feasible. In this work, we call the
attack that manipulates several nodes in the DMG concurrently a multi-instance
evasion attack. We present theoretical and empirical evidence that the existing
single-instance evasion techniques for are inadequate to launch multi-instance
evasion attacks against GNN-based MDDs. Therefore, we introduce MintA, an
inference-time multi-instance adversarial attack on GNN-based MDDs. MintA
enhances node and neighborhood evasiveness through optimized perturbations and
operates successfully with only black-box access to the target model,
eliminating the need for knowledge about the model's specifics or non-adversary
nodes. We formulate an optimization challenge for MintA, achieving an
approximate solution. Evaluating MintA on a leading GNN-based MDD technique
with real-world data showcases an attack success rate exceeding 80%. These
findings act as a warning for security experts, underscoring GNN-based MDDs'
susceptibility to practical attacks that can undermine their effectiveness and
benefits.


http://arxiv.org/abs/2308.11845
SEA: Shareable and Explainable Attribution for Query-based Black-box Attacks. (99%)
Yue Gao; Ilia Shumailov; Kassem Fawaz
  Machine Learning (ML) systems are vulnerable to adversarial examples,
particularly those from query-based black-box attacks. Despite various efforts
to detect and prevent such attacks, ML systems are still at risk, demanding a
more comprehensive approach to security that includes logging, analyzing, and
sharing evidence. While traditional security benefits from well-established
practices of forensics and threat intelligence sharing, ML security has yet to
find a way to profile its attackers and share information about them. In
response, this paper introduces SEA, a novel ML security system to characterize
black-box attacks on ML systems for forensic purposes and to facilitate
human-explainable intelligence sharing. SEA leverages Hidden Markov Models to
attribute the observed query sequence to known attacks. It thus understands the
attack's progression rather than focusing solely on the final adversarial
examples. Our evaluations reveal that SEA is effective at attack attribution,
even on the second incident, and is robust to adaptive strategies designed to
evade forensic analysis. SEA's explanations of the attack's behavior allow us
even to fingerprint specific minor bugs in widely used attack libraries. For
example, we discover that the SignOPT and Square attacks in ART v1.14 send over
50% duplicated queries. We thoroughly evaluate SEA on a variety of settings and
demonstrate that it can recognize the same attack with more than 90% Top-1 and
95% Top-3 accuracy. Finally, we demonstrate how SEA generalizes to other
domains like text classification.


http://arxiv.org/abs/2308.11894
Does Physical Adversarial Example Really Matter to Autonomous Driving? Towards System-Level Effect of Adversarial Object Evasion Attack. (98%)
Ningfei Wang; Yunpeng Luo; Takami Sato; Kaidi Xu; Qi Alfred Chen
  In autonomous driving (AD), accurate perception is indispensable to achieving
safe and secure driving. Due to its safety-criticality, the security of AD
perception has been widely studied. Among different attacks on AD perception,
the physical adversarial object evasion attacks are especially severe. However,
we find that all existing literature only evaluates their attack effect at the
targeted AI component level but not at the system level, i.e., with the entire
system semantics and context such as the full AD pipeline. Thereby, this raises
a critical research question: can these existing researches effectively achieve
system-level attack effects (e.g., traffic rule violations) in the real-world
AD context? In this work, we conduct the first measurement study on whether and
how effectively the existing designs can lead to system-level effects,
especially for the STOP sign-evasion attacks due to their popularity and
severity. Our evaluation results show that all the representative prior works
cannot achieve any system-level effects. We observe two design limitations in
the prior works: 1) physical model-inconsistent object size distribution in
pixel sampling and 2) lack of vehicle plant model and AD system model
consideration. Then, we propose SysAdv, a novel system-driven attack design in
the AD context and our evaluation results show that the system-level effects
can be significantly improved, i.e., the violation rate increases by around
70%.


http://arxiv.org/abs/2308.11333
Protect Federated Learning Against Backdoor Attacks via Data-Free Trigger Generation. (87%)
Yanxin Yang; Ming Hu; Yue Cao; Jun Xia; Yihao Huang; Yang Liu; Mingsong Chen
  As a distributed machine learning paradigm, Federated Learning (FL) enables
large-scale clients to collaboratively train a model without sharing their raw
data. However, due to the lack of data auditing for untrusted clients, FL is
vulnerable to poisoning attacks, especially backdoor attacks. By using poisoned
data for local training or directly changing the model parameters, attackers
can easily inject backdoors into the model, which can trigger the model to make
misclassification of targeted patterns in images. To address these issues, we
propose a novel data-free trigger-generation-based defense approach based on
the two characteristics of backdoor attacks: i) triggers are learned faster
than normal knowledge, and ii) trigger patterns have a greater effect on image
classification than normal class patterns. Our approach generates the images
with newly learned knowledge by identifying the differences between the old and
new global models, and filters trigger images by evaluating the effect of these
generated images. By using these trigger images, our approach eliminates
poisoned models to ensure the updated global model is benign. Comprehensive
experiments demonstrate that our approach can defend against almost all the
existing types of backdoor attacks and outperform all the seven
state-of-the-art defense methods with both IID and non-IID scenarios.
Especially, our approach can successfully defend against the backdoor attack
even when 80\% of the clients are malicious.


http://arxiv.org/abs/2308.11443
Revisiting and Exploring Efficient Fast Adversarial Training via LAW: Lipschitz Regularization and Auto Weight Averaging. (76%)
Xiaojun Jia; Yuefeng Chen; Xiaofeng Mao; Ranjie Duan; Jindong Gu; Rong Zhang; Hui Xue; Xiaochun Cao
  Fast Adversarial Training (FAT) not only improves the model robustness but
also reduces the training cost of standard adversarial training. However, fast
adversarial training often suffers from Catastrophic Overfitting (CO), which
results in poor robustness performance. Catastrophic Overfitting describes the
phenomenon of a sudden and significant decrease in robust accuracy during the
training of fast adversarial training. Many effective techniques have been
developed to prevent Catastrophic Overfitting and improve the model robustness
from different perspectives. However, these techniques adopt inconsistent
training settings and require different training costs, i.e, training time and
memory costs, leading to unfair comparisons. In this paper, we conduct a
comprehensive study of over 10 fast adversarial training methods in terms of
adversarial robustness and training costs. We revisit the effectiveness and
efficiency of fast adversarial training techniques in preventing Catastrophic
Overfitting from the perspective of model local nonlinearity and propose an
effective Lipschitz regularization method for fast adversarial training.
Furthermore, we explore the effect of data augmentation and weight averaging in
fast adversarial training and propose a simple yet effective auto weight
averaging method to improve robustness further. By assembling these techniques,
we propose a FGSM-based fast adversarial training method equipped with
Lipschitz regularization and Auto Weight averaging, abbreviated as FGSM-LAW.
Experimental evaluations on four benchmark databases demonstrate the
superiority of the proposed method over state-of-the-art fast adversarial
training methods and the advanced standard adversarial training methods.


http://arxiv.org/abs/2308.11804
Adversarial Illusions in Multi-Modal Embeddings. (75%)
Tingwei Zhang; Rishi Jha; Eugene Bagdasaryan; Vitaly Shmatikov
  Multi-modal embeddings encode texts, images, thermal images, sounds, and
videos into a single embedding space, aligning representations across different
modalities (e.g., associate an image of a dog with a barking sound). In this
paper, we show that multi-modal embeddings can be vulnerable to an attack we
call "adversarial illusions." Given an image or a sound, an adversary can
perturb it to make its embedding close to an arbitrary, adversary-chosen input
in another modality.
  These attacks are cross-modal and targeted: the adversary can align any image
or sound with any target of his choice. Adversarial illusions exploit proximity
in the embedding space and are thus agnostic to downstream tasks and
modalities, enabling a wholesale compromise of current and future tasks, as
well as modalities not available to the adversary. Using ImageBind and
AudioCLIP embeddings, we demonstrate how adversarially aligned inputs,
generated without knowledge of specific downstream tasks, mislead image
generation, text generation, zero-shot classification, and audio retrieval.
  We investigate transferability of illusions across different embeddings and
develop a black-box version of our method that we use to demonstrate the first
adversarial alignment attack on Amazon's commercial, proprietary Titan
embedding. Finally, we analyze countermeasures and evasion attacks.


http://arxiv.org/abs/2308.11406
Designing an attack-defense game: how to increase robustness of financial transaction models via a competition. (75%)
Alexey Zaytsev; Maria Kovaleva; Alex Natekin; Evgeni Vorsin; Valerii Smirnov; Georgii Smirnov; Oleg Sidorshin; Alexander Senin; Alexander Dudin; Dmitry Berestnev
  Banks routinely use neural networks to make decisions. While these models
offer higher accuracy, they are susceptible to adversarial attacks, a risk
often overlooked in the context of event sequences, particularly sequences of
financial transactions, as most works consider computer vision and NLP
modalities.
  We propose a thorough approach to studying these risks: a novel type of
competition that allows a realistic and detailed investigation of problems in
financial transaction data. The participants directly oppose each other,
proposing attacks and defenses -- so they are examined in close-to-real-life
conditions.
  The paper outlines our unique competition structure with direct opposition of
participants, presents results for several different top submissions, and
analyzes the competition results. We also introduce a new open dataset
featuring financial transactions with credit default labels, enhancing the
scope for practical research and development.


http://arxiv.org/abs/2308.11881
Adversarial Training Using Feedback Loops. (74%)
Ali Haisam Muhammad Rafid; Adrian Sandu
  Deep neural networks (DNN) have found wide applicability in numerous fields
due to their ability to accurately learn very complex input-output relations.
Despite their accuracy and extensive use, DNNs are highly susceptible to
adversarial attacks due to limited generalizability. For future progress in the
field, it is essential to build DNNs that are robust to any kind of
perturbations to the data points. In the past, many techniques have been
proposed to robustify DNNs using first-order derivative information of the
network.
  This paper proposes a new robustification approach based on control theory. A
neural network architecture that incorporates feedback control, named Feedback
Neural Networks, is proposed. The controller is itself a neural network, which
is trained using regular and adversarial data such as to stabilize the system
outputs. The novel adversarial training approach based on the feedback control
architecture is called Feedback Looped Adversarial Training (FLAT). Numerical
results on standard test problems empirically show that our FLAT method is more
effective than the state-of-the-art to guard against adversarial attacks.


http://arxiv.org/abs/2308.11284
LEAP: Efficient and Automated Test Method for NLP Software. (31%)
Mingxuan Xiao; Yan Xiao; Hai Dong; Shunhui Ji; Pengcheng Zhang
  The widespread adoption of DNNs in NLP software has highlighted the need for
robustness. Researchers proposed various automatic testing techniques for
adversarial test cases. However, existing methods suffer from two limitations:
weak error-discovering capabilities, with success rates ranging from 0% to
24.6% for BERT-based NLP software, and time inefficiency, taking 177.8s to
205.28s per test case, making them challenging for time-constrained scenarios.
To address these issues, this paper proposes LEAP, an automated test method
that uses LEvy flight-based Adaptive Particle swarm optimization integrated
with textual features to generate adversarial test cases. Specifically, we
adopt Levy flight for population initialization to increase the diversity of
generated test cases. We also design an inertial weight adaptive update
operator to improve the efficiency of LEAP's global optimization of
high-dimensional text examples and a mutation operator based on the greedy
strategy to reduce the search time. We conducted a series of experiments to
validate LEAP's ability to test NLP software and found that the average success
rate of LEAP in generating adversarial test cases is 79.1%, which is 6.1%
higher than the next best approach (PSOattack). While ensuring high success
rates, LEAP significantly reduces time overhead by up to 147.6s compared to
other heuristic-based methods. Additionally, the experimental results
demonstrate that LEAP can generate more transferable test cases and
significantly enhance the robustness of DNN-based systems.


http://arxiv.org/abs/2308.11822
PatchBackdoor: Backdoor Attack against Deep Neural Networks without Model Modification. (16%)
Yizhen Institute for AI Industry Research Yuan; Rui Shanghai Jiao Tong University, Shanghai, China Kong; Shenghao Wuhan University, Wuhan, China Xie; Yuanchun Institute for AI Industry Research Shanghai AI Laboratory, Shanghai, China Li; Yunxin Institute for AI Industry Research Shanghai AI Laboratory, Shanghai, China Liu
  Backdoor attack is a major threat to deep learning systems in safety-critical
scenarios, which aims to trigger misbehavior of neural network models under
attacker-controlled conditions. However, most backdoor attacks have to modify
the neural network models through training with poisoned data and/or direct
model editing, which leads to a common but false belief that backdoor attack
can be easily avoided by properly protecting the model. In this paper, we show
that backdoor attacks can be achieved without any model modification. Instead
of injecting backdoor logic into the training data or the model, we propose to
place a carefully-designed patch (namely backdoor patch) in front of the
camera, which is fed into the model together with the input images. The patch
can be trained to behave normally at most of the time, while producing wrong
prediction when the input image contains an attacker-controlled trigger object.
Our main techniques include an effective training method to generate the
backdoor patch and a digital-physical transformation modeling method to enhance
the feasibility of the patch in real deployments. Extensive experiments show
that PatchBackdoor can be applied to common deep learning models (VGG,
MobileNet, ResNet) with an attack success rate of 93% to 99% on classification
tasks. Moreover, we implement PatchBackdoor in real-world scenarios and show
that the attack is still threatening.


http://arxiv.org/abs/2308.10601
Improving the Transferability of Adversarial Examples with Arbitrary Style Transfer. (99%)
Zhijin Ge; Fanhua Shang; Hongying Liu; Yuanyuan Liu; Liang Wan; Wei Feng; Xiaosen Wang
  Deep neural networks are vulnerable to adversarial examples crafted by
applying human-imperceptible perturbations on clean inputs. Although many
attack methods can achieve high success rates in the white-box setting, they
also exhibit weak transferability in the black-box setting. Recently, various
methods have been proposed to improve adversarial transferability, in which the
input transformation is one of the most effective methods. In this work, we
notice that existing input transformation-based works mainly adopt the
transformed data in the same domain for augmentation. Inspired by domain
generalization, we aim to further improve the transferability using the data
augmented from different domains. Specifically, a style transfer network can
alter the distribution of low-level visual features in an image while
preserving semantic content for humans. Hence, we propose a novel attack method
named Style Transfer Method (STM) that utilizes a proposed arbitrary style
transfer network to transform the images into different domains. To avoid
inconsistent semantic information of stylized images for the classification
network, we fine-tune the style transfer network and mix up the generated
images added by random noise with the original images to maintain semantic
consistency and boost input diversity. Extensive experimental results on the
ImageNet-compatible dataset show that our proposed method can significantly
improve the adversarial transferability on either normally trained models or
adversarially trained models than state-of-the-art input transformation-based
attacks. Code is available at: https://github.com/Zhijin-Ge/STM.


http://arxiv.org/abs/2308.10779
Spear and Shield: Adversarial Attacks and Defense Methods for Model-Based Link Prediction on Continuous-Time Dynamic Graphs. (99%)
Dongjin Lee; Juho Lee; Kijung Shin
  Real-world graphs are dynamic, constantly evolving with new interactions,
such as financial transactions in financial networks. Temporal Graph Neural
Networks (TGNNs) have been developed to effectively capture the evolving
patterns in dynamic graphs. While these models have demonstrated their
superiority, being widely adopted in various important fields, their
vulnerabilities against adversarial attacks remain largely unexplored. In this
paper, we propose T-SPEAR, a simple and effective adversarial attack method for
link prediction on continuous-time dynamic graphs, focusing on investigating
the vulnerabilities of TGNNs. Specifically, before the training procedure of a
victim model, which is a TGNN for link prediction, we inject edge perturbations
to the data that are unnoticeable in terms of the four constraints we propose,
and yet effective enough to cause malfunction of the victim model. Moreover, we
propose a robust training approach T-SHIELD to mitigate the impact of
adversarial attacks. By using edge filtering and enforcing temporal smoothness
to node embeddings, we enhance the robustness of the victim model. Our
experimental study shows that T-SPEAR significantly degrades the victim model's
performance on link prediction tasks, and even more, our attacks are
transferable to other TGNNs, which differ from the victim model assumed by the
attacker. Moreover, we demonstrate that T-SHIELD effectively filters out
adversarial edges and exhibits robustness against adversarial attacks,
surpassing the link prediction performance of the naive TGNN by up to 11.2%
under T-SPEAR.


http://arxiv.org/abs/2308.10743
Enhancing Adversarial Attacks: The Similar Target Method. (99%)
Shuo Zhang; Ziruo Wang; Zikai Zhou; Huanran Chen
  Deep neural networks are vulnerable to adversarial examples, posing a threat
to the models' applications and raising security concerns. An intriguing
property of adversarial examples is their strong transferability. Several
methods have been proposed to enhance transferability, including ensemble
attacks which have demonstrated their efficacy. However, prior approaches
simply average logits, probabilities, or losses for model ensembling, lacking a
comprehensive analysis of how and why model ensembling significantly improves
transferability. In this paper, we propose a similar targeted attack method
named Similar Target~(ST). By promoting cosine similarity between the gradients
of each model, our method regularizes the optimization direction to
simultaneously attack all surrogate models. This strategy has been proven to
enhance generalization ability. Experimental results on ImageNet validate the
effectiveness of our approach in improving adversarial transferability. Our
method outperforms state-of-the-art attackers on 18 discriminative classifiers
and adversarially trained models.


http://arxiv.org/abs/2308.11161
Adversarial Attacks on Code Models with Discriminative Graph Patterns. (96%)
Thanh-Dat Nguyen; Yang Zhou; Xuan Bach D. Le; Patanamon Thongtanunam; David Lo
  Pre-trained language models of code are now widely used in various software
engineering tasks such as code generation, code completion, vulnerability
detection, etc. This, in turn, poses security and reliability risks to these
models. One of the important threats is \textit{adversarial attacks}, which can
lead to erroneous predictions and largely affect model performance on
downstream tasks. Current adversarial attacks on code models usually adopt
fixed sets of program transformations, such as variable renaming and dead code
insertion, leading to limited attack effectiveness. To address the
aforementioned challenges, we propose a novel adversarial attack framework,
GraphCodeAttack, to better evaluate the robustness of code models. Given a
target code model, GraphCodeAttack automatically mines important code patterns,
which can influence the model's decisions, to perturb the structure of input
code to the model. To do so, GraphCodeAttack uses a set of input source codes
to probe the model's outputs and identifies the \textit{discriminative} ASTs
patterns that can influence the model decisions. GraphCodeAttack then selects
appropriate AST patterns, concretizes the selected patterns as attacks, and
inserts them as dead code into the model's input program. To effectively
synthesize attacks from AST patterns, GraphCodeAttack uses a separate
pre-trained code model to fill in the ASTs with concrete code snippets. We
evaluate the robustness of two popular code models (e.g., CodeBERT and
GraphCodeBERT) against our proposed approach on three tasks: Authorship
Attribution, Vulnerability Prediction, and Clone Detection. The experimental
results suggest that our proposed approach significantly outperforms
state-of-the-art approaches in attacking code models such as CARROT and ALERT.


http://arxiv.org/abs/2308.11070
Temporal-Distributed Backdoor Attack Against Video Based Action Recognition. (88%)
Xi Li; Songhe Wang; Ruiquan Huang; Mahanth Gowda; George Kesidis
  Deep neural networks (DNNs) have achieved tremendous success in various
applications including video action recognition, yet remain vulnerable to
backdoor attacks (Trojans). The backdoor-compromised model will mis-classify to
the target class chosen by the attacker when a test instance (from a non-target
class) is embedded with a specific trigger, while maintaining high accuracy on
attack-free instances. Although there are extensive studies on backdoor attacks
against image data, the susceptibility of video-based systems under backdoor
attacks remains largely unexplored. Current studies are direct extensions of
approaches proposed for image data, e.g., the triggers are independently
embedded within the frames, which tend to be detectable by existing defenses.
In this paper, we introduce a simple yet effective backdoor attack against
video data. Our proposed attack, adding perturbations in a transformed domain,
plants an imperceptible, temporally distributed trigger across the video
frames, and is shown to be resilient to existing defensive strategies. The
effectiveness of the proposed attack is demonstrated by extensive experiments
with various well-known models on two video recognition benchmarks, UCF101 and
HMDB51, and a sign language recognition benchmark, Greek Sign Language (GSL)
dataset. We delve into the impact of several influential factors on our
proposed attack and identify an intriguing effect termed "collateral damage"
through extensive studies.


http://arxiv.org/abs/2308.10708
Measuring the Effect of Causal Disentanglement on the Adversarial Robustness of Neural Network Models. (76%)
Preben M. Ness; Dusica Marijan; Sunanda Bose
  Causal Neural Network models have shown high levels of robustness to
adversarial attacks as well as an increased capacity for generalisation tasks
such as few-shot learning and rare-context classification compared to
traditional Neural Networks. This robustness is argued to stem from the
disentanglement of causal and confounder input signals. However, no
quantitative study has yet measured the level of disentanglement achieved by
these types of causal models or assessed how this relates to their adversarial
robustness.
  Existing causal disentanglement metrics are not applicable to deterministic
models trained on real-world datasets. We, therefore, utilise metrics of
content/style disentanglement from the field of Computer Vision to measure
different aspects of the causal disentanglement for four state-of-the-art
causal Neural Network models. By re-implementing these models with a common
ResNet18 architecture we are able to fairly measure their adversarial
robustness on three standard image classification benchmarking datasets under
seven common white-box attacks. We find a strong association (r=0.820, p=0.001)
between the degree to which models decorrelate causal and confounder signals
and their adversarial robustness. Additionally, we find a moderate negative
association between the pixel-level information content of the confounder
signal and adversarial robustness (r=-0.597, p=0.040).


http://arxiv.org/abs/2308.10467
Single-User Injection for Invisible Shilling Attack against Recommender Systems. (62%)
Chengzhi Huang; Hui Li
  Recommendation systems (RS) are crucial for alleviating the information
overload problem. Due to its pivotal role in guiding users to make decisions,
unscrupulous parties are lured to launch attacks against RS to affect the
decisions of normal users and gain illegal profits. Among various types of
attacks, shilling attack is one of the most subsistent and profitable attacks.
In shilling attack, an adversarial party injects a number of well-designed fake
user profiles into the system to mislead RS so that the attack goal can be
achieved. Although existing shilling attack methods have achieved promising
results, they all adopt the attack paradigm of multi-user injection, where some
fake user profiles are required. This paper provides the first study of
shilling attack in an extremely limited scenario: only one fake user profile is
injected into the victim RS to launch shilling attacks (i.e., single-user
injection). We propose a novel single-user injection method SUI-Attack for
invisible shilling attack. SUI-Attack is a graph based attack method that
models shilling attack as a node generation task over the user-item bipartite
graph of the victim RS, and it constructs the fake user profile by generating
user features and edges that link the fake user to items. Extensive experiments
demonstrate that SUI-Attack can achieve promising attack results in single-user
injection. In addition to its attack power, SUI-Attack increases the
stealthiness of shilling attack and reduces the risk of being detected. We
provide our implementation at: https://github.com/KDEGroup/SUI-Attack.


http://arxiv.org/abs/2308.10741
On the Adversarial Robustness of Multi-Modal Foundation Models. (4%)
Christian Schlarmann; Matthias Hein
  Multi-modal foundation models combining vision and language models such as
Flamingo or GPT-4 have recently gained enormous interest. Alignment of
foundation models is used to prevent models from providing toxic or harmful
output. While malicious users have successfully tried to jailbreak foundation
models, an equally important question is if honest users could be harmed by
malicious third-party content. In this paper we show that imperceivable attacks
on images in order to change the caption output of a multi-modal foundation
model can be used by malicious content providers to harm honest users e.g. by
guiding them to malicious websites or broadcast fake information. This
indicates that countermeasures to adversarial attacks should be used by any
deployed multi-modal foundation model.


http://arxiv.org/abs/2308.10888
Unlocking Accuracy and Fairness in Differentially Private Image Classification. (2%)
Leonard Berrada; Soham De; Judy Hanwen Shen; Jamie Hayes; Robert Stanforth; David Stutz; Pushmeet Kohli; Samuel L. Smith; Borja Balle
  Privacy-preserving machine learning aims to train models on private data
without leaking sensitive information. Differential privacy (DP) is considered
the gold standard framework for privacy-preserving training, as it provides
formal privacy guarantees. However, compared to their non-private counterparts,
models trained with DP often have significantly reduced accuracy. Private
classifiers are also believed to exhibit larger performance disparities across
subpopulations, raising fairness concerns. The poor performance of classifiers
trained with DP has prevented the widespread adoption of privacy preserving
machine learning in industry. Here we show that pre-trained foundation models
fine-tuned with DP can achieve similar accuracy to non-private classifiers,
even in the presence of significant distribution shifts between pre-training
data and downstream tasks. We achieve private accuracies within a few percent
of the non-private state of the art across four datasets, including two medical
imaging benchmarks. Furthermore, our private medical classifiers do not exhibit
larger performance disparities across demographic groups than non-private
models. This milestone to make DP training a practical and reliable technology
has the potential to widely enable machine learning practitioners to train
safely on sensitive datasets while protecting individuals' privacy.


http://arxiv.org/abs/2308.10299
Boosting Adversarial Transferability by Block Shuffle and Rotation. (99%)
Kunyu Wang; Xuanran He; Wenxuan Wang; Xiaosen Wang
  Adversarial examples mislead deep neural networks with imperceptible
perturbations and have brought significant threats to deep learning. An
important aspect is their transferability, which refers to their ability to
deceive other models, thus enabling attacks in the black-box setting. Though
various methods have been proposed to boost transferability, the performance
still falls short compared with white-box attacks. In this work, we observe
that existing input transformation based attacks, one of the mainstream
transfer-based attacks, result in different attention heatmaps on various
models, which might limit the transferability. We also find that breaking the
intrinsic relation of the image can disrupt the attention heatmap of the
original image. Based on this finding, we propose a novel input transformation
based attack called block shuffle and rotation (BSR). Specifically, BSR splits
the input image into several blocks, then randomly shuffles and rotates these
blocks to construct a set of new images for gradient calculation. Empirical
evaluations on the ImageNet dataset demonstrate that BSR could achieve
significantly better transferability than the existing input transformation
based methods under single-model and ensemble-model settings. Combining BSR
with the current input transformation method can further improve the
transferability, which significantly outperforms the state-of-the-art methods.
Code is available at https://github.com/Trustworthy-AI-Group/BSR


http://arxiv.org/abs/2308.10315
Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting. (96%)
Qidong Huang; Xiaoyi Dong; Dongdong Chen; Yinpeng Chen; Lu Yuan; Gang Hua; Weiming Zhang; Nenghai Yu
  In this paper, we investigate the adversarial robustness of vision
transformers that are equipped with BERT pretraining (e.g., BEiT, MAE). A
surprising observation is that MAE has significantly worse adversarial
robustness than other BERT pretraining methods. This observation drives us to
rethink the basic differences between these BERT pretraining methods and how
these differences affect the robustness against adversarial perturbations. Our
empirical analysis reveals that the adversarial robustness of BERT pretraining
is highly related to the reconstruction target, i.e., predicting the raw pixels
of masked image patches will degrade more adversarial robustness of the model
than predicting the semantic context, since it guides the model to concentrate
more on medium-/high-frequency components of images. Based on our analysis, we
provide a simple yet effective way to boost the adversarial robustness of MAE.
The basic idea is using the dataset-extracted domain knowledge to occupy the
medium-/high-frequency of images, thus narrowing the optimization space of
adversarial perturbations. Specifically, we group the distribution of
pretraining data and optimize a set of cluster-specific visual prompts on
frequency domain. These prompts are incorporated with input images through
prototype-based prompt selection during test period. Extensive evaluation shows
that our method clearly boost MAE's adversarial robustness while maintaining
its clean performance on ImageNet-1k classification. Our code is available at:
https://github.com/shikiw/RobustMAE.


http://arxiv.org/abs/2308.10373
HoSNN: Adversarially-Robust Homeostatic Spiking Neural Networks with Adaptive Firing Thresholds. (96%)
Hejia Geng; Peng Li
  Spiking neural networks (SNNs) offer promise for efficient and powerful
neurally inspired computation. Common to other types of neural networks,
however, SNNs face the severe issue of vulnerability to adversarial attacks. We
present the first study that draws inspiration from neural homeostasis to
develop a bio-inspired solution that counters the susceptibilities of SNNs to
adversarial onslaughts. At the heart of our approach is a novel
threshold-adapting leaky integrate-and-fire (TA-LIF) neuron model, which we
adopt to construct the proposed adversarially robust homeostatic SNN (HoSNN).
Distinct from traditional LIF models, our TA-LIF model incorporates a
self-stabilizing dynamic thresholding mechanism, curtailing adversarial noise
propagation and safeguarding the robustness of HoSNNs in an unsupervised
manner. Theoretical analysis is presented to shed light on the stability and
convergence properties of the TA-LIF neurons, underscoring their superior
dynamic robustness under input distributional shifts over traditional LIF
neurons. Remarkably, without explicit adversarial training, our HoSNNs
demonstrate inherent robustness on CIFAR-10, with accuracy improvements to
72.6% and 54.19% against FGSM and PGD attacks, up from 20.97% and 0.6%,
respectively. Furthermore, with minimal FGSM adversarial training, our HoSNNs
surpass previous models by 29.99% under FGSM and 47.83% under PGD attacks on
CIFAR-10. Our findings offer a new perspective on harnessing biological
principles for bolstering SNNs adversarial robustness and defense, paving the
way to more resilient neuromorphic computing.


http://arxiv.org/abs/2308.10201
Hiding Backdoors within Event Sequence Data via Poisoning Attacks. (95%)
Elizaveta Kovtun; Alina Ermilova; Dmitry Berestnev; Alexey Zaytsev
  The financial industry relies on deep learning models for making important
decisions. This adoption brings new danger, as deep black-box models are known
to be vulnerable to adversarial attacks. In computer vision, one can shape the
output during inference by performing an adversarial attack called poisoning
via introducing a backdoor into the model during training. For sequences of
financial transactions of a customer, insertion of a backdoor is harder to
perform, as models operate over a more complex discrete space of sequences, and
systematic checks for insecurities occur. We provide a method to introduce
concealed backdoors, creating vulnerabilities without altering their
functionality for uncontaminated data. To achieve this, we replace a clean
model with a poisoned one that is aware of the availability of a backdoor and
utilize this knowledge. Our most difficult for uncovering attacks include
either additional supervised detection step of poisoned data activated during
the test or well-hidden model weight modifications. The experimental study
provides insights into how these effects vary across different datasets,
architectures, and model components. Alternative methods and baselines, such as
distillation-type regularization, are also explored but found to be less
efficient. Conducted on three open transaction datasets and architectures,
including LSTM, CNN, and Transformer, our findings not only illuminate the
vulnerabilities in contemporary models but also can drive the construction of
more robust systems.


http://arxiv.org/abs/2308.13541
Adversarial Collaborative Filtering for Free. (61%)
Huiyuan Chen; Xiaoting Li; Vivian Lai; Chin-Chia Michael Yeh; Yujie Fan; Yan Zheng; Mahashweta Das; Hao Yang
  Collaborative Filtering (CF) has been successfully used to help users
discover the items of interest. Nevertheless, existing CF methods suffer from
noisy data issue, which negatively impacts the quality of recommendation. To
tackle this problem, many prior studies leverage adversarial learning to
regularize the representations of users/items, which improves both
generalizability and robustness. Those methods often learn adversarial
perturbations and model parameters under min-max optimization framework.
However, there still have two major drawbacks: 1) Existing methods lack
theoretical guarantees of why adding perturbations improve the model
generalizability and robustness; 2) Solving min-max optimization is
time-consuming. In addition to updating the model parameters, each iteration
requires additional computations to update the perturbations, making them not
scalable for industry-scale datasets.
  In this paper, we present Sharpness-aware Collaborative Filtering (SharpCF),
a simple yet effective method that conducts adversarial training without extra
computational cost over the base optimizer. To achieve this goal, we first
revisit the existing adversarial collaborative filtering and discuss its
connection with recent Sharpness-aware Minimization. This analysis shows that
adversarial training actually seeks model parameters that lie in neighborhoods
around the optimal model parameters having uniformly low loss values, resulting
in better generalizability. To reduce the computational overhead, SharpCF
introduces a novel trajectory loss to measure the alignment between current
weights and past weights. Experimental results on real-world datasets
demonstrate that our SharpCF achieves superior performance with almost zero
additional computational cost comparing to adversarial training.


http://arxiv.org/abs/2308.10438
Efficient Joint Optimization of Layer-Adaptive Weight Pruning in Deep Neural Networks. (1%)
Kaixin Xu; Zhe Wang; Xue Geng; Jie Lin; Min Wu; Xiaoli Li; Weisi Lin
  In this paper, we propose a novel layer-adaptive weight-pruning approach for
Deep Neural Networks (DNNs) that addresses the challenge of optimizing the
output distortion minimization while adhering to a target pruning ratio
constraint. Our approach takes into account the collective influence of all
layers to design a layer-adaptive pruning scheme. We discover and utilize a
very important additivity property of output distortion caused by pruning
weights on multiple layers. This property enables us to formulate the pruning
as a combinatorial optimization problem and efficiently solve it through
dynamic programming. By decomposing the problem into sub-problems, we achieve
linear time complexity, making our optimization algorithm fast and feasible to
run on CPUs. Our extensive experiments demonstrate the superiority of our
approach over existing methods on the ImageNet and CIFAR-10 datasets. On
CIFAR-10, our method achieves remarkable improvements, outperforming others by
up to 1.0% for ResNet-32, 0.5% for VGG-16, and 0.7% for DenseNet-121 in terms
of top-1 accuracy. On ImageNet, we achieve up to 4.7% and 4.6% higher top-1
accuracy compared to other methods for VGG-16 and ResNet-50, respectively.
These results highlight the effectiveness and practicality of our approach for
enhancing DNN performance through layer-adaptive weight pruning. Code will be
available on https://github.com/Akimoto-Cris/RD_VIT_PRUNE.


http://arxiv.org/abs/2308.10335
A Study on Robustness and Reliability of Large Language Model Code Generation. (1%)
Li Zhong; Zilong Wang
  Recently, the large language models (LLMs) have shown extraordinary ability
in understanding natural language and generating programming code. It has been
a common practice of software engineers to consult LLMs when encountering
coding questions. Although efforts have been made to avoid syntax errors and
align the code with the intended semantics, the reliability and robustness of
the code generationfrom LLMs have not yet been thoroughly studied. The
executable code is not equivalent to the reliable and robust code, especially
in the context of real-world software development. The misuse of APIs in the
generated code could lead to severe problem, such as resource leaks, program
crashes. To make things worse, the users of LLM code generation services are
actually the developers that are most vulnerable to these code that seems right
-- They are always novice developers that are not familiar with the APIs that
LLMs generate code for them. Therefore, they could hardly tell the misuse in
the code generated by LLMs, which further facilitates the incorrect code
applied in real-world software. Existing code evaluation benchmark and datasets
focus on crafting small tasks such as programming questions in coding
interviews, which however deviates from the problem that developers would ask
LLM for real-world coding help. To fill the missing piece, in this work, we
propose a dataset RobustAPI for evaluating the reliability and robustness of
code generated by LLMs. We collect 1208 coding questions from StackOverflow on
24 representative Java APIs. We summarize thecommon misuse patterns of these
APIs and evaluate them oncurrent popular LLMs. The evaluation results show that
evenfor GPT-4, 62% of the generated code contains API misuses,which would cause
unexpected consequences if the code isintroduced into real-world software.


http://arxiv.org/abs/2308.09958
A Comparison of Adversarial Learning Techniques for Malware Detection. (99%)
Pavla Louthánová; Matouš Kozák; Martin Jureček; Mark Stamp
  Machine learning has proven to be a useful tool for automated malware
detection, but machine learning models have also been shown to be vulnerable to
adversarial attacks. This article addresses the problem of generating
adversarial malware samples, specifically malicious Windows Portable Executable
files. We summarize and compare work that has focused on adversarial machine
learning for malware detection. We use gradient-based, evolutionary
algorithm-based, and reinforcement-based methods to generate adversarial
samples, and then test the generated samples against selected antivirus
products. We compare the selected methods in terms of accuracy and practical
applicability. The results show that applying optimized modifications to
previously detected malware can lead to incorrect classification of the file as
benign. It is also known that generated malware samples can be successfully
used against detection models other than those used to generate them and that
using combinations of generators can create new samples that evade detection.
Experiments show that the Gym-malware generator, which uses a reinforcement
learning approach, has the greatest practical potential. This generator
achieved an average sample generation time of 5.73 seconds and the highest
average evasion rate of 44.11%. Using the Gym-malware generator in combination
with itself improved the evasion rate to 58.35%.


http://arxiv.org/abs/2308.10110
Robust Mixture-of-Expert Training for Convolutional Neural Networks. (83%)
Yihua Zhang; Ruisi Cai; Tianlong Chen; Guanhua Zhang; Huan Zhang; Pin-Yu Chen; Shiyu Chang; Zhangyang Wang; Sijia Liu
  Sparsely-gated Mixture of Expert (MoE), an emerging deep model architecture,
has demonstrated a great promise to enable high-accuracy and ultra-efficient
model inference. Despite the growing popularity of MoE, little work
investigated its potential to advance convolutional neural networks (CNNs),
especially in the plane of adversarial robustness. Since the lack of robustness
has become one of the main hurdles for CNNs, in this paper we ask: How to
adversarially robustify a CNN-based MoE model? Can we robustly train it like an
ordinary CNN model? Our pilot study shows that the conventional adversarial
training (AT) mechanism (developed for vanilla CNNs) no longer remains
effective to robustify an MoE-CNN. To better understand this phenomenon, we
dissect the robustness of an MoE-CNN into two dimensions: Robustness of routers
(i.e., gating functions to select data-specific experts) and robustness of
experts (i.e., the router-guided pathways defined by the subnetworks of the
backbone CNN). Our analyses show that routers and experts are hard to adapt to
each other in the vanilla AT. Thus, we propose a new router-expert alternating
Adversarial training framework for MoE, termed AdvMoE. The effectiveness of our
proposal is justified across 4 commonly-used CNN model architectures over 4
benchmark datasets. We find that AdvMoE achieves 1% ~ 4% adversarial robustness
improvement over the original dense CNN, and enjoys the efficiency merit of
sparsity-gated MoE, leading to more than 50% inference cost reduction. Codes
are available at https://github.com/OPTML-Group/Robust-MoE-CNN.


http://arxiv.org/abs/2308.09861
Black-box Adversarial Attacks against Dense Retrieval Models: A Multi-view Contrastive Learning Method. (99%)
Yu-An Liu; Ruqing Zhang; Jiafeng Guo; Rijke Maarten de; Wei Chen; Yixing Fan; Xueqi Cheng
  Neural ranking models (NRMs) and dense retrieval (DR) models have given rise
to substantial improvements in overall retrieval performance. In addition to
their effectiveness, and motivated by the proven lack of robustness of deep
learning-based approaches in other areas, there is growing interest in the
robustness of deep learning-based approaches to the core retrieval problem.
Adversarial attack methods that have so far been developed mainly focus on
attacking NRMs, with very little attention being paid to the robustness of DR
models. In this paper, we introduce the adversarial retrieval attack (AREA)
task. The AREA task is meant to trick DR models into retrieving a target
document that is outside the initial set of candidate documents retrieved by
the DR model in response to a query. We consider the decision-based black-box
adversarial setting, which is realistic in real-world search engines. To
address the AREA task, we first employ existing adversarial attack methods
designed for NRMs. We find that the promising results that have previously been
reported on attacking NRMs, do not generalize to DR models: these methods
underperform a simple term spamming method. We attribute the observed lack of
generalizability to the interaction-focused architecture of NRMs, which
emphasizes fine-grained relevance matching. DR models follow a different
representation-focused architecture that prioritizes coarse-grained
representations. We propose to formalize attacks on DR models as a contrastive
learning problem in a multi-view representation space. The core idea is to
encourage the consistency between each view representation of the target
document and its corresponding viewer via view-wise supervision signals.
Experimental results demonstrate that the proposed method can significantly
outperform existing attack strategies in misleading the DR model with small
indiscernible text perturbations.


http://arxiv.org/abs/2308.09392
Attacking logo-based phishing website detectors with adversarial perturbations. (99%)
Jehyun Lee; Zhe Xin; Melanie Ng Pei See; Kanav Sabharwal; Giovanni Apruzzese; Dinil Mon Divakaran
  Recent times have witnessed the rise of anti-phishing schemes powered by deep
learning (DL). In particular, logo-based phishing detectors rely on DL models
from Computer Vision to identify logos of well-known brands on webpages, to
detect malicious webpages that imitate a given brand. For instance, Siamese
networks have demonstrated notable performance for these tasks, enabling the
corresponding anti-phishing solutions to detect even "zero-day" phishing
webpages. In this work, we take the next step of studying the robustness of
logo-based phishing detectors against adversarial ML attacks. We propose a
novel attack exploiting generative adversarial perturbations to craft
"adversarial logos" that evade phishing detectors. We evaluate our attacks
through: (i) experiments on datasets containing real logos, to evaluate the
robustness of state-of-the-art phishing detectors; and (ii) user studies to
gauge whether our adversarial logos can deceive human eyes. The results show
that our proposed attack is capable of crafting perturbed logos subtle enough
to evade various DL models-achieving an evasion rate of up to 95%. Moreover,
users are not able to spot significant differences between generated
adversarial logos and original ones.


http://arxiv.org/abs/2308.09546
Compensating Removed Frequency Components: Thwarting Voice Spectrum Reduction Attacks. (92%)
Shu Wang; Kun Sun; Qi Li
  Automatic speech recognition (ASR) provides diverse audio-to-text services
for humans to communicate with machines. However, recent research reveals ASR
systems are vulnerable to various malicious audio attacks. In particular, by
removing the non-essential frequency components, a new spectrum reduction
attack can generate adversarial audios that can be perceived by humans but
cannot be correctly interpreted by ASR systems. It raises a new challenge for
content moderation solutions to detect harmful content in audio and video
available on social media platforms. In this paper, we propose an acoustic
compensation system named ACE to counter the spectrum reduction attacks over
ASR systems. Our system design is based on two observations, namely, frequency
component dependencies and perturbation sensitivity. First, since the Discrete
Fourier Transform computation inevitably introduces spectral leakage and
aliasing effects to the audio frequency spectrum, the frequency components with
similar frequencies will have a high correlation. Thus, considering the
intrinsic dependencies between neighboring frequency components, it is possible
to recover more of the original audio by compensating for the removed
components based on the remaining ones. Second, since the removed components in
the spectrum reduction attacks can be regarded as an inverse of adversarial
noise, the attack success rate will decrease when the adversarial audio is
replayed in an over-the-air scenario. Hence, we can model the acoustic
propagation process to add over-the-air perturbations into the attacked audio.
We implement a prototype of ACE and the experiments show ACE can effectively
reduce up to 87.9% of ASR inference errors caused by spectrum reduction
attacks. Also, by analyzing residual errors, we summarize six general types of
ASR inference errors and investigate the error causes and potential mitigation
solutions.


http://arxiv.org/abs/2308.09487
DFB: A Data-Free, Low-Budget, and High-Efficacy Clean-Label Backdoor Attack. (54%)
Binhao Ma; Jiahui Wang; Dejun Wang; Bo Meng
  In the domain of backdoor attacks, accurate labeling of injected data is
essential for evading rudimentary detection mechanisms. This imperative has
catalyzed the development of clean-label attacks, which are notably more
elusive as they preserve the original labels of the injected data. Current
clean-label attack methodologies primarily depend on extensive knowledge of the
training dataset. However, practically, such comprehensive dataset access is
often unattainable, given that training datasets are typically compiled from
various independent sources. Departing from conventional clean-label attack
methodologies, our research introduces DFB, a data-free, low-budget, and
high-efficacy clean-label backdoor Attack. DFB is unique in its independence
from training data access, requiring solely the knowledge of a specific target
class. Tested on CIFAR10, Tiny-ImageNet, and TSRD, DFB demonstrates remarkable
efficacy with minimal poisoning rates of just 0.1%, 0.025%, and 0.4%,
respectively. These rates are significantly lower than those required by
existing methods such as LC, HTBA, BadNets, and Blend, yet DFB achieves
superior attack success rates. Furthermore, our findings reveal that DFB poses
a formidable challenge to four established backdoor defense algorithms,
indicating its potential as a robust tool in advanced clean-label attack
strategies.


http://arxiv.org/abs/2308.09850
Backdoor Mitigation by Correcting the Distribution of Neural Activations. (11%)
Xi Li; Zhen Xiang; David J. Miller; George Kesidis
  Backdoor (Trojan) attacks are an important type of adversarial exploit
against deep neural networks (DNNs), wherein a test instance is (mis)classified
to the attacker's target class whenever the attacker's backdoor trigger is
present. In this paper, we reveal and analyze an important property of backdoor
attacks: a successful attack causes an alteration in the distribution of
internal layer activations for backdoor-trigger instances, compared to that for
clean instances. Even more importantly, we find that instances with the
backdoor trigger will be correctly classified to their original source classes
if this distribution alteration is corrected. Based on our observations, we
propose an efficient and effective method that achieves post-training backdoor
mitigation by correcting the distribution alteration using reverse-engineered
triggers. Notably, our method does not change any trainable parameters of the
DNN, but achieves generally better mitigation performance than existing methods
that do require intensive DNN parameter tuning. It also efficiently detects
test instances with the trigger, which may help to catch adversarial entities
in the act of exploiting the backdoor.


http://arxiv.org/abs/2308.09318
Towards Attack-tolerant Federated Learning via Critical Parameter Analysis. (9%)
Sungwon Han; Sungwon Park; Fangzhao Wu; Sundong Kim; Bin Zhu; Xing Xie; Meeyoung Cha
  Federated learning is used to train a shared model in a decentralized way
without clients sharing private data with each other. Federated learning
systems are susceptible to poisoning attacks when malicious clients send false
updates to the central server. Existing defense strategies are ineffective
under non-IID data settings. This paper proposes a new defense strategy, FedCPA
(Federated learning with Critical Parameter Analysis). Our attack-tolerant
aggregation method is based on the observation that benign local models have
similar sets of top-k and bottom-k critical parameters, whereas poisoned local
models do not. Experiments with different attack scenarios on multiple datasets
demonstrate that our model outperforms existing defense strategies in defending
against poisoning attacks.


http://arxiv.org/abs/2308.09381
On Gradient-like Explanation under a Black-box Setting: When Black-box Explanations Become as Good as White-box. (9%)
Yi Cai; Gerhard Wunder
  Attribution methods shed light on the explainability of data-driven
approaches such as deep learning models by uncovering the most influential
features in a to-be-explained decision. While determining feature attributions
via gradients delivers promising results, the internal access required for
acquiring gradients can be impractical under safety concerns, thus limiting the
applicability of gradient-based approaches. In response to such limited
flexibility, this paper presents \methodAbr~(gradient-estimation-based
explanation), an approach that produces gradient-like explanations through only
query-level access. The proposed approach holds a set of fundamental properties
for attribution methods, which are mathematically rigorously proved, ensuring
the quality of its explanations. In addition to the theoretical analysis, with
a focus on image data, the experimental results empirically demonstrate the
superiority of the proposed method over state-of-the-art black-box methods and
its competitive performance compared to methods with full access.


http://arxiv.org/abs/2308.09448
Defending Label Inference Attacks in Split Learning under Regression Setting. (4%)
Haoze Qiu; Fei Zheng; Chaochao Chen; Xiaolin Zheng
  As a privacy-preserving method for implementing Vertical Federated Learning,
Split Learning has been extensively researched. However, numerous studies have
indicated that the privacy-preserving capability of Split Learning is
insufficient. In this paper, we primarily focus on label inference attacks in
Split Learning under regression setting, which are mainly implemented through
the gradient inversion method. To defend against label inference attacks, we
propose Random Label Extension (RLE), where labels are extended to obfuscate
the label information contained in the gradients, thereby preventing the
attacker from utilizing gradients to train an attack model that can infer the
original labels. To further minimize the impact on the original task, we
propose Model-based adaptive Label Extension (MLE), where original labels are
preserved in the extended labels and dominate the training process. The
experimental results show that compared to the basic defense methods, our
proposed defense methods can significantly reduce the attack model's
performance while preserving the original task's performance.


http://arxiv.org/abs/2308.09810
An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software. (1%)
Wenxuan Wang; Jingyuan Huang; Jen-tse Huang; Chang Chen; Jiazhen Gu; Pinjia He; Michael R. Lyu
  The exponential growth of social media platforms has brought about a
revolution in communication and content dissemination in human society.
Nevertheless, these platforms are being increasingly misused to spread toxic
content, including hate speech, malicious advertising, and pornography, leading
to severe negative consequences such as harm to teenagers' mental health.
Despite tremendous efforts in developing and deploying textual and image
content moderation methods, malicious users can evade moderation by embedding
texts into images, such as screenshots of the text, usually with some
interference. We find that modern content moderation software's performance
against such malicious inputs remains underexplored. In this work, we propose
OASIS, a metamorphic testing framework for content moderation software. OASIS
employs 21 transform rules summarized from our pilot study on 5,000 real-world
toxic contents collected from 4 popular social media applications, including
Twitter, Instagram, Sina Weibo, and Baidu Tieba. Given toxic textual contents,
OASIS can generate image test cases, which preserve the toxicity yet are likely
to bypass moderation. In the evaluation, we employ OASIS to test five
commercial textual content moderation software from famous companies (i.e.,
Google Cloud, Microsoft Azure, Baidu Cloud, Alibaba Cloud and Tencent Cloud),
as well as a state-of-the-art moderation research model. The results show that
OASIS achieves up to 100% error finding rates. Moreover, through retraining the
models with the test cases generated by OASIS, the robustness of the moderation
model can be improved without performance degradation.


http://arxiv.org/abs/2308.09520
Proceedings of the 2nd International Workshop on Adaptive Cyber Defense. (1%)
Marco Carvalho; Damian Marriott; Mark Bilinski; Ahmad Ridley
  The 2nd International Workshop on Adaptive Cyber Defense was held at the
Florida Institute of Technology, Florida. This workshop was organized to share
research that explores unique applications of Artificial Intelligence (AI) and
Machine Learning (ML) as foundational capabilities for the pursuit of adaptive
cyber defense. The cyber domain cannot currently be reliably and effectively
defended without extensive reliance on human experts. Skilled cyber defenders
are in short supply and often cannot respond fast enough to cyber threats.
  Building on recent advances in AI and ML the Cyber defense research community
has been motivated to develop new dynamic and sustainable defenses through the
adoption of AI and ML techniques to cyber settings. Bridging critical gaps
between AI and Cyber researchers and practitioners can accelerate efforts to
create semi-autonomous cyber defenses that can learn to recognize and respond
to cyber attacks or discover and mitigate weaknesses in cooperation with other
cyber operation systems and human experts. Furthermore, these defenses are
expected to be adaptive and able to evolve over time to thwart changes in
attacker behavior, changes in the system health and readiness, and natural
shifts in user behavior over time.
  The workshop was comprised of invited keynote talks, technical presentations
and a panel discussion about how AI/ML can enable autonomous mitigation of
current and future cyber attacks. Workshop submissions were peer reviewed by a
panel of domain experts with a proceedings consisting of six technical articles
exploring challenging problems of critical importance to national and global
security. Participation in this workshop offered new opportunities to stimulate
research and innovation in the emerging domain of adaptive and autonomous cyber
defense.


http://arxiv.org/abs/2309.16706
AIR: Threats of Adversarial Attacks on Deep Learning-Based Information Recovery. (99%)
Jinyin Chen; Jie Ge; Shilian Zheng; Linhui Ye; Haibin Zheng; Weiguo Shen; Keqiang Yue; Xiaoniu Yang
  A wireless communications system usually consists of a transmitter which
transmits the information and a receiver which recovers the original
information from the received distorted signal. Deep learning (DL) has been
used to improve the performance of the receiver in complicated channel
environments and state-of-the-art (SOTA) performance has been achieved.
However, its robustness has not been investigated. In order to evaluate the
robustness of DL-based information recovery models under adversarial
circumstances, we investigate adversarial attacks on the SOTA DL-based
information recovery model, i.e., DeepReceiver. We formulate the problem as an
optimization problem with power and peak-to-average power ratio (PAPR)
constraints. We design different adversarial attack methods according to the
adversary's knowledge of DeepReceiver's model and/or testing samples. Extensive
experiments show that the DeepReceiver is vulnerable to the designed attack
methods in all of the considered scenarios. Even in the scenario of both model
and test sample restricted, the adversary can attack the DeepReceiver and
increase its bit error rate (BER) above 10%. It can also be found that the
DeepReceiver is vulnerable to adversarial perturbations even with very low
power and limited PAPR. These results suggest that defense measures should be
taken to enhance the robustness of DeepReceiver.


http://arxiv.org/abs/2308.08906
Towards a Practical Defense against Adversarial Attacks on Deep Learning-based Malware Detectors via Randomized Smoothing. (99%)
Daniel Gibert; Giulio Zizzo; Quan Le
  Malware detectors based on deep learning (DL) have been shown to be
susceptible to malware examples that have been deliberately manipulated in
order to evade detection, a.k.a. adversarial malware examples. More
specifically, it has been show that deep learning detectors are vulnerable to
small changes on the input file. Given this vulnerability of deep learning
detectors, we propose a practical defense against adversarial malware examples
inspired by randomized smoothing. In our work, instead of employing Gaussian or
Laplace noise when randomizing inputs, we propose a randomized ablation-based
smoothing scheme that ablates a percentage of the bytes within an executable.
During training, our randomized ablation-based smoothing scheme trains a base
classifier based on ablated versions of the executable files. At test time, the
final classification for a given input executable is taken as the class most
commonly predicted by the classifier on a set of ablated versions of the
original executable. To demonstrate the suitability of our approach we have
empirically evaluated the proposed ablation-based model against various
state-of-the-art evasion attacks on the BODMAS dataset. Results show greater
robustness and generalization capabilities to adversarial malware examples in
comparison to a non-smoothed classifier.


http://arxiv.org/abs/2308.08925
A White-Box False Positive Adversarial Attack Method on Contrastive Loss Based Offline Handwritten Signature Verification Models. (98%)
Zhongliang Guo; Weiye Li; Yifei Qian; Ognjen Arandjelović; Lei Fang
  In this paper, we tackle the challenge of white-box false positive
adversarial attacks on contrastive loss based offline handwritten signature
verification models. We propose a novel attack method that treats the attack as
a style transfer between closely related but distinct writing styles. To guide
the generation of deceptive images, we introduce two new loss functions that
enhance the attack success rate by perturbing the Euclidean distance between
the embedding vectors of the original and synthesized samples, while ensuring
minimal perturbations by reducing the difference between the generated image
and the original image. Our method demonstrates state-of-the-art performance in
white-box attacks on contrastive loss based offline handwritten signature
verification models, as evidenced by our experiments. The key contributions of
this paper include a novel false positive attack method, two new loss
functions, effective style transfer in handwriting styles, and superior
performance in white-box false positive attacks compared to other white-box
attack methods.


http://arxiv.org/abs/2308.08938
Causal Adversarial Perturbations for Individual Fairness and Robustness in Heterogeneous Data Spaces. (16%)
Ahmad-Reza Ehyaei; Kiarash Mohammadi; Amir-Hossein Karimi; Samira Samadi; Golnoosh Farnadi
  As responsible AI gains importance in machine learning algorithms, properties
such as fairness, adversarial robustness, and causality have received
considerable attention in recent years. However, despite their individual
significance, there remains a critical gap in simultaneously exploring and
integrating these properties. In this paper, we propose a novel approach that
examines the relationship between individual fairness, adversarial robustness,
and structural causal models in heterogeneous data spaces, particularly when
dealing with discrete sensitive attributes. We use causal structural models and
sensitive attributes to create a fair metric and apply it to measure semantic
similarity among individuals. By introducing a novel causal adversarial
perturbation and applying adversarial training, we create a new regularizer
that combines individual fairness, causality, and robustness in the classifier.
Our method is evaluated on both real-world and synthetic datasets,
demonstrating its effectiveness in achieving an accurate classifier that
simultaneously exhibits fairness, adversarial robustness, and causal awareness.


http://arxiv.org/abs/2308.09146
That Doesn't Go There: Attacks on Shared State in Multi-User Augmented Reality Applications. (10%)
Carter Slocum; Yicheng Zhang; Erfan Shayegani; Pedram Zaree; Nael Abu-Ghazaleh; Jiasi Chen
  Augmented Reality (AR) is expected to become a pervasive component in
enabling shared virtual experiences. In order to facilitate collaboration among
multiple users, it is crucial for multi-user AR applications to establish a
consensus on the "shared state" of the virtual world and its augmentations,
through which they interact within augmented reality spaces. Current methods to
create and access shared state collect sensor data from devices (e.g., camera
images), process them, and integrate them into the shared state. However, this
process introduces new vulnerabilities and opportunities for attacks.
Maliciously writing false data to "poison" the shared state is a major concern
for the security of the downstream victims that depend on it. Another type of
vulnerability arises when reading the shared state; by providing false inputs,
an attacker can view hologram augmentations at locations they are not allowed
to access. In this work, we demonstrate a series of novel attacks on multiple
AR frameworks with shared states, focusing on three publicly-accessible
frameworks. We show that these frameworks, while using different underlying
implementations, scopes, and mechanisms to read from and write to the shared
state, have shared vulnerability to a unified threat model. Our evaluation of
these state-of-art AR applications demonstrates reliable attacks both on
updating and accessing shared state across the different systems. To defend
against such threats, we discuss a number of potential mitigation strategies
that can help enhance the security of multi-user AR applications.


http://arxiv.org/abs/2308.10819
Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection. (10%)
Zekun Li; Baolin Peng; Pengcheng He; Xifeng Yan
  Large Language Models (LLMs) have shown remarkable proficiency in following
instructions, making them valuable in customer-facing applications. However,
their impressive capabilities also raise concerns about the amplification of
risks posed by adversarial instructions, which can be injected into the model
input by third-party attackers to manipulate LLMs' original instructions and
prompt unintended actions and content. Therefore, it is crucial to understand
LLMs' ability to accurately discern which instructions to follow to ensure
their safe deployment in real-world scenarios. In this paper, we propose a
pioneering benchmark for automatically evaluating the robustness of
instruction-following LLMs against adversarial instructions injected in the
prompt. The objective of this benchmark is to quantify the extent to which LLMs
are influenced by injected adversarial instructions and assess their ability to
differentiate between these injected adversarial instructions and original user
instructions. Through experiments conducted with state-of-the-art
instruction-following LLMs, we uncover significant limitations in their
robustness against adversarial instruction injection attacks. Furthermore, our
findings indicate that prevalent instruction-tuned models are prone to being
``overfitted'' to follow any instruction phrase in the prompt without truly
understanding which instructions should be followed. This highlights the need
to address the challenge of training models to comprehend prompts instead of
merely following instruction phrases and completing the text. The data and code
can be found at \url{https://github.com/Leezekun/Adv-Instruct-Eval}.


http://arxiv.org/abs/2309.16710
General Lipschitz: Certified Robustness Against Resolvable Semantic Transformations via Transformation-Dependent Randomized Smoothing. (3%)
Dmitrii Korzh; Mikhail Pautov; Olga Tsymboi; Ivan Oseledets
  Randomized smoothing is the state-of-the-art approach to construct image
classifiers that are provably robust against additive adversarial perturbations
of bounded magnitude. However, it is more complicated to construct reasonable
certificates against semantic transformation (e.g., image blurring,
translation, gamma correction) and their compositions. In this work, we propose
\emph{General Lipschitz (GL),} a new framework to certify neural networks
against composable resolvable semantic perturbations. Within the framework, we
analyze transformation-dependent Lipschitz-continuity of smoothed classifiers
w.r.t. transformation parameters and derive corresponding robustness
certificates. Our method performs comparably to state-of-the-art approaches on
the ImageNet dataset.


http://arxiv.org/abs/2308.08160
Benchmarking Adversarial Robustness of Compressed Deep Learning Models. (81%)
Brijesh Vora; Kartik Patwari; Syed Mahbub Hafiz; Zubair Shafiq; Chen-Nee Chuah
  The increasing size of Deep Neural Networks (DNNs) poses a pressing need for
model compression, particularly when employed on resource constrained devices.
Concurrently, the susceptibility of DNNs to adversarial attacks presents
another significant hurdle. Despite substantial research on both model
compression and adversarial robustness, their joint examination remains
underexplored. Our study bridges this gap, seeking to understand the effect of
adversarial inputs crafted for base models on their pruned versions. To examine
this relationship, we have developed a comprehensive benchmark across diverse
adversarial attacks and popular DNN models. We uniquely focus on models not
previously exposed to adversarial training and apply pruning schemes optimized
for accuracy and performance. Our findings reveal that while the benefits of
pruning enhanced generalizability, compression, and faster inference times are
preserved, adversarial robustness remains comparable to the base model. This
suggests that model compression while offering its unique advantages, does not
undermine adversarial robustness.


http://arxiv.org/abs/2308.08505
Test-Time Poisoning Attacks Against Test-Time Adaptation Models. (73%)
Tianshuo Cong; Xinlei He; Yun Shen; Yang Zhang
  Deploying machine learning (ML) models in the wild is challenging as it
suffers from distribution shifts, where the model trained on an original domain
cannot generalize well to unforeseen diverse transfer domains. To address this
challenge, several test-time adaptation (TTA) methods have been proposed to
improve the generalization ability of the target pre-trained models under test
data to cope with the shifted distribution. The success of TTA can be credited
to the continuous fine-tuning of the target model according to the
distributional hint from the test samples during test time. Despite being
powerful, it also opens a new attack surface, i.e., test-time poisoning
attacks, which are substantially different from previous poisoning attacks that
occur during the training time of ML models (i.e., adversaries cannot intervene
in the training process). In this paper, we perform the first test-time
poisoning attack against four mainstream TTA methods, including TTT, DUA, TENT,
and RPL. Concretely, we generate poisoned samples based on the surrogate models
and feed them to the target TTA models. Experimental results show that the TTA
methods are generally vulnerable to test-time poisoning attacks. For instance,
the adversary can feed as few as 10 poisoned samples to degrade the performance
of the target model from 76.20% to 41.83%. Our results demonstrate that TTA
algorithms lacking a rigorous security assessment are unsuitable for deployment
in real-life scenarios. As such, we advocate for the integration of defenses
against test-time poisoning attacks into the design of TTA methods.


http://arxiv.org/abs/2308.11521
Self-Deception: Reverse Penetrating the Semantic Firewall of Large Language Models. (67%)
Zhenhua Wang; Wei Xie; Kai Chen; Baosheng Wang; Zhiwen Gui; Enze Wang
  Large language models (LLMs), such as ChatGPT, have emerged with astonishing
capabilities approaching artificial general intelligence. While providing
convenience for various societal needs, LLMs have also lowered the cost of
generating harmful content. Consequently, LLM developers have deployed
semantic-level defenses to recognize and reject prompts that may lead to
inappropriate content. Unfortunately, these defenses are not foolproof, and
some attackers have crafted "jailbreak" prompts that temporarily hypnotize the
LLM into forgetting content defense rules and answering any improper questions.
To date, there is no clear explanation of the principles behind these
semantic-level attacks and defenses in both industry and academia.
  This paper investigates the LLM jailbreak problem and proposes an automatic
jailbreak method for the first time. We propose the concept of a semantic
firewall and provide three technical implementation approaches. Inspired by the
attack that penetrates traditional firewalls through reverse tunnels, we
introduce a "self-deception" attack that can bypass the semantic firewall by
inducing LLM to generate prompts that facilitate jailbreak. We generated a
total of 2,520 attack payloads in six languages (English, Russian, French,
Spanish, Chinese, and Arabic) across seven virtual scenarios, targeting the
three most common types of violations: violence, hate, and pornography. The
experiment was conducted on two models, namely the GPT-3.5-Turbo and GPT-4. The
success rates on the two models were 86.2% and 67%, while the failure rates
were 4.7% and 2.2%, respectively. This highlighted the effectiveness of the
proposed attack method. All experimental code and raw data will be released as
open-source to inspire future research. We believe that manipulating AI
behavior through carefully crafted prompts will become an important research
direction in the future.


http://arxiv.org/abs/2308.08709
Dynamic Neural Network is All You Need: Understanding the Robustness of Dynamic Mechanisms in Neural Networks. (61%)
Mirazul Haque; Wei Yang
  Deep Neural Networks (DNNs) have been used to solve different day-to-day
problems. Recently, DNNs have been deployed in real-time systems, and lowering
the energy consumption and response time has become the need of the hour. To
address this scenario, researchers have proposed incorporating dynamic
mechanism to static DNNs (SDNN) to create Dynamic Neural Networks (DyNNs)
performing dynamic amounts of computation based on the input complexity.
Although incorporating dynamic mechanism into SDNNs would be preferable in
real-time systems, it also becomes important to evaluate how the introduction
of dynamic mechanism impacts the robustness of the models. However, there has
not been a significant number of works focusing on the robustness trade-off
between SDNNs and DyNNs. To address this issue, we propose to investigate the
robustness of dynamic mechanism in DyNNs and how dynamic mechanism design
impacts the robustness of DyNNs. For that purpose, we evaluate three research
questions. These evaluations are performed on three models and two datasets.
Through the studies, we find that attack transferability from DyNNs to SDNNs is
higher than attack transferability from SDNNs to DyNNs. Also, we find that
DyNNs can be used to generate adversarial samples more efficiently than SDNNs.
Then, through research studies, we provide insight into the design choices that
can increase robustness of DyNNs against the attack generated using static
model. Finally, we propose a novel attack to understand the additional attack
surface introduced by the dynamic mechanism and provide design choices to
improve robustness against the attack.


http://arxiv.org/abs/2308.08173
Expressivity of Graph Neural Networks Through the Lens of Adversarial Robustness. (33%)
Francesco Campi; Lukas Gosch; Tom Wollschläger; Yan Scholten; Stephan Günnemann
  We perform the first adversarial robustness study into Graph Neural Networks
(GNNs) that are provably more powerful than traditional Message Passing Neural
Networks (MPNNs). In particular, we use adversarial robustness as a tool to
uncover a significant gap between their theoretically possible and empirically
achieved expressive power. To do so, we focus on the ability of GNNs to count
specific subgraph patterns, which is an established measure of expressivity,
and extend the concept of adversarial robustness to this task. Based on this,
we develop efficient adversarial attacks for subgraph counting and show that
more powerful GNNs fail to generalize even to small perturbations to the
graph's structure. Expanding on this, we show that such architectures also fail
to count substructures on out-of-distribution graphs.


http://arxiv.org/abs/2308.07874
SEDA: Self-Ensembling ViT with Defensive Distillation and Adversarial Training for robust Chest X-rays Classification. (99%)
Raza Imam; Ibrahim Almakky; Salma Alrashdi; Baketah Alrashdi; Mohammad Yaqub
  Deep Learning methods have recently seen increased adoption in medical
imaging applications. However, elevated vulnerabilities have been explored in
recent Deep Learning solutions, which can hinder future adoption. Particularly,
the vulnerability of Vision Transformer (ViT) to adversarial, privacy, and
confidentiality attacks raise serious concerns about their reliability in
medical settings. This work aims to enhance the robustness of self-ensembling
ViTs for the tuberculosis chest x-ray classification task. We propose
Self-Ensembling ViT with defensive Distillation and Adversarial training
(SEDA). SEDA utilizes efficient CNN blocks to learn spatial features with
various levels of abstraction from feature representations extracted from
intermediate ViT blocks, that are largely unaffected by adversarial
perturbations. Furthermore, SEDA leverages adversarial training in combination
with defensive distillation for improved robustness against adversaries.
Training using adversarial examples leads to better model generalizability and
improves its ability to handle perturbations. Distillation using soft
probabilities introduces uncertainty and variation into the output
probabilities, making it more difficult for adversarial and privacy attacks.
Extensive experiments performed with the proposed architecture and training
paradigm on publicly available Tuberculosis x-ray dataset shows SOTA efficacy
of SEDA compared to SEViT in terms of computational efficiency with 70x times
lighter framework and enhanced robustness of +9%.


http://arxiv.org/abs/2308.07625
Backpropagation Path Search On Adversarial Transferability. (99%)
Zhuoer Xu; Zhangxuan Gu; Jianping Zhang; Shiwen Cui; Changhua Meng; Weiqiang Wang
  Deep neural networks are vulnerable to adversarial examples, dictating the
imperativeness to test the model's robustness before deployment. Transfer-based
attackers craft adversarial examples against surrogate models and transfer them
to victim models deployed in the black-box situation. To enhance the
adversarial transferability, structure-based attackers adjust the
backpropagation path to avoid the attack from overfitting the surrogate model.
However, existing structure-based attackers fail to explore the convolution
module in CNNs and modify the backpropagation graph heuristically, leading to
limited effectiveness. In this paper, we propose backPropagation pAth Search
(PAS), solving the aforementioned two problems. We first propose SkipConv to
adjust the backpropagation path of convolution by structural
reparameterization. To overcome the drawback of heuristically designed
backpropagation paths, we further construct a DAG-based search space, utilize
one-step approximation for path evaluation and employ Bayesian Optimization to
search for the optimal path. We conduct comprehensive experiments in a wide
range of transfer settings, showing that PAS improves the attack success rate
by a huge margin for both normally trained and defense models.


http://arxiv.org/abs/2308.07673
A Review of Adversarial Attacks in Computer Vision. (99%)
Yutong Zhang; Yao Li; Yin Li; Zhichang Guo
  Deep neural networks have been widely used in various downstream tasks,
especially those safety-critical scenario such as autonomous driving, but deep
networks are often threatened by adversarial samples. Such adversarial attacks
can be invisible to human eyes, but can lead to DNN misclassification, and
often exhibits transferability between deep learning and machine learning
models and real-world achievability. Adversarial attacks can be divided into
white-box attacks, for which the attacker knows the parameters and gradient of
the model, and black-box attacks, for the latter, the attacker can only obtain
the input and output of the model. In terms of the attacker's purpose, it can
be divided into targeted attacks and non-targeted attacks, which means that the
attacker wants the model to misclassify the original sample into the specified
class, which is more practical, while the non-targeted attack just needs to
make the model misclassify the sample. The black box setting is a scenario we
will encounter in practice.


http://arxiv.org/abs/2308.07847
Robustness Over Time: Understanding Adversarial Examples' Effectiveness on Longitudinal Versions of Large Language Models. (95%)
Yugeng Liu; Tianshuo Cong; Zhengyu Zhao; Michael Backes; Yun Shen; Yang Zhang
  Large Language Models (LLMs) undergo continuous updates to improve user
experience. However, prior research on the security and safety implications of
LLMs has primarily focused on their specific versions, overlooking the impact
of successive LLM updates. This prompts the need for a holistic understanding
of the risks in these different versions of LLMs. To fill this gap, in this
paper, we conduct a longitudinal study to examine the adversarial robustness --
specifically misclassification, jailbreak, and hallucination -- of three
prominent LLMs: GPT-3.5, GPT-4, and LLaMA. Our study reveals that LLM updates
do not consistently improve adversarial robustness as expected. For instance, a
later version of GPT-3.5 degrades regarding misclassification and hallucination
despite its improved resilience against jailbreaks, and GPT-4 demonstrates
(incrementally) higher robustness overall. Moreover, larger model sizes do not
necessarily yield improved robustness. Specifically, larger LLaMA models do not
uniformly exhibit improved robustness across all three aspects studied.
Importantly, minor updates lacking substantial robustness improvements can
exacerbate existing issues rather than resolve them. By providing a more
nuanced understanding of LLM robustness over time, we hope our study can offer
valuable insights for developers and users navigating model updates and
informed decisions in model development and usage for LLM vendors.


http://arxiv.org/abs/2308.07834
Simple and Efficient Partial Graph Adversarial Attack: A New Perspective. (93%)
Guanghui Zhu; Mengyu Chen; Chunfeng Yuan; Yihua Huang
  As the study of graph neural networks becomes more intensive and
comprehensive, their robustness and security have received great research
interest. The existing global attack methods treat all nodes in the graph as
their attack targets. Although existing methods have achieved excellent
results, there is still considerable space for improvement. The key problem is
that the current approaches rigidly follow the definition of global attacks.
They ignore an important issue, i.e., different nodes have different robustness
and are not equally resilient to attacks. From a global attacker's view, we
should arrange the attack budget wisely, rather than wasting them on highly
robust nodes. To this end, we propose a totally new method named partial graph
attack (PGA), which selects the vulnerable nodes as attack targets. First, to
select the vulnerable items, we propose a hierarchical target selection policy,
which allows attackers to only focus on easy-to-attack nodes. Then, we propose
a cost-effective anchor-picking policy to pick the most promising anchors for
adding or removing edges, and a more aggressive iterative greedy-based attack
method to perform more efficient attacks. Extensive experimental results
demonstrate that PGA can achieve significant improvements in both attack effect
and attack efficiency compared to other existing graph global attack methods.


http://arxiv.org/abs/2308.07546
3DHacker: Spectrum-based Decision Boundary Generation for Hard-label 3D Point Cloud Attack. (99%)
Yunbo Tao; Daizong Liu; Pan Zhou; Yulai Xie; Wei Du; Wei Hu
  With the maturity of depth sensors, the vulnerability of 3D point cloud
models has received increasing attention in various applications such as
autonomous driving and robot navigation. Previous 3D adversarial attackers
either follow the white-box setting to iteratively update the coordinate
perturbations based on gradients, or utilize the output model logits to
estimate noisy gradients in the black-box setting. However, these attack
methods are hard to be deployed in real-world scenarios since realistic 3D
applications will not share any model details to users. Therefore, we explore a
more challenging yet practical 3D attack setting, \textit{i.e.}, attacking
point clouds with black-box hard labels, in which the attacker can only have
access to the prediction label of the input. To tackle this setting, we propose
a novel 3D attack method, termed \textbf{3D} \textbf{H}ard-label
att\textbf{acker} (\textbf{3DHacker}), based on the developed decision boundary
algorithm to generate adversarial samples solely with the knowledge of class
labels. Specifically, to construct the class-aware model decision boundary,
3DHacker first randomly fuses two point clouds of different classes in the
spectral domain to craft their intermediate sample with high imperceptibility,
then projects it onto the decision boundary via binary search. To restrict the
final perturbation size, 3DHacker further introduces an iterative optimization
strategy to move the intermediate sample along the decision boundary for
generating adversarial point clouds with smallest trivial perturbations.
Extensive evaluations show that, even in the challenging hard-label setting,
3DHacker still competitively outperforms existing 3D attacks regarding the
attack performance as well as adversary quality.


http://arxiv.org/abs/2308.07433
White-Box Adversarial Attacks on Deep Learning-Based Radio Frequency Fingerprint Identification. (99%)
Jie Ma; Junqing Zhang; Guanxiong Shen; Alan Marshall; Chip-Hong Chang
  Radio frequency fingerprint identification (RFFI) is an emerging technique
for the lightweight authentication of wireless Internet of things (IoT)
devices. RFFI exploits unique hardware impairments as device identifiers, and
deep learning is widely deployed as the feature extractor and classifier for
RFFI. However, deep learning is vulnerable to adversarial attacks, where
adversarial examples are generated by adding perturbation to clean data for
causing the classifier to make wrong predictions. Deep learning-based RFFI has
been shown to be vulnerable to such attacks, however, there is currently no
exploration of effective adversarial attacks against a diversity of RFFI
classifiers. In this paper, we report on investigations into white-box attacks
(non-targeted and targeted) using two approaches, namely the fast gradient sign
method (FGSM) and projected gradient descent (PGD). A LoRa testbed was built
and real datasets were collected. These adversarial examples have been
experimentally demonstrated to be effective against convolutional neural
networks (CNNs), long short-term memory (LSTM) networks, and gated recurrent
units (GRU).


http://arxiv.org/abs/2308.07026
AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning. (99%)
Ziqi Zhou; Shengshan Hu; Minghui Li; Hangtao Zhang; Yechao Zhang; Hai Jin
  Multimodal contrastive learning aims to train a general-purpose feature
extractor, such as CLIP, on vast amounts of raw, unlabeled paired image-text
data. This can greatly benefit various complex downstream tasks, including
cross-modal image-text retrieval and image classification. Despite its
promising prospect, the security issue of cross-modal pre-trained encoder has
not been fully explored yet, especially when the pre-trained encoder is
publicly available for commercial use.
  In this work, we propose AdvCLIP, the first attack framework for generating
downstream-agnostic adversarial examples based on cross-modal pre-trained
encoders. AdvCLIP aims to construct a universal adversarial patch for a set of
natural images that can fool all the downstream tasks inheriting the victim
cross-modal pre-trained encoder. To address the challenges of heterogeneity
between different modalities and unknown downstream tasks, we first build a
topological graph structure to capture the relevant positions between target
samples and their neighbors. Then, we design a topology-deviation based
generative adversarial network to generate a universal adversarial patch. By
adding the patch to images, we minimize their embeddings similarity to
different modality and perturb the sample distribution in the feature space,
achieving unviersal non-targeted attacks. Our results demonstrate the excellent
attack performance of AdvCLIP on two types of downstream tasks across eight
datasets. We also tailor three popular defenses to mitigate AdvCLIP,
highlighting the need for new defense mechanisms to defend cross-modal
pre-trained encoders.


http://arxiv.org/abs/2308.07553
Enhancing the Antidote: Improved Pointwise Certifications against Poisoning Attacks. (68%)
Shijie Liu; Andrew C. Cullen; Paul Montague; Sarah M. Erfani; Benjamin I. P. Rubinstein
  Poisoning attacks can disproportionately influence model behaviour by making
small changes to the training corpus. While defences against specific poisoning
attacks do exist, they in general do not provide any guarantees, leaving them
potentially countered by novel attacks. In contrast, by examining worst-case
behaviours Certified Defences make it possible to provide guarantees of the
robustness of a sample against adversarial attacks modifying a finite number of
training samples, known as pointwise certification. We achieve this by
exploiting both Differential Privacy and the Sampled Gaussian Mechanism to
ensure the invariance of prediction for each testing instance against finite
numbers of poisoned examples. In doing so, our model provides guarantees of
adversarial robustness that are more than twice as large as those provided by
prior certifications.


http://arxiv.org/abs/2308.07308
LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked. (61%)
Alec Helbling; Mansi Phute; Matthew Hull; Duen Horng Chau
  Large language models (LLMs) have skyrocketed in popularity in recent years
due to their ability to generate high-quality text in response to human
prompting. However, these models have been shown to have the potential to
generate harmful content in response to user prompting (e.g., giving users
instructions on how to commit crimes). There has been a focus in the literature
on mitigating these risks, through methods like aligning models with human
values through reinforcement learning. However, it has been shown that even
aligned language models are susceptible to adversarial attacks that bypass
their restrictions on generating harmful text. We propose a simple approach to
defending against these attacks by having a large language model filter its own
responses. Our current results show that even if a model is not fine-tuned to
be aligned with human values, it is possible to stop it from presenting harmful
content to users by validating the content using a language model.


http://arxiv.org/abs/2308.07387
DISBELIEVE: Distance Between Client Models is Very Essential for Effective Local Model Poisoning Attacks. (13%)
Indu Joshi; Priyank Upadhya; Gaurav Kumar Nayak; Peter Schüffler; Nassir Navab
  Federated learning is a promising direction to tackle the privacy issues
related to sharing patients' sensitive data. Often, federated systems in the
medical image analysis domain assume that the participating local clients are
\textit{honest}. Several studies report mechanisms through which a set of
malicious clients can be introduced that can poison the federated setup,
hampering the performance of the global model. To overcome this, robust
aggregation methods have been proposed that defend against those attacks. We
observe that most of the state-of-the-art robust aggregation methods are
heavily dependent on the distance between the parameters or gradients of
malicious clients and benign clients, which makes them prone to local model
poisoning attacks when the parameters or gradients of malicious and benign
clients are close. Leveraging this, we introduce DISBELIEVE, a local model
poisoning attack that creates malicious parameters or gradients such that their
distance to benign clients' parameters or gradients is low respectively but at
the same time their adverse effect on the global model's performance is high.
Experiments on three publicly available medical image datasets demonstrate the
efficacy of the proposed DISBELIEVE attack as it significantly lowers the
performance of the state-of-the-art \textit{robust aggregation} methods for
medical image analysis. Furthermore, compared to state-of-the-art local model
poisoning attacks, DISBELIEVE attack is also effective on natural images where
we observe a severe drop in classification performance of the global model for
multi-class classification on benchmark dataset CIFAR-10.


http://arxiv.org/abs/2308.07009
ACTIVE: Towards Highly Transferable 3D Physical Camouflage for Universal and Robust Vehicle Evasion. (10%)
Naufal Suryanto; Yongsu Kim; Harashta Tatimma Larasati; Hyoeun Kang; Thi-Thu-Huong Le; Yoonyoung Hong; Hunmin Yang; Se-Yoon Oh; Howon Kim
  Adversarial camouflage has garnered attention for its ability to attack
object detectors from any viewpoint by covering the entire object's surface.
However, universality and robustness in existing methods often fall short as
the transferability aspect is often overlooked, thus restricting their
application only to a specific target with limited performance. To address
these challenges, we present Adversarial Camouflage for Transferable and
Intensive Vehicle Evasion (ACTIVE), a state-of-the-art physical camouflage
attack framework designed to generate universal and robust adversarial
camouflage capable of concealing any 3D vehicle from detectors. Our framework
incorporates innovative techniques to enhance universality and robustness: a
refined texture rendering that enables common texture application to different
vehicles without being constrained to a specific texture map, a novel stealth
loss that renders the vehicle undetectable, and a smooth and camouflage loss to
enhance the naturalness of the adversarial camouflage. Our extensive
experiments on 15 different models show that ACTIVE consistently outperforms
existing works on various public detectors, including the latest YOLOv7.
Notably, our universality evaluations reveal promising transferability to other
vehicle classes, tasks (segmentation models), and the real world, not just
other vehicles.


http://arxiv.org/abs/2308.07156
SAM Meets Robotic Surgery: An Empirical Study on Generalization, Robustness and Adaptation. (1%)
An Wang; Mobarakol Islam; Mengya Xu; Yang Zhang; Hongliang Ren
  The Segment Anything Model (SAM) serves as a fundamental model for semantic
segmentation and demonstrates remarkable generalization capabilities across a
wide range of downstream scenarios. In this empirical study, we examine SAM's
robustness and zero-shot generalizability in the field of robotic surgery. We
comprehensively explore different scenarios, including prompted and unprompted
situations, bounding box and points-based prompt approaches, as well as the
ability to generalize under corruptions and perturbations at five severity
levels. Additionally, we compare the performance of SAM with state-of-the-art
supervised models. We conduct all the experiments with two well-known robotic
instrument segmentation datasets from MICCAI EndoVis 2017 and 2018 challenges.
Our extensive evaluation results reveal that although SAM shows remarkable
zero-shot generalization ability with bounding box prompts, it struggles to
segment the whole instrument with point-based prompts and unprompted settings.
Furthermore, our qualitative figures demonstrate that the model either failed
to predict certain parts of the instrument mask (e.g., jaws, wrist) or
predicted parts of the instrument as wrong classes in the scenario of
overlapping instruments within the same bounding box or with the point-based
prompt. In fact, SAM struggles to identify instruments in complex surgical
scenarios characterized by the presence of blood, reflection, blur, and shade.
Additionally, SAM is insufficiently robust to maintain high performance when
subjected to various forms of data corruption. We also attempt to fine-tune SAM
using Low-rank Adaptation (LoRA) and propose SurgicalSAM, which shows the
capability in class-wise mask prediction without prompt. Therefore, we can
argue that, without further domain-specific fine-tuning, SAM is not ready for
downstream surgical tasks.


http://arxiv.org/abs/2308.06819
SoK: Realistic Adversarial Attacks and Defenses for Intelligent Network Intrusion Detection. (99%)
João Vitorino; Isabel Praça; Eva Maia
  Machine Learning (ML) can be incredibly valuable to automate anomaly
detection and cyber-attack classification, improving the way that Network
Intrusion Detection (NID) is performed. However, despite the benefits of ML
models, they are highly susceptible to adversarial cyber-attack examples
specifically crafted to exploit them. A wide range of adversarial attacks have
been created and researchers have worked on various defense strategies to
safeguard ML models, but most were not intended for the specific constraints of
a communication network and its communication protocols, so they may lead to
unrealistic examples in the NID domain. This Systematization of Knowledge (SoK)
consolidates and summarizes the state-of-the-art adversarial learning
approaches that can generate realistic examples and could be used in real ML
development and deployment scenarios with real network traffic flows. This SoK
also describes the open challenges regarding the use of adversarial ML in the
NID domain, defines the fundamental properties that are required for an
adversarial example to be realistic, and provides guidelines for researchers to
ensure that their future experiments are adequate for a real communication
network.


http://arxiv.org/abs/2308.06703
Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods. (45%)
Avery Ma; Yangchen Pan; Amir-massoud Farahmand
  Stochastic gradient descent (SGD) and adaptive gradient methods, such as Adam
and RMSProp, have been widely used in training deep neural networks. We
empirically show that while the difference between the standard generalization
performance of models trained using these methods is small, those trained using
SGD exhibit far greater robustness under input perturbations. Notably, our
investigation demonstrates the presence of irrelevant frequencies in natural
datasets, where alterations do not affect models' generalization performance.
However, models trained with adaptive methods show sensitivity to these
changes, suggesting that their use of irrelevant frequencies can lead to
solutions sensitive to perturbations. To better understand this difference, we
study the learning dynamics of gradient descent (GD) and sign gradient descent
(signGD) on a synthetic dataset that mirrors natural signals. With a
three-dimensional input space, the models optimized with GD and signGD have
standard risks close to zero but vary in their adversarial risks. Our result
shows that linear models' robustness to $\ell_2$-norm bounded changes is
inversely proportional to the model parameters' weight norm: a smaller weight
norm implies better robustness. In the context of deep learning, our
experiments show that SGD-trained neural networks have smaller Lipschitz
constants, explaining the better robustness to input perturbations than those
trained with adaptive gradient methods.


http://arxiv.org/abs/2308.06767
A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations. (1%)
Hongrong Cheng; Miao Zhang; Javen Qinfeng Shi
  Modern deep neural networks, particularly recent large language models, come
with massive model sizes that require significant computational and storage
resources. To enable the deployment of modern models on resource-constrained
environments and accelerate inference time, researchers have increasingly
explored pruning techniques as a popular research direction in neural network
compression. However, there is a dearth of up-to-date comprehensive review
papers on pruning. To address this issue, in this survey, we provide a
comprehensive review of existing research works on deep neural network pruning
in a taxonomy of 1) universal/specific speedup, 2) when to prune, 3) how to
prune, and 4) fusion of pruning and other compression techniques. We then
provide a thorough comparative analysis of seven pairs of contrast settings for
pruning (e.g., unstructured/structured) and explore emerging topics, including
post-training pruning, different levels of supervision for pruning, and broader
applications (e.g., adversarial robustness) to shed light on the commonalities
and differences of existing methods and lay the foundation for further method
development. To facilitate future research, we build a curated collection of
datasets, networks, and evaluations on different applications. Finally, we
provide some valuable recommendations on selecting pruning methods and prospect
promising research directions. We build a repository at
https://github.com/hrcheng1066/awesome-pruning.


http://arxiv.org/abs/2308.06887
Robustified ANNs Reveal Wormholes Between Human Category Percepts. (1%)
Guy Gaziv; Michael J. Lee; James J. DiCarlo
  The visual object category reports of artificial neural networks (ANNs) are
notoriously sensitive to tiny, adversarial image perturbations. Because human
category reports (aka human percepts) are thought to be insensitive to those
same small-norm perturbations -- and locally stable in general -- this argues
that ANNs are incomplete scientific models of human visual perception.
Consistent with this, we show that when small-norm image perturbations are
generated by standard ANN models, human object category percepts are indeed
highly stable. However, in this very same "human-presumed-stable" regime, we
find that robustified ANNs reliably discover low-norm image perturbations that
strongly disrupt human percepts. These previously undetectable human perceptual
disruptions are massive in amplitude, approaching the same level of sensitivity
seen in robustified ANNs. Further, we show that robustified ANNs support
precise perceptual state interventions: they guide the construction of low-norm
image perturbations that strongly alter human category percepts toward specific
prescribed percepts. These observations suggest that for arbitrary starting
points in image space, there exists a set of nearby "wormholes", each leading
the subject from their current category perceptual state into a semantically
very different state. Moreover, contemporary ANN models of biological visual
processing are now accurate enough to consistently guide us to those portals.


http://arxiv.org/abs/2308.06795
Faithful to Whom? Questioning Interpretability Measures in NLP. (1%)
Evan Crothers; Herna Viktor; Nathalie Japkowicz
  A common approach to quantifying model interpretability is to calculate
faithfulness metrics based on iteratively masking input tokens and measuring
how much the predicted label changes as a result. However, we show that such
metrics are generally not suitable for comparing the interpretability of
different neural text classifiers as the response to masked inputs is highly
model-specific. We demonstrate that iterative masking can produce large
variation in faithfulness scores between comparable models, and show that
masked samples are frequently outside the distribution seen during training. We
further investigate the impact of adversarial attacks and adversarial training
on faithfulness scores, and demonstrate the relevance of faithfulness measures
for analyzing feature salience in text adversarial attacks. Our findings
provide new insights into the limitations of current faithfulness metrics and
key considerations to utilize them appropriately.


http://arxiv.org/abs/2308.06467
Not So Robust After All: Evaluating the Robustness of Deep Neural Networks to Unseen Adversarial Attacks. (99%)
Roman Garaev; Bader Rasheed; Adil Khan
  Deep neural networks (DNNs) have gained prominence in various applications,
such as classification, recognition, and prediction, prompting increased
scrutiny of their properties. A fundamental attribute of traditional DNNs is
their vulnerability to modifications in input data, which has resulted in the
investigation of adversarial attacks. These attacks manipulate the data in
order to mislead a DNN. This study aims to challenge the efficacy and
generalization of contemporary defense mechanisms against adversarial attacks.
Specifically, we explore the hypothesis proposed by Ilyas et. al, which posits
that DNN image features can be either robust or non-robust, with adversarial
attacks targeting the latter. This hypothesis suggests that training a DNN on a
dataset consisting solely of robust features should produce a model resistant
to adversarial attacks. However, our experiments demonstrate that this is not
universally true. To gain further insights into our findings, we analyze the
impact of adversarial attack norms on DNN representations, focusing on samples
subjected to $L_2$ and $L_{\infty}$ norm attacks. Further, we employ canonical
correlation analysis, visualize the representations, and calculate the mean
distance between these representations and various DNN decision boundaries. Our
results reveal a significant difference between $L_2$ and $L_{\infty}$ norms,
which could provide insights into the potential dangers posed by $L_{\infty}$
norm attacks, previously underestimated by the research community.


http://arxiv.org/abs/2308.07934
One-bit Flip is All You Need: When Bit-flip Attack Meets Model Training. (13%)
Jianshuo Dong; Han Qiu; Yiming Li; Tianwei Zhang; Yuanjie Li; Zeqi Lai; Chao Zhang; Shu-Tao Xia
  Deep neural networks (DNNs) are widely deployed on real-world devices.
Concerns regarding their security have gained great attention from researchers.
Recently, a new weight modification attack called bit flip attack (BFA) was
proposed, which exploits memory fault inject techniques such as row hammer to
attack quantized models in the deployment stage. With only a few bit flips, the
target model can be rendered useless as a random guesser or even be implanted
with malicious functionalities. In this work, we seek to further reduce the
number of bit flips. We propose a training-assisted bit flip attack, in which
the adversary is involved in the training stage to build a high-risk model to
release. This high-risk model, obtained coupled with a corresponding malicious
model, behaves normally and can escape various detection methods. The results
on benchmark datasets show that an adversary can easily convert this high-risk
but normal model to a malicious one on victim's side by \textbf{flipping only
one critical bit} on average in the deployment stage. Moreover, our attack
still poses a significant threat even when defenses are employed. The codes for
reproducing main experiments are available at
\url{https://github.com/jianshuod/TBA}.


http://arxiv.org/abs/2308.06015
Enhancing Generalization of Universal Adversarial Perturbation through Gradient Aggregation. (98%)
Xuannan Liu; Yaoyao Zhong; Yuhang Zhang; Lixiong Qin; Weihong Deng
  Deep neural networks are vulnerable to universal adversarial perturbation
(UAP), an instance-agnostic perturbation capable of fooling the target model
for most samples. Compared to instance-specific adversarial examples, UAP is
more challenging as it needs to generalize across various samples and models.
In this paper, we examine the serious dilemma of UAP generation methods from a
generalization perspective -- the gradient vanishing problem using small-batch
stochastic gradient optimization and the local optima problem using large-batch
optimization. To address these problems, we propose a simple and effective
method called Stochastic Gradient Aggregation (SGA), which alleviates the
gradient vanishing and escapes from poor local optima at the same time.
Specifically, SGA employs the small-batch training to perform multiple
iterations of inner pre-search. Then, all the inner gradients are aggregated as
a one-step gradient estimation to enhance the gradient stability and reduce
quantization errors. Extensive experiments on the standard ImageNet dataset
demonstrate that our method significantly enhances the generalization ability
of UAP and outperforms other state-of-the-art methods. The code is available at
https://github.com/liuxuannan/Stochastic-Gradient-Aggregation.


http://arxiv.org/abs/2308.06173
Physical Adversarial Attacks For Camera-based Smart Systems: Current Trends, Categorization, Applications, Research Challenges, and Future Outlook. (98%)
Amira Guesmi; Muhammad Abdullah Hanif; Bassem Ouni; Muhammed Shafique
  In this paper, we present a comprehensive survey of the current trends
focusing specifically on physical adversarial attacks. We aim to provide a
thorough understanding of the concept of physical adversarial attacks,
analyzing their key characteristics and distinguishing features. Furthermore,
we explore the specific requirements and challenges associated with executing
attacks in the physical world. Our article delves into various physical
adversarial attack methods, categorized according to their target tasks in
different applications, including classification, detection, face recognition,
semantic segmentation and depth estimation. We assess the performance of these
attack methods in terms of their effectiveness, stealthiness, and robustness.
We examine how each technique strives to ensure the successful manipulation of
DNNs while mitigating the risk of detection and withstanding real-world
distortions. Lastly, we discuss the current challenges and outline potential
future research directions in the field of physical adversarial attacks. We
highlight the need for enhanced defense mechanisms, the exploration of novel
attack strategies, the evaluation of attacks in different application domains,
and the establishment of standardized benchmarks and evaluation criteria for
physical adversarial attacks. Through this comprehensive survey, we aim to
provide a valuable resource for researchers, practitioners, and policymakers to
gain a holistic understanding of physical adversarial attacks in computer
vision and facilitate the development of robust and secure DNN-based systems.


http://arxiv.org/abs/2308.05983
Face Encryption via Frequency-Restricted Identity-Agnostic Attacks. (96%)
Xin Dong; Rui Wang; Siyuan Liang; Aishan Liu; Lihua Jing
  Billions of people are sharing their daily live images on social media
everyday. However, malicious collectors use deep face recognition systems to
easily steal their biometric information (e.g., faces) from these images. Some
studies are being conducted to generate encrypted face photos using adversarial
attacks by introducing imperceptible perturbations to reduce face information
leakage. However, existing studies need stronger black-box scenario feasibility
and more natural visual appearances, which challenge the feasibility of privacy
protection. To address these problems, we propose a frequency-restricted
identity-agnostic (FRIA) framework to encrypt face images from unauthorized
face recognition without access to personal information. As for the weak
black-box scenario feasibility, we obverse that representations of the average
feature in multiple face recognition models are similar, thus we propose to
utilize the average feature via the crawled dataset from the Internet as the
target to guide the generation, which is also agnostic to identities of unknown
face recognition systems; in nature, the low-frequency perturbations are more
visually perceptible by the human vision system. Inspired by this, we restrict
the perturbation in the low-frequency facial regions by discrete cosine
transform to achieve the visual naturalness guarantee. Extensive experiments on
several face recognition models demonstrate that our FRIA outperforms other
state-of-the-art methods in generating more natural encrypted faces while
attaining high black-box attack success rates of 96%. In addition, we validate
the efficacy of FRIA using real-world black-box commercial API, which reveals
the potential of FRIA in practice. Our codes can be found in
https://github.com/XinDong10/FRIA.


http://arxiv.org/abs/2308.06405
White-box Membership Inference Attacks against Diffusion Models. (68%)
Yan Pang; Tianhao Wang; Xuhui Kang; Mengdi Huai; Yang Zhang
  Diffusion models have begun to overshadow GANs and other generative models in
industrial applications due to their superior image generation performance. The
complex architecture of these models furnishes an extensive array of attack
features. In light of this, we aim to design membership inference attacks
(MIAs) catered to diffusion models. We first conduct an exhaustive analysis of
existing MIAs on diffusion models, taking into account factors such as
black-box/white-box models and the selection of attack features. We found that
white-box attacks are highly applicable in real-world scenarios, and the most
effective attacks presently are white-box. Departing from earlier research,
which employs model loss as the attack feature for white-box MIAs, we employ
model gradients in our attack, leveraging the fact that these gradients provide
a more profound understanding of model responses to various samples. We subject
these models to rigorous testing across a range of parameters, including
training steps, sampling frequency, diffusion steps, and data variance. Across
all experimental settings, our method consistently demonstrated near-flawless
attack performance, with attack success rate approaching $100\%$ and attack
AUCROC near $1.0$. We also evaluate our attack against common defense
mechanisms, and observe our attacks continue to exhibit commendable
performance.


http://arxiv.org/abs/2308.06107
Test-Time Backdoor Defense via Detecting and Repairing. (10%)
Jiyang Guan; Jian Liang; Ran He
  Deep neural networks have played a crucial part in many critical domains,
such as autonomous driving, face recognition, and medical diagnosis. However,
deep neural networks are facing security threats from backdoor attacks and can
be manipulated into attacker-decided behaviors by the backdoor attacker. To
defend the backdoor, prior research has focused on using clean data to remove
backdoor attacks before model deployment. In this paper, we investigate the
possibility of defending against backdoor attacks at test time by utilizing
partially poisoned data to remove the backdoor from the model. To address the
problem, a two-stage method Test-Time Backdoor Defense (TTBD) is proposed. In
the first stage, we propose a backdoor sample detection method DDP to identify
poisoned samples from a batch of mixed, partially poisoned samples. Once the
poisoned samples are detected, we employ Shapley estimation to calculate the
contribution of each neuron's significance in the network, locate the poisoned
neurons, and prune them to remove backdoor in the models. Our experiments
demonstrate that TTBD removes the backdoor successfully with only a batch of
partially poisoned data across different model architectures and datasets
against different types of backdoor attacks.


http://arxiv.org/abs/2308.06217
Continual Face Forgery Detection via Historical Distribution Preserving. (2%)
Ke Sun; Shen Chen; Taiping Yao; Xiaoshuai Sun; Shouhong Ding; Rongrong Ji
  Face forgery techniques have advanced rapidly and pose serious security
threats. Existing face forgery detection methods try to learn generalizable
features, but they still fall short of practical application. Additionally,
finetuning these methods on historical training data is resource-intensive in
terms of time and storage. In this paper, we focus on a novel and challenging
problem: Continual Face Forgery Detection (CFFD), which aims to efficiently
learn from new forgery attacks without forgetting previous ones. Specifically,
we propose a Historical Distribution Preserving (HDP) framework that reserves
and preserves the distributions of historical faces. To achieve this, we use
universal adversarial perturbation (UAP) to simulate historical forgery
distribution, and knowledge distillation to maintain the distribution variation
of real faces across different models. We also construct a new benchmark for
CFFD with three evaluation protocols. Our extensive experiments on the
benchmarks show that our method outperforms the state-of-the-art competitors.


http://arxiv.org/abs/2308.05986
Fast and Accurate Transferability Measurement by Evaluating Intra-class Feature Variance. (1%)
Huiwen Xu; U Kang
  Given a set of pre-trained models, how can we quickly and accurately find the
most useful pre-trained model for a downstream task? Transferability
measurement is to quantify how transferable is a pre-trained model learned on a
source task to a target task. It is used for quickly ranking pre-trained models
for a given task and thus becomes a crucial step for transfer learning.
Existing methods measure transferability as the discrimination ability of a
source model for a target data before transfer learning, which cannot
accurately estimate the fine-tuning performance. Some of them restrict the
application of transferability measurement in selecting the best supervised
pre-trained models that have classifiers. It is important to have a general
method for measuring transferability that can be applied in a variety of
situations, such as selecting the best self-supervised pre-trained models that
do not have classifiers, and selecting the best transferring layer for a target
task. In this work, we propose TMI (TRANSFERABILITY MEASUREMENT WITH
INTRA-CLASS FEATURE VARIANCE), a fast and accurate algorithm to measure
transferability. We view transferability as the generalization of a pre-trained
model on a target task by measuring intra-class feature variance. Intra-class
variance evaluates the adaptability of the model to a new task, which measures
how transferable the model is. Compared to previous studies that estimate how
discriminative the models are, intra-class variance is more accurate than those
as it does not require an optimal feature extractor and classifier. Extensive
experiments on real-world datasets show that TMI outperforms competitors for
selecting the top-5 best models, and exhibits consistently better correlation
in 13 out of 17 cases.


http://arxiv.org/abs/2308.05681
Hard No-Box Adversarial Attack on Skeleton-Based Human Action Recognition with Skeleton-Motion-Informed Gradient. (99%)
Zhengzhi Lu; He Wang; Ziyi Chang; Guoan Yang; Hubert P. H. Shum
  Recently, methods for skeleton-based human activity recognition have been
shown to be vulnerable to adversarial attacks. However, these attack methods
require either the full knowledge of the victim (i.e. white-box attacks),
access to training data (i.e. transfer-based attacks) or frequent model queries
(i.e. black-box attacks). All their requirements are highly restrictive,
raising the question of how detrimental the vulnerability is. In this paper, we
show that the vulnerability indeed exists. To this end, we consider a new
attack task: the attacker has no access to the victim model or the training
data or labels, where we coin the term hard no-box attack. Specifically, we
first learn a motion manifold where we define an adversarial loss to compute a
new gradient for the attack, named skeleton-motion-informed (SMI) gradient. Our
gradient contains information of the motion dynamics, which is different from
existing gradient-based attack methods that compute the loss gradient assuming
each dimension in the data is independent. The SMI gradient can augment many
gradient-based attack methods, leading to a new family of no-box attack
methods. Extensive evaluation and comparison show that our method imposes a
real threat to existing classifiers. They also show that the SMI gradient
improves the transferability and imperceptibility of adversarial samples in
both no-box and transfer-based black-box settings.


http://arxiv.org/abs/2308.05575
Symmetry Defense Against XGBoost Adversarial Perturbation Attacks. (96%)
Blerta Lindqvist
  We examine whether symmetry can be used to defend tree-based ensemble
classifiers such as gradient-boosting decision trees (GBDTs) against
adversarial perturbation attacks. The idea is based on a recent symmetry
defense for convolutional neural network classifiers (CNNs) that utilizes CNNs'
lack of invariance with respect to symmetries. CNNs lack invariance because
they can classify a symmetric sample, such as a horizontally flipped image,
differently from the original sample. CNNs' lack of invariance also means that
CNNs can classify symmetric adversarial samples differently from the incorrect
classification of adversarial samples. Using CNNs' lack of invariance, the
recent CNN symmetry defense has shown that the classification of symmetric
adversarial samples reverts to the correct sample classification. In order to
apply the same symmetry defense to GBDTs, we examine GBDT invariance and are
the first to show that GBDTs also lack invariance with respect to symmetries.
We apply and evaluate the GBDT symmetry defense for nine datasets against six
perturbation attacks with a threat model that ranges from zero-knowledge to
perfect-knowledge adversaries. Using the feature inversion symmetry against
zero-knowledge adversaries, we achieve up to 100% accuracy on adversarial
samples even when default and robust classifiers have 0% accuracy. Using the
feature inversion and horizontal flip symmetries against perfect-knowledge
adversaries, we achieve up to over 95% accuracy on adversarial samples for the
GBDT classifier of the F-MNIST dataset even when default and robust classifiers
have 0% accuracy.


http://arxiv.org/abs/2308.05498
Complex Network Effects on the Robustness of Graph Convolutional Networks. (92%)
Benjamin A. Miller; Kevin Chan; Tina Eliassi-Rad
  Vertex classification -- the problem of identifying the class labels of nodes
in a graph -- has applicability in a wide variety of domains. Examples include
classifying subject areas of papers in citation networks or roles of machines
in a computer network. Vertex classification using graph convolutional networks
is susceptible to targeted poisoning attacks, in which both graph structure and
node attributes can be changed in an attempt to misclassify a target node. This
vulnerability decreases users' confidence in the learning method and can
prevent adoption in high-stakes contexts. Defenses have also been proposed,
focused on filtering edges before creating the model or aggregating information
from neighbors more robustly. This paper considers an alternative: we leverage
network characteristics in the training data selection process to improve
robustness of vertex classifiers.
  We propose two alternative methods of selecting training data: (1) to select
the highest-degree nodes and (2) to iteratively select the node with the most
neighbors minimally connected to the training set. In the datasets on which the
original attack was demonstrated, we show that changing the training set can
make the network much harder to attack. To maintain a given probability of
attack success, the adversary must use far more perturbations; often a factor
of 2--4 over the random training baseline. These training set selection methods
often work in conjunction with the best recently published defenses to provide
even greater robustness. While increasing the amount of randomly selected
training data sometimes results in a more robust classifier, the proposed
methods increase robustness substantially more. We also run a simulation study
in which we demonstrate conditions under which each of the two methods
outperforms the other, controlling for the graph topology, homophily of the
labels, and node attributes.


http://arxiv.org/abs/2308.05525
Critical Points ++: An Agile Point Cloud Importance Measure for Robust Classification, Adversarial Defense and Explainable AI. (80%)
Meir Yossef Levi; Guy Gilboa
  The ability to cope accurately and fast with Out-Of-Distribution (OOD)
samples is crucial in real-world safety demanding applications. In this work we
first study the interplay between critical points of 3D point clouds and OOD
samples. Our findings are that common corruptions and outliers are often
interpreted as critical points. We generalize the notion of critical points
into importance measures. We show that training a classification network based
only on less important points dramatically improves robustness, at a cost of
minor performance loss on the clean set. We observe that normalized entropy is
highly informative for corruption analysis. An adaptive threshold based on
normalized entropy is suggested for selecting the set of uncritical points. Our
proposed importance measure is extremely fast to compute. We show it can be
used for a variety of applications, such as Explainable AI (XAI), Outlier
Removal, Uncertainty Estimation, Robust Classification and Adversarial Defense.
We reach SOTA results on the two latter tasks.


http://arxiv.org/abs/2310.10789
State Machine Frameworks for Website Fingerprinting Defenses: Maybe Not. (61%)
Ethan Witwer
  Tor is an anonymity network used by millions of people every day to evade
censorship and protect their browsing activity from privacy threats such as
mass surveillance. Unfortunately, Tor has been shown to be vulnerable to
website fingerprinting attacks, in which an adversary observes the connection
between a user and the Tor network and uses features of the encrypted traffic,
such as the timing and volume of packets, to identify the websites that are
being visited. In response, researchers have proposed a number of defenses
against website fingerprinting attacks, and a "circuit padding framework" has
been added to the Tor software which supports the implementation of defenses.
However, many proposed defenses are not supported by this framework, and no
defenses are currently present in Tor.
  As Arti, a reimplementation of Tor in Rust, is being developed, the issue
arises of whether a new state machine framework should be included or if
alternative models should instead be considered for future defense
implementation. We address this question by using an improved Rust-based state
machine framework, Maybenot, to implement three state-of-the-art website
fingerprinting defenses. Through our evaluation, we demonstrate the potential
of state machine frameworks to support effective defenses, and we highlight
important features that they should contain to do so. However, our evaluation
also raises uncertainty about the long-term feasibility of state machine
frameworks for defense implementation. We recommend enhancements to Maybenot
and substantial further evaluation, along with consideration of alternative
designs, before any decision is made regarding a mechanism for implementing
website fingerprinting defenses in Arti.


http://arxiv.org/abs/2308.05832
FLShield: A Validation Based Federated Learning Framework to Defend Against Poisoning Attacks. (45%)
Ehsanul Kabir; Zeyu Song; Md Rafi Ur Rashid; Shagufta Mehnaz
  Federated learning (FL) is revolutionizing how we learn from data. With its
growing popularity, it is now being used in many safety-critical domains such
as autonomous vehicles and healthcare. Since thousands of participants can
contribute in this collaborative setting, it is, however, challenging to ensure
security and reliability of such systems. This highlights the need to design FL
systems that are secure and robust against malicious participants' actions
while also ensuring high utility, privacy of local data, and efficiency. In
this paper, we propose a novel FL framework dubbed as FLShield that utilizes
benign data from FL participants to validate the local models before taking
them into account for generating the global model. This is in stark contrast
with existing defenses relying on server's access to clean datasets -- an
assumption often impractical in real-life scenarios and conflicting with the
fundamentals of FL. We conduct extensive experiments to evaluate our FLShield
framework in different settings and demonstrate its effectiveness in thwarting
various types of poisoning and backdoor attacks including a defense-aware one.
FLShield also preserves privacy of local data against gradient inversion
attacks.


http://arxiv.org/abs/2308.08012
Comprehensive Analysis of Network Robustness Evaluation Based on Convolutional Neural Networks with Spatial Pyramid Pooling. (1%)
Wenjun Jiang; Tianlong Fan; Changhao Li; Chuanfu Zhang; Tao Zhang; Zong-fu Luo
  Connectivity robustness, a crucial aspect for understanding, optimizing, and
repairing complex networks, has traditionally been evaluated through
time-consuming and often impractical simulations. Fortunately, machine learning
provides a new avenue for addressing this challenge. However, several key
issues remain unresolved, including the performance in more general edge
removal scenarios, capturing robustness through attack curves instead of
directly training for robustness, scalability of predictive tasks, and
transferability of predictive capabilities. In this paper, we address these
challenges by designing a convolutional neural networks (CNN) model with
spatial pyramid pooling networks (SPP-net), adapting existing evaluation
metrics, redesigning the attack modes, introducing appropriate filtering rules,
and incorporating the value of robustness as training data. The results
demonstrate the thoroughness of the proposed CNN framework in addressing the
challenges of high computational time across various network types, failure
component types and failure scenarios. However, the performance of the proposed
CNN model varies: for evaluation tasks that are consistent with the trained
network type, the proposed CNN model consistently achieves accurate evaluations
of both attack curves and robustness values across all removal scenarios. When
the predicted network type differs from the trained network, the CNN model
still demonstrates favorable performance in the scenario of random node
failure, showcasing its scalability and performance transferability.
Nevertheless, the performance falls short of expectations in other removal
scenarios. This observed scenario-sensitivity in the evaluation of network
features has been overlooked in previous studies and necessitates further
attention and optimization. Lastly, we discuss important unresolved questions
and further investigation.


http://arxiv.org/abs/2308.05320
Adv-Inpainting: Generating Natural and Transferable Adversarial Patch via Attention-guided Feature Fusion. (98%)
Yanjie Li; Mingxing Duan; Bin Xiao
  The rudimentary adversarial attacks utilize additive noise to attack facial
recognition (FR) models. However, because manipulating the total face is
impractical in the physical setting, most real-world FR attacks are based on
adversarial patches, which limit perturbations to a small area. Previous
adversarial patch attacks often resulted in unnatural patterns and clear
boundaries that were easily noticeable. In this paper, we argue that generating
adversarial patches with plausible content can result in stronger
transferability than using additive noise or directly sampling from the latent
space. To generate natural-looking and highly transferable adversarial patches,
we propose an innovative two-stage coarse-to-fine attack framework called
Adv-Inpainting. In the first stage, we propose an attention-guided StyleGAN
(Att-StyleGAN) that adaptively combines texture and identity features based on
the attention map to generate high-transferable and natural adversarial
patches. In the second stage, we design a refinement network with a new
boundary variance loss to further improve the coherence between the patch and
its surrounding area. Experiment results demonstrate that Adv-Inpainting is
stealthy and can produce adversarial patches with stronger transferability and
improved visual quality than previous adversarial patch attacks.


http://arxiv.org/abs/2308.04964
Adversarial ModSecurity: Countering Adversarial SQL Injections with Robust Machine Learning. (93%)
Biagio Montaruli; Luca Demetrio; Andrea Valenza; Battista Biggio; Luca Compagna; Davide Balzarotti; Davide Ariu; Luca Piras
  ModSecurity is widely recognized as the standard open-source Web Application
Firewall (WAF), maintained by the OWASP Foundation. It detects malicious
requests by matching them against the Core Rule Set, identifying well-known
attack patterns. Each rule in the CRS is manually assigned a weight, based on
the severity of the corresponding attack, and a request is detected as
malicious if the sum of the weights of the firing rules exceeds a given
threshold. In this work, we show that this simple strategy is largely
ineffective for detecting SQL injection (SQLi) attacks, as it tends to block
many legitimate requests, while also being vulnerable to adversarial SQLi
attacks, i.e., attacks intentionally manipulated to evade detection. To
overcome these issues, we design a robust machine learning model, named
AdvModSec, which uses the CRS rules as input features, and it is trained to
detect adversarial SQLi attacks. Our experiments show that AdvModSec, being
trained on the traffic directed towards the protected web services, achieves a
better trade-off between detection and false positive rates, improving the
detection rate of the vanilla version of ModSecurity with CRS by 21%. Moreover,
our approach is able to improve its adversarial robustness against adversarial
SQLi attacks by 42%, thereby taking a step forward towards building more robust
and trustworthy WAFs.


http://arxiv.org/abs/2308.04909
Adversarial Deep Reinforcement Learning for Cyber Security in Software Defined Networks. (81%)
Luke Borchjes; Clement Nyirenda; Louise Leenen
  This paper focuses on the impact of leveraging autonomous offensive
approaches in Deep Reinforcement Learning (DRL) to train more robust agents by
exploring the impact of applying adversarial learning to DRL for autonomous
security in Software Defined Networks (SDN). Two algorithms, Double Deep
Q-Networks (DDQN) and Neural Episodic Control to Deep Q-Network (NEC2DQN or
N2D), are compared. NEC2DQN was proposed in 2018 and is a new member of the
deep q-network (DQN) family of algorithms. The attacker has full observability
of the environment and access to a causative attack that uses state
manipulation in an attempt to poison the learning process. The implementation
of the attack is done under a white-box setting, in which the attacker has
access to the defender's model and experiences. Two games are played; in the
first game, DDQN is a defender and N2D is an attacker, and in second game, the
roles are reversed. The games are played twice; first, without an active
causative attack and secondly, with an active causative attack. For execution,
three sets of game results are recorded in which a single set consists of 10
game runs. The before and after results are then compared in order to see if
there was actually an improvement or degradation. The results show that with
minute parameter changes made to the algorithms, there was growth in the
attacker's role, since it is able to win games. Implementation of the
adversarial learning by the introduction of the causative attack showed the
algorithms are still able to defend the network according to their strengths.


http://arxiv.org/abs/2308.05127
Data-Free Model Extraction Attacks in the Context of Object Detection. (41%)
Harshit Shah; Aravindhan G; Pavan Kulkarni; Yuvaraj Govidarajulu; Manojkumar Parmar
  A significant number of machine learning models are vulnerable to model
extraction attacks, which focus on stealing the models by using specially
curated queries against the target model. This task is well accomplished by
using part of the training data or a surrogate dataset to train a new model
that mimics a target model in a white-box environment. In pragmatic situations,
however, the target models are trained on private datasets that are
inaccessible to the adversary. The data-free model extraction technique
replaces this problem when it comes to using queries artificially curated by a
generator similar to that used in Generative Adversarial Nets. We propose for
the first time, to the best of our knowledge, an adversary black box attack
extending to a regression problem for predicting bounding box coordinates in
object detection. As part of our study, we found that defining a loss function
and using a novel generator setup is one of the key aspects in extracting the
target model. We find that the proposed model extraction method achieves
significant results by using reasonable queries. The discovery of this object
detection vulnerability will support future prospects for securing such models.


http://arxiv.org/abs/2308.04373
Pelta: Shielding Transformers to Mitigate Evasion Attacks in Federated Learning. (99%)
Simon Queyrut; Yérom-David Bromberg; Valerio Schiavoni
  The main premise of federated learning is that machine learning model updates
are computed locally, in particular to preserve user data privacy, as those
never leave the perimeter of their device. This mechanism supposes the general
model, once aggregated, to be broadcast to collaborating and non malicious
nodes. However, without proper defenses, compromised clients can easily probe
the model inside their local memory in search of adversarial examples. For
instance, considering image-based applications, adversarial examples consist of
imperceptibly perturbed images (to the human eye) misclassified by the local
model, which can be later presented to a victim node's counterpart model to
replicate the attack. To mitigate such malicious probing, we introduce Pelta, a
novel shielding mechanism leveraging trusted hardware. By harnessing the
capabilities of Trusted Execution Environments (TEEs), Pelta masks part of the
back-propagation chain rule, otherwise typically exploited by attackers for the
design of malicious samples. We evaluate Pelta on a state of the art ensemble
model and demonstrate its effectiveness against the Self Attention Gradient
adversarial Attack.


http://arxiv.org/abs/2308.04077
Federated Zeroth-Order Optimization using Trajectory-Informed Surrogate Gradients. (81%)
Yao Shu; Xiaoqiang Lin; Zhongxiang Dai; Bryan Kian Hsiang Low
  Federated optimization, an emerging paradigm which finds wide real-world
applications such as federated learning, enables multiple clients (e.g., edge
devices) to collaboratively optimize a global function. The clients do not
share their local datasets and typically only share their local gradients.
However, the gradient information is not available in many applications of
federated optimization, which hence gives rise to the paradigm of federated
zeroth-order optimization (ZOO). Existing federated ZOO algorithms suffer from
the limitations of query and communication inefficiency, which can be
attributed to (a) their reliance on a substantial number of function queries
for gradient estimation and (b) the significant disparity between their
realized local updates and the intended global updates. To this end, we (a)
introduce trajectory-informed gradient surrogates which is able to use the
history of function queries during optimization for accurate and
query-efficient gradient estimation, and (b) develop the technique of adaptive
gradient correction using these gradient surrogates to mitigate the
aforementioned disparity. Based on these, we propose the federated zeroth-order
optimization using trajectory-informed surrogate gradients (FZooS) algorithm
for query- and communication-efficient federated ZOO. Our FZooS achieves
theoretical improvements over the existing approaches, which is supported by
our real-world experiments such as federated black-box adversarial attack and
federated non-differentiable metric optimization.


http://arxiv.org/abs/2308.04304
The Model Inversion Eavesdropping Attack in Semantic Communication Systems. (67%)
Yuhao Chen; Qianqian Yang; Zhiguo Shi; Jiming Chen
  In recent years, semantic communication has been a popular research topic for
its superiority in communication efficiency. As semantic communication relies
on deep learning to extract meaning from raw messages, it is vulnerable to
attacks targeting deep learning models. In this paper, we introduce the model
inversion eavesdropping attack (MIEA) to reveal the risk of privacy leaks in
the semantic communication system. In MIEA, the attacker first eavesdrops the
signal being transmitted by the semantic communication system and then performs
model inversion attack to reconstruct the raw message, where both the white-box
and black-box settings are considered. Evaluation results show that MIEA can
successfully reconstruct the raw message with good quality under different
channel conditions. We then propose a defense method based on random
permutation and substitution to defend against MIEA in order to achieve secure
semantic communication. Our experimental results demonstrate the effectiveness
of the proposed defense method in preventing MIEA.


http://arxiv.org/abs/2308.04137
A Comprehensive Assessment Benchmark for Rigorously Evaluating Deep Learning Image Classifiers. (67%)
Michael W. Spratling
  Reliable and robust evaluation methods are a necessary first step towards
developing machine learning models that are themselves robust and reliable.
Unfortunately, current evaluation protocols typically used to assess
classifiers fail to comprehensively evaluate performance as they tend to rely
on limited types of test data, and ignore others. For example, using the
standard test data fails to evaluate the predictions made by the classifier to
samples from classes it was not trained on. On the other hand, testing with
data containing samples from unknown classes fails to evaluate how well the
classifier can predict the labels for known classes. This article advocates
benchmarking performance using a wide range of different types of data and
using a single metric that can be applied to all such data types to produce a
consistent evaluation of performance. Using the proposed benchmark it is found
that current deep neural networks, including those trained with methods that
are believed to produce state-of-the-art robustness, are vulnerable to making
mistakes on certain types of data. This means that such models will be
unreliable in real-world scenarios where they may encounter data from many
different domains, and that they are insecure as they can be easily fooled into
making the wrong decisions. It is hoped that these results will motivate the
wider adoption of more comprehensive testing methods that will, in turn, lead
to the development of more robust machine learning methods in the future.
  Code is available at: https://codeberg.org/mwspratling/RobustnessEvaluation


http://arxiv.org/abs/2308.04406
XGBD: Explanation-Guided Graph Backdoor Detection. (54%)
Zihan Guan; Mengnan Du; Ninghao Liu
  Backdoor attacks pose a significant security risk to graph learning models.
Backdoors can be embedded into the target model by inserting backdoor triggers
into the training dataset, causing the model to make incorrect predictions when
the trigger is present. To counter backdoor attacks, backdoor detection has
been proposed. An emerging detection strategy in the vision and NLP domains is
based on an intriguing phenomenon: when training models on a mixture of
backdoor and clean samples, the loss on backdoor samples drops significantly
faster than on clean samples, allowing backdoor samples to be easily detected
by selecting samples with the lowest loss values. However, the ignorance of
topological feature information on graph data limits its detection
effectiveness when applied directly to the graph domain. To this end, we
propose an explanation-guided backdoor detection method to take advantage of
the topological information. Specifically, we train a helper model on the graph
dataset, feed graph samples into the model, and then adopt explanation methods
to attribute model prediction to an important subgraph. We observe that
backdoor samples have distinct attribution distribution than clean samples, so
the explanatory subgraph could serve as more discriminative features for
detecting backdoor samples. Comprehensive experiments on multiple popular
datasets and attack methods demonstrate the effectiveness and explainability of
our method. Our code is available:
https://github.com/GuanZihan/GNN_backdoor_detection.


http://arxiv.org/abs/2308.04617
Improved Activation Clipping for Universal Backdoor Mitigation and Test-Time Detection. (50%)
Hang Wang; Zhen Xiang; David J. Miller; George Kesidis
  Deep neural networks are vulnerable to backdoor attacks (Trojans), where an
attacker poisons the training set with backdoor triggers so that the neural
network learns to classify test-time triggers to the attacker's designated
target class. Recent work shows that backdoor poisoning induces over-fitting
(abnormally large activations) in the attacked model, which motivates a
general, post-training clipping method for backdoor mitigation, i.e., with
bounds on internal-layer activations learned using a small set of clean
samples. We devise a new such approach, choosing the activation bounds to
explicitly limit classification margins. This method gives superior performance
against peer methods for CIFAR-10 image classification. We also show that this
method has strong robustness against adaptive attacks, X2X attacks, and on
different datasets. Finally, we demonstrate a method extension for test-time
detection and correction based on the output differences between the original
and activation-bounded networks. The code of our method is online available.


http://arxiv.org/abs/2308.04179
Evil Operation: Breaking Speaker Recognition with PaddingBack. (31%)
Zhe Ye; Diqun Yan; Li Dong; Kailai Shen
  Machine Learning as a Service (MLaaS) has gained popularity due to
advancements in machine learning. However, untrusted third-party platforms have
raised concerns about AI security, particularly in backdoor attacks. Recent
research has shown that speech backdoors can utilize transformations as
triggers, similar to image backdoors. However, human ears easily detect these
transformations, leading to suspicion. In this paper, we introduce PaddingBack,
an inaudible backdoor attack that utilizes malicious operations to make
poisoned samples indistinguishable from clean ones. Instead of using external
perturbations as triggers, we exploit the widely used speech signal operation,
padding, to break speaker recognition systems. Our experimental results
demonstrate the effectiveness of the proposed approach, achieving a
significantly high attack success rate while maintaining a high rate of benign
accuracy. Furthermore, PaddingBack demonstrates the ability to resist defense
methods while maintaining its stealthiness against human perception. The
results of the stealthiness experiment have been made available at
https://nbufabio25.github.io/paddingback/.


http://arxiv.org/abs/2308.04466
Backdoor Federated Learning by Poisoning Backdoor-Critical Layers. (15%)
Haomin Zhuang; Mingxian Yu; Hao Wang; Yang Hua; Jian Li; Xu Yuan
  Federated learning (FL) has been widely deployed to enable machine learning
training on sensitive data across distributed devices. However, the
decentralized learning paradigm and heterogeneity of FL further extend the
attack surface for backdoor attacks. Existing FL attack and defense
methodologies typically focus on the whole model. None of them recognizes the
existence of backdoor-critical (BC) layers-a small subset of layers that
dominate the model vulnerabilities. Attacking the BC layers achieves equivalent
effects as attacking the whole model but at a far smaller chance of being
detected by state-of-the-art (SOTA) defenses. This paper proposes a general
in-situ approach that identifies and verifies BC layers from the perspective of
attackers. Based on the identified BC layers, we carefully craft a new backdoor
attack methodology that adaptively seeks a fundamental balance between
attacking effects and stealthiness under various defense strategies. Extensive
experiments show that our BC layer-aware backdoor attacks can successfully
backdoor FL under seven SOTA defenses with only 10% malicious clients and
outperform the latest backdoor attack methods.


http://arxiv.org/abs/2308.03956
Fixed Inter-Neuron Covariability Induces Adversarial Robustness. (98%)
Muhammad Ahmed Shah; Bhiksha Raj
  The vulnerability to adversarial perturbations is a major flaw of Deep Neural
Networks (DNNs) that raises question about their reliability when in real-world
scenarios. On the other hand, human perception, which DNNs are supposed to
emulate, is highly robust to such perturbations, indicating that there may be
certain features of the human perception that make it robust but are not
represented in the current class of DNNs. One such feature is that the activity
of biological neurons is correlated and the structure of this correlation tends
to be rather rigid over long spans of times, even if it hampers performance and
learning. We hypothesize that integrating such constraints on the activations
of a DNN would improve its adversarial robustness, and, to test this
hypothesis, we have developed the Self-Consistent Activation (SCA) layer, which
comprises of neurons whose activations are consistent with each other, as they
conform to a fixed, but learned, covariability pattern. When evaluated on image
and sound recognition tasks, the models with a SCA layer achieved high
accuracy, and exhibited significantly greater robustness than multi-layer
perceptron models to state-of-the-art Auto-PGD adversarial attacks
\textit{without being trained on adversarially perturbed data


http://arxiv.org/abs/2308.03476
Exploring the Physical World Adversarial Robustness of Vehicle Detection. (98%)
Wei Jiang; Tianyuan Zhang; Shuangcheng Liu; Weiyu Ji; Zichao Zhang; Gang Xiao
  Adversarial attacks can compromise the robustness of real-world detection
models. However, evaluating these models under real-world conditions poses
challenges due to resource-intensive experiments. Virtual simulations offer an
alternative, but the absence of standardized benchmarks hampers progress.
Addressing this, we propose an innovative instant-level data generation
pipeline using the CARLA simulator. Through this pipeline, we establish the
Discrete and Continuous Instant-level (DCI) dataset, enabling comprehensive
experiments involving three detection models and three physical adversarial
attacks. Our findings highlight diverse model performances under adversarial
conditions. Yolo v6 demonstrates remarkable resilience, experiencing just a
marginal 6.59% average drop in average precision (AP). In contrast, the ASA
attack yields a substantial 14.51% average AP reduction, twice the effect of
other algorithms. We also note that static scenes yield higher recognition AP
values, and outcomes remain relatively consistent across varying weather
conditions. Intriguingly, our study suggests that advancements in adversarial
attack algorithms may be approaching its ``limitation''.In summary, our work
underscores the significance of adversarial attacks in real-world contexts and
introduces the DCI dataset as a versatile benchmark. Our findings provide
valuable insights for enhancing the robustness of detection models and offer
guidance for future research endeavors in the realm of adversarial attacks.


http://arxiv.org/abs/2308.03979
PAIF: Perception-Aware Infrared-Visible Image Fusion for Attack-Tolerant Semantic Segmentation. (86%)
Zhu Liu; Jinyuan Liu; Benzhuang Zhang; Long Ma; Xin Fan; Risheng Liu
  Infrared and visible image fusion is a powerful technique that combines
complementary information from different modalities for downstream semantic
perception tasks. Existing learning-based methods show remarkable performance,
but are suffering from the inherent vulnerability of adversarial attacks,
causing a significant decrease in accuracy. In this work, a perception-aware
fusion framework is proposed to promote segmentation robustness in adversarial
scenes. We first conduct systematic analyses about the components of image
fusion, investigating the correlation with segmentation robustness under
adversarial perturbations. Based on these analyses, we propose a harmonized
architecture search with a decomposition-based structure to balance standard
accuracy and robustness. We also propose an adaptive learning strategy to
improve the parameter robustness of image fusion, which can learn effective
feature extraction under diverse adversarial perturbations. Thus, the goals of
image fusion (\textit{i.e.,} extracting complementary features from source
modalities and defending attack) can be realized from the perspectives of
architectural and learning strategies. Extensive experimental results
demonstrate that our scheme substantially enhances the robustness, with gains
of 15.3% mIOU of segmentation in the adversarial scene, compared with advanced
competitors. The source codes are available at
https://github.com/LiuZhu-CV/PAIF.


http://arxiv.org/abs/2308.03363
A reading survey on adversarial machine learning: Adversarial attacks and their understanding. (81%)
Shashank Kotyan
  Deep Learning has empowered us to train neural networks for complex data with
high performance. However, with the growing research, several vulnerabilities
in neural networks have been exposed. A particular branch of research,
Adversarial Machine Learning, exploits and understands some of the
vulnerabilities that cause the neural networks to misclassify for near original
input. A class of algorithms called adversarial attacks is proposed to make the
neural networks misclassify for various tasks in different domains. With the
extensive and growing research in adversarial attacks, it is crucial to
understand the classification of adversarial attacks. This will help us
understand the vulnerabilities in a systematic order and help us to mitigate
the effects of adversarial attacks. This article provides a survey of existing
adversarial attacks and their understanding based on different perspectives. We
also provide a brief overview of existing adversarial defences and their
limitations in mitigating the effect of adversarial attacks. Further, we
conclude with a discussion on the future research directions in the field of
adversarial machine learning.


http://arxiv.org/abs/2308.03331
A Four-Pronged Defense Against Byzantine Attacks in Federated Learning. (54%)
Wei Wan; Shengshan Hu; Minghui Li; Jianrong Lu; Longling Zhang; Leo Yu Zhang; Hai Jin
  \textit{Federated learning} (FL) is a nascent distributed learning paradigm
to train a shared global model without violating users' privacy. FL has been
shown to be vulnerable to various Byzantine attacks, where malicious
participants could independently or collusively upload well-crafted updates to
deteriorate the performance of the global model. However, existing defenses
could only mitigate part of Byzantine attacks, without providing an all-sided
shield for FL. It is difficult to simply combine them as they rely on totally
contradictory assumptions.
  In this paper, we propose FPD, a
\underline{\textbf{f}}our-\underline{\textbf{p}}ronged
\underline{\textbf{d}}efense against both non-colluding and colluding Byzantine
attacks. Our main idea is to utilize absolute similarity to filter updates
rather than relative similarity used in existingI works. To this end, we first
propose a reliable client selection strategy to prevent the majority of threats
in the bud. Then we design a simple but effective score-based detection method
to mitigate colluding attacks. Third, we construct an enhanced spectral-based
outlier detector to accurately discard abnormal updates when the training data
is \textit{not independent and identically distributed} (non-IID). Finally, we
design update denoising to rectify the direction of the slightly noisy but
harmful updates. The four sequentially combined modules can effectively
reconcile the contradiction in addressing non-colluding and colluding Byzantine
attacks. Extensive experiments over three benchmark image classification
datasets against four state-of-the-art Byzantine attacks demonstrate that FPD
drastically outperforms existing defenses in IID and non-IID scenarios (with
$30\%$ improvement on model accuracy).


http://arxiv.org/abs/2308.04018
Improving Performance of Semi-Supervised Learning by Adversarial Attacks. (11%)
Dongyoon Yang; Kunwoong Kim; Yongdai Kim
  Semi-supervised learning (SSL) algorithm is a setup built upon a realistic
assumption that access to a large amount of labeled data is tough. In this
study, we present a generalized framework, named SCAR, standing for Selecting
Clean samples with Adversarial Robustness, for improving the performance of
recent SSL algorithms. By adversarially attacking pre-trained models with
semi-supervision, our framework shows substantial advances in classifying
images. We introduce how adversarial attacks successfully select high-confident
unlabeled data to be labeled with current predictions. On CIFAR10, three recent
SSL algorithms with SCAR result in significantly improved image classification.


http://arxiv.org/abs/2308.03558
Mondrian: Prompt Abstraction Attack Against Large Language Models for Cheaper API Pricing. (10%)
Wai Man Si; Michael Backes; Yang Zhang
  The Machine Learning as a Service (MLaaS) market is rapidly expanding and
becoming more mature. For example, OpenAI's ChatGPT is an advanced large
language model (LLM) that generates responses for various queries with
associated fees. Although these models can deliver satisfactory performance,
they are far from perfect. Researchers have long studied the vulnerabilities
and limitations of LLMs, such as adversarial attacks and model toxicity.
Inevitably, commercial ML models are also not exempt from such issues, which
can be problematic as MLaaS continues to grow. In this paper, we discover a new
attack strategy against LLM APIs, namely the prompt abstraction attack.
Specifically, we propose Mondrian, a simple and straightforward method that
abstracts sentences, which can lower the cost of using LLM APIs. In this
approach, the adversary first creates a pseudo API (with a lower established
price) to serve as the proxy of the target API (with a higher established
price). Next, the pseudo API leverages Mondrian to modify the user query,
obtain the abstracted response from the target API, and forward it back to the
end user. Our results show that Mondrian successfully reduces user queries'
token length ranging from 13% to 23% across various tasks, including text
classification, generation, and question answering. Meanwhile, these abstracted
queries do not significantly affect the utility of task-specific and general
language models like ChatGPT. Mondrian also reduces instruction prompts' token
length by at least 11% without compromising output quality. As a result, the
prompt abstraction attack enables the adversary to profit without bearing the
cost of API development and deployment.


http://arxiv.org/abs/2308.03108
SAAM: Stealthy Adversarial Attack on Monoculor Depth Estimation. (99%)
Amira Guesmi; Muhammad Abdullah Hanif; Bassem Ouni; Muhammad Shafique
  In this paper, we investigate the vulnerability of MDE to adversarial
patches. We propose a novel \underline{S}tealthy \underline{A}dversarial
\underline{A}ttacks on \underline{M}DE (SAAM) that compromises MDE by either
corrupting the estimated distance or causing an object to seamlessly blend into
its surroundings. Our experiments, demonstrate that the designed stealthy patch
successfully causes a DNN-based MDE to misestimate the depth of objects. In
fact, our proposed adversarial patch achieves a significant 60\% depth error
with 99\% ratio of the affected region. Importantly, despite its adversarial
nature, the patch maintains a naturalistic appearance, making it inconspicuous
to human observers. We believe that this work sheds light on the threat of
adversarial attacks in the context of MDE on edge devices. We hope it raises
awareness within the community about the potential real-life harm of such
attacks and encourages further research into developing more robust and
adaptive defense mechanisms.


http://arxiv.org/abs/2308.03163
CGBA: Curvature-aware Geometric Black-box Attack. (99%)
Md Farhamdur Reza; Ali Rahmati; Tianfu Wu; Huaiyu Dai
  Decision-based black-box attacks often necessitate a large number of queries
to craft an adversarial example. Moreover, decision-based attacks based on
querying boundary points in the estimated normal vector direction often suffer
from inefficiency and convergence issues. In this paper, we propose a novel
query-efficient curvature-aware geometric decision-based black-box attack
(CGBA) that conducts boundary search along a semicircular path on a restricted
2D plane to ensure finding a boundary point successfully irrespective of the
boundary curvature. While the proposed CGBA attack can work effectively for an
arbitrary decision boundary, it is particularly efficient in exploiting the low
curvature to craft high-quality adversarial examples, which is widely seen and
experimentally verified in commonly used classifiers under non-targeted
attacks. In contrast, the decision boundaries often exhibit higher curvature
under targeted attacks. Thus, we develop a new query-efficient variant, CGBA-H,
that is adapted for the targeted attack. In addition, we further design an
algorithm to obtain a better initial boundary point at the expense of some
extra queries, which considerably enhances the performance of the targeted
attack. Extensive experiments are conducted to evaluate the performance of our
proposed methods against some well-known classifiers on the ImageNet and
CIFAR10 datasets, demonstrating the superiority of CGBA and CGBA-H over
state-of-the-art non-targeted and targeted attacks, respectively. The source
code is available at https://github.com/Farhamdur/CGBA.


http://arxiv.org/abs/2308.03258
APBench: A Unified Benchmark for Availability Poisoning Attacks and Defenses. (98%)
Tianrui Qin; Xitong Gao; Juanjuan Zhao; Kejiang Ye; Cheng-Zhong Xu
  The efficacy of availability poisoning, a method of poisoning data by
injecting imperceptible perturbations to prevent its use in model training, has
been a hot subject of investigation. Previous research suggested that it was
difficult to effectively counteract such poisoning attacks. However, the
introduction of various defense methods has challenged this notion. Due to the
rapid progress in this field, the performance of different novel methods cannot
be accurately validated due to variations in experimental setups. To further
evaluate the attack and defense capabilities of these poisoning methods, we
have developed a benchmark -- APBench for assessing the efficacy of adversarial
poisoning. APBench consists of 9 state-of-the-art availability poisoning
attacks, 8 defense algorithms, and 4 conventional data augmentation techniques.
We also have set up experiments with varying different poisoning ratios, and
evaluated the attacks on multiple datasets and their transferability across
model architectures. We further conducted a comprehensive evaluation of 2
additional attacks specifically targeting unsupervised models. Our results
reveal the glaring inadequacy of existing attacks in safeguarding individual
privacy. APBench is open source and available to the deep learning community:
https://github.com/lafeat/apbench.


http://arxiv.org/abs/2308.03243
Unsupervised Adversarial Detection without Extra Model: Training Loss Should Change. (82%)
Chien Cheng Chyou; Hung-Ting Su; Winston H. Hsu
  Adversarial robustness poses a critical challenge in the deployment of deep
learning models for real-world applications. Traditional approaches to
adversarial training and supervised detection rely on prior knowledge of attack
types and access to labeled training data, which is often impractical. Existing
unsupervised adversarial detection methods identify whether the target model
works properly, but they suffer from bad accuracies owing to the use of common
cross-entropy training loss, which relies on unnecessary features and
strengthens adversarial attacks. We propose new training losses to reduce
useless features and the corresponding detection method without prior knowledge
of adversarial attacks. The detection rate (true positive rate) against all
given white-box attacks is above 93.9% except for attacks without limits
(DF($\infty$)), while the false positive rate is barely 2.5%. The proposed
method works well in all tested attack types and the false positive rates are
even better than the methods good at certain types.


http://arxiv.org/abs/2308.03081
Using Overlapping Methods to Counter Adversaries in Community Detection. (50%)
Benjamin A. Miller; Kevin Chan; Tina Eliassi-Rad
  When dealing with large graphs, community detection is a useful data triage
tool that can identify subsets of the network that a data analyst should
investigate. In an adversarial scenario, the graph may be manipulated to avoid
scrutiny of certain nodes by the analyst. Robustness to such behavior is an
important consideration for data analysts in high-stakes scenarios such as
cyber defense and counterterrorism. In this paper, we evaluate the use of
overlapping community detection methods in the presence of adversarial attacks
aimed at lowering the priority of a specific vertex. We formulate the data
analyst's choice as a Stackelberg game in which the analyst chooses a community
detection method and the attacker chooses an attack strategy in response.
Applying various attacks from the literature to seven real network datasets, we
find that, when the attacker has a sufficient budget, overlapping community
detection methods outperform non-overlapping methods, often overwhelmingly so.
This is the case when the attacker can only add edges that connect to the
target and when the capability is added to add edges between neighbors of the
target. We also analyze the tradeoff between robustness in the presence of an
attack and performance when there is no attack. Our extensible analytic
framework enables network data analysts to take these considerations into
account and incorporate new attacks and community detection methods as they are
developed.


http://arxiv.org/abs/2308.02897
An Adaptive Model Ensemble Adversarial Attack for Boosting Adversarial Transferability. (99%)
Bin Chen; Jia-Li Yin; Shukai Chen; Bo-Hao Chen; Ximeng Liu
  While the transferability property of adversarial examples allows the
adversary to perform black-box attacks (i.e., the attacker has no knowledge
about the target model), the transfer-based adversarial attacks have gained
great attention. Previous works mostly study gradient variation or image
transformations to amplify the distortion on critical parts of inputs. These
methods can work on transferring across models with limited differences, i.e.,
from CNNs to CNNs, but always fail in transferring across models with wide
differences, such as from CNNs to ViTs. Alternatively, model ensemble
adversarial attacks are proposed to fuse outputs from surrogate models with
diverse architectures to get an ensemble loss, making the generated adversarial
example more likely to transfer to other models as it can fool multiple models
concurrently. However, existing ensemble attacks simply fuse the outputs of the
surrogate models evenly, thus are not efficacious to capture and amplify the
intrinsic transfer information of adversarial examples. In this paper, we
propose an adaptive ensemble attack, dubbed AdaEA, to adaptively control the
fusion of the outputs from each model, via monitoring the discrepancy ratio of
their contributions towards the adversarial objective. Furthermore, an extra
disparity-reduced filter is introduced to further synchronize the update
direction. As a result, we achieve considerable improvement over the existing
ensemble attacks on various datasets, and the proposed AdaEA can also boost
existing transfer-based attacks, which further demonstrates its efficacy and
versatility.


http://arxiv.org/abs/2308.02923
An AI-Enabled Framework to Defend Ingenious MDT-based Attacks on the Emerging Zero Touch Cellular Networks. (92%)
Aneeqa Ijaz; Waseem Raza; Hasan Farooq; Marvin Manalastas; Ali Imran
  Deep automation provided by self-organizing network (SON) features and their
emerging variants such as zero touch automation solutions is a key enabler for
increasingly dense wireless networks and pervasive Internet of Things (IoT). To
realize their objectives, most automation functionalities rely on the
Minimization of Drive Test (MDT) reports. The MDT reports are used to generate
inferences about network state and performance, thus dynamically change network
parameters accordingly. However, the collection of MDT reports from commodity
user devices, particularly low cost IoT devices, make them a vulnerable entry
point to launch an adversarial attack on emerging deeply automated wireless
networks. This adds a new dimension to the security threats in the IoT and
cellular networks. Existing literature on IoT, SON, or zero touch automation
does not address this important problem. In this paper, we investigate an
impactful, first of its kind adversarial attack that can be launched by
exploiting the malicious MDT reports from the compromised user equipment (UE).
We highlight the detrimental repercussions of this attack on the performance of
common network automation functions. We also propose a novel Malicious MDT
Reports Identification framework (MRIF) as a countermeasure to detect and
eliminate the malicious MDT reports using Machine Learning and verify it
through a use-case. Thus, the defense mechanism can provide the resilience and
robustness for zero touch automation SON engines against the adversarial MDT
attacks


http://arxiv.org/abs/2308.02973
A Security and Usability Analysis of Local Attacks Against FIDO2. (1%)
Tarun Kumar Yadav; Kent Seamons
  The FIDO2 protocol aims to strengthen or replace password authentication
using public-key cryptography. FIDO2 has primarily focused on defending against
attacks from afar by remote attackers that compromise a password or attempt to
phish the user. In this paper, we explore threats from local attacks on FIDO2
that have received less attention -- a browser extension compromise and
attackers gaining physical access to an HSK. Our systematic analysis of current
implementations of FIDO2 reveals four underlying flaws, and we demonstrate the
feasibility of seven attacks that exploit those flaws. The flaws include (1)
Lack of confidentiality/integrity of FIDO2 messages accessible to browser
extensions, (2) Broken clone detection algorithm, (3) Potential for user
misunderstanding from social engineering and notification/error messages, and
(4) Cookie life cycle. We build malicious browser extensions and demonstrate
the attacks on ten popular web servers that use FIDO2. We also show that many
browser extensions have sufficient permissions to conduct the attacks if they
were compromised. A static and dynamic analysis of current browser extensions
finds no evidence of the attacks in the wild. We conducted two user studies
confirming that participants do not detect the attacks with current error
messages, email notifications, and UX responses to the attacks. We provide an
improved clone detection algorithm and recommendations for relying part


http://arxiv.org/abs/2308.02836
Approximating Positive Homogeneous Functions with Scale Invariant Neural Networks. (1%)
Stefan Bamberger; Reinhard Heckel; Felix Krahmer
  We investigate to what extent it is possible to solve linear inverse problems
with $ReLu$ networks. Due to the scaling invariance arising from the linearity,
an optimal reconstruction function $f$ for such a problem is positive
homogeneous, i.e., satisfies $f(\lambda x) = \lambda f(x)$ for all non-negative
$\lambda$. In a $ReLu$ network, this condition translates to considering
networks without bias terms. We first consider recovery of sparse vectors from
few linear measurements. We prove that $ReLu$- networks with only one hidden
layer cannot even recover $1$-sparse vectors, not even approximately, and
regardless of the width of the network. However, with two hidden layers,
approximate recovery with arbitrary precision and arbitrary sparsity level $s$
is possible in a stable way. We then extend our results to a wider class of
recovery problems including low-rank matrix recovery and phase retrieval.
Furthermore, we also consider the approximation of general positive homogeneous
functions with neural networks. Extending previous work, we establish new
results explaining under which conditions such functions can be approximated
with neural networks. Our results also shed some light on the seeming
contradiction between previous works showing that neural networks for inverse
problems typically have very large Lipschitz constants, but still perform very
well also for adversarial noise. Namely, the error bounds in our expressivity
results include a combination of a small constant term and a term that is
linear in the noise level, indicating that robustness issues may occur only for
very small noise levels.


http://arxiv.org/abs/2308.03792
Multi-attacks: Many images $+$ the same adversarial attack $\to$ many target labels. (99%)
Stanislav Fort
  We show that we can easily design a single adversarial perturbation $P$ that
changes the class of $n$ images $X_1,X_2,\dots,X_n$ from their original,
unperturbed classes $c_1, c_2,\dots,c_n$ to desired (not necessarily all the
same) classes $c^*_1,c^*_2,\dots,c^*_n$ for up to hundreds of images and target
classes at once. We call these \textit{multi-attacks}. Characterizing the
maximum $n$ we can achieve under different conditions such as image resolution,
we estimate the number of regions of high class confidence around a particular
image in the space of pixels to be around $10^{\mathcal{O}(100)}$, posing a
significant problem for exhaustive defense strategies. We show several
immediate consequences of this: adversarial attacks that change the resulting
class based on their intensity, and scale-independent adversarial examples. To
demonstrate the redundancy and richness of class decision boundaries in the
pixel space, we look for its two-dimensional sections that trace images and
spell words using particular classes. We also show that ensembling reduces
susceptibility to multi-attacks, and that classifiers trained on random labels
are more susceptible. Our code is available on GitHub.


http://arxiv.org/abs/2308.02350
RobustMQ: Benchmarking Robustness of Quantized Models. (75%)
Yisong Xiao; Aishan Liu; Tianyuan Zhang; Haotong Qin; Jinyang Guo; Xianglong Liu
  Quantization has emerged as an essential technique for deploying deep neural
networks (DNNs) on devices with limited resources. However, quantized models
exhibit vulnerabilities when exposed to various noises in real-world
applications. Despite the importance of evaluating the impact of quantization
on robustness, existing research on this topic is limited and often disregards
established principles of robustness evaluation, resulting in incomplete and
inconclusive findings. To address this gap, we thoroughly evaluated the
robustness of quantized models against various noises (adversarial attacks,
natural corruptions, and systematic noises) on ImageNet. The comprehensive
evaluation results empirically provide valuable insights into the robustness of
quantized models in various scenarios, for example: (1) quantized models
exhibit higher adversarial robustness than their floating-point counterparts,
but are more vulnerable to natural corruptions and systematic noises; (2) in
general, increasing the quantization bit-width results in a decrease in
adversarial robustness, an increase in natural robustness, and an increase in
systematic robustness; (3) among corruption methods, \textit{impulse noise} and
\textit{glass blur} are the most harmful to quantized models, while
\textit{brightness} has the least impact; (4) among systematic noises, the
\textit{nearest neighbor interpolation} has the highest impact, while bilinear
interpolation, cubic interpolation, and area interpolation are the three least
harmful. Our research contributes to advancing the robust quantization of
models and their deployment in real-world scenarios.


http://arxiv.org/abs/2308.02747
SureFED: Robust Federated Learning via Uncertainty-Aware Inward and Outward Inspection. (67%)
Nasimeh Heydaribeni; Ruisi Zhang; Tara Javidi; Cristina Nita-Rotaru; Farinaz Koushanfar
  In this work, we introduce SureFED, a novel framework for byzantine robust
federated learning. Unlike many existing defense methods that rely on
statistically robust quantities, making them vulnerable to stealthy and
colluding attacks, SureFED establishes trust using the local information of
benign clients. SureFED utilizes an uncertainty aware model evaluation and
introspection to safeguard against poisoning attacks. In particular, each
client independently trains a clean local model exclusively using its local
dataset, acting as the reference point for evaluating model updates. SureFED
leverages Bayesian models that provide model uncertainties and play a crucial
role in the model evaluation process. Our framework exhibits robustness even
when the majority of clients are compromised, remains agnostic to the number of
malicious clients, and is well-suited for non-IID settings. We theoretically
prove the robustness of our algorithm against data and model poisoning attacks
in a decentralized linear regression setting. Proof-of Concept evaluations on
benchmark image classification data demonstrate the superiority of SureFED over
the state of the art defense methods under various colluding and non-colluding
data and model poisoning attacks.


http://arxiv.org/abs/2308.04451
Vulnerabilities in AI Code Generators: Exploring Targeted Data Poisoning Attacks. (67%)
Domenico Cotroneo; Cristina Improta; Pietro Liguori; Roberto Natella
  In this work, we assess the security of AI code generators via data
poisoning, i.e., an attack that injects malicious samples into the training
data to generate vulnerable code. We poison the training data by injecting
increasing amounts of code containing security vulnerabilities and assess the
attack's success on different state-of-the-art models for code generation. Our
analysis shows that AI code generators are vulnerable to even a small amount of
data poisoning. Moreover, the attack does not impact the correctness of code
generated by pre-trained models, making it hard to detect.


http://arxiv.org/abs/2308.02369
Universal Defensive Underpainting Patch: Making Your Text Invisible to Optical Character Recognition. (31%)
JiaCheng Deng; Li Dong; Jiahao Chen; Diqun Yan; Rangding Wang; Dengpan Ye; Lingchen Zhao; Jinyu Tian
  Optical Character Recognition (OCR) enables automatic text extraction from
scanned or digitized text images, but it also makes it easy to pirate valuable
or sensitive text from these images. Previous methods to prevent OCR piracy by
distorting characters in text images are impractical in real-world scenarios,
as pirates can capture arbitrary portions of the text images, rendering the
defenses ineffective. In this work, we propose a novel and effective defense
mechanism termed the Universal Defensive Underpainting Patch (UDUP) that
modifies the underpainting of text images instead of the characters. UDUP is
created through an iterative optimization process to craft a small, fixed-size
defensive patch that can generate non-overlapping underpainting for text images
of any size. Experimental results show that UDUP effectively defends against
unauthorized OCR under the setting of any screenshot range or complex image
background. It is agnostic to the content, size, colors, and languages of
characters, and is robust to typical image operations such as scaling and
compressing. In addition, the transferability of UDUP is demonstrated by
evading several off-the-shelf OCRs. The code is available at
https://github.com/QRICKDD/UDUP.


http://arxiv.org/abs/2308.02465
BlindSage: Label Inference Attacks against Node-level Vertical Federated Graph Neural Networks. (9%)
Marco Arazzi; Mauro Conti; Stefanos Koffas; Marina Krcek; Antonino Nocera; Stjepan Picek; Jing Xu
  Federated learning enables collaborative training of machine learning models
by keeping the raw data of the involved workers private. One of its main
objectives is to improve the models' privacy, security, and scalability.
Vertical Federated Learning (VFL) offers an efficient cross-silo setting where
a few parties collaboratively train a model without sharing the same features.
In such a scenario, classification labels are commonly considered sensitive
information held exclusively by one (active) party, while other (passive)
parties use only their local information. Recent works have uncovered important
flaws of VFL, leading to possible label inference attacks under the assumption
that the attacker has some, even limited, background knowledge on the relation
between labels and data. In this work, we are the first (to the best of our
knowledge) to investigate label inference attacks on VFL using a
zero-background knowledge strategy. To concretely formulate our proposal, we
focus on Graph Neural Networks (GNNs) as a target model for the underlying VFL.
In particular, we refer to node classification tasks, which are widely studied,
and GNNs have shown promising results. Our proposed attack, BlindSage, provides
impressive results in the experiments, achieving nearly 100% accuracy in most
cases. Even when the attacker has no information about the used architecture or
the number of classes, the accuracy remained above 85% in most instances.
Finally, we observe that well-known defenses cannot mitigate our attack without
affecting the model's performance on the main classification task.


http://arxiv.org/abs/2308.01823
Hard Adversarial Example Mining for Improving Robust Fairness. (99%)
Chenhao Lin; Xiang Ji; Yulong Yang; Qian Li; Chao Shen; Run Wang; Liming Fang
  Adversarial training (AT) is widely considered the state-of-the-art technique
for improving the robustness of deep neural networks (DNNs) against adversarial
examples (AE). Nevertheless, recent studies have revealed that adversarially
trained models are prone to unfairness problems, restricting their
applicability. In this paper, we empirically observe that this limitation may
be attributed to serious adversarial confidence overfitting, i.e., certain
adversarial examples with overconfidence. To alleviate this problem, we propose
HAM, a straightforward yet effective framework via adaptive Hard Adversarial
example Mining.HAM concentrates on mining hard adversarial examples while
discarding the easy ones in an adaptive fashion. Specifically, HAM identifies
hard AEs in terms of their step sizes needed to cross the decision boundary
when calculating loss value. Besides, an early-dropping mechanism is
incorporated to discard the easy examples at the initial stages of AE
generation, resulting in efficient AT. Extensive experimental results on
CIFAR-10, SVHN, and Imagenette demonstrate that HAM achieves significant
improvement in robust fairness while reducing computational cost compared to
several state-of-the-art adversarial training methods. The code will be made
publicly available.


http://arxiv.org/abs/2308.01840
URET: Universal Robustness Evaluation Toolkit (for Evasion). (99%)
Kevin Eykholt; Taesung Lee; Douglas Schales; Jiyong Jang; Ian Molloy; Masha Zorin
  Machine learning models are known to be vulnerable to adversarial evasion
attacks as illustrated by image classification models. Thoroughly understanding
such attacks is critical in order to ensure the safety and robustness of
critical AI tasks. However, most evasion attacks are difficult to deploy
against a majority of AI systems because they have focused on image domain with
only few constraints. An image is composed of homogeneous, numerical,
continuous, and independent features, unlike many other input types to AI
systems used in practice. Furthermore, some input types include additional
semantic and functional constraints that must be observed to generate realistic
adversarial inputs. In this work, we propose a new framework to enable the
generation of adversarial inputs irrespective of the input type and task
domain. Given an input and a set of pre-defined input transformations, our
framework discovers a sequence of transformations that result in a semantically
correct and functional adversarial input. We demonstrate the generality of our
approach on several diverse machine learning tasks with various input
representations. We also show the importance of generating adversarial examples
as they enable the deployment of mitigation techniques.


http://arxiv.org/abs/2308.02116
AdvFAS: A robust face anti-spoofing framework against adversarial examples. (98%)
Jiawei Chen; Xiao Yang; Heng Yin; Mingzhi Ma; Bihui Chen; Jianteng Peng; Yandong Guo; Zhaoxia Yin; Hang Su
  Ensuring the reliability of face recognition systems against presentation
attacks necessitates the deployment of face anti-spoofing techniques. Despite
considerable advancements in this domain, the ability of even the most
state-of-the-art methods to defend against adversarial examples remains
elusive. While several adversarial defense strategies have been proposed, they
typically suffer from constrained practicability due to inevitable trade-offs
between universality, effectiveness, and efficiency. To overcome these
challenges, we thoroughly delve into the coupled relationship between
adversarial detection and face anti-spoofing. Based on this, we propose a
robust face anti-spoofing framework, namely AdvFAS, that leverages two coupled
scores to accurately distinguish between correctly detected and wrongly
detected face images. Extensive experiments demonstrate the effectiveness of
our framework in a variety of settings, including different attacks, datasets,
and backbones, meanwhile enjoying high accuracy on clean examples. Moreover, we
successfully apply the proposed method to detect real-world adversarial
examples.


http://arxiv.org/abs/2308.01888
FROD: Robust Object Detection for Free. (67%)
Muhammad; Awais; Weiming; Zhuang; Lingjuan; Lyu; Sung-Ho; Bae
  Object detection is a vital task in computer vision and has become an
integral component of numerous critical systems. However, state-of-the-art
object detectors, similar to their classification counterparts, are susceptible
to small adversarial perturbations that can significantly alter their normal
behavior. Unlike classification, the robustness of object detectors has not
been thoroughly explored. In this work, we take the initial step towards
bridging the gap between the robustness of classification and object detection
by leveraging adversarially trained classification models. Merely utilizing
adversarially trained models as backbones for object detection does not result
in robustness. We propose effective modifications to the classification-based
backbone to instill robustness in object detection without incurring any
computational overhead. To further enhance the robustness achieved by the
proposed modified backbone, we introduce two lightweight components: imitation
loss and delayed adversarial training. Extensive experiments on the MS-COCO and
Pascal VOC datasets are conducted to demonstrate the effectiveness of our
proposed approach.


http://arxiv.org/abs/2308.02122
ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned Samples in NLP. (33%)
Lu Yan; Zhuo Zhang; Guanhong Tao; Kaiyuan Zhang; Xuan Chen; Guangyu Shen; Xiangyu Zhang
  Backdoor attacks have emerged as a prominent threat to natural language
processing (NLP) models, where the presence of specific triggers in the input
can lead poisoned models to misclassify these inputs to predetermined target
classes. Current detection mechanisms are limited by their inability to address
more covert backdoor strategies, such as style-based attacks. In this work, we
propose an innovative test-time poisoned sample detection framework that hinges
on the interpretability of model predictions, grounded in the semantic meaning
of inputs. We contend that triggers (e.g., infrequent words) are not supposed
to fundamentally alter the underlying semantic meanings of poisoned samples as
they want to stay stealthy. Based on this observation, we hypothesize that
while the model's predictions for paraphrased clean samples should remain
stable, predictions for poisoned samples should revert to their true labels
upon the mutations applied to triggers during the paraphrasing process. We
employ ChatGPT, a state-of-the-art large language model, as our paraphraser and
formulate the trigger-removal task as a prompt engineering problem. We adopt
fuzzing, a technique commonly used for unearthing software vulnerabilities, to
discover optimal paraphrase prompts that can effectively eliminate triggers
while concurrently maintaining input semantics. Experiments on 4 types of
backdoor attacks, including the subtle style backdoors, and 4 distinct datasets
demonstrate that our approach surpasses baseline methods, including STRIP, RAP,
and ONION, in precision and recall.


http://arxiv.org/abs/2308.01990
From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-Integrated Web Application? (4%)
Rodrigo Pedro; Daniel Castro; Paulo Carreira; Nuno Santos
  Large Language Models (LLMs) have found widespread applications in various
domains, including web applications, where they facilitate human interaction
via chatbots with natural language interfaces. Internally, aided by an
LLM-integration middleware such as Langchain, user prompts are translated into
SQL queries used by the LLM to provide meaningful responses to users. However,
unsanitized user prompts can lead to SQL injection attacks, potentially
compromising the security of the database. Despite the growing interest in
prompt injection vulnerabilities targeting LLMs, the specific risks of
generating SQL injection attacks through prompt injections have not been
extensively studied. In this paper, we present a comprehensive examination of
prompt-to-SQL (P$_2$SQL) injections targeting web applications based on the
Langchain framework. Using Langchain as our case study, we characterize
P$_2$SQL injections, exploring their variants and impact on application
security through multiple concrete examples. Furthermore, we evaluate 7
state-of-the-art LLMs, demonstrating the pervasiveness of P$_2$SQL attacks
across language models. Our findings indicate that LLM-integrated applications
based on Langchain are highly susceptible to P$_2$SQL injection attacks,
warranting the adoption of robust defenses. To counter these attacks, we
propose four effective defense techniques that can be integrated as extensions
to the Langchain framework. We validate the defenses through an experimental
evaluation with a real-world use case application.


http://arxiv.org/abs/2308.01040
Inaudible Adversarial Perturbation: Manipulating the Recognition of User Speech in Real Time. (99%)
Xinfeng Li; Chen Yan; Xuancun Lu; Zihan Zeng; Xiaoyu Ji; Wenyuan Xu
  Automatic speech recognition (ASR) systems have been shown to be vulnerable
to adversarial examples (AEs). Recent success all assumes that users will not
notice or disrupt the attack process despite the existence of music/noise-like
sounds and spontaneous responses from voice assistants. Nonetheless, in
practical user-present scenarios, user awareness may nullify existing attack
attempts that launch unexpected sounds or ASR usage. In this paper, we seek to
bridge the gap in existing research and extend the attack to user-present
scenarios. We propose VRIFLE, an inaudible adversarial perturbation (IAP)
attack via ultrasound delivery that can manipulate ASRs as a user speaks. The
inherent differences between audible sounds and ultrasounds make IAP delivery
face unprecedented challenges such as distortion, noise, and instability. In
this regard, we design a novel ultrasonic transformation model to enhance the
crafted perturbation to be physically effective and even survive long-distance
delivery. We further enable VRIFLE's robustness by adopting a series of
augmentation on user and real-world variations during the generation process.
In this way, VRIFLE features an effective real-time manipulation of the ASR
output from different distances and under any speech of users, with an
alter-and-mute strategy that suppresses the impact of user disruption. Our
extensive experiments in both digital and physical worlds verify VRIFLE's
effectiveness under various configurations, robustness against six kinds of
defenses, and universality in a targeted manner. We also show that VRIFLE can
be delivered with a portable attack device and even everyday-life loudspeakers.


http://arxiv.org/abs/2308.00958
Isolation and Induction: Training Robust Deep Neural Networks against Model Stealing Attacks. (98%)
Jun Guo; Aishan Liu; Xingyu Zheng; Siyuan Liang; Yisong Xiao; Yichao Wu; Xianglong Liu
  Despite the broad application of Machine Learning models as a Service
(MLaaS), they are vulnerable to model stealing attacks. These attacks can
replicate the model functionality by using the black-box query process without
any prior knowledge of the target victim model. Existing stealing defenses add
deceptive perturbations to the victim's posterior probabilities to mislead the
attackers. However, these defenses are now suffering problems of high inference
computational overheads and unfavorable trade-offs between benign accuracy and
stealing robustness, which challenges the feasibility of deployed models in
practice. To address the problems, this paper proposes Isolation and Induction
(InI), a novel and effective training framework for model stealing defenses.
Instead of deploying auxiliary defense modules that introduce redundant
inference time, InI directly trains a defensive model by isolating the
adversary's training gradient from the expected gradient, which can effectively
reduce the inference computational cost. In contrast to adding perturbations
over model predictions that harm the benign accuracy, we train models to
produce uninformative outputs against stealing queries, which can induce the
adversary to extract little useful knowledge from victim models with minimal
impact on the benign performance. Extensive experiments on several visual
classification datasets (e.g., MNIST and CIFAR10) demonstrate the superior
robustness (up to 48% reduction on stealing accuracy) and speed (up to 25.4x
faster) of our InI over other state-of-the-art methods. Our codes can be found
in https://github.com/DIG-Beihang/InI-Model-Stealing-Defense.


http://arxiv.org/abs/2308.01193
Mercury: An Automated Remote Side-channel Attack to Nvidia Deep Learning Accelerator. (16%)
Xiaobei Yan; Xiaoxuan Lou; Guowen Xu; Han Qiu; Shangwei Guo; Chip Hong Chang; Tianwei Zhang
  DNN accelerators have been widely deployed in many scenarios to speed up the
inference process and reduce the energy consumption. One big concern about the
usage of the accelerators is the confidentiality of the deployed models: model
inference execution on the accelerators could leak side-channel information,
which enables an adversary to preciously recover the model details. Such model
extraction attacks can not only compromise the intellectual property of DNN
models, but also facilitate some adversarial attacks.
  Although previous works have demonstrated a number of side-channel techniques
to extract models from DNN accelerators, they are not practical for two
reasons. (1) They only target simplified accelerator implementations, which
have limited practicality in the real world. (2) They require heavy human
analysis and domain knowledge. To overcome these limitations, this paper
presents Mercury, the first automated remote side-channel attack against the
off-the-shelf Nvidia DNN accelerator. The key insight of Mercury is to model
the side-channel extraction process as a sequence-to-sequence problem. The
adversary can leverage a time-to-digital converter (TDC) to remotely collect
the power trace of the target model's inference. Then he uses a learning model
to automatically recover the architecture details of the victim model from the
power trace without any prior knowledge. The adversary can further use the
attention mechanism to localize the leakage points that contribute most to the
attack. Evaluation results indicate that Mercury can keep the error rate of
model extraction below 1%.


http://arxiv.org/abs/2308.01311
TEASMA: A Practical Approach for the Test Assessment of Deep Neural Networks using Mutation Analysis. (2%)
Amin Abbasishahkoo; Mahboubeh Dadkhah; Lionel Briand; Dayi Lin
  Successful deployment of Deep Neural Networks (DNNs), particularly in
safety-critical systems, requires their validation with an adequate test set to
ensure a sufficient degree of confidence in test outcomes. Mutation analysis, a
well-established technique for measuring test adequacy in traditional software,
has been adapted to DNNs in recent years. This technique is based on generating
mutants that ideally aim to be representative of actual faults and thus can be
used for test adequacy assessment. In this paper, we investigate for the first
time whether and how mutation operators that directly modify the trained DNN
model (i.e., post-training operators) can be used for reliably assessing the
test inputs of DNNs. Our results show that these operators, though they do not
aim to represent realistic faults, exhibit strong, non-linear relationships
with faults. Inspired by this finding and considering the significant
computational advantage of post-training operators compared to the operators
that modify the training data or program (i.e., pre-training operators), we
propose and evaluate TEASMA, an approach based on posttraining mutation for
assessing the adequacy of DNNs test sets. In practice, TEASMA allows engineers
to decide whether they will be able to trust test results and thus validate the
DNN before its deployment. Based on a DNN model`s training set, TEASMA provides
a methodology to build accurate DNNspecific prediction models of the Fault
Detection Rate (FDR) of a test set from its mutation score, thus enabling its
assessment. Our large empirical evaluation, across multiple DNN models, shows
that predicted FDR values have a strong linear correlation (R2 >= 0.94) with
actual values. Consequently, empirical evidence suggests that TEASMA provides a
reliable basis for confidently deciding whether to trust test results or
improve the test set of a DNN model.


http://arxiv.org/abs/2308.01237
LSF-IDM: Automotive Intrusion Detection Model with Lightweight Attribution and Semantic Fusion. (1%)
Pengzhou Cheng; Lei Hua; Haobin Jiang; Mohammad Samie; Gongshen Liu
  Autonomous vehicles (AVs) are more vulnerable to network attacks due to the
high connectivity and diverse communication modes between vehicles and external
networks. Deep learning-based Intrusion detection, an effective method for
detecting network attacks, can provide functional safety as well as a real-time
communication guarantee for vehicles, thereby being widely used for AVs.
Existing works well for cyber-attacks such as simple-mode but become a higher
false alarm with a resource-limited environment required when the attack is
concealed within a contextual feature. In this paper, we present a novel
automotive intrusion detection model with lightweight attribution and semantic
fusion, named LSF-IDM. Our motivation is based on the observation that, when
injected the malicious packets to the in-vehicle networks (IVNs), the packet
log presents a strict order of context feature because of the periodicity and
broadcast nature of the CAN bus. Therefore, this model first captures the
context as the semantic feature of messages by the BERT language framework.
Thereafter, the lightweight model (e.g., BiLSTM) learns the fused feature from
an input packet's classification and its output distribution in BERT based on
knowledge distillation. Experiment results demonstrate the effectiveness of our
methods in defending against several representative attacks from IVNs. We also
perform the difference analysis of the proposed method with lightweight models
and Bert to attain a deeper understanding of how the model balance detection
performance and model complexity.


http://arxiv.org/abs/2308.00346
Dynamic ensemble selection based on Deep Neural Network Uncertainty Estimation for Adversarial Robustness. (99%)
Ruoxi Qin; Linyuan Wang; Xuehui Du; Xingyuan Chen; Bin Yan
  The deep neural network has attained significant efficiency in image
recognition. However, it has vulnerable recognition robustness under extensive
data uncertainty in practical applications. The uncertainty is attributed to
the inevitable ambient noise and, more importantly, the possible adversarial
attack. Dynamic methods can effectively improve the defense initiative in the
arms race of attack and defense of adversarial examples. Different from the
previous dynamic method depend on input or decision, this work explore the
dynamic attributes in model level through dynamic ensemble selection technology
to further protect the model from white-box attacks and improve the robustness.
Specifically, in training phase the Dirichlet distribution is apply as prior of
sub-models' predictive distribution, and the diversity constraint in parameter
space is introduced under the lightweight sub-models to construct alternative
ensembel model spaces. In test phase, the certain sub-models are dynamically
selected based on their rank of uncertainty value for the final prediction to
ensure the majority accurate principle in ensemble robustness and accuracy.
Compared with the previous dynamic method and staic adversarial traning model,
the presented approach can achieve significant robustness results without
damaging accuracy by combining dynamics and diversity property.


http://arxiv.org/abs/2308.02533
Improving Generalization of Adversarial Training via Robust Critical Fine-Tuning. (99%)
Kaijie Zhu; Jindong Wang; Xixu Hu; Xing Xie; Ge Yang
  Deep neural networks are susceptible to adversarial examples, posing a
significant security risk in critical applications. Adversarial Training (AT)
is a well-established technique to enhance adversarial robustness, but it often
comes at the cost of decreased generalization ability. This paper proposes
Robustness Critical Fine-Tuning (RiFT), a novel approach to enhance
generalization without compromising adversarial robustness. The core idea of
RiFT is to exploit the redundant capacity for robustness by fine-tuning the
adversarially trained model on its non-robust-critical module. To do so, we
introduce module robust criticality (MRC), a measure that evaluates the
significance of a given module to model robustness under worst-case weight
perturbations. Using this measure, we identify the module with the lowest MRC
value as the non-robust-critical module and fine-tune its weights to obtain
fine-tuned weights. Subsequently, we linearly interpolate between the
adversarially trained weights and fine-tuned weights to derive the optimal
fine-tuned model weights. We demonstrate the efficacy of RiFT on ResNet18,
ResNet34, and WideResNet34-10 models trained on CIFAR10, CIFAR100, and
Tiny-ImageNet datasets. Our experiments show that \method can significantly
improve both generalization and out-of-distribution robustness by around 1.5%
while maintaining or even slightly enhancing adversarial robustness. Code is
available at https://github.com/microsoft/robustlearn.


http://arxiv.org/abs/2308.00319
LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack. (99%)
Hai Zhu; Zhaoqing Yang; Weiwei Shang; Yuren Wu
  Natural language processing models are vulnerable to adversarial examples.
Previous textual adversarial attacks adopt gradients or confidence scores to
calculate word importance ranking and generate adversarial examples. However,
this information is unavailable in the real world. Therefore, we focus on a
more realistic and challenging setting, named hard-label attack, in which the
attacker can only query the model and obtain a discrete prediction label.
Existing hard-label attack algorithms tend to initialize adversarial examples
by random substitution and then utilize complex heuristic algorithms to
optimize the adversarial perturbation. These methods require a lot of model
queries and the attack success rate is restricted by adversary initialization.
In this paper, we propose a novel hard-label attack algorithm named LimeAttack,
which leverages a local explainable method to approximate word importance
ranking, and then adopts beam search to find the optimal solution. Extensive
experiments show that LimeAttack achieves the better attacking performance
compared with existing hard-label attack under the same query budget. In
addition, we evaluate the effectiveness of LimeAttack on large language models,
and results indicate that adversarial examples remain a significant threat to
large language models. The adversarial examples crafted by LimeAttack are
highly transferable and effectively improve model robustness in adversarial
training.


http://arxiv.org/abs/2308.00311
Doubly Robust Instance-Reweighted Adversarial Training. (82%)
Daouda Sow; Sen Lin; Zhangyang Wang; Yingbin Liang
  Assigning importance weights to adversarial data has achieved great success
in training adversarially robust networks under limited model capacity.
However, existing instance-reweighted adversarial training (AT) methods heavily
depend on heuristics and/or geometric interpretations to determine those
importance weights, making these algorithms lack rigorous theoretical
justification/guarantee. Moreover, recent research has shown that adversarial
training suffers from a severe non-uniform robust performance across the
training distribution, e.g., data points belonging to some classes can be much
more vulnerable to adversarial attacks than others. To address both issues, in
this paper, we propose a novel doubly-robust instance reweighted AT framework,
which allows to obtain the importance weights via exploring distributionally
robust optimization (DRO) techniques, and at the same time boosts the
robustness on the most vulnerable examples. In particular, our importance
weights are obtained by optimizing the KL-divergence regularized loss function,
which allows us to devise new algorithms with a theoretical convergence
guarantee. Experiments on standard classification datasets demonstrate that our
proposed approach outperforms related state-of-the-art baseline methods in
terms of average robust performance, and at the same time improves the
robustness against attacks on the weakest data points. Codes will be available
soon.


http://arxiv.org/abs/2308.00854
Training on Foveated Images Improves Robustness to Adversarial Attacks. (82%)
Muhammad A. Shah; Bhiksha Raj
  Deep neural networks (DNNs) have been shown to be vulnerable to adversarial
attacks -- subtle, perceptually indistinguishable perturbations of inputs that
change the response of the model. In the context of vision, we hypothesize that
an important contributor to the robustness of human visual perception is
constant exposure to low-fidelity visual stimuli in our peripheral vision. To
investigate this hypothesis, we develop \RBlur, an image transform that
simulates the loss in fidelity of peripheral vision by blurring the image and
reducing its color saturation based on the distance from a given fixation
point. We show that compared to DNNs trained on the original images, DNNs
trained on images transformed by \RBlur are substantially more robust to
adversarial attacks, as well as other, non-adversarial, corruptions, achieving
up to 25\% higher accuracy on perturbed data.


http://arxiv.org/abs/2308.00344
Kidnapping Deep Learning-based Multirotors using Optimized Flying Adversarial Patches. (47%)
Pia Hanfeld; Khaled Wahba; Marina M. -C. Höhne; Michael Bussmann; Wolfgang Hönig
  Autonomous flying robots, such as multirotors, often rely on deep learning
models that make predictions based on a camera image, e.g. for pose estimation.
These models can predict surprising results if applied to input images outside
the training domain. This fault can be exploited by adversarial attacks, for
example, by computing small images, so-called adversarial patches, that can be
placed in the environment to manipulate the neural network's prediction. We
introduce flying adversarial patches, where multiple images are mounted on at
least one other flying robot and therefore can be placed anywhere in the field
of view of a victim multirotor. By introducing the attacker robots, the system
is extended to an adversarial multi-robot system. For an effective attack, we
compare three methods that simultaneously optimize multiple adversarial patches
and their position in the input image. We show that our methods scale well with
the number of adversarial patches. Moreover, we demonstrate physical flights
with two robots, where we employ a novel attack policy that uses the computed
adversarial patches to kidnap a robot that was supposed to follow a human.


http://arxiv.org/abs/2308.00556
Robust Linear Regression: Phase-Transitions and Precise Tradeoffs for General Norms. (22%)
Elvis Dohmatob; Meyer Scetbon
  In this paper, we investigate the impact of test-time adversarial attacks on
linear regression models and determine the optimal level of robustness that any
model can reach while maintaining a given level of standard predictive
performance (accuracy). Through quantitative estimates, we uncover fundamental
tradeoffs between adversarial robustness and accuracy in different regimes. We
obtain a precise characterization which distinguishes between regimes where
robustness is achievable without hurting standard accuracy and regimes where a
tradeoff might be unavoidable. Our findings are empirically confirmed with
simple experiments that represent a variety of settings. This work applies to
feature covariance matrices and attack norms of any nature, and extends beyond
previous works in this area.


http://arxiv.org/abs/2308.02535
Learning to Generate Training Datasets for Robust Semantic Segmentation. (9%)
Marwane Hariat; Olivier Laurent; Rémi Kazmierczak; Shihao Zhang; Andrei Bursuc; Angela Yao; Gianni Franchi
  Semantic segmentation methods have advanced significantly. Still, their
robustness to real-world perturbations and object types not seen during
training remains a challenge, particularly in safety-critical applications. We
propose a novel approach to improve the robustness of semantic segmentation
techniques by leveraging the synergy between label-to-image generators and
image-to-label segmentation models. Specifically, we design Robusta, a novel
robust conditional generative adversarial network to generate realistic and
plausible perturbed images that can be used to train reliable segmentation
models. We conduct in-depth studies of the proposed generative model, assess
the performance and robustness of the downstream segmentation network, and
demonstrate that our approach can significantly enhance the robustness in the
face of real-world perturbations, distribution shifts, and out-of-distribution
samples. Our results suggest that this approach could be valuable in
safety-critical applications, where the reliability of perception modules such
as semantic segmentation is of utmost importance and comes with a limited
computational budget in inference. We release our code at
https://github.com/ENSTA-U2IS-AI/robusta.


http://arxiv.org/abs/2308.00313
Zero-Shot Learning by Harnessing Adversarial Samples. (1%)
Zhi Chen; Pengfei Zhang; Jingjing Li; Sen Wang; Zi Huang
  Zero-Shot Learning (ZSL) aims to recognize unseen classes by generalizing the
knowledge, i.e., visual and semantic relationships, obtained from seen classes,
where image augmentation techniques are commonly applied to improve the
generalization ability of a model. However, this approach can also cause
adverse effects on ZSL since the conventional augmentation techniques that
solely depend on single-label supervision is not able to maintain semantic
information and result in the semantic distortion issue consequently. In other
words, image argumentation may falsify the semantic (e.g., attribute)
information of an image. To take the advantage of image augmentations while
mitigating the semantic distortion issue, we propose a novel ZSL approach by
Harnessing Adversarial Samples (HAS). HAS advances ZSL through adversarial
training which takes into account three crucial aspects: (1) robust generation
by enforcing augmentations to be similar to negative classes, while maintaining
correct labels, (2) reliable generation by introducing a latent space
constraint to avert significant deviations from the original data manifold, and
(3) diverse generation by incorporating attribute-based perturbation by
adjusting images according to each semantic attribute's localization. Through
comprehensive experiments on three prominent zero-shot benchmark datasets, we
demonstrate the effectiveness of our adversarial samples approach in both ZSL
and Generalized Zero-Shot Learning (GZSL) scenarios. Our source code is
available at https://github.com/uqzhichen/HASZSL.


http://arxiv.org/abs/2308.00918
A Novel Cross-Perturbation for Single Domain Generalization. (1%)
Dongjia Zhao; Lei Qi; Xiao Shi; Yinghuan Shi; Xin Geng
  Single domain generalization aims to enhance the ability of the model to
generalize to unknown domains when trained on a single source domain. However,
the limited diversity in the training data hampers the learning of
domain-invariant features, resulting in compromised generalization performance.
To address this, data perturbation (augmentation) has emerged as a crucial
method to increase data diversity. Nevertheless, existing perturbation methods
often focus on either image-level or feature-level perturbations independently,
neglecting their synergistic effects. To overcome these limitations, we propose
CPerb, a simple yet effective cross-perturbation method. Specifically, CPerb
utilizes both horizontal and vertical operations. Horizontally, it applies
image-level and feature-level perturbations to enhance the diversity of the
training data, mitigating the issue of limited diversity in single-source
domains. Vertically, it introduces multi-route perturbation to learn
domain-invariant features from different perspectives of samples with the same
semantic category, thereby enhancing the generalization capability of the
model. Additionally, we propose MixPatch, a novel feature-level perturbation
method that exploits local image style information to further diversify the
training data. Extensive experiments on various benchmark datasets validate the
effectiveness of our method.


http://arxiv.org/abs/2308.00077
A Novel Deep Learning based Model to Defend Network Intrusion Detection System against Adversarial Attacks. (99%)
Khushnaseeb Roshan; Aasim Zafar; Shiekh Burhan Ul Haque
  Network Intrusion Detection System (NIDS) is an essential tool in securing
cyberspace from a variety of security risks and unknown cyberattacks. A number
of solutions have been implemented for Machine Learning (ML), and Deep Learning
(DL) based NIDS. However, all these solutions are vulnerable to adversarial
attacks, in which the malicious actor tries to evade or fool the model by
injecting adversarial perturbed examples into the system. The main aim of this
research work is to study powerful adversarial attack algorithms and their
defence method on DL-based NIDS. Fast Gradient Sign Method (FGSM), Jacobian
Saliency Map Attack (JSMA), Projected Gradient Descent (PGD) and Carlini &
Wagner (C&W) are four powerful adversarial attack methods implemented against
the NIDS. As a defence method, Adversarial Training is used to increase the
robustness of the NIDS model. The results are summarized in three phases, i.e.,
1) before the adversarial attack, 2) after the adversarial attack, and 3) after
the adversarial defence. The Canadian Institute for Cybersecurity Intrusion
Detection System 2017 (CICIDS-2017) dataset is used for evaluation purposes
with various performance measurements like f1-score, accuracy etc.


http://arxiv.org/abs/2307.16572
Transferable Attack for Semantic Segmentation. (99%)
Mengqi He; Jing Zhang; Zhaoyuan Yang; Mingyi He; Nick Barnes; Yuchao Dai
  Semantic segmentation models are known vulnerable to small input
perturbations. In this paper, we comprehensively analysis the performance of
semantic segmentation models \wrt~adversarial attacks, and observe that the
adversarial examples generated from a source model fail to attack the target
models, \ie~the conventional attack methods, such as PGD and FGSM, do not
transfer well to target models, making it necessary to study the transferable
attacks, especially transferable attacks for semantic segmentation. We find
that to achieve transferable attack, the attack should come with effective data
augmentation and translation-invariant features to deal with unseen models, and
stabilized optimization strategies to find the optimal attack direction. Based
on the above observations, we propose an ensemble attack for semantic
segmentation by aggregating several transferable attacks from classification to
achieve more effective attacks with higher transferability. The source code and
experimental results are publicly available via our project page:
https://github.com/anucvers/TASS.


http://arxiv.org/abs/2307.16865
Universal Adversarial Defense in Remote Sensing Based on Pre-trained Denoising Diffusion Models. (99%)
Weikang Yu; Yonghao Xu; Pedram Ghamisi
  Deep neural networks (DNNs) have achieved tremendous success in many remote
sensing (RS) applications. However, their vulnerability to the threat of
adversarial perturbations should not be neglected. Unfortunately, current
adversarial defense approaches in RS studies usually suffer from performance
fluctuation and unnecessary re-training costs due to the need for prior
knowledge of the adversarial perturbations among RS data. To circumvent these
challenges, we propose a universal adversarial defense approach in RS imagery
(UAD-RS) using pre-trained diffusion models to defend the common DNNs against
multiple unknown adversarial attacks. Specifically, the generative diffusion
models are first pre-trained on different RS datasets to learn generalized
representations in various data domains. After that, a universal adversarial
purification framework is developed using the forward and reverse process of
the pre-trained diffusion models to purify the perturbations from adversarial
samples. Furthermore, an adaptive noise level selection (ANLS) mechanism is
built to capture the optimal noise level of the diffusion model that can
achieve the best purification results closest to the clean samples according to
their Frechet Inception Distance (FID) in deep feature space. As a result, only
a single pre-trained diffusion model is needed for the universal purification
of adversarial samples on each dataset, which significantly alleviates the
re-training efforts for each attack setting and maintains high performance
without the prior knowledge of adversarial perturbations. Experiments on four
heterogeneous RS datasets regarding scene classification and semantic
segmentation verify that UAD-RS outperforms state-of-the-art adversarial
purification approaches with a universal defense against seven commonly
existing adversarial perturbations.


http://arxiv.org/abs/2307.16816
Defense of Adversarial Ranking Attack in Text Retrieval: Benchmark and Baseline via Detection. (97%)
Xuanang Chen; Ben He; Le Sun; Yingfei Sun
  Neural ranking models (NRMs) have undergone significant development and have
become integral components of information retrieval (IR) systems.
Unfortunately, recent research has unveiled the vulnerability of NRMs to
adversarial document manipulations, potentially exploited by malicious search
engine optimization practitioners. While progress in adversarial attack
strategies aids in identifying the potential weaknesses of NRMs before their
deployment, the defensive measures against such attacks, like the detection of
adversarial documents, remain inadequately explored. To mitigate this gap, this
paper establishes a benchmark dataset to facilitate the investigation of
adversarial ranking defense and introduces two types of detection tasks for
adversarial documents. A comprehensive investigation of the performance of
several detection baselines is conducted, which involve examining the
spamicity, perplexity, and linguistic acceptability, and utilizing supervised
classifiers. Experimental results demonstrate that a supervised classifier can
effectively mitigate known attacks, but it performs poorly against unseen
attacks. Furthermore, such classifier should avoid using query text to prevent
learning the classification on relevance, as it might lead to the inadvertent
discarding of relevant documents.


http://arxiv.org/abs/2307.16630
Text-CRS: A Generalized Certified Robustness Framework against Textual Adversarial Attacks. (86%)
Xinyu Zhang; Hanbin Hong; Yuan Hong; Peng Huang; Binghui Wang; Zhongjie Ba; Kui Ren
  The language models, especially the basic text classification models, have
been shown to be susceptible to textual adversarial attacks such as synonym
substitution and word insertion attacks. To defend against such attacks, a
growing body of research has been devoted to improving the model robustness.
However, providing provable robustness guarantees instead of empirical
robustness is still widely unexplored. In this paper, we propose Text-CRS, a
generalized certified robustness framework for natural language processing
(NLP) based on randomized smoothing. To our best knowledge, existing certified
schemes for NLP can only certify the robustness against $\ell_0$ perturbations
in synonym substitution attacks. Representing each word-level adversarial
operation (i.e., synonym substitution, word reordering, insertion, and
deletion) as a combination of permutation and embedding transformation, we
propose novel smoothing theorems to derive robustness bounds in both
permutation and embedding space against such adversarial operations. To further
improve certified accuracy and radius, we consider the numerical relationships
between discrete words and select proper noise distributions for the randomized
smoothing. Finally, we conduct substantial experiments on multiple language
models and datasets. Text-CRS can address all four different word-level
adversarial operations and achieve a significant accuracy improvement. We also
provide the first benchmark on certified accuracy and radius of four word-level
operations, besides outperforming the state-of-the-art certification against
synonym substitution attacks.


http://arxiv.org/abs/2307.16489
BAGM: A Backdoor Attack for Manipulating Text-to-Image Generative Models. (26%)
Jordan Vice; Naveed Akhtar; Richard Hartley; Ajmal Mian
  The rise in popularity of text-to-image generative artificial intelligence
(AI) has attracted widespread public interest. We demonstrate that this
technology can be attacked to generate content that subtly manipulates its
users. We propose a Backdoor Attack on text-to-image Generative Models (BAGM),
which upon triggering, infuses the generated images with manipulative details
that are naturally blended in the content. Our attack is the first to target
three popular text-to-image generative models across three stages of the
generative process by modifying the behaviour of the embedded tokenizer, the
language model or the image generative model. Based on the penetration level,
BAGM takes the form of a suite of attacks that are referred to as surface,
shallow and deep attacks in this article. Given the existing gap within this
domain, we also contribute a comprehensive set of quantitative metrics designed
specifically for assessing the effectiveness of backdoor attacks on
text-to-image models. The efficacy of BAGM is established by attacking
state-of-the-art generative models, using a marketing scenario as the target
domain. To that end, we contribute a dataset of branded product images. Our
embedded backdoors increase the bias towards the target outputs by more than
five times the usual, without compromising the model robustness or the
generated content utility. By exposing generative AI's vulnerabilities, we
encourage researchers to tackle these challenges and practitioners to exercise
caution when using pre-trained models. Relevant code, input prompts and
supplementary material can be found at https://github.com/JJ-Vice/BAGM, and the
dataset is available at:
https://ieee-dataport.org/documents/marketable-foods-mf-dataset.
  Keywords: Generative Artificial Intelligence, Generative Models,
Text-to-Image generation, Backdoor Attacks, Trojan, Stable Diffusion.


http://arxiv.org/abs/2308.00165
Adversarially Robust Neural Legal Judgement Systems. (11%)
Rohit Raj; V Susheela Devi
  Legal judgment prediction is the task of predicting the outcome of court
cases on a given text description of facts of cases. These tasks apply Natural
Language Processing (NLP) techniques to predict legal judgment results based on
facts. Recently, large-scale public datasets and NLP models have increased
research in areas related to legal judgment prediction systems. For such
systems to be practically helpful, they should be robust from adversarial
attacks. Previous works mainly focus on making a neural legal judgement system;
however, significantly less or no attention has been given to creating a robust
Legal Judgement Prediction(LJP) system. We implemented adversarial attacks on
early existing LJP systems and found that none of them could handle attacks. In
this work, we proposed an approach for making robust LJP systems. Extensive
experiments on three legal datasets show significant improvements in our
approach over the state-of-the-art LJP system in handling adversarial attacks.
To the best of our knowledge, we are the first to increase the robustness of
early-existing LJP systems.


http://arxiv.org/abs/2307.16888
Virtual Prompt Injection for Instruction-Tuned Large Language Models. (10%)
Jun Yan; Vikas Yadav; Shiyang Li; Lichang Chen; Zheng Tang; Hai Wang; Vijay Srinivasan; Xiang Ren; Hongxia Jin
  We present Virtual Prompt Injection (VPI) for instruction-tuned Large
Language Models (LLMs). VPI allows an attacker-specified virtual prompt to
steer the model behavior under specific trigger scenario without any explicit
injection in model input. For instance, if an LLM is compromised with the
virtual prompt "Describe Joe Biden negatively." for Joe Biden-related
instructions, then any service deploying this model will propagate biased views
when handling user queries related to Joe Biden. VPI is especially harmful for
two primary reasons. Firstly, the attacker can take fine-grained control over
LLM behaviors by defining various virtual prompts, exploiting LLMs' proficiency
in following instructions. Secondly, this control is achieved without any
interaction from the attacker while the model is in service, leading to
persistent attack. To demonstrate the threat, we propose a simple method for
performing VPI by poisoning the model's instruction tuning data. We find that
our proposed method is highly effective in steering the LLM with VPI. For
example, by injecting only 52 poisoned examples (0.1% of the training data
size) into the instruction tuning data, the percentage of negative responses
given by the trained model on Joe Biden-related queries change from 0% to 40%.
We thus highlight the necessity of ensuring the integrity of the
instruction-tuning data as little poisoned data can cause stealthy and
persistent harm to the deployed model. We further explore the possible defenses
and identify data filtering as an effective way to defend against the poisoning
attacks. Our project page is available at https://poison-llm.github.io.


http://arxiv.org/abs/2307.16609
Noisy Self-Training with Data Augmentations for Offensive and Hate Speech Detection Tasks. (1%)
João A. Leite; Carolina Scarton; Diego F. Silva
  Online social media is rife with offensive and hateful comments, prompting
the need for their automatic detection given the sheer amount of posts created
every second. Creating high-quality human-labelled datasets for this task is
difficult and costly, especially because non-offensive posts are significantly
more frequent than offensive ones. However, unlabelled data is abundant,
easier, and cheaper to obtain. In this scenario, self-training methods, using
weakly-labelled examples to increase the amount of training data, can be
employed. Recent "noisy" self-training approaches incorporate data augmentation
techniques to ensure prediction consistency and increase robustness against
noisy data and adversarial attacks. In this paper, we experiment with default
and noisy self-training using three different textual data augmentation
techniques across five different pre-trained BERT architectures varying in
size. We evaluate our experiments on two offensive/hate-speech datasets and
demonstrate that (i) self-training consistently improves performance regardless
of model size, resulting in up to +1.5% F1-macro on both datasets, and (ii)
noisy self-training with textual data augmentations, despite being successfully
applied in similar settings, decreases performance on offensive and hate-speech
domains when compared to the default method, even with state-of-the-art
augmentations such as backtranslation.


http://arxiv.org/abs/2307.16331
Theoretically Principled Trade-off for Stateful Defenses against Query-Based Black-Box Attacks. (99%)
Ashish Hooda; Neal Mangaokar; Ryan Feng; Kassem Fawaz; Somesh Jha; Atul Prakash
  Adversarial examples threaten the integrity of machine learning systems with
alarming success rates even under constrained black-box conditions. Stateful
defenses have emerged as an effective countermeasure, detecting potential
attacks by maintaining a buffer of recent queries and detecting new queries
that are too similar. However, these defenses fundamentally pose a trade-off
between attack detection and false positive rates, and this trade-off is
typically optimized by hand-picking feature extractors and similarity
thresholds that empirically work well. There is little current understanding as
to the formal limits of this trade-off and the exact properties of the feature
extractors/underlying problem domain that influence it. This work aims to
address this gap by offering a theoretical characterization of the trade-off
between detection and false positive rates for stateful defenses. We provide
upper bounds for detection rates of a general class of feature extractors and
analyze the impact of this trade-off on the convergence of black-box attacks.
We then support our theoretical findings with empirical evaluations across
multiple datasets and stateful defenses.


http://arxiv.org/abs/2307.16361
Benchmarking and Analyzing Robust Point Cloud Recognition: Bag of Tricks for Defending Adversarial Examples. (99%)
Qiufan Ji; Lin Wang; Cong Shi; Shengshan Hu; Yingying Chen; Lichao Sun
  Deep Neural Networks (DNNs) for 3D point cloud recognition are vulnerable to
adversarial examples, threatening their practical deployment. Despite the many
research endeavors have been made to tackle this issue in recent years, the
diversity of adversarial examples on 3D point clouds makes them more
challenging to defend against than those on 2D images. For examples, attackers
can generate adversarial examples by adding, shifting, or removing points.
Consequently, existing defense strategies are hard to counter unseen point
cloud adversarial examples. In this paper, we first establish a comprehensive,
and rigorous point cloud adversarial robustness benchmark to evaluate
adversarial robustness, which can provide a detailed understanding of the
effects of the defense and attack methods. We then collect existing defense
tricks in point cloud adversarial defenses and then perform extensive and
systematic experiments to identify an effective combination of these tricks.
Furthermore, we propose a hybrid training augmentation methods that consider
various types of point cloud adversarial examples to adversarial training,
significantly improving the adversarial robustness. By combining these tricks,
we construct a more robust defense framework achieving an average accuracy of
83.45\% against various attacks, demonstrating its capability to enabling
robust learners. Our codebase are open-sourced on:
\url{https://github.com/qiufan319/benchmark_pc_attack.git}.


http://arxiv.org/abs/2307.16360
Probabilistically robust conformal prediction. (91%)
Subhankar Ghosh; Yuanjie Shi; Taha Belkhouja; Yan Yan; Jana Doppa; Brian Jones
  Conformal prediction (CP) is a framework to quantify uncertainty of machine
learning classifiers including deep neural networks. Given a testing example
and a trained classifier, CP produces a prediction set of candidate labels with
a user-specified coverage (i.e., true class label is contained with high
probability). Almost all the existing work on CP assumes clean testing data and
there is not much known about the robustness of CP algorithms w.r.t
natural/adversarial perturbations to testing examples. This paper studies the
problem of probabilistically robust conformal prediction (PRCP) which ensures
robustness to most perturbations around clean input examples. PRCP generalizes
the standard CP (cannot handle perturbations) and adversarially robust CP
(ensures robustness w.r.t worst-case perturbations) to achieve better
trade-offs between nominal performance and robustness. We propose a novel
adaptive PRCP (aPRCP) algorithm to achieve probabilistically robust coverage.
The key idea behind aPRCP is to determine two parallel thresholds, one for data
samples and another one for the perturbations on data (aka
"quantile-of-quantile" design). We provide theoretical analysis to show that
aPRCP algorithm achieves robust coverage. Our experiments on CIFAR-10,
CIFAR-100, and ImageNet datasets using deep neural networks demonstrate that
aPRCP achieves better trade-offs than state-of-the-art CP and adversarially
robust CP algorithms.


http://arxiv.org/abs/2307.16178
On Updating Static Output Feedback Controllers Under State-Space Perturbation. (1%)
MirSaleh Bahavarnia; Ahmad F. Taha
  In this paper, we propose a novel update of a nominal stabilizing static
output feedback (SOF) controller for a perturbed linear system. In almost every
classical feedback controller design problem, a stabilizing feedback controller
is designed given a stabilizable unstable system. In realistic scenarios, the
system model is usually imperfect and subject to perturbations. A typical
approach to attenuate the impacts of such perturbations on the system stability
is repeating the whole controller design procedure to find an updated
stabilizing SOF controller. Such an approach can be inefficient and
occasionally infeasible. Using the notion of minimum destabilizing real
perturbation (MDRP), we construct a simple norm minimization problem (a
least-squares problem) to propose an efficient update of a nominal stabilizing
SOF controller that can be applied to various control engineering applications
in the case of perturbed scenarios like abrupt changes or inaccurate system
models. In particular, considering norm-bounded known or unknown perturbations,
this paper presents updated stabilizing SOF controllers and derives sufficient
stability conditions. Geometric metrics to quantitatively measure the
approach's robustness are defined. Moreover, we characterize the corresponding
guaranteed stability regions, and specifically, for the case of norm-bounded
unknown perturbations, we propose non-fragility-based robust updated
stabilizing SOF controllers. Through extensive numerical simulations, we assess
the effectiveness of the theoretical results.


http://arxiv.org/abs/2307.15971
You Can Backdoor Personalized Federated Learning. (92%)
Tiandi Ye; Cen Chen; Yinggui Wang; Xiang Li; Ming Gao
  Backdoor attacks pose a significant threat to the security of federated
learning systems. However, existing research primarily focuses on backdoor
attacks and defenses within the generic FL scenario, where all clients
collaborate to train a single global model. \citet{qin2023revisiting} conduct
the first study of backdoor attacks in the personalized federated learning
(pFL) scenario, where each client constructs a personalized model based on its
local data. Notably, the study demonstrates that pFL methods with partial
model-sharing can significantly boost robustness against backdoor attacks. In
this paper, we whistleblow that pFL methods with partial model-sharing are
still vulnerable to backdoor attacks in the absence of any defense. We propose
three backdoor attack methods: BapFL, BapFL+, and Gen-BapFL, and we empirically
demonstrate that they can effectively attack the pFL methods. Specifically, the
key principle of BapFL lies in maintaining clean local parameters while
implanting the backdoor into the global parameters. BapFL+ generalizes the
attack success to benign clients by introducing Gaussian noise to the local
parameters. Furthermore, we assume the collaboration of malicious clients and
propose Gen-BapFL, which leverages meta-learning techniques to further enhances
attack generalization. We evaluate our proposed attack methods against two
classic pFL methods with partial model-sharing, FedPer and LG-FedAvg. Extensive
experiments on four FL benchmark datasets demonstrate the effectiveness of our
proposed attack methods. Additionally, we assess the defense efficacy of
various defense strategies against our proposed attacks and find that Gradient
Norm-Clipping is particularly effective. It is crucial to note that pFL method
is not always secure in the presence of backdoor attacks, and we hope to
inspire further research on attack and defense in pFL scenarios.


http://arxiv.org/abs/2307.16099
On Neural Network approximation of ideal adversarial attack and convergence of adversarial training. (92%)
Rajdeep Haldar; Qifan Song
  Adversarial attacks are usually expressed in terms of a gradient-based
operation on the input data and model, this results in heavy computations every
time an attack is generated. In this work, we solidify the idea of representing
adversarial attacks as a trainable function, without further gradient
computation. We first motivate that the theoretical best attacks, under proper
conditions, can be represented as smooth piece-wise functions (piece-wise
H\"older functions). Then we obtain an approximation result of such functions
by a neural network. Subsequently, we emulate the ideal attack process by a
neural network and reduce the adversarial training to a mathematical game
between an attack network and a training model (a defense network). We also
obtain convergence rates of adversarial loss in terms of the sample size $n$
for adversarial training in such a setting.


http://arxiv.org/abs/2307.15926
Exposing Hidden Attackers in Industrial Control Systems using Micro-distortions. (41%)
Suman Sourav; Binbin Chen
  For industrial control systems (ICS), many existing defense solutions focus
on detecting attacks only when they make the system behave anomalously.
Instead, in this work, we study how to detect attackers who are still in their
hiding phase. Specifically, we consider an off-path false-data-injection
attacker who makes the original sensor's readings unavailable and then
impersonates that sensor by sending out legitimate-looking fake readings, so
that she can stay hidden in the system for a prolonged period of time (e.g., to
gain more information or to launch the actual devastating attack on a specific
time). To expose such hidden attackers, our approach relies on continuous
injection of ``micro distortion'' to the original sensor's readings, either
through digital or physical means. We keep the distortions strictly within a
small magnitude (e.g., $0.5\%$ of the possible operating value range) to ensure
that it does not affect the normal functioning of the ICS. Micro-distortions
are generated based on secret key(s) shared only between the targeted sensor
and the defender. For digitally-inserted micro-distortions, we propose and
discuss the pros and cons of a two-layer least-significant-bit-based detection
algorithm. Alternatively, when the micro-distortions are added physically, a
main design challenge is to ensure the introduced micro-distortions do not get
overwhelmed by the fluctuation of actual readings and can still provide
accurate detection capability. Towards that, we propose a simple yet effective
Filtered-$\Delta$-Mean-Difference algorithm that can expose the hidden
attackers in a highly accurate and fast manner. We demonstrate the
effectiveness and versatility of our defense by using real-world sensor reading
traces from different industrial control (including smart grid) systems.


http://arxiv.org/abs/2307.15539
Beating Backdoor Attack at Its Own Game. (97%)
Min Liu; Alberto Sangiovanni-Vincentelli; Xiangyu Yue
  Deep neural networks (DNNs) are vulnerable to backdoor attack, which does not
affect the network's performance on clean data but would manipulate the network
behavior once a trigger pattern is added. Existing defense methods have greatly
reduced attack success rate, but their prediction accuracy on clean data still
lags behind a clean model by a large margin. Inspired by the stealthiness and
effectiveness of backdoor attack, we propose a simple but highly effective
defense framework which injects non-adversarial backdoors targeting poisoned
samples. Following the general steps in backdoor attack, we detect a small set
of suspected samples and then apply a poisoning strategy to them. The
non-adversarial backdoor, once triggered, suppresses the attacker's backdoor on
poisoned data, but has limited influence on clean data. The defense can be
carried out during data preprocessing, without any modification to the standard
end-to-end training pipeline. We conduct extensive experiments on multiple
benchmarks with different architectures and representative attacks. Results
demonstrate that our method achieves state-of-the-art defense effectiveness
with by far the lowest performance drop on clean data. Considering the
surprising defense ability displayed by our framework, we call for more
attention to utilizing backdoor for backdoor defense. Code is available at
https://github.com/damianliumin/non-adversarial_backdoor.


http://arxiv.org/abs/2307.15677
Adversarial training for tabular data with attack propagation. (67%)
Tiago Leon Melo; João Bravo; Marco O. P. Sampaio; Paolo Romano; Hugo Ferreira; João Tiago Ascensão; Pedro Bizarro
  Adversarial attacks are a major concern in security-centered applications,
where malicious actors continuously try to mislead Machine Learning (ML) models
into wrongly classifying fraudulent activity as legitimate, whereas system
maintainers try to stop them. Adversarially training ML models that are robust
against such attacks can prevent business losses and reduce the work load of
system maintainers. In such applications data is often tabular and the space
available for attackers to manipulate undergoes complex feature engineering
transformations, to provide useful signals for model training, to a space
attackers cannot access. Thus, we propose a new form of adversarial training
where attacks are propagated between the two spaces in the training loop. We
then test this method empirically on a real world dataset in the domain of
credit card fraud detection. We show that our method can prevent about 30%
performance drops under moderate attacks and is essential under very aggressive
attacks, with a trade-off loss in performance under no attacks smaller than 7%.


http://arxiv.org/abs/2307.15853
Improving Realistic Worst-Case Performance of NVCiM DNN Accelerators through Training with Right-Censored Gaussian Noise. (10%)
Zheyu Yan; Yifan Qin; Wujie Wen; Xiaobo Sharon Hu; Yiyu Shi
  Compute-in-Memory (CiM), built upon non-volatile memory (NVM) devices, is
promising for accelerating deep neural networks (DNNs) owing to its in-situ
data processing capability and superior energy efficiency. Unfortunately, the
well-trained model parameters, after being mapped to NVM devices, can often
exhibit large deviations from their intended values due to device variations,
resulting in notable performance degradation in these CiM-based DNN
accelerators. There exists a long list of solutions to address this issue.
However, they mainly focus on improving the mean performance of CiM DNN
accelerators. How to guarantee the worst-case performance under the impact of
device variations, which is crucial for many safety-critical applications such
as self-driving cars, has been far less explored. In this work, we propose to
use the k-th percentile performance (KPP) to capture the realistic worst-case
performance of DNN models executing on CiM accelerators. Through a formal
analysis of the properties of KPP and the noise injection-based DNN training,
we demonstrate that injecting a novel right-censored Gaussian noise, as opposed
to the conventional Gaussian noise, significantly improves the KPP of DNNs. We
further propose an automated method to determine the optimal hyperparameters
for injecting this right-censored Gaussian noise during the training process.
Our method achieves up to a 26% improvement in KPP compared to the
state-of-the-art methods employed to enhance DNN robustness under the impact of
device variations.


http://arxiv.org/abs/2307.15860
What can Discriminator do? Towards Box-free Ownership Verification of Generative Adversarial Network. (4%)
Ziheng Huang; Boheng Li; Yan Cai; Run Wang; Shangwei Guo; Liming Fang; Jing Chen; Lina Wang
  In recent decades, Generative Adversarial Network (GAN) and its variants have
achieved unprecedented success in image synthesis. However, well-trained GANs
are under the threat of illegal steal or leakage. The prior studies on remote
ownership verification assume a black-box setting where the defender can query
the suspicious model with specific inputs, which we identify is not enough for
generation tasks. To this end, in this paper, we propose a novel IP protection
scheme for GANs where ownership verification can be done by checking outputs
only, without choosing the inputs (i.e., box-free setting). Specifically, we
make use of the unexploited potential of the discriminator to learn a
hypersphere that captures the unique distribution learned by the paired
generator. Extensive evaluations on two popular GAN tasks and more than 10 GAN
architectures demonstrate our proposed scheme to effectively verify the
ownership. Our proposed scheme shown to be immune to popular input-based
removal attacks and robust against other existing attacks. The source code and
models are available at
https://github.com/AbstractTeen/gan_ownership_verification


http://arxiv.org/abs/2307.15157
R-LPIPS: An Adversarially Robust Perceptual Similarity Metric. (99%)
Sara Ghazanfari; Siddharth Garg; Prashanth Krishnamurthy; Farshad Khorrami; Alexandre Araujo
  Similarity metrics have played a significant role in computer vision to
capture the underlying semantics of images. In recent years, advanced
similarity metrics, such as the Learned Perceptual Image Patch Similarity
(LPIPS), have emerged. These metrics leverage deep features extracted from
trained neural networks and have demonstrated a remarkable ability to closely
align with human perception when evaluating relative image similarity. However,
it is now well-known that neural networks are susceptible to adversarial
examples, i.e., small perturbations invisible to humans crafted to deliberately
mislead the model. Consequently, the LPIPS metric is also sensitive to such
adversarial examples. This susceptibility introduces significant security
concerns, especially considering the widespread adoption of LPIPS in
large-scale applications. In this paper, we propose the Robust Learned
Perceptual Image Patch Similarity (R-LPIPS) metric, a new metric that leverages
adversarially trained deep features. Through a comprehensive set of
experiments, we demonstrate the superiority of R-LPIPS compared to the
classical LPIPS metric. The code is available at
https://github.com/SaraGhazanfari/R-LPIPS.


http://arxiv.org/abs/2307.15043
Universal and Transferable Adversarial Attacks on Aligned Language Models. (99%)
Andy Zou; Zifan Wang; Nicholas Carlini; Milad Nasr; J. Zico Kolter; Matt Fredrikson
  Because "out-of-the-box" large language models are capable of generating a
great deal of objectionable content, recent work has focused on aligning these
models in an attempt to prevent undesirable generation. While there has been
some success at circumventing these measures -- so-called "jailbreaks" against
LLMs -- these attacks have required significant human ingenuity and are brittle
in practice. In this paper, we propose a simple and effective attack method
that causes aligned language models to generate objectionable behaviors.
Specifically, our approach finds a suffix that, when attached to a wide range
of queries for an LLM to produce objectionable content, aims to maximize the
probability that the model produces an affirmative response (rather than
refusing to answer). However, instead of relying on manual engineering, our
approach automatically produces these adversarial suffixes by a combination of
greedy and gradient-based search techniques, and also improves over past
automatic prompt generation methods.
  Surprisingly, we find that the adversarial prompts generated by our approach
are quite transferable, including to black-box, publicly released LLMs.
Specifically, we train an adversarial attack suffix on multiple prompts (i.e.,
queries asking for many different types of objectionable content), as well as
multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting
attack suffix is able to induce objectionable content in the public interfaces
to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat,
Pythia, Falcon, and others. In total, this work significantly advances the
state-of-the-art in adversarial attacks against aligned language models,
raising important questions about how such systems can be prevented from
producing objectionable information. Code is available at
github.com/llm-attacks/llm-attacks.


http://arxiv.org/abs/2309.00007
When Measures are Unreliable: Imperceptible Adversarial Perturbations toward Top-$k$ Multi-Label Learning. (99%)
Yuchen Sun; Qianqian Xu; Zitai Wang; Qingming Huang
  With the great success of deep neural networks, adversarial learning has
received widespread attention in various studies, ranging from multi-class
learning to multi-label learning. However, existing adversarial attacks toward
multi-label learning only pursue the traditional visual imperceptibility but
ignore the new perceptible problem coming from measures such as Precision@$k$
and mAP@$k$. Specifically, when a well-trained multi-label classifier performs
far below the expectation on some samples, the victim can easily realize that
this performance degeneration stems from attack, rather than the model itself.
Therefore, an ideal multi-labeling adversarial attack should manage to not only
deceive visual perception but also evade monitoring of measures. To this end,
this paper first proposes the concept of measure imperceptibility. Then, a
novel loss function is devised to generate such adversarial perturbations that
could achieve both visual and measure imperceptibility. Furthermore, an
efficient algorithm, which enjoys a convex objective, is established to
optimize this objective. Finally, extensive experiments on large-scale
benchmark datasets, such as PASCAL VOC 2012, MS COCO, and NUS WIDE, demonstrate
the superiority of our proposed method in attacking the top-$k$ multi-label
systems.


http://arxiv.org/abs/2307.14692
Backdoor Attacks for In-Context Learning with Language Models. (97%)
Nikhil Kandpal; Matthew Jagielski; Florian Tramèr; Nicholas Carlini
  Because state-of-the-art language models are expensive to train, most
practitioners must make use of one of the few publicly available language
models or language model APIs. This consolidation of trust increases the
potency of backdoor attacks, where an adversary tampers with a machine learning
model in order to make it perform some malicious behavior on inputs that
contain a predefined backdoor trigger. We show that the in-context learning
ability of large language models significantly complicates the question of
developing backdoor attacks, as a successful backdoor must work against various
prompting strategies and should not affect the model's general purpose
capabilities. We design a new attack for eliciting targeted misclassification
when language models are prompted to perform a particular target task and
demonstrate the feasibility of this attack by backdooring multiple large
language models ranging in size from 1.3 billion to 6 billion parameters.
Finally we study defenses to mitigate the potential harms of our attack: for
example, while in the white-box setting we show that fine-tuning models for as
few as 500 steps suffices to remove the backdoor behavior, in the black-box
setting we are unable to develop a successful defense that relies on prompt
engineering alone.


http://arxiv.org/abs/2307.14751
FLARE: Fingerprinting Deep Reinforcement Learning Agents using Universal Adversarial Masks. (93%)
Buse G. A. Tekgul; N. Asokan
  We propose FLARE, the first fingerprinting mechanism to verify whether a
suspected Deep Reinforcement Learning (DRL) policy is an illegitimate copy of
another (victim) policy. We first show that it is possible to find
non-transferable, universal adversarial masks, i.e., perturbations, to generate
adversarial examples that can successfully transfer from a victim policy to its
modified versions but not to independently trained policies. FLARE employs
these masks as fingerprints to verify the true ownership of stolen DRL policies
by measuring an action agreement value over states perturbed by such masks. Our
empirical evaluations show that FLARE is effective (100% action agreement on
stolen copies) and does not falsely accuse independent policies (no false
positives). FLARE is also robust to model modification attacks and cannot be
easily evaded by more informed adversaries without negatively impacting agent
performance. We also show that not all universal adversarial masks are suitable
candidates for fingerprints due to the inherent characteristics of DRL
policies. The spatio-temporal dynamics of DRL problems and sequential
decision-making process make characterizing the decision boundary of DRL
policies more difficult, as well as searching for universal masks that capture
the geometry of it.


http://arxiv.org/abs/2307.14682
Unified Adversarial Patch for Visible-Infrared Cross-modal Attacks in the Physical World. (92%)
Xingxing Wei; Yao Huang; Yitong Sun; Jie Yu
  Physical adversarial attacks have put a severe threat to DNN-based object
detectors. To enhance security, a combination of visible and infrared sensors
is deployed in various scenarios, which has proven effective in disabling
existing single-modal physical attacks. To further demonstrate the potential
risks in such cases, we design a unified adversarial patch that can perform
cross-modal physical attacks, achieving evasion in both modalities
simultaneously with a single patch. Given the different imaging mechanisms of
visible and infrared sensors, our work manipulates patches' shape features,
which can be captured in different modalities when they undergo changes. To
deal with challenges, we propose a novel boundary-limited shape optimization
approach that aims to achieve compact and smooth shapes for the adversarial
patch, making it easy to implement in the physical world. And a score-aware
iterative evaluation method is also introduced to balance the fooling degree
between visible and infrared detectors during optimization, which guides the
adversarial patch to iteratively reduce the predicted scores of the multi-modal
sensors. Furthermore, we propose an Affine-Transformation-based enhancement
strategy that makes the learnable shape robust to various angles, thus
mitigating the issue of shape deformation caused by different shooting angles
in the real world. Our method is evaluated against several state-of-the-art
object detectors, achieving an Attack Success Rate (ASR) of over 80%. We also
demonstrate the effectiveness of our approach in physical-world scenarios under
various settings, including different angles, distances, postures, and scenes
for both visible and infrared sensors.


http://arxiv.org/abs/2307.14917
NSA: Naturalistic Support Artifact to Boost Network Confidence. (62%)
Abhijith Sharma; Phil Munz; Apurva Narayan
  Visual AI systems are vulnerable to natural and synthetic physical corruption
in the real-world. Such corruption often arises unexpectedly and alters the
model's performance. In recent years, the primary focus has been on adversarial
attacks. However, natural corruptions (e.g., snow, fog, dust) are an
omnipresent threat to visual AI systems and should be considered equally
important. Many existing works propose interesting solutions to train robust
models against natural corruption. These works either leverage image
augmentations, which come with the additional cost of model training, or place
suspicious patches in the scene to design unadversarial examples. In this work,
we propose the idea of naturalistic support artifacts (NSA) for robust
prediction. The NSAs are shown to be beneficial in scenarios where model
parameters are inaccessible and adding artifacts in the scene is feasible. The
NSAs are natural looking objects generated through artifact training using
DC-GAN to have high visual fidelity in the scene. We test against natural
corruptions on the Imagenette dataset and observe the improvement in prediction
confidence score by four times. We also demonstrate NSA's capability to
increase adversarial accuracy by 8\% on average. Lastly, we qualitatively
analyze NSAs using saliency maps to understand how they help improve prediction
confidence.


http://arxiv.org/abs/2307.14757
SEV-Step: A Single-Stepping Framework for AMD-SEV. (3%)
Luca Wilke; Jan Wichelmann; Anja Rabich; Thomas Eisenbarth
  The ever increasing popularity and availability of Trusted Execution
Environments (TEEs) had a stark influence on microarchitectural attack research
in academia, as their strong attacker model both boosts existing attack vectors
and introduces several new ones. While many works have focused on Intel SGX,
other TEEs like AMD SEV have recently also started to receive more attention. A
common technique when attacking SGX enclaves is single-stepping, where the
system's APIC timer is used to interrupt the enclave after every instruction.
Single-stepping increases the temporal resolution of subsequent
microarchitectural attacks to a maximum. A key driver in the proliferation of
this complex attack technique was the SGX-Step framework, which offered a
stable reference implementation for single-stepping and a relatively easy
setup. In this paper, we demonstrate that SEV VMs can also be reliably
single-stepped. To lay the foundation for further microarchitectural attack
research against SEV, we introduce the reusable SEV-Step framework. Besides
reliable single-stepping, SEV-Step provides easy access to common attack
primitives like page fault tracking and cache attacks against SEV. All features
can be used interactively from user space. We demonstrate SEV-Step's
capabilities by carrying out an end-to-end cache attack against SEV that leaks
the volume key of a LUKS2-encrypted disk. Finally, we show for the first time
that SEV is vulnerable to Nemesis-style attacks, which allow to extract
information about the type and operands of single-stepped instructions from
SEV-protected VMs.


http://arxiv.org/abs/2307.14657
Decoding the Secrets of Machine Learning in Malware Classification: A Deep Dive into Datasets, Feature Extraction, and Model Performance. (1%)
Savino Dambra; Yufei Han; Simone Aonzo; Platon Kotzias; Antonino Vitale; Juan Caballero; Davide Balzarotti; Leyla Bilge
  Many studies have proposed machine-learning (ML) models for malware detection
and classification, reporting an almost-perfect performance. However, they
assemble ground-truth in different ways, use diverse static- and
dynamic-analysis techniques for feature extraction, and even differ on what
they consider a malware family. As a consequence, our community still lacks an
understanding of malware classification results: whether they are tied to the
nature and distribution of the collected dataset, to what extent the number of
families and samples in the training dataset influence performance, and how
well static and dynamic features complement each other.
  This work sheds light on those open questions. by investigating the key
factors influencing ML-based malware detection and classification. For this, we
collect the largest balanced malware dataset so far with 67K samples from 670
families (100 samples each), and train state-of-the-art models for malware
detection and family classification using our dataset. Our results reveal that
static features perform better than dynamic features, and that combining both
only provides marginal improvement over static features. We discover no
correlation between packing and classification accuracy, and that missing
behaviors in dynamically-extracted features highly penalize their performance.
We also demonstrate how a larger number of families to classify make the
classification harder, while a higher number of samples per family increases
accuracy. Finally, we find that models trained on a uniform distribution of
samples per family better generalize on unseen data.


http://arxiv.org/abs/2307.15282
AC-Norm: Effective Tuning for Medical Image Analysis via Affine Collaborative Normalization. (1%)
Chuyan Zhang; Yuncheng Yang; Hao Zheng; Yun Gu
  Driven by the latest trend towards self-supervised learning (SSL), the
paradigm of "pretraining-then-finetuning" has been extensively explored to
enhance the performance of clinical applications with limited annotations.
Previous literature on model finetuning has mainly focused on regularization
terms and specific policy models, while the misalignment of channels between
source and target models has not received sufficient attention. In this work,
we revisited the dynamics of batch normalization (BN) layers and observed that
the trainable affine parameters of BN serve as sensitive indicators of domain
information. Therefore, Affine Collaborative Normalization (AC-Norm) is
proposed for finetuning, which dynamically recalibrates the channels in the
target model according to the cross-domain channel-wise correlations without
adding extra parameters. Based on a single-step backpropagation, AC-Norm can
also be utilized to measure the transferability of pretrained models. We
evaluated AC-Norm against the vanilla finetuning and state-of-the-art
fine-tuning methods on transferring diverse pretrained models to the diabetic
retinopathy grade classification, retinal vessel segmentation, CT lung nodule
segmentation/classification, CT liver-tumor segmentation and MRI cardiac
segmentation tasks. Extensive experiments demonstrate that AC-Norm unanimously
outperforms the vanilla finetuning by up to 4% improvement, even under
significant domain shifts where the state-of-the-art methods bring no gains. We
also prove the capability of AC-Norm in fast transferability estimation. Our
code is available at https://github.com/EndoluminalSurgicalVision-IMR/ACNorm.


http://arxiv.org/abs/2307.13985
Enhanced Security against Adversarial Examples Using a Random Ensemble of Encrypted Vision Transformer Models. (99%)
Ryota Iijima; Miki Tanaka; Sayaka Shiota; Hitoshi Kiya
  Deep neural networks (DNNs) are well known to be vulnerable to adversarial
examples (AEs). In addition, AEs have adversarial transferability, which means
AEs generated for a source model can fool another black-box model (target
model) with a non-trivial probability. In previous studies, it was confirmed
that the vision transformer (ViT) is more robust against the property of
adversarial transferability than convolutional neural network (CNN) models such
as ConvMixer, and moreover encrypted ViT is more robust than ViT without any
encryption. In this article, we propose a random ensemble of encrypted ViT
models to achieve much more robust models. In experiments, the proposed scheme
is verified to be more robust against not only black-box attacks but also
white-box ones than convention methods.


http://arxiv.org/abs/2307.14061
Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models. (99%)
Dong Lu; Zhiqiang Wang; Teng Wang; Weili Guan; Hongchang Gao; Feng Zheng
  Vision-language pre-training (VLP) models have shown vulnerability to
adversarial examples in multimodal tasks. Furthermore, malicious adversaries
can be deliberately transferred to attack other black-box models. However,
existing work has mainly focused on investigating white-box attacks. In this
paper, we present the first study to investigate the adversarial
transferability of recent VLP models. We observe that existing methods exhibit
much lower transferability, compared to the strong attack performance in
white-box settings. The transferability degradation is partly caused by the
under-utilization of cross-modal interactions. Particularly, unlike unimodal
learning, VLP models rely heavily on cross-modal interactions and the
multimodal alignments are many-to-many, e.g., an image can be described in
various natural languages. To this end, we propose a highly transferable
Set-level Guidance Attack (SGA) that thoroughly leverages modality interactions
and incorporates alignment-preserving augmentation with cross-modal guidance.
Experimental results demonstrate that SGA could generate adversarial examples
that can strongly transfer across different VLP models on multiple downstream
vision-language tasks. On image-text retrieval, SGA significantly enhances the
attack success rate for transfer attacks from ALBEF to TCL by a large margin
(at least 9.78% and up to 30.21%), compared to the state-of-the-art.


http://arxiv.org/abs/2307.14242
Defending Adversarial Patches via Joint Region Localizing and Inpainting. (99%)
Junwen Chen; Xingxing Wei
  Deep neural networks are successfully used in various applications, but show
their vulnerability to adversarial examples. With the development of
adversarial patches, the feasibility of attacks in physical scenes increases,
and the defenses against patch attacks are urgently needed. However, defending
such adversarial patch attacks is still an unsolved problem. In this paper, we
analyse the properties of adversarial patches, and find that: on the one hand,
adversarial patches will lead to the appearance or contextual inconsistency in
the target objects; on the other hand, the patch region will show abnormal
changes on the high-level feature maps of the objects extracted by a backbone
network. Considering the above two points, we propose a novel defense method
based on a ``localizing and inpainting" mechanism to pre-process the input
examples. Specifically, we design an unified framework, where the ``localizing"
sub-network utilizes a two-branch structure to represent the above two aspects
to accurately detect the adversarial patch region in the image. For the
``inpainting" sub-network, it utilizes the surrounding contextual cues to
recover the original content covered by the adversarial patch. The quality of
inpainted images is also evaluated by measuring the appearance consistency and
the effects of adversarial attacks. These two sub-networks are then jointly
trained via an iterative optimization manner. In this way, the ``localizing"
and ``inpainting" modules can interact closely with each other, and thus learn
a better solution. A series of experiments versus traffic sign classification
and detection tasks are conducted to defend against various adversarial patch
attacks.


http://arxiv.org/abs/2307.14540
Lateral-Direction Localization Attack in High-Level Autonomous Driving: Domain-Specific Defense Opportunity via Lane Detection. (67%)
Junjie Shen; Yunpeng Luo; Ziwen Wan; Qi Alfred Chen
  Localization in high-level Autonomous Driving (AD) systems is highly security
critical. While the popular Multi-Sensor Fusion (MSF) based design can be more
robust against single-source sensor spoofing attacks, it is found recently that
state-of-the-art MSF algorithms is vulnerable to GPS spoofing alone due to
practical factors, which can cause various road hazards such as driving off
road or onto the wrong way. In this work, we perform the first systematic
exploration of the novel usage of lane detection (LD) to defend against such
attacks. We first systematically analyze the potentials of such a
domain-specific defense opportunity, and then design a novel LD-based defense
approach, $LD^3$, that aims at not only detecting such attacks effectively in
the real time, but also safely stopping the victim in the ego lane upon
detection considering the absence of onboard human drivers.
  We evaluate $LD^3$ on real-world sensor traces and find that it can achieve
effective and timely detection against existing attack with 100% true positive
rates and 0% false positive rates. Results also show that $LD^3$ is robust to
diverse environmental conditions and is effective at steering the AD vehicle to
safely stop within the current traffic lane. We implement $LD^3$ on two
open-source high-level AD systems, Baidu Apollo and Autoware, and validate its
defense capability in both simulation and the physical world in end-to-end
driving. We further conduct adaptive attack evaluations and find that $LD^3$ is
effective at bounding the deviations from reaching the attack goals in stealthy
attacks and is robust to latest LD-side attack.


http://arxiv.org/abs/2307.14539
Plug and Pray: Exploiting off-the-shelf components of Multi-Modal Models. (33%)
Erfan Shayegani; Yue Dong; Nael Abu-Ghazaleh
  The rapid growth and increasing popularity of incorporating additional
modalities (e.g., vision) into large language models (LLMs) has raised
significant security concerns. This expansion of modality, akin to adding more
doors to a house, unintentionally creates multiple access points for
adversarial attacks. In this paper, by introducing adversarial embedding space
attacks, we emphasize the vulnerabilities present in multi-modal systems that
originate from incorporating off-the-shelf components like public pre-trained
encoders in a plug-and-play manner into these systems. In contrast to existing
work, our approach does not require access to the multi-modal system's weights
or parameters but instead relies on the huge under-explored embedding space of
such pre-trained encoders. Our proposed embedding space attacks involve seeking
input images that reside within the dangerous or targeted regions of the
extensive embedding space of these pre-trained components. These crafted
adversarial images pose two major threats: 'Context Contamination' and 'Hidden
Prompt Injection'-both of which can compromise multi-modal models like LLaVA
and fully change the behavior of the associated language model. Our findings
emphasize the need for a comprehensive examination of the underlying
components, particularly pre-trained encoders, before incorporating them into
systems in a plug-and-play manner to ensure robust security.


http://arxiv.org/abs/2307.14387
Coupled-Space Attacks against Random-Walk-based Anomaly Detection. (11%)
Yuni Lai; Marcin Waniek; Liying Li; Jingwen Wu; Yulin Zhu; Tomasz P. Michalak; Talal Rahwan; Kai Zhou
  Random Walks-based Anomaly Detection (RWAD) is commonly used to identify
anomalous patterns in various applications. An intriguing characteristic of
RWAD is that the input graph can either be pre-existing or constructed from raw
features. Consequently, there are two potential attack surfaces against RWAD:
graph-space attacks and feature-space attacks. In this paper, we explore this
vulnerability by designing practical coupled-space attacks, investigating the
interplay between graph-space and feature-space attacks. To this end, we
conduct a thorough complexity analysis, proving that attacking RWAD is NP-hard.
Then, we proceed to formulate the graph-space attack as a bi-level optimization
problem and propose two strategies to solve it: alternative iteration
(alterI-attack) or utilizing the closed-form solution of the random walk model
(cf-attack). Finally, we utilize the results from the graph-space attacks as
guidance to design more powerful feature-space attacks (i.e., graph-guided
attacks). Comprehensive experiments demonstrate that our proposed attacks are
effective in enabling the target nodes from RWAD with a limited attack budget.
In addition, we conduct transfer attack experiments in a black-box setting,
which show that our feature attack significantly decreases the anomaly scores
of target nodes. Our study opens the door to studying the coupled-space attack
against graph anomaly detection in which the graph space relies on the feature
space.


http://arxiv.org/abs/2307.14593
FakeTracer: Proactively Defending Against Face-swap DeepFakes via Implanting Traces in Training. (5%)
Pu Sun; Honggang Qi; Yuezun Li; Siwei Lyu
  Face-swap DeepFake is an emerging AI-based face forgery technique that can
replace the original face in a video with a generated face of the target
identity while retaining consistent facial attributes such as expression and
orientation. Due to the high privacy of faces, the misuse of this technique can
raise severe social concerns, drawing tremendous attention to defend against
DeepFakes recently. In this paper, we describe a new proactive defense method
called FakeTracer to expose face-swap DeepFakes via implanting traces in
training. Compared to general face-synthesis DeepFake, the face-swap DeepFake
is more complex as it involves identity change, is subjected to the
encoding-decoding process, and is trained unsupervised, increasing the
difficulty of implanting traces into the training phase. To effectively defend
against face-swap DeepFake, we design two types of traces, sustainable trace
(STrace) and erasable trace (ETrace), to be added to training faces. During the
training, these manipulated faces affect the learning of the face-swap DeepFake
model, enabling it to generate faces that only contain sustainable traces. In
light of these two traces, our method can effectively expose DeepFakes by
identifying them. Extensive experiments are conducted on the Celeb-DF dataset,
compared with recent passive and proactive defense methods, and are studied
thoroughly regarding various factors, corroborating the efficacy of our method
on defending against face-swap DeepFake.


http://arxiv.org/abs/2307.14057
Open Image Content Disarm And Reconstruction. (1%)
Eli Belkind; Ran Dubin; Amit Dvir
  With the advance in malware technology, attackers create new ways to hide
their malicious code from antivirus services. One way to obfuscate an attack is
to use common files as cover to hide the malicious scripts, so the malware will
look like a legitimate file. Although cutting-edge Artificial Intelligence and
content signature exist, evasive malware successfully bypasses next-generation
malware detection using advanced methods like steganography. Some of the files
commonly used to hide malware are image files (e.g., JPEG). In addition, some
malware use steganography to hide malicious scripts or sensitive data in
images. Steganography in images is difficult to detect even with specialized
tools. Image-based attacks try to attack the user's device using malicious
payloads or utilize image steganography to hide sensitive data inside
legitimate images and leak it outside the user's device. Therefore in this
paper, we present a novel Image Content Disarm and Reconstruction (ICDR). Our
ICDR system removes potential malware, with a zero trust approach, while
maintaining high image quality and file usability. By extracting the image
data, removing it from the rest of the file, and manipulating the image pixels,
it is possible to disable or remove the hidden malware inside the file.


http://arxiv.org/abs/2307.13856
On the unreasonable vulnerability of transformers for image restoration -- and an easy fix. (99%)
Shashank Agnihotri; Kanchana Vaishnavi Gandikota; Julia Grabinski; Paramanand Chandramouli; Margret Keuper
  Following their success in visual recognition tasks, Vision
Transformers(ViTs) are being increasingly employed for image restoration. As a
few recent works claim that ViTs for image classification also have better
robustness properties, we investigate whether the improved adversarial
robustness of ViTs extends to image restoration. We consider the recently
proposed Restormer model, as well as NAFNet and the "Baseline network" which
are both simplified versions of a Restormer. We use Projected Gradient Descent
(PGD) and CosPGD, a recently proposed adversarial attack tailored to pixel-wise
prediction tasks for our robustness evaluation. Our experiments are performed
on real-world images from the GoPro dataset for image deblurring. Our analysis
indicates that contrary to as advocated by ViTs in image classification works,
these models are highly susceptible to adversarial attacks. We attempt to
improve their robustness through adversarial training. While this yields a
significant increase in robustness for Restormer, results on other networks are
less promising. Interestingly, the design choices in NAFNet and Baselines,
which were based on iid performance, and not on robust generalization, seem to
be at odds with the model robustness. Thus, we investigate this further and
find a fix.


http://arxiv.org/abs/2307.13294
Imperceptible Physical Attack against Face Recognition Systems via LED Illumination Modulation. (99%)
Junbin Fang; Canjian Jiang; You Jiang; Puxi Lin; Zhaojie Chen; Yujing Sun; Siu-Ming Yiu; Zoe L. Jiang
  Although face recognition starts to play an important role in our daily life,
we need to pay attention that data-driven face recognition vision systems are
vulnerable to adversarial attacks. However, the current two categories of
adversarial attacks, namely digital attacks and physical attacks both have
drawbacks, with the former ones impractical and the latter one conspicuous,
high-computational and inexecutable. To address the issues, we propose a
practical, executable, inconspicuous and low computational adversarial attack
based on LED illumination modulation. To fool the systems, the proposed attack
generates imperceptible luminance changes to human eyes through fast intensity
modulation of scene LED illumination and uses the rolling shutter effect of
CMOS image sensors in face recognition systems to implant luminance information
perturbation to the captured face images. In summary,we present a
denial-of-service (DoS) attack for face detection and a dodging attack for face
verification. We also evaluate their effectiveness against well-known face
detection models, Dlib, MTCNN and RetinaFace , and face verification models,
Dlib, FaceNet,and ArcFace.The extensive experiments show that the success rates
of DoS attacks against face detection models reach 97.67%, 100%, and 100%,
respectively, and the success rates of dodging attacks against all face
verification models reach 100%.


http://arxiv.org/abs/2307.13885
Efficient Estimation of Average-Case Robustness for Multi-Class Classification. (13%)
Tessa Han; Suraj Srinivas; Himabindu Lakkaraju
  Robustness in machine learning is commonly studied in the adversarial
setting, yet real-world noise (such as measurement noise) is random rather than
adversarial. Model behavior under such noise is captured by average-case
robustness, i.e., the probability of obtaining consistent predictions in a
local region around an input. However, the na\"ive approach to computing
average-case robustness based on Monte-Carlo sampling is statistically
inefficient, especially for high-dimensional data, leading to prohibitive
computational costs for large-scale applications. In this work, we develop the
first analytical estimators to efficiently compute average-case robustness of
multi-class discriminative models. These estimators linearize models in the
local region around an input and analytically compute the robustness of the
resulting linear models. We show empirically that these estimators efficiently
compute the robustness of standard deep learning models and demonstrate these
estimators' usefulness for various tasks involving robustness, such as
measuring robustness bias and identifying dataset samples that are vulnerable
to noise perturbation. In doing so, this work not only proposes a new framework
for robustness, but also makes its computation practical, enabling the use of
average-case robustness in downstream applications.


http://arxiv.org/abs/2307.13721
Foundational Models Defining a New Era in Vision: A Survey and Outlook. (10%)
Muhammad Awais; Muzammal Naseer; Salman Khan; Rao Muhammad Anwer; Hisham Cholakkal; Mubarak Shah; Ming-Hsuan Yang; Fahad Shahbaz Khan
  Vision systems to see and reason about the compositional nature of visual
scenes are fundamental to understanding our world. The complex relations
between objects and their locations, ambiguities, and variations in the
real-world environment can be better described in human language, naturally
governed by grammatical rules and other modalities such as audio and depth. The
models learned to bridge the gap between such modalities coupled with
large-scale training data facilitate contextual reasoning, generalization, and
prompt capabilities at test time. These models are referred to as foundational
models. The output of such models can be modified through human-provided
prompts without retraining, e.g., segmenting a particular object by providing a
bounding box, having interactive dialogues by asking questions about an image
or video scene or manipulating the robot's behavior through language
instructions. In this survey, we provide a comprehensive review of such
emerging foundational models, including typical architecture designs to combine
different modalities (vision, text, audio, etc), training objectives
(contrastive, generative), pre-training datasets, fine-tuning mechanisms, and
the common prompting patterns; textual, visual, and heterogeneous. We discuss
the open challenges and research directions for foundational models in computer
vision, including difficulties in their evaluations and benchmarking, gaps in
their real-world understanding, limitations of their contextual understanding,
biases, vulnerability to adversarial attacks, and interpretability issues. We
review recent developments in this field, covering a wide range of applications
of foundation models systematically and comprehensively. A comprehensive list
of foundational models studied in this work is available at
\url{https://github.com/awaisrauf/Awesome-CV-Foundational-Models}.


http://arxiv.org/abs/2307.13131
Why Don't You Clean Your Glasses? Perception Attacks with Dynamic Optical Perturbations. (99%)
Yi Han; Matthew Chan; Eric Wengrowski; Zhuohuan Li; Nils Ole Tippenhauer; Mani Srivastava; Saman Zonouz; Luis Garcia
  Camera-based autonomous systems that emulate human perception are
increasingly being integrated into safety-critical platforms. Consequently, an
established body of literature has emerged that explores adversarial attacks
targeting the underlying machine learning models. Adapting adversarial attacks
to the physical world is desirable for the attacker, as this removes the need
to compromise digital systems. However, the real world poses challenges related
to the "survivability" of adversarial manipulations given environmental noise
in perception pipelines and the dynamicity of autonomous systems. In this
paper, we take a sensor-first approach. We present EvilEye, a man-in-the-middle
perception attack that leverages transparent displays to generate dynamic
physical adversarial examples. EvilEye exploits the camera's optics to induce
misclassifications under a variety of illumination conditions. To generate
dynamic perturbations, we formalize the projection of a digital attack into the
physical domain by modeling the transformation function of the captured image
through the optical pipeline. Our extensive experiments show that EvilEye's
generated adversarial perturbations are much more robust across varying
environmental light conditions relative to existing physical perturbation
frameworks, achieving a high attack success rate (ASR) while bypassing
state-of-the-art physical adversarial detection frameworks. We demonstrate that
the dynamic nature of EvilEye enables attackers to adapt adversarial examples
across a variety of objects with a significantly higher ASR compared to
state-of-the-art physical world attack frameworks. Finally, we discuss
mitigation strategies against the EvilEye attack.


http://arxiv.org/abs/2307.12520
Lost In Translation: Generating Adversarial Examples Robust to Round-Trip Translation. (99%)
Neel Bhandari; Pin-Yu Chen
  Language Models today provide a high accuracy across a large number of
downstream tasks. However, they remain susceptible to adversarial attacks,
particularly against those where the adversarial examples maintain considerable
similarity to the original text. Given the multilingual nature of text, the
effectiveness of adversarial examples across translations and how machine
translations can improve the robustness of adversarial examples remain largely
unexplored. In this paper, we present a comprehensive study on the robustness
of current text adversarial attacks to round-trip translation. We demonstrate
that 6 state-of-the-art text-based adversarial attacks do not maintain their
efficacy after round-trip translation. Furthermore, we introduce an
intervention-based solution to this problem, by integrating Machine Translation
into the process of adversarial example generation and demonstrating increased
robustness to round-trip translation. Our results indicate that finding
adversarial examples robust to translation can help identify the insufficiency
of language models that is common across languages, and motivate further
research into multilingual adversarial attacks.


http://arxiv.org/abs/2307.12872
Data-free Black-box Attack based on Diffusion Model. (62%)
Mingwen Shao; Lingzhuang Meng; Yuanjian Qiao; Lixu Zhang; Wangmeng Zuo
  Since the training data for the target model in a data-free black-box attack
is not available, most recent schemes utilize GANs to generate data for
training substitute model. However, these GANs-based schemes suffer from low
training efficiency as the generator needs to be retrained for each target
model during the substitute training process, as well as low generation
quality. To overcome these limitations, we consider utilizing the diffusion
model to generate data, and propose a data-free black-box attack scheme based
on diffusion model to improve the efficiency and accuracy of substitute
training. Despite the data generated by the diffusion model exhibits high
quality, it presents diverse domain distributions and contains many samples
that do not meet the discriminative criteria of the target model. To further
facilitate the diffusion model to generate data suitable for the target model,
we propose a Latent Code Augmentation (LCA) method to guide the diffusion model
in generating data. With the guidance of LCA, the data generated by the
diffusion model not only meets the discriminative criteria of the target model
but also exhibits high diversity. By utilizing this data, it is possible to
train substitute model that closely resemble the target model more efficiently.
Extensive experiments demonstrate that our LCA achieves higher attack success
rates and requires fewer query budgets compared to GANs-based schemes for
different target models.


http://arxiv.org/abs/2307.13078
Adaptive Certified Training: Towards Better Accuracy-Robustness Tradeoffs. (56%)
Zhakshylyk Nurlanov; Frank R. Schmidt; Florian Bernard
  As deep learning models continue to advance and are increasingly utilized in
real-world systems, the issue of robustness remains a major challenge. Existing
certified training methods produce models that achieve high provable robustness
guarantees at certain perturbation levels. However, the main problem of such
models is a dramatically low standard accuracy, i.e. accuracy on clean
unperturbed data, that makes them impractical. In this work, we consider a more
realistic perspective of maximizing the robustness of a model at certain levels
of (high) standard accuracy. To this end, we propose a novel certified training
method based on a key insight that training with adaptive certified radii helps
to improve both the accuracy and robustness of the model, advancing
state-of-the-art accuracy-robustness tradeoffs. We demonstrate the
effectiveness of the proposed method on MNIST, CIFAR-10, and TinyImageNet
datasets. Particularly, on CIFAR-10 and TinyImageNet, our method yields models
with up to two times higher robustness, measured as an average certified radius
of a test set, at the same levels of standard accuracy compared to baseline
approaches.


http://arxiv.org/abs/2307.12679
An Estimator for the Sensitivity to Perturbations of Deep Neural Networks. (31%)
Naman Maheshwari; Nicholas Malaya; Scott Moe; Jaydeep P. Kulkarni; Sudhanva Gurumurthi
  For Deep Neural Networks (DNNs) to become useful in safety-critical
applications, such as self-driving cars and disease diagnosis, they must be
stable to perturbations in input and model parameters. Characterizing the
sensitivity of a DNN to perturbations is necessary to determine minimal
bit-width precision that may be used to safely represent the network. However,
no general result exists that is capable of predicting the sensitivity of a
given DNN to round-off error, noise, or other perturbations in input. This
paper derives an estimator that can predict such quantities. The estimator is
derived via inequalities and matrix norms, and the resulting quantity is
roughly analogous to a condition number for the entire neural network. An
approximation of the estimator is tested on two Convolutional Neural Networks,
AlexNet and VGG-19, using the ImageNet dataset. For each of these networks, the
tightness of the estimator is explored via random perturbations and adversarial
attacks.


http://arxiv.org/abs/2307.13107
Cyber Deception against Zero-day Attacks: A Game Theoretic Approach. (12%)
Md Abu University of Texas at El Paso Sayed; Ahmed H. US Army Research Laboratory Anwar; Christopher University of Texas at El Paso Kiekintveld; Branislav Czech Technical University in Prague Bosansky; Charles US Army Research Laboratory Kamhoua
  Reconnaissance activities precedent other attack steps in the cyber kill
chain. Zero-day attacks exploit unknown vulnerabilities and give attackers the
upper hand against conventional defenses. Honeypots have been used to deceive
attackers by misrepresenting the true state of the network. Existing work on
cyber deception does not model zero-day attacks. In this paper, we address the
question of "How to allocate honeypots over the network?" to protect its most
valuable assets. To this end, we develop a two-player zero-sum game theoretic
approach to study the potential reconnaissance tracks and attack paths that
attackers may use. However, zero-day attacks allow attackers to avoid placed
honeypots by creating new attack paths. Therefore, we introduce a sensitivity
analysis to investigate the impact of different zero-day vulnerabilities on the
performance of the proposed deception technique. Next, we propose several
mitigating strategies to defend the network against zero-day attacks based on
this analysis. Finally, our numerical results validate our findings and
illustrate the effectiveness of the proposed defense approach.


http://arxiv.org/abs/2307.13164
Malware Resistant Data Protection in Hyper-connected Networks: A survey. (10%)
Jannatul Ferdous; Rafiqul Islam; Maumita Bhattacharya; Md Zahidul Islam
  Data protection is the process of securing sensitive information from being
corrupted, compromised, or lost. A hyperconnected network, on the other hand,
is a computer networking trend in which communication occurs over a network.
However, what about malware. Malware is malicious software meant to penetrate
private data, threaten a computer system, or gain unauthorised network access
without the users consent. Due to the increasing applications of computers and
dependency on electronically saved private data, malware attacks on sensitive
information have become a dangerous issue for individuals and organizations
across the world. Hence, malware defense is critical for keeping our computer
systems and data protected. Many recent survey articles have focused on either
malware detection systems or single attacking strategies variously. To the best
of our knowledge, no survey paper demonstrates malware attack patterns and
defense strategies combinedly. Through this survey, this paper aims to address
this issue by merging diverse malicious attack patterns and machine learning
(ML) based detection models for modern and sophisticated malware. In doing so,
we focus on the taxonomy of malware attack patterns based on four fundamental
dimensions the primary goal of the attack, method of attack, targeted exposure
and execution process, and types of malware that perform each attack. Detailed
information on malware analysis approaches is also investigated. In addition,
existing malware detection techniques employing feature extraction and ML
algorithms are discussed extensively. Finally, it discusses research
difficulties and unsolved problems, including future research directions.


http://arxiv.org/abs/2307.13165
Investigating the Robustness of Sequential Recommender Systems Against Training Data Perturbations. (9%)
Filippo Betello; Federico Siciliano; Pushkar Mishra; Fabrizio Silvestri
  Sequential Recommender Systems (SRSs) are widely employed to model user
behavior over time. However, their robustness in the face of perturbations in
training data remains a largely understudied yet critical issue. A fundamental
challenge emerges in previous studies aimed at assessing the robustness of
SRSs: the Rank-Biased Overlap (RBO) similarity is not particularly suited for
this task as it is designed for infinite rankings of items and thus shows
limitations in real-world scenarios. For instance, it fails to achieve a
perfect score of 1 for two identical finite-length rankings. To address this
challenge, we introduce a novel contribution: Finite Rank-Biased Overlap
(FRBO), an enhanced similarity tailored explicitly for finite rankings. This
innovation facilitates a more intuitive evaluation in practical settings. In
pursuit of our goal, we empirically investigate the impact of removing items at
different positions within a temporally ordered sequence. We evaluate two
distinct SRS models across multiple datasets, measuring their performance using
metrics such as Normalized Discounted Cumulative Gain (NDCG) and Rank List
Sensitivity. Our results demonstrate that removing items at the end of the
sequence has a statistically significant impact on performance, with NDCG
decreasing up to 60%. Conversely, removing items from the beginning or middle
has no significant effect. These findings underscore the criticality of the
position of perturbed items in the training data. As we spotlight the
vulnerabilities inherent in current SRSs, we fervently advocate for intensified
research efforts to fortify their robustness against adversarial perturbations.


http://arxiv.org/abs/2307.13152
Digital Twins for Moving Target Defense Validation in AC Microgrids. (1%)
Suman Rath; Subham Sahoo; Shamik Sengupta
  Cyber-physical microgrids are vulnerable to stealth attacks that can degrade
their stability and operability by performing low-magnitude manipulations in a
coordinated manner. This paper formulates the interactions between CSAs and
microgrid defenders as a non-cooperative, zero-sum game. Additionally, it
presents a hybrid Moving Target Defense (MTD) strategy for distributed
microgrids that can dynamically alter local control gains to achieve resiliency
against Coordinated Stealth Attacks (CSAs). The proposed strategy reduces the
success probability of attack(s) by making system dynamics less predictable.
The framework also identifies and removes malicious injections by modifying
secondary control weights assigned to them. The manipulated signals are
reconstructed using an Artificial Neural Network (ANN)-based Digital Twin (DT)
to preserve stability. To guarantee additional immunity against instability
arising from gain alterations, MTD decisions are also validated (via utility
and best response computations) using the DT before actual implementation. The
DT is also used to find the minimum perturbation that defenders must achieve to
invalidate an attacker's knowledge effectively.


http://arxiv.org/abs/2307.12903
Towards Bridging the FL Performance-Explainability Trade-Off: A Trustworthy 6G RAN Slicing Use-Case. (1%)
Swastika Roy; Hatim Chergui; Christos Verikoukis
  In the context of sixth-generation (6G) networks, where diverse network
slices coexist, the adoption of AI-driven zero-touch management and
orchestration (MANO) becomes crucial. However, ensuring the trustworthiness of
AI black-boxes in real deployments is challenging. Explainable AI (XAI) tools
can play a vital role in establishing transparency among the stakeholders in
the slicing ecosystem. But there is a trade-off between AI performance and
explainability, posing a dilemma for trustworthy 6G network slicing because the
stakeholders require both highly performing AI models for efficient resource
allocation and explainable decision-making to ensure fairness, accountability,
and compliance. To balance this trade off and inspired by the closed loop
automation and XAI methodologies, this paper presents a novel
explanation-guided in-hoc federated learning (FL) approach where a constrained
resource allocation model and an explainer exchange -- in a closed loop (CL)
fashion -- soft attributions of the features as well as inference predictions
to achieve a transparent 6G network slicing resource management in a RAN-Edge
setup under non-independent identically distributed (non-IID) datasets. In
particular, we quantitatively validate the faithfulness of the explanations via
the so-called attribution-based confidence metric that is included as a
constraint to guide the overall training process in the run-time FL
optimization task. In this respect, Integrated-Gradient (IG) as well as Input
$\times$ Gradient and SHAP are used to generate the attributions for our
proposed in-hoc scheme, wherefore simulation results under different methods
confirm its success in tackling the performance-explainability trade-off and
its superiority over the unconstrained Integrated-Gradient post-hoc FL
baseline.


http://arxiv.org/abs/2307.12822
Learning Provably Robust Estimators for Inverse Problems via Jittering. (1%)
Anselm Krainovic; Mahdi Soltanolkotabi; Reinhard Heckel
  Deep neural networks provide excellent performance for inverse problems such
as denoising. However, neural networks can be sensitive to adversarial or
worst-case perturbations. This raises the question of whether such networks can
be trained efficiently to be worst-case robust. In this paper, we investigate
whether jittering, a simple regularization technique that adds isotropic
Gaussian noise during training, is effective for learning worst-case robust
estimators for inverse problems. While well studied for prediction in
classification tasks, the effectiveness of jittering for inverse problems has
not been systematically investigated. In this paper, we present a novel
analytical characterization of the optimal $\ell_2$-worst-case robust estimator
for linear denoising and show that jittering yields optimal robust denoisers.
Furthermore, we examine jittering empirically via training deep neural networks
(U-nets) for natural image denoising, deconvolution, and accelerated magnetic
resonance imaging (MRI). The results show that jittering significantly enhances
the worst-case robustness, but can be suboptimal for inverse problems beyond
denoising. Moreover, our results imply that training on real data which often
contains slight noise is somewhat robustness enhancing.


http://arxiv.org/abs/2307.12342
Towards Generic and Controllable Attacks Against Object Detection. (99%)
Guopeng Li; Yue Xu; Jian Ding; Gui-Song Xia
  Existing adversarial attacks against Object Detectors (ODs) suffer from two
inherent limitations. Firstly, ODs have complicated meta-structure designs,
hence most advanced attacks for ODs concentrate on attacking specific
detector-intrinsic structures, which makes it hard for them to work on other
detectors and motivates us to design a generic attack against ODs. Secondly,
most works against ODs make Adversarial Examples (AEs) by generalizing
image-level attacks from classification to detection, which brings redundant
computations and perturbations in semantically meaningless areas (e.g.,
backgrounds) and leads to an emergency for seeking controllable attacks for
ODs. To this end, we propose a generic white-box attack, LGP (local
perturbations with adaptively global attacks), to blind mainstream object
detectors with controllable perturbations. For a detector-agnostic attack, LGP
tracks high-quality proposals and optimizes three heterogeneous losses
simultaneously. In this way, we can fool the crucial components of ODs with a
part of their outputs without the limitations of specific structures. Regarding
controllability, we establish an object-wise constraint that exploits
foreground-background separation adaptively to induce the attachment of
perturbations to foregrounds. Experimentally, the proposed LGP successfully
attacked sixteen state-of-the-art object detectors on MS-COCO and DOTA
datasets, with promising imperceptibility and transferability obtained. Codes
are publicly released in https://github.com/liguopeng0923/LGP.git


http://arxiv.org/abs/2307.12280
Downstream-agnostic Adversarial Examples. (99%)
Ziqi Zhou; Shengshan Hu; Ruizhi Zhao; Qian Wang; Leo Yu Zhang; Junhui Hou; Hai Jin
  Self-supervised learning usually uses a large amount of unlabeled data to
pre-train an encoder which can be used as a general-purpose feature extractor,
such that downstream users only need to perform fine-tuning operations to enjoy
the benefit of "large model". Despite this promising prospect, the security of
pre-trained encoder has not been thoroughly investigated yet, especially when
the pre-trained encoder is publicly available for commercial use.
  In this paper, we propose AdvEncoder, the first framework for generating
downstream-agnostic universal adversarial examples based on the pre-trained
encoder. AdvEncoder aims to construct a universal adversarial perturbation or
patch for a set of natural images that can fool all the downstream tasks
inheriting the victim pre-trained encoder. Unlike traditional adversarial
example works, the pre-trained encoder only outputs feature vectors rather than
classification labels. Therefore, we first exploit the high frequency component
information of the image to guide the generation of adversarial examples. Then
we design a generative attack framework to construct adversarial
perturbations/patches by learning the distribution of the attack surrogate
dataset to improve their attack success rates and transferability. Our results
show that an attacker can successfully attack downstream tasks without knowing
either the pre-training dataset or the downstream dataset. We also tailor four
defenses for pre-trained encoders, the results of which further prove the
attack ability of AdvEncoder.


http://arxiv.org/abs/2307.12499
AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion Models. (99%)
Xuelong Dai; Kaisheng Liang; Bin Xiao
  Unrestricted adversarial attacks present a serious threat to deep learning
models and adversarial defense techniques. They pose severe security problems
for deep learning applications because they can effectively bypass defense
mechanisms. However, previous attack methods often directly inject Projected
Gradient Descent (PGD) gradients into the sampling of generative models, which
are not theoretically provable and thus generate unrealistic examples by
incorporating adversarial objectives, especially for GAN-based methods on
large-scale datasets like ImageNet. In this paper, we propose a new method,
called AdvDiff, to generate unrestricted adversarial examples with diffusion
models. We design two novel adversarial guidance techniques to conduct
adversarial sampling in the reverse generation process of diffusion models.
These two techniques are effective and stable in generating high-quality,
realistic adversarial examples by integrating gradients of the target
classifier interpretably. Experimental results on MNIST and ImageNet datasets
demonstrate that AdvDiff is effective in generating unrestricted adversarial
examples, which outperforms state-of-the-art unrestricted adversarial attack
methods in terms of attack performance and generation quality.


http://arxiv.org/abs/2307.12507
Gradient-Based Word Substitution for Obstinate Adversarial Examples Generation in Language Models. (98%)
Yimu Wang; Peng Shi; Hongyang Zhang
  In this paper, we study the problem of generating obstinate (over-stability)
adversarial examples by word substitution in NLP, where input text is
meaningfully changed but the model's prediction does not, even though it
should. Previous word substitution approaches have predominantly focused on
manually designed antonym-based strategies for generating obstinate adversarial
examples, which hinders its application as these strategies can only find a
subset of obstinate adversarial examples and require human efforts. To address
this issue, in this paper, we introduce a novel word substitution method named
GradObstinate, a gradient-based approach that automatically generates obstinate
adversarial examples without any constraints on the search space or the need
for manual design principles. To empirically evaluate the efficacy of
GradObstinate, we conduct comprehensive experiments on five representative
models (Electra, ALBERT, Roberta, DistillBERT, and CLIP) finetuned on four NLP
benchmarks (SST-2, MRPC, SNLI, and SQuAD) and a language-grounding benchmark
(MSCOCO). Extensive experiments show that our proposed GradObstinate generates
more powerful obstinate adversarial examples, exhibiting a higher attack
success rate compared to antonym-based methods. Furthermore, to show the
transferability of obstinate word substitutions found by GradObstinate, we
replace the words in four representative NLP benchmarks with their obstinate
substitutions. Notably, obstinate substitutions exhibit a high success rate
when transferred to other models in black-box settings, including even GPT-3
and ChatGPT. Examples of obstinate adversarial examples found by GradObstinate
are available at https://huggingface.co/spaces/anonauthors/SecretLanguage.


http://arxiv.org/abs/2307.12328
A First Look at On-device Models in iOS Apps. (84%)
Han Hu; Yujin Huang; Qiuyuan Chen; Terry Tue Zhuo; Chunyang Chen
  Powered by the rising popularity of deep learning techniques on smartphones,
on-device deep learning models are being used in vital fields like finance,
social media, and driving assistance.
  Because of the transparency of the Android platform and the on-device models
inside, on-device models on Android smartphones have been proven to be
extremely vulnerable.
  However, due to the challenge in accessing and analysing iOS app files,
despite iOS being a mobile platform as popular as Android, there are no
relevant works on on-device models in iOS apps.
  Since the functionalities of the same app on Android and iOS platforms are
similar, the same vulnerabilities may exist on both platforms.
  In this paper, we present the first empirical study about on-device models in
iOS apps, including their adoption of deep learning frameworks, structure,
functionality, and potential security issues.
  We study why current developers use different on-device models for one app
between iOS and Android.
  We propose a more general attack against white-box models that does not rely
on pre-trained models and a new adversarial attack approach based on our
findings to target iOS's gray-box on-device models.
  Our results show the effectiveness of our approaches.
  Finally, we successfully exploit the vulnerabilities of on-device models to
attack real-world iOS apps.


http://arxiv.org/abs/2307.12498
Robust Automatic Speech Recognition via WavAugment Guided Phoneme Adversarial Training. (83%)
Gege Qi; Yuefeng Chen; Xiaofeng Mao; Xiaojun Jia; Ranjie Duan; Rong Zhang; Hui Xue
  Developing a practically-robust automatic speech recognition (ASR) is
challenging since the model should not only maintain the original performance
on clean samples, but also achieve consistent efficacy under small volume
perturbations and large domain shifts. To address this problem, we propose a
novel WavAugment Guided Phoneme Adversarial Training (wapat). wapat use
adversarial examples in phoneme space as augmentation to make the model
invariant to minor fluctuations in phoneme representation and preserve the
performance on clean samples. In addition, wapat utilizes the phoneme
representation of augmented samples to guide the generation of adversaries,
which helps to find more stable and diverse gradient-directions, resulting in
improved generalization. Extensive experiments demonstrate the effectiveness of
wapat on End-to-end Speech Challenge Benchmark (ESB). Notably, SpeechLM-wapat
outperforms the original model by 6.28% WER reduction on ESB, achieving the new
state-of-the-art.


http://arxiv.org/abs/2307.12502
Cross Contrastive Feature Perturbation for Domain Generalization. (1%)
Chenming Li; Daoan Zhang; Wenjian Huang; Jianguo Zhang
  Domain generalization (DG) aims to learn a robust model from source domains
that generalize well on unseen target domains. Recent studies focus on
generating novel domain samples or features to diversify distributions
complementary to source domains. Yet, these approaches can hardly deal with the
restriction that the samples synthesized from various domains can cause
semantic distortion. In this paper, we propose an online one-stage Cross
Contrasting Feature Perturbation (CCFP) framework to simulate domain shift by
generating perturbed features in the latent space while regularizing the model
prediction against domain shift. Different from the previous fixed synthesizing
strategy, we design modules with learnable feature perturbations and semantic
consistency constraints. In contrast to prior work, our method does not use any
generative-based models or domain labels. We conduct extensive experiments on a
standard DomainBed benchmark with a strict evaluation protocol for a fair
comparison. Comprehensive experiments show that our method outperforms the
previous state-of-the-art, and quantitative analyses illustrate that our
approach can alleviate the domain shift problem in out-of-distribution (OOD)
scenarios.


http://arxiv.org/abs/2307.13643
Backdoor Attacks against Voice Recognition Systems: A Survey. (13%)
Baochen Yan; Jiahe Lan; Zheng Yan
  Voice Recognition Systems (VRSs) employ deep learning for speech recognition
and speaker recognition. They have been widely deployed in various real-world
applications, from intelligent voice assistance to telephony surveillance and
biometric authentication. However, prior research has revealed the
vulnerability of VRSs to backdoor attacks, which pose a significant threat to
the security and privacy of VRSs. Unfortunately, existing literature lacks a
thorough review on this topic. This paper fills this research gap by conducting
a comprehensive survey on backdoor attacks against VRSs. We first present an
overview of VRSs and backdoor attacks, elucidating their basic knowledge. Then
we propose a set of evaluation criteria to assess the performance of backdoor
attack methods. Next, we present a comprehensive taxonomy of backdoor attacks
against VRSs from different perspectives and analyze the characteristic of
different categories. After that, we comprehensively review existing attack
methods and analyze their pros and cons based on the proposed criteria.
Furthermore, we review classic backdoor defense methods and generic audio
defense techniques. Then we discuss the feasibility of deploying them on VRSs.
Finally, we figure out several open issues and further suggest future research
directions to motivate the research of VRSs security.


http://arxiv.org/abs/2307.11672
Fast Adaptive Test-Time Defense with Robust Features. (99%)
Anurag Singh; Mahalakshmi Sabanayagam; Krikamol Muandet; Debarghya Ghoshdastidar
  Adaptive test-time defenses are used to improve the robustness of deep neural
networks to adversarial examples. However, existing methods significantly
increase the inference time due to additional optimization on the model
parameters or the input at test time. In this work, we propose a novel adaptive
test-time defense strategy that is easy to integrate with any existing (robust)
training procedure without additional test-time computation. Based on the
notion of robustness of features that we present, the key idea is to project
the trained models to the most robust feature space, thereby reducing the
vulnerability to adversarial attacks in non-robust directions. We theoretically
show that the top eigenspace of the feature matrix are more robust for a
generalized additive model and support our argument for a large width neural
network with the Neural Tangent Kernel (NTK) equivalence. We conduct extensive
experiments on CIFAR-10 and CIFAR-100 datasets for several robustness
benchmarks, including the state-of-the-art methods in RobustBench, and observe
that the proposed method outperforms existing adaptive test-time defenses at
much lower computation costs.


http://arxiv.org/abs/2307.11906
Unveiling Vulnerabilities in Interpretable Deep Learning Systems with Query-Efficient Black-box Attacks. (99%)
Eldor Abdukhamidov; Mohammed Abuhamad; Simon S. Woo; Eric Chan-Tin; Tamer Abuhmed
  Deep learning has been rapidly employed in many applications revolutionizing
many industries, but it is known to be vulnerable to adversarial attacks. Such
attacks pose a serious threat to deep learning-based systems compromising their
integrity, reliability, and trust. Interpretable Deep Learning Systems (IDLSes)
are designed to make the system more transparent and explainable, but they are
also shown to be susceptible to attacks. In this work, we propose a novel
microbial genetic algorithm-based black-box attack against IDLSes that requires
no prior knowledge of the target model and its interpretation model. The
proposed attack is a query-efficient approach that combines transfer-based and
score-based methods, making it a powerful tool to unveil IDLS vulnerabilities.
Our experiments of the attack show high attack success rates using adversarial
examples with attribution maps that are highly similar to those of benign
samples which makes it difficult to detect even by human analysts. Our results
highlight the need for improved IDLS security to ensure their practical
reliability.


http://arxiv.org/abs/2307.11565
FMT: Removing Backdoor Feature Maps via Feature Map Testing in Deep Neural Networks. (81%)
Dong Huang; Qingwen Bu; Yahao Qing; Yichao Fu; Heming Cui
  Deep neural networks have been widely used in many critical applications,
such as autonomous vehicles and medical diagnosis. However, their security is
threatened by backdoor attack, which is achieved by adding artificial patterns
to specific training data. Existing defense strategies primarily focus on using
reverse engineering to reproduce the backdoor trigger generated by attackers
and subsequently repair the DNN model by adding the trigger into inputs and
fine-tuning the model with ground-truth labels. However, once the trigger
generated by the attackers is complex and invisible, the defender can not
successfully reproduce the trigger. Consequently, the DNN model will not be
repaired since the trigger is not effectively removed.
  In this work, we propose Feature Map Testing~(FMT). Different from existing
defense strategies, which focus on reproducing backdoor triggers, FMT tries to
detect the backdoor feature maps, which are trained to extract backdoor
information from the inputs. After detecting these backdoor feature maps, FMT
will erase them and then fine-tune the model with a secure subset of training
data. Our experiments demonstrate that, compared to existing defense
strategies, FMT can effectively reduce the Attack Success Rate (ASR) even
against the most complex and invisible attack triggers. Second, unlike
conventional defense methods that tend to exhibit low Robust Accuracy (i.e.,
the model's accuracy on the poisoned data), FMT achieves higher RA, indicating
its superiority in maintaining model performance while mitigating the effects
of backdoor attacks~(e.g., FMT obtains 87.40\% RA in CIFAR10). Third, compared
to existing feature map pruning techniques, FMT can cover more backdoor feature
maps~(e.g., FMT removes 83.33\% of backdoor feature maps from the model in the
CIFAR10 \& BadNet scenario).


http://arxiv.org/abs/2307.11528
Improving Viewpoint Robustness for Visual Recognition via Adversarial Training. (80%)
Shouwei Ruan; Yinpeng Dong; Hang Su; Jianteng Peng; Ning Chen; Xingxing Wei
  Viewpoint invariance remains challenging for visual recognition in the 3D
world, as altering the viewing directions can significantly impact predictions
for the same object. While substantial efforts have been dedicated to making
neural networks invariant to 2D image translations and rotations, viewpoint
invariance is rarely investigated. Motivated by the success of adversarial
training in enhancing model robustness, we propose Viewpoint-Invariant
Adversarial Training (VIAT) to improve the viewpoint robustness of image
classifiers. Regarding viewpoint transformation as an attack, we formulate VIAT
as a minimax optimization problem, where the inner maximization characterizes
diverse adversarial viewpoints by learning a Gaussian mixture distribution
based on the proposed attack method GMVFool. The outer minimization obtains a
viewpoint-invariant classifier by minimizing the expected loss over the
worst-case viewpoint distributions that can share the same one for different
objects within the same category. Based on GMVFool, we contribute a large-scale
dataset called ImageNet-V+ to benchmark viewpoint robustness. Experimental
results show that VIAT significantly improves the viewpoint robustness of
various image classifiers based on the diversity of adversarial viewpoints
generated by GMVFool. Furthermore, we propose ViewRS, a certified viewpoint
robustness method that provides a certified radius and accuracy to demonstrate
the effectiveness of VIAT from the theoretical perspective.


http://arxiv.org/abs/2307.11729
OUTFOX: LLM-generated Essay Detection through In-context Learning with Adversarially Generated Examples. (62%)
Ryuto Koike; Masahiro Kaneko; Naoaki Okazaki
  Large Language Models (LLMs) have achieved human-level fluency in text
generation, making it difficult to distinguish between human-written and
LLM-generated texts. This poses a growing risk of misuse of LLMs and demands
the development of detectors to identify LLM-generated texts. However, existing
detectors degrade detection accuracy by simply paraphrasing LLM-generated
texts. Furthermore, the effectiveness of these detectors in real-life
situations, such as when students use LLMs for writing homework assignments
(e.g., essays) and quickly learn how to evade these detectors, has not been
explored. In this paper, we propose OUTFOX, a novel framework that improves the
robustness of LLM-generated-text detectors by allowing both the detector and
the attacker to consider each other's output and apply this to the domain of
student essays. In our framework, the attacker uses the detector's prediction
labels as examples for in-context learning and adversarially generates essays
that are harder to detect. While the detector uses the adversarially generated
essays as examples for in-context learning to learn to detect essays from a
strong attacker. Our experiments show that our proposed detector learned
in-context from the attacker improves the detection performance on the attacked
dataset by up to +41.3 point F1-score. While our proposed attacker can
drastically degrade the performance of the detector by up to -57.0 point
F1-score compared to the paraphrasing method.


http://arxiv.org/abs/2307.11823
HybridAugment++: Unified Frequency Spectra Perturbations for Model Robustness. (26%)
Mehmet Kerim Yucel; Ramazan Gokberk Cinbis; Pinar Duygulu
  Convolutional Neural Networks (CNN) are known to exhibit poor generalization
performance under distribution shifts. Their generalization have been studied
extensively, and one line of work approaches the problem from a
frequency-centric perspective. These studies highlight the fact that humans and
CNNs might focus on different frequency components of an image. First, inspired
by these observations, we propose a simple yet effective data augmentation
method HybridAugment that reduces the reliance of CNNs on high-frequency
components, and thus improves their robustness while keeping their clean
accuracy high. Second, we propose HybridAugment++, which is a hierarchical
augmentation method that attempts to unify various frequency-spectrum
augmentations. HybridAugment++ builds on HybridAugment, and also reduces the
reliance of CNNs on the amplitude component of images, and promotes phase
information instead. This unification results in competitive to or better than
state-of-the-art results on clean accuracy (CIFAR-10/100 and ImageNet),
corruption benchmarks (ImageNet-C, CIFAR-10-C and CIFAR-100-C), adversarial
robustness on CIFAR-10 and out-of-distribution detection on various datasets.
HybridAugment and HybridAugment++ are implemented in a few lines of code, does
not require extra data, ensemble models or additional networks.


http://arxiv.org/abs/2307.11730
Mitigating Communications Threats in Decentralized Federated Learning through Moving Target Defense. (1%)
Enrique Tomás Martínez Beltrán; Pedro Miguel Sánchez Sánchez; Sergio López Bernal; Gérôme Bovet; Manuel Gil Pérez; Gregorio Martínez Pérez; Alberto Huertas Celdrán
  The rise of Decentralized Federated Learning (DFL) has enabled the training
of machine learning models across federated participants, fostering
decentralized model aggregation and reducing dependence on a server. However,
this approach introduces unique communication security challenges that have yet
to be thoroughly addressed in the literature. These challenges primarily
originate from the decentralized nature of the aggregation process, the varied
roles and responsibilities of the participants, and the absence of a central
authority to oversee and mitigate threats. Addressing these challenges, this
paper first delineates a comprehensive threat model focused on DFL
communications. In response to these identified risks, this work introduces a
security module to counter communication-based attacks for DFL platforms. The
module combines security techniques such as symmetric and asymmetric encryption
with Moving Target Defense (MTD) techniques, including random neighbor
selection and IP/port switching. The security module is implemented in a DFL
platform, Fedstellar, allowing the deployment and monitoring of the federation.
A DFL scenario with physical and virtual deployments have been executed,
encompassing three security configurations: (i) a baseline without security,
(ii) an encrypted configuration, and (iii) a configuration integrating both
encryption and MTD techniques. The effectiveness of the security module is
validated through experiments with the MNIST dataset and eclipse attacks. The
results showed an average F1 score of 95%, with the most secure configuration
resulting in CPU usage peaking at 68% (+-9%) in virtual deployments and network
traffic reaching 480.8 MB (+-18 MB), effectively mitigating risks associated
with eavesdropping or eclipse attacks.


http://arxiv.org/abs/2307.15008
A LLM Assisted Exploitation of AI-Guardian. (98%)
Nicholas Carlini
  Large language models (LLMs) are now highly capable at a diverse range of
tasks. This paper studies whether or not GPT-4, one such LLM, is capable of
assisting researchers in the field of adversarial machine learning. As a case
study, we evaluate the robustness of AI-Guardian, a recent defense to
adversarial examples published at IEEE S&P 2023, a top computer security
conference. We completely break this defense: the proposed scheme does not
increase robustness compared to an undefended baseline.
  We write none of the code to attack this model, and instead prompt GPT-4 to
implement all attack algorithms following our instructions and guidance. This
process was surprisingly effective and efficient, with the language model at
times producing code from ambiguous instructions faster than the author of this
paper could have done. We conclude by discussing (1) the warning signs present
in the evaluation that suggested to us AI-Guardian would be broken, and (2) our
experience with designing attacks and performing novel research using the most
recent advances in language modeling.


http://arxiv.org/abs/2307.11334
Improving Transferability of Adversarial Examples via Bayesian Attacks. (98%)
Qizhang Li; Yiwen Guo; Xiaochen Yang; Wangmeng Zuo; Hao Chen
  This paper presents a substantial extension of our work published at ICLR.
Our ICLR work advocated for enhancing transferability in adversarial examples
by incorporating a Bayesian formulation into model parameters, which
effectively emulates the ensemble of infinitely many deep neural networks,
while, in this paper, we introduce a novel extension by incorporating the
Bayesian formulation into the model input as well, enabling the joint
diversification of both the model input and model parameters. Our empirical
findings demonstrate that: 1) the combination of Bayesian formulations for both
the model input and model parameters yields significant improvements in
transferability; 2) by introducing advanced approximations of the posterior
distribution over the model input, adversarial transferability achieves further
enhancement, surpassing all state-of-the-arts when attacking without model
fine-tuning. Moreover, we propose a principled approach to fine-tune model
parameters in such an extended Bayesian formulation. The derived optimization
objective inherently encourages flat minima in the parameter space and input
space. Extensive experiments demonstrate that our method achieves a new
state-of-the-art on transfer-based attacks, improving the average success rate
on ImageNet and CIFAR-10 by 19.14% and 2.08%, respectively, when comparing with
our ICLR basic Bayesian method. We will make our code publicly available.


http://arxiv.org/abs/2307.10788
Adversarial attacks for mixtures of classifiers. (54%)
Lucas Gnecco Heredia; Benjamin Negrevergne; Yann Chevaleyre
  Mixtures of classifiers (a.k.a. randomized ensembles) have been proposed as a
way to improve robustness against adversarial attacks. However, it has been
shown that existing attacks are not well suited for this kind of classifiers.
In this paper, we discuss the problem of attacking a mixture in a principled
way and introduce two desirable properties of attacks based on a geometrical
analysis of the problem (effectiveness and maximality). We then show that
existing attacks do not meet both of these properties. Finally, we introduce a
new attack called lattice climber attack with theoretical guarantees on the
binary linear setting, and we demonstrate its performance by conducting
experiments on synthetic and real datasets.


http://arxiv.org/abs/2307.10981
PATROL: Privacy-Oriented Pruning for Collaborative Inference Against Model Inversion Attacks. (33%)
Shiwei Ding; Lan Zhang; Miao Pan; Xiaoyong Yuan
  Collaborative inference has been a promising solution to enable
resource-constrained edge devices to perform inference using state-of-the-art
deep neural networks (DNNs). In collaborative inference, the edge device first
feeds the input to a partial DNN locally and then uploads the intermediate
result to the cloud to complete the inference. However, recent research
indicates model inversion attacks (MIAs) can reconstruct input data from
intermediate results, posing serious privacy concerns for collaborative
inference. Existing perturbation and cryptography techniques are inefficient
and unreliable in defending against MIAs while performing accurate inference.
This paper provides a viable solution, named PATROL, which develops
privacy-oriented pruning to balance privacy, efficiency, and utility of
collaborative inference. PATROL takes advantage of the fact that later layers
in a DNN can extract more task-specific features. Given limited local resources
for collaborative inference, PATROL intends to deploy more layers at the edge
based on pruning techniques to enforce task-specific features for inference and
reduce task-irrelevant but sensitive features for privacy preservation. To
achieve privacy-oriented pruning, PATROL introduces two key components:
Lipschitz regularization and adversarial reconstruction training, which
increase the reconstruction errors by reducing the stability of MIAs and
enhance the target inference model by adversarial training, respectively.


http://arxiv.org/abs/2307.10586
A Holistic Assessment of the Reliability of Machine Learning Systems. (4%)
Anthony Corso; David Karamadian; Romeo Valentin; Mary Cooper; Mykel J. Kochenderfer
  As machine learning (ML) systems increasingly permeate high-stakes settings
such as healthcare, transportation, military, and national security, concerns
regarding their reliability have emerged. Despite notable progress, the
performance of these systems can significantly diminish due to adversarial
attacks or environmental changes, leading to overconfident predictions,
failures to detect input faults, and an inability to generalize in unexpected
scenarios. This paper proposes a holistic assessment methodology for the
reliability of ML systems. Our framework evaluates five key properties:
in-distribution accuracy, distribution-shift robustness, adversarial
robustness, calibration, and out-of-distribution detection. A reliability score
is also introduced and used to assess the overall system reliability. To
provide insights into the performance of different algorithmic approaches, we
identify and categorize state-of-the-art techniques, then evaluate a selection
on real-world tasks using our proposed reliability metrics and reliability
score. Our analysis of over 500 models reveals that designing for one metric
does not necessarily constrain others but certain algorithmic techniques can
improve reliability across multiple metrics simultaneously. This study
contributes to a more comprehensive understanding of ML reliability and
provides a roadmap for future research and development.


http://arxiv.org/abs/2307.11316
Making Pre-trained Language Models both Task-solvers and Self-calibrators. (2%)
Yangyi Chen; Xingyao Wang; Heng Ji
  Pre-trained language models (PLMs) serve as backbones for various real-world
systems. For high-stake applications, it's equally essential to have reasonable
confidence estimations in predictions. While the vanilla confidence scores of
PLMs can already be effectively utilized, PLMs consistently become
overconfident in their wrong predictions, which is not desirable in practice.
Previous work shows that introducing an extra calibration task can mitigate
this issue. The basic idea involves acquiring additional data to train models
in predicting the confidence of their initial predictions. However, it only
demonstrates the feasibility of this kind of method, assuming that there are
abundant extra available samples for the introduced calibration task. In this
work, we consider the practical scenario that we need to effectively utilize
training samples to make PLMs both task-solvers and self-calibrators. Three
challenges are presented, including limited training samples, data imbalance,
and distribution shifts. We first conduct pilot experiments to quantify various
decisive factors in the calibration task. Based on the empirical analysis
results, we propose a training algorithm LM-TOAST to tackle the challenges.
Experimental results show that LM-TOAST can effectively utilize the training
data to make PLMs have reasonable confidence estimations while maintaining the
original task performance. Further, we consider three downstream applications,
namely selective classification, adversarial defense, and model cascading, to
show the practical usefulness of LM-TOAST. The code will be made public at
\url{https://github.com/Yangyi-Chen/LM-TOAST}.


http://arxiv.org/abs/2307.10590
Boundary State Generation for Testing and Improvement of Autonomous Driving Systems. (1%)
Matteo Biagiola; Paolo Tonella
  Recent advances in Deep Neural Networks (DNNs) and sensor technologies are
enabling autonomous driving systems (ADSs) with an ever-increasing level of
autonomy. However, assessing their dependability remains a critical concern.
State-of-the-art ADS testing approaches modify the controllable attributes of a
simulated driving environment until the ADS misbehaves. Such approaches have
two main drawbacks: (1) modifications to the simulated environment might not be
easily transferable to the in-field test setting (e.g., changing the road
shape); (2) environment instances in which the ADS is successful are discarded,
despite the possibility that they could contain hidden driving conditions in
which the ADS may misbehave.
  In this paper, we present GenBo (GENerator of BOundary state pairs), a novel
test generator for ADS testing. GenBo mutates the driving conditions of the ego
vehicle (position, velocity and orientation), collected in a failure-free
environment instance, and efficiently generates challenging driving conditions
at the behavior boundary (i.e., where the model starts to misbehave) in the
same environment. We use such boundary conditions to augment the initial
training dataset and retrain the DNN model under test. Our evaluation results
show that the retrained model has up to 16 higher success rate on a separate
set of evaluation tracks with respect to the original DNN model.


http://arxiv.org/abs/2307.10655
A Survey of What to Share in Federated Learning: Perspectives on Model Utility, Privacy Leakage, and Communication Efficiency. (1%)
Jiawei Shao; Zijian Li; Wenqiang Sun; Tailin Zhou; Yuchang Sun; Lumin Liu; Zehong Lin; Yuyi Mao; Jun Zhang
  Federated learning (FL) has emerged as a secure paradigm for collaborative
training among clients. Without data centralization, FL allows clients to share
local information in a privacy-preserving manner. This approach has gained
considerable attention, promoting numerous surveys to summarize the related
works. However, the majority of these surveys concentrate on FL methods that
share model parameters during the training process, while overlooking the
possibility of sharing local information in other forms. In this paper, we
present a systematic survey from a new perspective of what to share in FL, with
an emphasis on the model utility, privacy leakage, and communication
efficiency. First, we present a new taxonomy of FL methods in terms of three
sharing methods, which respectively share model, synthetic data, and knowledge.
Second, we analyze the vulnerability of different sharing methods to privacy
attacks and review the defense mechanisms. Third, we conduct extensive
experiments to compare the learning performance and communication overhead of
various sharing methods in FL. Besides, we assess the potential privacy leakage
through model inversion and membership inference attacks, while comparing the
effectiveness of various defense approaches. Finally, we identify future
research directions and conclude the survey.


http://arxiv.org/abs/2307.10487
Backdoor Attack against Object Detection with Clean Annotation. (93%)
Yize Cheng; Wenbin Hu; Minhao Cheng
  Deep neural networks (DNNs) have shown unprecedented success in object
detection tasks. However, it was also discovered that DNNs are vulnerable to
multiple kinds of attacks, including Backdoor Attacks. Through the attack, the
attacker manages to embed a hidden backdoor into the DNN such that the model
behaves normally on benign data samples, but makes attacker-specified judgments
given the occurrence of a predefined trigger. Although numerous backdoor
attacks have been experimented on image classification, backdoor attacks on
object detection tasks have not been properly investigated and explored. As
object detection has been adopted as an important module in multiple
security-sensitive applications such as autonomous driving, backdoor attacks on
object detection could pose even more severe threats. Inspired by the inherent
property of deep learning-based object detectors, we propose a simple yet
effective backdoor attack method against object detection without modifying the
ground truth annotations, specifically focusing on the object disappearance
attack and object generation attack. Extensive experiments and ablation studies
prove the effectiveness of our attack on two benchmark object detection
datasets, PASCAL VOC07+12 and MSCOCO, on which we achieve an attack success
rate of more than 92% with a poison rate of only 5%.


http://arxiv.org/abs/2307.10562
Shared Adversarial Unlearning: Backdoor Mitigation by Unlearning Shared Adversarial Examples. (92%)
Shaokui Wei; Mingda Zhang; Hongyuan Zha; Baoyuan Wu
  Backdoor attacks are serious security threats to machine learning models
where an adversary can inject poisoned samples into the training set, causing a
backdoored model which predicts poisoned samples with particular triggers to
particular target classes, while behaving normally on benign samples. In this
paper, we explore the task of purifying a backdoored model using a small clean
dataset. By establishing the connection between backdoor risk and adversarial
risk, we derive a novel upper bound for backdoor risk, which mainly captures
the risk on the shared adversarial examples (SAEs) between the backdoored model
and the purified model. This upper bound further suggests a novel bi-level
optimization problem for mitigating backdoor using adversarial training
techniques. To solve it, we propose Shared Adversarial Unlearning (SAU).
Specifically, SAU first generates SAEs, and then, unlearns the generated SAEs
such that they are either correctly classified by the purified model and/or
differently classified by the two models, such that the backdoor effect in the
backdoored model will be mitigated in the purified model. Experiments on
various benchmark datasets and network architectures show that our proposed
method achieves state-of-the-art performance for backdoor defense.


http://arxiv.org/abs/2307.10163
Rethinking Backdoor Attacks. (83%)
Alaa Khaddaj; Guillaume Leclerc; Aleksandar Makelov; Kristian Georgiev; Hadi Salman; Andrew Ilyas; Aleksander Madry
  In a backdoor attack, an adversary inserts maliciously constructed backdoor
examples into a training set to make the resulting model vulnerable to
manipulation. Defending against such attacks typically involves viewing these
inserted examples as outliers in the training set and using techniques from
robust statistics to detect and remove them.
  In this work, we present a different approach to the backdoor attack problem.
Specifically, we show that without structural information about the training
data distribution, backdoor attacks are indistinguishable from
naturally-occurring features in the data--and thus impossible to "detect" in a
general sense. Then, guided by this observation, we revisit existing defenses
against backdoor attacks and characterize the (often latent) assumptions they
make and on which they depend. Finally, we explore an alternative perspective
on backdoor attacks: one that assumes these attacks correspond to the strongest
feature in the training data. Under this assumption (which we make formal) we
develop a new primitive for detecting backdoor attacks. Our primitive naturally
gives rise to a detection algorithm that comes with theoretical guarantees and
is effective in practice.


http://arxiv.org/abs/2307.09763
Towards Building More Robust Models with Frequency Bias. (81%)
Qingwen Bu; Dong Huang; Heming Cui
  The vulnerability of deep neural networks to adversarial samples has been a
major impediment to their broad applications, despite their success in various
fields. Recently, some works suggested that adversarially-trained models
emphasize the importance of low-frequency information to achieve higher
robustness. While several attempts have been made to leverage this frequency
characteristic, they have all faced the issue that applying low-pass filters
directly to input images leads to irreversible loss of discriminative
information and poor generalizability to datasets with distinct frequency
features. This paper presents a plug-and-play module called the Frequency
Preference Control Module that adaptively reconfigures the low- and
high-frequency components of intermediate feature representations, providing
better utilization of frequency in robust learning. Empirical studies show that
our proposed module can be easily incorporated into any adversarial training
framework, further improving model robustness across different architectures
and datasets. Additionally, experiments were conducted to examine how the
frequency bias of robust models impacts the adversarial training process and
its final robustness, revealing interesting insights.


http://arxiv.org/abs/2307.09762
Improving Surrogate Model Robustness to Perturbations for Dynamical Systems Through Machine Learning and Data Assimilation. (26%)
Abhishek Ajayakumar; Soumyendu Raha
  Many real-world systems are modelled using complex ordinary differential
equations (ODEs). However, the dimensionality of these systems can make them
challenging to analyze. Dimensionality reduction techniques like Proper
Orthogonal Decomposition (POD) can be used in such cases. However, these
reduced order models are susceptible to perturbations in the input. We propose
a novel framework that combines machine learning and data assimilation
techniques to improving surrogate models to handle perturbations in input data
effectively. Through rigorous experiments on dynamical systems modelled on
graphs, we demonstrate that our framework substantially improves the accuracy
of surrogate models under input perturbations. Furthermore, we evaluate the
framework's efficacy on alternative surrogate models, including neural ODEs,
and the empirical results consistently show enhanced performance.


http://arxiv.org/abs/2307.09375
CertPri: Certifiable Prioritization for Deep Neural Networks via Movement Cost in Feature Space. (67%)
Haibin Zheng; Jinyin Chen; Haibo Jin
  Deep neural networks (DNNs) have demonstrated their outperformance in various
software systems, but also exhibit misbehavior and even result in irreversible
disasters. Therefore, it is crucial to identify the misbehavior of DNN-based
software and improve DNNs' quality. Test input prioritization is one of the
most appealing ways to guarantee DNNs' quality, which prioritizes test inputs
so that more bug-revealing inputs can be identified earlier with limited time
and manual labeling efforts. However, the existing prioritization methods are
still limited from three aspects: certifiability, effectiveness, and
generalizability. To overcome the challenges, we propose CertPri, a test input
prioritization technique designed based on a movement cost perspective of test
inputs in DNNs' feature space. CertPri differs from previous works in three key
aspects: (1) certifiable: it provides a formal robustness guarantee for the
movement cost; (2) effective: it leverages formally guaranteed movement costs
to identify malicious bug-revealing inputs; and (3) generic: it can be applied
to various tasks, data, models, and scenarios. Extensive evaluations across 2
tasks (i.e., classification and regression), 6 data forms, 4 model structures,
and 2 scenarios (i.e., white-box and black-box) demonstrate CertPri's superior
performance. For instance, it significantly improves 53.97% prioritization
effectiveness on average compared with baselines. Its robustness and
generalizability are 1.41~2.00 times and 1.33~3.39 times that of baselines on
average, respectively.


http://arxiv.org/abs/2307.09048
FedDefender: Client-Side Attack-Tolerant Federated Learning. (50%)
Sungwon Park; Sungwon Han; Fangzhao Wu; Sundong Kim; Bin Zhu; Xing Xie; Meeyoung Cha
  Federated learning enables learning from decentralized data sources without
compromising privacy, which makes it a crucial technique. However, it is
vulnerable to model poisoning attacks, where malicious clients interfere with
the training process. Previous defense mechanisms have focused on the
server-side by using careful model aggregation, but this may not be effective
when the data is not identically distributed or when attackers can access the
information of benign clients. In this paper, we propose a new defense
mechanism that focuses on the client-side, called FedDefender, to help benign
clients train robust local models and avoid the adverse impact of malicious
model updates from attackers, even when a server-side defense cannot identify
or remove adversaries. Our method consists of two main components: (1)
attack-tolerant local meta update and (2) attack-tolerant global knowledge
distillation. These components are used to find noise-resilient model
parameters while accurately extracting knowledge from a potentially corrupted
global model. Our client-side defense strategy has a flexible structure and can
work in conjunction with any existing server-side strategies. Evaluations of
real-world scenarios across multiple datasets show that the proposed method
enhances the robustness of federated learning against model poisoning attacks.


http://arxiv.org/abs/2307.09542
Can Neural Network Memorization Be Localized? (4%)
Pratyush Maini; Michael C. Mozer; Hanie Sedghi; Zachary C. Lipton; J. Zico Kolter; Chiyuan Zhang
  Recent efforts at explaining the interplay of memorization and generalization
in deep overparametrized networks have posited that neural networks
$\textit{memorize}$ "hard" examples in the final few layers of the model.
Memorization refers to the ability to correctly predict on $\textit{atypical}$
examples of the training set. In this work, we show that rather than being
confined to individual layers, memorization is a phenomenon confined to a small
set of neurons in various layers of the model. First, via three experimental
sources of converging evidence, we find that most layers are redundant for the
memorization of examples and the layers that contribute to example memorization
are, in general, not the final layers. The three sources are $\textit{gradient
accounting}$ (measuring the contribution to the gradient norms from memorized
and clean examples), $\textit{layer rewinding}$ (replacing specific model
weights of a converged model with previous training checkpoints), and
$\textit{retraining}$ (training rewound layers only on clean examples). Second,
we ask a more generic question: can memorization be localized
$\textit{anywhere}$ in a model? We discover that memorization is often confined
to a small number of neurons or channels (around 5) of the model. Based on
these insights we propose a new form of dropout -- $\textit{example-tied
dropout}$ that enables us to direct the memorization of examples to an apriori
determined set of neurons. By dropping out these neurons, we are able to reduce
the accuracy on memorized examples from $100\%\to3\%$, while also reducing the
generalization gap.


http://arxiv.org/abs/2307.08278
Adversarial Attacks on Traffic Sign Recognition: A Survey. (98%)
Svetlana Pavlitska; Nico Lambing; J. Marius Zöllner
  Traffic sign recognition is an essential component of perception in
autonomous vehicles, which is currently performed almost exclusively with deep
neural networks (DNNs). However, DNNs are known to be vulnerable to adversarial
attacks. Several previous works have demonstrated the feasibility of
adversarial attacks on traffic sign recognition models. Traffic signs are
particularly promising for adversarial attack research due to the ease of
performing real-world attacks using printed signs or stickers. In this work, we
survey existing works performing either digital or real-world attacks on
traffic sign detection and classification models. We provide an overview of the
latest advancements and highlight the existing research areas that require
further investigation.


http://arxiv.org/abs/2307.08955
Discretization-based ensemble model for robust learning in IoT. (87%)
Anahita Namvar; Chandra Thapa; Salil S. Kanhere
  IoT device identification is the process of recognizing and verifying
connected IoT devices to the network. This is an essential process for ensuring
that only authorized devices can access the network, and it is necessary for
network management and maintenance. In recent years, machine learning models
have been used widely for automating the process of identifying devices in the
network. However, these models are vulnerable to adversarial attacks that can
compromise their accuracy and effectiveness. To better secure device
identification models, discretization techniques enable reduction in the
sensitivity of machine learning models to adversarial attacks contributing to
the stability and reliability of the model. On the other hand, Ensemble methods
combine multiple heterogeneous models to reduce the impact of remaining noise
or errors in the model. Therefore, in this paper, we integrate discretization
techniques and ensemble methods and examine it on model robustness against
adversarial attacks. In other words, we propose a discretization-based ensemble
stacking technique to improve the security of our ML models. We evaluate the
performance of different ML-based IoT device identification models against
white box and black box attacks using a real-world dataset comprised of network
traffic from 28 IoT devices. We demonstrate that the proposed method enables
robustness to the models for IoT device identification.


http://arxiv.org/abs/2307.08424
Unstoppable Attack: Label-Only Model Inversion via Conditional Diffusion Model. (83%)
Rongke Liu; Dong Wang; Yizhi Ren; Zhen Wang; Kaitian Guo; Qianqian Qin; Xiaolei Liu
  Model inversion attacks (MIAs) aim to recover private data from inaccessible
training sets of deep learning models, posing a privacy threat. MIAs primarily
focus on the white-box scenario where attackers have full access to the model's
structure and parameters. However, practical applications are usually in
black-box scenarios or label-only scenarios, i.e., the attackers can only
obtain the output confidence vectors or labels by accessing the model.
Therefore, the attack models in existing MIAs are difficult to effectively
train with the knowledge of the target model, resulting in sub-optimal attacks.
To the best of our knowledge, we pioneer the research of a powerful and
practical attack model in the label-only scenario.
  In this paper, we develop a novel MIA method, leveraging a conditional
diffusion model (CDM) to recover representative samples under the target label
from the training set. Two techniques are introduced: selecting an auxiliary
dataset relevant to the target model task and using predicted labels as
conditions to guide training CDM; and inputting target label, pre-defined
guidance strength, and random noise into the trained attack model to generate
and correct multiple results for final selection. This method is evaluated
using Learned Perceptual Image Patch Similarity as a new metric and as a
judgment basis for deciding the values of hyper-parameters. Experimental
results show that this method can generate similar and accurate samples to the
target label, outperforming generators of previous approaches.


http://arxiv.org/abs/2307.08939
Runtime Stealthy Perception Attacks against DNN-based Adaptive Cruise Control Systems. (22%)
Xugui Zhou; Anqi Chen; Maxfield Kouzel; Haotian Ren; Morgan McCarty; Cristina Nita-Rotaru; Homa Alemzadeh
  Adaptive Cruise Control (ACC) is a widely used driver assistance technology
for maintaining the desired speed and safe distance to the leading vehicle.
This paper evaluates the security of the deep neural network (DNN) based ACC
systems under runtime stealthy perception attacks that strategically inject
perturbations into camera data to cause forward collisions. We present a
context-aware strategy for the selection of the most critical times for
triggering the attacks and a novel optimization-based method for the adaptive
generation of image perturbations at runtime. We evaluate the effectiveness of
the proposed attack using an actual vehicle, a publicly available driving
dataset, and a realistic simulation platform with the control software from a
production ACC system, a physical-world driving simulator, and interventions by
the human driver and safety features such as Advanced Emergency Braking System
(AEBS). Experimental results show that the proposed attack achieves 142.9 times
higher success rate in causing hazards and 89.6% higher evasion rate than
baselines, while being stealthy and robust to real-world factors and dynamic
changes in the environment. This study highlights the role of human drivers and
basic safety mechanisms in preventing attacks.


http://arxiv.org/abs/2307.08551
On the Fly Neural Style Smoothing for Risk-Averse Domain Generalization. (2%)
Akshay Mehra; Yunbei Zhang; Bhavya Kailkhura; Jihun Hamm
  Achieving high accuracy on data from domains unseen during training is a
fundamental challenge in domain generalization (DG). While state-of-the-art DG
classifiers have demonstrated impressive performance across various tasks, they
have shown a bias towards domain-dependent information, such as image styles,
rather than domain-invariant information, such as image content. This bias
renders them unreliable for deployment in risk-sensitive scenarios such as
autonomous driving where a misclassification could lead to catastrophic
consequences. To enable risk-averse predictions from a DG classifier, we
propose a novel inference procedure, Test-Time Neural Style Smoothing (TT-NSS),
that uses a "style-smoothed" version of the DG classifier for prediction at
test time. Specifically, the style-smoothed classifier classifies a test image
as the most probable class predicted by the DG classifier on random
re-stylizations of the test image. TT-NSS uses a neural style transfer module
to stylize a test image on the fly, requires only black-box access to the DG
classifier, and crucially, abstains when predictions of the DG classifier on
the stylized test images lack consensus. Additionally, we propose a neural
style smoothing (NSS) based training procedure that can be seamlessly
integrated with existing DG methods. This procedure enhances prediction
consistency, improving the performance of TT-NSS on non-abstained samples. Our
empirical results demonstrate the effectiveness of TT-NSS and NSS at producing
and improving risk-averse predictions on unseen domains from DG classifiers
trained with SOTA training methods on various benchmark datasets and their
variations.


http://arxiv.org/abs/2307.10252
A Machine Learning based Empirical Evaluation of Cyber Threat Actors High Level Attack Patterns over Low level Attack Patterns in Attributing Attacks. (1%)
Umara Noor; Sawera Shahid; Rimsha Kanwal; Zahid Rashid
  Cyber threat attribution is the process of identifying the actor of an attack
incident in cyberspace. An accurate and timely threat attribution plays an
important role in deterring future attacks by applying appropriate and timely
defense mechanisms. Manual analysis of attack patterns gathered by honeypot
deployments, intrusion detection systems, firewalls, and via trace-back
procedures is still the preferred method of security analysts for cyber threat
attribution. Such attack patterns are low-level Indicators of Compromise (IOC).
They represent Tactics, Techniques, Procedures (TTP), and software tools used
by the adversaries in their campaigns. The adversaries rarely re-use them. They
can also be manipulated, resulting in false and unfair attribution. To
empirically evaluate and compare the effectiveness of both kinds of IOC, there
are two problems that need to be addressed. The first problem is that in recent
research works, the ineffectiveness of low-level IOC for cyber threat
attribution has been discussed intuitively. An empirical evaluation for the
measure of the effectiveness of low-level IOC based on a real-world dataset is
missing. The second problem is that the available dataset for high-level IOC
has a single instance for each predictive class label that cannot be used
directly for training machine learning models. To address these problems in
this research work, we empirically evaluate the effectiveness of low-level IOC
based on a real-world dataset that is specifically built for comparative
analysis with high-level IOC. The experimental results show that the high-level
IOC trained models effectively attribute cyberattacks with an accuracy of 95%
as compared to the low-level IOC trained models where accuracy is 40%.


http://arxiv.org/abs/2307.10235
Towards Viewpoint-Invariant Visual Recognition via Adversarial Training. (83%)
Shouwei Ruan; Yinpeng Dong; Hang Su; Jianteng Peng; Ning Chen; Xingxing Wei
  Visual recognition models are not invariant to viewpoint changes in the 3D
world, as different viewing directions can dramatically affect the predictions
given the same object. Although many efforts have been devoted to making neural
networks invariant to 2D image translations and rotations, viewpoint invariance
is rarely investigated. As most models process images in the perspective view,
it is challenging to impose invariance to 3D viewpoint changes based only on 2D
inputs. Motivated by the success of adversarial training in promoting model
robustness, we propose Viewpoint-Invariant Adversarial Training (VIAT) to
improve viewpoint robustness of common image classifiers. By regarding
viewpoint transformation as an attack, VIAT is formulated as a minimax
optimization problem, where the inner maximization characterizes diverse
adversarial viewpoints by learning a Gaussian mixture distribution based on a
new attack GMVFool, while the outer minimization trains a viewpoint-invariant
classifier by minimizing the expected loss over the worst-case adversarial
viewpoint distributions. To further improve the generalization performance, a
distribution sharing strategy is introduced leveraging the transferability of
adversarial viewpoints across objects. Experiments validate the effectiveness
of VIAT in improving the viewpoint robustness of various image classifiers
based on the diversity of adversarial viewpoints generated by GMVFool.


http://arxiv.org/abs/2307.08208
Towards Stealthy Backdoor Attacks against Speech Recognition via Elements of Sound. (73%)
Hanbo Cai; Pengcheng Zhang; Hai Dong; Yan Xiao; Stefanos Koffas; Yiming Li
  Deep neural networks (DNNs) have been widely and successfully adopted and
deployed in various applications of speech recognition. Recently, a few works
revealed that these models are vulnerable to backdoor attacks, where the
adversaries can implant malicious prediction behaviors into victim models by
poisoning their training process. In this paper, we revisit poison-only
backdoor attacks against speech recognition. We reveal that existing methods
are not stealthy since their trigger patterns are perceptible to humans or
machine detection. This limitation is mostly because their trigger patterns are
simple noises or separable and distinctive clips. Motivated by these findings,
we propose to exploit elements of sound ($e.g.$, pitch and timbre) to design
more stealthy yet effective poison-only backdoor attacks. Specifically, we
insert a short-duration high-pitched signal as the trigger and increase the
pitch of remaining audio clips to `mask' it for designing stealthy pitch-based
triggers. We manipulate timbre features of victim audios to design the stealthy
timbre-based attack and design a voiceprint selection module to facilitate the
multi-backdoor attack. Our attacks can generate more `natural' poisoned samples
and therefore are more stealthy. Extensive experiments are conducted on
benchmark datasets, which verify the effectiveness of our attacks under
different settings ($e.g.$, all-to-one, all-to-all, clean-label, physical, and
multi-backdoor settings) and their stealthiness. The code for reproducing main
experiments are available at \url{https://github.com/HanboCai/BadSpeech_SoE}.


http://arxiv.org/abs/2307.08076
Diffusion to Confusion: Naturalistic Adversarial Patch Generation Based on Diffusion Model for Object Detector. (10%)
Shuo-Yen Lin; Ernie Chu; Che-Hsien Lin; Jun-Cheng Chen; Jia-Ching Wang
  Many physical adversarial patch generation methods are widely proposed to
protect personal privacy from malicious monitoring using object detectors.
However, they usually fail to generate satisfactory patch images in terms of
both stealthiness and attack performance without making huge efforts on careful
hyperparameter tuning. To address this issue, we propose a novel naturalistic
adversarial patch generation method based on the diffusion models (DM). Through
sampling the optimal image from the DM model pretrained upon natural images, it
allows us to stably craft high-quality and naturalistic physical adversarial
patches to humans without suffering from serious mode collapse problems as
other deep generative models. To the best of our knowledge, we are the first to
propose DM-based naturalistic adversarial patch generation for object
detectors. With extensive quantitative, qualitative, and subjective
experiments, the results demonstrate the effectiveness of the proposed approach
to generate better-quality and more naturalistic adversarial patches while
achieving acceptable attack performance than other state-of-the-art patch
generation methods. We also show various generation trade-offs under different
conditions.


http://arxiv.org/abs/2307.08213
Lipschitz Continuous Algorithms for Covering Problems. (1%)
Soh Kumabe; Yuichi Yoshida
  Combinatorial algorithms are widely used for decision-making and knowledge
discovery, and it is important to ensure that their output remains stable even
when subjected to small perturbations in the input. Failure to do so can lead
to several problems, including costly decisions, reduced user trust, potential
security concerns, and lack of replicability. Unfortunately, many fundamental
combinatorial algorithms are vulnerable to small input perturbations. To
address the impact of input perturbations on algorithms for weighted graph
problems, Kumabe and Yoshida (FOCS'23) recently introduced the concept of
Lipschitz continuity of algorithms. This work explores this approach and
designs Lipschitz continuous algorithms for covering problems, such as the
minimum vertex cover, set cover, and feedback vertex set problems.
  Our algorithm for the feedback vertex set problem is based on linear
programming, and in the rounding process, we develop and use a technique called
cycle sparsification, which may be of independent interest.


http://arxiv.org/abs/2307.07916
On the Robustness of Split Learning against Adversarial Attacks. (99%)
Mingyuan Fan; Cen Chen; Chengyu Wang; Wenmeng Zhou; Jun Huang
  Split learning enables collaborative deep learning model training while
preserving data privacy and model security by avoiding direct sharing of raw
data and model details (i.e., sever and clients only hold partial sub-networks
and exchange intermediate computations). However, existing research has mainly
focused on examining its reliability for privacy protection, with little
investigation into model security. Specifically, by exploring full models,
attackers can launch adversarial attacks, and split learning can mitigate this
severe threat by only disclosing part of models to untrusted servers.This paper
aims to evaluate the robustness of split learning against adversarial attacks,
particularly in the most challenging setting where untrusted servers only have
access to the intermediate layers of the model.Existing adversarial attacks
mostly focus on the centralized setting instead of the collaborative setting,
thus, to better evaluate the robustness of split learning, we develop a
tailored attack called SPADV, which comprises two stages: 1) shadow model
training that addresses the issue of lacking part of the model and 2) local
adversarial attack that produces adversarial examples to evaluate.The first
stage only requires a few unlabeled non-IID data, and, in the second stage,
SPADV perturbs the intermediate output of natural samples to craft the
adversarial ones. The overall cost of the proposed attack process is relatively
low, yet the empirical attack effectiveness is significantly high,
demonstrating the surprising vulnerability of split learning to adversarial
attacks.


http://arxiv.org/abs/2307.07873
Why Does Little Robustness Help? A Further Step Towards Understanding Adversarial Transferability. (99%)
Yechao Zhang; Shengshan Hu; Leo Yu Zhang; Junyu Shi; Minghui Li; Xiaogeng Liu; Wei Wan; Hai Jin
  Adversarial examples (AEs) for DNNs have been shown to be transferable: AEs
that successfully fool white-box surrogate models can also deceive other
black-box models with different architectures. Although a bunch of empirical
studies have provided guidance on generating highly transferable AEs, many of
these findings lack explanations and even lead to inconsistent advice. In this
paper, we take a further step towards understanding adversarial
transferability, with a particular focus on surrogate aspects. Starting from
the intriguing little robustness phenomenon, where models adversarially trained
with mildly perturbed adversarial samples can serve as better surrogates, we
attribute it to a trade-off between two predominant factors: model smoothness
and gradient similarity. Our investigations focus on their joint effects,
rather than their separate correlations with transferability. Through a series
of theoretical and empirical analyses, we conjecture that the data distribution
shift in adversarial training explains the degradation of gradient similarity.
Building on these insights, we explore the impacts of data augmentation and
gradient regularization on transferability and identify that the trade-off
generally exists in the various training mechanisms, thus building a
comprehensive blueprint for the regulation mechanism behind transferability.
Finally, we provide a general route for constructing better surrogates to boost
transferability which optimizes both model smoothness and gradient similarity
simultaneously, e.g., the combination of input gradient regularization and
sharpness-aware minimization (SAM), validated by extensive experiments. In
summary, we call for attention to the united impacts of these two factors for
launching effective transfer attacks, rather than optimizing one while ignoring
the other, and emphasize the crucial role of manipulating surrogate models.


http://arxiv.org/abs/2307.07859
Unified Adversarial Patch for Cross-modal Attacks in the Physical World. (92%)
Xingxing Wei; Yao Huang; Yitong Sun; Jie Yu
  Recently, physical adversarial attacks have been presented to evade
DNNs-based object detectors. To ensure the security, many scenarios are
simultaneously deployed with visible sensors and infrared sensors, leading to
the failures of these single-modal physical attacks. To show the potential
risks under such scenes, we propose a unified adversarial patch to perform
cross-modal physical attacks, i.e., fooling visible and infrared object
detectors at the same time via a single patch. Considering different imaging
mechanisms of visible and infrared sensors, our work focuses on modeling the
shapes of adversarial patches, which can be captured in different modalities
when they change. To this end, we design a novel boundary-limited shape
optimization to achieve the compact and smooth shapes, and thus they can be
easily implemented in the physical world. In addition, to balance the fooling
degree between visible detector and infrared detector during the optimization
process, we propose a score-aware iterative evaluation, which can guide the
adversarial patch to iteratively reduce the predicted scores of the multi-modal
sensors. We finally test our method against the one-stage detector: YOLOv3 and
the two-stage detector: Faster RCNN. Results show that our unified patch
achieves an Attack Success Rate (ASR) of 73.33% and 69.17%, respectively. More
importantly, we verify the effective attacks in the physical world when visible
and infrared sensors shoot the objects under various settings like different
angles, distances, postures, and scenes.


http://arxiv.org/abs/2307.08715
MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots. (2%)
Gelei Deng; Yi Liu; Yuekang Li; Kailong Wang; Ying Zhang; Zefeng Li; Haoyu Wang; Tianwei Zhang; Yang Liu
  Large Language Models (LLMs) have revolutionized Artificial Intelligence (AI)
services due to their exceptional proficiency in understanding and generating
human-like text. LLM chatbots, in particular, have seen widespread adoption,
transforming human-machine interactions. However, these LLM chatbots are
susceptible to "jailbreak" attacks, where malicious users manipulate prompts to
elicit inappropriate or sensitive responses, contravening service policies.
Despite existing attempts to mitigate such threats, our research reveals a
substantial gap in our understanding of these vulnerabilities, largely due to
the undisclosed defensive measures implemented by LLM service providers.
  In this paper, we present Jailbreaker, a comprehensive framework that offers
an in-depth understanding of jailbreak attacks and countermeasures. Our work
makes a dual contribution. First, we propose an innovative methodology inspired
by time-based SQL injection techniques to reverse-engineer the defensive
strategies of prominent LLM chatbots, such as ChatGPT, Bard, and Bing Chat.
This time-sensitive approach uncovers intricate details about these services'
defenses, facilitating a proof-of-concept attack that successfully bypasses
their mechanisms. Second, we introduce an automatic generation method for
jailbreak prompts. Leveraging a fine-tuned LLM, we validate the potential of
automated jailbreak generation across various commercial LLM chatbots. Our
method achieves a promising average success rate of 21.58%, significantly
outperforming the effectiveness of existing techniques. We have responsibly
disclosed our findings to the concerned service providers, underscoring the
urgent need for more robust defenses. Jailbreaker thus marks a significant step
towards understanding and mitigating jailbreak threats in the realm of LLM
chatbots.


http://arxiv.org/abs/2307.07167
Vulnerability-Aware Instance Reweighting For Adversarial Training. (99%)
Olukorede Fakorede; Ashutosh Kumar Nirala; Modeste Atsague; Jin Tian
  Adversarial Training (AT) has been found to substantially improve the
robustness of deep learning classifiers against adversarial attacks. AT
involves obtaining robustness by including adversarial examples in training a
classifier. Most variants of AT algorithms treat every training example
equally. However, recent works have shown that better performance is achievable
by treating them unequally. In addition, it has been observed that AT exerts an
uneven influence on different classes in a training set and unfairly hurts
examples corresponding to classes that are inherently harder to classify.
Consequently, various reweighting schemes have been proposed that assign
unequal weights to robust losses of individual examples in a training set. In
this work, we propose a novel instance-wise reweighting scheme. It considers
the vulnerability of each natural example and the resulting information loss on
its adversarial counterpart occasioned by adversarial attacks. Through
extensive experiments, we show that our proposed method significantly improves
over existing reweighting schemes, especially against strong white and
black-box attacks.


http://arxiv.org/abs/2307.07250
Mitigating Adversarial Vulnerability through Causal Parameter Estimation by Adversarial Double Machine Learning. (99%)
Byung-Kwan Lee; Junho Kim; Yong Man Ro
  Adversarial examples derived from deliberately crafted perturbations on
visual inputs can easily harm decision process of deep neural networks. To
prevent potential threats, various adversarial training-based defense methods
have grown rapidly and become a de facto standard approach for robustness.
Despite recent competitive achievements, we observe that adversarial
vulnerability varies across targets and certain vulnerabilities remain
prevalent. Intriguingly, such peculiar phenomenon cannot be relieved even with
deeper architectures and advanced defense methods. To address this issue, in
this paper, we introduce a causal approach called Adversarial Double Machine
Learning (ADML), which allows us to quantify the degree of adversarial
vulnerability for network predictions and capture the effect of treatments on
outcome of interests. ADML can directly estimate causal parameter of
adversarial perturbations per se and mitigate negative effects that can
potentially damage robustness, bridging a causal perspective into the
adversarial vulnerability. Through extensive experiments on various CNN and
Transformer architectures, we corroborate that ADML improves adversarial
robustness with large margins and relieve the empirical observation.


http://arxiv.org/abs/2307.10209
On the Sensitivity of Deep Load Disaggregation to Adversarial Attacks. (99%)
Hafsa Bousbiat; Yassine Himeur; Abbes Amira; Wathiq Mansoor
  Non-intrusive Load Monitoring (NILM) algorithms, commonly referred to as load
disaggregation algorithms, are fundamental tools for effective energy
management. Despite the success of deep models in load disaggregation, they
face various challenges, particularly those pertaining to privacy and security.
This paper investigates the sensitivity of prominent deep NILM baselines to
adversarial attacks, which have proven to be a significant threat in domains
such as computer vision and speech recognition. Adversarial attacks entail the
introduction of imperceptible noise into the input data with the aim of
misleading the neural network into generating erroneous outputs. We investigate
the Fast Gradient Sign Method (FGSM), a well-known adversarial attack, to
perturb the input sequences fed into two commonly employed CNN-based NILM
baselines: the Sequence-to-Sequence (S2S) and Sequence-to-Point (S2P) models.
Our findings provide compelling evidence for the vulnerability of these models,
particularly the S2P model which exhibits an average decline of 20\% in the
F1-score even with small amounts of noise. Such weakness has the potential to
generate profound implications for energy management systems in residential and
industrial sectors reliant on NILM models.


http://arxiv.org/abs/2307.07653
RFLA: A Stealthy Reflected Light Adversarial Attack in the Physical World. (98%)
Donghua Wang; Wen Yao; Tingsong Jiang; Chao Li; Xiaoqian Chen
  Physical adversarial attacks against deep neural networks (DNNs) have
recently gained increasing attention. The current mainstream physical attacks
use printed adversarial patches or camouflage to alter the appearance of the
target object. However, these approaches generate conspicuous adversarial
patterns that show poor stealthiness. Another physical deployable attack is the
optical attack, featuring stealthiness while exhibiting weakly in the daytime
with sunlight. In this paper, we propose a novel Reflected Light Attack (RFLA),
featuring effective and stealthy in both the digital and physical world, which
is implemented by placing the color transparent plastic sheet and a paper cut
of a specific shape in front of the mirror to create different colored
geometries on the target object. To achieve these goals, we devise a general
framework based on the circle to model the reflected light on the target
object. Specifically, we optimize a circle (composed of a coordinate and
radius) to carry various geometrical shapes determined by the optimized angle.
The fill color of the geometry shape and its corresponding transparency are
also optimized. We extensively evaluate the effectiveness of RFLA on different
datasets and models. Experiment results suggest that the proposed method
achieves over 99% success rate on different datasets and models in the digital
world. Additionally, we verify the effectiveness of the proposed method in
different physical environments by using sunlight or a flashlight.


http://arxiv.org/abs/2307.07269
Frequency Domain Adversarial Training for Robust Volumetric Medical Segmentation. (98%)
Asif Hanif; Muzammal Naseer; Salman Khan; Mubarak Shah; Fahad Shahbaz Khan
  It is imperative to ensure the robustness of deep learning models in critical
applications such as, healthcare. While recent advances in deep learning have
improved the performance of volumetric medical image segmentation models, these
models cannot be deployed for real-world applications immediately due to their
vulnerability to adversarial attacks. We present a 3D frequency domain
adversarial attack for volumetric medical image segmentation models and
demonstrate its advantages over conventional input or voxel domain attacks.
Using our proposed attack, we introduce a novel frequency domain adversarial
training approach for optimizing a robust model against voxel and frequency
domain attacks. Moreover, we propose frequency consistency loss to regulate our
frequency domain adversarial training that achieves a better tradeoff between
model's performance on clean and adversarial samples. Code is publicly
available at https://github.com/asif-hanif/vafa.


http://arxiv.org/abs/2307.10205
Alleviating the Effect of Data Imbalance on Adversarial Training. (92%)
Guanlin Li; Guowen Xu; Tianwei Zhang
  In this paper, we study adversarial training on datasets that obey the
long-tailed distribution, which is practical but rarely explored in previous
works. Compared with conventional adversarial training on balanced datasets,
this process falls into the dilemma of generating uneven adversarial examples
(AEs) and an unbalanced feature embedding space, causing the resulting model to
exhibit low robustness and accuracy on tail data. To combat that, we
theoretically analyze the lower bound of the robust risk to train a model on a
long-tailed dataset to obtain the key challenges in addressing the
aforementioned dilemmas. Based on it, we propose a new adversarial training
framework -- Re-balancing Adversarial Training (REAT). This framework consists
of two components: (1) a new training strategy inspired by the effective number
to guide the model to generate more balanced and informative AEs; (2) a
carefully constructed penalty function to force a satisfactory feature space.
Evaluation results on different datasets and model structures prove that REAT
can effectively enhance the model's robustness and preserve the model's clean
accuracy. The code can be found in https://github.com/GuanlinLee/REAT.


http://arxiv.org/abs/2307.07457
Structured Pruning of Neural Networks for Constraints Learning. (76%)
Matteo Cacciola; Antonio Frangioni; Andrea Lodi
  In recent years, the integration of Machine Learning (ML) models with
Operation Research (OR) tools has gained popularity across diverse
applications, including cancer treatment, algorithmic configuration, and
chemical process optimization. In this domain, the combination of ML and OR
often relies on representing the ML model output using Mixed Integer
Programming (MIP) formulations. Numerous studies in the literature have
developed such formulations for many ML predictors, with a particular emphasis
on Artificial Neural Networks (ANNs) due to their significant interest in many
applications. However, ANNs frequently contain a large number of parameters,
resulting in MIP formulations that are impractical to solve, thereby impeding
scalability. In fact, the ML community has already introduced several
techniques to reduce the parameter count of ANNs without compromising their
performance, since the substantial size of modern ANNs presents challenges for
ML applications as it significantly impacts computational efforts during
training and necessitates significant memory resources for storage. In this
paper, we showcase the effectiveness of pruning, one of these techniques, when
applied to ANNs prior to their integration into MIPs. By pruning the ANN, we
achieve significant improvements in the speed of the solution process. We
discuss why pruning is more suitable in this context compared to other ML
compression techniques, and we identify the most appropriate pruning
strategies. To highlight the potential of this approach, we conduct experiments
using feed-forward neural networks with multiple layers to construct
adversarial examples. Our results demonstrate that pruning offers remarkable
reductions in solution times without hindering the quality of the final
decision, enabling the resolution of previously unsolvable instances.


http://arxiv.org/abs/2307.07328
Boosting Backdoor Attack with A Learnable Poisoning Sample Selection Strategy. (68%)
Zihao Zhu; Mingda Zhang; Shaokui Wei; Li Shen; Yanbo Fan; Baoyuan Wu
  Data-poisoning based backdoor attacks aim to insert backdoor into models by
manipulating training datasets without controlling the training process of the
target model. Existing attack methods mainly focus on designing triggers or
fusion strategies between triggers and benign samples. However, they often
randomly select samples to be poisoned, disregarding the varying importance of
each poisoning sample in terms of backdoor injection. A recent selection
strategy filters a fixed-size poisoning sample pool by recording forgetting
events, but it fails to consider the remaining samples outside the pool from a
global perspective. Moreover, computing forgetting events requires significant
additional computing resources. Therefore, how to efficiently and effectively
select poisoning samples from the entire dataset is an urgent problem in
backdoor attacks.To address it, firstly, we introduce a poisoning mask into the
regular backdoor training loss. We suppose that a backdoored model training
with hard poisoning samples has a more backdoor effect on easy ones, which can
be implemented by hindering the normal training process (\ie, maximizing loss
\wrt mask). To further integrate it with normal training process, we then
propose a learnable poisoning sample selection strategy to learn the mask
together with the model parameters through a min-max optimization.Specifically,
the outer loop aims to achieve the backdoor attack goal by minimizing the loss
based on the selected samples, while the inner loop selects hard poisoning
samples that impede this goal by maximizing the loss. After several rounds of
adversarial training, we finally select effective poisoning samples with high
contribution. Extensive experiments on benchmark datasets demonstrate the
effectiveness and efficiency of our approach in boosting backdoor attack
performance.


http://arxiv.org/abs/2307.07187
Erasing, Transforming, and Noising Defense Network for Occluded Person Re-Identification. (31%)
Neng Dong; Liyan Zhang; Shuanglin Yan; Hao Tang; Jinhui Tang
  Occlusion perturbation presents a significant challenge in person
re-identification (re-ID), and existing methods that rely on external visual
cues require additional computational resources and only consider the issue of
missing information caused by occlusion. In this paper, we propose a simple yet
effective framework, termed Erasing, Transforming, and Noising Defense Network
(ETNDNet), which treats occlusion as a noise disturbance and solves occluded
person re-ID from the perspective of adversarial defense. In the proposed
ETNDNet, we introduce three strategies: Firstly, we randomly erase the feature
map to create an adversarial representation with incomplete information,
enabling adversarial learning of identity loss to protect the re-ID system from
the disturbance of missing information. Secondly, we introduce random
transformations to simulate the position misalignment caused by occlusion,
training the extractor and classifier adversarially to learn robust
representations immune to misaligned information. Thirdly, we perturb the
feature map with random values to address noisy information introduced by
obstacles and non-target pedestrians, and employ adversarial gaming in the
re-ID system to enhance its resistance to occlusion noise. Without bells and
whistles, ETNDNet has three key highlights: (i) it does not require any
external modules with parameters, (ii) it effectively handles various issues
caused by occlusion from obstacles and non-target pedestrians, and (iii) it
designs the first GAN-based adversarial defense paradigm for occluded person
re-ID. Extensive experiments on five public datasets fully demonstrate the
effectiveness, superiority, and practicality of the proposed ETNDNet. The code
will be released at \url{https://github.com/nengdong96/ETNDNet}.


http://arxiv.org/abs/2307.08596
Omnipotent Adversarial Training in the Wild. (9%)
Guanlin Li; Kangjie Chen; Yuan Xu; Han Qiu; Tianwei Zhang
  Adversarial training is an important topic in robust deep learning, but the
community lacks attention to its practical usage. In this paper, we aim to
resolve a real-world challenge, i.e., training a model on an imbalanced and
noisy dataset to achieve high clean accuracy and adversarial robustness, with
our proposed Omnipotent Adversarial Training (OAT) strategy. OAT consists of
two innovative methodologies to address the imperfection in the training set.
We first introduce an oracle into the adversarial training process to help the
model learn a correct data-label conditional distribution. This
carefully-designed oracle can provide correct label annotations for adversarial
training. We further propose logits adjustment adversarial training to overcome
the data imbalance issue, which can help the model learn a Bayes-optimal
distribution. Our comprehensive evaluation results show that OAT outperforms
other baselines by more than 20% clean accuracy improvement and 10% robust
accuracy improvement under complex combinations of data imbalance and label
noise scenarios. The code can be found in https://github.com/GuanlinLee/OAT.


http://arxiv.org/abs/2307.07171
Certified Robustness for Large Language Models with Self-Denoising. (5%)
Zhen Zhang; Guanhua Zhang; Bairu Hou; Wenqi Fan; Qing Li; Sijia Liu; Yang Zhang; Shiyu Chang
  Although large language models (LLMs) have achieved great success in vast
real-world applications, their vulnerabilities towards noisy inputs have
significantly limited their uses, especially in high-stake environments. In
these contexts, it is crucial to ensure that every prediction made by large
language models is stable, i.e., LLM predictions should be consistent given
minor differences in the input. This largely falls into the study of certified
robust LLMs, i.e., all predictions of LLM are certified to be correct in a
local region around the input. Randomized smoothing has demonstrated great
potential in certifying the robustness and prediction stability of LLMs.
However, randomized smoothing requires adding noise to the input before model
prediction, and its certification performance depends largely on the model's
performance on corrupted data. As a result, its direct application to LLMs
remains challenging and often results in a small certification radius. To
address this issue, we take advantage of the multitasking nature of LLMs and
propose to denoise the corrupted inputs with LLMs in a self-denoising manner.
Different from previous works like denoised smoothing, which requires training
a separate model to robustify LLM, our method enjoys far better efficiency and
flexibility. Our experiment results show that our method outperforms the
existing certification methods under both certified robustness and empirical
robustness. The codes are available at
https://github.com/UCSB-NLP-Chang/SelfDenoise.


http://arxiv.org/abs/2307.06548
Multi-objective Evolutionary Search of Variable-length Composite Semantic Perturbations. (99%)
Jialiang Suna; Wen Yao; Tingsong Jianga; Xiaoqian Chena
  Deep neural networks have proven to be vulnerable to adversarial attacks in
the form of adding specific perturbations on images to make wrong outputs.
Designing stronger adversarial attack methods can help more reliably evaluate
the robustness of DNN models. To release the harbor burden and improve the
attack performance, auto machine learning (AutoML) has recently emerged as one
successful technique to help automatically find the near-optimal adversarial
attack strategy. However, existing works about AutoML for adversarial attacks
only focus on $L_{\infty}$-norm-based perturbations. In fact, semantic
perturbations attract increasing attention due to their naturalnesses and
physical realizability. To bridge the gap between AutoML and semantic
adversarial attacks, we propose a novel method called multi-objective
evolutionary search of variable-length composite semantic perturbations
(MES-VCSP). Specifically, we construct the mathematical model of
variable-length composite semantic perturbations, which provides five
gradient-based semantic attack methods. The same type of perturbation in an
attack sequence is allowed to be performed multiple times. Besides, we
introduce the multi-objective evolutionary search consisting of NSGA-II and
neighborhood search to find near-optimal variable-length attack sequences.
Experimental results on CIFAR10 and ImageNet datasets show that compared with
existing methods, MES-VCSP can obtain adversarial examples with a higher attack
success rate, more naturalness, and less time cost.


http://arxiv.org/abs/2307.06608
MF-CLIP: Leveraging CLIP as Surrogate Models for No-box Adversarial Attacks. (99%)
Jiaming Zhang; Lingyu Qiu; Qi Yi; Yige Li; Jitao Sang; Changsheng Xu; Dit-Yan Yeung
  The vulnerability of Deep Neural Networks (DNNs) to adversarial attacks poses
a significant challenge to their deployment in safety-critical applications.
While extensive research has addressed various attack scenarios, the no-box
attack setting where adversaries have no prior knowledge, including access to
training data of the target model, remains relatively underexplored despite its
practical relevance. This work presents a systematic investigation into
leveraging large-scale Vision-Language Models (VLMs), particularly CLIP, as
surrogate models for executing no-box attacks. Our theoretical and empirical
analyses reveal a key limitation in the execution of no-box attacks stemming
from insufficient discriminative capabilities for direct application of vanilla
CLIP as a surrogate model. To address this limitation, we propose MF-CLIP: a
novel framework that enhances CLIP's effectiveness as a surrogate model through
margin-aware feature space optimization. Comprehensive evaluations across
diverse architectures and datasets demonstrate that MF-CLIP substantially
advances the state-of-the-art in no-box attacks, surpassing existing baselines
by 15.23% on standard models and achieving a 9.52% improvement on adversarially
trained models. Our code will be made publicly available to facilitate
reproducibility and future research in this direction.


http://arxiv.org/abs/2307.06865
Effective Prompt Extraction from Language Models. (4%)
Yiming Zhang; Nicholas Carlini; Daphne Ippolito
  The text generated by large language models is commonly controlled by
prompting, where a prompt prepended to a user's query guides the model's
output. The prompts used by companies to guide their models are often treated
as secrets, to be hidden from the user making the query. They have even been
treated as commodities to be bought and sold on marketplaces. However,
anecdotal reports have shown adversarial users employing prompt extraction
attacks to recover these prompts. In this paper, we present a framework for
systematically measuring the effectiveness of these attacks. In experiments
with 3 different sources of prompts and 11 underlying large language models, we
find that simple text-based attacks can in fact reveal prompts with high
probability. Our framework determines with high precision whether an extracted
prompt is the actual secret prompt, rather than a model hallucination. Prompt
extraction from real systems such as Claude 3 and ChatGPT further suggest that
system prompts can be revealed by an adversary despite existing defenses in
place.


http://arxiv.org/abs/2307.06966
Layer-wise Linear Mode Connectivity. (1%)
Linara Adilova; Maksym Andriushchenko; Michael Kamp; Asja Fischer; Martin Jaggi
  Averaging neural network parameters is an intuitive method for fusing the
knowledge of two independent models. It is most prominently used in federated
learning. If models are averaged at the end of training, this can only lead to
a good performing model if the loss surface of interest is very particular,
i.e., the loss in the midpoint between the two models needs to be sufficiently
low. This is impossible to guarantee for the non-convex losses of
state-of-the-art networks. For averaging models trained on vastly different
datasets, it was proposed to average only the parameters of particular layers
or combinations of layers, resulting in better performing models. To get a
better understanding of the effect of layer-wise averaging, we analyse the
performance of the models that result from averaging single layers, or groups
of layers. Based on our empirical and theoretical investigation, we introduce a
novel notion of the layer-wise linear connectivity, and show that deep networks
do not have layer-wise barriers between them.


http://arxiv.org/abs/2307.06796
Defeating Proactive Jammers Using Deep Reinforcement Learning for Resource-Constrained IoT Networks. (1%)
Abubakar Sani Ali; Shimaa Naser; Sami Muhaidat
  Traditional anti-jamming techniques like spread spectrum, adaptive power/rate
control, and cognitive radio, have demonstrated effectiveness in mitigating
jamming attacks. However, their robustness against the growing complexity of
internet-of-thing (IoT) networks and diverse jamming attacks is still limited.
To address these challenges, machine learning (ML)-based techniques have
emerged as promising solutions. By offering adaptive and intelligent
anti-jamming capabilities, ML-based approaches can effectively adapt to dynamic
attack scenarios and overcome the limitations of traditional methods. In this
paper, we propose a deep reinforcement learning (DRL)-based approach that
utilizes state input from realistic wireless network interface cards. We train
five different variants of deep Q-network (DQN) agents to mitigate the effects
of jamming with the aim of identifying the most sample-efficient, lightweight,
robust, and least complex agent that is tailored for power-constrained devices.
The simulation results demonstrate the effectiveness of the proposed DRL-based
anti-jamming approach against proactive jammers, regardless of their jamming
strategy which eliminates the need for a pattern recognition or jamming
strategy detection step. Our findings present a promising solution for securing
IoT networks against jamming attacks and highlights substantial opportunities
for continued investigation and advancement within this field.


http://arxiv.org/abs/2307.06695
Towards Traitor Tracing in Black-and-White-Box DNN Watermarking with Tardos-based Codes. (1%)
Elena Rodriguez-Lois; Fernando Perez-Gonzalez
  The growing popularity of Deep Neural Networks, which often require
computationally expensive training and access to a vast amount of data, calls
for accurate authorship verification methods to deter unlawful dissemination of
the models and identify the source of the leak. In DNN watermarking the owner
may have access to the full network (white-box) or only be able to extract
information from its output to queries (black-box), but a watermarked model may
include both approaches in order to gather sufficient evidence to then gain
access to the network. Although there has been limited research in white-box
watermarking that considers traitor tracing, this problem is yet to be explored
in the black-box scenario. In this paper, we propose a black-and-white-box
watermarking method for DNN classifiers that opens the door to
collusion-resistant traitor tracing in black-box, exploiting the properties of
Tardos codes, and making it possible to identify the source of the leak before
access to the model is granted. While experimental results show that the method
can successfully identify traitors, even when further attacks have been
performed, we also discuss its limitations and open problems for traitor
tracing in black-box.


http://arxiv.org/abs/2307.06484
Single-Class Target-Specific Attack against Interpretable Deep Learning Systems. (99%)
Eldor Abdukhamidov; Mohammed Abuhamad; George K. Thiruvathukal; Hyoungshick Kim; Tamer Abuhmed
  In this paper, we present a novel Single-class target-specific Adversarial
attack called SingleADV. The goal of SingleADV is to generate a universal
perturbation that deceives the target model into confusing a specific category
of objects with a target category while ensuring highly relevant and accurate
interpretations. The universal perturbation is stochastically and iteratively
optimized by minimizing the adversarial loss that is designed to consider both
the classifier and interpreter costs in targeted and non-targeted categories.
In this optimization framework, ruled by the first- and second-moment
estimations, the desired loss surface promotes high confidence and
interpretation score of adversarial samples. By avoiding unintended
misclassification of samples from other categories, SingleADV enables more
effective targeted attacks on interpretable deep learning systems in both
white-box and black-box scenarios. To evaluate the effectiveness of SingleADV,
we conduct experiments using four different model architectures (ResNet-50,
VGG-16, DenseNet-169, and Inception-V3) coupled with three interpretation
models (CAM, Grad, and MASK). Through extensive empirical evaluation, we
demonstrate that SingleADV effectively deceives the target deep learning models
and their associated interpreters under various conditions and settings. Our
experimental results show that the performance of SingleADV is effective, with
an average fooling ratio of 0.74 and an adversarial confidence level of 0.78 in
generating deceptive adversarial samples. Furthermore, we discuss several
countermeasures against SingleADV, including a transfer-based learning approach
and existing preprocessing defenses.


http://arxiv.org/abs/2307.06496
Microbial Genetic Algorithm-based Black-box Attack against Interpretable Deep Learning Systems. (99%)
Eldor Abdukhamidov; Mohammed Abuhamad; Simon S. Woo; Eric Chan-Tin; Tamer Abuhmed
  Deep learning models are susceptible to adversarial samples in white and
black-box environments. Although previous studies have shown high attack
success rates, coupling DNN models with interpretation models could offer a
sense of security when a human expert is involved, who can identify whether a
given sample is benign or malicious. However, in white-box environments,
interpretable deep learning systems (IDLSes) have been shown to be vulnerable
to malicious manipulations. In black-box settings, as access to the components
of IDLSes is limited, it becomes more challenging for the adversary to fool the
system. In this work, we propose a Query-efficient Score-based black-box attack
against IDLSes, QuScore, which requires no knowledge of the target model and
its coupled interpretation model. QuScore is based on transfer-based and
score-based methods by employing an effective microbial genetic algorithm. Our
method is designed to reduce the number of queries necessary to carry out
successful attacks, resulting in a more efficient process. By continuously
refining the adversarial samples created based on feedback scores from the
IDLS, our approach effectively navigates the search space to identify
perturbations that can fool the system. We evaluate the attack's effectiveness
on four CNN models (Inception, ResNet, VGG, DenseNet) and two interpretation
models (CAM, Grad), using both ImageNet and CIFAR datasets. Our results show
that the proposed approach is query-efficient with a high attack success rate
that can reach between 95% and 100% and transferability with an average success
rate of 69% in the ImageNet and CIFAR datasets. Our attack method generates
adversarial examples with attribution maps that resemble benign samples. We
have also demonstrated that our attack is resilient against various
preprocessing defense techniques and can easily be transferred to different DNN
models.


http://arxiv.org/abs/2307.06287
Rational Neural Network Controllers. (2%)
Matthew Newton; Antonis Papachristodoulou
  Neural networks have shown great success in many machine learning related
tasks, due to their ability to act as general function approximators. Recent
work has demonstrated the effectiveness of neural networks in control systems
(known as neural feedback loops), most notably by using a neural network as a
controller. However, one of the big challenges of this approach is that neural
networks have been shown to be sensitive to adversarial attacks. This means
that, unless they are designed properly, they are not an ideal candidate for
controllers due to issues with robustness and uncertainty, which are pivotal
aspects of control systems. There has been initial work on robustness to both
analyse and design dynamical systems with neural network controllers. However,
one prominent issue with these methods is that they use existing neural network
architectures tailored for traditional machine learning tasks. These structures
may not be appropriate for neural network controllers and it is important to
consider alternative architectures. This paper considers rational neural
networks and presents novel rational activation functions, which can be used
effectively in robustness problems for neural feedback loops. Rational
activation functions are replaced by a general rational neural network
structure, which is convex in the neural network's parameters. A method is
proposed to recover a stabilising controller from a Sum of Squares feasibility
test. This approach is then applied to a refined rational neural network which
is more compatible with Sum of Squares programming. Numerical examples show
that this method can successfully recover stabilising rational neural network
controllers for neural feedback loops with non-linear plants with noise and
parametric uncertainty.


http://arxiv.org/abs/2307.06483
Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can! (1%)
Nathan TeBlunthuis; Valerie Hase; Chung-Hong Chan
  Automated classifiers (ACs), often built via supervised machine learning
(SML), can categorize large, statistically powerful samples of data ranging
from text to images and video, and have become widely popular measurement
devices in communication science and related fields. Despite this popularity,
even highly accurate classifiers make errors that cause misclassification bias
and misleading results in downstream analyses-unless such analyses account for
these errors. As we show in a systematic literature review of SML applications,
communication scholars largely ignore misclassification bias. In principle,
existing statistical methods can use "gold standard" validation data, such as
that created by human annotators, to correct misclassification bias and produce
consistent estimates. We introduce and test such methods, including a new
method we design and implement in the R package misclassificationmodels, via
Monte Carlo simulations designed to reveal each method's limitations, which we
also release. Based on our results, we recommend our new error correction
method as it is versatile and efficient. In sum, automated classifiers, even
those below common accuracy standards or making systematic misclassifications,
can be useful for measurement with careful study design and appropriate error
correction methods.


http://arxiv.org/abs/2307.05946
A Bayesian approach to quantifying uncertainties and improving generalizability in traffic prediction models. (1%)
Agnimitra Sengupta; Sudeepta Mondal; Adway Das; S. Ilgin Guler
  Deep-learning models for traffic data prediction can have superior
performance in modeling complex functions using a multi-layer architecture.
However, a major drawback of these approaches is that most of these approaches
do not offer forecasts with uncertainty estimates, which are essential for
traffic operations and control. Without uncertainty estimates, it is difficult
to place any level of trust to the model predictions, and operational
strategies relying on overconfident predictions can lead to worsening traffic
conditions. In this study, we propose a Bayesian recurrent neural network
framework for uncertainty quantification in traffic prediction with higher
generalizability by introducing spectral normalization to its hidden layers. In
our paper, we have shown that normalization alters the training process of deep
neural networks by controlling the model's complexity and reducing the risk of
overfitting to the training data. This, in turn, helps improve the
generalization performance of the model on out-of-distribution datasets.
Results demonstrate that spectral normalization improves uncertainty estimates
and significantly outperforms both the layer normalization and model without
normalization in single-step prediction horizons. This improved performance can
be attributed to the ability of spectral normalization to better localize the
feature space of the data under perturbations. Our findings are especially
relevant to traffic management applications, where predicting traffic
conditions across multiple locations is the goal, but the availability of
training data from multiple locations is limited. Spectral normalization,
therefore, provides a more generalizable approach that can effectively capture
the underlying patterns in traffic data without requiring location-specific
models.


http://arxiv.org/abs/2307.05095
ATWM: Defense against adversarial malware based on adversarial training. (99%)
Kun Li; Fan Zhang; Wei Guo
  Deep learning technology has made great achievements in the field of image.
In order to defend against malware attacks, researchers have proposed many
Windows malware detection models based on deep learning. However, deep learning
models are vulnerable to adversarial example attacks. Malware can generate
adversarial malware with the same malicious function to attack the malware
detection model and evade detection of the model. Currently, many adversarial
defense studies have been proposed, but existing adversarial defense studies
are based on image sample and cannot be directly applied to malware sample.
Therefore, this paper proposes an adversarial malware defense method based on
adversarial training. This method uses preprocessing to defend simple
adversarial examples to reduce the difficulty of adversarial training.
Moreover, this method improves the adversarial defense capability of the model
through adversarial training. We experimented with three attack methods in two
sets of datasets, and the results show that the method in this paper can
improve the adversarial defense capability of the model without reducing the
accuracy of the model.


http://arxiv.org/abs/2307.05193
Membership Inference Attacks on DNNs using Adversarial Perturbations. (89%)
Hassan Ali; Adnan Qayyum; Ala Al-Fuqaha; Junaid Qadir
  Several membership inference (MI) attacks have been proposed to audit a
target DNN. Given a set of subjects, MI attacks tell which subjects the target
DNN has seen during training. This work focuses on the post-training MI attacks
emphasizing high confidence membership detection -- True Positive Rates (TPR)
at low False Positive Rates (FPR). Current works in this category -- likelihood
ratio attack (LiRA) and enhanced MI attack (EMIA) -- only perform well on
complex datasets (e.g., CIFAR-10 and Imagenet) where the target DNN overfits
its train set, but perform poorly on simpler datasets (0% TPR by both attacks
on Fashion-MNIST, 2% and 0% TPR respectively by LiRA and EMIA on MNIST at 1%
FPR). To address this, firstly, we unify current MI attacks by presenting a
framework divided into three stages -- preparation, indication and decision.
Secondly, we utilize the framework to propose two novel attacks: (1)
Adversarial Membership Inference Attack (AMIA) efficiently utilizes the
membership and the non-membership information of the subjects while
adversarially minimizing a novel loss function, achieving 6% TPR on both
Fashion-MNIST and MNIST datasets; and (2) Enhanced AMIA (E-AMIA) combines EMIA
and AMIA to achieve 8% and 4% TPRs on Fashion-MNIST and MNIST datasets
respectively, at 1% FPR. Thirdly, we introduce two novel augmented indicators
that positively leverage the loss information in the Gaussian neighborhood of a
subject. This improves TPR of all four attacks on average by 2.5% and 0.25%
respectively on Fashion-MNIST and MNIST datasets at 1% FPR. Finally, we propose
simple, yet novel, evaluation metric, the running TPR average (RTA) at a given
FPR, that better distinguishes different MI attacks in the low FPR region. We
also show that AMIA and E-AMIA are more transferable to the unknown DNNs (other
than the target DNN) and are more robust to DP-SGD training as compared to LiRA
and EMIA.


http://arxiv.org/abs/2307.05772
Random-Set Convolutional Neural Network (RS-CNN) for Epistemic Deep Learning. (12%)
Shireen Kudukkil Manchingal; Muhammad Mubashar; Kaizheng Wang; Keivan Shariatmadar; Fabio Cuzzolin
  Machine learning is increasingly deployed in safety-critical domains where
robustness against adversarial attacks is crucial and erroneous predictions
could lead to potentially catastrophic consequences. This highlights the need
for learning systems to be equipped with the means to determine a model's
confidence in its prediction and the epistemic uncertainty associated with it,
'to know when a model does not know'. In this paper, we propose a novel
Random-Set Convolutional Neural Network (RS-CNN) for classification which
predicts belief functions rather than probability vectors over the set of
classes, using the mathematics of random sets, i.e., distributions over the
power set of the sample space. Based on the epistemic deep learning approach,
random-set models are capable of representing the 'epistemic' uncertainty
induced in machine learning by limited training sets. We estimate epistemic
uncertainty by approximating the size of credal sets associated with the
predicted belief functions, and experimentally demonstrate how our approach
outperforms competing uncertainty-aware approaches in a classical evaluation
setting. The performance of RS-CNN is best demonstrated on OOD samples where it
manages to capture the true prediction while standard CNNs fail.


http://arxiv.org/abs/2307.05397
On the Vulnerability of DeepFake Detectors to Attacks Generated by Denoising Diffusion Models. (10%)
Marija Ivanovska; Vitomir Štruc
  The detection of malicious deepfakes is a constantly evolving problem that
requires continuous monitoring of detectors to ensure they can detect image
manipulations generated by the latest emerging models. In this paper, we
investigate the vulnerability of single-image deepfake detectors to black-box
attacks created by the newest generation of generative methods, namely
Denoising Diffusion Models (DDMs). Our experiments are run on FaceForensics++,
a widely used deepfake benchmark consisting of manipulated images generated
with various techniques for face identity swapping and face reenactment.
Attacks are crafted through guided reconstruction of existing deepfakes with a
proposed DDM approach for face restoration. Our findings indicate that
employing just a single denoising diffusion step in the reconstruction process
of a deepfake can significantly reduce the likelihood of detection, all without
introducing any perceptible image modifications. While training detectors using
attack examples demonstrated some effectiveness, it was observed that
discriminators trained on fully diffusion-based deepfakes exhibited limited
generalizability when presented with our attacks.


http://arxiv.org/abs/2307.05422
Differential Analysis of Triggers and Benign Features for Black-Box DNN Backdoor Detection. (2%)
Hao Fu; Prashanth Krishnamurthy; Siddharth Garg; Farshad Khorrami
  This paper proposes a data-efficient detection method for deep neural
networks against backdoor attacks under a black-box scenario. The proposed
approach is motivated by the intuition that features corresponding to triggers
have a higher influence in determining the backdoored network output than any
other benign features. To quantitatively measure the effects of triggers and
benign features on determining the backdoored network output, we introduce five
metrics. To calculate the five-metric values for a given input, we first
generate several synthetic samples by injecting the input's partial contents
into clean validation samples. Then, the five metrics are computed by using the
output labels of the corresponding synthetic samples. One contribution of this
work is the use of a tiny clean validation dataset. Having the computed five
metrics, five novelty detectors are trained from the validation dataset. A meta
novelty detector fuses the output of the five trained novelty detectors to
generate a meta confidence score. During online testing, our method determines
if online samples are poisoned or not via assessing their meta confidence
scores output by the meta novelty detector. We show the efficacy of our
methodology through a broad range of backdoor attacks, including ablation
studies and comparison to existing approaches. Our methodology is promising
since the proposed five metrics quantify the inherent differences between clean
and poisoned samples. Additionally, our detection method can be incrementally
improved by appending more metrics that may be proposed to address future
advanced attacks.


http://arxiv.org/abs/2307.05471
Scale Alone Does not Improve Mechanistic Interpretability in Vision Models. (1%)
Roland S. Zimmermann; Thomas Klein; Wieland Brendel
  In light of the recent widespread adoption of AI systems, understanding the
internal information processing of neural networks has become increasingly
critical. Most recently, machine vision has seen remarkable progress by scaling
neural networks to unprecedented levels in dataset and model size. We here ask
whether this extraordinary increase in scale also positively impacts the field
of mechanistic interpretability. In other words, has our understanding of the
inner workings of scaled neural networks improved as well? We use a
psychophysical paradigm to quantify one form of mechanistic interpretability
for a diverse suite of nine models and find no scaling effect for
interpretability - neither for model nor dataset size. Specifically, none of
the investigated state-of-the-art models are easier to interpret than the
GoogLeNet model from almost a decade ago. Latest-generation vision models
appear even less interpretable than older architectures, hinting at a
regression rather than improvement, with modern models sacrificing
interpretability for accuracy. These results highlight the need for models
explicitly designed to be mechanistically interpretable and the need for more
helpful interpretability methods to increase our understanding of networks at
an atomic level. We release a dataset containing more than 130'000 human
responses from our psychophysical evaluation of 767 units across nine models.
This dataset facilitates research on automated instead of human-based
interpretability evaluations, which can ultimately be leveraged to directly
optimize the mechanistic interpretability of models.


http://arxiv.org/abs/2307.05831
Memorization Through the Lens of Curvature of Loss Function Around Samples. (1%)
Isha Garg; Deepak Ravikumar; Kaushik Roy
  Deep neural networks are over-parameterized and easily overfit the datasets
they train on. In the extreme case, it has been shown that these networks can
memorize a training set with fully randomized labels. We propose using the
curvature of loss function around each training sample, averaged over training
epochs, as a measure of memorization of the sample. We use this metric to study
the generalization versus memorization properties of different samples in
popular image datasets and show that it captures memorization statistics well,
both qualitatively and quantitatively. We first show that the high curvature
samples visually correspond to long-tailed, mislabeled, or conflicting samples,
those that are most likely to be memorized. This analysis helps us find, to the
best of our knowledge, a novel failure mode on the CIFAR100 and ImageNet
datasets: that of duplicated images with differing labels. Quantitatively, we
corroborate the validity of our scores via two methods. First, we validate our
scores against an independent and comprehensively calculated baseline, by
showing high cosine similarity with the memorization scores released by Feldman
and Zhang (2020). Second, we inject corrupted samples which are memorized by
the network, and show that these are learned with high curvature. To this end,
we synthetically mislabel a random subset of the dataset. We overfit a network
to it and show that sorting by curvature yields high AUROC values for
identifying the corrupted samples. An added advantage of our method is that it
is scalable, as it requires training only a single network as opposed to the
thousands trained by the baseline, while capturing the aforementioned failure
mode that the baseline fails to identify.


http://arxiv.org/abs/2307.05842
The Butterfly Effect in Artificial Intelligence Systems: Implications for AI Bias and Fairness. (1%)
Emilio Ferrara
  The Butterfly Effect, a concept originating from chaos theory, underscores
how small changes can have significant and unpredictable impacts on complex
systems. In the context of AI fairness and bias, the Butterfly Effect can stem
from a variety of sources, such as small biases or skewed data inputs during
algorithm development, saddle points in training, or distribution shifts in
data between training and testing phases. These seemingly minor alterations can
lead to unexpected and substantial unfair outcomes, disproportionately
affecting underrepresented individuals or groups and perpetuating pre-existing
inequalities. Moreover, the Butterfly Effect can amplify inherent biases within
data or algorithms, exacerbate feedback loops, and create vulnerabilities for
adversarial attacks. Given the intricate nature of AI systems and their
societal implications, it is crucial to thoroughly examine any changes to
algorithms or input data for potential unintended consequences. In this paper,
we envision both algorithmic and empirical strategies to detect, quantify, and
mitigate the Butterfly Effect in AI systems, emphasizing the importance of
addressing these challenges to promote fairness and ensure responsible AI
development.


http://arxiv.org/abs/2307.04677
Practical Trustworthiness Model for DNN in Dedicated 6G Application. (33%)
Anouar Nechi; Ahmed Mahmoudi; Christoph Herold; Daniel Widmer; Thomas Kürner; Mladen Berekovic; Saleh Mulhem
  Artificial intelligence (AI) is considered an efficient response to several
challenges facing 6G technology. However, AI still suffers from a huge trust
issue due to its ambiguous way of making predictions. Therefore, there is a
need for a method to evaluate the AI's trustworthiness in practice for future
6G applications. This paper presents a practical model to analyze the
trustworthiness of AI in a dedicated 6G application. In particular, we present
two customized Deep Neural Networks (DNNs) to solve the Automatic Modulation
Recognition (AMR) problem in Terahertz communications-based 6G technology.
Then, a specific trustworthiness model and its attributes, namely data
robustness, parameter sensitivity, and security covering adversarial examples,
are introduced. The evaluation results indicate that the proposed
trustworthiness attributes are crucial to evaluate the trustworthiness of DNN
for this 6G application.


http://arxiv.org/abs/2307.04596
Distill-SODA: Distilling Self-Supervised Vision Transformer for Source-Free Open-Set Domain Adaptation in Computational Pathology. (1%)
Guillaume Vray; Devavrat Tomar; Jean-Philippe Thiran; Behzad Bozorgtabar
  Developing computational pathology models is essential for reducing manual
tissue typing from whole slide images, transferring knowledge from the source
domain to an unlabeled, shifted target domain, and identifying unseen
categories. We propose a practical setting by addressing the above-mentioned
challenges in one fell swoop, i.e., source-free open-set domain adaptation. Our
methodology focuses on adapting a pre-trained source model to an unlabeled
target dataset and encompasses both closed-set and open-set classes. Beyond
addressing the semantic shift of unknown classes, our framework also deals with
a covariate shift, which manifests as variations in color appearance between
source and target tissue samples. Our method hinges on distilling knowledge
from a self-supervised vision transformer (ViT), drawing guidance from either
robustly pre-trained transformer models or histopathology datasets, including
those from the target domain. In pursuit of this, we introduce a novel
style-based adversarial data augmentation, serving as hard positives for
self-training a ViT, resulting in highly contextualized embeddings. Following
this, we cluster semantically akin target images, with the source model
offering weak pseudo-labels, albeit with uncertain confidence. To enhance this
process, we present the closed-set affinity score (CSAS), aiming to correct the
confidence levels of these pseudo-labels and to calculate weighted class
prototypes within the contextualized embedding space. Our approach establishes
itself as state-of-the-art across three public histopathological datasets for
colorectal cancer assessment. Notably, our self-training method seamlessly
integrates with open-set detection methods, resulting in enhanced performance
in both closed-set and open-set recognition tasks.


http://arxiv.org/abs/2307.04099
GNP Attack: Transferable Adversarial Examples via Gradient Norm Penalty. (98%)
Tao Wu; Tie Luo; Donald C. Wunsch
  Adversarial examples (AE) with good transferability enable practical
black-box attacks on diverse target models, where insider knowledge about the
target models is not required. Previous methods often generate AE with no or
very limited transferability; that is, they easily overfit to the particular
architecture and feature representation of the source, white-box model and the
generated AE barely work for target, black-box models. In this paper, we
propose a novel approach to enhance AE transferability using Gradient Norm
Penalty (GNP). It drives the loss function optimization procedure to converge
to a flat region of local optima in the loss landscape. By attacking 11
state-of-the-art (SOTA) deep learning models and 6 advanced defense methods, we
empirically show that GNP is very effective in generating AE with high
transferability. We also demonstrate that it is very flexible in that it can be
easily integrated with other gradient based methods for stronger transfer-based
attacks.


http://arxiv.org/abs/2307.04333
Enhancing Adversarial Robustness via Score-Based Optimization. (98%)
Boya Zhang; Weijian Luo; Zhihua Zhang
  Adversarial attacks have the potential to mislead deep neural network
classifiers by introducing slight perturbations. Developing algorithms that can
mitigate the effects of these attacks is crucial for ensuring the safe use of
artificial intelligence. Recent studies have suggested that score-based
diffusion models are effective in adversarial defenses. However, existing
diffusion-based defenses rely on the sequential simulation of the reversed
stochastic differential equations of diffusion models, which are
computationally inefficient and yield suboptimal results. In this paper, we
introduce a novel adversarial defense scheme named ScoreOpt, which optimizes
adversarial samples at test-time, towards original clean data in the direction
guided by score-based priors. We conduct comprehensive experiments on multiple
datasets, including CIFAR10, CIFAR100 and ImageNet. Our experimental results
demonstrate that our approach outperforms existing adversarial defenses in
terms of both robustness performance and inference speed.


http://arxiv.org/abs/2307.03903
Adversarial Self-Attack Defense and Spatial-Temporal Relation Mining for Visible-Infrared Video Person Re-Identification. (99%)
Huafeng Li; Le Xu; Yafei Zhang; Dapeng Tao; Zhengtao Yu
  In visible-infrared video person re-identification (re-ID), extracting
features not affected by complex scenes (such as modality, camera views,
pedestrian pose, background, etc.) changes, and mining and utilizing motion
information are the keys to solving cross-modal pedestrian identity matching.
To this end, the paper proposes a new visible-infrared video person re-ID
method from a novel perspective, i.e., adversarial self-attack defense and
spatial-temporal relation mining. In this work, the changes of views, posture,
background and modal discrepancy are considered as the main factors that cause
the perturbations of person identity features. Such interference information
contained in the training samples is used as an adversarial perturbation. It
performs adversarial attacks on the re-ID model during the training to make the
model more robust to these unfavorable factors. The attack from the adversarial
perturbation is introduced by activating the interference information contained
in the input samples without generating adversarial samples, and it can be thus
called adversarial self-attack. This design allows adversarial attack and
defense to be integrated into one framework. This paper further proposes a
spatial-temporal information-guided feature representation network to use the
information in video sequences. The network cannot only extract the information
contained in the video-frame sequences but also use the relation of the local
information in space to guide the network to extract more robust features. The
proposed method exhibits compelling performance on large-scale cross-modality
video datasets. The source code of the proposed method will be released at
https://github.com/lhf12278/xxx.


http://arxiv.org/abs/2307.04066
Random Position Adversarial Patch for Vision Transformers. (83%)
Mingzhen Shao
  Previous studies have shown the vulnerability of vision transformers to
adversarial patches, but these studies all rely on a critical assumption: the
attack patches must be perfectly aligned with the patches used for linear
projection in vision transformers. Due to this stringent requirement, deploying
adversarial patches for vision transformers in the physical world becomes
impractical, unlike their effectiveness on CNNs. This paper proposes a novel
method for generating an adversarial patch (G-Patch) that overcomes the
alignment constraint, allowing the patch to launch a targeted attack at any
position within the field of view. Specifically, instead of directly optimizing
the patch using gradients, we employ a GAN-like structure to generate the
adversarial patch. Our experiments show the effectiveness of the adversarial
patch in achieving universal attacks on vision transformers, both in digital
and physical-world scenarios. Additionally, further analysis reveals that the
generated adversarial patch exhibits robustness to brightness restriction,
color transfer, and random noise. Real-world attack experiments validate the
effectiveness of the G-Patch to launch robust attacks even under some very
challenging conditions.


http://arxiv.org/abs/2307.04024
Robust Ranking Explanations. (38%)
Chao Chen; Chenghua Guo; Guixiang Ma; Ming Zeng; Xi Zhang; Sihong Xie
  Robust explanations of machine learning models are critical to establish
human trust in the models. Due to limited cognition capability, most humans can
only interpret the top few salient features. It is critical to make top salient
features robust to adversarial attacks, especially those against the more
vulnerable gradient-based explanations. Existing defense measures robustness
using $\ell_p$-norms, which have weaker protection power. We define explanation
thickness for measuring salient features ranking stability, and derive
tractable surrogate bounds of the thickness to design the \textit{R2ET}
algorithm to efficiently maximize the thickness and anchor top salient
features. Theoretically, we prove a connection between R2ET and adversarial
training. Experiments with a wide spectrum of network architectures and data
modalities, including brain networks, demonstrate that R2ET attains higher
explanation robustness under stealthy attacks while retaining accuracy.


http://arxiv.org/abs/2307.03803
A Theoretical Perspective on Subnetwork Contributions to Adversarial Robustness. (81%)
Jovon Craig; Josh Andle; Theodore S. Nowak; Salimeh Yasaei Sekeh
  The robustness of deep neural networks (DNNs) against adversarial attacks has
been studied extensively in hopes of both better understanding how deep
learning models converge and in order to ensure the security of these models in
safety-critical applications. Adversarial training is one approach to
strengthening DNNs against adversarial attacks, and has been shown to offer a
means for doing so at the cost of applying computationally expensive training
methods to the entire model. To better understand these attacks and facilitate
more efficient adversarial training, in this paper we develop a novel
theoretical framework that investigates how the adversarial robustness of a
subnetwork contributes to the robustness of the entire network. To do so we
first introduce the concept of semirobustness, which is a measure of the
adversarial robustness of a subnetwork. Building on this concept, we then
provide a theoretical analysis to show that if a subnetwork is semirobust and
there is a sufficient dependency between it and each subsequent layer in the
network, then the remaining layers are also guaranteed to be robust. We
validate these findings empirically across multiple DNN architectures,
datasets, and adversarial attacks. Experiments show the ability of a robust
subnetwork to promote full-network robustness, and investigate the layer-wise
dependencies required for this full-network robustness to be achieved.


http://arxiv.org/abs/2307.03798
Fooling Contrastive Language-Image Pre-trained Models with CLIPMasterPrints. (68%)
Matthias Freiberger; Peter Kun; Christian Igel; Anders Sundnes Løvlie; Sebastian Risi
  Models leveraging both visual and textual data such as Contrastive
Language-Image Pre-training (CLIP), are the backbone of many recent advances in
artificial intelligence. In this work, we show that despite their versatility,
such models are vulnerable to what we refer to as fooling master images.
Fooling master images are capable of maximizing the confidence score of a CLIP
model for a significant number of widely varying prompts, while being either
unrecognizable or unrelated to the attacked prompts for humans. The existence
of such images is problematic as it could be used by bad actors to maliciously
interfere with CLIP-trained image retrieval models in production with
comparably small effort as a single image can attack many different prompts. We
demonstrate how fooling master images for CLIP (CLIPMasterPrints) can be mined
using stochastic gradient descent, projected gradient descent, or blackbox
optimization. Contrary to many common adversarial attacks, the blackbox
optimization approach allows us to mine CLIPMasterPrints even when the weights
of the model are not accessible. We investigate the properties of the mined
images, and find that images trained on a small number of image captions
generalize to a much larger number of semantically related captions. We
evaluate possible mitigation strategies, where we increase the robustness of
the model and introduce an approach to automatically detect CLIPMasterPrints to
sanitize the input of vulnerable models. Finally, we find that vulnerability to
CLIPMasterPrints is related to a modality gap in contrastive pre-trained
multi-modal networks. Code available at
https://github.com/matfrei/CLIPMasterPrints.


http://arxiv.org/abs/2307.03694
Scalable Membership Inference Attacks via Quantile Regression. (33%)
Martin Bertran; Shuai Tang; Michael Kearns; Jamie Morgenstern; Aaron Roth; Zhiwei Steven Wu
  Membership inference attacks are designed to determine, using black box
access to trained models, whether a particular example was used in training or
not. Membership inference can be formalized as a hypothesis testing problem.
The most effective existing attacks estimate the distribution of some test
statistic (usually the model's confidence on the true label) on points that
were (and were not) used in training by training many \emph{shadow models} --
i.e. models of the same architecture as the model being attacked, trained on a
random subsample of data. While effective, these attacks are extremely
computationally expensive, especially when the model under attack is large.
  We introduce a new class of attacks based on performing quantile regression
on the distribution of confidence scores induced by the model under attack on
points that are not used in training. We show that our method is competitive
with state-of-the-art shadow model attacks, while requiring substantially less
compute because our attack requires training only a single model. Moreover,
unlike shadow model attacks, our proposed attack does not require any knowledge
of the architecture of the model under attack and is therefore truly
``black-box". We show the efficacy of this approach in an extensive series of
experiments on various datasets and model architectures.


http://arxiv.org/abs/2307.03838
RADAR: Robust AI-Text Detection via Adversarial Learning. (5%)
Xiaomeng Hu; Pin-Yu Chen; Tsung-Yi Ho
  Recent advances in large language models (LLMs) and the intensifying
popularity of ChatGPT-like applications have blurred the boundary of
high-quality text generation between humans and machines. However, in addition
to the anticipated revolutionary changes to our technology and society, the
difficulty of distinguishing LLM-generated texts (AI-text) from human-generated
texts poses new challenges of misuse and fairness, such as fake content
generation, plagiarism, and false accusations of innocent writers. While
existing works show that current AI-text detectors are not robust to LLM-based
paraphrasing, this paper aims to bridge this gap by proposing a new framework
called RADAR, which jointly trains a robust AI-text detector via adversarial
learning. RADAR is based on adversarial training of a paraphraser and a
detector. The paraphraser's goal is to generate realistic content to evade
AI-text detection. RADAR uses the feedback from the detector to update the
paraphraser, and vice versa. Evaluated with 8 different LLMs (Pythia, Dolly
2.0, Palmyra, Camel, GPT-J, Dolly 1.0, LLaMA, and Vicuna) across 4 datasets,
experimental results show that RADAR significantly outperforms existing AI-text
detection methods, especially when paraphrasing is in place. We also identify
the strong transferability of RADAR from instruction-tuned LLMs to other LLMs,
and evaluate the improved capability of RADAR via GPT-3.5-Turbo.


http://arxiv.org/abs/2307.12399
Generation of Time-Varying Impedance Attacks Against Haptic Shared Control Steering Systems. (1%)
Alireza Mohammadi; Hafiz Malik
  The safety-critical nature of vehicle steering is one of the main motivations
for exploring the space of possible cyber-physical attacks against the steering
systems of modern vehicles. This paper investigates the adversarial
capabilities for destabilizing the interaction dynamics between human drivers
and vehicle haptic shared control (HSC) steering systems. In contrast to the
conventional robotics literature, where the main objective is to render the
human-automation interaction dynamics stable by ensuring passivity, this paper
takes the exact opposite route. In particular, to investigate the damaging
capabilities of a successful cyber-physical attack, this paper demonstrates
that an attacker who targets the HSC steering system can destabilize the
interaction dynamics between the human driver and the vehicle HSC steering
system through synthesis of time-varying impedance profiles. Specifically, it
is shown that the adversary can utilize a properly designed non-passive and
time-varying adversarial impedance target dynamics, which are fed with a linear
combination of the human driver and the steering column torques. Using these
target dynamics, it is possible for the adversary to generate in real-time a
reference angular command for the driver input device and the directional
control steering assembly of the vehicle. Furthermore, it is shown that the
adversary can make the steering wheel and the vehicle steering column angular
positions to follow the reference command generated by the time-varying
impedance target dynamics using proper adaptive control strategies. Numerical
simulations demonstrate the effectiveness of such time-varying impedance
attacks, which result in a non-passive and inherently unstable interaction
between the driver and the HSC steering system.


http://arxiv.org/abs/2307.02828
Sampling-based Fast Gradient Rescaling Method for Highly Transferable Adversarial Attacks. (99%)
Xu Han; Anmin Liu; Chenxuan Yao; Yanbo Fan; Kun He
  Deep neural networks are known to be vulnerable to adversarial examples
crafted by adding human-imperceptible perturbations to the benign input. After
achieving nearly 100% attack success rates in white-box setting, more focus is
shifted to black-box attacks, of which the transferability of adversarial
examples has gained significant attention. In either case, the common
gradient-based methods generally use the sign function to generate
perturbations on the gradient update, that offers a roughly correct direction
and has gained great success. But little work pays attention to its possible
limitation. In this work, we observe that the deviation between the original
gradient and the generated noise may lead to inaccurate gradient update
estimation and suboptimal solutions for adversarial transferability. To this
end, we propose a Sampling-based Fast Gradient Rescaling Method (S-FGRM).
Specifically, we use data rescaling to substitute the sign function without
extra computational cost. We further propose a Depth First Sampling method to
eliminate the fluctuation of rescaling and stabilize the gradient update. Our
method could be used in any gradient-based attacks and is extensible to be
integrated with various input transformation or ensemble methods to further
improve the adversarial transferability. Extensive experiments on the standard
ImageNet dataset show that our method could significantly boost the
transferability of gradient-based attacks and outperform the state-of-the-art
baselines.


http://arxiv.org/abs/2307.02849
NatLogAttack: A Framework for Attacking Natural Language Inference Models with Natural Logic. (92%)
Zi'ou Zheng; Xiaodan Zhu
  Reasoning has been a central topic in artificial intelligence from the
beginning. The recent progress made on distributed representation and neural
networks continues to improve the state-of-the-art performance of natural
language inference. However, it remains an open question whether the models
perform real reasoning to reach their conclusions or rely on spurious
correlations. Adversarial attacks have proven to be an important tool to help
evaluate the Achilles' heel of the victim models. In this study, we explore the
fundamental problem of developing attack models based on logic formalism. We
propose NatLogAttack to perform systematic attacks centring around natural
logic, a classical logic formalism that is traceable back to Aristotle's
syllogism and has been closely developed for natural language inference. The
proposed framework renders both label-preserving and label-flipping attacks. We
show that compared to the existing attack models, NatLogAttack generates better
adversarial examples with fewer visits to the victim models. The victim models
are found to be more vulnerable under the label-flipping setting. NatLogAttack
provides a tool to probe the existing and future NLI models' capacity from a
key viewpoint and we hope more logic-based attacks will be further explored for
understanding the desired property of reasoning.


http://arxiv.org/abs/2307.03217
Quantification of Uncertainty with Adversarial Models. (68%)
Kajetan Schweighofer; Lukas Aichberger; Mykyta Ielanskyi; Günter Klambauer; Sepp Hochreiter
  Quantifying uncertainty is important for actionable predictions in real-world
applications. A crucial part of predictive uncertainty quantification is the
estimation of epistemic uncertainty, which is defined as an integral of the
product between a divergence function and the posterior. Current methods such
as Deep Ensembles or MC dropout underperform at estimating the epistemic
uncertainty, since they primarily consider the posterior when sampling models.
We suggest Quantification of Uncertainty with Adversarial Models (QUAM) to
better estimate the epistemic uncertainty. QUAM identifies regions where the
whole product under the integral is large, not just the posterior.
Consequently, QUAM has lower approximation error of the epistemic uncertainty
compared to previous methods. Models for which the product is large correspond
to adversarial models (not adversarial examples!). Adversarial models have both
a high posterior as well as a high divergence between their predictions and
that of a reference model. Our experiments show that QUAM excels in capturing
epistemic uncertainty for deep learning models and outperforms previous methods
on challenging tasks in the vision domain.


http://arxiv.org/abs/2307.03305
A Vulnerability of Attribution Methods Using Pre-Softmax Scores. (41%)
Miguel Lerma; Mirtha Lucas
  We discuss a vulnerability involving a category of attribution methods used
to provide explanations for the outputs of convolutional neural networks
working as classifiers. It is known that this type of networks are vulnerable
to adversarial attacks, in which imperceptible perturbations of the input may
alter the outputs of the model. In contrast, here we focus on effects that
small modifications in the model may cause on the attribution method without
altering the model outputs.


http://arxiv.org/abs/2307.02881
Probabilistic and Semantic Descriptions of Image Manifolds and Their Applications. (8%)
Peter Tu; Zhaoyuan Yang; Richard Hartley; Zhiwei Xu; Jing Zhang; Yiwei Fu; Dylan Campbell; Jaskirat Singh; Tianyu Wang
  This paper begins with a description of methods for estimating image
probability density functions that reflects the observation that such data is
usually constrained to lie in restricted regions of the high-dimensional image
space-not every pattern of pixels is an image. It is common to say that images
lie on a lower-dimensional manifold in the high-dimensional space. However, it
is not the case that all points on the manifold have an equal probability of
being images. Images are unevenly distributed on the manifold, and our task is
to devise ways to model this distribution as a probability distribution. We
therefore consider popular generative models. For our purposes,
generative/probabilistic models should have the properties of 1) sample
generation: the possibility to sample from this distribution with the modelled
density function, and 2) probability computation: given a previously unseen
sample from the dataset of interest, one should be able to compute its
probability, at least up to a normalising constant. To this end, we investigate
the use of methods such as normalising flow and diffusion models. We then show
how semantic interpretations are used to describe points on the manifold. To
achieve this, we consider an emergent language framework that uses variational
encoders for a disentangled representation of points that reside on a given
manifold. Trajectories between points on a manifold can then be described as
evolving semantic descriptions. We also show that such probabilistic
descriptions (bounded) can be used to improve semantic consistency by
constructing defences against adversarial attacks. We evaluate our methods with
improved semantic robustness and OoD detection capability, explainable and
editable semantic interpolation, and improved classification accuracy under
patch attacks. We also discuss the limitation in diffusion models.


http://arxiv.org/abs/2307.03132
T-MARS: Improving Visual Representations by Circumventing Text Feature Learning. (1%)
Pratyush Maini; Sachin Goyal; Zachary C. Lipton; J. Zico Kolter; Aditi Raghunathan
  Large web-sourced multimodal datasets have powered a slew of new methods for
learning general-purpose visual representations, advancing the state of the art
in computer vision and revolutionizing zero- and few-shot recognition. One
crucial decision facing practitioners is how, if at all, to curate these
ever-larger datasets. For example, the creators of the LAION-5B dataset chose
to retain only image-caption pairs whose CLIP similarity score exceeded a
designated threshold. In this paper, we propose a new state-of-the-art data
filtering approach motivated by our observation that nearly 40% of LAION's
images contain text that overlaps significantly with the caption. Intuitively,
such data could be wasteful as it incentivizes models to perform optical
character recognition rather than learning visual features. However, naively
removing all such data could also be wasteful, as it throws away images that
contain visual features (in addition to overlapping text). Our simple and
scalable approach, T-MARS (Text Masking and Re-Scoring), filters out only those
pairs where the text dominates the remaining visual features -- by first
masking out the text and then filtering out those with a low CLIP similarity
score of the masked image. Experimentally, T-MARS outperforms the top-ranked
method on the "medium scale" of DataComp (a data filtering benchmark) by a
margin of 6.5% on ImageNet and 4.7% on VTAB. Additionally, our systematic
evaluation on various data pool sizes from 2M to 64M shows that the accuracy
gains enjoyed by T-MARS linearly increase as data and compute are scaled
exponentially. Code is available at https://github.com/locuslab/T-MARS.


http://arxiv.org/abs/2307.02055
Adversarial Attacks on Image Classification Models: FGSM and Patch Attacks and their Impact. (98%)
Jaydip Sen; Subhasis Dasgupta
  This chapter introduces the concept of adversarial attacks on image
classification models built on convolutional neural networks (CNN). CNNs are
very popular deep-learning models which are used in image classification tasks.
However, very powerful and pre-trained CNN models working very accurately on
image datasets for image classification tasks may perform disastrously when the
networks are under adversarial attacks. In this work, two very well-known
adversarial attacks are discussed and their impact on the performance of image
classifiers is analyzed. These two adversarial attacks are the fast gradient
sign method (FGSM) and adversarial patch attack. These attacks are launched on
three powerful pre-trained image classifier architectures, ResNet-34,
GoogleNet, and DenseNet-161. The classification accuracy of the models in the
absence and presence of the two attacks are computed on images from the
publicly accessible ImageNet dataset. The results are analyzed to evaluate the
impact of the attacks on the image classification task.


http://arxiv.org/abs/2307.02094
DARE: Towards Robust Text Explanations in Biomedical and Healthcare Applications. (69%)
Adam Ivankay; Mattia Rigotti; Pascal Frossard
  Along with the successful deployment of deep neural networks in several
application domains, the need to unravel the black-box nature of these networks
has seen a significant increase recently. Several methods have been introduced
to provide insight into the inference process of deep neural networks. However,
most of these explainability methods have been shown to be brittle in the face
of adversarial perturbations of their inputs in the image and generic textual
domain. In this work we show that this phenomenon extends to specific and
important high stakes domains like biomedical datasets. In particular, we
observe that the robustness of explanations should be characterized in terms of
the accuracy of the explanation in linking a model's inputs and its decisions -
faithfulness - and its relevance from the perspective of domain experts -
plausibility. This is crucial to prevent explanations that are inaccurate but
still look convincing in the context of the domain at hand. To this end, we
show how to adapt current attribution robustness estimation methods to a given
domain, so as to take into account domain-specific plausibility. This results
in our DomainAdaptiveAREstimator (DARE) attribution robustness estimator,
allowing us to properly characterize the domain-specific robustness of faithful
explanations. Next, we provide two methods, adversarial training and FAR
training, to mitigate the brittleness characterized by DARE, allowing us to
train networks that display robust attributions. Finally, we empirically
validate our methods with extensive experiments on three established biomedical
benchmarks.


http://arxiv.org/abs/2307.02347
Detecting Images Generated by Deep Diffusion Models using their Local Intrinsic Dimensionality. (67%)
Peter Lorenz; Ricard Durall; Janis Keuper
  Diffusion models recently have been successfully applied for the visual
synthesis of strikingly realistic appearing images. This raises strong concerns
about their potential for malicious purposes. In this paper, we propose using
the lightweight multi Local Intrinsic Dimensionality (multiLID), which has been
originally developed in context of the detection of adversarial examples, for
the automatic detection of synthetic images and the identification of the
according generator networks. In contrast to many existing detection
approaches, which often only work for GAN-generated images, the proposed method
provides close to perfect detection results in many realistic use cases.
Extensive experiments on known and newly created datasets demonstrate that the
proposed multiLID approach exhibits superiority in diffusion detection and
model identification. Since the empirical evaluations of recent publications on
the detection of generated images are often mainly focused on the
"LSUN-Bedroom" dataset, we further establish a comprehensive benchmark for the
detection of diffusion-generated images, including samples from several
diffusion models with different image sizes.


http://arxiv.org/abs/2307.02672
GIT: Detecting Uncertainty, Out-Of-Distribution and Adversarial Samples using Gradients and Invariance Transformations. (62%)
Julia Lust; Alexandru P. Condurache
  Deep neural networks tend to make overconfident predictions and often require
additional detectors for misclassifications, particularly for safety-critical
applications. Existing detection methods usually only focus on adversarial
attacks or out-of-distribution samples as reasons for false predictions.
However, generalization errors occur due to diverse reasons often related to
poorly learning relevant invariances. We therefore propose GIT, a holistic
approach for the detection of generalization errors that combines the usage of
gradient information and invariance transformations. The invariance
transformations are designed to shift misclassified samples back into the
generalization area of the neural network, while the gradient information
measures the contradiction between the initial prediction and the corresponding
inherent computations of the neural network using the transformed sample. Our
experiments demonstrate the superior performance of GIT compared to the
state-of-the-art on a variety of network architectures, problem setups and
perturbation types.


http://arxiv.org/abs/2307.02569
Securing Cloud FPGAs Against Power Side-Channel Attacks: A Case Study on Iterative AES. (5%)
Nithyashankari Gummidipoondi JV Jayasankaran; Hao JV Guo; Satwik JV Patnaik; JV Jeyavijayan; Rajendran; Jiang Hu
  The various benefits of multi-tenanting, such as higher device utilization
and increased profit margin, intrigue the cloud field-programmable gate array
(FPGA) servers to include multi-tenanting in their infrastructure. However,
this property makes these servers vulnerable to power side-channel (PSC)
attacks. Logic designs such as ring oscillator (RO) and time-to-digital
converter (TDC) are used to measure the power consumed by security critical
circuits, such as advanced encryption standard (AES). Firstly, the existing
works require higher minimum traces for disclosure (MTD). Hence, in this work,
we improve the sensitivity of the TDC-based sensors by manually placing the
FPGA primitives inferring these sensors. This enhancement helps to determine
the 128-bit AES key using 3.8K traces. Secondly, the existing defenses use ROs
to defend against PSC attacks. However, cloud servers such as Amazon Web
Services (AWS) block design with combinatorial loops. Hence, we propose a
placement-based defense. We study the impact of (i) primitive-level placement
on the AES design and (ii) additional logic that resides along with the AES on
the correlation power analysis (CPA) attack results. Our results showcase that
the AES along with filters and/or processors are sufficient to provide the same
level or better security than the existing defenses.


http://arxiv.org/abs/2307.02202
On the Adversarial Robustness of Generative Autoencoders in the Latent Space. (3%)
Mingfei Lu; Badong Chen
  The generative autoencoders, such as the variational autoencoders or the
adversarial autoencoders, have achieved great success in lots of real-world
applications, including image generation, and signal communication.
  However, little concern has been devoted to their robustness during practical
deployment.
  Due to the probabilistic latent structure, variational autoencoders (VAEs)
may confront problems such as a mismatch between the posterior distribution of
the latent and real data manifold, or discontinuity in the posterior
distribution of the latent.
  This leaves a back door for malicious attackers to collapse VAEs from the
latent space, especially in scenarios where the encoder and decoder are used
separately, such as communication and compressed sensing.
  In this work, we provide the first study on the adversarial robustness of
generative autoencoders in the latent space.
  Specifically, we empirically demonstrate the latent vulnerability of popular
generative autoencoders through attacks in the latent space.
  We also evaluate the difference between variational autoencoders and their
deterministic variants and observe that the latter performs better in latent
robustness.
  Meanwhile, we identify a potential trade-off between the adversarial
robustness and the degree of the disentanglement of the latent codes.
  Additionally, we also verify the feasibility of improvement for the latent
robustness of VAEs through adversarial training.
  In summary, we suggest concerning the adversarial latent robustness of the
generative autoencoders, analyze several robustness-relative issues, and give
some insights into a series of key challenges.


http://arxiv.org/abs/2307.01488
SCAT: Robust Self-supervised Contrastive Learning via Adversarial Training for Text Classification. (99%)
Junjie Wu; Dit-Yan Yeung
  Despite their promising performance across various natural language
processing (NLP) tasks, current NLP systems are vulnerable to textual
adversarial attacks. To defend against these attacks, most existing methods
apply adversarial training by incorporating adversarial examples. However,
these methods have to rely on ground-truth labels to generate adversarial
examples, rendering it impractical for large-scale model pre-training which is
commonly used nowadays for NLP and many other tasks. In this paper, we propose
a novel learning framework called SCAT (Self-supervised Contrastive Learning
via Adversarial Training), which can learn robust representations without
requiring labeled data. Specifically, SCAT modifies random augmentations of the
data in a fully labelfree manner to generate adversarial examples. Adversarial
training is achieved by minimizing the contrastive loss between the
augmentations and their adversarial counterparts. We evaluate SCAT on two text
classification datasets using two state-of-the-art attack schemes proposed
recently. Our results show that SCAT can not only train robust language models
from scratch, but it can also significantly improve the robustness of existing
pre-trained language models. Moreover, to demonstrate its flexibility, we show
that SCAT can also be combined with supervised adversarial training to further
enhance model robustness.


http://arxiv.org/abs/2307.01520
LEAT: Towards Robust Deepfake Disruption in Real-World Scenarios via Latent Ensemble Attack. (83%)
Joonkyo Shim; Hyunsoo Yoon
  Deepfakes, malicious visual contents created by generative models, pose an
increasingly harmful threat to society. To proactively mitigate deepfake
damages, recent studies have employed adversarial perturbation to disrupt
deepfake model outputs. However, previous approaches primarily focus on
generating distorted outputs based on only predetermined target attributes,
leading to a lack of robustness in real-world scenarios where target attributes
are unknown. Additionally, the transferability of perturbations between two
prominent generative models, Generative Adversarial Networks (GANs) and
Diffusion Models, remains unexplored. In this paper, we emphasize the
importance of target attribute-transferability and model-transferability for
achieving robust deepfake disruption. To address this challenge, we propose a
simple yet effective disruption method called Latent Ensemble ATtack (LEAT),
which attacks the independent latent encoding process. By disrupting the latent
encoding process, it generates distorted output images in subsequent generation
processes, regardless of the given target attributes. This target
attribute-agnostic attack ensures robust disruption even when the target
attributes are unknown. Additionally, we introduce a Normalized Gradient
Ensemble strategy that effectively aggregates gradients for iterative gradient
attacks, enabling simultaneous attacks on various types of deepfake models,
involving both GAN-based and Diffusion-based models. Moreover, we demonstrate
the insufficiency of evaluating disruption quality solely based on pixel-level
differences. As a result, we propose an alternative protocol for
comprehensively evaluating the success of defense. Extensive experiments
confirm the efficacy of our method in disrupting deepfakes in real-world
scenarios, reporting a higher defense success rate compared to previous
methods.


http://arxiv.org/abs/2307.02500
Interpretable Computer Vision Models through Adversarial Training: Unveiling the Robustness-Interpretability Connection. (68%)
Delyan Boychev
  With the perpetual increase of complexity of the state-of-the-art deep neural
networks, it becomes a more and more challenging task to maintain their
interpretability. Our work aims to evaluate the effects of adversarial training
utilized to produce robust models - less vulnerable to adversarial attacks. It
has been shown to make computer vision models more interpretable.
Interpretability is as essential as robustness when we deploy the models to the
real world. To prove the correlation between these two problems, we extensively
examine the models using local feature-importance methods (SHAP, Integrated
Gradients) and feature visualization techniques (Representation Inversion,
Class Specific Image Generation). Standard models, compared to robust are more
susceptible to adversarial attacks, and their learned representations are less
meaningful to humans. Conversely, these models focus on distinctive regions of
the images which support their predictions. Moreover, the features learned by
the robust model are closer to the real ones.


http://arxiv.org/abs/2307.01610
Overconfidence is a Dangerous Thing: Mitigating Membership Inference Attacks by Enforcing Less Confident Prediction. (45%)
Zitao Chen; Karthik Pattabiraman
  Machine learning (ML) models are vulnerable to membership inference attacks
(MIAs), which determine whether a given input is used for training the target
model. While there have been many efforts to mitigate MIAs, they often suffer
from limited privacy protection, large accuracy drop, and/or requiring
additional data that may be difficult to acquire. This work proposes a defense
technique, HAMP that can achieve both strong membership privacy and high
accuracy, without requiring extra data. To mitigate MIAs in different forms, we
observe that they can be unified as they all exploit the ML model's
overconfidence in predicting training samples through different proxies. This
motivates our design to enforce less confident prediction by the model, hence
forcing the model to behave similarly on the training and testing samples. HAMP
consists of a novel training framework with high-entropy soft labels and an
entropy-based regularizer to constrain the model's prediction while still
achieving high accuracy. To further reduce privacy risk, HAMP uniformly
modifies all the prediction outputs to become low-confidence outputs while
preserving the accuracy, which effectively obscures the differences between the
prediction on members and non-members. We conduct extensive evaluation on five
benchmark datasets, and show that HAMP provides consistently high accuracy and
strong membership privacy. Our comparison with seven state-of-the-art defenses
shows that HAMP achieves a superior privacy-utility trade off than those
techniques.


http://arxiv.org/abs/2307.01778
Physically Realizable Natural-Looking Clothing Textures Evade Person Detectors via 3D Modeling. (26%)
Zhanhao Hu; Wenda Chu; Xiaopei Zhu; Hui Zhang; Bo Zhang; Xiaolin Hu
  Recent works have proposed to craft adversarial clothes for evading person
detectors, while they are either only effective at limited viewing angles or
very conspicuous to humans. We aim to craft adversarial texture for clothes
based on 3D modeling, an idea that has been used to craft rigid adversarial
objects such as a 3D-printed turtle. Unlike rigid objects, humans and clothes
are non-rigid, leading to difficulties in physical realization. In order to
craft natural-looking adversarial clothes that can evade person detectors at
multiple viewing angles, we propose adversarial camouflage textures (AdvCaT)
that resemble one kind of the typical textures of daily clothes, camouflage
textures. We leverage the Voronoi diagram and Gumbel-softmax trick to
parameterize the camouflage textures and optimize the parameters via 3D
modeling. Moreover, we propose an efficient augmentation pipeline on 3D meshes
combining topologically plausible projection (TopoProj) and Thin Plate Spline
(TPS) to narrow the gap between digital and real-world objects. We printed the
developed 3D texture pieces on fabric materials and tailored them into T-shirts
and trousers. Experiments show high attack success rates of these clothes
against multiple detectors.


http://arxiv.org/abs/2307.01565
An Analysis of Untargeted Poisoning Attack and Defense Methods for Federated Online Learning to Rank Systems. (13%)
Shuyi Wang; Guido Zuccon
  Federated online learning to rank (FOLTR) aims to preserve user privacy by
not sharing their searchable data and search interactions, while guaranteeing
high search effectiveness, especially in contexts where individual users have
scarce training data and interactions. For this, FOLTR trains learning to rank
models in an online manner -- i.e. by exploiting users' interactions with the
search systems (queries, clicks), rather than labels -- and federatively --
i.e. by not aggregating interaction data in a central server for training
purposes, but by training instances of a model on each user device on their own
private data, and then sharing the model updates, not the data, across a set of
users that have formed the federation. Existing FOLTR methods build upon
advances in federated learning.
  While federated learning methods have been shown effective at training
machine learning models in a distributed way without the need of data sharing,
they can be susceptible to attacks that target either the system's security or
its overall effectiveness.
  In this paper, we consider attacks on FOLTR systems that aim to compromise
their search effectiveness. Within this scope, we experiment with and analyse
data and model poisoning attack methods to showcase their impact on FOLTR
search effectiveness. We also explore the effectiveness of defense methods
designed to counteract attacks on FOLTR systems. We contribute an understanding
of the effect of attack and defense methods for FOLTR systems, as well as
identifying the key factors influencing their effectiveness.


http://arxiv.org/abs/2307.01570
Machine Learning-Based Intrusion Detection: Feature Selection versus Feature Extraction. (1%)
Vu-Duc Ngo; Tuan-Cuong Vuong; Luong Thien Van; Hung Tran
  Internet of things (IoT) has been playing an important role in many sectors,
such as smart cities, smart agriculture, smart healthcare, and smart
manufacturing. However, IoT devices are highly vulnerable to cyber-attacks,
which may result in security breaches and data leakages. To effectively prevent
these attacks, a variety of machine learning-based network intrusion detection
methods for IoT networks have been developed, which often rely on either
feature extraction or feature selection techniques for reducing the dimension
of input data before being fed into machine learning models. This aims to make
the detection complexity low enough for real-time operations, which is
particularly vital in any intrusion detection systems. This paper provides a
comprehensive comparison between these two feature reduction methods of
intrusion detection in terms of various performance metrics, namely, precision
rate, recall rate, detection accuracy, as well as runtime complexity, in the
presence of the modern UNSW-NB15 dataset as well as both binary and multiclass
classification. For example, in general, the feature selection method not only
provides better detection performance but also lower training and inference
time compared to its feature extraction counterpart, especially when the number
of reduced features K increases. However, the feature extraction method is much
more reliable than its selection counterpart, particularly when K is very
small, such as K = 4. Additionally, feature extraction is less sensitive to
changing the number of reduced features K than feature selection, and this
holds true for both binary and multiclass classifications. Based on this
comparison, we provide a useful guideline for selecting a suitable intrusion
detection type for each specific scenario, as detailed in Tab. 14 at the end of
Section IV.


http://arxiv.org/abs/2307.01701
Synthetic is all you need: removing the auxiliary data assumption for membership inference attacks against synthetic data. (1%)
Florent Guépin; Matthieu Meeus; Ana-Maria Cretu; Montjoye Yves-Alexandre de
  Synthetic data is emerging as one of the most promising solutions to share
individual-level data while safeguarding privacy. While membership inference
attacks (MIAs), based on shadow modeling, have become the standard to evaluate
the privacy of synthetic data, they currently assume the attacker to have
access to an auxiliary dataset sampled from a similar distribution as the
training dataset. This is often seen as a very strong assumption in practice,
especially as the proposed main use cases for synthetic tabular data (e.g.
medical data, financial transactions) are very specific and don't have any
reference datasets directly available. We here show how this assumption can be
removed, allowing for MIAs to be performed using only the synthetic data.
Specifically, we developed three different scenarios: (S1) Black-box access to
the generator, (S2) only access to the released synthetic dataset and (S3) a
theoretical setup as upper bound for the attack performance using only
synthetic data. Our results show that MIAs are still successful, across two
real-world datasets and two synthetic data generators. These results show how
the strong hypothesis made when auditing synthetic data releases - access to an
auxiliary dataset - can be relaxed, making the attacks more realistic in
practice.


http://arxiv.org/abs/2307.01292
Pareto-Secure Machine Learning (PSML): Fingerprinting and Securing Inference Serving Systems. (99%)
Debopam Georgia Institute of Technology Sanyal; Jui-Tse Georgia Institute of Technology Hung; Manav Georgia Institute of Technology Agrawal; Prahlad Georgia Institute of Technology Jasti; Shahab University of California, Riverside Nikkhoo; Somesh University of Wisconsin-Madison Jha; Tianhao University of Virginia Wang; Sibin George Washington University Mohan; Alexey Georgia Institute of Technology Tumanov
  Model-serving systems have become increasingly popular, especially in
real-time web applications. In such systems, users send queries to the server
and specify the desired performance metrics (e.g., desired accuracy, latency).
The server maintains a set of models (model zoo) in the back-end and serves the
queries based on the specified metrics. This paper examines the security,
specifically robustness against model extraction attacks, of such systems.
Existing black-box attacks assume a single model can be repeatedly selected for
serving inference requests. Modern inference serving systems break this
assumption. Thus, they cannot be directly applied to extract a victim model, as
models are hidden behind a layer of abstraction exposed by the serving system.
An attacker can no longer identify which model she is interacting with. To this
end, we first propose a query-efficient fingerprinting algorithm to enable the
attacker to trigger any desired model consistently. We show that by using our
fingerprinting algorithm, model extraction can have fidelity and accuracy
scores within $1\%$ of the scores obtained when attacking a single, explicitly
specified model, as well as up to $14.6\%$ gain in accuracy and up to $7.7\%$
gain in fidelity compared to the naive attack. Second, we counter the proposed
attack with a noise-based defense mechanism that thwarts fingerprinting by
adding noise to the specified performance metrics. The proposed defense
strategy reduces the attack's accuracy and fidelity by up to $9.8\%$ and
$4.8\%$, respectively (on medium-sized model extraction). Third, we show that
the proposed defense induces a fundamental trade-off between the level of
protection and system goodput, achieving configurable and significant victim
model extraction protection while maintaining acceptable goodput ($>80\%$). We
implement the proposed defense in a real system with plans to open source.


http://arxiv.org/abs/2307.10184
A Dual Stealthy Backdoor: From Both Spatial and Frequency Perspectives. (83%)
Yudong Gao; Honglong Chen; Peng Sun; Junjian Li; Anqing Zhang; Zhibo Wang
  Backdoor attacks pose serious security threats to deep neural networks
(DNNs). Backdoored models make arbitrarily (targeted) incorrect predictions on
inputs embedded with well-designed triggers while behaving normally on clean
inputs. Many works have explored the invisibility of backdoor triggers to
improve attack stealthiness. However, most of them only consider the
invisibility in the spatial domain without explicitly accounting for the
generation of invisible triggers in the frequency domain, making the generated
poisoned images be easily detected by recent defense methods. To address this
issue, in this paper, we propose a DUal stealthy BAckdoor attack method named
DUBA, which simultaneously considers the invisibility of triggers in both the
spatial and frequency domains, to achieve desirable attack performance, while
ensuring strong stealthiness. Specifically, we first use Discrete Wavelet
Transform to embed the high-frequency information of the trigger image into the
clean image to ensure attack effectiveness. Then, to attain strong
stealthiness, we incorporate Fourier Transform and Discrete Cosine Transform to
mix the poisoned image and clean image in the frequency domain. Moreover, the
proposed DUBA adopts a novel attack strategy, in which the model is trained
with weak triggers and attacked with strong triggers to further enhance the
attack performance and stealthiness. We extensively evaluate DUBA against
popular image classifiers on four datasets. The results demonstrate that it
significantly outperforms the state-of-the-art backdoor attacks in terms of the
attack success rate and stealthiness


http://arxiv.org/abs/2307.03197
Analyzing the vulnerabilities in SplitFed Learning: Assessing the robustness against Data Poisoning Attacks. (62%)
Aysha Thahsin Zahir Ismail; Raj Mani Shukla
  Distributed Collaborative Machine Learning (DCML) is a potential alternative
to address the privacy concerns associated with centralized machine learning.
The Split learning (SL) and Federated Learning (FL) are the two effective
learning approaches in DCML. Recently there have been an increased interest on
the hybrid of FL and SL known as the SplitFed Learning (SFL). This research is
the earliest attempt to study, analyze and present the impact of data poisoning
attacks in SFL. We propose three kinds of novel attack strategies namely
untargeted, targeted and distance-based attacks for SFL. All the attacks
strategies aim to degrade the performance of the DCML-based classifier. We test
the proposed attack strategies for two different case studies on
Electrocardiogram signal classification and automatic handwritten digit
recognition. A series of attack experiments were conducted by varying the
percentage of malicious clients and the choice of the model split layer between
the clients and the server. The results after the comprehensive analysis of
attack strategies clearly convey that untargeted and distance-based poisoning
attacks have greater impacts in evading the classifier outcomes compared to
targeted attacks in SFL


http://arxiv.org/abs/2307.01073
What Distributions are Robust to Indiscriminate Poisoning Attacks for Linear Learners? (62%)
Fnu Suya; Xiao Zhang; Yuan Tian; David Evans
  We study indiscriminate poisoning for linear learners where an adversary
injects a few crafted examples into the training data with the goal of forcing
the induced model to incur higher test error. Inspired by the observation that
linear learners on some datasets are able to resist the best known attacks even
without any defenses, we further investigate whether datasets can be inherently
robust to indiscriminate poisoning attacks for linear learners. For theoretical
Gaussian distributions, we rigorously characterize the behavior of an optimal
poisoning attack, defined as the poisoning strategy that attains the maximum
risk of the induced model at a given poisoning budget. Our results prove that
linear learners can indeed be robust to indiscriminate poisoning if the
class-wise data distributions are well-separated with low variance and the size
of the constraint set containing all permissible poisoning points is also
small. These findings largely explain the drastic variation in empirical attack
performance of the state-of-the-art poisoning attacks on linear learners across
benchmark datasets, making an important initial step towards understanding the
underlying reasons some learning tasks are vulnerable to data poisoning
attacks.


http://arxiv.org/abs/2307.01390
Adversarial Learning in Real-World Fraud Detection: Challenges and Perspectives. (45%)
Danele Lunghi; Alkis Simitsis; Olivier Caelen; Gianluca Bontempi
  Data economy relies on data-driven systems and complex machine learning
applications are fueled by them. Unfortunately, however, machine learning
models are exposed to fraudulent activities and adversarial attacks, which
threaten their security and trustworthiness. In the last decade or so, the
research interest on adversarial machine learning has grown significantly,
revealing how learning applications could be severely impacted by effective
attacks. Although early results of adversarial machine learning indicate the
huge potential of the approach to specific domains such as image processing,
still there is a gap in both the research literature and practice regarding how
to generalize adversarial techniques in other domains and applications. Fraud
detection is a critical defense mechanism for data economy, as it is for other
applications as well, which poses several challenges for machine learning. In
this work, we describe how attacks against fraud detection systems differ from
other applications of adversarial machine learning, and propose a number of
interesting directions to bridge this gap.


http://arxiv.org/abs/2307.00823
Understanding the Transferability of Representations via Task-Relatedness. (13%)
Akshay Mehra; Yunbei Zhang; Jihun Hamm
  The growing popularity of transfer learning, due to the availability of
models pre-trained on vast amounts of data, makes it imperative to understand
when the knowledge of these pre-trained models can be transferred to obtain
high-performing models on downstream target tasks. However, the exact
conditions under which transfer learning succeeds in a cross-domain cross-task
setting are still poorly understood. To bridge this gap, we propose a novel
analysis that analyzes the transferability of the representations of
pre-trained models to downstream tasks in terms of their relatedness to a given
reference task. Our analysis leads to an upper bound on transferability in
terms of task-relatedness, quantified using the difference between the class
priors, label sets, and features of the two tasks. Our experiments using
state-of-the-art pre-trained models show the effectiveness of task-relatedness
in explaining transferability on various vision and language tasks. The
efficient computability of task-relatedness even without labels of the target
task and its high correlation with the model's accuracy after end-to-end
fine-tuning on the target task makes it a useful metric for transferability
estimation. Our empirical results of using task-relatedness to select the best
pre-trained model from a model zoo for a target task highlight its utility for
practical problems.


http://arxiv.org/abs/2307.00907
Enhancing the Robustness of QMIX against State-adversarial Attacks. (4%)
Weiran Guo; Guanjun Liu; Ziyuan Zhou; Ling Wang; Jiacun Wang
  Deep reinforcement learning (DRL) performance is generally impacted by
state-adversarial attacks, a perturbation applied to an agent's observation.
Most recent research has concentrated on robust single-agent reinforcement
learning (SARL) algorithms against state-adversarial attacks. Still, there has
yet to be much work on robust multi-agent reinforcement learning. Using QMIX,
one of the popular cooperative multi-agent reinforcement algorithms, as an
example, we discuss four techniques to improve the robustness of SARL
algorithms and extend them to multi-agent scenarios. To increase the robustness
of multi-agent reinforcement learning (MARL) algorithms, we train models using
a variety of attacks in this research. We then test the models taught using the
other attacks by subjecting them to the corresponding attacks throughout the
training phase. In this way, we organize and summarize techniques for enhancing
robustness when used with MARL.


http://arxiv.org/abs/2307.00934
Towards Building Self-Aware Object Detectors via Reliable Uncertainty Quantification and Calibration. (1%)
Kemal Oksuz; Tom Joy; Puneet K. Dokania
  The current approach for testing the robustness of object detectors suffers
from serious deficiencies such as improper methods of performing
out-of-distribution detection and using calibration metrics which do not
consider both localisation and classification quality. In this work, we address
these issues, and introduce the Self-Aware Object Detection (SAOD) task, a
unified testing framework which respects and adheres to the challenges that
object detectors face in safety-critical environments such as autonomous
driving. Specifically, the SAOD task requires an object detector to be: robust
to domain shift; obtain reliable uncertainty estimates for the entire scene;
and provide calibrated confidence scores for the detections. We extensively use
our framework, which introduces novel metrics and large scale test datasets, to
test numerous object detectors in two different use-cases, allowing us to
highlight critical insights into their robustness performance. Finally, we
introduce a simple baseline for the SAOD task, enabling researchers to
benchmark future proposed methods and move towards robust object detectors
which are fit for purpose. Code is available at https://github.com/fiveai/saod


http://arxiv.org/abs/2307.00477
Query-Efficient Decision-based Black-Box Patch Attack. (99%)
Zhaoyu Chen; Bo Li; Shuang Wu; Shouhong Ding; Wenqiang Zhang
  Deep neural networks (DNNs) have been showed to be highly vulnerable to
imperceptible adversarial perturbations. As a complementary type of adversary,
patch attacks that introduce perceptible perturbations to the images have
attracted the interest of researchers. Existing patch attacks rely on the
architecture of the model or the probabilities of predictions and perform
poorly in the decision-based setting, which can still construct a perturbation
with the minimal information exposed -- the top-1 predicted label. In this
work, we first explore the decision-based patch attack. To enhance the attack
efficiency, we model the patches using paired key-points and use targeted
images as the initialization of patches, and parameter optimizations are all
performed on the integer domain. Then, we propose a differential evolutionary
algorithm named DevoPatch for query-efficient decision-based patch attacks.
Experiments demonstrate that DevoPatch outperforms the state-of-the-art
black-box patch attacks in terms of patch area and attack success rate within a
given query budget on image classification and face verification. Additionally,
we conduct the vulnerability evaluation of ViT and MLP on image classification
in the decision-based patch attack setting for the first time. Using DevoPatch,
we can evaluate the robustness of models to black-box patch attacks. We believe
this method could inspire the design and deployment of robust vision models
based on various DNN architectures in the future.


http://arxiv.org/abs/2307.01225
Interpretability and Transparency-Driven Detection and Transformation of Textual Adversarial Examples (IT-DT). (99%)
Bushra Sabir; M. Ali Babar; Sharif Abuadbba
  Transformer-based text classifiers like BERT, Roberta, T5, and GPT-3 have
shown impressive performance in NLP. However, their vulnerability to
adversarial examples poses a security risk. Existing defense methods lack
interpretability, making it hard to understand adversarial classifications and
identify model vulnerabilities. To address this, we propose the
Interpretability and Transparency-Driven Detection and Transformation (IT-DT)
framework. It focuses on interpretability and transparency in detecting and
transforming textual adversarial examples. IT-DT utilizes techniques like
attention maps, integrated gradients, and model feedback for interpretability
during detection. This helps identify salient features and perturbed words
contributing to adversarial classifications. In the transformation phase, IT-DT
uses pre-trained embeddings and model feedback to generate optimal replacements
for perturbed words. By finding suitable substitutions, we aim to convert
adversarial examples into non-adversarial counterparts that align with the
model's intended behavior while preserving the text's meaning. Transparency is
emphasized through human expert involvement. Experts review and provide
feedback on detection and transformation results, enhancing decision-making,
especially in complex scenarios. The framework generates insights and threat
intelligence empowering analysts to identify vulnerabilities and improve model
robustness. Comprehensive experiments demonstrate the effectiveness of IT-DT in
detecting and transforming adversarial examples. The approach enhances
interpretability, provides transparency, and enables accurate identification
and successful transformation of adversarial inputs. By combining technical
analysis and human expertise, IT-DT significantly improves the resilience and
trustworthiness of transformer-based text classifiers against adversarial
attacks.


http://arxiv.org/abs/2307.00691
From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy. (10%)
Maanak Gupta; CharanKumar Akiri; Kshitiz Aryal; Eli Parker; Lopamudra Praharaj
  Undoubtedly, the evolution of Generative AI (GenAI) models has been the
highlight of digital transformation in the year 2022. As the different GenAI
models like ChatGPT and Google Bard continue to foster their complexity and
capability, it's critical to understand its consequences from a cybersecurity
perspective. Several instances recently have demonstrated the use of GenAI
tools in both the defensive and offensive side of cybersecurity, and focusing
on the social, ethical and privacy implications this technology possesses. This
research paper highlights the limitations, challenges, potential risks, and
opportunities of GenAI in the domain of cybersecurity and privacy. The work
presents the vulnerabilities of ChatGPT, which can be exploited by malicious
users to exfiltrate malicious information bypassing the ethical constraints on
the model. This paper demonstrates successful example attacks like Jailbreaks,
reverse psychology, and prompt injection attacks on the ChatGPT. The paper also
investigates how cyber offenders can use the GenAI tools in developing cyber
attacks, and explore the scenarios where ChatGPT can be used by adversaries to
create social engineering attacks, phishing attacks, automated hacking, attack
payload generation, malware creation, and polymorphic malware. This paper then
examines defense techniques and uses GenAI tools to improve security measures,
including cyber defense automation, reporting, threat intelligence, secure code
generation and detection, attack identification, developing ethical guidelines,
incidence response plans, and malware detection. We will also discuss the
social, legal, and ethical implications of ChatGPT. In conclusion, the paper
highlights open challenges and future directions to make this GenAI secure,
safe, trustworthy, and ethical as the community understands its cybersecurity
impacts.


http://arxiv.org/abs/2307.00680
CLIMAX: An exploration of Classifier-Based Contrastive Explanations. (2%)
Praharsh Nanavati; Ranjitha Prasad
  Explainable AI is an evolving area that deals with understanding the decision
making of machine learning models so that these models are more transparent,
accountable, and understandable for humans. In particular, post-hoc
model-agnostic interpretable AI techniques explain the decisions of a black-box
ML model for a single instance locally, without the knowledge of the intrinsic
nature of the ML model. Despite their simplicity and capability in providing
valuable insights, existing approaches fail to deliver consistent and reliable
explanations. Moreover, in the context of black-box classifiers, existing
approaches justify the predicted class, but these methods do not ensure that
the explanation scores strongly differ as compared to those of another class.
In this work we propose a novel post-hoc model agnostic XAI technique that
provides contrastive explanations justifying the classification of a black box
classifier along with a reasoning as to why another class was not predicted.
Our method, which we refer to as CLIMAX which is short for Contrastive
Label-aware Influence-based Model Agnostic XAI, is based on local classifiers .
In order to ensure model fidelity of the explainer, we require the
perturbations to be such that it leads to a class-balanced surrogate dataset.
Towards this, we employ a label-aware surrogate data generation method based on
random oversampling and Gaussian Mixture Model sampling. Further, we propose
influence subsampling in order to retaining effective samples and hence ensure
sample complexity. We show that we achieve better consistency as compared to
baselines such as LIME, BayLIME, and SLIME. We also depict results on textual
and image based datasets, where we generate contrastive explanations for any
black-box classification model where one is able to only query the class
probabilities for an instance of interest.


http://arxiv.org/abs/2307.00274
Common Knowledge Learning for Generating Transferable Adversarial Examples. (99%)
Ruijie Yang; Yuanfang Guo; Junfu Wang; Jiantao Zhou; Yunhong Wang
  This paper focuses on an important type of black-box attacks, i.e.,
transfer-based adversarial attacks, where the adversary generates adversarial
examples by a substitute (source) model and utilize them to attack an unseen
target model, without knowing its information. Existing methods tend to give
unsatisfactory adversarial transferability when the source and target models
are from different types of DNN architectures (e.g. ResNet-18 and Swin
Transformer). In this paper, we observe that the above phenomenon is induced by
the output inconsistency problem. To alleviate this problem while effectively
utilizing the existing DNN models, we propose a common knowledge learning (CKL)
framework to learn better network weights to generate adversarial examples with
better transferability, under fixed network architectures. Specifically, to
reduce the model-specific features and obtain better output distributions, we
construct a multi-teacher framework, where the knowledge is distilled from
different teacher architectures into one student network. By considering that
the gradient of input is usually utilized to generated adversarial examples, we
impose constraints on the gradients between the student and teacher models, to
further alleviate the output inconsistency problem and enhance the adversarial
transferability. Extensive experiments demonstrate that our proposed work can
significantly improve the adversarial transferability.


http://arxiv.org/abs/2307.00309
Adversarial Attacks and Defenses on 3D Point Cloud Classification: A Survey. (99%)
Hanieh Naderi; Ivan V. Bajić
  Deep learning has successfully solved a wide range of tasks in 2D vision as a
dominant AI technique. Recently, deep learning on 3D point clouds is becoming
increasingly popular for addressing various tasks in this field. Despite
remarkable achievements, deep learning algorithms are vulnerable to adversarial
attacks. These attacks are imperceptible to the human eye but can easily fool
deep neural networks in the testing and deployment stage. To encourage future
research, this survey summarizes the current progress on adversarial attack and
defense techniques on point cloud classification.This paper first introduces
the principles and characteristics of adversarial attacks and summarizes and
analyzes adversarial example generation methods in recent years. Additionally,
it provides an overview of defense strategies, organized into data-focused and
model-focused methods. Finally, it presents several current challenges and
potential future research directions in this domain.


http://arxiv.org/abs/2307.00421
Brightness-Restricted Adversarial Attack Patch. (75%)
Mingzhen Shao
  Adversarial attack patches have gained increasing attention due to their
practical applicability in physical-world scenarios. However, the bright colors
used in attack patches represent a significant drawback, as they can be easily
identified by human observers. Moreover, even though these attacks have been
highly successful in deceiving target networks, which specific features of the
attack patch contribute to its success are still unknown. Our paper introduces
a brightness-restricted patch (BrPatch) that uses optical characteristics to
effectively reduce conspicuousness while preserving image independence. We also
conducted an analysis of the impact of various image features (such as color,
texture, noise, and size) on the effectiveness of an attack patch in
physical-world deployment. Our experiments show that attack patches exhibit
strong redundancy to brightness and are resistant to color transfer and noise.
Based on our findings, we propose some additional methods to further reduce the
conspicuousness of BrPatch. Our findings also explain the robustness of attack
patches observed in physical-world scenarios.


http://arxiv.org/abs/2307.00356
Fedward: Flexible Federated Backdoor Defense Framework with Non-IID Data. (54%)
Zekai Chen; Fuyi Wang; Zhiwei Zheng; Ximeng Liu; Yujie Lin
  Federated learning (FL) enables multiple clients to collaboratively train
deep learning models while considering sensitive local datasets' privacy.
However, adversaries can manipulate datasets and upload models by injecting
triggers for federated backdoor attacks (FBA). Existing defense strategies
against FBA consider specific and limited attacker models, and a sufficient
amount of noise to be injected only mitigates rather than eliminates FBA. To
address these deficiencies, we introduce a Flexible Federated Backdoor Defense
Framework (Fedward) to ensure the elimination of adversarial backdoors. We
decompose FBA into various attacks, and design amplified magnitude
sparsification (AmGrad) and adaptive OPTICS clustering (AutoOPTICS) to address
each attack. Meanwhile, Fedward uses the adaptive clipping method by regarding
the number of samples in the benign group as constraints on the boundary. This
ensures that Fedward can maintain the performance for the Non-IID scenario. We
conduct experimental evaluations over three benchmark datasets and thoroughly
compare them to state-of-the-art studies. The results demonstrate the promising
defense performance from Fedward, moderately improved by 33% $\sim$ 75 in
clustering defense methods, and 96.98%, 90.74%, and 89.8% for Non-IID to the
utmost extent for the average FBA success rate over MNIST, FMNIST, and CIFAR10,
respectively.


http://arxiv.org/abs/2307.00368
Minimizing Energy Consumption of Deep Learning Models by Energy-Aware Training. (26%)
Dario Lazzaro; Antonio Emanuele Cinà; Maura Pintor; Ambra Demontis; Battista Biggio; Fabio Roli; Marcello Pelillo
  Deep learning models undergo a significant increase in the number of
parameters they possess, leading to the execution of a larger number of
operations during inference. This expansion significantly contributes to higher
energy consumption and prediction latency. In this work, we propose EAT, a
gradient-based algorithm that aims to reduce energy consumption during model
training. To this end, we leverage a differentiable approximation of the
$\ell_0$ norm, and use it as a sparse penalty over the training loss. Through
our experimental analysis conducted on three datasets and two deep neural
networks, we demonstrate that our energy-aware training algorithm EAT is able
to train networks with a better trade-off between classification performance
and energy efficiency.


http://arxiv.org/abs/2307.00280
SysNoise: Exploring and Benchmarking Training-Deployment System Inconsistency. (13%)
Yan Wang; Yuhang Li; Ruihao Gong; Aishan Liu; Yanfei Wang; Jian Hu; Yongqiang Yao; Yunchen Zhang; Tianzi Xiao; Fengwei Yu; Xianglong Liu
  Extensive studies have shown that deep learning models are vulnerable to
adversarial and natural noises, yet little is known about model robustness on
noises caused by different system implementations. In this paper, we for the
first time introduce SysNoise, a frequently occurred but often overlooked noise
in the deep learning training-deployment cycle. In particular, SysNoise happens
when the source training system switches to a disparate target system in
deployments, where various tiny system mismatch adds up to a non-negligible
difference. We first identify and classify SysNoise into three categories based
on the inference stage; we then build a holistic benchmark to quantitatively
measure the impact of SysNoise on 20+ models, comprehending image
classification, object detection, instance segmentation and natural language
processing tasks. Our extensive experiments revealed that SysNoise could bring
certain impacts on model robustness across different tasks and common
mitigations like data augmentation and adversarial training show limited
effects on it. Together, our findings open a new research topic and we hope
this work will raise research attention to deep learning deployment systems
accounting for model performance. We have open-sourced the benchmark and
framework at https://modeltc.github.io/systemnoise_web.


http://arxiv.org/abs/2307.00310
Gradients Look Alike: Sensitivity is Often Overestimated in DP-SGD. (10%)
Anvith Thudi; Hengrui Jia; Casey Meehan; Ilia Shumailov; Nicolas Papernot
  Differentially private stochastic gradient descent (DP-SGD) is the canonical
approach to private deep learning. While the current privacy analysis of DP-SGD
is known to be tight in some settings, several empirical results suggest that
models trained on common benchmark datasets leak significantly less privacy for
many datapoints. Yet, despite past attempts, a rigorous explanation for why
this is the case has not been reached. Is it because there exist tighter
privacy upper bounds when restricted to these dataset settings, or are our
attacks not strong enough for certain datapoints? In this paper, we provide the
first per-instance (i.e., ``data-dependent") DP analysis of DP-SGD. Our
analysis captures the intuition that points with similar neighbors in the
dataset enjoy better data-dependent privacy than outliers. Formally, this is
done by modifying the per-step privacy analysis of DP-SGD to introduce a
dependence on the distribution of model updates computed from a training
dataset. We further develop a new composition theorem to effectively use this
new per-step analysis to reason about an entire training run. Put all together,
our evaluation shows that this novel DP-SGD analysis allows us to now formally
show that DP-SGD leaks significantly less privacy for many datapoints (when
trained on common benchmarks) than the current data-independent guarantee. This
implies privacy attacks will necessarily fail against many datapoints if the
adversary does not have sufficient control over the possible training datasets.


http://arxiv.org/abs/2307.00384
CasTGAN: Cascaded Generative Adversarial Network for Realistic Tabular Data Synthesis. (5%)
Abdallah Alshantti; Damiano Varagnolo; Adil Rasheed; Aria Rahmati; Frank Westad
  Generative adversarial networks (GANs) have drawn considerable attention in
recent years for their proven capability in generating synthetic data which can
be utilised for multiple purposes. While GANs have demonstrated tremendous
successes in producing synthetic data samples that replicate the dynamics of
the original datasets, the validity of the synthetic data and the underlying
privacy concerns represent major challenges which are not sufficiently
addressed. In this work, we design a cascaded tabular GAN framework (CasTGAN)
for generating realistic tabular data with a specific focus on the validity of
the output. In this context, validity refers to the the dependency between
features that can be found in the real data, but is typically misrepresented by
traditional generative models. Our key idea entails that employing a cascaded
architecture in which a dedicated generator samples each feature, the synthetic
output becomes more representative of the real data. Our experimental results
demonstrate that our model is capable of generating synthetic tabular data that
can be used for fitting machine learning models. In addition, our model
captures well the constraints and the correlations between the features of the
real data, especially the high dimensional datasets. Furthermore, we evaluate
the risk of white-box privacy attacks on our model and subsequently show that
applying some perturbations to the auxiliary learners in CasTGAN increases the
overall robustness of our model against targeted attacks.


http://arxiv.org/abs/2307.08672
FedDefender: Backdoor Attack Defense in Federated Learning. (2%)
Waris Virginia Tech Gill; Ali University of Minnesota Twin Cities Anwar; Muhammad Ali Virginia Tech Gulzar
  Federated Learning (FL) is a privacy-preserving distributed machine learning
technique that enables individual clients (e.g., user participants, edge
devices, or organizations) to train a model on their local data in a secure
environment and then share the trained model with an aggregator to build a
global model collaboratively. In this work, we propose FedDefender, a defense
mechanism against targeted poisoning attacks in FL by leveraging differential
testing. Our proposed method fingerprints the neuron activations of clients'
models on the same input and uses differential testing to identify a
potentially malicious client containing a backdoor. We evaluate FedDefender
using MNIST and FashionMNIST datasets with 20 and 30 clients, and our results
demonstrate that FedDefender effectively mitigates such attacks, reducing the
attack success rate (ASR) to 10\% without deteriorating the global model
performance.


http://arxiv.org/abs/2307.00268
Hiding in Plain Sight: Differential Privacy Noise Exploitation for Evasion-resilient Localized Poisoning Attacks in Multiagent Reinforcement Learning. (1%)
Md Tamjid Hossain; Hung La
  Lately, differential privacy (DP) has been introduced in cooperative
multiagent reinforcement learning (CMARL) to safeguard the agents' privacy
against adversarial inference during knowledge sharing. Nevertheless, we argue
that the noise introduced by DP mechanisms may inadvertently give rise to a
novel poisoning threat, specifically in the context of private knowledge
sharing during CMARL, which remains unexplored in the literature. To address
this shortcoming, we present an adaptive, privacy-exploiting, and
evasion-resilient localized poisoning attack (PeLPA) that capitalizes on the
inherent DP-noise to circumvent anomaly detection systems and hinder the
optimal convergence of the CMARL model. We rigorously evaluate our proposed
PeLPA attack in diverse environments, encompassing both non-adversarial and
multiple-adversarial contexts. Our findings reveal that, in a medium-scale
environment, the PeLPA attack with attacker ratios of 20% and 40% can lead to
an increase in average steps to goal by 50.69% and 64.41%, respectively.
Furthermore, under similar conditions, PeLPA can result in a 1.4x and 1.6x
computational time increase in optimal reward attainment and a 1.18x and 1.38x
slower convergence for attacker ratios of 20% and 40%, respectively.


http://arxiv.org/abs/2306.17431
Defense against Adversarial Cloud Attack on Remote Sensing Salient Object Detection. (99%)
Huiming Sun; Lan Fu; Jinlong Li; Qing Guo; Zibo Meng; Tianyun Zhang; Yuewei Lin; Hongkai Yu
  Detecting the salient objects in a remote sensing image has wide applications
for the interdisciplinary research. Many existing deep learning methods have
been proposed for Salient Object Detection (SOD) in remote sensing images and
get remarkable results. However, the recent adversarial attack examples,
generated by changing a few pixel values on the original remote sensing image,
could result in a collapse for the well-trained deep learning based SOD model.
Different with existing methods adding perturbation to original images, we
propose to jointly tune adversarial exposure and additive perturbation for
attack and constrain image close to cloudy image as Adversarial Cloud. Cloud is
natural and common in remote sensing images, however, camouflaging cloud based
adversarial attack and defense for remote sensing images are not well studied
before. Furthermore, we design DefenseNet as a learn-able pre-processing to the
adversarial cloudy images so as to preserve the performance of the deep
learning based remote sensing SOD model, without tuning the already deployed
deep SOD model. By considering both regular and generalized adversarial
examples, the proposed DefenseNet can defend the proposed Adversarial Cloud in
white-box setting and other attack methods in black-box setting. Experimental
results on a synthesized benchmark from the public remote sensing SOD dataset
(EORSSD) show the promising defense against adversarial cloud attacks.


http://arxiv.org/abs/2306.17441
Efficient Backdoor Removal Through Natural Gradient Fine-tuning. (8%)
Nazmul Karim; Abdullah Al Arafat; Umar Khalid; Zhishan Guo; Naznin Rahnavard
  The success of a deep neural network (DNN) heavily relies on the details of
the training scheme; e.g., training data, architectures, hyper-parameters, etc.
Recent backdoor attacks suggest that an adversary can take advantage of such
training details and compromise the integrity of a DNN. Our studies show that a
backdoor model is usually optimized to a bad local minima, i.e. sharper minima
as compared to a benign model. Intuitively, a backdoor model can be purified by
reoptimizing the model to a smoother minima through fine-tuning with a few
clean validation data. However, fine-tuning all DNN parameters often requires
huge computational costs and often results in sub-par clean test performance.
To address this concern, we propose a novel backdoor purification technique,
Natural Gradient Fine-tuning (NGF), which focuses on removing the backdoor by
fine-tuning only one layer. Specifically, NGF utilizes a loss surface
geometry-aware optimizer that can successfully overcome the challenge of
reaching a smooth minima under a one-layer optimization scenario. To enhance
the generalization performance of our proposed method, we introduce a clean
data distribution-aware regularizer based on the knowledge of loss surface
curvature matrix, i.e., Fisher Information Matrix. Extensive experiments show
that the proposed method achieves state-of-the-art performance on a wide range
of backdoor defense benchmarks: four different datasets- CIFAR10, GTSRB,
Tiny-ImageNet, and ImageNet; 13 recent backdoor attacks, e.g. Blend, Dynamic,
WaNet, ISSBA, etc.


http://arxiv.org/abs/2306.17606
Minimum-norm Sparse Perturbations for Opacity in Linear Systems. (1%)
Varkey M John; Vaibhav Katewa
  Opacity is a notion that describes an eavesdropper's inability to estimate a
system's 'secret' states by observing the system's outputs. In this paper, we
propose algorithms to compute the minimum sparse perturbation to be added to a
system to make its initial states opaque. For these perturbations, we consider
two sparsity constraints - structured and affine. We develop an algorithm to
compute the global minimum-norm perturbation for the structured case. For the
affine case, we use the global minimum solution of the structured case as
initial point to compute a local minimum. Empirically, this local minimum is
very close to the global minimum. We demonstrate our results via a running
example.


http://arxiv.org/abs/2306.16979
Post-train Black-box Defense via Bayesian Boundary Correction. (99%)
He Wang; Yunfeng Diao
  Classifiers based on deep neural networks are susceptible to adversarial
attack, where the widely existing vulnerability has invoked the research in
defending them from potential threats. Given a vulnerable classifier, existing
defense methods are mostly white-box and often require re-training the victim
under modified loss functions/training regimes. While the model/data/training
specifics of the victim are usually unavailable to the user, re-training is
unappealing, if not impossible for reasons such as limited computational
resources. To this end, we propose a new post-train black-box defense
framework. It can turn any pre-trained classifier into a resilient one with
little knowledge of the model specifics. This is achieved by new joint Bayesian
treatments on the clean data, the adversarial examples and the classifier, for
maximizing their joint probability. It is further equipped with a new
post-train strategy which keeps the victim intact, avoiding re-training. We
name our framework Bayesian Boundary Correction (BBC). BBC is a general and
flexible framework that can easily adapt to different data types. We
instantiate BBC for image classification and skeleton-based human activity
recognition, for both static and dynamic data. Exhaustive evaluation shows that
BBC has superior robustness and can enhance robustness without severely hurting
the clean accuracy, compared with existing defense methods.


http://arxiv.org/abs/2306.16738
Towards Optimal Randomized Strategies in Adversarial Example Game. (96%)
Jiahao Xie; Chao Zhang; Weijie Liu; Wensong Bai; Hui Qian
  The vulnerability of deep neural network models to adversarial example
attacks is a practical challenge in many artificial intelligence applications.
A recent line of work shows that the use of randomization in adversarial
training is the key to find optimal strategies against adversarial example
attacks. However, in a fully randomized setting where both the defender and the
attacker can use randomized strategies, there are no efficient algorithm for
finding such an optimal strategy. To fill the gap, we propose the first
algorithm of its kind, called FRAT, which models the problem with a new
infinite-dimensional continuous-time flow on probability distribution spaces.
FRAT maintains a lightweight mixture of models for the defender, with
flexibility to efficiently update mixing weights and model parameters at each
iteration. Furthermore, FRAT utilizes lightweight sampling subroutines to
construct a random strategy for the attacker. We prove that the continuous-time
limit of FRAT converges to a mixed Nash equilibria in a zero-sum game formed by
a defender and an attacker. Experimental results also demonstrate the
efficiency of FRAT on CIFAR-10 and CIFAR-100 datasets.


http://arxiv.org/abs/2306.16697
Neural Polarizer: A Lightweight and Effective Backdoor Defense via Purifying Poisoned Features. (13%)
Mingli Zhu; Shaokui Wei; Hongyuan Zha; Baoyuan Wu
  Recent studies have demonstrated the susceptibility of deep neural networks
to backdoor attacks. Given a backdoored model, its prediction of a poisoned
sample with trigger will be dominated by the trigger information, though
trigger information and benign information coexist. Inspired by the mechanism
of the optical polarizer that a polarizer could pass light waves with
particular polarizations while filtering light waves with other polarizations,
we propose a novel backdoor defense method by inserting a learnable neural
polarizer into the backdoored model as an intermediate layer, in order to
purify the poisoned sample via filtering trigger information while maintaining
benign information. The neural polarizer is instantiated as one lightweight
linear transformation layer, which is learned through solving a well designed
bi-level optimization problem, based on a limited clean dataset. Compared to
other fine-tuning-based defense methods which often adjust all parameters of
the backdoored model, the proposed method only needs to learn one additional
layer, such that it is more efficient and requires less clean data. Extensive
experiments demonstrate the effectiveness and efficiency of our method in
removing backdoors across various neural network architectures and datasets,
especially in the case of very limited clean data.


http://arxiv.org/abs/2306.16869
NeuralFuse: Learning to Recover the Accuracy of Access-Limited Neural Network Inference in Low-Voltage Regimes. (1%)
Hao-Lun Sun; Lei Hsiung; Nandhini Chandramoorthy; Pin-Yu Chen; Tsung-Yi Ho
  Deep neural networks (DNNs) have become ubiquitous in machine learning, but
their energy consumption remains a notable issue. Lowering the supply voltage
is an effective strategy for reducing energy consumption. However, aggressively
scaling down the supply voltage can lead to accuracy degradation due to random
bit flips in static random access memory (SRAM) where model parameters are
stored. To address this challenge, we introduce NeuralFuse, a novel add-on
module that addresses the accuracy-energy tradeoff in low-voltage regimes by
learning input transformations to generate error-resistant data
representations. NeuralFuse protects DNN accuracy in both nominal and
low-voltage scenarios. Moreover, NeuralFuse is easy to implement and can be
readily applied to DNNs with limited access, such as non-configurable hardware
or remote access to cloud-based APIs. Experimental results demonstrate that, at
a 1% bit error rate, NeuralFuse can reduce SRAM memory access energy by up to
24% while recovering accuracy by up to 57%. To the best of our knowledge, this
is the first model-agnostic approach (i.e., no model retraining) to address
low-voltage-induced bit errors. The source code is available at
https://github.com/IBM/NeuralFuse.


http://arxiv.org/abs/2306.15931
Boosting Adversarial Transferability with Learnable Patch-wise Masks. (99%)
Xingxing Wei; Shiji Zhao
  Adversarial examples have raised widespread attention in security-critical
applications because of their transferability across different models. Although
many methods have been proposed to boost adversarial transferability, a gap
still exists in the practical demand. In this paper, we argue that the
model-specific discriminative regions are a key factor to cause the
over-fitting to the source model, and thus reduce the transferability to the
target model. For that, a patch-wise mask is utilized to prune the
model-specific regions when calculating adversarial perturbations. To
accurately localize these regions, we present a learnable approach to optimize
the mask automatically. Specifically, we simulate the target models in our
framework, and adjust the patch-wise mask according to the feedback of
simulated models. To improve the efficiency, Differential Evolutionary (DE)
algorithm is utilized to search for patch-wise masks for a specific image.
During iterative attacks, the learned masks are applied to the image to drop
out the patches related to model-specific regions, thus making the gradients
more generic and improving the adversarial transferability. The proposed
approach is a pre-processing method and can be integrated with existing
gradient-based methods to further boost the transfer attack success rate.
Extensive experiments on the ImageNet dataset demonstrate the effectiveness of
our method. We incorporate the proposed approach with existing methods in the
ensemble attacks and achieve an average success rate of 93.01% against seven
advanced defense methods, which can effectively enhance the state-of-the-art
transfer-based attack performance.


http://arxiv.org/abs/2306.16050
Evaluating Similitude and Robustness of Deep Image Denoising Models via Adversarial Attack. (99%)
Jie Ning; Yao Li; Zhichang Guo
  Deep neural networks (DNNs) have a wide range of applications in the field of
image denoising, and they are superior to traditional image denoising. However,
DNNs inevitably show vulnerability, which is the weak robustness in the face of
adversarial attacks. In this paper, we find some similitudes between existing
deep image denoising methods, as they are consistently fooled by adversarial
attacks. First, denoising-PGD is proposed which is a denoising model full
adversarial method. The current mainstream non-blind denoising models (DnCNN,
FFDNet, ECNDNet, BRDNet), blind denoising models (DnCNN-B, Noise2Noise,
RDDCNN-B, FAN), and plug-and-play (DPIR, CurvPnP) and unfolding denoising
models (DeamNet) applied to grayscale and color images can be attacked by the
same set of methods. Second, since the transferability of denoising-PGD is
prominent in the image denoising task, we design experiments to explore the
characteristic of the latent under the transferability. We correlate
transferability with similitude and conclude that the deep image denoising
models have high similitude. Third, we investigate the characteristic of the
adversarial space and use adversarial training to complement the vulnerability
of deep image denoising to adversarial attacks on image denoising. Finally, we
constrain this adversarial attack method and propose the L2-denoising-PGD image
denoising adversarial attack method that maintains the Gaussian distribution.
Moreover, the model-driven image denoising BM3D shows some resistance in the
face of adversarial attacks.


http://arxiv.org/abs/2306.16170
Mitigating Accuracy-Robustness Trade-off via Balanced Multi-Teacher Adversarial Distillation. (99%)
Shiji Zhao; Xizhe Wang; Xingxing Wei
  Adversarial Training is a practical approach for improving the robustness of
deep neural networks against adversarial attacks. Although bringing reliable
robustness, the performance towards clean examples is negatively affected after
Adversarial Training, which means a trade-off exists between accuracy and
robustness. Recently, some studies have tried to use knowledge distillation
methods in Adversarial Training, achieving competitive performance in improving
the robustness but the accuracy for clean samples is still limited. In this
paper, to mitigate the accuracy-robustness trade-off, we introduce the Balanced
Multi-Teacher Adversarial Robustness Distillation (B-MTARD) to guide the
model's Adversarial Training process by applying a strong clean teacher and a
strong robust teacher to handle the clean examples and adversarial examples,
respectively. During the optimization process, to ensure that different
teachers show similar knowledge scales, we design the Entropy-Based Balance
algorithm to adjust the teacher's temperature and keep the teachers'
information entropy consistent. Besides, to ensure that the student has a
relatively consistent learning speed from multiple teachers, we propose the
Normalization Loss Balance algorithm to adjust the learning weights of
different types of knowledge. A series of experiments conducted on three public
datasets demonstrate that B-MTARD outperforms the state-of-the-art methods
against various adversarial attacks.


http://arxiv.org/abs/2306.16614
Group-based Robustness: A General Framework for Customized Robustness in the Real World. (98%)
Weiran Lin; Keane Lucas; Neo Eyal; Lujo Bauer; Michael K. Reiter; Mahmood Sharif
  Machine-learning models are known to be vulnerable to evasion attacks that
perturb model inputs to induce misclassifications. In this work, we identify
real-world scenarios where the true threat cannot be assessed accurately by
existing attacks. Specifically, we find that conventional metrics measuring
targeted and untargeted robustness do not appropriately reflect a model's
ability to withstand attacks from one set of source classes to another set of
target classes. To address the shortcomings of existing methods, we formally
define a new metric, termed group-based robustness, that complements existing
metrics and is better-suited for evaluating model performance in certain attack
scenarios. We show empirically that group-based robustness allows us to
distinguish between models' vulnerability against specific threat models in
situations where traditional robustness metrics do not apply. Moreover, to
measure group-based robustness efficiently and accurately, we 1) propose two
loss functions and 2) identify three new attack strategies. We show empirically
that with comparable success rates, finding evasive samples using our new loss
functions saves computation by a factor as large as the number of targeted
classes, and finding evasive samples using our new attack strategies saves time
by up to 99\% compared to brute-force search methods. Finally, we propose a
defense method that increases group-based robustness by up to 3.52$\times$.


http://arxiv.org/abs/2306.16131
Distributional Modeling for Location-Aware Adversarial Patches. (98%)
Xingxing Wei; Shouwei Ruan; Yinpeng Dong; Hang Su
  Adversarial patch is one of the important forms of performing adversarial
attacks in the physical world. To improve the naturalness and aggressiveness of
existing adversarial patches, location-aware patches are proposed, where the
patch's location on the target object is integrated into the optimization
process to perform attacks. Although it is effective, efficiently finding the
optimal location for placing the patches is challenging, especially under the
black-box attack settings. In this paper, we propose the Distribution-Optimized
Adversarial Patch (DOPatch), a novel method that optimizes a multimodal
distribution of adversarial locations instead of individual ones. DOPatch has
several benefits: Firstly, we find that the locations' distributions across
different models are pretty similar, and thus we can achieve efficient
query-based attacks to unseen models using a distributional prior optimized on
a surrogate model. Secondly, DOPatch can generate diverse adversarial samples
by characterizing the distribution of adversarial locations. Thus we can
improve the model's robustness to location-aware patches via carefully designed
Distributional-Modeling Adversarial Training (DOP-DMAT). We evaluate DOPatch on
various face recognition and image recognition tasks and demonstrate its
superiority and efficiency over existing methods. We also conduct extensive
ablation studies and analyses to validate the effectiveness of our method and
provide insights into the distribution of adversarial locations.


http://arxiv.org/abs/2306.16022
Enrollment-stage Backdoor Attacks on Speaker Recognition Systems via Adversarial Ultrasound. (98%)
Xinfeng Li; Junning Ze; Chen Yan; Yushi Cheng; Xiaoyu Ji; Wenyuan Xu
  Automatic Speaker Recognition Systems (SRSs) have been widely used in voice
applications for personal identification and access control. A typical SRS
consists of three stages, i.e., training, enrollment, and recognition. Previous
work has revealed that SRSs can be bypassed by backdoor attacks at the training
stage or by adversarial example attacks at the recognition stage. In this
paper, we propose Tuner, a new type of backdoor attack against the enrollment
stage of SRS via adversarial ultrasound modulation, which is inaudible,
synchronization-free, content-independent, and black-box. Our key idea is to
first inject the backdoor into the SRS with modulated ultrasound when a
legitimate user initiates the enrollment, and afterward, the polluted SRS will
grant access to both the legitimate user and the adversary with high
confidence. Our attack faces a major challenge of unpredictable user
articulation at the enrollment stage. To overcome this challenge, we generate
the ultrasonic backdoor by augmenting the optimization process with random
speech content, vocalizing time, and volume of the user. Furthermore, to
achieve real-world robustness, we improve the ultrasonic signal over
traditional methods using sparse frequency points, pre-compensation, and
single-sideband (SSB) modulation. We extensively evaluate Tuner on two common
datasets and seven representative SRS models, as well as its robustness against
seven kinds of defenses. Results show that our attack can successfully bypass
speaker recognition systems while remaining effective to various speakers,
speech content, etc. To mitigate this newly discovered threat, we also provide
discussions on potential countermeasures, limitations, and future works of this
new threat.


http://arxiv.org/abs/2306.16581
Does Saliency-Based Training bring Robustness for Deep Neural Networks in Image Classification? (93%)
Ali Karkehabadi
  Deep Neural Networks are powerful tools to understand complex patterns and
making decisions. However, their black-box nature impedes a complete
understanding of their inner workings. While online saliency-guided training
methods try to highlight the prominent features in the model's output to
alleviate this problem, it is still ambiguous if the visually explainable
features align with robustness of the model against adversarial examples. In
this paper, we investigate the saliency trained model's vulnerability to
adversarial examples methods. Models are trained using an online
saliency-guided training method and evaluated against popular algorithms of
adversarial examples. We quantify the robustness and conclude that despite the
well-explained visualizations in the model's output, the salient models suffer
from the lower performance against adversarial examples attacks.


http://arxiv.org/abs/2306.16415
On Practical Aspects of Aggregation Defenses against Data Poisoning Attacks. (50%)
Wenxiao Wang; Soheil Feizi
  The increasing access to data poses both opportunities and risks in deep
learning, as one can manipulate the behaviors of deep learning models with
malicious training samples. Such attacks are known as data poisoning. Recent
advances in defense strategies against data poisoning have highlighted the
effectiveness of aggregation schemes in achieving state-of-the-art results in
certified poisoning robustness. However, the practical implications of these
approaches remain unclear. Here we focus on Deep Partition Aggregation, a
representative aggregation defense, and assess its practical aspects, including
efficiency, performance, and robustness. For evaluations, we use ImageNet
resized to a resolution of 64 by 64 to enable evaluations at a larger scale
than previous ones. Firstly, we demonstrate a simple yet practical approach to
scaling base models, which improves the efficiency of training and inference
for aggregation defenses. Secondly, we provide empirical evidence supporting
the data-to-complexity ratio, i.e. the ratio between the data set size and
sample complexity, as a practical estimation of the maximum number of base
models that can be deployed while preserving accuracy. Last but not least, we
point out how aggregation defenses boost poisoning robustness empirically
through the poisoning overfitting phenomenon, which is the key underlying
mechanism for the empirical poisoning robustness of aggregations. Overall, our
findings provide valuable insights for practical implementations of aggregation
defenses to mitigate the threat of data poisoning.


http://arxiv.org/abs/2306.17194
On the Exploitability of Instruction Tuning. (13%)
Manli Shu; Jiongxiao Wang; Chen Zhu; Jonas Geiping; Chaowei Xiao; Tom Goldstein
  Instruction tuning is an effective technique to align large language models
(LLMs) with human intents. In this work, we investigate how an adversary can
exploit instruction tuning by injecting specific instruction-following examples
into the training data that intentionally changes the model's behavior. For
example, an adversary can achieve content injection by injecting training
examples that mention target content and eliciting such behavior from
downstream models. To achieve this goal, we propose \textit{AutoPoison}, an
automated data poisoning pipeline. It naturally and coherently incorporates
versatile attack goals into poisoned data with the help of an oracle LLM. We
showcase two example attacks: content injection and over-refusal attacks, each
aiming to induce a specific exploitable behavior. We quantify and benchmark the
strength and the stealthiness of our data poisoning scheme. Our results show
that AutoPoison allows an adversary to change a model's behavior by poisoning
only a small fraction of data while maintaining a high level of stealthiness in
the poisoned examples. We hope our work sheds light on how data quality affects
the behavior of instruction-tuned models and raises awareness of the importance
of data quality for responsible deployments of LLMs. Code is available at
\url{https://github.com/azshue/AutoPoison}.


http://arxiv.org/abs/2306.15451
Advancing Adversarial Training by Injecting Booster Signal. (98%)
Hong Joo Lee; Youngjoon Yu; Yong Man Ro
  Recent works have demonstrated that deep neural networks (DNNs) are highly
vulnerable to adversarial attacks. To defend against adversarial attacks, many
defense strategies have been proposed, among which adversarial training has
been demonstrated to be the most effective strategy. However, it has been known
that adversarial training sometimes hurts natural accuracy. Then, many works
focus on optimizing model parameters to handle the problem. Different from the
previous approaches, in this paper, we propose a new approach to improve the
adversarial robustness by using an external signal rather than model
parameters. In the proposed method, a well-optimized universal external signal
called a booster signal is injected into the outside of the image which does
not overlap with the original content. Then, it boosts both adversarial
robustness and natural accuracy. The booster signal is optimized in parallel to
model parameters step by step collaboratively. Experimental results show that
the booster signal can improve both the natural and robust accuracies over the
recent state-of-the-art adversarial training methods. Also, optimizing the
booster signal is general and flexible enough to be adopted on any existing
adversarial training methods.


http://arxiv.org/abs/2306.15755
IMPOSITION: Implicit Backdoor Attack through Scenario Injection. (96%)
Mozhgan Pourkeshavarz; Mohammad Sabokrou; Amir Rasouli
  This paper presents a novel backdoor attack called IMPlicit BackdOor Attack
through Scenario InjecTION (IMPOSITION) that does not require direct poisoning
of the training data. Instead, the attack leverages a realistic scenario from
the training data as a trigger to manipulate the model's output during
inference. This type of attack is particularly dangerous as it is stealthy and
difficult to detect. The paper focuses on the application of this attack in the
context of Autonomous Driving (AD) systems, specifically targeting the
trajectory prediction module. To implement the attack, we design a trigger
mechanism that mimics a set of cloned behaviors in the driving scene, resulting
in a scenario that triggers the attack. The experimental results demonstrate
that IMPOSITION is effective in attacking trajectory prediction models while
maintaining high performance in untargeted scenarios. Our proposed method
highlights the growing importance of research on the trustworthiness of Deep
Neural Network (DNN) models, particularly in safety-critical applications.
Backdoor attacks pose a significant threat to the safety and reliability of DNN
models, and this paper presents a new perspective on backdooring DNNs. The
proposed IMPOSITION paradigm and the demonstration of its severity in the
context of AD systems are significant contributions of this paper. We highlight
the impact of the proposed attacks via empirical studies showing how IMPOSITION
can easily compromise the safety of AD systems.


http://arxiv.org/abs/2306.15427
Adversarial Training for Graph Neural Networks: Pitfalls, Solutions, and New Directions. (92%)
Lukas Gosch; Simon Geisler; Daniel Sturm; Bertrand Charpentier; Daniel Zügner; Stephan Günnemann
  Despite its success in the image domain, adversarial training did not (yet)
stand out as an effective defense for Graph Neural Networks (GNNs) against
graph structure perturbations. In the pursuit of fixing adversarial training
(1) we show and overcome fundamental theoretical as well as practical
limitations of the adopted graph learning setting in prior work; (2) we reveal
that more flexible GNNs based on learnable graph diffusion are able to adjust
to adversarial perturbations, while the learned message passing scheme is
naturally interpretable; (3) we introduce the first attack for structure
perturbations that, while targeting multiple nodes at once, is capable of
handling global (graph-level) as well as local (node-level) constraints.
Including these contributions, we demonstrate that adversarial training is a
state-of-the-art defense against adversarial structure perturbations.


http://arxiv.org/abs/2306.15457
Robust Proxy: Improving Adversarial Robustness by Robust Proxy Learning. (89%)
Hong Joo Lee; Yong Man Ro
  Recently, it has been widely known that deep neural networks are highly
vulnerable and easily broken by adversarial attacks. To mitigate the
adversarial vulnerability, many defense algorithms have been proposed.
Recently, to improve adversarial robustness, many works try to enhance feature
representation by imposing more direct supervision on the discriminative
feature. However, existing approaches lack an understanding of learning
adversarially robust feature representation. In this paper, we propose a novel
training framework called Robust Proxy Learning. In the proposed method, the
model explicitly learns robust feature representations with robust proxies. To
this end, firstly, we demonstrate that we can generate class-representative
robust features by adding class-wise robust perturbations. Then, we use the
class representative features as robust proxies. With the class-wise robust
features, the model explicitly learns adversarially robust features through the
proposed robust proxy learning framework. Through extensive experiments, we
verify that we can manually generate robust features, and our proposed learning
framework could increase the robustness of the DNNs.


http://arxiv.org/abs/2306.15363
Your Attack Is Too DUMB: Formalizing Attacker Scenarios for Adversarial Transferability. (87%)
Marco Alecci; Mauro Conti; Francesco Marchiori; Luca Martinelli; Luca Pajola
  Evasion attacks are a threat to machine learning models, where adversaries
attempt to affect classifiers by injecting malicious samples. An alarming
side-effect of evasion attacks is their ability to transfer among different
models: this property is called transferability. Therefore, an attacker can
produce adversarial samples on a custom model (surrogate) to conduct the attack
on a victim's organization later. Although literature widely discusses how
adversaries can transfer their attacks, their experimental settings are limited
and far from reality. For instance, many experiments consider both attacker and
defender sharing the same dataset, balance level (i.e., how the ground truth is
distributed), and model architecture.
  In this work, we propose the DUMB attacker model. This framework allows
analyzing if evasion attacks fail to transfer when the training conditions of
surrogate and victim models differ. DUMB considers the following conditions:
Dataset soUrces, Model architecture, and the Balance of the ground truth. We
then propose a novel testbed to evaluate many state-of-the-art evasion attacks
with DUMB; the testbed consists of three computer vision tasks with two
distinct datasets each, four types of balance levels, and three model
architectures. Our analysis, which generated 13K tests over 14 distinct
attacks, led to numerous novel findings in the scope of transferable attacks
with surrogate models. In particular, mismatches between attackers and victims
in terms of dataset source, balance levels, and model architecture lead to
non-negligible loss of attack performance.


http://arxiv.org/abs/2306.15221
[Re] Double Sampling Randomized Smoothing. (69%)
Aryan Gupta; Sarthak Gupta; Abhay Kumar; Harsh Dugar
  This paper is a contribution to the reproducibility challenge in the field of
machine learning, specifically addressing the issue of certifying the
robustness of neural networks (NNs) against adversarial perturbations. The
proposed Double Sampling Randomized Smoothing (DSRS) framework overcomes the
limitations of existing methods by using an additional smoothing distribution
to improve the robustness certification. The paper provides a clear
manifestation of DSRS for a generalized family of Gaussian smoothing and a
computationally efficient method for implementation. The experiments on MNIST
and CIFAR-10 demonstrate the effectiveness of DSRS, consistently certifying
larger robust radii compared to other methods. Also various ablations studies
are conducted to further analyze the hyperparameters and effect of adversarial
training methods on the certified radius by the proposed framework.


http://arxiv.org/abs/2306.15482
Cooperation or Competition: Avoiding Player Domination for Multi-Target Robustness via Adaptive Budgets. (68%)
Yimu Wang; Dinghuai Zhang; Yihan Wu; Heng Huang; Hongyang Zhang
  Despite incredible advances, deep learning has been shown to be susceptible
to adversarial attacks. Numerous approaches have been proposed to train robust
networks both empirically and certifiably. However, most of them defend against
only a single type of attack, while recent work takes steps forward in
defending against multiple attacks. In this paper, to understand multi-target
robustness, we view this problem as a bargaining game in which different
players (adversaries) negotiate to reach an agreement on a joint direction of
parameter updating. We identify a phenomenon named player domination in the
bargaining game, namely that the existing max-based approaches, such as MAX and
MSD, do not converge. Based on our theoretical analysis, we design a novel
framework that adjusts the budgets of different adversaries to avoid any player
dominance. Experiments on standard benchmarks show that employing the proposed
framework to the existing approaches significantly advances multi-target
robustness.


http://arxiv.org/abs/2306.15248
Catch Me If You Can: A New Low-Rate DDoS Attack Strategy Disguised by Feint. (26%)
Tianyang Cai; Yuqi Li; Tao Jia; Leo Yu Zhang; Zheng Yang
  While collaborative systems provide convenience to our lives, they also face
many security threats. One of them is the Low-rate Distributed
Denial-of-Service (LDDoS) attack, which is a worthy concern. Unlike volumetric
DDoS attacks that continuously send large volumes of traffic, LDDoS attacks are
more stealthy and difficult to be detected owing to their low-volume feature.
Due to its stealthiness and harmfulness, LDDoS has become one of the most
destructive attacks in cloud computing. Although a few LDDoS attack detection
and defense methods have been proposed, we observe that sophisticated LDDoS
attacks (being more stealthy) can bypass some of the existing LDDoS defense
methods. To verify our security observation, we proposed a new Feint-based
LDDoS (F-LDDoS) attack strategy. In this strategy, we divide a Pulse Interval
into a Feinting Interval and an Attack Interval. Unlike the previous LDDoS
attacks, the bots also send traffic randomly in the Feinting Interval, thus
disguise themselves as benign users during the F-LDDoS attack. In this way,
although the victim detects that it is under an LDDoS attack, it is difficult
to locate the attack sources and apply mitigation solutions. Experimental
results show that F-LDDoS attack can degrade TCP bandwidth 6.7%-14% more than
the baseline LDDoS attack. Besides, F-LDDoS also reduces the similarities
between bot traffic and aggregated attack traffic, and increases the
uncertainty of packet arrival. These results mean that the proposed F-LDDoS is
more effective and more stealthy than normal LDDoS attacks. Finally, we discuss
the countermeasures of F-LDDoS to draw the attention of defenders and improve
the defense methods.


http://arxiv.org/abs/2306.16526
Shilling Black-box Review-based Recommender Systems through Fake Review Generation. (1%)
Hung-Yun Chiang; Yi-Syuan Chen; Yun-Zhu Song; Hong-Han Shuai; Jason S. Chang
  Review-Based Recommender Systems (RBRS) have attracted increasing research
interest due to their ability to alleviate well-known cold-start problems. RBRS
utilizes reviews to construct the user and items representations. However, in
this paper, we argue that such a reliance on reviews may instead expose systems
to the risk of being shilled. To explore this possibility, in this paper, we
propose the first generation-based model for shilling attacks against RBRSs.
Specifically, we learn a fake review generator through reinforcement learning,
which maliciously promotes items by forcing prediction shifts after adding
generated reviews to the system. By introducing the auxiliary rewards to
increase text fluency and diversity with the aid of pre-trained language models
and aspect predictors, the generated reviews can be effective for shilling with
high fidelity. Experimental results demonstrate that the proposed framework can
successfully attack three different kinds of RBRSs on the Amazon corpus with
three domains and Yelp corpus. Furthermore, human studies also show that the
generated reviews are fluent and informative. Finally, equipped with Attack
Review Generators (ARGs), RBRSs with adversarial training are much more robust
to malicious reviews.


http://arxiv.org/abs/2306.15705
On the Universal Adversarial Perturbations for Efficient Data-free Adversarial Detection. (99%)
Songyang Gao; Shihan Dou; Qi Zhang; Xuanjing Huang; Jin Ma; Ying Shan
  Detecting adversarial samples that are carefully crafted to fool the model is
a critical step to socially-secure applications. However, existing adversarial
detection methods require access to sufficient training data, which brings
noteworthy concerns regarding privacy leakage and generalizability. In this
work, we validate that the adversarial sample generated by attack algorithms is
strongly related to a specific vector in the high-dimensional inputs. Such
vectors, namely UAPs (Universal Adversarial Perturbations), can be calculated
without original training data. Based on this discovery, we propose a
data-agnostic adversarial detection framework, which induces different
responses between normal and adversarial samples to UAPs. Experimental results
show that our method achieves competitive detection performance on various text
classification tasks, and maintains an equivalent time consumption to normal
inference.


http://arxiv.org/abs/2306.15447
Are aligned neural networks adversarially aligned? (99%)
Nicholas Carlini; Milad Nasr; Christopher A. Choquette-Choo; Matthew Jagielski; Irena Gao; Anas Awadalla; Pang Wei Koh; Daphne Ippolito; Katherine Lee; Florian Tramer; Ludwig Schmidt
  Large language models are now tuned to align with the goals of their
creators, namely to be "helpful and harmless." These models should respond
helpfully to user questions, but refuse to answer requests that could cause
harm. However, adversarial users can construct inputs which circumvent attempts
at alignment. In this work, we study to what extent these models remain
aligned, even when interacting with an adversarial user who constructs
worst-case inputs (adversarial examples). These inputs are designed to cause
the model to emit harmful content that would otherwise be prohibited. We show
that existing NLP-based optimization attacks are insufficiently powerful to
reliably attack aligned text models: even when current NLP-based attacks fail,
we can find adversarial inputs with brute force. As a result, the failure of
current attacks should not be seen as proof that aligned text models remain
aligned under adversarial inputs.
  However the recent trend in large-scale ML models is multimodal models that
allow users to provide images that influence the text that is generated. We
show these models can be easily attacked, i.e., induced to perform arbitrary
un-aligned behavior through adversarial perturbation of the input image. We
conjecture that improved NLP attacks may demonstrate this same level of
adversarial control over text-only models.


http://arxiv.org/abs/2306.14609
The race to robustness: exploiting fragile models for urban camouflage and the imperative for machine learning security. (92%)
Harriet Farlow; Matthew Garratt; Gavin Mount; Tim Lynar
  Adversarial Machine Learning (AML) represents the ability to disrupt Machine
Learning (ML) algorithms through a range of methods that broadly exploit the
architecture of deep learning optimisation. This paper presents Distributed
Adversarial Regions (DAR), a novel method that implements distributed
instantiations of computer vision-based AML attack methods that may be used to
disguise objects from image recognition in both white and black box settings.
We consider the context of object detection models used in urban environments,
and benchmark the MobileNetV2, NasNetMobile and DenseNet169 models against a
subset of relevant images from the ImageNet dataset. We evaluate optimal
parameters (size, number and perturbation method), and compare to
state-of-the-art AML techniques that perturb the entire image. We find that
DARs can cause a reduction in confidence of 40.4% on average, but with the
benefit of not requiring the entire image, or the focal object, to be
perturbed. The DAR method is a deliberately simple approach where the intention
is to highlight how an adversary with very little skill could attack models
that may already be productionised, and to emphasise the fragility of
foundational object detection models. We present this as a contribution to the
field of ML security as well as AML. This paper contributes a novel adversarial
method, an original comparison between DARs and other AML methods, and frames
it in a new context - that of urban camouflage and the necessity for ML
security and model robustness.


http://arxiv.org/abs/2306.14640
3D-Aware Adversarial Makeup Generation for Facial Privacy Protection. (92%)
Yueming Lyu; Yue Jiang; Ziwen He; Bo Peng; Yunfan Liu; Jing Dong
  The privacy and security of face data on social media are facing
unprecedented challenges as it is vulnerable to unauthorized access and
identification. A common practice for solving this problem is to modify the
original data so that it could be protected from being recognized by malicious
face recognition (FR) systems. However, such ``adversarial examples'' obtained
by existing methods usually suffer from low transferability and poor image
quality, which severely limits the application of these methods in real-world
scenarios. In this paper, we propose a 3D-Aware Adversarial Makeup Generation
GAN (3DAM-GAN). which aims to improve the quality and transferability of
synthetic makeup for identity information concealing. Specifically, a UV-based
generator consisting of a novel Makeup Adjustment Module (MAM) and Makeup
Transfer Module (MTM) is designed to render realistic and robust makeup with
the aid of symmetric characteristics of human faces. Moreover, a makeup attack
mechanism with an ensemble training strategy is proposed to boost the
transferability of black-box models. Extensive experiment results on several
benchmark datasets demonstrate that 3DAM-GAN could effectively protect faces
against various FR models, including both publicly available state-of-the-art
models and commercial face verification APIs, such as Face++, Baidu and Aliyun.


http://arxiv.org/abs/2306.15044
Towards Sybil Resilience in Decentralized Learning. (80%)
Thomas Werthenbach; Johan Pouwelse
  Federated learning is a privacy-enforcing machine learning technology but
suffers from limited scalability. This limitation mostly originates from the
internet connection and memory capacity of the central parameter server, and
the complexity of the model aggregation function. Decentralized learning has
recently been emerging as a promising alternative to federated learning. This
novel technology eliminates the need for a central parameter server by
decentralizing the model aggregation across all participating nodes. Numerous
studies have been conducted on improving the resilience of federated learning
against poisoning and Sybil attacks, whereas the resilience of decentralized
learning remains largely unstudied. This research gap serves as the main
motivator for this study, in which our objective is to improve the Sybil
poisoning resilience of decentralized learning.
  We present SybilWall, an innovative algorithm focused on increasing the
resilience of decentralized learning against targeted Sybil poisoning attacks.
By combining a Sybil-resistant aggregation function based on similarity between
Sybils with a novel probabilistic gossiping mechanism, we establish a new
benchmark for scalable, Sybil-resilient decentralized learning.
  A comprehensive empirical evaluation demonstrated that SybilWall outperforms
existing state-of-the-art solutions designed for federated learning scenarios
and is the only algorithm to obtain consistent accuracy over a range of
adversarial attack scenarios. We also found SybilWall to diminish the utility
of creating many Sybils, as our evaluations demonstrate a higher success rate
among adversaries employing fewer Sybils. Finally, we suggest a number of
possible improvements to SybilWall and highlight promising future research
directions.


http://arxiv.org/abs/2306.14782
On the Resilience of Machine Learning-Based IDS for Automotive Networks. (78%)
Ivo Zenden; Han Wang; Alfonso Iacovazzi; Arash Vahidi; Rolf Blom; Shahid Raza
  Modern automotive functions are controlled by a large number of small
computers called electronic control units (ECUs). These functions span from
safety-critical autonomous driving to comfort and infotainment. ECUs
communicate with one another over multiple internal networks using different
technologies. Some, such as Controller Area Network (CAN), are very simple and
provide minimal or no security services. Machine learning techniques can be
used to detect anomalous activities in such networks. However, it is necessary
that these machine learning techniques are not prone to adversarial attacks. In
this paper, we investigate adversarial sample vulnerabilities in four different
machine learning-based intrusion detection systems for automotive networks. We
show that adversarial samples negatively impact three of the four studied
solutions. Furthermore, we analyze transferability of adversarial samples
between different systems. We also investigate detection performance and the
attack success rate after using adversarial samples in the training. After
analyzing these results, we discuss whether current solutions are mature enough
for a use in modern vehicles.


http://arxiv.org/abs/2306.15164
DSRM: Boost Textual Adversarial Training with Distribution Shift Risk Minimization. (75%)
Songyang Gao; Shihan Dou; Yan Liu; Xiao Wang; Qi Zhang; Zhongyu Wei; Jin Ma; Ying Shan
  Adversarial training is one of the best-performing methods in improving the
robustness of deep language models. However, robust models come at the cost of
high time consumption, as they require multi-step gradient ascents or word
substitutions to obtain adversarial samples. In addition, these generated
samples are deficient in grammatical quality and semantic consistency, which
impairs the effectiveness of adversarial training. To address these problems,
we introduce a novel, effective procedure for instead adversarial training with
only clean data. Our procedure, distribution shift risk minimization (DSRM),
estimates the adversarial loss by perturbing the input data's probability
distribution rather than their embeddings. This formulation results in a robust
model that minimizes the expected global loss under adversarial attacks. Our
approach requires zero adversarial samples for training and reduces time
consumption by up to 70\% compared to current best-performing adversarial
training methods. Experiments demonstrate that DSRM considerably improves
BERT's resistance to textual adversarial attacks and achieves state-of-the-art
robust accuracy on various benchmarks.


http://arxiv.org/abs/2306.14672
PWSHAP: A Path-Wise Explanation Model for Targeted Variables. (8%)
Lucile Ter-Minassian; Oscar Clivio; Karla Diaz-Ordaz; Robin J. Evans; Chris Holmes
  Predictive black-box models can exhibit high accuracy but their opaque nature
hinders their uptake in safety-critical deployment environments. Explanation
methods (XAI) can provide confidence for decision-making through increased
transparency. However, existing XAI methods are not tailored towards models in
sensitive domains where one predictor is of special interest, such as a
treatment effect in a clinical model, or ethnicity in policy models. We
introduce Path-Wise Shapley effects (PWSHAP), a framework for assessing the
targeted effect of a binary (e.g.~treatment) variable from a complex outcome
model. Our approach augments the predictive model with a user-defined directed
acyclic graph (DAG). The method then uses the graph alongside on-manifold
Shapley values to identify effects along causal pathways whilst maintaining
robustness to adversarial attacks. We establish error bounds for the identified
path-wise Shapley effects and for Shapley values. We show PWSHAP can perform
local bias and mediation analyses with faithfulness to the model. Further, if
the targeted variable is randomised we can quantify local effect modification.
We demonstrate the resolution, interpretability, and true locality of our
approach on examples and a real-world experiment.


http://arxiv.org/abs/2306.14262
A Spectral Perspective towards Understanding and Improving Adversarial Robustness. (99%)
Binxiao Huang; Rui Lin; Chaofan Tao; Ngai Wong
  Deep neural networks (DNNs) are incredibly vulnerable to crafted,
imperceptible adversarial perturbations. While adversarial training (AT) has
proven to be an effective defense approach, the AT mechanism for robustness
improvement is not fully understood. This work investigates AT from a spectral
perspective, adding new insights to the design of effective defenses. In
particular, we show that AT induces the deep model to focus more on the
low-frequency region, which retains the shape-biased representations, to gain
robustness. Further, we find that the spectrum of a white-box attack is
primarily distributed in regions the model focuses on, and the perturbation
attacks the spectral bands where the model is vulnerable. Based on this
observation, to train a model tolerant to frequency-varying perturbation, we
propose a spectral alignment regularization (SAR) such that the spectral output
inferred by an attacked adversarial input stays as close as possible to its
natural input counterpart. Experiments demonstrate that SAR and its weight
averaging (WA) extension could significantly improve the robust accuracy by
1.14% ~ 3.87% relative to the standard AT, across multiple datasets (CIFAR-10,
CIFAR-100 and Tiny ImageNet), and various attacks (PGD, C&W and Autoattack),
without any extra data.


http://arxiv.org/abs/2306.14217
On Evaluating the Adversarial Robustness of Semantic Segmentation Models. (99%)
Levente Halmosi; Mark Jelasity
  Achieving robustness against adversarial input perturbation is an important
and intriguing problem in machine learning. In the area of semantic image
segmentation, a number of adversarial training approaches have been proposed as
a defense against adversarial perturbation, but the methodology of evaluating
the robustness of the models is still lacking, compared to image
classification. Here, we demonstrate that, just like in image classification,
it is important to evaluate the models over several different and hard attacks.
We propose a set of gradient based iterative attacks and show that it is
essential to perform a large number of iterations. We include attacks against
the internal representations of the models as well. We apply two types of
attacks: maximizing the error with a bounded perturbation, and minimizing the
perturbation for a given level of error. Using this set of attacks, we show for
the first time that a number of models in previous work that are claimed to be
robust are in fact not robust at all. We then evaluate simple adversarial
training algorithms that produce reasonably robust models even under our set of
strong attacks. Our results indicate that a key design decision to achieve any
robustness is to use only adversarial examples during training. However, this
introduces a trade-off between robustness and accuracy.


http://arxiv.org/abs/2306.14126
Robust Spatiotemporal Traffic Forecasting with Reinforced Dynamic Adversarial Training. (98%)
Fan Liu; Weijia Zhang; Hao Liu
  Machine learning-based forecasting models are commonly used in Intelligent
Transportation Systems (ITS) to predict traffic patterns and provide city-wide
services. However, most of the existing models are susceptible to adversarial
attacks, which can lead to inaccurate predictions and negative consequences
such as congestion and delays. Therefore, improving the adversarial robustness
of these models is crucial for ITS. In this paper, we propose a novel framework
for incorporating adversarial training into spatiotemporal traffic forecasting
tasks. We demonstrate that traditional adversarial training methods designated
for static domains cannot be directly applied to traffic forecasting tasks, as
they fail to effectively defend against dynamic adversarial attacks. Then, we
propose a reinforcement learning-based method to learn the optimal node
selection strategy for adversarial examples, which simultaneously strengthens
the dynamic attack defense capability and reduces the model overfitting.
Additionally, we introduce a self-knowledge distillation regularization module
to overcome the "forgetting issue" caused by continuously changing adversarial
nodes during training. We evaluate our approach on two real-world traffic
datasets and demonstrate its superiority over other baselines. Our method
effectively enhances the adversarial robustness of spatiotemporal traffic
forecasting models. The source code for our framework is available at
https://github.com/usail-hkust/RDAT.


http://arxiv.org/abs/2306.14275
Enhancing Adversarial Training via Reweighting Optimization Trajectory. (97%)
Tianjin Huang; Shiwei Liu; Tianlong Chen; Meng Fang; Li Shen; Vlaod Menkovski; Lu Yin; Yulong Pei; Mykola Pechenizkiy
  Despite the fact that adversarial training has become the de facto method for
improving the robustness of deep neural networks, it is well-known that vanilla
adversarial training suffers from daunting robust overfitting, resulting in
unsatisfactory robust generalization. A number of approaches have been proposed
to address these drawbacks such as extra regularization, adversarial weights
perturbation, and training with more data over the last few years. However, the
robust generalization improvement is yet far from satisfactory. In this paper,
we approach this challenge with a brand new perspective -- refining historical
optimization trajectories. We propose a new method named \textbf{Weighted
Optimization Trajectories (WOT)} that leverages the optimization trajectories
of adversarial training in time. We have conducted extensive experiments to
demonstrate the effectiveness of WOT under various state-of-the-art adversarial
attacks. Our results show that WOT integrates seamlessly with the existing
adversarial training methods and consistently overcomes the robust overfitting
issue, resulting in better adversarial robustness. For example, WOT boosts the
robust accuracy of AT-PGD under AA-$L_{\infty}$ attack by 1.53\% $\sim$ 6.11\%
and meanwhile increases the clean accuracy by 0.55\%$\sim$5.47\% across SVHN,
CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets.


http://arxiv.org/abs/2306.14321
RobuT: A Systematic Study of Table QA Robustness Against Human-Annotated Adversarial Perturbations. (87%)
Yilun Zhao; Chen Zhao; Linyong Nan; Zhenting Qi; Wenlin Zhang; Xiangru Tang; Boyu Mi; Dragomir Radev
  Despite significant progress having been made in question answering on
tabular data (Table QA), it's unclear whether, and to what extent existing
Table QA models are robust to task-specific perturbations, e.g., replacing key
question entities or shuffling table columns. To systematically study the
robustness of Table QA models, we propose a benchmark called RobuT, which
builds upon existing Table QA datasets (WTQ, WikiSQL-Weak, and SQA) and
includes human-annotated adversarial perturbations in terms of table header,
table content, and question. Our results indicate that both state-of-the-art
Table QA models and large language models (e.g., GPT-3) with few-shot learning
falter in these adversarial sets. We propose to address this problem by using
large language models to generate adversarial examples to enhance training,
which significantly improves the robustness of Table QA models. Our data and
code is publicly available at https://github.com/yilunzhao/RobuT.


http://arxiv.org/abs/2306.14326
Computational Asymmetries in Robust Classification. (80%)
Samuele Marro; Michele Lombardi
  In the context of adversarial robustness, we make three strongly related
contributions. First, we prove that while attacking ReLU classifiers is
$\mathit{NP}$-hard, ensuring their robustness at training time is
$\Sigma^2_P$-hard (even on a single example). This asymmetry provides a
rationale for the fact that robust classifications approaches are frequently
fooled in the literature. Second, we show that inference-time robustness
certificates are not affected by this asymmetry, by introducing a
proof-of-concept approach named Counter-Attack (CA). Indeed, CA displays a
reversed asymmetry: running the defense is $\mathit{NP}$-hard, while attacking
it is $\Sigma_2^P$-hard. Finally, motivated by our previous result, we argue
that adversarial attacks can be used in the context of robustness
certification, and provide an empirical evaluation of their effectiveness. As a
byproduct of this process, we also release UG100, a benchmark dataset for
adversarial attacks.


http://arxiv.org/abs/2306.13965
Boosting Model Inversion Attacks with Adversarial Examples. (98%)
Shuai Zhou; Tianqing Zhu; Dayong Ye; Xin Yu; Wanlei Zhou
  Model inversion attacks involve reconstructing the training data of a target
model, which raises serious privacy concerns for machine learning models.
However, these attacks, especially learning-based methods, are likely to suffer
from low attack accuracy, i.e., low classification accuracy of these
reconstructed data by machine learning classifiers. Recent studies showed an
alternative strategy of model inversion attacks, GAN-based optimization, can
improve the attack accuracy effectively. However, these series of GAN-based
attacks reconstruct only class-representative training data for a class,
whereas learning-based attacks can reconstruct diverse data for different
training data in each class. Hence, in this paper, we propose a new training
paradigm for a learning-based model inversion attack that can achieve higher
attack accuracy in a black-box setting. First, we regularize the training
process of the attack model with an added semantic loss function and, second,
we inject adversarial examples into the training data to increase the diversity
of the class-related parts (i.e., the essential features for classification
tasks) in training data. This scheme guides the attack model to pay more
attention to the class-related parts of the original data during the data
reconstruction process. The experimental results show that our method greatly
boosts the performance of existing learning-based model inversion attacks. Even
when no extra queries to the target model are allowed, the approach can still
improve the attack accuracy of reconstructed data. This new attack shows that
the severity of the threat from learning-based model inversion adversaries is
underestimated and more robust defenses are required.


http://arxiv.org/abs/2306.14043
Machine Learning needs Better Randomness Standards: Randomised Smoothing and PRNG-based attacks. (98%)
Pranav Dahiya; Ilia Shumailov; Ross Anderson
  Randomness supports many critical functions in the field of machine learning
(ML) including optimisation, data selection, privacy, and security. ML systems
outsource the task of generating or harvesting randomness to the compiler, the
cloud service provider or elsewhere in the toolchain. Yet there is a long
history of attackers exploiting poor randomness, or even creating it -- as when
the NSA put backdoors in random number generators to break cryptography. In
this paper we consider whether attackers can compromise an ML system using only
the randomness on which they commonly rely. We focus our effort on Randomised
Smoothing, a popular approach to train certifiably robust models, and to
certify specific input datapoints of an arbitrary model. We choose Randomised
Smoothing since it is used for both security and safety -- to counteract
adversarial examples and quantify uncertainty respectively. Under the hood, it
relies on sampling Gaussian noise to explore the volume around a data point to
certify that a model is not vulnerable to adversarial examples. We demonstrate
an entirely novel attack, where an attacker backdoors the supplied randomness
to falsely certify either an overestimate or an underestimate of robustness for
up to 81 times. We demonstrate that such attacks are possible, that they
require very small changes to randomness to succeed, and that they are hard to
detect. As an example, we hide an attack in the random number generator and
show that the randomness tests suggested by NIST fail to detect it. We advocate
updating the NIST guidelines on random number testing to make them more
appropriate for safety-critical and security-critical machine-learning
applications.


http://arxiv.org/abs/2306.13854
Similarity Preserving Adversarial Graph Contrastive Learning. (96%)
Yeonjun In; Kanghoon Yoon; Chanyoung Park
  Recent works demonstrate that GNN models are vulnerable to adversarial
attacks, which refer to imperceptible perturbation on the graph structure and
node features. Among various GNN models, graph contrastive learning (GCL) based
methods specifically suffer from adversarial attacks due to their inherent
design that highly depends on the self-supervision signals derived from the
original graph, which however already contains noise when the graph is
attacked. To achieve adversarial robustness against such attacks, existing
methods adopt adversarial training (AT) to the GCL framework, which considers
the attacked graph as an augmentation under the GCL framework. However, we find
that existing adversarially trained GCL methods achieve robustness at the
expense of not being able to preserve the node feature similarity. In this
paper, we propose a similarity-preserving adversarial graph contrastive
learning (SP-AGCL) framework that contrasts the clean graph with two auxiliary
views of different properties (i.e., the node similarity-preserving view and
the adversarial view). Extensive experiments demonstrate that SP-AGCL achieves
a competitive performance on several downstream tasks, and shows its
effectiveness in various scenarios, e.g., a network with adversarial attacks,
noisy labels, and heterophilous neighbors. Our code is available at
https://github.com/yeonjun-in/torch-SP-AGCL.


http://arxiv.org/abs/2306.14040
Weighted Automata Extraction and Explanation of Recurrent Neural Networks for Natural Language Tasks. (70%)
Zeming Wei; Xiyue Zhang; Yihao Zhang; Meng Sun
  Recurrent Neural Networks (RNNs) have achieved tremendous success in
processing sequential data, yet understanding and analyzing their behaviours
remains a significant challenge. To this end, many efforts have been made to
extract finite automata from RNNs, which are more amenable for analysis and
explanation. However, existing approaches like exact learning and compositional
approaches for model extraction have limitations in either scalability or
precision. In this paper, we propose a novel framework of Weighted Finite
Automata (WFA) extraction and explanation to tackle the limitations for natural
language tasks. First, to address the transition sparsity and context loss
problems we identified in WFA extraction for natural language tasks, we propose
an empirical method to complement missing rules in the transition diagram, and
adjust transition matrices to enhance the context-awareness of the WFA. We also
propose two data augmentation tactics to track more dynamic behaviours of RNN,
which further allows us to improve the extraction precision. Based on the
extracted model, we propose an explanation method for RNNs including a word
embedding method -- Transition Matrix Embeddings (TME) and TME-based task
oriented explanation for the target RNN. Our evaluation demonstrates the
advantage of our method in extraction precision than existing approaches, and
the effectiveness of TME-based explanation method in applications to
pretraining and adversarial example generation.


http://arxiv.org/abs/2306.13587
Creating Valid Adversarial Examples of Malware. (99%)
Matouš Kozák; Martin Jureček; Mark Stamp; Troia Fabio Di
  Machine learning is becoming increasingly popular as a go-to approach for
many tasks due to its world-class results. As a result, antivirus developers
are incorporating machine learning models into their products. While these
models improve malware detection capabilities, they also carry the disadvantage
of being susceptible to adversarial attacks. Although this vulnerability has
been demonstrated for many models in white-box settings, a black-box attack is
more applicable in practice for the domain of malware detection. We present a
generator of adversarial malware examples using reinforcement learning
algorithms. The reinforcement learning agents utilize a set of
functionality-preserving modifications, thus creating valid adversarial
examples. Using the proximal policy optimization (PPO) algorithm, we achieved
an evasion rate of 53.84% against the gradient-boosted decision tree (GBDT)
model. The PPO agent previously trained against the GBDT classifier scored an
evasion rate of 11.41% against the neural network-based classifier MalConv and
an average evasion rate of 2.31% against top antivirus programs. Furthermore,
we discovered that random application of our functionality-preserving portable
executable modifications successfully evades leading antivirus engines, with an
average evasion rate of 11.65%. These findings indicate that machine
learning-based models used in malware detection systems are vulnerable to
adversarial attacks and that better safeguards need to be taken to protect
these systems.


http://arxiv.org/abs/2306.13614
Adversarial Robustness Certification for Bayesian Neural Networks. (92%)
Matthew Wicker; Andrea Patane; Luca Laurenti; Marta Kwiatkowska
  We study the problem of certifying the robustness of Bayesian neural networks
(BNNs) to adversarial input perturbations. Given a compact set of input points
$T \subseteq \mathbb{R}^m$ and a set of output points $S \subseteq
\mathbb{R}^n$, we define two notions of robustness for BNNs in an adversarial
setting: probabilistic robustness and decision robustness. Probabilistic
robustness is the probability that for all points in $T$ the output of a BNN
sampled from the posterior is in $S$. On the other hand, decision robustness
considers the optimal decision of a BNN and checks if for all points in $T$ the
optimal decision of the BNN for a given loss function lies within the output
set $S$. Although exact computation of these robustness properties is
challenging due to the probabilistic and non-convex nature of BNNs, we present
a unified computational framework for efficiently and formally bounding them.
Our approach is based on weight interval sampling, integration, and bound
propagation techniques, and can be applied to BNNs with a large number of
parameters, and independently of the (approximate) inference method employed to
train the BNN. We evaluate the effectiveness of our methods on various
regression and classification tasks, including an industrial regression
benchmark, MNIST, traffic sign recognition, and airborne collision avoidance,
and demonstrate that our approach enables certification of robustness and
uncertainty of BNN predictions.


http://arxiv.org/abs/2306.13800
A First Order Meta Stackelberg Method for Robust Federated Learning. (10%)
Yunian Pan; Tao Li; Henger Li; Tianyi Xu; Zizhan Zheng; Quanyan Zhu
  Previous research has shown that federated learning (FL) systems are exposed
to an array of security risks. Despite the proposal of several defensive
strategies, they tend to be non-adaptive and specific to certain types of
attacks, rendering them ineffective against unpredictable or adaptive threats.
This work models adversarial federated learning as a Bayesian Stackelberg
Markov game (BSMG) to capture the defender's incomplete information of various
attack types. We propose meta-Stackelberg learning (meta-SL), a provably
efficient meta-learning algorithm, to solve the equilibrium strategy in BSMG,
leading to an adaptable FL defense. We demonstrate that meta-SL converges to
the first-order $\varepsilon$-equilibrium point in $O(\varepsilon^{-2})$
gradient iterations, with $O(\varepsilon^{-4})$ samples needed per iteration,
matching the state of the art. Empirical evidence indicates that our
meta-Stackelberg framework performs exceptionally well against potent model
poisoning and backdoor attacks of an uncertain nature.


http://arxiv.org/abs/2306.13213
Visual Adversarial Examples Jailbreak Large Language Models. (99%)
Xiangyu Qi; Kaixuan Huang; Ashwinee Panda; Mengdi Wang; Prateek Mittal
  Recently, there has been a surge of interest in introducing vision into Large
Language Models (LLMs). The proliferation of large Visual Language Models
(VLMs), such as Flamingo, BLIP-2, and GPT-4, signifies an exciting convergence
of advancements in both visual and language foundation models. Yet, the risks
associated with this integrative approach are largely unexamined. In this
paper, we shed light on the security and safety implications of this trend.
First, we underscore that the continuous and high-dimensional nature of the
additional visual input space intrinsically makes it a fertile ground for
adversarial attacks. This unavoidably expands the attack surfaces of LLMs.
Second, we highlight that the broad functionality of LLMs also presents visual
attackers with a wider array of achievable adversarial objectives, extending
the implications of security failures beyond mere misclassification. To
elucidate these risks, we study adversarial examples in the visual input space
of a VLM. Specifically, against MiniGPT-4, which incorporates safety mechanisms
that can refuse harmful instructions, we present visual adversarial examples
that can circumvent the safety mechanisms and provoke harmful behaviors of the
model. Remarkably, we discover that adversarial examples, even if optimized on
a narrow, manually curated derogatory corpus against specific social groups,
can universally jailbreak the model's safety mechanisms. A single such
adversarial example can generally undermine MiniGPT-4's safety, enabling it to
heed a wide range of harmful instructions and produce harmful content far
beyond simply imitating the derogatory corpus used in optimization. Unveiling
these risks, we accentuate the urgent need for comprehensive risk assessments,
robust defense strategies, and the implementation of responsible practices for
the secure and safe utilization of VLMs.


http://arxiv.org/abs/2306.12688
Towards quantum enhanced adversarial robustness in machine learning. (99%)
Maxwell T. West; Shu-Lok Tsang; Jia S. Low; Charles D. Hill; Christopher Leckie; Lloyd C. L. Hollenberg; Sarah M. Erfani; Muhammad Usman
  Machine learning algorithms are powerful tools for data driven tasks such as
image classification and feature detection, however their vulnerability to
adversarial examples - input samples manipulated to fool the algorithm -
remains a serious challenge. The integration of machine learning with quantum
computing has the potential to yield tools offering not only better accuracy
and computational efficiency, but also superior robustness against adversarial
attacks. Indeed, recent work has employed quantum mechanical phenomena to
defend against adversarial attacks, spurring the rapid development of the field
of quantum adversarial machine learning (QAML) and potentially yielding a new
source of quantum advantage. Despite promising early results, there remain
challenges towards building robust real-world QAML tools. In this review we
discuss recent progress in QAML and identify key challenges. We also suggest
future research directions which could determine the route to practicality for
QAML approaches as quantum computing hardware scales up and noise levels are
reduced.


http://arxiv.org/abs/2306.12685
Rethinking the Backward Propagation for Adversarial Transferability. (99%)
Xiaosen Wang; Kangheng Tong; Kun He
  Transfer-based attacks generate adversarial examples on the surrogate model,
which can mislead other black-box models without access, making it promising to
attack real-world applications. Recently, several works have been proposed to
boost adversarial transferability, in which the surrogate model is usually
overlooked. In this work, we identify that non-linear layers (e.g., ReLU,
max-pooling, etc.) truncate the gradient during backward propagation, making
the gradient w.r.t. input image imprecise to the loss function. We hypothesize
and empirically validate that such truncation undermines the transferability of
adversarial examples. Based on these findings, we propose a novel method called
Backward Propagation Attack (BPA) to increase the relevance between the
gradient w.r.t. input image and loss function so as to generate adversarial
examples with higher transferability. Specifically, BPA adopts a non-monotonic
function as the derivative of ReLU and incorporates softmax with temperature to
smooth the derivative of max-pooling, thereby mitigating the information loss
during the backward propagation of gradients. Empirical results on the ImageNet
dataset demonstrate that not only does our method substantially boost the
adversarial transferability, but it is also general to existing transfer-based
attacks. Code is available at https://github.com/Trustworthy-AI-Group/RPA.


http://arxiv.org/abs/2306.13091
Evading Forensic Classifiers with Attribute-Conditioned Adversarial Faces. (96%)
Fahad Shamshad; Koushik Srivatsan; Karthik Nandakumar
  The ability of generative models to produce highly realistic synthetic face
images has raised security and ethical concerns. As a first line of defense
against such fake faces, deep learning based forensic classifiers have been
developed. While these forensic models can detect whether a face image is
synthetic or real with high accuracy, they are also vulnerable to adversarial
attacks. Although such attacks can be highly successful in evading detection by
forensic classifiers, they introduce visible noise patterns that are detectable
through careful human scrutiny. Additionally, these attacks assume access to
the target model(s) which may not always be true. Attempts have been made to
directly perturb the latent space of GANs to produce adversarial fake faces
that can circumvent forensic classifiers. In this work, we go one step further
and show that it is possible to successfully generate adversarial fake faces
with a specified set of attributes (e.g., hair color, eye size, race, gender,
etc.). To achieve this goal, we leverage the state-of-the-art generative model
StyleGAN with disentangled representations, which enables a range of
modifications without leaving the manifold of natural images. We propose a
framework to search for adversarial latent codes within the feature space of
StyleGAN, where the search can be guided either by a text prompt or a reference
image. We also propose a meta-learning based optimization strategy to achieve
transferable performance on unknown target models. Extensive experiments
demonstrate that the proposed approach can produce semantically manipulated
adversarial fake faces, which are true to the specified attribute set and can
successfully fool forensic face classifiers, while remaining undetectable by
humans. Code: https://github.com/koushiksrivats/face_attribute_attack.


http://arxiv.org/abs/2306.13119
Adversarial Resilience in Sequential Prediction via Abstention. (93%)
Surbhi Goel; Steve Hanneke; Shay Moran; Abhishek Shetty
  We study the problem of sequential prediction in the stochastic setting with
an adversary that is allowed to inject clean-label adversarial (or
out-of-distribution) examples. Algorithms designed to handle purely stochastic
data tend to fail in the presence of such adversarial examples, often leading
to erroneous predictions. This is undesirable in many high-stakes applications
such as medical recommendations, where abstaining from predictions on
adversarial examples is preferable to misclassification. On the other hand,
assuming fully adversarial data leads to very pessimistic bounds that are often
vacuous in practice.
  To capture this motivation, we propose a new model of sequential prediction
that sits between the purely stochastic and fully adversarial settings by
allowing the learner to abstain from making a prediction at no cost on
adversarial examples. Assuming access to the marginal distribution on the
non-adversarial examples, we design a learner whose error scales with the VC
dimension (mirroring the stochastic setting) of the hypothesis class, as
opposed to the Littlestone dimension which characterizes the fully adversarial
setting. Furthermore, we design a learner for VC dimension~1 classes, which
works even in the absence of access to the marginal distribution. Our key
technical contribution is a novel measure for quantifying uncertainty for
learning VC classes, which may be of independent interest.


http://arxiv.org/abs/2306.13236
Document Image Cleaning using Budget-Aware Black-Box Approximation. (92%)
Ganesh Tata; Katyani Singh; Oeveren Eric Van; Nilanjan Ray
  Recent work has shown that by approximating the behaviour of a
non-differentiable black-box function using a neural network, the black-box can
be integrated into a differentiable training pipeline for end-to-end training.
This methodology is termed "differentiable bypass,'' and a successful
application of this method involves training a document preprocessor to improve
the performance of a black-box OCR engine. However, a good approximation of an
OCR engine requires querying it for all samples throughout the training
process, which can be computationally and financially expensive. Several
zeroth-order optimization (ZO) algorithms have been proposed in black-box
attack literature to find adversarial examples for a black-box model by
computing its gradient in a query-efficient manner. However, the query
complexity and convergence rate of such algorithms makes them infeasible for
our problem. In this work, we propose two sample selection algorithms to train
an OCR preprocessor with less than 10% of the original system's OCR engine
queries, resulting in more than 60% reduction of the total training time
without significant loss of accuracy. We also show an improvement of 4% in the
word-level accuracy of a commercial OCR engine with only 2.5% of the total
queries and a 32x reduction in monetary cost. Further, we propose a simple
ranking technique to prune 30% of the document images from the training dataset
without affecting the system's performance.


http://arxiv.org/abs/2306.13157
Anticipatory Thinking Challenges in Open Worlds: Risk Management. (81%)
Adam Amos-Binks; Dustin Dannenhauer; Leilani H. Gilpin
  Anticipatory thinking drives our ability to manage risk - identification and
mitigation - in everyday life, from bringing an umbrella when it might rain to
buying car insurance. As AI systems become part of everyday life, they too have
begun to manage risk. Autonomous vehicles log millions of miles, StarCraft and
Go agents have similar capabilities to humans, implicitly managing risks
presented by their opponents. To further increase performance in these tasks,
out-of-distribution evaluation can characterize a model's bias, what we view as
a type of risk management. However, learning to identify and mitigate
low-frequency, high-impact risks is at odds with the observational bias
required to train machine learning models. StarCraft and Go are closed-world
domains whose risks are known and mitigations well documented, ideal for
learning through repetition. Adversarial filtering datasets provide difficult
examples but are laborious to curate and static, both barriers to real-world
risk management. Adversarial robustness focuses on model poisoning under the
assumption there is an adversary with malicious intent, without considering
naturally occurring adversarial examples. These methods are all important steps
towards improving risk management but do so without considering open-worlds. We
unify these open-world risk management challenges with two contributions. The
first is our perception challenges, designed for agents with imperfect
perceptions of their environment whose consequences have a high impact. Our
second contribution are cognition challenges, designed for agents that must
dynamically adjust their risk exposure as they identify new risks and learn new
mitigations. Our goal with these challenges is to spur research into solutions
that assess and improve the anticipatory thinking required by AI agents to
manage risk in open-worlds and ultimately the real-world.


http://arxiv.org/abs/2306.12941
Towards Reliable Evaluation and Fast Training of Robust Semantic Segmentation Models. (80%)
Francesco Croce; Naman D Singh; Matthias Hein
  Adversarial robustness has been studied extensively in image classification,
especially for the $\ell_\infty$-threat model, but significantly less so for
related tasks such as object detection and semantic segmentation, where attacks
turn out to be a much harder optimization problem than for image
classification. We propose several problem-specific novel attacks minimizing
different metrics in accuracy and mIoU. The ensemble of our attacks, SEA, shows
that existing attacks severely overestimate the robustness of semantic
segmentation models. Surprisingly, existing attempts of adversarial training
for semantic segmentation models turn out to be weak or even completely
non-robust. We investigate why previous adaptations of adversarial training to
semantic segmentation failed and show how recently proposed robust ImageNet
backbones can be used to obtain adversarially robust semantic segmentation
models with up to six times less training time for PASCAL-VOC and the more
challenging ADE20k. The associated code and robust models are available at
https://github.com/nmndeep/robust-segmentation


http://arxiv.org/abs/2306.12916
Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation. (45%)
Ran Zhang; Jihed Ouni; Steffen Eger
  While summarization has been extensively researched in natural language
processing (NLP), cross-lingual cross-temporal summarization (CLCTS) is a
largely unexplored area that has the potential to improve cross-cultural
accessibility and understanding. This paper comprehensively addresses the CLCTS
task, including dataset creation, modeling, and evaluation. We (1) build the
first CLCTS corpus with 328 instances for hDe-En (extended version with 455
instances) and 289 for hEn-De (extended version with 501 instances), leveraging
historical fiction texts and Wikipedia summaries in English and German; (2)
examine the effectiveness of popular transformer end-to-end models with
different intermediate finetuning tasks; (3) explore the potential of GPT-3.5
as a summarizer; (4) report evaluations from humans, GPT-4, and several recent
automatic evaluation metrics. Our results indicate that intermediate task
finetuned end-to-end models generate bad to moderate quality summaries while
GPT-3.5, as a zero-shot summarizer, provides moderate to good quality outputs.
GPT-3.5 also seems very adept at normalizing historical text. To assess data
contamination in GPT-3.5, we design an adversarial attack scheme in which we
find that GPT-3.5 performs slightly worse for unseen source documents compared
to seen documents. Moreover, it sometimes hallucinates when the source
sentences are inverted against its prior knowledge with a summarization
accuracy of 0.67 for plot omission, 0.71 for entity swap, and 0.53 for plot
negation. Overall, our regression results of model performances suggest that
longer, older, and more complex source texts (all of which are more
characteristic for historical language variants) are harder to summarize for
all models, indicating the difficulty of the CLCTS task.


http://arxiv.org/abs/2306.13273
A First Order Meta Stackelberg Method for Robust Federated Learning (Technical Report). (33%)
Henger Li; Tianyi Xu; Tao Li; Yunian Pan; Quanyan Zhu; Zizhan Zheng
  Recent research efforts indicate that federated learning (FL) systems are
vulnerable to a variety of security breaches. While numerous defense strategies
have been suggested, they are mainly designed to counter specific attack
patterns and lack adaptability, rendering them less effective when facing
uncertain or adaptive threats. This work models adversarial FL as a Bayesian
Stackelberg Markov game (BSMG) between the defender and the attacker to address
the lack of adaptability to uncertain adaptive attacks. We further devise an
effective meta-learning technique to solve for the Stackelberg equilibrium,
leading to a resilient and adaptable defense. The experiment results suggest
that our meta-Stackelberg learning approach excels in combating intense model
poisoning and backdoor attacks of indeterminate types.


http://arxiv.org/abs/2306.13033
Impacts and Risk of Generative AI Technology on Cyber Defense. (4%)
Subash Neupane; Ivan A. Fernandez; Sudip Mittal; Shahram Rahimi
  Generative Artificial Intelligence (GenAI) has emerged as a powerful
technology capable of autonomously producing highly realistic content in
various domains, such as text, images, audio, and videos. With its potential
for positive applications in creative arts, content generation, virtual
assistants, and data synthesis, GenAI has garnered significant attention and
adoption. However, the increasing adoption of GenAI raises concerns about its
potential misuse for crafting convincing phishing emails, generating
disinformation through deepfake videos, and spreading misinformation via
authentic-looking social media posts, posing a new set of challenges and risks
in the realm of cybersecurity. To combat the threats posed by GenAI, we propose
leveraging the Cyber Kill Chain (CKC) to understand the lifecycle of
cyberattacks, as a foundational model for cyber defense. This paper aims to
provide a comprehensive analysis of the risk areas introduced by the offensive
use of GenAI techniques in each phase of the CKC framework. We also analyze the
strategies employed by threat actors and examine their utilization throughout
different phases of the CKC, highlighting the implications for cyber defense.
Additionally, we propose GenAI-enabled defense strategies that are both
attack-aware and adaptive. These strategies encompass various techniques such
as detection, deception, and adversarial training, among others, aiming to
effectively mitigate the risks posed by GenAI-induced cyber threats.


http://arxiv.org/abs/2306.12161
Adversarial Attacks Neutralization via Data Set Randomization. (99%)
Mouna Rabhi; Pietro Roberto Di
  Adversarial attacks on deep-learning models pose a serious threat to their
reliability and security. Existing defense mechanisms are narrow addressing a
specific type of attack or being vulnerable to sophisticated attacks. We
propose a new defense mechanism that, while being focused on image-based
classifiers, is general with respect to the cited category. It is rooted on
hyperspace projection. In particular, our solution provides a pseudo-random
projection of the original dataset into a new dataset. The proposed defense
mechanism creates a set of diverse projected datasets, where each projected
dataset is used to train a specific classifier, resulting in different trained
classifiers with different decision boundaries. During testing, it randomly
selects a classifier to test the input. Our approach does not sacrifice
accuracy over legitimate input. Other than detailing and providing a thorough
characterization of our defense mechanism, we also provide a proof of concept
of using four optimization-based adversarial attacks (PGD, FGSM, IGSM, and
C\&W) and a generative adversarial attack testing them on the MNIST dataset.
Our experimental results show that our solution increases the robustness of
deep learning models against adversarial attacks and significantly reduces the
attack success rate by at least 89% for optimization attacks and 78% for
generative attacks. We also analyze the relationship between the number of used
hyperspaces and the efficacy of the defense mechanism. As expected, the two are
positively correlated, offering an easy-to-tune parameter to enforce the
desired level of security. The generality and scalability of our solution and
adaptability to different attack scenarios, combined with the excellent
achieved results, other than providing a robust defense against adversarial
attacks on deep learning networks, also lay the groundwork for future research
in the field.


http://arxiv.org/abs/2306.12111
A Comprehensive Study on the Robustness of Image Classification and Object Detection in Remote Sensing: Surveying and Benchmarking. (92%)
Shaohui Mei; Jiawei Lian; Xiaofei Wang; Yuru Su; Mingyang Ma; Lap-Pui Chau
  Deep neural networks (DNNs) have found widespread applications in
interpreting remote sensing (RS) imagery. However, it has been demonstrated in
previous works that DNNs are vulnerable to different types of noises,
particularly adversarial noises. Surprisingly, there has been a lack of
comprehensive studies on the robustness of RS tasks, prompting us to undertake
a thorough survey and benchmark on the robustness of image classification and
object detection in RS. To our best knowledge, this study represents the first
comprehensive examination of both natural robustness and adversarial robustness
in RS tasks. Specifically, we have curated and made publicly available datasets
that contain natural and adversarial noises. These datasets serve as valuable
resources for evaluating the robustness of DNNs-based models. To provide a
comprehensive assessment of model robustness, we conducted meticulous
experiments with numerous different classifiers and detectors, encompassing a
wide range of mainstream methods. Through rigorous evaluation, we have
uncovered insightful and intriguing findings, which shed light on the
relationship between adversarial noise crafting and model training, yielding a
deeper understanding of the susceptibility and limitations of various models,
and providing guidance for the development of more resilient and robust models


http://arxiv.org/abs/2306.12043
Sample Attackability in Natural Language Adversarial Attacks. (92%)
Vyas Raina; Mark Gales
  Adversarial attack research in natural language processing (NLP) has made
significant progress in designing powerful attack methods and defence
approaches. However, few efforts have sought to identify which source samples
are the most attackable or robust, i.e. can we determine for an unseen target
model, which samples are the most vulnerable to an adversarial attack. This
work formally extends the definition of sample attackability/robustness for NLP
attacks. Experiments on two popular NLP datasets, four state of the art models
and four different NLP adversarial attack methods, demonstrate that sample
uncertainty is insufficient for describing characteristics of attackable/robust
samples and hence a deep learning based detector can perform much better at
identifying the most attackable and robust samples for an unseen target model.
Nevertheless, further analysis finds that there is little agreement in which
samples are considered the most attackable/robust across different NLP attack
methods, explaining a lack of portability of attackability detection methods
across attack methods.


http://arxiv.org/abs/2306.12610
Revisiting Image Classifier Training for Improved Certified Robust Defense against Adversarial Patches. (76%)
Aniruddha Saha; Shuhua Yu; Arash Norouzzadeh; Wan-Yi Lin; Chaithanya Kumar Mummadi
  Certifiably robust defenses against adversarial patches for image classifiers
ensure correct prediction against any changes to a constrained neighborhood of
pixels. PatchCleanser arXiv:2108.09135 [cs.CV], the state-of-the-art certified
defense, uses a double-masking strategy for robust classification. The success
of this strategy relies heavily on the model's invariance to image pixel
masking. In this paper, we take a closer look at model training schemes to
improve this invariance. Instead of using Random Cutout arXiv:1708.04552v2
[cs.CV] augmentations like PatchCleanser, we introduce the notion of worst-case
masking, i.e., selecting masked images which maximize classification loss.
However, finding worst-case masks requires an exhaustive search, which might be
prohibitively expensive to do on-the-fly during training. To solve this
problem, we propose a two-round greedy masking strategy (Greedy Cutout) which
finds an approximate worst-case mask location with much less compute. We show
that the models trained with our Greedy Cutout improves certified robust
accuracy over Random Cutout in PatchCleanser across a range of datasets and
architectures. Certified robust accuracy on ImageNet with a ViT-B16-224 model
increases from 58.1\% to 62.3\% against a 3\% square patch applied anywhere on
the image.


http://arxiv.org/abs/2306.12608
DP-BREM: Differentially-Private and Byzantine-Robust Federated Learning with Client Momentum. (47%)
Xiaolan Gu; Ming Li; Li Xiong
  Federated Learning (FL) allows multiple participating clients to train
machine learning models collaboratively while keeping their datasets local and
only exchanging the gradient or model updates with a coordinating server.
Existing FL protocols are vulnerable to attacks that aim to compromise data
privacy and/or model robustness. Recently proposed defenses focused on ensuring
either privacy or robustness, but not both. In this paper, we focus on
simultaneously achieving differential privacy (DP) and Byzantine robustness for
cross-silo FL, based on the idea of learning from history. The robustness is
achieved via client momentum, which averages the updates of each client over
time, thus reducing the variance of the honest clients and exposing the small
malicious perturbations of Byzantine clients that are undetectable in a single
round but accumulate over time. In our initial solution DP-BREM, DP is achieved
by adding noise to the aggregated momentum, and we account for the privacy cost
from the momentum, which is different from the conventional DP-SGD that
accounts for the privacy cost from the gradient. Since DP-BREM assumes a
trusted server (who can obtain clients' local models or updates), we further
develop the final solution called DP-BREM+, which achieves the same DP and
robustness properties as DP-BREM without a trusted server by utilizing secure
aggregation techniques, where DP noise is securely and jointly generated by the
clients. Both theoretical analysis and experimental results demonstrate that
our proposed protocols achieve better privacy-utility tradeoff and stronger
Byzantine robustness than several baseline methods, under different DP budgets
and attack settings.


http://arxiv.org/abs/2306.12517
FFCV: Accelerating Training by Removing Data Bottlenecks. (3%)
Guillaume Leclerc; Andrew Ilyas; Logan Engstrom; Sung Min Park; Hadi Salman; Aleksander Madry
  We present FFCV, a library for easy and fast machine learning model training.
FFCV speeds up model training by eliminating (often subtle) data bottlenecks
from the training process. In particular, we combine techniques such as an
efficient file storage format, caching, data pre-loading, asynchronous data
transfer, and just-in-time compilation to (a) make data loading and transfer
significantly more efficient, ensuring that GPUs can reach full utilization;
and (b) offload as much data processing as possible to the CPU asynchronously,
freeing GPU cycles for training. Using FFCV, we train ResNet-18 and ResNet-50
on the ImageNet dataset with competitive tradeoff between accuracy and training
time. For example, we are able to train an ImageNet ResNet-50 model to 75\% in
only 20 mins on a single machine. We demonstrate FFCV's performance,
ease-of-use, extensibility, and ability to adapt to resource constraints
through several case studies. Detailed installation instructions,
documentation, and Slack support channel are available at https://ffcv.io/ .


http://arxiv.org/abs/2306.11322
Reversible Adversarial Examples with Beam Search Attack and Grayscale Invariance. (99%)
Haodong Zhang; Chi Man Pun; Xia Du
  Reversible adversarial examples (RAE) combine adversarial attacks and
reversible data-hiding technology on a single image to prevent illegal access.
Most RAE studies focus on achieving white-box attacks. In this paper, we
propose a novel framework to generate reversible adversarial examples, which
combines a novel beam search based black-box attack and reversible data hiding
with grayscale invariance (RDH-GI). This RAE uses beam search to evaluate the
adversarial gain of historical perturbations and guide adversarial
perturbations. After the adversarial examples are generated, the framework
RDH-GI embeds the secret data that can be recovered losslessly. Experimental
results show that our method can achieve an average Peak Signal-to-Noise Ratio
(PSNR) of at least 40dB compared to source images with limited query budgets.
Our method can also achieve a targeted black-box reversible adversarial attack
for the first time.


http://arxiv.org/abs/2306.11974
Universal adversarial perturbations for multiple classification tasks with quantum classifiers. (99%)
Yun-Zhong Qiu
  Quantum adversarial machine learning is an emerging field that studies the
vulnerability of quantum learning systems against adversarial perturbations and
develops possible defense strategies. Quantum universal adversarial
perturbations are small perturbations, which can make different input samples
into adversarial examples that may deceive a given quantum classifier. This is
a field that was rarely looked into but worthwhile investigating because
universal perturbations might simplify malicious attacks to a large extent,
causing unexpected devastation to quantum machine learning models. In this
paper, we take a step forward and explore the quantum universal perturbations
in the context of heterogeneous classification tasks. In particular, we find
that quantum classifiers that achieve almost state-of-the-art accuracy on two
different classification tasks can be both conclusively deceived by one
carefully-crafted universal perturbation. This result is explicitly
demonstrated with well-designed quantum continual learning models with elastic
weight consolidation method to avoid catastrophic forgetting, as well as
real-life heterogeneous datasets from hand-written digits and medical MRI
images. Our results provide a simple and efficient way to generate universal
perturbations on heterogeneous classification tasks and thus would provide
valuable guidance for future quantum learning technologies.


http://arxiv.org/abs/2306.11990
Physics-constrained Attack against Convolution-based Human Motion Prediction. (99%)
Chengxu Duan; Zhicheng Zhang; Xiaoli Liu; Yonghao Dang; Jianqin Yin
  Human motion prediction has achieved a brilliant performance with the help of
convolution-based neural networks. However, currently, there is no work
evaluating the potential risk in human motion prediction when facing
adversarial attacks. The adversarial attack will encounter problems against
human motion prediction in naturalness and data scale. To solve the problems
above, we propose a new adversarial attack method that generates the worst-case
perturbation by maximizing the human motion predictor's prediction error with
physical constraints. Specifically, we introduce a novel adaptable scheme that
facilitates the attack to suit the scale of the target pose and two physical
constraints to enhance the naturalness of the adversarial example. The
evaluating experiments on three datasets show that the prediction errors of all
target models are enlarged significantly, which means current convolution-based
human motion prediction models are vulnerable to the proposed attack. Based on
the experimental results, we provide insights on how to enhance the adversarial
robustness of the human motion predictor and how to improve the adversarial
attack against human motion prediction.


http://arxiv.org/abs/2306.11338
FDINet: Protecting against DNN Model Extraction via Feature Distortion Index. (50%)
Hongwei Yao; Zheng Li; Haiqin Weng; Feng Xue; Zhan Qin; Kui Ren
  Machine Learning as a Service (MLaaS) platforms have gained popularity due to
their accessibility, cost-efficiency, scalability, and rapid development
capabilities. However, recent research has highlighted the vulnerability of
cloud-based models in MLaaS to model extraction attacks. In this paper, we
introduce FDINET, a novel defense mechanism that leverages the feature
distribution of deep neural network (DNN) models. Concretely, by analyzing the
feature distribution from the adversary's queries, we reveal that the feature
distribution of these queries deviates from that of the model's training set.
Based on this key observation, we propose Feature Distortion Index (FDI), a
metric designed to quantitatively measure the feature distribution deviation of
received queries. The proposed FDINET utilizes FDI to train a binary detector
and exploits FDI similarity to identify colluding adversaries from distributed
extraction attacks. We conduct extensive experiments to evaluate FDINET against
six state-of-the-art extraction attacks on four benchmark datasets and four
popular model architectures. Empirical results demonstrate the following
findings FDINET proves to be highly effective in detecting model extraction,
achieving a 100% detection accuracy on DFME and DaST. FDINET is highly
efficient, using just 50 queries to raise an extraction alarm with an average
confidence of 96.08% for GTSRB. FDINET exhibits the capability to identify
colluding adversaries with an accuracy exceeding 91%. Additionally, it
demonstrates the ability to detect two types of adaptive attacks.


http://arxiv.org/abs/2306.11698
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. (33%)
Boxin Wang; Weixin Chen; Hengzhi Pei; Chulin Xie; Mintong Kang; Chenhui Zhang; Chejian Xu; Zidi Xiong; Ritik Dutta; Rylan Schaeffer; Sang T. Truong; Simran Arora; Mantas Mazeika; Dan Hendrycks; Zinan Lin; Yu Cheng; Sanmi Koyejo; Dawn Song; Bo Li
  Generative Pre-trained Transformer (GPT) models have exhibited exciting
progress in their capabilities, capturing the interest of practitioners and the
public alike. Yet, while the literature on the trustworthiness of GPT models
remains limited, practitioners have proposed employing capable GPT models for
sensitive applications such as healthcare and finance -- where mistakes can be
costly. To this end, this work proposes a comprehensive trustworthiness
evaluation for large language models with a focus on GPT-4 and GPT-3.5,
considering diverse perspectives -- including toxicity, stereotype bias,
adversarial robustness, out-of-distribution robustness, robustness on
adversarial demonstrations, privacy, machine ethics, and fairness. Based on our
evaluations, we discover previously unpublished vulnerabilities to
trustworthiness threats. For instance, we find that GPT models can be easily
misled to generate toxic and biased outputs and leak private information in
both training data and conversation history. We also find that although GPT-4
is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more
vulnerable given jailbreaking system or user prompts, potentially because GPT-4
follows (misleading) instructions more precisely. Our work illustrates a
comprehensive trustworthiness evaluation of GPT models and sheds light on the
trustworthiness gaps. Our benchmark is publicly available at
https://decodingtrust.github.io/; our dataset can be previewed at
https://huggingface.co/datasets/AI-Secure/DecodingTrust; a concise version of
this work is at https://openreview.net/pdf?id=kaHpo8OZw2.


http://arxiv.org/abs/2306.11797
Towards a robust and reliable deep learning approach for detection of compact binary mergers in gravitational wave data. (3%)
Shreejit Jadhav; Mihir Shrivastava; Sanjit Mitra
  The ability of deep learning (DL) approaches to learn generalised signal and
noise models, coupled with their fast inference on GPUs, holds great promise
for enhancing gravitational-wave (GW) searches in terms of speed, parameter
space coverage, and search sensitivity. However, the opaque nature of DL models
severely harms their reliability. In this work, we meticulously develop a DL
model stage-wise and work towards improving its robustness and reliability.
First, we address the problems in maintaining the purity of training data by
deriving a new metric that better reflects the visual strength of the "chirp"
signal features in the data. Using a reduced, smooth representation obtained
through a variational auto-encoder (VAE), we build a classifier to search for
compact binary coalescence (CBC) signals. Our tests on real LIGO data show an
impressive performance of the model. However, upon probing the robustness of
the model through adversarial attacks, its simple failure modes were
identified, underlining how such models can still be highly fragile. As a first
step towards bringing robustness, we retrain the model in a novel framework
involving a generative adversarial network (GAN). Over the course of training,
the model learns to eliminate the primary modes of failure identified by the
adversaries. Although absolute robustness is practically impossible to achieve,
we demonstrate some fundamental improvements earned through such training, like
sparseness and reduced degeneracy in the extracted features at different layers
inside the model. Through comparative inference on real LIGO data, we show that
the prescribed robustness is achieved at practically zero cost in terms of
performance. Through a direct search on ~8.8 days of LIGO data, we recover two
significant CBC events from GWTC-2.1, GW190519_153544 and GW190521_074359, and
report the search sensitivity.


http://arxiv.org/abs/2306.11291
Mitigating Speculation-based Attacks through Configurable Hardware/Software Co-design. (1%)
Ali Hajiabadi; Archit Agarwal; Andreas Diavastos; Trevor E. Carlson
  New speculation-based attacks that affect large numbers of modern systems are
disclosed regularly. Currently, CPU vendors regularly fall back to heavy-handed
mitigations like using barriers or enforcing strict programming guidelines
resulting in significant performance overhead. What is missing is a solution
that allows for efficient mitigation and is flexible enough to address both
current and future speculation vulnerabilities, without additional hardware
changes.
  In this work, we present SpecControl, a novel hardware/software co-design,
that enables new levels of security while reducing the performance overhead
that has been demonstrated by state-of-the-art methodologies. SpecControl
introduces a communication interface that allows compilers and application
developers to inform the hardware about true branch dependencies, confidential
control-flow instructions, and fine-grained instruction constraints in order to
apply restrictions only when necessary. We evaluate SpecControl against known
speculative execution attacks and in addition, present a new speculative fetch
attack variant on the Pattern History Table (PHT) in branch predictors that
shows how similar previously reported vulnerabilities are more dangerous by
enabling unprivileged attacks, especially with the state-of-the-art branch
predictors. SpecControl provides stronger security guarantees compared to the
existing defenses while reducing the performance overhead of two
state-of-the-art defenses from 51% and 43% to just 23%.


http://arxiv.org/abs/2306.11925
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching. (1%)
Duy M. H. Nguyen; Hoang Nguyen; Nghiem T. Diep; Tan N. Pham; Tri Cao; Binh T. Nguyen; Paul Swoboda; Nhat Ho; Shadi Albarqouni; Pengtao Xie; Daniel Sonntag; Mathias Niepert
  Obtaining large pre-trained models that can be fine-tuned to new tasks with
limited annotated samples has remained an open challenge for medical imaging
data. While pre-trained deep networks on ImageNet and vision-language
foundation models trained on web-scale data are prevailing approaches, their
effectiveness on medical tasks is limited due to the significant domain shift
between natural and medical images. To bridge this gap, we introduce LVM-Med,
the first family of deep networks trained on large-scale medical datasets. We
have collected approximately 1.3 million medical images from 55 publicly
available datasets, covering a large number of organs and modalities such as
CT, MRI, X-ray, and Ultrasound. We benchmark several state-of-the-art
self-supervised algorithms on this dataset and propose a novel self-supervised
contrastive learning algorithm using a graph-matching formulation. The proposed
approach makes three contributions: (i) it integrates prior pair-wise image
similarity metrics based on local and global information; (ii) it captures the
structural constraints of feature embeddings through a loss function
constructed via a combinatorial graph-matching objective; and (iii) it can be
trained efficiently end-to-end using modern gradient-estimation techniques for
black-box solvers. We thoroughly evaluate the proposed LVM-Med on 15 downstream
medical tasks ranging from segmentation and classification to object detection,
and both for the in and out-of-distribution settings. LVM-Med empirically
outperforms a number of state-of-the-art supervised, self-supervised, and
foundation models. For challenging tasks such as Brain Tumor Classification or
Diabetic Retinopathy Grading, LVM-Med improves previous vision-language models
trained on 1 billion masks by 6-7% while using only a ResNet-50.


http://arxiv.org/abs/2306.11261
Comparative Evaluation of Recent Universal Adversarial Perturbations in Image Classification. (99%)
Juanjuan Weng; Zhiming Luo; Dazhen Lin; Shaozi Li
  The vulnerability of Convolutional Neural Networks (CNNs) to adversarial
samples has recently garnered significant attention in the machine learning
community. Furthermore, recent studies have unveiled the existence of universal
adversarial perturbations (UAPs) that are image-agnostic and highly
transferable across different CNN models. In this survey, our primary focus
revolves around the recent advancements in UAPs specifically within the image
classification task. We categorize UAPs into two distinct categories, i.e.,
noise-based attacks and generator-based attacks, thereby providing a
comprehensive overview of representative methods within each category. By
presenting the computational details of these methods, we summarize various
loss functions employed for learning UAPs. Furthermore, we conduct a
comprehensive evaluation of different loss functions within consistent training
frameworks, including noise-based and generator-based. The evaluation covers a
wide range of attack settings, including black-box and white-box attacks,
targeted and untargeted attacks, as well as the examination of defense
mechanisms.
  Our quantitative evaluation results yield several important findings
pertaining to the effectiveness of different loss functions, the selection of
surrogate CNN models, the impact of training data and data size, and the
training frameworks involved in crafting universal attackers. Finally, to
further promote future research on universal adversarial attacks, we provide
some visualizations of the perturbations and discuss the potential research
directions.


http://arxiv.org/abs/2306.11066
Adversarial Robustness of Prompt-based Few-Shot Learning for Natural Language Understanding. (75%)
Venkata Prabhakara Sarath Nookala; Gaurav Verma; Subhabrata Mukherjee; Srijan Kumar
  State-of-the-art few-shot learning (FSL) methods leverage prompt-based
fine-tuning to obtain remarkable results for natural language understanding
(NLU) tasks. While much of the prior FSL methods focus on improving downstream
task performance, there is a limited understanding of the adversarial
robustness of such methods. In this work, we conduct an extensive study of
several state-of-the-art FSL methods to assess their robustness to adversarial
perturbations. To better understand the impact of various factors towards
robustness (or the lack of it), we evaluate prompt-based FSL methods against
fully fine-tuned models for aspects such as the use of unlabeled data, multiple
prompts, number of few-shot examples, model size and type. Our results on six
GLUE tasks indicate that compared to fully fine-tuned models, vanilla FSL
methods lead to a notable relative drop in task performance (i.e., are less
robust) in the face of adversarial perturbations. However, using (i) unlabeled
data for prompt-based FSL and (ii) multiple prompts flip the trend. We further
demonstrate that increasing the number of few-shot examples and model size lead
to increased adversarial robustness of vanilla FSL methods. Broadly, our work
sheds light on the adversarial robustness evaluation of prompt-based FSL
methods for NLU tasks.


http://arxiv.org/abs/2306.11035
Adversarial Training Should Be Cast as a Non-Zero-Sum Game. (73%)
Alexander Robey; Fabian Latorre; George J. Pappas; Hamed Hassani; Volkan Cevher
  One prominent approach toward resolving the adversarial vulnerability of deep
neural networks is the two-player zero-sum paradigm of adversarial training, in
which predictors are trained against adversarially-chosen perturbations of
data. Despite the promise of this approach, algorithms based on this paradigm
have not engendered sufficient levels of robustness, and suffer from
pathological behavior like robust overfitting. To understand this shortcoming,
we first show that the commonly used surrogate-based relaxation used in
adversarial training algorithms voids all guarantees on the robustness of
trained classifiers. The identification of this pitfall informs a novel
non-zero-sum bilevel formulation of adversarial training, wherein each player
optimizes a different objective function. Our formulation naturally yields a
simple algorithmic framework that matches and in some cases outperforms
state-of-the-art attacks, attains comparable levels of robustness to standard
adversarial training algorithms, and does not suffer from robust overfitting.


http://arxiv.org/abs/2306.10963
Eigenpatches -- Adversarial Patches from Principal Components. (38%)
Jens Bayer; Stefan Becker; David Münch; Michael Arens
  Adversarial patches are still a simple yet powerful white box attack that can
be used to fool object detectors by suppressing possible detections. The
patches of these so-called evasion attacks are computational expensive to
produce and require full access to the attacked detector. This paper addresses
the problem of computational expensiveness by analyzing 375 generated patches,
calculating the principal components of these and show, that linear
combinations of the resulting "eigenpatches" can be used to fool object
detections successfully.


http://arxiv.org/abs/2306.10746
Practical and General Backdoor Attacks against Vertical Federated Learning. (13%)
Yuexin Xuan; Xiaojun Chen; Zhendong Zhao; Bisheng Tang; Ye Dong
  Federated learning (FL), which aims to facilitate data collaboration across
multiple organizations without exposing data privacy, encounters potential
security risks. One serious threat is backdoor attacks, where an attacker
injects a specific trigger into the training dataset to manipulate the model's
prediction. Most existing FL backdoor attacks are based on horizontal federated
learning (HFL), where the data owned by different parties have the same
features. However, compared to HFL, backdoor attacks on vertical federated
learning (VFL), where each party only holds a disjoint subset of features and
the labels are only owned by one party, are rarely studied. The main challenge
of this attack is to allow an attacker without access to the data labels, to
perform an effective attack. To this end, we propose BadVFL, a novel and
practical approach to inject backdoor triggers into victim models without label
information. BadVFL mainly consists of two key steps. First, to address the
challenge of attackers having no knowledge of labels, we introduce a SDD module
that can trace data categories based on gradients. Second, we propose a SDP
module that can improve the attack's effectiveness by enhancing the decision
dependency between the trigger and attack target. Extensive experiments show
that BadVFL supports diverse datasets and models, and achieves over 93% attack
success rate with only 1% poisoning rate.


http://arxiv.org/abs/2306.10742
BNN-DP: Robustness Certification of Bayesian Neural Networks via Dynamic Programming. (5%)
Steven Adams; Andrea Patane; Morteza Lahijanian; Luca Laurenti
  In this paper, we introduce BNN-DP, an efficient algorithmic framework for
analysis of adversarial robustness of Bayesian Neural Networks (BNNs). Given a
compact set of input points $T\subset \mathbb{R}^n$, BNN-DP computes lower and
upper bounds on the BNN's predictions for all the points in $T$. The framework
is based on an interpretation of BNNs as stochastic dynamical systems, which
enables the use of Dynamic Programming (DP) algorithms to bound the prediction
range along the layers of the network. Specifically, the method uses bound
propagation techniques and convex relaxations to derive a backward recursion
procedure to over-approximate the prediction range of the BNN with piecewise
affine functions. The algorithm is general and can handle both regression and
classification tasks. On a set of experiments on various regression and
classification tasks and BNN architectures, we show that BNN-DP outperforms
state-of-the-art methods by up to four orders of magnitude in both tightness of
the bounds and computational efficiency.


http://arxiv.org/abs/2306.10309
Edge Learning for 6G-enabled Internet of Things: A Comprehensive Survey of Vulnerabilities, Datasets, and Defenses. (98%)
Mohamed Amine Ferrag; Othmane Friha; Burak Kantarci; Norbert Tihanyi; Lucas Cordeiro; Merouane Debbah; Djallel Hamouda; Muna Al-Hawawreh; Kim-Kwang Raymond Choo
  The ongoing deployment of the fifth generation (5G) wireless networks
constantly reveals limitations concerning its original concept as a key driver
of Internet of Everything (IoE) applications. These 5G challenges are behind
worldwide efforts to enable future networks, such as sixth generation (6G)
networks, to efficiently support sophisticated applications ranging from
autonomous driving capabilities to the Metaverse. Edge learning is a new and
powerful approach to training models across distributed clients while
protecting the privacy of their data. This approach is expected to be embedded
within future network infrastructures, including 6G, to solve challenging
problems such as resource management and behavior prediction. This survey
article provides a holistic review of the most recent research focused on edge
learning vulnerabilities and defenses for 6G-enabled IoT. We summarize the
existing surveys on machine learning for 6G IoT security and machine
learning-associated threats in three different learning modes: centralized,
federated, and distributed. Then, we provide an overview of enabling emerging
technologies for 6G IoT intelligence. Moreover, we provide a holistic survey of
existing research on attacks against machine learning and classify threat
models into eight categories, including backdoor attacks, adversarial examples,
combined attacks, poisoning attacks, Sybil attacks, byzantine attacks,
inference attacks, and dropping attacks. In addition, we provide a
comprehensive and detailed taxonomy and a side-by-side comparison of the
state-of-the-art defense methods against edge learning vulnerabilities.
Finally, as new attacks and defense technologies are realized, new research and
future overall prospects for 6G-enabled IoT are discussed.


http://arxiv.org/abs/2306.10426
Understanding Certified Training with Interval Bound Propagation. (38%)
Yuhao Mao; Mark Niklas Müller; Marc Fischer; Martin Vechev
  As robustness verification methods are becoming more precise, training
certifiably robust neural networks is becoming ever more relevant. To this end,
certified training methods compute and then optimize an upper bound on the
worst-case loss over a robustness specification. Curiously, training methods
based on the imprecise interval bound propagation (IBP) consistently outperform
those leveraging more precise bounding methods. Still, we lack an understanding
of the mechanisms making IBP so successful.
  In this work, we thoroughly investigate these mechanisms by leveraging a
novel metric measuring the tightness of IBP bounds. We first show theoretically
that, for deep linear models, tightness decreases with width and depth at
initialization, but improves with IBP training, given sufficient network width.
We, then, derive sufficient and necessary conditions on weight matrices for IBP
bounds to become exact and demonstrate that these impose strong regularization,
explaining the empirically observed trade-off between robustness and accuracy
in certified training.
  Our extensive experimental evaluation validates our theoretical predictions
for ReLU networks, including that wider networks improve performance, yielding
state-of-the-art results. Interestingly, we observe that while all IBP-based
training methods lead to high tightness, this is neither sufficient nor
necessary to achieve high certifiable robustness. This hints at the existence
of new training methods that do not induce the strong regularization required
for tight IBP bounds, leading to improved robustness and standard accuracy.


http://arxiv.org/abs/2306.10392
GlyphNet: Homoglyph domains dataset and detection using attention-based Convolutional Neural Networks. (9%)
Akshat Gupta; Laxman Singh Tomar; Ridhima Garg
  Cyber attacks deceive machines into believing something that does not exist
in the first place. However, there are some to which even humans fall prey. One
such famous attack that attackers have used over the years to exploit the
vulnerability of vision is known to be a Homoglyph attack. It employs a primary
yet effective mechanism to create illegitimate domains that are hard to
differentiate from legit ones. Moreover, as the difference is pretty
indistinguishable for a user to notice, they cannot stop themselves from
clicking on these homoglyph domain names. In many cases, that results in either
information theft or malware attack on their systems. Existing approaches use
simple, string-based comparison techniques applied in primary language-based
tasks. Although they are impactful to some extent, they usually fail because
they are not robust to different types of homoglyphs and are computationally
not feasible because of their time requirement proportional to the string
length. Similarly, neural network-based approaches are employed to determine
real domain strings from fake ones. Nevertheless, the problem with both methods
is that they require paired sequences of real and fake domain strings to work
with, which is often not the case in the real world, as the attacker only sends
the illegitimate or homoglyph domain to the vulnerable user. Therefore,
existing approaches are not suitable for practical scenarios in the real world.
In our work, we created GlyphNet, an image dataset that contains 4M domains,
both real and homoglyphs. Additionally, we introduce a baseline method for a
homoglyph attack detection system using an attention-based convolutional Neural
Network. We show that our model can reach state-of-the-art accuracy in
detecting homoglyph attacks with a 0.93 AUC on our dataset.


http://arxiv.org/abs/2306.10351
Bkd-FedGNN: A Benchmark for Classification Backdoor Attacks on Federated Graph Neural Network. (1%)
Fan Liu; Siqi Lai; Yansong Ning; Hao Liu
  Federated Graph Neural Network (FedGNN) has recently emerged as a rapidly
growing research topic, as it integrates the strengths of graph neural networks
and federated learning to enable advanced machine learning applications without
direct access to sensitive data. Despite its advantages, the distributed nature
of FedGNN introduces additional vulnerabilities, particularly backdoor attacks
stemming from malicious participants. Although graph backdoor attacks have been
explored, the compounded complexity introduced by the combination of GNNs and
federated learning has hindered a comprehensive understanding of these attacks,
as existing research lacks extensive benchmark coverage and in-depth analysis
of critical factors. To address these limitations, we propose Bkd-FedGNN, a
benchmark for backdoor attacks on FedGNN. Specifically, Bkd-FedGNN decomposes
the graph backdoor attack into trigger generation and injection steps, and
extending the attack to the node-level federated setting, resulting in a
unified framework that covers both node-level and graph-level classification
tasks. Moreover, we thoroughly investigate the impact of multiple critical
factors in backdoor attacks on FedGNN. These factors are categorized into
global-level and local-level factors, including data distribution, the number
of malicious attackers, attack time, overlapping rate, trigger size, trigger
type, trigger position, and poisoning rate. Finally, we conduct comprehensive
evaluations on 13 benchmark datasets and 13 critical factors, comprising 1,725
experimental configurations for node-level and graph-level tasks from six
domains. These experiments encompass over 8,000 individual tests, allowing us
to provide a thorough evaluation and insightful observations that advance our
understanding of backdoor attacks on FedGNN.The Bkd-FedGNN benchmark is
publicly available at https://github.com/usail-hkust/BkdFedGCN.


http://arxiv.org/abs/2306.09844
Wasserstein distributional robustness of neural networks. (99%)
Xingjian Bai; Guangyi He; Yifan Jiang; Jan Obloj
  Deep neural networks are known to be vulnerable to adversarial attacks (AA).
For an image recognition task, this means that a small perturbation of the
original can result in the image being misclassified. Design of such attacks as
well as methods of adversarial training against them are subject of intense
research. We re-cast the problem using techniques of Wasserstein
distributionally robust optimization (DRO) and obtain novel contributions
leveraging recent insights from DRO sensitivity analysis. We consider a set of
distributional threat models. Unlike the traditional pointwise attacks, which
assume a uniform bound on perturbation of each input data point, distributional
threat models allow attackers to perturb inputs in a non-uniform way. We link
these more general attacks with questions of out-of-sample performance and
Knightian uncertainty. To evaluate the distributional robustness of neural
networks, we propose a first-order AA algorithm and its multi-step version. Our
attack algorithms include Fast Gradient Sign Method (FGSM) and Projected
Gradient Descent (PGD) as special cases. Furthermore, we provide a new
asymptotic estimate of the adversarial accuracy against distributional threat
models. The bound is fast to compute and first-order accurate, offering new
insights even for the pointwise AA. It also naturally yields out-of-sample
performance guarantees. We conduct numerical experiments on the CIFAR-10
dataset using DNNs on RobustBench to illustrate our theoretical results. Our
code is available at https://github.com/JanObloj/W-DRO-Adversarial-Methods.


http://arxiv.org/abs/2306.09925
Query-Free Evasion Attacks Against Machine Learning-Based Malware Detectors with Generative Adversarial Networks. (99%)
Daniel Gibert; Jordi Planes; Quan Le; Giulio Zizzo
  Malware detectors based on machine learning (ML) have been shown to be
susceptible to adversarial malware examples. However, current methods to
generate adversarial malware examples still have their limits. They either rely
on detailed model information (gradient-based attacks), or on detailed outputs
of the model - such as class probabilities (score-based attacks), neither of
which are available in real-world scenarios. Alternatively, adversarial
examples might be crafted using only the label assigned by the detector
(label-based attack) to train a substitute network or an agent using
reinforcement learning. Nonetheless, label-based attacks might require querying
a black-box system from a small number to thousands of times, depending on the
approach, which might not be feasible against malware detectors. This work
presents a novel query-free approach to craft adversarial malware examples to
evade ML-based malware detectors. To this end, we have devised a GAN-based
framework to generate adversarial malware examples that look similar to benign
executables in the feature space. To demonstrate the suitability of our
approach we have applied the GAN-based attack to three common types of features
usually employed by static ML-based malware detectors: (1) Byte histogram
features, (2) API-based features, and (3) String-based features. Results show
that our model-agnostic approach performs on par with MalGAN, while generating
more realistic adversarial malware examples without requiring any query to the
malware detectors. Furthermore, we have tested the generated adversarial
examples against state-of-the-art multimodal and deep learning malware
detectors, showing a decrease in detection performance, as well as a decrease
in the average number of detections by the anti-malware engines in VirusTotal.


http://arxiv.org/abs/2306.09951
You Don't Need Robust Machine Learning to Manage Adversarial Attack Risks. (98%)
Edward Raff; Michel Benaroch; Andrew L. Farris
  The robustness of modern machine learning (ML) models has become an
increasing concern within the community. The ability to subvert a model into
making errant predictions using seemingly inconsequential changes to input is
startling, as is our lack of success in building models robust to this concern.
Existing research shows progress, but current mitigations come with a high cost
and simultaneously reduce the model's accuracy. However, such trade-offs may
not be necessary when other design choices could subvert the risk. In this
survey we review the current literature on attacks and their real-world
occurrences, or limited evidence thereof, to critically evaluate the real-world
risks of adversarial machine learning (AML) for the average entity. This is
done with an eye toward how one would then mitigate these attacks in practice,
the risks for production deployment, and how those risks could be managed. In
doing so we elucidate that many AML threats do not warrant the cost and
trade-offs of robustness due to a low likelihood of attack or availability of
superior non-ML mitigations. Our analysis also recommends cases where an actor
should be concerned about AML to the degree where robust ML models are
necessary for a complete deployment.


http://arxiv.org/abs/2306.09949
Towards Better Certified Segmentation via Diffusion Models. (73%)
Othmane Laousy; Alexandre Araujo; Guillaume Chassagnon; Marie-Pierre Revel; Siddharth Garg; Farshad Khorrami; Maria Vakalopoulou
  The robustness of image segmentation has been an important research topic in
the past few years as segmentation models have reached production-level
accuracy. However, like classification models, segmentation models can be
vulnerable to adversarial perturbations, which hinders their use in
critical-decision systems like healthcare or autonomous driving. Recently,
randomized smoothing has been proposed to certify segmentation predictions by
adding Gaussian noise to the input to obtain theoretical guarantees. However,
this method exhibits a trade-off between the amount of added noise and the
level of certification achieved. In this paper, we address the problem of
certifying segmentation prediction using a combination of randomized smoothing
and diffusion models. Our experiments show that combining randomized smoothing
and diffusion models significantly improves certified robustness, with results
indicating a mean improvement of 21 points in accuracy compared to previous
state-of-the-art methods on Pascal-Context and Cityscapes public datasets. Our
method is independent of the selected segmentation model and does not need any
additional specialized training procedure.


http://arxiv.org/abs/2306.09977
Adversarially robust clustering with optimality guarantees. (5%)
Soham Jana; Kun Yang; Sanjeev Kulkarni
  We consider the problem of clustering data points coming from sub-Gaussian
mixtures. Existing methods that provably achieve the optimal mislabeling error,
such as the Lloyd algorithm, are usually vulnerable to outliers. In contrast,
clustering methods seemingly robust to adversarial perturbations are not known
to satisfy the optimal statistical guarantees. We propose a simple robust
algorithm based on the coordinatewise median that obtains the optimal
mislabeling rate even when we allow adversarial outliers to be present. Our
algorithm achieves the optimal error rate in constant iterations when a weak
initialization condition is satisfied. In the absence of outliers, in fixed
dimensions, our theoretical guarantees are similar to that of the Lloyd
algorithm. Extensive experiments on various simulated and public datasets are
conducted to support the theoretical guarantees of our method.


http://arxiv.org/abs/2306.10008
CLIP2Protect: Protecting Facial Privacy using Text-Guided Makeup via Adversarial Latent Search. (1%)
Fahad Shamshad; Muzammal Naseer; Karthik Nandakumar
  The success of deep learning based face recognition systems has given rise to
serious privacy concerns due to their ability to enable unauthorized tracking
of users in the digital world. Existing methods for enhancing privacy fail to
generate naturalistic images that can protect facial privacy without
compromising user experience. We propose a novel two-step approach for facial
privacy protection that relies on finding adversarial latent codes in the
low-dimensional manifold of a pretrained generative model. The first step
inverts the given face image into the latent space and finetunes the generative
model to achieve an accurate reconstruction of the given image from its latent
code. This step produces a good initialization, aiding the generation of
high-quality faces that resemble the given identity. Subsequently, user-defined
makeup text prompts and identity-preserving regularization are used to guide
the search for adversarial codes in the latent space. Extensive experiments
demonstrate that faces generated by our approach have stronger black-box
transferability with an absolute gain of 12.06% over the state-of-the-art
facial privacy protection approach under the face verification task. Finally,
we demonstrate the effectiveness of the proposed approach for commercial face
recognition systems. Our code is available at
https://github.com/fahadshamshad/Clip2Protect.


http://arxiv.org/abs/2306.09124
DIFFender: Diffusion-Based Adversarial Defense against Patch Attacks in the Physical World. (99%)
Caixin Kang; Yinpeng Dong; Zhengyi Wang; Shouwei Ruan; Hang Su; Xingxing Wei
  Adversarial attacks in the physical world, particularly patch attacks, pose
significant threats to the robustness and reliability of deep learning models.
Developing reliable defenses against patch attacks is crucial for real-world
applications, yet current research in this area is severely lacking. In this
paper, we propose DIFFender, a novel defense method that leverages the
pre-trained diffusion model to perform both localization and defense against
potential adversarial patch attacks. DIFFender is designed as a pipeline
consisting of two main stages: patch localization and restoration. In the
localization stage, we exploit the intriguing properties of a diffusion model
to effectively identify the locations of adversarial patches. In the
restoration stage, we employ a text-guided diffusion model to eliminate
adversarial regions in the image while preserving the integrity of the visual
content. Additionally, we design a few-shot prompt-tuning algorithm to
facilitate simple and efficient tuning, enabling the learned representations to
easily transfer to downstream tasks, which optimize two stages jointly. We
conduct extensive experiments on image classification and face recognition to
demonstrate that DIFFender exhibits superior robustness under strong adaptive
attacks and generalizes well across various scenarios, diverse classifiers, and
multiple attack methods.


http://arxiv.org/abs/2306.13215
OVLA: Neural Network Ownership Verification using Latent Watermarks. (64%)
Feisi Fu; Wenchao Li
  Ownership verification for neural networks is important for protecting these
models from illegal copying, free-riding, re-distribution and other
intellectual property misuse. We present a novel methodology for neural network
ownership verification based on the notion of latent watermarks. Existing
ownership verification methods either modify or introduce constraints to the
neural network parameters, which are accessible to an attacker in a white-box
attack and can be harmful to the network's normal operation, or train the
network to respond to specific watermarks in the inputs similar to data
poisoning-based backdoor attacks, which are susceptible to backdoor removal
techniques. In this paper, we address these problems by decoupling a network's
normal operation from its responses to watermarked inputs during ownership
verification. The key idea is to train the network such that the watermarks
remain dormant unless the owner's secret key is applied to activate it. The
secret key is realized as a specific perturbation only known to the owner to
the network's parameters. We show that our approach offers strong defense
against backdoor detection, backdoor removal and surrogate model attacks.In
addition, our method provides protection against ambiguity attacks where the
attacker either tries to guess the secret weight key or uses fine-tuning to
embed their own watermarks with a different key into a pre-trained neural
network. Experimental results demonstrate the advantages and effectiveness of
our proposed approach.


http://arxiv.org/abs/2306.13103
Evaluating the Robustness of Text-to-image Diffusion Models against Real-world Attacks. (62%)
Hongcheng Gao; Hao Zhang; Yinpeng Dong; Zhijie Deng
  Text-to-image (T2I) diffusion models (DMs) have shown promise in generating
high-quality images from textual descriptions. The real-world applications of
these models require particular attention to their safety and fidelity, but
this has not been sufficiently explored. One fundamental question is whether
existing T2I DMs are robust against variations over input texts. To answer it,
this work provides the first robustness evaluation of T2I DMs against
real-world attacks. Unlike prior studies that focus on malicious attacks
involving apocryphal alterations to the input texts, we consider an attack
space spanned by realistic errors (e.g., typo, glyph, phonetic) that humans can
make, to ensure semantic consistency. Given the inherent randomness of the
generation process, we develop novel distribution-based attack objectives to
mislead T2I DMs. We perform attacks in a black-box manner without any knowledge
of the model. Extensive experiments demonstrate the effectiveness of our method
for attacking popular T2I DMs and simultaneously reveal their non-trivial
robustness issues. Moreover, we provide an in-depth analysis of our method to
show that it is not designed to attack the text encoder in T2I DMs solely.


http://arxiv.org/abs/2306.09104
On Strengthening and Defending Graph Reconstruction Attack with Markov Chain Approximation. (33%)
Zhanke Zhou; Chenyu Zhou; Xuan Li; Jiangchao Yao; Quanming Yao; Bo Han
  Although powerful graph neural networks (GNNs) have boosted numerous
real-world applications, the potential privacy risk is still underexplored. To
close this gap, we perform the first comprehensive study of graph
reconstruction attack that aims to reconstruct the adjacency of nodes. We show
that a range of factors in GNNs can lead to the surprising leakage of private
links. Especially by taking GNNs as a Markov chain and attacking GNNs via a
flexible chain approximation, we systematically explore the underneath
principles of graph reconstruction attack, and propose two information
theory-guided mechanisms: (1) the chain-based attack method with adaptive
designs for extracting more private information; (2) the chain-based defense
method that sharply reduces the attack fidelity with moderate accuracy loss.
Such two objectives disclose a critical belief that to recover better in
attack, you must extract more multi-aspect knowledge from the trained GNN;
while to learn safer for defense, you must forget more link-sensitive
information in training GNNs. Empirically, we achieve state-of-the-art results
on six datasets and three common GNNs. The code is publicly available at:
https://github.com/tmlr-group/MC-GRA.


http://arxiv.org/abs/2306.09278
Robustness Analysis on Foundational Segmentation Models. (11%)
Madeline Chantry Schiappa; Sachidanand VS; Yunhao Ge; Ondrej Miksik; Yogesh S. Rawat; Vibhav Vineet
  Due to the increase in computational resources and accessibility of data, an
increase in large, deep learning models trained on copious amounts of data
using self-supervised or semi-supervised learning have emerged. These
"foundation" models are often adapted to a variety of downstream tasks like
classification, object detection, and segmentation with little-to-no training
on the target dataset. In this work, we perform a robustness analysis of Visual
Foundation Models (VFMs) for segmentation tasks and compare them to supervised
models of smaller scale. We focus on robustness against real-world distribution
shift perturbations.We benchmark four state-of-the-art segmentation
architectures using 2 different datasets, COCO and ADE20K, with 17 different
perturbations with 5 severity levels each. We find interesting insights that
include (1) VFMs are not robust to compression-based corruptions, (2) while the
selected VFMs do not significantly outperform or exhibit more robustness
compared to non-VFM models, they remain competitively robust in zero-shot
evaluations, particularly when non-VFM are under supervision and (3) selected
VFMs demonstrate greater resilience to specific categories of objects, likely
due to their open-vocabulary training paradigm, a feature that non-VFM models
typically lack. We posit that the suggested robustness evaluation introduces
new requirements for foundational models, thus sparking further research to
enhance their performance.


http://arxiv.org/abs/2306.09192
DiffAug: A Diffuse-and-Denoise Augmentation for Training Robust Classifiers. (3%)
Chandramouli Sastry; Sri Harsha Dumpala; Sageev Oore
  We introduce DiffAug, a simple and efficient diffusion-based augmentation
technique to train image classifiers for the crucial yet challenging goal of
improved classifier robustness. Applying DiffAug to a given example consists of
one forward-diffusion step followed by one reverse-diffusion step. Using both
ResNet-50 and Vision Transformer architectures, we comprehensively evaluate
classifiers trained with DiffAug and demonstrate the surprising effectiveness
of single-step reverse diffusion in improving robustness to covariate shifts,
certified adversarial accuracy and out of distribution detection. When we
combine DiffAug with other augmentations such as AugMix and DeepAugment we
demonstrate further improved robustness. Finally, building on this approach, we
also improve classifier-guided diffusion wherein we observe improvements in:
(i) classifier-generalization, (ii) gradient quality (i.e., improved perceptual
alignment) and (iii) image generation performance. We thus introduce a
computationally efficient technique for training with improved robustness that
does not require any additional data, and effectively complements existing
augmentation approaches.


http://arxiv.org/abs/2306.09442
Explore, Establish, Exploit: Red Teaming Language Models from Scratch. (1%)
Stephen Casper; Jason Lin; Joe Kwon; Gatlen Culp; Dylan Hadfield-Menell
  Deploying large language models (LMs) can pose hazards from harmful outputs
such as toxic or false text. Prior work has introduced automated tools that
elicit harmful outputs to identify these risks. While this is a valuable step
toward securing models, these approaches rely on a pre-existing way to
efficiently classify undesirable outputs. Using a pre-existing classifier does
not allow for red-teaming to be tailored to the target model. Furthermore, when
failures can be easily classified in advance, red-teaming has limited marginal
value because problems can be avoided by simply filtering training data and/or
model outputs. Here, we consider red-teaming "from scratch," in which the
adversary does not begin with a way to classify failures. Our framework
consists of three steps: 1) Exploring the model's range of behaviors in the
desired context; 2) Establishing a definition and measurement for undesired
behavior (e.g., a classifier trained to reflect human evaluations); and 3)
Exploiting the model's flaws using this measure to develop diverse adversarial
prompts. We use this approach to red-team GPT-3 to discover classes of inputs
that elicit false statements. In doing so, we construct the CommonClaim dataset
of 20,000 statements labeled by humans as common-knowledge-true, common
knowledge-false, or neither. We are making code and data available.


http://arxiv.org/abs/2306.08929
Community Detection Attack against Collaborative Learning-based Recommender Systems. (1%)
Yacine Belal; Sonia Ben Mokhtar; Mohamed Maouche; Anthony Simonet-Boulogne
  Collaborative-learning based recommender systems emerged following the
success of collaborative learning techniques such as Federated Learning (FL)
and Gossip Learning (GL). In these systems, users participate in the training
of a recommender system while keeping their history of consumed items on their
devices. While these solutions seemed appealing for preserving the privacy of
the participants at a first glance, recent studies have shown that
collaborative learning can be vulnerable to a variety of privacy attacks. In
this paper we propose a novel privacy attack called Community Detection Attack
(CDA), which allows an adversary to discover the members of a community based
on a set of items of her choice (e.g., discovering users interested in LGBT
content). Through experiments on three real recommendation datasets and by
using two state-of-the-art recommendation models, we assess the sensitivity of
an FL-based recommender system as well as two flavors of Gossip Learning-based
recommender systems to CDA. Results show that on all models and all datasets,
the FL setting is more vulnerable to CDA than Gossip settings. We further
evaluated two off-the-shelf mitigation strategies, namely differential privacy
(DP) and a share less policy, which consists in sharing a subset of model
parameters. Results show a better privacy-utility trade-off for the share less
policy compared to DP especially in the Gossip setting.


http://arxiv.org/abs/2306.09206
Concealing CAN Message Sequences to Prevent Schedule-based Bus-off Attacks. (1%)
Sunandan Adhikary; Ipsita Koley; Arkaprava Sain; Soumyadeep das; Shuvam Saha; Soumyajit Dey
  This work focuses on eliminating timing-side channels in real-time
safety-critical cyber-physical network protocols like Controller Area Networks
(CAN). Automotive Electronic Control Units (ECUs) implement predictable
scheduling decisions based on task level response time estimation. Such levels
of determinism exposes timing information about task executions and therefore
corresponding message transmissions via the network buses (that connect the
ECUs and actuators). With proper analysis, such timing side channels can be
utilized to launch several schedule-based attacks that can lead to eventual
denial-of-service or man-in-the-middle-type attacks. To eliminate this
determinism, we propose a novel schedule obfuscation strategy by skipping
certain control task executions and related data transmissions along with
random shifting of the victim task instance. While doing this, our strategy
contemplates the performance of the control task as well by bounding the number
of control execution skips. We analytically demonstrate how the attack success
probability (ASP) is reduced under this proposed attack-aware skipping and
randomization. We also demonstrate the efficacy and real-time applicability of
our attack-aware schedule obfuscation strategy Hide-n-Seek by applying it to
synthesized automotive task sets in a real-time Hardware-in-loop (HIL) setup.


http://arxiv.org/abs/2306.08565
Reliable Evaluation of Adversarial Transferability. (99%)
Wenqian Yu; Jindong Gu; Zhijiang Li; Philip Torr
  Adversarial examples (AEs) with small adversarial perturbations can mislead
deep neural networks (DNNs) into wrong predictions. The AEs created on one DNN
can also fool another DNN. Over the last few years, the transferability of AEs
has garnered significant attention as it is a crucial property for facilitating
black-box attacks. Many approaches have been proposed to improve adversarial
transferability. However, they are mainly verified across different
convolutional neural network (CNN) architectures, which is not a reliable
evaluation since all CNNs share some similar architectural biases. In this
work, we re-evaluate 12 representative transferability-enhancing attack methods
where we test on 18 popular models from 4 types of neural networks. Our
reevaluation revealed that the adversarial transferability is often
overestimated, and there is no single AE that can be transferred to all popular
models. The transferability rank of previous attacking methods changes when
under our comprehensive evaluation. Based on our analysis, we propose a
reliable benchmark including three evaluation protocols. Adversarial
transferability on our new benchmark is extremely low, which further confirms
the overestimation of adversarial transferability. We release our benchmark at
https://adv-trans-eval.github.io to facilitate future research, which includes
code, model checkpoints, and evaluation protocols.


http://arxiv.org/abs/2306.08492
A Relaxed Optimization Approach for Adversarial Attacks against Neural Machine Translation Models. (99%)
Sahar Sadrizadeh; Clément Barbier; Ljiljana Dolamic; Pascal Frossard
  In this paper, we propose an optimization-based adversarial attack against
Neural Machine Translation (NMT) models. First, we propose an optimization
problem to generate adversarial examples that are semantically similar to the
original sentences but destroy the translation generated by the target NMT
model. This optimization problem is discrete, and we propose a continuous
relaxation to solve it. With this relaxation, we find a probability
distribution for each token in the adversarial example, and then we can
generate multiple adversarial examples by sampling from these distributions.
Experimental results show that our attack significantly degrades the
translation quality of multiple NMT models while maintaining the semantic
similarity between the original and adversarial sentences. Furthermore, our
attack outperforms the baselines in terms of success rate, similarity
preservation, effect on translation quality, and token error rate. Finally, we
propose a black-box extension of our attack by sampling from an optimized
probability distribution for a reference model whose gradients are accessible.


http://arxiv.org/abs/2306.08422
X-Detect: Explainable Adversarial Patch Detection for Object Detectors in Retail. (98%)
Omer Hofman; Amit Giloni; Yarin Hayun; Ikuya Morikawa; Toshiya Shimizu; Yuval Elovici; Asaf Shabtai
  Object detection models, which are widely used in various domains (such as
retail), have been shown to be vulnerable to adversarial attacks. Existing
methods for detecting adversarial attacks on object detectors have had
difficulty detecting new real-life attacks. We present X-Detect, a novel
adversarial patch detector that can: i) detect adversarial samples in real
time, allowing the defender to take preventive action; ii) provide explanations
for the alerts raised to support the defender's decision-making process, and
iii) handle unfamiliar threats in the form of new attacks. Given a new scene,
X-Detect uses an ensemble of explainable-by-design detectors that utilize
object extraction, scene manipulation, and feature transformation techniques to
determine whether an alert needs to be raised. X-Detect was evaluated in both
the physical and digital space using five different attack scenarios (including
adaptive attacks) and the COCO dataset and our new Superstore dataset. The
physical evaluation was performed using a smart shopping cart setup in
real-world settings and included 17 adversarial patch attacks recorded in 1,700
adversarial videos. The results showed that X-Detect outperforms the
state-of-the-art methods in distinguishing between benign and adversarial
scenes for all attack scenarios while maintaining a 0% FPR (no false alarms)
and providing actionable explanations for the alerts raised. A demo is
available.


http://arxiv.org/abs/2306.08656
Augment then Smooth: Reconciling Differential Privacy with Certified Robustness. (98%)
Jiapeng Wu; Atiyeh Ashari Ghomi; David Glukhov; Jesse C. Cresswell; Franziska Boenisch; Nicolas Papernot
  Machine learning models are susceptible to a variety of attacks that can
erode trust, including attacks against the privacy of training data, and
adversarial examples that jeopardize model accuracy. Differential privacy and
certified robustness are effective frameworks for combating these two threats
respectively, as they each provide future-proof guarantees. However, we show
that standard differentially private model training is insufficient for
providing strong certified robustness guarantees. Indeed, combining
differential privacy and certified robustness in a single system is
non-trivial, leading previous works to introduce complex training schemes that
lack flexibility. In this work, we present DP-CERT, a simple and effective
method that achieves both privacy and robustness guarantees simultaneously by
integrating randomized smoothing into standard differentially private model
training. Compared to the leading prior work, DP-CERT gives up to a 2.5%
increase in certified accuracy for the same differential privacy guarantee on
CIFAR10. Through in-depth per-sample metric analysis, we find that larger
certifiable radii correlate with smaller local Lipschitz constants, and show
that DP-CERT effectively reduces Lipschitz constants compared to other
differentially private training methods. The code is available at
github.com/layer6ai-labs/dp-cert.


http://arxiv.org/abs/2306.08386
Efficient Backdoor Attacks for Deep Neural Networks in Real-world Scenarios. (83%)
Ziqiang Li; Hong Sun; Pengfei Xia; Heng Li; Beihao Xia; Yi Wu; Bin Li
  Recent deep neural networks (DNNs) have came to rely on vast amounts of
training data, providing an opportunity for malicious attackers to exploit and
contaminate the data to carry out backdoor attacks. However, existing backdoor
attack methods make unrealistic assumptions, assuming that all training data
comes from a single source and that attackers have full access to the training
data. In this paper, we introduce a more realistic attack scenario where
victims collect data from multiple sources, and attackers cannot access the
complete training data. We refer to this scenario as data-constrained backdoor
attacks. In such cases, previous attack methods suffer from severe efficiency
degradation due to the entanglement between benign and poisoning features
during the backdoor injection process. To tackle this problem, we introduce
three CLIP-based technologies from two distinct streams: Clean Feature
Suppression and Poisoning Feature Augmentation.effective solution for
data-constrained backdoor attacks. The results demonstrate remarkable
improvements, with some settings achieving over 100% improvement compared to
existing attacks in data-constrained scenarios. Code is available at
https://github.com/sunh1113/Efficient-backdoor-attacks-for-deep-neural-networks-in-real-world-scenarios


http://arxiv.org/abs/2306.08604
A Unified Framework of Graph Information Bottleneck for Robustness and Membership Privacy. (75%)
Enyan Dai; Limeng Cui; Zhengyang Wang; Xianfeng Tang; Yinghan Wang; Monica Cheng; Bing Yin; Suhang Wang
  Graph Neural Networks (GNNs) have achieved great success in modeling
graph-structured data. However, recent works show that GNNs are vulnerable to
adversarial attacks which can fool the GNN model to make desired predictions of
the attacker. In addition, training data of GNNs can be leaked under membership
inference attacks. This largely hinders the adoption of GNNs in high-stake
domains such as e-commerce, finance and bioinformatics. Though investigations
have been made in conducting robust predictions and protecting membership
privacy, they generally fail to simultaneously consider the robustness and
membership privacy. Therefore, in this work, we study a novel problem of
developing robust and membership privacy-preserving GNNs. Our analysis shows
that Information Bottleneck (IB) can help filter out noisy information and
regularize the predictions on labeled samples, which can benefit robustness and
membership privacy. However, structural noises and lack of labels in node
classification challenge the deployment of IB on graph-structured data. To
mitigate these issues, we propose a novel graph information bottleneck
framework that can alleviate structural noises with neighbor bottleneck. Pseudo
labels are also incorporated in the optimization to minimize the gap between
the predictions on the labeled set and unlabeled set for membership privacy.
Extensive experiments on real-world datasets demonstrate that our method can
give robust predictions and simultaneously preserve membership privacy.


http://arxiv.org/abs/2306.08257
On the Robustness of Latent Diffusion Models. (73%)
Jianping Zhang; Zhuoer Xu; Shiwen Cui; Changhua Meng; Weibin Wu; Michael R. Lyu
  Latent diffusion models achieve state-of-the-art performance on a variety of
generative tasks, such as image synthesis and image editing. However, the
robustness of latent diffusion models is not well studied. Previous works only
focus on the adversarial attacks against the encoder or the output image under
white-box settings, regardless of the denoising process. Therefore, in this
paper, we aim to analyze the robustness of latent diffusion models more
thoroughly. We first study the influence of the components inside latent
diffusion models on their white-box robustness. In addition to white-box
scenarios, we evaluate the black-box robustness of latent diffusion models via
transfer attacks, where we consider both prompt-transfer and model-transfer
settings and possible defense mechanisms. However, all these explorations need
a comprehensive benchmark dataset, which is missing in the literature.
Therefore, to facilitate the research of the robustness of latent diffusion
models, we propose two automatic dataset construction pipelines for two kinds
of image editing models and release the whole dataset. Our code and dataset are
available at \url{https://github.com/jpzhang1810/LDM-Robustness}.


http://arxiv.org/abs/2306.08313
A Proxy Attack-Free Strategy for Practically Improving the Poisoning Efficiency in Backdoor Attacks. (38%)
Ziqiang Li; Hong Sun; Pengfei Xia; Beihao Xia; Xue Rui; Wei Zhang; Qinglang Guo; Bin Li
  Poisoning efficiency plays a critical role in poisoning-based backdoor
attacks. To evade detection, attackers aim to use the fewest poisoning samples
while achieving the desired attack strength. Although efficient triggers have
significantly improved poisoning efficiency, there is still room for further
enhancement. Recently, selecting efficient samples has shown promise, but it
often requires a proxy backdoor injection task to identify an efficient
poisoning sample set. However, the proxy attack-based approach can lead to
performance degradation if the proxy attack settings differ from those used by
the actual victims due to the shortcut of backdoor learning. This paper
presents a Proxy attack-Free Strategy (PFS) designed to identify efficient
poisoning samples based on individual similarity and ensemble diversity,
effectively addressing the mentioned concern. The proposed PFS is motivated by
the observation that selecting the to-be-poisoned samples with high similarity
between clean samples and their corresponding poisoning samples results in
significantly higher attack success rates compared to using samples with low
similarity. Furthermore, theoretical analyses for this phenomenon are provided
based on the theory of active learning and neural tangent kernel. We
comprehensively evaluate the proposed strategy across various datasets,
triggers, poisoning rates, architectures, and training hyperparameters. Our
experimental results demonstrate that PFS enhances backdoor attack efficiency,
while also exhibiting a remarkable speed advantage over prior proxy-dependent
selection methodologies.


http://arxiv.org/abs/2306.08751
Improving Selective Visual Question Answering by Learning from Your Peers. (1%)
Corentin Dancette; Spencer Whitehead; Rishabh Maheshwary; Ramakrishna Vedantam; Stefan Scherer; Xinlei Chen; Matthieu Cord; Marcus Rohrbach
  Despite advances in Visual Question Answering (VQA), the ability of models to
assess their own correctness remains underexplored. Recent work has shown that
VQA models, out-of-the-box, can have difficulties abstaining from answering
when they are wrong. The option to abstain, also called Selective Prediction,
is highly relevant when deploying systems to users who must trust the system's
output (e.g., VQA assistants for users with visual impairments). For such
scenarios, abstention can be especially important as users may provide
out-of-distribution (OOD) or adversarial inputs that make incorrect answers
more likely. In this work, we explore Selective VQA in both in-distribution
(ID) and OOD scenarios, where models are presented with mixtures of ID and OOD
data. The goal is to maximize the number of questions answered while minimizing
the risk of error on those questions. We propose a simple yet effective
Learning from Your Peers (LYP) approach for training multimodal selection
functions for making abstention decisions. Our approach uses predictions from
models trained on distinct subsets of the training data as targets for
optimizing a Selective VQA model. It does not require additional manual labels
or held-out data and provides a signal for identifying examples that are
easy/difficult to generalize to. In our extensive evaluations, we show this
benefits a number of models across different architectures and scales. Overall,
for ID, we reach 32.92% in the selective prediction metric coverage at 1% risk
of error (C@1%) which doubles the previous best coverage of 15.79% on this
task. For mixed ID/OOD, using models' softmax confidences for abstention
decisions performs very poorly, answering <5% of questions at 1% risk of error
even when faced with only 10% OOD examples, but a learned selection function
with LYP can increase that to 25.38% C@1%.


http://arxiv.org/abs/2306.07723
Theoretical Foundations of Adversarially Robust Learning. (99%)
Omar Montasser
  Despite extraordinary progress, current machine learning systems have been
shown to be brittle against adversarial examples: seemingly innocuous but
carefully crafted perturbations of test examples that cause machine learning
predictors to misclassify. Can we learn predictors robust to adversarial
examples? and how? There has been much empirical interest in this contemporary
challenge in machine learning, and in this thesis, we address it from a
theoretical perspective.
  In this thesis, we explore what robustness properties can we hope to
guarantee against adversarial examples and develop an understanding of how to
algorithmically guarantee them. We illustrate the need to go beyond traditional
approaches and principles such as empirical risk minimization and uniform
convergence, and make contributions that can be categorized as follows: (1)
introducing problem formulations capturing aspects of emerging practical
challenges in robust learning, (2) designing new learning algorithms with
provable robustness guarantees, and (3) characterizing the complexity of robust
learning and fundamental limitations on the performance of any algorithm.


http://arxiv.org/abs/2306.07796
Finite Gaussian Neurons: Defending against adversarial attacks by making neural networks say "I don't know". (99%)
Felix Grezes
  Since 2014, artificial neural networks have been known to be vulnerable to
adversarial attacks, which can fool the network into producing wrong or
nonsensical outputs by making humanly imperceptible alterations to inputs.
While defenses against adversarial attacks have been proposed, they usually
involve retraining a new neural network from scratch, a costly task. In this
work, I introduce the Finite Gaussian Neuron (FGN), a novel neuron architecture
for artificial neural networks. My works aims to: - easily convert existing
models to Finite Gaussian Neuron architecture, - while preserving the existing
model's behavior on real data, - and offering resistance against adversarial
attacks. I show that converted and retrained Finite Gaussian Neural Networks
(FGNN) always have lower confidence (i.e., are not overconfident) in their
predictions over randomized and Fast Gradient Sign Method adversarial images
when compared to classical neural networks, while maintaining high accuracy and
confidence over real MNIST images. To further validate the capacity of Finite
Gaussian Neurons to protect from adversarial attacks, I compare the behavior of
FGNs to that of Bayesian Neural Networks against both randomized and
adversarial images, and show how the behavior of the two architectures differs.
Finally I show some limitations of the FGN models by testing them on the more
complex SPEECHCOMMANDS task, against the stronger Carlini-Wagner and Projected
Gradient Descent adversarial attacks.


http://arxiv.org/abs/2306.07591
I See Dead People: Gray-Box Adversarial Attack on Image-To-Text Models. (99%)
Raz Lapid; Moshe Sipper
  Modern image-to-text systems typically adopt the encoder-decoder framework,
which comprises two main components: an image encoder, responsible for
extracting image features, and a transformer-based decoder, used for generating
captions. Taking inspiration from the analysis of neural networks' robustness
against adversarial perturbations, we propose a novel gray-box algorithm for
creating adversarial examples in image-to-text models. Unlike image
classification tasks that have a finite set of class labels, finding visually
similar adversarial examples in an image-to-text task poses greater challenges
because the captioning system allows for a virtually infinite space of possible
captions. In this paper, we present a gray-box adversarial attack on
image-to-text, both untargeted and targeted. We formulate the process of
discovering adversarial perturbations as an optimization problem that uses only
the image-encoder component, meaning the proposed attack is language-model
agnostic. Through experiments conducted on the ViT-GPT2 model, which is the
most-used image-to-text model in Hugging Face, and the Flickr30k dataset, we
demonstrate that our proposed attack successfully generates visually similar
adversarial examples, both with untargeted and targeted captions. Notably, our
attack operates in a gray-box manner, requiring no knowledge about the decoder
module. We also show that our attacks fool the popular open-source platform
Hugging Face.


http://arxiv.org/abs/2306.07713
Robustness of SAM: Segment Anything Under Corruptions and Beyond. (98%)
Yu Qiao; Chaoning Zhang; Taegoo Kang; Donghun Kim; Chenshuang Zhang; Choong Seon Hong
  Segment anything model (SAM), as the name suggests, is claimed to be capable
of cutting out any object and demonstrates impressive zero-shot transfer
performance with the guidance of prompts. However, there is currently a lack of
comprehensive evaluation regarding its robustness under various corruptions.
Understanding the robustness of SAM across different corruption scenarios is
crucial for its real-world deployment. Prior works show that SAM is biased
towards texture (style) rather than shape, motivated by which we start by
investigating its robustness against style transfer, which is synthetic
corruption. Following by interpreting the effects of synthetic corruption as
style changes, we proceed to conduct a comprehensive evaluation for its
robustness against 15 types of common corruption. These corruptions mainly fall
into categories such as digital, noise, weather, and blur, and within each
corruption category, we explore 5 severity levels to simulate real-world
corruption scenarios. Beyond the corruptions, we further assess the robustness
of SAM against local occlusion and local adversarial patch attacks. To the best
of our knowledge, our work is the first of its kind to evaluate the robustness
of SAM under style change, local occlusion, and local adversarial patch
attacks. Given that patch attacks visible to human eyes are easily detectable,
we further assess its robustness against global adversarial attacks that are
imperceptible to human eyes. Overall, this work provides a comprehensive
empirical study of the robustness of SAM, evaluating its performance under
various corruptions and extending the assessment to critical aspects such as
local occlusion, local adversarial patch attacks, and global adversarial
attacks. These evaluations yield valuable insights into the practical
applicability and effectiveness of SAM in addressing real-world challenges.


http://arxiv.org/abs/2306.07768
Area is all you need: repeatable elements make stronger adversarial attacks. (98%)
Dillon Niederhut
  Over the last decade, deep neural networks have achieved state of the art in
computer vision tasks. These models, however, are susceptible to unusual
inputs, known as adversarial examples, that cause them to misclassify or
otherwise fail to detect objects. Here, we provide evidence that the increasing
success of adversarial attacks is primarily due to increasing their size. We
then demonstrate a method for generating the largest possible adversarial patch
by building a adversarial pattern out of repeatable elements. This approach
achieves a new state of the art in evading detection by YOLOv2 and YOLOv3.
Finally, we present an experiment that fails to replicate the prior success of
several attacks published in this field, and end with some comments on testing
and reproducibility.


http://arxiv.org/abs/2306.07655
Malafide: a novel adversarial convolutive noise attack against deepfake and spoofing detection systems. (96%)
Michele Panariello; Wanying Ge; Hemlata Tak; Massimiliano Todisco; Nicholas Evans
  We present Malafide, a universal adversarial attack against automatic speaker
verification (ASV) spoofing countermeasures (CMs). By introducing convolutional
noise using an optimised linear time-invariant filter, Malafide attacks can be
used to compromise CM reliability while preserving other speech attributes such
as quality and the speaker's voice. In contrast to other adversarial attacks
proposed recently, Malafide filters are optimised independently of the input
utterance and duration, are tuned instead to the underlying spoofing attack,
and require the optimisation of only a small number of filter coefficients.
Even so, they degrade CM performance estimates by an order of magnitude, even
in black-box settings, and can also be configured to overcome integrated CM and
ASV subsystems. Integrated solutions that use self-supervised learning CMs,
however, are more robust, under both black-box and white-box settings.


http://arxiv.org/abs/2306.07613
Revisiting and Advancing Adversarial Training Through A Simple Baseline. (87%)
Hong Liu
  In this paper, we delve into the essential components of adversarial training
which is a pioneering defense technique against adversarial attacks. We
indicate that some factors such as the loss function, learning rate scheduler,
and data augmentation, which are independent of the model architecture, will
influence adversarial robustness and generalization. When these factors are
controlled for, we introduce a simple baseline approach, termed SimpleAT, that
performs competitively with recent methods and mitigates robust overfitting. We
conduct extensive experiments on CIFAR-10/100 and Tiny-ImageNet, which validate
the robustness of SimpleAT against state-of-the-art adversarial attackers such
as AutoAttack. Our results also demonstrate that SimpleAT exhibits good
performance in the presence of various image corruptions, such as those found
in the CIFAR-10-C. In addition, we empirically show that SimpleAT is capable of
reducing the variance in model predictions, which is considered the primary
contributor to robust overfitting. Our results also reveal the connections
between SimpleAT and many advanced state-of-the-art adversarial defense
methods.


http://arxiv.org/abs/2306.07754
Generative Watermarking Against Unauthorized Subject-Driven Image Synthesis. (78%)
Yihan Ma; Zhengyu Zhao; Xinlei He; Zheng Li; Michael Backes; Yang Zhang
  Large text-to-image models have shown remarkable performance in synthesizing
high-quality images. In particular, the subject-driven model makes it possible
to personalize the image synthesis for a specific subject, e.g., a human face
or an artistic style, by fine-tuning the generic text-to-image model with a few
images from that subject. Nevertheless, misuse of subject-driven image
synthesis may violate the authority of subject owners. For example, malicious
users may use subject-driven synthesis to mimic specific artistic styles or to
create fake facial images without authorization. To protect subject owners
against such misuse, recent attempts have commonly relied on adversarial
examples to indiscriminately disrupt subject-driven image synthesis. However,
this essentially prevents any benign use of subject-driven synthesis based on
protected images.
  In this paper, we take a different angle and aim at protection without
sacrificing the utility of protected images for general synthesis purposes.
Specifically, we propose GenWatermark, a novel watermark system based on
jointly learning a watermark generator and a detector. In particular, to help
the watermark survive the subject-driven synthesis, we incorporate the
synthesis process in learning GenWatermark by fine-tuning the detector with
synthesized images for a specific subject. This operation is shown to largely
improve the watermark detection accuracy and also ensure the uniqueness of the
watermark for each individual subject. Extensive experiments validate the
effectiveness of GenWatermark, especially in practical scenarios with unknown
models and text prompts (74% Acc.), as well as partial data watermarking (80%
Acc. for 1/4 watermarking). We also demonstrate the robustness of GenWatermark
to two potential countermeasures that substantially degrade the synthesis
quality.


http://arxiv.org/abs/2306.08011
Privacy Inference-Empowered Stealthy Backdoor Attack on Federated Learning under Non-IID Scenarios. (22%)
Haochen Mei; Gaolei Li; Jun Wu; Longfei Zheng
  Federated learning (FL) naturally faces the problem of data heterogeneity in
real-world scenarios, but this is often overlooked by studies on FL security
and privacy. On the one hand, the effectiveness of backdoor attacks on FL may
drop significantly under non-IID scenarios. On the other hand, malicious
clients may steal private data through privacy inference attacks. Therefore, it
is necessary to have a comprehensive perspective of data heterogeneity,
backdoor, and privacy inference. In this paper, we propose a novel privacy
inference-empowered stealthy backdoor attack (PI-SBA) scheme for FL under
non-IID scenarios. Firstly, a diverse data reconstruction mechanism based on
generative adversarial networks (GANs) is proposed to produce a supplementary
dataset, which can improve the attacker's local data distribution and support
more sophisticated strategies for backdoor attacks. Based on this, we design a
source-specified backdoor learning (SSBL) strategy as a demonstration, allowing
the adversary to arbitrarily specify which classes are susceptible to the
backdoor trigger. Since the PI-SBA has an independent poisoned data synthesis
process, it can be integrated into existing backdoor attacks to improve their
effectiveness and stealthiness in non-IID scenarios. Extensive experiments
based on MNIST, CIFAR10 and Youtube Aligned Face datasets demonstrate that the
proposed PI-SBA scheme is effective in non-IID FL and stealthy against
state-of-the-art defense methods.


http://arxiv.org/abs/2306.08009
DHBE: Data-free Holistic Backdoor Erasing in Deep Neural Networks via Restricted Adversarial Distillation. (22%)
Zhicong Yan; Shenghong Li; Ruijie Zhao; Yuan Tian; Yuanyuan Zhao
  Backdoor attacks have emerged as an urgent threat to Deep Neural Networks
(DNNs), where victim DNNs are furtively implanted with malicious neurons that
could be triggered by the adversary. To defend against backdoor attacks, many
works establish a staged pipeline to remove backdoors from victim DNNs:
inspecting, locating, and erasing. However, in a scenario where a few clean
data can be accessible, such pipeline is fragile and cannot erase backdoors
completely without sacrificing model accuracy. To address this issue, in this
paper, we propose a novel data-free holistic backdoor erasing (DHBE) framework.
Instead of the staged pipeline, the DHBE treats the backdoor erasing task as a
unified adversarial procedure, which seeks equilibrium between two different
competing processes: distillation and backdoor regularization. In distillation,
the backdoored DNN is distilled into a proxy model, transferring its knowledge
about clean data, yet backdoors are simultaneously transferred. In backdoor
regularization, the proxy model is holistically regularized to prevent from
infecting any possible backdoor transferred from distillation. These two
processes jointly proceed with data-free adversarial optimization until a
clean, high-accuracy proxy model is obtained. With the novel adversarial
design, our framework demonstrates its superiority in three aspects: 1) minimal
detriment to model accuracy, 2) high tolerance for hyperparameters, and 3) no
demand for clean data. Extensive experiments on various backdoor attacks and
datasets are performed to verify the effectiveness of the proposed framework.
Code is available at \url{https://github.com/yanzhicong/DHBE}


http://arxiv.org/abs/2306.07883
Temporal Gradient Inversion Attacks with Robust Optimization. (8%)
Bowen Li; Hanlin Gu; Ruoxin Chen; Jie Li; Chentao Wu; Na Ruan; Xueming Si; Lixin Fan
  Federated Learning (FL) has emerged as a promising approach for collaborative
model training without sharing private data. However, privacy concerns
regarding information exchanged during FL have received significant research
attention. Gradient Inversion Attacks (GIAs) have been proposed to reconstruct
the private data retained by local clients from the exchanged gradients. While
recovering private data, the data dimensions and the model complexity increase,
which thwart data reconstruction by GIAs. Existing methods adopt prior
knowledge about private data to overcome those challenges. In this paper, we
first observe that GIAs with gradients from a single iteration fail to
reconstruct private data due to insufficient dimensions of leaked gradients,
complex model architectures, and invalid gradient information. We investigate a
Temporal Gradient Inversion Attack with a Robust Optimization framework, called
TGIAs-RO, which recovers private data without any prior knowledge by leveraging
multiple temporal gradients. To eliminate the negative impacts of outliers,
e.g., invalid gradients for collaborative optimization, robust statistics are
proposed. Theoretical guarantees on the recovery performance and robustness of
TGIAs-RO against invalid gradients are also provided. Extensive empirical
results on MNIST, CIFAR10, ImageNet and Reuters 21578 datasets show that the
proposed TGIAs-RO with 10 temporal gradients improves reconstruction
performance compared to state-of-the-art methods, even for large batch sizes
(up to 128), complex models like ResNet18, and large datasets like ImageNet
(224*224 pixels). Furthermore, the proposed attack method inspires further
exploration of privacy-preserving methods in the context of FL.


http://arxiv.org/abs/2306.07685
Few-shot Multi-domain Knowledge Rearming for Context-aware Defence against Advanced Persistent Threats. (2%)
Gaolei Li; Yuanyuan Zhao; Wenqi Wei; Yuchen Liu
  Advanced persistent threats (APTs) have novel features such as multi-stage
penetration, highly-tailored intention, and evasive tactics. APTs defense
requires fusing multi-dimensional Cyber threat intelligence data to identify
attack intentions and conducts efficient knowledge discovery strategies by
data-driven machine learning to recognize entity relationships. However,
data-driven machine learning lacks generalization ability on fresh or unknown
samples, reducing the accuracy and practicality of the defense model. Besides,
the private deployment of these APT defense models on heterogeneous
environments and various network devices requires significant investment in
context awareness (such as known attack entities, continuous network states,
and current security strategies). In this paper, we propose a few-shot
multi-domain knowledge rearming (FMKR) scheme for context-aware defense against
APTs. By completing multiple small tasks that are generated from different
network domains with meta-learning, the FMKR firstly trains a model with good
discrimination and generalization ability for fresh and unknown APT attacks. In
each FMKR task, both threat intelligence and local entities are fused into the
support/query sets in meta-learning to identify possible attack stages.
Secondly, to rearm current security strategies, an finetuning-based deployment
mechanism is proposed to transfer learned knowledge into the student model,
while minimizing the defense cost. Compared to multiple model replacement
strategies, the FMKR provides a faster response to attack behaviors while
consuming less scheduling cost. Based on the feedback from multiple real users
of the Industrial Internet of Things (IIoT) over 2 months, we demonstrate that
the proposed scheme can improve the defense satisfaction rate.


http://arxiv.org/abs/2306.07033
When Vision Fails: Text Attacks Against ViT and OCR. (99%)
Nicholas Boucher; Jenny Blessing; Ilia Shumailov; Ross Anderson; Nicolas Papernot
  While text-based machine learning models that operate on visual inputs of
rendered text have become robust against a wide range of existing attacks, we
show that they are still vulnerable to visual adversarial examples encoded as
text. We use the Unicode functionality of combining diacritical marks to
manipulate encoded text so that small visual perturbations appear when the text
is rendered. We show how a genetic algorithm can be used to generate visual
adversarial examples in a black-box setting, and conduct a user study to
establish that the model-fooling adversarial examples do not affect human
comprehension. We demonstrate the effectiveness of these attacks in the real
world by creating adversarial examples against production models published by
Facebook, Microsoft, IBM, and Google.


http://arxiv.org/abs/2306.07197
AROID: Improving Adversarial Robustness Through Online Instance-Wise Data Augmentation. (99%)
Lin Li; Jianing Qiu; Michael Spratling
  Deep neural networks are vulnerable to adversarial examples. Adversarial
training (AT) is an effective defense against adversarial examples. However, AT
is prone to overfitting which degrades robustness substantially. Recently, data
augmentation (DA) was shown to be effective in mitigating robust overfitting if
appropriately designed and optimized for AT. This work proposes a new method to
automatically learn online, instance-wise, DA policies to improve robust
generalization for AT. This is the first automated DA method specific for
robustness. A novel policy learning objective, consisting of Vulnerability,
Affinity and Diversity, is proposed and shown to be sufficiently effective and
efficient to be practical for automatic DA generation during AT. Importantly,
our method dramatically reduces the cost of policy search from the 5000 hours
of AutoAugment and the 412 hours of IDBH to 9 hours, making automated DA more
practical to use for adversarial robustness. This allows our method to
efficiently explore a large search space for a more effective DA policy and
evolve the policy as training progresses. Empirically, our method is shown to
outperform all competitive DA methods across various model architectures and
datasets. Our DA policy reinforced vanilla AT to surpass several
state-of-the-art AT methods regarding both accuracy and robustness. It can also
be combined with those advanced AT methods to further boost robustness. Code
and pre-trained models are available at https://github.com/TreeLLi/AROID.


http://arxiv.org/abs/2306.06995
How robust accuracy suffers from certified training with convex relaxations. (73%)
Bartolomeis Piersilvio De; Jacob Clarysse; Amartya Sanyal; Fanny Yang
  Adversarial attacks pose significant threats to deploying state-of-the-art
classifiers in safety-critical applications. Two classes of methods have
emerged to address this issue: empirical defences and certified defences.
Although certified defences come with robustness guarantees, empirical defences
such as adversarial training enjoy much higher popularity among practitioners.
In this paper, we systematically compare the standard and robust error of these
two robust training paradigms across multiple computer vision tasks. We show
that in most tasks and for both $\mathscr{l}_\infty$-ball and
$\mathscr{l}_2$-ball threat models, certified training with convex relaxations
suffers from worse standard and robust error than adversarial training. We
further explore how the error gap between certified and adversarial training
depends on the threat model and the data distribution. In particular, besides
the perturbation budget, we identify as important factors the shape of the
perturbation set and the implicit margin of the data distribution. We support
our arguments with extensive ablations on both synthetic and image datasets.


http://arxiv.org/abs/2306.06909
Graph Agent Network: Empowering Nodes with Decentralized Communications Capabilities for Adversarial Resilience. (54%)
Ao Liu; Wenshan Li; Tao Li; Beibei Li; Guangquan Xu; Pan Zhou; Wengang Ma; Hanyuan Huang
  End-to-end training with global optimization have popularized graph neural
networks (GNNs) for node classification, yet inadvertently introduced
vulnerabilities to adversarial edge-perturbing attacks. Adversaries can exploit
the inherent opened interfaces of GNNs' input and output, perturbing critical
edges and thus manipulating the classification results. Current defenses, due
to their persistent utilization of global-optimization-based end-to-end
training schemes, inherently encapsulate the vulnerabilities of GNNs. This is
specifically evidenced in their inability to defend against targeted secondary
attacks. In this paper, we propose the Graph Agent Network (GAgN) to address
the aforementioned vulnerabilities of GNNs. GAgN is a graph-structured agent
network in which each node is designed as an 1-hop-view agent. Through the
decentralized interactions between agents, they can learn to infer global
perceptions to perform tasks including inferring embeddings, degrees and
neighbor relationships for given nodes. This empowers nodes to filtering
adversarial edges while carrying out classification tasks. Furthermore, agents'
limited view prevents malicious messages from propagating globally in GAgN,
thereby resisting global-optimization-based secondary attacks. We prove that
single-hidden-layer multilayer perceptrons (MLPs) are theoretically sufficient
to achieve these functionalities. Experimental results show that GAgN
effectively implements all its intended capabilities and, compared to
state-of-the-art defenses, achieves optimal classification accuracy on the
perturbed datasets.


http://arxiv.org/abs/2306.07178
Frequency-Based Vulnerability Analysis of Deep Learning Models against Image Corruptions. (13%)
Harshitha Machiraju; Michael H. Herzog; Pascal Frossard
  Deep learning models often face challenges when handling real-world image
corruptions. In response, researchers have developed image corruption datasets
to evaluate the performance of deep neural networks in handling such
corruptions. However, these datasets have a significant limitation: they do not
account for all corruptions encountered in real-life scenarios. To address this
gap, we present MUFIA (Multiplicative Filter Attack), an algorithm designed to
identify the specific types of corruptions that can cause models to fail. Our
algorithm identifies the combination of image frequency components that render
a model susceptible to misclassification while preserving the semantic
similarity to the original image. We find that even state-of-the-art models
trained to be robust against known common corruptions struggle against the low
visibility-based corruptions crafted by MUFIA. This highlights the need for
more comprehensive approaches to enhance model robustness against a wider range
of real-world image corruptions.


http://arxiv.org/abs/2306.07462
On the Robustness of Removal-Based Feature Attributions. (11%)
Chris Lin; Ian Covert; Su-In Lee
  To explain predictions made by complex machine learning models, many feature
attribution methods have been developed that assign importance scores to input
features. Some recent work challenges the robustness of these methods by
showing that they are sensitive to input and model perturbations, while other
work addresses this issue by proposing robust attribution methods. However,
previous work on attribution robustness has focused primarily on gradient-based
feature attributions, whereas the robustness of removal-based attribution
methods is not currently well understood. To bridge this gap, we theoretically
characterize the robustness properties of removal-based feature attributions.
Specifically, we provide a unified analysis of such methods and derive upper
bounds for the difference between intact and perturbed attributions, under
settings of both input and model perturbations. Our empirical results on
synthetic and real-world data validate our theoretical results and demonstrate
their practical implications, including the ability to increase attribution
robustness by improving the model's Lipschitz regularity.


http://arxiv.org/abs/2306.06874
VillanDiffusion: A Unified Backdoor Attack Framework for Diffusion Models. (1%)
Sheng-Yen Chou; Pin-Yu Chen; Tsung-Yi Ho
  Diffusion Models (DMs) are state-of-the-art generative models that learn a
reversible corruption process from iterative noise addition and denoising. They
are the backbone of many generative AI applications, such as text-to-image
conditional generation. However, recent studies have shown that basic
unconditional DMs (e.g., DDPM and DDIM) are vulnerable to backdoor injection, a
type of output manipulation attack triggered by a maliciously embedded pattern
at model input. This paper presents a unified backdoor attack framework
(VillanDiffusion) to expand the current scope of backdoor analysis for DMs. Our
framework covers mainstream unconditional and conditional DMs (denoising-based
and score-based) and various training-free samplers for holistic evaluations.
Experiments show that our unified framework facilitates the backdoor analysis
of different DM configurations and provides new insights into caption-based
backdoor attacks on DMs.


http://arxiv.org/abs/2306.07992
Securing Visually-Aware Recommender Systems: An Adversarial Image Reconstruction and Detection Framework. (99%)
Minglei Yin; Bin Liu; Neil Zhenqiang Gong; Xin Li
  With rich visual data, such as images, becoming readily associated with
items, visually-aware recommendation systems (VARS) have been widely used in
different applications. Recent studies have shown that VARS are vulnerable to
item-image adversarial attacks, which add human-imperceptible perturbations to
the clean images associated with those items. Attacks on VARS pose new security
challenges to a wide range of applications such as e-Commerce and social
networks where VARS are widely used. How to secure VARS from such adversarial
attacks becomes a critical problem. Currently, there is still a lack of
systematic study on how to design secure defense strategies against visual
attacks on VARS. In this paper, we attempt to fill this gap by proposing an
adversarial image reconstruction and detection framework to secure VARS. Our
proposed method can simultaneously (1) secure VARS from adversarial attacks
characterized by local perturbations by image reconstruction based on global
vision transformers; and (2) accurately detect adversarial examples using a
novel contrastive learning approach. Meanwhile, our framework is designed to be
used as both a filter and a detector so that they can be jointly trained to
improve the flexibility of our defense strategy to a variety of attacks and
VARS models. We have conducted extensive experimental studies with two popular
attack methods (FGSM and PGD). Our experimental results on two real-world
datasets show that our defense strategy against visual attacks is effective and
outperforms existing methods on different attacks. Moreover, our method can
detect adversarial examples with high accuracy.


http://arxiv.org/abs/2306.06712
Neural Architecture Design and Robustness: A Dataset. (76%)
Steffen Jung; Jovita Lukasik; Margret Keuper
  Deep learning models have proven to be successful in a wide range of machine
learning tasks. Yet, they are often highly sensitive to perturbations on the
input data which can lead to incorrect decisions with high confidence,
hampering their deployment for practical use-cases. Thus, finding architectures
that are (more) robust against perturbations has received much attention in
recent years. Just like the search for well-performing architectures in terms
of clean accuracy, this usually involves a tedious trial-and-error process with
one additional challenge: the evaluation of a network's robustness is
significantly more expensive than its evaluation for clean accuracy. Thus, the
aim of this paper is to facilitate better streamlined research on architectural
design choices with respect to their impact on robustness as well as, for
example, the evaluation of surrogate measures for robustness. We therefore
borrow one of the most commonly considered search spaces for neural
architecture search for image classification, NAS-Bench-201, which contains a
manageable size of 6466 non-isomorphic network designs. We evaluate all these
networks on a range of common adversarial attacks and corruption types and
introduce a database on neural architecture design and robustness evaluations.
We further present three exemplary use cases of this dataset, in which we (i)
benchmark robustness measurements based on Jacobian and Hessian matrices for
their robustness predictability, (ii) perform neural architecture search on
robust accuracies, and (iii) provide an initial analysis of how architectural
design choices affect robustness. We find that carefully crafting the topology
of a network can have substantial impact on its robustness, where networks with
the same parameter count range in mean adversarial robust accuracy from
20%-41%. Code and data is available at http://robustness.vision/.


http://arxiv.org/abs/2306.06815
TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models. (68%)
Jiaqi Xue; Mengxin Zheng; Ting Hua; Yilin Shen; Yepeng Liu; Ladislau Boloni; Qian Lou
  Large Language Models (LLMs) are progressively being utilized as machine
learning services and interface tools for various applications. However, the
security implications of LLMs, particularly in relation to adversarial and
Trojan attacks, remain insufficiently examined. In this paper, we propose
TrojLLM, an automatic and black-box framework to effectively generate universal
and stealthy triggers. When these triggers are incorporated into the input
data, the LLMs' outputs can be maliciously manipulated. Moreover, the framework
also supports embedding Trojans within discrete prompts, enhancing the overall
effectiveness and precision of the triggers' attacks. Specifically, we propose
a trigger discovery algorithm for generating universal triggers for various
inputs by querying victim LLM-based APIs using few-shot data samples.
Furthermore, we introduce a novel progressive Trojan poisoning algorithm
designed to generate poisoned prompts that retain efficacy and transferability
across a diverse range of models. Our experiments and results demonstrate
TrojLLM's capacity to effectively insert Trojans into text prompts in
real-world black-box LLM APIs including GPT-3.5 and GPT-4, while maintaining
exceptional performance on clean test sets. Our work sheds light on the
potential security risks in current models and offers a potential defensive
approach. The source code of TrojLLM is available at
https://github.com/UCF-ML-Research/TrojLLM.


http://arxiv.org/abs/2306.06462
Boosting Adversarial Robustness using Feature Level Stochastic Smoothing. (92%)
Sravanti Addepalli; Samyak Jain; Gaurang Sriramanan; R. Venkatesh Babu
  Advances in adversarial defenses have led to a significant improvement in the
robustness of Deep Neural Networks. However, the robust accuracy of present
state-ofthe-art defenses is far from the requirements in critical applications
such as robotics and autonomous navigation systems. Further, in practical use
cases, network prediction alone might not suffice, and assignment of a
confidence value for the prediction can prove crucial. In this work, we propose
a generic method for introducing stochasticity in the network predictions, and
utilize this for smoothing decision boundaries and rejecting low confidence
predictions, thereby boosting the robustness on accepted samples. The proposed
Feature Level Stochastic Smoothing based classification also results in a boost
in robustness without rejection over existing adversarial training methods.
Finally, we combine the proposed method with adversarial detection methods, to
achieve the benefits of both approaches.


http://arxiv.org/abs/2306.06359
NeRFool: Uncovering the Vulnerability of Generalizable Neural Radiance Fields against Adversarial Perturbations. (83%)
Yonggan Fu; Ye Yuan; Souvik Kundu; Shang Wu; Shunyao Zhang; Yingyan Celine Lin
  Generalizable Neural Radiance Fields (GNeRF) are one of the most promising
real-world solutions for novel view synthesis, thanks to their cross-scene
generalization capability and thus the possibility of instant rendering on new
scenes. While adversarial robustness is essential for real-world applications,
little study has been devoted to understanding its implication on GNeRF. We
hypothesize that because GNeRF is implemented by conditioning on the source
views from new scenes, which are often acquired from the Internet or
third-party providers, there are potential new security concerns regarding its
real-world applications. Meanwhile, existing understanding and solutions for
neural networks' adversarial robustness may not be applicable to GNeRF, due to
its 3D nature and uniquely diverse operations. To this end, we present NeRFool,
which to the best of our knowledge is the first work that sets out to
understand the adversarial robustness of GNeRF. Specifically, NeRFool unveils
the vulnerability patterns and important insights regarding GNeRF's adversarial
robustness. Built upon the above insights gained from NeRFool, we further
develop NeRFool+, which integrates two techniques capable of effectively
attacking GNeRF across a wide range of target views, and provide guidelines for
defending against our proposed attacks. We believe that our NeRFool/NeRFool+
lays the initial foundation for future innovations in developing robust
real-world GNeRF solutions. Our codes are available at:
https://github.com/GATECH-EIC/NeRFool.


http://arxiv.org/abs/2306.06485
The Defense of Networked Targets in General Lotto games. (13%)
Adel Aghajan; Keith Paarporn; Jason R. Marden
  Ensuring the security of networked systems is a significant problem,
considering the susceptibility of modern infrastructures and technologies to
adversarial interference. A central component of this problem is how defensive
resources should be allocated to mitigate the severity of potential attacks on
the system. In this paper, we consider this in the context of a General Lotto
game, where a defender and attacker deploys resources on the nodes of a
network, and the objective is to secure as many links as possible. The defender
secures a link only if it out-competes the attacker on both of its associated
nodes. For bipartite networks, we completely characterize equilibrium payoffs
and strategies for both the defender and attacker. Surprisingly, the resulting
payoffs are the same for any bipartite graph. On arbitrary network structures,
we provide lower and upper bounds on the defender's max-min value. Notably, the
equilibrium payoff from bipartite networks serves as the lower bound. These
results suggest that more connected networks are easier to defend against
attacks. We confirm these findings with simulations that compute deterministic
allocation strategies on large random networks. This also highlights the
importance of randomization in the equilibrium strategies.


http://arxiv.org/abs/2306.05873
Detecting Adversarial Directions in Deep Reinforcement Learning to Make Robust Decisions. (84%)
Ezgi Korkmaz; Jonah Brown-Cohen
  Learning in MDPs with highly complex state representations is currently
possible due to multiple advancements in reinforcement learning algorithm
design. However, this incline in complexity, and furthermore the increase in
the dimensions of the observation came at the cost of volatility that can be
taken advantage of via adversarial attacks (i.e. moving along worst-case
directions in the observation space). To solve this policy instability problem
we propose a novel method to detect the presence of these non-robust directions
via local quadratic approximation of the deep neural policy loss. Our method
provides a theoretical basis for the fundamental cut-off between safe
observations and adversarial observations. Furthermore, our technique is
computationally efficient, and does not depend on the methods used to produce
the worst-case directions. We conduct extensive experiments in the Arcade
Learning Environment with several different adversarial attack techniques. Most
significantly, we demonstrate the effectiveness of our approach even in the
setting where non-robust directions are explicitly optimized to circumvent our
proposed method.


http://arxiv.org/abs/2306.05923
When Authentication Is Not Enough: On the Security of Behavioral-Based Driver Authentication Systems. (78%)
Emad Efatinasab; Francesco Marchiori; Denis Donadel; Alessandro Brighente; Mauro Conti
  Many research papers have recently focused on behavioral-based driver
authentication systems in vehicles. Pushed by Artificial Intelligence (AI)
advancements, these works propose powerful models to identify drivers through
their unique biometric behavior. However, these models have never been
scrutinized from a security point of view, rather focusing on the performance
of the AI algorithms. Several limitations and oversights make implementing the
state-of-the-art impractical, such as their secure connection to the vehicle's
network and the management of security alerts. Furthermore, due to the
extensive use of AI, these systems may be vulnerable to adversarial attacks.
However, there is currently no discussion on the feasibility and impact of such
attacks in this scenario.
  Driven by the significant gap between research and practical application,
this paper seeks to connect these two domains. We propose the first
security-aware system model for behavioral-based driver authentication. We
develop two lightweight driver authentication systems based on Random Forest
and Recurrent Neural Network architectures designed for our constrained
environments. We formalize a realistic system and threat model reflecting a
real-world vehicle's network for their implementation. When evaluated on real
driving data, our models outclass the state-of-the-art with an accuracy of up
to 0.999 in identification and authentication. Moreover, we are the first to
propose attacks against these systems by developing two novel evasion attacks,
SMARTCAN and GANCAN. We show how attackers can still exploit these systems with
a perfect attack success rate (up to 1.000). Finally, we discuss requirements
for deploying driver authentication systems securely. Through our
contributions, we aid practitioners in safely adopting these systems, help
reduce car thefts, and enhance driver security.


http://arxiv.org/abs/2306.05952
Overcoming Adversarial Attacks for Human-in-the-Loop Applications. (45%)
Ryan McCoppin; Marla Kennedy; Platon Lukyanenko; Sean Kennedy
  Including human analysis has the potential to positively affect the
robustness of Deep Neural Networks and is relatively unexplored in the
Adversarial Machine Learning literature. Neural network visual explanation maps
have been shown to be prone to adversarial attacks. Further research is needed
in order to select robust visualizations of explanations for the image analyst
to evaluate a given model. These factors greatly impact Human-In-The-Loop
(HITL) evaluation tools due to their reliance on adversarial images, including
explanation maps and measurements of robustness. We believe models of human
visual attention may improve interpretability and robustness of human-machine
imagery analysis systems. Our challenge remains, how can HITL evaluation be
robust in this adversarial landscape?


http://arxiv.org/abs/2306.05494
Adversarial Evasion Attacks Practicality in Networks: Testing the Impact of Dynamic Learning. (99%)
Mohamed elShehaby; Ashraf Matrawy
  Machine Learning (ML) has become ubiquitous, and its deployment in Network
Intrusion Detection Systems (NIDS) is inevitable due to its automated nature
and high accuracy compared to traditional models in processing and classifying
large volumes of data. However, ML has been found to have several flaws, most
importantly, adversarial attacks, which aim to trick ML models into producing
faulty predictions. While most adversarial attack research focuses on computer
vision datasets, recent studies have explored the suitability of these attacks
against ML-based network security entities, especially NIDS, due to the wide
difference between different domains regarding the generation of adversarial
attacks.
  To further explore the practicality of adversarial attacks against ML-based
NIDS in-depth, this paper presents three distinct contributions: identifying
numerous practicality issues for evasion adversarial attacks on ML-NIDS using
an attack tree threat model, introducing a taxonomy of practicality issues
associated with adversarial attacks against ML-based NIDS, and investigating
how the dynamicity of some real-world ML models affects adversarial attacks
against NIDS. Our experiments indicate that continuous re-training, even
without adversarial training, can reduce the effectiveness of adversarial
attacks. While adversarial attacks can compromise ML-based NIDSs, our aim is to
highlight the significant gap between research and real-world practicality in
this domain, warranting attention.


http://arxiv.org/abs/2306.05225
Boosting Adversarial Transferability by Achieving Flat Local Maxima. (99%)
Zhijin Ge; Hongying Liu; Xiaosen Wang; Fanhua Shang; Yuanyuan Liu
  Transfer-based attack adopts the adversarial examples generated on the
surrogate model to attack various models, making it applicable in the physical
world and attracting increasing interest. Recently, various adversarial attacks
have emerged to boost adversarial transferability from different perspectives.
In this work, inspired by the observation that flat local minima are correlated
with good generalization, we assume and empirically validate that adversarial
examples at a flat local region tend to have good transferability by
introducing a penalized gradient norm to the original loss function. Since
directly optimizing the gradient regularization norm is computationally
expensive and intractable for generating adversarial examples, we propose an
approximation optimization method to simplify the gradient update of the
objective function. Specifically, we randomly sample an example and adopt a
first-order procedure to approximate the curvature of Hessian/vector product,
which makes computing more efficient by interpolating two neighboring
gradients. Meanwhile, in order to obtain a more stable gradient direction, we
randomly sample multiple examples and average the gradients of these examples
to reduce the variance due to random sampling during the iterative process.
Extensive experimental results on the ImageNet-compatible dataset show that the
proposed method can generate adversarial examples at flat local regions, and
significantly improve the adversarial transferability on either normally
trained models or adversarially trained models than the state-of-the-art
attacks. Our codes are available at:
https://github.com/Trustworthy-AI-Group/PGN.


http://arxiv.org/abs/2306.05659
COVER: A Heuristic Greedy Adversarial Attack on Prompt-based Learning in Language Models. (93%)
Zihao Tan; Qingliang Chen; Wenbin Zhu; Yongjian Huang
  Prompt-based learning has been proved to be an effective way in pre-trained
language models (PLMs), especially in low-resource scenarios like few-shot
settings. However, the trustworthiness of PLMs is of paramount significance and
potential vulnerabilities have been shown in prompt-based templates that could
mislead the predictions of language models, causing serious security concerns.
In this paper, we will shed light on some vulnerabilities of PLMs, by proposing
a prompt-based adversarial attack on manual templates in black box scenarios.
First of all, we design character-level and word-level heuristic approaches to
break manual templates separately. Then we present a greedy algorithm for the
attack based on the above heuristic destructive approaches. Finally, we
evaluate our approach with the classification tasks on three variants of BERT
series models and eight datasets. And comprehensive experimental results
justify the effectiveness of our approach in terms of attack success rate and
attack speed. Further experimental studies indicate that our proposed method
also displays good capabilities in scenarios with varying shot counts, template
lengths and query counts, exhibiting good generalizability.


http://arxiv.org/abs/2306.05031
Generalizable Lightweight Proxy for Robust NAS against Diverse Perturbations. (83%)
Hyeonjeong Ha; Minseon Kim; Sung Ju Hwang
  Recent neural architecture search (NAS) frameworks have been successful in
finding optimal architectures for given conditions (e.g., performance or
latency). However, they search for optimal architectures in terms of their
performance on clean images only, while robustness against various types of
perturbations or corruptions is crucial in practice. Although there exist
several robust NAS frameworks that tackle this issue by integrating adversarial
training into one-shot NAS, however, they are limited in that they only
consider robustness against adversarial attacks and require significant
computational resources to discover optimal architectures for a single task,
which makes them impractical in real-world scenarios. To address these
challenges, we propose a novel lightweight robust zero-cost proxy that
considers the consistency across features, parameters, and gradients of both
clean and perturbed images at the initialization state. Our approach
facilitates an efficient and rapid search for neural architectures capable of
learning generalizable features that exhibit robustness across diverse
perturbations. The experimental results demonstrate that our proxy can rapidly
and efficiently search for neural architectures that are consistently robust
against various perturbations on multiple benchmark datasets and diverse search
spaces, largely outperforming existing clean zero-shot NAS and robust NAS with
reduced search cost.


http://arxiv.org/abs/2306.04984
G$^2$uardFL: Safeguarding Federated Learning Against Backdoor Attacks through Attributed Client Graph Clustering. (62%)
Hao Yu; Chuan Ma; Meng Liu; Xinwang Liu; Zhe Liu; Ming Ding
  As a collaborative paradigm, Federated Learning (FL) empowers clients to
engage in collective model training without exchanging their respective local
data. Nevertheless, FL remains vulnerable to backdoor attacks in which an
attacker compromises malicious clients, and injects poisoned model weights into
the aggregation process to yield attacker-chosen predictions for particular
samples. Existing countermeasures, mainly based on anomaly detection, may
erroneously reject legitimate weights while accepting malicious ones, which is
due to inadequacies in quantifying client model similarities. Other defense
mechanisms prove effective exclusively when confronted with a restricted number
of malicious clients, e.g., less than 10%. To address these vulnerabilities, we
present G$^2$uardFL, a protective framework that reframes the detection of
malicious clients as an attributed graph clustering problem, thereby
safeguarding FL systems. This framework employs a client graph clustering
technique to identify malicious clients and incorporates an adaptive method to
amplify the disparity between the aggregated model and poisoned client models,
thereby eliminating previously embedded backdoors. A theoretical analysis of
convergence is also performed to demonstrate that the global model closely
approximates the model untouched by any backdoor. Through empirical evaluation
compared to cutting-edge defenses and against various backdoor attacks, our
experimental results indicate that G$^2$uardFL considerably undermines the
effectiveness of backdoor attacks while maintaining a negligible impact on the
benign sample performance.


http://arxiv.org/abs/2306.04971
A Melting Pot of Evolution and Learning. (41%)
Moshe Sipper; Achiya Elyasaf; Tomer Halperin; Zvika Haramaty; Raz Lapid; Eyal Segal; Itai Tzruia; Snir Vitrack Tamam
  We survey eight recent works by our group, involving the successful blending
of evolutionary algorithms with machine learning and deep learning: 1. Binary
and Multinomial Classification through Evolutionary Symbolic Regression, 2.
Classy Ensemble: A Novel Ensemble Algorithm for Classification, 3. EC-KitY:
Evolutionary Computation Tool Kit in Python, 4. Evolution of Activation
Functions for Deep Learning-Based Image Classification, 5. Adaptive Combination
of a Genetic Algorithm and Novelty Search for Deep Neuroevolution, 6. An
Evolutionary, Gradient-Free, Query-Efficient, Black-Box Algorithm for
Generating Adversarial Instances in Deep Networks, 7. Foiling Explanations in
Deep Neural Networks, 8. Patch of Invisibility: Naturalistic Black-Box
Adversarial Attacks on Object Detectors.


http://arxiv.org/abs/2306.04959
FedMLSecurity: A Benchmark for Attacks and Defenses in Federated Learning and LLMs. (22%)
Shanshan Han; Baturalp Buyukates; Zijian Hu; Han Jin; Weizhao Jin; Lichao Sun; Xiaoyang Wang; Chulin Xie; Kai Zhang; Qifan Zhang; Yuhui Zhang; Chaoyang He; Salman Avestimehr
  This paper introduces FedMLSecurity, a benchmark that simulates adversarial
attacks and corresponding defense mechanisms in Federated Learning (FL). As an
integral module of the open-sourced library FedML that facilitates FL algorithm
development and performance comparison, FedMLSecurity enhances the security
assessment capacity of FedML. FedMLSecurity comprises two principal components:
FedMLAttacker, which simulates attacks injected into FL training, and
FedMLDefender, which emulates defensive strategies designed to mitigate the
impacts of the attacks. FedMLSecurity is open-sourced 1 and is customizable to
a wide range of machine learning models (e.g., Logistic Regression, ResNet,
GAN, etc.) and federated optimizers (e.g., FedAVG, FedOPT, FedNOVA, etc.).
Experimental evaluations in this paper also demonstrate the ease of application
of FedMLSecurity to Large Language Models (LLMs), further reinforcing its
versatility and practical utility in various scenarios.


http://arxiv.org/abs/2306.05208
PriSampler: Mitigating Property Inference of Diffusion Models. (13%)
Hailong Hu; Jun Pang
  Diffusion models have been remarkably successful in data synthesis. Such
successes have also driven diffusion models to apply to sensitive data, such as
human face data, but this might bring about severe privacy concerns. In this
work, we systematically present the first privacy study about property
inference attacks against diffusion models, in which adversaries aim to extract
sensitive global properties of the training set from a diffusion model, such as
the proportion of the training data for certain sensitive properties.
Specifically, we consider the most practical attack scenario: adversaries are
only allowed to obtain synthetic data. Under this realistic scenario, we
evaluate the property inference attacks on different types of samplers and
diffusion models. A broad range of evaluations shows that various diffusion
models and their samplers are all vulnerable to property inference attacks.
Furthermore, one case study on off-the-shelf pre-trained diffusion models also
demonstrates the effectiveness of the attack in practice. Finally, we propose a
new model-agnostic plug-in method PriSampler to mitigate the property inference
of diffusion models. PriSampler can be directly applied to well-trained
diffusion models and support both stochastic and deterministic sampling.
Extensive experiments illustrate the effectiveness of our defense and it makes
adversaries infer the proportion of properties as close as random guesses.
PriSampler also shows its significantly superior performance to diffusion
models trained with differential privacy on both model utility and defense
performance.


http://arxiv.org/abs/2306.05093
Investigating the Effect of Misalignment on Membership Privacy in the White-box Setting. (12%)
Ana-Maria Cretu; Daniel Jones; Montjoye Yves-Alexandre de; Shruti Tople
  Machine learning models have been shown to leak sensitive information about
their training datasets. Models are increasingly deployed on devices, raising
concerns that white-box access to the model parameters increases the attack
surface compared to black-box access which only provides query access. Directly
extending the shadow modelling technique from the black-box to the white-box
setting has been shown, in general, not to perform better than black-box only
attacks. A potential reason is misalignment, a known characteristic of deep
neural networks. In the shadow modelling context, misalignment means that,
while the shadow models learn similar features in each layer, the features are
located in different positions. We here present the first systematic analysis
of the causes of misalignment in shadow models and show the use of a different
weight initialisation to be the main cause. We then extend several re-alignment
techniques, previously developed in the model fusion literature, to the shadow
modelling context, where the goal is to re-align the layers of a shadow model
to those of the target model. We show re-alignment techniques to significantly
reduce the measured misalignment between the target and shadow models. Finally,
we perform a comprehensive evaluation of white-box membership inference attacks
(MIA). Our analysis reveals that internal layer activation-based MIAs suffer
strongly from shadow model misalignment, while gradient-based MIAs are only
sometimes significantly affected. We show that re-aligning the shadow models
strongly improves the former's performance and can also improve the latter's
performance, although less frequently. Taken together, our results highlight
that on-device deployment increases the attack surface and that the newly
available information can be used to build more powerful attacks.


http://arxiv.org/abs/2306.06136
Robustness Testing for Multi-Agent Reinforcement Learning: State Perturbations on Critical Agents. (10%)
Ziyuan Zhou; Guanjun Liu
  Multi-Agent Reinforcement Learning (MARL) has been widely applied in many
fields such as smart traffic and unmanned aerial vehicles. However, most MARL
algorithms are vulnerable to adversarial perturbations on agent states.
Robustness testing for a trained model is an essential step for confirming the
trustworthiness of the model against unexpected perturbations. This work
proposes a novel Robustness Testing framework for MARL that attacks states of
Critical Agents (RTCA). The RTCA has two innovations: 1) a Differential
Evolution (DE) based method to select critical agents as victims and to advise
the worst-case joint actions on them; and 2) a team cooperation policy
evaluation method employed as the objective function for the optimization of
DE. Then, adversarial state perturbations of the critical agents are generated
based on the worst-case joint actions. This is the first robustness testing
framework with varying victim agents. RTCA demonstrates outstanding performance
in terms of the number of victim agents and destroying cooperation policies.


http://arxiv.org/abs/2306.05079
Enhancing Robustness of AI Offensive Code Generators via Data Augmentation. (10%)
Cristina Improta; Pietro Liguori; Roberto Natella; Bojan Cukic; Domenico Cotroneo
  Since manually writing software exploits for offensive security is
time-consuming and requires expert knowledge, AI-base code generators are an
attractive solution to enhance security analysts' productivity by automatically
crafting exploits for security testing. However, the variability in the natural
language and technical skills used to describe offensive code poses unique
challenges to their robustness and applicability. In this work, we present a
method to add perturbations to the code descriptions to create new inputs in
natural language (NL) from well-intentioned developers that diverge from the
original ones due to the use of new words or because they miss part of them.
The goal is to analyze how and to what extent perturbations affect the
performance of AI code generators in the context of offensive code. First, we
show that perturbed descriptions preserve the semantics of the original,
non-perturbed ones. Then, we use the method to assess the robustness of three
state-of-the-art code generators against the newly perturbed inputs, showing
that the performance of these AI-based solutions is highly affected by
perturbations in the NL descriptions. To enhance their robustness, we use the
method to perform data augmentation, i.e., to increase the variability and
diversity of the NL descriptions in the training data, proving its
effectiveness against both perturbed and non-perturbed code descriptions.


http://arxiv.org/abs/2306.04974
Conservative Prediction via Data-Driven Confidence Minimization. (8%)
Caroline Choi; Fahim Tajwar; Yoonho Lee; Huaxiu Yao; Ananya Kumar; Chelsea Finn
  Errors of machine learning models are costly, especially in safety-critical
domains such as healthcare, where such mistakes can prevent the deployment of
machine learning altogether. In these settings, conservative models -- models
which can defer to human judgment when they are likely to make an error -- may
offer a solution. However, detecting unusual or difficult examples is notably
challenging, as it is impossible to anticipate all potential inputs at test
time. To address this issue, prior work has proposed to minimize the model's
confidence on an auxiliary pseudo-OOD dataset. We theoretically analyze the
effect of confidence minimization and show that the choice of auxiliary dataset
is critical. Specifically, if the auxiliary dataset includes samples from the
OOD region of interest, confidence minimization provably separates ID and OOD
inputs by predictive confidence. Taking inspiration from this result, we
present data-driven confidence minimization (DCM), which minimizes confidence
on an uncertainty dataset containing examples that the model is likely to
misclassify at test time. Our experiments show that DCM consistently
outperforms state-of-the-art OOD detection methods on 8 ID-OOD dataset pairs,
reducing FPR (at TPR 95%) by 6.3% and 58.1% on CIFAR-10 and CIFAR-100, and
outperforms existing selective classification approaches on 4 datasets in
conditions of distribution shift.


http://arxiv.org/abs/2306.05501
Robust Framework for Explanation Evaluation in Time Series Classification. (2%)
Thu Trang Nguyen; Thach Le Nguyen; Georgiana Ifrim
  Time series classification is a task which deals with a prevalent data type,
temporal sequences, common in domains such as human activity recognition,
sports analytics and general healthcare. This paper provides a framework to
quantitatively evaluate and rank explanation methods for time series
classification. The recent interest in explanation methods for time series has
provided a great variety of explanation techniques. Nevertheless, when the
explanations disagree on a specific problem, it remains unclear which of them
to use. Comparing multiple explanations to find the right answer is
non-trivial. Two key challenges remain: how to quantitatively and robustly
evaluate the informativeness of a given explanation method (i.e., relevance for
the classification task), and how to compare explanation methods side-by-side.
We propose AMEE, a robust Model-Agnostic Explanation Evaluation framework for
evaluating and comparing multiple saliency-based explanations for time series
classification. In this approach, data perturbation is added to the input time
series guided by each explanation. The impact of perturbation on classification
accuracy is then measured and used for explanation evaluation. The results show
that perturbing discriminative parts of the time series leads to significant
changes in classification accuracy which can be used to evaluate each
explanation. To be robust to different types of perturbations and different
types of classifiers, we aggregate the accuracy loss across perturbations and
classifiers. This novel approach allows us to quantify and rank different
explanation methods. We provide a quantitative and qualitative analysis for
synthetic datasets, a variety of time-series datasets, as well as a real-world
dataset with known expert ground truth.


http://arxiv.org/abs/2306.04950
Open Set Relation Extraction via Unknown-Aware Training. (1%)
Jun Zhao; Xin Zhao; Wenyu Zhan; Qi Zhang; Tao Gui; Zhongyu Wei; Yunwen Chen; Xiang Gao; Xuanjing Huang
  The existing supervised relation extraction methods have achieved impressive
performance in a closed-set setting, where the relations during both training
and testing remain the same. In a more realistic open-set setting, unknown
relations may appear in the test set. Due to the lack of supervision signals
from unknown relations, a well-performing closed-set relation extractor can
still confidently misclassify them into known relations. In this paper, we
propose an unknown-aware training method, regularizing the model by dynamically
synthesizing negative instances. To facilitate a compact decision boundary,
``difficult'' negative instances are necessary. Inspired by text adversarial
attacks, we adaptively apply small but critical perturbations to original
training instances and thus synthesizing negative instances that are more
likely to be mistaken by the model as known relations. Experimental results
show that this method achieves SOTA unknown relation detection without
compromising the classification of known relations.


http://arxiv.org/abs/2306.04192
Extracting Cloud-based Model with Prior Knowledge. (99%)
Shiqian Zhao; Kangjie Chen; Meng Hao; Jian Zhang; Guowen Xu; Hongwei Li; Tianwei Zhang
  Machine Learning-as-a-Service, a pay-as-you-go business pattern, is widely
accepted by third-party users and developers. However, the open inference APIs
may be utilized by malicious customers to conduct model extraction attacks,
i.e., attackers can replicate a cloud-based black-box model merely via querying
malicious examples. Existing model extraction attacks mainly depend on the
posterior knowledge (i.e., predictions of query samples) from Oracle. Thus,
they either require high query overhead to simulate the decision boundary, or
suffer from generalization errors and overfitting problems due to query budget
limitations. To mitigate it, this work proposes an efficient model extraction
attack based on prior knowledge for the first time. The insight is that prior
knowledge of unlabeled proxy datasets is conducive to the search for the
decision boundary (e.g., informative samples). Specifically, we leverage
self-supervised learning including autoencoder and contrastive learning to
pre-compile the prior knowledge of the proxy dataset into the feature extractor
of the substitute model. Then we adopt entropy to measure and sample the most
informative examples to query the target model. Our design leverages both prior
and posterior knowledge to extract the model and thus eliminates
generalizability errors and overfitting problems. We conduct extensive
experiments on open APIs like Traffic Recognition, Flower Recognition,
Moderation Recognition, and NSFW Recognition from real-world platforms, Azure
and Clarifai. The experimental results demonstrate the effectiveness and
efficiency of our attack. For example, our attack achieves 95.1% fidelity with
merely 1.8K queries (cost 2.16$) on the NSFW Recognition API. Also, the
adversarial examples generated with our substitute model have better
transferability than others, which reveals that our scheme is more conducive to
downstream attacks.


http://arxiv.org/abs/2306.04874
Expanding Scope: Adapting English Adversarial Attacks to Chinese. (99%)
Hanyu Liu; Chengyuan Cai; Yanjun Qi
  Recent studies have revealed that NLP predictive models are vulnerable to
adversarial attacks. Most existing studies focused on designing attacks to
evaluate the robustness of NLP models in the English language alone. Literature
has seen an increasing need for NLP solutions for other languages. We,
therefore, ask one natural question: whether state-of-the-art (SOTA) attack
methods generalize to other languages. This paper investigates how to adapt
SOTA adversarial attack algorithms in English to the Chinese language. Our
experiments show that attack methods previously applied to English NLP can
generate high-quality adversarial examples in Chinese when combined with proper
text segmentation and linguistic constraints. In addition, we demonstrate that
the generated adversarial examples can achieve high fluency and semantic
consistency by focusing on the Chinese language's morphology and phonology,
which in turn can be used to improve the adversarial robustness of Chinese NLP
models.


http://arxiv.org/abs/2306.04535
PromptAttack: Probing Dialogue State Trackers with Adversarial Prompts. (92%)
Xiangjue Dong; Yun He; Ziwei Zhu; James Caverlee
  A key component of modern conversational systems is the Dialogue State
Tracker (or DST), which models a user's goals and needs. Toward building more
robust and reliable DSTs, we introduce a prompt-based learning approach to
automatically generate effective adversarial examples to probe DST models. Two
key characteristics of this approach are: (i) it only needs the output of the
DST with no need for model parameters, and (ii) it can learn to generate
natural language utterances that can target any DST. Through experiments over
state-of-the-art DSTs, the proposed framework leads to the greatest reduction
in accuracy and the best attack success rate while maintaining good fluency and
a low perturbation ratio. We also show how much the generated adversarial
examples can bolster a DST through adversarial training. These results indicate
the strength of prompt-based attacks on DSTs and leave open avenues for
continued refinement.


http://arxiv.org/abs/2306.04178
Optimal Transport Model Distributional Robustness. (83%)
Van-Anh Nguyen; Trung Le; Anh Tuan Bui; Thanh-Toan Do; Dinh Phung
  Distributional robustness is a promising framework for training deep learning
models that are less vulnerable to adversarial examples and data distribution
shifts. Previous works have mainly focused on exploiting distributional
robustness in the data space. In this work, we explore an optimal
transport-based distributional robustness framework in model spaces.
Specifically, we examine a model distribution within a Wasserstein ball
centered on a given model distribution that maximizes the loss. We have
developed theories that enable us to learn the optimal robust center model
distribution. Interestingly, our developed theories allow us to flexibly
incorporate the concept of sharpness awareness into training, whether it's a
single model, ensemble models, or Bayesian Neural Networks, by considering
specific forms of the center model distribution. These forms include a Dirac
delta distribution over a single model, a uniform distribution over several
models, and a general Bayesian Neural Network. Furthermore, we demonstrate that
Sharpness-Aware Minimization (SAM) is a specific case of our framework when
using a Dirac delta distribution over a single model, while our framework can
be seen as a probabilistic extension of SAM. To validate the effectiveness of
our framework in the aforementioned settings, we conducted extensive
experiments, and the results reveal remarkable improvements compared to the
baselines.


http://arxiv.org/abs/2306.04528
PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts. (76%)
Kaijie Zhu; Jindong Wang; Jiaheng Zhou; Zichen Wang; Hao Chen; Yidong Wang; Linyi Yang; Wei Ye; Yue Zhang; Neil Zhenqiang Gong; Xing Xie
  The increasing reliance on Large Language Models (LLMs) across academia and
industry necessitates a comprehensive understanding of their robustness to
prompts. In response to this vital need, we introduce PromptRobust, a
robustness benchmark designed to measure LLMs' resilience to adversarial
prompts. This study uses a plethora of adversarial textual attacks targeting
prompts across multiple levels: character, word, sentence, and semantic. The
adversarial prompts, crafted to mimic plausible user errors like typos or
synonyms, aim to evaluate how slight deviations can affect LLM outcomes while
maintaining semantic integrity. These prompts are then employed in diverse
tasks including sentiment analysis, natural language inference, reading
comprehension, machine translation, and math problem-solving. Our study
generates 4,788 adversarial prompts, meticulously evaluated over 8 tasks and 13
datasets. Our findings demonstrate that contemporary LLMs are not robust to
adversarial prompts. Furthermore, we present a comprehensive analysis to
understand the mystery behind prompt robustness and its transferability. We
then offer insightful robustness analysis and pragmatic recommendations for
prompt composition, beneficial to both researchers and everyday users.


http://arxiv.org/abs/2306.04756
A Linearly Convergent GAN Inversion-based Algorithm for Reverse Engineering of Deceptions. (45%)
Darshan Thaker; Paris Giampouras; René Vidal
  An important aspect of developing reliable deep learning systems is devising
strategies that make these systems robust to adversarial attacks. There is a
long line of work that focuses on developing defenses against these attacks,
but recently, researchers have began to study ways to reverse engineer the
attack process. This allows us to not only defend against several attack
models, but also classify the threat model. However, there is still a lack of
theoretical guarantees for the reverse engineering process. Current approaches
that give any guarantees are based on the assumption that the data lies in a
union of linear subspaces, which is not a valid assumption for more complex
datasets. In this paper, we build on prior work and propose a novel framework
for reverse engineering of deceptions which supposes that the clean data lies
in the range of a GAN. To classify the signal and attack, we jointly solve a
GAN inversion problem and a block-sparse recovery problem. For the first time
in the literature, we provide deterministic linear convergence guarantees for
this problem. We also empirically demonstrate the merits of the proposed
approach on several nonlinear datasets as compared to state-of-the-art methods.


http://arxiv.org/abs/2306.04431
Faithful Knowledge Distillation. (41%)
Tom A. Lamb; Rudy Brunel; Krishnamurthy DJ Dvijotham; M. Pawan Kumar; Philip H. S. Torr; Francisco Eiras
  Knowledge distillation (KD) has received much attention due to its success in
compressing networks to allow for their deployment in resource-constrained
systems. While the problem of adversarial robustness has been studied before in
the KD setting, previous works overlook what we term the relative calibration
of the student network with respect to its teacher in terms of soft
confidences. In particular, we focus on two crucial questions with regard to a
teacher-student pair: (i) do the teacher and student disagree at points close
to correctly classified dataset examples, and (ii) is the distilled student as
confident as the teacher around dataset examples? These are critical questions
when considering the deployment of a smaller student network trained from a
robust teacher within a safety-critical setting. To address these questions, we
introduce a faithful imitation framework to discuss the relative calibration of
confidences, as well as provide empirical and certified methods to evaluate the
relative calibration of a student w.r.t. its teacher. Further, to verifiably
align the relative calibration incentives of the student to those of its
teacher, we introduce faithful distillation. Our experiments on the MNIST and
Fashion-MNIST datasets demonstrate the need for such an analysis and the
advantages of the increased verifiability of faithful distillation over
alternative adversarial distillation methods.


http://arxiv.org/abs/2306.04581
Divide and Repair: Using Options to Improve Performance of Imitation Learning Against Adversarial Demonstrations. (16%)
Prithviraj Dasgupta
  We consider the problem of learning to perform a task from demonstrations
given by teachers or experts, when some of the experts' demonstrations might be
adversarial and demonstrate an incorrect way to perform the task. We propose a
novel technique that can identify parts of demonstrated trajectories that have
not been significantly modified by the adversary and utilize them for learning,
using temporally extended policies or options. We first define a trajectory
divergence measure based on the spatial and temporal features of demonstrated
trajectories to detect and discard parts of the trajectories that have been
significantly modified by an adversarial expert, and, could degrade the
learner's performance, if used for learning, We then use an options-based
algorithm that partitions trajectories and learns only from the parts of
trajectories that have been determined as admissible. We provide theoretical
results of our technique to show that repairing partial trajectories improves
the sample efficiency of the demonstrations without degrading the learner's
performance. We then evaluate the proposed algorithm for learning to play an
Atari-like, computer-based game called LunarLander in the presence of different
types and degrees of adversarial attacks of demonstrated trajectories. Our
experimental results show that our technique can identify adversarially
modified parts of the demonstrated trajectories and successfully prevent the
learning performance from degrading due to adversarial demonstrations.


http://arxiv.org/abs/2306.04523
Can current NLI systems handle German word order? Investigating language model performance on a new German challenge set of minimal pairs. (15%)
Ines Reinig; Katja Markert
  Compared to English, German word order is freer and therefore poses
additional challenges for natural language inference (NLI). We create WOGLI
(Word Order in German Language Inference), the first adversarial NLI dataset
for German word order that has the following properties: (i) each premise has
an entailed and a non-entailed hypothesis; (ii) premise and hypotheses differ
only in word order and necessary morphological changes to mark case and number.
In particular, each premise andits two hypotheses contain exactly the same
lemmata. Our adversarial examples require the model to use morphological
markers in order to recognise or reject entailment. We show that current German
autoencoding models fine-tuned on translated NLI data can struggle on this
challenge set, reflecting the fact that translated NLI datasets will not mirror
all necessary language phenomena in the target language. We also examine
performance after data augmentation as well as on related word order phenomena
derived from WOGLI. Our datasets are publically available at
https://github.com/ireinig/wogli.


http://arxiv.org/abs/2306.04252
Adversarial Sample Detection Through Neural Network Transport Dynamics. (10%)
Skander Karkar; Patrick Gallinari; Alain Rakotomamonjy
  We propose a detector of adversarial samples that is based on the view of
neural networks as discrete dynamic systems. The detector tells clean inputs
from abnormal ones by comparing the discrete vector fields they follow through
the layers. We also show that regularizing this vector field during training
makes the network more regular on the data distribution's support, thus making
the activations of clean inputs more distinguishable from those of abnormal
ones. Experimentally, we compare our detector favorably to other detectors on
seen and unseen attacks, and show that the regularization of the network's
dynamics improves the performance of adversarial detectors that use the
internal embeddings as inputs, while also improving test accuracy.


http://arxiv.org/abs/2306.03430
Revisiting the Trade-off between Accuracy and Robustness via Weight Distribution of Filters. (99%)
Xingxing Wei; Shiji Zhao
  Adversarial attacks have been proven to be potential threats to Deep Neural
Networks (DNNs), and many methods are proposed to defend against adversarial
attacks. However, while enhancing the robustness, the clean accuracy will
decline to a certain extent, implying a trade-off existed between the accuracy
and robustness. In this paper, we firstly empirically find an obvious
distinction between standard and robust models in the filters' weight
distribution of the same architecture, and then theoretically explain this
phenomenon in terms of the gradient regularization, which shows this difference
is an intrinsic property for DNNs, and thus a static network architecture is
difficult to improve the accuracy and robustness at the same time. Secondly,
based on this observation, we propose a sample-wise dynamic network
architecture named Adversarial Weight-Varied Network (AW-Net), which focuses on
dealing with clean and adversarial examples with a ``divide and rule" weight
strategy. The AW-Net dynamically adjusts network's weights based on regulation
signals generated by an adversarial detector, which is directly influenced by
the input sample. Benefiting from the dynamic network architecture, clean and
adversarial examples can be processed with different network weights, which
provides the potentiality to enhance the accuracy and robustness
simultaneously. A series of experiments demonstrate that our AW-Net is
architecture-friendly to handle both clean and adversarial examples and can
achieve better trade-off performance than state-of-the-art robust models.


http://arxiv.org/abs/2306.03600
Avoid Adversarial Adaption in Federated Learning by Multi-Metric Investigations. (97%)
Torsten University of Würzburg Krauß; Alexandra University of Würzburg Dmitrienko
  Federated Learning (FL) trains machine learning models on data distributed
across multiple devices, avoiding data transfer to a central location. This
improves privacy, reduces communication costs, and enhances model performance.
However, FL is prone to poisoning attacks, which can be untargeted aiming to
reduce the model performance, or targeted, so-called backdoors, which add
adversarial behavior that can be triggered with appropriately crafted inputs.
Striving for stealthiness, backdoor attacks are harder to deal with.
  Mitigation techniques against poisoning attacks rely on monitoring certain
metrics and filtering malicious model updates. However, previous works didn't
consider real-world adversaries and data distributions. To support our
statement, we define a new notion of strong adaptive adversaries that can
simultaneously adapt to multiple objectives and demonstrate through extensive
tests, that existing defense methods can be circumvented in this adversary
model. We also demonstrate, that existing defenses have limited effectiveness
when no assumptions are made about underlying data distributions.
  To address realistic scenarios and adversary models, we propose
Metric-Cascades (MESAS) a new defense that leverages multiple detection metrics
simultaneously for the filtering of poisoned model updates. This approach
forces adaptive attackers into a heavy multi-objective optimization problem,
and our evaluation with nine backdoors and three datasets shows that even our
strong adaptive attacker cannot evade MESAS's detection. We show that MESAS
outperforms existing defenses in distinguishing backdoors from distortions
originating from different data distributions within and across the clients.
Overall, MESAS is the first defense that is robust against strong adaptive
adversaries and is effective in real-world data scenarios while introducing a
low overhead of 24.37s on average.


http://arxiv.org/abs/2306.04064
Transferable Adversarial Robustness for Categorical Data via Universal Robust Embeddings. (93%)
Klim Kireev; Maksym Andriushchenko; Carmela Troncoso; Nicolas Flammarion
  Research on adversarial robustness is primarily focused on image and text
data. Yet, many scenarios in which lack of robustness can result in serious
risks, such as fraud detection, medical diagnosis, or recommender systems often
do not rely on images or text but instead on tabular data. Adversarial
robustness in tabular data poses two serious challenges. First, tabular
datasets often contain categorical features, and therefore cannot be tackled
directly with existing optimization procedures. Second, in the tabular domain,
algorithms that are not based on deep networks are widely used and offer great
performance, but algorithms to enhance robustness are tailored to neural
networks (e.g. adversarial training).
  In this paper, we tackle both challenges. We present a method that allows us
to train adversarially robust deep networks for tabular data and to transfer
this robustness to other classifiers via universal robust embeddings tailored
to categorical data. These embeddings, created using a bilevel alternating
minimization framework, can be transferred to boosted trees or random forests
making them robust without the need for adversarial training while preserving
their high accuracy on tabular data. We show that our methods outperform
existing techniques within a practical threat model suitable for tabular data.


http://arxiv.org/abs/2306.06123
Adversarial attacks and defenses in explainable artificial intelligence: A survey. (64%)
Hubert Baniecki; Przemyslaw Biecek
  Explainable artificial intelligence (XAI) methods are portrayed as a remedy
for debugging and trusting statistical and deep learning models, as well as
interpreting their predictions. However, recent advances in adversarial machine
learning (AdvML) highlight the limitations and vulnerabilities of
state-of-the-art explanation methods, putting their security and
trustworthiness into question. The possibility of manipulating, fooling or
fairwashing evidence of the model's reasoning has detrimental consequences when
applied in high-stakes decision-making and knowledge discovery. This survey
provides a comprehensive overview of research concerning adversarial attacks on
explanations of machine learning models, as well as fairness metrics. We
introduce a unified notation and taxonomy of methods facilitating a common
ground for researchers and practitioners from the intersecting research fields
of AdvML and XAI. We discuss how to defend against attacks and design robust
interpretation methods. We contribute a list of existing insecurities in XAI
and outline the emerging research directions in adversarial XAI (AdvXAI).
Future work should address improving explanation methods and evaluation
protocols to take into account the reported safety issues.


http://arxiv.org/abs/2306.03726
Exploring Model Dynamics for Accumulative Poisoning Discovery. (62%)
Jianing Zhu; Xiawei Guo; Jiangchao Yao; Chao Du; Li He; Shuo Yuan; Tongliang Liu; Liang Wang; Bo Han
  Adversarial poisoning attacks pose huge threats to various machine learning
applications. Especially, the recent accumulative poisoning attacks show that
it is possible to achieve irreparable harm on models via a sequence of
imperceptible attacks followed by a trigger batch. Due to the limited
data-level discrepancy in real-time data streaming, current defensive methods
are indiscriminate in handling the poison and clean samples. In this paper, we
dive into the perspective of model dynamics and propose a novel information
measure, namely, Memorization Discrepancy, to explore the defense via the
model-level information. By implicitly transferring the changes in the data
manipulation to that in the model outputs, Memorization Discrepancy can
discover the imperceptible poison samples based on their distinct dynamics from
the clean samples. We thoroughly explore its properties and propose
Discrepancy-aware Sample Correction (DSC) to defend against accumulative
poisoning attacks. Extensive experiments comprehensively characterized
Memorization Discrepancy and verified its effectiveness. The code is publicly
available at: https://github.com/tmlr-group/Memorization-Discrepancy.


http://arxiv.org/abs/2306.04109
Membership inference attack with relative decision boundary distance. (33%)
JiaCheng Xu; ChengXiang Tan
  Membership inference attack is one of the most popular privacy attacks in
machine learning, which aims to predict whether a given sample was contained in
the target model's training set. Label-only membership inference attack is a
variant that exploits sample robustness and attracts more attention since it
assumes a practical scenario in which the adversary only has access to the
predicted labels of the input samples. However, since the decision boundary
distance, which measures robustness, is strongly affected by the random initial
image, the adversary may get opposite results even for the same input samples.
In this paper, we propose a new attack method, called muti-class adaptive
membership inference attack in the label-only setting. All decision boundary
distances for all target classes have been traversed in the early attack
iterations, and the subsequent attack iterations continue with the shortest
decision boundary distance to obtain a stable and optimal decision boundary
distance. Instead of using a single boundary distance, the relative boundary
distance between samples and neighboring points has also been employed as a new
membership score to distinguish between member samples inside the training set
and nonmember samples outside the training set. Experiments show that previous
label-only membership inference attacks using the untargeted HopSkipJump
algorithm fail to achieve optimal decision bounds in more than half of the
samples, whereas our multi-targeted HopSkipJump algorithm succeeds in almost
all samples. In addition, extensive experiments show that our multi-class
adaptive MIA outperforms current label-only membership inference attacks in the
CIFAR10, and CIFAR100 datasets, especially for the true positive rate at low
false positive rates metric.


http://arxiv.org/abs/2306.03779
Performance-optimized deep neural networks are evolving into worse models of inferotemporal visual cortex. (8%)
Drew Linsley; Ivan F. Rodriguez; Thomas Fel; Michael Arcaro; Saloni Sharma; Margaret Livingstone; Thomas Serre
  One of the most impactful findings in computational neuroscience over the
past decade is that the object recognition accuracy of deep neural networks
(DNNs) correlates with their ability to predict neural responses to natural
images in the inferotemporal (IT) cortex. This discovery supported the
long-held theory that object recognition is a core objective of the visual
cortex, and suggested that more accurate DNNs would serve as better models of
IT neuron responses to images. Since then, deep learning has undergone a
revolution of scale: billion parameter-scale DNNs trained on billions of images
are rivaling or outperforming humans at visual tasks including object
recognition. Have today's DNNs become more accurate at predicting IT neuron
responses to images as they have grown more accurate at object recognition?
  Surprisingly, across three independent experiments, we find this is not the
case. DNNs have become progressively worse models of IT as their accuracy has
increased on ImageNet. To understand why DNNs experience this trade-off and
evaluate if they are still an appropriate paradigm for modeling the visual
system, we turn to recordings of IT that capture spatially resolved maps of
neuronal activity elicited by natural images. These neuronal activity maps
reveal that DNNs trained on ImageNet learn to rely on different visual features
than those encoded by IT and that this problem worsens as their accuracy
increases. We successfully resolved this issue with the neural harmonizer, a
plug-and-play training routine for DNNs that aligns their learned
representations with humans. Our results suggest that harmonized DNNs break the
trade-off between ImageNet accuracy and neural prediction accuracy that assails
current DNNs and offer a path to more accurate models of biological vision.


http://arxiv.org/abs/2306.03528
Adversarial Attacks and Defenses for Semantic Communication in Vehicular Metaverses. (1%)
Jiawen Kang; Jiayi He; Hongyang Du; Zehui Xiong; Zhaohui Yang; Xumin Huang; Shengli Xie
  For vehicular metaverses, one of the ultimate user-centric goals is to
optimize the immersive experience and Quality of Service (QoS) for users on
board. Semantic Communication (SemCom) has been introduced as a revolutionary
paradigm that significantly eases communication resource pressure for vehicular
metaverse applications to achieve this goal. SemCom enables high-quality and
ultra-efficient vehicular communication, even with explosively increasing data
traffic among vehicles. In this article, we propose a hierarchical
SemCom-enabled vehicular metaverses framework consisting of the global
metaverse, local metaverses, SemCom module, and resource pool. The global and
local metaverses are brand-new concepts from the metaverse's distribution
standpoint. Considering the QoS of users, this article explores the potential
security vulnerabilities of the proposed framework. To that purpose, this study
highlights a specific security risk to the framework's SemCom module and offers
a viable defense solution, so encouraging community researchers to focus more
on vehicular metaverse security. Finally, we provide an overview of the open
issues of secure SemCom in the vehicular metaverses, notably pointing out
potential future research directions.


http://arxiv.org/abs/2306.03229
Adversarial alignment: Breaking the trade-off between the strength of an attack and its relevance to human perception. (99%)
Drew Linsley; Pinyuan Feng; Thibaut Boissin; Alekh Karkada Ashok; Thomas Fel; Stephanie Olaiya; Thomas Serre
  Deep neural networks (DNNs) are known to have a fundamental sensitivity to
adversarial attacks, perturbations of the input that are imperceptible to
humans yet powerful enough to change the visual decision of a model.
Adversarial attacks have long been considered the "Achilles' heel" of deep
learning, which may eventually force a shift in modeling paradigms.
Nevertheless, the formidable capabilities of modern large-scale DNNs have
somewhat eclipsed these early concerns. Do adversarial attacks continue to pose
a threat to DNNs?
  Here, we investigate how the robustness of DNNs to adversarial attacks has
evolved as their accuracy on ImageNet has continued to improve. We measure
adversarial robustness in two different ways: First, we measure the smallest
adversarial attack needed to cause a model to change its object categorization
decision. Second, we measure how aligned successful attacks are with the
features that humans find diagnostic for object recognition. We find that
adversarial attacks are inducing bigger and more easily detectable changes to
image pixels as DNNs grow better on ImageNet, but these attacks are also
becoming less aligned with features that humans find diagnostic for
recognition. To better understand the source of this trade-off, we turn to the
neural harmonizer, a DNN training routine that encourages models to leverage
the same features as humans to solve tasks. Harmonized DNNs achieve the best of
both worlds and experience attacks that are detectable and affect features that
humans find diagnostic for recognition, meaning that attacks on these models
are more likely to be rendered ineffective by inducing similar effects on human
perception. Our findings suggest that the sensitivity of DNNs to adversarial
attacks can be mitigated by DNN scale, data scale, and training routines that
align models with biological intelligence.


http://arxiv.org/abs/2306.02895
Evading Black-box Classifiers Without Breaking Eggs. (99%)
Edoardo Debenedetti; Nicholas Carlini; Florian Tramèr
  Decision-based evasion attacks repeatedly query a black-box classifier to
generate adversarial examples. Prior work measures the cost of such attacks by
the total number of queries made to the classifier. We argue this metric is
flawed. Most security-critical machine learning systems aim to weed out "bad"
data (e.g., malware, harmful content, etc). Queries to such systems carry a
fundamentally asymmetric cost: queries detected as "bad" come at a higher cost
because they trigger additional security filters, e.g., usage throttling or
account suspension. Yet, we find that existing decision-based attacks issue a
large number of "bad" queries, which likely renders them ineffective against
security-critical systems. We then design new attacks that reduce the number of
bad queries by $1.5$-$7.3\times$, but often at a significant increase in total
(non-bad) queries. We thus pose it as an open problem to build black-box
attacks that are more effective under realistic cost metrics.


http://arxiv.org/abs/2306.02639
Evaluating robustness of support vector machines with the Lagrangian dual approach. (97%)
Yuting Liu; Hong Gu; Pan Qin
  Adversarial examples bring a considerable security threat to support vector
machines (SVMs), especially those used in safety-critical applications. Thus,
robustness verification is an essential issue for SVMs, which can provide
provable robustness against various kinds of adversary attacks. The evaluation
results obtained through the robustness verification can provide a safe
guarantee for the use of SVMs. The existing verification method does not often
perform well in verifying SVMs with nonlinear kernels. To this end, we propose
a method to improve the verification performance for SVMs with nonlinear
kernels. We first formalize the adversarial robustness evaluation of SVMs as an
optimization problem. Then a lower bound of the original problem is obtained by
solving the Lagrangian dual problem of the original problem. Finally, the
adversarial robustness of SVMs is evaluated concerning the lower bound. We
evaluate the adversarial robustness of SVMs with linear and nonlinear kernels
on the MNIST and Fashion-MNIST datasets. The experimental results show that the
percentage of provable robustness obtained by our method on the test set is
better than that of the state-of-the-art.


http://arxiv.org/abs/2306.03331
A Robust Likelihood Model for Novelty Detection. (93%)
Ranya Almohsen; Shivang Patel; Donald A. Adjeroh; Gianfranco Doretto
  Current approaches to novelty or anomaly detection are based on deep neural
networks. Despite their effectiveness, neural networks are also vulnerable to
imperceptible deformations of the input data. This is a serious issue in
critical applications, or when data alterations are generated by an adversarial
attack. While this is a known problem that has been studied in recent years for
the case of supervised learning, the case of novelty detection has received
very limited attention. Indeed, in this latter setting the learning is
typically unsupervised because outlier data is not available during training,
and new approaches for this case need to be investigated. We propose a new
prior that aims at learning a robust likelihood for the novelty test, as a
defense against attacks. We also integrate the same prior with a
state-of-the-art novelty detection approach. Because of the geometric
properties of that approach, the resulting robust training is computationally
very efficient. An initial evaluation of the method indicates that it is
effective at improving performance with respect to the standard models in the
absence and presence of attacks.


http://arxiv.org/abs/2306.02918
Adversarial Ink: Componentwise Backward Error Attacks on Deep Learning. (86%)
Lucas Beerens; Desmond J. Higham
  Deep neural networks are capable of state-of-the-art performance in many
classification tasks. However, they are known to be vulnerable to adversarial
attacks -- small perturbations to the input that lead to a change in
classification. We address this issue from the perspective of backward error
and condition number, concepts that have proved useful in numerical analysis.
To do this, we build on the work of Beuzeville et al. (2021). In particular, we
develop a new class of attack algorithms that use componentwise relative
perturbations. Such attacks are highly relevant in the case of handwritten
documents or printed texts where, for example, the classification of
signatures, postcodes, dates or numerical quantities may be altered by changing
only the ink consistency and not the background. This makes the perturbed
images look natural to the naked eye. Such ``adversarial ink'' attacks
therefore reveal a weakness that can have a serious impact on safety and
security. We illustrate the new attacks on real data and contrast them with
existing algorithms. We also study the use of a componentwise condition number
to quantify vulnerability.


http://arxiv.org/abs/2306.02618
Enhance Diffusion to Improve Robust Generalization. (76%)
Jianhui Sun; Sanchit Sinha; Aidong Zhang
  Deep neural networks are susceptible to human imperceptible adversarial
perturbations. One of the strongest defense mechanisms is \emph{Adversarial
Training} (AT). In this paper, we aim to address two predominant problems in
AT. First, there is still little consensus on how to set hyperparameters with a
performance guarantee for AT research, and customized settings impede a fair
comparison between different model designs in AT research. Second, the robustly
trained neural networks struggle to generalize well and suffer from tremendous
overfitting. This paper focuses on the primary AT framework - Projected
Gradient Descent Adversarial Training (PGD-AT). We approximate the dynamic of
PGD-AT by a continuous-time Stochastic Differential Equation (SDE), and show
that the diffusion term of this SDE determines the robust generalization. An
immediate implication of this theoretical finding is that robust generalization
is positively correlated with the ratio between learning rate and batch size.
We further propose a novel approach, \emph{Diffusion Enhanced Adversarial
Training} (DEAT), to manipulate the diffusion term to improve robust
generalization with virtually no extra computational burden. We theoretically
show that DEAT obtains a tighter generalization bound than PGD-AT. Our
empirical investigation is extensive and firmly attests that DEAT universally
outperforms PGD-AT by a significant margin.


http://arxiv.org/abs/2306.02980
KNOW How to Make Up Your Mind! Adversarially Detecting and Alleviating Inconsistencies in Natural Language Explanations. (68%)
Myeongjun Jang; Bodhisattwa Prasad Majumder; Julian McAuley; Thomas Lukasiewicz; Oana-Maria Camburu
  While recent works have been considerably improving the quality of the
natural language explanations (NLEs) generated by a model to justify its
predictions, there is very limited research in detecting and alleviating
inconsistencies among generated NLEs. In this work, we leverage external
knowledge bases to significantly improve on an existing adversarial attack for
detecting inconsistent NLEs. We apply our attack to high-performing NLE models
and show that models with higher NLE quality do not necessarily generate fewer
inconsistencies. Moreover, we propose an off-the-shelf mitigation method to
alleviate inconsistencies by grounding the model into external background
knowledge. Our method decreases the inconsistencies of previous high-performing
NLE models as detected by our attack.


http://arxiv.org/abs/2306.02583
Stable Diffusion is Unstable. (45%)
Chengbin Du; Yanxi Li; Zhongwei Qiu; Chang Xu
  Recently, text-to-image models have been thriving. Despite their powerful
generative capacity, our research has uncovered a lack of robustness in this
generation process. Specifically, the introduction of small perturbations to
the text prompts can result in the blending of primary subjects with other
categories or their complete disappearance in the generated images. In this
paper, we propose Auto-attack on Text-to-image Models (ATM), a gradient-based
approach, to effectively and efficiently generate such perturbations. By
learning a Gumbel Softmax distribution, we can make the discrete process of
word replacement or extension continuous, thus ensuring the differentiability
of the perturbation generation. Once the distribution is learned, ATM can
sample multiple attack samples simultaneously. These attack samples can prevent
the generative model from generating the desired subjects without compromising
image quality. ATM has achieved a 91.1% success rate in short-text attacks and
an 81.2% success rate in long-text attacks. Further empirical analysis revealed
four attack patterns based on: 1) the variability in generation speed, 2) the
similarity of coarse-grained characteristics, 3) the polysemy of words, and 4)
the positioning of words.


http://arxiv.org/abs/2306.02879
Neuron Activation Coverage: Rethinking Out-of-distribution Detection and Generalization. (1%)
Yibing Liu; Chris Xing Tian; Haoliang Li; Lei Ma; Shiqi Wang
  The out-of-distribution (OOD) problem generally arises when neural networks
encounter data that significantly deviates from the training data distribution,
i.e., in-distribution (InD). In this paper, we study the OOD problem from a
neuron activation view. We first formulate neuron activation states by
considering both the neuron output and its influence on model decisions. Then,
to characterize the relationship between neurons and OOD issues, we introduce
the \textit{neuron activation coverage} (NAC) -- a simple measure for neuron
behaviors under InD data. Leveraging our NAC, we show that 1) InD and OOD
inputs can be largely separated based on the neuron behavior, which
significantly eases the OOD detection problem and beats the 21 previous methods
over three benchmarks (CIFAR-10, CIFAR-100, and ImageNet-1K). 2) a positive
correlation between NAC and model generalization ability consistently holds
across architectures and datasets, which enables a NAC-based criterion for
evaluating model robustness. Compared to prevalent InD validation criteria, we
show that NAC not only can select more robust models, but also has a stronger
correlation with OOD test performance.


http://arxiv.org/abs/2306.03269
Security Knowledge-Guided Fuzzing of Deep Learning Libraries. (1%)
Nima Shiri Harzevili; Hung Viet Pham; Song Wang
  There have been many Deep Learning (DL) fuzzers proposed in the literature.
However, most of them only focused on high-level APIs that are used by users,
which results in a large number of APIs used by library developers being
untested. Additionally, they use general input generation rules to generate
malformed inputs such as random value generation and boundary-input generation,
which are ineffective to generate DL-specific malformed inputs.
  To fill this gap, we first conduct an empirical study regarding root cause
analysis on 447 history security vulnerabilities of two of the most popular DL
libraries, i.e., PyTorch and TensorFlow, for characterizing and understanding
their malicious inputs. As a result, we categorize 18 rules regarding the
construction of malicious inputs, which we believe can be used to generate
effective malformed inputs for testing DL libraries. We further design and
implement Orion, a new fuzzer that tests DL libraries by utilizing our
malformed input generation rules mined from real-world deep learning security
vulnerabilities. Specifically, Orion first collects API invocation code from
various sources such as API documentation, source code, developer tests, and
publicly available repositories on GitHub. Then Orion instruments these code
snippets to dynamically trace execution information for each API such as
parameters' types, shapes, and values. Then, Orion combines the malformed input
generation rules and the dynamic execution information to create inputs to test
DL libraries.
  Our evaluation on TensorFlow and PyTorch shows that Orion reports 143 bugs
and 68 of which are previously unknown. Among the 68 new bugs, 58 have been
fixed or confirmed by developers after we report them and the left are awaiting
confirmation. Compared to the state-of-the-art DL fuzzers (i.e., FreeFuzz and
DocTer), Orion detects 21% and 34% more bugs respectively.


http://arxiv.org/abs/2306.02775
Input-gradient space particle inference for neural network ensembles. (1%)
Trung Trinh; Markus Heinonen; Luigi Acerbi; Samuel Kaski
  Deep Ensembles (DEs) demonstrate improved accuracy, calibration and
robustness to perturbations over single neural networks partly due to their
functional diversity. Particle-based variational inference (ParVI) methods
enhance diversity by formalizing a repulsion term based on a network similarity
kernel. However, weight-space repulsion is inefficient due to
over-parameterization, while direct function-space repulsion has been found to
produce little improvement over DEs. To sidestep these difficulties, we propose
First-order Repulsive Deep Ensemble (FoRDE), an ensemble learning method based
on ParVI, which performs repulsion in the space of first-order input gradients.
As input gradients uniquely characterize a function up to translation and are
much smaller in dimension than the weights, this method guarantees that
ensemble members are functionally different. Intuitively, diversifying the
input gradients encourages each network to learn different features, which is
expected to improve the robustness of an ensemble. Experiments on image
classification datasets and transfer learning tasks show that FoRDE
significantly outperforms the gold-standard DEs and other ensemble methods in
accuracy and calibration under covariate shift due to input perturbations.


http://arxiv.org/abs/2306.02488
Adversary for Social Good: Leveraging Adversarial Attacks to Protect Personal Attribute Privacy. (98%)
Xiaoting Li; Lingwei Chen; Dinghao Wu
  Social media has drastically reshaped the world that allows billions of
people to engage in such interactive environments to conveniently create and
share content with the public. Among them, text data (e.g., tweets, blogs)
maintains the basic yet important social activities and generates a rich source
of user-oriented information. While those explicit sensitive user data like
credentials has been significantly protected by all means, personal private
attribute (e.g., age, gender, location) disclosure due to inference attacks is
somehow challenging to avoid, especially when powerful natural language
processing (NLP) techniques have been effectively deployed to automate
attribute inferences from implicit text data. This puts users' attribute
privacy at risk. To address this challenge, in this paper, we leverage the
inherent vulnerability of machine learning to adversarial attacks, and design a
novel text-space Adversarial attack for Social Good, called Adv4SG. In other
words, we cast the problem of protecting personal attribute privacy as an
adversarial attack formulation problem over the social media text data to
defend against NLP-based attribute inference attacks. More specifically, Adv4SG
proceeds with a sequence of word perturbations under given constraints such
that the probed attribute cannot be identified correctly. Different from the
prior works, we advance Adv4SG by considering social media property, and
introducing cost-effective mechanisms to expedite attribute obfuscation over
text data under the black-box setting. Extensive experiments on real-world
social media datasets have demonstrated that our method can effectively degrade
the inference accuracy with less computational cost over different attribute
settings, which substantially helps mitigate the impacts of inference attacks
and thus achieve high performance in user attribute privacy protection.


http://arxiv.org/abs/2306.02482
Aerial Swarm Defense using Interception and Herding Strategies. (1%)
Vishnu S. Chipade; Dimitra Panagou
  This paper presents a multi-mode solution to the problem of defending a
circular protected area (target) from a wide range of attacks by swarms of
risk-taking and/or risk-averse attacking agents (attackers). The proposed
multi-mode solution combines two defense strategies, namely: 1) an interception
strategy for a team of defenders to intercept multiple risk-taking attackers
while ensuring that the defenders do not collide with each other, 2) a herding
strategy to herd a swarm of risk-averse attackers to a safe area. In
particular, we develop mixed integer programs (MIPs) and geometry-inspired
heuristics to distribute and assign and/or reassign the defenders to
interception and herding tasks under different spatiotemporal behaviors by the
attackers such as splitting into smaller swarms to evade defenders easily or
high-speed maneuvers by some risk-taking attackers to maximize damage to the
protected area. We provide theoretical as well as numerical comparisons of the
computational costs of these MIPs and the heuristics, and demonstrate the
overall approach in simulations.


http://arxiv.org/abs/2306.02021
Towards Black-box Adversarial Example Detection: A Data Reconstruction-based Method. (99%)
Yifei Gao; Zhiyu Lin; Yunfan Yang; Jitao Sang
  Adversarial example detection is known to be an effective adversarial defense
method. Black-box attack, which is a more realistic threat and has led to
various black-box adversarial training-based defense methods, however, does not
attract considerable attention in adversarial example detection. In this paper,
we fill this gap by positioning the problem of black-box adversarial example
detection (BAD). Data analysis under the introduced BAD settings demonstrates
(1) the incapability of existing detectors in addressing the black-box scenario
and (2) the potential of exploring BAD solutions from a data perspective. To
tackle the BAD problem, we propose a data reconstruction-based adversarial
example detection method. Specifically, we use variational auto-encoder (VAE)
to capture both pixel and frequency representations of normal examples. Then we
use reconstruction error to detect adversarial examples. Compared with existing
detection methods, the proposed method achieves substantially better detection
performance in BAD, which helps promote the deployment of adversarial example
detection-based defense solutions in real-world models.


http://arxiv.org/abs/2306.02165
Learning to Defend by Attacking (and Vice-Versa): Transfer of Learning in Cybersecurity Games. (67%)
Tailia Malloy; Cleotilde Gonzalez
  Designing cyber defense systems to account for cognitive biases in human
decision making has demonstrated significant success in improving performance
against human attackers. However, much of the attention in this area has
focused on relatively simple accounts of biases in human attackers, and little
is known about adversarial behavior or how defenses could be improved by
disrupting attacker's behavior. In this work, we present a novel model of human
decision-making inspired by the cognitive faculties of Instance-Based Learning
Theory, Theory of Mind, and Transfer of Learning. This model functions by
learning from both roles in a security scenario: defender and attacker, and by
making predictions of the opponent's beliefs, intentions, and actions. The
proposed model can better defend against attacks from a wide range of opponents
compared to alternatives that attempt to perform optimally without accounting
for human biases. Additionally, the proposed model performs better against a
range of human-like behavior by explicitly modeling human transfer of learning,
which has not yet been applied to cyber defense scenarios. Results from
simulation experiments demonstrate the potential usefulness of cognitively
inspired models of agents trained in attack and defense roles and how these
insights could potentially be used in real-world cybersecurity.


http://arxiv.org/abs/2306.02002
Can Directed Graph Neural Networks be Adversarially Robust? (56%)
Zhichao Hou; Xitong Zhang; Wei Wang; Charu C. Aggarwal; Xiaorui Liu
  The existing research on robust Graph Neural Networks (GNNs) fails to
acknowledge the significance of directed graphs in providing rich information
about networks' inherent structure. This work presents the first investigation
into the robustness of GNNs in the context of directed graphs, aiming to
harness the profound trust implications offered by directed graphs to bolster
the robustness and resilience of GNNs. Our study reveals that existing directed
GNNs are not adversarially robust. In pursuit of our goal, we introduce a new
and realistic directed graph attack setting and propose an innovative,
universal, and efficient message-passing framework as a plug-in layer to
significantly enhance the robustness of GNNs. Combined with existing defense
strategies, this framework achieves outstanding clean accuracy and
state-of-the-art robust performance, offering superior defense against both
transfer and adaptive attacks. The findings in this study reveal a novel and
promising direction for this crucial research area. The code will be made
publicly available upon the acceptance of this work.


http://arxiv.org/abs/2306.02064
Flew Over Learning Trap: Learn Unlearnable Samples by Progressive Staged Training. (47%)
Pucheng Dang; Xing Hu; Kaidi Xu; Jinhao Duan; Di Huang; Husheng Han; Rui Zhang; Zidong Du; Qi Guo; Yunji Chen
  Unlearning techniques are proposed to prevent third parties from exploiting
unauthorized data, which generate unlearnable samples by adding imperceptible
perturbations to data for public publishing. These unlearnable samples
effectively misguide model training to learn perturbation features but ignore
image semantic features. We make the in-depth analysis and observe that models
can learn both image features and perturbation features of unlearnable samples
at an early stage, but rapidly go to the overfitting stage since the shallow
layers tend to overfit on perturbation features and make models fall into
overfitting quickly. Based on the observations, we propose Progressive Staged
Training to effectively prevent models from overfitting in learning
perturbation features. We evaluated our method on multiple model architectures
over diverse datasets, e.g., CIFAR-10, CIFAR-100, and ImageNet-mini. Our method
circumvents the unlearnability of all state-of-the-art methods in the
literature and provides a reliable baseline for further evaluation of
unlearnable techniques.


http://arxiv.org/abs/2306.02080
Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models. (1%)
Shuo Chen; Jindong Gu; Zhen Han; Yunpu Ma; Philip Torr; Volker Tresp
  Various adaptation methods, such as LoRA, prompts, and adapters, have been
proposed to enhance the performance of pre-trained vision-language models in
specific domains. The robustness of these adaptation methods against
distribution shifts have not been studied. In this study, we assess the
robustness of 11 widely-used adaptation methods across 4 vision-language
datasets under multimodal corruptions. Concretely, we introduce 7 benchmark
datasets, including 96 visual and 87 textual corruptions, to investigate the
robustness of different adaptation methods, the impact of available adaptation
examples, and the influence of trainable parameter size during adaptation. Our
analysis reveals that: 1) Adaptation methods are more sensitive to text
corruptions than visual corruptions. 2) Full fine-tuning does not consistently
provide the highest robustness; instead, adapters can achieve better robustness
with comparable clean performance. 3) Contrary to expectations, our findings
indicate that increasing the number of adaptation data and parameters does not
guarantee enhanced robustness; instead it results in even lower robustness. We
hope this study could benefit future research in the development of robust
multimodal adaptation methods. The benchmark, code, and dataset used in this
study can be accessed at \url{https://adarobustness.github.io}.


http://arxiv.org/abs/2306.01271
Towards Understanding Clean Generalization and Robust Overfitting in Adversarial Training. (99%)
Binghui Li; Yuanzhi Li
  Similar to surprising performance in the standard deep learning, deep nets
trained by adversarial training also generalize well for $\textit{unseen clean
data (natural data)}$. However, despite adversarial training can achieve low
robust training error, there exists a significant $\textit{robust
generalization gap}$. We call this phenomenon the $\textit{Clean Generalization
and Robust Overfitting (CGRO)}$. In this work, we study the CGRO phenomenon in
adversarial training from two views: $\textit{representation complexity}$ and
$\textit{training dynamics}$. Specifically, we consider a binary classification
setting with $N$ separated training data points. $\textit{First}$, we prove
that, based on the assumption that we assume there is
$\operatorname{poly}(D)$-size clean classifier (where $D$ is the data
dimension), ReLU net with only $O(N D)$ extra parameters is able to leverages
robust memorization to achieve the CGRO, while robust classifier still requires
exponential representation complexity in worst case. $\textit{Next}$, we focus
on a structured-data case to analyze training dynamics, where we train a
two-layer convolutional network with $O(N D)$ width against adversarial
perturbation. We then show that a three-stage phase transition occurs during
learning process and the network provably converges to robust memorization
regime, which thereby results in the CGRO. $\textit{Besides}$, we also
empirically verify our theoretical analysis by experiments in real-image
recognition datasets.


http://arxiv.org/abs/2306.01429
A Closer Look at the Adversarial Robustness of Deep Equilibrium Models. (92%)
Zonghan Yang; Tianyu Pang; Yang Liu
  Deep equilibrium models (DEQs) refrain from the traditional layer-stacking
paradigm and turn to find the fixed point of a single layer. DEQs have achieved
promising performance on different applications with featured memory
efficiency. At the same time, the adversarial vulnerability of DEQs raises
concerns. Several works propose to certify robustness for monotone DEQs.
However, limited efforts are devoted to studying empirical robustness for
general DEQs. To this end, we observe that an adversarially trained DEQ
requires more forward steps to arrive at the equilibrium state, or even
violates its fixed-point structure. Besides, the forward and backward tracks of
DEQs are misaligned due to the black-box solvers. These facts cause gradient
obfuscation when applying the ready-made attacks to evaluate or adversarially
train DEQs. Given this, we develop approaches to estimate the intermediate
gradients of DEQs and integrate them into the attacking pipelines. Our
approaches facilitate fully white-box evaluations and lead to effective
adversarial defense for DEQs. Extensive experiments on CIFAR-10 validate the
adversarial robustness of DEQs competitive with deep networks of similar sizes.


http://arxiv.org/abs/2306.01400
Adaptive Attractors: A Defense Strategy against ML Adversarial Collusion Attacks. (83%)
Jiyi Zhang; Han Fang; Ee-Chien Chang
  In the seller-buyer setting on machine learning models, the seller generates
different copies based on the original model and distributes them to different
buyers, such that adversarial samples generated on one buyer's copy would
likely not work on other copies. A known approach achieves this using
attractor-based rewriter which injects different attractors to different
copies. This induces different adversarial regions in different copies, making
adversarial samples generated on one copy not replicable on others. In this
paper, we focus on a scenario where multiple malicious buyers collude to
attack. We first give two formulations and conduct empirical studies to analyze
effectiveness of collusion attack under different assumptions on the attacker's
capabilities and properties of the attractors. We observe that existing
attractor-based methods do not effectively mislead the colluders in the sense
that adversarial samples found are influenced more by the original model
instead of the attractors as number of colluders increases. Based on this
observation, we propose using adaptive attractors whose weight is guided by a
U-shape curve to cover the shortfalls. Experimentation results show that when
using our approach, the attack success rate of a collusion attack converges to
around 15% even when lots of copies are applied for collusion. In contrast,
when using the existing attractor-based rewriter with fixed weight, the attack
success rate increases linearly with the number of copies used for collusion.


http://arxiv.org/abs/2306.01655
Poisoning Network Flow Classifiers. (61%)
Giorgio Severi; Simona Boboila; Alina Oprea; John Holodnak; Kendra Kratkiewicz; Jason Matterer
  As machine learning (ML) classifiers increasingly oversee the automated
monitoring of network traffic, studying their resilience against adversarial
attacks becomes critical. This paper focuses on poisoning attacks, specifically
backdoor attacks, against network traffic flow classifiers. We investigate the
challenging scenario of clean-label poisoning where the adversary's
capabilities are constrained to tampering only with the training data - without
the ability to arbitrarily modify the training labels or any other component of
the training process. We describe a trigger crafting strategy that leverages
model interpretability techniques to generate trigger patterns that are
effective even at very low poisoning rates. Finally, we design novel strategies
to generate stealthy triggers, including an approach based on generative
Bayesian network models, with the goal of minimizing the conspicuousness of the
trigger, and thus making detection of an ongoing poisoning campaign more
challenging. Our findings provide significant insights into the feasibility of
poisoning attacks on network traffic classifiers used in multiple scenarios,
including detecting malicious communication and application classification.


http://arxiv.org/abs/2306.01613
Hyperparameter Learning under Data Poisoning: Analysis of the Influence of Regularization via Multiobjective Bilevel Optimization. (54%)
Javier Carnerero-Cano; Luis Muñoz-González; Phillippa Spencer; Emil C. Lupu
  Machine Learning (ML) algorithms are vulnerable to poisoning attacks, where a
fraction of the training data is manipulated to deliberately degrade the
algorithms' performance. Optimal attacks can be formulated as bilevel
optimization problems and help to assess their robustness in worst-case
scenarios. We show that current approaches, which typically assume that
hyperparameters remain constant, lead to an overly pessimistic view of the
algorithms' robustness and of the impact of regularization. We propose a novel
optimal attack formulation that considers the effect of the attack on the
hyperparameters and models the attack as a multiobjective bilevel optimization
problem. This allows to formulate optimal attacks, learn hyperparameters and
evaluate robustness under worst-case conditions. We apply this attack
formulation to several ML classifiers using $L_2$ and $L_1$ regularization. Our
evaluation on multiple datasets confirms the limitations of previous strategies
and evidences the benefits of using $L_2$ and $L_1$ regularization to dampen
the effect of poisoning attacks.


http://arxiv.org/abs/2306.01953
Invisible Image Watermarks Are Provably Removable Using Generative AI. (33%)
Xuandong Zhao; Kexun Zhang; Zihao Su; Saastha Vasan; Ilya Grishchenko; Christopher Kruegel; Giovanni Vigna; Yu-Xiang Wang; Lei Li
  Invisible watermarks safeguard images' copyrights by embedding hidden
messages only detectable by owners. They also prevent people from misusing
images, especially those generated by AI models. We propose a family of
regeneration attacks to remove these invisible watermarks. The proposed attack
method first adds random noise to an image to destroy the watermark and then
reconstructs the image. This approach is flexible and can be instantiated with
many existing image-denoising algorithms and pre-trained generative models such
as diffusion models. Through formal proofs and extensive empirical evaluations,
we demonstrate that pixel-level invisible watermarks are vulnerable to this
regeneration attack. Our results reveal that, across four different pixel-level
watermarking schemes, the proposed method consistently achieves superior
performance compared to existing attack techniques, with lower detection rates
and higher image quality. However, watermarks that keep the image semantically
similar can be an alternative defense against our attacks. Our finding
underscores the need for a shift in research/industry emphasis from invisible
watermarks to semantic-preserving watermarks. Code is available at
https://github.com/XuandongZhao/WatermarkAttacker


http://arxiv.org/abs/2306.01485
Robust low-rank training via approximate orthonormal constraints. (22%)
Dayana Savostianova; Emanuele Zangrando; Gianluca Ceruti; Francesco Tudisco
  With the growth of model and data sizes, a broad effort has been made to
design pruning techniques that reduce the resource demand of deep learning
pipelines, while retaining model performance. In order to reduce both inference
and training costs, a prominent line of work uses low-rank matrix
factorizations to represent the network weights. Although able to retain
accuracy, we observe that low-rank methods tend to compromise model robustness
against adversarial perturbations. By modeling robustness in terms of the
condition number of the neural network, we argue that this loss of robustness
is due to the exploding singular values of the low-rank weight matrices. Thus,
we introduce a robust low-rank training algorithm that maintains the network's
weights on the low-rank matrix manifold while simultaneously enforcing
approximate orthonormal constraints. The resulting model reduces both training
and inference costs while ensuring well-conditioning and thus better
adversarial robustness, without compromising model accuracy. This is shown by
extensive numerical evidence and by our main approximation theorem that shows
the computed robust low-rank network well-approximates the ideal full model,
provided a highly performing low-rank sub-network exists.


http://arxiv.org/abs/2306.01505
Supervised Adversarial Contrastive Learning for Emotion Recognition in Conversations. (13%)
Dou Hu; Yinan Bao; Lingwei Wei; Wei Zhou; Songlin Hu
  Extracting generalized and robust representations is a major challenge in
emotion recognition in conversations (ERC). To address this, we propose a
supervised adversarial contrastive learning (SACL) framework for learning
class-spread structured representations in a supervised manner. SACL applies
contrast-aware adversarial training to generate worst-case samples and uses
joint class-spread contrastive learning to extract structured representations.
It can effectively utilize label-level feature consistency and retain
fine-grained intra-class features. To avoid the negative impact of adversarial
perturbations on context-dependent data, we design a contextual adversarial
training (CAT) strategy to learn more diverse features from context and enhance
the model's context robustness. Under the framework with CAT, we develop a
sequence-based SACL-LSTM to learn label-consistent and context-robust features
for ERC. Experiments on three datasets show that SACL-LSTM achieves
state-of-the-art performance on ERC. Extended experiments prove the
effectiveness of SACL and CAT.


http://arxiv.org/abs/2306.01435
Improving Adversarial Robustness of DEQs with Explicit Regulations Along the Neural Dynamics. (11%)
Zonghan Yang; Peng Li; Tianyu Pang; Yang Liu
  Deep equilibrium (DEQ) models replace the multiple-layer stacking of
conventional deep networks with a fixed-point iteration of a single-layer
transformation. Having been demonstrated to be competitive in a variety of
real-world scenarios, the adversarial robustness of general DEQs becomes
increasingly crucial for their reliable deployment. Existing works improve the
robustness of general DEQ models with the widely-used adversarial training (AT)
framework, but they fail to exploit the structural uniquenesses of DEQ models.
To this end, we interpret DEQs through the lens of neural dynamics and find
that AT under-regulates intermediate states. Besides, the intermediate states
typically provide predictions with a high prediction entropy. Informed by the
correlation between the entropy of dynamical systems and their stability
properties, we propose reducing prediction entropy by progressively updating
inputs along the neural dynamics. During AT, we also utilize random
intermediate states to compute the loss function. Our methods regulate the
neural dynamics of DEQ models in this manner. Extensive experiments demonstrate
that our methods substantially increase the robustness of DEQ models and even
outperform the strong deep network baselines.


http://arxiv.org/abs/2306.01342
Covert Communication Based on the Poisoning Attack in Federated Learning. (10%)
Junchuan Liang; Rong Wang
  Covert communication has become an important area of research in computer
security. It involves hiding specific information on a carrier for message
transmission and is often used to transmit private data, military secrets, and
even malware. In deep learning, many methods have been developed for hiding
information in models to achieve covert communication. However, these methods
are not applicable to federated learning, where model aggregation invalidates
the exact information embedded in the model by the client. To address this
problem, we propose a novel method for covert communication in federated
learning based on the poisoning attack. Our approach achieves 100% accuracy in
covert message transmission between two clients and is shown to be both
stealthy and robust through extensive experiments. However, existing defense
methods are limited in their effectiveness against our attack scheme,
highlighting the urgent need for new protection methods to be developed. Our
study emphasizes the necessity of research in covert communication and serves
as a foundation for future research in federated learning attacks and defenses.


http://arxiv.org/abs/2306.01273
VoteTRANS: Detecting Adversarial Text without Training by Voting on Hard Labels of Transformations. (3%)
Hoang-Quoc Nguyen-Son; Seira Hidano; Kazuhide Fukushima; Shinsaku Kiyomoto; Isao Echizen
  Adversarial attacks reveal serious flaws in deep learning models. More
dangerously, these attacks preserve the original meaning and escape human
recognition. Existing methods for detecting these attacks need to be trained
using original/adversarial data. In this paper, we propose detection without
training by voting on hard labels from predictions of transformations, namely,
VoteTRANS. Specifically, VoteTRANS detects adversarial text by comparing the
hard labels of input text and its transformation. The evaluation demonstrates
that VoteTRANS effectively detects adversarial text across various
state-of-the-art attacks, models, and datasets.


http://arxiv.org/abs/2306.01902
Unlearnable Examples for Diffusion Models: Protect Data from Unauthorized Exploitation. (2%)
Zhengyue Zhao; Jinhao Duan; Xing Hu; Kaidi Xu; Chenan Wang; Rui Zhang; Zidong Du; Qi Guo; Yunji Chen
  Diffusion models have demonstrated remarkable performance in image generation
tasks, paving the way for powerful AIGC applications. However, these
widely-used generative models can also raise security and privacy concerns,
such as copyright infringement, and sensitive data leakage. To tackle these
issues, we propose a method, Unlearnable Diffusion Perturbation, to safeguard
images from unauthorized exploitation. Our approach involves designing an
algorithm to generate sample-wise perturbation noise for each image to be
protected. This imperceptible protective noise makes the data almost
unlearnable for diffusion models, i.e., diffusion models trained or fine-tuned
on the protected data cannot generate high-quality and diverse images related
to the protected training data. Theoretically, we frame this as a max-min
optimization problem and introduce EUDP, a noise scheduler-based method to
enhance the effectiveness of the protective noise. We evaluate our methods on
both Denoising Diffusion Probabilistic Model and Latent Diffusion Models,
demonstrating that training diffusion models on the protected data lead to a
significant reduction in the quality of the generated images. Especially, the
experimental results on Stable Diffusion demonstrate that our method
effectively safeguards images from being used to train Diffusion Models in
various tasks, such as training specific objects and styles. This achievement
holds significant importance in real-world scenarios, as it contributes to the
protection of privacy and copyright against AI-generated content.


http://arxiv.org/abs/2306.01697
MutateNN: Mutation Testing of Image Recognition Models Deployed on Hardware Accelerators. (1%)
Nikolaos Louloudakis; Perry Gibson; José Cano; Ajitha Rajan
  The increased utilization of Artificial Intelligence (AI) solutions brings
with it inherent risks, such as misclassification and sub-optimal execution
time performance, due to errors introduced in their deployment infrastructure
because of problematic configuration and software faults. On top of that, AI
methods such as Deep Neural Networks (DNNs) are utilized to perform demanding,
resource-intensive and even safety-critical tasks, and in order to effectively
increase the performance of the DNN models deployed, a variety of Machine
Learning (ML) compilers have been developed, allowing compatibility of DNNs
with a variety of hardware acceleration devices, such as GPUs and TPUs.
Furthermore the correctness of the compilation process should be verified. In
order to allow developers and researchers to explore the robustness of DNN
models deployed on different hardware accelerators via ML compilers, in this
paper we propose MutateNN, a tool that provides mutation testing and model
analysis features in the context of deployment on different hardware
accelerators. To demonstrate the capabilities of MutateNN, we focus on the
image recognition domain by applying mutation testing to 7 well-established
models utilized for image classification. We instruct 21 mutations of 6
different categories, and deploy our mutants on 4 different hardware
acceleration devices of varying capabilities. Our results indicate that models
are proven robust to changes related to layer modifications and arithmetic
operators, while presenting discrepancies of up to 90.3% in mutants related to
conditional operators. We also observed unexpectedly severe performance
degradation on mutations related to arithmetic types of variables, leading the
mutants to produce the same classifications for all dataset inputs.


http://arxiv.org/abs/2306.01364
Towards Robust GAN-generated Image Detection: a Multi-view Completion Representation. (1%)
Chi Liu; Tianqing Zhu; Sheng Shen; Wanlei Zhou
  GAN-generated image detection now becomes the first line of defense against
the malicious uses of machine-synthesized image manipulations such as
deepfakes. Although some existing detectors work well in detecting clean, known
GAN samples, their success is largely attributable to overfitting unstable
features such as frequency artifacts, which will cause failures when facing
unknown GANs or perturbation attacks. To overcome the issue, we propose a
robust detection framework based on a novel multi-view image completion
representation. The framework first learns various view-to-image tasks to model
the diverse distributions of genuine images. Frequency-irrelevant features can
be represented from the distributional discrepancies characterized by the
completion models, which are stable, generalized, and robust for detecting
unknown fake patterns. Then, a multi-view classification is devised with
elaborated intra- and inter-view learning strategies to enhance view-specific
feature representation and cross-view feature aggregation, respectively. We
evaluated the generalization ability of our framework across six popular GANs
at different resolutions and its robustness against a broad range of
perturbation attacks. The results confirm our method's improved effectiveness,
generalization, and robustness over various baselines.


http://arxiv.org/abs/2306.01925
Improving the generalizability and robustness of large-scale traffic signal control. (1%)
Tianyu Shi; Francois-Xavier Devailly; Denis Larocque; Laurent Charlin
  A number of deep reinforcement-learning (RL) approaches propose to control
traffic signals. In this work, we study the robustness of such methods along
two axes. First, sensor failures and GPS occlusions create missing-data
challenges and we show that recent methods remain brittle in the face of these
missing data. Second, we provide a more systematic study of the generalization
ability of RL methods to new networks with different traffic regimes. Again, we
identify the limitations of recent approaches. We then propose using a
combination of distributional and vanilla reinforcement learning through a
policy ensemble. Building upon the state-of-the-art previous model which uses a
decentralized approach for large-scale traffic signal control with graph
convolutional networks (GCNs), we first learn models using a distributional
reinforcement learning (DisRL) approach. In particular, we use implicit
quantile networks (IQN) to model the state-action return distribution with
quantile regression. For traffic signal control problems, an ensemble of
standard RL and DisRL yields superior performance across different scenarios,
including different levels of missing sensor data and traffic flow patterns.
Furthermore, the learning scheme of the resulting model can improve zero-shot
transferability to different road network structures, including both synthetic
networks and real-world networks (e.g., Luxembourg, Manhattan). We conduct
extensive experiments to compare our approach to multi-agent reinforcement
learning and traditional transportation approaches. Results show that the
proposed method improves robustness and generalizability in the face of missing
data, varying road networks, and traffic flows.


http://arxiv.org/abs/2306.01809
Adversarial Attack Based on Prediction-Correction. (99%)
Chen Wan; Fangjun Huang
  Deep neural networks (DNNs) are vulnerable to adversarial examples obtained
by adding small perturbations to original examples. The added perturbations in
existing attacks are mainly determined by the gradient of the loss function
with respect to the inputs. In this paper, the close relationship between
gradient-based attacks and the numerical methods for solving ordinary
differential equation (ODE) is studied for the first time. Inspired by the
numerical solution of ODE, a new prediction-correction (PC) based adversarial
attack is proposed. In our proposed PC-based attack, some existing attack can
be selected to produce a predicted example first, and then the predicted
example and the current example are combined together to determine the added
perturbations. The proposed method possesses good extensibility and can be
applied to all available gradient-based attacks easily. Extensive experiments
demonstrate that compared with the state-of-the-art gradient-based adversarial
attacks, our proposed PC-based attacks have higher attack success rates, and
exhibit better transferability.


http://arxiv.org/abs/2306.00353
Constructing Semantics-Aware Adversarial Examples with Probabilistic Perspective. (98%)
Andi Zhang; Damon Wischik
  In this study, we introduce a novel, probabilistic viewpoint on adversarial
examples, achieved through box-constrained Langevin Monte Carlo (LMC).
Proceeding from this perspective, we develop an innovative approach for
generating semantics-aware adversarial examples in a principled manner. This
methodology transcends the restriction imposed by geometric distance, instead
opting for semantic constraints. Our approach empowers individuals to
incorporate their personal comprehension of semantics into the model. Through
human evaluation, we validate that our semantics-aware adversarial examples
maintain their inherent meaning. Experimental findings on the MNIST and SVHN
datasets demonstrate that our semantics-aware adversarial examples can
effectively circumvent robust adversarial training methods tailored for
traditional adversarial attacks.


http://arxiv.org/abs/2306.01125
Reconstruction Distortion of Learned Image Compression with Imperceptible Perturbations. (96%)
Yang Sui; Zhuohang Li; Ding Ding; Xiang Pan; Xiaozhong Xu; Shan Liu; Zhenzhong Chen
  Learned Image Compression (LIC) has recently become the trending technique
for image transmission due to its notable performance. Despite its popularity,
the robustness of LIC with respect to the quality of image reconstruction
remains under-explored. In this paper, we introduce an imperceptible attack
approach designed to effectively degrade the reconstruction quality of LIC,
resulting in the reconstructed image being severely disrupted by noise where
any object in the reconstructed images is virtually impossible. More
specifically, we generate adversarial examples by introducing a Frobenius
norm-based loss function to maximize the discrepancy between original images
and reconstructed adversarial examples. Further, leveraging the insensitivity
of high-frequency components to human vision, we introduce Imperceptibility
Constraint (IC) to ensure that the perturbations remain inconspicuous.
Experiments conducted on the Kodak dataset using various LIC models demonstrate
effectiveness. In addition, we provide several findings and suggestions for
designing future defenses.


http://arxiv.org/abs/2306.00974
Intriguing Properties of Text-guided Diffusion Models. (92%)
Qihao Liu; Adam Kortylewski; Yutong Bai; Song Bai; Alan Yuille
  Text-guided diffusion models (TDMs) are widely applied but can fail
unexpectedly. Common failures include: (i) natural-looking text prompts
generating images with the wrong content, or (ii) different random samples of
the latent variables that generate vastly different, and even unrelated,
outputs despite being conditioned on the same text prompt. In this work, we aim
to study and understand the failure modes of TDMs in more detail. To achieve
this, we propose SAGE, an adversarial attack on TDMs that uses image
classifiers as surrogate loss functions, to search over the discrete prompt
space and the high-dimensional latent space of TDMs to automatically discover
unexpected behaviors and failure cases in the image generation. We make several
technical contributions to ensure that SAGE finds failure cases of the
diffusion model, rather than the classifier, and verify this in a human study.
Our study reveals four intriguing properties of TDMs that have not been
systematically studied before: (1) We find a variety of natural text prompts
producing images that fail to capture the semantics of input texts. We
categorize these failures into ten distinct types based on the underlying
causes. (2) We find samples in the latent space (which are not outliers) that
lead to distorted images independent of the text prompt, suggesting that parts
of the latent space are not well-structured. (3) We also find latent samples
that lead to natural-looking images which are unrelated to the text prompt,
implying a potential misalignment between the latent and prompt spaces. (4) By
appending a single adversarial token embedding to an input prompt we can
generate a variety of specified target objects, while only minimally affecting
the CLIP score. This demonstrates the fragility of language representations and
raises potential safety concerns.


http://arxiv.org/abs/2306.00816
Versatile Backdoor Attack with Visible, Semantic, Sample-Specific, and Compatible Triggers. (82%)
Ruotong Wang; Hongrui Chen; Zihao Zhu; Li Liu; Baoyuan Wu
  Deep neural networks (DNNs) can be manipulated to exhibit specific behaviors
when exposed to specific trigger patterns, without affecting their performance
on benign samples, dubbed \textit{backdoor attack}. Currently, implementing
backdoor attacks in physical scenarios still faces significant challenges.
Physical attacks are labor-intensive and time-consuming, and the triggers are
selected in a manual and heuristic way. Moreover, expanding digital attacks to
physical scenarios faces many challenges due to their sensitivity to visual
distortions and the absence of counterparts in the real world. To address these
challenges, we define a novel trigger called the \textbf{V}isible,
\textbf{S}emantic, \textbf{S}ample-Specific, and \textbf{C}ompatible (VSSC)
trigger, to achieve effective, stealthy and robust simultaneously, which can
also be effectively deployed in the physical scenario using corresponding
objects. To implement the VSSC trigger, we propose an automated pipeline
comprising three modules: a trigger selection module that systematically
identifies suitable triggers leveraging large language models, a trigger
insertion module that employs generative models to seamlessly integrate
triggers into images, and a quality assessment module that ensures the natural
and successful insertion of triggers through vision-language models. Extensive
experimental results and analysis validate the effectiveness, stealthiness, and
robustness of the VSSC trigger. It can not only maintain robustness under
visual distortions but also demonstrates strong practicality in the physical
scenario. We hope that the proposed VSSC trigger and implementation approach
could inspire future studies on designing more practical triggers in backdoor
attacks.


http://arxiv.org/abs/2306.01090
Improving the Robustness of Summarization Systems with Dual Augmentation. (76%)
Xiuying Chen; Guodong Long; Chongyang Tao; Mingzhe Li; Xin Gao; Chengqi Zhang; Xiangliang Zhang
  A robust summarization system should be able to capture the gist of the
document, regardless of the specific word choices or noise in the input. In
this work, we first explore the summarization models' robustness against
perturbations including word-level synonym substitution and noise. To create
semantic-consistent substitutes, we propose a SummAttacker, which is an
efficient approach to generating adversarial samples based on language models.
Experimental results show that state-of-the-art summarization models have a
significant decrease in performance on adversarial and noisy test sets. Next,
we analyze the vulnerability of the summarization systems and explore improving
the robustness by data augmentation. Specifically, the first brittleness factor
we found is the poor understanding of infrequent words in the input.
Correspondingly, we feed the encoder with more diverse cases created by
SummAttacker in the input space. The other factor is in the latent space, where
the attacked inputs bring more variations to the hidden states. Hence, we
construct adversarial decoder input and devise manifold softmixing operation in
hidden space to introduce more diversity. Experimental results on Gigaword and
CNN/DM datasets demonstrate that our approach achieves significant improvements
over strong baselines and exhibits higher robustness on noisy, attacked, and
clean datasets.


http://arxiv.org/abs/2306.00687
Adversarial Robustness in Unsupervised Machine Learning: A Systematic Review. (38%)
Mathias Lundteigen Mohus; Jinyue Li
  As the adoption of machine learning models increases, ensuring robust models
against adversarial attacks is increasingly important. With unsupervised
machine learning gaining more attention, ensuring it is robust against attacks
is vital. This paper conducts a systematic literature review on the robustness
of unsupervised learning, collecting 86 papers. Our results show that most
research focuses on privacy attacks, which have effective defenses; however,
many attacks lack effective and general defensive measures. Based on the
results, we formulate a model on the properties of an attack on unsupervised
learning, contributing to future research by providing a model to use.


http://arxiv.org/abs/2306.00578
Does Black-box Attribute Inference Attacks on Graph Neural Networks Constitute Privacy Risk? (13%)
Iyiola E. Olatunji; Anmar Hizber; Oliver Sihlovec; Megha Khosla
  Graph neural networks (GNNs) have shown promising results on real-life
datasets and applications, including healthcare, finance, and education.
However, recent studies have shown that GNNs are highly vulnerable to attacks
such as membership inference attack and link reconstruction attack.
Surprisingly, attribute inference attacks has received little attention. In
this paper, we initiate the first investigation into attribute inference attack
where an attacker aims to infer the sensitive user attributes based on her
public or non-sensitive attributes. We ask the question whether black-box
attribute inference attack constitutes a significant privacy risk for
graph-structured data and their corresponding GNN model. We take a systematic
approach to launch the attacks by varying the adversarial knowledge and
assumptions. Our findings reveal that when an attacker has black-box access to
the target model, GNNs generally do not reveal significantly more information
compared to missing value estimation techniques. Code is available.


http://arxiv.org/abs/2306.00349
CALICO: Self-Supervised Camera-LiDAR Contrastive Pre-training for BEV Perception. (13%)
Jiachen Sun; Haizhong Zheng; Qingzhao Zhang; Atul Prakash; Z. Morley Mao; Chaowei Xiao
  Perception is crucial in the realm of autonomous driving systems, where
bird's eye view (BEV)-based architectures have recently reached
state-of-the-art performance. The desirability of self-supervised
representation learning stems from the expensive and laborious process of
annotating 2D and 3D data. Although previous research has investigated
pretraining methods for both LiDAR and camera-based 3D object detection, a
unified pretraining framework for multimodal BEV perception is missing. In this
study, we introduce CALICO, a novel framework that applies contrastive
objectives to both LiDAR and camera backbones. Specifically, CALICO
incorporates two stages: point-region contrast (PRC) and region-aware
distillation (RAD). PRC better balances the region- and scene-level
representation learning on the LiDAR modality and offers significant
performance improvement compared to existing methods. RAD effectively achieves
contrastive distillation on our self-trained teacher model. CALICO's efficacy
is substantiated by extensive evaluations on 3D object detection and BEV map
segmentation tasks, where it delivers significant performance improvements.
Notably, CALICO outperforms the baseline method by 10.5% and 8.6% on NDS and
mAP. Moreover, CALICO boosts the robustness of multimodal 3D object detection
against adversarial attacks and corruption. Additionally, our framework can be
tailored to different backbones and heads, positioning it as a promising
approach for multimodal BEV perception.


http://arxiv.org/abs/2306.06112
ModelObfuscator: Obfuscating Model Information to Protect Deployed ML-based Systems. (4%)
Mingyi Zhou; Xiang Gao; Jing Wu; John Grundy; Xiao Chen; Chunyang Chen; Li Li
  More and more edge devices and mobile apps are leveraging deep learning (DL)
capabilities. Deploying such models on devices -- referred to as on-device
models -- rather than as remote cloud-hosted services, has gained popularity
because it avoids transmitting user data off of the device and achieves high
response time. However, on-device models can be easily attacked, as they can be
accessed by unpacking corresponding apps and the model is fully exposed to
attackers. Recent studies show that attackers can easily generate
white-box-like attacks for an on-device model or even inverse its training
data. To protect on-device models from white-box attacks, we propose a novel
technique called model obfuscation. Specifically, model obfuscation hides and
obfuscates the key information -- structure, parameters and attributes -- of
models by renaming, parameter encapsulation, neural structure obfuscation
obfuscation, shortcut injection, and extra layer injection. We have developed a
prototype tool ModelObfuscator to automatically obfuscate on-device TFLite
models. Our experiments show that this proposed approach can dramatically
improve model security by significantly increasing the difficulty of parsing
models inner information, without increasing the latency of DL models. Our
proposed on-device model obfuscation has the potential to be a fundamental
technique for on-device model deployment. Our prototype tool is publicly
available at: https://github.com/zhoumingyi/ModelObfuscator.


http://arxiv.org/abs/2305.19593
Exploring the Vulnerabilities of Machine Learning and Quantum Machine Learning to Adversarial Attacks using a Malware Dataset: A Comparative Analysis. (98%)
Mst Shapna Akter; Hossain Shahriar; Iysa Iqbal; MD Hossain; M. A. Karim; Victor Clincy; Razvan Voicu
  The burgeoning fields of machine learning (ML) and quantum machine learning
(QML) have shown remarkable potential in tackling complex problems across
various domains. However, their susceptibility to adversarial attacks raises
concerns when deploying these systems in security sensitive applications. In
this study, we present a comparative analysis of the vulnerability of ML and
QML models, specifically conventional neural networks (NN) and quantum neural
networks (QNN), to adversarial attacks using a malware dataset. We utilize a
software supply chain attack dataset known as ClaMP and develop two distinct
models for QNN and NN, employing Pennylane for quantum implementations and
TensorFlow and Keras for traditional implementations. Our methodology involves
crafting adversarial samples by introducing random noise to a small portion of
the dataset and evaluating the impact on the models performance using accuracy,
precision, recall, and F1 score metrics. Based on our observations, both ML and
QML models exhibit vulnerability to adversarial attacks. While the QNNs
accuracy decreases more significantly compared to the NN after the attack, it
demonstrates better performance in terms of precision and recall, indicating
higher resilience in detecting true positives under adversarial conditions. We
also find that adversarial samples crafted for one model type can impair the
performance of the other, highlighting the need for robust defense mechanisms.
Our study serves as a foundation for future research focused on enhancing the
security and resilience of ML and QML models, particularly QNN, given its
recent advancements. A more extensive range of experiments will be conducted to
better understand the performance and robustness of both models in the face of
adversarial attacks.


http://arxiv.org/abs/2306.00042
Graph-based methods coupled with specific distributional distances for adversarial attack detection. (98%)
Dwight Nwaigwe; Lucrezia Carboni; Martial Mermillod; Sophie Achard; Michel Dojat
  Artificial neural networks are prone to being fooled by carefully perturbed
inputs which cause an egregious misclassification. These \textit{adversarial}
attacks have been the focus of extensive research. Likewise, there has been an
abundance of research in ways to detect and defend against them. We introduce a
novel approach of detection and interpretation of adversarial attacks from a
graph perspective. For an image, benign or adversarial, we study how a neural
network's architecture can induce an associated graph. We study this graph and
introduce specific measures used to predict and interpret adversarial attacks.
We show that graphs-based approaches help to investigate the inner workings of
adversarial attacks.


http://arxiv.org/abs/2306.00314
Adversarial-Aware Deep Learning System based on a Secondary Classical Machine Learning Verification Approach. (98%)
Mohammed Alkhowaiter; Hisham Kholidy; Mnassar Alyami; Abdulmajeed Alghamdi; Cliff Zou
  Deep learning models have been used in creating various effective image
classification applications. However, they are vulnerable to adversarial
attacks that seek to misguide the models into predicting incorrect classes. Our
study of major adversarial attack models shows that they all specifically
target and exploit the neural networking structures in their designs. This
understanding makes us develop a hypothesis that most classical machine
learning models, such as Random Forest (RF), are immune to adversarial attack
models because they do not rely on neural network design at all. Our
experimental study of classical machine learning models against popular
adversarial attacks supports this hypothesis. Based on this hypothesis, we
propose a new adversarial-aware deep learning system by using a classical
machine learning model as the secondary verification system to complement the
primary deep learning model in image classification. Although the secondary
classical machine learning model has less accurate output, it is only used for
verification purposes, which does not impact the output accuracy of the primary
deep learning model, and at the same time, can effectively detect an
adversarial attack when a clear mismatch occurs. Our experiments based on
CIFAR-100 dataset show that our proposed approach outperforms current
state-of-the-art adversarial defense systems.


http://arxiv.org/abs/2305.19607
Adversarial Clean Label Backdoor Attacks and Defenses on Text Classification Systems. (54%)
Ashim Gupta; Amrith Krishna
  Clean-label (CL) attack is a form of data poisoning attack where an adversary
modifies only the textual input of the training data, without requiring access
to the labeling function. CL attacks are relatively unexplored in NLP, as
compared to label flipping (LF) attacks, where the latter additionally requires
access to the labeling function as well. While CL attacks are more resilient to
data sanitization and manual relabeling methods than LF attacks, they often
demand as high as ten times the poisoning budget than LF attacks. In this work,
we first introduce an Adversarial Clean Label attack which can adversarially
perturb in-class training examples for poisoning the training set. We then show
that an adversary can significantly bring down the data requirements for a CL
attack, using the aforementioned approach, to as low as 20% of the data
otherwise required. We then systematically benchmark and analyze a number of
defense methods, for both LF and CL attacks, some previously employed solely
for LF attacks in the textual domain and others adapted from computer vision.
We find that text-specific defenses greatly vary in their effectiveness
depending on their properties.


http://arxiv.org/abs/2305.20043
Deception by Omission: Using Adversarial Missingness to Poison Causal Structure Learning. (26%)
Deniz Koyuncu; Alex Gittens; Bülent Yener; Moti Yung
  Inference of causal structures from observational data is a key component of
causal machine learning; in practice, this data may be incompletely observed.
Prior work has demonstrated that adversarial perturbations of completely
observed training data may be used to force the learning of inaccurate causal
structural models (SCMs). However, when the data can be audited for correctness
(e.g., it is crytographically signed by its source), this adversarial mechanism
is invalidated. This work introduces a novel attack methodology wherein the
adversary deceptively omits a portion of the true training data to bias the
learned causal structures in a desired manner. Theoretically sound attack
mechanisms are derived for the case of arbitrary SCMs, and a sample-efficient
learning-based heuristic is given for Gaussian SCMs. Experimental validation of
these approaches on real and synthetic data sets demonstrates the effectiveness
of adversarial missingness attacks at deceiving popular causal structure
learning algorithms.


http://arxiv.org/abs/2305.19713
Red Teaming Language Model Detectors with Language Models. (15%)
Zhouxing Shi; Yihan Wang; Fan Yin; Xiangning Chen; Kai-Wei Chang; Cho-Jui Hsieh
  The prevalence and strong capability of large language models (LLMs) present
significant safety and ethical risks if exploited by malicious users. To
prevent the potentially deceptive usage of LLMs, recent works have proposed
algorithms to detect LLM-generated text and protect LLMs. In this paper, we
investigate the robustness and reliability of these LLM detectors under
adversarial attacks. We study two types of attack strategies: 1) replacing
certain words in an LLM's output with their synonyms given the context; 2)
automatically searching for an instructional prompt to alter the writing style
of the generation. In both strategies, we leverage an auxiliary LLM to generate
the word replacements or the instructional prompt. Different from previous
works, we consider a challenging setting where the auxiliary LLM can also be
protected by a detector. Experiments reveal that our attacks effectively
compromise the performance of all detectors in the study with plausible
generations, underscoring the urgent need to improve the robustness of
LLM-generated text detection systems.


http://arxiv.org/abs/2305.19774
Ambiguity in solving imaging inverse problems with deep learning based operators. (1%)
Davide Evangelista; Elena Morotti; Elena Loli Piccolomini; James Nagy
  In recent years, large convolutional neural networks have been widely used as
tools for image deblurring, because of their ability in restoring images very
precisely. It is well known that image deblurring is mathematically modeled as
an ill-posed inverse problem and its solution is difficult to approximate when
noise affects the data. Really, one limitation of neural networks for
deblurring is their sensitivity to noise and other perturbations, which can
lead to instability and produce poor reconstructions. In addition, networks do
not necessarily take into account the numerical formulation of the underlying
imaging problem, when trained end-to-end. In this paper, we propose some
strategies to improve stability without losing to much accuracy to deblur
images with deep-learning based methods. First, we suggest a very small neural
architecture, which reduces the execution time for training, satisfying a green
AI need, and does not extremely amplify noise in the computed image. Second, we
introduce a unified framework where a pre-processing step balances the lack of
stability of the following, neural network-based, step. Two different
pre-processors are presented: the former implements a strong parameter-free
denoiser, and the latter is a variational model-based regularized formulation
of the latent imaging problem. This framework is also formally characterized by
mathematical analysis. Numerical experiments are performed to verify the
accuracy and stability of the proposed approaches for image deblurring when
unknown or not-quantified noise is present; the results confirm that they
improve the network stability with respect to noise. In particular, the
model-based framework represents the most reliable trade-off between visual
precision and robustness.


http://arxiv.org/abs/2305.19020
Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification. (99%)
Qing Wang; Jixun Yao; Ziqian Wang; Pengcheng Guo; Lei Xie
  In this study, we propose a timbre-reserved adversarial attack approach for
speaker identification (SID) to not only exploit the weakness of the SID model
but also preserve the timbre of the target speaker in a black-box attack
setting. Particularly, we generate timbre-reserved fake audio by adding an
adversarial constraint during the training of the voice conversion model. Then,
we leverage a pseudo-Siamese network architecture to learn from the black-box
SID model constraining both intrinsic similarity and structural similarity
simultaneously. The intrinsic similarity loss is to learn an intrinsic
invariance, while the structural similarity loss is to ensure that the
substitute SID model shares a similar decision boundary to the fixed black-box
SID model. The substitute model can be used as a proxy to generate
timbre-reserved fake audio for attacking. Experimental results on the Audio
Deepfake Detection (ADD) challenge dataset indicate that the attack success
rate of our proposed approach yields up to 60.58% and 55.38% in the white-box
and black-box scenarios, respectively, and can deceive both human beings and
machines.


http://arxiv.org/abs/2305.19330
Breeding Machine Translations: Evolutionary approach to survive and thrive in the world of automated evaluation. (64%)
Josef Jon; Ondřej Bojar
  We propose a genetic algorithm (GA) based method for modifying n-best lists
produced by a machine translation (MT) system. Our method offers an innovative
approach to improving MT quality and identifying weaknesses in evaluation
metrics. Using common GA operations (mutation and crossover) on a list of
hypotheses in combination with a fitness function (an arbitrary MT metric), we
obtain novel and diverse outputs with high metric scores. With a combination of
multiple MT metrics as the fitness function, the proposed method leads to an
increase in translation quality as measured by other held-out automatic
metrics. With a single metric (including popular ones such as COMET) as the
fitness function, we find blind spots and flaws in the metric. This allows for
an automated search for adversarial examples in an arbitrary metric, without
prior assumptions on the form of such example. As a demonstration of the
method, we create datasets of adversarial examples and use them to show that
reference-free COMET is substantially less robust than the reference-based
version.


http://arxiv.org/abs/2305.19101
Which Models have Perceptually-Aligned Gradients? An Explanation via Off-Manifold Robustness. (56%)
Suraj Srinivas; Sebastian Bordt; Hima Lakkaraju
  One of the remarkable properties of robust computer vision models is that
their input-gradients are often aligned with human perception, referred to in
the literature as perceptually-aligned gradients (PAGs). Despite only being
trained for classification, PAGs cause robust models to have rudimentary
generative capabilities, including image generation, denoising, and
in-painting. However, the underlying mechanisms behind these phenomena remain
unknown. In this work, we provide a first explanation of PAGs via
\emph{off-manifold robustness}, which states that models must be more robust
off- the data manifold than they are on-manifold. We first demonstrate
theoretically that off-manifold robustness leads input gradients to lie
approximately on the data manifold, explaining their perceptual alignment. We
then show that Bayes optimal models satisfy off-manifold robustness, and
confirm the same empirically for robust models trained via gradient norm
regularization, noise augmentation, and randomized smoothing. Quantifying the
perceptual alignment of model gradients via their similarity with the gradients
of generative models, we show that off-manifold robustness correlates well with
perceptual alignment. Finally, based on the levels of on- and off-manifold
robustness, we identify three different regimes of robustness that affect both
perceptual alignment and model accuracy: weak robustness, bayes-aligned
robustness, and excessive robustness.


http://arxiv.org/abs/2305.19521
Incremental Randomized Smoothing Certification. (33%)
Shubham Ugare; Tarun Suresh; Debangshu Banerjee; Gagandeep Singh; Sasa Misailovic
  Randomized smoothing-based certification is an effective approach for
obtaining robustness certificates of deep neural networks (DNNs) against
adversarial attacks. This method constructs a smoothed DNN model and certifies
its robustness through statistical sampling, but it is computationally
expensive, especially when certifying with a large number of samples.
Furthermore, when the smoothed model is modified (e.g., quantized or pruned),
certification guarantees may not hold for the modified DNN, and recertifying
from scratch can be prohibitively expensive.
  We present the first approach for incremental robustness certification for
randomized smoothing, IRS. We show how to reuse the certification guarantees
for the original smoothed model to certify an approximated model with very few
samples. IRS significantly reduces the computational cost of certifying
modified DNNs while maintaining strong robustness guarantees. We experimentally
demonstrate the effectiveness of our approach, showing up to 3x certification
speedup over the certification that applies randomized smoothing of the
approximate model from scratch.


http://arxiv.org/abs/2305.19083
Defense Against Shortest Path Attacks. (16%)
Benjamin A. Miller; Zohair Shafi; Wheeler Ruml; Yevgeniy Vorobeychik; Tina Eliassi-Rad; Scott Alfeld
  Identifying shortest paths between nodes in a network is an important task in
applications involving routing of resources. Recent work has shown that a
malicious actor can manipulate a graph to make traffic between two nodes of
interest follow their target path. In this paper, we develop a defense against
such attacks by modifying the weights of the graph that users observe. The
defender must balance inhibiting the attacker against any negative effects of
the defense on benign users. Specifically, the defender's goals are: (a) to
recommend the shortest paths possible to users, (b) for the lengths of the
shortest paths in the published graph to be close to those of the same paths in
the true graph, and (c) to minimize the probability of an attack. We formulate
the defense as a Stackelberg game in which the defender is the leader and the
attacker is the follower. In this context, we also consider a zero-sum version
of the game, in which the defender's goal is to minimize cost while achieving
the minimum possible attack probability. We show that this problem is NP-hard
and propose heuristic solutions based on increasing edge weights along target
paths in both the zero-sum and non-zero-sum settings. Relaxing some constraints
of the original problem, we formulate a linear program for local optimization
around a feasible point. We present defense results with both synthetic and
real network datasets and show that these methods often reach the lower bound
of the defender's cost.


http://arxiv.org/abs/2305.18933
A Multilingual Evaluation of NER Robustness to Adversarial Inputs. (15%)
Akshay Srinivasan; Sowmya Vajjala
  Adversarial evaluations of language models typically focus on English alone.
In this paper, we performed a multilingual evaluation of Named Entity
Recognition (NER) in terms of its robustness to small perturbations in the
input. Our results showed the NER models we explored across three languages
(English, German and Hindi) are not very robust to such changes, as indicated
by the fluctuations in the overall F1 score as well as in a more fine-grained
evaluation. With that knowledge, we further explored whether it is possible to
improve the existing NER models using a part of the generated adversarial data
sets as augmented training data to train a new NER model or as fine-tuning data
to adapt an existing NER model. Our results showed that both these approaches
improve performance on the original as well as adversarial test sets. While
there is no significant difference between the two approaches for English,
re-training is significantly better than fine-tuning for German and Hindi.


http://arxiv.org/abs/2305.18779
It begins with a boundary: A geometric view on probabilistically robust learning. (10%)
Leon Bungert; Nicolás García Trillos; Matt Jacobs; Daniel McKenzie; Đorđe Nikolić; Qingsong Wang
  Although deep neural networks have achieved super-human performance on many
classification tasks, they often exhibit a worrying lack of robustness towards
adversarially generated examples. Thus, considerable effort has been invested
into reformulating Empirical Risk Minimization (ERM) into an adversarially
robust framework. Recently, attention has shifted towards approaches which
interpolate between the robustness offered by adversarial training and the
higher clean accuracy and faster training times of ERM. In this paper, we take
a fresh and geometric view on one such method -- Probabilistically Robust
Learning (PRL) (Robey et al., ICML, 2022). We propose a geometric framework for
understanding PRL, which allows us to identify a subtle flaw in its original
formulation and to introduce a family of probabilistic nonlocal perimeter
functionals to address this. We prove existence of solutions using novel
relaxation methods and study properties as well as local limits of the
introduced perimeters.


http://arxiv.org/abs/2305.19218
Adversarial Attacks on Online Learning to Rank with Stochastic Click Models. (2%)
Zichen Wang; Rishab Balasubramanian; Hui Yuan; Chenyu Song; Mengdi Wang; Huazheng Wang
  We propose the first study of adversarial attacks on online learning to rank.
The goal of the adversary is to misguide the online learning to rank algorithm
to place the target item on top of the ranking list linear times to time
horizon $T$ with a sublinear attack cost. We propose generalized list poisoning
attacks that perturb the ranking list presented to the user. This strategy can
efficiently attack any no-regret ranker in general stochastic click models.
Furthermore, we propose a click poisoning-based strategy named attack-then-quit
that can efficiently attack two representative OLTR algorithms for stochastic
click models. We theoretically analyze the success and cost upper bound of the
two proposed methods. Experimental results based on synthetic and real-world
data further validate the effectiveness and cost-efficiency of the proposed
attack strategies.


http://arxiv.org/abs/2305.18840
Learning Perturbations to Explain Time Series Predictions. (1%)
Joseph Enguehard
  Explaining predictions based on multivariate time series data carries the
additional difficulty of handling not only multiple features, but also time
dependencies. It matters not only what happened, but also when, and the same
feature could have a very different impact on a prediction depending on this
time information. Previous work has used perturbation-based saliency methods to
tackle this issue, perturbing an input using a trainable mask to discover which
features at which times are driving the predictions. However these methods
introduce fixed perturbations, inspired from similar methods on static data,
while there seems to be little motivation to do so on temporal data. In this
work, we aim to explain predictions by learning not only masks, but also
associated perturbations. We empirically show that learning these perturbations
significantly improves the quality of these explanations on time series data.


http://arxiv.org/abs/2305.18503
From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework. (99%)
Yangyi Chen; Hongcheng Gao; Ganqu Cui; Lifan Yuan; Dehan Kong; Hanlu Wu; Ning Shi; Bo Yuan; Longtao Huang; Hui Xue; Zhiyuan Liu; Maosong Sun; Heng Ji
  Textual adversarial attacks can discover models' weaknesses by adding
semantic-preserved but misleading perturbations to the inputs. The long-lasting
adversarial attack-and-defense arms race in Natural Language Processing (NLP)
is algorithm-centric, providing valuable techniques for automatic robustness
evaluation. However, the existing practice of robustness evaluation may exhibit
issues of incomprehensive evaluation, impractical evaluation protocol, and
invalid adversarial samples. In this paper, we aim to set up a unified
automatic robustness evaluation framework, shifting towards model-centric
evaluation to further exploit the advantages of adversarial attacks. To address
the above challenges, we first determine robustness evaluation dimensions based
on model capabilities and specify the reasonable algorithm to generate
adversarial samples for each dimension. Then we establish the evaluation
protocol, including evaluation settings and metrics, under realistic demands.
Finally, we use the perturbation degree of adversarial samples to control the
sample validity. We implement a toolkit RobTest that realizes our automatic
robustness evaluation framework. In our experiments, we conduct a robustness
evaluation of RoBERTa models to demonstrate the effectiveness of our evaluation
framework, and further show the rationality of each component in the framework.
The code will be made public at \url{https://github.com/thunlp/RobTest}.


http://arxiv.org/abs/2305.17939
Fourier Analysis on Robustness of Graph Convolutional Neural Networks for Skeleton-based Action Recognition. (92%)
Nariki Tanaka; Hiroshi Kera; Kazuhiko Kawamoto
  Using Fourier analysis, we explore the robustness and vulnerability of graph
convolutional neural networks (GCNs) for skeleton-based action recognition. We
adopt a joint Fourier transform (JFT), a combination of the graph Fourier
transform (GFT) and the discrete Fourier transform (DFT), to examine the
robustness of adversarially-trained GCNs against adversarial attacks and common
corruptions. Experimental results with the NTU RGB+D dataset reveal that
adversarial training does not introduce a robustness trade-off between
adversarial attacks and low-frequency perturbations, which typically occurs
during image classification based on convolutional neural networks. This
finding indicates that adversarial training is a practical approach to
enhancing robustness against adversarial attacks and common corruptions in
skeleton-based action recognition. Furthermore, we find that the Fourier
approach cannot explain vulnerability against skeletal part occlusion
corruption, which highlights its limitations. These findings extend our
understanding of the robustness of GCNs, potentially guiding the development of
more robust learning methods for skeleton-based action recognition.


http://arxiv.org/abs/2305.18585
Exploiting Explainability to Design Adversarial Attacks and Evaluate Attack Resilience in Hate-Speech Detection Models. (92%)
Pranath Reddy Kumbam; Sohaib Uddin Syed; Prashanth Thamminedi; Suhas Harish; Ian Perera; Bonnie J. Dorr
  The advent of social media has given rise to numerous ethical challenges,
with hate speech among the most significant concerns. Researchers are
attempting to tackle this problem by leveraging hate-speech detection and
employing language models to automatically moderate content and promote civil
discourse. Unfortunately, recent studies have revealed that hate-speech
detection systems can be misled by adversarial attacks, raising concerns about
their resilience. While previous research has separately addressed the
robustness of these models under adversarial attacks and their
interpretability, there has been no comprehensive study exploring their
intersection. The novelty of our work lies in combining these two critical
aspects, leveraging interpretability to identify potential vulnerabilities and
enabling the design of targeted adversarial attacks. We present a comprehensive
and comparative analysis of adversarial robustness exhibited by various
hate-speech detection models. Our study evaluates the resilience of these
models against adversarial attacks using explainability techniques. To gain
insights into the models' decision-making processes, we employ the Local
Interpretable Model-agnostic Explanations (LIME) framework. Based on the
explainability results obtained by LIME, we devise and execute targeted attacks
on the text by leveraging the TextAttack tool. Our findings enhance the
understanding of the vulnerabilities and strengths exhibited by
state-of-the-art hate-speech detection models. This work underscores the
importance of incorporating explainability in the development and evaluation of
such models to enhance their resilience against adversarial attacks.
Ultimately, this work paves the way for creating more robust and reliable
hate-speech detection systems, fostering safer online environments and
promoting ethical discourse on social media platforms.


http://arxiv.org/abs/2305.18651
UMD: Unsupervised Model Detection for X2X Backdoor Attacks. (81%)
Zhen Xiang; Zidi Xiong; Bo Li
  Backdoor (Trojan) attack is a common threat to deep neural networks, where
samples from one or more source classes embedded with a backdoor trigger will
be misclassified to adversarial target classes. Existing methods for detecting
whether a classifier is backdoor attacked are mostly designed for attacks with
a single adversarial target (e.g., all-to-one attack). To the best of our
knowledge, without supervision, no existing methods can effectively address the
more general X2X attack with an arbitrary number of source classes, each paired
with an arbitrary target class. In this paper, we propose UMD, the first
Unsupervised Model Detection method that effectively detects X2X backdoor
attacks via a joint inference of the adversarial (source, target) class pairs.
In particular, we first define a novel transferability statistic to measure and
select a subset of putative backdoor class pairs based on a proposed clustering
approach. Then, these selected class pairs are jointly assessed based on an
aggregation of their reverse-engineered trigger size for detection inference,
using a robust and unsupervised anomaly detector we proposed. We conduct
comprehensive evaluations on CIFAR-10, GTSRB, and Imagenette dataset, and show
that our unsupervised UMD outperforms SOTA detectors (even with supervision) by
17%, 4%, and 8%, respectively, in terms of the detection accuracy against
diverse X2X attacks. We also show the strong detection performance of UMD
against several strong adaptive attacks.


http://arxiv.org/abs/2305.18462
Membership Inference Attacks against Language Models via Neighbourhood Comparison. (73%)
Justus Mattern; Fatemehsadat Mireshghallah; Zhijing Jin; Bernhard Schölkopf; Mrinmaya Sachan; Taylor Berg-Kirkpatrick
  Membership Inference attacks (MIAs) aim to predict whether a data sample was
present in the training data of a machine learning model or not, and are widely
used for assessing the privacy risks of language models. Most existing attacks
rely on the observation that models tend to assign higher probabilities to
their training samples than non-training points. However, simple thresholding
of the model score in isolation tends to lead to high false-positive rates as
it does not account for the intrinsic complexity of a sample. Recent work has
demonstrated that reference-based attacks which compare model scores to those
obtained from a reference model trained on similar data can substantially
improve the performance of MIAs. However, in order to train reference models,
attacks of this kind make the strong and arguably unrealistic assumption that
an adversary has access to samples closely resembling the original training
data. Therefore, we investigate their performance in more realistic scenarios
and find that they are highly fragile in relation to the data distribution used
to train reference models. To investigate whether this fragility provides a
layer of safety, we propose and evaluate neighbourhood attacks, which compare
model scores for a given sample to scores of synthetically generated neighbour
texts and therefore eliminate the need for access to the training data
distribution. We show that, in addition to being competitive with
reference-based attacks that have perfect knowledge about the training data
distribution, our attack clearly outperforms existing reference-free attacks as
well as reference-based attacks with imperfect knowledge, which demonstrates
the need for a reevaluation of the threat model of adversarial attacks.


http://arxiv.org/abs/2306.05358
Trustworthy Sensor Fusion against Inaudible Command Attacks in Advanced Driver-Assistance System. (41%)
Jiwei Guan; Lei Pan; Chen Wang; Shui Yu; Longxiang Gao; Xi Zheng
  There are increasing concerns about malicious attacks on autonomous vehicles.
In particular, inaudible voice command attacks pose a significant threat as
voice commands become available in autonomous driving systems. How to
empirically defend against these inaudible attacks remains an open question.
Previous research investigates utilizing deep learning-based multimodal fusion
for defense, without considering the model uncertainty in trustworthiness. As
deep learning has been applied to increasingly sensitive tasks, uncertainty
measurement is crucial in helping improve model robustness, especially in
mission-critical scenarios. In this paper, we propose the Multimodal Fusion
Framework (MFF) as an intelligent security system to defend against inaudible
voice command attacks. MFF fuses heterogeneous audio-vision modalities using
VGG family neural networks and achieves the detection accuracy of 92.25% in the
comparative fusion method empirical study. Additionally, extensive experiments
on audio-vision tasks reveal the model's uncertainty. Using Expected
Calibration Errors, we measure calibration errors and Monte-Carlo Dropout to
estimate the predictive distribution for the proposed models. Our findings show
empirically to train robust multimodal models, improve standard accuracy and
provide a further step toward interpretability. Finally, we discuss the pros
and cons of our approach and its applicability for Advanced Driver Assistance
Systems.


http://arxiv.org/abs/2306.00010
Trainable and Explainable Simplicial Map Neural Networks. (41%)
Eduardo Paluzo-Hidalgo; Miguel A. Gutiérrez-Naranjo; Rocio Gonzalez-Diaz
  Simplicial map neural networks (SMNNs) are topology-based neural networks
with interesting properties such as universal approximation ability and
robustness to adversarial examples under appropriate conditions. However, SMNNs
present some bottlenecks for their possible application in high-dimensional
datasets. First, SMNNs have precomputed fixed weight and no SMNN training
process has been defined so far, so they lack generalization ability. Second,
SMNNs require the construction of a convex polytope surrounding the input
dataset. In this paper, we overcome these issues by proposing an SMNN training
procedure based on a support subset of the given dataset and replacing the
construction of the convex polytope by a method based on projections to a
hypersphere. In addition, the explainability capacity of SMNNs and an effective
implementation are also newly introduced in this paper.


http://arxiv.org/abs/2305.18543
Robust Lipschitz Bandits to Adversarial Corruptions. (11%)
Yue Kang; Cho-Jui Hsieh; Thomas C. M. Lee
  Lipschitz bandit is a variant of stochastic bandits that deals with a
continuous arm set defined on a metric space, where the reward function is
subject to a Lipschitz constraint. In this paper, we introduce a new problem of
Lipschitz bandits in the presence of adversarial corruptions where an adaptive
adversary corrupts the stochastic rewards up to a total budget $C$. The budget
is measured by the sum of corruption levels across the time horizon $T$. We
consider both weak and strong adversaries, where the weak adversary is unaware
of the current action before the attack, while the strong one can observe it.
Our work presents the first line of robust Lipschitz bandit algorithms that can
achieve sub-linear regret under both types of adversary, even when the total
budget of corruption $C$ is unrevealed to the agent. We provide a lower bound
under each type of adversary, and show that our algorithm is optimal under the
strong case. Finally, we conduct experiments to illustrate the effectiveness of
our algorithms against two classic kinds of attacks.


http://arxiv.org/abs/2305.18216
Towards minimizing efforts for Morphing Attacks -- Deep embeddings for morphing pair selection and improved Morphing Attack Detection. (8%)
Roman Kessler; Kiran Raja; Juan Tapia; Christoph Busch
  Face Morphing Attacks pose a threat to the security of identity documents,
especially with respect to a subsequent access control process, because it
enables both individuals involved to exploit the same document. In this study,
face embeddings serve two purposes: pre-selecting images for large-scale
Morphing Attack generation and detecting potential Morphing Attacks. We build
upon previous embedding studies in both use cases using the MagFace model. For
the first objective, we employ an pre-selection algorithm that pairs
individuals based on face embedding similarity. We quantify the attack
potential of differently morphed face images to compare the usability of
pre-selection in automatically generating numerous successful Morphing Attacks.
Regarding the second objective, we compare embeddings from two state-of-the-art
face recognition systems in terms of their ability to detect Morphing Attacks.
Our findings demonstrate that ArcFace and MagFace provide valuable face
embeddings for image pre-selection. Both open-source and COTS face recognition
systems are susceptible to generated attacks, particularly when pre-selection
is based on embeddings rather than random pairing which was only constrained by
soft biometrics. More accurate face recognition systems exhibit greater
vulnerability to attacks, with COTS systems being the most susceptible.
Additionally, MagFace embeddings serve as a robust alternative for detecting
morphed face images compared to the previously used ArcFace embeddings. The
results endorse the advantages of face embeddings in more effective image
pre-selection for face morphing and accurate detection of morphed face images.
This is supported by extensive analysis of various designed attacks. The
MagFace model proves to be a powerful alternative to the commonly used ArcFace
model for both objectives, pre-selection and attack detection.


http://arxiv.org/abs/2305.17688
Amplification trojan network: Attack deep neural networks by amplifying their inherent weakness. (99%)
Zhanhao Hu; Jun Zhu; Bo Zhang; Xiaolin Hu
  Recent works found that deep neural networks (DNNs) can be fooled by
adversarial examples, which are crafted by adding adversarial noise on clean
inputs. The accuracy of DNNs on adversarial examples will decrease as the
magnitude of the adversarial noise increase. In this study, we show that DNNs
can be also fooled when the noise is very small under certain circumstances.
This new type of attack is called Amplification Trojan Attack (ATAttack).
Specifically, we use a trojan network to transform the inputs before sending
them to the target DNN. This trojan network serves as an amplifier to amplify
the inherent weakness of the target DNN. The target DNN, which is infected by
the trojan network, performs normally on clean data while being more vulnerable
to adversarial examples. Since it only transforms the inputs, the trojan
network can hide in DNN-based pipelines, e.g. by infecting the pre-processing
procedure of the inputs before sending them to the DNNs. This new type of
threat should be considered in developing safe DNNs.


http://arxiv.org/abs/2305.17868
NaturalFinger: Generating Natural Fingerprint with Generative Adversarial Networks. (92%)
Kang Yang; Kunhao Lai
  Deep neural network (DNN) models have become a critical asset of the model
owner as training them requires a large amount of resource (i.e. labeled data).
Therefore, many fingerprinting schemes have been proposed to safeguard the
intellectual property (IP) of the model owner against model extraction and
illegal redistribution. However, previous schemes adopt unnatural images as the
fingerprint, such as adversarial examples and noisy images, which can be easily
perceived and rejected by the adversary. In this paper, we propose
NaturalFinger which generates natural fingerprint with generative adversarial
networks (GANs). Besides, our proposed NaturalFinger fingerprints the decision
difference areas rather than the decision boundary, which is more robust. The
application of GAN not only allows us to generate more imperceptible samples,
but also enables us to generate unrestricted samples to explore the decision
boundary.To demonstrate the effectiveness of our fingerprint approach, we
evaluate our approach against four model modification attacks including
adversarial training and two model extraction attacks. Experiments show that
our approach achieves 0.91 ARUC value on the FingerBench dataset (154 models),
exceeding the optimal baseline (MetaV) over 17\%.


http://arxiv.org/abs/2305.18384
Backdoor Attacks Against Incremental Learners: An Empirical Evaluation Study. (41%)
Yiqi Zhong; Xianming Liu; Deming Zhai; Junjun Jiang; Xiangyang Ji
  Large amounts of incremental learning algorithms have been proposed to
alleviate the catastrophic forgetting issue arises while dealing with
sequential data on a time series. However, the adversarial robustness of
incremental learners has not been widely verified, leaving potential security
risks. Specifically, for poisoning-based backdoor attacks, we argue that the
nature of streaming data in IL provides great convenience to the adversary by
creating the possibility of distributed and cross-task attacks -- an adversary
can affect \textbf{any unknown} previous or subsequent task by data poisoning
\textbf{at any time or time series} with extremely small amount of backdoor
samples injected (e.g., $0.1\%$ based on our observations). To attract the
attention of the research community, in this paper, we empirically reveal the
high vulnerability of 11 typical incremental learners against poisoning-based
backdoor attack on 3 learning scenarios, especially the cross-task
generalization effect of backdoor knowledge, while the poison ratios range from
$5\%$ to as low as $0.1\%$. Finally, the defense mechanism based on activation
clustering is found to be effective in detecting our trigger pattern to
mitigate potential security risks.


http://arxiv.org/abs/2305.17826
NOTABLE: Transferable Backdoor Attacks Against Prompt-based NLP Models. (38%)
Kai Mei; Zheng Li; Zhenting Wang; Yang Zhang; Shiqing Ma
  Prompt-based learning is vulnerable to backdoor attacks. Existing backdoor
attacks against prompt-based models consider injecting backdoors into the
entire embedding layers or word embedding vectors. Such attacks can be easily
affected by retraining on downstream tasks and with different prompting
strategies, limiting the transferability of backdoor attacks. In this work, we
propose transferable backdoor attacks against prompt-based models, called
NOTABLE, which is independent of downstream tasks and prompting strategies.
Specifically, NOTABLE injects backdoors into the encoders of PLMs by utilizing
an adaptive verbalizer to bind triggers to specific words (i.e., anchors). It
activates the backdoor by pasting input with triggers to reach
adversary-desired anchors, achieving independence from downstream tasks and
prompting strategies. We conduct experiments on six NLP tasks, three popular
models, and three prompting strategies. Empirical results show that NOTABLE
achieves superior attack performance (i.e., attack success rate over 90% on all
the datasets), and outperforms two state-of-the-art baselines. Evaluations on
three defenses show the robustness of NOTABLE. Our code can be found at
https://github.com/RU-System-Software-and-Security/Notable.


http://arxiv.org/abs/2305.17667
Choose your Data Wisely: A Framework for Semantic Counterfactuals. (13%)
Edmund Dervakos; Konstantinos Thomas; Giorgos Filandrianos; Giorgos Stamou
  Counterfactual explanations have been argued to be one of the most intuitive
forms of explanation. They are typically defined as a minimal set of edits on a
given data sample that, when applied, changes the output of a model on that
sample. However, a minimal set of edits is not always clear and understandable
to an end-user, as it could, for instance, constitute an adversarial example
(which is indistinguishable from the original data sample to an end-user).
Instead, there are recent ideas that the notion of minimality in the context of
counterfactuals should refer to the semantics of the data sample, and not to
the feature space. In this work, we build on these ideas, and propose a
framework that provides counterfactual explanations in terms of knowledge
graphs. We provide an algorithm for computing such explanations (given some
assumptions about the underlying knowledge), and quantitatively evaluate the
framework with a user study.


http://arxiv.org/abs/2305.18377
BadLabel: A Robust Perspective on Evaluating and Enhancing Label-noise Learning. (5%)
Jingfeng Zhang; Bo Song; Haohan Wang; Bo Han; Tongliang Liu; Lei Liu; Masashi Sugiyama
  Label-noise learning (LNL) aims to increase the model's generalization given
training data with noisy labels. To facilitate practical LNL algorithms,
researchers have proposed different label noise types, ranging from
class-conditional to instance-dependent noises. In this paper, we introduce a
novel label noise type called BadLabel, which can significantly degrade the
performance of existing LNL algorithms by a large margin. BadLabel is crafted
based on the label-flipping attack against standard classification, where
specific samples are selected and their labels are flipped to other labels so
that the loss values of clean and noisy labels become indistinguishable. To
address the challenge posed by BadLabel, we further propose a robust LNL method
that perturbs the labels in an adversarial manner at each epoch to make the
loss values of clean and noisy labels again distinguishable. Once we select a
small set of (mostly) clean labeled data, we can apply the techniques of
semi-supervised learning to train the model accurately. Empirically, our
experimental results demonstrate that existing LNL algorithms are vulnerable to
the newly introduced BadLabel noise type, while our proposed robust LNL method
can effectively improve the generalization performance of the model under
various types of label noise. The new dataset of noisy labels and the source
codes of robust LNL algorithms are available at
https://github.com/zjfheart/BadLabels.


http://arxiv.org/abs/2305.18440
Black-Box Anomaly Attribution. (1%)
Tsuyoshi Idé; Naoki Abe
  When the prediction of a black-box machine learning model deviates from the
true observation, what can be said about the reason behind that deviation? This
is a fundamental and ubiquitous question that the end user in a business or
industrial AI application often asks. The deviation may be due to a sub-optimal
black-box model, or it may be simply because the sample in question is an
outlier. In either case, one would ideally wish to obtain some form of
attribution score -- a value indicative of the extent to which an input
variable is responsible for the anomaly.
  In the present paper we address this task of ``anomaly attribution,''
particularly in the setting in which the model is black-box and the training
data are not available. Specifically, we propose a novel likelihood-based
attribution framework we call the ``likelihood compensation (LC),'' in which
the responsibility score is equated with the correction on each input variable
needed to attain the highest possible likelihood. We begin by showing formally
why mainstream model-agnostic explanation methods, such as the local linear
surrogate modeling and Shapley values, are not designed to explain anomalies.
In particular, we show that they are ``deviation-agnostic,'' namely, that their
explanations are blind to the fact that there is a deviation in the model
prediction for the sample of interest. We do this by positioning these existing
methods under the unified umbrella of a function family we call the
``integrated gradient family.'' We validate the effectiveness of the proposed
LC approach using publicly available data sets. We also conduct a case study
with a real-world building energy prediction task and confirm its usefulness in
practice based on expert feedback.


http://arxiv.org/abs/2306.06071
Adversarial Attack On Yolov5 For Traffic And Road Sign Detection. (99%)
Sanyam Jain
  This paper implements and investigates popular adversarial attacks on the
YOLOv5 Object Detection algorithm. The paper explores the vulnerability of the
YOLOv5 to adversarial attacks in the context of traffic and road sign
detection. The paper investigates the impact of different types of attacks,
including the Limited memory Broyden Fletcher Goldfarb Shanno (L-BFGS), the
Fast Gradient Sign Method (FGSM) attack, the Carlini and Wagner (C&W) attack,
the Basic Iterative Method (BIM) attack, the Projected Gradient Descent (PGD)
attack, One Pixel Attack, and the Universal Adversarial Perturbations attack on
the accuracy of YOLOv5 in detecting traffic and road signs. The results show
that YOLOv5 is susceptible to these attacks, with misclassification rates
increasing as the magnitude of the perturbations increases. We also explain the
results using saliency maps. The findings of this paper have important
implications for the safety and reliability of object detection algorithms used
in traffic and transportation systems, highlighting the need for more robust
and secure models to ensure their effectiveness in real-world applications.


http://arxiv.org/abs/2306.01762
Rapid Plug-in Defenders. (99%)
Kai Wu; Yujian Betterest Li; Jian Lou; Xiaoyu Zhang; Handing Wang; Jing Liu
  In the realm of daily services, the deployment of deep neural networks
underscores the paramount importance of their reliability. However, the
vulnerability of these networks to adversarial attacks, primarily
evasion-based, poses a concerning threat to their functionality. Common methods
for enhancing robustness involve heavy adversarial training or leveraging
learned knowledge from clean data, both necessitating substantial computational
resources. This inherent time-intensive nature severely limits the agility of
large foundational models to swiftly counter adversarial perturbations. To
address this challenge, this paper focuses on the Rapid Plug-in Defender
(RaPiD) problem, aiming to rapidly counter adversarial perturbations without
altering the deployed model. Drawing inspiration from the generalization and
the universal computation ability of pre-trained transformer models, we propose
a novel method termed CeTaD (Considering Pre-trained Transformers as Defenders)
for RaPiD, optimized for efficient computation. CeTaD strategically fine-tunes
the normalization layer parameters within the defender using a limited set of
clean and adversarial examples. Our evaluation centers on assessing CeTaD's
effectiveness, transferability, and the impact of different components in
scenarios involving one-shot adversarial examples. The proposed method is
capable of rapidly adapting to various attacks and different application
scenarios without altering the target model and clean training data. We also
explore the influence of varying training data conditions on CeTaD's
performance. Notably, CeTaD exhibits adaptability across differentiable service
models and proves the potential of continuous learning.


http://arxiv.org/abs/2305.17528
Two Heads are Better than One: Towards Better Adversarial Robustness by Combining Transduction and Rejection. (99%)
Nils Palumbo; Yang Guo; Xi Wu; Jiefeng Chen; Yingyu Liang; Somesh Jha
  Both transduction and rejection have emerged as important techniques for
defending against adversarial perturbations. A recent work by Tram\`er showed
that, in the rejection-only case (no transduction), a strong rejection-solution
can be turned into a strong (but computationally inefficient) non-rejection
solution. This detector-to-classifier reduction has been mostly applied to give
evidence that certain claims of strong selective-model solutions are
susceptible, leaving the benefits of rejection unclear. On the other hand, a
recent work by Goldwasser et al. showed that rejection combined with
transduction can give provable guarantees (for certain problems) that cannot be
achieved otherwise. Nevertheless, under recent strong adversarial attacks
(GMSA, which has been shown to be much more effective than AutoAttack against
transduction), Goldwasser et al.'s work was shown to have low performance in a
practical deep-learning setting. In this paper, we take a step towards
realizing the promise of transduction+rejection in more realistic scenarios.
Theoretically, we show that a novel application of Tram\`er's
classifier-to-detector technique in the transductive setting can give
significantly improved sample-complexity for robust generalization. While our
theoretical construction is computationally inefficient, it guides us to
identify an efficient transductive algorithm to learn a selective model.
Extensive experiments using state of the art attacks (AutoAttack, GMSA) show
that our solutions provide significantly better robust accuracy.


http://arxiv.org/abs/2305.17438
On the Importance of Backbone to the Adversarial Robustness of Object Detectors. (93%)
Xiao Li; Hang Chen; Xiaolin Hu
  Object detection is a critical component of various security-sensitive
applications, such as autonomous driving and video surveillance. However,
existing object detectors are vulnerable to adversarial attacks, which poses a
significant challenge to their reliability and security. Through experiments,
first, we found that existing works on improving the adversarial robustness of
object detectors give a false sense of security. Second, we found that
adversarially pre-trained backbone networks were essential for enhancing the
adversarial robustness of object detectors. We then proposed a simple yet
effective recipe for fast adversarial fine-tuning on object detectors with
adversarially pre-trained backbones. Without any modifications to the structure
of object detectors, our recipe achieved significantly better adversarial
robustness than previous works. Finally, we explored the potential of different
modern object detector designs for improving adversarial robustness with our
recipe and demonstrated interesting findings, which inspired us to design
state-of-the-art (SOTA) robust detectors. Our empirical results set a new
milestone for adversarially robust object detection. Code and trained
checkpoints are available at https://github.com/thu-ml/oddefense.


http://arxiv.org/abs/2305.17440
Modeling Adversarial Attack on Pre-trained Language Models as Sequential Decision Making. (92%)
Xuanjie Fang; Sijie Cheng; Yang Liu; Wei Wang
  Pre-trained language models (PLMs) have been widely used to underpin various
downstream tasks. However, the adversarial attack task has found that PLMs are
vulnerable to small perturbations. Mainstream methods adopt a detached
two-stage framework to attack without considering the subsequent influence of
substitution at each step. In this paper, we formally model the adversarial
attack task on PLMs as a sequential decision-making problem, where the whole
attack process is sequential with two decision-making problems, i.e., word
finder and word substitution. Considering the attack process can only receive
the final state without any direct intermediate signals, we propose to use
reinforcement learning to find an appropriate sequential attack path to
generate adversaries, named SDM-Attack. Extensive experimental results show
that SDM-Attack achieves the highest attack success rate with a comparable
modification rate and semantic similarity to attack fine-tuned BERT.
Furthermore, our analyses demonstrate the generalization and transferability of
SDM-Attack. The code is available at https://github.com/fduxuan/SDM-Attack.


http://arxiv.org/abs/2305.17380
No-Regret Online Reinforcement Learning with Adversarial Losses and Transitions. (2%)
Tiancheng Jin; Junyan Liu; Chloé Rouyer; William Chang; Chen-Yu Wei; Haipeng Luo
  Existing online learning algorithms for adversarial Markov Decision Processes
achieve ${O}(\sqrt{T})$ regret after $T$ rounds of interactions even if the
loss functions are chosen arbitrarily by an adversary, with the caveat that the
transition function has to be fixed. This is because it has been shown that
adversarial transition functions make no-regret learning impossible. Despite
such impossibility results, in this work, we develop algorithms that can handle
both adversarial losses and adversarial transitions, with regret increasing
smoothly in the degree of maliciousness of the adversary. More concretely, we
first propose an algorithm that enjoys $\widetilde{{O}}(\sqrt{T} +
C^{\textsf{P}})$ regret where $C^{\textsf{P}}$ measures how adversarial the
transition functions are and can be at most ${O}(T)$. While this algorithm
itself requires knowledge of $C^{\textsf{P}}$, we further develop a black-box
reduction approach that removes this requirement. Moreover, we also show that
further refinements of the algorithm not only maintains the same regret bound,
but also simultaneously adapts to easier environments (where losses are
generated in a certain stochastically constrained manner as in Jin et al.
[2021]) and achieves $\widetilde{{O}}(U + \sqrt{UC^{\textsf{L}}} +
C^{\textsf{P}})$ regret, where $U$ is some standard gap-dependent coefficient
and $C^{\textsf{L}}$ is the amount of corruption on losses.


http://arxiv.org/abs/2305.17421
FoPro-KD: Fourier Prompted Effective Knowledge Distillation for Long-Tailed Medical Image Recognition. (1%)
Marawan Elbatel; Robert Martí; Xiaomeng Li
  Representational transfer from publicly available models is a promising
technique for improving medical image classification, especially in long-tailed
datasets with rare diseases. However, existing methods often overlook the
frequency-dependent behavior of these models, thereby limiting their
effectiveness in transferring representations and generalizations to rare
diseases. In this paper, we propose FoPro-KD, a novel framework that leverages
the power of frequency patterns learned from frozen pre-trained models to
enhance their transferability and compression, presenting a few unique
insights: 1) We demonstrate that leveraging representations from publicly
available pre-trained models can substantially improve performance,
specifically for rare classes, even when utilizing representations from a
smaller pre-trained model. 2) We observe that pre-trained models exhibit
frequency preferences, which we explore using our proposed Fourier Prompt
Generator (FPG), allowing us to manipulate specific frequencies in the input
image, enhancing the discriminative representational transfer. 3) By amplifying
or diminishing these frequencies in the input image, we enable Effective
Knowledge Distillation (EKD). EKD facilitates the transfer of knowledge from
pre-trained models to smaller models. Through extensive experiments in
long-tailed gastrointestinal image recognition and skin lesion classification,
where rare diseases are prevalent, our FoPro-KD framework outperforms existing
methods, enabling more accessible medical models for rare disease
classification. Code is available at https://github.com/xmed-lab/FoPro-KD.


http://arxiv.org/abs/2305.16934
On Evaluating Adversarial Robustness of Large Vision-Language Models. (99%)
Yunqing Zhao; Tianyu Pang; Chao Du; Xiao Yang; Chongxuan Li; Ngai-Man Cheung; Min Lin
  Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented
performance in response generation, especially with visual inputs, enabling
more creative and adaptable interaction than large language models such as
ChatGPT. Nonetheless, multimodal generation exacerbates safety concerns, since
adversaries may successfully evade the entire system by subtly manipulating the
most vulnerable modality (e.g., vision). To this end, we propose evaluating the
robustness of open-source large VLMs in the most realistic and high-risk
setting, where adversaries have only black-box system access and seek to
deceive the model into returning the targeted responses. In particular, we
first craft targeted adversarial examples against pretrained models such as
CLIP and BLIP, and then transfer these adversarial examples to other VLMs such
as MiniGPT-4, LLaVA, UniDiffuser, BLIP-2, and Img2Prompt. In addition, we
observe that black-box queries on these VLMs can further improve the
effectiveness of targeted evasion, resulting in a surprisingly high success
rate for generating targeted responses. Our findings provide a quantitative
understanding regarding the adversarial vulnerability of large VLMs and call
for a more thorough examination of their potential security flaws before
deployment in practice. Code is at https://github.com/yunqing-me/AttackVLM.


http://arxiv.org/abs/2305.17000
DistriBlock: Identifying adversarial audio samples by leveraging characteristics of the output distribution. (98%)
Matías Pizarro; Dorothea Kolossa; Asja Fischer
  Adversarial attacks can mislead automatic speech recognition (ASR) systems
into predicting an arbitrary target text, thus posing a clear security threat.
To prevent such attacks, we propose DistriBlock, an efficient detection
strategy applicable to any ASR system that predicts a probability distribution
over output tokens in each time step. We measure a set of characteristics of
this distribution: the median, maximum, and minimum over the output
probabilities, the entropy of the distribution, as well as the Kullback-Leibler
and the Jensen-Shannon divergence with respect to the distributions of the
subsequent time step. Then, by leveraging the characteristics observed for both
benign and adversarial data, we apply binary classifiers, including simple
threshold-based classification, ensembles of such classifiers, and neural
networks. Through extensive analysis across different state-of-the-art ASR
systems and language data sets, we demonstrate the supreme performance of this
approach, with a mean area under the receiver operating characteristic curve
for distinguishing target adversarial examples against clean and noisy data of
99% and 97%, respectively. To assess the robustness of our method, we show that
adaptive adversarial examples that can circumvent DistriBlock are much noisier,
which makes them easier to detect through filtering and creates another avenue
for preserving the system's robustness.


http://arxiv.org/abs/2305.17342
Rethinking Adversarial Policies: A Generalized Attack Formulation and Provable Defense in Multi-Agent RL. (96%)
Xiangyu Liu; Souradip Chakraborty; Yanchao Sun; Furong Huang
  Most existing works consider direct perturbations of victim's state/action or
the underlying transition dynamics to show vulnerability of reinforcement
learning agents under adversarial attacks. However, such direct manipulation
may not always be feasible in practice. In this paper, we consider another
common and realistic attack setup: in a multi-agent RL setting with
well-trained agents, during deployment time, the victim agent $\nu$ is
exploited by an attacker who controls another agent $\alpha$ to act
adversarially against the victim using an \textit{adversarial policy}. Prior
attack models under such setup do not consider that the attacker can confront
resistance and thus can only take partial control of the agent $\alpha$, as
well as introducing perceivable ``abnormal'' behaviors that are easily
detectable. A provable defense against these adversarial policies is also
lacking. To resolve these issues, we introduce a more general attack
formulation that models to what extent the adversary is able to control the
agent to produce the adversarial policy. Based on such a generalized attack
framework, the attacker can also regulate the state distribution shift caused
by the attack through an attack budget, and thus produce stealthy adversarial
policies that can exploit the victim agent. Furthermore, we provide the first
provably robust defenses with convergence guarantee to the most robust victim
policy via adversarial training with timescale separation, in sharp contrast to
adversarial training in supervised learning which may only provide {\it
empirical} defenses.


http://arxiv.org/abs/2305.16998
A Tale of Two Approximations: Tightening Over-Approximation for DNN Robustness Verification via Under-Approximation. (45%)
Zhiyi Xue; Si Liu; Zhaodi Zhang; Yiting Wu; Min Zhang
  The robustness of deep neural networks (DNNs) is crucial to the hosting
system's reliability and security. Formal verification has been demonstrated to
be effective in providing provable robustness guarantees. To improve its
scalability, over-approximating the non-linear activation functions in DNNs by
linear constraints has been widely adopted, which transforms the verification
problem into an efficiently solvable linear programming problem. Many efforts
have been dedicated to defining the so-called tightest approximations to reduce
overestimation imposed by over-approximation. In this paper, we study existing
approaches and identify a dominant factor in defining tight approximation,
namely the approximation domain of the activation function. We find out that
tight approximations defined on approximation domains may not be as tight as
the ones on their actual domains, yet existing approaches all rely only on
approximation domains. Based on this observation, we propose a novel
dual-approximation approach to tighten over-approximations, leveraging an
activation function's underestimated domain to define tight approximation
bounds. We implement our approach with two complementary algorithms based
respectively on Monte Carlo simulation and gradient descent into a tool called
DualApp. We assess it on a comprehensive benchmark of DNNs with different
architectures. Our experimental results show that DualApp significantly
outperforms the state-of-the-art approaches with 100% - 1000% improvement on
the verified robustness ratio and 10.64% on average (up to 66.53%) on the
certified lower bound.


http://arxiv.org/abs/2305.17071
Adversarial Attacks on Online Learning to Rank with Click Feedback. (38%)
Jinhang Zuo; Zhiyao Zhang; Zhiyong Wang; Shuai Li; Mohammad Hajiesmaili; Adam Wierman
  Online learning to rank (OLTR) is a sequential decision-making problem where
a learning agent selects an ordered list of items and receives feedback through
user clicks. Although potential attacks against OLTR algorithms may cause
serious losses in real-world applications, little is known about adversarial
attacks on OLTR. This paper studies attack strategies against multiple variants
of OLTR. Our first result provides an attack strategy against the UCB algorithm
on classical stochastic bandits with binary feedback, which solves the key
issues caused by bounded and discrete feedback that previous works can not
handle. Building on this result, we design attack algorithms against UCB-based
OLTR algorithms in position-based and cascade models. Finally, we propose a
general attack strategy against any algorithm under the general click model.
Each attack algorithm manipulates the learning agent into choosing the target
attack item $T-o(T)$ times, incurring a cumulative cost of $o(T)$. Experiments
on synthetic and real data further validate the effectiveness of our proposed
attack algorithms.


http://arxiv.org/abs/2306.06075
DeepSeaNet: Improving Underwater Object Detection using EfficientDet. (2%)
Sanyam Jain
  Marine animals and deep underwater objects are difficult to recognize and
monitor for safety of aquatic life. There is an increasing challenge when the
water is saline with granular particles and impurities. In such natural
adversarial environment, traditional approaches like CNN start to fail and are
expensive to compute. This project involves implementing and evaluating various
object detection models, including EfficientDet, YOLOv5, YOLOv8, and
Detectron2, on an existing annotated underwater dataset, called the
Brackish-Dataset. The dataset comprises annotated image sequences of fish,
crabs, starfish, and other aquatic animals captured in Limfjorden water with
limited visibility. The aim of this research project is to study the efficiency
of newer models on the same dataset and contrast them with the previous results
based on accuracy and inference time. Firstly, I compare the results of YOLOv3
(31.10% mean Average Precision (mAP)), YOLOv4 (83.72% mAP), YOLOv5 (97.6%),
YOLOv8 (98.20%), EfficientDet (98.56% mAP) and Detectron2 (95.20% mAP) on the
same dataset. Secondly, I provide a modified BiSkFPN mechanism (BiFPN neck with
skip connections) to perform complex feature fusion in adversarial noise which
makes modified EfficientDet robust to perturbations. Third, analyzed the effect
on accuracy of EfficientDet (98.63% mAP) and YOLOv5 by adversarial learning
(98.04% mAP). Last, I provide class activation map based explanations (CAM) for
the two models to promote Explainability in black box models. Overall, the
results indicate that modified EfficientDet achieved higher accuracy with
five-fold cross validation than the other models with 88.54% IoU of feature
maps.


http://arxiv.org/abs/2305.16818
Trust-Aware Resilient Control and Coordination of Connected and Automated Vehicles. (1%)
H M Sabbir Ahmad; Ehsan Sabouni; Wei Xiao; Christos G. Cassandras; Wenchao Li
  We address the security of a network of Connected and Automated Vehicles
(CAVs) cooperating to navigate through a conflict area. Adversarial attacks
such as Sybil attacks can cause safety violations resulting in collisions and
traffic jams. In addition, uncooperative (but not necessarily adversarial) CAVs
can also induce similar adversarial effects on the traffic network. We propose
a decentralized resilient control and coordination scheme that mitigates the
effects of adversarial attacks and uncooperative CAVs by utilizing a trust
framework. Our trust-aware scheme can guarantee safe collision free
coordination and mitigate traffic jams. Simulation results validate the
theoretical guarantee of our proposed scheme, and demonstrate that it can
effectively mitigate adversarial effects across different traffic scenarios.


http://arxiv.org/abs/2305.16617
Efficient Detection of LLM-generated Texts with a Bayesian Surrogate Model. (1%)
Zhijie Deng; Hongcheng Gao; Yibo Miao; Hao Zhang
  The detection of machine-generated text, especially from large language
models (LLMs), is crucial in preventing serious social problems resulting from
their misuse. Some methods train dedicated detectors on specific datasets but
fall short in generalizing to unseen test data, while other zero-shot ones
often yield suboptimal performance. Although the recent DetectGPT has shown
promising detection performance, it suffers from significant inefficiency
issues, as detecting a single candidate requires scoring hundreds of its
perturbations with the source LLM. This paper aims to bridge this gap.
Technically, we propose to incorporate a Bayesian surrogate model, which allows
us to select typical samples based on Bayesian uncertainty and interpolate
scores from typical samples to other ones, to improve query efficiency. Our
empirical results demonstrate that our method significantly outperforms
existing approaches under a low query budget. Notably, our method achieves
similar performance with up to 2 times fewer queries than DetectGPT and 3.7%
higher AUROC at a query number of 5.


http://arxiv.org/abs/2305.15792
IDEA: Invariant Defense for Graph Adversarial Robustness. (99%)
Shuchang Tao; Qi Cao; Huawei Shen; Yunfan Wu; Bingbing Xu; Xueqi Cheng
  Despite the success of graph neural networks (GNNs), their vulnerability to
adversarial attacks poses tremendous challenges for practical applications.
Existing defense methods suffer from severe performance decline under unseen
attacks, due to either limited observed adversarial examples or pre-defined
heuristics. To address these limitations, we analyze the causalities in graph
adversarial attacks and conclude that causal features are key to achieve graph
adversarial robustness, owing to their determinedness for labels and invariance
across attacks. To learn these causal features, we innovatively propose an
Invariant causal DEfense method against adversarial Attacks (IDEA). We derive
node-based and structure-based invariance objectives from an
information-theoretic perspective. IDEA ensures strong predictability for
labels and invariant predictability across attacks, which is provably a
causally invariant defense across various attacks. Extensive experiments
demonstrate that IDEA attains state-of-the-art defense performance under all
five attacks on all five datasets. The implementation of IDEA is available at
https://anonymous.4open.science/r/IDEA.


http://arxiv.org/abs/2305.16444
Don't Retrain, Just Rewrite: Countering Adversarial Perturbations by Rewriting Text. (98%)
Ashim Gupta; Carter Wood Blum; Temma Choji; Yingjie Fei; Shalin Shah; Alakananda Vempala; Vivek Srikumar
  Can language models transform inputs to protect text classifiers against
adversarial attacks? In this work, we present ATINTER, a model that intercepts
and learns to rewrite adversarial inputs to make them non-adversarial for a
downstream text classifier. Our experiments on four datasets and five attack
mechanisms reveal that ATINTER is effective at providing better adversarial
robustness than existing defense approaches, without compromising task
accuracy. For example, on sentiment classification using the SST-2 dataset, our
method improves the adversarial accuracy over the best existing defense
approach by more than 4% with a smaller decrease in task accuracy (0.5% vs
2.5%). Moreover, we show that ATINTER generalizes across multiple downstream
tasks and classifiers without having to explicitly retrain it for those
settings. Specifically, we find that when ATINTER is trained to remove
adversarial perturbations for the sentiment classification task on the SST-2
dataset, it even transfers to a semantically different task of news
classification (on AGNews) and improves the adversarial robustness by more than
10%.


http://arxiv.org/abs/2305.16494
Diffusion-Based Adversarial Sample Generation for Improved Stealthiness and Controllability. (98%)
Haotian Xue; Alexandre Araujo; Bin Hu; Yongxin Chen
  Neural networks are known to be susceptible to adversarial samples: small
variations of natural examples crafted to deliberately mislead the models.
While they can be easily generated using gradient-based techniques in digital
and physical scenarios, they often differ greatly from the actual data
distribution of natural images, resulting in a trade-off between strength and
stealthiness. In this paper, we propose a novel framework dubbed
Diffusion-Based Projected Gradient Descent (Diff-PGD) for generating realistic
adversarial samples. By exploiting a gradient guided by a diffusion model,
Diff-PGD ensures that adversarial samples remain close to the original data
distribution while maintaining their effectiveness. Moreover, our framework can
be easily customized for specific tasks such as digital attacks, physical-world
attacks, and style-based attacks. Compared with existing methods for generating
natural-style adversarial samples, our framework enables the separation of
optimizing adversarial loss from other surrogate losses (e.g.,
content/smoothness/style loss), making it more stable and controllable.
Finally, we demonstrate that the samples generated using Diff-PGD have better
transferability and anti-purification power than traditional gradient-based
methods. Code will be released in https://github.com/xavihart/Diff-PGD


http://arxiv.org/abs/2305.15709
PEARL: Preprocessing Enhanced Adversarial Robust Learning of Image Deraining for Semantic Segmentation. (96%)
Xianghao Jiao; Yaohua Liu; Jiaxin Gao; Xinyuan Chu; Risheng Liu; Xin Fan
  In light of the significant progress made in the development and application
of semantic segmentation tasks, there has been increasing attention towards
improving the robustness of segmentation models against natural degradation
factors (e.g., rain streaks) or artificially attack factors (e.g., adversarial
attack). Whereas, most existing methods are designed to address a single
degradation factor and are tailored to specific application scenarios. In this
work, we present the first attempt to improve the robustness of semantic
segmentation tasks by simultaneously handling different types of degradation
factors. Specifically, we introduce the Preprocessing Enhanced Adversarial
Robust Learning (PEARL) framework based on the analysis of our proposed Naive
Adversarial Training (NAT) framework. Our approach effectively handles both
rain streaks and adversarial perturbation by transferring the robustness of the
segmentation model to the image derain model. Furthermore, as opposed to the
commonly used Negative Adversarial Attack (NAA), we design the Auxiliary Mirror
Attack (AMA) to introduce positive information prior to the training of the
PEARL framework, which improves defense capability and segmentation
performance. Our extensive experiments and ablation studies based on different
derain methods and segmentation models have demonstrated the significant
performance improvement of PEARL with AMA in defense against various
adversarial attacks and rain streaks while maintaining high generalization
performance across different datasets.


http://arxiv.org/abs/2306.06081
CARSO: Counter-Adversarial Recall of Synthetic Observations. (93%)
Emanuele Ballarin; Alessio Ansuini; Luca Bortolussi
  In this paper, we propose a novel adversarial defence mechanism for image
classification -- CARSO -- inspired by cues from cognitive neuroscience. The
method is synergistically complementary to adversarial training and relies on
knowledge of the internal representation of the attacked classifier. Exploiting
a generative model for adversarial purification, conditioned on such
representation, it samples reconstructions of inputs to be finally classified.
Experimental evaluation by a well-established benchmark of varied, strong
adaptive attacks, across diverse image datasets and classifier architectures,
shows that CARSO is able to defend the classifier significantly better than
state-of-the-art adversarial training alone -- with a tolerable clean accuracy
toll. Furthermore, the defensive architecture succeeds in effectively shielding
itself from unforeseen threats, and end-to-end attacks adapted to fool
stochastic defences. Code and pre-trained models are available at
https://github.com/emaballarin/CARSO .


http://arxiv.org/abs/2306.06107
Adversarial Attacks on Leakage Detectors in Water Distribution Networks. (86%)
Paul Stahlhofen; André Artelt; Luca Hermes; Barbara Hammer
  Many Machine Learning models are vulnerable to adversarial attacks: There
exist methodologies that add a small (imperceptible) perturbation to an input
such that the model comes up with a wrong prediction. Better understanding of
such attacks is crucial in particular for models used in security-critical
domains, such as monitoring of water distribution networks, in order to devise
counter-measures enhancing model robustness and trustworthiness.
  We propose a taxonomy for adversarial attacks against machine learning based
leakage detectors in water distribution networks. Following up on this, we
focus on a particular type of attack: an adversary searching the least
sensitive point, that is, the location in the water network where the largest
possible undetected leak could occur. Based on a mathematical formalization of
the least sensitive point problem, we use three different algorithmic
approaches to find a solution. Results are evaluated on two benchmark water
distribution networks.


http://arxiv.org/abs/2305.16220
On the Robustness of Segment Anything. (73%)
Yihao Huang; Yue Cao; Tianlin Li; Felix Juefei-Xu; Di Lin; Ivor W. Tsang; Yang Liu; Qing Guo
  Segment anything model (SAM) has presented impressive objectness
identification capability with the idea of prompt learning and a new collected
large-scale dataset. Given a prompt (e.g., points, bounding boxes, or masks)
and an input image, SAM is able to generate valid segment masks for all objects
indicated by the prompts, presenting high generalization across diverse
scenarios and being a general method for zero-shot transfer to downstream
vision tasks. Nevertheless, it remains unclear whether SAM may introduce errors
in certain threatening scenarios. Clarifying this is of significant importance
for applications that require robustness, such as autonomous vehicles. In this
paper, we aim to study the testing-time robustness of SAM under adversarial
scenarios and common corruptions. To this end, we first build a testing-time
robustness evaluation benchmark for SAM by integrating existing public
datasets. Second, we extend representative adversarial attacks against SAM and
study the influence of different prompts on robustness. Third, we study the
robustness of SAM under diverse corruption types by evaluating SAM on corrupted
datasets with different prompts. With experiments conducted on SA-1B and KITTI
datasets, we find that SAM exhibits remarkable robustness against various
corruptions, except for blur-related corruption. Furthermore, SAM remains
susceptible to adversarial attacks, particularly when subjected to PGD and BIM
attacks. We think such a comprehensive study could highlight the importance of
the robustness issues of SAM and trigger a series of new tasks for SAM as well
as downstream vision tasks.


http://arxiv.org/abs/2305.16035
Detecting Adversarial Data by Probing Multiple Perturbations Using Expected Perturbation Score. (67%)
Shuhai Zhang; Feng Liu; Jiahao Yang; Yifan Yang; Changsheng Li; Bo Han; Mingkui Tan
  Adversarial detection aims to determine whether a given sample is an
adversarial one based on the discrepancy between natural and adversarial
distributions. Unfortunately, estimating or comparing two data distributions is
extremely difficult, especially in high-dimension spaces. Recently, the
gradient of log probability density (a.k.a., score) w.r.t. the sample is used
as an alternative statistic to compute. However, we find that the score is
sensitive in identifying adversarial samples due to insufficient information
with one sample only. In this paper, we propose a new statistic called expected
perturbation score (EPS), which is essentially the expected score of a sample
after various perturbations. Specifically, to obtain adequate information
regarding one sample, we perturb it by adding various noises to capture its
multi-view observations. We theoretically prove that EPS is a proper statistic
to compute the discrepancy between two samples under mild conditions. In
practice, we can use a pre-trained diffusion model to estimate EPS for each
sample. Last, we propose an EPS-based adversarial detection (EPS-AD) method, in
which we develop EPS-based maximum mean discrepancy (MMD) as a metric to
measure the discrepancy between the test sample and natural samples. We also
prove that the EPS-based MMD between natural and adversarial samples is larger
than that among natural samples. Extensive experiments show the superior
adversarial detection performance of our EPS-AD.


http://arxiv.org/abs/2305.15698
Rethinking Diversity in Deep Neural Network Testing. (50%)
Zi Wang; Jihye Choi; Ke Wang; Somesh Jha
  Motivated by the success of traditional software testing, numerous diversity
measures have been proposed for testing deep neural networks (DNNs). In this
study, we propose a shift in perspective, advocating for the consideration of
DNN testing as directed testing problems rather than diversity-based testing
tasks. We note that the objective of testing DNNs is specific and well-defined:
identifying inputs that lead to misclassifications. Consequently, a more
precise testing approach is to prioritize inputs with a higher potential to
induce misclassifications, as opposed to emphasizing inputs that enhance
"diversity."
  We derive six directed metrics for DNN testing. Furthermore, we conduct a
careful analysis of the appropriate scope for each metric, as applying metrics
beyond their intended scope could significantly diminish their effectiveness.
Our evaluation demonstrates that (1) diversity metrics are particularly weak
indicators for identifying buggy inputs resulting from small input
perturbations, and (2) our directed metrics consistently outperform diversity
metrics in revealing erroneous behaviors of DNNs across all scenarios.


http://arxiv.org/abs/2305.16503
IMBERT: Making BERT Immune to Insertion-based Backdoor Attacks. (13%)
Xuanli He; Jun Wang; Benjamin Rubinstein; Trevor Cohn
  Backdoor attacks are an insidious security threat against machine learning
models. Adversaries can manipulate the predictions of compromised models by
inserting triggers into the training phase. Various backdoor attacks have been
devised which can achieve nearly perfect attack success without affecting model
predictions for clean inputs. Means of mitigating such vulnerabilities are
underdeveloped, especially in natural language processing. To fill this gap, we
introduce IMBERT, which uses either gradients or self-attention scores derived
from victim models to self-defend against backdoor attacks at inference time.
Our empirical studies demonstrate that IMBERT can effectively identify up to
98.5% of inserted triggers. Thus, it significantly reduces the attack success
rate while attaining competitive accuracy on the clean dataset across
widespread insertion-based attacks compared to two baselines. Finally, we show
that our approach is model-agnostic, and can be easily ported to several
pre-trained transformer models.


http://arxiv.org/abs/2305.16310
Securing Deep Generative Models with Universal Adversarial Signature. (2%)
Yu Zeng; Mo Zhou; Yuan Xue; Vishal M. Patel
  Recent advances in deep generative models have led to the development of
methods capable of synthesizing high-quality, realistic images. These models
pose threats to society due to their potential misuse. Prior research attempted
to mitigate these threats by detecting generated images, but the varying traces
left by different generative models make it challenging to create a universal
detector capable of generalizing to new, unseen generative models. In this
paper, we propose to inject a universal adversarial signature into an arbitrary
pre-trained generative model, in order to make its generated contents more
detectable and traceable. First, the imperceptible optimal signature for each
image can be found by a signature injector through adversarial training.
Subsequently, the signature can be incorporated into an arbitrary generator by
fine-tuning it with the images processed by the signature injector. In this
way, the detector corresponding to the signature can be reused for any
fine-tuned generator for tracking the generator identity. The proposed method
is validated on the FFHQ and ImageNet datasets with various state-of-the-art
generative models, consistently showing a promising detection rate. Code will
be made publicly available at \url{https://github.com/zengxianyu/genwm}.


http://arxiv.org/abs/2305.15775
Concept-Centric Transformers: Enhancing Model Interpretability through Object-Centric Concept Learning within a Shared Global Workspace. (1%)
Jinyung Hong; Keun Hee Park; Theodore P. Pavlic
  To explain "black-box" properties of AI models, many approaches, such as post
hoc and intrinsically interpretable models, have been proposed to provide
plausible explanations that identify human-understandable features/concepts
that a trained model uses to make predictions, and attention mechanisms have
been widely used to aid in model interpretability by visualizing that
information. However, the problem of configuring an interpretable model that
effectively communicates and coordinates among computational modules has
received less attention. A recently proposed shared global workspace theory
demonstrated that networks of distributed modules can benefit from sharing
information with a bandwidth-limited working memory because the communication
constraints encourage specialization, compositionality, and synchronization
among the modules. Inspired by this, we consider how such shared working
memories can be realized to build intrinsically interpretable models with
better interpretability and performance. Toward this end, we propose
Concept-Centric Transformers, a simple yet effective configuration of the
shared global workspace for interpretability consisting of: i) an
object-centric-based architecture for extracting semantic concepts from input
features, ii) a cross-attention mechanism between the learned concept and input
embeddings, and iii) standard classification and additional explanation losses
to allow human analysts to directly assess an explanation for the model's
classification reasoning. We test our approach against other existing
concept-based methods on classification tasks for various datasets, including
CIFAR100 (super-classes), CUB-200-2011 (bird species), and ImageNet, and we
show that our model achieves better classification accuracy than all selected
methods across all problems but also generates more consistent concept-based
explanations of classification output.


http://arxiv.org/abs/2305.15587
How do humans perceive adversarial text? A reality check on the validity and naturalness of word-based adversarial attacks. (99%)
Salijona Dyrmishi; Salah Ghamizi; Maxime Cordy
  Natural Language Processing (NLP) models based on Machine Learning (ML) are
susceptible to adversarial attacks -- malicious algorithms that imperceptibly
modify input text to force models into making incorrect predictions. However,
evaluations of these attacks ignore the property of imperceptibility or study
it under limited settings. This entails that adversarial perturbations would
not pass any human quality gate and do not represent real threats to
human-checked NLP systems. To bypass this limitation and enable proper
assessment (and later, improvement) of NLP model robustness, we have surveyed
378 human participants about the perceptibility of text adversarial examples
produced by state-of-the-art methods. Our results underline that existing text
attacks are impractical in real-world scenarios where humans are involved. This
contrasts with previous smaller-scale human studies, which reported overly
optimistic conclusions regarding attack success. Through our work, we hope to
position human perceptibility as a first-class success criterion for text
attacks, and provide guidance for research to build effective attack algorithms
and, in turn, design appropriate defence mechanisms.


http://arxiv.org/abs/2305.14846
Introducing Competition to Boost the Transferability of Targeted Adversarial Examples through Clean Feature Mixup. (99%)
Junyoung Byun; Myung-Joon Kwon; Seungju Cho; Yoonji Kim; Changick Kim
  Deep neural networks are widely known to be susceptible to adversarial
examples, which can cause incorrect predictions through subtle input
modifications. These adversarial examples tend to be transferable between
models, but targeted attacks still have lower attack success rates due to
significant variations in decision boundaries. To enhance the transferability
of targeted adversarial examples, we propose introducing competition into the
optimization process. Our idea is to craft adversarial perturbations in the
presence of two new types of competitor noises: adversarial perturbations
towards different target classes and friendly perturbations towards the correct
class. With these competitors, even if an adversarial example deceives a
network to extract specific features leading to the target class, this
disturbance can be suppressed by other competitors. Therefore, within this
competition, adversarial examples should take different attack strategies by
leveraging more diverse features to overwhelm their interference, leading to
improving their transferability to different models. Considering the
computational complexity, we efficiently simulate various interference from
these two types of competitors in feature space by randomly mixing up stored
clean features in the model inference and named this method Clean Feature Mixup
(CFM). Our extensive experimental results on the ImageNet-Compatible and
CIFAR-10 datasets show that the proposed method outperforms the existing
baselines with a clear margin. Our code is available at
https://github.com/dreamflake/CFM.


http://arxiv.org/abs/2305.15241
Robust Classification via a Single Diffusion Model. (99%)
Huanran Chen; Yinpeng Dong; Zhengyi Wang; Xiao Yang; Chengqi Duan; Hang Su; Jun Zhu
  Diffusion models have been applied to improve adversarial robustness of image
classifiers by purifying the adversarial noises or generating realistic data
for adversarial training. However, diffusion-based purification can be evaded
by stronger adaptive attacks while adversarial training does not perform well
under unseen threats, exhibiting inevitable limitations of these methods. To
better harness the expressive power of diffusion models, this paper proposes
Robust Diffusion Classifier (RDC), a generative classifier that is constructed
from a pre-trained diffusion model to be adversarially robust. RDC first
maximizes the data likelihood of a given input and then predicts the class
probabilities of the optimized input using the conditional likelihood estimated
by the diffusion model through Bayes' theorem. To further reduce the
computational cost, we propose a new diffusion backbone called multi-head
diffusion and develop efficient sampling strategies. As RDC does not require
training on particular adversarial attacks, we demonstrate that it is more
generalizable to defend against multiple unseen threats. In particular, RDC
achieves $75.67\%$ robust accuracy against various $\ell_\infty$ norm-bounded
adaptive attacks with $\epsilon_\infty=8/255$ on CIFAR-10, surpassing the
previous state-of-the-art adversarial training models by $+4.77\%$. The results
highlight the potential of generative classifiers by employing pre-trained
diffusion models for adversarial robustness compared with the commonly studied
discriminative classifiers. Code is available at
\url{https://github.com/huanranchen/DiffusionClassifier}.


http://arxiv.org/abs/2305.15203
Frequency maps reveal the correlation between Adversarial Attacks and Implicit Bias. (99%)
Lorenzo Basile; Nikos Karantzas; Alberto d'Onofrio; Luca Manzoni; Luca Bortolussi; Alex Rodriguez; Fabio Anselmi
  Despite their impressive performance in classification tasks, neural networks
are known to be vulnerable to adversarial attacks, subtle perturbations of the
input data designed to deceive the model. In this work, we investigate the
correlation between these perturbations and the implicit bias of neural
networks trained with gradient-based algorithms. To this end, we analyse a
representation of the network's implicit bias through the lens of the Fourier
transform. Specifically, we identify unique fingerprints of implicit bias and
adversarial attacks by calculating the minimal, essential frequencies needed
for accurate classification of each image, as well as the frequencies that
drive misclassification in its adversarially perturbed counterpart. This
approach enables us to uncover and analyse the correlation between these
essential frequencies, providing a precise map of how the network's biases
align or contrast with the frequency components exploited by adversarial
attacks. To this end, among other methods, we use a newly introduced technique
capable of detecting nonlinear correlations between high-dimensional datasets.
Our results provide empirical evidence that the network bias in Fourier space
and the target frequencies of adversarial attacks are highly correlated and
suggest new potential strategies for adversarial defence.


http://arxiv.org/abs/2305.15563
Fantastic DNN Classifiers and How to Identify them without Data. (91%)
Nathaniel Dean; Dilip Sarkar
  Current algorithms and architecture can create excellent DNN classifier
models from example data. In general, larger training datasets result in better
model estimations, which improve test performance. Existing methods for
predicting generalization performance are based on hold-out test examples. To
the best of our knowledge, at present no method exists that can estimate the
quality of a trained DNN classifier without test data. In this paper, we show
that the quality of a trained DNN classifier can be assessed without any
example data. We consider DNNs to be composed of a feature extractor and a
feature classifier; the feature extractor's output is fed to the classifier.
The proposed method iteratively creates class prototypes in the input space for
each class by minimizing a cross-entropy loss function at the output of the
network. We use these prototypes and their feature relationships to reveal the
quality of the classifier. We have developed two metrics: one using the
features of the prototypes and the other using adversarial examples
corresponding to each prototype. Empirical evaluations show that accuracy
obtained from test examples is directly proportional to quality measures
obtained from the proposed metrics. We report our observations for ResNet18
with Tiny ImageNet, CIFAR100, and CIFAR10 datasets. The proposed metrics can be
used to compare performances of two or more classifiers without test examples.


http://arxiv.org/abs/2305.14950
Adversarial Demonstration Attacks on Large Language Models. (88%)
Jiongxiao Wang; Zichen Liu; Keun Hee Park; Muhao Chen; Chaowei Xiao
  With the emergence of more powerful large language models (LLMs), such as
ChatGPT and GPT-4, in-context learning (ICL) has gained significant prominence
in leveraging these models for specific tasks by utilizing data-label pairs as
precondition prompts. While incorporating demonstrations can greatly enhance
the performance of LLMs across various tasks, it may introduce a new security
concern: attackers can manipulate only the demonstrations without changing the
input to perform an attack. In this paper, we investigate the security concern
of ICL from an adversarial perspective, focusing on the impact of
demonstrations. We propose an ICL attack based on TextAttack, which aims to
only manipulate the demonstration without changing the input to mislead the
models. Our results demonstrate that as the number of demonstrations increases,
the robustness of in-context learning would decreases. Furthermore, we also
observe that adversarially attacked demonstrations exhibit transferability to
diverse input examples. These findings emphasize the critical security risks
associated with ICL and underscore the necessity for extensive research on the
robustness of ICL, particularly given its increasing significance in the
advancement of LLMs.


http://arxiv.org/abs/2305.14700
AdvFunMatch: When Consistent Teaching Meets Adversarial Robustness. (76%)
Ziuhi Wu; Haichang Gao; Bingqian Zhou; Ping Wang
  \emph{Consistent teaching} is an effective paradigm for implementing
knowledge distillation (KD), where both student and teacher models receive
identical inputs, and KD is treated as a function matching task (FunMatch).
However, one limitation of FunMatch is that it does not account for the
transfer of adversarial robustness, a model's resistance to adversarial
attacks. To tackle this problem, we propose a simple but effective strategy
called Adversarial Function Matching (AdvFunMatch), which aims to match
distributions for all data points within the $\ell_p$-norm ball of the training
data, in accordance with consistent teaching. Formulated as a min-max
optimization problem, AdvFunMatch identifies the worst-case instances that
maximizes the KL-divergence between teacher and student model outputs, which we
refer to as "mismatched examples," and then matches the outputs on these
mismatched examples. Our experimental results show that AdvFunMatch effectively
produces student models with both high clean accuracy and robustness.
Furthermore, we reveal that strong data augmentations (\emph{e.g.},
AutoAugment) are beneficial in AdvFunMatch, whereas prior works have found them
less effective in adversarial training. Code is available at
\url{https://gitee.com/zihui998/adv-fun-match}.


http://arxiv.org/abs/2305.14876
Reconstructive Neuron Pruning for Backdoor Defense. (75%)
Yige Li; Xixiang Lyu; Xingjun Ma; Nodens Koren; Lingjuan Lyu; Bo Li; Yu-Gang Jiang
  Deep neural networks (DNNs) have been found to be vulnerable to backdoor
attacks, raising security concerns about their deployment in mission-critical
applications. While existing defense methods have demonstrated promising
results, it is still not clear how to effectively remove backdoor-associated
neurons in backdoored DNNs. In this paper, we propose a novel defense called
\emph{Reconstructive Neuron Pruning} (RNP) to expose and prune backdoor neurons
via an unlearning and then recovering process. Specifically, RNP first unlearns
the neurons by maximizing the model's error on a small subset of clean samples
and then recovers the neurons by minimizing the model's error on the same data.
In RNP, unlearning is operated at the neuron level while recovering is operated
at the filter level, forming an asymmetric reconstructive learning procedure.
We show that such an asymmetric process on only a few clean samples can
effectively expose and prune the backdoor neurons implanted by a wide range of
attacks, achieving a new state-of-the-art defense performance. Moreover, the
unlearned model at the intermediate step of our RNP can be directly used to
improve other backdoor defense tasks including backdoor removal, trigger
recovery, backdoor label detection, and backdoor sample detection. Code is
available at \url{https://github.com/bboylyg/RNP}.


http://arxiv.org/abs/2305.15119
Another Dead End for Morphological Tags? Perturbed Inputs and Parsing. (74%)
Alberto Muñoz-Ortiz; David Vilares
  The usefulness of part-of-speech tags for parsing has been heavily questioned
due to the success of word-contextualized parsers. Yet, most studies are
limited to coarse-grained tags and high quality written content; while we know
little about their influence when it comes to models in production that face
lexical errors. We expand these setups and design an adversarial attack to
verify if the use of morphological information by parsers: (i) contributes to
error propagation or (ii) if on the other hand it can play a role to correct
mistakes that word-only neural parsers make. The results on 14 diverse UD
treebanks show that under such attacks, for transition- and graph-based models
their use contributes to degrade the performance even faster, while for the
(lower-performing) sequence labeling parsers they are helpful. We also show
that if morphological tags were utopically robust against lexical
perturbations, they would be able to correct parsing mistakes.


http://arxiv.org/abs/2305.14710
Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models. (50%)
Jiashu Xu; Mingyu Derek Ma; Fei Wang; Chaowei Xiao; Muhao Chen
  Instruction-tuned models are trained on crowdsourcing datasets with task
instructions to achieve superior performance. However, in this work we raise
security concerns about this training paradigm. Our studies demonstrate that an
attacker can inject backdoors by issuing very few malicious instructions among
thousands of gathered data and control model behavior through data poisoning,
without even the need of modifying data instances or labels themselves. Through
such instruction attacks, the attacker can achieve over 90% attack success rate
across four commonly used NLP datasets, and cause persistent backdoors that are
easily transferred to 15 diverse datasets zero-shot. In this way, the attacker
can directly apply poisoned instructions designed for one dataset on many other
datasets. Moreover, the poisoned model cannot be cured by continual learning.
Lastly, instruction attacks show resistance to existing inference-time defense.
These findings highlight the need for more robust defenses against data
poisoning attacks in instructiontuning models and underscore the importance of
ensuring data quality in instruction crowdsourcing.


http://arxiv.org/abs/2305.14910
From Shortcuts to Triggers: Backdoor Defense with Denoised PoE. (47%)
Qin Liu; Fei Wang; Chaowei Xiao; Muhao Chen
  Language models are often at risk of diverse backdoor attacks, especially
data poisoning. Thus, it is important to investigate defense solutions for
addressing them. Existing backdoor defense methods mainly focus on backdoor
attacks with explicit triggers, leaving a universal defense against various
backdoor attacks with diverse triggers largely unexplored. In this paper, we
propose an end-to-end ensemble-based backdoor defense framework, DPoE (Denoised
Product-of-Experts), which is inspired by the shortcut nature of backdoor
attacks, to defend various backdoor attacks. DPoE consists of two models: a
shallow model that captures the backdoor shortcuts and a main model that is
prevented from learning the backdoor shortcuts. To address the label flip
caused by backdoor attackers, DPoE incorporates a denoising design. Experiments
on SST-2 dataset show that DPoE significantly improves the defense performance
against various types of backdoor triggers including word-level,
sentence-level, and syntactic triggers. Furthermore, DPoE is also effective
under a more challenging but practical setting that mixes multiple types of
trigger.


http://arxiv.org/abs/2305.14763
Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models. (22%)
Natalie Shapira; Mosh Levy; Seyed Hossein Alavi; Xuhui Zhou; Yejin Choi; Yoav Goldberg; Maarten Sap; Vered Shwartz
  The escalating debate on AI's capabilities warrants developing reliable
metrics to assess machine "intelligence". Recently, many anecdotal examples
were used to suggest that newer large language models (LLMs) like ChatGPT and
GPT-4 exhibit Neural Theory-of-Mind (N-ToM); however, prior work reached
conflicting conclusions regarding those abilities. We investigate the extent of
LLMs' N-ToM through an extensive evaluation on 6 tasks and find that while LLMs
exhibit certain N-ToM abilities, this behavior is far from being robust. We
further examine the factors impacting performance on N-ToM tasks and discover
that LLMs struggle with adversarial examples, indicating reliance on shallow
heuristics rather than robust ToM abilities. We caution against drawing
conclusions from anecdotal examples, limited benchmark testing, and using
human-designed psychological tests to evaluate models.


http://arxiv.org/abs/2305.14984
Adversarial robustness of amortized Bayesian inference. (11%)
Manuel Glöckler; Michael Deistler; Jakob H. Macke
  Bayesian inference usually requires running potentially costly inference
procedures separately for every new observation. In contrast, the idea of
amortized Bayesian inference is to initially invest computational cost in
training an inference network on simulated data, which can subsequently be used
to rapidly perform inference (i.e., to return estimates of posterior
distributions) for new observations. This approach has been applied to many
real-world models in the sciences and engineering, but it is unclear how robust
the approach is to adversarial perturbations in the observed data. Here, we
study the adversarial robustness of amortized Bayesian inference, focusing on
simulation-based estimation of multi-dimensional posterior distributions. We
show that almost unrecognizable, targeted perturbations of the observations can
lead to drastic changes in the predicted posterior and highly unrealistic
posterior predictive samples, across several benchmark tasks and a real-world
example from neuroscience. We propose a computationally efficient
regularization scheme based on penalizing the Fisher information of the
conditional density estimator, and show how it improves the adversarial
robustness of amortized Bayesian inference.


http://arxiv.org/abs/2305.14851
Sharpness-Aware Data Poisoning Attack. (10%)
Pengfei He; Han Xu; Jie Ren; Yingqian Cui; Hui Liu; Charu C. Aggarwal; Jiliang Tang
  Recent research has highlighted the vulnerability of Deep Neural Networks
(DNNs) against data poisoning attacks. These attacks aim to inject poisoning
samples into the models' training dataset such that the trained models have
inference failures. While previous studies have executed different types of
attacks, one major challenge that greatly limits their effectiveness is the
uncertainty of the re-training process after the injection of poisoning
samples, including the re-training initialization or algorithms. To address
this challenge, we propose a novel attack method called ''Sharpness-Aware Data
Poisoning Attack (SAPA)''. In particular, it leverages the concept of DNNs'
loss landscape sharpness to optimize the poisoning effect on the worst
re-trained model. It helps enhance the preservation of the poisoning effect,
regardless of the specific retraining procedure employed. Extensive experiments
demonstrate that SAPA offers a general and principled strategy that
significantly enhances various types of poisoning attacks.


http://arxiv.org/abs/2305.15508
How to fix a broken confidence estimator: Evaluating post-hoc methods for selective classification with deep neural networks. (3%)
Luís Felipe P. Cattelan; Danilo Silva
  This paper addresses the problem of selective classification for deep neural
networks, where a model is allowed to abstain from low-confidence predictions
to avoid potential errors. We focus on so-called post-hoc methods, which
replace the confidence estimator of a given classifier without retraining or
modifying it, thus being practically appealing. Considering neural networks
with softmax outputs, our goal is to identify the best confidence estimator
that can be computed directly from the unnormalized logits. This problem is
motivated by the intriguing observation in recent work that many classifiers
appear to have a "broken" confidence estimator, in the sense that their
selective classification performance is much worse than what could be expected
by their corresponding accuracies. We perform an extensive experimental study
of many existing and proposed confidence estimators applied to 84 pretrained
ImageNet classifiers available from popular repositories. Our results show that
a simple $p$-norm normalization of the logits, followed by taking the maximum
logit as the confidence estimator, can lead to considerable gains in selective
classification performance, completely fixing the pathological behavior
observed in many classifiers. As a consequence, the selective classification
performance of any classifier becomes almost entirely determined by its
corresponding accuracy. Moreover, these results are shown to be consistent
under distribution shift. We also investigate why certain classifiers innately
have a good confidence estimator that apparently cannot be improved by post-hoc
methods.


http://arxiv.org/abs/2305.14902
M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection. (1%)
Yuxia Wang; Jonibek Mansurov; Petar Ivanov; Jinyan Su; Artem Shelmanov; Akim Tsvigun; Chenxi Whitehouse; Osama Mohammed Afzal; Tarek Mahmoud; Toru Sasaki; Thomas Arnold; Alham Fikri Aji; Nizar Habash; Iryna Gurevych; Preslav Nakov
  Large language models (LLMs) have demonstrated remarkable capability to
generate fluent responses to a wide variety of user queries. However, this has
also raised concerns about the potential misuse of such texts in journalism,
education, and academia. In this study, we strive to create automated systems
that can detect machine-generated texts and pinpoint potential misuse. We first
introduce a large-scale benchmark \textbf{M4}, which is a multi-generator,
multi-domain, and multi-lingual corpus for machine-generated text detection.
Through an extensive empirical study of this dataset, we show that it is
challenging for detectors to generalize well on instances from unseen domains
or LLMs. In such cases, detectors tend to misclassify machine-generated text as
human-written. These results show that the problem is far from solved and that
there is a lot of room for improvement. We believe that our dataset will enable
future research towards more robust approaches to this pressing societal
problem. The dataset is available at https://github.com/mbzuai-nlp/M4.


http://arxiv.org/abs/2305.15047
Ghostbuster: Detecting Text Ghostwritten by Large Language Models. (1%)
Vivek Verma; Eve Fleisig; Nicholas Tomlin; Dan Klein
  We introduce Ghostbuster, a state-of-the-art system for detecting
AI-generated text. Our method works by passing documents through a series of
weaker language models, running a structured search over possible combinations
of their features, and then training a classifier on the selected features to
predict whether documents are AI-generated. Crucially, Ghostbuster does not
require access to token probabilities from the target model, making it useful
for detecting text generated by black-box models or unknown model versions. In
conjunction with our model, we release three new datasets of human- and
AI-generated text as detection benchmarks in the domains of student essays,
creative writing, and news articles. We compare Ghostbuster to a variety of
existing detectors, including DetectGPT and GPTZero, as well as a new RoBERTa
baseline. Ghostbuster achieves 99.0 F1 when evaluated across domains, which is
5.9 F1 higher than the best preexisting model. It also outperforms all previous
approaches in generalization across writing domains (+7.5 F1), prompting
strategies (+2.1 F1), and language models (+4.4 F1). We also analyze the
robustness of our system to a variety of perturbations and paraphrasing attacks
and evaluate its performance on documents written by non-native English
speakers.


http://arxiv.org/abs/2305.14188
The Best Defense is a Good Offense: Adversarial Augmentation against Adversarial Attacks. (99%)
Iuri Frosio; Jan Kautz
  Many defenses against adversarial attacks (\eg robust classifiers,
randomization, or image purification) use countermeasures put to work only
after the attack has been crafted. We adopt a different perspective to
introduce $A^5$ (Adversarial Augmentation Against Adversarial Attacks), a novel
framework including the first certified preemptive defense against adversarial
attacks. The main idea is to craft a defensive perturbation to guarantee that
any attack (up to a given magnitude) towards the input in hand will fail. To
this aim, we leverage existing automatic perturbation analysis tools for neural
networks. We study the conditions to apply $A^5$ effectively, analyze the
importance of the robustness of the to-be-defended classifier, and inspect the
appearance of the robustified images. We show effective on-the-fly defensive
augmentation with a robustifier network that ignores the ground truth label,
and demonstrate the benefits of robustifier and classifier co-training. In our
tests, $A^5$ consistently beats state of the art certified defenses on MNIST,
CIFAR10, FashionMNIST and Tinyimagenet. We also show how to apply $A^5$ to
create certifiably robust physical objects. Our code at
https://github.com/NVlabs/A5 allows experimenting on a wide range of scenarios
beyond the man-in-the-middle attack tested here, including the case of physical
attacks.


http://arxiv.org/abs/2305.13678
Enhancing Accuracy and Robustness through Adversarial Training in Class Incremental Continual Learning. (99%)
Minchan Kwon; Kangil Kim
  In real life, adversarial attack to deep learning models is a fatal security
issue. However, the issue has been rarely discussed in a widely used
class-incremental continual learning (CICL). In this paper, we address problems
of applying adversarial training to CICL, which is well-known defense method
against adversarial attack. A well-known problem of CICL is class-imbalance
that biases a model to the current task by a few samples of previous tasks.
Meeting with the adversarial training, the imbalance causes another imbalance
of attack trials over tasks. Lacking clean data of a minority class by the
class-imbalance and increasing of attack trials from a majority class by the
secondary imbalance, adversarial training distorts optimal decision boundaries.
The distortion eventually decreases both accuracy and robustness than
adversarial training. To exclude the effects, we propose a straightforward but
significantly effective method, External Adversarial Training (EAT) which can
be applied to methods using experience replay. This method conduct adversarial
training to an auxiliary external model for the current task data at each time
step, and applies generated adversarial examples to train the target model. We
verify the effects on a toy problem and show significance on CICL benchmarks of
image classification. We expect that the results will be used as the first
baseline for robustness research of CICL.


http://arxiv.org/abs/2305.14097
QFA2SR: Query-Free Adversarial Transfer Attacks to Speaker Recognition Systems. (98%)
Guangke Chen; Yedi Zhang; Zhe Zhao; Fu Song
  Current adversarial attacks against speaker recognition systems (SRSs)
require either white-box access or heavy black-box queries to the target SRS,
thus still falling behind practical attacks against proprietary commercial APIs
and voice-controlled devices. To fill this gap, we propose QFA2SR, an effective
and imperceptible query-free black-box attack, by leveraging the
transferability of adversarial voices. To improve transferability, we present
three novel methods, tailored loss functions, SRS ensemble, and time-freq
corrosion. The first one tailors loss functions to different attack scenarios.
The latter two augment surrogate SRSs in two different ways. SRS ensemble
combines diverse surrogate SRSs with new strategies, amenable to the unique
scoring characteristics of SRSs. Time-freq corrosion augments surrogate SRSs by
incorporating well-designed time-/frequency-domain modification functions,
which simulate and approximate the decision boundary of the target SRS and
distortions introduced during over-the-air attacks. QFA2SR boosts the targeted
transferability by 20.9%-70.7% on four popular commercial APIs (Microsoft
Azure, iFlytek, Jingdong, and TalentedSoft), significantly outperforming
existing attacks in query-free setting, with negligible effect on the
imperceptibility. QFA2SR is also highly effective when launched over the air
against three wide-spread voice assistants (Google Assistant, Apple Siri, and
TMall Genie) with 60%, 46%, and 70% targeted transferability, respectively.


http://arxiv.org/abs/2305.13991
Expressive Losses for Verified Robustness via Convex Combinations. (95%)
Palma Alessandro De; Rudy Bunel; Krishnamurthy Dvijotham; M. Pawan Kumar; Robert Stanforth; Alessio Lomuscio
  In order to train networks for verified adversarial robustness, it is common
to over-approximate the worst-case loss over perturbation regions, resulting in
networks that attain verifiability at the expense of standard performance. As
shown in recent work, better trade-offs between accuracy and robustness can be
obtained by carefully coupling adversarial training with over-approximations.
We hypothesize that the expressivity of a loss function, which we formalize as
the ability to span a range of trade-offs between lower and upper bounds to the
worst-case loss through a single parameter (the over-approximation
coefficient), is key to attaining state-of-the-art performance. To support our
hypothesis, we show that trivial expressive losses, obtained via convex
combinations between adversarial attacks and IBP bounds, yield state-of-the-art
results across a variety of settings in spite of their conceptual simplicity.
We provide a detailed analysis of the relationship between the
over-approximation coefficient and performance profiles across different
expressive losses, showing that, while expressivity is essential, better
approximations of the worst-case loss are not necessarily linked to superior
robustness-accuracy trade-offs.


http://arxiv.org/abs/2305.14165
Impact of Light and Shadow on Robustness of Deep Neural Networks. (87%)
Chengyin Hu; Weiwen Shi; Chao Li; Jialiang Sun; Donghua Wang; Junqi Wu; Guijian Tang
  Deep neural networks (DNNs) have made remarkable strides in various computer
vision tasks, including image classification, segmentation, and object
detection. However, recent research has revealed a vulnerability in advanced
DNNs when faced with deliberate manipulations of input data, known as
adversarial attacks. Moreover, the accuracy of DNNs is heavily influenced by
the distribution of the training dataset. Distortions or perturbations in the
color space of input images can introduce out-of-distribution data, resulting
in misclassification. In this work, we propose a brightness-variation dataset,
which incorporates 24 distinct brightness levels for each image within a subset
of ImageNet. This dataset enables us to simulate the effects of light and
shadow on the images, so as is to investigate the impact of light and shadow on
the performance of DNNs. In our study, we conduct experiments using several
state-of-the-art DNN architectures on the aforementioned dataset. Through our
analysis, we discover a noteworthy positive correlation between the brightness
levels and the loss of accuracy in DNNs. Furthermore, we assess the
effectiveness of recently proposed robust training techniques and strategies,
including AugMix, Revisit, and Free Normalizer, using the ResNet50 architecture
on our brightness-variation dataset. Our experimental results demonstrate that
these techniques can enhance the robustness of DNNs against brightness
variation, leading to improved performance when dealing with images exhibiting
varying brightness levels.


http://arxiv.org/abs/2305.14695
A Causal View of Entity Bias in (Large) Language Models. (10%)
Fei Wang; Wenjie Mo; Yiwei Wang; Wenxuan Zhou; Muhao Chen
  Entity bias widely affects pretrained (large) language models, causing them
to rely on (biased) parametric knowledge to make unfaithful predictions.
Although causality-inspired methods have shown great potential to mitigate
entity bias, it is hard to precisely estimate the parameters of underlying
causal models in practice. The rise of black-box LLMs also makes the situation
even worse, because of their inaccessible parameters and uncalibrated logits.
To address these problems, we propose a specific structured causal model (SCM)
whose parameters are comparatively easier to estimate. Building upon this SCM,
we propose causal intervention techniques to mitigate entity bias for both
white-box and black-box settings. The proposed causal intervention perturbs the
original entity with neighboring entities. This intervention reduces specific
biasing information pertaining to the original entity while still preserving
sufficient semantic information from similar entities. Under the white-box
setting, our training-time intervention improves OOD performance of PLMs on
relation extraction (RE) and machine reading comprehension (MRC) by 5.7 points
and by 9.1 points, respectively. Under the black-box setting, our in-context
intervention effectively reduces the entity-based knowledge conflicts of
GPT-3.5, achieving up to 20.5 points of improvement of exact match accuracy on
MRC and up to 17.6 points of reduction in memorization ratio on RE. Our code is
available at https://github.com/luka-group/Causal-View-of-Entity-Bias.


http://arxiv.org/abs/2305.13948
Decoupled Kullback-Leibler Divergence Loss. (1%)
Jiequan Cui; Zhuotao Tian; Zhisheng Zhong; Xiaojuan Qi; Bei Yu; Hanwang Zhang
  In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss
and mathematically prove that it is equivalent to the Decoupled
Kullback-Leibler (DKL) Divergence loss that consists of 1) a weighted Mean
Square Error (wMSE) loss and 2) a Cross-Entropy loss incorporating soft labels.
Thanks to the decomposed formulation of DKL loss, we have identified two areas
for improvement. Firstly, we address the limitation of KL/DKL in scenarios like
knowledge distillation by breaking its asymmetric optimization property. This
modification ensures that the $\mathbf{w}$MSE component is always effective
during training, providing extra constructive cues. Secondly, we introduce
class-wise global information into KL/DKL to mitigate bias from individual
samples. With these two enhancements, we derive the Improved Kullback-Leibler
(IKL) Divergence loss and evaluate its effectiveness by conducting experiments
on CIFAR-10/100 and ImageNet datasets, focusing on adversarial training, and
knowledge distillation tasks. The proposed approach achieves new
state-of-the-art adversarial robustness on the public leaderboard --
RobustBench and competitive performance on knowledge distillation,
demonstrating the substantial practical merits. Our code is available at
https://github.com/jiequancui/DKL.


http://arxiv.org/abs/2305.12906
Latent Magic: An Investigation into Adversarial Examples Crafted in the Semantic Latent Space. (99%)
BoYang Zheng
  Adversarial attacks against Deep Neural Networks(DNN) have been a crutial
topic ever since \cite{goodfellow} purposed the vulnerability of DNNs. However,
most prior works craft adversarial examples in the pixel space, following the
$l_p$ norm constraint. In this paper, we give intuitional explain about why
crafting adversarial examples in the latent space is equally efficient and
important. We purpose a framework for crafting adversarial examples in semantic
latent space based on an pre-trained Variational Auto Encoder from state-of-art
Stable Diffusion Model\cite{SDM}. We also show that adversarial examples
crafted in the latent space can also achieve a high level of fool rate.
However, examples crafted from latent space are often hard to evaluated, as
they doesn't follow a certain $l_p$ norm constraint, which is a big challenge
for existing researches. To efficiently and accurately evaluate the adversarial
examples crafted in the latent space, we purpose \textbf{a novel evaluation
matric} based on SSIM\cite{SSIM} loss and fool rate.Additionally, we explain
why FID\cite{FID} is not suitable for measuring such adversarial examples. To
the best of our knowledge, it's the first evaluation metrics that is
specifically designed to evaluate the quality of a adversarial attack. We also
investigate the transferability of adversarial examples crafted in the latent
space and show that they have superiority over adversarial examples crafted in
the pixel space.


http://arxiv.org/abs/2305.12825
Uncertainty-based Detection of Adversarial Attacks in Semantic Segmentation. (99%)
Kira Maag; Asja Fischer
  State-of-the-art deep neural networks have proven to be highly powerful in a
broad range of tasks, including semantic image segmentation. However, these
networks are vulnerable against adversarial attacks, i.e., non-perceptible
perturbations added to the input image causing incorrect predictions, which is
hazardous in safety-critical applications like automated driving. Adversarial
examples and defense strategies are well studied for the image classification
task, while there has been limited research in the context of semantic
segmentation. First works however show that the segmentation outcome can be
severely distorted by adversarial attacks. In this work, we introduce an
uncertainty-based approach for the detection of adversarial attacks in semantic
segmentation. We observe that uncertainty as for example captured by the
entropy of the output distribution behaves differently on clean and perturbed
images and leverage this property to distinguish between the two cases. Our
method works in a light-weight and post-processing manner, i.e., we do not
modify the model or need knowledge of the process used for generating
adversarial examples. In a thorough empirical analysis, we demonstrate the
ability of our approach to detect perturbed images across multiple types of
adversarial attacks.


http://arxiv.org/abs/2305.12770
FGAM:Fast Adversarial Malware Generation Method Based on Gradient Sign. (98%)
Kun Li; Fan Zhang; Wei Guo
  Malware detection models based on deep learning have been widely used, but
recent research shows that deep learning models are vulnerable to adversarial
attacks. Adversarial attacks are to deceive the deep learning model by
generating adversarial samples. When adversarial attacks are performed on the
malware detection model, the attacker will generate adversarial malware with
the same malicious functions as the malware, and make the detection model
classify it as benign software. Studying adversarial malware generation can
help model designers improve the robustness of malware detection models. At
present, in the work on adversarial malware generation for byte-to-image
malware detection models, there are mainly problems such as large amount of
injection perturbation and low generation efficiency. Therefore, this paper
proposes FGAM (Fast Generate Adversarial Malware), a method for fast generating
adversarial malware, which iterates perturbed bytes according to the gradient
sign to enhance adversarial capability of the perturbed bytes until the
adversarial malware is successfully generated. It is experimentally verified
that the success rate of the adversarial malware deception model generated by
FGAM is increased by about 84\% compared with existing methods.


http://arxiv.org/abs/2305.13548
Attribute-Guided Encryption with Facial Texture Masking. (98%)
Chun Pong Lau; Jiang Liu; Rama Chellappa
  The increasingly pervasive facial recognition (FR) systems raise serious
concerns about personal privacy, especially for billions of users who have
publicly shared their photos on social media. Several attempts have been made
to protect individuals from unauthorized FR systems utilizing adversarial
attacks to generate encrypted face images to protect users from being
identified by FR systems. However, existing methods suffer from poor visual
quality or low attack success rates, which limit their usability in practice.
In this paper, we propose Attribute Guided Encryption with Facial Texture
Masking (AGE-FTM) that performs a dual manifold adversarial attack on FR
systems to achieve both good visual quality and high black box attack success
rates. In particular, AGE-FTM utilizes a high fidelity generative adversarial
network (GAN) to generate natural on-manifold adversarial samples by modifying
facial attributes, and performs the facial texture masking attack to generate
imperceptible off-manifold adversarial samples. Extensive experiments on the
CelebA-HQ dataset demonstrate that our proposed method produces more
natural-looking encrypted images than state-of-the-art methods while achieving
competitive attack performance. We further evaluate the effectiveness of
AGE-FTM in the real world using a commercial FR API and validate its usefulness
in practice through an user study.


http://arxiv.org/abs/2305.13625
DiffProtect: Generate Adversarial Examples with Diffusion Models for Facial Privacy Protection. (98%)
Jiang Liu; Chun Pong Lau; Rama Chellappa
  The increasingly pervasive facial recognition (FR) systems raise serious
concerns about personal privacy, especially for billions of users who have
publicly shared their photos on social media. Several attempts have been made
to protect individuals from being identified by unauthorized FR systems
utilizing adversarial attacks to generate encrypted face images. However,
existing methods suffer from poor visual quality or low attack success rates,
which limit their utility. Recently, diffusion models have achieved tremendous
success in image generation. In this work, we ask: can diffusion models be used
to generate adversarial examples to improve both visual quality and attack
performance? We propose DiffProtect, which utilizes a diffusion autoencoder to
generate semantically meaningful perturbations on FR systems. Extensive
experiments demonstrate that DiffProtect produces more natural-looking
encrypted images than state-of-the-art methods while achieving significantly
higher attack success rates, e.g., 24.5% and 25.1% absolute improvements on the
CelebA-HQ and FFHQ datasets.


http://arxiv.org/abs/2305.12872
Byzantine Robust Cooperative Multi-Agent Reinforcement Learning as a Bayesian Game. (93%)
Simin Li; Jun Guo; Jingqiao Xiu; Ruixiao Xu; Xin Yu; Jiakai Wang; Aishan Liu; Yaodong Yang; Xianglong Liu
  In this study, we explore the robustness of cooperative multi-agent
reinforcement learning (c-MARL) against Byzantine failures, where any agent can
enact arbitrary, worst-case actions due to malfunction or adversarial attack.
To address the uncertainty that any agent can be adversarial, we propose a
Bayesian Adversarial Robust Dec-POMDP (BARDec-POMDP) framework, which views
Byzantine adversaries as nature-dictated types, represented by a separate
transition. This allows agents to learn policies grounded on their posterior
beliefs about the type of other agents, fostering collaboration with identified
allies and minimizing vulnerability to adversarial manipulation. We define the
optimal solution to the BARDec-POMDP as an ex post robust Bayesian Markov
perfect equilibrium, which we proof to exist and weakly dominates the
equilibrium of previous robust MARL approaches. To realize this equilibrium, we
put forward a two-timescale actor-critic algorithm with almost sure convergence
under specific conditions. Experimentation on matrix games, level-based
foraging and StarCraft II indicate that, even under worst-case perturbations,
our method successfully acquires intricate micromanagement skills and
adaptively aligns with allies, demonstrating resilience against non-oblivious
adversaries, random allies, observation-based attacks, and transfer-based
attacks.


http://arxiv.org/abs/2305.12863
Towards Benchmarking and Assessing Visual Naturalness of Physical World Adversarial Attacks. (88%)
Simin Li; Shuing Zhang; Gujun Chen; Dong Wang; Pu Feng; Jiakai Wang; Aishan Liu; Xin Yi; Xianglong Liu
  Physical world adversarial attack is a highly practical and threatening
attack, which fools real world deep learning systems by generating conspicuous
and maliciously crafted real world artifacts. In physical world attacks,
evaluating naturalness is highly emphasized since human can easily detect and
remove unnatural attacks. However, current studies evaluate naturalness in a
case-by-case fashion, which suffers from errors, bias and inconsistencies. In
this paper, we take the first step to benchmark and assess visual naturalness
of physical world attacks, taking autonomous driving scenario as the first
attempt. First, to benchmark attack naturalness, we contribute the first
Physical Attack Naturalness (PAN) dataset with human rating and gaze. PAN
verifies several insights for the first time: naturalness is (disparately)
affected by contextual features (i.e., environmental and semantic variations)
and correlates with behavioral feature (i.e., gaze signal). Second, to
automatically assess attack naturalness that aligns with human ratings, we
further introduce Dual Prior Alignment (DPA) network, which aims to embed human
knowledge into model reasoning process. Specifically, DPA imitates human
reasoning in naturalness assessment by rating prior alignment and mimics human
gaze behavior by attentive prior alignment. We hope our work fosters researches
to improve and automatically assess naturalness of physical world attacks. Our
code and dataset can be found at https://github.com/zhangsn-19/PAN.


http://arxiv.org/abs/2305.12859
Flying Adversarial Patches: Manipulating the Behavior of Deep Learning-based Autonomous Multirotors. (54%)
Pia Hanfeld; Marina M. -C. Höhne; Michael Bussmann; Wolfgang Hönig
  Autonomous flying robots, e.g. multirotors, often rely on a neural network
that makes predictions based on a camera image. These deep learning (DL) models
can compute surprising results if applied to input images outside the training
domain. Adversarial attacks exploit this fault, for example, by computing small
images, so-called adversarial patches, that can be placed in the environment to
manipulate the neural network's prediction. We introduce flying adversarial
patches, where an image is mounted on another flying robot and therefore can be
placed anywhere in the field of view of a victim multirotor. For an effective
attack, we compare three methods that simultaneously optimize the adversarial
patch and its position in the input image. We perform an empirical validation
on a publicly available DL model and dataset for autonomous multirotors.
Ultimately, our attacking multirotor would be able to gain full control over
the motions of the victim multirotor.


http://arxiv.org/abs/2305.13508
DeepBern-Nets: Taming the Complexity of Certifying Neural Networks using Bernstein Polynomial Activations and Precise Bound Propagation. (50%)
Haitham Khedr; Yasser Shoukry
  Formal certification of Neural Networks (NNs) is crucial for ensuring their
safety, fairness, and robustness. Unfortunately, on the one hand, sound and
complete certification algorithms of ReLU-based NNs do not scale to large-scale
NNs. On the other hand, incomplete certification algorithms are easier to
compute, but they result in loose bounds that deteriorate with the depth of NN,
which diminishes their effectiveness. In this paper, we ask the following
question; can we replace the ReLU activation function with one that opens the
door to incomplete certification algorithms that are easy to compute but can
produce tight bounds on the NN's outputs? We introduce DeepBern-Nets, a class
of NNs with activation functions based on Bernstein polynomials instead of the
commonly used ReLU activation. Bernstein polynomials are smooth and
differentiable functions with desirable properties such as the so-called range
enclosure and subdivision properties. We design a novel algorithm, called
Bern-IBP, to efficiently compute tight bounds on DeepBern-Nets outputs. Our
approach leverages the properties of Bernstein polynomials to improve the
tractability of neural network certification tasks while maintaining the
accuracy of the trained networks. We conduct comprehensive experiments in
adversarial robustness and reachability analysis settings to assess the
effectiveness of the proposed Bernstein polynomial activation in enhancing the
certification process. Our proposed framework achieves high certified accuracy
for adversarially-trained NNs, which is often a challenging task for certifiers
of ReLU-based NNs. Moreover, using Bern-IBP bounds for certified training
results in NNs with state-of-the-art certified accuracy compared to ReLU
networks. This work establishes Bernstein polynomial activation as a promising
alternative for improving NN certification tasks across various applications.


http://arxiv.org/abs/2305.12804
The defender's perspective on automatic speaker verification: An overview. (22%)
Haibin Wu; Jiawen Kang; Lingwei Meng; Helen Meng; Hung-yi Lee
  Automatic speaker verification (ASV) plays a critical role in
security-sensitive environments. Regrettably, the reliability of ASV has been
undermined by the emergence of spoofing attacks, such as replay and synthetic
speech, as well as adversarial attacks and the relatively new partially fake
speech. While there are several review papers that cover replay and synthetic
speech, and adversarial attacks, there is a notable gap in a comprehensive
review that addresses defense against adversarial attacks and the recently
emerged partially fake speech. Thus, the aim of this paper is to provide a
thorough and systematic overview of the defense methods used against these
types of attacks.


http://arxiv.org/abs/2305.13584
Model Stealing Attack against Multi-Exit Networks. (10%)
Li Pan; Lv Peizhuo; Chen Kai; Cai Yuling; Xiang Fan; Zhang Shengzhi
  Compared to traditional neural networks with a single exit, a multi-exit
network has multiple exits that allow for early output from intermediate layers
of the model, thus bringing significant improvement in computational efficiency
while maintaining similar recognition accuracy. When attempting to steal such
valuable models using traditional model stealing attacks, we found that
conventional methods can only steal the model's classification function while
failing to capture its output strategy. This results in a significant decrease
in computational efficiency for the stolen substitute model, thereby losing the
advantages of multi-exit networks.In this paper, we propose the first model
stealing attack to extract both the model function and output strategy. We
employ bayesian changepoint detection to analyze the target model's output
strategy and use performance loss and strategy loss to guide the training of
the substitute model. Furthermore, we designed a novel output strategy search
algorithm that can find the optimal output strategy to maximize the consistency
between the victim model and the substitute model's outputs. Through
experiments on multiple mainstream multi-exit networks and benchmark datasets,
we thoroughly demonstrates the effectiveness of our method.


http://arxiv.org/abs/2305.14384
Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models. (2%)
Alicia Parrish; Hannah Rose Kirk; Jessica Quaye; Charvi Rastogi; Max Bartolo; Oana Inel; Juan Ciro; Rafael Mosquera; Addison Howard; Will Cukierski; D. Sculley; Vijay Janapa Reddi; Lora Aroyo
  The generative AI revolution in recent years has been spurred by an expansion
in compute power and data quantity, which together enable extensive
pre-training of powerful text-to-image (T2I) models. With their greater
capabilities to generate realistic and creative content, these T2I models like
DALL-E, MidJourney, Imagen or Stable Diffusion are reaching ever wider
audiences. Any unsafe behaviors inherited from pretraining on uncurated
internet-scraped datasets thus have the potential to cause wide-reaching harm,
for example, through generated images which are violent, sexually explicit, or
contain biased and derogatory stereotypes. Despite this risk of harm, we lack
systematic and structured evaluation datasets to scrutinize model behavior,
especially adversarial attacks that bypass existing safety filters. A typical
bottleneck in safety evaluation is achieving a wide coverage of different types
of challenging examples in the evaluation set, i.e., identifying 'unknown
unknowns' or long-tail problems. To address this need, we introduce the
Adversarial Nibbler challenge. The goal of this challenge is to crowdsource a
diverse set of failure modes and reward challenge participants for successfully
finding safety vulnerabilities in current state-of-the-art T2I models.
Ultimately, we aim to provide greater awareness of these issues and assist
developers in improving the future safety and reliability of generative AI
models. Adversarial Nibbler is a data-centric challenge, part of the DataPerf
challenge suite, organized and supported by Kaggle and MLCommons.


http://arxiv.org/abs/2305.13535
Improving Classifier Robustness through Active Generation of Pairwise Counterfactuals. (1%)
Ananth Balashankar; Xuezhi Wang; Yao Qin; Ben Packer; Nithum Thain; Jilin Chen; Ed H. Chi; Alex Beutel
  Counterfactual Data Augmentation (CDA) is a commonly used technique for
improving robustness in natural language classifiers. However, one fundamental
challenge is how to discover meaningful counterfactuals and efficiently label
them, with minimal human labeling cost. Most existing methods either completely
rely on human-annotated labels, an expensive process which limits the scale of
counterfactual data, or implicitly assume label invariance, which may mislead
the model with incorrect labels. In this paper, we present a novel framework
that utilizes counterfactual generative models to generate a large number of
diverse counterfactuals by actively sampling from regions of uncertainty, and
then automatically label them with a learned pairwise classifier. Our key
insight is that we can more correctly label the generated counterfactuals by
training a pairwise classifier that interpolates the relationship between the
original example and the counterfactual. We demonstrate that with a small
amount of human-annotated counterfactual data (10%), we can generate a
counterfactual augmentation dataset with learned labels, that provides an
18-20% improvement in robustness and a 14-21% reduction in errors on 6
out-of-domain datasets, comparable to that of a fully human-annotated
counterfactual dataset for both sentiment classification and question
paraphrase tasks.


http://arxiv.org/abs/2305.13520
Tied-Augment: Controlling Representation Similarity Improves Data Augmentation. (1%)
Emirhan Kurtulus; Zichao Li; Yann Dauphin; Ekin Dogus Cubuk
  Data augmentation methods have played an important role in the recent advance
of deep learning models, and have become an indispensable component of
state-of-the-art models in semi-supervised, self-supervised, and supervised
training for vision. Despite incurring no additional latency at test time, data
augmentation often requires more epochs of training to be effective. For
example, even the simple flips-and-crops augmentation requires training for
more than 5 epochs to improve performance, whereas RandAugment requires more
than 90 epochs. We propose a general framework called Tied-Augment, which
improves the efficacy of data augmentation in a wide range of applications by
adding a simple term to the loss that can control the similarity of
representations under distortions. Tied-Augment can improve state-of-the-art
methods from data augmentation (e.g. RandAugment, mixup), optimization (e.g.
SAM), and semi-supervised learning (e.g. FixMatch). For example,
Tied-RandAugment can outperform RandAugment by 2.0% on ImageNet. Notably, using
Tied-Augment, data augmentation can be made to improve generalization even when
training for a few epochs and when fine-tuning. We open source our code at
https://github.com/ekurtulus/tied-augment/tree/main.


http://arxiv.org/abs/2305.13605
Adaptive Face Recognition Using Adversarial Information Network. (1%)
Mei Wang; Weihong Deng
  In many real-world applications, face recognition models often degenerate
when training data (referred to as source domain) are different from testing
data (referred to as target domain). To alleviate this mismatch caused by some
factors like pose and skin tone, the utilization of pseudo-labels generated by
clustering algorithms is an effective way in unsupervised domain adaptation.
However, they always miss some hard positive samples. Supervision on
pseudo-labeled samples attracts them towards their prototypes and would cause
an intra-domain gap between pseudo-labeled samples and the remaining unlabeled
samples within target domain, which results in the lack of discrimination in
face recognition. In this paper, considering the particularity of face
recognition, we propose a novel adversarial information network (AIN) to
address it. First, a novel adversarial mutual information (MI) loss is proposed
to alternately minimize MI with respect to the target classifier and maximize
MI with respect to the feature extractor. By this min-max manner, the positions
of target prototypes are adaptively modified which makes unlabeled images
clustered more easily such that intra-domain gap can be mitigated. Second, to
assist adversarial MI loss, we utilize a graph convolution network to predict
linkage likelihoods between target data and generate pseudo-labels. It
leverages valuable information in the context of nodes and can achieve more
reliable results. The proposed method is evaluated under two scenarios, i.e.,
domain adaptation across poses and image conditions, and domain adaptation
across faces with different skin tones. Extensive experiments show that AIN
successfully improves cross-domain generalization and offers a new
state-of-the-art on RFW dataset.


http://arxiv.org/abs/2305.13257
Watermarking Text Data on Large Language Models for Dataset Copyright. (1%)
Yixin Liu; Hongsheng Hu; Xun Chen; Xuyun Zhang; Lichao Sun
  Substantial research works have shown that deep models, e.g., pre-trained
models, on the large corpus can learn universal language representations, which
are beneficial for downstream NLP tasks. However, these powerful models are
also vulnerable to various privacy attacks, while much sensitive information
exists in the training dataset. The attacker can easily steal sensitive
information from public models, e.g., individuals' email addresses and phone
numbers. In an attempt to address these issues, particularly the unauthorized
use of private data, we introduce a novel watermarking technique via a
backdoor-based membership inference approach named TextMarker, which can
safeguard diverse forms of private information embedded in the training text
data. Specifically, TextMarker only requires data owners to mark a small number
of samples for data copyright protection under the black-box access assumption
to the target model. Through extensive evaluation, we demonstrate the
effectiveness of TextMarker on various real-world datasets, e.g., marking only
0.1% of the training dataset is practically sufficient for effective membership
inference with negligible effect on model utility. We also discuss potential
countermeasures and show that TextMarker is stealthy enough to bypass them.


http://arxiv.org/abs/2305.12683
Mist: Towards Improved Adversarial Examples for Diffusion Models. (99%)
Chumeng Liang; Xiaoyu Wu
  Diffusion Models (DMs) have empowered great success in
artificial-intelligence-generated content, especially in artwork creation, yet
raising new concerns in intellectual properties and copyright. For example,
infringers can make profits by imitating non-authorized human-created paintings
with DMs. Recent researches suggest that various adversarial examples for
diffusion models can be effective tools against these copyright infringements.
However, current adversarial examples show weakness in transferability over
different painting-imitating methods and robustness under straightforward
adversarial defense, for example, noise purification. We surprisingly find that
the transferability of adversarial examples can be significantly enhanced by
exploiting a fused and modified adversarial loss term under consistent
parameters. In this work, we comprehensively evaluate the cross-method
transferability of adversarial examples. The experimental observation shows
that our method generates more transferable adversarial examples with even
stronger robustness against the simple adversarial defense.


http://arxiv.org/abs/2305.12351
Are Your Explanations Reliable? Investigating the Stability of LIME in Explaining Text Classifiers by Marrying XAI and Adversarial Attack. (81%)
Christopher Burger; Lingwei Chen; Thai Le
  LIME has emerged as one of the most commonly referenced tools in explainable
AI (XAI) frameworks that is integrated into critical machine learning
applications--e.g., healthcare and finance. However, its stability remains
little explored, especially in the context of text data, due to the unique
text-space constraints. To address these challenges, in this paper, we first
evaluate the inherent instability of LIME on text data to establish a baseline,
and then propose a novel algorithm XAIFooler to perturb text inputs and
manipulate explanations that casts investigation on the stability of LIME as a
text perturbation optimization problem. XAIFooler conforms to the constraints
to preserve text semantics and original prediction with small perturbations,
and introduces Rank-biased Overlap (RBO) as a key part to guide the
optimization of XAIFooler that satisfies all the requirements for explanation
similarity measure. Extensive experiments on real-world text datasets
demonstrate that XAIFooler significantly outperforms all baselines by large
margins in its ability to manipulate LIME's explanations with high semantic
preservability.


http://arxiv.org/abs/2305.12590
FAQ: Mitigating the Impact of Faults in the Weight Memory of DNN Accelerators through Fault-Aware Quantization. (1%)
Muhammad Abdullah Hanif; Muhammad Shafique
  Permanent faults induced due to imperfections in the manufacturing process of
Deep Neural Network (DNN) accelerators are a major concern, as they negatively
impact the manufacturing yield of the chip fabrication process. Fault-aware
training is the state-of-the-art approach for mitigating such faults. However,
it incurs huge retraining overheads, specifically when used for large DNNs
trained on complex datasets. To address this issue, we propose a novel
Fault-Aware Quantization (FAQ) technique for mitigating the effects of stuck-at
permanent faults in the on-chip weight memory of DNN accelerators at a
negligible overhead cost compared to fault-aware retraining while offering
comparable accuracy results. We propose a lookup table-based algorithm to
achieve ultra-low model conversion time. We present extensive evaluation of the
proposed approach using five different DNNs, i.e., ResNet-18, VGG11, VGG16,
AlexNet and MobileNetV2, and three different datasets, i.e., CIFAR-10,
CIFAR-100 and ImageNet. The results demonstrate that FAQ helps in maintaining
the baseline accuracy of the DNNs at low and moderate fault rates without
involving costly fault-aware training. For example, for ResNet-18 trained on
the CIFAR-10 dataset, at 0.04 fault rate FAQ offers (on average) an increase of
76.38% in accuracy. Similarly, for VGG11 trained on the CIFAR-10 dataset, at
0.04 fault rate FAQ offers (on average) an increase of 70.47% in accuracy. The
results also show that FAQ incurs negligible overheads, i.e., less than 5% of
the time required to run 1 epoch of retraining. We additionally demonstrate the
efficacy of our technique when used in conjunction with fault-aware retraining
and show that the use of FAQ inside fault-aware retraining enables fast
accuracy recovery.


http://arxiv.org/abs/2305.12228
Dynamic Transformers Provide a False Sense of Efficiency. (92%)
Yiming Chen; Simin Chen; Zexin Li; Wei Yang; Cong Liu; Robby T. Tan; Haizhou Li
  Despite much success in natural language processing (NLP), pre-trained
language models typically lead to a high computational cost during inference.
Multi-exit is a mainstream approach to address this issue by making a trade-off
between efficiency and accuracy, where the saving of computation comes from an
early exit. However, whether such saving from early-exiting is robust remains
unknown. Motivated by this, we first show that directly adapting existing
adversarial attack approaches targeting model accuracy cannot significantly
reduce inference efficiency. To this end, we propose a simple yet effective
attacking framework, SAME, a novel slowdown attack framework on multi-exit
models, which is specially tailored to reduce the efficiency of the multi-exit
models. By leveraging the multi-exit models' design characteristics, we utilize
all internal predictions to guide the adversarial sample generation instead of
merely considering the final prediction. Experiments on the GLUE benchmark show
that SAME can effectively diminish the efficiency gain of various multi-exit
models by 80% on average, convincingly validating its effectiveness and
generalization ability.


http://arxiv.org/abs/2305.12118
Annealing Self-Distillation Rectification Improves Adversarial Training. (76%)
Yu-Yu Wu; Hung-Jui Wang; Shang-Tse Chen
  In standard adversarial training, models are optimized to fit one-hot labels
within allowable adversarial perturbation budgets. However, the ignorance of
underlying distribution shifts brought by perturbations causes the problem of
robust overfitting. To address this issue and enhance adversarial robustness,
we analyze the characteristics of robust models and identify that robust models
tend to produce smoother and well-calibrated outputs. Based on the observation,
we propose a simple yet effective method, Annealing Self-Distillation
Rectification (ADR), which generates soft labels as a better guidance mechanism
that accurately reflects the distribution shift under attack during adversarial
training. By utilizing ADR, we can obtain rectified distributions that
significantly improve model robustness without the need for pre-trained models
or extensive extra computation. Moreover, our method facilitates seamless
plug-and-play integration with other adversarial training techniques by
replacing the hard labels in their objectives. We demonstrate the efficacy of
ADR through extensive experiments and strong performances across datasets.


http://arxiv.org/abs/2305.12100
Stability, Generalization and Privacy: Precise Analysis for Random and NTK Features. (8%)
Simone Bombari; Marco Mondelli
  Deep learning models can be vulnerable to recovery attacks, raising privacy
concerns to users, and widespread algorithms such as empirical risk
minimization (ERM) often do not directly enforce safety guarantees. In this
paper, we study the safety of ERM-trained models against a family of powerful
black-box attacks. Our analysis quantifies this safety via two separate terms:
(i) the model stability with respect to individual training samples, and (ii)
the feature alignment between the attacker query and the original data. While
the first term is well established in learning theory and it is connected to
the generalization error in classical work, the second one is, to the best of
our knowledge, novel. Our key technical result provides a precise
characterization of the feature alignment for the two prototypical settings of
random features (RF) and neural tangent kernel (NTK) regression. This proves
that privacy strengthens with an increase in the generalization capability,
unveiling also the role of the activation function. Numerical experiments show
a behavior in agreement with our theory not only for the RF and NTK models, but
also for deep neural networks trained on standard datasets (MNIST, CIFAR-10).


http://arxiv.org/abs/2305.12066
Multi-Task Models Adversarial Attacks. (98%)
Lijun Zhang; Xiao Liu; Kaleel Mahmood; Caiwen Ding; Hui Guan
  Multi-Task Learning (MTL) involves developing a singular model, known as a
multi-task model, to concurrently perform multiple tasks. While the security of
single-task models has been thoroughly studied, multi-task models pose several
critical security questions, such as 1) their vulnerability to single-task
adversarial attacks, 2) the possibility of designing attacks that target
multiple tasks, and 3) the impact of task sharing and adversarial training on
their resilience to such attacks. This paper addresses these queries through
detailed analysis and rigorous experimentation. First, we explore the
adaptation of single-task white-box attacks to multi-task models and identify
their limitations. We then introduce a novel attack framework, the Gradient
Balancing Multi-Task Attack (GB-MTA), which treats attacking a multi-task model
as an optimization problem. This problem, based on averaged relative loss
change across tasks, is approximated as an integer linear programming problem.
Extensive evaluations on MTL benchmarks, NYUv2 and Tiny-Taxonomy, demonstrate
GB-MTA's effectiveness against both standard and adversarially trained
multi-task models. The results also highlight a trade-off between task accuracy
improvement via parameter sharing and increased model vulnerability due to
enhanced attack transferability.


http://arxiv.org/abs/2305.11618
DAP: A Dynamic Adversarial Patch for Evading Person Detectors. (92%)
Amira Guesmi; Ruitian Ding; Muhammad Abdullah Hanif; Ihsen Alouani; Muhammad Shafique
  Patch-based adversarial attacks were proven to compromise the robustness and
reliability of computer vision systems. However, their conspicuous and easily
detectable nature challenge their practicality in real-world setting. To
address this, recent work has proposed using Generative Adversarial Networks
(GANs) to generate naturalistic patches that may not attract human attention.
However, such approaches suffer from a limited latent space making it
challenging to produce a patch that is efficient, stealthy, and robust to
multiple real-world transformations. This paper introduces a novel approach
that produces a Dynamic Adversarial Patch (DAP) designed to overcome these
limitations. DAP maintains a naturalistic appearance while optimizing attack
efficiency and robustness to real-world transformations. The approach involves
redefining the optimization problem and introducing a novel objective function
that incorporates a similarity metric to guide the patch's creation. Unlike
GAN-based techniques, the DAP directly modifies pixel values within the patch,
providing increased flexibility and adaptability to multiple transformations.
Furthermore, most clothing-based physical attacks assume static objects and
ignore the possible transformations caused by non-rigid deformation due to
changes in a person's pose. To address this limitation, a 'Creases
Transformation' (CT) block is introduced, enhancing the patch's resilience to a
variety of real-world distortions. Experimental results demonstrate that the
proposed approach outperforms state-of-the-art attacks, achieving a success
rate of up to 82.28% in the digital world when targeting the YOLOv7 detector
and 65% in the physical world when targeting YOLOv3tiny detector deployed in
edge-based smart cameras.


http://arxiv.org/abs/2305.11624
Efficient ConvBN Blocks for Transfer Learning and Beyond. (67%)
Kaichao You; Guo Qin; Anchang Bao; Meng Cao; Ping Huang; Jiulong Shan; Mingsheng Long
  Convolution-BatchNorm (ConvBN) blocks are integral components in various
computer vision tasks and other domains. A ConvBN block can operate in three
modes: Train, Eval, and Deploy. While the Train mode is indispensable for
training models from scratch, the Eval mode is suitable for transfer learning
and beyond, and the Deploy mode is designed for the deployment of models. This
paper focuses on the trade-off between stability and efficiency in ConvBN
blocks: Deploy mode is efficient but suffers from training instability; Eval
mode is widely used in transfer learning but lacks efficiency. To solve the
dilemma, we theoretically reveal the reason behind the diminished training
stability observed in the Deploy mode. Subsequently, we propose a novel Tune
mode to bridge the gap between Eval mode and Deploy mode. The proposed Tune
mode is as stable as Eval mode for transfer learning, and its computational
efficiency closely matches that of the Deploy mode. Through extensive
experiments in object detection, classification, and adversarial example
generation across $5$ datasets and $12$ model architectures, we demonstrate
that the proposed Tune mode retains the performance while significantly
reducing GPU memory footprint and training time, thereby contributing efficient
ConvBN blocks for transfer learning and beyond. Our method has been integrated
into both PyTorch (general machine learning framework) and MMCV/MMEngine
(computer vision framework). Practitioners just need one line of code to enjoy
our efficient ConvBN blocks thanks to PyTorch's builtin machine learning
compilers.


http://arxiv.org/abs/2305.11596
Mitigating Backdoor Poisoning Attacks through the Lens of Spurious Correlation. (8%)
Xuanli He; Qiongkai Xu; Jun Wang; Benjamin Rubinstein; Trevor Cohn
  Modern NLP models are often trained over large untrusted datasets, raising
the potential for a malicious adversary to compromise model behaviour. For
instance, backdoors can be implanted through crafting training instances with a
specific textual trigger and a target label. This paper posits that backdoor
poisoning attacks exhibit \emph{spurious correlation} between simple text
features and classification labels, and accordingly, proposes methods for
mitigating spurious correlation as means of defence. Our empirical study
reveals that the malicious triggers are highly correlated to their target
labels; therefore such correlations are extremely distinguishable compared to
those scores of benign features, and can be used to filter out potentially
problematic instances. Compared with several existing defences, our defence
method significantly reduces attack success rates across backdoor attacks, and
in the case of insertion-based attacks, our method provides a near-perfect
defence.


http://arxiv.org/abs/2305.11733
Long-tailed Visual Recognition via Gaussian Clouded Logit Adjustment. (5%)
Mengke Li; Yiu-ming Cheung; Yang Lu
  Long-tailed data is still a big challenge for deep neural networks, even
though they have achieved great success on balanced data. We observe that
vanilla training on long-tailed data with cross-entropy loss makes the
instance-rich head classes severely squeeze the spatial distribution of the
tail classes, which leads to difficulty in classifying tail class samples.
Furthermore, the original cross-entropy loss can only propagate gradient
short-lively because the gradient in softmax form rapidly approaches zero as
the logit difference increases. This phenomenon is called softmax saturation.
It is unfavorable for training on balanced data, but can be utilized to adjust
the validity of the samples in long-tailed data, thereby solving the distorted
embedding space of long-tailed problems. To this end, this paper proposes the
Gaussian clouded logit adjustment by Gaussian perturbation of different class
logits with varied amplitude. We define the amplitude of perturbation as cloud
size and set relatively large cloud sizes to tail classes. The large cloud size
can reduce the softmax saturation and thereby making tail class samples more
active as well as enlarging the embedding space. To alleviate the bias in a
classifier, we therefore propose the class-based effective number sampling
strategy with classifier re-training. Extensive experiments on benchmark
datasets validate the superior performance of the proposed method. Source code
is available at https://github.com/Keke921/GCLLoss.


http://arxiv.org/abs/2305.12082
SneakyPrompt: Evaluating Robustness of Text-to-image Generative Models' Safety Filters. (4%)
Yuchen Yang; Bo Hui; Haolin Yuan; Neil Gong; Yinzhi Cao
  Text-to-image generative models such as Stable Diffusion and DALL$\cdot$E 2
have attracted much attention since their publication due to their wide
application in the real world. One challenging problem of text-to-image
generative models is the generation of Not-Safe-for-Work (NSFW) content, e.g.,
those related to violence and adult. Therefore, a common practice is to deploy
a so-called safety filter, which blocks NSFW content based on either text or
image features. Prior works have studied the possible bypass of such safety
filters. However, existing works are largely manual and specific to Stable
Diffusion's official safety filter. Moreover, the bypass ratio of Stable
Diffusion's safety filter is as low as 23.51% based on our evaluation.
  In this paper, we propose the first automated attack framework, called
SneakyPrompt, to evaluate the robustness of real-world safety filters in
state-of-the-art text-to-image generative models. Our key insight is to search
for alternative tokens in a prompt that generates NSFW images so that the
generated prompt (called an adversarial prompt) bypasses existing safety
filters. Specifically, SneakyPrompt utilizes reinforcement learning (RL) to
guide an agent with positive rewards on semantic similarity and bypass success.
  Our evaluation shows that SneakyPrompt successfully generated NSFW content
using an online model DALL$\cdot$E 2 with its default, closed-box safety filter
enabled. At the same time, we also deploy several open-source state-of-the-art
safety filters on a Stable Diffusion model and show that SneakyPrompt not only
successfully generates NSFW content, but also outperforms existing adversarial
attacks in terms of the number of queries and image qualities.


http://arxiv.org/abs/2305.11602
Latent Imitator: Generating Natural Individual Discriminatory Instances for Black-Box Fairness Testing. (2%)
Yisong Xiao; Aishan Liu; Tianlin Li; Xianglong Liu
  Machine learning (ML) systems have achieved remarkable performance across a
wide area of applications. However, they frequently exhibit unfair behaviors in
sensitive application domains, raising severe fairness concerns. To evaluate
and test fairness, engineers often generate individual discriminatory instances
to expose unfair behaviors before model deployment. However, existing baselines
ignore the naturalness of generation and produce instances that deviate from
the real data distribution, which may fail to reveal the actual model fairness
since these unnatural discriminatory instances are unlikely to appear in
practice. To address the problem, this paper proposes a framework named Latent
Imitator (LIMI) to generate more natural individual discriminatory instances
with the help of a generative adversarial network (GAN), where we imitate the
decision boundary of the target model in the semantic latent space of GAN and
further samples latent instances on it. Specifically, we first derive a
surrogate linear boundary to coarsely approximate the decision boundary of the
target model, which reflects the nature of the original data distribution.
Subsequently, to obtain more natural instances, we manipulate random latent
vectors to the surrogate boundary with a one-step movement, and further conduct
vector calculation to probe two potential discriminatory candidates that may be
more closely located in the real decision boundary. Extensive experiments on
various datasets demonstrate that our LIMI outperforms other baselines largely
in effectiveness ($\times$9.42 instances), efficiency ($\times$8.71 speeds),
and naturalness (+19.65%) on average. In addition, we empirically demonstrate
that retraining on test samples generated by our approach can lead to
improvements in both individual fairness (45.67% on $IF_r$ and 32.81% on
$IF_o$) and group fairness (9.86% on $SPD$ and 28.38% on $AOD$}).


http://arxiv.org/abs/2305.11759
Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning. (1%)
Mustafa Safa Ozdayi; Charith Peris; Jack FitzGerald; Christophe Dupuy; Jimit Majmudar; Haidar Khan; Rahil Parikh; Rahul Gupta
  Large Language Models (LLMs) are known to memorize significant portions of
their training data. Parts of this memorized content have been shown to be
extractable by simply querying the model, which poses a privacy risk. We
present a novel approach which uses prompt-tuning to control the extraction
rates of memorized content in LLMs. We present two prompt training strategies
to increase and decrease extraction rates, which correspond to an attack and a
defense, respectively. We demonstrate the effectiveness of our techniques by
using models from the GPT-Neo family on a public benchmark. For the 1.3B
parameter GPT-Neo model, our attack yields a 9.3 percentage point increase in
extraction rate compared to our baseline. Our defense can be tuned to achieve
different privacy-utility trade-offs by a user-specified hyperparameter. We
achieve an extraction rate reduction of up to 97.7% relative to our baseline,
with a perplexity increase of 16.9%.


http://arxiv.org/abs/2305.11039
Deep PackGen: A Deep Reinforcement Learning Framework for Adversarial Network Packet Generation. (99%)
Soumyadeep Hore; Jalal Ghadermazi; Diwas Paudel; Ankit Shah; Tapas K. Das; Nathaniel D. Bastian
  Recent advancements in artificial intelligence (AI) and machine learning (ML)
algorithms, coupled with the availability of faster computing infrastructure,
have enhanced the security posture of cybersecurity operations centers
(defenders) through the development of ML-aided network intrusion detection
systems (NIDS). Concurrently, the abilities of adversaries to evade security
have also increased with the support of AI/ML models. Therefore, defenders need
to proactively prepare for evasion attacks that exploit the detection
mechanisms of NIDS. Recent studies have found that the perturbation of
flow-based and packet-based features can deceive ML models, but these
approaches have limitations. Perturbations made to the flow-based features are
difficult to reverse-engineer, while samples generated with perturbations to
the packet-based features are not playable.
  Our methodological framework, Deep PackGen, employs deep reinforcement
learning to generate adversarial packets and aims to overcome the limitations
of approaches in the literature. By taking raw malicious network packets as
inputs and systematically making perturbations on them, Deep PackGen
camouflages them as benign packets while still maintaining their functionality.
In our experiments, using publicly available data, Deep PackGen achieved an
average adversarial success rate of 66.4\% against various ML models and across
different attack types. Our investigation also revealed that more than 45\% of
the successful adversarial samples were out-of-distribution packets that evaded
the decision boundaries of the classifiers. The knowledge gained from our study
on the adversary's ability to make specific evasive perturbations to different
types of malicious packets can help defenders enhance the robustness of their
NIDS against evolving adversarial attacks.


http://arxiv.org/abs/2305.10766
Adversarial Amendment is the Only Force Capable of Transforming an Enemy into a Friend. (99%)
Chong Yu; Tao Chen; Zhongxue Gan
  Adversarial attack is commonly regarded as a huge threat to neural networks
because of misleading behavior. This paper presents an opposite perspective:
adversarial attacks can be harnessed to improve neural models if amended
correctly. Unlike traditional adversarial defense or adversarial training
schemes that aim to improve the adversarial robustness, the proposed
adversarial amendment (AdvAmd) method aims to improve the original accuracy
level of neural models on benign samples. We thoroughly analyze the
distribution mismatch between the benign and adversarial samples. This
distribution mismatch and the mutual learning mechanism with the same learning
ratio applied in prior art defense strategies is the main cause leading the
accuracy degradation for benign samples. The proposed AdvAmd is demonstrated to
steadily heal the accuracy degradation and even leads to a certain accuracy
boost of common neural models on benign classification, object detection, and
segmentation tasks. The efficacy of the AdvAmd is contributed by three key
components: mediate samples (to reduce the influence of distribution mismatch
with a fine-grained amendment), auxiliary batch norm (to solve the mutual
learning mechanism and the smoother judgment surface), and AdvAmd loss (to
adjust the learning ratios according to different attack vulnerabilities)
through quantitative and ablation experiments.


http://arxiv.org/abs/2305.10929
Architecture-agnostic Iterative Black-box Certified Defense against Adversarial Patches. (99%)
Di Yang; Yihao Huang; Qing Guo; Felix Juefei-Xu; Ming Hu; Yang Liu; Geguang Pu
  The adversarial patch attack aims to fool image classifiers within a bounded,
contiguous region of arbitrary changes, posing a real threat to computer vision
systems (e.g., autonomous driving, content moderation, biometric
authentication, medical imaging) in the physical world. To address this problem
in a trustworthy way, proposals have been made for certified patch defenses
that ensure the robustness of classification models and prevent future patch
attacks from breaching the defense. State-of-the-art certified defenses can be
compatible with any model architecture, as well as achieve high clean and
certified accuracy. Although the methods are adaptive to arbitrary patch
positions, they inevitably need to access the size of the adversarial patch,
which is unreasonable and impractical in real-world attack scenarios. To
improve the feasibility of the architecture-agnostic certified defense in a
black-box setting (i.e., position and size of the patch are both unknown), we
propose a novel two-stage Iterative Black-box Certified Defense method, termed
IBCD.In the first stage, it estimates the patch size in a search-based manner
by evaluating the size relationship between the patch and mask with pixel
masking. In the second stage, the accuracy results are calculated by the
existing white-box certified defense methods with the estimated patch size. The
experiments conducted on two popular model architectures and two datasets
verify the effectiveness and efficiency of IBCD.


http://arxiv.org/abs/2305.10856
Towards an Accurate and Secure Detector against Adversarial Perturbations. (99%)
Chao Wang; Shuren Qi; Zhiqiu Huang; Yushu Zhang; Xiaochun Cao
  The vulnerability of deep neural networks to adversarial perturbations has
been widely perceived in the computer vision community. From a security
perspective, it poses a critical risk for modern vision systems, e.g., the
popular Deep Learning as a Service (DLaaS) frameworks. For protecting
off-the-shelf deep models while not modifying them, current algorithms
typically detect adversarial patterns through discriminative decomposition of
natural-artificial data. However, these decompositions are biased towards
frequency or spatial discriminability, thus failing to capture subtle
adversarial patterns comprehensively. More seriously, they are typically
invertible, meaning successful defense-aware (secondary) adversarial attack
(i.e., evading the detector as well as fooling the model) is practical under
the assumption that the adversary is fully aware of the detector (i.e., the
Kerckhoffs's principle). Motivated by such facts, we propose an accurate and
secure adversarial example detector, relying on a spatial-frequency
discriminative decomposition with secret keys. It expands the above works on
two aspects: 1) the introduced Krawtchouk basis provides better
spatial-frequency discriminability and thereby is more suitable for capturing
adversarial patterns than the common trigonometric or wavelet basis; 2) the
extensive parameters for decomposition are generated by a pseudo-random
function with secret keys, hence blocking the defense-aware adversarial attack.
Theoretical and numerical analysis demonstrates the increased accuracy and
security of our detector w.r.t. a number of state-of-the-art algorithms.


http://arxiv.org/abs/2305.11347
Quantifying the robustness of deep multispectral segmentation models against natural perturbations and data poisoning. (99%)
Elise Bishoff; Charles Godfrey; Myles McKay; Eleanor Byler
  In overhead image segmentation tasks, including additional spectral bands
beyond the traditional RGB channels can improve model performance. However, it
is still unclear how incorporating this additional data impacts model
robustness to adversarial attacks and natural perturbations. For adversarial
robustness, the additional information could improve the model's ability to
distinguish malicious inputs, or simply provide new attack avenues and
vulnerabilities. For natural perturbations, the additional information could
better inform model decisions and weaken perturbation effects or have no
significant influence at all. In this work, we seek to characterize the
performance and robustness of a multispectral (RGB and near infrared) image
segmentation model subjected to adversarial attacks and natural perturbations.
While existing adversarial and natural robustness research has focused
primarily on digital perturbations, we prioritize on creating realistic
perturbations designed with physical world conditions in mind. For adversarial
robustness, we focus on data poisoning attacks whereas for natural robustness,
we focus on extending ImageNet-C common corruptions for fog and snow that
coherently and self-consistently perturbs the input data. Overall, we find both
RGB and multispectral models are vulnerable to data poisoning attacks
regardless of input or fusion architectures and that while physically
realizable natural perturbations still degrade model performance, the impact
differs based on fusion architecture and input data.


http://arxiv.org/abs/2305.10862
How Deep Learning Sees the World: A Survey on Adversarial Attacks & Defenses. (98%)
Joana C. Costa; Tiago Roxo; Hugo Proença; Pedro R. M. Inácio
  Deep Learning is currently used to perform multiple tasks, such as object
recognition, face recognition, and natural language processing. However, Deep
Neural Networks (DNNs) are vulnerable to perturbations that alter the network
prediction (adversarial examples), raising concerns regarding its usage in
critical areas, such as self-driving vehicles, malware detection, and
healthcare. This paper compiles the most recent adversarial attacks, grouped by
the attacker capacity, and modern defenses clustered by protection strategies.
We also present the new advances regarding Vision Transformers, summarize the
datasets and metrics used in the context of adversarial settings, and compare
the state-of-the-art results under different attacks, finishing with the
identification of open issues.


http://arxiv.org/abs/2305.10906
RobustFair: Adversarial Evaluation through Fairness Confusion Directed Gradient Search. (93%)
Xuran Li; Peng Wu; Kaixiang Dong; Zhen Zhang
  The trustworthiness of DNNs is often challenged by their vulnerability to
minor adversarial perturbations, which may not only undermine prediction
accuracy (robustness) but also cause biased predictions for similar inputs
(individual fairness). Accurate fairness has been recently proposed to enforce
a harmonic balance between accuracy and individual fairness. It induces the
notion of fairness confusion matrix to categorize predictions as true fair,
true biased, false fair, and false biased. This paper proposes a harmonic
evaluation approach, RobustFair, for the accurate fairness of DNNs, using
adversarial perturbations crafted through fairness confusion directed gradient
search. By using Taylor expansions to approximate the ground truths of
adversarial instances, RobustFair can particularly identify the robustness
defects entangled for spurious fairness, which are often elusive in robustness
evaluation, and missing in individual fairness evaluation. RobustFair can boost
robustness and individual fairness evaluations by identifying robustness or
fairness defects simultaneously. Empirical case studies on fairness benchmark
datasets show that, compared with the state-of-the-art white-box robustness and
individual fairness testing approaches, RobustFair detects significantly
1.77-11.87 times adversarial perturbations, yielding 1.83-13.12 times biased
and 1.53-8.22 times false instances. The adversarial instances can then be
effectively exploited to improve the accurate fairness (and hence accuracy and
individual fairness) of the original deep neural network through retraining.
The empirical case studies further show that the adversarial instances
identified by RobustFair outperform those identified by the other testing
approaches, in promoting 21% accurate fairness and 19% individual fairness on
multiple sensitive attributes, without losing accuracy at all or even promoting
it by up to 4%.


http://arxiv.org/abs/2305.11132
Attacks on Online Learners: a Teacher-Student Analysis. (54%)
Riccardo Giuseppe Margiotta; Sebastian Goldt; Guido Sanguinetti
  Machine learning models are famously vulnerable to adversarial attacks: small
ad-hoc perturbations of the data that can catastrophically alter the model
predictions. While a large literature has studied the case of test-time attacks
on pre-trained models, the important case of attacks in an online learning
setting has received little attention so far. In this work, we use a
control-theoretical perspective to study the scenario where an attacker may
perturb data labels to manipulate the learning dynamics of an online learner.
We perform a theoretical analysis of the problem in a teacher-student setup,
considering different attack strategies, and obtaining analytical results for
the steady state of simple linear learners. These results enable us to prove
that a discontinuous transition in the learner's accuracy occurs when the
attack strength exceeds a critical threshold. We then study empirically attacks
on learners with complex architectures using real data, confirming the insights
of our theoretical analysis. Our findings show that greedy attacks can be
extremely efficient, especially when data stream in small batches.


http://arxiv.org/abs/2305.11275
Explaining V1 Properties with a Biologically Constrained Deep Learning Architecture. (47%)
Galen Pogoncheff; Jacob Granley; Michael Beyeler
  Convolutional neural networks (CNNs) have recently emerged as promising
models of the ventral visual stream, despite their lack of biological
specificity. While current state-of-the-art models of the primary visual cortex
(V1) have surfaced from training with adversarial examples and extensively
augmented data, these models are still unable to explain key neural properties
observed in V1 that arise from biological circuitry. To address this gap, we
systematically incorporated neuroscience-derived architectural components into
CNNs to identify a set of mechanisms and architectures that comprehensively
explain neural activity in V1. We show drastic improvements in model-V1
alignment driven by the integration of architectural components that simulate
center-surround antagonism, local receptive fields, tuned normalization, and
cortical magnification. Upon enhancing task-driven CNNs with a collection of
these specialized components, we uncover models with latent representations
that yield state-of-the-art explanation of V1 neural activity and tuning
properties. Our results highlight an important advancement in the field of
NeuroAI, as we systematically establish a set of architectural components that
contribute to unprecedented explanation of V1. The neuroscience insights that
could be gleaned from increasingly accurate in-silico models of the brain have
the potential to greatly advance the fields of both neuroscience and artificial
intelligence.


http://arxiv.org/abs/2305.10847
Large Language Models can be Guided to Evade AI-Generated Text Detection. (3%)
Ning Lu; Shengcai Liu; Rui He; Qi Wang; Yew-Soon Ong; Ke Tang
  Large language models (LLMs) have shown remarkable performance in various
tasks and have been extensively utilized by the public. However, the increasing
concerns regarding the misuse of LLMs, such as plagiarism and spamming, have
led to the development of multiple detectors, including fine-tuned classifiers
and statistical methods. In this study, we equip LLMs with prompts, rather than
relying on an external paraphraser, to evaluate the vulnerability of these
detectors. We propose a novel Substitution-based In-Context example
Optimization method (SICO) to automatically construct prompts for evading the
detectors. SICO is cost-efficient as it requires only 40 human-written examples
and a limited number of LLM inferences to generate a prompt. Moreover, once a
task-specific prompt has been constructed, it can be universally used against a
wide range of detectors. Extensive experiments across three real-world tasks
demonstrate that SICO significantly outperforms the paraphraser baselines and
enables GPT-3.5 to successfully evade six detectors, decreasing their AUC by
0.5 on average. Furthermore, a comprehensive human evaluation show that the
SICO-generated text achieves human-level readability and task completion rates,
while preserving high imperceptibility. Finally, we propose an ensemble
approach to enhance the robustness of detectors against SICO attack. The code
is publicly available at https://github.com/ColinLu50/Evade-GPT-Detector.


http://arxiv.org/abs/2305.10701
Zero-Day Backdoor Attack against Text-to-Image Diffusion Models via Personalization. (2%)
Yihao Huang; Qing Guo; Felix Juefei-Xu
  Although recent personalization methods have democratized high-resolution
image synthesis by enabling swift concept acquisition with minimal examples and
lightweight computation, they also present an exploitable avenue for high
accessible backdoor attacks. This paper investigates a critical and unexplored
aspect of text-to-image (T2I) diffusion models - their potential vulnerability
to backdoor attacks via personalization. Our study focuses on a zero-day
backdoor vulnerability prevalent in two families of personalization methods,
epitomized by Textual Inversion and DreamBooth.Compared to traditional backdoor
attacks, our proposed method can facilitate more precise, efficient, and easily
accessible attacks with a lower barrier to entry. We provide a comprehensive
review of personalization in T2I diffusion models, highlighting the operation
and exploitation potential of this backdoor vulnerability. To be specific, by
studying the prompt processing of Textual Inversion and DreamBooth, we have
devised dedicated backdoor attacks according to the different ways of dealing
with unseen tokens and analyzed the influence of triggers and concept images on
the attack effect. Our empirical study has shown that the nouveau-token
backdoor attack has better attack performance while legacy-token backdoor
attack is potentially harder to defend.


http://arxiv.org/abs/2305.10691
Re-thinking Data Availablity Attacks Against Deep Neural Networks. (1%)
Bin Fang; Bo Li; Shuang Wu; Ran Yi; Shouhong Ding; Lizhuang Ma
  The unauthorized use of personal data for commercial purposes and the
clandestine acquisition of private data for training machine learning models
continue to raise concerns. In response to these issues, researchers have
proposed availability attacks that aim to render data unexploitable. However,
many current attack methods are rendered ineffective by adversarial training.
In this paper, we re-examine the concept of unlearnable examples and discern
that the existing robust error-minimizing noise presents an inaccurate
optimization objective. Building on these observations, we introduce a novel
optimization paradigm that yields improved protection results with reduced
computational time requirements. We have conducted extensive experiments to
substantiate the soundness of our approach. Moreover, our method establishes a
robust foundation for future research in this area.


http://arxiv.org/abs/2305.11229
TrustSER: On the Trustworthiness of Fine-tuning Pre-trained Speech Embeddings For Speech Emotion Recognition. (1%)
Tiantian Feng; Rajat Hebbar; Shrikanth Narayanan
  Recent studies have explored the use of pre-trained embeddings for speech
emotion recognition (SER), achieving comparable performance to conventional
methods that rely on low-level knowledge-inspired acoustic features. These
embeddings are often generated from models trained on large-scale speech
datasets using self-supervised or weakly-supervised learning objectives.
Despite the significant advancements made in SER through the use of pre-trained
embeddings, there is a limited understanding of the trustworthiness of these
methods, including privacy breaches, unfair performance, vulnerability to
adversarial attacks, and computational cost, all of which may hinder the
real-world deployment of these systems. In response, we introduce TrustSER, a
general framework designed to evaluate the trustworthiness of SER systems using
deep learning methods, with a focus on privacy, safety, fairness, and
sustainability, offering unique insights into future research in the field of
SER. Our code is publicly available under:
https://github.com/usc-sail/trust-ser.


http://arxiv.org/abs/2305.10665
Content-based Unrestricted Adversarial Attack. (99%)
Zhaoyu Chen; Bo Li; Shuang Wu; Kaixun Jiang; Shouhong Ding; Wenqiang Zhang
  Unrestricted adversarial attacks typically manipulate the semantic content of
an image (e.g., color or texture) to create adversarial examples that are both
effective and photorealistic, demonstrating their ability to deceive human
perception and deep neural networks with stealth and success. However, current
works usually sacrifice unrestricted degrees and subjectively select some image
content to guarantee the photorealism of unrestricted adversarial examples,
which limits its attack performance. To ensure the photorealism of adversarial
examples and boost attack performance, we propose a novel unrestricted attack
framework called Content-based Unrestricted Adversarial Attack. By leveraging a
low-dimensional manifold that represents natural images, we map the images onto
the manifold and optimize them along its adversarial direction. Therefore,
within this framework, we implement Adversarial Content Attack based on Stable
Diffusion and can generate high transferable unrestricted adversarial examples
with various adversarial contents. Extensive experimentation and visualization
demonstrate the efficacy of ACA, particularly in surpassing state-of-the-art
attacks by an average of 13.3-50.4% and 16.8-48.0% in normally trained models
and defense methods, respectively.


http://arxiv.org/abs/2305.10388
Raising the Bar for Certified Adversarial Robustness with Diffusion Models. (95%)
Thomas Altstidl; David Dobre; Björn Eskofier; Gauthier Gidel; Leo Schwinn
  Certified defenses against adversarial attacks offer formal guarantees on the
robustness of a model, making them more reliable than empirical methods such as
adversarial training, whose effectiveness is often later reduced by unseen
attacks. Still, the limited certified robustness that is currently achievable
has been a bottleneck for their practical adoption. Gowal et al. and Wang et
al. have shown that generating additional training data using state-of-the-art
diffusion models can considerably improve the robustness of adversarial
training. In this work, we demonstrate that a similar approach can
substantially improve deterministic certified defenses. In addition, we provide
a list of recommendations to scale the robustness of certified training
approaches. One of our main insights is that the generalization gap, i.e., the
difference between the training and test accuracy of the original model, is a
good predictor of the magnitude of the robustness improvement when using
additional generated data. Our approach achieves state-of-the-art deterministic
robustness certificates on CIFAR-10 for the $\ell_2$ ($\epsilon = 36/255$) and
$\ell_\infty$ ($\epsilon = 8/255$) threat models, outperforming the previous
best results by $+3.95\%$ and $+1.39\%$, respectively. Furthermore, we report
similar improvements for CIFAR-100.


http://arxiv.org/abs/2305.09956
The Adversarial Consistency of Surrogate Risks for Binary Classification. (10%)
Natalie Frank; Jonathan Niles-Weed
  We study the consistency of surrogate risks for robust binary classification.
It is common to learn robust classifiers by adversarial training, which seeks
to minimize the expected $0$-$1$ loss when each example can be maliciously
corrupted within a small ball. We give a simple and complete characterization
of the set of surrogate loss functions that are \emph{consistent}, i.e., that
can replace the $0$-$1$ loss without affecting the minimizing sequences of the
original adversarial risk, for any data distribution. We also prove a
quantitative version of adversarial consistency for the $\rho$-margin loss. Our
results reveal that the class of adversarially consistent surrogates is
substantially smaller than in the standard setting, where many common
surrogates are known to be consistent.


http://arxiv.org/abs/2305.10406
Variational Classification. (1%)
Shehzaad Dhuliawala; Mrinmaya Sachan; Carl Allen
  We present a latent variable model for classification that provides a novel
probabilistic interpretation of neural network softmax classifiers. We derive a
variational objective to train the model, analogous to the evidence lower bound
(ELBO) used to train variational auto-encoders, that generalises the
cross-entropy loss used to train classification models. Treating inputs to the
softmax layer as samples of a latent variable, our abstracted perspective
reveals a potential inconsistency between their anticipated distribution,
required for accurate label predictions to be output, and the empirical
distribution found in practice. We augment the variational objective to
mitigate such inconsistency and encourage a chosen latent distribution, instead
of the implicit assumption in off-the-shelf softmax classifiers. Overall, we
provide new theoretical insight into the inner workings of widely-used softmax
classification. Empirical evaluation on image and text classification datasets
demonstrates that our proposed approach, variational classification, maintains
classification accuracy while the reshaped latent space improves other
desirable properties of a classifier, such as calibration, adversarial
robustness, robustness to distribution shift and sample efficiency useful in
low data settings.


http://arxiv.org/abs/2305.11186
Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt. (1%)
Zhaozhuo Xu; Zirui Liu; Beidi Chen; Yuxin Tang; Jue Wang; Kaixiong Zhou; Xia Hu; Anshumali Shrivastava
  While the numerous parameters in Large Language Models (LLMs) contribute to
their superior performance, this massive scale makes them inefficient and
memory-hungry. Thus, they are hard to deploy on commodity hardware, such as one
single GPU. Given the memory and power constraints of such devices, model
compression methods are widely employed to reduce both the model size and
inference latency, which essentially trades off model quality in return for
improved efficiency. Thus, optimizing this accuracy-efficiency trade-off is
crucial for the LLM deployment on commodity hardware. In this paper, we
introduce a new perspective to optimize this trade-off by prompting compressed
models. Specifically, we first observe that for certain questions, the
generation quality of a compressed LLM can be significantly improved by adding
carefully designed hard prompts, though this isn't the case for all questions.
Based on this observation, we propose a soft prompt learning method where we
expose the compressed model to the prompt learning process, aiming to enhance
the performance of prompts. Our experimental analysis suggests our soft prompt
strategy greatly improves the performance of the 8x compressed LLaMA-7B model
(with a joint 4-bit quantization and 50% weight pruning compression), allowing
them to match their uncompressed counterparts on popular benchmarks. Also, we
demonstrate that these learned prompts can be transferred across various
datasets, tasks, and compression levels. Hence with this transferability, we
can stitch the soft prompt to a newly compressed model to improve the test-time
accuracy in an ``in-situ'' way.


http://arxiv.org/abs/2305.10403
PaLM 2 Technical Report. (1%)
Rohan Anil; Andrew M. Dai; Orhan Firat; Melvin Johnson; Dmitry Lepikhin; Alexandre Passos; Siamak Shakeri; Emanuel Taropa; Paige Bailey; Zhifeng Chen; Eric Chu; Jonathan H. Clark; Laurent El Shafey; Yanping Huang; Kathy Meier-Hellstern; Gaurav Mishra; Erica Moreira; Mark Omernick; Kevin Robinson; Sebastian Ruder; Yi Tay; Kefan Xiao; Yuanzhong Xu; Yujing Zhang; Gustavo Hernandez Abrego; Junwhan Ahn; Jacob Austin; Paul Barham; Jan Botha; James Bradbury; Siddhartha Brahma; Kevin Brooks; Michele Catasta; Yong Cheng; Colin Cherry; Christopher A. Choquette-Choo; Aakanksha Chowdhery; Clément Crepy; Shachi Dave; Mostafa Dehghani; Sunipa Dev; Jacob Devlin; Mark Díaz; Nan Du; Ethan Dyer; Vlad Feinberg; Fangxiaoyu Feng; Vlad Fienber; Markus Freitag; Xavier Garcia; Sebastian Gehrmann; Lucas Gonzalez; Guy Gur-Ari; Steven Hand; Hadi Hashemi; Le Hou; Joshua Howland; Andrea Hu; Jeffrey Hui; Jeremy Hurwitz; Michael Isard; Abe Ittycheriah; Matthew Jagielski; Wenhao Jia; Kathleen Kenealy; Maxim Krikun; Sneha Kudugunta; Chang Lan; Katherine Lee; Benjamin Lee; Eric Li; Music Li; Wei Li; YaGuang Li; Jian Li; Hyeontaek Lim; Hanzhao Lin; Zhongtao Liu; Frederick Liu; Marcello Maggioni; Aroma Mahendru; Joshua Maynez; Vedant Misra; Maysam Moussalem; Zachary Nado; John Nham; Eric Ni; Andrew Nystrom; Alicia Parrish; Marie Pellat; Martin Polacek; Alex Polozov; Reiner Pope; Siyuan Qiao; Emily Reif; Bryan Richter; Parker Riley; Alex Castro Ros; Aurko Roy; Brennan Saeta; Rajkumar Samuel; Renee Shelby; Ambrose Slone; Daniel Smilkov; David R. So; Daniel Sohn; Simon Tokumine; Dasha Valter; Vijay Vasudevan; Kiran Vodrahalli; Xuezhi Wang; Pidong Wang; Zirui Wang; Tao Wang; John Wieting; Yuhuai Wu; Kelvin Xu; Yunhan Xu; Linting Xue; Pengcheng Yin; Jiahui Yu; Qiao Zhang; Steven Zheng; Ce Zheng; Weikang Zhou; Denny Zhou; Slav Petrov; Yonghui Wu
  We introduce PaLM 2, a new state-of-the-art language model that has better
multilingual and reasoning capabilities and is more compute-efficient than its
predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture
of objectives. Through extensive evaluations on English and multilingual
language, and reasoning tasks, we demonstrate that PaLM 2 has significantly
improved quality on downstream tasks across different model sizes, while
simultaneously exhibiting faster and more efficient inference compared to PaLM.
This improved efficiency enables broader deployment while also allowing the
model to respond faster, for a more natural pace of interaction. PaLM 2
demonstrates robust reasoning capabilities exemplified by large improvements
over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable
performance on a suite of responsible AI evaluations, and enables
inference-time control over toxicity without additional overhead or impact on
other capabilities. Overall, PaLM 2 achieves state-of-the-art performance
across a diverse set of tasks and capabilities.
  When discussing the PaLM 2 family, it is important to distinguish between
pre-trained models (of various sizes), fine-tuned variants of these models, and
the user-facing products that use these models. In particular, user-facing
products typically include additional pre- and post-processing steps.
Additionally, the underlying models may evolve over time. Therefore, one should
not expect the performance of user-facing products to exactly match the results
reported in this report.


http://arxiv.org/abs/2305.13208
Iterative Adversarial Attack on Image-guided Story Ending Generation. (99%)
Youze Wang; Wenbo Hu; Richang Hong
  Multimodal learning involves developing models that can integrate information
from various sources like images and texts. In this field, multimodal text
generation is a crucial aspect that involves processing data from multiple
modalities and outputting text. The image-guided story ending generation
(IgSEG) is a particularly significant task, targeting on an understanding of
complex relationships between text and image data with a complete story text
ending. Unfortunately, deep neural networks, which are the backbone of recent
IgSEG models, are vulnerable to adversarial samples. Current adversarial attack
methods mainly focus on single-modality data and do not analyze adversarial
attacks for multimodal text generation tasks that use cross-modal information.
To this end, we propose an iterative adversarial attack method
(Iterative-attack) that fuses image and text modality attacks, allowing for an
attack search for adversarial text and image in an more effective iterative
way. Experimental results demonstrate that the proposed method outperforms
existing single-modal and non-iterative multimodal attack methods, indicating
the potential for improving the adversarial robustness of multimodal text
generation models, such as multimodal machine translation, multimodal question
answering, etc.


http://arxiv.org/abs/2305.09305
Releasing Inequality Phenomena in $L_{\infty}$-Adversarial Training via Input Gradient Distillation. (98%)
Junxi Chen; Junhao Dong; Xiaohua Xie
  Since adversarial examples appeared and showed the catastrophic degradation
they brought to DNN, many adversarial defense methods have been devised, among
which adversarial training is considered the most effective. However, a recent
work showed the inequality phenomena in $l_{\infty}$-adversarial training and
revealed that the $l_{\infty}$-adversarially trained model is vulnerable when a
few important pixels are perturbed by i.i.d. noise or occluded. In this paper,
we propose a simple yet effective method called Input Gradient Distillation
(IGD) to release the inequality phenomena in $l_{\infty}$-adversarial training.
Experiments show that while preserving the model's adversarial robustness,
compared to PGDAT, IGD decreases the $l_{\infty}$-adversarially trained model's
error rate to inductive noise and inductive occlusion by up to 60\% and
16.53\%, and to noisy images in Imagenet-C by up to 21.11\%. Moreover, we
formally explain why the equality of the model's saliency map can improve such
robustness.


http://arxiv.org/abs/2305.09179
Ortho-ODE: Enhancing Robustness and of Neural ODEs against Adversarial Attacks. (54%)
Vishal Purohit
  Neural Ordinary Differential Equations (NODEs) probed the usage of numerical
solvers to solve the differential equation characterized by a Neural Network
(NN), therefore initiating a new paradigm of deep learning models with infinite
depth. NODEs were designed to tackle the irregular time series problem.
However, NODEs have demonstrated robustness against various noises and
adversarial attacks. This paper is about the natural robustness of NODEs and
examines the cause behind such surprising behaviour. We show that by
controlling the Lipschitz constant of the ODE dynamics the robustness can be
significantly improved. We derive our approach from Grownwall's inequality.
Further, we draw parallels between contractivity theory and Grownwall's
inequality. Experimentally we corroborate the enhanced robustness on numerous
datasets - MNIST, CIFAR-10, and CIFAR 100. We also present the impact of
adaptive and non-adaptive solvers on the robustness of NODEs.


http://arxiv.org/abs/2305.09241
Unlearnable Examples Give a False Sense of Security: Piercing through Unexploitable Data with Learnable Examples. (50%)
Wan Jiang; Yunfeng Diao; He Wang; Jianxin Sun; Meng Wang; Richang Hong
  Safeguarding data from unauthorized exploitation is vital for privacy and
security, especially in recent rampant research in security breach such as
adversarial/membership attacks. To this end, \textit{unlearnable examples}
(UEs) have been recently proposed as a compelling protection, by adding
imperceptible perturbation to data so that models trained on them cannot
classify them accurately on original clean distribution. Unfortunately, we find
UEs provide a false sense of security, because they cannot stop unauthorized
users from utilizing other unprotected data to remove the protection, by
turning unlearnable data into learnable again. Motivated by this observation,
we formally define a new threat by introducing \textit{learnable unauthorized
examples} (LEs) which are UEs with their protection removed. The core of this
approach is a novel purification process that projects UEs onto the manifold of
LEs. This is realized by a new joint-conditional diffusion model which denoises
UEs conditioned on the pixel and perceptual similarity between UEs and LEs.
Extensive experiments demonstrate that LE delivers state-of-the-art countering
performance against both supervised UEs and unsupervised UEs in various
scenarios, which is the first generalizable countermeasure to UEs across
supervised learning and unsupervised learning. Our code is available at
\url{https://github.com/jiangw-0/LE_JCDP}.


http://arxiv.org/abs/2305.08840
Attacking Perceptual Similarity Metrics. (99%)
Abhijay Ghildyal; Feng Liu
  Perceptual similarity metrics have progressively become more correlated with
human judgments on perceptual similarity; however, despite recent advances, the
addition of an imperceptible distortion can still compromise these metrics. In
our study, we systematically examine the robustness of these metrics to
imperceptible adversarial perturbations. Following the two-alternative
forced-choice experimental design with two distorted images and one reference
image, we perturb the distorted image closer to the reference via an
adversarial attack until the metric flips its judgment. We first show that all
metrics in our study are susceptible to perturbations generated via common
adversarial attacks such as FGSM, PGD, and the One-pixel attack. Next, we
attack the widely adopted LPIPS metric using spatial-transformation-based
adversarial perturbations (stAdv) in a white-box setting to craft adversarial
examples that can effectively transfer to other similarity metrics in a
black-box setting. We also combine the spatial attack stAdv with PGD
($\ell_\infty$-bounded) attack to increase transferability and use these
adversarial examples to benchmark the robustness of both traditional and
recently developed metrics. Our benchmark provides a good starting point for
discussion and further research on the robustness of metrics to imperceptible
adversarial perturbations.


http://arxiv.org/abs/2305.08439
Exploiting Frequency Spectrum of Adversarial Images for General Robustness. (96%)
Chun Yang Tan; Kazuhiko Kawamoto; Hiroshi Kera
  In recent years, there has been growing concern over the vulnerability of
convolutional neural networks (CNNs) to image perturbations. However, achieving
general robustness against different types of perturbations remains
challenging, in which enhancing robustness to some perturbations (e.g.,
adversarial perturbations) may degrade others (e.g., common corruptions). In
this paper, we demonstrate that adversarial training with an emphasis on phase
components significantly improves model performance on clean, adversarial, and
common corruption accuracies. We propose a frequency-based data augmentation
method, Adversarial Amplitude Swap, that swaps the amplitude spectrum between
clean and adversarial images to generate two novel training images: adversarial
amplitude and adversarial phase images. These images act as substitutes for
adversarial images and can be implemented in various adversarial training
setups. Through extensive experiments, we demonstrate that our method enables
the CNNs to gain general robustness against different types of perturbations
and results in a uniform performance against all types of common corruptions.


http://arxiv.org/abs/2305.08960
Training Neural Networks without Backpropagation: A Deeper Dive into the Likelihood Ratio Method. (4%)
Jinyang Jiang; Zeliang Zhang; Chenliang Xu; Zhaofei Yu; Yijie Peng
  Backpropagation (BP) is the most important gradient estimation method for
training neural networks in deep learning. However, the literature shows that
neural networks trained by BP are vulnerable to adversarial attacks. We develop
the likelihood ratio (LR) method, a new gradient estimation method, for
training a broad range of neural network architectures, including convolutional
neural networks, recurrent neural networks, graph neural networks, and spiking
neural networks, without recursive gradient computation. We propose three
methods to efficiently reduce the variance of the gradient estimation in the
neural network training process. Our experiments yield numerical results for
training different neural networks on several datasets. All results demonstrate
that the LR method is effective for training various neural networks and
significantly improves the robustness of the neural networks under adversarial
attacks relative to the BP method.


http://arxiv.org/abs/2305.10235
Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility. (1%)
Wentao Ye; Mingfeng Ou; Tianyi Li; Yipeng chen; Xuetao Ma; Yifan Yanggong; Sai Wu; Jie Fu; Gang Chen; Haobo Wang; Junbo Zhao
  The recent popularity of large language models (LLMs) has brought a
significant impact to boundless fields, particularly through their open-ended
ecosystem such as the APIs, open-sourced models, and plugins. However, with
their widespread deployment, there is a general lack of research that
thoroughly discusses and analyzes the potential risks concealed. In that case,
we intend to conduct a preliminary but pioneering study covering the
robustness, consistency, and credibility of LLMs systems. With most of the
related literature in the era of LLM uncharted, we propose an automated
workflow that copes with an upscaled number of queries/responses. Overall, we
conduct over a million queries to the mainstream LLMs including ChatGPT, LLaMA,
and OPT. Core to our workflow consists of a data primitive, followed by an
automated interpreter that evaluates these LLMs under different adversarial
metrical systems. As a result, we draw several, and perhaps unfortunate,
conclusions that are quite uncommon from this trendy community. Briefly, they
are: (i)-the minor but inevitable error occurrence in the user-generated query
input may, by chance, cause the LLM to respond unexpectedly; (ii)-LLMs possess
poor consistency when processing semantically similar query input. In addition,
as a side finding, we find that ChatGPT is still capable to yield the correct
answer even when the input is polluted at an extreme level. While this
phenomenon demonstrates the powerful memorization of the LLMs, it raises
serious concerns about using such data for LLM-involved evaluation in academic
development. To deal with it, we propose a novel index associated with a
dataset that roughly decides the feasibility of using such data for
LLM-involved evaluation. Extensive empirical studies are tagged to support the
aforementioned claims.


http://arxiv.org/abs/2305.08192
Diffusion Models for Imperceptible and Transferable Adversarial Attack. (99%)
Jianqi Chen; Hao Chen; Keyan Chen; Yilan Zhang; Zhengxia Zou; Zhenwei Shi
  Many existing adversarial attacks generate $L_p$-norm perturbations on image
RGB space. Despite some achievements in transferability and attack success
rate, the crafted adversarial examples are easily perceived by human eyes.
Towards visual imperceptibility, some recent works explore unrestricted attacks
without $L_p$-norm constraints, yet lacking transferability of attacking
black-box models. In this work, we propose a novel imperceptible and
transferable attack by leveraging both the generative and discriminative power
of diffusion models. Specifically, instead of direct manipulation in pixel
space, we craft perturbations in the latent space of diffusion models. Combined
with well-designed content-preserving structures, we can generate
human-insensitive perturbations embedded with semantic clues. For better
transferability, we further "deceive" the diffusion model which can be viewed
as an implicit recognition surrogate, by distracting its attention away from
the target regions. To our knowledge, our proposed method, DiffAttack, is the
first that introduces diffusion models into the adversarial attack field.
Extensive experiments on various model structures, datasets, and defense
methods have demonstrated the superiority of our attack over the existing
attack methods.


http://arxiv.org/abs/2305.08076
Improving Defensive Distillation using Teacher Assistant. (96%)
Maniratnam Mandal; Suna Gao
  Adversarial attacks pose a significant threat to the security and safety of
deep neural networks being applied to modern applications. More specifically,
in computer vision-based tasks, experts can use the knowledge of model
architecture to create adversarial samples imperceptible to the human eye.
These attacks can lead to security problems in popular applications such as
self-driving cars, face recognition, etc. Hence, building networks which are
robust to such attacks is highly desirable and essential. Among the various
methods present in literature, defensive distillation has shown promise in
recent years. Using knowledge distillation, researchers have been able to
create models robust against some of those attacks. However, more attacks have
been developed exposing weakness in defensive distillation. In this project, we
derive inspiration from teacher assistant knowledge distillation and propose
that introducing an assistant network can improve the robustness of the
distilled model. Through a series of experiments, we evaluate the distilled
models for different distillation temperatures in terms of accuracy,
sensitivity, and robustness. Our experiments demonstrate that the proposed
hypothesis can improve robustness in most cases. Additionally, we show that
multi-step distillation can further improve robustness with very little impact
on model accuracy.


http://arxiv.org/abs/2305.08183
Manipulating Visually-aware Federated Recommender Systems and Its Countermeasures. (82%)
Wei Yuan; Shilong Yuan; Chaoqun Yang; Quoc Viet Hung Nguyen; Hongzhi Yin
  Federated recommender systems (FedRecs) have been widely explored recently
due to their ability to protect user data privacy. In FedRecs, a central server
collaboratively learns recommendation models by sharing model public parameters
with clients, thereby offering a privacy-preserving solution. Unfortunately,
the exposure of model parameters leaves a backdoor for adversaries to
manipulate FedRecs. Existing works about FedRec security already reveal that
items can easily be promoted by malicious users via model poisoning attacks,
but all of them mainly focus on FedRecs with only collaborative information
(i.e., user-item interactions). We argue that these attacks are effective
because of the data sparsity of collaborative signals. In practice, auxiliary
information, such as products' visual descriptions, is used to alleviate
collaborative filtering data's sparsity. Therefore, when incorporating visual
information in FedRecs, all existing model poisoning attacks' effectiveness
becomes questionable. In this paper, we conduct extensive experiments to verify
that incorporating visual information can beat existing state-of-the-art
attacks in reasonable settings. However, since visual information is usually
provided by external sources, simply including it will create new security
problems. Specifically, we propose a new kind of poisoning attack for
visually-aware FedRecs, namely image poisoning attacks, where adversaries can
gradually modify the uploaded image to manipulate item ranks during FedRecs'
training process. Furthermore, we reveal that the potential collaboration
between image poisoning attacks and model poisoning attacks will make
visually-aware FedRecs more vulnerable to being manipulated. To safely use
visual information, we employ a diffusion model in visually-aware FedRecs to
purify each uploaded image and detect the adversarial images.


http://arxiv.org/abs/2305.08883
Watermarking Text Generated by Black-Box Language Models. (9%)
Xi Yang; Kejiang Chen; Weiming Zhang; Chang Liu; Yuang Qi; Jie Zhang; Han Fang; Nenghai Yu
  LLMs now exhibit human-like skills in various fields, leading to worries
about misuse. Thus, detecting generated text is crucial. However, passive
detection methods are stuck in domain specificity and limited adversarial
robustness. To achieve reliable detection, a watermark-based method was
proposed for white-box LLMs, allowing them to embed watermarks during text
generation. The method involves randomly dividing the model vocabulary to
obtain a special list and adjusting the probability distribution to promote the
selection of words in the list. A detection algorithm aware of the list can
identify the watermarked text. However, this method is not applicable in many
real-world scenarios where only black-box language models are available. For
instance, third-parties that develop API-based vertical applications cannot
watermark text themselves because API providers only supply generated text and
withhold probability distributions to shield their commercial interests. To
allow third-parties to autonomously inject watermarks into generated text, we
develop a watermarking framework for black-box language model usage scenarios.
Specifically, we first define a binary encoding function to compute a random
binary encoding corresponding to a word. The encodings computed for
non-watermarked text conform to a Bernoulli distribution, wherein the
probability of a word representing bit-1 being approximately 0.5. To inject a
watermark, we alter the distribution by selectively replacing words
representing bit-0 with context-based synonyms that represent bit-1. A
statistical test is then used to identify the watermark. Experiments
demonstrate the effectiveness of our method on both Chinese and English
datasets. Furthermore, results under re-translation, polishing, word deletion,
and synonym substitution attacks reveal that it is arduous to remove the
watermark without compromising the original semantics.


http://arxiv.org/abs/2305.08034
DNN-Defender: A Victim-Focused In-DRAM Defense Mechanism for Taming Adversarial Weight Attack on DNNs. (86%)
Ranyang Zhou; Sabbir Ahmed; Adnan Siraj Rakin; Shaahin Angizi
  With deep learning deployed in many security-sensitive areas, machine
learning security is becoming progressively important. Recent studies
demonstrate attackers can exploit system-level techniques exploiting the
RowHammer vulnerability of DRAM to deterministically and precisely flip bits in
Deep Neural Networks (DNN) model weights to affect inference accuracy. The
existing defense mechanisms are software-based, such as weight reconstruction
requiring expensive training overhead or performance degradation. On the other
hand, generic hardware-based victim-/aggressor-focused mechanisms impose
expensive hardware overheads and preserve the spatial connection between victim
and aggressor rows. In this paper, we present the first DRAM-based
victim-focused defense mechanism tailored for quantized DNNs, named
DNN-Defender that leverages the potential of in-DRAM swapping to withstand the
targeted bit-flip attacks with a priority protection mechanism. Our results
indicate that DNN-Defender can deliver a high level of protection downgrading
the performance of targeted RowHammer attacks to a random attack level. In
addition, the proposed defense has no accuracy drop on CIFAR-10 and ImageNet
datasets without requiring any software training or incurring hardware
overhead.


http://arxiv.org/abs/2305.08031
On enhancing the robustness of Vision Transformers: Defensive Diffusion. (76%)
Raza Imam; Muhammad Huzaifa; Mohammed El-Amine Azz
  Privacy and confidentiality of medical data are of utmost importance in
healthcare settings. ViTs, the SOTA vision model, rely on large amounts of
patient data for training, which raises concerns about data security and the
potential for unauthorized access. Adversaries may exploit vulnerabilities in
ViTs to extract sensitive patient information and compromising patient privacy.
This work address these vulnerabilities to ensure the trustworthiness and
reliability of ViTs in medical applications. In this work, we introduced a
defensive diffusion technique as an adversarial purifier to eliminate
adversarial noise introduced by attackers in the original image. By utilizing
the denoising capabilities of the diffusion model, we employ a reverse
diffusion process to effectively eliminate the adversarial noise from the
attack sample, resulting in a cleaner image that is then fed into the ViT
blocks. Our findings demonstrate the effectiveness of the diffusion model in
eliminating attack-agnostic adversarial noise from images. Additionally, we
propose combining knowledge distillation with our framework to obtain a
lightweight student model that is both computationally efficient and robust
against gray box attacks. Comparison of our method with a SOTA baseline method,
SEViT, shows that our work is able to outperform the baseline. Extensive
experiments conducted on a publicly available Tuberculosis X-ray dataset
validate the computational efficiency and improved robustness achieved by our
proposed architecture.


http://arxiv.org/abs/2305.09684
Decision-based iterative fragile watermarking for model integrity verification. (50%)
Zhaoxia Yin; Heng Yin; Hang Su; Xinpeng Zhang; Zhenzhe Gao
  Typically, foundation models are hosted on cloud servers to meet the high
demand for their services. However, this exposes them to security risks, as
attackers can modify them after uploading to the cloud or transferring from a
local system. To address this issue, we propose an iterative decision-based
fragile watermarking algorithm that transforms normal training samples into
fragile samples that are sensitive to model changes. We then compare the output
of sensitive samples from the original model to that of the compromised model
during validation to assess the model's completeness.The proposed fragile
watermarking algorithm is an optimization problem that aims to minimize the
variance of the predicted probability distribution outputed by the target model
when fed with the converted sample.We convert normal samples to fragile samples
through multiple iterations. Our method has some advantages: (1) the iterative
update of samples is done in a decision-based black-box manner, relying solely
on the predicted probability distribution of the target model, which reduces
the risk of exposure to adversarial attacks, (2) the small-amplitude multiple
iterations approach allows the fragile samples to perform well visually, with a
PSNR of 55 dB in TinyImageNet compared to the original samples, (3) even with
changes in the overall parameters of the model of magnitude 1e-4, the fragile
samples can detect such changes, and (4) the method is independent of the
specific model structure and dataset. We demonstrate the effectiveness of our
method on multiple models and datasets, and show that it outperforms the
current state-of-the-art.


http://arxiv.org/abs/2305.07308
Efficient Search of Comprehensively Robust Neural Architectures via Multi-fidelity Evaluation. (73%)
Jialiang Sun; Wen Yao; Tingsong Jiang; Xiaoqian Chen
  Neural architecture search (NAS) has emerged as one successful technique to
find robust deep neural network (DNN) architectures. However, most existing
robustness evaluations in NAS only consider $l_{\infty}$ norm-based adversarial
noises. In order to improve the robustness of DNN models against multiple types
of noises, it is necessary to consider a comprehensive evaluation in NAS for
robust architectures. But with the increasing number of types of robustness
evaluations, it also becomes more time-consuming to find comprehensively robust
architectures. To alleviate this problem, we propose a novel efficient search
of comprehensively robust neural architectures via multi-fidelity evaluation
(ES-CRNA-ME). Specifically, we first search for comprehensively robust
architectures under multiple types of evaluations using the
weight-sharing-based NAS method, including different $l_{p}$ norm attacks,
semantic adversarial attacks, and composite adversarial attacks. In addition,
we reduce the number of robustness evaluations by the correlation analysis,
which can incorporate similar evaluations and decrease the evaluation cost.
Finally, we propose a multi-fidelity online surrogate during optimization to
further decrease the search cost. On the basis of the surrogate constructed by
low-fidelity data, the online high-fidelity data is utilized to finetune the
surrogate. Experiments on CIFAR10 and CIFAR100 datasets show the effectiveness
of our proposed method.


http://arxiv.org/abs/2305.09679
Adversarial Security and Differential Privacy in mmWave Beam Prediction in 6G networks. (68%)
Ghanta Sai Krishna; Kundrapu Supriya; Sanskar Singh; Sabur Baidya
  In the forthcoming era of 6G, the mmWave communication is envisioned to be
used in dense user scenarios with high bandwidth requirements, that necessitate
efficient and accurate beam prediction. Machine learning (ML) based approaches
are ushering as a critical solution for achieving such efficient beam
prediction for 6G mmWave communications. However, most contemporary ML
classifiers are quite susceptible to adversarial inputs. Attackers can easily
perturb the methodology through noise addition in the model itself. To mitigate
this, the current work presents a defensive mechanism for attenuating the
adversarial attacks against projected ML-based models for mmWave beam
anticipation by incorporating adversarial training. Furthermore, as training 6G
mmWave beam prediction model necessitates the use of large and comprehensive
datasets that could include sensitive information regarding the user's
location, differential privacy (DP) has been introduced as a technique to
preserve the confidentiality of the information by purposefully adding a low
sensitivity controlled noise in the datasets. It ensures that even if the
information about a user location could be retrieved, the attacker would have
no means to determine whether the information is significant or meaningless.
With ray-tracing simulations for various outdoor and indoor scenarios, we
illustrate the advantage of our proposed novel framework in terms of beam
prediction accuracy and effective achievable rate while ensuring the security
and privacy in communications.


http://arxiv.org/abs/2305.07687
Mastering Percolation-like Games with Deep Learning. (1%)
Michael M. Danziger; Omkar R. Gojala; Sean P. Cornelius
  Though robustness of networks to random attacks has been widely studied,
intentional destruction by an intelligent agent is not tractable with previous
methods. Here we devise a single-player game on a lattice that mimics the logic
of an attacker attempting to destroy a network. The objective of the game is to
disable all nodes in the fewest number of steps. We develop a reinforcement
learning approach using deep Q-learning that is capable of learning to play
this game successfully, and in so doing, to optimally attack a network. Because
the learning algorithm is universal, we train agents on different definitions
of robustness and compare the learned strategies. We find that superficially
similar definitions of robustness induce different strategies in the trained
agent, implying that optimally attacking or defending a network is sensitive
the particular objective. Our method provides a new approach to understand
network robustness, with potential applications to other discrete processes in
disordered systems.


http://arxiv.org/abs/2305.06716
Distracting Downpour: Adversarial Weather Attacks for Motion Estimation. (74%)
Jenny Schmalfuss; Lukas Mehl; Andrés Bruhn
  Current adversarial attacks on motion estimation, or optical flow, optimize
small per-pixel perturbations, which are unlikely to appear in the real world.
In contrast, adverse weather conditions constitute a much more realistic threat
scenario. Hence, in this work, we present a novel attack on motion estimation
that exploits adversarially optimized particles to mimic weather effects like
snowflakes, rain streaks or fog clouds. At the core of our attack framework is
a differentiable particle rendering system that integrates particles (i)
consistently over multiple time steps (ii) into the 3D space (iii) with a
photo-realistic appearance. Through optimization, we obtain adversarial weather
that significantly impacts the motion estimation. Surprisingly, methods that
previously showed good robustness towards small per-pixel perturbations are
particularly vulnerable to adversarial weather. At the same time, augmenting
the training with non-optimized weather increases a method's robustness towards
weather effects and improves generalizability at almost no additional cost.


http://arxiv.org/abs/2306.06209
Backdoor Attack with Sparse and Invisible Trigger. (68%)
Yinghua Gao; Yiming Li; Xueluan Gong; Shu-Tao Xia; Qian Wang
  Deep neural networks (DNNs) are vulnerable to backdoor attacks, where the
adversary manipulates a small portion of training data such that the victim
model predicts normally on the benign samples but classifies the triggered
samples as the target class. The backdoor attack is an emerging yet threatening
training-phase threat, leading to serious risks in DNN-based applications. In
this paper, we revisit the trigger patterns of existing backdoor attacks. We
reveal that they are either visible or not sparse and therefore are not
stealthy enough. More importantly, it is not feasible to simply combine
existing methods to design an effective sparse and invisible backdoor attack.
To address this problem, we formulate the trigger generation as a bi-level
optimization problem with sparsity and invisibility constraints and propose an
effective method to solve it. The proposed method is dubbed sparse and
invisible backdoor attack (SIBA). We conduct extensive experiments on benchmark
datasets under different settings, which verify the effectiveness of our attack
and its resistance to existing backdoor defenses. The codes for reproducing
main experiments are available at \url{https://github.com/YinghuaGao/SIBA}.


http://arxiv.org/abs/2305.06947
Watch This Space: Securing Satellite Communication through Resilient Transmitter Fingerprinting. (1%)
Joshua Smailes; Sebastian Kohler; Simon Birnbach; Martin Strohmeier; Ivan Martinovic
  Due to an increase in the availability of cheap off-the-shelf radio hardware,
spoofing and replay attacks on satellite ground systems have become more
accessible than ever. This is particularly a problem for legacy systems, many
of which do not offer cryptographic security and cannot be patched to support
novel security measures.
  In this paper we explore radio transmitter fingerprinting in satellite
systems. We introduce the SatIQ system, proposing novel techniques for
authenticating transmissions using characteristics of transmitter hardware
expressed as impairments on the downlinked signal. We look in particular at
high sample rate fingerprinting, making fingerprints difficult to forge without
similarly high sample rate transmitting hardware, thus raising the budget for
attacks. We also examine the difficulty of this approach with high levels of
atmospheric noise and multipath scattering, and analyze potential solutions to
this problem.
  We focus on the Iridium satellite constellation, for which we collected
1010464 messages at a sample rate of 25 MS/s. We use this data to train a
fingerprinting model consisting of an autoencoder combined with a Siamese
neural network, enabling the model to learn an efficient encoding of message
headers that preserves identifying information.
  We demonstrate the system's robustness under attack by replaying messages
using a Software-Defined Radio, achieving an Equal Error Rate of 0.120, and ROC
AUC of 0.946. Finally, we analyze its stability over time by introducing a time
gap between training and testing data, and its extensibility by introducing new
transmitters which have not been seen before. We conclude that our techniques
are useful for building systems that are stable over time, can be used
immediately with new transmitters without retraining, and provide robustness
against spoofing and replay by raising the required budget for attacks.


http://arxiv.org/abs/2305.05896
A Black-Box Attack on Code Models via Representation Nearest Neighbor Search. (99%)
Jie Zhang; Wei Ma; Qiang Hu; Shangqing Liu; Xiaofei Xie; Yves Le Traon; Yang Liu
  Existing methods for generating adversarial code examples face several
challenges: limted availability of substitute variables, high verification
costs for these substitutes, and the creation of adversarial samples with
noticeable perturbations. To address these concerns, our proposed approach,
RNNS, uses a search seed based on historical attacks to find potential
adversarial substitutes. Rather than directly using the discrete substitutes,
they are mapped to a continuous vector space using a pre-trained variable name
encoder. Based on the vector representation, RNNS predicts and selects better
substitutes for attacks. We evaluated the performance of RNNS across six coding
tasks encompassing three programming languages: Java, Python, and C. We
employed three pre-trained code models (CodeBERT, GraphCodeBERT, and CodeT5)
that resulted in a cumulative of 18 victim models. The results demonstrate that
RNNS outperforms baselines in terms of ASR and QT. Furthermore, the
perturbation of adversarial examples introduced by RNNS is smaller compared to
the baselines in terms of the number of replaced variables and the change in
variable length. Lastly, our experiments indicate that RNNS is efficient in
attacking defended models and can be employed for adversarial training.


http://arxiv.org/abs/2305.06540
Inter-frame Accelerate Attack against Video Interpolation Models. (99%)
Junpei Liao; Zhikai Chen; Liang Yi; Wenyuan Yang; Baoyuan Wu; Xiaochun Cao
  Deep learning based video frame interpolation (VIF) method, aiming to
synthesis the intermediate frames to enhance video quality, have been highly
developed in the past few years. This paper investigates the adversarial
robustness of VIF models. We apply adversarial attacks to VIF models and find
that the VIF models are very vulnerable to adversarial examples. To improve
attack efficiency, we suggest to make full use of the property of video frame
interpolation task. The intuition is that the gap between adjacent frames would
be small, leading to the corresponding adversarial perturbations being similar
as well. Then we propose a novel attack method named Inter-frame Accelerate
Attack (IAA) that initializes the perturbation as the perturbation for the
previous adjacent frame and reduces the number of attack iterations. It is
shown that our method can improve attack efficiency greatly while achieving
comparable attack performance with traditional methods. Besides, we also extend
our method to video recognition models which are higher level vision tasks and
achieves great attack efficiency.


http://arxiv.org/abs/2305.06522
Randomized Smoothing with Masked Inference for Adversarially Robust Text Classifications. (98%)
Han Cheol Moon; Shafiq Joty; Ruochen Zhao; Megh Thakkar; Xu Chi
  Large-scale pre-trained language models have shown outstanding performance in
a variety of NLP tasks. However, they are also known to be significantly
brittle against specifically crafted adversarial examples, leading to
increasing interest in probing the adversarial robustness of NLP systems. We
introduce RSMI, a novel two-stage framework that combines randomized smoothing
(RS) with masked inference (MI) to improve the adversarial robustness of NLP
systems. RS transforms a classifier into a smoothed classifier to obtain robust
representations, whereas MI forces a model to exploit the surrounding context
of a masked token in an input sequence. RSMI improves adversarial robustness by
2 to 3 times over existing state-of-the-art methods on benchmark datasets. We
also perform in-depth qualitative analysis to validate the effectiveness of the
different stages of RSMI and probe the impact of its components through
extensive ablations. By empirically proving the stability of RSMI, we put it
forward as a practical method to robustly train large-scale NLP models. Our
code and datasets are available at https://github.com/Han8931/rsmi_nlp


http://arxiv.org/abs/2305.09677
Stealthy Low-frequency Backdoor Attack against Deep Neural Networks. (80%)
Xinrui Liu; Yu-an Tan; Yajie Wang; Kefan Qiu; Yuanzhang Li
  Deep neural networks (DNNs) have gain its popularity in various scenarios in
recent years. However, its excellent ability of fitting complex functions also
makes it vulnerable to backdoor attacks. Specifically, a backdoor can remain
hidden indefinitely until activated by a sample with a specific trigger, which
is hugely concealed. Nevertheless, existing backdoor attacks operate backdoors
in spatial domain, i.e., the poisoned images are generated by adding additional
perturbations to the original images, which are easy to detect. To bring the
potential of backdoor attacks into full play, we propose low-pass attack, a
novel attack scheme that utilizes low-pass filter to inject backdoor in
frequency domain. Unlike traditional poisoned image generation methods, our
approach reduces high-frequency components and preserve original images'
semantic information instead of adding additional perturbations, improving the
capability of evading current defenses. Besides, we introduce "precision mode"
to make our backdoor triggered at a specified level of filtering, which further
improves stealthiness. We evaluate our low-pass attack on four datasets and
demonstrate that even under pollution rate of 0.01, we can perform stealthy
attack without trading off attack performance. Besides, our backdoor attack can
successfully bypass state-of-the-art defending mechanisms. We also compare our
attack with existing backdoor attacks and show that our poisoned images are
nearly invisible and retain higher image quality.


http://arxiv.org/abs/2305.10596
Towards Invisible Backdoor Attacks in the Frequency Domain against Deep Neural Networks. (75%)
Xinrui Liu; Yajie Wang; Yu-an Tan; Kefan Qiu; Yuanzhang Li
  Deep neural networks (DNNs) have made tremendous progress in the past ten
years and have been applied in various critical applications. However, recent
studies have shown that deep neural networks are vulnerable to backdoor
attacks. By injecting malicious data into the training set, an adversary can
plant the backdoor into the original model. The backdoor can remain hidden
indefinitely until activated by a sample with a specific trigger, which is
hugely concealed, bringing serious security risks to critical applications.
However, one main limitation of current backdoor attacks is that the trigger is
often visible to human perception. Therefore, it is crucial to study the
stealthiness of backdoor triggers. In this paper, we propose a novel
frequency-domain backdooring technique. In particular, our method aims to add a
backdoor trigger in the frequency domain of original images via Discrete
Fourier Transform, thus hidding the trigger. We evaluate our method on three
benchmark datasets: MNIST, CIFAR-10 and Imagenette. Our experiments show that
we can simultaneously fool human inspection and DNN models. We further apply
two image similarity evaluation metrics to illustrate that our method adds the
most subtle perturbation without compromising attack success rate and clean
sample accuracy.


http://arxiv.org/abs/2305.06024
The Robustness of Computer Vision Models against Common Corruptions: a Survey. (50%)
Shunxin Wang; Raymond Veldhuis; Nicola Strisciuglio
  The performance of computer vision models is susceptible to unexpected
changes in input images when deployed in real scenarios. These changes are
referred to as common corruptions. While they can hinder the applicability of
computer vision models in real-world scenarios, they are not always considered
as a testbed for model generalization and robustness. In this survey, we
present a comprehensive and systematic overview of methods that improve
corruption robustness of computer vision models. Unlike existing surveys that
focus on adversarial attacks and label noise, we cover extensively the study of
robustness to common corruptions that can occur when deploying computer vision
models to work in practical applications. We describe different types of image
corruption and provide the definition of corruption robustness. We then
introduce relevant evaluation metrics and benchmark datasets. We categorize
methods into four groups. We also cover indirect methods that show improvements
in generalization and may improve corruption robustness as a byproduct. We
report benchmark results collected from the literature and find that they are
not evaluated in a unified manner, making it difficult to compare and analyze.
We thus built a unified benchmark framework to obtain directly comparable
results on benchmark datasets. Furthermore, we evaluate relevant backbone
networks pre-trained on ImageNet using our framework, providing an overview of
the base corruption robustness of existing models to help choose appropriate
backbones for computer vision tasks. We identify that developing methods to
handle a wide range of corruptions and efficiently learn with limited data and
computational resources is crucial for future development. Additionally, we
highlight the need for further investigation into the relationship among
corruption robustness, OOD generalization, and shortcut learning.


http://arxiv.org/abs/2305.06422
An Empirical Study on the Robustness of the Segment Anything Model (SAM). (22%)
Yuqing Wang; Yun Zhao; Linda Petzold
  The Segment Anything Model (SAM) is a foundation model for general image
segmentation. Although it exhibits impressive performance predominantly on
natural images, understanding its robustness against various image
perturbations and domains is critical for real-world applications where such
challenges frequently arise. In this study we conduct a comprehensive
robustness investigation of SAM under diverse real-world conditions. Our
experiments encompass a wide range of image perturbations. Our experimental
results demonstrate that SAM's performance generally declines under perturbed
images, with varying degrees of vulnerability across different perturbations.
By customizing prompting techniques and leveraging domain knowledge based on
the unique characteristics of each dataset, the model's resilience to these
perturbations can be enhanced, addressing dataset-specific challenges. This
work sheds light on the limitations and strengths of SAM in real-world
applications, promoting the development of more robust and versatile image
segmentation solutions.


http://arxiv.org/abs/2305.05909
Robust multi-agent coordination via evolutionary generation of auxiliary adversarial attackers. (12%)
Lei Yuan; Zi-Qian Zhang; Ke Xue; Hao Yin; Feng Chen; Cong Guan; Li-He Li; Chao Qian; Yang Yu
  Cooperative multi-agent reinforcement learning (CMARL) has shown to be
promising for many real-world applications. Previous works mainly focus on
improving coordination ability via solving MARL-specific challenges (e.g.,
non-stationarity, credit assignment, scalability), but ignore the policy
perturbation issue when testing in a different environment. This issue hasn't
been considered in problem formulation or efficient algorithm design. To
address this issue, we firstly model the problem as a limited policy adversary
Dec-POMDP (LPA-Dec-POMDP), where some coordinators from a team might
accidentally and unpredictably encounter a limited number of malicious action
attacks, but the regular coordinators still strive for the intended goal. Then,
we propose Robust Multi-Agent Coordination via Evolutionary Generation of
Auxiliary Adversarial Attackers (ROMANCE), which enables the trained policy to
encounter diversified and strong auxiliary adversarial attacks during training,
thus achieving high robustness under various policy perturbations. Concretely,
to avoid the ego-system overfitting to a specific attacker, we maintain a set
of attackers, which is optimized to guarantee the attackers high attacking
quality and behavior diversity. The goal of quality is to minimize the
ego-system coordination effect, and a novel diversity regularizer based on
sparse action is applied to diversify the behaviors among attackers. The
ego-system is then paired with a population of attackers selected from the
maintained attacker set, and alternately trained against the constantly
evolving attackers. Extensive experiments on multiple scenarios from SMAC
indicate our ROMANCE provides comparable or better robustness and
generalization ability than other baselines.


http://arxiv.org/abs/2305.05875
Quantization Aware Attack: Enhancing the Transferability of Adversarial Attacks across Target Models with Different Quantization Bitwidths. (99%)
Yulong Yang; Chenhao Lin; Qian Li; Chao Shen; Dawei Zhou; Nannan Wang; Tongliang Liu
  Quantized Neural Networks (QNNs) receive increasing attention in
resource-constrained scenarios because of their excellent generalization
abilities, but their robustness under realistic black-box adversarial attacks
has not been deeply studied, in which the adversary requires to improve the
attack capability across target models with unknown quantization bitwidths. One
major challenge is that adversarial examples transfer poorly against QNNs with
unknown bitwidths because of the quantization shift and gradient misalignment
issues. This paper proposes the Quantization Aware Attack to enhance the attack
transferability by making the substitute model ``aware of'' the target of
attacking models with multiple bitwidths. Specifically, we design a training
objective with multiple bitwidths to align the gradient of the substitute model
with the target model with different bitwidths and thus mitigate the negative
effect of the above two issues. We conduct comprehensive evaluations by
performing multiple transfer-based attacks on standard models and defense
models with different architectures and quantization bitwidths. Experimental
results show that QAA significantly improves the adversarial transferability of
the state-of-the-art attacks by 3.4%-20.9% against normally trained models and
3.7%-13.4% against adversarially trained models on average.


http://arxiv.org/abs/2305.05253
Attack Named Entity Recognition by Entity Boundary Interference. (98%)
Yifei Yang; Hongqiu Wu; Hai Zhao
  Named Entity Recognition (NER) is a cornerstone NLP task while its robustness
has been given little attention. This paper rethinks the principles of NER
attacks derived from sentence classification, as they can easily violate the
label consistency between the original and adversarial NER examples. This is
due to the fine-grained nature of NER, as even minor word changes in the
sentence can result in the emergence or mutation of any entities, resulting in
invalid adversarial examples. To this end, we propose a novel one-word
modification NER attack based on a key insight, NER models are always
vulnerable to the boundary position of an entity to make their decision. We
thus strategically insert a new boundary into the sentence and trigger the
Entity Boundary Interference that the victim model makes the wrong prediction
either on this boundary word or on other words in the sentence. We call this
attack Virtual Boundary Attack (ViBA), which is shown to be remarkably
effective when attacking both English and Chinese models with a 70%-90% attack
success rate on state-of-the-art language models (e.g. RoBERTa, DeBERTa) and
also significantly faster than previous methods.


http://arxiv.org/abs/2305.05736
VSMask: Defending Against Voice Synthesis Attack via Real-Time Predictive Perturbation. (96%)
Yuanda Wang; Hanqing Guo; Guangjing Wang; Bocheng Chen; Qiben Yan
  Deep learning based voice synthesis technology generates artificial
human-like speeches, which has been used in deepfakes or identity theft
attacks. Existing defense mechanisms inject subtle adversarial perturbations
into the raw speech audios to mislead the voice synthesis models. However,
optimizing the adversarial perturbation not only consumes substantial
computation time, but it also requires the availability of entire speech.
Therefore, they are not suitable for protecting live speech streams, such as
voice messages or online meetings. In this paper, we propose VSMask, a
real-time protection mechanism against voice synthesis attacks. Different from
offline protection schemes, VSMask leverages a predictive neural network to
forecast the most effective perturbation for the upcoming streaming speech.
VSMask introduces a universal perturbation tailored for arbitrary speech input
to shield a real-time speech in its entirety. To minimize the audio distortion
within the protected speech, we implement a weight-based perturbation
constraint to reduce the perceptibility of the added perturbation. We
comprehensively evaluate VSMask protection performance under different
scenarios. The experimental results indicate that VSMask can effectively defend
against 3 popular voice synthesis models. None of the synthetic voice could
deceive the speaker verification models or human ears with VSMask protection.
In a physical world experiment, we demonstrate that VSMask successfully
safeguards the real-time speech by injecting the perturbation over the air.


http://arxiv.org/abs/2305.05400
Investigating the Corruption Robustness of Image Classifiers with Random Lp-norm Corruptions. (75%)
Georg Siedel; Weijia Shao; Silvia Vock; Andrey Morozov
  Robustness is a fundamental property of machine learning classifiers required
to achieve safety and reliability. In the field of adversarial robustness of
image classifiers, robustness is commonly defined as the stability of a model
to all input changes within a p-norm distance. However, in the field of random
corruption robustness, variations observed in the real world are used, while
p-norm corruptions are rarely considered. This study investigates the use of
random p-norm corruptions to augment the training and test data of image
classifiers. We evaluate the model robustness against imperceptible random
p-norm corruptions and propose a novel robustness metric. We empirically
investigate whether robustness transfers across different p-norms and derive
conclusions on which p-norm corruptions a model should be trained and
evaluated. We find that training data augmentation with a combination of p-norm
corruptions significantly improves corruption robustness, even on top of
state-of-the-art data augmentation schemes.


http://arxiv.org/abs/2305.05392
On the Relation between Sharpness-Aware Minimization and Adversarial Robustness. (56%)
Zeming Wei; Jingyu Zhu; Yihao Zhang
  We propose a novel understanding of Sharpness-Aware Minimization (SAM) in the
context of adversarial robustness. In this paper, we point out that both SAM
and adversarial training (AT) can be viewed as specific feature perturbations,
which improve adversarial robustness. However, we note that SAM and AT are
distinct in terms of perturbation strength, leading to different accuracy and
robustness trade-offs. We provide theoretical evidence for these claims in a
simplified model with rigorous mathematical proofs. Furthermore, we conduct
experiment to demonstrate that only utilizing SAM can achieve superior
adversarial robustness compared to standard training, which is an unexpected
benefit. As adversarial training can suffer from a decrease in clean accuracy,
we show that using SAM alone can improve robustness without sacrificing clean
accuracy. Code is available at https://github.com/weizeming/SAM_AT.


http://arxiv.org/abs/2305.05499
Effects of Real-Life Traffic Sign Alteration on YOLOv7- an Object Recognition Model. (13%)
Farhin Farhad Riya; Shahinul Hoque; Md Saif Hassan Onim; Edward Michaud; Edmon Begoli; Jinyuan Stella Sun
  The widespread adoption of Image Processing has propelled Object Recognition
(OR) models into essential roles across various applications, demonstrating the
power of AI and enabling crucial services. Among the applications, traffic sign
recognition stands out as a popular research topic, given its critical
significance in the development of autonomous vehicles. Despite their
significance, real-world challenges, such as alterations to traffic signs, can
negatively impact the performance of OR models. This study investigates the
influence of altered traffic signs on the accuracy and effectiveness of object
recognition, employing a publicly available dataset to introduce alterations in
shape, color, content, visibility, angles and background. Focusing on the
YOLOv7 (You Only Look Once) model, the study demonstrates a notable decline in
detection and classification accuracy when confronted with traffic signs in
unusual conditions including the altered traffic signs. Notably, the
alterations explored in this study are benign examples and do not involve
algorithms used for generating adversarial machine learning samples. This study
highlights the significance of enhancing the robustness of object detection
models in real-life scenarios and the need for further investigation in this
area to improve their accuracy and reliability.


http://arxiv.org/abs/2305.05355
Turning Privacy-preserving Mechanisms against Federated Learning. (9%)
Marco Arazzi; Mauro Conti; Antonino Nocera; Stjepan Picek
  Recently, researchers have successfully employed Graph Neural Networks (GNNs)
to build enhanced recommender systems due to their capability to learn patterns
from the interaction between involved entities. In addition, previous studies
have investigated federated learning as the main solution to enable a native
privacy-preserving mechanism for the construction of global GNN models without
collecting sensitive data into a single computation unit. Still, privacy issues
may arise as the analysis of local model updates produced by the federated
clients can return information related to sensitive local data. For this
reason, experts proposed solutions that combine federated learning with
Differential Privacy strategies and community-driven approaches, which involve
combining data from neighbor clients to make the individual local updates less
dependent on local sensitive data. In this paper, we identify a crucial
security flaw in such a configuration, and we design an attack capable of
deceiving state-of-the-art defenses for federated learning. The proposed attack
includes two operating modes, the first one focusing on convergence inhibition
(Adversarial Mode), and the second one aiming at building a deceptive rating
injection on the global federated model (Backdoor Mode). The experimental
results show the effectiveness of our attack in both its modes, returning on
average 60% performance detriment in all the tests on Adversarial Mode and
fully effective backdoors in 93% of cases for the tests performed on Backdoor
Mode.


http://arxiv.org/abs/2305.05503
BadCS: A Backdoor Attack Framework for Code search. (8%)
Shiyi Qi; Yuanhang Yang; Shuzhzeng Gao; Cuiyun Gao; Zenglin Xu
  With the development of deep learning (DL), DL-based code search models have
achieved state-of-the-art performance and have been widely used by developers
during software development. However, the security issue, e.g., recommending
vulnerable code, has not received sufficient attention, which will bring
potential harm to software development. Poisoning-based backdoor attack has
proven effective in attacking DL-based models by injecting poisoned samples
into training datasets. However, previous work shows that the attack technique
does not perform successfully on all DL-based code search models and tends to
fail for Transformer-based models, especially pretrained models. Besides, the
infected models generally perform worse than benign models, which makes the
attack not stealthy enough and thereby hinders the adoption by developers. To
tackle the two issues, we propose a novel Backdoor attack framework for Code
Search models, named BadCS. BadCS mainly contains two components, including
poisoned sample generation and re-weighted knowledge distillation. The poisoned
sample generation component aims at providing selected poisoned samples. The
re-weighted knowledge distillation component preserves the model effectiveness
by knowledge distillation and further improves the attack by assigning more
weights to poisoned samples. Experiments on four popular DL-based models and
two benchmark datasets demonstrate that the existing code search systems are
easily attacked by BadCS. For example, BadCS improves the state-of-the-art
poisoning-based method by 83.03%-99.98% and 75.98%-99.90% on Python and Java
datasets, respectively. Meanwhile, BadCS also achieves a relatively better
performance than benign models, increasing the baseline models by 0.49% and
0.46% on average, respectively.


http://arxiv.org/abs/2305.09674
Quantum Machine Learning for Malware Classification. (1%)
Grégoire Barrué; Tony Quertier
  In a context of malicious software detection, machine learning (ML) is widely
used to generalize to new malware. However, it has been demonstrated that ML
models can be fooled or may have generalization problems on malware that has
never been seen. We investigate the possible benefits of quantum algorithms for
classification tasks. We implement two models of Quantum Machine Learning
algorithms, and we compare them to classical models for the classification of a
dataset composed of malicious and benign executable files. We try to optimize
our algorithms based on methods found in the literature, and analyze our
results in an exploratory way, to identify the most interesting directions to
explore for the future.


http://arxiv.org/abs/2305.04557
Toward Adversarial Training on Contextualized Language Representation. (93%)
Hongqiu Wu; Yongxiang Liu; Hanwen Shi; Hai Zhao; Min Zhang
  Beyond the success story of adversarial training (AT) in the recent text
domain on top of pre-trained language models (PLMs), our empirical study
showcases the inconsistent gains from AT on some tasks, e.g. commonsense
reasoning, named entity recognition. This paper investigates AT from the
perspective of the contextualized language representation outputted by PLM
encoders. We find the current AT attacks lean to generate sub-optimal
adversarial examples that can fool the decoder part but have a minor effect on
the encoder. However, we find it necessary to effectively deviate the latter
one to allow AT to gain. Based on the observation, we propose simple yet
effective \textit{Contextualized representation-Adversarial Training} (CreAT),
in which the attack is explicitly optimized to deviate the contextualized
representation of the encoder. It allows a global optimization of adversarial
examples that can fool the entire model. We also find CreAT gives rise to a
better direction to optimize the adversarial examples, to let them less
sensitive to hyperparameters. Compared to AT, CreAT produces consistent
performance gains on a wider range of tasks and is proven to be more effective
for language pre-training where only the encoder part is kept for downstream
tasks. We achieve the new state-of-the-art performances on a series of
challenging benchmarks, e.g. AdvGLUE (59.1 $ \rightarrow $ 61.1), HellaSWAG
(93.0 $ \rightarrow $ 94.9), ANLI (68.1 $ \rightarrow $ 69.3).


http://arxiv.org/abs/2305.04746
Understanding Noise-Augmented Training for Randomized Smoothing. (64%)
Ambar Pal; Jeremias Sulam
  Randomized smoothing is a technique for providing provable robustness
guarantees against adversarial attacks while making minimal assumptions about a
classifier. This method relies on taking a majority vote of any base classifier
over multiple noise-perturbed inputs to obtain a smoothed classifier, and it
remains the tool of choice to certify deep and complex neural network models.
Nonetheless, non-trivial performance of such smoothed classifier crucially
depends on the base model being trained on noise-augmented data, i.e., on a
smoothed input distribution. While widely adopted in practice, it is still
unclear how this noisy training of the base classifier precisely affects the
risk of the robust smoothed classifier, leading to heuristics and tricks that
are poorly understood. In this work we analyze these trade-offs theoretically
in a binary classification setting, proving that these common observations are
not universal. We show that, without making stronger distributional
assumptions, no benefit can be expected from predictors trained with
noise-augmentation, and we further characterize distributions where such
benefit is obtained. Our analysis has direct implications to the practical
deployment of randomized smoothing, and we illustrate some of these via
experiments on CIFAR-10 and MNIST, as well as on synthetic datasets.


http://arxiv.org/abs/2305.04574
TAPS: Connecting Certified and Adversarial Training. (41%)
Yuhao Mao; Mark Niklas Müller; Marc Fischer; Martin Vechev
  Training certifiably robust neural networks remains a notoriously hard
problem. On one side, adversarial training optimizes under-approximations of
the worst-case loss, which leads to insufficient regularization for
certification, while on the other, sound certified training methods optimize
loose over-approximations, leading to over-regularization and poor (standard)
accuracy. In this work we propose TAPS, an (unsound) certified training method
that combines IBP and PGD training to yield precise, although not necessarily
sound, worst-case loss approximations, reducing over-regularization and
increasing certified and standard accuracies. Empirically, TAPS achieves a new
state-of-the-art in many settings, e.g., reaching a certified accuracy of
$22\%$ on TinyImageNet for $\ell_\infty$-perturbations with radius
$\epsilon=1/255$. We make our implementation and networks public at
https://github.com/eth-sri/taps.


http://arxiv.org/abs/2305.05391
Privacy-preserving Adversarial Facial Features. (22%)
Zhibo Wang; He Wang; Shuaifan Jin; Wenwen Zhang; Jiahui Hu; Yan Wang; Peng Sun; Wei Yuan; Kaixin Liu; Kui Ren
  Face recognition service providers protect face privacy by extracting compact
and discriminative facial features (representations) from images, and storing
the facial features for real-time recognition. However, such features can still
be exploited to recover the appearance of the original face by building a
reconstruction network. Although several privacy-preserving methods have been
proposed, the enhancement of face privacy protection is at the expense of
accuracy degradation. In this paper, we propose an adversarial features-based
face privacy protection (AdvFace) approach to generate privacy-preserving
adversarial features, which can disrupt the mapping from adversarial features
to facial images to defend against reconstruction attacks. To this end, we
design a shadow model which simulates the attackers' behavior to capture the
mapping function from facial features to images and generate adversarial latent
noise to disrupt the mapping. The adversarial features rather than the original
features are stored in the server's database to prevent leaked features from
exposing facial information. Moreover, the AdvFace requires no changes to the
face recognition network and can be implemented as a privacy-enhancing plugin
in deployed face recognition systems. Extensive experimental results
demonstrate that AdvFace outperforms the state-of-the-art face
privacy-preserving methods in defending against reconstruction attacks while
maintaining face recognition accuracy.


http://arxiv.org/abs/2305.05116
Communication-Robust Multi-Agent Learning by Adaptable Auxiliary Multi-Agent Adversary Generation. (1%)
Lei Yuan; Feng Chen; Zhongzhang Zhang; Yang Yu
  Communication can promote coordination in cooperative Multi-Agent
Reinforcement Learning (MARL). Nowadays, existing works mainly focus on
improving the communication efficiency of agents, neglecting that real-world
communication is much more challenging as there may exist noise or potential
attackers. Thus the robustness of the communication-based policies becomes an
emergent and severe issue that needs more exploration. In this paper, we posit
that the ego system trained with auxiliary adversaries may handle this
limitation and propose an adaptable method of Multi-Agent Auxiliary Adversaries
Generation for robust Communication, dubbed MA3C, to obtain a robust
communication-based policy. In specific, we introduce a novel message-attacking
approach that models the learning of the auxiliary attacker as a cooperative
problem under a shared goal to minimize the coordination ability of the ego
system, with which every information channel may suffer from distinct message
attacks. Furthermore, as naive adversarial training may impede the
generalization ability of the ego system, we design an attacker population
generation approach based on evolutionary learning. Finally, the ego system is
paired with an attacker population and then alternatively trained against the
continuously evolving attackers to improve its robustness, meaning that both
the ego system and the attackers are adaptable. Extensive experiments on
multiple benchmarks indicate that our proposed MA3C provides comparable or
better robustness and generalization ability than other baselines.


http://arxiv.org/abs/2305.04436
Adversarial Examples Detection with Enhanced Image Difference Features based on Local Histogram Equalization. (99%)
Zhaoxia Yin; Shaowei Zhu; Hang Su; Jianteng Peng; Wanli Lyu; Bin Luo
  Deep Neural Networks (DNNs) have recently made significant progress in many
fields. However, studies have shown that DNNs are vulnerable to adversarial
examples, where imperceptible perturbations can greatly mislead DNNs even if
the full underlying model parameters are not accessible. Various defense
methods have been proposed, such as feature compression and gradient masking.
However, numerous studies have proven that previous methods create detection or
defense against certain attacks, which renders the method ineffective in the
face of the latest unknown attack methods. The invisibility of adversarial
perturbations is one of the evaluation indicators for adversarial example
attacks, which also means that the difference in the local correlation of
high-frequency information in adversarial examples and normal examples can be
used as an effective feature to distinguish the two. Therefore, we propose an
adversarial example detection framework based on a high-frequency information
enhancement strategy, which can effectively extract and amplify the feature
differences between adversarial examples and normal examples. Experimental
results show that the feature augmentation module can be combined with existing
detection models in a modular way under this framework. Improve the detector's
performance and reduce the deployment cost without modifying the existing
detection model.


http://arxiv.org/abs/2305.09671
Pick your Poison: Undetectability versus Robustness in Data Poisoning Attacks against Deep Image Classification. (93%)
Nils Lukas; Florian Kerschbaum
  Deep image classification models trained on large amounts of web-scraped data
are vulnerable to data poisoning, a mechanism for backdooring models. Even a
few poisoned samples seen during training can entirely undermine the model's
integrity during inference. While it is known that poisoning more samples
enhances an attack's effectiveness and robustness, it is unknown whether
poisoning too many samples weakens an attack by making it more detectable. We
observe a fundamental detectability/robustness trade-off in data poisoning
attacks: Poisoning too few samples renders an attack ineffective and not
robust, but poisoning too many samples makes it detectable. This raises the bar
for data poisoning attackers who have to balance this trade-off to remain
robust and undetectable. Our work proposes two defenses designed to (i) detect
and (ii) repair poisoned models as a post-processing step after training using
a limited amount of trusted image-label pairs. We show that our defenses
mitigate all surveyed attacks and outperform existing defenses using less
trusted data to repair a model. Our defense scales to joint vision-language
models, such as CLIP, and interestingly, we find that attacks on larger models
are more easily detectable but also more robust than those on smaller models.
Lastly, we propose two adaptive attacks demonstrating that while our work
raises the bar for data poisoning attacks, it cannot mitigate all forms of
backdooring.


http://arxiv.org/abs/2305.04067
The Best Defense is Attack: Repairing Semantics in Textual Adversarial Examples. (99%)
Heng Yang; Ke Li
  Recent studies have revealed the vulnerability of pre-trained language models
to adversarial attacks. Existing adversarial defense techniques attempt to
reconstruct adversarial examples within feature or text spaces. However, these
methods struggle to effectively repair the semantics in adversarial examples,
resulting in unsatisfactory performance and limiting their practical utility.
To repair the semantics in adversarial examples, we introduce a novel approach
named Reactive Perturbation Defocusing (Rapid). Rapid employs an adversarial
detector to identify fake labels of adversarial examples and leverage
adversarial attackers to repair the semantics in adversarial examples. Our
extensive experimental results conducted on four public datasets, convincingly
demonstrate the effectiveness of Rapid in various adversarial attack scenarios.
To address the problem of defense performance validation in previous works, we
provide a demonstration of adversarial detection and repair based on our work,
which can be easily evaluated at https://tinyurl.com/22ercuf8.


http://arxiv.org/abs/2305.03963
Beyond the Model: Data Pre-processing Attack to Deep Learning Models in Android Apps. (92%)
Ye Sang; Yujin Huang; Shuo Huang; Helei Cui
  The increasing popularity of deep learning (DL) models and the advantages of
computing, including low latency and bandwidth savings on smartphones, have led
to the emergence of intelligent mobile applications, also known as DL apps, in
recent years. However, this technological development has also given rise to
several security concerns, including adversarial examples, model stealing, and
data poisoning issues. Existing works on attacks and countermeasures for
on-device DL models have primarily focused on the models themselves. However,
scant attention has been paid to the impact of data processing disturbance on
the model inference. This knowledge disparity highlights the need for
additional research to fully comprehend and address security issues related to
data processing for on-device models. In this paper, we introduce a data
processing-based attacks against real-world DL apps. In particular, our attack
could influence the performance and latency of the model without affecting the
operation of a DL app. To demonstrate the effectiveness of our attack, we carry
out an empirical study on 517 real-world DL apps collected from Google Play.
Among 320 apps utilizing MLkit, we find that 81.56\% of them can be
successfully attacked.
  The results emphasize the importance of DL app developers being aware of and
taking actions to secure on-device models from the perspective of data
processing.


http://arxiv.org/abs/2305.03980
Towards Prompt-robust Face Privacy Protection via Adversarial Decoupling Augmentation Framework. (38%)
Ruijia Wu; Yuhang Wang; Huafeng Shi; Zhipeng Yu; Yichao Wu; Ding Liang
  Denoising diffusion models have shown remarkable potential in various
generation tasks. The open-source large-scale text-to-image model, Stable
Diffusion, becomes prevalent as it can generate realistic artistic or facial
images with personalization through fine-tuning on a limited number of new
samples. However, this has raised privacy concerns as adversaries can acquire
facial images online and fine-tune text-to-image models for malicious editing,
leading to baseless scandals, defamation, and disruption to victims' lives.
Prior research efforts have focused on deriving adversarial loss from
conventional training processes for facial privacy protection through
adversarial perturbations. However, existing algorithms face two issues: 1)
they neglect the image-text fusion module, which is the vital module of
text-to-image diffusion models, and 2) their defensive performance is unstable
against different attacker prompts. In this paper, we propose the Adversarial
Decoupling Augmentation Framework (ADAF), addressing these issues by targeting
the image-text fusion module to enhance the defensive performance of facial
privacy protection algorithms. ADAF introduces multi-level text-related
augmentations for defense stability against various attacker prompts.
Concretely, considering the vision, text, and common unit space, we propose
Vision-Adversarial Loss, Prompt-Robust Augmentation, and Attention-Decoupling
Loss. Extensive experiments on CelebA-HQ and VGGFace2 demonstrate ADAF's
promising performance, surpassing existing algorithms.


http://arxiv.org/abs/2305.04175
Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning. (2%)
Shengfang Zhai; Yinpeng Dong; Qingni Shen; Shi Pu; Yuejian Fang; Hang Su
  With the help of conditioning mechanisms, the state-of-the-art diffusion
models have achieved tremendous success in guided image generation,
particularly in text-to-image synthesis. To gain a better understanding of the
training process and potential risks of text-to-image synthesis, we perform a
systematic investigation of backdoor attack on text-to-image diffusion models
and propose BadT2I, a general multimodal backdoor attack framework that tampers
with image synthesis in diverse semantic levels. Specifically, we perform
backdoor attacks on three levels of the vision semantics: Pixel-Backdoor,
Object-Backdoor and Style-Backdoor. By utilizing a regularization loss, our
methods efficiently inject backdoors into a large-scale text-to-image diffusion
model while preserving its utility with benign inputs. We conduct empirical
experiments on Stable Diffusion, the widely-used text-to-image diffusion model,
demonstrating that the large-scale diffusion model can be easily backdoored
within a few fine-tuning steps. We conduct additional experiments to explore
the impact of different types of textual triggers, as well as the backdoor
persistence during further training, providing insights for the development of
backdoor defense methods. Besides, our investigation may contribute to the
copyright protection of text-to-image models in the future.


http://arxiv.org/abs/2305.03655
White-Box Multi-Objective Adversarial Attack on Dialogue Generation. (99%)
Yufei Li; Zexin Li; Yingfan Gao; Cong Liu
  Pre-trained transformers are popular in state-of-the-art dialogue generation
(DG) systems. Such language models are, however, vulnerable to various
adversarial samples as studied in traditional tasks such as text
classification, which inspires our curiosity about their robustness in DG
systems. One main challenge of attacking DG models is that perturbations on the
current sentence can hardly degrade the response accuracy because the unchanged
chat histories are also considered for decision-making. Instead of merely
pursuing pitfalls of performance metrics such as BLEU, ROUGE, we observe that
crafting adversarial samples to force longer generation outputs benefits attack
effectiveness -- the generated responses are typically irrelevant, lengthy, and
repetitive. To this end, we propose a white-box multi-objective attack method
called DGSlow. Specifically, DGSlow balances two objectives -- generation
accuracy and length, via a gradient-based multi-objective optimizer and applies
an adaptive searching mechanism to iteratively craft adversarial samples with
only a few modifications. Comprehensive experiments on four benchmark datasets
demonstrate that DGSlow could significantly degrade state-of-the-art DG models
with a higher success rate than traditional accuracy-based methods. Besides,
our crafted sentences also exhibit strong transferability in attacking other
models.


http://arxiv.org/abs/2305.03807
Evading Watermark based Detection of AI-Generated Content. (87%)
Zhengyuan Jiang; Jinghuai Zhang; Neil Zhenqiang Gong
  A generative AI model can generate extremely realistic-looking content,
posing growing challenges to the authenticity of information. To address the
challenges, watermark has been leveraged to detect AI-generated content.
Specifically, a watermark is embedded into an AI-generated content before it is
released. A content is detected as AI-generated if a similar watermark can be
decoded from it. In this work, we perform a systematic study on the robustness
of such watermark-based AI-generated content detection. We focus on
AI-generated images. Our work shows that an attacker can post-process a
watermarked image via adding a small, human-imperceptible perturbation to it,
such that the post-processed image evades detection while maintaining its
visual quality. We show the effectiveness of our attack both theoretically and
empirically. Moreover, to evade detection, our adversarial post-processing
method adds much smaller perturbations to AI-generated images and thus better
maintain their visual quality than existing popular post-processing methods
such as JPEG compression, Gaussian blur, and Brightness/Contrast. Our work
shows the insufficiency of existing watermark-based detection of AI-generated
content, highlighting the urgent needs of new methods. Our code is publicly
available: https://github.com/zhengyuan-jiang/WEvade.


http://arxiv.org/abs/2305.03626
Verifiable Learning for Robust Tree Ensembles. (15%)
Stefano Calzavara; Lorenzo Cazzaro; Giulio Ermanno Pibiri; Nicola Prezza
  Verifying the robustness of machine learning models against evasion attacks
at test time is an important research problem. Unfortunately, prior work
established that this problem is NP-hard for decision tree ensembles, hence
bound to be intractable for specific inputs. In this paper, we identify a
restricted class of decision tree ensembles, called large-spread ensembles,
which admit a security verification algorithm running in polynomial time. We
then propose a new approach called verifiable learning, which advocates the
training of such restricted model classes which are amenable for efficient
verification. We show the benefits of this idea by designing a new training
algorithm that automatically learns a large-spread decision tree ensemble from
labelled data, thus enabling its security verification in polynomial time.
Experimental results on public datasets confirm that large-spread ensembles
trained using our algorithm can be verified in a matter of seconds, using
standard commercial hardware. Moreover, large-spread ensembles are more robust
than traditional ensembles against evasion attacks, at the cost of an
acceptable loss of accuracy in the non-adversarial setting.


http://arxiv.org/abs/2305.03365
Repairing Deep Neural Networks Based on Behavior Imitation. (4%)
Zhen Liang; Taoran Wu; Changyuan Zhao; Wanwei Liu; Bai Xue; Wenjing Yang; Ji Wang
  The increasing use of deep neural networks (DNNs) in safety-critical systems
has raised concerns about their potential for exhibiting ill-behaviors. While
DNN verification and testing provide post hoc conclusions regarding unexpected
behaviors, they do not prevent the erroneous behaviors from occurring. To
address this issue, DNN repair/patch aims to eliminate unexpected predictions
generated by defective DNNs. Two typical DNN repair paradigms are retraining
and fine-tuning. However, existing methods focus on the high-level abstract
interpretation or inference of state spaces, ignoring the underlying neurons'
outputs. This renders patch processes computationally prohibitive and limited
to piecewise linear (PWL) activation functions to great extent. To address
these shortcomings, we propose a behavior-imitation based repair framework,
BIRDNN, which integrates the two repair paradigms for the first time. BIRDNN
corrects incorrect predictions of negative samples by imitating the closest
expected behaviors of positive samples during the retraining repair procedure.
For the fine-tuning repair process, BIRDNN analyzes the behavior differences of
neurons on positive and negative samples to identify the most responsible
neurons for the erroneous behaviors. To tackle more challenging domain-wise
repair problems (DRPs), we synthesize BIRDNN with a domain behavior
characterization technique to repair buggy DNNs in a probably approximated
correct style. We also implement a prototype tool based on BIRDNN and evaluate
it on ACAS Xu DNNs. Our experimental results show that BIRDNN can successfully
repair buggy DNNs with significantly higher efficiency than state-of-the-art
repair tools. Additionally, BIRDNN is highly compatible with different
activation functions.


http://arxiv.org/abs/2305.02559
Madvex: Instrumentation-based Adversarial Attacks on Machine Learning Malware Detection. (99%)
Nils Loose; Felix Mächtle; Claudius Pott; Volodymyr Bezsmertnyi; Thomas Eisenbarth
  WebAssembly (Wasm) is a low-level binary format for web applications, which
has found widespread adoption due to its improved performance and compatibility
with existing software. However, the popularity of Wasm has also led to its
exploitation for malicious purposes, such as cryptojacking, where malicious
actors use a victim's computing resources to mine cryptocurrencies without
their consent. To counteract this threat, machine learning-based detection
methods aiming to identify cryptojacking activities within Wasm code have
emerged. It is well-known that neural networks are susceptible to adversarial
attacks, where inputs to a classifier are perturbed with minimal changes that
result in a crass misclassification. While applying changes in image
classification is easy, manipulating binaries in an automated fashion to evade
malware classification without changing functionality is non-trivial. In this
work, we propose a new approach to include adversarial examples in the code
section of binaries via instrumentation. The introduced gadgets allow for the
inclusion of arbitrary bytes, enabling efficient adversarial attacks that
reliably bypass state-of-the-art machine learning classifiers such as the
CNN-based Minos recently proposed at NDSS 2021. We analyze the cost and
reliability of instrumentation-based adversarial example generation and show
that the approach works reliably at minimal size and performance overheads.


http://arxiv.org/abs/2305.02605
IMAP: Intrinsically Motivated Adversarial Policy. (99%)
Xiang Zheng; Xingjun Ma; Shengjie Wang; Xinyu Wang; Chao Shen; Cong Wang
  Reinforcement learning agents are susceptible to evasion attacks during
deployment. In single-agent environments, these attacks can occur through
imperceptible perturbations injected into the inputs of the victim policy
network. In multi-agent environments, an attacker can manipulate an adversarial
opponent to influence the victim policy's observations indirectly. While
adversarial policies offer a promising technique to craft such attacks, current
methods are either sample-inefficient due to poor exploration strategies or
require extra surrogate model training under the black-box assumption. To
address these challenges, in this paper, we propose Intrinsically Motivated
Adversarial Policy (IMAP) for efficient black-box adversarial policy learning
in both single- and multi-agent environments. We formulate four types of
adversarial intrinsic regularizers -- maximizing the adversarial state
coverage, policy coverage, risk, or divergence -- to discover potential
vulnerabilities of the victim policy in a principled way. We also present a
novel Bias-Reduction (BR) method to boost IMAP further. Our experiments
validate the effectiveness of the four types of adversarial intrinsic
regularizers and BR in enhancing black-box adversarial policy learning across a
variety of environments. Our IMAP successfully evades two types of defense
methods, adversarial training and robust regularizer, decreasing the
performance of the state-of-the-art robust WocaR-PPO agents by 34%-54% across
four single-agent tasks. IMAP also achieves a state-of-the-art attacking
success rate of 83.91% in the multi-agent game YouShallNotPass.


http://arxiv.org/abs/2305.02901
Single Node Injection Label Specificity Attack on Graph Neural Networks via Reinforcement Learning. (78%)
Dayuan Chen; Jian Zhang; Yuqian Lv; Jinhuan Wang; Hongjie Ni; Shanqing Yu; Zhen Wang; Qi Xuan
  Graph neural networks (GNNs) have achieved remarkable success in various
real-world applications. However, recent studies highlight the vulnerability of
GNNs to malicious perturbations. Previous adversaries primarily focus on graph
modifications or node injections to existing graphs, yielding promising results
but with notable limitations. Graph modification attack~(GMA) requires
manipulation of the original graph, which is often impractical, while graph
injection attack~(GIA) necessitates training a surrogate model in the black-box
setting, leading to significant performance degradation due to divergence
between the surrogate architecture and the actual victim model. Furthermore,
most methods concentrate on a single attack goal and lack a generalizable
adversary to develop distinct attack strategies for diverse goals, thus
limiting precise control over victim model behavior in real-world scenarios. To
address these issues, we present a gradient-free generalizable adversary that
injects a single malicious node to manipulate the classification result of a
target node in the black-box evasion setting. We propose Gradient-free
Generalizable Single Node Injection Attack, namely G$^2$-SNIA, a reinforcement
learning framework employing Proximal Policy Optimization. By directly querying
the victim model, G$^2$-SNIA learns patterns from exploration to achieve
diverse attack goals with extremely limited attack budgets. Through
comprehensive experiments over three acknowledged benchmark datasets and four
prominent GNNs in the most challenging and realistic scenario, we demonstrate
the superior performance of our proposed G$^2$-SNIA over the existing
state-of-the-art baselines. Moreover, by comparing G$^2$-SNIA with multiple
white-box evasion baselines, we confirm its capacity to generate solutions
comparable to those of the best adversaries.


http://arxiv.org/abs/2305.02855
Faulting original McEliece's implementations is possible: How to mitigate this risk? (2%)
Vincent Giraud; Guillaume Bouffard
  Private and public actors increasingly encounter use cases where they need to
implement sensitive operations on mass-market peripherals for which they have
little or no control. They are sometimes inclined to attempt this without using
hardware-assisted equipment, such as secure elements. In this case, the
white-box attack model is particularly relevant and includes access to every
asset, retro-engineering, and binary instrumentation by attackers. At the same
time, quantum attacks are becoming more and more of a threat and challenge
traditional asymmetrical ciphers, which are treasured by private and public
actors.
  The McEliece cryptosystem is a code-based public key algorithm introduced in
1978 that is not subject to well-known quantum attacks and that could be
implemented in an uncontrolled environment. During the NIST post-quantum
cryptography standardization process, a derived candidate commonly refer to as
classic McEliece was selected. This algorithm is however vulnerable to some
fault injection attacks while a priori, this does not apply to the original
McEliece. In this article, we thus focus on the original McEliece cryptosystem
and we study its resilience against fault injection attacks on an ARM reference
implementation. We disclose the first fault injection based attack and we
discuss on how to modify the original McEliece cryptosystem to make it
resilient to fault injection attacks.


http://arxiv.org/abs/2305.03173
New Adversarial Image Detection Based on Sentiment Analysis. (99%)
Yulong Wang; Tianxiang Li; Shenghong Li; Xin Yuan; Wei Ni
  Deep Neural Networks (DNNs) are vulnerable to adversarial examples, while
adversarial attack models, e.g., DeepFool, are on the rise and outrunning
adversarial example detection techniques. This paper presents a new adversarial
example detector that outperforms state-of-the-art detectors in identifying the
latest adversarial attacks on image datasets. Specifically, we propose to use
sentiment analysis for adversarial example detection, qualified by the
progressively manifesting impact of an adversarial perturbation on the
hidden-layer feature maps of a DNN under attack. Accordingly, we design a
modularized embedding layer with the minimum learnable parameters to embed the
hidden-layer feature maps into word vectors and assemble sentences ready for
sentiment analysis. Extensive experiments demonstrate that the new detector
consistently surpasses the state-of-the-art detection algorithms in detecting
the latest attacks launched against ResNet and Inception neutral networks on
the CIFAR-10, CIFAR-100 and SVHN datasets. The detector only has about 2
million parameters, and takes shorter than 4.6 milliseconds to detect an
adversarial example generated by the latest attack models using a Tesla K80 GPU
card.


http://arxiv.org/abs/2305.02022
A Data-Driven Defense against Edge-case Model Poisoning Attacks on Federated Learning. (86%)
Kiran Purohit; Soumi Das; Sourangshu Bhattacharya; Santu Rana
  Federated Learning systems are increasingly subjected to a multitude of model
poisoning attacks from clients. Among these, edge-case attacks that target a
small fraction of the input space are nearly impossible to detect using
existing defenses, leading to a high attack success rate. We propose an
effective defense using an external defense dataset, which provides information
about the attack target. The defense dataset contains a mix of poisoned and
clean examples, with only a few known to be clean. The proposed method,
DataDefense, uses this dataset to learn a poisoned data detector model which
marks each example in the defense dataset as poisoned or clean. It also learns
a client importance model that estimates the probability of a client update
being malicious. The global model is then updated as a weighted average of the
client models' updates. The poisoned data detector and the client importance
model parameters are updated using an alternating minimization strategy over
the Federated Learning rounds. Extensive experiments on standard attack
scenarios demonstrate that DataDefense can defend against model poisoning
attacks where other state-of-the-art defenses fail. In particular, DataDefense
is able to reduce the attack success rate by at least ~ 40% on standard attack
setups and by more than 80% on some setups. Furthermore, DataDefense requires
very few defense examples (as few as five) to achieve a near-optimal reduction
in attack success rate.


http://arxiv.org/abs/2305.02394
Defending against Insertion-based Textual Backdoor Attacks via Attribution. (61%)
Jiazhao Li; Zhuofeng Wu; Wei Ping; Chaowei Xiao; V. G. Vinod Vydiswaran
  Textual backdoor attack, as a novel attack model, has been shown to be
effective in adding a backdoor to the model during training. Defending against
such backdoor attacks has become urgent and important. In this paper, we
propose AttDef, an efficient attribution-based pipeline to defend against two
insertion-based poisoning attacks, BadNL and InSent. Specifically, we regard
the tokens with larger attribution scores as potential triggers since larger
attribution words contribute more to the false prediction results and therefore
are more likely to be poison triggers. Additionally, we further utilize an
external pre-trained language model to distinguish whether input is poisoned or
not. We show that our proposed method can generalize sufficiently well in two
common attack scenarios (poisoning training data and testing data), which
consistently improves previous methods. For instance, AttDef can successfully
mitigate both attacks with an average accuracy of 79.97% (56.59% up) and 48.34%
(3.99% up) under pre-training and post-training attack defense respectively,
achieving the new state-of-the-art performance on prediction recovery over four
benchmark datasets.


http://arxiv.org/abs/2305.02383
On the Security Risks of Knowledge Graph Reasoning. (26%)
Zhaohan Xi; Tianyu Du; Changjiang Li; Ren Pang; Shouling Ji; Xiapu Luo; Xusheng Xiao; Fenglong Ma; Ting Wang
  Knowledge graph reasoning (KGR) -- answering complex logical queries over
large knowledge graphs -- represents an important artificial intelligence task,
entailing a range of applications (e.g., cyber threat hunting). However,
despite its surging popularity, the potential security risks of KGR are largely
unexplored, which is concerning, given the increasing use of such capability in
security-critical domains.
  This work represents a solid initial step towards bridging the striking gap.
We systematize the security threats to KGR according to the adversary's
objectives, knowledge, and attack vectors. Further, we present ROAR, a new
class of attacks that instantiate a variety of such threats. Through empirical
evaluation in representative use cases (e.g., medical decision support, cyber
threat hunting, and commonsense reasoning), we demonstrate that ROAR is highly
effective to mislead KGR to suggest pre-defined answers for target queries, yet
with negligible impact on non-target ones. Finally, we explore potential
countermeasures against ROAR, including filtering of potentially poisoning
knowledge and training with adversarially augmented queries, which leads to
several promising research directions.


http://arxiv.org/abs/2305.02424
Backdoor Learning on Sequence to Sequence Models. (5%)
Lichang Chen; Minhao Cheng; Heng Huang
  Backdoor learning has become an emerging research area towards building a
trustworthy machine learning system. While a lot of works have studied the
hidden danger of backdoor attacks in image or text classification, there is a
limited understanding of the model's robustness on backdoor attacks when the
output space is infinite and discrete. In this paper, we study a much more
challenging problem of testing whether sequence-to-sequence (seq2seq) models
are vulnerable to backdoor attacks. Specifically, we find by only injecting
0.2\% samples of the dataset, we can cause the seq2seq model to generate the
designated keyword and even the whole sentence. Furthermore, we utilize Byte
Pair Encoding (BPE) to create multiple new triggers, which brings new
challenges to backdoor detection since these backdoors are not static.
Extensive experiments on machine translation and text summarization have been
conducted to show our proposed methods could achieve over 90\% attack success
rate on multiple datasets and models.


http://arxiv.org/abs/2305.02190
Rethinking Graph Lottery Tickets: Graph Sparsity Matters. (2%)
Bo Hui; Da Yan; Xiaolong Ma; Wei-Shinn Ku
  Lottery Ticket Hypothesis (LTH) claims the existence of a winning ticket
(i.e., a properly pruned sub-network together with original weight
initialization) that can achieve competitive performance to the original dense
network. A recent work, called UGS, extended LTH to prune graph neural networks
(GNNs) for effectively accelerating GNN inference. UGS simultaneously prunes
the graph adjacency matrix and the model weights using the same masking
mechanism, but since the roles of the graph adjacency matrix and the weight
matrices are very different, we find that their sparsifications lead to
different performance characteristics. Specifically, we find that the
performance of a sparsified GNN degrades significantly when the graph sparsity
goes beyond a certain extent. Therefore, we propose two techniques to improve
GNN performance when the graph sparsity is high. First, UGS prunes the
adjacency matrix using a loss formulation which, however, does not properly
involve all elements of the adjacency matrix; in contrast, we add a new
auxiliary loss head to better guide the edge pruning by involving the entire
adjacency matrix. Second, by regarding unfavorable graph sparsification as
adversarial data perturbations, we formulate the pruning process as a min-max
optimization problem to gain the robustness of lottery tickets when the graph
sparsity is high. We further investigate the question: Can the "retrainable"
winning ticket of a GNN be also effective for graph transferring learning? We
call it the transferable graph lottery ticket (GLT) hypothesis. Extensive
experiments were conducted which demonstrate the superiority of our proposed
sparsification method over UGS, and which empirically verified our transferable
GLT hypothesis.


http://arxiv.org/abs/2305.02423
PTP: Boosting Stability and Performance of Prompt Tuning with Perturbation-Based Regularizer. (1%)
Lichang Chen; Heng Huang; Minhao Cheng
  Recent studies show that prompt tuning can better leverage the power of large
language models than fine-tuning on downstream natural language understanding
tasks. However, the existing prompt tuning methods have training instability
issues, as the variance of scores under different random seeds is quite large.
To address this critical problem, we first investigate and find that the loss
landscape of vanilla prompt tuning is precipitous when it is visualized, where
a slight change of input data can cause a big fluctuation in the loss
landscape. This is an essential factor that leads to the instability of prompt
tuning. Based on this observation, we introduce perturbation-based
regularizers, which can smooth the loss landscape, into prompt tuning. We
propose a new algorithm, called Prompt Tuning with Perturbation-based
regularizer~(PTP), which can not only alleviate training instability
dramatically but also boost the performance of prompt tuning. We design two
kinds of perturbation-based regularizers, including random-noise-based and
adversarial-based. In particular, our proposed perturbations are flexible on
both text space and embedding space. Extensive experiments show the
effectiveness of our proposed methods in stabilizing the training. Our new
algorithms improve the state-of-the-art prompt tuning methods by 1.94\% and
2.34\% on SuperGLUE and FewGLUE benchmarks, respectively.


http://arxiv.org/abs/2305.01361
Boosting Adversarial Transferability via Fusing Logits of Top-1 Decomposed Feature. (99%)
Juanjuan Weng; Zhiming Luo; Dazhen Lin; Shaozi Li; Zhun Zhong
  Recent research has shown that Deep Neural Networks (DNNs) are highly
vulnerable to adversarial samples, which are highly transferable and can be
used to attack other unknown black-box models. To improve the transferability
of adversarial samples, several feature-based adversarial attack methods have
been proposed to disrupt neuron activation in middle layers. However, current
state-of-the-art feature-based attack methods typically require additional
computation costs for estimating the importance of neurons. To address this
challenge, we propose a Singular Value Decomposition (SVD)-based feature-level
attack method. Our approach is inspired by the discovery that eigenvectors
associated with the larger singular values decomposed from the middle layer
features exhibit superior generalization and attention properties.
Specifically, we conduct the attack by retaining the decomposed Top-1 singular
value-associated feature for computing the output logits, which are then
combined with the original logits to optimize adversarial perturbations. Our
extensive experimental results verify the effectiveness of our proposed method,
which significantly enhances the transferability of adversarial samples against
various baseline models and defense strategies.The source code of this study is
available at \href{https://anonymous.4open.science/r/SVD-SSA-13BF/README.md}.


http://arxiv.org/abs/2305.01267
DABS: Data-Agnostic Backdoor attack at the Server in Federated Learning. (73%)
Wenqiang Sun; Sen Li; Yuchang Sun; Jun Zhang
  Federated learning (FL) attempts to train a global model by aggregating local
models from distributed devices under the coordination of a central server.
However, the existence of a large number of heterogeneous devices makes FL
vulnerable to various attacks, especially the stealthy backdoor attack.
Backdoor attack aims to trick a neural network to misclassify data to a target
label by injecting specific triggers while keeping correct predictions on
original training data. Existing works focus on client-side attacks which try
to poison the global model by modifying the local datasets. In this work, we
propose a new attack model for FL, namely Data-Agnostic Backdoor attack at the
Server (DABS), where the server directly modifies the global model to backdoor
an FL system. Extensive simulation results show that this attack scheme
achieves a higher attack success rate compared with baseline methods while
maintaining normal accuracy on the clean data.


http://arxiv.org/abs/2305.01860
Towards Imperceptible Document Manipulations against Neural Ranking Models. (67%)
Xuanang Chen; Ben He; Zheng Ye; Le Sun; Yingfei Sun
  Adversarial attacks have gained traction in order to identify potential
vulnerabilities in neural ranking models (NRMs), but current attack methods
often introduce grammatical errors, nonsensical expressions, or incoherent text
fragments, which can be easily detected. Additionally, current methods rely
heavily on the use of a well-imitated surrogate NRM to guarantee the attack
effect, which makes them difficult to use in practice. To address these issues,
we propose a framework called Imperceptible DocumEnt Manipulation (IDEM) to
produce adversarial documents that are less noticeable to both algorithms and
humans. IDEM instructs a well-established generative language model, such as
BART, to generate connection sentences without introducing easy-to-detect
errors, and employs a separate position-wise merging strategy to balance
relevance and coherence of the perturbed text. Experimental results on the
popular MS MARCO benchmark demonstrate that IDEM can outperform strong
baselines while preserving fluency and correctness of the target documents as
evidenced by automatic and human evaluations. Furthermore, the separation of
adversarial text generation from the surrogate NRM makes IDEM more robust and
less affected by the quality of the surrogate NRM.


http://arxiv.org/abs/2305.01437
Sentiment Perception Adversarial Attacks on Neural Machine Translation Systems. (50%)
Vyas Raina; Mark Gales
  With the advent of deep learning methods, Neural Machine Translation (NMT)
systems have become increasingly powerful. However, deep learning based systems
are susceptible to adversarial attacks, where imperceptible changes to the
input can cause undesirable changes at the output of the system. To date there
has been little work investigating adversarial attacks on sequence-to-sequence
systems, such as NMT models. Previous work in NMT has examined attacks with the
aim of introducing target phrases in the output sequence. In this work,
adversarial attacks for NMT systems are explored from an output perception
perspective. Thus the aim of an attack is to change the perception of the
output sequence, without altering the perception of the input sequence. For
example, an adversary may distort the sentiment of translated reviews to have
an exaggerated positive sentiment. In practice it is challenging to run
extensive human perception experiments, so a proxy deep-learning classifier
applied to the NMT output is used to measure perception changes. Experiments
demonstrate that the sentiment perception of NMT systems' output sequences can
be changed significantly.


http://arxiv.org/abs/2305.01219
Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models. (8%)
Shuai Zhao; Jinming Wen; Luu Anh Tuan; Junbo Zhao; Jie Fu
  The prompt-based learning paradigm, which bridges the gap between
pre-training and fine-tuning, achieves state-of-the-art performance on several
NLP tasks, particularly in few-shot settings. Despite being widely applied,
prompt-based learning is vulnerable to backdoor attacks. Textual backdoor
attacks are designed to introduce targeted vulnerabilities into models by
poisoning a subset of training samples through trigger injection and label
modification. However, they suffer from flaws such as abnormal natural language
expressions resulting from the trigger and incorrect labeling of poisoned
samples. In this study, we propose {\bf ProAttack}, a novel and efficient
method for performing clean-label backdoor attacks based on the prompt, which
uses the prompt itself as a trigger. Our method does not require external
triggers and ensures correct labeling of poisoned samples, improving the
stealthy nature of the backdoor attack. With extensive experiments on
rich-resource and few-shot text classification tasks, we empirically validate
ProAttack's competitive performance in textual backdoor attacks. Notably, in
the rich-resource setting, ProAttack achieves state-of-the-art attack success
rates in the clean-label backdoor attack benchmark without external triggers.
All data and code used in our models are publically
available\footnote{\url{https://github.com/shuaizhao95/Prompt_attack}}.


http://arxiv.org/abs/2305.00866
Attack-SAM: Towards Evaluating Adversarial Robustness of Segment Anything Model. (99%)
Chenshuang Zhang; Chaoning Zhang; Taegoo Kang; Donghun Kim; Sung-Ho Bae; In So Kweon
  Segment Anything Model (SAM) has attracted significant attention recently,
due to its impressive performance on various downstream tasks in a zero-short
manner. Computer vision (CV) area might follow the natural language processing
(NLP) area to embark on a path from task-specific vision models toward
foundation models. However, previous task-specific models are widely recognized
as vulnerable to adversarial examples, which fool the model to make wrong
predictions with imperceptible perturbation. Such vulnerability to adversarial
attacks causes serious concerns when applying deep models to security-sensitive
applications. Therefore, it is critical to know whether the vision foundation
model SAM can also be easily fooled by adversarial attacks. To the best of our
knowledge, our work is the first of its kind to conduct a comprehensive
investigation on how to attack SAM with adversarial examples. Specifically, we
find that SAM is vulnerable to white-box attacks while maintaining robustness
to some extent in the black-box setting. This is an ongoing project and more
results and findings will be updated soon through
https://github.com/chenshuang-zhang/attack-sam.


http://arxiv.org/abs/2305.01074
Physical Adversarial Attacks for Surveillance: A Survey. (98%)
Kien Nguyen; Tharindu Fernando; Clinton Fookes; Sridha Sridharan
  Modern automated surveillance techniques are heavily reliant on deep learning
methods. Despite the superior performance, these learning systems are
inherently vulnerable to adversarial attacks - maliciously crafted inputs that
are designed to mislead, or trick, models into making incorrect predictions. An
adversary can physically change their appearance by wearing adversarial
t-shirts, glasses, or hats or by specific behavior, to potentially avoid
various forms of detection, tracking and recognition of surveillance systems;
and obtain unauthorized access to secure properties and assets. This poses a
severe threat to the security and safety of modern surveillance systems. This
paper reviews recent attempts and findings in learning and designing physical
adversarial attacks for surveillance applications. In particular, we propose a
framework to analyze physical adversarial attacks and provide a comprehensive
survey of physical adversarial attacks on four key surveillance tasks:
detection, identification, tracking, and action recognition under this
framework. Furthermore, we review and analyze strategies to defend against the
physical adversarial attacks and the methods for evaluating the strengths of
the defense. The insights in this paper present an important step in building
resilience within surveillance systems to physical adversarial attacks.


http://arxiv.org/abs/2305.00851
Revisiting Robustness in Graph Machine Learning. (98%)
Lukas Gosch; Daniel Sturm; Simon Geisler; Stephan Günnemann
  Many works show that node-level predictions of Graph Neural Networks (GNNs)
are unrobust to small, often termed adversarial, changes to the graph
structure. However, because manual inspection of a graph is difficult, it is
unclear if the studied perturbations always preserve a core assumption of
adversarial examples: that of unchanged semantic content. To address this
problem, we introduce a more principled notion of an adversarial graph, which
is aware of semantic content change. Using Contextual Stochastic Block Models
(CSBMs) and real-world graphs, our results uncover: $i)$ for a majority of
nodes the prevalent perturbation models include a large fraction of perturbed
graphs violating the unchanged semantics assumption; $ii)$ surprisingly, all
assessed GNNs show over-robustness - that is robustness beyond the point of
semantic change. We find this to be a complementary phenomenon to adversarial
examples and show that including the label-structure of the training graph into
the inference process of GNNs significantly reduces over-robustness, while
having a positive effect on test accuracy and adversarial robustness.
Theoretically, leveraging our new semantics-aware notion of robustness, we
prove that there is no robustness-accuracy tradeoff for inductively classifying
a newly added node.


http://arxiv.org/abs/2305.01139
Stratified Adversarial Robustness with Rejection. (96%)
Jiefeng Chen; Jayaram Raghuram; Jihye Choi; Xi Wu; Yingyu Liang; Somesh Jha
  Recently, there is an emerging interest in adversarially training a
classifier with a rejection option (also known as a selective classifier) for
boosting adversarial robustness. While rejection can incur a cost in many
applications, existing studies typically associate zero cost with rejecting
perturbed inputs, which can result in the rejection of numerous
slightly-perturbed inputs that could be correctly classified. In this work, we
study adversarially-robust classification with rejection in the stratified
rejection setting, where the rejection cost is modeled by rejection loss
functions monotonically non-increasing in the perturbation magnitude. We
theoretically analyze the stratified rejection setting and propose a novel
defense method -- Adversarial Training with Consistent Prediction-based
Rejection (CPR) -- for building a robust selective classifier. Experiments on
image datasets demonstrate that the proposed method significantly outperforms
existing methods under strong adaptive attacks. For instance, on CIFAR-10, CPR
reduces the total robust loss (for different rejection losses) by at least 7.3%
under both seen and unseen attacks.


http://arxiv.org/abs/2305.00944
Poisoning Language Models During Instruction Tuning. (2%)
Alexander Wan; Eric Wallace; Sheng Shen; Dan Klein
  Instruction-tuned LMs such as ChatGPT, FLAN, and InstructGPT are finetuned on
datasets that contain user-submitted examples, e.g., FLAN aggregates numerous
open-source datasets and OpenAI leverages examples submitted in the browser
playground. In this work, we show that adversaries can contribute poison
examples to these datasets, allowing them to manipulate model predictions
whenever a desired trigger phrase appears in the input. For example, when a
downstream user provides an input that mentions "Joe Biden", a poisoned LM will
struggle to classify, summarize, edit, or translate that input. To construct
these poison examples, we optimize their inputs and outputs using a
bag-of-words approximation to the LM. We evaluate our method on open-source
instruction-tuned LMs. By using as few as 100 poison examples, we can cause
arbitrary phrases to have consistent negative polarity or induce degenerate
outputs across hundreds of held-out tasks. Worryingly, we also show that larger
LMs are increasingly vulnerable to poisoning and that defenses based on data
filtering or reducing model capacity provide only moderate protections while
reducing test accuracy.


http://arxiv.org/abs/2305.00399
Assessing Vulnerabilities of Adversarial Learning Algorithm through Poisoning Attacks. (98%)
Jingfeng Zhang; Bo Song; Bo Han; Lei Liu; Gang Niu; Masashi Sugiyama
  Adversarial training (AT) is a robust learning algorithm that can defend
against adversarial attacks in the inference phase and mitigate the side
effects of corrupted data in the training phase. As such, it has become an
indispensable component of many artificial intelligence (AI) systems. However,
in high-stake AI applications, it is crucial to understand AT's vulnerabilities
to ensure reliable deployment. In this paper, we investigate AT's
susceptibility to poisoning attacks, a type of malicious attack that
manipulates training data to compromise the performance of the trained model.
Previous work has focused on poisoning attacks against standard training, but
little research has been done on their effectiveness against AT. To fill this
gap, we design and test effective poisoning attacks against AT. Specifically,
we investigate and design clean-label poisoning attacks, allowing attackers to
imperceptibly modify a small fraction of training data to control the
algorithm's behavior on a specific target data point. Additionally, we propose
the clean-label untargeted attack, enabling attackers can attach tiny stickers
on training data to degrade the algorithm's performance on all test data, where
the stickers could serve as a signal against unauthorized data collection. Our
experiments demonstrate that AT can still be poisoned, highlighting the need
for caution when using vanilla AT algorithms in security-related applications.
The code is at https://github.com/zjfheart/Poison-adv-training.git.


http://arxiv.org/abs/2305.00328
FedGrad: Mitigating Backdoor Attacks in Federated Learning Through Local Ultimate Gradients Inspection. (81%)
Thuy Dung Nguyen; Anh Duy Nguyen; Kok-Seng Wong; Huy Hieu Pham; Thanh Hung Nguyen; Phi Le Nguyen; Truong Thao Nguyen
  Federated learning (FL) enables multiple clients to train a model without
compromising sensitive data. The decentralized nature of FL makes it
susceptible to adversarial attacks, especially backdoor insertion during
training. Recently, the edge-case backdoor attack employing the tail of the
data distribution has been proposed as a powerful one, raising questions about
the shortfall in current defenses' robustness guarantees. Specifically, most
existing defenses cannot eliminate edge-case backdoor attacks or suffer from a
trade-off between backdoor-defending effectiveness and overall performance on
the primary task. To tackle this challenge, we propose FedGrad, a novel
backdoor-resistant defense for FL that is resistant to cutting-edge backdoor
attacks, including the edge-case attack, and performs effectively under
heterogeneous client data and a large number of compromised clients. FedGrad is
designed as a two-layer filtering mechanism that thoroughly analyzes the
ultimate layer's gradient to identify suspicious local updates and remove them
from the aggregation process. We evaluate FedGrad under different attack
scenarios and show that it significantly outperforms state-of-the-art defense
mechanisms. Notably, FedGrad can almost 100% correctly detect the malicious
participants, thus providing a significant reduction in the backdoor effect
(e.g., backdoor accuracy is less than 8%) while not reducing the main accuracy
on the primary task.


http://arxiv.org/abs/2305.00374
Enhancing Adversarial Contrastive Learning via Adversarial Invariant Regularization. (33%)
Xilie Xu; Jingfeng Zhang; Feng Liu; Masashi Sugiyama; Mohan Kankanhalli
  Adversarial contrastive learning (ACL), without requiring labels,
incorporates adversarial data with standard contrastive learning (SCL) and
outputs a robust representation which is generalizable and resistant to
adversarial attacks and common corruptions. The style-independence property of
representations has been validated to be beneficial in improving robustness
transferability. Standard invariant regularization (SIR) has been proposed to
make the learned representations via SCL to be independent of the style
factors. However, how to equip robust representations learned via ACL with the
style-independence property is still unclear so far. To this end, we leverage
the technique of causal reasoning to propose an adversarial invariant
regularization (AIR) that enforces robust representations learned via ACL to be
style-independent. Then, we enhance ACL using invariant regularization (IR),
which is a weighted sum of SIR and AIR. Theoretically, we show that AIR
implicitly encourages the prediction of adversarial data and consistency
between adversarial and natural data to be independent of data augmentations.
We also theoretically demonstrate that the style-independence property of
robust representation learned via ACL still holds in downstream tasks,
providing generalization guarantees. Empirically, our comprehensive
experimental results corroborate that IR can significantly improve the
performance of ACL and its variants on various datasets.


http://arxiv.org/abs/2305.00011
Adversarial Representation Learning for Robust Privacy Preservation in Audio. (1%)
Shayan Gharib; Minh Tran; Diep Luong; Konstantinos Drossos; Tuomas Virtanen
  Sound event detection systems are widely used in various applications such as
surveillance and environmental monitoring where data is automatically
collected, processed, and sent to a cloud for sound recognition. However, this
process may inadvertently reveal sensitive information about users or their
surroundings, hence raising privacy concerns. In this study, we propose a novel
adversarial training method for learning representations of audio recordings
that effectively prevents the detection of speech activity from the latent
features of the recordings. The proposed method trains a model to generate
invariant latent representations of speech-containing audio recordings that
cannot be distinguished from non-speech recordings by a speech classifier. The
novelty of our work is in the optimization algorithm, where the speech
classifier's weights are regularly replaced with the weights of classifiers
trained in a supervised manner. This increases the discrimination power of the
speech classifier constantly during the adversarial training, motivating the
model to generate latent representations in which speech is not
distinguishable, even using new speech classifiers trained outside the
adversarial training loop. The proposed method is evaluated against a baseline
approach with no privacy measures and a prior adversarial training method,
demonstrating a significant reduction in privacy violations compared to the
baseline approach. Additionally, we show that the prior adversarial method is
practically ineffective for this purpose.


http://arxiv.org/abs/2304.14867
Topic-oriented Adversarial Attacks against Black-box Neural Ranking Models. (99%)
Yu-An Liu; Ruqing Zhang; Jiafeng Guo; Rijke Maarten de; Wei Chen; Yixing Fan; Xueqi Cheng
  Neural ranking models (NRMs) have attracted considerable attention in
information retrieval. Unfortunately, NRMs may inherit the adversarial
vulnerabilities of general neural networks, which might be leveraged by
black-hat search engine optimization practitioners. Recently, adversarial
attacks against NRMs have been explored in the paired attack setting,
generating an adversarial perturbation to a target document for a specific
query. In this paper, we focus on a more general type of perturbation and
introduce the topic-oriented adversarial ranking attack task against NRMs,
which aims to find an imperceptible perturbation that can promote a target
document in ranking for a group of queries with the same topic. We define both
static and dynamic settings for the task and focus on decision-based black-box
attacks. We propose a novel framework to improve topic-oriented attack
performance based on a surrogate ranking model. The attack problem is
formalized as a Markov decision process (MDP) and addressed using reinforcement
learning. Specifically, a topic-oriented reward function guides the policy to
find a successful adversarial example that can be promoted in rankings to as
many queries as possible in a group. Experimental results demonstrate that the
proposed framework can significantly outperform existing attack strategies, and
we conclude by re-iterating that there exist potential risks for applying NRMs
in the real world.


http://arxiv.org/abs/2305.00075
On the existence of solutions to adversarial training in multiclass classification. (75%)
Nicolas Garcia Trillos; Matt Jacobs; Jakwang Kim
  We study three models of the problem of adversarial training in multiclass
classification designed to construct robust classifiers against adversarial
perturbations of data in the agnostic-classifier setting. We prove the
existence of Borel measurable robust classifiers in each model and provide a
unified perspective of the adversarial training problem, expanding the
connections with optimal transport initiated by the authors in previous work
and developing new connections between adversarial training in the multiclass
setting and total variation regularization. As a corollary of our results, we
prove the existence of Borel measurable solutions to the agnostic adversarial
training problem in the binary classification setting, a result that improves
results in the literature of adversarial training, where robust classifiers
were only known to exist within the enlarged universal $\sigma$-algebra of the
feature space.


http://arxiv.org/abs/2304.14888
The Power of Typed Affine Decision Structures: A Case Study. (3%)
Gerrit Nolte; Maximilian Schlüter; Alnis Murtovi; Bernhard Steffen
  TADS are a novel, concise white-box representation of neural networks. In
this paper, we apply TADS to the problem of neural network verification, using
them to generate either proofs or concise error characterizations for desirable
neural network properties. In a case study, we consider the robustness of
neural networks to adversarial attacks, i.e., small changes to an input that
drastically change a neural networks perception, and show that TADS can be used
to provide precise diagnostics on how and where robustness errors a occur. We
achieve these results by introducing Precondition Projection, a technique that
yields a TADS describing network behavior precisely on a given subset of its
input space, and combining it with PCA, a traditional, well-understood
dimensionality reduction technique. We show that PCA is easily compatible with
TADS. All analyses can be implemented in a straightforward fashion using the
rich algebraic properties of TADS, demonstrating the utility of the TADS
framework for neural network explainability and verification. While TADS do not
yet scale as efficiently as state-of-the-art neural network verifiers, we show
that, using PCA-based simplifications, they can still scale to mediumsized
problems and yield concise explanations for potential errors that can be used
for other purposes such as debugging a network or generating new training
samples.


http://arxiv.org/abs/2304.14717
faulTPM: Exposing AMD fTPMs' Deepest Secrets. (3%)
Hans Niklas Jacob; Christian Werling; Robert Buhren; Jean-Pierre Seifert
  Trusted Platform Modules constitute an integral building block of modern
security features. Moreover, as Windows 11 made a TPM 2.0 mandatory, they are
subject to an ever-increasing academic challenge. While discrete TPMs - as
found in higher-end systems - have been susceptible to attacks on their exposed
communication interface, more common firmware TPMs (fTPMs) are immune to this
attack vector as they do not communicate with the CPU via an exposed bus. In
this paper, we analyze a new class of attacks against fTPMs: Attacking their
Trusted Execution Environment can lead to a full TPM state compromise. We
experimentally verify this attack by compromising the AMD Secure Processor,
which constitutes the TEE for AMD's fTPMs. In contrast to previous dTPM
sniffing attacks, this vulnerability exposes the complete internal TPM state of
the fTPM. It allows us to extract any cryptographic material stored or sealed
by the fTPM regardless of authentication mechanisms such as Platform
Configuration Register validation or passphrases with anti-hammering
protection. First, we demonstrate the impact of our findings by - to the best
of our knowledge - enabling the first attack against Full Disk Encryption
solutions backed by an fTPM. Furthermore, we lay out how any application
relying solely on the security properties of the TPM - like Bitlocker's TPM-
only protector - can be defeated by an attacker with 2-3 hours of physical
access to the target device. Lastly, we analyze the impact of our attack on FDE
solutions protected by a TPM and PIN strategy. While a naive implementation
also leaves the disk completely unprotected, we find that BitLocker's FDE
implementation withholds some protection depending on the complexity of the
used PIN. Our results show that when an fTPM's internal state is compromised, a
TPM and PIN strategy for FDE is less secure than TPM-less protection with a
reasonable passphrase.


http://arxiv.org/abs/2304.14674
SAM Meets Robotic Surgery: An Empirical Study in Robustness Perspective. (1%)
An Wang; Mobarakol Islam; Mengya Xu; Yang Zhang; Hongliang Ren
  Segment Anything Model (SAM) is a foundation model for semantic segmentation
and shows excellent generalization capability with the prompts. In this
empirical study, we investigate the robustness and zero-shot generalizability
of the SAM in the domain of robotic surgery in various settings of (i) prompted
vs. unprompted; (ii) bounding box vs. points-based prompt; (iii) generalization
under corruptions and perturbations with five severity levels; and (iv)
state-of-the-art supervised model vs. SAM. We conduct all the observations with
two well-known robotic instrument segmentation datasets of MICCAI EndoVis 2017
and 2018 challenges. Our extensive evaluation results reveal that although SAM
shows remarkable zero-shot generalization ability with bounding box prompts, it
struggles to segment the whole instrument with point-based prompts and
unprompted settings. Furthermore, our qualitative figures demonstrate that the
model either failed to predict the parts of the instrument mask (e.g., jaws,
wrist) or predicted parts of the instrument as different classes in the
scenario of overlapping instruments within the same bounding box or with the
point-based prompt. In fact, it is unable to identify instruments in some
complex surgical scenarios of blood, reflection, blur, and shade. Additionally,
SAM is insufficiently robust to maintain high performance when subjected to
various forms of data corruption. Therefore, we can argue that SAM is not ready
for downstream surgical tasks without further domain-specific fine-tuning.


http://arxiv.org/abs/2304.14483
Adversary Aware Continual Learning. (80%)
Muhammad Umer; Robi Polikar
  Class incremental learning approaches are useful as they help the model to
learn new information (classes) sequentially, while also retaining the
previously acquired information (classes). However, it has been shown that such
approaches are extremely vulnerable to the adversarial backdoor attacks, where
an intelligent adversary can introduce small amount of misinformation to the
model in the form of imperceptible backdoor pattern during training to cause
deliberate forgetting of a specific task or class at test time. In this work,
we propose a novel defensive framework to counter such an insidious attack
where, we use the attacker's primary strength-hiding the backdoor pattern by
making it imperceptible to humans-against it, and propose to learn a
perceptible (stronger) pattern (also during the training) that can overpower
the attacker's imperceptible (weaker) pattern. We demonstrate the effectiveness
of the proposed defensive mechanism through various commonly used Replay-based
(both generative and exact replay-based) class incremental learning algorithms
using continual learning benchmark variants of CIFAR-10, CIFAR-100, and MNIST
datasets. Most noteworthy, our proposed defensive framework does not assume
that the attacker's target task and target class is known to the defender. The
defender is also unaware of the shape, size, and location of the attacker's
pattern. We show that our proposed defensive framework considerably improves
the performance of class incremental learning algorithms with no knowledge of
the attacker's target task, attacker's target class, and attacker's
imperceptible pattern. We term our defensive framework as Adversary Aware
Continual Learning (AACL).


http://arxiv.org/abs/2304.14614
Fusion is Not Enough: Single-Modal Attacks to Compromise Fusion Models in Autonomous Driving. (75%)
Zhiyuan Cheng; Hongjun Choi; James Liang; Shiwei Feng; Guanhong Tao; Dongfang Liu; Michael Zuzak; Xiangyu Zhang
  Multi-sensor fusion (MSF) is widely adopted for perception in autonomous
vehicles (AVs), particularly for the task of 3D object detection with camera
and LiDAR sensors. The rationale behind fusion is to capitalize on the
strengths of each modality while mitigating their limitations. The exceptional
and leading performance of fusion models has been demonstrated by advanced deep
neural network (DNN)-based fusion techniques. Fusion models are also perceived
as more robust to attacks compared to single-modal ones due to the redundant
information in multiple modalities. In this work, we challenge this perspective
with single-modal attacks that targets the camera modality, which is considered
less significant in fusion but more affordable for attackers. We argue that the
weakest link of fusion models depends on their most vulnerable modality, and
propose an attack framework that targets advanced camera-LiDAR fusion models
with adversarial patches. Our approach employs a two-stage optimization-based
strategy that first comprehensively assesses vulnerable image areas under
adversarial attacks, and then applies customized attack strategies to different
fusion models, generating deployable patches. Evaluations with five
state-of-the-art camera-LiDAR fusion models on a real-world dataset show that
our attacks successfully compromise all models. Our approach can either reduce
the mean average precision (mAP) of detection performance from 0.824 to 0.353
or degrade the detection score of the target object from 0.727 to 0.151 on
average, demonstrating the effectiveness and practicality of our proposed
attack framework.


http://arxiv.org/abs/2304.14031
Boosting Big Brother: Attacking Search Engines with Encodings. (68%)
Nicholas Boucher; Luca Pajola; Ilia Shumailov; Ross Anderson; Mauro Conti
  Search engines are vulnerable to attacks against indexing and searching via
text encoding manipulation. By imperceptibly perturbing text using uncommon
encoded representations, adversaries can control results across search engines
for specific search queries. We demonstrate that this attack is successful
against two major commercial search engines - Google and Bing - and one open
source search engine - Elasticsearch. We further demonstrate that this attack
is successful against LLM chat search including Bing's GPT-4 chatbot and
Google's Bard chatbot. We also present a variant of the attack targeting text
summarization and plagiarism detection models, two ML tasks closely tied to
search. We provide a set of defenses against these techniques and warn that
adversaries can leverage these attacks to launch disinformation campaigns
against unsuspecting users, motivating the need for search engine maintainers
to patch deployed systems.


http://arxiv.org/abs/2304.14475
ChatGPT as an Attack Tool: Stealthy Textual Backdoor Attack via Blackbox Generative Model Trigger. (62%)
Jiazhao Li; Yijin Yang; Zhuofeng Wu; V. G. Vinod Vydiswaran; Chaowei Xiao
  Textual backdoor attacks pose a practical threat to existing systems, as they
can compromise the model by inserting imperceptible triggers into inputs and
manipulating labels in the training dataset. With cutting-edge generative
models such as GPT-4 pushing rewriting to extraordinary levels, such attacks
are becoming even harder to detect. We conduct a comprehensive investigation of
the role of black-box generative models as a backdoor attack tool, highlighting
the importance of researching relative defense strategies. In this paper, we
reveal that the proposed generative model-based attack, BGMAttack, could
effectively deceive textual classifiers. Compared with the traditional attack
methods, BGMAttack makes the backdoor trigger less conspicuous by leveraging
state-of-the-art generative models. Our extensive evaluation of attack
effectiveness across five datasets, complemented by three distinct human
cognition assessments, reveals that Figure 4 achieves comparable attack
performance while maintaining superior stealthiness relative to baseline
methods.


http://arxiv.org/abs/2304.14601
Improve Video Representation with Temporal Adversarial Augmentation. (26%)
Jinhao Duan; Quanfu Fan; Hao Cheng; Xiaoshuang Shi; Kaidi Xu
  Recent works reveal that adversarial augmentation benefits the generalization
of neural networks (NNs) if used in an appropriate manner. In this paper, we
introduce Temporal Adversarial Augmentation (TA), a novel video augmentation
technique that utilizes temporal attention. Unlike conventional adversarial
augmentation, TA is specifically designed to shift the attention distributions
of neural networks with respect to video clips by maximizing a temporal-related
loss function. We demonstrate that TA will obtain diverse temporal views, which
significantly affect the focus of neural networks. Training with these examples
remedies the flaw of unbalanced temporal information perception and enhances
the ability to defend against temporal shifts, ultimately leading to better
generalization. To leverage TA, we propose Temporal Video Adversarial
Fine-tuning (TAF) framework for improving video representations. TAF is a
model-agnostic, generic, and interpretability-friendly training strategy. We
evaluate TAF with four powerful models (TSM, GST, TAM, and TPN) over three
challenging temporal-related benchmarks (Something-something V1&V2 and
diving48). Experimental results demonstrate that TAF effectively improves the
test accuracy of these models with notable margins without introducing
additional parameters or computational costs. As a byproduct, TAF also improves
the robustness under out-of-distribution (OOD) settings. Code is available at
https://github.com/jinhaoduan/TAF.


http://arxiv.org/abs/2304.14072
Origin Tracing and Detecting of LLMs. (1%)
Linyang Li; Pengyu Wang; Ke Ren; Tianxiang Sun; Xipeng Qiu
  The extraordinary performance of large language models (LLMs) heightens the
importance of detecting whether the context is generated by an AI system. More
importantly, while more and more companies and institutions release their LLMs,
the origin can be hard to trace. Since LLMs are heading towards the time of
AGI, similar to the origin tracing in anthropology, it is of great importance
to trace the origin of LLMs. In this paper, we first raise the concern of the
origin tracing of LLMs and propose an effective method to trace and detect
AI-generated contexts. We introduce a novel algorithm that leverages the
contrastive features between LLMs and extracts model-wise features to trace the
text origins. Our proposed method works under both white-box and black-box
settings therefore can be widely generalized to detect various LLMs.(e.g. can
be generalized to detect GPT-3 models without the GPT-3 models). Also, our
proposed method requires only limited data compared with the supervised
learning methods and can be extended to trace new-coming model origins. We
construct extensive experiments to examine whether we can trace the origins of
given texts. We provide valuable observations based on the experimental
results, such as the difficulty level of AI origin tracing, and the AI origin
similarities, and call for ethical concerns of LLM providers. We are releasing
all codes and data as a toolkit and benchmark for future AI origin tracing and
detecting studies. \footnote{We are releasing all available resource at
\url{https://github.com/OpenLMLab/}.}


http://arxiv.org/abs/2304.14613
Deep Intellectual Property Protection: A Survey. (1%)
Yuchen Sun; Tianpeng Liu; Panhe Hu; Qing Liao; Shaojing Fu; Nenghai Yu; Deke Guo; Yongxiang Liu; Li Liu
  Deep Neural Networks (DNNs), from AlexNet to ResNet to ChatGPT, have made
revolutionary progress in recent years, and are widely used in various fields.
The high performance of DNNs requires a huge amount of high-quality data,
expensive computing hardware, and excellent DNN architectures that are costly
to obtain. Therefore, trained DNNs are becoming valuable assets and must be
considered the Intellectual Property (IP) of the legitimate owner who created
them, in order to protect trained DNN models from illegal reproduction,
stealing, redistribution, or abuse. Although being a new emerging and
interdisciplinary field, numerous DNN model IP protection methods have been
proposed. Given this period of rapid evolution, the goal of this paper is to
provide a comprehensive survey of two mainstream DNN IP protection methods:
deep watermarking and deep fingerprinting, with a proposed taxonomy. More than
190 research contributions are included in this survey, covering many aspects
of Deep IP Protection: problem definition, main threats and challenges, merits
and demerits of deep watermarking and deep fingerprinting methods, evaluation
metrics, and performance discussion. We finish the survey by identifying
promising directions for future research.


http://arxiv.org/abs/2304.14540
Interactive Greybox Penetration Testing for Cloud Access Control using IAM Modeling and Deep Reinforcement Learning. (1%)
Yang Hu; Wenxi Wang; Sarfraz Khurshid; Mohit Tiwari
  Identity and Access Management (IAM) is an access control service in cloud
platforms. To securely manage cloud resources, customers are required to
configure IAM to specify the access control rules for their cloud
organizations. However, incorrectly configuring IAM may be exploited to cause a
security attack such as privilege escalation (PE), which can cause severe
economic loss. To detect such PEs due to IAM misconfigurations, third-party
cloud security services are commonly used. The state-of-the-art services apply
whitebox penetration testing techniques, which require the access to complete
IAM configurations. However, the configurations can contain sensitive
information. To prevent the disclosure of such information, the customers have
to put lots of manual efforts for the anonymization.
  In this paper, we propose a precise greybox penetration testing approach
called TAC for third-party services to detect IAM PEs. To mitigate the dual
challenges of labor-intensive anonymizations and potentially sensitive
information disclosures, TAC interacts with customers by selectively querying
only the essential information needed. Our key insight is that only a small
fraction of information in the IAM configuration is relevant to the IAM PE
detection. We first propose IAM modeling, enabling TAC to detect a broad class
of IAM PEs based on the partial information collected from queries. To improve
the efficiency and applicability of TAC, we aim to minimize the interactions
with customers by applying Reinforcement Learning (RL) with Graph Neural
Networks (GNNs), allowing TAC to learn to make as few queries as possible.
Experimental results on both our synthesized task set and the only publicly
available task set IAM Vulnerable show that, in comparison to state-of-the-art
whitebox approaches, TAC detects IAM PEs with competitively low false negative
rates, employing a limited number of queries.


http://arxiv.org/abs/2304.13410
Improving Adversarial Transferability via Intermediate-level Perturbation Decay. (98%)
Qizhang Li; Yiwen Guo; Wangmeng Zuo; Hao Chen
  Intermediate-level attacks that attempt to perturb feature representations
following an adversarial direction drastically have shown favorable performance
in crafting transferable adversarial examples. Existing methods in this
category are normally formulated with two separate stages, where a directional
guide is required to be determined at first and the scalar projection of the
intermediate-level perturbation onto the directional guide is enlarged
thereafter. The obtained perturbation deviates from the guide inevitably in the
feature space, and it is revealed in this paper that such a deviation may lead
to sub-optimal attack. To address this issue, we develop a novel
intermediate-level method that crafts adversarial examples within a single
stage of optimization. In particular, the proposed method, named
intermediate-level perturbation decay (ILPD), encourages the intermediate-level
perturbation to be in an effective adversarial direction and to possess a great
magnitude simultaneously. In-depth discussion verifies the effectiveness of our
method. Experimental results show that it outperforms state-of-the-arts by
large margins in attacking various victim models on ImageNet (+10.07% on
average) and CIFAR-10 (+3.88% on average). Our code is at
https://github.com/qizhangli/ILPD-attack.


http://arxiv.org/abs/2304.13919
Detection of Adversarial Physical Attacks in Time-Series Image Data. (92%)
Ramneet Kaur; Yiannis Kantaros; Wenwen Si; James Weimer; Insup Lee
  Deep neural networks (DNN) have become a common sensing modality in
autonomous systems as they allow for semantically perceiving the ambient
environment given input images. Nevertheless, DNN models have proven to be
vulnerable to adversarial digital and physical attacks. To mitigate this issue,
several detection frameworks have been proposed to detect whether a single
input image has been manipulated by adversarial digital noise or not. In our
prior work, we proposed a real-time detector, called VisionGuard (VG), for
adversarial physical attacks against single input images to DNN models.
Building upon that work, we propose VisionGuard* (VG), which couples VG with
majority-vote methods, to detect adversarial physical attacks in time-series
image data, e.g., videos. This is motivated by autonomous systems applications
where images are collected over time using onboard sensors for decision-making
purposes. We emphasize that majority-vote mechanisms are quite common in
autonomous system applications (among many other applications), as e.g., in
autonomous driving stacks for object detection. In this paper, we investigate,
both theoretically and experimentally, how this widely used mechanism can be
leveraged to enhance the performance of adversarial detectors. We have
evaluated VG* on videos of both clean and physically attacked traffic signs
generated by a state-of-the-art robust physical attack. We provide extensive
comparative experiments against detectors that have been designed originally
for out-of-distribution data and digitally attacked images.


http://arxiv.org/abs/2304.13360
Blockchain-based Federated Learning with SMPC Model Verification Against Poisoning Attack for Healthcare Systems. (13%)
Aditya Pribadi Kalapaaking; Ibrahim Khalil; Xun Yi
  Due to the rising awareness of privacy and security in machine learning
applications, federated learning (FL) has received widespread attention and
applied to several areas, e.g., intelligence healthcare systems, IoT-based
industries, and smart cities. FL enables clients to train a global model
collaboratively without accessing their local training data. However, the
current FL schemes are vulnerable to adversarial attacks. Its architecture
makes detecting and defending against malicious model updates difficult. In
addition, most recent studies to detect FL from malicious updates while
maintaining the model's privacy have not been sufficiently explored. This paper
proposed blockchain-based federated learning with SMPC model verification
against poisoning attacks for healthcare systems. First, we check the machine
learning model from the FL participants through an encrypted inference process
and remove the compromised model. Once the participants' local models have been
verified, the models are sent to the blockchain node to be securely aggregated.
We conducted several experiments with different medical datasets to evaluate
our proposed framework.


http://arxiv.org/abs/2304.12829
Improving Robustness Against Adversarial Attacks with Deeply Quantized Neural Networks. (99%)
Ferheen Ayaz; Idris Zakariyya; José Cano; Sye Loong Keoh; Jeremy Singer; Danilo Pau; Mounia Kharbouche-Harrari
  Reducing the memory footprint of Machine Learning (ML) models, particularly
Deep Neural Networks (DNNs), is essential to enable their deployment into
resource-constrained tiny devices. However, a disadvantage of DNN models is
their vulnerability to adversarial attacks, as they can be fooled by adding
slight perturbations to the inputs. Therefore, the challenge is how to create
accurate, robust, and tiny DNN models deployable on resource-constrained
embedded devices. This paper reports the results of devising a tiny DNN model,
robust to adversarial black and white box attacks, trained with an automatic
quantizationaware training framework, i.e. QKeras, with deep quantization loss
accounted in the learning loop, thereby making the designed DNNs more accurate
for deployment on tiny devices. We investigated how QKeras and an adversarial
robustness technique, Jacobian Regularization (JR), can provide a
co-optimization strategy by exploiting the DNN topology and the per layer JR
approach to produce robust yet tiny deeply quantized DNN models. As a result, a
new DNN model implementing this cooptimization strategy was conceived,
developed and tested on three datasets containing both images and audio inputs,
as well as compared its performance with existing benchmarks against various
white-box and black-box attacks. Experimental results demonstrated that on
average our proposed DNN model resulted in 8.3% and 79.5% higher accuracy than
MLCommons/Tiny benchmarks in the presence of white-box and black-box attacks on
the CIFAR-10 image dataset and a subset of the Google Speech Commands audio
dataset respectively. It was also 6.5% more accurate for black-box attacks on
the SVHN image dataset.


http://arxiv.org/abs/2304.13229
Generating Adversarial Examples with Task Oriented Multi-Objective Optimization. (99%)
Anh Bui; Trung Le; He Zhao; Quan Tran; Paul Montague; Dinh Phung
  Deep learning models, even the-state-of-the-art ones, are highly vulnerable
to adversarial examples. Adversarial training is one of the most efficient
methods to improve the model's robustness. The key factor for the success of
adversarial training is the capability to generate qualified and divergent
adversarial examples which satisfy some objectives/goals (e.g., finding
adversarial examples that maximize the model losses for simultaneously
attacking multiple models). Therefore, multi-objective optimization (MOO) is a
natural tool for adversarial example generation to achieve multiple
objectives/goals simultaneously. However, we observe that a naive application
of MOO tends to maximize all objectives/goals equally, without caring if an
objective/goal has been achieved yet. This leads to useless effort to further
improve the goal-achieved tasks, while putting less focus on the
goal-unachieved tasks. In this paper, we propose \emph{Task Oriented MOO} to
address this issue, in the context where we can explicitly define the goal
achievement for a task. Our principle is to only maintain the goal-achieved
tasks, while letting the optimizer spend more effort on improving the
goal-unachieved tasks. We conduct comprehensive experiments for our Task
Oriented MOO on various adversarial example generation schemes. The
experimental results firmly demonstrate the merit of our proposed approach. Our
code is available at \url{https://github.com/tuananhbui89/TAMOO}.


http://arxiv.org/abs/2304.13255
SHIELD: Thwarting Code Authorship Attribution. (98%)
Mohammed Abuhamad; Changhun Jung; David Mohaisen; DaeHun Nyang
  Authorship attribution has become increasingly accurate, posing a serious
privacy risk for programmers who wish to remain anonymous. In this paper, we
introduce SHIELD to examine the robustness of different code authorship
attribution approaches against adversarial code examples. We define four
attacks on attribution techniques, which include targeted and non-targeted
attacks, and realize them using adversarial code perturbation. We experiment
with a dataset of 200 programmers from the Google Code Jam competition to
validate our methods targeting six state-of-the-art authorship attribution
methods that adopt a variety of techniques for extracting authorship traits
from source-code, including RNN, CNN, and code stylometry. Our experiments
demonstrate the vulnerability of current authorship attribution methods against
adversarial attacks. For the non-targeted attack, our experiments demonstrate
the vulnerability of current authorship attribution methods against the attack
with an attack success rate exceeds 98.5\% accompanied by a degradation of the
identification confidence that exceeds 13\%. For the targeted attacks, we show
the possibility of impersonating a programmer using targeted-adversarial
perturbations with a success rate ranging from 66\% to 88\% for different
authorship attribution techniques under several adversarial scenarios.


http://arxiv.org/abs/2304.12707
Lyapunov-Stable Deep Equilibrium Models. (82%)
Haoyu Chu; Shikui Wei; Ting Liu; Yao Zhao; Yuto Miyatake
  Deep equilibrium (DEQ) models have emerged as a promising class of implicit
layer models, which abandon traditional depth by solving for the fixed points
of a single nonlinear layer. Despite their success, the stability of the fixed
points for these models remains poorly understood. By considering DEQ models as
nonlinear dynamic systems, we propose a robust DEQ model named LyaDEQ with
guaranteed provable stability via Lyapunov theory. The crux of our method is
ensuring the Lyapunov stability of the DEQ model's fixed points, which enables
the proposed model to resist minor initial perturbations. To avoid poor
adversarial defense due to Lyapunov-stable fixed points being located near each
other, we orthogonalize the layers after the Lyapunov stability module to
separate different fixed points. We evaluate LyaDEQ models under well-known
adversarial attacks, and experimental results demonstrate significant
improvement in robustness. Furthermore, we show that the LyaDEQ model can be
combined with other defense methods, such as adversarial training, to achieve
even better adversarial robustness.


http://arxiv.org/abs/2304.13104
LSTM-based Load Forecasting Robustness Against Noise Injection Attack in Microgrid. (1%)
Amirhossein Nazeri; Pierluigi Pisu
  In this paper, we investigate the robustness of an LSTM neural network
against noise injection attacks for electric load forecasting in an ideal
microgrid. The performance of the LSTM model is investigated under a black-box
Gaussian noise attack with different SNRs. It is assumed that attackers have
just access to the input data of the LSTM model. The results show that the
noise attack affects the performance of the LSTM model. The load prediction
means absolute error (MAE) is 0.047 MW for a healthy prediction, while this
value increases up to 0.097 MW for a Gaussian noise insertion with SNR= 6 dB.
To robustify the LSTM model against noise attack, a low-pass filter with
optimal cut-off frequency is applied at the model's input to remove the noise
attack. The filter performs better in case of noise with lower SNR and is less
promising for small noises.


http://arxiv.org/abs/2304.12486
Evaluating Adversarial Robustness on Document Image Classification. (99%)
Timothée Fronteau; Arnaud Paran; Aymen Shabou
  Adversarial attacks and defenses have gained increasing interest on computer
vision systems in recent years, but as of today, most investigations are
limited to images. However, many artificial intelligence models actually handle
documentary data, which is very different from real world images. Hence, in
this work, we try to apply the adversarial attack philosophy on documentary and
natural data and to protect models against such attacks. We focus our work on
untargeted gradient-based, transfer-based and score-based attacks and evaluate
the impact of adversarial training, JPEG input compression and grey-scale input
transformation on the robustness of ResNet50 and EfficientNetB0 model
architectures. To the best of our knowledge, no such work has been conducted by
the community in order to study the impact of these attacks on the document
image classification task.


http://arxiv.org/abs/2304.12550
Combining Adversaries with Anti-adversaries in Training. (64%)
Xiaoling Zhou; Nan Yang; Ou Wu
  Adversarial training is an effective learning technique to improve the
robustness of deep neural networks. In this study, the influence of adversarial
training on deep learning models in terms of fairness, robustness, and
generalization is theoretically investigated under more general perturbation
scope that different samples can have different perturbation directions (the
adversarial and anti-adversarial directions) and varied perturbation bounds.
Our theoretical explorations suggest that the combination of adversaries and
anti-adversaries (samples with anti-adversarial perturbations) in training can
be more effective in achieving better fairness between classes and a better
tradeoff between robustness and generalization in some typical learning
scenarios (e.g., noisy label learning and imbalance learning) compared with
standard adversarial training. On the basis of our theoretical findings, a more
general learning objective that combines adversaries and anti-adversaries with
varied bounds on each training sample is presented. Meta learning is utilized
to optimize the combination weights. Experiments on benchmark datasets under
different learning scenarios verify our theoretical findings and the
effectiveness of the proposed methodology.


http://arxiv.org/abs/2304.11823
Enhancing Fine-Tuning Based Backdoor Defense with Sharpness-Aware Minimization. (41%)
Mingli Zhu; Shaokui Wei; Li Shen; Yanbo Fan; Baoyuan Wu
  Backdoor defense, which aims to detect or mitigate the effect of malicious
triggers introduced by attackers, is becoming increasingly critical for machine
learning security and integrity. Fine-tuning based on benign data is a natural
defense to erase the backdoor effect in a backdoored model. However, recent
studies show that, given limited benign data, vanilla fine-tuning has poor
defense performance. In this work, we provide a deep study of fine-tuning the
backdoored model from the neuron perspective and find that backdoorrelated
neurons fail to escape the local minimum in the fine-tuning process. Inspired
by observing that the backdoorrelated neurons often have larger norms, we
propose FTSAM, a novel backdoor defense paradigm that aims to shrink the norms
of backdoor-related neurons by incorporating sharpness-aware minimization with
fine-tuning. We demonstrate the effectiveness of our method on several
benchmark datasets and network architectures, where it achieves
state-of-the-art defense performance. Overall, our work provides a promising
avenue for improving the robustness of machine learning models against backdoor
attacks.


http://arxiv.org/abs/2304.12540
Opinion Control under Adversarial Network Perturbation: A Stackelberg Game Approach. (10%)
Yuejiang Li; Zhanjiang Chen; H. Vicky Zhao
  The emerging social network platforms enable users to share their own
opinions, as well as to exchange opinions with others. However, adversarial
network perturbation, where malicious users intentionally spread their extreme
opinions, rumors, and misinformation to others, is ubiquitous in social
networks. Such adversarial network perturbation greatly influences the opinion
formation of the public and threatens our societies. Thus, it is critical to
study and control the influence of adversarial network perturbation. Although
tremendous efforts have been made in both academia and industry to guide and
control the public opinion dynamics, most of these works assume that the
network is static, and ignore such adversarial network perturbation. In this
work, based on the well-accepted Friedkin-Johnsen opinion dynamics model, we
model the adversarial network perturbation and analyze its impact on the
networks' opinion. Then, from the adversary's perspective, we analyze its
optimal network perturbation, which maximally changes the network's opinion.
Next, from the network defender's perspective, we formulate a Stackelberg game
and aim to control the network's opinion even under such adversarial network
perturbation. We devise a projected subgradient algorithm to solve the
formulated Stackelberg game. Extensive simulations on real social networks
validate our analysis of the adversarial network perturbation's influence and
the effectiveness of the proposed opinion control algorithm.


http://arxiv.org/abs/2304.11834
Robust Tickets Can Transfer Better: Drawing More Transferable Subnetworks in Transfer Learning. (1%)
Yonggan Fu; Ye Yuan; Shang Wu; Jiayi Yuan; Yingyan Celine Lin
  Transfer learning leverages feature representations of deep neural networks
(DNNs) pretrained on source tasks with rich data to empower effective
finetuning on downstream tasks. However, the pretrained models are often
prohibitively large for delivering generalizable representations, which limits
their deployment on edge devices with constrained resources. To close this gap,
we propose a new transfer learning pipeline, which leverages our finding that
robust tickets can transfer better, i.e., subnetworks drawn with properly
induced adversarial robustness can win better transferability over vanilla
lottery ticket subnetworks. Extensive experiments and ablation studies validate
that our proposed transfer learning pipeline can achieve enhanced
accuracy-sparsity trade-offs across both diverse downstream tasks and sparsity
patterns, further enriching the lottery ticket hypothesis.


http://arxiv.org/abs/2304.11579
StyLess: Boosting the Transferability of Adversarial Examples. (99%)
Kaisheng Liang; Bin Xiao
  Adversarial attacks can mislead deep neural networks (DNNs) by adding
imperceptible perturbations to benign examples. The attack transferability
enables adversarial examples to attack black-box DNNs with unknown
architectures or parameters, which poses threats to many real-world
applications. We find that existing transferable attacks do not distinguish
between style and content features during optimization, limiting their attack
transferability. To improve attack transferability, we propose a novel attack
method called style-less perturbation (StyLess). Specifically, instead of using
a vanilla network as the surrogate model, we advocate using stylized networks,
which encode different style features by perturbing an adaptive instance
normalization. Our method can prevent adversarial examples from using
non-robust style features and help generate transferable perturbations.
Comprehensive experiments show that our method can significantly improve the
transferability of adversarial examples. Furthermore, our approach is generic
and can outperform state-of-the-art transferable attacks when combined with
other attack techniques.


http://arxiv.org/abs/2304.11670
Evading DeepFake Detectors via Adversarial Statistical Consistency. (98%)
Yang Hou; Qing Guo; Yihao Huang; Xiaofei Xie; Lei Ma; Jianjun Zhao
  In recent years, as various realistic face forgery techniques known as
DeepFake improves by leaps and bounds,more and more DeepFake detection
techniques have been proposed. These methods typically rely on detecting
statistical differences between natural (i.e., real) and DeepFakegenerated
images in both spatial and frequency domains. In this work, we propose to
explicitly minimize the statistical differences to evade state-of-the-art
DeepFake detectors. To this end, we propose a statistical consistency attack
(StatAttack) against DeepFake detectors, which contains two main parts. First,
we select several statistical-sensitive natural degradations (i.e., exposure,
blur, and noise) and add them to the fake images in an adversarial way. Second,
we find that the statistical differences between natural and DeepFake images
are positively associated with the distribution shifting between the two kinds
of images, and we propose to use a distribution-aware loss to guide the
optimization of different degradations. As a result, the feature distributions
of generated adversarial examples is close to the natural images.Furthermore,
we extend the StatAttack to a more powerful version, MStatAttack, where we
extend the single-layer degradation to multi-layer degradations sequentially
and use the loss to tune the combination weights jointly. Comprehensive
experimental results on four spatial-based detectors and two frequency-based
detectors with four datasets demonstrate the effectiveness of our proposed
attack method in both white-box and black-box settings.


http://arxiv.org/abs/2304.11359
Detecting Adversarial Faces Using Only Real Face Self-Perturbations. (98%)
Qian Wang; Yongqin Xian; Hefei Ling; Jinyuan Zhang; Xiaorui Lin; Ping Li; Jiazhong Chen; Ning Yu
  Adversarial attacks aim to disturb the functionality of a target system by
adding specific noise to the input samples, bringing potential threats to
security and robustness when applied to facial recognition systems. Although
existing defense techniques achieve high accuracy in detecting some specific
adversarial faces (adv-faces), new attack methods especially GAN-based attacks
with completely different noise patterns circumvent them and reach a higher
attack success rate. Even worse, existing techniques require attack data before
implementing the defense, making it impractical to defend newly emerging
attacks that are unseen to defenders. In this paper, we investigate the
intrinsic generality of adv-faces and propose to generate pseudo adv-faces by
perturbing real faces with three heuristically designed noise patterns. We are
the first to train an adv-face detector using only real faces and their
self-perturbations, agnostic to victim facial recognition systems, and agnostic
to unseen attacks. By regarding adv-faces as out-of-distribution data, we then
naturally introduce a novel cascaded system for adv-face detection, which
consists of training data self-perturbations, decision boundary regularization,
and a max-pooling-based binary classifier focusing on abnormal local color
aberrations. Experiments conducted on LFW and CelebA-HQ datasets with eight
gradient-based and two GAN-based attacks validate that our method generalizes
to a variety of unseen adversarial attacks.


http://arxiv.org/abs/2304.11432
Universal Adversarial Backdoor Attacks to Fool Vertical Federated Learning in Cloud-Edge Collaboration. (70%)
Peng Chen; Xin Du; Zhihui Lu; Hongfeng Chai
  Vertical federated learning (VFL) is a cloud-edge collaboration paradigm that
enables edge nodes, comprising resource-constrained Internet of Things (IoT)
devices, to cooperatively train artificial intelligence (AI) models while
retaining their data locally. This paradigm facilitates improved privacy and
security for edges and IoT devices, making VFL an essential component of
Artificial Intelligence of Things (AIoT) systems. Nevertheless, the partitioned
structure of VFL can be exploited by adversaries to inject a backdoor, enabling
them to manipulate the VFL predictions. In this paper, we aim to investigate
the vulnerability of VFL in the context of binary classification tasks. To this
end, we define a threat model for backdoor attacks in VFL and introduce a
universal adversarial backdoor (UAB) attack to poison the predictions of VFL.
The UAB attack, consisting of universal trigger generation and clean-label
backdoor injection, is incorporated during the VFL training at specific
iterations. This is achieved by alternately optimizing the universal trigger
and model parameters of VFL sub-problems. Our work distinguishes itself from
existing studies on designing backdoor attacks for VFL, as those require the
knowledge of auxiliary information not accessible within the split VFL
architecture. In contrast, our approach does not necessitate any additional
data to execute the attack. On the LendingClub and Zhongyuan datasets, our
approach surpasses existing state-of-the-art methods, achieving up to 100\%
backdoor task performance while maintaining the main task performance. Our
results in this paper make a major advance to revealing the hidden backdoor
risks of VFL, hence paving the way for the future development of secure AIoT.


http://arxiv.org/abs/2304.10985
INK: Inheritable Natural Backdoor Attack Against Model Distillation. (97%)
Xiaolei Liu; Ming Yi; Kangyi Ding; Bangzhou Xin; Yixiao Xu; Li Yan; Chao Shen
  Deep learning models are vulnerable to backdoor attacks, where attackers
inject malicious behavior through data poisoning and later exploit triggers to
manipulate deployed models. To improve the stealth and effectiveness of
backdoors, prior studies have introduced various imperceptible attack methods
targeting both defense mechanisms and manual inspection. However, all
poisoning-based attacks still rely on privileged access to the training
dataset. Consequently, model distillation using a trusted dataset has emerged
as an effective defense against these attacks. To bridge this gap, we introduce
INK, an inheritable natural backdoor attack that targets model distillation.
The key insight behind INK is the use of naturally occurring statistical
features in all datasets, allowing attackers to leverage them as backdoor
triggers without direct access to the training data. Specifically, INK employs
image variance as a backdoor trigger and enables both clean-image and
clean-label attacks by manipulating the labels and image variance in an
unauthenticated dataset. Once the backdoor is embedded, it transfers from the
teacher model to the student model, even when defenders use a trusted dataset
for distillation. Theoretical analysis and experimental results demonstrate the
robustness of INK against transformation-based, search-based, and
distillation-based defenses. For instance, INK maintains an attack success rate
of over 98\% post-distillation, compared to an average success rate of 1.4\%
for existing methods.


http://arxiv.org/abs/2304.10828
Individual Fairness in Bayesian Neural Networks. (69%)
Alice Doherty; Matthew Wicker; Luca Laurenti; Andrea Patane
  We study Individual Fairness (IF) for Bayesian neural networks (BNNs).
Specifically, we consider the $\epsilon$-$\delta$-individual fairness notion,
which requires that, for any pair of input points that are $\epsilon$-similar
according to a given similarity metrics, the output of the BNN is within a
given tolerance $\delta>0.$ We leverage bounds on statistical sampling over the
input space and the relationship between adversarial robustness and individual
fairness to derive a framework for the systematic estimation of
$\epsilon$-$\delta$-IF, designing Fair-FGSM and Fair-PGD as
global,fairness-aware extensions to gradient-based attacks for BNNs. We
empirically study IF of a variety of approximately inferred BNNs with different
architectures on fairness benchmarks, and compare against deterministic models
learnt using frequentist techniques. Interestingly, we find that BNNs trained
by means of approximate Bayesian inference consistently tend to be markedly
more individually fair than their deterministic counterparts.


http://arxiv.org/abs/2304.10783
Denial-of-Service or Fine-Grained Control: Towards Flexible Model Poisoning Attacks on Federated Learning. (64%)
Hangtao Zhang; Zeming Yao; Leo Yu Zhang; Shengshan Hu; Chao Chen; Alan Liew; Zhetao Li
  Federated learning (FL) is vulnerable to poisoning attacks, where adversaries
corrupt the global aggregation results and cause denial-of-service (DoS).
Unlike recent model poisoning attacks that optimize the amplitude of malicious
perturbations along certain prescribed directions to cause DoS, we propose a
Flexible Model Poisoning Attack (FMPA) that can achieve versatile attack goals.
We consider a practical threat scenario where no extra knowledge about the FL
system (e.g., aggregation rules or updates on benign devices) is available to
adversaries. FMPA exploits the global historical information to construct an
estimator that predicts the next round of the global model as a benign
reference. It then fine-tunes the reference model to obtain the desired
poisoned model with low accuracy and small perturbations. Besides the goal of
causing DoS, FMPA can be naturally extended to launch a fine-grained
controllable attack, making it possible to precisely reduce the global
accuracy. Armed with precise control, malicious FL service providers can gain
advantages over their competitors without getting noticed, hence opening a new
attack surface in FL other than DoS. Even for the purpose of DoS, experiments
show that FMPA significantly decreases the global accuracy, outperforming six
state-of-the-art attacks.The code can be found at
https://github.com/ZhangHangTao/Poisoning-Attack-on-FL.


http://arxiv.org/abs/2304.11300
MAWSEO: Adversarial Wiki Search Poisoning for Illicit Online Promotion. (2%)
Zilong Lin; Zhengyi Li; Xiaojing Liao; XiaoFeng Wang; Xiaozhong Liu
  As a prominent instance of vandalism edits, Wiki search poisoning for illicit
promotion is a cybercrime in which the adversary aims at editing Wiki articles
to promote illicit businesses through Wiki search results of relevant queries.
In this paper, we report a study that, for the first time, shows that such
stealthy blackhat SEO on Wiki can be automated. Our technique, called MAWSEO,
employs adversarial revisions to achieve real-world cybercriminal objectives,
including rank boosting, vandalism detection evasion, topic relevancy, semantic
consistency, user awareness (but not alarming) of promotional content, etc. Our
evaluation and user study demonstrate that MAWSEO is capable of effectively and
efficiently generating adversarial vandalism edits, which can bypass
state-of-the-art built-in Wiki vandalism detectors, and also get promotional
content through to Wiki users without triggering their alarms. In addition, we
investigated potential defense, including coherence based detection and
adversarial training of vandalism detection, against our attack in the Wiki
ecosystem.


http://arxiv.org/abs/2304.10088
Towards the Universal Defense for Query-Based Audio Adversarial Attacks. (99%)
Feng Guo; Zheng Sun; Yuxuan Chen; Lei Ju
  Recently, studies show that deep learning-based automatic speech recognition
(ASR) systems are vulnerable to adversarial examples (AEs), which add a small
amount of noise to the original audio examples. These AE attacks pose new
challenges to deep learning security and have raised significant concerns about
deploying ASR systems and devices. The existing defense methods are either
limited in application or only defend on results, but not on process. In this
work, we propose a novel method to infer the adversary intent and discover
audio adversarial examples based on the AEs generation process. The insight of
this method is based on the observation: many existing audio AE attacks utilize
query-based methods, which means the adversary must send continuous and similar
queries to target ASR models during the audio AE generation process. Inspired
by this observation, We propose a memory mechanism by adopting audio
fingerprint technology to analyze the similarity of the current query with a
certain length of memory query. Thus, we can identify when a sequence of
queries appears to be suspectable to generate audio AEs. Through extensive
evaluation on four state-of-the-art audio AE attacks, we demonstrate that on
average our defense identify the adversary intent with over 90% accuracy. With
careful regard for robustness evaluations, we also analyze our proposed defense
and its strength to withstand two adaptive attacks. Finally, our scheme is
available out-of-the-box and directly compatible with any ensemble of ASR
defense models to uncover audio AE attacks effectively without model
retraining.


http://arxiv.org/abs/2304.10136
Diversifying the High-level Features for better Adversarial Transferability. (99%)
Zhiyuan Wang; Zeliang Zhang; Siyuan Liang; Xiaosen Wang
  Given the great threat of adversarial attacks against Deep Neural Networks
(DNNs), numerous works have been proposed to boost transferability to attack
real-world applications. However, existing attacks often utilize advanced
gradient calculation or input transformation but ignore the white-box model.
Inspired by the fact that DNNs are over-parameterized for superior performance,
we propose diversifying the high-level features (DHF) for more transferable
adversarial examples. In particular, DHF perturbs the high-level features by
randomly transforming the high-level features and mixing them with the feature
of benign samples when calculating the gradient at each iteration. Due to the
redundancy of parameters, such transformation does not affect the
classification performance but helps identify the invariant features across
different models, leading to much better transferability. Empirical evaluations
on ImageNet dataset show that DHF could effectively improve the transferability
of existing momentum-based attacks. Incorporated into the input
transformation-based attacks, DHF generates more transferable adversarial
examples and outperforms the baselines with a clear margin when attacking
several defense models, showing its generalization to various attacks and high
effectiveness for boosting transferability. Code is available at
https://github.com/Trustworthy-AI-Group/DHF.


http://arxiv.org/abs/2304.10558
Using Z3 for Formal Modeling and Verification of FNN Global Robustness. (98%)
Yihao Zhang; Zeming Wei; Xiyue Zhang; Meng Sun
  While Feedforward Neural Networks (FNNs) have achieved remarkable success in
various tasks, they are vulnerable to adversarial examples. Several techniques
have been developed to verify the adversarial robustness of FNNs, but most of
them focus on robustness verification against the local perturbation
neighborhood of a single data point. There is still a large research gap in
global robustness analysis. The global-robustness verifiable framework
DeepGlobal has been proposed to identify \textit{all} possible Adversarial
Dangerous Regions (ADRs) of FNNs, not limited to data samples in a test set. In
this paper, we propose a complete specification and implementation of
DeepGlobal utilizing the SMT solver Z3 for more explicit definition, and
propose several improvements to DeepGlobal for more efficient verification. To
evaluate the effectiveness of our implementation and improvements, we conduct
extensive experiments on a set of benchmark datasets. Visualization of our
experiment results shows the validity and effectiveness of the approach.


http://arxiv.org/abs/2304.10446
Certified Adversarial Robustness Within Multiple Perturbation Bounds. (96%)
Soumalya Nandi; Sravanti Addepalli; Harsh Rangwani; R. Venkatesh Babu
  Randomized smoothing (RS) is a well known certified defense against
adversarial attacks, which creates a smoothed classifier by predicting the most
likely class under random noise perturbations of inputs during inference. While
initial work focused on robustness to $\ell_2$ norm perturbations using noise
sampled from a Gaussian distribution, subsequent works have shown that
different noise distributions can result in robustness to other $\ell_p$ norm
bounds as well. In general, a specific noise distribution is optimal for
defending against a given $\ell_p$ norm based attack. In this work, we aim to
improve the certified adversarial robustness against multiple perturbation
bounds simultaneously. Towards this, we firstly present a novel
\textit{certification scheme}, that effectively combines the certificates
obtained using different noise distributions to obtain optimal results against
multiple perturbation bounds. We further propose a novel \textit{training noise
distribution} along with a \textit{regularized training scheme} to improve the
certification within both $\ell_1$ and $\ell_2$ perturbation norms
simultaneously. Contrary to prior works, we compare the certified robustness of
different training algorithms across the same natural (clean) accuracy, rather
than across fixed noise levels used for training and certification. We also
empirically invalidate the argument that training and certifying the classifier
with the same amount of noise gives the best results. The proposed approach
achieves improvements on the ACR (Average Certified Radius) metric across both
$\ell_1$ and $\ell_2$ perturbation bounds.


http://arxiv.org/abs/2304.11043
Can Perturbations Help Reduce Investment Risks? Risk-Aware Stock Recommendation via Split Variational Adversarial Training. (93%)
Jiezhu Cheng; Kaizhu Huang; Zibin Zheng
  In the stock market, a successful investment requires a good balance between
profits and risks. Recently, stock recommendation has been widely studied in
quantitative investment to select stocks with higher return ratios for
investors. Despite the success in making profits, most existing recommendation
approaches are still weak in risk control, which may lead to intolerable paper
losses in practical stock investing. To effectively reduce risks, we draw
inspiration from adversarial perturbations and propose a novel Split
Variational Adversarial Training (SVAT) framework for risk-aware stock
recommendation. Essentially, SVAT encourages the model to be sensitive to
adversarial perturbations of risky stock examples and enhances the model's risk
awareness by learning from perturbations. To generate representative
adversarial examples as risk indicators, we devise a variational perturbation
generator to model diverse risk factors. Particularly, the variational
architecture enables our method to provide a rough risk quantification for
investors, showing an additional advantage of interpretability. Experiments on
three real-world stock market datasets show that SVAT effectively reduces the
volatility of the stock recommendation model and outperforms state-of-the-art
baseline methods by more than 30% in terms of risk-adjusted profits.


http://arxiv.org/abs/2304.10712
Adversarial Infrared Blocks: A Black-box Attack to Thermal Infrared Detectors at Multiple Angles in Physical World. (89%)
Chengyin Hu; Weiwen Shi; Tingsong Jiang; Wen Yao; Ling Tian; Xiaoqian Chen
  Infrared imaging systems have a vast array of potential applications in
pedestrian detection and autonomous driving, and their safety performance is of
great concern. However, few studies have explored the safety of infrared
imaging systems in real-world settings. Previous research has used physical
perturbations such as small bulbs and thermal "QR codes" to attack infrared
imaging detectors, but such methods are highly visible and lack stealthiness.
Other researchers have used hot and cold blocks to deceive infrared imaging
detectors, but this method is limited in its ability to execute attacks from
various angles. To address these shortcomings, we propose a novel physical
attack called adversarial infrared blocks (AdvIB). By optimizing the physical
parameters of the adversarial infrared blocks, this method can execute a
stealthy black-box attack on thermal imaging system from various angles. We
evaluate the proposed method based on its effectiveness, stealthiness, and
robustness. Our physical tests show that the proposed method achieves a success
rate of over 80% under most distance and angle conditions, validating its
effectiveness. For stealthiness, our method involves attaching the adversarial
infrared block to the inside of clothing, enhancing its stealthiness.
Additionally, we test the proposed method on advanced detectors, and
experimental results demonstrate an average attack success rate of 51.2%,
proving its robustness. Overall, our proposed AdvIB method offers a promising
avenue for conducting stealthy, effective and robust black-box attacks on
thermal imaging system, with potential implications for real-world safety and
security applications.


http://arxiv.org/abs/2304.10218
An Analysis of the Completion Time of the BB84 Protocol. (22%)
Sounak Kar; Jean-Yves Le Boudec
  The BB84 QKD protocol is based on the idea that the sender and the receiver
can reconcile a certain fraction of the teleported qubits to detect
eavesdropping or noise and decode the rest to use as a private key. Under the
present hardware infrastructure, decoherence of quantum states poses a
significant challenge to performing perfect or efficient teleportation, meaning
that a teleportation-based protocol must be run multiple times to observe
success. Thus, performance analyses of such protocols usually consider the
completion time, i.e., the time until success, rather than the duration of a
single attempt. Moreover, due to decoherence, the success of an attempt is in
general dependent on the duration of individual phases of that attempt, as
quantum states must wait in memory while the success or failure of a generation
phase is communicated to the relevant parties. In this work, we do a
performance analysis of the completion time of the BB84 protocol in a setting
where the sender and the receiver are connected via a single quantum repeater
and the only quantum channel between them does not see any adversarial attack.
Assuming certain distributional forms for the generation and communication
phases of teleportation, we provide a method to compute the MGF of the
completion time and subsequently derive an estimate of the CDF and a bound on
the tail probability. This result helps us gauge the (tail) behaviour of the
completion time in terms of the parameters characterising the elementary phases
of teleportation, without having to run the protocol multiple times. We also
provide an efficient simulation scheme to generate the completion time, which
relies on expressing the completion time in terms of aggregated teleportation
times. We numerically compare our approach with a full-scale simulation and
observe good agreement between them.


http://arxiv.org/abs/2304.10679
A Plug-and-Play Defensive Perturbation for Copyright Protection of DNN-based Applications. (13%)
Donghua Wang; Wen Yao; Tingsong Jiang; Weien Zhou; Lang Lin; Xiaoqian Chen
  Wide deployment of deep neural networks (DNNs) based applications (e.g.,
style transfer, cartoonish), stimulating the requirement of copyright
protection of such application's production. Although some traditional visible
copyright techniques are available, they would introduce undesired traces and
result in a poor user experience. In this paper, we propose a novel
plug-and-play invisible copyright protection method based on defensive
perturbation for DNN-based applications (i.e., style transfer). Rather than
apply the perturbation to attack the DNNs model, we explore the potential
utilization of perturbation in copyright protection. Specifically, we project
the copyright information to the defensive perturbation with the designed
copyright encoder, which is added to the image to be protected. Then, we
extract the copyright information from the encoded copyrighted image with the
devised copyright decoder. Furthermore, we use a robustness module to
strengthen the decoding capability of the decoder toward images with various
distortions (e.g., JPEG compression), which may be occurred when the user posts
the image on social media. To ensure the image quality of encoded images and
decoded copyright images, a loss function was elaborately devised. Objective
and subjective experiment results demonstrate the effectiveness of the proposed
method. We have also conducted physical world tests on social media (i.e.,
Wechat and Twitter) by posting encoded copyright images. The results show that
the copyright information in the encoded image saved from social media can
still be correctly extracted.


http://arxiv.org/abs/2304.10622
Enhancing object detection robustness: A synthetic and natural perturbation approach. (12%)
Nilantha Premakumara; Brian Jalaian; Niranjan Suri; Hooman Samani
  Robustness against real-world distribution shifts is crucial for the
successful deployment of object detection models in practical applications. In
this paper, we address the problem of assessing and enhancing the robustness of
object detection models against natural perturbations, such as varying lighting
conditions, blur, and brightness. We analyze four state-of-the-art deep neural
network models, Detr-ResNet-101, Detr-ResNet-50, YOLOv4, and YOLOv4-tiny, using
the COCO 2017 dataset and ExDark dataset. By simulating synthetic perturbations
with the AugLy package, we systematically explore the optimal level of
synthetic perturbation required to improve the models robustness through data
augmentation techniques. Our comprehensive ablation study meticulously
evaluates the impact of synthetic perturbations on object detection models
performance against real-world distribution shifts, establishing a tangible
connection between synthetic augmentation and real-world robustness. Our
findings not only substantiate the effectiveness of synthetic perturbations in
improving model robustness, but also provide valuable insights for researchers
and practitioners in developing more robust and reliable object detection
models tailored for real-world applications.


http://arxiv.org/abs/2304.10727
RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models. (8%)
Seulki Park; Daeho Um; Hajung Yoon; Sanghyuk Chun; Sangdoo Yun; Jin Young Choi
  In this paper, we propose a robustness benchmark for image-text matching
models to assess their vulnerabilities. To this end, we insert adversarial
texts and images into the search pool (i.e., gallery set) and evaluate models
with the adversarial data. Specifically, we replace a word in the text to
change the meaning of the text and mix images with different images to create
perceptible changes in pixels. We assume that such explicit alterations would
not deceive a robust model, as they should understand the holistic meaning of
texts and images simultaneously. However, in our evaluations on the proposed
benchmark, many state-of-the-art models show significant performance
degradation, e.g., Recall@1: 81.9% $\rightarrow$ 64.5% in BLIP, 66.1%
$\rightarrow$ 37.5% in VSE$\infty$, where the models favor adversarial
texts/images over the original ones. This reveals the current vision-language
models may not account for subtle changes or understand the overall context of
texts and images. Our findings can provide insights for improving the
robustness of the vision-language models and devising more diverse stress-test
methods in cross-modal retrieval task. Source code and dataset will be
available at https://github.com/pseulki/rococo.


http://arxiv.org/abs/2304.10638
Get Rid Of Your Trail: Remotely Erasing Backdoors in Federated Learning. (2%)
Manaar Alam; Hithem Lamri; Michail Maniatakos
  Federated Learning (FL) enables collaborative deep learning training across
multiple participants without exposing sensitive personal data. However, the
distributed nature of FL and the unvetted participants' data makes it
vulnerable to backdoor attacks. In these attacks, adversaries inject malicious
functionality into the centralized model during training, leading to
intentional misclassifications for specific adversary-chosen inputs. While
previous research has demonstrated successful injections of persistent
backdoors in FL, the persistence also poses a challenge, as their existence in
the centralized model can prompt the central aggregation server to take
preventive measures to penalize the adversaries. Therefore, this paper proposes
a methodology that enables adversaries to effectively remove backdoors from the
centralized model upon achieving their objectives or upon suspicion of possible
detection. The proposed approach extends the concept of machine unlearning and
presents strategies to preserve the performance of the centralized model and
simultaneously prevent over-unlearning of information unrelated to backdoor
patterns, making the adversaries stealthy while removing backdoors. To the best
of our knowledge, this is the first work that explores machine unlearning in FL
to remove backdoors to the benefit of adversaries. Exhaustive evaluation
considering image classification scenarios demonstrates the efficacy of the
proposed method in efficient backdoor removal from the centralized model,
injected by state-of-the-art attacks across multiple configurations.


http://arxiv.org/abs/2304.10127
Learning Sample Difficulty from Pre-trained Models for Reliable Prediction. (1%)
Peng Cui; Dan Zhang; Zhijie Deng; Yinpeng Dong; Jun Zhu
  Large-scale pre-trained models have achieved remarkable success in many
applications, but how to leverage them to improve the prediction reliability of
downstream models is undesirably under-explored. Moreover, modern neural
networks have been found to be poorly calibrated and make overconfident
predictions regardless of inherent sample difficulty and data uncertainty. To
address this issue, we propose to utilize large-scale pre-trained models to
guide downstream model training with sample difficulty-aware entropy
regularization. Pre-trained models that have been exposed to large-scale
datasets and do not overfit the downstream training classes enable us to
measure each training sample's difficulty via feature-space Gaussian modeling
and relative Mahalanobis distance computation. Importantly, by adaptively
penalizing overconfident prediction based on the sample difficulty, we
simultaneously improve accuracy and uncertainty calibration across challenging
benchmarks (e.g., +0.55% ACC and -3.7% ECE on ImageNet1k using ResNet34),
consistently surpassing competitive baselines for reliable prediction. The
improved uncertainty estimate further improves selective classification
(abstaining from erroneous predictions) and out-of-distribution detection.


http://arxiv.org/abs/2304.10029
Jedi: Entropy-based Localization and Removal of Adversarial Patches. (84%)
Bilel Tarchoun; Anouar Ben Khalifa; Mohamed Ali Mahjoub; Nael Abu-Ghazaleh; Ihsen Alouani
  Real-world adversarial physical patches were shown to be successful in
compromising state-of-the-art models in a variety of computer vision
applications. Existing defenses that are based on either input gradient or
features analysis have been compromised by recent GAN-based attacks that
generate naturalistic patches. In this paper, we propose Jedi, a new defense
against adversarial patches that is resilient to realistic patch attacks. Jedi
tackles the patch localization problem from an information theory perspective;
leverages two new ideas: (1) it improves the identification of potential patch
regions using entropy analysis: we show that the entropy of adversarial patches
is high, even in naturalistic patches; and (2) it improves the localization of
adversarial patches, using an autoencoder that is able to complete patch
regions from high entropy kernels. Jedi achieves high-precision adversarial
patch localization, which we show is critical to successfully repair the
images. Since Jedi relies on an input entropy analysis, it is model-agnostic,
and can be applied on pre-trained off-the-shelf models without changes to the
training or inference of the protected models. Jedi detects on average 90% of
adversarial patches across different benchmarks and recovers up to 94% of
successful patch attacks (Compared to 75% and 65% for LGS and Jujutsu,
respectively).


http://arxiv.org/abs/2304.09875
GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models. (81%)
Zaitang Li; Pin-Yu Chen; Tsung-Yi Ho
  Current studies on adversarial robustness mainly focus on aggregating local
robustness results from a set of data samples to evaluate and rank different
models. However, the local statistics may not well represent the true global
robustness of the underlying unknown data distribution. To address this
challenge, this paper makes the first attempt to present a new framework,
called GREAT Score , for global robustness evaluation of adversarial
perturbation using generative models. Formally, GREAT Score carries the
physical meaning of a global statistic capturing a mean certified attack-proof
perturbation level over all samples drawn from a generative model. For
finite-sample evaluation, we also derive a probabilistic guarantee on the
sample complexity and the difference between the sample mean and the true mean.
GREAT Score has several advantages: (1) Robustness evaluations using GREAT
Score are efficient and scalable to large models, by sparing the need of
running adversarial attacks. In particular, we show high correlation and
significantly reduced computation cost of GREAT Score when compared to the
attack-based model ranking on RobustBench (Croce,et. al. 2021). (2) The use of
generative models facilitates the approximation of the unknown data
distribution. In our ablation study with different generative adversarial
networks (GANs), we observe consistency between global robustness evaluation
and the quality of GANs. (3) GREAT Score can be used for remote auditing of
privacy-sensitive black-box models, as demonstrated by our robustness
evaluation on several online facial recognition services.


http://arxiv.org/abs/2304.09515
Secure Split Learning against Property Inference, Data Reconstruction, and Feature Space Hijacking Attacks. (5%)
Yunlong Mao; Zexi Xin; Zhenyu Li; Jue Hong; Qingyou Yang; Sheng Zhong
  Split learning of deep neural networks (SplitNN) has provided a promising
solution to learning jointly for the mutual interest of a guest and a host,
which may come from different backgrounds, holding features partitioned
vertically. However, SplitNN creates a new attack surface for the adversarial
participant, holding back its practical use in the real world. By investigating
the adversarial effects of highly threatening attacks, including property
inference, data reconstruction, and feature hijacking attacks, we identify the
underlying vulnerability of SplitNN and propose a countermeasure. To prevent
potential threats and ensure the learning guarantees of SplitNN, we design a
privacy-preserving tunnel for information exchange between the guest and the
host. The intuition is to perturb the propagation of knowledge in each
direction with a controllable unified solution. To this end, we propose a new
activation function named R3eLU, transferring private smashed data and partial
loss into randomized responses in forward and backward propagations,
respectively. We give the first attempt to secure split learning against three
threatening attacks and present a fine-grained privacy budget allocation
scheme. The analysis proves that our privacy-preserving SplitNN solution
provides a tight privacy budget, while the experimental results show that our
solution performs better than existing solutions in most cases and achieves a
good tradeoff between defense and model usability.


http://arxiv.org/abs/2304.09446
Density-Insensitive Unsupervised Domain Adaption on 3D Object Detection. (1%)
Qianjiang Hu; Daizong Liu; Wei Hu
  3D object detection from point clouds is crucial in safety-critical
autonomous driving. Although many works have made great efforts and achieved
significant progress on this task, most of them suffer from expensive
annotation cost and poor transferability to unknown data due to the domain gap.
Recently, few works attempt to tackle the domain gap in objects, but still fail
to adapt to the gap of varying beam-densities between two domains, which is
critical to mitigate the characteristic differences of the LiDAR collectors. To
this end, we make the attempt to propose a density-insensitive domain adaption
framework to address the density-induced domain gap. In particular, we first
introduce Random Beam Re-Sampling (RBRS) to enhance the robustness of 3D
detectors trained on the source domain to the varying beam-density. Then, we
take this pre-trained detector as the backbone model, and feed the unlabeled
target domain data into our newly designed task-specific teacher-student
framework for predicting its high-quality pseudo labels. To further adapt the
property of density-insensitivity into the target domain, we feed the teacher
and student branches with the same sample of different densities, and propose
an Object Graph Alignment (OGA) module to construct two object-graphs between
the two branches for enforcing the consistency in both the attribute and
relation of cross-density objects. Experimental results on three widely adopted
3D object detection datasets demonstrate that our proposed domain adaption
method outperforms the state-of-the-art methods, especially over
varying-density data. Code is available at
https://github.com/WoodwindHu/DTS}{https://github.com/WoodwindHu/DTS.


http://arxiv.org/abs/2304.09563
On the Robustness of Aspect-based Sentiment Analysis: Rethinking Model, Data, and Training. (1%)
Hao Fei; Tat-Seng Chua; Chenliang Li; Donghong Ji; Meishan Zhang; Yafeng Ren
  Aspect-based sentiment analysis (ABSA) aims at automatically inferring the
specific sentiment polarities toward certain aspects of products or services
behind the social media texts or reviews, which has been a fundamental
application to the real-world society. Since the early 2010s, ABSA has achieved
extraordinarily high accuracy with various deep neural models. However,
existing ABSA models with strong in-house performances may fail to generalize
to some challenging cases where the contexts are variable, i.e., low robustness
to real-world environments. In this study, we propose to enhance the ABSA
robustness by systematically rethinking the bottlenecks from all possible
angles, including model, data, and training. First, we strengthen the current
best-robust syntax-aware models by further incorporating the rich external
syntactic dependencies and the labels with aspect simultaneously with a
universal-syntax graph convolutional network. In the corpus perspective, we
propose to automatically induce high-quality synthetic training data with
various types, allowing models to learn sufficient inductive bias for better
robustness. Last, we based on the rich pseudo data perform adversarial training
to enhance the resistance to the context perturbation and meanwhile employ
contrastive learning to reinforce the representations of instances with
contrastive sentiments. Extensive robustness evaluations are conducted. The
results demonstrate that our enhanced syntax-aware model achieves better
robustness performances than all the state-of-the-art baselines. By
additionally incorporating our synthetic corpus, the robust testing results are
pushed with around 10% accuracy, which are then further improved by installing
the advanced training strategies. In-depth analyses are presented for revealing
the factors influencing the ABSA robustness.


http://arxiv.org/abs/2304.11082
Fundamental Limitations of Alignment in Large Language Models. (1%)
Yotam Wolf; Noam Wies; Oshri Avnery; Yoav Levine; Amnon Shashua
  An important aspect in developing language models that interact with humans
is aligning their behavior to be useful and unharmful for their human users.
This is usually achieved by tuning the model in a way that enhances desired
behaviors and inhibits undesired ones, a process referred to as alignment. In
this paper, we propose a theoretical approach called Behavior Expectation
Bounds (BEB) which allows us to formally investigate several inherent
characteristics and limitations of alignment in large language models.
Importantly, we prove that for any behavior that has a finite probability of
being exhibited by the model, there exist prompts that can trigger the model
into outputting this behavior, with probability that increases with the length
of the prompt. This implies that any alignment process that attenuates
undesired behavior but does not remove it altogether, is not safe against
adversarial prompting attacks. Furthermore, our framework hints at the
mechanism by which leading alignment approaches such as reinforcement learning
from human feedback increase the LLM's proneness to being prompted into the
undesired behaviors. Moreover, we include the notion of personas in our BEB
framework, and find that behaviors which are generally very unlikely to be
exhibited by the model can be brought to the front by prompting the model to
behave as specific persona. This theoretical result is being experimentally
demonstrated in large scale by the so called contemporary "chatGPT jailbreaks",
where adversarial users trick the LLM into breaking its alignment guardrails by
triggering it into acting as a malicious persona. Our results expose
fundamental limitations in alignment of LLMs and bring to the forefront the
need to devise reliable mechanisms for ensuring AI safety.


http://arxiv.org/abs/2304.09403
Wavelets Beat Monkeys at Adversarial Robustness. (99%)
Jingtong Su; Julia Kempe
  Research on improving the robustness of neural networks to adversarial noise
- imperceptible malicious perturbations of the data - has received significant
attention. The currently uncontested state-of-the-art defense to obtain robust
deep neural networks is Adversarial Training (AT), but it consumes
significantly more resources compared to standard training and trades off
accuracy for robustness. An inspiring recent work [Dapello et al.] aims to
bring neurobiological tools to the question: How can we develop Neural Nets
that robustly generalize like human vision? [Dapello et al.] design a network
structure with a neural hidden first layer that mimics the primate primary
visual cortex (V1), followed by a back-end structure adapted from current CNN
vision models. It seems to achieve non-trivial adversarial robustness on
standard vision benchmarks when tested on small perturbations. Here we revisit
this biologically inspired work, and ask whether a principled parameter-free
representation with inspiration from physics is able to achieve the same goal.
We discover that the wavelet scattering transform can replace the complex
V1-cortex and simple uniform Gaussian noise can take the role of neural
stochasticity, to achieve adversarial robustness. In extensive experiments on
the CIFAR-10 benchmark with adaptive adversarial attacks we show that: 1)
Robustness of VOneBlock architectures is relatively weak (though non-zero) when
the strength of the adversarial attack radius is set to commonly used
benchmarks. 2) Replacing the front-end VOneBlock by an off-the-shelf
parameter-free Scatternet followed by simple uniform Gaussian noise can achieve
much more substantial adversarial robustness without adversarial training. Our
work shows how physically inspired structures yield new insights into
robustness that were previously only thought possible by meticulously mimicking
the human cortex.


http://arxiv.org/abs/2304.08811
Towards the Transferable Audio Adversarial Attack via Ensemble Methods. (99%)
Feng Guo; Zheng Sun; Yuxuan Chen; Lei Ju
  In recent years, deep learning (DL) models have achieved significant progress
in many domains, such as autonomous driving, facial recognition, and speech
recognition. However, the vulnerability of deep learning models to adversarial
attacks has raised serious concerns in the community because of their
insufficient robustness and generalization. Also, transferable attacks have
become a prominent method for black-box attacks. In this work, we explore the
potential factors that impact adversarial examples (AEs) transferability in
DL-based speech recognition. We also discuss the vulnerability of different DL
systems and the irregular nature of decision boundaries. Our results show a
remarkable difference in the transferability of AEs between speech and images,
with the data relevance being low in images but opposite in speech recognition.
Motivated by dropout-based ensemble approaches, we propose random gradient
ensembles and dynamic gradient-weighted ensembles, and we evaluate the impact
of ensembles on the transferability of AEs. The results show that the AEs
created by both approaches are valid for transfer to the black box API.


http://arxiv.org/abs/2304.08767
Masked Language Model Based Textual Adversarial Example Detection. (99%)
Xiaomei Zhang; Zhaoxi Zhang; Qi Zhong; Xufei Zheng; Yanjun Zhang; Shengshan Hu; Leo Yu Zhang
  Adversarial attacks are a serious threat to the reliable deployment of
machine learning models in safety-critical applications. They can misguide
current models to predict incorrectly by slightly modifying the inputs.
Recently, substantial work has shown that adversarial examples tend to deviate
from the underlying data manifold of normal examples, whereas pre-trained
masked language models can fit the manifold of normal NLP data. To explore how
to use the masked language model in adversarial detection, we propose a novel
textual adversarial example detection method, namely Masked Language
Model-based Detection (MLMD), which can produce clearly distinguishable signals
between normal examples and adversarial examples by exploring the changes in
manifolds induced by the masked language model. MLMD features a plug and play
usage (i.e., no need to retrain the victim model) for adversarial defense and
it is agnostic to classification tasks, victim model's architectures, and
to-be-defended attack methods. We evaluate MLMD on various benchmark textual
datasets, widely studied machine learning models, and state-of-the-art (SOTA)
adversarial attacks (in total $3*4*4 = 48$ settings). Experimental results show
that MLMD can achieve strong performance, with detection accuracy up to 0.984,
0.967, and 0.901 on AG-NEWS, IMDB, and SST-2 datasets, respectively.
Additionally, MLMD is superior, or at least comparable to, the SOTA detection
defenses in detection accuracy and F1 score. Among many defenses based on the
off-manifold assumption of adversarial examples, this work offers a new angle
for capturing the manifold change. The code for this work is openly accessible
at \url{https://github.com/mlmddetection/MLMDdetection}.


http://arxiv.org/abs/2304.08979
In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT. (80%)
Xinyue Shen; Zeyuan Chen; Michael Backes; Yang Zhang
  The way users acquire information is undergoing a paradigm shift with the
advent of ChatGPT. Unlike conventional search engines, ChatGPT retrieves
knowledge from the model itself and generates answers for users. ChatGPT's
impressive question-answering (QA) capability has attracted more than 100
million users within a short period of time but has also raised concerns
regarding its reliability. In this paper, we perform the first large-scale
measurement of ChatGPT's reliability in the generic QA scenario with a
carefully curated set of 5,695 questions across ten datasets and eight domains.
We find that ChatGPT's reliability varies across different domains, especially
underperforming in law and science questions. We also demonstrate that system
roles, originally designed by OpenAI to allow users to steer ChatGPT's
behavior, can impact ChatGPT's reliability. We further show that ChatGPT is
vulnerable to adversarial examples, and even a single character change can
negatively affect its reliability in certain cases. We believe that our study
provides valuable insights into ChatGPT's reliability and underscores the need
for strengthening the reliability and security of large language models (LLMs).


http://arxiv.org/abs/2304.09218
Generative models improve fairness of medical classifiers under distribution shifts. (13%)
Ira Ktena; Olivia Wiles; Isabela Albuquerque; Sylvestre-Alvise Rebuffi; Ryutaro Tanno; Abhijit Guha Roy; Shekoofeh Azizi; Danielle Belgrave; Pushmeet Kohli; Alan Karthikesalingam; Taylan Cemgil; Sven Gowal
  A ubiquitous challenge in machine learning is the problem of domain
generalisation. This can exacerbate bias against groups or labels that are
underrepresented in the datasets used for model development. Model bias can
lead to unintended harms, especially in safety-critical applications like
healthcare. Furthermore, the challenge is compounded by the difficulty of
obtaining labelled data due to high cost or lack of readily available domain
expertise. In our work, we show that learning realistic augmentations
automatically from data is possible in a label-efficient manner using
generative models. In particular, we leverage the higher abundance of
unlabelled data to capture the underlying data distribution of different
conditions and subgroups for an imaging modality. By conditioning generative
models on appropriate labels, we can steer the distribution of synthetic
examples according to specific requirements. We demonstrate that these learned
augmentations can surpass heuristic ones by making models more robust and
statistically fair in- and out-of-distribution. To evaluate the generality of
our approach, we study 3 distinct medical imaging contexts of varying
difficulty: (i) histopathology images from a publicly available generalisation
benchmark, (ii) chest X-rays from publicly available clinical datasets, and
(iii) dermatology images characterised by complex shifts and imaging
conditions. Complementing real training samples with synthetic ones improves
the robustness of models in all three medical tasks and increases fairness by
improving the accuracy of diagnosis within underrepresented groups. This
approach leads to stark improvements OOD across modalities: 7.7% prediction
accuracy improvement in histopathology, 5.2% in chest radiology with 44.6%
lower fairness gap and a striking 63.5% improvement in high-risk sensitivity
for dermatology with a 7.5x reduction in fairness gap.


http://arxiv.org/abs/2304.08411
Evil from Within: Machine Learning Backdoors through Hardware Trojans. (15%)
Alexander Warnecke; Julian Speith; Jan-Niklas Möller; Konrad Rieck; Christof Paar
  Backdoors pose a serious threat to machine learning, as they can compromise
the integrity of security-critical systems, such as self-driving cars. While
different defenses have been proposed to address this threat, they all rely on
the assumption that the hardware on which the learning models are executed
during inference is trusted. In this paper, we challenge this assumption and
introduce a backdoor attack that completely resides within a common hardware
accelerator for machine learning. Outside of the accelerator, neither the
learning model nor the software is manipulated, so that current defenses fail.
To make this attack practical, we overcome two challenges: First, as memory on
a hardware accelerator is severely limited, we introduce the concept of a
minimal backdoor that deviates as little as possible from the original model
and is activated by replacing a few model parameters only. Second, we develop a
configurable hardware trojan that can be provisioned with the backdoor and
performs a replacement only when the specific target model is processed. We
demonstrate the practical feasibility of our attack by implanting our hardware
trojan into the Xilinx Vitis AI DPU, a commercial machine-learning accelerator.
We configure the trojan with a minimal backdoor for a traffic-sign recognition
system. The backdoor replaces only 30 (0.069%) model parameters, yet it
reliably manipulates the recognition once the input contains a backdoor
trigger. Our attack expands the hardware circuit of the accelerator by 0.24%
and induces no run-time overhead, rendering a detection hardly possible. Given
the complex and highly distributed manufacturing process of current hardware,
our work points to a new threat in machine learning that is inaccessible to
current security mechanisms and calls for hardware to be manufactured only in
fully trusted environments.


http://arxiv.org/abs/2304.08566
GrOVe: Ownership Verification of Graph Neural Networks using Embeddings. (13%)
Asim Waheed; Vasisht Duddu; N. Asokan
  Graph neural networks (GNNs) have emerged as a state-of-the-art approach to
model and draw inferences from large scale graph-structured data in various
application settings such as social networking. The primary goal of a GNN is to
learn an embedding for each graph node in a dataset that encodes both the node
features and the local graph structure around the node. Embeddings generated by
a GNN for a graph node are unique to that GNN. Prior work has shown that GNNs
are prone to model extraction attacks. Model extraction attacks and defenses
have been explored extensively in other non-graph settings. While detecting or
preventing model extraction appears to be difficult, deterring them via
effective ownership verification techniques offer a potential defense. In
non-graph settings, fingerprinting models, or the data used to build them, have
shown to be a promising approach toward ownership verification. We present
GrOVe, a state-of-the-art GNN model fingerprinting scheme that, given a target
model and a suspect model, can reliably determine if the suspect model was
trained independently of the target model or if it is a surrogate of the target
model obtained via model extraction. We show that GrOVe can distinguish between
surrogate and independent models even when the independent model uses the same
training dataset and architecture as the original target model. Using six
benchmark datasets and three model architectures, we show that consistently
achieves low false-positive and false-negative rates. We demonstrate that is
robust against known fingerprint evasion techniques while remaining
computationally efficient.


http://arxiv.org/abs/2304.10266
OOD-CV-v2: An extended Benchmark for Robustness to Out-of-Distribution Shifts of Individual Nuisances in Natural Images. (1%)
Bingchen Zhao; Jiahao Wang; Wufei Ma; Artur Jesslen; Siwei Yang; Shaozuo Yu; Oliver Zendel; Christian Theobalt; Alan Yuille; Adam Kortylewski
  Enhancing the robustness of vision algorithms in real-world scenarios is
challenging. One reason is that existing robustness benchmarks are limited, as
they either rely on synthetic data or ignore the effects of individual nuisance
factors. We introduce OOD-CV-v2, a benchmark dataset that includes
out-of-distribution examples of 10 object categories in terms of pose, shape,
texture, context and the weather conditions, and enables benchmarking of models
for image classification, object detection, and 3D pose estimation. In addition
to this novel dataset, we contribute extensive experiments using popular
baseline methods, which reveal that: 1) Some nuisance factors have a much
stronger negative effect on the performance compared to others, also depending
on the vision task. 2) Current approaches to enhance robustness have only
marginal effects, and can even reduce robustness. 3) We do not observe
significant differences between convolutional and transformer architectures. We
believe our dataset provides a rich test bed to study robustness and will help
push forward research in this area.
  Our dataset can be accessed from https://bzhao.me/OOD-CV/


http://arxiv.org/abs/2304.07822
A Random-patch based Defense Strategy Against Physical Attacks for Face Recognition Systems. (98%)
JiaHao Xie; Ye Luo; Jianwei Lu
  The physical attack has been regarded as a kind of threat against real-world
computer vision systems. Still, many existing defense methods are only useful
for small perturbations attacks and can't detect physical attacks effectively.
In this paper, we propose a random-patch based defense strategy to robustly
detect physical attacks for Face Recognition System (FRS). Different from
mainstream defense methods which focus on building complex deep neural networks
(DNN) to achieve high recognition rate on attacks, we introduce a patch based
defense strategy to a standard DNN aiming to obtain robust detection models.
Extensive experimental results on the employed datasets show the superiority of
the proposed defense method on detecting white-box attacks and adaptive attacks
which attack both FRS and the defense method. Additionally, due to the
simpleness yet robustness of our method, it can be easily applied to the real
world face recognition system and extended to other defense methods to boost
the detection performance.


http://arxiv.org/abs/2304.07980
RNN-Guard: Certified Robustness Against Multi-frame Attacks for Recurrent Neural Networks. (96%)
Yunruo Zhang; Tianyu Du; Shouling Ji; Peng Tang; Shanqing Guo
  It is well-known that recurrent neural networks (RNNs), although widely used,
are vulnerable to adversarial attacks including one-frame attacks and
multi-frame attacks. Though a few certified defenses exist to provide
guaranteed robustness against one-frame attacks, we prove that defending
against multi-frame attacks remains a challenging problem due to their enormous
perturbation space. In this paper, we propose the first certified defense
against multi-frame attacks for RNNs called RNN-Guard. To address the above
challenge, we adopt the perturb-all-frame strategy to construct perturbation
spaces consistent with those in multi-frame attacks. However, the
perturb-all-frame strategy causes a precision issue in linear relaxations. To
address this issue, we introduce a novel abstract domain called InterZono and
design tighter relaxations. We prove that InterZono is more precise than
Zonotope yet carries the same time complexity. Experimental evaluations across
various datasets and model structures show that the certified robust accuracy
calculated by RNN-Guard with InterZono is up to 2.18 times higher than that
with Zonotope. In addition, we extend RNN-Guard as the first certified training
method against multi-frame attacks to directly enhance RNNs' robustness. The
results show that the certified robust accuracy of models trained with
RNN-Guard against multi-frame attacks is 15.47 to 67.65 percentage points
higher than those with other training methods.


http://arxiv.org/abs/2304.07744
JoB-VS: Joint Brain-Vessel Segmentation in TOF-MRA Images. (15%)
Natalia Valderrama; Ioannis Pitsiorlas; Luisa Vargas; Pablo Arbeláez; Maria A. Zuluaga
  We propose the first joint-task learning framework for brain and vessel
segmentation (JoB-VS) from Time-of-Flight Magnetic Resonance images. Unlike
state-of-the-art vessel segmentation methods, our approach avoids the
pre-processing step of implementing a model to extract the brain from the
volumetric input data. Skipping this additional step makes our method an
end-to-end vessel segmentation framework. JoB-VS uses a lattice architecture
that favors the segmentation of structures of different scales (e.g., the brain
and vessels). Its segmentation head allows the simultaneous prediction of the
brain and vessel mask. Moreover, we generate data augmentation with adversarial
examples, which our results demonstrate to enhance the performance. JoB-VS
achieves 70.03% mean AP and 69.09% F1-score in the OASIS-3 dataset and is
capable of generalizing the segmentation in the IXI dataset. These results show
the adequacy of JoB-VS for the challenging task of vessel segmentation in
complete TOF-MRA images.


http://arxiv.org/abs/2304.06919
Interpretability is a Kind of Safety: An Interpreter-based Ensemble for Adversary Defense. (99%)
Jingyuan Wang; Yufan Wu; Mingxuan Li; Xin Lin; Junjie Wu; Chao Li
  While having achieved great success in rich real-life applications, deep
neural network (DNN) models have long been criticized for their vulnerability
to adversarial attacks. Tremendous research efforts have been dedicated to
mitigating the threats of adversarial attacks, but the essential trait of
adversarial examples is not yet clear, and most existing methods are yet
vulnerable to hybrid attacks and suffer from counterattacks. In light of this,
in this paper, we first reveal a gradient-based correlation between sensitivity
analysis-based DNN interpreters and the generation process of adversarial
examples, which indicates the Achilles's heel of adversarial attacks and sheds
light on linking together the two long-standing challenges of DNN: fragility
and unexplainability. We then propose an interpreter-based ensemble framework
called X-Ensemble for robust adversary defense. X-Ensemble adopts a novel
detection-rectification process and features in building multiple sub-detectors
and a rectifier upon various types of interpretation information toward target
classifiers. Moreover, X-Ensemble employs the Random Forests (RF) model to
combine sub-detectors into an ensemble detector for adversarial hybrid attacks
defense. The non-differentiable property of RF further makes it a precious
choice against the counterattack of adversaries. Extensive experiments under
various types of state-of-the-art attacks and diverse attack scenarios
demonstrate the advantages of X-Ensemble to competitive baseline methods.


http://arxiv.org/abs/2304.07360
Combining Generators of Adversarial Malware Examples to Increase Evasion Rate. (99%)
Matouš Kozák; Martin Jureček
  Antivirus developers are increasingly embracing machine learning as a key
component of malware defense. While machine learning achieves cutting-edge
outcomes in many fields, it also has weaknesses that are exploited by several
adversarial attack techniques. Many authors have presented both white-box and
black-box generators of adversarial malware examples capable of bypassing
malware detectors with varying success. We propose to combine contemporary
generators in order to increase their potential. Combining different generators
can create more sophisticated adversarial examples that are more likely to
evade anti-malware tools. We demonstrated this technique on five well-known
generators and recorded promising results. The best-performing combination of
AMG-random and MAB-Malware generators achieved an average evasion rate of 15.9%
against top-tier antivirus products. This represents an average improvement of
more than 36% and 627% over using only the AMG-random and MAB-Malware
generators, respectively. The generator that benefited the most from having
another generator follow its procedure was the FGSM injection attack, which
improved the evasion rate on average between 91.97% and 1,304.73%, depending on
the second generator used. These results demonstrate that combining different
generators can significantly improve their effectiveness against leading
antivirus programs.


http://arxiv.org/abs/2304.07288
Cross-Entropy Loss Functions: Theoretical Analysis and Applications. (3%)
Anqi Mao; Mehryar Mohri; Yutao Zhong
  Cross-entropy is a widely used loss function in applications. It coincides
with the logistic loss applied to the outputs of a neural network, when the
softmax is used. But, what guarantees can we rely on when using cross-entropy
as a surrogate loss? We present a theoretical analysis of a broad family of
loss functions, comp-sum losses, that includes cross-entropy (or logistic
loss), generalized cross-entropy, the mean absolute error and other
cross-entropy-like loss functions. We give the first $H$-consistency bounds for
these loss functions. These are non-asymptotic guarantees that upper bound the
zero-one loss estimation error in terms of the estimation error of a surrogate
loss, for the specific hypothesis set $H$ used. We further show that our bounds
are tight. These bounds depend on quantities called minimizability gaps. To
make them more explicit, we give a specific analysis of these gaps for comp-sum
losses. We also introduce a new family of loss functions, smooth adversarial
comp-sum losses, that are derived from their comp-sum counterparts by adding in
a related smooth term. We show that these loss functions are beneficial in the
adversarial setting by proving that they admit $H$-consistency bounds. This
leads to new adversarial robustness algorithms that consist of minimizing a
regularized smooth adversarial comp-sum loss. While our main purpose is a
theoretical analysis, we also present an extensive empirical analysis comparing
comp-sum losses. We further report the results of a series of experiments
demonstrating that our adversarial robustness algorithms outperform the current
state-of-the-art, while also achieving a superior non-adversarial accuracy.


http://arxiv.org/abs/2304.07134
Pool Inference Attacks on Local Differential Privacy: Quantifying the Privacy Guarantees of Apple's Count Mean Sketch in Practice. (2%)
Andrea Gadotti; Florimond Houssiau; Meenatchi Sundaram Muthu Selva Annamalai; Montjoye Yves-Alexandre de
  Behavioral data generated by users' devices, ranging from emoji use to pages
visited, are collected at scale to improve apps and services. These data,
however, contain fine-grained records and can reveal sensitive information
about individual users. Local differential privacy has been used by companies
as a solution to collect data from users while preserving privacy. We here
first introduce pool inference attacks, where an adversary has access to a
user's obfuscated data, defines pools of objects, and exploits the user's
polarized behavior in multiple data collections to infer the user's preferred
pool. Second, we instantiate this attack against Count Mean Sketch, a local
differential privacy mechanism proposed by Apple and deployed in iOS and Mac OS
devices, using a Bayesian model. Using Apple's parameters for the privacy loss
$\varepsilon$, we then consider two specific attacks: one in the emojis setting
-- where an adversary aims at inferring a user's preferred skin tone for emojis
-- and one against visited websites -- where an adversary wants to learn the
political orientation of a user from the news websites they visit. In both
cases, we show the attack to be much more effective than a random guess when
the adversary collects enough data. We find that users with high polarization
and relevant interest are significantly more vulnerable, and we show that our
attack is well-calibrated, allowing the adversary to target such vulnerable
users. We finally validate our results for the emojis setting using user data
from Twitter. Taken together, our results show that pool inference attacks are
a concern for data protected by local differential privacy mechanisms with a
large $\varepsilon$, emphasizing the need for additional technical safeguards
and the need for more research on how to apply local differential privacy for
multiple collections.


http://arxiv.org/abs/2304.06908
Generating Adversarial Examples with Better Transferability via Masking Unimportant Parameters of Surrogate Model. (99%)
Dingcheng Yang; Wenjian Yu; Zihao Xiao; Jiaqi Luo
  Deep neural networks (DNNs) have been shown to be vulnerable to adversarial
examples. Moreover, the transferability of the adversarial examples has
received broad attention in recent years, which means that adversarial examples
crafted by a surrogate model can also attack unknown models. This phenomenon
gave birth to the transfer-based adversarial attacks, which aim to improve the
transferability of the generated adversarial examples. In this paper, we
propose to improve the transferability of adversarial examples in the
transfer-based attack via masking unimportant parameters (MUP). The key idea in
MUP is to refine the pretrained surrogate models to boost the transfer-based
attack. Based on this idea, a Taylor expansion-based metric is used to evaluate
the parameter importance score and the unimportant parameters are masked during
the generation of adversarial examples. This process is simple, yet can be
naturally combined with various existing gradient-based optimizers for
generating adversarial examples, thus further improving the transferability of
the generated adversarial examples. Extensive experiments are conducted to
validate the effectiveness of the proposed MUP-based methods.


http://arxiv.org/abs/2304.06430
Certified Zeroth-order Black-Box Defense with Robust UNet Denoiser. (96%)
Astha Verma; Siddhesh Bangar; A V Subramanyam; Naman Lal; Rajiv Ratn Shah; Shin'ichi Satoh
  Certified defense methods against adversarial perturbations have been
recently investigated in the black-box setting with a zeroth-order (ZO)
perspective. However, these methods suffer from high model variance with low
performance on high-dimensional datasets due to the ineffective design of the
denoiser and are limited in their utilization of ZO techniques. To this end, we
propose a certified ZO preprocessing technique for removing adversarial
perturbations from the attacked image in the black-box setting using only model
queries. We propose a robust UNet denoiser (RDUNet) that ensures the robustness
of black-box models trained on high-dimensional datasets. We propose a novel
black-box denoised smoothing (DS) defense mechanism, ZO-RUDS, by prepending our
RDUNet to the black-box model, ensuring black-box defense. We further propose
ZO-AE-RUDS in which RDUNet followed by autoencoder (AE) is prepended to the
black-box model. We perform extensive experiments on four classification
datasets, CIFAR-10, CIFAR-10, Tiny Imagenet, STL-10, and the MNIST dataset for
image reconstruction tasks. Our proposed defense methods ZO-RUDS and ZO-AE-RUDS
beat SOTA with a huge margin of $35\%$ and $9\%$, for low dimensional
(CIFAR-10) and with a margin of $20.61\%$ and $23.51\%$ for high-dimensional
(STL-10) datasets, respectively.


http://arxiv.org/abs/2304.06607
False Claims against Model Ownership Resolution. (93%)
Jian Liu; Rui Zhang; Sebastian Szyller; Kui Ren; N. Asokan
  Deep neural network (DNN) models are valuable intellectual property of model
owners, constituting a competitive advantage. Therefore, it is crucial to
develop techniques to protect against model theft. Model ownership resolution
(MOR) is a class of techniques that can deter model theft. A MOR scheme enables
an accuser to assert an ownership claim for a suspect model by presenting
evidence, such as a watermark or fingerprint, to show that the suspect model
was stolen or derived from a source model owned by the accuser. Most of the
existing MOR schemes prioritize robustness against malicious suspects, ensuring
that the accuser will win if the suspect model is indeed a stolen model.
  In this paper, we show that common MOR schemes in the literature are
vulnerable to a different, equally important but insufficiently explored,
robustness concern: a malicious accuser. We show how malicious accusers can
successfully make false claims against independent suspect models that were not
stolen. Our core idea is that a malicious accuser can deviate (without
detection) from the specified MOR process by finding (transferable) adversarial
examples that successfully serve as evidence against independent suspect
models. To this end, we first generalize the procedures of common MOR schemes
and show that, under this generalization, defending against false claims is as
challenging as preventing (transferable) adversarial examples. Via systematic
empirical evaluation, we show that our false claim attacks always succeed in
the MOR schemes that follow our generalization, including in a real-world
model: Amazon's Rekognition API.


http://arxiv.org/abs/2304.06575
Adversarial Examples from Dimensional Invariance. (45%)
Benjamin L. Badger
  Adversarial examples have been found for various deep as well as shallow
learning models, and have at various times been suggested to be either fixable
model-specific bugs, or else inherent dataset feature, or both. We present
theoretical and empirical results to show that adversarial examples are
approximate discontinuities resulting from models that specify approximately
bijective maps $f: \Bbb R^n \to \Bbb R^m; n \neq m$ over their inputs, and this
discontinuity follows from the topological invariance of dimension.


http://arxiv.org/abs/2304.06326
Understanding Overfitting in Adversarial Training in Kernel Regression. (1%)
Teng Zhang; Kang Li
  Adversarial training and data augmentation with noise are widely adopted
techniques to enhance the performance of neural networks. This paper
investigates adversarial training and data augmentation with noise in the
context of regularized regression in a reproducing kernel Hilbert space (RKHS).
We establish the limiting formula for these techniques as the attack and noise
size, as well as the regularization parameter, tend to zero. Based on this
limiting formula, we analyze specific scenarios and demonstrate that, without
appropriate regularization, these two methods may have larger generalization
error and Lipschitz constant than standard kernel regression. However, by
selecting the appropriate regularization parameter, these two methods can
outperform standard kernel regression and achieve smaller generalization error
and Lipschitz constant. These findings support the empirical observations that
adversarial training can lead to overfitting, and appropriate regularization
methods, such as early stopping, can alleviate this issue.


http://arxiv.org/abs/2304.06672
LSFSL: Leveraging Shape Information in Few-shot Learning. (1%)
Deepan Chakravarthi Padmanabhan; Shruthi Gowda; Elahe Arani; Bahram Zonooz
  Few-shot learning (FSL) techniques seek to learn the underlying patterns in
data using fewer samples, analogous to how humans learn from limited
experience. In this limited-data scenario, the challenges associated with deep
neural networks, such as shortcut learning and texture bias behaviors, are
further exacerbated. Moreover, the significance of addressing shortcut learning
is not yet fully explored in the few-shot setup. To address these issues, we
propose LSFSL, which enforces the model to learn more generalizable features
utilizing the implicit prior information present in the data. Through
comprehensive analyses, we demonstrate that LSFSL-trained models are less
vulnerable to alteration in color schemes, statistical correlations, and
adversarial perturbations leveraging the global semantics in the data. Our
findings highlight the potential of incorporating relevant priors in few-shot
approaches to increase robustness and generalization.


http://arxiv.org/abs/2304.05644
Generative Adversarial Networks-Driven Cyber Threat Intelligence Detection Framework for Securing Internet of Things. (92%)
Mohamed Amine Ferrag; Djallel Hamouda; Merouane Debbah; Leandros Maglaras; Abderrahmane Lakas
  While the benefits of 6G-enabled Internet of Things (IoT) are numerous,
providing high-speed, low-latency communication that brings new opportunities
for innovation and forms the foundation for continued growth in the IoT
industry, it is also important to consider the security challenges and risks
associated with the technology. In this paper, we propose a two-stage intrusion
detection framework for securing IoTs, which is based on two detectors. In the
first stage, we propose an adversarial training approach using generative
adversarial networks (GAN) to help the first detector train on robust features
by supplying it with adversarial examples as validation sets. Consequently, the
classifier would perform very well against adversarial attacks. Then, we
propose a deep learning (DL) model for the second detector to identify
intrusions. We evaluated the proposed approach's efficiency in terms of
detection accuracy and robustness against adversarial attacks. Experiment
results with a new cyber security dataset demonstrate the effectiveness of the
proposed methodology in detecting both intrusions and persistent adversarial
examples with a weighted avg of 96%, 95%, 95%, and 95% for precision, recall,
f1-score, and accuracy, respectively.


http://arxiv.org/abs/2304.06017
Exploiting Logic Locking for a Neural Trojan Attack on Machine Learning Accelerators. (1%)
Hongye Xu; Dongfang Liu; Cory Merkel; Michael Zuzak
  Logic locking has been proposed to safeguard intellectual property (IP)
during chip fabrication. Logic locking techniques protect hardware IP by making
a subset of combinational modules in a design dependent on a secret key that is
withheld from untrusted parties. If an incorrect secret key is used, a set of
deterministic errors is produced in locked modules, restricting unauthorized
use. A common target for logic locking is neural accelerators, especially as
machine-learning-as-a-service becomes more prevalent. In this work, we explore
how logic locking can be used to compromise the security of a neural
accelerator it protects. Specifically, we show how the deterministic errors
caused by incorrect keys can be harnessed to produce neural-trojan-style
backdoors. To do so, we first outline a motivational attack scenario where a
carefully chosen incorrect key, which we call a trojan key, produces
misclassifications for an attacker-specified input class in a locked
accelerator. We then develop a theoretically-robust attack methodology to
automatically identify trojan keys. To evaluate this attack, we launch it on
several locked accelerators. In our largest benchmark accelerator, our attack
identified a trojan key that caused a 74\% decrease in classification accuracy
for attacker-specified trigger inputs, while degrading accuracy by only 1.7\%
for other inputs on average.


http://arxiv.org/abs/2304.05135
RecUP-FL: Reconciling Utility and Privacy in Federated Learning via User-configurable Privacy Defense. (99%)
Yue Cui; Syed Irfan Ali Meerza; Zhuohang Li; Luyang Liu; Jiaxin Zhang; Jian Liu
  Federated learning (FL) provides a variety of privacy advantages by allowing
clients to collaboratively train a model without sharing their private data.
However, recent studies have shown that private information can still be leaked
through shared gradients. To further minimize the risk of privacy leakage,
existing defenses usually require clients to locally modify their gradients
(e.g., differential privacy) prior to sharing with the server. While these
approaches are effective in certain cases, they regard the entire data as a
single entity to protect, which usually comes at a large cost in model utility.
In this paper, we seek to reconcile utility and privacy in FL by proposing a
user-configurable privacy defense, RecUP-FL, that can better focus on the
user-specified sensitive attributes while obtaining significant improvements in
utility over traditional defenses. Moreover, we observe that existing inference
attacks often rely on a machine learning model to extract the private
information (e.g., attributes). We thus formulate such a privacy defense as an
adversarial learning problem, where RecUP-FL generates slight perturbations
that can be added to the gradients before sharing to fool adversary models. To
improve the transferability to un-queryable black-box adversary models,
inspired by the idea of meta-learning, RecUP-FL forms a model zoo containing a
set of substitute models and iteratively alternates between simulations of the
white-box and the black-box adversarial attack scenarios to generate
perturbations. Extensive experiments on four datasets under various adversarial
settings (both attribute inference attack and data reconstruction attack) show
that RecUP-FL can meet user-specified privacy constraints over the sensitive
attributes while significantly improving the model utility compared with
state-of-the-art privacy defenses.


http://arxiv.org/abs/2304.05048
Simultaneous Adversarial Attacks On Multiple Face Recognition System Components. (98%)
Inderjeet Singh; Kazuya Kakizaki; Toshinori Araki
  In this work, we investigate the potential threat of adversarial examples to
the security of face recognition systems. Although previous research has
explored the adversarial risk to individual components of FRSs, our study
presents an initial exploration of an adversary simultaneously fooling multiple
components: the face detector and feature extractor in an FRS pipeline. We
propose three multi-objective attacks on FRSs and demonstrate their
effectiveness through a preliminary experimental analysis on a target system.
Our attacks achieved up to 100% Attack Success Rates against both the face
detector and feature extractor and were able to manipulate the face detection
probability by up to 50% depending on the adversarial objective. This research
identifies and examines novel attack vectors against FRSs and suggests possible
ways to augment the robustness by leveraging the attack vector's knowledge
during training of an FRS's components.


http://arxiv.org/abs/2304.05402
Boosting Cross-task Transferability of Adversarial Patches with Visual Relations. (98%)
Tony Ma; Songze Li; Yisong Xiao; Shunchang Liu
  The transferability of adversarial examples is a crucial aspect of evaluating
the robustness of deep learning systems, particularly in black-box scenarios.
Although several methods have been proposed to enhance cross-model
transferability, little attention has been paid to the transferability of
adversarial examples across different tasks. This issue has become increasingly
relevant with the emergence of foundational multi-task AI systems such as
Visual ChatGPT, rendering the utility of adversarial samples generated by a
single task relatively limited. Furthermore, these systems often entail
inferential functions beyond mere recognition-like tasks. To address this gap,
we propose a novel Visual Relation-based cross-task Adversarial Patch
generation method called VRAP, which aims to evaluate the robustness of various
visual tasks, especially those involving visual reasoning, such as Visual
Question Answering and Image Captioning. VRAP employs scene graphs to combine
object recognition-based deception with predicate-based relations elimination,
thereby disrupting the visual reasoning information shared among inferential
tasks. Our extensive experiments demonstrate that VRAP significantly surpasses
previous methods in terms of black-box transferability across diverse visual
reasoning tasks.


http://arxiv.org/abs/2304.05098
Benchmarking the Physical-world Adversarial Robustness of Vehicle Detection. (92%)
Tianyuan Zhang; Yisong Xiao; Xiaoya Zhang; Hao Li; Lu Wang
  Adversarial attacks in the physical world can harm the robustness of
detection models. Evaluating the robustness of detection models in the physical
world can be challenging due to the time-consuming and labor-intensive nature
of many experiments. Thus, virtual simulation experiments can provide a
solution to this challenge. However, there is no unified detection benchmark
based on virtual simulation environment. To address this challenge, we proposed
an instant-level data generation pipeline based on the CARLA simulator. Using
this pipeline, we generated the DCI dataset and conducted extensive experiments
on three detection models and three physical adversarial attacks. The dataset
covers 7 continuous and 1 discrete scenes, with over 40 angles, 20 distances,
and 20,000 positions. The results indicate that Yolo v6 had strongest
resistance, with only a 6.59% average AP drop, and ASA was the most effective
attack algorithm with a 14.51% average AP reduction, twice that of other
algorithms. Static scenes had higher recognition AP, and results under
different weather conditions were similar. Adversarial attack algorithm
improvement may be approaching its 'limitation'.


http://arxiv.org/abs/2304.05561
On the Adversarial Inversion of Deep Biometric Representations. (67%)
Gioacchino Tangari; Shreesh Keskar; Hassan Jameel Asghar; Dali Kaafar
  Biometric authentication service providers often claim that it is not
possible to reverse-engineer a user's raw biometric sample, such as a
fingerprint or a face image, from its mathematical (feature-space)
representation. In this paper, we investigate this claim on the specific
example of deep neural network (DNN) embeddings. Inversion of DNN embeddings
has been investigated for explaining deep image representations or synthesizing
normalized images. Existing studies leverage full access to all layers of the
original model, as well as all possible information on the original dataset.
For the biometric authentication use case, we need to investigate this under
adversarial settings where an attacker has access to a feature-space
representation but no direct access to the exact original dataset nor the
original learned model. Instead, we assume varying degree of attacker's
background knowledge about the distribution of the dataset as well as the
original learned model (architecture and training process). In these cases, we
show that the attacker can exploit off-the-shelf DNN models and public
datasets, to mimic the behaviour of the original learned model to varying
degrees of success, based only on the obtained representation and attacker's
prior knowledge. We propose a two-pronged attack that first infers the original
DNN by exploiting the model footprint on the embedding, and then reconstructs
the raw data by using the inferred model. We show the practicality of the
attack on popular DNNs trained for two prominent biometric modalities, face and
fingerprint recognition. The attack can effectively infer the original
recognition model (mean accuracy 83\% for faces, 86\% for fingerprints), and
can craft effective biometric reconstructions that are successfully
authenticated with 1-vs-1 authentication accuracy of up to 92\% for some
models.


http://arxiv.org/abs/2304.05370
Overload: Latency Attacks on Object Detection for Edge Devices. (33%)
Erh-Chung Chen; Pin-Yu Chen; I-Hsin Chung; Che-rung Lee
  Nowadays, the deployment of deep learning based applications on edge devices
is an essential task owing to the increasing demands on intelligent services.
However, the limited computing resources on edge nodes make the models
vulnerable to attacks, such that the predictions made by models are unreliable.
In this paper, we investigate latency attacks on deep learning applications.
Unlike common adversarial attacks for misclassification, the goal of latency
attacks is to increase the inference time, which may stop applications from
responding to the requests within a reasonable time. This kind of attack is
ubiquitous for various applications, and we use object detection to demonstrate
how such kind of attacks work. We also design a framework named Overload to
generate latency attacks at scale. Our method is based on a newly formulated
optimization problem and a novel technique, called spatial attention, to
increase the inference time of object detection. We have conducted experiments
using YOLOv5 models on Nvidia NX. The experimental results show that with
latency attacks, the inference time of a single image can be increased ten
times longer in reference to the normal setting. Moreover, comparing to
existing methods, our attacking method is simpler and more effective.


http://arxiv.org/abs/2304.05492
Towards More Robust and Accurate Sequential Recommendation with Cascade-guided Adversarial Training. (9%)
Juntao Tan; Shelby Heinecke; Zhiwei Liu; Yongjun Chen; Yongfeng Zhang; Huan Wang
  Sequential recommendation models, models that learn from chronological
user-item interactions, outperform traditional recommendation models in many
settings. Despite the success of sequential recommendation models, their
robustness has recently come into question. Two properties unique to the nature
of sequential recommendation models may impair their robustness - the cascade
effects induced during training and the model's tendency to rely too heavily on
temporal information. To address these vulnerabilities, we propose
Cascade-guided Adversarial training, a new adversarial training procedure that
is specifically designed for sequential recommendation models. Our approach
harnesses the intrinsic cascade effects present in sequential modeling to
produce strategic adversarial perturbations to item embeddings during training.
Experiments on training state-of-the-art sequential models on four public
datasets from different domains show that our training approach produces
superior model ranking accuracy and superior model robustness to real item
replacement perturbations when compared to both standard model training and
generic adversarial training.


http://arxiv.org/abs/2304.04386
Generating Adversarial Attacks in the Latent Space. (98%)
Nitish Shukla; Sudipta Banerjee
  Adversarial attacks in the input (pixel) space typically incorporate noise
margins such as $L_1$ or $L_{\infty}$-norm to produce imperceptibly perturbed
data that confound deep learning networks. Such noise margins confine the
magnitude of permissible noise. In this work, we propose injecting adversarial
perturbations in the latent (feature) space using a generative adversarial
network, removing the need for margin-based priors. Experiments on MNIST,
CIFAR10, Fashion-MNIST, CIFAR100 and Stanford Dogs datasets support the
effectiveness of the proposed method in generating adversarial attacks in the
latent space while ensuring a high degree of visual realism with respect to
pixel-based adversarial attack methods.


http://arxiv.org/abs/2304.04625
Reinforcement Learning-Based Black-Box Model Inversion Attacks. (67%)
Gyojin Han; Jaehyun Choi; Haeil Lee; Junmo Kim
  Model inversion attacks are a type of privacy attack that reconstructs
private data used to train a machine learning model, solely by accessing the
model. Recently, white-box model inversion attacks leveraging Generative
Adversarial Networks (GANs) to distill knowledge from public datasets have been
receiving great attention because of their excellent attack performance. On the
other hand, current black-box model inversion attacks that utilize GANs suffer
from issues such as being unable to guarantee the completion of the attack
process within a predetermined number of query accesses or achieve the same
level of performance as white-box attacks. To overcome these limitations, we
propose a reinforcement learning-based black-box model inversion attack. We
formulate the latent space search as a Markov Decision Process (MDP) problem
and solve it with reinforcement learning. Our method utilizes the confidence
scores of the generated images to provide rewards to an agent. Finally, the
private data can be reconstructed using the latent vectors found by the agent
trained in the MDP. The experiment results on various datasets and models
demonstrate that our attack successfully recovers the private information of
the target model by achieving state-of-the-art attack performance. We emphasize
the importance of studies on privacy-preserving machine learning by proposing a
more advanced black-box model inversion attack.


http://arxiv.org/abs/2304.04512
Defense-Prefix for Preventing Typographic Attacks on CLIP. (16%)
Hiroki Azuma; Yusuke Matsui
  Vision-language pre-training models (VLPs) have exhibited revolutionary
improvements in various vision-language tasks. In VLP, some adversarial attacks
fool a model into false or absurd classifications. Previous studies addressed
these attacks by fine-tuning the model or changing its architecture. However,
these methods risk losing the original model's performance and are difficult to
apply to downstream tasks. In particular, their applicability to other tasks
has not been considered. In this study, we addressed the reduction of the
impact of typographic attacks on CLIP without changing the model parameters. To
achieve this, we expand the idea of ``prefix learning'' and introduce our
simple yet effective method: Defense-Prefix (DP), which inserts the DP token
before a class name to make words ``robust'' against typographic attacks. Our
method can be easily applied to downstream tasks, such as object detection,
because the proposed method is independent of the model parameters. Our method
significantly improves the accuracy of classification tasks for typographic
attack datasets, while maintaining the zero-shot capabilities of the model. In
addition, we leverage our proposed method for object detection, demonstrating
its high applicability and effectiveness. The codes and datasets are available
at https://github.com/azuma164/Defense-Prefix.


http://arxiv.org/abs/2304.04846
Helix++: A platform for efficiently securing software. (1%)
Jack W. Davidson; Jason D. Hiser; Anh Nguyen-Tuong
  The open-source Helix++ project improves the security posture of computing
platforms by applying cutting-edge cybersecurity techniques to diversify and
harden software automatically. A distinguishing feature of Helix++ is that it
does not require source code or build artifacts; it operates directly on
software in binary form--even stripped executables and libraries. This feature
is key as rebuilding applications from source is a time-consuming and often
frustrating process. Diversification breaks the software monoculture and makes
attacks harder to execute as information needed for a successful attack will
have changed unpredictably. Diversification also forces attackers to customize
an attack for each target instead of attackers crafting an exploit that works
reliably on all similarly configured targets. Hardening directly targets key
attack classes. The combination of diversity and hardening provides
defense-in-depth, as well as a moving target defense, to secure the Nation's
cyber infrastructure.


http://arxiv.org/abs/2304.04343
Certifiable Black-Box Attacks with Randomized Adversarial Examples: Breaking Defenses with Provable Confidence. (99%)
Hanbin Hong; Xinyu Zhang; Binghui Wang; Zhongjie Ba; Yuan Hong
  Black-box adversarial attacks have demonstrated strong potential to
compromise machine learning models by iteratively querying the target model or
leveraging transferability from a local surrogate model. Recently, such attacks
can be effectively mitigated by state-of-the-art (SOTA) defenses, e.g.,
detection via the pattern of sequential queries, or injecting noise into the
model. To our best knowledge, we take the first step to study a new paradigm of
black-box attacks with provable guarantees -- certifiable black-box attacks
that can guarantee the attack success probability (ASP) of adversarial examples
before querying over the target model. This new black-box attack unveils
significant vulnerabilities of machine learning models, compared to traditional
empirical black-box attacks, e.g., breaking strong SOTA defenses with provable
confidence, constructing a space of (infinite) adversarial examples with high
ASP, and the ASP of the generated adversarial examples is theoretically
guaranteed without verification/queries over the target model. Specifically, we
establish a novel theoretical foundation for ensuring the ASP of the black-box
attack with randomized adversarial examples (AEs). Then, we propose several
novel techniques to craft the randomized AEs while reducing the perturbation
size for better imperceptibility. Finally, we have comprehensively evaluated
the certifiable black-box attacks on the CIFAR10/100, ImageNet, and LibriSpeech
datasets, while benchmarking with 16 SOTA black-box attacks, against various
SOTA defenses in the domains of computer vision and speech recognition. Both
theoretical and experimental results have validated the significance of the
proposed attack. The code and all the benchmarks are available at
\url{https://github.com/datasec-lab/CertifiedAttack}.


http://arxiv.org/abs/2304.04168
Adversarially Robust Neural Architecture Search for Graph Neural Networks. (80%)
Beini Xie; Heng Chang; Ziwei Zhang; Xin Wang; Daixin Wang; Zhiqiang Zhang; Rex Ying; Wenwu Zhu
  Graph Neural Networks (GNNs) obtain tremendous success in modeling relational
data. Still, they are prone to adversarial attacks, which are massive threats
to applying GNNs to risk-sensitive domains. Existing defensive methods neither
guarantee performance facing new data/tasks or adversarial attacks nor provide
insights to understand GNN robustness from an architectural perspective. Neural
Architecture Search (NAS) has the potential to solve this problem by automating
GNN architecture designs. Nevertheless, current graph NAS approaches lack
robust design and are vulnerable to adversarial attacks. To tackle these
challenges, we propose a novel Robust Neural Architecture search framework for
GNNs (G-RNA). Specifically, we design a robust search space for the
message-passing mechanism by adding graph structure mask operations into the
search space, which comprises various defensive operation candidates and allows
us to search for defensive GNNs. Furthermore, we define a robustness metric to
guide the search procedure, which helps to filter robust architectures. In this
way, G-RNA helps understand GNN robustness from an architectural perspective
and effectively searches for optimal adversarial robust GNNs. Extensive
experimental results on benchmark datasets show that G-RNA significantly
outperforms manually designed robust GNNs and vanilla graph NAS baselines by
12.1% to 23.4% under adversarial attacks.


http://arxiv.org/abs/2304.04228
Unsupervised Multi-Criteria Adversarial Detection in Deep Image Retrieval. (68%)
Yanru Xiao; Cong Wang; Xing Gao
  The vulnerability in the algorithm supply chain of deep learning has imposed
new challenges to image retrieval systems in the downstream. Among a variety of
techniques, deep hashing is gaining popularity. As it inherits the algorithmic
backend from deep learning, a handful of attacks are recently proposed to
disrupt normal image retrieval. Unfortunately, the defense strategies in
softmax classification are not readily available to be applied in the image
retrieval domain. In this paper, we propose an efficient and unsupervised
scheme to identify unique adversarial behaviors in the hamming space. In
particular, we design three criteria from the perspectives of hamming distance,
quantization loss and denoising to defend against both untargeted and targeted
attacks, which collectively limit the adversarial space. The extensive
experiments on four datasets demonstrate 2-23% improvements of detection rates
with minimum computational overhead for real-time image queries.


http://arxiv.org/abs/2304.03955
Robust Deep Learning Models Against Semantic-Preserving Adversarial Attack. (99%)
Dashan Gao; Yunce Zhao; Yinghua Yao; Zeqi Zhang; Bifei Mao; Xin Yao
  Deep learning models can be fooled by small $l_p$-norm adversarial
perturbations and natural perturbations in terms of attributes. Although the
robustness against each perturbation has been explored, it remains a challenge
to address the robustness against joint perturbations effectively. In this
paper, we study the robustness of deep learning models against joint
perturbations by proposing a novel attack mechanism named Semantic-Preserving
Adversarial (SPA) attack, which can then be used to enhance adversarial
training. Specifically, we introduce an attribute manipulator to generate
natural and human-comprehensible perturbations and a noise generator to
generate diverse adversarial noises. Based on such combined noises, we optimize
both the attribute value and the diversity variable to generate
jointly-perturbed samples. For robust training, we adversarially train the deep
learning model against the generated joint perturbations. Empirical results on
four benchmarks show that the SPA attack causes a larger performance decline
with small $l_{\infty}$ norm-ball constraints compared to existing approaches.
Furthermore, our SPA-enhanced training outperforms existing defense methods
against such joint perturbations.


http://arxiv.org/abs/2304.03973
RobCaps: Evaluating the Robustness of Capsule Networks against Affine Transformations and Adversarial Attacks. (98%)
Alberto Marchisio; Marco Antonio De; Alessio Colucci; Maurizio Martina; Muhammad Shafique
  Capsule Networks (CapsNets) are able to hierarchically preserve the pose
relationships between multiple objects for image classification tasks. Other
than achieving high accuracy, another relevant factor in deploying CapsNets in
safety-critical applications is the robustness against input transformations
and malicious adversarial attacks.
  In this paper, we systematically analyze and evaluate different factors
affecting the robustness of CapsNets, compared to traditional Convolutional
Neural Networks (CNNs). Towards a comprehensive comparison, we test two CapsNet
models and two CNN models on the MNIST, GTSRB, and CIFAR10 datasets, as well as
on the affine-transformed versions of such datasets. With a thorough analysis,
we show which properties of these architectures better contribute to increasing
the robustness and their limitations. Overall, CapsNets achieve better
robustness against adversarial examples and affine transformations, compared to
a traditional CNN with a similar number of parameters. Similar conclusions have
been derived for deeper versions of CapsNets and CNNs. Moreover, our results
unleash a key finding that the dynamic routing does not contribute much to
improving the CapsNets' robustness. Indeed, the main generalization
contribution is due to the hierarchical feature learning through capsules.


http://arxiv.org/abs/2304.04033
Exploring the Connection between Robust and Generative Models. (67%)
Senad Beadini; Iacopo Masi
  We offer a study that connects robust discriminative classifiers trained with
adversarial training (AT) with generative modeling in the form of Energy-based
Models (EBM). We do so by decomposing the loss of a discriminative classifier
and showing that the discriminative model is also aware of the input data
density. Though a common assumption is that adversarial points leave the
manifold of the input data, our study finds out that, surprisingly, untargeted
adversarial points in the input space are very likely under the generative
model hidden inside the discriminative classifier -- have low energy in the
EBM. We present two evidence: untargeted attacks are even more likely than the
natural data and their likelihood increases as the attack strength increases.
This allows us to easily detect them and craft a novel attack called
High-Energy PGD that fools the classifier yet has energy similar to the data
set. The code is available at github.com/senad96/Robust-Generative


http://arxiv.org/abs/2304.03968
Benchmarking the Robustness of Quantized Models. (47%)
Yisong Xiao; Tianyuan Zhang; Shunchang Liu; Haotong Qin
  Quantization has emerged as an essential technique for deploying deep neural
networks (DNNs) on devices with limited resources. However, quantized models
exhibit vulnerabilities when exposed to various noises in real-world
applications. Despite the importance of evaluating the impact of quantization
on robustness, existing research on this topic is limited and often disregards
established principles of robustness evaluation, resulting in incomplete and
inconclusive findings. To address this gap, we thoroughly evaluated the
robustness of quantized models against various noises (adversarial attacks,
natural corruptions, and systematic noises) on ImageNet. Extensive experiments
demonstrate that lower-bit quantization is more resilient to adversarial
attacks but is more susceptible to natural corruptions and systematic noises.
Notably, our investigation reveals that impulse noise (in natural corruptions)
and the nearest neighbor interpolation (in systematic noises) have the most
significant impact on quantized models. Our research contributes to advancing
the robust quantization of models and their deployment in real-world scenarios.


http://arxiv.org/abs/2304.04023
Attack-Augmentation Mixing-Contrastive Skeletal Representation Learning. (15%)
Binqian Xu; Xiangbo Shu; Jiachao Zhang; Rui Yan; Guo-Sen Xie
  Contrastive learning, relying on effective positive and negative sample
pairs, is beneficial to learn informative skeleton representations in
unsupervised skeleton-based action recognition. To achieve these positive and
negative pairs, existing weak/strong data augmentation methods have to randomly
change the appearance of skeletons for indirectly pursuing semantic
perturbations. However, such approaches have two limitations: i) solely
perturbing appearance cannot well capture the intrinsic semantic information of
skeletons, and ii) randomly perturbation may change the original
positive/negative pairs to soft positive/negative ones. To address the above
dilemma, we start the first attempt to explore an attack-based augmentation
scheme that additionally brings in direct semantic perturbation, for
constructing hard positive pairs and further assisting in constructing hard
negative pairs. In particular, we propose a novel Attack-Augmentation
Mixing-Contrastive skeletal representation learning (A$^2$MC) to contrast hard
positive features and hard negative features for learning more robust skeleton
representations. In A$^2$MC, Attack-Augmentation (Att-Aug) is designed to
collaboratively perform targeted and untargeted perturbations of skeletons via
attack and augmentation respectively, for generating high-quality hard positive
features. Meanwhile, Positive-Negative Mixer (PNM) is presented to mix hard
positive features and negative features for generating hard negative features,
which are adopted for updating the mixed memory banks. Extensive experiments on
three public datasets demonstrate that A$^2$MC is competitive with the
state-of-the-art methods. The code will be accessible on A$^2$MC
(https://github.com/1xbq1/A2MC).


http://arxiv.org/abs/2304.04077
Deep Prototypical-Parts Ease Morphological Kidney Stone Identification and are Competitively Robust to Photometric Perturbations. (4%)
Daniel Flores-Araiza; Francisco Lopez-Tiro; Jonathan El-Beze; Jacques Hubert; Miguel Gonzalez-Mendoza; Gilberto Ochoa-Ruiz; Christian Daul
  Identifying the type of kidney stones can allow urologists to determine their
cause of formation, improving the prescription of appropriate treatments to
diminish future relapses. Currently, the associated ex-vivo diagnosis (known as
Morpho-constitutional Analysis, MCA) is time-consuming, expensive and requires
a great deal of experience, as it requires a visual analysis component that is
highly operator dependant. Recently, machine learning methods have been
developed for in-vivo endoscopic stone recognition. Deep Learning (DL) based
methods outperform non-DL methods in terms of accuracy but lack explainability.
Despite this trade-off, when it comes to making high-stakes decisions, it's
important to prioritize understandable Computer-Aided Diagnosis (CADx) that
suggests a course of action based on reasonable evidence, rather than a model
prescribing a course of action. In this proposal, we learn Prototypical Parts
(PPs) per kidney stone subtype, which are used by the DL model to generate an
output classification. Using PPs in the classification task enables case-based
reasoning explanations for such output, thus making the model interpretable. In
addition, we modify global visual characteristics to describe their relevance
to the PPs and the sensitivity of our model's performance. With this, we
provide explanations with additional information at the sample, class and model
levels in contrast to previous works. Although our implementation's average
accuracy is lower than state-of-the-art (SOTA) non-interpretable DL models by
1.5 %, our models perform 2.8% better on perturbed images with a lower standard
deviation, without adversarial training. Thus, Learning PPs has the potential
to create more robust DL models.


http://arxiv.org/abs/2304.03977
EMP-SSL: Towards Self-Supervised Learning in One Training Epoch. (1%)
Shengbang Tong; Yubei Chen; Yi Ma; Yann Lecun
  Recently, self-supervised learning (SSL) has achieved tremendous success in
learning image representation. Despite the empirical success, most
self-supervised learning methods are rather "inefficient" learners, typically
taking hundreds of training epochs to fully converge. In this work, we show
that the key towards efficient self-supervised learning is to increase the
number of crops from each image instance. Leveraging one of the
state-of-the-art SSL method, we introduce a simplistic form of self-supervised
learning method called Extreme-Multi-Patch Self-Supervised-Learning (EMP-SSL)
that does not rely on many heuristic techniques for SSL such as weight sharing
between the branches, feature-wise normalization, output quantization, and stop
gradient, etc, and reduces the training epochs by two orders of magnitude. We
show that the proposed method is able to converge to 85.1% on CIFAR-10, 58.5%
on CIFAR-100, 38.1% on Tiny ImageNet and 58.5% on ImageNet-100 in just one
epoch. Furthermore, the proposed method achieves 91.5% on CIFAR-10, 70.1% on
CIFAR-100, 51.5% on Tiny ImageNet and 78.9% on ImageNet-100 with linear probing
in less than ten training epochs. In addition, we show that EMP-SSL shows
significantly better transferability to out-of-domain datasets compared to
baseline SSL methods. We will release the code in
https://github.com/tsb0601/EMP-SSL.


http://arxiv.org/abs/2304.03496
Architecture-Preserving Provable Repair of Deep Neural Networks. (1%)
Zhe Tao; Stephanie Nawas; Jacqueline Mitchell; Aditya V. Thakur
  Deep neural networks (DNNs) are becoming increasingly important components of
software, and are considered the state-of-the-art solution for a number of
problems, such as image recognition. However, DNNs are far from infallible, and
incorrect behavior of DNNs can have disastrous real-world consequences. This
paper addresses the problem of architecture-preserving V-polytope provable
repair of DNNs. A V-polytope defines a convex bounded polytope using its vertex
representation. V-polytope provable repair guarantees that the repaired DNN
satisfies the given specification on the infinite set of points in the given
V-polytope. An architecture-preserving repair only modifies the parameters of
the DNN, without modifying its architecture. The repair has the flexibility to
modify multiple layers of the DNN, and runs in polynomial time. It supports
DNNs with activation functions that have some linear pieces, as well as
fully-connected, convolutional, pooling and residual layers. To the best our
knowledge, this is the first provable repair approach that has all of these
features. We implement our approach in a tool called APRNN. Using MNIST,
ImageNet, and ACAS Xu DNNs, we show that it has better efficiency, scalability,
and generalization compared to PRDNN and REASSURE, prior provable repair
methods that are not architecture preserving.


http://arxiv.org/abs/2304.03870
ASPEST: Bridging the Gap Between Active Learning and Selective Prediction. (1%)
Jiefeng Chen; Jinsung Yoon; Sayna Ebrahimi; Sercan Arik; Somesh Jha; Tomas Pfister
  Selective prediction aims to learn a reliable model that abstains from making
predictions when the model uncertainty is high. These predictions can then be
deferred to a human expert for further evaluation. In many real-world
scenarios, the distribution of test data is different from the training data.
This results in more inaccurate predictions, necessitating increased human
labeling, which can be difficult and expensive. Active learning circumvents
this by only querying the most informative examples and, in several cases, has
been shown to lower the overall labeling effort. In this work, we bridge
selective prediction and active learning, proposing a new learning paradigm
called active selective prediction which learns to query more informative
samples from the shifted target domain while increasing accuracy and coverage.
For this new problem, we propose a simple but effective solution, ASPEST, that
utilizes ensembles of model snapshots with self-training with their aggregated
outputs as pseudo labels. Extensive experiments on numerous image, text and
structured datasets, particularly those suffer from domain shifts, demonstrate
that our proposed method can significantly outperform prior work on selective
prediction and active learning (e.g. on the MNIST$\to$SVHN benchmark with the
labeling budget of $100$, ASPEST improves the AUC metric from $79.36\%$ to
$88.84\%$) and achieves more optimal utilization of humans in the loop.


http://arxiv.org/abs/2304.03054
Manipulating Federated Recommender Systems: Poisoning with Synthetic Users and Its Countermeasures. (45%)
Wei Yuan; Quoc Viet Hung Nguyen; Tieke He; Liang Chen; Hongzhi Yin
  Federated Recommender Systems (FedRecs) are considered privacy-preserving
techniques to collaboratively learn a recommendation model without sharing user
data. Since all participants can directly influence the systems by uploading
gradients, FedRecs are vulnerable to poisoning attacks of malicious clients.
However, most existing poisoning attacks on FedRecs are either based on some
prior knowledge or with less effectiveness. To reveal the real vulnerability of
FedRecs, in this paper, we present a new poisoning attack method to manipulate
target items' ranks and exposure rates effectively in the top-$K$
recommendation without relying on any prior knowledge. Specifically, our attack
manipulates target items' exposure rate by a group of synthetic malicious users
who upload poisoned gradients considering target items' alternative products.
We conduct extensive experiments with two widely used FedRecs (Fed-NCF and
Fed-LightGCN) on two real-world recommendation datasets. The experimental
results show that our attack can significantly improve the exposure rate of
unpopular target items with extremely fewer malicious users and fewer global
epochs than state-of-the-art attacks. In addition to disclosing the security
hole, we design a novel countermeasure for poisoning attacks on FedRecs.
Specifically, we propose a hierarchical gradient clipping with sparsified
updating to defend against existing poisoning attacks. The empirical results
demonstrate that the proposed defending mechanism improves the robustness of
FedRecs.


http://arxiv.org/abs/2304.02932
Quantifying and Defending against Privacy Threats on Federated Knowledge Graph Embedding. (45%)
Yuke Hu; Wei Liang; Ruofan Wu; Kai Xiao; Weiqiang Wang; Xiaochen Li; Jinfei Liu; Zhan Qin
  Knowledge Graph Embedding (KGE) is a fundamental technique that extracts
expressive representation from knowledge graph (KG) to facilitate diverse
downstream tasks. The emerging federated KGE (FKGE) collaboratively trains from
distributed KGs held among clients while avoiding exchanging clients' sensitive
raw KGs, which can still suffer from privacy threats as evidenced in other
federated model trainings (e.g., neural networks). However, quantifying and
defending against such privacy threats remain unexplored for FKGE which
possesses unique properties not shared by previously studied models. In this
paper, we conduct the first holistic study of the privacy threat on FKGE from
both attack and defense perspectives. For the attack, we quantify the privacy
threat by proposing three new inference attacks, which reveal substantial
privacy risk by successfully inferring the existence of the KG triple from
victim clients. For the defense, we propose DP-Flames, a novel differentially
private FKGE with private selection, which offers a better privacy-utility
tradeoff by exploiting the entity-binding sparse gradient property of FKGE and
comes with a tight privacy accountant by incorporating the state-of-the-art
private selection technique. We further propose an adaptive privacy budget
allocation policy to dynamically adjust defense magnitude across the training
procedure. Comprehensive evaluations demonstrate that the proposed defense can
successfully mitigate the privacy threat by effectively reducing the success
rate of inference attacks from $83.1\%$ to $59.4\%$ on average with only a
modest utility decrease.


http://arxiv.org/abs/2304.03147
Improving Visual Question Answering Models through Robustness Analysis and In-Context Learning with a Chain of Basic Questions. (10%)
Jia-Hong Huang; Modar Alfadly; Bernard Ghanem; Marcel Worring
  Deep neural networks have been critical in the task of Visual Question
Answering (VQA), with research traditionally focused on improving model
accuracy. Recently, however, there has been a trend towards evaluating the
robustness of these models against adversarial attacks. This involves assessing
the accuracy of VQA models under increasing levels of noise in the input, which
can target either the image or the proposed query question, dubbed the main
question. However, there is currently a lack of proper analysis of this aspect
of VQA. This work proposes a new method that utilizes semantically related
questions, referred to as basic questions, acting as noise to evaluate the
robustness of VQA models. It is hypothesized that as the similarity of a basic
question to the main question decreases, the level of noise increases. To
generate a reasonable noise level for a given main question, a pool of basic
questions is ranked based on their similarity to the main question, and this
ranking problem is cast as a LASSO optimization problem. Additionally, this
work proposes a novel robustness measure, R_score, and two basic question
datasets to standardize the analysis of VQA model robustness. The experimental
results demonstrate that the proposed evaluation method effectively analyzes
the robustness of VQA models. Moreover, the experiments show that in-context
learning with a chain of basic questions can enhance model accuracy.


http://arxiv.org/abs/2304.03388
EZClone: Improving DNN Model Extraction Attack via Shape Distillation from GPU Execution Profiles. (4%)
Jonah O'Brien Weiss; Tiago Alves; Sandip Kundu
  Deep Neural Networks (DNNs) have become ubiquitous due to their performance
on prediction and classification problems. However, they face a variety of
threats as their usage spreads. Model extraction attacks, which steal DNNs,
endanger intellectual property, data privacy, and security. Previous research
has shown that system-level side-channels can be used to leak the architecture
of a victim DNN, exacerbating these risks. We propose two DNN architecture
extraction techniques catering to various threat models. The first technique
uses a malicious, dynamically linked version of PyTorch to expose a victim DNN
architecture through the PyTorch profiler. The second, called EZClone, exploits
aggregate (rather than time-series) GPU profiles as a side-channel to predict
DNN architecture, employing a simple approach and assuming little adversary
capability as compared to previous work. We investigate the effectiveness of
EZClone when minimizing the complexity of the attack, when applied to pruned
models, and when applied across GPUs. We find that EZClone correctly predicts
DNN architectures for the entire set of PyTorch vision architectures with 100%
accuracy. No other work has shown this degree of architecture prediction
accuracy with the same adversarial constraints or using aggregate side-channel
information. Prior work has shown that, once a DNN has been successfully
cloned, further attacks such as model evasion or model inversion can be
accelerated significantly.


http://arxiv.org/abs/2304.03145
Evaluating the Robustness of Machine Reading Comprehension Models to Low Resource Entity Renaming. (2%)
Clemencia Siro; Tunde Oluwaseyi Ajayi
  Question answering (QA) models have shown compelling results in the task of
Machine Reading Comprehension (MRC). Recently these systems have proved to
perform better than humans on held-out test sets of datasets e.g. SQuAD, but
their robustness is not guaranteed. The QA model's brittleness is exposed when
evaluated on adversarial generated examples by a performance drop. In this
study, we explore the robustness of MRC models to entity renaming, with
entities from low-resource regions such as Africa. We propose EntSwap, a method
for test-time perturbations, to create a test set whose entities have been
renamed. In particular, we rename entities of type: country, person,
nationality, location, organization, and city, to create AfriSQuAD2. Using the
perturbed test set, we evaluate the robustness of three popular MRC models. We
find that compared to base models, large models perform well comparatively on
novel entities. Furthermore, our analysis indicates that entity type person
highly challenges the MRC models' performance.


http://arxiv.org/abs/2304.03456
Rethinking Evaluation Protocols of Visual Representations Learned via Self-supervised Learning. (1%)
Jae-Hun Lee; Doyoung Yoon; ByeongMoon Ji; Kyungyul Kim; Sangheum Hwang
  Linear probing (LP) (and $k$-NN) on the upstream dataset with labels (e.g.,
ImageNet) and transfer learning (TL) to various downstream datasets are
commonly employed to evaluate the quality of visual representations learned via
self-supervised learning (SSL). Although existing SSL methods have shown good
performances under those evaluation protocols, we observe that the performances
are very sensitive to the hyperparameters involved in LP and TL. We argue that
this is an undesirable behavior since truly generic representations should be
easily adapted to any other visual recognition task, i.e., the learned
representations should be robust to the settings of LP and TL hyperparameters.
In this work, we try to figure out the cause of performance sensitivity by
conducting extensive experiments with state-of-the-art SSL methods. First, we
find that input normalization for LP is crucial to eliminate performance
variations according to the hyperparameters. Specifically, batch normalization
before feeding inputs to a linear classifier considerably improves the
stability of evaluation, and also resolves inconsistency of $k$-NN and LP
metrics. Second, for TL, we demonstrate that a weight decay parameter in SSL
significantly affects the transferability of learned representations, which
cannot be identified by LP or $k$-NN evaluations on the upstream dataset. We
believe that the findings of this study will be beneficial for the community by
drawing attention to the shortcomings in the current SSL evaluation schemes and
underscoring the need to reconsider them.


http://arxiv.org/abs/2304.03370
Reliable Learning for Test-time Attacks and Distribution Shift. (1%)
Maria-Florina Balcan; Steve Hanneke; Rattana Pukdee; Dravyansh Sharma
  Machine learning algorithms are often used in environments which are not
captured accurately even by the most carefully obtained training data, either
due to the possibility of `adversarial' test-time attacks, or on account of
`natural' distribution shift. For test-time attacks, we introduce and analyze a
novel robust reliability guarantee, which requires a learner to output
predictions along with a reliability radius $\eta$, with the meaning that its
prediction is guaranteed to be correct as long as the adversary has not
perturbed the test point farther than a distance $\eta$. We provide learners
that are optimal in the sense that they always output the best possible
reliability radius on any test point, and we characterize the reliable region,
i.e. the set of points where a given reliability radius is attainable. We
additionally analyze reliable learners under distribution shift, where the test
points may come from an arbitrary distribution Q different from the training
distribution P. For both cases, we bound the probability mass of the reliable
region for several interesting examples, for linear separators under nearly
log-concave and s-concave distributions, as well as for smooth boundary
classifiers under smooth probability distributions.


http://arxiv.org/abs/2304.02963
Benchmarking Robustness to Text-Guided Corruptions. (1%)
Mohammadreza Mofayezi; Yasamin Medghalchi
  This study investigates the robustness of image classifiers to text-guided
corruptions. We utilize diffusion models to edit images to different domains.
Unlike other works that use synthetic or hand-picked data for benchmarking, we
use diffusion models as they are generative models capable of learning to edit
images while preserving their semantic content. Thus, the corruptions will be
more realistic and the comparison will be more informative. Also, there is no
need for manual labeling and we can create large-scale benchmarks with less
effort. We define a prompt hierarchy based on the original ImageNet hierarchy
to apply edits in different domains. As well as introducing a new benchmark we
try to investigate the robustness of different vision models. The results of
this study demonstrate that the performance of image classifiers decreases
significantly in different language-based corruptions and edit domains. We also
observe that convolutional models are more robust than transformer
architectures. Additionally, we see that common data augmentation techniques
can improve the performance on both the original data and the edited images.
The findings of this research can help improve the design of image classifiers
and contribute to the development of more robust machine learning systems. The
code for generating the benchmark is available at
https://github.com/ckoorosh/RobuText.


http://arxiv.org/abs/2304.02693
A Certified Radius-Guided Attack Framework to Image Segmentation Models. (99%)
Wenjie Qu; Youqi Li; Binghui Wang
  Image segmentation is an important problem in many safety-critical
applications. Recent studies show that modern image segmentation models are
vulnerable to adversarial perturbations, while existing attack methods mainly
follow the idea of attacking image classification models. We argue that image
segmentation and classification have inherent differences, and design an attack
framework specially for image segmentation models. Our attack framework is
inspired by certified radius, which was originally used by defenders to defend
against adversarial perturbations to classification models. We are the first,
from the attacker perspective, to leverage the properties of certified radius
and propose a certified radius guided attack framework against image
segmentation models. Specifically, we first adapt randomized smoothing, the
state-of-the-art certification method for classification models, to derive the
pixel's certified radius. We then focus more on disrupting pixels with
relatively smaller certified radii and design a pixel-wise certified radius
guided loss, when plugged into any existing white-box attack, yields our
certified radius-guided white-box attack. Next, we propose the first black-box
attack to image segmentation models via bandit. We design a novel gradient
estimator, based on bandit feedback, which is query-efficient and provably
unbiased and stable. We use this gradient estimator to design a projected
bandit gradient descent (PBGD) attack, as well as a certified radius-guided
PBGD (CR-PBGD) attack. We prove our PBGD and CR-PBGD attacks can achieve
asymptotically optimal attack performance with an optimal rate. We evaluate our
certified-radius guided white-box and black-box attacks on multiple modern
image segmentation models and datasets. Our results validate the effectiveness
of our certified radius-guided attack framework.


http://arxiv.org/abs/2304.02312
How to choose your best allies for a transferable attack? (99%)
Thibault Maho; Seyed-Mohsen Moosavi-Dezfooli; Teddy Furon
  The transferability of adversarial examples is a key issue in the security of
deep neural networks. The possibility of an adversarial example crafted for a
source model fooling another targeted model makes the threat of adversarial
attacks more realistic. Measuring transferability is a crucial problem, but the
Attack Success Rate alone does not provide a sound evaluation. This paper
proposes a new methodology for evaluating transferability by putting distortion
in a central position. This new tool shows that transferable attacks may
perform far worse than a black box attack if the attacker randomly picks the
source model. To address this issue, we propose a new selection mechanism,
called FiT, which aims at choosing the best source model with only a few
preliminary queries to the target. Our experimental results show that FiT is
highly effective at selecting the best source model for multiple scenarios such
as single-model attacks, ensemble-model attacks and multiple attacks (Code
available at: https://github.com/t-maho/transferability_measure_fit).


http://arxiv.org/abs/2304.02688
Going Further: Flatness at the Rescue of Early Stopping for Adversarial Example Transferability. (99%)
Martin Gubri; Maxime Cordy; Yves Le Traon
  Transferability is the property of adversarial examples to be misclassified
by other models than the surrogate model for which they were crafted. Previous
research has shown that early stopping the training of the surrogate model
substantially increases transferability. A common hypothesis to explain this is
that deep neural networks (DNNs) first learn robust features, which are more
generic, thus a better surrogate. Then, at later epochs, DNNs learn non-robust
features, which are more brittle, hence worst surrogate. First, we tend to
refute this hypothesis, using transferability as a proxy for representation
similarity. We then establish links between transferability and the exploration
of the loss landscape in parameter space, focusing on sharpness, which is
affected by early stopping. This leads us to evaluate surrogate models trained
with seven minimizers that minimize both loss value and loss sharpness. Among
them, SAM consistently outperforms early stopping by up to 28.8 percentage
points. We discover that the strong SAM regularization from large flat
neighborhoods tightly links to transferability. Finally, the best
sharpness-aware minimizers prove competitive with other training methods and
complement existing transferability techniques.


http://arxiv.org/abs/2304.02845
Robust Neural Architecture Search. (92%)
Xunyu Zhu; Jian Li; Yong Liu; Weiping Wang
  Neural Architectures Search (NAS) becomes more and more popular over these
years. However, NAS-generated models tends to suffer greater vulnerability to
various malicious attacks. Lots of robust NAS methods leverage adversarial
training to enhance the robustness of NAS-generated models, however, they
neglected the nature accuracy of NAS-generated models. In our paper, we propose
a novel NAS method, Robust Neural Architecture Search (RNAS). To design a
regularization term to balance accuracy and robustness, RNAS generates
architectures with both high accuracy and good robustness. To reduce search
cost, we further propose to use noise examples instead adversarial examples as
input to search architectures. Extensive experiments show that RNAS achieves
state-of-the-art (SOTA) performance on both image classification and
adversarial attacks, which illustrates the proposed RNAS achieves a good
tradeoff between robustness and accuracy.


http://arxiv.org/abs/2304.02497
Hyper-parameter Tuning for Adversarially Robust Models. (62%)
Pedro Mendes; Paolo Romano; David Garlan
  This work focuses on the problem of hyper-parameter tuning (HPT) for robust
(i.e., adversarially trained) models, shedding light on the new challenges and
opportunities arising during the HPT process for robust models. To this end, we
conduct an extensive experimental study based on 3 popular deep models, in
which we explore exhaustively 9 (discretized) HPs, 2 fidelity dimensions, and 2
attack bounds, for a total of 19208 configurations (corresponding to 50
thousand GPU hours). Through this study, we show that the complexity of the HPT
problem is further exacerbated in adversarial settings due to the need to
independently tune the HPs used during standard and adversarial training:
succeeding in doing so (i.e., adopting different HP settings in both phases)
can lead to a reduction of up to 80% and 43% of the error for clean and
adversarial inputs, respectively. On the other hand, we also identify new
opportunities to reduce the cost of HPT for robust models. Specifically, we
propose to leverage cheap adversarial training methods to obtain inexpensive,
yet highly correlated, estimations of the quality achievable using
state-of-the-art methods. We show that, by exploiting this novel idea in
conjunction with a recent multi-fidelity optimizer (taKG), the efficiency of
the HPT process can be enhanced by up to 2.1x.


http://arxiv.org/abs/2304.02234
JPEG Compressed Images Can Bypass Protections Against AI Editing. (15%)
Pedro Sandoval-Segura; Jonas Geiping; Tom Goldstein
  Recently developed text-to-image diffusion models make it easy to edit or
create high-quality images. Their ease of use has raised concerns about the
potential for malicious editing or deepfake creation. Imperceptible
perturbations have been proposed as a means of protecting images from malicious
editing by preventing diffusion models from generating realistic images.
However, we find that the aforementioned perturbations are not robust to JPEG
compression, which poses a major weakness because of the common usage and
availability of JPEG. We discuss the importance of robustness for additive
imperceptible perturbations and encourage alternative approaches to protect
images against editing.


http://arxiv.org/abs/2304.02782
FACE-AUDITOR: Data Auditing in Facial Recognition Systems. (1%)
Min Chen; Zhikun Zhang; Tianhao Wang; Michael Backes; Yang Zhang
  Few-shot-based facial recognition systems have gained increasing attention
due to their scalability and ability to work with a few face images during the
model deployment phase. However, the power of facial recognition systems
enables entities with moderate resources to canvas the Internet and build
well-performed facial recognition models without people's awareness and
consent. To prevent the face images from being misused, one straightforward
approach is to modify the raw face images before sharing them, which inevitably
destroys the semantic information, increases the difficulty of retroactivity,
and is still prone to adaptive attacks. Therefore, an auditing method that does
not interfere with the facial recognition model's utility and cannot be quickly
bypassed is urgently needed.
  In this paper, we formulate the auditing process as a user-level membership
inference problem and propose a complete toolkit FACE-AUDITOR that can
carefully choose the probing set to query the few-shot-based facial recognition
model and determine whether any of a user's face images is used in training the
model. We further propose to use the similarity scores between the original
face images as reference information to improve the auditing performance.
Extensive experiments on multiple real-world face image datasets show that
FACE-AUDITOR can achieve auditing accuracy of up to $99\%$. Finally, we show
that FACE-AUDITOR is robust in the presence of several perturbation mechanisms
to the training images or the target models. The source code of our experiments
can be found at \url{https://github.com/MinChen00/Face-Auditor}.


http://arxiv.org/abs/2304.01826
CGDTest: A Constrained Gradient Descent Algorithm for Testing Neural Networks. (31%)
Vineel Nagisetty; Laura Graves; Guanting Pan; Piyush Jha; Vijay Ganesh
  In this paper, we propose a new Deep Neural Network (DNN) testing algorithm
called the Constrained Gradient Descent (CGD) method, and an implementation we
call CGDTest aimed at exposing security and robustness issues such as
adversarial robustness and bias in DNNs. Our CGD algorithm is a
gradient-descent (GD) method, with the twist that the user can also specify
logical properties that characterize the kinds of inputs that the user may
want. This functionality sets CGDTest apart from other similar DNN testing
tools since it allows users to specify logical constraints to test DNNs not
only for $\ell_p$ ball-based adversarial robustness but, more importantly,
includes richer properties such as disguised and flow adversarial constraints,
as well as adversarial robustness in the NLP domain. We showcase the utility
and power of CGDTest via extensive experimentation in the context of vision and
NLP domains, comparing against 32 state-of-the-art methods over these diverse
domains. Our results indicate that CGDTest outperforms state-of-the-art testing
tools for $\ell_p$ ball-based adversarial robustness, and is significantly
superior in testing for other adversarial robustness, with improvements in PAR2
scores of over 1500% in some cases over the next best tool. Our evaluation
shows that our CGD method outperforms competing methods we compared against in
terms of expressibility (i.e., a rich constraint language and concomitant tool
support to express a wide variety of properties), scalability (i.e., can be
applied to very large real-world models with up to 138 million parameters), and
generality (i.e., can be used to test a plethora of model architectures).


http://arxiv.org/abs/2304.01731
Selective Knowledge Sharing for Privacy-Preserving Federated Distillation without A Good Teacher. (1%)
Jiawei Shao; Fangzhao Wu; Jun Zhang
  While federated learning is promising for privacy-preserving collaborative
learning without revealing local data, it remains vulnerable to white-box
attacks and struggles to adapt to heterogeneous clients. Federated distillation
(FD), built upon knowledge distillation--an effective technique for
transferring knowledge from a teacher model to student models--emerges as an
alternative paradigm, which provides enhanced privacy guarantees and addresses
model heterogeneity. Nevertheless, challenges arise due to variations in local
data distributions and the absence of a well-trained teacher model, which leads
to misleading and ambiguous knowledge sharing that significantly degrades model
performance. To address these issues, this paper proposes a selective knowledge
sharing mechanism for FD, termed Selective-FD. It includes client-side
selectors and a server-side selector to accurately and precisely identify
knowledge from local and ensemble predictions, respectively. Empirical studies,
backed by theoretical insights, demonstrate that our approach enhances the
generalization capabilities of the FD framework and consistently outperforms
baseline methods.


http://arxiv.org/abs/2304.02012
EGC: Image Generation and Classification via a Single Energy-Based Model. (1%)
Qiushan Guo; Chuofan Ma; Yi Jiang; Zehuan Yuan; Yizhou Yu; Ping Luo
  Learning image classification and image generation using the same set of
network parameters is a challenging problem. Recent advanced approaches perform
well in one task often exhibit poor performance in the other. This work
introduces an energy-based classifier and generator, namely EGC, which can
achieve superior performance in both tasks using a single neural network.
Unlike a conventional classifier that outputs a label given an image (i.e., a
conditional distribution $p(y|\mathbf{x})$), the forward pass in EGC is a
classifier that outputs a joint distribution $p(\mathbf{x},y)$, enabling an
image generator in its backward pass by marginalizing out the label $y$. This
is done by estimating the energy and classification probability given a noisy
image in the forward pass, while denoising it using the score function
estimated in the backward pass. EGC achieves competitive generation results
compared with state-of-the-art approaches on ImageNet-1k, CelebA-HQ and LSUN
Church, while achieving superior classification accuracy and robustness against
adversarial attacks on CIFAR-10. This work represents the first successful
attempt to simultaneously excel in both tasks using a single set of network
parameters. We believe that EGC bridges the gap between discriminative and
generative learning.


http://arxiv.org/abs/2304.01482
Defending Against Patch-based Backdoor Attacks on Self-Supervised Learning. (76%)
Ajinkya Tejankar; Maziar Sanjabi; Qifan Wang; Sinong Wang; Hamed Firooz; Hamed Pirsiavash; Liang Tan
  Recently, self-supervised learning (SSL) was shown to be vulnerable to
patch-based data poisoning backdoor attacks. It was shown that an adversary can
poison a small part of the unlabeled data so that when a victim trains an SSL
model on it, the final model will have a backdoor that the adversary can
exploit. This work aims to defend self-supervised learning against such
attacks. We use a three-step defense pipeline, where we first train a model on
the poisoned data. In the second step, our proposed defense algorithm
(PatchSearch) uses the trained model to search the training data for poisoned
samples and removes them from the training set. In the third step, a final
model is trained on the cleaned-up training set. Our results show that
PatchSearch is an effective defense. As an example, it improves a model's
accuracy on images containing the trigger from 38.2% to 63.7% which is very
close to the clean model's accuracy, 64.6%. Moreover, we show that PatchSearch
outperforms baselines and state-of-the-art defense approaches including those
using additional clean, trusted data. Our code is available at
https://github.com/UCDvision/PatchSearch


http://arxiv.org/abs/2304.00813
Model-Agnostic Reachability Analysis on Deep Neural Networks. (75%)
Chi Zhang; Wenjie Ruan; Fu Wang; Peipei Xu; Geyong Min; Xiaowei Huang
  Verification plays an essential role in the formal analysis of
safety-critical systems. Most current verification methods have specific
requirements when working on Deep Neural Networks (DNNs). They either target
one particular network category, e.g., Feedforward Neural Networks (FNNs), or
networks with specific activation functions, e.g., RdLU. In this paper, we
develop a model-agnostic verification framework, called DeepAgn, and show that
it can be applied to FNNs, Recurrent Neural Networks (RNNs), or a mixture of
both. Under the assumption of Lipschitz continuity, DeepAgn analyses the
reachability of DNNs based on a novel optimisation scheme with a global
convergence guarantee. It does not require access to the network's internal
structures, such as layers and parameters. Through reachability analysis,
DeepAgn can tackle several well-known robustness problems, including computing
the maximum safe radius for a given input, and generating the ground-truth
adversarial examples. We also empirically demonstrate DeepAgn's superior
capability and efficiency in handling a broader class of deep neural networks,
including both FNNs, and RNNs with very deep layers and millions of neurons,
than other state-of-the-art verification approaches.


http://arxiv.org/abs/2304.01441
NetFlick: Adversarial Flickering Attacks on Deep Learning Based Video Compression. (69%)
Jung-Woo Chang; Nojan Sheybani; Shehzeen Samarah Hussain; Mojan Javaheripi; Seira Hidano; Farinaz Koushanfar
  Video compression plays a significant role in IoT devices for the efficient
transport of visual data while satisfying all underlying bandwidth constraints.
Deep learning-based video compression methods are rapidly replacing traditional
algorithms and providing state-of-the-art results on edge devices. However,
recently developed adversarial attacks demonstrate that digitally crafted
perturbations can break the Rate-Distortion relationship of video compression.
In this work, we present a real-world LED attack to target video compression
frameworks. Our physically realizable attack, dubbed NetFlick, can degrade the
spatio-temporal correlation between successive frames by injecting flickering
temporal perturbations. In addition, we propose universal perturbations that
can downgrade performance of incoming video without prior knowledge of the
contents. Experimental results demonstrate that NetFlick can successfully
deteriorate the performance of video compression frameworks in both digital-
and physical-settings and can be further extended to attack downstream video
classification networks.


http://arxiv.org/abs/2304.01142
Learning About Simulated Adversaries from Human Defenders using Interactive Cyber-Defense Games. (1%)
Baptiste Prebot; Yinuo Du; Cleotilde Gonzalez
  Given the increase in cybercrime, cybersecurity analysts (i.e. Defenders) are
in high demand. Defenders must monitor an organization's network to evaluate
threats and potential breaches into the network. Adversary simulation is
commonly used to test defenders' performance against known threats to
organizations. However, it is unclear how effective this training process is in
preparing defenders for this highly demanding job. In this paper, we
demonstrate how to use adversarial algorithms to investigate defenders'
learning of defense strategies, using interactive cyber defense games. Our
Interactive Defense Game (IDG) represents a cyber defense scenario that
requires constant monitoring of incoming network alerts and allows a defender
to analyze, remove, and restore services based on the events observed in a
network. The participants in our study faced one of two types of simulated
adversaries. A Beeline adversary is a fast, targeted, and informed attacker;
and a Meander adversary is a slow attacker that wanders the network until it
finds the right target to exploit. Our results suggest that although human
defenders have more difficulty to stop the Beeline adversary initially, they
were able to learn to stop this adversary by taking advantage of their attack
strategy. Participants who played against the Beeline adversary learned to
anticipate the adversary and take more proactive actions, while decreasing
their reactive actions. These findings have implications for understanding how
to help cybersecurity analysts speed up their training.


http://arxiv.org/abs/2304.06724
GradMDM: Adversarial Attack on Dynamic Networks. (84%)
Jianhong Pan; Lin Geng Foo; Qichen Zheng; Zhipeng Fan; Hossein Rahmani; Qiuhong Ke; Jun Liu
  Dynamic neural networks can greatly reduce computation redundancy without
compromising accuracy by adapting their structures based on the input. In this
paper, we explore the robustness of dynamic neural networks against
energy-oriented attacks targeted at reducing their efficiency. Specifically, we
attack dynamic models with our novel algorithm GradMDM. GradMDM is a technique
that adjusts the direction and the magnitude of the gradients to effectively
find a small perturbation for each input, that will activate more computational
units of dynamic models during inference. We evaluate GradMDM on multiple
datasets and dynamic models, where it outperforms previous energy-oriented
attack techniques, significantly increasing computation complexity while
reducing the perceptibility of the perturbations.


http://arxiv.org/abs/2304.00436
Instance-level Trojan Attacks on Visual Question Answering via Adversarial Learning in Neuron Activation Space. (67%)
Yuwei Sun; Hideya Ochiai; Jun Sakuma
  Malicious perturbations embedded in input data, known as Trojan attacks, can
cause neural networks to misbehave. However, the impact of a Trojan attack is
reduced during fine-tuning of the model, which involves transferring knowledge
from a pretrained large-scale model like visual question answering (VQA) to the
target model. To mitigate the effects of a Trojan attack, replacing and
fine-tuning multiple layers of the pretrained model is possible. This research
focuses on sample efficiency, stealthiness and variation, and robustness to
model fine-tuning. To address these challenges, we propose an instance-level
Trojan attack that generates diverse Trojans across input samples and
modalities. Adversarial learning establishes a correlation between a specified
perturbation layer and the misbehavior of the fine-tuned model. We conducted
extensive experiments on the VQA-v2 dataset using a range of metrics. The
results show that our proposed method can effectively adapt to a fine-tuned
model with minimal samples. Specifically, we found that a model with a single
fine-tuning layer can be compromised using a single shot of adversarial
samples, while a model with more fine-tuning layers can be compromised using
only a few shots.


http://arxiv.org/abs/2304.00202
Improving Fast Adversarial Training with Prior-Guided Knowledge. (99%)
Xiaojun Jia; Yong Zhang; Xingxing Wei; Baoyuan Wu; Ke Ma; Jue Wang; Xiaochun Sr Cao
  Fast adversarial training (FAT) is an efficient method to improve robustness.
However, the original FAT suffers from catastrophic overfitting, which
dramatically and suddenly reduces robustness after a few training epochs.
Although various FAT variants have been proposed to prevent overfitting, they
require high training costs. In this paper, we investigate the relationship
between adversarial example quality and catastrophic overfitting by comparing
the training processes of standard adversarial training and FAT. We find that
catastrophic overfitting occurs when the attack success rate of adversarial
examples becomes worse. Based on this observation, we propose a positive
prior-guided adversarial initialization to prevent overfitting by improving
adversarial example quality without extra training costs. This initialization
is generated by using high-quality adversarial perturbations from the
historical training process. We provide theoretical analysis for the proposed
initialization and propose a prior-guided regularization method that boosts the
smoothness of the loss function. Additionally, we design a prior-guided
ensemble FAT method that averages the different model weights of historical
models using different decay rates. Our proposed method, called FGSM-PGK,
assembles the prior-guided knowledge, i.e., the prior-guided initialization and
model weights, acquired during the historical training process. Evaluations of
four datasets demonstrate the superiority of the proposed method.


http://arxiv.org/abs/2304.00061
To be Robust and to be Fair: Aligning Fairness with Robustness. (93%)
Junyi Chai; Xiaoqian Wang
  Adversarial training has been shown to be reliable in improving robustness
against adversarial samples. However, the problem of adversarial training in
terms of fairness has not yet been properly studied, and the relationship
between fairness and accuracy attack still remains unclear. Can we
simultaneously improve robustness w.r.t. both fairness and accuracy? To tackle
this topic, in this paper, we study the problem of adversarial training and
adversarial attack w.r.t. both metrics. We propose a unified structure for
fairness attack which brings together common notions in group fairness, and we
theoretically prove the equivalence of fairness attack against different
notions. Moreover, we show the alignment of fairness and accuracy attack, and
theoretically demonstrate that robustness w.r.t. one metric benefits from
robustness w.r.t. the other metric. Our study suggests a novel way to unify
adversarial training and attack w.r.t. fairness and accuracy, and experimental
results show that our proposed method achieves better performance in terms of
robustness w.r.t. both metrics.


http://arxiv.org/abs/2303.17890
Fooling Polarization-based Vision using Locally Controllable Polarizing Projection. (91%)
Zhuoxiao Li; Zhihang Zhong; Shohei Nobuhara; Ko Nishino; Yinqiang Zheng
  Polarization is a fundamental property of light that encodes abundant
information regarding surface shape, material, illumination and viewing
geometry. The computer vision community has witnessed a blossom of
polarization-based vision applications, such as reflection removal,
shape-from-polarization, transparent object segmentation and color constancy,
partially due to the emergence of single-chip mono/color polarization sensors
that make polarization data acquisition easier than ever. However, is
polarization-based vision vulnerable to adversarial attacks? If so, is that
possible to realize these adversarial attacks in the physical world, without
being perceived by human eyes? In this paper, we warn the community of the
vulnerability of polarization-based vision, which can be more serious than
RGB-based vision. By adapting a commercial LCD projector, we achieve locally
controllable polarizing projection, which is successfully utilized to fool
state-of-the-art polarization-based vision algorithms for glass segmentation
and color constancy. Compared with existing physical attacks on RGB-based
vision, which always suffer from the trade-off between attack efficacy and eye
conceivability, the adversarial attackers based on polarizing projection are
contact-free and visually imperceptible, since naked human eyes can rarely
perceive the difference of viciously manipulated polarizing light and ordinary
illumination. This poses unprecedented risks on polarization-based vision, both
in the monochromatic and trichromatic domain, for which due attentions should
be paid and counter measures be considered.


http://arxiv.org/abs/2303.17940
Per-Example Gradient Regularization Improves Learning Signals from Noisy Data. (3%)
Xuran Meng; Yuan Cao; Difan Zou
  Gradient regularization, as described in \citet{barrett2021implicit}, is a
highly effective technique for promoting flat minima during gradient descent.
Empirical evidence suggests that this regularization technique can
significantly enhance the robustness of deep learning models against noisy
perturbations, while also reducing test error. In this paper, we explore the
per-example gradient regularization (PEGR) and present a theoretical analysis
that demonstrates its effectiveness in improving both test error and robustness
against noise perturbations. Specifically, we adopt a signal-noise data model
from \citet{cao2022benign} and show that PEGR can learn signals effectively
while suppressing noise. In contrast, standard gradient descent struggles to
distinguish the signal from the noise, leading to suboptimal generalization
performance. Our analysis reveals that PEGR penalizes the variance of pattern
learning, thus effectively suppressing the memorization of noises from the
training data. These findings underscore the importance of variance control in
deep learning training and offer useful insights for developing more effective
training approaches.


http://arxiv.org/abs/2304.00160
Secure Federated Learning against Model Poisoning Attacks via Client Filtering. (2%)
Duygu Nur Yaldiz; Tuo Zhang; Salman Avestimehr
  Given the distributed nature, detecting and defending against the backdoor
attack under federated learning (FL) systems is challenging. In this paper, we
observe that the cosine similarity of the last layer's weight between the
global model and each local update could be used effectively as an indicator of
malicious model updates. Therefore, we propose CosDefense, a
cosine-similarity-based attacker detection algorithm. Specifically, under
CosDefense, the server calculates the cosine similarity score of the last
layer's weight between the global model and each client update, labels
malicious clients whose score is much higher than the average, and filters them
out of the model aggregation in each round. Compared to existing defense
schemes, CosDefense does not require any extra information besides the received
model updates to operate and is compatible with client sampling. Experiment
results on three real-world datasets demonstrate that CosDefense could provide
robust performance under the state-of-the-art FL poisoning attack.


http://arxiv.org/abs/2303.18232
DIME-FM: DIstilling Multimodal and Efficient Foundation Models. (1%)
Ximeng Sun; Pengchuan Zhang; Peizhao Zhang; Hardik Shah; Kate Saenko; Xide Xia
  Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and
Florence, are trained on large-scale datasets of image-caption pairs and
achieve superior transferability and robustness on downstream tasks, but they
are difficult to use in many practical applications due to their large size,
high latency and fixed architectures. Unfortunately, recent work shows training
a small custom VLFM for resource-limited applications is currently very
difficult using public and smaller-scale data. In this paper, we introduce a
new distillation mechanism (DIME-FM) that allows us to transfer the knowledge
contained in large VLFMs to smaller, customized foundation models using a
relatively small amount of inexpensive, unpaired images and sentences. We
transfer the knowledge from the pre-trained CLIP-ViTL/14 model to a ViT-B/32
model, with only 40M public images and 28.4M unpaired public sentences. The
resulting model "Distill-ViT-B/32" rivals the CLIP-ViT-B/32 model pre-trained
on its private WiT dataset (400M image-text pairs): Distill-ViT-B/32 achieves
similar results in terms of zero-shot and linear-probing performance on both
ImageNet and the ELEVATER (20 image classification tasks) benchmarks. It also
displays comparable robustness when evaluated on five datasets with natural
distribution shifts from ImageNet.


http://arxiv.org/abs/2304.00083
A Generative Framework for Low-Cost Result Validation of Outsourced Machine Learning Tasks. (1%)
Abhinav Kumar; Miguel A. Guirao Aguilera; Reza Tourani; Satyajayant Misra
  The growing popularity of Machine Learning (ML) has led to its deployment in
various sensitive domains, which has resulted in significant research focused
on ML security and privacy. However, in some applications, such as autonomous
driving, integrity verification of the outsourced ML workload is more
critical--a facet that has not received much attention. Existing solutions,
such as multi-party computation and proof-based systems, impose significant
computation overhead, which makes them unfit for real-time applications. We
propose Fides, a novel framework for real-time validation of outsourced ML
workloads. Fides features a novel and efficient distillation technique--Greedy
Distillation Transfer Learning--that dynamically distills and fine-tunes a
space and compute-efficient verification model for verifying the corresponding
service model while running inside a trusted execution environment. Fides
features a client-side attack detection model that uses statistical analysis
and divergence measurements to identify, with a high likelihood, if the service
model is under attack. Fides also offers a re-classification functionality that
predicts the original class whenever an attack is identified. We devised a
generative adversarial network framework for training the attack detection and
re-classification models. The evaluation shows that Fides achieves an accuracy
of up to 98% for attack detection and 94% for re-classification.


http://arxiv.org/abs/2303.17255
Adversarial Attack and Defense for Dehazing Networks. (97%)
Jie Gui; Xiaofeng Cong; Chengwei Peng; Yuan Yan Tang; James Tin-Yau Kwok
  The research on single image dehazing task has been widely explored. However,
as far as we know, no comprehensive study has been conducted on the robustness
of the well-trained dehazing models. Therefore, there is no evidence that the
dehazing networks can resist malicious attacks. In this paper, we focus on
designing a group of attack methods based on first order gradient to verify the
robustness of the existing dehazing algorithms. By analyzing the general goal
of image dehazing task, five attack methods are proposed, which are prediction,
noise, mask, ground-truth and input attack. The corresponding experiments are
conducted on six datasets with different scales. Further, the defense strategy
based on adversarial training is adopted for reducing the negative effects
caused by malicious attacks. In summary, this paper defines a new challenging
problem for image dehazing area, which can be called as adversarial attack on
dehazing networks (AADN). Code is available at
https://github.com/guijiejie/AADN.


http://arxiv.org/abs/2303.17720
Generating Adversarial Samples in Mini-Batches May Be Detrimental To Adversarial Robustness. (96%)
Timothy Redgrave; Colton Crum
  Neural networks have been proven to be both highly effective within computer
vision, and highly vulnerable to adversarial attacks. Consequently, as the use
of neural networks increases due to their unrivaled performance, so too does
the threat posed by adversarial attacks. In this work, we build towards
addressing the challenge of adversarial robustness by exploring the
relationship between the mini-batch size used during adversarial sample
generation and the strength of the adversarial samples produced. We demonstrate
that an increase in mini-batch size results in a decrease in the efficacy of
the samples produced, and we draw connections between these observations and
the phenomenon of vanishing gradients. Next, we formulate loss functions such
that adversarial sample strength is not degraded by mini-batch size. Our
findings highlight a potential risk for underestimating the true (practical)
strength of adversarial attacks, and a risk of overestimating a model's
robustness. We share our codes to let others replicate our experiments and to
facilitate further exploration of the connections between batch size and
adversarial sample strength.


http://arxiv.org/abs/2303.17764
Towards Adversarially Robust Continual Learning. (95%)
Tao Bai; Chen Chen; Lingjuan Lyu; Jun Zhao; Bihan Wen
  Recent studies show that models trained by continual learning can achieve the
comparable performances as the standard supervised learning and the learning
flexibility of continual learning models enables their wide applications in the
real world. Deep learning models, however, are shown to be vulnerable to
adversarial attacks. Though there are many studies on the model robustness in
the context of standard supervised learning, protecting continual learning from
adversarial attacks has not yet been investigated. To fill in this research
gap, we are the first to study adversarial robustness in continual learning and
propose a novel method called \textbf{T}ask-\textbf{A}ware \textbf{B}oundary
\textbf{A}ugmentation (TABA) to boost the robustness of continual learning
models. With extensive experiments on CIFAR-10 and CIFAR-100, we show the
efficacy of adversarial training and TABA in defending adversarial attacks.


http://arxiv.org/abs/2303.17297
Understanding the Robustness of 3D Object Detection with Bird's-Eye-View Representations in Autonomous Driving. (81%)
Zijian Zhu; Yichi Zhang; Hai Chen; Yinpeng Dong; Shu Zhao; Wenbo Ding; Jiachen Zhong; Shibao Zheng
  3D object detection is an essential perception task in autonomous driving to
understand the environments. The Bird's-Eye-View (BEV) representations have
significantly improved the performance of 3D detectors with camera inputs on
popular benchmarks. However, there still lacks a systematic understanding of
the robustness of these vision-dependent BEV models, which is closely related
to the safety of autonomous driving systems. In this paper, we evaluate the
natural and adversarial robustness of various representative models under
extensive settings, to fully understand their behaviors influenced by explicit
BEV features compared with those without BEV. In addition to the classic
settings, we propose a 3D consistent patch attack by applying adversarial
patches in the 3D space to guarantee the spatiotemporal consistency, which is
more realistic for the scenario of autonomous driving. With substantial
experiments, we draw several findings: 1) BEV models tend to be more stable
than previous methods under different natural conditions and common corruptions
due to the expressive spatial representations; 2) BEV models are more
vulnerable to adversarial noises, mainly caused by the redundant BEV features;
3) Camera-LiDAR fusion models have superior performance under different
settings with multi-modal inputs, but BEV fusion model is still vulnerable to
adversarial noises of both point cloud and image. These findings alert the
safety issue in the applications of BEV detectors and could facilitate the
development of more robust models.


http://arxiv.org/abs/2303.17597
Robo3D: Towards Robust and Reliable 3D Perception against Corruptions. (2%)
Lingdong Kong; Youquan Liu; Xin Li; Runnan Chen; Wenwei Zhang; Jiawei Ren; Liang Pan; Kai Chen; Ziwei Liu
  The robustness of 3D perception systems under natural corruptions from
environments and sensors is pivotal for safety-critical applications. Existing
large-scale 3D perception datasets often contain data that are meticulously
cleaned. Such configurations, however, cannot reflect the reliability of
perception models during the deployment stage. In this work, we present Robo3D,
the first comprehensive benchmark heading toward probing the robustness of 3D
detectors and segmentors under out-of-distribution scenarios against natural
corruptions that occur in real-world environments. Specifically, we consider
eight corruption types stemming from adversarial weather conditions, external
disturbances, and internal sensor failure. We uncover that, although promising
results have been progressively achieved on standard benchmarks,
state-of-the-art 3D perception models are at risk of being vulnerable to
corruptions. We draw key observations on the use of data representations,
augmentation schemes, and training strategies, that could severely affect the
model's performance. To pursue better robustness, we propose a
density-insensitive training framework along with a simple flexible
voxelization strategy to enhance the model resiliency. We hope our benchmark
and approach could inspire future research in designing more robust and
reliable 3D perception models. Our robustness benchmark suite is publicly
available.


http://arxiv.org/abs/2303.17249
Model-agnostic explainable artificial intelligence for object detection in image data. (1%)
Milad Moradi; Ke Yan; David Colwell; Matthias Samwald; Rhona Asgari
  In recent years, deep neural networks have been widely used for building
high-performance Artificial Intelligence (AI) systems for computer vision
applications. Object detection is a fundamental task in computer vision, which
has been greatly progressed through developing large and intricate AI models.
However, the lack of transparency is a big challenge that may not allow the
widespread adoption of these models. Explainable artificial intelligence is a
field of research where methods are developed to help users understand the
behavior, decision logics, and vulnerabilities of AI systems. Previously, few
explanation methods were developed for object detection based on random
masking. However, random masks may raise some issues regarding the actual
importance of pixels within an image. In this paper, we design and implement a
black-box explanation method named Black-box Object Detection Explanation by
Masking (BODEM) through adopting a hierarchical random masking approach for
object detection systems. We propose a hierarchical random masking framework in
which coarse-grained masks are used in lower levels to find salient regions
within an image, and fine-grained mask are used to refine the salient regions
in higher levels. Experimentations on various object detection datasets and
models showed that BODEM can effectively explain the behavior of object
detectors. Moreover, our method outperformed Detector Randomized Input Sampling
for Explanation (D-RISE) and Local Interpretable Model-agnostic Explanations
(LIME) with respect to different quantitative measures of explanation
effectiveness. The experimental results demonstrate that BODEM can be an
effective method for explaining and validating object detection systems in
black-box testing scenarios.


http://arxiv.org/abs/2303.17658
Establishing baselines and introducing TernaryMixOE for fine-grained out-of-distribution detection. (1%)
Noah Fleischmann; Walter Bennette; Nathan Inkawhich
  Machine learning models deployed in the open world may encounter observations
that they were not trained to recognize, and they risk misclassifying such
observations with high confidence. Therefore, it is essential that these models
are able to ascertain what is in-distribution (ID) and out-of-distribution
(OOD), to avoid this misclassification. In recent years, huge strides have been
made in creating models that are robust to this distinction. As a result, the
current state-of-the-art has reached near perfect performance on relatively
coarse-grained OOD detection tasks, such as distinguishing horses from trucks,
while struggling with finer-grained classification, like differentiating models
of commercial aircraft. In this paper, we describe a new theoretical framework
for understanding fine- and coarse-grained OOD detection, we re-conceptualize
fine grained classification into a three part problem, and we propose a new
baseline task for OOD models on two fine-grained hierarchical data sets, two
new evaluation methods to differentiate fine- and coarse-grained OOD
performance, along with a new loss function for models in this task.


http://arxiv.org/abs/2303.17387
Explainable Intrusion Detection Systems Using Competitive Learning Techniques. (1%)
Jesse Ables; Thomas Kirby; Sudip Mittal; Ioana Banicescu; Shahram Rahimi; William Anderson; Maria Seale
  The current state of the art systems in Artificial Intelligence (AI) enabled
intrusion detection use a variety of black box methods. These black box methods
are generally trained using Error Based Learning (EBL) techniques with a focus
on creating accurate models. These models have high performative costs and are
not easily explainable. A white box Competitive Learning (CL) based eXplainable
Intrusion Detection System (X-IDS) offers a potential solution to these
problem. CL models utilize an entirely different learning paradigm than EBL
approaches. This different learning process makes the CL family of algorithms
innately explainable and less resource intensive. In this paper, we create an
X-IDS architecture that is based on DARPA's recommendation for explainable
systems. In our architecture we leverage CL algorithms like, Self Organizing
Maps (SOM), Growing Self Organizing Maps (GSOM), and Growing Hierarchical Self
Organizing Map (GHSOM). The resulting models can be data-mined to create
statistical and visual explanations. Our architecture is tested using NSL-KDD
and CIC-IDS-2017 benchmark datasets, and produces accuracies that are 1% - 3%
less than EBL models. However, CL models are much more explainable than EBL
models. Additionally, we use a pruning process that is able to significantly
reduce the size of these CL based models. By pruning our models, we are able to
increase prediction speeds. Lastly, we analyze the statistical and visual
explanations generated by our architecture, and we give a strategy that users
could use to help navigate the set of explanations. These explanations will
help users build trust with an Intrusion Detection System (IDS), and allow
users to discover ways to increase the IDS's potency.


http://arxiv.org/abs/2303.17351
Differential Area Analysis for Ransomware: Attacks, Countermeasures, and Limitations. (1%)
Marco Venturini; Francesco Freda; Emanuele Miotto; Alberto Giaretta; Mauro Conti
  Crypto-ransomware attacks have been a growing threat over the last few years.
The goal of every ransomware strain is encrypting user data, such that
attackers can later demand users a ransom for unlocking their data. To maximise
their earning chances, attackers equip their ransomware with strong encryption
which produce files with high entropy values. Davies et al. proposed
Differential Area Analysis (DAA), a technique that analyses files headers to
differentiate compressed, regularly encrypted, and ransomware-encrypted files.
In this paper, first we propose three different attacks to perform malicious
header manipulation and bypass DAA detection. Then, we propose three
countermeasures, namely 2-Fragments (2F), 3-Fragments (3F), and 4-Fragments
(4F), which can be applied equally against each of the three attacks we
propose. We conduct a number of experiments to analyse the ability of our
countermeasures to detect ransomware-encrypted files, whether implementing our
proposed attacks or not. Last, we test the robustness of our own
countermeasures by analysing the performance, in terms of files per second
analysed and resilience to extensive injection of low-entropy data. Our results
show that our detection countermeasures are viable and deployable alternatives
to DAA.


http://arxiv.org/abs/2303.16697
Latent Feature Relation Consistency for Adversarial Robustness. (99%)
Xingbin Liu; Huafeng Kuang; Hong Liu; Xianming Lin; Yongjian Wu; Rongrong Ji
  Deep neural networks have been applied in many computer vision tasks and
achieved state-of-the-art performance. However, misclassification will occur
when DNN predicts adversarial examples which add human-imperceptible
adversarial noise to natural examples. This limits the application of DNN in
security-critical fields. To alleviate this problem, we first conducted an
empirical analysis of the latent features of both adversarial and natural
examples and found the similarity matrix of natural examples is more compact
than those of adversarial examples. Motivated by this observation, we propose
\textbf{L}atent \textbf{F}eature \textbf{R}elation \textbf{C}onsistency
(\textbf{LFRC}), which constrains the relation of adversarial examples in
latent space to be consistent with the natural examples. Importantly, our LFRC
is orthogonal to the previous method and can be easily combined with them to
achieve further improvement. To demonstrate the effectiveness of LFRC, we
conduct extensive experiments using different neural networks on benchmark
datasets. For instance, LFRC can bring 0.78\% further improvement compared to
AT, and 1.09\% improvement compared to TRADES, against AutoAttack on CIFAR10.
Code is available at https://github.com/liuxingbin/LFRC.


http://arxiv.org/abs/2303.16861
Beyond Empirical Risk Minimization: Local Structure Preserving Regularization for Improving Adversarial Robustness. (99%)
Wei Wei; Jiahuan Zhou; Ying Wu
  It is broadly known that deep neural networks are susceptible to being fooled
by adversarial examples with perturbations imperceptible by humans. Various
defenses have been proposed to improve adversarial robustness, among which
adversarial training methods are most effective. However, most of these methods
treat the training samples independently and demand a tremendous amount of
samples to train a robust network, while ignoring the latent structural
information among these samples. In this work, we propose a novel Local
Structure Preserving (LSP) regularization, which aims to preserve the local
structure of the input space in the learned embedding space. In this manner,
the attacking effect of adversarial samples lying in the vicinity of clean
samples can be alleviated. We show strong empirical evidence that with or
without adversarial training, our method consistently improves the performance
of adversarial robustness on several image classification datasets compared to
the baselines and some state-of-the-art approaches, thus providing promising
direction for future research.


http://arxiv.org/abs/2303.16633
Targeted Adversarial Attacks on Wind Power Forecasts. (88%)
René Heinrich; Christoph Scholz; Stephan Vogt; Malte Lehna
  In recent years, researchers proposed a variety of deep learning models for
wind power forecasting. These models predict the wind power generation of wind
farms or entire regions more accurately than traditional machine learning
algorithms or physical models. However, latest research has shown that deep
learning models can often be manipulated by adversarial attacks. Since wind
power forecasts are essential for the stability of modern power systems, it is
important to protect them from this threat. In this work, we investigate the
vulnerability of two different forecasting models to targeted, semitargeted,
and untargeted adversarial attacks. We consider a Long Short-Term Memory (LSTM)
network for predicting the power generation of a wind farm and a Convolutional
Neural Network (CNN) for forecasting the wind power generation throughout
Germany. Moreover, we propose the Total Adversarial Robustness Score (TARS), an
evaluation metric for quantifying the robustness of regression models to
targeted and semi-targeted adversarial attacks. It assesses the impact of
attacks on the model's performance, as well as the extent to which the
attacker's goal was achieved, by assigning a score between 0 (very vulnerable)
and 1 (very robust). In our experiments, the LSTM forecasting model was fairly
robust and achieved a TARS value of over 0.81 for all adversarial attacks
investigated. The CNN forecasting model only achieved TARS values below 0.06
when trained ordinarily, and was thus very vulnerable. Yet, its robustness
could be significantly improved by adversarial training, which always resulted
in a TARS above 0.46.


http://arxiv.org/abs/2304.00010
Towards Reasonable Budget Allocation in Untargeted Graph Structure Attacks via Gradient Debias. (67%)
Zihan Liu; Yun Luo; Lirong Wu; Zicheng Liu; Stan Z. Li
  It has become cognitive inertia to employ cross-entropy loss function in
classification related tasks. In the untargeted attacks on graph structure, the
gradients derived from the attack objective are the attacker's basis for
evaluating a perturbation scheme. Previous methods use negative cross-entropy
loss as the attack objective in attacking node-level classification models.
However, the suitability of the cross-entropy function for constructing the
untargeted attack objective has yet been discussed in previous works. This
paper argues about the previous unreasonable attack objective from the
perspective of budget allocation. We demonstrate theoretically and empirically
that negative cross-entropy tends to produce more significant gradients from
nodes with lower confidence in the labeled classes, even if the predicted
classes of these nodes have been misled. To free up these inefficient attack
budgets, we propose a simple attack model for untargeted attacks on graph
structure based on a novel attack objective which generates unweighted
gradients on graph structures that are not affected by the node confidence. By
conducting experiments in gray-box poisoning attack scenarios, we demonstrate
that a reasonable budget allocation can significantly improve the effectiveness
of gradient-based edge perturbations without any extra hyper-parameter.


http://arxiv.org/abs/2303.17096
ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing. (56%)
Xiaodan Li; Yuefeng Chen; Yao Zhu; Shuhui Wang; Rong Zhang; Hui Xue
  Recent studies have shown that higher accuracy on ImageNet usually leads to
better robustness against different corruptions. Therefore, in this paper,
instead of following the traditional research paradigm that investigates new
out-of-distribution corruptions or perturbations deep models may encounter, we
conduct model debugging in in-distribution data to explore which object
attributes a model may be sensitive to. To achieve this goal, we create a
toolkit for object editing with controls of backgrounds, sizes, positions, and
directions, and create a rigorous benchmark named ImageNet-E(diting) for
evaluating the image classifier robustness in terms of object attributes. With
our ImageNet-E, we evaluate the performance of current deep learning models,
including both convolutional neural networks and vision transformers. We find
that most models are quite sensitive to attribute changes. A small change in
the background can lead to an average of 9.23\% drop on top-1 accuracy. We also
evaluate some robust models including both adversarially trained models and
other robust trained models and find that some models show worse robustness
against attribute changes than vanilla models. Based on these findings, we
discover ways to enhance attribute robustness with preprocessing, architecture
designs, and training strategies. We hope this work can provide some insights
to the community and open up a new avenue for research in robust computer
vision. The code and dataset are available at
https://github.com/alibaba/easyrobust.


http://arxiv.org/abs/2303.16690
Graph Neural Networks for Hardware Vulnerability Analysis -- Can you Trust your GNN? (16%)
Lilas Alrahis; Ozgur Sinanoglu
  The participation of third-party entities in the globalized semiconductor
supply chain introduces potential security vulnerabilities, such as
intellectual property piracy and hardware Trojan (HT) insertion. Graph neural
networks (GNNs) have been employed to address various hardware security
threats, owing to their superior performance on graph-structured data, such as
circuits. However, GNNs are also susceptible to attacks. This work examines the
use of GNNs for detecting hardware threats like HTs and their vulnerability to
attacks. We present BadGNN, a backdoor attack on GNNs that can hide HTs and
evade detection with a 100% success rate through minor circuit perturbations.
Our findings highlight the need for further investigation into the security and
robustness of GNNs before they can be safely used in security-critical
applications.


http://arxiv.org/abs/2303.17080
Mole Recruitment: Poisoning of Image Classifiers via Selective Batch Sampling. (10%)
Ethan Wisdom; Tejas Gokhale; Chaowei Xiao; Yezhou Yang
  In this work, we present a data poisoning attack that confounds machine
learning models without any manipulation of the image or label. This is
achieved by simply leveraging the most confounding natural samples found within
the training data itself, in a new form of a targeted attack coined "Mole
Recruitment." We define moles as the training samples of a class that appear
most similar to samples of another class, and show that simply restructuring
training batches with an optimal number of moles can lead to significant
degradation in the performance of the targeted class. We show the efficacy of
this novel attack in an offline setting across several standard image
classification datasets, and demonstrate the real-world viability of this
attack in a continual learning (CL) setting. Our analysis reveals that
state-of-the-art models are susceptible to Mole Recruitment, thereby exposing a
previously undetected vulnerability of image classifiers.


http://arxiv.org/abs/2303.17061
A Tensor-based Convolutional Neural Network for Small Dataset Classification. (2%)
Zhenhua Chen; David Crandall
  Inspired by the ConvNets with structured hidden representations, we propose a
Tensor-based Neural Network, TCNN. Different from ConvNets, TCNNs are composed
of structured neurons rather than scalar neurons, and the basic operation is
neuron tensor transformation. Unlike other structured ConvNets, where the
part-whole relationships are modeled explicitly, the relationships are learned
implicitly in TCNNs. Also, the structured neurons in TCNNs are high-rank
tensors rather than vectors or matrices. We compare TCNNs with current popular
ConvNets, including ResNets, MobileNets, EfficientNets, RegNets, etc., on
CIFAR10, CIFAR100, and Tiny ImageNet. The experiment shows that TCNNs have
higher efficiency in terms of parameters. TCNNs also show higher robustness
against white-box adversarial attacks on MNIST compared to ConvNets.


http://arxiv.org/abs/2303.16866
ALUM: Adversarial Data Uncertainty Modeling from Latent Model Uncertainty Compensation. (1%)
Wei Wei; Jiahuan Zhou; Hongze Li; Ying Wu
  It is critical that the models pay attention not only to accuracy but also to
the certainty of prediction. Uncertain predictions of deep models caused by
noisy data raise significant concerns in trustworthy AI areas. To explore and
handle uncertainty due to intrinsic data noise, we propose a novel method
called ALUM to simultaneously handle the model uncertainty and data uncertainty
in a unified scheme. Rather than solely modeling data uncertainty in the
ultimate layer of a deep model based on randomly selected training data, we
propose to explore mined adversarial triplets to facilitate data uncertainty
modeling and non-parametric uncertainty estimations to compensate for the
insufficiently trained latent model layers. Thus, the critical data uncertainty
and model uncertainty caused by noisy data can be readily quantified for
improving model robustness. Our proposed ALUM is model-agnostic which can be
easily implemented into any existing deep model with little extra computation
overhead. Extensive experiments on various noisy learning tasks validate the
superior robustness and generalization ability of our method. The code is
released at https://github.com/wwzjer/ALUM.


http://arxiv.org/abs/2303.16378
A Pilot Study of Query-Free Adversarial Attack against Stable Diffusion. (99%)
Haomin Zhuang; Yihua Zhang; Sijia Liu
  Despite the record-breaking performance in Text-to-Image (T2I) generation by
Stable Diffusion, less research attention is paid to its adversarial
robustness. In this work, we study the problem of adversarial attack generation
for Stable Diffusion and ask if an adversarial text prompt can be obtained even
in the absence of end-to-end model queries. We call the resulting problem
'query-free attack generation'. To resolve this problem, we show that the
vulnerability of T2I models is rooted in the lack of robustness of text
encoders, e.g., the CLIP text encoder used for attacking Stable Diffusion.
Based on such insight, we propose both untargeted and targeted query-free
attacks, where the former is built on the most influential dimensions in the
text embedding space, which we call steerable key dimensions. By leveraging the
proposed attacks, we empirically show that only a five-character perturbation
to the text prompt is able to cause the significant content shift of
synthesized images using Stable Diffusion. Moreover, we show that the proposed
target attack can precisely steer the diffusion model to scrub the targeted
image content without causing much change in untargeted image content.


http://arxiv.org/abs/2303.15735
Improving the Transferability of Adversarial Samples by Path-Augmented Method. (99%)
Jianping Zhang; Jen-tse Huang; Wenxuan Wang; Yichen Li; Weibin Wu; Xiaosen Wang; Yuxin Su; Michael R. Lyu
  Deep neural networks have achieved unprecedented success on diverse vision
tasks. However, they are vulnerable to adversarial noise that is imperceptible
to humans. This phenomenon negatively affects their deployment in real-world
scenarios, especially security-related ones. To evaluate the robustness of a
target model in practice, transfer-based attacks craft adversarial samples with
a local model and have attracted increasing attention from researchers due to
their high efficiency. The state-of-the-art transfer-based attacks are
generally based on data augmentation, which typically augments multiple
training images from a linear path when learning adversarial samples. However,
such methods selected the image augmentation path heuristically and may augment
images that are semantics-inconsistent with the target images, which harms the
transferability of the generated adversarial samples. To overcome the pitfall,
we propose the Path-Augmented Method (PAM). Specifically, PAM first constructs
a candidate augmentation path pool. It then settles the employed augmentation
paths during adversarial sample generation with greedy search. Furthermore, to
avoid augmenting semantics-inconsistent images, we train a Semantics Predictor
(SP) to constrain the length of the augmentation path. Extensive experiments
confirm that PAM can achieve an improvement of over 4.8% on average compared
with the state-of-the-art baselines in terms of the attack success rates.


http://arxiv.org/abs/2303.15818
Towards Effective Adversarial Textured 3D Meshes on Physical Face Recognition. (99%)
Xiao Yang; Chang Liu; Longlong Xu; Yikai Wang; Yinpeng Dong; Ning Chen; Hang Su; Jun Zhu
  Face recognition is a prevailing authentication solution in numerous
biometric applications. Physical adversarial attacks, as an important
surrogate, can identify the weaknesses of face recognition systems and evaluate
their robustness before deployed. However, most existing physical attacks are
either detectable readily or ineffective against commercial recognition
systems. The goal of this work is to develop a more reliable technique that can
carry out an end-to-end evaluation of adversarial robustness for commercial
systems. It requires that this technique can simultaneously deceive black-box
recognition models and evade defensive mechanisms. To fulfill this, we design
adversarial textured 3D meshes (AT3D) with an elaborate topology on a human
face, which can be 3D-printed and pasted on the attacker's face to evade the
defenses. However, the mesh-based optimization regime calculates gradients in
high-dimensional mesh space, and can be trapped into local optima with
unsatisfactory transferability. To deviate from the mesh-based space, we
propose to perturb the low-dimensional coefficient space based on 3D Morphable
Model, which significantly improves black-box transferability meanwhile
enjoying faster search efficiency and better visual quality. Extensive
experiments in digital and physical scenarios show that our method effectively
explores the security vulnerabilities of multiple popular commercial services,
including three recognition APIs, four anti-spoofing APIs, two prevailing
mobile phones and two automated access control systems.


http://arxiv.org/abs/2303.15754
Transferable Adversarial Attacks on Vision Transformers with Token Gradient Regularization. (98%)
Jianping Zhang; Yizhan Huang; Weibin Wu; Michael R. Lyu
  Vision transformers (ViTs) have been successfully deployed in a variety of
computer vision tasks, but they are still vulnerable to adversarial samples.
Transfer-based attacks use a local model to generate adversarial samples and
directly transfer them to attack a target black-box model. The high efficiency
of transfer-based attacks makes it a severe security threat to ViT-based
applications. Therefore, it is vital to design effective transfer-based attacks
to identify the deficiencies of ViTs beforehand in security-sensitive
scenarios. Existing efforts generally focus on regularizing the input gradients
to stabilize the updated direction of adversarial samples. However, the
variance of the back-propagated gradients in intermediate blocks of ViTs may
still be large, which may make the generated adversarial samples focus on some
model-specific features and get stuck in poor local optima. To overcome the
shortcomings of existing approaches, we propose the Token Gradient
Regularization (TGR) method. According to the structural characteristics of
ViTs, TGR reduces the variance of the back-propagated gradient in each internal
block of ViTs in a token-wise manner and utilizes the regularized gradient to
generate adversarial samples. Extensive experiments on attacking both ViTs and
CNNs confirm the superiority of our approach. Notably, compared to the
state-of-the-art transfer-based attacks, our TGR offers a performance
improvement of 8.8% on average.


http://arxiv.org/abs/2303.15901
Denoising Autoencoder-based Defensive Distillation as an Adversarial Robustness Algorithm. (98%)
Bakary Badjie; José Cecílio; António Casimiro
  Adversarial attacks significantly threaten the robustness of deep neural
networks (DNNs). Despite the multiple defensive methods employed, they are
nevertheless vulnerable to poison attacks, where attackers meddle with the
initial training data. In order to defend DNNs against such adversarial
attacks, this work proposes a novel method that combines the defensive
distillation mechanism with a denoising autoencoder (DAE). This technique tries
to lower the sensitivity of the distilled model to poison attacks by spotting
and reconstructing poisonous adversarial inputs in the training data. We added
carefully created adversarial samples to the initial training data to assess
the proposed method's performance. Our experimental findings demonstrate that
our method successfully identified and reconstructed the poisonous inputs while
also considering enhancing the DNN's resilience. The proposed approach provides
a potent and robust defense mechanism for DNNs in various applications where
data poisoning attacks are a concern. Thus, the defensive distillation
technique's limitation posed by poisonous adversarial attacks is overcome.


http://arxiv.org/abs/2303.15940
TransAudio: Towards the Transferable Adversarial Audio Attack via Learning Contextualized Perturbations. (98%)
Qi Gege; Yuefeng Chen; Xiaofeng Mao; Yao Zhu; Binyuan Hui; Xiaodan Li; Rong Zhang; Hui Xue
  In a transfer-based attack against Automatic Speech Recognition (ASR)
systems, attacks are unable to access the architecture and parameters of the
target model. Existing attack methods are mostly investigated in voice
assistant scenarios with restricted voice commands, prohibiting their
applicability to more general ASR related applications. To tackle this
challenge, we propose a novel contextualized attack with deletion, insertion,
and substitution adversarial behaviors, namely TransAudio, which achieves
arbitrary word-level attacks based on the proposed two-stage framework. To
strengthen the attack transferability, we further introduce an audio
score-matching optimization strategy to regularize the training process, which
mitigates adversarial example over-fitting to the surrogate model. Extensive
experiments and analysis demonstrate the effectiveness of TransAudio against
open-source ASR models and commercial APIs.


http://arxiv.org/abs/2303.16004
A Survey on Malware Detection with Graph Representation Learning. (41%)
Tristan Bilot; Nour El Madhoun; Khaldoun Al Agha; Anis Zouaoui
  Malware detection has become a major concern due to the increasing number and
complexity of malware. Traditional detection methods based on signatures and
heuristics are used for malware detection, but unfortunately, they suffer from
poor generalization to unknown attacks and can be easily circumvented using
obfuscation techniques. In recent years, Machine Learning (ML) and notably Deep
Learning (DL) achieved impressive results in malware detection by learning
useful representations from data and have become a solution preferred over
traditional methods. More recently, the application of such techniques on
graph-structured data has achieved state-of-the-art performance in various
domains and demonstrates promising results in learning more robust
representations from malware. Yet, no literature review focusing on graph-based
deep learning for malware detection exists. In this survey, we provide an
in-depth literature review to summarize and unify existing works under the
common approaches and architectures. We notably demonstrate that Graph Neural
Networks (GNNs) reach competitive results in learning robust embeddings from
malware represented as expressive graph structures, leading to an efficient
detection by downstream classifiers. This paper also reviews adversarial
attacks that are utilized to fool graph-based detection methods. Challenges and
future research directions are discussed at the end of the paper.


http://arxiv.org/abs/2303.16308
Provable Robustness for Streaming Models with a Sliding Window. (15%)
Aounon Kumar; Vinu Sankar Sadasivan; Soheil Feizi
  The literature on provable robustness in machine learning has primarily
focused on static prediction problems, such as image classification, in which
input samples are assumed to be independent and model performance is measured
as an expectation over the input distribution. Robustness certificates are
derived for individual input instances with the assumption that the model is
evaluated on each instance separately. However, in many deep learning
applications such as online content recommendation and stock market analysis,
models use historical data to make predictions. Robustness certificates based
on the assumption of independent input samples are not directly applicable in
such scenarios. In this work, we focus on the provable robustness of machine
learning models in the context of data streams, where inputs are presented as a
sequence of potentially correlated items. We derive robustness certificates for
models that use a fixed-size sliding window over the input stream. Our
guarantees hold for the average model performance across the entire stream and
are independent of stream size, making them suitable for large data streams. We
perform experiments on speech detection and human activity recognition tasks
and show that our certificates can produce meaningful performance guarantees
against adversarial perturbations.


http://arxiv.org/abs/2303.18136
Machine-learned Adversarial Attacks against Fault Prediction Systems in Smart Electrical Grids. (9%)
Carmelo Ardito; Yashar Deldjoo; Noia Tommaso Di; Sciascio Eugenio Di; Fatemeh Nazary; Giovanni Servedio
  In smart electrical grids, fault detection tasks may have a high impact on
society due to their economic and critical implications. In the recent years,
numerous smart grid applications, such as defect detection and load
forecasting, have embraced data-driven methodologies. The purpose of this study
is to investigate the challenges associated with the security of machine
learning (ML) applications in the smart grid scenario. Indeed, the robustness
and security of these data-driven algorithms have not been extensively studied
in relation to all power grid applications. We demonstrate first that the deep
neural network method used in the smart grid is susceptible to adversarial
perturbation. Then, we highlight how studies on fault localization and type
classification illustrate the weaknesses of present ML algorithms in smart
grids to various adversarial attacks


http://arxiv.org/abs/2303.15736
On the Use of Reinforcement Learning for Attacking and Defending Load Frequency Control. (3%)
Amr S. Mohamed; Deepa Kundur
  The electric grid is an attractive target for cyberattackers given its
critical nature in society. With the increasing sophistication of cyberattacks,
effective grid defense will benefit from proactively identifying
vulnerabilities and attack strategies. We develop a deep reinforcement
learning-based method that recognizes vulnerabilities in load frequency
control, an essential process that maintains grid security and reliability. We
demonstrate how our method can synthesize a variety of attacks involving false
data injection and load switching, while specifying the attack and threat
models - providing insight into potential attack strategies and impact. We
discuss how our approach can be employed for testing electric grid
vulnerabilities. Moreover our method can be employed to generate data to inform
the design of defense strategies and develop attack detection methods. For
this, we design and compare a (deep learning-based) supervised attack detector
with an unsupervised anomaly detector to highlight the benefits of developing
defense strategies based on identified attack strategies.


http://arxiv.org/abs/2303.16191
Hard-normal Example-aware Template Mutual Matching for Industrial Anomaly Detection. (1%)
Zixuan Chen; Xiaohua Xie; Lingxiao Yang; Jianhuang Lai
  Anomaly detectors are widely used in industrial manufacturing to detect and
localize unknown defects in query images. These detectors are trained on
anomaly-free samples and have successfully distinguished anomalies from most
normal samples. However, hard-normal examples are scattered and far apart from
most normal samples, and thus they are often mistaken for anomalies by existing
methods. To address this issue, we propose Hard-normal Example-aware Template
Mutual Matching (HETMM), an efficient framework to build a robust
prototype-based decision boundary. Specifically, HETMM employs the proposed
Affine-invariant Template Mutual Matching (ATMM) to mitigate the affection
brought by the affine transformations and easy-normal examples. By mutually
matching the pixel-level prototypes within the patch-level search spaces
between query and template set, ATMM can accurately distinguish between
hard-normal examples and anomalies, achieving low false-positive and
missed-detection rates. In addition, we also propose PTS to compress the
original template set for speed-up. PTS selects cluster centres and hard-normal
examples to preserve the original decision boundary, allowing this tiny set to
achieve comparable performance to the original one. Extensive experiments
demonstrate that HETMM outperforms state-of-the-art methods, while using a
60-sheet tiny set can achieve competitive performance and real-time inference
speed (around 26.1 FPS) on a Quadro 8000 RTX GPU. HETMM is training-free and
can be hot-updated by directly inserting novel samples into the template set,
which can promptly address some incremental learning issues in industrial
manufacturing.


http://arxiv.org/abs/2303.16031
A Universal Identity Backdoor Attack against Speaker Verification based on Siamese Network. (1%)
Haodong Zhao; Wei Du; Junjie Guo; Gongshen Liu
  Speaker verification has been widely used in many authentication scenarios.
However, training models for speaker verification requires large amounts of
data and computing power, so users often use untrustworthy third-party data or
deploy third-party models directly, which may create security risks. In this
paper, we propose a backdoor attack for the above scenario. Specifically, for
the Siamese network in the speaker verification system, we try to implant a
universal identity in the model that can simulate any enrolled speaker and pass
the verification. So the attacker does not need to know the victim, which makes
the attack more flexible and stealthy. In addition, we design and compare three
ways of selecting attacker utterances and two ways of poisoned training for the
GE2E loss function in different scenarios. The results on the TIMIT and
Voxceleb1 datasets show that our approach can achieve a high attack success
rate while guaranteeing the normal verification accuracy. Our work reveals the
vulnerability of the speaker verification system and provides a new perspective
to further improve the robustness of the system.


http://arxiv.org/abs/2303.15409
Classifier Robustness Enhancement Via Test-Time Transformation. (99%)
Tsachi Blau; Roy Ganz; Chaim Baskin; Michael Elad; Alex Bronstein
  It has been recently discovered that adversarially trained classifiers
exhibit an intriguing property, referred to as perceptually aligned gradients
(PAG). PAG implies that the gradients of such classifiers possess a meaningful
structure, aligned with human perception. Adversarial training is currently the
best-known way to achieve classification robustness under adversarial attacks.
The PAG property, however, has yet to be leveraged for further improving
classifier robustness. In this work, we introduce Classifier Robustness
Enhancement Via Test-Time Transformation (TETRA) -- a novel defense method that
utilizes PAG, enhancing the performance of trained robust classifiers. Our
method operates in two phases. First, it modifies the input image via a
designated targeted adversarial attack into each of the dataset's classes.
Then, it classifies the input image based on the distance to each of the
modified instances, with the assumption that the shortest distance relates to
the true class. We show that the proposed method achieves state-of-the-art
results and validate our claim through extensive experiments on a variety of
defense methods, classifier architectures, and datasets. We also empirically
demonstrate that TETRA can boost the accuracy of any differentiable adversarial
training classifier across a variety of attacks, including ones unseen at
training. Specifically, applying TETRA leads to substantial improvement of up
to $+23\%$, $+20\%$, and $+26\%$ on CIFAR10, CIFAR100, and ImageNet,
respectively.


http://arxiv.org/abs/2303.15109
Improving the Transferability of Adversarial Examples via Direction Tuning. (99%)
Xiangyuan Yang; Jie Lin; Hanlin Zhang; Xinyu Yang; Peng Zhao
  In the transfer-based adversarial attacks, adversarial examples are only
generated by the surrogate models and achieve effective perturbation in the
victim models. Although considerable efforts have been developed on improving
the transferability of adversarial examples generated by transfer-based
adversarial attacks, our investigation found that, the big deviation between
the actual and steepest update directions of the current transfer-based
adversarial attacks is caused by the large update step length, resulting in the
generated adversarial examples can not converge well. However, directly
reducing the update step length will lead to serious update oscillation so that
the generated adversarial examples also can not achieve great transferability
to the victim models. To address these issues, a novel transfer-based attack,
namely direction tuning attack, is proposed to not only decrease the update
deviation in the large step length, but also mitigate the update oscillation in
the small sampling step length, thereby making the generated adversarial
examples converge well to achieve great transferability on victim models. In
addition, a network pruning method is proposed to smooth the decision boundary,
thereby further decreasing the update oscillation and enhancing the
transferability of the generated adversarial examples. The experiment results
on ImageNet demonstrate that the average attack success rate (ASR) of the
adversarial examples generated by our method can be improved from 87.9\% to
94.5\% on five victim models without defenses, and from 69.1\% to 76.2\% on
eight advanced defense methods, in comparison with that of latest
gradient-based attacks.


http://arxiv.org/abs/2303.15571
EMShepherd: Detecting Adversarial Samples via Side-channel Leakage. (99%)
Ruyi Ding; Cheng Gongye; Siyue Wang; Aidong Ding; Yunsi Fei
  Deep Neural Networks (DNN) are vulnerable to adversarial perturbations-small
changes crafted deliberately on the input to mislead the model for wrong
predictions. Adversarial attacks have disastrous consequences for deep
learning-empowered critical applications. Existing defense and detection
techniques both require extensive knowledge of the model, testing inputs, and
even execution details. They are not viable for general deep learning
implementations where the model internal is unknown, a common 'black-box'
scenario for model users. Inspired by the fact that electromagnetic (EM)
emanations of a model inference are dependent on both operations and data and
may contain footprints of different input classes, we propose a framework,
EMShepherd, to capture EM traces of model execution, perform processing on
traces and exploit them for adversarial detection. Only benign samples and
their EM traces are used to train the adversarial detector: a set of EM
classifiers and class-specific unsupervised anomaly detectors. When the victim
model system is under attack by an adversarial example, the model execution
will be different from executions for the known classes, and the EM trace will
be different. We demonstrate that our air-gapped EMShepherd can effectively
detect different adversarial attacks on a commonly used FPGA deep learning
accelerator for both Fashion MNIST and CIFAR-10 datasets. It achieves a 100%
detection rate on most types of adversarial samples, which is comparable to the
state-of-the-art 'white-box' software-based detectors.


http://arxiv.org/abs/2303.15127
Learning the Unlearnable: Adversarial Augmentations Suppress Unlearnable Example Attacks. (97%)
Tianrui Qin; Xitong Gao; Juanjuan Zhao; Kejiang Ye; Cheng-Zhong Xu
  Unlearnable example attacks are data poisoning techniques that can be used to
safeguard public data against unauthorized use for training deep learning
models. These methods add stealthy perturbations to the original image, thereby
making it difficult for deep learning models to learn from these training data
effectively. Current research suggests that adversarial training can, to a
certain degree, mitigate the impact of unlearnable example attacks, while
common data augmentation methods are not effective against such poisons.
Adversarial training, however, demands considerable computational resources and
can result in non-trivial accuracy loss. In this paper, we introduce the
UEraser method, which outperforms current defenses against different types of
state-of-the-art unlearnable example attacks through a combination of effective
data augmentation policies and loss-maximizing adversarial augmentations. In
stark contrast to the current SOTA adversarial training methods, UEraser uses
adversarial augmentations, which extends beyond the confines of $ \ell_p $
perturbation budget assumed by current unlearning attacks and defenses. It also
helps to improve the model's generalization ability, thus protecting against
accuracy loss. UEraser wipes out the unlearning effect with error-maximizing
data augmentations, thus restoring trained model accuracies. Interestingly,
UEraser-Lite, a fast variant without adversarial augmentations, is also highly
effective in preserving clean accuracies. On challenging unlearnable CIFAR-10,
CIFAR-100, SVHN, and ImageNet-subset datasets produced with various attacks, it
achieves results that are comparable to those obtained during clean training.
We also demonstrate its efficacy against possible adaptive attacks. Our code is
open source and available to the deep learning community:
https://github.com/lafeat/ueraser.


http://arxiv.org/abs/2303.18191
Detecting Backdoors During the Inference Stage Based on Corruption Robustness Consistency. (76%)
Xiaogeng Liu; Minghui Li; Haoyu Wang; Shengshan Hu; Dengpan Ye; Hai Jin; Libing Wu; Chaowei Xiao
  Deep neural networks are proven to be vulnerable to backdoor attacks.
Detecting the trigger samples during the inference stage, i.e., the test-time
trigger sample detection, can prevent the backdoor from being triggered.
However, existing detection methods often require the defenders to have high
accessibility to victim models, extra clean data, or knowledge about the
appearance of backdoor triggers, limiting their practicality. In this paper, we
propose the test-time corruption robustness consistency evaluation (TeCo), a
novel test-time trigger sample detection method that only needs the hard-label
outputs of the victim models without any extra information. Our journey begins
with the intriguing observation that the backdoor-infected models have similar
performance across different image corruptions for the clean images, but
perform discrepantly for the trigger samples. Based on this phenomenon, we
design TeCo to evaluate test-time robustness consistency by calculating the
deviation of severity that leads to predictions' transition across different
corruptions. Extensive experiments demonstrate that compared with
state-of-the-art defenses, which even require either certain information about
the trigger types or accessibility of clean data, TeCo outperforms them on
different backdoor attacks, datasets, and model architectures, enjoying a
higher AUROC by 10% and 5 times of stability.


http://arxiv.org/abs/2303.14922
CAT:Collaborative Adversarial Training. (69%)
Xingbin Liu; Huafeng Kuang; Xianming Lin; Yongjian Wu; Rongrong Ji
  Adversarial training can improve the robustness of neural networks. Previous
methods focus on a single adversarial training strategy and do not consider the
model property trained by different strategies. By revisiting the previous
methods, we find different adversarial training methods have distinct
robustness for sample instances. For example, a sample instance can be
correctly classified by a model trained using standard adversarial training
(AT) but not by a model trained using TRADES, and vice versa. Based on this
observation, we propose a collaborative adversarial training framework to
improve the robustness of neural networks. Specifically, we use different
adversarial training methods to train robust models and let models interact
with their knowledge during the training process. Collaborative Adversarial
Training (CAT) can improve both robustness and accuracy. Extensive experiments
on various networks and datasets validate the effectiveness of our method. CAT
achieves state-of-the-art adversarial robustness without using any additional
data on CIFAR-10 under the Auto-Attack benchmark. Code is available at
https://github.com/liuxingbin/CAT.


http://arxiv.org/abs/2303.14961
Diffusion Denoised Smoothing for Certified and Adversarial Robust Out-Of-Distribution Detection. (67%)
Nicola Franco; Daniel Korth; Jeanette Miriam Lorenz; Karsten Roscher; Stephan Guennemann
  As the use of machine learning continues to expand, the importance of
ensuring its safety cannot be overstated. A key concern in this regard is the
ability to identify whether a given sample is from the training distribution,
or is an "Out-Of-Distribution" (OOD) sample. In addition, adversaries can
manipulate OOD samples in ways that lead a classifier to make a confident
prediction. In this study, we present a novel approach for certifying the
robustness of OOD detection within a $\ell_2$-norm around the input, regardless
of network architecture and without the need for specific components or
additional training. Further, we improve current techniques for detecting
adversarial attacks on OOD samples, while providing high levels of certified
and adversarial robustness on in-distribution samples. The average of all OOD
detection metrics on CIFAR10/100 shows an increase of $\sim 13 \% / 5\%$
relative to previous approaches.


http://arxiv.org/abs/2303.15168
Personalized Federated Learning on Long-Tailed Data via Adversarial Feature Augmentation. (41%)
Yang Lu; Pinxin Qian; Gang Huang; Hanzi Wang
  Personalized Federated Learning (PFL) aims to learn personalized models for
each client based on the knowledge across all clients in a privacy-preserving
manner. Existing PFL methods generally assume that the underlying global data
across all clients are uniformly distributed without considering the long-tail
distribution. The joint problem of data heterogeneity and long-tail
distribution in the FL environment is more challenging and severely affects the
performance of personalized models. In this paper, we propose a PFL method
called Federated Learning with Adversarial Feature Augmentation (FedAFA) to
address this joint problem in PFL. FedAFA optimizes the personalized model for
each client by producing a balanced feature set to enhance the local minority
classes. The local minority class features are generated by transferring the
knowledge from the local majority class features extracted by the global model
in an adversarial example learning manner. The experimental results on
benchmarks under different settings of data heterogeneity and long-tail
distribution demonstrate that FedAFA significantly improves the personalized
performance of each client compared with the state-of-the-art PFL algorithm.
The code is available at https://github.com/pxqian/FedAFA.


http://arxiv.org/abs/2303.15564
Mask and Restore: Blind Backdoor Defense at Test Time with Masked Autoencoder. (41%)
Tao Sun; Lu Pang; Chao Chen; Haibin Ling
  Deep neural networks are vulnerable to backdoor attacks, where an adversary
maliciously manipulates the model behavior through overlaying images with
special triggers. Existing backdoor defense methods often require accessing a
few validation data and model parameters, which are impractical in many
real-world applications, e.g., when the model is provided as a cloud service.
In this paper, we address the practical task of blind backdoor defense at test
time, in particular for black-box models. The true label of every test image
needs to be recovered on the fly from the hard label predictions of a
suspicious model. The heuristic trigger search in image space, however, is not
scalable to complex triggers or high image resolution. We circumvent such
barrier by leveraging generic image generation models, and propose a framework
of Blind Defense with Masked AutoEncoder (BDMAE). It uses the image structural
similarity and label consistency between the test image and MAE restorations to
detect possible triggers. The detection result is refined by considering the
topology of triggers. We obtain a purified test image from restorations for
making prediction. Our approach is blind to the model architectures, trigger
patterns or image benignity. Extensive experiments on multiple datasets with
different backdoor attacks validate its effectiveness and generalizability.
Code is available at https://github.com/tsun/BDMAE.


http://arxiv.org/abs/2303.15533
Sequential training of GANs against GAN-classifiers reveals correlated "knowledge gaps" present among independently trained GAN instances. (41%)
Arkanath Pathak; Nicholas Dufour
  Modern Generative Adversarial Networks (GANs) generate realistic images
remarkably well. Previous work has demonstrated the feasibility of
"GAN-classifiers" that are distinct from the co-trained discriminator, and
operate on images generated from a frozen GAN. That such classifiers work at
all affirms the existence of "knowledge gaps" (out-of-distribution artifacts
across samples) present in GAN training. We iteratively train GAN-classifiers
and train GANs that "fool" the classifiers (in an attempt to fill the knowledge
gaps), and examine the effect on GAN training dynamics, output quality, and
GAN-classifier generalization. We investigate two settings, a small DCGAN
architecture trained on low dimensional images (MNIST), and StyleGAN2, a SOTA
GAN architecture trained on high dimensional images (FFHQ). We find that the
DCGAN is unable to effectively fool a held-out GAN-classifier without
compromising the output quality. However, StyleGAN2 can fool held-out
classifiers with no change in output quality, and this effect persists over
multiple rounds of GAN/classifier training which appears to reveal an ordering
over optima in the generator parameter space. Finally, we study different
classifier architectures and show that the architecture of the GAN-classifier
has a strong influence on the set of its learned artifacts.


http://arxiv.org/abs/2303.15433
Anti-DreamBooth: Protecting users from personalized text-to-image synthesis. (5%)
Le Thanh Van; Hao Phung; Thuan Hoang Nguyen; Quan Dao; Ngoc Tran; Anh Tran
  Text-to-image diffusion models are nothing but a revolution, allowing anyone,
even without design skills, to create realistic images from simple text inputs.
With powerful personalization tools like DreamBooth, they can generate images
of a specific person just by learning from his/her few reference images.
However, when misused, such a powerful and convenient tool can produce fake
news or disturbing content targeting any individual victim, posing a severe
negative social impact. In this paper, we explore a defense system called
Anti-DreamBooth against such malicious use of DreamBooth. The system aims to
add subtle noise perturbation to each user's image before publishing in order
to disrupt the generation quality of any DreamBooth model trained on these
perturbed images. We investigate a wide range of algorithms for perturbation
optimization and extensively evaluate them on two facial datasets over various
text-to-image model versions. Despite the complicated formulation of DreamBooth
and Diffusion-based text-to-image models, our methods effectively defend users
from the malicious use of those models. Their effectiveness withstands even
adverse conditions, such as model or prompt/term mismatching between training
and testing. Our code will be available at
\href{https://github.com/VinAIResearch/Anti-DreamBooth.git}{https://github.com/VinAIResearch/Anti-DreamBooth.git}.


http://arxiv.org/abs/2303.14822
MGTBench: Benchmarking Machine-Generated Text Detection. (61%)
Xinlei He; Xinyue Shen; Zeyuan Chen; Michael Backes; Yang Zhang
  Nowadays large language models (LLMs) have shown revolutionary power in a
variety of natural language processing (NLP) tasks such as text classification,
sentiment analysis, language translation, and question-answering. In this way,
detecting machine-generated texts (MGTs) is becoming increasingly important as
LLMs become more advanced and prevalent. These models can generate human-like
language that can be difficult to distinguish from text written by a human,
which raises concerns about authenticity, accountability, and potential bias.
However, existing detection methods against MGTs are evaluated under different
model architectures, datasets, and experimental settings, resulting in a lack
of a comprehensive evaluation framework across different methodologies
  In this paper, we fill this gap by proposing the first benchmark framework
for MGT detection, named MGTBench. Extensive evaluations on public datasets
with curated answers generated by ChatGPT (the most representative and powerful
LLMs thus far) show that most of the current detection methods perform less
satisfactorily against MGTs. An exceptional case is ChatGPT Detector, which is
trained with ChatGPT-generated texts and shows great performance in detecting
MGTs. Nonetheless, we note that only a small fraction of adversarial-crafted
perturbations on MGTs can evade the ChatGPT Detector, thus highlighting the
need for more robust MGT detection methods. We envision that MGTBench will
serve as a benchmark tool to accelerate future investigations involving the
evaluation of state-of-the-art MGT detection methods on their respective
datasets and the development of more advanced MGT detection methods. Our source
code and datasets are available at https://github.com/xinleihe/MGTBench.


http://arxiv.org/abs/2303.18131
AdvCheck: Characterizing Adversarial Examples via Local Gradient Checking. (99%)
Ruoxi Chen; Haibo Jin; Jinyin Chen; Haibin Zheng
  Deep neural networks (DNNs) are vulnerable to adversarial examples, which may
lead to catastrophe in security-critical domains. Numerous detection methods
are proposed to characterize the feature uniqueness of adversarial examples, or
to distinguish DNN's behavior activated by the adversarial examples. Detections
based on features cannot handle adversarial examples with large perturbations.
Besides, they require a large amount of specific adversarial examples. Another
mainstream, model-based detections, which characterize input properties by
model behaviors, suffer from heavy computation cost. To address the issues, we
introduce the concept of local gradient, and reveal that adversarial examples
have a quite larger bound of local gradient than the benign ones. Inspired by
the observation, we leverage local gradient for detecting adversarial examples,
and propose a general framework AdvCheck. Specifically, by calculating the
local gradient from a few benign examples and noise-added misclassified
examples to train a detector, adversarial examples and even misclassified
natural inputs can be precisely distinguished from benign ones. Through
extensive experiments, we have validated the AdvCheck's superior performance to
the state-of-the-art (SOTA) baselines, with detection rate ($\sim \times 1.2$)
on general adversarial attacks and ($\sim \times 1.4$) on misclassified natural
inputs on average, with average 1/500 time cost. We also provide interpretable
results for successful detection.


http://arxiv.org/abs/2303.14460
CFA: Class-wise Calibrated Fair Adversarial Training. (98%)
Zeming Wei; Yifei Wang; Yiwen Guo; Yisen Wang
  Adversarial training has been widely acknowledged as the most effective
method to improve the adversarial robustness against adversarial examples for
Deep Neural Networks (DNNs). So far, most existing works focus on enhancing the
overall model robustness, treating each class equally in both the training and
testing phases. Although revealing the disparity in robustness among classes,
few works try to make adversarial training fair at the class level without
sacrificing overall robustness. In this paper, we are the first to
theoretically and empirically investigate the preference of different classes
for adversarial configurations, including perturbation margin, regularization,
and weight averaging. Motivated by this, we further propose a
\textbf{C}lass-wise calibrated \textbf{F}air \textbf{A}dversarial training
framework, named CFA, which customizes specific training configurations for
each class automatically. Experiments on benchmark datasets demonstrate that
our proposed CFA can improve both overall robustness and fairness notably over
other state-of-the-art methods. Code is available at
\url{https://github.com/PKU-ML/CFA}.


http://arxiv.org/abs/2303.14601
PORE: Provably Robust Recommender Systems against Data Poisoning Attacks. (68%)
Jinyuan Jia; Yupei Liu; Yuepeng Hu; Neil Zhenqiang Gong
  Data poisoning attacks spoof a recommender system to make arbitrary,
attacker-desired recommendations via injecting fake users with carefully
crafted rating scores into the recommender system. We envision a cat-and-mouse
game for such data poisoning attacks and their defenses, i.e., new defenses are
designed to defend against existing attacks and new attacks are designed to
break them. To prevent such a cat-and-mouse game, we propose PORE, the first
framework to build provably robust recommender systems in this work. PORE can
transform any existing recommender system to be provably robust against any
untargeted data poisoning attacks, which aim to reduce the overall performance
of a recommender system. Suppose PORE recommends top-$N$ items to a user when
there is no attack. We prove that PORE still recommends at least $r$ of the $N$
items to the user under any data poisoning attack, where $r$ is a function of
the number of fake users in the attack. Moreover, we design an efficient
algorithm to compute $r$ for each user. We empirically evaluate PORE on popular
benchmark datasets.


http://arxiv.org/abs/2303.14511
Improving robustness of jet tagging algorithms with adversarial training: exploring the loss surface. (12%)
Annika Stein
  In the field of high-energy physics, deep learning algorithms continue to
gain in relevance and provide performance improvements over traditional
methods, for example when identifying rare signals or finding complex patterns.
From an analyst's perspective, obtaining highest possible performance is
desirable, but recently, some attention has been shifted towards studying
robustness of models to investigate how well these perform under slight
distortions of input features. Especially for tasks that involve many
(low-level) inputs, the application of deep neural networks brings new
challenges. In the context of jet flavor tagging, adversarial attacks are used
to probe a typical classifier's vulnerability and can be understood as a model
for systematic uncertainties. A corresponding defense strategy, adversarial
training, improves robustness, while maintaining high performance.
Investigating the loss surface corresponding to the inputs and models in
question reveals geometric interpretations of robustness, taking correlations
into account.


http://arxiv.org/abs/2303.13955
PIAT: Parameter Interpolation based Adversarial Training for Image Classification. (99%)
Kun He; Xin Liu; Yichen Yang; Zhou Qin; Weigao Wen; Hui Xue; John E. Hopcroft
  Adversarial training has been demonstrated to be the most effective approach
to defend against adversarial attacks. However, existing adversarial training
methods show apparent oscillations and overfitting issue in the training
process, degrading the defense efficacy. In this work, we propose a novel
framework, termed Parameter Interpolation based Adversarial Training (PIAT),
that makes full use of the historical information during training.
Specifically, at the end of each epoch, PIAT tunes the model parameters as the
interpolation of the parameters of the previous and current epochs. Besides, we
suggest to use the Normalized Mean Square Error (NMSE) to further improve the
robustness by aligning the clean and adversarial examples. Compared with other
regularization methods, NMSE focuses more on the relative magnitude of the
logits rather than the absolute magnitude. Extensive experiments on several
benchmark datasets and various networks show that our method could prominently
improve the model robustness and reduce the generalization error. Moreover, our
framework is general and could further boost the robust accuracy when combined
with other adversarial training methods.


http://arxiv.org/abs/2303.14173
How many dimensions are required to find an adversarial example? (99%)
Charles Godfrey; Henry Kvinge; Elise Bishoff; Myles Mckay; Davis Brown; Tim Doster; Eleanor Byler
  Past work exploring adversarial vulnerability have focused on situations
where an adversary can perturb all dimensions of model input. On the other
hand, a range of recent works consider the case where either (i) an adversary
can perturb a limited number of input parameters or (ii) a subset of modalities
in a multimodal problem. In both of these cases, adversarial examples are
effectively constrained to a subspace $V$ in the ambient input space
$\mathcal{X}$. Motivated by this, in this work we investigate how adversarial
vulnerability depends on $\dim(V)$. In particular, we show that the adversarial
success of standard PGD attacks with $\ell^p$ norm constraints behaves like a
monotonically increasing function of $\epsilon (\frac{\dim(V)}{\dim
\mathcal{X}})^{\frac{1}{q}}$ where $\epsilon$ is the perturbation budget and
$\frac{1}{p} + \frac{1}{q} =1$, provided $p > 1$ (the case $p=1$ presents
additional subtleties which we analyze in some detail). This functional form
can be easily derived from a simple toy linear model, and as such our results
land further credence to arguments that adversarial examples are endemic to
locally linear models on high dimensional spaces.


http://arxiv.org/abs/2303.13887
Effective black box adversarial attack with handcrafted kernels. (99%)
Petr Dvořáček; Petr Hurtik; Petra Števuliáková
  We propose a new, simple framework for crafting adversarial examples for
black box attacks. The idea is to simulate the substitution model with a
non-trainable model compounded of just one layer of handcrafted convolutional
kernels and then train the generator neural network to maximize the distance of
the outputs for the original and generated adversarial image. We show that
fooling the prediction of the first layer causes the whole network to be fooled
and decreases its accuracy on adversarial inputs. Moreover, we do not train the
neural network to obtain the first convolutional layer kernels, but we create
them using the technique of F-transform. Therefore, our method is very time and
resource effective.


http://arxiv.org/abs/2303.14133
Survey on Adversarial Attack and Defense for Medical Image Analysis: Methods and Challenges. (99%)
Junhao Dong; Junxi Chen; Xiaohua Xie; Jianhuang Lai; Hao Chen
  Deep learning techniques have achieved superior performance in computer-aided
medical image analysis, yet they are still vulnerable to imperceptible
adversarial attacks, resulting in potential misdiagnosis in clinical practice.
Oppositely, recent years have also witnessed remarkable progress in defense
against these tailored adversarial examples in deep medical diagnosis systems.
In this exposition, we present a comprehensive survey on recent advances in
adversarial attacks and defenses for medical image analysis with a systematic
taxonomy in terms of the application scenario. We also provide a unified
framework for different types of adversarial attack and defense methods in the
context of medical image analysis. For a fair comparison, we establish a new
benchmark for adversarially robust medical diagnosis models obtained by
adversarial training under various scenarios. To the best of our knowledge,
this is the first survey paper that provides a thorough evaluation of
adversarially robust medical diagnosis models. By analyzing qualitative and
quantitative results, we conclude this survey with a detailed discussion of
current challenges for adversarial attack and defense in medical image analysis
systems to shed light on future research directions. Code is available on
\href{https://github.com/tomvii/Adv_MIA}{\color{red}{GitHub}}.


http://arxiv.org/abs/2303.14077
Improved Adversarial Training Through Adaptive Instance-wise Loss Smoothing. (99%)
Lin Li; Michael Spratling
  Deep neural networks can be easily fooled into making incorrect predictions
through corruption of the input by adversarial perturbations:
human-imperceptible artificial noise. So far adversarial training has been the
most successful defense against such adversarial attacks. This work focuses on
improving adversarial training to boost adversarial robustness. We first
analyze, from an instance-wise perspective, how adversarial vulnerability
evolves during adversarial training. We find that during training an overall
reduction of adversarial loss is achieved by sacrificing a considerable
proportion of training samples to be more vulnerable to adversarial attack,
which results in an uneven distribution of adversarial vulnerability among
data. Such "uneven vulnerability", is prevalent across several popular robust
training methods and, more importantly, relates to overfitting in adversarial
training. Motivated by this observation, we propose a new adversarial training
method: Instance-adaptive Smoothness Enhanced Adversarial Training (ISEAT). It
jointly smooths both input and weight loss landscapes in an adaptive,
instance-specific, way to enhance robustness more for those samples with higher
adversarial vulnerability. Extensive experiments demonstrate the superiority of
our method over existing defense methods. Noticeably, our method, when combined
with the latest data augmentation and semi-supervised learning techniques,
achieves state-of-the-art robustness against $\ell_{\infty}$-norm constrained
attacks on CIFAR10 of 59.32% for Wide ResNet34-10 without extra data, and
61.55% for Wide ResNet28-10 with extra data. Code is available at
https://github.com/TreeLLi/Instance-adaptive-Smoothness-Enhanced-AT.


http://arxiv.org/abs/2303.13846
Feature Separation and Recalibration for Adversarial Robustness. (98%)
Woo Jae Kim; Yoonki Cho; Junsik Jung; Sung-Eui Yoon
  Deep neural networks are susceptible to adversarial attacks due to the
accumulation of perturbations in the feature level, and numerous works have
boosted model robustness by deactivating the non-robust feature activations
that cause model mispredictions. However, we claim that these malicious
activations still contain discriminative cues and that with recalibration, they
can capture additional useful information for correct model predictions. To
this end, we propose a novel, easy-to-plugin approach named Feature Separation
and Recalibration (FSR) that recalibrates the malicious, non-robust activations
for more robust feature maps through Separation and Recalibration. The
Separation part disentangles the input feature map into the robust feature with
activations that help the model make correct predictions and the non-robust
feature with activations that are responsible for model mispredictions upon
adversarial attack. The Recalibration part then adjusts the non-robust
activations to restore the potentially useful cues for model predictions.
Extensive experiments verify the superiority of FSR compared to traditional
deactivation techniques and demonstrate that it improves the robustness of
existing adversarial training methods by up to 8.57% with small computational
overhead. Codes are available at https://github.com/wkim97/FSR.


http://arxiv.org/abs/2303.13868
Physically Adversarial Infrared Patches with Learnable Shapes and Locations. (97%)
Wei Xingxing; Yu Jie; Huang Yao
  Owing to the extensive application of infrared object detectors in the
safety-critical tasks, it is necessary to evaluate their robustness against
adversarial examples in the real world. However, current few physical infrared
attacks are complicated to implement in practical application because of their
complex transformation from digital world to physical world. To address this
issue, in this paper, we propose a physically feasible infrared attack method
called "adversarial infrared patches". Considering the imaging mechanism of
infrared cameras by capturing objects' thermal radiation, adversarial infrared
patches conduct attacks by attaching a patch of thermal insulation materials on
the target object to manipulate its thermal distribution. To enhance
adversarial attacks, we present a novel aggregation regularization to guide the
simultaneous learning for the patch' shape and location on the target object.
Thus, a simple gradient-based optimization can be adapted to solve for them. We
verify adversarial infrared patches in different object detection tasks with
various object detectors. Experimental results show that our method achieves
more than 90\% Attack Success Rate (ASR) versus the pedestrian detector and
vehicle detector in the physical environment, where the objects are captured in
different angles, distances, postures, and scenes. More importantly,
adversarial infrared patch is easy to implement, and it only needs 0.5 hours to
be constructed in the physical world, which verifies its effectiveness and
efficiency.


http://arxiv.org/abs/2303.13813
Generalist: Decoupling Natural and Robust Generalization. (96%)
Hongjun Wang; Yisen Wang
  Deep neural networks obtained by standard training have been constantly
plagued by adversarial examples. Although adversarial training demonstrates its
capability to defend against adversarial examples, unfortunately, it leads to
an inevitable drop in the natural generalization. To address the issue, we
decouple the natural generalization and the robust generalization from joint
training and formulate different training strategies for each one.
Specifically, instead of minimizing a global loss on the expectation over these
two generalization errors, we propose a bi-expert framework called
\emph{Generalist} where we simultaneously train base learners with task-aware
strategies so that they can specialize in their own fields. The parameters of
base learners are collected and combined to form a global learner at intervals
during the training process. The global learner is then distributed to the base
learners as initialized parameters for continued training. Theoretically, we
prove that the risks of Generalist will get lower once the base learners are
well trained. Extensive experiments verify the applicability of Generalist to
achieve high accuracy on natural examples while maintaining considerable
robustness to adversarial ones. Code is available at
https://github.com/PKU-ML/Generalist.


http://arxiv.org/abs/2303.14304
Ensemble-based Blackbox Attacks on Dense Prediction. (86%)
Zikui Cai; Yaoteng Tan; M. Salman Asif
  We propose an approach for adversarial attacks on dense prediction models
(such as object detectors and segmentation). It is well known that the attacks
generated by a single surrogate model do not transfer to arbitrary (blackbox)
victim models. Furthermore, targeted attacks are often more challenging than
the untargeted attacks. In this paper, we show that a carefully designed
ensemble can create effective attacks for a number of victim models. In
particular, we show that normalization of the weights for individual models
plays a critical role in the success of the attacks. We then demonstrate that
by adjusting the weights of the ensemble according to the victim model can
further improve the performance of the attacks. We performed a number of
experiments for object detectors and segmentation to highlight the significance
of the our proposed methods. Our proposed ensemble-based method outperforms
existing blackbox attack methods for object detection and segmentation. Finally
we show that our proposed method can also generate a single perturbation that
can fool multiple blackbox detection and segmentation models simultaneously.
Code is available at https://github.com/CSIPlab/EBAD.


http://arxiv.org/abs/2303.14325
Backdoor Attacks with Input-unique Triggers in NLP. (54%)
Xukun Zhou; Jiwei Li; Tianwei Zhang; Lingjuan Lyu; Muqiao Yang; Jun He
  Backdoor attack aims at inducing neural models to make incorrect predictions
for poison data while keeping predictions on the clean dataset unchanged, which
creates a considerable threat to current natural language processing (NLP)
systems. Existing backdoor attacking systems face two severe issues:firstly,
most backdoor triggers follow a uniform and usually input-independent pattern,
e.g., insertion of specific trigger words, synonym replacement. This
significantly hinders the stealthiness of the attacking model, leading the
trained backdoor model being easily identified as malicious by model probes.
Secondly, trigger-inserted poisoned sentences are usually disfluent,
ungrammatical, or even change the semantic meaning from the original sentence,
making them being easily filtered in the pre-processing stage. To resolve these
two issues, in this paper, we propose an input-unique backdoor attack(NURA),
where we generate backdoor triggers unique to inputs. IDBA generates
context-related triggers by continuing writing the input with a language model
like GPT2. The generated sentence is used as the backdoor trigger. This
strategy not only creates input-unique backdoor triggers, but also preserves
the semantics of the original input, simultaneously resolving the two issues
above. Experimental results show that the IDBA attack is effective for attack
and difficult to defend: it achieves high attack success rate across all the
widely applied benchmarks, while is immune to existing defending methods. In
addition, it is able to generate fluent, grammatical, and diverse backdoor
inputs, which can hardly be recognized through human inspection.


http://arxiv.org/abs/2303.14009
PoisonedGNN: Backdoor Attack on Graph Neural Networks-based Hardware Security Systems. (22%)
Lilas Alrahis; Satwik Patnaik; Muhammad Abdullah Hanif; Muhammad Shafique; Ozgur Sinanoglu
  Graph neural networks (GNNs) have shown great success in detecting
intellectual property (IP) piracy and hardware Trojans (HTs). However, the
machine learning community has demonstrated that GNNs are susceptible to data
poisoning attacks, which result in GNNs performing abnormally on graphs with
pre-defined backdoor triggers (realized using crafted subgraphs). Thus, it is
imperative to ensure that the adoption of GNNs should not introduce security
vulnerabilities in critical security frameworks.
  Existing backdoor attacks on GNNs generate random subgraphs with specific
sizes/densities to act as backdoor triggers. However, for Boolean circuits,
backdoor triggers cannot be randomized since the added structures should not
affect the functionality of a design.
  We explore this threat and develop PoisonedGNN as the first backdoor attack
on GNNs in the context of hardware design. We design and inject backdoor
triggers into the register-transfer- or the gate-level representation of a
given design without affecting the functionality to evade some GNN-based
detection procedures. To demonstrate the effectiveness of PoisonedGNN, we
consider two case studies: (i) Hiding HTs and (ii) IP piracy. Our experiments
on TrustHub datasets demonstrate that PoisonedGNN can hide HTs and IP piracy
from advanced GNN-based detection platforms with an attack success rate of up
to 100%.


http://arxiv.org/abs/2303.14096
Enhancing Multiple Reliability Measures via Nuisance-extended Information Bottleneck. (5%)
Jongheon Jeong; Sihyun Yu; Hankook Lee; Jinwoo Shin
  In practical scenarios where training data is limited, many predictive
signals in the data can be rather from some biases in data acquisition (i.e.,
less generalizable), so that one cannot prevent a model from co-adapting on
such (so-called) "shortcut" signals: this makes the model fragile in various
distribution shifts. To bypass such failure modes, we consider an adversarial
threat model under a mutual information constraint to cover a wider class of
perturbations in training. This motivates us to extend the standard information
bottleneck to additionally model the nuisance information. We propose an
autoencoder-based training to implement the objective, as well as practical
encoder designs to facilitate the proposed hybrid discriminative-generative
training concerning both convolutional- and Transformer-based architectures.
Our experimental results show that the proposed scheme improves robustness of
learned representations (remarkably without using any domain-specific
knowledge), with respect to multiple challenging reliability measures. For
example, our model could advance the state-of-the-art on a recent challenging
OBJECTS benchmark in novelty detection by $78.4\% \rightarrow 87.2\%$ in AUROC,
while simultaneously enjoying improved corruption, background and (certified)
adversarial robustness. Code is available at
https://github.com/jh-jeong/nuisance_ib.


http://arxiv.org/abs/2303.14197
Optimal Smoothing Distribution Exploration for Backdoor Neutralization in Deep Learning-based Traffic Systems. (2%)
Yue Wang; Wending Li; Michail Maniatakos; Saif Eddin Jabari
  Deep Reinforcement Learning (DRL) enhances the efficiency of Autonomous
Vehicles (AV), but also makes them susceptible to backdoor attacks that can
result in traffic congestion or collisions. Backdoor functionality is typically
incorporated by contaminating training datasets with covert malicious data to
maintain high precision on genuine inputs while inducing the desired
(malicious) outputs for specific inputs chosen by adversaries. Current defenses
against backdoors mainly focus on image classification using image-based
features, which cannot be readily transferred to the regression task of
DRL-based AV controllers since the inputs are continuous sensor data, i.e., the
combinations of velocity and distance of AV and its surrounding vehicles. Our
proposed method adds well-designed noise to the input to neutralize backdoors.
The approach involves learning an optimal smoothing (noise) distribution to
preserve the normal functionality of genuine inputs while neutralizing
backdoors. By doing so, the resulting model is expected to be more resilient
against backdoor attacks while maintaining high accuracy on genuine inputs. The
effectiveness of the proposed method is verified on a simulated traffic system
based on a microscopic traffic simulator, where experimental results showcase
that the smoothed traffic controller can neutralize all trigger samples and
maintain the performance of relieving traffic congestion


http://arxiv.org/abs/2303.14186
TRAK: Attributing Model Behavior at Scale. (1%)
Sung Min Park; Kristian Georgiev; Andrew Ilyas; Guillaume Leclerc; Aleksander Madry
  The goal of data attribution is to trace model predictions back to training
data. Despite a long line of work towards this goal, existing approaches to
data attribution tend to force users to choose between computational
tractability and efficacy. That is, computationally tractable methods can
struggle with accurately attributing model predictions in non-convex settings
(e.g., in the context of deep neural networks), while methods that are
effective in such regimes require training thousands of models, which makes
them impractical for large models or datasets.
  In this work, we introduce TRAK (Tracing with the Randomly-projected After
Kernel), a data attribution method that is both effective and computationally
tractable for large-scale, differentiable models. In particular, by leveraging
only a handful of trained models, TRAK can match the performance of attribution
methods that require training thousands of models. We demonstrate the utility
of TRAK across various modalities and scales: image classifiers trained on
ImageNet, vision-language models (CLIP), and language models (BERT and mT5). We
provide code for using TRAK (and reproducing our work) at
https://github.com/MadryLab/trak .


http://arxiv.org/abs/2303.13131
Watch Out for the Confusing Faces: Detecting Face Swapping with the Probability Distribution of Face Identification Models. (68%)
Yuxuan Duan; Xuhong Zhang; Chuer Yu; Zonghui Wang; Shouling Ji; Wenzhi Chen
  Recently, face swapping has been developing rapidly and achieved a surprising
reality, raising concerns about fake content. As a countermeasure, various
detection approaches have been proposed and achieved promising performance.
However, most existing detectors struggle to maintain performance on unseen
face swapping methods and low-quality images. Apart from the generalization
problem, current detection approaches have been shown vulnerable to evasion
attacks crafted by detection-aware manipulators. Lack of robustness under
adversary scenarios leaves threats for applying face swapping detection in real
world. In this paper, we propose a novel face swapping detection approach based
on face identification probability distributions, coined as IdP_FSD, to improve
the generalization and robustness. IdP_FSD is specially designed for detecting
swapped faces whose identities belong to a finite set, which is meaningful in
real-world applications. Compared with previous general detection methods, we
make use of the available real faces with concerned identities and require no
fake samples for training. IdP_FSD exploits face swapping's common nature that
the identity of swapped face combines that of two faces involved in swapping.
We reflect this nature with the confusion of a face identification model and
measure the confusion with the maximum value of the output probability
distribution. What's more, to defend our detector under adversary scenarios, an
attention-based finetuning scheme is proposed for the face identification
models used in IdP_FSD. Extensive experiments show that the proposed IdP_FSD
not only achieves high detection performance on different benchmark datasets
and image qualities but also raises the bar for manipulators to evade the
detection.


http://arxiv.org/abs/2303.14193
Quadratic Graph Attention Network (Q-GAT) for Robust Construction of Gene Regulatory Networks. (50%)
Hui Zhang; Xuexin An; Qiang He; Yudong Yao; Feng-Lei Fan; Yueyang Teng
  Gene regulatory relationships can be abstracted as a gene regulatory network
(GRN), which plays a key role in characterizing complex cellular processes and
pathways. Recently, graph neural networks (GNNs), as a class of deep learning
models, have emerged as a useful tool to infer gene regulatory relationships
from gene expression data. However, deep learning models have been found to be
vulnerable to noise, which greatly hinders the adoption of deep learning in
constructing GRNs, because high noise is often unavoidable in the process of
gene expression measurement. Can we preferably prototype a robust GNN for
constructing GRNs? In this paper, we give a positive answer by proposing a
Quadratic Graph Attention Network (Q-GAT) with a dual attention mechanism. We
study the changes in the predictive accuracy of Q-GAT and 9 state-of-the-art
baselines by introducing different levels of adversarial perturbations.
Experiments in the E. coli and S. cerevisiae datasets suggest that Q-GAT
outperforms the state-of-the-art models in robustness. Lastly, we dissect why
Q-GAT is robust through the signal-to-noise ratio (SNR) and interpretability
analyses. The former informs that nonlinear aggregation of quadratic neurons
can amplify useful signals and suppress unwanted noise, thereby facilitating
robustness, while the latter reveals that Q-GAT can leverage more features in
prediction thanks to the dual attention mechanism, which endows Q-GAT with the
ability to confront adversarial perturbation. We have shared our code in
https://github.com/Minorway/Q-GAT_for_Robust_Construction_of_GRN for readers'
evaluation.


http://arxiv.org/abs/2303.13401
Optimization and Optimizers for Adversarial Robustness. (41%)
Hengyue Liang; Buyun Liang; Le Peng; Ying Cui; Tim Mitchell; Ju Sun
  Empirical robustness evaluation (RE) of deep learning models against
adversarial perturbations entails solving nontrivial constrained optimization
problems. Existing numerical algorithms that are commonly used to solve them in
practice predominantly rely on projected gradient, and mostly handle
perturbations modeled by the $\ell_1$, $\ell_2$ and $\ell_\infty$ distances. In
this paper, we introduce a novel algorithmic framework that blends a
general-purpose constrained-optimization solver PyGRANSO with Constraint
Folding (PWCF), which can add more reliability and generality to the
state-of-the-art RE packages, e.g., AutoAttack. Regarding reliability, PWCF
provides solutions with stationarity measures and feasibility tests to assess
the solution quality. For generality, PWCF can handle perturbation models that
are typically inaccessible to the existing projected gradient methods; the main
requirement is the distance metric to be almost everywhere differentiable.
Taking advantage of PWCF and other existing numerical algorithms, we further
explore the distinct patterns in the solutions found for solving these
optimization problems using various combinations of losses, perturbation
models, and optimization algorithms. We then discuss the implications of these
patterns on the current robustness evaluation and adversarial training.


http://arxiv.org/abs/2303.13649
Adversarial Robustness and Feature Impact Analysis for Driver Drowsiness Detection. (41%)
João Vitorino; Lourenço Rodrigues; Eva Maia; Isabel Praça; André Lourenço
  Drowsy driving is a major cause of road accidents, but drivers are dismissive
of the impact that fatigue can have on their reaction times. To detect
drowsiness before any impairment occurs, a promising strategy is using Machine
Learning (ML) to monitor Heart Rate Variability (HRV) signals. This work
presents multiple experiments with different HRV time windows and ML models, a
feature impact analysis using Shapley Additive Explanations (SHAP), and an
adversarial robustness analysis to assess their reliability when processing
faulty input data and perturbed HRV signals. The most reliable model was
Extreme Gradient Boosting (XGB) and the optimal time window had between 120 and
150 seconds. Furthermore, SHAP enabled the selection of the 18 most impactful
features and the training of new smaller models that achieved a performance as
good as the initial ones. Despite the susceptibility of all models to
adversarial attacks, adversarial training enabled them to preserve
significantly higher results, especially XGB. Therefore, ML models can
significantly benefit from realistic adversarial training to provide a more
robust driver drowsiness detection.


http://arxiv.org/abs/2303.13326
Decentralized Adversarial Training over Graphs. (15%)
Ying Cao; Elsa Rizk; Stefan Vlaski; Ali H. Sayed
  The vulnerability of machine learning models to adversarial attacks has been
attracting considerable attention in recent years. Most existing studies focus
on the behavior of stand-alone single-agent learners. In comparison, this work
studies adversarial training over graphs, where individual agents are subjected
to perturbations of varied strength levels across space. It is expected that
interactions by linked agents, and the heterogeneity of the attack models that
are possible over the graph, can help enhance robustness in view of the
coordination power of the group. Using a min-max formulation of distributed
learning, we develop a decentralized adversarial training framework for
multi-agent systems. Specifically, we devise two decentralized adversarial
training algorithms by relying on two popular decentralized learning
strategies--diffusion and consensus. We analyze the convergence properties of
the proposed framework for strongly-convex, convex, and non-convex
environments, and illustrate the enhanced robustness to adversarial attacks.


http://arxiv.org/abs/2303.13408
Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. (15%)
Kalpesh Krishna; Yixiao Song; Marzena Karpinska; John Wieting; Mohit Iyyer
  The rise in malicious usage of large language models, such as fake content
creation and academic plagiarism, has motivated the development of approaches
that identify AI-generated text, including those based on watermarking or
outlier detection. However, the robustness of these detection algorithms to
paraphrases of AI-generated text remains unclear. To stress test these
detectors, we build a 11B parameter paraphrase generation model (DIPPER) that
can paraphrase paragraphs, condition on surrounding context, and control
lexical diversity and content reordering. Using DIPPER to paraphrase text
generated by three large language models (including GPT3.5-davinci-003)
successfully evades several detectors, including watermarking, GPTZero,
DetectGPT, and OpenAI's text classifier. For example, DIPPER drops detection
accuracy of DetectGPT from 70.3% to 4.6% (at a constant false positive rate of
1%), without appreciably modifying the input semantics.
  To increase the robustness of AI-generated text detection to paraphrase
attacks, we introduce a simple defense that relies on retrieving
semantically-similar generations and must be maintained by a language model API
provider. Given a candidate text, our algorithm searches a database of
sequences previously generated by the API, looking for sequences that match the
candidate text within a certain threshold. We empirically verify our defense
using a database of 15M generations from a fine-tuned T5-XXL model and find
that it can detect 80% to 97% of paraphrased generations across different
settings while only classifying 1% of human-written sequences as AI-generated.
We open-source our models, code and data.


http://arxiv.org/abs/2303.13211
Don't FREAK Out: A Frequency-Inspired Approach to Detecting Backdoor Poisoned Samples in DNNs. (8%)
Hasan Abed Al Kader Hammoud; Adel Bibi; Philip H. S. Torr; Bernard Ghanem
  In this paper we investigate the frequency sensitivity of Deep Neural
Networks (DNNs) when presented with clean samples versus poisoned samples. Our
analysis shows significant disparities in frequency sensitivity between these
two types of samples. Building on these findings, we propose FREAK, a
frequency-based poisoned sample detection algorithm that is simple yet
effective. Our experimental results demonstrate the efficacy of FREAK not only
against frequency backdoor attacks but also against some spatial attacks. Our
work is just the first step in leveraging these insights. We believe that our
analysis and proposed defense mechanism will provide a foundation for future
research and development of backdoor defenses.


http://arxiv.org/abs/2303.13713
Low-frequency Image Deep Steganography: Manipulate the Frequency Distribution to Hide Secrets with Tenacious Robustness. (1%)
Huajie Chen; Tianqing Zhu; Yuan Zhao; Bo Liu; Xin Yu; Wanlei Zhou
  Image deep steganography (IDS) is a technique that utilizes deep learning to
embed a secret image invisibly into a cover image to generate a container
image. However, the container images generated by convolutional neural networks
(CNNs) are vulnerable to attacks that distort their high-frequency components.
To address this problem, we propose a novel method called Low-frequency Image
Deep Steganography (LIDS) that allows frequency distribution manipulation in
the embedding process. LIDS extracts a feature map from the secret image and
adds it to the cover image to yield the container image. The container image is
not directly output by the CNNs, and thus, it does not contain high-frequency
artifacts. The extracted feature map is regulated by a frequency loss to ensure
that its frequency distribution mainly concentrates on the low-frequency
domain. To further enhance robustness, an attack layer is inserted to damage
the container image. The retrieval network then retrieves a recovered secret
image from a damaged container image. Our experiments demonstrate that LIDS
outperforms state-of-the-art methods in terms of robustness, while maintaining
high fidelity and specificity. By avoiding high-frequency artifacts and
manipulating the frequency distribution of the embedded feature map, LIDS
achieves improved robustness against attacks that distort the high-frequency
components of container images.


http://arxiv.org/abs/2303.13588
Efficient Symbolic Reasoning for Neural-Network Verification. (1%)
Zi Dj Wang; Somesh Dj Jha; Dj Krishnamurthy; Dvijotham
  The neural network has become an integral part of modern software systems.
However, they still suffer from various problems, in particular, vulnerability
to adversarial attacks. In this work, we present a novel program reasoning
framework for neural-network verification, which we refer to as symbolic
reasoning. The key components of our framework are the use of the symbolic
domain and the quadratic relation. The symbolic domain has very flexible
semantics, and the quadratic relation is quite expressive. They allow us to
encode many verification problems for neural networks as quadratic programs.
Our scheme then relaxes the quadratic programs to semidefinite programs, which
can be efficiently solved. This framework allows us to verify various
neural-network properties under different scenarios, especially those that
appear challenging for non-symbolic domains. Moreover, it introduces new
representations and perspectives for the verification tasks. We believe that
our framework can bring new theoretical insights and practical tools to
verification problems for neural networks.


http://arxiv.org/abs/2303.12658
Reliable and Efficient Evaluation of Adversarial Robustness for Deep Hashing-Based Retrieval. (99%)
Xunguang Wang; Jiawang Bai; Xinyue Xu; Xiaomeng Li
  Deep hashing has been extensively applied to massive image retrieval due to
its efficiency and effectiveness. Recently, several adversarial attacks have
been presented to reveal the vulnerability of deep hashing models against
adversarial examples. However, existing attack methods suffer from degraded
performance or inefficiency because they underutilize the semantic relations
between original samples or spend a lot of time learning these relations with a
deep neural network. In this paper, we propose a novel Pharos-guided Attack,
dubbed PgA, to evaluate the adversarial robustness of deep hashing networks
reliably and efficiently. Specifically, we design pharos code to represent the
semantics of the benign image, which preserves the similarity to semantically
relevant samples and dissimilarity to irrelevant ones. It is proven that we can
quickly calculate the pharos code via a simple math formula. Accordingly, PgA
can directly conduct a reliable and efficient attack on deep hashing-based
retrieval by maximizing the similarity between the hash code of the adversarial
example and the pharos code. Extensive experiments on the benchmark datasets
verify that the proposed algorithm outperforms the prior state-of-the-arts in
both attack strength and speed.


http://arxiv.org/abs/2303.13010
Semantic Image Attack for Visual Model Diagnosis. (99%)
Jinqi Luo; Zhaoning Wang; Chen Henry Wu; Dong Huang; la Torre Fernando De
  In practice, metric analysis on a specific train and test dataset does not
guarantee reliable or fair ML models. This is partially due to the fact that
obtaining a balanced, diverse, and perfectly labeled dataset is typically
expensive, time-consuming, and error-prone. Rather than relying on a carefully
designed test set to assess ML models' failures, fairness, or robustness, this
paper proposes Semantic Image Attack (SIA), a method based on the adversarial
attack that provides semantic adversarial images to allow model diagnosis,
interpretability, and robustness. Traditional adversarial training is a popular
methodology for robustifying ML models against attacks. However, existing
adversarial methods do not combine the two aspects that enable the
interpretation and analysis of the model's flaws: semantic traceability and
perceptual quality. SIA combines the two features via iterative gradient ascent
on a predefined semantic attribute space and the image space. We illustrate the
validity of our approach in three scenarios for keypoint detection and
classification. (1) Model diagnosis: SIA generates a histogram of attributes
that highlights the semantic vulnerability of the ML model (i.e., attributes
that make the model fail). (2) Stronger attacks: SIA generates adversarial
examples with visually interpretable attributes that lead to higher attack
success rates than baseline methods. The adversarial training on SIA improves
the transferable robustness across different gradient-based attacks. (3)
Robustness to imbalanced datasets: we use SIA to augment the underrepresented
classes, which outperforms strong augmentation and re-balancing baselines.


http://arxiv.org/abs/2303.12481
Revisiting DeepFool: generalization and improvement. (99%)
Alireza Abdollahpourrostam; Mahed Abroshan; Seyed-Mohsen Moosavi-Dezfooli
  Deep neural networks have been known to be vulnerable to adversarial
examples, which are inputs that are modified slightly to fool the network into
making incorrect predictions. This has led to a significant amount of research
on evaluating the robustness of these networks against such perturbations. One
particularly important robustness metric is the robustness to minimal l2
adversarial perturbations. However, existing methods for evaluating this
robustness metric are either computationally expensive or not very accurate. In
this paper, we introduce a new family of adversarial attacks that strike a
balance between effectiveness and computational efficiency. Our proposed
attacks are generalizations of the well-known DeepFool (DF) attack, while they
remain simple to understand and implement. We demonstrate that our attacks
outperform existing methods in terms of both effectiveness and computational
efficiency. Our proposed attacks are also suitable for evaluating the
robustness of large models and can be used to perform adversarial training (AT)
to achieve state-of-the-art robustness to minimal l2 adversarial perturbations.


http://arxiv.org/abs/2303.12357
Wasserstein Adversarial Examples on Univariant Time Series Data. (99%)
Wenjie Wang; Li Xiong; Jian Lou
  Adversarial examples are crafted by adding indistinguishable perturbations to
normal examples in order to fool a well-trained deep learning model to
misclassify. In the context of computer vision, this notion of
indistinguishability is typically bounded by $L_{\infty}$ or other norms.
However, these norms are not appropriate for measuring indistinguishiability
for time series data. In this work, we propose adversarial examples in the
Wasserstein space for time series data for the first time and utilize
Wasserstein distance to bound the perturbation between normal examples and
adversarial examples. We introduce Wasserstein projected gradient descent
(WPGD), an adversarial attack method for perturbing univariant time series
data. We leverage the closed-form solution of Wasserstein distance in the 1D
space to calculate the projection step of WPGD efficiently with the gradient
descent method. We further propose a two-step projection so that the search of
adversarial examples in the Wasserstein space is guided and constrained by
Euclidean norms to yield more effective and imperceptible perturbations. We
empirically evaluate the proposed attack on several time series datasets in the
healthcare domain. Extensive results demonstrate that the Wasserstein attack is
powerful and can successfully attack most of the target classifiers with a high
attack success rate. To better study the nature of Wasserstein adversarial
example, we evaluate a strong defense mechanism named Wasserstein smoothing for
potential certified robustness defense. Although the defense can achieve some
accuracy gain, it still has limitations in many cases and leaves space for
developing a stronger certified robustness method to Wasserstein adversarial
examples on univariant time series data.


http://arxiv.org/abs/2303.12848
Test-time Defense against Adversarial Attacks: Detection and Reconstruction of Adversarial Examples via Masked Autoencoder. (99%)
Yun-Yun Tsai; Ju-Chin Chao; Albert Wen; Zhaoyuan Yang; Chengzhi Mao; Tapan Shah; Junfeng Yang
  Existing defense methods against adversarial attacks can be categorized into
training time and test time defenses. Training time defense, i.e., adversarial
training, requires a significant amount of extra time for training and is often
not able to be generalized to unseen attacks. On the other hand, test time
defense by test time weight adaptation requires access to perform gradient
descent on (part of) the model weights, which could be infeasible for models
with frozen weights. To address these challenges, we propose DRAM, a novel
defense method to Detect and Reconstruct multiple types of Adversarial attacks
via Masked autoencoder (MAE). We demonstrate how to use MAE losses to build a
KS-test to detect adversarial attacks. Moreover, the MAE losses can be used to
repair adversarial samples from unseen attack types. In this sense, DRAM
neither requires model weight updates in test time nor augments the training
set with more adversarial samples. Evaluating DRAM on the large-scale ImageNet
data, we achieve the best detection rate of 82% on average on eight types of
adversarial attacks compared with other detection baselines. For
reconstruction, DRAM improves the robust accuracy by 6% ~ 41% for Standard
ResNet50 and 3% ~ 8% for Robust ResNet50 compared with other self-supervision
tasks, such as rotation prediction and contrastive learning.


http://arxiv.org/abs/2303.12512
Sibling-Attack: Rethinking Transferable Adversarial Attacks against Face Recognition. (78%)
Zexin Li; Bangjie Yin; Taiping Yao; Juefeng Guo; Shouhong Ding; Simin Chen; Cong Liu
  A hard challenge in developing practical face recognition (FR) attacks is due
to the black-box nature of the target FR model, i.e., inaccessible gradient and
parameter information to attackers. While recent research took an important
step towards attacking black-box FR models through leveraging transferability,
their performance is still limited, especially against online commercial FR
systems that can be pessimistic (e.g., a less than 50% ASR--attack success rate
on average). Motivated by this, we present Sibling-Attack, a new FR attack
technique for the first time explores a novel multi-task perspective (i.e.,
leveraging extra information from multi-correlated tasks to boost attacking
transferability). Intuitively, Sibling-Attack selects a set of tasks correlated
with FR and picks the Attribute Recognition (AR) task as the task used in
Sibling-Attack based on theoretical and quantitative analysis. Sibling-Attack
then develops an optimization framework that fuses adversarial gradient
information through (1) constraining the cross-task features to be under the
same space, (2) a joint-task meta optimization framework that enhances the
gradient compatibility among tasks, and (3) a cross-task gradient stabilization
method which mitigates the oscillation effect during attacking. Extensive
experiments demonstrate that Sibling-Attack outperforms state-of-the-art FR
attack techniques by a non-trivial margin, boosting ASR by 12.61% and 55.77% on
average on state-of-the-art pre-trained FR models and two well-known, widely
used commercial FR systems.


http://arxiv.org/abs/2303.12669
An Extended Study of Human-like Behavior under Adversarial Training. (76%)
Paul Gavrikov; Janis Keuper; Margret Keuper
  Neural networks have a number of shortcomings. Amongst the severest ones is
the sensitivity to distribution shifts which allows models to be easily fooled
into wrong predictions by small perturbations to inputs that are often
imperceivable to humans and do not have to carry semantic meaning. Adversarial
training poses a partial solution to address this issue by training models on
worst-case perturbations. Yet, recent work has also pointed out that the
reasoning in neural networks is different from humans. Humans identify objects
by shape, while neural nets mainly employ texture cues. Exemplarily, a model
trained on photographs will likely fail to generalize to datasets containing
sketches. Interestingly, it was also shown that adversarial training seems to
favorably increase the shift toward shape bias. In this work, we revisit this
observation and provide an extensive analysis of this effect on various
architectures, the common $\ell_2$- and $\ell_\infty$-training, and
Transformer-based models. Further, we provide a possible explanation for this
phenomenon from a frequency perspective.


http://arxiv.org/abs/2303.12363
Distribution-restrained Softmax Loss for the Model Robustness. (38%)
Hao Wang; Chen Li; Jinzhe Jiang; Xin Zhang; Yaqian Zhao; Weifeng Gong
  Recently, the robustness of deep learning models has received widespread
attention, and various methods for improving model robustness have been
proposed, including adversarial training, model architecture modification,
design of loss functions, certified defenses, and so on. However, the principle
of the robustness to attacks is still not fully understood, also the related
research is still not sufficient. Here, we have identified a significant factor
that affects the robustness of models: the distribution characteristics of
softmax values for non-real label samples. We found that the results after an
attack are highly correlated with the distribution characteristics, and thus we
proposed a loss function to suppress the distribution diversity of softmax. A
large number of experiments have shown that our method can improve robustness
without significant time consumption.


http://arxiv.org/abs/2303.12993
Backdoor Defense via Adaptively Splitting Poisoned Dataset. (16%)
Kuofeng Gao; Yang Bai; Jindong Gu; Yong Yang; Shu-Tao Xia
  Backdoor defenses have been studied to alleviate the threat of deep neural
networks (DNNs) being backdoor attacked and thus maliciously altered. Since
DNNs usually adopt some external training data from an untrusted third party, a
robust backdoor defense strategy during the training stage is of importance. We
argue that the core of training-time defense is to select poisoned samples and
to handle them properly. In this work, we summarize the training-time defenses
from a unified framework as splitting the poisoned dataset into two data pools.
Under our framework, we propose an adaptively splitting dataset-based defense
(ASD). Concretely, we apply loss-guided split and meta-learning-inspired split
to dynamically update two data pools. With the split clean data pool and
polluted data pool, ASD successfully defends against backdoor attacks during
training. Extensive experiments on multiple benchmark datasets and DNN models
against six state-of-the-art backdoor attacks demonstrate the superiority of
our ASD. Our code is available at https://github.com/KuofengGao/ASD.


http://arxiv.org/abs/2303.12397
Edge Deep Learning Model Protection via Neuron Authorization. (11%)
Jinyin Chen; Haibin Zheng; Tao Liu; Rongchang Li; Yao Cheng; Xuhong Zhang; Shouling Ji
  With the development of deep learning processors and accelerators, deep
learning models have been widely deployed on edge devices as part of the
Internet of Things. Edge device models are generally considered as valuable
intellectual properties that are worth for careful protection. Unfortunately,
these models have a great risk of being stolen or illegally copied. The
existing model protections using encryption algorithms are suffered from high
computation overhead which is not practical due to the limited computing
capacity on edge devices. In this work, we propose a light-weight, practical,
and general Edge device model Pro tection method at neuron level, denoted as
EdgePro. Specifically, we select several neurons as authorization neurons and
set their activation values to locking values and scale the neuron outputs as
the "asswords" during training. EdgePro protects the model by ensuring it can
only work correctly when the "passwords" are met, at the cost of encrypting and
storing the information of the "passwords" instead of the whole model.
Extensive experimental results indicate that EdgePro can work well on the task
of protecting on datasets with different modes. The inference time increase of
EdgePro is only 60% of state-of-the-art methods, and the accuracy loss is less
than 1%. Additionally, EdgePro is robust against adaptive attacks including
fine-tuning and pruning, which makes it more practical in real-world
applications. EdgePro is also open sourced to facilitate future research:
https://github.com/Leon022/Edg


http://arxiv.org/abs/2303.11625
Information-containing Adversarial Perturbation for Combating Facial Manipulation Systems. (99%)
Yao Zhu; Yuefeng Chen; Xiaodan Li; Rong Zhang; Xiang Tian; Bolun Zheng; Yaowu Chen
  With the development of deep learning technology, the facial manipulation
system has become powerful and easy to use. Such systems can modify the
attributes of the given facial images, such as hair color, gender, and age.
Malicious applications of such systems pose a serious threat to individuals'
privacy and reputation. Existing studies have proposed various approaches to
protect images against facial manipulations. Passive defense methods aim to
detect whether the face is real or fake, which works for posterior forensics
but can not prevent malicious manipulation. Initiative defense methods protect
images upfront by injecting adversarial perturbations into images to disrupt
facial manipulation systems but can not identify whether the image is fake. To
address the limitation of existing methods, we propose a novel two-tier
protection method named Information-containing Adversarial Perturbation (IAP),
which provides more comprehensive protection for {facial images}. We use an
encoder to map a facial image and its identity message to a cross-model
adversarial example which can disrupt multiple facial manipulation systems to
achieve initiative protection. Recovering the message in adversarial examples
with a decoder serves passive protection, contributing to provenance tracking
and fake image detection. We introduce a feature-level correlation measurement
that is more suitable to measure the difference between the facial images than
the commonly used mean squared error. Moreover, we propose a spectral diffusion
method to spread messages to different frequency channels, thereby improving
the robustness of the message against facial manipulation. Extensive
experimental results demonstrate that our proposed IAP can recover the messages
from the adversarial examples with high average accuracy and effectively
disrupt the facial manipulation systems.


http://arxiv.org/abs/2303.12249
State-of-the-art optical-based physical adversarial attacks for deep learning computer vision systems. (99%)
Junbin Fang; You Jiang; Canjian Jiang; Zoe L. Jiang; Siu-Ming Yiu; Chuanyi Liu
  Adversarial attacks can mislead deep learning models to make false
predictions by implanting small perturbations to the original input that are
imperceptible to the human eye, which poses a huge security threat to the
computer vision systems based on deep learning. Physical adversarial attacks,
which is more realistic, as the perturbation is introduced to the input before
it is being captured and converted to a binary image inside the vision system,
when compared to digital adversarial attacks. In this paper, we focus on
physical adversarial attacks and further classify them into invasive and
non-invasive. Optical-based physical adversarial attack techniques (e.g. using
light irradiation) belong to the non-invasive category. As the perturbations
can be easily ignored by humans as the perturbations are very similar to the
effects generated by a natural environment in the real world. They are highly
invisibility and executable and can pose a significant or even lethal threats
to real systems. This paper focuses on optical-based physical adversarial
attack techniques for computer vision systems, with emphasis on the
introduction and discussion of optical-based physical adversarial attack
techniques.


http://arxiv.org/abs/2303.11793
Bridging Optimal Transport and Jacobian Regularization by Optimal Trajectory for Enhanced Adversarial Defense. (99%)
Binh M. Le; Shahroz Tariq; Simon S. Woo
  Deep neural networks, particularly in vision tasks, are notably susceptible
to adversarial perturbations. To overcome this challenge, developing a robust
classifier is crucial. In light of the recent advancements in the robustness of
classifiers, we delve deep into the intricacies of adversarial training and
Jacobian regularization, two pivotal defenses. Our work is the first carefully
analyzes and characterizes these two schools of approaches, both theoretically
and empirically, to demonstrate how each approach impacts the robust learning
of a classifier. Next, we propose our novel Optimal Transport with Jacobian
regularization method, dubbed OTJR, bridging the input Jacobian regularization
with the a output representation alignment by leveraging the optimal transport
theory. In particular, we employ the Sliced Wasserstein distance that can
efficiently push the adversarial samples' representations closer to those of
clean samples, regardless of the number of classes within the dataset. The SW
distance provides the adversarial samples' movement directions, which are much
more informative and powerful for the Jacobian regularization. Our empirical
evaluations set a new standard in the domain, with our method achieving
commendable accuracies of 52.57% on CIFAR-10 and 28.3% on CIFAR-100 datasets
under the AutoAttack. Further validating our model's practicality, we conducted
real-world tests by subjecting internet-sourced images to online adversarial
attacks. These demonstrations highlight our model's capability to counteract
sophisticated adversarial perturbations, affirming its significance and
applicability in real-world scenarios.


http://arxiv.org/abs/2303.11917
Efficient Decision-based Black-box Patch Attacks on Video Recognition. (98%)
Kaixun Jiang; Zhaoyu Chen; Hao Huang; Jiafeng Wang; Dingkang Yang; Bo Li; Yan Wang; Wenqiang Zhang
  Although Deep Neural Networks (DNNs) have demonstrated excellent performance,
they are vulnerable to adversarial patches that introduce perceptible and
localized perturbations to the input. Generating adversarial patches on images
has received much attention, while adversarial patches on videos have not been
well investigated. Further, decision-based attacks, where attackers only access
the predicted hard labels by querying threat models, have not been well
explored on video models either, even if they are practical in real-world video
recognition scenes. The absence of such studies leads to a huge gap in the
robustness assessment for video models. To bridge this gap, this work first
explores decision-based patch attacks on video models. We analyze that the huge
parameter space brought by videos and the minimal information returned by
decision-based models both greatly increase the attack difficulty and query
burden. To achieve a query-efficient attack, we propose a spatial-temporal
differential evolution (STDE) framework. First, STDE introduces target videos
as patch textures and only adds patches on keyframes that are adaptively
selected by temporal difference. Second, STDE takes minimizing the patch area
as the optimization objective and adopts spatialtemporal mutation and crossover
to search for the global optimum without falling into the local optimum.
Experiments show STDE has demonstrated state-of-the-art performance in terms of
threat, efficiency and imperceptibility. Hence, STDE has the potential to be a
powerful tool for evaluating the robustness of video recognition models.


http://arxiv.org/abs/2303.12175
Black-box Backdoor Defense via Zero-shot Image Purification. (86%)
Yucheng Shi; Mengnan Du; Xuansheng Wu; Zihan Guan; Jin Sun; Ninghao Liu
  Backdoor attacks inject poisoned samples into the training data, resulting in
the misclassification of the poisoned input during a model's deployment.
Defending against such attacks is challenging, especially for real-world
black-box models where only query access is permitted. In this paper, we
propose a novel defense framework against backdoor attacks through Zero-shot
Image Purification (ZIP). Our framework can be applied to poisoned models
without requiring internal information about the model or any prior knowledge
of the clean/poisoned samples. Our defense framework involves two steps. First,
we apply a linear transformation (e.g., blurring) on the poisoned image to
destroy the backdoor pattern. Then, we use a pre-trained diffusion model to
recover the missing semantic information removed by the transformation. In
particular, we design a new reverse process by using the transformed image to
guide the generation of high-fidelity purified images, which works in zero-shot
settings. We evaluate our ZIP framework on multiple datasets with different
types of attacks. Experimental results demonstrate the superiority of our ZIP
framework compared to state-of-the-art backdoor defense baselines. We believe
that our results will provide valuable insights for future defense methods for
black-box models. Our code is available at https://github.com/sycny/ZIP.


http://arxiv.org/abs/2303.11611
Out of Thin Air: Exploring Data-Free Adversarial Robustness Distillation. (10%)
Yuzheng Wang; Zhaoyu Chen; Dingkang Yang; Pinxue Guo; Kaixun Jiang; Wenqiang Zhang; Lizhe Qi
  Adversarial Robustness Distillation (ARD) is a promising task to solve the
issue of limited adversarial robustness of small capacity models while
optimizing the expensive computational costs of Adversarial Training (AT).
Despite the good robust performance, the existing ARD methods are still
impractical to deploy in natural high-security scenes due to these methods rely
entirely on original or publicly available data with a similar distribution. In
fact, these data are almost always private, specific, and distinctive for
scenes that require high robustness. To tackle these issues, we propose a
challenging but significant task called Data-Free Adversarial Robustness
Distillation (DFARD), which aims to train small, easily deployable, robust
models without relying on data. We demonstrate that the challenge lies in the
lower upper bound of knowledge transfer information, making it crucial to
mining and transferring knowledge more efficiently. Inspired by human
education, we design a plug-and-play Interactive Temperature Adjustment (ITA)
strategy to improve the efficiency of knowledge transfer and propose an
Adaptive Generator Balance (AGB) module to retain more data information. Our
method uses adaptive hyperparameters to avoid a large number of parameter
tuning, which significantly outperforms the combination of existing techniques.
Meanwhile, our method achieves stable and reliable performance on multiple
benchmarks.


http://arxiv.org/abs/2303.12054
Influencer Backdoor Attack on Semantic Segmentation. (10%)
Haoheng Lan; Jindong Gu; Philip Torr; Hengshuang Zhao
  When a small number of poisoned samples are injected into the training
dataset of a deep neural network, the network can be induced to exhibit
malicious behavior during inferences, which poses potential threats to
real-world applications. While they have been intensively studied in
classification, backdoor attacks on semantic segmentation have been largely
overlooked. Unlike classification, semantic segmentation aims to classify every
pixel within a given image. In this work, we explore backdoor attacks on
segmentation models to misclassify all pixels of a victim class by injecting a
specific trigger on non-victim pixels during inferences, which is dubbed
Influencer Backdoor Attack (IBA). IBA is expected to maintain the
classification accuracy of non-victim pixels and misleads classifications of
all victim pixels in every single inference. Specifically, we consider two
types of IBA scenarios, i.e., 1) Free-position IBA: the trigger can be
positioned freely except for pixels of the victim class, and 2) Long-distance
IBA: the trigger can only be positioned somewhere far from victim pixels, given
the possible practical constraint. Based on the context aggregation ability of
segmentation models, we propose techniques to improve IBA for the scenarios.
Concretely, for free-position IBA, we propose a simple, yet effective Nearest
Neighbor trigger injection strategy for poisoned sample creation. For
long-distance IBA, we propose a novel Pixel Random Labeling strategy. Our
extensive experiments reveal that current segmentation models do suffer from
backdoor attacks, and verify that our proposed techniques can further increase
attack performance.


http://arxiv.org/abs/2303.12233
LOKI: Large-scale Data Reconstruction Attack against Federated Learning through Model Manipulation. (9%)
Joshua C. Zhao; Atul Sharma; Ahmed Roushdy Elkordy; Yahya H. Ezzeldin; Salman Avestimehr; Saurabh Bagchi
  Federated learning was introduced to enable machine learning over large
decentralized datasets while promising privacy by eliminating the need for data
sharing. Despite this, prior work has shown that shared gradients often contain
private information and attackers can gain knowledge either through malicious
modification of the architecture and parameters or by using optimization to
approximate user data from the shared gradients. However, prior data
reconstruction attacks have been limited in setting and scale, as most works
target FedSGD and limit the attack to single-client gradients. Many of these
attacks fail in the more practical setting of FedAVG or if updates are
aggregated together using secure aggregation. Data reconstruction becomes
significantly more difficult, resulting in limited attack scale and/or
decreased reconstruction quality. When both FedAVG and secure aggregation are
used, there is no current method that is able to attack multiple clients
concurrently in a federated learning setting. In this work we introduce LOKI,
an attack that overcomes previous limitations and also breaks the anonymity of
aggregation as the leaked data is identifiable and directly tied back to the
clients they come from. Our design sends clients customized convolutional
parameters, and the weight gradients of data points between clients remain
separate even through aggregation. With FedAVG and aggregation across 100
clients, prior work can leak less than 1% of images on MNIST, CIFAR-100, and
Tiny ImageNet. Using only a single training round, LOKI is able to leak 76-86%
of all data samples.


http://arxiv.org/abs/2303.11745
Poisoning Attacks in Federated Edge Learning for Digital Twin 6G-enabled IoTs: An Anticipatory Study. (1%)
Mohamed Amine Ferrag; Burak Kantarci; Lucas C. Cordeiro; Merouane Debbah; Kim-Kwang Raymond Choo
  Federated edge learning can be essential in supporting privacy-preserving,
artificial intelligence (AI)-enabled activities in digital twin 6G-enabled
Internet of Things (IoT) environments. However, we need to also consider the
potential of attacks targeting the underlying AI systems (e.g., adversaries
seek to corrupt data on the IoT devices during local updates or corrupt the
model updates); hence, in this article, we propose an anticipatory study for
poisoning attacks in federated edge learning for digital twin 6G-enabled IoT
environments. Specifically, we study the influence of adversaries on the
training and development of federated learning models in digital twin
6G-enabled IoT environments. We demonstrate that attackers can carry out
poisoning attacks in two different learning settings, namely: centralized
learning and federated learning, and successful attacks can severely reduce the
model's accuracy. We comprehensively evaluate the attacks on a new cyber
security dataset designed for IoT applications with three deep neural networks
under the non-independent and identically distributed (Non-IID) data and the
independent and identically distributed (IID) data. The poisoning attacks, on
an attack classification problem, can lead to a decrease in accuracy from
94.93% to 85.98% with IID data and from 94.18% to 30.04% with Non-IID.


http://arxiv.org/abs/2303.11135
TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization. (99%)
Ziquan Liu; Yi Xu; Xiangyang Ji; Antoni B. Chan
  Recent years have seen the ever-increasing importance of pre-trained models
and their downstream training in deep learning research and applications. At
the same time, the defense for adversarial examples has been mainly
investigated in the context of training from random initialization on simple
classification tasks. To better exploit the potential of pre-trained models in
adversarial robustness, this paper focuses on the fine-tuning of an
adversarially pre-trained model in various classification tasks. Existing
research has shown that since the robust pre-trained model has already learned
a robust feature extractor, the crucial question is how to maintain the
robustness in the pre-trained model when learning the downstream task. We study
the model-based and data-based approaches for this goal and find that the two
common approaches cannot achieve the objective of improving both generalization
and adversarial robustness. Thus, we propose a novel statistics-based approach,
Two-WIng NormliSation (TWINS) fine-tuning framework, which consists of two
neural networks where one of them keeps the population means and variances of
pre-training data in the batch normalization layers. Besides the robust
information transfer, TWINS increases the effective learning rate without
hurting the training stability since the relationship between a weight norm and
its gradient norm in standard batch normalization layer is broken, resulting in
a faster escape from the sub-optimal initialization and alleviating the robust
overfitting. Finally, TWINS is shown to be effective on a wide range of image
classification datasets in terms of both generalization and robustness. Our
code is available at https://github.com/ziquanliu/CVPR2023-TWINS.


http://arxiv.org/abs/2303.11143
Adversarial Attacks against Binary Similarity Systems. (99%)
Gianluca Capozzi; Daniele Cono D'Elia; Luna Giuseppe Antonio Di; Leonardo Querzoni
  In recent years, binary analysis gained traction as a fundamental approach to
inspect software and guarantee its security. Due to the exponential increase of
devices running software, much research is now moving towards new autonomous
solutions based on deep learning models, as they have been showing
state-of-the-art performances in solving binary analysis problems. One of the
hot topics in this context is binary similarity, which consists in determining
if two functions in assembly code are compiled from the same source code.
However, it is unclear how deep learning models for binary similarity behave in
an adversarial context. In this paper, we study the resilience of binary
similarity models against adversarial examples, showing that they are
susceptible to both targeted and untargeted attacks (w.r.t. similarity goals)
performed by black-box and white-box attackers. In more detail, we extensively
test three current state-of-the-art solutions for binary similarity against two
black-box greedy attacks, including a new technique that we call Spatial
Greedy, and one white-box attack in which we repurpose a gradient-guided
strategy used in attacks to image classifiers.


http://arxiv.org/abs/2303.13372
DRSM: De-Randomized Smoothing on Malware Classifier Providing Certified Robustness. (99%)
Shoumik Saha; Wenxiao Wang; Yigitcan Kaya; Soheil Feizi; Tudor Dumitras
  Machine Learning (ML) models have been utilized for malware detection for
over two decades. Consequently, this ignited an ongoing arms race between
malware authors and antivirus systems, compelling researchers to propose
defenses for malware-detection models against evasion attacks. However, most if
not all existing defenses against evasion attacks suffer from sizable
performance degradation and/or can defend against only specific attacks, which
makes them less practical in real-world settings. In this work, we develop a
certified defense, DRSM (De-Randomized Smoothed MalConv), by redesigning the
de-randomized smoothing technique for the domain of malware detection.
Specifically, we propose a window ablation scheme to provably limit the impact
of adversarial bytes while maximally preserving local structures of the
executables. After showing how DRSM is theoretically robust against attacks
with contiguous adversarial bytes, we verify its performance and certified
robustness experimentally, where we observe only marginal accuracy drops as the
cost of robustness. To our knowledge, we are the first to offer certified
robustness in the realm of static detection of malware executables. More
surprisingly, through evaluating DRSM against 9 empirical attacks of different
types, we observe that the proposed defense is empirically robust to some
extent against a diverse set of attacks, some of which even fall out of the
scope of its original threat model. In addition, we collected 15.5K recent
benign raw executables from diverse sources, which will be made public as a
dataset called PACE (Publicly Accessible Collection(s) of Executables) to
alleviate the scarcity of publicly available benign datasets for studying
malware detection and provide future research with more representative data of
the time.


http://arxiv.org/abs/2303.10974
Translate your gibberish: black-box adversarial attack on machine translation systems. (83%)
Andrei Chertkov; Olga Tsymboi; Mikhail Pautov; Ivan Oseledets
  Neural networks are deployed widely in natural language processing tasks on
the industrial scale, and perhaps the most often they are used as compounds of
automatic machine translation systems. In this work, we present a simple
approach to fool state-of-the-art machine translation tools in the task of
translation from Russian to English and vice versa. Using a novel black-box
gradient-free tensor-based optimizer, we show that many online translation
tools, such as Google, DeepL, and Yandex, may both produce wrong or offensive
translations for nonsensical adversarial input queries and refuse to translate
seemingly benign input phrases. This vulnerability may interfere with
understanding a new language and simply worsen the user's experience while
using machine translation systems, and, hence, additional improvements of these
tools are required to establish better translation.


http://arxiv.org/abs/2303.11376
GNN-Ensemble: Towards Random Decision Graph Neural Networks. (56%)
Wenqi Wei; Mu Qiao; Divyesh Jadav
  Graph Neural Networks (GNNs) have enjoyed wide spread applications in
graph-structured data. However, existing graph based applications commonly lack
annotated data. GNNs are required to learn latent patterns from a limited
amount of training data to perform inferences on a vast amount of test data.
The increased complexity of GNNs, as well as a single point of model parameter
initialization, usually lead to overfitting and sub-optimal performance. In
addition, it is known that GNNs are vulnerable to adversarial attacks. In this
paper, we push one step forward on the ensemble learning of GNNs with improved
accuracy, generalization, and adversarial robustness. Following the principles
of stochastic modeling, we propose a new method called GNN-Ensemble to
construct an ensemble of random decision graph neural networks whose capacity
can be arbitrarily expanded for improvement in performance. The essence of the
method is to build multiple GNNs in randomly selected substructures in the
topological space and subfeatures in the feature space, and then combine them
for final decision making. These GNNs in different substructure and subfeature
spaces generalize their classification in complementary ways. Consequently,
their combined classification performance can be improved and overfitting on
the training data can be effectively reduced. In the meantime, we show that
GNN-Ensemble can significantly improve the adversarial robustness against
attacks on GNNs.


http://arxiv.org/abs/2303.11040
Benchmarking Robustness of 3D Object Detection to Common Corruptions in Autonomous Driving. (41%)
Yinpeng Dong; Caixin Kang; Jinlai Zhang; Zijian Zhu; Yikai Wang; Xiao Yang; Hang Su; Xingxing Wei; Jun Zhu
  3D object detection is an important task in autonomous driving to perceive
the surroundings. Despite the excellent performance, the existing 3D detectors
lack the robustness to real-world corruptions caused by adverse weathers,
sensor noises, etc., provoking concerns about the safety and reliability of
autonomous driving systems. To comprehensively and rigorously benchmark the
corruption robustness of 3D detectors, in this paper we design 27 types of
common corruptions for both LiDAR and camera inputs considering real-world
driving scenarios. By synthesizing these corruptions on public datasets, we
establish three corruption robustness benchmarks -- KITTI-C, nuScenes-C, and
Waymo-C. Then, we conduct large-scale experiments on 24 diverse 3D object
detection models to evaluate their corruption robustness. Based on the
evaluation results, we draw several important findings, including: 1)
motion-level corruptions are the most threatening ones that lead to significant
performance drop of all models; 2) LiDAR-camera fusion models demonstrate
better robustness; 3) camera-only models are extremely vulnerable to image
corruptions, showing the indispensability of LiDAR point clouds. We release the
benchmarks and codes at https://github.com/kkkcx/3D_Corruptions_AD. We hope
that our benchmarks and findings can provide insights for future research on
developing robust 3D object detection models.


http://arxiv.org/abs/2303.11470
Did You Train on My Dataset? Towards Public Dataset Protection with Clean-Label Backdoor Watermarking. (9%)
Ruixiang Tang; Qizhang Feng; Ninghao Liu; Fan Yang; Xia Hu
  The huge supporting training data on the Internet has been a key factor in
the success of deep learning models. However, this abundance of
public-available data also raises concerns about the unauthorized exploitation
of datasets for commercial purposes, which is forbidden by dataset licenses. In
this paper, we propose a backdoor-based watermarking approach that serves as a
general framework for safeguarding public-available data. By inserting a small
number of watermarking samples into the dataset, our approach enables the
learning model to implicitly learn a secret function set by defenders. This
hidden function can then be used as a watermark to track down third-party
models that use the dataset illegally. Unfortunately, existing backdoor
insertion methods often entail adding arbitrary and mislabeled data to the
training set, leading to a significant drop in performance and easy detection
by anomaly detection algorithms. To overcome this challenge, we introduce a
clean-label backdoor watermarking framework that uses imperceptible
perturbations to replace mislabeled samples. As a result, the watermarking
samples remain consistent with the original labels, making them difficult to
detect. Our experiments on text, image, and audio datasets demonstrate that the
proposed framework effectively safeguards datasets with minimal impact on
original task performance. We also show that adding just 1% of watermarking
samples can inject a traceable watermarking function and that our watermarking
samples are stealthy and look benign upon visual inspection.


http://arxiv.org/abs/2303.11066
Boosting Semi-Supervised Learning by Exploiting All Unlabeled Data. (2%)
Yuhao Chen; Xin Tan; Borui Zhao; Zhaowei Chen; Renjie Song; Jiajun Liang; Xuequan Lu
  Semi-supervised learning (SSL) has attracted enormous attention due to its
vast potential of mitigating the dependence on large labeled datasets. The
latest methods (e.g., FixMatch) use a combination of consistency regularization
and pseudo-labeling to achieve remarkable successes. However, these methods all
suffer from the waste of complicated examples since all pseudo-labels have to
be selected by a high threshold to filter out noisy ones. Hence, the examples
with ambiguous predictions will not contribute to the training phase. For
better leveraging all unlabeled examples, we propose two novel techniques:
Entropy Meaning Loss (EML) and Adaptive Negative Learning (ANL). EML
incorporates the prediction distribution of non-target classes into the
optimization objective to avoid competition with target class, and thus
generating more high-confidence predictions for selecting pseudo-label. ANL
introduces the additional negative pseudo-label for all unlabeled data to
leverage low-confidence examples. It adaptively allocates this label by
dynamically evaluating the top-k performance of the model. EML and ANL do not
introduce any additional parameter and hyperparameter. We integrate these
techniques with FixMatch, and develop a simple yet powerful framework called
FullMatch. Extensive experiments on several common SSL benchmarks
(CIFAR-10/100, SVHN, STL-10 and ImageNet) demonstrate that FullMatch exceeds
FixMatch by a large margin. Integrated with FlexMatch (an advanced
FixMatch-based framework), we achieve state-of-the-art performance. Source code
is at https://github.com/megvii-research/FullMatch.


http://arxiv.org/abs/2303.11242
Make Landscape Flatter in Differentially Private Federated Learning. (1%)
Yifan Shi; Yingqi Liu; Kang Wei; Li Shen; Xueqian Wang; Dacheng Tao
  To defend the inference attacks and mitigate the sensitive information
leakages in Federated Learning (FL), client-level Differentially Private FL
(DPFL) is the de-facto standard for privacy protection by clipping local
updates and adding random noise. However, existing DPFL methods tend to make a
sharper loss landscape and have poorer weight perturbation robustness,
resulting in severe performance degradation. To alleviate these issues, we
propose a novel DPFL algorithm named DP-FedSAM, which leverages gradient
perturbation to mitigate the negative impact of DP. Specifically, DP-FedSAM
integrates Sharpness Aware Minimization (SAM) optimizer to generate local
flatness models with better stability and weight perturbation robustness, which
results in the small norm of local updates and robustness to DP noise, thereby
improving the performance. From the theoretical perspective, we analyze in
detail how DP-FedSAM mitigates the performance degradation induced by DP.
Meanwhile, we give rigorous privacy guarantees with R\'enyi DP and present the
sensitivity analysis of local updates. At last, we empirically confirm that our
algorithm achieves state-of-the-art (SOTA) performance compared with existing
SOTA baselines in DPFL. Code is available at
https://github.com/YMJS-Irfan/DP-FedSAM


http://arxiv.org/abs/2303.11126
Robustifying Token Attention for Vision Transformers. (1%)
Yong Guo; David Stutz; Bernt Schiele
  Despite the success of vision transformers (ViTs), they still suffer from
significant drops in accuracy in the presence of common corruptions, such as
noise or blur. Interestingly, we observe that the attention mechanism of ViTs
tends to rely on few important tokens, a phenomenon we call token overfocusing.
More critically, these tokens are not robust to corruptions, often leading to
highly diverging attention patterns. In this paper, we intend to alleviate this
overfocusing issue and make attention more stable through two general
techniques: First, our Token-aware Average Pooling (TAP) module encourages the
local neighborhood of each token to take part in the attention mechanism.
Specifically, TAP learns average pooling schemes for each token such that the
information of potentially important tokens in the neighborhood can adaptively
be taken into account. Second, we force the output tokens to aggregate
information from a diverse set of input tokens rather than focusing on just a
few by using our Attention Diversification Loss (ADL). We achieve this by
penalizing high cosine similarity between the attention vectors of different
tokens. In experiments, we apply our methods to a wide range of transformer
architectures and improve robustness significantly. For example, we improve
corruption robustness on ImageNet-C by 2.4% while simultaneously improving
accuracy by 0.4% based on state-of-the-art robust architecture FAN. Also, when
finetuning on semantic segmentation tasks, we improve robustness on
CityScapes-C by 2.4% and ACDC by 3.1%.


http://arxiv.org/abs/2303.10653
Randomized Adversarial Training via Taylor Expansion. (99%)
Gaojie Jin; Xinping Yi; Dengyu Wu; Ronghui Mu; Xiaowei Huang
  In recent years, there has been an explosion of research into developing more
robust deep neural networks against adversarial examples. Adversarial training
appears as one of the most successful methods. To deal with both the robustness
against adversarial examples and the accuracy over clean examples, many works
develop enhanced adversarial training methods to achieve various trade-offs
between them. Leveraging over the studies that smoothed update on weights
during training may help find flat minima and improve generalization, we
suggest reconciling the robustness-accuracy trade-off from another perspective,
i.e., by adding random noise into deterministic weights. The randomized weights
enable our design of a novel adversarial training method via Taylor expansion
of a small Gaussian noise, and we show that the new adversarial training method
can flatten loss landscape and find flat minima. With PGD, CW, and Auto
Attacks, an extensive set of experiments demonstrate that our method enhances
the state-of-the-art adversarial training methods, boosting both robustness and
clean accuracy. The code is available at
https://github.com/Alexkael/Randomized-Adversarial-Training.


http://arxiv.org/abs/2303.10594
AdaptGuard: Defending Against Universal Attacks for Model Adaptation. (82%)
Lijun Sheng; Jian Liang; Ran He; Zilei Wang; Tieniu Tan
  Model adaptation aims at solving the domain transfer problem under the
constraint of only accessing the pretrained source models. With the increasing
considerations of data privacy and transmission efficiency, this paradigm has
been gaining recent popularity. This paper studies the vulnerability to
universal attacks transferred from the source domain during model adaptation
algorithms due to the existence of malicious providers. We explore both
universal adversarial perturbations and backdoor attacks as loopholes on the
source side and discover that they still survive in the target models after
adaptation. To address this issue, we propose a model preprocessing framework,
named AdaptGuard, to improve the security of model adaptation algorithms.
AdaptGuard avoids direct use of the risky source parameters through knowledge
distillation and utilizes the pseudo adversarial samples under adjusted radius
to enhance the robustness. AdaptGuard is a plug-and-play module that requires
neither robust pretrained models nor any changes for the following model
adaptation algorithms. Extensive results on three commonly used datasets and
two popular adaptation methods validate that AdaptGuard can effectively defend
against universal attacks and maintain clean accuracy in the target domain
simultaneously. We hope this research will shed light on the safety and
robustness of transfer learning. Code is available at
https://github.com/TomSheng21/AdaptGuard.


http://arxiv.org/abs/2303.10430
NoisyHate: Mining Online Human-Written Perturbations for Realistic Robustness Benchmarking of Content Moderation Models. (98%)
Yiran Ye; Thai Le; Dongwon Lee
  Online texts with toxic content are a clear threat to the users on social
media in particular and society in general. Although many platforms have
adopted various measures (e.g., machine learning-based hate-speech detection
systems) to diminish their effect, toxic content writers have also attempted to
evade such measures by using cleverly modified toxic words, so-called
human-written text perturbations. Therefore, to help build automatic detection
tools to recognize those perturbations, prior methods have developed
sophisticated techniques to generate diverse adversarial samples. However, we
note that these ``algorithms"-generated perturbations do not necessarily
capture all the traits of ``human"-written perturbations. Therefore, in this
paper, we introduce a novel, high-quality dataset of human-written
perturbations, named as NoisyHate, that was created from real-life
perturbations that are both written and verified by human-in-the-loop. We show
that perturbations in NoisyHate have different characteristics than prior
algorithm-generated toxic datasets show, and thus can be in particular useful
to help develop better toxic speech detection solutions. We thoroughly validate
NoisyHate against state-of-the-art language models, such as BERT and RoBERTa,
and black box APIs, such as Perspective API, on two tasks, such as perturbation
normalization and understanding.


http://arxiv.org/abs/2303.10399
FedRight: An Effective Model Copyright Protection for Federated Learning. (96%)
Jinyin Chen; Mingjun Li; Mingjun Li; Haibin Zheng
  Federated learning (FL), an effective distributed machine learning framework,
implements model training and meanwhile protects local data privacy. It has
been applied to a broad variety of practice areas due to its great performance
and appreciable profits. Who owns the model, and how to protect the copyright
has become a real problem. Intuitively, the existing property rights protection
methods in centralized scenarios (e.g., watermark embedding and model
fingerprints) are possible solutions for FL. But they are still challenged by
the distributed nature of FL in aspects of the no data sharing, parameter
aggregation, and federated training settings. For the first time, we formalize
the problem of copyright protection for FL, and propose FedRight to protect
model copyright based on model fingerprints, i.e., extracting model features by
generating adversarial examples as model fingerprints. FedRight outperforms
previous works in four key aspects: (i) Validity: it extracts model features to
generate transferable fingerprints to train a detector to verify the copyright
of the model. (ii) Fidelity: it is with imperceptible impact on the federated
training, thus promising good main task performance. (iii) Robustness: it is
empirically robust against malicious attacks on copyright protection, i.e.,
fine-tuning, model pruning, and adaptive attacks. (iv) Black-box: it is valid
in the black-box forensic scenario where only application programming interface
calls to the model are available. Extensive evaluations across 3 datasets and 9
model structures demonstrate FedRight's superior fidelity, validity, and
robustness.


http://arxiv.org/abs/2303.10078
Fuzziness-tuned: Improving the Transferability of Adversarial Examples. (99%)
Xiangyuan Yang; Jie Lin; Hanlin Zhang; Xinyu Yang; Peng Zhao
  With the development of adversarial attacks, adversairal examples have been
widely used to enhance the robustness of the training models on deep neural
networks. Although considerable efforts of adversarial attacks on improving the
transferability of adversarial examples have been developed, the attack success
rate of the transfer-based attacks on the surrogate model is much higher than
that on victim model under the low attack strength (e.g., the attack strength
$\epsilon=8/255$). In this paper, we first systematically investigated this
issue and found that the enormous difference of attack success rates between
the surrogate model and victim model is caused by the existence of a special
area (known as fuzzy domain in our paper), in which the adversarial examples in
the area are classified wrongly by the surrogate model while correctly by the
victim model. Then, to eliminate such enormous difference of attack success
rates for improving the transferability of generated adversarial examples, a
fuzziness-tuned method consisting of confidence scaling mechanism and
temperature scaling mechanism is proposed to ensure the generated adversarial
examples can effectively skip out of the fuzzy domain. The confidence scaling
mechanism and the temperature scaling mechanism can collaboratively tune the
fuzziness of the generated adversarial examples through adjusting the gradient
descent weight of fuzziness and stabilizing the update direction, respectively.
Specifically, the proposed fuzziness-tuned method can be effectively integrated
with existing adversarial attacks to further improve the transferability of
adverarial examples without changing the time complexity. Extensive experiments
demonstrated that fuzziness-tuned method can effectively enhance the
transferability of adversarial examples in the latest transfer-based attacks.


http://arxiv.org/abs/2303.09767
It Is All About Data: A Survey on the Effects of Data on Adversarial Robustness. (99%)
Peiyu Xiong; Michael Tegegn; Jaskeerat Singh Sarin; Shubhraneel Pal; Julia Rubin
  Adversarial examples are inputs to machine learning models that an attacker
has intentionally designed to confuse the model into making a mistake. Such
examples pose a serious threat to the applicability of machine-learning-based
systems, especially in life- and safety-critical domains. To address this
problem, the area of adversarial robustness investigates mechanisms behind
adversarial attacks and defenses against these attacks. This survey reviews a
particular subset of this literature that focuses on investigating properties
of training data in the context of model robustness under evasion attacks. It
first summarizes the main properties of data leading to adversarial
vulnerability. It then discusses guidelines and techniques for improving
adversarial robustness by enhancing the data representation and learning
procedures, as well as techniques for estimating robustness guarantees given
particular data. Finally, it discusses gaps of knowledge and promising future
research directions in this area.


http://arxiv.org/abs/2303.10225
Robust Mode Connectivity-Oriented Adversarial Defense: Enhancing Neural Network Robustness Against Diversified $\ell_p$ Attacks. (99%)
Ren Wang; Yuxuan Li; Sijia Liu
  Adversarial robustness is a key concept in measuring the ability of neural
networks to defend against adversarial attacks during the inference phase.
Recent studies have shown that despite the success of improving adversarial
robustness against a single type of attack using robust training techniques,
models are still vulnerable to diversified $\ell_p$ attacks. To achieve
diversified $\ell_p$ robustness, we propose a novel robust mode connectivity
(RMC)-oriented adversarial defense that contains two population-based learning
phases. The first phase, RMC, is able to search the model parameter space
between two pre-trained models and find a path containing points with high
robustness against diversified $\ell_p$ attacks. In light of the effectiveness
of RMC, we develop a second phase, RMC-based optimization, with RMC serving as
the basic unit for further enhancement of neural network diversified $\ell_p$
robustness. To increase computational efficiency, we incorporate learning with
a self-robust mode connectivity (SRMC) module that enables the fast
proliferation of the population used for endpoints of RMC. Furthermore, we draw
parallels between SRMC and the human immune system. Experimental results on
various datasets and model architectures demonstrate that the proposed defense
methods can achieve high diversified $\ell_p$ robustness against $\ell_\infty$,
$\ell_2$, $\ell_1$, and hybrid attacks. Codes are available at
\url{https://github.com/wangren09/MCGR}.


http://arxiv.org/abs/2303.10291
Detection of Uncertainty in Exceedance of Threshold (DUET): An Adversarial Patch Localizer. (83%)
Terence Jie Chua; Wenhan Yu; Jun Zhao
  Development of defenses against physical world attacks such as adversarial
patches is gaining traction within the research community. We contribute to the
field of adversarial patch detection by introducing an uncertainty-based
adversarial patch localizer which localizes adversarial patch on an image,
permitting post-processing patch-avoidance or patch-reconstruction. We quantify
our prediction uncertainties with the development of \textit{\textbf{D}etection
of \textbf{U}ncertainties in the \textbf{E}xceedance of \textbf{T}hreshold}
(DUET) algorithm. This algorithm provides a framework to ascertain confidence
in the adversarial patch localization, which is essential for safety-sensitive
applications such as self-driving cars and medical imaging. We conducted
experiments on localizing adversarial patches and found our proposed DUET model
outperforms baseline models. We then conduct further analyses on our choice of
model priors and the adoption of Bayesian Neural Networks in different layers
within our model architecture. We found that isometric gaussian priors in
Bayesian Neural Networks are suitable for patch localization tasks and the
presence of Bayesian layers in the earlier neural network blocks facilitates
top-end localization performance, while Bayesian layers added in the later
neural network blocks contribute to better model generalization. We then
propose two different well-performing models to tackle different use cases.


http://arxiv.org/abs/2303.11156
Can AI-Generated Text be Reliably Detected? (45%)
Vinu Sankar Sadasivan; Aounon Kumar; Sriram Balasubramanian; Wenxiao Wang; Soheil Feizi
  Large Language Models (LLMs) perform impressively well in various
applications. However, the potential for misuse of these models in activities
such as plagiarism, generating fake news, and spamming has raised concern about
their responsible use. Consequently, the reliable detection of AI-generated
text has become a critical area of research. AI text detectors have shown to be
effective under their specific settings. In this paper, we stress-test the
robustness of these AI text detectors in the presence of an attacker. We
introduce recursive paraphrasing attack to stress test a wide range of
detection schemes, including the ones using the watermarking as well as neural
network-based detectors, zero shot classifiers, and retrieval-based detectors.
Our experiments conducted on passages, each approximately 300 tokens long,
reveal the varying sensitivities of these detectors to our attacks. Our
findings indicate that while our recursive paraphrasing method can
significantly reduce detection rates, it only slightly degrades text quality in
many cases, highlighting potential vulnerabilities in current detection systems
in the presence of an attacker. Additionally, we investigate the susceptibility
of watermarked LLMs to spoofing attacks aimed at misclassifying human-written
text as AI-generated. We demonstrate that an attacker can infer hidden AI text
signatures without white-box access to the detection method, potentially
leading to reputational risks for LLM developers. Finally, we provide a
theoretical framework connecting the AUROC of the best possible detector to the
Total Variation distance between human and AI text distributions. This analysis
offers insights into the fundamental challenges of reliable detection as
language models continue to advance. Our code is publicly available at
https://github.com/vinusankars/Reliability-of-AI-text-detectors.


http://arxiv.org/abs/2303.09962
Adversarial Counterfactual Visual Explanations. (31%)
Guillaume Jeanneret; Loïc Simon; Frédéric Jurie
  Counterfactual explanations and adversarial attacks have a related goal:
flipping output labels with minimal perturbations regardless of their
characteristics. Yet, adversarial attacks cannot be used directly in a
counterfactual explanation perspective, as such perturbations are perceived as
noise and not as actionable and understandable image modifications. Building on
the robust learning literature, this paper proposes an elegant method to turn
adversarial attacks into semantically meaningful perturbations, without
modifying the classifiers to explain. The proposed approach hypothesizes that
Denoising Diffusion Probabilistic Models are excellent regularizers for
avoiding high-frequency and out-of-distribution perturbations when generating
adversarial attacks. The paper's key idea is to build attacks through a
diffusion model to polish them. This allows studying the target model
regardless of its robustification level. Extensive experimentation shows the
advantages of our counterfactual explanation approach over current
State-of-the-Art in multiple testbeds.


http://arxiv.org/abs/2303.09858
MedLocker: A Transferable Adversarial Watermarking for Preventing Unauthorized Analysis of Medical Image Dataset. (16%)
Bangzheng Pu; Xingxing Wei; Shiji Zhao; Huazhu Fu
  The collection of medical image datasets is a demanding and laborious process
that requires significant resources. Furthermore, these medical datasets may
contain personally identifiable information, necessitating measures to ensure
that unauthorized access is prevented. Failure to do so could violate the
intellectual property rights of the dataset owner and potentially compromise
the privacy of patients. As a result, safeguarding medical datasets and
preventing unauthorized usage by AI diagnostic models is a pressing challenge.
To address this challenge, we propose a novel visible adversarial watermarking
method for medical image copyright protection, called MedLocker. Our approach
involves continuously optimizing the position and transparency of a watermark
logo, which reduces the performance of the target model, leading to incorrect
predictions. Importantly, we ensure that our method minimizes the impact on
clinical visualization by constraining watermark positions using semantical
masks (WSM), which are bounding boxes of lesion regions based on semantic
segmentation. To ensure the transferability of the watermark across different
models, we verify the cross-model transferability of the watermark generated on
a single model. Additionally, we generate a unique watermark parameter list
each time, which can be used as a certification to verify the authorization. We
evaluate the performance of MedLocker on various mainstream backbones and
validate the feasibility of adversarial watermarking for copyright protection
on two widely-used diabetic retinopathy detection datasets. Our results
demonstrate that MedLocker can effectively protect the copyright of medical
datasets and prevent unauthorized users from analyzing medical images with AI
diagnostic models.


http://arxiv.org/abs/2303.10288
Mobile Edge Adversarial Detection for Digital Twinning to the Metaverse with Deep Reinforcement Learning. (9%)
Terence Jie Chua; Wenhan Yu; Jun Zhao
  Real-time Digital Twinning of physical world scenes onto the Metaverse is
necessary for a myriad of applications such as augmented-reality (AR) assisted
driving. In AR assisted driving, physical environment scenes are first captured
by Internet of Vehicles (IoVs) and are uploaded to the Metaverse. A central
Metaverse Map Service Provider (MMSP) will aggregate information from all IoVs
to develop a central Metaverse Map. Information from the Metaverse Map can then
be downloaded into individual IoVs on demand and be delivered as AR scenes to
the driver. However, the growing interest in developing AR assisted driving
applications which relies on digital twinning invites adversaries. These
adversaries may place physical adversarial patches on physical world objects
such as cars, signboards, or on roads, seeking to contort the virtual world
digital twin. Hence, there is a need to detect these physical world adversarial
patches. Nevertheless, as real-time, accurate detection of adversarial patches
is compute-intensive, these physical world scenes have to be offloaded to the
Metaverse Map Base Stations (MMBS) for computation. Hence in our work, we
considered an environment with moving Internet of Vehicles (IoV), uploading
real-time physical world scenes to the MMBSs. We formulated a realistic joint
variable optimization problem where the MMSPs' objective is to maximize
adversarial patch detection mean average precision (mAP), while minimizing the
computed AR scene up-link transmission latency and IoVs' up-link transmission
idle count, through optimizing the IoV-MMBS allocation and IoV up-link scene
resolution selection. We proposed a Heterogeneous Action Proximal Policy
Optimization (HAPPO) (discrete-continuous) algorithm to tackle the proposed
problem. Extensive experiments shows HAPPO outperforms baseline models when
compared against key metrics.


http://arxiv.org/abs/2303.09893
Moving Target Defense for Service-oriented Mission-critical Networks. (1%)
Doğanalp Ergenç; Florian Schneider; Peter Kling; Mathias Fischer
  Modern mission-critical systems (MCS) are increasingly softwarized and
interconnected. As a result, their complexity increased, and so their
vulnerability against cyber-attacks. The current adoption of virtualization and
service-oriented architectures (SOA) in MCSs provides additional flexibility
that can be leveraged to withstand and mitigate attacks, e.g., by moving
critical services or data flows. This enables the deployment of strategies for
moving target defense (MTD), which allows stripping attackers of their
asymmetric advantage from the long reconnaissance of MCSs. However, it is
challenging to design MTD strategies, given the diverse threat landscape,
resource limitations, and potential degradation in service availability. In
this paper, we combine two optimization models to explore feasible service
configurations for SOA-based systems and to derive subsequent MTD actions with
their time schedule based on an attacker-defender game. Our results indicate
that even for challenging and diverse attack scenarios, our models can defend
the system by up to 90% of the system operation time with a limited MTD
defender budget.


http://arxiv.org/abs/2303.09105
Rethinking Model Ensemble in Transfer-based Adversarial Attacks. (99%)
Huanran Chen; Yichi Zhang; Yinpeng Dong; Jun Zhu
  Deep learning models are vulnerable to adversarial examples. Transfer-based
adversarial attacks attract tremendous attention as they can identify the
weaknesses of deep learning models in a black-box manner. An effective strategy
to improve the transferability of adversarial examples is attacking an ensemble
of models. However, previous works simply average the outputs of different
models, lacking an in-depth analysis on how and why model ensemble can strongly
improve the transferability. In this work, we rethink the ensemble in
adversarial attacks and define the common weakness of model ensemble with the
properties of the flatness of loss landscape and the closeness to the local
optimum of each model. We empirically and theoretically show that these two
properties are strongly correlated with the transferability and propose a
Common Weakness Attack (CWA) to generate more transferable adversarial examples
by promoting these two properties. Experimental results on both image
classification and object detection tasks validate the effectiveness of our
approach to improve the adversarial transferability, especially when attacking
adversarially trained models.


http://arxiv.org/abs/2303.09289
Class Attribute Inference Attacks: Inferring Sensitive Class Information by Diffusion-Based Attribute Manipulations. (68%)
Lukas Struppek; Dominik Hintersdorf; Felix Friedrich; Manuel Brack; Patrick Schramowski; Kristian Kersting
  Neural network-based image classifiers are powerful tools for computer vision
tasks, but they inadvertently reveal sensitive attribute information about
their classes, raising concerns about their privacy. To investigate this
privacy leakage, we introduce the first Class Attribute Inference Attack
(CAIA), which leverages recent advances in text-to-image synthesis to infer
sensitive attributes of individual classes in a black-box setting, while
remaining competitive with related white-box attacks. Our extensive experiments
in the face recognition domain show that CAIA can accurately infer undisclosed
sensitive attributes, such as an individual's hair color, gender, and racial
appearance, which are not part of the training labels. Interestingly, we
demonstrate that adversarial robust models are even more vulnerable to such
privacy leakage than standard models, indicating that a trade-off between
robustness and privacy exists.


http://arxiv.org/abs/2303.09495
Among Us: Adversarially Robust Collaborative Perception by Consensus. (67%)
Yiming Li; Qi Fang; Jiamu Bai; Siheng Chen; Felix Juefei-Xu; Chen Feng
  Multiple robots could perceive a scene (e.g., detect objects) collaboratively
better than individuals, although easily suffer from adversarial attacks when
using deep learning. This could be addressed by the adversarial defense, but
its training requires the often-unknown attacking mechanism. Differently, we
propose ROBOSAC, a novel sampling-based defense strategy generalizable to
unseen attackers. Our key idea is that collaborative perception should lead to
consensus rather than dissensus in results compared to individual perception.
This leads to our hypothesize-and-verify framework: perception results with and
without collaboration from a random subset of teammates are compared until
reaching a consensus. In such a framework, more teammates in the sampled subset
often entail better perception performance but require longer sampling time to
reject potential attackers. Thus, we derive how many sampling trials are needed
to ensure the desired size of an attacker-free subset, or equivalently, the
maximum size of such a subset that we can successfully sample within a given
number of trials. We validate our method on the task of collaborative 3D object
detection in autonomous driving scenarios.


http://arxiv.org/abs/2303.09731
Exorcising ''Wraith'': Protecting LiDAR-based Object Detector in Automated Driving System from Appearing Attacks. (50%)
Qifan Xiao; Xudong Pan; Yifan Lu; Mi Zhang; Jiarun Dai; Min Yang
  Automated driving systems rely on 3D object detectors to recognize possible
obstacles from LiDAR point clouds. However, recent works show the adversary can
forge non-existent cars in the prediction results with a few fake points (i.e.,
appearing attack). By removing statistical outliers, existing defenses are
however designed for specific attacks or biased by predefined heuristic rules.
Towards more comprehensive mitigation, we first systematically inspect the
mechanism of recent appearing attacks: Their common weaknesses are observed in
crafting fake obstacles which (i) have obvious differences in the local parts
compared with real obstacles and (ii) violate the physical relation between
depth and point density. In this paper, we propose a novel plug-and-play
defensive module which works by side of a trained LiDAR-based object detector
to eliminate forged obstacles where a major proportion of local parts have low
objectness, i.e., to what degree it belongs to a real object. At the core of
our module is a local objectness predictor, which explicitly incorporates the
depth information to model the relation between depth and point density, and
predicts each local part of an obstacle with an objectness score. Extensive
experiments show, our proposed defense eliminates at least 70% cars forged by
three known appearing attacks in most cases, while, for the best previous
defense, less than 30% forged cars are eliminated. Meanwhile, under the same
circumstance, our defense incurs less overhead for AP/precision on cars
compared with existing defenses. Furthermore, We validate the effectiveness of
our proposed defense on simulation-based closed-loop control driving tests in
the open-source system of Baidu's Apollo.


http://arxiv.org/abs/2303.09732
Rethinking White-Box Watermarks on Deep Learning Models under Neural Structural Obfuscation. (11%)
Yifan Yan; Xudong Pan; Mi Zhang; Min Yang
  Copyright protection for deep neural networks (DNNs) is an urgent need for AI
corporations. To trace illegally distributed model copies, DNN watermarking is
an emerging technique for embedding and verifying secret identity messages in
the prediction behaviors or the model internals. Sacrificing less functionality
and involving more knowledge about the target DNN, the latter branch called
\textit{white-box DNN watermarking} is believed to be accurate, credible and
secure against most known watermark removal attacks, with emerging research
efforts in both the academy and the industry.
  In this paper, we present the first systematic study on how the mainstream
white-box DNN watermarks are commonly vulnerable to neural structural
obfuscation with \textit{dummy neurons}, a group of neurons which can be added
to a target model but leave the model behavior invariant. Devising a
comprehensive framework to automatically generate and inject dummy neurons with
high stealthiness, our novel attack intensively modifies the architecture of
the target model to inhibit the success of watermark verification. With
extensive evaluation, our work for the first time shows that nine published
watermarking schemes require amendments to their verification procedures.


http://arxiv.org/abs/2303.08509
Black-box Adversarial Example Attack towards FCG Based Android Malware Detection under Incomplete Feature Information. (99%)
Heng Li; Zhang Cheng; Bang Wu; Liheng Yuan; Cuiying Gao; Wei Yuan; Xiapu Luo
  The function call graph (FCG) based Android malware detection methods have
recently attracted increasing attention due to their promising performance.
However, these methods are susceptible to adversarial examples (AEs). In this
paper, we design a novel black-box AE attack towards the FCG based malware
detection system, called BagAmmo. To mislead its target system, BagAmmo
purposefully perturbs the FCG feature of malware through inserting
"never-executed" function calls into malware code. The main challenges are
two-fold. First, the malware functionality should not be changed by adversarial
perturbation. Second, the information of the target system (e.g., the graph
feature granularity and the output probabilities) is absent.
  To preserve malware functionality, BagAmmo employs the try-catch trap to
insert function calls to perturb the FCG of malware. Without the knowledge
about feature granularity and output probabilities, BagAmmo adopts the
architecture of generative adversarial network (GAN), and leverages a
multi-population co-evolution algorithm (i.e., Apoem) to generate the desired
perturbation. Every population in Apoem represents a possible feature
granularity, and the real feature granularity can be achieved when Apoem
converges.
  Through extensive experiments on over 44k Android apps and 32 target models,
we evaluate the effectiveness, efficiency and resilience of BagAmmo. BagAmmo
achieves an average attack success rate of over 99.9% on MaMaDroid, APIGraph
and GCN, and still performs well in the scenario of concept drift and data
imbalance. Moreover, BagAmmo outperforms the state-of-the-art attack SRL in
attack success rate.


http://arxiv.org/abs/2303.09051
Robust Evaluation of Diffusion-Based Adversarial Purification. (83%)
Minjong Lee; Dongwoo Kim
  We question the current evaluation practice on diffusion-based purification
methods. Diffusion-based purification methods aim to remove adversarial effects
from an input data point at test time. The approach gains increasing attention
as an alternative to adversarial training due to the disentangling between
training and testing. Well-known white-box attacks are often employed to
measure the robustness of the purification. However, it is unknown whether
these attacks are the most effective for the diffusion-based purification since
the attacks are often tailored for adversarial training. We analyze the current
practices and provide a new guideline for measuring the robustness of
purification methods against adversarial attacks. Based on our analysis, we
further propose a new purification strategy improving robustness compared to
the current diffusion-based purification methods.


http://arxiv.org/abs/2303.09024
DeeBBAA: A benchmark Deep Black Box Adversarial Attack against Cyber-Physical Power Systems. (81%)
Arnab Bhattacharjee; Tapan K. Saha; Ashu Verma; Sukumar Mishra
  An increased energy demand, and environmental pressure to accommodate higher
levels of renewable energy and flexible loads like electric vehicles have led
to numerous smart transformations in the modern power systems. These
transformations make the cyber-physical power system highly susceptible to
cyber-adversaries targeting its numerous operations. In this work, a novel
black box adversarial attack strategy is proposed targeting the AC state
estimation operation of an unknown power system using historical data.
Specifically, false data is injected into the measurements obtained from a
small subset of the power system components which leads to significant
deviations in the state estimates. Experiments carried out on the IEEE 39 bus
and 118 bus test systems make it evident that the proposed strategy, called
DeeBBAA, can evade numerous conventional and state-of-the-art attack detection
mechanisms with very high probability.


http://arxiv.org/abs/2303.08500
The Devil's Advocate: Shattering the Illusion of Unexploitable Data using Diffusion Models. (67%)
Hadi M. Dolatabadi; Sarah Erfani; Christopher Leckie
  Protecting personal data against exploitation of machine learning models is
crucial. Recently, availability attacks have shown great promise to provide an
extra layer of protection against the unauthorized use of data to train neural
networks. These methods aim to add imperceptible noise to clean data so that
the neural networks cannot extract meaningful patterns from the protected data,
claiming that they can make personal data "unexploitable." This paper provides
a strong countermeasure against such approaches, showing that unexploitable
data might only be an illusion. In particular, we leverage the power of
diffusion models and show that a carefully designed denoising process can
counteract the effectiveness of the data-protecting perturbations. We
rigorously analyze our algorithm, and theoretically prove that the amount of
required denoising is directly related to the magnitude of the data-protecting
perturbations. Our approach, called AVATAR, delivers state-of-the-art
performance against a suite of recent availability attacks in various
scenarios, outperforming adversarial training even under distribution mismatch
between the diffusion model and the protected data. Our findings call for more
research into making personal data unexploitable, showing that this goal is far
from over. Our implementation is available at this repository:
https://github.com/hmdolatabadi/AVATAR.


http://arxiv.org/abs/2303.08866
EvalAttAI: A Holistic Approach to Evaluating Attribution Maps in Robust and Non-Robust Models. (45%)
Ian E. Nielsen; Ravi P. Ramachandran; Nidhal Bouaynaya; Hassan M. Fathallah-Shaykh; Ghulam Rasool
  The expansion of explainable artificial intelligence as a field of research
has generated numerous methods of visualizing and understanding the black box
of a machine learning model. Attribution maps are generally used to highlight
the parts of the input image that influence the model to make a specific
decision. On the other hand, the robustness of machine learning models to
natural noise and adversarial attacks is also being actively explored. This
paper focuses on evaluating methods of attribution mapping to find whether
robust neural networks are more explainable. We explore this problem within the
application of classification for medical imaging. Explainability research is
at an impasse. There are many methods of attribution mapping, but no current
consensus on how to evaluate them and determine the ones that are the best. Our
experiments on multiple datasets (natural and medical imaging) and various
attribution methods reveal that two popular evaluation metrics, Deletion and
Insertion, have inherent limitations and yield contradictory results. We
propose a new explainability faithfulness metric (called EvalAttAI) that
addresses the limitations of prior metrics. Using our novel evaluation, we
found that Bayesian deep neural networks using the Variational Density
Propagation technique were consistently more explainable when used with the
best performing attribution method, the Vanilla Gradient. However, in general,
various types of robust neural networks may not be more explainable, despite
these models producing more visually plausible attribution maps.


http://arxiv.org/abs/2303.08944
Agnostic Multi-Robust Learning Using ERM. (12%)
Saba Ahmadi; Avrim Blum; Omar Montasser; Kevin Stangl
  A fundamental problem in robust learning is asymmetry: a learner needs to
correctly classify every one of exponentially-many perturbations that an
adversary might make to a test-time natural example. In contrast, the attacker
only needs to find one successful perturbation. Xiang et al.[2022] proposed an
algorithm that in the context of patch attacks for image classification,
reduces the effective number of perturbations from an exponential to a
polynomial number of perturbations and learns using an ERM oracle. However, to
achieve its guarantee, their algorithm requires the natural examples to be
robustly realizable. This prompts the natural question; can we extend their
approach to the non-robustly-realizable case where there is no classifier with
zero robust error?
  Our first contribution is to answer this question affirmatively by reducing
this problem to a setting in which an algorithm proposed by Feige et al.[2015]
can be applied, and in the process extend their guarantees. Next, we extend our
results to a multi-group setting and introduce a novel agnostic multi-robust
learning problem where the goal is to learn a predictor that achieves low
robust loss on a (potentially) rich collection of subgroups.


http://arxiv.org/abs/2303.08983
Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement. (1%)
Fartash Faghri; Hadi Pouransari; Sachin Mehta; Mehrdad Farajtabar; Ali Farhadi; Mohammad Rastegari; Oncel Tuzel
  We propose Dataset Reinforcement, a strategy to improve a dataset once such
that the accuracy of any model architecture trained on the reinforced dataset
is improved at no additional training cost for users. We propose a Dataset
Reinforcement strategy based on data augmentation and knowledge distillation.
Our generic strategy is designed based on extensive analysis across CNN- and
transformer-based models and performing large-scale study of distillation with
state-of-the-art models with various data augmentations. We create a reinforced
version of the ImageNet training dataset, called ImageNet+, as well as
reinforced datasets CIFAR-100+, Flowers-102+, and Food-101+. Models trained
with ImageNet+ are more accurate, robust, and calibrated, and transfer well to
downstream tasks (e.g., segmentation and detection). As an example, the
accuracy of ResNet-50 improves by 1.7% on the ImageNet validation set, 3.5% on
ImageNetV2, and 10.0% on ImageNet-R. Expected Calibration Error (ECE) on the
ImageNet validation set is also reduced by 9.9%. Using this backbone with
Mask-RCNN for object detection on MS-COCO, the mean average precision improves
by 0.8%. We reach similar gains for MobileNets, ViTs, and Swin-Transformers.
For MobileNetV3 and Swin-Tiny we observe significant improvements on
ImageNet-R/A/C of up to 10% improved robustness. Models pretrained on ImageNet+
and fine-tuned on CIFAR-100+, Flowers-102+, and Food-101+, reach up to 3.4%
improved accuracy.


http://arxiv.org/abs/2303.08774
GPT-4 Technical Report. (1%)
Rai OpenAI; Josh Rai Achiam; Steven Rai Adler; Sandhini Rai Agarwal; Lama Rai Ahmad; Ilge Rai Akkaya; Florencia Leoni Rai Aleman; Diogo Rai Almeida; Janko Rai Altenschmidt; Sam Rai Altman; Shyamal Rai Anadkat; Red Rai Avila; Igor Rai Babuschkin; Suchir Rai Balaji; Valerie Rai Balcom; Paul Rai Baltescu; Haiming Rai Bao; Mohammad Rai Bavarian; Jeff Rai Belgum; Irwan Rai Bello; Jake Rai Berdine; Gabriel Rai Bernadett-Shapiro; Christopher Rai Berner; Lenny Rai Bogdonoff; Oleg Rai Boiko; Madelaine Rai Boyd; Anna-Luisa Rai Brakman; Greg Rai Brockman; Tim Rai Brooks; Miles Rai Brundage; Kevin Rai Button; Trevor Rai Cai; Rosie Rai Campbell; Andrew Rai Cann; Brittany Rai Carey; Chelsea Rai Carlson; Rory Rai Carmichael; Brooke Rai Chan; Che Rai Chang; Fotis Rai Chantzis; Derek Rai Chen; Sully Rai Chen; Ruby Rai Chen; Jason Rai Chen; Mark Rai Chen; Ben Rai Chess; Chester Rai Cho; Casey Rai Chu; Hyung Won Rai Chung; Dave Rai Cummings; Jeremiah Rai Currier; Yunxing Rai Dai; Cory Rai Decareaux; Thomas Rai Degry; Noah Rai Deutsch; Damien Rai Deville; Arka Rai Dhar; David Rai Dohan; Steve Rai Dowling; Sheila Rai Dunning; Adrien Rai Ecoffet; Atty Rai Eleti; Tyna Rai Eloundou; David Rai Farhi; Liam Rai Fedus; Niko Rai Felix; Simón Posada Rai Fishman; Juston Rai Forte; Isabella Rai Fulford; Leo Rai Gao; Elie Rai Georges; Christian Rai Gibson; Vik Rai Goel; Tarun Rai Gogineni; Gabriel Rai Goh; Rapha Rai Gontijo-Lopes; Jonathan Rai Gordon; Morgan Rai Grafstein; Scott Rai Gray; Ryan Rai Greene; Joshua Rai Gross; Shixiang Shane Rai Gu; Yufei Rai Guo; Chris Rai Hallacy; Jesse Rai Han; Jeff Rai Harris; Yuchen Rai He; Mike Rai Heaton; Johannes Rai Heidecke; Chris Rai Hesse; Alan Rai Hickey; Wade Rai Hickey; Peter Rai Hoeschele; Brandon Rai Houghton; Kenny Rai Hsu; Shengli Rai Hu; Xin Rai Hu; Joost Rai Huizinga; Shantanu Rai Jain; Shawn Rai Jain; Joanne Rai Jang; Angela Rai Jiang; Roger Rai Jiang; Haozhun Rai Jin; Denny Rai Jin; Shino Rai Jomoto; Billie Rai Jonn; Heewoo Rai Jun; Tomer Rai Kaftan; Łukasz Rai Kaiser; Ali Rai Kamali; Ingmar Rai Kanitscheider; Nitish Shirish Rai Keskar; Tabarak Rai Khan; Logan Rai Kilpatrick; Jong Wook Rai Kim; Christina Rai Kim; Yongjik Rai Kim; Jan Hendrik Rai Kirchner; Jamie Rai Kiros; Matt Rai Knight; Daniel Rai Kokotajlo; Łukasz Rai Kondraciuk; Andrew Rai Kondrich; Aris Rai Konstantinidis; Kyle Rai Kosic; Gretchen Rai Krueger; Vishal Rai Kuo; Michael Rai Lampe; Ikai Rai Lan; Teddy Rai Lee; Jan Rai Leike; Jade Rai Leung; Daniel Rai Levy; Chak Ming Rai Li; Rachel Rai Lim; Molly Rai Lin; Stephanie Rai Lin; Mateusz Rai Litwin; Theresa Rai Lopez; Ryan Rai Lowe; Patricia Rai Lue; Anna Rai Makanju; Kim Rai Malfacini; Sam Rai Manning; Todor Rai Markov; Yaniv Rai Markovski; Bianca Rai Martin; Katie Rai Mayer; Andrew Rai Mayne; Bob Rai McGrew; Scott Mayer Rai McKinney; Christine Rai McLeavey; Paul Rai McMillan; Jake Rai McNeil; David Rai Medina; Aalok Rai Mehta; Jacob Rai Menick; Luke Rai Metz; Andrey Rai Mishchenko; Pamela Rai Mishkin; Vinnie Rai Monaco; Evan Rai Morikawa; Daniel Rai Mossing; Tong Rai Mu; Mira Rai Murati; Oleg Rai Murk; David Rai Mély; Ashvin Rai Nair; Reiichiro Rai Nakano; Rajeev Rai Nayak; Arvind Rai Neelakantan; Richard Rai Ngo; Hyeonwoo Rai Noh; Long Rai Ouyang; Cullen Rai O'Keefe; Jakub Rai Pachocki; Alex Rai Paino; Joe Rai Palermo; Ashley Rai Pantuliano; Giambattista Rai Parascandolo; Joel Rai Parish; Emy Rai Parparita; Alex Rai Passos; Mikhail Rai Pavlov; Andrew Rai Peng; Adam Rai Perelman; Filipe de Avila Belbute Rai Peres; Michael Rai Petrov; Henrique Ponde de Oliveira Rai Pinto; Rai Michael; Pokorny; Michelle Pokrass; Vitchyr H. Pong; Tolly Powell; Alethea Power; Boris Power; Elizabeth Proehl; Raul Puri; Alec Radford; Jack Rae; Aditya Ramesh; Cameron Raymond; Francis Real; Kendra Rimbach; Carl Ross; Bob Rotsted; Henri Roussez; Nick Ryder; Mario Saltarelli; Ted Sanders; Shibani Santurkar; Girish Sastry; Heather Schmidt; David Schnurr; John Schulman; Daniel Selsam; Kyla Sheppard; Toki Sherbakov; Jessica Shieh; Sarah Shoker; Pranav Shyam; Szymon Sidor; Eric Sigler; Maddie Simens; Jordan Sitkin; Katarina Slama; Ian Sohl; Benjamin Sokolowsky; Yang Song; Natalie Staudacher; Felipe Petroski Such; Natalie Summers; Ilya Sutskever; Jie Tang; Nikolas Tezak; Madeleine B. Thompson; Phil Tillet; Amin Tootoonchian; Elizabeth Tseng; Preston Tuggle; Nick Turley; Jerry Tworek; Juan Felipe Cerón Uribe; Andrea Vallone; Arun Vijayvergiya; Chelsea Voss; Carroll Wainwright; Justin Jay Wang; Alvin Wang; Ben Wang; Jonathan Ward; Jason Wei; CJ Weinmann; Akila Welihinda; Peter Welinder; Jiayi Weng; Lilian Weng; Matt Wiethoff; Dave Willner; Clemens Winter; Samuel Wolrich; Hannah Wong; Lauren Workman; Sherwin Wu; Jeff Wu; Michael Wu; Kai Xiao; Tao Xu; Sarah Yoo; Kevin Yu; Qiming Yuan; Wojciech Zaremba; Rowan Zellers; Chong Zhang; Marvin Zhang; Shengjia Zhao; Tianhao Zheng; Juntang Zhuang; William Zhuk; Barret Zoph
  We report the development of GPT-4, a large-scale, multimodal model which can
accept image and text inputs and produce text outputs. While less capable than
humans in many real-world scenarios, GPT-4 exhibits human-level performance on
various professional and academic benchmarks, including passing a simulated bar
exam with a score around the top 10% of test takers. GPT-4 is a
Transformer-based model pre-trained to predict the next token in a document.
The post-training alignment process results in improved performance on measures
of factuality and adherence to desired behavior. A core component of this
project was developing infrastructure and optimization methods that behave
predictably across a wide range of scales. This allowed us to accurately
predict some aspects of GPT-4's performance based on models trained with no
more than 1/1,000th the compute of GPT-4.


http://arxiv.org/abs/2303.08032
Verifying the Robustness of Automatic Credibility Assessment. (99%)
Piotr Przybyła; Alexander Shvets; Horacio Saggion
  Text classification methods have been widely investigated as a way to detect
content of low credibility: fake news, social media bots, propaganda, etc.
Quite accurate models (likely based on deep neural networks) help in moderating
public electronic platforms and often cause content creators to face rejection
of their submissions or removal of already published texts. Having the
incentive to evade further detection, content creators try to come up with a
slightly modified version of the text (known as an attack with an adversarial
example) that exploit the weaknesses of classifiers and result in a different
output. Here we systematically test the robustness of popular text classifiers
against available attacking techniques and discover that, indeed, in some cases
insignificant changes in input text can mislead the models. We also introduce
BODEGA: a benchmark for testing both victim models and attack methods on four
misinformation detection tasks in an evaluation framework designed to simulate
real use-cases of content moderation. Finally, we manually analyse a subset
adversarial examples and check what kinds of modifications are used in
successful attacks. The BODEGA code and data is openly shared in hope of
enhancing the comparability and replicability of further research in this area


http://arxiv.org/abs/2303.08171
Resilient Dynamic Average Consensus based on Trusted agents. (69%)
Shamik Bhattacharyya; Rachel Kalpana Kalaimani
  In this paper, we address the discrete-time dynamic average consensus (DAC)
of a multi-agent system in the presence of adversarial attacks. The adversarial
attack is considered to be of Byzantine type, which compromises the computation
capabilities of the agent and sends arbitrary false data to its neighbours. We
assume a few of the agents cannot be compromised by adversaries, which we term
trusted agents. We first formally define resilient DAC in the presence of
Byzantine adversaries. Then we propose our novel Resilient Dynamic Average
Consensus (ResDAC) algorithm that ensures the trusted and ordinary agents
achieve resilient DAC in the presence of adversarial agents. The only
requirements are that of the trusted agents forming a connected dominating set
and the first-order differences of the reference signals being bounded. We do
not impose any restriction on the tolerable number of adversarial agents that
can be present in the network. We also do not restrict the reference signals to
be bounded. Finally, we provide numerical simulations to illustrate the
effectiveness of the proposed ResDAC algorithm.


http://arxiv.org/abs/2303.08289
Improving Adversarial Robustness with Hypersphere Embedding and Angular-based Regularizations. (31%)
Olukorede Fakorede; Ashutosh Nirala; Modeste Atsague; Jin Tian
  Adversarial training (AT) methods have been found to be effective against
adversarial attacks on deep neural networks. Many variants of AT have been
proposed to improve its performance. Pang et al. [1] have recently shown that
incorporating hypersphere embedding (HE) into the existing AT procedures
enhances robustness. We observe that the existing AT procedures are not
designed for the HE framework, and thus fail to adequately learn the angular
discriminative information available in the HE framework. In this paper, we
propose integrating HE into AT with regularization terms that exploit the rich
angular information available in the HE framework. Specifically, our method,
termed angular-AT, adds regularization terms to AT that explicitly enforce
weight-feature compactness and inter-class separation; all expressed in terms
of angular features. Experimental results show that angular-AT further improves
adversarial robustness.


http://arxiv.org/abs/2303.07003
Review on the Feasibility of Adversarial Evasion Attacks and Defenses for Network Intrusion Detection Systems. (99%)
Islam Debicha; Benjamin Cochez; Tayeb Kenaza; Thibault Debatty; Jean-Michel Dricot; Wim Mees
  Nowadays, numerous applications incorporate machine learning (ML) algorithms
due to their prominent achievements. However, many studies in the field of
computer vision have shown that ML can be fooled by intentionally crafted
instances, called adversarial examples. These adversarial examples take
advantage of the intrinsic vulnerability of ML models. Recent research raises
many concerns in the cybersecurity field. An increasing number of researchers
are studying the feasibility of such attacks on security systems based on ML
algorithms, such as Intrusion Detection Systems (IDS). The feasibility of such
adversarial attacks would be influenced by various domain-specific constraints.
This can potentially increase the difficulty of crafting adversarial examples.
Despite the considerable amount of research that has been done in this area,
much of it focuses on showing that it is possible to fool a model using
features extracted from the raw data but does not address the practical side,
i.e., the reverse transformation from theory to practice. For this reason, we
propose a review browsing through various important papers to provide a
comprehensive analysis. Our analysis highlights some challenges that have not
been addressed in the reviewed papers.


http://arxiv.org/abs/2303.07546
Constrained Adversarial Learning for Automated Software Testing: a literature review. (99%)
João Vitorino; Tiago Dias; Tiago Fonseca; Eva Maia; Isabel Praça
  It is imperative to safeguard computer applications and information systems
against the growing number of cyber-attacks. Automated software testing tools
can be developed to quickly analyze many lines of code and detect
vulnerabilities by generating function-specific testing data. This process
draws similarities to the constrained adversarial examples generated by
adversarial machine learning methods, so there could be significant benefits to
the integration of these methods in testing tools to identify possible attack
vectors. Therefore, this literature review is focused on the current
state-of-the-art of constrained data generation approaches applied for
adversarial learning and software testing, aiming to guide researchers and
developers to enhance their software testing tools with adversarial testing
methods and improve the resilience and robustness of their information systems.
The found approaches were systematized, and the advantages and limitations of
those specific for white-box, grey-box, and black-box testing were analyzed,
identifying research gaps and opportunities to automate the testing tools with
data generated by adversarial attacks.


http://arxiv.org/abs/2303.07474
Can Adversarial Examples Be Parsed to Reveal Victim Model Information? (99%)
Yuguang Yao; Jiancheng Liu; Yifan Gong; Xiaoming Liu; Yanzhi Wang; Xue Lin; Sijia Liu
  Numerous adversarial attack methods have been developed to generate
imperceptible image perturbations that can cause erroneous predictions of
state-of-the-art machine learning (ML) models, in particular, deep neural
networks (DNNs). Despite intense research on adversarial attacks, little effort
was made to uncover 'arcana' carried in adversarial attacks. In this work, we
ask whether it is possible to infer data-agnostic victim model (VM) information
(i.e., characteristics of the ML model or DNN used to generate adversarial
attacks) from data-specific adversarial instances. We call this 'model parsing
of adversarial attacks' - a task to uncover 'arcana' in terms of the concealed
VM information in attacks. We approach model parsing via supervised learning,
which correctly assigns classes of VM's model attributes (in terms of
architecture type, kernel size, activation function, and weight sparsity) to an
attack instance generated from this VM. We collect a dataset of adversarial
attacks across 7 attack types generated from 135 victim models (configured by 5
architecture types, 3 kernel size setups, 3 activation function types, and 3
weight sparsity ratios). We show that a simple, supervised model parsing
network (MPN) is able to infer VM attributes from unseen adversarial attacks if
their attack settings are consistent with the training setting (i.e.,
in-distribution generalization assessment). We also provide extensive
experiments to justify the feasibility of VM parsing from adversarial attacks,
and the influence of training and evaluation factors in the parsing performance
(e.g., generalization challenge raised in out-of-distribution evaluation). We
further demonstrate how the proposed MPN can be used to uncover the source VM
attributes from transfer attacks, and shed light on a potential connection
between model parsing and attack transferability.


http://arxiv.org/abs/2303.12735
SMUG: Towards robust MRI reconstruction by smoothed unrolling. (98%)
Hui Li; Jinghan Jia; Shijun Liang; Yuguang Yao; Saiprasad Ravishankar; Sijia Liu
  Although deep learning (DL) has gained much popularity for accelerated
magnetic resonance imaging (MRI), recent studies have shown that DL-based MRI
reconstruction models could be oversensitive to tiny input perturbations (that
are called 'adversarial perturbations'), which cause unstable, low-quality
reconstructed images. This raises the question of how to design robust DL
methods for MRI reconstruction. To address this problem, we propose a novel
image reconstruction framework, termed SMOOTHED UNROLLING (SMUG), which
advances a deep unrolling-based MRI reconstruction model using a randomized
smoothing (RS)-based robust learning operation. RS, which improves the
tolerance of a model against input noises, has been widely used in the design
of adversarial defense for image classification. Yet, we find that the
conventional design that applies RS to the entire DL process is ineffective for
MRI reconstruction. We show that SMUG addresses the above issue by customizing
the RS operation based on the unrolling architecture of the DL-based MRI
reconstruction model. Compared to the vanilla RS approach and several variants
of SMUG, we show that SMUG improves the robustness of MRI reconstruction with
respect to a diverse set of perturbation sources, including perturbations to
the input measurements, different measurement sampling rates, and different
unrolling steps. Code for SMUG will be available at
https://github.com/LGM70/SMUG.


http://arxiv.org/abs/2303.07320
Model-tuning Via Prompts Makes NLP Models Adversarially Robust. (96%)
Mrigank Raman; Pratyush Maini; J. Zico Kolter; Zachary C. Lipton; Danish Pruthi
  In recent years, NLP practitioners have converged on the following practice:
(i) import an off-the-shelf pretrained (masked) language model; (ii) append a
multilayer perceptron atop the CLS token's hidden representation (with randomly
initialized weights); and (iii) fine-tune the entire model on a downstream task
(MLP). This procedure has produced massive gains on standard NLP benchmarks,
but these models remain brittle, even to mild adversarial perturbations, such
as word-level synonym substitutions. In this work, we demonstrate surprising
gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP), an
alternative method of adapting to downstream tasks. Rather than modifying the
model (by appending an MLP head), MVP instead modifies the input (by appending
a prompt template). Across three classification datasets, MVP improves
performance against adversarial word-level synonym substitutions by an average
of 8% over standard methods and even outperforms adversarial training-based
state-of-art defenses by 3.5%. By combining MVP with adversarial training, we
achieve further improvements in robust accuracy while maintaining clean
accuracy. Finally, we conduct ablations to investigate the mechanism underlying
these gains. Notably, we find that the main causes of vulnerability of MLP can
be attributed to the misalignment between pre-training and fine-tuning tasks,
and the randomly initialized MLP parameters. Code is available at
https://github.com/acmi-lab/mvp


http://arxiv.org/abs/2303.06854
Robust Contrastive Language-Image Pretraining against Adversarial Attacks. (83%)
Wenhan Yang; Baharan Mirzasoleiman
  Contrastive vision-language representation learning has achieved
state-of-the-art performance for zero-shot classification, by learning from
millions of image-caption pairs crawled from the internet. However, the massive
data that powers large multimodal models such as CLIP, makes them extremely
vulnerable to various types of adversarial attacks, including targeted and
backdoor data poisoning attacks. Despite this vulnerability, robust contrastive
vision-language pretraining against adversarial attacks has remained
unaddressed. In this work, we propose RoCLIP, the first effective method for
robust pretraining {and fine-tuning} multimodal vision-language models. RoCLIP
effectively breaks the association between poisoned image-caption pairs by
considering a pool of random examples, and (1) matching every image with the
text that is most similar to its caption in the pool, and (2) matching every
caption with the image that is most similar to its image in the pool. Our
extensive experiments show that our method renders state-of-the-art targeted
data poisoning and backdoor attacks ineffective during pre-training or
fine-tuning of CLIP. In particular, RoCLIP decreases the poison and backdoor
attack success rates down to 0\% during pre-training and 1\%-4\% during
fine-tuning, and effectively improves the model's performance.


http://arxiv.org/abs/2303.08581
Model Extraction Attacks on Split Federated Learning. (47%)
Jingtao Li; Adnan Siraj Rakin; Xing Chen; Li Yang; Zhezhi He; Deliang Fan; Chaitali Chakrabarti
  Federated Learning (FL) is a popular collaborative learning scheme involving
multiple clients and a server. FL focuses on protecting clients' data but turns
out to be highly vulnerable to Intellectual Property (IP) threats. Since FL
periodically collects and distributes the model parameters, a free-rider can
download the latest model and thus steal model IP. Split Federated Learning
(SFL), a recent variant of FL that supports training with resource-constrained
clients, splits the model into two, giving one part of the model to clients
(client-side model), and the remaining part to the server (server-side model).
Thus SFL prevents model leakage by design. Moreover, by blocking prediction
queries, it can be made resistant to advanced IP threats such as traditional
Model Extraction (ME) attacks. While SFL is better than FL in terms of
providing IP protection, it is still vulnerable. In this paper, we expose the
vulnerability of SFL and show how malicious clients can launch ME attacks by
querying the gradient information from the server side. We propose five
variants of ME attack which differs in the gradient usage as well as in the
data assumptions. We show that under practical cases, the proposed ME attacks
work exceptionally well for SFL. For instance, when the server-side model has
five layers, our proposed ME attack can achieve over 90% accuracy with less
than 2% accuracy degradation with VGG-11 on CIFAR-10.


http://arxiv.org/abs/2303.07543
WDiscOOD: Out-of-Distribution Detection via Whitened Linear Discriminative Analysis. (1%)
Yiye Chen; Yunzhi Lin; Ruinian Xu; Patricio A. Vela
  Deep neural networks are susceptible to generating overconfident yet
erroneous predictions when presented with data beyond known concepts. This
challenge underscores the importance of detecting out-of-distribution (OOD)
samples in the open world. In this work, we propose a novel feature-space OOD
detection score that jointly reasons with both class-specific and
class-agnostic information. Specifically, our approach utilizes Whitened Linear
Discriminative Analysis to project features into two subspaces - the
discriminative and residual subspaces - in which the ID classes are maximally
separated and closely clustered, respectively. The OOD score is then determined
by combining the deviation from the input data to the ID distribution in both
subspaces. The efficacy of our method, named WDiscOOD, is verified on the
large-scale ImageNet-1k benchmark, with six OOD datasets that covers a variety
of distribution shifts. WDiscOOD demonstrates superior performance on deep
classifiers with diverse backbone architectures, including CNN and vision
transformer. Furthermore, we also show that our method can more effectively
detect novel concepts in representation space trained with contrastive
objectives, including supervised contrastive loss and multi-modality
contrastive loss.


http://arxiv.org/abs/2303.06920
Pixel-wise Gradient Uncertainty for Convolutional Neural Networks applied to Out-of-Distribution Segmentation. (1%)
Kira Maag; Tobias Riedlinger
  In recent years, deep neural networks have defined the state-of-the-art in
semantic segmentation where their predictions are constrained to a predefined
set of semantic classes. They are to be deployed in applications such as
automated driving, although their categorically confined expressive power runs
contrary to such open world scenarios. Thus, the detection and segmentation of
objects from outside their predefined semantic space, i.e., out-of-distribution
(OoD) objects, is of highest interest. Since uncertainty estimation methods
like softmax entropy or Bayesian models are sensitive to erroneous predictions,
these methods are a natural baseline for OoD detection. Here, we present a
method for obtaining uncertainty scores from pixel-wise loss gradients which
can be computed efficiently during inference. Our approach is simple to
implement for a large class of models, does not require any additional training
or auxiliary data and can be readily used on pre-trained segmentation models.
Our experiments show the ability of our method to identify wrong pixel
classifications and to estimate prediction quality at negligible computational
overhead. In particular, we observe superior performance in terms of OoD
segmentation to comparable baselines on the SegmentMeIfYouCan benchmark,
clearly outperforming other methods.


http://arxiv.org/abs/2303.06664
Adv-Bot: Realistic Adversarial Botnet Attacks against Network Intrusion Detection Systems. (99%)
Islam Debicha; Benjamin Cochez; Tayeb Kenaza; Thibault Debatty; Jean-Michel Dricot; Wim Mees
  Due to the numerous advantages of machine learning (ML) algorithms, many
applications now incorporate them. However, many studies in the field of image
classification have shown that MLs can be fooled by a variety of adversarial
attacks. These attacks take advantage of ML algorithms' inherent vulnerability.
This raises many questions in the cybersecurity field, where a growing number
of researchers are recently investigating the feasibility of such attacks
against machine learning-based security systems, such as intrusion detection
systems. The majority of this research demonstrates that it is possible to fool
a model using features extracted from a raw data source, but it does not take
into account the real implementation of such attacks, i.e., the reverse
transformation from theory to practice. The real implementation of these
adversarial attacks would be influenced by various constraints that would make
their execution more difficult. As a result, the purpose of this study was to
investigate the actual feasibility of adversarial attacks, specifically evasion
attacks, against network-based intrusion detection systems (NIDS),
demonstrating that it is entirely possible to fool these ML-based IDSs using
our proposed adversarial algorithm while assuming as many constraints as
possible in a black-box setting. In addition, since it is critical to design
defense mechanisms to protect ML-based IDSs against such attacks, a defensive
scheme is presented. Realistic botnet traffic traces are used to assess this
work. Our goal is to create adversarial botnet traffic that can avoid detection
while still performing all of its intended malicious functionality.


http://arxiv.org/abs/2303.06641
Adaptive Local Adversarial Attacks on 3D Point Clouds for Augmented Reality. (99%)
Weiquan Liu; Shijun Zheng; Cheng Wang
  As the key technology of augmented reality (AR), 3D recognition and tracking
are always vulnerable to adversarial examples, which will cause serious
security risks to AR systems. Adversarial examples are beneficial to improve
the robustness of the 3D neural network model and enhance the stability of the
AR system. At present, most 3D adversarial attack methods perturb the entire
point cloud to generate adversarial examples, which results in high
perturbation costs and difficulty in reconstructing the corresponding real
objects in the physical world. In this paper, we propose an adaptive local
adversarial attack method (AL-Adv) on 3D point clouds to generate adversarial
point clouds. First, we analyze the vulnerability of the 3D network model and
extract the salient regions of the input point cloud, namely the vulnerable
regions. Second, we propose an adaptive gradient attack algorithm that targets
vulnerable regions. The proposed attack algorithm adaptively assigns different
disturbances in different directions of the three-dimensional coordinates of
the point cloud. Experimental results show that our proposed method AL-Adv
achieves a higher attack success rate than the global attack method.
Specifically, the adversarial examples generated by the AL-Adv demonstrate good
imperceptibility and small generation costs.


http://arxiv.org/abs/2303.06746
DNN-Alias: Deep Neural Network Protection Against Side-Channel Attacks via Layer Balancing. (96%)
Mahya Morid Ahmadi; Lilas Alrahis; Ozgur Sinanoglu; Muhammad Shafique
  Extracting the architecture of layers of a given deep neural network (DNN)
through hardware-based side channels allows adversaries to steal its
intellectual property and even launch powerful adversarial attacks on the
target system. In this work, we propose DNN-Alias, an obfuscation method for
DNNs that forces all the layers in a given network to have similar execution
traces, preventing attack models from differentiating between the layers.
Towards this, DNN-Alias performs various layer-obfuscation operations, e.g.,
layer branching, layer deepening, etc, to alter the run-time traces while
maintaining the functionality. DNN-Alias deploys an evolutionary algorithm to
find the best combination of obfuscation operations in terms of maximizing the
security level while maintaining a user-provided latency overhead budget. We
demonstrate the effectiveness of our DNN-Alias technique by obfuscating the
architecture of 700 randomly generated and obfuscated DNNs running on multiple
Nvidia RTX 2080 TI GPU-based machines. Our experiments show that
state-of-the-art side-channel architecture stealing attacks cannot extract the
original DNN accurately. Moreover, we obfuscate the architecture of various
DNNs, such as the VGG-11, VGG-13, ResNet-20, and ResNet-32 networks. Training
the DNNs using the standard CIFAR10 dataset, we show that our DNN-Alias
maintains the functionality of the original DNNs by preserving the original
inference accuracy. Further, the experiments highlight that adversarial attack
on obfuscated DNNs is unsuccessful.


http://arxiv.org/abs/2303.06601
Multi-metrics adaptively identifies backdoors in Federated learning. (92%)
Siquan Huang; Yijiang Li; Chong Chen; Leyu Shi; Ying Gao
  The decentralized and privacy-preserving nature of federated learning (FL)
makes it vulnerable to backdoor attacks aiming to manipulate the behavior of
the resulting model on specific adversary-chosen inputs. However, most existing
defenses based on statistical differences take effect only against specific
attacks, especially when the malicious gradients are similar to benign ones or
the data are highly non-independent and identically distributed (non-IID). In
this paper, we revisit the distance-based defense methods and discover that i)
Euclidean distance becomes meaningless in high dimensions and ii) malicious
gradients with diverse characteristics cannot be identified by a single metric.
To this end, we present a simple yet effective defense strategy with
multi-metrics and dynamic weighting to identify backdoors adaptively.
Furthermore, our novel defense has no reliance on predefined assumptions over
attack settings or data distributions and little impact on benign performance.
To evaluate the effectiveness of our approach, we conduct comprehensive
experiments on different datasets under various attack settings, where our
method achieves the best defensive performance. For instance, we achieve the
lowest backdoor accuracy of 3.06% under the difficult Edge-case PGD, showing
significant superiority over previous defenses. The results also demonstrate
that our method can be well-adapted to a wide range of non-IID degrees without
sacrificing the benign performance.


http://arxiv.org/abs/2303.06837
Adversarial Attacks to Direct Data-driven Control for Destabilization. (91%)
Hampei Sasahara
  This study investigates the vulnerability of direct data-driven control to
adversarial attacks in the form of a small but sophisticated perturbation added
to the original data. The directed gradient sign method (DGSM) is developed as
a specific attack method, based on the fast gradient sign method (FGSM), which
has originally been considered in image classification. DGSM uses the gradient
of the eigenvalues of the resulting closed-loop system and crafts a
perturbation in the direction where the system becomes less stable. It is
demonstrated that the system can be destabilized by the attack, even if the
original closed-loop system with the clean data has a large margin of
stability. To increase the robustness against the attack, regularization
methods that have been developed to deal with random disturbances are
considered. Their effectiveness is evaluated by numerical experiments using an
inverted pendulum model.


http://arxiv.org/abs/2303.06818
Backdoor Defense via Deconfounded Representation Learning. (83%)
Zaixi Zhang; Qi Liu; Zhicai Wang; Zepu Lu; Qingyong Hu
  Deep neural networks (DNNs) are recently shown to be vulnerable to backdoor
attacks, where attackers embed hidden backdoors in the DNN model by injecting a
few poisoned examples into the training dataset. While extensive efforts have
been made to detect and remove backdoors from backdoored DNNs, it is still not
clear whether a backdoor-free clean model can be directly obtained from
poisoned datasets. In this paper, we first construct a causal graph to model
the generation process of poisoned data and find that the backdoor attack acts
as the confounder, which brings spurious associations between the input images
and target labels, making the model predictions less reliable. Inspired by the
causal understanding, we propose the Causality-inspired Backdoor Defense (CBD),
to learn deconfounded representations for reliable classification.
Specifically, a backdoored model is intentionally trained to capture the
confounding effects. The other clean model dedicates to capturing the desired
causal effects by minimizing the mutual information with the confounding
representations from the backdoored model and employing a sample-wise
re-weighting scheme. Extensive experiments on multiple benchmark datasets
against 6 state-of-the-art attacks verify that our proposed defense method is
effective in reducing backdoor threats while maintaining high accuracy in
predicting benign samples. Further analysis shows that CBD can also resist
potential adaptive attacks. The code is available at
\url{https://github.com/zaixizhang/CBD}.


http://arxiv.org/abs/2303.06652
Interpreting Hidden Semantics in the Intermediate Layers of 3D Point Cloud Classification Neural Network. (76%)
Weiquan Liu; Minghao Liu; Shijun Zheng; Cheng Wang
  Although 3D point cloud classification neural network models have been widely
used, the in-depth interpretation of the activation of the neurons and layers
is still a challenge. We propose a novel approach, named Relevance Flow, to
interpret the hidden semantics of 3D point cloud classification neural
networks. It delivers the class Relevance to the activated neurons in the
intermediate layers in a back-propagation manner, and associates the activation
of neurons with the input points to visualize the hidden semantics of each
layer. Specially, we reveal that the 3D point cloud classification neural
network has learned the plane-level and part-level hidden semantics in the
intermediate layers, and utilize the normal and IoU to evaluate the consistency
of both levels' hidden semantics. Besides, by using the hidden semantics, we
generate the adversarial attack samples to attack 3D point cloud classifiers.
Experiments show that our proposed method reveals the hidden semantics of the
3D point cloud classification neural network on ModelNet40 and ShapeNet, which
can be used for the unsupervised point cloud part segmentation without labels
and attacking the 3D point cloud classifiers.


http://arxiv.org/abs/2303.06808
Boosting Source Code Learning with Data Augmentation: An Empirical Study. (11%)
Zeming Dong; Qiang Hu; Yuejun Guo; Zhenya Zhang; Maxime Cordy; Mike Papadakis; Yves Le Traon; Jianjun Zhao
  The next era of program understanding is being propelled by the use of
machine learning to solve software problems. Recent studies have shown
surprising results of source code learning, which applies deep neural networks
(DNNs) to various critical software tasks, e.g., bug detection and clone
detection. This success can be greatly attributed to the utilization of massive
high-quality training data, and in practice, data augmentation, which is a
technique used to produce additional training data, has been widely adopted in
various domains, such as computer vision. However, in source code learning,
data augmentation has not been extensively studied, and existing practice is
limited to simple syntax-preserved methods, such as code refactoring.
Essentially, source code is often represented in two ways, namely, sequentially
as text data and structurally as graph data, when it is used as training data
in source code learning. Inspired by these analogy relations, we take an early
step to investigate whether data augmentation methods that are originally used
for text and graphs are effective in improving the training quality of source
code learning. To that end, we first collect and categorize data augmentation
methods in the literature. Second, we conduct a comprehensive empirical study
on four critical tasks and 11 DNN architectures to explore the effectiveness of
12 data augmentation methods (including code refactoring and 11 other methods
for text and graph data). Our results identify the data augmentation methods
that can produce more accurate and robust models for source code learning,
including those based on mixup (e.g., SenMixup for texts and Manifold-Mixup for
graphs), and those that slightly break the syntax of source code (e.g., random
swap and random deletion for texts).


http://arxiv.org/abs/2303.06425
Improving the Robustness of Deep Convolutional Neural Networks Through Feature Learning. (99%)
Jin Ding; Jie-Chao Zhao; Yong-Zhi Sun; Ping Tan; Ji-En Ma; You-Tong Fang
  Deep convolutional neural network (DCNN for short) models are vulnerable to
examples with small perturbations. Adversarial training (AT for short) is a
widely used approach to enhance the robustness of DCNN models by data
augmentation. In AT, the DCNN models are trained with clean examples and
adversarial examples (AE for short) which are generated using a specific attack
method, aiming to gain ability to defend themselves when facing the unseen AEs.
However, in practice, the trained DCNN models are often fooled by the AEs
generated by the novel attack methods. This naturally raises a question: can a
DCNN model learn certain features which are insensitive to small perturbations,
and further defend itself no matter what attack methods are presented. To
answer this question, this paper makes a beginning effort by proposing a
shallow binary feature module (SBFM for short), which can be integrated into
any popular backbone. The SBFM includes two types of layers, i.e., Sobel layer
and threshold layer. In Sobel layer, there are four parallel feature maps which
represent horizontal, vertical, and diagonal edge features, respectively. And
in threshold layer, it turns the edge features learnt by Sobel layer to the
binary features, which then are feeded into the fully connected layers for
classification with the features learnt by the backbone. We integrate SBFM into
VGG16 and ResNet34, respectively, and conduct experiments on multiple datasets.
Experimental results demonstrate, under FGSM attack with $\epsilon=8/255$, the
SBFM integrated models can achieve averagely 35\% higher accuracy than the
original ones, and in CIFAR-10 and TinyImageNet datasets, the SBFM integrated
models can achieve averagely 75\% classification accuracy. The work in this
paper shows it is promising to enhance the robustness of DCNN models through
feature learning.


http://arxiv.org/abs/2303.06486
SHIELD: An Adaptive and Lightweight Defense against the Remote Power Side-Channel Attacks on Multi-tenant FPGAs. (8%)
Mahya Morid Ahmadi; Faiq Khalid; Radha Vaidya; Florian Kriebel; Andreas Steininger; Muhammad Shafique
  Dynamic partial reconfiguration enables multi-tenancy in cloud-based FPGAs,
which presents security challenges for tenants, IPs, and data. Malicious users
can exploit FPGAs for remote side-channel attacks (SCAs), and shared on-chip
resources can be used for attacks. Logical separation can ensure design
integrity, but on-chip resources can still be exploited. Conventional SCA
mitigation can help, but it requires significant effort, and bitstream checking
techniques are not highly accurate. An active on-chip defense mechanism is
needed for tenant confidentiality. Toward this, we propose a lightweight
shielding technique utilizing ring oscillators (ROs) to protect applications
against remote power SCA. Unlike existing RO-based approaches, in our
methodology, an offline pre-processing stage is proposed to carefully configure
power monitors and an obfuscating circuit concerning the resource constraints
of the board. Detection of power fluctuations due to application execution
enables the obfuscating circuit to flatten the power consumption trace. To
evaluate the effectiveness of the proposed SHIELD, we implemented it on a
Xilinx Zynq-7000 FPGA board executing an RSA encryption algorithm. Due to the
SHIELD, the number of traces required to extract the encryption key is
increased by 166x, making an attack extremely hard at run-time. Note that the
proposed SHIELD does not require any modification in the target application.
Our methodology also shows up to 54% less power consumption and up to 26% less
area overhead than the state-of-the-art random noise-addition-based defense.


http://arxiv.org/abs/2303.06199
Turning Strengths into Weaknesses: A Certified Robustness Inspired Attack Framework against Graph Neural Networks. (99%)
Binghui Wang; Meng Pang; Yun Dong
  Graph neural networks (GNNs) have achieved state-of-the-art performance in
many graph learning tasks. However, recent studies show that GNNs are
vulnerable to both test-time evasion and training-time poisoning attacks that
perturb the graph structure. While existing attack methods have shown promising
attack performance, we would like to design an attack framework to further
enhance the performance. In particular, our attack framework is inspired by
certified robustness, which was originally used by defenders to defend against
adversarial attacks. We are the first, from the attacker perspective, to
leverage its properties to better attack GNNs. Specifically, we first derive
nodes' certified perturbation sizes against graph evasion and poisoning attacks
based on randomized smoothing, respectively. A larger certified perturbation
size of a node indicates this node is theoretically more robust to graph
perturbations. Such a property motivates us to focus more on nodes with smaller
certified perturbation sizes, as they are easier to be attacked after graph
perturbations. Accordingly, we design a certified robustness inspired attack
loss, when incorporated into (any) existing attacks, produces our certified
robustness inspired attack counterpart. We apply our framework to the existing
attacks and results show it can significantly enhance the existing base
attacks' performance.


http://arxiv.org/abs/2303.05719
Boosting Adversarial Attacks by Leveraging Decision Boundary Information. (99%)
Boheng Zeng; LianLi Gao; QiLong Zhang; ChaoQun Li; JingKuan Song; ShuaiQi Jing
  Due to the gap between a substitute model and a victim model, the
gradient-based noise generated from a substitute model may have low
transferability for a victim model since their gradients are different.
Inspired by the fact that the decision boundaries of different models do not
differ much, we conduct experiments and discover that the gradients of
different models are more similar on the decision boundary than in the original
position. Moreover, since the decision boundary in the vicinity of an input
image is flat along most directions, we conjecture that the boundary gradients
can help find an effective direction to cross the decision boundary of the
victim models. Based on it, we propose a Boundary Fitting Attack to improve
transferability. Specifically, we introduce a method to obtain a set of
boundary points and leverage the gradient information of these points to update
the adversarial examples. Notably, our method can be combined with existing
gradient-based methods. Extensive experiments prove the effectiveness of our
method, i.e., improving the success rate by 5.6% against normally trained CNNs
and 14.9% against defense CNNs on average compared to state-of-the-art
transfer-based attacks. Further we compare transformers with CNNs, the results
indicate that transformers are more robust than CNNs. However, our method still
outperforms existing methods when attacking transformers. Specifically, when
using CNNs as substitute models, our method obtains an average attack success
rate of 58.2%, which is 10.8% higher than other state-of-the-art transfer-based
attacks.


http://arxiv.org/abs/2303.06302
Adversarial Attacks and Defenses in Machine Learning-Powered Networks: A Contemporary Survey. (99%)
Yulong Wang; Tong Sun; Shenghong Li; Xin Yuan; Wei Ni; Ekram Hossain; H. Vincent Poor
  Adversarial attacks and defenses in machine learning and deep neural network
have been gaining significant attention due to the rapidly growing applications
of deep learning in the Internet and relevant scenarios. This survey provides a
comprehensive overview of the recent advancements in the field of adversarial
attack and defense techniques, with a focus on deep neural network-based
classification models. Specifically, we conduct a comprehensive classification
of recent adversarial attack methods and state-of-the-art adversarial defense
techniques based on attack principles, and present them in visually appealing
tables and tree diagrams. This is based on a rigorous evaluation of the
existing works, including an analysis of their strengths and limitations. We
also categorize the methods into counter-attack detection and robustness
enhancement, with a specific focus on regularization-based methods for
enhancing robustness. New avenues of attack are also explored, including
search-based, decision-based, drop-based, and physical-world attacks, and a
hierarchical classification of the latest defense methods is provided,
highlighting the challenges of balancing training costs with performance,
maintaining clean accuracy, overcoming the effect of gradient masking, and
ensuring method transferability. At last, the lessons learned and open
challenges are summarized with future research opportunities recommended.


http://arxiv.org/abs/2303.06280
Investigating Stateful Defenses Against Black-Box Adversarial Examples. (99%)
Ryan Feng; Ashish Hooda; Neal Mangaokar; Kassem Fawaz; Somesh Jha; Atul Prakash
  Defending machine-learning (ML) models against white-box adversarial attacks
has proven to be extremely difficult. Instead, recent work has proposed
stateful defenses in an attempt to defend against a more restricted black-box
attacker. These defenses operate by tracking a history of incoming model
queries, and rejecting those that are suspiciously similar. The current
state-of-the-art stateful defense Blacklight was proposed at USENIX Security
'22 and claims to prevent nearly 100% of attacks on both the CIFAR10 and
ImageNet datasets. In this paper, we observe that an attacker can significantly
reduce the accuracy of a Blacklight-protected classifier (e.g., from 82.2% to
6.4% on CIFAR10) by simply adjusting the parameters of an existing black-box
attack. Motivated by this surprising observation, since existing attacks were
evaluated by the Blacklight authors, we provide a systematization of stateful
defenses to understand why existing stateful defense models fail. Finally, we
propose a stronger evaluation strategy for stateful defenses comprised of
adaptive score and hard-label based black-box attacks. We use these attacks to
successfully reduce even reconfigured versions of Blacklight to as low as 0%
robust accuracy.


http://arxiv.org/abs/2303.05758
MIXPGD: Hybrid Adversarial Training for Speech Recognition Systems. (99%)
Aminul Huq; Weiyi Zhang; Xiaolin Hu
  Automatic speech recognition (ASR) systems based on deep neural networks are
weak against adversarial perturbations. We propose mixPGD adversarial training
method to improve the robustness of the model for ASR systems. In standard
adversarial training, adversarial samples are generated by leveraging
supervised or unsupervised methods. We merge the capabilities of both
supervised and unsupervised approaches in our method to generate new
adversarial samples which aid in improving model robustness. Extensive
experiments and comparison across various state-of-the-art defense methods and
adversarial attacks have been performed to show that mixPGD gains 4.1% WER of
better performance than previous best performing models under white-box
adversarial attack setting. We tested our proposed defense method against both
white-box and transfer based black-box attack settings to ensure that our
defense strategy is robust against various types of attacks. Empirical results
on several adversarial attacks validate the effectiveness of our proposed
approach.


http://arxiv.org/abs/2303.06241
Do we need entire training data for adversarial training? (99%)
Vipul Gupta; Apurva Narayan
  Deep Neural Networks (DNNs) are being used to solve a wide range of problems
in many domains including safety-critical domains like self-driving cars and
medical imagery. DNNs suffer from vulnerability against adversarial attacks. In
the past few years, numerous approaches have been proposed to tackle this
problem by training networks using adversarial training. Almost all the
approaches generate adversarial examples for the entire training dataset, thus
increasing the training time drastically. We show that we can decrease the
training time for any adversarial training algorithm by using only a subset of
training data for adversarial training. To select the subset, we filter the
adversarially-prone samples from the training data. We perform a simple
adversarial attack on all training examples to filter this subset. In this
attack, we add a small perturbation to each pixel and a few grid lines to the
input image.
  We perform adversarial training on the adversarially-prone subset and mix it
with vanilla training performed on the entire dataset. Our results show that
when our method-agnostic approach is plugged into FGSM, we achieve a speedup of
3.52x on MNIST and 1.98x on the CIFAR-10 dataset with comparable robust
accuracy. We also test our approach on state-of-the-art Free adversarial
training and achieve a speedup of 1.2x in training time with a marginal drop in
robust accuracy on the ImageNet dataset.


http://arxiv.org/abs/2303.05762
TrojDiff: Trojan Attacks on Diffusion Models with Diverse Targets. (61%)
Weixin Chen; Dawn Song; Bo Li
  Diffusion models have achieved great success in a range of tasks, such as
image synthesis and molecule design. As such successes hinge on large-scale
training data collected from diverse sources, the trustworthiness of these
collected data is hard to control or audit. In this work, we aim to explore the
vulnerabilities of diffusion models under potential training data manipulations
and try to answer: How hard is it to perform Trojan attacks on well-trained
diffusion models? What are the adversarial targets that such Trojan attacks can
achieve? To answer these questions, we propose an effective Trojan attack
against diffusion models, TrojDiff, which optimizes the Trojan diffusion and
generative processes during training. In particular, we design novel
transitions during the Trojan diffusion process to diffuse adversarial targets
into a biased Gaussian distribution and propose a new parameterization of the
Trojan generative process that leads to an effective training objective for the
attack. In addition, we consider three types of adversarial targets: the
Trojaned diffusion models will always output instances belonging to a certain
class from the in-domain distribution (In-D2D attack), out-of-domain
distribution (Out-D2D-attack), and one specific instance (D2I attack). We
evaluate TrojDiff on CIFAR-10 and CelebA datasets against both DDPM and DDIM
diffusion models. We show that TrojDiff always achieves high attack performance
under different adversarial targets using different types of triggers, while
the performance in benign environments is preserved. The code is available at
https://github.com/chenweixin107/TrojDiff.


http://arxiv.org/abs/2303.05828
Adapting Contrastive Language-Image Pretrained (CLIP) Models for Out-of-Distribution Detection. (13%)
Nikolas Adaloglou; Felix Michels; Tim Kaiser; Markus Kollmann
  We present a comprehensive experimental study on pretrained feature
extractors for visual out-of-distribution (OOD) detection, focusing on adapting
contrastive language-image pretrained (CLIP) models. Without fine-tuning on the
training data, we are able to establish a positive correlation ($R^2\geq0.92$)
between in-distribution classification and unsupervised OOD detection for CLIP
models in $4$ benchmarks. We further propose a new simple and scalable method
called \textit{pseudo-label probing} (PLP) that adapts vision-language models
for OOD detection. Given a set of label names of the training set, PLP trains a
linear layer using the pseudo-labels derived from the text encoder of CLIP. To
test the OOD detection robustness of pretrained models, we develop a novel
feature-based adversarial OOD data manipulation approach to create adversarial
samples. Intriguingly, we show that (i) PLP outperforms the previous
state-of-the-art \citep{ming2022mcm} on all $5$ large-scale benchmarks based on
ImageNet, specifically by an average AUROC gain of 3.4\% using the largest CLIP
model (ViT-G), (ii) we show that linear probing outperforms fine-tuning by
large margins for CLIP architectures (i.e. CLIP ViT-H achieves a mean gain of
7.3\% AUROC on average on all ImageNet-based benchmarks), and (iii)
billion-parameter CLIP models still fail at detecting adversarially manipulated
OOD images. The code and adversarially created datasets will be made publicly
available.


http://arxiv.org/abs/2303.06151
NoiseCAM: Explainable AI for the Boundary Between Noise and Adversarial Attacks. (99%)
Wenkai Tan; Justus Renkhoff; Alvaro Velasquez; Ziyu Wang; Lusi Li; Jian Wang; Shuteng Niu; Fan Yang; Yongxin Liu; Houbing Song
  Deep Learning (DL) and Deep Neural Networks (DNNs) are widely used in various
domains. However, adversarial attacks can easily mislead a neural network and
lead to wrong decisions. Defense mechanisms are highly preferred in
safety-critical applications. In this paper, firstly, we use the gradient class
activation map (GradCAM) to analyze the behavior deviation of the VGG-16
network when its inputs are mixed with adversarial perturbation or Gaussian
noise. In particular, our method can locate vulnerable layers that are
sensitive to adversarial perturbation and Gaussian noise. We also show that the
behavior deviation of vulnerable layers can be used to detect adversarial
examples. Secondly, we propose a novel NoiseCAM algorithm that integrates
information from globally and pixel-level weighted class activation maps. Our
algorithm is susceptible to adversarial perturbations and will not respond to
Gaussian random noise mixed in the inputs. Third, we compare detecting
adversarial examples using both behavior deviation and NoiseCAM, and we show
that NoiseCAM outperforms behavior deviation modeling in its overall
performance. Our work could provide a useful tool to defend against certain
adversarial attacks on deep neural networks.


http://arxiv.org/abs/2303.05575
Evaluating the Robustness of Conversational Recommender Systems by Adversarial Examples. (92%)
Ali Montazeralghaem; James Allan
  Conversational recommender systems (CRSs) are improving rapidly, according to
the standard recommendation accuracy metrics. However, it is essential to make
sure that these systems are robust in interacting with users including regular
and malicious users who want to attack the system by feeding the system
modified input data. In this paper, we propose an adversarial evaluation scheme
including four scenarios in two categories and automatically generate
adversarial examples to evaluate the robustness of these systems in the face of
different input data. By executing these adversarial examples we can compare
the ability of different conversational recommender systems to satisfy the
user's preferences. We evaluate three CRSs by the proposed adversarial examples
on two datasets. Our results show that none of these systems are robust and
reliable to the adversarial examples.


http://arxiv.org/abs/2303.05072
Identification of Systematic Errors of Image Classifiers on Rare Subgroups. (83%)
Jan Hendrik Metzen; Robin Hutmacher; N. Grace Hua; Valentyn Boreiko; Dan Zhang
  Despite excellent average-case performance of many image classifiers, their
performance can substantially deteriorate on semantically coherent subgroups of
the data that were under-represented in the training data. These systematic
errors can impact both fairness for demographic minority groups as well as
robustness and safety under domain shift. A major challenge is to identify such
subgroups with subpar performance when the subgroups are not annotated and
their occurrence is very rare. We leverage recent advances in text-to-image
models and search in the space of textual descriptions of subgroups ("prompts")
for subgroups where the target model has low performance on the
prompt-conditioned synthesized data. To tackle the exponentially growing number
of subgroups, we employ combinatorial testing. We denote this procedure as
PromptAttack as it can be interpreted as an adversarial attack in a prompt
space. We study subgroup coverage and identifiability with PromptAttack in a
controlled setting and find that it identifies systematic errors with high
accuracy. Thereupon, we apply PromptAttack to ImageNet classifiers and identify
novel systematic errors on rare subgroups.


http://arxiv.org/abs/2303.05077
Learning the Legibility of Visual Text Perturbations. (78%)
Dev Seth; Rickard Stureborg; Danish Pruthi; Bhuwan Dhingra
  Many adversarial attacks in NLP perturb inputs to produce visually similar
strings ('ergo' $\rightarrow$ '$\epsilon$rgo') which are legible to humans but
degrade model performance. Although preserving legibility is a necessary
condition for text perturbation, little work has been done to systematically
characterize it; instead, legibility is typically loosely enforced via
intuitions around the nature and extent of perturbations. Particularly, it is
unclear to what extent can inputs be perturbed while preserving legibility, or
how to quantify the legibility of a perturbed string. In this work, we address
this gap by learning models that predict the legibility of a perturbed string,
and rank candidate perturbations based on their legibility. To do so, we
collect and release LEGIT, a human-annotated dataset comprising the legibility
of visually perturbed text. Using this dataset, we build both text- and
vision-based models which achieve up to $0.91$ F1 score in predicting whether
an input is legible, and an accuracy of $0.86$ in predicting which of two given
perturbations is more legible. Additionally, we discover that legible
perturbations from the LEGIT dataset are more effective at lowering the
performance of NLP models than best-known attack strategies, suggesting that
current models may be vulnerable to a broad range of perturbations beyond what
is captured by existing visual attacks. Data, code, and models are available at
https://github.com/dvsth/learning-legibility-2023.


http://arxiv.org/abs/2303.05246
Efficient Certified Training and Robustness Verification of Neural ODEs. (75%)
Mustafa Zeqiri; Mark Niklas Müller; Marc Fischer; Martin Vechev
  Neural Ordinary Differential Equations (NODEs) are a novel neural
architecture, built around initial value problems with learned dynamics which
are solved during inference. Thought to be inherently more robust against
adversarial perturbations, they were recently shown to be vulnerable to strong
adversarial attacks, highlighting the need for formal guarantees. However,
despite significant progress in robustness verification for standard
feed-forward architectures, the verification of high dimensional NODEs remains
an open problem. In this work, we address this challenge and propose GAINS, an
analysis framework for NODEs combining three key ideas: (i) a novel class of
ODE solvers, based on variable but discrete time steps, (ii) an efficient graph
representation of solver trajectories, and (iii) a novel abstraction algorithm
operating on this graph representation. Together, these advances enable the
efficient analysis and certified training of high-dimensional NODEs, by
reducing the runtime from an intractable $O(\exp(d)+\exp(T))$ to ${O}(d+T^2
\log^2T)$ in the dimensionality $d$ and integration time $T$. In an extensive
evaluation on computer vision (MNIST and FMNIST) and time-series forecasting
(PHYSIO-NET) problems, we demonstrate the effectiveness of both our certified
training and verification methods.


http://arxiv.org/abs/2303.05699
Feature Unlearning for Pre-trained GANs and VAEs. (68%)
Saemi Moon; Seunghyuk Cho; Dongwoo Kim
  We tackle the problem of feature unlearning from a pre-trained image
generative model: GANs and VAEs. Unlike a common unlearning task where an
unlearning target is a subset of the training set, we aim to unlearn a specific
feature, such as hairstyle from facial images, from the pre-trained generative
models. As the target feature is only presented in a local region of an image,
unlearning the entire image from the pre-trained model may result in losing
other details in the remaining region of the image. To specify which features
to unlearn, we collect randomly generated images that contain the target
features. We then identify a latent representation corresponding to the target
feature and then use the representation to fine-tune the pre-trained model.
Through experiments on MNIST and CelebA datasets, we show that target features
are successfully removed while keeping the fidelity of the original models.
Further experiments with an adversarial attack show that the unlearned model is
more robust under the presence of malicious parties.


http://arxiv.org/abs/2303.04502
Immune Defense: A Novel Adversarial Defense Mechanism for Preventing the Generation of Adversarial Examples. (99%)
Jinwei Wang; Hao Wu; Haihua Wang; Jiawei Zhang; Xiangyang Luo; Bin Ma
  The vulnerability of Deep Neural Networks (DNNs) to adversarial examples has
been confirmed. Existing adversarial defenses primarily aim at preventing
adversarial examples from attacking DNNs successfully, rather than preventing
their generation. If the generation of adversarial examples is unregulated,
images within reach are no longer secure and pose a threat to non-robust DNNs.
Although gradient obfuscation attempts to address this issue, it has been shown
to be circumventable. Therefore, we propose a novel adversarial defense
mechanism, which is referred to as immune defense and is the example-based
pre-defense. This mechanism applies carefully designed quasi-imperceptible
perturbations to the raw images to prevent the generation of adversarial
examples for the raw images, and thereby protecting both images and DNNs. These
perturbed images are referred to as Immune Examples (IEs). In the white-box
immune defense, we provide a gradient-based and an optimization-based approach,
respectively. Additionally, the more complex black-box immune defense is taken
into consideration. We propose Masked Gradient Sign Descent (MGSD) to reduce
approximation error and stabilize the update to improve the transferability of
IEs and thereby ensure their effectiveness against black-box adversarial
attacks. The experimental results demonstrate that the optimization-based
approach has superior performance and better visual quality in white-box immune
defense. In contrast, the gradient-based approach has stronger transferability
and the proposed MGSD significantly improve the transferability of baselines.


http://arxiv.org/abs/2303.04980
Decision-BADGE: Decision-based Adversarial Batch Attack with Directional Gradient Estimation. (99%)
Geunhyeok Yu; Minwoo Jeon; Hyoseok Hwang
  The susceptibility of deep neural networks (DNNs) to adversarial examples has
prompted an increase in the deployment of adversarial attacks. Image-agnostic
universal adversarial perturbations (UAPs) are much more threatening, but many
limitations exist to implementing UAPs in real-world scenarios where only
binary decisions are returned. In this research, we propose Decision-BADGE, a
novel method to craft universal adversarial perturbations for executing
decision-based black-box attacks. To optimize perturbation with decisions, we
addressed two challenges, namely the magnitude and the direction of the
gradient. First, we use batch loss, differences from distributions of ground
truth, and accumulating decisions in batches to determine the magnitude of the
gradient. This magnitude is applied in the direction of the revised
simultaneous perturbation stochastic approximation (SPSA) to update the
perturbation. This simple yet efficient method can be easily extended to
score-based attacks as well as targeted attacks. Experimental validation across
multiple victim models demonstrates that the Decision-BADGE outperforms
existing attack methods, even image-specific and score-based attacks. In
particular, our proposed method shows a superior success rate with less
training time. The research also shows that Decision-BADGE can successfully
deceive unseen victim models and accurately target specific classes.


http://arxiv.org/abs/2303.06032
Exploring Adversarial Attacks on Neural Networks: An Explainable Approach. (99%)
Justus Renkhoff; Wenkai Tan; Alvaro Velasquez; illiam Yichen Wang; Yongxin Liu; Jian Wang; Shuteng Niu; Lejla Begic Fazlic; Guido Dartmann; Houbing Song
  Deep Learning (DL) is being applied in various domains, especially in
safety-critical applications such as autonomous driving. Consequently, it is of
great significance to ensure the robustness of these methods and thus
counteract uncertain behaviors caused by adversarial attacks. In this paper, we
use gradient heatmaps to analyze the response characteristics of the VGG-16
model when the input images are mixed with adversarial noise and statistically
similar Gaussian random noise. In particular, we compare the network response
layer by layer to determine where errors occurred. Several interesting findings
are derived. First, compared to Gaussian random noise, intentionally generated
adversarial noise causes severe behavior deviation by distracting the area of
concentration in the networks. Second, in many cases, adversarial examples only
need to compromise a few intermediate blocks to mislead the final decision.
Third, our experiments revealed that specific blocks are more vulnerable and
easier to exploit by adversarial examples. Finally, we demonstrate that the
layers $Block4\_conv1$ and $Block5\_cov1$ of the VGG-16 model are more
susceptible to adversarial attacks. Our work could provide valuable insights
into developing more reliable Deep Neural Network (DNN) models.


http://arxiv.org/abs/2303.07199
BeamAttack: Generating High-quality Textual Adversarial Examples through Beam Search and Mixed Semantic Spaces. (99%)
Hai Zhu; Qingyang Zhao; Yuren Wu
  Natural language processing models based on neural networks are vulnerable to
adversarial examples. These adversarial examples are imperceptible to human
readers but can mislead models to make the wrong predictions. In a black-box
setting, attacker can fool the model without knowing model's parameters and
architecture. Previous works on word-level attacks widely use single semantic
space and greedy search as a search strategy. However, these methods fail to
balance the attack success rate, quality of adversarial examples and time
consumption. In this paper, we propose BeamAttack, a textual attack algorithm
that makes use of mixed semantic spaces and improved beam search to craft
high-quality adversarial examples. Extensive experiments demonstrate that
BeamAttack can improve attack success rate while saving numerous queries and
time, e.g., improving at most 7\% attack success rate than greedy search when
attacking the examples from MR dataset. Compared with heuristic search,
BeamAttack can save at most 85\% model queries and achieve a competitive attack
success rate. The adversarial examples crafted by BeamAttack are highly
transferable and can effectively improve model's robustness during adversarial
training. Code is available at
https://github.com/zhuhai-ustc/beamattack/tree/master


http://arxiv.org/abs/2303.04878
DeepGD: A Multi-Objective Black-Box Test Selection Approach for Deep Neural Networks. (3%)
Zohreh Aghababaeyan; Manel Abdellatif; Mahboubeh Dadkhah; Lionel Briand
  Deep neural networks (DNNs) are widely used in various application domains
such as image processing, speech recognition, and natural language processing.
However, testing DNN models may be challenging due to the complexity and size
of their input domain. Particularly, testing DNN models often requires
generating or exploring large unlabeled datasets. In practice, DNN test
oracles, which identify the correct outputs for inputs, often require expensive
manual effort to label test data, possibly involving multiple experts to ensure
labeling correctness.
  In this paper, we propose DeepGD, a black-box multi-objective test selection
approach for DNN models. It reduces the cost of labeling by prioritizing the
selection of test inputs with high fault revealing power from large unlabeled
datasets. DeepGD not only selects test inputs with high uncertainty scores to
trigger as many mispredicted inputs as possible but also maximizes the
probability of revealing distinct faults in the DNN model by selecting diverse
mispredicted inputs.
  The experimental results conducted on four widely used datasets and five DNN
models show that in terms of fault-revealing ability: (1) White-box,
coverage-based approaches fare poorly, (2) DeepGD outperforms existing
black-box test selection approaches in terms of fault detection, and (3) DeepGD
also leads to better guidance for DNN model retraining when using selected
inputs to augment the training set.


http://arxiv.org/abs/2303.03680
Logit Margin Matters: Improving Transferable Targeted Adversarial Attack by Logit Calibration. (99%)
Juanjuan Weng; Zhiming Luo; Zhun Zhong; Shaozi Li; Nicu Sebe
  Previous works have extensively studied the transferability of adversarial
samples in untargeted black-box scenarios. However, it still remains
challenging to craft targeted adversarial examples with higher transferability
than non-targeted ones. Recent studies reveal that the traditional
Cross-Entropy (CE) loss function is insufficient to learn transferable targeted
adversarial examples due to the issue of vanishing gradient. In this work, we
provide a comprehensive investigation of the CE loss function and find that the
logit margin between the targeted and untargeted classes will quickly obtain
saturation in CE, which largely limits the transferability. Therefore, in this
paper, we devote to the goal of continually increasing the logit margin along
the optimization to deal with the saturation issue and propose two simple and
effective logit calibration methods, which are achieved by downscaling the
logits with a temperature factor and an adaptive margin, respectively. Both of
them can effectively encourage optimization to produce a larger logit margin
and lead to higher transferability. Besides, we show that minimizing the cosine
distance between the adversarial examples and the classifier weights of the
target class can further improve the transferability, which is benefited from
downscaling logits via L2-normalization. Experiments conducted on the ImageNet
dataset validate the effectiveness of the proposed methods, which outperform
the state-of-the-art methods in black-box targeted attacks. The source code is
available at \href{https://github.com/WJJLL/Target-Attack/}{Link}


http://arxiv.org/abs/2303.04238
Patch of Invisibility: Naturalistic Physical Black-Box Adversarial Attacks on Object Detectors. (98%)
Raz Lapid; Eylon Mizrahi; Moshe Sipper
  Adversarial attacks on deep-learning models have been receiving increased
attention in recent years. Work in this area has mostly focused on
gradient-based techniques, so-called "white-box" attacks, wherein the attacker
has access to the targeted model's internal parameters; such an assumption is
usually unrealistic in the real world. Some attacks additionally use the entire
pixel space to fool a given model, which is neither practical nor physical
(i.e., real-world). On the contrary, we propose herein a direct, black-box,
gradient-free method that uses the learned image manifold of a pretrained
generative adversarial network (GAN) to generate naturalistic physical
adversarial patches for object detectors. To our knowledge this is the first
and only method that performs black-box physical attacks directly on
object-detection models, which results with a model-agnostic attack. We show
that our proposed method works both digitally and physically. We compared our
approach against four different black-box attacks with different
configurations. Our approach outperformed all other approaches that were tested
in our experiments by a large margin.


http://arxiv.org/abs/2303.04183
Robustness-preserving Lifelong Learning via Dataset Condensation. (96%)
Jinghan Jia; Yihua Zhang; Dogyoon Song; Sijia Liu; Alfred Hero
  Lifelong learning (LL) aims to improve a predictive model as the data source
evolves continuously. Most work in this learning paradigm has focused on
resolving the problem of 'catastrophic forgetting,' which refers to a notorious
dilemma between improving model accuracy over new data and retaining accuracy
over previous data. Yet, it is also known that machine learning (ML) models can
be vulnerable in the sense that tiny, adversarial input perturbations can
deceive the models into producing erroneous predictions. This motivates the
research objective of this paper - specification of a new LL framework that can
salvage model robustness (against adversarial attacks) from catastrophic
forgetting. Specifically, we propose a new memory-replay LL strategy that
leverages modern bi-level optimization techniques to determine the 'coreset' of
the current data (i.e., a small amount of data to be memorized) for ease of
preserving adversarial robustness over time. We term the resulting LL framework
'Data-Efficient Robustness-Preserving LL' (DERPLL). The effectiveness of DERPLL
is evaluated for class-incremental image classification using ResNet-18 over
the CIFAR-10 dataset. Experimental results show that DERPLL outperforms the
conventional coreset-guided LL baseline and achieves a substantial improvement
in both standard accuracy and robust accuracy.


http://arxiv.org/abs/2303.04278
CUDA: Convolution-based Unlearnable Datasets. (82%)
Vinu Sankar Sadasivan; Mahdi Soltanolkotabi; Soheil Feizi
  Large-scale training of modern deep learning models heavily relies on
publicly available data on the web. This potentially unauthorized usage of
online data leads to concerns regarding data privacy. Recent works aim to make
unlearnable data for deep learning models by adding small, specially designed
noises to tackle this issue. However, these methods are vulnerable to
adversarial training (AT) and/or are computationally heavy. In this work, we
propose a novel, model-free, Convolution-based Unlearnable DAtaset (CUDA)
generation technique. CUDA is generated using controlled class-wise
convolutions with filters that are randomly generated via a private key. CUDA
encourages the network to learn the relation between filters and labels rather
than informative features for classifying the clean data. We develop some
theoretical analysis demonstrating that CUDA can successfully poison Gaussian
mixture data by reducing the clean data performance of the optimal Bayes
classifier. We also empirically demonstrate the effectiveness of CUDA with
various datasets (CIFAR-10, CIFAR-100, ImageNet-100, and Tiny-ImageNet), and
architectures (ResNet-18, VGG-16, Wide ResNet-34-10, DenseNet-121, DeIT,
EfficientNetV2-S, and MobileNetV2). Our experiments show that CUDA is robust to
various data augmentations and training approaches such as smoothing, AT with
different budgets, transfer learning, and fine-tuning. For instance, training a
ResNet-18 on ImageNet-100 CUDA achieves only 8.96$\%$, 40.08$\%$, and 20.58$\%$
clean test accuracies with empirical risk minimization (ERM), $L_{\infty}$ AT,
and $L_{2}$ AT, respectively. Here, ERM on the clean training data achieves a
clean test accuracy of 80.66$\%$. CUDA exhibits unlearnability effect with ERM
even when only a fraction of the training dataset is perturbed. Furthermore, we
also show that CUDA is robust to adaptive defenses designed specifically to
break it.


http://arxiv.org/abs/2303.03700
EavesDroid: Eavesdropping User Behaviors via OS Side-Channels on Smartphones. (11%)
Quancheng Wang; Ming Tang; Jianming Fu
  As the Internet of Things (IoT) continues to grow, smartphones have become an
integral part of IoT systems. However, with the increasing amount of personal
information stored on smartphones, users' privacy is at risk of being
compromised by malicious attackers. Malware detection engines are commonly
installed on smartphones to defend against these attacks, but new attacks that
can evade these defenses may still emerge. In this paper, we present
EavesDroid, a new side-channel attack on Android smartphones that allows an
unprivileged attacker to accurately infer fine-grained user behaviors (e.g.
viewing messages, playing videos) through the on-screen operations. Our attack
relies on the correlation between user behaviors and the return values of
system calls. The fact that these return values are affected by many factors,
resulting in fluctuation and misalignment, makes the attack more challenging.
Therefore, we build a CNN-GRU classification model, apply min-max normalization
to the raw data and combine multiple features to identify the fine-grained user
behaviors. A series of experiments on different models and systems of Android
smartphones show that, EavesDroid can achieve an accuracy of 98% and 86% for
already considered user behaviors in test set and real-world settings. To
prevent this attack, we recommend malware detection, obfuscating return values
or restricting applications from reading vulnerable return values.


http://arxiv.org/abs/2303.04187
Stabilized training of joint energy-based models and their practical applications. (2%)
Martin Sustek; Samik Sadhu; Lukas Burget; Hynek Hermansky; Jesus Villalba; Laureano Moro-Velazquez; Najim Dehak
  The recently proposed Joint Energy-based Model (JEM) interprets
discriminatively trained classifier $p(y|x)$ as an energy model, which is also
trained as a generative model describing the distribution of the input
observations $p(x)$. The JEM training relies on "positive examples" (i.e.
examples from the training data set) as well as on "negative examples", which
are samples from the modeled distribution $p(x)$ generated by means of
Stochastic Gradient Langevin Dynamics (SGLD). Unfortunately, SGLD often fails
to deliver negative samples of sufficient quality during the standard JEM
training, which causes a very unbalanced contribution from the positive and
negative examples when calculating gradients for JEM updates. As a consequence,
the standard JEM training is quite unstable requiring careful tuning of
hyper-parameters and frequent restarts when the training starts diverging. This
makes it difficult to apply JEM to different neural network architectures,
modalities, and tasks. In this work, we propose a training procedure that
stabilizes SGLD-based JEM training (ST-JEM) by balancing the contribution from
the positive and negative examples. We also propose to add an additional
"regularization" term to the training objective -- MI between the input
observations $x$ and output labels $y$ -- which encourages the JEM classifier
to make more certain decisions about output labels. We demonstrate the
effectiveness of our approach on the CIFAR10 and CIFAR100 tasks. We also
consider the task of classifying phonemes in a speech signal, for which we were
not able to train JEM without the proposed stabilization. We show that a
convincing speech can be generated from the trained model. Alternatively,
corrupted speech can be de-noised by bringing it closer to the modeled speech
distribution using a few SGLD iterations. We also propose and discuss
additional applications of the trained model.


http://arxiv.org/abs/2303.03323
CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning. (41%)
Hritik Bansal; Nishad Singhi; Yu Yang; Fan Yin; Aditya Grover; Kai-Wei Chang
  Multimodal contrastive pretraining has been used to train multimodal
representation models, such as CLIP, on large amounts of paired image-text
data. However, previous studies have revealed that such models are vulnerable
to backdoor attacks. Specifically, when trained on backdoored examples, CLIP
learns spurious correlations between the embedded backdoor trigger and the
target label, aligning their representations in the joint embedding space.
Injecting even a small number of poisoned examples, such as 75 examples in 3
million pretraining data, can significantly manipulate the model's behavior,
making it difficult to detect or unlearn such correlations. To address this
issue, we propose CleanCLIP, a finetuning framework that weakens the learned
spurious associations introduced by backdoor attacks by independently
re-aligning the representations for individual modalities. We demonstrate that
unsupervised finetuning using a combination of multimodal contrastive and
unimodal self-supervised objectives for individual modalities can significantly
reduce the impact of the backdoor attack. Additionally, we show that supervised
finetuning on task-specific labeled image data removes the backdoor trigger
from the CLIP vision encoder. We show empirically that CleanCLIP maintains
model performance on benign examples while erasing a range of backdoor attacks
on multimodal contrastive learning.


http://arxiv.org/abs/2303.03446
Students Parrot Their Teachers: Membership Inference on Model Distillation. (31%)
Matthew Jagielski; Milad Nasr; Christopher Choquette-Choo; Katherine Lee; Nicholas Carlini
  Model distillation is frequently proposed as a technique to reduce the
privacy leakage of machine learning. These empirical privacy defenses rely on
the intuition that distilled ``student'' models protect the privacy of training
data, as they only interact with this data indirectly through a ``teacher''
model. In this work, we design membership inference attacks to systematically
study the privacy provided by knowledge distillation to both the teacher and
student training sets. Our new attacks show that distillation alone provides
only limited privacy across a number of domains. We explain the success of our
attacks on distillation by showing that membership inference attacks on a
private dataset can succeed even if the target model is *never* queried on any
actual training points, but only on inputs whose predictions are highly
influenced by training data. Finally, we show that our attacks are strongest
when student and teacher sets are similar, or when the attacker can poison the
teacher set.


http://arxiv.org/abs/2303.03012
On the Feasibility of Specialized Ability Extracting for Large Language Code Models. (22%)
Zongjie Li; Chaozheng Wang; Pingchuan Ma; Chaowei Liu; Shuai Wang; Daoyuan Wu; Cuiyun Gao
  Recent progress in large language code models (LLCMs) has led to a dramatic
surge in the use of software development. Nevertheless, it is widely known that
training a well-performed LLCM requires a plethora of workforce for collecting
the data and high quality annotation. Additionally, the training dataset may be
proprietary (or partially open source to the public), and the training process
is often conducted on a large-scale cluster of GPUs with high costs. Inspired
by the recent success of imitation attacks in extracting computer vision and
natural language models, this work launches the first imitation attack on
LLCMs: by querying a target LLCM with carefully-designed queries and collecting
the outputs, the adversary can train an imitation model that manifests close
behavior with the target LLCM. We systematically investigate the effectiveness
of launching imitation attacks under different query schemes and different LLCM
tasks. We also design novel methods to polish the LLCM outputs, resulting in an
effective imitation training process. We summarize our findings and provide
lessons harvested in this study that can help better depict the attack surface
of LLCMs. Our research contributes to the growing body of knowledge on
imitation attacks and defenses in deep neural models, particularly in the
domain of code related tasks.


http://arxiv.org/abs/2303.03169
A Unified Algebraic Perspective on Lipschitz Neural Networks. (15%)
Alexandre Araujo; Aaron Havens; Blaise Delattre; Alexandre Allauzen; Bin Hu
  Important research efforts have focused on the design and training of neural
networks with a controlled Lipschitz constant. The goal is to increase and
sometimes guarantee the robustness against adversarial attacks. Recent
promising techniques draw inspirations from different backgrounds to design
1-Lipschitz neural networks, just to name a few: convex potential layers derive
from the discretization of continuous dynamical systems,
Almost-Orthogonal-Layer proposes a tailored method for matrix rescaling.
However, it is today important to consider the recent and promising
contributions in the field under a common theoretical lens to better design new
and improved layers. This paper introduces a novel algebraic perspective
unifying various types of 1-Lipschitz neural networks, including the ones
previously mentioned, along with methods based on orthogonality and spectral
methods. Interestingly, we show that many existing techniques can be derived
and generalized via finding analytical solutions of a common semidefinite
programming (SDP) condition. We also prove that AOL biases the scaled weight to
the ones which are close to the set of orthogonal matrices in a certain
mathematical manner. Moreover, our algebraic condition, combined with the
Gershgorin circle theorem, readily leads to new and diverse parameterizations
for 1-Lipschitz network layers. Our approach, called SDP-based Lipschitz Layers
(SLL), allows us to design non-trivial yet efficient generalization of convex
potential layers. Finally, the comprehensive set of experiments on image
classification shows that SLLs outperform previous approaches on certified
robust accuracy. Code is available at
https://github.com/araujoalexandre/Lipschitz-SLL-Networks.


http://arxiv.org/abs/2303.03320
Learning to Backdoor Federated Learning. (15%)
Henger Li; Chen Wu; Sencun Zhu; Zizhan Zheng
  In a federated learning (FL) system, malicious participants can easily embed
backdoors into the aggregated model while maintaining the model's performance
on the main task. To this end, various defenses, including training stage
aggregation-based defenses and post-training mitigation defenses, have been
proposed recently. While these defenses obtain reasonable performance against
existing backdoor attacks, which are mainly heuristics based, we show that they
are insufficient in the face of more advanced attacks. In particular, we
propose a general reinforcement learning-based backdoor attack framework where
the attacker first trains a (non-myopic) attack policy using a simulator built
upon its local data and common knowledge on the FL system, which is then
applied during actual FL training. Our attack framework is both adaptive and
flexible and achieves strong attack performance and durability even under
state-of-the-art defenses.


http://arxiv.org/abs/2303.03470
Partial-Information, Longitudinal Cyber Attacks on LiDAR in Autonomous Vehicles. (10%)
R. Spencer Hallyburton; Qingzhao Zhang; Z. Morley Mao; Miroslav Pajic
  What happens to an autonomous vehicle (AV) if its data are adversarially
compromised? Prior security studies have addressed this question through mostly
unrealistic threat models, with limited practical relevance, such as white-box
adversarial learning or nanometer-scale laser aiming and spoofing. With growing
evidence that cyber threats pose real, imminent danger to AVs and
cyber-physical systems (CPS) in general, we present and evaluate a novel AV
threat model: a cyber-level attacker capable of disrupting sensor data but
lacking any situational awareness. We demonstrate that even though the attacker
has minimal knowledge and only access to raw data from a single sensor (i.e.,
LiDAR), she can design several attacks that critically compromise perception
and tracking in multi-sensor AVs. To mitigate vulnerabilities and advance
secure architectures in AVs, we introduce two improvements for security-aware
fusion: a probabilistic data-asymmetry monitor and a scalable track-to-track
fusion of 3D LiDAR and monocular detections (T2T-3DLM); we demonstrate that the
approaches significantly reduce attack effectiveness. To support objective
safety and security evaluations in AVs, we release our security evaluation
platform, AVsec, which is built on security-relevant metrics to benchmark AVs
on gold-standard longitudinal AV datasets and AV simulators.


http://arxiv.org/abs/2303.03372
ALMOST: Adversarial Learning to Mitigate Oracle-less ML Attacks via Synthesis Tuning. (1%)
Animesh Basak Chowdhury; Lilas Alrahis; Luca Collini; Johann Knechtel; Ramesh Karri; Siddharth Garg; Ozgur Sinanoglu; Benjamin Tan
  Oracle-less machine learning (ML) attacks have broken various logic locking
schemes. Regular synthesis, which is tailored for area-power-delay
optimization, yields netlists where key-gate localities are vulnerable to
learning. Thus, we call for security-aware logic synthesis. We propose ALMOST,
a framework for adversarial learning to mitigate oracle-less ML attacks via
synthesis tuning. ALMOST uses a simulated-annealing-based synthesis recipe
generator, employing adversarially trained models that can predict
state-of-the-art attacks' accuracies over wide ranges of recipes and key-gate
localities. Experiments on ISCAS benchmarks confirm the attacks' accuracies
drops to around 50\% for ALMOST-synthesized circuits, all while not undermining
design optimization.


http://arxiv.org/abs/2303.02970
Rethinking Confidence Calibration for Failure Prediction. (1%)
Fei Zhu; Zhen Cheng; Xu-Yao Zhang; Cheng-Lin Liu
  Reliable confidence estimation for the predictions is important in many
safety-critical applications. However, modern deep neural networks are often
overconfident for their incorrect predictions. Recently, many calibration
methods have been proposed to alleviate the overconfidence problem. With
calibrated confidence, a primary and practical purpose is to detect
misclassification errors by filtering out low-confidence predictions (known as
failure prediction). In this paper, we find a general, widely-existed but
actually-neglected phenomenon that most confidence calibration methods are
useless or harmful for failure prediction. We investigate this problem and
reveal that popular confidence calibration methods often lead to worse
confidence separation between correct and incorrect samples, making it more
difficult to decide whether to trust a prediction or not. Finally, inspired by
the natural connection between flat minima and confidence separation, we
propose a simple hypothesis: flat minima is beneficial for failure prediction.
We verify this hypothesis via extensive experiments and further boost the
performance by combining two different flat minima techniques. Our code is
available at https://github.com/Impression2805/FMFP


http://arxiv.org/abs/2303.02814
Visual Analytics of Neuron Vulnerability to Adversarial Attacks on Convolutional Neural Networks. (99%)
Yiran Li; Junpeng Wang; Takanori Fujiwara; Kwan-Liu Ma
  Adversarial attacks on a convolutional neural network (CNN) -- injecting
human-imperceptible perturbations into an input image -- could fool a
high-performance CNN into making incorrect predictions. The success of
adversarial attacks raises serious concerns about the robustness of CNNs, and
prevents them from being used in safety-critical applications, such as medical
diagnosis and autonomous driving. Our work introduces a visual analytics
approach to understanding adversarial attacks by answering two questions: (1)
which neurons are more vulnerable to attacks and (2) which image features do
these vulnerable neurons capture during the prediction? For the first question,
we introduce multiple perturbation-based measures to break down the attacking
magnitude into individual CNN neurons and rank the neurons by their
vulnerability levels. For the second, we identify image features (e.g., cat
ears) that highly stimulate a user-selected neuron to augment and validate the
neuron's responsibility. Furthermore, we support an interactive exploration of
a large number of neurons by aiding with hierarchical clustering based on the
neurons' roles in the prediction. To this end, a visual analytics system is
designed to incorporate visual reasoning for interpreting adversarial attacks.
We validate the effectiveness of our system through multiple case studies as
well as feedback from domain experts.


http://arxiv.org/abs/2303.02669
Consistent Valid Physically-Realizable Adversarial Attack against Crowd-flow Prediction Models. (99%)
Hassan Ali; Muhammad Atif Butt; Fethi Filali; Ala Al-Fuqaha; Junaid Qadir
  Recent works have shown that deep learning (DL) models can effectively learn
city-wide crowd-flow patterns, which can be used for more effective urban
planning and smart city management. However, DL models have been known to
perform poorly on inconspicuous adversarial perturbations. Although many works
have studied these adversarial perturbations in general, the adversarial
vulnerabilities of deep crowd-flow prediction models in particular have
remained largely unexplored. In this paper, we perform a rigorous analysis of
the adversarial vulnerabilities of DL-based crowd-flow prediction models under
multiple threat settings, making three-fold contributions. (1) We propose
CaV-detect by formally identifying two novel properties - Consistency and
Validity - of the crowd-flow prediction inputs that enable the detection of
standard adversarial inputs with 0% false acceptance rate (FAR). (2) We
leverage universal adversarial perturbations and an adaptive adversarial loss
to present adaptive adversarial attacks to evade CaV-detect defense. (3) We
propose CVPR, a Consistent, Valid and Physically-Realizable adversarial attack,
that explicitly inducts the consistency and validity priors in the perturbation
generation mechanism. We find out that although the crowd-flow models are
vulnerable to adversarial perturbations, it is extremely challenging to
simulate these perturbations in physical settings, notably when CaV-detect is
in place. We also show that CVPR attack considerably outperforms the adaptively
modified standard attacks in FAR and adversarial loss metrics. We conclude with
useful insights emerging from our work and highlight promising future research
directions.


http://arxiv.org/abs/2303.02874
Adversarial Sampling for Fairness Testing in Deep Neural Network. (98%)
Tosin Ige; William Marfo; Justin Tonkinson; Sikiru Adewale; Bolanle Hafiz Matti
  In this research, we focus on the usage of adversarial sampling to test for
the fairness in the prediction of deep neural network model across different
classes of image in a given dataset. While several framework had been proposed
to ensure robustness of machine learning model against adversarial attack, some
of which includes adversarial training algorithm. There is still the pitfall
that adversarial training algorithm tends to cause disparity in accuracy and
robustness among different group. Our research is aimed at using adversarial
sampling to test for fairness in the prediction of deep neural network model
across different classes or categories of image in a given dataset. We
successfully demonstrated a new method of ensuring fairness across various
group of input in deep neural network classifier. We trained our neural network
model on the original image, and without training our model on the perturbed or
attacked image. When we feed the adversarial samplings to our model, it was
able to predict the original category/ class of the image the adversarial
sample belongs to. We also introduced and used the separation of concern
concept from software engineering whereby there is an additional standalone
filter layer that filters perturbed image by heavily removing the noise or
attack before automatically passing it to the network for classification, we
were able to have accuracy of 93.3%. Cifar-10 dataset have ten categories of
dataset, and so, in order to account for fairness, we applied our hypothesis
across each categories of dataset and were able to get a consistent result and
accuracy.


http://arxiv.org/abs/2303.02725
Local Environment Poisoning Attacks on Federated Reinforcement Learning. (12%)
Evelyn Ma; Rasoul Etesami
  Federated learning (FL) has become a popular tool for solving traditional
Reinforcement Learning (RL) tasks. The multi-agent structure addresses the
major concern of data-hungry in traditional RL, while the federated mechanism
protects the data privacy of individual agents. However, the federated
mechanism also exposes the system to poisoning by malicious agents that can
mislead the trained policy. Despite the advantage brought by FL, the
vulnerability of Federated Reinforcement Learning (FRL) has not been
well-studied before. In this work, we propose the first general framework to
characterize FRL poisoning as an optimization problem constrained by a limited
budget and design a poisoning protocol that can be applied to policy-based FRL
and extended to FRL with actor-critic as a local RL algorithm by training a
pair of private and public critics. We also discuss a conventional defense
strategy inherited from FL to mitigate this risk. We verify our poisoning
effectiveness by conducting extensive experiments targeting mainstream RL
algorithms and over various RL OpenAI Gym environments covering a wide range of
difficulty levels. Our results show that our proposed defense protocol is
successful in most cases but is not robust under complicated environments. Our
work provides new insights into the vulnerability of FL in RL training and
poses additional challenges for designing robust FRL algorithms.


http://arxiv.org/abs/2303.02781
Robustness, Evaluation and Adaptation of Machine Learning Models in the Wild. (10%)
Vihari Piratla
  Our goal is to improve reliability of Machine Learning (ML) systems deployed
in the wild. ML models perform exceedingly well when test examples are similar
to train examples. However, real-world applications are required to perform on
any distribution of test examples. Current ML systems can fail silently on test
examples with distribution shifts. In order to improve reliability of ML models
due to covariate or domain shift, we propose algorithms that enable models to:
(a) generalize to a larger family of test distributions, (b) evaluate accuracy
under distribution shifts, (c) adapt to a target distribution. We study causes
of impaired robustness to domain shifts and present algorithms for training
domain robust models. A key source of model brittleness is due to domain
overfitting, which our new training algorithms suppress and instead encourage
domain-general hypotheses. While we improve robustness over standard training
methods for certain problem settings, performance of ML systems can still vary
drastically with domain shifts. It is crucial for developers and stakeholders
to understand model vulnerabilities and operational ranges of input, which
could be assessed on the fly during the deployment, albeit at a great cost.
Instead, we advocate for proactively estimating accuracy surfaces over any
combination of prespecified and interpretable domain shifts for performance
forecasting. We present a label-efficient estimation to address estimation over
a combinatorial space of domain shifts. Further, when a model's performance on
a target domain is found to be poor, traditional approaches adapt the model
using the target domain's resources. Standard adaptation methods assume access
to sufficient labeled resources, which may be impractical for deployed models.
We initiate a study of lightweight adaptation techniques with only unlabeled
data resources with a focus on language applications.


http://arxiv.org/abs/2303.02601
Knowledge-Based Counterfactual Queries for Visual Question Answering. (3%)
Theodoti Stoikou; Maria Lymperaiou; Giorgos Stamou
  Visual Question Answering (VQA) has been a popular task that combines vision
and language, with numerous relevant implementations in literature. Even though
there are some attempts that approach explainability and robustness issues in
VQA models, very few of them employ counterfactuals as a means of probing such
challenges in a model-agnostic way. In this work, we propose a systematic
method for explaining the behavior and investigating the robustness of VQA
models through counterfactual perturbations. For this reason, we exploit
structured knowledge bases to perform deterministic, optimal and controllable
word-level replacements targeting the linguistic modality, and we then evaluate
the model's response against such counterfactual inputs. Finally, we
qualitatively extract local and global explanations based on counterfactual
responses, which are ultimately proven insightful towards interpreting VQA
model behaviors. By performing a variety of perturbation types, targeting
different parts of speech of the input question, we gain insights to the
reasoning of the model, through the comparison of its responses in different
adversarial circumstances. Overall, we reveal possible biases in the
decision-making process of the model, as well as expected and unexpected
patterns, which impact its performance quantitatively and qualitatively, as
indicated by our analysis.


http://arxiv.org/abs/2303.02322
Improved Robustness Against Adaptive Attacks With Ensembles and Error-Correcting Output Codes. (68%)
Thomas Philippon; Christian Gagné
  Neural network ensembles have been studied extensively in the context of
adversarial robustness and most ensemble-based approaches remain vulnerable to
adaptive attacks. In this paper, we investigate the robustness of
Error-Correcting Output Codes (ECOC) ensembles through architectural
improvements and ensemble diversity promotion. We perform a comprehensive
robustness assessment against adaptive attacks and investigate the relationship
between ensemble diversity and robustness. Our results demonstrate the benefits
of ECOC ensembles for adversarial robustness compared to regular ensembles of
convolutional neural networks (CNNs) and show why the robustness of previous
implementations is limited. We also propose an adversarial training method
specific to ECOC ensembles that allows to further improve robustness to
adaptive attacks.


http://arxiv.org/abs/2303.01959
PointCert: Point Cloud Classification with Deterministic Certified Robustness Guarantees. (91%)
Jinghuai Zhang; Jinyuan Jia; Hongbin Liu; Neil Zhenqiang Gong
  Point cloud classification is an essential component in many
security-critical applications such as autonomous driving and augmented
reality. However, point cloud classifiers are vulnerable to adversarially
perturbed point clouds. Existing certified defenses against adversarial point
clouds suffer from a key limitation: their certified robustness guarantees are
probabilistic, i.e., they produce an incorrect certified robustness guarantee
with some probability. In this work, we propose a general framework, namely
PointCert, that can transform an arbitrary point cloud classifier to be
certifiably robust against adversarial point clouds with deterministic
guarantees. PointCert certifiably predicts the same label for a point cloud
when the number of arbitrarily added, deleted, and/or modified points is less
than a threshold. Moreover, we propose multiple methods to optimize the
certified robustness guarantees of PointCert in three application scenarios. We
systematically evaluate PointCert on ModelNet and ScanObjectNN benchmark
datasets. Our results show that PointCert substantially outperforms
state-of-the-art certified defenses even though their robustness guarantees are
probabilistic.


http://arxiv.org/abs/2303.02251
Certified Robust Neural Networks: Generalization and Corruption Resistance. (82%)
Amine Bennouna; Ryan Lucas; Parys Bart Van
  Recent work have demonstrated that robustness (to "corruption") can be at
odds with generalization. Adversarial training, for instance, aims to reduce
the problematic susceptibility of modern neural networks to small data
perturbations. Surprisingly, overfitting is a major concern in adversarial
training despite being mostly absent in standard training. We provide here
theoretical evidence for this peculiar "robust overfitting" phenomenon.
Subsequently, we advance a novel distributionally robust loss function bridging
robustness and generalization. We demonstrate both theoretically as well as
empirically the loss to enjoy a certified level of robustness against two
common types of corruption--data evasion and poisoning attacks--while ensuring
guaranteed generalization. We show through careful numerical experiments that
our resulting holistic robust (HR) training procedure yields SOTA performance.
Finally, we indicate that HR training can be interpreted as a direct extension
of adversarial training and comes with a negligible additional computational
burden. A ready-to-use python library implementing our algorithm is available
at https://github.com/RyanLucas3/HR_Neural_Networks.


http://arxiv.org/abs/2303.01734
AdvART: Adversarial Art for Camouflaged Object Detection Attacks. (75%)
Amira Guesmi; Ioan Marius Bilasco; Muhammad Shafique; Ihsen Alouani
  Physical adversarial attacks pose a significant practical threat as it
deceives deep learning systems operating in the real world by producing
prominent and maliciously designed physical perturbations. Emphasizing the
evaluation of naturalness is crucial in such attacks, as humans can readily
detect and eliminate unnatural manipulations. To overcome this limitation,
recent work has proposed leveraging generative adversarial networks (GANs) to
generate naturalistic patches, which may not catch human's attention. However,
these approaches suffer from a limited latent space which leads to an
inevitable trade-off between naturalness and attack efficiency. In this paper,
we propose a novel approach to generate naturalistic and inconspicuous
adversarial patches. Specifically, we redefine the optimization problem by
introducing an additional loss term to the cost function. This term works as a
semantic constraint to ensure that the generated camouflage pattern holds
semantic meaning rather than arbitrary patterns. The additional term leverages
similarity metrics to construct a similarity loss that we optimize within the
global objective function. Our technique is based on directly manipulating the
pixel values in the patch, which gives higher flexibility and larger space
compared to the GAN-based techniques that are based on indirectly optimizing
the patch by modifying the latent vector. Our attack achieves superior success
rate of up to 91.19\% and 72\%, respectively, in the digital world and when
deployed in smart cameras at the edge compared to the GAN-based technique.


http://arxiv.org/abs/2303.02213
Backdoor Attacks and Defenses in Federated Learning: Survey, Challenges and Future Research Directions. (47%)
Thuy Dung Nguyen; Tuan Nguyen; Phi Le Nguyen; Hieu H. Pham; Khoa Doan; Kok-Seng Wong
  Federated learning (FL) is a machine learning (ML) approach that allows the
use of distributed data without compromising personal privacy. However, the
heterogeneous distribution of data among clients in FL can make it difficult
for the orchestration server to validate the integrity of local model updates,
making FL vulnerable to various threats, including backdoor attacks. Backdoor
attacks involve the insertion of malicious functionality into a targeted model
through poisoned updates from malicious clients. These attacks can cause the
global model to misbehave on specific inputs while appearing normal in other
cases. Backdoor attacks have received significant attention in the literature
due to their potential to impact real-world deep learning applications.
However, they have not been thoroughly studied in the context of FL. In this
survey, we provide a comprehensive survey of current backdoor attack strategies
and defenses in FL, including a comprehensive analysis of different approaches.
We also discuss the challenges and potential future directions for attacks and
defenses in the context of FL.


http://arxiv.org/abs/2303.02214
Adversarial Attacks on Machine Learning in Embedded and IoT Platforms. (38%)
Christian Westbrook; Sudeep Pasricha
  Machine learning (ML) algorithms are increasingly being integrated into
embedded and IoT systems that surround us, and they are vulnerable to
adversarial attacks. The deployment of these ML algorithms on resource-limited
embedded platforms also requires the use of model compression techniques. The
impact of such model compression techniques on adversarial robustness in ML is
an important and emerging area of research. This article provides an overview
of the landscape of adversarial attacks and ML model compression techniques
relevant to embedded systems. We then describe efforts that seek to understand
the relationship between adversarial attacks and ML model compression before
discussing open problems in this area.


http://arxiv.org/abs/2303.01870
Revisiting Adversarial Training for ImageNet: Architectures, Training and Generalization across Threat Models. (33%)
Naman D Singh; Francesco Croce; Matthias Hein
  While adversarial training has been extensively studied for ResNet
architectures and low resolution datasets like CIFAR, much less is known for
ImageNet. Given the recent debate about whether transformers are more robust
than convnets, we revisit adversarial training on ImageNet comparing ViTs and
ConvNeXts. Extensive experiments show that minor changes in architecture, most
notably replacing PatchStem with ConvStem, and training scheme have a
significant impact on the achieved robustness. These changes not only increase
robustness in the seen $\ell_\infty$-threat model, but even more so improve
generalization to unseen $\ell_1/\ell_2$-attacks. Our modified ConvNeXt,
ConvNeXt + ConvStem, yields the most robust $\ell_\infty$-models across
different ranges of model parameters and FLOPs, while our ViT + ConvStem yields
the best generalization to unseen threat models.


http://arxiv.org/abs/2303.02112
Stealthy Perception-based Attacks on Unmanned Aerial Vehicles. (16%)
Amir Khazraei; Haocheng Meng; Miroslav Pajic
  In this work, we study vulnerability of unmanned aerial vehicles (UAVs) to
stealthy attacks on perception-based control. To guide our analysis, we
consider two specific missions: ($i$) ground vehicle tracking (GVT), and ($ii$)
vertical take-off and landing (VTOL) of a quadcopter on a moving ground
vehicle. Specifically, we introduce a method to consistently attack both the
sensors measurements and camera images over time, in order to cause control
performance degradation (e.g., by failing the mission) while remaining stealthy
(i.e., undetected by the deployed anomaly detector). Unlike existing attacks
that mainly rely on vulnerability of deep neural networks to small input
perturbations (e.g., by adding small patches and/or noise to the images), we
show that stealthy yet effective attacks can be designed by changing images of
the ground vehicle's landing markers as well as suitably falsifying sensing
data. We illustrate the effectiveness of our attacks in Gazebo 3D robotics
simulator.


http://arxiv.org/abs/2303.02242
TrojText: Test-time Invisible Textual Trojan Insertion. (2%)
Qian Lou; Yepeng Liu; Bo Feng
  In Natural Language Processing (NLP), intelligent neuron models can be
susceptible to textual Trojan attacks. Such attacks occur when Trojan models
behave normally for standard inputs but generate malicious output for inputs
that contain a specific trigger. Syntactic-structure triggers, which are
invisible, are becoming more popular for Trojan attacks because they are
difficult to detect and defend against. However, these types of attacks require
a large corpus of training data to generate poisoned samples with the necessary
syntactic structures for Trojan insertion. Obtaining such data can be difficult
for attackers, and the process of generating syntactic poisoned triggers and
inserting Trojans can be time-consuming. This paper proposes a solution called
TrojText, which aims to determine whether invisible textual Trojan attacks can
be performed more efficiently and cost-effectively without training data. The
proposed approach, called the Representation-Logit Trojan Insertion (RLI)
algorithm, uses smaller sampled test data instead of large training data to
achieve the desired attack. The paper also introduces two additional
techniques, namely the accumulated gradient ranking (AGR) and Trojan Weights
Pruning (TWP), to reduce the number of tuned parameters and the attack
overhead. The TrojText approach was evaluated on three datasets (AG's News,
SST-2, and OLID) using three NLP models (BERT, XLNet, and DeBERTa). The
experiments demonstrated that the TrojText approach achieved a 98.35\%
classification accuracy for test sentences in the target class on the BERT
model for the AG's News dataset. The source code for TrojText is available at
https://github.com/UCF-ML-Research/TrojText.


http://arxiv.org/abs/2303.01507
Defending against Adversarial Audio via Diffusion Model. (99%)
Shutong Wu; Jiongxiao Wang; Wei Ping; Weili Nie; Chaowei Xiao
  Deep learning models have been widely used in commercial acoustic systems in
recent years. However, adversarial audio examples can cause abnormal behaviors
for those acoustic systems, while being hard for humans to perceive. Various
methods, such as transformation-based defenses and adversarial training, have
been proposed to protect acoustic systems from adversarial attacks, but they
are less effective against adaptive attacks. Furthermore, directly applying the
methods from the image domain can lead to suboptimal results because of the
unique properties of audio data. In this paper, we propose an adversarial
purification-based defense pipeline, AudioPure, for acoustic systems via
off-the-shelf diffusion models. Taking advantage of the strong generation
ability of diffusion models, AudioPure first adds a small amount of noise to
the adversarial audio and then runs the reverse sampling step to purify the
noisy audio and recover clean audio. AudioPure is a plug-and-play method that
can be directly applied to any pretrained classifier without any fine-tuning or
re-training. We conduct extensive experiments on speech command recognition
task to evaluate the robustness of AudioPure. Our method is effective against
diverse adversarial attacks (e.g. $\mathcal{L}_2$ or
$\mathcal{L}_\infty$-norm). It outperforms the existing methods under both
strong adaptive white-box and black-box attacks bounded by $\mathcal{L}_2$ or
$\mathcal{L}_\infty$-norm (up to +20\% in robust accuracy). Besides, we also
evaluate the certified robustness for perturbations bounded by
$\mathcal{L}_2$-norm via randomized smoothing. Our pipeline achieves a higher
certified accuracy than baselines.


http://arxiv.org/abs/2303.01052
Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust Network by Adversarial Instrumental Variable Regression. (99%)
Junho Kim. Byung-Kwan Lee; Yong Man Ro
  The origin of adversarial examples is still inexplicable in research fields,
and it arouses arguments from various viewpoints, albeit comprehensive
investigations. In this paper, we propose a way of delving into the unexpected
vulnerability in adversarially trained networks from a causal perspective,
namely adversarial instrumental variable (IV) regression. By deploying it, we
estimate the causal relation of adversarial prediction under an unbiased
environment dissociated from unknown confounders. Our approach aims to
demystify inherent causal features on adversarial examples by leveraging a
zero-sum optimization game between a casual feature estimator (i.e., hypothesis
model) and worst-case counterfactuals (i.e., test function) disturbing to find
causal features. Through extensive analyses, we demonstrate that the estimated
causal features are highly related to the correct prediction for adversarial
robustness, and the counterfactuals exhibit extreme features significantly
deviating from the correct prediction. In addition, we present how to
effectively inoculate CAusal FEatures (CAFE) into defense networks for
improving adversarial robustness.


http://arxiv.org/abs/2303.01338
AdvRain: Adversarial Raindrops to Attack Camera-based Smart Vision Systems. (99%)
Amira Guesmi; Muhammad Abdullah Hanif; Muhammad Shafique
  Vision-based perception modules are increasingly deployed in many
applications, especially autonomous vehicles and intelligent robots. These
modules are being used to acquire information about the surroundings and
identify obstacles. Hence, accurate detection and classification are essential
to reach appropriate decisions and take appropriate and safe actions at all
times. Current studies have demonstrated that "printed adversarial attacks",
known as physical adversarial attacks, can successfully mislead perception
models such as object detectors and image classifiers. However, most of these
physical attacks are based on noticeable and eye-catching patterns for
generated perturbations making them identifiable/detectable by human eye or in
test drives. In this paper, we propose a camera-based inconspicuous adversarial
attack (\textbf{AdvRain}) capable of fooling camera-based perception systems
over all objects of the same class. Unlike mask based fake-weather attacks that
require access to the underlying computing hardware or image memory, our attack
is based on emulating the effects of a natural weather condition (i.e.,
Raindrops) that can be printed on a translucent sticker, which is externally
placed over the lens of a camera. To accomplish this, we provide an iterative
process based on performing a random search aiming to identify critical
positions to make sure that the performed transformation is adversarial for a
target classifier. Our transformation is based on blurring predefined parts of
the captured image corresponding to the areas covered by the raindrop. We
achieve a drop in average model accuracy of more than $45\%$ and $40\%$ on
VGG19 for ImageNet and Resnet34 for Caltech-101, respectively, using only $20$
raindrops.


http://arxiv.org/abs/2303.01351
APARATE: Adaptive Adversarial Patch for CNN-based Monocular Depth Estimation for Autonomous Navigation. (99%)
Amira Guesmi; Muhammad Abdullah Hanif; Ihsen Alouani; Muhammad Shafique
  In recent times, monocular depth estimation (MDE) has experienced significant
advancements in performance, largely attributed to the integration of
innovative architectures, i.e., convolutional neural networks (CNNs) and
Transformers. Nevertheless, the susceptibility of these models to adversarial
attacks has emerged as a noteworthy concern, especially in domains where safety
and security are paramount. This concern holds particular weight for MDE due to
its critical role in applications like autonomous driving and robotic
navigation, where accurate scene understanding is pivotal. To assess the
vulnerability of CNN-based depth prediction methods, recent work tries to
design adversarial patches against MDE. However, the existing approaches fall
short of inducing a comprehensive and substantially disruptive impact on the
vision system. Instead, their influence is partial and confined to specific
local areas. These methods lead to erroneous depth predictions only within the
overlapping region with the input image, without considering the
characteristics of the target object, such as its size, shape, and position. In
this paper, we introduce a novel adversarial patch named APARATE. This patch
possesses the ability to selectively undermine MDE in two distinct ways: by
distorting the estimated distances or by creating the illusion of an object
disappearing from the perspective of the autonomous system. Notably, APARATE is
designed to be sensitive to the shape and scale of the target object, and its
influence extends beyond immediate proximity. APARATE, results in a mean depth
estimation error surpassing $0.5$, significantly impacting as much as $99\%$ of
the targeted region when applied to CNN-based MDE models. Furthermore, it
yields a significant error of $0.34$ and exerts substantial influence over
$94\%$ of the target region in the context of Transformer-based MDE.


http://arxiv.org/abs/2303.01068
Targeted Adversarial Attacks against Neural Machine Translation. (98%)
Sahar Sadrizadeh; AmirHossein Dabiri Aghdam; Ljiljana Dolamic; Pascal Frossard
  Neural Machine Translation (NMT) systems are used in various applications.
However, it has been shown that they are vulnerable to very small perturbations
of their inputs, known as adversarial attacks. In this paper, we propose a new
targeted adversarial attack against NMT models. In particular, our goal is to
insert a predefined target keyword into the translation of the adversarial
sentence while maintaining similarity between the original sentence and the
perturbed one in the source domain. To this aim, we propose an optimization
problem, including an adversarial loss term and a similarity term. We use
gradient projection in the embedding space to craft an adversarial sentence.
Experimental results show that our attack outperforms Seq2Sick, the other
targeted adversarial attack against NMT models, in terms of success rate and
decrease in translation quality. Our attack succeeds in inserting a keyword
into the translation for more than 75% of sentences while similarity with the
original sentence stays preserved.


http://arxiv.org/abs/2303.01456
The Double-Edged Sword of Implicit Bias: Generalization vs. Robustness in ReLU Networks. (93%)
Spencer Frei; Gal Vardi; Peter L. Bartlett; Nathan Srebro
  In this work, we study the implications of the implicit bias of gradient flow
on generalization and adversarial robustness in ReLU networks. We focus on a
setting where the data consists of clusters and the correlations between
cluster means are small, and show that in two-layer ReLU networks gradient flow
is biased towards solutions that generalize well, but are highly vulnerable to
adversarial examples. Our results hold even in cases where the network has many
more parameters than training examples. Despite the potential for harmful
overfitting in such overparameterized settings, we prove that the implicit bias
of gradient flow prevents it. However, the implicit bias also leads to
non-robust solutions (susceptible to small adversarial $\ell_2$-perturbations),
even though robust networks that fit the data exist.


http://arxiv.org/abs/2303.01538
Feature Perturbation Augmentation for Reliable Evaluation of Importance Estimators in Neural Networks. (10%)
Lennart Brocki; Neo Christopher Chung
  Post-hoc explanation methods attempt to make the inner workings of deep
neural networks more interpretable. However, since a ground truth is in general
lacking, local post-hoc interpretability methods, which assign importance
scores to input features, are challenging to evaluate. One of the most popular
evaluation frameworks is to perturb features deemed important by an
interpretability method and to measure the change in prediction accuracy.
Intuitively, a large decrease in prediction accuracy would indicate that the
explanation has correctly quantified the importance of features with respect to
the prediction outcome (e.g., logits). However, the change in the prediction
outcome may stem from perturbation artifacts, since perturbed samples in the
test dataset are out of distribution (OOD) compared to the training dataset and
can therefore potentially disturb the model in an unexpected manner. To
overcome this challenge, we propose feature perturbation augmentation (FPA)
which creates and adds perturbed images during the model training. Through
extensive computational experiments, we demonstrate that FPA makes deep neural
networks (DNNs) more robust against perturbations. Furthermore, training DNNs
with FPA demonstrate that the sign of importance scores may explain the model
more meaningfully than has previously been assumed. Overall, FPA is an
intuitive data augmentation technique that improves the evaluation of post-hoc
interpretability methods.


http://arxiv.org/abs/2303.01041
D-Score: An Expert-Based Method for Assessing the Detectability of IoT-Related Cyber-Attacks. (3%)
Yair Meidan; Daniel Benatar; Ron Bitton; Dan Avraham; Asaf Shabtai
  IoT devices are known to be vulnerable to various cyber-attacks, such as data
exfiltration and the execution of flooding attacks as part of a DDoS attack.
When it comes to detecting such attacks using network traffic analysis, it has
been shown that some attack scenarios are not always equally easy to detect if
they involve different IoT models. That is, when targeted at some IoT models, a
given attack can be detected rather accurately, while when targeted at others
the same attack may result in too many false alarms. In this research, we
attempt to explain this variability of IoT attack detectability and devise a
risk assessment method capable of addressing a key question: how easy is it for
an anomaly-based network intrusion detection system to detect a given
cyber-attack involving a specific IoT model? In the process of addressing this
question we (a) investigate the predictability of IoT network traffic, (b)
present a novel taxonomy for IoT attack detection which also encapsulates
traffic predictability aspects, (c) propose an expert-based attack
detectability estimation method which uses this taxonomy to derive a
detectability score (termed `D-Score') for a given combination of IoT model and
attack scenario, and (d) empirically evaluate our method while comparing it
with a data-driven method.


http://arxiv.org/abs/2303.01193
Interpretable System Identification and Long-term Prediction on Time-Series Data. (1%)
Xiaoyi Liu; Duxin Chen; Wenjia Wei; Xia Zhu; Wenwu Yu
  Time-series prediction has drawn considerable attention during the past
decades fueled by the emerging advances of deep learning methods. However, most
neural network based methods lack interpretability and fail in extracting the
hidden mechanism of the targeted physical system. To overcome these
shortcomings, an interpretable sparse system identification method without any
prior knowledge is proposed in this study. This method adopts the Fourier
transform to reduces the irrelevant items in the dictionary matrix, instead of
indiscriminate usage of polynomial functions in most system identification
methods. It shows an interpretable system representation and greatly reduces
computing cost. With the adoption of $l_1$ norm in regularizing the parameter
matrix, a sparse description of the system model can be achieved. Moreover,
Three data sets including the water conservancy data, global temperature data
and financial data are used to test the performance of the proposed method.
Although no prior knowledge was known about the physical background,
experimental results show that our method can achieve long-term prediction
regardless of the noise and incompleteness in the original data more accurately
than the widely-used baseline data-driven methods. This study may provide some
insight into time-series prediction investigations, and suggests that an
white-box system identification method may extract the easily overlooked yet
inherent periodical features and may beat neural-network based black-box
methods on long-term prediction tasks.


http://arxiv.org/abs/2303.01469
Consistency Models. (1%)
Yang Song; Prafulla Dhariwal; Mark Chen; Ilya Sutskever
  Diffusion models have made significant breakthroughs in image, audio, and
video generation, but they depend on an iterative generation process that
causes slow sampling speed and caps their potential for real-time applications.
To overcome this limitation, we propose consistency models, a new family of
generative models that achieve high sample quality without adversarial
training. They support fast one-step generation by design, while still allowing
for few-step sampling to trade compute for sample quality. They also support
zero-shot data editing, like image inpainting, colorization, and
super-resolution, without requiring explicit training on these tasks.
Consistency models can be trained either as a way to distill pre-trained
diffusion models, or as standalone generative models. Through extensive
experiments, we demonstrate that they outperform existing distillation
techniques for diffusion models in one- and few-step generation. For example,
we achieve the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on
ImageNet 64x64 for one-step generation. When trained as standalone generative
models, consistency models also outperform single-step, non-adversarial
generative models on standard benchmarks like CIFAR-10, ImageNet 64x64 and LSUN
256x256.


http://arxiv.org/abs/2303.01021
CADeSH: Collaborative Anomaly Detection for Smart Homes. (1%)
Yair Meidan; Dan Avraham; Hanan Libhaber; Asaf Shabtai
  Although home IoT (Internet of Things) devices are typically plain and task
oriented, the context of their daily use may affect their traffic patterns. For
this reason, anomaly-based intrusion detection systems tend to suffer from a
high false positive rate (FPR). To overcome this, we propose a two-step
collaborative anomaly detection method which first uses an autoencoder to
differentiate frequent (`benign') and infrequent (possibly `malicious') traffic
flows. Clustering is then used to analyze only the infrequent flows and
classify them as either known ('rare yet benign') or unknown (`malicious'). Our
method is collaborative, in that (1) normal behaviors are characterized more
robustly, as they take into account a variety of user interactions and network
topologies, and (2) several features are computed based on a pool of identical
devices rather than just the inspected device.
  We evaluated our method empirically, using 21 days of real-world traffic data
that emanated from eight identical IoT devices deployed on various networks,
one of which was located in our controlled lab where we implemented two popular
IoT-related cyber-attacks. Our collaborative anomaly detection method achieved
a macro-average area under the precision-recall curve of 0.841, an F1 score of
0.929, and an FPR of only 0.014. These promising results were obtained by using
labeled traffic data from our lab as the test set, while training the models on
the traffic of devices deployed outside the lab, and thus demonstrate a high
level of generalizability. In addition to its high generalizability and
promising performance, our proposed method also offers benefits such as privacy
preservation, resource savings, and model poisoning mitigation. On top of that,
as a contribution to the scientific community, our novel dataset is available
online.


http://arxiv.org/abs/2303.01276
Conflict-Based Cross-View Consistency for Semi-Supervised Semantic Segmentation. (1%)
Zicheng Wang; Zhen Zhao; Xiaoxia Xing; Dong Xu; Xiangyu Kong; Luping Zhou
  Semi-supervised semantic segmentation (SSS) has recently gained increasing
research interest as it can reduce the requirement for large-scale
fully-annotated training data. The current methods often suffer from the
confirmation bias from the pseudo-labelling process, which can be alleviated by
the co-training framework. The current co-training-based SSS methods rely on
hand-crafted perturbations to prevent the different sub-nets from collapsing
into each other, but these artificial perturbations cannot lead to the optimal
solution. In this work, we propose a new conflict-based cross-view consistency
(CCVC) method based on a two-branch co-training framework which aims at
enforcing the two sub-nets to learn informative features from irrelevant views.
In particular, we first propose a new cross-view consistency (CVC) strategy
that encourages the two sub-nets to learn distinct features from the same input
by introducing a feature discrepancy loss, while these distinct features are
expected to generate consistent prediction scores of the input. The CVC
strategy helps to prevent the two sub-nets from stepping into the collapse. In
addition, we further propose a conflict-based pseudo-labelling (CPL) method to
guarantee the model will learn more useful information from conflicting
predictions, which will lead to a stable training process. We validate our new
CCVC approach on the SSS benchmark datasets where our method achieves new
state-of-the-art performance. Our code is available at
https://github.com/xiaoyao3302/CCVC.


http://arxiv.org/abs/2303.00284
To Make Yourself Invisible with Adversarial Semantic Contours. (99%)
Yichi Zhang; Zijian Zhu; Hang Su; Jun Zhu; Shibao Zheng; Yuan He; Hui Xue
  Modern object detectors are vulnerable to adversarial examples, which may
bring risks to real-world applications. The sparse attack is an important task
which, compared with the popular adversarial perturbation on the whole image,
needs to select the potential pixels that is generally regularized by an
$\ell_0$-norm constraint, and simultaneously optimize the corresponding
texture. The non-differentiability of $\ell_0$ norm brings challenges and many
works on attacking object detection adopted manually-designed patterns to
address them, which are meaningless and independent of objects, and therefore
lead to relatively poor attack performance.
  In this paper, we propose Adversarial Semantic Contour (ASC), an MAP estimate
of a Bayesian formulation of sparse attack with a deceived prior of object
contour. The object contour prior effectively reduces the search space of pixel
selection and improves the attack by introducing more semantic bias. Extensive
experiments demonstrate that ASC can corrupt the prediction of 9 modern
detectors with different architectures (\e.g., one-stage, two-stage and
Transformer) by modifying fewer than 5\% of the pixels of the object area in
COCO in white-box scenario and around 10\% of those in black-box scenario. We
further extend the attack to datasets for autonomous driving systems to verify
the effectiveness. We conclude with cautions about contour being the common
weakness of object detectors with various architecture and the care needed in
applying them in safety-sensitive scenarios.


http://arxiv.org/abs/2303.00783
Adversarial Examples Exist in Two-Layer ReLU Networks for Low Dimensional Data Manifolds. (98%)
Odelia Melamed; Gilad Yehudai; Gal Vardi
  Despite a great deal of research, it is still not well-understood why trained
neural networks are highly vulnerable to adversarial examples. In this work we
focus on two-layer neural networks trained using data which lie on a low
dimensional linear subspace. We show that standard gradient methods lead to
non-robust neural networks, namely, networks which have large gradients in
directions orthogonal to the data subspace, and are susceptible to small
adversarial $L_2$-perturbations in these directions. Moreover, we show that
decreasing the initialization scale of the training algorithm, or adding $L_2$
regularization, can make the trained network more robust to adversarial
perturbations orthogonal to the data.


http://arxiv.org/abs/2303.01234
Frauds Bargain Attack: Generating Adversarial Text Samples via Word Manipulation Process. (97%)
Mingze Ni; Zhensu Sun; Wei Liu
  Recent research has revealed that natural language processing (NLP) models
are vulnerable to adversarial examples. However, the current techniques for
generating such examples rely on deterministic heuristic rules, which fail to
produce optimal adversarial examples. In response, this study proposes a new
method called the Fraud's Bargain Attack (FBA), which uses a randomization
mechanism to expand the search space and produce high-quality adversarial
examples with a higher probability of success. FBA uses the Metropolis-Hasting
sampler, a type of Markov Chain Monte Carlo sampler, to improve the selection
of adversarial examples from all candidates generated by a customized
stochastic process called the Word Manipulation Process (WMP). The WMP method
modifies individual words in a contextually-aware manner through insertion,
removal, or substitution. Through extensive experiments, this study
demonstrates that FBA outperforms other methods in terms of attack success
rate, imperceptibility and sentence quality.


http://arxiv.org/abs/2303.00340
A Practical Upper Bound for the Worst-Case Attribution Deviations. (70%)
Fan Wang; Adams Wai-Kin Kong
  Model attribution is a critical component of deep neural networks (DNNs) for
its interpretability to complex models. Recent studies bring up attention to
the security of attribution methods as they are vulnerable to attribution
attacks that generate similar images with dramatically different attributions.
Existing works have been investigating empirically improving the robustness of
DNNs against those attacks; however, none of them explicitly quantifies the
actual deviations of attributions. In this work, for the first time, a
constrained optimization problem is formulated to derive an upper bound that
measures the largest dissimilarity of attributions after the samples are
perturbed by any noises within a certain region while the classification
results remain the same. Based on the formulation, different practical
approaches are introduced to bound the attributions above using Euclidean
distance and cosine similarity under both $\ell_2$ and $\ell_\infty$-norm
perturbations constraints. The bounds developed by our theoretical study are
validated on various datasets and two different types of attacks (PGD attack
and IFIA attribution attack). Over 10 million attacks in the experiments
indicate that the proposed upper bounds effectively quantify the robustness of
models based on the worst-case attribution dissimilarities.


http://arxiv.org/abs/2303.00250
Combating Exacerbated Heterogeneity for Robust Models in Federated Learning. (54%)
Jianing Zhu; Jiangchao Yao; Tongliang Liu; Quanming Yao; Jianliang Xu; Bo Han
  Privacy and security concerns in real-world applications have led to the
development of adversarially robust federated models. However, the
straightforward combination between adversarial training and federated learning
in one framework can lead to the undesired robustness deterioration. We
discover that the attribution behind this phenomenon is that the generated
adversarial data could exacerbate the data heterogeneity among local clients,
making the wrapped federated learning perform poorly. To deal with this
problem, we propose a novel framework called Slack Federated Adversarial
Training (SFAT), assigning the client-wise slack during aggregation to combat
the intensified heterogeneity. Theoretically, we analyze the convergence of the
proposed method to properly relax the objective when combining federated
learning and adversarial training. Experimentally, we verify the rationality
and effectiveness of SFAT on various benchmarked and real-world datasets with
different adversarial training and federated optimization methods. The code is
publicly available at https://github.com/ZFancy/SFAT.


http://arxiv.org/abs/2303.01243
Poster: Sponge ML Model Attacks of Mobile Apps. (8%)
Souvik Paul; Nicolas Kourtellis
  Machine Learning (ML)-powered apps are used in pervasive devices such as
phones, tablets, smartwatches and IoT devices. Recent advances in
collaborative, distributed ML such as Federated Learning (FL) attempt to solve
privacy concerns of users and data owners, and thus used by tech industry
leaders such as Google, Facebook and Apple. However, FL systems and models are
still vulnerable to adversarial membership and attribute inferences and model
poisoning attacks, especially in FL-as-a-Service ecosystems recently proposed,
which can enable attackers to access multiple ML-powered apps. In this work, we
focus on the recently proposed Sponge attack: It is designed to soak up energy
consumed while executing inference (not training) of ML model, without
hampering the classifier's performance. Recent work has shown sponge attacks on
ASCI-enabled GPUs can potentially escalate the power consumption and inference
time. For the first time, in this work, we investigate this attack in the
mobile setting and measure the effect it can have on ML models running inside
apps on mobile devices.


http://arxiv.org/abs/2303.00387
DOLOS: A Novel Architecture for Moving Target Defense. (8%)
Giulio Pagnotta; Gaspari Fabio De; Dorjan Hitaj; Mauro Andreolini; Michele Colajanni; Luigi V. Mancini
  Moving Target Defense and Cyber Deception emerged in recent years as two key
proactive cyber defense approaches, contrasting with the static nature of the
traditional reactive cyber defense. The key insight behind these approaches is
to impose an asymmetric disadvantage for the attacker by using deception and
randomization techniques to create a dynamic attack surface. Moving Target
Defense typically relies on system randomization and diversification, while
Cyber Deception is based on decoy nodes and fake systems to deceive attackers.
However, current Moving Target Defense techniques are complex to manage and can
introduce high overheads, while Cyber Deception nodes are easily recognized and
avoided by adversaries.
  This paper presents DOLOS, a novel architecture that unifies Cyber Deception
and Moving Target Defense approaches. DOLOS is motivated by the insight that
deceptive techniques are much more powerful when integrated into production
systems rather than deployed alongside them. DOLOS combines typical Moving
Target Defense techniques, such as randomization, diversity, and redundancy,
with cyber deception and seamlessly integrates them into production systems
through multiple layers of isolation. We extensively evaluate DOLOS against a
wide range of attackers, ranging from automated malware to professional
penetration testers, and show that DOLOS is highly effective in slowing down
attacks and protecting the integrity of production systems. We also provide
valuable insights and considerations for the future development of MTD
techniques based on our findings.


http://arxiv.org/abs/2303.00302
Mitigating Backdoors in Federated Learning with FLD. (2%)
Yihang Lin; Pengyuan Zhou; Zhiqian Wu; Yong Liao
  Federated learning allows clients to collaboratively train a global model
without uploading raw data for privacy preservation. This feature, i.e., the
inability to review participants' datasets, has recently been found responsible
for federated learning's vulnerability in the face of backdoor attacks.
Existing defense methods fall short from two perspectives: 1) they consider
only very specific and limited attacker models and unable to cope with advanced
backdoor attacks, such as distributed backdoor attacks, which break down the
global trigger into multiple distributed triggers. 2) they conduct detection
based on model granularity thus the performance gets impacted by the model
dimension. To address these challenges, we propose Federated Layer Detection
(FLD), a novel model filtering approach for effectively defending against
backdoor attacks. FLD examines the models based on layer granularity to capture
the complete model details and effectively detect potential backdoor models
regardless of model dimension. We provide theoretical analysis and proof for
the convergence of FLD. Extensive experiments demonstrate that FLD effectively
mitigates state-of-the-art backdoor attacks with negligible impact on the
accuracy of the primary task.


http://arxiv.org/abs/2303.00333
Competence-Based Analysis of Language Models. (1%)
Adam Davies; Jize Jiang; ChengXiang Zhai
  Despite the recent success of large pretrained language models (LMs) on a
variety of prompting tasks, these models can be alarmingly brittle to small
changes in inputs or application contexts. To better understand such behavior
and motivate the design of more robust LMs, we propose a general experimental
framework, CALM (Competence-based Analysis of Language Models), where targeted
causal interventions are utilized to damage an LM's internal representation of
various linguistic properties in order to evaluate its use of each
representation in performing a given task. We implement these interventions as
gradient-based adversarial attacks, which (in contrast to prior causal probing
methodologies) are able to target arbitrarily-encoded representations of
relational properties, and carry out a case study of this approach to analyze
how BERT-like LMs use representations of several relational properties in
performing associated relation prompting tasks. We find that, while the
representations LMs leverage in performing each task are highly entangled, they
may be meaningfully interpreted in terms of the tasks where they are most
utilized; and more broadly, that CALM enables an expanded scope of inquiry in
LM analysis that may be useful in predicting and explaining weaknesses of
existing LMs.


http://arxiv.org/abs/2302.14353
A semantic backdoor attack against Graph Convolutional Networks. (98%)
Jiazhu Dai; Zhipeng Xiong
  Graph convolutional networks (GCNs) have been very effective in addressing
the issue of various graph-structured related tasks. However, recent research
has shown that GCNs are vulnerable to a new type of threat called a backdoor
attack, where the adversary can inject a hidden backdoor into GCNs so that the
attacked model performs well on benign samples, but its prediction will be
maliciously changed to the attacker-specified target label if the hidden
backdoor is activated by the attacker-defined trigger. A semantic backdoor
attack is a new type of backdoor attack on deep neural networks (DNNs), where a
naturally occurring semantic feature of samples can serve as a backdoor trigger
such that the infected DNN models will misclassify testing samples containing
the predefined semantic feature even without the requirement of modifying the
testing samples. Since the backdoor trigger is a naturally occurring semantic
feature of the samples, semantic backdoor attacks are more imperceptible and
pose a new and serious threat. In this paper, we investigate whether such
semantic backdoor attacks are possible for GCNs and propose a semantic backdoor
attack against GCNs (SBAG) under the context of graph classification to reveal
the existence of this security vulnerability in GCNs. SBAG uses a certain type
of node in the samples as a backdoor trigger and injects a hidden backdoor into
GCN models by poisoning training data. The backdoor will be activated, and the
GCN models will give malicious classification results specified by the attacker
even on unmodified samples as long as the samples contain enough trigger nodes.
We evaluate SBAG on four graph datasets and the experimental results indicate
that SBAG is effective.


http://arxiv.org/abs/2303.00215
Single Image Backdoor Inversion via Robust Smoothed Classifiers. (88%)
Mingjie Sun; J. Zico Kolter
  Backdoor inversion, a central step in many backdoor defenses, is a
reverse-engineering process to recover the hidden backdoor trigger inserted
into a machine learning model. Existing approaches tackle this problem by
searching for a backdoor pattern that is able to flip a set of clean images
into the target class, while the exact size needed of this support set is
rarely investigated. In this work, we present a new approach for backdoor
inversion, which is able to recover the hidden backdoor with as few as a single
image. Insipired by recent advances in adversarial robustness, our method
SmoothInv starts from a single clean image, and then performs projected
gradient descent towards the target class on a robust smoothed version of the
original backdoored classifier. We find that backdoor patterns emerge naturally
from such optimization process. Compared to existing backdoor inversion
methods, SmoothInv introduces minimum optimization variables and does not
require complex regularization schemes. We perform a comprehensive quantitative
and qualitative study on backdoored classifiers obtained from existing backdoor
attacks. We demonstrate that SmoothInv consistently recovers successful
backdoors from single images: for backdoored ImageNet classifiers, our
reconstructed backdoors have close to 100% attack success rates. We also show
that they maintain high fidelity to the underlying true backdoors. Last, we
propose and analyze two countermeasures to our approach and show that SmoothInv
remains robust in the face of an adaptive attacker. Our code is available at
https://github.com/locuslab/smoothinv.


http://arxiv.org/abs/2303.00200
Feature Extraction Matters More: Universal Deepfake Disruption through Attacking Ensemble Feature Extractors. (67%)
Long Tang; Dengpan Ye; Zhenhao Lu; Yunming Zhang; Shengshan Hu; Yue Xu; Chuanxi Chen
  Adversarial example is a rising way of protecting facial privacy security
from deepfake modification. To prevent massive facial images from being
illegally modified by various deepfake models, it is essential to design a
universal deepfake disruptor. However, existing works treat deepfake disruption
as an End-to-End process, ignoring the functional difference between feature
extraction and image reconstruction, which makes it difficult to generate a
cross-model universal disruptor. In this work, we propose a novel
Feature-Output ensemble UNiversal Disruptor (FOUND) against deepfake networks,
which explores a new opinion that considers attacking feature extractors as the
more critical and general task in deepfake disruption. We conduct an effective
two-stage disruption process. We first disrupt multi-model feature extractors
through multi-feature aggregation and individual-feature maintenance, and then
develop a gradient-ensemble algorithm to enhance the disruption effect by
simplifying the complex optimization problem of disrupting multiple End-to-End
models. Extensive experiments demonstrate that FOUND can significantly boost
the disruption effect against ensemble deepfake benchmark models. Besides, our
method can fast obtain a cross-attribute, cross-image, and cross-model
universal deepfake disruptor with only a few training images, surpassing
state-of-the-art universal disruptors in both success rate and efficiency.


http://arxiv.org/abs/2302.14677
Backdoor Attacks Against Deep Image Compression via Adaptive Frequency Trigger. (11%)
Yi Yu; Yufei Wang; Wenhan Yang; Shijian Lu; Yap-peng Tan; Alex C. Kot
  Recent deep-learning-based compression methods have achieved superior
performance compared with traditional approaches. However, deep learning models
have proven to be vulnerable to backdoor attacks, where some specific trigger
patterns added to the input can lead to malicious behavior of the models. In
this paper, we present a novel backdoor attack with multiple triggers against
learned image compression models. Motivated by the widely used discrete cosine
transform (DCT) in existing compression systems and standards, we propose a
frequency-based trigger injection model that adds triggers in the DCT domain.
In particular, we design several attack objectives for various attacking
scenarios, including: 1) attacking compression quality in terms of bit-rate and
reconstruction quality; 2) attacking task-driven measures, such as down-stream
face recognition and semantic segmentation. Moreover, a novel simple dynamic
loss is designed to balance the influence of different loss terms adaptively,
which helps achieve more efficient training. Extensive experiments show that
with our trained trigger injection models and simple modification of encoder
parameters (of the compression model), the proposed attack can successfully
inject several backdoors with corresponding triggers in a single image
compression model.


http://arxiv.org/abs/2302.14500
FreeEagle: Detecting Complex Neural Trojans in Data-Free Cases. (1%)
Chong Fu; Xuhong Zhang; Shouling Ji; Ting Wang; Peng Lin; Yanghe Feng; Jianwei Yin
  Trojan attack on deep neural networks, also known as backdoor attack, is a
typical threat to artificial intelligence. A trojaned neural network behaves
normally with clean inputs. However, if the input contains a particular
trigger, the trojaned model will have attacker-chosen abnormal behavior.
Although many backdoor detection methods exist, most of them assume that the
defender has access to a set of clean validation samples or samples with the
trigger, which may not hold in some crucial real-world cases, e.g., the case
where the defender is the maintainer of model-sharing platforms. Thus, in this
paper, we propose FreeEagle, the first data-free backdoor detection method that
can effectively detect complex backdoor attacks on deep neural networks,
without relying on the access to any clean samples or samples with the trigger.
The evaluation results on diverse datasets and model architectures show that
FreeEagle is effective against various complex backdoor attacks, even
outperforming some state-of-the-art non-data-free backdoor detection methods.


http://arxiv.org/abs/2302.14301
A Comprehensive Study on Robustness of Image Classification Models: Benchmarking and Rethinking. (99%)
Chang Liu; Yinpeng Dong; Wenzhao Xiang; Xiao Yang; Hang Su; Jun Zhu; Yuefeng Chen; Yuan He; Hui Xue; Shibao Zheng
  The robustness of deep neural networks is usually lacking under adversarial
examples, common corruptions, and distribution shifts, which becomes an
important research problem in the development of deep learning. Although new
deep learning methods and robustness improvement techniques have been
constantly proposed, the robustness evaluations of existing methods are often
inadequate due to their rapid development, diverse noise patterns, and simple
evaluation metrics. Without thorough robustness evaluations, it is hard to
understand the advances in the field and identify the effective methods. In
this paper, we establish a comprehensive robustness benchmark called
\textbf{ARES-Bench} on the image classification task. In our benchmark, we
evaluate the robustness of 55 typical deep learning models on ImageNet with
diverse architectures (e.g., CNNs, Transformers) and learning algorithms (e.g.,
normal supervised training, pre-training, adversarial training) under numerous
adversarial attacks and out-of-distribution (OOD) datasets. Using robustness
curves as the major evaluation criteria, we conduct large-scale experiments and
draw several important findings, including: 1) there is an inherent trade-off
between adversarial and natural robustness for the same model architecture; 2)
adversarial training effectively improves adversarial robustness, especially
when performed on Transformer architectures; 3) pre-training significantly
improves natural robustness based on more training data or self-supervised
learning. Based on ARES-Bench, we further analyze the training tricks in
large-scale adversarial training on ImageNet. By designing the training
settings accordingly, we achieve the new state-of-the-art adversarial
robustness. We have made the benchmarking results and code platform publicly
available.


http://arxiv.org/abs/2302.14267
Adversarial Attack with Raindrops. (99%)
Jiyuan Liu; Bingyi Lu; Mingkang Xiong; Tao Zhang; Huilin Xiong
  Deep neural networks (DNNs) are known to be vulnerable to adversarial
examples, which are usually designed artificially to fool DNNs, but rarely
exist in real-world scenarios. In this paper, we study the adversarial examples
caused by raindrops, to demonstrate that there exist plenty of natural
phenomena being able to work as adversarial attackers to DNNs. Moreover, we
present a new approach to generate adversarial raindrops, denoted as AdvRD,
using the generative adversarial network (GAN) technique to simulate natural
raindrops. The images crafted by our AdvRD look very similar to the real-world
raindrop images, statistically close to the distribution of true raindrop
images, and more importantly, can perform strong adversarial attack to the
state-of-the-art DNN models. On the other side, we show that the adversarial
training using our AdvRD images can significantly improve the robustness of
DNNs to the real-world raindrop attacks. Extensive experiments are carried out
to demonstrate that the images crafted by AdvRD are visually and statistically
close to the natural raindrop images, can work as strong attackers to DNN
models, and also help improve the robustness of DNNs to raindrop attacks.


http://arxiv.org/abs/2302.13570
Physical Adversarial Attacks on Deep Neural Networks for Traffic Sign Recognition: A Feasibility Study. (99%)
Fabian Woitschek; Georg Schneider
  Deep Neural Networks (DNNs) are increasingly applied in the real world in
safety critical applications like advanced driver assistance systems. An
example for such use case is represented by traffic sign recognition systems.
At the same time, it is known that current DNNs can be fooled by adversarial
attacks, which raises safety concerns if those attacks can be applied under
realistic conditions. In this work we apply different black-box attack methods
to generate perturbations that are applied in the physical environment and can
be used to fool systems under different environmental conditions. To the best
of our knowledge we are the first to combine a general framework for physical
attacks with different black-box attack methods and study the impact of the
different methods on the success rate of the attack under the same setting. We
show that reliable physical adversarial attacks can be performed with different
methods and that it is also possible to reduce the perceptibility of the
resulting perturbations. The findings highlight the need for viable defenses of
a DNN even in the black-box case, but at the same time form the basis for
securing a DNN with methods like adversarial training which utilizes
adversarial attacks to augment the original training data.


http://arxiv.org/abs/2302.13520
Aegis: Mitigating Targeted Bit-flip Attacks against Deep Neural Networks. (98%)
Jialai Wang; Ziyuan Zhang; Meiqi Wang; Han Qiu; Tianwei Zhang; Qi Li; Zongpeng Li; Tao Wei; Chao Zhang
  Bit-flip attacks (BFAs) have attracted substantial attention recently, in
which an adversary could tamper with a small number of model parameter bits to
break the integrity of DNNs. To mitigate such threats, a batch of defense
methods are proposed, focusing on the untargeted scenarios. Unfortunately, they
either require extra trustworthy applications or make models more vulnerable to
targeted BFAs. Countermeasures against targeted BFAs, stealthier and more
purposeful by nature, are far from well established.
  In this work, we propose Aegis, a novel defense method to mitigate targeted
BFAs. The core observation is that existing targeted attacks focus on flipping
critical bits in certain important layers. Thus, we design a dynamic-exit
mechanism to attach extra internal classifiers (ICs) to hidden layers. This
mechanism enables input samples to early-exit from different layers, which
effectively upsets the adversary's attack plans. Moreover, the dynamic-exit
mechanism randomly selects ICs for predictions during each inference to
significantly increase the attack cost for the adaptive attacks where all
defense mechanisms are transparent to the adversary. We further propose a
robustness training strategy to adapt ICs to the attack scenarios by simulating
BFAs during the IC training phase, to increase model robustness. Extensive
evaluations over four well-known datasets and two popular DNN structures reveal
that Aegis could effectively mitigate different state-of-the-art targeted
attacks, reducing attack success rate by 5-10$\times$, significantly
outperforming existing defense methods.


http://arxiv.org/abs/2302.13519
CBA: Contextual Background Attack against Optical Aerial Detection in the Physical World. (98%)
Jiawei Lian; Xiaofei Wang; Yuru Su; Mingyang Ma; Shaohui Mei
  Patch-based physical attacks have increasingly aroused concerns.
  However, most existing methods focus on obscuring targets captured on the
ground, and some of these methods are simply extended to deceive aerial
detectors.
  They smear the targeted objects in the physical world with the elaborated
adversarial patches, which can only slightly sway the aerial detectors'
prediction and with weak attack transferability.
  To address the above issues, we propose to perform Contextual Background
Attack (CBA), a novel physical attack framework against aerial detection, which
can achieve strong attack efficacy and transferability in the physical world
even without smudging the interested objects at all.
  Specifically, the targets of interest, i.e. the aircraft in aerial images,
are adopted to mask adversarial patches.
  The pixels outside the mask area are optimized to make the generated
adversarial patches closely cover the critical contextual background area for
detection, which contributes to gifting adversarial patches with more robust
and transferable attack potency in the real world.
  To further strengthen the attack performance, the adversarial patches are
forced to be outside targets during training, by which the detected objects of
interest, both on and outside patches, benefit the accumulation of attack
efficacy.
  Consequently, the sophisticatedly designed patches are gifted with solid
fooling efficacy against objects both on and outside the adversarial patches
simultaneously.
  Extensive proportionally scaled experiments are performed in physical
scenarios, demonstrating the superiority and potential of the proposed
framework for physical attacks.
  We expect that the proposed physical attack method will serve as a benchmark
for assessing the adversarial robustness of diverse aerial detectors and
defense methods.


http://arxiv.org/abs/2302.14302
Improving Model Generalization by On-manifold Adversarial Augmentation in the Frequency Domain. (96%)
Chang Liu; Wenzhao Xiang; Yuan He; Hui Xue; Shibao Zheng; Hang Su
  Deep neural networks (DNNs) may suffer from significantly degenerated
performance when the training and test data are of different underlying
distributions. Despite the importance of model generalization to
out-of-distribution (OOD) data, the accuracy of state-of-the-art (SOTA) models
on OOD data can plummet. Recent work has demonstrated that regular or
off-manifold adversarial examples, as a special case of data augmentation, can
be used to improve OOD generalization. Inspired by this, we theoretically prove
that on-manifold adversarial examples can better benefit OOD generalization.
Nevertheless, it is nontrivial to generate on-manifold adversarial examples
because the real manifold is generally complex. To address this issue, we
proposed a novel method of Augmenting data with Adversarial examples via a
Wavelet module (AdvWavAug), an on-manifold adversarial data augmentation
technique that is simple to implement. In particular, we project a benign image
into a wavelet domain. With the assistance of the sparsity characteristic of
wavelet transformation, we can modify an image on the estimated data manifold.
We conduct adversarial augmentation based on AdvProp training framework.
Extensive experiments on different models and different datasets, including
ImageNet and its distorted versions, demonstrate that our method can improve
model generalization, especially on OOD data. By integrating AdvWavAug into the
training process, we have achieved SOTA results on some recent
transformer-based models.


http://arxiv.org/abs/2302.13763
Efficient and Low Overhead Website Fingerprinting Attacks and Defenses based on TCP/IP Traffic. (83%)
Guodong Huang; Chuan Ma; Ming Ding; Yuwen Qian; Chunpeng Ge; Liming Fang; Zhe Liu
  Website fingerprinting attack is an extensively studied technique used in a
web browser to analyze traffic patterns and thus infer confidential information
about users. Several website fingerprinting attacks based on machine learning
and deep learning tend to use the most typical features to achieve a
satisfactory performance of attacking rate. However, these attacks suffer from
several practical implementation factors, such as a skillfully pre-processing
step or a clean dataset. To defend against such attacks, random packet defense
(RPD) with a high cost of excessive network overhead is usually applied. In
this work, we first propose a practical filter-assisted attack against RPD,
which can filter out the injected noises using the statistical characteristics
of TCP/IP traffic. Then, we propose a list-assisted defensive mechanism to
defend the proposed attack method. To achieve a configurable trade-off between
the defense and the network overhead, we further improve the list-based defense
by a traffic splitting mechanism, which can combat the mentioned attacks as
well as save a considerable amount of network overhead. In the experiments, we
collect real-life traffic patterns using three mainstream browsers, i.e.,
Microsoft Edge, Google Chrome, and Mozilla Firefox, and extensive results
conducted on the closed and open-world datasets show the effectiveness of the
proposed algorithms in terms of defense accuracy and network efficiency.


http://arxiv.org/abs/2302.14166
GLOW: Global Layout Aware Attacks on Object Detection. (81%)
Buyu Liu; BaoJun; Jianping Fan; Xi Peng; Kui Ren; Jun Yu
  Adversarial attacks aim to perturb images such that a predictor outputs
incorrect results. Due to the limited research in structured attacks, imposing
consistency checks on natural multi-object scenes is a promising yet practical
defense against conventional adversarial attacks. More desired attacks, to this
end, should be able to fool defenses with such consistency checks. Therefore,
we present the first approach GLOW that copes with various attack requests by
generating global layout-aware adversarial attacks, in which both categorical
and geometric layout constraints are explicitly established. Specifically, we
focus on object detection task and given a victim image, GLOW first localizes
victim objects according to target labels. And then it generates multiple
attack plans, together with their context-consistency scores. Our proposed
GLOW, on the one hand, is capable of handling various types of requests,
including single or multiple victim objects, with or without specified victim
objects. On the other hand, it produces a consistency score for each attack
plan, reflecting the overall contextual consistency that both semantic category
and global scene layout are considered. In experiment, we design multiple types
of attack requests and validate our ideas on MS COCO and Pascal. Extensive
experimental results demonstrate that we can achieve about 30$\%$ average
relative improvement compared to state-of-the-art methods in conventional
single object attack request; Moreover, our method outperforms SOTAs
significantly on more generic attack requests by about 20$\%$ in average;
Finally, our method produces superior performance under challenging zero-query
black-box setting, or 20$\%$ better than SOTAs. Our code, model and attack
requests would be made available.


http://arxiv.org/abs/2302.13578
Online Black-Box Confidence Estimation of Deep Neural Networks. (16%)
Fabian Woitschek; Georg Schneider
  Autonomous driving (AD) and advanced driver assistance systems (ADAS)
increasingly utilize deep neural networks (DNNs) for improved perception or
planning. Nevertheless, DNNs are quite brittle when the data distribution
during inference deviates from the data distribution during training. This
represents a challenge when deploying in partly unknown environments like in
the case of ADAS. At the same time, the standard confidence of DNNs remains
high even if the classification reliability decreases. This is problematic
since following motion control algorithms consider the apparently confident
prediction as reliable even though it might be considerably wrong. To reduce
this problem real-time capable confidence estimation is required that better
aligns with the actual reliability of the DNN classification. Additionally, the
need exists for black-box confidence estimation to enable the homogeneous
inclusion of externally developed components to an entire system. In this work
we explore this use case and introduce the neighborhood confidence (NHC) which
estimates the confidence of an arbitrary DNN for classification. The metric can
be used for black-box systems since only the top-1 class output is required and
does not need access to the gradients, the training dataset or a hold-out
validation dataset. Evaluation on different data distributions, including small
in-domain distribution shifts, out-of-domain data or adversarial attacks, shows
that the NHC performs better or on par with a comparable method for online
white-box confidence estimation in low data regimes which is required for
real-time capable AD/ADAS.


http://arxiv.org/abs/2302.13851
Implicit Poisoning Attacks in Two-Agent Reinforcement Learning: Adversarial Policies for Training-Time Attacks. (15%)
Mohammad Mohammadi; Jonathan Nöther; Debmalya Mandal; Adish Singla; Goran Radanovic
  In targeted poisoning attacks, an attacker manipulates an agent-environment
interaction to force the agent into adopting a policy of interest, called
target policy. Prior work has primarily focused on attacks that modify standard
MDP primitives, such as rewards or transitions. In this paper, we study
targeted poisoning attacks in a two-agent setting where an attacker implicitly
poisons the effective environment of one of the agents by modifying the policy
of its peer. We develop an optimization framework for designing optimal
attacks, where the cost of the attack measures how much the solution deviates
from the assumed default policy of the peer agent. We further study the
computational properties of this optimization framework. Focusing on a tabular
setting, we show that in contrast to poisoning attacks based on MDP primitives
(transitions and (unbounded) rewards), which are always feasible, it is NP-hard
to determine the feasibility of implicit poisoning attacks. We provide
characterization results that establish sufficient conditions for the
feasibility of the attack problem, as well as an upper and a lower bound on the
optimal cost of the attack. We propose two algorithmic approaches for finding
an optimal adversarial policy: a model-based approach with tabular policies and
a model-free approach with parametric/neural policies. We showcase the efficacy
of the proposed algorithms through experiments.


http://arxiv.org/abs/2302.13861
Differentially Private Diffusion Models Generate Useful Synthetic Images. (10%)
Sahra Ghalebikesabi; Leonard Berrada; Sven Gowal; Ira Ktena; Robert Stanforth; Jamie Hayes; Soham De; Samuel L. Smith; Olivia Wiles; Borja Balle
  The ability to generate privacy-preserving synthetic versions of sensitive
image datasets could unlock numerous ML applications currently constrained by
data availability. Due to their astonishing image generation quality, diffusion
models are a prime candidate for generating high-quality synthetic data.
However, recent studies have found that, by default, the outputs of some
diffusion models do not preserve training data privacy. By privately
fine-tuning ImageNet pre-trained diffusion models with more than 80M
parameters, we obtain SOTA results on CIFAR-10 and Camelyon17 in terms of both
FID and the accuracy of downstream classifiers trained on synthetic data. We
decrease the SOTA FID on CIFAR-10 from 26.2 to 9.8, and increase the accuracy
from 51.0% to 88.0%. On synthetic data from Camelyon17, we achieve a downstream
accuracy of 91.1% which is close to the SOTA of 96.5% when training on the real
data. We leverage the ability of generative models to create infinite amounts
of data to maximise the downstream prediction performance, and further show how
to use synthetic data for hyperparameter tuning. Our results demonstrate that
diffusion models fine-tuned with differential privacy can produce useful and
provably private synthetic data, even in applications with significant
distribution shift between the pre-training and fine-tuning distributions.


http://arxiv.org/abs/2302.14290
Learning to Retain while Acquiring: Combating Distribution-Shift in Adversarial Data-Free Knowledge Distillation. (5%)
Gaurav Patel; Konda Reddy Mopuri; Qiang Qiu
  Data-free Knowledge Distillation (DFKD) has gained popularity recently, with
the fundamental idea of carrying out knowledge transfer from a Teacher neural
network to a Student neural network in the absence of training data. However,
in the Adversarial DFKD framework, the student network's accuracy, suffers due
to the non-stationary distribution of the pseudo-samples under multiple
generator updates. To this end, at every generator update, we aim to maintain
the student's performance on previously encountered examples while acquiring
knowledge from samples of the current distribution. Thus, we propose a
meta-learning inspired framework by treating the task of Knowledge-Acquisition
(learning from newly generated samples) and Knowledge-Retention (retaining
knowledge on previously met samples) as meta-train and meta-test, respectively.
Hence, we dub our method as Learning to Retain while Acquiring. Moreover, we
identify an implicit aligning factor between the Knowledge-Retention and
Knowledge-Acquisition tasks indicating that the proposed student update
strategy enforces a common gradient direction for both tasks, alleviating
interference between the two objectives. Finally, we support our hypothesis by
exhibiting extensive evaluation and comparison of our method with prior arts on
multiple datasets.


http://arxiv.org/abs/2302.13487
Contextual adversarial attack against aerial detection in the physical world. (99%)
Jiawei Lian; Xiaofei Wang; Yuru Su; Mingyang Ma; Shaohui Mei
  Deep Neural Networks (DNNs) have been extensively utilized in aerial
detection. However, DNNs' sensitivity and vulnerability to maliciously
elaborated adversarial examples have progressively garnered attention.
Recently, physical attacks have gradually become a hot issue due to they are
more practical in the real world, which poses great threats to some
security-critical applications. In this paper, we take the first attempt to
perform physical attacks in contextual form against aerial detection in the
physical world. We propose an innovative contextual attack method against
aerial detection in real scenarios, which achieves powerful attack performance
and transfers well between various aerial object detectors without smearing or
blocking the interested objects to hide. Based on the findings that the
targets' contextual information plays an important role in aerial detection by
observing the detectors' attention maps, we propose to make full use of the
contextual area of the interested targets to elaborate contextual perturbations
for the uncovered attacks in real scenarios. Extensive proportionally scaled
experiments are conducted to evaluate the effectiveness of the proposed
contextual attack method, which demonstrates the proposed method's superiority
in both attack efficacy and physical practicality.


http://arxiv.org/abs/2302.13464
Randomness in ML Defenses Helps Persistent Attackers and Hinders Evaluators. (96%)
Keane Lucas; Matthew Jagielski; Florian Tramèr; Lujo Bauer; Nicholas Carlini
  It is becoming increasingly imperative to design robust ML defenses. However,
recent work has found that many defenses that initially resist state-of-the-art
attacks can be broken by an adaptive adversary. In this work we take steps to
simplify the design of defenses and argue that white-box defenses should eschew
randomness when possible. We begin by illustrating a new issue with the
deployment of randomized defenses that reduces their security compared to their
deterministic counterparts. We then provide evidence that making defenses
deterministic simplifies robustness evaluation, without reducing the
effectiveness of a truly robust defense. Finally, we introduce a new defense
evaluation framework that leverages a defense's deterministic nature to better
evaluate its adversarial robustness.


http://arxiv.org/abs/2302.13172
Deep Learning-based Multi-Organ CT Segmentation with Adversarial Data Augmentation. (99%)
Shaoyan Pan; Shao-Yuan Lo; Min Huang; Chaoqiong Ma; Jacob Wynne; Tonghe Wang; Tian Liu; Xiaofeng Yang
  In this work, we propose an adversarial attack-based data augmentation method
to improve the deep-learning-based segmentation algorithm for the delineation
of Organs-At-Risk (OAR) in abdominal Computed Tomography (CT) to facilitate
radiation therapy. We introduce Adversarial Feature Attack for Medical Image
(AFA-MI) augmentation, which forces the segmentation network to learn
out-of-distribution statistics and improve generalization and robustness to
noises. AFA-MI augmentation consists of three steps: 1) generate adversarial
noises by Fast Gradient Sign Method (FGSM) on the intermediate features of the
segmentation network's encoder; 2) inject the generated adversarial noises into
the network, intentionally compromising performance; 3) optimize the network
with both clean and adversarial features. Experiments are conducted segmenting
the heart, left and right kidney, liver, left and right lung, spinal cord, and
stomach. We first evaluate the AFA-MI augmentation using nnUnet and TT-Vnet on
the test data from a public abdominal dataset and an institutional dataset. In
addition, we validate how AFA-MI affects the networks' robustness to the noisy
data by evaluating the networks with added Gaussian noises of varying
magnitudes to the institutional dataset. Network performance is quantitatively
evaluated using Dice Similarity Coefficient (DSC) for volume-based accuracy.
Also, Hausdorff Distance (HD) is applied for surface-based accuracy. On the
public dataset, nnUnet with AFA-MI achieves DSC = 0.85 and HD = 6.16
millimeters (mm); and TT-Vnet achieves DSC = 0.86 and HD = 5.62 mm. AFA-MI
augmentation further improves all contour accuracies up to 0.217 DSC score when
tested on images with Gaussian noises. AFA-MI augmentation is therefore
demonstrated to improve segmentation performance and robustness in CT
multi-organ segmentation.


http://arxiv.org/abs/2302.14059
Scalable Attribution of Adversarial Attacks via Multi-Task Learning. (99%)
Zhongyi Guo; Keji Han; Yao Ge; Wei Ji; Yun Li
  Deep neural networks (DNNs) can be easily fooled by adversarial attacks
during inference phase when attackers add imperceptible perturbations to
original examples, i.e., adversarial examples. Many works focus on adversarial
detection and adversarial training to defend against adversarial attacks.
However, few works explore the tool-chains behind adversarial examples, which
can help defenders to seize the clues about the originator of the attack, their
goals, and provide insight into the most effective defense algorithm against
corresponding attacks. With such a gap, it is necessary to develop techniques
that can recognize tool-chains that are leveraged to generate the adversarial
examples, which is called Adversarial Attribution Problem (AAP). In this paper,
AAP is defined as the recognition of three signatures, i.e., {\em attack
algorithm}, {\em victim model} and {\em hyperparameter}. Current works transfer
AAP into single label classification task and ignore the relationship between
these signatures. The former will meet combination explosion problem as the
number of signatures is increasing. The latter dictates that we cannot treat
AAP simply as a single task problem. We first conduct some experiments to
validate the attributability of adversarial examples. Furthermore, we propose a
multi-task learning framework named Multi-Task Adversarial Attribution (MTAA)
to recognize the three signatures simultaneously. MTAA contains perturbation
extraction module, adversarial-only extraction module and classification and
regression module. It takes the relationship between attack algorithm and
corresponding hyperparameter into account and uses the uncertainty weighted
loss to adjust the weights of three recognition tasks. The experimental results
on MNIST and ImageNet show the feasibility and scalability of the proposed
framework as well as its effectiveness in dealing with false alarms.


http://arxiv.org/abs/2302.13056
SATBA: An Invisible Backdoor Attack Based On Spatial Attention. (74%)
Huasong Zhou; Xiaowei Xu; Xiaodong Wang; Leon Bevan Bullock
  Backdoor attacks pose a new and emerging threat to AI security, where Deep
Neural Networks (DNNs) are trained on datasets added to hidden trigger
patterns. Although the poisoned model behaves normally on benign samples, it
produces anomalous results on samples containing the trigger pattern.
Nevertheless, most existing backdoor attacks face two significant drawbacks:
their trigger patterns are visible and easy to detect by human inspection, and
their injection process leads to the loss of natural sample features and
trigger patterns, thereby reducing the attack success rate and the model
accuracy. In this paper, we propose a novel backdoor attack named SATBA that
overcomes these limitations by using spatial attention mechanism and U-type
model. Our attack leverages spatial attention mechanism to extract data
features and generate invisible trigger patterns that are correlated with clean
data. Then it uses U-type model to plant these trigger patterns into the
original data without causing noticeable feature loss. We evaluate our attack
on three prominent image classification DNNs across three standard datasets and
demonstrate that it achieves high attack success rate and robustness against
backdoor defenses. Additionally, we also conduct extensive experiments on image
similarity to highlight the stealthiness of our attack.


http://arxiv.org/abs/2302.13095
Bayesian Neural Networks Avoid Encoding Complex and Perturbation-Sensitive Concepts. (1%)
Qihan Ren; Huiqi Deng; Yunuo Chen; Siyu Lou; Quanshi Zhang
  In this paper, we focus on mean-field variational Bayesian Neural Networks
(BNNs) and explore the representation capacity of such BNNs by investigating
which types of concepts are less likely to be encoded by the BNN. It has been
observed and studied that a relatively small set of interactive concepts
usually emerge in the knowledge representation of a sufficiently-trained neural
network, and such concepts can faithfully explain the network output. Based on
this, our study proves that compared to standard deep neural networks (DNNs),
it is less likely for BNNs to encode complex concepts. Experiments verify our
theoretical proofs. Note that the tendency to encode less complex concepts does
not necessarily imply weak representation power, considering that complex
concepts exhibit low generalization power and high adversarial vulnerability.
The code is available at https://github.com/sjtu-xai-lab/BNN-concepts.


http://arxiv.org/abs/2302.12758
Defending Against Backdoor Attacks by Layer-wise Feature Analysis. (68%)
Najeeb Moharram Jebreel; Josep Domingo-Ferrer; Yiming Li
  Training deep neural networks (DNNs) usually requires massive training data
and computational resources. Users who cannot afford this may prefer to
outsource training to a third party or resort to publicly available pre-trained
models. Unfortunately, doing so facilitates a new training-time attack (i.e.,
backdoor attack) against DNNs. This attack aims to induce misclassification of
input samples containing adversary-specified trigger patterns. In this paper,
we first conduct a layer-wise feature analysis of poisoned and benign samples
from the target class. We find out that the feature difference between benign
and poisoned samples tends to be maximum at a critical layer, which is not
always the one typically used in existing defenses, namely the layer before
fully-connected layers. We also demonstrate how to locate this critical layer
based on the behaviors of benign samples. We then propose a simple yet
effective method to filter poisoned samples by analyzing the feature
differences between suspicious and benign samples at the critical layer. We
conduct extensive experiments on two benchmark datasets, which confirm the
effectiveness of our defense.


http://arxiv.org/abs/2302.12959
Chaotic Variational Auto encoder-based Adversarial Machine Learning. (54%)
Pavan Venkata Sainadh Reddy; Yelleti Vivek; Gopi Pranay; Vadlamani Ravi
  Machine Learning (ML) has become the new contrivance in almost every field.
This makes them a target of fraudsters by various adversary attacks, thereby
hindering the performance of ML models. Evasion and Data-Poison-based attacks
are well acclaimed, especially in finance, healthcare, etc. This motivated us
to propose a novel computationally less expensive attack mechanism based on the
adversarial sample generation by Variational Auto Encoder (VAE). It is well
known that Wavelet Neural Network (WNN) is considered computationally efficient
in solving image and audio processing, speech recognition, and time-series
forecasting. This paper proposed VAE-Deep-Wavelet Neural Network
(VAE-Deep-WNN), where Encoder and Decoder employ WNN networks. Further, we
proposed chaotic variants of both VAE with Multi-layer perceptron (MLP) and
Deep-WNN and named them C-VAE-MLP and C-VAE-Deep-WNN, respectively. Here, we
employed a Logistic map to generate random noise in the latent space. In this
paper, we performed VAE-based adversary sample generation and applied it to
various problems related to finance and cybersecurity domain-related problems
such as loan default, credit card fraud, and churn modelling, etc., We
performed both Evasion and Data-Poison attacks on Logistic Regression (LR) and
Decision Tree (DT) models. The results indicated that VAE-Deep-WNN outperformed
the rest in the majority of the datasets and models. However, its chaotic
variant C-VAE-Deep-WNN performed almost similarly to VAE-Deep-WNN in the
majority of the datasets.


http://arxiv.org/abs/2302.12480
Robust Weight Signatures: Gaining Robustness as Easy as Patching Weights? (12%)
Ruisi Cai; Zhenyu Zhang; Zhangyang Wang
  Given a robust model trained to be resilient to one or multiple types of
distribution shifts (e.g., natural image corruptions), how is that "robustness"
encoded in the model weights, and how easily can it be disentangled and/or
"zero-shot" transferred to some other models? This paper empirically suggests a
surprisingly simple answer: linearly - by straightforward model weight
arithmetic! We start by drawing several key observations: (1)assuming that we
train the same model architecture on both a clean dataset and its corrupted
version, resultant weights mostly differ in shallow layers; (2)the weight
difference after projection, which we call "Robust Weight Signature" (RWS),
appears to be discriminative and indicative of different corruption types;
(3)for the same corruption type, the RWSs obtained by one model architecture
are highly consistent and transferable across different datasets.
  We propose a minimalistic model robustness "patching" framework that carries
a model trained on clean data together with its pre-extracted RWSs. In this
way, injecting certain robustness to the model is reduced to directly adding
the corresponding RWS to its weight. We verify our proposed framework to be
remarkably (1)lightweight. since RWSs concentrate on the shallowest few layers
and we further show they can be painlessly quantized, storing an RWS is up to
13 x more compact than storing the full weight copy; (2)in-situ adjustable.
RWSs can be appended as needed and later taken off to restore the intact clean
model. We further demonstrate one can linearly re-scale the RWS to control the
patched robustness strength; (3)composable. Multiple RWSs can be added
simultaneously to patch more comprehensive robustness at once; and
(4)transferable. Even when the clean model backbone is continually adapted or
updated, RWSs remain as effective patches due to their outstanding
cross-dataset transferability.


http://arxiv.org/abs/2302.12366
Less is More: Data Pruning for Faster Adversarial Training. (99%)
Yize Li; Pu Zhao; Xue Lin; Bhavya Kailkhura; Ryan Goldhahn
  Deep neural networks (DNNs) are sensitive to adversarial examples, resulting
in fragile and unreliable performance in the real world. Although adversarial
training (AT) is currently one of the most effective methodologies to robustify
DNNs, it is computationally very expensive (e.g., 5-10X costlier than standard
training). To address this challenge, existing approaches focus on single-step
AT, referred to as Fast AT, reducing the overhead of adversarial example
generation. Unfortunately, these approaches are known to fail against stronger
adversaries. To make AT computationally efficient without compromising
robustness, this paper takes a different view of the efficient AT problem.
Specifically, we propose to minimize redundancies at the data level by
leveraging data pruning. Extensive experiments demonstrate that the data
pruning based AT can achieve similar or superior robust (and clean) accuracy as
its unpruned counterparts while being significantly faster. For instance,
proposed strategies accelerate CIFAR-10 training up to 3.44X and CIFAR-100
training to 2.02X. Additionally, the data pruning methods can readily be
reconciled with existing adversarial acceleration tricks to obtain the striking
speed-ups of 5.66X and 5.12X on CIFAR-10, 3.67X and 3.07X on CIFAR-100 with
TRADES and MART, respectively.


http://arxiv.org/abs/2302.11982
A Plot is Worth a Thousand Words: Model Information Stealing Attacks via Scientific Plots. (99%)
Boyang Zhang; Xinlei He; Yun Shen; Tianhao Wang; Yang Zhang
  Building advanced machine learning (ML) models requires expert knowledge and
many trials to discover the best architecture and hyperparameter settings.
Previous work demonstrates that model information can be leveraged to assist
other attacks, such as membership inference, generating adversarial examples.
Therefore, such information, e.g., hyperparameters, should be kept
confidential. It is well known that an adversary can leverage a target ML
model's output to steal the model's information. In this paper, we discover a
new side channel for model information stealing attacks, i.e., models'
scientific plots which are extensively used to demonstrate model performance
and are easily accessible. Our attack is simple and straightforward. We
leverage the shadow model training techniques to generate training data for the
attack model which is essentially an image classifier. Extensive evaluation on
three benchmark datasets shows that our proposed attack can effectively infer
the architecture/hyperparameters of image classifiers based on convolutional
neural network (CNN) given the scientific plot generated from it. We also
reveal that the attack's success is mainly caused by the shape of the
scientific plots, and further demonstrate that the attacks are robust in
various scenarios. Given the simplicity and effectiveness of the attack method,
our study indicates scientific plots indeed constitute a valid side channel for
model information stealing attacks. To mitigate the attacks, we propose several
defense mechanisms that can reduce the original attacks' accuracy while
maintaining the plot utility. However, such defenses can still be bypassed by
adaptive attacks.


http://arxiv.org/abs/2302.12252
Boosting Adversarial Transferability using Dynamic Cues. (99%)
Muzammal Naseer; Ahmad Mahmood; Salman Khan; Fahad Khan
  The transferability of adversarial perturbations between image models has
been extensively studied. In this case, an attack is generated from a known
surrogate \eg, the ImageNet trained model, and transferred to change the
decision of an unknown (black-box) model trained on an image dataset. However,
attacks generated from image models do not capture the dynamic nature of a
moving object or a changing scene due to a lack of temporal cues within image
models. This leads to reduced transferability of adversarial attacks from
representation-enriched \emph{image} models such as Supervised Vision
Transformers (ViTs), Self-supervised ViTs (\eg, DINO), and Vision-language
models (\eg, CLIP) to black-box \emph{video} models. In this work, we induce
dynamic cues within the image models without sacrificing their original
performance on images. To this end, we optimize \emph{temporal prompts} through
frozen image models to capture motion dynamics. Our temporal prompts are the
result of a learnable transformation that allows optimizing for temporal
gradients during an adversarial attack to fool the motion dynamics.
Specifically, we introduce spatial (image) and temporal (video) cues within the
same source model through task-specific prompts. Attacking such prompts
maximizes the adversarial transferability from image-to-video and
image-to-image models using the attacks designed for image models. Our attack
results indicate that the attacker does not need specialized architectures,
\eg, divided space-time attention, 3D convolutions, or multi-view convolution
networks for different data modalities. Image models are effective surrogates
to optimize an adversarial attack to fool black-box models in a changing
environment over time. Code is available at https://bit.ly/3Xd9gRQ


http://arxiv.org/abs/2302.12407
HyperAttack: Multi-Gradient-Guided White-box Adversarial Structure Attack of Hypergraph Neural Networks. (98%)
Chao Hu; Ruishi Yu; Binqi Zeng; Yu Zhan; Ying Fu; Quan Zhang; Rongkai Liu; Heyuan Shi
  Hypergraph neural networks (HGNN) have shown superior performance in various
deep learning tasks, leveraging the high-order representation ability to
formulate complex correlations among data by connecting two or more nodes
through hyperedge modeling. Despite the well-studied adversarial attacks on
Graph Neural Networks (GNN), there is few study on adversarial attacks against
HGNN, which leads to a threat to the safety of HGNN applications. In this
paper, we introduce HyperAttack, the first white-box adversarial attack
framework against hypergraph neural networks. HyperAttack conducts a white-box
structure attack by perturbing hyperedge link status towards the target node
with the guidance of both gradients and integrated gradients. We evaluate
HyperAttack on the widely-used Cora and PubMed datasets and three hypergraph
neural networks with typical hypergraph modeling techniques. Compared to
state-of-the-art white-box structural attack methods for GNN, HyperAttack
achieves a 10-20X improvement in time efficiency while also increasing attack
success rates by 1.3%-3.7%. The results show that HyperAttack can achieve
efficient adversarial attacks that balance effectiveness and time costs.


http://arxiv.org/abs/2302.11963
Investigating Catastrophic Overfitting in Fast Adversarial Training: A Self-fitting Perspective. (84%)
Zhengbao He; Tao Li; Sizhe Chen; Xiaolin Huang
  Although fast adversarial training provides an efficient approach for
building robust networks, it may suffer from a serious problem known as
catastrophic overfitting (CO), where the multi-step robust accuracy suddenly
collapses to zero. In this paper, we for the first time decouple the FGSM
examples into data-information and self-information, which reveals an
interesting phenomenon called "self-fitting". Self-fitting, i.e., DNNs learn
the self-information embedded in single-step perturbations, naturally leads to
the occurrence of CO. When self-fitting occurs, the network experiences an
obvious "channel differentiation" phenomenon that some convolution channels
accounting for recognizing self-information become dominant, while others for
data-information are suppressed. In this way, the network learns to only
recognize images with sufficient self-information and loses generalization
ability to other types of data. Based on self-fitting, we provide new insight
into the existing methods to mitigate CO and extend CO to multi-step
adversarial training. Our findings reveal a self-learning mechanism in
adversarial training and open up new perspectives for suppressing different
kinds of information to mitigate CO.


http://arxiv.org/abs/2302.12173
More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models. (70%)
Kai Greshake; Sahar Abdelnabi; Shailesh Mishra; Christoph Endres; Thorsten Holz; Mario Fritz
  We are currently witnessing dramatic advances in the capabilities of Large
Language Models (LLMs). They are already being adopted in practice and
integrated into many systems, including integrated development environments
(IDEs) and search engines. The functionalities of current LLMs can be modulated
via natural language prompts, while their exact internal functionality remains
implicit and unassessable. This property, which makes them adaptable to even
unseen tasks, might also make them susceptible to targeted adversarial
prompting. Recently, several ways to misalign LLMs using Prompt Injection (PI)
attacks have been introduced. In such attacks, an adversary can prompt the LLM
to produce malicious content or override the original instructions and the
employed filtering schemes. Recent work showed that these attacks are hard to
mitigate, as state-of-the-art LLMs are instruction-following. So far, these
attacks assumed that the adversary is directly prompting the LLM.
  In this work, we show that augmenting LLMs with retrieval and API calling
capabilities (so-called Application-Integrated LLMs) induces a whole new set of
attack vectors. These LLMs might process poisoned content retrieved from the
Web that contains malicious prompts pre-injected and selected by adversaries.
We demonstrate that an attacker can indirectly perform such PI attacks. Based
on this key insight, we systematically analyze the resulting threat landscape
of Application-Integrated LLMs and discuss a variety of new attack vectors. To
demonstrate the practical viability of our attacks, we implemented specific
demonstrations of the proposed attacks within synthetic applications. In
summary, our work calls for an urgent evaluation of current mitigation
techniques and an investigation of whether new techniques are needed to defend
LLMs against these threats.


http://arxiv.org/abs/2302.12351
On the Hardness of Robustness Transfer: A Perspective from Rademacher Complexity over Symmetric Difference Hypothesis Space. (68%)
Yuyang Deng; Nidham Gazagnadou; Junyuan Hong; Mehrdad Mahdavi; Lingjuan Lyu
  Recent studies demonstrated that the adversarially robust learning under
$\ell_\infty$ attack is harder to generalize to different domains than standard
domain adaptation. How to transfer robustness across different domains has been
a key question in domain adaptation field. To investigate the fundamental
difficulty behind adversarially robust domain adaptation (or robustness
transfer), we propose to analyze a key complexity measure that controls the
cross-domain generalization: the adversarial Rademacher complexity over {\em
symmetric difference hypothesis space} $\mathcal{H} \Delta \mathcal{H}$. For
linear models, we show that adversarial version of this complexity is always
greater than the non-adversarial one, which reveals the intrinsic hardness of
adversarially robust domain adaptation. We also establish upper bounds on this
complexity measure. Then we extend them to the ReLU neural network class by
upper bounding the adversarial Rademacher complexity in the binary
classification setting. Finally, even though the robust domain adaptation is
provably harder, we do find positive relation between robust learning and
standard domain adaptation. We explain \emph{how adversarial training helps
domain adaptation in terms of standard risk}. We believe our results initiate
the study of the generalization theory of adversarially robust domain
adaptation, and could shed lights on distributed adversarially robust learning
from heterogeneous sources, e.g., federated learning scenario.


http://arxiv.org/abs/2302.12415
Harnessing the Speed and Accuracy of Machine Learning to Advance Cybersecurity. (2%)
Khatoon Mohammed
  As cyber attacks continue to increase in frequency and sophistication,
detecting malware has become a critical task for maintaining the security of
computer systems. Traditional signature-based methods of malware detection have
limitations in detecting complex and evolving threats. In recent years, machine
learning (ML) has emerged as a promising solution to detect malware
effectively. ML algorithms are capable of analyzing large datasets and
identifying patterns that are difficult for humans to identify. This paper
presents a comprehensive review of the state-of-the-art ML techniques used in
malware detection, including supervised and unsupervised learning, deep
learning, and reinforcement learning. We also examine the challenges and
limitations of ML-based malware detection, such as the potential for
adversarial attacks and the need for large amounts of labeled data.
Furthermore, we discuss future directions in ML-based malware detection,
including the integration of multiple ML algorithms and the use of explainable
AI techniques to enhance the interpret ability of ML-based detection systems.
Our research highlights the potential of ML-based techniques to improve the
speed and accuracy of malware detection, and contribute to enhancing
cybersecurity


http://arxiv.org/abs/2302.11704
Mitigating Adversarial Attacks in Deepfake Detection: An Exploration of Perturbation and AI Techniques. (99%)
Saminder Dhesi; Laura Fontes; Pedro Machado; Isibor Kennedy Ihianle; Farhad Fassihi Tash; David Ada Adama
  Deep learning is a crucial aspect of machine learning, but it also makes
these techniques vulnerable to adversarial examples, which can be seen in a
variety of applications. These examples can even be targeted at humans, leading
to the creation of false media, such as deepfakes, which are often used to
shape public opinion and damage the reputation of public figures. This article
will explore the concept of adversarial examples, which are comprised of
perturbations added to clean images or videos, and their ability to deceive DL
algorithms. The proposed approach achieved a precision value of accuracy of
76.2% on the DFDC dataset.


http://arxiv.org/abs/2302.11328
PAD: Towards Principled Adversarial Malware Detection Against Evasion Attacks. (98%)
Deqiang Li; Shicheng Cui; Yun Li; Jia Xu; Fu Xiao; Shouhuai Xu
  Machine Learning (ML) techniques facilitate automating malicious software
(malware for short) detection, but suffer from evasion attacks. Many
researchers counter such attacks in heuristic manners short of both theoretical
guarantees and defense effectiveness. We hence propose a new adversarial
training framework, termed Principled Adversarial Malware Detection (PAD),
which encourages convergence guarantees for robust optimization methods. PAD
lays on a learnable convex measurement that quantifies distribution-wise
discrete perturbations and protects the malware detector from adversaries, by
which for smooth detectors, adversarial training can be performed heuristically
with theoretical treatments. To promote defense effectiveness, we propose a new
mixture of attacks to instantiate PAD for enhancing the deep neural
network-based measurement and malware detector. Experimental results on two
Android malware datasets demonstrate: (i) the proposed method significantly
outperforms the state-of-the-art defenses; (ii) it can harden the ML-based
malware detection against 27 evasion attacks with detection accuracies greater
than 83.45%, while suffering an accuracy decrease smaller than 2.16% in the
absence of attacks; (iii) it matches or outperforms many anti-malware scanners
in VirusTotal service against realistic adversarial malware.


http://arxiv.org/abs/2302.11628
Provable Robustness Against a Union of $\ell_0$ Adversarial Attacks. (97%)
Zayd Hammoudeh; Daniel Lowd
  Sparse or $\ell_0$ adversarial attacks arbitrarily perturb an unknown subset
of the features. $\ell_0$ robustness analysis is particularly well-suited for
heterogeneous (tabular) data where features have different types or scales.
State-of-the-art $\ell_0$ certified defenses are based on randomized smoothing
and apply to evasion attacks only. This paper proposes feature partition
aggregation (FPA) -- a certified defense against the union of $\ell_0$ evasion,
backdoor, and poisoning attacks. FPA generates its stronger robustness
guarantees via an ensemble whose submodels are trained on disjoint feature
sets. Compared to state-of-the-art $\ell_0$ defenses, FPA is up to
3,000${\times}$ faster and provides larger median robustness guarantees (e.g.,
median certificates of 13 pixels over 10 for CIFAR10, 12 pixels over 10 for
MNIST, 4 features over 1 for Weather, and 3 features over 1 for Ames), meaning
FPA provides the additional dimensions of robustness essentially for free.


http://arxiv.org/abs/2302.11408
ASSET: Robust Backdoor Data Detection Across a Multiplicity of Deep Learning Paradigms. (33%)
Minzhou Pan; Yi Zeng; Lingjuan Lyu; Xue Lin; Ruoxi Jia
  Backdoor data detection is traditionally studied in an end-to-end supervised
learning (SL) setting. However, recent years have seen the proliferating
adoption of self-supervised learning (SSL) and transfer learning (TL), due to
their lesser need for labeled data. Successful backdoor attacks have also been
demonstrated in these new settings. However, we lack a thorough understanding
of the applicability of existing detection methods across a variety of learning
settings. By evaluating 56 attack settings, we show that the performance of
most existing detection methods varies significantly across different attacks
and poison ratios, and all fail on the state-of-the-art clean-label attack. In
addition, they either become inapplicable or suffer large performance losses
when applied to SSL and TL. We propose a new detection method called Active
Separation via Offset (ASSET), which actively induces different model behaviors
between the backdoor and clean samples to promote their separation. We also
provide procedures to adaptively select the number of suspicious points to
remove. In the end-to-end SL setting, ASSET is superior to existing methods in
terms of consistency of defensive performance across different attacks and
robustness to changes in poison ratios; in particular, it is the only method
that can detect the state-of-the-art clean-label attack. Moreover, ASSET's
average detection rates are higher than the best existing methods in SSL and
TL, respectively, by 69.3% and 33.2%, thus providing the first practical
backdoor defense for these new DL settings. We open-source the project to drive
further development and encourage engagement:
https://github.com/ruoxi-jia-group/ASSET.


http://arxiv.org/abs/2302.12095
On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective. (12%)
Jindong Wang; Xixu Hu; Wenxin Hou; Hao Chen; Runkai Zheng; Yidong Wang; Linyi Yang; Haojun Huang; Wei Ye; Xiubo Geng; Binxin Jiao; Yue Zhang; Xing Xie
  ChatGPT is a recent chatbot service released by OpenAI and is receiving
increasing attention over the past few months. While evaluations of various
aspects of ChatGPT have been done, its robustness, i.e., the performance to
unexpected inputs, is still unclear to the public. Robustness is of particular
concern in responsible AI, especially for safety-critical applications. In this
paper, we conduct a thorough evaluation of the robustness of ChatGPT from the
adversarial and out-of-distribution (OOD) perspective. To do so, we employ the
AdvGLUE and ANLI benchmarks to assess adversarial robustness and the Flipkart
review and DDXPlus medical diagnosis datasets for OOD evaluation. We select
several popular foundation models as baselines. Results show that ChatGPT shows
consistent advantages on most adversarial and OOD classification and
translation tasks. However, the absolute performance is far from perfection,
which suggests that adversarial and OOD robustness remains a significant threat
to foundation models. Moreover, ChatGPT shows astounding performance in
understanding dialogue-related texts and we find that it tends to provide
informal suggestions for medical tasks instead of definitive answers. Finally,
we present in-depth discussions of possible research directions.


http://arxiv.org/abs/2302.10980
MultiRobustBench: Benchmarking Robustness Against Multiple Attacks. (99%)
Sihui Dai; Saeed Mahloujifar; Chong Xiang; Vikash Sehwag; Pin-Yu Chen; Prateek Mittal
  The bulk of existing research in defending against adversarial examples
focuses on defending against a single (typically bounded Lp-norm) attack, but
for a practical setting, machine learning (ML) models should be robust to a
wide variety of attacks. In this paper, we present the first unified framework
for considering multiple attacks against ML models. Our framework is able to
model different levels of learner's knowledge about the test-time adversary,
allowing us to model robustness against unforeseen attacks and robustness
against unions of attacks. Using our framework, we present the first
leaderboard, MultiRobustBench, for benchmarking multiattack evaluation which
captures performance across attack types and attack strengths. We evaluate the
performance of 16 defended models for robustness against a set of 9 different
attack types, including Lp-based threat models, spatial transformations, and
color changes, at 20 different attack strengths (180 attacks total).
Additionally, we analyze the state of current defenses against multiple
attacks. Our analysis shows that while existing defenses have made progress in
terms of average robustness across the set of attacks used, robustness against
the worst-case attack is still a big open problem as all existing models
perform worse than random guessing.


http://arxiv.org/abs/2302.10739
MalProtect: Stateful Defense Against Adversarial Query Attacks in ML-based Malware Detection. (99%)
Aqib Rashid; Jose Such
  ML models are known to be vulnerable to adversarial query attacks. In these
attacks, queries are iteratively perturbed towards a particular class without
any knowledge of the target model besides its output. The prevalence of
remotely-hosted ML classification models and Machine-Learning-as-a-Service
platforms means that query attacks pose a real threat to the security of these
systems. To deal with this, stateful defenses have been proposed to detect
query attacks and prevent the generation of adversarial examples by monitoring
and analyzing the sequence of queries received by the system. Several stateful
defenses have been proposed in recent years. However, these defenses rely
solely on similarity or out-of-distribution detection methods that may be
effective in other domains. In the malware detection domain, the methods to
generate adversarial examples are inherently different, and therefore we find
that such detection mechanisms are significantly less effective. Hence, in this
paper, we present MalProtect, which is a stateful defense against query attacks
in the malware detection domain. MalProtect uses several threat indicators to
detect attacks. Our results show that it reduces the evasion rate of
adversarial query attacks by 80+\% in Android and Windows malware, across a
range of attacker scenarios. In the first evaluation of its kind, we show that
MalProtect outperforms prior stateful defenses, especially under the peak
adversarial threat.


http://arxiv.org/abs/2302.10686
Interpretable Spectrum Transformation Attacks to Speaker Recognition. (98%)
Jiadi Yao; Hong Luo; Xiao-Lei Zhang
  The success of adversarial attacks to speaker recognition is mainly in
white-box scenarios. When applying the adversarial voices that are generated by
attacking white-box surrogate models to black-box victim models, i.e.
\textit{transfer-based} black-box attacks, the transferability of the
adversarial voices is not only far from satisfactory, but also lacks
interpretable basis. To address these issues, in this paper, we propose a
general framework, named spectral transformation attack based on modified
discrete cosine transform (STA-MDCT), to improve the transferability of the
adversarial voices to a black-box victim model. Specifically, we first apply
MDCT to the input voice. Then, we slightly modify the energy of different
frequency bands for capturing the salient regions of the adversarial noise in
the time-frequency domain that are critical to a successful attack. Unlike
existing approaches that operate voices in the time domain, the proposed
framework operates voices in the time-frequency domain, which improves the
interpretability, transferability, and imperceptibility of the attack.
Moreover, it can be implemented with any gradient-based attackers. To utilize
the advantage of model ensembling, we not only implement STA-MDCT with a single
white-box surrogate model, but also with an ensemble of surrogate models.
Finally, we visualize the saliency maps of adversarial voices by the class
activation maps (CAM), which offers an interpretable basis to transfer-based
attacks in speaker recognition for the first time. Extensive comparison results
with five representative attackers show that the CAM visualization clearly
explains the effectiveness of STA-MDCT, and the weaknesses of the comparison
methods; the proposed method outperforms the comparison methods by a large
margin.


http://arxiv.org/abs/2302.10722
Characterizing the Optimal 0-1 Loss for Multi-class Classification with a Test-time Attacker. (97%)
Sihui Dai; Wenxin Ding; Arjun Nitin Bhagoji; Daniel Cullina; Ben Y. Zhao; Haitao Zheng; Prateek Mittal
  Finding classifiers robust to adversarial examples is critical for their safe
deployment. Determining the robustness of the best possible classifier under a
given threat model for a given data distribution and comparing it to that
achieved by state-of-the-art training methods is thus an important diagnostic
tool. In this paper, we find achievable information-theoretic lower bounds on
loss in the presence of a test-time attacker for multi-class classifiers on any
discrete dataset. We provide a general framework for finding the optimal 0-1
loss that revolves around the construction of a conflict hypergraph from the
data and adversarial constraints. We further define other variants of the
attacker-classifier game that determine the range of the optimal loss more
efficiently than the full-fledged hypergraph construction. Our evaluation
shows, for the first time, an analysis of the gap to optimal robustness for
classifiers in the multi-class setting on benchmark datasets.


http://arxiv.org/abs/2302.10633
Generalization Bounds for Adversarial Contrastive Learning. (31%)
Xin Zou; Weiwei Liu
  Deep networks are well-known to be fragile to adversarial attacks, and
adversarial training is one of the most popular methods used to train a robust
model. To take advantage of unlabeled data, recent works have applied
adversarial training to contrastive learning (Adversarial Contrastive Learning;
ACL for short) and obtain promising robust performance. However, the theory of
ACL is not well understood. To fill this gap, we leverage the Rademacher
complexity to analyze the generalization performance of ACL, with a particular
focus on linear models and multi-layer neural networks under $\ell_p$ attack
($p \ge 1$). Our theory shows that the average adversarial risk of the
downstream tasks can be upper bounded by the adversarial unsupervised risk of
the upstream task. The experimental results validate our theory.


http://arxiv.org/abs/2303.01245
An Incremental Gray-box Physical Adversarial Attack on Neural Network Training. (98%)
Rabiah Al-qudah; Moayad Aloqaily; Bassem Ouni; Mohsen Guizani; Thierry Lestable
  Neural networks have demonstrated remarkable success in learning and solving
complex tasks in a variety of fields. Nevertheless, the rise of those networks
in modern computing has been accompanied by concerns regarding their
vulnerability to adversarial attacks. In this work, we propose a novel
gradient-free, gray box, incremental attack that targets the training process
of neural networks. The proposed attack, which implicitly poisons the
intermediate data structures that retain the training instances between
training epochs acquires its high-risk property from attacking data structures
that are typically unobserved by professionals. Hence, the attack goes
unnoticed despite the damage it can cause. Moreover, the attack can be executed
without the attackers' knowledge of the neural network structure or training
data making it more dangerous. The attack was tested under a sensitive
application of secure cognitive cities, namely, biometric authentication. The
conducted experiments showed that the proposed attack is effective and
stealthy. Finally, the attack effectiveness property was concluded from the
fact that it was able to flip the sign of the loss gradient in the conducted
experiments to become positive, which indicated noisy and unstable training.
Moreover, the attack was able to decrease the inference probability in the
poisoned networks compared to their unpoisoned counterparts by 15.37%, 14.68%,
and 24.88% for the Densenet, VGG, and Xception, respectively. Finally, the
attack retained its stealthiness despite its high effectiveness. This was
demonstrated by the fact that the attack did not cause a notable increase in
the training time, in addition, the Fscore values only dropped by an average of
1.2%, 1.9%, and 1.5% for the poisoned Densenet, VGG, and Xception,
respectively.


http://arxiv.org/abs/2302.09902
Variation Enhanced Attacks Against RRAM-based Neuromorphic Computing System. (97%)
Hao Lv; Bing Li; Lei Zhang; Cheng Liu; Ying Wang
  The RRAM-based neuromorphic computing system has amassed explosive interests
for its superior data processing capability and energy efficiency than
traditional architectures, and thus being widely used in many data-centric
applications. The reliability and security issues of the NCS therefore become
an essential problem. In this paper, we systematically investigated the
adversarial threats to the RRAM-based NCS and observed that the RRAM hardware
feature can be leveraged to strengthen the attack effect, which has not been
granted sufficient attention by previous algorithmic attack methods. Thus, we
proposed two types of hardware-aware attack methods with respect to different
attack scenarios and objectives. The first is adversarial attack, VADER, which
perturbs the input samples to mislead the prediction of neural networks. The
second is fault injection attack, EFI, which perturbs the network parameter
space such that a specified sample will be classified to a target label, while
maintaining the prediction accuracy on other samples. Both attack methods
leverage the RRAM properties to improve the performance compared with the
conventional attack methods. Experimental results show that our hardware-aware
attack methods can achieve nearly 100% attack success rate with extremely low
operational cost, while maintaining the attack stealthiness.


http://arxiv.org/abs/2302.10164
Seasoning Model Soups for Robustness to Adversarial and Natural Distribution Shifts. (88%)
Francesco Croce; Sylvestre-Alvise Rebuffi; Evan Shelhamer; Sven Gowal
  Adversarial training is widely used to make classifiers robust to a specific
threat or adversary, such as $\ell_p$-norm bounded perturbations of a given
$p$-norm. However, existing methods for training classifiers robust to multiple
threats require knowledge of all attacks during training and remain vulnerable
to unseen distribution shifts. In this work, we describe how to obtain
adversarially-robust model soups (i.e., linear combinations of parameters) that
smoothly trade-off robustness to different $\ell_p$-norm bounded adversaries.
We demonstrate that such soups allow us to control the type and level of
robustness, and can achieve robustness to all threats without jointly training
on all of them. In some cases, the resulting model soups are more robust to a
given $\ell_p$-norm adversary than the constituent model specialized against
that same adversary. Finally, we show that adversarially-robust model soups can
be a viable tool to adapt to distribution shifts from a few examples.


http://arxiv.org/abs/2302.10149
Poisoning Web-Scale Training Datasets is Practical. (83%)
Nicholas Carlini; Matthew Jagielski; Christopher A. Choquette-Choo; Daniel Paleka; Will Pearce; Hyrum Anderson; Andreas Terzis; Kurt Thomas; Florian Tramèr
  Deep learning models are often trained on distributed, webscale datasets
crawled from the internet. In this paper, we introduce two new dataset
poisoning attacks that intentionally introduce malicious examples to a model's
performance. Our attacks are immediately practical and could, today, poison 10
popular datasets. Our first attack, split-view poisoning, exploits the mutable
nature of internet content to ensure a dataset annotator's initial view of the
dataset differs from the view downloaded by subsequent clients. By exploiting
specific invalid trust assumptions, we show how we could have poisoned 0.01% of
the LAION-400M or COYO-700M datasets for just $60 USD. Our second attack,
frontrunning poisoning, targets web-scale datasets that periodically snapshot
crowd-sourced content -- such as Wikipedia -- where an attacker only needs a
time-limited window to inject malicious examples. In light of both attacks, we
notify the maintainers of each affected dataset and recommended several
low-overhead defenses.


http://arxiv.org/abs/2302.09814
Pseudo Label-Guided Model Inversion Attack via Conditional Generative Adversarial Network. (47%)
Xiaojian Yuan; Kejiang Chen; Jie Zhang; Weiming Zhang; Nenghai Yu; Yang Zhang
  Model inversion (MI) attacks have raised increasing concerns about privacy,
which can reconstruct training data from public models. Indeed, MI attacks can
be formalized as an optimization problem that seeks private data in a certain
space. Recent MI attacks leverage a generative adversarial network (GAN) as an
image prior to narrow the search space, and can successfully reconstruct even
the high-dimensional data (e.g., face images). However, these generative MI
attacks do not fully exploit the potential capabilities of the target model,
still leading to a vague and coupled search space, i.e., different classes of
images are coupled in the search space. Besides, the widely used cross-entropy
loss in these attacks suffers from gradient vanishing. To address these
problems, we propose Pseudo Label-Guided MI (PLG-MI) attack via conditional GAN
(cGAN). At first, a top-n selection strategy is proposed to provide
pseudo-labels for public data, and use pseudo-labels to guide the training of
the cGAN. In this way, the search space is decoupled for different classes of
images. Then a max-margin loss is introduced to improve the search process on
the subspace of a target class. Extensive experiments demonstrate that our
PLG-MI attack significantly improves the attack success rate and visual quality
for various datasets and models, notably, 2~3 $\times$ better than
state-of-the-art attacks under large distributional shifts. Our code is
available at: https://github.com/LetheSec/PLG-MI-Attack.


http://arxiv.org/abs/2302.10341
Take Me Home: Reversing Distribution Shifts using Reinforcement Learning. (26%)
Vivian Lin; Kuk Jin Jang; Souradeep Dutta; Michele Caprio; Oleg Sokolsky; Insup Lee
  Deep neural networks have repeatedly been shown to be non-robust to the
uncertainties of the real world. Even subtle adversarial attacks and naturally
occurring distribution shifts wreak havoc on systems relying on deep neural
networks. In response to this, current state-of-the-art techniques use
data-augmentation to enrich the training distribution of the model and
consequently improve robustness to natural distribution shifts. We propose an
alternative approach that allows the system to recover from distribution shifts
online. Specifically, our method applies a sequence of semantic-preserving
transformations to bring the shifted data closer in distribution to the
training set, as measured by the Wasserstein distance. We formulate the problem
of sequence selection as an MDP, which we solve using reinforcement learning.
To aid in our estimates of Wasserstein distance, we employ dimensionality
reduction through orthonormal projection. We provide both theoretical and
empirical evidence that orthonormal projection preserves characteristics of the
data at the distributional level. Finally, we apply our distribution shift
recovery approach to the ImageNet-C benchmark for distribution shifts,
targeting shifts due to additive noise and image histogram modifications. We
demonstrate an improvement in average accuracy up to 14.21% across a variety of
state-of-the-art ImageNet classifiers.


http://arxiv.org/abs/2302.10344
Model-based feature selection for neural networks: A mixed-integer programming approach. (22%)
Shudian Zhao; Calvin Tsay; Jan Kronqvist
  In this work, we develop a novel input feature selection framework for
ReLU-based deep neural networks (DNNs), which builds upon a mixed-integer
optimization approach. While the method is generally applicable to various
classification tasks, we focus on finding input features for image
classification for clarity of presentation. The idea is to use a trained DNN,
or an ensemble of trained DNNs, to identify the salient input features. The
input feature selection is formulated as a sequence of mixed-integer linear
programming (MILP) problems that find sets of sparse inputs that maximize the
classification confidence of each category. These ''inverse'' problems are
regularized by the number of inputs selected for each category and by
distribution constraints. Numerical results on the well-known MNIST and
FashionMNIST datasets show that the proposed input feature selection allows us
to drastically reduce the size of the input to $\sim$15\% while maintaining a
good classification accuracy. This allows us to design DNNs with significantly
fewer connections, reducing computational effort and producing DNNs that are
more robust towards adversarial attacks.


http://arxiv.org/abs/2302.09923
Prompt Stealing Attacks Against Text-to-Image Generation Models. (1%)
Xinyue Shen; Yiting Qu; Michael Backes; Yang Zhang
  Text-to-Image generation models have revolutionized the artwork design
process and enabled anyone to create high-quality images by entering text
descriptions called prompts. Creating a high-quality prompt that consists of a
subject and several modifiers can be time-consuming and costly. In consequence,
a trend of trading high-quality prompts on specialized marketplaces has
emerged. In this paper, we perform the first study on understanding the threat
of a novel attack, namely prompt stealing attack, which aims to steal prompts
from generated images by text-to-image generation models. Successful prompt
stealing attacks directly violate the intellectual property of prompt engineers
and jeopardize the business model of prompt marketplaces. We first perform a
systematic analysis on a dataset collected by ourselves and show that a
successful prompt stealing attack should consider a prompt's subject as well as
its modifiers. Based on this observation, we propose a simple yet effective
prompt stealing attack, PromptStealer. It consists of two modules: a subject
generator trained to infer the subject and a modifier detector for identifying
the modifiers within the generated image. Experimental results demonstrate that
PromptStealer is superior over three baseline methods, both quantitatively and
qualitatively. We also make some initial attempts to defend PromptStealer. In
general, our study uncovers a new attack vector within the ecosystem
established by the popular text-to-image generation models. We hope our results
can contribute to understanding and mitigating this emerging threat.


http://arxiv.org/abs/2302.09491
X-Adv: Physical Adversarial Object Attacks against X-ray Prohibited Item Detection. (99%)
Aishan Liu; Jun Guo; Jiakai Wang; Siyuan Liang; Renshuai Tao; Wenbo Zhou; Cong Liu; Xianglong Liu; Dacheng Tao
  Adversarial attacks are valuable for evaluating the robustness of deep
learning models. Existing attacks are primarily conducted on the visible light
spectrum (e.g., pixel-wise texture perturbation). However, attacks targeting
texture-free X-ray images remain underexplored, despite the widespread
application of X-ray imaging in safety-critical scenarios such as the X-ray
detection of prohibited items. In this paper, we take the first step toward the
study of adversarial attacks targeted at X-ray prohibited item detection, and
reveal the serious threats posed by such attacks in this safety-critical
scenario. Specifically, we posit that successful physical adversarial attacks
in this scenario should be specially designed to circumvent the challenges
posed by color/texture fading and complex overlapping. To this end, we propose
X-adv to generate physically printable metals that act as an adversarial agent
capable of deceiving X-ray detectors when placed in luggage. To resolve the
issues associated with color/texture fading, we develop a differentiable
converter that facilitates the generation of 3D-printable objects with
adversarial shapes, using the gradients of a surrogate model rather than
directly generating adversarial textures. To place the printed 3D adversarial
objects in luggage with complex overlapped instances, we design a policy-based
reinforcement learning strategy to find locations eliciting strong attack
performance in worst-case scenarios whereby the prohibited items are heavily
occluded by other items. To verify the effectiveness of the proposed X-Adv, we
conduct extensive experiments in both the digital and the physical world
(employing a commercial X-ray security inspection system for the latter case).
Furthermore, we present the physical-world X-ray adversarial attack dataset
XAD.


http://arxiv.org/abs/2302.09575
Stationary Point Losses for Robust Model. (93%)
Weiwei Gao; Dazhi Zhang; Yao Li; Zhichang Guo; Ovanes Petrosian
  The inability to guarantee robustness is one of the major obstacles to the
application of deep learning models in security-demanding domains. We identify
that the most commonly used cross-entropy (CE) loss does not guarantee robust
boundary for neural networks. CE loss sharpens the neural network at the
decision boundary to achieve a lower loss, rather than pushing the boundary to
a more robust position. A robust boundary should be kept in the middle of
samples from different classes, thus maximizing the margins from the boundary
to the samples. We think this is due to the fact that CE loss has no stationary
point. In this paper, we propose a family of new losses, called stationary
point (SP) loss, which has at least one stationary point on the correct
classification side. We proved that robust boundary can be guaranteed by SP
loss without losing much accuracy. With SP loss, larger perturbations are
required to generate adversarial examples. We demonstrate that robustness is
improved under a variety of adversarial attacks by applying SP loss. Moreover,
robust boundary learned by SP loss also performs well on imbalanced datasets.


http://arxiv.org/abs/2302.09578
On Feasibility of Server-side Backdoor Attacks on Split Learning. (76%)
Behrad Tajalli; Oguzhan Ersoy; Stjepan Picek
  Split learning is a collaborative learning design that allows several
participants (clients) to train a shared model while keeping their datasets
private. Recent studies demonstrate that collaborative learning models,
specifically federated learning, are vulnerable to security and privacy attacks
such as model inference and backdoor attacks. Backdoor attacks are a group of
poisoning attacks in which the attacker tries to control the model output by
manipulating the model's training process. While there have been studies
regarding inference attacks on split learning, it has not yet been tested for
backdoor attacks. This paper performs a novel backdoor attack on split learning
and studies its effectiveness. Despite traditional backdoor attacks done on the
client side, we inject the backdoor trigger from the server side. For this
purpose, we provide two attack methods: one using a surrogate client and
another using an autoencoder to poison the model via incoming smashed data and
its outgoing gradient toward the innocent participants. We did our experiments
using three model architectures and three publicly available datasets in the
image domain and ran a total of 761 experiments to evaluate our attack methods.
The results show that despite using strong patterns and injection methods,
split learning is highly robust and resistant to such poisoning attacks. While
we get the attack success rate of 100% as our best result for the MNIST
dataset, in most of the other cases, our attack shows little success when
increasing the cut layer.


http://arxiv.org/abs/2302.09457
Adversarial Machine Learning: A Systematic Survey of Backdoor Attack, Weight Attack and Adversarial Example. (99%)
Baoyuan Wu; Li Liu; Zihao Zhu; Qingshan Liu; Zhaofeng He; Siwei Lyu
  Adversarial machine learning (AML) studies the adversarial phenomenon of
machine learning, which may make inconsistent or unexpected predictions with
humans. Some paradigms have been recently developed to explore this adversarial
phenomenon occurring at different stages of a machine learning system, such as
training-time adversarial attack (i.e., backdoor attack), deployment-time
adversarial attack (i.e., weight attack), and inference-time adversarial attack
(i.e., adversarial example). However, although these paradigms share a common
goal, their developments are almost independent, and there is still no big
picture of AML. In this work, we aim to provide a unified perspective to the
AML community to systematically review the overall progress of this field. We
firstly provide a general definition about AML, and then propose a unified
mathematical framework to covering existing attack paradigms. According to the
proposed unified framework, we can not only clearly figure out the connections
and differences among these paradigms, but also systematically categorize and
review existing works in each paradigm.


http://arxiv.org/abs/2302.09479
Delving into the Adversarial Robustness of Federated Learning. (98%)
Jie Zhang; Bo Li; Chen Chen; Lingjuan Lyu; Shuang Wu; Shouhong Ding; Chao Wu
  In Federated Learning (FL), models are as fragile as centrally trained models
against adversarial examples. However, the adversarial robustness of federated
learning remains largely unexplored. This paper casts light on the challenge of
adversarial robustness of federated learning. To facilitate a better
understanding of the adversarial vulnerability of the existing FL methods, we
conduct comprehensive robustness evaluations on various attacks and adversarial
training methods. Moreover, we reveal the negative impacts induced by directly
adopting adversarial training in FL, which seriously hurts the test accuracy,
especially in non-IID settings. In this work, we propose a novel algorithm
called Decision Boundary based Federated Adversarial Training (DBFAT), which
consists of two components (local re-weighting and global regularization) to
improve both accuracy and robustness of FL systems. Extensive experiments on
multiple datasets demonstrate that DBFAT consistently outperforms other
baselines under both IID and non-IID settings.


http://arxiv.org/abs/2302.09309
Meta Style Adversarial Training for Cross-Domain Few-Shot Learning. (83%)
Yuqian Fu; Yu Xie; Yanwei Fu; Yu-Gang Jiang
  Cross-Domain Few-Shot Learning (CD-FSL) is a recently emerging task that
tackles few-shot learning across different domains. It aims at transferring
prior knowledge learned on the source dataset to novel target datasets. The
CD-FSL task is especially challenged by the huge domain gap between different
datasets. Critically, such a domain gap actually comes from the changes of
visual styles, and wave-SAN empirically shows that spanning the style
distribution of the source data helps alleviate this issue. However, wave-SAN
simply swaps styles of two images. Such a vanilla operation makes the generated
styles ``real'' and ``easy'', which still fall into the original set of the
source styles. Thus, inspired by vanilla adversarial learning, a novel
model-agnostic meta Style Adversarial training (StyleAdv) method together with
a novel style adversarial attack method is proposed for CD-FSL. Particularly,
our style attack method synthesizes both ``virtual'' and ``hard'' adversarial
styles for model training. This is achieved by perturbing the original style
with the signed style gradients. By continually attacking styles and forcing
the model to recognize these challenging adversarial styles, our model is
gradually robust to the visual styles, thus boosting the generalization ability
for novel target datasets. Besides the typical CNN-based backbone, we also
employ our StyleAdv method on large-scale pretrained vision transformer.
Extensive experiments conducted on eight various target datasets show the
effectiveness of our method. Whether built upon ResNet or ViT, we achieve the
new state of the art for CD-FSL. Codes and models will be released.


http://arxiv.org/abs/2302.09270
Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements. (67%)
Jiawen Deng; Jiale Cheng; Hao Sun; Zhexin Zhang; Minlie Huang
  As generative large model capabilities advance, safety concerns become more
pronounced in their outputs. To ensure the sustainable growth of the AI
ecosystem, it's imperative to undertake a holistic evaluation and refinement of
associated safety risks. This survey presents a framework for safety research
pertaining to large models, delineating the landscape of safety risks as well
as safety evaluation and improvement methods. We begin by introducing safety
issues of wide concern, then delve into safety evaluation methods for large
models, encompassing preference-based testing, adversarial attack approaches,
issues detection, and other advanced evaluation methods. Additionally, we
explore the strategies for enhancing large model safety from training to
deployment, highlighting cutting-edge safety approaches for each stage in
building large models. Finally, we discuss the core challenges in advancing
towards more responsible AI, including the interpretability of safety
mechanisms, ongoing safety issues, and robustness against malicious attacks.
Through this survey, we aim to provide clear technical guidance for safety
researchers and encourage further study on the safety of large models.


http://arxiv.org/abs/2302.09462
MedViT: A Robust Vision Transformer for Generalized Medical Image Classification. (12%)
Omid Nejati Manzari; Hamid Ahmadabadi; Hossein Kashiani; Shahriar B. Shokouhi; Ahmad Ayatollahi
  Convolutional Neural Networks (CNNs) have advanced existing medical systems
for automatic disease diagnosis. However, there are still concerns about the
reliability of deep medical diagnosis systems against the potential threats of
adversarial attacks since inaccurate diagnosis could lead to disastrous
consequences in the safety realm. In this study, we propose a highly robust yet
efficient CNN-Transformer hybrid model which is equipped with the locality of
CNNs as well as the global connectivity of vision Transformers. To mitigate the
high quadratic complexity of the self-attention mechanism while jointly
attending to information in various representation subspaces, we construct our
attention mechanism by means of an efficient convolution operation. Moreover,
to alleviate the fragility of our Transformer model against adversarial
attacks, we attempt to learn smoother decision boundaries. To this end, we
augment the shape information of an image in the high-level feature space by
permuting the feature mean and variance within mini-batches. With less
computational complexity, our proposed hybrid model demonstrates its high
robustness and generalization ability compared to the state-of-the-art studies
on a large-scale collection of standardized MedMNIST-2D datasets.


http://arxiv.org/abs/2302.09420
RobustNLP: A Technique to Defend NLP Models Against Backdoor Attacks. (11%)
Marwan Omar
  As machine learning (ML) systems are being increasingly employed in the real
world to handle sensitive tasks and make decisions in various fields, the
security and privacy of those models have also become increasingly critical. In
particular, Deep Neural Networks (DNN) have been shown to be vulnerable to
backdoor attacks whereby adversaries have access to the training data and the
opportunity to manipulate such data by inserting carefully developed samples
into the training dataset. Although the NLP community has produced several
studies on generating backdoor attacks proving the vulnerable state of language
modes, to the best of our knowledge, there does not exist any work to combat
such attacks. To bridge this gap, we present RobustEncoder: a novel
clustering-based technique for detecting and removing backdoor attacks in the
text domain. Extensive empirical results demonstrate the effectiveness of our
technique in detecting and removing backdoor triggers. Our code is available at
https://github.com/marwanomar1/Backdoor-Learning-for-NLP


http://arxiv.org/abs/2302.09344
Beyond Distribution Shift: Spurious Features Through the Lens of Training Dynamics. (2%)
Nihal Murali; Aahlad Puli; Ke Yu; Rajesh Ranganath; Kayhan Batmanghelich
  Deep Neural Networks (DNNs) are prone to learning spurious features that
correlate with the label during training but are irrelevant to the learning
problem. This hurts model generalization and poses problems when deploying them
in safety-critical applications. This paper aims to better understand the
effects of spurious features through the lens of the learning dynamics of the
internal neurons during the training process. We make the following
observations: (1) While previous works highlight the harmful effects of
spurious features on the generalization ability of DNNs, we emphasize that not
all spurious features are harmful. Spurious features can be "benign" or
"harmful" depending on whether they are "harder" or "easier" to learn than the
core features for a given model. This definition is model and
dataset-dependent. (2) We build upon this premise and use instance difficulty
methods (like Prediction Depth (Baldock et al., 2021)) to quantify "easiness"
for a given model and to identify this behavior during the training phase. (3)
We empirically show that the harmful spurious features can be detected by
observing the learning dynamics of the DNN's early layers. In other words, easy
features learned by the initial layers of a DNN early during the training can
(potentially) hurt model generalization. We verify our claims on medical and
vision datasets, both simulated and real, and justify the empirical success of
our hypothesis by showing the theoretical connections between Prediction Depth
and information-theoretic concepts like V-usable information (Ethayarajh et
al., 2021). Lastly, our experiments show that monitoring only accuracy during
training (as is common in machine learning pipelines) is insufficient to detect
spurious features. We, therefore, highlight the need for monitoring early
training dynamics using suitable instance difficulty metrics.


http://arxiv.org/abs/2302.08973
Measuring Equality in Machine Learning Security Defenses. (96%)
Luke E. Richards; Edward Raff; Cynthia Matuszek
  The machine learning security community has developed myriad defenses for
evasion attacks over the past decade. An understudied question in that
community is: for whom do these defenses defend? In this work, we consider some
common approaches to defending learned systems and whether those approaches may
offer unexpected performance inequities when used by different sub-populations.
We outline simple parity metrics and a framework for analysis that can begin to
answer this question through empirical results of the fairness implications of
machine learning security methods. Many methods have been proposed that can
cause direct harm, which we describe as biased vulnerability and biased
rejection. Our framework and metric can be applied to robustly trained models,
preprocessing-based methods, and rejection methods to capture behavior over
security budgets. We identify a realistic dataset with a reasonable
computational cost suitable for measuring the equality of defenses. Through a
case study in speech command recognition, we show how such defenses do not
offer equal protection for social subgroups and how to perform such analyses
for robustness training, and we present a comparison of fairness between two
rejection-based defenses: randomized smoothing and neural rejection. We offer
further analysis of factors that correlate to equitable defenses to stimulate
the future investigation of how to assist in building such defenses. To the
best of our knowledge, this is the first work that examines the fairness
disparity in the accuracy-robustness trade-off in speech data and addresses
fairness evaluation for rejection-based defenses.


http://arxiv.org/abs/2302.09190
Function Composition in Trustworthy Machine Learning: Implementation Choices, Insights, and Questions. (5%)
Manish Nagireddy; Moninder Singh; Samuel C. Hoffman; Evaline Ju; Karthikeyan Natesan Ramamurthy; Kush R. Varshney
  Ensuring trustworthiness in machine learning (ML) models is a
multi-dimensional task. In addition to the traditional notion of predictive
performance, other notions such as privacy, fairness, robustness to
distribution shift, adversarial robustness, interpretability, explainability,
and uncertainty quantification are important considerations to evaluate and
improve (if deficient). However, these sub-disciplines or 'pillars' of
trustworthiness have largely developed independently, which has limited us from
understanding their interactions in real-world ML pipelines. In this paper,
focusing specifically on compositions of functions arising from the different
pillars, we aim to reduce this gap, develop new insights for trustworthy ML,
and answer questions such as the following. Does the composition of multiple
fairness interventions result in a fairer model compared to a single
intervention? How do bias mitigation algorithms for fairness affect local
post-hoc explanations? Does a defense algorithm for untargeted adversarial
attacks continue to be effective when composed with a privacy transformation?
Toward this end, we report initial empirical results and new insights from 9
different compositions of functions (or pipelines) on 7 real-world datasets
along two trustworthy dimensions - fairness and explainability. We also report
progress, and implementation choices, on an extensible composer tool to
encourage the combination of functionalities from multiple pillars. To-date,
the tool supports bias mitigation algorithms for fairness and post-hoc
explainability methods. We hope this line of work encourages the thoughtful
consideration of multiple pillars when attempting to formulate and resolve a
trustworthiness problem.


http://arxiv.org/abs/2302.09207
RetVec: Resilient and Efficient Text Vectorizer. (4%)
Elie Bursztein; Marina Zhang; Owen Vallis; Xinyu Jia; Alexey Kurakin
  This paper describes RetVec, a resilient multilingual embedding scheme
designed for neural-based text processing, including small-text classification
and large-language models. RetVec combines a novel character encoding with an
optional small model to embed words into a 256-dimensional vector space. These
embeddings enable training competitive multilingual text models resilient to
typos and adversarial attacks. In this paper, we evaluate and compare RetVec to
state-of-the-art tokenizers and word embeddings on common model architectures.
These comparisons demonstrate that RetVec leads to competitive models that are
significantly more resilient to text perturbations across a variety of common
tasks. RetVec is available under Apache 2 license at
\url{https://github.com/[anonymized]}.


http://arxiv.org/abs/2302.08257
On the Effect of Adversarial Training Against Invariance-based Adversarial Examples. (99%)
Roland Rauter; Martin Nocker; Florian Merkle; Pascal Schöttle
  Adversarial examples are carefully crafted attack points that are supposed to
fool machine learning classifiers. In the last years, the field of adversarial
machine learning, especially the study of perturbation-based adversarial
examples, in which a perturbation that is not perceptible for humans is added
to the images, has been studied extensively. Adversarial training can be used
to achieve robustness against such inputs. Another type of adversarial examples
are invariance-based adversarial examples, where the images are semantically
modified such that the predicted class of the model does not change, but the
class that is determined by humans does. How to ensure robustness against this
type of adversarial examples has not been explored yet. This work addresses the
impact of adversarial training with invariance-based adversarial examples on a
convolutional neural network (CNN).
  We show that when adversarial training with invariance-based and
perturbation-based adversarial examples is applied, it should be conducted
simultaneously and not consecutively. This procedure can achieve relatively
high robustness against both types of adversarial examples. Additionally, we
find that the algorithm used for generating invariance-based adversarial
examples in prior work does not correctly determine the labels and therefore we
use human-determined labels.


http://arxiv.org/abs/2302.08637
High-frequency Matters: An Overwriting Attack and defense for Image-processing Neural Network Watermarking. (67%)
Huajie Chen; Tianqing Zhu; Chi Liu; Shui Yu; Wanlei Zhou
  In recent years, there has been significant advancement in the field of model
watermarking techniques. However, the protection of image-processing neural
networks remains a challenge, with only a limited number of methods being
developed. The objective of these techniques is to embed a watermark in the
output images of the target generative network, so that the watermark signal
can be detected in the output of a surrogate model obtained through model
extraction attacks. This promising technique, however, has certain limits.
Analysis of the frequency domain reveals that the watermark signal is mainly
concealed in the high-frequency components of the output. Thus, we propose an
overwriting attack that involves forging another watermark in the output of the
generative network. The experimental results demonstrate the efficacy of this
attack in sabotaging existing watermarking schemes for image-processing
networks, with an almost 100% success rate. To counter this attack, we devise
an adversarial framework for the watermarking network. The framework
incorporates a specially designed adversarial training step, where the
watermarking network is trained to defend against the overwriting network,
thereby enhancing its robustness. Additionally, we observe an overfitting
phenomenon in the existing watermarking method, which can render it
ineffective. To address this issue, we modify the training process to eliminate
the overfitting problem.


http://arxiv.org/abs/2302.08466
Marich: A Query-efficient Distributionally Equivalent Model Extraction Attack using Public Data. (3%)
Pratik Karmakar; Debabrota Basu
  We study design of black-box model extraction attacks that can send minimal
number of queries from a publicly available dataset to a target ML model
through a predictive API with an aim to create an informative and
distributionally equivalent replica of the target. First, we define
distributionally equivalent and Max-Information model extraction attacks, and
reduce them into a variational optimisation problem. The attacker sequentially
solves this optimisation problem to select the most informative queries that
simultaneously maximise the entropy and reduce the mismatch between the target
and the stolen models. This leads to an active sampling-based query selection
algorithm, Marich, which is model-oblivious. Then, we evaluate Marich on
different text and image data sets, and different models, including CNNs and
BERT. Marich extracts models that achieve $\sim 60-95\%$ of true model's
accuracy and uses $\sim 1,000 - 8,500$ queries from the publicly available
datasets, which are different from the private training datasets. Models
extracted by Marich yield prediction distributions, which are $\sim 2-4\times$
closer to the target's distribution in comparison to the existing active
sampling-based attacks. The extracted models also lead to $84-96\%$ accuracy
under membership inference attacks. Experimental results validate that Marich
is query-efficient, and capable of performing task-accurate, high-fidelity, and
informative model extraction.


http://arxiv.org/abs/2302.10802
A Novel Noise Injection-based Training Scheme for Better Model Robustness. (2%)
Zeliang Zhang; Jinyang Jiang; Minjie Chen; Zhiyuan Wang; Yijie Peng; Zhaofei Yu
  Noise injection-based method has been shown to be able to improve the
robustness of artificial neural networks in previous work. In this work, we
propose a novel noise injection-based training scheme for better model
robustness. Specifically, we first develop a likelihood ratio method to
estimate the gradient with respect to both synaptic weights and noise levels
for stochastic gradient descent training. Then, we design an approximation for
the vanilla noise injection-based training method to reduce memory and improve
computational efficiency. Next, we apply our proposed scheme to spiking neural
networks and evaluate the performance of classification accuracy and robustness
on MNIST and Fashion-MNIST datasets. Experiment results show that our proposed
method achieves a much better performance on adversarial robustness and
slightly better performance on original accuracy, compared with the
conventional gradient-based training method.


http://arxiv.org/abs/2302.08066
Masking and Mixing Adversarial Training. (99%)
Hiroki Adachi; Tsubasa Hirakawa; Takayoshi Yamashita; Hironobu Fujiyoshi; Yasunori Ishii; Kazuki Kozuka
  While convolutional neural networks (CNNs) have achieved excellent
performances in various computer vision tasks, they often misclassify with
malicious samples, a.k.a. adversarial examples. Adversarial training is a
popular and straightforward technique to defend against the threat of
adversarial examples. Unfortunately, CNNs must sacrifice the accuracy of
standard samples to improve robustness against adversarial examples when
adversarial training is used. In this work, we propose Masking and Mixing
Adversarial Training (M2AT) to mitigate the trade-off between accuracy and
robustness. We focus on creating diverse adversarial examples during training.
Specifically, our approach consists of two processes: 1) masking a perturbation
with a binary mask and 2) mixing two partially perturbed images. Experimental
results on CIFAR-10 dataset demonstrate that our method achieves better
robustness against several adversarial attacks than previous methods.


http://arxiv.org/abs/2302.08048
Robust Mid-Pass Filtering Graph Convolutional Networks. (98%)
Jincheng Huang; Lun Du; Xu Chen; Qiang Fu; Shi Han; Dongmei Zhang
  Graph convolutional networks (GCNs) are currently the most promising paradigm
for dealing with graph-structure data, while recent studies have also shown
that GCNs are vulnerable to adversarial attacks. Thus developing GCN models
that are robust to such attacks become a hot research topic. However, the
structural purification learning-based or robustness constraints-based defense
GCN methods are usually designed for specific data or attacks, and introduce
additional objective that is not for classification. Extra training overhead is
also required in their design. To address these challenges, we conduct in-depth
explorations on mid-frequency signals on graphs and propose a simple yet
effective Mid-pass filter GCN (Mid-GCN). Theoretical analyses guarantee the
robustness of signals through the mid-pass filter, and we also shed light on
the properties of different frequency signals under adversarial attacks.
Extensive experiments on six benchmark graph data further verify the
effectiveness of our designed Mid-GCN in node classification accuracy compared
to state-of-the-art GCNs under various adversarial attack strategies.


http://arxiv.org/abs/2302.08051
Graph Adversarial Immunization for Certifiable Robustness. (98%)
Shuchang Tao; Huawei Shen; Qi Cao; Yunfan Wu; Liang Hou; Xueqi Cheng
  Despite achieving great success, graph neural networks (GNNs) are vulnerable
to adversarial attacks. Existing defenses focus on developing adversarial
training or model modification. In this paper, we propose and formulate graph
adversarial immunization, i.e., vaccinating part of graph structure to improve
certifiable robustness of graph against any admissible adversarial attack. We
first propose edge-level immunization to vaccinate node pairs. Unfortunately,
such edge-level immunization cannot defend against emerging node injection
attacks, since it only immunizes existing node pairs. To this end, we further
propose node-level immunization. To avoid computationally intensive
combinatorial optimization associated with adversarial immunization, we develop
AdvImmune-Edge and AdvImmune-Node algorithms to effectively obtain the immune
node pairs or nodes. Extensive experiments demonstrate the superiority of
AdvImmune methods. In particular, AdvImmune-Node remarkably improves the ratio
of robust nodes by 79%, 294%, and 100%, after immunizing only 5% of nodes.
Furthermore, AdvImmune methods show excellent defensive performance against
various attacks, outperforming state-of-the-art defenses. To the best of our
knowledge, this is the first attempt to improve certifiable robustness from
graph data perspective without losing performance on clean graphs, providing
new insights into graph adversarial learning.


http://arxiv.org/abs/2302.07769
XploreNAS: Explore Adversarially Robust & Hardware-efficient Neural Architectures for Non-ideal Xbars. (87%)
Abhiroop Bhattacharjee; Abhishek Moitra; Priyadarshini Panda
  Compute In-Memory platforms such as memristive crossbars are gaining focus as
they facilitate acceleration of Deep Neural Networks (DNNs) with high area and
compute-efficiencies. However, the intrinsic non-idealities associated with the
analog nature of computing in crossbars limits the performance of the deployed
DNNs. Furthermore, DNNs are shown to be vulnerable to adversarial attacks
leading to severe security threats in their large-scale deployment. Thus,
finding adversarially robust DNN architectures for non-ideal crossbars is
critical to the safe and secure deployment of DNNs on the edge. This work
proposes a two-phase algorithm-hardware co-optimization approach called
XploreNAS that searches for hardware-efficient & adversarially robust neural
architectures for non-ideal crossbar platforms. We use the one-shot Neural
Architecture Search (NAS) approach to train a large Supernet with
crossbar-awareness and sample adversarially robust Subnets therefrom,
maintaining competitive hardware-efficiency. Our experiments on crossbars with
benchmark datasets (SVHN, CIFAR10 & CIFAR100) show upto ~8-16% improvement in
the adversarial robustness of the searched Subnets against a baseline ResNet-18
model subjected to crossbar-aware adversarial training. We benchmark our robust
Subnets for Energy-Delay-Area-Products (EDAPs) using the Neurosim tool and find
that with additional hardware-efficiency driven optimizations, the Subnets
attain ~1.5-1.6x lower EDAPs than ResNet-18 baseline.


http://arxiv.org/abs/2302.07956
Tight Auditing of Differentially Private Machine Learning. (41%)
Milad Nasr; Jamie Hayes; Thomas Steinke; Borja Balle; Florian Tramèr; Matthew Jagielski; Nicholas Carlini; Andreas Terzis
  Auditing mechanisms for differential privacy use probabilistic means to
empirically estimate the privacy level of an algorithm. For private machine
learning, existing auditing mechanisms are tight: the empirical privacy
estimate (nearly) matches the algorithm's provable privacy guarantee. But these
auditing techniques suffer from two limitations. First, they only give tight
estimates under implausible worst-case assumptions (e.g., a fully adversarial
dataset). Second, they require thousands or millions of training runs to
produce non-trivial statistical estimates of the privacy leakage.
  This work addresses both issues. We design an improved auditing scheme that
yields tight privacy estimates for natural (not adversarially crafted) datasets
-- if the adversary can see all model updates during training. Prior auditing
works rely on the same assumption, which is permitted under the standard
differential privacy threat model. This threat model is also applicable, e.g.,
in federated learning settings. Moreover, our auditing scheme requires only two
training runs (instead of thousands) to produce tight privacy estimates, by
adapting recent advances in tight composition theorems for differential
privacy. We demonstrate the utility of our improved auditing schemes by
surfacing implementation bugs in private machine learning code that eluded
prior auditing techniques.


http://arxiv.org/abs/2302.07717
Field-sensitive Data Flow Integrity. (1%)
So Shizukuishi; Yoshitaka Arahori; Katsuhiko Gondow
  Although numerous defenses against memory vulnerability exploits have been
studied so far, highly-compatible, precise, and efficient defense is still an
open problem. In fact, existing defense methods have at least one of the
following problems: they (1) cannot precisely protect structure fields, (2)
incur high protection overheads, and/or (3) cannot maintain compatibility with
existing code due to imposing memory layout change on the protected program.
  In this paper, we propose a novel memory-protection method FIX-Sense that
aims to solve all of these problems simultaneously. Our key idea is to perform
memory protection based on field-sensitive data-flow integrity. Specifically,
our method (1) computes a safe write-read relation for each memory object, at
the structure-field granularity, based on field-sensitive value-flow analysis
at the compile-time of the protected program. (2) At run-time, lightweight
verification is performed to determine whether each memory read executed by the
protected program belong to the safe write-read relation calculated for the
memory object at compile time. (3) This verification is implemented by
lightweight metadata management that tracks memory writes at the structure
field granularity without changing the memory layout of the target program
(especially the structure field layout).


http://arxiv.org/abs/2302.07608
Uncertainty-Estimation with Normalized Logits for Out-of-Distribution Detection. (1%)
Mouxiao Huang; Yu Qiao
  Out-of-distribution (OOD) detection is critical for preventing deep learning
models from making incorrect predictions to ensure the safety of artificial
intelligence systems. Especially in safety-critical applications such as
medical diagnosis and autonomous driving, the cost of incorrect decisions is
usually unbearable. However, neural networks often suffer from the
overconfidence issue, making high confidence for OOD data which are never seen
during training process and may be irrelevant to training data, namely
in-distribution (ID) data. Determining the reliability of the prediction is
still a difficult and challenging task. In this work, we propose
Uncertainty-Estimation with Normalized Logits (UE-NL), a robust learning method
for OOD detection, which has three main benefits. (1) Neural networks with
UE-NL treat every ID sample equally by predicting the uncertainty score of
input data and the uncertainty is added into softmax function to adjust the
learning strength of easy and hard samples during training phase, making the
model learn robustly and accurately. (2) UE-NL enforces a constant vector norm
on the logits to decouple the effect of the increasing output norm from
optimization process, which causes the overconfidence issue to some extent. (3)
UE-NL provides a new metric, the magnitude of uncertainty score, to detect OOD
data. Experiments demonstrate that UE-NL achieves top performance on common OOD
benchmarks and is more robust to noisy ID data that may be misjudged as OOD
data by other methods.


http://arxiv.org/abs/2302.06912
Regret-Based Defense in Adversarial Reinforcement Learning. (99%)
Roman Belaire; Pradeep Varakantham; Thanh Nguyen; David Lo
  Deep Reinforcement Learning (DRL) policies have been shown to be vulnerable
to small adversarial noise in observations. Such adversarial noise can have
disastrous consequences in safety-critical environments. For instance, a
self-driving car receiving adversarially perturbed sensory observations about
nearby signs (e.g., a stop sign physically altered to be perceived as a speed
limit sign) or objects (e.g., cars altered to be recognized as trees) can be
fatal. Existing approaches for making RL algorithms robust to an
observation-perturbing adversary have focused on reactive approaches that
iteratively improve against adversarial examples generated at each iteration.
While such approaches have been shown to provide improvements over regular RL
methods, they are reactive and can fare significantly worse if certain
categories of adversarial examples are not generated during training. To that
end, we pursue a more proactive approach that relies on directly optimizing a
well-studied robustness measure, regret instead of expected value. We provide a
principled approach that minimizes maximum regret over a "neighborhood" of
observations to the received "observation". Our regret criterion can be used to
modify existing value- and policy-based Deep RL methods. We demonstrate that
our approaches provide a significant improvement in performance across a wide
variety of benchmarks against leading approaches for robust Deep RL.


http://arxiv.org/abs/2302.07221
On the Role of Randomization in Adversarially Robust Classification. (99%)
Lucas Gnecco-Heredia; Yann Chevaleyre; Benjamin Negrevergne; Laurent Meunier; Muni Sreenivas Pydi
  Deep neural networks are known to be vulnerable to small adversarial
perturbations in test data. To defend against adversarial attacks,
probabilistic classifiers have been proposed as an alternative to deterministic
ones. However, literature has conflicting findings on the effectiveness of
probabilistic classifiers in comparison to deterministic ones. In this paper,
we clarify the role of randomization in building adversarially robust
classifiers. Given a base hypothesis set of deterministic classifiers, we show
the conditions under which a randomized ensemble outperforms the hypothesis set
in adversarial risk, extending previous results. Additionally, we show that for
any probabilistic classifier (including randomized ensembles), there exists a
deterministic classifier that outperforms it. Finally, we give an explicit
description of the deterministic hypothesis set that contains such a
deterministic classifier for many types of commonly used probabilistic
classifiers, i.e. randomized ensembles and parametric/input noise injection.


http://arxiv.org/abs/2302.07363
Attacking Fake News Detectors via Manipulating News Social Engagement. (83%)
Haoran Wang; Yingtong Dou; Canyu Chen; Lichao Sun; Philip S. Yu; Kai Shu
  Social media is one of the main sources for news consumption, especially
among the younger generation. With the increasing popularity of news
consumption on various social media platforms, there has been a surge of
misinformation which includes false information or unfounded claims. As various
text- and social context-based fake news detectors are proposed to detect
misinformation on social media, recent works start to focus on the
vulnerabilities of fake news detectors. In this paper, we present the first
adversarial attack framework against Graph Neural Network (GNN)-based fake news
detectors to probe their robustness. Specifically, we leverage a multi-agent
reinforcement learning (MARL) framework to simulate the adversarial behavior of
fraudsters on social media. Research has shown that in real-world settings,
fraudsters coordinate with each other to share different news in order to evade
the detection of fake news detectors. Therefore, we modeled our MARL framework
as a Markov Game with bot, cyborg, and crowd worker agents, which have their
own distinctive cost, budget, and influence. We then use deep Q-learning to
search for the optimal policy that maximizes the rewards. Extensive
experimental results on two real-world fake news propagation datasets
demonstrate that our proposed framework can effectively sabotage the GNN-based
fake news detector performance. We hope this paper can provide insights for
future research on fake news detection.


http://arxiv.org/abs/2302.07173
An Experimental Study of Byzantine-Robust Aggregation Schemes in Federated Learning. (31%)
Shenghui Li; Edith C. -H. Ngai; Thiemo Voigt
  Byzantine-robust federated learning aims at mitigating Byzantine failures
during the federated training process, where malicious participants may upload
arbitrary local updates to the central server to degrade the performance of the
global model. In recent years, several robust aggregation schemes have been
proposed to defend against malicious updates from Byzantine clients and improve
the robustness of federated learning. These solutions were claimed to be
Byzantine-robust, under certain assumptions. Other than that, new attack
strategies are emerging, striving to circumvent the defense schemes. However,
there is a lack of systematic comparison and empirical study thereof. In this
paper, we conduct an experimental study of Byzantine-robust aggregation schemes
under different attacks using two popular algorithms in federated learning,
FedSGD and FedAvg . We first survey existing Byzantine attack strategies and
Byzantine-robust aggregation schemes that aim to defend against Byzantine
attacks. We also propose a new scheme, ClippedClustering , to enhance the
robustness of a clustering-based scheme by automatically clipping the updates.
Then we provide an experimental evaluation of eight aggregation schemes in the
scenario of five different Byzantine attacks. Our results show that these
aggregation schemes sustain relatively high accuracy in some cases but are
ineffective in others. In particular, our proposed ClippedClustering
successfully defends against most attacks under independent and IID local
datasets. However, when the local datasets are Non-IID, the performance of all
the aggregation schemes significantly decreases. With Non-IID data, some of
these aggregation schemes fail even in the complete absence of Byzantine
clients. We conclude that the robustness of all the aggregation schemes is
limited, highlighting the need for new defense strategies, in particular for
Non-IID datasets.


http://arxiv.org/abs/2302.07011
A Modern Look at the Relationship between Sharpness and Generalization. (10%)
Maksym Andriushchenko; Francesco Croce; Maximilian Müller; Matthias Hein; Nicolas Flammarion
  Sharpness of minima is a promising quantity that can correlate with
generalization in deep networks and, when optimized during training, can
improve generalization. However, standard sharpness is not invariant under
reparametrizations of neural networks, and, to fix this,
reparametrization-invariant sharpness definitions have been proposed, most
prominently adaptive sharpness (Kwon et al., 2021). But does it really capture
generalization in modern practical settings? We comprehensively explore this
question in a detailed study of various definitions of adaptive sharpness in
settings ranging from training from scratch on ImageNet and CIFAR-10 to
fine-tuning CLIP on ImageNet and BERT on MNLI. We focus mostly on transformers
for which little is known in terms of sharpness despite their widespread usage.
Overall, we observe that sharpness does not correlate well with generalization
but rather with some training parameters like the learning rate that can be
positively or negatively correlated with generalization depending on the setup.
Interestingly, in multiple cases, we observe a consistent negative correlation
of sharpness with out-of-distribution error implying that sharper minima can
generalize better. Finally, we illustrate on a simple model that the right
sharpness measure is highly data-dependent, and that we do not understand well
this aspect for realistic data distributions. The code of our experiments is
available at https://github.com/tml-epfl/sharpness-vs-generalization.


http://arxiv.org/abs/2302.07225
Bounding Training Data Reconstruction in DP-SGD. (8%)
Jamie Hayes; Saeed Mahloujifar; Borja Balle
  Differentially private training offers a protection which is usually
interpreted as a guarantee against membership inference attacks. By proxy, this
guarantee extends to other threats like reconstruction attacks attempting to
extract complete training examples. Recent works provide evidence that if one
does not need to protect against membership attacks but instead only wants to
protect against training data reconstruction, then utility of private models
can be improved because less noise is required to protect against these more
ambitious attacks. We investigate this further in the context of DP-SGD, a
standard algorithm for private deep learning, and provide an upper bound on the
success of any reconstruction attack against DP-SGD together with an attack
that empirically matches the predictions of our bound. Together, these two
results open the door to fine-grained investigations on how to set the privacy
parameters of DP-SGD in practice to protect against reconstruction attacks.
Finally, we use our methods to demonstrate that different settings of the
DP-SGD parameters leading to the same DP guarantees can result in significantly
different success rates for reconstruction, indicating that the DP guarantee
alone might not be a good proxy for controlling the protection against
reconstruction attacks.


http://arxiv.org/abs/2302.07347
Security Defense For Smart Contracts: A Comprehensive Survey. (1%)
Nikolay Ivanov; Chenning Li; Qiben Yan; Zhiyuan Sun; Zhichao Cao; Xiapu Luo
  The blockchain technology has been used for recording state transitions of
smart contracts - decentralized applications that can be invoked through
external transactions. Smart contracts gained popularity and accrued hundreds
of billions of dollars in market capitalization in recent years. Unfortunately,
like all other programs, smart contracts are prone to security vulnerabilities
that have incurred multimillion-dollar damages over the past decade. As a
result, many automated threat mitigation solutions have been proposed to
counter the security issues of smart contracts. These threat mitigation
solutions include various tools and methods that are challenging to compare.
This survey develops a comprehensive classification taxonomy of smart contract
threat mitigation solutions within five orthogonal dimensions: defense
modality, core method, targeted contracts, input-output data mapping, and
threat model. We classify 133 existing threat mitigation solutions using our
taxonomy and confirm that the proposed five dimensions allow us to concisely
and accurately describe any smart contract threat mitigation solution. In
addition to learning what the threat mitigation solutions do, we also show how
these solutions work by synthesizing their actual designs into a set of uniform
workflows corresponding to the eight existing defense core methods. We further
create an integrated coverage map for the known smart contract vulnerabilities
by the existing threat mitigation solutions. Finally, we perform the
evidence-based evolutionary analysis, in which we identify trends and future
perspectives of threat mitigation in smart contracts and pinpoint major
weaknesses of the existing methodologies. For the convenience of smart contract
security developers, auditors, users, and researchers, we deploy a regularly
updated comprehensive open-source online registry of threat mitigation
solutions.


http://arxiv.org/abs/2302.07324
READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input Noises. (1%)
Chenglei Si; Zhengyan Zhang; Yingfa Chen; Xiaozhi Wang; Zhiyuan Liu; Maosong Sun
  For many real-world applications, the user-generated inputs usually contain
various noises due to speech recognition errors caused by linguistic
variations1 or typographical errors (typos). Thus, it is crucial to test model
performance on data with realistic input noises to ensure robustness and
fairness. However, little study has been done to construct such benchmarks for
Chinese, where various language-specific input noises happen in the real world.
In order to fill this important gap, we construct READIN: a Chinese multi-task
benchmark with REalistic And Diverse Input Noises. READIN contains four diverse
tasks and requests annotators to re-enter the original test data with two
commonly used Chinese input methods: Pinyin input and speech input. We designed
our annotation pipeline to maximize diversity, for example by instructing the
annotators to use diverse input method editors (IMEs) for keyboard noises and
recruiting speakers from diverse dialectical groups for speech noises. We
experiment with a series of strong pretrained language models as well as robust
training methods, we find that these models often suffer significant
performance drops on READIN even with robustness methods like data
augmentation. As the first large-scale attempt in creating a benchmark with
noises geared towards user-generated inputs, we believe that READIN serves as
an important complement to existing Chinese NLP benchmarks. The source code and
dataset can be obtained from https://github.com/thunlp/READIN.


http://arxiv.org/abs/2302.06279
Sneaky Spikes: Uncovering Stealthy Backdoor Attacks in Spiking Neural Networks with Neuromorphic Data. (98%)
Gorka Abad; Oguzhan Ersoy; Stjepan Picek; Aitor Urbieta
  Deep neural networks (DNNs) have demonstrated remarkable performance across
various tasks, including image and speech recognition. However, maximizing the
effectiveness of DNNs requires meticulous optimization of numerous
hyperparameters and network parameters through training. Moreover,
high-performance DNNs entail many parameters, which consume significant energy
during training. In order to overcome these challenges, researchers have turned
to spiking neural networks (SNNs), which offer enhanced energy efficiency and
biologically plausible data processing capabilities, rendering them highly
suitable for sensory data tasks, particularly in neuromorphic data. Despite
their advantages, SNNs, like DNNs, are susceptible to various threats,
including adversarial examples and backdoor attacks. Yet, the field of SNNs
still needs to be explored in terms of understanding and countering these
attacks.
  This paper delves into backdoor attacks in SNNs using neuromorphic datasets
and diverse triggers. Specifically, we explore backdoor triggers within
neuromorphic data that can manipulate their position and color, providing a
broader scope of possibilities than conventional triggers in domains like
images. We present various attack strategies, achieving an attack success rate
of up to 100% while maintaining a negligible impact on clean accuracy.
Furthermore, we assess these attacks' stealthiness, revealing that our most
potent attacks possess significant stealth capabilities. Lastly, we adapt
several state-of-the-art defenses from the image domain, evaluating their
efficacy on neuromorphic data and uncovering instances where they fall short,
leading to compromised performance.


http://arxiv.org/abs/2302.06588
Raising the Cost of Malicious AI-Powered Image Editing. (82%)
Hadi Salman; Alaa Khaddaj; Guillaume Leclerc; Andrew Ilyas; Aleksander Madry
  We present an approach to mitigating the risks of malicious image editing
posed by large diffusion models. The key idea is to immunize images so as to
make them resistant to manipulation by these models. This immunization relies
on injection of imperceptible adversarial perturbations designed to disrupt the
operation of the targeted diffusion models, forcing them to generate
unrealistic images. We provide two methods for crafting such perturbations, and
then demonstrate their efficacy. Finally, we discuss a policy component
necessary to make our approach fully effective and practical -- one that
involves the organizations developing diffusion models, rather than individual
users, to implement (and support) the immunization process.


http://arxiv.org/abs/2302.07735
Targeted Attack on GPT-Neo for the SATML Language Model Data Extraction Challenge. (8%)
Ali Al-Kaswan; Maliheh Izadi; Deursen Arie van
  Previous work has shown that Large Language Models are susceptible to
so-called data extraction attacks. This allows an attacker to extract a sample
that was contained in the training data, which has massive privacy
implications. The construction of data extraction attacks is challenging,
current attacks are quite inefficient, and there exists a significant gap in
the extraction capabilities of untargeted attacks and memorization. Thus,
targeted attacks are proposed, which identify if a given sample from the
training data, is extractable from a model. In this work, we apply a targeted
data extraction attack to the SATML2023 Language Model Training Data Extraction
Challenge. We apply a two-step approach. In the first step, we maximise the
recall of the model and are able to extract the suffix for 69% of the samples.
In the second step, we use a classifier-based Membership Inference Attack on
the generations. Our AutoSklearn classifier achieves a precision of 0.841. The
full approach reaches a score of 0.405 recall at a 10% false positive rate,
which is an improvement of 34% over the baseline of 0.301.


http://arxiv.org/abs/2302.06801
Backdoor Learning for NLP: Recent Advances, Challenges, and Future Research Directions. (1%)
Marwan Omar
  Although backdoor learning is an active research topic in the NLP domain, the
literature lacks studies that systematically categorize and summarize backdoor
attacks and defenses. To bridge the gap, we present a comprehensive and
unifying study of backdoor learning for NLP by summarizing the literature in a
systematic manner. We first present and motivate the importance of backdoor
learning for building robust NLP systems. Next, we provide a thorough account
of backdoor attack techniques, their applications, defenses against backdoor
attacks, and various mitigation techniques to remove backdoor attacks. We then
provide a detailed review and analysis of evaluation metrics, benchmark
datasets, threat models, and challenges related to backdoor learning in NLP.
Ultimately, our work aims to crystallize and contextualize the landscape of
existing literature in backdoor learning for the text domain and motivate
further research in the field. To this end, we identify troubling gaps in the
literature and offer insights and ideas into open challenges and future
research directions. Finally, we provide a GitHub repository with a list of
backdoor learning papers that will be continuously updated at
https://github.com/marwanomar1/Backdoor-Learning-for-NLP.


http://arxiv.org/abs/2302.05892
TextDefense: Adversarial Text Detection based on Word Importance Entropy. (99%)
Lujia Shen; Xuhong Zhang; Shouling Ji; Yuwen Pu; Chunpeng Ge; Xing Yang; Yanghe Feng
  Currently, natural language processing (NLP) models are wildly used in
various scenarios. However, NLP models, like all deep models, are vulnerable to
adversarially generated text. Numerous works have been working on mitigating
the vulnerability from adversarial attacks. Nevertheless, there is no
comprehensive defense in existing works where each work targets a specific
attack category or suffers from the limitation of computation overhead,
irresistible to adaptive attack, etc.
  In this paper, we exhaustively investigate the adversarial attack algorithms
in NLP, and our empirical studies have discovered that the attack algorithms
mainly disrupt the importance distribution of words in a text. A well-trained
model can distinguish subtle importance distribution differences between clean
and adversarial texts. Based on this intuition, we propose TextDefense, a new
adversarial example detection framework that utilizes the target model's
capability to defend against adversarial attacks while requiring no prior
knowledge. TextDefense differs from previous approaches, where it utilizes the
target model for detection and thus is attack type agnostic. Our extensive
experiments show that TextDefense can be applied to different architectures,
datasets, and attack methods and outperforms existing methods. We also discover
that the leading factor influencing the performance of TextDefense is the
target model's generalizability. By analyzing the property of the target model
and the property of the adversarial example, we provide our insights into the
adversarial attacks in NLP and the principles of our defense method.


http://arxiv.org/abs/2302.05794
Mutation-Based Adversarial Attacks on Neural Text Detectors. (69%)
Gongbo Liang; Jesus Guerrero; Izzat Alsmadi
  Neural text detectors aim to decide the characteristics that distinguish
neural (machine-generated) from human texts. To challenge such detectors,
adversarial attacks can alter the statistical characteristics of the generated
text, making the detection task more and more difficult. Inspired by the
advances of mutation analysis in software development and testing, in this
paper, we propose character- and word-based mutation operators for generating
adversarial samples to attack state-of-the-art natural text detectors. This
falls under white-box adversarial attacks. In such attacks, attackers have
access to the original text and create mutation instances based on this
original text. The ultimate goal is to confuse machine learning models and
classifiers and decrease their prediction accuracy.


http://arxiv.org/abs/2302.05703
HateProof: Are Hateful Meme Detection Systems really Robust? (13%)
Piush Aggarwal; Pranit Chawla; Mithun Das; Punyajoy Saha; Binny Mathew; Torsten Zesch; Animesh Mukherjee
  Exploiting social media to spread hate has tremendously increased over the
years. Lately, multi-modal hateful content such as memes has drawn relatively
more traction than uni-modal content. Moreover, the availability of implicit
content payloads makes them fairly challenging to be detected by existing
hateful meme detection systems. In this paper, we present a use case study to
analyze such systems' vulnerabilities against external adversarial attacks. We
find that even very simple perturbations in uni-modal and multi-modal settings
performed by humans with little knowledge about the model can make the existing
detection models highly vulnerable. Empirically, we find a noticeable
performance drop of as high as 10% in the macro-F1 score for certain attacks.
As a remedy, we attempt to boost the model's robustness using contrastive
learning as well as an adversarial training-based method - VILLA. Using an
ensemble of the above two approaches, in two of our high resolution datasets,
we are able to (re)gain back the performance to a large extent for certain
attacks. We believe that ours is a first step toward addressing this crucial
problem in an adversarial setting and would inspire more such investigations in
the future.


http://arxiv.org/abs/2302.05706
MTTM: Metamorphic Testing for Textual Content Moderation Software. (2%)
Wenxuan Wang; Jen-tse Huang; Weibin Wu; Jianping Zhang; Yizhan Huang; Shuqing Li; Pinjia He; Michael Lyu
  The exponential growth of social media platforms such as Twitter and Facebook
has revolutionized textual communication and textual content publication in
human society. However, they have been increasingly exploited to propagate
toxic content, such as hate speech, malicious advertisement, and pornography,
which can lead to highly negative impacts (e.g., harmful effects on teen mental
health). Researchers and practitioners have been enthusiastically developing
and extensively deploying textual content moderation software to address this
problem. However, we find that malicious users can evade moderation by changing
only a few words in the toxic content. Moreover, modern content moderation
software performance against malicious inputs remains underexplored. To this
end, we propose MTTM, a Metamorphic Testing framework for Textual content
Moderation software. Specifically, we conduct a pilot study on 2,000 text
messages collected from real users and summarize eleven metamorphic relations
across three perturbation levels: character, word, and sentence. MTTM employs
these metamorphic relations on toxic textual contents to generate test cases,
which are still toxic yet likely to evade moderation. In our evaluation, we
employ MTTM to test three commercial textual content moderation software and
two state-of-the-art moderation algorithms against three kinds of toxic
content. The results show that MTTM achieves up to 83.9%, 51%, and 82.5% error
finding rates (EFR) when testing commercial moderation software provided by
Google, Baidu, and Huawei, respectively, and it obtains up to 91.2% EFR when
testing the state-of-the-art algorithms from the academy. In addition, we
leverage the test cases generated by MTTM to retrain the model we explored,
which largely improves model robustness (0% to 5.9% EFR) while maintaining the
accuracy on the original test set.


http://arxiv.org/abs/2302.05807
Pushing the Accuracy-Group Robustness Frontier with Introspective Self-play. (1%)
Jeremiah Zhe Liu; Krishnamurthy Dj Dvijotham; Jihyeon Lee; Quan Yuan; Martin Strobel; Balaji Lakshminarayanan; Deepak Ramachandran
  Standard empirical risk minimization (ERM) training can produce deep neural
network (DNN) models that are accurate on average but under-perform in
under-represented population subgroups, especially when there are imbalanced
group distributions in the long-tailed training data. Therefore, approaches
that improve the accuracy-group robustness trade-off frontier of a DNN model
(i.e. improving worst-group accuracy without sacrificing average accuracy, or
vice versa) is of crucial importance. Uncertainty-based active learning (AL)
can potentially improve the frontier by preferentially sampling
underrepresented subgroups to create a more balanced training dataset. However,
the quality of uncertainty estimates from modern DNNs tend to degrade in the
presence of spurious correlations and dataset bias, compromising the
effectiveness of AL for sampling tail groups. In this work, we propose
Introspective Self-play (ISP), a simple approach to improve the uncertainty
estimation of a deep neural network under dataset bias, by adding an auxiliary
introspection task requiring a model to predict the bias for each data point in
addition to the label. We show that ISP provably improves the bias-awareness of
the model representation and the resulting uncertainty estimates. On two
real-world tabular and language tasks, ISP serves as a simple "plug-in" for AL
model training, consistently improving both the tail-group sampling rate and
the final accuracy-fairness trade-off frontier of popular AL methods.


http://arxiv.org/abs/2302.05628
High Recovery with Fewer Injections: Practical Binary Volumetric Injection Attacks against Dynamic Searchable Encryption. (1%)
Xianglong Zhang; Wei Wang; Peng Xu; Laurence T. Yang; Kaitai Liang
  Searchable symmetric encryption enables private queries over an encrypted
database, but it also yields information leakages. Adversaries can exploit
these leakages to launch injection attacks (Zhang et al., USENIX'16) to recover
the underlying keywords from queries. The performance of the existing injection
attacks is strongly dependent on the amount of leaked information or injection.
In this work, we propose two new injection attacks, namely BVA and BVMA, by
leveraging a binary volumetric approach. We enable adversaries to inject fewer
files than the existing volumetric attacks by using the known keywords and
reveal the queries by observing the volume of the query results. Our attacks
can thwart well-studied defenses (e.g., threshold countermeasure, static
padding) without exploiting the distribution of target queries and client
databases. We evaluate the proposed attacks empirically in real-world datasets
with practical queries. The results show that our attacks can obtain a high
recovery rate (>80%) in the best case and a roughly 60% recovery even under a
large-scale dataset with a small number of injections (<20 files).


http://arxiv.org/abs/2302.05086
Making Substitute Models More Bayesian Can Enhance Transferability of Adversarial Examples. (98%)
Qizhang Li; Yiwen Guo; Wangmeng Zuo; Hao Chen
  The transferability of adversarial examples across deep neural networks
(DNNs) is the crux of many black-box attacks. Many prior efforts have been
devoted to improving the transferability via increasing the diversity in inputs
of some substitute models. In this paper, by contrast, we opt for the diversity
in substitute models and advocate to attack a Bayesian model for achieving
desirable transferability. Deriving from the Bayesian formulation, we develop a
principled strategy for possible finetuning, which can be combined with many
off-the-shelf Gaussian posterior approximations over DNN parameters. Extensive
experiments have been conducted to verify the effectiveness of our method, on
common benchmark datasets, and the results demonstrate that our method
outperforms recent state-of-the-arts by large margins (roughly 19% absolute
increase in average attack success rate on ImageNet), and, by combining with
these recent methods, further performance gain can be obtained. Our code:
https://github.com/qizhangli/MoreBayesian-attack.


http://arxiv.org/abs/2303.01263
Unnoticeable Backdoor Attacks on Graph Neural Networks. (80%)
Enyan Dai; Minhua Lin; Xiang Zhang; Suhang Wang
  Graph Neural Networks (GNNs) have achieved promising results in various tasks
such as node classification and graph classification. Recent studies find that
GNNs are vulnerable to adversarial attacks. However, effective backdoor attacks
on graphs are still an open problem. In particular, backdoor attack poisons the
graph by attaching triggers and the target class label to a set of nodes in the
training graph. The backdoored GNNs trained on the poisoned graph will then be
misled to predict test nodes to target class once attached with triggers.
Though there are some initial efforts in graph backdoor attacks, our empirical
analysis shows that they may require a large attack budget for effective
backdoor attacks and the injected triggers can be easily detected and pruned.
Therefore, in this paper, we study a novel problem of unnoticeable graph
backdoor attacks with limited attack budget. To fully utilize the attack
budget, we propose to deliberately select the nodes to inject triggers and
target class labels in the poisoning phase. An adaptive trigger generator is
deployed to obtain effective triggers that are difficult to be noticed.
Extensive experiments on real-world datasets against various defense strategies
demonstrate the effectiveness of our proposed method in conducting effective
unnoticeable backdoor attacks.


http://arxiv.org/abs/2302.05120
Step by Step Loss Goes Very Far: Multi-Step Quantization for Adversarial Text Attacks. (73%)
Piotr Gaiński; Klaudia Bałazy
  We propose a novel gradient-based attack against transformer-based language
models that searches for an adversarial example in a continuous space of token
probabilities. Our algorithm mitigates the gap between adversarial loss for
continuous and discrete text representations by performing multi-step
quantization in a quantization-compensation loop. Experiments show that our
method significantly outperforms other approaches on various natural language
processing (NLP) tasks.


http://arxiv.org/abs/2302.10896
IB-RAR: Information Bottleneck as Regularizer for Adversarial Robustness. (98%)
Xiaoyun Xu; Guilherme Perin; Stjepan Picek
  In this paper, we propose a novel method, IB-RAR, which uses Information
Bottleneck (IB) to strengthen adversarial robustness for both adversarial
training and non-adversarial-trained methods. We first use the IB theory to
build regularizers as learning objectives in the loss function. Then, we filter
out unnecessary features of intermediate representation according to their
mutual information (MI) with labels, as the network trained with IB provides
easily distinguishable MI for its features. Experimental results show that our
method can be naturally combined with adversarial training and provides
consistently better accuracy on new adversarial examples. Our method improves
the accuracy by an average of 3.07% against five adversarial attacks for the
VGG16 network, trained with three adversarial training benchmarks and the
CIFAR-10 dataset. In addition, our method also provides good robustness for
undefended methods, such as training with cross-entropy loss only. Finally, in
the absence of adversarial training, the VGG16 network trained using our method
and the CIFAR-10 dataset reaches an accuracy of 35.86% against PGD examples,
while using all layers reaches 25.61% accuracy.


http://arxiv.org/abs/2302.04578
Adversarial Example Does Good: Preventing Painting Imitation from Diffusion Models via Adversarial Examples. (98%)
Chumeng Liang; Xiaoyu Wu; Yang Hua; Jiaru Zhang; Yiming Xue; Tao Song; Zhengui Xue; Ruhui Ma; Haibing Guan
  Diffusion Models (DMs) achieve state-of-the-art performance in generative
tasks, boosting a wave in AI for Art. Despite the success of commercialization,
DMs meanwhile provide tools for copyright violations, where infringers benefit
from illegally using paintings created by human artists to train DMs and
generate novel paintings in a similar style. In this paper, we show that it is
possible to create an image $x'$ that is similar to an image $x$ for human
vision but unrecognizable for DMs. We build a framework to define and evaluate
this adversarial example for diffusion models. Based on the framework, we
further propose AdvDM, an algorithm to generate adversarial examples for DMs.
By optimizing upon different latent variables sampled from the reverse process
of DMs, AdvDM conducts a Monte-Carlo estimation of adversarial examples for
DMs. Extensive experiments show that the estimated adversarial examples can
effectively hinder DMs from extracting their features. Our method can be a
powerful tool for human artists to protect their copyright against infringers
with DM-based AI-for-Art applications.


http://arxiv.org/abs/2302.04977
Mithridates: Auditing and Boosting Backdoor Resistance of Machine Learning Pipelines. (81%)
Eugene Bagdasaryan; Vitaly Shmatikov
  Machine learning (ML) models trained on data from potentially untrusted
sources are vulnerable to poisoning. A small, maliciously crafted subset of the
training inputs can cause the model to learn a "backdoor" task (e.g.,
misclassify inputs with a certain feature) in addition to its main task. Recent
research proposed many hypothetical backdoor attacks whose efficacy heavily
depends on the configuration and training hyperparameters of the target model.
  Given the variety of potential backdoor attacks, ML engineers who are not
security experts have no way to measure how vulnerable their current training
pipelines are, nor do they have a practical way to compare training
configurations so as to pick the more resistant ones. Deploying a defense
requires evaluating and choosing from among dozens of research papers and
re-engineering the training pipeline.
  In this paper, we aim to provide ML engineers with pragmatic tools to audit
the backdoor resistance of their training pipelines and to compare different
training configurations, to help choose one that best balances accuracy and
security.
  First, we propose a universal, attack-agnostic resistance metric based on the
minimum number of training inputs that must be compromised before the model
learns any backdoor.
  Second, we design, implement, and evaluate Mithridates a multi-stage approach
that integrates backdoor resistance into the training-configuration search. ML
developers already rely on hyperparameter search to find configurations that
maximize the model's accuracy. Mithridates extends this standard tool to
balance accuracy and resistance without disruptive changes to the training
pipeline. We show that hyperparameters found by Mithridates increase resistance
to multiple types of backdoor attacks by 3-5x with only a slight impact on
accuracy. We also discuss extensions to AutoML and federated learning.


http://arxiv.org/abs/2302.04457
Imperceptible Sample-Specific Backdoor to DNN with Denoising Autoencoder. (62%)
Xiangqi Wang; Mingfu Xue; Kewei Chen; Jing Xu; Wenmao Liu; Leo Yu Zhang; Yushu Zhang
  The backdoor attack poses a new security threat to deep neural networks.
Existing backdoor often relies on visible universal trigger to make the
backdoored model malfunction, which are not only usually visually suspicious to
human but also catchable by mainstream countermeasures. We propose an
imperceptible sample-specific backdoor that the trigger varies from sample to
sample and invisible. Our trigger generation is automated through a desnoising
autoencoder that is fed with delicate but pervasive features (i.e., edge
patterns per images). We extensively experiment our backdoor attack on ImageNet
and MS-Celeb-1M, which demonstrates stable and nearly 100% (i.e., 99.8%) attack
success rate with negligible impact on the clean data accuracy of the infected
model. The denoising autoeconder based trigger generator is reusable or
transferable across tasks (e.g., from ImageNet to MS-Celeb-1M), whilst the
trigger has high exclusiveness (i.e., a trigger generated for one sample is not
applicable to another sample). Besides, our proposed backdoored model has
achieved high evasiveness against mainstream backdoor defenses such as Neural
Cleanse, STRIP, SentiNet and Fine-Pruning.


http://arxiv.org/abs/2302.04638
Better Diffusion Models Further Improve Adversarial Training. (22%)
Zekai Wang; Tianyu Pang; Chao Du; Min Lin; Weiwei Liu; Shuicheng Yan
  It has been recognized that the data generated by the denoising diffusion
probabilistic model (DDPM) improves adversarial training. After two years of
rapid development in diffusion models, a question naturally arises: can better
diffusion models further improve adversarial training? This paper gives an
affirmative answer by employing the most recent diffusion model which has
higher efficiency ($\sim 20$ sampling steps) and image quality (lower FID
score) compared with DDPM. Our adversarially trained models achieve
state-of-the-art performance on RobustBench using only generated data (no
external datasets). Under the $\ell_\infty$-norm threat model with
$\epsilon=8/255$, our models achieve $70.69\%$ and $42.67\%$ robust accuracy on
CIFAR-10 and CIFAR-100, respectively, i.e. improving upon previous
state-of-the-art models by $+4.58\%$ and $+8.03\%$. Under the $\ell_2$-norm
threat model with $\epsilon=128/255$, our models achieve $84.86\%$ on CIFAR-10
($+4.44\%$). These results also beat previous works that use external data. Our
code is available at https://github.com/wzekai99/DM-Improves-AT.


http://arxiv.org/abs/2302.04700
Augmenting NLP data to counter Annotation Artifacts for NLI Tasks. (16%)
Armaan Singh Bhullar
  In this paper, we explore Annotation Artifacts - the phenomena wherein large
pre-trained NLP models achieve high performance on benchmark datasets but do
not actually "solve" the underlying task and instead rely on some dataset
artifacts (same across train, validation, and test sets) to figure out the
right answer. We explore this phenomenon on the well-known Natural Language
Inference task by first using contrast and adversarial examples to understand
limitations to the model's performance and show one of the biases arising from
annotation artifacts (the way training data was constructed by the annotators).
We then propose a data augmentation technique to fix this bias and measure its
effectiveness.


http://arxiv.org/abs/2302.06455
Incremental Satisfiability Modulo Theory for Verification of Deep Neural Networks. (1%)
Pengfei Yang; Zhiming Chi; Zongxin Liu; Mengyu Zhao; Cheng-Chao Huang; Shaowei Cai; Lijun Zhang
  Constraint solving is an elementary way for verification of deep neural
networks (DNN). In the domain of AI safety, a DNN might be modified in its
structure and parameters for its repair or attack. For such situations, we
propose the incremental DNN verification problem, which asks whether a safety
property still holds after the DNN is modified. To solve the problem, we
present an incremental satisfiability modulo theory (SMT) algorithm based on
the Reluplex framework. We simulate the most important features of the
configurations that infers the verification result of the searching branches in
the old solving procedure (with respect to the original network), and
heuristically check whether the proofs are still valid for the modified DNN. We
implement our algorithm as an incremental solver called DeepInc, and
exerimental results show that DeepInc is more efficient in most cases. For the
cases that the property holds both before and after modification, the
acceleration can be faster by several orders of magnitude, showing that DeepInc
is outstanding in incrementally searching for counterexamples. Moreover, based
on the framework, we propose the multi-objective DNN repair problem and give an
algorithm based on our incremental SMT solving algorithm. Our repair method
preserves more potential safety properties on the repaired DNNs compared with
state-of-the-art.


http://arxiv.org/abs/2302.04025
WAT: Improve the Worst-class Robustness in Adversarial Training. (99%)
Boqi Li; Weiwei Liu
  Deep Neural Networks (DNN) have been shown to be vulnerable to adversarial
examples. Adversarial training (AT) is a popular and effective strategy to
defend against adversarial attacks. Recent works (Benz et al., 2020; Xu et al.,
2021; Tian et al., 2021) have shown that a robust model well-trained by AT
exhibits a remarkable robustness disparity among classes, and propose various
methods to obtain consistent robust accuracy across classes. Unfortunately,
these methods sacrifice a good deal of the average robust accuracy.
Accordingly, this paper proposes a novel framework of worst-class adversarial
training and leverages no-regret dynamics to solve this problem. Our goal is to
obtain a classifier with great performance on worst-class and sacrifice just a
little average robust accuracy at the same time. We then rigorously analyze the
theoretical properties of our proposed algorithm, and the generalization error
bound in terms of the worst-class robust risk. Furthermore, we propose a
measurement to evaluate the proposed method in terms of both the average and
worst-class accuracies. Experiments on various datasets and networks show that
our proposed method outperforms the state-of-the-art approaches.


http://arxiv.org/abs/2302.04379
Et Tu Certifications: Robustness Certificates Yield Better Adversarial Examples. (99%)
Andrew C. Cullen; Shijie Liu; Paul Montague; Sarah M. Erfani; Benjamin I. P. Rubinstein
  In guaranteeing the absence of adversarial examples in an instance's
neighbourhood, certification mechanisms play an important role in demonstrating
neural net robustness. In this paper, we ask if these certifications can
compromise the very models they help to protect? Our new \emph{Certification
Aware Attack} exploits certifications to produce computationally efficient
norm-minimising adversarial examples $74 \%$ more often than comparable
attacks, while reducing the median perturbation norm by more than $10\%$. While
these attacks can be used to assess the tightness of certification bounds, they
also highlight that releasing certifications can paradoxically reduce security.


http://arxiv.org/abs/2302.04246
Shortcut Detection with Variational Autoencoders. (13%)
Nicolas M. Müller; Simon Roschmann; Shahbaz Khan; Philip Sperl; Konstantin Böttinger
  For real-world applications of machine learning (ML), it is essential that
models make predictions based on well-generalizing features rather than
spurious correlations in the data. The identification of such spurious
correlations, also known as shortcuts, is a challenging problem and has so far
been scarcely addressed. In this work, we present a novel approach to detect
shortcuts in image and audio datasets by leveraging variational autoencoders
(VAEs). The disentanglement of features in the latent space of VAEs allows us
to discover correlations in datasets and semi-automatically evaluate them for
ML shortcuts. We demonstrate the applicability of our method on several
real-world datasets and identify shortcuts that have not been discovered
before. Based on these findings, we also investigate the construction of
shortcut adversarial examples.


http://arxiv.org/abs/2302.04332
Continuous Learning for Android Malware Detection. (13%)
Yizheng Chen; Zhoujie Ding; David Wagner
  Machine learning methods can detect Android malware with very high accuracy.
However, these classifiers have an Achilles heel, concept drift: they rapidly
become out of date and ineffective, due to the evolution of malware apps and
benign apps. Our research finds that, after training an Android malware
classifier on one year's worth of data, the F1 score quickly dropped from 0.99
to 0.76 after 6 months of deployment on new test samples.
  In this paper, we propose new methods to combat the concept drift problem of
Android malware classifiers. Since machine learning technique needs to be
continuously deployed, we use active learning: we select new samples for
analysts to label, and then add the labeled samples to the training set to
retrain the classifier. Our key idea is, similarity-based uncertainty is more
robust against concept drift. Therefore, we combine contrastive learning with
active learning. We propose a new hierarchical contrastive learning scheme, and
a new sample selection technique to continuously train the Android malware
classifier. Our evaluation shows that this leads to significant improvements,
compared to previously published methods for active learning. Our approach
reduces the false negative rate from 16% (for the best baseline) to 10%, while
maintaining the same false positive rate (0.6%). Also, our approach maintains
more consistent performance across a seven-year time period than past methods.


http://arxiv.org/abs/2302.04116
Training-free Lexical Backdoor Attacks on Language Models. (8%)
Yujin Huang; Terry Yue Zhuo; Qiongkai Xu; Han Hu; Xingliang Yuan; Chunyang Chen
  Large-scale language models have achieved tremendous success across various
natural language processing (NLP) applications. Nevertheless, language models
are vulnerable to backdoor attacks, which inject stealthy triggers into models
for steering them to undesirable behaviors. Most existing backdoor attacks,
such as data poisoning, require further (re)training or fine-tuning language
models to learn the intended backdoor patterns. The additional training process
however diminishes the stealthiness of the attacks, as training a language
model usually requires long optimization time, a massive amount of data, and
considerable modifications to the model parameters. In this work, we propose
Training-Free Lexical Backdoor Attack (TFLexAttack) as the first training-free
backdoor attack on language models. Our attack is achieved by injecting lexical
triggers into the tokenizer of a language model via manipulating its embedding
dictionary using carefully designed rules. These rules are explainable to human
developers which inspires attacks from a wider range of hackers. The sparse
manipulation of the dictionary also habilitates the stealthiness of our attack.
We conduct extensive experiments on three dominant NLP tasks based on nine
language models to demonstrate the effectiveness and universality of our
attack. The code of this work is available at
https://github.com/Jinxhy/TFLexAttack.


http://arxiv.org/abs/2302.10296
On Function-Coupled Watermarks for Deep Neural Networks. (2%)
Xiangyu Wen; Yu Li; Wei Jiang; Qiang Xu
  Well-performed deep neural networks (DNNs) generally require massive labelled
data and computational resources for training. Various watermarking techniques
are proposed to protect such intellectual properties (IPs), wherein the DNN
providers implant secret information into the model so that they can later
claim IP ownership by retrieving their embedded watermarks with some dedicated
trigger inputs. While promising results are reported in the literature,
existing solutions suffer from watermark removal attacks, such as model
fine-tuning and model pruning.
  In this paper, we propose a novel DNN watermarking solution that can
effectively defend against the above attacks. Our key insight is to enhance the
coupling of the watermark and model functionalities such that removing the
watermark would inevitably degrade the model's performance on normal inputs. To
this end, unlike previous methods relying on secret features learnt from
out-of-distribution data, our method only uses features learnt from
in-distribution data. Specifically, on the one hand, we propose to sample
inputs from the original training dataset and fuse them as watermark triggers.
On the other hand, we randomly mask model weights during training so that the
information of our embedded watermarks spreads in the network. By doing so,
model fine-tuning/pruning would not forget our function-coupled watermarks.
Evaluation results on various image classification tasks show a 100\% watermark
authentication success rate under aggressive watermark removal attacks,
significantly outperforming existing solutions. Code is available:
https://github.com/cure-lab/Function-Coupled-Watermark.


http://arxiv.org/abs/2302.04369
Unsupervised Learning of Initialization in Deep Neural Networks via Maximum Mean Discrepancy. (1%)
Cheolhyoung Lee; Kyunghyun Cho
  Despite the recent success of stochastic gradient descent in deep learning,
it is often difficult to train a deep neural network with an inappropriate
choice of its initial parameters. Even if training is successful, it has been
known that the initial parameter configuration may negatively impact
generalization. In this paper, we propose an unsupervised algorithm to find
good initialization for input data, given that a downstream task is d-way
classification. We first notice that each parameter configuration in the
parameter space corresponds to one particular downstream task of d-way
classification. We then conjecture that the success of learning is directly
related to how diverse downstream tasks are in the vicinity of the initial
parameters. We thus design an algorithm that encourages small perturbation to
the initial parameter configuration leads to a diverse set of d-way
classification tasks. In other words, the proposed algorithm ensures a solution
to any downstream task to be near the initial parameter configuration. We
empirically evaluate the proposed algorithm on various tasks derived from MNIST
with a fully connected network. In these experiments, we observe that our
algorithm improves average test accuracy across most of these tasks, and that
such improvement is greater when the number of labelled examples is small.


http://arxiv.org/abs/2302.03657
Toward Face Biometric De-identification using Adversarial Examples. (98%)
Mahdi Ghafourian; Julian Fierrez; Luis Felipe Gomez; Ruben Vera-Rodriguez; Aythami Morales; Zohra Rezgui; Raymond Veldhuis
  The remarkable success of face recognition (FR) has endangered the privacy of
internet users particularly in social media. Recently, researchers turned to
use adversarial examples as a countermeasure. In this paper, we assess the
effectiveness of using two widely known adversarial methods (BIM and ILLC) for
de-identifying personal images. We discovered, unlike previous claims in the
literature, that it is not easy to get a high protection success rate
(suppressing identification rate) with imperceptible adversarial perturbation
to the human visual system. Finally, we found out that the transferability of
adversarial examples is highly affected by the training parameters of the
network with which they are generated.


http://arxiv.org/abs/2302.03322
Attacking Cooperative Multi-Agent Reinforcement Learning by Adversarial Minority Influence. (83%)
Simin Li; Jun Guo; Jingqiao Xiu; Yuwei Zheng; Pu Feng; Xin Yu; Aishan Liu; Yaodong Yang; Bo An; Wenjun Wu; Xianglong Liu
  This study probes the vulnerabilities of cooperative multi-agent
reinforcement learning (c-MARL) under adversarial attacks, a critical
determinant of c-MARL's worst-case performance prior to real-world
implementation. Current observation-based attacks, constrained by white-box
assumptions, overlook c-MARL's complex multi-agent interactions and cooperative
objectives, resulting in impractical and limited attack capabilities. To
address these shortcomes, we propose Adversarial Minority Influence (AMI), a
practical and strong for c-MARL. AMI is a practical black-box attack and can be
launched without knowing victim parameters. AMI is also strong by considering
the complex multi-agent interaction and the cooperative goal of agents,
enabling a single adversarial agent to unilaterally misleads majority victims
to form targeted worst-case cooperation. This mirrors minority influence
phenomena in social psychology. To achieve maximum deviation in victim policies
under complex agent-wise interactions, our unilateral attack aims to
characterize and maximize the impact of the adversary on the victims. This is
achieved by adapting a unilateral agent-wise relation metric derived from
mutual information, thereby mitigating the adverse effects of victim influence
on the adversary. To lead the victims into a jointly detrimental scenario, our
targeted attack deceives victims into a long-term, cooperatively harmful
situation by guiding each victim towards a specific target, determined through
a trial-and-error process executed by a reinforcement learning agent. Through
AMI, we achieve the first successful attack against real-world robot swarms and
effectively fool agents in simulated environments into collectively worst-case
scenarios, including Starcraft II and Multi-agent Mujoco. The source code and
demonstrations can be found at: https://github.com/DIG-Beihang/AMI.


http://arxiv.org/abs/2302.03262
Membership Inference Attacks against Diffusion Models. (64%)
Tomoya Matsumoto; Takayuki Miura; Naoto Yanai
  Diffusion models have attracted attention in recent years as innovative
generative models. In this paper, we investigate whether a diffusion model is
resistant to a membership inference attack, which evaluates the privacy leakage
of a machine learning model. We primarily discuss the diffusion model from the
standpoints of comparison with a generative adversarial network (GAN) as
conventional models and hyperparameters unique to the diffusion model, i.e.,
time steps, sampling steps, and sampling variances. We conduct extensive
experiments with DDIM as a diffusion model and DCGAN as a GAN on the CelebA and
CIFAR-10 datasets in both white-box and black-box settings and then confirm if
the diffusion model is comparably resistant to a membership inference attack as
GAN. Next, we demonstrate that the impact of time steps is significant and
intermediate steps in a noise schedule are the most vulnerable to the attack.
We also found two key insights through further analysis. First, we identify
that DDIM is vulnerable to the attack for small sample sizes instead of
achieving a lower FID. Second, sampling steps in hyperparameters are important
for resistance to the attack, whereas the impact of sampling variances is quite
limited.


http://arxiv.org/abs/2302.03684
Temporal Robustness against Data Poisoning. (12%)
Wenxiao Wang; Soheil Feizi
  Data poisoning considers cases when an adversary manipulates the behavior of
machine learning algorithms through malicious training data. Existing threat
models of data poisoning center around a single metric, the number of poisoned
samples. In consequence, if attackers can poison more samples than expected
with affordable overhead, as in many practical scenarios, they may be able to
render existing defenses ineffective in a short time. To address this issue, we
leverage timestamps denoting the birth dates of data, which are often available
but neglected in the past. Benefiting from these timestamps, we propose a
temporal threat model of data poisoning with two novel metrics, earliness and
duration, which respectively measure how long an attack started in advance and
how long an attack lasted. Using these metrics, we define the notions of
temporal robustness against data poisoning, providing a meaningful sense of
protection even with unbounded amounts of poisoned samples when the attacks are
temporally bounded. We present a benchmark with an evaluation protocol
simulating continuous data collection and periodic deployments of updated
models, thus enabling empirical evaluation of temporal robustness. Lastly, we
develop and also empirically verify a baseline defense, namely temporal
aggregation, offering provable temporal robustness and highlighting the
potential of our temporal threat model for data poisoning.


http://arxiv.org/abs/2302.03465
Robustness Implies Fairness in Casual Algorithmic Recourse. (2%)
Ahmad-Reza Ehyaei; Amir-Hossein Karimi; Bernhard Schölkopf; Setareh Maghsudi
  Algorithmic recourse aims to disclose the inner workings of the black-box
decision process in situations where decisions have significant consequences,
by providing recommendations to empower beneficiaries to achieve a more
favorable outcome. To ensure an effective remedy, suggested interventions must
not only be low-cost but also robust and fair. This goal is accomplished by
providing similar explanations to individuals who are alike. This study
explores the concept of individual fairness and adversarial robustness in
causal algorithmic recourse and addresses the challenge of achieving both. To
resolve the challenges, we propose a new framework for defining adversarially
robust recourse. The new setting views the protected feature as a pseudometric
and demonstrates that individual fairness is a special case of adversarial
robustness. Finally, we introduce the fair robust recourse problem to achieve
both desirable properties and show how it can be satisfied both theoretically
and empirically.


http://arxiv.org/abs/2302.03335
Low-Latency Communication using Delay-Aware Relays Against Reactive Adversaries. (1%)
Vivek Chaudhary; J. Harshan
  This work addresses a reactive jamming attack on the low-latency messages of
a victim, wherein the jammer deploys countermeasure detection mechanisms to
change its strategy. We highlight that the existing schemes against reactive
jammers use relays with instantaneous full-duplex (FD) radios to evade the
attack. However, due to the limitation of the radio architecture of the FD
helper, instantaneous forwarding may not be possible in practice, thereby
leading to increased decoding complexity at the destination and a high
detection probability at the adversary. Pointing at this drawback, we propose a
delay-aware cooperative framework wherein the victim seeks assistance from a
delay-aware FD helper to forward its messages to the destination within the
latency constraints. In particular, we first model the processing delay at the
helper based on its hardware architecture, and then propose two low-complexity
mitigation schemes, wherein the victim and the helper share their uplink
frequencies using appropriate energy-splitting factors. For both the schemes,
we solve the optimization problems of computing the near-optimal
energy-splitting factors that minimize the joint error rates at the
destination. Finally, through analytical and simulation results, we show that
the proposed schemes facilitate the victim in evading the jamming attack whilst
deceiving the reactive adversary.


http://arxiv.org/abs/2302.02568
Less is More: Understanding Word-level Textual Adversarial Attack via n-gram Frequency Descend. (99%)
Ning Lu; Shengcai Liu; Zhirui Zhang; Qi Wang; Haifeng Liu; Ke Tang
  Word-level textual adversarial attacks have demonstrated notable efficacy in
misleading Natural Language Processing (NLP) models. Despite their success, the
underlying reasons for their effectiveness and the fundamental characteristics
of adversarial examples (AEs) remain obscure. This work aims to interpret
word-level attacks by examining their $n$-gram frequency patterns. Our
comprehensive experiments reveal that in approximately 90\% of cases,
word-level attacks lead to the generation of examples where the frequency of
$n$-grams decreases, a tendency we term as the $n$-gram Frequency Descend
($n$-FD). This finding suggests a straightforward strategy to enhance model
robustness: training models using examples with $n$-FD. To examine the
feasibility of this strategy, we employed the $n$-gram frequency information,
as an alternative to conventional loss gradients, to generate perturbed
examples in adversarial training. The experiment results indicate that the
frequency-based approach performs comparably with the gradient-based approach
in improving model robustness. Our research offers a novel and more intuitive
perspective for understanding word-level textual adversarial attacks and
proposes a new direction to improve model robustness.


http://arxiv.org/abs/2302.03251
SCALE-UP: An Efficient Black-box Input-level Backdoor Detection via Analyzing Scaled Prediction Consistency. (92%)
Junfeng Guo; Yiming Li; Xun Chen; Hanqing Guo; Lichao Sun; Cong Liu
  Deep neural networks (DNNs) are vulnerable to backdoor attacks, where
adversaries embed a hidden backdoor trigger during the training process for
malicious prediction manipulation. These attacks pose great threats to the
applications of DNNs under the real-world machine learning as a service (MLaaS)
setting, where the deployed model is fully black-box while the users can only
query and obtain its predictions. Currently, there are many existing defenses
to reduce backdoor threats. However, almost all of them cannot be adopted in
MLaaS scenarios since they require getting access to or even modifying the
suspicious models. In this paper, we propose a simple yet effective black-box
input-level backdoor detection, called SCALE-UP, which requires only the
predicted labels to alleviate this problem. Specifically, we identify and
filter malicious testing samples by analyzing their prediction consistency
during the pixel-wise amplification process. Our defense is motivated by an
intriguing observation (dubbed scaled prediction consistency) that the
predictions of poisoned samples are significantly more consistent compared to
those of benign ones when amplifying all pixel values. Besides, we also provide
theoretical foundations to explain this phenomenon. Extensive experiments are
conducted on benchmark datasets, verifying the effectiveness and efficiency of
our defense and its resistance to potential adaptive attacks. Our codes are
available at https://github.com/JunfengGo/SCALE-UP.


http://arxiv.org/abs/2302.03015
Exploring and Exploiting Decision Boundary Dynamics for Adversarial Robustness. (87%)
Yuancheng Xu; Yanchao Sun; Micah Goldblum; Tom Goldstein; Furong Huang
  The robustness of a deep classifier can be characterized by its margins: the
decision boundary's distances to natural data points. However, it is unclear
whether existing robust training methods effectively increase the margin for
each vulnerable point during training. To understand this, we propose a
continuous-time framework for quantifying the relative speed of the decision
boundary with respect to each individual point. Through visualizing the moving
speed of the decision boundary under Adversarial Training, one of the most
effective robust training algorithms, a surprising moving-behavior is revealed:
the decision boundary moves away from some vulnerable points but simultaneously
moves closer to others, decreasing their margins. To alleviate these
conflicting dynamics of the decision boundary, we propose Dynamics-aware Robust
Training (DyART), which encourages the decision boundary to engage in movement
that prioritizes increasing smaller margins. In contrast to prior works, DyART
directly operates on the margins rather than their indirect approximations,
allowing for more targeted and effective robustness improvement. Experiments on
the CIFAR-10 and Tiny-ImageNet datasets verify that DyART alleviates the
conflicting dynamics of the decision boundary and obtains improved robustness
under various perturbation sizes compared to the state-of-the-art defenses. Our
code is available at
https://github.com/Yuancheng-Xu/Dynamics-Aware-Robust-Training.


http://arxiv.org/abs/2302.02829
Collective Robustness Certificates: Exploiting Interdependence in Graph Neural Networks. (75%)
Jan Schuchardt; Aleksandar Bojchevski; Johannes Gasteiger; Stephan Günnemann
  In tasks like node classification, image segmentation, and named-entity
recognition we have a classifier that simultaneously outputs multiple
predictions (a vector of labels) based on a single input, i.e. a single graph,
image, or document respectively. Existing adversarial robustness certificates
consider each prediction independently and are thus overly pessimistic for such
tasks. They implicitly assume that an adversary can use different perturbed
inputs to attack different predictions, ignoring the fact that we have a single
shared input. We propose the first collective robustness certificate which
computes the number of predictions that are simultaneously guaranteed to remain
stable under perturbation, i.e. cannot be attacked. We focus on Graph Neural
Networks and leverage their locality property - perturbations only affect the
predictions in a close neighborhood - to fuse multiple single-node certificates
into a drastically stronger collective certificate. For example, on the
Citeseer dataset our collective certificate for node classification increases
the average number of certifiable feature perturbations from $7$ to $351$.


http://arxiv.org/abs/2302.02907
GAT: Guided Adversarial Training with Pareto-optimal Auxiliary Tasks. (67%)
Salah Ghamizi; Jingfeng Zhang; Maxime Cordy; Mike Papadakis; Masashi Sugiyama; Yves Le Traon
  While leveraging additional training data is well established to improve
adversarial robustness, it incurs the unavoidable cost of data collection and
the heavy computation to train models. To mitigate the costs, we propose
\textit{Guided Adversarial Training } (GAT), a novel adversarial training
technique that exploits auxiliary tasks under a limited set of training data.
Our approach extends single-task models into multi-task models during the
min-max optimization of adversarial training, and drives the loss optimization
with a regularization of the gradient curvature across multiple tasks. GAT
leverages two types of auxiliary tasks: self-supervised tasks, where the labels
are generated automatically, and domain-knowledge tasks, where human experts
provide additional labels. Experimentally, under limited data, GAT increases
the robust accuracy on CIFAR-10 up to four times (from 11% to 42% robust
accuracy) and the robust AUC of CheXpert medical imaging dataset from 50\% to
83\%. On the full CIFAR-10 dataset, GAT outperforms eight state-of-the-art
adversarial training strategies.
  Our large study across five datasets and six tasks demonstrates that task
augmentation is an efficient alternative to data augmentation, and can be key
to achieving both clean and robust performances.


http://arxiv.org/abs/2302.02607
Target-based Surrogates for Stochastic Optimization. (1%)
Jonathan Wilder Lavington; Sharan Vaswani; Reza Babanezhad; Mark Schmidt; Nicolas Le Roux
  We consider minimizing functions for which it is expensive to compute the
(possibly stochastic) gradient. Such functions are prevalent in reinforcement
learning, imitation learning and adversarial training. Our target optimization
framework uses the (expensive) gradient computation to construct surrogate
functions in a target space (e.g. the logits output by a linear model for
classification) that can be minimized efficiently. This allows for multiple
parameter updates to the model, amortizing the cost of gradient computation. In
the full-batch setting, we prove that our surrogate is a global upper-bound on
the loss, and can be (locally) minimized using a black-box optimization
algorithm. We prove that the resulting majorization-minimization algorithm
ensures convergence to a stationary point of the loss. Next, we instantiate our
framework in the stochastic setting and propose the $SSO$ algorithm, which can
be viewed as projected stochastic gradient descent in the target space. This
connection enables us to prove theoretical guarantees for $SSO$ when minimizing
convex functions. Our framework allows the use of standard stochastic
optimization algorithms to construct surrogates which can be minimized by any
deterministic optimization method. To evaluate our framework, we consider a
suite of supervised learning and imitation learning problems. Our experiments
indicate the benefits of target optimization and the effectiveness of $SSO$.


http://arxiv.org/abs/2302.02924
Dropout Injection at Test Time for Post Hoc Uncertainty Quantification in Neural Networks. (1%)
Emanuele Ledda; Giorgio Fumera; Fabio Roli
  Among Bayesian methods, Monte-Carlo dropout provides principled tools for
evaluating the epistemic uncertainty of neural networks. Its popularity
recently led to seminal works that proposed activating the dropout layers only
during inference for evaluating uncertainty. This approach, which we call
dropout injection, provides clear benefits over its traditional counterpart
(which we call embedded dropout) since it allows one to obtain a post hoc
uncertainty measure for any existing network previously trained without
dropout, avoiding an additional, time-consuming training process.
Unfortunately, no previous work compared injected and embedded dropout;
therefore, we provide the first thorough investigation, focusing on regression
problems. The main contribution of our work is to provide guidelines on the
effective use of injected dropout so that it can be a practical alternative to
the current use of embedded dropout. In particular, we show that its
effectiveness strongly relies on a suitable scaling of the corresponding
uncertainty measure, and we discuss the trade-off between negative
log-likelihood and calibration error as a function of the scale factor.
Experimental results on UCI data sets and crowd counting benchmarks support our
claim that dropout injection can effectively behave as a competitive post hoc
uncertainty quantification technique.


http://arxiv.org/abs/2302.03098
One-shot Empirical Privacy Estimation for Federated Learning. (1%)
Galen Andrew; Peter Kairouz; Sewoong Oh; Alina Oprea; H. Brendan McMahan; Vinith Suriyakumar
  Privacy estimation techniques for differentially private (DP) algorithms are
useful for comparing against analytical bounds, or to empirically measure
privacy loss in settings where known analytical bounds are not tight. However,
existing privacy auditing techniques usually make strong assumptions on the
adversary (e.g., knowledge of intermediate model iterates or the training data
distribution), are tailored to specific tasks, model architectures, or DP
algorithm, and/or require retraining the model many times (typically on the
order of thousands). These shortcomings make deploying such techniques at scale
difficult in practice, especially in federated settings where model training
can take days or weeks. In this work, we present a novel ``one-shot'' approach
that can systematically address these challenges, allowing efficient auditing
or estimation of the privacy loss of a model during the same, single training
run used to fit model parameters, and without requiring any a priori knowledge
about the model architecture, task, or DP training algorithm. We show that our
method provides provably correct estimates for the privacy loss under the
Gaussian mechanism, and we demonstrate its performance on well-established FL
benchmark datasets under several adversarial threat models.


http://arxiv.org/abs/2302.02502
On the Role of Contrastive Representation Learning in Adversarial Robustness: An Empirical Study. (54%)
Fatemeh Ghofrani; Mehdi Yaghouti; Pooyan Jamshidi
  Self-supervised contrastive learning has solved one of the significant
obstacles in deep learning by alleviating the annotation cost. This advantage
comes with the price of false negative-pair selection without any label
information. Supervised contrastive learning has emerged as an extension of
contrastive learning to eliminate this issue. However, aside from accuracy,
there is a lack of understanding about the impacts of adversarial training on
the representations learned by these learning schemes. In this work, we utilize
supervised learning as a baseline to comprehensively study the robustness of
contrastive and supervised contrastive learning under different adversarial
training scenarios. Then, we begin by looking at how adversarial training
affects the learned representations in hidden layers, discovering more
redundant representations between layers of the model. Our results on CIFAR-10
and CIFAR-100 image classification benchmarks demonstrate that this redundancy
is highly reduced by adversarial fine-tuning applied to the contrastive
learning scheme, leading to more robust representations. However, adversarial
fine-tuning is not very effective for supervised contrastive learning and
supervised learning schemes. Our code is released at
https://github.com/softsys4ai/CL-Robustness.


http://arxiv.org/abs/2302.02503
Leaving Reality to Imagination: Robust Classification via Generated Datasets. (2%)
Hritik Bansal; Aditya Grover
  Recent research on robustness has revealed significant performance gaps
between neural image classifiers trained on datasets that are similar to the
test set, and those that are from a naturally shifted distribution, such as
sketches, paintings, and animations of the object categories observed during
training. Prior work focuses on reducing this gap by designing engineered
augmentations of training data or through unsupervised pretraining of a single
large model on massive in-the-wild training datasets scraped from the Internet.
However, the notion of a dataset is also undergoing a paradigm shift in recent
years. With drastic improvements in the quality, ease-of-use, and access to
modern generative models, generated data is pervading the web. In this light,
we study the question: How do these generated datasets influence the natural
robustness of image classifiers? We find that Imagenet classifiers trained on
real data augmented with generated data achieve higher accuracy and effective
robustness than standard training and popular augmentation strategies in the
presence of natural distribution shifts. We analyze various factors influencing
these results, including the choice of conditioning strategies and the amount
of generated data. Additionally, we find that the standard ImageNet classifiers
suffer a performance degradation of upto 20\% on the generated data, indicating
their fragility at accurately classifying the objects under novel variations.
Lastly, we demonstrate that the image classifiers, which have been trained on
real data augmented with generated data from the base generative model, exhibit
greater resilience to natural distribution shifts compared to the classifiers
trained on real data augmented with generated data from the finetuned
generative model on the real data. The code, models, and datasets are available
at https://github.com/Hritikbansal/generative-robustness.


http://arxiv.org/abs/2302.02213
CosPGD: a unified white-box adversarial attack for pixel-wise prediction tasks. (99%)
Shashank Agnihotri; Steffen Jung; Margret Keuper
  While neural networks allow highly accurate predictions in many tasks, their
lack of robustness towards even slight input perturbations hampers their
deployment in many real-world applications. Recent research towards evaluating
the robustness of neural networks such as the seminal projected gradient
descent(PGD) attack and subsequent works have drawn significant attention, as
they provide an effective insight into the quality of representations learned
by the network. However, these methods predominantly focus on image
classification tasks, while only a few approaches specifically address the
analysis of pixel-wise prediction tasks such as semantic segmentation, optical
flow, disparity estimation, and others, respectively. Thus, there is a lack of
a unified adversarial robustness benchmarking tool(algorithm) that is
applicable to all such pixel-wise prediction tasks. In this work, we close this
gap and propose CosPGD, a novel white-box adversarial attack that allows
optimizing dedicated attacks for any pixel-wise prediction task in a unified
setting. It leverages the cosine similarity between the distributions over the
predictions and ground truth (or target) to extend directly from classification
tasks to regression settings. We outperform the SotA on semantic segmentation
attacks in our experiments on PASCAL VOC2012 and CityScapes. Further, we set a
new benchmark for adversarial attacks on optical flow, and image restoration
displaying the ability to extend to any pixel-wise prediction task.


http://arxiv.org/abs/2302.02216
A Minimax Approach Against Multi-Armed Adversarial Attacks Detection. (86%)
Federica Granese; Marco Romanelli; Siddharth Garg; Pablo Piantanida
  Multi-armed adversarial attacks, in which multiple algorithms and objective
loss functions are simultaneously used at evaluation time, have been shown to
be highly successful in fooling state-of-the-art adversarial examples detectors
while requiring no specific side information about the detection mechanism. By
formalizing the problem at hand, we can propose a solution that aggregates the
soft-probability outputs of multiple pre-trained detectors according to a
minimax approach. The proposed framework is mathematically sound, easy to
implement, and modular, allowing for integrating existing or future detectors.
Through extensive evaluation on popular datasets (e.g., CIFAR10 and SVHN), we
show that our aggregation consistently outperforms individual state-of-the-art
detectors against multi-armed adversarial attacks, making it an effective
solution to improve the resilience of available methods.


http://arxiv.org/abs/2302.02162
AUTOLYCUS: Exploiting Explainable AI (XAI) for Model Extraction Attacks against Decision Tree Models. (84%)
Abdullah Caglar Oksuz; Anisa Halimi; Erman Ayday
  Model extraction attack is one of the most prominent adversarial techniques
to target machine learning models along with membership inference attack and
model inversion attack. On the other hand, Explainable Artificial Intelligence
(XAI) is a set of techniques and procedures to explain the decision making
process behind AI. XAI is a great tool to understand the reasoning behind AI
models but the data provided for such revelation creates security and privacy
vulnerabilities. In this poster, we propose AUTOLYCUS, a model extraction
attack that exploits the explanations provided by LIME to infer the decision
boundaries of decision tree models and create extracted surrogate models that
behave similar to a target model.


http://arxiv.org/abs/2302.02300
Run-Off Election: Improved Provable Defense against Data Poisoning Attacks. (83%)
Keivan Rezaei; Kiarash Banihashem; Atoosa Chegini; Soheil Feizi
  In data poisoning attacks, an adversary tries to change a model's prediction
by adding, modifying, or removing samples in the training data. Recently,
ensemble-based approaches for obtaining provable defenses against data
poisoning have been proposed where predictions are done by taking a majority
vote across multiple base models. In this work, we show that merely considering
the majority vote in ensemble defenses is wasteful as it does not effectively
utilize available information in the logits layers of the base models. Instead,
we propose Run-Off Election (ROE), a novel aggregation method based on a
two-round election across the base models: In the first round, models vote for
their preferred class and then a second, Run-Off election is held between the
top two classes in the first round. Based on this approach, we propose DPA+ROE
and FA+ROE defense methods based on Deep Partition Aggregation (DPA) and Finite
Aggregation (FA) approaches from prior work. We evaluate our methods on MNIST,
CIFAR-10, and GTSRB and obtain improvements in certified accuracy by up to
3%-4%. Also, by applying ROE on a boosted version of DPA, we gain improvements
around 12%-27% comparing to the current state-of-the-art, establishing a new
state-of-the-art in (pointwise) certified robustness against data poisoning. In
many cases, our approach outperforms the state-of-the-art, even when using 32
times less computational power.


http://arxiv.org/abs/2302.02208
Certified Robust Control under Adversarial Perturbations. (78%)
Jinghan Yang; Hunmin Kim; Wenbin Wan; Naira Hovakimyan; Yevgeniy Vorobeychik
  Autonomous systems increasingly rely on machine learning techniques to
transform high-dimensional raw inputs into predictions that are then used for
decision-making and control. However, it is often easy to maliciously
manipulate such inputs and, as a result, predictions. While effective
techniques have been proposed to certify the robustness of predictions to
adversarial input perturbations, such techniques have been disembodied from
control systems that make downstream use of the predictions. We propose the
first approach for composing robustness certification of predictions with
respect to raw input perturbations with robust control to obtain certified
robustness of control to adversarial input perturbations. We use a case study
of adaptive vehicle control to illustrate our approach and show the value of
the resulting end-to-end certificates through extensive experiments.


http://arxiv.org/abs/2302.02023
TextShield: Beyond Successfully Detecting Adversarial Sentences in Text Classification. (96%)
Lingfeng Shen; Ze Zhang; Haiyun Jiang; Ying Chen
  Adversarial attack serves as a major challenge for neural network models in
NLP, which precludes the model's deployment in safety-critical applications. A
recent line of work, detection-based defense, aims to distinguish adversarial
sentences from benign ones. However, {the core limitation of previous detection
methods is being incapable of giving correct predictions on adversarial
sentences unlike defense methods from other paradigms.} To solve this issue,
this paper proposes TextShield: (1) we discover a link between text attack and
saliency information, and then we propose a saliency-based detector, which can
effectively detect whether an input sentence is adversarial or not. (2) We
design a saliency-based corrector, which converts the detected adversary
sentences to benign ones. By combining the saliency-based detector and
corrector, TextShield extends the detection-only paradigm to a
detection-correction paradigm, thus filling the gap in the existing
detection-based defense. Comprehensive experiments show that (a) TextShield
consistently achieves higher or comparable performance than state-of-the-art
defense methods across various attacks on different benchmarks. (b) our
saliency-based detector outperforms existing detectors for detecting
adversarial sentences.


http://arxiv.org/abs/2302.02012
DeTorrent: An Adversarial Padding-only Traffic Analysis Defense. (73%)
James K Holland; Jason Carpenter; Se Eun Oh; Nicholas Hopper
  While anonymity networks like Tor aim to protect the privacy of their users,
they are vulnerable to traffic analysis attacks such as Website Fingerprinting
(WF) and Flow Correlation (FC). Recent implementations of WF and FC attacks,
such as Tik-Tok and DeepCoFFEA, have shown that the attacks can be effectively
carried out, threatening user privacy. Consequently, there is a need for
effective traffic analysis defense.
  There are a variety of existing defenses, but most are either ineffective,
incur high latency and bandwidth overhead, or require additional
infrastructure. As a result, we aim to design a traffic analysis defense that
is efficient and highly resistant to both WF and FC attacks. We propose
DeTorrent, which uses competing neural networks to generate and evaluate
traffic analysis defenses that insert 'dummy' traffic into real traffic flows.
DeTorrent operates with moderate overhead and without delaying traffic. In a
closed-world WF setting, it reduces an attacker's accuracy by 60.5%, a
reduction 9.5% better than the next-best padding-only defense. Against the
state-of-the-art FC attacker, DeTorrent reduces the true positive rate for a
$10^{-4}$ false positive rate to about .30, which is less than half that of the
next-best defense. We also demonstrate DeTorrent's practicality by deploying it
alongside the Tor network and find that it maintains its performance when
applied to live traffic.


http://arxiv.org/abs/2302.01740
SoK: A Systematic Evaluation of Backdoor Trigger Characteristics in Image Classification. (61%)
Gorka Abad; Jing Xu; Stefanos Koffas; Behrad Tajalli; Stjepan Picek; Mauro Conti
  Deep learning achieves outstanding results in many machine learning tasks.
Nevertheless, it is vulnerable to backdoor attacks that modify the training set
to embed a secret functionality in the trained model. The modified training
samples have a secret property, i. e., a trigger. At inference time, the secret
functionality is activated when the input contains the trigger, while the model
functions correctly in other cases. While there are many known backdoor attacks
(and defenses), deploying a stealthy attack is still far from trivial.
Successfully creating backdoor triggers depends on numerous parameters.
Unfortunately, research has not yet determined which parameters contribute most
to the attack performance.
  This paper systematically analyzes the most relevant parameters for the
backdoor attacks, i.e., trigger size, position, color, and poisoning rate.
Using transfer learning, which is very common in computer vision, we evaluate
the attack on state-of-the-art models (ResNet, VGG, AlexNet, and GoogLeNet) and
datasets (MNIST, CIFAR10, and TinyImageNet). Our attacks cover the majority of
backdoor settings in research, providing concrete directions for future works.
Our code is publicly available to facilitate the reproducibility of our
results.


http://arxiv.org/abs/2302.01629
Beyond the Universal Law of Robustness: Sharper Laws for Random Features and Neural Tangent Kernels. (15%)
Simone Bombari; Shayan Kiyani; Marco Mondelli
  Machine learning models are vulnerable to adversarial perturbations, and a
thought-provoking paper by Bubeck and Sellke has analyzed this phenomenon
through the lens of over-parameterization: interpolating smoothly the data
requires significantly more parameters than simply memorizing it. However, this
"universal" law provides only a necessary condition for robustness, and it is
unable to discriminate between models. In this paper, we address these gaps by
focusing on empirical risk minimization in two prototypical settings, namely,
random features and the neural tangent kernel (NTK). We prove that, for random
features, the model is not robust for any degree of over-parameterization, even
when the necessary condition coming from the universal law of robustness is
satisfied. In contrast, for even activations, the NTK model meets the universal
lower bound, and it is robust as soon as the necessary condition on
over-parameterization is fulfilled. This also addresses a conjecture in prior
work by Bubeck, Li and Nagaraj. Our analysis decouples the effect of the kernel
of the model from an "interaction matrix", which describes the interaction with
the test data and captures the effect of the activation. Our theoretical
results are corroborated by numerical evidence on both synthetic and standard
datasets (MNIST, CIFAR-10).


http://arxiv.org/abs/2302.01961
Asymmetric Certified Robustness via Feature-Convex Neural Networks. (8%)
Samuel Pfrommer; Brendon G. Anderson; Julien Piet; Somayeh Sojoudi
  Recent works have introduced input-convex neural networks (ICNNs) as learning
models with advantageous training, inference, and generalization properties
linked to their convex structure. In this paper, we propose a novel
feature-convex neural network architecture as the composition of an ICNN with a
Lipschitz feature map in order to achieve adversarial robustness. We consider
the asymmetric binary classification setting with one "sensitive" class, and
for this class we prove deterministic, closed-form, and easily-computable
certified robust radii for arbitrary $\ell_p$-norms. We theoretically justify
the use of these models by characterizing their decision region geometry,
extending the universal approximation theorem for ICNN regression to the
classification setting, and proving a lower bound on the probability that such
models perfectly fit even unstructured uniformly distributed data in
sufficiently high dimensions. Experiments on Malimg malware classification and
subsets of MNIST, Fashion-MNIST, and CIFAR-10 datasets show that feature-convex
classifiers attain state-of-the-art certified $\ell_1$-radii as well as
substantial $\ell_2$- and $\ell_{\infty}$-radii while being far more
computationally efficient than any competitive baseline.


http://arxiv.org/abs/2302.01677
Revisiting Personalized Federated Learning: Robustness Against Backdoor Attacks. (2%)
Zeyu Qin; Liuyi Yao; Daoyuan Chen; Yaliang Li; Bolin Ding; Minhao Cheng
  In this work, besides improving prediction accuracy, we study whether
personalization could bring robustness benefits to backdoor attacks. We conduct
the first study of backdoor attacks in the pFL framework, testing 4 widely used
backdoor attacks against 6 pFL methods on benchmark datasets FEMNIST and
CIFAR-10, a total of 600 experiments. The study shows that pFL methods with
partial model-sharing can significantly boost robustness against backdoor
attacks. In contrast, pFL methods with full model-sharing do not show
robustness. To analyze the reasons for varying robustness performances, we
provide comprehensive ablation studies on different pFL methods. Based on our
findings, we further propose a lightweight defense method, Simple-Tuning, which
empirically improves defense performance against backdoor attacks. We believe
that our work could provide both guidance for pFL application in terms of its
robustness and offer valuable insights to design more robust FL methods in the
future. We open-source our code to establish the first benchmark for black-box
backdoor attacks in pFL:
https://github.com/alibaba/FederatedScope/tree/backdoor-bench.


http://arxiv.org/abs/2302.02042
BarrierBypass: Out-of-Sight Clean Voice Command Injection Attacks through Physical Barriers. (2%)
Payton Walker; Tianfang Zhang; Cong Shi; Nitesh Saxena; Yingying Chen
  The growing adoption of voice-enabled devices (e.g., smart speakers),
particularly in smart home environments, has introduced many security
vulnerabilities that pose significant threats to users' privacy and safety.
When multiple devices are connected to a voice assistant, an attacker can cause
serious damage if they can gain control of these devices. We ask where and how
can an attacker issue clean voice commands stealthily across a physical
barrier, and perform the first academic measurement study of this nature on the
command injection attack. We present the BarrierBypass attack that can be
launched against three different barrier-based scenarios termed across-door,
across-window, and across-wall. We conduct a broad set of experiments to
observe the command injection attack success rates for multiple speaker samples
(TTS and live human recorded) at different command audio volumes (65, 75, 85
dB), and smart speaker locations (0.1-4.0m from barrier). Against Amazon Echo
Dot 2, BarrierBypass is able to achieve 100% wake word and command injection
success for the across-wall and across-window attacks, and for the across-door
attack (up to 2 meters). At 4 meters for the across-door attack, BarrierBypass
can achieve 90% and 80% injection accuracy for the wake word and command,
respectively. Against Google Home mini BarrierBypass is able to achieve 100%
wake word injection accuracy for all attack scenarios. For command injection
BarrierBypass can achieve 100% accuracy for all the three barrier settings (up
to 2 meters). For the across-door attack at 4 meters, BarrierBypass can achieve
80% command injection accuracy. Further, our demonstration using drones yielded
high command injection success, up to 100%. Overall, our results demonstrate
the potentially devastating nature of this vulnerability to control a user's
device from outside of the device's physical space.


http://arxiv.org/abs/2302.01855
From Robustness to Privacy and Back. (2%)
Hilal Asi; Jonathan Ullman; Lydia Zakynthinou
  We study the relationship between two desiderata of algorithms in statistical
inference and machine learning: differential privacy and robustness to
adversarial data corruptions. Their conceptual similarity was first observed by
Dwork and Lei (STOC 2009), who observed that private algorithms satisfy
robustness, and gave a general method for converting robust algorithms to
private ones. However, all general methods for transforming robust algorithms
into private ones lead to suboptimal error rates. Our work gives the first
black-box transformation that converts any adversarially robust algorithm into
one that satisfies pure differential privacy. Moreover, we show that for any
low-dimensional estimation task, applying our transformation to an optimal
robust estimator results in an optimal private estimator. Thus, we conclude
that for any low-dimensional task, the optimal error rate for
$\varepsilon$-differentially private estimators is essentially the same as the
optimal error rate for estimators that are robust to adversarially corrupting
$1/\varepsilon$ training samples. We apply our transformation to obtain new
optimal private estimators for several high-dimensional tasks, including
Gaussian (sparse) linear regression and PCA. Finally, we present an extension
of our transformation that leads to approximate differentially private
algorithms whose error does not depend on the range of the output space, which
is impossible under pure differential privacy.


http://arxiv.org/abs/2302.01972
DCA: Delayed Charging Attack on the Electric Shared Mobility System. (1%)
Shuocheng Guo; Hanlin Chen; Mizanur Rahman; Xinwu Qian
  An efficient operation of the electric shared mobility system (ESMS) relies
heavily on seamless interconnections among shared electric vehicles (SEV),
electric vehicle supply equipment (EVSE), and the grid. Nevertheless, this
interconnectivity also makes the ESMS vulnerable to cyberattacks that may cause
short-term breakdowns or long-term degradation of the ESMS. This study focuses
on one such attack with long-lasting effects, the Delayed Charge Attack (DCA),
that stealthily delays the charging service by exploiting the physical and
communication vulnerabilities. To begin, we present the ESMS threat model by
highlighting the assets, information flow, and access points. We next identify
a linked sequence of vulnerabilities as a viable attack vector for launching
DCA. Then, we detail the implementation of DCA, which can effectively bypass
the detection in the SEV's battery management system and the cross-verification
in the cloud environment. We test the DCA model against various Anomaly
Detection (AD) algorithms by simulating the DCA dynamics in a
Susceptible-Infectious-Removed-Susceptible process, where the EVSE can be
compromised by the DCA or detected for repair. Using real-world taxi trip data
and EVSE locations in New York City, the DCA model allows us to explore the
long-term impacts and validate the system consequences. The results show that a
10-min delay results in 12-min longer queuing times and 8% more unfulfilled
requests, leading to a 10.7% (\$311.7) weekly revenue loss per driver. With the
AD algorithms, the weekly revenue loss remains at least 3.8% (\$111.8) with
increased repair costs of \$36,000, suggesting the DCA's robustness against the
AD.


http://arxiv.org/abs/2302.02031
Augmenting Rule-based DNS Censorship Detection at Scale with Machine Learning. (1%)
Jacob Alexander Markson Brown; Xi Jiang; Van Tran; Arjun Nitin Bhagoji; Nguyen Phong Hoang; Nick Feamster; Prateek Mittal; Vinod Yegneswaran
  The proliferation of global censorship has led to the development of a
plethora of measurement platforms to monitor and expose it. Censorship of the
domain name system (DNS) is a key mechanism used across different countries. It
is currently detected by applying heuristics to samples of DNS queries and
responses (probes) for specific destinations. These heuristics, however, are
both platform-specific and have been found to be brittle when censors change
their blocking behavior, necessitating a more reliable automated process for
detecting censorship. In this paper, we explore how machine learning (ML)
models can (1) help streamline the detection process, (2) improve the usability
of large-scale datasets for censorship detection, and (3) discover new
censorship instances and blocking signatures missed by existing heuristic
methods. Our study shows that supervised models, trained using expert-derived
labels on instances of known anomalies and possible censorship, can learn the
detection heuristics employed by different measurement platforms. More
crucially, we find that unsupervised models, trained solely on uncensored
instances, can identify new instances and variations of censorship missed by
existing heuristics. Moreover, both methods demonstrate the capability to
uncover a substantial number of new DNS blocking signatures, i.e., injected
fake IP addresses overlooked by existing heuristics. These results are
underpinned by an important methodological finding: comparing the outputs of
models trained using the same probes but with labels arising from independent
processes allows us to more reliably detect cases of censorship in the absence
of ground-truth labels of censorship.


http://arxiv.org/abs/2302.00944
TransFool: An Adversarial Attack against Neural Machine Translation Models. (99%)
Sahar Sadrizadeh; Ljiljana Dolamic; Pascal Frossard
  Deep neural networks have been shown to be vulnerable to small perturbations
of their inputs, known as adversarial attacks. In this paper, we investigate
the vulnerability of Neural Machine Translation (NMT) models to adversarial
attacks and propose a new attack algorithm called TransFool. To fool NMT
models, TransFool builds on a multi-term optimization problem and a gradient
projection step. By integrating the embedding representation of a language
model, we generate fluent adversarial examples in the source language that
maintain a high level of semantic similarity with the clean samples.
Experimental results demonstrate that, for different translation tasks and NMT
architectures, our white-box attack can severely degrade the translation
quality while the semantic similarity between the original and the adversarial
sentences stays high. Moreover, we show that TransFool is transferable to
unknown target models. Finally, based on automatic and human evaluations,
TransFool leads to improvement in terms of success rate, semantic similarity,
and fluency compared to the existing attacks both in white-box and black-box
settings. Thus, TransFool permits us to better characterize the vulnerability
of NMT models and outlines the necessity to design strong defense mechanisms
and more robust NMT systems for real-life applications.


http://arxiv.org/abs/2302.01056
Beyond Pretrained Features: Noisy Image Modeling Provides Adversarial Defense. (99%)
Zunzhi You; Daochang Liu; Bohyung Han; Chang Xu
  Recent advancements in masked image modeling (MIM) have made it a prevailing
framework for self-supervised visual representation learning. The MIM
pretrained models, like most deep neural network methods, remain vulnerable to
adversarial attacks, limiting their practical application, and this issue has
received little research attention. In this paper, we investigate how this
powerful self-supervised learning paradigm can provide adversarial robustness
to downstream classifiers. During the exploration, we find that noisy image
modeling (NIM), a simple variant of MIM that adopts denoising as the pre-text
task, reconstructs noisy images surprisingly well despite severe corruption.
Motivated by this observation, we propose an adversarial defense method,
referred to as De^3, by exploiting the pretrained decoder for denoising.
Through De^3, NIM is able to enhance adversarial robustness beyond providing
pretrained features. Furthermore, we incorporate a simple modification,
sampling the noise scale hyperparameter from random distributions, and enable
the defense to achieve a better and tunable trade-off between accuracy and
robustness. Experimental results demonstrate that, in terms of adversarial
robustness, NIM is superior to MIM thanks to its effective denoising
capability. Moreover, the defense provided by NIM achieves performance on par
with adversarial training while offering the extra tunability advantage. Source
code and models are available at https://github.com/youzunzhi/NIM-AdvDef.


http://arxiv.org/abs/2302.01375
On the Robustness of Randomized Ensembles to Adversarial Perturbations. (75%)
Hassan Dbouk; Naresh R. Shanbhag
  Randomized ensemble classifiers (RECs), where one classifier is randomly
selected during inference, have emerged as an attractive alternative to
traditional ensembling methods for realizing adversarially robust classifiers
with limited compute requirements. However, recent works have shown that
existing methods for constructing RECs are more vulnerable than initially
claimed, casting major doubts on their efficacy and prompting fundamental
questions such as: "When are RECs useful?", "What are their limits?", and "How
do we train them?". In this work, we first demystify RECs as we derive
fundamental results regarding their theoretical limits, necessary and
sufficient conditions for them to be useful, and more. Leveraging this new
understanding, we propose a new boosting algorithm (BARRE) for training robust
RECs, and empirically demonstrate its effectiveness at defending against strong
$\ell_\infty$ norm-bounded adversaries across various network architectures and
datasets. Our code can be found at https://github.com/hsndbk4/BARRE.


http://arxiv.org/abs/2302.01404
Provably Bounding Neural Network Preimages. (64%)
Suhas Kotha; Christopher Brix; Zico Kolter; Krishnamurthy Dvijotham; Huan Zhang
  Most work on the formal verification of neural networks has focused on
bounding the set of outputs that correspond to a given set of inputs (for
example, bounded perturbations of a nominal input). However, many use cases of
neural network verification require solving the inverse problem, or
over-approximating the set of inputs that lead to certain outputs. We present
the INVPROP algorithm for verifying properties over the preimage of a linearly
constrained output set, which can be combined with branch-and-bound to increase
precision. Contrary to other approaches, our efficient algorithm is
GPU-accelerated and does not require a linear programming solver. We
demonstrate our algorithm for identifying safe control regions for a dynamical
system via backward reachability analysis, verifying adversarial robustness,
and detecting out-of-distribution inputs to a neural network. Our results show
that in certain settings, we find over-approximations over 2500x tighter than
prior work while being 2.5x faster. By strengthening robustness verification
with output constraints, we consistently verify more properties than the
previous state-of-the-art on multiple benchmarks, including a large model with
167k neurons in VNN-COMP 2023. Our algorithm has been incorporated into the
$\alpha,\!\beta$-CROWN verifier, available at https://abcrown.org.


http://arxiv.org/abs/2302.01459
A sliced-Wasserstein distance-based approach for out-of-class-distribution detection. (62%)
Mohammad Shifat E Rabbi; Abu Hasnat Mohammad Rubaiyat; Yan Zhuang; Gustavo K Rohde
  There exist growing interests in intelligent systems for numerous medical
imaging, image processing, and computer vision applications, such as face
recognition, medical diagnosis, character recognition, and self-driving cars,
among others. These applications usually require solving complex classification
problems involving complex images with unknown data generative processes. In
addition to recent successes of the current classification approaches relying
on feature engineering and deep learning, several shortcomings of them, such as
the lack of robustness, generalizability, and interpretability, have also been
observed. These methods often require extensive training data, are
computationally expensive, and are vulnerable to out-of-distribution samples,
e.g., adversarial attacks. Recently, an accurate, data-efficient,
computationally efficient, and robust transport-based classification approach
has been proposed, which describes a generative model-based problem formulation
and closed-form solution for a specific category of classification problems.
However, all these approaches lack mechanisms to detect test samples outside
the class distributions used during training. In real-world settings, where the
collected training samples are unable to exhaust or cover all classes, the
traditional classification schemes are unable to handle the unseen classes
effectively, which is especially an important issue for safety-critical
systems, such as self-driving and medical imaging diagnosis. In this work, we
propose a method for detecting out-of-class distributions based on the
distribution of sliced-Wasserstein distance from the Radon Cumulative
Distribution Transform (R-CDT) subspace. We tested our method on the MNIST and
two medical image datasets and reported better accuracy than the
state-of-the-art methods without an out-of-class distribution detection
procedure.


http://arxiv.org/abs/2302.01381
Effective Robustness against Natural Distribution Shifts for Models with Different Training Data. (13%)
Zhouxing Shi; Nicholas Carlini; Ananth Balashankar; Ludwig Schmidt; Cho-Jui Hsieh; Alex Beutel; Yao Qin
  ``Effective robustness'' measures the extra out-of-distribution (OOD)
robustness beyond what can be predicted from the in-distribution (ID)
performance. Existing effective robustness evaluations typically use a single
test set such as ImageNet to evaluate ID accuracy. This becomes problematic
when evaluating models trained on different data distributions, e.g., comparing
models trained on ImageNet vs. zero-shot language-image pre-trained models
trained on LAION. In this paper, we propose a new effective robustness
evaluation metric to compare the effective robustness of models trained on
different data distributions. To do this we control for the accuracy on
multiple ID test sets that cover the training distributions for all the
evaluated models. Our new evaluation metric provides a better estimate of the
effectiveness robustness and explains the surprising effective robustness gains
of zero-shot CLIP-like models exhibited when considering only one ID dataset,
while the gains diminish under our evaluation.


http://arxiv.org/abs/2302.00947
SPECWANDS: An Efficient Priority-based Scheduler Against Speculation Contention Attacks. (10%)
Bowen Tang; Chenggang Wu; Pen-Chung Yew; Yinqian Zhang; Mengyao Xie; Yuanming Lai; Yan Kang; Wei Wang; Qiang Wei; Zhe Wang
  Transient Execution Attacks (TEAs) have gradually become a major security
threat to modern high-performance processors. They exploit the vulnerability of
speculative execution to illegally access private data, and transmit them
through timing-based covert channels. While new vulnerabilities are discovered
continuously, the covert channels can be categorised to two types: 1)
Persistent Type, in which covert channels are based on the layout changes of
buffering, e.g. through caches or TLBs; 2) Volatile Type, in which covert
channels are based on the contention of sharing resources, e.g. through
execution units or issuing ports. The defenses against the persistent-type
covert channels have been well addressed, while those for the volatile-type are
still rather inadequate. Existing mitigation schemes for the volatile type such
as Speculative Compression and Time-Division-Multiplexing will introduce
significant overhead due to the need to stall the pipeline or to disallow
resource sharing. In this paper, we look into such attacks and defenses with a
new perspective, and propose a scheduling-based mitigation scheme, called
SPECWANDS. It consists of three priority-based scheduling policies to prevent
an attacker from transmitting the secret in different contention situations.
SPECWANDS not only can defend against both inter-thread and intra-thread based
attacks, but also can keep most of the performance benefit from speculative
execution and resource-sharing. We evaluate its runtime overhead on SPEC 2017
benchmarks and realistic programs. The experimental results show that SPECWANDS
has a significant performance advantage over the other two representative
schemes.


http://arxiv.org/abs/2302.01474
Defensive ML: Defending Architectural Side-channels with Adversarial Obfuscation. (2%)
Hyoungwook Nam; Raghavendra Pradyumna Pothukuchi; Bo Li; Nam Sung Kim; Josep Torrellas
  Side-channel attacks that use machine learning (ML) for signal analysis have
become prominent threats to computer security, as ML models easily find
patterns in signals. To address this problem, this paper explores using
Adversarial Machine Learning (AML) methods as a defense at the computer
architecture layer to obfuscate side channels. We call this approach Defensive
ML, and the generator to obfuscate signals, defender. Defensive ML is a
workflow to design, implement, train, and deploy defenders for different
environments. First, we design a defender architecture given the physical
characteristics and hardware constraints of the side-channel. Next, we use our
DefenderGAN structure to train the defender. Finally, we apply defensive ML to
thwart two side-channel attacks: one based on memory contention and the other
on application power. The former uses a hardware defender with ns-level
response time that attains a high level of security with half the performance
impact of a traditional scheme; the latter uses a software defender with
ms-level response time that provides better security than a traditional scheme
with only 70% of its power overhead.


http://arxiv.org/abs/2302.01440
Generalized Uncertainty of Deep Neural Networks: Taxonomy and Applications. (1%)
Chengyu Dong
  Deep neural networks have seen enormous success in various real-world
applications. Beyond their predictions as point estimates, increasing attention
has been focused on quantifying the uncertainty of their predictions. In this
review, we show that the uncertainty of deep neural networks is not only
important in a sense of interpretability and transparency, but also crucial in
further advancing their performance, particularly in learning systems seeking
robustness and efficiency. We will generalize the definition of the uncertainty
of deep neural networks to any number or vector that is associated with an
input or an input-label pair, and catalog existing methods on ``mining'' such
uncertainty from a deep model. We will include those methods from the classic
field of uncertainty quantification as well as those methods that are specific
to deep neural networks. We then show a wide spectrum of applications of such
generalized uncertainty in realistic learning tasks including robust learning
such as noisy learning, adversarially robust learning; data-efficient learning
such as semi-supervised and weakly-supervised learning; and model-efficient
learning such as model compression and knowledge distillation.


http://arxiv.org/abs/2302.01428
Dataset Distillation Fixes Dataset Reconstruction Attacks. (1%)
Noel Loo; Ramin Hasani; Mathias Lechner; Daniela Rus
  Modern deep learning requires large volumes of data, which could contain
sensitive or private information which cannot be leaked. Recent work has shown
for homogeneous neural networks a large portion of this training data could be
reconstructed with only access to the trained network parameters. While the
attack was shown to work empirically, there exists little formal understanding
of its effectiveness regime, and ways to defend against it. In this work, we
first build a stronger version of the dataset reconstruction attack and show
how it can provably recover its entire training set in the infinite width
regime. We then empirically study the characteristics of this attack on
two-layer networks and reveal that its success heavily depends on deviations
from the frozen infinite-width Neural Tangent Kernel limit. More importantly,
we formally show for the first time that dataset reconstruction attacks are a
variation of dataset distillation. This key theoretical result on the
unification of dataset reconstruction and distillation not only sheds more
light on the characteristics of the attack but enables us to design defense
mechanisms against them via distillation algorithms.


http://arxiv.org/abs/2302.00747
Universal Soldier: Using Universal Adversarial Perturbations for Detecting Backdoor Attacks. (99%)
Xiaoyun Xu; Oguzhan Ersoy; Stjepan Picek
  Deep learning models achieve excellent performance in numerous machine
learning tasks. Yet, they suffer from security-related issues such as
adversarial examples and poisoning (backdoor) attacks. A deep learning model
may be poisoned by training with backdoored data or by modifying inner network
parameters. Then, a backdoored model performs as expected when receiving a
clean input, but it misclassifies when receiving a backdoored input stamped
with a pre-designed pattern called "trigger". Unfortunately, it is difficult to
distinguish between clean and backdoored models without prior knowledge of the
trigger. This paper proposes a backdoor detection method by utilizing a special
type of adversarial attack, universal adversarial perturbation (UAP), and its
similarities with a backdoor trigger. We observe an intuitive phenomenon: UAPs
generated from backdoored models need fewer perturbations to mislead the model
than UAPs from clean models. UAPs of backdoored models tend to exploit the
shortcut from all classes to the target class, built by the backdoor trigger.
We propose a novel method called Universal Soldier for Backdoor detection (USB)
and reverse engineering potential backdoor triggers via UAPs. Experiments on
345 models trained on several datasets show that USB effectively detects the
injected backdoor and provides comparable or better results than
state-of-the-art methods.


http://arxiv.org/abs/2302.00537
Effectiveness of Moving Target Defenses for Adversarial Attacks in ML-based Malware Detection. (92%)
Aqib Rashid; Jose Such
  Several moving target defenses (MTDs) to counter adversarial ML attacks have
been proposed in recent years. MTDs claim to increase the difficulty for the
attacker in conducting attacks by regularly changing certain elements of the
defense, such as cycling through configurations. To examine these claims, we
study for the first time the effectiveness of several recent MTDs for
adversarial ML attacks applied to the malware detection domain. Under different
threat models, we show that transferability and query attack strategies can
achieve high levels of evasion against these defenses through existing and
novel attack strategies across Android and Windows. We also show that
fingerprinting and reconnaissance are possible and demonstrate how attackers
may obtain critical defense hyperparameters as well as information about how
predictions are produced. Based on our findings, we present key recommendations
for future work on the development of effective MTDs for adversarial attacks in
ML-based malware detection.


http://arxiv.org/abs/2302.00509
Exploring Semantic Perturbations on Grover. (56%)
Ziqing Ji; Pranav Kulkarni; Marko Neskovic; Kevin Nolan; Yan Xu
  With news and information being as easy to access as they currently are, it
is more important than ever to ensure that people are not mislead by what they
read. Recently, the rise of neural fake news (AI-generated fake news) and its
demonstrated effectiveness at fooling humans has prompted the development of
models to detect it. One such model is the Grover model, which can both detect
neural fake news to prevent it, and generate it to demonstrate how a model
could be misused to fool human readers. In this work we explore the Grover
model's fake news detection capabilities by performing targeted attacks through
perturbations on input news articles. Through this we test Grover's resilience
to these adversarial attacks and expose some potential vulnerabilities which
should be addressed in further iterations to ensure it can detect all types of
fake news accurately.


http://arxiv.org/abs/2302.01762
BackdoorBox: A Python Toolbox for Backdoor Learning. (10%)
Yiming Li; Mengxi Ya; Yang Bai; Yong Jiang; Shu-Tao Xia
  Third-party resources ($e.g.$, samples, backbones, and pre-trained models)
are usually involved in the training of deep neural networks (DNNs), which
brings backdoor attacks as a new training-phase threat. In general, backdoor
attackers intend to implant hidden backdoor in DNNs, so that the attacked DNNs
behave normally on benign samples whereas their predictions will be maliciously
changed to a pre-defined target label if hidden backdoors are activated by
attacker-specified trigger patterns. To facilitate the research and development
of more secure training schemes and defenses, we design an open-sourced Python
toolbox that implements representative and advanced backdoor attacks and
defenses under a unified and flexible framework. Our toolbox has four important
and promising characteristics, including consistency, simplicity, flexibility,
and co-development. It allows researchers and developers to easily implement
and compare different methods on benchmark or their local datasets. This Python
toolbox, namely \texttt{BackdoorBox}, is available at
\url{https://github.com/THUYimingLi/BackdoorBox}.


http://arxiv.org/abs/2301.13869
Reverse engineering adversarial attacks with fingerprints from adversarial examples. (99%)
David Aaron Embedded Intelligence Nicholson; Vincent Embedded Intelligence Emanuele
  In spite of intense research efforts, deep neural networks remain vulnerable
to adversarial examples: an input that forces the network to confidently
produce incorrect outputs. Adversarial examples are typically generated by an
attack algorithm that optimizes a perturbation added to a benign input. Many
such algorithms have been developed. If it were possible to reverse engineer
attack algorithms from adversarial examples, this could deter bad actors
because of the possibility of attribution. Here we formulate reverse
engineering as a supervised learning problem where the goal is to assign an
adversarial example to a class that represents the algorithm and parameters
used. To our knowledge it has not been previously shown whether this is even
possible. We first test whether we can classify the perturbations added to
images by attacks on undefended single-label image classification models.
Taking a ``fight fire with fire'' approach, we leverage the sensitivity of deep
neural networks to adversarial examples, training them to classify these
perturbations. On a 17-class dataset (5 attacks, 4 bounded with 4 epsilon
values each), we achieve an accuracy of 99.4\% with a ResNet50 model trained on
the perturbations. We then ask whether we can perform this task without access
to the perturbations, obtaining an estimate of them with signal processing
algorithms, an approach we call ``fingerprinting''. We find the JPEG algorithm
serves as a simple yet effective fingerprinter (85.05\% accuracy), providing a
strong baseline for future work. We discuss how our approach can be extended to
attack agnostic, learnable fingerprints, and to open-world scenarios with
unknown attacks.


http://arxiv.org/abs/2302.00094
The Impacts of Unanswerable Questions on the Robustness of Machine Reading Comprehension Models. (97%)
Son Quoc Tran; Phong Nguyen-Thuan Do; Uyen Le; Matt Kretchmar
  Pretrained language models have achieved super-human performances on many
Machine Reading Comprehension (MRC) benchmarks. Nevertheless, their relative
inability to defend against adversarial attacks has spurred skepticism about
their natural language understanding. In this paper, we ask whether training
with unanswerable questions in SQuAD 2.0 can help improve the robustness of MRC
models against adversarial attacks. To explore that question, we fine-tune
three state-of-the-art language models on either SQuAD 1.1 or SQuAD 2.0 and
then evaluate their robustness under adversarial attacks. Our experiments
reveal that current models fine-tuned on SQuAD 2.0 do not initially appear to
be any more robust than ones fine-tuned on SQuAD 1.1, yet they reveal a measure
of hidden robustness that can be leveraged to realize actual performance gains.
Furthermore, we find that the robustness of models fine-tuned on SQuAD 2.0
extends to additional out-of-domain datasets. Finally, we introduce a new
adversarial attack to reveal artifacts of SQuAD 2.0 that current MRC models are
learning.


http://arxiv.org/abs/2301.13694
Are Defenses for Graph Neural Networks Robust? (80%)
Felix Mujkanovic; Simon Geisler; Stephan Günnemann; Aleksandar Bojchevski
  A cursory reading of the literature suggests that we have made a lot of
progress in designing effective adversarial defenses for Graph Neural Networks
(GNNs). Yet, the standard methodology has a serious flaw - virtually all of the
defenses are evaluated against non-adaptive attacks leading to overly
optimistic robustness estimates. We perform a thorough robustness analysis of 7
of the most popular defenses spanning the entire spectrum of strategies, i.e.,
aimed at improving the graph, the architecture, or the training. The results
are sobering - most defenses show no or only marginal improvement compared to
an undefended baseline. We advocate using custom adaptive attacks as a gold
standard and we outline the lessons we learned from successfully designing such
attacks. Moreover, our diverse collection of perturbed graphs forms a
(black-box) unit test offering a first glance at a model's robustness.


http://arxiv.org/abs/2301.13487
Adversarial Training of Self-supervised Monocular Depth Estimation against Physical-World Attacks. (75%)
Zhiyuan Cheng; James Liang; Guanhong Tao; Dongfang Liu; Xiangyu Zhang
  Monocular Depth Estimation (MDE) is a critical component in applications such
as autonomous driving. There are various attacks against MDE networks. These
attacks, especially the physical ones, pose a great threat to the security of
such systems. Traditional adversarial training method requires ground-truth
labels hence cannot be directly applied to self-supervised MDE that does not
have ground-truth depth. Some self-supervised model hardening techniques (e.g.,
contrastive learning) ignore the domain knowledge of MDE and can hardly achieve
optimal performance. In this work, we propose a novel adversarial training
method for self-supervised MDE models based on view synthesis without using
ground-truth depth. We improve adversarial robustness against physical-world
attacks using L0-norm-bounded perturbation in training. We compare our method
with supervised learning based and contrastive learning based methods that are
tailored for MDE. Results on two representative MDE networks show that we
achieve better robustness against various adversarial attacks with nearly no
benign performance degradation.


http://arxiv.org/abs/2301.13803
Fairness-aware Vision Transformer via Debiased Self-Attention. (50%)
Yao Qiang; Chengyin Li; Prashant Khanduri; Dongxiao Zhu
  Vision Transformer (ViT) has recently gained significant interest in solving
computer vision (CV) problems due to its capability of extracting informative
features and modeling long-range dependencies through the self-attention
mechanism. To fully realize the advantages of ViT in real-world applications,
recent works have explored the trustworthiness of ViT, including its robustness
and explainability. However, another desiderata, fairness has not yet been
adequately addressed in the literature. We establish that the existing
fairness-aware algorithms (primarily designed for CNNs) do not perform well on
ViT. This necessitates the need for developing our novel framework via Debiased
Self-Attention (DSA). DSA is a fairness-through-blindness approach that
enforces ViT to eliminate spurious features correlated with the sensitive
attributes for bias mitigation. Notably, adversarial examples are leveraged to
locate and mask the spurious features in the input image patches. In addition,
DSA utilizes an attention weights alignment regularizer in the training
objective to encourage learning informative features for target prediction.
Importantly, our DSA framework leads to improved fairness guarantees over prior
works on multiple prediction tasks without compromising target prediction
performance.


http://arxiv.org/abs/2301.13486
Robust Linear Regression: Gradient-descent, Early-stopping, and Beyond. (47%)
Meyer Scetbon; Elvis Dohmatob
  In this work we study the robustness to adversarial attacks, of
early-stopping strategies on gradient-descent (GD) methods for linear
regression. More precisely, we show that early-stopped GD is optimally robust
(up to an absolute constant) against Euclidean-norm adversarial attacks.
However, we show that this strategy can be arbitrarily sub-optimal in the case
of general Mahalanobis attacks. This observation is compatible with recent
findings in the case of classification~\cite{Vardi2022GradientMP} that show
that GD provably converges to non-robust models. To alleviate this issue, we
propose to apply instead a GD scheme on a transformation of the data adapted to
the attack. This data transformation amounts to apply feature-depending
learning rates and we show that this modified GD is able to handle any
Mahalanobis attack, as well as more general attacks under some conditions.
Unfortunately, choosing such adapted transformations can be hard for general
attacks. To the rescue, we design a simple and tractable estimator whose
adversarial risk is optimal up to within a multiplicative constant of 1.1124 in
the population regime, and works for any norm.


http://arxiv.org/abs/2301.13838
Image Shortcut Squeezing: Countering Perturbative Availability Poisons with Compression. (12%)
Zhuoran Liu; Zhengyu Zhao; Martha Larson
  Perturbative availability poisons (PAPs) add small changes to images to
prevent their use for model training. Current research adopts the belief that
practical and effective approaches to countering PAPs do not exist. In this
paper, we argue that it is time to abandon this belief. We present extensive
experiments showing that 12 state-of-the-art PAP methods are vulnerable to
Image Shortcut Squeezing (ISS), which is based on simple compression. For
example, on average, ISS restores the CIFAR-10 model accuracy to $81.73\%$,
surpassing the previous best preprocessing-based countermeasures by $37.97\%$
absolute. ISS also (slightly) outperforms adversarial training and has higher
generalizability to unseen perturbation norms and also higher efficiency. Our
investigation reveals that the property of PAP perturbations depends on the
type of surrogate model used for poison generation, and it explains why a
specific ISS compression yields the best performance for a specific type of PAP
perturbation. We further test stronger, adaptive poisoning, and show it falls
short of being an ideal defense against ISS. Overall, our results demonstrate
the importance of considering various (simple) countermeasures to ensure the
meaningfulness of analysis carried out during the development of PAP methods.


http://arxiv.org/abs/2301.13577
DRAINCLoG: Detecting Rogue Accounts with Illegally-obtained NFTs using Classifiers Learned on Graphs. (1%)
Hanna Kim; Jian Cui; Eugene Jang; Chanhee Lee; Yongjae Lee; Jin-Woo Chung; Seungwon Shin
  As Non-Fungible Tokens (NFTs) continue to grow in popularity, NFT users have
become targets of phishing attacks by cybercriminals, called \textit{NFT
drainers}. Over the last year, \$100 million worth of NFTs were stolen by
drainers, and their presence remains a serious threat to the NFT trading space.
However, no work has yet comprehensively investigated the behaviors of drainers
in the NFT ecosystem.
  In this paper, we present the first study on the trading behavior of NFT
drainers and introduce the first dedicated NFT drainer detection system. We
collect 127M NFT transaction data from the Ethereum blockchain and 1,135
drainer accounts from five sources for the year 2022. We find that drainers
exhibit significantly different transactional and social contexts from those of
regular users. With these insights, we design \textit{DRAINCLoG}, an automatic
drainer detection system utilizing Graph Neural Networks. This system
effectively captures the multifaceted web of interactions within the NFT space
through two distinct graphs: the NFT-User graph for transaction contexts and
the User graph for social contexts. Evaluations using real-world NFT
transaction data underscore the robustness and precision of our model.
Additionally, we analyze the security of \textit{DRAINCLoG} under a wide
variety of evasion attacks.


http://arxiv.org/abs/2301.13807
Identifying the Hazard Boundary of ML-enabled Autonomous Systems Using Cooperative Co-Evolutionary Search. (1%)
Sepehr Sharifi; Donghwan Shin; Lionel C. Briand; Nathan Aschbacher
  In Machine Learning (ML)-enabled autonomous systems (MLASs), it is essential
to identify the hazard boundary of ML Components (MLCs) in the MLAS under
analysis. Given that such boundary captures the conditions in terms of MLC
behavior and system context that can lead to hazards, it can then be used to,
for example, build a safety monitor that can take any predefined fallback
mechanisms at runtime when reaching the hazard boundary. However, determining
such hazard boundary for an ML component is challenging. This is due to the
problem space combining system contexts (i.e., scenarios) and MLC behaviors
(i.e., inputs and outputs) being far too large for exhaustive exploration and
even to handle using conventional metaheuristics, such as genetic algorithms.
Additionally, the high computational cost of simulations required to determine
any MLAS safety violations makes the problem even more challenging.
Furthermore, it is unrealistic to consider a region in the problem space
deterministically safe or unsafe due to the uncontrollable parameters in
simulations and the non-linear behaviors of ML models (e.g., deep neural
networks) in the MLAS under analysis. To address the challenges, we propose
MLCSHE (ML Component Safety Hazard Envelope), a novel method based on a
Cooperative Co-Evolutionary Algorithm (CCEA), which aims to tackle a
high-dimensional problem by decomposing it into two lower-dimensional search
subproblems. Moreover, we take a probabilistic view of safe and unsafe regions
and define a novel fitness function to measure the distance from the
probabilistic hazard boundary and thus drive the search effectively. We
evaluate the effectiveness and efficiency of MLCSHE on a complex Autonomous
Vehicle (AV) case study. Our evaluation results show that MLCSHE is
significantly more effective and efficient compared to a standard genetic
algorithm and random search.


http://arxiv.org/abs/2301.12680
Feature-Space Bayesian Adversarial Learning Improved Malware Detector Robustness. (99%)
Bao Gia Doan; Shuiqiao Yang; Paul Montague; Vel Olivier De; Tamas Abraham; Seyit Camtepe; Salil S. Kanhere; Ehsan Abbasnejad; Damith C. Ranasinghe
  We present a new algorithm to train a robust malware detector. Modern malware
detectors rely on machine learning algorithms. Now, the adversarial objective
is to devise alterations to the malware code to decrease the chance of being
detected whilst preserving the functionality and realism of the malware.
Adversarial learning is effective in improving robustness but generating
functional and realistic adversarial malware samples is non-trivial. Because:
i) in contrast to tasks capable of using gradient-based feedback, adversarial
learning in a domain without a differentiable mapping function from the problem
space (malware code inputs) to the feature space is hard; and ii) it is
difficult to ensure the adversarial malware is realistic and functional. This
presents a challenge for developing scalable adversarial machine learning
algorithms for large datasets at a production or commercial scale to realize
robust malware detectors. We propose an alternative; perform adversarial
learning in the feature space in contrast to the problem space. We prove the
projection of perturbed, yet valid malware, in the problem space into feature
space will always be a subset of adversarials generated in the feature space.
Hence, by generating a robust network against feature-space adversarial
examples, we inherently achieve robustness against problem-space adversarial
examples. We formulate a Bayesian adversarial learning objective that captures
the distribution of models for improved robustness. We prove that our learning
method bounds the difference between the adversarial risk and empirical risk
explaining the improved robustness. We show that adversarially trained BNNs
achieve state-of-the-art robustness. Notably, adversarially trained BNNs are
robust against stronger attacks with larger attack budgets by a margin of up to
15% on a recent production-scale malware dataset of more than 20 million
samples.


http://arxiv.org/abs/2301.12968
Improving Adversarial Transferability with Scheduled Step Size and Dual Example. (99%)
Zeliang Zhang; Peihan Liu; Xiaosen Wang; Chenliang Xu
  Deep neural networks are widely known to be vulnerable to adversarial
examples, especially showing significantly poor performance on adversarial
examples generated under the white-box setting. However, most white-box attack
methods rely heavily on the target model and quickly get stuck in local optima,
resulting in poor adversarial transferability. The momentum-based methods and
their variants are proposed to escape the local optima for better
transferability. In this work, we notice that the transferability of
adversarial examples generated by the iterative fast gradient sign method
(I-FGSM) exhibits a decreasing trend when increasing the number of iterations.
Motivated by this finding, we argue that the information of adversarial
perturbations near the benign sample, especially the direction, benefits more
on the transferability. Thus, we propose a novel strategy, which uses the
Scheduled step size and the Dual example (SD), to fully utilize the adversarial
information near the benign sample. Our proposed strategy can be easily
integrated with existing adversarial attack methods for better adversarial
transferability. Empirical evaluations on the standard ImageNet dataset
demonstrate that our proposed method can significantly enhance the
transferability of existing adversarial attacks.


http://arxiv.org/abs/2301.13122
Towards Adversarial Realism and Robust Learning for IoT Intrusion Detection and Classification. (99%)
João Vitorino; Isabel Praça; Eva Maia
  The Internet of Things (IoT) faces tremendous security challenges. Machine
learning models can be used to tackle the growing number of cyber-attack
variations targeting IoT systems, but the increasing threat posed by
adversarial attacks restates the need for reliable defense strategies. This
work describes the types of constraints required for a realistic adversarial
cyber-attack example and proposes a methodology for a trustworthy adversarial
robustness analysis with a realistic adversarial evasion attack vector. The
proposed methodology was used to evaluate three supervised algorithms, Random
Forest (RF), Extreme Gradient Boosting (XGB), and Light Gradient Boosting
Machine (LGBM), and one unsupervised algorithm, Isolation Forest (IFOR).
Constrained adversarial examples were generated with the Adaptative
Perturbation Pattern Method (A2PM), and evasion attacks were performed against
models created with regular and adversarial training. Even though RF was the
least affected in binary classification, XGB consistently achieved the highest
accuracy in multi-class classification. The obtained results evidence the
inherent susceptibility of tree-based algorithms and ensembles to adversarial
evasion attacks and demonstrates the benefits of adversarial training and a
security by design approach for a more robust IoT network intrusion detection
and cyber-attack classification.


http://arxiv.org/abs/2302.01757
RS-Del: Edit Distance Robustness Certificates for Sequence Classifiers via Randomized Deletion. (99%)
Zhuoqun Huang; Neil G. Marchant; Keane Lucas; Lujo Bauer; Olga Ohrimenko; Benjamin I. P. Rubinstein
  Randomized smoothing is a leading approach for constructing classifiers that
are certifiably robust against adversarial examples. Existing work on
randomized smoothing has focused on classifiers with continuous inputs, such as
images, where $\ell_p$-norm bounded adversaries are commonly studied. However,
there has been limited work for classifiers with discrete or variable-size
inputs, such as for source code, which require different threat models and
smoothing mechanisms. In this work, we adapt randomized smoothing for discrete
sequence classifiers to provide certified robustness against edit
distance-bounded adversaries. Our proposed smoothing mechanism randomized
deletion (RS-Del) applies random deletion edits, which are (perhaps
surprisingly) sufficient to confer robustness against adversarial deletion,
insertion and substitution edits. Our proof of certification deviates from the
established Neyman-Pearson approach, which is intractable in our setting, and
is instead organized around longest common subsequences. We present a case
study on malware detection--a binary classification problem on byte sequences
where classifier evasion is a well-established threat model. When applied to
the popular MalConv malware detection model, our smoothing mechanism RS-Del
achieves a certified accuracy of 91% at an edit distance radius of 128 bytes.


http://arxiv.org/abs/2301.12896
Identifying Adversarially Attackable and Robust Samples. (99%)
Vyas Raina; Mark Gales
  Adversarial attacks insert small, imperceptible perturbations to input
samples that cause large, undesired changes to the output of deep learning
models. Despite extensive research on generating adversarial attacks and
building defense systems, there has been limited research on understanding
adversarial attacks from an input-data perspective. This work introduces the
notion of sample attackability, where we aim to identify samples that are most
susceptible to adversarial attacks (attackable samples) and conversely also
identify the least susceptible samples (robust samples). We propose a
deep-learning-based method to detect the adversarially attackable and robust
samples in an unseen dataset for an unseen target model. Experiments on
standard image classification datasets enables us to assess the portability of
the deep attackability detector across a range of architectures. We find that
the deep attackability detector performs better than simple model
uncertainty-based measures for identifying the attackable/robust samples. This
suggests that uncertainty is an inadequate proxy for measuring sample distance
to a decision boundary. In addition to better understanding adversarial attack
theory, it is found that the ability to identify the adversarially attackable
and robust samples has implications for improving the efficiency of
sample-selection tasks, e.g. active learning in augmentation for adversarial
training.


http://arxiv.org/abs/2301.12868
On Robustness of Prompt-based Semantic Parsing with Large Pre-trained Language Model: An Empirical Study on Codex. (98%)
Terry Yue Zhuo; Zhuang Li; Yujin Huang; Fatemeh Shiri; Weiqing Wang; Gholamreza Haffari; Yuan-Fang Li
  Semantic parsing is a technique aimed at constructing a structured
representation of the meaning of a natural-language question. Recent
advancements in few-shot language models trained on code have demonstrated
superior performance in generating these representations compared to
traditional unimodal language models, which are trained on downstream tasks.
Despite these advancements, existing fine-tuned neural semantic parsers are
susceptible to adversarial attacks on natural-language inputs. While it has
been established that the robustness of smaller semantic parsers can be
enhanced through adversarial training, this approach is not feasible for large
language models in real-world scenarios, as it requires both substantial
computational resources and expensive human annotation on in-domain semantic
parsing data. This paper presents the first empirical study on the adversarial
robustness of a large prompt-based language model of code, \codex. Our results
demonstrate that the state-of-the-art (SOTA) code-language models are
vulnerable to carefully crafted adversarial examples. To address this
challenge, we propose methods for improving robustness without the need for
significant amounts of labeled data or heavy computational resources.


http://arxiv.org/abs/2301.13096
Anchor-Based Adversarially Robust Zero-Shot Learning Driven by Language. (96%)
Xiao Li; Wei Zhang; Yining Liu; Zhanhao Hu; Bo Zhang; Xiaolin Hu
  Deep neural networks are vulnerable to adversarial attacks. We consider
adversarial defense in the case of zero-shot image classification setting,
which has rarely been explored because both adversarial defense and zero-shot
learning are challenging. We propose LAAT, a novel Language-driven,
Anchor-based Adversarial Training strategy, to improve the adversarial
robustness in a zero-shot setting. LAAT uses a text encoder to obtain fixed
anchors (normalized feature embeddings) of each category, then uses these
anchors to perform adversarial training. The text encoder has the property that
semantically similar categories can be mapped to neighboring anchors in the
feature space. By leveraging this property, LAAT can make the image model
adversarially robust on novel categories without any extra examples.
Experimental results show that our method achieves impressive zero-shot
adversarial performance, even surpassing the previous state-of-the-art
adversarially robust one-shot methods in most attacking settings. When models
are trained with LAAT on large datasets like ImageNet-1K, they can have
substantial zero-shot adversarial robustness across several downstream
datasets.


http://arxiv.org/abs/2301.13356
Inference Time Evidences of Adversarial Attacks for Forensic on Transformers. (87%)
Hugo Lemarchant; Liangzi Li; Yiming Qian; Yuta Nakashima; Hajime Nagahara
  Vision Transformers (ViTs) are becoming a very popular paradigm for vision
tasks as they achieve state-of-the-art performance on image classification.
However, although early works implied that this network structure had increased
robustness against adversarial attacks, some works argue ViTs are still
vulnerable. This paper presents our first attempt toward detecting adversarial
attacks during inference time using the network's input and outputs as well as
latent features. We design four quantifications (or derivatives) of input,
output, and latent vectors of ViT-based models that provide a signature of the
inference, which could be beneficial for the attack detection, and empirically
study their behavior over clean samples and adversarial samples. The results
demonstrate that the quantifications from input (images) and output (posterior
probabilities) are promising for distinguishing clean and adversarial samples,
while latent vectors offer less discriminative power, though they give some
insights on how adversarial perturbations work.


http://arxiv.org/abs/2301.13028
On the Efficacy of Metrics to Describe Adversarial Attacks. (82%)
Tommaso Puccetti; Tommaso Zoppi; Andrea Ceccarelli
  Adversarial defenses are naturally evaluated on their ability to tolerate
adversarial attacks. To test defenses, diverse adversarial attacks are crafted,
that are usually described in terms of their evading capability and the L0, L1,
L2, and Linf norms. We question if the evading capability and L-norms are the
most effective information to claim that defenses have been tested against a
representative attack set. To this extent, we select image quality metrics from
the state of the art and search correlations between image perturbation and
detectability. We observe that computing L-norms alone is rarely the preferable
solution. We observe a strong correlation between the identified metrics
computed on an adversarial image and the output of a detector on such an image,
to the extent that they can predict the response of a detector with
approximately 0.94 accuracy. Further, we observe that metrics can classify
attacks based on similar perturbations and similar detectability. This suggests
a possible review of the approach to evaluate detectors, where additional
metrics are included to assure that a representative attack dataset is
selected.


http://arxiv.org/abs/2301.12993
Benchmarking Robustness to Adversarial Image Obfuscations. (74%)
Florian Stimberg; Ayan Chakrabarti; Chun-Ta Lu; Hussein Hazimeh; Otilia Stretcu; Wei Qiao; Yintao Liu; Merve Kaya; Cyrus Rashtchian; Ariel Fuxman; Mehmet Tek; Sven Gowal
  Automated content filtering and moderation is an important tool that allows
online platforms to build striving user communities that facilitate cooperation
and prevent abuse. Unfortunately, resourceful actors try to bypass automated
filters in a bid to post content that violate platform policies and codes of
conduct. To reach this goal, these malicious actors may obfuscate policy
violating images (e.g. overlay harmful images by carefully selected benign
images or visual patterns) to prevent machine learning models from reaching the
correct decision. In this paper, we invite researchers to tackle this specific
issue and present a new image benchmark. This benchmark, based on ImageNet,
simulates the type of obfuscations created by malicious actors. It goes beyond
ImageNet-$\textrm{C}$ and ImageNet-$\bar{\textrm{C}}$ by proposing general,
drastic, adversarial modifications that preserve the original content intent.
It aims to tackle a more common adversarial threat than the one considered by
$\ell_p$-norm bounded adversaries. We evaluate 33 pretrained models on the
benchmark and train models with different augmentations, architectures and
training methods on subsets of the obfuscations to measure generalization. We
hope this benchmark will encourage researchers to test their models and methods
and try to find new approaches that are more robust to these obfuscations.


http://arxiv.org/abs/2301.13188
Extracting Training Data from Diffusion Models. (5%)
Nicholas Carlini; Jamie Hayes; Milad Nasr; Matthew Jagielski; Vikash Sehwag; Florian Tramèr; Borja Balle; Daphne Ippolito; Eric Wallace
  Image diffusion models such as DALL-E 2, Imagen, and Stable Diffusion have
attracted significant attention due to their ability to generate high-quality
synthetic images. In this work, we show that diffusion models memorize
individual images from their training data and emit them at generation time.
With a generate-and-filter pipeline, we extract over a thousand training
examples from state-of-the-art models, ranging from photographs of individual
people to trademarked company logos. We also train hundreds of diffusion models
in various settings to analyze how different modeling and data decisions affect
privacy. Overall, our results show that diffusion models are much less private
than prior generative models such as GANs, and that mitigating these
vulnerabilities may require new advances in privacy-preserving training.


http://arxiv.org/abs/2301.13340
Affinity Uncertainty-based Hard Negative Mining in Graph Contrastive Learning. (2%)
Chaoxi Niu; Guansong Pang; Ling Chen
  Hard negative mining has shown effective in enhancing self-supervised
contrastive learning (CL) on diverse data types, including graph CL (GCL). The
existing hardness-aware CL methods typically treat negative instances that are
most similar to the anchor instance as hard negatives, which helps improve the
CL performance, especially on image data. However, this approach often fails to
identify the hard negatives but leads to many false negatives on graph data.
This is mainly due to that the learned graph representations are not
sufficiently discriminative due to oversmooth representations and/or
non-independent and identically distributed (non-i.i.d.) issues in graph data.
To tackle this problem, this article proposes a novel approach that builds a
discriminative model on collective affinity information (i.e., two sets of
pairwise affinities between the negative instances and the anchor instance) to
mine hard negatives in GCL. In particular, the proposed approach evaluates how
confident/uncertain the discriminative model is about the affinity of each
negative instance to an anchor instance to determine its hardness weight
relative to the anchor instance. This uncertainty information is then
incorporated into the existing GCL loss functions via a weighting term to
enhance their performance. The enhanced GCL is theoretically grounded that the
resulting GCL loss is equivalent to a triplet loss with an adaptive margin
being exponentially proportional to the learned uncertainty of each negative
instance. Extensive experiments on ten graph datasets show that our approach
does the following: 1) consistently enhances different state-of-the-art (SOTA)
GCL methods in both graph and node classification tasks and 2) significantly
improves their robustness against adversarial attacks. Code is available at
https://github.com/mala-lab/AUGCL.


http://arxiv.org/abs/2301.12831
M3FAS: An Accurate and Robust MultiModal Mobile Face Anti-Spoofing System. (1%)
Chenqi Kong; Kexin Zheng; Yibing Liu; Shiqi Wang; Anderson Rocha; Haoliang Li
  Face presentation attacks (FPA), also known as face spoofing, have brought
increasing concerns to the public through various malicious applications, such
as financial fraud and privacy leakage. Therefore, safeguarding face
recognition systems against FPA is of utmost importance. Although existing
learning-based face anti-spoofing (FAS) models can achieve outstanding
detection performance, they lack generalization capability and suffer
significant performance drops in unforeseen environments. Many methodologies
seek to use auxiliary modality data (e.g., depth and infrared maps) during the
presentation attack detection (PAD) to address this limitation. However, these
methods can be limited since (1) they require specific sensors such as depth
and infrared cameras for data capture, which are rarely available on commodity
mobile devices, and (2) they cannot work properly in practical scenarios when
either modality is missing or of poor quality. In this paper, we devise an
accurate and robust MultiModal Mobile Face Anti-Spoofing system named M3FAS to
overcome the issues above. The innovation of this work mainly lies in the
following aspects: (1) To achieve robust PAD, our system combines visual and
auditory modalities using three pervasively available sensors: camera, speaker,
and microphone; (2) We design a novel two-branch neural network with three
hierarchical feature aggregation modules to perform cross-modal feature fusion;
(3). We propose a multi-head training strategy. The model outputs three
predictions from the vision, acoustic, and fusion heads, enabling a more
flexible PAD. Extensive experiments have demonstrated the accuracy, robustness,
and flexibility of M3FAS under various challenging experimental settings.


http://arxiv.org/abs/2301.12549
Unlocking Deterministic Robustness Certification on ImageNet. (98%)
Kai Hu; Andy Zou; Zifan Wang; Klas Leino; Matt Fredrikson
  Despite the promise of Lipschitz-based methods for provably-robust deep
learning with deterministic guarantees, current state-of-the-art results are
limited to feed-forward Convolutional Networks (ConvNets) on low-dimensional
data, such as CIFAR-10. This paper investigates strategies for expanding
certifiably robust training to larger, deeper models. A key challenge in
certifying deep networks is efficient calculation of the Lipschitz bound for
residual blocks found in ResNet and ViT architectures. We show that fast ways
of bounding the Lipschitz constant for conventional ResNets are loose, and show
how to address this by designing a new residual block, leading to the
\emph{Linear ResNet} (LiResNet) architecture. We then introduce \emph{Efficient
Margin MAximization} (EMMA), a loss function that stabilizes robust training by
simultaneously penalizing worst-case adversarial examples from \emph{all}
classes. Together, these contributions yield new \emph{state-of-the-art} robust
accuracy on CIFAR-10/100 and Tiny-ImageNet under $\ell_2$ perturbations.
Moreover, for the first time, we are able to scale up fast deterministic
robustness guarantees to ImageNet, demonstrating that this approach to robust
learning can be applied to real-world applications.
  We release our code on Github: \url{https://github.com/klasleino/gloro}.


http://arxiv.org/abs/2301.12487
Mitigating Adversarial Effects of False Data Injection Attacks in Power Grid. (93%)
Farhin Farhad Riya; Shahinul Hoque; Yingyuan Yang; Jiangnan Li; Jinyuan Stella Sun; Hairong Qi
  Deep Neural Networks have proven to be highly accurate at a variety of tasks
in recent years. The benefits of Deep Neural Networks have also been embraced
in power grids to detect False Data Injection Attacks (FDIA) while conducting
critical tasks like state estimation. However, the vulnerabilities of DNNs
along with the distinct infrastructure of the cyber-physical-system (CPS) can
favor the attackers to bypass the detection mechanism. Moreover, the divergent
nature of CPS engenders limitations to the conventional defense mechanisms for
False Data Injection Attacks. In this paper, we propose a DNN framework with an
additional layer that utilizes randomization to mitigate the adversarial effect
by padding the inputs. The primary advantage of our method is when deployed to
a DNN model it has a trivial impact on the model's performance even with larger
padding sizes. We demonstrate the favorable outcome of the framework through
simulation using the IEEE 14-bus, 30-bus, 118-bus, and 300-bus systems.
Furthermore to justify the framework we select attack techniques that generate
subtle adversarial examples that can bypass the detection mechanism
effortlessly.


http://arxiv.org/abs/2301.12554
Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive Smoothing. (83%)
Yatong Bai; Brendon G. Anderson; Aerin Kim; Somayeh Sojoudi
  While prior research has proposed a plethora of methods that build neural
classifiers robust against adversarial robustness, practitioners are still
reluctant to adopt them due to their unacceptably severe clean accuracy
penalties. This paper significantly alleviates this accuracy-robustness
trade-off by mixing the output probabilities of a standard classifier and a
robust classifier, where the standard network is optimized for clean accuracy
and is not robust in general. We show that the robust base classifier's
confidence difference for correct and incorrect examples is the key to this
improvement. In addition to providing intuitions and empirical evidence, we
theoretically certify the robustness of the mixed classifier under realistic
assumptions. Furthermore, we adapt an adversarial input detector into a mixing
network that adaptively adjusts the mixture of the two base models, further
reducing the accuracy penalty of achieving robustness. The proposed flexible
method, termed "adaptive smoothing", can work in conjunction with existing or
even future methods that improve clean accuracy, robustness, or adversary
detection. Our empirical evaluation considers strong attack methods, including
AutoAttack and adaptive attack. On the CIFAR-100 dataset, our method achieves
an 85.21% clean accuracy while maintaining a 38.72% $\ell_\infty$-AutoAttacked
($\epsilon = 8/255$) accuracy, becoming the second most robust method on the
RobustBench CIFAR-100 benchmark as of submission, while improving the clean
accuracy by ten percentage points compared with all listed models. The code
that implements our method is available at
https://github.com/Bai-YT/AdaptiveSmoothing.


http://arxiv.org/abs/2301.12576
Uncovering Adversarial Risks of Test-Time Adaptation. (82%)
Tong Wu; Feiran Jia; Xiangyu Qi; Jiachen T. Wang; Vikash Sehwag; Saeed Mahloujifar; Prateek Mittal
  Recently, test-time adaptation (TTA) has been proposed as a promising
solution for addressing distribution shifts. It allows a base model to adapt to
an unforeseen distribution during inference by leveraging the information from
the batch of (unlabeled) test data. However, we uncover a novel security
vulnerability of TTA based on the insight that predictions on benign samples
can be impacted by malicious samples in the same batch. To exploit this
vulnerability, we propose Distribution Invading Attack (DIA), which injects a
small fraction of malicious data into the test batch. DIA causes models using
TTA to misclassify benign and unperturbed test data, providing an entirely new
capability for adversaries that is infeasible in canonical machine learning
pipelines. Through comprehensive evaluations, we demonstrate the high
effectiveness of our attack on multiple benchmarks across six TTA methods. In
response, we investigate two countermeasures to robustify the existing insecure
TTA implementations, following the principle of "security by design". Together,
we hope our findings can make the community aware of the utility-security
tradeoffs in deploying TTA and provide valuable insights for developing robust
TTA approaches.


http://arxiv.org/abs/2301.12595
Adversarial Attacks on Adversarial Bandits. (69%)
Yuzhe Ma; Zhijin Zhou
  We study a security threat to adversarial multi-armed bandits, in which an
attacker perturbs the loss or reward signal to control the behavior of the
victim bandit player. We show that the attacker is able to mislead any
no-regret adversarial bandit algorithm into selecting a suboptimal target arm
in every but sublinear (T-o(T)) number of rounds, while incurring only
sublinear (o(T)) cumulative attack cost. This result implies critical security
concern in real-world bandit-based systems, e.g., in online recommendation, an
attacker might be able to hijack the recommender system and promote a desired
product. Our proposed attack algorithms require knowledge of only the regret
rate, thus are agnostic to the concrete bandit algorithm employed by the victim
player. We also derived a theoretical lower bound on the cumulative attack cost
that any victim-agnostic attack algorithm must incur. The lower bound matches
the upper bound achieved by our attack, which shows that our attack is
asymptotically optimal.


http://arxiv.org/abs/2301.12456
Towards Verifying the Geometric Robustness of Large-scale Neural Networks. (54%)
Fu Wang; Peipei Xu; Wenjie Ruan; Xiaowei Huang
  Deep neural networks (DNNs) are known to be vulnerable to adversarial
geometric transformation. This paper aims to verify the robustness of
large-scale DNNs against the combination of multiple geometric transformations
with a provable guarantee. Given a set of transformations (e.g., rotation,
scaling, etc.), we develop GeoRobust, a black-box robustness analyser built
upon a novel global optimisation strategy, for locating the worst-case
combination of transformations that affect and even alter a network's output.
GeoRobust can provide provable guarantees on finding the worst-case combination
based on recent advances in Lipschitzian theory. Due to its black-box nature,
GeoRobust can be deployed on large-scale DNNs regardless of their
architectures, activation functions, and the number of neurons. In practice,
GeoRobust can locate the worst-case geometric transformation with high
precision for the ResNet50 model on ImageNet in a few seconds on average. We
examined 18 ImageNet classifiers, including the ResNet family and vision
transformers, and found a positive correlation between the geometric robustness
of the networks and the parameter numbers. We also observe that increasing the
depth of DNN is more beneficial than increasing its width in terms of improving
its geometric robustness. Our tool GeoRobust is available at
https://github.com/TrustAI/GeoRobust.


http://arxiv.org/abs/2301.12637
Lateralized Learning for Multi-Class Visual Classification Tasks. (13%)
Abubakar Siddique; Will N. Browne; Gina M. Grimshaw
  The majority of computer vision algorithms fail to find higher-order
(abstract) patterns in an image so are not robust against adversarial attacks,
unlike human lateralized vision. Deep learning considers each input pixel in a
homogeneous manner such that different parts of a ``locality-sensitive hashing
table'' are often not connected, meaning higher-order patterns are not
discovered. Hence these systems are not robust against noisy, irrelevant, and
redundant data, resulting in the wrong prediction being made with high
confidence. Conversely, vertebrate brains afford heterogeneous knowledge
representation through lateralization, enabling modular learning at different
levels of abstraction. This work aims to verify the effectiveness, scalability,
and robustness of a lateralized approach to real-world problems that contain
noisy, irrelevant, and redundant data. The experimental results of multi-class
(200 classes) image classification show that the novel system effectively
learns knowledge representation at multiple levels of abstraction making it
more robust than other state-of-the-art techniques. Crucially, the novel
lateralized system outperformed all the state-of-the-art deep learning-based
systems for the classification of normal and adversarial images by 19.05% -
41.02% and 1.36% - 49.22%, respectively. Findings demonstrate the value of
heterogeneous and lateralized learning for computer vision applications.


http://arxiv.org/abs/2301.12527
Diverse, Difficult, and Odd Instances (D2O): A New Test Set for Object Classification. (3%)
Ali Borji
  Test sets are an integral part of evaluating models and gauging progress in
object recognition, and more broadly in computer vision and AI. Existing test
sets for object recognition, however, suffer from shortcomings such as bias
towards the ImageNet characteristics and idiosyncrasies (e.g., ImageNet-V2),
being limited to certain types of stimuli (e.g., indoor scenes in ObjectNet),
and underestimating the model performance (e.g., ImageNet-A). To mitigate these
problems, we introduce a new test set, called D2O, which is sufficiently
different from existing test sets. Images are a mix of generated images as well
as images crawled from the web. They are diverse, unmodified, and
representative of real-world scenarios and cause state-of-the-art models to
misclassify them with high confidence. To emphasize generalization, our dataset
by design does not come paired with a training set. It contains 8,060 images
spread across 36 categories, out of which 29 appear in ImageNet. The best Top-1
accuracy on our dataset is around 60% which is much lower than 91% best Top-1
accuracy on ImageNet. We find that popular vision APIs perform very poorly in
detecting objects over D2O categories such as ``faces'', ``cars'', and
``cats''. Our dataset also comes with a ``miscellaneous'' category, over which
we test the image tagging models. Overall, our investigations demonstrate that
the D2O test set contain a mix of images with varied levels of difficulty and
is predictive of the average-case performance of models. It can challenge
object recognition models for years to come and can spur more research in this
fundamental area.


http://arxiv.org/abs/2301.12643
Adversarial Style Augmentation for Domain Generalization. (2%)
Yabin Zhang; Bin Deng; Ruihuang Li; Kui Jia; Lei Zhang
  It is well-known that the performance of well-trained deep neural networks
may degrade significantly when they are applied to data with even slightly
shifted distributions. Recent studies have shown that introducing certain
perturbation on feature statistics (\eg, mean and standard deviation) during
training can enhance the cross-domain generalization ability. Existing methods
typically conduct such perturbation by utilizing the feature statistics within
a mini-batch, limiting their representation capability. Inspired by the domain
generalization objective, we introduce a novel Adversarial Style Augmentation
(ASA) method, which explores broader style spaces by generating more effective
statistics perturbation via adversarial training. Specifically, we first search
for the most sensitive direction and intensity for statistics perturbation by
maximizing the task loss. By updating the model against the adversarial
statistics perturbation during training, we allow the model to explore the
worst-case domain and hence improve its generalization performance. To
facilitate the application of ASA, we design a simple yet effective module,
namely AdvStyle, which instantiates the ASA method in a plug-and-play manner.
We justify the efficacy of AdvStyle on tasks of cross-domain classification and
instance retrieval. It achieves higher mean accuracy and lower performance
fluctuation. Especially, our method significantly outperforms its competitors
on the PACS dataset under the single source generalization setting, \eg,
boosting the classification accuracy from 61.2\% to 67.1\% with a ResNet50
backbone. Our code will be available at \url{https://github.com/YBZh/AdvStyle}.


http://arxiv.org/abs/2301.12589
Confidence-Aware Calibration and Scoring Functions for Curriculum Learning. (1%)
Shuang Ao; Stefan Rueger; Advaith Siddharthan
  Despite the great success of state-of-the-art deep neural networks, several
studies have reported models to be over-confident in predictions, indicating
miscalibration. Label Smoothing has been proposed as a solution to the
over-confidence problem and works by softening hard targets during training,
typically by distributing part of the probability mass from a `one-hot' label
uniformly to all other labels. However, neither model nor human confidence in a
label are likely to be uniformly distributed in this manner, with some labels
more likely to be confused than others. In this paper we integrate notions of
model confidence and human confidence with label smoothing, respectively
\textit{Model Confidence LS} and \textit{Human Confidence LS}, to achieve
better model calibration and generalization. To enhance model generalization,
we show how our model and human confidence scores can be successfully applied
to curriculum learning, a training strategy inspired by learning of `easier to
harder' tasks. A higher model or human confidence score indicates a more
recognisable and therefore easier sample, and can therefore be used as a
scoring function to rank samples in curriculum learning. We evaluate our
proposed methods with four state-of-the-art architectures for image and text
classification task, using datasets with multi-rater label annotations by
humans. We report that integrating model or human confidence information in
label smoothing and curriculum learning improves both model performance and
model calibration. The code are available at
\url{https://github.com/AoShuang92/Confidence_Calibration_CL}.


http://arxiv.org/abs/2301.12277
Node Injection for Class-specific Network Poisoning. (82%)
Ansh Kumar Sharma; Rahul Kukreja; Mayank Kharbanda; Tanmoy Chakraborty
  Graph Neural Networks (GNNs) are powerful in learning rich network
representations that aid the performance of downstream tasks. However, recent
studies showed that GNNs are vulnerable to adversarial attacks involving node
injection and network perturbation. Among these, node injection attacks are
more practical as they don't require manipulation in the existing network and
can be performed more realistically. In this paper, we propose a novel problem
statement - a class-specific poison attack on graphs in which the attacker aims
to misclassify specific nodes in the target class into a different class using
node injection. Additionally, nodes are injected in such a way that they
camouflage as benign nodes. We propose NICKI, a novel attacking strategy that
utilizes an optimization-based approach to sabotage the performance of
GNN-based node classifiers. NICKI works in two phases - it first learns the
node representation and then generates the features and edges of the injected
nodes. Extensive experiments and ablation studies on four benchmark networks
show that NICKI is consistently better than four baseline attacking strategies
for misclassifying nodes in the target class. We also show that the injected
nodes are properly camouflaged as benign, thus making the poisoned graph
indistinguishable from its clean version w.r.t various topological properties.


http://arxiv.org/abs/2302.12002
Out-of-distribution Detection with Energy-based Models. (82%)
Sven Elflein
  Today, deep learning is increasingly applied in security-critical situations
such as autonomous driving and medical diagnosis. Despite its success, the
behavior and robustness of deep networks are not fully understood yet, posing a
significant risk. In particular, researchers recently found that neural
networks are overly confident in their predictions, even on data they have
never seen before. To tackle this issue, one can differentiate two approaches
in the literature. One accounts for uncertainty in the predictions, while the
second estimates the underlying density of the training data to decide whether
a given input is close to the training data, and thus the network is able to
perform as expected.In this thesis, we investigate the capabilities of EBMs at
the task of fitting the training data distribution to perform detection of
out-of-distribution (OOD) inputs. We find that on most datasets, EBMs do not
inherently outperform other density estimators at detecting OOD data despite
their flexibility. Thus, we additionally investigate the effects of
supervision, dimensionality reduction, and architectural modifications on the
performance of EBMs. Further, we propose Energy-Prior Network (EPN) which
enables estimation of various uncertainties within an EBM for classification,
bridging the gap between two approaches for tackling the OOD detection problem.
We identify a connection between the concentration parameters of the Dirichlet
distribution and the joint energy in an EBM. Additionally, this allows
optimization without a held-out OOD dataset, which might not be available or
costly to collect in some applications. Finally, we empirically demonstrate
that Energy-Prior Network (EPN) is able to detect OOD inputs, datasets shifts,
and adversarial examples. Theoretically, EPN offers favorable properties for
the asymptotic case when inputs are far from the training data.


http://arxiv.org/abs/2301.12318
Gradient Shaping: Enhancing Backdoor Attack Against Reverse Engineering. (13%)
Rui Zhu; Di Tang; Siyuan Tang; Guanhong Tao; Shiqing Ma; Xiaofeng Wang; Haixu Tang
  Most existing methods to detect backdoored machine learning (ML) models take
one of the two approaches: trigger inversion (aka. reverse engineer) and weight
analysis (aka. model diagnosis). In particular, the gradient-based trigger
inversion is considered to be among the most effective backdoor detection
techniques, as evidenced by the TrojAI competition, Trojan Detection Challenge
and backdoorBench. However, little has been done to understand why this
technique works so well and, more importantly, whether it raises the bar to the
backdoor attack. In this paper, we report the first attempt to answer this
question by analyzing the change rate of the backdoored model around its
trigger-carrying inputs. Our study shows that existing attacks tend to inject
the backdoor characterized by a low change rate around trigger-carrying inputs,
which are easy to capture by gradient-based trigger inversion. In the meantime,
we found that the low change rate is not necessary for a backdoor attack to
succeed: we design a new attack enhancement called \textit{Gradient Shaping}
(GRASP), which follows the opposite direction of adversarial training to reduce
the change rate of a backdoored model with regard to the trigger, without
undermining its backdoor effect. Also, we provide a theoretic analysis to
explain the effectiveness of this new technique and the fundamental weakness of
gradient-based trigger inversion. Finally, we perform both theoretical and
experimental analysis, showing that the GRASP enhancement does not reduce the
effectiveness of the stealthy attacks against the backdoor detection methods
based on weight analysis, as well as other backdoor mitigation methods without
using detection.


http://arxiv.org/abs/2301.12151
Selecting Models based on the Risk of Damage Caused by Adversarial Attacks. (1%)
Jona Klemenc; Holger Trittenbach
  Regulation, legal liabilities, and societal concerns challenge the adoption
of AI in safety and security-critical applications. One of the key concerns is
that adversaries can cause harm by manipulating model predictions without being
detected. Regulation hence demands an assessment of the risk of damage caused
by adversaries. Yet, there is no method to translate this high-level demand
into actionable metrics that quantify the risk of damage.
  In this article, we propose a method to model and statistically estimate the
probability of damage arising from adversarial attacks. We show that our
proposed estimator is statistically consistent and unbiased. In experiments, we
demonstrate that the estimation results of our method have a clear and
actionable interpretation and outperform conventional metrics. We then show how
operators can use the estimation results to reliably select the model with the
lowest risk.


http://arxiv.org/abs/2301.12046
Semantic Adversarial Attacks on Face Recognition through Significant Attributes. (99%)
Yasmeen M. Khedr; Yifeng Xiong; Kun He
  Face recognition is known to be vulnerable to adversarial face images.
Existing works craft face adversarial images by indiscriminately changing a
single attribute without being aware of the intrinsic attributes of the images.
To this end, we propose a new Semantic Adversarial Attack called SAA-StarGAN
that tampers with the significant facial attributes for each image. We predict
the most significant attributes by applying the cosine similarity or
probability score. The probability score method is based on training a Face
Verification model for an attribute prediction task to obtain a class
probability score for each attribute. The prediction process will help craft
adversarial face images more easily and efficiently, as well as improve the
adversarial transferability. Then, we change the most significant facial
attributes, with either one or more of the facial attributes for impersonation
and dodging attacks in white-box and black-box settings. Experimental results
show that our method could generate diverse and realistic adversarial face
images meanwhile avoid affecting human perception of the face recognition.
SAA-StarGAN achieves an 80.5% attack success rate against black-box models,
outperforming existing methods by 35.5% under the impersonation attack.
Concerning the black-box setting, SAA-StarGAN achieves high attack success
rates on various models. The experiments confirm that predicting the most
important attributes significantly affects the success of adversarial attacks
in both white-box and black-box settings and could enhance the transferability
of the crafted adversarial examples.


http://arxiv.org/abs/2301.11544
Targeted Attacks on Timeseries Forecasting. (99%)
Yuvaraj Govindarajulu; Avinash Amballa; Pavan Kulkarni; Manojkumar Parmar
  Real-world deep learning models developed for Time Series Forecasting are
used in several critical applications ranging from medical devices to the
security domain. Many previous works have shown how deep learning models are
prone to adversarial attacks and studied their vulnerabilities. However, the
vulnerabilities of time series models for forecasting due to adversarial inputs
are not extensively explored. While the attack on a forecasting model might aim
to deteriorate the performance of the model, it is more effective, if the
attack is focused on a specific impact on the model's output. In this paper, we
propose a novel formulation of Directional, Amplitudinal, and Temporal targeted
adversarial attacks on time series forecasting models. These targeted attacks
create a specific impact on the amplitude and direction of the output
prediction. We use the existing adversarial attack techniques from the computer
vision domain and adapt them for time series. Additionally, we propose a
modified version of the Auto Projected Gradient Descent attack for targeted
attacks. We examine the impact of the proposed targeted attacks versus
untargeted attacks. We use KS-Tests to statistically demonstrate the impact of
the attack. Our experimental results show how targeted attacks on time series
models are viable and are more powerful in terms of statistical similarity. It
is, hence difficult to detect through statistical methods. We believe that this
work opens a new paradigm in the time series forecasting domain and represents
an important consideration for developing better defenses.


http://arxiv.org/abs/2301.11546
Adapting Step-size: A Unified Perspective to Analyze and Improve Gradient-based Methods for Adversarial Attacks. (98%)
Wei Tao; Lei Bao; Long Sheng; Gaowei Wu; Qing Tao
  Learning adversarial examples can be formulated as an optimization problem of
maximizing the loss function with some box-constraints. However, for solving
this induced optimization problem, the state-of-the-art gradient-based methods
such as FGSM, I-FGSM and MI-FGSM look different from their original methods
especially in updating the direction, which makes it difficult to understand
them and then leaves some theoretical issues to be addressed in viewpoint of
optimization. In this paper, from the perspective of adapting step-size, we
provide a unified theoretical interpretation of these gradient-based
adversarial learning methods. We show that each of these algorithms is in fact
a specific reformulation of their original gradient methods but using the
step-size rules with only current gradient information. Motivated by such
analysis, we present a broad class of adaptive gradient-based algorithms based
on the regular gradient methods, in which the step-size strategy utilizing
information of the accumulated gradients is integrated. Such adaptive step-size
strategies directly normalize the scale of the gradients rather than use some
empirical operations. The important benefit is that convergence for the
iterative algorithms is guaranteed and then the whole optimization process can
be stabilized. The experiments demonstrate that our AdaI-FGM consistently
outperforms I-FGSM and AdaMI-FGM remains competitive with MI-FGSM for black-box
attacks.


http://arxiv.org/abs/2301.11824
PECAN: A Deterministic Certified Defense Against Backdoor Attacks. (97%)
Yuhao Zhang; Aws Albarghouthi; Loris D'Antoni
  Neural networks are vulnerable to backdoor poisoning attacks, where the
attackers maliciously poison the training set and insert triggers into the test
input to change the prediction of the victim model. Existing defenses for
backdoor attacks either provide no formal guarantees or come with
expensive-to-compute and ineffective probabilistic guarantees. We present
PECAN, an efficient and certified approach for defending against backdoor
attacks. The key insight powering PECAN is to apply off-the-shelf test-time
evasion certification techniques on a set of neural networks trained on
disjoint partitions of the data. We evaluate PECAN on image classification and
malware detection datasets. Our results demonstrate that PECAN can (1)
significantly outperform the state-of-the-art certified backdoor defense, both
in defense strength and efficiency, and (2) on real back-door attacks, PECAN
can reduce attack success rate by order of magnitude when compared to a range
of baselines from the literature.


http://arxiv.org/abs/2301.12001
Vertex-based reachability analysis for verifying ReLU deep neural networks. (93%)
João Zago; Eduardo Camponogara; Eric Antonelo
  Neural networks achieved high performance over different tasks, i.e. image
identification, voice recognition and other applications. Despite their
success, these models are still vulnerable regarding small perturbations, which
can be used to craft the so-called adversarial examples. Different approaches
have been proposed to circumvent their vulnerability, including formal
verification systems, which employ a variety of techniques, including
reachability, optimization and search procedures, to verify that the model
satisfies some property. In this paper we propose three novel reachability
algorithms for verifying deep neural networks with ReLU activations. The first
and third algorithms compute an over-approximation for the reachable set,
whereas the second one computes the exact reachable set. Differently from
previously proposed approaches, our algorithms take as input a V-polytope. Our
experiments on the ACAS Xu problem show that the Exact Polytope Network Mapping
(EPNM) reachability algorithm proposed in this work surpass the
state-of-the-art results from the literature, specially in relation to other
reachability methods.


http://arxiv.org/abs/2301.11912
OccRob: Efficient SMT-Based Occlusion Robustness Verification of Deep Neural Networks. (92%)
Xingwu Guo; Ziwei Zhou; Yueling Zhang; Guy Katz; Min Zhang
  Occlusion is a prevalent and easily realizable semantic perturbation to deep
neural networks (DNNs). It can fool a DNN into misclassifying an input image by
occluding some segments, possibly resulting in severe errors. Therefore, DNNs
planted in safety-critical systems should be verified to be robust against
occlusions prior to deployment. However, most existing robustness verification
approaches for DNNs are focused on non-semantic perturbations and are not
suited to the occlusion case. In this paper, we propose the first efficient,
SMT-based approach for formally verifying the occlusion robustness of DNNs. We
formulate the occlusion robustness verification problem and prove it is
NP-complete. Then, we devise a novel approach for encoding occlusions as a part
of neural networks and introduce two acceleration techniques so that the
extended neural networks can be efficiently verified using off-the-shelf,
SMT-based neural network verification tools. We implement our approach in a
prototype called OccRob and extensively evaluate its performance on benchmark
datasets with various occlusion variants. The experimental results demonstrate
our approach's effectiveness and efficiency in verifying DNNs' robustness
against various occlusions, and its ability to generate counterexamples when
these DNNs are not robust.


http://arxiv.org/abs/2301.11806
PCV: A Point Cloud-Based Network Verifier. (88%)
Arup Kumar Sarker; Farzana Yasmin Ahmad; Matthew B. Dwyer
  3D vision with real-time LiDAR-based point cloud data became a vital part of
autonomous system research, especially perception and prediction modules use
for object classification, segmentation, and detection. Despite their success,
point cloud-based network models are vulnerable to multiple adversarial
attacks, where the certain factor of changes in the validation set causes
significant performance drop in well-trained networks. Most of the existing
verifiers work perfectly on 2D convolution. Due to complex architecture,
dimension of hyper-parameter, and 3D convolution, no verifiers can perform the
basic layer-wise verification. It is difficult to conclude the robustness of a
3D vision model without performing the verification. Because there will be
always corner cases and adversarial input that can compromise the model's
effectiveness.
  In this project, we describe a point cloud-based network verifier that
successfully deals state of the art 3D classifier PointNet verifies the
robustness by generating adversarial inputs. We have used extracted properties
from the trained PointNet and changed certain factors for perturbation input.
We calculate the impact on model accuracy versus property factor and can test
PointNet network's robustness against a small collection of perturbing input
states resulting from adversarial attacks like the suggested hybrid reverse
signed attack. The experimental results reveal that the resilience property of
PointNet is affected by our hybrid reverse signed perturbation strategy


http://arxiv.org/abs/2301.11553
Robust Transformer with Locality Inductive Bias and Feature Normalization. (88%)
Omid Nejati Manzari; Hossein Kashiani; Hojat Asgarian Dehkordi; Shahriar Baradaran Shokouhi
  Vision transformers have been demonstrated to yield state-of-the-art results
on a variety of computer vision tasks using attention-based networks. However,
research works in transformers mostly do not investigate robustness/accuracy
trade-off, and they still struggle to handle adversarial perturbations. In this
paper, we explore the robustness of vision transformers against adversarial
perturbations and try to enhance their robustness/accuracy trade-off in white
box attack settings. To this end, we propose Locality iN Locality (LNL)
transformer model. We prove that the locality introduction to LNL contributes
to the robustness performance since it aggregates local information such as
lines, edges, shapes, and even objects. In addition, to further improve the
robustness performance, we encourage LNL to extract training signal from the
moments (a.k.a., mean and standard deviation) and the normalized features. We
validate the effectiveness and generality of LNL by achieving state-of-the-art
results in terms of accuracy and robustness metrics on German Traffic Sign
Recognition Benchmark (GTSRB) and Canadian Institute for Advanced Research
(CIFAR-10). More specifically, for traffic sign classification, the proposed
LNL yields gains of 1.1% and ~35% in terms of clean and robustness accuracy
compared to the state-of-the-art studies.


http://arxiv.org/abs/2301.12036
Analyzing Robustness of the Deep Reinforcement Learning Algorithm in Ramp Metering Applications Considering False Data Injection Attack and Defense. (87%)
Diyi Liu; Lanmin Liu; Lee D Han
  Ramp metering is the act of controlling on-going vehicles to the highway
mainlines. Decades of practices of ramp metering have proved that ramp metering
can decrease total travel time, mitigate shockwaves, decrease rear-end
collisions by smoothing the traffic interweaving process, etc. Besides
traditional control algorithm like ALINEA, Deep Reinforcement Learning (DRL)
algorithms have been introduced to build a finer control. However, two
remaining challenges still hinder DRL from being implemented in the real world:
(1) some assumptions of algorithms are hard to be matched in the real world;
(2) the rich input states may make the model vulnerable to attacks and data
noises. To investigate these issues, we propose a Deep Q-Learning algorithm
using only loop detectors information as inputs in this study. Then, a set of
False Data Injection attacks and random noise attack are designed to
investigate the robustness of the model. The major benefit of the model is that
it can be applied to almost any ramp metering sites regardless of the road
geometries and layouts. Besides outcompeting the ALINEA method, the Deep
Q-Learning method also shows a good robustness through training among very
different demands and geometries. For example, during the testing case in I-24
near Murfreesboro, TN, the model shows its robustness as it still outperforms
ALINEA algorithm under Fast Gradient Sign Method attacks. Unlike many previous
studies, the model is trained and tested in completely different environments
to show the capabilities of the model.


http://arxiv.org/abs/2301.11578
Learning to Unlearn: Instance-wise Unlearning for Pre-trained Classifiers. (80%)
Sungmin Cha; Sungjun Cho; Dasol Hwang; Honglak Lee; Taesup Moon; Moontae Lee
  Since the recent advent of regulations for data protection (e.g., the General
Data Protection Regulation), there has been increasing demand in deleting
information learned from sensitive data in pre-trained models without
retraining from scratch. The inherent vulnerability of neural networks towards
adversarial attacks and unfairness also calls for a robust method to remove or
correct information in an instance-wise fashion, while retaining the predictive
performance across remaining data. To this end, we consider instance-wise
unlearning, of which the goal is to delete information on a set of instances
from a pre-trained model, by either misclassifying each instance away from its
original prediction or relabeling the instance to a different label. We also
propose two methods that reduce forgetting on the remaining data: 1) utilizing
adversarial examples to overcome forgetting at the representation-level and 2)
leveraging weight importance metrics to pinpoint network parameters guilty of
propagating unwanted information. Both methods only require the pre-trained
model and data instances to forget, allowing painless application to real-life
settings where the entire training set is unavailable. Through extensive
experimentation on various image classification benchmarks, we show that our
approach effectively preserves knowledge of remaining data while unlearning
given instances in both single-task and continual unlearning scenarios.


http://arxiv.org/abs/2301.11783
Certified Invertibility in Neural Networks via Mixed-Integer Programming. (76%)
Tianqi Cui; Thomas Bertalan; George J. Pappas; Manfred Morari; Ioannis G. Kevrekidis; Mahyar Fazlyab
  Neural networks are notoriously vulnerable to adversarial attacks -- small
imperceptible perturbations that can change the network's output drastically.
In the reverse direction, there may exist large, meaningful perturbations that
leave the network's decision unchanged (excessive invariance, nonivertibility).
We study the latter phenomenon in two contexts: (a) discrete-time dynamical
system identification, as well as (b) calibration of the output of one neural
network to the output of another (neural network matching). For ReLU networks
and $L_p$ norms ($p=1,2,\infty$), we formulate these optimization problems as
mixed-integer programs (MIPs) that apply to neural network approximators of
dynamical systems. We also discuss the applicability of our results to
invertibility certification in transformations between neural networks (e.g. at
different levels of pruning).


http://arxiv.org/abs/2301.11457
Attacking Important Pixels for Anchor-free Detectors. (99%)
Yunxu Xie; Shu Hu; Xin Wang; Quanyu Liao; Bin Zhu; Xi Wu; Siwei Lyu
  Deep neural networks have been demonstrated to be vulnerable to adversarial
attacks: subtle perturbation can completely change the prediction result.
Existing adversarial attacks on object detection focus on attacking
anchor-based detectors, which may not work well for anchor-free detectors. In
this paper, we propose the first adversarial attack dedicated to anchor-free
detectors. It is a category-wise attack that attacks important pixels of all
instances of a category simultaneously. Our attack manifests in two forms,
sparse category-wise attack (SCA) and dense category-wise attack (DCA), that
minimize the $L_0$ and $L_\infty$ norm-based perturbations, respectively. For
DCA, we present three variants, DCA-G, DCA-L, and DCA-S, that select a global
region, a local region, and a semantic region, respectively, to attack. Our
experiments on large-scale benchmark datasets including PascalVOC, MS-COCO, and
MS-COCO Keypoints indicate that our proposed methods achieve state-of-the-art
attack performance and transferability on both object detection and human pose
estimation tasks.


http://arxiv.org/abs/2301.11324
Certified Interpretability Robustness for Class Activation Mapping. (92%)
Alex Gu; Tsui-Wei Weng; Pin-Yu Chen; Sijia Liu; Luca Daniel
  Interpreting machine learning models is challenging but crucial for ensuring
the safety of deep networks in autonomous driving systems. Due to the
prevalence of deep learning based perception models in autonomous vehicles,
accurately interpreting their predictions is crucial. While a variety of such
methods have been proposed, most are shown to lack robustness. Yet, little has
been done to provide certificates for interpretability robustness. Taking a
step in this direction, we present CORGI, short for Certifiably prOvable
Robustness Guarantees for Interpretability mapping. CORGI is an algorithm that
takes in an input image and gives a certifiable lower bound for the robustness
of the top k pixels of its CAM interpretability map. We show the effectiveness
of CORGI via a case study on traffic sign data, certifying lower bounds on the
minimum adversarial perturbation not far from (4-5x) state-of-the-art attack
methods.


http://arxiv.org/abs/2301.11050
Minerva: A File-Based Ransomware Detector. (69%)
Dorjan Hitaj; Giulio Pagnotta; Gaspari Fabio De; Carli Lorenzo De; Luigi V. Mancini
  Ransomware attacks have caused billions of dollars in damages in recent
years, and are expected to cause billions more in the future. Consequently,
significant effort has been devoted to ransomware detection and mitigation.
Behavioral-based ransomware detection approaches have garnered considerable
attention recently. These behavioral detectors typically rely on process-based
behavioral profiles to identify malicious behaviors. However, with an
increasing body of literature highlighting the vulnerability of such approaches
to evasion attacks, a comprehensive solution to the ransomware problem remains
elusive. This paper presents Minerva, a novel robust approach to ransomware
detection. Minerva is engineered to be robust by design against evasion
attacks, with architectural and feature selection choices informed by their
resilience to adversarial manipulation. We conduct a comprehensive analysis of
Minerva across a diverse spectrum of ransomware types, encompassing unseen
ransomware as well as variants designed specifically to evade Minerva. Our
evaluation showcases the ability of Minerva to accurately identify ransomware,
generalize to unseen threats, and withstand evasion attacks. Furthermore,
Minerva achieves remarkably low detection times, enabling the adoption of data
loss prevention techniques with near-zero overhead.


http://arxiv.org/abs/2301.10964
Interaction-level Membership Inference Attack Against Federated Recommender Systems. (31%)
Wei Yuan; Chaoqun Yang; Quoc Viet Hung Nguyen; Lizhen Cui; Tieke He; Hongzhi Yin
  The marriage of federated learning and recommender system (FedRec) has been
widely used to address the growing data privacy concerns in personalized
recommendation services. In FedRecs, users' attribute information and behavior
data (i.e., user-item interaction data) are kept locally on their personal
devices, therefore, it is considered a fairly secure approach to protect user
privacy. As a result, the privacy issue of FedRecs is rarely explored.
Unfortunately, several recent studies reveal that FedRecs are vulnerable to
user attribute inference attacks, highlighting the privacy concerns of FedRecs.
In this paper, we further investigate the privacy problem of user behavior data
(i.e., user-item interactions) in FedRecs. Specifically, we perform the first
systematic study on interaction-level membership inference attacks on FedRecs.
An interaction-level membership inference attacker is first designed, and then
the classical privacy protection mechanism, Local Differential Privacy (LDP),
is adopted to defend against the membership inference attack. Unfortunately,
the empirical analysis shows that LDP is not effective against such new attacks
unless the recommendation performance is largely compromised. To mitigate the
interaction-level membership attack threats, we design a simple yet effective
defense method to significantly reduce the attacker's inference accuracy
without losing recommendation performance. Extensive experiments are conducted
with two widely used FedRecs (Fed-NCF and Fed-LightGCN) on three real-world
recommendation datasets (MovieLens-100K, Steam-200K, and Amazon Cell Phone),
and the experimental results show the effectiveness of our solutions.


http://arxiv.org/abs/2301.10766
On the Adversarial Robustness of Camera-based 3D Object Detection. (99%)
Shaoyuan Xie; Zichao Li; Zeyu Wang; Cihang Xie
  In recent years, camera-based 3D object detection has gained widespread
attention for its ability to achieve high performance with low computational
cost. However, the robustness of these methods to adversarial attacks has not
been thoroughly examined. In this study, we conduct the first comprehensive
investigation of the robustness of leading camera-based 3D object detection
methods under various adversarial conditions. Our experiments reveal five
interesting findings: (a) the use of accurate depth estimation effectively
improves robustness; (b) depth-estimation-free approaches do not show superior
robustness; (c) bird's-eye-view-based representations exhibit greater
robustness against localization attacks; (d) incorporating multi-frame benign
inputs can effectively mitigate adversarial attacks; and (e) addressing
long-tail problems can enhance robustness. We hope our work can provide
guidance for the design of future camera-based object detection modules with
improved adversarial robustness.


http://arxiv.org/abs/2301.10822
RobustPdM: Designing Robust Predictive Maintenance against Adversarial Attacks. (99%)
Ayesha Siddique; Ripan Kumar Kundu; Gautam Raj Mode; Khaza Anuarul Hoque
  The state-of-the-art predictive maintenance (PdM) techniques have shown great
success in reducing maintenance costs and downtime of complicated machines
while increasing overall productivity through extensive utilization of
Internet-of-Things (IoT) and Deep Learning (DL). Unfortunately, IoT sensors and
DL algorithms are both prone to cyber-attacks. For instance, DL algorithms are
known for their susceptibility to adversarial examples. Such adversarial
attacks are vastly under-explored in the PdM domain. This is because the
adversarial attacks in the computer vision domain for classification tasks
cannot be directly applied to the PdM domain for multivariate time series (MTS)
regression tasks. In this work, we propose an end-to-end methodology to design
adversarially robust PdM systems by extensively analyzing the effect of
different types of adversarial attacks and proposing a novel adversarial
defense technique for DL-enabled PdM models. First, we propose novel MTS
Projected Gradient Descent (PGD) and MTS PGD with random restarts (PGD_r)
attacks. Then, we evaluate the impact of MTS PGD and PGD_r along with MTS Fast
Gradient Sign Method (FGSM) and MTS Basic Iterative Method (BIM) on Long
Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Convolutional Neural
Network (CNN), and Bi-directional LSTM based PdM system. Our results using
NASA's turbofan engine dataset show that adversarial attacks can cause a severe
defect (up to 11X) in the RUL prediction, outperforming the effectiveness of
the state-of-the-art PdM attacks by 3X. Furthermore, we present a novel
approximate adversarial training method to defend against adversarial attacks.
We observe that approximate adversarial training can significantly improve the
robustness of PdM models (up to 54X) and outperforms the state-of-the-art PdM
defense methods by offering 3X more robustness.


http://arxiv.org/abs/2301.10412
BDMMT: Backdoor Sample Detection for Language Models through Model Mutation Testing. (98%)
Jiali Wei; Ming Fan; Wenjing Jiao; Wuxia Jin; Ting Liu
  Deep neural networks (DNNs) and natural language processing (NLP) systems
have developed rapidly and have been widely used in various real-world fields.
However, they have been shown to be vulnerable to backdoor attacks.
Specifically, the adversary injects a backdoor into the model during the
training phase, so that input samples with backdoor triggers are classified as
the target class. Some attacks have achieved high attack success rates on the
pre-trained language models (LMs), but there have yet to be effective defense
methods. In this work, we propose a defense method based on deep model mutation
testing. Our main justification is that backdoor samples are much more robust
than clean samples if we impose random mutations on the LMs and that backdoors
are generalizable. We first confirm the effectiveness of model mutation testing
in detecting backdoor samples and select the most appropriate mutation
operators. We then systematically defend against three extensively studied
backdoor attack levels (i.e., char-level, word-level, and sentence-level) by
detecting backdoor samples. We also make the first attempt to defend against
the latest style-level backdoor attacks. We evaluate our approach on three
benchmark datasets (i.e., IMDB, Yelp, and AG news) and three style transfer
datasets (i.e., SST-2, Hate-speech, and AG news). The extensive experimental
results demonstrate that our approach can detect backdoor samples more
efficiently and accurately than the three state-of-the-art defense approaches.


http://arxiv.org/abs/2301.10454
A Data-Centric Approach for Improving Adversarial Training Through the Lens of Out-of-Distribution Detection. (96%)
Mohammad Azizmalayeri; Arman Zarei; Alireza Isavand; Mohammad Taghi Manzuri; Mohammad Hossein Rohban
  Current machine learning models achieve super-human performance in many
real-world applications. Still, they are susceptible against imperceptible
adversarial perturbations. The most effective solution for this problem is
adversarial training that trains the model with adversarially perturbed samples
instead of original ones. Various methods have been developed over recent years
to improve adversarial training such as data augmentation or modifying training
attacks. In this work, we examine the same problem from a new data-centric
perspective. For this purpose, we first demonstrate that the existing
model-based methods can be equivalent to applying smaller perturbation or
optimization weights to the hard training examples. By using this finding, we
propose detecting and removing these hard samples directly from the training
procedure rather than applying complicated algorithms to mitigate their
effects. For detection, we use maximum softmax probability as an effective
method in out-of-distribution detection since we can consider the hard samples
as the out-of-distribution samples for the whole data distribution. Our results
on SVHN and CIFAR-10 datasets show the effectiveness of this method in
improving the adversarial training without adding too much computational cost.


http://arxiv.org/abs/2301.10576
A Study on FGSM Adversarial Training for Neural Retrieval. (75%)
Simon Lupart; Stéphane Clinchant
  Neural retrieval models have acquired significant effectiveness gains over
the last few years compared to term-based methods. Nevertheless, those models
may be brittle when faced to typos, distribution shifts or vulnerable to
malicious attacks. For instance, several recent papers demonstrated that such
variations severely impacted models performances, and then tried to train more
resilient models. Usual approaches include synonyms replacements or typos
injections -- as data-augmentation -- and the use of more robust tokenizers
(characterBERT, BPE-dropout). To further complement the literature, we
investigate in this paper adversarial training as another possible solution to
this robustness issue. Our comparison includes the two main families of
BERT-based neural retrievers, i.e. dense and sparse, with and without
distillation techniques. We then demonstrate that one of the most simple
adversarial training techniques -- the Fast Gradient Sign Method (FGSM) -- can
improve first stage rankers robustness and effectiveness. In particular, FGSM
increases models performances on both in-domain and out-of-domain
distributions, and also on queries with typos, for multiple neural retrievers.


http://arxiv.org/abs/2301.10908
Distilling Cognitive Backdoor Patterns within an Image. (5%)
Hanxun Huang; Xingjun Ma; Sarah Erfani; James Bailey
  This paper proposes a simple method to distill and detect backdoor patterns
within an image: \emph{Cognitive Distillation} (CD). The idea is to extract the
"minimal essence" from an input image responsible for the model's prediction.
CD optimizes an input mask to extract a small pattern from the input image that
can lead to the same model output (i.e., logits or deep features). The
extracted pattern can help understand the cognitive mechanism of a model on
clean vs. backdoor images and is thus called a \emph{Cognitive Pattern} (CP).
Using CD and the distilled CPs, we uncover an interesting phenomenon of
backdoor attacks: despite the various forms and sizes of trigger patterns used
by different attacks, the CPs of backdoor samples are all surprisingly and
suspiciously small. One thus can leverage the learned mask to detect and remove
backdoor examples from poisoned training datasets. We conduct extensive
experiments to show that CD can robustly detect a wide range of advanced
backdoor attacks. We also show that CD can potentially be applied to help
detect potential biases from face datasets. Code is available at
\url{https://github.com/HanxunH/CognitiveDistillation}.


http://arxiv.org/abs/2301.10608
Connecting metrics for shape-texture knowledge in computer vision. (1%)
Tiago Oliveira; Tiago Marques; Arlindo L. Oliveira
  Modern artificial neural networks, including convolutional neural networks
and vision transformers, have mastered several computer vision tasks, including
object recognition. However, there are many significant differences between the
behavior and robustness of these systems and of the human visual system. Deep
neural networks remain brittle and susceptible to many changes in the image
that do not cause humans to misclassify images. Part of this different behavior
may be explained by the type of features humans and deep neural networks use in
vision tasks. Humans tend to classify objects according to their shape while
deep neural networks seem to rely mostly on texture. Exploring this question is
relevant, since it may lead to better performing neural network architectures
and to a better understanding of the workings of the vision system of primates.
In this work, we advance the state of the art in our understanding of this
phenomenon, by extending previous analyses to a much larger set of deep neural
network architectures. We found that the performance of models in image
classification tasks is highly correlated with their shape bias measured at the
output and penultimate layer. Furthermore, our results showed that the number
of neurons that represent shape and texture are strongly anti-correlated, thus
providing evidence that there is competition between these two types of
features. Finally, we observed that while in general there is a correlation
between performance and shape bias, there are significant variations between
architecture families.


http://arxiv.org/abs/2301.11289
Blockchain-aided Secure Semantic Communication for AI-Generated Content in Metaverse. (13%)
Yijing Lin; Hongyang Du; Dusit Niyato; Jiangtian Nie; Jiayi Zhang; Yanyu Cheng; Zhaohui Yang
  The construction of virtual transportation networks requires massive data to
be transmitted from edge devices to Virtual Service Providers (VSP) to
facilitate circulations between the physical and virtual domains in Metaverse.
Leveraging semantic communication for reducing information redundancy, VSPs can
receive semantic data from edge devices to provide varied services through
advanced techniques, e.g., AI-Generated Content (AIGC), for users to explore
digital worlds. But the use of semantic communication raises a security issue
because attackers could send malicious semantic data with similar semantic
information but different desired content to break Metaverse services and cause
wrong output of AIGC. Therefore, in this paper, we first propose a
blockchain-aided semantic communication framework for AIGC services in virtual
transportation networks to facilitate interactions of the physical and virtual
domains among VSPs and edge devices. We illustrate a training-based targeted
semantic attack scheme to generate adversarial semantic data by various loss
functions. We also design a semantic defense scheme that uses the blockchain
and zero-knowledge proofs to tell the difference between the semantic
similarities of adversarial and authentic semantic data and to check the
authenticity of semantic data transformations. Simulation results show that the
proposed defense method can reduce the semantic similarity of the adversarial
semantic data and the authentic ones by up to 30% compared with the attack
scheme.


http://arxiv.org/abs/2301.09892
Learning Effective Strategies for Moving Target Defense with Switching Costs. (1%)
Vignesh Viswanathan; Megha Bose; Praveen Paruchuri
  Moving Target Defense (MTD) has emerged as a key technique in various
security applications as it takes away the attacker's ability to perform
reconnaissance for exploiting a system's vulnerabilities. However, most of the
existing research in the field assumes unrealistic access to information about
the attacker's motivations and/or actions when developing MTD strategies. Many
of the existing approaches also assume complete knowledge regarding the
vulnerabilities of a system and how each of these vulnerabilities can be
exploited by an attacker. In this work, we aim to create algorithms that
generate effective Moving Target Defense strategies that do not rely on prior
knowledge about the attackers. Our work assumes that the only way the defender
receives information about its own reward is via interaction with the attacker
in a repeated game setting. Depending on the amount of information that can be
obtained from the interactions, we devise two different algorithms using
multi-armed bandit formulation to identify efficient strategies. We then
evaluate our algorithms using data mined from the National Vulnerability
Database to showcase that they match the performance of the state-of-the-art
techniques, despite using a lot less amount of information.


http://arxiv.org/abs/2301.09879
Data Augmentation Alone Can Improve Adversarial Training. (1%)
Lin Li; Michael Spratling
  Adversarial training suffers from the issue of robust overfitting, which
seriously impairs its generalization performance. Data augmentation, which is
effective at preventing overfitting in standard training, has been observed by
many previous works to be ineffective in mitigating overfitting in adversarial
training. This work proves that, contrary to previous findings, data
augmentation alone can significantly boost accuracy and robustness in
adversarial training. We find that the hardness and the diversity of data
augmentation are important factors in combating robust overfitting. In general,
diversity can improve both accuracy and robustness, while hardness can boost
robustness at the cost of accuracy within a certain limit and degrade them both
over that limit. To mitigate robust overfitting, we first propose a new crop
transformation, Cropshift, which has improved diversity compared to the
conventional one (Padcrop). We then propose a new data augmentation scheme,
based on Cropshift, with much improved diversity and well-balanced hardness.
Empirically, our augmentation method achieves the state-of-the-art accuracy and
robustness for data augmentations in adversarial training. Furthermore, when
combined with weight averaging it matches, or even exceeds, the performance of
the best contemporary regularization methods for alleviating robust
overfitting. Code is available at:
https://github.com/TreeLLi/DA-Alone-Improves-AT.


http://arxiv.org/abs/2301.09740
DODEM: DOuble DEfense Mechanism Against Adversarial Attacks Towards Secure Industrial Internet of Things Analytics. (99%)
Onat Gungor; Tajana Rosing; Baris Aksanli
  Industrial Internet of Things (I-IoT) is a collaboration of devices, sensors,
and networking equipment to monitor and collect data from industrial
operations. Machine learning (ML) methods use this data to make high-level
decisions with minimal human intervention. Data-driven predictive maintenance
(PDM) is a crucial ML-based I-IoT application to find an optimal maintenance
schedule for industrial assets. The performance of these ML methods can
seriously be threatened by adversarial attacks where an adversary crafts
perturbed data and sends it to the ML model to deteriorate its prediction
performance. The models should be able to stay robust against these attacks
where robustness is measured by how much perturbation in input data affects
model performance. Hence, there is a need for effective defense mechanisms that
can protect these models against adversarial attacks. In this work, we propose
a double defense mechanism to detect and mitigate adversarial attacks in I-IoT
environments. We first detect if there is an adversarial attack on a given
sample using novelty detection algorithms. Then, based on the outcome of our
algorithm, marking an instance as attack or normal, we select adversarial
retraining or standard training to provide a secondary defense layer. If there
is an attack, adversarial retraining provides a more robust model, while we
apply standard training for regular samples. Since we may not know if an attack
will take place, our adaptive mechanism allows us to consider irregular changes
in data. The results show that our double defense strategy is highly efficient
where we can improve model robustness by up to 64.6% and 52% compared to
standard and adversarial retraining, respectively.


http://arxiv.org/abs/2301.09305
Practical Adversarial Attacks Against AI-Driven Power Allocation in a Distributed MIMO Network. (92%)
Ömer Faruk Tuna; Fehmi Emre Kadan; Leyli Karaçay
  In distributed multiple-input multiple-output (D-MIMO) networks, power
control is crucial to optimize the spectral efficiencies of users and max-min
fairness (MMF) power control is a commonly used strategy as it satisfies
uniform quality-of-service to all users. The optimal solution of MMF power
control requires high complexity operations and hence deep neural network based
artificial intelligence (AI) solutions are proposed to decrease the complexity.
Although quite accurate models can be achieved by using AI, these models have
some intrinsic vulnerabilities against adversarial attacks where carefully
crafted perturbations are applied to the input of the AI model. In this work,
we show that threats against the target AI model which might be originated from
malicious users or radio units can substantially decrease the network
performance by applying a successful adversarial sample, even in the most
constrained circumstances. We also demonstrate that the risk associated with
these kinds of adversarial attacks is higher than the conventional attack
threats. Detailed simulations reveal the effectiveness of adversarial attacks
and the necessity of smart defense techniques.


http://arxiv.org/abs/2301.09508
BayBFed: Bayesian Backdoor Defense for Federated Learning. (78%)
Kavita Kumari; Phillip Rieger; Hossein Fereidooni; Murtuza Jadliwala; Ahmad-Reza Sadeghi
  Federated learning (FL) allows participants to jointly train a machine
learning model without sharing their private data with others. However, FL is
vulnerable to poisoning attacks such as backdoor attacks. Consequently, a
variety of defenses have recently been proposed, which have primarily utilized
intermediary states of the global model (i.e., logits) or distance of the local
models (i.e., L2-norm) from the global model to detect malicious backdoors.
However, as these approaches directly operate on client updates, their
effectiveness depends on factors such as clients' data distribution or the
adversary's attack strategies. In this paper, we introduce a novel and more
generic backdoor defense framework, called BayBFed, which proposes to utilize
probability distributions over client updates to detect malicious updates in
FL: it computes a probabilistic measure over the clients' updates to keep track
of any adjustments made in the updates, and uses a novel detection algorithm
that can leverage this probabilistic measure to efficiently detect and filter
out malicious updates. Thus, it overcomes the shortcomings of previous
approaches that arise due to the direct usage of client updates; as our
probabilistic measure will include all aspects of the local client training
strategies. BayBFed utilizes two Bayesian Non-Parametric extensions: (i) a
Hierarchical Beta-Bernoulli process to draw a probabilistic measure given the
clients' updates, and (ii) an adaptation of the Chinese Restaurant Process
(CRP), referred by us as CRP-Jensen, which leverages this probabilistic measure
to detect and filter out malicious updates. We extensively evaluate our defense
approach on five benchmark datasets: CIFAR10, Reddit, IoT intrusion detection,
MNIST, and FMNIST, and show that it can effectively detect and eliminate
malicious updates in FL without deteriorating the benign performance of the
global model.


http://arxiv.org/abs/2301.09732
Backdoor Attacks in Peer-to-Peer Federated Learning. (68%)
Gokberk Yar; Cristina Nita-Rotaru; Alina Oprea
  We study backdoor attacks in peer-to-peer federated learning systems on
different graph topologies and datasets. We show that only 5% attacker nodes
are sufficient to perform a backdoor attack with 42% attack success without
decreasing the accuracy on clean data by more than 2%. We also demonstrate that
the attack can be amplified by the attacker crashing a small number of nodes.
We evaluate defenses proposed in the context of centralized federated learning
and show they are ineffective in peer-to-peer settings. Finally, we propose a
defense that mitigates the attacks by applying different clipping norms to the
model updates received from peers and local model trained by a node.


http://arxiv.org/abs/2301.09522
Optimising Event-Driven Spiking Neural Network with Regularisation and Cutoff. (1%)
Dengyu Wu; Gaojie Jin; Han Yu; Xinping Yi; Xiaowei Huang
  Spiking neural network (SNN), as the next generation of artificial neural
network (ANN), offer a closer mimicry of natural neural networks and hold
promise for significant improvements in computational efficiency. However, the
current SNN is trained to infer over a fixed duration, overlooking the
potential of dynamic inference in SNN. In this paper, we strengthen the
marriage between SNN and event-driven processing with a proposal to consider a
cutoff in SNN, which can terminate SNN anytime during inference to achieve
efficient inference. Two novel optimisation techniques are presented to achieve
inference efficient SNN: a Top-K cutoff and a regularisation.The proposed
regularisation influences the training process, optimising SNN for the cutoff,
while the Top-K cutoff technique optimises the inference phase. We conduct an
extensive set of experiments on multiple benchmark frame-based datasets, such
asCIFAR10/100, Tiny-ImageNet, and event-based datasets, including CIFAR10-DVS,
N-Caltech101 and DVS128 Gesture. The experimental results demonstrate the
effectiveness of our techniques in both ANN-to-SNN conversion and direct
training, enabling SNNs to require 1.76 to 2.76x fewer timesteps for CIFAR-10,
while achieving 1.64 to 1.95x fewer timesteps across all event-based datasets,
with near-zero accuracy loss. These findings affirms the compatibility and
potential benefits of our techniques in enhancing accuracy and reducing
inference latency when integrated with existing methods. Code available:
https://github.com/Dengyu-Wu/SNNCutoff


http://arxiv.org/abs/2301.09069
Provable Unrestricted Adversarial Training without Compromise with Generalizability. (99%)
Lilin Zhang; Ning Yang; Yanchao Sun; Philip S. Yu
  Adversarial training (AT) is widely considered as the most promising strategy
to defend against adversarial attacks and has drawn increasing interest from
researchers. However, the existing AT methods still suffer from two challenges.
First, they are unable to handle unrestricted adversarial examples (UAEs),
which are built from scratch, as opposed to restricted adversarial examples
(RAEs), which are created by adding perturbations bound by an $l_p$ norm to
observed examples. Second, the existing AT methods often achieve adversarial
robustness at the expense of standard generalizability (i.e., the accuracy on
natural examples) because they make a tradeoff between them. To overcome these
challenges, we propose a unique viewpoint that understands UAEs as
imperceptibly perturbed unobserved examples. Also, we find that the tradeoff
results from the separation of the distributions of adversarial examples and
natural examples. Based on these ideas, we propose a novel AT approach called
Provable Unrestricted Adversarial Training (PUAT), which can provide a target
classifier with comprehensive adversarial robustness against both UAE and RAE,
and simultaneously improve its standard generalizability. Particularly, PUAT
utilizes partially labeled data to achieve effective UAE generation by
accurately capturing the natural data distribution through a novel augmented
triple-GAN. At the same time, PUAT extends the traditional AT by introducing
the supervised loss of the target classifier into the adversarial loss and
achieves the alignment between the UAE distribution, the natural data
distribution, and the distribution learned by the classifier, with the
collaboration of the augmented triple-GAN. Finally, the solid theoretical
analysis and extensive experiments conducted on widely-used benchmarks
demonstrate the superiority of PUAT.


http://arxiv.org/abs/2301.09072
ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning. (8%)
Shangqing Liu; Bozhi Wu; Xiaofei Xie; Guozhu Meng; Yang Liu
  Large-scale pre-trained models such as CodeBERT, GraphCodeBERT have earned
widespread attention from both academia and industry. Attributed to the
superior ability in code representation, they have been further applied in
multiple downstream tasks such as clone detection, code search and code
translation. However, it is also observed that these state-of-the-art
pre-trained models are susceptible to adversarial attacks. The performance of
these pre-trained models drops significantly with simple perturbations such as
renaming variable names. This weakness may be inherited by their downstream
models and thereby amplified at an unprecedented scale. To this end, we propose
an approach namely ContraBERT that aims to improve the robustness of
pre-trained models via contrastive learning. Specifically, we design nine kinds
of simple and complex data augmentation operators on the programming language
(PL) and natural language (NL) data to construct different variants.
Furthermore, we continue to train the existing pre-trained models by masked
language modeling (MLM) and contrastive pre-training task on the original
samples with their augmented variants to enhance the robustness of the model.
The extensive experiments demonstrate that ContraBERT can effectively improve
the robustness of the existing pre-trained models. Further study also confirms
that these robustness-enhanced models provide improvements as compared to
original models over four popular downstream tasks.


http://arxiv.org/abs/2301.08842
Limitations of Piecewise Linearity for Efficient Robustness Certification. (95%)
Klas Leino
  Certified defenses against small-norm adversarial examples have received
growing attention in recent years; though certified accuracies of
state-of-the-art methods remain far below their non-robust counterparts,
despite the fact that benchmark datasets have been shown to be well-separated
at far larger radii than the literature generally attempts to certify. In this
work, we offer insights that identify potential factors in this performance
gap. Specifically, our analysis reveals that piecewise linearity imposes
fundamental limitations on the tightness of leading certification techniques.
These limitations are felt in practical terms as a greater need for capacity in
models hoped to be certified efficiently. Moreover, this is in addition to the
capacity necessary to learn a robust boundary, studied in prior work. However,
we argue that addressing the limitations of piecewise linearity through scaling
up model capacity may give rise to potential difficulties -- particularly
regarding robust generalization -- therefore, we conclude by suggesting that
developing smooth activation functions may be the way forward for advancing the
performance of certified neural networks.


http://arxiv.org/abs/2301.08751
Towards Understanding How Self-training Tolerates Data Backdoor Poisoning. (16%)
Soumyadeep Pal; Ren Wang; Yuguang Yao; Sijia Liu
  Recent studies on backdoor attacks in model training have shown that
polluting a small portion of training data is sufficient to produce incorrect
manipulated predictions on poisoned test-time data while maintaining high clean
accuracy in downstream tasks. The stealthiness of backdoor attacks has imposed
tremendous defense challenges in today's machine learning paradigm. In this
paper, we explore the potential of self-training via additional unlabeled data
for mitigating backdoor attacks. We begin by making a pilot study to show that
vanilla self-training is not effective in backdoor mitigation. Spurred by that,
we propose to defend the backdoor attacks by leveraging strong but proper data
augmentations in the self-training pseudo-labeling stage. We find that the new
self-training regime help in defending against backdoor attacks to a great
extent. Its effectiveness is demonstrated through experiments for different
backdoor triggers on CIFAR-10 and a combination of CIFAR-10 with an additional
unlabeled 500K TinyImages dataset. Finally, we explore the direction of
combining self-supervised representation learning with self-training for
further improvement in backdoor defense.


http://arxiv.org/abs/2301.08881
Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness. (8%)
Shuaichen Chang; Jun Wang; Mingwen Dong; Lin Pan; Henghui Zhu; Alexander Hanbo Li; Wuwei Lan; Sheng Zhang; Jiarong Jiang; Joseph Lilien; Steve Ash; William Yang Wang; Zhiguo Wang; Vittorio Castelli; Patrick Ng; Bing Xiang
  Neural text-to-SQL models have achieved remarkable performance in translating
natural language questions into SQL queries. However, recent studies reveal
that text-to-SQL models are vulnerable to task-specific perturbations. Previous
curated robustness test sets usually focus on individual phenomena. In this
paper, we propose a comprehensive robustness benchmark based on Spider, a
cross-domain text-to-SQL benchmark, to diagnose the model robustness. We design
17 perturbations on databases, natural language questions, and SQL queries to
measure the robustness from different angles. In order to collect more
diversified natural question perturbations, we utilize large pretrained
language models (PLMs) to simulate human behaviors in creating natural
questions. We conduct a diagnostic study of the state-of-the-art models on the
robustness set. Experimental results reveal that even the most robust model
suffers from a 14.0% performance drop overall and a 50.7% performance drop on
the most challenging perturbation. We also present a breakdown analysis
regarding text-to-SQL model designs and provide insights for improving model
robustness.


http://arxiv.org/abs/2301.08428
Defending SDN against packet injection attacks using deep learning. (2%)
Anh Tuan Phu; Bo Li; Faheem Ullah; Tanvir Ul Huque; Ranesh Naha; Ali Babar; Hung Nguyen
  The (logically) centralised architecture of the software-defined networks
makes them an easy target for packet injection attacks. In these attacks, the
attacker injects malicious packets into the SDN network to affect the services
and performance of the SDN controller and overflow the capacity of the SDN
switches. Such attacks have been shown to ultimately stop the network
functioning in real-time, leading to network breakdowns. There have been
significant works on detecting and defending against similar DoS attacks in
non-SDN networks, but detection and protection techniques for SDN against
packet injection attacks are still in their infancy. Furthermore, many of the
proposed solutions have been shown to be easily by-passed by simple
modifications to the attacking packets or by altering the attacking profile. In
this paper, we develop novel Graph Convolutional Neural Network models and
algorithms for grouping network nodes/users into security classes by learning
from network data. We start with two simple classes - nodes that engage in
suspicious packet injection attacks and nodes that are not. From these classes,
we then partition the network into separate segments with different security
policies using distributed Ryu controllers in an SDN network. We show in
experiments on an emulated SDN that our detection solution outperforms
alternative approaches with above 99\% detection accuracy on various types
(both old and new) of injection attacks. More importantly, our mitigation
solution maintains continuous functions of non-compromised nodes while
isolating compromised/suspicious nodes in real-time. All code and data are
publicly available for reproducibility of our results.


http://arxiv.org/abs/2301.08170
On the Vulnerability of Backdoor Defenses for Federated Learning. (62%)
Pei Fang; Jinghui Chen
  Federated Learning (FL) is a popular distributed machine learning paradigm
that enables jointly training a global model without sharing clients' data.
However, its repetitive server-client communication gives room for backdoor
attacks with aim to mislead the global model into a targeted misprediction when
a specific trigger pattern is presented. In response to such backdoor threats
on federated learning, various defense measures have been proposed. In this
paper, we study whether the current defense mechanisms truly neutralize the
backdoor threats from federated learning in a practical setting by proposing a
new federated backdoor attack method for possible countermeasures. Different
from traditional training (on triggered data) and rescaling (the malicious
client model) based backdoor injection, the proposed backdoor attack framework
(1) directly modifies (a small proportion of) local model weights to inject the
backdoor trigger via sign flips; (2) jointly optimize the trigger pattern with
the client model, thus is more persistent and stealthy for circumventing
existing defenses. In a case study, we examine the strength and weaknesses of
recent federated backdoor defenses from three major categories and provide
suggestions to the practitioners when training federated models in practice.


http://arxiv.org/abs/2301.08401
On the Relationship Between Information-Theoretic Privacy Metrics And Probabilistic Information Privacy. (31%)
Chong Xiao Wang; Wee Peng Tay
  Information-theoretic (IT) measures based on $f$-divergences have recently
gained interest as a measure of privacy leakage as they allow for trading off
privacy against utility using only a single-value characterization. However,
their operational interpretations in the privacy context are unclear. In this
paper, we relate the notion of probabilistic information privacy (IP) to
several IT privacy metrics based on $f$-divergences. We interpret probabilistic
IP under both the detection and estimation frameworks and link it to
differential privacy, thus allowing a precise operational interpretation of
these IT privacy metrics. We show that the $\chi^2$-divergence privacy metric
is stronger than those based on total variation distance and Kullback-Leibler
divergence. Therefore, we further develop a data-driven empirical risk
framework based on the $\chi^2$-divergence privacy metric and realized using
deep neural networks. This framework is agnostic to the adversarial attack
model. Empirical experiments demonstrate the efficacy of our approach.


http://arxiv.org/abs/2301.08092
RNAS-CL: Robust Neural Architecture Search by Cross-Layer Knowledge Distillation. (16%)
Utkarsh Nath; Yancheng Wang; Yingzhen Yang
  Deep Neural Networks are vulnerable to adversarial attacks. Neural
Architecture Search (NAS), one of the driving tools of deep neural networks,
demonstrates superior performance in prediction accuracy in various machine
learning applications. However, it is unclear how it performs against
adversarial attacks. Given the presence of a robust teacher, it would be
interesting to investigate if NAS would produce robust neural architecture by
inheriting robustness from the teacher. In this paper, we propose Robust Neural
Architecture Search by Cross-Layer Knowledge Distillation (RNAS-CL), a novel
NAS algorithm that improves the robustness of NAS by learning from a robust
teacher through cross-layer knowledge distillation. Unlike previous knowledge
distillation methods that encourage close student/teacher output only in the
last layer, RNAS-CL automatically searches for the best teacher layer to
supervise each student layer. Experimental result evidences the effectiveness
of RNAS-CL and shows that RNAS-CL produces small and robust neural
architecture.


http://arxiv.org/abs/2301.08114
Enhancing Deep Learning with Scenario-Based Override Rules: a Case Study. (1%)
Adiel Ashrov; Guy Katz
  Deep neural networks (DNNs) have become a crucial instrument in the software
development toolkit, due to their ability to efficiently solve complex
problems. Nevertheless, DNNs are highly opaque, and can behave in an unexpected
manner when they encounter unfamiliar input. One promising approach for
addressing this challenge is by extending DNN-based systems with hand-crafted
override rules, which override the DNN's output when certain conditions are
met. Here, we advocate crafting such override rules using the well-studied
scenario-based modeling paradigm, which produces rules that are simple,
extensible, and powerful enough to ensure the safety of the DNN, while also
rendering the system more translucent. We report on two extensive case studies,
which demonstrate the feasibility of the approach; and through them, propose an
extension to scenario-based modeling, which facilitates its integration with
DNN components. We regard this work as a step towards creating safer and more
reliable DNN-based systems and models.


http://arxiv.org/abs/2301.06871
Denoising Diffusion Probabilistic Models as a Defense against Adversarial Attacks. (98%)
Lars Lien Ankile; Anna Midgley; Sebastian Weisshaar
  Neural Networks are infamously sensitive to small perturbations in their
inputs, making them vulnerable to adversarial attacks. This project evaluates
the performance of Denoising Diffusion Probabilistic Models (DDPM) as a
purification technique to defend against adversarial attacks. This works by
adding noise to an adversarial example before removing it through the reverse
process of the diffusion model. We evaluate the approach on the PatchCamelyon
data set for histopathologic scans of lymph node sections and find an
improvement of the robust accuracy by up to 88\% of the original model's
accuracy, constituting a considerable improvement over the vanilla model and
our baselines. The project code is located at
https://github.com/ankile/Adversarial-Diffusion.


http://arxiv.org/abs/2301.07487
Adversarial Robust Deep Reinforcement Learning Requires Redefining Robustness. (68%)
Ezgi Korkmaz
  Learning from raw high dimensional data via interaction with a given
environment has been effectively achieved through the utilization of deep
neural networks. Yet the observed degradation in policy performance caused by
imperceptible worst-case policy dependent translations along high sensitivity
directions (i.e. adversarial perturbations) raises concerns on the robustness
of deep reinforcement learning policies. In our paper, we show that these high
sensitivity directions do not lie only along particular worst-case directions,
but rather are more abundant in the deep neural policy landscape and can be
found via more natural means in a black-box setting. Furthermore, we show that
vanilla training techniques intriguingly result in learning more robust
policies compared to the policies learnt via the state-of-the-art adversarial
training techniques. We believe our work lays out intriguing properties of the
deep reinforcement learning policy manifold and our results can help to build
robust and generalizable deep reinforcement learning policies.


http://arxiv.org/abs/2301.07284
Label Inference Attack against Split Learning under Regression Setting. (8%)
Shangyu Xie; Xin Yang; Yuanshun Yao; Tianyi Liu; Taiqing Wang; Jiankai Sun
  As a crucial building block in vertical Federated Learning (vFL), Split
Learning (SL) has demonstrated its practice in the two-party model training
collaboration, where one party holds the features of data samples and another
party holds the corresponding labels. Such method is claimed to be private
considering the shared information is only the embedding vectors and gradients
instead of private raw data and labels. However, some recent works have shown
that the private labels could be leaked by the gradients. These existing attack
only works under the classification setting where the private labels are
discrete. In this work, we step further to study the leakage in the scenario of
the regression model, where the private labels are continuous numbers (instead
of discrete labels in classification). This makes previous attacks harder to
infer the continuous labels due to the unbounded output range. To address the
limitation, we propose a novel learning-based attack that integrates gradient
information and extra learning regularization objectives in aspects of model
training properties, which can infer the labels under regression settings
effectively. The comprehensive experiments on various datasets and models have
demonstrated the effectiveness of our proposed attack. We hope our work can
pave the way for future analyses that make the vFL framework more secure.


http://arxiv.org/abs/2301.06393
$\beta$-DARTS++: Bi-level Regularization for Proxy-robust Differentiable Architecture Search. (1%)
Peng Ye; Tong He; Baopu Li; Tao Chen; Lei Bai; Wanli Ouyang
  Neural Architecture Search has attracted increasing attention in recent
years. Among them, differential NAS approaches such as DARTS, have gained
popularity for the search efficiency. However, they still suffer from three
main issues, that are, the weak stability due to the performance collapse, the
poor generalization ability of the searched architectures, and the inferior
robustness to different kinds of proxies. To solve the stability and
generalization problems, a simple-but-effective regularization method, termed
as Beta-Decay, is proposed to regularize the DARTS-based NAS searching process
(i.e., $\beta$-DARTS). Specifically, Beta-Decay regularization can impose
constraints to keep the value and variance of activated architecture parameters
from being too large, thereby ensuring fair competition among architecture
parameters and making the supernet less sensitive to the impact of input on the
operation set. In-depth theoretical analyses on how it works and why it works
are provided. Comprehensive experiments validate that Beta-Decay regularization
can help to stabilize the searching process and makes the searched network more
transferable across different datasets. To address the robustness problem, we
first benchmark different NAS methods under a wide range of proxy data, proxy
channels, proxy layers and proxy epochs, since the robustness of NAS under
different kinds of proxies has not been explored before. We then conclude some
interesting findings and find that $\beta$-DARTS always achieves the best
result among all compared NAS methods under almost all proxies. We further
introduce the novel flooding regularization to the weight optimization of
$\beta$-DARTS (i.e., Bi-level regularization), and experimentally and
theoretically verify its effectiveness for improving the proxy robustness of
differentiable NAS.


http://arxiv.org/abs/2301.06442
Modeling Uncertain Feature Representation for Domain Generalization. (1%)
Xiaotong Li; Zixuan Hu; Jun Liu; Yixiao Ge; Yongxing Dai; Ling-Yu Duan
  Though deep neural networks have achieved impressive success on various
vision tasks, obvious performance degradation still exists when models are
tested in out-of-distribution scenarios. In addressing this limitation, we
ponder that the feature statistics (mean and standard deviation), which carry
the domain characteristics of the training data, can be properly manipulated to
improve the generalization ability of deep learning models. Existing methods
commonly consider feature statistics as deterministic values measured from the
learned features and do not explicitly model the uncertain statistics
discrepancy caused by potential domain shifts during testing. In this paper, we
improve the network generalization ability by modeling domain shifts with
uncertainty (DSU), i.e., characterizing the feature statistics as uncertain
distributions during training. Specifically, we hypothesize that the feature
statistic, after considering the potential uncertainties, follows a
multivariate Gaussian distribution. During inference, we propose an
instance-wise adaptation strategy that can adaptively deal with the
unforeseeable shift and further enhance the generalization ability of the
trained model with negligible additional cost. We also conduct theoretical
analysis on the aspects of generalization error bound and the implicit
regularization effect, showing the efficacy of our method. Extensive
experiments demonstrate that our method consistently improves the network
generalization ability on multiple vision tasks, including image
classification, semantic segmentation, instance retrieval, and pose estimation.
Our methods are simple yet effective and can be readily integrated into
networks without additional trainable parameters or loss constraints. Code will
be released in https://github.com/lixiaotong97/DSU.


http://arxiv.org/abs/2301.06241
BEAGLE: Forensics of Deep Learning Backdoor Attack for Better Defense. (4%)
Siyuan Cheng; Guanhong Tao; Yingqi Liu; Shengwei An; Xiangzhe Xu; Shiwei Feng; Guangyu Shen; Kaiyuan Zhang; Qiuling Xu; Shiqing Ma; Xiangyu Zhang
  Deep Learning backdoor attacks have a threat model similar to traditional
cyber attacks. Attack forensics, a critical counter-measure for traditional
cyber attacks, is hence of importance for defending model backdoor attacks. In
this paper, we propose a novel model backdoor forensics technique. Given a few
attack samples such as inputs with backdoor triggers, which may represent
different types of backdoors, our technique automatically decomposes them to
clean inputs and the corresponding triggers. It then clusters the triggers
based on their properties to allow automatic attack categorization and
summarization. Backdoor scanners can then be automatically synthesized to find
other instances of the same type of backdoor in other models. Our evaluation on
2,532 pre-trained models, 10 popular attacks, and comparison with 9 baselines
show that our technique is highly effective. The decomposed clean inputs and
triggers closely resemble the ground truth. The synthesized scanners
substantially outperform the vanilla versions of existing scanners that can
hardly generalize to different kinds of attacks.


http://arxiv.org/abs/2301.07099
Adaptive Deep Neural Network Inference Optimization with EENet. (1%)
Fatih Ilhan; Ka-Ho Chow; Sihao Hu; Tiansheng Huang; Selim Tekin; Wenqi Wei; Yanzhao Wu; Myungjin Lee; Ramana Kompella; Hugo Latapie; Gaowen Liu; Ling Liu
  Well-trained deep neural networks (DNNs) treat all test samples equally
during prediction. Adaptive DNN inference with early exiting leverages the
observation that some test examples can be easier to predict than others. This
paper presents EENet, a novel early-exiting scheduling framework for multi-exit
DNN models. Instead of having every sample go through all DNN layers during
prediction, EENet learns an early exit scheduler, which can intelligently
terminate the inference earlier for certain predictions, which the model has
high confidence of early exit. As opposed to previous early-exiting solutions
with heuristics-based methods, our EENet framework optimizes an early-exiting
policy to maximize model accuracy while satisfying the given per-sample average
inference budget. Extensive experiments are conducted on four computer vision
datasets (CIFAR-10, CIFAR-100, ImageNet, Cityscapes) and two NLP datasets
(SST-2, AgNews). The results demonstrate that the adaptive inference by EENet
can outperform the representative existing early exit techniques. We also
perform a detailed visualization analysis of the comparison results to
interpret the benefits of EENet.


http://arxiv.org/abs/2301.05506
On the feasibility of attacking Thai LPR systems with adversarial examples. (99%)
Chissanupong Jiamsuchon; Jakapan Suaboot; Norrathep Rattanavipanon
  Recent advances in deep neural networks (DNNs) have significantly enhanced
the capabilities of optical character recognition (OCR) technology, enabling
its adoption to a wide range of real-world applications. Despite this success,
DNN-based OCR is shown to be vulnerable to adversarial attacks, in which the
adversary can influence the DNN model's prediction by carefully manipulating
input to the model. Prior work has demonstrated the security impacts of
adversarial attacks on various OCR languages. However, to date, no studies have
been conducted and evaluated on an OCR system tailored specifically for the
Thai language. To bridge this gap, this work presents a feasibility study of
performing adversarial attacks on a specific Thai OCR application -- Thai
License Plate Recognition (LPR). Moreover, we propose a new type of adversarial
attack based on the \emph{semi-targeted} scenario and show that this scenario
is highly realistic in LPR applications. Our experimental results show the
feasibility of our attacks as they can be performed on a commodity computer
desktop with over 90% attack success rate.


http://arxiv.org/abs/2301.05264
Security-Aware Approximate Spiking Neural Networks. (87%)
Syed Tihaam Ahmad; Ayesha Siddique; Khaza Anuarul Hoque
  Deep Neural Networks (DNNs) and Spiking Neural Networks (SNNs) are both known
for their susceptibility to adversarial attacks. Therefore, researchers in the
recent past have extensively studied the robustness and defense of DNNs and
SNNs under adversarial attacks. Compared to accurate SNNs (AccSNN), approximate
SNNs (AxSNNs) are known to be up to 4X more energy-efficient for ultra-low
power applications. Unfortunately, the robustness of AxSNNs under adversarial
attacks is yet unexplored. In this paper, we first extensively analyze the
robustness of AxSNNs with different structural parameters and approximation
levels under two gradient-based and two neuromorphic attacks. Then, we propose
two novel defense methods, i.e., precision scaling and approximate
quantization-aware filtering (AQF), for securing AxSNNs. We evaluated the
effectiveness of these two defense methods using both static and neuromorphic
datasets. Our results demonstrate that AxSNNs are more prone to adversarial
attacks than AccSNNs, but precision scaling and AQF significantly improve the
robustness of AxSNNs. For instance, a PGD attack on AxSNN results in a 72\%
accuracy loss compared to AccSNN without any attack, whereas the same attack on
the precision-scaled AxSNN leads to only a 17\% accuracy loss in the static
MNIST dataset (4X robustness improvement). Similarly, a Sparse Attack on AxSNN
leads to a 77\% accuracy loss when compared to AccSNN without any attack,
whereas the same attack on an AxSNN with AQF leads to only a 2\% accuracy loss
in the neuromorphic DVS128 Gesture dataset (38X robustness improvement).


http://arxiv.org/abs/2301.05250
Jamming Attacks on Decentralized Federated Learning in General Multi-Hop Wireless Networks. (3%)
Yi Shi; Yalin E. Sagduyu; Tugba Erpek
  Decentralized federated learning (DFL) is an effective approach to train a
deep learning model at multiple nodes over a multi-hop network, without the
need of a server having direct connections to all nodes. In general, as long as
nodes are connected potentially via multiple hops, the DFL process will
eventually allow each node to experience the effects of models from all other
nodes via either direct connections or multi-hop paths, and thus is able to
train a high-fidelity model at each node. We consider an effective attack that
uses jammers to prevent the model exchanges between nodes. There are two attack
scenarios. First, the adversary can attack any link under a certain budget.
Once attacked, two end nodes of a link cannot exchange their models. Secondly,
some jammers with limited jamming ranges are deployed in the network and a
jammer can only jam nodes within its jamming range. Once a directional link is
attacked, the receiver node cannot receive the model from the transmitter node.
We design algorithms to select links to be attacked for both scenarios. For the
second scenario, we also design algorithms to deploy jammers at optimal
locations so that they can attack critical nodes and achieve the highest impact
on the DFL process. We evaluate these algorithms by using wireless signal
classification over a large network area as the use case and identify how these
attack mechanisms exploits various learning, connectivity, and sensing aspects.
We show that the DFL performance can be significantly reduced by jamming
attacks launched in a wireless network and characterize the attack surface as a
vulnerability study before the safe deployment of DFL over wireless networks.


http://arxiv.org/abs/2301.04785
Phase-shifted Adversarial Training. (82%)
Yeachan Kim; Seongyeon Kim; Ihyeok Seo; Bonggun Shin
  Adversarial training has been considered an imperative component for safely
deploying neural network-based applications to the real world. To achieve
stronger robustness, existing methods primarily focus on how to generate strong
attacks by increasing the number of update steps, regularizing the models with
the smoothed loss function, and injecting the randomness into the attack.
Instead, we analyze the behavior of adversarial training through the lens of
response frequency. We empirically discover that adversarial training causes
neural networks to have low convergence to high-frequency information,
resulting in highly oscillated predictions near each data. To learn
high-frequency contents efficiently and effectively, we first prove that a
universal phenomenon of frequency principle, i.e., \textit{lower frequencies
are learned first}, still holds in adversarial training. Based on that, we
propose phase-shifted adversarial training (PhaseAT) in which the model learns
high-frequency components by shifting these frequencies to the low-frequency
range where the fast convergence occurs. For evaluations, we conduct the
experiments on CIFAR-10 and ImageNet with the adaptive attack carefully
designed for reliable evaluation. Comprehensive results show that PhaseAT
significantly improves the convergence for high-frequency information. This
results in improved adversarial robustness by enabling the model to have
smoothed predictions near each data.


http://arxiv.org/abs/2301.04554
Universal Detection of Backdoor Attacks via Density-based Clustering and Centroids Analysis. (78%)
Wei Guo; Benedetta Tondi; Mauro Barni
  In this paper, we propose a Universal Defence based on Clustering and
Centroids Analysis (CCA-UD) against backdoor attacks. The goal of the proposed
defence is to reveal whether a Deep Neural Network model is subject to a
backdoor attack by inspecting the training dataset. CCA-UD first clusters the
samples of the training set by means of density-based clustering. Then, it
applies a novel strategy to detect the presence of poisoned clusters. The
proposed strategy is based on a general misclassification behaviour obtained
when the features of a representative example of the analysed cluster are added
to benign samples. The capability of inducing a misclassification error is a
general characteristic of poisoned samples, hence the proposed defence is
attack-agnostic. This mask a significant difference with respect to existing
defences, that, either can defend against only some types of backdoor attacks,
e.g., when the attacker corrupts the label of the poisoned samples, or are
effective only when some conditions on the poisoning ratios adopted by the
attacker or the kind of triggering pattern used by the attacker are satisfied.
Experiments carried out on several classification tasks, considering different
types of backdoor attacks and triggering patterns, including both local and
global triggers, reveal that the proposed method is very effective to defend
against backdoor attacks in all the cases, always outperforming the state of
the art techniques.


http://arxiv.org/abs/2301.04093
On the Robustness of AlphaFold: A COVID-19 Case Study. (73%)
Ismail Alkhouri; Sumit Jha; Andre Beckus; George Atia; Alvaro Velasquez; Rickard Ewetz; Arvind Ramanathan; Susmit Jha
  Protein folding neural networks (PFNNs) such as AlphaFold predict remarkably
accurate structures of proteins compared to other approaches. However, the
robustness of such networks has heretofore not been explored. This is
particularly relevant given the broad social implications of such technologies
and the fact that biologically small perturbations in the protein sequence do
not generally lead to drastic changes in the protein structure. In this paper,
we demonstrate that AlphaFold does not exhibit such robustness despite its high
accuracy. This raises the challenge of detecting and quantifying the extent to
which these predicted protein structures can be trusted. To measure the
robustness of the predicted structures, we utilize (i) the root-mean-square
deviation (RMSD) and (ii) the Global Distance Test (GDT) similarity measure
between the predicted structure of the original sequence and the structure of
its adversarially perturbed version. We prove that the problem of minimally
perturbing protein sequences to fool protein folding neural networks is
NP-complete. Based on the well-established BLOSUM62 sequence alignment scoring
matrix, we generate adversarial protein sequences and show that the RMSD
between the predicted protein structure and the structure of the original
sequence are very large when the adversarial changes are bounded by (i) 20
units in the BLOSUM62 distance, and (ii) five residues (out of hundreds or
thousands of residues) in the given protein sequence. In our experimental
evaluation, we consider 111 COVID-19 proteins in the Universal Protein resource
(UniProt), a central resource for protein data managed by the European
Bioinformatics Institute, Swiss Institute of Bioinformatics, and the US Protein
Information Resource. These result in an overall GDT similarity test score
average of around 34%, demonstrating a substantial drop in the performance of
AlphaFold.


http://arxiv.org/abs/2301.03826
CDA: Contrastive-adversarial Domain Adaptation. (38%)
Nishant Yadav; Mahbubul Alam; Ahmed Farahat; Dipanjan Ghosh; Chetan Gupta; Auroop R. Ganguly
  Recent advances in domain adaptation reveal that adversarial learning on deep
neural networks can learn domain invariant features to reduce the shift between
source and target domains. While such adversarial approaches achieve
domain-level alignment, they ignore the class (label) shift. When
class-conditional data distributions are significantly different between the
source and target domain, it can generate ambiguous features near class
boundaries that are more likely to be misclassified. In this work, we propose a
two-stage model for domain adaptation called \textbf{C}ontrastive-adversarial
\textbf{D}omain \textbf{A}daptation \textbf{(CDA)}. While the adversarial
component facilitates domain-level alignment, two-stage contrastive learning
exploits class information to achieve higher intra-class compactness across
domains resulting in well-separated decision boundaries. Furthermore, the
proposed contrastive framework is designed as a plug-and-play module that can
be easily embedded with existing adversarial methods for domain adaptation. We
conduct experiments on two widely used benchmark datasets for domain
adaptation, namely, \textit{Office-31} and \textit{Digits-5}, and demonstrate
that CDA achieves state-of-the-art results on both datasets.


http://arxiv.org/abs/2301.04230
User-Centered Security in Natural Language Processing. (12%)
Chris Emmery
  This dissertation proposes a framework of user-centered security in Natural
Language Processing (NLP), and demonstrates how it can improve the
accessibility of related research. Accordingly, it focuses on two security
domains within NLP with great public interest. First, that of author profiling,
which can be employed to compromise online privacy through invasive inferences.
Without access and detailed insight into these models' predictions, there is no
reasonable heuristic by which Internet users might defend themselves from such
inferences. Secondly, that of cyberbullying detection, which by default
presupposes a centralized implementation; i.e., content moderation across
social platforms. As access to appropriate data is restricted, and the nature
of the task rapidly evolves (both through lexical variation, and cultural
shifts), the effectiveness of its classifiers is greatly diminished and thereby
often misrepresented.
  Under the proposed framework, we predominantly investigate the use of
adversarial attacks on language; i.e., changing a given input (generating
adversarial samples) such that a given model does not function as intended.
These attacks form a common thread between our user-centered security problems;
they are highly relevant for privacy-preserving obfuscation methods against
author profiling, and adversarial samples might also prove useful to assess the
influence of lexical variation and augmentation on cyberbullying detection.


http://arxiv.org/abs/2301.04218
Leveraging Diffusion For Strong and High Quality Face Morphing Attacks. (3%)
Zander W. Blasingame; Chen Liu
  Face morphing attacks seek to deceive a Face Recognition (FR) system by
presenting a morphed image consisting of the biometric qualities from two
different identities with the aim of triggering a false acceptance with one of
the two identities, thereby presenting a significant threat to biometric
systems. The success of a morphing attack is dependent on the ability of the
morphed image to represent the biometric characteristics of both identities
that were used to create the image. We present a novel morphing attack that
uses a Diffusion-based architecture to improve the visual fidelity of the image
and the ability of the morphing attack to represent characteristics from both
identities. We demonstrate the effectiveness of the proposed attack by
evaluating its visual fidelity via the Frechet Inception Distance (FID). Also,
extensive experiments are conducted to measure the vulnerability of FR systems
to the proposed attack. The ability of a morphing attack detector to detect the
proposed attack is measured and compared against two state-of-the-art GAN-based
morphing attacks along with two Landmark-based attacks. Additionally, a novel
metric to measure the relative strength between different morphing attacks is
introduced and evaluated.


http://arxiv.org/abs/2301.03760
Over-The-Air Adversarial Attacks on Deep Learning Wi-Fi Fingerprinting. (99%)
Fei Xiao; Yong Huang; Yingying Zuo; Wei Kuang; Wei Wang
  Empowered by deep neural networks (DNNs), Wi-Fi fingerprinting has recently
achieved astonishing localization performance to facilitate many
security-critical applications in wireless networks, but it is inevitably
exposed to adversarial attacks, where subtle perturbations can mislead DNNs to
wrong predictions. Such vulnerability provides new security breaches to
malicious devices for hampering wireless network security, such as
malfunctioning geofencing or asset management. The prior adversarial attack on
localization DNNs uses additive perturbations on channel state information
(CSI) measurements, which is impractical in Wi-Fi transmissions. To transcend
this limitation, this paper presents FooLoc, which fools Wi-Fi CSI
fingerprinting DNNs over the realistic wireless channel between the attacker
and the victim access point (AP). We observe that though uplink CSIs are
unknown to the attacker, the accessible downlink CSIs could be their reasonable
substitutes at the same spot. We thoroughly investigate the multiplicative and
repetitive properties of over-the-air perturbations and devise an efficient
optimization problem to generate imperceptible yet robust adversarial
perturbations. We implement FooLoc using commercial Wi-Fi APs and Wireless
Open-Access Research Platform (WARP) v3 boards in offline and online
experiments, respectively. The experimental results show that FooLoc achieves
overall attack success rates of about 70% in targeted attacks and of above 90%
in untargeted attacks with small perturbation-to-signal ratios of about -18dB.


http://arxiv.org/abs/2301.03703
On the Susceptibility and Robustness of Time Series Models through Adversarial Attack and Defense. (98%)
Asadullah Hill Galib; Bidhan Bashyal
  Under adversarial attacks, time series regression and classification are
vulnerable. Adversarial defense, on the other hand, can make the models more
resilient. It is important to evaluate how vulnerable different time series
models are to attacks and how well they recover using defense. The sensitivity
to various attacks and the robustness using the defense of several time series
models are investigated in this study. Experiments are run on seven-time series
models with three adversarial attacks and one adversarial defense. According to
the findings, all models, particularly GRU and RNN, appear to be vulnerable.
LSTM and GRU also have better defense recovery. FGSM exceeds the competitors in
terms of attacks. PGD attacks are more difficult to recover from than other
sorts of attacks.


http://arxiv.org/abs/2301.04017
Is Federated Learning a Practical PET Yet? (13%)
Franziska Boenisch; Adam Dziedzic; Roei Schuster; Ali Shahin Shamsabadi; Ilia Shumailov; Nicolas Papernot
  Federated learning (FL) is a framework for users to jointly train a machine
learning model. FL is promoted as a privacy-enhancing technology (PET) that
provides data minimization: data never "leaves" personal devices and users
share only model updates with a server (e.g., a company) coordinating the
distributed training. We assess the realistic (i.e., worst-case) privacy
guarantees that are provided to users who are unable to trust the server. To
this end, we propose an attack against FL protected with distributed
differential privacy (DDP) and secure aggregation (SA). The attack method is
based on the introduction of Sybil devices that deviate from the protocol to
expose individual users' data for reconstruction by the server. The underlying
root cause for the vulnerability to our attack is the power imbalance. The
server orchestrates the whole protocol and users are given little guarantees
about the selection of other users participating in the protocol. Moving
forward, we discuss requirements for an FL protocol to guarantee DDP without
asking users to trust the server. We conclude that such systems are not yet
practical.


http://arxiv.org/abs/2301.03724
SoK: Hardware Defenses Against Speculative Execution Attacks. (1%)
Guangyuan Hu; Zecheng He; Ruby Lee
  Speculative execution attacks leverage the speculative and out-of-order
execution features in modern computer processors to access secret data or
execute code that should not be executed. Secret information can then be leaked
through a covert channel. While software patches can be installed for
mitigation on existing hardware, these solutions can incur big performance
overhead. Hardware mitigation is being studied extensively by the computer
architecture community. It has the benefit of preserving software compatibility
and the potential for much smaller performance overhead than software
solutions.
  This paper presents a systematization of the hardware defenses against
speculative execution attacks that have been proposed. We show that speculative
execution attacks consist of 6 critical attack steps. We propose defense
strategies, each of which prevents a critical attack step from happening, thus
preventing the attack from succeeding. We then summarize 20 hardware defenses
and overhead-reducing features that have been proposed. We show that each
defense proposed can be classified under one of our defense strategies, which
also explains why it can thwart the attack from succeeding. We discuss the
scope of the defenses, their performance overhead, and the security-performance
trade-offs that can be made.


http://arxiv.org/abs/2301.03110
RobArch: Designing Robust Architectures against Adversarial Attacks. (76%)
ShengYun Peng; Weilin Xu; Cory Cornelius; Kevin Li; Rahul Duggal; Duen Horng Chau; Jason Martin
  Adversarial Training is the most effective approach for improving the
robustness of Deep Neural Networks (DNNs). However, compared to the large body
of research in optimizing the adversarial training process, there are few
investigations into how architecture components affect robustness, and they
rarely constrain model capacity. Thus, it is unclear where robustness precisely
comes from. In this work, we present the first large-scale systematic study on
the robustness of DNN architecture components under fixed parameter budgets.
Through our investigation, we distill 18 actionable robust network design
guidelines that empower model developers to gain deep insights. We demonstrate
these guidelines' effectiveness by introducing the novel Robust Architecture
(RobArch) model that instantiates the guidelines to build a family of
top-performing models across parameter capacities against strong adversarial
attacks. RobArch achieves the new state-of-the-art AutoAttack accuracy on the
RobustBench ImageNet leaderboard. The code is available at
$\href{https://github.com/ShengYun-Peng/RobArch}{\text{this url}}$.


http://arxiv.org/abs/2301.03118
Facial Misrecognition Systems: Simple Weight Manipulations Force DNNs to Err Only on Specific Persons. (2%)
Irad Zehavi; Roee Nitzan; Adi Shamir
  In this paper, we describe how to plant novel types of backdoors in any
facial recognition model based on the popular architecture of deep Siamese
neural networks. These backdoors force the system to err only on natural images
of specific persons who are preselected by the attacker, without controlling
their appearance or inserting any triggers. For example, we show how such a
backdoored system can classify any two images of a particular person as
different people, or any two images of a particular pair of persons as the same
person, with almost no effect on the correctness of its decisions for other
persons. Surprisingly, we show that both types of backdoors can be implemented
by applying linear transformations to the model's last weight matrix, with no
additional training or optimization, using only images of the backdoor
identities. A unique property of our attack is that multiple backdoors can be
independently installed in the same model by multiple attackers, who may not be
aware of each other's existence, with almost no interference. We have
experimentally verified the attacks on a SOTA facial recognition system. When
we tried to individually anonymize ten celebrities, the network failed to
recognize two of their images as being the same person in $97.02\%$ to
$98.31\%$ of the time. When we tried to confuse between the extremely
different-looking Morgan Freeman and Scarlett Johansson, for example, their
images were declared to be the same person in $98.47 \%$ of the time. For each
type of backdoor, we sequentially installed multiple backdoors with minimal
effect on the performance of each other (for example, anonymizing all ten
celebrities on the same model reduced the success rate for each celebrity by no
more than $1.01\%$). In all of our experiments, the benign accuracy of the
network on other persons barely degraded (in most cases, it degraded by less
than $0.05\%$).


http://arxiv.org/abs/2302.05294
MoreauGrad: Sparse and Robust Interpretation of Neural Networks via Moreau Envelope. (1%)
Jingwei Zhang; Farzan Farnia
  Explaining the predictions of deep neural nets has been a topic of great
interest in the computer vision literature. While several gradient-based
interpretation schemes have been proposed to reveal the influential variables
in a neural net's prediction, standard gradient-based interpretation frameworks
have been commonly observed to lack robustness to input perturbations and
flexibility for incorporating prior knowledge of sparsity and group-sparsity
structures. In this work, we propose MoreauGrad as an interpretation scheme
based on the classifier neural net's Moreau envelope. We demonstrate that
MoreauGrad results in a smooth and robust interpretation of a multi-layer
neural network and can be efficiently computed through first-order optimization
methods. Furthermore, we show that MoreauGrad can be naturally combined with
$L_1$-norm regularization techniques to output a sparse or group-sparse
explanation which are prior conditions applicable to a wide range of deep
learning applications. We empirically evaluate the proposed MoreauGrad scheme
on standard computer vision datasets, showing the qualitative and quantitative
success of the MoreauGrad approach in comparison to standard gradient-based
interpretation methods.


http://arxiv.org/abs/2301.02905
REaaS: Enabling Adversarially Robust Downstream Classifiers via Robust Encoder as a Service. (99%)
Wenjie Qu; Jinyuan Jia; Neil Zhenqiang Gong
  Encoder as a service is an emerging cloud service. Specifically, a service
provider first pre-trains an encoder (i.e., a general-purpose feature
extractor) via either supervised learning or self-supervised learning and then
deploys it as a cloud service API. A client queries the cloud service API to
obtain feature vectors for its training/testing inputs when training/testing
its classifier (called downstream classifier). A downstream classifier is
vulnerable to adversarial examples, which are testing inputs with carefully
crafted perturbation that the downstream classifier misclassifies. Therefore,
in safety and security critical applications, a client aims to build a robust
downstream classifier and certify its robustness guarantees against adversarial
examples.
  What APIs should the cloud service provide, such that a client can use any
certification method to certify the robustness of its downstream classifier
against adversarial examples while minimizing the number of queries to the
APIs? How can a service provider pre-train an encoder such that clients can
build more certifiably robust downstream classifiers? We aim to answer the two
questions in this work. For the first question, we show that the cloud service
only needs to provide two APIs, which we carefully design, to enable a client
to certify the robustness of its downstream classifier with a minimal number of
queries to the APIs. For the second question, we show that an encoder
pre-trained using a spectral-norm regularization term enables clients to build
more robust downstream classifiers.


http://arxiv.org/abs/2301.04472
Adversarial training with informed data selection. (99%)
Marcele O. K. Mendonça; Javier Maroto; Pascal Frossard; Paulo S. R. Diniz
  With the increasing amount of available data and advances in computing
capabilities, deep neural networks (DNNs) have been successfully employed to
solve challenging tasks in various areas, including healthcare, climate, and
finance. Nevertheless, state-of-the-art DNNs are susceptible to
quasi-imperceptible perturbed versions of the original images -- adversarial
examples. These perturbations of the network input can lead to disastrous
implications in critical areas where wrong decisions can directly affect human
lives. Adversarial training is the most efficient solution to defend the
network against these malicious attacks. However, adversarial trained networks
generally come with lower clean accuracy and higher computational complexity.
This work proposes a data selection (DS) strategy to be applied in the
mini-batch training. Based on the cross-entropy loss, the most relevant samples
in the batch are selected to update the model parameters in the
backpropagation. The simulation results show that a good compromise can be
obtained regarding robustness and standard accuracy, whereas the computational
complexity of the backpropagation pass is reduced.


http://arxiv.org/abs/2301.02412
Code Difference Guided Adversarial Example Generation for Deep Code Models. (99%)
Zhao Tian; Junjie Chen; Zhi Jin
  Adversarial examples are important to test and enhance the robustness of deep
code models. As source code is discrete and has to strictly stick to complex
grammar and semantics constraints, the adversarial example generation
techniques in other domains are hardly applicable. Moreover, the adversarial
example generation techniques specific to deep code models still suffer from
unsatisfactory effectiveness due to the enormous ingredient search space. In
this work, we propose a novel adversarial example generation technique (i.e.,
CODA) for testing deep code models. Its key idea is to use code differences
between the target input (i.e., a given code snippet as the model input) and
reference inputs (i.e., the inputs that have small code differences but
different prediction results with the target input) to guide the generation of
adversarial examples. It considers both structure differences and identifier
differences to preserve the original semantics. Hence, the ingredient search
space can be largely reduced as the one constituted by the two kinds of code
differences, and thus the testing process can be improved by designing and
guiding corresponding equivalent structure transformations and identifier
renaming transformations. Our experiments on 15 deep code models demonstrate
the effectiveness and efficiency of CODA, the naturalness of its generated
examples, and its capability of enhancing model robustness after adversarial
fine-tuning. For example, CODA reveals 88.05% and 72.51% more faults in models
than the state-of-the-art techniques (i.e., CARROT and ALERT) on average,
respectively.


http://arxiv.org/abs/2301.02496
Stealthy Backdoor Attack for Code Models. (98%)
Zhou Yang; Bowen Xu; Jie M. Zhang; Hong Jin Kang; Jieke Shi; Junda He; David Lo
  Code models, such as CodeBERT and CodeT5, offer general-purpose
representations of code and play a vital role in supporting downstream
automated software engineering tasks. Most recently, code models were revealed
to be vulnerable to backdoor attacks. A code model that is backdoor-attacked
can behave normally on clean examples but will produce pre-defined malicious
outputs on examples injected with triggers that activate the backdoors.
Existing backdoor attacks on code models use unstealthy and easy-to-detect
triggers. This paper aims to investigate the vulnerability of code models with
stealthy backdoor attacks. To this end, we propose AFRAIDOOR (Adversarial
Feature as Adaptive Backdoor). AFRAIDOOR achieves stealthiness by leveraging
adversarial perturbations to inject adaptive triggers into different inputs. We
evaluate AFRAIDOOR on three widely adopted code models (CodeBERT, PLBART and
CodeT5) and two downstream tasks (code summarization and method name
prediction). We find that around 85% of adaptive triggers in AFRAIDOOR bypass
the detection in the defense process. By contrast, only less than 12% of the
triggers from previous work bypass the defense. When the defense method is not
applied, both AFRAIDOOR and baselines have almost perfect attack success rates.
However, once a defense is applied, the success rates of baselines decrease
dramatically to 10.47% and 12.06%, while the success rate of AFRAIDOOR are
77.05% and 92.98% on the two tasks. Our finding exposes security weaknesses in
code models under stealthy backdoor attacks and shows that the state-of-the-art
defense method cannot provide sufficient protection. We call for more research
efforts in understanding security threats to code models and developing more
effective countermeasures.


http://arxiv.org/abs/2301.02615
Silent Killer: A Stealthy, Clean-Label, Black-Box Backdoor Attack. (98%)
Tzvi Lederer; Gallil Maimon; Lior Rokach
  Backdoor poisoning attacks pose a well-known risk to neural networks.
However, most studies have focused on lenient threat models. We introduce
Silent Killer, a novel attack that operates in clean-label, black-box settings,
uses a stealthy poison and trigger and outperforms existing methods. We
investigate the use of universal adversarial perturbations as triggers in
clean-label attacks, following the success of such approaches under
poison-label settings. We analyze the success of a naive adaptation and find
that gradient alignment for crafting the poison is required to ensure high
success rates. We conduct thorough experiments on MNIST, CIFAR10, and a reduced
version of ImageNet and achieve state-of-the-art results.


http://arxiv.org/abs/2301.02288
gRoMA: a Tool for Measuring the Global Robustness of Deep Neural Networks. (96%)
Natan Levy; Raz Yerushalmi; Guy Katz
  Deep neural networks (DNNs) are at the forefront of cutting-edge technology,
and have been achieving remarkable performance in a variety of complex tasks.
Nevertheless, their integration into safety-critical systems, such as in the
aerospace or automotive domains, poses a significant challenge due to the
threat of adversarial inputs: perturbations in inputs that might cause the DNN
to make grievous mistakes. Multiple studies have demonstrated that even modern
DNNs are susceptible to adversarial inputs, and this risk must thus be measured
and mitigated to allow the deployment of DNNs in critical settings. Here, we
present gRoMA (global Robustness Measurement and Assessment), an innovative and
scalable tool that implements a probabilistic approach to measure the global
categorial robustness of a DNN. Specifically, gRoMA measures the probability of
encountering adversarial inputs for a specific output category. Our tool
operates on pre-trained, black-box classification DNNs, and generates input
samples belonging to an output category of interest. It measures the DNN's
susceptibility to adversarial inputs around these inputs, and aggregates the
results to infer the overall global categorial robustness of the DNN up to some
small bounded statistical error.
  We evaluate our tool on the popular Densenet DNN model over the CIFAR10
dataset. Our results reveal significant gaps in the robustness of the different
output categories. This experiment demonstrates the usefulness and scalability
of our approach and its potential for allowing DNNs to be deployed within
critical systems of interest.


http://arxiv.org/abs/2301.02039
Randomized Message-Interception Smoothing: Gray-box Certificates for Graph Neural Networks. (61%)
Yan Scholten; Jan Schuchardt; Simon Geisler; Aleksandar Bojchevski; Stephan Günnemann
  Randomized smoothing is one of the most promising frameworks for certifying
the adversarial robustness of machine learning models, including Graph Neural
Networks (GNNs). Yet, existing randomized smoothing certificates for GNNs are
overly pessimistic since they treat the model as a black box, ignoring the
underlying architecture. To remedy this, we propose novel gray-box certificates
that exploit the message-passing principle of GNNs: We randomly intercept
messages and carefully analyze the probability that messages from adversarially
controlled nodes reach their target nodes. Compared to existing certificates,
we certify robustness to much stronger adversaries that control entire nodes in
the graph and can arbitrarily manipulate node features. Our certificates
provide stronger guarantees for attacks at larger distances, as messages from
farther-away nodes are more likely to get intercepted. We demonstrate the
effectiveness of our method on various models and datasets. Since our gray-box
certificates consider the underlying graph structure, we can significantly
improve certifiable robustness by applying graph sparsification.


http://arxiv.org/abs/2301.02344
TrojanPuzzle: Covertly Poisoning Code-Suggestion Models. (4%)
Hojjat Aghakhani; Wei Dai; Andre Manoel; Xavier Fernandes; Anant Kharkar; Christopher Kruegel; Giovanni Vigna; David Evans; Ben Zorn; Robert Sim
  With tools like GitHub Copilot, automatic code suggestion is no longer a
dream in software engineering. These tools, based on large language models, are
typically trained on massive corpora of code mined from unvetted public
sources. As a result, these models are susceptible to data poisoning attacks
where an adversary manipulates the model's training by injecting malicious
data. Poisoning attacks could be designed to influence the model's suggestions
at run time for chosen contexts, such as inducing the model into suggesting
insecure code payloads. To achieve this, prior attacks explicitly inject the
insecure code payload into the training data, making the poison data detectable
by static analysis tools that can remove such malicious data from the training
set. In this work, we demonstrate two novel attacks, COVERT and TROJANPUZZLE,
that can bypass static analysis by planting malicious poison data in
out-of-context regions such as docstrings. Our most novel attack, TROJANPUZZLE,
goes one step further in generating less suspicious poison data by never
explicitly including certain (suspicious) parts of the payload in the poison
data, while still inducing a model that suggests the entire payload when
completing code (i.e., outside docstrings). This makes TROJANPUZZLE robust
against signature-based dataset-cleansing methods that can filter out
suspicious sequences from the training data. Our evaluation against models of
two sizes demonstrates that both COVERT and TROJANPUZZLE have significant
implications for practitioners when selecting code used to train or tune
code-suggestion models.


http://arxiv.org/abs/2302.10291
Can Large Language Models Change User Preference Adversarially? (1%)
Varshini Subhash
  Pretrained large language models (LLMs) are becoming increasingly powerful
and ubiquitous in mainstream applications such as being a personal assistant, a
dialogue model, etc. As these models become proficient in deducing user
preferences and offering tailored assistance, there is an increasing concern
about the ability of these models to influence, modify and in the extreme case
manipulate user preference adversarially. The issue of lack of interpretability
in these models in adversarial settings remains largely unsolved. This work
tries to study adversarial behavior in user preferences from the lens of
attention probing, red teaming and white-box analysis. Specifically, it
provides a bird's eye view of existing literature, offers red teaming samples
for dialogue models like ChatGPT and GODEL and probes the attention mechanism
in the latter for non-adversarial and adversarial settings.


http://arxiv.org/abs/2301.01832
Availability Adversarial Attack and Countermeasures for Deep Learning-based Load Forecasting. (98%)
Wangkun Xu; Fei Teng
  The forecast of electrical loads is essential for the planning and operation
of the power system. Recently, advances in deep learning have enabled more
accurate forecasts. However, deep neural networks are prone to adversarial
attacks. Although most of the literature focuses on integrity-based attacks,
this paper proposes availability-based adversarial attacks, which can be more
easily implemented by attackers. For each forecast instance, the availability
attack position is optimally solved by mixed-integer reformulation of the
artificial neural network. To tackle this attack, an adversarial training
algorithm is proposed. In simulation, a realistic load forecasting dataset is
considered and the attack performance is compared to the integrity-based
attack. Meanwhile, the adversarial training algorithm is shown to significantly
improve robustness against availability attacks. All codes are available at
https://github.com/xuwkk/AAA_Load_Forecast.


http://arxiv.org/abs/2301.01495
Beckman Defense. (84%)
A. V. Subramanyam
  Optimal transport (OT) based distributional robust optimisation (DRO) has
received some traction in the recent past. However, it is at a nascent stage
but has a sound potential in robustifying the deep learning models.
Interestingly, OT barycenters demonstrate a good robustness against adversarial
attacks. Owing to the computationally expensive nature of OT barycenters, they
have not been investigated under DRO framework. In this work, we propose a new
barycenter, namely Beckman barycenter, which can be computed efficiently and
used for training the network to defend against adversarial attacks in
conjunction with adversarial training. We propose a novel formulation of
Beckman barycenter and analytically obtain the barycenter using the marginals
of the input image. We show that the Beckman barycenter can be used to train
adversarially trained networks to improve the robustness. Our training is
extremely efficient as it requires only a single epoch of training. Elaborate
experiments on CIFAR-10, CIFAR-100 and Tiny ImageNet demonstrate that training
an adversarially robust network with Beckman barycenter can significantly
increase the performance. Under auto attack, we get a a maximum boost of 10\%
in CIFAR-10, 8.34\% in CIFAR-100 and 11.51\% in Tiny ImageNet. Our code is
available at
https://github.com/Visual-Conception-Group/test-barycentric-defense.


http://arxiv.org/abs/2301.01731
GUAP: Graph Universal Attack Through Adversarial Patching. (81%)
Xiao Zang; Jie Chen; Bo Yuan
  Graph neural networks (GNNs) are a class of effective deep learning models
for node classification tasks; yet their predictive capability may be severely
compromised under adversarially designed unnoticeable perturbations to the
graph structure and/or node data. Most of the current work on graph adversarial
attacks aims at lowering the overall prediction accuracy, but we argue that the
resulting abnormal model performance may catch attention easily and invite
quick counterattack. Moreover, attacks through modification of existing graph
data may be hard to conduct if good security protocols are implemented. In this
work, we consider an easier attack harder to be noticed, through adversarially
patching the graph with new nodes and edges. The attack is universal: it
targets a single node each time and flips its connection to the same set of
patch nodes. The attack is unnoticeable: it does not modify the predictions of
nodes other than the target. We develop an algorithm, named GUAP, that achieves
high attack success rate but meanwhile preserves the prediction accuracy. GUAP
is fast to train by employing a sampling strategy. We demonstrate that a 5%
sampling in each epoch yields 20x speedup in training, with only a slight
degradation in attack performance. Additionally, we show that the adversarial
patch trained with the graph convolutional network transfers well to other
GNNs, such as the graph attention network.


http://arxiv.org/abs/2301.01885
Enhancement attacks in biomedical machine learning. (1%)
Matthew Rosenblatt; Javid Dadashkarimi; Dustin Scheinost
  The prevalence of machine learning in biomedical research is rapidly growing,
yet the trustworthiness of such research is often overlooked. While some
previous works have investigated the ability of adversarial attacks to degrade
model performance in medical imaging, the ability to falsely improve
performance via recently-developed "enhancement attacks" may be a greater
threat to biomedical machine learning. In the spirit of developing attacks to
better understand trustworthiness, we developed two techniques to drastically
enhance prediction performance of classifiers with minimal changes to features:
1) general enhancement of prediction performance, and 2) enhancement of a
particular method over another. Our enhancement framework falsely improved
classifiers' accuracy from 50% to almost 100% while maintaining high feature
similarities between original and enhanced data (Pearson's r's>0.99).
Similarly, the method-specific enhancement framework was effective in falsely
improving the performance of one method over another. For example, a simple
neural network outperformed logistic regression by 17% on our enhanced dataset,
although no performance differences were present in the original dataset.
Crucially, the original and enhanced data were still similar (r=0.99). Our
results demonstrate the feasibility of minor data manipulations to achieve any
desired prediction performance, which presents an interesting ethical challenge
for the future of biomedical machine learning. These findings emphasize the
need for more robust data provenance tracking and other precautionary measures
to ensure the integrity of biomedical machine learning research.


http://arxiv.org/abs/2301.01343
Explainability and Robustness of Deep Visual Classification Models. (92%)
Jindong Gu
  In the computer vision community, Convolutional Neural Networks (CNNs), first
proposed in the 1980's, have become the standard visual classification model.
Recently, as alternatives to CNNs, Capsule Networks (CapsNets) and Vision
Transformers (ViTs) have been proposed. CapsNets, which were inspired by the
information processing of the human brain, are considered to have more
inductive bias than CNNs, whereas ViTs are considered to have less inductive
bias than CNNs. All three classification models have received great attention
since they can serve as backbones for various downstream tasks. However, these
models are far from being perfect. As pointed out by the community, there are
two weaknesses in standard Deep Neural Networks (DNNs). One of the limitations
of DNNs is the lack of explainability. Even though they can achieve or surpass
human expert performance in the image classification task, the DNN-based
decisions are difficult to understand. In many real-world applications,
however, individual decisions need to be explained. The other limitation of
DNNs is adversarial vulnerability. Concretely, the small and imperceptible
perturbations of inputs can mislead DNNs. The vulnerability of deep neural
networks poses challenges to current visual classification models. The
potential threats thereof can lead to unacceptable consequences. Besides,
studying model adversarial vulnerability can lead to a better understanding of
the underlying models. Our research aims to address the two limitations of
DNNs. Specifically, we focus on deep visual classification models, especially
the core building parts of each classification model, e.g. dynamic routing in
CapsNets and self-attention module in ViTs.


http://arxiv.org/abs/2301.00986
Look, Listen, and Attack: Backdoor Attacks Against Video Action Recognition. (83%)
Hasan Abed Al Kader Hammoud; Shuming Liu; Mohammed Alkhrashi; Fahad AlBalawi; Bernard Ghanem
  Deep neural networks (DNNs) are vulnerable to a class of attacks called
"backdoor attacks", which create an association between a backdoor trigger and
a target label the attacker is interested in exploiting. A backdoored DNN
performs well on clean test images, yet persistently predicts an
attacker-defined label for any sample in the presence of the backdoor trigger.
Although backdoor attacks have been extensively studied in the image domain,
there are very few works that explore such attacks in the video domain, and
they tend to conclude that image backdoor attacks are less effective in the
video domain. In this work, we revisit the traditional backdoor threat model
and incorporate additional video-related aspects to that model. We show that
poisoned-label image backdoor attacks could be extended temporally in two ways,
statically and dynamically, leading to highly effective attacks in the video
domain. In addition, we explore natural video backdoors to highlight the
seriousness of this vulnerability in the video domain. And, for the first time,
we study multi-modal (audiovisual) backdoor attacks against video action
recognition models, where we show that attacking a single modality is enough
for achieving a high attack success rate.


http://arxiv.org/abs/2301.01197
Backdoor Attacks Against Dataset Distillation. (50%)
Yugeng Liu; Zheng Li; Michael Backes; Yun Shen; Yang Zhang
  Dataset distillation has emerged as a prominent technique to improve data
efficiency when training machine learning models. It encapsulates the knowledge
from a large dataset into a smaller synthetic dataset. A model trained on this
smaller distilled dataset can attain comparable performance to a model trained
on the original training dataset. However, the existing dataset distillation
techniques mainly aim at achieving the best trade-off between resource usage
efficiency and model utility. The security risks stemming from them have not
been explored. This study performs the first backdoor attack against the models
trained on the data distilled by dataset distillation models in the image
domain. Concretely, we inject triggers into the synthetic data during the
distillation procedure rather than during the model training stage, where all
previous attacks are performed. We propose two types of backdoor attacks,
namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw
data at the initial distillation phase, while DOORPING iteratively updates the
triggers during the entire distillation procedure. We conduct extensive
evaluations on multiple datasets, architectures, and dataset distillation
techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack
success rate (ASR) scores in some cases, while DOORPING reaches higher ASR
scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive
ablation study to analyze the factors that may affect the attack performance.
Finally, we evaluate multiple defense mechanisms against our backdoor attacks
and show that our attacks can practically circumvent these defense mechanisms.


http://arxiv.org/abs/2301.01044
Analysis of Label-Flip Poisoning Attack on Machine Learning Based Malware Detector. (33%)
Kshitiz Aryal; Maanak Gupta; Mahmoud Abdelsalam
  With the increase in machine learning (ML) applications in different domains,
incentives for deceiving these models have reached more than ever. As data is
the core backbone of ML algorithms, attackers shifted their interest toward
polluting the training data. Data credibility is at even higher risk with the
rise of state-of-art research topics like open design principles, federated
learning, and crowd-sourcing. Since the machine learning model depends on
different stakeholders for obtaining data, there are no reliable automated
mechanisms to verify the veracity of data from each source.
  Malware detection is arduous due to its malicious nature with the addition of
metamorphic and polymorphic ability in the evolving samples. ML has proven to
solve the zero-day malware detection problem, which is unresolved by
traditional signature-based approaches. The poisoning of malware training data
can allow the malware files to go undetected by the ML-based malware detectors,
helping the attackers to fulfill their malicious goals. A feasibility analysis
of the data poisoning threat in the malware detection domain is still lacking.
Our work will focus on two major sections: training ML-based malware detectors
and poisoning the training data using the label-poisoning approach. We will
analyze the robustness of different machine learning models against data
poisoning with varying volumes of poisoning data.


http://arxiv.org/abs/2301.00896
Efficient Robustness Assessment via Adversarial Spatial-Temporal Focus on Videos. (92%)
Wei Xingxing; Wang Songping; Yan Huanqian
  Adversarial robustness assessment for video recognition models has raised
concerns owing to their wide applications on safety-critical tasks. Compared
with images, videos have much high dimension, which brings huge computational
costs when generating adversarial videos. This is especially serious for the
query-based black-box attacks where gradient estimation for the threat models
is usually utilized, and high dimensions will lead to a large number of
queries. To mitigate this issue, we propose to simultaneously eliminate the
temporal and spatial redundancy within the video to achieve an effective and
efficient gradient estimation on the reduced searching space, and thus query
number could decrease. To implement this idea, we design the novel Adversarial
spatial-temporal Focus (AstFocus) attack on videos, which performs attacks on
the simultaneously focused key frames and key regions from the inter-frames and
intra-frames in the video. AstFocus attack is based on the cooperative
Multi-Agent Reinforcement Learning (MARL) framework. One agent is responsible
for selecting key frames, and another agent is responsible for selecting key
regions. These two agents are jointly trained by the common rewards received
from the black-box threat models to perform a cooperative prediction. By
continuously querying, the reduced searching space composed of key frames and
key regions is becoming precise, and the whole query number becomes less than
that on the original video. Extensive experiments on four mainstream video
recognition models and three widely used action recognition datasets
demonstrate that the proposed AstFocus attack outperforms the SOTA methods,
which is prevenient in fooling rate, query number, time, and perturbation
magnitude at the same.


http://arxiv.org/abs/2301.00364
Generalizable Black-Box Adversarial Attack with Meta Learning. (99%)
Fei Yin; Yong Zhang; Baoyuan Wu; Yan Feng; Jingyi Zhang; Yanbo Fan; Yujiu Yang
  In the scenario of black-box adversarial attack, the target model's
parameters are unknown, and the attacker aims to find a successful adversarial
perturbation based on query feedback under a query budget. Due to the limited
feedback information, existing query-based black-box attack methods often
require many queries for attacking each benign example. To reduce query cost,
we propose to utilize the feedback information across historical attacks,
dubbed example-level adversarial transferability. Specifically, by treating the
attack on each benign example as one task, we develop a meta-learning framework
by training a meta-generator to produce perturbations conditioned on benign
examples. When attacking a new benign example, the meta generator can be
quickly fine-tuned based on the feedback information of the new task as well as
a few historical attacks to produce effective perturbations. Moreover, since
the meta-train procedure consumes many queries to learn a generalizable
generator, we utilize model-level adversarial transferability to train the
meta-generator on a white-box surrogate model, then transfer it to help the
attack against the target model. The proposed framework with the two types of
adversarial transferability can be naturally combined with any off-the-shelf
query-based attack methods to boost their performance, which is verified by
extensive experiments.


http://arxiv.org/abs/2301.01223
ExploreADV: Towards exploratory attack for Neural Networks. (99%)
Tianzuo Luo; Yuyi Zhong; Siaucheng Khoo
  Although deep learning has made remarkable progress in processing various
types of data such as images, text and speech, they are known to be susceptible
to adversarial perturbations: perturbations specifically designed and added to
the input to make the target model produce erroneous output. Most of the
existing studies on generating adversarial perturbations attempt to perturb the
entire input indiscriminately. In this paper, we propose ExploreADV, a general
and flexible adversarial attack system that is capable of modeling regional and
imperceptible attacks, allowing users to explore various kinds of adversarial
examples as needed. We adapt and combine two existing boundary attack methods,
DeepFool and Brendel\&Bethge Attack, and propose a mask-constrained adversarial
attack system, which generates minimal adversarial perturbations under the
pixel-level constraints, namely ``mask-constraints''. We study different ways
of generating such mask-constraints considering the variance and importance of
the input features, and show that our adversarial attack system offers users
good flexibility to focus on sub-regions of inputs, explore imperceptible
perturbations and understand the vulnerability of pixels/regions to adversarial
attacks. We demonstrate our system to be effective based on extensive
experiments and user study.


http://arxiv.org/abs/2301.00435
Trojaning semi-supervised learning model via poisoning wild images on the web. (47%)
Le Feng; Zhenxing Qian; Sheng Li; Xinpeng Zhang
  Wild images on the web are vulnerable to backdoor (also called trojan)
poisoning, causing machine learning models learned on these images to be
injected with backdoors. Most previous attacks assumed that the wild images are
labeled. In reality, however, most images on the web are unlabeled.
Specifically, we study the effects of unlabeled backdoor images under
semi-supervised learning (SSL) on widely studied deep neural networks. To be
realistic, we assume that the adversary is zero-knowledge and that the
semi-supervised learning model is trained from scratch. Firstly, we find the
fact that backdoor poisoning always fails when poisoned unlabeled images come
from different classes, which is different from poisoning the labeled images.
The reason is that the SSL algorithms always strive to correct them during
training. Therefore, for unlabeled images, we implement backdoor poisoning on
images from the target class. Then, we propose a gradient matching strategy to
craft poisoned images such that their gradients match the gradients of target
images on the SSL model, which can fit poisoned images to the target class and
realize backdoor injection. To the best of our knowledge, this may be the first
approach to backdoor poisoning on unlabeled images of trained-from-scratch SSL
models. Experiments show that our poisoning achieves state-of-the-art attack
success rates on most SSL algorithms while bypassing modern backdoor defenses.


http://arxiv.org/abs/2301.01218
Tracing the Origin of Adversarial Attack for Forensic Investigation and Deterrence. (99%)
Han Fang; Jiyi Zhang; Yupeng Qiu; Ke Xu; Chengfang Fang; Ee-Chien Chang
  Deep neural networks are vulnerable to adversarial attacks. In this paper, we
take the role of investigators who want to trace the attack and identify the
source, that is, the particular model which the adversarial examples are
generated from. Techniques derived would aid forensic investigation of attack
incidents and serve as deterrence to potential attacks. We consider the
buyers-seller setting where a machine learning model is to be distributed to
various buyers and each buyer receives a slightly different copy with same
functionality. A malicious buyer generates adversarial examples from a
particular copy $\mathcal{M}_i$ and uses them to attack other copies. From
these adversarial examples, the investigator wants to identify the source
$\mathcal{M}_i$. To address this problem, we propose a two-stage
separate-and-trace framework. The model separation stage generates multiple
copies of a model for a same classification task. This process injects unique
characteristics into each copy so that adversarial examples generated have
distinct and traceable features. We give a parallel structure which embeds a
``tracer'' in each copy, and a noise-sensitive training loss to achieve this
goal. The tracing stage takes in adversarial examples and a few candidate
models, and identifies the likely source. Based on the unique features induced
by the noise-sensitive loss function, we could effectively trace the potential
adversarial copy by considering the output logits from each tracer. Empirical
results show that it is possible to trace the origin of the adversarial example
and the mechanism can be applied to a wide range of architectures and datasets.


http://arxiv.org/abs/2212.14875
Guidance Through Surrogate: Towards a Generic Diagnostic Attack. (99%)
Muzammal Naseer; Salman Khan; Fatih Porikli; Fahad Shahbaz Khan
  Adversarial training is an effective approach to make deep neural networks
robust against adversarial attacks. Recently, different adversarial training
defenses are proposed that not only maintain a high clean accuracy but also
show significant robustness against popular and well studied adversarial
attacks such as PGD. High adversarial robustness can also arise if an attack
fails to find adversarial gradient directions, a phenomenon known as `gradient
masking'. In this work, we analyse the effect of label smoothing on adversarial
training as one of the potential causes of gradient masking. We then develop a
guided mechanism to avoid local minima during attack optimization, leading to a
novel attack dubbed Guided Projected Gradient Attack (G-PGA). Our attack
approach is based on a `match and deceive' loss that finds optimal adversarial
directions through guidance from a surrogate model. Our modified attack does
not require random restarts, large number of attack iterations or search for an
optimal step-size. Furthermore, our proposed G-PGA is generic, thus it can be
combined with an ensemble attack strategy as we demonstrate for the case of
Auto-Attack, leading to efficiency and convergence speed improvements. More
than an effective attack, G-PGA can be used as a diagnostic tool to reveal
elusive robustness due to gradient masking in adversarial defenses.


http://arxiv.org/abs/2212.14597
Defense Against Adversarial Attacks on Audio DeepFake Detection. (91%)
Piotr Kawa; Marcin Plata; Piotr Syga
  Audio DeepFakes are artificially generated utterances created using deep
learning methods with the main aim to fool the listeners, most of such audio is
highly convincing. Their quality is sufficient to pose a serious threat in
terms of security and privacy, such as the reliability of news or defamation.
To prevent the threats, multiple neural networks-based methods to detect
generated speech have been proposed. In this work, we cover the topic of
adversarial attacks, which decrease the performance of detectors by adding
superficial (difficult to spot by a human) changes to input data. Our
contribution contains evaluating the robustness of 3 detection architectures
against adversarial attacks in two scenarios (white-box and using
transferability mechanism) and enhancing it later by the use of adversarial
training performed by our novel adaptive training method.


http://arxiv.org/abs/2212.14677
Adversarial attacks and defenses on ML- and hardware-based IoT device fingerprinting and identification. (82%)
Pedro Miguel Sánchez Sánchez; Alberto Huertas Celdrán; Gérôme Bovet; Gregorio Martínez Pérez
  In the last years, the number of IoT devices deployed has suffered an
undoubted explosion, reaching the scale of billions. However, some new
cybersecurity issues have appeared together with this development. Some of
these issues are the deployment of unauthorized devices, malicious code
modification, malware deployment, or vulnerability exploitation. This fact has
motivated the requirement for new device identification mechanisms based on
behavior monitoring. Besides, these solutions have recently leveraged Machine
and Deep Learning techniques due to the advances in this field and the increase
in processing capabilities. In contrast, attackers do not stay stalled and have
developed adversarial attacks focused on context modification and ML/DL
evaluation evasion applied to IoT device identification solutions. This work
explores the performance of hardware behavior-based individual device
identification, how it is affected by possible context- and ML/DL-focused
attacks, and how its resilience can be improved using defense techniques. In
this sense, it proposes an LSTM-CNN architecture based on hardware performance
behavior for individual device identification. Then, previous techniques have
been compared with the proposed architecture using a hardware performance
dataset collected from 45 Raspberry Pi devices running identical software. The
LSTM-CNN improves previous solutions achieving a +0.96 average F1-Score and 0.8
minimum TPR for all devices. Afterward, context- and ML/DL-focused adversarial
attacks were applied against the previous model to test its robustness. A
temperature-based context attack was not able to disrupt the identification.
However, some ML/DL state-of-the-art evasion attacks were successful. Finally,
adversarial training and model distillation defense techniques are selected to
improve the model resilience to evasion attacks, without degrading its
performance.


http://arxiv.org/abs/2301.01217
Unlearnable Clusters: Towards Label-agnostic Unlearnable Examples. (22%)
Jiaming Zhang; Xingjun Ma; Qi Yi; Jitao Sang; Yugang Jiang; Yaowei Wang; Changsheng Xu
  There is a growing interest in developing unlearnable examples (UEs) against
visual privacy leaks on the Internet. UEs are training samples added with
invisible but unlearnable noise, which have been found can prevent unauthorized
training of machine learning models. UEs typically are generated via a bilevel
optimization framework with a surrogate model to remove (minimize) errors from
the original samples, and then applied to protect the data against unknown
target models. However, existing UE generation methods all rely on an ideal
assumption called label-consistency, where the hackers and protectors are
assumed to hold the same label for a given sample. In this work, we propose and
promote a more practical label-agnostic setting, where the hackers may exploit
the protected data quite differently from the protectors. E.g., a m-class
unlearnable dataset held by the protector may be exploited by the hacker as a
n-class dataset. Existing UE generation methods are rendered ineffective in
this challenging setting. To tackle this challenge, we present a novel
technique called Unlearnable Clusters (UCs) to generate label-agnostic
unlearnable examples with cluster-wise perturbations. Furthermore, we propose
to leverage VisionandLanguage Pre-trained Models (VLPMs) like CLIP as the
surrogate model to improve the transferability of the crafted UCs to diverse
domains. We empirically verify the effectiveness of our proposed approach under
a variety of settings with different datasets, target models, and even
commercial platforms Microsoft Azure and Baidu PaddlePaddle. Code is available
at \url{https://github.com/jiamingzhang94/Unlearnable-Clusters}.


http://arxiv.org/abs/2301.00108
Targeted k-node Collapse Problem: Towards Understanding the Robustness of Local k-core Structure. (1%)
Yuqian Lv; Bo Zhou; Jinhuan Wang; Qi Xuan
  The concept of k-core, which indicates the largest induced subgraph where
each node has k or more neighbors, plays a significant role in measuring the
cohesiveness and the engagement of a network, and it is exploited in diverse
applications, e.g., network analysis, anomaly detection, community detection,
etc. Recent works have demonstrated the vulnerability of k-core under malicious
perturbations which focuses on removing the minimal number of edges to make a
whole k-core structure collapse. However, to the best of our knowledge, there
is no existing research concentrating on how many edges should be removed at
least to make an arbitrary node in k-core collapse. Therefore, in this paper,
we make the first attempt to study the Targeted k-node Collapse Problem (TNCP)
with four novel contributions. Firstly, we offer the general definition of TNCP
problem with the proof of its NP-hardness. Secondly, in order to address the
TNCP problem, we propose a heuristic algorithm named TNC and its improved
version named ATNC for implementations on large-scale networks. After that, the
experiments on 16 real-world networks across various domains verify the
superiority of our proposed algorithms over 4 baseline methods along with
detailed comparisons and analyses. Finally, the significance of TNCP problem
for precisely evaluating the resilience of k-core structures in networks is
validated.


http://arxiv.org/abs/2212.14315
"Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML Research and Practice. (68%)
Giovanni Apruzzese; Hyrum S. Anderson; Savino Dambra; David Freeman; Fabio Pierazzi; Kevin A. Roundy
  Recent years have seen a proliferation of research on adversarial machine
learning. Numerous papers demonstrate powerful algorithmic attacks against a
wide variety of machine learning (ML) models, and numerous other papers propose
defenses that can withstand most attacks. However, abundant real-world evidence
suggests that actual attackers use simple tactics to subvert ML-driven systems,
and as a result security practitioners have not prioritized adversarial ML
defenses.
  Motivated by the apparent gap between researchers and practitioners, this
position paper aims to bridge the two domains. We first present three
real-world case studies from which we can glean practical insights unknown or
neglected in research. Next we analyze all adversarial ML papers recently
published in top security conferences, highlighting positive trends and blind
spots. Finally, we state positions on precise and cost-driven threat modeling,
collaboration between industry and academia, and reproducible research. We
believe that our positions, if adopted, will increase the real-world impact of
future endeavours in adversarial ML, bringing both researchers and
practitioners closer to their shared goal of improving the security of ML
systems.


http://arxiv.org/abs/2212.14268
Detection of out-of-distribution samples using binary neuron activation patterns. (11%)
Bartlomiej Olber; Krystian Radlak; Adam Popowicz; Michal Szczepankiewicz; Krystian Chachula
  Deep neural networks (DNN) have outstanding performance in various
applications. Despite numerous efforts of the research community,
out-of-distribution (OOD) samples remain significant limitation of DNN
classifiers. The ability to identify previously unseen inputs as novel is
crucial in safety-critical applications such as self-driving cars, unmanned
aerial vehicles and robots. Existing approaches to detect OOD samples treat a
DNN as a black box and assess the confidence score of the output predictions.
Unfortunately, this method frequently fails, because DNN are not trained to
reduce their confidence for OOD inputs. In this work, we introduce a novel
method for OOD detection. Our method is motivated by theoretical analysis of
neuron activation patterns (NAP) in ReLU based architectures. The proposed
method does not introduce high computational workload due to the binary
representation of the activation patterns extracted from convolutional layers.
The extensive empirical evaluation proves its high performance on various DNN
architectures and seven image datasets. ion.


http://arxiv.org/abs/2212.13707
Thermal Heating in ReRAM Crossbar Arrays: Challenges and Solutions. (99%)
Kamilya Smagulova; Mohammed E. Fouda; Ahmed Eltawil
  Increasing popularity of deep-learning-powered applications raises the issue
of vulnerability of neural networks to adversarial attacks. In other words,
hardly perceptible changes in input data lead to the output error in neural
network hindering their utilization in applications that involve decisions with
security risks. A number of previous works have already thoroughly evaluated
the most commonly used configuration - Convolutional Neural Networks (CNNs)
against different types of adversarial attacks. Moreover, recent works
demonstrated transferability of the some adversarial examples across different
neural network models. This paper studied robustness of the new emerging models
such as SpinalNet-based neural networks and Compact Convolutional Transformers
(CCT) on image classification problem of CIFAR-10 dataset. Each architecture
was tested against four White-box attacks and three Black-box attacks. Unlike
VGG and SpinalNet models, attention-based CCT configuration demonstrated large
span between strong robustness and vulnerability to adversarial examples.
Eventually, the study of transferability between VGG, VGG-inspired SpinalNet
and pretrained CCT 7/3x1 models was conducted. It was shown that despite high
effectiveness of the attack on the certain individual model, this does not
guarantee the transferability to other models.


http://arxiv.org/abs/2212.14115
Certifying Safety in Reinforcement Learning under Adversarial Perturbation Attacks. (98%)
Junlin Wu; Hussein Sibai; Yevgeniy Vorobeychik
  Function approximation has enabled remarkable advances in applying
reinforcement learning (RL) techniques in environments with high-dimensional
inputs, such as images, in an end-to-end fashion, mapping such inputs directly
to low-level control. Nevertheless, these have proved vulnerable to small
adversarial input perturbations. A number of approaches for improving or
certifying robustness of end-to-end RL to adversarial perturbations have
emerged as a result, focusing on cumulative reward. However, what is often at
stake in adversarial scenarios is the violation of fundamental properties, such
as safety, rather than the overall reward that combines safety with efficiency.
Moreover, properties such as safety can only be defined with respect to true
state, rather than the high-dimensional raw inputs to end-to-end policies. To
disentangle nominal efficiency and adversarial safety, we situate RL in
deterministic partially-observable Markov decision processes (POMDPs) with the
goal of maximizing cumulative reward subject to safety constraints. We then
propose a partially-supervised reinforcement learning (PSRL) framework that
takes advantage of an additional assumption that the true state of the POMDP is
known at training time. We present the first approach for certifying safety of
PSRL policies under adversarial input perturbations, and two adversarial
training approaches that make direct use of PSRL. Our experiments demonstrate
both the efficacy of the proposed approach for certifying safety in adversarial
environments, and the value of the PSRL framework coupled with adversarial
training in improving certified safety while preserving high nominal reward and
high-quality predictions of true state.


http://arxiv.org/abs/2212.13700
Publishing Efficient On-device Models Increases Adversarial Vulnerability. (95%)
Sanghyun Hong; Nicholas Carlini; Alexey Kurakin
  Recent increases in the computational demands of deep neural networks (DNNs)
have sparked interest in efficient deep learning mechanisms, e.g., quantization
or pruning. These mechanisms enable the construction of a small, efficient
version of commercial-scale models with comparable accuracy, accelerating their
deployment to resource-constrained devices.
  In this paper, we study the security considerations of publishing on-device
variants of large-scale models. We first show that an adversary can exploit
on-device models to make attacking the large models easier. In evaluations
across 19 DNNs, by exploiting the published on-device models as a transfer
prior, the adversarial vulnerability of the original commercial-scale models
increases by up to 100x. We then show that the vulnerability increases as the
similarity between a full-scale and its efficient model increase. Based on the
insights, we propose a defense, $similarity$-$unpairing$, that fine-tunes
on-device models with the objective of reducing the similarity. We evaluated
our defense on all the 19 DNNs and found that it reduces the transferability up
to 90% and the number of queries required by a factor of 10-100x. Our results
suggest that further research is needed on the security (or even privacy)
threats caused by publishing those efficient siblings.


http://arxiv.org/abs/2212.14049
Differentiable Search of Accurate and Robust Architectures. (92%)
Yuwei Ou; Xiangning Xie; Shangce Gao; Yanan Sun; Kay Chen Tan; Jiancheng Lv
  Deep neural networks (DNNs) are found to be vulnerable to adversarial
attacks, and various methods have been proposed for the defense. Among these
methods, adversarial training has been drawing increasing attention because of
its simplicity and effectiveness. However, the performance of the adversarial
training is greatly limited by the architectures of target DNNs, which often
makes the resulting DNNs with poor accuracy and unsatisfactory robustness. To
address this problem, we propose DSARA to automatically search for the neural
architectures that are accurate and robust after adversarial training. In
particular, we design a novel cell-based search space specially for adversarial
training, which improves the accuracy and the robustness upper bound of the
searched architectures by carefully designing the placement of the cells and
the proportional relationship of the filter numbers. Then we propose a
two-stage search strategy to search for both accurate and robust neural
architectures. At the first stage, the architecture parameters are optimized to
minimize the adversarial loss, which makes full use of the effectiveness of the
adversarial training in enhancing the robustness. At the second stage, the
architecture parameters are optimized to minimize both the natural loss and the
adversarial loss utilizing the proposed multi-objective adversarial training
method, so that the searched neural architectures are both accurate and robust.
We evaluate the proposed algorithm under natural data and various adversarial
attacks, which reveals the superiority of the proposed method in terms of both
accurate and robust architectures. We also conclude that accurate and robust
neural architectures tend to deploy very different structures near the input
and the output, which has great practical significance on both hand-crafting
and automatically designing of accurate and robust neural architectures.


http://arxiv.org/abs/2212.14106
Robust Ranking Explanations. (76%)
Chao Chen; Chenghua Guo; Guixiang Ma; Xi Zhang; Sihong Xie
  Gradient-based explanation is the cornerstone of explainable deep networks,
but it has been shown to be vulnerable to adversarial attacks. However,
existing works measure the explanation robustness based on $\ell_p$-norm, which
can be counter-intuitive to humans, who only pay attention to the top few
salient features. We propose explanation ranking thickness as a more suitable
explanation robustness metric. We then present a new practical adversarial
attacking goal for manipulating explanation rankings. To mitigate the
ranking-based attacks while maintaining computational feasibility, we derive
surrogate bounds of the thickness that involve expensive sampling and
integration. We use a multi-objective approach to analyze the convergence of a
gradient-based attack to confirm that the explanation robustness can be
measured by the thickness metric. We conduct experiments on various network
architectures and diverse datasets to prove the superiority of the proposed
methods, while the widely accepted Hessian-based curvature smoothing approaches
are not as robust as our method.


http://arxiv.org/abs/2212.13929
Evaluating Generalizability of Deep Learning Models Using Indian-COVID-19 CT Dataset. (1%)
Suba S; Nita Parekh; Ramesh Loganathan; Vikram Pudi; Chinnababu Sunkavalli
  Computer tomography (CT) have been routinely used for the diagnosis of lung
diseases and recently, during the pandemic, for detecting the infectivity and
severity of COVID-19 disease. One of the major concerns in using ma-chine
learning (ML) approaches for automatic processing of CT scan images in clinical
setting is that these methods are trained on limited and biased sub-sets of
publicly available COVID-19 data. This has raised concerns regarding the
generalizability of these models on external datasets, not seen by the model
during training. To address some of these issues, in this work CT scan images
from confirmed COVID-19 data obtained from one of the largest public
repositories, COVIDx CT 2A were used for training and internal vali-dation of
machine learning models. For the external validation we generated
Indian-COVID-19 CT dataset, an open-source repository containing 3D CT volumes
and 12096 chest CT images from 288 COVID-19 patients from In-dia. Comparative
performance evaluation of four state-of-the-art machine learning models, viz.,
a lightweight convolutional neural network (CNN), and three other CNN based
deep learning (DL) models such as VGG-16, ResNet-50 and Inception-v3 in
classifying CT images into three classes, viz., normal, non-covid pneumonia,
and COVID-19 is carried out on these two datasets. Our analysis showed that the
performance of all the models is comparable on the hold-out COVIDx CT 2A test
set with 90% - 99% accuracies (96% for CNN), while on the external
Indian-COVID-19 CT dataset a drop in the performance is observed for all the
models (8% - 19%). The traditional ma-chine learning model, CNN performed the
best on the external dataset (accu-racy 88%) in comparison to the deep learning
models, indicating that a light-weight CNN is better generalizable on unseen
data. The data and code are made available at https://github.com/aleesuss/c19.


http://arxiv.org/abs/2212.13607
EDoG: Adversarial Edge Detection For Graph Neural Networks. (98%)
Xiaojun Xu; Yue Yu; Hanzhang Wang; Alok Lal; Carl A. Gunter; Bo Li
  Graph Neural Networks (GNNs) have been widely applied to different tasks such
as bioinformatics, drug design, and social networks. However, recent studies
have shown that GNNs are vulnerable to adversarial attacks which aim to mislead
the node or subgraph classification prediction by adding subtle perturbations.
Detecting these attacks is challenging due to the small magnitude of
perturbation and the discrete nature of graph data. In this paper, we propose a
general adversarial edge detection pipeline EDoG without requiring knowledge of
the attack strategies based on graph generation. Specifically, we propose a
novel graph generation approach combined with link prediction to detect
suspicious adversarial edges. To effectively train the graph generative model,
we sample several sub-graphs from the given graph data. We show that since the
number of adversarial edges is usually low in practice, with low probability
the sampled sub-graphs will contain adversarial edges based on the union bound.
In addition, considering the strong attacks which perturb a large number of
edges, we propose a set of novel features to perform outlier detection as the
preprocessing for our detection. Extensive experimental results on three
real-world graph datasets including a private transaction rule dataset from a
major company and two types of synthetic graphs with controlled properties show
that EDoG can achieve above 0.8 AUC against four state-of-the-art unseen attack
strategies without requiring any knowledge about the attack type; and around
0.85 with knowledge of the attack type. EDoG significantly outperforms
traditional malicious edge detection baselines. We also show that an adaptive
attack with full knowledge of our detection pipeline is difficult to bypass it.


http://arxiv.org/abs/2212.13667
Learning When to Use Adaptive Adversarial Image Perturbations against Autonomous Vehicles. (86%)
Hyung-Jin Yoon; Hamidreza Jafarnejadsani; Petros Voulgaris
  The deep neural network (DNN) models for object detection using camera images
are widely adopted in autonomous vehicles. However, DNN models are shown to be
susceptible to adversarial image perturbations. In the existing methods of
generating the adversarial image perturbations, optimizations take each
incoming image frame as the decision variable to generate an image
perturbation. Therefore, given a new image, the typically
computationally-expensive optimization needs to start over as there is no
learning between the independent optimizations. Very few approaches have been
developed for attacking online image streams while considering the underlying
physical dynamics of autonomous vehicles, their mission, and the environment.
We propose a multi-level stochastic optimization framework that monitors an
attacker's capability of generating the adversarial perturbations. Based on
this capability level, a binary decision attack/not attack is introduced to
enhance the effectiveness of the attacker. We evaluate our proposed multi-level
image attack framework using simulations for vision-guided autonomous vehicles
and actual tests with a small indoor drone in an office environment. The
results show our method's capability to generate the image attack in real-time
while monitoring when the attacker is proficient given state estimates.


http://arxiv.org/abs/2302.03523
Sparse Mixture Once-for-all Adversarial Training for Efficient In-Situ Trade-Off Between Accuracy and Robustness of DNNs. (62%)
Souvik Kundu; Sairam Sundaresan; Sharath Nittur Sridhar; Shunlin Lu; Han Tang; Peter A. Beerel
  Existing deep neural networks (DNNs) that achieve state-of-the-art (SOTA)
performance on both clean and adversarially-perturbed images rely on either
activation or weight conditioned convolution operations. However, such
conditional learning costs additional multiply-accumulate (MAC) or addition
operations, increasing inference memory and compute costs. To that end, we
present a sparse mixture once for all adversarial training (SMART), that allows
a model to train once and then in-situ trade-off between accuracy and
robustness, that too at a reduced compute and parameter overhead. In
particular, SMART develops two expert paths, for clean and adversarial images,
respectively, that are then conditionally trained via respective dedicated sets
of binary sparsity masks. Extensive evaluations on multiple image
classification datasets across different models show SMART to have up to 2.72x
fewer non-zero parameters costing proportional reduction in compute overhead,
while yielding SOTA accuracy-robustness trade-off. Additionally, we present
insightful observations in designing sparse masks to successfully condition on
both clean and perturbed images.


http://arxiv.org/abs/2212.13675
XMAM:X-raying Models with A Matrix to Reveal Backdoor Attacks for Federated Learning. (56%)
Jianyi Zhang; Fangjiao Zhang; Qichao Jin; Zhiqiang Wang; Xiaodong Lin; Xiali Hei
  Federated Learning (FL) has received increasing attention due to its privacy
protection capability. However, the base algorithm FedAvg is vulnerable when it
suffers from so-called backdoor attacks. Former researchers proposed several
robust aggregation methods. Unfortunately, many of these aggregation methods
are unable to defend against backdoor attacks. What's more, the attackers
recently have proposed some hiding methods that further improve backdoor
attacks' stealthiness, making all the existing robust aggregation methods fail.
  To tackle the threat of backdoor attacks, we propose a new aggregation
method, X-raying Models with A Matrix (XMAM), to reveal the malicious local
model updates submitted by the backdoor attackers. Since we observe that the
output of the Softmax layer exhibits distinguishable patterns between malicious
and benign updates, we focus on the Softmax layer's output in which the
backdoor attackers are difficult to hide their malicious behavior.
Specifically, like X-ray examinations, we investigate the local model updates
by using a matrix as an input to get their Softmax layer's outputs. Then, we
preclude updates whose outputs are abnormal by clustering. Without any training
dataset in the server, the extensive evaluations show that our XMAM can
effectively distinguish malicious local model updates from benign ones. For
instance, when other methods fail to defend against the backdoor attacks at no
more than 20% malicious clients, our method can tolerate 45% malicious clients
in the black-box mode and about 30% in Projected Gradient Descent (PGD) mode.
Besides, under adaptive attacks, the results demonstrate that XMAM can still
complete the global model training task even when there are 40% malicious
clients. Finally, we analyze our method's screening complexity, and the results
show that XMAM is about 10-10000 times faster than the existing methods.


http://arxiv.org/abs/2212.12995
Simultaneously Optimizing Perturbations and Positions for Black-box Adversarial Patch Attacks. (99%)
Xingxing Wei; Ying Guo; Jie Yu; Bo Zhang
  Adversarial patch is an important form of real-world adversarial attack that
brings serious risks to the robustness of deep neural networks. Previous
methods generate adversarial patches by either optimizing their perturbation
values while fixing the pasting position or manipulating the position while
fixing the patch's content. This reveals that the positions and perturbations
are both important to the adversarial attack. For that, in this paper, we
propose a novel method to simultaneously optimize the position and perturbation
for an adversarial patch, and thus obtain a high attack success rate in the
black-box setting. Technically, we regard the patch's position, the
pre-designed hyper-parameters to determine the patch's perturbations as the
variables, and utilize the reinforcement learning framework to simultaneously
solve for the optimal solution based on the rewards obtained from the target
model with a small number of queries. Extensive experiments are conducted on
the Face Recognition (FR) task, and results on four representative FR models
show that our method can significantly improve the attack success rate and
query efficiency. Besides, experiments on the commercial FR service and
physical environments confirm its practical application value. We also extend
our method to the traffic sign recognition task to verify its generalization
ability.


http://arxiv.org/abs/2212.12732
Frequency Regularization for Improving Adversarial Robustness. (99%)
Binxiao Huang; Chaofan Tao; Rui Lin; Ngai Wong
  Deep neural networks are incredibly vulnerable to crafted,
human-imperceptible adversarial perturbations. Although adversarial training
(AT) has proven to be an effective defense approach, we find that the
AT-trained models heavily rely on the input low-frequency content for judgment,
accounting for the low standard accuracy. To close the large gap between the
standard and robust accuracies during AT, we investigate the frequency
difference between clean and adversarial inputs, and propose a frequency
regularization (FR) to align the output difference in the spectral domain.
Besides, we find Stochastic Weight Averaging (SWA), by smoothing the kernels
over epochs, further improves the robustness. Among various defense schemes,
our method achieves the strongest robustness against attacks by PGD-20, C\&W
and Autoattack, on a WideResNet trained on CIFAR-10 without any extra data.


http://arxiv.org/abs/2212.12641
Out-of-Distribution Detection with Reconstruction Error and Typicality-based Penalty. (61%)
Genki Osada; Takahashi Tsubasa; Budrul Ahsan; Takashi Nishide
  The task of out-of-distribution (OOD) detection is vital to realize safe and
reliable operation for real-world applications. After the failure of
likelihood-based detection in high dimensions had been shown, approaches based
on the \emph{typical set} have been attracting attention; however, they still
have not achieved satisfactory performance. Beginning by presenting the failure
case of the typicality-based approach, we propose a new reconstruction
error-based approach that employs normalizing flow (NF). We further introduce a
typicality-based penalty, and by incorporating it into the reconstruction error
in NF, we propose a new OOD detection method, penalized reconstruction error
(PRE). Because the PRE detects test inputs that lie off the in-distribution
manifold, it effectively detects adversarial examples as well as OOD examples.
We show the effectiveness of our method through the evaluation using natural
image datasets, CIFAR-10, TinyImageNet, and ILSVRC2012.


http://arxiv.org/abs/2212.12380
Towards Scalable Physically Consistent Neural Networks: an Application to Data-driven Multi-zone Thermal Building Models. (1%)
Natale Loris Di; Bratislav Svetozarevic; Philipp Heer; Colin Neil Jones
  With more and more data being collected, data-driven modeling methods have
been gaining in popularity in recent years. While physically sound, classical
gray-box models are often cumbersome to identify and scale, and their accuracy
might be hindered by their limited expressiveness. On the other hand, classical
black-box methods, typically relying on Neural Networks (NNs) nowadays, often
achieve impressive performance, even at scale, by deriving statistical patterns
from data. However, they remain completely oblivious to the underlying physical
laws, which may lead to potentially catastrophic failures if decisions for
real-world physical systems are based on them. Physically Consistent Neural
Networks (PCNNs) were recently developed to address these aforementioned
issues, ensuring physical consistency while still leveraging NNs to attain
state-of-the-art accuracy.
  In this work, we scale PCNNs to model building temperature dynamics and
propose a thorough comparison with classical gray-box and black-box methods.
More precisely, we design three distinct PCNN extensions, thereby exemplifying
the modularity and flexibility of the architecture, and formally prove their
physical consistency. In the presented case study, PCNNs are shown to achieve
state-of-the-art accuracy, even outperforming classical NN-based models despite
their constrained structure. Our investigations furthermore provide a clear
illustration of NNs achieving seemingly good performance while remaining
completely physics-agnostic, which can be misleading in practice. While this
performance comes at the cost of computational complexity, PCNNs on the other
hand show accuracy improvements of 17-35% compared to all other physically
consistent methods, paving the way for scalable physically consistent models
with state-of-the-art performance.


http://arxiv.org/abs/2212.11778
Adversarial Machine Learning and Defense Game for NextG Signal Classification with Deep Learning. (98%)
Yalin E. Sagduyu
  This paper presents a game-theoretic framework to study the interactions of
attack and defense for deep learning-based NextG signal classification. NextG
systems such as the one envisioned for a massive number of IoT devices can
employ deep neural networks (DNNs) for various tasks such as user equipment
identification, physical layer authentication, and detection of incumbent users
(such as in the Citizens Broadband Radio Service (CBRS) band). By training
another DNN as the surrogate model, an adversary can launch an inference
(exploratory) attack to learn the behavior of the victim model, predict
successful operation modes (e.g., channel access), and jam them. A defense
mechanism can increase the adversary's uncertainty by introducing controlled
errors in the victim model's decisions (i.e., poisoning the adversary's
training data). This defense is effective against an attack but reduces the
performance when there is no attack. The interactions between the defender and
the adversary are formulated as a non-cooperative game, where the defender
selects the probability of defending or the defense level itself (i.e., the
ratio of falsified decisions) and the adversary selects the probability of
attacking. The defender's objective is to maximize its reward (e.g., throughput
or transmission success ratio), whereas the adversary's objective is to
minimize this reward and its attack cost. The Nash equilibrium strategies are
determined as operation modes such that no player can unilaterally improve its
utility given the other's strategy is fixed. A fictitious play is formulated
for each player to play the game repeatedly in response to the empirical
frequency of the opponent's actions. The performance in Nash equilibrium is
compared to the fixed attack and defense cases, and the resilience of NextG
signal classification against attacks is quantified.


http://arxiv.org/abs/2212.11760
Aliasing is a Driver of Adversarial Attacks. (80%)
Adrián Rodríguez-Muñoz; Antonio Torralba
  Aliasing is a highly important concept in signal processing, as careful
consideration of resolution changes is essential in ensuring transmission and
processing quality of audio, image, and video. Despite this, up until recently
aliasing has received very little consideration in Deep Learning, with all
common architectures carelessly sub-sampling without considering aliasing
effects. In this work, we investigate the hypothesis that the existence of
adversarial perturbations is due in part to aliasing in neural networks. Our
ultimate goal is to increase robustness against adversarial attacks using
explainable, non-trained, structural changes only, derived from aliasing first
principles. Our contributions are the following. First, we establish a
sufficient condition for no aliasing for general image transformations. Next,
we study sources of aliasing in common neural network layers, and derive simple
modifications from first principles to eliminate or reduce it. Lastly, our
experimental results show a solid link between anti-aliasing and adversarial
attacks. Simply reducing aliasing already results in more robust classifiers,
and combining anti-aliasing with robust training out-performs solo robust
training on $L_2$ attacks with none or minimal losses in performance on
$L_{\infty}$ attacks.


http://arxiv.org/abs/2212.11810
GAN-based Domain Inference Attack. (2%)
Yuechun Gu; Keke Chen
  Model-based attacks can infer training data information from deep neural
network models. These attacks heavily depend on the attacker's knowledge of the
application domain, e.g., using it to determine the auxiliary data for
model-inversion attacks. However, attackers may not know what the model is used
for in practice. We propose a generative adversarial network (GAN) based method
to explore likely or similar domains of a target model -- the model domain
inference (MDI) attack. For a given target (classification) model, we assume
that the attacker knows nothing but the input and output formats and can use
the model to derive the prediction for any input in the desired form. Our basic
idea is to use the target model to affect a GAN training process for a
candidate domain's dataset that is easy to obtain. We find that the target
model may distract the training procedure less if the domain is more similar to
the target domain. We then measure the distraction level with the distance
between GAN-generated datasets, which can be used to rank candidate domains for
the target model. Our experiments show that the auxiliary dataset from an MDI
top-ranked domain can effectively boost the result of model-inversion attacks.


http://arxiv.org/abs/2212.11614
Hybrid Quantum-Classical Generative Adversarial Network for High Resolution Image Generation. (1%)
Shu Lok Tsang; Maxwell T. West; Sarah M. Erfani; Muhammad Usman
  Quantum machine learning (QML) has received increasing attention due to its
potential to outperform classical machine learning methods in various problems.
A subclass of QML methods is quantum generative adversarial networks (QGANs)
which have been studied as a quantum counterpart of classical GANs widely used
in image manipulation and generation tasks. The existing work on QGANs is still
limited to small-scale proof-of-concept examples based on images with
significant down-scaling. Here we integrate classical and quantum techniques to
propose a new hybrid quantum-classical GAN framework. We demonstrate its
superior learning capabilities by generating $28 \times 28$ pixels grey-scale
images without dimensionality reduction or classical pre/post-processing on
multiple classes of the standard MNIST and Fashion MNIST datasets, which
achieves comparable results to classical frameworks with 3 orders of magnitude
less trainable generator parameters. To gain further insight into the working
of our hybrid approach, we systematically explore the impact of its parameter
space by varying the number of qubits, the size of image patches, the number of
layers in the generator, the shape of the patches and the choice of prior
distribution. Our results show that increasing the quantum generator size
generally improves the learning capability of the network. The developed
framework provides a foundation for future design of QGANs with optimal
parameter set tailored for complex image generation tasks.


http://arxiv.org/abs/2212.11005
Revisiting Residual Networks for Adversarial Robustness: An Architectural Perspective. (80%)
Shihua Huang; Zhichao Lu; Kalyanmoy Deb; Vishnu Naresh Boddeti
  Efforts to improve the adversarial robustness of convolutional neural
networks have primarily focused on developing more effective adversarial
training methods. In contrast, little attention was devoted to analyzing the
role of architectural elements (such as topology, depth, and width) on
adversarial robustness. This paper seeks to bridge this gap and present a
holistic study on the impact of architectural design on adversarial robustness.
We focus on residual networks and consider architecture design at the block
level, i.e., topology, kernel size, activation, and normalization, as well as
at the network scaling level, i.e., depth and width of each block in the
network. In both cases, we first derive insights through systematic ablative
experiments. Then we design a robust residual block, dubbed RobustResBlock, and
a compound scaling rule, dubbed RobustScaling, to distribute depth and width at
the desired FLOP count. Finally, we combine RobustResBlock and RobustScaling
and present a portfolio of adversarially robust residual networks,
RobustResNets, spanning a broad spectrum of model capacities. Experimental
validation across multiple datasets and adversarial attacks demonstrate that
RobustResNets consistently outperform both the standard WRNs and other existing
robust architectures, achieving state-of-the-art AutoAttack robust accuracy of
61.1% without additional data and 63.7% with 500K external data while being
$2\times$ more compact in terms of parameters. Code is available at \url{
https://github.com/zhichao-lu/robust-residual-network}


http://arxiv.org/abs/2212.11205
Vulnerabilities of Deep Learning-Driven Semantic Communications to Backdoor (Trojan) Attacks. (67%)
Yalin E. Sagduyu; Tugba Erpek; Sennur Ulukus; Aylin Yener
  This paper highlights vulnerabilities of deep learning-driven semantic
communications to backdoor (Trojan) attacks. Semantic communications aims to
convey a desired meaning while transferring information from a transmitter to
its receiver. An encoder-decoder pair that is represented by two deep neural
networks (DNNs) as part of an autoencoder is trained to reconstruct signals
such as images at the receiver by transmitting latent features of small size
over a limited number of channel uses. In the meantime, another DNN of a
semantic task classifier at the receiver is jointly trained with the
autoencoder to check the meaning conveyed to the receiver. The complex decision
space of the DNNs makes semantic communications susceptible to adversarial
manipulations. In a backdoor (Trojan) attack, the adversary adds triggers to a
small portion of training samples and changes the label to a target label. When
the transfer of images is considered, the triggers can be added to the images
or equivalently to the corresponding transmitted or received signals. In test
time, the adversary activates these triggers by providing poisoned samples as
input to the encoder (or decoder) of semantic communications. The backdoor
attack can effectively change the semantic information transferred for the
poisoned input samples to a target meaning. As the performance of semantic
communications improves with the signal-to-noise ratio and the number of
channel uses, the success of the backdoor attack increases as well. Also,
increasing the Trojan ratio in training data makes the attack more successful.
In the meantime, the effect of this attack on the unpoisoned input samples
remains limited. Overall, this paper shows that the backdoor attack poses a
serious threat to semantic communications and presents novel design guidelines
to preserve the meaning of transferred information in the presence of backdoor
attacks.


http://arxiv.org/abs/2212.11209
A Theoretical Study of The Effects of Adversarial Attacks on Sparse Regression. (13%)
Deepak Maurya; Jean Honorio
  This paper analyzes $\ell_1$ regularized linear regression under the
challenging scenario of having only adversarially corrupted data for training.
We use the primal-dual witness paradigm to provide provable performance
guarantees for the support of the estimated regression parameter vector to
match the actual parameter. Our theoretical analysis shows the
counter-intuitive result that an adversary can influence sample complexity by
corrupting the irrelevant features, i.e., those corresponding to zero
coefficients of the regression parameter vector, which, consequently, do not
affect the dependent variable. As any adversarially robust algorithm has its
limitations, our theoretical analysis identifies the regimes under which the
learning algorithm and adversary can dominate over each other. It helps us to
analyze these fundamental limits and address critical scientific questions of
which parameters (like mutual incoherence, the maximum and minimum eigenvalue
of the covariance matrix, and the budget of adversarial perturbation) play a
role in the high or low probability of success of the LASSO algorithm. Also,
the derived sample complexity is logarithmic with respect to the size of the
regression parameter vector, and our theoretical claims are validated by
empirical analysis on synthetic and real-world datasets.


http://arxiv.org/abs/2212.10230
A Comprehensive Study and Comparison of the Robustness of 3D Object Detectors Against Adversarial Attacks. (98%)
Yifan Zhang; Junhui Hou; Yixuan Yuan
  Deep learning-based 3D object detectors have made significant progress in
recent years and have been deployed in a wide range of applications. It is
crucial to understand the robustness of detectors against adversarial attacks
when employing detectors in security-critical applications. In this paper, we
make the first attempt to conduct a thorough evaluation and analysis of the
robustness of 3D detectors under adversarial attacks. Specifically, we first
extend three kinds of adversarial attacks to the 3D object detection task to
benchmark the robustness of state-of-the-art 3D object detectors against
attacks on KITTI and Waymo datasets, subsequently followed by the analysis of
the relationship between robustness and properties of detectors. Then, we
explore the transferability of cross-model, cross-task, and cross-data attacks.
We finally conduct comprehensive experiments of defense for 3D detectors,
demonstrating that simple transformations like flipping are of little help in
improving robustness when the strategy of transformation imposed on input point
cloud data is exposed to attackers. Our findings will facilitate investigations
in understanding and defending the adversarial attacks against 3D object
detectors to advance this field.


http://arxiv.org/abs/2212.10006
Multi-head Uncertainty Inference for Adversarial Attack Detection. (98%)
Yuqi Yang; Songyun Yang; Jiyang Xie. Zhongwei Si; Kai Guo; Ke Zhang; Kongming Liang
  Deep neural networks (DNNs) are sensitive and susceptible to tiny
perturbation by adversarial attacks which causes erroneous predictions. Various
methods, including adversarial defense and uncertainty inference (UI), have
been developed in recent years to overcome the adversarial attacks. In this
paper, we propose a multi-head uncertainty inference (MH-UI) framework for
detecting adversarial attack examples. We adopt a multi-head architecture with
multiple prediction heads (i.e., classifiers) to obtain predictions from
different depths in the DNNs and introduce shallow information for the UI.
Using independent heads at different depths, the normalized predictions are
assumed to follow the same Dirichlet distribution, and we estimate distribution
parameter of it by moment matching. Cognitive uncertainty brought by the
adversarial attacks will be reflected and amplified on the distribution.
Experimental results show that the proposed MH-UI framework can outperform all
the referred UI methods in the adversarial attack detection task with different
settings.


http://arxiv.org/abs/2212.10258
In and Out-of-Domain Text Adversarial Robustness via Label Smoothing. (98%)
Yahan Yang; Soham Dan; Dan Roth; Insup Lee
  Recently it has been shown that state-of-the-art NLP models are vulnerable to
adversarial attacks, where the predictions of a model can be drastically
altered by slight modifications to the input (such as synonym substitutions).
While several defense techniques have been proposed, and adapted, to the
discrete nature of text adversarial attacks, the benefits of general-purpose
regularization methods such as label smoothing for language models, have not
been studied. In this paper, we study the adversarial robustness provided by
various label smoothing strategies in foundational models for diverse NLP tasks
in both in-domain and out-of-domain settings. Our experiments show that label
smoothing significantly improves adversarial robustness in pre-trained models
like BERT, against various popular attacks. We also analyze the relationship
between prediction confidence and robustness, showing that label smoothing
reduces over-confident errors on adversarial examples.


http://arxiv.org/abs/2212.10438
Is Semantic Communications Secure? A Tale of Multi-Domain Adversarial Attacks. (96%)
Yalin E. Sagduyu; Tugba Erpek; Sennur Ulukus; Aylin Yener
  Semantic communications seeks to transfer information from a source while
conveying a desired meaning to its destination. We model the
transmitter-receiver functionalities as an autoencoder followed by a task
classifier that evaluates the meaning of the information conveyed to the
receiver. The autoencoder consists of an encoder at the transmitter to jointly
model source coding, channel coding, and modulation, and a decoder at the
receiver to jointly model demodulation, channel decoding and source decoding.
By augmenting the reconstruction loss with a semantic loss, the two deep neural
networks (DNNs) of this encoder-decoder pair are interactively trained with the
DNN of the semantic task classifier. This approach effectively captures the
latent feature space and reliably transfers compressed feature vectors with a
small number of channel uses while keeping the semantic loss low. We identify
the multi-domain security vulnerabilities of using the DNNs for semantic
communications. Based on adversarial machine learning, we introduce test-time
(targeted and non-targeted) adversarial attacks on the DNNs by manipulating
their inputs at different stages of semantic communications. As a computer
vision attack, small perturbations are injected to the images at the input of
the transmitter's encoder. As a wireless attack, small perturbations signals
are transmitted to interfere with the input of the receiver's decoder. By
launching these stealth attacks individually or more effectively in a combined
form as a multi-domain attack, we show that it is possible to change the
semantics of the transferred information even when the reconstruction loss
remains low. These multi-domain adversarial attacks pose as a serious threat to
the semantics of information transfer (with larger impact than conventional
jamming) and raise the need of defense methods for the safe adoption of
semantic communications.


http://arxiv.org/abs/2212.10556
Unleashing the Power of Visual Prompting At the Pixel Level. (92%)
Junyang Wu; Xianhang Li; Chen Wei; Huiyu Wang; Alan Yuille; Yuyin Zhou; Cihang Xie
  This paper presents a simple and effective visual prompting method for
adapting pre-trained models to downstream recognition tasks. Our method
includes two key designs. First, rather than directly adding together the
prompt and the image, we treat the prompt as an extra and independent learnable
component. We show that the strategy of reconciling the prompt and the image
matters, and find that warping the prompt around a properly shrinked image
empirically works the best. Second, we re-introduce two "old tricks" commonly
used in building transferable adversarial examples, i.e., input diversity and
gradient normalization, into visual prompting. These techniques improve
optimization and enable the prompt to generalize better. We provide extensive
experimental results to demonstrate the effectiveness of our method. Using a
CLIP model, our prompting method sets a new record of 82.8% average accuracy
across 12 popular classification datasets, substantially surpassing the prior
art by +5.6%. It is worth noting that this prompting performance already
outperforms linear probing by +2.1% and can even match fully fine-tuning in
certain datasets. In addition, our prompting method shows competitive
performance across different data scales and against distribution shifts. The
code is publicly available at https://github.com/UCSC-VLAA/EVP.


http://arxiv.org/abs/2212.10318
Learned Systems Security. (81%)
Roei Schuster; Jin Peng Zhou; Paul Grubbs; Thorsten Eisenhofer; Nicolas Papernot
  A learned system uses machine learning (ML) internally to improve
performance. We can expect such systems to be vulnerable to some adversarial-ML
attacks. Often, the learned component is shared between mutually-distrusting
users or processes, much like microarchitectural resources such as caches,
potentially giving rise to highly-realistic attacker models. However, compared
to attacks on other ML-based systems, attackers face a level of indirection as
they cannot interact directly with the learned model. Additionally, the
difference between the attack surface of learned and non-learned versions of
the same system is often subtle. These factors obfuscate the de-facto risks
that the incorporation of ML carries. We analyze the root causes of
potentially-increased attack surface in learned systems and develop a framework
for identifying vulnerabilities that stem from the use of ML. We apply our
framework to a broad set of learned systems under active development. To
empirically validate the many vulnerabilities surfaced by our framework, we
choose 3 of them and implement and evaluate exploits against prominent
learned-system instances. We show that the use of ML caused leakage of past
queries in a database, enabled a poisoning attack that causes exponential
memory blowup in an index structure and crashes it in seconds, and enabled
index users to snoop on each others' key distributions by timing queries over
their own keys. We find that adversarial ML is a universal threat against
learned systems, point to open research gaps in our understanding of
learned-systems security, and conclude by discussing mitigations, while noting
that data leakage is inherent in systems whose learned component is shared
between multiple parties.


http://arxiv.org/abs/2212.10717
Hidden Poison: Machine Unlearning Enables Camouflaged Poisoning Attacks. (22%)
Jimmy Z. Di; Jack Douglas; Jayadev Acharya; Gautam Kamath; Ayush Sekhari
  We introduce camouflaged data poisoning attacks, a new attack vector that
arises in the context of machine unlearning and other settings when model
retraining may be induced. An adversary first adds a few carefully crafted
points to the training dataset such that the impact on the model's predictions
is minimal. The adversary subsequently triggers a request to remove a subset of
the introduced points at which point the attack is unleashed and the model's
predictions are negatively affected. In particular, we consider clean-label
targeted attacks (in which the goal is to cause the model to misclassify a
specific test point) on datasets including CIFAR-10, Imagenette, and Imagewoof.
This attack is realized by constructing camouflage datapoints that mask the
effect of a poisoned dataset.


http://arxiv.org/abs/2212.10264
ReCode: Robustness Evaluation of Code Generation Models. (10%)
Shiqi Wang; Zheng Li; Haifeng Qian; Chenghao Yang; Zijian Wang; Mingyue Shang; Varun Kumar; Samson Tan; Baishakhi Ray; Parminder Bhatia; Ramesh Nallapati; Murali Krishna Ramanathan; Dan Roth; Bing Xiang
  Code generation models have achieved impressive performance. However, they
tend to be brittle as slight edits to a prompt could lead to very different
generations; these robustness properties, critical for user experience when
deployed in real-life applications, are not well understood. Most existing
works on robustness in text or code tasks have focused on classification, while
robustness in generation tasks is an uncharted area and to date there is no
comprehensive benchmark for robustness in code generation. In this paper, we
propose ReCode, a comprehensive robustness evaluation benchmark for code
generation models. We customize over 30 transformations specifically for code
on docstrings, function and variable names, code syntax, and code format. They
are carefully designed to be natural in real-life coding practice, preserve the
original semantic meaning, and thus provide multifaceted assessments of a
model's robustness performance. With human annotators, we verified that over
90% of the perturbed prompts do not alter the semantic meaning of the original
prompt. In addition, we define robustness metrics for code generation models
considering the worst-case behavior under each type of perturbation, taking
advantage of the fact that executing the generated code can serve as objective
evaluation. We demonstrate ReCode on SOTA models using HumanEval, MBPP, as well
as function completion tasks derived from them. Interesting observations
include: better robustness for CodeGen over InCoder and GPT-J; models are most
sensitive to syntax perturbations; more challenging robustness evaluation on
MBPP over HumanEval.


http://arxiv.org/abs/2212.10002
Defending Against Poisoning Attacks in Open-Domain Question Answering. (8%)
Orion Weller; Aleem Khan; Nathaniel Weir; Dawn Lawrie; Durme Benjamin Van
  Recent work in open-domain question answering (ODQA) has shown that
adversarial poisoning of the input contexts can cause large drops in accuracy
for production systems. However, little to no work has proposed methods to
defend against these attacks. To do so, we introduce a new method that uses
query augmentation to search for a diverse set of retrieved passages that could
answer the original question. We integrate these new passages into the model
through the design of a novel confidence method, comparing the predicted answer
to its appearance in the retrieved contexts (what we call Confidence from
Answer Redundancy, e.g. CAR). Together these methods allow for a simple but
effective way to defend against poisoning attacks and provide gains of 5-20%
exact match across varying levels of data poisoning.


http://arxiv.org/abs/2212.10221
SoK: Analysis of Root Causes and Defense Strategies for Attacks on Microarchitectural Optimizations. (5%)
Nadja Ramhöj Holtryd; Madhavan Manivannan; Per Stenström
  Microarchitectural optimizations are expected to play a crucial role in
ensuring performance scalability in future technology nodes. However, recent
attacks have demonstrated that microarchitectural optimizations, which were
assumed to be secure, can be exploited. Moreover, new attacks surface at a
rapid pace limiting the scope of existing defenses. These developments prompt
the need to review microarchitectural optimizations with an emphasis on
security, understand the attack landscape and the potential defense strategies.
  We analyze timing-based side-channel attacks targeting a diverse set of
microarchitectural optimizations. We provide a framework for analysing
non-transient and transient attacks, which highlights the similarities. We
identify the four root causes of timing-based side-channel attacks:
determinism, sharing, access violation and information flow, through our
systematic analysis. Our key insight is that a subset (or all) of the root
causes are exploited by attacks and eliminating any of the exploited root
causes, in any attack step, is enough to provide protection. Leveraging our
framework, we systematize existing defenses and show that they target these
root causes in the different attack steps.


http://arxiv.org/abs/2212.10430
Walking Noise: On Layer-Specific Robustness of Neural Architectures against Noisy Computations and Associated Characteristic Learning Dynamics. (1%)
Hendrik Borras; Bernhard Klein; Holger Fröning
  Deep neural networks are extremely successful in various applications,
however they exhibit high computational demands and energy consumption. This is
exacerbated by stuttering technology scaling, prompting the need for novel
approaches to handle increasingly complex neural architectures. At the same
time, alternative computing technologies such as analog computing, which
promise groundbreaking improvements in energy efficiency, are inevitably
fraught with noise and inaccurate calculations. Such noisy computations are
more energy efficient, and, given a fixed power budget, also more time
efficient. However, like any kind of unsafe optimization, they require
countermeasures to ensure functionally correct results.
  This work considers noisy computations in an abstract form, and gears to
understand the implications of such noise on the accuracy of neural network
classifiers as an exemplary workload. We propose a methodology called Walking
Noise which injects layer-specific noise to measure the robustness and to
provide insights on the learning dynamics. In more detail, we investigate the
implications of additive, multiplicative and mixed noise for different
classification tasks and model architectures. While noisy training
significantly increases robustness for all noise types, we observe in
particular that it results in increased weight magnitudes and thus inherently
improves the signal-to-noise ratio for additive noise injection. Contrarily,
training with multiplicative noise can lead to a form of self-binarization of
the model parameters, leading to extreme robustness. We conclude with a
discussion of the use of this methodology in practice, among others, discussing
its use for tailored multi-execution in noisy environments.


http://arxiv.org/abs/2212.10534
DISCO: Distilling Phrasal Counterfactuals with Large Language Models. (1%)
Zeming Chen; Qiyue Gao; Kyle Richardson; Antoine Bosselut; Ashish Sabharwal
  Recent methods demonstrate that data augmentation using counterfactual
knowledge can teach models the causal structure of a task, leading to robust
and generalizable models. However, such counterfactual data often has a limited
scale and diversity if crowdsourced and is computationally expensive to extend
to new perturbation types if generated using supervised methods. To address
this, we introduce a new framework called DISCO for automatically generating
high-quality counterfactual data at scale. DISCO engineers prompts to generate
phrasal perturbations with a large general language model. Then, a
task-specific teacher model filters the generation to distill high-quality
counterfactual data. We show that learning with this counterfactual data yields
a comparatively small student model that is 6% (absolute) more robust and
generalizes 5% better across distributions than baselines on various
challenging evaluations. This model is also 15% more sensitive in
differentiating original and counterfactual examples, on three evaluation sets
written by human workers and via human-AI collaboration.


http://arxiv.org/abs/2212.09254
TextGrad: Advancing Robustness Evaluation in NLP by Gradient-Driven Optimization. (99%)
Bairu Hou; Jinghan Jia; Yihua Zhang; Guanhua Zhang; Yang Zhang; Sijia Liu; Shiyu Chang
  Robustness evaluation against adversarial examples has become increasingly
important to unveil the trustworthiness of the prevailing deep models in
natural language processing (NLP). However, in contrast to the computer vision
domain where the first-order projected gradient descent (PGD) is used as the
benchmark approach to generate adversarial examples for robustness evaluation,
there lacks a principled first-order gradient-based robustness evaluation
framework in NLP. The emerging optimization challenges lie in 1) the discrete
nature of textual inputs together with the strong coupling between the
perturbation location and the actual content, and 2) the additional constraint
that the perturbed text should be fluent and achieve a low perplexity under a
language model. These challenges make the development of PGD-like NLP attacks
difficult. To bridge the gap, we propose TextGrad, a new attack generator using
gradient-driven optimization, supporting high-accuracy and high-quality
assessment of adversarial robustness in NLP. Specifically, we address the
aforementioned challenges in a unified optimization framework. And we develop
an effective convex relaxation method to co-optimize the continuously-relaxed
site selection and perturbation variables and leverage an effective sampling
method to establish an accurate mapping from the continuous optimization
variables to the discrete textual perturbations. Moreover, as a first-order
attack generation method, TextGrad can be baked into adversarial training to
further improve the robustness of NLP models. Extensive experiments are
provided to demonstrate the effectiveness of TextGrad not only in attack
generation for robustness evaluation but also in adversarial defense.


http://arxiv.org/abs/2212.09994
Towards Robustness of Text-to-SQL Models Against Natural and Realistic Adversarial Table Perturbation. (75%)
Xinyu Pi; Bing Wang; Yan Gao; Jiaqi Guo; Zhoujun Li; Jian-Guang Lou
  The robustness of Text-to-SQL parsers against adversarial perturbations plays
a crucial role in delivering highly reliable applications. Previous studies
along this line primarily focused on perturbations in the natural language
question side, neglecting the variability of tables. Motivated by this, we
propose the Adversarial Table Perturbation (ATP) as a new attacking paradigm to
measure the robustness of Text-to-SQL models. Following this proposition, we
curate ADVETA, the first robustness evaluation benchmark featuring natural and
realistic ATPs. All tested state-of-the-art models experience dramatic
performance drops on ADVETA, revealing models' vulnerability in real-world
practices. To defend against ATP, we build a systematic adversarial training
example generation framework tailored for better contextualization of tabular
data. Experiments show that our approach not only brings the best robustness
improvement against table-side perturbations but also substantially empowers
models against NL-side perturbations. We release our benchmark and code at:
https://github.com/microsoft/ContextualSP.


http://arxiv.org/abs/2212.09360
AI Security for Geoscience and Remote Sensing: Challenges and Future Trends. (50%)
Yonghao Xu; Tao Bai; Weikang Yu; Shizhen Chang; Peter M. Atkinson; Pedram Ghamisi
  Recent advances in artificial intelligence (AI) have significantly
intensified research in the geoscience and remote sensing (RS) field. AI
algorithms, especially deep learning-based ones, have been developed and
applied widely to RS data analysis. The successful application of AI covers
almost all aspects of Earth observation (EO) missions, from low-level vision
tasks like super-resolution, denoising and inpainting, to high-level vision
tasks like scene classification, object detection and semantic segmentation.
While AI techniques enable researchers to observe and understand the Earth more
accurately, the vulnerability and uncertainty of AI models deserve further
attention, considering that many geoscience and RS tasks are highly
safety-critical. This paper reviews the current development of AI security in
the geoscience and RS field, covering the following five important aspects:
adversarial attack, backdoor attack, federated learning, uncertainty and
explainability. Moreover, the potential opportunities and trends are discussed
to provide insights for future research. To the best of the authors' knowledge,
this paper is the first attempt to provide a systematic review of AI
security-related research in the geoscience and RS community. Available code
and datasets are also listed in the paper to move this vibrant field of
research forward.


http://arxiv.org/abs/2212.09668
Task-Oriented Communications for NextG: End-to-End Deep Learning and AI Security Aspects. (26%)
Yalin E. Sagduyu; Sennur Ulukus; Aylin Yener
  Communications systems to date are primarily designed with the goal of
reliable (error-free) transfer of digital sequences (bits). Next generation
(NextG) communication systems are beginning to explore shifting this design
paradigm of reliably decoding bits to reliably executing a given task.
Task-oriented communications system design is likely to find impactful
applications, for example, considering the relative importance of messages. In
this paper, a wireless signal classification is considered as the task to be
performed in the NextG Radio Access Network (RAN) for signal intelligence and
spectrum awareness applications such as user equipment (UE) identification and
authentication, and incumbent signal detection for spectrum co-existence. For
that purpose, edge devices collect wireless signals and communicate with the
NextG base station (gNodeB) that needs to know the signal class. Edge devices
may not have sufficient processing power and may not be trusted to perform the
signal classification task, whereas the transfer of the captured signals from
the edge devices to the gNodeB may not be efficient or even feasible subject to
stringent delay, rate, and energy restrictions. We present a task-oriented
communications approach, where all the transmitter, receiver and classifier
functionalities are jointly trained as two deep neural networks (DNNs), one for
the edge device and another for the gNodeB. We show that this approach achieves
better accuracy with smaller DNNs compared to the baselines that treat
communications and signal classification as two separate tasks. Finally, we
discuss how adversarial machine learning poses a major security threat for the
use of DNNs for task-oriented communications. We demonstrate the major
performance loss under backdoor (Trojan) attacks and adversarial (evasion)
attacks that target the training and test processes of task-oriented
communications.


http://arxiv.org/abs/2212.09979
Flareon: Stealthy any2any Backdoor Injection via Poisoned Augmentation. (2%)
Tianrui Qin; Xianghuan He; Xitong Gao; Yiren Zhao; Kejiang Ye; Cheng-Zhong Xu
  Open software supply chain attacks, once successful, can exact heavy costs in
mission-critical applications. As open-source ecosystems for deep learning
flourish and become increasingly universal, they present attackers previously
unexplored avenues to code-inject malicious backdoors in deep neural network
models. This paper proposes Flareon, a small, stealthy, seemingly harmless code
modification that specifically targets the data augmentation pipeline with
motion-based triggers. Flareon neither alters ground-truth labels, nor modifies
the training loss objective, nor does it assume prior knowledge of the victim
model architecture, training data, and training hyperparameters. Yet, it has a
surprisingly large ramification on training -- models trained under Flareon
learn powerful target-conditional (or "any2any") backdoors. The resulting
models can exhibit high attack success rates for any target choices and better
clean accuracies than backdoor attacks that not only seize greater control, but
also assume more restrictive attack capabilities. We also demonstrate the
effectiveness of Flareon against recent defenses. Flareon is fully open-source
and available online to the deep learning community:
https://github.com/lafeat/flareon.


http://arxiv.org/abs/2212.09458
Exploring Optimal Substructure for Out-of-distribution Generalization via Feature-targeted Model Pruning. (1%)
Yingchun Wang; Jingcai Guo; Song Guo; Weizhan Zhang; Jie Zhang
  Recent studies show that even highly biased dense networks contain an
unbiased substructure that can achieve better out-of-distribution (OOD)
generalization than the original model. Existing works usually search the
invariant subnetwork using modular risk minimization (MRM) with out-domain
data. Such a paradigm may bring about two potential weaknesses: 1) Unfairness,
due to the insufficient observation of out-domain data during training; and 2)
Sub-optimal OOD generalization, due to the feature-untargeted model pruning on
the whole data distribution. In this paper, we propose a novel Spurious
Feature-targeted model Pruning framework, dubbed SFP, to automatically explore
invariant substructures without referring to the above weaknesses.
Specifically, SFP identifies in-distribution (ID) features during training
using our theoretically verified task loss, upon which, SFP can perform ID
targeted-model pruning that removes branches with strong dependencies on ID
features. Notably, by attenuating the projections of spurious features into
model space, SFP can push the model learning toward invariant features and pull
that out of environmental features, devising optimal OOD generalization.
Moreover, we also conduct detailed theoretical analysis to provide the
rationality guarantee and a proof framework for OOD structures via model
sparsity, and for the first time, reveal how a highly biased data distribution
affects the model's OOD generalization. Extensive experiments on various OOD
datasets show that SFP can significantly outperform both structure-based and
non-structure OOD generalization SOTAs, with accuracy improvement up to 4.72%
and 23.35%, respectively.


http://arxiv.org/abs/2212.09155
Estimating the Adversarial Robustness of Attributions in Text with Transformers. (99%)
Adam Ivankay; Mattia Rigotti; Ivan Girardi; Chiara Marchiori; Pascal Frossard
  Explanations are crucial parts of deep neural network (DNN) classifiers. In
high stakes applications, faithful and robust explanations are important to
understand and gain trust in DNN classifiers. However, recent work has shown
that state-of-the-art attribution methods in text classifiers are susceptible
to imperceptible adversarial perturbations that alter explanations
significantly while maintaining the correct prediction outcome. If undetected,
this can critically mislead the users of DNNs. Thus, it is crucial to
understand the influence of such adversarial perturbations on the networks'
explanations and their perceptibility. In this work, we establish a novel
definition of attribution robustness (AR) in text classification, based on
Lipschitz continuity. Crucially, it reflects both attribution change induced by
adversarial input alterations and perceptibility of such alterations. Moreover,
we introduce a wide set of text similarity measures to effectively capture
locality between two text samples and imperceptibility of adversarial
perturbations in text. We then propose our novel TransformerExplanationAttack
(TEA), a strong adversary that provides a tight estimation for attribution
robustness in text classification. TEA uses state-of-the-art language models to
extract word substitutions that result in fluent, contextual adversarial
samples. Finally, with experiments on several text classification
architectures, we show that TEA consistently outperforms current
state-of-the-art AR estimators, yielding perturbations that alter explanations
to a greater extent while being more fluent and less perceptible.


http://arxiv.org/abs/2212.09035
Minimizing Maximum Model Discrepancy for Transferable Black-box Targeted Attacks. (99%)
Anqi Zhao; Tong Chu; Yahao Liu; Wen Li; Jingjing Li; Lixin Duan
  In this work, we study the black-box targeted attack problem from the model
discrepancy perspective. On the theoretical side, we present a generalization
error bound for black-box targeted attacks, which gives a rigorous theoretical
analysis for guaranteeing the success of the attack. We reveal that the attack
error on a target model mainly depends on empirical attack error on the
substitute model and the maximum model discrepancy among substitute models. On
the algorithmic side, we derive a new algorithm for black-box targeted attacks
based on our theoretical analysis, in which we additionally minimize the
maximum model discrepancy(M3D) of the substitute models when training the
generator to generate adversarial examples. In this way, our model is capable
of crafting highly transferable adversarial examples that are robust to the
model variation, thus improving the success rate for attacking the black-box
model. We conduct extensive experiments on the ImageNet dataset with different
classification models, and our proposed approach outperforms existing
state-of-the-art methods by a significant margin. Our codes will be released.


http://arxiv.org/abs/2301.06083
Discrete Point-wise Attack Is Not Enough: Generalized Manifold Adversarial Attack for Face Recognition. (99%)
Qian Li; Yuxiao Hu; Ye Liu; Dongxiao Zhang; Xin Jin; Yuntian Chen
  Classical adversarial attacks for Face Recognition (FR) models typically
generate discrete examples for target identity with a single state image.
However, such paradigm of point-wise attack exhibits poor generalization
against numerous unknown states of identity and can be easily defended. In this
paper, by rethinking the inherent relationship between the face of target
identity and its variants, we introduce a new pipeline of Generalized Manifold
Adversarial Attack (GMAA) to achieve a better attack performance by expanding
the attack range. Specifically, this expansion lies on two aspects - GMAA not
only expands the target to be attacked from one to many to encourage a good
generalization ability for the generated adversarial examples, but it also
expands the latter from discrete points to manifold by leveraging the domain
knowledge that face expression change can be continuous, which enhances the
attack effect as a data augmentation mechanism did. Moreover, we further design
a dual supervision with local and global constraints as a minor contribution to
improve the visual quality of the generated adversarial examples. We
demonstrate the effectiveness of our method based on extensive experiments, and
reveal that GMAA promises a semantic continuous adversarial space with a higher
generalization ability and visual quality


http://arxiv.org/abs/2212.09067
Fine-Tuning Is All You Need to Mitigate Backdoor Attacks. (4%)
Zeyang Sha; Xinlei He; Pascal Berrang; Mathias Humbert; Yang Zhang
  Backdoor attacks represent one of the major threats to machine learning
models. Various efforts have been made to mitigate backdoors. However, existing
defenses have become increasingly complex and often require high computational
resources or may also jeopardize models' utility. In this work, we show that
fine-tuning, one of the most common and easy-to-adopt machine learning training
operations, can effectively remove backdoors from machine learning models while
maintaining high model utility. Extensive experiments over three machine
learning paradigms show that fine-tuning and our newly proposed
super-fine-tuning achieve strong defense performance. Furthermore, we coin a
new term, namely backdoor sequela, to measure the changes in model
vulnerabilities to other attacks before and after the backdoor has been
removed. Empirical evaluation shows that, compared to other defense methods,
super-fine-tuning leaves limited backdoor sequela. We hope our results can help
machine learning model owners better protect their models from backdoor
threats. Also, it calls for the design of more advanced attacks in order to
comprehensively assess machine learning models' backdoor vulnerabilities.


http://arxiv.org/abs/2212.09000
Confidence-aware Training of Smoothed Classifiers for Certified Robustness. (86%)
Jongheon Jeong; Seojin Kim; Jinwoo Shin
  Any classifier can be "smoothed out" under Gaussian noise to build a new
classifier that is provably robust to $\ell_2$-adversarial perturbations, viz.,
by averaging its predictions over the noise via randomized smoothing. Under the
smoothed classifiers, the fundamental trade-off between accuracy and
(adversarial) robustness has been well evidenced in the literature: i.e.,
increasing the robustness of a classifier for an input can be at the expense of
decreased accuracy for some other inputs. In this paper, we propose a simple
training method leveraging this trade-off to obtain robust smoothed
classifiers, in particular, through a sample-wise control of robustness over
the training samples. We make this control feasible by using "accuracy under
Gaussian noise" as an easy-to-compute proxy of adversarial robustness for an
input. Specifically, we differentiate the training objective depending on this
proxy to filter out samples that are unlikely to benefit from the worst-case
(adversarial) objective. Our experiments show that the proposed method, despite
its simplicity, consistently exhibits improved certified robustness upon
state-of-the-art training methods. Somewhat surprisingly, we find these
improvements persist even for other notions of robustness, e.g., to various
types of common corruptions.


http://arxiv.org/abs/2212.09006
A Review of Speech-centric Trustworthy Machine Learning: Privacy, Safety, and Fairness. (2%)
Tiantian Feng; Rajat Hebbar; Nicholas Mehlman; Xuan Shi; Aditya Kommineni; and Shrikanth Narayanan
  Speech-centric machine learning systems have revolutionized many leading
domains ranging from transportation and healthcare to education and defense,
profoundly changing how people live, work, and interact with each other.
However, recent studies have demonstrated that many speech-centric ML systems
may need to be considered more trustworthy for broader deployment.
Specifically, concerns over privacy breaches, discriminating performance, and
vulnerability to adversarial attacks have all been discovered in ML research
fields. In order to address the above challenges and risks, a significant
number of efforts have been made to ensure these ML systems are trustworthy,
especially private, safe, and fair. In this paper, we conduct the first
comprehensive survey on speech-centric trustworthy ML topics related to
privacy, safety, and fairness. In addition to serving as a summary report for
the research community, we point out several promising future research
directions to inspire the researchers who wish to explore further in this area.


http://arxiv.org/abs/2212.08853
HyPe: Better Pre-trained Language Model Fine-tuning with Hidden Representation Perturbation. (1%)
Hongyi Yuan; Zheng Yuan; Chuanqi Tan; Fei Huang; Songfang Huang
  Language models with the Transformers structure have shown great performance
in natural language processing. However, there still poses problems when
fine-tuning pre-trained language models on downstream tasks, such as
over-fitting or representation collapse. In this work, we propose HyPe, a
simple yet effective fine-tuning technique to alleviate such problems by
perturbing hidden representations of Transformers layers. Unlike previous works
that only add noise to inputs or parameters, we argue that the hidden
representations of Transformers layers convey more diverse and meaningful
language information. Therefore, making the Transformers layers more robust to
hidden representation perturbations can further benefit the fine-tuning of PLMs
en bloc. We conduct extensive experiments and analyses on GLUE and other
natural language inference datasets. Results demonstrate that HyPe outperforms
vanilla fine-tuning and enhances generalization of hidden representations from
different layers. In addition, HyPe acquires negligible computational
overheads, and is better than and compatible with previous state-of-the-art
fine-tuning techniques.


http://arxiv.org/abs/2212.08341
Adversarial Example Defense via Perturbation Grading Strategy. (99%)
Shaowei Zhu; Wanli Lyu; Bin Li; Zhaoxia Yin; Bin Luo
  Deep Neural Networks have been widely used in many fields. However, studies
have shown that DNNs are easily attacked by adversarial examples, which have
tiny perturbations and greatly mislead the correct judgment of DNNs.
Furthermore, even if malicious attackers cannot obtain all the underlying model
parameters, they can use adversarial examples to attack various DNN-based task
systems. Researchers have proposed various defense methods to protect DNNs,
such as reducing the aggressiveness of adversarial examples by preprocessing or
improving the robustness of the model by adding modules. However, some defense
methods are only effective for small-scale examples or small perturbations but
have limited defense effects for adversarial examples with large perturbations.
This paper assigns different defense strategies to adversarial perturbations of
different strengths by grading the perturbations on the input examples.
Experimental results show that the proposed method effectively improves defense
performance. In addition, the proposed method does not modify any task model,
which can be used as a preprocessing module, which significantly reduces the
deployment cost in practical applications.


http://arxiv.org/abs/2212.08427
WebAssembly Diversification for Malware Evasion. (5%)
Javier Cabrera-Arteaga; Martin Monperrus; Tim Toady; Benoit Baudry
  WebAssembly has become a crucial part of the modern web, offering a faster
alternative to JavaScript in browsers. While boosting rich applications in
browser, this technology is also very efficient to develop cryptojacking
malware. This has triggered the development of several methods to detect
cryptojacking malware. However, these defenses have not considered the
possibility of attackers using evasion techniques. This paper explores how
automatic binary diversification can support the evasion of WebAssembly
cryptojacking detectors. We experiment with a dataset of 33 WebAssembly
cryptojacking binaries and evaluate our evasion technique against two malware
detectors: VirusTotal, a general-purpose detector, and MINOS, a
WebAssembly-specific detector. Our results demonstrate that our technique can
automatically generate variants of WebAssembly cryptojacking that evade the
detectors in 90% of cases for VirusTotal and 100% for MINOS. Our results
emphasize the importance of meta-antiviruses and diverse detection techniques,
and provide new insights into which WebAssembly code transformations are best
suited for malware evasion. We also show that the variants introduce limited
performance overhead, making binary diversification an effective technique for
evasion.


http://arxiv.org/abs/2212.08568
Biomedical image analysis competitions: The state of current participation practice. (4%)
Matthias Eisenmann; Annika Reinke; Vivienn Weru; Minu Dietlinde Tizabi; Fabian Isensee; Tim J. Adler; Patrick Godau; Veronika Cheplygina; Michal Kozubek; Sharib Ali; Anubha Gupta; Jan Kybic; Alison Noble; Solórzano Carlos Ortiz de; Samiksha Pachade; Caroline Petitjean; Daniel Sage; Donglai Wei; Elizabeth Wilden; Deepak Alapatt; Vincent Andrearczyk; Ujjwal Baid; Spyridon Bakas; Niranjan Balu; Sophia Bano; Vivek Singh Bawa; Jorge Bernal; Sebastian Bodenstedt; Alessandro Casella; Jinwook Choi; Olivier Commowick; Marie Daum; Adrien Depeursinge; Reuben Dorent; Jan Egger; Hannah Eichhorn; Sandy Engelhardt; Melanie Ganz; Gabriel Girard; Lasse Hansen; Mattias Heinrich; Nicholas Heller; Alessa Hering; Arnaud Huaulmé; Hyunjeong Kim; Bennett Landman; Hongwei Bran Li; Jianning Li; Jun Ma; Anne Martel; Carlos Martín-Isla; Bjoern Menze; Chinedu Innocent Nwoye; Valentin Oreiller; Nicolas Padoy; Sarthak Pati; Kelly Payette; Carole Sudre; Wijnen Kimberlin van; Armine Vardazaryan; Tom Vercauteren; Martin Wagner; Chuanbo Wang; Moi Hoon Yap; Zeyun Yu; Chun Yuan; Maximilian Zenk; Aneeq Zia; David Zimmerer; Rina Bao; Chanyeol Choi; Andrew Cohen; Oleh Dzyubachyk; Adrian Galdran; Tianyuan Gan; Tianqi Guo; Pradyumna Gupta; Mahmood Haithami; Edward Ho; Ikbeom Jang; Zhili Li; Zhengbo Luo; Filip Lux; Sokratis Makrogiannis; Dominik Müller; Young-tack Oh; Subeen Pang; Constantin Pape; Gorkem Polat; Charlotte Rosalie Reed; Kanghyun Ryu; Tim Scherr; Vajira Thambawita; Haoyu Wang; Xinliang Wang; Kele Xu; Hung Yeh; Doyeob Yeo; Yixuan Yuan; Yan Zeng; Xin Zhao; Julian Abbing; Jannes Adam; Nagesh Adluru; Niklas Agethen; Salman Ahmed; Yasmina Al Khalil; Mireia Alenyà; Esa Alhoniemi; Chengyang An; Talha Anwar; Tewodros Weldebirhan Arega; Netanell Avisdris; Dogu Baran Aydogan; Yingbin Bai; Maria Baldeon Calisto; Berke Doga Basaran; Marcel Beetz; Cheng Bian; Hao Bian; Kevin Blansit; Louise Bloch; Robert Bohnsack; Sara Bosticardo; Jack Breen; Mikael Brudfors; Raphael Brüngel; Mariano Cabezas; Alberto Cacciola; Zhiwei Chen; Yucong Chen; Daniel Tianming Chen; Minjeong Cho; Min-Kook Choi; Chuantao Xie Chuantao Xie; Dana Cobzas; Julien Cohen-Adad; Jorge Corral Acero; Sujit Kumar Das; Oliveira Marcela de; Hanqiu Deng; Guiming Dong; Lars Doorenbos; Cory Efird; Di Fan; Mehdi Fatan Serj; Alexandre Fenneteau; Lucas Fidon; Patryk Filipiak; René Finzel; Nuno R. Freitas; Christoph M. Friedrich; Mitchell Fulton; Finn Gaida; Francesco Galati; Christoforos Galazis; Chang Hee Gan; Zheyao Gao; Shengbo Gao; Matej Gazda; Beerend Gerats; Neil Getty; Adam Gibicar; Ryan Gifford; Sajan Gohil; Maria Grammatikopoulou; Daniel Grzech; Orhun Güley; Timo Günnemann; Chunxu Guo; Sylvain Guy; Heonjin Ha; Luyi Han; Il Song Han; Ali Hatamizadeh; Tian He; Jimin Heo; Sebastian Hitziger; SeulGi Hong; SeungBum Hong; Rian Huang; Ziyan Huang; Markus Huellebrand; Stephan Huschauer; Mustaffa Hussain; Tomoo Inubushi; Ece Isik Polat; Mojtaba Jafaritadi; SeongHun Jeong; Bailiang Jian; Yuanhong Jiang; Zhifan Jiang; Yueming Jin; Smriti Joshi; Abdolrahim Kadkhodamohammadi; Reda Abdellah Kamraoui; Inha Kang; Junghwa Kang; Davood Karimi; April Khademi; Muhammad Irfan Khan; Suleiman A. Khan; Rishab Khantwal; Kwang-Ju Kim; Timothy Kline; Satoshi Kondo; Elina Kontio; Adrian Krenzer; Artem Kroviakov; Hugo Kuijf; Satyadwyoom Kumar; Rosa Francesco La; Abhi Lad; Doohee Lee; Minho Lee; Chiara Lena; Hao Li; Ling Li; Xingyu Li; Fuyuan Liao; KuanLun Liao; Arlindo Limede Oliveira; Chaonan Lin; Shan Lin; Akis Linardos; Marius George Linguraru; Han Liu; Tao Liu; Di Liu; Yanling Liu; João Lourenço-Silva; Jingpei Lu; Jiangshan Lu; Imanol Luengo; Christina B. Lund; Huan Minh Luu; Yi Lv; Yi Lv; Uzay Macar; Leon Maechler; Sina Mansour L.; Kenji Marshall; Moona Mazher; Richard McKinley; Alfonso Medela; Felix Meissen; Mingyuan Meng; Dylan Miller; Seyed Hossein Mirjahanmardi; Arnab Mishra; Samir Mitha; Hassan Mohy-ud-Din; Tony Chi Wing Mok; Gowtham Krishnan Murugesan; Enamundram Naga Karthik; Sahil Nalawade; Jakub Nalepa; Mohamed Naser; Ramin Nateghi; Hammad Naveed; Quang-Minh Nguyen; Cuong Nguyen Quoc; Brennan Nichyporuk; Bruno Oliveira; David Owen; Jimut Bahan Pal; Junwen Pan; Wentao Pan; Winnie Pang; Bogyu Park; Vivek Pawar; Kamlesh Pawar; Michael Peven; Lena Philipp; Tomasz Pieciak; Szymon Plotka; Marcel Plutat; Fattaneh Pourakpour; Domen Preložnik; Kumaradevan Punithakumar; Abdul Qayyum; Sandro Queirós; Arman Rahmim; Salar Razavi; Jintao Ren; Mina Rezaei; Jonathan Adam Rico; ZunHyan Rieu; Markus Rink; Johannes Roth; Yusely Ruiz-Gonzalez; Numan Saeed; Anindo Saha; Mostafa Salem; Ricardo Sanchez-Matilla; Kurt Schilling; Wei Shao; Zhiqiang Shen; Ruize Shi; Pengcheng Shi; Daniel Sobotka; Théodore Soulier; Bella Specktor Fadida; Danail Stoyanov; Timothy Sum Hon Mun; Xiaowu Sun; Rong Tao; Franz Thaler; Antoine Théberge; Felix Thielke; Helena Torres; Kareem A. Wahid; Jiacheng Wang; YiFei Wang; Wei Wang; Xiong Wang; Jianhui Wen; Ning Wen; Marek Wodzinski; Ye Wu; Fangfang Xia; Tianqi Xiang; Chen Xiaofei; Lizhan Xu; Tingting Xue; Yuxuan Yang; Lin Yang; Kai Yao; Huifeng Yao; Amirsaeed Yazdani; Michael Yip; Hwanseung Yoo; Fereshteh Yousefirizi; Shunkai Yu; Lei Yu; Jonathan Zamora; Ramy Ashraf Zeineldin; Dewen Zeng; Jianpeng Zhang; Bokai Zhang; Jiapeng Zhang; Fan Zhang; Huahong Zhang; Zhongchen Zhao; Zixuan Zhao; Jiachen Zhao; Can Zhao; Qingshuo Zheng; Yuheng Zhi; Ziqi Zhou; Baosheng Zou; Klaus Maier-Hein; Paul F. Jäger; Annette Kopp-Schneider; Lena Maier-Hein
  The number of international benchmarking competitions is steadily increasing
in various fields of machine learning (ML) research and practice. So far,
however, little is known about the common practice as well as bottlenecks faced
by the community in tackling the research questions posed. To shed light on the
status quo of algorithm development in the specific field of biomedical imaging
analysis, we designed an international survey that was issued to all
participants of challenges conducted in conjunction with the IEEE ISBI 2021 and
MICCAI 2021 conferences (80 competitions in total). The survey covered
participants' expertise and working environments, their chosen strategies, as
well as algorithm characteristics. A median of 72% challenge participants took
part in the survey. According to our results, knowledge exchange was the
primary incentive (70%) for participation, while the reception of prize money
played only a minor role (16%). While a median of 80 working hours was spent on
method development, a large portion of participants stated that they did not
have enough time for method development (32%). 25% perceived the infrastructure
to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of
these, 84% were based on standard architectures. 43% of the respondents
reported that the data samples (e.g., images) were too large to be processed at
once. This was most commonly addressed by patch-based training (69%),
downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks.
K-fold cross-validation on the training set was performed by only 37% of the
participants and only 50% of the participants performed ensembling based on
multiple identical models (61%) or heterogeneous models (39%). 48% of the
respondents applied postprocessing steps.


http://arxiv.org/abs/2212.08649
Better May Not Be Fairer: Can Data Augmentation Mitigate Subgroup Degradation? (1%)
Ming-Chang Chiu; Pin-Yu Chen; Xuezhe Ma
  It is no secret that deep learning models exhibit undesirable behaviors such
as learning spurious correlations instead of learning correct relationships
between input/output pairs. Prior works on robustness study datasets that mix
low-level features to quantify how spurious correlations affect predictions
instead of considering natural semantic factors due to limitations in accessing
realistic datasets for comprehensive evaluation. To bridge this gap, in this
paper we first investigate how natural background colors play a role as
spurious features in image classification tasks by manually splitting the test
sets of CIFAR10 and CIFAR100 into subgroups based on the background color of
each image. We name our datasets CIFAR10-B and CIFAR100-B. We find that while
standard CNNs achieve human-level accuracy, the subgroup performances are not
consistent, and the phenomenon remains even after data augmentation (DA). To
alleviate this issue, we propose FlowAug, a semantic DA method that leverages
the decoupled semantic representations captured by a pre-trained generative
flow. Experimental results show that FlowAug achieves more consistent results
across subgroups than other types of DA methods on CIFAR10 and CIFAR100.
Additionally, it shows better generalization performance. Furthermore, we
propose a generic metric for studying model robustness to spurious
correlations, where we take a macro average on the weighted standard deviations
across different classes. Per our metric, FlowAug demonstrates less reliance on
spurious correlations. Although this metric is proposed to study our curated
datasets, it applies to all datasets that have subgroups or subclasses. Lastly,
aside from less dependence on spurious correlations and better generalization
on in-distribution test sets, we also show superior out-of-distribution results
on CIFAR10.1 and competitive performances on CIFAR10-C and CIFAR100-C.


http://arxiv.org/abs/2212.08650
On Human Visual Contrast Sensitivity and Machine Vision Robustness: A Comparative Study. (1%)
Ming-Chang Chiu; Yingfei Wang; Derrick Eui Gyu Kim; Pin-Yu Chen; Xuezhe Ma
  It is well established in neuroscience that color vision plays an essential
part in the human visual perception system. Meanwhile, many novel designs for
computer vision inspired by human vision have achieved success in a wide range
of tasks and applications. Nonetheless, how color differences affect machine
vision has not been well explored. Our work tries to bridge this gap between
the human color vision aspect of visual recognition and that of the machine. To
achieve this, we curate two datasets: CIFAR10-F and CIFAR100-F, which are based
on the foreground colors of the popular CIFAR datasets. Together with CIFAR10-B
and CIFAR100-B, the existing counterpart datasets with information on the
background colors of CIFAR test sets, we assign each image based on its color
contrast level per its foreground and background color labels and use this as a
proxy to study how color contrast affects machine vision. We first conduct a
proof-of-concept study, showing the effect of color difference and validate our
datasets. Furthermore, on a broader level, an important characteristic of human
vision is its robustness against ambient changes; therefore, drawing
inspirations from ophthalmology and the robustness literature, we analogize
contrast sensitivity from the human visual aspect to machine vision and
complement the current robustness study using corrupted images with our
CIFAR-CoCo datasets. In summary, motivated by neuroscience and equipped with
the datasets we curate, we devise a new framework in two dimensions to perform
extensive analyses on the effect of color contrast and corrupted images: (1)
model architecture, (2) model size, to measure the perception ability of
machine vision beyond total accuracy. We also explore how task complexity and
data augmentation play a role in this setup. Our results call attention to new
evaluation approaches for human-like machine perception.


http://arxiv.org/abs/2212.07992
Alternating Objectives Generates Stronger PGD-Based Adversarial Attacks. (98%)
Nikolaos Antoniou; Efthymios Georgiou; Alexandros Potamianos
  Designing powerful adversarial attacks is of paramount importance for the
evaluation of $\ell_p$-bounded adversarial defenses. Projected Gradient Descent
(PGD) is one of the most effective and conceptually simple algorithms to
generate such adversaries. The search space of PGD is dictated by the steepest
ascent directions of an objective. Despite the plethora of objective function
choices, there is no universally superior option and robustness overestimation
may arise from ill-suited objective selection. Driven by this observation, we
postulate that the combination of different objectives through a simple loss
alternating scheme renders PGD more robust towards design choices. We
experimentally verify this assertion on a synthetic-data example and by
evaluating our proposed method across 25 different $\ell_{\infty}$-robust
models and 3 datasets. The performance improvement is consistent, when compared
to the single loss counterparts. In the CIFAR-10 dataset, our strongest
adversarial attack outperforms all of the white-box components of AutoAttack
(AA) ensemble, as well as the most powerful attacks existing on the literature,
achieving state-of-the-art results in the computational budget of our study
($T=100$, no restarts).


http://arxiv.org/abs/2212.08130
On Evaluating Adversarial Robustness of Chest X-ray Classification: Pitfalls and Best Practices. (84%)
Salah Ghamizi; Maxime Cordy; Michail Papadakis; Yves Le Traon
  Vulnerability to adversarial attacks is a well-known weakness of Deep Neural
Networks. While most of the studies focus on natural images with standardized
benchmarks like ImageNet and CIFAR, little research has considered real world
applications, in particular in the medical domain. Our research shows that,
contrary to previous claims, robustness of chest x-ray classification is much
harder to evaluate and leads to very different assessments based on the
dataset, the architecture and robustness metric. We argue that previous studies
did not take into account the peculiarity of medical diagnosis, like the
co-occurrence of diseases, the disagreement of labellers (domain experts), the
threat model of the attacks and the risk implications for each successful
attack.
  In this paper, we discuss the methodological foundations, review the pitfalls
and best practices, and suggest new methodological considerations for
evaluating the robustness of chest xray classification models. Our evaluation
on 3 datasets, 7 models, and 18 diseases is the largest evaluation of
robustness of chest x-ray classification models.


http://arxiv.org/abs/2212.08044
Are Multimodal Models Robust to Image and Text Perturbations? (5%)
Jielin Qiu; Yi Zhu; Xingjian Shi; Florian Wenzel; Zhiqiang Tang; Ding Zhao; Bo Li; Mu Li
  Multimodal image-text models have shown remarkable performance in the past
few years. However, evaluating their robustness against distribution shifts is
crucial before adopting them in real-world applications. In this paper, we
investigate the robustness of 9 popular open-sourced image-text models under
common perturbations on five tasks (image-text retrieval, visual reasoning,
visual entailment, image captioning, and text-to-image generation). In
particular, we propose several new multimodal robustness benchmarks by applying
17 image perturbation and 16 text perturbation techniques on top of existing
datasets. We observe that multimodal models are not robust to image and text
perturbations, especially to image perturbations. Among the tested perturbation
methods, character-level perturbations constitute the most severe distribution
shift for text, and zoom blur is the most severe shift for image data. We also
introduce two new robustness metrics (MMI and MOR) for proper evaluations of
multimodal models. We hope our extensive study sheds light on new directions
for the development of robust multimodal models.


http://arxiv.org/abs/2212.10628
Holistic risk assessment of inference attacks in machine learning. (4%)
Yang Yang
  As machine learning expanding application, there are more and more
unignorable privacy and safety issues. Especially inference attacks against
Machine Learning models allow adversaries to infer sensitive information about
the target model, such as training data, model parameters, etc. Inference
attacks can lead to serious consequences, including violating individuals
privacy, compromising the intellectual property of the owner of the machine
learning model. As far as concerned, researchers have studied and analyzed in
depth several types of inference attacks, albeit in isolation, but there is
still a lack of a holistic rick assessment of inference attacks against machine
learning models, such as their application in different scenarios, the common
factors affecting the performance of these attacks and the relationship among
the attacks. As a result, this paper performs a holistic risk assessment of
different inference attacks against Machine Learning models. This paper focuses
on three kinds of representative attacks: membership inference attack,
attribute inference attack and model stealing attack. And a threat model
taxonomy is established. A total of 12 target models using three model
architectures, including AlexNet, ResNet18 and Simple CNN, are trained on four
datasets, namely CelebA, UTKFace, STL10 and FMNIST.


http://arxiv.org/abs/2212.12307
Defending against cybersecurity threats to the payments and banking system. (2%)
Williams Haruna; Toyin Ajiboro Aremu; Yetunde Ajao Modupe
  Cyber security threats to the payment and banking system have become a
worldwide menace. The phenomenon has forced financial institutions to take
risks as part of their business model. Hence, deliberate investment in
sophisticated technologies and security measures has become imperative to
safeguard against heavy financial losses and information breaches that may
occur due to cyber-attacks. The proliferation of cyber crimes is a huge concern
for various stakeholders in the banking sector. Usually, cyber-attacks are
carried out via software systems running on a computing system in cyberspace.
As such, to prevent risks of cyber-attacks on software systems, entities
operating within cyberspace must be identified and the threats to the
application security isolated after analyzing the vulnerabilities and
developing defense mechanisms. This paper will examine various approaches that
identify assets in cyberspace, classify the cyber threats, provide security
defenses and map security measures to control types and functionalities. Thus,
adopting the right application to the security threats and defenses will aid IT
professionals and users alike in making decisions for developing a strong
defense-in-depth mechanism.


http://arxiv.org/abs/2301.03595
White-box Inference Attacks against Centralized Machine Learning and Federated Learning. (1%)
Jingyi Ge
  With the development of information science and technology, various
industries have generated massive amounts of data, and machine learning is
widely used in the analysis of big data. However, if the privacy of machine
learning applications' customers cannot be guaranteed, it will cause security
threats and losses to users' personal privacy information and service
providers. Therefore, the issue of privacy protection of machine learning has
received wide attention. For centralized machine learning models, we evaluate
the impact of different neural network layers, gradient, gradient norm, and
fine-tuned models on member inference attack performance with prior knowledge;
For the federated learning model, we discuss the location of the attacker in
the target model and its attack mode. The results show that the centralized
machine learning model shows more serious member information leakage in all
aspects, and the accuracy of the attacker in the central parameter server is
significantly higher than the local Inference attacks as participants.


http://arxiv.org/abs/2212.07495
SAIF: Sparse Adversarial and Interpretable Attack Framework. (99%)
Tooba Imtiaz; Morgan Kohler; Jared Miller; Zifeng Wang; Mario Sznaier; Octavia Camps; Jennifer Dy
  Adversarial attacks hamper the decision-making ability of neural networks by
perturbing the input signal. The addition of calculated small distortion to
images, for instance, can deceive a well-trained image classification network.
In this work, we propose a novel attack technique called Sparse Adversarial and
Interpretable Attack Framework (SAIF). Specifically, we design imperceptible
attacks that contain low-magnitude perturbations at a small number of pixels
and leverage these sparse attacks to reveal the vulnerability of classifiers.
We use the Frank-Wolfe (conditional gradient) algorithm to simultaneously
optimize the attack perturbations for bounded magnitude and sparsity with
$O(1/\sqrt{T})$ convergence. Empirical results show that SAIF computes highly
imperceptible and interpretable adversarial examples, and outperforms
state-of-the-art sparse attack methods on the ImageNet dataset.


http://arxiv.org/abs/2212.07591
Dissecting Distribution Inference. (88%)
Anshuman Suri; Yifu Lu; Yanjin Chen; David Evans
  A distribution inference attack aims to infer statistical properties of data
used to train machine learning models. These attacks are sometimes surprisingly
potent, but the factors that impact distribution inference risk are not well
understood and demonstrated attacks often rely on strong and unrealistic
assumptions such as full knowledge of training environments even in supposedly
black-box threat scenarios. To improve understanding of distribution inference
risks, we develop a new black-box attack that even outperforms the best known
white-box attack in most settings. Using this new attack, we evaluate
distribution inference risk while relaxing a variety of assumptions about the
adversary's knowledge under black-box access, like known model architectures
and label-only access. Finally, we evaluate the effectiveness of previously
proposed defenses and introduce new defenses. We find that although noise-based
defenses appear to be ineffective, a simple re-sampling defense can be highly
effective. Code is available at
https://github.com/iamgroot42/dissecting_distribution_inference


http://arxiv.org/abs/2212.07283
Generative Robust Classification. (11%)
Xuwang Yin
  Training adversarially robust discriminative (i.e., softmax) classifier has
been the dominant approach to robust classification. Building on recent work on
adversarial training (AT)-based generative models, we investigate using AT to
learn unnormalized class-conditional density models and then performing
generative robust classification. Our result shows that, under the condition of
similar model capacities, the generative robust classifier achieves comparable
performance to a baseline softmax robust classifier when the test data is clean
or when the test perturbation is of limited size, and much better performance
when the test perturbation size exceeds the training perturbation size. The
generative classifier is also able to generate samples or counterfactuals that
more closely resemble the training data, suggesting that the generative
classifier can better capture the class-conditional distributions. In contrast
to standard discriminative adversarial training where advanced data
augmentation techniques are only effective when combined with weight averaging,
we find it straightforward to apply advanced data augmentation to achieve
better robustness in our approach. Our result suggests that the generative
classifier is a competitive alternative to robust classification, especially
for problems with limited number of classes.


http://arxiv.org/abs/2212.14109
Synthesis of Adversarial DDOS Attacks Using Tabular Generative Adversarial Networks. (8%)
Abdelmageed Ahmed Hassan; Mohamed Sayed Hussein; Ahmed Shehata AboMoustafa; Sarah Hossam Elmowafy
  Network Intrusion Detection Systems (NIDS) are tools or software that are
widely used to maintain the computer networks and information systems keeping
them secure and preventing malicious traffics from penetrating into them, as
they flag when somebody is trying to break into the system. Best effort has
been set up on these systems, and the results achieved so far are quite
satisfying, however, new types of attacks stand out as the technology of
attacks keep evolving, one of these attacks are the attacks based on Generative
Adversarial Networks (GAN) that can evade machine learning IDS leaving them
vulnerable. This project investigates the impact of the Adversarial Attacks
synthesized using real DDoS attacks generated using GANs on the IDS. The
objective is to discover how will these systems react towards synthesized
attacks. marking the vulnerability and weakness points of these systems so we
could fix them.


http://arxiv.org/abs/2212.07558
DOC-NAD: A Hybrid Deep One-class Classifier for Network Anomaly Detection. (1%)
Mohanad Sarhan; Gayan Kulatilleke; Wai Weng Lo; Siamak Layeghy; Marius Portmann
  Machine Learning (ML) approaches have been used to enhance the detection
capabilities of Network Intrusion Detection Systems (NIDSs). Recent work has
achieved near-perfect performance by following binary- and multi-class network
anomaly detection tasks. Such systems depend on the availability of both
(benign and malicious) network data classes during the training phase. However,
attack data samples are often challenging to collect in most organisations due
to security controls preventing the penetration of known malicious traffic to
their networks. Therefore, this paper proposes a Deep One-Class (DOC)
classifier for network intrusion detection by only training on benign network
data samples. The novel one-class classification architecture consists of a
histogram-based deep feed-forward classifier to extract useful network data
features and use efficient outlier detection. The DOC classifier has been
extensively evaluated using two benchmark NIDS datasets. The results
demonstrate its superiority over current state-of-the-art one-class classifiers
in terms of detection and false positive rates.


http://arxiv.org/abs/2212.06431
Object-fabrication Targeted Attack for Object Detection. (99%)
Xuchong Zhang; Changfeng Sun; Haoliang Han; Hongbin Sun
  Recent studies have demonstrated that object detection networks are usually
vulnerable to adversarial examples. Generally, adversarial attacks for object
detection can be categorized into targeted and untargeted attacks. Compared
with untargeted attacks, targeted attacks present greater challenges and all
existing targeted attack methods launch the attack by misleading detectors to
mislabel the detected object as a specific wrong label. However, since these
methods must depend on the presence of the detected objects within the victim
image, they suffer from limitations in attack scenarios and attack success
rates. In this paper, we propose a targeted feature space attack method that
can mislead detectors to `fabricate' extra designated objects regardless of
whether the victim image contains objects or not. Specifically, we introduce a
guided image to extract coarse-grained features of the target objects and
design an innovative dual attention mechanism to filter out the critical
features of the target objects efficiently. The attack performance of the
proposed method is evaluated on MS COCO and BDD100K datasets with FasterRCNN
and YOLOv5. Evaluation results indicate that the proposed targeted feature
space attack method shows significant improvements in terms of image-specific,
universality, and generalization attack performance, compared with the previous
targeted attack for object detection.


http://arxiv.org/abs/2212.06822
Adversarial Attacks and Defences for Skin Cancer Classification. (99%)
Vinay Jogani; Joy Purohit; Ishaan Shivhare; Samina Attari; Shraddha Surtkar
  There has been a concurrent significant improvement in the medical images
used to facilitate diagnosis and the performance of machine learning techniques
to perform tasks such as classification, detection, and segmentation in recent
years. As a result, a rapid increase in the usage of such systems can be
observed in the healthcare industry, for instance in the form of medical image
classification systems, where these models have achieved diagnostic parity with
human physicians. One such application where this can be observed is in
computer vision tasks such as the classification of skin lesions in
dermatoscopic images. However, as stakeholders in the healthcare industry, such
as insurance companies, continue to invest extensively in machine learning
infrastructure, it becomes increasingly important to understand the
vulnerabilities in such systems. Due to the highly critical nature of the tasks
being carried out by these machine learning models, it is necessary to analyze
techniques that could be used to take advantage of these vulnerabilities and
methods to defend against them. This paper explores common adversarial attack
techniques. The Fast Sign Gradient Method and Projected Descent Gradient are
used against a Convolutional Neural Network trained to classify dermatoscopic
images of skin lesions. Following that, it also discusses one of the most
popular adversarial defense techniques, adversarial training. The performance
of the model that has been trained on adversarial examples is then tested
against the previously mentioned attacks, and recommendations to improve neural
networks robustness are thus provided based on the results of the experiment.


http://arxiv.org/abs/2212.06776
Unfolding Local Growth Rate Estimates for (Almost) Perfect Adversarial Detection. (99%)
Peter Lorenz; Margret Keuper; Janis Keuper
  Convolutional neural networks (CNN) define the state-of-the-art solution on
many perceptual tasks. However, current CNN approaches largely remain
vulnerable against adversarial perturbations of the input that have been
crafted specifically to fool the system while being quasi-imperceptible to the
human eye. In recent years, various approaches have been proposed to defend
CNNs against such attacks, for example by model hardening or by adding explicit
defence mechanisms. Thereby, a small "detector" is included in the network and
trained on the binary classification task of distinguishing genuine data from
data containing adversarial perturbations. In this work, we propose a simple
and light-weight detector, which leverages recent findings on the relation
between networks' local intrinsic dimensionality (LID) and adversarial attacks.
Based on a re-interpretation of the LID measure and several simple adaptations,
we surpass the state-of-the-art on adversarial detection by a significant
margin and reach almost perfect results in terms of F1-score for several
networks and datasets. Sources available at:
https://github.com/adverML/multiLID


http://arxiv.org/abs/2212.06836
Towards Efficient and Domain-Agnostic Evasion Attack with High-dimensional Categorical Inputs. (80%)
Hongyan Bao; Yufei Han; Yujun Zhou; Xin Gao; Xiangliang Zhang
  Our work targets at searching feasible adversarial perturbation to attack a
classifier with high-dimensional categorical inputs in a domain-agnostic
setting. This is intrinsically an NP-hard knapsack problem where the
exploration space becomes explosively larger as the feature dimension
increases. Without the help of domain knowledge, solving this problem via
heuristic method, such as Branch-and-Bound, suffers from exponential
complexity, yet can bring arbitrarily bad attack results. We address the
challenge via the lens of multi-armed bandit based combinatorial search. Our
proposed method, namely FEAT, treats modifying each categorical feature as
pulling an arm in multi-armed bandit programming. Our objective is to achieve
highly efficient and effective attack using an Orthogonal Matching Pursuit
(OMP)-enhanced Upper Confidence Bound (UCB) exploration strategy. Our
theoretical analysis bounding the regret gap of FEAT guarantees its practical
attack performance. In empirical analysis, we compare FEAT with other
state-of-the-art domain-agnostic attack methods over various real-world
categorical data sets of different applications. Substantial experimental
observations confirm the expected efficiency and attack effectiveness of FEAT
applied in different application scenarios. Our work further hints the
applicability of FEAT for assessing the adversarial vulnerability of
classification systems with high-dimensional categorical inputs.


http://arxiv.org/abs/2212.07016
Understanding Zero-Shot Adversarial Robustness for Large-Scale Models. (73%)
Chengzhi Mao; Scott Geng; Junfeng Yang; Xin Wang; Carl Vondrick
  Pretrained large-scale vision-language models like CLIP have exhibited strong
generalization over unseen tasks. Yet imperceptible adversarial perturbations
can significantly reduce CLIP's performance on new tasks. In this work, we
identify and explore the problem of \emph{adapting large-scale models for
zero-shot adversarial robustness}. We first identify two key factors during
model adaption -- training losses and adaptation methods -- that affect the
model's zero-shot adversarial robustness. We then propose a text-guided
contrastive adversarial training loss, which aligns the text embeddings and the
adversarial visual features with contrastive learning on a small set of
training data. We apply this training loss to two adaption methods, model
finetuning and visual prompt tuning. We find that visual prompt tuning is more
effective in the absence of texts, while finetuning wins in the existence of
text guidance. Overall, our approach significantly improves the zero-shot
adversarial robustness over CLIP, seeing an average improvement of over 31
points over ImageNet and 15 zero-shot datasets. We hope this work can shed
light on understanding the zero-shot adversarial robustness of large-scale
models.


http://arxiv.org/abs/2212.06493
Pixel is All You Need: Adversarial Trajectory-Ensemble Active Learning for Salient Object Detection. (56%)
Zhenyu Wu; Lin Wang; Wei Wang; Qing Xia; Chenglizhao Chen; Aimin Hao; Shuo Li
  Although weakly-supervised techniques can reduce the labeling effort, it is
unclear whether a saliency model trained with weakly-supervised data (e.g.,
point annotation) can achieve the equivalent performance of its
fully-supervised version. This paper attempts to answer this unexplored
question by proving a hypothesis: there is a point-labeled dataset where
saliency models trained on it can achieve equivalent performance when trained
on the densely annotated dataset. To prove this conjecture, we proposed a novel
yet effective adversarial trajectory-ensemble active learning (ATAL). Our
contributions are three-fold: 1) Our proposed adversarial attack triggering
uncertainty can conquer the overconfidence of existing active learning methods
and accurately locate these uncertain pixels. {2)} Our proposed
trajectory-ensemble uncertainty estimation method maintains the advantages of
the ensemble networks while significantly reducing the computational cost. {3)}
Our proposed relationship-aware diversity sampling algorithm can conquer
oversampling while boosting performance. Experimental results show that our
ATAL can find such a point-labeled dataset, where a saliency model trained on
it obtained $97\%$ -- $99\%$ performance of its fully-supervised version with
only ten annotated points per image.


http://arxiv.org/abs/2212.13989
AdvCat: Domain-Agnostic Robustness Assessment for Cybersecurity-Critical Applications with Categorical Inputs. (56%)
Helene Orsini; Hongyan Bao; Yujun Zhou; Xiangrui Xu; Yufei Han; Longyang Yi; Wei Wang; Xin Gao; Xiangliang Zhang
  Machine Learning-as-a-Service systems (MLaaS) have been largely developed for
cybersecurity-critical applications, such as detecting network intrusions and
fake news campaigns. Despite effectiveness, their robustness against
adversarial attacks is one of the key trust concerns for MLaaS deployment. We
are thus motivated to assess the adversarial robustness of the Machine Learning
models residing at the core of these security-critical applications with
categorical inputs. Previous research efforts on accessing model robustness
against manipulation of categorical inputs are specific to use cases and
heavily depend on domain knowledge, or require white-box access to the target
ML model. Such limitations prevent the robustness assessment from being as a
domain-agnostic service provided to various real-world applications. We propose
a provably optimal yet computationally highly efficient adversarial robustness
assessment protocol for a wide band of ML-driven cybersecurity-critical
applications. We demonstrate the use of the domain-agnostic robustness
assessment method with substantial experimental study on fake news detection
and intrusion detection problems.


http://arxiv.org/abs/2212.06428
Privacy-preserving Security Inference Towards Cloud-Edge Collaborative Using Differential Privacy. (1%)
Yulong Wang; Xingshu Chen; Qixu Wang
  Cloud-edge collaborative inference approach splits deep neural networks
(DNNs) into two parts that run collaboratively on resource-constrained edge
devices and cloud servers, aiming at minimizing inference latency and
protecting data privacy. However, even if the raw input data from edge devices
is not directly exposed to the cloud, state-of-the-art attacks targeting
collaborative inference are still able to reconstruct the raw private data from
the intermediate outputs of the exposed local models, introducing serious
privacy risks. In this paper, a secure privacy inference framework for
cloud-edge collaboration is proposed, termed CIS, which supports adaptively
partitioning the network according to the dynamically changing network
bandwidth and fully releases the computational power of edge devices. To
mitigate the influence introduced by private perturbation, CIS provides a way
to achieve differential privacy protection by adding refined noise to the
intermediate layer feature maps offloaded to the cloud. Meanwhile, with a given
total privacy budget, the budget is reasonably allocated by the size of the
feature graph rank generated by different convolution filters, which makes the
inference in the cloud robust to the perturbed data, thus effectively trade-off
the conflicting problem between privacy and availability. Finally, we construct
a real cloud-edge collaborative inference computing scenario to verify the
effectiveness of inference latency and model partitioning on
resource-constrained edge devices. Furthermore, the state-of-the-art cloud-edge
collaborative reconstruction attack is used to evaluate the practical
availability of the end-to-end privacy protection mechanism provided by CIS.


http://arxiv.org/abs/2212.06643
Boosting Semi-Supervised Learning with Contrastive Complementary Labeling. (1%)
Qinyi Deng; Yong Guo; Zhibang Yang; Haolin Pan; Jian Chen
  Semi-supervised learning (SSL) has achieved great success in leveraging a
large amount of unlabeled data to learn a promising classifier. A popular
approach is pseudo-labeling that generates pseudo labels only for those
unlabeled data with high-confidence predictions. As for the low-confidence
ones, existing methods often simply discard them because these unreliable
pseudo labels may mislead the model. Nevertheless, we highlight that these data
with low-confidence pseudo labels can be still beneficial to the training
process. Specifically, although the class with the highest probability in the
prediction is unreliable, we can assume that this sample is very unlikely to
belong to the classes with the lowest probabilities. In this way, these data
can be also very informative if we can effectively exploit these complementary
labels, i.e., the classes that a sample does not belong to. Inspired by this,
we propose a novel Contrastive Complementary Labeling (CCL) method that
constructs a large number of reliable negative pairs based on the complementary
labels and adopts contrastive learning to make use of all the unlabeled data.
Extensive experiments demonstrate that CCL significantly improves the
performance on top of existing methods. More critically, our CCL is
particularly effective under the label-scarce settings. For example, we yield
an improvement of 2.43% over FixMatch on CIFAR-10 only with 40 labeled data.


http://arxiv.org/abs/2212.05917
SRoUDA: Meta Self-training for Robust Unsupervised Domain Adaptation. (98%)
Wanqing Zhu; Jia-Li Yin; Bo-Hao Chen; Ximeng Liu
  As acquiring manual labels on data could be costly, unsupervised domain
adaptation (UDA), which transfers knowledge learned from a rich-label dataset
to the unlabeled target dataset, is gaining increasing popularity. While
extensive studies have been devoted to improving the model accuracy on target
domain, an important issue of model robustness is neglected. To make things
worse, conventional adversarial training (AT) methods for improving model
robustness are inapplicable under UDA scenario since they train models on
adversarial examples that are generated by supervised loss function. In this
paper, we present a new meta self-training pipeline, named SRoUDA, for
improving adversarial robustness of UDA models. Based on self-training
paradigm, SRoUDA starts with pre-training a source model by applying UDA
baseline on source labeled data and taraget unlabeled data with a developed
random masked augmentation (RMA), and then alternates between adversarial
target model training on pseudo-labeled target data and finetuning source model
by a meta step. While self-training allows the direct incorporation of AT in
UDA, the meta step in SRoUDA further helps in mitigating error propagation from
noisy pseudo labels. Extensive experiments on various benchmark datasets
demonstrate the state-of-the-art performance of SRoUDA where it achieves
significant model robustness improvement without harming clean accuracy. Code
is available at https://github.com/Vision.


http://arxiv.org/abs/2212.07815
Adversarially Robust Video Perception by Seeing Motion. (98%)
Lingyu Zhang; Chengzhi Mao; Junfeng Yang; Carl Vondrick
  Despite their excellent performance, state-of-the-art computer vision models
often fail when they encounter adversarial examples. Video perception models
tend to be more fragile under attacks, because the adversary has more places to
manipulate in high-dimensional data. In this paper, we find one reason for
video models' vulnerability is that they fail to perceive the correct motion
under adversarial perturbations. Inspired by the extensive evidence that motion
is a key factor for the human visual system, we propose to correct what the
model sees by restoring the perceived motion information. Since motion
information is an intrinsic structure of the video data, recovering motion
signals can be done at inference time without any human annotation, which
allows the model to adapt to unforeseen, worst-case inputs. Visualizations and
empirical experiments on UCF-101 and HMDB-51 datasets show that restoring
motion information in deep vision models improves adversarial robustness. Even
under adaptive attacks where the adversary knows our defense, our algorithm is
still effective. Our work provides new insight into robust video perception
algorithms by using intrinsic structures from the data. Our webpage is
available at https://motion4robust.cs.columbia.edu.


http://arxiv.org/abs/2212.05709
HOTCOLD Block: Fooling Thermal Infrared Detectors with a Novel Wearable Design. (96%)
Hui Wei; Zhixiang Wang; Xuemei Jia; Yinqiang Zheng; Hao Tang; Shin'ichi Satoh; Zheng Wang
  Adversarial attacks on thermal infrared imaging expose the risk of related
applications. Estimating the security of these systems is essential for safely
deploying them in the real world. In many cases, realizing the attacks in the
physical space requires elaborate special perturbations. These solutions are
often \emph{impractical} and \emph{attention-grabbing}. To address the need for
a physically practical and stealthy adversarial attack, we introduce
\textsc{HotCold} Block, a novel physical attack for infrared detectors that
hide persons utilizing the wearable Warming Paste and Cooling Paste. By
attaching these readily available temperature-controlled materials to the body,
\textsc{HotCold} Block evades human eyes efficiently. Moreover, unlike existing
methods that build adversarial patches with complex texture and structure
features, \textsc{HotCold} Block utilizes an SSP-oriented adversarial
optimization algorithm that enables attacks with pure color blocks and explores
the influence of size, shape, and position on attack performance. Extensive
experimental results in both digital and physical environments demonstrate the
performance of our proposed \textsc{HotCold} Block. \emph{Code is available:
\textcolor{magenta}{https://github.com/weihui1308/HOTCOLDBlock}}.


http://arxiv.org/abs/2212.06079
Robust Perception through Equivariance. (96%)
Chengzhi Mao; Lingyu Zhang; Abhishek Joshi; Junfeng Yang; Hao Wang; Carl Vondrick
  Deep networks for computer vision are not reliable when they encounter
adversarial examples. In this paper, we introduce a framework that uses the
dense intrinsic constraints in natural images to robustify inference. By
introducing constraints at inference time, we can shift the burden of
robustness from training to the inference algorithm, thereby allowing the model
to adjust dynamically to each individual image's unique and potentially novel
characteristics at inference time. Among different constraints, we find that
equivariance-based constraints are most effective, because they allow dense
constraints in the feature space without overly constraining the representation
at a fine-grained level. Our theoretical results validate the importance of
having such dense constraints at inference time. Our empirical experiments show
that restoring feature equivariance at inference time defends against
worst-case adversarial perturbations. The method obtains improved adversarial
robustness on four datasets (ImageNet, Cityscapes, PASCAL VOC, and MS-COCO) on
image recognition, semantic segmentation, and instance segmentation tasks.
Project page is available at equi4robust.cs.columbia.edu.


http://arxiv.org/abs/2212.06295
Despite "super-human" performance, current LLMs are unsuited for decisions about ethics and safety. (75%)
Joshua Albrecht; Ellie Kitanidis; Abraham J. Fetterman
  Large language models (LLMs) have exploded in popularity in the past few
years and have achieved undeniably impressive results on benchmarks as varied
as question answering and text summarization. We provide a simple new prompting
strategy that leads to yet another supposedly "super-human" result, this time
outperforming humans at common sense ethical reasoning (as measured by accuracy
on a subset of the ETHICS dataset). Unfortunately, we find that relying on
average performance to judge capabilities can be highly misleading. LLM errors
differ systematically from human errors in ways that make it easy to craft
adversarial examples, or even perturb existing examples to flip the output
label. We also observe signs of inverse scaling with model size on some
examples, and show that prompting models to "explain their reasoning" often
leads to alarming justifications of unethical actions. Our results highlight
how human-like performance does not necessarily imply human-like understanding
or reasoning.


http://arxiv.org/abs/2212.06325
AFLGuard: Byzantine-robust Asynchronous Federated Learning. (15%)
Minghong Fang; Jia Liu; Neil Zhenqiang Gong; Elizabeth S. Bentley
  Federated learning (FL) is an emerging machine learning paradigm, in which
clients jointly learn a model with the help of a cloud server. A fundamental
challenge of FL is that the clients are often heterogeneous, e.g., they have
different computing powers, and thus the clients may send model updates to the
server with substantially different delays. Asynchronous FL aims to address
this challenge by enabling the server to update the model once any client's
model update reaches it without waiting for other clients' model updates.
However, like synchronous FL, asynchronous FL is also vulnerable to poisoning
attacks, in which malicious clients manipulate the model via poisoning their
local data and/or model updates sent to the server. Byzantine-robust FL aims to
defend against poisoning attacks. In particular, Byzantine-robust FL can learn
an accurate model even if some clients are malicious and have Byzantine
behaviors. However, most existing studies on Byzantine-robust FL focused on
synchronous FL, leaving asynchronous FL largely unexplored. In this work, we
bridge this gap by proposing AFLGuard, a Byzantine-robust asynchronous FL
method. We show that, both theoretically and empirically, AFLGuard is robust
against various existing and adaptive poisoning attacks (both untargeted and
targeted). Moreover, AFLGuard outperforms existing Byzantine-robust
asynchronous FL methods.


http://arxiv.org/abs/2212.05827
Carpet-bombing patch: attacking a deep network without usual requirements. (2%)
Pol Labarbarie; Adrien Chan-Hon-Tong; Stéphane Herbin; Milad Leyli-Abadi
  Although deep networks have shown vulnerability to evasion attacks, such
attacks have usually unrealistic requirements. Recent literature discussed the
possibility to remove or not some of these requirements. This paper contributes
to this literature by introducing a carpet-bombing patch attack which has
almost no requirement. Targeting the feature representations, this patch attack
does not require knowing the network task. This attack decreases accuracy on
Imagenet, mAP on Pascal Voc, and IoU on Cityscapes without being aware that the
underlying tasks involved classification, detection or semantic segmentation,
respectively. Beyond the potential safety issues raised by this attack, the
impact of the carpet-bombing attack highlights some interesting property of
deep network layer dynamic.


http://arxiv.org/abs/2212.06361
Numerical Stability of DeepGOPlus Inference. (1%)
Inés Gonzalez Pepe; Yohan Chatelain; Gregory Kiar; Tristan Glatard
  Convolutional neural networks (CNNs) are currently among the most widely-used
deep neural network (DNN) architectures available and achieve state-of-the-art
performance for many problems. Originally applied to computer vision tasks,
CNNs work well with any data with a spatial relationship, besides images, and
have been applied to different fields. However, recent works have highlighted
numerical stability challenges in DNNs, which also relates to their known
sensitivity to noise injection. These challenges can jeopardise their
performance and reliability. This paper investigates DeepGOPlus, a CNN that
predicts protein function. DeepGOPlus has achieved state-of-the-art performance
and can successfully take advantage and annotate the abounding protein
sequences emerging in proteomics.We determine the numerical stability of the
model's inference stage by quantifying the numerical uncertainty due to
perturbations of the underlying floating-point data. In addition, we explore
the opportunity to use reduced-precision floating point formats for DeepGOPlus
inference to reduce memory consumption and latency. This is achieved by
instrumenting DeepGOPlus' execution using Monte Carlo Arithmetic, a technique
that experimentally quantifies floating point operation errors and VPREC, a
tool that emulates results with customizable floating point precision formats.
Focus is placed on the inference stage as it is the primary deliverable of the
DeepGOPlus model, widely applicable across different environments. All in all,
our results show that although the DeepGOPlus CNN is very stable numerically,
it can only be selectively implemented with lower-precision floating-point
formats. We conclude that predictions obtained from the pre-trained DeepGOPlus
model are very reliable numerically, and use existing floating-point formats
efficiently.


http://arxiv.org/abs/2212.05630
DISCO: Adversarial Defense with Local Implicit Functions. (99%)
Chih-Hui Ho; Nuno Vasconcelos
  The problem of adversarial defenses for image classification, where the goal
is to robustify a classifier against adversarial examples, is considered.
Inspired by the hypothesis that these examples lie beyond the natural image
manifold, a novel aDversarIal defenSe with local impliCit functiOns (DISCO) is
proposed to remove adversarial perturbations by localized manifold projections.
DISCO consumes an adversarial image and a query pixel location and outputs a
clean RGB value at the location. It is implemented with an encoder and a local
implicit module, where the former produces per-pixel deep features and the
latter uses the features in the neighborhood of query pixel for predicting the
clean RGB value. Extensive experiments demonstrate that both DISCO and its
cascade version outperform prior defenses, regardless of whether the defense is
known to the attacker. DISCO is also shown to be data and parameter efficient
and to mount defenses that transfers across datasets, classifiers and attacks.


http://arxiv.org/abs/2212.05680
REAP: A Large-Scale Realistic Adversarial Patch Benchmark. (98%)
Nabeel Hingun; Chawin Sitawarin; Jerry Li; David Wagner
  Machine learning models are known to be susceptible to adversarial
perturbation. One famous attack is the adversarial patch, a sticker with a
particularly crafted pattern that makes the model incorrectly predict the
object it is placed on. This attack presents a critical threat to
cyber-physical systems that rely on cameras such as autonomous cars. Despite
the significance of the problem, conducting research in this setting has been
difficult; evaluating attacks and defenses in the real world is exceptionally
costly while synthetic data are unrealistic. In this work, we propose the REAP
(REalistic Adversarial Patch) benchmark, a digital benchmark that allows the
user to evaluate patch attacks on real images, and under real-world conditions.
Built on top of the Mapillary Vistas dataset, our benchmark contains over
14,000 traffic signs. Each sign is augmented with a pair of geometric and
lighting transformations, which can be used to apply a digitally generated
patch realistically onto the sign. Using our benchmark, we perform the first
large-scale assessments of adversarial patch attacks under realistic
conditions. Our experiments suggest that adversarial patch attacks may present
a smaller threat than previously believed and that the success rate of an
attack on simpler digital simulations is not predictive of its actual
effectiveness in practice. We release our benchmark publicly at
https://github.com/wagner-group/reap-benchmark.


http://arxiv.org/abs/2212.05387
General Adversarial Defense Against Black-box Attacks via Pixel Level and Feature Level Distribution Alignments. (99%)
Xiaogang Xu; Hengshuang Zhao; Philip Torr; Jiaya Jia
  Deep Neural Networks (DNNs) are vulnerable to the black-box adversarial
attack that is highly transferable. This threat comes from the distribution gap
between adversarial and clean samples in feature space of the target DNNs. In
this paper, we use Deep Generative Networks (DGNs) with a novel training
mechanism to eliminate the distribution gap. The trained DGNs align the
distribution of adversarial samples with clean ones for the target DNNs by
translating pixel values. Different from previous work, we propose a more
effective pixel level training constraint to make this achievable, thus
enhancing robustness on adversarial samples. Further, a class-aware
feature-level constraint is formulated for integrated distribution alignment.
Our approach is general and applicable to multiple tasks, including image
classification, semantic segmentation, and object detection. We conduct
extensive experiments on different datasets. Our strategy demonstrates its
unique effectiveness and generality against black-box attacks.


http://arxiv.org/abs/2212.05399
Untargeted Attack against Federated Recommendation Systems via Poisonous Item Embeddings and the Defense. (93%)
Yang Yu; Qi Liu; Likang Wu; Runlong Yu; Sanshi Lei Yu; Zaixi Zhang
  Federated recommendation (FedRec) can train personalized recommenders without
collecting user data, but the decentralized nature makes it susceptible to
poisoning attacks. Most previous studies focus on the targeted attack to
promote certain items, while the untargeted attack that aims to degrade the
overall performance of the FedRec system remains less explored. In fact,
untargeted attacks can disrupt the user experience and bring severe financial
loss to the service provider. However, existing untargeted attack methods are
either inapplicable or ineffective against FedRec systems. In this paper, we
delve into the untargeted attack and its defense for FedRec systems. (i) We
propose ClusterAttack, a novel untargeted attack method. It uploads poisonous
gradients that converge the item embeddings into several dense clusters, which
make the recommender generate similar scores for these items in the same
cluster and perturb the ranking order. (ii) We propose a uniformity-based
defense mechanism (UNION) to protect FedRec systems from such attacks. We
design a contrastive learning task that regularizes the item embeddings toward
a uniform distribution. Then the server filters out these malicious gradients
by estimating the uniformity of updated item embeddings. Experiments on two
public datasets show that ClusterAttack can effectively degrade the performance
of FedRec systems while circumventing many defense methods, and UNION can
improve the resistance of the system against various untargeted attacks,
including our ClusterAttack.


http://arxiv.org/abs/2212.05337
Targeted Adversarial Attacks on Deep Reinforcement Learning Policies via Model Checking. (93%)
Dennis Gross; Thiago D. Simao; Nils Jansen; Guillermo A. Perez
  Deep Reinforcement Learning (RL) agents are susceptible to adversarial noise
in their observations that can mislead their policies and decrease their
performance. However, an adversary may be interested not only in decreasing the
reward, but also in modifying specific temporal logic properties of the policy.
This paper presents a metric that measures the exact impact of adversarial
attacks against such properties. We use this metric to craft optimal
adversarial attacks. Furthermore, we introduce a model checking method that
allows us to verify the robustness of RL policies against adversarial attacks.
Our empirical analysis confirms (1) the quality of our metric to craft
adversarial attacks against temporal logic properties, and (2) that we are able
to concisely assess a system's robustness against attacks.


http://arxiv.org/abs/2212.05380
Mitigating Adversarial Gray-Box Attacks Against Phishing Detectors. (54%)
Giovanni Apruzzese; V. S. Subrahmanian
  Although machine learning based algorithms have been extensively used for
detecting phishing websites, there has been relatively little work on how
adversaries may attack such "phishing detectors" (PDs for short). In this
paper, we propose a set of Gray-Box attacks on PDs that an adversary may use
which vary depending on the knowledge that he has about the PD. We show that
these attacks severely degrade the effectiveness of several existing PDs. We
then propose the concept of operation chains that iteratively map an original
set of features to a new set of features and develop the "Protective Operation
Chain" (POC for short) algorithm. POC leverages the combination of random
feature selection and feature mappings in order to increase the attacker's
uncertainty about the target PD. Using 3 existing publicly available datasets
plus a fourth that we have created and will release upon the publication of
this paper, we show that POC is more robust to these attacks than past
competing work, while preserving predictive performance when no adversarial
attacks are present. Moreover, POC is robust to attacks on 13 different
classifiers, not just one. These results are shown to be statistically
significant at the p < 0.001 level.


http://arxiv.org/abs/2212.05400
How to Backdoor Diffusion Models? (12%)
Sheng-Yen Chou; Pin-Yu Chen; Tsung-Yi Ho
  Diffusion models are state-of-the-art deep learning empowered generative
models that are trained based on the principle of learning forward and reverse
diffusion processes via progressive noise-addition and denoising. To gain a
better understanding of the limitations and potential risks, this paper
presents the first study on the robustness of diffusion models against backdoor
attacks. Specifically, we propose BadDiffusion, a novel attack framework that
engineers compromised diffusion processes during model training for backdoor
implantation. At the inference stage, the backdoored diffusion model will
behave just like an untampered generator for regular data inputs, while falsely
generating some targeted outcome designed by the bad actor upon receiving the
implanted trigger signal. Such a critical risk can be dreadful for downstream
tasks and applications built upon the problematic model. Our extensive
experiments on various backdoor attack settings show that BadDiffusion can
consistently lead to compromised diffusion models with high utility and target
specificity. Even worse, BadDiffusion can be made cost-effective by simply
finetuning a clean pre-trained diffusion model to implant backdoors. We also
explore some possible countermeasures for risk mitigation. Our results call
attention to potential risks and possible misuse of diffusion models. Our code
is available on https://github.com/IBM/BadDiffusion.


http://arxiv.org/abs/2212.05327
Identifying the Source of Vulnerability in Explanation Discrepancy: A Case Study in Neural Text Classification. (1%)
Ruixuan Tang; Hanjie Chen; Yangfeng Ji
  Some recent works observed the instability of post-hoc explanations when
input side perturbations are applied to the model. This raises the interest and
concern in the stability of post-hoc explanations. However, the remaining
question is: is the instability caused by the neural network model or the
post-hoc explanation method? This work explores the potential source that leads
to unstable post-hoc explanations. To separate the influence from the model, we
propose a simple output probability perturbation method. Compared to prior
input side perturbation methods, the output probability perturbation method can
circumvent the neural model's potential effect on the explanations and allow
the analysis on the explanation method. We evaluate the proposed method with
three widely-used post-hoc explanation methods (LIME (Ribeiro et al., 2016),
Kernel Shapley (Lundberg and Lee, 2017a), and Sample Shapley (Strumbelj and
Kononenko, 2010)). The results demonstrate that the post-hoc methods are
stable, barely producing discrepant explanations under output probability
perturbations. The observation suggests that neural network models may be the
primary source of fragile explanations.


http://arxiv.org/abs/2212.04985
Understanding and Combating Robust Overfitting via Input Loss Landscape Analysis and Regularization. (98%)
Lin Li; Michael Spratling
  Adversarial training is widely used to improve the robustness of deep neural
networks to adversarial attack. However, adversarial training is prone to
overfitting, and the cause is far from clear. This work sheds light on the
mechanisms underlying overfitting through analyzing the loss landscape w.r.t.
the input. We find that robust overfitting results from standard training,
specifically the minimization of the clean loss, and can be mitigated by
regularization of the loss gradients. Moreover, we find that robust overfitting
turns severer during adversarial training partially because the gradient
regularization effect of adversarial training becomes weaker due to the
increase in the loss landscapes curvature. To improve robust generalization, we
propose a new regularizer to smooth the loss landscape by penalizing the
weighted logits variation along the adversarial direction. Our method
significantly mitigates robust overfitting and achieves the highest robustness
and efficiency compared to similar previous methods. Code is available at
https://github.com/TreeLLi/Combating-RO-AdvLC.


http://arxiv.org/abs/2212.04875
Expeditious Saliency-guided Mix-up through Random Gradient Thresholding. (2%)
Minh-Long Luu; Zeyi Huang; Eric P. Xing; Yong Jae Lee; Haohan Wang
  Mix-up training approaches have proven to be effective in improving the
generalization ability of Deep Neural Networks. Over the years, the research
community expands mix-up methods into two directions, with extensive efforts to
improve saliency-guided procedures but minimal focus on the arbitrary path,
leaving the randomization domain unexplored. In this paper, inspired by the
superior qualities of each direction over one another, we introduce a novel
method that lies at the junction of the two routes. By combining the best
elements of randomness and saliency utilization, our method balances speed,
simplicity, and accuracy. We name our method R-Mix following the concept of
"Random Mix-up". We demonstrate its effectiveness in generalization, weakly
supervised object localization, calibration, and robustness to adversarial
attacks. Finally, in order to address the question of whether there exists a
better decision protocol, we train a Reinforcement Learning agent that decides
the mix-up policies based on the classifier's performance, reducing dependency
on human-designed objectives and hyperparameter tuning. Extensive experiments
further show that the agent is capable of performing at the cutting-edge level,
laying the foundation for a fully automatic mix-up. Our code is released at
[https://github.com/minhlong94/Random-Mixup].


http://arxiv.org/abs/2212.04681
Dynamic Test-Time Augmentation via Differentiable Functions. (2%)
Shohei Enomoto; Monikka Roslianna Busto; Takeharu Eda
  Distribution shifts, which often occur in the real world, degrade the
accuracy of deep learning systems, and thus improving robustness to
distribution shifts is essential for practical applications. To improve
robustness, we study an image enhancement method that generates
recognition-friendly images without retraining the recognition model. We
propose a novel image enhancement method, DynTTA, which is based on
differentiable data augmentation techniques and generates a blended image from
many augmented images to improve the recognition accuracy under distribution
shifts. In addition to standard data augmentations, DynTTA also incorporates
deep neural network-based image transformation, further improving the
robustness. Because DynTTA is composed of differentiable functions, it can be
directly trained with the classification loss of the recognition model. In
experiments with widely used image recognition datasets using various
classification models, DynTTA improves the robustness with almost no reduction
in classification accuracy for clean images, thus outperforming the existing
methods. Furthermore, the results show that robustness is significantly
improved by estimating the training-time augmentations for distribution-shifted
datasets using DynTTA and retraining the recognition model with the estimated
augmentations. DynTTA is a promising approach for applications that require
both clean accuracy and robustness. Our code is available at
\url{https://github.com/s-enmt/DynTTA}.


http://arxiv.org/abs/2212.04871
Spurious Features Everywhere -- Large-Scale Detection of Harmful Spurious Features in ImageNet. (1%)
Yannic Neuhaus; Maximilian Augustin; Valentyn Boreiko; Matthias Hein
  Benchmark performance of deep learning classifiers alone is not a reliable
predictor for the performance of a deployed model. In particular, if the image
classifier has picked up spurious features in the training data, its
predictions can fail in unexpected ways. In this paper, we develop a framework
that allows us to systematically identify spurious features in large datasets
like ImageNet. It is based on our neural PCA components and their
visualization. Previous work on spurious features often operates in toy
settings or requires costly pixel-wise annotations. In contrast, we work with
ImageNet and validate our results by showing that presence of the harmful
spurious feature of a class alone is sufficient to trigger the prediction of
that class. We introduce the novel dataset "Spurious ImageNet" which allows to
measure the reliance of any ImageNet classifier on harmful spurious features.
Moreover, we introduce SpuFix as a simple mitigation method to reduce the
dependence of any ImageNet classifier on previously identified harmful spurious
features without requiring additional labels or retraining of the model. We
provide code and data at https://github.com/YanNeu/spurious_imagenet .


http://arxiv.org/abs/2212.05015
Robustness Implies Privacy in Statistical Estimation. (1%)
Samuel B. Hopkins; Gautam Kamath; Mahbod Majid; Shyam Narayanan
  We study the relationship between adversarial robustness and differential
privacy in high-dimensional algorithmic statistics. We give the first black-box
reduction from privacy to robustness which can produce private estimators with
optimal tradeoffs among sample complexity, accuracy, and privacy for a wide
range of fundamental high-dimensional parameter estimation problems, including
mean and covariance estimation. We show that this reduction can be implemented
in polynomial time in some important special cases. In particular, using
nearly-optimal polynomial-time robust estimators for the mean and covariance of
high-dimensional Gaussians which are based on the Sum-of-Squares method, we
design the first polynomial-time private estimators for these problems with
nearly-optimal samples-accuracy-privacy tradeoffs. Our algorithms are also
robust to a nearly optimal fraction of adversarially-corrupted samples.


http://arxiv.org/abs/2212.04687
Selective Amnesia: On Efficient, High-Fidelity and Blind Suppression of Backdoor Effects in Trojaned Machine Learning Models. (1%)
Rui Zhu; Di Tang; Siyuan Tang; XiaoFeng Wang; Haixu Tang
  In this paper, we present a simple yet surprisingly effective technique to
induce "selective amnesia" on a backdoored model. Our approach, called SEAM,
has been inspired by the problem of catastrophic forgetting (CF), a long
standing issue in continual learning. Our idea is to retrain a given DNN model
on randomly labeled clean data, to induce a CF on the model, leading to a
sudden forget on both primary and backdoor tasks; then we recover the primary
task by retraining the randomized model on correctly labeled clean data. We
analyzed SEAM by modeling the unlearning process as continual learning and
further approximating a DNN using Neural Tangent Kernel for measuring CF. Our
analysis shows that our random-labeling approach actually maximizes the CF on
an unknown backdoor in the absence of triggered inputs, and also preserves some
feature extraction in the network to enable a fast revival of the primary task.
We further evaluated SEAM on both image processing and Natural Language
Processing tasks, under both data contamination and training manipulation
attacks, over thousands of models either trained on popular image datasets or
provided by the TrojAI competition. Our experiments show that SEAM vastly
outperforms the state-of-the-art unlearning techniques, achieving a high
Fidelity (measuring the gap between the accuracy of the primary task and that
of the backdoor) within a few minutes (about 30 times faster than training a
model from scratch using the MNIST dataset), with only a small amount of clean
data (0.1% of training data for TrojAI models).


http://arxiv.org/abs/2212.11138
QVIP: An ILP-based Formal Verification Approach for Quantized Neural Networks. (1%)
Yedi Zhang; Zhe Zhao; Fu Song; Min Zhang; Taolue Chen; Jun Sun
  Deep learning has become a promising programming paradigm in software
development, owing to its surprising performance in solving many challenging
tasks. Deep neural networks (DNNs) are increasingly being deployed in practice,
but are limited on resource-constrained devices owing to their demand for
computational power. Quantization has emerged as a promising technique to
reduce the size of DNNs with comparable accuracy as their floating-point
numbered counterparts. The resulting quantized neural networks (QNNs) can be
implemented energy-efficiently. Similar to their floating-point numbered
counterparts, quality assurance techniques for QNNs, such as testing and formal
verification, are essential but are currently less explored. In this work, we
propose a novel and efficient formal verification approach for QNNs. In
particular, we are the first to propose an encoding that reduces the
verification problem of QNNs into the solving of integer linear constraints,
which can be solved using off-the-shelf solvers. Our encoding is both sound and
complete. We demonstrate the application of our approach on local robustness
verification and maximum robustness radius computation. We implement our
approach in a prototype tool QVIP and conduct a thorough evaluation.
Experimental results on QNNs with different quantization bits confirm the
effectiveness and efficiency of our approach, e.g., two orders of magnitude
faster and able to solve more verification tasks in the same time limit than
the state-of-the-art methods.


http://arxiv.org/abs/2212.04138
Targeted Adversarial Attacks against Neural Network Trajectory Predictors. (99%)
Kaiyuan Tan; Jun Wang; Yiannis Kantaros
  Trajectory prediction is an integral component of modern autonomous systems
as it allows for envisioning future intentions of nearby moving agents. Due to
the lack of other agents' dynamics and control policies, deep neural network
(DNN) models are often employed for trajectory forecasting tasks. Although
there exists an extensive literature on improving the accuracy of these models,
there is a very limited number of works studying their robustness against
adversarially crafted input trajectories. To bridge this gap, in this paper, we
propose a targeted adversarial attack against DNN models for trajectory
forecasting tasks. We call the proposed attack TA4TP for Targeted adversarial
Attack for Trajectory Prediction. Our approach generates adversarial input
trajectories that are capable of fooling DNN models into predicting
user-specified target/desired trajectories. Our attack relies on solving a
nonlinear constrained optimization problem where the objective function
captures the deviation of the predicted trajectory from a target one while the
constraints model physical requirements that the adversarial input should
satisfy. The latter ensures that the inputs look natural and they are safe to
execute (e.g., they are close to nominal inputs and away from obstacles). We
demonstrate the effectiveness of TA4TP on two state-of-the-art DNN models and
two datasets. To the best of our knowledge, we propose the first targeted
adversarial attack against DNN models used for trajectory forecasting.


http://arxiv.org/abs/2212.04454
XRand: Differentially Private Defense against Explanation-Guided Attacks. (68%)
Truc Nguyen; Phung Lai; NhatHai Phan; My T. Thai
  Recent development in the field of explainable artificial intelligence (XAI)
has helped improve trust in Machine-Learning-as-a-Service (MLaaS) systems, in
which an explanation is provided together with the model prediction in response
to each query. However, XAI also opens a door for adversaries to gain insights
into the black-box models in MLaaS, thereby making the models more vulnerable
to several attacks. For example, feature-based explanations (e.g., SHAP) could
expose the top important features that a black-box model focuses on. Such
disclosure has been exploited to craft effective backdoor triggers against
malware classifiers. To address this trade-off, we introduce a new concept of
achieving local differential privacy (LDP) in the explanations, and from that
we establish a defense, called XRand, against such attacks. We show that our
mechanism restricts the information that the adversary can learn about the top
important features, while maintaining the faithfulness of the explanations.


http://arxiv.org/abs/2212.04656
Robust Graph Representation Learning via Predictive Coding. (22%)
Billy Byiringiro; Tommaso Salvatori; Thomas Lukasiewicz
  Predictive coding is a message-passing framework initially developed to model
information processing in the brain, and now also topic of research in machine
learning due to some interesting properties. One of such properties is the
natural ability of generative models to learn robust representations thanks to
their peculiar credit assignment rule, that allows neural activities to
converge to a solution before updating the synaptic weights. Graph neural
networks are also message-passing models, which have recently shown outstanding
results in diverse types of tasks in machine learning, providing
interdisciplinary state-of-the-art performance on structured data. However,
they are vulnerable to imperceptible adversarial attacks, and unfit for
out-of-distribution generalization. In this work, we address this by building
models that have the same structure of popular graph neural network
architectures, but rely on the message-passing rule of predictive coding.
Through an extensive set of experiments, we show that the proposed models are
(i) comparable to standard ones in terms of performance in both inductive and
transductive tasks, (ii) better calibrated, and (iii) robust against multiple
kinds of adversarial attacks.


http://arxiv.org/abs/2212.03659
Multi-Objective Linear Ensembles for Robust and Sparse Training of Few-Bit Neural Networks. (2%)
Ambrogio Maria Bernardelli; Stefano Gualandi; Hoong Chuin Lau; Simone Milanesi; Neil Yorke-Smith
  Training neural networks (NNs) using combinatorial optimization solvers has
gained attention in recent years. In low-data settings, state-of-the-art mixed
integer linear programming solvers can train exactly a NN, avoiding intensive
GPU-based training and hyper-parameter tuning and simultaneously training and
sparsifying the network. We study the case of few-bit discrete-valued neural
networks, both Binarized Neural Networks (BNNs), whose values are restricted to
+-1, and Integer Neural Networks (INNs), whose values lie in a range {-P, ...,
P}. Few-bit NNs receive increasing recognition due to their lightweight
architecture and ability to run on low-power devices. This paper proposes new
methods to improve the training of BNNs and INNs. Our contribution is a
multi-objective ensemble approach based on training a single NN for each
possible pair of classes and applying a majority voting scheme to predict the
final output. Our approach results in training robust sparsified networks whose
output is not affected by small perturbations on the input and whose number of
active weights is as small as possible. We compare this BeMi approach to the
current state-of-the-art in solver-based NN training and gradient-based
training, focusing on BNN learning in few-shot contexts. We compare the
benefits and drawbacks of INNs versus BNNs, bringing new light to the
distribution of weights over the {-P, ..., P} interval. Finally, we compare
multi-objective versus single-objective training of INNs, showing that
robustness and network simplicity can be acquired simultaneously, thus
obtaining better test performances. While the previous state-of-the-art
approaches achieve an average accuracy of 51.1% on the MNIST dataset, the BeMi
ensemble approach achieves an average accuracy of 68.4% when trained with 10
images per class and 81.8% when trained with 40 images per class, having up to
75.3% NN links removed.


http://arxiv.org/abs/2212.04008
Use of Cryptography in Malware Obfuscation. (1%)
Hassan Jameel Asghar; Benjamin Zi Hao Zhao; Muhammad Ikram; Giang Nguyen; Dali Kaafar; Sean Lamont; Daniel Coscia
  Malware authors often use cryptographic tools such as XOR encryption and
block ciphers like AES to obfuscate part of the malware to evade detection. Use
of cryptography may give the impression that these obfuscation techniques have
some provable guarantees of success. In this paper, we take a closer look at
the use of cryptographic tools to obfuscate malware. We first find that most
techniques are easy to defeat (in principle), since the decryption algorithm
and the key is shipped within the program. In order to clearly define an
obfuscation technique's potential to evade detection we propose a principled
definition of malware obfuscation, and then categorize instances of malware
obfuscation that use cryptographic tools into those which evade detection and
those which are detectable. We find that schemes that are hard to de-obfuscate
necessarily rely on a construct based on environmental keying. We also show
that cryptographic notions of obfuscation, e.g., indistinghuishability and
virtual black box obfuscation, may not guarantee evasion detection under our
model. However, they can be used in conjunction with environmental keying to
produce hard to de-obfuscate versions of programs.


http://arxiv.org/abs/2212.03334
Pre-trained Encoders in Self-Supervised Learning Improve Secure and Privacy-preserving Supervised Learning. (96%)
Hongbin Liu; Wenjie Qu; Jinyuan Jia; Neil Zhenqiang Gong
  Classifiers in supervised learning have various security and privacy issues,
e.g., 1) data poisoning attacks, backdoor attacks, and adversarial examples on
the security side as well as 2) inference attacks and the right to be forgotten
for the training data on the privacy side. Various secure and
privacy-preserving supervised learning algorithms with formal guarantees have
been proposed to address these issues. However, they suffer from various
limitations such as accuracy loss, small certified security guarantees, and/or
inefficiency. Self-supervised learning is an emerging technique to pre-train
encoders using unlabeled data. Given a pre-trained encoder as a feature
extractor, supervised learning can train a simple yet accurate classifier using
a small amount of labeled training data. In this work, we perform the first
systematic, principled measurement study to understand whether and when a
pre-trained encoder can address the limitations of secure or privacy-preserving
supervised learning algorithms. Our key findings are that a pre-trained encoder
substantially improves 1) both accuracy under no attacks and certified security
guarantees against data poisoning and backdoor attacks of state-of-the-art
secure learning algorithms (i.e., bagging and KNN), 2) certified security
guarantees of randomized smoothing against adversarial examples without
sacrificing its accuracy under no attacks, 3) accuracy of differentially
private classifiers, and 4) accuracy and/or efficiency of exact machine
unlearning.


http://arxiv.org/abs/2212.02531
Enhancing Quantum Adversarial Robustness by Randomized Encodings. (99%)
Weiyuan Gong; Dong Yuan; Weikang Li; Dong-Ling Deng
  The interplay between quantum physics and machine learning gives rise to the
emergent frontier of quantum machine learning, where advanced quantum learning
models may outperform their classical counterparts in solving certain
challenging problems. However, quantum learning systems are vulnerable to
adversarial attacks: adding tiny carefully-crafted perturbations on legitimate
input samples can cause misclassifications. To address this issue, we propose a
general scheme to protect quantum learning systems from adversarial attacks by
randomly encoding the legitimate data samples through unitary or quantum error
correction encoders. In particular, we rigorously prove that both global and
local random unitary encoders lead to exponentially vanishing gradients (i.e.
barren plateaus) for any variational quantum circuits that aim to add
adversarial perturbations, independent of the input data and the inner
structures of adversarial circuits and quantum classifiers. In addition, we
prove a rigorous bound on the vulnerability of quantum classifiers under local
unitary adversarial attacks. We show that random black-box quantum error
correction encoders can protect quantum classifiers against local adversarial
noises and their robustness increases as we concatenate error correction codes.
To quantify the robustness enhancement, we adapt quantum differential privacy
as a measure of the prediction stability for quantum classifiers. Our results
establish versatile defense strategies for quantum classifiers against
adversarial perturbations, which provide valuable guidance to enhance the
reliability and security for both near-term and future quantum learning
technologies.


http://arxiv.org/abs/2212.03069
Multiple Perturbation Attack: Attack Pixelwise Under Different $\ell_p$-norms For Better Adversarial Performance. (99%)
Ngoc N. Tran; Anh Tuan Bui; Dinh Phung; Trung Le
  Adversarial machine learning has been both a major concern and a hot topic
recently, especially with the ubiquitous use of deep neural networks in the
current landscape. Adversarial attacks and defenses are usually likened to a
cat-and-mouse game in which defenders and attackers evolve over the time. On
one hand, the goal is to develop strong and robust deep networks that are
resistant to malicious actors. On the other hand, in order to achieve that, we
need to devise even stronger adversarial attacks to challenge these defense
models. Most of existing attacks employs a single $\ell_p$ distance (commonly,
$p\in\{1,2,\infty\}$) to define the concept of closeness and performs steepest
gradient ascent w.r.t. this $p$-norm to update all pixels in an adversarial
example in the same way. These $\ell_p$ attacks each has its own pros and cons;
and there is no single attack that can successfully break through defense
models that are robust against multiple $\ell_p$ norms simultaneously.
Motivated by these observations, we come up with a natural approach: combining
various $\ell_p$ gradient projections on a pixel level to achieve a joint
adversarial perturbation. Specifically, we learn how to perturb each pixel to
maximize the attack performance, while maintaining the overall visual
imperceptibility of adversarial examples. Finally, through various experiments
with standardized benchmarks, we show that our method outperforms most current
strong attacks across state-of-the-art defense mechanisms, while retaining its
ability to remain clean visually.


http://arxiv.org/abs/2212.02127
FaceQAN: Face Image Quality Assessment Through Adversarial Noise Exploration. (92%)
Žiga Babnik; Peter Peer; Vitomir Štruc
  Recent state-of-the-art face recognition (FR) approaches have achieved
impressive performance, yet unconstrained face recognition still represents an
open problem. Face image quality assessment (FIQA) approaches aim to estimate
the quality of the input samples that can help provide information on the
confidence of the recognition decision and eventually lead to improved results
in challenging scenarios. While much progress has been made in face image
quality assessment in recent years, computing reliable quality scores for
diverse facial images and FR models remains challenging. In this paper, we
propose a novel approach to face image quality assessment, called FaceQAN, that
is based on adversarial examples and relies on the analysis of adversarial
noise which can be calculated with any FR model learned by using some form of
gradient descent. As such, the proposed approach is the first to link image
quality to adversarial attacks. Comprehensive (cross-model as well as
model-specific) experiments are conducted with four benchmark datasets, i.e.,
LFW, CFP-FP, XQLFW and IJB-C, four FR models, i.e., CosFace, ArcFace,
CurricularFace and ElasticFace, and in comparison to seven state-of-the-art
FIQA methods to demonstrate the performance of FaceQAN. Experimental results
show that FaceQAN achieves competitive results, while exhibiting several
desirable characteristics.


http://arxiv.org/abs/2212.02042
Refiner: Data Refining against Gradient Leakage Attacks in Federated Learning. (76%)
Mingyuan Fan; Cen Chen; Chengyu Wang; Xiaodan Li; Wenmeng Zhou
  Recent works have brought attention to the vulnerability of Federated
Learning (FL) systems to gradient leakage attacks. Such attacks exploit
clients' uploaded gradients to reconstruct their sensitive data, thereby
compromising the privacy protection capability of FL. In response, various
defense mechanisms have been proposed to mitigate this threat by manipulating
the uploaded gradients. Unfortunately, empirical evaluations have demonstrated
limited resilience of these defenses against sophisticated attacks, indicating
an urgent need for more effective defenses. In this paper, we explore a novel
defensive paradigm that departs from conventional gradient perturbation
approaches and instead focuses on the construction of robust data. Intuitively,
if robust data exhibits low semantic similarity with clients' raw data, the
gradients associated with robust data can effectively obfuscate attackers. To
this end, we design Refiner that jointly optimizes two metrics for privacy
protection and performance maintenance. The utility metric is designed to
promote consistency between the gradients of key parameters associated with
robust data and those derived from clients' data, thus maintaining model
performance. Furthermore, the privacy metric guides the generation of robust
data towards enlarging the semantic gap with clients' data. Theoretical
analysis supports the effectiveness of Refiner, and empirical evaluations on
multiple benchmark datasets demonstrate the superior defense effectiveness of
Refiner at defending against state-of-the-art attacks.


http://arxiv.org/abs/2212.02457
Blessings and Curses of Covariate Shifts: Adversarial Learning Dynamics, Directional Convergence, and Equilibria. (8%)
Tengyuan Liang
  Covariate distribution shifts and adversarial perturbations present
robustness challenges to the conventional statistical learning framework: mild
shifts in the test covariate distribution can significantly affect the
performance of the statistical model learned based on the training
distribution. The model performance typically deteriorates when extrapolation
happens: namely, covariates shift to a region where the training distribution
is scarce, and naturally, the learned model has little information. For
robustness and regularization considerations, adversarial perturbation
techniques are proposed as a remedy; however, careful study needs to be carried
out about what extrapolation region adversarial covariate shift will focus on,
given a learned model. This paper precisely characterizes the extrapolation
region, examining both regression and classification in an infinite-dimensional
setting. We study the implications of adversarial covariate shifts to
subsequent learning of the equilibrium -- the Bayes optimal model -- in a
sequential game framework. We exploit the dynamics of the adversarial learning
game and reveal the curious effects of the covariate shift to equilibrium
learning and experimental design. In particular, we establish two directional
convergence results that exhibit distinctive phenomena: (1) a blessing in
regression, the adversarial covariate shifts in an exponential rate to an
optimal experimental design for rapid subsequent learning, (2) a curse in
classification, the adversarial covariate shifts in a subquadratic rate fast to
the hardest experimental design trapping subsequent learning.


http://arxiv.org/abs/2212.02705
What is the Solution for State-Adversarial Multi-Agent Reinforcement Learning? (3%)
Songyang Han; Sanbao Su; Sihong He; Shuo Han; Haizhao Yang; Fei Miao
  Various types of Multi-Agent Reinforcement Learning (MARL) methods have been
developed, assuming that agents' policies are based on true states. Recent
works have improved the robustness of MARL under uncertainties from the reward,
transition probability, or other partners' policies. However, in real-world
multi-agent systems, state estimations may be perturbed by sensor measurement
noise or even adversaries. Agents' policies trained with only true state
information will deviate from optimal solutions when facing adversarial state
perturbations during execution. MARL under adversarial state perturbations has
limited study. Hence, in this work, we propose a State-Adversarial Markov Game
(SAMG) and make the first attempt to study the fundamental properties of MARL
under state uncertainties. We prove that the optimal agent policy and the
robust Nash equilibrium do not always exist for an SAMG. Instead, we define the
solution concept, robust agent policy, of the proposed SAMG under adversarial
state perturbations, where agents want to maximize the worst-case expected
state value. We then design a gradient descent ascent-based robust MARL
algorithm to learn the robust policies for the MARL agents. Our experiments
show that adversarial state perturbations decrease agents' rewards for several
baselines from the existing literature, while our algorithm outperforms
baselines with state perturbations and significantly improves the robustness of
the MARL policies under state uncertainties.


http://arxiv.org/abs/2212.02648
Spuriosity Rankings: Sorting Data for Spurious Correlation Robustness. (1%)
Mazda Moayeri; Wenxiao Wang; Sahil Singla; Soheil Feizi
  We present a framework for ranking images within their class based on the
strength of spurious cues present. By measuring the gap in accuracy on the
highest and lowest ranked images (we call this spurious gap), we assess
spurious feature reliance for $89$ diverse ImageNet models, finding that even
the best models underperform in images with weak spurious presence. However,
the effect of spurious cues varies far more dramatically across classes,
emphasizing the crucial, often overlooked, class-dependence of the spurious
correlation problem. While most spurious features we observe are clarifying
(i.e. improving test-time accuracy when present, as is typically expected), we
surprisingly find many cases of confusing spurious features, where models
perform better when they are absent. We then close the spurious gap by training
new classification heads on lowly ranked (i.e. without common spurious cues)
images, resulting in improved effective robustness to distribution shifts
(ObjectNet, ImageNet-R, ImageNet-Sketch). We also propose a second metric to
assess feature reliability, finding that spurious features are generally less
reliable than non-spurious (core) ones, though again, spurious features can be
more reliable for certain classes. To enable our analysis, we annotated $5,000$
feature-class dependencies over {\it all} of ImageNet as core or spurious using
minimal human supervision. Finally, we show the feature discovery and
spuriosity ranking framework can be extended to other datasets like CelebA and
WaterBirds in a lightweight fashion with only linear layer training, leading to
discovering a previously unknown racial bias in the Celeb-A hair
classification.


http://arxiv.org/abs/2212.02663
Efficient Malware Analysis Using Metric Embeddings. (1%)
Ethan M. Rudd; David Krisiloff; Scott Coull; Daniel Olszewski; Edward Raff; James Holt
  In this paper, we explore the use of metric learning to embed Windows PE
files in a low-dimensional vector space for downstream use in a variety of
applications, including malware detection, family classification, and malware
attribute tagging. Specifically, we enrich labeling on malicious and benign PE
files using computationally expensive, disassembly-based malicious
capabilities. Using these capabilities, we derive several different types of
metric embeddings utilizing an embedding neural network trained via contrastive
loss, Spearman rank correlation, and combinations thereof. We then examine
performance on a variety of transfer tasks performed on the EMBER and SOREL
datasets, demonstrating that for several tasks, low-dimensional,
computationally efficient metric embeddings maintain performance with little
decay, which offers the potential to quickly retrain for a variety of transfer
tasks at significantly reduced storage overhead. We conclude with an
examination of practical considerations for the use of our proposed embedding
approach, such as robustness to adversarial evasion and introduction of
task-specific auxiliary objectives to improve performance on mission critical
tasks.


http://arxiv.org/abs/2212.02003
Bayesian Learning with Information Gain Provably Bounds Risk for a Robust Adversarial Defense. (98%)
Bao Gia Doan; Ehsan Abbasnejad; Javen Qinfeng Shi; Damith C. Ranasinghe
  We present a new algorithm to learn a deep neural network model robust
against adversarial attacks. Previous algorithms demonstrate an adversarially
trained Bayesian Neural Network (BNN) provides improved robustness. We
recognize the adversarial learning approach for approximating the multi-modal
posterior distribution of a Bayesian model can lead to mode collapse;
consequently, the model's achievements in robustness and performance are
sub-optimal. Instead, we first propose preventing mode collapse to better
approximate the multi-modal posterior distribution. Second, based on the
intuition that a robust model should ignore perturbations and only consider the
informative content of the input, we conceptualize and formulate an information
gain objective to measure and force the information learned from both benign
and adversarial training instances to be similar. Importantly. we prove and
demonstrate that minimizing the information gain objective allows the
adversarial risk to approach the conventional empirical risk. We believe our
efforts provide a step toward a basis for a principled method of adversarially
training BNNs. Our model demonstrate significantly improved robustness--up to
20%--compared with adversarial training and Adv-BNN under PGD attacks with
0.035 distortion on both CIFAR-10 and STL-10 datasets.


http://arxiv.org/abs/2212.01806
Recognizing Object by Components with Human Prior Knowledge Enhances Adversarial Robustness of Deep Neural Networks. (88%)
Xiao Li; Ziqi Wang; Bo Zhang; Fuchun Sun; Xiaolin Hu
  Adversarial attacks can easily fool object recognition systems based on deep
neural networks (DNNs). Although many defense methods have been proposed in
recent years, most of them can still be adaptively evaded. One reason for the
weak adversarial robustness may be that DNNs are only supervised by category
labels and do not have part-based inductive bias like the recognition process
of humans. Inspired by a well-known theory in cognitive psychology --
recognition-by-components, we propose a novel object recognition model ROCK
(Recognizing Object by Components with human prior Knowledge). It first
segments parts of objects from images, then scores part segmentation results
with predefined human prior knowledge, and finally outputs prediction based on
the scores. The first stage of ROCK corresponds to the process of decomposing
objects into parts in human vision. The second stage corresponds to the
decision process of the human brain. ROCK shows better robustness than
classical recognition models across various attack settings. These results
encourage researchers to rethink the rationality of currently widely-used
DNN-based object recognition models and explore the potential of part-based
models, once important but recently ignored, for improving robustness.


http://arxiv.org/abs/2212.01957
CSTAR: Towards Compact and STructured Deep Neural Networks with Adversarial Robustness. (82%)
Huy Phan; Miao Yin; Yang Sui; Bo Yuan; Saman Zonouz
  Model compression and model defense for deep neural networks (DNNs) have been
extensively and individually studied. Considering the co-importance of model
compactness and robustness in practical applications, several prior works have
explored to improve the adversarial robustness of the sparse neural networks.
However, the structured sparse models obtained by the exiting works suffer
severe performance degradation for both benign and robust accuracy, thereby
causing a challenging dilemma between robustness and structuredness of the
compact DNNs. To address this problem, in this paper, we propose CSTAR, an
efficient solution that can simultaneously impose the low-rankness-based
Compactness, high STructuredness and high Adversarial Robustness on the target
DNN models. By formulating the low-rankness and robustness requirement within
the same framework and globally determining the ranks, the compressed DNNs can
simultaneously achieve high compression performance and strong adversarial
robustness. Evaluations for various DNN models on different datasets
demonstrate the effectiveness of CSTAR. Compared with the state-of-the-art
robust structured pruning methods, CSTAR shows consistently better performance.
For instance, when compressing ResNet-18 on CIFAR-10, CSTAR can achieve up to
20.07% and 11.91% improvement for benign accuracy and robust accuracy,
respectively. For compressing ResNet-18 with 16x compression ratio on Imagenet,
CSTAR can obtain 8.58% benign accuracy gain and 4.27% robust accuracy gain
compared to the existing robust structured pruning method.


http://arxiv.org/abs/2212.01976
FedCC: Robust Federated Learning against Model Poisoning Attacks. (45%)
Hyejun Jeong; Hamin Son; Seohu Lee; Jayun Hyun; Tai-Myoung Chung
  Federated learning is a distributed framework designed to address privacy
concerns. However, it introduces new attack surfaces, which are especially
prone when data is non-Independently and Identically Distributed. Existing
approaches fail to effectively mitigate the malicious influence in this
setting; previous approaches often tackle non-IID data and poisoning attacks
separately. To address both challenges simultaneously, we present FedCC, a
simple yet effective novel defense algorithm against model poisoning attacks.
It leverages the Centered Kernel Alignment similarity of Penultimate Layer
Representations for clustering, allowing the identification and filtration of
malicious clients, even in non-IID data settings. The penultimate layer
representations are meaningful since the later layers are more sensitive to
local data distributions, which allows better detection of malicious clients.
The sophisticated utilization of layer-wise Centered Kernel Alignment
similarity allows attack mitigation while leveraging useful knowledge obtained.
Our extensive experiments demonstrate the effectiveness of FedCC in mitigating
both untargeted model poisoning and targeted backdoor attacks. Compared to
existing outlier detection-based and first-order statistics-based methods,
FedCC consistently reduces attack confidence to zero. Specifically, it
significantly minimizes the average degradation of global performance by
65.5\%. We believe that this new perspective on aggregation makes it a valuable
contribution to the field of FL model security and privacy. The code will be
made available upon acceptance.


http://arxiv.org/abs/2212.01767
ConfounderGAN: Protecting Image Data Privacy with Causal Confounder. (8%)
Qi Tian; Kun Kuang; Kelu Jiang; Furui Liu; Zhihua Wang; Fei Wu
  The success of deep learning is partly attributed to the availability of
massive data downloaded freely from the Internet. However, it also means that
users' private data may be collected by commercial organizations without
consent and used to train their models. Therefore, it's important and necessary
to develop a method or tool to prevent unauthorized data exploitation. In this
paper, we propose ConfounderGAN, a generative adversarial network (GAN) that
can make personal image data unlearnable to protect the data privacy of its
owners. Specifically, the noise produced by the generator for each image has
the confounder property. It can build spurious correlations between images and
labels, so that the model cannot learn the correct mapping from images to
labels in this noise-added dataset. Meanwhile, the discriminator is used to
ensure that the generated noise is small and imperceptible, thereby remaining
the normal utility of the encrypted image for humans. The experiments are
conducted in six image classification datasets, consisting of three natural
object datasets and three medical datasets. The results demonstrate that our
method not only outperforms state-of-the-art methods in standard settings, but
can also be applied to fast encryption scenarios. Moreover, we show a series of
transferability and stability experiments to further illustrate the
effectiveness and superiority of our method.


http://arxiv.org/abs/2212.01688
LDL: A Defense for Label-Based Membership Inference Attacks. (83%)
Arezoo Rajabi; Dinuka Sahabandu; Luyao Niu; Bhaskar Ramasubramanian; Radha Poovendran
  The data used to train deep neural network (DNN) models in applications such
as healthcare and finance typically contain sensitive information. A DNN model
may suffer from overfitting. Overfitted models have been shown to be
susceptible to query-based attacks such as membership inference attacks (MIAs).
MIAs aim to determine whether a sample belongs to the dataset used to train a
classifier (members) or not (nonmembers). Recently, a new class of label based
MIAs (LAB MIAs) was proposed, where an adversary was only required to have
knowledge of predicted labels of samples. Developing a defense against an
adversary carrying out a LAB MIA on DNN models that cannot be retrained remains
an open problem. We present LDL, a light weight defense against LAB MIAs. LDL
works by constructing a high-dimensional sphere around queried samples such
that the model decision is unchanged for (noisy) variants of the sample within
the sphere. This sphere of label-invariance creates ambiguity and prevents a
querying adversary from correctly determining whether a sample is a member or a
nonmember. We analytically characterize the success rate of an adversary
carrying out a LAB MIA when LDL is deployed, and show that the formulation is
consistent with experimental observations. We evaluate LDL on seven datasets --
CIFAR-10, CIFAR-100, GTSRB, Face, Purchase, Location, and Texas -- with varying
sizes of training data. All of these datasets have been used by SOTA LAB MIAs.
Our experiments demonstrate that LDL reduces the success rate of an adversary
carrying out a LAB MIA in each case. We empirically compare LDL with defenses
against LAB MIAs that require retraining of DNN models, and show that LDL
performs favorably despite not needing to retrain the DNNs.


http://arxiv.org/abs/2212.01716
Security Analysis of SplitFed Learning. (8%)
Momin Ahmad Khan; Virat Shejwalkar; Amir Houmansadr; Fatima Muhammad Anwar
  Split Learning (SL) and Federated Learning (FL) are two prominent distributed
collaborative learning techniques that maintain data privacy by allowing
clients to never share their private data with other clients and servers, and
fined extensive IoT applications in smart healthcare, smart cities, and smart
industry. Prior work has extensively explored the security vulnerabilities of
FL in the form of poisoning attacks. To mitigate the effect of these attacks,
several defenses have also been proposed. Recently, a hybrid of both learning
techniques has emerged (commonly known as SplitFed) that capitalizes on their
advantages (fast training) and eliminates their intrinsic disadvantages
(centralized model updates). In this paper, we perform the first ever empirical
analysis of SplitFed's robustness to strong model poisoning attacks. We observe
that the model updates in SplitFed have significantly smaller dimensionality as
compared to FL that is known to have the curse of dimensionality. We show that
large models that have higher dimensionality are more susceptible to privacy
and security attacks, whereas the clients in SplitFed do not have the complete
model and have lower dimensionality, making them more robust to existing model
poisoning attacks. Our results show that the accuracy reduction due to the
model poisoning attack is 5x lower for SplitFed compared to FL.


http://arxiv.org/abs/2212.01082
Membership Inference Attacks Against Semantic Segmentation Models. (45%)
Tomas Chobola; Dmitrii Usynin; Georgios Kaissis
  Membership inference attacks aim to infer whether a data record has been used
to train a target model by observing its predictions. In sensitive domains such
as healthcare, this can constitute a severe privacy violation. In this work we
attempt to address the existing knowledge gap by conducting an exhaustive study
of membership inference attacks and defences in the domain of semantic image
segmentation. Our findings indicate that for certain threat models, these
learning settings can be considerably more vulnerable than the previously
considered classification settings. We additionally investigate a threat model
where a dishonest adversary can perform model poisoning to aid their inference
and evaluate the effects that these adaptations have on the success of
membership inference attacks. We quantitatively evaluate the attacks on a
number of popular model architectures across a variety of semantic segmentation
tasks, demonstrating that membership inference attacks in this domain can
achieve a high success rate and defending against them may result in
unfavourable privacy-utility trade-offs or increased computational costs.


http://arxiv.org/abs/2212.01346
Guaranteed Conformance of Neurosymbolic Models to Natural Constraints. (1%)
Kaustubh Sridhar; Souradeep Dutta; James Weimer; Insup Lee
  Deep neural networks have emerged as the workhorse for a large section of
robotics and control applications, especially as models for dynamical systems.
Such data-driven models are in turn used for designing and verifying autonomous
systems. This is particularly useful in modeling medical systems where data can
be leveraged to individualize treatment. In safety-critical applications, it is
important that the data-driven model is conformant to established knowledge
from the natural sciences. Such knowledge is often available or can often be
distilled into a (possibly black-box) model $M$. For instance, the unicycle
model for an F1 racing car. In this light, we consider the following problem -
given a model $M$ and state transition dataset, we wish to best approximate the
system model while being bounded distance away from $M$. We propose a method to
guarantee this conformance. Our first step is to distill the dataset into few
representative samples called memories, using the idea of a growing neural gas.
Next, using these memories we partition the state space into disjoint subsets
and compute bounds that should be respected by the neural network, when the
input is drawn from a particular subset. This serves as a symbolic wrapper for
guaranteed conformance. We argue theoretically that this only leads to bounded
increase in approximation error; which can be controlled by increasing the
number of memories. We experimentally show that on three case studies (Car
Model, Drones, and Artificial Pancreas), our constrained neurosymbolic models
conform to specified $M$ models (each encoding various constraints) with
order-of-magnitude improvements compared to the augmented Lagrangian and
vanilla training methods.


http://arxiv.org/abs/2212.00612
Purifier: Defending Data Inference Attacks via Transforming Confidence Scores. (89%)
Ziqi Yang; Lijin Wang; Da Yang; Jie Wan; Ziming Zhao; Ee-Chien Chang; Fan Zhang; Kui Ren
  Neural networks are susceptible to data inference attacks such as the
membership inference attack, the adversarial model inversion attack and the
attribute inference attack, where the attacker could infer useful information
such as the membership, the reconstruction or the sensitive attributes of a
data sample from the confidence scores predicted by the target classifier. In
this paper, we propose a method, namely PURIFIER, to defend against membership
inference attacks. It transforms the confidence score vectors predicted by the
target classifier and makes purified confidence scores indistinguishable in
individual shape, statistical distribution and prediction label between members
and non-members. The experimental results show that PURIFIER helps defend
membership inference attacks with high effectiveness and efficiency,
outperforming previous defense methods, and also incurs negligible utility
loss. Besides, our further experiments show that PURIFIER is also effective in
defending adversarial model inversion attacks and attribute inference attacks.
For example, the inversion error is raised about 4+ times on the Facescrub530
classifier, and the attribute inference accuracy drops significantly when
PURIFIER is deployed in our experiment.


http://arxiv.org/abs/2212.00884
Pareto Regret Analyses in Multi-objective Multi-armed Bandit. (41%)
Mengfan Xu; Diego Klabjan
  We study Pareto optimality in multi-objective multi-armed bandit by providing
a formulation of adversarial multi-objective multi-armed bandit and properly
defining its Pareto regrets that can be generalized to stochastic settings as
well. The regrets do not rely on any scalarization functions and reflect Pareto
optimality compared to scalarized regrets. We also present new algorithms
assuming both with and without prior information of the multi-objective
multi-armed bandit setting. The algorithms are shown optimal in adversarial
settings and nearly optimal in stochastic settings simultaneously by our
established upper bounds and lower bounds on Pareto regrets. Moreover, the
lower bound analyses show that the new regrets are consistent with the existing
Pareto regret for stochastic settings and extend an adversarial attack
mechanism from bandit to the multi-objective one.


http://arxiv.org/abs/2212.00325
All You Need Is Hashing: Defending Against Data Reconstruction Attack in Vertical Federated Learning. (3%)
Pengyu Qiu; Xuhong Zhang; Shouling Ji; Yuwen Pu; Ting Wang
  Vertical federated learning is a trending solution for multi-party
collaboration in training machine learning models. Industrial frameworks adopt
secure multi-party computation methods such as homomorphic encryption to
guarantee data security and privacy. However, a line of work has revealed that
there are still leakage risks in VFL. The leakage is caused by the correlation
between the intermediate representations and the raw data. Due to the powerful
approximation ability of deep neural networks, an adversary can capture the
correlation precisely and reconstruct the data. To deal with the threat of the
data reconstruction attack, we propose a hashing-based VFL framework, called
\textit{HashVFL}, to cut off the reversibility directly. The one-way nature of
hashing allows our framework to block all attempts to recover data from hash
codes. However, integrating hashing also brings some challenges, e.g., the loss
of information. This paper proposes and addresses three challenges to
integrating hashing: learnability, bit balance, and consistency. Experimental
results demonstrate \textit{HashVFL}'s efficiency in keeping the main task's
performance and defending against data reconstruction attacks. Furthermore, we
also analyze its potential value in detecting abnormal inputs. In addition, we
conduct extensive experiments to prove \textit{HashVFL}'s generalization in
various settings. In summary, \textit{HashVFL} provides a new perspective on
protecting multi-party's data security and privacy in VFL. We hope our study
can attract more researchers to expand the application domains of
\textit{HashVFL}.


http://arxiv.org/abs/2212.00311
Generalizing and Improving Jacobian and Hessian Regularization. (1%)
Chenwei Cui; Zehao Yan; Guangshen Liu; Liangfu Lu
  Jacobian and Hessian regularization aim to reduce the magnitude of the first
and second-order partial derivatives with respect to neural network inputs, and
they are predominantly used to ensure the adversarial robustness of image
classifiers. In this work, we generalize previous efforts by extending the
target matrix from zero to any matrix that admits efficient matrix-vector
products. The proposed paradigm allows us to construct novel regularization
terms that enforce symmetry or diagonality on square Jacobian and Hessian
matrices. On the other hand, the major challenge for Jacobian and Hessian
regularization has been high computational complexity. We introduce
Lanczos-based spectral norm minimization to tackle this difficulty. This
technique uses a parallelized implementation of the Lanczos algorithm and is
capable of effective and stable regularization of large Jacobian and Hessian
matrices. Theoretical justifications and empirical evidence are provided for
the proposed paradigm and technique. We carry out exploratory experiments to
validate the effectiveness of our novel regularization terms. We also conduct
comparative experiments to evaluate Lanczos-based spectral norm minimization
against prior methods. Results show that the proposed methodologies are
advantageous for a wide range of tasks.


http://arxiv.org/abs/2212.00952
On the Limit of Explaining Black-box Temporal Graph Neural Networks. (1%)
Minh N. Vu; My T. Thai
  Temporal Graph Neural Network (TGNN) has been receiving a lot of attention
recently due to its capability in modeling time-evolving graph-related tasks.
Similar to Graph Neural Networks, it is also non-trivial to interpret
predictions made by a TGNN due to its black-box nature. A major approach
tackling this problems in GNNs is by analyzing the model' responses on some
perturbations of the model's inputs, called perturbation-based explanation
methods. While these methods are convenient and flexible since they do not need
internal access to the model, does this lack of internal access prevent them
from revealing some important information of the predictions? Motivated by that
question, this work studies the limit of some classes of perturbation-based
explanation methods. Particularly, by constructing some specific instances of
TGNNs, we show (i) node-perturbation cannot reliably identify the paths
carrying out the prediction, (ii) edge-perturbation is not reliable in
determining all nodes contributing to the prediction and (iii) perturbing both
nodes and edges does not reliably help us identify the graph's components
carrying out the temporal aggregation in TGNNs.


http://arxiv.org/abs/2212.00951
SimpleMind adds thinking to deep neural networks. (1%)
Youngwon Choi; M. Wasil Wahi-Anwar; Matthew S. Brown
  Deep neural networks (DNNs) detect patterns in data and have shown
versatility and strong performance in many computer vision applications.
However, DNNs alone are susceptible to obvious mistakes that violate simple,
common sense concepts and are limited in their ability to use explicit
knowledge to guide their search and decision making. While overall DNN
performance metrics may be good, these obvious errors, coupled with a lack of
explainability, have prevented widespread adoption for crucial tasks such as
medical image analysis. The purpose of this paper is to introduce SimpleMind,
an open-source software framework for Cognitive AI focused on medical image
understanding. It allows creation of a knowledge base that describes expected
characteristics and relationships between image objects in an intuitive
human-readable form. The SimpleMind framework brings thinking to DNNs by: (1)
providing methods for reasoning with the knowledge base about image content,
such as spatial inferencing and conditional reasoning to check DNN outputs; (2)
applying process knowledge, in the form of general-purpose software agents,
that are chained together to accomplish image preprocessing, DNN prediction,
and result post-processing, and (3) performing automatic co-optimization of all
knowledge base parameters to adapt agents to specific problems. SimpleMind
enables reasoning on multiple detected objects to ensure consistency, providing
cross checking between DNN outputs. This machine reasoning improves the
reliability and trustworthiness of DNNs through an interpretable model and
explainable decisions. Example applications are provided that demonstrate how
SimpleMind supports and improves deep neural networks by embedding them within
a Cognitive AI framework.


http://arxiv.org/abs/2211.17071
Towards Interpreting Vulnerability of Multi-Instance Learning via Customized and Universal Adversarial Perturbations. (97%)
Yu-Xuan Zhang; Hua Meng; Xue-Mei Cao; Zhengchun Zhou; Mei Yang; Avik Ranjan Adhikary
  Multiple-Instance Learning (MIL) is a recent machine learning paradigm which
is immensely useful in various real-life applications, like image analysis,
video anomaly detection, text classification, etc. It is well known that most
of the existing machine learning classifiers are highly vulnerable to
adversarial perturbations. Since MIL is a weakly supervised learning, where
information is available for a set of instances, called bag and not for every
instances, adversarial perturbations can be fatal. In this paper, we have
proposed two adversarial perturbation methods to analyze the effect of
adversarial perturbations to interpret the vulnerability of MIL methods. Out of
the two algorithms, one can be customized for every bag, and the other is a
universal one, which can affect all bags in a given data set and thus has some
generalizability. Through simulations, we have also shown the effectiveness of
the proposed algorithms to fool the state-of-the-art (SOTA) MIL methods.
Finally, we have discussed through experiments, about taking care of these kind
of adversarial perturbations through a simple strategy. Source codes are
available at https://github.com/InkiInki/MI-UAP.


http://arxiv.org/abs/2212.03095
Interpretation of Neural Networks is Susceptible to Universal Adversarial Perturbations. (84%)
Haniyeh Ehsani Oskouie; Farzan Farnia
  Interpreting neural network classifiers using gradient-based saliency maps
has been extensively studied in the deep learning literature. While the
existing algorithms manage to achieve satisfactory performance in application
to standard image recognition datasets, recent works demonstrate the
vulnerability of widely-used gradient-based interpretation schemes to
norm-bounded perturbations adversarially designed for every individual input
sample. However, such adversarial perturbations are commonly designed using the
knowledge of an input sample, and hence perform sub-optimally in application to
an unknown or constantly changing data point. In this paper, we show the
existence of a Universal Perturbation for Interpretation (UPI) for standard
image datasets, which can alter a gradient-based feature map of neural networks
over a significant fraction of test samples. To design such a UPI, we propose a
gradient-based optimization method as well as a principal component analysis
(PCA)-based approach to compute a UPI which can effectively alter a neural
network's gradient-based interpretation on different samples. We support the
proposed UPI approaches by presenting several numerical results of their
successful applications to standard image datasets.


http://arxiv.org/abs/2211.16808
Efficient Adversarial Input Generation via Neural Net Patching. (75%)
Tooba Khan; Kumar Madhukar; Subodh Vishnu Sharma
  The generation of adversarial inputs has become a crucial issue in
establishing the robustness and trustworthiness of deep neural nets, especially
when they are used in safety-critical application domains such as autonomous
vehicles and precision medicine. However, the problem poses multiple practical
challenges, including scalability issues owing to large-sized networks, and the
generation of adversarial inputs that lack important qualities such as
naturalness and output-impartiality. This problem shares its end goal with the
task of patching neural nets where small changes in some of the network's
weights need to be discovered so that upon applying these changes, the modified
net produces the desirable output for a given set of inputs. We exploit this
connection by proposing to obtain an adversarial input from a patch, with the
underlying observation that the effect of changing the weights can also be
brought about by changing the inputs instead. Thus, this paper presents a novel
way to generate input perturbations that are adversarial for a given network by
using an efficient network patching technique. We note that the proposed method
is significantly more effective than the prior state-of-the-art techniques.


http://arxiv.org/abs/2211.16806
Toward Robust Diagnosis: A Contour Attention Preserving Adversarial Defense for COVID-19 Detection. (69%)
Kun Xiang; Xing Zhang; Jinwen She; Jinpeng Liu; Haohan Wang; Shiqi Deng; Shancheng Jiang
  As the COVID-19 pandemic puts pressure on healthcare systems worldwide, the
computed tomography image based AI diagnostic system has become a sustainable
solution for early diagnosis. However, the model-wise vulnerability under
adversarial perturbation hinders its deployment in practical situation. The
existing adversarial training strategies are difficult to generalized into
medical imaging field challenged by complex medical texture features. To
overcome this challenge, we propose a Contour Attention Preserving (CAP) method
based on lung cavity edge extraction. The contour prior features are injected
to attention layer via a parameter regularization and we optimize the robust
empirical risk with hybrid distance metric. We then introduce a new
cross-nation CT scan dataset to evaluate the generalization capability of the
adversarial robustness under distribution shift. Experimental results indicate
that the proposed method achieves state-of-the-art performance in multiple
adversarial defense and generalization tasks. The code and dataset are
available at https://github.com/Quinn777/CAP.


http://arxiv.org/abs/2211.17244
Tight Certification of Adversarially Trained Neural Networks via Nonconvex Low-Rank Semidefinite Relaxations. (38%)
Hong-Ming Chiu; Richard Y. Zhang
  Adversarial training is well-known to produce high-quality neural network
models that are empirically robust against adversarial perturbations.
Nevertheless, once a model has been adversarially trained, one often desires a
certification that the model is truly robust against all future attacks.
Unfortunately, when faced with adversarially trained models, all existing
approaches have significant trouble making certifications that are strong
enough to be practically useful. Linear programming (LP) techniques in
particular face a "convex relaxation barrier" that prevent them from making
high-quality certifications, even after refinement with mixed-integer linear
programming (MILP) and branch-and-bound (BnB) techniques. In this paper, we
propose a nonconvex certification technique, based on a low-rank restriction of
a semidefinite programming (SDP) relaxation. The nonconvex relaxation makes
strong certifications comparable to much more expensive SDP methods, while
optimizing over dramatically fewer variables comparable to much weaker LP
methods. Despite nonconvexity, we show how off-the-shelf local optimization
algorithms can be used to achieve and to certify global optimality in
polynomial time. Our experiments find that the nonconvex relaxation almost
completely closes the gap towards exact certification of adversarially trained
models.


http://arxiv.org/abs/2211.16908
Improved Smoothed Analysis of 2-Opt for the Euclidean TSP. (8%)
Bodo Manthey; Rhijn Jesse van
  The 2-opt heuristic is a simple local search heuristic for the Travelling
Salesperson Problem (TSP). Although it usually performs well in practice, its
worst-case running time is poor. Attempts to reconcile this difference have
used smoothed analysis, in which adversarial instances are perturbed
probabilistically.
  We are interested in the classical model of smoothed analysis for the
Euclidean TSP, in which the perturbations are Gaussian. This model was
previously used by Manthey \& Veenstra, who obtained smoothed complexity bounds
polynomial in $n$, the dimension $d$, and the perturbation strength
$\sigma^{-1}$. However, their analysis only works for $d \geq 4$. The only
previous analysis for $d \leq 3$ was performed by Englert, R\"oglin \&
V\"ocking, who used a different perturbation model which can be translated to
Gaussian perturbations. Their model yields bounds polynomial in $n$ and
$\sigma^{-d}$, and super-exponential in $d$.
  As no direct analysis existed for Gaussian perturbations that yields
polynomial bounds for all $d$, we perform this missing analysis. Along the way,
we improve all existing smoothed complexity bounds for Euclidean 2-opt.


http://arxiv.org/abs/2211.16080
Understanding and Enhancing Robustness of Concept-based Models. (99%)
Sanchit Sinha; Mengdi Huai; Jianhui Sun; Aidong Zhang
  Rising usage of deep neural networks to perform decision making in critical
applications like medical diagnosis and financial analysis have raised concerns
regarding their reliability and trustworthiness. As automated systems become
more mainstream, it is important their decisions be transparent, reliable and
understandable by humans for better trust and confidence. To this effect,
concept-based models such as Concept Bottleneck Models (CBMs) and
Self-Explaining Neural Networks (SENN) have been proposed which constrain the
latent space of a model to represent high level concepts easily understood by
domain experts in the field. Although concept-based models promise a good
approach to both increasing explainability and reliability, it is yet to be
shown if they demonstrate robustness and output consistent concepts under
systematic perturbations to their inputs. To better understand performance of
concept-based models on curated malicious samples, in this paper, we aim to
study their robustness to adversarial perturbations, which are also known as
the imperceptible changes to the input data that are crafted by an attacker to
fool a well-learned concept-based model. Specifically, we first propose and
analyze different malicious attacks to evaluate the security vulnerability of
concept based models. Subsequently, we propose a potential general adversarial
training-based defense mechanism to increase robustness of these systems to the
proposed malicious attacks. Extensive experiments on one synthetic and two
real-world datasets demonstrate the effectiveness of the proposed attacks and
the defense approach.


http://arxiv.org/abs/2211.16247
Ada3Diff: Defending against 3D Adversarial Point Clouds via Adaptive Diffusion. (99%)
Kui Zhang; Hang Zhou; Jie Zhang; Qidong Huang; Weiming Zhang; Nenghai Yu
  Deep 3D point cloud models are sensitive to adversarial attacks, which poses
threats to safety-critical applications such as autonomous driving. Robust
training and defend-by-denoising are typical strategies for defending
adversarial perturbations. However, they either induce massive computational
overhead or rely heavily upon specified priors, limiting generalized robustness
against attacks of all kinds. To remedy it, this paper introduces a novel
distortion-aware defense framework that can rebuild the pristine data
distribution with a tailored intensity estimator and a diffusion model. To
perform distortion-aware forward diffusion, we design a distortion estimation
algorithm that is obtained by summing the distance of each point to the
best-fitting plane of its local neighboring points, which is based on the
observation of the local spatial properties of the adversarial point cloud. By
iterative diffusion and reverse denoising, the perturbed point cloud under
various distortions can be restored back to a clean distribution. This approach
enables effective defense against adaptive attacks with varying noise budgets,
enhancing the robustness of existing 3D deep recognition models.


http://arxiv.org/abs/2211.16253
Advancing Deep Metric Learning Through Multiple Batch Norms And Multi-Targeted Adversarial Examples. (88%)
Inderjeet Singh; Kazuya Kakizaki; Toshinori Araki
  Deep Metric Learning (DML) is a prominent field in machine learning with
extensive practical applications that concentrate on learning visual
similarities. It is known that inputs such as Adversarial Examples (AXs), which
follow a distribution different from that of clean data, result in false
predictions from DML systems. This paper proposes MDProp, a framework to
simultaneously improve the performance of DML models on clean data and inputs
following multiple distributions. MDProp utilizes multi-distribution data
through an AX generation process while leveraging disentangled learning through
multiple batch normalization layers during the training of a DML model. MDProp
is the first to generate feature space multi-targeted AXs to perform targeted
regularization on the training model's denser embedding space regions,
resulting in improved embedding space densities contributing to the improved
generalization in the trained models. From a comprehensive experimental
analysis, we show that MDProp results in up to 2.95% increased clean data
Recall@1 scores and up to 2.12 times increased robustness against different
input distributions compared to the conventional methods.


http://arxiv.org/abs/2211.16093
Penalizing Confident Predictions on Largely Perturbed Inputs Does Not Improve Out-of-Distribution Generalization in Question Answering. (83%)
Kazutoshi Shinoda; Saku Sugawara; Akiko Aizawa
  Question answering (QA) models are shown to be insensitive to large
perturbations to inputs; that is, they make correct and confident predictions
even when given largely perturbed inputs from which humans can not correctly
derive answers. In addition, QA models fail to generalize to other domains and
adversarial test sets, while humans maintain high accuracy. Based on these
observations, we assume that QA models do not use intended features necessary
for human reading but rely on spurious features, causing the lack of
generalization ability. Therefore, we attempt to answer the question: If the
overconfident predictions of QA models for various types of perturbations are
penalized, will the out-of-distribution (OOD) generalization be improved? To
prevent models from making confident predictions on perturbed inputs, we first
follow existing studies and maximize the entropy of the output probability for
perturbed inputs. However, we find that QA models trained to be sensitive to a
certain perturbation type are often insensitive to unseen types of
perturbations. Thus, we simultaneously maximize the entropy for the four
perturbation types (i.e., word- and sentence-level shuffling and deletion) to
further close the gap between models and humans. Contrary to our expectations,
although models become sensitive to the four types of perturbations, we find
that the OOD generalization is not improved. Moreover, the OOD generalization
is sometimes degraded after entropy maximization. Making unconfident
predictions on largely perturbed inputs per se may be beneficial to gaining
human trust. However, our negative results suggest that researchers should pay
attention to the side effect of entropy maximization.


http://arxiv.org/abs/2211.16187
Quantization-aware Interval Bound Propagation for Training Certifiably Robust Quantized Neural Networks. (73%)
Mathias Lechner; Đorđe Žikelić; Krishnendu Chatterjee; Thomas A. Henzinger; Daniela Rus
  We study the problem of training and certifying adversarially robust
quantized neural networks (QNNs). Quantization is a technique for making neural
networks more efficient by running them using low-bit integer arithmetic and is
therefore commonly adopted in industry. Recent work has shown that
floating-point neural networks that have been verified to be robust can become
vulnerable to adversarial attacks after quantization, and certification of the
quantized representation is necessary to guarantee robustness. In this work, we
present quantization-aware interval bound propagation (QA-IBP), a novel method
for training robust QNNs. Inspired by advances in robust learning of
non-quantized networks, our training algorithm computes the gradient of an
abstract representation of the actual network. Unlike existing approaches, our
method can handle the discrete semantics of QNNs. Based on QA-IBP, we also
develop a complete verification procedure for verifying the adversarial
robustness of QNNs, which is guaranteed to terminate and produce a correct
answer. Compared to existing approaches, the key advantage of our verification
procedure is that it runs entirely on GPU or other accelerator devices. We
demonstrate experimentally that our approach significantly outperforms existing
methods and establish the new state-of-the-art for training and certifying the
robustness of QNNs.


http://arxiv.org/abs/2211.16040
AdvMask: A Sparse Adversarial Attack Based Data Augmentation Method for Image Classification. (54%)
Suorong Yang; Jinqiao Li; Jian Zhao; Furao Shen
  Data augmentation is a widely used technique for enhancing the generalization
ability of convolutional neural networks (CNNs) in image classification tasks.
Occlusion is a critical factor that affects on the generalization ability of
image classification models. In order to generate new samples, existing data
augmentation methods based on information deletion simulate occluded samples by
randomly removing some areas in the images. However, those methods cannot
delete areas of the images according to their structural features of the
images. To solve those problems, we propose a novel data augmentation method,
AdvMask, for image classification tasks. Instead of randomly removing areas in
the images, AdvMask obtains the key points that have the greatest influence on
the classification results via an end-to-end sparse adversarial attack module.
Therefore, we can find the most sensitive points of the classification results
without considering the diversity of various image appearance and shapes of the
object of interest. In addition, a data augmentation module is employed to
generate structured masks based on the key points, thus forcing the CNN
classification models to seek other relevant content when the most
discriminative content is hidden. AdvMask can effectively improve the
performance of classification models in the testing process. The experimental
results on various datasets and CNN models verify that the proposed method
outperforms other previous data augmentation methods in image classification
tasks.


http://arxiv.org/abs/2211.16316
A3T: Accuracy Aware Adversarial Training. (10%)
Enes Altinisik; Safa Messaoud; Husrev Taha Sencar; Sanjay Chawla
  Adversarial training has been empirically shown to be more prone to
overfitting than standard training. The exact underlying reasons still need to
be fully understood. In this paper, we identify one cause of overfitting
related to current practices of generating adversarial samples from
misclassified samples. To address this, we propose an alternative approach that
leverages the misclassified samples to mitigate the overfitting problem. We
show that our approach achieves better generalization while having comparable
robustness to state-of-the-art adversarial training methods on a wide range of
computer vision, natural language processing, and tabular tasks.


http://arxiv.org/abs/2211.16228
Building Resilience to Out-of-Distribution Visual Data via Input Optimization and Model Finetuning. (1%)
Christopher J. Holder; Majid Khonji; Jorge Dias; Muhammad Shafique
  A major challenge in machine learning is resilience to out-of-distribution
data, that is data that exists outside of the distribution of a model's
training data. Training is often performed using limited, carefully curated
datasets and so when a model is deployed there is often a significant
distribution shift as edge cases and anomalies not included in the training
data are encountered. To address this, we propose the Input Optimisation
Network, an image preprocessing model that learns to optimise input data for a
specific target vision model. In this work we investigate several
out-of-distribution scenarios in the context of semantic segmentation for
autonomous vehicles, comparing an Input Optimisation based solution to existing
approaches of finetuning the target model with augmented training data and an
adversarially trained preprocessing model. We demonstrate that our approach can
enable performance on such data comparable to that of a finetuned model, and
subsequently that a combined approach, whereby an input optimization network is
optimised to target a finetuned model, delivers superior performance to either
method in isolation. Finally, we propose a joint optimisation approach, in
which input optimization network and target model are trained simultaneously,
which we demonstrate achieves significant further performance gains,
particularly in challenging edge-case scenarios. We also demonstrate that our
architecture can be reduced to a relatively compact size without a significant
performance impact, potentially facilitating real time embedded applications.


http://arxiv.org/abs/2212.00727
Adversarial Artifact Detection in EEG-Based Brain-Computer Interfaces. (99%)
Xiaoqing Chen; Dongrui Wu
  Machine learning has achieved great success in electroencephalogram (EEG)
based brain-computer interfaces (BCIs). Most existing BCI research focused on
improving its accuracy, but few had considered its security. Recent studies,
however, have shown that EEG-based BCIs are vulnerable to adversarial attacks,
where small perturbations added to the input can cause misclassification.
Detection of adversarial examples is crucial to both the understanding of this
phenomenon and the defense. This paper, for the first time, explores
adversarial detection in EEG-based BCIs. Experiments on two EEG datasets using
three convolutional neural networks were performed to verify the performances
of multiple detection approaches. We showed that both white-box and black-box
attacks can be detected, and the former are easier to detect.


http://arxiv.org/abs/2211.15926
Interpretations Cannot Be Trusted: Stealthy and Effective Adversarial Perturbations against Interpretable Deep Learning. (95%)
Eldor Abdukhamidov; Mohammed Abuhamad; Simon S. Woo; Eric Chan-Tin; Tamer Abuhmed
  Deep learning methods have gained increased attention in various applications
due to their outstanding performance. For exploring how this high performance
relates to the proper use of data artifacts and the accurate problem
formulation of a given task, interpretation models have become a crucial
component in developing deep learning-based systems. Interpretation models
enable the understanding of the inner workings of deep learning models and
offer a sense of security in detecting the misuse of artifacts in the input
data. Similar to prediction models, interpretation models are also susceptible
to adversarial inputs. This work introduces two attacks, AdvEdge and
AdvEdge$^{+}$, that deceive both the target deep learning model and the coupled
interpretation model. We assess the effectiveness of proposed attacks against
two deep learning model architectures coupled with four interpretation models
that represent different categories of interpretation models. Our experiments
include the attack implementation using various attack frameworks. We also
explore the potential countermeasures against such attacks. Our analysis shows
the effectiveness of our attacks in terms of deceiving the deep learning models
and their interpreters, and highlights insights to improve and circumvent the
attacks.


http://arxiv.org/abs/2211.15875
Training Time Adversarial Attack Aiming the Vulnerability of Continual Learning. (83%)
Gyojin Han; Jaehyun Choi; Hyeong Gwon Hong; Junmo Kim
  Generally, regularization-based continual learning models limit access to the
previous task data to imitate the real-world setting which has memory and
privacy issues. However, this introduces a problem in these models by not being
able to track the performance on each task. In other words, current continual
learning methods are vulnerable to attacks done on the previous task. We
demonstrate the vulnerability of regularization-based continual learning
methods by presenting simple task-specific training time adversarial attack
that can be used in the learning process of a new task. Training data generated
by the proposed attack causes performance degradation on a specific task
targeted by the attacker. Experiment results justify the vulnerability proposed
in this paper and demonstrate the importance of developing continual learning
models that are robust to adversarial attack.


http://arxiv.org/abs/2211.15900
Towards More Robust Interpretation via Local Gradient Alignment. (76%)
Sunghwan Joo; Seokhyeon Jeong; Juyeon Heo; Adrian Weller; Taesup Moon
  Neural network interpretation methods, particularly feature attribution
methods, are known to be fragile with respect to adversarial input
perturbations. To address this, several methods for enhancing the local
smoothness of the gradient while training have been proposed for attaining
\textit{robust} feature attributions. However, the lack of considering the
normalization of the attributions, which is essential in their visualizations,
has been an obstacle to understanding and improving the robustness of feature
attribution methods. In this paper, we provide new insights by taking such
normalization into account. First, we show that for every non-negative
homogeneous neural network, a naive $\ell_2$-robust criterion for gradients is
\textit{not} normalization invariant, which means that two functions with the
same normalized gradient can have different values. Second, we formulate a
normalization invariant cosine distance-based criterion and derive its upper
bound, which gives insight for why simply minimizing the Hessian norm at the
input, as has been done in previous work, is not sufficient for attaining
robust feature attribution. Finally, we propose to combine both $\ell_2$ and
cosine distance-based criteria as regularization terms to leverage the
advantages of both in aligning the local gradient. As a result, we
experimentally show that models trained with our method produce much more
robust interpretations on CIFAR-10 and ImageNet-100 without significantly
hurting the accuracy, compared to the recent baselines. To the best of our
knowledge, this is the first work to verify the robustness of interpretation on
a larger-scale dataset beyond CIFAR-10, thanks to the computational efficiency
of our method.


http://arxiv.org/abs/2211.15762
Understanding the Impact of Adversarial Robustness on Accuracy Disparity. (31%)
Yuzheng Hu; Fan Wu; Hongyang Zhang; Han Zhao
  While it has long been empirically observed that adversarial robustness may
be at odds with standard accuracy and may have further disparate impacts on
different classes, it remains an open question to what extent such observations
hold and how the class imbalance plays a role within. In this paper, we attempt
to understand this question of accuracy disparity by taking a closer look at
linear classifiers under a Gaussian mixture model. We decompose the impact of
adversarial robustness into two parts: an inherent effect that will degrade the
standard accuracy on all classes due to the robustness constraint, and the
other caused by the class imbalance ratio, which will increase the accuracy
disparity compared to standard training. Furthermore, we also show that such
effects extend beyond the Gaussian mixture model, by generalizing our data
model to the general family of stable distributions. More specifically, we
demonstrate that while the constraint of adversarial robustness consistently
degrades the standard accuracy in the balanced class setting, the class
imbalance ratio plays a fundamentally different role in accuracy disparity
compared to the Gaussian case, due to the heavy tail of the stable
distribution. We additionally perform experiments on both synthetic and
real-world datasets to corroborate our theoretical findings. Our empirical
results also suggest that the implications may extend to nonlinear models over
real-world datasets. Our code is publicly available on GitHub at
https://github.com/Accuracy-Disparity/AT-on-AD.


http://arxiv.org/abs/2211.15844
How Important are Good Method Names in Neural Code Generation? A Model Robustness Perspective. (13%)
Guang Yang; Yu Zhou; Wenhua Yang; Tao Yue; Xiang Chen; Taolue Chen
  Pre-trained code generation models (PCGMs) have been widely applied in neural
code generation which can generate executable code from functional descriptions
in natural languages, possibly together with signatures. Despite substantial
performance improvement of PCGMs, the role of method names in neural code
generation has not been thoroughly investigated. In this paper, we study and
demonstrate the potential of benefiting from method names to enhance the
performance of PCGMs, from a model robustness perspective. Specifically, we
propose a novel approach, named RADAR (neuRAl coDe generAtor Robustifier).
RADAR consists of two components: RADAR-Attack and RADAR-Defense. The former
attacks a PCGM by generating adversarial method names as part of the input,
which are semantic and visual similar to the original input, but may trick the
PCGM to generate completely unrelated code snippets. As a countermeasure to
such attacks, RADAR-Defense synthesizes a new method name from the functional
description and supplies it to the PCGM. Evaluation results show that
RADAR-Attack can reduce the CodeBLEU of generated code by 19.72% to 38.74% in
three state-of-the-art PCGMs (i.e., CodeGPT, PLBART, and CodeT5) in the
fine-tuning code generation task, and reduce the Pass@1 of generated code by
32.28% to 44.42% in three state-of-the-art PCGMs (i.e., Replit, CodeGen, and
CodeT5+) in the zero-shot code generation task. Moreover, RADAR-Defense is able
to reinstate the performance of PCGMs with synthesized method names. These
results highlight the importance of good method names in neural code generation
and implicate the benefits of studying model robustness in software
engineering.


http://arxiv.org/abs/2211.15180
Rethinking the Number of Shots in Robust Model-Agnostic Meta-Learning. (8%)
Xiaoyue Duan; Guoliang Kang; Runqi Wang; Shumin Han; Song Xue; Tian Wang; Baochang Zhang
  Robust Model-Agnostic Meta-Learning (MAML) is usually adopted to train a
meta-model which may fast adapt to novel classes with only a few exemplars and
meanwhile remain robust to adversarial attacks. The conventional solution for
robust MAML is to introduce robustness-promoting regularization during
meta-training stage. With such a regularization, previous robust MAML methods
simply follow the typical MAML practice that the number of training shots
should match with the number of test shots to achieve an optimal adaptation
performance. However, although the robustness can be largely improved, previous
methods sacrifice clean accuracy a lot. In this paper, we observe that
introducing robustness-promoting regularization into MAML reduces the intrinsic
dimension of clean sample features, which results in a lower capacity of clean
representations. This may explain why the clean accuracy of previous robust
MAML methods drops severely. Based on this observation, we propose a simple
strategy, i.e., increasing the number of training shots, to mitigate the loss
of intrinsic dimension caused by robustness-promoting regularization. Though
simple, our method remarkably improves the clean accuracy of MAML without much
loss of robustness, producing a robust yet accurate model. Extensive
experiments demonstrate that our method outperforms prior arts in achieving a
better trade-off between accuracy and robustness. Besides, we observe that our
method is less sensitive to the number of fine-tuning steps during
meta-training, which allows for a reduced number of fine-tuning steps to
improve training efficiency.


http://arxiv.org/abs/2211.15556
Attack on Unfair ToS Clause Detection: A Case Study using Universal Adversarial Triggers. (8%)
Shanshan Xu; Irina Broda; Rashid Haddad; Marco Negrini; Matthias Grabmair
  Recent work has demonstrated that natural language processing techniques can
support consumer protection by automatically detecting unfair clauses in the
Terms of Service (ToS) Agreement. This work demonstrates that transformer-based
ToS analysis systems are vulnerable to adversarial attacks. We conduct
experiments attacking an unfair-clause detector with universal adversarial
triggers. Experiments show that a minor perturbation of the text can
considerably reduce the detection performance. Moreover, to measure the
detectability of the triggers, we conduct a detailed human evaluation study by
collecting both answer accuracy and response time from the participants. The
results show that the naturalness of the triggers remains key to tricking
readers.


http://arxiv.org/abs/2211.15223
Gamma-convergence of a nonlocal perimeter arising in adversarial machine learning. (3%)
Leon Bungert; Kerrek Stinson
  In this paper we prove Gamma-convergence of a nonlocal perimeter of Minkowski
type to a local anisotropic perimeter. The nonlocal model describes the
regularizing effect of adversarial training in binary classifications. The
energy essentially depends on the interaction between two distributions
modelling likelihoods for the associated classes. We overcome typical strict
regularity assumptions for the distributions by only assuming that they have
bounded $BV$ densities. In the natural topology coming from compactness, we
prove Gamma-convergence to a weighted perimeter with weight determined by an
anisotropic function of the two densities. Despite being local, this sharp
interface limit reflects classification stability with respect to adversarial
perturbations. We further apply our results to deduce Gamma-convergence of the
associated total variations, to study the asymptotics of adversarial training,
and to prove Gamma-convergence of graph discretizations for the nonlocal
perimeter.


http://arxiv.org/abs/2211.15718
CoNAL: Anticipating Outliers with Large Language Models. (1%)
Albert Xu; Xiang Ren; Robin Jia
  In many task settings, text classification models are likely to encounter
examples from novel classes on which they cannot predict correctly. Selective
prediction, in which models abstain on low-confidence examples, provides a
possible solution, but existing models are often overly confident on OOD
examples. To remedy this overconfidence, we introduce Contrastive
Novelty-Augmented Learning (CoNAL), a two-step method that generates OOD
examples representative of novel classes, then trains to decrease confidence on
them. First, we generate OOD examples by prompting a large language model
twice: we prompt it to enumerate relevant novel labels, then generate examples
from each novel class matching the task format. Second, we train our classifier
with a novel contrastive objective that encourages lower confidence on
generated OOD examples than training examples. When trained with CoNAL,
classifiers improve in their ability to detect and abstain on OOD examples over
prior methods by an average of 2.3% AUAC and 5.5% AUROC across 4 NLP datasets,
with no cost to in-distribution accuracy.


http://arxiv.org/abs/2211.15897
Learning Antidote Data to Individual Unfairness. (1%)
Peizhao Li; Ethan Xia; Hongfu Liu
  Fairness is essential for machine learning systems deployed in high-stake
applications. Among all fairness notions, individual fairness, deriving from a
consensus that `similar individuals should be treated similarly,' is a vital
notion to describe fair treatment for individual cases. Previous studies
typically characterize individual fairness as a prediction-invariant problem
when perturbing sensitive attributes on samples, and solve it by
Distributionally Robust Optimization (DRO) paradigm. However, such adversarial
perturbations along a direction covering sensitive information used in DRO do
not consider the inherent feature correlations or innate data constraints,
therefore could mislead the model to optimize at off-manifold and unrealistic
samples. In light of this drawback, in this paper, we propose to learn and
generate antidote data that approximately follows the data distribution to
remedy individual unfairness. These generated on-manifold antidote data can be
used through a generic optimization procedure along with original training
data, resulting in a pure pre-processing approach to individual unfairness, or
can also fit well with the in-processing DRO paradigm. Through extensive
experiments on multiple tabular datasets, we demonstrate our method resists
individual unfairness at a minimal or zero cost to predictive utility compared
to baselines.


http://arxiv.org/abs/2211.15030
Imperceptible Adversarial Attack via Invertible Neural Networks. (99%)
Zihan Chen; Ziyue Wang; Junjie Huang; Wentao Zhao; Xiao Liu; Dejian Guan
  Adding perturbations via utilizing auxiliary gradient information or
discarding existing details of the benign images are two common approaches for
generating adversarial examples. Though visual imperceptibility is the desired
property of adversarial examples, conventional adversarial attacks still
generate traceable adversarial perturbations. In this paper, we introduce a
novel Adversarial Attack via Invertible Neural Networks (AdvINN) method to
produce robust and imperceptible adversarial examples. Specifically, AdvINN
fully takes advantage of the information preservation property of Invertible
Neural Networks and thereby generates adversarial examples by simultaneously
adding class-specific semantic information of the target class and dropping
discriminant information of the original class. Extensive experiments on
CIFAR-10, CIFAR-100, and ImageNet-1K demonstrate that the proposed AdvINN
method can produce less imperceptible adversarial images than the
state-of-the-art methods and AdvINN yields more robust adversarial examples
with high confidence compared to other adversarial attacks.


http://arxiv.org/abs/2211.14860
Foiling Explanations in Deep Neural Networks. (98%)
Snir Vitrack Tamam; Raz Lapid; Moshe Sipper
  Deep neural networks (DNNs) have greatly impacted numerous fields over the
past decade. Yet despite exhibiting superb performance over many problems,
their black-box nature still poses a significant challenge with respect to
explainability. Indeed, explainable artificial intelligence (XAI) is crucial in
several fields, wherein the answer alone -- sans a reasoning of how said answer
was derived -- is of little value. This paper uncovers a troubling property of
explanation methods for image-based DNNs: by making small visual changes to the
input image -- hardly influencing the network's output -- we demonstrate how
explanations may be arbitrarily manipulated through the use of evolution
strategies. Our novel algorithm, AttaXAI, a model-agnostic, adversarial attack
on XAI algorithms, only requires access to the output logits of a classifier
and to the explanation map; these weak assumptions render our approach highly
useful where real-world models and data are concerned. We compare our method's
performance on two benchmark datasets -- CIFAR100 and ImageNet -- using four
different pretrained deep-learning models: VGG16-CIFAR100, VGG16-ImageNet,
MobileNet-CIFAR100, and Inception-v3-ImageNet. We find that the XAI methods can
be manipulated without the use of gradients or other model internals. Our novel
algorithm is successfully able to manipulate an image in a manner imperceptible
to the human eye, such that the XAI method outputs a specific explanation map.
To our knowledge, this is the first such method in a black-box setting, and we
believe it has significant value where explainability is desired, required, or
legally mandatory.


http://arxiv.org/abs/2211.14769
Navigation as the Attacker Wishes? Towards Building Byzantine-Robust Embodied Agents under Federated Learning. (84%)
Yunchao Zhang; Zonglin Di; Kaiwen Zhou; Cihang Xie; Xin Wang
  Federated embodied agent learning protects the data privacy of individual
visual environments by keeping data locally at each client (the individual
environment) during training. However, since the local data is inaccessible to
the server under federated learning, attackers may easily poison the training
data of the local client to build a backdoor in the agent without notice.
Deploying such an agent raises the risk of potential harm to humans, as the
attackers may easily navigate and control the agent as they wish via the
backdoor. Towards Byzantine-robust federated embodied agent learning, in this
paper, we study the attack and defense for the task of vision-and-language
navigation (VLN), where the agent is required to follow natural language
instructions to navigate indoor environments. First, we introduce a simple but
effective attack strategy, Navigation as Wish (NAW), in which the malicious
client manipulates local trajectory data to implant a backdoor into the global
model. Results on two VLN datasets (R2R and RxR) show that NAW can easily
navigate the deployed VLN agent regardless of the language instruction, without
affecting its performance on normal test sets. Then, we propose a new
Prompt-Based Aggregation (PBA) to defend against the NAW attack in federated
VLN, which provides the server with a ''prompt'' of the vision-and-language
alignment variance between the benign and malicious clients so that they can be
distinguished during training. We validate the effectiveness of the PBA method
on protecting the global model from the NAW attack, which outperforms other
state-of-the-art defense methods by a large margin in the defense metrics on
R2R and RxR.


http://arxiv.org/abs/2211.14794
Traditional Classification Neural Networks are Good Generators: They are Competitive with DDPMs and GANs. (50%)
Guangrun Wang; Philip H. S. Torr
  Classifiers and generators have long been separated. We break down this
separation and showcase that conventional neural network classifiers can
generate high-quality images of a large number of categories, being comparable
to the state-of-the-art generative models (e.g., DDPMs and GANs). We achieve
this by computing the partial derivative of the classification loss function
with respect to the input to optimize the input to produce an image. Since it
is widely known that directly optimizing the inputs is similar to targeted
adversarial attacks incapable of generating human-meaningful images, we propose
a mask-based stochastic reconstruction module to make the gradients
semantic-aware to synthesize plausible images. We further propose a
progressive-resolution technique to guarantee fidelity, which produces
photorealistic images. Furthermore, we introduce a distance metric loss and a
non-trivial distribution loss to ensure classification neural networks can
synthesize diverse and high-fidelity images. Using traditional neural network
classifiers, we can generate good-quality images of 256$\times$256 resolution
on ImageNet. Intriguingly, our method is also applicable to text-to-image
generation by regarding image-text foundation models as generalized
classifiers.
  Proving that classifiers have learned the data distribution and are ready for
image generation has far-reaching implications, for classifiers are much easier
to train than generative models like DDPMs and GANs. We don't even need to
train classification models because tons of public ones are available for
download. Also, this holds great potential for the interpretability and
robustness of classifiers.


http://arxiv.org/abs/2211.14952
Federated Learning Attacks and Defenses: A Survey. (47%)
Yao Chen; Yijie Gui; Hong Lin; Wensheng Gan; Yongdong Wu
  In terms of artificial intelligence, there are several security and privacy
deficiencies in the traditional centralized training methods of machine
learning models by a server. To address this limitation, federated learning
(FL) has been proposed and is known for breaking down ``data silos" and
protecting the privacy of users. However, FL has not yet gained popularity in
the industry, mainly due to its security, privacy, and high cost of
communication. For the purpose of advancing the research in this field,
building a robust FL system, and realizing the wide application of FL, this
paper sorts out the possible attacks and corresponding defenses of the current
FL system systematically. Firstly, this paper briefly introduces the basic
workflow of FL and related knowledge of attacks and defenses. It reviews a
great deal of research about privacy theft and malicious attacks that have been
studied in recent years. Most importantly, in view of the current three
classification criteria, namely the three stages of machine learning, the three
different roles in federated learning, and the CIA (Confidentiality, Integrity,
and Availability) guidelines on privacy protection, we divide attack approaches
into two categories according to the training stage and the prediction stage in
machine learning. Furthermore, we also identify the CIA property violated for
each attack method and potential attack role. Various defense mechanisms are
then analyzed separately from the level of privacy and security. Finally, we
summarize the possible challenges in the application of FL from the aspect of
attacks and defenses and discuss the future development direction of FL
systems. In this way, the designed FL system has the ability to resist
different attacks and is more secure and stable.


http://arxiv.org/abs/2211.14966
Adversarial Rademacher Complexity of Deep Neural Networks. (47%)
Jiancong Xiao; Yanbo Fan; Ruoyu Sun; Zhi-Quan Luo
  Deep neural networks are vulnerable to adversarial attacks. Ideally, a robust
model shall perform well on both the perturbed training data and the unseen
perturbed test data. It is found empirically that fitting perturbed training
data is not hard, but generalizing to perturbed test data is quite difficult.
To better understand adversarial generalization, it is of great interest to
study the adversarial Rademacher complexity (ARC) of deep neural networks.
However, how to bound ARC in multi-layers cases is largely unclear due to the
difficulty of analyzing adversarial loss in the definition of ARC. There have
been two types of attempts of ARC. One is to provide the upper bound of ARC in
linear and one-hidden layer cases. However, these approaches seem hard to
extend to multi-layer cases. Another is to modify the adversarial loss and
provide upper bounds of Rademacher complexity on such surrogate loss in
multi-layer cases. However, such variants of Rademacher complexity are not
guaranteed to be bounds for meaningful robust generalization gaps (RGG). In
this paper, we provide a solution to this unsolved problem. Specifically, we
provide the first bound of adversarial Rademacher complexity of deep neural
networks. Our approach is based on covering numbers. We provide a method to
handle the robustify function classes of DNNs such that we can calculate the
covering numbers. Finally, we provide experiments to study the empirical
implication of our bounds and provide an analysis of poor adversarial
generalization.


http://arxiv.org/abs/2211.14669
Game Theoretic Mixed Experts for Combinational Adversarial Machine Learning. (99%)
Ethan Rathbun; Kaleel Mahmood; Sohaib Ahmad; Caiwen Ding; Dijk Marten van
  Recent advances in adversarial machine learning have shown that defenses
considered to be robust are actually susceptible to adversarial attacks which
are specifically customized to target their weaknesses. These defenses include
Barrage of Random Transforms (BaRT), Friendly Adversarial Training (FAT), Trash
is Treasure (TiT) and ensemble models made up of Vision Transformers (ViTs),
Big Transfer models and Spiking Neural Networks (SNNs). We first conduct a
transferability analysis, to demonstrate the adversarial examples generated by
customized attacks on one defense, are not often misclassified by another
defense.
  This finding leads to two important questions. First, how can the low
transferability between defenses be utilized in a game theoretic framework to
improve the robustness? Second, how can an adversary within this framework
develop effective multi-model attacks? In this paper, we provide a
game-theoretic framework for ensemble adversarial attacks and defenses. Our
framework is called Game theoretic Mixed Experts (GaME). It is designed to find
the Mixed-Nash strategy for both a detector based and standard defender, when
facing an attacker employing compositional adversarial attacks. We further
propose three new attack algorithms, specifically designed to target defenses
with randomized transformations, multi-model voting schemes, and adversarial
detector architectures. These attacks serve to both strengthen defenses
generated by the GaME framework and verify their robustness against unforeseen
attacks. Overall, our framework and analyses advance the field of adversarial
machine learning by yielding new insights into compositional attack and defense
formulations.


http://arxiv.org/abs/2211.14088
Boundary Adversarial Examples Against Adversarial Overfitting. (99%)
Muhammad Zaid Hameed; Beat Buesser
  Standard adversarial training approaches suffer from robust overfitting where
the robust accuracy decreases when models are adversarially trained for too
long. The origin of this problem is still unclear and conflicting explanations
have been reported, i.e., memorization effects induced by large loss data or
because of small loss data and growing differences in loss distribution of
training samples as the adversarial training progresses. Consequently, several
mitigation approaches including early stopping, temporal ensembling and weight
perturbations on small loss data have been proposed to mitigate the effect of
robust overfitting. However, a side effect of these strategies is a larger
reduction in clean accuracy compared to standard adversarial training. In this
paper, we investigate if these mitigation approaches are complimentary to each
other in improving adversarial training performance. We further propose the use
of helper adversarial examples that can be obtained with minimal cost in the
adversarial example generation, and show how they increase the clean accuracy
in the existing approaches without compromising the robust accuracy.


http://arxiv.org/abs/2211.14424
Supervised Contrastive Prototype Learning: Augmentation Free Robust Neural Network. (98%)
Iordanis Fostiropoulos; Laurent Itti
  Transformations in the input space of Deep Neural Networks (DNN) lead to
unintended changes in the feature space. Almost perceptually identical inputs,
such as adversarial examples, can have significantly distant feature
representations. On the contrary, Out-of-Distribution (OOD) samples can have
highly similar feature representations to training set samples. Our theoretical
analysis for DNNs trained with a categorical classification head suggests that
the inflexible logit space restricted by the classification problem size is one
of the root causes for the lack of $\textit{robustness}$. Our second
observation is that DNNs over-fit to the training augmentation technique and do
not learn $\textit{nuance invariant}$ representations. Inspired by the recent
success of prototypical and contrastive learning frameworks for both improving
robustness and learning nuance invariant representations, we propose a training
framework, $\textbf{Supervised Contrastive Prototype Learning}$ (SCPL). We use
N-pair contrastive loss with prototypes of the same and opposite classes and
replace a categorical classification head with a $\textbf{Prototype
Classification Head}$ (PCH). Our approach is $\textit{sample efficient}$, does
not require $\textit{sample mining}$, can be implemented on any existing DNN
without modification to their architecture, and combined with other training
augmentation techniques. We empirically evaluate the $\textbf{clean}$
robustness of our method on out-of-distribution and adversarial samples. Our
framework outperforms other state-of-the-art contrastive and prototype learning
approaches in $\textit{robustness}$.


http://arxiv.org/abs/2211.14065
Beyond Smoothing: Unsupervised Graph Representation Learning with Edge Heterophily Discriminating. (3%)
Yixin Liu; Yizhen Zheng; Daokun Zhang; Vincent CS Lee; Shirui Pan
  Unsupervised graph representation learning (UGRL) has drawn increasing
research attention and achieved promising results in several graph analytic
tasks. Relying on the homophily assumption, existing UGRL methods tend to
smooth the learned node representations along all edges, ignoring the existence
of heterophilic edges that connect nodes with distinct attributes. As a result,
current methods are hard to generalize to heterophilic graphs where dissimilar
nodes are widely connected, and also vulnerable to adversarial attacks. To
address this issue, we propose a novel unsupervised Graph Representation
learning method with Edge hEterophily discriminaTing (GREET) which learns
representations by discriminating and leveraging homophilic edges and
heterophilic edges. To distinguish two types of edges, we build an edge
discriminator that infers edge homophily/heterophily from feature and structure
information. We train the edge discriminator in an unsupervised way through
minimizing the crafted pivot-anchored ranking loss, with randomly sampled node
pairs acting as pivots. Node representations are learned through contrasting
the dual-channel encodings obtained from the discriminated homophilic and
heterophilic edges. With an effective interplaying scheme, edge discriminating
and representation learning can mutually boost each other during the training
phase. We conducted extensive experiments on 14 benchmark datasets and multiple
learning scenarios to demonstrate the superiority of GREET.


http://arxiv.org/abs/2211.13991
TrustGAN: Training safe and trustworthy deep learning models through generative adversarial networks. (1%)
Hélion du Mas des Bourboux
  Deep learning models have been developed for a variety of tasks and are
deployed every day to work in real conditions. Some of these tasks are critical
and models need to be trusted and safe, e.g. military communications or cancer
diagnosis. These models are given real data, simulated data or combination of
both and are trained to be highly predictive on them. However, gathering enough
real data or simulating them to be representative of all the real conditions
is: costly, sometimes impossible due to confidentiality and most of the time
impossible. Indeed, real conditions are constantly changing and sometimes are
intractable. A solution is to deploy machine learning models that are able to
give predictions when they are confident enough otherwise raise a flag or
abstain. One issue is that standard models easily fail at detecting
out-of-distribution samples where their predictions are unreliable.
  We present here TrustGAN, a generative adversarial network pipeline targeting
trustness. It is a deep learning pipeline which improves a target model
estimation of the confidence without impacting its predictive power. The
pipeline can accept any given deep learning model which outputs a prediction
and a confidence on this prediction. Moreover, the pipeline does not need to
modify this target model. It can thus be easily deployed in a MLOps (Machine
Learning Operations) setting.
  The pipeline is applied here to a target classification model trained on
MNIST data to recognise numbers based on images. We compare such a model when
trained in the standard way and with TrustGAN. We show that on
out-of-distribution samples, here FashionMNIST and CIFAR10, the estimated
confidence is largely reduced. We observe similar conclusions for a
classification model trained on 1D radio signals from AugMod, tested on
RML2016.04C. We also publicly release the code.


http://arxiv.org/abs/2211.13775
SAGA: Spectral Adversarial Geometric Attack on 3D Meshes. (98%)
Tomer Stolik; Itai Lang; Shai Avidan
  A triangular mesh is one of the most popular 3D data representations. As
such, the deployment of deep neural networks for mesh processing is widely
spread and is increasingly attracting more attention. However, neural networks
are prone to adversarial attacks, where carefully crafted inputs impair the
model's functionality. The need to explore these vulnerabilities is a
fundamental factor in the future development of 3D-based applications.
Recently, mesh attacks were studied on the semantic level, where classifiers
are misled to produce wrong predictions. Nevertheless, mesh surfaces possess
complex geometric attributes beyond their semantic meaning, and their analysis
often includes the need to encode and reconstruct the geometry of the shape.
  We propose a novel framework for a geometric adversarial attack on a 3D mesh
autoencoder. In this setting, an adversarial input mesh deceives the
autoencoder by forcing it to reconstruct a different geometric shape at its
output. The malicious input is produced by perturbing a clean shape in the
spectral domain. Our method leverages the spectral decomposition of the mesh
along with additional mesh-related properties to obtain visually credible
results that consider the delicacy of surface distortions. Our code is publicly
available at https://github.com/StolikTomer/SAGA.


http://arxiv.org/abs/2211.13535
Tracking Dataset IP Use in Deep Neural Networks. (96%)
Seonhye Park; Alsharif Abuadbba; Shuo Wang; Kristen Moore; Yansong Gao; Hyoungshick Kim; Surya Nepal
  Training highly performant deep neural networks (DNNs) typically requires the
collection of a massive dataset and the use of powerful computing resources.
Therefore, unauthorized redistribution of private pre-trained DNNs may cause
severe economic loss for model owners. For protecting the ownership of DNN
models, DNN watermarking schemes have been proposed by embedding secret
information in a DNN model and verifying its presence for model ownership.
However, existing DNN watermarking schemes compromise the model utility and are
vulnerable to watermark removal attacks because a model is modified with a
watermark. Alternatively, a new approach dubbed DEEPJUDGE was introduced to
measure the similarity between a suspect model and a victim model without
modifying the victim model. However, DEEPJUDGE would only be designed to detect
the case where a suspect model's architecture is the same as a victim model's.
In this work, we propose a novel DNN fingerprinting technique dubbed DEEPTASTER
to prevent a new attack scenario in which a victim's data is stolen to build a
suspect model. DEEPTASTER can effectively detect such data theft attacks even
when a suspect model's architecture differs from a victim model's. To achieve
this goal, DEEPTASTER generates a few adversarial images with perturbations,
transforms them into the Fourier frequency domain, and uses the transformed
images to identify the dataset used in a suspect model. The intuition is that
those adversarial images can be used to capture the characteristics of DNNs
built on a specific dataset. We evaluated the detection accuracy of DEEPTASTER
on three datasets with three model architectures under various attack
scenarios, including transfer learning, pruning, fine-tuning, and data
augmentation. Overall, DEEPTASTER achieves a balanced accuracy of 94.95%, which
is significantly better than 61.11% achieved by DEEPJUDGE in the same settings.


http://arxiv.org/abs/2211.13474
Explainable and Safe Reinforcement Learning for Autonomous Air Mobility. (92%)
Lei Wang; Hongyu Yang; Yi Lin; Suwan Yin; Yuankai Wu
  Increasing traffic demands, higher levels of automation, and communication
enhancements provide novel design opportunities for future air traffic
controllers (ATCs). This article presents a novel deep reinforcement learning
(DRL) controller to aid conflict resolution for autonomous free flight.
Although DRL has achieved important advancements in this field, the existing
works pay little attention to the explainability and safety issues related to
DRL controllers, particularly the safety under adversarial attacks. To address
those two issues, we design a fully explainable DRL framework wherein we: 1)
decompose the coupled Q value learning model into a safety-awareness and
efficiency (reach the target) one; and 2) use information from surrounding
intruders as inputs, eliminating the needs of central controllers. In our
simulated experiments, we show that by decoupling the safety-awareness and
efficiency, we can exceed performance on free flight control tasks while
dramatically improving explainability on practical. In addition, the safety Q
learning module provides rich information about the safety situation of
environments. To study the safety under adversarial attacks, we additionally
propose an adversarial attack strategy that can impose both safety-oriented and
efficiency-oriented attacks. The adversarial aims to minimize safety/efficiency
by only attacking the agent at a few time steps. In the experiments, our attack
strategy increases as many collisions as the uniform attack (i.e., attacking at
every time step) by only attacking the agent four times less often, which
provide insights into the capabilities and restrictions of the DRL in future
ATC designs. The source code is publicly available at
https://github.com/WLeiiiii/Gym-ATC-Attack-Project.


http://arxiv.org/abs/2211.15382
Neural Network Complexity of Chaos and Turbulence. (41%)
Tim Whittaker; Romuald A. Janik; Yaron Oz
  Chaos and turbulence are complex physical phenomena, yet a precise definition
of the complexity measure that quantifies them is still lacking. In this work
we consider the relative complexity of chaos and turbulence from the
perspective of deep neural networks. We analyze a set of classification
problems, where the network has to distinguish images of fluid profiles in the
turbulent regime from other classes of images such as fluid profiles in the
chaotic regime, various constructions of noise and real world images. We
analyze incompressible as well as weakly compressible fluid flows. We quantify
the complexity of the computation performed by the network via the intrinsic
dimensionality of the internal feature representations, and calculate the
effective number of independent features which the network uses in order to
distinguish between classes. In addition to providing a numerical estimate of
the complexity of the computation, the measure also characterizes the neural
network processing at intermediate and final stages. We construct adversarial
examples and use them to identify the two point correlation spectra for the
chaotic and turbulent vorticity as the feature used by the network for
classification.


http://arxiv.org/abs/2211.13644
Seeds Don't Lie: An Adaptive Watermarking Framework for Computer Vision Models. (8%)
Jacob Shams; Ben Nassi; Ikuya Morikawa; Toshiya Shimizu; Asaf Shabtai; Yuval Elovici
  In recent years, various watermarking methods were suggested to detect
computer vision models obtained illegitimately from their owners, however they
fail to demonstrate satisfactory robustness against model extraction attacks.
In this paper, we present an adaptive framework to watermark a protected model,
leveraging the unique behavior present in the model due to a unique random seed
initialized during the model training. This watermark is used to detect
extracted models, which have the same unique behavior, indicating an
unauthorized usage of the protected model's intellectual property (IP). First,
we show how an initial seed for random number generation as part of model
training produces distinct characteristics in the model's decision boundaries,
which are inherited by extracted models and present in their decision
boundaries, but aren't present in non-extracted models trained on the same
data-set with a different seed. Based on our findings, we suggest the Robust
Adaptive Watermarking (RAW) Framework, which utilizes the unique behavior
present in the protected and extracted models to generate a watermark key-set
and verification model. We show that the framework is robust to (1) unseen
model extraction attacks, and (2) extracted models which undergo a blurring
method (e.g., weight pruning). We evaluate the framework's robustness against a
naive attacker (unaware that the model is watermarked), and an informed
attacker (who employs blurring strategies to remove watermarked behavior from
an extracted model), and achieve outstanding (i.e., >0.9) AUC values. Finally,
we show that the framework is robust to model extraction attacks with different
structure and/or architecture than the protected model.


http://arxiv.org/abs/2211.13772
Generative Joint Source-Channel Coding for Semantic Image Transmission. (1%)
Ecenaz Erdemir; Tze-Yang Tung; Pier Luigi Dragotti; Deniz Gunduz
  Recent works have shown that joint source-channel coding (JSCC) schemes using
deep neural networks (DNNs), called DeepJSCC, provide promising results in
wireless image transmission. However, these methods mostly focus on the
distortion of the reconstructed signals with respect to the input image, rather
than their perception by humans. However, focusing on traditional distortion
metrics alone does not necessarily result in high perceptual quality,
especially in extreme physical conditions, such as very low bandwidth
compression ratio (BCR) and low signal-to-noise ratio (SNR) regimes. In this
work, we propose two novel JSCC schemes that leverage the perceptual quality of
deep generative models (DGMs) for wireless image transmission, namely
InverseJSCC and GenerativeJSCC. While the former is an inverse problem approach
to DeepJSCC, the latter is an end-to-end optimized JSCC scheme. In both, we
optimize a weighted sum of mean squared error (MSE) and learned perceptual
image patch similarity (LPIPS) losses, which capture more semantic similarities
than other distortion metrics. InverseJSCC performs denoising on the distorted
reconstructions of a DeepJSCC model by solving an inverse optimization problem
using style-based generative adversarial network (StyleGAN). Our simulation
results show that InverseJSCC significantly improves the state-of-the-art
(SotA) DeepJSCC in terms of perceptual quality in edge cases. In
GenerativeJSCC, we carry out end-to-end training of an encoder and a
StyleGAN-based decoder, and show that GenerativeJSCC significantly outperforms
DeepJSCC both in terms of distortion and perceptual quality.


http://arxiv.org/abs/2211.13737
CycleGANWM: A CycleGAN watermarking method for ownership verification. (1%)
Dongdong Lin; Benedetta Tondi; Bin Li; Mauro Barni
  Due to the proliferation and widespread use of deep neural networks (DNN),
their Intellectual Property Rights (IPR) protection has become increasingly
important. This paper presents a novel model watermarking method for an
unsupervised image-to-image translation (I2IT) networks, named CycleGAN, which
leverage the image translation visual quality and watermark embedding. In this
method, a watermark decoder is trained initially. Then the decoder is frozen
and used to extract the watermark bits when training the CycleGAN watermarking
model. The CycleGAN watermarking (CycleGANWM) is trained with specific loss
functions and optimized to get a good performance on both I2IT task and
watermark embedding. For watermark verification, this work uses statistical
significance test to identify the ownership of the model from the extract
watermark bits. We evaluate the robustness of the model against image
post-processing and improve it by fine-tuning the model with adding data
augmentation on the output images before extracting the watermark bits. We also
carry out surrogate model attack under black-box access of the model. The
experimental results prove that the proposed method is effective and robust to
some image post-processing, and it is able to resist surrogate model attack.


http://arxiv.org/abs/2211.13171
Query Efficient Cross-Dataset Transferable Black-Box Attack on Action Recognition. (99%)
Rohit Gupta; Naveed Akhtar; Gaurav Kumar Nayak; Ajmal Mian; Mubarak Shah
  Black-box adversarial attacks present a realistic threat to action
recognition systems. Existing black-box attacks follow either a query-based
approach where an attack is optimized by querying the target model, or a
transfer-based approach where attacks are generated using a substitute model.
While these methods can achieve decent fooling rates, the former tends to be
highly query-inefficient while the latter assumes extensive knowledge of the
black-box model's training data. In this paper, we propose a new attack on
action recognition that addresses these shortcomings by generating
perturbations to disrupt the features learned by a pre-trained substitute model
to reduce the number of queries. By using a nearly disjoint dataset to train
the substitute model, our method removes the requirement that the substitute
model be trained using the same dataset as the target model, and leverages
queries to the target model to retain the fooling rate benefits provided by
query-based methods. This ultimately results in attacks which are more
transferable than conventional black-box attacks. Through extensive
experiments, we demonstrate highly query-efficient black-box attacks with the
proposed framework. Our method achieves 8% and 12% higher deception rates
compared to state-of-the-art query-based and transfer-based attacks,
respectively.


http://arxiv.org/abs/2211.12990
Adversarial Attacks are a Surprisingly Strong Baseline for Poisoning Few-Shot Meta-Learners. (99%)
Elre T. Oldewage; John Bronskill; Richard E. Turner
  This paper examines the robustness of deployed few-shot meta-learning systems
when they are fed an imperceptibly perturbed few-shot dataset. We attack
amortized meta-learners, which allows us to craft colluding sets of inputs that
are tailored to fool the system's learning algorithm when used as training
data. Jointly crafted adversarial inputs might be expected to synergistically
manipulate a classifier, allowing for very strong data-poisoning attacks that
would be hard to detect. We show that in a white box setting, these attacks are
very successful and can cause the target model's predictions to become worse
than chance. However, in opposition to the well-known transferability of
adversarial examples in general, the colluding sets do not transfer well to
different classifiers. We explore two hypotheses to explain this: 'overfitting'
by the attack, and mismatch between the model on which the attack is generated
and that to which the attack is transferred. Regardless of the mitigation
strategies suggested by these hypotheses, the colluding inputs transfer no
better than adversarial inputs that are generated independently in the usual
way.


http://arxiv.org/abs/2211.12713
Reliable Robustness Evaluation via Automatically Constructed Attack Ensembles. (76%)
Shengcai Liu; Fu Peng; Ke Tang
  Attack Ensemble (AE), which combines multiple attacks together, provides a
reliable way to evaluate adversarial robustness. In practice, AEs are often
constructed and tuned by human experts, which however tends to be sub-optimal
and time-consuming. In this work, we present AutoAE, a conceptually simple
approach for automatically constructing AEs. In brief, AutoAE repeatedly adds
the attack and its iteration steps to the ensemble that maximizes ensemble
improvement per additional iteration consumed. We show theoretically that
AutoAE yields AEs provably within a constant factor of the optimal for a given
defense. We then use AutoAE to construct two AEs for $l_{\infty}$ and $l_2$
attacks, and apply them without any tuning or adaptation to 45 top adversarial
defenses on the RobustBench leaderboard. In all except one cases we achieve
equal or better (often the latter) robustness evaluation than existing AEs, and
notably, in 29 cases we achieve better robustness evaluation than the best
known one. Such performance of AutoAE shows itself as a reliable evaluation
protocol for adversarial robustness, which further indicates the huge potential
of automatic AE construction. Code is available at
\url{https://github.com/LeegerPENG/AutoAE}.


http://arxiv.org/abs/2211.13305
Dual Graphs of Polyhedral Decompositions for the Detection of Adversarial Attacks. (62%)
Huma Jamil; Yajing Liu; Christina Cole; Nathaniel Blanchard; Emily J. King; Michael Kirby; Christopher Peterson
  Previous work has shown that a neural network with the rectified linear unit
(ReLU) activation function leads to a convex polyhedral decomposition of the
input space. These decompositions can be represented by a dual graph with
vertices corresponding to polyhedra and edges corresponding to polyhedra
sharing a facet, which is a subgraph of a Hamming graph. This paper illustrates
how one can utilize the dual graph to detect and analyze adversarial attacks in
the context of digital images. When an image passes through a network
containing ReLU nodes, the firing or non-firing at a node can be encoded as a
bit ($1$ for ReLU activation, $0$ for ReLU non-activation). The sequence of all
bit activations identifies the image with a bit vector, which identifies it
with a polyhedron in the decomposition and, in turn, identifies it with a
vertex in the dual graph. We identify ReLU bits that are discriminators between
non-adversarial and adversarial images and examine how well collections of
these discriminators can ensemble vote to build an adversarial image detector.
Specifically, we examine the similarities and differences of ReLU bit vectors
for adversarial images, and their non-adversarial counterparts, using a
pre-trained ResNet-50 architecture. While this paper focuses on adversarial
digital images, ResNet-50 architecture, and the ReLU activation function, our
methods extend to other network architectures, activation functions, and types
of datasets.


http://arxiv.org/abs/2211.12864
Privacy-Enhancing Optical Embeddings for Lensless Classification. (11%)
Eric Bezzam; Martin Vetterli; Matthieu Simeoni
  Lensless imaging can provide visual privacy due to the highly multiplexed
characteristic of its measurements. However, this alone is a weak form of
security, as various adversarial attacks can be designed to invert the
one-to-many scene mapping of such cameras. In this work, we enhance the privacy
provided by lensless imaging by (1) downsampling at the sensor and (2) using a
programmable mask with variable patterns as our optical encoder. We build a
prototype from a low-cost LCD and Raspberry Pi components, for a total cost of
around 100 USD. This very low price point allows our system to be deployed and
leveraged in a broad range of applications. In our experiments, we first
demonstrate the viability and reconfigurability of our system by applying it to
various classification tasks: MNIST, CelebA (face attributes), and CIFAR10. By
jointly optimizing the mask pattern and a digital classifier in an end-to-end
fashion, low-dimensional, privacy-enhancing embeddings are learned directly at
the sensor. Secondly, we show how the proposed system, through variable mask
patterns, can thwart adversaries that attempt to invert the system (1) via
plaintext attacks or (2) in the event of camera parameters leaks. We
demonstrate the defense of our system to both risks, with 55% and 26% drops in
image quality metrics for attacks based on model-based convex optimization and
generative neural networks respectively. We open-source a wave propagation and
camera simulator needed for end-to-end optimization, the training software, and
a library for interfacing with the camera.


http://arxiv.org/abs/2211.13345
Principled Data-Driven Decision Support for Cyber-Forensic Investigations. (1%)
Soodeh Atefi; Sakshyam Panda; Manos Panaousis; Aron Laszka
  In the wake of a cybersecurity incident, it is crucial to promptly discover
how the threat actors breached security in order to assess the impact of the
incident and to develop and deploy countermeasures that can protect against
further attacks. To this end, defenders can launch a cyber-forensic
investigation, which discovers the techniques that the threat actors used in
the incident. A fundamental challenge in such an investigation is prioritizing
the investigation of particular techniques since the investigation of each
technique requires time and effort, but forensic analysts cannot know which
ones were actually used before investigating them. To ensure prompt discovery,
it is imperative to provide decision support that can help forensic analysts
with this prioritization. A recent study demonstrated that data-driven decision
support, based on a dataset of prior incidents, can provide state-of-the-art
prioritization. However, this data-driven approach, called DISCLOSE, is based
on a heuristic that utilizes only a subset of the available information and
does not approximate optimal decisions. To improve upon this heuristic, we
introduce a principled approach for data-driven decision support for
cyber-forensic investigations. We formulate the decision-support problem using
a Markov decision process, whose states represent the states of a forensic
investigation. To solve the decision problem, we propose a Monte Carlo tree
search based method, which relies on a k-NN regression over prior incidents to
estimate state-transition probabilities. We evaluate our proposed approach on
multiple versions of the MITRE ATT&CK dataset, which is a knowledge base of
adversarial techniques and tactics based on real-world cyber incidents, and
demonstrate that our approach outperforms DISCLOSE in terms of techniques
discovered per effort spent.


http://arxiv.org/abs/2211.13416
Data Provenance Inference in Machine Learning. (1%)
Mingxue Xu; Xiang-Yang Li
  Unintended memorization of various information granularity has garnered
academic attention in recent years, e.g. membership inference and property
inference. How to inversely use this privacy leakage to facilitate real-world
applications is a growing direction; the current efforts include dataset
ownership inference and user auditing. Standing on the data lifecycle and ML
model production, we propose an inference process named Data Provenance
Inference, which is to infer the generation, collection or processing property
of the ML training data, to assist ML developers in locating the training data
gaps without maintaining strenuous metadata. We formularly define the data
provenance and the data provenance inference task in ML training. Then we
propose a novel inference strategy combining embedded-space multiple instance
classification and shadow learning. Comprehensive evaluations cover language,
visual and structured data in black-box and white-box settings, with diverse
kinds of data provenance (i.e. business, county, movie, user). Our best
inference accuracy achieves 98.96% in the white-box text model when "author" is
the data provenance. The experimental results indicate that, in general, the
inference performance positively correlated with the amount of reference data
for inference, the depth and also the amount of the parameter of the accessed
layer. Furthermore, we give a post-hoc statistical analysis of the data
provenance definition to explain when our proposed method works well.


http://arxiv.org/abs/2211.12681
Benchmarking Adversarially Robust Quantum Machine Learning at Scale. (99%)
Maxwell T. West; Sarah M. Erfani; Christopher Leckie; Martin Sevior; Lloyd C. L. Hollenberg; Muhammad Usman
  Machine learning (ML) methods such as artificial neural networks are rapidly
becoming ubiquitous in modern science, technology and industry. Despite their
accuracy and sophistication, neural networks can be easily fooled by carefully
designed malicious inputs known as adversarial attacks. While such
vulnerabilities remain a serious challenge for classical neural networks, the
extent of their existence is not fully understood in the quantum ML setting. In
this work, we benchmark the robustness of quantum ML networks, such as quantum
variational classifiers (QVC), at scale by performing rigorous training for
both simple and complex image datasets and through a variety of high-end
adversarial attacks. Our results show that QVCs offer a notably enhanced
robustness against classical adversarial attacks by learning features which are
not detected by the classical neural networks, indicating a possible quantum
advantage for ML tasks. Contrarily, and remarkably, the converse is not true,
with attacks on quantum networks also capable of deceiving classical neural
networks. By combining quantum and classical network outcomes, we propose a
novel adversarial attack detection technology. Traditionally quantum advantage
in ML systems has been sought through increased accuracy or algorithmic
speed-up, but our work has revealed the potential for a new kind of quantum
advantage through superior robustness of ML models, whose practical realisation
will address serious security concerns and reliability issues of ML algorithms
employed in a myriad of applications including autonomous vehicles,
cybersecurity, and surveillance robotic systems.


http://arxiv.org/abs/2211.12294
PointCA: Evaluating the Robustness of 3D Point Cloud Completion Models Against Adversarial Examples. (99%)
Shengshan Hu; Junwei Zhang; Wei Liu; Junhui Hou; Minghui Li; Leo Yu Zhang; Hai Jin; Lichao Sun
  Point cloud completion, as the upstream procedure of 3D recognition and
segmentation, has become an essential part of many tasks such as navigation and
scene understanding. While various point cloud completion models have
demonstrated their powerful capabilities, their robustness against adversarial
attacks, which have been proven to be fatally malicious towards deep neural
networks, remains unknown. In addition, existing attack approaches towards
point cloud classifiers cannot be applied to the completion models due to
different output forms and attack purposes. In order to evaluate the robustness
of the completion models, we propose PointCA, the first adversarial attack
against 3D point cloud completion models. PointCA can generate adversarial
point clouds that maintain high similarity with the original ones, while being
completed as another object with totally different semantic information.
Specifically, we minimize the representation discrepancy between the
adversarial example and the target point set to jointly explore the adversarial
point clouds in the geometry space and the feature space. Furthermore, to
launch a stealthier attack, we innovatively employ the neighbourhood density
information to tailor the perturbation constraint, leading to geometry-aware
and distribution-adaptive modifications for each point. Extensive experiments
against different premier point cloud completion networks show that PointCA can
cause a performance degradation from 77.9% to 16.7%, with the structure chamfer
distance kept below 0.01. We conclude that existing completion models are
severely vulnerable to adversarial examples, and state-of-the-art defenses for
point cloud classification will be partially invalid when applied to incomplete
and uneven point cloud data.


http://arxiv.org/abs/2211.12314
Attacking Image Splicing Detection and Localization Algorithms Using Synthetic Traces. (98%)
Shengbang Fang; Matthew C Stamm
  Recent advances in deep learning have enabled forensics researchers to
develop a new class of image splicing detection and localization algorithms.
These algorithms identify spliced content by detecting localized
inconsistencies in forensic traces using Siamese neural networks, either
explicitly during analysis or implicitly during training. At the same time,
deep learning has enabled new forms of anti-forensic attacks, such as
adversarial examples and generative adversarial network (GAN) based attacks.
Thus far, however, no anti-forensic attack has been demonstrated against image
splicing detection and localization algorithms. In this paper, we propose a new
GAN-based anti-forensic attack that is able to fool state-of-the-art splicing
detection and localization algorithms such as EXIF-Net, Noiseprint, and
Forensic Similarity Graphs. This attack operates by adversarially training an
anti-forensic generator against a set of Siamese neural networks so that it is
able to create synthetic forensic traces. Under analysis, these synthetic
traces appear authentic and are self-consistent throughout an image. Through a
series of experiments, we demonstrate that our attack is capable of fooling
forensic splicing detection and localization algorithms without introducing
visually detectable artifacts into an attacked image. Additionally, we
demonstrate that our attack outperforms existing alternative attack approaches.
%


http://arxiv.org/abs/2211.12044
Backdoor Cleansing with Unlabeled Data. (75%)
Lu Pang; Tao Sun; Haibin Ling; Chao Chen
  Due to the increasing computational demand of Deep Neural Networks (DNNs),
companies and organizations have begun to outsource the training process.
However, the externally trained DNNs can potentially be backdoor attacked. It
is crucial to defend against such attacks, i.e., to postprocess a suspicious
model so that its backdoor behavior is mitigated while its normal prediction
power on clean inputs remain uncompromised. To remove the abnormal backdoor
behavior, existing methods mostly rely on additional labeled clean samples.
However, such requirement may be unrealistic as the training data are often
unavailable to end users. In this paper, we investigate the possibility of
circumventing such barrier. We propose a novel defense method that does not
require training labels. Through a carefully designed layer-wise weight
re-initialization and knowledge distillation, our method can effectively
cleanse backdoor behaviors of a suspicious network {with negligible compromise
in} its normal behavior. In experiments, we show that our method, trained
without labels, is on-par with state-of-the-art defense methods trained using
labels. We also observe promising defense results even on out-of-distribution
data. This makes our method very practical.


http://arxiv.org/abs/2211.12624
Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization. (70%)
Zifan Wang; Nan Ding; Tomer Levinboim; Xi Chen; Radu Soricut
  Recent research in robust optimization has shown an overfitting-like
phenomenon in which models trained against adversarial attacks exhibit higher
robustness on the training set compared to the test set. Although previous work
provided theoretical explanations for this phenomenon using a robust
PAC-Bayesian bound over the adversarial test error, related algorithmic
derivations are at best only loosely connected to this bound, which implies
that there is still a gap between their empirical success and our understanding
of adversarial robustness theory. To close this gap, in this paper we consider
a different form of the robust PAC-Bayesian bound and directly minimize it with
respect to the model posterior. The derivation of the optimal solution connects
PAC-Bayesian learning to the geometry of the robust loss surface through a
Trace of Hessian (TrH) regularizer that measures the surface flatness. In
practice, we restrict the TrH regularizer to the top layer only, which results
in an analytical solution to the bound whose computational cost does not depend
on the network depth. Finally, we evaluate our TrH regularization approach over
CIFAR-10/100 and ImageNet using Vision Transformers (ViT) and compare against
baseline adversarial robustness algorithms. Experimental results show that TrH
regularization leads to improved ViT robustness that either matches or
surpasses previous state-of-the-art approaches while at the same time requires
less memory and computational cost.


http://arxiv.org/abs/2211.12087
SoK: Inference Attacks and Defenses in Human-Centered Wireless Sensing. (69%)
Wei Sun; Tingjun Chen; Neil Gong
  Human-centered wireless sensing aims to understand the fine-grained
environment and activities of a human using the diverse wireless signals around
her. The wireless sensing community has demonstrated the superiority of such
techniques in many applications such as smart homes, human-computer
interactions, and smart cities. Like many other technologies, wireless sensing
is also a double-edged sword. While the sensed information about a human can be
used for many good purposes such as enhancing life quality, an adversary can
also abuse it to steal private information about the human (e.g., location,
living habits, and behavioral biometric characteristics). However, the
literature lacks a systematic understanding of the privacy vulnerabilities of
wireless sensing and the defenses against them.
  In this work, we aim to bridge this gap. First, we propose a framework to
systematize wireless sensing-based inference attacks. Our framework consists of
three key steps: deploying a sniffing device, sniffing wireless signals, and
inferring private information. Our framework can be used to guide the design of
new inference attacks since different attacks can instantiate these three steps
differently. Second, we propose a defense-in-depth framework to systematize
defenses against such inference attacks. The prevention component of our
framework aims to prevent inference attacks via obfuscating the wireless
signals around a human, while the detection component aims to detect and
respond to attacks. Third, based on our attack and defense frameworks, we
identify gaps in the existing literature and discuss future research
directions.


http://arxiv.org/abs/2211.11312
Understanding the Vulnerability of Skeleton-based Human Activity Recognition via Black-box Attack. (99%)
Yunfeng Diao; He Wang; Tianjia Shao; Yong-Liang Yang; Kun Zhou; David Hogg
  Human Activity Recognition (HAR) has been employed in a wide range of
applications, e.g. self-driving cars, where safety and lives are at stake.
Recently, the robustness of existing skeleton-based HAR methods has been
questioned due to their vulnerability to adversarial attacks, which causes
concerns considering the scale of the implication. However, the proposed
attacks require the full-knowledge of the attacked classifier, which is overly
restrictive. In this paper, we show such threats indeed exist, even when the
attacker only has access to the input/output of the model. To this end, we
propose the very first black-box adversarial attack approach in skeleton-based
HAR called BASAR. BASAR explores the interplay between the classification
boundary and the natural motion manifold. To our best knowledge, this is the
first time data manifold is introduced in adversarial attacks on time series.
Via BASAR, we find on-manifold adversarial samples are extremely deceitful and
rather common in skeletal motions, in contrast to the common belief that
adversarial samples only exist off-manifold. Through exhaustive evaluation, we
show that BASAR can deliver successful attacks across classifiers, datasets,
and attack modes. By attack, BASAR helps identify the potential causes of the
model vulnerability and provides insights on possible improvements. Finally, to
mitigate the newly identified threat, we propose a new adversarial training
approach by leveraging the sophisticated distributions of on/off-manifold
adversarial samples, called mixed manifold-based adversarial training (MMAT).
MMAT can successfully help defend against adversarial attacks without
compromising classification accuracy.


http://arxiv.org/abs/2211.12005
Self-Ensemble Protection: Training Checkpoints Are Good Data Protectors. (99%)
Sizhe Chen; Geng Yuan; Xinwen Cheng; Yifan Gong; Minghai Qin; Yanzhi Wang; Xiaolin Huang
  As data becomes increasingly vital, a company would be very cautious about
releasing data, because the competitors could use it to train high-performance
models, thereby posing a tremendous threat to the company's commercial
competence. To prevent training good models on the data, we could add
imperceptible perturbations to it. Since such perturbations aim at hurting the
entire training process, they should reflect the vulnerability of DNN training,
rather than that of a single model. Based on this new idea, we seek perturbed
examples that are always unrecognized (never correctly classified) in training.
In this paper, we uncover them by model checkpoints' gradients, forming the
proposed self-ensemble protection (SEP), which is very effective because (1)
learning on examples ignored during normal training tends to yield DNNs
ignoring normal examples; (2) checkpoints' cross-model gradients are close to
orthogonal, meaning that they are as diverse as DNNs with different
architectures. That is, our amazing performance of ensemble only requires the
computation of training one model. By extensive experiments with 9 baselines on
3 datasets and 5 architectures, SEP is verified to be a new state-of-the-art,
e.g., our small $\ell_\infty=2/255$ perturbations reduce the accuracy of a
CIFAR-10 ResNet18 from 94.56% to 14.68%, compared to 41.35% by the best-known
method. Code is available at https://github.com/Sizhe-Chen/SEP.


http://arxiv.org/abs/2211.11236
Boosting the Transferability of Adversarial Attacks with Global Momentum Initialization. (99%)
Jiafeng Wang; Zhaoyu Chen; Kaixun Jiang; Dingkang Yang; Lingyi Hong; Pinxue Guo; Haijing Guo; Wenqiang Zhang
  Deep Neural Networks (DNNs) are vulnerable to adversarial examples, which are
crafted by adding human-imperceptible perturbations to the benign inputs.
Simultaneously, adversarial examples exhibit transferability across models,
enabling practical black-box attacks. However, existing methods are still
incapable of achieving the desired transfer attack performance. In this work,
focusing on gradient optimization and consistency, we analyse the gradient
elimination phenomenon as well as the local momentum optimum dilemma. To tackle
these challenges, we introduce Global Momentum Initialization (GI), providing
global momentum knowledge to mitigate gradient elimination. Specifically, we
perform gradient pre-convergence before the attack and a global search during
this stage. GI seamlessly integrates with existing transfer methods,
significantly improving the success rate of transfer attacks by an average of
6.4% under various advanced defense mechanisms compared to the state-of-the-art
method. Ultimately, GI demonstrates strong transferability in both image and
video attack domains. Particularly, when attacking advanced defense methods in
the image domain, it achieves an average attack success rate of 95.4%. The code
is available at
$\href{https://github.com/Omenzychen/Global-Momentum-Initialization}{https://github.com/Omenzychen/Global-Momentum-Initialization}$.


http://arxiv.org/abs/2211.11880
Addressing Mistake Severity in Neural Networks with Semantic Knowledge. (92%)
Natalie Abreu; Nathan Vaska; Victoria Helus
  Robustness in deep neural networks and machine learning algorithms in general
is an open research challenge. In particular, it is difficult to ensure
algorithmic performance is maintained on out-of-distribution inputs or
anomalous instances that cannot be anticipated at training time. Embodied
agents will be deployed in these conditions, and are likely to make incorrect
predictions. An agent will be viewed as untrustworthy unless it can maintain
its performance in dynamic environments. Most robust training techniques aim to
improve model accuracy on perturbed inputs; as an alternate form of robustness,
we aim to reduce the severity of mistakes made by neural networks in
challenging conditions. We leverage current adversarial training methods to
generate targeted adversarial attacks during the training process in order to
increase the semantic similarity between a model's predictions and true labels
of misclassified instances. Results demonstrate that our approach performs
better with respect to mistake severity compared to standard and adversarially
trained models. We also find an intriguing role that non-robust features play
with regards to semantic similarity.


http://arxiv.org/abs/2211.11489
Efficient Generalization Improvement Guided by Random Weight Perturbation. (68%)
Tao Li; Weihao Yan; Zehao Lei; Yingwen Wu; Kun Fang; Ming Yang; Xiaolin Huang
  To fully uncover the great potential of deep neural networks (DNNs), various
learning algorithms have been developed to improve the model's generalization
ability. Recently, sharpness-aware minimization (SAM) establishes a generic
scheme for generalization improvements by minimizing the sharpness measure
within a small neighborhood and achieves state-of-the-art performance. However,
SAM requires two consecutive gradient evaluations for solving the min-max
problem and inevitably doubles the training time. In this paper, we resort to
filter-wise random weight perturbations (RWP) to decouple the nested gradients
in SAM. Different from the small adversarial perturbations in SAM, RWP is
softer and allows a much larger magnitude of perturbations. Specifically, we
jointly optimize the loss function with random perturbations and the original
loss function: the former guides the network towards a wider flat region while
the latter helps recover the necessary local information. These two loss terms
are complementary to each other and mutually independent. Hence, the
corresponding gradients can be efficiently computed in parallel, enabling
nearly the same training speed as regular training. As a result, we achieve
very competitive performance on CIFAR and remarkably better performance on
ImageNet (e.g. $\mathbf{ +1.1\%}$) compared with SAM, but always require half
of the training time. The code is released at https://github.com/nblt/RWP.


http://arxiv.org/abs/2211.11711
CLAWSAT: Towards Both Robust and Accurate Code Models. (56%)
Jinghan Jia; Shashank Srikant; Tamara Mitrovska; Chuang Gan; Shiyu Chang; Sijia Liu; Una-May O'Reilly
  We integrate contrastive learning (CL) with adversarial learning to
co-optimize the robustness and accuracy of code models. Different from existing
works, we show that code obfuscation, a standard code transformation operation,
provides novel means to generate complementary `views' of a code that enable us
to achieve both robust and accurate code models. To the best of our knowledge,
this is the first systematic study to explore and exploit the robustness and
accuracy benefits of (multi-view) code obfuscations in code models.
Specifically, we first adopt adversarial codes as robustness-promoting views in
CL at the self-supervised pre-training phase. This yields improved robustness
and transferability for downstream tasks. Next, at the supervised fine-tuning
stage, we show that adversarial training with a proper temporally-staggered
schedule of adversarial code generation can further improve robustness and
accuracy of the pre-trained code model. Built on the above two modules, we
develop CLAWSAT, a novel self-supervised learning (SSL) framework for code by
integrating $\underline{\textrm{CL}}$ with $\underline{\textrm{a}}$dversarial
vie$\underline{\textrm{w}}$s (CLAW) with $\underline{\textrm{s}}$taggered
$\underline{\textrm{a}}$dversarial $\underline{\textrm{t}}$raining (SAT). On
evaluating three downstream tasks across Python and Java, we show that CLAWSAT
consistently yields the best robustness and accuracy ($\textit{e.g.}$ 11$\%$ in
robustness and 6$\%$ in accuracy on the code summarization task in Python). We
additionally demonstrate the effectiveness of adversarial learning in CLAW by
analyzing the characteristics of the loss landscape and interpretability of the
pre-trained models.


http://arxiv.org/abs/2211.11835
Fairness Increases Adversarial Vulnerability. (54%)
Cuong Tran; Keyu Zhu; Ferdinando Fioretto; Henternyck Pascal Van
  The remarkable performance of deep learning models and their applications in
consequential domains (e.g., facial recognition) introduces important
challenges at the intersection of equity and security. Fairness and robustness
are two desired notions often required in learning models. Fairness ensures
that models do not disproportionately harm (or benefit) some groups over
others, while robustness measures the models' resilience against small input
perturbations.
  This paper shows the existence of a dichotomy between fairness and
robustness, and analyzes when achieving fairness decreases the model robustness
to adversarial samples. The reported analysis sheds light on the factors
causing such contrasting behavior, suggesting that distance to the decision
boundary across groups as a key explainer for this behavior. Extensive
experiments on non-linear models and different architectures validate the
theoretical findings in multiple vision domains. Finally, the paper proposes a
simple, yet effective, solution to construct models achieving good tradeoffs
between fairness and robustness.


http://arxiv.org/abs/2211.14440
Don't Watch Me: A Spatio-Temporal Trojan Attack on Deep-Reinforcement-Learning-Augment Autonomous Driving. (10%)
Yinbo Yu; Jiajia Liu
  Deep reinforcement learning (DRL) is one of the most popular algorithms to
realize an autonomous driving (AD) system. The key success factor of DRL is
that it embraces the perception capability of deep neural networks which,
however, have been proven vulnerable to Trojan attacks. Trojan attacks have
been widely explored in supervised learning (SL) tasks (e.g., image
classification), but rarely in sequential decision-making tasks solved by DRL.
Hence, in this paper, we explore Trojan attacks on DRL for AD tasks. First, we
propose a spatio-temporal DRL algorithm based on the recurrent neural network
and attention mechanism to prove that capturing spatio-temporal traffic
features is the key factor to the effectiveness and safety of a DRL-augment AD
system. We then design a spatial-temporal Trojan attack on DRL policies, where
the trigger is hidden in a sequence of spatial and temporal traffic features,
rather than a single instant state used in existing Trojan on SL and DRL tasks.
With our Trojan, the adversary acts as a surrounding normal vehicle and can
trigger attacks via specific spatial-temporal driving behaviors, rather than
physical or wireless access. Through extensive experiments, we show that while
capturing spatio-temporal traffic features can improve the performance of DRL
for different AD tasks, they suffer from Trojan attacks since our designed
Trojan shows high stealthy (various spatio-temporal trigger patterns),
effective (less than 3.1\% performance variance rate and more than 98.5\%
attack success rate), and sustainable to existing advanced defenses.


http://arxiv.org/abs/2211.11321
SPIN: Simulated Poisoning and Inversion Network for Federated Learning-Based 6G Vehicular Networks. (8%)
Sunder Ali Khowaja; Parus Khuwaja; Kapal Dev; Angelos Antonopoulos
  The applications concerning vehicular networks benefit from the vision of
beyond 5G and 6G technologies such as ultra-dense network topologies, low
latency, and high data rates. Vehicular networks have always faced data privacy
preservation concerns, which lead to the advent of distributed learning
techniques such as federated learning. Although federated learning has solved
data privacy preservation issues to some extent, the technique is quite
vulnerable to model inversion and model poisoning attacks. We assume that the
design of defense mechanism and attacks are two sides of the same coin.
Designing a method to reduce vulnerability requires the attack to be effective
and challenging with real-world implications. In this work, we propose
simulated poisoning and inversion network (SPIN) that leverages the
optimization approach for reconstructing data from a differential model trained
by a vehicular node and intercepted when transmitted to roadside unit (RSU). We
then train a generative adversarial network (GAN) to improve the generation of
data with each passing round and global update from the RSU, accordingly.
Evaluation results show the qualitative and quantitative effectiveness of the
proposed approach. The attack initiated by SPIN can reduce up to 22% accuracy
on publicly available datasets while just using a single attacker. We assume
that revealing the simulation of such attacks would help us find its defense
mechanism in an effective manner.


http://arxiv.org/abs/2211.11958
A Survey on Backdoor Attack and Defense in Natural Language Processing. (2%)
Xuan Sheng; Zhaoyang Han; Piji Li; Xiangmao Chang
  Deep learning is becoming increasingly popular in real-life applications,
especially in natural language processing (NLP). Users often choose training
outsourcing or adopt third-party data and models due to data and computation
resources being limited. In such a situation, training data and models are
exposed to the public. As a result, attackers can manipulate the training
process to inject some triggers into the model, which is called backdoor
attack. Backdoor attack is quite stealthy and difficult to be detected because
it has little inferior influence on the model's performance for the clean
samples. To get a precise grasp and understanding of this problem, in this
paper, we conduct a comprehensive review of backdoor attacks and defenses in
the field of NLP. Besides, we summarize benchmark datasets and point out the
open issues to design credible systems to defend against backdoor attacks.


http://arxiv.org/abs/2211.11635
Understanding and Improving Visual Prompting: A Label-Mapping Perspective. (2%)
Aochuan Chen; Yuguang Yao; Pin-Yu Chen; Yihua Zhang; Sijia Liu
  We revisit and advance visual prompting (VP), an input prompting technique
for vision tasks. VP can reprogram a fixed, pre-trained source model to
accomplish downstream tasks in the target domain by simply incorporating
universal prompts (in terms of input perturbation patterns) into downstream
data points. Yet, it remains elusive why VP stays effective even given a
ruleless label mapping (LM) between the source classes and the target classes.
Inspired by the above, we ask: How is LM interrelated with VP? And how to
exploit such a relationship to improve its accuracy on target tasks? We peer
into the influence of LM on VP and provide an affirmative answer that a better
'quality' of LM (assessed by mapping precision and explanation) can
consistently improve the effectiveness of VP. This is in contrast to the prior
art where the factor of LM was missing. To optimize LM, we propose a new VP
framework, termed ILM-VP (iterative label mapping-based visual prompting),
which automatically re-maps the source labels to the target labels and
progressively improves the target task accuracy of VP. Further, when using a
contrastive language-image pretrained (CLIP) model, we propose to integrate an
LM process to assist the text prompt selection of CLIP and to improve the
target task accuracy. Extensive experiments demonstrate that our proposal
significantly outperforms state-of-the-art VP methods. As highlighted below, we
show that when reprogramming an ImageNet-pretrained ResNet-18 to 13 target
tasks, our method outperforms baselines by a substantial margin, e.g., 7.9% and
6.7% accuracy improvements in transfer learning to the target Flowers102 and
CIFAR100 datasets. Besides, our proposal on CLIP-based VP provides 13.7% and
7.1% accuracy improvements on Flowers102 and DTD respectively. Our code is
available at https://github.com/OPTML-Group/ILM-VP.


http://arxiv.org/abs/2211.11300
Multi-Level Knowledge Distillation for Out-of-Distribution Detection in Text. (1%)
Qianhui Wu; Huiqiang Jiang; Haonan Yin; Börje F. Karlsson; Chin-Yew Lin
  Self-supervised representation learning has proved to be a valuable component
for out-of-distribution (OoD) detection with only the texts of in-distribution
(ID) examples. These approaches either train a language model from scratch or
fine-tune a pre-trained language model using ID examples, and then take the
perplexity output by the language model as OoD scores. In this paper, we
analyze the complementary characteristics of both OoD detection methods and
propose a multi-level knowledge distillation approach that integrates their
strengths while mitigating their limitations. Specifically, we use a fine-tuned
model as the teacher to teach a randomly initialized student model on the ID
examples. Besides the prediction layer distillation, we present a
similarity-based intermediate layer distillation method to thoroughly explore
the representation space of the teacher model. In this way, the learned student
can better represent the ID data manifold while gaining a stronger ability to
map OoD examples outside the ID data manifold with the regularization inherited
from pre-training. Besides, the student model sees only ID examples during
parameter learning, further promoting more distinguishable features for OoD
detection. We conduct extensive experiments over multiple benchmark datasets,
i.e., CLINC150, SST, ROSTD, 20 NewsGroups, and AG News; showing that the
proposed method yields new state-of-the-art performance. We also explore its
application as an AIGC detector to distinguish between answers generated by
ChatGPT and human experts. It is observed that our model exceeds human
evaluators in the pair-expert task on the Human ChatGPT Comparison Corpus.


http://arxiv.org/abs/2211.11434
Privacy in Practice: Private COVID-19 Detection in X-Ray Images. (1%)
Lucas Lange; Maja Schneider; Erhard Rahm
  Machine learning (ML) can help fight the COVID-19 pandemic by enabling rapid
screening of large volumes of chest X-ray images. To perform such data analysis
while maintaining patient privacy, we create ML models that satisfy
Differential Privacy (DP). Previous works exploring private COVID-19 ML models
are in part based on small or skewed datasets, are lacking in their privacy
guarantees, and do not investigate practical privacy. In this work, we
therefore suggest several improvements to address these open gaps. We account
for inherent class imbalances in the data and evaluate the utility-privacy
trade-off more extensively and over stricter privacy budgets than in previous
work. Our evaluation is supported by empirically estimating practical privacy
leakage through actual attacks. Based on theory, the introduced DP should help
limit and mitigate information leakage threats posed by black-box Membership
Inference Attacks (MIAs). Our practical privacy analysis is the first to test
this hypothesis on the COVID-19 detection task. In addition, we also re-examine
the evaluation on the MNIST database. Our results indicate that based on the
task-dependent threat from MIAs, DP does not always improve practical privacy,
which we show on the COVID-19 task. The results further suggest that with
increasing DP guarantees, empirical privacy leakage reaches an early plateau
and DP therefore appears to have a limited impact on MIA defense. Our findings
identify possibilities for better utility-privacy trade-offs, and we thus
believe that empirical attack-specific privacy estimation can play a vital role
in tuning for practical privacy.


http://arxiv.org/abs/2211.11357
A Tale of Frozen Clouds: Quantifying the Impact of Algorithmic Complexity Vulnerabilities in Popular Web Servers. (1%)
Masudul Hasan Masud Bhuiyan; Cristian-Alexandru Staicu
  Algorithmic complexity vulnerabilities are a class of security problems that
enables attackers to trigger the worst-case complexity of certain algorithms.
Such vulnerabilities can be leveraged to deploy low-volume, asymmetric,
CPU-based denial-of-service (DoS) attacks. Previous work speculates that these
vulnerabilities are more dangerous in certain web servers, like Node.js, than
in traditional ones, like Apache. We believe it is of utmost importance to
understand if this is indeed the case or if there are ways to compensate
against such problems using various deployment strategies. To this end, we
study the resilience of popular web servers against CPU-based DoS attacks in
four major cloud platforms under realistic deployment conditions. We find that
there are indeed significant differences in how various web servers react to an
attack. However, our results suggest a more nuanced landscape than previously
believed: while event-based systems tend to recover faster from DoS in certain
scenarios, they also suffer the worst performance degradation overall.
Nevertheless, in some setups, Apache performs worse than event-based systems,
and there are cloud platforms in which all the considered servers are seriously
exposed to the attack. We also find that developers can harden their servers
against CPU-based DoS attacks by increasing the number of server instances
running in parallel. This, in turn, can lead to an increased cost of operation
or a slight degradation of performance in non-DoS conditions.


http://arxiv.org/abs/2211.10896
Spectral Adversarial Training for Robust Graph Neural Network. (99%)
Jintang Li; Jiaying Peng; Liang Chen; Zibin Zheng; Tingting Liang; Qing Ling
  Recent studies demonstrate that Graph Neural Networks (GNNs) are vulnerable
to slight but adversarially designed perturbations, known as adversarial
examples. To address this issue, robust training methods against adversarial
examples have received considerable attention in the literature.
\emph{Adversarial Training (AT)} is a successful approach to learning a robust
model using adversarially perturbed training samples. Existing AT methods on
GNNs typically construct adversarial perturbations in terms of graph structures
or node features. However, they are less effective and fraught with challenges
on graph data due to the discreteness of graph structure and the relationships
between connected examples. In this work, we seek to address these challenges
and propose Spectral Adversarial Training (SAT), a simple yet effective
adversarial training approach for GNNs. SAT first adopts a low-rank
approximation of the graph structure based on spectral decomposition, and then
constructs adversarial perturbations in the spectral domain rather than
directly manipulating the original graph structure. To investigate its
effectiveness, we employ SAT on three widely used GNNs. Experimental results on
four public graph datasets demonstrate that SAT significantly improves the
robustness of GNNs against adversarial attacks without sacrificing
classification accuracy and training efficiency.


http://arxiv.org/abs/2211.10933
Invisible Backdoor Attack with Dynamic Triggers against Person Re-identification. (81%)
Wenli Sun; Xinyang Jiang; Shuguang Dou; Dongsheng Li; Duoqian Miao; Cheng Deng; Cairong Zhao
  In recent years, person Re-identification (ReID) has rapidly progressed with
wide real-world applications, but also poses significant risks of adversarial
attacks. In this paper, we focus on the backdoor attack on deep ReID models.
Existing backdoor attack methods follow an all-to-one or all-to-all attack
scenario, where all the target classes in the test set have already been seen
in the training set. However, ReID is a much more complex fine-grained open-set
recognition problem, where the identities in the test set are not contained in
the training set. Thus, previous backdoor attack methods for classification are
not applicable for ReID. To ameliorate this issue, we propose a novel backdoor
attack on deep ReID under a new all-to-unknown scenario, called Dynamic
Triggers Invisible Backdoor Attack (DT-IBA). Instead of learning fixed triggers
for the target classes from the training set, DT-IBA can dynamically generate
new triggers for any unknown identities. Specifically, an identity hashing
network is proposed to first extract target identity information from a
reference image, which is then injected into the benign images by image
steganography. We extensively validate the effectiveness and stealthiness of
the proposed attack on benchmark datasets, and evaluate the effectiveness of
several defense methods against our attack.


http://arxiv.org/abs/2211.11127
Taming Reachability Analysis of DNN-Controlled Systems via Abstraction-Based Training. (47%)
Jiaxu Tian; Dapeng Zhi; Si Liu; Peixin Wang; Guy Katz; Min Zhang
  The intrinsic complexity of deep neural networks (DNNs) makes it challenging
to verify not only the networks themselves but also the hosting DNN-controlled
systems. Reachability analysis of these systems faces the same challenge.
Existing approaches rely on over-approximating DNNs using simpler polynomial
models. However, they suffer from low efficiency and large overestimation, and
are restricted to specific types of DNNs. This paper presents a novel
abstraction-based approach to bypass the crux of over-approximating DNNs in
reachability analysis. Specifically, we extend conventional DNNs by inserting
an additional abstraction layer, which abstracts a real number to an interval
for training. The inserted abstraction layer ensures that the values
represented by an interval are indistinguishable to the network for both
training and decision-making. Leveraging this, we devise the first black-box
reachability analysis approach for DNN-controlled systems, where trained DNNs
are only queried as black-box oracles for the actions on abstract states. Our
approach is sound, tight, efficient, and agnostic to any DNN type and size. The
experimental results on a wide range of benchmarks show that the DNNs trained
by using our approach exhibit comparable performance, while the reachability
analysis of the corresponding systems becomes more amenable with significant
tightness and efficiency improvement over the state-of-the-art white-box
approaches.


http://arxiv.org/abs/2211.11030
Adversarial Cheap Talk. (8%)
Chris Lu; Timon Willi; Alistair Letcher; Jakob Foerster
  Adversarial attacks in reinforcement learning (RL) often assume
highly-privileged access to the victim's parameters, environment, or data.
Instead, this paper proposes a novel adversarial setting called a Cheap Talk
MDP in which an Adversary can merely append deterministic messages to the
Victim's observation, resulting in a minimal range of influence. The Adversary
cannot occlude ground truth, influence underlying environment dynamics or
reward signals, introduce non-stationarity, add stochasticity, see the Victim's
actions, or access their parameters. Additionally, we present a simple
meta-learning algorithm called Adversarial Cheap Talk (ACT) to train
Adversaries in this setting. We demonstrate that an Adversary trained with ACT
still significantly influences the Victim's training and testing performance,
despite the highly constrained setting. Affecting train-time performance
reveals a new attack vector and provides insight into the success and failure
modes of existing RL algorithms. More specifically, we show that an ACT
Adversary is capable of harming performance by interfering with the learner's
function approximation, or instead helping the Victim's performance by
outputting useful features. Finally, we show that an ACT Adversary can
manipulate messages during train-time to directly and arbitrarily control the
Victim at test-time. Project video and code are available at
https://sites.google.com/view/adversarial-cheap-talk


http://arxiv.org/abs/2211.11039
Deep Composite Face Image Attacks: Generation, Vulnerability and Detection. (2%)
Jag Mohan Singh; Raghavendra Ramachandra
  Face manipulation attacks have drawn the attention of biometric researchers
because of their vulnerability to Face Recognition Systems (FRS). This paper
proposes a novel scheme to generate Composite Face Image Attacks (CFIA) based
on the Generative Adversarial Networks (GANs). Given the face images from
contributory data subjects, the proposed CFIA method will independently
generate the segmented facial attributes, then blend them using transparent
masks to generate the CFIA samples. { The primary motivation for CFIA is to
utilize deep learning to generate facial attribute-based composite attacks,
which has been explored relatively less in the current literature.} We generate
$14$ different combinations of facial attributes resulting in $14$ unique CFIA
samples for each pair of contributory data subjects. Extensive experiments are
carried out on our newly generated CFIA dataset consisting of 1000 unique
identities with 2000 bona fide samples and 14000 CFIA samples, thus resulting
in an overall 16000 face image samples. We perform a sequence of experiments to
benchmark the vulnerability of CFIA to automatic FRS (based on both
deep-learning and commercial-off-the-shelf (COTS). We introduced a new metric
named Generalized Morphing Attack Potential (GMAP) to benchmark the
vulnerability effectively. Additional experiments are performed to compute the
perceptual quality of the generated CFIA samples. Finally, the CFIA detection
performance is presented using three different Face Morphing Attack Detection
(MAD) algorithms. The proposed CFIA method indicates good perceptual quality
based on the obtained results. Further, { FRS is vulnerable to CFIA} (much
higher than SOTA), making it difficult to detect by human observers and
automatic detection algorithms. Lastly, we performed experiments to detect the
CFIA samples using three different detection techniques automatically.


http://arxiv.org/abs/2211.10938
AI-KD: Adversarial learning and Implicit regularization for self-Knowledge Distillation. (2%)
Hyungmin Kim; Sungho Suh; Sunghyun Baek; Daehwan Kim; Daun Jeong; Hansang Cho; Junmo Kim
  We present a novel adversarial penalized self-knowledge distillation method,
named adversarial learning and implicit regularization for self-knowledge
distillation (AI-KD), which regularizes the training procedure by adversarial
learning and implicit distillations. Our model not only distills the
deterministic and progressive knowledge which are from the pre-trained and
previous epoch predictive probabilities but also transfers the knowledge of the
deterministic predictive distributions using adversarial learning. The
motivation is that the self-knowledge distillation methods regularize the
predictive probabilities with soft targets, but the exact distributions may be
hard to predict. Our method deploys a discriminator to distinguish the
distributions between the pre-trained and student models while the student
model is trained to fool the discriminator in the trained procedure. Thus, the
student model not only can learn the pre-trained model's predictive
probabilities but also align the distributions between the pre-trained and
student models. We demonstrate the effectiveness of the proposed method with
network architectures on multiple datasets and show the proposed method
achieves better performance than state-of-the-art methods.


http://arxiv.org/abs/2211.10882
Multi-head Ensemble of Smoothed Classifiers for Certified Robustness. (1%)
Kun Fang; Qinghua Tao; Yingwen Wu; Tao Li; Xiaolin Huang; Jie Yang
  Randomized Smoothing (RS) is a promising technique for certified robustness,
and recently in RS the ensemble of multiple Deep Neural Networks (DNNs) has
shown state-of-the-art performances due to its variance reduction effect over
Gaussian noises. However, such an ensemble brings heavy computation burdens in
both training and certification, and yet under-exploits individual DNNs and
their mutual effects, as the communication between these classifiers is
commonly ignored in optimization. In this work, we consider a novel
ensemble-based training way for a single DNN with multiple augmented heads,
named as SmOothed Multi-head Ensemble (SOME). In SOME, similar to the pursuit
of variance reduction via ensemble, an ensemble of multiple heads imposed with
a cosine constraint inside a single DNN is employed with much cheaper training
and certification computation overloads in RS. In such network structure, an
associated training strategy is designed by introducing a circular
communication flow among those augmented heads. That is, each head teaches its
neighbor with the self-paced learning strategy using smoothed losses, which are
specifically designed in relation to certified robustness. The deployed
multi-head structure and the circular-teaching scheme in SOME jointly
contribute to the diversities among multiple heads and benefit their ensemble,
leading to a competitively stronger certifiably-robust RS-based defense than
ensembling multiple DNNs (effectiveness) at the cost of much less computational
expenses (efficiency), verified by extensive experiments and discussions.


http://arxiv.org/abs/2211.10670
Towards Adversarial Robustness of Deep Vision Algorithms. (92%)
Hanshu Yan
  Deep learning methods have achieved great success in solving computer vision
tasks, and they have been widely utilized in artificially intelligent systems
for image processing, analysis, and understanding. However, deep neural
networks have been shown to be vulnerable to adversarial perturbations in input
data. The security issues of deep neural networks have thus come to the fore.
It is imperative to study the adversarial robustness of deep vision algorithms
comprehensively. This talk focuses on the adversarial robustness of image
classification models and image denoisers. We will discuss the robustness of
deep vision algorithms from three perspectives: 1) robustness evaluation (we
propose the ObsAtk to evaluate the robustness of denoisers), 2) robustness
improvement (HAT, TisODE, and CIFS are developed to robustify vision models),
and 3) the connection between adversarial robustness and generalization
capability to new domains (we find that adversarially robust denoisers can deal
with unseen types of real-world noise).


http://arxiv.org/abs/2211.10661
Phonemic Adversarial Attack against Audio Recognition in Real World. (87%)
Jiakai Wang; Zhendong Chen; Zixin Yin; Qinghong Yang; Xianglong Liu
  Recently, adversarial attacks for audio recognition have attracted much
attention. However, most of the existing studies mainly rely on the
coarse-grain audio features at the instance level to generate adversarial
noises, which leads to expensive generation time costs and weak universal
attacking ability. Motivated by the observations that all audio speech consists
of fundamental phonemes, this paper proposes a phonemic adversarial tack (PAT)
paradigm, which attacks the fine-grain audio features at the phoneme level
commonly shared across audio instances, to generate phonemic adversarial
noises, enjoying the more general attacking ability with fast generation speed.
Specifically, for accelerating the generation, a phoneme density balanced
sampling strategy is introduced to sample quantity less but phonemic features
abundant audio instances as the training data via estimating the phoneme
density, which substantially alleviates the heavy dependency on the large
training dataset. Moreover, for promoting universal attacking ability, the
phonemic noise is optimized in an asynchronous way with a sliding window, which
enhances the phoneme diversity and thus well captures the critical fundamental
phonemic patterns. By conducting extensive experiments, we comprehensively
investigate the proposed PAT framework and demonstrate that it outperforms the
SOTA baselines by large margins (i.e., at least 11X speed up and 78% attacking
ability improvement).


http://arxiv.org/abs/2211.10752
Towards Robust Dataset Learning. (82%)
Yihan Wu; Xinda Li; Florian Kerschbaum; Heng Huang; Hongyang Zhang
  Adversarial training has been actively studied in recent computer vision
research to improve the robustness of models. However, due to the huge
computational cost of generating adversarial samples, adversarial training
methods are often slow. In this paper, we study the problem of learning a
robust dataset such that any classifier naturally trained on the dataset is
adversarially robust. Such a dataset benefits the downstream tasks as natural
training is much faster than adversarial training, and demonstrates that the
desired property of robustness is transferable between models and data. In this
work, we propose a principled, tri-level optimization to formulate the robust
dataset learning problem. We show that, under an abstraction model that
characterizes robust vs. non-robust features, the proposed method provably
learns a robust dataset. Extensive experiments on MNIST, CIFAR10, and
TinyImageNet demostrate the effectiveness of our algorithm with different
network initializations and architectures.


http://arxiv.org/abs/2211.10782
Let Graph be the Go Board: Gradient-free Node Injection Attack for Graph Neural Networks via Reinforcement Learning. (80%)
Mingxuan Ju; Yujie Fan; Chuxu Zhang; Yanfang Ye
  Graph Neural Networks (GNNs) have drawn significant attentions over the years
and been broadly applied to essential applications requiring solid robustness
or vigorous security standards, such as product recommendation and user
behavior modeling. Under these scenarios, exploiting GNN's vulnerabilities and
further downgrading its performance become extremely incentive for adversaries.
Previous attackers mainly focus on structural perturbations or node injections
to the existing graphs, guided by gradients from the surrogate models. Although
they deliver promising results, several limitations still exist. For the
structural perturbation attack, to launch a proposed attack, adversaries need
to manipulate the existing graph topology, which is impractical in most
circumstances. Whereas for the node injection attack, though being more
practical, current approaches require training surrogate models to simulate a
white-box setting, which results in significant performance downgrade when the
surrogate architecture diverges from the actual victim model. To bridge these
gaps, in this paper, we study the problem of black-box node injection attack,
without training a potentially misleading surrogate model. Specifically, we
model the node injection attack as a Markov decision process and propose
Gradient-free Graph Advantage Actor Critic, namely G2A2C, a reinforcement
learning framework in the fashion of advantage actor critic. By directly
querying the victim model, G2A2C learns to inject highly malicious nodes with
extremely limited attacking budgets, while maintaining a similar node feature
distribution. Through our comprehensive experiments over eight acknowledged
benchmark datasets with different characteristics, we demonstrate the superior
performance of our proposed G2A2C over the existing state-of-the-art attackers.
Source code is publicly available at: https://github.com/jumxglhf/G2A2C}.


http://arxiv.org/abs/2211.10747
Exploring validation metrics for offline model-based optimisation with diffusion models. (75%)
Christopher Beckham; Alexandre Piche; David Vazquez; Christopher Pal
  In model-based optimisation (MBO) we are interested in using machine learning
to design candidates that maximise some measure of reward with respect to a
black box function called the (ground truth) oracle, which is expensive to
compute since it involves executing a real world process. In offline MBO we
wish to do so without assuming access to such an oracle during training or
validation, with makes evaluation non-straightforward. While an approximation
to the ground oracle can be trained and used in place of it during model
validation to measure the mean reward over generated candidates, the evaluation
is approximate and vulnerable to adversarial examples. Measuring the mean
reward of generated candidates over this approximation is one such `validation
metric', whereas we are interested in a more fundamental question which is
finding which validation metrics correlate the most with the ground truth. This
involves proposing validation metrics and quantifying them over many datasets
for which the ground truth is known, for instance simulated environments. This
is encapsulated under our proposed evaluation framework which is also designed
to measure extrapolation, which is the ultimate goal behind leveraging
generative models for MBO. While our evaluation framework is model agnostic we
specifically evaluate denoising diffusion models due to their state-of-the-art
performance, as well as derive interesting insights such as ranking the most
effective validation metrics as well as discussing important hyperparameters.


http://arxiv.org/abs/2211.10843
Mask Off: Analytic-based Malware Detection By Transfer Learning and Model Personalization. (9%)
Amirmohammad Pasdar; Young Choon Lee; Seok-Hee Hong
  The vulnerability of smartphones to cyberattacks has been a severe concern to
users arising from the integrity of installed applications (\textit{apps}).
Although applications are to provide legitimate and diversified on-the-go
services, harmful and dangerous ones have also uncovered the feasible way to
penetrate smartphones for malicious behaviors. Thorough application analysis is
key to revealing malicious intent and providing more insights into the
application behavior for security risk assessments. Such in-depth analysis
motivates employing deep neural networks (DNNs) for a set of features and
patterns extracted from applications to facilitate detecting potentially
dangerous applications independently. This paper presents an Analytic-based
deep neural network, Android Malware detection (ADAM), that employs a
fine-grained set of features to train feature-specific DNNs to have consensus
on the application labels when their ground truth is unknown. In addition, ADAM
leverages the transfer learning technique to obtain its adjustability to new
applications across smartphones for recycling the pre-trained model(s) and
making them more adaptable by model personalization and federated learning
techniques. This adjustability is also assisted by federated learning guards,
which protect ADAM against poisoning attacks through model analysis. ADAM
relies on a diverse dataset containing more than 153000 applications with over
41000 extracted features for DNNs training. The ADAM's feature-specific DNNs,
on average, achieved more than 98% accuracy, resulting in an outstanding
performance against data manipulation attacks.


http://arxiv.org/abs/2211.10603
Investigating the Security of EV Charging Mobile Applications As an Attack Surface. (1%)
K. Sarieddine; M. A. Sayed; S. Torabi; R. Atallah; C. Assi
  In this paper, we study the security posture of the EV charging ecosystem
against a new type of remote that exploits vulnerabilities in the EV charging
mobile applications as an attack surface. We leverage a combination of static
and dynamic analysis techniques to analyze the security of widely used EV
charging mobile applications. Our analysis was performed on 31 of the most
widely used mobile applications including their interactions with various
components such as cloud management systems. The attack, scenarios that exploit
these vulnerabilities were verified on a real-time co-simulation test bed. Our
discoveries indicate the lack of user/vehicle verification and improper
authorization for critical functions, which allow adversaries to remotely
hijack charging sessions and launch attacks against the connected critical
infrastructure. The attacks were demonstrated using the EVCS mobile
applications showing the feasibility and the applicability of our attacks.
Indeed, we discuss specific remote attack scenarios and their impact on EV
users. More importantly, our analysis results demonstrate the feasibility of
leveraging existing vulnerabilities across various EV charging mobile
applications to perform wide-scale coordinated remote charging/discharging
attacks against the connected critical infrastructure (e.g., power grid), with
significant economical and operational implications. Finally, we propose
countermeasures to secure the infrastructure and impede adversaries from
performing reconnaissance and launching remote attacks using compromised
accounts.


http://arxiv.org/abs/2211.10227
Adversarial Detection by Approximation of Ensemble Boundary. (99%)
T. Windeatt
  A spectral approximation of a Boolean function is proposed for approximating
the decision boundary of an ensemble of Deep Neural Networks (DNNs) solving
two-class pattern recognition problems. The Walsh combination of relatively
weak DNN classifiers is shown experimentally to be capable of detecting
adversarial attacks. By observing the difference in Walsh coefficient
approximation between clean and adversarial images, it appears that
transferability of attack may be used for detection. Approximating the decision
boundary may also aid in understanding the learning and transferability
properties of DNNs. While the experiments here use images, the proposed
approach of modelling two-class ensemble decision boundaries could in principle
be applied to any application area.


http://arxiv.org/abs/2211.10033
Adversarial Stimuli: Attacking Brain-Computer Interfaces via Perturbed Sensory Events. (98%)
Bibek Upadhayay; Vahid Behzadan
  Machine learning models are known to be vulnerable to adversarial
perturbations in the input domain, causing incorrect predictions. Inspired by
this phenomenon, we explore the feasibility of manipulating EEG-based Motor
Imagery (MI) Brain Computer Interfaces (BCIs) via perturbations in sensory
stimuli. Similar to adversarial examples, these \emph{adversarial stimuli} aim
to exploit the limitations of the integrated brain-sensor-processing components
of the BCI system in handling shifts in participants' response to changes in
sensory stimuli. This paper proposes adversarial stimuli as an attack vector
against BCIs, and reports the findings of preliminary experiments on the impact
of visual adversarial stimuli on the integrity of EEG-based MI BCIs. Our
findings suggest that minor adversarial stimuli can significantly deteriorate
the performance of MI BCIs across all participants (p=0.0003). Additionally,
our results indicate that such attacks are more effective in conditions with
induced stress.


http://arxiv.org/abs/2211.10209
Leveraging Algorithmic Fairness to Mitigate Blackbox Attribute Inference Attacks. (68%)
Jan Aalmoes; Vasisht Duddu; Antoine Boutet
  Machine learning (ML) models have been deployed for high-stakes applications,
e.g., healthcare and criminal justice. Prior work has shown that ML models are
vulnerable to attribute inference attacks where an adversary, with some
background knowledge, trains an ML attack model to infer sensitive attributes
by exploiting distinguishable model predictions. However, some prior attribute
inference attacks have strong assumptions about adversary's background
knowledge (e.g., marginal distribution of sensitive attribute) and pose no more
privacy risk than statistical inference. Moreover, none of the prior attacks
account for class imbalance of sensitive attribute in datasets coming from
real-world applications (e.g., Race and Sex). In this paper, we propose an
practical and effective attribute inference attack that accounts for this
imbalance using an adaptive threshold over the attack model's predictions. We
exhaustively evaluate our proposed attack on multiple datasets and show that
the adaptive threshold over the model's predictions drastically improves the
attack accuracy over prior work. Finally, current literature lacks an effective
defence against attribute inference attacks. We investigate the impact of
fairness constraints (i.e., designed to mitigate unfairness in model
predictions) during model training on our attribute inference attack. We show
that constraint based fairness algorithms which enforces equalized odds acts as
an effective defense against attribute inference attacks without impacting the
model utility. Hence, the objective of algorithmic fairness and sensitive
attribute privacy are aligned.


http://arxiv.org/abs/2211.10370
Invariant Learning via Diffusion Dreamed Distribution Shifts. (10%)
Priyatham Kattakinda; Alexander Levine; Soheil Feizi
  Though the background is an important signal for image classification, over
reliance on it can lead to incorrect predictions when spurious correlations
between foreground and background are broken at test time. Training on a
dataset where these correlations are unbiased would lead to more robust models.
In this paper, we propose such a dataset called Diffusion Dreamed Distribution
Shifts (D3S). D3S consists of synthetic images generated through
StableDiffusion using text prompts and image guides obtained by pasting a
sample foreground image onto a background template image. Using this scalable
approach we generate 120K images of objects from all 1000 ImageNet classes in
10 diverse backgrounds. Due to the incredible photorealism of the diffusion
model, our images are much closer to natural images than previous synthetic
datasets. D3S contains a validation set of more than 17K images whose labels
are human-verified in an MTurk study. Using the validation set, we evaluate
several popular DNN image classifiers and find that the classification
performance of models generally suffers on our background diverse images. Next,
we leverage the foreground & background labels in D3S to learn a foreground
(background) representation that is invariant to changes in background
(foreground) by penalizing the mutual information between the foreground
(background) features and the background (foreground) labels. Linear
classifiers trained on these features to predict foreground (background) from
foreground (background) have high accuracies at 82.9% (93.8%), while
classifiers that predict these labels from background and foreground have a
much lower accuracy of 2.4% and 45.6% respectively. This suggests that our
foreground and background features are well disentangled. We further test the
efficacy of these representations by training classifiers on a task with strong
spurious correlations.


http://arxiv.org/abs/2211.10062
Intrusion Detection in Internet of Things using Convolutional Neural Networks. (1%)
Martin Kodys; Zhi Lu; Kar Wai Fok; Vrizlynn L. L. Thing
  Internet of Things (IoT) has become a popular paradigm to fulfil needs of the
industry such as asset tracking, resource monitoring and automation. As
security mechanisms are often neglected during the deployment of IoT devices,
they are more easily attacked by complicated and large volume intrusion attacks
using advanced techniques. Artificial Intelligence (AI) has been used by the
cyber security community in the past decade to automatically identify such
attacks. However, deep learning methods have yet to be extensively explored for
Intrusion Detection Systems (IDS) specifically for IoT. Most recent works are
based on time sequential models like LSTM and there is short of research in
CNNs as they are not naturally suited for this problem. In this article, we
propose a novel solution to the intrusion attacks against IoT devices using
CNNs. The data is encoded as the convolutional operations to capture the
patterns from the sensors data along time that are useful for attacks detection
by CNNs. The proposed method is integrated with two classical CNNs: ResNet and
EfficientNet, where the detection performance is evaluated. The experimental
results show significant improvement in both true positive rate and false
positive rate compared to the baseline using LSTM.


http://arxiv.org/abs/2211.10095
Improving Robustness of TCM-based Robust Steganography with Variable Robustness. (1%)
Jimin Zhang; Xianfeng Zhao; Xiaolei He
  Recent study has found out that after multiple times of recompression, the
DCT coefficients of JPEG image can form an embedding domain that is robust to
recompression, which is called transport channel matching (TCM) method. Because
the cost function of the adaptive steganography does not consider the impact of
modification on the robustness, the modified DCT coefficients of the stego
image after TCM will change after recompression. To reduce the number of
changed coefficients after recompression, this paper proposes a robust
steganography algorithm which dynamically updates the robustness cost of every
DCT coefficient. The robustness cost proposed is calculated by testing whether
the modified DCT coefficient can resist recompression in every step of STC
embedding process. By adding robustness cost to the distortion cost and using
the framework of STC embedding algorithm to embed the message, the stego images
have good performance both in robustness and security. The experimental results
show that the proposed algorithm can significantly enhance the robustness of
stego images, and the embedded messages could be extracted correctly at almost
all cases when recompressing with a lower quality factor and recompression
process is known to the user of proposed algorithm.


http://arxiv.org/abs/2211.10530
Provable Defense against Backdoor Policies in Reinforcement Learning. (1%)
Shubham Kumar Bharti; Xuezhou Zhang; Adish Singla; Xiaojin Zhu
  We propose a provable defense mechanism against backdoor policies in
reinforcement learning under subspace trigger assumption. A backdoor policy is
a security threat where an adversary publishes a seemingly well-behaved policy
which in fact allows hidden triggers. During deployment, the adversary can
modify observed states in a particular way to trigger unexpected actions and
harm the agent. We assume the agent does not have the resources to re-train a
good policy. Instead, our defense mechanism sanitizes the backdoor policy by
projecting observed states to a 'safe subspace', estimated from a small number
of interactions with a clean (non-triggered) environment. Our sanitized policy
achieves $\epsilon$ approximate optimality in the presence of triggers,
provided the number of clean interactions is $O\left(\frac{D}{(1-\gamma)^4
\epsilon^2}\right)$ where $\gamma$ is the discounting factor and $D$ is the
dimension of state space. Empirically, we show that our sanitization defense
performs well on two Atari game environments.


http://arxiv.org/abs/2211.10586
Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory. (1%)
Justin Cui; Ruochen Wang; Si Si; Cho-Jui Hsieh
  Dataset Distillation is a newly emerging area that aims to distill large
datasets into much smaller and highly informative synthetic ones to accelerate
training and reduce storage. Among various dataset distillation methods,
trajectory-matching-based methods (MTT) have achieved SOTA performance in many
tasks, e.g., on CIFAR-10/100. However, due to exorbitant memory consumption
when unrolling optimization through SGD steps, MTT fails to scale to
large-scale datasets such as ImageNet-1K. Can we scale this SOTA method to
ImageNet-1K and does its effectiveness on CIFAR transfer to ImageNet-1K? To
answer these questions, we first propose a procedure to exactly compute the
unrolled gradient with constant memory complexity, which allows us to scale MTT
to ImageNet-1K seamlessly with ~6x reduction in memory footprint. We further
discover that it is challenging for MTT to handle datasets with a large number
of classes, and propose a novel soft label assignment that drastically improves
its convergence. The resulting algorithm sets new SOTA on ImageNet-1K: we can
scale up to 50 IPCs (Image Per Class) on ImageNet-1K on a single GPU (all
previous methods can only scale to 2 IPCs on ImageNet-1K), leading to the best
accuracy (only 5.9% accuracy drop against full dataset training) while
utilizing only 4.2% of the number of data points - an 18.2% absolute gain over
prior SOTA. Our code is available at https://github.com/justincui03/tesla


http://arxiv.org/abs/2211.10024
Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks. (99%)
Stephen Casper; Kaivalya Hariharan; Dylan Hadfield-Menell
  Deep neural networks (DNNs) are powerful, but they can make mistakes that
pose significant risks. A model performing well on a test set does not imply
safety in deployment, so it is important to have additional tools to understand
its flaws. Adversarial examples can help reveal weaknesses, but they are often
difficult for a human to interpret or draw generalizable, actionable
conclusions from. Some previous works have addressed this by studying
human-interpretable attacks. We build on these with three contributions. First,
we introduce a method termed Search for Natural Adversarial Features Using
Embeddings (SNAFUE) which offers a fully-automated method for finding
"copy/paste" attacks in which one natural image can be pasted into another in
order to induce an unrelated misclassification. Second, we use this to red team
an ImageNet classifier and identify hundreds of easily-describable sets of
vulnerabilities. Third, we compare this approach with other interpretability
tools by attempting to rediscover trojans. Our results suggest that SNAFUE can
be useful for interpreting DNNs and generating adversarial data for them. Code
is available at https://github.com/thestephencasper/snafue


http://arxiv.org/abs/2211.09565
Towards Good Practices in Evaluating Transfer Adversarial Attacks. (93%)
Zhengyu Zhao; Hanwei Zhang; Renjue Li; Ronan Sicre; Laurent Amsaleg; Michael Backes
  Transfer adversarial attacks raise critical security concerns in real-world,
black-box scenarios. However, the actual progress of this field is difficult to
assess due to two common limitations in existing evaluations. First, different
methods are often not systematically and fairly evaluated in a one-to-one
comparison. Second, only transferability is evaluated but another key attack
property, stealthiness, is largely overlooked. In this work, we design good
practices to address these limitations, and we present the first comprehensive
evaluation of transfer attacks, covering 23 representative attacks against 9
defenses on ImageNet. In particular, we propose to categorize existing attacks
into five categories, which enables our systematic category-wise analyses.
These analyses lead to new findings that even challenge existing knowledge and
also help determine the optimal attack hyperparameters for our attack-wise
comprehensive evaluation. We also pay particular attention to stealthiness, by
adopting diverse imperceptibility metrics and looking into new, finer-grained
characteristics. Overall, our new insights into transferability and
stealthiness lead to actionable good practices for future evaluations.


http://arxiv.org/abs/2211.09782
Assessing Neural Network Robustness via Adversarial Pivotal Tuning. (92%)
Peter Ebert Christensen; Vésteinn Snæbjarnarson; Andrea Dittadi; Serge Belongie; Sagie Benaim
  The robustness of image classifiers is essential to their deployment in the
real world. The ability to assess this resilience to manipulations or
deviations from the training data is thus crucial. These modifications have
traditionally consisted of minimal changes that still manage to fool
classifiers, and modern approaches are increasingly robust to them. Semantic
manipulations that modify elements of an image in meaningful ways have thus
gained traction for this purpose. However, they have primarily been limited to
style, color, or attribute changes. While expressive, these manipulations do
not make use of the full capabilities of a pretrained generative model. In this
work, we aim to bridge this gap. We show how a pretrained image generator can
be used to semantically manipulate images in a detailed, diverse, and
photorealistic way while still preserving the class of the original image.
Inspired by recent GAN-based image inversion methods, we propose a method
called Adversarial Pivotal Tuning (APT). Given an image, APT first finds a
pivot latent space input that reconstructs the image using a pretrained
generator. It then adjusts the generator's weights to create small yet semantic
manipulations in order to fool a pretrained classifier. APT preserves the full
expressive editing capabilities of the generative model. We demonstrate that
APT is capable of a wide range of class-preserving semantic image manipulations
that fool a variety of pretrained classifiers. Finally, we show that
classifiers that are robust to other benchmarks are not robust to APT
manipulations and suggest a method to improve them. Code available at:
https://captaine.github.io/apt/


http://arxiv.org/abs/2211.09717
UPTON: Unattributable Authorship Text via Data Poisoning. (86%)
Ziyao Wang; Thai Le; Dongwon Lee
  In online medium such as opinion column in Bloomberg, The Guardian and
Western Journal, aspiring writers post their writings for various reasons with
their names often proudly open. However, it may occur that such a writer wants
to write in other venues anonymously or under a pseudonym (e.g., activist,
whistle-blower). However, if an attacker has already built an accurate
authorship attribution (AA) model based off of the writings from such
platforms, attributing an anonymous writing to the known authorship is
possible. Therefore, in this work, we ask a question "can one make the writings
and texts, T, in the open spaces such as opinion sharing platforms
unattributable so that AA models trained from T cannot attribute authorship
well?" Toward this question, we present a novel solution, UPTON, that exploits
textual data poisoning method to disturb the training process of AA models.
UPTON uses data poisoning to destroy the authorship feature only in training
samples by perturbing them, and try to make released textual data unlearnable
on deep neuron networks. It is different from previous obfuscation works, that
use adversarial attack to modify the test samples and mislead an AA model, and
also the backdoor works, which use trigger words both in test and training
samples and only change the model output when trigger words occur. Using four
authorship datasets (e.g., IMDb10, IMDb64, Enron and WJO), then, we present
empirical validation where: (1)UPTON is able to downgrade the test accuracy to
about 30% with carefully designed target-selection methods. (2)UPTON poisoning
is able to preserve most of the original semantics. The BERTSCORE between the
clean and UPTON poisoned texts are higher than 0.95. The number is very closed
to 1.00, which means no sematic change. (3)UPTON is also robust towards
spelling correction systems.


http://arxiv.org/abs/2211.09363
Generalizable Deepfake Detection with Phase-Based Motion Analysis. (50%)
Ekta Prashnani; Michael Goebel; B. S. Manjunath
  We propose PhaseForensics, a DeepFake (DF) video detection method that
leverages a phase-based motion representation of facial temporal dynamics.
Existing methods relying on temporal inconsistencies for DF detection present
many advantages over the typical frame-based methods. However, they still show
limited cross-dataset generalization and robustness to common distortions.
These shortcomings are partially due to error-prone motion estimation and
landmark tracking, or the susceptibility of the pixel intensity-based features
to spatial distortions and the cross-dataset domain shifts. Our key insight to
overcome these issues is to leverage the temporal phase variations in the
band-pass components of the Complex Steerable Pyramid on face sub-regions. This
not only enables a robust estimate of the temporal dynamics in these regions,
but is also less prone to cross-dataset variations. Furthermore, the band-pass
filters used to compute the local per-frame phase form an effective defense
against the perturbations commonly seen in gradient-based adversarial attacks.
Overall, with PhaseForensics, we show improved distortion and adversarial
robustness, and state-of-the-art cross-dataset generalization, with 91.2%
video-level AUC on the challenging CelebDFv2 (a recent state-of-the-art
compares at 86.9%).


http://arxiv.org/abs/2211.09345
More Effective Centrality-Based Attacks on Weighted Networks. (15%)
Balume Mburano; Weisheng Si; Qing Cao; Wei Xing Zheng
  Only when understanding hackers' tactics, can we thwart their attacks. With
this spirit, this paper studies how hackers can effectively launch the
so-called 'targeted node attacks', in which iterative attacks are staged on a
network, and in each iteration the most important node is removed. In the
existing attacks for weighted networks, the node importance is typically
measured by the centralities related to shortest-path lengths, and the attack
effectiveness is also measured mostly by length-related metrics. However, this
paper argues that flows can better reflect network functioning than
shortest-path lengths for those networks with carrying traffic as the main
functionality. Thus, this paper proposes metrics based on flows for measuring
the node importance and the attack effectiveness, respectively. Our node
importance metrics include three flow-based centralities (flow betweenness,
current-flow betweenness and current-flow closeness), which have not been
proposed for use in the attacks on weighted networks yet. Our attack
effectiveness metric is a new one proposed by us based on average network flow.
Extensive experiments on both artificial and real-world networks show that the
attack methods with our three suggested centralities are more effective than
the existing attack methods when evaluated under our proposed attack
effectiveness metric.


http://arxiv.org/abs/2211.09959
Potential Auto-driving Threat: Universal Rain-removal Attack. (2%)
Jinchegn Hu; Jihao Li; Zhuoran Hou; Jingjing Jiang; Cunjia Liu; Yuanjian Zhang
  The problem of robustness in adverse weather conditions is considered a
significant challenge for computer vision algorithms in the applicants of
autonomous driving. Image rain removal algorithms are a general solution to
this problem. They find a deep connection between raindrops/rain-streaks and
images by mining the hidden features and restoring information about the
rain-free environment based on the powerful representation capabilities of
neural networks. However, previous research has focused on architecture
innovations and has yet to consider the vulnerability issues that already exist
in neural networks. This research gap hints at a potential security threat
geared toward the intelligent perception of autonomous driving in the rain. In
this paper, we propose a universal rain-removal attack (URA) on the
vulnerability of image rain-removal algorithms by generating a non-additive
spatial perturbation that significantly reduces the similarity and image
quality of scene restoration. Notably, this perturbation is difficult to
recognise by humans and is also the same for different target images. Thus, URA
could be considered a critical tool for the vulnerability detection of image
rain-removal algorithms. It also could be developed as a real-world artificial
intelligence attack method. Experimental results show that URA can reduce the
scene repair capability by 39.5% and the image generation quality by 26.4%,
targeting the state-of-the-art (SOTA) single-image rain-removal algorithms
currently available.


http://arxiv.org/abs/2211.09859
Data-Centric Debugging: mitigating model failures via targeted data collection. (1%)
Sahil Singla; Atoosa Malemir Chegini; Mazda Moayeri; Soheil Feiz
  Deep neural networks can be unreliable in the real world when the training
set does not adequately cover all the settings where they are deployed.
Focusing on image classification, we consider the setting where we have an
error distribution $\mathcal{E}$ representing a deployment scenario where the
model fails. We have access to a small set of samples $\mathcal{E}_{sample}$
from $\mathcal{E}$ and it can be expensive to obtain additional samples. In the
traditional model development framework, mitigating failures of the model in
$\mathcal{E}$ can be challenging and is often done in an ad hoc manner. In this
paper, we propose a general methodology for model debugging that can
systemically improve model performance on $\mathcal{E}$ while maintaining its
performance on the original test set. Our key assumption is that we have access
to a large pool of weakly (noisily) labeled data $\mathcal{F}$. However,
naively adding $\mathcal{F}$ to the training would hurt model performance due
to the large extent of label noise. Our Data-Centric Debugging (DCD) framework
carefully creates a debug-train set by selecting images from $\mathcal{F}$ that
are perceptually similar to the images in $\mathcal{E}_{sample}$. To do this,
we use the $\ell_2$ distance in the feature space (penultimate layer
activations) of various models including ResNet, Robust ResNet and DINO where
we observe DINO ViTs are significantly better at discovering similar images
compared to Resnets. Compared to LPIPS, we find that our method reduces compute
and storage requirements by 99.58\%. Compared to the baselines that maintain
model performance on the test set, we achieve significantly (+9.45\%) improved
results on the debug-heldout sets.


http://arxiv.org/abs/2211.10012
A Tale of Two Cities: Data and Configuration Variances in Robust Deep Learning. (1%)
Guanqin Zhang; Jiankun Sun; Feng Xu; H. M. N. Dilum Bandara; Shiping Chen; Yulei Sui; Tim Menzies
  Deep neural networks (DNNs), are widely used in many industries such as image
recognition, supply chain, medical diagnosis, and autonomous driving. However,
prior work has shown the high accuracy of a DNN model does not imply high
robustness (i.e., consistent performances on new and future datasets) because
the input data and external environment (e.g., software and model
configurations) for a deployed model are constantly changing. Hence, ensuring
the robustness of deep learning is not an option but a priority to enhance
business and consumer confidence. Previous studies mostly focus on the data
aspect of model variance. In this article, we systematically summarize DNN
robustness issues and formulate them in a holistic view through two important
aspects, i.e., data and software configuration variances in DNNs. We also
provide a predictive framework to generate representative variances
(counterexamples) by considering both data and configurations for robust
learning through the lens of search-based optimization.


http://arxiv.org/abs/2211.09945
VeriSparse: Training Verified Locally Robust Sparse Neural Networks from Scratch. (1%)
Sawinder Kaur; Yi Xiao; Asif Salekin
  Several safety-critical applications such as self-navigation, health care,
and industrial control systems use embedded systems as their core. Recent
advancements in Neural Networks (NNs) in approximating complex functions make
them well-suited for these domains. However, the compute-intensive nature of
NNs limits their deployment and training in embedded systems with limited
computation and storage capacities. Moreover, the adversarial vulnerability of
NNs challenges their use in safety-critical scenarios. Hence, developing sparse
models having robustness guarantees while leveraging fewer resources during
training is critical in expanding NNs' use in safety-critical and
resource-constrained embedding system settings. This paper presents
'VeriSparse'-- a framework to search verified locally robust sparse networks
starting from a random sparse initialization (i.e., scratch). VeriSparse
obtains sparse NNs exhibiting similar or higher verified local robustness,
requiring one-third of the training time compared to the state-of-the-art
approaches. Furthermore, VeriSparse performs both structured and unstructured
sparsification, enabling storage, computing-resource, and computation time
reduction during inference generation. Thus, it facilitates the
resource-constraint embedding platforms to leverage verified robust NN models,
expanding their scope to safety-critical, real-time, and edge applications. We
exhaustively investigated VeriSparse's efficacy and generalizability by
evaluating various benchmark and application-specific datasets across several
model architectures.


http://arxiv.org/abs/2211.09773
T-SEA: Transfer-based Self-Ensemble Attack on Object Detection. (99%)
Hao Huang; Ziyan Chen; Huanran Chen; Yongtao Wang; Kevin Zhang
  Compared to query-based black-box attacks, transfer-based black-box attacks
do not require any information of the attacked models, which ensures their
secrecy. However, most existing transfer-based approaches rely on ensembling
multiple models to boost the attack transferability, which is time- and
resource-intensive, not to mention the difficulty of obtaining diverse models
on the same task. To address this limitation, in this work, we focus on the
single-model transfer-based black-box attack on object detection, utilizing
only one model to achieve a high-transferability adversarial attack on multiple
black-box detectors. Specifically, we first make observations on the patch
optimization process of the existing method and propose an enhanced attack
framework by slightly adjusting its training strategies. Then, we analogize
patch optimization with regular model optimization, proposing a series of
self-ensemble approaches on the input data, the attacked model, and the
adversarial patch to efficiently make use of the limited information and
prevent the patch from overfitting. The experimental results show that the
proposed framework can be applied with multiple classical base attack methods
(e.g., PGD and MIM) to greatly improve the black-box transferability of the
well-optimized patch on multiple mainstream detectors, meanwhile boosting
white-box performance. Our code is available at
https://github.com/VDIGPKU/T-SEA.


http://arxiv.org/abs/2211.08706
Efficiently Finding Adversarial Examples with DNN Preprocessing. (99%)
Avriti Chauhan; Mohammad Afzal; Hrishikesh Karmarkar; Yizhak Elboher; Kumar Madhukar; Guy Katz
  Deep Neural Networks (DNNs) are everywhere, frequently performing a fairly
complex task that used to be unimaginable for machines to carry out. In doing
so, they do a lot of decision making which, depending on the application, may
be disastrous if gone wrong. This necessitates a formal argument that the
underlying neural networks satisfy certain desirable properties. Robustness is
one such key property for DNNs, particularly if they are being deployed in
safety or business critical applications. Informally speaking, a DNN is not
robust if very small changes to its input may affect the output in a
considerable way (e.g. changes the classification for that input). The task of
finding an adversarial example is to demonstrate this lack of robustness,
whenever applicable. While this is doable with the help of constrained
optimization techniques, scalability becomes a challenge due to large-sized
networks. This paper proposes the use of information gathered by preprocessing
the DNN to heavily simplify the optimization problem. Our experiments
substantiate that this is effective, and does significantly better than the
state-of-the-art.


http://arxiv.org/abs/2211.08686
Improving Interpretability via Regularization of Neural Activation Sensitivity. (92%)
Ofir Moshe; Gil Fidel; Ron Bitton; Asaf Shabtai
  State-of-the-art deep neural networks (DNNs) are highly effective at tackling
many real-world tasks. However, their wide adoption in mission-critical
contexts is hampered by two major weaknesses - their susceptibility to
adversarial attacks and their opaqueness. The former raises concerns about the
security and generalization of DNNs in real-world conditions, whereas the
latter impedes users' trust in their output. In this research, we (1) examine
the effect of adversarial robustness on interpretability and (2) present a
novel approach for improving the interpretability of DNNs that is based on
regularization of neural activation sensitivity. We evaluate the
interpretability of models trained using our method to that of standard models
and models trained using state-of-the-art adversarial robustness techniques.
Our results show that adversarially robust models are superior to standard
models and that models trained using our proposed method are even better than
adversarially robust models in terms of interpretability.


http://arxiv.org/abs/2211.08859
Attacking Object Detector Using A Universal Targeted Label-Switch Patch. (86%)
Avishag Shapira; Ron Bitton; Dan Avraham; Alon Zolfi; Yuval Elovici; Asaf Shabtai
  Adversarial attacks against deep learning-based object detectors (ODs) have
been studied extensively in the past few years. These attacks cause the model
to make incorrect predictions by placing a patch containing an adversarial
pattern on the target object or anywhere within the frame. However, none of
prior research proposed a misclassification attack on ODs, in which the patch
is applied on the target object. In this study, we propose a novel, universal,
targeted, label-switch attack against the state-of-the-art object detector,
YOLO. In our attack, we use (i) a tailored projection function to enable the
placement of the adversarial patch on multiple target objects in the image
(e.g., cars), each of which may be located a different distance away from the
camera or have a different view angle relative to the camera, and (ii) a unique
loss function capable of changing the label of the attacked objects. The
proposed universal patch, which is trained in the digital domain, is
transferable to the physical domain. We performed an extensive evaluation using
different types of object detectors, different video streams captured by
different cameras, and various target classes, and evaluated different
configurations of the adversarial patch in the physical domain.


http://arxiv.org/abs/2211.08942
Differentially Private Optimizers Can Learn Adversarially Robust Models. (83%)
Yuan Zhang; Zhiqi Bu
  Machine learning models have shone in a variety of domains and attracted
increasing attention from both the security and the privacy communities. One
important yet worrying question is: Will training models under the differential
privacy (DP) constraint have an unfavorable impact on their adversarial
robustness? While previous works have postulated that privacy comes at the cost
of worse robustness, we give the first theoretical analysis to show that DP
models can indeed be robust and accurate, even sometimes more robust than their
naturally-trained non-private counterparts. We observe three key factors that
influence the privacy-robustness-accuracy tradeoff: (1) hyper-parameters for DP
optimizers are critical; (2) pre-training on public data significantly
mitigates the accuracy and robustness drop; (3) choice of DP optimizers makes a
difference. With these factors set properly, we achieve 90\% natural accuracy,
72\% robust accuracy ($+9\%$ than the non-private model) under $l_2(0.5)$
attack, and 69\% robust accuracy ($+16\%$ than the non-private model) with
pre-trained SimCLRv2 model under $l_\infty(4/255)$ attack on CIFAR10 with
$\epsilon=2$. In fact, we show both theoretically and empirically that DP
models are Pareto optimal on the accuracy-robustness tradeoff. Empirically, the
robustness of DP models is consistently observed across various datasets and
models. We believe our encouraging results are a significant step towards
training models that are private as well as robust.


http://arxiv.org/abs/2211.09321
Interpretable Dimensionality Reduction by Feature Preserving Manifold Approximation and Projection. (56%)
Yang Yang; Hongjian Sun; Jialei Gong; Di Yu
  Nonlinear dimensionality reduction lacks interpretability due to the absence
of source features in low-dimensional embedding space. We propose an
interpretable method featMAP to preserve source features by tangent space
embedding. The core of our proposal is to utilize local singular value
decomposition (SVD) to approximate the tangent space which is embedded to
low-dimensional space by maintaining the alignment. Based on the embedding
tangent space, featMAP enables the interpretability by locally demonstrating
the source features and feature importance. Furthermore, featMAP embeds the
data points by anisotropic projection to preserve the local similarity and
original density. We apply featMAP to interpreting digit classification, object
detection and MNIST adversarial examples. FeatMAP uses source features to
explicitly distinguish the digits and objects and to explain the
misclassification of adversarial examples. We also compare featMAP with other
state-of-the-art methods on local and global metrics.


http://arxiv.org/abs/2211.09273
Privacy against Real-Time Speech Emotion Detection via Acoustic Adversarial Evasion of Machine Learning. (38%)
Brian Testa; Yi Xiao; Harshit Sharma; Avery Gump; Asif Salekin
  Smart speaker voice assistants (VAs) such as Amazon Echo and Google Home have
been widely adopted due to their seamless integration with smart home devices
and the Internet of Things (IoT) technologies. These VA services raise privacy
concerns, especially due to their access to our speech. This work considers one
such use case: the unaccountable and unauthorized surveillance of a user's
emotion via speech emotion recognition (SER). This paper presents DARE-GP, a
solution that creates additive noise to mask users' emotional information while
preserving the transcription-relevant portions of their speech. DARE-GP does
this by using a constrained genetic programming approach to learn the spectral
frequency traits that depict target users' emotional content, and then
generating a universal adversarial audio perturbation that provides this
privacy protection. Unlike existing works, DARE-GP provides: a) real-time
protection of previously unheard utterances, b) against previously unseen
black-box SER classifiers, c) while protecting speech transcription, and d)
does so in a realistic, acoustic environment. Further, this evasion is robust
against defenses employed by a knowledgeable adversary. The evaluations in this
work culminate with acoustic evaluations against two off-the-shelf commercial
smart speakers using a small-form-factor (raspberry pi) integrated with a
wake-word system to evaluate the efficacy of its real-world, real-time
deployment.


http://arxiv.org/abs/2211.09110
Holistic Evaluation of Language Models. (2%)
Percy Liang; Rishi Bommasani; Tony Lee; Dimitris Tsipras; Dilara Soylu; Michihiro Yasunaga; Yian Zhang; Deepak Narayanan; Yuhuai Wu; Ananya Kumar; Benjamin Newman; Binhang Yuan; Bobby Yan; Ce Zhang; Christian Cosgrove; Christopher D. Manning; Christopher Ré; Diana Acosta-Navas; Drew A. Hudson; Eric Zelikman; Esin Durmus; Faisal Ladhak; Frieda Rong; Hongyu Ren; Huaxiu Yao; Jue Wang; Keshav Santhanam; Laurel Orr; Lucia Zheng; Mert Yuksekgonul; Mirac Suzgun; Nathan Kim; Neel Guha; Niladri Chatterji; Omar Khattab; Peter Henderson; Qian Huang; Ryan Chi; Sang Michael Xie; Shibani Santurkar; Surya Ganguli; Tatsunori Hashimoto; Thomas Icard; Tianyi Zhang; Vishrav Chaudhary; William Wang; Xuechen Li; Yifan Mai; Yuhui Zhang; Yuta Koreeda
  Language models (LMs) are becoming the foundation for almost all major
language technologies, but their capabilities, limitations, and risks are not
well understood. We present Holistic Evaluation of Language Models (HELM) to
improve the transparency of language models. First, we taxonomize the vast
space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata)
that are of interest for LMs. Then we select a broad subset based on coverage
and feasibility, noting what's missing or underrepresented (e.g. question
answering for neglected English dialects, metrics for trustworthiness). Second,
we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration,
robustness, fairness, bias, toxicity, and efficiency) for each of 16 core
scenarios when possible (87.5% of the time). This ensures metrics beyond
accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We
also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze
specific aspects (e.g. reasoning, disinformation). Third, we conduct a
large-scale evaluation of 30 prominent language models (spanning open,
limited-access, and closed models) on all 42 scenarios, 21 of which were not
previously used in mainstream LM evaluation. Prior to HELM, models on average
were evaluated on just 17.9% of the core HELM scenarios, with some prominent
models not sharing a single scenario in common. We improve this to 96.0%: now
all 30 models have been densely benchmarked on the same core scenarios and
metrics under standardized conditions. Our evaluation surfaces 25 top-level
findings. For full transparency, we release all raw model prompts and
completions publicly for further analysis, as well as a general modular
toolkit. We intend for HELM to be a living benchmark for the community,
continuously updated with new scenarios, metrics, and models.


http://arxiv.org/abs/2211.08804
Analysis and Detectability of Offline Data Poisoning Attacks on Linear Systems. (1%)
Alessio Russo; Alexandre Proutiere
  A recent body of literature has investigated the effect of data poisoning
attacks on data-driven control methods. Data poisoning attacks are well-known
to the Machine Learning community, which, however, make use of assumptions,
such as cross-sample independence, that in general do not hold for dynamical
systems. As a consequence, attacks, and detection methods, operate differently
from the i.i.d. setting studied in classical supervised problems. In
particular, data poisoning attacks against data-driven control methods can be
fundamentally seen as changing the behavior of the dynamical system described
by the data. In this work, we study this phenomenon through the lens of
statistical testing, and verify the detectability of different attacks for a
linear dynamical system. On the basis of the arguments hereby presented, we
propose a stealthy data poisoning attack that can escape classical detection
tests, and conclude by showing the efficiency of the proposed attack.


http://arxiv.org/abs/2211.08068
Resisting Graph Adversarial Attack via Cooperative Homophilous Augmentation. (99%)
Zhihao Zhu; Chenwang Wu; Min Zhou; Hao Liao; Defu Lian; Enhong Chen
  Recent studies show that Graph Neural Networks(GNNs) are vulnerable and
easily fooled by small perturbations, which has raised considerable concerns
for adapting GNNs in various safety-critical applications. In this work, we
focus on the emerging but critical attack, namely, Graph Injection Attack(GIA),
in which the adversary poisons the graph by injecting fake nodes instead of
modifying existing structures or node attributes. Inspired by findings that the
adversarial attacks are related to the increased heterophily on perturbed
graphs (the adversary tends to connect dissimilar nodes), we propose a general
defense framework CHAGNN against GIA through cooperative homophilous
augmentation of graph data and model. Specifically, the model generates
pseudo-labels for unlabeled nodes in each round of training to reduce
heterophilous edges of nodes with distinct labels. The cleaner graph is fed
back to the model, producing more informative pseudo-labels. In such an
iterative manner, model robustness is then promisingly enhanced. We present the
theoretical analysis of the effect of homophilous augmentation and provide the
guarantee of the proposal's validity. Experimental results empirically
demonstrate the effectiveness of CHAGNN in comparison with recent
state-of-the-art defense methods on diverse real-world datasets.


http://arxiv.org/abs/2211.08384
Universal Distributional Decision-based Black-box Adversarial Attack with Reinforcement Learning. (99%)
Yiran Huang; Yexu Zhou; Michael Hefenbrock; Till Riedel; Likun Fang; Michael Beigl
  The vulnerability of the high-performance machine learning models implies a
security risk in applications with real-world consequences. Research on
adversarial attacks is beneficial in guiding the development of machine
learning models on the one hand and finding targeted defenses on the other.
However, most of the adversarial attacks today leverage the gradient or logit
information from the models to generate adversarial perturbation. Works in the
more realistic domain: decision-based attacks, which generate adversarial
perturbation solely based on observing the output label of the targeted model,
are still relatively rare and mostly use gradient-estimation strategies. In
this work, we propose a pixel-wise decision-based attack algorithm that finds a
distribution of adversarial perturbation through a reinforcement learning
algorithm. We call this method Decision-based Black-box Attack with
Reinforcement learning (DBAR). Experiments show that the proposed approach
outperforms state-of-the-art decision-based attacks with a higher attack
success rate and greater transferability.


http://arxiv.org/abs/2211.08008
MORA: Improving Ensemble Robustness Evaluation with Model-Reweighing Attack. (99%)
Yunrui Yu; Xitong Gao; Cheng-Zhong Xu
  Adversarial attacks can deceive neural networks by adding tiny perturbations
to their input data. Ensemble defenses, which are trained to minimize attack
transferability among sub-models, offer a promising research direction to
improve robustness against such attacks while maintaining a high accuracy on
natural inputs. We discover, however, that recent state-of-the-art (SOTA)
adversarial attack strategies cannot reliably evaluate ensemble defenses,
sizeably overestimating their robustness. This paper identifies the two factors
that contribute to this behavior. First, these defenses form ensembles that are
notably difficult for existing gradient-based method to attack, due to gradient
obfuscation. Second, ensemble defenses diversify sub-model gradients,
presenting a challenge to defeat all sub-models simultaneously, simply summing
their contributions may counteract the overall attack objective; yet, we
observe that ensemble may still be fooled despite most sub-models being
correct. We therefore introduce MORA, a model-reweighing attack to steer
adversarial example synthesis by reweighing the importance of sub-model
gradients. MORA finds that recent ensemble defenses all exhibit varying degrees
of overestimated robustness. Comparing it against recent SOTA white-box
attacks, it can converge orders of magnitude faster while achieving higher
attack success rates across all ensemble models examined with three different
ensemble modes (i.e., ensembling by either softmax, voting or logits). In
particular, most ensemble defenses exhibit near or exactly 0% robustness
against MORA with $\ell^\infty$ perturbation within 0.02 on CIFAR-10, and 0.01
on CIFAR-100. We make MORA open source with reproducible results and
pre-trained models; and provide a leaderboard of ensemble defenses under
various attack strategies.


http://arxiv.org/abs/2211.08657
Person Text-Image Matching via Text-Featur Interpretability Embedding and External Attack Node Implantation. (92%)
Fan Li; Hang Zhou; Huafeng Li; Yafei Zhang; Zhengtao Yu
  Person text-image matching, also known as textbased person search, aims to
retrieve images of specific pedestrians using text descriptions. Although
person text-image matching has made great research progress, existing methods
still face two challenges. First, the lack of interpretability of text features
makes it challenging to effectively align them with their corresponding image
features. Second, the same pedestrian image often corresponds to multiple
different text descriptions, and a single text description can correspond to
multiple different images of the same identity. The diversity of text
descriptions and images makes it difficult for a network to extract robust
features that match the two modalities. To address these problems, we propose a
person text-image matching method by embedding text-feature interpretability
and an external attack node. Specifically, we improve the interpretability of
text features by providing them with consistent semantic information with image
features to achieve the alignment of text and describe image region features.To
address the challenges posed by the diversity of text and the corresponding
person images, we treat the variation caused by diversity to features as caused
by perturbation information and propose a novel adversarial attack and defense
method to solve it. In the model design, graph convolution is used as the basic
framework for feature representation and the adversarial attacks caused by text
and image diversity on feature extraction is simulated by implanting an
additional attack node in the graph convolution layer to improve the robustness
of the model against text and image diversity. Extensive experiments
demonstrate the effectiveness and superiority of text-pedestrian image matching
over existing methods. The source code of the method is published at


http://arxiv.org/abs/2211.07915
Backdoor Attacks on Time Series: A Generative Approach. (70%)
Yujing Jiang; Xingjun Ma; Sarah Monazam Erfani; James Bailey
  Backdoor attacks have emerged as one of the major security threats to deep
learning models as they can easily control the model's test-time predictions by
pre-injecting a backdoor trigger into the model at training time. While
backdoor attacks have been extensively studied on images, few works have
investigated the threat of backdoor attacks on time series data. To fill this
gap, in this paper we present a novel generative approach for time series
backdoor attacks against deep learning based time series classifiers. Backdoor
attacks have two main goals: high stealthiness and high attack success rate. We
find that, compared to images, it can be more challenging to achieve the two
goals on time series. This is because time series have fewer input dimensions
and lower degrees of freedom, making it hard to achieve a high attack success
rate without compromising stealthiness. Our generative approach addresses this
challenge by generating trigger patterns that are as realistic as real-time
series patterns while achieving a high attack success rate without causing a
significant drop in clean accuracy. We also show that our proposed attack is
resistant to potential backdoor defenses. Furthermore, we propose a novel
universal generator that can poison any type of time series with a single
generator that allows universal attacks without the need to fine-tune the
generative model for new time series datasets.


http://arxiv.org/abs/2211.08229
CorruptEncoder: Data Poisoning based Backdoor Attacks to Contrastive Learning. (61%)
Jinghuai Zhang; Hongbin Liu; Jinyuan Jia; Neil Zhenqiang Gong
  Contrastive learning (CL) pre-trains general-purpose encoders using an
unlabeled pre-training dataset, which consists of images or image-text pairs.
CL is vulnerable to data poisoning based backdoor attacks (DPBAs), in which an
attacker injects poisoned inputs into the pre-training dataset so the encoder
is backdoored. However, existing DPBAs achieve limited effectiveness. In this
work, we take the first step to analyze the limitations of existing attacks and
propose new DPBAs called CorruptEncoder to CL. CorruptEncoder uses a
theory-guided method to create optimal poisoned inputs to maximize attack
effectiveness. Our experiments show that CorruptEncoder substantially
outperforms existing DPBAs. In particular, CorruptEncoder is the first DPBA
that achieves more than 90% attack success rates with only a few (3) reference
images and a small poisoning ratio (0.5%). Moreover, we also propose a defense,
called localized cropping, to defend against DPBAs. Our results show that our
defense can reduce the effectiveness of DPBAs, but it sacrifices the utility of
the encoder, highlighting the need for new defenses.


http://arxiv.org/abs/2211.08453
Improved techniques for deterministic l2 robustness. (22%)
Sahil Singla; Soheil Feizi
  Training convolutional neural networks (CNNs) with a strict 1-Lipschitz
constraint under the $l_{2}$ norm is useful for adversarial robustness,
interpretable gradients and stable training. 1-Lipschitz CNNs are usually
designed by enforcing each layer to have an orthogonal Jacobian matrix (for all
inputs) to prevent the gradients from vanishing during backpropagation.
However, their performance often significantly lags behind that of heuristic
methods to enforce Lipschitz constraints where the resulting CNN is not
\textit{provably} 1-Lipschitz. In this work, we reduce this gap by introducing
(a) a procedure to certify robustness of 1-Lipschitz CNNs by replacing the last
linear layer with a 1-hidden layer MLP that significantly improves their
performance for both standard and provably robust accuracy, (b) a method to
significantly reduce the training time per epoch for Skew Orthogonal
Convolution (SOC) layers (>30\% reduction for deeper networks) and (c) a class
of pooling layers using the mathematical property that the $l_{2}$ distance of
an input to a manifold is 1-Lipschitz. Using these methods, we significantly
advance the state-of-the-art for standard and provable robust accuracies on
CIFAR-10 (gains of +1.79\% and +3.82\%) and similarly on CIFAR-100 (+3.78\% and
+4.75\%) across all networks. Code is available at
\url{https://github.com/singlasahil14/improved_l2_robustness}.


http://arxiv.org/abs/2211.08044
Backdoor Attacks for Remote Sensing Data with Wavelet Transform. (12%)
Nikolaus Dräger; Yonghao Xu; Pedram Ghamisi
  Recent years have witnessed the great success of deep learning algorithms in
the geoscience and remote sensing realm. Nevertheless, the security and
robustness of deep learning models deserve special attention when addressing
safety-critical remote sensing tasks. In this paper, we provide a systematic
analysis of backdoor attacks for remote sensing data, where both scene
classification and semantic segmentation tasks are considered. While most of
the existing backdoor attack algorithms rely on visible triggers like squared
patches with well-designed patterns, we propose a novel wavelet transform-based
attack (WABA) method, which can achieve invisible attacks by injecting the
trigger image into the poisoned image in the low-frequency domain. In this way,
the high-frequency information in the trigger image can be filtered out in the
attack, resulting in stealthy data poisoning. Despite its simplicity, the
proposed method can significantly cheat the current state-of-the-art deep
learning models with a high attack success rate. We further analyze how
different trigger images and the hyper-parameters in the wavelet transform
would influence the performance of the proposed method. Extensive experiments
on four benchmark remote sensing datasets demonstrate the effectiveness of the
proposed method for both scene classification and semantic segmentation tasks
and thus highlight the importance of designing advanced backdoor defense
algorithms to address this threat in remote sensing scenarios. The code will be
available online at \url{https://github.com/ndraeger/waba}.


http://arxiv.org/abs/2211.07263
Efficient Adversarial Training with Robust Early-Bird Tickets. (92%)
Zhiheng Xi; Rui Zheng; Tao Gui; Qi Zhang; Xuanjing Huang
  Adversarial training is one of the most powerful methods to improve the
robustness of pre-trained language models (PLMs). However, this approach is
typically more expensive than traditional fine-tuning because of the necessity
to generate adversarial examples via gradient descent. Delving into the
optimization process of adversarial training, we find that robust connectivity
patterns emerge in the early training phase (typically $0.15\sim0.3$ epochs),
far before parameters converge. Inspired by this finding, we dig out robust
early-bird tickets (i.e., subnetworks) to develop an efficient adversarial
training method: (1) searching for robust tickets with structured sparsity in
the early stage; (2) fine-tuning robust tickets in the remaining time. To
extract the robust tickets as early as possible, we design a ticket convergence
metric to automatically terminate the searching process. Experiments show that
the proposed efficient adversarial training method can achieve up to $7\times
\sim 13 \times$ training speedups while maintaining comparable or even better
robustness compared to the most competitive state-of-the-art adversarial
training methods.


http://arxiv.org/abs/2211.07383
Attacking Face Recognition with T-shirts: Database, Vulnerability Assessment and Detection. (13%)
M. Ibsen; C. Rathgeb; F. Brechtel; R. Klepp; K. Pöppelmann; A. George; S. Marcel; C. Busch
  Face recognition systems are widely deployed for biometric authentication.
Despite this, it is well-known that, without any safeguards, face recognition
systems are highly vulnerable to presentation attacks. In response to this
security issue, several promising methods for detecting presentation attacks
have been proposed which show high performance on existing benchmarks. However,
an ongoing challenge is the generalization of presentation attack detection
methods to unseen and new attack types. To this end, we propose a new T-shirt
Face Presentation Attack (TFPA) database of 1,608 T-shirt attacks using 100
unique presentation attack instruments. In an extensive evaluation, we show
that this type of attack can compromise the security of face recognition
systems and that some state-of-the-art attack detection mechanisms trained on
popular benchmarks fail to robustly generalize to the new attacks. Further, we
propose three new methods for detecting T-shirt attack images, one which relies
on the statistical differences between depth maps of bona fide images and
T-shirt attacks, an anomaly detection approach trained on features only
extracted from bona fide RGB images, and a fusion approach which achieves
competitive detection performance.


http://arxiv.org/abs/2211.07455
Towards Robust Numerical Question Answering: Diagnosing Numerical Capabilities of NLP Systems. (5%)
Jialiang Xu; Mengyu Zhou; Xinyi He; Shi Han; Dongmei Zhang
  Numerical Question Answering is the task of answering questions that require
numerical capabilities. Previous works introduce general adversarial attacks to
Numerical Question Answering, while not systematically exploring numerical
capabilities specific to the topic. In this paper, we propose to conduct
numerical capability diagnosis on a series of Numerical Question Answering
systems and datasets. A series of numerical capabilities are highlighted, and
corresponding dataset perturbations are designed. Empirical results indicate
that existing systems are severely challenged by these perturbations. E.g.,
Graph2Tree experienced a 53.83% absolute accuracy drop against the ``Extra''
perturbation on ASDiv-a, and BART experienced 13.80% accuracy drop against the
``Language'' perturbation on the numerical subset of DROP. As a counteracting
approach, we also investigate the effectiveness of applying perturbations as
data augmentation to relieve systems' lack of robust numerical capabilities.
With experiment analysis and empirical studies, it is demonstrated that
Numerical Question Answering with robust numerical capabilities is still to a
large extent an open question. We discuss future directions of Numerical
Question Answering and summarize guidelines on future dataset collection and
system design.


http://arxiv.org/abs/2211.07650
Explainer Divergence Scores (EDS): Some Post-Hoc Explanations May be Effective for Detecting Unknown Spurious Correlations. (5%)
Shea Cardozo; Gabriel Islas Montero; Dmitry Kazhdan; Botty Dimanov; Maleakhi Wijaya; Mateja Jamnik; Pietro Lio
  Recent work has suggested post-hoc explainers might be ineffective for
detecting spurious correlations in Deep Neural Networks (DNNs). However, we
show there are serious weaknesses with the existing evaluation frameworks for
this setting. Previously proposed metrics are extremely difficult to interpret
and are not directly comparable between explainer methods. To alleviate these
constraints, we propose a new evaluation methodology, Explainer Divergence
Scores (EDS), grounded in an information theory approach to evaluate
explainers. EDS is easy to interpret and naturally comparable across
explainers. We use our methodology to compare the detection performance of
three different explainers - feature attribution methods, influential examples
and concept extraction, on two different image datasets. We discover post-hoc
explainers often contain substantial information about a DNN's dependence on
spurious artifacts, but in ways often imperceptible to human users. This
suggests the need for new techniques that can use this information to better
detect a DNN's reliance on spurious correlations.


http://arxiv.org/abs/2211.07277
Robustifying Deep Vision Models Through Shape Sensitization. (2%)
Aditay Tripathi; Rishubh Singh; Anirban Chakraborty; Pradeep Shenoy
  Recent work has shown that deep vision models tend to be overly dependent on
low-level or "texture" features, leading to poor generalization. Various data
augmentation strategies have been proposed to overcome this so-called texture
bias in DNNs. We propose a simple, lightweight adversarial augmentation
technique that explicitly incentivizes the network to learn holistic shapes for
accurate prediction in an object classification setting. Our augmentations
superpose edgemaps from one image onto another image with shuffled patches,
using a randomly determined mixing proportion, with the image label of the
edgemap image. To classify these augmented images, the model needs to not only
detect and focus on edges but distinguish between relevant and spurious edges.
We show that our augmentations significantly improve classification accuracy
and robustness measures on a range of datasets and neural architectures. As an
example, for ViT-S, We obtain absolute gains on classification accuracy gains
up to 6%. We also obtain gains of up to 28% and 8.5% on natural adversarial and
out-of-distribution datasets like ImageNet-A (for ViT-B) and ImageNet-R (for
ViT-S), respectively. Analysis using a range of probe datasets shows
substantially increased shape sensitivity in our trained models, explaining the
observed improvement in robustness and classification accuracy.


http://arxiv.org/abs/2211.09810
Tightening Robustness Verification of MaxPool-based Neural Networks via Minimizing the Over-Approximation Zone. (26%)
Yuan Xiao; Yuchen Chen; Shiqing Ma; Chunrong Fang; Tongtong Bai; Mingzheng Gu; Yuxin Cheng; Yanwei Chen; Zhenyu Chen
  The robustness of neural network classifiers is important in the
safety-critical domain and can be quantified by robustness verification. At
present, efficient and scalable verification techniques are always sound but
incomplete, and thus, the improvement of verified robustness results is the key
criterion to evaluate the performance of incomplete verification approaches.
The multi-variate function MaxPool is widely adopted yet challenging to verify.
In this paper, we present Ti-Lin, a robustness verifier for MaxPool-based CNNs
with Tight Linear Approximation. Following the sequel of minimizing the
over-approximation zone of the non-linear function of CNNs, we are the first to
propose the provably neuron-wise tightest linear bounds for the MaxPool
function. By our proposed linear bounds, we can certify larger robustness
results for CNNs. We evaluate the effectiveness of Ti-Lin on different
verification frameworks with open-sourced benchmarks, including LeNet,
PointNet, and networks trained on the MNIST, CIFAR-10, Tiny ImageNet and
ModelNet40 datasets. Experimental results show that Ti-Lin significantly
outperforms the state-of-the-art methods across all networks with up to 78.6%
improvement in terms of the certified accuracy with almost the same time
consumption as the fastest tool. Our code is available at
https://github.com/xiaoyuanpigo/Ti-Lin-Hybrid-Lin.


http://arxiv.org/abs/2211.06788
Adversarial and Random Transformations for Robust Domain Adaptation and Generalization. (75%)
Liang Xiao; Jiaolong Xu; Dawei Zhao; Erke Shang; Qi Zhu; Bin Dai
  Data augmentation has been widely used to improve generalization in training
deep neural networks. Recent works show that using worst-case transformations
or adversarial augmentation strategies can significantly improve the accuracy
and robustness. However, due to the non-differentiable properties of image
transformations, searching algorithms such as reinforcement learning or
evolution strategy have to be applied, which are not computationally practical
for large scale problems. In this work, we show that by simply applying
consistency training with random data augmentation, state-of-the-art results on
domain adaptation (DA) and generalization (DG) can be obtained. To further
improve the accuracy and robustness with adversarial examples, we propose a
differentiable adversarial data augmentation method based on spatial
transformer networks (STN). The combined adversarial and random transformations
based method outperforms the state-of-the-art on multiple DA and DG benchmark
datasets. Besides, the proposed method shows desirable robustness to
corruption, which is also validated on commonly used datasets.


http://arxiv.org/abs/2211.06757
DriftRec: Adapting diffusion models to blind JPEG restoration. (1%)
Simon Welker; Henry N. Chapman; Timo Gerkmann
  In this work, we utilize the high-fidelity generation abilities of diffusion
models to solve blind JPEG restoration at high compression levels. We propose
an elegant modification of the forward stochastic differential equation of
diffusion models to adapt them to this restoration task and name our method
DriftRec. Comparing DriftRec against an $L_2$ regression baseline with the same
network architecture and state-of-the-art techniques for JPEG restoration, we
show that our approach can escape the tendency of other methods to generate
blurry images, and recovers the distribution of clean images significantly more
faithfully. For this, only a dataset of clean/corrupted image pairs and no
knowledge about the corruption operation is required, enabling wider
applicability to other restoration tasks. In contrast to other conditional and
unconditional diffusion models, we utilize the idea that the distributions of
clean and corrupted images are much closer to each other than each is to the
usual Gaussian prior of the reverse process in diffusion models. Our approach
therefore requires only low levels of added noise and needs comparatively few
sampling steps even without further optimizations. We show that DriftRec
naturally generalizes to realistic and difficult scenarios such as unaligned
double JPEG compression and blind restoration of JPEGs found online, without
having encountered such examples during training.


http://arxiv.org/abs/2211.06571
Generating Textual Adversaries with Minimal Perturbation. (98%)
Xingyi Zhao; Lu Zhang; Depeng Xu; Shuhan Yuan
  Many word-level adversarial attack approaches for textual data have been
proposed in recent studies. However, due to the massive search space consisting
of combinations of candidate words, the existing approaches face the problem of
preserving the semantics of texts when crafting adversarial counterparts. In
this paper, we develop a novel attack strategy to find adversarial texts with
high similarity to the original texts while introducing minimal perturbation.
The rationale is that we expect the adversarial texts with small perturbation
can better preserve the semantic meaning of original texts. Experiments show
that, compared with state-of-the-art attack approaches, our approach achieves
higher success rates and lower perturbation rates in four benchmark datasets.


http://arxiv.org/abs/2211.06508
On the robustness of non-intrusive speech quality model by adversarial examples. (98%)
Hsin-Yi Lin; Huan-Hsin Tseng; Yu Tsao
  It has been shown recently that deep learning based models are effective on
speech quality prediction and could outperform traditional metrics in various
perspectives. Although network models have potential to be a surrogate for
complex human hearing perception, they may contain instabilities in
predictions. This work shows that deep speech quality predictors can be
vulnerable to adversarial perturbations, where the prediction can be changed
drastically by unnoticeable perturbations as small as $-30$ dB compared with
speech inputs. In addition to exposing the vulnerability of deep speech quality
predictors, we further explore and confirm the viability of adversarial
training for strengthening robustness of models.


http://arxiv.org/abs/2211.06500
An investigation of security controls and MITRE ATT\&CK techniques. (47%)
Md Rayhanur Rahman; Laurie Williams
  Attackers utilize a plethora of adversarial techniques in cyberattacks to
compromise the confidentiality, integrity, and availability of the target
organizations and systems. Information security standards such as NIST, ISO/IEC
specify hundreds of security controls that organizations can enforce to protect
and defend the information systems from adversarial techniques. However,
implementing all the available controls at the same time can be infeasible and
security controls need to be investigated in terms of their mitigation ability
over adversarial techniques used in cyberattacks as well. The goal of this
research is to aid organizations in making informed choices on security
controls to defend against cyberthreats through an investigation of adversarial
techniques used in current cyberattacks. In this study, we investigated the
extent of mitigation of 298 NIST SP800-53 controls over 188 adversarial
techniques used in 669 cybercrime groups and malware cataloged in the MITRE
ATT\&CK framework based upon an existing mapping between the controls and
techniques. We identify that, based on the mapping, only 101 out of 298 control
are capable of mitigating adversarial techniques. However, we also identify
that 53 adversarial techniques cannot be mitigated by any existing controls,
and these techniques primarily aid adversaries in bypassing system defense and
discovering targeted system information. We identify a set of 20 critical
controls that can mitigate 134 adversarial techniques, and on average, can
mitigate 72\% of all techniques used by 98\% of the cataloged adversaries in
MITRE ATT\&CK. We urge organizations, that do not have any controls enforced in
place, to implement the top controls identified in the study.


http://arxiv.org/abs/2211.06495
Investigating co-occurrences of MITRE ATT\&CK Techniques. (12%)
Md Rayhanur Rahman; Laurie Williams
  Cyberattacks use adversarial techniques to bypass system defenses, persist,
and eventually breach systems. The MITRE ATT\&CK framework catalogs a set of
adversarial techniques and maps between adversaries and their used techniques
and tactics. Understanding how adversaries deploy techniques in conjunction is
pivotal for learning adversary behavior, hunting potential threats, and
formulating a proactive defense. The goal of this research is to aid
cybersecurity practitioners and researchers in choosing detection and
mitigation strategies through co-occurrence analysis of adversarial techniques
reported in MITRE ATT&CK. We collect the adversarial techniques of 115
cybercrime groups and 484 malware from the MITRE ATT\&CK. We apply association
rule mining and network analysis to investigate how adversarial techniques
co-occur. We identify that adversaries pair T1059: Command and scripting
interface and T1105: Ingress tool transfer techniques with a relatively large
number of ATT\&CK techniques. We also identify adversaries using the T1082:
System Information Discovery technique to determine their next course of
action. We observe adversaries deploy the highest number of techniques from the
TA0005: Defense evasion and TA0007: Discovery tactics. Based on our findings on
co-occurrence, we identify six detection, six mitigation strategies, and twelve
adversary behaviors. We urge defenders to prioritize primarily the detection of
TA0007: Discovery and mitigation of TA0005: Defense evasion techniques.
Overall, this study approximates how adversaries leverage techniques based on
publicly reported documents. We advocate organizations investigate adversarial
techniques in their environment and make the findings available for a more
precise and actionable understanding.


http://arxiv.org/abs/2211.06056
Remapped Cache Layout: Thwarting Cache-Based Side-Channel Attacks with a Hardware Defense. (9%)
Wei Song; Rui Hou; Peng Liu; Xiaoxin Li; Peinan Li; Lutan Zhao; Xiaofei Fu; Yifei Sun; Dan Meng
  As cache-based side-channel attacks become serious security problems, various
defenses have been proposed and deployed in both software and hardware.
Consequently, cache-based side-channel attacks on processes co-residing on the
same core are becoming extremely difficult. Most of recent attacks then shift
their focus to the last-level cache (LLC). Although cache partitioning is
currently the most promising defense against the attacks abusing LLC, it is
ineffective in thwarting the side-channel attacks that automatically create
eviction sets or bypass the user address space layout randomization. In fact,
these attacks are largely undefended in current computer systems.
  We propose Remapped Cache Layout (\textsf{RCL}) -- a pure hardware defense
against a broad range of conflict-based side-channel attacks. \textsf{RCL}
obfuscates the mapping from address to cache sets; therefore, an attacker
cannot accurately infer the location of her data in caches or using a cache set
to infer her victim's data. To our best knowledge, it is the first defense to
thwart the aforementioned largely undefended side-channel attacks .
\textsf{RCL} has been implemented in a superscalar processor and detailed
evaluation results show that \textsf{RCL} incurs only small costs in area,
frequency and execution time.


http://arxiv.org/abs/2211.05854
Test-time adversarial detection and robustness for localizing humans using ultra wide band channel impulse responses. (99%)
Abhiram Kolli; Muhammad Jehanzeb Mirza; Horst Possegger; Horst Bischof
  Keyless entry systems in cars are adopting neural networks for localizing its
operators. Using test-time adversarial defences equip such systems with the
ability to defend against adversarial attacks without prior training on
adversarial samples. We propose a test-time adversarial example detector which
detects the input adversarial example through quantifying the localized
intermediate responses of a pre-trained neural network and confidence scores of
an auxiliary softmax layer. Furthermore, in order to make the network robust,
we extenuate the non-relevant features by non-iterative input sample clipping.
Using our approach, mean performance over 15 levels of adversarial
perturbations is increased by 55.33% for the fast gradient sign method (FGSM)
and 6.3% for both the basic iterative method (BIM) and the projected gradient
method (PGD).


http://arxiv.org/abs/2211.05523
Impact of Adversarial Training on Robustness and Generalizability of Language Models. (99%)
Enes Altinisik; Hassan Sajjad; Husrev Taha Sencar; Safa Messaoud; Sanjay Chawla
  Adversarial training is widely acknowledged as the most effective defense
against adversarial attacks. However, it is also well established that
achieving both robustness and generalization in adversarially trained models
involves a trade-off. The goal of this work is to provide an in depth
comparison of different approaches for adversarial training in language models.
Specifically, we study the effect of pre-training data augmentation as well as
training time input perturbations vs. embedding space perturbations on the
robustness and generalization of BERT-like language models. Our findings
suggest that better robustness can be achieved by pre-training data
augmentation or by training with input space perturbation. However, training
with embedding space perturbation significantly improves generalization. A
linguistic correlation analysis of neurons of the learned models reveal that
the improved generalization is due to `more specialized' neurons. To the best
of our knowledge, this is the first work to carry out a deep qualitative
analysis of different methods of generating adversarial examples in adversarial
training of language models.


http://arxiv.org/abs/2211.05446
Privacy-Utility Balanced Voice De-Identification Using Adversarial Examples. (98%)
Meng Chen; Li Lu; Jiadi Yu; Yingying Chen; Zhongjie Ba; Feng Lin; Kui Ren
  Faced with the threat of identity leakage during voice data publishing, users
are engaged in a privacy-utility dilemma when enjoying convenient voice
services. Existing studies employ direct modification or text-based
re-synthesis to de-identify users' voices, but resulting in inconsistent
audibility in the presence of human participants. In this paper, we propose a
voice de-identification system, which uses adversarial examples to balance the
privacy and utility of voice services. Instead of typical additive examples
inducing perceivable distortions, we design a novel convolutional adversarial
example that modulates perturbations into real-world room impulse responses.
Benefit from this, our system could preserve user identity from exposure by
Automatic Speaker Identification (ASI) while remaining the voice perceptual
quality for non-intrusive de-identification. Moreover, our system learns a
compact speaker distribution through a conditional variational auto-encoder to
sample diverse target embeddings on demand. Combining diverse target generation
and input-specific perturbation construction, our system enables any-to-any
identify transformation for adaptive de-identification. Experimental results
show that our system could achieve 98% and 79% successful de-identification on
mainstream ASIs and commercial systems with an objective Mel cepstral
distortion of 4.31dB and a subjective mean opinion score of 4.48.


http://arxiv.org/abs/2211.05410
Stay Home Safe with Starving Federated Data. (80%)
Jaechul Roh; Yajun Fang
  Over the past few years, the field of adversarial attack received numerous
attention from various researchers with the help of successful attack success
rate against well-known deep neural networks that were acknowledged to achieve
high classification ability in various tasks. However, majority of the
experiments were completed under a single model, which we believe it may not be
an ideal case in a real-life situation. In this paper, we introduce a novel
federated adversarial training method for smart home face recognition, named
FLATS, where we observed some interesting findings that may not be easily
noticed in a traditional adversarial attack to federated learning experiments.
By applying different variations to the hyperparameters, we have spotted that
our method can make the global model to be robust given a starving federated
environment. Our code can be found on https://github.com/jcroh0508/FLATS.


http://arxiv.org/abs/2211.05371
MSDT: Masked Language Model Scoring Defense in Text Domain. (38%)
Jaechul Roh; Minhao Cheng; Yajun Fang
  Pre-trained language models allowed us to process downstream tasks with the
help of fine-tuning, which aids the model to achieve fairly high accuracy in
various Natural Language Processing (NLP) tasks. Such easily-downloaded
language models from various websites empowered the public users as well as
some major institutions to give a momentum to their real-life application.
However, it was recently proven that models become extremely vulnerable when
they are backdoor attacked with trigger-inserted poisoned datasets by malicious
users. The attackers then redistribute the victim models to the public to
attract other users to use them, where the models tend to misclassify when
certain triggers are detected within the training sample. In this paper, we
will introduce a novel improved textual backdoor defense method, named MSDT,
that outperforms the current existing defensive algorithms in specific
datasets. The experimental results illustrate that our method can be effective
and constructive in terms of defending against backdoor attack in text domain.
Code is available at https://github.com/jcroh0508/MSDT.


http://arxiv.org/abs/2211.09954
Robust DNN Surrogate Models with Uncertainty Quantification via Adversarial Training. (3%)
Lixiang Zhang; Jia Li
  For computational efficiency, surrogate models have been used to emulate
mathematical simulators for physical or biological processes. High-speed
simulation is crucial for conducting uncertainty quantification (UQ) when the
simulation is repeated over many randomly sampled input points (aka, the Monte
Carlo method). In some cases, UQ is only feasible with a surrogate model.
Recently, Deep Neural Network (DNN) surrogate models have gained popularity for
their hard-to-match emulation accuracy. However, it is well-known that DNN is
prone to errors when input data are perturbed in particular ways, the very
motivation for adversarial training. In the usage scenario of surrogate models,
the concern is less of a deliberate attack but more of the high sensitivity of
the DNN's accuracy to input directions, an issue largely ignored by researchers
using emulation models. In this paper, we show the severity of this issue
through empirical studies and hypothesis testing. Furthermore, we adopt methods
in adversarial training to enhance the robustness of DNN surrogate models.
Experiments demonstrate that our approaches significantly improve the
robustness of the surrogate models without compromising emulation accuracy.


http://arxiv.org/abs/2211.05347
Mitigating Forgetting in Online Continual Learning via Contrasting Semantically Distinct Augmentations. (1%)
Sheng-Feng Yu; Wei-Chen Chiu
  Online continual learning (OCL) aims to enable model learning from a
non-stationary data stream to continuously acquire new knowledge as well as
retain the learnt one, under the constraints of having limited system size and
computational cost, in which the main challenge comes from the "catastrophic
forgetting" issue -- the inability to well remember the learnt knowledge while
learning the new ones. With the specific focus on the class-incremental OCL
scenario, i.e. OCL for classification, the recent advance incorporates the
contrastive learning technique for learning more generalised feature
representation to achieve the state-of-the-art performance but is still unable
to fully resolve the catastrophic forgetting. In this paper, we follow the
strategy of adopting contrastive learning but further introduce the
semantically distinct augmentation technique, in which it leverages strong
augmentation to generate more data samples, and we show that considering these
samples semantically different from their original classes (thus being related
to the out-of-distribution samples) in the contrastive learning mechanism
contributes to alleviate forgetting and facilitate model stability. Moreover,
in addition to contrastive learning, the typical classification mechanism and
objective (i.e. softmax classifier and cross-entropy loss) are included in our
model design for faster convergence and utilising the label information, but
particularly equipped with a sampling strategy to tackle the tendency of
favouring the new classes (i.e. model bias towards the recently learnt
classes). Upon conducting extensive experiments on CIFAR-10, CIFAR-100, and
Mini-Imagenet datasets, our proposed method is shown to achieve superior
performance against various baselines.


http://arxiv.org/abs/2211.04780
On the Robustness of Explanations of Deep Neural Network Models: A Survey. (50%)
Amlan Jyoti; Karthik Balaji Ganesh; Manoj Gayala; Nandita Lakshmi Tunuguntla; Sandesh Kamath; Vineeth N Balasubramanian
  Explainability has been widely stated as a cornerstone of the responsible and
trustworthy use of machine learning models. With the ubiquitous use of Deep
Neural Network (DNN) models expanding to risk-sensitive and safety-critical
domains, many methods have been proposed to explain the decisions of these
models. Recent years have also seen concerted efforts that have shown how such
explanations can be distorted (attacked) by minor input perturbations. While
there have been many surveys that review explainability methods themselves,
there has been no effort hitherto to assimilate the different methods and
metrics proposed to study the robustness of explanations of DNN models. In this
work, we present a comprehensive survey of methods that study, understand,
attack, and defend explanations of DNN models. We also present a detailed
review of different metrics used to evaluate explanation methods, as well as
describe attributional attack and defense methods. We conclude with lessons and
take-aways for the community towards ensuring robust explanations of DNN model
predictions.


http://arxiv.org/abs/2211.05184
Are All Edges Necessary? A Unified Framework for Graph Purification. (5%)
Zishan Gu; Jintang Li; Liang Chen
  Graph Neural Networks (GNNs) as deep learning models working on
graph-structure data have achieved advanced performance in many works. However,
it has been proved repeatedly that, not all edges in a graph are necessary for
the training of machine learning models. In other words, some of the
connections between nodes may bring redundant or even misleading information to
downstream tasks. In this paper, we try to provide a method to drop edges in
order to purify the graph data from a new perspective. Specifically, it is a
framework to purify graphs with the least loss of information, under which the
core problems are how to better evaluate the edges and how to delete the
relatively redundant edges with the least loss of information. To address the
above two problems, we propose several measurements for the evaluation and
different judges and filters for the edge deletion. We also introduce a
residual-iteration strategy and a surrogate model for measurements requiring
unknown information. The experimental results show that our proposed
measurements for KL divergence with constraints to maintain the connectivity of
the graph and delete edges in an iterative way can find out the most edges
while keeping the performance of GNNs. What's more, further experiments show
that this method also achieves the best defense performance against adversarial
attacks.


http://arxiv.org/abs/2211.05249
QuerySnout: Automating the Discovery of Attribute Inference Attacks against Query-Based Systems. (3%)
Ana-Maria Cretu; Florimond Houssiau; Antoine Cully; Montjoye Yves-Alexandre de
  Although query-based systems (QBS) have become one of the main solutions to
share data anonymously, building QBSes that robustly protect the privacy of
individuals contributing to the dataset is a hard problem. Theoretical
solutions relying on differential privacy guarantees are difficult to implement
correctly with reasonable accuracy, while ad-hoc solutions might contain
unknown vulnerabilities. Evaluating the privacy provided by QBSes must thus be
done by evaluating the accuracy of a wide range of privacy attacks. However,
existing attacks require time and expertise to develop, need to be manually
tailored to the specific systems attacked, and are limited in scope. In this
paper, we develop QuerySnout (QS), the first method to automatically discover
vulnerabilities in QBSes. QS takes as input a target record and the QBS as a
black box, analyzes its behavior on one or more datasets, and outputs a
multiset of queries together with a rule to combine answers to them in order to
reveal the sensitive attribute of the target record. QS uses evolutionary
search techniques based on a novel mutation operator to find a multiset of
queries susceptible to lead to an attack, and a machine learning classifier to
infer the sensitive attribute from answers to the queries selected. We showcase
the versatility of QS by applying it to two attack scenarios, three real-world
datasets, and a variety of protection mechanisms. We show the attacks found by
QS to consistently equate or outperform, sometimes by a large margin, the best
attacks from the literature. We finally show how QS can be extended to QBSes
that require a budget, and apply QS to a simple QBS based on the Laplace
mechanism. Taken together, our results show how powerful and accurate attacks
against QBSes can already be found by an automated system, allowing for highly
complex QBSes to be automatically tested "at the pressing of a button".


http://arxiv.org/abs/2211.04946
Accountable and Explainable Methods for Complex Reasoning over Text. (2%)
Pepa Atanasova
  A major concern of Machine Learning (ML) models is their opacity. They are
deployed in an increasing number of applications where they often operate as
black boxes that do not provide explanations for their predictions. Among
others, the potential harms associated with the lack of understanding of the
models' rationales include privacy violations, adversarial manipulations, and
unfair discrimination. As a result, the accountability and transparency of ML
models have been posed as critical desiderata by works in policy and law,
philosophy, and computer science.
  In computer science, the decision-making process of ML models has been
studied by developing accountability and transparency methods. Accountability
methods, such as adversarial attacks and diagnostic datasets, expose
vulnerabilities of ML models that could lead to malicious manipulations or
systematic faults in their predictions. Transparency methods explain the
rationales behind models' predictions gaining the trust of relevant
stakeholders and potentially uncovering mistakes and unfairness in models'
decisions. To this end, transparency methods have to meet accountability
requirements as well, e.g., being robust and faithful to the underlying
rationales of a model.
  This thesis presents my research that expands our collective knowledge in the
areas of accountability and transparency of ML models developed for complex
reasoning tasks over text.


http://arxiv.org/abs/2211.04686
Directional Privacy for Deep Learning. (1%)
Pedro Faustini; Natasha Fernandes; Shakila Tonni; Annabelle McIver; Mark Dras
  Differentially Private Stochastic Gradient Descent (DP-SGD) is a key method
for applying privacy in the training of deep learning models. This applies
isotropic Gaussian noise to gradients during training, which can perturb these
gradients in any direction, damaging utility. Metric DP, however, can provide
alternative mechanisms based on arbitrary metrics that might be more suitable
for preserving utility. In this paper, we apply \textit{directional privacy},
via a mechanism based on the von Mises-Fisher (VMF) distribution, to perturb
gradients in terms of \textit{angular distance} so that gradient direction is
broadly preserved. We show that this provides both $\epsilon$-DP and $\epsilon
d$-privacy for deep learning training, rather than the $(\epsilon,
\delta)$-privacy of the Gaussian mechanism; we observe that the $\epsilon
d$-privacy guarantee does not require a $\delta>0$ term but degrades smoothly
according to the dissimilarity of the input gradients.
  As $\epsilon$s between these different frameworks cannot be directly
compared, we examine empirical privacy calibration mechanisms that go beyond
previous work on empirically calibrating privacy within standard DP frameworks
using membership inference attacks (MIA); we show that a combination of
enhanced MIA and reconstruction attacks provides a suitable method for privacy
calibration. Experiments on key datasets then indicate that the VMF mechanism
can outperform the Gaussian in the utility-privacy trade-off. In particular,
our experiments provide a direct comparison of privacy between the two
approaches in terms of their ability to defend against reconstruction and
membership inference.


http://arxiv.org/abs/2211.04205
Preserving Semantics in Textual Adversarial Attacks. (99%)
David Herel; Hugo Cisneros; Tomas Mikolov
  Adversarial attacks in NLP challenge the way we look at language models. The
goal of this kind of adversarial attack is to modify the input text to fool a
classifier while maintaining the original meaning of the text. Although most
existing adversarial attacks claim to fulfill the constraint of semantics
preservation, careful scrutiny shows otherwise. We show that the problem lies
in the text encoders used to determine the similarity of adversarial examples,
specifically in the way they are trained. Unsupervised training methods make
these encoders more susceptible to problems with antonym recognition. To
overcome this, we introduce a simple, fully supervised sentence embedding
technique called Semantics-Preserving-Encoder (SPE). The results show that our
solution minimizes the variation in the meaning of the adversarial examples
generated. It also significantly improves the overall quality of adversarial
examples, as confirmed by human evaluators. Furthermore, it can be used as a
component in any existing attack to speed up its execution while maintaining
similar attack success.


http://arxiv.org/abs/2211.04364
NaturalAdversaries: Can Naturalistic Adversaries Be as Effective as Artificial Adversaries? (98%)
Saadia Gabriel; Hamid Palangi; Yejin Choi
  While a substantial body of prior work has explored adversarial example
generation for natural language understanding tasks, these examples are often
unrealistic and diverge from the real-world data distributions. In this work,
we introduce a two-stage adversarial example generation framework
(NaturalAdversaries), for designing adversaries that are effective at fooling a
given classifier and demonstrate natural-looking failure cases that could
plausibly occur during in-the-wild deployment of the models.
  At the first stage a token attribution method is used to summarize a given
classifier's behaviour as a function of the key tokens in the input. In the
second stage a generative model is conditioned on the key tokens from the first
stage. NaturalAdversaries is adaptable to both black-box and white-box
adversarial attacks based on the level of access to the model parameters. Our
results indicate these adversaries generalize across domains, and offer
insights for future research on improving robustness of neural text
classification models.


http://arxiv.org/abs/2211.11534
How Fraudster Detection Contributes to Robust Recommendation. (67%)
Yuni Lai; Kai Zhou
  The adversarial robustness of recommendation systems under node injection
attacks has received considerable research attention. Recently, a robust
recommendation system GraphRfi was proposed, and it was shown that GraphRfi
could successfully mitigate the effects of injected fake users in the system.
Unfortunately, we demonstrate that GraphRfi is still vulnerable to attacks due
to the supervised nature of its fraudster detection component. Specifically, we
propose a new attack metaC against GraphRfi, and further analyze why GraphRfi
fails under such an attack. Based on the insights we obtained from the
vulnerability analysis, we build a new robust recommendation system PDR by
re-designing the fraudster detection component. Comprehensive experiments show
that our defense approach outperforms other benchmark methods under attacks.
Overall, our research demonstrates an effective framework of integrating
fraudster detection into recommendation to achieve adversarial robustness.


http://arxiv.org/abs/2211.04674
Lipschitz Continuous Algorithms for Graph Problems. (16%)
Soh Kumabe; Yuichi Yoshida
  It has been widely observed in the machine learning community that a small
perturbation to the input can cause a large change in the prediction of a
trained model, and such phenomena have been intensively studied in the machine
learning community under the name of adversarial attacks. Because graph
algorithms also are widely used for decision making and knowledge discovery, it
is important to design graph algorithms that are robust against adversarial
attacks. In this study, we consider the Lipschitz continuity of algorithms as a
robustness measure and initiate a systematic study of the Lipschitz continuity
of algorithms for (weighted) graph problems.
  Depending on how we embed the output solution to a metric space, we can think
of several Lipschitzness notions. We mainly consider the one that is invariant
under scaling of weights, and we provide Lipschitz continuous algorithms and
lower bounds for the minimum spanning tree problem, the shortest path problem,
and the maximum weight matching problem. In particular, our shortest path
algorithm is obtained by first designing an algorithm for unweighted graphs
that are robust against edge contractions and then applying it to the
unweighted graph constructed from the original weighted graph.
  Then, we consider another Lipschitzness notion induced by a natural mapping
that maps the output solution to its characteristic vector. It turns out that
no Lipschitz continuous algorithm exists for this Lipschitz notion, and we
instead design algorithms with bounded pointwise Lipschitz constants for the
minimum spanning tree problem and the maximum weight bipartite matching
problem. Our algorithm for the latter problem is based on an LP relaxation with
entropy regularization.


http://arxiv.org/abs/2211.04177
Learning advisor networks for noisy image classification. (1%)
Simone Ricci; Tiberio Uricchio; Bimbo Alberto Del
  In this paper, we introduced the novel concept of advisor network to address
the problem of noisy labels in image classification. Deep neural networks (DNN)
are prone to performance reduction and overfitting problems on training data
with noisy annotations. Weighting loss methods aim to mitigate the influence of
noisy labels during the training, completely removing their contribution. This
discarding process prevents DNNs from learning wrong associations between
images and their correct labels but reduces the amount of data used, especially
when most of the samples have noisy labels. Differently, our method weighs the
feature extracted directly from the classifier without altering the loss value
of each data. The advisor helps to focus only on some part of the information
present in mislabeled examples, allowing the classifier to leverage that data
as well. We trained it with a meta-learning strategy so that it can adapt
throughout the training of the main model. We tested our method on CIFAR10 and
CIFAR100 with synthetic noise, and on Clothing1M which contains real-world
noise, reporting state-of-the-art results.


http://arxiv.org/abs/2211.03769
Are AlphaZero-like Agents Robust to Adversarial Perturbations? (99%)
Li-Cheng Lan; Huan Zhang; Ti-Rong Wu; Meng-Yu Tsai; I-Chen Wu; Cho-Jui Hsieh
  The success of AlphaZero (AZ) has demonstrated that neural-network-based Go
AIs can surpass human performance by a large margin. Given that the state space
of Go is extremely large and a human player can play the game from any legal
state, we ask whether adversarial states exist for Go AIs that may lead them to
play surprisingly wrong actions. In this paper, we first extend the concept of
adversarial examples to the game of Go: we generate perturbed states that are
``semantically'' equivalent to the original state by adding meaningless moves
to the game, and an adversarial state is a perturbed state leading to an
undoubtedly inferior action that is obvious even for Go beginners. However,
searching the adversarial state is challenging due to the large, discrete, and
non-differentiable search space. To tackle this challenge, we develop the first
adversarial attack on Go AIs that can efficiently search for adversarial states
by strategically reducing the search space. This method can also be extended to
other board games such as NoGo. Experimentally, we show that the actions taken
by both Policy-Value neural network (PV-NN) and Monte Carlo tree search (MCTS)
can be misled by adding one or two meaningless stones; for example, on 58\% of
the AlphaGo Zero self-play games, our method can make the widely used KataGo
agent with 50 simulations of MCTS plays a losing action by adding two
meaningless stones. We additionally evaluated the adversarial examples found by
our algorithm with amateur human Go players and 90\% of examples indeed lead
the Go agent to play an obviously inferior action. Our code is available at
\url{https://PaperCode.cc/GoAttack}.


http://arxiv.org/abs/2211.03509
Black-Box Attack against GAN-Generated Image Detector with Contrastive Perturbation. (82%)
Zijie Lou; Gang Cao; Man Lin
  Visually realistic GAN-generated facial images raise obvious concerns on
potential misuse. Many effective forensic algorithms have been developed to
detect such synthetic images in recent years. It is significant to assess the
vulnerability of such forensic detectors against adversarial attacks. In this
paper, we propose a new black-box attack method against GAN-generated image
detectors. A novel contrastive learning strategy is adopted to train the
encoder-decoder network based anti-forensic model under a contrastive loss
function. GAN images and their simulated real counterparts are constructed as
positive and negative samples, respectively. Leveraging on the trained attack
model, imperceptible contrastive perturbation could be applied to input
synthetic images for removing GAN fingerprint to some extent. As such, existing
GAN-generated image detectors are expected to be deceived. Extensive
experimental results verify that the proposed attack effectively reduces the
accuracy of three state-of-the-art detectors on six popular GANs. High visual
quality of the attacked images is also achieved. The source code will be
available at https://github.com/ZXMMD/BAttGAND.


http://arxiv.org/abs/2211.03714
Deviations in Representations Induced by Adversarial Attacks. (70%)
Daniel Steinberg; Paul Munro
  Deep learning has been a popular topic and has achieved success in many
areas. It has drawn the attention of researchers and machine learning
practitioners alike, with developed models deployed to a variety of settings.
Along with its achievements, research has shown that deep learning models are
vulnerable to adversarial attacks. This finding brought about a new direction
in research, whereby algorithms were developed to attack and defend vulnerable
networks. Our interest is in understanding how these attacks effect change on
the intermediate representations of deep learning models. We present a method
for measuring and analyzing the deviations in representations induced by
adversarial attacks, progressively across a selected set of layers. Experiments
are conducted using an assortment of attack algorithms, on the CIFAR-10
dataset, with plots created to visualize the impact of adversarial attacks
across different layers in a network.


http://arxiv.org/abs/2211.03637
Interpreting deep learning output for out-of-distribution detection. (1%)
Damian Matuszewski; Ida-Maria Sintorn
  Commonly used AI networks are very self-confident in their predictions, even
when the evidence for a certain decision is dubious. The investigation of a
deep learning model output is pivotal for understanding its decision processes
and assessing its capabilities and limitations. By analyzing the distributions
of raw network output vectors, it can be observed that each class has its own
decision boundary and, thus, the same raw output value has different support
for different classes. Inspired by this fact, we have developed a new method
for out-of-distribution detection. The method offers an explanatory step beyond
simple thresholding of the softmax output towards understanding and
interpretation of the model learning process and its output. Instead of
assigning the class label of the highest logit to each new sample presented to
the network, it takes the distributions over all classes into consideration. A
probability score interpreter (PSI) is created based on the joint logit values
in relation to their respective correct vs wrong class distributions. The PSI
suggests whether the sample is likely to belong to a specific class, whether
the network is unsure, or whether the sample is likely an outlier or unknown
type for the network. The simple PSI has the benefit of being applicable on
already trained networks. The distributions for correct vs wrong class for each
output node are established by simply running the training examples through the
trained network. We demonstrate our OOD detection method on a challenging
transmission electron microscopy virus image dataset. We simulate a real-world
application in which images of virus types unknown to a trained virus
classifier, yet acquired with the same procedures and instruments, constitute
the OOD samples.


http://arxiv.org/abs/2211.03489
Resilience of Wireless Ad Hoc Federated Learning against Model Poisoning Attacks. (1%)
Naoya Tezuka; Hideya Ochiai; Yuwei Sun; Hiroshi Esaki
  Wireless ad hoc federated learning (WAFL) is a fully decentralized
collaborative machine learning framework organized by opportunistically
encountered mobile nodes. Compared to conventional federated learning, WAFL
performs model training by weakly synchronizing the model parameters with
others, and this shows great resilience to a poisoned model injected by an
attacker. In this paper, we provide our theoretical analysis of the WAFL's
resilience against model poisoning attacks, by formulating the force balance
between the poisoned model and the legitimate model. According to our
experiments, we confirmed that the nodes directly encountered the attacker has
been somehow compromised to the poisoned model but other nodes have shown great
resilience. More importantly, after the attacker has left the network, all the
nodes have finally found stronger model parameters combined with the poisoned
model. Most of the attack-experienced cases achieved higher accuracy than the
no-attack-experienced cases.


http://arxiv.org/abs/2211.03933
A Hypergraph-Based Machine Learning Ensemble Network Intrusion Detection System. (1%)
Zong-Zhi Lin; Thomas D. Pike; Mark M. Bailey; Nathaniel D. Bastian
  Network intrusion detection systems (NIDS) to detect malicious attacks
continue to meet challenges. NIDS are often developed offline while they face
auto-generated port scan infiltration attempts, resulting in a significant time
lag from adversarial adaption to NIDS response. To address these challenges, we
use hypergraphs focused on internet protocol addresses and destination ports to
capture evolving patterns of port scan attacks. The derived set of
hypergraph-based metrics are then used to train an ensemble machine learning
(ML) based NIDS that allows for real-time adaption in monitoring and detecting
port scanning activities, other types of attacks, and adversarial intrusions at
high accuracy, precision and recall performances. This ML adapting NIDS was
developed through the combination of (1) intrusion examples, (2) NIDS update
rules, (3) attack threshold choices to trigger NIDS retraining requests, and
(4) a production environment with no prior knowledge of the nature of network
traffic. 40 scenarios were auto-generated to evaluate the ML ensemble NIDS
comprising three tree-based models. The resulting ML Ensemble NIDS was extended
and evaluated with the CIC-IDS2017 dataset. Results show that under the model
settings of an Update-ALL-NIDS rule (specifically retrain and update all the
three models upon the same NIDS retraining request) the proposed ML ensemble
NIDS evolved intelligently and produced the best results with nearly 100%
detection performance throughout the simulation.


http://arxiv.org/abs/2211.03073
Contrastive Weighted Learning for Near-Infrared Gaze Estimation. (31%)
Adam Lee
  Appearance-based gaze estimation has been very successful with the use of
deep learning. Many following works improved domain generalization for gaze
estimation. However, even though there has been much progress in domain
generalization for gaze estimation, most of the recent work have been focused
on cross-dataset performance -- accounting for different distributions in
illuminations, head pose, and lighting. Although improving gaze estimation in
different distributions of RGB images is important, near-infrared image based
gaze estimation is also critical for gaze estimation in dark settings. Also
there are inherent limitations relying solely on supervised learning for
regression tasks. This paper contributes to solving these problems and proposes
GazeCWL, a novel framework for gaze estimation with near-infrared images using
contrastive learning. This leverages adversarial attack techniques for data
augmentation and a novel contrastive loss function specifically for regression
tasks that effectively clusters the features of different samples in the latent
space. Our model outperforms previous domain generalization models in infrared
image based gaze estimation and outperforms the baseline by 45.6\% while
improving the state-of-the-art by 8.6\%, we demonstrate the efficacy of our
method.


http://arxiv.org/abs/2211.02878
Textual Manifold-based Defense Against Natural Language Adversarial Examples. (99%)
Dang Minh Nguyen; Luu Anh Tuan
  Recent studies on adversarial images have shown that they tend to leave the
underlying low-dimensional data manifold, making them significantly more
challenging for current models to make correct predictions. This so-called
off-manifold conjecture has inspired a novel line of defenses against
adversarial attacks on images. In this study, we find a similar phenomenon
occurs in the contextualized embedding space induced by pretrained language
models, in which adversarial texts tend to have their embeddings diverge from
the manifold of natural ones. Based on this finding, we propose Textual
Manifold-based Defense (TMD), a defense mechanism that projects text embeddings
onto an approximated embedding manifold before classification. It reduces the
complexity of potential adversarial examples, which ultimately enhances the
robustness of the protected model. Through extensive experiments, our method
consistently and significantly outperforms previous defenses under various
attack settings without trading off clean accuracy. To the best of our
knowledge, this is the first NLP defense that leverages the manifold structure
against adversarial attacks. Our code is available at
\url{https://github.com/dangne/tmd}.


http://arxiv.org/abs/2211.02885
Stateful Detection of Adversarial Reprogramming. (96%)
Yang Zheng; Xiaoyi Feng; Zhaoqiang Xia; Xiaoyue Jiang; Maura Pintor; Ambra Demontis; Battista Biggio; Fabio Roli
  Adversarial reprogramming allows stealing computational resources by
repurposing machine learning models to perform a different task chosen by the
attacker. For example, a model trained to recognize images of animals can be
reprogrammed to recognize medical images by embedding an adversarial program in
the images provided as inputs. This attack can be perpetrated even if the
target model is a black box, supposed that the machine-learning model is
provided as a service and the attacker can query the model and collect its
outputs. So far, no defense has been demonstrated effective in this scenario.
We show for the first time that this attack is detectable using stateful
defenses, which store the queries made to the classifier and detect the
abnormal cases in which they are similar. Once a malicious query is detected,
the account of the user who made it can be blocked. Thus, the attacker must
create many accounts to perpetrate the attack. To decrease this number, the
attacker could create the adversarial program against a surrogate classifier
and then fine-tune it by making few queries to the target model. In this
scenario, the effectiveness of the stateful defense is reduced, but we show
that it is still effective.


http://arxiv.org/abs/2211.03013
Robust Lottery Tickets for Pre-trained Language Models. (83%)
Rui Zheng; Rong Bao; Yuhao Zhou; Di Liang; Sirui Wang; Wei Wu; Tao Gui; Qi Zhang; Xuanjing Huang
  Recent works on Lottery Ticket Hypothesis have shown that pre-trained
language models (PLMs) contain smaller matching subnetworks(winning tickets)
which are capable of reaching accuracy comparable to the original models.
However, these tickets are proved to be notrobust to adversarial examples, and
even worse than their PLM counterparts. To address this problem, we propose a
novel method based on learning binary weight masks to identify robust tickets
hidden in the original PLMs. Since the loss is not differentiable for the
binary mask, we assign the hard concrete distribution to the masks and
encourage their sparsity using a smoothing approximation of L0
regularization.Furthermore, we design an adversarial loss objective to guide
the search for robust tickets and ensure that the tickets perform well bothin
accuracy and robustness. Experimental results show the significant improvement
of the proposed method over previous work on adversarial robustness evaluation.


http://arxiv.org/abs/2211.02468
Improving Adversarial Robustness to Sensitivity and Invariance Attacks with Deep Metric Learning. (99%)
Anaelia Ovalle; Evan Czyzycki; Cho-Jui Hsieh
  Intentionally crafted adversarial samples have effectively exploited
weaknesses in deep neural networks. A standard method in adversarial robustness
assumes a framework to defend against samples crafted by minimally perturbing a
sample such that its corresponding model output changes. These sensitivity
attacks exploit the model's sensitivity toward task-irrelevant features.
Another form of adversarial sample can be crafted via invariance attacks, which
exploit the model underestimating the importance of relevant features. Previous
literature has indicated a tradeoff in defending against both attack types
within a strictly L_p bounded defense. To promote robustness toward both types
of attacks beyond Euclidean distance metrics, we use metric learning to frame
adversarial regularization as an optimal transport problem. Our preliminary
results indicate that regularizing over invariant perturbations in our
framework improves both invariant and sensitivity defense.


http://arxiv.org/abs/2211.02272
Logits are predictive of network type. (68%)
Ali Borji
  We show that it is possible to predict which deep network has generated a
given logit vector with accuracy well above chance. We utilize a number of
networks on a dataset, initialized with random weights or pretrained weights,
as well as fine-tuned networks. A classifier is then trained on the logit
vectors of the trained set of this dataset to map the logit vector to the
network index that has generated it. The classifier is then evaluated on the
test set of the dataset. Results are better with randomly initialized networks,
but also generalize to pretrained networks as well as fine-tuned ones.
Classification accuracy is higher using unnormalized logits than normalized
ones. We find that there is little transfer when applying a classifier to the
same networks but with different sets of weights. In addition to help better
understand deep networks and the way they encode uncertainty, we anticipate our
finding to be useful in some applications (e.g. tailoring an adversarial attack
for a certain type of network). Code is available at
https://github.com/aliborji/logits.


http://arxiv.org/abs/2211.02675
An Adversarial Robustness Perspective on the Topology of Neural Networks. (64%)
Morgane Goibert; Thomas Ricatte; Elvis Dohmatob
  In this paper, we investigate the impact of neural networks (NNs) topology on
adversarial robustness. Specifically, we study the graph produced when an input
traverses all the layers of a NN, and show that such graphs are different for
clean and adversarial inputs. We find that graphs from clean inputs are more
centralized around highway edges, whereas those from adversaries are more
diffuse, leveraging under-optimized edges. Through experiments on a variety of
datasets and architectures, we show that these under-optimized edges are a
source of adversarial vulnerability and that they can be used to detect
adversarial inputs.


http://arxiv.org/abs/2211.04449
Fairness-aware Regression Robust to Adversarial Attacks. (38%)
Yulu Jin; Lifeng Lai
  In this paper, we take a first step towards answering the question of how to
design fair machine learning algorithms that are robust to adversarial attacks.
Using a minimax framework, we aim to design an adversarially robust fair
regression model that achieves optimal performance in the presence of an
attacker who is able to add a carefully designed adversarial data point to the
dataset or perform a rank-one attack on the dataset. By solving the proposed
nonsmooth nonconvex-nonconcave minimax problem, the optimal adversary as well
as the robust fairness-aware regression model are obtained. For both synthetic
data and real-world datasets, numerical results illustrate that the proposed
adversarially robust fair models have better performance on poisoned datasets
than other fair machine learning models in both prediction accuracy and
group-based fairness measure.


http://arxiv.org/abs/2211.02755
Extension of Simple Algorithms to the Matroid Secretary Problem. (9%)
Simon Park
  Whereas there are simple algorithms that are proven to be optimal for the
Classical and the Multiple Choice Secretary Problem, the Matroid Secretary
Problem is less thoroughly understood. This paper proposes the generalization
of some simple algorithms from the Classical and Multiple Choice versions on
the Matroid Secretary Problem. Out of two algorithms that make decisions based
on samples, like the Dynkin's algorithm, one is proven to be an instance of
Greedy Algorithm (Bahrani et al., 2022), while the other is not. A generalized
version of the Virtual Algorithm (Babaioff et al., 2018) obtains a constant
competitive ratio for the Hat Graph, the adversarial example for Greedy
Algorithms, but fails to do so when a slight modificiation is introduced to the
graph. We show that there is no algorithm with Strong Forbidden Sets (Soto et
al., 2021) of size 1 on all graphic matroids.


http://arxiv.org/abs/2211.02646
Robustness of Fusion-based Multimodal Classifiers to Cross-Modal Content Dilutions. (3%)
Gaurav Verma; Vishwa Vinay; Ryan A. Rossi; Srijan Kumar
  As multimodal learning finds applications in a wide variety of high-stakes
societal tasks, investigating their robustness becomes important. Existing work
has focused on understanding the robustness of vision-and-language models to
imperceptible variations on benchmark tasks. In this work, we investigate the
robustness of multimodal classifiers to cross-modal dilutions - a plausible
variation. We develop a model that, given a multimodal (image + text) input,
generates additional dilution text that (a) maintains relevance and topical
coherence with the image and existing text, and (b) when added to the original
text, leads to misclassification of the multimodal input. Via experiments on
Crisis Humanitarianism and Sentiment Detection tasks, we find that the
performance of task-specific fusion-based multimodal classifiers drops by 23.3%
and 22.5%, respectively, in the presence of dilutions generated by our model.
Metric-based comparisons with several baselines and human evaluations indicate
that our dilutions show higher relevance and topical coherence, while
simultaneously being more effective at demonstrating the brittleness of the
multimodal classifiers. Our work aims to highlight and encourage further
research on the robustness of deep multimodal models to realistic variations,
especially in human-facing societal applications. The code and other resources
are available at https://claws-lab.github.io/multimodal-robustness/.


http://arxiv.org/abs/2211.02578
Data Models for Dataset Drift Controls in Machine Learning With Images. (1%)
Luis Oala; Marco Aversa; Gabriel Nobis; Kurt Willis; Yoan Neuenschwander; Michèle Buck; Christian Matek; Jerome Extermann; Enrico Pomarico; Wojciech Samek; Roderick Murray-Smith; Christoph Clausen; Bruno Sanguinetti
  Camera images are ubiquitous in machine learning research. They also play a
central role in the delivery of important services spanning medicine and
environmental surveying. However, the application of machine learning models in
these domains has been limited because of robustness concerns. A primary
failure mode are performance drops due to differences between the training and
deployment data. While there are methods to prospectively validate the
robustness of machine learning models to such dataset drifts, existing
approaches do not account for explicit models of the primary object of
interest: the data. This makes it difficult to create physically faithful drift
test cases or to provide specifications of data models that should be avoided
when deploying a machine learning model. In this study, we demonstrate how
these shortcomings can be overcome by pairing machine learning robustness
validation with physical optics. We examine the role raw sensor data and
differentiable data models can play in controlling performance risks related to
image dataset drift. The findings are distilled into three applications. First,
drift synthesis enables the controlled generation of physically faithful drift
test cases. The experiments presented here show that the average decrease in
model performance is ten to four times less severe than under post-hoc
augmentation testing. Second, the gradient connection between task and data
models allows for drift forensics that can be used to specify
performance-sensitive data models which should be avoided during deployment of
a machine learning model. Third, drift adjustment opens up the possibility for
processing adjustments in the face of drift. This can lead to speed up and
stabilization of classifier training at a margin of up to 20% in validation
accuracy. A guide to access the open code and datasets is available at
https://github.com/aiaudit-org/raw2logit.


http://arxiv.org/abs/2211.01671
Physically Adversarial Attacks and Defenses in Computer Vision: A Survey. (99%)
Xingxing Wei; Bangzheng Pu; Jiefan Lu; Baoyuan Wu
  Although Deep Neural Networks (DNNs) have been widely applied in various
real-world scenarios, they are vulnerable to adversarial examples. The current
adversarial attacks in computer vision can be divided into digital attacks and
physical attacks according to their different attack forms. Compared with
digital attacks, which generate perturbations in the digital pixels, physical
attacks are more practical in the real world. Owing to the serious security
problem caused by physically adversarial examples, many works have been
proposed to evaluate the physically adversarial robustness of DNNs in the past
years. In this paper, we summarize a survey versus the current physically
adversarial attacks and physically adversarial defenses in computer vision. To
establish a taxonomy, we organize the current physical attacks from attack
tasks, attack forms, and attack methods, respectively. Thus, readers can have a
systematic knowledge about this topic from different aspects. For the physical
defenses, we establish the taxonomy from pre-processing, in-processing, and
post-processing for the DNN models to achieve a full coverage of the
adversarial defenses. Based on the above survey, we finally discuss the
challenges of this research field and further outlook the future direction.


http://arxiv.org/abs/2211.02223
Adversarial Defense via Neural Oscillation inspired Gradient Masking. (98%)
Chunming Jiang; Yilei Zhang
  Spiking neural networks (SNNs) attract great attention due to their low power
consumption, low latency, and biological plausibility. As they are widely
deployed in neuromorphic devices for low-power brain-inspired computing,
security issues become increasingly important. However, compared to deep neural
networks (DNNs), SNNs currently lack specifically designed defense methods
against adversarial attacks. Inspired by neural membrane potential oscillation,
we propose a novel neural model that incorporates the bio-inspired oscillation
mechanism to enhance the security of SNNs. Our experiments show that SNNs with
neural oscillation neurons have better resistance to adversarial attacks than
ordinary SNNs with LIF neurons on kinds of architectures and datasets.
Furthermore, we propose a defense method that changes model's gradients by
replacing the form of oscillation, which hides the original training gradients
and confuses the attacker into using gradients of 'fake' neurons to generate
invalid adversarial samples. Our experiments suggest that the proposed defense
method can effectively resist both single-step and iterative attacks with
comparable defense effectiveness and much less computational costs than
adversarial training methods on DNNs. To the best of our knowledge, this is the
first work that establishes adversarial defense through masking surrogate
gradients on SNNs.


http://arxiv.org/abs/2211.01875
M-to-N Backdoor Paradigm: A Multi-Trigger and Multi-Target Attack to Deep Learning Models. (98%)
Linshan Hou; Zhongyun Hua; Yuhong Li; Yifeng Zheng; Leo Yu Zhang
  Deep neural networks (DNNs) are vulnerable to backdoor attacks, where a
backdoored model behaves normally with clean inputs but exhibits
attacker-specified behaviors upon the inputs containing triggers. Most previous
backdoor attacks mainly focus on either the all-to-one or all-to-all paradigm,
allowing attackers to manipulate an input to attack a single target class.
Besides, the two paradigms rely on a single trigger for backdoor activation,
rendering attacks ineffective if the trigger is destroyed. In light of the
above, we propose a new $M$-to-$N$ attack paradigm that allows an attacker to
manipulate any input to attack $N$ target classes, and each backdoor of the $N$
target classes can be activated by any one of its $M$ triggers. Our attack
selects $M$ clean images from each target class as triggers and leverages our
proposed poisoned image generation framework to inject the triggers into clean
images invisibly. By using triggers with the same distribution as clean
training images, the targeted DNN models can generalize to the triggers during
training, thereby enhancing the effectiveness of our attack on multiple target
classes. Extensive experimental results demonstrate that our new backdoor
attack is highly effective in attacking multiple target classes and robust
against pre-processing operations and existing defenses.


http://arxiv.org/abs/2211.01598
Robust Few-shot Learning Without Using any Adversarial Samples. (89%)
Gaurav Kumar Nayak; Ruchit Rawal; Inder Khatri; Anirban Chakraborty
  The high cost of acquiring and annotating samples has made the `few-shot'
learning problem of prime importance. Existing works mainly focus on improving
performance on clean data and overlook robustness concerns on the data
perturbed with adversarial noise. Recently, a few efforts have been made to
combine the few-shot problem with the robustness objective using sophisticated
Meta-Learning techniques. These methods rely on the generation of adversarial
samples in every episode of training, which further adds a computational
burden. To avoid such time-consuming and complicated procedures, we propose a
simple but effective alternative that does not require any adversarial samples.
Inspired by the cognitive decision-making process in humans, we enforce
high-level feature matching between the base class data and their corresponding
low-frequency samples in the pretraining stage via self distillation. The model
is then fine-tuned on the samples of novel classes where we additionally
improve the discriminability of low-frequency query set features via cosine
similarity. On a 1-shot setting of the CIFAR-FS dataset, our method yields a
massive improvement of $60.55\%$ & $62.05\%$ in adversarial accuracy on the PGD
and state-of-the-art Auto Attack, respectively, with a minor drop in clean
accuracy compared to the baseline. Moreover, our method only takes $1.69\times$
of the standard training time while being $\approx$ $5\times$ faster than
state-of-the-art adversarial meta-learning methods. The code is available at
https://github.com/vcl-iisc/robust-few-shot-learning.


http://arxiv.org/abs/2211.01579
Data-free Defense of Black Box Models Against Adversarial Attacks. (84%)
Gaurav Kumar Nayak; Inder Khatri; Shubham Randive; Ruchit Rawal; Anirban Chakraborty
  Several companies often safeguard their trained deep models (i.e. details of
architecture, learnt weights, training details etc.) from third-party users by
exposing them only as black boxes through APIs. Moreover, they may not even
provide access to the training data due to proprietary reasons or sensitivity
concerns. We make the first attempt to provide adversarial robustness to the
black box models in a data-free set up. We construct synthetic data via
generative model and train surrogate network using model stealing techniques.
To minimize adversarial contamination on perturbed samples, we propose `wavelet
noise remover' (WNR) that performs discrete wavelet decomposition on input
images and carefully select only a few important coefficients determined by our
`wavelet coefficient selection module' (WCSM). To recover the high-frequency
content of the image after noise removal via WNR, we further train a
`regenerator' network with an objective to retrieve the coefficients such that
the reconstructed image yields similar to original predictions on the surrogate
model. At test time, WNR combined with trained regenerator network is prepended
to the black box network, resulting in a high boost in adversarial accuracy.
Our method improves the adversarial accuracy on CIFAR-10 by 38.98% and 32.01%
on state-of-the-art Auto Attack compared to baseline, even when the attacker
uses surrogate architecture (Alexnet-half and Alexnet) similar to the black box
architecture (Alexnet) with same model stealing strategy as defender. The code
is available at https://github.com/vcl-iisc/data-free-black-box-defense


http://arxiv.org/abs/2211.01621
Leveraging Domain Features for Detecting Adversarial Attacks Against Deep Speech Recognition in Noise. (38%)
Christian Heider Nielsen; Zheng-Hua Tan
  In recent years, significant progress has been made in deep model-based
automatic speech recognition (ASR), leading to its widespread deployment in the
real world. At the same time, adversarial attacks against deep ASR systems are
highly successful. Various methods have been proposed to defend ASR systems
from these attacks. However, existing classification based methods focus on the
design of deep learning models while lacking exploration of domain specific
features. This work leverages filter bank-based features to better capture the
characteristics of attacks for improved detection. Furthermore, the paper
analyses the potentials of using speech and non-speech parts separately in
detecting adversarial attacks. In the end, considering adverse environments
where ASR systems may be deployed, we study the impact of acoustic noise of
various types and signal-to-noise ratios. Extensive experiments show that the
inverse filter bank features generally perform better in both clean and noisy
environments, the detection is effective using either speech or non-speech
part, and the acoustic noise can largely degrade the detection performance.


http://arxiv.org/abs/2211.01592
Try to Avoid Attacks: A Federated Data Sanitization Defense for Healthcare IoMT Systems. (33%)
Chong Chen; Ying Gao; Leyu Shi; Siquan Huang
  Healthcare IoMT systems are becoming intelligent, miniaturized, and more
integrated into daily life. As for the distributed devices in the IoMT,
federated learning has become a topical area with cloud-based training
procedures when meeting data security. However, the distribution of IoMT has
the risk of protection from data poisoning attacks. Poisoned data can be
fabricated by falsifying medical data, which urges a security defense to IoMT
systems. Due to the lack of specific labels, the filtering of malicious data is
a unique unsupervised scenario. One of the main challenges is finding robust
data filtering methods for various poisoning attacks. This paper introduces a
Federated Data Sanitization Defense, a novel approach to protect the system
from data poisoning attacks. To solve this unsupervised problem, we first use
federated learning to project all the data to the subspace domain, allowing
unified feature mapping to be established since the data is stored locally.
Then we adopt the federated clustering to re-group their features to clarify
the poisoned data. The clustering is based on the consistent association of
data and its semantics. After we get the clustering of the private data, we do
the data sanitization with a simple yet efficient strategy. In the end, each
device of distributed ImOT is enabled to filter malicious data according to
federated data sanitization. Extensive experiments are conducted to evaluate
the efficacy of the proposed defense method against data poisoning attacks.
Further, we consider our approach in the different poisoning ratios and achieve
a high Accuracy and a low attack success rate.


http://arxiv.org/abs/2211.02245
Unintended Memorization and Timing Attacks in Named Entity Recognition Models. (12%)
Rana Salal Ali; Benjamin Zi Hao Zhao; Hassan Jameel Asghar; Tham Nguyen; Ian David Wood; Dali Kaafar
  Named entity recognition models (NER), are widely used for identifying named
entities (e.g., individuals, locations, and other information) in text
documents. Machine learning based NER models are increasingly being applied in
privacy-sensitive applications that need automatic and scalable identification
of sensitive information to redact text for data sharing. In this paper, we
study the setting when NER models are available as a black-box service for
identifying sensitive information in user documents and show that these models
are vulnerable to membership inference on their training datasets. With updated
pre-trained NER models from spaCy, we demonstrate two distinct membership
attacks on these models. Our first attack capitalizes on unintended
memorization in the NER's underlying neural network, a phenomenon NNs are known
to be vulnerable to. Our second attack leverages a timing side-channel to
target NER models that maintain vocabularies constructed from the training
data. We show that different functional paths of words within the training
dataset in contrast to words not previously seen have measurable differences in
execution time. Revealing membership status of training samples has clear
privacy implications, e.g., in text redaction, sensitive words or phrases to be
found and removed, are at risk of being detected in the training dataset. Our
experimental evaluation includes the redaction of both password and health
data, presenting both security risks and privacy/regulatory issues. This is
exacerbated by results that show memorization with only a single phrase. We
achieved 70% AUC in our first attack on a text redaction use-case. We also show
overwhelming success in the timing attack with 99.23% AUC. Finally we discuss
potential mitigation approaches to realize the safe use of NER models in light
of the privacy and security implications of membership inference attacks.


http://arxiv.org/abs/2211.01182
Defending with Errors: Approximate Computing for Robustness of Deep Neural Networks. (99%)
Amira Guesmi; Ihsen Alouani; Khaled N. Khasawneh; Mouna Baklouti; Tarek Frikha; Mohamed Abid; Nael Abu-Ghazaleh
  Machine-learning architectures, such as Convolutional Neural Networks (CNNs)
are vulnerable to adversarial attacks: inputs crafted carefully to force the
system output to a wrong label. Since machine-learning is being deployed in
safety-critical and security-sensitive domains, such attacks may have
catastrophic security and safety consequences. In this paper, we propose for
the first time to use hardware-supported approximate computing to improve the
robustness of machine-learning classifiers. We show that successful adversarial
attacks against the exact classifier have poor transferability to the
approximate implementation. Surprisingly, the robustness advantages also apply
to white-box attacks where the attacker has unrestricted access to the
approximate classifier implementation: in this case, we show that substantially
higher levels of adversarial noise are needed to produce adversarial examples.
Furthermore, our approximate computing model maintains the same level in terms
of classification accuracy, does not require retraining, and reduces resource
utilization and energy consumption of the CNN. We conducted extensive
experiments on a set of strong adversarial attacks; We empirically show that
the proposed implementation increases the robustness of a LeNet-5, Alexnet and
VGG-11 CNNs considerably with up to 50% by-product saving in energy consumption
due to the simpler nature of the approximate logic.


http://arxiv.org/abs/2211.01093
Improving transferability of 3D adversarial attacks with scale and shear transformations. (99%)
Jinali Zhang; Yinpeng Dong; Jun Zhu; Jihong Zhu; Minchi Kuang; Xiaming Yuan
  Previous work has shown that 3D point cloud classifiers can be vulnerable to
adversarial examples. However, most of the existing methods are aimed at
white-box attacks, where the parameters and other information of the
classifiers are known in the attack, which is unrealistic for real-world
applications. In order to improve the attack performance of the black-box
classifiers, the research community generally uses the transfer-based black-box
attack. However, the transferability of current 3D attacks is still relatively
low. To this end, this paper proposes Scale and Shear (SS) Attack to generate
3D adversarial examples with strong transferability. Specifically, we randomly
scale or shear the input point cloud, so that the attack will not overfit the
white-box model, thereby improving the transferability of the attack. Extensive
experiments show that the SS attack proposed in this paper can be seamlessly
combined with the existing state-of-the-art (SOTA) 3D point cloud attack
methods to form more powerful attack methods, and the SS attack improves the
transferability over 3.6 times compare to the baseline. Moreover, while
substantially outperforming the baseline methods, the SS attack achieves SOTA
transferability under various defenses. Our code will be available online at
https://github.com/cuge1995/SS-attack


http://arxiv.org/abs/2211.00887
Certified Robustness of Quantum Classifiers against Adversarial Examples through Quantum Noise. (99%)
Jhih-Cing Huang; Yu-Lin Tsai; Chao-Han Huck Yang; Cheng-Fang Su; Chia-Mu Yu; Pin-Yu Chen; Sy-Yen Kuo
  Recently, quantum classifiers have been found to be vulnerable to adversarial
attacks, in which quantum classifiers are deceived by imperceptible noises,
leading to misclassification. In this paper, we propose the first theoretical
study demonstrating that adding quantum random rotation noise can improve
robustness in quantum classifiers against adversarial attacks. We link the
definition of differential privacy and show that the quantum classifier trained
with the natural presence of additive noise is differentially private. Finally,
we derive a certified robustness bound to enable quantum classifiers to defend
against adversarial examples, supported by experimental results simulated with
noises from IBM's 7-qubits device.


http://arxiv.org/abs/2211.01112
Adversarial Attack on Radar-based Environment Perception Systems. (99%)
Amira Guesmi; Ihsen Alouani
  Due to their robustness to degraded capturing conditions, radars are widely
used for environment perception, which is a critical task in applications like
autonomous vehicles. More specifically, Ultra-Wide Band (UWB) radars are
particularly efficient for short range settings as they carry rich information
on the environment. Recent UWB-based systems rely on Machine Learning (ML) to
exploit the rich signature of these sensors. However, ML classifiers are
susceptible to adversarial examples, which are created from raw data to fool
the classifier such that it assigns the input to the wrong class. These attacks
represent a serious threat to systems integrity, especially for safety-critical
applications. In this work, we present a new adversarial attack on UWB radars
in which an adversary injects adversarial radio noise in the wireless channel
to cause an obstacle recognition failure. First, based on signals collected in
real-life environment, we show that conventional attacks fail to generate
robust noise under realistic conditions. We propose a-RNA, i.e., Adversarial
Radio Noise Attack to overcome these issues. Specifically, a-RNA generates an
adversarial noise that is efficient without synchronization between the input
signal and the noise. Moreover, a-RNA generated noise is, by-design, robust
against pre-processing countermeasures such as filtering-based defenses.
Moreover, in addition to the undetectability objective by limiting the noise
magnitude budget, a-RNA is also efficient in the presence of sophisticated
defenses in the spectral domain by introducing a frequency budget. We believe
this work should alert about potentially critical implementations of
adversarial attacks on radar systems that should be taken seriously.


http://arxiv.org/abs/2211.01236
Isometric Representations in Neural Networks Improve Robustness. (62%)
Kosio Beshkov; Jonas Verhellen; Mikkel Elle Lepperød
  Artificial and biological agents cannon learn given completely random and
unstructured data. The structure of data is encoded in the metric relationships
between data points. In the context of neural networks, neuronal activity
within a layer forms a representation reflecting the transformation that the
layer implements on its inputs. In order to utilize the structure in the data
in a truthful manner, such representations should reflect the input distances
and thus be continuous and isometric. Supporting this statement, recent
findings in neuroscience propose that generalization and robustness are tied to
neural representations being continuously differentiable. In machine learning,
most algorithms lack robustness and are generally thought to rely on aspects of
the data that differ from those that humans use, as is commonly seen in
adversarial attacks. During cross-entropy classification, the metric and
structural properties of network representations are usually broken both
between and within classes. This side effect from training can lead to
instabilities under perturbations near locations where such structure is not
preserved. One of the standard solutions to obtain robustness is to add ad hoc
regularization terms, but to our knowledge, forcing representations to preserve
the metric structure of the input data as a stabilising mechanism has not yet
been studied. In this work, we train neural networks to perform classification
while simultaneously maintaining within-class metric structure, leading to
isometric within-class representations. Such network representations turn out
to be beneficial for accurate and robust inference. By stacking layers with
this property we create a network architecture that facilitates hierarchical
manipulation of internal neural representations. Finally, we verify that
isometric regularization improves the robustness to adversarial attacks on
MNIST.


http://arxiv.org/abs/2211.01806
BATT: Backdoor Attack with Transformation-based Triggers. (56%)
Tong Xu; Yiming Li; Yong Jiang; Shu-Tao Xia
  Deep neural networks (DNNs) are vulnerable to backdoor attacks. The backdoor
adversaries intend to maliciously control the predictions of attacked DNNs by
injecting hidden backdoors that can be activated by adversary-specified trigger
patterns during the training process. One recent research revealed that most of
the existing attacks failed in the real physical world since the trigger
contained in the digitized test samples may be different from that of the one
used for training. Accordingly, users can adopt spatial transformations as the
image pre-processing to deactivate hidden backdoors. In this paper, we explore
the previous findings from another side. We exploit classical spatial
transformations (i.e. rotation and translation) with the specific parameter as
trigger patterns to design a simple yet effective poisoning-based backdoor
attack. For example, only images rotated to a particular angle can activate the
embedded backdoor of attacked DNNs. Extensive experiments are conducted,
verifying the effectiveness of our attack under both digital and physical
settings and its resistance to existing backdoor defenses.


http://arxiv.org/abs/2211.05638
Untargeted Backdoor Attack against Object Detection. (50%)
Chengxiao Luo; Yiming Li; Yong Jiang; Shu-Tao Xia
  Recent studies revealed that deep neural networks (DNNs) are exposed to
backdoor threats when training with third-party resources (such as training
samples or backbones). The backdoored model has promising performance in
predicting benign samples, whereas its predictions can be maliciously
manipulated by adversaries based on activating its backdoors with pre-defined
trigger patterns. Currently, most of the existing backdoor attacks were
conducted on the image classification under the targeted manner. In this paper,
we reveal that these threats could also happen in object detection, posing
threatening risks to many mission-critical applications ($e.g.$, pedestrian
detection and intelligent surveillance systems). Specifically, we design a
simple yet effective poison-only backdoor attack in an untargeted manner, based
on task characteristics. We show that, once the backdoor is embedded into the
target model by our attack, it can trick the model to lose detection of any
object stamped with our trigger patterns. We conduct extensive experiments on
the benchmark dataset, showing its effectiveness in both digital and
physical-world settings and its resistance to potential defenses.


http://arxiv.org/abs/2211.09728
Generative Adversarial Training Can Improve Neural Language Models. (33%)
Sajad Movahedi; Azadeh Shakery
  While deep learning in the form of recurrent neural networks (RNNs) has
caused a significant improvement in neural language modeling, the fact that
they are extremely prone to overfitting is still a mainly unresolved issue. In
this paper we propose a regularization method based on generative adversarial
networks (GANs) and adversarial training (AT), that can prevent overfitting in
neural language models. Unlike common adversarial training methods such as the
fast gradient sign method (FGSM) that require a second back-propagation through
time, and therefore effectively require at least twice the amount of time for
regular training, the overhead of our method does not exceed more than 20% of
the training of the baselines.


http://arxiv.org/abs/2211.05631
Backdoor Defense via Suppressing Model Shortcuts. (3%)
Sheng Yang; Yiming Li; Yong Jiang; Shu-Tao Xia
  Recent studies have demonstrated that deep neural networks (DNNs) are
vulnerable to backdoor attacks during the training process. Specifically, the
adversaries intend to embed hidden backdoors in DNNs so that malicious model
predictions can be activated through pre-defined trigger patterns. In this
paper, we explore the backdoor mechanism from the angle of the model structure.
We select the skip connection for discussions, inspired by the understanding
that it helps the learning of model `shortcuts' where backdoor triggers are
usually easier to be learned. Specifically, we demonstrate that the attack
success rate (ASR) decreases significantly when reducing the outputs of some
key skip connections. Based on this observation, we design a simple yet
effective backdoor removal method by suppressing the skip connections in
critical layers selected by our method. We also implement fine-tuning on these
layers to recover high benign accuracy and to further reduce ASR. Extensive
experiments on benchmark datasets verify the effectiveness of our method.


http://arxiv.org/abs/2211.01202
Human-in-the-Loop Mixup. (1%)
Katherine M. Collins; Umang Bhatt; Weiyang Liu; Vihari Piratla; Ilia Sucholutsky; Bradley Love; Adrian Weller
  Aligning model representations to humans has been found to improve robustness
and generalization. However, such methods often focus on standard observational
data. Synthetic data is proliferating and powering many advances in machine
learning; yet, it is not always clear whether synthetic labels are perceptually
aligned to humans -- rendering it likely model representations are not human
aligned. We focus on the synthetic data used in mixup: a powerful regularizer
shown to improve model robustness, generalization, and calibration. We design a
comprehensive series of elicitation interfaces, which we release as HILL MixE
Suite, and recruit 159 participants to provide perceptual judgments along with
their uncertainties, over mixup examples. We find that human perceptions do not
consistently align with the labels traditionally used for synthetic points, and
begin to demonstrate the applicability of these findings to potentially
increase the reliability of downstream models, particularly when incorporating
human uncertainty. We release all elicited judgments in a new data hub we call
H-Mix.


http://arxiv.org/abs/2211.00525
The Enemy of My Enemy is My Friend: Exploring Inverse Adversaries for Improving Adversarial Training. (99%)
Junhao Dong; Seyed-Mohsen Moosavi-Dezfooli; Jianhuang Lai; Xiaohua Xie
  Although current deep learning techniques have yielded superior performance
on various computer vision tasks, yet they are still vulnerable to adversarial
examples. Adversarial training and its variants have been shown to be the most
effective approaches to defend against adversarial examples. These methods
usually regularize the difference between output probabilities for an
adversarial and its corresponding natural example. However, it may have a
negative impact if the model misclassifies a natural example. To circumvent
this issue, we propose a novel adversarial training scheme that encourages the
model to produce similar outputs for an adversarial example and its ``inverse
adversarial'' counterpart. These samples are generated to maximize the
likelihood in the neighborhood of natural examples. Extensive experiments on
various vision datasets and architectures demonstrate that our training method
achieves state-of-the-art robustness as well as natural accuracy. Furthermore,
using a universal version of inverse adversarial examples, we improve the
performance of single-step adversarial training techniques at a low
computational cost.


http://arxiv.org/abs/2211.00825
LMD: A Learnable Mask Network to Detect Adversarial Examples for Speaker Verification. (99%)
Xing Chen; Jie Wang; Xiao-Lei Zhang; Wei-Qiang Zhang; Kunde Yang
  Although the security of automatic speaker verification (ASV) is seriously
threatened by recently emerged adversarial attacks, there have been some
countermeasures to alleviate the threat. However, many defense approaches not
only require the prior knowledge of the attackers but also possess weak
interpretability. To address this issue, in this paper, we propose an
attacker-independent and interpretable method, named learnable mask detector
(LMD), to separate adversarial examples from the genuine ones. It utilizes
score variation as an indicator to detect adversarial examples, where the score
variation is the absolute discrepancy between the ASV scores of an original
audio recording and its transformed audio synthesized from its masked complex
spectrogram. A core component of the score variation detector is to generate
the masked spectrogram by a neural network. The neural network needs only
genuine examples for training, which makes it an attacker-independent approach.
Its interpretability lies that the neural network is trained to minimize the
score variation of the targeted ASV, and maximize the number of the masked
spectrogram bins of the genuine training examples. Its foundation is based on
the observation that, masking out the vast majority of the spectrogram bins
with little speaker information will inevitably introduce a large score
variation to the adversarial example, and a small score variation to the
genuine example. Experimental results with 12 attackers and two representative
ASV systems show that our proposed method outperforms five state-of-the-art
baselines. The extensive experimental results can also be a benchmark for the
detection-based ASV defenses.


http://arxiv.org/abs/2211.00322
DensePure: Understanding Diffusion Models towards Adversarial Robustness. (98%)
Chaowei Xiao; Zhongzhu Chen; Kun Jin; Jiongxiao Wang; Weili Nie; Mingyan Liu; Anima Anandkumar; Bo Li; Dawn Song
  Diffusion models have been recently employed to improve certified robustness
through the process of denoising. However, the theoretical understanding of why
diffusion models are able to improve the certified robustness is still lacking,
preventing from further improvement. In this study, we close this gap by
analyzing the fundamental properties of diffusion models and establishing the
conditions under which they can enhance certified robustness. This deeper
understanding allows us to propose a new method DensePure, designed to improve
the certified robustness of a pretrained model (i.e. classifier). Given an
(adversarial) input, DensePure consists of multiple runs of denoising via the
reverse process of the diffusion model (with different random seeds) to get
multiple reversed samples, which are then passed through the classifier,
followed by majority voting of inferred labels to make the final prediction.
This design of using multiple runs of denoising is informed by our theoretical
analysis of the conditional distribution of the reversed sample. Specifically,
when the data density of a clean sample is high, its conditional density under
the reverse process in a diffusion model is also high; thus sampling from the
latter conditional distribution can purify the adversarial example and return
the corresponding clean sample with a high probability. By using the highest
density point in the conditional distribution as the reversed sample, we
identify the robust region of a given instance under the diffusion model's
reverse process. We show that this robust region is a union of multiple convex
sets, and is potentially much larger than the robust regions identified in
previous works. In practice, DensePure can approximate the label of the high
density region in the conditional distribution so that it can enhance certified
robustness.


http://arxiv.org/abs/2211.00269
Adversarial Training with Complementary Labels: On the Benefit of Gradually Informative Attacks. (87%)
Jianan Zhou; Jianing Zhu; Jingfeng Zhang; Tongliang Liu; Gang Niu; Bo Han; Masashi Sugiyama
  Adversarial training (AT) with imperfect supervision is significant but
receives limited attention. To push AT towards more practical scenarios, we
explore a brand new yet challenging setting, i.e., AT with complementary labels
(CLs), which specify a class that a data sample does not belong to. However,
the direct combination of AT with existing methods for CLs results in
consistent failure, but not on a simple baseline of two-stage training. In this
paper, we further explore the phenomenon and identify the underlying challenges
of AT with CLs as intractable adversarial optimization and low-quality
adversarial examples. To address the above problems, we propose a new learning
strategy using gradually informative attacks, which consists of two critical
components: 1) Warm-up Attack (Warm-up) gently raises the adversarial
perturbation budgets to ease the adversarial optimization with CLs; 2)
Pseudo-Label Attack (PLA) incorporates the progressively informative model
predictions into a corrected complementary loss. Extensive experiments are
conducted to demonstrate the effectiveness of our method on a range of
benchmarked datasets. The code is publicly available at:
https://github.com/RoyalSkye/ATCL.


http://arxiv.org/abs/2211.00366
Universal Perturbation Attack on Differentiable No-Reference Image- and Video-Quality Metrics. (82%)
Ekaterina Shumitskaya; Anastasia Antsiferova; Dmitriy Vatolin
  Universal adversarial perturbation attacks are widely used to analyze image
classifiers that employ convolutional neural networks. Nowadays, some attacks
can deceive image- and video-quality metrics. So sustainability analysis of
these metrics is important. Indeed, if an attack can confuse the metric, an
attacker can easily increase quality scores. When developers of image- and
video-algorithms can boost their scores through detached processing, algorithm
comparisons are no longer fair. Inspired by the idea of universal adversarial
perturbation for classifiers, we suggest a new method to attack differentiable
no-reference quality metrics through universal perturbation. We applied this
method to seven no-reference image- and video-quality metrics (PaQ-2-PiQ,
Linearity, VSFA, MDTVSFA, KonCept512, Nima and SPAQ). For each one, we trained
a universal perturbation that increases the respective scores. We also propose
a method for assessing metric stability and identify the metrics that are the
most vulnerable and the most resistant to our attack. The existence of
successful universal perturbations appears to diminish the metric's ability to
provide reliable scores. We therefore recommend our proposed method as an
additional verification of metric reliability to complement traditional
subjective tests and benchmarks.


http://arxiv.org/abs/2211.00453
The Perils of Learning From Unlabeled Data: Backdoor Attacks on Semi-supervised Learning. (80%)
Virat Shejwalkar; Lingjuan Lyu; Amir Houmansadr
  Semi-supervised machine learning (SSL) is gaining popularity as it reduces
the cost of training ML models. It does so by using very small amounts of
(expensive, well-inspected) labeled data and large amounts of (cheap,
non-inspected) unlabeled data. SSL has shown comparable or even superior
performances compared to conventional fully-supervised ML techniques.
  In this paper, we show that the key feature of SSL that it can learn from
(non-inspected) unlabeled data exposes SSL to strong poisoning attacks. In
fact, we argue that, due to its reliance on non-inspected unlabeled data,
poisoning is a much more severe problem in SSL than in conventional
fully-supervised ML.
  Specifically, we design a backdoor poisoning attack on SSL that can be
conducted by a weak adversary with no knowledge of target SSL pipeline. This is
unlike prior poisoning attacks in fully-supervised settings that assume strong
adversaries with practically-unrealistic capabilities. We show that by
poisoning only 0.2% of the unlabeled training data, our attack can cause
misclassification of more than 80% of test inputs (when they contain the
adversary's backdoor trigger). Our attacks remain effective across twenty
combinations of benchmark datasets and SSL algorithms, and even circumvent the
state-of-the-art defenses against backdoor attacks. Our work raises significant
concerns about the practical utility of existing SSL algorithms.


http://arxiv.org/abs/2211.00748
Maximum Likelihood Distillation for Robust Modulation Classification. (69%)
Javier Maroto; Gérôme Bovet; Pascal Frossard
  Deep Neural Networks are being extensively used in communication systems and
Automatic Modulation Classification (AMC) in particular. However, they are very
susceptible to small adversarial perturbations that are carefully crafted to
change the network decision. In this work, we build on knowledge distillation
ideas and adversarial training in order to build more robust AMC systems. We
first outline the importance of the quality of the training data in terms of
accuracy and robustness of the model. We then propose to use the Maximum
Likelihood function, which could solve the AMC problem in offline settings, to
generate better training labels. Those labels teach the model to be uncertain
in challenging conditions, which permits to increase the accuracy, as well as
the robustness of the model when combined with adversarial training.
Interestingly, we observe that this increase in performance transfers to online
settings, where the Maximum Likelihood function cannot be used in practice.
Overall, this work highlights the potential of learning to be uncertain in
difficult scenarios, compared to directly removing label noise.


http://arxiv.org/abs/2211.00294
FRSUM: Towards Faithful Abstractive Summarization via Enhancing Factual Robustness. (45%)
Wenhao Wu; Wei Li; Jiachen Liu; Xinyan Xiao; Ziqiang Cao; Sujian Li; Hua Wu
  Despite being able to generate fluent and grammatical text, current Seq2Seq
summarization models still suffering from the unfaithful generation problem. In
this paper, we study the faithfulness of existing systems from a new
perspective of factual robustness which is the ability to correctly generate
factual information over adversarial unfaithful information. We first measure a
model's factual robustness by its success rate to defend against adversarial
attacks when generating factual information. The factual robustness analysis on
a wide range of current systems shows its good consistency with human judgments
on faithfulness. Inspired by these findings, we propose to improve the
faithfulness of a model by enhancing its factual robustness. Specifically, we
propose a novel training strategy, namely FRSUM, which teaches the model to
defend against both explicit adversarial samples and implicit factual
adversarial perturbations. Extensive automatic and human evaluation results
show that FRSUM consistently improves the faithfulness of various Seq2Seq
models, such as T5, BART.


http://arxiv.org/abs/2211.00463
Amplifying Membership Exposure via Data Poisoning. (22%)
Yufei Chen; Chao Shen; Yun Shen; Cong Wang; Yang Zhang
  As in-the-wild data are increasingly involved in the training stage, machine
learning applications become more susceptible to data poisoning attacks. Such
attacks typically lead to test-time accuracy degradation or controlled
misprediction. In this paper, we investigate the third type of exploitation of
data poisoning - increasing the risks of privacy leakage of benign training
samples. To this end, we demonstrate a set of data poisoning attacks to amplify
the membership exposure of the targeted class. We first propose a generic
dirty-label attack for supervised classification algorithms. We then propose an
optimization-based clean-label attack in the transfer learning scenario,
whereby the poisoning samples are correctly labeled and look "natural" to evade
human moderation. We extensively evaluate our attacks on computer vision
benchmarks. Our results show that the proposed attacks can substantially
increase the membership inference precision with minimum overall test-time
model performance degradation. To mitigate the potential negative impacts of
our attacks, we also investigate feasible countermeasures.


http://arxiv.org/abs/2211.00273
ActGraph: Prioritization of Test Cases Based on Deep Neural Network Activation Graph. (13%)
Jinyin Chen; Jie Ge; Haibin Zheng
  Widespread applications of deep neural networks (DNNs) benefit from DNN
testing to guarantee their quality. In the DNN testing, numerous test cases are
fed into the model to explore potential vulnerabilities, but they require
expensive manual cost to check the label. Therefore, test case prioritization
is proposed to solve the problem of labeling cost, e.g., activation-based and
mutation-based prioritization methods. However, most of them suffer from
limited scenarios (i.e. high confidence adversarial or false positive cases)
and high time complexity. To address these challenges, we propose the concept
of the activation graph from the perspective of the spatial relationship of
neurons. We observe that the activation graph of cases that triggers the
models' misbehavior significantly differs from that of normal cases. Motivated
by it, we design a test case prioritization method based on the activation
graph, ActGraph, by extracting the high-order node features of the activation
graph for prioritization. ActGraph explains the difference between the test
cases to solve the problem of scenario limitation. Without mutation operations,
ActGraph is easy to implement, leading to lower time complexity. Extensive
experiments on three datasets and four models demonstrate that ActGraph has the
following key characteristics. (i) Effectiveness and generalizability: ActGraph
shows competitive performance in all of the natural, adversarial and mixed
scenarios, especially in RAUC-100 improvement (~1.40). (ii) Efficiency:
ActGraph does not use complex mutation operations and runs in less time (~1/50)
than the state-of-the-art method.


http://arxiv.org/abs/2210.17140
Scoring Black-Box Models for Adversarial Robustness. (98%)
Jian Vora; Pranay Reddy Samala
  Deep neural networks are susceptible to adversarial inputs and various
methods have been proposed to defend these models against adversarial attacks
under different perturbation models. The robustness of models to adversarial
attacks has been analyzed by first constructing adversarial inputs for the
model, and then testing the model performance on the constructed adversarial
inputs. Most of these attacks require the model to be white-box, need access to
data labels, and finding adversarial inputs can be computationally expensive.
We propose a simple scoring method for black-box models which indicates their
robustness to adversarial input. We show that adversarially more robust models
have a smaller $l_1$-norm of LIME weights and sharper explanations.


http://arxiv.org/abs/2211.00239
ARDIR: Improving Robustness using Knowledge Distillation of Internal Representation. (88%)
Tomokatsu Takahashi; Masanori Yamada; Yuuki Yamanaka; Tomoya Yamashita
  Adversarial training is the most promising method for learning robust models
against adversarial examples. A recent study has shown that knowledge
distillation between the same architectures is effective in improving the
performance of adversarial training. Exploiting knowledge distillation is a new
approach to improve adversarial training and has attracted much attention.
However, its performance is still insufficient. Therefore, we propose
Adversarial Robust Distillation with Internal Representation~(ARDIR) to utilize
knowledge distillation even more effectively. In addition to the output of the
teacher model, ARDIR uses the internal representation of the teacher model as a
label for adversarial training. This enables the student model to be trained
with richer, more informative labels. As a result, ARDIR can learn more robust
student models. We show that ARDIR outperforms previous methods in our
experiments.


http://arxiv.org/abs/2210.17376
SoK: Modeling Explainability in Security Analytics for Interpretability, Trustworthiness, and Usability. (33%)
Dipkamal Bhusal; Rosalyn Shin; Ajay Ashok Shewale; Monish Kumar Manikya Veerabhadran; Michael Clifford; Sara Rampazzi; Nidhi Rastogi
  Interpretability, trustworthiness, and usability are key considerations in
high-stake security applications, especially when utilizing deep learning
models. While these models are known for their high accuracy, they behave as
black boxes in which identifying important features and factors that led to a
classification or a prediction is difficult. This can lead to uncertainty and
distrust, especially when an incorrect prediction results in severe
consequences. Thus, explanation methods aim to provide insights into the inner
working of deep learning models. However, most explanation methods provide
inconsistent explanations, have low fidelity, and are susceptible to
adversarial manipulation, which can reduce model trustworthiness. This paper
provides a comprehensive analysis of explainable methods and demonstrates their
efficacy in three distinct security applications: anomaly detection using
system logs, malware prediction, and detection of adversarial images. Our
quantitative and qualitative analysis reveals serious limitations and concerns
in state-of-the-art explanation methods in all three applications. We show that
explanation methods for security applications necessitate distinct
characteristics, such as stability, fidelity, robustness, and usability, among
others, which we outline as the prerequisites for trustworthy explanation
methods.


http://arxiv.org/abs/2210.17546
Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy. (16%)
Daphne Ippolito; Florian Tramèr; Milad Nasr; Chiyuan Zhang; Matthew Jagielski; Katherine Lee; Christopher A. Choquette-Choo; Nicholas Carlini
  Studying data memorization in neural language models helps us understand the
risks (e.g., to privacy or copyright) associated with models regurgitating
training data, and aids in the evaluation of potential countermeasures. Many
prior works -- and some recently deployed defenses -- focus on "verbatim
memorization", defined as a model generation that exactly matches a substring
from the training set. We argue that verbatim memorization definitions are too
restrictive and fail to capture more subtle forms of memorization.
Specifically, we design and implement an efficient defense based on Bloom
filters that perfectly prevents all verbatim memorization. And yet, we
demonstrate that this "perfect" filter does not prevent the leakage of training
data. Indeed, it is easily circumvented by plausible and minimally modified
"style-transfer" prompts -- and in some cases even the non-modified original
prompts -- to extract memorized information. For example, instructing the model
to output ALL-CAPITAL texts bypasses memorization checks based on verbatim
matching. We conclude by discussing potential alternative definitions and why
defining memorization is a difficult yet crucial open question for neural
language models.


http://arxiv.org/abs/2210.17029
Poison Attack and Defense on Deep Source Code Processing Models. (99%)
Jia Li; Zhuo Li; Huangzhao Zhang; Ge Li; Zhi Jin; Xing Hu; Xin Xia
  In the software engineering community, deep learning (DL) has recently been
applied to many source code processing tasks. Due to the poor interpretability
of DL models, their security vulnerabilities require scrutiny. Recently,
researchers have identified an emergent security threat, namely poison attack.
The attackers aim to inject insidious backdoors into models by poisoning the
training data with poison samples. Poisoned models work normally with clean
inputs but produce targeted erroneous results with poisoned inputs embedded
with triggers. By activating backdoors, attackers can manipulate the poisoned
models in security-related scenarios.
  To verify the vulnerability of existing deep source code processing models to
the poison attack, we present a poison attack framework for source code named
CodePoisoner as a strong imaginary enemy. CodePoisoner can produce compilable
even human-imperceptible poison samples and attack models by poisoning the
training data with poison samples. To defend against the poison attack, we
further propose an effective defense approach named CodeDetector to detect
poison samples in the training data. CodeDetector can be applied to many model
architectures and effectively defend against multiple poison attack approaches.
We apply our CodePoisoner and CodeDetector to three tasks, including defect
detection, clone detection, and code repair. The results show that (1)
CodePoisoner achieves a high attack success rate (max: 100%) in misleading
models to targeted erroneous behaviors. It validates that existing deep source
code processing models have a strong vulnerability to the poison attack. (2)
CodeDetector effectively defends against multiple poison attack approaches by
detecting (max: 100%) poison samples in the training data. We hope this work
can help practitioners notice the poison attack and inspire the design of more
advanced defense techniques.


http://arxiv.org/abs/2210.17004
Character-level White-Box Adversarial Attacks against Transformers via Attachable Subwords Substitution. (99%)
Aiwei Liu; Honghai Yu; Xuming Hu; Shu'ang Li; Li Lin; Fukun Ma; Yawen Yang; Lijie Wen
  We propose the first character-level white-box adversarial attack method
against transformer models. The intuition of our method comes from the
observation that words are split into subtokens before being fed into the
transformer models and the substitution between two close subtokens has a
similar effect to the character modification. Our method mainly contains three
steps. First, a gradient-based method is adopted to find the most vulnerable
words in the sentence. Then we split the selected words into subtokens to
replace the origin tokenization result from the transformer tokenizer. Finally,
we utilize an adversarial loss to guide the substitution of attachable
subtokens in which the Gumbel-softmax trick is introduced to ensure gradient
propagation. Meanwhile, we introduce the visual and length constraint in the
optimization process to achieve minimum character modifications. Extensive
experiments on both sentence-level and token-level tasks demonstrate that our
method could outperform the previous attack methods in terms of success rate
and edit distance. Furthermore, human evaluation verifies our adversarial
examples could preserve their origin labels.


http://arxiv.org/abs/2210.16765
Benchmarking Adversarial Patch Against Aerial Detection. (99%)
Jiawei Lian; Shaohui Mei; Shun Zhang; Mingyang Ma
  DNNs are vulnerable to adversarial examples, which poses great security
concerns for security-critical systems. In this paper, a novel
adaptive-patch-based physical attack (AP-PA) framework is proposed, which aims
to generate adversarial patches that are adaptive in both physical dynamics and
varying scales, and by which the particular targets can be hidden from being
detected. Furthermore, the adversarial patch is also gifted with attack
effectiveness against all targets of the same class with a patch outside the
target (No need to smear targeted objects) and robust enough in the physical
world. In addition, a new loss is devised to consider more available
information of detected objects to optimize the adversarial patch, which can
significantly improve the patch's attack efficacy (Average precision drop up to
87.86% and 85.48% in white-box and black-box settings, respectively) and
optimizing efficiency. We also establish one of the first comprehensive,
coherent, and rigorous benchmarks to evaluate the attack efficacy of
adversarial patches on aerial detection tasks. Finally, several proportionally
scaled experiments are performed physically to demonstrate that the elaborated
adversarial patches can successfully deceive aerial detection algorithms in
dynamic physical circumstances. The code is available at
https://github.com/JiaweiLian/AP-PA.


http://arxiv.org/abs/2210.16777
Symmetric Saliency-based Adversarial Attack To Speaker Identification. (92%)
Jiadi Yao; Xing Chen; Xiao-Lei Zhang; Wei-Qiang Zhang; Kunde Yang
  Adversarial attack approaches to speaker identification either need high
computational cost or are not very effective, to our knowledge. To address this
issue, in this paper, we propose a novel generation-network-based approach,
called symmetric saliency-based encoder-decoder (SSED), to generate adversarial
voice examples to speaker identification. It contains two novel components.
First, it uses a novel saliency map decoder to learn the importance of speech
samples to the decision of a targeted speaker identification system, so as to
make the attacker focus on generating artificial noise to the important
samples. It also proposes an angular loss function to push the speaker
embedding far away from the source speaker. Our experimental results
demonstrate that the proposed SSED yields the state-of-the-art performance,
i.e. over 97% targeted attack success rate and a signal-to-noise level of over
39 dB on both the open-set and close-set speaker identification tasks, with a
low computational cost.


http://arxiv.org/abs/2210.16940
FI-ODE: Certified and Robust Forward Invariance in Neural ODEs. (61%)
Yujia Huang; Ivan Dario Jimenez Rodriguez; Huan Zhang; Yuanyuan Shi; Yisong Yue
  We study how to certifiably enforce forward invariance properties in neural
ODEs. Forward invariance implies that the hidden states of the ODE will stay in
a ``good'' region, and a robust version would hold even under adversarial
perturbations to the input. Such properties can be used to certify desirable
behaviors such as adversarial robustness (the hidden states stay in the region
that generates accurate classification even under input perturbations) and
safety in continuous control (the system never leaves some safe set). We
develop a general approach using tools from non-linear control theory and
sampling-based verification. Our approach empirically produces the strongest
adversarial robustness guarantees compared to prior work on certifiably robust
ODE-based models (including implicit-depth models).


http://arxiv.org/abs/2210.16915
Imitating Opponent to Win: Adversarial Policy Imitation Learning in Two-player Competitive Games. (9%)
The Viet Bui; Tien Mai; Thanh H. Nguyen
  Recent research on vulnerabilities of deep reinforcement learning (RL) has
shown that adversarial policies adopted by an adversary agent can influence a
target RL agent (victim agent) to perform poorly in a multi-agent environment.
In existing studies, adversarial policies are directly trained based on
experiences of interacting with the victim agent. There is a key shortcoming of
this approach; knowledge derived from historical interactions may not be
properly generalized to unexplored policy regions of the victim agent, making
the trained adversarial policy significantly less effective. In this work, we
design a new effective adversarial policy learning algorithm that overcomes
this shortcoming. The core idea of our new algorithm is to create a new
imitator to imitate the victim agent's policy while the adversarial policy will
be trained not only based on interactions with the victim agent but also based
on feedback from the imitator to forecast victim's intention. By doing so, we
can leverage the capability of imitation learning in well capturing underlying
characteristics of the victim policy only based on sample trajectories of the
victim. Our victim imitation learning model differs from prior models as the
environment's dynamics are driven by adversary's policy and will keep changing
during the adversarial policy training. We provide a provable bound to
guarantee a desired imitating policy when the adversary's policy becomes
stable. We further strengthen our adversarial policy learning by making our
imitator a stronger version of the victim. Finally, our extensive experiments
using four competitive MuJoCo game environments show that our proposed
adversarial policy learning algorithm outperforms state-of-the-art algorithms.


http://arxiv.org/abs/2210.16690
On the Need of Neuromorphic Twins to Detect Denial-of-Service Attacks on Communication Networks. (10%)
Holger Boche; Rafael F. Schaefer; H. Vincent Poor; Frank H. P. Fitzek
  As we are more and more dependent on the communication technologies,
resilience against any attacks on communication networks is important to
guarantee the digital sovereignty of our society. New developments of
communication networks tackle the problem of resilience by in-network computing
approaches for higher protocol layers, while the physical layer remains an open
problem. This is particularly true for wireless communication systems which are
inherently vulnerable to adversarial attacks due to the open nature of the
wireless medium. In denial-of-service (DoS) attacks, an active adversary is
able to completely disrupt the communication and it has been shown that Turing
machines are incapable of detecting such attacks. As Turing machines provide
the fundamental limits of digital information processing and therewith of
digital twins, this implies that even the most powerful digital twins that
preserve all information of the physical network error-free are not capable of
detecting such attacks. This stimulates the question of how powerful the
information processing hardware must be to enable the detection of DoS attacks.
Therefore, in the paper the need of neuromorphic twins is advocated and by the
use of Blum-Shub-Smale machines a first implementation that enables the
detection of DoS attacks is shown. This result holds for both cases of with and
without constraints on the input and jamming sequences of the adversary.


http://arxiv.org/abs/2210.15997
Universal Adversarial Directions. (99%)
Ching Lam Choi; Farzan Farnia
  Despite their great success in image recognition tasks, deep neural networks
(DNNs) have been observed to be susceptible to universal adversarial
perturbations (UAPs) which perturb all input samples with a single perturbation
vector. However, UAPs often struggle in transferring across DNN architectures
and lead to challenging optimization problems. In this work, we study the
transferability of UAPs by analyzing equilibrium in the universal adversarial
example game between the classifier and UAP adversary players. We show that
under mild assumptions the universal adversarial example game lacks a pure Nash
equilibrium, indicating UAPs' suboptimal transferability across DNN
classifiers. To address this issue, we propose Universal Adversarial Directions
(UADs) which only fix a universal direction for adversarial perturbations and
allow the perturbations' magnitude to be chosen freely across samples. We prove
that the UAD adversarial example game can possess a Nash equilibrium with a
pure UAD strategy, implying the potential transferability of UADs. We also
connect the UAD optimization problem to the well-known principal component
analysis (PCA) and develop an efficient PCA-based algorithm for optimizing
UADs. We evaluate UADs over multiple benchmark image datasets. Our numerical
results show the superior transferability of UADs over standard gradient-based
UAPs.


http://arxiv.org/abs/2210.16117
Improving the Transferability of Adversarial Attacks on Face Recognition with Beneficial Perturbation Feature Augmentation. (99%)
Fengfan Zhou; Hefei Ling; Yuxuan Shi; Jiazhong Chen; Zongyi Li; Ping Li
  Face recognition (FR) models can be easily fooled by adversarial examples,
which are crafted by adding imperceptible perturbations on benign face images.
The existence of adversarial face examples poses a great threat to the security
of society. In order to build a more sustainable digital nation, in this paper,
we improve the transferability of adversarial face examples to expose more
blind spots of existing FR models. Though generating hard samples has shown its
effectiveness in improving the generalization of models in training tasks, the
effectiveness of utilizing this idea to improve the transferability of
adversarial face examples remains unexplored. To this end, based on the
property of hard samples and the symmetry between training tasks and
adversarial attack tasks, we propose the concept of hard models, which have
similar effects as hard samples for adversarial attack tasks. Utilizing the
concept of hard models, we propose a novel attack method called Beneficial
Perturbation Feature Augmentation Attack (BPFA), which reduces the overfitting
of adversarial examples to surrogate FR models by constantly generating new
hard models to craft the adversarial examples. Specifically, in the
backpropagation, BPFA records the gradients on pre-selected feature maps and
uses the gradient on the input image to craft the adversarial example. In the
next forward propagation, BPFA leverages the recorded gradients to add
beneficial perturbations on their corresponding feature maps to increase the
loss. Extensive experiments demonstrate that BPFA can significantly boost the
transferability of adversarial attacks on FR.


http://arxiv.org/abs/2210.16346
Improving Hyperspectral Adversarial Robustness Under Multiple Attacks. (98%)
Nicholas Soucy; Salimeh Yasaei Sekeh
  Semantic segmentation models classifying hyperspectral images (HSI) are
vulnerable to adversarial examples. Traditional approaches to adversarial
robustness focus on training or retraining a single network on attacked data,
however, in the presence of multiple attacks these approaches decrease in
performance compared to networks trained individually on each attack. To combat
this issue we propose an Adversarial Discriminator Ensemble Network (ADE-Net)
which focuses on attack type detection and adversarial robustness under a
unified model to preserve per data-type weight optimally while robustifiying
the overall network. In the proposed method, a discriminator network is used to
separate data by attack type into their specific attack-expert ensemble
network.


http://arxiv.org/abs/2210.16371
Distributed Black-box Attack against Image Classification Cloud Services. (95%)
Han Wu; Sareh Rowlands; Johan Wahlstrom
  Black-box adversarial attacks can fool image classifiers into misclassifying
images without requiring access to model structure and weights. Recently
proposed black-box attacks can achieve a success rate of more than 95\% after
less than 1,000 queries. The question then arises of whether black-box attacks
have become a real threat against IoT devices that rely on cloud APIs to
achieve image classification. To shed some light on this, note that prior
research has primarily focused on increasing the success rate and reducing the
number of required queries. However, another crucial factor for black-box
attacks against cloud APIs is the time required to perform the attack. This
paper applies black-box attacks directly to cloud APIs rather than to local
models, thereby avoiding multiple mistakes made in prior research. Further, we
exploit load balancing to enable distributed black-box attacks that can reduce
the attack time by a factor of about five for both local search and gradient
estimation methods.


http://arxiv.org/abs/2210.15944
RoChBert: Towards Robust BERT Fine-tuning for Chinese. (75%)
Zihan Zhang; Jinfeng Li; Ning Shi; Bo Yuan; Xiangyu Liu; Rong Zhang; Hui Xue; Donghong Sun; Chao Zhang
  Despite of the superb performance on a wide range of tasks, pre-trained
language models (e.g., BERT) have been proved vulnerable to adversarial texts.
In this paper, we present RoChBERT, a framework to build more Robust BERT-based
models by utilizing a more comprehensive adversarial graph to fuse Chinese
phonetic and glyph features into pre-trained representations during
fine-tuning. Inspired by curriculum learning, we further propose to augment the
training dataset with adversarial texts in combination with intermediate
samples. Extensive experiments demonstrate that RoChBERT outperforms previous
methods in significant ways: (i) robust -- RoChBERT greatly improves the model
robustness without sacrificing accuracy on benign texts. Specifically, the
defense lowers the success rates of unlimited and limited attacks by 59.43% and
39.33% respectively, while remaining accuracy of 93.30%; (ii) flexible --
RoChBERT can easily extend to various language models to solve different
downstream tasks with excellent performance; and (iii) efficient -- RoChBERT
can be directly applied to the fine-tuning stage without pre-training language
model from scratch, and the proposed data augmentation method is also low-cost.


http://arxiv.org/abs/2210.16451
Robust Boosting Forests with Richer Deep Feature Hierarchy. (56%)
Jianqiao Wangni
  We propose a robust variant of boosting forest to the various adversarial
defense methods, and apply it to enhance the robustness of the deep neural
network. We retain the deep network architecture, weights, and middle layer
features, then install gradient boosting forest to select the features from
each layer of the deep network, and predict the target. For training each
decision tree, we propose a novel conservative and greedy trade-off, with
consideration for less misprediction instead of pure gain functions, therefore
being suboptimal and conservative. We actively increase tree depth to remedy
the accuracy with splits in more features, being more greedy in growing tree
depth. We propose a new task on 3D face model, whose robustness has not been
carefully studied, despite the great security and privacy concerns related to
face analytics. We tried a simple attack method on a pure convolutional neural
network (CNN) face shape estimator, making it degenerate to only output average
face shape with invisible perturbation. Our conservative-greedy boosting forest
(CGBF) on face landmark datasets showed a great improvement over original pure
deep learning methods under the adversarial attacks.


http://arxiv.org/abs/2210.16140
Localized Randomized Smoothing for Collective Robustness Certification. (26%)
Jan Schuchardt; Tom Wollschläger; Aleksandar Bojchevski; Stephan Günnemann
  Models for image segmentation, node classification and many other tasks map a
single input to multiple labels. By perturbing this single shared input (e.g.
the image) an adversary can manipulate several predictions (e.g. misclassify
several pixels). Collective robustness certification is the task of provably
bounding the number of robust predictions under this threat model. The only
dedicated method that goes beyond certifying each output independently is
limited to strictly local models, where each prediction is associated with a
small receptive field. We propose a more general collective robustness
certificate for all types of models. We further show that this approach is
beneficial for the larger class of softly local models, where each output is
dependent on the entire input but assigns different levels of importance to
different input regions (e.g. based on their proximity in the image). The
certificate is based on our novel localized randomized smoothing approach,
where the random perturbation strength for different input regions is
proportional to their importance for the outputs. Localized smoothing
Pareto-dominates existing certificates on both image segmentation and node
classification tasks, simultaneously offering higher accuracy and stronger
certificates.


http://arxiv.org/abs/2210.16114
Towards Reliable Neural Specifications. (11%)
Chuqin Geng; Nham Le; Xiaojie Xu; Zhaoyue Wang; Arie Gurfinkel; Xujie Si
  Having reliable specifications is an unavoidable challenge in achieving
verifiable correctness, robustness, and interpretability of AI systems.
Existing specifications for neural networks are in the paradigm of data as
specification. That is, the local neighborhood centering around a reference
input is considered to be correct (or robust). While existing specifications
contribute to verifying adversarial robustness, a significant problem in many
research domains, our empirical study shows that those verified regions are
somewhat tight, and thus fail to allow verification of test set inputs, making
them impractical for some real-world applications. To this end, we propose a
new family of specifications called neural representation as specification,
which uses the intrinsic information of neural networks - neural activation
patterns (NAPs), rather than input data to specify the correctness and/or
robustness of neural network predictions. We present a simple statistical
approach to mining neural activation patterns. To show the effectiveness of
discovered NAPs, we formally verify several important properties, such as
various types of misclassifications will never happen for a given NAP, and
there is no ambiguity between different NAPs. We show that by using NAP, we can
verify a significant region of the input space, while still recalling 84% of
the data on MNIST. Moreover, we can push the verifiable bound to 10 times
larger on the CIFAR10 benchmark. Thus, we argue that NAPs can potentially be
used as a more reliable and extensible specification for neural network
verification.


http://arxiv.org/abs/2210.16258
On the Vulnerability of Data Points under Multiple Membership Inference Attacks and Target Models. (1%)
Mauro Conti; Jiaxin Li; Stjepan Picek
  Membership Inference Attacks (MIAs) infer whether a data point is in the
training data of a machine learning model. It is a threat while being in the
training data is private information of a data point. MIA correctly infers some
data points as members or non-members of the training data. Intuitively, data
points that MIA accurately detects are vulnerable. Considering those data
points may exist in different target models susceptible to multiple MIAs, the
vulnerability of data points under multiple MIAs and target models is worth
exploring.
  This paper defines new metrics that can reflect the actual situation of data
points' vulnerability and capture vulnerable data points under multiple MIAs
and target models. From the analysis, MIA has an inference tendency to some
data points despite a low overall inference performance. Additionally, we
implement 54 MIAs, whose average attack accuracy ranges from 0.5 to 0.9, to
support our analysis with our scalable and flexible platform, Membership
Inference Attacks Platform (VMIAP). Furthermore, previous methods are
unsuitable for finding vulnerable data points under multiple MIAs and different
target models. Finally, we observe that the vulnerability is not characteristic
of the data point but related to the MIA and target model.


http://arxiv.org/abs/2210.15700
TAD: Transfer Learning-based Multi-Adversarial Detection of Evasion Attacks against Network Intrusion Detection Systems. (99%)
Islam Debicha; Richard Bauwens; Thibault Debatty; Jean-Michel Dricot; Tayeb Kenaza; Wim Mees
  Nowadays, intrusion detection systems based on deep learning deliver
state-of-the-art performance. However, recent research has shown that specially
crafted perturbations, called adversarial examples, are capable of
significantly reducing the performance of these intrusion detection systems.
The objective of this paper is to design an efficient transfer learning-based
adversarial detector and then to assess the effectiveness of using multiple
strategically placed adversarial detectors compared to a single adversarial
detector for intrusion detection systems. In our experiments, we implement
existing state-of-the-art models for intrusion detection. We then attack those
models with a set of chosen evasion attacks. In an attempt to detect those
adversarial attacks, we design and implement multiple transfer learning-based
adversarial detectors, each receiving a subset of the information passed
through the IDS. By combining their respective decisions, we illustrate that
combining multiple detectors can further improve the detectability of
adversarial traffic compared to a single detector in the case of a parallel IDS
design.


http://arxiv.org/abs/2210.15291
Isometric 3D Adversarial Examples in the Physical World. (99%)
Yibo Miao; Yinpeng Dong; Jun Zhu; Xiao-Shan Gao
  3D deep learning models are shown to be as vulnerable to adversarial examples
as 2D models. However, existing attack methods are still far from stealthy and
suffer from severe performance degradation in the physical world. Although 3D
data is highly structured, it is difficult to bound the perturbations with
simple metrics in the Euclidean space. In this paper, we propose a novel
$\epsilon$-isometric ($\epsilon$-ISO) attack to generate natural and robust 3D
adversarial examples in the physical world by considering the geometric
properties of 3D objects and the invariance to physical transformations. For
naturalness, we constrain the adversarial example to be $\epsilon$-isometric to
the original one by adopting the Gaussian curvature as a surrogate metric
guaranteed by a theoretical analysis. For invariance to physical
transformations, we propose a maxima over transformation (MaxOT) method that
actively searches for the most harmful transformations rather than random ones
to make the generated adversarial example more robust in the physical world.
Experiments on typical point cloud recognition models validate that our
approach can significantly improve the attack success rate and naturalness of
the generated 3D adversarial examples than the state-of-the-art attack methods.


http://arxiv.org/abs/2210.15392
LeNo: Adversarial Robust Salient Object Detection Networks with Learnable Noise. (92%)
He Tang; He Wang
  Pixel-wise predction with deep neural network has become an effective
paradigm for salient object detection (SOD) and achieved remakable performance.
However, very few SOD models are robust against adversarial attacks which are
visually imperceptible for human visual attention. The previous work robust
salient object detection against adversarial attacks (ROSA) shuffles the
pre-segmented superpixels and then refines the coarse saliency map by the
densely connected CRF. Different from ROSA that rely on various pre- and
post-processings, this paper proposes a light-weight Learnble Noise (LeNo) to
against adversarial attacks for SOD models. LeNo preserves accuracy of SOD
models on both adversarial and clean images, as well as inference speed. In
general, LeNo consists of a simple shallow noise and noise estimation that
embedded in the encoder and decoder of arbitrary SOD networks respectively.
Inspired by the center prior of human visual attention mechanism, we initialize
the shallow noise with a cross-shaped gaussian distribution for better defense
against adversarial attacks. Instead of adding additional network components
for post-processing, the proposed noise estimation modifies only one channel of
the decoder. With the deeply-supervised noise-decoupled training on
state-of-the-art RGB and RGB-D SOD networks, LeNo outperforms previous works
not only on adversarial images but also clean images, which contributes
stronger robustness for SOD.


http://arxiv.org/abs/2210.15221
TASA: Deceiving Question Answering Models by Twin Answer Sentences Attack. (92%)
Yu Cao; Dianqi Li; Meng Fang; Tianyi Zhou; Jun Gao; Yibing Zhan; Dacheng Tao
  We present Twin Answer Sentences Attack (TASA), an adversarial attack method
for question answering (QA) models that produces fluent and grammatical
adversarial contexts while maintaining gold answers. Despite phenomenal
progress on general adversarial attacks, few works have investigated the
vulnerability and attack specifically for QA models. In this work, we first
explore the biases in the existing models and discover that they mainly rely on
keyword matching between the question and context, and ignore the relevant
contextual relations for answer prediction. Based on two biases above, TASA
attacks the target model in two folds: (1) lowering the model's confidence on
the gold answer with a perturbed answer sentence; (2) misguiding the model
towards a wrong answer with a distracting answer sentence. Equipped with
designed beam search and filtering methods, TASA can generate more effective
attacks than existing textual attack methods while sustaining the quality of
contexts, in extensive experiments on five QA datasets and human evaluations.


http://arxiv.org/abs/2210.15318
Efficient and Effective Augmentation Strategy for Adversarial Training. (56%)
Sravanti Addepalli; Samyak Jain; R. Venkatesh Babu
  Adversarial training of Deep Neural Networks is known to be significantly
more data-hungry when compared to standard training. Furthermore, complex data
augmentations such as AutoAugment, which have led to substantial gains in
standard training of image classifiers, have not been successful with
Adversarial Training. We first explain this contrasting behavior by viewing
augmentation during training as a problem of domain generalization, and further
propose Diverse Augmentation-based Joint Adversarial Training (DAJAT) to use
data augmentations effectively in adversarial training. We aim to handle the
conflicting goals of enhancing the diversity of the training dataset and
training with data that is close to the test distribution by using a
combination of simple and complex augmentations with separate batch
normalization layers during training. We further utilize the popular
Jensen-Shannon divergence loss to encourage the joint learning of the diverse
augmentations, thereby allowing simple augmentations to guide the learning of
complex ones. Lastly, to improve the computational efficiency of the proposed
method, we propose and utilize a two-step defense, Ascending Constraint
Adversarial Training (ACAT), that uses an increasing epsilon schedule and
weight-space smoothing to prevent gradient masking. The proposed method DAJAT
achieves substantially better robustness-accuracy trade-off when compared to
existing methods on the RobustBench Leaderboard on ResNet-18 and
WideResNet-34-10. The code for implementing DAJAT is available here:
https://github.com/val-iisc/DAJAT.


http://arxiv.org/abs/2210.15764
Noise Injection Node Regularization for Robust Learning. (2%)
Noam Levi; Itay M. Bloch; Marat Freytsis; Tomer Volansky
  We introduce Noise Injection Node Regularization (NINR), a method of
injecting structured noise into Deep Neural Networks (DNN) during the training
stage, resulting in an emergent regularizing effect. We present theoretical and
empirical evidence for substantial improvement in robustness against various
test data perturbations for feed-forward DNNs when trained under NINR. The
novelty in our approach comes from the interplay of adaptive noise injection
and initialization conditions such that noise is the dominant driver of
dynamics at the start of training. As it simply requires the addition of
external nodes without altering the existing network structure or optimization
algorithms, this method can be easily incorporated into many standard problem
specifications. We find improved stability against a number of data
perturbations, including domain shifts, with the most dramatic improvement
obtained for unstructured noise, where our technique outperforms other existing
methods such as Dropout or $L_2$ regularization, in some cases. We further show
that desirable generalization properties on clean data are generally
maintained.


http://arxiv.org/abs/2210.15176
Domain Adaptive Object Detection for Autonomous Driving under Foggy Weather. (1%)
Jinlong Li; Runsheng Xu; Jin Ma; Qin Zou; Jiaqi Ma; Hongkai Yu
  Most object detection methods for autonomous driving usually assume a
consistent feature distribution between training and testing data, which is not
always the case when weathers differ significantly. The object detection model
trained under clear weather might not be effective enough in foggy weather
because of the domain gap. This paper proposes a novel domain adaptive object
detection framework for autonomous driving under foggy weather. Our method
leverages both image-level and object-level adaptation to diminish the domain
discrepancy in image style and object appearance. To further enhance the
model's capabilities under challenging samples, we also come up with a new
adversarial gradient reversal layer to perform adversarial mining for the hard
examples together with domain adaptation. Moreover, we propose to generate an
auxiliary domain by data augmentation to enforce a new domain-level metric
regularization. Experimental results on public benchmarks show the
effectiveness and accuracy of the proposed method. The code is available at
https://github.com/jinlong17/DA-Detect.


http://arxiv.org/abs/2210.15068
Improving Adversarial Robustness with Self-Paced Hard-Class Pair Reweighting. (99%)
Pengyue Hou; Jie Han; Xingyu Li
  Deep Neural Networks are vulnerable to adversarial attacks. Among many
defense strategies, adversarial training with untargeted attacks is one of the
most effective methods. Theoretically, adversarial perturbation in untargeted
attacks can be added along arbitrary directions and the predicted labels of
untargeted attacks should be unpredictable. However, we find that the naturally
imbalanced inter-class semantic similarity makes those hard-class pairs become
virtual targets of each other. This study investigates the impact of such
closely-coupled classes on adversarial attacks and develops a self-paced
reweighting strategy in adversarial training accordingly. Specifically, we
propose to upweight hard-class pair losses in model optimization, which prompts
learning discriminative features from hard classes. We further incorporate a
term to quantify hard-class pair consistency in adversarial training, which
greatly boosts model robustness. Extensive experiments show that the proposed
adversarial training method achieves superior robustness performance over
state-of-the-art defenses against a wide range of adversarial attacks.


http://arxiv.org/abs/2210.17316
There is more than one kind of robustness: Fooling Whisper with adversarial examples. (98%)
Raphael Olivier; Bhiksha Raj
  Whisper is a recent Automatic Speech Recognition (ASR) model displaying
impressive robustness to both out-of-distribution inputs and random noise. In
this work, we show that this robustness does not carry over to adversarial
noise. We show that we can degrade Whisper performance dramatically, or even
transcribe a target sentence of our choice, by generating very small input
perturbations with Signal Noise Ratio of 35-45dB. We also show that by fooling
the Whisper language detector we can very easily degrade the performance of
multilingual models. These vulnerabilities of a widely popular open-source
model have practical security implications and emphasize the need for
adversarially robust ASR.


http://arxiv.org/abs/2210.14957
Disentangled Text Representation Learning with Information-Theoretic Perspective for Adversarial Robustness. (86%)
Jiahao Zhao; Wenji Mao
  Adversarial vulnerability remains a major obstacle to constructing reliable
NLP systems. When imperceptible perturbations are added to raw input text, the
performance of a deep learning model may drop dramatically under attacks.
Recent work argues the adversarial vulnerability of the model is caused by the
non-robust features in supervised training. Thus in this paper, we tackle the
adversarial robustness challenge from the view of disentangled representation
learning, which is able to explicitly disentangle robust and non-robust
features in text. Specifically, inspired by the variation of information (VI)
in information theory, we derive a disentangled learning objective composed of
mutual information to represent both the semantic representativeness of latent
embeddings and differentiation of robust and non-robust features. On the basis
of this, we design a disentangled learning network to estimate these mutual
information. Experiments on text classification and entailment tasks show that
our method significantly outperforms the representative methods under
adversarial attacks, indicating that discarding non-robust features is critical
for improving adversarial robustness.


http://arxiv.org/abs/2210.14814
BioNLI: Generating a Biomedical NLI Dataset Using Lexico-semantic Constraints for Adversarial Examples. (75%)
Mohaddeseh Bastan; Mihai Surdeanu; Niranjan Balasubramanian
  Natural language inference (NLI) is critical for complex decision-making in
biomedical domain. One key question, for example, is whether a given biomedical
mechanism is supported by experimental evidence. This can be seen as an NLI
problem but there are no directly usable datasets to address this. The main
challenge is that manually creating informative negative examples for this task
is difficult and expensive. We introduce a novel semi-supervised procedure that
bootstraps an NLI dataset from existing biomedical dataset that pairs
mechanisms with experimental evidence in abstracts. We generate a range of
negative examples using nine strategies that manipulate the structure of the
underlying mechanisms both with rules, e.g., flip the roles of the entities in
the interaction, and, more importantly, as perturbations via logical
constraints in a neuro-logical decoding system. We use this procedure to create
a novel dataset for NLI in the biomedical domain, called BioNLI and benchmark
two state-of-the-art biomedical classifiers. The best result we obtain is
around mid 70s in F1, suggesting the difficulty of the task. Critically, the
performance on the different classes of negative examples varies widely, from
97% F1 on the simple role change negative examples, to barely better than
chance on the negative examples generated using neuro-logic decoding.


http://arxiv.org/abs/2210.14999
Secure IP Address Allocation at Cloud Scale. (47%)
Eric University of Wisconsin-Madison Pauley; Kyle Pennsylvania State University Domico; Blaine University of Wisconsin-Madison Hoak; Ryan University of Wisconsin-Madison Sheatsley; Quinn University of Wisconsin-Madison Burke; Yohan University of Wisconsin-Madison Beugin; Engin Northeastern University Kirda; Patrick University of Wisconsin-Madison McDaniel
  Public clouds necessitate dynamic resource allocation and sharing. However,
the dynamic allocation of IP addresses can be abused by adversaries to source
malicious traffic, bypass rate limiting systems, and even capture traffic
intended for other cloud tenants. As a result, both the cloud provider and
their customers are put at risk, and defending against these threats requires a
rigorous analysis of tenant behavior, adversarial strategies, and cloud
provider policies. In this paper, we develop a practical defense for IP address
allocation through such an analysis. We first develop a statistical model of
cloud tenant deployment behavior based on literature and measurement of
deployed systems. Through this, we analyze IP allocation policies under
existing and novel threat models. In response to our stronger proposed threat
model, we design IP scan segmentation, an IP allocation policy that protects
the address pool against adversarial scanning even when an adversary is not
limited by number of cloud tenants. Through empirical evaluation on both
synthetic and real-world allocation traces, we show that IP scan segmentation
reduces adversaries' ability to rapidly allocate addresses, protecting both
address space reputation and cloud tenant data. In this way, we show that
principled analysis and implementation of cloud IP address allocation can lead
to substantial security gains for tenants and their users.


http://arxiv.org/abs/2210.15140
V-Cloak: Intelligibility-, Naturalness- & Timbre-Preserving Real-Time Voice Anonymization. (10%)
Jiangyi Zhejiang University Deng; Fei Zhejiang University Teng; Yanjiao Zhejiang University Chen; Xiaofu Wuhan University Chen; Zhaohui Wuhan University Wang; Wenyuan Zhejiang University Xu
  Voice data generated on instant messaging or social media applications
contains unique user voiceprints that may be abused by malicious adversaries
for identity inference or identity theft. Existing voice anonymization
techniques, e.g., signal processing and voice conversion/synthesis, suffer from
degradation of perceptual quality. In this paper, we develop a voice
anonymization system, named V-Cloak, which attains real-time voice
anonymization while preserving the intelligibility, naturalness and timbre of
the audio. Our designed anonymizer features a one-shot generative model that
modulates the features of the original audio at different frequency levels. We
train the anonymizer with a carefully-designed loss function. Apart from the
anonymity loss, we further incorporate the intelligibility loss and the
psychoacoustics-based naturalness loss. The anonymizer can realize untargeted
and targeted anonymization to achieve the anonymity goals of unidentifiability
and unlinkability.
  We have conducted extensive experiments on four datasets, i.e., LibriSpeech
(English), AISHELL (Chinese), CommonVoice (French) and CommonVoice (Italian),
five Automatic Speaker Verification (ASV) systems (including two DNN-based, two
statistical and one commercial ASV), and eleven Automatic Speech Recognition
(ASR) systems (for different languages). Experiment results confirm that
V-Cloak outperforms five baselines in terms of anonymity performance. We also
demonstrate that V-Cloak trained only on the VoxCeleb1 dataset against
ECAPA-TDNN ASV and DeepSpeech2 ASR has transferable anonymity against other
ASVs and cross-language intelligibility for other ASRs. Furthermore, we verify
the robustness of V-Cloak against various de-noising techniques and adaptive
attacks. Hopefully, V-Cloak may provide a cloak for us in a prism world.


http://arxiv.org/abs/2210.15127
Rethinking the Reverse-engineering of Trojan Triggers. (5%)
Zhenting Wang; Kai Mei; Hailun Ding; Juan Zhai; Shiqing Ma
  Deep Neural Networks are vulnerable to Trojan (or backdoor) attacks.
Reverse-engineering methods can reconstruct the trigger and thus identify
affected models. Existing reverse-engineering methods only consider input space
constraints, e.g., trigger size in the input space. Expressly, they assume the
triggers are static patterns in the input space and fail to detect models with
feature space triggers such as image style transformations. We observe that
both input-space and feature-space Trojans are associated with feature space
hyperplanes. Based on this observation, we design a novel reverse-engineering
method that exploits the feature space constraint to reverse-engineer Trojan
triggers. Results on four datasets and seven different attacks demonstrate that
our solution effectively defends both input-space and feature-space Trojans. It
outperforms state-of-the-art reverse-engineering methods and other types of
defenses in both Trojaned model detection and mitigation tasks. On average, the
detection accuracy of our method is 93\%. For Trojan mitigation, our method can
reduce the ASR (attack success rate) to only 0.26\% with the BA (benign
accuracy) remaining nearly unchanged. Our code can be found at
https://github.com/RU-System-Software-and-Security/FeatureRE.


http://arxiv.org/abs/2210.14632
Cover Reproducible Steganography via Deep Generative Models. (1%)
Kejiang Chen; Hang Zhou; Yaofei Wang; Menghan Li; Weiming Zhang; Nenghai Yu
  Whereas cryptography easily arouses attacks by means of encrypting a secret
message into a suspicious form, steganography is advantageous for its
resilience to attacks by concealing the message in an innocent-looking cover
signal. Minimal distortion steganography, one of the mainstream steganography
frameworks, embeds messages while minimizing the distortion caused by the
modification on the cover elements. Due to the unavailability of the original
cover signal for the receiver, message embedding is realized by finding the
coset leader of the syndrome function of steganographic codes migrated from
channel coding, which is complex and has limited performance. Fortunately, deep
generative models and the robust semantic of generated data make it possible
for the receiver to perfectly reproduce the cover signal from the stego signal.
With this advantage, we propose cover-reproducible steganography where the
source coding, e.g., arithmetic coding, serves as the steganographic code.
Specifically, the decoding process of arithmetic coding is used for message
embedding and its encoding process is regarded as message extraction. Taking
text-to-speech and text-to-image synthesis tasks as two examples, we illustrate
the feasibility of cover-reproducible steganography. Steganalysis experiments
and theoretical analysis are conducted to demonstrate that the proposed methods
outperform the existing methods in most cases.


http://arxiv.org/abs/2210.14622
DEMIS: A Threat Model for Selectively Encrypted Visual Surveillance Data. (1%)
Ifeoluwapo Aribilola; Mamoona Naveed Asghar; Brian Lee
  The monitoring of individuals/objects has become increasingly possible in
recent years due to the convenience of integrated cameras in many devices. Due
to the important moments or activities of people captured by these devices, it
has made it a great asset for attackers to launch attacks against by exploiting
the weaknesses in these devices. Different studies proposed na\"ive/selective
encryption of the captured visual data for safety but despite the encryption,
an attacker can still access or manipulate such data. This paper proposed a
novel threat model, DEMIS which helps analyse the threats against such
encrypted videos. The paper also examines the attack vectors that can be used
for threats and the mitigation that will reduce or prevent the attack. For
experiments, firstly the data set is generated by applying selective encryption
on the Regions-of-interests (ROI) of the tested videos using the image
segmentation technique and Chacha20 cipher. Secondly, different types of
attacks, such as inverse, lowercase, uppercase, random insertion, and
malleability attacks were simulated in experiments to show the effects of the
attacks, the risk matrix, and the severity of these attacks. Our developed data
set with the original, selective encrypted, and attacked videos are available
on git-repository(https://github.com/Ifeoluwapoo/video-datasets) for future
researchers.


http://arxiv.org/abs/2210.15042
Privately Fine-Tuning Large Language Models with Differential Privacy. (1%)
Rouzbeh Behnia; Mohamamdreza Ebrahimi; Jason Pacheco; Balaji Padmanabhan
  Pre-trained Large Language Models (LLMs) are an integral part of modern AI
that have led to breakthrough performances in complex AI tasks. Major AI
companies with expensive infrastructures are able to develop and train these
large models with billions and millions of parameters from scratch. Third
parties, researchers, and practitioners are increasingly adopting these
pre-trained models and fine-tuning them on their private data to accomplish
their downstream AI tasks. However, it has been shown that an adversary can
extract/reconstruct the exact training samples from these LLMs, which can lead
to revealing personally identifiable information. The issue has raised deep
concerns about the privacy of LLMs. Differential privacy (DP) provides a
rigorous framework that allows adding noise in the process of training or
fine-tuning LLMs such that extracting the training data becomes infeasible
(i.e., with a cryptographically small success probability). While the
theoretical privacy guarantees offered in most extant studies assume learning
models from scratch through many training iterations in an asymptotic setting,
this assumption does not hold in fine-tuning scenarios in which the number of
training iterations is significantly smaller. To address the gap, we present
\ewtune, a DP framework for fine-tuning LLMs based on Edgeworth accountant with
finite-sample privacy guarantees. Our results across four well-established
natural language understanding (NLU) tasks show that while \ewtune~adds privacy
guarantees to LLM fine-tuning process, it directly contributes to decreasing
the induced noise to up to 5.6\% and improves the state-of-the-art LLMs
performance by up to 1.1\% across all NLU tasks. We have open-sourced our
implementations for wide adoption and public testing purposes.


http://arxiv.org/abs/2210.15446
LP-BFGS attack: An adversarial attack based on the Hessian with limited pixels. (99%)
Jiebao Zhang; Wenhua Qian; Rencan Nie; Jinde Cao; Dan Xu
  Deep neural networks are vulnerable to adversarial attacks. Most white-box
attacks are based on the gradient of models to the input. Since the computation
and memory budget, adversarial attacks based on the Hessian information are not
paid enough attention. In this work, we study the attack performance and
computation cost of the attack method based on the Hessian with a limited
perturbation pixel number. Specifically, we propose the Limited Pixel BFGS
(LP-BFGS) attack method by incorporating the BFGS algorithm. Some pixels are
selected as perturbation pixels by the Integrated Gradient algorithm, which are
regarded as optimization variables of the LP-BFGS attack. Experimental results
across different networks and datasets with various perturbation pixel numbers
demonstrate our approach has a comparable attack with an acceptable computation
compared with existing solutions.


http://arxiv.org/abs/2210.14405
Adversarially Robust Medical Classification via Attentive Convolutional Neural Networks. (99%)
Isaac Wasserman
  Convolutional neural network-based medical image classifiers have been shown
to be especially susceptible to adversarial examples. Such instabilities are
likely to be unacceptable in the future of automated diagnoses. Though
statistical adversarial example detection methods have proven to be effective
defense mechanisms, additional research is necessary that investigates the
fundamental vulnerabilities of deep-learning-based systems and how best to
build models that jointly maximize traditional and robust accuracy. This paper
presents the inclusion of attention mechanisms in CNN-based medical image
classifiers as a reliable and effective strategy for increasing robust accuracy
without sacrifice. This method is able to increase robust accuracy by up to 16%
in typical adversarial scenarios and up to 2700% in extreme cases.


http://arxiv.org/abs/2210.14018
A White-Box Adversarial Attack Against a Digital Twin. (99%)
Wilson Patterson; Ivan Fernandez; Subash Neupane; Milan Parmar; Sudip Mittal; Shahram Rahimi
  Recent research has shown that Machine Learning/Deep Learning (ML/DL) models
are particularly vulnerable to adversarial perturbations, which are small
changes made to the input data in order to fool a machine learning classifier.
The Digital Twin, which is typically described as consisting of a physical
entity, a virtual counterpart, and the data connections in between, is
increasingly being investigated as a means of improving the performance of
physical entities by leveraging computational techniques, which are enabled by
the virtual counterpart. This paper explores the susceptibility of Digital Twin
(DT), a virtual model designed to accurately reflect a physical object using
ML/DL classifiers that operate as Cyber Physical Systems (CPS), to adversarial
attacks. As a proof of concept, we first formulate a DT of a vehicular system
using a deep neural network architecture and then utilize it to launch an
adversarial attack. We attack the DT model by perturbing the input to the
trained model and show how easily the model can be broken with white-box
attacks.


http://arxiv.org/abs/2210.15429
Multi-view Representation Learning from Malware to Defend Against Adversarial Variants. (98%)
James Lee Hu; Mohammadreza Ebrahimi; Weifeng Li; Xin Li; Hsinchun Chen
  Deep learning-based adversarial malware detectors have yielded promising
results in detecting never-before-seen malware executables without relying on
expensive dynamic behavior analysis and sandbox. Despite their abilities, these
detectors have been shown to be vulnerable to adversarial malware variants -
meticulously modified, functionality-preserving versions of original malware
executables generated by machine learning. Due to the nature of these
adversarial modifications, these adversarial methods often use a \textit{single
view} of malware executables (i.e., the binary/hexadecimal view) to generate
adversarial malware variants. This provides an opportunity for the defenders
(i.e., malware detectors) to detect the adversarial variants by utilizing more
than one view of a malware file (e.g., source code view in addition to the
binary view). The rationale behind this idea is that while the adversary
focuses on the binary view, certain characteristics of the malware file in the
source code view remain untouched which leads to the detection of the
adversarial malware variants. To capitalize on this opportunity, we propose
Adversarially Robust Multiview Malware Defense (ARMD), a novel multi-view
learning framework to improve the robustness of DL-based malware detectors
against adversarial variants. Our experiments on three renowned open-source
deep learning-based malware detectors across six common malware categories show
that ARMD is able to improve the adversarial robustness by up to seven times on
these malware detectors.


http://arxiv.org/abs/2210.14404
Adversarial Purification with the Manifold Hypothesis. (98%)
Zhaoyuan Yang; Zhiwei Xu; Jing Zhang; Richard Hartley; Peter Tu
  In this work, we formulate a novel framework for adversarial robustness using
the manifold hypothesis. This framework provides sufficient conditions for
defending against adversarial examples. We develop an adversarial purification
method with this framework. Our method combines manifold learning with
variational inference to provide adversarial robustness without the need for
expensive adversarial training. Experimentally, our approach can provide
adversarial robustness even if attackers are aware of the existence of the
defense. In addition, our method can also serve as a test-time defense
mechanism for variational autoencoders.


http://arxiv.org/abs/2210.14410
Improving Adversarial Robustness via Joint Classification and Multiple Explicit Detection Classes. (98%)
Sina Baharlouei; Fatemeh Sheikholeslami; Meisam Razaviyayn; Zico Kolter
  This work concerns the development of deep networks that are certifiably
robust to adversarial attacks. Joint robust classification-detection was
recently introduced as a certified defense mechanism, where adversarial
examples are either correctly classified or assigned to the "abstain" class. In
this work, we show that such a provable framework can benefit by extension to
networks with multiple explicit abstain classes, where the adversarial examples
are adaptively assigned to those. We show that naively adding multiple abstain
classes can lead to "model degeneracy", then we propose a regularization
approach and a training method to counter this degeneracy by promoting full use
of the multiple abstain classes. Our experiments demonstrate that the proposed
approach consistently achieves favorable standard vs. robust verified accuracy
tradeoffs, outperforming state-of-the-art algorithms for various choices of
number of abstain classes.


http://arxiv.org/abs/2210.14283
Accelerating Certified Robustness Training via Knowledge Transfer. (73%)
Pratik Vaishnavi; Kevin Eykholt; Amir Rahmati
  Training deep neural network classifiers that are certifiably robust against
adversarial attacks is critical to ensuring the security and reliability of
AI-controlled systems. Although numerous state-of-the-art certified training
methods have been developed, they are computationally expensive and scale
poorly with respect to both dataset and network complexity. Widespread usage of
certified training is further hindered by the fact that periodic retraining is
necessary to incorporate new data and network improvements. In this paper, we
propose Certified Robustness Transfer (CRT), a general-purpose framework for
reducing the computational overhead of any certifiably robust training method
through knowledge transfer. Given a robust teacher, our framework uses a novel
training loss to transfer the teacher's robustness to the student. We provide
theoretical and empirical validation of CRT. Our experiments on CIFAR-10 show
that CRT speeds up certified robustness training by $8 \times$ on average
across three different architecture generations while achieving comparable
robustness to state-of-the-art methods. We also show that CRT can scale to
large-scale datasets like ImageNet.


http://arxiv.org/abs/2210.14229
Causal Information Bottleneck Boosts Adversarial Robustness of Deep Neural Network. (64%)
Huan Hua; Jun Yan; Xi Fang; Weiquan Huang; Huilin Yin; Wancheng Ge
  The information bottleneck (IB) method is a feasible defense solution against
adversarial attacks in deep learning. However, this method suffers from the
spurious correlation, which leads to the limitation of its further improvement
of adversarial robustness. In this paper, we incorporate the causal inference
into the IB framework to alleviate such a problem. Specifically, we divide the
features obtained by the IB method into robust features (content information)
and non-robust features (style information) via the instrumental variables to
estimate the causal effects. With the utilization of such a framework, the
influence of non-robust features could be mitigated to strengthen the
adversarial robustness. We make an analysis of the effectiveness of our
proposed method. The extensive experiments in MNIST, FashionMNIST, and CIFAR-10
show that our method exhibits the considerable robustness against multiple
adversarial attacks. Our code would be released.


http://arxiv.org/abs/2210.13762
Towards Robust Recommender Systems via Triple Cooperative Defense. (61%)
Qingyang Wang; Defu Lian; Chenwang Wu; Enhong Chen
  Recommender systems are often susceptible to well-crafted fake profiles,
leading to biased recommendations. The wide application of recommender systems
makes studying the defense against attack necessary. Among existing defense
methods, data-processing-based methods inevitably exclude normal samples, while
model-based methods struggle to enjoy both generalization and robustness.
Considering the above limitations, we suggest integrating data processing and
robust model and propose a general framework, Triple Cooperative Defense (TCD),
which cooperates to improve model robustness through the co-training of three
models. Specifically, in each round of training, we sequentially use the
high-confidence prediction ratings (consistent ratings) of any two models as
auxiliary training data for the remaining model, and the three models
cooperatively improve recommendation robustness. Notably, TCD adds pseudo label
data instead of deleting abnormal data, which avoids the cleaning of normal
data, and the cooperative training of the three models is also beneficial to
model generalization. Through extensive experiments with five poisoning attacks
on three real-world datasets, the results show that the robustness improvement
of TCD significantly outperforms baselines. It is worth mentioning that TCD is
also beneficial for model generalizations.


http://arxiv.org/abs/2210.13915
Towards Formal Approximated Minimal Explanations of Neural Networks. (13%)
Shahaf Bassan; Guy Katz
  With the rapid growth of machine learning, deep neural networks (DNNs) are
now being used in numerous domains. Unfortunately, DNNs are "black-boxes", and
cannot be interpreted by humans, which is a substantial concern in
safety-critical systems. To mitigate this issue, researchers have begun working
on explainable AI (XAI) methods, which can identify a subset of input features
that are the cause of a DNN's decision for a given input. Most existing
techniques are heuristic, and cannot guarantee the correctness of the
explanation provided. In contrast, recent and exciting attempts have shown that
formal methods can be used to generate provably correct explanations. Although
these methods are sound, the computational complexity of the underlying
verification problem limits their scalability; and the explanations they
produce might sometimes be overly complex. Here, we propose a novel approach to
tackle these limitations. We (1) suggest an efficient, verification-based
method for finding minimal explanations, which constitute a provable
approximation of the global, minimum explanation; (2) show how DNN verification
can assist in calculating lower and upper bounds on the optimal explanation;
(3) propose heuristics that significantly improve the scalability of the
verification process; and (4) suggest the use of bundles, which allows us to
arrive at more succinct and interpretable explanations. Our evaluation shows
that our approach significantly outperforms state-of-the-art techniques, and
produces explanations that are more useful to humans. We thus regard this work
as a step toward leveraging verification technology in producing DNNs that are
more reliable and comprehensible.


http://arxiv.org/abs/2210.13815
FocusedCleaner: Sanitizing Poisoned Graphs for Robust GNN-based Node Classification. (13%)
Yulin Zhu; Liang Tong; Kai Zhou
  Recently, a lot of research attention has been devoted to exploring Web
security, a most representative topic is the adversarial robustness of graph
mining algorithms. Especially, a widely deployed adversarial attacks
formulation is the graph manipulation attacks by modifying the relational data
to mislead the Graph Neural Networks' (GNNs) predictions. Naturally, an
intrinsic question one would ask is whether we can accurately identify the
manipulations over graphs - we term this problem as poisoned graph sanitation.
In this paper, we present FocusedCleaner, a poisoned graph sanitation framework
consisting of two modules: bi-level structural learning and victim node
detection. In particular, the structural learning module will reserve the
attack process to steadily sanitize the graph while the detection module
provides the "focus" - a narrowed and more accurate search region - to
structural learning. These two modules will operate in iterations and reinforce
each other to sanitize a poisoned graph step by step. Extensive experiments
demonstrate that FocusedCleaner outperforms the state-of-the-art baselines both
on poisoned graph sanitation and improving robustness.


http://arxiv.org/abs/2211.12851
A Streamlit-based Artificial Intelligence Trust Platform for Next-Generation Wireless Networks. (3%)
M. Kuzlu; F. O. Catak; S. Sarp; U. Cali; O Gueler
  With the rapid development and integration of artificial intelligence (AI)
methods in next-generation networks (NextG), AI algorithms have provided
significant advantages for NextG in terms of frequency spectrum usage,
bandwidth, latency, and security. A key feature of NextG is the integration of
AI, i.e., self-learning architecture based on self-supervised algorithms, to
improve the performance of the network. A secure AI-powered structure is also
expected to protect NextG networks against cyber-attacks. However, AI itself
may be attacked, i.e., model poisoning targeted by attackers, and it results in
cybersecurity violations. This paper proposes an AI trust platform using
Streamlit for NextG networks that allows researchers to evaluate, defend,
certify, and verify their AI models and applications against adversarial
threats of evasion, poisoning, extraction, and interference.


http://arxiv.org/abs/2210.14376
Robustness of Locally Differentially Private Graph Analysis Against Poisoning. (1%)
Jacob Imola; Amrita Roy Chowdhury; Kamalika Chaudhuri
  Locally differentially private (LDP) graph analysis allows private analysis
on a graph that is distributed across multiple users. However, such
computations are vulnerable to data poisoning attacks where an adversary can
skew the results by submitting malformed data. In this paper, we formally study
the impact of poisoning attacks for graph degree estimation protocols under
LDP. We make two key technical contributions. First, we observe LDP makes a
protocol more vulnerable to poisoning -- the impact of poisoning is worse when
the adversary can directly poison their (noisy) responses, rather than their
input data. Second, we observe that graph data is naturally redundant -- every
edge is shared between two users. Leveraging this data redundancy, we design
robust degree estimation protocols under LDP that can significantly reduce the
impact of data poisoning and compute degree estimates with high accuracy. We
evaluate our proposed robust degree estimation protocols under poisoning
attacks on real-world datasets to demonstrate their efficacy in practice.


http://arxiv.org/abs/2210.12952
Ares: A System-Oriented Wargame Framework for Adversarial ML. (99%)
Farhan Ahmed; Pratik Vaishnavi; Kevin Eykholt; Amir Rahmati
  Since the discovery of adversarial attacks against machine learning models
nearly a decade ago, research on adversarial machine learning has rapidly
evolved into an eternal war between defenders, who seek to increase the
robustness of ML models against adversarial attacks, and adversaries, who seek
to develop better attacks capable of weakening or defeating these defenses.
This domain, however, has found little buy-in from ML practitioners, who are
neither overtly concerned about these attacks affecting their systems in the
real world nor are willing to trade off the accuracy of their models in pursuit
of robustness against these attacks.
  In this paper, we motivate the design and implementation of Ares, an
evaluation framework for adversarial ML that allows researchers to explore
attacks and defenses in a realistic wargame-like environment. Ares frames the
conflict between the attacker and defender as two agents in a reinforcement
learning environment with opposing objectives. This allows the introduction of
system-level evaluation metrics such as time to failure and evaluation of
complex strategies such as moving target defenses. We provide the results of
our initial exploration involving a white-box attacker against an adversarially
trained defender.


http://arxiv.org/abs/2210.13660
SpacePhish: The Evasion-space of Adversarial Attacks against Phishing Website Detectors using Machine Learning. (99%)
Giovanni Apruzzese; Mauro Conti; Ying Yuan
  Existing literature on adversarial Machine Learning (ML) focuses either on
showing attacks that break every ML model, or defenses that withstand most
attacks. Unfortunately, little consideration is given to the actual
\textit{cost} of the attack or the defense. Moreover, adversarial samples are
often crafted in the "feature-space", making the corresponding evaluations of
questionable value. Simply put, the current situation does not allow to
estimate the actual threat posed by adversarial attacks, leading to a lack of
secure ML systems.
  We aim to clarify such confusion in this paper. By considering the
application of ML for Phishing Website Detection (PWD), we formalize the
"evasion-space" in which an adversarial perturbation can be introduced to fool
a ML-PWD -- demonstrating that even perturbations in the "feature-space" are
useful. Then, we propose a realistic threat model describing evasion attacks
against ML-PWD that are cheap to stage, and hence intrinsically more attractive
for real phishers. Finally, we perform the first statistically validated
assessment of state-of-the-art ML-PWD against 12 evasion attacks. Our
evaluation shows (i) the true efficacy of evasion attempts that are more likely
to occur; and (ii) the impact of perturbations crafted in different
evasion-spaces. Our realistic evasion attempts induce a statistically
significant degradation (3-10% at $p\!<$0.05), and their cheap cost makes them
a subtle threat. Notably, however, some ML-PWD are immune to our most realistic
attacks ($p$=0.22). Our contribution paves the way for a much needed
re-assessment of adversarial attacks against ML systems for cybersecurity.


http://arxiv.org/abs/2210.13710
Motif-Backdoor: Rethinking the Backdoor Attack on Graph Neural Networks via Motifs. (96%)
Haibin Zheng; Haiyang Xiong; Jinyin Chen; Haonan Ma; Guohan Huang
  Graph neural network (GNN) with a powerful representation capability has been
widely applied to various areas, such as biological gene prediction, social
recommendation, etc. Recent works have exposed that GNN is vulnerable to the
backdoor attack, i.e., models trained with maliciously crafted training samples
are easily fooled by patched samples. Most of the proposed studies launch the
backdoor attack using a trigger that either is the randomly generated subgraph
(e.g., erd\H{o}s-r\'enyi backdoor) for less computational burden, or the
gradient-based generative subgraph (e.g., graph trojaning attack) to enable a
more effective attack. However, the interpretation of how is the trigger
structure and the effect of the backdoor attack related has been overlooked in
the current literature. Motifs, recurrent and statistically significant
sub-graphs in graphs, contain rich structure information. In this paper, we are
rethinking the trigger from the perspective of motifs, and propose a
motif-based backdoor attack, denoted as Motif-Backdoor. It contributes from
three aspects. (i) Interpretation: it provides an in-depth explanation for
backdoor effectiveness by the validity of the trigger structure from motifs,
leading to some novel insights, e.g., using subgraphs that appear less
frequently in the graph as the trigger can achieve better attack performance.
(ii) Effectiveness: Motif-Backdoor reaches the state-of-the-art (SOTA) attack
performance in both black-box and defensive scenarios. (iii) Efficiency: based
on the graph motif distribution, Motif-Backdoor can quickly obtain an effective
trigger structure without target model feedback or subgraph model generation.
Extensive experimental results show that Motif-Backdoor realizes the SOTA
performance on three popular models and four public datasets compared with five
baselines.


http://arxiv.org/abs/2210.13631
On the Robustness of Dataset Inference. (88%)
Sebastian Szyller; Rui Zhang; Jian Liu; N. Asokan
  Machine learning (ML) models are costly to train as they can require a
significant amount of data, computational resources and technical expertise.
Thus, they constitute valuable intellectual property that needs protection from
adversaries wanting to steal them. Ownership verification techniques allow the
victims of model stealing attacks to demonstrate that a suspect model was in
fact stolen from theirs. Although a number of ownership verification techniques
based on watermarking or fingerprinting have been proposed, most of them fall
short either in terms of security guarantees (well-equipped adversaries can
evade verification) or computational cost. A fingerprinting technique
introduced at ICLR '21, Dataset Inference (DI), has been shown to offer better
robustness and efficiency than prior methods. The authors of DI provided a
correctness proof for linear (suspect) models. However, in the same setting, we
prove that DI suffers from high false positives (FPs) -- it can incorrectly
identify an independent model trained with non-overlapping data from the same
distribution as stolen. We further prove that DI also triggers FPs in
realistic, non-linear suspect models. We then confirm empirically that DI leads
to FPs, with high confidence. Second, we show that DI also suffers from false
negatives (FNs) -- an adversary can fool DI by regularising a stolen model's
decision boundaries using adversarial training, thereby leading to an FN. To
this end, we demonstrate that DI fails to identify a model adversarially
trained from a stolen dataset -- the setting where DI is the hardest to evade.
Finally, we discuss the implications of our findings, the viability of
fingerprinting-based ownership verification in general, and suggest directions
for future work.


http://arxiv.org/abs/2210.14225
Flexible Android Malware Detection Model based on Generative Adversarial Networks with Code Tensor. (16%)
Zhao Yang; Fengyang Deng; Linxi Han
  The behavior of malware threats is gradually increasing, heightened the need
for malware detection. However, existing malware detection methods only target
at the existing malicious samples, the detection of fresh malicious code and
variants of malicious code is limited. In this paper, we propose a novel scheme
that detects malware and its variants efficiently. Based on the idea of the
generative adversarial networks (GANs), we obtain the `true' sample
distribution that satisfies the characteristics of the real malware, use them
to deceive the discriminator, thus achieve the defense against malicious code
attacks and improve malware detection. Firstly, a new Android malware APK to
image texture feature extraction segmentation method is proposed, which is
called segment self-growing texture segmentation algorithm. Secondly, tensor
singular value decomposition (tSVD) based on the low-tubal rank transforms
malicious features with different sizes into a fixed third-order tensor
uniformly, which is entered into the neural network for training and learning.
Finally, a flexible Android malware detection model based on GANs with code
tensor (MTFD-GANs) is proposed. Experiments show that the proposed model can
generally surpass the traditional malware detection model, with a maximum
improvement efficiency of 41.6\%. At the same time, the newly generated samples
of the GANs generator greatly enrich the sample diversity. And retraining
malware detector can effectively improve the detection efficiency and
robustness of traditional models.


http://arxiv.org/abs/2210.12945
Revisiting Sparse Convolutional Model for Visual Recognition. (11%)
Xili Dai; Mingyang Li; Pengyuan Zhai; Shengbang Tong; Xingjian Gao; Shao-Lun Huang; Zhihui Zhu; Chong You; Yi Ma
  Despite strong empirical performance for image classification, deep neural
networks are often regarded as ``black boxes'' and they are difficult to
interpret. On the other hand, sparse convolutional models, which assume that a
signal can be expressed by a linear combination of a few elements from a
convolutional dictionary, are powerful tools for analyzing natural images with
good theoretical interpretability and biological plausibility. However, such
principled models have not demonstrated competitive performance when compared
with empirically designed deep networks. This paper revisits the sparse
convolutional modeling for image classification and bridges the gap between
good empirical performance (of deep learning) and good interpretability (of
sparse convolutional models). Our method uses differentiable optimization
layers that are defined from convolutional sparse coding as drop-in
replacements of standard convolutional layers in conventional deep neural
networks. We show that such models have equally strong empirical performance on
CIFAR-10, CIFAR-100, and ImageNet datasets when compared to conventional neural
networks. By leveraging stable recovery property of sparse modeling, we further
show that such models can be much more robust to input corruptions as well as
adversarial perturbations in testing through a simple proper trade-off between
sparse regularization and data reconstruction terms. Source code can be found
at https://github.com/Delay-Xili/SDNet.


http://arxiv.org/abs/2210.12873
FLIP: A Provable Defense Framework for Backdoor Mitigation in Federated Learning. (68%)
Kaiyuan Zhang; Guanhong Tao; Qiuling Xu; Siyuan Cheng; Shengwei An; Yingqi Liu; Shiwei Feng; Guangyu Shen; Pin-Yu Chen; Shiqing Ma; Xiangyu Zhang
  Federated Learning (FL) is a distributed learning paradigm that enables
different parties to train a model together for high quality and strong privacy
protection. In this scenario, individual participants may get compromised and
perform backdoor attacks by poisoning the data (or gradients). Existing work on
robust aggregation and certified FL robustness does not study how hardening
benign clients can affect the global model (and the malicious clients). In this
work, we theoretically analyze the connection among cross-entropy loss, attack
success rate, and clean accuracy in this setting. Moreover, we propose a
trigger reverse engineering based defense and show that our method can achieve
robustness improvement with guarantee (i.e., reducing the attack success rate)
without affecting benign accuracy. We conduct comprehensive experiments across
different datasets and attack settings. Our results on eight competing SOTA
defense methods show the empirical superiority of our method on both
single-shot and continuous FL backdoor attacks.


http://arxiv.org/abs/2210.13463
Adversarial Pretraining of Self-Supervised Deep Networks: Past, Present and Future. (45%)
Guo-Jun Qi; Mubarak Shah
  In this paper, we review adversarial pretraining of self-supervised deep
networks including both convolutional neural networks and vision transformers.
Unlike the adversarial training with access to labeled examples, adversarial
pretraining is complicated as it only has access to unlabeled examples. To
incorporate adversaries into pretraining models on either input or feature
level, we find that existing approaches are largely categorized into two
groups: memory-free instance-wise attacks imposing worst-case perturbations on
individual examples, and memory-based adversaries shared across examples over
iterations. In particular, we review several representative adversarial
pretraining models based on Contrastive Learning (CL) and Masked Image Modeling
(MIM), respectively, two popular self-supervised pretraining methods in
literature. We also review miscellaneous issues about computing overheads,
input-/feature-level adversaries, as well as other adversarial pretraining
approaches beyond the above two groups. Finally, we discuss emerging trends and
future directions about the relations between adversarial and cooperative
pretraining, unifying adversarial CL and MIM pretraining, and the trade-off
between accuracy and robustness in adversarial pretraining.


http://arxiv.org/abs/2210.12396
ADDMU: Detection of Far-Boundary Adversarial Examples with Data and Model Uncertainty Estimation. (99%)
Fan Yin; Yao Li; Cho-Jui Hsieh; Kai-Wei Chang
  Adversarial Examples Detection (AED) is a crucial defense technique against
adversarial attacks and has drawn increasing attention from the Natural
Language Processing (NLP) community. Despite the surge of new AED methods, our
studies show that existing methods heavily rely on a shortcut to achieve good
performance. In other words, current search-based adversarial attacks in NLP
stop once model predictions change, and thus most adversarial examples
generated by those attacks are located near model decision boundaries. To
surpass this shortcut and fairly evaluate AED methods, we propose to test AED
methods with \textbf{F}ar \textbf{B}oundary (\textbf{FB}) adversarial examples.
Existing methods show worse than random guess performance under this scenario.
To overcome this limitation, we propose a new technique, \textbf{ADDMU},
\textbf{a}dversary \textbf{d}etection with \textbf{d}ata and \textbf{m}odel
\textbf{u}ncertainty, which combines two types of uncertainty estimation for
both regular and FB adversarial example detection. Our new method outperforms
previous methods by 3.6 and 6.0 \emph{AUC} points under each scenario. Finally,
our analysis shows that the two types of uncertainty provided by \textbf{ADDMU}
can be leveraged to characterize adversarial examples and identify the ones
that contribute most to model's robustness in adversarial training.


http://arxiv.org/abs/2210.13982
Hindering Adversarial Attacks with Implicit Neural Representations. (92%)
Andrei A. Rusu; Dan A. Calian; Sven Gowal; Raia Hadsell
  We introduce the Lossy Implicit Network Activation Coding (LINAC) defence, an
input transformation which successfully hinders several common adversarial
attacks on CIFAR-$10$ classifiers for perturbations up to $\epsilon = 8/255$ in
$L_\infty$ norm and $\epsilon = 0.5$ in $L_2$ norm. Implicit neural
representations are used to approximately encode pixel colour intensities in
$2\text{D}$ images such that classifiers trained on transformed data appear to
have robustness to small perturbations without adversarial training or large
drops in performance. The seed of the random number generator used to
initialise and train the implicit neural representation turns out to be
necessary information for stronger generic attacks, suggesting its role as a
private key. We devise a Parametric Bypass Approximation (PBA) attack strategy
for key-based defences, which successfully invalidates an existing method in
this category. Interestingly, our LINAC defence also hinders some transfer and
adaptive attacks, including our novel PBA strategy. Our results emphasise the
importance of a broad range of customised attacks despite apparent robustness
according to standard evaluations. LINAC source code and parameters of defended
classifier evaluated throughout this submission are available:
https://github.com/deepmind/linac


http://arxiv.org/abs/2210.12598
GANI: Global Attacks on Graph Neural Networks via Imperceptible Node Injections. (81%)
Junyuan Fang; Haixian Wen; Jiajing Wu; Qi Xuan; Zibin Zheng; Chi K. Tse
  Graph neural networks (GNNs) have found successful applications in various
graph-related tasks. However, recent studies have shown that many GNNs are
vulnerable to adversarial attacks. In a vast majority of existing studies,
adversarial attacks on GNNs are launched via direct modification of the
original graph such as adding/removing links, which may not be applicable in
practice. In this paper, we focus on a realistic attack operation via injecting
fake nodes. The proposed Global Attack strategy via Node Injection (GANI) is
designed under the comprehensive consideration of an unnoticeable perturbation
setting from both structure and feature domains. Specifically, to make the node
injections as imperceptible and effective as possible, we propose a sampling
operation to determine the degree of the newly injected nodes, and then
generate features and select neighbors for these injected nodes based on the
statistical information of features and evolutionary perturbations obtained
from a genetic algorithm, respectively. In particular, the proposed feature
generation mechanism is suitable for both binary and continuous node features.
Extensive experimental results on benchmark datasets against both general and
defended GNNs show strong attack performance of GANI. Moreover, the
imperceptibility analyses also demonstrate that GANI achieves a relatively
unnoticeable injection on benchmark datasets.


http://arxiv.org/abs/2210.12606
Nash Equilibria and Pitfalls of Adversarial Training in Adversarial Robustness Games. (26%)
Maria-Florina Balcan; Rattana Pukdee; Pradeep Ravikumar; Hongyang Zhang
  Adversarial training is a standard technique for training adversarially
robust models. In this paper, we study adversarial training as an alternating
best-response strategy in a 2-player zero-sum game. We prove that even in a
simple scenario of a linear classifier and a statistical model that abstracts
robust vs. non-robust features, the alternating best response strategy of such
game may not converge. On the other hand, a unique pure Nash equilibrium of the
game exists and is provably robust. We support our theoretical results with
experiments, showing the non-convergence of adversarial training and the
robustness of Nash equilibrium.


http://arxiv.org/abs/2210.12367
Precisely the Point: Adversarial Augmentations for Faithful and Informative Text Generation. (4%)
Wenhao Wu; Wei Li; Jiachen Liu; Xinyan Xiao; Sujian Li; Yajuan Lyu
  Though model robustness has been extensively studied in language
understanding, the robustness of Seq2Seq generation remains understudied. In
this paper, we conduct the first quantitative analysis on the robustness of
pre-trained Seq2Seq models. We find that even current SOTA pre-trained Seq2Seq
model (BART) is still vulnerable, which leads to significant degeneration in
faithfulness and informativeness for text generation tasks. This motivated us
to further propose a novel adversarial augmentation framework, namely AdvSeq,
for generally improving faithfulness and informativeness of Seq2Seq models via
enhancing their robustness. AdvSeq automatically constructs two types of
adversarial augmentations during training, including implicit adversarial
samples by perturbing word representations and explicit adversarial samples by
word swapping, both of which effectively improve Seq2Seq robustness. Extensive
experiments on three popular text generation tasks demonstrate that AdvSeq
significantly improves both the faithfulness and informativeness of Seq2Seq
generation under both automatic and human evaluation settings.


http://arxiv.org/abs/2210.12030
Evolution of Neural Tangent Kernels under Benign and Adversarial Training. (99%)
Noel Loo; Ramin Hasani; Alexander Amini; Daniela Rus
  Two key challenges facing modern deep learning are mitigating deep networks'
vulnerability to adversarial attacks and understanding deep learning's
generalization capabilities. Towards the first issue, many defense strategies
have been developed, with the most common being Adversarial Training (AT).
Towards the second challenge, one of the dominant theories that has emerged is
the Neural Tangent Kernel (NTK) -- a characterization of neural network
behavior in the infinite-width limit. In this limit, the kernel is frozen, and
the underlying feature map is fixed. In finite widths, however, there is
evidence that feature learning happens at the earlier stages of the training
(kernel learning) before a second phase where the kernel remains fixed (lazy
training). While prior work has aimed at studying adversarial vulnerability
through the lens of the frozen infinite-width NTK, there is no work that
studies the adversarial robustness of the empirical/finite NTK during training.
In this work, we perform an empirical study of the evolution of the empirical
NTK under standard and adversarial training, aiming to disambiguate the effect
of adversarial training on kernel learning and lazy training. We find under
adversarial training, the empirical NTK rapidly converges to a different kernel
(and feature map) than standard training. This new kernel provides adversarial
robustness, even when non-robust training is performed on top of it.
Furthermore, we find that adversarial training on top of a fixed kernel can
yield a classifier with $76.1\%$ robust accuracy under PGD attacks with
$\varepsilon = 4/255$ on CIFAR-10.


http://arxiv.org/abs/2210.12179
The Dark Side of AutoML: Towards Architectural Backdoor Search. (68%)
Ren Pang; Changjiang Li; Zhaohan Xi; Shouling Ji; Ting Wang
  This paper asks the intriguing question: is it possible to exploit neural
architecture search (NAS) as a new attack vector to launch previously
improbable attacks? Specifically, we present EVAS, a new attack that leverages
NAS to find neural architectures with inherent backdoors and exploits such
vulnerability using input-aware triggers. Compared with existing attacks, EVAS
demonstrates many interesting properties: (i) it does not require polluting
training data or perturbing model parameters; (ii) it is agnostic to downstream
fine-tuning or even re-training from scratch; (iii) it naturally evades
defenses that rely on inspecting model parameters or training data. With
extensive evaluation on benchmark datasets, we show that EVAS features high
evasiveness, transferability, and robustness, thereby expanding the adversary's
design spectrum. We further characterize the mechanisms underlying EVAS, which
are possibly explainable by architecture-level ``shortcuts'' that recognize
trigger patterns. This work raises concerns about the current practice of NAS
and points to potential directions to develop effective countermeasures.


http://arxiv.org/abs/2210.11841
Diffusion Visual Counterfactual Explanations. (10%)
Maximilian Augustin; Valentyn Boreiko; Francesco Croce; Matthias Hein
  Visual Counterfactual Explanations (VCEs) are an important tool to understand
the decisions of an image classifier. They are 'small' but 'realistic' semantic
changes of the image changing the classifier decision. Current approaches for
the generation of VCEs are restricted to adversarially robust models and often
contain non-realistic artefacts, or are limited to image classification
problems with few classes. In this paper, we overcome this by generating
Diffusion Visual Counterfactual Explanations (DVCEs) for arbitrary ImageNet
classifiers via a diffusion process. Two modifications to the diffusion process
are key for our DVCEs: first, an adaptive parameterization, whose
hyperparameters generalize across images and models, together with distance
regularization and late start of the diffusion process, allow us to generate
images with minimal semantic changes to the original ones but different
classification. Second, our cone regularization via an adversarially robust
model ensures that the diffusion process does not converge to trivial
non-semantic changes, but instead produces realistic images of the target class
which achieve high confidence by the classifier.


http://arxiv.org/abs/2210.12233
TCAB: A Large-Scale Text Classification Attack Benchmark. (10%)
Kalyani Asthana; Zhouhang Xie; Wencong You; Adam Noack; Jonathan Brophy; Sameer Singh; Daniel Lowd
  We introduce the Text Classification Attack Benchmark (TCAB), a dataset for
analyzing, understanding, detecting, and labeling adversarial attacks against
text classifiers. TCAB includes 1.5 million attack instances, generated by
twelve adversarial attacks targeting three classifiers trained on six source
datasets for sentiment analysis and abuse detection in English. Unlike standard
text classification, text attacks must be understood in the context of the
target classifier that is being attacked, and thus features of the target
classifier are important as well. TCAB includes all attack instances that are
successful in flipping the predicted label; a subset of the attacks are also
labeled by human annotators to determine how frequently the primary semantics
are preserved. The process of generating attacks is automated, so that TCAB can
easily be extended to incorporate new text attacks and better classifiers as
they are developed. In addition to the primary tasks of detecting and labeling
attacks, TCAB can also be used for attack localization, attack target labeling,
and attack characterization. TCAB code and dataset are available at
https://react-nlp.github.io/tcab/.


http://arxiv.org/abs/2210.11726
A critical review of cyber-physical security for building automation systems. (2%)
Guowen Li; Lingyu Ren; Yangyang Fu; Zhiyao Yang; Veronica Adetola; Jin Wen; Qi Zhu; Teresa Wu; K. Selcuk Candanf; Zheng O'Neill
  Modern Building Automation Systems (BASs), as the brain that enables the
smartness of a smart building, often require increased connectivity both among
system components as well as with outside entities, such as optimized
automation via outsourced cloud analytics and increased building-grid
integrations. However, increased connectivity and accessibility come with
increased cyber security threats. BASs were historically developed as closed
environments with limited cyber-security considerations. As a result, BASs in
many buildings are vulnerable to cyber-attacks that may cause adverse
consequences, such as occupant discomfort, excessive energy usage, and
unexpected equipment downtime. Therefore, there is a strong need to advance the
state-of-the-art in cyber-physical security for BASs and provide practical
solutions for attack mitigation in buildings. However, an inclusive and
systematic review of BAS vulnerabilities, potential cyber-attacks with impact
assessment, detection & defense approaches, and cyber-secure resilient control
strategies is currently lacking in the literature. This review paper fills the
gap by providing a comprehensive up-to-date review of cyber-physical security
for BASs at three levels in commercial buildings: management level, automation
level, and field level. The general BASs vulnerabilities and protocol-specific
vulnerabilities for the four dominant BAS protocols are reviewed, followed by a
discussion on four attack targets and seven potential attack scenarios. The
impact of cyber-attacks on BASs is summarized as signal corruption, signal
delaying, and signal blocking. The typical cyber-attack detection and defense
approaches are identified at the three levels. Cyber-secure resilient control
strategies for BASs under attack are categorized into passive and active
resilient control schemes. Open challenges and future opportunities are finally
discussed.


http://arxiv.org/abs/2210.11735
Extracted BERT Model Leaks More Information than You Think! (1%)
Xuanli He; Chen Chen; Lingjuan Lyu; Qiongkai Xu
  The collection and availability of big data, combined with advances in
pre-trained models (e.g. BERT), have revolutionized the predictive performance
of natural language processing tasks. This allows corporations to provide
machine learning as a service (MLaaS) by encapsulating fine-tuned BERT-based
models as APIs. Due to significant commercial interest, there has been a surge
of attempts to steal re mote services via model extraction. Although previous
works have made progress in defending against model extraction attacks, there
has been little discussion on their performance in preventing privacy leakage.
This work bridges this gap by launching an attribute inference attack against
the extracted BERT model. Our extensive experiments reveal that model
extraction can cause severe privacy leakage even when victim models are
facilitated with advanced defensive strategies.


http://arxiv.org/abs/2210.11598
Identifying Human Strategies for Generating Word-Level Adversarial Examples. (98%)
Maximilian Mozes; Bennett Kleinberg; Lewis D. Griffin
  Adversarial examples in NLP are receiving increasing research attention. One
line of investigation is the generation of word-level adversarial examples
against fine-tuned Transformer models that preserve naturalness and
grammaticality. Previous work found that human- and machine-generated
adversarial examples are comparable in their naturalness and grammatical
correctness. Most notably, humans were able to generate adversarial examples
much more effortlessly than automated attacks. In this paper, we provide a
detailed analysis of exactly how humans create these adversarial examples. By
exploring the behavioural patterns of human workers during the generation
process, we identify statistically significant tendencies based on which words
humans prefer to select for adversarial replacement (e.g., word frequencies,
word saliencies, sentiment) as well as where and when words are replaced in an
input sequence. With our findings, we seek to inspire efforts that harness
human strategies for more robust NLP models.


http://arxiv.org/abs/2210.15427
Are You Stealing My Model? Sample Correlation for Fingerprinting Deep Neural Networks. (98%)
Jiyang Guan; Jian Liang; Ran He
  An off-the-shelf model as a commercial service could be stolen by model
stealing attacks, posing great threats to the rights of the model owner. Model
fingerprinting aims to verify whether a suspect model is stolen from the victim
model, which gains more and more attention nowadays. Previous methods always
leverage the transferable adversarial examples as the model fingerprint, which
is sensitive to adversarial defense or transfer learning scenarios. To address
this issue, we consider the pairwise relationship between samples instead and
propose a novel yet simple model stealing detection method based on SAmple
Correlation (SAC). Specifically, we present SAC-w that selects wrongly
classified normal samples as model inputs and calculates the mean correlation
among their model outputs. To reduce the training time, we further develop
SAC-m that selects CutMix Augmented samples as model inputs, without the need
for training the surrogate models or generating adversarial examples. Extensive
results validate that SAC successfully defends against various model stealing
attacks, even including adversarial training or transfer learning, and detects
the stolen models with the best performance in terms of AUC across different
datasets and model architectures. The codes are available at
https://github.com/guanjiyang/SAC.


http://arxiv.org/abs/2210.11498
Balanced Adversarial Training: Balancing Tradeoffs between Fickleness and Obstinacy in NLP Models. (98%)
Hannah Chen; Yangfeng Ji; David Evans
  Traditional (fickle) adversarial examples involve finding a small
perturbation that does not change an input's true label but confuses the
classifier into outputting a different prediction. Conversely, obstinate
adversarial examples occur when an adversary finds a small perturbation that
preserves the classifier's prediction but changes the true label of an input.
Adversarial training and certified robust training have shown some
effectiveness in improving the robustness of machine learnt models to fickle
adversarial examples. We show that standard adversarial training methods
focused on reducing vulnerability to fickle adversarial examples may make a
model more vulnerable to obstinate adversarial examples, with experiments for
both natural language inference and paraphrase identification tasks. To counter
this phenomenon, we introduce Balanced Adversarial Training, which incorporates
contrastive learning to increase robustness against both fickle and obstinate
adversarial examples.


http://arxiv.org/abs/2210.11513
Learning Sample Reweighting for Accuracy and Adversarial Robustness. (93%)
Chester Holtz; Tsui-Wei Weng; Gal Mishne
  There has been great interest in enhancing the robustness of neural network
classifiers to defend against adversarial perturbations through adversarial
training, while balancing the trade-off between robust accuracy and standard
accuracy. We propose a novel adversarial training framework that learns to
reweight the loss associated with individual training samples based on a notion
of class-conditioned margin, with the goal of improving robust generalization.
We formulate weighted adversarial training as a bilevel optimization problem
with the upper-level problem corresponding to learning a robust classifier, and
the lower-level problem corresponding to learning a parametric function that
maps from a sample's \textit{multi-class margin} to an importance weight.
Extensive experiments demonstrate that our approach consistently improves both
clean and robust accuracy compared to related methods and state-of-the-art
baselines.


http://arxiv.org/abs/2210.11407
Similarity of Neural Architectures using Adversarial Attack Transferability. (86%)
Jaehui Hwang; Dongyoon Han; Byeongho Heo; Song Park; Sanghyuk Chun; Jong-Seok Lee
  In recent years, many deep neural architectures have been developed for image
classification. Whether they are similar or dissimilar and what factors
contribute to their (dis)similarities remains curious. To address this
question, we aim to design a quantitative and scalable similarity measure
between neural architectures. We propose Similarity by Attack Transferability
(SAT) from the observation that adversarial attack transferability contains
information related to input gradients and decision boundaries widely used to
understand model behaviors. We conduct a large-scale analysis on 69
state-of-the-art ImageNet classifiers using our proposed similarity function to
answer the question. Moreover, we observe neural architecture-related phenomena
using model similarity that model diversity can lead to better performance on
model ensembles and knowledge distillation under specific conditions. Our
results provide insights into why developing diverse neural architectures with
distinct components is necessary.


http://arxiv.org/abs/2210.11592
New data poison attacks on machine learning classifiers for mobile exfiltration. (80%)
Miguel A. Ramirez; Sangyoung Yoon; Ernesto Damiani; Hussam Al Hamadi; Claudio Agostino Ardagna; Nicola Bena; Young-Ji Byon; Tae-Yeon Kim; Chung-Suk Cho; Chan Yeob Yeun
  Most recent studies have shown several vulnerabilities to attacks with the
potential to jeopardize the integrity of the model, opening in a few recent
years a new window of opportunity in terms of cyber-security. The main interest
of this paper is directed towards data poisoning attacks involving
label-flipping, this kind of attacks occur during the training phase, being the
aim of the attacker to compromise the integrity of the targeted machine
learning model by drastically reducing the overall accuracy of the model and/or
achieving the missclassification of determined samples. This paper is conducted
with intention of proposing two new kinds of data poisoning attacks based on
label-flipping, the targeted of the attack is represented by a variety of
machine learning classifiers dedicated for malware detection using mobile
exfiltration data. With that, the proposed attacks are proven to be
model-agnostic, having successfully corrupted a wide variety of machine
learning models; Logistic Regression, Decision Tree, Random Forest and KNN are
some examples. The first attack is performs label-flipping actions randomly
while the second attacks performs label flipping only one of the 2 classes in
particular. The effects of each attack are analyzed in further detail with
special emphasis on the accuracy drop and the misclassification rate. Finally,
this paper pursuits further research direction by suggesting the development of
a defense technique that could promise a feasible detection and/or mitigation
mechanisms; such technique should be capable of conferring a certain level of
robustness to a target model against potential attackers.


http://arxiv.org/abs/2210.11242
Attacking Motion Estimation with Adversarial Snow. (16%)
Jenny Schmalfuss; Lukas Mehl; Andrés Bruhn
  Current adversarial attacks for motion estimation (optical flow) optimize
small per-pixel perturbations, which are unlikely to appear in the real world.
In contrast, we exploit a real-world weather phenomenon for a novel attack with
adversarially optimized snow. At the core of our attack is a differentiable
renderer that consistently integrates photorealistic snowflakes with realistic
motion into the 3D scene. Through optimization we obtain adversarial snow that
significantly impacts the optical flow while being indistinguishable from
ordinary snow. Surprisingly, the impact of our novel attack is largest on
methods that previously showed a high robustness to small L_p perturbations.


http://arxiv.org/abs/2210.11049
How Does a Deep Learning Model Architecture Impact Its Privacy? A Comprehensive Study of Privacy Attacks on CNNs and Transformers. (13%)
Guangsheng Zhang; Bo Liu; Huan Tian; Tianqing Zhu; Ming Ding; Wanlei Zhou
  As a booming research area in the past decade, deep learning technologies
have been driven by big data collected and processed on an unprecedented scale.
However, privacy concerns arise due to the potential leakage of sensitive
information from the training data. Recent research has revealed that deep
learning models are vulnerable to various privacy attacks, including membership
inference attacks, attribute inference attacks, and gradient inversion attacks.
Notably, the efficacy of these attacks varies from model to model. In this
paper, we answer a fundamental question: Does model architecture affect model
privacy? By investigating representative model architectures from CNNs to
Transformers, we demonstrate that Transformers generally exhibit higher
vulnerability to privacy attacks compared to CNNs. Additionally, We identify
the micro design of activation layers, stem layers, and LN layers, as major
factors contributing to the resilience of CNNs against privacy attacks, while
the presence of attention modules is another main factor that exacerbates the
privacy vulnerability of Transformers. Our discovery reveals valuable insights
for deep learning models to defend against privacy attacks and inspires the
research community to develop privacy-friendly model architectures.


http://arxiv.org/abs/2210.11061
Analyzing the Robustness of Decentralized Horizontal and Vertical Federated Learning Architectures in a Non-IID Scenario. (4%)
Pedro Miguel Sánchez Sánchez; Alberto Huertas Celdrán; Enrique Tomás Martínez Beltrán; Daniel Demeter; Gérôme Bovet; Gregorio Martínez Pérez; Burkhard Stiller
  Federated learning (FL) allows participants to collaboratively train machine
and deep learning models while protecting data privacy. However, the FL
paradigm still presents drawbacks affecting its trustworthiness since malicious
participants could launch adversarial attacks against the training process.
Related work has studied the robustness of horizontal FL scenarios under
different attacks. However, there is a lack of work evaluating the robustness
of decentralized vertical FL and comparing it with horizontal FL architectures
affected by adversarial attacks. Thus, this work proposes three decentralized
FL architectures, one for horizontal and two for vertical scenarios, namely
HoriChain, VertiChain, and VertiComb. These architectures present different
neural networks and training protocols suitable for horizontal and vertical
scenarios. Then, a decentralized, privacy-preserving, and federated use case
with non-IID data to classify handwritten digits is deployed to evaluate the
performance of the three architectures. Finally, a set of experiments computes
and compares the robustness of the proposed architectures when they are
affected by different data poisoning based on image watermarks and gradient
poisoning adversarial attacks. The experiments show that even though particular
configurations of both attacks can destroy the classification performance of
the architectures, HoriChain is the most robust one.


http://arxiv.org/abs/2210.11082
Apple of Sodom: Hidden Backdoors in Superior Sentence Embeddings via Contrastive Learning. (3%)
Xiaoyi Chen; Baisong Xin; Shengfang Zhai; Shiqing Ma; Qingni Shen; Zhonghai Wu
  This paper finds that contrastive learning can produce superior sentence
embeddings for pre-trained models but is also vulnerable to backdoor attacks.
We present the first backdoor attack framework, BadCSE, for state-of-the-art
sentence embeddings under supervised and unsupervised learning settings. The
attack manipulates the construction of positive and negative pairs so that the
backdoored samples have a similar embedding with the target sample (targeted
attack) or the negative embedding of its clean version (non-targeted attack).
By injecting the backdoor in sentence embeddings, BadCSE is resistant against
downstream fine-tuning. We evaluate BadCSE on both STS tasks and other
downstream tasks. The supervised non-targeted attack obtains a performance
degradation of 194.86%, and the targeted attack maps the backdoored samples to
the target embedding with a 97.70% success rate while maintaining the model
utility.


http://arxiv.org/abs/2210.11620
LOT: Layer-wise Orthogonal Training on Improving $\ell_2$ Certified Robustness. (3%)
Xiaojun Xu; Linyi Li; Bo Li
  Recent studies show that training deep neural networks (DNNs) with Lipschitz
constraints are able to enhance adversarial robustness and other model
properties such as stability. In this paper, we propose a layer-wise orthogonal
training method (LOT) to effectively train 1-Lipschitz convolution layers via
parametrizing an orthogonal matrix with an unconstrained matrix. We then
efficiently compute the inverse square root of a convolution kernel by
transforming the input domain to the Fourier frequency domain. On the other
hand, as existing works show that semi-supervised training helps improve
empirical robustness, we aim to bridge the gap and prove that semi-supervised
learning also improves the certified robustness of Lipschitz-bounded models. We
conduct comprehensive evaluations for LOT under different settings. We show
that LOT significantly outperforms baselines regarding deterministic l2
certified robustness, and scales to deeper neural networks. Under the
supervised scenario, we improve the state-of-the-art certified robustness for
all architectures (e.g. from 59.04% to 63.50% on CIFAR-10 and from 32.57% to
34.59% on CIFAR-100 at radius rho = 36/255 for 40-layer networks). With
semi-supervised learning over unlabelled data, we are able to improve
state-of-the-art certified robustness on CIFAR-10 at rho = 108/255 from 36.04%
to 42.39%. In addition, LOT consistently outperforms baselines on different
model architectures with only 1/3 evaluation time.


http://arxiv.org/abs/2210.10485
Learning Transferable Adversarial Robust Representations via Multi-view Consistency. (99%)
Minseon Kim; Hyeonjeong Ha; Dong Bok Lee; Sung Ju Hwang
  Despite the success on few-shot learning problems, most meta-learned models
only focus on achieving good performance on clean examples and thus easily
break down when given adversarially perturbed samples. While some recent works
have shown that a combination of adversarial learning and meta-learning could
enhance the robustness of a meta-learner against adversarial attacks, they fail
to achieve generalizable adversarial robustness to unseen domains and tasks,
which is the ultimate goal of meta-learning. To address this challenge, we
propose a novel meta-adversarial multi-view representation learning framework
with dual encoders. Specifically, we introduce the discrepancy across the two
differently augmented samples of the same data instance by first updating the
encoder parameters with them and further imposing a novel label-free
adversarial attack to maximize their discrepancy. Then, we maximize the
consistency across the views to learn transferable robust representations
across domains and tasks. Through experimental validation on multiple
benchmarks, we demonstrate the effectiveness of our framework on few-shot
learning tasks from unseen domains, achieving over 10\% robust accuracy
improvements against previous adversarial meta-learning baselines.


http://arxiv.org/abs/2210.10482
Effective Targeted Attacks for Adversarial Self-Supervised Learning. (99%)
Minseon Kim; Hyeonjeong Ha; Sooel Son; Sung Ju Hwang
  Recently, unsupervised adversarial training (AT) has been highlighted as a
means of achieving robustness in models without any label information. Previous
studies in unsupervised AT have mostly focused on implementing self-supervised
learning (SSL) frameworks, which maximize the instance-wise classification loss
to generate adversarial examples. However, we observe that simply maximizing
the self-supervised training loss with an untargeted adversarial attack often
results in generating ineffective adversaries that may not help improve the
robustness of the trained model, especially for non-contrastive SSL frameworks
without negative examples. To tackle this problem, we propose a novel positive
mining for targeted adversarial attack to generate effective adversaries for
adversarial SSL frameworks. Specifically, we introduce an algorithm that
selects the most confusing yet similar target example for a given instance
based on entropy and similarity, and subsequently perturbs the given instance
towards the selected target. Our method demonstrates significant enhancements
in robustness when applied to non-contrastive SSL frameworks, and less but
consistent robustness improvements with contrastive SSL frameworks, on the
benchmark datasets.


http://arxiv.org/abs/2210.14164
No-Box Attacks on 3D Point Cloud Classification. (93%)
Hanieh Naderi; Chinthaka Dinesh; Ivan V. Bajic; Shohreh Kasaei
  Adversarial attacks pose serious challenges for deep neural network
(DNN)-based analysis of various input signals. In the case of 3D point clouds,
methods have been developed to identify points that play a key role in network
decision, and these become crucial in generating existing adversarial attacks.
For example, a saliency map approach is a popular method for identifying
adversarial drop points, whose removal would significantly impact the network
decision. Generally, methods for identifying adversarial points rely on the
access to the DNN model itself to determine which points are critically
important for the model's decision. This paper aims to provide a novel
viewpoint on this problem, where adversarial points can be predicted without
access to the target DNN model, which is referred to as a ``no-box'' attack. To
this end, we define 14 point cloud features and use multiple linear regression
to examine whether these features can be used for adversarial point prediction,
and which combination of features is best suited for this purpose. Experiments
show that a suitable combination of features is able to predict adversarial
points of four different networks -- PointNet, PointNet++, DGCNN, and PointConv
-- significantly better than a random guess and comparable to white-box
attacks. Additionally, we show that no-box attack is transferable to unseen
models. The results also provide further insight into DNNs for point cloud
classification, by showing which features play key roles in their
decision-making process.


http://arxiv.org/abs/2210.10886
Backdoor Attack and Defense in Federated Generative Adversarial Network-based Medical Image Synthesis. (83%)
Ruinan Jin; Xiaoxiao Li
  Deep Learning-based image synthesis techniques have been applied in
healthcare research for generating medical images to support open research and
augment medical datasets. Training generative adversarial neural networks
(GANs) usually require large amounts of training data. Federated learning (FL)
provides a way of training a central model using distributed data while keeping
raw data locally. However, given that the FL server cannot access the raw data,
it is vulnerable to backdoor attacks, an adversarial by poisoning training
data. Most backdoor attack strategies focus on classification models and
centralized domains. It is still an open question if the existing backdoor
attacks can affect GAN training and, if so, how to defend against the attack in
the FL setting. In this work, we investigate the overlooked issue of backdoor
attacks in federated GANs (FedGANs). The success of this attack is subsequently
determined to be the result of some local discriminators overfitting the
poisoned data and corrupting the local GAN equilibrium, which then further
contaminates other clients when averaging the generator's parameters and yields
high generator loss. Therefore, we proposed FedDetect, an efficient and
effective way of defending against the backdoor attack in the FL setting, which
allows the server to detect the client's adversarial behavior based on their
losses and block the malicious clients. Our extensive experiments on two
medical datasets with different modalities demonstrate the backdoor attack on
FedGANs can result in synthetic images with low fidelity. After detecting and
suppressing the detected malicious clients using the proposed defense strategy,
we show that FedGANs can synthesize high-quality medical datasets (with labels)
for data augmentation to improve classification models' performance.


http://arxiv.org/abs/2210.13235
Chaos Theory and Adversarial Robustness. (73%)
Jonathan S. Kent
  Neural Networks, being susceptible to adversarial attacks, should face a
strict level of scrutiny before being deployed in critical or adversarial
applications. This paper uses ideas from Chaos Theory to explain, analyze, and
quantify the degree to which Neural Networks are susceptible to or robust
against adversarial attacks. Our results show that susceptibility to attack
grows significantly with the depth of the model, which has significant safety
implications for the design of Neural Networks for production environments. We
also demonstrate how to quickly and easily approximate the certified robustness
radii for extremely large models, which until now has been computationally
infeasible to calculate directly, as well as show a clear relationship between
our new susceptibility metric and post-attack accuracy.


http://arxiv.org/abs/2210.11237
Emerging Threats in Deep Learning-Based Autonomous Driving: A Comprehensive Survey. (69%)
Hui Cao; Wenlong Zou; Yinkun Wang; Ting Song; Mengjun Liu
  Since the 2004 DARPA Grand Challenge, the autonomous driving technology has
witnessed nearly two decades of rapid development. Particularly, in recent
years, with the application of new sensors and deep learning technologies
extending to the autonomous field, the development of autonomous driving
technology has continued to make breakthroughs. Thus, many carmakers and
high-tech giants dedicated to research and system development of autonomous
driving. However, as the foundation of autonomous driving, the deep learning
technology faces many new security risks. The academic community has proposed
deep learning countermeasures against the adversarial examples and AI backdoor,
and has introduced them into the autonomous driving field for verification.
Deep learning security matters to autonomous driving system security, and then
matters to personal safety, which is an issue that deserves attention and
research.This paper provides an summary of the concepts, developments and
recent research in deep learning security technologies in autonomous driving.
Firstly, we briefly introduce the deep learning framework and pipeline in the
autonomous driving system, which mainly include the deep learning technologies
and algorithms commonly used in this field. Moreover, we focus on the potential
security threats of the deep learning based autonomous driving system in each
functional layer in turn. We reviews the development of deep learning attack
technologies to autonomous driving, investigates the State-of-the-Art
algorithms, and reveals the potential risks. At last, we provides an outlook on
deep learning security in the autonomous driving field and proposes
recommendations for building a safe and trustworthy autonomous driving system.


http://arxiv.org/abs/2210.10683
Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial NLP. (64%)
Yangyi Chen; Hongcheng Gao; Ganqu Cui; Fanchao Qi; Longtao Huang; Zhiyuan Liu; Maosong Sun
  Textual adversarial samples play important roles in multiple subfields of NLP
research, including security, evaluation, explainability, and data
augmentation. However, most work mixes all these roles, obscuring the problem
definitions and research goals of the security role that aims to reveal the
practical concerns of NLP models. In this paper, we rethink the research
paradigm of textual adversarial samples in security scenarios. We discuss the
deficiencies in previous work and propose our suggestions that the research on
the Security-oriented adversarial NLP (SoadNLP) should: (1) evaluate their
methods on security tasks to demonstrate the real-world concerns; (2) consider
real-world attackers' goals, instead of developing impractical methods. To this
end, we first collect, process, and release a security datasets collection
Advbench. Then, we reformalize the task and adjust the emphasis on different
goals in SoadNLP. Next, we propose a simple method based on heuristic rules
that can easily fulfill the actual adversarial goals to simulate real-world
attack methods. We conduct experiments on both the attack and the defense sides
on Advbench. Experimental results show that our method has higher practical
value, indicating that the research paradigm in SoadNLP may start from our new
benchmark. All the code and data of Advbench can be obtained at
\url{https://github.com/thunlp/Advbench}.


http://arxiv.org/abs/2210.10936
FedRecover: Recovering from Poisoning Attacks in Federated Learning using Historical Information. (41%)
Xiaoyu Cao; Jinyuan Jia; Zaixi Zhang; Neil Zhenqiang Gong
  Federated learning is vulnerable to poisoning attacks in which malicious
clients poison the global model via sending malicious model updates to the
server. Existing defenses focus on preventing a small number of malicious
clients from poisoning the global model via robust federated learning methods
and detecting malicious clients when there are a large number of them. However,
it is still an open challenge how to recover the global model from poisoning
attacks after the malicious clients are detected. A naive solution is to remove
the detected malicious clients and train a new global model from scratch, which
incurs large cost that may be intolerable for resource-constrained clients such
as smartphones and IoT devices.
  In this work, we propose FedRecover, which can recover an accurate global
model from poisoning attacks with small cost for the clients. Our key idea is
that the server estimates the clients' model updates instead of asking the
clients to compute and communicate them during the recovery process. In
particular, the server stores the global models and clients' model updates in
each round, when training the poisoned global model. During the recovery
process, the server estimates a client's model update in each round using its
stored historical information. Moreover, we further optimize FedRecover to
recover a more accurate global model using warm-up, periodic correction,
abnormality fixing, and final tuning strategies, in which the server asks the
clients to compute and communicate their exact model updates. Theoretically, we
show that the global model recovered by FedRecover is close to or the same as
that recovered by train-from-scratch under some assumptions. Empirically, our
evaluation on four datasets, three federated learning methods, as well as
untargeted and targeted poisoning attacks (e.g., backdoor attacks) shows that
FedRecover is both accurate and efficient.


http://arxiv.org/abs/2210.10880
Learning to Invert: Simple Adaptive Attacks for Gradient Inversion in Federated Learning. (16%)
Ruihan Wu; Xiangyu Chen; Chuan Guo; Kilian Q. Weinberger
  Gradient inversion attack enables recovery of training samples from model
gradients in federated learning (FL), and constitutes a serious threat to data
privacy. To mitigate this vulnerability, prior work proposed both principled
defenses based on differential privacy, as well as heuristic defenses based on
gradient compression as countermeasures. These defenses have so far been very
effective, in particular those based on gradient compression that allow the
model to maintain high accuracy while greatly reducing the effectiveness of
attacks. In this work, we argue that such findings underestimate the privacy
risk in FL. As a counterexample, we show that existing defenses can be broken
by a simple adaptive attack, where a model trained on auxiliary data is able to
invert gradients on both vision and language tasks.


http://arxiv.org/abs/2210.10378
Variational Model Perturbation for Source-Free Domain Adaptation. (1%)
Mengmeng Jing; Xiantong Zhen; Jingjing Li; Cees G. M. Snoek
  We aim for source-free domain adaptation, where the task is to deploy a model
pre-trained on source domains to target domains. The challenges stem from the
distribution shift from the source to the target domain, coupled with the
unavailability of any source data and labeled target data for optimization.
Rather than fine-tuning the model by updating the parameters, we propose to
perturb the source model to achieve adaptation to target domains. We introduce
perturbations into the model parameters by variational Bayesian inference in a
probabilistic framework. By doing so, we can effectively adapt the model to the
target domain while largely preserving the discriminative ability. Importantly,
we demonstrate the theoretical connection to learning Bayesian neural networks,
which proves the generalizability of the perturbed model to target domains. To
enable more efficient optimization, we further employ a parameter sharing
strategy, which substantially reduces the learnable parameters compared to a
fully Bayesian neural network. Our model perturbation provides a new
probabilistic way for domain adaptation which enables efficient adaptation to
target domains while maximally preserving knowledge in source models.
Experiments on several source-free benchmarks under three different evaluation
settings verify the effectiveness of the proposed variational model
perturbation for source-free domain adaptation.


http://arxiv.org/abs/2210.09852
Scaling Adversarial Training to Large Perturbation Bounds. (98%)
Sravanti Addepalli; Samyak Jain; Gaurang Sriramanan; R. Venkatesh Babu
  The vulnerability of Deep Neural Networks to Adversarial Attacks has fuelled
research towards building robust models. While most Adversarial Training
algorithms aim at defending attacks constrained within low magnitude Lp norm
bounds, real-world adversaries are not limited by such constraints. In this
work, we aim to achieve adversarial robustness within larger bounds, against
perturbations that may be perceptible, but do not change human (or Oracle)
prediction. The presence of images that flip Oracle predictions and those that
do not makes this a challenging setting for adversarial robustness. We discuss
the ideal goals of an adversarial defense algorithm beyond perceptual limits,
and further highlight the shortcomings of naively extending existing training
algorithms to higher perturbation bounds. In order to overcome these
shortcomings, we propose a novel defense, Oracle-Aligned Adversarial Training
(OA-AT), to align the predictions of the network with that of an Oracle during
adversarial training. The proposed approach achieves state-of-the-art
performance at large epsilon bounds (such as an L-inf bound of 16/255 on
CIFAR-10) while outperforming existing defenses (AWP, TRADES, PGD-AT) at
standard bounds (8/255) as well.


http://arxiv.org/abs/2210.09671
Not All Poisons are Created Equal: Robust Training against Data Poisoning. (97%)
Yu Yang; Tian Yu Liu; Baharan Mirzasoleiman
  Data poisoning causes misclassification of test time target examples by
injecting maliciously crafted samples in the training data. Existing defenses
are often effective only against a specific type of targeted attack,
significantly degrade the generalization performance, or are prohibitive for
standard deep learning pipelines.
  In this work, we propose an efficient defense mechanism that significantly
reduces the success rate of various data poisoning attacks, and provides
theoretical guarantees for the performance of the model. Targeted attacks work
by adding bounded perturbations to a randomly selected subset of training data
to match the targets' gradient or representation. We show that: (i) under
bounded perturbations, only a number of poisons can be optimized to have a
gradient that is close enough to that of the target and make the attack
successful; (ii) such effective poisons move away from their original class and
get isolated in the gradient space; (iii) dropping examples in low-density
gradient regions during training can successfully eliminate the effective
poisons, and guarantees similar training dynamics to that of training on full
data. Our extensive experiments show that our method significantly decreases
the success rate of state-of-the-art targeted attacks, including Gradient
Matching and Bullseye Polytope, and easily scales to large datasets.


http://arxiv.org/abs/2210.09658
ROSE: Robust Selective Fine-tuning for Pre-trained Language Models. (73%)
Lan Jiang; Hao Zhou; Yankai Lin; Peng Li; Jie Zhou; Rui Jiang
  Even though the large-scale language models have achieved excellent
performances, they suffer from various adversarial attacks. A large body of
defense methods has been proposed. However, they are still limited due to
redundant attack search spaces and the inability to defend against various
types of attacks. In this work, we present a novel fine-tuning approach called
\textbf{RO}bust \textbf{SE}letive fine-tuning (\textbf{ROSE}) to address this
issue. ROSE conducts selective updates when adapting pre-trained models to
downstream tasks, filtering out invaluable and unrobust updates of parameters.
Specifically, we propose two strategies: the first-order and second-order ROSE
for selecting target robust parameters. The experimental results show that ROSE
achieves significant improvements in adversarial robustness on various
downstream NLP tasks, and the ensemble method even surpasses both variants
above. Furthermore, ROSE can be easily incorporated into existing fine-tuning
methods to improve their adversarial robustness further. The empirical analysis
confirms that ROSE eliminates unrobust spurious updates during fine-tuning,
leading to solutions corresponding to flatter and wider optima than the
conventional method. Code is available at
\url{https://github.com/jiangllan/ROSE}.


http://arxiv.org/abs/2210.10667
Analysis of Master Vein Attacks on Finger Vein Recognition Systems. (56%)
Huy H. Nguyen; Trung-Nghia Le; Junichi Yamagishi; Isao Echizen
  Finger vein recognition (FVR) systems have been commercially used, especially
in ATMs, for customer verification. Thus, it is essential to measure their
robustness against various attack methods, especially when a hand-crafted FVR
system is used without any countermeasure methods. In this paper, we are the
first in the literature to introduce master vein attacks in which we craft a
vein-looking image so that it can falsely match with as many identities as
possible by the FVR systems. We present two methods for generating master veins
for use in attacking these systems. The first uses an adaptation of the latent
variable evolution algorithm with a proposed generative model (a multi-stage
combination of beta-VAE and WGAN-GP models). The second uses an adversarial
machine learning attack method to attack a strong surrogate CNN-based
recognition system. The two methods can be easily combined to boost their
attack ability. Experimental results demonstrated that the proposed methods
alone and together achieved false acceptance rates up to 73.29% and 88.79%,
respectively, against Miura's hand-crafted FVR system. We also point out that
Miura's system is easily compromised by non-vein-looking samples generated by a
WGAN-GP model with false acceptance rates up to 94.21%. The results raise the
alarm about the robustness of such systems and suggest that master vein attacks
should be considered an important security measure.


http://arxiv.org/abs/2210.10272
Training set cleansing of backdoor poisoning by self-supervised representation learning. (56%)
H. Wang; S. Karami; O. Dia; H. Ritter; E. Emamjomeh-Zadeh; J. Chen; Z. Xiang; D. J. Miller; G. Kesidis
  A backdoor or Trojan attack is an important type of data poisoning attack
against deep neural network (DNN) classifiers, wherein the training dataset is
poisoned with a small number of samples that each possess the backdoor pattern
(usually a pattern that is either imperceptible or innocuous) and which are
mislabeled to the attacker's target class. When trained on a backdoor-poisoned
dataset, a DNN behaves normally on most benign test samples but makes incorrect
predictions to the target class when the test sample has the backdoor pattern
incorporated (i.e., contains a backdoor trigger). Here we focus on image
classification tasks and show that supervised training may build stronger
association between the backdoor pattern and the associated target class than
that between normal features and the true class of origin. By contrast,
self-supervised representation learning ignores the labels of samples and
learns a feature embedding based on images' semantic content. %We thus propose
to use unsupervised representation learning to avoid emphasising
backdoor-poisoned training samples and learn a similar feature embedding for
samples of the same class. Using a feature embedding found by self-supervised
representation learning, a data cleansing method, which combines sample
filtering and re-labeling, is developed. Experiments on CIFAR-10 benchmark
datasets show that our method achieves state-of-the-art performance in
mitigating backdoor attacks.


http://arxiv.org/abs/2210.10253
On the Adversarial Robustness of Mixture of Experts. (13%)
Joan Puigcerver; Rodolphe Jenatton; Carlos Riquelme; Pranjal Awasthi; Srinadh Bhojanapalli
  Adversarial robustness is a key desirable property of neural networks. It has
been empirically shown to be affected by their sizes, with larger networks
being typically more robust. Recently, Bubeck and Sellke proved a lower bound
on the Lipschitz constant of functions that fit the training data in terms of
their number of parameters. This raises an interesting open question, do -- and
can -- functions with more parameters, but not necessarily more computational
cost, have better robustness? We study this question for sparse Mixture of
Expert models (MoEs), that make it possible to scale up the model size for a
roughly constant computational cost. We theoretically show that under certain
conditions on the routing and the structure of the data, MoEs can have
significantly smaller Lipschitz constants than their dense counterparts. The
robustness of MoEs can suffer when the highest weighted experts for an input
implement sufficiently different functions. We next empirically evaluate the
robustness of MoEs on ImageNet using adversarial attacks and show they are
indeed more robust than dense models with the same computational cost. We make
key observations showing the robustness of MoEs to the choice of experts,
highlighting the redundancy of experts in models trained in practice.


http://arxiv.org/abs/2210.10114
Transferable Unlearnable Examples. (8%)
Jie Ren; Han Xu; Yuxuan Wan; Xingjun Ma; Lichao Sun; Jiliang Tang
  With more people publishing their personal data online, unauthorized data
usage has become a serious concern. The unlearnable strategies have been
introduced to prevent third parties from training on the data without
permission. They add perturbations to the users' data before publishing, which
aims to make the models trained on the perturbed published dataset invalidated.
These perturbations have been generated for a specific training setting and a
target dataset. However, their unlearnable effects significantly decrease when
used in other training settings and datasets. To tackle this issue, we propose
a novel unlearnable strategy based on Classwise Separability Discriminant
(CSD), which aims to better transfer the unlearnable effects to other training
settings and datasets by enhancing the linear separability. Extensive
experiments demonstrate the transferability of the proposed unlearnable
examples across training settings and datasets.


http://arxiv.org/abs/2210.09940
Automatic Detection of Fake Key Attacks in Secure Messaging. (8%)
Tarun Kumar Yadav; Devashish Gosain; Amir Herzberg; Daniel Zappala; Kent Seamons
  Popular instant messaging applications such as WhatsApp and Signal provide
end-to-end encryption for billions of users. They rely on a centralized,
application-specific server to distribute public keys and relay encrypted
messages between the users. Therefore, they prevent passive attacks but are
vulnerable to some active attacks. A malicious or hacked server can distribute
fake keys to users to perform man-in-the-middle or impersonation attacks. While
typical secure messaging applications provide a manual method for users to
detect these attacks, this burdens users, and studies show it is ineffective in
practice. This paper presents KTACA, a completely automated approach for key
verification that is oblivious to users and easy to deploy. We motivate KTACA
by designing two approaches to automatic key verification. One approach uses
client auditing (KTCA) and the second uses anonymous key monitoring (AKM). Both
have relatively inferior security properties, leading to KTACA, which combines
these approaches to provide the best of both worlds. We provide a security
analysis of each defense, identifying which attacks they can automatically
detect. We implement the active attacks to demonstrate they are possible, and
we also create a prototype implementation of all the defenses to measure their
performance and confirm their feasibility. Finally, we discuss the strengths
and weaknesses of each defense, the overhead on clients and service providers,
and deployment considerations.


http://arxiv.org/abs/2210.09643
Improving Adversarial Robustness by Contrastive Guided Diffusion Process. (2%)
Yidong Ouyang; Liyan Xie; Guang Cheng
  Synthetic data generation has become an emerging tool to help improve the
adversarial robustness in classification tasks since robust learning requires a
significantly larger amount of training samples compared with standard
classification tasks. Among various deep generative models, the diffusion model
has been shown to produce high-quality synthetic images and has achieved good
performance in improving the adversarial robustness. However, diffusion-type
methods are typically slow in data generation as compared with other generative
models. Although different acceleration techniques have been proposed recently,
it is also of great importance to study how to improve the sample efficiency of
generated data for the downstream task. In this paper, we first analyze the
optimality condition of synthetic distribution for achieving non-trivial robust
accuracy. We show that enhancing the distinguishability among the generated
data is critical for improving adversarial robustness. Thus, we propose the
Contrastive-Guided Diffusion Process (Contrastive-DP), which adopts the
contrastive loss to guide the diffusion model in data generation. We verify our
theoretical results using simulations and demonstrate the good performance of
Contrastive-DP on image datasets.


http://arxiv.org/abs/2210.09405
Towards Generating Adversarial Examples on Mixed-type Data. (99%)
Han Xu; Menghai Pan; Zhimeng Jiang; Huiyuan Chen; Xiaoting Li; Mahashweta Das; Hao Yang
  The existence of adversarial attacks (or adversarial examples) brings huge
concern about the machine learning (ML) model's safety issues. For many
safety-critical ML tasks, such as financial forecasting, fraudulent detection,
and anomaly detection, the data samples are usually mixed-type, which contain
plenty of numerical and categorical features at the same time. However, how to
generate adversarial examples with mixed-type data is still seldom studied. In
this paper, we propose a novel attack algorithm M-Attack, which can effectively
generate adversarial examples in mixed-type data. Based on M-Attack, attackers
can attempt to mislead the targeted classification model's prediction, by only
slightly perturbing both the numerical and categorical features in the given
data samples. More importantly, by adding designed regularizations, our
generated adversarial examples can evade potential detection models, which
makes the attack indeed insidious. Through extensive empirical studies, we
validate the effectiveness and efficiency of our attack method and evaluate the
robustness of existing classification models against our proposed attack. The
experimental results highlight the feasibility of generating adversarial
examples toward machine learning models in real-world applications.


http://arxiv.org/abs/2210.08870
Differential Evolution based Dual Adversarial Camouflage: Fooling Human Eyes and Object Detectors. (99%)
Jialiang Sun; Tingsong Jiang; Wen Yao; Donghua Wang; Xiaoqian Chen
  Recent studies reveal that deep neural network (DNN) based object detectors
are vulnerable to adversarial attacks in the form of adding the perturbation to
the images, leading to the wrong output of object detectors. Most current
existing works focus on generating perturbed images, also called adversarial
examples, to fool object detectors. Though the generated adversarial examples
themselves can remain a certain naturalness, most of them can still be easily
observed by human eyes, which limits their further application in the real
world. To alleviate this problem, we propose a differential evolution based
dual adversarial camouflage (DE_DAC) method, composed of two stages to fool
human eyes and object detectors simultaneously. Specifically, we try to obtain
the camouflage texture, which can be rendered over the surface of the object.
In the first stage, we optimize the global texture to minimize the discrepancy
between the rendered object and the scene images, making human eyes difficult
to distinguish. In the second stage, we design three loss functions to optimize
the local texture, making object detectors ineffective. In addition, we
introduce the differential evolution algorithm to search for the near-optimal
areas of the object to attack, improving the adversarial performance under
certain attack area limitations. Besides, we also study the performance of
adaptive DE_DAC, which can be adapted to the environment. Experiments show that
our proposed method could obtain a good trade-off between the fooling human
eyes and object detectors under multiple specific scenes and objects.


http://arxiv.org/abs/2210.09364
Probabilistic Categorical Adversarial Attack & Adversarial Training. (99%)
Pengfei He; Han Xu; Jie Ren; Yuxuan Wan; Zitao Liu; Jiliang Tang
  The existence of adversarial examples brings huge concern for people to apply
Deep Neural Networks (DNNs) in safety-critical tasks. However, how to generate
adversarial examples with categorical data is an important problem but lack of
extensive exploration. Previously established methods leverage greedy search
method, which can be very time-consuming to conduct successful attack. This
also limits the development of adversarial training and potential defenses for
categorical data. To tackle this problem, we propose Probabilistic Categorical
Adversarial Attack (PCAA), which transfers the discrete optimization problem to
a continuous problem that can be solved efficiently by Projected Gradient
Descent. In our paper, we theoretically analyze its optimality and time
complexity to demonstrate its significant advantage over current greedy based
attacks. Moreover, based on our attack, we propose an efficient adversarial
training framework. Through a comprehensive empirical study, we justify the
effectiveness of our proposed attack and defense algorithms.


http://arxiv.org/abs/2210.09194
Marksman Backdoor: Backdoor Attacks with Arbitrary Target Class. (96%)
Khoa D. Doan; Yingjie Lao; Ping Li
  In recent years, machine learning models have been shown to be vulnerable to
backdoor attacks. Under such attacks, an adversary embeds a stealthy backdoor
into the trained model such that the compromised models will behave normally on
clean inputs but will misclassify according to the adversary's control on
maliciously constructed input with a trigger. While these existing attacks are
very effective, the adversary's capability is limited: given an input, these
attacks can only cause the model to misclassify toward a single pre-defined or
target class. In contrast, this paper exploits a novel backdoor attack with a
much more powerful payload, denoted as Marksman, where the adversary can
arbitrarily choose which target class the model will misclassify given any
input during inference. To achieve this goal, we propose to represent the
trigger function as a class-conditional generative model and to inject the
backdoor in a constrained optimization framework, where the trigger function
learns to generate an optimal trigger pattern to attack any target class at
will while simultaneously embedding this generative backdoor into the trained
model. Given the learned trigger-generation function, during inference, the
adversary can specify an arbitrary backdoor attack target class, and an
appropriate trigger causing the model to classify toward this target class is
created accordingly. We show empirically that the proposed framework achieves
high attack performance while preserving the clean-data performance in several
benchmark datasets, including MNIST, CIFAR10, GTSRB, and TinyImageNet. The
proposed Marksman backdoor attack can also easily bypass existing backdoor
defenses that were originally designed against backdoor attacks with a single
target class. Our work takes another significant step toward understanding the
extensive risks of backdoor attacks in practice.


http://arxiv.org/abs/2210.08929
DE-CROP: Data-efficient Certified Robustness for Pretrained Classifiers. (87%)
Gaurav Kumar Nayak; Ruchit Rawal; Anirban Chakraborty
  Certified defense using randomized smoothing is a popular technique to
provide robustness guarantees for deep neural networks against l2 adversarial
attacks. Existing works use this technique to provably secure a pretrained
non-robust model by training a custom denoiser network on entire training data.
However, access to the training set may be restricted to a handful of data
samples due to constraints such as high transmission cost and the proprietary
nature of the data. Thus, we formulate a novel problem of "how to certify the
robustness of pretrained models using only a few training samples". We observe
that training the custom denoiser directly using the existing techniques on
limited samples yields poor certification. To overcome this, our proposed
approach (DE-CROP) generates class-boundary and interpolated samples
corresponding to each training sample, ensuring high diversity in the feature
space of the pretrained classifier. We train the denoiser by maximizing the
similarity between the denoised output of the generated sample and the original
training sample in the classifier's logit space. We also perform distribution
level matching using domain discriminator and maximum mean discrepancy that
yields further benefit. In white box setup, we obtain significant improvements
over the baseline on multiple benchmark datasets and also report similar
performance under the challenging black box setup.


http://arxiv.org/abs/2210.08902
Beyond Model Interpretability: On the Faithfulness and Adversarial Robustness of Contrastive Textual Explanations. (78%)
Julia El Zini; Mariette Awad
  Contrastive explanation methods go beyond transparency and address the
contrastive aspect of explanations. Such explanations are emerging as an
attractive option to provide actionable change to scenarios adversely impacted
by classifiers' decisions. However, their extension to textual data is
under-explored and there is little investigation on their vulnerabilities and
limitations.
  This work motivates textual counterfactuals by laying the ground for a novel
evaluation scheme inspired by the faithfulness of explanations. Accordingly, we
extend the computation of three metrics, proximity,connectedness and stability,
to textual data and we benchmark two successful contrastive methods, POLYJUICE
and MiCE, on our suggested metrics. Experiments on sentiment analysis data show
that the connectedness of counterfactuals to their original counterparts is not
obvious in both models. More interestingly, the generated contrastive texts are
more attainable with POLYJUICE which highlights the significance of latent
representations in counterfactual search. Finally, we perform the first
semantic adversarial attack on textual recourse methods. The results
demonstrate the robustness of POLYJUICE and the role that latent input
representations play in robustness and reliability.


http://arxiv.org/abs/2210.09503
Towards Fair Classification against Poisoning Attacks. (76%)
Han Xu; Xiaorui Liu; Yuxuan Wan; Jiliang Tang
  Fair classification aims to stress the classification models to achieve the
equality (treatment or prediction quality) among different sensitive groups.
However, fair classification can be under the risk of poisoning attacks that
deliberately insert malicious training samples to manipulate the trained
classifiers' performance. In this work, we study the poisoning scenario where
the attacker can insert a small fraction of samples into training data, with
arbitrary sensitive attributes as well as other predictive features. We
demonstrate that the fairly trained classifiers can be greatly vulnerable to
such poisoning attacks, with much worse accuracy & fairness trade-off, even
when we apply some of the most effective defenses (originally proposed to
defend traditional classification tasks). As countermeasures to defend fair
classification tasks, we propose a general and theoretically guaranteed
framework which accommodates traditional defense methods to fair classification
against poisoning attacks. Through extensive experiments, the results validate
that the proposed defense framework obtains better robustness in terms of
accuracy and fairness than representative baseline methods.


http://arxiv.org/abs/2210.09421
Deepfake Text Detection: Limitations and Opportunities. (41%)
Jiameng Pu; Zain Sarwar; Sifat Muhammad Abdullah; Abdullah Rehman; Yoonjin Kim; Parantapa Bhattacharya; Mobin Javed; Bimal Viswanath
  Recent advances in generative models for language have enabled the creation
of convincing synthetic text or deepfake text. Prior work has demonstrated the
potential for misuse of deepfake text to mislead content consumers. Therefore,
deepfake text detection, the task of discriminating between human and
machine-generated text, is becoming increasingly critical. Several defenses
have been proposed for deepfake text detection. However, we lack a thorough
understanding of their real-world applicability. In this paper, we collect
deepfake text from 4 online services powered by Transformer-based tools to
evaluate the generalization ability of the defenses on content in the wild. We
develop several low-cost adversarial attacks, and investigate the robustness of
existing defenses against an adaptive attacker. We find that many defenses show
significant degradation in performance under our evaluation scenarios compared
to their original claimed performance. Our evaluation shows that tapping into
the semantic information in the text content is a promising approach for
improving the robustness and generalization performance of deepfake text
detection schemes.


http://arxiv.org/abs/2210.09482
You Can't See Me: Physical Removal Attacks on LiDAR-based Autonomous Vehicles Driving Frameworks. (15%)
Yulong Cao; S. Hrushikesh Bhupathiraju; Pirouz Naghavi; Takeshi Sugawara; Z. Morley Mao; Sara Rampazzi
  Autonomous Vehicles (AVs) increasingly use LiDAR-based object detection
systems to perceive other vehicles and pedestrians on the road. While existing
attacks on LiDAR-based autonomous driving architectures focus on lowering the
confidence score of AV object detection models to induce obstacle misdetection,
our research discovers how to leverage laser-based spoofing techniques to
selectively remove the LiDAR point cloud data of genuine obstacles at the
sensor level before being used as input to the AV perception. The ablation of
this critical LiDAR information causes autonomous driving obstacle detectors to
fail to identify and locate obstacles and, consequently, induces AVs to make
dangerous automatic driving decisions. In this paper, we present a method
invisible to the human eye that hides objects and deceives autonomous vehicles'
obstacle detectors by exploiting inherent automatic transformation and
filtering processes of LiDAR sensor data integrated with autonomous driving
frameworks. We call such attacks Physical Removal Attacks (PRA), and we
demonstrate their effectiveness against three popular AV obstacle detectors
(Apollo, Autoware, PointPillars), and we achieve 45{\deg} attack capability. We
evaluate the attack impact on three fusion models (Frustum-ConvNet, AVOD, and
Integrated-Semantic Level Fusion) and the consequences on the driving decision
using LGSVL, an industry-grade simulator. In our moving vehicle scenarios, we
achieve a 92.7% success rate removing 90% of a target obstacle's cloud points.
Finally, we demonstrate the attack's success against two popular defenses
against spoofing and object hiding attacks and discuss two enhanced defense
strategies to mitigate our attack.


http://arxiv.org/abs/2210.09545
Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models. (9%)
Zhiyuan Zhang; Lingjuan Lyu; Xingjun Ma; Chenguang Wang; Xu Sun
  Deep Neural Networks (DNNs) are known to be vulnerable to backdoor attacks.
In Natural Language Processing (NLP), DNNs are often backdoored during the
fine-tuning process of a large-scale Pre-trained Language Model (PLM) with
poisoned samples. Although the clean weights of PLMs are readily available,
existing methods have ignored this information in defending NLP models against
backdoor attacks. In this work, we take the first step to exploit the
pre-trained (unfine-tuned) weights to mitigate backdoors in fine-tuned language
models. Specifically, we leverage the clean pre-trained weights via two
complementary techniques: (1) a two-step Fine-mixing technique, which first
mixes the backdoored weights (fine-tuned on poisoned data) with the pre-trained
weights, then fine-tunes the mixed weights on a small subset of clean data; (2)
an Embedding Purification (E-PUR) technique, which mitigates potential
backdoors existing in the word embeddings. We compare Fine-mixing with typical
backdoor mitigation methods on three single-sentence sentiment classification
tasks and two sentence-pair classification tasks and show that it outperforms
the baselines by a considerable margin in all scenarios. We also show that our
E-PUR method can benefit existing mitigation methods. Our work establishes a
simple but strong baseline defense for secure fine-tuned NLP models against
backdoor attacks.


http://arxiv.org/abs/2210.09465
Understanding CNN Fragility When Learning With Imbalanced Data. (1%)
Damien Dablain; Kristen N. Jacobson; Colin Bellinger; Mark Roberts; Nitesh Chawla
  Convolutional neural networks (CNNs) have achieved impressive results on
imbalanced image data, but they still have difficulty generalizing to minority
classes and their decisions are difficult to interpret. These problems are
related because the method by which CNNs generalize to minority classes, which
requires improvement, is wrapped in a blackbox. To demystify CNN decisions on
imbalanced data, we focus on their latent features. Although CNNs embed the
pattern knowledge learned from a training set in model parameters, the effect
of this knowledge is contained in feature and classification embeddings (FE and
CE). These embeddings can be extracted from a trained model and their global,
class properties (e.g., frequency, magnitude and identity) can be analyzed. We
find that important information regarding the ability of a neural network to
generalize to minority classes resides in the class top-K CE and FE. We show
that a CNN learns a limited number of class top-K CE per category, and that
their number and magnitudes vary based on whether the same class is balanced or
imbalanced. This calls into question whether a CNN has learned intrinsic class
features, or merely frequently occurring ones that happen to exist in the
sampled class distribution. We also hypothesize that latent class diversity is
as important as the number of class examples, which has important implications
for re-sampling and cost-sensitive methods. These methods generally focus on
rebalancing model weights, class numbers and margins; instead of diversifying
class latent features through augmentation. We also demonstrate that a CNN has
difficulty generalizing to test data if the magnitude of its top-K latent
features do not match the training set. We use three popular image datasets and
two cost-sensitive algorithms commonly employed in imbalanced learning for our
experiments.


http://arxiv.org/abs/2210.08472
Object-Attentional Untargeted Adversarial Attack. (99%)
Chao Zhou; Yuan-Gen Wang; Guopu Zhu
  Deep neural networks are facing severe threats from adversarial attacks. Most
existing black-box attacks fool target model by generating either global
perturbations or local patches. However, both global perturbations and local
patches easily cause annoying visual artifacts in adversarial example. Compared
with some smooth regions of an image, the object region generally has more
edges and a more complex texture. Thus small perturbations on it will be more
imperceptible. On the other hand, the object region is undoubtfully the
decisive part of an image to classification tasks. Motivated by these two
facts, we propose an object-attentional adversarial attack method for
untargeted attack. Specifically, we first generate an object region by
intersecting the object detection region from YOLOv4 with the salient object
detection (SOD) region from HVPNet. Furthermore, we design an activation
strategy to avoid the reaction caused by the incomplete SOD. Then, we perform
an adversarial attack only on the detected object region by leveraging Simple
Black-box Adversarial Attack (SimBA). To verify the proposed method, we create
a unique dataset by extracting all the images containing the object defined by
COCO from ImageNet-1K, named COCO-Reduced-ImageNet in this paper. Experimental
results on ImageNet-1K and COCO-Reduced-ImageNet show that under various system
settings, our method yields the adversarial example with better perceptual
quality meanwhile saving the query budget up to 24.16\% compared to the
state-of-the-art approaches including SimBA.


http://arxiv.org/abs/2210.08579
Nowhere to Hide: A Lightweight Unsupervised Detector against Adversarial Examples. (99%)
Hui Liu; Bo Zhao; Kehuan Zhang; Peng Liu
  Although deep neural networks (DNNs) have shown impressive performance on
many perceptual tasks, they are vulnerable to adversarial examples that are
generated by adding slight but maliciously crafted perturbations to benign
images. Adversarial detection is an important technique for identifying
adversarial examples before they are entered into target DNNs. Previous studies
to detect adversarial examples either targeted specific attacks or required
expensive computation. How design a lightweight unsupervised detector is still
a challenging problem. In this paper, we propose an AutoEncoder-based
Adversarial Examples (AEAE) detector, that can guard DNN models by detecting
adversarial examples with low computation in an unsupervised manner. The AEAE
includes only a shallow autoencoder but plays two roles. First, a well-trained
autoencoder has learned the manifold of benign examples. This autoencoder can
produce a large reconstruction error for adversarial images with large
perturbations, so we can detect significantly perturbed adversarial examples
based on the reconstruction error. Second, the autoencoder can filter out the
small noise and change the DNN's prediction on adversarial examples with small
perturbations. It helps to detect slightly perturbed adversarial examples based
on the prediction distance. To cover these two cases, we utilize the
reconstruction error and prediction distance from benign images to construct a
two-tuple feature set and train an adversarial detector using the isolation
forest algorithm. We show empirically that the AEAE is unsupervised and
inexpensive against the most state-of-the-art attacks. Through the detection in
these two cases, there is nowhere to hide adversarial examples.


http://arxiv.org/abs/2210.08701
ODG-Q: Robust Quantization via Online Domain Generalization. (83%)
Chaofan Tao; Ngai Wong
  Quantizing neural networks to low-bitwidth is important for model deployment
on resource-limited edge hardware. Although a quantized network has a smaller
model size and memory footprint, it is fragile to adversarial attacks. However,
few methods study the robustness and training efficiency of quantized networks.
To this end, we propose a new method by recasting robust quantization as an
online domain generalization problem, termed ODG-Q, which generates diverse
adversarial data at a low cost during training. ODG-Q consistently outperforms
existing works against various adversarial attacks. For example, on CIFAR-10
dataset, ODG-Q achieves 49.2% average improvements under five common white-box
attacks and 21.7% average improvements under five common black-box attacks,
with a training cost similar to that of natural training (viz. without
adversaries). To our best knowledge, this work is the first work that trains
both quantized and binary neural networks on ImageNet that consistently improve
robustness under different attacks. We also provide a theoretical insight of
ODG-Q that accounts for the bound of model risk on attacked data.


http://arxiv.org/abs/2210.11235
Interpretable Machine Learning for Detection and Classification of Ransomware Families Based on API Calls. (1%)
Rawshan Ara Mowri; Madhuri Siddula; Kaushik Roy
  Ransomware has appeared as one of the major global threats in recent days The
alarming increasing rate of ransomware attacks and new ransomware variants
intrigue the researchers to constantly examine the distinguishing traits of
ransomware and refine their detection strategies Application Programming
Interface API is a way for one program to collaborate with another API calls
are the medium by which they communicate Ransomware uses this strategy to
interact with the OS and makes a significantly higher number of calls in
different sequences to ask for taking action This research work utilizes the
frequencies of different API calls to detect and classify ransomware families
First a WebCrawler is developed to automate collecting the Windows Portable
Executable PE files of 15 different ransomware families By extracting different
frequencies of 68 API calls we develop our dataset in the first phase of the
two phase feature engineering process After selecting the most significant
features in the second phase of the feature engineering process we deploy six
Supervised Machine Learning models Naive Bayes Logistic Regression Random
Forest Stochastic Gradient Descent K Nearest Neighbor and Support Vector
Machine Then the performances of all the classifiers are compared to select the
best model The results reveal that Logistic Regression can efficiently classify
ransomware into their corresponding families securing 9915 accuracy Finally
instead of relying on the Black box characteristic of the Machine Learning
models we present the interpretability of our best performing model using SHAP
values to ascertain the transparency and trustworthiness of the models
prediction


http://arxiv.org/abs/2210.08388
RoS-KD: A Robust Stochastic Knowledge Distillation Approach for Noisy Medical Imaging. (2%)
Ajay Jaiswal; Kumar Ashutosh; Justin F Rousseau; Yifan Peng; Zhangyang Wang; Ying Ding
  AI-powered Medical Imaging has recently achieved enormous attention due to
its ability to provide fast-paced healthcare diagnoses. However, it usually
suffers from a lack of high-quality datasets due to high annotation cost,
inter-observer variability, human annotator error, and errors in
computer-generated labels. Deep learning models trained on noisy labelled
datasets are sensitive to the noise type and lead to less generalization on the
unseen samples. To address this challenge, we propose a Robust Stochastic
Knowledge Distillation (RoS-KD) framework which mimics the notion of learning a
topic from multiple sources to ensure deterrence in learning noisy information.
More specifically, RoS-KD learns a smooth, well-informed, and robust student
manifold by distilling knowledge from multiple teachers trained on overlapping
subsets of training data. Our extensive experiments on popular medical imaging
classification tasks (cardiopulmonary disease and lesion classification) using
real-world datasets, show the performance benefit of RoS-KD, its ability to
distill knowledge from many popular large networks (ResNet-50, DenseNet-121,
MobileNet-V2) in a comparatively small network, and its robustness to
adversarial attacks (PGD, FSGM). More specifically, RoS-KD achieves >2% and >4%
improvement on F1-score for lesion classification and cardiopulmonary disease
classification tasks, respectively, when the underlying student is ResNet-18
against recent competitive knowledge distillation baseline. Additionally, on
cardiopulmonary disease classification task, RoS-KD outperforms most of the
SOTA baselines by ~1% gain in AUC score.


http://arxiv.org/abs/2210.08159
Dynamics-aware Adversarial Attack of Adaptive Neural Networks. (89%)
An Tao; Yueqi Duan; Yingqi Wang; Jiwen Lu; Jie Zhou
  In this paper, we investigate the dynamics-aware adversarial attack problem
of adaptive neural networks. Most existing adversarial attack algorithms are
designed under a basic assumption -- the network architecture is fixed
throughout the attack process. However, this assumption does not hold for many
recently proposed adaptive neural networks, which adaptively deactivate
unnecessary execution units based on inputs to improve computational
efficiency. It results in a serious issue of lagged gradient, making the
learned attack at the current step ineffective due to the architecture change
afterward. To address this issue, we propose a Leaded Gradient Method (LGM) and
show the significant effects of the lagged gradient. More specifically, we
reformulate the gradients to be aware of the potential dynamic changes of
network architectures, so that the learned attack better "leads" the next step
than the dynamics-unaware methods when network architecture changes
dynamically. Extensive experiments on representative types of adaptive neural
networks for both 2D images and 3D point clouds show that our LGM achieves
impressive adversarial attack performance compared with the dynamic-unaware
attack methods. Code is available at https://github.com/antao97/LGM.


http://arxiv.org/abs/2210.07540
When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture. (87%)
Yichuan Mo; Dongxian Wu; Yifei Wang; Yiwen Guo; Yisen Wang
  Vision Transformers (ViTs) have recently achieved competitive performance in
broad vision tasks. Unfortunately, on popular threat models, naturally trained
ViTs are shown to provide no more adversarial robustness than convolutional
neural networks (CNNs). Adversarial training is still required for ViTs to
defend against such adversarial attacks. In this paper, we provide the first
and comprehensive study on the adversarial training recipe of ViTs via
extensive evaluation of various training techniques across benchmark datasets.
We find that pre-training and SGD optimizer are necessary for ViTs' adversarial
training. Further considering ViT as a new type of model architecture, we
investigate its adversarial robustness from the perspective of its unique
architectural components. We find, when randomly masking gradients from some
attention blocks or masking perturbations on some patches during adversarial
training, the adversarial robustness of ViTs can be remarkably improved, which
may potentially open up a line of work to explore the architectural information
inside the newly designed models like ViTs. Our code is available at
https://github.com/mo666666/When-Adversarial-Training-Meets-Vision-Transformers.


http://arxiv.org/abs/2210.08178
Is Face Recognition Safe from Realizable Attacks? (84%)
Sanjay Saha; Terence Sim
  Face recognition is a popular form of biometric authentication and due to its
widespread use, attacks have become more common as well. Recent studies show
that Face Recognition Systems are vulnerable to attacks and can lead to
erroneous identification of faces. Interestingly, most of these attacks are
white-box, or they are manipulating facial images in ways that are not
physically realizable. In this paper, we propose an attack scheme where the
attacker can generate realistic synthesized face images with subtle
perturbations and physically realize that onto his face to attack black-box
face recognition systems. Comprehensive experiments and analyses show that
subtle perturbations realized on attackers face can create successful attacks
on state-of-the-art face recognition systems in black-box settings. Our study
exposes the underlying vulnerability posed by the Face Recognition Systems
against realizable black-box attacks.


http://arxiv.org/abs/2210.07907
Expose Backdoors on the Way: A Feature-Based Efficient Defense against Textual Backdoor Attacks. (76%)
Sishuo Chen; Wenkai Yang; Zhiyuan Zhang; Xiaohan Bi; Xu Sun
  Natural language processing (NLP) models are known to be vulnerable to
backdoor attacks, which poses a newly arisen threat to NLP models. Prior online
backdoor defense methods for NLP models only focus on the anomalies at either
the input or output level, still suffering from fragility to adaptive attacks
and high computational cost. In this work, we take the first step to
investigate the unconcealment of textual poisoned samples at the
intermediate-feature level and propose a feature-based efficient online defense
method. Through extensive experiments on existing attacking methods, we find
that the poisoned samples are far away from clean samples in the intermediate
feature space of a poisoned NLP model. Motivated by this observation, we devise
a distance-based anomaly score (DAN) to distinguish poisoned samples from clean
samples at the feature level. Experiments on sentiment analysis and offense
detection tasks demonstrate the superiority of DAN, as it substantially
surpasses existing online defense methods in terms of defending performance and
enjoys lower inference costs. Moreover, we show that DAN is also resistant to
adaptive attacks based on feature-level regularization. Our code is available
at https://github.com/lancopku/DAN.


http://arxiv.org/abs/2210.07714
Close the Gate: Detecting Backdoored Models in Federated Learning based on Client-Side Deep Layer Output Analysis. (67%)
Phillip Technical University Darmstadt Rieger; Torsten University of Würzburg Krauß; Markus Technical University Darmstadt Miettinen; Alexandra University of Würzburg Dmitrienko; Ahmad-Reza Technical University Darmstadt Sadeghi
  Federated Learning (FL) is a scheme for collaboratively training Deep Neural
Networks (DNNs) with multiple data sources from different clients. Instead of
sharing the data, each client trains the model locally, resulting in improved
privacy. However, recently so-called targeted poisoning attacks have been
proposed that allow individual clients to inject a backdoor into the trained
model. Existing defenses against these backdoor attacks either rely on
techniques like Differential Privacy to mitigate the backdoor, or analyze the
weights of the individual models and apply outlier detection methods that
restricts these defenses to certain data distributions. However, adding noise
to the models' parameters or excluding benign outliers might also reduce the
accuracy of the collaboratively trained model. Additionally, allowing the
server to inspect the clients' models creates a privacy risk due to existing
knowledge extraction methods.
  We propose \textit{CrowdGuard}, a model filtering defense, that mitigates
backdoor attacks by leveraging the clients' data to analyze the individual
models before the aggregation. To prevent data leaks, the server sends the
individual models to secure enclaves, running in client-located Trusted
Execution Environments. To effectively distinguish benign and poisoned models,
even if the data of different clients are not independently and identically
distributed (non-IID), we introduce a novel metric called \textit{HLBIM} to
analyze the outputs of the DNN's hidden layers. We show that the applied
significance-based detection algorithm combined can effectively detect poisoned
models, even in non-IID scenarios.


http://arxiv.org/abs/2210.06871
Adv-Attribute: Inconspicuous and Transferable Adversarial Attack on Face Recognition. (99%)
Shuai Jia; Bangjie Yin; Taiping Yao; Shouhong Ding; Chunhua Shen; Xiaokang Yang; Chao Ma
  Deep learning models have shown their vulnerability when dealing with
adversarial attacks. Existing attacks almost perform on low-level instances,
such as pixels and super-pixels, and rarely exploit semantic clues. For face
recognition attacks, existing methods typically generate the l_p-norm
perturbations on pixels, however, resulting in low attack transferability and
high vulnerability to denoising defense models. In this work, instead of
performing perturbations on the low-level pixels, we propose to generate
attacks through perturbing on the high-level semantics to improve attack
transferability. Specifically, a unified flexible framework, Adversarial
Attributes (Adv-Attribute), is designed to generate inconspicuous and
transferable attacks on face recognition, which crafts the adversarial noise
and adds it into different attributes based on the guidance of the difference
in face recognition features from the target. Moreover, the importance-aware
attribute selection and the multi-objective optimization strategy are
introduced to further ensure the balance of stealthiness and attacking
strength. Extensive experiments on the FFHQ and CelebA-HQ datasets show that
the proposed Adv-Attribute method achieves the state-of-the-art attacking
success rates while maintaining better visual effects against recent attack
methods.


http://arxiv.org/abs/2210.06888
AccelAT: A Framework for Accelerating the Adversarial Training of Deep Neural Networks through Accuracy Gradient. (99%)
Farzad Nikfam; Alberto Marchisio; Maurizio Martina; Muhammad Shafique
  Adversarial training is exploited to develop a robust Deep Neural Network
(DNN) model against the malicious altered data. These attacks may have
catastrophic effects on DNN models but are indistinguishable for a human being.
For example, an external attack can modify an image adding noises invisible for
a human eye, but a DNN model misclassified the image. A key objective for
developing robust DNN models is to use a learning algorithm that is fast but
can also give model that is robust against different types of adversarial
attacks. Especially for adversarial training, enormously long training times
are needed for obtaining high accuracy under many different types of
adversarial samples generated using different adversarial attack techniques.
  This paper aims at accelerating the adversarial training to enable fast
development of robust DNN models against adversarial attacks. The general
method for improving the training performance is the hyperparameters
fine-tuning, where the learning rate is one of the most crucial
hyperparameters. By modifying its shape (the value over time) and value during
the training, we can obtain a model robust to adversarial attacks faster than
standard training.
  First, we conduct experiments on two different datasets (CIFAR10, CIFAR100),
exploring various techniques. Then, this analysis is leveraged to develop a
novel fast training methodology, AccelAT, which automatically adjusts the
learning rate for different epochs based on the accuracy gradient. The
experiments show comparable results with the related works, and in several
experiments, the adversarial training of DNNs using our AccelAT framework is
conducted up to 2 times faster than the existing techniques. Thus, our findings
boost the speed of adversarial training in an era in which security and
performance are fundamental optimization objectives in DNN-based applications.


http://arxiv.org/abs/2210.07346
Demystifying Self-supervised Trojan Attacks. (95%)
Changjiang Li; Ren Pang; Zhaohan Xi; Tianyu Du; Shouling Ji; Yuan Yao; Ting Wang
  As an emerging machine learning paradigm, self-supervised learning (SSL) is
able to learn high-quality representations for complex data without data
labels. Prior work shows that, besides obviating the reliance on labeling, SSL
also benefits adversarial robustness by making it more challenging for the
adversary to manipulate model prediction. However, whether this robustness
benefit generalizes to other types of attacks remains an open question.
  We explore this question in the context of trojan attacks by showing that SSL
is comparably vulnerable as supervised learning to trojan attacks.
Specifically, we design and evaluate CTRL, an extremely simple self-supervised
trojan attack. By polluting a tiny fraction of training data (less than 1%)
with indistinguishable poisoning samples, CTRL causes any trigger-embedded
input to be misclassified to the adversary's desired class with a high
probability (over 99%) at inference. More importantly, through the lens of
CTRL, we study the mechanisms underlying self-supervised trojan attacks. With
both empirical and analytical evidence, we reveal that the representation
invariance property of SSL, which benefits adversarial robustness, may also be
the very reason making SSL highly vulnerable to trojan attacks. We further
discuss the fundamental challenges to defending against self-supervised trojan
attacks, pointing to promising directions for future research.


http://arxiv.org/abs/2210.06807
Improving Out-of-Distribution Generalization by Adversarial Training with Structured Priors. (81%)
Qixun Wang; Yifei Wang; Hong Zhu; Yisen Wang
  Deep models often fail to generalize well in test domains when the data
distribution differs from that in the training domain. Among numerous
approaches to address this Out-of-Distribution (OOD) generalization problem,
there has been a growing surge of interest in exploiting Adversarial Training
(AT) to improve OOD performance. Recent works have revealed that the robust
model obtained by conducting sample-wise AT also retains transferability to
biased test domains. In this paper, we empirically show that sample-wise AT has
limited improvement on OOD performance. Specifically, we find that AT can only
maintain performance at smaller scales of perturbation while Universal AT (UAT)
is more robust to larger-scale perturbations. This provides us with clues that
adversarial perturbations with universal (low dimensional) structures can
enhance the robustness against large data distribution shifts that are common
in OOD scenarios. Inspired by this, we propose two AT variants with low-rank
structures to train OOD-robust models. Extensive experiments on DomainBed
benchmark show that our proposed approaches outperform Empirical Risk
Minimization (ERM) and sample-wise AT. Our code is available at
https://github.com/NOVAglow646/NIPS22-MAT-and-LDAT-for-OOD.


http://arxiv.org/abs/2210.07394
Efficiently Computing Local Lipschitz Constants of Neural Networks via Bound Propagation. (13%)
Zhouxing Shi; Yihan Wang; Huan Zhang; Zico Kolter; Cho-Jui Hsieh
  Lipschitz constants are connected to many properties of neural networks, such
as robustness, fairness, and generalization. Existing methods for computing
Lipschitz constants either produce relatively loose upper bounds or are limited
to small networks. In this paper, we develop an efficient framework for
computing the $\ell_\infty$ local Lipschitz constant of a neural network by
tightly upper bounding the norm of Clarke Jacobian via linear bound
propagation. We formulate the computation of local Lipschitz constants with a
linear bound propagation process on a high-order backward graph induced by the
chain rule of Clarke Jacobian. To enable linear bound propagation, we derive
tight linear relaxations for specific nonlinearities in Clarke Jacobian. This
formulate unifies existing ad-hoc approaches such as RecurJac, which can be
seen as a special case of ours with weaker relaxations. The bound propagation
framework also allows us to easily borrow the popular Branch-and-Bound (BaB)
approach from neural network verification to further tighten Lipschitz
constants. Experiments show that on tiny models, our method produces comparable
bounds compared to exact methods that cannot scale to slightly larger models;
on larger models, our method efficiently produces tighter results than existing
relaxed or naive methods, and our method scales to much larger practical models
that previous works could not handle. We also demonstrate an application on
provable monotonicity analysis. Code is available at
https://github.com/shizhouxing/Local-Lipschitz-Constants.


http://arxiv.org/abs/2210.06789
Large-Scale Open-Set Classification Protocols for ImageNet. (2%)
Jesus Andres Palechor Anacona; Annesha Bhoumik; Manuel Günther
  Open-Set Classification (OSC) intends to adapt closed-set classification
models to real-world scenarios, where the classifier must correctly label
samples of known classes while rejecting previously unseen unknown samples.
Only recently, research started to investigate on algorithms that are able to
handle these unknown samples correctly. Some of these approaches address OSC by
including into the training set negative samples that a classifier learns to
reject, expecting that these data increase the robustness of the classifier on
unknown classes. Most of these approaches are evaluated on small-scale and
low-resolution image datasets like MNIST, SVHN or CIFAR, which makes it
difficult to assess their applicability to the real world, and to compare them
among each other. We propose three open-set protocols that provide rich
datasets of natural images with different levels of similarity between known
and unknown classes. The protocols consist of subsets of ImageNet classes
selected to provide training and testing data closer to real-world scenarios.
Additionally, we propose a new validation metric that can be employed to assess
whether the training of deep learning models addresses both the classification
of known samples and the rejection of unknown samples. We use the protocols to
compare the performance of two baseline open-set algorithms to the standard
SoftMax baseline and find that the algorithms work well on negative samples
that have been seen during training, and partially on out-of-distribution
detection tasks, but drop performance in the presence of samples from
previously unseen unknown classes.


http://arxiv.org/abs/2210.06792
SoK: How Not to Architect Your Next-Generation TEE Malware? (1%)
Kubilay Ahmet Küçük; Steve Moyle; Andrew Martin; Alexandru Mereacre; Nicholas Allott
  Besides Intel's SGX technology, there are long-running discussions on how
trusted computing technologies can be used to cloak malware. Past research
showed example methods of malicious activities utilising Flicker, Trusted
Platform Module, and recently integrating with enclaves. We observe two
ambiguous methodologies of malware development being associated with SGX, and
it is crucial to systematise their details. One methodology is to use the core
SGX ecosystem to cloak malware; potentially affecting a large number of
systems. The second methodology is to create a custom enclave not adhering to
base assumptions of SGX, creating a demonstration code of malware behaviour
with these incorrect assumptions; remaining local without any impact. We
examine what malware aims to do in real-world scenarios and state-of-art
techniques in malware evasion. We present multiple limitations of maintaining
the SGX-assisted malware and evading it from anti-malware mechanisms. The
limitations make SGX enclaves a poor choice for achieving a successful malware
campaign. We systematise twelve misconceptions (myths) outlining how an
overfit-malware using SGX weakens malware's existing abilities. We find the
differences by comparing SGX assistance for malware with non-SGX malware (i.e.,
malware in the wild in our paper). We conclude that the use of hardware
enclaves does not increase the preexisting attack surface, enables no new
infection vector, and does not contribute any new methods to the stealthiness
of malware.


http://arxiv.org/abs/2210.06771
Feature Reconstruction Attacks and Countermeasures of DNN training in Vertical Federated Learning. (1%)
Peng Ye; Zhifeng Jiang; Wei Wang; Bo Li; Baochun Li
  Federated learning (FL) has increasingly been deployed, in its vertical form,
among organizations to facilitate secure collaborative training over siloed
data. In vertical FL (VFL), participants hold disjoint features of the same set
of sample instances. Among them, only one has labels. This participant, known
as the active party, initiates the training and interacts with the other
participants, known as the passive parties. Despite the increasing adoption of
VFL, it remains largely unknown if and how the active party can extract feature
data from the passive party, especially when training deep neural network (DNN)
models.
  This paper makes the first attempt to study the feature security problem of
DNN training in VFL. We consider a DNN model partitioned between active and
passive parties, where the latter only holds a subset of the input layer and
exhibits some categorical features of binary values. Using a reduction from the
Exact Cover problem, we prove that reconstructing those binary features is
NP-hard. Through analysis, we demonstrate that, unless the feature dimension is
exceedingly large, it remains feasible, both theoretically and practically, to
launch a reconstruction attack with an efficient search-based algorithm that
prevails over current feature protection techniques. To address this problem,
we develop a novel feature protection scheme against the reconstruction attack
that effectively misleads the search to some pre-specified random values. With
an extensive set of experiments, we show that our protection scheme sustains
the feature reconstruction attack in various VFL applications at no expense of
accuracy loss.


http://arxiv.org/abs/2210.07441
Characterizing the Influence of Graph Elements. (1%)
Zizhang Chen; Peizhao Li; Hongfu Liu; Pengyu Hong
  Influence function, a method from robust statistics, measures the changes of
model parameters or some functions about model parameters concerning the
removal or modification of training instances. It is an efficient and useful
post-hoc method for studying the interpretability of machine learning models
without the need for expensive model re-training. Recently, graph convolution
networks (GCNs), which operate on graph data, have attracted a great deal of
attention. However, there is no preceding research on the influence functions
of GCNs to shed light on the effects of removing training nodes/edges from an
input graph. Since the nodes/edges in a graph are interdependent in GCNs, it is
challenging to derive influence functions for GCNs. To fill this gap, we
started with the simple graph convolution (SGC) model that operates on an
attributed graph and formulated an influence function to approximate the
changes in model parameters when a node or an edge is removed from an
attributed graph. Moreover, we theoretically analyzed the error bound of the
estimated influence of removing an edge. We experimentally validated the
accuracy and effectiveness of our influence estimation function. In addition,
we showed that the influence function of an SGC model could be used to estimate
the impact of removing training nodes/edges on the test performance of the SGC
without re-training the model. Finally, we demonstrated how to use influence
functions to guide the adversarial attacks on GCNs effectively.


http://arxiv.org/abs/2210.06670
A Game Theoretical vulnerability analysis of Adversarial Attack. (99%)
Khondker Fariha Hossain; Alireza Tavakkoli; Shamik Sengupta
  In recent times deep learning has been widely used for automating various
security tasks in Cyber Domains. However, adversaries manipulate data in many
situations and diminish the deployed deep learning model's accuracy. One
notable example is fooling CAPTCHA data to access the CAPTCHA-based Classifier
leading to the critical system being vulnerable to cybersecurity attacks. To
alleviate this, we propose a computational framework of game theory to analyze
the CAPTCHA-based Classifier's vulnerability, strategy, and outcomes by forming
a simultaneous two-player game. We apply the Fast Gradient Symbol Method (FGSM)
and One Pixel Attack on CAPTCHA Data to imitate real-life scenarios of possible
cyber-attack. Subsequently, to interpret this scenario from a Game theoretical
perspective, we represent the interaction in the Stackelberg Game in Kuhn tree
to study players' possible behaviors and actions by applying our Classifier's
actual predicted values. Thus, we interpret potential attacks in deep learning
applications while representing viable defense strategies in the game theory
prospect.


http://arxiv.org/abs/2210.05968
Boosting the Transferability of Adversarial Attacks with Reverse Adversarial Perturbation. (99%)
Zeyu Qin; Yanbo Fan; Yi Liu; Li Shen; Yong Zhang; Jue Wang; Baoyuan Wu
  Deep neural networks (DNNs) have been shown to be vulnerable to adversarial
examples, which can produce erroneous predictions by injecting imperceptible
perturbations. In this work, we study the transferability of adversarial
examples, which is significant due to its threat to real-world applications
where model architecture or parameters are usually unknown. Many existing works
reveal that the adversarial examples are likely to overfit the surrogate model
that they are generated from, limiting its transfer attack performance against
different target models. To mitigate the overfitting of the surrogate model, we
propose a novel attack method, dubbed reverse adversarial perturbation (RAP).
Specifically, instead of minimizing the loss of a single adversarial point, we
advocate seeking adversarial example located at a region with unified low loss
value, by injecting the worst-case perturbation (the reverse adversarial
perturbation) for each step of the optimization procedure. The adversarial
attack with RAP is formulated as a min-max bi-level optimization problem. By
integrating RAP into the iterative process for attacks, our method can find
more stable adversarial examples which are less sensitive to the changes of
decision boundary, mitigating the overfitting of the surrogate model.
Comprehensive experimental comparisons demonstrate that RAP can significantly
boost adversarial transferability. Furthermore, RAP can be naturally combined
with many existing black-box attack techniques, to further boost the
transferability. When attacking a real-world image recognition system, Google
Cloud Vision API, we obtain 22% performance improvement of targeted attacks
over the compared method. Our codes are available at
https://github.com/SCLBD/Transfer_attack_RAP.


http://arxiv.org/abs/2210.06284
Visual Prompting for Adversarial Robustness. (99%)
Aochuan Chen; Peter Lorenz; Yuguang Yao; Pin-Yu Chen; Sijia Liu
  In this work, we leverage visual prompting (VP) to improve adversarial
robustness of a fixed, pre-trained model at testing time. Compared to
conventional adversarial defenses, VP allows us to design universal (i.e.,
data-agnostic) input prompting templates, which have plug-and-play capabilities
at testing time to achieve desired model performance without introducing much
computation overhead. Although VP has been successfully applied to improving
model generalization, it remains elusive whether and how it can be used to
defend against adversarial attacks. We investigate this problem and show that
the vanilla VP approach is not effective in adversarial defense since a
universal input prompt lacks the capacity for robust learning against
sample-specific adversarial perturbations. To circumvent it, we propose a new
VP method, termed Class-wise Adversarial Visual Prompting (C-AVP), to generate
class-wise visual prompts so as to not only leverage the strengths of ensemble
prompts but also optimize their interrelations to improve model robustness. Our
experiments show that C-AVP outperforms the conventional VP method, with 2.1X
standard accuracy gain and 2X robust accuracy gain. Compared to classical
test-time defenses, C-AVP also yields a 42X inference time speedup.


http://arxiv.org/abs/2210.05938
Robust Models are less Over-Confident. (96%)
Julia Grabinski; Paul Gavrikov; Janis Keuper; Margret Keuper
  Despite the success of convolutional neural networks (CNNs) in many academic
benchmarks for computer vision tasks, their application in the real-world is
still facing fundamental challenges. One of these open problems is the inherent
lack of robustness, unveiled by the striking effectiveness of adversarial
attacks. Current attack methods are able to manipulate the network's prediction
by adding specific but small amounts of noise to the input. In turn,
adversarial training (AT) aims to achieve robustness against such attacks and
ideally a better model generalization ability by including adversarial samples
in the trainingset. However, an in-depth analysis of the resulting robust
models beyond adversarial robustness is still pending. In this paper, we
empirically analyze a variety of adversarially trained models that achieve high
robust accuracies when facing state-of-the-art attacks and we show that AT has
an interesting side-effect: it leads to models that are significantly less
overconfident with their decisions, even on clean data than non-robust models.
Further, our analysis of robust models shows that not only AT but also the
model's building blocks (like activation functions and pooling) have a strong
influence on the models' prediction confidences. Data & Project website:
https://github.com/GeJulia/robustness_confidences_evaluation


http://arxiv.org/abs/2210.06077
Double Bubble, Toil and Trouble: Enhancing Certified Robustness through Transitivity. (86%)
Andrew C. Cullen; Paul Montague; Shijie Liu; Sarah M. Erfani; Benjamin I. P. Rubinstein
  In response to subtle adversarial examples flipping classifications of neural
network models, recent research has promoted certified robustness as a
solution. There, invariance of predictions to all norm-bounded attacks is
achieved through randomised smoothing of network inputs. Today's
state-of-the-art certifications make optimal use of the class output scores at
the input instance under test: no better radius of certification (under the
$L_2$ norm) is possible given only these score. However, it is an open question
as to whether such lower bounds can be improved using local information around
the instance under test. In this work, we demonstrate how today's "optimal"
certificates can be improved by exploiting both the transitivity of
certifications, and the geometry of the input space, giving rise to what we
term Geometrically-Informed Certified Robustness. By considering the smallest
distance to points on the boundary of a set of certifications this approach
improves certifications for more than $80\%$ of Tiny-Imagenet instances,
yielding an on average $5 \%$ increase in the associated certification. When
incorporating training time processes that enhance the certified radius, our
technique shows even more promising results, with a uniform $4$ percentage
point increase in the achieved certified radius.


http://arxiv.org/abs/2210.05927
Efficient Adversarial Training without Attacking: Worst-Case-Aware Robust Reinforcement Learning. (82%)
Yongyuan Liang; Yanchao Sun; Ruijie Zheng; Furong Huang
  Recent studies reveal that a well-trained deep reinforcement learning (RL)
policy can be particularly vulnerable to adversarial perturbations on input
observations. Therefore, it is crucial to train RL agents that are robust
against any attacks with a bounded budget. Existing robust training methods in
deep RL either treat correlated steps separately, ignoring the robustness of
long-term rewards, or train the agents and RL-based attacker together, doubling
the computational burden and sample complexity of the training process. In this
work, we propose a strong and efficient robust training framework for RL, named
Worst-case-aware Robust RL (WocaR-RL) that directly estimates and optimizes the
worst-case reward of a policy under bounded l_p attacks without requiring extra
samples for learning an attacker. Experiments on multiple environments show
that WocaR-RL achieves state-of-the-art performance under various strong
attacks, and obtains significantly higher training efficiency than prior
state-of-the-art robust training methods. The code of this work is available at
https://github.com/umd-huang-lab/WocaR-RL.


http://arxiv.org/abs/2210.06704
COLLIDER: A Robust Training Framework for Backdoor Data. (81%)
Hadi M. Dolatabadi; Sarah Erfani; Christopher Leckie
  Deep neural network (DNN) classifiers are vulnerable to backdoor attacks. An
adversary poisons some of the training data in such attacks by installing a
trigger. The goal is to make the trained DNN output the attacker's desired
class whenever the trigger is activated while performing as usual for clean
data. Various approaches have recently been proposed to detect malicious
backdoored DNNs. However, a robust, end-to-end training approach, like
adversarial training, is yet to be discovered for backdoor poisoned data. In
this paper, we take the first step toward such methods by developing a robust
training framework, COLLIDER, that selects the most prominent samples by
exploiting the underlying geometric structures of the data. Specifically, we
effectively filter out candidate poisoned data at each training epoch by
solving a geometrical coreset selection objective. We first argue how clean
data samples exhibit (1) gradients similar to the clean majority of data and
(2) low local intrinsic dimensionality (LID). Based on these criteria, we
define a novel coreset selection objective to find such samples, which are used
for training a DNN. We show the effectiveness of the proposed method for robust
training of DNNs on various poisoned datasets, reducing the backdoor success
rate significantly.


http://arxiv.org/abs/2210.06428
Trap and Replace: Defending Backdoor Attacks by Trapping Them into an Easy-to-Replace Subnetwork. (76%)
Haotao Wang; Junyuan Hong; Aston Zhang; Jiayu Zhou; Zhangyang Wang
  Deep neural networks (DNNs) are vulnerable to backdoor attacks. Previous
works have shown it extremely challenging to unlearn the undesired backdoor
behavior from the network, since the entire network can be affected by the
backdoor samples. In this paper, we propose a brand-new backdoor defense
strategy, which makes it much easier to remove the harmful influence of
backdoor samples from the model. Our defense strategy, \emph{Trap and Replace},
consists of two stages. In the first stage, we bait and trap the backdoors in a
small and easy-to-replace subnetwork. Specifically, we add an auxiliary image
reconstruction head on top of the stem network shared with a light-weighted
classification head. The intuition is that the auxiliary image reconstruction
task encourages the stem network to keep sufficient low-level visual features
that are hard to learn but semantically correct, instead of overfitting to the
easy-to-learn but semantically incorrect backdoor correlations. As a result,
when trained on backdoored datasets, the backdoors are easily baited towards
the unprotected classification head, since it is much more vulnerable than the
shared stem, leaving the stem network hardly poisoned. In the second stage, we
replace the poisoned light-weighted classification head with an untainted one,
by re-training it from scratch only on a small holdout dataset with clean
samples, while fixing the stem network. As a result, both the stem and the
classification head in the final network are hardly affected by backdoor
training samples. We evaluate our method against ten different backdoor
attacks. Our method outperforms previous state-of-the-art methods by up to
$20.57\%$, $9.80\%$, and $13.72\%$ attack success rate and on-average $3.14\%$,
$1.80\%$, and $1.21\%$ clean classification accuracy on CIFAR10, GTSRB, and
ImageNet-12, respectively. Code is available online.


http://arxiv.org/abs/2210.05929
Few-shot Backdoor Attacks via Neural Tangent Kernels. (62%)
Jonathan Hayase; Sewoong Oh
  In a backdoor attack, an attacker injects corrupted examples into the
training set. The goal of the attacker is to cause the final trained model to
predict the attacker's desired target label when a predefined trigger is added
to test inputs. Central to these attacks is the trade-off between the success
rate of the attack and the number of corrupted training examples injected. We
pose this attack as a novel bilevel optimization problem: construct strong
poison examples that maximize the attack success rate of the trained model. We
use neural tangent kernels to approximate the training dynamics of the model
being attacked and automatically learn strong poison examples. We experiment on
subclasses of CIFAR-10 and ImageNet with WideResNet-34 and ConvNeXt
architectures on periodic and patch trigger attacks and show that NTBA-designed
poisoned examples achieve, for example, an attack success rate of 90% with ten
times smaller number of poison examples injected compared to the baseline. We
provided an interpretation of the NTBA-designed attacks using the analysis of
kernel linear regression. We further demonstrate a vulnerability in
overparametrized deep neural networks, which is revealed by the shape of the
neural tangent kernel.


http://arxiv.org/abs/2210.06516
How to Sift Out a Clean Data Subset in the Presence of Data Poisoning? (9%)
Yi Zeng; Minzhou Pan; Himanshu Jahagirdar; Ming Jin; Lingjuan Lyu; Ruoxi Jia
  Given the volume of data needed to train modern machine learning models,
external suppliers are increasingly used. However, incorporating external data
poses data poisoning risks, wherein attackers manipulate their data to degrade
model utility or integrity. Most poisoning defenses presume access to a set of
clean data (or base set). While this assumption has been taken for granted,
given the fast-growing research on stealthy poisoning attacks, a question
arises: can defenders really identify a clean subset within a contaminated
dataset to support defenses?
  This paper starts by examining the impact of poisoned samples on defenses
when they are mistakenly mixed into the base set. We analyze five defenses and
find that their performance deteriorates dramatically with less than 1%
poisoned points in the base set. These findings suggest that sifting out a base
set with high precision is key to these defenses' performance. Motivated by
these observations, we study how precise existing automated tools and human
inspection are at identifying clean data in the presence of data poisoning.
Unfortunately, neither effort achieves the precision needed. Worse yet, many of
the outcomes are worse than random selection.
  In addition to uncovering the challenge, we propose a practical
countermeasure, Meta-Sift. Our method is based on the insight that existing
attacks' poisoned samples shifts from clean data distributions. Hence, training
on the clean portion of a dataset and testing on the corrupted portion will
result in high prediction loss. Leveraging the insight, we formulate a bilevel
optimization to identify clean data and further introduce a suite of techniques
to improve efficiency and precision. Our evaluation shows that Meta-Sift can
sift a clean base set with 100% precision under a wide range of poisoning
attacks. The selected base set is large enough to give rise to successful
defenses.


http://arxiv.org/abs/2210.06509
Understanding Impacts of Task Similarity on Backdoor Attack and Detection. (2%)
Di Tang; Rui Zhu; XiaoFeng Wang; Haixu Tang; Yi Chen
  With extensive studies on backdoor attack and detection, still fundamental
questions are left unanswered regarding the limits in the adversary's
capability to attack and the defender's capability to detect. We believe that
answers to these questions can be found through an in-depth understanding of
the relations between the primary task that a benign model is supposed to
accomplish and the backdoor task that a backdoored model actually performs. For
this purpose, we leverage similarity metrics in multi-task learning to formally
define the backdoor distance (similarity) between the primary task and the
backdoor task, and analyze existing stealthy backdoor attacks, revealing that
most of them fail to effectively reduce the backdoor distance and even for
those that do, still much room is left to further improve their stealthiness.
So we further design a new method, called TSA attack, to automatically generate
a backdoor model under a given distance constraint, and demonstrate that our
new attack indeed outperforms existing attacks, making a step closer to
understanding the attacker's limits. Most importantly, we provide both
theoretic results and experimental evidence on various datasets for the
positive correlation between the backdoor distance and backdoor detectability,
demonstrating that indeed our task similarity analysis help us better
understand backdoor risks and has the potential to identify more effective
mitigations.


http://arxiv.org/abs/2210.06089
When are Local Queries Useful for Robust Learning? (1%)
Pascale Gourdeau; Varun Kanade; Marta Kwiatkowska; James Worrell
  Distributional assumptions have been shown to be necessary for the robust
learnability of concept classes when considering the exact-in-the-ball robust
risk and access to random examples by Gourdeau et al. (2019). In this paper, we
study learning models where the learner is given more power through the use of
local queries, and give the first distribution-free algorithms that perform
robust empirical risk minimization (ERM) for this notion of robustness. The
first learning model we consider uses local membership queries (LMQ), where the
learner can query the label of points near the training sample. We show that,
under the uniform distribution, LMQs do not increase the robustness threshold
of conjunctions and any superclass, e.g., decision lists and halfspaces. Faced
with this negative result, we introduce the local equivalence query
($\mathsf{LEQ}$) oracle, which returns whether the hypothesis and target
concept agree in the perturbation region around a point in the training sample,
as well as a counterexample if it exists. We show a separation result: on the
one hand, if the query radius $\lambda$ is strictly smaller than the
adversary's perturbation budget $\rho$, then distribution-free robust learning
is impossible for a wide variety of concept classes; on the other hand, the
setting $\lambda=\rho$ allows us to develop robust ERM algorithms. We then
bound the query complexity of these algorithms based on online learning
guarantees and further improve these bounds for the special case of
conjunctions. We finish by giving robust learning algorithms for halfspaces on
$\{0,1\}^n$ and then obtaining robustness guarantees for halfspaces in
$\mathbb{R}^n$ against precision-bounded adversaries.


http://arxiv.org/abs/2210.05577
What Can the Neural Tangent Kernel Tell Us About Adversarial Robustness? (99%)
Nikolaos Tsilivis; Julia Kempe
  The adversarial vulnerability of neural nets, and subsequent techniques to
create robust models have attracted significant attention; yet we still lack a
full understanding of this phenomenon. Here, we study adversarial examples of
trained neural networks through analytical tools afforded by recent theory
advances connecting neural networks and kernel methods, namely the Neural
Tangent Kernel (NTK), following a growing body of work that leverages the NTK
approximation to successfully analyze important deep learning phenomena and
design algorithms for new applications. We show how NTKs allow to generate
adversarial examples in a ``training-free'' fashion, and demonstrate that they
transfer to fool their finite-width neural net counterparts in the ``lazy''
regime. We leverage this connection to provide an alternative view on robust
and non-robust features, which have been suggested to underlie the adversarial
brittleness of neural nets. Specifically, we define and study features induced
by the eigendecomposition of the kernel to better understand the role of robust
and non-robust features, the reliance on both for standard classification and
the robustness-accuracy trade-off. We find that such features are surprisingly
consistent across architectures, and that robust features tend to correspond to
the largest eigenvalues of the model, and thus are learned early during
training. Our framework allows us to identify and visualize non-robust yet
useful features. Finally, we shed light on the robustness mechanism underlying
adversarial training of neural nets used in practice: quantifying the evolution
of the associated empirical NTK, we demonstrate that its dynamics falls much
earlier into the ``lazy'' regime and manifests a much stronger form of the well
known bias to prioritize learning features within the top eigenspaces of the
kernel, compared to standard training.


http://arxiv.org/abs/2210.05373
Stable and Efficient Adversarial Training through Local Linearization. (91%)
Zhuorong Li; Daiwei Yu
  There has been a recent surge in single-step adversarial training as it shows
robustness and efficiency. However, a phenomenon referred to as ``catastrophic
overfitting" has been observed, which is prevalent in single-step defenses and
may frustrate attempts to use FGSM adversarial training. To address this issue,
we propose a novel method, Stable and Efficient Adversarial Training (SEAT),
which mitigates catastrophic overfitting by harnessing on local properties that
distinguish a robust model from that of a catastrophic overfitted model. The
proposed SEAT has strong theoretical justifications, in that minimizing the
SEAT loss can be shown to favour smooth empirical risk, thereby leading to
robustness. Experimental results demonstrate that the proposed method
successfully mitigates catastrophic overfitting, yielding superior performance
amongst efficient defenses. Our single-step method can reach 51% robust
accuracy for CIFAR-10 with $l_\infty$ perturbations of radius $8/255$ under a
strong PGD-50 attack, matching the performance of a 10-step iterative
adversarial training at merely 3% computational cost.


http://arxiv.org/abs/2210.05276
RoHNAS: A Neural Architecture Search Framework with Conjoint Optimization for Adversarial Robustness and Hardware Efficiency of Convolutional and Capsule Networks. (86%)
Alberto Marchisio; Vojtech Mrazek; Andrea Massa; Beatrice Bussolino; Maurizio Martina; Muhammad Shafique
  Neural Architecture Search (NAS) algorithms aim at finding efficient Deep
Neural Network (DNN) architectures for a given application under given system
constraints. DNNs are computationally-complex as well as vulnerable to
adversarial attacks. In order to address multiple design objectives, we propose
RoHNAS, a novel NAS framework that jointly optimizes for adversarial-robustness
and hardware-efficiency of DNNs executed on specialized hardware accelerators.
Besides the traditional convolutional DNNs, RoHNAS additionally accounts for
complex types of DNNs such as Capsule Networks. For reducing the exploration
time, RoHNAS analyzes and selects appropriate values of adversarial
perturbation for each dataset to employ in the NAS flow. Extensive evaluations
on multi - Graphics Processing Unit (GPU) - High Performance Computing (HPC)
nodes provide a set of Pareto-optimal solutions, leveraging the tradeoff
between the above-discussed design objectives. For example, a Pareto-optimal
DNN for the CIFAR-10 dataset exhibits 86.07% accuracy, while having an energy
of 38.63 mJ, a memory footprint of 11.85 MiB, and a latency of 4.47 ms.


http://arxiv.org/abs/2210.06589
Adversarial Attack Against Image-Based Localization Neural Networks. (78%)
Meir Brand; Itay Naeh; Daniel Teitelman
  In this paper, we present a proof of concept for adversarially attacking the
image-based localization module of an autonomous vehicle. This attack aims to
cause the vehicle to perform a wrong navigational decisions and prevent it from
reaching a desired predefined destination in a simulated urban environment. A
database of rendered images allowed us to train a deep neural network that
performs a localization task and implement, develop and assess the adversarial
pattern. Our tests show that using this adversarial attack we can prevent the
vehicle from turning at a given intersection. This is done by manipulating the
vehicle's navigational module to falsely estimate its current position and thus
fail to initialize the turning procedure until the vehicle misses the last
opportunity to perform a safe turn in a given intersection.


http://arxiv.org/abs/2210.11264
Detecting Backdoors in Deep Text Classifiers. (76%)
You Guo; Jun Wang; Trevor Cohn
  Deep neural networks are vulnerable to adversarial attacks, such as backdoor
attacks in which a malicious adversary compromises a model during training such
that specific behaviour can be triggered at test time by attaching a specific
word or phrase to an input. This paper considers the problem of diagnosing
whether a model has been compromised and if so, identifying the backdoor
trigger. We present the first robust defence mechanism that generalizes to
several backdoor attacks against text classification models, without prior
knowledge of the attack type, nor does our method require access to any
(potentially compromised) training resources. Our experiments show that our
technique is highly accurate at defending against state-of-the-art backdoor
attacks, including data poisoning and weight poisoning, across a range of text
classification tasks and model architectures. Our code will be made publicly
available upon acceptance.


http://arxiv.org/abs/2210.05667
Human Body Measurement Estimation with Adversarial Augmentation. (33%)
Nataniel Ruiz; Miriam Bellver; Timo Bolkart; Ambuj Arora; Ming C. Lin; Javier Romero; Raja Bala
  We present a Body Measurement network (BMnet) for estimating 3D
anthropomorphic measurements of the human body shape from silhouette images.
Training of BMnet is performed on data from real human subjects, and augmented
with a novel adversarial body simulator (ABS) that finds and synthesizes
challenging body shapes. ABS is based on the skinned multiperson linear (SMPL)
body model, and aims to maximize BMnet measurement prediction error with
respect to latent SMPL shape parameters. ABS is fully differentiable with
respect to these parameters, and trained end-to-end via backpropagation with
BMnet in the loop. Experiments show that ABS effectively discovers adversarial
examples, such as bodies with extreme body mass indices (BMI), consistent with
the rarity of extreme-BMI bodies in BMnet's training set. Thus ABS is able to
reveal gaps in training data and potential failures in predicting
under-represented body shapes. Results show that training BMnet with ABS
improves measurement prediction accuracy on real bodies by up to 10%, when
compared to no augmentation or random body shape sampling. Furthermore, our
method significantly outperforms SOTA measurement estimation methods by as much
as 3x. Finally, we release BodyM, the first challenging, large-scale dataset of
photo silhouettes and body measurements of real human subjects, to further
promote research in this area. Project website:
https://adversarialbodysim.github.io


http://arxiv.org/abs/2210.05742
Curved Representation Space of Vision Transformers. (10%)
Juyeop Kim; Junha Park; Songkuk Kim; Jong-Seok Lee
  Neural networks with self-attention (a.k.a. Transformers) like ViT and Swin
have emerged as a better alternative to traditional convolutional neural
networks (CNNs) for computer vision tasks. However, our understanding of how
the new architecture works is still limited. In this paper, we focus on the
phenomenon that Transformers show higher robustness against corruptions than
CNNs, while not being overconfident (in fact, we find Transformers are actually
underconfident). This is contrary to the intuition that robustness increases
with confidence. We resolve this contradiction by investigating how the output
of the penultimate layer moves in the representation space as the input data
moves within a small area. In particular, we show the following. (1) While CNNs
exhibit fairly linear relationship between the input and output movements,
Transformers show nonlinear relationship for some data. For those data, the
output of Transformers moves in a curved trajectory as the input moves
linearly. (2) When a data is located in a curved region, it is hard to move it
out of the decision region since the output moves along a curved trajectory
instead of a straight line to the decision boundary, resulting in high
robustness of Transformers. (3) If a data is slightly modified to jump out of
the curved region, the movements afterwards become linear and the output goes
to the decision boundary directly. Thus, Transformers can be attacked easily
after a small random jump and the perturbation in the final attacked data
remains imperceptible, i.e., there does exist a decision boundary near the
data. This also explains the underconfident prediction of Transformers. (4) The
curved regions in the representation space start to form at an early training
stage and grow throughout the training course. Some data are trapped in the
regions, obstructing Transformers from reducing the training loss.


http://arxiv.org/abs/2210.05279
Zeroth-Order Hard-Thresholding: Gradient Error vs. Expansivity. (1%)
Vazelhes William de; Hualin Zhang; Huimin Wu; Xiao-Tong Yuan; Bin Gu
  $\ell_0$ constrained optimization is prevalent in machine learning,
particularly for high-dimensional problems, because it is a fundamental
approach to achieve sparse learning. Hard-thresholding gradient descent is a
dominant technique to solve this problem. However, first-order gradients of the
objective function may be either unavailable or expensive to calculate in a lot
of real-world problems, where zeroth-order (ZO) gradients could be a good
surrogate. Unfortunately, whether ZO gradients can work with the
hard-thresholding operator is still an unsolved problem. To solve this puzzle,
in this paper, we focus on the $\ell_0$ constrained black-box stochastic
optimization problems, and propose a new stochastic zeroth-order gradient
hard-thresholding (SZOHT) algorithm with a general ZO gradient estimator
powered by a novel random support sampling. We provide the convergence analysis
of SZOHT under standard assumptions. Importantly, we reveal a conflict between
the deviation of ZO estimators and the expansivity of the hard-thresholding
operator, and provide a theoretical minimal value of the number of random
directions in ZO gradients. In addition, we find that the query complexity of
SZOHT is independent or weakly dependent on the dimensionality under different
settings. Finally, we illustrate the utility of our method on a portfolio
optimization problem as well as black-box adversarial attacks.


http://arxiv.org/abs/2210.05177
Make Sharpness-Aware Minimization Stronger: A Sparsified Perturbation Approach. (1%)
Peng Mi; Li Shen; Tianhe Ren; Yiyi Zhou; Xiaoshuai Sun; Rongrong Ji; Dacheng Tao
  Deep neural networks often suffer from poor generalization caused by complex
and non-convex loss landscapes. One of the popular solutions is Sharpness-Aware
Minimization (SAM), which smooths the loss landscape via minimizing the
maximized change of training loss when adding a perturbation to the weight.
However, we find the indiscriminate perturbation of SAM on all parameters is
suboptimal, which also results in excessive computation, i.e., double the
overhead of common optimizers like Stochastic Gradient Descent (SGD). In this
paper, we propose an efficient and effective training scheme coined as Sparse
SAM (SSAM), which achieves sparse perturbation by a binary mask. To obtain the
sparse mask, we provide two solutions which are based onFisher information and
dynamic sparse training, respectively. In addition, we theoretically prove that
SSAM can converge at the same rate as SAM, i.e., $O(\log T/\sqrt{T})$. Sparse
SAM not only has the potential for training acceleration but also smooths the
loss landscape effectively. Extensive experimental results on CIFAR10,
CIFAR100, and ImageNet-1K confirm the superior efficiency of our method to SAM,
and the performance is preserved or even better with a perturbation of merely
50% sparsity. Code is availiable at
https://github.com/Mi-Peng/Sparse-Sharpness-Aware-Minimization.


http://arxiv.org/abs/2210.05118
Boosting Adversarial Robustness From The Perspective of Effective Margin Regularization. (92%)
Ziquan Liu; Antoni B. Chan
  The adversarial vulnerability of deep neural networks (DNNs) has been
actively investigated in the past several years. This paper investigates the
scale-variant property of cross-entropy loss, which is the most commonly used
loss function in classification tasks, and its impact on the effective margin
and adversarial robustness of deep neural networks. Since the loss function is
not invariant to logit scaling, increasing the effective weight norm will make
the loss approach zero and its gradient vanish while the effective margin is
not adequately maximized. On typical DNNs, we demonstrate that, if not properly
regularized, the standard training does not learn large effective margins and
leads to adversarial vulnerability. To maximize the effective margins and learn
a robust DNN, we propose to regularize the effective weight norm during
training. Our empirical study on feedforward DNNs demonstrates that the
proposed effective margin regularization (EMR) learns large effective margins
and boosts the adversarial robustness in both standard and adversarial
training. On large-scale models, we show that EMR outperforms basic adversarial
training, TRADES and two regularization baselines with substantial improvement.
Moreover, when combined with several strong adversarial defense methods (MART
and MAIL), our EMR further boosts the robustness.


http://arxiv.org/abs/2210.04886
Revisiting adapters with adversarial training. (88%)
Sylvestre-Alvise Rebuffi; Francesco Croce; Sven Gowal
  While adversarial training is generally used as a defense mechanism, recent
works show that it can also act as a regularizer. By co-training a neural
network on clean and adversarial inputs, it is possible to improve
classification accuracy on the clean, non-adversarial inputs. We demonstrate
that, contrary to previous findings, it is not necessary to separate batch
statistics when co-training on clean and adversarial inputs, and that it is
sufficient to use adapters with few domain-specific parameters for each type of
input. We establish that using the classification token of a Vision Transformer
(ViT) as an adapter is enough to match the classification performance of dual
normalization layers, while using significantly less additional parameters.
First, we improve upon the top-1 accuracy of a non-adversarially trained
ViT-B16 model by +1.12% on ImageNet (reaching 83.76% top-1 accuracy). Second,
and more importantly, we show that training with adapters enables model soups
through linear combinations of the clean and adversarial tokens. These model
soups, which we call adversarial model soups, allow us to trade-off between
clean and robust accuracy without sacrificing efficiency. Finally, we show that
we can easily adapt the resulting models in the face of distribution shifts.
Our ViT-B16 obtains top-1 accuracies on ImageNet variants that are on average
+4.00% better than those obtained with Masked Autoencoders.


http://arxiv.org/abs/2210.04591
Universal Adversarial Perturbations: Efficiency on a small image dataset. (81%)
Waris ENSEIRB-MATMECA, UB Radji
  Although neural networks perform very well on the image classification task,
they are still vulnerable to adversarial perturbations that can fool a neural
network without visibly changing an input image. A paper has shown the
existence of Universal Adversarial Perturbations which when added to any image
will fool the neural network with a very high probability. In this paper we
will try to reproduce the experience of the Universal Adversarial Perturbations
paper, but on a smaller neural network architecture and training set, in order
to be able to study the efficiency of the computed perturbation.


http://arxiv.org/abs/2210.04871
Certified Training: Small Boxes are All You Need. (22%)
Mark Niklas Müller; Franziska Eckert; Marc Fischer; Martin Vechev
  To obtain, deterministic guarantees of adversarial robustness, specialized
training methods are used. We propose, SABR, a novel such certified training
method, based on the key insight that propagating interval bounds for a small
but carefully selected subset of the adversarial input region is sufficient to
approximate the worst-case loss over the whole region while significantly
reducing approximation errors. We show in an extensive empirical evaluation
that SABR outperforms existing certified defenses in terms of both standard and
certifiable accuracies across perturbation magnitudes and datasets, pointing to
a new class of certified training methods promising to alleviate the
robustness-accuracy trade-off.


http://arxiv.org/abs/2210.06983
Denoising Masked AutoEncoders Help Robust Classification. (1%)
Quanlin Wu; Hang Ye; Yuntian Gu; Huishuai Zhang; Liwei Wang; Di He
  In this paper, we propose a new self-supervised method, which is called
Denoising Masked AutoEncoders (DMAE), for learning certified robust classifiers
of images. In DMAE, we corrupt each image by adding Gaussian noises to each
pixel value and randomly masking several patches. A Transformer-based
encoder-decoder model is then trained to reconstruct the original image from
the corrupted one. In this learning paradigm, the encoder will learn to capture
relevant semantics for the downstream tasks, which is also robust to Gaussian
additive noises. We show that the pre-trained encoder can naturally be used as
the base classifier in Gaussian smoothed models, where we can analytically
compute the certified radius for any data point. Although the proposed method
is simple, it yields significant performance improvement in downstream
classification tasks. We show that the DMAE ViT-Base model, which just uses
1/10 parameters of the model developed in recent work arXiv:2206.10550,
achieves competitive or better certified accuracy in various settings. The DMAE
ViT-Large model significantly surpasses all previous results, establishing a
new state-of-the-art on ImageNet dataset. We further demonstrate that the
pre-trained model has good transferability to the CIFAR-10 dataset, suggesting
its wide adaptability. Models and code are available at
https://github.com/quanlin-wu/dmae.


http://arxiv.org/abs/2210.04311
Pruning Adversarially Robust Neural Networks without Adversarial Examples. (99%)
Tong Jian; Zifeng Wang; Yanzhi Wang; Jennifer Dy; Stratis Ioannidis
  Adversarial pruning compresses models while preserving robustness. Current
methods require access to adversarial examples during pruning. This
significantly hampers training efficiency. Moreover, as new adversarial attacks
and training methods develop at a rapid rate, adversarial pruning methods need
to be modified accordingly to keep up. In this work, we propose a novel
framework to prune a previously trained robust neural network while maintaining
adversarial robustness, without further generating adversarial examples. We
leverage concurrent self-distillation and pruning to preserve knowledge in the
original model as well as regularizing the pruned model via the Hilbert-Schmidt
Information Bottleneck. We comprehensively evaluate our proposed framework and
show its superior performance in terms of both adversarial robustness and
efficiency when pruning architectures trained on the MNIST, CIFAR-10, and
CIFAR-100 datasets against five state-of-the-art attacks. Code is available at
https://github.com/neu-spiral/PwoA/.


http://arxiv.org/abs/2210.04213
Towards Understanding and Boosting Adversarial Transferability from a Distribution Perspective. (99%)
Yao Zhu; Yuefeng Chen; Xiaodan Li; Kejiang Chen; Yuan He; Xiang Tian; Bolun Zheng; Yaowu Chen; Qingming Huang
  Transferable adversarial attacks against Deep neural networks (DNNs) have
received broad attention in recent years. An adversarial example can be crafted
by a surrogate model and then attack the unknown target model successfully,
which brings a severe threat to DNNs. The exact underlying reasons for the
transferability are still not completely understood. Previous work mostly
explores the causes from the model perspective, e.g., decision boundary, model
architecture, and model capacity. adversarial attacks against Deep neural
networks (DNNs) have received broad attention in recent years. An adversarial
example can be crafted by a surrogate model and then attack the unknown target
model successfully, which brings a severe threat to DNNs. The exact underlying
reasons for the transferability are still not completely understood. Previous
work mostly explores the causes from the model perspective. Here, we
investigate the transferability from the data distribution perspective and
hypothesize that pushing the image away from its original distribution can
enhance the adversarial transferability. To be specific, moving the image out
of its original distribution makes different models hardly classify the image
correctly, which benefits the untargeted attack, and dragging the image into
the target distribution misleads the models to classify the image as the target
class, which benefits the targeted attack. Towards this end, we propose a novel
method that crafts adversarial examples by manipulating the distribution of the
image. We conduct comprehensive transferable attacks against multiple DNNs to
demonstrate the effectiveness of the proposed method. Our method can
significantly improve the transferability of the crafted attacks and achieves
state-of-the-art performance in both untargeted and targeted scenarios,
surpassing the previous best method by up to 40$\%$ in some cases.


http://arxiv.org/abs/2210.04195
Online Training Through Time for Spiking Neural Networks. (1%)
Mingqing Xiao; Qingyan Meng; Zongpeng Zhang; Di He; Zhouchen Lin
  Spiking neural networks (SNNs) are promising brain-inspired energy-efficient
models. Recent progress in training methods has enabled successful deep SNNs on
large-scale tasks with low latency. Particularly, backpropagation through time
(BPTT) with surrogate gradients (SG) is popularly used to achieve high
performance in a very small number of time steps. However, it is at the cost of
large memory consumption for training, lack of theoretical clarity for
optimization, and inconsistency with the online property of biological learning
and rules on neuromorphic hardware. Other works connect spike representations
of SNNs with equivalent artificial neural network formulation and train SNNs by
gradients from equivalent mappings to ensure descent directions. But they fail
to achieve low latency and are also not online. In this work, we propose online
training through time (OTTT) for SNNs, which is derived from BPTT to enable
forward-in-time learning by tracking presynaptic activities and leveraging
instantaneous loss and gradients. Meanwhile, we theoretically analyze and prove
that gradients of OTTT can provide a similar descent direction for optimization
as gradients based on spike representations under both feedforward and
recurrent conditions. OTTT only requires constant training memory costs
agnostic to time steps, avoiding the significant memory costs of BPTT for GPU
training. Furthermore, the update rule of OTTT is in the form of three-factor
Hebbian learning, which could pave a path for online on-chip learning. With
OTTT, it is the first time that two mainstream supervised SNN training methods,
BPTT with SG and spike representation-based training, are connected, and
meanwhile in a biologically plausible form. Experiments on CIFAR-10, CIFAR-100,
ImageNet, and CIFAR10-DVS demonstrate the superior performance of our method on
large-scale static and neuromorphic datasets in small time steps.


http://arxiv.org/abs/2210.04052
FedDef: Defense Against Gradient Leakage in Federated Learning-based Network Intrusion Detection Systems. (99%)
Jiahui Chen; Yi Zhao; Qi Li; Xuewei Feng; Ke Xu
  Deep learning (DL) methods have been widely applied to anomaly-based network
intrusion detection system (NIDS) to detect malicious traffic. To expand the
usage scenarios of DL-based methods, the federated learning (FL) framework
allows multiple users to train a global model on the basis of respecting
individual data privacy. However, it has not yet been systematically evaluated
how robust FL-based NIDSs are against existing privacy attacks under existing
defenses. To address this issue, we propose two privacy evaluation metrics
designed for FL-based NIDSs, including (1) privacy score that evaluates the
similarity between the original and recovered traffic features using
reconstruction attacks, and (2) evasion rate against NIDSs using Generative
Adversarial Network-based adversarial attack with the reconstructed benign
traffic. We conduct experiments to show that existing defenses provide little
protection that the corresponding adversarial traffic can even evade the SOTA
NIDS Kitsune. To defend against such attacks and build a more robust FL-based
NIDS, we further propose FedDef, a novel optimization-based input perturbation
defense strategy with theoretical guarantee. It achieves both high utility by
minimizing the gradient distance and strong privacy protection by maximizing
the input distance. We experimentally evaluate four existing defenses on four
datasets and show that our defense outperforms all the baselines in terms of
privacy protection with up to 7 times higher privacy score, while maintaining
model accuracy loss within 3% under optimal parameter combination.


http://arxiv.org/abs/2210.04087
Symmetry Defense Against CNN Adversarial Perturbation Attacks. (99%)
Blerta Lindqvist
  This paper uses symmetry to make Convolutional Neural Network classifiers
(CNNs) robust against adversarial perturbation attacks. Such attacks add
perturbation to original images to generate adversarial images that fool
classifiers such as road sign classifiers of autonomous vehicles. Although
symmetry is a pervasive aspect of the natural world, CNNs are unable to handle
symmetry well. For example, a CNN can classify an image differently from its
mirror image. For an adversarial image that misclassifies with a wrong label
$l_w$, CNN inability to handle symmetry means that a symmetric adversarial
image can classify differently from the wrong label $l_w$. Further than that,
we find that the classification of a symmetric adversarial image reverts to the
correct label. To classify an image when adversaries are unaware of the
defense, we apply symmetry to the image and use the classification label of the
symmetric image. To classify an image when adversaries are aware of the
defense, we use mirror symmetry and pixel inversion symmetry to form a symmetry
group. We apply all the group symmetries to the image and decide on the output
label based on the agreement of any two of the classification labels of the
symmetry images. Adaptive attacks fail because they need to rely on loss
functions that use conflicting CNN output values for symmetric images. Without
attack knowledge, the proposed symmetry defense succeeds against both
gradient-based and random-search attacks, with up to near-default accuracies
for ImageNet. The defense even improves the classification accuracy of original
images.


http://arxiv.org/abs/2210.04076
Robustness of Unsupervised Representation Learning without Labels. (54%)
Aleksandar Petrov; Marta Kwiatkowska
  Unsupervised representation learning leverages large unlabeled datasets and
is competitive with supervised learning. But non-robust encoders may affect
downstream task robustness. Recently, robust representation encoders have
become of interest. Still, all prior work evaluates robustness using a
downstream classification task. Instead, we propose a family of unsupervised
robustness measures, which are model- and task-agnostic and label-free. We
benchmark state-of-the-art representation encoders and show that none dominates
the rest. We offer unsupervised extensions to the FGSM and PGD attacks. When
used in adversarial training, they improve most unsupervised robustness
measures, including certified robustness. We validate our results against a
linear probe and show that, for MOCOv2, adversarial training results in 3 times
higher certified accuracy, a 2-fold decrease in impersonation attack success
rate and considerable improvements in certified robustness.


http://arxiv.org/abs/2210.03429
Adversarially Robust Prototypical Few-shot Segmentation with Neural-ODEs. (99%)
Prashant Pandey; Aleti Vardhan; Mustafa Chasmai; Tanuj Sur; Brejesh Lall
  Few-shot Learning (FSL) methods are being adopted in settings where data is
not abundantly available. This is especially seen in medical domains where the
annotations are expensive to obtain. Deep Neural Networks have been shown to be
vulnerable to adversarial attacks. This is even more severe in the case of FSL
due to the lack of a large number of training examples. In this paper, we
provide a framework to make few-shot segmentation models adversarially robust
in the medical domain where such attacks can severely impact the decisions made
by clinicians who use them. We propose a novel robust few-shot segmentation
framework, Prototypical Neural Ordinary Differential Equation (PNODE), that
provides defense against gradient-based adversarial attacks. We show that our
framework is more robust compared to traditional adversarial defense mechanisms
such as adversarial training. Adversarial training involves increased training
time and shows robustness to limited types of attacks depending on the type of
adversarial examples seen during training. Our proposed framework generalises
well to common adversarial attacks like FGSM, PGD and SMIA while having the
model parameters comparable to the existing few-shot segmentation models. We
show the effectiveness of our proposed approach on three publicly available
multi-organ segmentation datasets in both in-domain and cross-domain settings
by attacking the support and query sets without the need for ad-hoc adversarial
training.


http://arxiv.org/abs/2210.03372
Pre-trained Adversarial Perturbations. (99%)
Yuanhao Ban; Yinpeng Dong
  Self-supervised pre-training has drawn increasing attention in recent years
due to its superior performance on numerous downstream tasks after fine-tuning.
However, it is well-known that deep learning models lack the robustness to
adversarial examples, which can also invoke security issues to pre-trained
models, despite being less explored. In this paper, we delve into the
robustness of pre-trained models by introducing Pre-trained Adversarial
Perturbations (PAPs), which are universal perturbations crafted for the
pre-trained models to maintain the effectiveness when attacking fine-tuned ones
without any knowledge of the downstream tasks. To this end, we propose a
Low-Level Layer Lifting Attack (L4A) method to generate effective PAPs by
lifting the neuron activations of low-level layers of the pre-trained models.
Equipped with an enhanced noise augmentation strategy, L4A is effective at
generating more transferable PAPs against fine-tuned models. Extensive
experiments on typical pre-trained vision models and ten downstream tasks
demonstrate that our method improves the attack success rate by a large margin
compared with state-of-the-art methods.


http://arxiv.org/abs/2210.03696
NMTSloth: Understanding and Testing Efficiency Degradation of Neural Machine Translation Systems. (97%)
Simin Chen; Cong Liu; Mirazul Haque; Zihe Song; Wei Yang
  Neural Machine Translation (NMT) systems have received much recent attention
due to their human-level accuracy. While existing works mostly focus on either
improving accuracy or testing accuracy robustness, the computation efficiency
of NMT systems, which is of paramount importance due to often vast translation
demands and real-time requirements, has surprisingly received little attention.
In this paper, we make the first attempt to understand and test potential
computation efficiency robustness in state-of-the-art NMT systems. By analyzing
the working mechanism and implementation of 1455 public-accessible NMT systems,
we observe a fundamental property in NMT systems that could be manipulated in
an adversarial manner to reduce computation efficiency significantly. Our key
motivation is to generate test inputs that could sufficiently delay the
generation of EOS such that NMT systems would have to go through enough
iterations to satisfy the pre-configured threshold. We present NMTSloth, which
develops a gradient-guided technique that searches for a minimal and
unnoticeable perturbation at character-level, token-level, and structure-level,
which sufficiently delays the appearance of EOS and forces these inputs to
reach the naturally-unreachable threshold. To demonstrate the effectiveness of
NMTSloth, we conduct a systematic evaluation on three public-available NMT
systems: Google T5, AllenAI WMT14, and Helsinki-NLP translators. Experimental
results show that NMTSloth can increase NMT systems' response latency and
energy consumption by 85% to 3153% and 86% to 3052%, respectively, by
perturbing just one character or token in the input sentence. Our case study
shows that inputs generated by NMTSloth significantly affect the battery power
in real-world mobile devices (i.e., drain more than 30 times battery power than
normal inputs).


http://arxiv.org/abs/2210.03895
ViewFool: Evaluating the Robustness of Visual Recognition to Adversarial Viewpoints. (93%)
Yinpeng Dong; Shouwei Ruan; Hang Su; Caixin Kang; Xingxing Wei; Jun Zhu
  Recent studies have demonstrated that visual recognition models lack
robustness to distribution shift. However, current work mainly considers model
robustness to 2D image transformations, leaving viewpoint changes in the 3D
world less explored. In general, viewpoint changes are prevalent in various
real-world applications (e.g., autonomous driving), making it imperative to
evaluate viewpoint robustness. In this paper, we propose a novel method called
ViewFool to find adversarial viewpoints that mislead visual recognition models.
By encoding real-world objects as neural radiance fields (NeRF), ViewFool
characterizes a distribution of diverse adversarial viewpoints under an
entropic regularizer, which helps to handle the fluctuations of the real camera
pose and mitigate the reality gap between the real objects and their neural
representations. Experiments validate that the common image classifiers are
extremely vulnerable to the generated adversarial viewpoints, which also
exhibit high cross-model transferability. Based on ViewFool, we introduce
ImageNet-V, a new out-of-distribution dataset for benchmarking viewpoint
robustness of image classifiers. Evaluation results on 40 classifiers with
diverse architectures, objective functions, and data augmentations reveal a
significant drop in model performance when tested on ImageNet-V, which provides
a possibility to leverage ViewFool as an effective data augmentation strategy
to improve viewpoint robustness.


http://arxiv.org/abs/2210.03349
Game-Theoretic Understanding of Misclassification. (47%)
Kosuke Sumiyasu; Kazuhiko Kawamoto; Hiroshi Kera
  This paper analyzes various types of image misclassification from a
game-theoretic view. Particularly, we consider the misclassification of clean,
adversarial, and corrupted images and characterize it through the distribution
of multi-order interactions. We discover that the distribution of multi-order
interactions varies across the types of misclassification. For example,
misclassified adversarial images have a higher strength of high-order
interactions than correctly classified clean images, which indicates that
adversarial perturbations create spurious features that arise from complex
cooperation between pixels. By contrast, misclassified corrupted images have a
lower strength of low-order interactions than correctly classified clean
images, which indicates that corruptions break the local cooperation between
pixels. We also provide the first analysis of Vision Transformers using
interactions. We found that Vision Transformers show a different tendency in
the distribution of interactions from that in CNNs, and this implies that they
exploit the features that CNNs do not use for the prediction. Our study
demonstrates that the recent game-theoretic analysis of deep learning models
can be broadened to analyze various malfunctions of deep learning models
including Vision Transformers by using the distribution, order, and sign of
interactions.


http://arxiv.org/abs/2210.03543
A2: Efficient Automated Attacker for Boosting Adversarial Training. (41%)
Zhuoer Xu; Guanghui Zhu; Changhua Meng; Shiwen Cui; Zhenzhe Ying; Weiqiang Wang; Ming GU; Yihua Huang
  Based on the significant improvement of model robustness by AT (Adversarial
Training), various variants have been proposed to further boost the
performance. Well-recognized methods have focused on different components of AT
(e.g., designing loss functions and leveraging additional unlabeled data). It
is generally accepted that stronger perturbations yield more robust models.
However, how to generate stronger perturbations efficiently is still missed. In
this paper, we propose an efficient automated attacker called A2 to boost AT by
generating the optimal perturbations on-the-fly during training. A2 is a
parameterized automated attacker to search in the attacker space for the best
attacker against the defense model and examples. Extensive experiments across
different datasets demonstrate that A2 generates stronger perturbations with
low extra cost and reliably improves the robustness of various AT methods
against different attacks.


http://arxiv.org/abs/2210.04688
BAFFLE: Hiding Backdoors in Offline Reinforcement Learning Datasets. (9%)
Chen Gong; Zhou Yang; Yunpeng Bai; Junda He; Jieke Shi; Kecen Li; Arunesh Sinha; Bowen Xu; Xinwen Hou; David Lo; Tianhao Wang
  Reinforcement learning (RL) makes an agent learn from trial-and-error
experiences gathered during the interaction with the environment. Recently,
offline RL has become a popular RL paradigm because it saves the interactions
with environments. In offline RL, data providers share large pre-collected
datasets, and others can train high-quality agents without interacting with the
environments. This paradigm has demonstrated effectiveness in critical tasks
like robot control, autonomous driving, etc. However, less attention is paid to
investigating the security threats to the offline RL system. This paper focuses
on backdoor attacks, where some perturbations are added to the data
(observations) such that given normal observations, the agent takes
high-rewards actions, and low-reward actions on observations injected with
triggers. In this paper, we propose Baffle (Backdoor Attack for Offline
Reinforcement Learning), an approach that automatically implants backdoors to
RL agents by poisoning the offline RL dataset, and evaluate how different
offline RL algorithms react to this attack. Our experiments conducted on four
tasks and four offline RL algorithms expose a disquieting fact: none of the
existing offline RL algorithms is immune to such a backdoor attack. More
specifically, Baffle modifies 10\% of the datasets for four tasks (3 robotic
controls and 1 autonomous driving). Agents trained on the poisoned datasets
perform well in normal settings. However, when triggers are presented, the
agents' performance decreases drastically by 63.2\%, 53.9\%, 64.7\%, and 47.4\%
in the four tasks on average. The backdoor still persists after fine-tuning
poisoned agents on clean datasets. We further show that the inserted backdoor
is also hard to be detected by a popular defensive method. This paper calls
attention to developing more effective protection for the open-source offline
RL dataset.


http://arxiv.org/abs/2210.03688
A Wolf in Sheep's Clothing: Spreading Deadly Pathogens Under the Disguise of Popular Music. (2%)
Anomadarshi Barua; Yonatan Gizachew Achamyeleh; Mohammad Abdullah Al Faruque
  A Negative Pressure Room (NPR) is an essential requirement by the Bio-Safety
Levels (BSLs) in biolabs or infectious-control hospitals to prevent deadly
pathogens from being leaked from the facility. An NPR maintains a negative
pressure inside with respect to the outside reference space so that microbes
are contained inside of an NPR. Nowadays, differential pressure sensors (DPSs)
are utilized by the Building Management Systems (BMSs) to control and monitor
the negative pressure in an NPR. This paper demonstrates a non-invasive and
stealthy attack on NPRs by spoofing a DPS at its resonant frequency. Our
contributions are: (1) We show that DPSs used in NPRs typically have resonant
frequencies in the audible range. (2) We use this finding to design malicious
music to create resonance in DPSs, resulting in an overshooting in the DPS's
normal pressure readings. (3) We show how the resonance in DPSs can fool the
BMSs so that the NPR turns its negative pressure to a positive one, causing a
potential \textit{leak} of deadly microbes from NPRs. We do experiments on 8
DPSs from 5 different manufacturers to evaluate their resonant frequencies
considering the sampling tube length and find resonance in 6 DPSs. We can
achieve a 2.5 Pa change in negative pressure from a $\sim$7 cm distance when a
sampling tube is not present and from a $\sim$2.5 cm distance for a 1 m
sampling tube length. We also introduce an interval-time variation approach for
an adversarial control over the negative pressure and show that the
\textit{forged} pressure can be varied within 12 - 33 Pa. Our attack is also
capable of attacking multiple NPRs simultaneously. Moreover, we demonstrate our
attack at a real-world NPR located in an anonymous bioresearch facility, which
is FDA approved and follows CDC guidelines. We also provide countermeasures to
prevent the attack.


http://arxiv.org/abs/2210.03879
Improving Fine-Grain Segmentation via Interpretable Modifications: A Case Study in Fossil Segmentation. (1%)
Indu Panigrahi; Ryan Manzuk; Adam Maloof; Ruth Fong
  Most interpretability research focuses on datasets containing thousands of
images of commonplace objects. However, many high-impact datasets, such as
those in medicine and the geosciences, contain fine-grain objects that require
domain-expert knowledge to recognize and are time-consuming to collect and
annotate. As a result, these datasets contain few annotated images, and current
machine vision models cannot train intensively on them. Thus, adapting
interpretability techniques to maximize the amount of information that models
can learn from small, fine-grain datasets is an important endeavor.
  Using a Mask R-CNN to segment ancient reef fossils in rock sample images, we
present a general paradigm for identifying and mitigating model weaknesses.
Specifically, we apply image perturbations to expose the Mask R-CNN's inability
to distinguish between different classes of fossils and its inconsistency in
segmenting fossils with different textures. To address these shortcomings, we
extend an existing model-editing method for correcting systematic mistakes in
image classification to image segmentation and introduce a novel application of
the technique: encouraging a greater separation between positive and negative
pixels for a given class. Through extensive experiments, we find that editing
the model by perturbing all pixels for a given class in one image is most
effective (compared to using multiple images and/or fewer pixels). Our paradigm
may also generalize to other segmentation models trained on small, fine-grain
datasets.


http://arxiv.org/abs/2210.03297
Preprocessors Matter! Realistic Decision-Based Attacks on Machine Learning Systems. (99%)
Chawin Sitawarin; Florian Tramèr; Nicholas Carlini
  Decision-based attacks construct adversarial examples against a machine
learning (ML) model by making only hard-label queries. These attacks have
mainly been applied directly to standalone neural networks. However, in
practice, ML models are just one component of a larger learning system. We find
that by adding a single preprocessor in front of a classifier, state-of-the-art
query-based attacks are up to 7$\times$ less effective at attacking a
prediction pipeline than at attacking the model alone. We explain this
discrepancy by the fact that most preprocessors introduce some notion of
invariance to the input space. Hence, attacks that are unaware of this
invariance inevitably waste a large number of queries to re-discover or
overcome it. We, therefore, develop techniques to (i) reverse-engineer the
preprocessor and then (ii) use this extracted information to attack the
end-to-end system. Our preprocessors extraction method requires only a few
hundred queries, and our preprocessor-aware attacks recover the same efficacy
as when attacking the model alone. The code can be found at
https://github.com/google-research/preprocessor-aware-black-box-attack.


http://arxiv.org/abs/2210.03003
Enhancing Code Classification by Mixup-Based Data Augmentation. (96%)
Zeming Dong; Qiang Hu; Yuejun Guo; Maxime Cordy; Mike Papadakis; Yves Le Traon; Jianjun Zhao
  Recently, deep neural networks (DNNs) have been widely applied in programming
language understanding. Generally, training a DNN model with competitive
performance requires massive and high-quality labeled training data. However,
collecting and labeling such data is time-consuming and labor-intensive. To
tackle this issue, data augmentation has been a popular solution, which
delicately increases the training data size, e.g., adversarial example
generation. However, few works focus on employing it for programming
language-related tasks. In this paper, we propose a Mixup-based data
augmentation approach, MixCode, to enhance the source code classification task.
First, we utilize multiple code refactoring methods to generate
label-consistent code data. Second, the Mixup technique is employed to mix the
original code and transformed code to form the new training data to train the
model. We evaluate MixCode on two programming languages (JAVA and Python), two
code tasks (problem classification and bug detection), four datasets (JAVA250,
Python800, CodRep1, and Refactory), and 5 model architectures. Experimental
results demonstrate that MixCode outperforms the standard data augmentation
baseline by up to 6.24\% accuracy improvement and 26.06\% robustness
improvement.


http://arxiv.org/abs/2210.02840
Deep Reinforcement Learning based Evasion Generative Adversarial Network for Botnet Detection. (92%)
Rizwan Hamid Randhawa; Nauman Aslam; Mohammad Alauthman; Muhammad Khalid; Husnain Rafiq
  Botnet detectors based on machine learning are potential targets for
adversarial evasion attacks. Several research works employ adversarial training
with samples generated from generative adversarial nets (GANs) to make the
botnet detectors adept at recognising adversarial evasions. However, the
synthetic evasions may not follow the original semantics of the input samples.
This paper proposes a novel GAN model leveraged with deep reinforcement
learning (DRL) to explore semantic aware samples and simultaneously harden its
detection. A DRL agent is used to attack the discriminator of the GAN that acts
as a botnet detector. The discriminator is trained on the crafted perturbations
by the agent during the GAN training, which helps the GAN generator converge
earlier than the case without DRL. We name this model RELEVAGAN, i.e. ["relive
a GAN" or deep REinforcement Learning-based Evasion Generative Adversarial
Network] because, with the help of DRL, it minimises the GAN's job by letting
its generator explore the evasion samples within the semantic limits. During
the GAN training, the attacks are conducted to adjust the discriminator weights
for learning crafted perturbations by the agent. RELEVAGAN does not require
adversarial training for the ML classifiers since it can act as an adversarial
semantic-aware botnet detection model. Code will be available at
https://github.com/rhr407/RELEVAGAN.


http://arxiv.org/abs/2210.02713
On Optimal Learning Under Targeted Data Poisoning. (82%)
Steve Hanneke; Amin Karbasi; Mohammad Mahmoody; Idan Mehalel; Shay Moran
  Consider the task of learning a hypothesis class $\mathcal{H}$ in the
presence of an adversary that can replace up to an $\eta$ fraction of the
examples in the training set with arbitrary adversarial examples. The adversary
aims to fail the learner on a particular target test point $x$ which is known
to the adversary but not to the learner. In this work we aim to characterize
the smallest achievable error $\epsilon=\epsilon(\eta)$ by the learner in the
presence of such an adversary in both realizable and agnostic settings. We
fully achieve this in the realizable setting, proving that
$\epsilon=\Theta(\mathtt{VC}(\mathcal{H})\cdot \eta)$, where
$\mathtt{VC}(\mathcal{H})$ is the VC dimension of $\mathcal{H}$. Remarkably, we
show that the upper bound can be attained by a deterministic learner. In the
agnostic setting we reveal a more elaborate landscape: we devise a
deterministic learner with a multiplicative regret guarantee of $\epsilon \leq
C\cdot\mathtt{OPT} + O(\mathtt{VC}(\mathcal{H})\cdot \eta)$, where $C > 1$ is a
universal numerical constant. We complement this by showing that for any
deterministic learner there is an attack which worsens its error to at least
$2\cdot \mathtt{OPT}$. This implies that a multiplicative deterioration in the
regret is unavoidable in this case. Finally, the algorithms we develop for
achieving the optimal rates are inherently improper. Nevertheless, we show that
for a variety of natural concept classes, such as linear classifiers, it is
possible to retain the dependence $\epsilon=\Theta_{\mathcal{H}}(\eta)$ by a
proper algorithm in the realizable setting. Here $\Theta_{\mathcal{H}}$
conceals a polynomial dependence on $\mathtt{VC}(\mathcal{H})$.


http://arxiv.org/abs/2210.03150
Towards Out-of-Distribution Adversarial Robustness. (73%)
Adam Ibrahim; Charles Guille-Escuret; Ioannis Mitliagkas; Irina Rish; David Krueger; Pouya Bashivan
  Adversarial robustness continues to be a major challenge for deep learning. A
core issue is that robustness to one type of attack often fails to transfer to
other attacks. While prior work establishes a theoretical trade-off in
robustness against different $L_p$ norms, we show that there is potential for
improvement against many commonly used attacks by adopting a domain
generalisation approach. Concretely, we treat each type of attack as a domain,
and apply the Risk Extrapolation method (REx), which promotes similar levels of
robustness against all training attacks. Compared to existing methods, we
obtain similar or superior worst-case adversarial robustness on attacks seen
during training. Moreover, we achieve superior performance on families or
tunings of attacks only encountered at test time. On ensembles of attacks, our
approach improves the accuracy from 3.4% the best existing baseline to 25.9% on
MNIST, and from 16.9% to 23.5% on CIFAR10.


http://arxiv.org/abs/2210.03068
InferES : A Natural Language Inference Corpus for Spanish Featuring Negation-Based Contrastive and Adversarial Examples. (61%)
Venelin Kovatchev; Mariona Taulé
  In this paper, we present InferES - an original corpus for Natural Language
Inference (NLI) in European Spanish. We propose, implement, and analyze a
variety of corpus-creating strategies utilizing expert linguists and crowd
workers. The objectives behind InferES are to provide high-quality data, and,
at the same time to facilitate the systematic evaluation of automated systems.
Specifically, we focus on measuring and improving the performance of machine
learning systems on negation-based adversarial examples and their ability to
generalize across out-of-distribution topics.
  We train two transformer models on InferES (8,055 gold examples) in a variety
of scenarios. Our best model obtains 72.8% accuracy, leaving a lot of room for
improvement. The "hypothesis-only" baseline performs only 2%-5% higher than
majority, indicating much fewer annotation artifacts than prior work. We find
that models trained on InferES generalize very well across topics (both in- and
out-of-distribution) and perform moderately well on negation-based adversarial
examples.


http://arxiv.org/abs/2210.03250
Unsupervised Domain Adaptation for COVID-19 Information Service with Contrastive Adversarial Domain Mixup. (41%)
Huimin Zeng; Zhenrui Yue; Ziyi Kou; Lanyu Shang; Yang Zhang; Dong Wang
  In the real-world application of COVID-19 misinformation detection, a
fundamental challenge is the lack of the labeled COVID data to enable
supervised end-to-end training of the models, especially at the early stage of
the pandemic. To address this challenge, we propose an unsupervised domain
adaptation framework using contrastive learning and adversarial domain mixup to
transfer the knowledge from an existing source data domain to the target
COVID-19 data domain. In particular, to bridge the gap between the source
domain and the target domain, our method reduces a radial basis function (RBF)
based discrepancy between these two domains. Moreover, we leverage the power of
domain adversarial examples to establish an intermediate domain mixup, where
the latent representations of the input text from both domains could be mixed
during the training process. Extensive experiments on multiple real-world
datasets suggest that our method can effectively adapt misinformation detection
systems to the unseen COVID-19 target domain with significant improvements
compared to the state-of-the-art baselines.


http://arxiv.org/abs/2210.03205
Synthetic Dataset Generation for Privacy-Preserving Machine Learning. (2%)
Efstathia Soufleri; Gobinda Saha; Kaushik Roy
  Machine Learning (ML) has achieved enormous success in solving a variety of
problems in computer vision, speech recognition, object detection, to name a
few. The principal reason for this success is the availability of huge datasets
for training deep neural networks (DNNs). However, datasets cannot be publicly
released if they contain sensitive information such as medical records, and
data privacy becomes a major concern. Encryption methods could be a possible
solution, however their deployment on ML applications seriously impacts
classification accuracy and results in substantial computational overhead.
Alternatively, obfuscation techniques could be used, but maintaining a good
trade-off between visual privacy and accuracy is challenging. In this paper, we
propose a method to generate secure synthetic datasets from the original
private datasets. Given a network with Batch Normalization (BN) layers
pretrained on the original dataset, we first record the class-wise BN layer
statistics. Next, we generate the synthetic dataset by optimizing random noise
such that the synthetic data match the layer-wise statistical distribution of
original images. We evaluate our method on image classification datasets
(CIFAR10, ImageNet) and show that synthetic data can be used in place of the
original CIFAR10/ImageNet data for training networks from scratch, producing
comparable classification performance. Further, to analyze visual privacy
provided by our method, we use Image Quality Metrics and show high degree of
visual dissimilarity between the original and synthetic images. Moreover, we
show that our proposed method preserves data-privacy under various
privacy-leakage attacks including Gradient Matching Attack, Model Memorization
Attack, and GAN-based Attack.


http://arxiv.org/abs/2210.03123
Enhancing Mixup-Based Graph Learning for Language Processing via Hybrid Pooling. (1%)
Zeming Dong; Qiang Hu; Yuejun Guo; Maxime Cordy; Mike Papadakis; Yves Le Traon; Jianjun Zhao
  Graph neural networks (GNNs) have recently been popular in natural language
and programming language processing, particularly in text and source code
classification. Graph pooling which processes node representation into the
entire graph representation, which can be used for multiple downstream tasks,
e.g., graph classification, is a crucial component of GNNs. Recently, to
enhance graph learning, Manifold Mixup, a data augmentation strategy that mixes
the graph data vector after the pooling layer, has been introduced. However,
since there are a series of graph pooling methods, how they affect the
effectiveness of such a Mixup approach is unclear. In this paper, we take the
first step to explore the influence of graph pooling methods on the
effectiveness of the Mixup-based data augmentation approach. Specifically, 9
types of hybrid pooling methods are considered in the study, e.g.,
$\mathcal{M}_{sum}(\mathcal{P}_{att},\mathcal{P}_{max})$. The experimental
results on both natural language datasets (Gossipcop, Politifact) and
programming language datasets (Java250, Python800) demonstrate that hybrid
pooling methods are more suitable for Mixup than the standard max pooling and
the state-of-the-art graph multiset transformer (GMT) pooling, in terms of
metric accuracy and robustness.


http://arxiv.org/abs/2210.03239
Bad Citrus: Reducing Adversarial Costs with Model Distances. (1%)
Giorgio Severi; Will Pearce; Alina Oprea
  Recent work by Jia et al., showed the possibility of effectively computing
pairwise model distances in weight space, using a model explanation technique
known as LIME. This method requires query-only access to the two models under
examination. We argue this insight can be leveraged by an adversary to reduce
the net cost (number of queries) of launching an evasion campaign against a
deployed model. We show that there is a strong negative correlation between the
success rate of adversarial transfer and the distance between the victim model
and the surrogate used to generate the evasive samples. Thus, we propose and
evaluate a method to reduce adversarial costs by finding the closest surrogate
model for adversarial transfer.


http://arxiv.org/abs/2210.02041
Natural Color Fool: Towards Boosting Black-box Unrestricted Attacks. (99%)
Shengming Yuan; Qilong Zhang; Lianli Gao; Yaya Cheng; Jingkuan Song
  Unrestricted color attacks, which manipulate semantically meaningful color of
an image, have shown their stealthiness and success in fooling both human eyes
and deep neural networks. However, current works usually sacrifice the
flexibility of the uncontrolled setting to ensure the naturalness of
adversarial examples. As a result, the black-box attack performance of these
methods is limited. To boost transferability of adversarial examples without
damaging image quality, we propose a novel Natural Color Fool (NCF) which is
guided by realistic color distributions sampled from a publicly available
dataset and optimized by our neighborhood search and initialization reset. By
conducting extensive experiments and visualizations, we convincingly
demonstrate the effectiveness of our proposed method. Notably, on average,
results show that our NCF can outperform state-of-the-art approaches by
15.0%$\sim$32.9% for fooling normally trained models and 10.0%$\sim$25.3% for
evading defense methods. Our code is available at
https://github.com/ylhz/Natural-Color-Fool.


http://arxiv.org/abs/2210.02618
Dynamic Stochastic Ensemble with Adversarial Robust Lottery Ticket Subnetworks. (98%)
Qi Peng; Wenlin Liu; Ruoxi Qin; Libin Hou; Bin Yan; Linyuan Wang
  Adversarial attacks are considered the intrinsic vulnerability of CNNs.
Defense strategies designed for attacks have been stuck in the adversarial
attack-defense arms race, reflecting the imbalance between attack and defense.
Dynamic Defense Framework (DDF) recently changed the passive safety status quo
based on the stochastic ensemble model. The diversity of subnetworks, an
essential concern in the DDF, can be effectively evaluated by the adversarial
transferability between different networks. Inspired by the poor adversarial
transferability between subnetworks of scratch tickets with various remaining
ratios, we propose a method to realize the dynamic stochastic ensemble defense
strategy. We discover the adversarial transferable diversity between robust
lottery ticket subnetworks drawn from different basic structures and sparsity.
The experimental results suggest that our method achieves better robust and
clean recognition accuracy by adversarial transferable diversity, which would
decrease the reliability of attacks.


http://arxiv.org/abs/2210.02502
On Adversarial Robustness of Deep Image Deblurring. (83%)
Kanchana Vaishnavi Gandikota; Paramanand Chandramouli; Michael Moeller
  Recent approaches employ deep learning-based solutions for the recovery of a
sharp image from its blurry observation. This paper introduces adversarial
attacks against deep learning-based image deblurring methods and evaluates the
robustness of these neural networks to untargeted and targeted attacks. We
demonstrate that imperceptible distortion can significantly degrade the
performance of state-of-the-art deblurring networks, even producing drastically
different content in the output, indicating the strong need to include
adversarially robust training not only in classification but also for image
recovery.


http://arxiv.org/abs/2210.02577
A Closer Look at Robustness to L-infinity and Spatial Perturbations and their Composition. (81%)
Luke Rowe; Benjamin Thérien; Krzysztof Czarnecki; Hongyang Zhang
  In adversarial machine learning, the popular $\ell_\infty$ threat model has
been the focus of much previous work. While this mathematical definition of
imperceptibility successfully captures an infinite set of additive image
transformations that a model should be robust to, this is only a subset of all
transformations which leave the semantic label of an image unchanged. Indeed,
previous work also considered robustness to spatial attacks as well as other
semantic transformations; however, designing defense methods against the
composition of spatial and $\ell_{\infty}$ perturbations remains relatively
underexplored. In the following, we improve the understanding of this seldom
investigated compositional setting. We prove theoretically that no linear
classifier can achieve more than trivial accuracy against a composite adversary
in a simple statistical setting, illustrating its difficulty. We then
investigate how state-of-the-art $\ell_{\infty}$ defenses can be adapted to
this novel threat model and study their performance against compositional
attacks. We find that our newly proposed TRADES$_{\text{All}}$ strategy
performs the strongest of all. Analyzing its logit's Lipschitz constant for RT
transformations of different sizes, we find that TRADES$_{\text{All}}$ remains
stable over a wide range of RT transformations with and without $\ell_\infty$
perturbations.


http://arxiv.org/abs/2210.02082
Jitter Does Matter: Adapting Gaze Estimation to New Domains. (78%)
Ruicong Liu; Yiwei Bao; Mingjie Xu; Haofei Wang; Yunfei Liu; Feng Lu
  Deep neural networks have demonstrated superior performance on
appearance-based gaze estimation tasks. However, due to variations in person,
illuminations, and background, performance degrades dramatically when applying
the model to a new domain. In this paper, we discover an interesting gaze
jitter phenomenon in cross-domain gaze estimation, i.e., the gaze predictions
of two similar images can be severely deviated in target domain. This is
closely related to cross-domain gaze estimation tasks, but surprisingly, it has
not been noticed yet previously. Therefore, we innovatively propose to utilize
the gaze jitter to analyze and optimize the gaze domain adaptation task. We
find that the high-frequency component (HFC) is an important factor that leads
to jitter. Based on this discovery, we add high-frequency components to input
images using the adversarial attack and employ contrastive learning to
encourage the model to obtain similar representations between original and
perturbed data, which reduces the impacts of HFC. We evaluate the proposed
method on four cross-domain gaze estimation tasks, and experimental results
demonstrate that it significantly reduces the gaze jitter and improves the gaze
estimation performance in target domains.


http://arxiv.org/abs/2210.02357
Image Masking for Robust Self-Supervised Monocular Depth Estimation. (38%)
Hemang Chawla; Kishaan Jeeveswaran; Elahe Arani; Bahram Zonooz
  Self-supervised monocular depth estimation is a salient task for 3D scene
understanding. Learned jointly with monocular ego-motion estimation, several
methods have been proposed to predict accurate pixel-wise depth without using
labeled data. Nevertheless, these methods focus on improving performance under
ideal conditions without natural or digital corruptions. A general absence of
occlusions is assumed even for object-specific depth estimation. These methods
are also vulnerable to adversarial attacks, which is a pertinent concern for
their reliable deployment on robots and autonomous driving systems. We propose
MIMDepth, a method that adapts masked image modeling (MIM) for self-supervised
monocular depth estimation. While MIM has been used to learn generalizable
features during pre-training, we show how it could be adapted for direct
training of monocular depth estimation. Our experiments show that MIMDepth is
more robust to noise, blur, weather conditions, digital artifacts, occlusions,
as well as untargeted and targeted adversarial attacks.


http://arxiv.org/abs/2210.02235
Over-the-Air Federated Learning with Privacy Protection via Correlated Additive Perturbations. (38%)
Jialing Liao; Zheng Chen; Erik G. Larsson
  In this paper, we consider privacy aspects of wireless federated learning
(FL) with Over-the-Air (OtA) transmission of gradient updates from multiple
users/agents to an edge server. By exploiting the waveform superposition
property of multiple access channels, OtA FL enables the users to transmit
their updates simultaneously with linear processing techniques, which improves
resource efficiency. However, this setting is vulnerable to privacy leakage
since an adversary node can hear directly the uncoded message. Traditional
perturbation-based methods provide privacy protection while sacrificing the
training accuracy due to the reduced signal-to-noise ratio. In this work, we
aim at minimizing privacy leakage to the adversary and the degradation of model
accuracy at the edge server at the same time. More explicitly, spatially
correlated perturbations are added to the gradient vectors at the users before
transmission. Using the zero-sum property of the correlated perturbations, the
side effect of the added perturbation on the aggregated gradients at the edge
server can be minimized. In the meanwhile, the added perturbation will not be
canceled out at the adversary, which prevents privacy leakage. Theoretical
analysis of the perturbation covariance matrix, differential privacy, and model
convergence is provided, based on which an optimization problem is formulated
to jointly design the covariance matrix and the power scaling factor to balance
between privacy protection and convergence performance. Simulation results
validate the correlated perturbation approach can provide strong defense
ability while guaranteeing high learning accuracy.


http://arxiv.org/abs/2210.01787
Rethinking Lipschitz Neural Networks and Certified Robustness: A Boolean Function Perspective. (97%)
Bohang Zhang; Du Jiang; Di He; Liwei Wang
  Designing neural networks with bounded Lipschitz constant is a promising way
to obtain certifiably robust classifiers against adversarial examples. However,
the relevant progress for the important $\ell_\infty$ perturbation setting is
rather limited, and a principled understanding of how to design expressive
$\ell_\infty$ Lipschitz networks is still lacking. In this paper, we bridge the
gap by studying certified $\ell_\infty$ robustness from a novel perspective of
representing Boolean functions. We derive two fundamental impossibility results
that hold for any standard Lipschitz network: one for robust classification on
finite datasets, and the other for Lipschitz function approximation. These
results identify that networks built upon norm-bounded affine layers and
Lipschitz activations intrinsically lose expressive power even in the
two-dimensional case, and shed light on how recently proposed Lipschitz
networks (e.g., GroupSort and $\ell_\infty$-distance nets) bypass these
impossibilities by leveraging order statistic functions. Finally, based on
these insights, we develop a unified Lipschitz network that generalizes prior
works, and design a practical version that can be efficiently trained (making
certified robust training free). Extensive experiments show that our approach
is scalable, efficient, and consistently yields better certified robustness
across multiple datasets and perturbation radii than prior Lipschitz networks.
Our code is available at https://github.com/zbh2047/SortNet.


http://arxiv.org/abs/2210.01953
Robust Fair Clustering: A Novel Fairness Attack and Defense Framework. (93%)
Anshuman Chhabra; Peizhao Li; Prasant Mohapatra; Hongfu Liu
  Clustering algorithms are widely used in many societal resource allocation
applications, such as loan approvals and candidate recruitment, among others,
and hence, biased or unfair model outputs can adversely impact individuals that
rely on these applications. To this end, many fair clustering approaches have
been recently proposed to counteract this issue. Due to the potential for
significant harm, it is essential to ensure that fair clustering algorithms
provide consistently fair outputs even under adversarial influence. However,
fair clustering algorithms have not been studied from an adversarial attack
perspective. In contrast to previous research, we seek to bridge this gap and
conduct a robustness analysis against fair clustering by proposing a novel
black-box fairness attack. Through comprehensive experiments, we find that
state-of-the-art models are highly susceptible to our attack as it can reduce
their fairness performance significantly. Finally, we propose Consensus Fair
Clustering (CFC), the first robust fair clustering approach that transforms
consensus clustering into a fair graph partitioning problem, and iteratively
learns to generate fair cluster outputs. Experimentally, we observe that CFC is
highly robust to the proposed attack and is thus a truly robust fair clustering
alternative.


http://arxiv.org/abs/2210.01371
A Study on the Efficiency and Generalization of Light Hybrid Retrievers. (86%)
Man Luo; Shashank Jain; Anchit Gupta; Arash Einolghozati; Barlas Oguz; Debojeet Chatterjee; Xilun Chen; Chitta Baral; Peyman Heidari
  Existing hybrid retrievers which integrate sparse and dense retrievers, are
indexing-heavy, limiting their applicability in real-world on-devices settings.
We ask the question "Is it possible to reduce the indexing memory of hybrid
retrievers without sacrificing performance?" Driven by this question, we
leverage an indexing-efficient dense retriever (i.e. DrBoost) to obtain a light
hybrid retriever. Moreover, to further reduce the memory, we introduce a
lighter dense retriever (LITE) which is jointly trained on contrastive learning
and knowledge distillation from DrBoost. Compared to previous heavy hybrid
retrievers, our Hybrid-LITE retriever saves 13 memory while maintaining 98.0
performance.
  In addition, we study the generalization of light hybrid retrievers along two
dimensions, out-of-domain (OOD) generalization and robustness against
adversarial attacks. We evaluate models on two existing OOD benchmarks and
create six adversarial attack sets for robustness evaluation. Experiments show
that our light hybrid retrievers achieve better robustness performance than
both sparse and dense retrievers. Nevertheless there is a large room to improve
the robustness of retrievers, and our datasets can aid future research.


http://arxiv.org/abs/2210.02447
Practical Adversarial Attacks on Spatiotemporal Traffic Forecasting Models. (81%)
Fan Liu; Hao Liu; Wenzhao Jiang
  Machine learning based traffic forecasting models leverage sophisticated
spatiotemporal auto-correlations to provide accurate predictions of city-wide
traffic states. However, existing methods assume a reliable and unbiased
forecasting environment, which is not always available in the wild. In this
work, we investigate the vulnerability of spatiotemporal traffic forecasting
models and propose a practical adversarial spatiotemporal attack framework.
Specifically, instead of simultaneously attacking all geo-distributed data
sources, an iterative gradient-guided node saliency method is proposed to
identify the time-dependent set of victim nodes. Furthermore, we devise a
spatiotemporal gradient descent based scheme to generate real-valued
adversarial traffic states under a perturbation constraint. Meanwhile, we
theoretically demonstrate the worst performance bound of adversarial traffic
forecasting attacks. Extensive experiments on two real-world datasets show that
the proposed two-step framework achieves up to $67.8\%$ performance degradation
on various advanced spatiotemporal forecasting models. Remarkably, we also show
that adversarial training with our proposed attacks can significantly improve
the robustness of spatiotemporal traffic forecasting models. Our code is
available in \url{https://github.com/luckyfan-cs/ASTFA}.


http://arxiv.org/abs/2210.01834
Invariant Aggregator for Defending against Federated Backdoor Attacks. (80%)
Xiaoyang Wang; Dimitrios Dimitriadis; Sanmi Koyejo; Shruti Tople
  Federated learning enables training high-utility models across several
clients without directly sharing their private data. As a downside, the
federated setting makes the model vulnerable to various adversarial attacks in
the presence of malicious clients. Despite the theoretical and empirical
success in defending against attacks that aim to degrade models' utility,
defense against backdoor attacks that increase model accuracy on backdoor
samples exclusively without hurting the utility on other samples remains
challenging. To this end, we first analyze the failure modes of existing
defenses over a flat loss landscape, which is common for well-designed neural
networks such as Resnet [He et al., 2015] but is often overlooked by previous
works. Then, we propose an invariant aggregator that redirects the aggregated
update to invariant directions that are generally useful via selectively
masking out the update elements that favor few and possibly malicious clients.
Theoretical results suggest that our approach provably mitigates backdoor
attacks and remains effective over flat loss landscapes. Empirical results on
three datasets with different modalities and varying numbers of clients further
demonstrate that our approach mitigates a broad class of backdoor attacks with
a negligible cost on the model utility.


http://arxiv.org/abs/2210.01940
On the Robustness of Deep Clustering Models: Adversarial Attacks and Defenses. (75%)
Anshuman Chhabra; Ashwin Sekhari; Prasant Mohapatra
  Clustering models constitute a class of unsupervised machine learning methods
which are used in a number of application pipelines, and play a vital role in
modern data science. With recent advancements in deep learning -- deep
clustering models have emerged as the current state-of-the-art over traditional
clustering approaches, especially for high-dimensional image datasets. While
traditional clustering approaches have been analyzed from a robustness
perspective, no prior work has investigated adversarial attacks and robustness
for deep clustering models in a principled manner. To bridge this gap, we
propose a blackbox attack using Generative Adversarial Networks (GANs) where
the adversary does not know which deep clustering model is being used, but can
query it for outputs. We analyze our attack against multiple state-of-the-art
deep clustering models and real-world datasets, and find that it is highly
successful. We then employ some natural unsupervised defense approaches, but
find that these are unable to mitigate our attack. Finally, we attack Face++, a
production-level face clustering API service, and find that we can
significantly reduce its performance as well. Through this work, we thus aim to
motivate the need for truly robust deep clustering models.


http://arxiv.org/abs/2210.04625
Robustness Certification of Visual Perception Models via Camera Motion Smoothing. (70%)
Hanjiang Hu; Zuxin Liu; Linyi Li; Jiacheng Zhu; Ding Zhao
  A vast literature shows that the learning-based visual perception model is
sensitive to adversarial noises but few works consider the robustness of
robotic perception models under widely-existing camera motion perturbations. To
this end, we study the robustness of the visual perception model under camera
motion perturbations to investigate the influence of camera motion on robotic
perception. Specifically, we propose a motion smoothing technique for arbitrary
image classification models, whose robustness under camera motion perturbations
could be certified. The proposed robustness certification framework based on
camera motion smoothing provides tight and scalable robustness guarantees for
visual perception modules so that they are applicable to wide robotic
applications. As far as we are aware, this is the first work to provide the
robustness certification for the deep perception module against camera motions,
which improves the trustworthiness of robotic perception. A realistic indoor
robotic dataset with the dense point cloud map for the entire room, MetaRoom,
is introduced for the challenging certifiable robust perception task. We
conduct extensive experiments to validate the certification approach via motion
smoothing against camera motion perturbations. Our framework guarantees the
certified accuracy of 81.7% against camera translation perturbation along depth
direction within -0.1m ` 0.1m. We also validate the effectiveness of our method
on the real-world robot by conducting hardware experiment on the robotic arm
with an eye-in-hand camera. The code is available on
https://github.com/HanjiangHu/camera-motion-smoothing.


http://arxiv.org/abs/2210.01632
Backdoor Attacks in the Supply Chain of Masked Image Modeling. (68%)
Xinyue Shen; Xinlei He; Zheng Li; Yun Shen; Michael Backes; Yang Zhang
  Masked image modeling (MIM) revolutionizes self-supervised learning (SSL) for
image pre-training. In contrast to previous dominating self-supervised methods,
i.e., contrastive learning, MIM attains state-of-the-art performance by masking
and reconstructing random patches of the input image. However, the associated
security and privacy risks of this novel generative method are unexplored. In
this paper, we perform the first security risk quantification of MIM through
the lens of backdoor attacks. Different from previous work, we are the first to
systematically threat modeling on SSL in every phase of the model supply chain,
i.e., pre-training, release, and downstream phases. Our evaluation shows that
models built with MIM are vulnerable to existing backdoor attacks in release
and downstream phases and are compromised by our proposed method in
pre-training phase. For instance, on CIFAR10, the attack success rate can reach
99.62%, 96.48%, and 98.89% in the downstream phase, release phase, and
pre-training phase, respectively. We also take the first step to investigate
the success factors of backdoor attacks in the pre-training phase and find the
trigger number and trigger pattern play key roles in the success of backdoor
attacks while trigger location has only tiny effects. In the end, our empirical
study of the defense mechanisms across three detection-level on model supply
chain phases indicates that different defenses are suitable for backdoor
attacks in different phases. However, backdoor attacks in the release phase
cannot be detected by all three detection-level methods, calling for more
effective defenses in future research.


http://arxiv.org/abs/2210.01742
CADet: Fully Self-Supervised Anomaly Detection With Contrastive Learning. (67%)
Charles Guille-Escuret; Pau Rodriguez; David Vazquez; Ioannis Mitliagkas; Joao Monteiro
  Handling out-of-distribution (OOD) samples has become a major stake in the
real-world deployment of machine learning systems. This work explores the
application of self-supervised contrastive learning to the simultaneous
detection of two types of OOD samples: unseen classes and adversarial
perturbations. Since in practice the distribution of such samples is not known
in advance, we do not assume access to OOD examples. We show that similarity
functions trained with contrastive learning can be leveraged with the maximum
mean discrepancy (MMD) two-sample test to verify whether two independent sets
of samples are drawn from the same distribution. Inspired by this approach, we
introduce CADet (Contrastive Anomaly Detection), a method based on image
augmentations to perform anomaly detection on single samples. CADet compares
favorably to adversarial detection methods to detect adversarially perturbed
samples on ImageNet. Simultaneously, it achieves comparable performance to
unseen label detection methods on two challenging benchmarks: ImageNet-O and
iNaturalist. CADet is fully self-supervised and requires neither labels for
in-distribution samples nor access to OOD examples.


http://arxiv.org/abs/2210.01111
MultiGuard: Provably Robust Multi-label Classification against Adversarial Examples. (99%)
Jinyuan Jia; Wenjie Qu; Neil Zhenqiang Gong
  Multi-label classification, which predicts a set of labels for an input, has
many applications. However, multiple recent studies showed that multi-label
classification is vulnerable to adversarial examples. In particular, an
attacker can manipulate the labels predicted by a multi-label classifier for an
input via adding carefully crafted, human-imperceptible perturbation to it.
Existing provable defenses for multi-class classification achieve sub-optimal
provable robustness guarantees when generalized to multi-label classification.
In this work, we propose MultiGuard, the first provably robust defense against
adversarial examples to multi-label classification. Our MultiGuard leverages
randomized smoothing, which is the state-of-the-art technique to build provably
robust classifiers. Specifically, given an arbitrary multi-label classifier,
our MultiGuard builds a smoothed multi-label classifier via adding random noise
to the input. We consider isotropic Gaussian noise in this work. Our major
theoretical contribution is that we show a certain number of ground truth
labels of an input are provably in the set of labels predicted by our
MultiGuard when the $\ell_2$-norm of the adversarial perturbation added to the
input is bounded. Moreover, we design an algorithm to compute our provable
robustness guarantees. Empirically, we evaluate our MultiGuard on VOC 2007,
MS-COCO, and NUS-WIDE benchmark datasets. Our code is available at:
\url{https://github.com/quwenjie/MultiGuard}


http://arxiv.org/abs/2210.00753
Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual Active Speaker Detection. (97%)
Xuanjun Chen; Haibin Wu; Helen Meng; Hung-yi Lee; Jyh-Shing Roger Jang
  Audio-visual active speaker detection (AVASD) is well-developed, and now is
an indispensable front-end for several multi-modal applications. However, to
the best of our knowledge, the adversarial robustness of AVASD models hasn't
been investigated, not to mention the effective defense against such attacks.
In this paper, we are the first to reveal the vulnerability of AVASD models
under audio-only, visual-only, and audio-visual adversarial attacks through
extensive experiments. What's more, we also propose a novel audio-visual
interaction loss (AVIL) for making attackers difficult to find feasible
adversarial examples under an allocated attack budget. The loss aims at pushing
the inter-class embeddings to be dispersed, namely non-speech and speech
clusters, sufficiently disentangled, and pulling the intra-class embeddings as
close as possible to keep them compact. Experimental results show the AVIL
outperforms the adversarial training by 33.14 mAP (%) under multi-modal
attacks.


http://arxiv.org/abs/2210.00960
Stability Analysis and Generalization Bounds of Adversarial Training. (96%)
Jiancong Xiao; Yanbo Fan; Ruoyu Sun; Jue Wang; Zhi-Quan Luo
  In adversarial machine learning, deep neural networks can fit the adversarial
examples on the training dataset but have poor generalization ability on the
test set. This phenomenon is called robust overfitting, and it can be observed
when adversarially training neural nets on common datasets, including SVHN,
CIFAR-10, CIFAR-100, and ImageNet. In this paper, we study the robust
overfitting issue of adversarial training by using tools from uniform
stability. One major challenge is that the outer function (as a maximization of
the inner function) is nonsmooth, so the standard technique (e.g., hardt et
al., 2016) cannot be applied. Our approach is to consider $\eta$-approximate
smoothness: we show that the outer function satisfies this modified smoothness
assumption with $\eta$ being a constant related to the adversarial perturbation
$\epsilon$. Based on this, we derive stability-based generalization bounds for
stochastic gradient descent (SGD) on the general class of $\eta$-approximate
smooth functions, which covers the adversarial loss. Our results suggest that
robust test accuracy decreases in $\epsilon$ when $T$ is large, with a speed
between $\Omega(\epsilon\sqrt{T})$ and $\mathcal{O}(\epsilon T)$. This
phenomenon is also observed in practice. Additionally, we show that a few
popular techniques for adversarial training (e.g., early stopping, cyclic
learning rate, and stochastic weight averaging) are stability-promoting in
theory.


http://arxiv.org/abs/2210.02191
On Attacking Out-Domain Uncertainty Estimation in Deep Neural Networks. (92%)
Huimin Zeng; Zhenrui Yue; Yang Zhang; Ziyi Kou; Lanyu Shang; Dong Wang
  In many applications with real-world consequences, it is crucial to develop
reliable uncertainty estimation for the predictions made by the AI decision
systems. Targeting at the goal of estimating uncertainty, various deep neural
network (DNN) based uncertainty estimation algorithms have been proposed.
However, the robustness of the uncertainty returned by these algorithms has not
been systematically explored. In this work, to raise the awareness of the
research community on robust uncertainty estimation, we show that
state-of-the-art uncertainty estimation algorithms could fail catastrophically
under our proposed adversarial attack despite their impressive performance on
uncertainty estimation. In particular, we aim at attacking the out-domain
uncertainty estimation: under our attack, the uncertainty model would be fooled
to make high-confident predictions for the out-domain data, which they
originally would have rejected. Extensive experimental results on various
benchmark image datasets show that the uncertainty estimated by
state-of-the-art methods could be easily corrupted by our attack.


http://arxiv.org/abs/2210.01075
Decompiling x86 Deep Neural Network Executables. (83%)
Zhibo Liu; Yuanyuan Yuan; Shuai Wang; Xiaofei Xie; Lei Ma
  Due to their widespread use on heterogeneous hardware devices, deep learning
(DL) models are compiled into executables by DL compilers to fully leverage
low-level hardware primitives. This approach allows DL computations to be
undertaken at low cost across a variety of computing platforms, including CPUs,
GPUs, and various hardware accelerators.
  We present BTD (Bin to DNN), a decompiler for deep neural network (DNN)
executables. BTD takes DNN executables and outputs full model specifications,
including types of DNN operators, network topology, dimensions, and parameters
that are (nearly) identical to those of the input models. BTD delivers a
practical framework to process DNN executables compiled by different DL
compilers and with full optimizations enabled on x86 platforms. It employs
learning-based techniques to infer DNN operators, dynamic analysis to reveal
network architectures, and symbolic execution to facilitate inferring
dimensions and parameters of DNN operators.
  Our evaluation reveals that BTD enables accurate recovery of full
specifications of complex DNNs with millions of parameters (e.g., ResNet). The
recovered DNN specifications can be re-compiled into a new DNN executable
exhibiting identical behavior to the input executable. We show that BTD can
boost two representative attacks, adversarial example generation and knowledge
stealing, against DNN executables. We also demonstrate cross-architecture
legacy code reuse using BTD, and envision BTD being used for other critical
downstream tasks like DNN security hardening and patching.


http://arxiv.org/abs/2210.01288
Strength-Adaptive Adversarial Training. (80%)
Chaojian Yu; Dawei Zhou; Li Shen; Jun Yu; Bo Han; Mingming Gong; Nannan Wang; Tongliang Liu
  Adversarial training (AT) is proved to reliably improve network's robustness
against adversarial data. However, current AT with a pre-specified perturbation
budget has limitations in learning a robust network. Firstly, applying a
pre-specified perturbation budget on networks of various model capacities will
yield divergent degree of robustness disparity between natural and robust
accuracies, which deviates from robust network's desideratum. Secondly, the
attack strength of adversarial training data constrained by the pre-specified
perturbation budget fails to upgrade as the growth of network robustness, which
leads to robust overfitting and further degrades the adversarial robustness. To
overcome these limitations, we propose \emph{Strength-Adaptive Adversarial
Training} (SAAT). Specifically, the adversary employs an adversarial loss
constraint to generate adversarial training data. Under this constraint, the
perturbation budget will be adaptively adjusted according to the training state
of adversarial data, which can effectively avoid robust overfitting. Besides,
SAAT explicitly constrains the attack strength of training data through the
adversarial loss, which manipulates model capacity scheduling during training,
and thereby can flexibly control the degree of robustness disparity and adjust
the tradeoff between natural accuracy and robustness. Extensive experiments
show that our proposal boosts the robustness of adversarial training.


http://arxiv.org/abs/2210.01002
ASGNN: Graph Neural Networks with Adaptive Structure. (68%)
Zepeng Zhang; Songtao Lu; Zengfeng Huang; Ziping Zhao
  The graph neural network (GNN) models have presented impressive achievements
in numerous machine learning tasks. However, many existing GNN models are shown
to be vulnerable to adversarial attacks, which creates a stringent need to
build robust GNN architectures. In this work, we propose a novel interpretable
message passing scheme with adaptive structure (ASMP) to defend against
adversarial attacks on graph structure. Layers in ASMP are derived based on
optimization steps that minimize an objective function that learns the node
feature and the graph structure simultaneously. ASMP is adaptive in the sense
that the message passing process in different layers is able to be carried out
over dynamically adjusted graphs. Such property allows more fine-grained
handling of the noisy (or perturbed) graph structure and hence improves the
robustness. Convergence properties of the ASMP scheme are theoretically
established. Integrating ASMP with neural networks can lead to a new family of
GNN models with adaptive structure (ASGNN). Extensive experiments on
semi-supervised node classification tasks demonstrate that the proposed ASGNN
outperforms the state-of-the-art GNN architectures in terms of classification
performance under various adversarial attacks.


http://arxiv.org/abs/2210.00957
UnGANable: Defending Against GAN-based Face Manipulation. (2%)
Zheng Li; Ning Yu; Ahmed Salem; Michael Backes; Mario Fritz; Yang Zhang
  Deepfakes pose severe threats of visual misinformation to our society. One
representative deepfake application is face manipulation that modifies a
victim's facial attributes in an image, e.g., changing her age or hair color.
The state-of-the-art face manipulation techniques rely on Generative
Adversarial Networks (GANs). In this paper, we propose the first defense
system, namely UnGANable, against GAN-inversion-based face manipulation. In
specific, UnGANable focuses on defending GAN inversion, an essential step for
face manipulation. Its core technique is to search for alternative images
(called cloaked images) around the original images (called target images) in
image space. When posted online, these cloaked images can jeopardize the GAN
inversion process. We consider two state-of-the-art inversion techniques
including optimization-based inversion and hybrid inversion, and design five
different defenses under five scenarios depending on the defender's background
knowledge. Extensive experiments on four popular GAN models trained on two
benchmark face datasets show that UnGANable achieves remarkable effectiveness
and utility performance, and outperforms multiple baseline methods. We further
investigate four adaptive adversaries to bypass UnGANable and show that some of
them are slightly effective.


http://arxiv.org/abs/2210.00557
Adaptive Smoothness-weighted Adversarial Training for Multiple Perturbations with Its Stability Analysis. (99%)
Jiancong Xiao; Zeyu Qin; Yanbo Fan; Baoyuan Wu; Jue Wang; Zhi-Quan Luo
  Adversarial Training (AT) has been demonstrated as one of the most effective
methods against adversarial examples. While most existing works focus on AT
with a single type of perturbation e.g., the $\ell_\infty$ attacks), DNNs are
facing threats from different types of adversarial examples. Therefore,
adversarial training for multiple perturbations (ATMP) is proposed to
generalize the adversarial robustness over different perturbation types (in
$\ell_1$, $\ell_2$, and $\ell_\infty$ norm-bounded perturbations). However, the
resulting model exhibits trade-off between different attacks. Meanwhile, there
is no theoretical analysis of ATMP, limiting its further development. In this
paper, we first provide the smoothness analysis of ATMP and show that $\ell_1$,
$\ell_2$, and $\ell_\infty$ adversaries give different contributions to the
smoothness of the loss function of ATMP. Based on this, we develop the
stability-based excess risk bounds and propose adaptive smoothness-weighted
adversarial training for multiple perturbations. Theoretically, our algorithm
yields better bounds. Empirically, our experiments on CIFAR10 and CIFAR100
achieve the state-of-the-art performance against the mixture of multiple
perturbations attacks.


http://arxiv.org/abs/2210.00430
Understanding Adversarial Robustness Against On-manifold Adversarial Examples. (99%)
Jiancong Xiao; Liusha Yang; Yanbo Fan; Jue Wang; Zhi-Quan Luo
  Deep neural networks (DNNs) are shown to be vulnerable to adversarial
examples. A well-trained model can be easily attacked by adding small
perturbations to the original data. One of the hypotheses of the existence of
the adversarial examples is the off-manifold assumption: adversarial examples
lie off the data manifold. However, recent research showed that on-manifold
adversarial examples also exist. In this paper, we revisit the off-manifold
assumption and want to study a question: at what level is the poor performance
of neural networks against adversarial attacks due to on-manifold adversarial
examples? Since the true data manifold is unknown in practice, we consider two
approximated on-manifold adversarial examples on both real and synthesis
datasets. On real datasets, we show that on-manifold adversarial examples have
greater attack rates than off-manifold adversarial examples on both
standard-trained and adversarially-trained models. On synthetic datasets,
theoretically, We prove that on-manifold adversarial examples are powerful, yet
adversarial training focuses on off-manifold directions and ignores the
on-manifold adversarial examples. Furthermore, we provide analysis to show that
the properties derived theoretically can also be observed in practice. Our
analysis suggests that on-manifold adversarial examples are important, and we
should pay more attention to on-manifold adversarial examples for training
robust models.


http://arxiv.org/abs/2210.00584
FLCert: Provably Secure Federated Learning against Poisoning Attacks. (74%)
Xiaoyu Cao; Zaixi Zhang; Jinyuan Jia; Neil Zhenqiang Gong
  Due to its distributed nature, federated learning is vulnerable to poisoning
attacks, in which malicious clients poison the training process via
manipulating their local training data and/or local model updates sent to the
cloud server, such that the poisoned global model misclassifies many
indiscriminate test inputs or attacker-chosen ones. Existing defenses mainly
leverage Byzantine-robust federated learning methods or detect malicious
clients. However, these defenses do not have provable security guarantees
against poisoning attacks and may be vulnerable to more advanced attacks. In
this work, we aim to bridge the gap by proposing FLCert, an ensemble federated
learning framework, that is provably secure against poisoning attacks with a
bounded number of malicious clients. Our key idea is to divide the clients into
groups, learn a global model for each group of clients using any existing
federated learning method, and take a majority vote among the global models to
classify a test input. Specifically, we consider two methods to group the
clients and propose two variants of FLCert correspondingly, i.e., FLCert-P that
randomly samples clients in each group, and FLCert-D that divides clients to
disjoint groups deterministically. Our extensive experiments on multiple
datasets show that the label predicted by our FLCert for a test input is
provably unaffected by a bounded number of malicious clients, no matter what
poisoning attacks they use.


http://arxiv.org/abs/2210.00621
Optimization for Robustness Evaluation beyond $\ell_p$ Metrics. (16%)
Hengyue Liang; Buyun Liang; Ying Cui; Tim Mitchell; Ju Sun
  Empirical evaluation of deep learning models against adversarial attacks
entails solving nontrivial constrained optimization problems. Popular
algorithms for solving these constrained problems rely on projected gradient
descent (PGD) and require careful tuning of multiple hyperparameters. Moreover,
PGD can only handle $\ell_1$, $\ell_2$, and $\ell_\infty$ attack models due to
the use of analytical projectors. In this paper, we introduce a novel
algorithmic framework that blends a general-purpose constrained-optimization
solver PyGRANSO, With Constraint-Folding (PWCF), to add reliability and
generality to robustness evaluation. PWCF 1) finds good-quality solutions
without the need of delicate hyperparameter tuning, and 2) can handle general
attack models, e.g., general $\ell_p$ ($p \geq 0$) and perceptual attacks,
which are inaccessible to PGD-based algorithms.


http://arxiv.org/abs/2210.00649
Automated Security Analysis of Exposure Notification Systems. (1%)
Kevin Morio; Ilkan Esiyok; Dennis Jackson; Robert Künnemann
  We present the first formal analysis and comparison of the security of the
two most widely deployed exposure notification systems, ROBERT and the Google
and Apple Exposure Notification (GAEN) framework. ROBERT is the most popular
instalment of the centralised approach to exposure notification, in which the
risk score is computed by a central server. GAEN, in contrast, follows the
decentralised approach, where the user's phone calculates the risk. The
relative merits of centralised and decentralised systems have proven to be a
controversial question. The majority of the previous analyses have focused on
the privacy implications of these systems, ours is the first formal analysis to
evaluate the security of the deployed systems -- the absence of false risk
alerts. We model the French deployment of ROBERT and the most widely deployed
GAEN variant, Germany's Corona-Warn-App. We isolate the precise conditions
under which these systems prevent false alerts. We determine exactly how an
adversary can subvert the system via network and Bluetooth sniffing, database
leakage or the compromise of phones, back-end systems and health authorities.
We also investigate the security of the original specification of the DP3T
protocol, in order to identify gaps between the proposed scheme and its
ultimate deployment. We find a total of 27 attack patterns, including many that
distinguish the centralised from the decentralised approach, as well as attacks
on the authorisation procedure that differentiate all three protocols. Our
results suggest that ROBERT's centralised design is more vulnerable against
both opportunistic and highly resourced attackers trying to perform
mass-notification attacks.


http://arxiv.org/abs/2210.00292
DeltaBound Attack: Efficient decision-based attack in low queries regime. (96%)
Lorenzo Rossi
  Deep neural networks and other machine learning systems, despite being
extremely powerful and able to make predictions with high accuracy, are
vulnerable to adversarial attacks. We proposed the DeltaBound attack: a novel,
powerful attack in the hard-label setting with $\ell_2$ norm bounded
perturbations. In this scenario, the attacker has only access to the top-1
predicted label of the model and can be therefore applied to real-world
settings such as remote API. This is a complex problem since the attacker has
very little information about the model. Consequently, most of the other
techniques present in the literature require a massive amount of queries for
attacking a single example. Oppositely, this work mainly focuses on the
evaluation of attack's power in the low queries regime $\leq 1000$ queries)
with $\ell_2$ norm in the hard-label settings. We find that the DeltaBound
attack performs as well and sometimes better than current state-of-the-art
attacks while remaining competitive across different kinds of models. Moreover,
we evaluate our method against not only deep neural networks, but also non-deep
learning models, such as Gradient Boosting Decision Trees and Multinomial Naive
Bayes.


http://arxiv.org/abs/2210.00008
Adversarial Attacks on Transformers-Based Malware Detectors. (91%)
Yash Jakhotiya; Heramb Patil; Jugal Rawlani; Sunil B. Mane
  Signature-based malware detectors have proven to be insufficient as even a
small change in malignant executable code can bypass these signature-based
detectors. Many machine learning-based models have been proposed to efficiently
detect a wide variety of malware. Many of these models are found to be
susceptible to adversarial attacks - attacks that work by generating
intentionally designed inputs that can force these models to misclassify. Our
work aims to explore vulnerabilities in the current state of the art malware
detectors to adversarial attacks. We train a Transformers-based malware
detector, carry out adversarial attacks resulting in a misclassification rate
of 23.9% and propose defenses that reduce this misclassification rate to half.
An implementation of our work can be found at
https://github.com/yashjakhotiya/Adversarial-Attacks-On-Transformers.


http://arxiv.org/abs/2210.00417
Voice Spoofing Countermeasures: Taxonomy, State-of-the-art, experimental analysis of generalizability, open challenges, and the way forward. (5%)
Awais Khan; Khalid Mahmood Malik; James Ryan; Mikul Saravanan
  Malicious actors may seek to use different voice-spoofing attacks to fool ASV
systems and even use them for spreading misinformation. Various countermeasures
have been proposed to detect these spoofing attacks. Due to the extensive work
done on spoofing detection in automated speaker verification (ASV) systems in
the last 6-7 years, there is a need to classify the research and perform
qualitative and quantitative comparisons on state-of-the-art countermeasures.
Additionally, no existing survey paper has reviewed integrated solutions to
voice spoofing evaluation and speaker verification, adversarial/antiforensics
attacks on spoofing countermeasures, and ASV itself, or unified solutions to
detect multiple attacks using a single model. Further, no work has been done to
provide an apples-to-apples comparison of published countermeasures in order to
assess their generalizability by evaluating them across corpora. In this work,
we conduct a review of the literature on spoofing detection using hand-crafted
features, deep learning, end-to-end, and universal spoofing countermeasure
solutions to detect speech synthesis (SS), voice conversion (VC), and replay
attacks. Additionally, we also review integrated solutions to voice spoofing
evaluation and speaker verification, adversarial and anti-forensics attacks on
voice countermeasures, and ASV. The limitations and challenges of the existing
spoofing countermeasures are also presented. We report the performance of these
countermeasures on several datasets and evaluate them across corpora. For the
experiments, we employ the ASVspoof2019 and VSDC datasets along with GMM, SVM,
CNN, and CNN-GRU classifiers. (For reproduceability of the results, the code of
the test bed can be found in our GitHub Repository.


http://arxiv.org/abs/2209.15246
Your Out-of-Distribution Detection Method is Not Robust! (99%)
Mohammad Azizmalayeri; Arshia Soltani Moakhar; Arman Zarei; Reihaneh Zohrabi; Mohammad Taghi Manzuri; Mohammad Hossein Rohban
  Out-of-distribution (OOD) detection has recently gained substantial attention
due to the importance of identifying out-of-domain samples in reliability and
safety. Although OOD detection methods have advanced by a great deal, they are
still susceptible to adversarial examples, which is a violation of their
purpose. To mitigate this issue, several defenses have recently been proposed.
Nevertheless, these efforts remained ineffective, as their evaluations are
based on either small perturbation sizes, or weak attacks. In this work, we
re-examine these defenses against an end-to-end PGD attack on in/out data with
larger perturbation sizes, e.g. up to commonly used $\epsilon=8/255$ for the
CIFAR-10 dataset. Surprisingly, almost all of these defenses perform worse than
a random detection under the adversarial setting. Next, we aim to provide a
robust OOD detection method. In an ideal defense, the training should expose
the model to almost all possible adversarial perturbations, which can be
achieved through adversarial training. That is, such training perturbations
should based on both in- and out-of-distribution samples. Therefore, unlike OOD
detection in the standard setting, access to OOD, as well as in-distribution,
samples sounds necessary in the adversarial training setup. These tips lead us
to adopt generative OOD detection methods, such as OpenGAN, as a baseline. We
subsequently propose the Adversarially Trained Discriminator (ATD), which
utilizes a pre-trained robust model to extract robust features, and a generator
model to create OOD samples. Using ATD with CIFAR-10 and CIFAR-100 as the
in-distribution data, we could significantly outperform all previous methods in
the robust AUROC while maintaining high standard AUROC and classification
accuracy. The code repository is available at https://github.com/rohban-lab/ATD .


http://arxiv.org/abs/2210.00062
Learning Robust Kernel Ensembles with Kernel Average Pooling. (99%)
Pouya Bashivan; Adam Ibrahim; Amirozhan Dehghani; Yifei Ren
  Model ensembles have long been used in machine learning to reduce the
variance in individual model predictions, making them more robust to input
perturbations. Pseudo-ensemble methods like dropout have also been commonly
used in deep learning models to improve generalization. However, the
application of these techniques to improve neural networks' robustness against
input perturbations remains underexplored. We introduce Kernel Average Pooling
(KAP), a neural network building block that applies the mean filter along the
kernel dimension of the layer activation tensor. We show that ensembles of
kernels with similar functionality naturally emerge in convolutional neural
networks equipped with KAP and trained with backpropagation. Moreover, we show
that when trained on inputs perturbed with additive Gaussian noise, KAP models
are remarkably robust against various forms of adversarial attacks. Empirical
evaluations on CIFAR10, CIFAR100, TinyImagenet, and Imagenet datasets show
substantial improvements in robustness against strong adversarial attacks such
as AutoAttack without training on any adversarial examples.


http://arxiv.org/abs/2210.00122
Adversarial Robustness of Representation Learning for Knowledge Graphs. (95%)
Peru Bhardwaj
  Knowledge graphs represent factual knowledge about the world as relationships
between concepts and are critical for intelligent decision making in enterprise
applications. New knowledge is inferred from the existing facts in the
knowledge graphs by encoding the concepts and relations into low-dimensional
feature vector representations. The most effective representations for this
task, called Knowledge Graph Embeddings (KGE), are learned through neural
network architectures. Due to their impressive predictive performance, they are
increasingly used in high-impact domains like healthcare, finance and
education. However, are the black-box KGE models adversarially robust for use
in domains with high stakes? This thesis argues that state-of-the-art KGE
models are vulnerable to data poisoning attacks, that is, their predictive
performance can be degraded by systematically crafted perturbations to the
training knowledge graph. To support this argument, two novel data poisoning
attacks are proposed that craft input deletions or additions at training time
to subvert the learned model's performance at inference time. These adversarial
attacks target the task of predicting the missing facts in knowledge graphs
using KGE models, and the evaluation shows that the simpler attacks are
competitive with or outperform the computationally expensive ones. The thesis
contributions not only highlight and provide an opportunity to fix the security
vulnerabilities of KGE models, but also help to understand the black-box
predictive behaviour of KGE models.


http://arxiv.org/abs/2209.15304
Hiding Visual Information via Obfuscating Adversarial Perturbations. (92%)
Zhigang Su; Dawei Zhou; Nannan Wangu; Decheng Li; Zhen Wang; Xinbo Gao
  Growing leakage and misuse of visual information raise security and privacy
concerns, which promotes the development of information protection. Existing
adversarial perturbations-based methods mainly focus on the de-identification
against deep learning models. However, the inherent visual information of the
data has not been well protected. In this work, inspired by the Type-I
adversarial attack, we propose an adversarial visual information hiding method
to protect the visual privacy of data. Specifically, the method generates
obfuscating adversarial perturbations to obscure the visual information of the
data. Meanwhile, it maintains the hidden objectives to be correctly predicted
by models. In addition, our method does not modify the parameters of the
applied model, which makes it flexible for different scenarios. Experimental
results on the recognition and classification tasks demonstrate that the
proposed method can effectively hide visual information and hardly affect the
performances of models. The code is available in the supplementary material.


http://arxiv.org/abs/2210.00178
On the tightness of linear relaxation based robustness certification methods. (78%)
Cheng Tang
  There has been a rapid development and interest in adversarial training and
defenses in the machine learning community in the recent years. One line of
research focuses on improving the performance and efficiency of adversarial
robustness certificates for neural networks \cite{gowal:19, wong_zico:18,
raghunathan:18, WengTowardsFC:18, wong:scalable:18, singh:convex_barrier:19,
Huang_etal:19, single-neuron-relax:20, Zhang2020TowardsSA}. While each
providing a certification to lower (or upper) bound the true distortion under
adversarial attacks via relaxation, less studied was the tightness of
relaxation. In this paper, we analyze a family of linear outer approximation
based certificate methods via a meta algorithm, IBP-Lin. The aforementioned
works often lack quantitative analysis to answer questions such as how does the
performance of the certificate method depend on the network configuration and
the choice of approximation parameters. Under our framework, we make a first
attempt at answering these questions, which reveals that the tightness of
linear approximation based certification can depend heavily on the
configuration of the trained networks.


http://arxiv.org/abs/2209.15266
Data Poisoning Attacks Against Multimodal Encoders. (73%)
Ziqing Yang; Xinlei He; Zheng Li; Michael Backes; Mathias Humbert; Pascal Berrang; Yang Zhang
  Recently, the newly emerged multimodal models, which leverage both visual and
linguistic modalities to train powerful encoders, have gained increasing
attention. However, learning from a large-scale unlabeled dataset also exposes
the model to the risk of potential poisoning attacks, whereby the adversary
aims to perturb the model's training data to trigger malicious behaviors in it.
In contrast to previous work, only poisoning visual modality, in this work, we
take the first step to studying poisoning attacks against multimodal models in
both visual and linguistic modalities. Specially, we focus on answering two
questions: (1) Is the linguistic modality also vulnerable to poisoning attacks?
and (2) Which modality is most vulnerable? To answer the two questions, we
propose three types of poisoning attacks against multimodal models. Extensive
evaluations on different datasets and model architectures show that all three
attacks can achieve significant attack performance while maintaining model
utility in both visual and linguistic modalities. Furthermore, we observe that
the poisoning effect differs between different modalities. To mitigate the
attacks, we propose both pre-training and post-training defenses. We
empirically show that both defenses can significantly reduce the attack
performance while preserving the model's utility.


http://arxiv.org/abs/2210.00108
ImpNet: Imperceptible and blackbox-undetectable backdoors in compiled neural networks. (73%)
Tim Clifford; Ilia Shumailov; Yiren Zhao; Ross Anderson; Robert Mullins
  Early backdoor attacks against machine learning set off an arms race in
attack and defence development. Defences have since appeared demonstrating some
ability to detect backdoors in models or even remove them. These defences work
by inspecting the training data, the model, or the integrity of the training
procedure. In this work, we show that backdoors can be added during
compilation, circumventing any safeguards in the data preparation and model
training stages. As an illustration, the attacker can insert weight-based
backdoors during the hardware compilation step that will not be detected by any
training or data-preparation process. Next, we demonstrate that some backdoors,
such as ImpNet, can only be reliably detected at the stage where they are
inserted and removing them anywhere else presents a significant challenge. We
conclude that machine-learning model security requires assurance of provenance
along the entire technical pipeline, including the data, model architecture,
compiler, and hardware specification.


http://arxiv.org/abs/2209.15179
Physical Adversarial Attack meets Computer Vision: A Decade Survey. (99%)
Hui Wei; Hao Tang; Xuemei Jia; Zhixiang Wang; Hanxun Yu; Zhubo Li; Shin'ichi Satoh; Gool Luc Van; Zheng Wang
  Despite the impressive achievements of Deep Neural Networks (DNNs) in
computer vision, their vulnerability to adversarial attacks remains a critical
concern. Extensive research has demonstrated that incorporating sophisticated
perturbations into input images can lead to a catastrophic degradation in DNNs'
performance. This perplexing phenomenon not only exists in the digital space
but also in the physical world. Consequently, it becomes imperative to evaluate
the security of DNNs-based systems to ensure their safe deployment in
real-world scenarios, particularly in security-sensitive applications. To
facilitate a profound understanding of this topic, this paper presents a
comprehensive overview of physical adversarial attacks. Firstly, we distill
four general steps for launching physical adversarial attacks. Building upon
this foundation, we uncover the pervasive role of artifacts carrying
adversarial perturbations in the physical world. These artifacts influence each
step. To denote them, we introduce a new term: adversarial medium. Then, we
take the first step to systematically evaluate the performance of physical
adversarial attacks, taking the adversarial medium as a first attempt. Our
proposed evaluation metric, hiPAA, comprises six perspectives: Effectiveness,
Stealthiness, Robustness, Practicability, Aesthetics, and Economics. We also
provide comparative results across task categories, together with insightful
observations and suggestions for future research directions.


http://arxiv.org/abs/2209.14826
Towards Lightweight Black-Box Attacks against Deep Neural Networks. (99%)
Chenghao Sun; Yonggang Zhang; Wan Chaoqun; Qizhou Wang; Ya Li; Tongliang Liu; Bo Han; Xinmei Tian
  Black-box attacks can generate adversarial examples without accessing the
parameters of target model, largely exacerbating the threats of deployed deep
neural networks (DNNs). However, previous works state that black-box attacks
fail to mislead target models when their training data and outputs are
inaccessible. In this work, we argue that black-box attacks can pose practical
attacks in this extremely restrictive scenario where only several test samples
are available. Specifically, we find that attacking the shallow layers of DNNs
trained on a few test samples can generate powerful adversarial examples. As
only a few samples are required, we refer to these attacks as lightweight
black-box attacks. The main challenge to promoting lightweight attacks is to
mitigate the adverse impact caused by the approximation error of shallow
layers. As it is hard to mitigate the approximation error with few available
samples, we propose Error TransFormer (ETF) for lightweight attacks. Namely,
ETF transforms the approximation error in the parameter space into a
perturbation in the feature space and alleviates the error by disturbing
features. In experiments, lightweight black-box attacks with the proposed ETF
achieve surprising results. For example, even if only 1 sample per category
available, the attack success rate in lightweight black-box attacks is only
about 3% lower than that of the black-box attacks with complete training data.


http://arxiv.org/abs/2209.15042
Generalizability of Adversarial Robustness Under Distribution Shifts. (83%)
Kumail Alhamoud; Hasan Abed Al Kader Hammoud; Motasem Alfarra; Bernard Ghanem
  Recent progress in empirical and certified robustness promises to deliver
reliable and deployable Deep Neural Networks (DNNs). Despite that success, most
existing evaluations of DNN robustness have been done on images sampled from
the same distribution on which the model was trained. However, in the real
world, DNNs may be deployed in dynamic environments that exhibit significant
distribution shifts. In this work, we take a first step towards thoroughly
investigating the interplay between empirical and certified adversarial
robustness on one hand and domain generalization on another. To do so, we train
robust models on multiple domains and evaluate their accuracy and robustness on
an unseen domain. We observe that: (1) both empirical and certified robustness
generalize to unseen domains, and (2) the level of generalizability does not
correlate well with input visual similarity, measured by the FID between source
and target domains. We also extend our study to cover a real-world medical
application, in which adversarial augmentation significantly boosts the
generalization of robustness with minimal effect on clean data accuracy.


http://arxiv.org/abs/2209.14692
Digital and Physical Face Attacks: Reviewing and One Step Further. (2%)
Chenqi Kong; Shiqi Wang; Haoliang Li
  With the rapid progress over the past five years, face authentication has
become the most pervasive biometric recognition method. Thanks to the
high-accuracy recognition performance and user-friendly usage, automatic face
recognition (AFR) has exploded into a plethora of practical applications over
device unlocking, checking-in, and financial payment. In spite of the
tremendous success of face authentication, a variety of face presentation
attacks (FPA), such as print attacks, replay attacks, and 3D mask attacks, have
raised pressing mistrust concerns. Besides physical face attacks, face
videos/images are vulnerable to a wide variety of digital attack techniques
launched by malicious hackers, causing potential menace to the public at large.
Due to the unrestricted access to enormous digital face images/videos and
disclosed easy-to-use face manipulation tools circulating on the internet,
non-expert attackers without any prior professional skills are able to readily
create sophisticated fake faces, leading to numerous dangerous applications
such as financial fraud, impersonation, and identity theft. This survey aims to
build the integrity of face forensics by providing thorough analyses of
existing literature and highlighting the issues requiring further attention. In
this paper, we first comprehensively survey both physical and digital face
attack types and datasets. Then, we review the latest and most advanced
progress on existing counter-attack methodologies and highlight their current
limits. Moreover, we outline possible future research directions for existing
and upcoming challenges in the face forensics community. Finally, the necessity
of joint physical and digital face attack detection has been discussed, which
has never been studied in previous surveys.


http://arxiv.org/abs/2209.14673
Chameleon Cache: Approximating Fully Associative Caches with Random Replacement to Prevent Contention-Based Cache Attacks. (1%)
Thomas Unterluggauer; Austin Harris; Scott Constable; Fangfei Liu; Carlos Rozas
  Randomized, skewed caches (RSCs) such as CEASER-S have recently received much
attention to defend against contention-based cache side channels. By
randomizing and regularly changing the mapping(s) of addresses to cache sets,
these techniques are designed to obfuscate the leakage of memory access
patterns. However, new attack techniques, e.g., Prime+Prune+Probe, soon
demonstrated the limits of RSCs as they allow attackers to more quickly learn
which addresses contend in the cache and use this information to circumvent the
randomization. To yet maintain side-channel resilience, RSCs must change the
random mapping(s) more frequently with adverse effects on performance and
implementation complexity. This work aims to make randomization-based
approaches more robust to allow for reduced re-keying rates and presents
Chameleon Cache. Chameleon Cache extends RSCs with a victim cache (VC) to
decouple contention in the RSC from evictions observed by the user. The VC
allows Chameleon Cache to make additional use of the multiple mappings RSCs
provide to translate addresses to cache set indices: when a cache line is
evicted from the RSC to the VC under one of its mappings, the VC automatically
reinserts this evicted line back into the RSC by using a different mapping. As
a result, the effects of previous RSC set contention are hidden and Chameleon
Cache exhibits side-channel resistance and eviction patterns similar to fully
associative caches with random replacement. We show that Chameleon Cache has
performance overheads of < 1% and stress that VCs are more generically helpful
to increase side-channel resistance and re-keying intervals of randomized
caches.


http://arxiv.org/abs/2209.14262
A Survey on Physical Adversarial Attack in Computer Vision. (99%)
Donghua Wang; Wen Yao; Tingsong Jiang; Guijian Tang; Xiaoqian Chen
  Over the past decade, deep learning has revolutionized conventional tasks
that rely on hand-craft feature extraction with its strong feature learning
capability, leading to substantial enhancements in traditional tasks. However,
deep neural networks (DNNs) have been demonstrated to be vulnerable to
adversarial examples crafted by malicious tiny noise, which is imperceptible to
human observers but can make DNNs output the wrong result. Existing adversarial
attacks can be categorized into digital and physical adversarial attacks. The
former is designed to pursue strong attack performance in lab environments
while hardly remaining effective when applied to the physical world. In
contrast, the latter focus on developing physical deployable attacks, thus
exhibiting more robustness in complex physical environmental conditions.
Recently, with the increasing deployment of the DNN-based system in the real
world, strengthening the robustness of these systems is an emergency, while
exploring physical adversarial attacks exhaustively is the precondition. To
this end, this paper reviews the evolution of physical adversarial attacks
against DNN-based computer vision tasks, expecting to provide beneficial
information for developing stronger physical adversarial attacks. Specifically,
we first proposed a taxonomy to categorize the current physical adversarial
attacks and grouped them. Then, we discuss the existing physical attacks and
focus on the technique for improving the robustness of physical attacks under
complex physical environmental conditions. Finally, we discuss the issues of
the current physical adversarial attacks to be solved and give promising
directions.


http://arxiv.org/abs/2209.14105
Exploring the Relationship between Architecture and Adversarially Robust Generalization. (99%)
Aishan Liu; Shiyu Tang; Siyuan Liang; Ruihao Gong; Boxi Wu; Xianglong Liu; Dacheng Tao
  Adversarial training has been demonstrated to be one of the most effective
remedies for defending adversarial examples, yet it often suffers from the huge
robustness generalization gap on unseen testing adversaries, deemed as the
adversarially robust generalization problem. Despite the preliminary
understandings devoted to adversarially robust generalization, little is known
from the architectural perspective. To bridge the gap, this paper for the first
time systematically investigated the relationship between adversarially robust
generalization and architectural design. Inparticular, we comprehensively
evaluated 20 most representative adversarially trained architectures on
ImageNette and CIFAR-10 datasets towards multiple `p-norm adversarial attacks.
Based on the extensive experiments, we found that, under aligned settings,
Vision Transformers (e.g., PVT, CoAtNet) often yield better adversarially
robust generalization while CNNs tend to overfit on specific attacks and fail
to generalize on multiple adversaries. To better understand the nature behind
it, we conduct theoretical analysis via the lens of Rademacher complexity. We
revealed the fact that the higher weight sparsity contributes significantly
towards the better adversarially robust generalization of Transformers, which
can be often achieved by the specially-designed attention blocks. We hope our
paper could help to better understand the mechanism for designing robust DNNs.
Our model weights can be found at http://robust.art.


http://arxiv.org/abs/2209.14243
A Closer Look at Evaluating the Bit-Flip Attack Against Deep Neural Networks. (67%)
Kevin Hector; Mathieu Dumont; Pierre-Alain Moellic; Jean-Max Dutertre
  Deep neural network models are massively deployed on a wide variety of
hardware platforms. This results in the appearance of new attack vectors that
significantly extend the standard attack surface, extensively studied by the
adversarial machine learning community. One of the first attack that aims at
drastically dropping the performance of a model, by targeting its parameters
(weights) stored in memory, is the Bit-Flip Attack (BFA). In this work, we
point out several evaluation challenges related to the BFA. First of all, the
lack of an adversary's budget in the standard threat model is problematic,
especially when dealing with physical attacks. Moreover, since the BFA presents
critical variability, we discuss the influence of some training parameters and
the importance of the model architecture. This work is the first to present the
impact of the BFA against fully-connected architectures that present different
behaviors compared to convolutional neural networks. These results highlight
the importance of defining robust and sound evaluation methodologies to
properly evaluate the dangers of parameter-based attacks as well as measure the
real level of robustness offered by a defense.


http://arxiv.org/abs/2209.14161
Supervised Contrastive Learning as Multi-Objective Optimization for Fine-Tuning Large Pre-trained Language Models. (47%)
Youness Moukafih; Mounir Ghogho; Kamel Smaili
  Recently, Supervised Contrastive Learning (SCL) has been shown to achieve
excellent performance in most classification tasks. In SCL, a neural network is
trained to optimize two objectives: pull an anchor and positive samples
together in the embedding space, and push the anchor apart from the negatives.
However, these two different objectives may conflict, requiring trade-offs
between them during optimization. In this work, we formulate the SCL problem as
a Multi-Objective Optimization problem for the fine-tuning phase of RoBERTa
language model. Two methods are utilized to solve the optimization problem: (i)
the linear scalarization (LS) method, which minimizes a weighted linear
combination of pertask losses; and (ii) the Exact Pareto Optimal (EPO) method
which finds the intersection of the Pareto front with a given preference
vector. We evaluate our approach on several GLUE benchmark tasks, without using
data augmentations, memory banks, or generating adversarial examples. The
empirical results show that the proposed learning strategy significantly
outperforms a strong competitive contrastive learning baseline


http://arxiv.org/abs/2209.14013
On the Robustness of Random Forest Against Untargeted Data Poisoning: An Ensemble-Based Approach. (31%)
Marco Anisetti; Claudio A. Ardagna; Alessandro Balestrucci; Nicola Bena; Ernesto Damiani; Chan Yeob Yeun
  Machine learning is becoming ubiquitous. From finance to medicine, machine
learning models are boosting decision-making processes and even outperforming
humans in some tasks. This huge progress in terms of prediction quality does
not however find a counterpart in the security of such models and corresponding
predictions, where perturbations of fractions of the training set (poisoning)
can seriously undermine the model accuracy. Research on poisoning attacks and
defenses received increasing attention in the last decade, leading to several
promising solutions aiming to increase the robustness of machine learning.
Among them, ensemble-based defenses, where different models are trained on
portions of the training set and their predictions are then aggregated, provide
strong theoretical guarantees at the price of a linear overhead. Surprisingly,
ensemble-based defenses, which do not pose any restrictions on the base model,
have not been applied to increase the robustness of random forest models. The
work in this paper aims to fill in this gap by designing and implementing a
novel hash-based ensemble approach that protects random forest against
untargeted, random poisoning attacks. An extensive experimental evaluation
measures the performance of our approach against a variety of attacks, as well
as its sustainability in terms of resource consumption and performance, and
compares it with a traditional monolithic model based on random forest. A final
discussion presents our main findings and compares our approach with existing
poisoning defenses targeting random forests.


http://arxiv.org/abs/2209.14169
CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention. (1%)
Ziyu Guo; Renrui Zhang; Longtian Qiu; Xianzheng Ma; Xupeng Miao; Xuming He; Bin Cui
  Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual
representations with great transferability, which achieves promising accuracy
for zero-shot classification. To further improve its downstream performance,
existing works propose additional learnable modules upon CLIP and fine-tune
them by few-shot training sets. However, the resulting extra training cost and
data requirement severely hinder the efficiency for model deployment and
knowledge transfer. In this paper, we introduce a free-lunch enhancement
method, CALIP, to boost CLIP's zero-shot performance via a parameter-free
Attention module. Specifically, we guide visual and textual representations to
interact with each other and explore cross-modal informative features via
attention. As the pre-training has largely reduced the embedding distances
between two modalities, we discard all learnable parameters in the attention
and bidirectionally update the multi-modal features, enabling the whole process
to be parameter-free and training-free. In this way, the images are blended
with textual-aware signals and the text representations become visual-guided
for better adaptive zero-shot alignment. We evaluate CALIP on various
benchmarks of 14 datasets for both 2D image and 3D point cloud few-shot
classification, showing consistent zero-shot performance improvement over CLIP.
Based on that, we further insert a small number of linear layers in CALIP's
attention module and verify our robustness under the few-shot settings, which
also achieves leading performance compared to existing methods. Those extensive
experiments demonstrate the superiority of our approach for efficient
enhancement of CLIP.


http://arxiv.org/abs/2209.14375
Improving alignment of dialogue agents via targeted human judgements. (1%)
Amelia Glaese; Nat McAleese; Maja Trębacz; John Aslanides; Vlad Firoiu; Timo Ewalds; Maribeth Rauh; Laura Weidinger; Martin Chadwick; Phoebe Thacker; Lucy Campbell-Gillingham; Jonathan Uesato; Po-Sen Huang; Ramona Comanescu; Fan Yang; Abigail See; Sumanth Dathathri; Rory Greig; Charlie Chen; Doug Fritz; Jaume Sanchez Elias; Richard Green; Soňa Mokrá; Nicholas Fernando; Boxi Wu; Rachel Foley; Susannah Young; Iason Gabriel; William Isaac; John Mellor; Demis Hassabis; Koray Kavukcuoglu; Lisa Anne Hendricks; Geoffrey Irving
  We present Sparrow, an information-seeking dialogue agent trained to be more
helpful, correct, and harmless compared to prompted language model baselines.
We use reinforcement learning from human feedback to train our models with two
new additions to help human raters judge agent behaviour. First, to make our
agent more helpful and harmless, we break down the requirements for good
dialogue into natural language rules the agent should follow, and ask raters
about each rule separately. We demonstrate that this breakdown enables us to
collect more targeted human judgements of agent behaviour and allows for more
efficient rule-conditional reward models. Second, our agent provides evidence
from sources supporting factual claims when collecting preference judgements
over model statements. For factual questions, evidence provided by Sparrow
supports the sampled response 78% of the time. Sparrow is preferred more often
than baselines while being more resilient to adversarial probing by humans,
violating our rules only 8% of the time when probed. Finally, we conduct
extensive analyses showing that though our model learns to follow our rules it
can exhibit distributional biases.


http://arxiv.org/abs/2209.13353
Suppress with a Patch: Revisiting Universal Adversarial Patch Attacks against Object Detection. (74%)
Svetlana Pavlitskaya; Jonas Hendl; Sebastian Kleim; Leopold Müller; Fabian Wylczoch; J. Marius Zöllner
  Adversarial patch-based attacks aim to fool a neural network with an
intentionally generated noise, which is concentrated in a particular region of
an input image. In this work, we perform an in-depth analysis of different
patch generation parameters, including initialization, patch size, and
especially positioning a patch in an image during training. We focus on the
object vanishing attack and run experiments with YOLOv3 as a model under attack
in a white-box setting and use images from the COCO dataset. Our experiments
have shown, that inserting a patch inside a window of increasing size during
training leads to a significant increase in attack strength compared to a fixed
position. The best results were obtained when a patch was positioned randomly
during training, while patch position additionally varied within a batch.


http://arxiv.org/abs/2209.14053
Inducing Data Amplification Using Auxiliary Datasets in Adversarial Training. (33%)
Saehyung Lee; Hyungyu Lee
  Several recent studies have shown that the use of extra in-distribution data
can lead to a high level of adversarial robustness. However, there is no
guarantee that it will always be possible to obtain sufficient extra data for a
selected dataset. In this paper, we propose a biased multi-domain adversarial
training (BiaMAT) method that induces training data amplification on a primary
dataset using publicly available auxiliary datasets, without requiring the
class distribution match between the primary and auxiliary datasets. The
proposed method can achieve increased adversarial robustness on a primary
dataset by leveraging auxiliary datasets via multi-domain learning.
Specifically, data amplification on both robust and non-robust features can be
accomplished through the application of BiaMAT as demonstrated through a
theoretical and empirical analysis. Moreover, we demonstrate that while
existing methods are vulnerable to negative transfer due to the distributional
discrepancy between auxiliary and primary data, the proposed method enables
neural networks to flexibly leverage diverse image datasets for adversarial
training by successfully handling the domain discrepancy through the
application of a confidence-based selection strategy. The pre-trained models
and code are available at: \url{https://github.com/Saehyung-Lee/BiaMAT}.


http://arxiv.org/abs/2209.13785
Attacking Compressed Vision Transformers. (33%)
Swapnil Parekh; Devansh Shah; Pratyush Shukla
  Vision Transformers are increasingly embedded in industrial systems due to
their superior performance, but their memory and power requirements make
deploying them to edge devices a challenging task. Hence, model compression
techniques are now widely used to deploy models on edge devices as they
decrease the resource requirements and make model inference very fast and
efficient. But their reliability and robustness from a security perspective is
another major issue in safety-critical applications. Adversarial attacks are
like optical illusions for ML algorithms and they can severely impact the
accuracy and reliability of models. In this work we investigate the
transferability of adversarial samples across the SOTA Vision Transformer
models across 3 SOTA compressed versions and infer the effects different
compression techniques have on adversarial attacks.


http://arxiv.org/abs/2209.13007
Mitigating Attacks on Artificial Intelligence-based Spectrum Sensing for Cellular Network Signals. (8%)
Ferhat Ozgur Catak; Murat Kuzlu; Salih Sarp; Evren Catak; Umit Cali
  Cellular networks (LTE, 5G, and beyond) are dramatically growing with high
demand from consumers and more promising than the other wireless networks with
advanced telecommunication technologies. The main goal of these networks is to
connect billions of devices, systems, and users with high-speed data
transmission, high cell capacity, and low latency, as well as to support a wide
range of new applications, such as virtual reality, metaverse, telehealth,
online education, autonomous and flying vehicles, advanced manufacturing, and
many more. To achieve these goals, spectrum sensing has been paid more
attention, along with new approaches using artificial intelligence (AI) methods
for spectrum management in cellular networks. This paper provides a
vulnerability analysis of spectrum sensing approaches using AI-based semantic
segmentation models for identifying cellular network signals under adversarial
attacks with and without defensive distillation methods. The results showed
that mitigation methods can significantly reduce the vulnerabilities of
AI-based spectrum sensing models against adversarial attacks.


http://arxiv.org/abs/2210.00875
Untargeted Backdoor Watermark: Towards Harmless and Stealthy Dataset Copyright Protection. (5%)
Yiming Li; Yang Bai; Yong Jiang; Yong Yang; Shu-Tao Xia; Bo Li
  Deep neural networks (DNNs) have demonstrated their superiority in practice.
Arguably, the rapid development of DNNs is largely benefited from high-quality
(open-sourced) datasets, based on which researchers and developers can easily
evaluate and improve their learning methods. Since the data collection is
usually time-consuming or even expensive, how to protect their copyrights is of
great significance and worth further exploration. In this paper, we revisit
dataset ownership verification. We find that existing verification methods
introduced new security risks in DNNs trained on the protected dataset, due to
the targeted nature of poison-only backdoor watermarks. To alleviate this
problem, in this work, we explore the untargeted backdoor watermarking scheme,
where the abnormal model behaviors are not deterministic. Specifically, we
introduce two dispersibilities and prove their correlation, based on which we
design the untargeted backdoor watermark under both poisoned-label and
clean-label settings. We also discuss how to use the proposed untargeted
backdoor watermark for dataset ownership verification. Experiments on benchmark
datasets verify the effectiveness of our methods and their resistance to
existing backdoor defenses. Our codes are available at
\url{https://github.com/THUYimingLi/Untargeted_Backdoor_Watermark}.


http://arxiv.org/abs/2209.13620
Reconstruction-guided attention improves the robustness and shape processing of neural networks. (2%)
Seoyoung Ahn; Hossein Adeli; Gregory J. Zelinsky
  Many visual phenomena suggest that humans use top-down generative or
reconstructive processes to create visual percepts (e.g., imagery, object
completion, pareidolia), but little is known about the role reconstruction
plays in robust object recognition. We built an iterative encoder-decoder
network that generates an object reconstruction and used it as top-down
attentional feedback to route the most relevant spatial and feature information
to feed-forward object recognition processes. We tested this model using the
challenging out-of-distribution digit recognition dataset, MNIST-C, where 15
different types of transformation and corruption are applied to handwritten
digit images. Our model showed strong generalization performance against
various image perturbations, on average outperforming all other models
including feedforward CNNs and adversarially trained networks. Our model is
particularly robust to blur, noise, and occlusion corruptions, where shape
perception plays an important role. Ablation studies further reveal two
complementary roles of spatial and feature-based attention in robust object
recognition, with the former largely consistent with spatial masking benefits
in the attention literature (the reconstruction serves as a mask) and the
latter mainly contributing to the model's inference speed (i.e., number of time
steps to reach a certain confidence threshold) by reducing the space of
possible object hypotheses. We also observed that the model sometimes
hallucinates a non-existing pattern out of noise, leading to highly
interpretable human-like errors. Our study shows that modeling
reconstruction-based feedback endows AI systems with a powerful attention
mechanism, which can help us understand the role of generating perception in
human visual processing.


http://arxiv.org/abs/2209.13815
A Learning-based Honeypot Game for Collaborative Defense in UAV Networks. (1%)
Yuntao Wang; Zhou Su; Abderrahim Benslimane; Qichao Xu; Minghui Dai; Ruidong Li
  The proliferation of unmanned aerial vehicles (UAVs) opens up new
opportunities for on-demand service provisioning anywhere and anytime, but it
also exposes UAVs to various cyber threats. Low/medium-interaction honeypot is
regarded as a promising lightweight defense to actively protect mobile Internet
of things, especially UAV networks. Existing works primarily focused on
honeypot design and attack pattern recognition, the incentive issue for
motivating UAVs' participation (e.g., sharing trapped attack data in honeypots)
to collaboratively resist distributed and sophisticated attacks is still
under-explored. This paper proposes a novel game-based collaborative defense
approach to address optimal, fair, and feasible incentive mechanism design, in
the presence of network dynamics and UAVs' multi-dimensional private
information (e.g., valid defense data (VDD) volume, communication delay, and
UAV cost). Specifically, we first develop a honeypot game between UAVs under
both partial and complete information asymmetry scenarios. We then devise a
contract-theoretic method to solve the optimal VDD-reward contract design
problem with partial information asymmetry, while ensuring truthfulness,
fairness, and computational efficiency. Furthermore, under complete information
asymmetry, we devise a reinforcement learning based distributed method to
dynamically design optimal contracts for distinct types of UAVs in the
fast-changing network. Experimental simulations show that the proposed scheme
can motivate UAV's collaboration in VDD sharing and enhance defensive
effectiveness, compared with existing solutions.


http://arxiv.org/abs/2210.00874
Stability Via Adversarial Training of Neural Network Stochastic Control of Mean-Field Type. (1%)
Julian Barreiro-Gomez; Salah Eddine Choutri; Boualem Djehiche
  In this paper, we present an approach to neural network mean-field-type
control and its stochastic stability analysis by means of adversarial inputs
(aka adversarial attacks). This is a class of data-driven mean-field-type
control where the distribution of the variables such as the system states and
control inputs are incorporated into the problem. Besides, we present a
methodology to validate the feasibility of the approximations of the solutions
via neural networks and evaluate their stability. Moreover, we enhance the
stability by enlarging the training set with adversarial inputs to obtain a
more robust neural network. Finally, a worked-out example based on the
linear-quadratic mean-field type control problem (LQ-MTC) is presented to
illustrate our methodology.


http://arxiv.org/abs/2209.13382
Measuring Overfitting in Convolutional Neural Networks using Adversarial Perturbations and Label Noise. (1%)
Svetlana Pavlitskaya; Joël Oswald; J. Marius Zöllner
  Although numerous methods to reduce the overfitting of convolutional neural
networks (CNNs) exist, it is still not clear how to confidently measure the
degree of overfitting. A metric reflecting the overfitting level might be,
however, extremely helpful for the comparison of different architectures and
for the evaluation of various techniques to tackle overfitting. Motivated by
the fact that overfitted neural networks tend to rather memorize noise in the
training data than generalize to unseen data, we examine how the training
accuracy changes in the presence of increasing data perturbations and study the
connection to overfitting. While previous work focused on label noise only, we
examine a spectrum of techniques to inject noise into the training data,
including adversarial perturbations and input corruptions. Based on this, we
define two new metrics that can confidently distinguish between correct and
overfitted models. For the evaluation, we derive a pool of models for which the
overfitting behavior is known beforehand. To test the effect of various
factors, we introduce several anti-overfitting measures in architectures based
on VGG and ResNet and study their impact, including regularization techniques,
training set size, and the number of parameters. Finally, we assess the
applicability of the proposed metrics by measuring the overfitting degree of
several CNN architectures outside of our model pool.


http://arxiv.org/abs/2209.13113
FG-UAP: Feature-Gathering Universal Adversarial Perturbation. (99%)
Zhixing Ye; Xinwen Cheng; Xiaolin Huang
  Deep Neural Networks (DNNs) are susceptible to elaborately designed
perturbations, whether such perturbations are dependent or independent of
images. The latter one, called Universal Adversarial Perturbation (UAP), is
very attractive for model robustness analysis, since its independence of input
reveals the intrinsic characteristics of the model. Relatively, another
interesting observation is Neural Collapse (NC), which means the feature
variability may collapse during the terminal phase of training. Motivated by
this, we propose to generate UAP by attacking the layer where NC phenomenon
happens. Because of NC, the proposed attack could gather all the natural
images' features to its surrounding, which is hence called Feature-Gathering
UAP (FG-UAP).
  We evaluate the effectiveness our proposed algorithm on abundant experiments,
including untargeted and targeted universal attacks, attacks under limited
dataset, and transfer-based black-box attacks among different architectures
including Vision Transformers, which are believed to be more robust.
Furthermore, we investigate FG-UAP in the view of NC by analyzing the labels
and extracted features of adversarial examples, finding that collapse
phenomenon becomes stronger after the model is corrupted. The code will be
released when the paper is accepted.


http://arxiv.org/abs/2209.13400
Activation Learning by Local Competitions. (64%)
Hongchao Zhou
  The backpropagation that drives the success of deep learning is most likely
different from the learning mechanism of the brain. In this paper, we develop a
biology-inspired learning rule that discovers features by local competitions
among neurons, following the idea of Hebb's famous proposal. It is demonstrated
that the unsupervised features learned by this local learning rule can serve as
a pre-training model to improve the performance of some supervised learning
tasks. More importantly, this local learning rule enables us to build a new
learning paradigm very different from the backpropagation, named activation
learning, where the output activation of the neural network roughly measures
how probable the input patterns are. The activation learning is capable of
learning plentiful local features from few shots of input patterns, and
demonstrates significantly better performances than the backpropagation
algorithm when the number of training samples is relatively small. This
learning paradigm unifies unsupervised learning, supervised learning and
generative models, and is also more secure against adversarial attack, paving a
road to some possibilities of creating general-task neural networks.


http://arxiv.org/abs/2209.12549
Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech. (1%)
Yusuke Nakai; Yuki Saito; Kenta Udagawa; Hiroshi Saruwatari
  We propose a novel training algorithm for a multi-speaker neural
text-to-speech (TTS) model based on multi-task adversarial training. A
conventional generative adversarial network (GAN)-based training algorithm
significantly improves the quality of synthetic speech by reducing the
statistical difference between natural and synthetic speech. However, the
algorithm does not guarantee the generalization performance of the trained TTS
model in synthesizing voices of unseen speakers who are not included in the
training data. Our algorithm alternatively trains two deep neural networks:
multi-task discriminator and multi-speaker neural TTS model (i.e., generator of
GANs). The discriminator is trained not only to distinguish between natural and
synthetic speech but also to verify the speaker of input speech is existent or
non-existent (i.e., newly generated by interpolating seen speakers' embedding
vectors). Meanwhile, the generator is trained to minimize the weighted sum of
the speech reconstruction loss and adversarial loss for fooling the
discriminator, which achieves high-quality multi-speaker TTS even if the target
speaker is unseen. Experimental evaluation shows that our algorithm improves
the quality of synthetic speech better than a conventional GANSpeech algorithm.


http://arxiv.org/abs/2209.14974
Greybox XAI: a Neural-Symbolic learning framework to produce interpretable predictions for image classification. (1%)
Adrien Bennetot; Gianni Franchi; Ser Javier Del; Raja Chatila; Natalia Diaz-Rodriguez
  Although Deep Neural Networks (DNNs) have great generalization and prediction
capabilities, their functioning does not allow a detailed explanation of their
behavior. Opaque deep learning models are increasingly used to make important
predictions in critical environments, and the danger is that they make and use
predictions that cannot be justified or legitimized. Several eXplainable
Artificial Intelligence (XAI) methods that separate explanations from machine
learning models have emerged, but have shortcomings in faithfulness to the
model actual functioning and robustness. As a result, there is a widespread
agreement on the importance of endowing Deep Learning models with explanatory
capabilities so that they can themselves provide an answer to why a particular
prediction was made. First, we address the problem of the lack of universal
criteria for XAI by formalizing what an explanation is. We also introduced a
set of axioms and definitions to clarify XAI from a mathematical perspective.
Finally, we present the Greybox XAI, a framework that composes a DNN and a
transparent model thanks to the use of a symbolic Knowledge Base (KB). We
extract a KB from the dataset and use it to train a transparent model (i.e., a
logistic regression). An encoder-decoder architecture is trained on RGB images
to produce an output similar to the KB used by the transparent model. Once the
two models are trained independently, they are used compositionally to form an
explainable predictive model. We show how this new architecture is accurate and
explainable in several datasets.


http://arxiv.org/abs/2209.12195
SPRITZ-1.5C: Employing Deep Ensemble Learning for Improving the Security of Computer Networks against Adversarial Attacks. (81%)
Ehsan Nowroozi; Mohammadreza Mohammadi; Erkay Savas; Mauro Conti; Yassine Mekdad
  In the past few years, Convolutional Neural Networks (CNN) have demonstrated
promising performance in various real-world cybersecurity applications, such as
network and multimedia security. However, the underlying fragility of CNN
structures poses major security problems, making them inappropriate for use in
security-oriented applications including such computer networks. Protecting
these architectures from adversarial attacks necessitates using security-wise
architectures that are challenging to attack.
  In this study, we present a novel architecture based on an ensemble
classifier that combines the enhanced security of 1-Class classification (known
as 1C) with the high performance of conventional 2-Class classification (known
as 2C) in the absence of attacks.Our architecture is referred to as the
1.5-Class (SPRITZ-1.5C) classifier and constructed using a final dense
classifier, one 2C classifier (i.e., CNNs), and two parallel 1C classifiers
(i.e., auto-encoders). In our experiments, we evaluated the robustness of our
proposed architecture by considering eight possible adversarial attacks in
various scenarios. We performed these attacks on the 2C and SPRITZ-1.5C
architectures separately. The experimental results of our study showed that the
Attack Success Rate (ASR) of the I-FGSM attack against a 2C classifier trained
with the N-BaIoT dataset is 0.9900. In contrast, the ASR is 0.0000 for the
SPRITZ-1.5C classifier.


http://arxiv.org/abs/2209.11964
Strong Transferable Adversarial Attacks via Ensembled Asymptotically Normal Distribution Learning. (99%)
Zhengwei Fang; Rui Wang; Tao Huang; Liping Jing
  Strong adversarial examples are crucial for evaluating and enhancing the
robustness of deep neural networks. However, the performance of popular attacks
is usually sensitive, for instance, to minor image transformations, stemming
from limited information -- typically only one input example, a handful of
white-box source models, and undefined defense strategies. Hence, the crafted
adversarial examples are prone to overfit the source model, which hampers their
transferability to unknown architectures. In this paper, we propose an approach
named Multiple Asymptotically Normal Distribution Attacks (MultiANDA) which
explicitly characterize adversarial perturbations from a learned distribution.
Specifically, we approximate the posterior distribution over the perturbations
by taking advantage of the asymptotic normality property of stochastic gradient
ascent (SGA), then employ the deep ensemble strategy as an effective proxy for
Bayesian marginalization in this process, aiming to estimate a mixture of
Gaussians that facilitates a more thorough exploration of the potential
optimization space. The approximated posterior essentially describes the
stationary distribution of SGA iterations, which captures the geometric
information around the local optimum. Thus, MultiANDA allows drawing an
unlimited number of adversarial perturbations for each input and reliably
maintains the transferability. Our proposed method outperforms ten
state-of-the-art black-box attacks on deep learning models with or without
defenses through extensive experiments on seven normally trained and seven
defense models.


http://arxiv.org/abs/2209.11715
The "Beatrix'' Resurrections: Robust Backdoor Detection via Gram Matrices. (13%)
Wanlun Ma; Derui Wang; Ruoxi Sun; Minhui Xue; Sheng Wen; Yang Xiang
  Deep Neural Networks (DNNs) are susceptible to backdoor attacks during
training. The model corrupted in this way functions normally, but when
triggered by certain patterns in the input, produces a predefined target label.
Existing defenses usually rely on the assumption of the universal backdoor
setting in which poisoned samples share the same uniform trigger. However,
recent advanced backdoor attacks show that this assumption is no longer valid
in dynamic backdoors where the triggers vary from input to input, thereby
defeating the existing defenses.
  In this work, we propose a novel technique, Beatrix (backdoor detection via
Gram matrix). Beatrix utilizes Gram matrix to capture not only the feature
correlations but also the appropriately high-order information of the
representations. By learning class-conditional statistics from activation
patterns of normal samples, Beatrix can identify poisoned samples by capturing
the anomalies in activation patterns. To further improve the performance in
identifying target labels, Beatrix leverages kernel-based testing without
making any prior assumptions on representation distribution. We demonstrate the
effectiveness of our method through extensive evaluation and comparison with
state-of-the-art defensive techniques. The experimental results show that our
approach achieves an F1 score of 91.1% in detecting dynamic backdoors, while
the state of the art can only reach 36.9%.


http://arxiv.org/abs/2209.11020
Privacy Attacks Against Biometric Models with Fewer Samples: Incorporating the Output of Multiple Models. (50%)
Sohaib Ahmad; Benjamin Fuller; Kaleel Mahmood
  Authentication systems are vulnerable to model inversion attacks where an
adversary is able to approximate the inverse of a target machine learning
model. Biometric models are a prime candidate for this type of attack. This is
because inverting a biometric model allows the attacker to produce a realistic
biometric input to spoof biometric authentication systems.
  One of the main constraints in conducting a successful model inversion attack
is the amount of training data required. In this work, we focus on iris and
facial biometric systems and propose a new technique that drastically reduces
the amount of training data necessary. By leveraging the output of multiple
models, we are able to conduct model inversion attacks with 1/10th the training
set size of Ahmad and Fuller (IJCB 2020) for iris data and 1/1000th the
training set size of Mai et al. (Pattern Analysis and Machine Intelligence
2019) for facial data. We denote our new attack technique as structured random
with alignment loss. Our attacks are black-box, requiring no knowledge of the
weights of the target neural network, only the dimension, and values of the
output vector.
  To show the versatility of the alignment loss, we apply our attack framework
to the task of membership inference (Shokri et al., IEEE S&P 2017) on biometric
data. For the iris, membership inference attack against classification networks
improves from 52% to 62% accuracy.


http://arxiv.org/abs/2209.10729
Fair Robust Active Learning by Joint Inconsistency. (99%)
Tsung-Han Wu; Shang-Tse Chen; Winston H. Hsu
  Fair Active Learning (FAL) utilized active learning techniques to achieve
high model performance with limited data and to reach fairness between
sensitive groups (e.g., genders). However, the impact of the adversarial
attack, which is vital for various safety-critical machine learning
applications, is not yet addressed in FAL. Observing this, we introduce a novel
task, Fair Robust Active Learning (FRAL), integrating conventional FAL and
adversarial robustness. FRAL requires ML models to leverage active learning
techniques to jointly achieve equalized performance on benign data and
equalized robustness against adversarial attacks between groups. In this new
task, previous FAL methods generally face the problem of unbearable
computational burden and ineffectiveness. Therefore, we develop a simple yet
effective FRAL strategy by Joint INconsistency (JIN). To efficiently find
samples that can boost the performance and robustness of disadvantaged groups
for labeling, our method exploits the prediction inconsistency between benign
and adversarial samples as well as between standard and robust models.
Extensive experiments under diverse datasets and sensitive groups demonstrate
that our method not only achieves fairer performance on benign samples but also
obtains fairer robustness under white-box PGD attacks compared with existing
active learning and FAL baselines. We are optimistic that FRAL would pave a new
path for developing safe and robust ML research and applications such as facial
attribute recognition in biometrics systems.


http://arxiv.org/abs/2209.10652
Toy Models of Superposition. (45%)
Nelson Elhage; Tristan Hume; Catherine Olsson; Nicholas Schiefer; Tom Henighan; Shauna Kravec; Zac Hatfield-Dodds; Robert Lasenby; Dawn Drain; Carol Chen; Roger Grosse; Sam McCandlish; Jared Kaplan; Dario Amodei; Martin Wattenberg; Christopher Olah
  Neural networks often pack many unrelated concepts into a single neuron - a
puzzling phenomenon known as 'polysemanticity' which makes interpretability
much more challenging. This paper provides a toy model where polysemanticity
can be fully understood, arising as a result of models storing additional
sparse features in "superposition." We demonstrate the existence of a phase
change, a surprising connection to the geometry of uniform polytopes, and
evidence of a link to adversarial examples. We also discuss potential
implications for mechanistic interpretability.


http://arxiv.org/abs/2209.10381
DARTSRepair: Core-failure-set Guided DARTS for Network Robustness to Common Corruptions. (13%)
Xuhong Ren; Jianlang Chen; Felix Juefei-Xu; Wanli Xue; Qing Guo; Lei Ma; Jianjun Zhao; Shengyong Chen
  Network architecture search (NAS), in particular the differentiable
architecture search (DARTS) method, has shown a great power to learn excellent
model architectures on the specific dataset of interest. In contrast to using a
fixed dataset, in this work, we focus on a different but important scenario for
NAS: how to refine a deployed network's model architecture to enhance its
robustness with the guidance of a few collected and misclassified examples that
are degraded by some real-world unknown corruptions having a specific pattern
(e.g., noise, blur, etc.). To this end, we first conduct an empirical study to
validate that the model architectures can be definitely related to the
corruption patterns. Surprisingly, by just adding a few corrupted and
misclassified examples (e.g., $10^3$ examples) to the clean training dataset
(e.g., $5.0 \times 10^4$ examples), we can refine the model architecture and
enhance the robustness significantly. To make it more practical, the key
problem, i.e., how to select the proper failure examples for the effective NAS
guidance, should be carefully investigated. Then, we propose a novel
core-failure-set guided DARTS that embeds a K-center-greedy algorithm for DARTS
to select suitable corrupted failure examples to refine the model architecture.
We use our method for DARTS-refined DNNs on the clean as well as 15 corruptions
with the guidance of four specific real-world corruptions. Compared with the
state-of-the-art NAS as well as data-augmentation-based enhancement methods,
our final method can achieve higher accuracy on both corrupted datasets and the
original clean dataset. On some of the corruption patterns, we can achieve as
high as over 45% absolute accuracy improvements.


http://arxiv.org/abs/2209.10222
Fairness Reprogramming. (1%)
Guanhua Zhang; Yihua Zhang; Yang Zhang; Wenqi Fan; Qing Li; Sijia Liu; Shiyu Chang
  Despite a surge of recent advances in promoting machine Learning (ML)
fairness, the existing mainstream approaches mostly require retraining or
finetuning the entire weights of the neural network to meet the fairness
criteria. However, this is often infeasible in practice for those large-scale
trained models due to large computational and storage costs, low data
efficiency, and model privacy issues. In this paper, we propose a new generic
fairness learning paradigm, called FairReprogram, which incorporates the model
reprogramming technique. Specifically, FairReprogram considers the case where
models can not be changed and appends to the input a set of perturbations,
called the fairness trigger, which is tuned towards the fairness criteria under
a min-max formulation. We further introduce an information-theoretic framework
that explains why and under what conditions fairness goals can be achieved
using the fairness trigger. We show both theoretically and empirically that the
fairness trigger can effectively obscure demographic biases in the output
prediction of fixed ML models by providing false demographic information that
hinders the model from utilizing the correct demographic information to make
the prediction. Extensive experiments on both NLP and CV datasets demonstrate
that our method can achieve better fairness improvements than retraining-based
methods with far less data dependency under two widely-used fairness criteria.
Codes are available at
https://github.com/UCSB-NLP-Chang/Fairness-Reprogramming.git.


http://arxiv.org/abs/2209.09577
Understanding Real-world Threats to Deep Learning Models in Android Apps. (99%)
Zizhuang Deng; Kai Chen; Guozhu Meng; Xiaodong Zhang; Ke Xu; Yao Cheng
  Famous for its superior performance, deep learning (DL) has been popularly
used within many applications, which also at the same time attracts various
threats to the models. One primary threat is from adversarial attacks.
Researchers have intensively studied this threat for several years and proposed
dozens of approaches to create adversarial examples (AEs). But most of the
approaches are only evaluated on limited models and datasets (e.g., MNIST,
CIFAR-10). Thus, the effectiveness of attacking real-world DL models is not
quite clear. In this paper, we perform the first systematic study of
adversarial attacks on real-world DNN models and provide a real-world model
dataset named RWM. Particularly, we design a suite of approaches to adapt
current AE generation algorithms to the diverse real-world DL models, including
automatically extracting DL models from Android apps, capturing the inputs and
outputs of the DL models in apps, generating AEs and validating them by
observing the apps' execution. For black-box DL models, we design a
semantic-based approach to build suitable datasets and use them for training
substitute models when performing transfer-based attacks. After analyzing 245
DL models collected from 62,583 real-world apps, we have a unique opportunity
to understand the gap between real-world DL models and contemporary AE
generation algorithms. To our surprise, the current AE generation algorithms
can only directly attack 6.53% of the models. Benefiting from our approach, the
success rate upgrades to 47.35%.


http://arxiv.org/abs/2209.09996
Audit and Improve Robustness of Private Neural Networks on Encrypted Data. (99%)
Jiaqi Xue; Lei Xu; Lin Chen; Weidong Shi; Kaidi Xu; Qian Lou
  Performing neural network inference on encrypted data without decryption is
one popular method to enable privacy-preserving neural networks (PNet) as a
service. Compared with regular neural networks deployed for
machine-learning-as-a-service, PNet requires additional encoding, e.g.,
quantized-precision numbers, and polynomial activation. Encrypted input also
introduces novel challenges such as adversarial robustness and security. To the
best of our knowledge, we are the first to study questions including (i)
Whether PNet is more robust against adversarial inputs than regular neural
networks? (ii) How to design a robust PNet given the encrypted input without
decryption? We propose PNet-Attack to generate black-box adversarial examples
that can successfully attack PNet in both target and untarget manners. The
attack results show that PNet robustness against adversarial inputs needs to be
improved. This is not a trivial task because the PNet model owner does not have
access to the plaintext of the input values, which prevents the application of
existing detection and defense methods such as input tuning, model
normalization, and adversarial training. To tackle this challenge, we propose a
new fast and accurate noise insertion method, called RPNet, to design Robust
and Private Neural Networks. Our comprehensive experiments show that
PNet-Attack reduces at least $2.5\times$ queries than prior works. We
theoretically analyze our RPNet methods and demonstrate that RPNet can decrease
$\sim 91.88\%$ attack success rate.


http://arxiv.org/abs/2209.09502
GAMA: Generative Adversarial Multi-Object Scene Attacks. (99%)
Abhishek Aich; Calvin-Khang Ta; Akash Gupta; Chengyu Song; Srikanth V. Krishnamurthy; M. Salman Asif; Amit K. Roy-Chowdhury
  The majority of methods for crafting adversarial attacks have focused on
scenes with a single dominant object (e.g., images from ImageNet). On the other
hand, natural scenes include multiple dominant objects that are semantically
related. Thus, it is crucial to explore designing attack strategies that look
beyond learning on single-object scenes or attack single-object victim
classifiers. Due to their inherent property of strong transferability of
perturbations to unknown models, this paper presents the first approach of
using generative models for adversarial attacks on multi-object scenes. In
order to represent the relationships between different objects in the input
scene, we leverage upon the open-sourced pre-trained vision-language model CLIP
(Contrastive Language-Image Pre-training), with the motivation to exploit the
encoded semantics in the language space along with the visual space. We call
this attack approach Generative Adversarial Multi-object scene Attacks (GAMA).
GAMA demonstrates the utility of the CLIP model as an attacker's tool to train
formidable perturbation generators for multi-object scenes. Using the joint
image-text features to train the generator, we show that GAMA can craft potent
transferable perturbations in order to fool victim classifiers in various
attack settings. For example, GAMA triggers ~16% more misclassification than
state-of-the-art generative approaches in black-box settings where both the
classifier architecture and data distribution of the attacker are different
from the victim. Our code is available here:
https://abhishekaich27.github.io/gama.html


http://arxiv.org/abs/2209.09688
Sparse Vicious Attacks on Graph Neural Networks. (98%)
Giovanni Trappolini; Valentino Maiorca; Silvio Severino; Emanuele Rodolà; Fabrizio Silvestri; Gabriele Tolomei
  Graph Neural Networks (GNNs) have proven to be successful in several
predictive modeling tasks for graph-structured data.
  Amongst those tasks, link prediction is one of the fundamental problems for
many real-world applications, such as recommender systems.
  However, GNNs are not immune to adversarial attacks, i.e., carefully crafted
malicious examples that are designed to fool the predictive model.
  In this work, we focus on a specific, white-box attack to GNN-based link
prediction models, where a malicious node aims to appear in the list of
recommended nodes for a given target victim.
  To achieve this goal, the attacker node may also count on the cooperation of
other existing peers that it directly controls, namely on the ability to inject
a number of ``vicious'' nodes in the network.
  Specifically, all these malicious nodes can add new edges or remove existing
ones, thereby perturbing the original graph.
  Thus, we propose SAVAGE, a novel framework and a method to mount this type of
link prediction attacks.
  SAVAGE formulates the adversary's goal as an optimization task, striking the
balance between the effectiveness of the attack and the sparsity of malicious
resources required.
  Extensive experiments conducted on real-world and synthetic datasets
demonstrate that adversarial attacks implemented through SAVAGE indeed achieve
high attack success rate yet using a small amount of vicious nodes.
  Finally, despite those attacks require full knowledge of the target model, we
show that they are successfully transferable to other black-box methods for
link prediction.


http://arxiv.org/abs/2209.09883
Leveraging Local Patch Differences in Multi-Object Scenes for Generative Adversarial Attacks. (98%)
Abhishek Aich; Shasha Li; Chengyu Song; M. Salman Asif; Srikanth V. Krishnamurthy; Amit K. Roy-Chowdhury
  State-of-the-art generative model-based attacks against image classifiers
overwhelmingly focus on single-object (i.e., single dominant object) images.
Different from such settings, we tackle a more practical problem of generating
adversarial perturbations using multi-object (i.e., multiple dominant objects)
images as they are representative of most real-world scenes. Our goal is to
design an attack strategy that can learn from such natural scenes by leveraging
the local patch differences that occur inherently in such images (e.g.
difference between the local patch on the object `person' and the object `bike'
in a traffic scene). Our key idea is to misclassify an adversarial multi-object
image by confusing the victim classifier for each local patch in the image.
Based on this, we propose a novel generative attack (called Local Patch
Difference or LPD-Attack) where a novel contrastive loss function uses the
aforesaid local differences in feature space of multi-object scenes to optimize
the perturbation generator. Through various experiments across diverse victim
convolutional neural networks, we show that our approach outperforms baseline
generative attacks with highly transferable perturbations when evaluated under
different white-box and black-box settings.


http://arxiv.org/abs/2209.09841
Rethinking Data Augmentation in Knowledge Distillation for Object Detection. (68%)
Jiawei Liang; Siyuan Liang; Aishan Liu; Mingli Zhu; Danni Yuan; Chenye Xu; Xiaochun Cao
  Knowledge distillation (KD) has shown its effectiveness for object detection,
where it trains a compact object detector under the supervision of both AI
knowledge (teacher detector) and human knowledge (human expert). However,
existing studies treat the AI knowledge and human knowledge consistently and
adopt a uniform data augmentation strategy during learning, which would lead to
the biased learning of multi-scale objects and insufficient learning for the
teacher detector causing unsatisfactory distillation performance. To tackle
these problems, we propose the sample-specific data augmentation and
adversarial feature augmentation. Firstly, to mitigate the impact incurred by
multi-scale objects, we propose an adaptive data augmentation based on our
observations from the Fourier perspective. Secondly, we propose a feature
augmentation method based on adversarial examples for better mimicking AI
knowledge to make up for the insufficient information mining of the teacher
detector. Furthermore, our proposed method is unified and easily extended to
other KD methods. Extensive experiments demonstrate the effectiveness of our
framework and improve the performance of state-of-the-art methods in one-stage
and two-stage detectors, bringing at most 0.5 mAP gains.


http://arxiv.org/abs/2209.09557
CANflict: Exploiting Peripheral Conflicts for Data-Link Layer Attacks on Automotive Networks. (1%)
Alvise de Faveri Tron; Stefano Longari; Michele Carminati; Mario Polino; Stefano Zanero
  Current research in the automotive domain has proven the limitations of the
CAN protocol from a security standpoint. Application-layer attacks, which
involve the creation of malicious packets, are deemed feasible from remote but
can be easily detected by modern IDS. On the other hand, more recent link-layer
attacks are stealthier and possibly more disruptive but require physical access
to the bus. In this paper, we present CANflict, a software-only approach that
allows reliable manipulation of the CAN bus at the data link layer from an
unmodified microcontroller, overcoming the limitations of state-of-the-art
works. We demonstrate that it is possible to deploy stealthy CAN link-layer
attacks from a remotely compromised ECU, targeting another ECU on the same CAN
network. To do this, we exploit the presence of pin conflicts between
microcontroller peripherals to craft polyglot frames, which allows an attacker
to control the CAN traffic at the bit level and bypass the protocol's rules. We
experimentally demonstrate the effectiveness of our approach on high-, mid-,
and low-end microcontrollers, and we provide the ground for future research by
releasing an extensible tool that can be used to implement our approach on
different platforms and to build CAN countermeasures at the data link layer.


http://arxiv.org/abs/2209.09835
EM-Fault It Yourself: Building a Replicable EMFI Setup for Desktop and Server Hardware. (1%)
Niclas Kühnapfel; Robert Buhren; Hans Niklas Jacob; Thilo Krachenfels; Christian Werling; Jean-Pierre Seifert
  EMFI has become a popular fault injection (FI) technique due to its ability
to inject faults precisely considering timing and location. Recently, ARM,
RISC-V, and even x86 processing units in different packages were shown to be
vulnerable to electromagnetic fault injection (EMFI) attacks. However, past
publications lack a detailed description of the entire attack setup, hindering
researchers and companies from easily replicating the presented attacks on
their devices. In this work, we first show how to build an automated EMFI setup
with high scanning resolution and good repeatability that is large enough to
attack modern desktop and server CPUs. We structurally lay out all details on
mechanics, hardware, and software along with this paper. Second, we use our
setup to attack a deeply embedded security co-processor in modern AMD systems
on a chip (SoCs), the AMD Secure Processor (AMD-SP). Using a previously
published code execution exploit, we run two custom payloads on the AMD-SP that
utilize the SoC to different degrees. We then visualize these fault locations
on SoC photographs allowing us to reason about the SoC's components under
attack. Finally, we show that the signature verification process of one of the
first executed firmware parts is susceptible to EMFI attacks, undermining the
security architecture of the entire SoC. To the best of our knowledge, this is
the first reported EMFI attack against an AMD desktop CPU.


http://arxiv.org/abs/2209.11739
Adversarial Catoptric Light: An Effective, Stealthy and Robust Physical-World Attack to DNNs. (99%)
Chengyin Hu; Weiwen Shi
  Deep neural networks (DNNs) have demonstrated exceptional success across
various tasks, underscoring the need to evaluate the robustness of advanced
DNNs. However, traditional methods using stickers as physical perturbations to
deceive classifiers present challenges in achieving stealthiness and suffer
from printing loss. Recent advancements in physical attacks have utilized light
beams such as lasers and projectors to perform attacks, where the optical
patterns generated are artificial rather than natural. In this study, we
introduce a novel physical attack, adversarial catoptric light (AdvCL), where
adversarial perturbations are generated using a common natural phenomenon,
catoptric light, to achieve stealthy and naturalistic adversarial attacks
against advanced DNNs in a black-box setting. We evaluate the proposed method
in three aspects: effectiveness, stealthiness, and robustness. Quantitative
results obtained in simulated environments demonstrate the effectiveness of the
proposed method, and in physical scenarios, we achieve an attack success rate
of 83.5%, surpassing the baseline. We use common catoptric light as a
perturbation to enhance the stealthiness of the method and make physical
samples appear more natural. Robustness is validated by successfully attacking
advanced and robust DNNs with a success rate over 80% in all cases.
Additionally, we discuss defense strategy against AdvCL and put forward some
light-based physical attacks.


http://arxiv.org/abs/2209.09652
Adversarial Color Projection: A Projector-Based Physical Attack to DNNs. (99%)
Chengyin Hu; Weiwen Shi
  Recent advances have shown that deep neural networks (DNNs) are susceptible
to adversarial perturbations. Therefore, it is necessary to evaluate the
robustness of advanced DNNs using adversarial attacks. However, traditional
physical attacks that use stickers as perturbations are more vulnerable than
recent light-based physical attacks. In this work, we propose a projector-based
physical attack called adversarial color projection (AdvCP), which performs an
adversarial attack by manipulating the physical parameters of the projected
light. Experiments show the effectiveness of our method in both digital and
physical environments. The experimental results demonstrate that the proposed
method has excellent attack transferability, which endows AdvCP with effective
blackbox attack. We prospect AdvCP threats to future vision-based systems and
applications and propose some ideas for light-based physical attacks.


http://arxiv.org/abs/2209.08724
On the Adversarial Transferability of ConvMixer Models. (99%)
Ryota Iijima; Miki Tanaka; Isao Echizen; Hitoshi Kiya
  Deep neural networks (DNNs) are well known to be vulnerable to adversarial
examples (AEs). In addition, AEs have adversarial transferability, which means
AEs generated for a source model can fool another black-box model (target
model) with a non-trivial probability. In this paper, we investigate the
property of adversarial transferability between models including ConvMixer,
which is an isotropic network, for the first time. To objectively verify the
property of transferability, the robustness of models is evaluated by using a
benchmark attack method called AutoAttack. In an image classification
experiment, ConvMixer is confirmed to be weak to adversarial transferability.


http://arxiv.org/abs/2209.08744
AdvDO: Realistic Adversarial Attacks for Trajectory Prediction. (96%)
Yulong Cao; Chaowei Xiao; Anima Anandkumar; Danfei Xu; Marco Pavone
  Trajectory prediction is essential for autonomous vehicles (AVs) to plan
correct and safe driving behaviors. While many prior works aim to achieve
higher prediction accuracy, few study the adversarial robustness of their
methods. To bridge this gap, we propose to study the adversarial robustness of
data-driven trajectory prediction systems. We devise an optimization-based
adversarial attack framework that leverages a carefully-designed differentiable
dynamic model to generate realistic adversarial trajectories. Empirically, we
benchmark the adversarial robustness of state-of-the-art prediction models and
show that our attack increases the prediction error for both general metrics
and planning-aware metrics by more than 50% and 37%. We also show that our
attack can lead an AV to drive off road or collide into other vehicles in
simulation. Finally, we demonstrate how to mitigate the adversarial attacks
using an adversarial training scheme.


http://arxiv.org/abs/2209.08541
Distribution inference risks: Identifying and mitigating sources of leakage. (1%)
Valentin Hartmann; Léo Meynent; Maxime Peyrard; Dimitrios Dimitriadis; Shruti Tople; Robert West
  A large body of work shows that machine learning (ML) models can leak
sensitive or confidential information about their training data. Recently,
leakage due to distribution inference (or property inference) attacks is
gaining attention. In this attack, the goal of an adversary is to infer
distributional information about the training data. So far, research on
distribution inference has focused on demonstrating successful attacks, with
little attention given to identifying the potential causes of the leakage and
to proposing mitigations. To bridge this gap, as our main contribution, we
theoretically and empirically analyze the sources of information leakage that
allows an adversary to perpetrate distribution inference attacks. We identify
three sources of leakage: (1) memorizing specific information about the
$\mathbb{E}[Y|X]$ (expected label given the feature values) of interest to the
adversary, (2) wrong inductive bias of the model, and (3) finiteness of the
training data. Next, based on our analysis, we propose principled mitigation
techniques against distribution inference attacks. Specifically, we demonstrate
that causal learning techniques are more resilient to a particular type of
distribution inference risk termed distributional membership inference than
associative learning methods. And lastly, we present a formalization of
distribution inference that allows for reasoning about more general adversaries
than was previously possible.


http://arxiv.org/abs/2209.13523
Watch What You Pretrain For: Targeted, Transferable Adversarial Examples on Self-Supervised Speech Recognition models. (99%)
Raphael Olivier; Hadi Abdullah; Bhiksha Raj
  A targeted adversarial attack produces audio samples that can force an
Automatic Speech Recognition (ASR) system to output attacker-chosen text. To
exploit ASR models in real-world, black-box settings, an adversary can leverage
the transferability property, i.e. that an adversarial sample produced for a
proxy ASR can also fool a different remote ASR. However recent work has shown
that transferability against large ASR models is very difficult. In this work,
we show that modern ASR architectures, specifically ones based on
Self-Supervised Learning, are in fact vulnerable to transferability. We
successfully demonstrate this phenomenon by evaluating state-of-the-art
self-supervised ASR models like Wav2Vec2, HuBERT, Data2Vec and WavLM. We show
that with low-level additive noise achieving a 30dB Signal-Noise Ratio, we can
achieve target transferability with up to 80% accuracy. Next, we 1) use an
ablation study to show that Self-Supervised learning is the main cause of that
phenomenon, and 2) we provide an explanation for this phenomenon. Through this
we show that modern ASR architectures are uniquely vulnerable to adversarial
security threats.


http://arxiv.org/abs/2209.08412
Characterizing Internal Evasion Attacks in Federated Learning. (98%)
Taejin Kim; Shubhranshu Singh; Nikhil Madaan; Carlee Joe-Wong
  Federated learning allows for clients in a distributed system to jointly
train a machine learning model. However, clients' models are vulnerable to
attacks during the training and testing phases. In this paper, we address the
issue of adversarial clients performing "internal evasion attacks": crafting
evasion attacks at test time to deceive other clients. For example, adversaries
may aim to deceive spam filters and recommendation systems trained with
federated learning for monetary gain. The adversarial clients have extensive
information about the victim model in a federated learning setting, as weight
information is shared amongst clients. We are the first to characterize the
transferability of such internal evasion attacks for different learning methods
and analyze the trade-off between model accuracy and robustness depending on
the degree of similarities in client data. We show that adversarial training
defenses in the federated learning setting only display limited improvements
against internal attacks. However, combining adversarial training with
personalized federated learning frameworks increases relative internal attack
robustness by 60% compared to federated adversarial training and performs well
under limited system resources.


http://arxiv.org/abs/2209.08262
A study on the deviations in performance of FNNs and CNNs in the realm of grayscale adversarial images. (4%)
Durga Shree Nagabushanam; Steve Mathew; Chiranji Lal Chowdhary
  Neural Networks are prone to having lesser accuracy in the classification of
images with noise perturbation. Convolutional Neural Networks, CNNs are known
for their unparalleled accuracy in the classification of benign images. But our
study shows that they are extremely vulnerable to noise addition while
Feed-forward Neural Networks, FNNs show very less correspondence with noise
perturbation, maintaining their accuracy almost undisturbed. FNNs are observed
to be better at classifying noise-intensive, single-channeled images that are
just sheer noise to human vision. In our study, we have used the hand-written
digits dataset, MNIST with the following architectures: FNNs with 1 and 2
hidden layers and CNNs with 3, 4, 6 and 8 convolutions and analyzed their
accuracies. FNNs stand out to show that irrespective of the intensity of noise,
they have a classification accuracy of more than 85%. In our analysis of CNNs
with this data, the deceleration of classification accuracy of CNN with 8
convolutions was half of that of the rest of the CNNs. Correlation analysis and
mathematical modelling of the accuracy trends act as roadmaps to these
conclusions.


http://arxiv.org/abs/2209.08130
Robust Ensemble Morph Detection with Domain Generalization. (99%)
Hossein Kashiani; Shoaib Meraj Sami; Sobhan Soleymani; Nasser M. Nasrabadi
  Although a substantial amount of studies is dedicated to morph detection,
most of them fail to generalize for morph faces outside of their training
paradigm. Moreover, recent morph detection methods are highly vulnerable to
adversarial attacks. In this paper, we intend to learn a morph detection model
with high generalization to a wide range of morphing attacks and high
robustness against different adversarial attacks. To this aim, we develop an
ensemble of convolutional neural networks (CNNs) and Transformer models to
benefit from their capabilities simultaneously. To improve the robust accuracy
of the ensemble model, we employ multi-perturbation adversarial training and
generate adversarial examples with high transferability for several single
models. Our exhaustive evaluations demonstrate that the proposed robust
ensemble model generalizes to several morphing attacks and face datasets. In
addition, we validate that our robust ensemble model gain better robustness
against several adversarial attacks while outperforming the state-of-the-art
studies.


http://arxiv.org/abs/2209.07790
A Large-scale Multiple-objective Method for Black-box Attack against Object Detection. (99%)
Siyuan Liang; Longkang Li; Yanbo Fan; Xiaojun Jia; Jingzhi Li; Baoyuan Wu; Xiaochun Cao
  Recent studies have shown that detectors based on deep models are vulnerable
to adversarial examples, even in the black-box scenario where the attacker
cannot access the model information. Most existing attack methods aim to
minimize the true positive rate, which often shows poor attack performance, as
another sub-optimal bounding box may be detected around the attacked bounding
box to be the new true positive one. To settle this challenge, we propose to
minimize the true positive rate and maximize the false positive rate, which can
encourage more false positive objects to block the generation of new true
positive bounding boxes. It is modeled as a multi-objective optimization (MOP)
problem, of which the generic algorithm can search the Pareto-optimal. However,
our task has more than two million decision variables, leading to low searching
efficiency. Thus, we extend the standard Genetic Algorithm with Random Subset
selection and Divide-and-Conquer, called GARSDC, which significantly improves
the efficiency. Moreover, to alleviate the sensitivity to population quality in
generic algorithms, we generate a gradient-prior initial population, utilizing
the transferability between different detectors with similar backbones.
Compared with the state-of-art attack methods, GARSDC decreases by an average
12.0 in the mAP and queries by about 1000 times in extensive experiments. Our
codes can be found at https://github.com/LiangSiyuan21/ GARSDC.


http://arxiv.org/abs/2209.07735
Enhance the Visual Representation via Discrete Adversarial Training. (97%)
Xiaofeng Mao; Yuefeng Chen; Ranjie Duan; Yao Zhu; Gege Qi; Shaokai Ye; Xiaodan Li; Rong Zhang; Hui Xue
  Adversarial Training (AT), which is commonly accepted as one of the most
effective approaches defending against adversarial examples, can largely harm
the standard performance, thus has limited usefulness on industrial-scale
production and applications. Surprisingly, this phenomenon is totally opposite
in Natural Language Processing (NLP) task, where AT can even benefit for
generalization. We notice the merit of AT in NLP tasks could derive from the
discrete and symbolic input space. For borrowing the advantage from NLP-style
AT, we propose Discrete Adversarial Training (DAT). DAT leverages VQGAN to
reform the image data to discrete text-like inputs, i.e. visual words. Then it
minimizes the maximal risk on such discrete images with symbolic adversarial
perturbations. We further give an explanation from the perspective of
distribution to demonstrate the effectiveness of DAT. As a plug-and-play
technique for enhancing the visual representation, DAT achieves significant
improvement on multiple tasks including image classification, object detection
and self-supervised learning. Especially, the model pre-trained with Masked
Auto-Encoding (MAE) and fine-tuned by our DAT without extra data can get 31.40
mCE on ImageNet-C and 32.77% top-1 accuracy on Stylized-ImageNet, building the
new state-of-the-art. The code will be available at
https://github.com/alibaba/easyrobust.


http://arxiv.org/abs/2209.07807
Model Inversion Attacks against Graph Neural Networks. (92%)
Zaixi Zhang; Qi Liu; Zhenya Huang; Hao Wang; Chee-Kong Lee; Enhong Chen
  Many data mining tasks rely on graphs to model relational structures among
individuals (nodes). Since relational data are often sensitive, there is an
urgent need to evaluate the privacy risks in graph data. One famous privacy
attack against data analysis models is the model inversion attack, which aims
to infer sensitive data in the training dataset and leads to great privacy
concerns. Despite its success in grid-like domains, directly applying model
inversion attacks on non-grid domains such as graph leads to poor attack
performance. This is mainly due to the failure to consider the unique
properties of graphs. To bridge this gap, we conduct a systematic study on
model inversion attacks against Graph Neural Networks (GNNs), one of the
state-of-the-art graph analysis tools in this paper. Firstly, in the white-box
setting where the attacker has full access to the target GNN model, we present
GraphMI to infer the private training graph data. Specifically, in GraphMI, a
projected gradient module is proposed to tackle the discreteness of graph edges
and preserve the sparsity and smoothness of graph features; a graph
auto-encoder module is used to efficiently exploit graph topology, node
attributes, and target model parameters for edge inference; a random sampling
module can finally sample discrete edges. Furthermore, in the hard-label
black-box setting where the attacker can only query the GNN API and receive the
classification results, we propose two methods based on gradient estimation and
reinforcement learning (RL-GraphMI). Our experimental results show that such
defenses are not sufficiently effective and call for more advanced defenses
against privacy attacks.


http://arxiv.org/abs/2209.07788
PointCAT: Contrastive Adversarial Training for Robust Point Cloud Recognition. (62%)
Qidong Huang; Xiaoyi Dong; Dongdong Chen; Hang Zhou; Weiming Zhang; Kui Zhang; Gang Hua; Nenghai Yu
  Notwithstanding the prominent performance achieved in various applications,
point cloud recognition models have often suffered from natural corruptions and
adversarial perturbations. In this paper, we delve into boosting the general
robustness of point cloud recognition models and propose Point-Cloud
Contrastive Adversarial Training (PointCAT). The main intuition of PointCAT is
encouraging the target recognition model to narrow the decision gap between
clean point clouds and corrupted point clouds. Specifically, we leverage a
supervised contrastive loss to facilitate the alignment and uniformity of the
hypersphere features extracted by the recognition model, and design a pair of
centralizing losses with the dynamic prototype guidance to avoid these features
deviating from their belonging category clusters. To provide the more
challenging corrupted point clouds, we adversarially train a noise generator
along with the recognition model from the scratch, instead of using
gradient-based attack as the inner loop like previous adversarial training
methods. Comprehensive experiments show that the proposed PointCAT outperforms
the baseline methods and dramatically boosts the robustness of different point
cloud recognition models, under a variety of corruptions including isotropic
point noises, the LiDAR simulated noises, random point dropping and adversarial
perturbations.


http://arxiv.org/abs/2209.08116
Cascading Failures in Power Grids. (33%)
Rounak Meyur
  This paper studies the consequences of a human-initiated targeted attack on
the national electric power system. We consider two kinds of attacks: ($i$) an
attack by an adversary that uses a tactical weapon and destroys a large part of
the grid, by physically targeting a large geographic region; ($ii$) a targeted
attack by an adversary that takes out a small number of critical components in
the network simultaneously. Our analysis uses ($i$) a realistic representation
of the underlying power grid, including the topology, the control and
protection components, ($ii$) a realistic representation of the targeted attack
scenario, and ($iii$) a dynamic stability analysis, that goes beyond
traditional work comprising structural and linear flow analysis. Such realistic
analysis is expensive, but critical since it can capture cascading failures
that result from transient instabilities introduced due to the attack. Our
model acknowledges the presence of hidden failures in the protection systems
resulting in relay misoperations. We analyze the extent of cascading outages
for different levels of hidden failures. Our results show that: ($i$) the power
grid is vulnerable to both these attacks, ($ii$) the tactical attack has
significant social, economic and health damage but need not result in a
regional cascade; on the contrary the targeted attack can cause significant
cascade and lead to power outage over a large region. Our work shows the
necessity to harden the power grid not just to cyber-attacks but also to
physical attacks. Furthermore, we show that realistic representations and
analysis can lead to fundamentally new insights that simplified models are
unlikely to capture. Finally, the methods and results help us identify critical
elements in the grid; the system can then be hardened in a more precise manner
to reduce the vulnerabilities.


http://arxiv.org/abs/2209.09024
Dataset Inference for Self-Supervised Models. (16%)
Adam Dziedzic; Haonan Duan; Muhammad Ahmad Kaleem; Nikita Dhawan; Jonas Guan; Yannis Cattan; Franziska Boenisch; Nicolas Papernot
  Self-supervised models are increasingly prevalent in machine learning (ML)
since they reduce the need for expensively labeled data. Because of their
versatility in downstream applications, they are increasingly used as a service
exposed via public APIs. At the same time, these encoder models are
particularly vulnerable to model stealing attacks due to the high
dimensionality of vector representations they output. Yet, encoders remain
undefended: existing mitigation strategies for stealing attacks focus on
supervised learning. We introduce a new dataset inference defense, which uses
the private training set of the victim encoder model to attribute its ownership
in the event of stealing. The intuition is that the log-likelihood of an
encoder's output representations is higher on the victim's training data than
on test data if it is stolen from the victim, but not if it is independently
trained. We compute this log-likelihood using density estimation models. As
part of our evaluation, we also propose measuring the fidelity of stolen
encoders and quantifying the effectiveness of the theft detection without
involving downstream tasks; instead, we leverage mutual information and
distance measurements. Our extensive empirical results in the vision domain
demonstrate that dataset inference is a promising direction for defending
self-supervised models against model stealing.


http://arxiv.org/abs/2209.07754
On the Robustness of Graph Neural Diffusion to Topology Perturbations. (15%)
Yang Song; Qiyu Kang; Sijie Wang; Zhao Kai; Wee Peng Tay
  Neural diffusion on graphs is a novel class of graph neural networks that has
attracted increasing attention recently. The capability of graph neural partial
differential equations (PDEs) in addressing common hurdles of graph neural
networks (GNNs), such as the problems of over-smoothing and bottlenecks, has
been investigated but not their robustness to adversarial attacks. In this
work, we explore the robustness properties of graph neural PDEs. We empirically
demonstrate that graph neural PDEs are intrinsically more robust against
topology perturbation as compared to other GNNs. We provide insights into this
phenomenon by exploiting the stability of the heat semigroup under graph
topology perturbations. We discuss various graph diffusion operators and relate
them to existing graph neural PDEs. Furthermore, we propose a general graph
neural PDE framework based on which a new class of robust GNNs can be defined.
We verify that the new model achieves comparable state-of-the-art performance
on several benchmark datasets.


http://arxiv.org/abs/2209.08064
A Systematic Evaluation of Node Embedding Robustness. (11%)
Alexandru Mara; Jefrey Lijffijt; Stephan Günnemann; Bie Tijl De
  Node embedding methods map network nodes to low dimensional vectors that can
be subsequently used in a variety of downstream prediction tasks. The
popularity of these methods has grown significantly in recent years, yet, their
robustness to perturbations of the input data is still poorly understood. In
this paper, we assess the empirical robustness of node embedding models to
random and adversarial poisoning attacks. Our systematic evaluation covers
representative embedding methods based on Skip-Gram, matrix factorization, and
deep neural networks. We compare edge addition, deletion and rewiring attacks
computed using network properties as well as node labels. We also investigate
the performance of popular node classification attack baselines that assume
full knowledge of the node labels. We report qualitative results via embedding
visualization and quantitative results in terms of downstream node
classification and network reconstruction performances. We find that node
classification results are impacted more than network reconstruction ones, that
degree-based and label-based attacks are on average the most damaging and that
label heterophily can strongly influence attack performance.


http://arxiv.org/abs/2209.07936
PA-Boot: A Formally Verified Authentication Protocol for Multiprocessor Secure Boot. (1%)
Zhuoruo Zhang; Rui Chang; Mingshuai Chen; Wenbo Shen; Chenyang Yu; He Huang; Qinming Dai; Yongwang Zhao
  Hardware supply-chain attacks are raising significant security threats to the
boot process of multiprocessor systems. This paper identifies a new, prevalent
hardware supply-chain attack surface that can bypass multiprocessor secure boot
due to the absence of processor-authentication mechanisms. To defend against
such attacks, we present PA-Boot, the first formally verified
processor-authentication protocol for secure boot in multiprocessor systems.
PA-Boot is proved functionally correct and is guaranteed to detect multiple
adversarial behaviors, e.g., processor replacements, man-in-the-middle attacks,
and tampering with certificates. The fine-grained formalization of PA-Boot and
its fully mechanized security proofs are carried out in the Isabelle/HOL
theorem prover with 306 lemmas/theorems and ~7,100 LoC. Experiments on a
proof-of-concept implementation indicate that PA-Boot can effectively identify
boot-process attacks with a considerably minor overhead and thereby improve the
security of multiprocessor systems.


http://arxiv.org/abs/2209.07534
Improving Robust Fairness via Balance Adversarial Training. (99%)
Chunyu Sun; Chenye Xu; Chengyuan Yao; Siyuan Liang; Yichao Wu; Ding Liang; XiangLong Liu; Aishan Liu
  Adversarial training (AT) methods are effective against adversarial attacks,
yet they introduce severe disparity of accuracy and robustness between
different classes, known as the robust fairness problem. Previously proposed
Fair Robust Learning (FRL) adaptively reweights different classes to improve
fairness. However, the performance of the better-performed classes decreases,
leading to a strong performance drop. In this paper, we observed two unfair
phenomena during adversarial training: different difficulties in generating
adversarial examples from each class (source-class fairness) and disparate
target class tendencies when generating adversarial examples (target-class
fairness). From the observations, we propose Balance Adversarial Training (BAT)
to address the robust fairness problem. Regarding source-class fairness, we
adjust the attack strength and difficulties of each class to generate samples
near the decision boundary for easier and fairer model learning; considering
target-class fairness, by introducing a uniform distribution constraint, we
encourage the adversarial example generation process for each class with a fair
tendency. Extensive experiments conducted on multiple datasets (CIFAR-10,
CIFAR-100, and ImageNette) demonstrate that our method can significantly
outperform other baselines in mitigating the robust fairness problem (+5-10\%
on the worst class accuracy)


http://arxiv.org/abs/2209.07399
A Light Recipe to Train Robust Vision Transformers. (98%)
Edoardo Debenedetti; Vikash Sehwag; Prateek Mittal
  In this paper, we ask whether Vision Transformers (ViTs) can serve as an
underlying architecture for improving the adversarial robustness of machine
learning models against evasion attacks. While earlier works have focused on
improving Convolutional Neural Networks, we show that also ViTs are highly
suitable for adversarial training to achieve competitive performance. We
achieve this objective using a custom adversarial training recipe, discovered
using rigorous ablation studies on a subset of the ImageNet dataset. The
canonical training recipe for ViTs recommends strong data augmentation, in part
to compensate for the lack of vision inductive bias of attention modules, when
compared to convolutions. We show that this recipe achieves suboptimal
performance when used for adversarial training. In contrast, we find that
omitting all heavy data augmentation, and adding some additional bag-of-tricks
($\varepsilon$-warmup and larger weight decay), significantly boosts the
performance of robust ViTs. We show that our recipe generalizes to different
classes of ViT architectures and large-scale models on full ImageNet-1k.
Additionally, investigating the reasons for the robustness of our models, we
show that it is easier to generate strong attacks during training when using
our recipe and that this leads to better robustness at test time. Finally, we
further study one consequence of adversarial training by proposing a way to
quantify the semantic nature of adversarial perturbations and highlight its
correlation with the robustness of the model. Overall, we recommend that the
community should avoid translating the canonical training recipes in ViTs to
robust training and rethink common training choices in the context of
adversarial training.


http://arxiv.org/abs/2209.09117
Part-Based Models Improve Adversarial Robustness. (92%)
Chawin Sitawarin; Kornrapat Pongmala; Yizheng Chen; Nicholas Carlini; David Wagner
  We show that combining human prior knowledge with end-to-end learning can
improve the robustness of deep neural networks by introducing a part-based
model for object classification. We believe that the richer form of annotation
helps guide neural networks to learn more robust features without requiring
more samples or larger models. Our model combines a part segmentation model
with a tiny classifier and is trained end-to-end to simultaneously segment
objects into parts and then classify the segmented object. Empirically, our
part-based models achieve both higher accuracy and higher adversarial
robustness than a ResNet-50 baseline on all three datasets. For instance, the
clean accuracy of our part models is up to 15 percentage points higher than the
baseline's, given the same level of robustness. Our experiments indicate that
these models also reduce texture bias and yield better robustness against
common corruptions and spurious correlations. The code is publicly available at
https://github.com/chawins/adv-part-model.


http://arxiv.org/abs/2209.07592
Explicit Tradeoffs between Adversarial and Natural Distributional Robustness. (80%)
Mazda Moayeri; Kiarash Banihashem; Soheil Feizi
  Several existing works study either adversarial or natural distributional
robustness of deep neural networks separately. In practice, however, models
need to enjoy both types of robustness to ensure reliability. In this work, we
bridge this gap and show that in fact, explicit tradeoffs exist between
adversarial and natural distributional robustness. We first consider a simple
linear regression setting on Gaussian data with disjoint sets of core and
spurious features. In this setting, through theoretical and empirical analysis,
we show that (i) adversarial training with $\ell_1$ and $\ell_2$ norms
increases the model reliance on spurious features; (ii) For $\ell_\infty$
adversarial training, spurious reliance only occurs when the scale of the
spurious features is larger than that of the core features; (iii) adversarial
training can have an unintended consequence in reducing distributional
robustness, specifically when spurious correlations are changed in the new test
domain. Next, we present extensive empirical evidence, using a test suite of
twenty adversarially trained models evaluated on five benchmark datasets
(ObjectNet, RIVAL10, Salient ImageNet-1M, ImageNet-9, Waterbirds), that
adversarially trained classifiers rely on backgrounds more than their
standardly trained counterparts, validating our theoretical results. We also
show that spurious correlations in training data (when preserved in the test
domain) can improve adversarial robustness, revealing that previous claims that
adversarial vulnerability is rooted in spurious correlations are incomplete.


http://arxiv.org/abs/2209.07369
Adversarially Robust Learning: A Generic Minimax Optimal Learner and Characterization. (80%)
Omar Montasser; Steve Hanneke; Nathan Srebro
  We present a minimax optimal learner for the problem of learning predictors
robust to adversarial examples at test-time. Interestingly, we find that this
requires new algorithmic ideas and approaches to adversarially robust learning.
In particular, we show, in a strong negative sense, the suboptimality of the
robust learner proposed by Montasser, Hanneke, and Srebro (2019) and a broader
family of learners we identify as local learners. Our results are enabled by
adopting a global perspective, specifically, through a key technical
contribution: the global one-inclusion graph, which may be of independent
interest, that generalizes the classical one-inclusion graph due to Haussler,
Littlestone, and Warmuth (1994). Finally, as a byproduct, we identify a
dimension characterizing qualitatively and quantitatively what classes of
predictors $\mathcal{H}$ are robustly learnable. This resolves an open problem
due to Montasser et al. (2019), and closes a (potentially) infinite gap between
the established upper and lower bounds on the sample complexity of
adversarially robust learning.


http://arxiv.org/abs/2209.07491
Defending Root DNS Servers Against DDoS Using Layered Defenses. (15%)
A S M Rizvi; Jelena Mirkovic; John Heidemann; Wesley Hardaker; Robert Story
  Distributed Denial-of-Service (DDoS) attacks exhaust resources, leaving a
server unavailable to legitimate clients. The Domain Name System (DNS) is a
frequent target of DDoS attacks. Since DNS is a critical infrastructure
service, protecting it from DoS is imperative. Many prior approaches have
focused on specific filters or anti-spoofing techniques to protect generic
services. DNS root nameservers are more challenging to protect, since they use
fixed IP addresses, serve very diverse clients and requests, receive
predominantly UDP traffic that can be spoofed, and must guarantee high quality
of service. In this paper we propose a layered DDoS defense for DNS root
nameservers. Our defense uses a library of defensive filters, which can be
optimized for different attack types, with different levels of selectivity. We
further propose a method that automatically and continuously evaluates and
selects the best combination of filters throughout the attack. We show that
this layered defense approach provides exceptional protection against all
attack types using traces of ten real attacks from a DNS root nameserver. Our
automated system can select the best defense within seconds and quickly reduces
traffic to the server within a manageable range, while keeping collateral
damage lower than 2%. We can handle millions of filtering rules without
noticeable operational overhead.


http://arxiv.org/abs/2209.07125
BadRes: Reveal the Backdoors through Residual Connection. (2%)
Mingrui He; Tianyu Chen; Haoyi Zhou; Shanghang Zhang; Jianxin Li
  Generally, residual connections are indispensable network components in
building CNNs and Transformers for various downstream tasks in CV and VL, which
encourages skip shortcuts between network blocks. However, the layer-by-layer
loopback residual connections may also hurt the model's robustness by allowing
unsuspecting input. In this paper, we proposed a simple yet strong backdoor
attack method - BadRes, where the residual connections play as a turnstile to
be deterministic on clean inputs while unpredictable on poisoned ones. We have
performed empirical evaluations on four datasets with ViT and BEiT models, and
the BadRes achieves 97% attack success rate while receiving zero performance
degradation on clean data. Moreover, we analyze BadRes with state-of-the-art
defense methods and reveal the fundamental weakness lying in residual
connections.


http://arxiv.org/abs/2209.07699
Adversarial Cross-View Disentangled Graph Contrastive Learning. (1%)
Qianlong Wen; Zhongyu Ouyang; Chunhui Zhang; Yiyue Qian; Yanfang Ye; Chuxu Zhang
  Graph contrastive learning (GCL) is prevalent to tackle the supervision
shortage issue in graph learning tasks. Many recent GCL methods have been
proposed with various manually designed augmentation techniques, aiming to
implement challenging augmentations on the original graph to yield robust
representation. Although many of them achieve remarkable performances, existing
GCL methods still struggle to improve model robustness without risking losing
task-relevant information because they ignore the fact the augmentation-induced
latent factors could be highly entangled with the original graph, thus it is
more difficult to discriminate the task-relevant information from irrelevant
information. Consequently, the learned representation is either brittle or
unilluminating. In light of this, we introduce the Adversarial Cross-View
Disentangled Graph Contrastive Learning (ACDGCL), which follows the information
bottleneck principle to learn minimal yet sufficient representations from graph
data. To be specific, our proposed model elicits the augmentation-invariant and
augmentation-dependent factors separately. Except for the conventional
contrastive loss which guarantees the consistency and sufficiency of the
representations across different contrastive views, we introduce a cross-view
reconstruction mechanism to pursue the representation disentanglement. Besides,
an adversarial view is added as the third view of contrastive loss to enhance
model robustness. We empirically demonstrate that our proposed model
outperforms the state-of-the-arts on graph classification task over multiple
benchmark datasets.


http://arxiv.org/abs/2209.07601
Towards Improving Calibration in Object Detection Under Domain Shift. (1%)
Muhammad Akhtar Munir; Muhammad Haris Khan; M. Saquib Sarfraz; Mohsen Ali
  With deep neural network based solution more readily being incorporated in
real-world applications, it has been pressing requirement that predictions by
such models, especially in safety-critical environments, be highly accurate and
well-calibrated. Although some techniques addressing DNN calibration have been
proposed, they are only limited to visual classification applications and
in-domain predictions. Unfortunately, very little to no attention is paid
towards addressing calibration of DNN-based visual object detectors, that
occupy similar space and importance in many decision making systems as their
visual classification counterparts. In this work, we study the calibration of
DNN-based object detection models, particularly under domain shift. To this
end, we first propose a new, plug-and-play, train-time calibration loss for
object detection (coined as TCD). It can be used with various
application-specific loss functions as an auxiliary loss function to improve
detection calibration. Second, we devise a new implicit technique for improving
calibration in self-training based domain adaptive detectors, featuring a new
uncertainty quantification mechanism for object detection. We demonstrate TCD
is capable of enhancing calibration with notable margins (1) across different
DNN-based object detection paradigms both in in-domain and out-of-domain
predictions, and (2) in different domain-adaptive detectors across challenging
adaptation scenarios. Finally, we empirically show that our implicit
calibration technique can be used in tandem with TCD during adaptation to
further boost calibration in diverse domain shift scenarios.


http://arxiv.org/abs/2209.06931
Robust Transferable Feature Extractors: Learning to Defend Pre-Trained Networks Against White Box Adversaries. (99%)
Alexander Cann; Ian Colbert; Ihab Amer
  The widespread adoption of deep neural networks in computer vision
applications has brought forth a significant interest in adversarial
robustness. Existing research has shown that maliciously perturbed inputs
specifically tailored for a given model (i.e., adversarial examples) can be
successfully transferred to another independently trained model to induce
prediction errors. Moreover, this property of adversarial examples has been
attributed to features derived from predictive patterns in the data
distribution. Thus, we are motivated to investigate the following question: Can
adversarial defenses, like adversarial examples, be successfully transferred to
other independently trained models? To this end, we propose a deep
learning-based pre-processing mechanism, which we refer to as a robust
transferable feature extractor (RTFE). After examining theoretical motivation
and implications, we experimentally show that our method can provide
adversarial robustness to multiple independently pre-trained classifiers that
are otherwise ineffective against an adaptive white box adversary. Furthermore,
we show that RTFEs can even provide one-shot adversarial robustness to models
independently trained on different datasets.


http://arxiv.org/abs/2209.06971
PointACL:Adversarial Contrastive Learning for Robust Point Clouds Representation under Adversarial Attack. (99%)
Junxuan Huang; Yatong An; Lu cheng; Bai Chen; Junsong Yuan; Chunming Qiao
  Despite recent success of self-supervised based contrastive learning model
for 3D point clouds representation, the adversarial robustness of such
pre-trained models raised concerns. Adversarial contrastive learning (ACL) is
considered an effective way to improve the robustness of pre-trained models. In
contrastive learning, the projector is considered an effective component for
removing unnecessary feature information during contrastive pretraining and
most ACL works also use contrastive loss with projected feature representations
to generate adversarial examples in pretraining, while "unprojected " feature
representations are used in generating adversarial inputs during
inference.Because of the distribution gap between projected and "unprojected"
features, their models are constrained of obtaining robust feature
representations for downstream tasks. We introduce a new method to generate
high-quality 3D adversarial examples for adversarial training by utilizing
virtual adversarial loss with "unprojected" feature representations in
contrastive learning framework. We present our robust aware loss function to
train self-supervised contrastive learning framework adversarially.
Furthermore, we find selecting high difference points with the Difference of
Normal (DoN) operator as additional input for adversarial self-supervised
contrastive learning can significantly improve the adversarial robustness of
the pre-trained model. We validate our method, PointACL on downstream tasks,
including 3D classification and 3D segmentation with multiple datasets. It
obtains comparable robust accuracy over state-of-the-art contrastive
adversarial learning methods.


http://arxiv.org/abs/2209.06691
Certified Robustness to Word Substitution Ranking Attack for Neural Ranking Models. (99%)
Chen Wu; Ruqing Zhang; Jiafeng Guo; Wei Chen; Yixing Fan; Rijke Maarten de; Xueqi Cheng
  Neural ranking models (NRMs) have achieved promising results in information
retrieval. NRMs have also been shown to be vulnerable to adversarial examples.
A typical Word Substitution Ranking Attack (WSRA) against NRMs was proposed
recently, in which an attacker promotes a target document in rankings by adding
human-imperceptible perturbations to its text. This raises concerns when
deploying NRMs in real-world applications. Therefore, it is important to
develop techniques that defend against such attacks for NRMs. In empirical
defenses adversarial examples are found during training and used to augment the
training set. However, such methods offer no theoretical guarantee on the
models' robustness and may eventually be broken by other sophisticated WSRAs.
To escape this arms race, rigorous and provable certified defense methods for
NRMs are needed.
  To this end, we first define the \textit{Certified Top-$K$ Robustness} for
ranking models since users mainly care about the top ranked results in
real-world scenarios. A ranking model is said to be Certified Top-$K$ Robust on
a ranked list when it is guaranteed to keep documents that are out of the top
$K$ away from the top $K$ under any attack. Then, we introduce a Certified
Defense method, named CertDR, to achieve certified top-$K$ robustness against
WSRA, based on the idea of randomized smoothing. Specifically, we first
construct a smoothed ranker by applying random word substitutions on the
documents, and then leverage the ranking property jointly with the statistical
property of the ensemble to provably certify top-$K$ robustness. Extensive
experiments on two representative web search datasets demonstrate that CertDR
can significantly outperform state-of-the-art empirical defense methods for
ranking models.


http://arxiv.org/abs/2209.06506
Order-Disorder: Imitation Adversarial Attacks for Black-box Neural Ranking Models. (97%)
Jiawei Liu; Yangyang Kang; Di Tang; Kaisong Song; Changlong Sun; Xiaofeng Wang; Wei Lu; Xiaozhong Liu
  Neural text ranking models have witnessed significant advancement and are
increasingly being deployed in practice. Unfortunately, they also inherit
adversarial vulnerabilities of general neural models, which have been detected
but remain underexplored by prior studies. Moreover, the inherit adversarial
vulnerabilities might be leveraged by blackhat SEO to defeat better-protected
search engines. In this study, we propose an imitation adversarial attack on
black-box neural passage ranking models. We first show that the target passage
ranking model can be transparentized and imitated by enumerating critical
queries/candidates and then train a ranking imitation model. Leveraging the
ranking imitation model, we can elaborately manipulate the ranking results and
transfer the manipulation attack to the target ranking model. For this purpose,
we propose an innovative gradient-based attack method, empowered by the
pairwise objective function, to generate adversarial triggers, which causes
premeditated disorderliness with very few tokens. To equip the trigger
camouflages, we add the next sentence prediction loss and the language model
fluency constraint to the objective function. Experimental results on passage
ranking demonstrate the effectiveness of the ranking imitation attack model and
adversarial triggers against various SOTA neural ranking models. Furthermore,
various mitigation analyses and human evaluation show the effectiveness of
camouflages when facing potential mitigation approaches. To motivate other
scholars to further investigate this novel and important problem, we make the
experiment data and code publicly available.


http://arxiv.org/abs/2209.06953
On the interplay of adversarial robustness and architecture components: patches, convolution and attention. (67%)
Francesco Croce; Matthias Hein
  In recent years novel architecture components for image classification have
been developed, starting with attention and patches used in transformers. While
prior works have analyzed the influence of some aspects of architecture
components on the robustness to adversarial attacks, in particular for vision
transformers, the understanding of the main factors is still limited. We
compare several (non)-robust classifiers with different architectures and study
their properties, including the effect of adversarial training on the
interpretability of the learnt features and robustness to unseen threat models.
An ablation from ResNet to ConvNeXt reveals key architectural changes leading
to almost $10\%$ higher $\ell_\infty$-robustness.


http://arxiv.org/abs/2209.06997
M^4I: Multi-modal Models Membership Inference. (54%)
Pingyi Hu; Zihan Wang; Ruoxi Sun; Hu Wang; Minhui Xue
  With the development of machine learning techniques, the attention of
research has been moved from single-modal learning to multi-modal learning, as
real-world data exist in the form of different modalities. However, multi-modal
models often carry more information than single-modal models and they are
usually applied in sensitive scenarios, such as medical report generation or
disease identification. Compared with the existing membership inference against
machine learning classifiers, we focus on the problem that the input and output
of the multi-modal models are in different modalities, such as image
captioning. This work studies the privacy leakage of multi-modal models through
the lens of membership inference attack, a process of determining whether a
data record involves in the model training process or not. To achieve this, we
propose Multi-modal Models Membership Inference (M^4I) with two attack methods
to infer the membership status, named metric-based (MB) M^4I and feature-based
(FB) M^4I, respectively. More specifically, MB M^4I adopts similarity metrics
while attacking to infer target data membership. FB M^4I uses a pre-trained
shadow multi-modal feature extractor to achieve the purpose of data inference
attack by comparing the similarities from extracted input and output features.
Extensive experimental results show that both attack methods can achieve strong
performances. Respectively, 72.5% and 94.83% of attack success rates on average
can be obtained under unrestricted scenarios. Moreover, we evaluate multiple
defense mechanisms against our attacks. The source code of M^4I attacks is
publicly available at
https://github.com/MultimodalMI/Multimodal-membership-inference.git.


http://arxiv.org/abs/2209.06954
Finetuning Pretrained Vision-Language Models with Correlation Information Bottleneck for Robust Visual Question Answering. (12%)
Jingjing Jiang; Ziyi Liu; Nanning Zheng
  Benefiting from large-scale Pretrained Vision-Language Models (VL-PMs), the
performance of Visual Question Answering (VQA) has started to approach human
oracle performance. However, finetuning large-scale VL-PMs with limited data
for VQA usually faces overfitting and poor generalization issues, leading to a
lack of robustness. In this paper, we aim to improve the robustness of VQA
systems (ie, the ability of the systems to defend against input variations and
human-adversarial attacks) from the perspective of Information Bottleneck when
finetuning VL-PMs for VQA. Generally, internal representations obtained by
VL-PMs inevitably contain irrelevant and redundant information for the
downstream VQA task, resulting in statistically spurious correlations and
insensitivity to input variations. To encourage representations to converge to
a minimal sufficient statistic in vision-language learning, we propose the
Correlation Information Bottleneck (CIB) principle, which seeks a tradeoff
between representation compression and redundancy by minimizing the mutual
information (MI) between the inputs and internal representations while
maximizing the MI between the outputs and the representations. Meanwhile, CIB
measures the internal correlations among visual and linguistic inputs and
representations by a symmetrized joint MI estimation. Extensive experiments on
five VQA benchmarks of input robustness and two VQA benchmarks of
human-adversarial robustness demonstrate the effectiveness and superiority of
the proposed CIB in improving the robustness of VQA systems.


http://arxiv.org/abs/2209.06866
Robust Constrained Reinforcement Learning. (9%)
Yue Wang; Fei Miao; Shaofeng Zou
  Constrained reinforcement learning is to maximize the expected reward subject
to constraints on utilities/costs. However, the training environment may not be
the same as the test one, due to, e.g., modeling error, adversarial attack,
non-stationarity, resulting in severe performance degradation and more
importantly constraint violation. We propose a framework of robust constrained
reinforcement learning under model uncertainty, where the MDP is not fixed but
lies in some uncertainty set, the goal is to guarantee that constraints on
utilities/costs are satisfied for all MDPs in the uncertainty set, and to
maximize the worst-case reward performance over the uncertainty set. We design
a robust primal-dual approach, and further theoretically develop guarantee on
its convergence, complexity and robust feasibility. We then investigate a
concrete example of $\delta$-contamination uncertainty set, design an online
and model-free algorithm and theoretically characterize its sample complexity.


http://arxiv.org/abs/2209.05785
Adversarial Coreset Selection for Efficient Robust Training. (99%)
Hadi M. Dolatabadi; Sarah Erfani; Christopher Leckie
  Neural networks are vulnerable to adversarial attacks: adding well-crafted,
imperceptible perturbations to their input can modify their output. Adversarial
training is one of the most effective approaches to training robust models
against such attacks. Unfortunately, this method is much slower than vanilla
training of neural networks since it needs to construct adversarial examples
for the entire training data at every iteration. By leveraging the theory of
coreset selection, we show how selecting a small subset of training data
provides a principled approach to reducing the time complexity of robust
training. To this end, we first provide convergence guarantees for adversarial
coreset selection. In particular, we show that the convergence bound is
directly related to how well our coresets can approximate the gradient computed
over the entire training data. Motivated by our theoretical analysis, we
propose using this gradient approximation error as our adversarial coreset
selection objective to reduce the training set size effectively. Once built, we
run adversarial training over this subset of the training data. Unlike existing
methods, our approach can be adapted to a wide variety of training objectives,
including TRADES, $\ell_p$-PGD, and Perceptual Adversarial Training. We conduct
extensive experiments to demonstrate that our approach speeds up adversarial
training by 2-3 times while experiencing a slight degradation in the clean and
robust accuracy.


http://arxiv.org/abs/2209.06388
TSFool: Crafting Highly-Imperceptible Adversarial Time Series through Multi-Objective Attack. (99%)
Yanyun Wang; Dehui Du; Haibo Hu; Zi Liang; Yuanhao Liu
  Recent years have witnessed the success of recurrent neural network (RNN)
models in time series classification (TSC). However, neural networks (NNs) are
vulnerable to adversarial samples, which cause real-life adversarial attacks
that undermine the robustness of AI models. To date, most existing attacks
target at feed-forward NNs and image recognition tasks, but they cannot perform
well on RNN-based TSC. This is due to the cyclical computation of RNN, which
prevents direct model differentiation. In addition, the high visual sensitivity
of time series to perturbations also poses challenges to local objective
optimization of adversarial samples. In this paper, we propose an efficient
method called TSFool to craft highly-imperceptible adversarial time series for
RNN-based TSC. The core idea is a new global optimization objective known as
"Camouflage Coefficient" that captures the imperceptibility of adversarial
samples from the class distribution. Based on this, we reduce the adversarial
attack problem to a multi-objective optimization problem that enhances the
perturbation quality. Furthermore, to speed up the optimization process, we
propose to use a representation model for RNN to capture deeply embedded
vulnerable samples whose features deviate from the latent manifold. Experiments
on 11 UCR and UEA datasets showcase that TSFool significantly outperforms six
white-box and three black-box benchmark attacks in terms of effectiveness,
efficiency and imperceptibility from various perspectives including standard
measure, human study and real-world defense.


http://arxiv.org/abs/2209.06300
PINCH: An Adversarial Extraction Attack Framework for Deep Learning Models. (92%)
William Hackett; Stefan Trawicki; Zhengxin Yu; Neeraj Suri; Peter Garraghan
  Deep Learning (DL) models increasingly power a diversity of applications.
Unfortunately, this pervasiveness also makes them attractive targets for
extraction attacks which can steal the architecture, parameters, and
hyper-parameters of a targeted DL model. Existing extraction attack studies
have observed varying levels of attack success for different DL models and
datasets, yet the underlying cause(s) behind their susceptibility often remain
unclear. Ascertaining such root-cause weaknesses would help facilitate secure
DL systems, though this requires studying extraction attacks in a wide variety
of scenarios to identify commonalities across attack success and DL
characteristics. The overwhelmingly high technical effort and time required to
understand, implement, and evaluate even a single attack makes it infeasible to
explore the large number of unique extraction attack scenarios in existence,
with current frameworks typically designed to only operate for specific attack
types, datasets and hardware platforms. In this paper we present PINCH: an
efficient and automated extraction attack framework capable of deploying and
evaluating multiple DL models and attacks across heterogeneous hardware
platforms. We demonstrate the effectiveness of PINCH by empirically evaluating
a large number of previously unexplored extraction attack scenarios, as well as
secondary attack staging. Our key findings show that 1) multiple
characteristics affect extraction attack success spanning DL model
architecture, dataset complexity, hardware, attack type, and 2) partially
successful extraction attacks significantly enhance the success of further
adversarial attack staging.


http://arxiv.org/abs/2209.05980
Certified Defences Against Adversarial Patch Attacks on Semantic Segmentation. (78%)
Maksym Yatsura; Kaspar Sakmann; N. Grace Hua; Matthias Hein; Jan Hendrik Metzen
  Adversarial patch attacks are an emerging security threat for real world deep
learning applications. We present Demasked Smoothing, the first approach (up to
our knowledge) to certify the robustness of semantic segmentation models
against this threat model. Previous work on certifiably defending against patch
attacks has mostly focused on image classification task and often required
changes in the model architecture and additional training which is undesirable
and computationally expensive. In Demasked Smoothing, any segmentation model
can be applied without particular training, fine-tuning, or restriction of the
architecture. Using different masking strategies, Demasked Smoothing can be
applied both for certified detection and certified recovery. In extensive
experiments we show that Demasked Smoothing can on average certify 64% of the
pixel predictions for a 1% patch in the detection task and 48% against a 0.5%
patch for the recovery task on the ADE20K dataset.


http://arxiv.org/abs/2209.05957
Adversarial Inter-Group Link Injection Degrades the Fairness of Graph Neural Networks. (68%)
Hussain Hussain; Meng Cao; Sandipan Sikdar; Denis Helic; Elisabeth Lex; Markus Strohmaier; Roman Kern
  We present evidence for the existence and effectiveness of adversarial
attacks on graph neural networks (GNNs) that aim to degrade fairness. These
attacks can disadvantage a particular subgroup of nodes in GNN-based node
classification, where nodes of the underlying network have sensitive
attributes, such as race or gender. We conduct qualitative and experimental
analyses explaining how adversarial link injection impairs the fairness of GNN
predictions. For example, an attacker can compromise the fairness of GNN-based
node classification by injecting adversarial links between nodes belonging to
opposite subgroups and opposite class labels. Our experiments on empirical
datasets demonstrate that adversarial fairness attacks can significantly
degrade the fairness of GNN predictions (attacks are effective) with a low
perturbation rate (attacks are efficient) and without a significant drop in
accuracy (attacks are deceptive). This work demonstrates the vulnerability of
GNN models to adversarial fairness attacks. We hope our findings raise
awareness about this issue in our community and lay a foundation for the future
development of GNN models that are more robust to such attacks.


http://arxiv.org/abs/2209.06292
ADMM based Distributed State Observer Design under Sparse Sensor Attacks. (22%)
Vinaya Mary Prinse; Rachel Kalpana Kalaimani
  This paper considers the design of a distributed state-observer for
discrete-time Linear Time-Invariant (LTI) systems in the presence of sensor
attacks. We assume there is a network of observer nodes, communicating with
each other over an undirected graph, each with partial measurements of the
output corrupted by some adversarial attack. We address the case of sparse
attacks where the attacker targets a small subset of sensors. An algorithm
based on Alternating Direction Method of Multipliers (ADMM) is developed which
provides an update law for each observer which ensures convergence of each
observer node to the actual state asymptotically.


http://arxiv.org/abs/2209.05742
A Tale of HodgeRank and Spectral Method: Target Attack Against Rank Aggregation Is the Fixed Point of Adversarial Game. (15%)
Ke Ma; Qianqian Xu; Jinshan Zeng; Guorong Li; Xiaochun Cao; Qingming Huang
  Rank aggregation with pairwise comparisons has shown promising results in
elections, sports competitions, recommendations, and information retrieval.
However, little attention has been paid to the security issue of such
algorithms, in contrast to numerous research work on the computational and
statistical characteristics. Driven by huge profits, the potential adversary
has strong motivation and incentives to manipulate the ranking list. Meanwhile,
the intrinsic vulnerability of the rank aggregation methods is not well studied
in the literature. To fully understand the possible risks, we focus on the
purposeful adversary who desires to designate the aggregated results by
modifying the pairwise data in this paper. From the perspective of the
dynamical system, the attack behavior with a target ranking list is a fixed
point belonging to the composition of the adversary and the victim. To perform
the targeted attack, we formulate the interaction between the adversary and the
victim as a game-theoretic framework consisting of two continuous operators
while Nash equilibrium is established. Then two procedures against HodgeRank
and RankCentrality are constructed to produce the modification of the original
data. Furthermore, we prove that the victims will produce the target ranking
list once the adversary masters the complete information. It is noteworthy that
the proposed methods allow the adversary only to hold incomplete information or
imperfect feedback and perform the purposeful attack. The effectiveness of the
suggested target attack strategies is demonstrated by a series of toy
simulations and several real-world data experiments. These experimental results
show that the proposed methods could achieve the attacker's goal in the sense
that the leading candidate of the perturbed ranking list is the designated one
by the adversary.


http://arxiv.org/abs/2209.05724
Defense against Privacy Leakage in Federated Learning. (12%)
Jing Wu; Munawar Hayat; Mingyi Zhou; Mehrtash Harandi
  Federated Learning (FL) provides a promising distributed learning paradigm,
since it seeks to protect users privacy by not sharing their private training
data. Recent research has demonstrated, however, that FL is susceptible to
model inversion attacks, which can reconstruct users' private data by
eavesdropping on shared gradients. Existing defense solutions cannot survive
stronger attacks and exhibit a poor trade-off between privacy and performance.
In this paper, we present a straightforward yet effective defense strategy
based on obfuscating the gradients of sensitive data with concealing data.
Specifically, we alter a few samples within a mini batch to mimic the sensitive
data at the gradient levels. Using a gradient projection technique, our method
seeks to obscure sensitive data without sacrificing FL performance. Our
extensive evaluations demonstrate that, compared to other defenses, our
technique offers the highest level of protection while preserving FL
performance. Our source code is located in the repository.


http://arxiv.org/abs/2209.06397
Federated Learning based on Defending Against Data Poisoning Attacks in IoT. (1%)
Jiayin Li; Wenzhong Guo; Xingshuo Han; Jianping Cai; Ximeng Liu
  The rapidly expanding number of Internet of Things (IoT) devices is
generating huge quantities of data, but the data privacy and security exposure
in IoT devices, especially in the automatic driving system. Federated learning
(FL) is a paradigm that addresses data privacy, security, access rights, and
access to heterogeneous message issues by integrating a global model based on
distributed nodes. However, data poisoning attacks on FL can undermine the
benefits, destroying the global model's availability and disrupting model
training. To avoid the above issues, we build up a hierarchical defense data
poisoning (HDDP) system framework to defend against data poisoning attacks in
FL, which monitors each local model of individual nodes via abnormal detection
to remove the malicious clients. Whether the poisoning defense server has a
trusted test dataset, we design the \underline{l}ocal \underline{m}odel
\underline{t}est \underline{v}oting (LMTV) and
\underline{k}ullback-\underline{l}eibler divergence \underline{a}nomaly
parameters \underline{d}etection (KLAD) algorithms to defend against
label-flipping poisoning attacks. Specifically, the trusted test dataset is
utilized to obtain the evaluation results for each classification to recognize
the malicious clients in LMTV. More importantly, we adopt the kullback leibler
divergence to measure the similarity between local models without the trusted
test dataset in KLAD. Finally, through extensive evaluations and against the
various label-flipping poisoning attacks, LMTV and KLAD algorithms could
achieve the $100\%$ and $40\%$ to $85\%$ successful defense ratios under
different detection situations.


http://arxiv.org/abs/2209.05244
Adaptive Perturbation Generation for Multiple Backdoors Detection. (95%)
Yuhang Wang; Huafeng Shi; Rui Min; Ruijia Wu; Siyuan Liang; Yichao Wu; Ding Liang; Aishan Liu
  Extensive evidence has demonstrated that deep neural networks (DNNs) are
vulnerable to backdoor attacks, which motivates the development of backdoor
detection methods. Existing backdoor detection methods are typically tailored
for backdoor attacks with individual specific types (e.g., patch-based or
perturbation-based). However, adversaries are likely to generate multiple types
of backdoor attacks in practice, which challenges the current detection
strategies. Based on the fact that adversarial perturbations are highly
correlated with trigger patterns, this paper proposes the Adaptive Perturbation
Generation (APG) framework to detect multiple types of backdoor attacks by
adaptively injecting adversarial perturbations. Since different trigger
patterns turn out to show highly diverse behaviors under the same adversarial
perturbations, we first design the global-to-local strategy to fit the multiple
types of backdoor triggers via adjusting the region and budget of attacks. To
further increase the efficiency of perturbation injection, we introduce a
gradient-guided mask generation strategy to search for the optimal regions for
adversarial attacks. Extensive experiments conducted on multiple datasets
(CIFAR-10, GTSRB, Tiny-ImageNet) demonstrate that our method outperforms
state-of-the-art baselines by large margins(+12%).


http://arxiv.org/abs/2209.05055
CARE: Certifiably Robust Learning with Reasoning via Variational Inference. (75%)
Jiawei Zhang; Linyi Li; Ce Zhang; Bo Li
  Despite great recent advances achieved by deep neural networks (DNNs), they
are often vulnerable to adversarial attacks. Intensive research efforts have
been made to improve the robustness of DNNs; however, most empirical defenses
can be adaptively attacked again, and the theoretically certified robustness is
limited, especially on large-scale datasets. One potential root cause of such
vulnerabilities for DNNs is that although they have demonstrated powerful
expressiveness, they lack the reasoning ability to make robust and reliable
predictions. In this paper, we aim to integrate domain knowledge to enable
robust learning with the reasoning paradigm. In particular, we propose a
certifiably robust learning with reasoning pipeline (CARE), which consists of a
learning component and a reasoning component. Concretely, we use a set of
standard DNNs to serve as the learning component to make semantic predictions,
and we leverage the probabilistic graphical models, such as Markov logic
networks (MLN), to serve as the reasoning component to enable knowledge/logic
reasoning. However, it is known that the exact inference of MLN (reasoning) is
#P-complete, which limits the scalability of the pipeline. To this end, we
propose to approximate the MLN inference via variational inference based on an
efficient expectation maximization algorithm. In particular, we leverage graph
convolutional networks (GCNs) to encode the posterior distribution during
variational inference and update the parameters of GCNs (E-step) and the
weights of knowledge rules in MLN (M-step) iteratively. We conduct extensive
experiments on different datasets and show that CARE achieves significantly
higher certified robustness compared with the state-of-the-art baselines. We
additionally conducted different ablation studies to demonstrate the empirical
robustness of CARE and the effectiveness of different knowledge integration.


http://arxiv.org/abs/2209.05692
Sample Complexity of an Adversarial Attack on UCB-based Best-arm Identification Policy. (69%)
Varsha Pendyala
  In this work I study the problem of adversarial perturbations to rewards, in
a Multi-armed bandit (MAB) setting. Specifically, I focus on an adversarial
attack to a UCB type best-arm identification policy applied to a stochastic
MAB. The UCB attack presented in [1] results in pulling a target arm K very
often. I used the attack model of [1] to derive the sample complexity required
for selecting target arm K as the best arm. I have proved that the stopping
condition of UCB based best-arm identification algorithm given in [2], can be
achieved by the target arm K in T rounds, where T depends only on the total
number of arms and $\sigma$ parameter of $\sigma^2-$ sub-Gaussian random
rewards of the arms.


http://arxiv.org/abs/2209.05446
Boosting Robustness Verification of Semantic Feature Neighborhoods. (54%)
Anan Kabaha; Dana Drachsler-Cohen
  Deep neural networks have been shown to be vulnerable to adversarial attacks
that perturb inputs based on semantic features. Existing robustness analyzers
can reason about semantic feature neighborhoods to increase the networks'
reliability. However, despite the significant progress in these techniques,
they still struggle to scale to deep networks and large neighborhoods. In this
work, we introduce VeeP, an active learning approach that splits the
verification process into a series of smaller verification steps, each is
submitted to an existing robustness analyzer. The key idea is to build on prior
steps to predict the next optimal step. The optimal step is predicted by
estimating the certification velocity and sensitivity via parametric
regression. We evaluate VeeP on MNIST, Fashion-MNIST, CIFAR-10 and ImageNet and
show that it can analyze neighborhoods of various features: brightness,
contrast, hue, saturation, and lightness. We show that, on average, given a 90
minute timeout, VeeP verifies 96% of the maximally certifiable neighborhoods
within 29 minutes, while existing splitting approaches verify, on average, 73%
of the maximally certifiable neighborhoods within 58 minutes.


http://arxiv.org/abs/2209.05130
Semantic-Preserving Adversarial Code Comprehension. (1%)
Yiyang Li; Hongqiu Wu; Hai Zhao
  Based on the tremendous success of pre-trained language models (PrLMs) for
source code comprehension tasks, current literature studies either ways to
further improve the performance (generalization) of PrLMs, or their robustness
against adversarial attacks. However, they have to compromise on the trade-off
between the two aspects and none of them consider improving both sides in an
effective and practical way. To fill this gap, we propose Semantic-Preserving
Adversarial Code Embeddings (SPACE) to find the worst-case semantic-preserving
attacks while forcing the model to predict the correct labels under these worst
cases. Experiments and analysis demonstrate that SPACE can stay robust against
state-of-the-art attacks while boosting the performance of PrLMs for code.


http://arxiv.org/abs/2209.05407
Holistic Segmentation. (1%)
Stefano Gasperini; Alvaro Marcos-Ramiro; Michael Schmidt; Nassir Navab; Benjamin Busam; Federico Tombari
  Panoptic segmentation methods assign a known class to each pixel given in
input. Even for state-of-the-art approaches, this inherently enforces decisions
that systematically lead to wrong predictions for unknown objects that are not
part of the training categories. However, in safety-critical settings,
robustness against out-of-distribution samples and corner cases is crucial to
avoid dangerous consequences. Since real-world datasets cannot contain enough
data points to properly sample the long tail of the underlying distribution,
models must be able to deal with unknown and unseen scenarios as well. Previous
methods targeted this issue by re-identifying already seen unlabeled objects.
In this work, we propose the necessary step to extend segmentation with a new
task which we term holistic segmentation. The aim of holistic segmentation is
to identify and separate objects of unseen unknown categories into instances,
without any prior knowledge about them, while performing panoptic segmentation
of known classes. We tackle this new problem with U3HS, which finds unknowns as
highly uncertain regions, and clusters their corresponding instance-aware
embeddings into individual objects. By doing so, for the first time in panoptic
segmentation with unknown objects, our U3HS is not trained with unknown
categories, reducing assumptions and leaving the settings as unconstrained as
in real-life scenarios. Extensive experiments on publicly available data from
Cityscapes and Lost&Found demonstrate the effectiveness of U3HS for the new
challenging task of holistic segmentation.


http://arxiv.org/abs/2209.05668
Class-Level Logit Perturbation. (1%)
Mengyang Li; Fengguang Su; Ou Wu; Ji Zhang
  Features, logits, and labels are the three primary data when a sample passes
through a deep neural network. Feature perturbation and label perturbation
receive increasing attention in recent years. They have been proven to be
useful in various deep learning approaches. For example, (adversarial) feature
perturbation can improve the robustness or even generalization capability of
learned models. However, limited studies have explicitly explored for the
perturbation of logit vectors. This work discusses several existing methods
related to class-level logit perturbation. A unified viewpoint between
positive/negative data augmentation and loss variations incurred by logit
perturbation is established. A theoretical analysis is provided to illuminate
why class-level logit perturbation is useful. Accordingly, new methodologies
are proposed to explicitly learn to perturb logits for both single-label and
multi-label classification tasks. Extensive experiments on benchmark image
classification data sets and their long-tail versions indicated the competitive
performance of our learning method. As it only perturbs on logit, it can be
used as a plug-in to fuse with any existing classification algorithms. All the
codes are available at https://github.com/limengyang1992/lpl.


http://arxiv.org/abs/2209.04930
Resisting Deep Learning Models Against Adversarial Attack Transferability via Feature Randomization. (99%)
Ehsan Nowroozi; Mohammadreza Mohammadi; Pargol Golmohammadi; Yassine Mekdad; Mauro Conti; Selcuk Uluagac
  In the past decades, the rise of artificial intelligence has given us the
capabilities to solve the most challenging problems in our day-to-day lives,
such as cancer prediction and autonomous navigation. However, these
applications might not be reliable if not secured against adversarial attacks.
In addition, recent works demonstrated that some adversarial examples are
transferable across different models. Therefore, it is crucial to avoid such
transferability via robust models that resist adversarial manipulations. In
this paper, we propose a feature randomization-based approach that resists
eight adversarial attacks targeting deep learning models in the testing phase.
Our novel approach consists of changing the training strategy in the target
network classifier and selecting random feature samples. We consider the
attacker with a Limited-Knowledge and Semi-Knowledge conditions to undertake
the most prevalent types of adversarial attacks. We evaluate the robustness of
our approach using the well-known UNSW-NB15 datasets that include realistic and
synthetic attacks. Afterward, we demonstrate that our strategy outperforms the
existing state-of-the-art approach, such as the Most Powerful Attack, which
consists of fine-tuning the network model against specific adversarial attacks.
Finally, our experimental results show that our methodology can secure the
target network and resists adversarial attack transferability by over 60%.


http://arxiv.org/abs/2209.06113
Generate novel and robust samples from data: accessible sharing without privacy concerns. (5%)
David Banh; Alan Huang
  Generating new samples from data sets can mitigate extra expensive
operations, increased invasive procedures, and mitigate privacy issues. These
novel samples that are statistically robust can be used as a temporary and
intermediate replacement when privacy is a concern. This method can enable
better data sharing practices without problems relating to identification
issues or biases that are flaws for an adversarial attack.


http://arxiv.org/abs/2209.04779
Scattering Model Guided Adversarial Examples for SAR Target Recognition: Attack and Defense. (99%)
Bowen Peng; Bo Peng; Jie Zhou; Jianyue Xie; Li Liu
  Deep Neural Networks (DNNs) based Synthetic Aperture Radar (SAR) Automatic
Target Recognition (ATR) systems have shown to be highly vulnerable to
adversarial perturbations that are deliberately designed yet almost
imperceptible but can bias DNN inference when added to targeted objects. This
leads to serious safety concerns when applying DNNs to high-stake SAR ATR
applications. Therefore, enhancing the adversarial robustness of DNNs is
essential for implementing DNNs to modern real-world SAR ATR systems. Toward
building more robust DNN-based SAR ATR models, this article explores the domain
knowledge of SAR imaging process and proposes a novel Scattering Model Guided
Adversarial Attack (SMGAA) algorithm which can generate adversarial
perturbations in the form of electromagnetic scattering response (called
adversarial scatterers). The proposed SMGAA consists of two parts: 1) a
parametric scattering model and corresponding imaging method and 2) a
customized gradient-based optimization algorithm. First, we introduce the
effective Attributed Scattering Center Model (ASCM) and a general imaging
method to describe the scattering behavior of typical geometric structures in
the SAR imaging process. By further devising several strategies to take the
domain knowledge of SAR target images into account and relax the greedy search
procedure, the proposed method does not need to be prudentially finetuned, but
can efficiently to find the effective ASCM parameters to fool the SAR
classifiers and facilitate the robust model training. Comprehensive evaluations
on the MSTAR dataset show that the adversarial scatterers generated by SMGAA
are more robust to perturbations and transformations in the SAR processing
chain than the currently studied attacks, and are effective to construct a
defensive model against the malicious scatterers.


http://arxiv.org/abs/2209.04521
The Space of Adversarial Strategies. (99%)
Ryan Sheatsley; Blaine Hoak; Eric Pauley; Patrick McDaniel
  Adversarial examples, inputs designed to induce worst-case behavior in
machine learning models, have been extensively studied over the past decade.
Yet, our understanding of this phenomenon stems from a rather fragmented pool
of knowledge; at present, there are a handful of attacks, each with disparate
assumptions in threat models and incomparable definitions of optimality. In
this paper, we propose a systematic approach to characterize worst-case (i.e.,
optimal) adversaries. We first introduce an extensible decomposition of attacks
in adversarial machine learning by atomizing attack components into surfaces
and travelers. With our decomposition, we enumerate over components to create
576 attacks (568 of which were previously unexplored). Next, we propose the
Pareto Ensemble Attack (PEA): a theoretical attack that upper-bounds attack
performance. With our new attacks, we measure performance relative to the PEA
on: both robust and non-robust models, seven datasets, and three extended
lp-based threat models incorporating compute costs, formalizing the Space of
Adversarial Strategies. From our evaluation we find that attack performance to
be highly contextual: the domain, model robustness, and threat model can have a
profound influence on attack efficacy. Our investigation suggests that future
studies measuring the security of machine learning should: (1) be
contextualized to the domain & threat models, and (2) go beyond the handful of
known attacks used today.


http://arxiv.org/abs/2209.04547
Defend Data Poisoning Attacks on Voice Authentication. (54%)
Ke Li; Cameron Baird; Dan Lin
  With the advances in deep learning, speaker verification has achieved very
high accuracy and is gaining popularity as a type of biometric authentication
option in many scenes of our daily life, especially the growing market of web
services. Compared to traditional passwords, "vocal passwords" are much more
convenient as they relieve people from memorizing different passwords. However,
new machine learning attacks are putting these voice authentication systems at
risk. Without a strong security guarantee, attackers could access legitimate
users' web accounts by fooling the deep neural network (DNN) based voice
recognition models. In this paper, we demonstrate an easy-to-implement data
poisoning attack to the voice authentication system, which can hardly be
captured by existing defense mechanisms. Thus, we propose a more robust defense
method, called Guardian, which is a convolutional neural network-based
discriminator. The Guardian discriminator integrates a series of novel
techniques including bias reduction, input augmentation, and ensemble learning.
Our approach is able to distinguish about 95% of attacked accounts from normal
accounts, which is much more effective than existing approaches with only 60%
accuracy.


http://arxiv.org/abs/2209.04293
Robust-by-Design Classification via Unitary-Gradient Neural Networks. (41%)
Fabio Brau; Giulio Rossolini; Alessandro Biondi; Giorgio Buttazzo
  The use of neural networks in safety-critical systems requires safe and
robust models, due to the existence of adversarial attacks. Knowing the minimal
adversarial perturbation of any input x, or, equivalently, knowing the distance
of x from the classification boundary, allows evaluating the classification
robustness, providing certifiable predictions. Unfortunately, state-of-the-art
techniques for computing such a distance are computationally expensive and
hence not suited for online applications. This work proposes a novel family of
classifiers, namely Signed Distance Classifiers (SDCs), that, from a
theoretical perspective, directly output the exact distance of x from the
classification boundary, rather than a probability score (e.g., SoftMax). SDCs
represent a family of robust-by-design classifiers. To practically address the
theoretical requirements of a SDC, a novel network architecture named
Unitary-Gradient Neural Network is presented. Experimental results show that
the proposed architecture approximates a signed distance classifier, hence
allowing an online certifiable classification of x at the cost of a single
inference.


http://arxiv.org/abs/2209.04113
Robust and Lossless Fingerprinting of Deep Neural Networks via Pooled Membership Inference. (10%)
Hanzhou Wu
  Deep neural networks (DNNs) have already achieved great success in a lot of
application areas and brought profound changes to our society. However, it also
raises new security problems, among which how to protect the intellectual
property (IP) of DNNs against infringement is one of the most important yet
very challenging topics. To deal with this problem, recent studies focus on the
IP protection of DNNs by applying digital watermarking, which embeds source
information and/or authentication data into DNN models by tuning network
parameters directly or indirectly. However, tuning network parameters
inevitably distorts the DNN and therefore surely impairs the performance of the
DNN model on its original task regardless of the degree of the performance
degradation. It has motivated the authors in this paper to propose a novel
technique called \emph{pooled membership inference (PMI)} so as to protect the
IP of the DNN models. The proposed PMI neither alters the network parameters of
the given DNN model nor fine-tunes the DNN model with a sequence of carefully
crafted trigger samples. Instead, it leaves the original DNN model unchanged,
but can determine the ownership of the DNN model by inferring which
mini-dataset among multiple mini-datasets was once used to train the target DNN
model, which differs from previous arts and has remarkable potential in
practice. Experiments also have demonstrated the superiority and applicability
of this work.


http://arxiv.org/abs/2209.04326
Saliency Guided Adversarial Training for Learning Generalizable Features with Applications to Medical Imaging Classification System. (1%)
Xin Li; Yao Qiang; Chengyin Li; Sijia Liu; Dongxiao Zhu
  This work tackles a central machine learning problem of performance
degradation on out-of-distribution (OOD) test sets. The problem is particularly
salient in medical imaging based diagnosis system that appears to be accurate
but fails when tested in new hospitals/datasets. Recent studies indicate the
system might learn shortcut and non-relevant features instead of generalizable
features, so-called good features. We hypothesize that adversarial training can
eliminate shortcut features whereas saliency guided training can filter out
non-relevant features; both are nuisance features accounting for the
performance degradation on OOD test sets. With that, we formulate a novel model
training scheme for the deep neural network to learn good features for
classification and/or detection tasks ensuring a consistent generalization
performance on OOD test sets. The experimental results qualitatively and
quantitatively demonstrate the superior performance of our method using the
benchmark CXR image data sets on classification tasks.


http://arxiv.org/abs/2209.03716
Incorporating Locality of Images to Generate Targeted Transferable Adversarial Examples. (99%)
Zhipeng Wei; Jingjing Chen; Zuxuan Wu; Yu-Gang Jiang
  Despite that leveraging the transferability of adversarial examples can
attain a fairly high attack success rate for non-targeted attacks, it does not
work well in targeted attacks since the gradient directions from a source image
to a targeted class are usually different in different DNNs. To increase the
transferability of target attacks, recent studies make efforts in aligning the
feature of the generated adversarial example with the feature distributions of
the targeted class learned from an auxiliary network or a generative
adversarial network. However, these works assume that the training dataset is
available and require a lot of time to train networks, which makes it hard to
apply to real-world scenarios. In this paper, we revisit adversarial examples
with targeted transferability from the perspective of universality and find
that highly universal adversarial perturbations tend to be more transferable.
Based on this observation, we propose the Locality of Images (LI) attack to
improve targeted transferability. Specifically, instead of using the
classification loss only, LI introduces a feature similarity loss between
intermediate features from adversarial perturbed original images and randomly
cropped images, which makes the features from adversarial perturbations to be
more dominant than that of benign images, hence improving targeted
transferability. Through incorporating locality of images into optimizing
perturbations, the LI attack emphasizes that targeted perturbations should be
universal to diverse input patterns, even local image patches. Extensive
experiments demonstrate that LI can achieve high success rates for
transfer-based targeted attacks. On attacking the ImageNet-compatible dataset,
LI yields an improvement of 12\% compared with existing state-of-the-art
methods.


http://arxiv.org/abs/2209.04028
Evaluating the Security of Aircraft Systems. (92%)
Edan Habler; Ron Bitton; Asaf Shabtai
  The sophistication and complexity of cyber attacks and the variety of
targeted platforms have been growing in recent years. Various adversaries are
abusing an increasing range of platforms, e.g., enterprise platforms, mobile
phones, PCs, transportation systems, and industrial control systems. In recent
years, we have witnessed various cyber attacks on transportation systems,
including attacks on ports, airports, and trains. It is only a matter of time
before transportation systems become a more common target of cyber attackers.
Due to the enormous potential damage inherent in attacking vehicles carrying
many passengers and the lack of security measures applied in traditional
airborne systems, the vulnerability of aircraft systems is one of the most
concerning topics in the vehicle security domain. This paper provides a
comprehensive review of aircraft systems and components and their various
networks, emphasizing the cyber threats they are exposed to and the impact of a
cyber attack on these components and networks and the essential capabilities of
the aircraft. In addition, we present a comprehensive and in-depth taxonomy
that standardizes the knowledge and understanding of cyber security in the
avionics field from an adversary's perspective. The taxonomy divides techniques
into relevant categories (tactics) reflecting the various phases of the
adversarial attack lifecycle and maps existing attacks according to the MITRE
ATT&CK methodology. Furthermore, we analyze the security risks among the
various systems according to the potential threat actors and categorize the
threats based on STRIDE threat model. Future work directions are presented as
guidelines for industry and academia.


http://arxiv.org/abs/2209.04030
Unraveling the Connections between Privacy and Certified Robustness in Federated Learning Against Poisoning Attacks. (64%)
Chulin Xie; Yunhui Long; Pin-Yu Chen; Qinbin Li; Arash Nourian; Sanmi Koyejo; Bo Li
  Federated learning (FL) provides an efficient paradigm to jointly train a
global model leveraging data from distributed users. As local training data
comes from different users who may not be trustworthy, several studies have
shown that FL is vulnerable to poisoning attacks. Meanwhile, to protect the
privacy of local users, FL is usually trained in a differentially private way
(DPFL). Thus, in this paper, we ask: What are the underlying connections
between differential privacy and certified robustness in FL against poisoning
attacks? Can we leverage the innate privacy property of DPFL to provide
certified robustness for FL? Can we further improve the privacy of FL to
improve such robustness certification? We first investigate both user-level and
instance-level privacy of FL and provide formal privacy analysis to achieve
improved instance-level privacy. We then provide two robustness certification
criteria: certified prediction and certified attack inefficacy for DPFL on both
user and instance levels. Theoretically, we provide the certified robustness of
DPFL based on both criteria given a bounded number of adversarial users or
instances. Empirically, we conduct extensive experiments to verify our theories
under a range of poisoning attacks on different datasets. We find that
increasing the level of privacy protection in DPFL results in stronger
certified attack inefficacy; however, it does not necessarily lead to a
stronger certified prediction. Thus, achieving the optimal certified prediction
requires a proper balance between privacy and utility loss.


http://arxiv.org/abs/2209.03622
A Survey of Recent Advances in Deep Learning Models for Detecting Malware in Desktop and Mobile Platforms. (1%)
Pascal Maniriho; Abdun Naser Mahmood; Mohammad Jabed Morshed Chowdhury
  Malware is one of the most common and severe cyber-attack today. Malware
infects millions of devices and can perform several malicious activities
including mining sensitive data, encrypting data, crippling system performance,
and many more. Hence, malware detection is crucial to protect our computers and
mobile devices from malware attacks. Deep learning (DL) is one of the emerging
and promising technologies for detecting malware. The recent high production of
malware variants against desktop and mobile platforms makes DL algorithms
powerful approaches for building scalable and advanced malware detection models
as they can handle big datasets. This work explores current deep learning
technologies for detecting malware attacks on the Windows, Linux, and Android
platforms. Specifically, we present different categories of DL algorithms,
network optimizers, and regularization methods. Different loss functions,
activation functions, and frameworks for implementing DL models are presented.
We also present feature extraction approaches and a review of recent DL-based
models for detecting malware attacks on the above platforms. Furthermore, this
work presents major research issues on malware detection including future
directions to further advance knowledge and research in this field.


http://arxiv.org/abs/2209.03839
FADE: Enabling Large-Scale Federated Adversarial Training on Resource-Constrained Edge Devices. (1%)
Minxue Tang; Jianyi Zhang; Mingyuan Ma; Louis DiValentin; Aolin Ding; Amin Hassanzadeh; Hai Li; Yiran Chen
  Adversarial Training (AT) has been proven to be an effective method of
introducing strong adversarial robustness into deep neural networks. However,
the high computational cost of AT prohibits the deployment of large-scale AT on
resource-constrained edge devices, e.g., with limited computing power and small
memory footprint, in Federated Learning (FL) applications. Very few previous
studies have tried to tackle these constraints in FL at the same time. In this
paper, we propose a new framework named Federated Adversarial Decoupled
Learning (FADE) to enable AT on resource-constrained edge devices in FL. FADE
reduces the computation and memory usage by applying Decoupled Greedy Learning
(DGL) to federated adversarial training such that each client only needs to
perform AT on a small module of the entire model in each communication round.
In addition, we improve vanilla DGL by adding an auxiliary weight decay to
alleviate objective inconsistency and achieve better performance. FADE offers a
theoretical guarantee for the adversarial robustness and convergence. The
experimental results also show that FADE can significantly reduce the computing
resources consumed by AT while maintaining almost the same accuracy and
robustness as fully joint training.


http://arxiv.org/abs/2209.02997
On the Transferability of Adversarial Examples between Encrypted Models. (99%)
Miki Tanaka; Isao Echizen; Hitoshi Kiya
  Deep neural networks (DNNs) are well known to be vulnerable to adversarial
examples (AEs). In addition, AEs have adversarial transferability, namely, AEs
generated for a source model fool other (target) models. In this paper, we
investigate the transferability of models encrypted for adversarially robust
defense for the first time. To objectively verify the property of
transferability, the robustness of models is evaluated by using a benchmark
attack method, called AutoAttack. In an image-classification experiment, the
use of encrypted models is confirmed not only to be robust against AEs but to
also reduce the influence of AEs in terms of the transferability of models.


http://arxiv.org/abs/2209.03358
Securing the Spike: On the Transferabilty and Security of Spiking Neural Networks to Adversarial Examples. (99%)
Nuo Xu; Kaleel Mahmood; Haowen Fang; Ethan Rathbun; Caiwen Ding; Wujie Wen
  Spiking neural networks (SNNs) have attracted much attention for their high
energy efficiency and for recent advances in their classification performance.
However, unlike traditional deep learning approaches, the analysis and study of
the robustness of SNNs to adversarial examples remains relatively
underdeveloped. In this work we advance the field of adversarial machine
learning through experimentation and analyses of three important SNN security
attributes. First, we show that successful white-box adversarial attacks on
SNNs are highly dependent on the underlying surrogate gradient technique.
Second, we analyze the transferability of adversarial examples generated by
SNNs and other state-of-the-art architectures like Vision Transformers and Big
Transfer CNNs. We demonstrate that SNNs are not often deceived by adversarial
examples generated by Vision Transformers and certain types of CNNs. Lastly, we
develop a novel white-box attack that generates adversarial examples capable of
fooling both SNN models and non-SNN models simultaneously. Our experiments and
analyses are broad and rigorous covering two datasets (CIFAR-10 and CIFAR-100),
five different white-box attacks and twelve different classifier models.


http://arxiv.org/abs/2209.03540
Reward Delay Attacks on Deep Reinforcement Learning. (70%)
Anindya Sarkar; Jiarui Feng; Yevgeniy Vorobeychik; Christopher Gill; Ning Zhang
  Most reinforcement learning algorithms implicitly assume strong synchrony. We
present novel attacks targeting Q-learning that exploit a vulnerability
entailed by this assumption by delaying the reward signal for a limited time
period. We consider two types of attack goals: targeted attacks, which aim to
cause a target policy to be learned, and untargeted attacks, which simply aim
to induce a policy with a low reward. We evaluate the efficacy of the proposed
attacks through a series of experiments. Our first observation is that
reward-delay attacks are extremely effective when the goal is simply to
minimize reward. Indeed, we find that even naive baseline reward-delay attacks
are also highly successful in minimizing the reward. Targeted attacks, on the
other hand, are more challenging, although we nevertheless demonstrate that the
proposed approaches remain highly effective at achieving the attacker's
targets. In addition, we introduce a second threat model that captures a
minimal mitigation that ensures that rewards cannot be used out of sequence. We
find that this mitigation remains insufficient to ensure robustness to attacks
that delay, but preserve the order, of rewards.


http://arxiv.org/abs/2209.03755
Fact-Saboteurs: A Taxonomy of Evidence Manipulation Attacks against Fact-Verification Systems. (47%)
Sahar Abdelnabi; Mario Fritz
  Mis- and disinformation are now a substantial global threat to our security
and safety. To cope with the scale of online misinformation, one viable
solution is to automate the fact-checking of claims by retrieving and verifying
against relevant evidence. While major recent advances have been achieved in
pushing forward the automatic fact-verification, a comprehensive evaluation of
the possible attack vectors against such systems is still lacking.
Particularly, the automated fact-verification process might be vulnerable to
the exact disinformation campaigns it is trying to combat. In this work, we
assume an adversary that automatically tampers with the online evidence in
order to disrupt the fact-checking model via camouflaging the relevant
evidence, or planting a misleading one. We first propose an exploratory
taxonomy that spans these two targets and the different threat model
dimensions. Guided by this, we design and propose several potential attack
methods. We show that it is possible to subtly modify claim-salient snippets in
the evidence, in addition to generating diverse and claim-aligned evidence. As
a result, we highly degrade the fact-checking performance under many different
permutations of the taxonomy's dimensions. The attacks are also robust against
post-hoc modifications of the claim. Our analysis further hints at potential
limitations in models' inference when faced with contradicting evidence. We
emphasize that these attacks can have harmful implications on the inspectable
and human-in-the-loop usage scenarios of such models, and we conclude by
discussing challenges and directions for future defenses.


http://arxiv.org/abs/2209.03463
Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots. (15%)
Wai Man Si; Michael Backes; Jeremy Blackburn; Cristofaro Emiliano De; Gianluca Stringhini; Savvas Zannettou; Yand Zhang
  Chatbots are used in many applications, e.g., automated agents, smart home
assistants, interactive characters in online games, etc. Therefore, it is
crucial to ensure they do not behave in undesired manners, providing offensive
or toxic responses to users. This is not a trivial task as state-of-the-art
chatbot models are trained on large, public datasets openly collected from the
Internet. This paper presents a first-of-its-kind, large-scale measurement of
toxicity in chatbots. We show that publicly available chatbots are prone to
providing toxic responses when fed toxic queries. Even more worryingly, some
non-toxic queries can trigger toxic responses too. We then set out to design
and experiment with an attack, ToxicBuddy, which relies on fine-tuning GPT-2 to
generate non-toxic queries that make chatbots respond in a toxic manner. Our
extensive experimental evaluation demonstrates that our attack is effective
against public chatbot models and outperforms manually-crafted malicious
queries proposed by previous work. We also evaluate three defense mechanisms
against ToxicBuddy, showing that they either reduce the attack performance at
the cost of affecting the chatbot's utility or are only effective at mitigating
a portion of the attack. This highlights the need for more research from the
computer security and online safety communities to ensure that chatbot models
do not hurt their users. Overall, we are confident that ToxicBuddy can be used
as an auditing tool and that our work will pave the way toward designing more
effective defenses for chatbot safety.


http://arxiv.org/abs/2209.03431
Physics-Guided Adversarial Machine Learning for Aircraft Systems Simulation. (1%)
Houssem Ben Braiek; Thomas Reid; Foutse Khomh
  In the context of aircraft system performance assessment, deep learning
technologies allow to quickly infer models from experimental measurements, with
less detailed system knowledge than usually required by physics-based modeling.
However, this inexpensive model development also comes with new challenges
regarding model trustworthiness. This work presents a novel approach,
physics-guided adversarial machine learning (ML), that improves the confidence
over the physics consistency of the model. The approach performs, first, a
physics-guided adversarial testing phase to search for test inputs revealing
behavioral system inconsistencies, while still falling within the range of
foreseeable operational conditions. Then, it proceeds with physics-informed
adversarial training to teach the model the system-related physics domain
foreknowledge through iteratively reducing the unwanted output deviations on
the previously-uncovered counterexamples. Empirical evaluation on two aircraft
system performance models shows the effectiveness of our adversarial ML
approach in exposing physical inconsistencies of both models and in improving
their propensity to be consistent with physics domain knowledge.


http://arxiv.org/abs/2209.03225
Hardware faults that matter: Understanding and Estimating the safety impact of hardware faults on object detection DNNs. (1%)
Syed Qutub; Florian Geissler; Yang Peng; Ralf Grafe; Michael Paulitsch; Gereon Hinz; Alois Knoll
  Object detection neural network models need to perform reliably in highly
dynamic and safety-critical environments like automated driving or robotics.
Therefore, it is paramount to verify the robustness of the detection under
unexpected hardware faults like soft errors that can impact a systems
perception module. Standard metrics based on average precision produce model
vulnerability estimates at the object level rather than at an image level. As
we show in this paper, this does not provide an intuitive or representative
indicator of the safety-related impact of silent data corruption caused by bit
flips in the underlying memory but can lead to an over- or underestimation of
typical fault-induced hazards. With an eye towards safety-related real-time
applications, we propose a new metric IVMOD (Image-wise Vulnerability Metric
for Object Detection) to quantify vulnerability based on an incorrect
image-wise object detection due to false positive (FPs) or false negative (FNs)
objects, combined with a severity analysis. The evaluation of several
representative object detection models shows that even a single bit flip can
lead to a severe silent data corruption event with potentially critical safety
implications, with e.g., up to (much greater than) 100 FPs generated, or up to
approx. 90% of true positives (TPs) are lost in an image. Furthermore, with a
single stuck-at-1 fault, an entire sequence of images can be affected, causing
temporally persistent ghost detections that can be mistaken for actual objects
(covering up to approx. 83% of the image). Furthermore, actual objects in the
scene are continuously missed (up to approx. 64% of TPs are lost). Our work
establishes a detailed understanding of the safety-related vulnerability of
such critical workloads against hardware faults.


http://arxiv.org/abs/2209.03547
MalDetConv: Automated Behaviour-based Malware Detection Framework Based on Natural Language Processing and Deep Learning Techniques. (1%)
Pascal Maniriho; Abdun Naser Mahmood; Mohammad Jabed Morshed Chowdhury
  The popularity of Windows attracts the attention of hackers/cyber-attackers,
making Windows devices the primary target of malware attacks in recent years.
Several sophisticated malware variants and anti-detection methods have been
significantly enhanced and as a result, traditional malware detection
techniques have become less effective. This work presents MalBehavD-V1, a new
behavioural dataset of Windows Application Programming Interface (API) calls
extracted from benign and malware executable files using the dynamic analysis
approach. In addition, we present MalDetConV, a new automated behaviour-based
framework for detecting both existing and zero-day malware attacks. MalDetConv
uses a text processing-based encoder to transform features of API calls into a
suitable format supported by deep learning models. It then uses a hybrid of
convolutional neural network (CNN) and bidirectional gated recurrent unit
(CNN-BiGRU) automatic feature extractor to select high-level features of the
API Calls which are then fed to a fully connected neural network module for
malware classification. MalDetConv also uses an explainable component that
reveals features that contributed to the final classification outcome, helping
the decision-making process for security analysts. The performance of the
proposed framework is evaluated using our MalBehavD-V1 dataset and other
benchmark datasets. The detection results demonstrate the effectiveness of
MalDetConv over the state-of-the-art techniques with detection accuracy of
96.10%, 95.73%, 98.18%, and 99.93% achieved while detecting unseen malware from
MalBehavD-V1, Allan and John, Brazilian, and Ki-D datasets, respectively. The
experimental results show that MalDetConv is highly accurate in detecting both
known and zero-day malware attacks on Windows devices.


http://arxiv.org/abs/2209.02453
Instance Attack:An Explanation-based Vulnerability Analysis Framework Against DNNs for Malware Detection. (99%)
Sun RuiJin; Guo ShiZe; Guo JinHong; Xing ChangYou; Yang LuMing; Guo Xi; Pan ZhiSong
  Deep neural networks (DNNs) are increasingly being applied in malware
detection and their robustness has been widely debated. Traditionally an
adversarial example generation scheme relies on either detailed model
information (gradient-based methods) or lots of samples to train a surrogate
model, neither of which are available in most scenarios.
  We propose the notion of the instance-based attack. Our scheme is
interpretable and can work in a black-box environment. Given a specific binary
example and a malware classifier, we use the data augmentation strategies to
produce enough data from which we can train a simple interpretable model. We
explain the detection model by displaying the weight of different parts of the
specific binary. By analyzing the explanations, we found that the data
subsections play an important role in Windows PE malware detection. We proposed
a new function preserving transformation algorithm that can be applied to data
subsections. By employing the binary-diversification techniques that we
proposed, we eliminated the influence of the most weighted part to generate
adversarial examples. Our algorithm can fool the DNNs in certain cases with a
success rate of nearly 100\%. Our method outperforms the state-of-the-art
method . The most important aspect is that our method operates in black-box
settings and the results can be validated with domain knowledge. Our analysis
model can assist people in improving the robustness of malware detectors.


http://arxiv.org/abs/2209.02684
Bag of Tricks for FGSM Adversarial Training. (96%)
Zichao Li; Li Liu; Zeyu Wang; Yuyin Zhou; Cihang Xie
  Adversarial training (AT) with samples generated by Fast Gradient Sign Method
(FGSM), also known as FGSM-AT, is a computationally simple method to train
robust networks. However, during its training procedure, an unstable mode of
"catastrophic overfitting" has been identified in arXiv:2001.03994 [cs.LG],
where the robust accuracy abruptly drops to zero within a single training step.
Existing methods use gradient regularizers or random initialization tricks to
attenuate this issue, whereas they either take high computational cost or lead
to lower robust accuracy. In this work, we provide the first study, which
thoroughly examines a collection of tricks from three perspectives: Data
Initialization, Network Structure, and Optimization, to overcome the
catastrophic overfitting in FGSM-AT.
  Surprisingly, we find that simple tricks, i.e., a) masking partial pixels
(even without randomness), b) setting a large convolution stride and smooth
activation functions, or c) regularizing the weights of the first convolutional
layer, can effectively tackle the overfitting issue. Extensive results on a
range of network architectures validate the effectiveness of each proposed
trick, and the combinations of tricks are also investigated. For example,
trained with PreActResNet-18 on CIFAR-10, our method attains 49.8% accuracy
against PGD-50 attacker and 46.4% accuracy against AutoAttack, demonstrating
that pure FGSM-AT is capable of enabling robust learners. The code and models
are publicly available at
https://github.com/UCSC-VLAA/Bag-of-Tricks-for-FGSM-AT.


http://arxiv.org/abs/2209.02369
Improving Robustness to Out-of-Distribution Data by Frequency-based Augmentation. (82%)
Koki Mukai; Soichiro Kumano; Toshihiko Yamasaki
  Although Convolutional Neural Networks (CNNs) have high accuracy in image
recognition, they are vulnerable to adversarial examples and
out-of-distribution data, and the difference from human recognition has been
pointed out. In order to improve the robustness against out-of-distribution
data, we present a frequency-based data augmentation technique that replaces
the frequency components with other images of the same class. When the training
data are CIFAR10 and the out-of-distribution data are SVHN, the Area Under
Receiver Operating Characteristic (AUROC) curve of the model trained with the
proposed method increases from 89.22\% to 98.15\%, and further increased to
98.59\% when combined with another data augmentation method. Furthermore, we
experimentally demonstrate that the robust model for out-of-distribution data
uses a lot of high-frequency components of the image.


http://arxiv.org/abs/2209.02902
Defending Against Backdoor Attack on Graph Nerual Network by Explainability. (80%)
Bingchen Jiang; Zhao Li
  Backdoor attack is a powerful attack algorithm to deep learning model.
Recently, GNN's vulnerability to backdoor attack has been proved especially on
graph classification task. In this paper, we propose the first backdoor
detection and defense method on GNN. Most backdoor attack depends on injecting
small but influential trigger to the clean sample. For graph data, current
backdoor attack focus on manipulating the graph structure to inject the
trigger. We find that there are apparent differences between benign samples and
malicious samples in some explanatory evaluation metrics, such as fidelity and
infidelity. After identifying the malicious sample, the explainability of the
GNN model can help us capture the most significant subgraph which is probably
the trigger in a trojan graph. We use various dataset and different attack
settings to prove the effectiveness of our defense method. The attack success
rate all turns out to decrease considerably.


http://arxiv.org/abs/2209.02339
MACAB: Model-Agnostic Clean-Annotation Backdoor to Object Detection with Natural Trigger in Real-World. (56%)
Hua Ma; Yinshan Li; Yansong Gao; Zhi Zhang; Alsharif Abuadbba; Anmin Fu; Said F. Al-Sarawi; Nepal Surya; Derek Abbott
  Object detection is the foundation of various critical computer-vision tasks
such as segmentation, object tracking, and event detection. To train an object
detector with satisfactory accuracy, a large amount of data is required.
However, due to the intensive workforce involved with annotating large
datasets, such a data curation task is often outsourced to a third party or
relied on volunteers. This work reveals severe vulnerabilities of such data
curation pipeline. We propose MACAB that crafts clean-annotated images to
stealthily implant the backdoor into the object detectors trained on them even
when the data curator can manually audit the images. We observe that the
backdoor effect of both misclassification and the cloaking are robustly
achieved in the wild when the backdoor is activated with inconspicuously
natural physical triggers. Backdooring non-classification object detection with
clean-annotation is challenging compared to backdooring existing image
classification tasks with clean-label, owing to the complexity of having
multiple objects within each frame, including victim and non-victim objects.
The efficacy of the MACAB is ensured by constructively i abusing the
image-scaling function used by the deep learning framework, ii incorporating
the proposed adversarial clean image replica technique, and iii combining
poison data selection criteria given constrained attacking budget. Extensive
experiments demonstrate that MACAB exhibits more than 90% attack success rate
under various real-world scenes. This includes both cloaking and
misclassification backdoor effect even restricted with a small attack budget.
The poisoned samples cannot be effectively identified by state-of-the-art
detection techniques.The comprehensive video demo is at
https://youtu.be/MA7L_LpXkp4, which is based on a poison rate of 0.14% for
YOLOv4 cloaking backdoor and Faster R-CNN misclassification backdoor.


http://arxiv.org/abs/2209.02329
Multimodal contrastive learning for remote sensing tasks. (1%)
Umangi Jain; Alex Wilson; Varun Gulshan
  Self-supervised methods have shown tremendous success in the field of
computer vision, including applications in remote sensing and medical imaging.
Most popular contrastive-loss based methods like SimCLR, MoCo, MoCo-v2 use
multiple views of the same image by applying contrived augmentations on the
image to create positive pairs and contrast them with negative examples.
Although these techniques work well, most of these techniques have been tuned
on ImageNet (and similar computer vision datasets). While there have been some
attempts to capture a richer set of deformations in the positive samples, in
this work, we explore a promising alternative to generating positive examples
for remote sensing data within the contrastive learning framework. Images
captured from different sensors at the same location and nearby timestamps can
be thought of as strongly augmented instances of the same scene, thus removing
the need to explore and tune a set of hand crafted strong augmentations. In
this paper, we propose a simple dual-encoder framework, which is pre-trained on
a large unlabeled dataset (~1M) of Sentinel-1 and Sentinel-2 image pairs. We
test the embeddings on two remote sensing downstream tasks: flood segmentation
and land cover mapping, and empirically show that embeddings learnt from this
technique outperform the conventional technique of collecting positive examples
via aggressive data augmentations.


http://arxiv.org/abs/2209.02826
Annealing Optimization for Progressive Learning with Stochastic Approximation. (1%)
Christos Mavridis; John Baras
  In this work, we introduce a learning model designed to meet the needs of
applications in which computational resources are limited, and robustness and
interpretability are prioritized. Learning problems can be formulated as
constrained stochastic optimization problems, with the constraints originating
mainly from model assumptions that define a trade-off between complexity and
performance. This trade-off is closely related to over-fitting, generalization
capacity, and robustness to noise and adversarial attacks, and depends on both
the structure and complexity of the model, as well as the properties of the
optimization methods used. We develop an online prototype-based learning
algorithm based on annealing optimization that is formulated as an online
gradient-free stochastic approximation algorithm. The learning model can be
viewed as an interpretable and progressively growing competitive-learning
neural network model to be used for supervised, unsupervised, and reinforcement
learning. The annealing nature of the algorithm contributes to minimal
hyper-parameter tuning requirements, poor local minima prevention, and
robustness with respect to the initial conditions. At the same time, it
provides online control over the performance-complexity trade-off by
progressively increasing the complexity of the learning model as needed,
through an intuitive bifurcation phenomenon. Finally, the use of stochastic
approximation enables the study of the convergence of the learning algorithm
through mathematical tools from dynamical systems and control, and allows for
its integration with reinforcement learning algorithms, constructing an
adaptive state-action aggregation scheme.


http://arxiv.org/abs/2209.02869
Interpretations Steered Network Pruning via Amortized Inferred Saliency Maps. (1%)
Alireza Ganjdanesh; Shangqian Gao; Heng Huang
  Convolutional Neural Networks (CNNs) compression is crucial to deploying
these models in edge devices with limited resources. Existing channel pruning
algorithms for CNNs have achieved plenty of success on complex models. They
approach the pruning problem from various perspectives and use different
metrics to guide the pruning process. However, these metrics mainly focus on
the model's `outputs' or `weights' and neglect its `interpretations'
information. To fill in this gap, we propose to address the channel pruning
problem from a novel perspective by leveraging the interpretations of a model
to steer the pruning process, thereby utilizing information from both inputs
and outputs of the model. However, existing interpretation methods cannot get
deployed to achieve our goal as either they are inefficient for pruning or may
predict non-coherent explanations. We tackle this challenge by introducing a
selector model that predicts real-time smooth saliency masks for pruned models.
We parameterize the distribution of explanatory masks by Radial Basis Function
(RBF)-like functions to incorporate geometric prior of natural images in our
selector model's inductive bias. Thus, we can obtain compact representations of
explanations to reduce the computational costs of our pruning method. We
leverage our selector model to steer the network pruning by maximizing the
similarity of explanatory representations for the pruned and original models.
Extensive experiments on CIFAR-10 and ImageNet benchmark datasets demonstrate
the efficacy of our proposed method. Our implementations are available at
\url{https://github.com/Alii-Ganjj/InterpretationsSteeredPruning}


http://arxiv.org/abs/2209.02299
A Survey of Machine Unlearning. (1%)
Thanh Tam Nguyen; Thanh Trung Huynh; Phi Le Nguyen; Alan Wee-Chung Liew; Hongzhi Yin; Quoc Viet Hung Nguyen
  Computer systems hold a large amount of personal data over decades. On the
one hand, such data abundance allows breakthroughs in artificial intelligence
(AI), especially machine learning (ML) models. On the other hand, it can
threaten the privacy of users and weaken the trust between humans and AI.
Recent regulations require that private information about a user can be removed
from computer systems in general and from ML models in particular upon request
(e.g. the "right to be forgotten"). While removing data from back-end databases
should be straightforward, it is not sufficient in the AI context as ML models
often "remember" the old data. Existing adversarial attacks proved that we can
learn private membership or attributes of the training data from the trained
models. This phenomenon calls for a new paradigm, namely machine unlearning, to
make ML models forget about particular data. It turns out that recent works on
machine unlearning have not been able to solve the problem completely due to
the lack of common frameworks and resources. In this survey paper, we seek to
provide a thorough investigation of machine unlearning in its definitions,
scenarios, mechanisms, and applications. Specifically, as a categorical
collection of state-of-the-art research, we hope to provide a broad reference
for those seeking a primer on machine unlearning and its various formulations,
design requirements, removal requests, algorithms, and uses in a variety of ML
applications. Furthermore, we hope to outline key findings and trends in the
paradigm as well as highlight new areas of research that have yet to see the
application of machine unlearning, but could nonetheless benefit immensely. We
hope this survey provides a valuable reference for ML researchers as well as
those seeking to innovate privacy technologies. Our resources are at
https://github.com/tamlhp/awesome-machine-unlearning.


http://arxiv.org/abs/2209.02128
Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples. (98%)
Hezekiah J. Branch; Jonathan Rodriguez Cefalu; Jeremy McHugh; Leyla Hujer; Aditya Bahl; Daniel del Castillo Iglesias; Ron Heichman; Ramesh Darwishi
  Recent advances in the development of large language models have resulted in
public access to state-of-the-art pre-trained language models (PLMs), including
Generative Pre-trained Transformer 3 (GPT-3) and Bidirectional Encoder
Representations from Transformers (BERT). However, evaluations of PLMs, in
practice, have shown their susceptibility to adversarial attacks during the
training and fine-tuning stages of development. Such attacks can result in
erroneous outputs, model-generated hate speech, and the exposure of users'
sensitive information. While existing research has focused on adversarial
attacks during either the training or the fine-tuning of PLMs, there is a
deficit of information on attacks made between these two development phases. In
this work, we highlight a major security vulnerability in the public release of
GPT-3 and further investigate this vulnerability in other state-of-the-art
PLMs. We restrict our work to pre-trained models that have not undergone
fine-tuning. Further, we underscore token distance-minimized perturbations as
an effective adversarial approach, bypassing both supervised and unsupervised
quality measures. Following this approach, we observe a significant decrease in
text classification quality when evaluating for semantic similarity.


http://arxiv.org/abs/2209.02167
White-Box Adversarial Policies in Deep Reinforcement Learning. (98%)
Stephen Casper; Taylor Killian; Gabriel Kreiman; Dylan Hadfield-Menell
  In reinforcement learning (RL), adversarial policies can be developed by
training an adversarial agent to minimize a target agent's rewards. Prior work
has studied black-box versions of these attacks where the adversary only
observes the world state and treats the target agent as any other part of the
environment. However, this does not take into account additional structure in
the problem. In this work, we take inspiration from the literature on white-box
attacks to train more effective adversarial policies. We study white-box
adversarial policies and show that having access to a target agent's internal
state can be useful for identifying its vulnerabilities. We make two
contributions. (1) We introduce white-box adversarial policies where an
attacker observes both a target's internal state and the world state at each
timestep. We formulate ways of using these policies to attack agents in
2-player games and text-generating language models. (2) We demonstrate that
these policies can achieve higher initial and asymptotic performance against a
target agent than black-box controls. Code is available at
https://github.com/thestephencasper/lm_white_box_attacks


http://arxiv.org/abs/2209.01782
"Is your explanation stable?": A Robustness Evaluation Framework for Feature Attribution. (69%)
Yuyou Gan; Yuhao Mao; Xuhong Zhang; Shouling Ji; Yuwen Pu; Meng Han; Jianwei Yin; Ting Wang
  Understanding the decision process of neural networks is hard. One vital
method for explanation is to attribute its decision to pivotal features.
Although many algorithms are proposed, most of them solely improve the
faithfulness to the model. However, the real environment contains many random
noises, which may leads to great fluctuations in the explanations. More
seriously, recent works show that explanation algorithms are vulnerable to
adversarial attacks. All of these make the explanation hard to trust in real
scenarios.
  To bridge this gap, we propose a model-agnostic method \emph{Median Test for
Feature Attribution} (MeTFA) to quantify the uncertainty and increase the
stability of explanation algorithms with theoretical guarantees. MeTFA has the
following two functions: (1) examine whether one feature is significantly
important or unimportant and generate a MeTFA-significant map to visualize the
results; (2) compute the confidence interval of a feature attribution score and
generate a MeTFA-smoothed map to increase the stability of the explanation.
Experiments show that MeTFA improves the visual quality of explanations and
significantly reduces the instability while maintaining the faithfulness. To
quantitatively evaluate the faithfulness of an explanation under different
noise settings, we further propose several robust faithfulness metrics.
Experiment results show that the MeTFA-smoothed explanation can significantly
increase the robust faithfulness. In addition, we use two scenarios to show
MeTFA's potential in the applications. First, when applied to the SOTA
explanation method to locate context bias for semantic segmentation models,
MeTFA-significant explanations use far smaller regions to maintain 99\%+
faithfulness. Second, when tested with different explanation-oriented attacks,
MeTFA can help defend vanilla, as well as adaptive, adversarial attacks against
explanations.


http://arxiv.org/abs/2209.01962
Adversarial Detection: Attacking Object Detection in Real Time. (64%)
Han Wu; Syed Yunas; Sareh Rowlands; Wenjie Ruan; Johan Wahlstrom
  Intelligent robots rely on object detection models to perceive the
environment. Following advances in deep learning security it has been revealed
that object detection models are vulnerable to adversarial attacks. However,
prior research primarily focuses on attacking static images or offline videos.
Therefore, it is still unclear if such attacks could jeopardize real-world
robotic applications in dynamic environments. This paper bridges this gap by
presenting the first real-time online attack against object detection models.
We devise three attacks that fabricate bounding boxes for nonexistent objects
at desired locations. The attacks achieve a success rate of about 90\% within
about 20 iterations. The demo video is available at
https://youtu.be/zJZ1aNlXsMU.


http://arxiv.org/abs/2209.01882
PromptAttack: Prompt-based Attack for Language Models via Gradient Search. (16%)
Yundi Shi; Piji Li; Changchun Yin; Zhaoyang Han; Lu Zhou; Zhe Liu
  As the pre-trained language models (PLMs) continue to grow, so do the
hardware and data requirements for fine-tuning PLMs. Therefore, the researchers
have come up with a lighter method called \textit{Prompt Learning}. However,
during the investigations, we observe that the prompt learning methods are
vulnerable and can easily be attacked by some illegally constructed prompts,
resulting in classification errors, and serious security problems for PLMs.
Most of the current research ignores the security issue of prompt-based
methods. Therefore, in this paper, we propose a malicious prompt template
construction method (\textbf{PromptAttack}) to probe the security performance
of PLMs. Several unfriendly template construction approaches are investigated
to guide the model to misclassify the task. Extensive experiments on three
datasets and three PLMs prove the effectiveness of our proposed approach
PromptAttack. We also conduct experiments to verify that our method is
applicable in few-shot scenarios.


http://arxiv.org/abs/2209.01994
Federated Zero-Shot Learning for Visual Recognition. (2%)
Zhi Chen; Yadan Luo; Sen Wang; Jingjing Li; Zi Huang
  Zero-shot learning is a learning regime that recognizes unseen classes by
generalizing the visual-semantic relationship learned from the seen classes. To
obtain an effective ZSL model, one may resort to curating training samples from
multiple sources, which may inevitably raise the privacy concerns about data
sharing across different organizations. In this paper, we propose a novel
Federated Zero-Shot Learning FedZSL framework, which learns a central model
from the decentralized data residing on edge devices. To better generalize to
previously unseen classes, FedZSL allows the training data on each device
sampled from the non-overlapping classes, which are far from the i.i.d. that
traditional federated learning commonly assumes. We identify two key challenges
in our FedZSL protocol: 1) the trained models are prone to be biased to the
locally observed classes, thus failing to generalize to the unseen classes
and/or seen classes appeared on other devices; 2) as each category in the
training data comes from a single source, the central model is highly
vulnerable to model replacement (backdoor) attacks. To address these issues, we
propose three local objectives for visual-semantic alignment and cross-device
alignment through relation distillation, which leverages the normalized
class-wise covariance to regularize the consistency of the prediction logits
across devices. To defend against the backdoor attacks, a feature magnitude
defending technique is proposed. As malicious samples are less correlated to
the given semantic attributes, the visual features of low magnitude will be
discarded to stabilize model updates. The effectiveness and robustness of
FedZSL are demonstrated by extensive experiments conducted on three zero-shot
benchmark datasets.


http://arxiv.org/abs/2209.03148
Improving Out-of-Distribution Detection via Epistemic Uncertainty Adversarial Training. (2%)
Derek Everett; Andre T. Nguyen; Luke E. Richards; Edward Raff
  The quantification of uncertainty is important for the adoption of machine
learning, especially to reject out-of-distribution (OOD) data back to human
experts for review. Yet progress has been slow, as a balance must be struck
between computational efficiency and the quality of uncertainty estimates. For
this reason many use deep ensembles of neural networks or Monte Carlo dropout
for reasonable uncertainty estimates at relatively minimal compute and memory.
Surprisingly, when we focus on the real-world applicable constraint of $\leq
1\%$ false positive rate (FPR), prior methods fail to reliably detect OOD
samples as such. Notably, even Gaussian random noise fails to trigger these
popular OOD techniques. We help to alleviate this problem by devising a simple
adversarial training scheme that incorporates an attack of the epistemic
uncertainty predicted by the dropout ensemble. We demonstrate this method
improves OOD detection performance on standard data (i.e., not adversarially
crafted), and improves the standardized partial AUC from near-random guessing
performance to $\geq 0.75$.


http://arxiv.org/abs/2209.01721
An Adaptive Black-box Defense against Trojan Attacks (TrojDef). (98%)
Guanxiong Liu; Abdallah Khreishah; Fatima Sharadgah; Issa Khalil
  Trojan backdoor is a poisoning attack against Neural Network (NN) classifiers
in which adversaries try to exploit the (highly desirable) model reuse property
to implant Trojans into model parameters for backdoor breaches through a
poisoned training process. Most of the proposed defenses against Trojan attacks
assume a white-box setup, in which the defender either has access to the inner
state of NN or is able to run back-propagation through it. In this work, we
propose a more practical black-box defense, dubbed TrojDef, which can only run
forward-pass of the NN. TrojDef tries to identify and filter out Trojan inputs
(i.e., inputs augmented with the Trojan trigger) by monitoring the changes in
the prediction confidence when the input is repeatedly perturbed by random
noise. We derive a function based on the prediction outputs which is called the
prediction confidence bound to decide whether the input example is Trojan or
not. The intuition is that Trojan inputs are more stable as the
misclassification only depends on the trigger, while benign inputs will suffer
when augmented with noise due to the perturbation of the classification
features.
  Through mathematical analysis, we show that if the attacker is perfect in
injecting the backdoor, the Trojan infected model will be trained to learn the
appropriate prediction confidence bound, which is used to distinguish Trojan
and benign inputs under arbitrary perturbations. However, because the attacker
might not be perfect in injecting the backdoor, we introduce a nonlinear
transform to the prediction confidence bound to improve the detection accuracy
in practical settings. Extensive empirical evaluations show that TrojDef
significantly outperforms the-state-of-the-art defenses and is highly stable
under different settings, even when the classifier architecture, the training
process, or the hyper-parameters change.


http://arxiv.org/abs/2209.01711
Hide & Seek: Seeking the (Un)-Hidden key in Provably-Secure Logic Locking Techniques. (11%)
Satwik Patnaik; Nimisha Limaye; Ozgur Sinanoglu
  Logic locking protects an IC from threats such as piracy of design IP and
unauthorized overproduction throughout the IC supply chain. Out of the several
techniques proposed by the research community, provably-secure logic locking
(PSLL) has acquired a foothold due to its algorithmic and provable-security
guarantees. However, the security of these techniques is questioned by
attackers that exploit the vulnerabilities arising from the hardware
implementation. Such attacks (i) are predominantly specific to locking
techniques and (ii) lack generality and scalability. This leads to a plethora
of attacks, and defenders, find it challenging to ascertain the security of
newly developed PSLL techniques. Additionally, there is no repository of locked
circuits that attackers can use to benchmark (and compare) their attacks.
  In this work, we develop a generalized attack that can recover the secret key
across different PSLL techniques. To that end, we extract functional and
structural properties depending on the hardware construction of the PSLL
techniques and develop two attacks based on the concepts of VLSI testing and
Boolean transformations. We evaluate our attacks on 30,000 locked circuits
across 14 PSLL techniques, including nine unbroken techniques. Our attacks
successfully recover the secret key (100% accuracy) for all the techniques. Our
experimentation across different (I) technology libraries, (ii) synthesis
tools, and (iii) logic optimization settings provide interesting insights. For
instance, our attacks recover the secret key by only using the locked circuit
when an academic synthesis tool is used. Additionally, designers can use our
attacks as a verification tool to ascertain the lower-bound security achieved
by hardware implementations. We shall release our artifacts, which could help
foster the development of future attacks and defenses in the PSLL domain.


http://arxiv.org/abs/2209.01710
Synergistic Redundancy: Towards Verifiable Safety for Autonomous Vehicles. (1%)
Ayoosh Bansal; Simon Yu; Hunmin Kim; Bo Li; Naira Hovakimyan; Marco Caccamo; Lui Sha
  As Autonomous Vehicle (AV) development has progressed, concerns regarding the
safety of passengers and agents in their environment have risen. Each real
world traffic collision involving autonomously controlled vehicles has
compounded this concern. Open source autonomous driving implementations show a
software architecture with complex interdependent tasks, heavily reliant on
machine learning and Deep Neural Networks (DNN), which are vulnerable to non
deterministic faults and corner cases. These complex subsystems work together
to fulfill the mission of the AV while also maintaining safety. Although
significant improvements are being made towards increasing the empirical
reliability and confidence in these systems, the inherent limitations of DNN
verification create an, as yet, insurmountable challenge in providing
deterministic safety guarantees in AV.
  We propose Synergistic Redundancy (SR), a safety architecture for complex
cyber physical systems, like AV. SR provides verifiable safety guarantees
against specific faults by decoupling the mission and safety tasks of the
system. Simultaneous to independently fulfilling their primary roles, the
partially functionally redundant mission and safety tasks are able to aid each
other, synergistically improving the combined system. The synergistic safety
layer uses only verifiable and logically analyzable software to fulfill its
tasks. Close coordination with the mission layer allows easier and early
detection of safety critical faults in the system. SR simplifies the mission
layer's optimization goals and improves its design. SR provides safe deployment
of high performance, although inherently unverifiable, machine learning
software. In this work, we first present the design and features of the SR
architecture and then evaluate the efficacy of the solution, focusing on the
crucial problem of obstacle existence detection faults in AV.


http://arxiv.org/abs/2209.02430
Adversarial Color Film: Effective Physical-World Attack to DNNs. (98%)
Chengyin Hu; Weiwen Shi
  It is well known that the performance of deep neural networks (DNNs) is
susceptible to subtle interference. So far, camera-based physical adversarial
attacks haven't gotten much attention, but it is the vacancy of physical
attack. In this paper, we propose a simple and efficient camera-based physical
attack called Adversarial Color Film (AdvCF), which manipulates the physical
parameters of color film to perform attacks. Carefully designed experiments
show the effectiveness of the proposed method in both digital and physical
environments. In addition, experimental results show that the adversarial
samples generated by AdvCF have excellent performance in attack
transferability, which enables AdvCF effective black-box attacks. At the same
time, we give the guidance of defense against AdvCF by means of adversarial
training. Finally, we look into AdvCF's threat to future vision-based systems
and propose some promising mentality for camera-based physical attacks.


http://arxiv.org/abs/2209.02132
Impact of Scaled Image on Robustness of Deep Neural Networks. (98%)
Chengyin Hu; Weiwen Shi
  Deep neural networks (DNNs) have been widely used in computer vision tasks
like image classification, object detection and segmentation. Whereas recent
studies have shown their vulnerability to manual digital perturbations or
distortion in the input images. The accuracy of the networks is remarkably
influenced by the data distribution of their training dataset. Scaling the raw
images creates out-of-distribution data, which makes it a possible adversarial
attack to fool the networks. In this work, we propose a Scaling-distortion
dataset ImageNet-CS by Scaling a subset of the ImageNet Challenge dataset by
different multiples. The aim of our work is to study the impact of scaled
images on the performance of advanced DNNs. We perform experiments on several
state-of-the-art deep neural network architectures on the proposed ImageNet-CS,
and the results show a significant positive correlation between scaling size
and accuracy decline. Moreover, based on ResNet50 architecture, we demonstrate
some tests on the performance of recent proposed robust training techniques and
strategies like Augmix, Revisiting and Normalizer Free on our proposed
ImageNet-CS. Experiment results have shown that these robust training
techniques can improve networks' robustness to scaling transformation.


http://arxiv.org/abs/2209.01100
Property inference attack; Graph neural networks; Privacy attacks and defense; Trustworthy machine learning. (95%)
Xiuling Wang; Wendy Hui Wang
  With the fast adoption of machine learning (ML) techniques, sharing of ML
models is becoming popular. However, ML models are vulnerable to privacy
attacks that leak information about the training data. In this work, we focus
on a particular type of privacy attacks named property inference attack (PIA)
which infers the sensitive properties of the training data through the access
to the target ML model. In particular, we consider Graph Neural Networks (GNNs)
as the target model, and distribution of particular groups of nodes and links
in the training graph as the target property. While the existing work has
investigated PIAs that target at graph-level properties, no prior works have
studied the inference of node and link properties at group level yet.
  In this work, we perform the first systematic study of group property
inference attacks (GPIA) against GNNs. First, we consider a taxonomy of threat
models under both black-box and white-box settings with various types of
adversary knowledge, and design six different attacks for these settings. We
evaluate the effectiveness of these attacks through extensive experiments on
three representative GNN models and three real-world graphs. Our results
demonstrate the effectiveness of these attacks whose accuracy outperforms the
baseline approaches. Second, we analyze the underlying factors that contribute
to GPIA's success, and show that the target model trained on the graphs with or
without the target property represents some dissimilarity in model parameters
and/or model outputs, which enables the adversary to infer the existence of the
property. Further, we design a set of defense mechanisms against the GPIA
attacks, and demonstrate that these mechanisms can reduce attack accuracy
effectively with small loss on GNN model accuracy.


http://arxiv.org/abs/2209.02832
Impact of Colour Variation on Robustness of Deep Neural Networks. (92%)
Chengyin Hu; Weiwen Shi
  Deep neural networks (DNNs) have have shown state-of-the-art performance for
computer vision applications like image classification, segmentation and object
detection. Whereas recent advances have shown their vulnerability to manual
digital perturbations in the input data, namely adversarial attacks. The
accuracy of the networks is significantly affected by the data distribution of
their training dataset. Distortions or perturbations on color space of input
images generates out-of-distribution data, which make networks more likely to
misclassify them. In this work, we propose a color-variation dataset by
distorting their RGB color on a subset of the ImageNet with 27 different
combinations. The aim of our work is to study the impact of color variation on
the performance of DNNs. We perform experiments on several state-of-the-art DNN
architectures on the proposed dataset, and the result shows a significant
correlation between color variation and loss of accuracy. Furthermore, based on
the ResNet50 architecture, we demonstrate some experiments of the performance
of recently proposed robust training techniques and strategies, such as Augmix,
revisit, and free normalizer, on our proposed dataset. Experimental results
indicate that these robust training techniques can improve the robustness of
deep networks to color variation.


http://arxiv.org/abs/2209.00892
Scalable Adversarial Attack Algorithms on Influence Maximization. (68%)
Lichao Sun; Xiaobin Rui; Wei Chen
  In this paper, we study the adversarial attacks on influence maximization
under dynamic influence propagation models in social networks. In particular,
given a known seed set S, the problem is to minimize the influence spread from
S by deleting a limited number of nodes and edges. This problem reflects many
application scenarios, such as blocking virus (e.g. COVID-19) propagation in
social networks by quarantine and vaccination, blocking rumor spread by
freezing fake accounts, or attacking competitor's influence by incentivizing
some users to ignore the information from the competitor. In this paper, under
the linear threshold model, we adapt the reverse influence sampling approach
and provide efficient algorithms of sampling valid reverse reachable paths to
solve the problem.


http://arxiv.org/abs/2209.01292
Are Attribute Inference Attacks Just Imputation? (31%)
Bargav Jayaraman; David Evans
  Models can expose sensitive information about their training data. In an
attribute inference attack, an adversary has partial knowledge of some training
records and access to a model trained on those records, and infers the unknown
values of a sensitive feature of those records. We study a fine-grained variant
of attribute inference we call \emph{sensitive value inference}, where the
adversary's goal is to identify with high confidence some records from a
candidate set where the unknown attribute has a particular sensitive value. We
explicitly compare attribute inference with data imputation that captures the
training distribution statistics, under various assumptions about the training
data available to the adversary. Our main conclusions are: (1) previous
attribute inference methods do not reveal more about the training data from the
model than can be inferred by an adversary without access to the trained model,
but with the same knowledge of the underlying distribution as needed to train
the attribute inference attack; (2) black-box attribute inference attacks
rarely learn anything that cannot be learned without the model; but (3)
white-box attacks, which we introduce and evaluate in the paper, can reliably
identify some records with the sensitive value attribute that would not be
predicted without having access to the model. Furthermore, we show that
proposed defenses such as differentially private training and removing
vulnerable records from training do not mitigate this privacy risk. The code
for our experiments is available at
\url{https://github.com/bargavj/EvaluatingDPML}.


http://arxiv.org/abs/2209.00812
Explainable AI for Android Malware Detection: Towards Understanding Why the Models Perform So Well? (9%)
Yue Liu; Chakkrit Tantithamthavorn; Li Li; Yepang Liu
  Machine learning (ML)-based Android malware detection has been one of the
most popular research topics in the mobile security community. An increasing
number of research studies have demonstrated that machine learning is an
effective and promising approach for malware detection, and some works have
even claimed that their proposed models could achieve 99\% detection accuracy,
leaving little room for further improvement. However, numerous prior studies
have suggested that unrealistic experimental designs bring substantial biases,
resulting in over-optimistic performance in malware detection. Unlike previous
research that examined the detection performance of ML classifiers to locate
the causes, this study employs Explainable AI (XAI) approaches to explore what
ML-based models learned during the training process, inspecting and
interpreting why ML-based malware classifiers perform so well under unrealistic
experimental settings. We discover that temporal sample inconsistency in the
training dataset brings over-optimistic classification performance (up to 99\%
F1 score and accuracy). Importantly, our results indicate that ML models
classify malware based on temporal differences between malware and benign,
rather than the actual malicious behaviors. Our evaluation also confirms the
fact that unrealistic experimental designs lead to not only unrealistic
detection performance but also poor reliability, posing a significant obstacle
to real-world applications. These findings suggest that XAI approaches should
be used to help practitioners/researchers better understand how do AI/ML models
(i.e., malware detection) work -- not just focusing on accuracy improvement.


http://arxiv.org/abs/2209.01199
Revisiting Outer Optimization in Adversarial Training. (5%)
Ali Dabouei; Fariborz Taherkhani; Sobhan Soleymani; Nasser M. Nasrabadi
  Despite the fundamental distinction between adversarial and natural training
(AT and NT), AT methods generally adopt momentum SGD (MSGD) for the outer
optimization. This paper aims to analyze this choice by investigating the
overlooked role of outer optimization in AT. Our exploratory evaluations reveal
that AT induces higher gradient norm and variance compared to NT. This
phenomenon hinders the outer optimization in AT since the convergence rate of
MSGD is highly dependent on the variance of the gradients. To this end, we
propose an optimization method called ENGM which regularizes the contribution
of each input example to the average mini-batch gradients. We prove that the
convergence rate of ENGM is independent of the variance of the gradients, and
thus, it is suitable for AT. We introduce a trick to reduce the computational
cost of ENGM using empirical observations on the correlation between the norm
of gradients w.r.t. the network parameters and input examples. Our extensive
evaluations and ablation studies on CIFAR-10, CIFAR-100, and TinyImageNet
demonstrate that ENGM and its variants consistently improve the performance of
a wide range of AT methods. Furthermore, ENGM alleviates major shortcomings of
AT including robust overfitting and high sensitivity to hyperparameter
settings.


http://arxiv.org/abs/2209.00269
Adversarial for Social Privacy: A Poisoning Strategy to Degrade User Identity Linkage. (98%)
Jiangli Shao; Yongqing Wang; Boshen Shi; Hao Gao; Huawei Shen; Xueqi Cheng
  Privacy issues on social networks have been extensively discussed in recent
years. The user identity linkage (UIL) task, aiming at finding corresponding
users across different social networks, would be a threat to privacy if
unethically applied. The sensitive user information might be detected through
connected identities. A promising and novel solution to this issue is to design
an adversarial strategy to degrade the matching performance of UIL models.
However, most existing adversarial attacks on graphs are designed for models
working in a single network, while UIL is a cross-network learning task.
Meanwhile, privacy protection against UIL works unilaterally in real-world
scenarios, i.e., the service provider can only add perturbations to its own
network to protect its users from being linked. To tackle these challenges,
this paper proposes a novel adversarial attack strategy that poisons one target
network to prevent its nodes from being linked to other networks by UIL
algorithms. Specifically, we reformalize the UIL problem in the perspective of
kernelized topology consistency and convert the attack objective to maximizing
the structural changes within the target network before and after attacks. A
novel graph kernel is then defined with Earth mover's distance (EMD) on the
edge-embedding space. In terms of efficiency, a fast attack strategy is
proposed by greedy searching and replacing EMD with its lower bound. Results on
three real-world datasets indicate that the proposed attacks can best fool a
wide range of UIL models and reach a balance between attack effectiveness and
imperceptibility.


http://arxiv.org/abs/2209.00757
Universal Fourier Attack for Time Series. (12%)
Elizabeth Coda; Brad Clymer; Chance DeSmet; Yijing Watkins; Michael Girard
  A wide variety of adversarial attacks have been proposed and explored using
image and audio data. These attacks are notoriously easy to generate digitally
when the attacker can directly manipulate the input to a model, but are much
more difficult to implement in the real-world. In this paper we present a
universal, time invariant attack for general time series data such that the
attack has a frequency spectrum primarily composed of the frequencies present
in the original data. The universality of the attack makes it fast and easy to
implement as no computation is required to add it to an input, while time
invariance is useful for real-world deployment. Additionally, the frequency
constraint ensures the attack can withstand filtering. We demonstrate the
effectiveness of the attack in two different domains, speech recognition and
unintended radiated emission, and show that the attack is robust against common
transform-and-compare defense pipelines.


http://arxiv.org/abs/2209.00005
Be Your Own Neighborhood: Detecting Adversarial Example by the Neighborhood Relations Built on Self-Supervised Learning. (99%)
Zhiyuan He; Yijun Yang; Pin-Yu Chen; Qiang Xu; Tsung-Yi Ho
  Deep Neural Networks (DNNs) have achieved excellent performance in various
fields. However, DNNs' vulnerability to Adversarial Examples (AE) hinders their
deployments to safety-critical applications. This paper presents a novel AE
detection framework, named BEYOND, for trustworthy predictions. BEYOND performs
the detection by distinguishing the AE's abnormal relation with its augmented
versions, i.e. neighbors, from two prospects: representation similarity and
label consistency. An off-the-shelf Self-Supervised Learning (SSL) model is
used to extract the representation and predict the label for its highly
informative representation capacity compared to supervised learning models. For
clean samples, their representations and predictions are closely consistent
with their neighbors, whereas those of AEs differ greatly. Furthermore, we
explain this observation and show that by leveraging this discrepancy BEYOND
can effectively detect AEs. We develop a rigorous justification for the
effectiveness of BEYOND. Furthermore, as a plug-and-play model, BEYOND can
easily cooperate with the Adversarial Trained Classifier (ATC), achieving the
state-of-the-art (SOTA) robustness accuracy. Experimental results show that
BEYOND outperforms baselines by a large margin, especially under adaptive
attacks. Empowered by the robust relation net built on SSL, we found that
BEYOND outperforms baselines in terms of both detection ability and speed. Our
code will be publicly available.


http://arxiv.org/abs/2209.02406
Unrestricted Adversarial Samples Based on Non-semantic Feature Clusters Substitution. (99%)
MingWei Zhou; Xiaobing Pei
  Most current methods generate adversarial examples with the $L_p$ norm
specification. As a result, many defense methods utilize this property to
eliminate the impact of such attacking algorithms. In this paper,we instead
introduce "unrestricted" perturbations that create adversarial samples by using
spurious relations which were learned by model training. Specifically, we find
feature clusters in non-semantic features that are strongly correlated with
model judgment results, and treat them as spurious relations learned by the
model. Then we create adversarial samples by using them to replace the
corresponding feature clusters in the target image. Experimental evaluations
show that in both black-box and white-box situations. Our adversarial examples
do not change the semantics of images, while still being effective at fooling
an adversarially trained DNN image classifier.


http://arxiv.org/abs/2208.14933
Membership Inference Attacks by Exploiting Loss Trajectory. (70%)
Yiyong Liu; Zhengyu Zhao; Michael Backes; Yang Zhang
  Machine learning models are vulnerable to membership inference attacks in
which an adversary aims to predict whether or not a particular sample was
contained in the target model's training dataset. Existing attack methods have
commonly exploited the output information (mostly, losses) solely from the
given target model. As a result, in practical scenarios where both the member
and non-member samples yield similarly small losses, these methods are
naturally unable to differentiate between them. To address this limitation, in
this paper, we propose a new attack method, called \system, which can exploit
the membership information from the whole training process of the target model
for improving the attack performance. To mount the attack in the common
black-box setting, we leverage knowledge distillation, and represent the
membership information by the losses evaluated on a sequence of intermediate
models at different distillation epochs, namely \emph{distilled loss
trajectory}, together with the loss from the given target model. Experimental
results over different datasets and model architectures demonstrate the great
advantage of our attack in terms of different metrics. For example, on
CINIC-10, our attack achieves at least 6$\times$ higher true-positive rate at a
low false-positive rate of 0.1\% than existing methods. Further analysis
demonstrates the general effectiveness of our attack in more strict scenarios.


http://arxiv.org/abs/2208.14937
Explainable Artificial Intelligence Applications in Cyber Security: State-of-the-Art in Research. (13%)
Zhibo Zhang; Hussam Al Hamadi; Ernesto Damiani; Chan Yeob Yeun; Fatma Taher
  This survey presents a comprehensive review of current literature on
Explainable Artificial Intelligence (XAI) methods for cyber security
applications. Due to the rapid development of Internet-connected systems and
Artificial Intelligence in recent years, Artificial Intelligence including
Machine Learning (ML) and Deep Learning (DL) has been widely utilized in the
fields of cyber security including intrusion detection, malware detection, and
spam filtering. However, although Artificial Intelligence-based approaches for
the detection and defense of cyber attacks and threats are more advanced and
efficient compared to the conventional signature-based and rule-based cyber
security strategies, most ML-based techniques and DL-based techniques are
deployed in the black-box manner, meaning that security experts and customers
are unable to explain how such procedures reach particular conclusions. The
deficiencies of transparency and interpretability of existing Artificial
Intelligence techniques would decrease human users' confidence in the models
utilized for the defense against cyber attacks, especially in current
situations where cyber attacks become increasingly diverse and complicated.
Therefore, it is essential to apply XAI in the establishment of cyber security
models to create more explainable models while maintaining high accuracy and
allowing human users to comprehend, trust, and manage the next generation of
cyber defense mechanisms. Although there are papers reviewing Artificial
Intelligence applications in cyber security areas and the vast literature on
applying XAI in many fields including healthcare, financial services, and
criminal justice, the surprising fact is that there are currently no survey
research articles that concentrate on XAI applications in cyber security.


http://arxiv.org/abs/2208.14888
Feature Alignment by Uncertainty and Self-Training for Source-Free Unsupervised Domain Adaptation. (1%)
JoonHo Lee; Gyemin Lee
  Most unsupervised domain adaptation (UDA) methods assume that labeled source
images are available during model adaptation. However, this assumption is often
infeasible owing to confidentiality issues or memory constraints on mobile
devices. Some recently developed approaches do not require source images during
adaptation, but they show limited performance on perturbed images. To address
these problems, we propose a novel source-free UDA method that uses only a
pre-trained source model and unlabeled target images. Our method captures the
aleatoric uncertainty by incorporating data augmentation and trains the feature
generator with two consistency objectives. The feature generator is encouraged
to learn consistent visual features away from the decision boundaries of the
head classifier. Thus, the adapted model becomes more robust to image
perturbations. Inspired by self-supervised learning, our method promotes
inter-space alignment between the prediction space and the feature space while
incorporating intra-space consistency within the feature space to reduce the
domain gap between the source and target domains. We also consider epistemic
uncertainty to boost the model adaptation performance. Extensive experiments on
popular UDA benchmark datasets demonstrate that the proposed source-free method
is comparable or even superior to vanilla UDA methods. Moreover, the adapted
models show more robust results when input images are perturbed.


http://arxiv.org/abs/2208.14672
Vulnerability of Distributed Inverter VAR Control in PV Distributed Energy System. (1%)
Bo Tu; Wen-Tai Li; Chau Yuen
  This work studies the potential vulnerability of distributed control schemes
in smart grids. To this end, we consider an optimal inverter VAR control
problem within a PV-integrated distribution network. First, we formulate the
centralized optimization problem considering the reactive power priority and
further reformulate the problem into a distributed framework by an accelerated
proximal projection method. The inverter controller can curtail the PV output
of each user by clamping the reactive power. To illustrate the studied
distributed control scheme that may be vulnerable due to the two-hop
information communication pattern, we present a heuristic attack injecting
false data during the information exchange. Then we analyze the attack impact
on the update procedure of critical parameters. A case study with an eight-node
test feeder demonstrates that adversaries can violate the constraints of
distributed control scheme without being detected through simple attacks such
as the proposed attack.


http://arxiv.org/abs/2209.00462
MA-RECON: Mask-aware deep-neural-network for robust fast MRI k-space interpolation. (1%)
Nitzan Avidan; Moti Freiman
  High-quality reconstruction of MRI images from under-sampled `k-space' data,
which is in the Fourier domain, is crucial for shortening MRI acquisition times
and ensuring superior temporal resolution. Over recent years, a wealth of deep
neural network (DNN) methods have emerged, aiming to tackle the complex,
ill-posed inverse problem linked to this process. However, their instability
against variations in the acquisition process and anatomical distribution
exposes a deficiency in the generalization of relevant physical models within
these DNN architectures. The goal of our work is to enhance the generalization
capabilities of DNN methods for k-space interpolation by introducing
`MA-RECON', an innovative mask-aware DNN architecture and associated training
method. Unlike preceding approaches, our `MA-RECON' architecture encodes not
only the observed data but also the under-sampling mask within the model
structure. It implements a tailored training approach that leverages data
generated with a variety of under-sampling masks to stimulate the model's
generalization of the under-sampled MRI reconstruction problem. Therefore,
effectively represents the associated inverse problem, akin to the classical
compressed sensing approach. The benefits of our MA-RECON approach were
affirmed through rigorous testing with the widely accessible fastMRI dataset.
Compared to standard DNN methods and DNNs trained with under-sampling mask
augmentation, our approach demonstrated superior generalization capabilities.
This resulted in a considerable improvement in robustness against variations in
both the acquisition process and anatomical distribution, especially in regions
with pathology. In conclusion, our mask-aware strategy holds promise for
enhancing the generalization capacity and robustness of DNN-based methodologies
for MRI reconstruction from undersampled k-space data.


http://arxiv.org/abs/2208.14302
A Black-Box Attack on Optical Character Recognition Systems. (99%)
Samet Bayram; Kenneth Barner
  Adversarial machine learning is an emerging area showing the vulnerability of
deep learning models. Exploring attack methods to challenge state of the art
artificial intelligence (A.I.) models is an area of critical concern. The
reliability and robustness of such A.I. models are one of the major concerns
with an increasing number of effective adversarial attack methods.
Classification tasks are a major vulnerable area for adversarial attacks. The
majority of attack strategies are developed for colored or gray-scaled images.
Consequently, adversarial attacks on binary image recognition systems have not
been sufficiently studied. Binary images are simple two possible pixel-valued
signals with a single channel. The simplicity of binary images has a
significant advantage compared to colored and gray scaled images, namely
computation efficiency. Moreover, most optical character recognition systems
(O.C.R.s), such as handwritten character recognition, plate number
identification, and bank check recognition systems, use binary images or
binarization in their processing steps. In this paper, we propose a simple yet
efficient attack method, Efficient Combinatorial Black-box Adversarial Attack,
on binary image classifiers. We validate the efficiency of the attack technique
on two different data sets and three classification networks, demonstrating its
performance. Furthermore, we compare our proposed method with state-of-the-art
methods regarding advantages and disadvantages as well as applicability.


http://arxiv.org/abs/2209.02408
Robustness and invariance properties of image classifiers. (99%)
Apostolos Modas
  Deep neural networks have achieved impressive results in many image
classification tasks. However, since their performance is usually measured in
controlled settings, it is important to ensure that their decisions remain
correct when deployed in noisy environments. In fact, deep networks are not
robust to a large variety of semantic-preserving image modifications, even to
imperceptible image changes known as adversarial perturbations. The poor
robustness of image classifiers to small data distribution shifts raises
serious concerns regarding their trustworthiness. To build reliable machine
learning models, we must design principled methods to analyze and understand
the mechanisms that shape robustness and invariance. This is exactly the focus
of this thesis.
  First, we study the problem of computing sparse adversarial perturbations. We
exploit the geometry of the decision boundaries of image classifiers for
computing sparse perturbations very fast, and reveal a qualitative connection
between adversarial examples and the data features that image classifiers
learn. Then, to better understand this connection, we propose a geometric
framework that connects the distance of data samples to the decision boundary,
with the features existing in the data. We show that deep classifiers have a
strong inductive bias towards invariance to non-discriminative features, and
that adversarial training exploits this property to confer robustness. Finally,
we focus on the challenging problem of generalization to unforeseen corruptions
of the data, and we propose a novel data augmentation scheme for achieving
state-of-the-art robustness to common corruptions of the images.
  Overall, our results contribute to the understanding of the fundamental
mechanisms of deep image classifiers, and pave the way for building more
reliable machine learning systems that can be deployed in real-world
environments.


http://arxiv.org/abs/2208.14127
Solving the Capsulation Attack against Backdoor-based Deep Neural Network Watermarks by Reversing Triggers. (1%)
Fangqi Li; Shilin Wang; Yun Zhu
  Backdoor-based watermarking schemes were proposed to protect the intellectual
property of artificial intelligence models, especially deep neural networks,
under the black-box setting. Compared with ordinary backdoors, backdoor-based
watermarks need to digitally incorporate the owner's identity, which fact adds
extra requirements to the trigger generation and verification programs.
Moreover, these concerns produce additional security risks after the
watermarking scheme has been published for as a forensics tool or the owner's
evidence has been eavesdropped on. This paper proposes the capsulation attack,
an efficient method that can invalidate most established backdoor-based
watermarking schemes without sacrificing the pirated model's functionality. By
encapsulating the deep neural network with a rule-based or Bayes filter, an
adversary can block ownership probing and reject the ownership verification. We
propose a metric, CAScore, to measure a backdoor-based watermarking scheme's
security against the capsulation attack. This paper also proposes a new
backdoor-based deep neural network watermarking scheme that is secure against
the capsulation attack by reversing the encoding process and randomizing the
exposure of triggers.


http://arxiv.org/abs/2208.14488
Constraining Representations Yields Models That Know What They Don't Know. (1%)
Joao Monteiro; Pau Rodriguez; Pierre-Andre Noel; Issam Laradji; David Vazquez
  A well-known failure mode of neural networks is that they may confidently
return erroneous predictions. Such unsafe behaviour is particularly frequent
when the use case slightly differs from the training context, and/or in the
presence of an adversary. This work presents a novel direction to address these
issues in a broad, general manner: imposing class-aware constraints on a
model's internal activation patterns. Specifically, we assign to each class a
unique, fixed, randomly-generated binary vector - hereafter called class code -
and train the model so that its cross-depths activation patterns predict the
appropriate class code according to the input sample's class. The resulting
predictors are dubbed total activation classifiers (TAC), and TACs may either
be trained from scratch, or used with negligible cost as a thin add-on on top
of a frozen, pre-trained neural network. The distance between a TAC's
activation pattern and the closest valid code acts as an additional confidence
score, besides the default unTAC'ed prediction head's. In the add-on case, the
original neural network's inference head is completely unaffected (so its
accuracy remains the same) but we now have the option to use TAC's own
confidence and prediction when determining which course of action to take in an
hypothetical production workflow. In particular, we show that TAC strictly
improves the value derived from models allowed to reject/defer. We provide
further empirical evidence that TAC works well on multiple types of
architectures and data modalities and that it is at least as good as
state-of-the-art alternative confidence scores derived from existing models.


http://arxiv.org/abs/2208.13838
Towards Adversarial Purification using Denoising AutoEncoders. (99%)
Dvij Kalaria; Aritra Hazra; Partha Pratim Chakrabarti
  With the rapid advancement and increased use of deep learning models in image
identification, security becomes a major concern to their deployment in
safety-critical systems. Since the accuracy and robustness of deep learning
models are primarily attributed from the purity of the training samples,
therefore the deep learning architectures are often susceptible to adversarial
attacks. Adversarial attacks are often obtained by making subtle perturbations
to normal images, which are mostly imperceptible to humans, but can seriously
confuse the state-of-the-art machine learning models. We propose a framework,
named APuDAE, leveraging Denoising AutoEncoders (DAEs) to purify these samples
by using them in an adaptive way and thus improve the classification accuracy
of the target classifier networks that have been attacked. We also show how
using DAEs adaptively instead of using them directly, improves classification
accuracy further and is more robust to the possibility of designing adaptive
attacks to fool them. We demonstrate our results over MNIST, CIFAR-10, ImageNet
dataset and show how our framework (APuDAE) provides comparable and in most
cases better performance to the baseline methods in purifying adversaries. We
also design adaptive attack specifically designed to attack our purifying model
and demonstrate how our defense is robust to that.


http://arxiv.org/abs/2208.13904
Reducing Certified Regression to Certified Classification for General Poisoning Attacks. (54%)
Zayd Hammoudeh; Daniel Lowd
  Adversarial training instances can severely distort a model's behavior. This
work investigates certified regression defenses, which provide guaranteed
limits on how much a regressor's prediction may change under a poisoning
attack. Our key insight is that certified regression reduces to voting-based
certified classification when using median as a model's primary decision
function. Coupling our reduction with existing certified classifiers, we
propose six new regressors provably-robust to poisoning attacks. To the extent
of our knowledge, this is the first work that certifies the robustness of
individual regression predictions without any assumptions about the data
distribution and model architecture. We also show that the assumptions made by
existing state-of-the-art certified classifiers are often overly pessimistic.
We introduce a tighter analysis of model robustness, which in many cases
results in significantly improved certified guarantees. Lastly, we empirically
demonstrate our approaches' effectiveness on both regression and classification
data, where the accuracy of up to 50% of test predictions can be guaranteed
under 1% training set corruption and up to 30% of predictions under 4%
corruption. Our source code is available at
https://github.com/ZaydH/certified-regression.


http://arxiv.org/abs/2208.13405
Interpreting Black-box Machine Learning Models for High Dimensional Datasets. (1%)
Md. Rezaul Karim; Md. Shajalal; Alex Graß; Till Döhmen; Sisay Adugna Chala; Christian Beecks; Stefan Decker
  Deep neural networks (DNNs) have been shown to outperform traditional machine
learning algorithms in a broad variety of application domains due to their
effectiveness in modeling complex problems and handling high-dimensional
datasets. Many real-life datasets, however, are of increasingly high
dimensionality, where a large number of features may be irrelevant for both
supervised and unsupervised learning tasks. The inclusion of such features
would not only introduce unwanted noise but also increase computational
complexity. Furthermore, due to high non-linearity and dependency among a large
number of features, DNN models tend to be unavoidably opaque and perceived as
black-box methods because of their not well-understood internal functioning.
Their algorithmic complexity is often simply beyond the capacities of humans to
understand the interplay among myriads of hyperparameters. A well-interpretable
model can identify statistically significant features and explain the way they
affect the model's outcome. In this paper, we propose an efficient method to
improve the interpretability of black-box models for classification tasks in
the case of high-dimensional datasets. First, we train a black-box model on a
high-dimensional dataset to learn the embeddings on which the classification is
performed. To decompose the inner working principles of the black-box model and
to identify top-k important features, we employ different probing and
perturbing techniques. We then approximate the behavior of the black-box model
by means of an interpretable surrogate model on the top-k feature space.
Finally, we derive decision rules and local explanations from the surrogate
model to explain individual decisions. Our approach outperforms
state-of-the-art methods like TabNet and XGboost when tested on different
datasets with varying dimensionality between 50 and 20,000 w.r.t metrics and
explainability.


http://arxiv.org/abs/2208.13182
Cross-domain Cross-architecture Black-box Attacks on Fine-tuned Models with Transferred Evolutionary Strategies. (99%)
Yinghua Zhang; Yangqiu Song; Kun Bai; Qiang Yang
  Fine-tuning can be vulnerable to adversarial attacks. Existing works about
black-box attacks on fine-tuned models (BAFT) are limited by strong
assumptions. To fill the gap, we propose two novel BAFT settings, cross-domain
and cross-domain cross-architecture BAFT, which only assume that (1) the target
model for attacking is a fine-tuned model, and (2) the source domain data is
known and accessible. To successfully attack fine-tuned models under both
settings, we propose to first train an adversarial generator against the source
model, which adopts an encoder-decoder architecture and maps a clean input to
an adversarial example. Then we search in the low-dimensional latent space
produced by the encoder of the adversarial generator. The search is conducted
under the guidance of the surrogate gradient obtained from the source model.
Experimental results on different domains and different network architectures
demonstrate that the proposed attack method can effectively and efficiently
attack the fine-tuned models.


http://arxiv.org/abs/2208.13058
Adversarial Robustness for Tabular Data through Cost and Utility Awareness. (99%)
Klim Kireev; Bogdan Kulynych; Carmela Troncoso
  Many machine learning problems use data in the tabular domains. Adversarial
examples can be especially damaging for these applications. Yet, existing works
on adversarial robustness mainly focus on machine-learning models in the image
and text domains. We argue that due to the differences between tabular data and
images or text, existing threat models are inappropriate for tabular domains.
These models do not capture that cost can be more important than
imperceptibility, nor that the adversary could ascribe different value to the
utility obtained from deploying different adversarial examples. We show that
due to these differences the attack and defence methods used for images and
text cannot be directly applied to the tabular setup. We address these issues
by proposing new cost and utility-aware threat models tailored to the
adversarial capabilities and constraints of attackers targeting tabular
domains. We introduce a framework that enables us to design attack and defence
mechanisms which result in models protected against cost or utility-aware
adversaries, e.g., adversaries constrained by a certain dollar budget. We show
that our approach is effective on three tabular datasets corresponding to
applications for which adversarial examples can have economic and social
implications.


http://arxiv.org/abs/2208.13066
SA: Sliding attack for synthetic speech detection with resistance to clipping and self-splicing. (99%)
Deng JiaCheng; Dong Li; Yan Diqun; Wang Rangding; Zeng Jiaming
  Deep neural networks are vulnerable to adversarial examples that mislead
models with imperceptible perturbations. In audio, although adversarial
examples have achieved incredible attack success rates on white-box settings
and black-box settings, most existing adversarial attacks are constrained by
the input length. A More practical scenario is that the adversarial examples
must be clipped or self-spliced and input into the black-box model. Therefore,
it is necessary to explore how to improve transferability in different input
length settings. In this paper, we take the synthetic speech detection task as
an example and consider two representative SOTA models. We observe that the
gradients of fragments with the same sample value are similar in different
models via analyzing the gradients obtained by feeding samples into the model
after cropping or self-splicing. Inspired by the above observation, we propose
a new adversarial attack method termed sliding attack. Specifically, we make
each sampling point aware of gradients at different locations, which can
simulate the situation where adversarial examples are input to black-box models
with varying input lengths. Therefore, instead of using the current gradient
directly in each iteration of the gradient calculation, we go through the
following three steps. First, we extract subsegments of different lengths using
sliding windows. We then augment the subsegments with data from the adjacent
domains. Finally, we feed the sub-segments into different models to obtain
aggregate gradients to update adversarial examples. Empirical results
demonstrate that our method could significantly improve the transferability of
adversarial examples after clipping or self-splicing. Besides, our method could
also enhance the transferability between models based on different features.


http://arxiv.org/abs/2208.13049
TrojViT: Trojan Insertion in Vision Transformers. (15%)
Mengxin Zheng; Qian Lou; Lei Jiang
  Vision Transformers (ViTs) have demonstrated the state-of-the-art performance
in various vision-related tasks. The success of ViTs motivates adversaries to
perform backdoor attacks on ViTs. Although the vulnerability of traditional
CNNs to backdoor attacks is well-known, backdoor attacks on ViTs are
seldom-studied. Compared to CNNs capturing pixel-wise local features by
convolutions, ViTs extract global context information through patches and
attentions. Na\"ively transplanting CNN-specific backdoor attacks to ViTs
yields only a low clean data accuracy and a low attack success rate. In this
paper, we propose a stealth and practical ViT-specific backdoor attack
$TrojViT$. Rather than an area-wise trigger used by CNN-specific backdoor
attacks, TrojViT generates a patch-wise trigger designed to build a Trojan
composed of some vulnerable bits on the parameters of a ViT stored in DRAM
memory through patch salience ranking and attention-target loss. TrojViT
further uses minimum-tuned parameter update to reduce the bit number of the
Trojan. Once the attacker inserts the Trojan into the ViT model by flipping the
vulnerable bits, the ViT model still produces normal inference accuracy with
benign inputs. But when the attacker embeds a trigger into an input, the ViT
model is forced to classify the input to a predefined target class. We show
that flipping only few vulnerable bits identified by TrojViT on a ViT model
using the well-known RowHammer can transform the model into a backdoored one.
We perform extensive experiments of multiple datasets on various ViT models.
TrojViT can classify $99.64\%$ of test images to a target class by flipping
$345$ bits on a ViT for ImageNet.


http://arxiv.org/abs/2208.12926
Overparameterized (robust) models from computational constraints. (13%)
Sanjam Garg; Somesh Jha; Saeed Mahloujifar; Mohammad Mahmoody; Mingyuan Wang
  Overparameterized models with millions of parameters have been hugely
successful. In this work, we ask: can the need for large models be, at least in
part, due to the \emph{computational} limitations of the learner? Additionally,
we ask, is this situation exacerbated for \emph{robust} learning? We show that
this indeed could be the case. We show learning tasks for which computationally
bounded learners need \emph{significantly more} model parameters than what
information-theoretic learners need. Furthermore, we show that even more model
parameters could be necessary for robust learning. In particular, for
computationally bounded learners, we extend the recent result of Bubeck and
Sellke [NeurIPS'2021] which shows that robust models might need more
parameters, to the computational regime and show that bounded learners could
provably need an even larger number of parameters. Then, we address the
following related question: can we hope to remedy the situation for robust
computationally bounded learning by restricting \emph{adversaries} to also be
computationally bounded for sake of obtaining models with fewer parameters?
Here again, we show that this could be possible. Specifically, building on the
work of Garg, Jha, Mahloujifar, and Mahmoody [ALT'2020], we demonstrate a
learning task that can be learned efficiently and robustly against a
computationally bounded attacker, while to be robust against an
information-theoretic attacker requires the learner to utilize significantly
more parameters.


http://arxiv.org/abs/2208.13032
RL-DistPrivacy: Privacy-Aware Distributed Deep Inference for low latency IoT systems. (1%)
Emna Baccour; Aiman Erbad; Amr Mohamed; Mounir Hamdi; Mohsen Guizani
  Although Deep Neural Networks (DNN) have become the backbone technology of
several ubiquitous applications, their deployment in resource-constrained
machines, e.g., Internet of Things (IoT) devices, is still challenging. To
satisfy the resource requirements of such a paradigm, collaborative deep
inference with IoT synergy was introduced. However, the distribution of DNN
networks suffers from severe data leakage. Various threats have been presented,
including black-box attacks, where malicious participants can recover arbitrary
inputs fed into their devices. Although many countermeasures were designed to
achieve privacy-preserving DNN, most of them result in additional computation
and lower accuracy. In this paper, we present an approach that targets the
security of collaborative deep inference via re-thinking the distribution
strategy, without sacrificing the model performance. Particularly, we examine
different DNN partitions that make the model susceptible to black-box threats
and we derive the amount of data that should be allocated per device to hide
proprieties of the original input. We formulate this methodology, as an
optimization, where we establish a trade-off between the latency of
co-inference and the privacy-level of data. Next, to relax the optimal
solution, we shape our approach as a Reinforcement Learning (RL) design that
supports heterogeneous devices as well as multiple DNNs/datasets.


http://arxiv.org/abs/2208.12815
What Does the Gradient Tell When Attacking the Graph Structure. (69%)
Zihan Liu; Ge Wang; Yun Luo; Stan Z. Li
  Recent research has revealed that Graph Neural Networks (GNNs) are
susceptible to adversarial attacks targeting the graph structure. A malicious
attacker can manipulate a limited number of edges, given the training labels,
to impair the victim model's performance. Previous empirical studies indicate
that gradient-based attackers tend to add edges rather than remove them. In
this paper, we present a theoretical demonstration revealing that attackers
tend to increase inter-class edges due to the message passing mechanism of
GNNs, which explains some previous empirical observations. By connecting
dissimilar nodes, attackers can more effectively corrupt node features, making
such attacks more advantageous. However, we demonstrate that the inherent
smoothness of GNN's message passing tends to blur node dissimilarity in the
feature space, leading to the loss of crucial information during the forward
process. To address this issue, we propose a novel surrogate model with
multi-level propagation that preserves the node dissimilarity information. This
model parallelizes the propagation of unaggregated raw features and multi-hop
aggregated features, while introducing batch normalization to enhance the
dissimilarity in node representations and counteract the smoothness resulting
from topological aggregation. Our experiments show significant improvement with
our approach.Furthermore, both theoretical and experimental evidence suggest
that adding inter-class edges constitutes an easily observable attack pattern.
We propose an innovative attack loss that balances attack effectiveness and
imperceptibility, sacrificing some attack effectiveness to attain greater
imperceptibility. We also provide experiments to validate the compromise
performance achieved through this attack loss.


http://arxiv.org/abs/2208.12911
Network-Level Adversaries in Federated Learning. (54%)
Giorgio Severi; Matthew Jagielski; Gökberk Yar; Yuxuan Wang; Alina Oprea; Cristina Nita-Rotaru
  Federated learning is a popular strategy for training models on distributed,
sensitive data, while preserving data privacy. Prior work identified a range of
security threats on federated learning protocols that poison the data or the
model. However, federated learning is a networked system where the
communication between clients and server plays a critical role for the learning
task performance. We highlight how communication introduces another
vulnerability surface in federated learning and study the impact of
network-level adversaries on training federated learning models. We show that
attackers dropping the network traffic from carefully selected clients can
significantly decrease model accuracy on a target population. Moreover, we show
that a coordinated poisoning campaign from a few clients can amplify the
dropping attacks. Finally, we develop a server-side defense which mitigates the
impact of our attacks by identifying and up-sampling clients likely to
positively contribute towards target accuracy. We comprehensively evaluate our
attacks and defenses on three datasets, assuming encrypted communication
channels and attackers with partial visibility of the network.


http://arxiv.org/abs/2208.12897
ATTRITION: Attacking Static Hardware Trojan Detection Techniques Using Reinforcement Learning. (45%)
Vasudev JV Gohil; Hao JV Guo; Satwik JV Patnaik; JV Jeyavijayan; Rajendran
  Stealthy hardware Trojans (HTs) inserted during the fabrication of integrated
circuits can bypass the security of critical infrastructures. Although
researchers have proposed many techniques to detect HTs, several limitations
exist, including: (i) a low success rate, (ii) high algorithmic complexity, and
(iii) a large number of test patterns. Furthermore, the most pertinent drawback
of prior detection techniques stems from an incorrect evaluation methodology,
i.e., they assume that an adversary inserts HTs randomly. Such inappropriate
adversarial assumptions enable detection techniques to claim high HT detection
accuracy, leading to a "false sense of security." Unfortunately, to the best of
our knowledge, despite more than a decade of research on detecting HTs inserted
during fabrication, there have been no concerted efforts to perform a
systematic evaluation of HT detection techniques.
  In this paper, we play the role of a realistic adversary and question the
efficacy of HT detection techniques by developing an automated, scalable, and
practical attack framework, ATTRITION, using reinforcement learning (RL).
ATTRITION evades eight detection techniques across two HT detection categories,
showcasing its agnostic behavior. ATTRITION achieves average attack success
rates of $47\times$ and $211\times$ compared to randomly inserted HTs against
state-of-the-art HT detection techniques. We demonstrate ATTRITION's ability to
evade detection techniques by evaluating designs ranging from the widely-used
academic suites to larger designs such as the open-source MIPS and mor1kx
processors to AES and a GPS module. Additionally, we showcase the impact of
ATTRITION-generated HTs through two case studies (privilege escalation and kill
switch) on the mor1kx processor. We envision that our work, along with our
released HT benchmarks and models, fosters the development of better HT
detection techniques.


http://arxiv.org/abs/2208.12511
Lower Difficulty and Better Robustness: A Bregman Divergence Perspective for Adversarial Training. (4%)
Zihui Wu; Haichang Gao; Bingqian Zhou; Xiaoyan Guo; Shudong Zhang
  In this paper, we investigate on improving the adversarial robustness
obtained in adversarial training (AT) via reducing the difficulty of
optimization. To better study this problem, we build a novel Bregman divergence
perspective for AT, in which AT can be viewed as the sliding process of the
training data points on the negative entropy curve. Based on this perspective,
we analyze the learning objectives of two typical AT methods, i.e., PGD-AT and
TRADES, and we find that the optimization process of TRADES is easier than
PGD-AT for that TRADES separates PGD-AT. In addition, we discuss the function
of entropy in TRADES, and we find that models with high entropy can be better
robustness learners. Inspired by the above findings, we propose two methods,
i.e., FAIT and MER, which can both not only reduce the difficulty of
optimization under the 10-step PGD adversaries, but also provide better
robustness. Our work suggests that reducing the difficulty of optimization
under the 10-step PGD adversaries is a promising approach for enhancing the
adversarial robustness in AT.


http://arxiv.org/abs/2208.12230
Semantic Preserving Adversarial Attack Generation with Autoencoder and Genetic Algorithm. (99%)
Xinyi Wang; Simon Yusuf Enoch; Dong Seong Kim
  Widely used deep learning models are found to have poor robustness. Little
noises can fool state-of-the-art models into making incorrect predictions.
While there is a great deal of high-performance attack generation methods, most
of them directly add perturbations to original data and measure them using L_p
norms; this can break the major structure of data, thus, creating invalid
attacks. In this paper, we propose a black-box attack, which, instead of
modifying original data, modifies latent features of data extracted by an
autoencoder; then, we measure noises in semantic space to protect the semantics
of data. We trained autoencoders on MNIST and CIFAR-10 datasets and found
optimal adversarial perturbations using a genetic algorithm. Our approach
achieved a 100% attack success rate on the first 100 data of MNIST and CIFAR-10
datasets with less perturbation than FGSM.


http://arxiv.org/abs/2208.12348
SNAP: Efficient Extraction of Private Properties with Poisoning. (89%)
Harsh Chaudhari; John Abascal; Alina Oprea; Matthew Jagielski; Florian Tramèr; Jonathan Ullman
  Property inference attacks allow an adversary to extract global properties of
the training dataset from a machine learning model. Such attacks have privacy
implications for data owners who share their datasets to train machine learning
models. Several existing approaches for property inference attacks against deep
neural networks have been proposed, but they all rely on the attacker training
a large number of shadow models, which induces large computational overhead.
  In this paper, we consider the setting of property inference attacks in which
the attacker can poison a subset of the training dataset and query the trained
target model. Motivated by our theoretical analysis of model confidences under
poisoning, we design an efficient property inference attack, SNAP, which
obtains higher attack success and requires lower amounts of poisoning than the
state-of-the-art poisoning-based property inference attack by Mahloujifar et
al. For example, on the Census dataset, SNAP achieves 34% higher success rate
than Mahloujifar et al. while being 56.5x faster. We also extend our attack to
determine if a certain property is present at all in training, and estimate the
exact proportion of a property of interest efficiently. We evaluate our attack
on several properties of varying proportions from four datasets, and
demonstrate SNAP's generality and effectiveness.


http://arxiv.org/abs/2208.14191
FuncFooler: A Practical Black-box Attack Against Learning-based Binary Code Similarity Detection Methods. (78%)
Lichen Jia; Bowen Tang; Chenggang Wu; Zhe Wang; Zihan Jiang; Yuanming Lai; Yan Kang; Ning Liu; Jingfeng Zhang
  The binary code similarity detection (BCSD) method measures the similarity of
two binary executable codes. Recently, the learning-based BCSD methods have
achieved great success, outperforming traditional BCSD in detection accuracy
and efficiency. However, the existing studies are rather sparse on the
adversarial vulnerability of the learning-based BCSD methods, which cause
hazards in security-related applications. To evaluate the adversarial
robustness, this paper designs an efficient and black-box adversarial code
generation algorithm, namely, FuncFooler. FuncFooler constrains the adversarial
codes 1) to keep unchanged the program's control flow graph (CFG), and 2) to
preserve the same semantic meaning. Specifically, FuncFooler consecutively 1)
determines vulnerable candidates in the malicious code, 2) chooses and inserts
the adversarial instructions from the benign code, and 3) corrects the semantic
side effect of the adversarial code to meet the constraints. Empirically, our
FuncFooler can successfully attack the three learning-based BCSD models,
including SAFE, Asm2Vec, and jTrans, which calls into question whether the
learning-based BCSD is desirable.


http://arxiv.org/abs/2208.12428
Robust Prototypical Few-Shot Organ Segmentation with Regularized Neural-ODEs. (31%)
Prashant Pandey; Mustafa Chasmai; Tanuj Sur; Brejesh Lall
  Despite the tremendous progress made by deep learning models in image
semantic segmentation, they typically require large annotated examples, and
increasing attention is being diverted to problem settings like Few-Shot
Learning (FSL) where only a small amount of annotation is needed for
generalisation to novel classes. This is especially seen in medical domains
where dense pixel-level annotations are expensive to obtain. In this paper, we
propose Regularized Prototypical Neural Ordinary Differential Equation
(R-PNODE), a method that leverages intrinsic properties of Neural-ODEs,
assisted and enhanced by additional cluster and consistency losses to perform
Few-Shot Segmentation (FSS) of organs. R-PNODE constrains support and query
features from the same classes to lie closer in the representation space
thereby improving the performance over the existing Convolutional Neural
Network (CNN) based FSS methods. We further demonstrate that while many
existing Deep CNN based methods tend to be extremely vulnerable to adversarial
attacks, R-PNODE exhibits increased adversarial robustness for a wide array of
these attacks. We experiment with three publicly available multi-organ
segmentation datasets in both in-domain and cross-domain FSS settings to
demonstrate the efficacy of our method. In addition, we perform experiments
with seven commonly used adversarial attacks in various settings to demonstrate
R-PNODE's robustness. R-PNODE outperforms the baselines for FSS by significant
margins and also shows superior performance for a wide array of attacks varying
in intensity and design.


http://arxiv.org/abs/2208.12084
Calibrated Selective Classification. (15%)
Adam Fisch; Tommi Jaakkola; Regina Barzilay
  Selective classification allows models to abstain from making predictions
(e.g., say "I don't know") when in doubt in order to obtain better effective
accuracy. While typical selective models can be effective at producing more
accurate predictions on average, they may still allow for wrong predictions
that have high confidence, or skip correct predictions that have low
confidence. Providing calibrated uncertainty estimates alongside predictions --
probabilities that correspond to true frequencies -- can be as important as
having predictions that are simply accurate on average. However, uncertainty
estimates can be unreliable for certain inputs. In this paper, we develop a new
approach to selective classification in which we propose a method for rejecting
examples with "uncertain" uncertainties. By doing so, we aim to make
predictions with {well-calibrated} uncertainty estimates over the distribution
of accepted examples, a property we call selective calibration. We present a
framework for learning selectively calibrated models, where a separate selector
network is trained to improve the selective calibration error of a given base
model. In particular, our work focuses on achieving robust calibration, where
the model is intentionally designed to be tested on out-of-domain data. We
achieve this through a training strategy inspired by distributionally robust
optimization, in which we apply simulated input perturbations to the known,
in-domain training data. We demonstrate the empirical effectiveness of our
approach on multiple image classification and lung cancer risk assessment
tasks.


http://arxiv.org/abs/2208.12003
XDRI Attacks - and - How to Enhance Resilience of Residential Routers. (4%)
Philipp Jeitner; Haya Shulman; Lucas Teichmann; Michael Waidner
  We explore the security of residential routers and find a range of critical
vulnerabilities. Our evaluations show that 10 out of 36 popular routers are
vulnerable to injections of fake records via misinterpretation of special
characters. We also find that in 15 of the 36 routers the mechanisms, that are
meant to prevent cache poisoning attacks, can be circumvented. In our
Internet-wide study with an advertisement network, we identified and analyzed
976 residential routers used by web clients, out of which more than 95% were
found vulnerable to our attacks. Overall, vulnerable routers are prevalent and
are distributed among 177 countries and 4830 networks. To understand the core
factors causing the vulnerabilities we perform black- and white-box analyses of
the routers. We find that many problems can be attributed to incorrect
assumptions on the protocols' behaviour and the Internet, misunderstanding of
the standard recommendations, bugs, and simplified DNS software
implementations. We provide recommendations to mitigate our attacks. We also
set up a tool to enable everyone to evaluate the security of their routers at
https://xdi-attack.net/.


http://arxiv.org/abs/2208.12268
FedPrompt: Communication-Efficient and Privacy Preserving Prompt Tuning in Federated Learning. (1%)
Haodong Zhao; Wei Du; Fangqi Li; Peixuan Li; Gongshen Liu
  Federated learning (FL) has enabled global model training on decentralized
data in a privacy-preserving way by aggregating model updates. However, for
many natural language processing (NLP) tasks that utilize pre-trained language
models (PLMs) with large numbers of parameters, there are considerable
communication costs associated with FL. Recently, prompt tuning, which tunes
some soft prompts without modifying PLMs, has achieved excellent performance as
a new learning paradigm. Therefore we want to combine the two methods and
explore the effect of prompt tuning under FL. In this paper, we propose
"FedPrompt" to study prompt tuning in a model split aggregation way using FL,
and prove that split aggregation greatly reduces the communication cost, only
0.01% of the PLMs' parameters, with little decrease on accuracy both on IID and
Non-IID data distribution. This improves the efficiency of FL method while also
protecting the data privacy in prompt tuning. In addition, like PLMs, prompts
are uploaded and downloaded between public platforms and personal users, so we
try to figure out whether there is still a backdoor threat using only soft
prompts in FL scenarios. We further conduct backdoor attacks by data poisoning
on FedPrompt. Our experiments show that normal backdoor attack can not achieve
a high attack success rate, proving the robustness of FedPrompt. We hope this
work can promote the application of prompt in FL and raise the awareness of the
possible security threats.


http://arxiv.org/abs/2208.11667
Attacking Neural Binary Function Detection. (99%)
Joshua Bundt; Michael Davinroy; Ioannis Agadakos; Alina Oprea; William Robertson
  Binary analyses based on deep neural networks (DNNs), or neural binary
analyses (NBAs), have become a hotly researched topic in recent years. DNNs
have been wildly successful at pushing the performance and accuracy envelopes
in the natural language and image processing domains. Thus, DNNs are highly
promising for solving binary analysis problems that are typically hard due to a
lack of complete information resulting from the lossy compilation process.
Despite this promise, it is unclear that the prevailing strategy of repurposing
embeddings and model architectures originally developed for other problem
domains is sound given the adversarial contexts under which binary analysis
often operates.
  In this paper, we empirically demonstrate that the current state of the art
in neural function boundary detection is vulnerable to both inadvertent and
deliberate adversarial attacks. We proceed from the insight that current
generation NBAs are built upon embeddings and model architectures intended to
solve syntactic problems. We devise a simple, reproducible, and scalable
black-box methodology for exploring the space of inadvertent attacks -
instruction sequences that could be emitted by common compiler toolchains and
configurations - that exploits this syntactic design focus. We then show that
these inadvertent misclassifications can be exploited by an attacker, serving
as the basis for a highly effective black-box adversarial example generation
process. We evaluate this methodology against two state-of-the-art neural
function boundary detectors: XDA and DeepDi. We conclude with an analysis of
the evaluation data and recommendations for how future research might avoid
succumbing to similar attacks.


http://arxiv.org/abs/2208.11613
Unrestricted Black-box Adversarial Attack Using GAN with Limited Queries. (99%)
Dongbin Na; Sangwoo Ji; Jong Kim
  Adversarial examples are inputs intentionally generated for fooling a deep
neural network. Recent studies have proposed unrestricted adversarial attacks
that are not norm-constrained. However, the previous unrestricted attack
methods still have limitations to fool real-world applications in a black-box
setting. In this paper, we present a novel method for generating unrestricted
adversarial examples using GAN where an attacker can only access the top-1
final decision of a classification model. Our method, Latent-HSJA, efficiently
leverages the advantages of a decision-based attack in the latent space and
successfully manipulates the latent vectors for fooling the classification
model.
  With extensive experiments, we demonstrate that our proposed method is
efficient in evaluating the robustness of classification models with limited
queries in a black-box setting. First, we demonstrate that our targeted attack
method is query-efficient to produce unrestricted adversarial examples for a
facial identity recognition model that contains 307 identities. Then, we
demonstrate that the proposed method can also successfully attack a real-world
celebrity recognition service.


http://arxiv.org/abs/2208.11436
Trace and Detect Adversarial Attacks on CNNs using Feature Response Maps. (98%)
Mohammadreza Amirian; Friedhelm Schwenker; Thilo Stadelmann
  The existence of adversarial attacks on convolutional neural networks (CNN)
questions the fitness of such models for serious applications. The attacks
manipulate an input image such that misclassification is evoked while still
looking normal to a human observer -- they are thus not easily detectable. In a
different context, backpropagated activations of CNN hidden layers -- "feature
responses" to a given input -- have been helpful to visualize for a human
"debugger" what the CNN "looks at" while computing its output. In this work, we
propose a novel detection method for adversarial examples to prevent attacks.
We do so by tracking adversarial perturbations in feature responses, allowing
for automatic detection using average local spatial entropy. The method does
not alter the original network architecture and is fully human-interpretable.
Experiments confirm the validity of our approach for state-of-the-art attacks
on large-scale models trained on ImageNet.


http://arxiv.org/abs/2208.11839
A Perturbation Resistant Transformation and Classification System for Deep Neural Networks. (98%)
Nathaniel Dean; Dilip Sarkar
  Deep convolutional neural networks accurately classify a diverse range of
natural images, but may be easily deceived when designed, imperceptible
perturbations are embedded in the images. In this paper, we design a
multi-pronged training, input transformation, and image ensemble system that is
attack agnostic and not easily estimated. Our system incorporates two novel
features. The first is a transformation layer that computes feature level
polynomial kernels from class-level training data samples and iteratively
updates input image copies at inference time based on their feature kernel
differences to create an ensemble of transformed inputs. The second is a
classification system that incorporates the prediction of the undefended
network with a hard vote on the ensemble of filtered images. Our evaluations on
the CIFAR10 dataset show our system improves the robustness of an undefended
network against a variety of bounded and unbounded white-box attacks under
different distance metrics, while sacrificing little accuracy on clean images.
Against adaptive full-knowledge attackers creating end-to-end attacks, our
system successfully augments the existing robustness of adversarially trained
networks, for which our methods are most effectively applied.


http://arxiv.org/abs/2208.11739
Rethinking Cost-sensitive Classification in Deep Learning via Adversarial Data Augmentation. (92%)
Qiyuan Chen; Raed Al Kontar; Maher Nouiehed; Jessie Yang; Corey Lester
  Cost-sensitive classification is critical in applications where
misclassification errors widely vary in cost. However, over-parameterization
poses fundamental challenges to the cost-sensitive modeling of deep neural
networks (DNNs). The ability of a DNN to fully interpolate a training dataset
can render a DNN, evaluated purely on the training set, ineffective in
distinguishing a cost-sensitive solution from its overall accuracy maximization
counterpart. This necessitates rethinking cost-sensitive classification in
DNNs. To address this challenge, this paper proposes a cost-sensitive
adversarial data augmentation (CSADA) framework to make over-parameterized
models cost-sensitive. The overarching idea is to generate targeted adversarial
examples that push the decision boundary in cost-aware directions. These
targeted adversarial samples are generated by maximizing the probability of
critical misclassifications and used to train a model with more conservative
decisions on costly pairs. Experiments on well-known datasets and a pharmacy
medication image (PMI) dataset made publicly available show that our method can
effectively minimize the overall cost and reduce critical errors, while
achieving comparable performance in terms of overall accuracy.


http://arxiv.org/abs/2208.11435
Bidirectional Contrastive Split Learning for Visual Question Answering. (38%)
Yuwei Sun; Hideya Ochiai
  Visual Question Answering (VQA) based on multi-modal data facilitates
real-life applications such as home robots and medical diagnoses. One
significant challenge is to devise a robust decentralized learning framework
for various client models where centralized data collection is refrained due to
confidentiality concerns. This work aims to tackle privacy-preserving VQA by
decoupling a multi-modal model into representation modules and a contrastive
module and leveraging inter-module gradients sharing and inter-client weight
sharing. To this end, we propose Bidirectional Contrastive Split Learning
(BiCSL) to train a global multi-modal model on the entire data distribution of
decentralized clients. We employ the contrastive loss that enables a more
efficient self-supervised learning of decentralized modules. Comprehensive
experiments are conducted on the VQA-v2 dataset based on five SOTA VQA models,
demonstrating the effectiveness of the proposed method. Furthermore, we inspect
BiCSL's robustness against a dual-key backdoor attack on VQA. Consequently,
BiCSL shows much better robustness to the multi-modal adversarial attack
compared to the centralized learning method, which provides a promising
approach to decentralized multi-modal learning.


http://arxiv.org/abs/2208.11264
Towards an Awareness of Time Series Anomaly Detection Models' Adversarial Vulnerability. (99%)
Shahroz Tariq; Binh M. Le; Simon S. Woo
  Time series anomaly detection is extensively studied in statistics,
economics, and computer science. Over the years, numerous methods have been
proposed for time series anomaly detection using deep learning-based methods.
Many of these methods demonstrate state-of-the-art performance on benchmark
datasets, giving the false impression that these systems are robust and
deployable in many practical and industrial real-world scenarios. In this
paper, we demonstrate that the performance of state-of-the-art anomaly
detection methods is degraded substantially by adding only small adversarial
perturbations to the sensor data. We use different scoring metrics such as
prediction errors, anomaly, and classification scores over several public and
private datasets ranging from aerospace applications, server machines, to
cyber-physical systems in power plants. Under well-known adversarial attacks
from Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD)
methods, we demonstrate that state-of-the-art deep neural networks (DNNs) and
graph neural networks (GNNs) methods, which claim to be robust against
anomalies and have been possibly integrated in real-life systems, have their
performance drop to as low as 0%. To the best of our understanding, we
demonstrate, for the first time, the vulnerabilities of anomaly detection
systems against adversarial attacks. The overarching goal of this research is
to raise awareness towards the adversarial vulnerabilities of time series
anomaly detectors.


http://arxiv.org/abs/2208.10773
Adversarial Vulnerability of Temporal Feature Networks for Object Detection. (99%)
Svetlana Pavlitskaya; Nikolai Polley; Michael Weber; J. Marius Zöllner
  Taking into account information across the temporal domain helps to improve
environment perception in autonomous driving. However, it has not been studied
so far whether temporally fused neural networks are vulnerable to deliberately
generated perturbations, i.e. adversarial attacks, or whether temporal history
is an inherent defense against them. In this work, we study whether temporal
feature networks for object detection are vulnerable to universal adversarial
attacks. We evaluate attacks of two types: imperceptible noise for the whole
image and locally-bound adversarial patch. In both cases, perturbations are
generated in a white-box manner using PGD. Our experiments confirm, that
attacking even a portion of a temporal input suffices to fool the network. We
visually assess generated perturbations to gain insights into the functioning
of attacks. To enhance the robustness, we apply adversarial training using
5-PGD. Our experiments on KITTI and nuScenes datasets demonstrate, that a model
robustified via K-PGD is able to withstand the studied attacks while keeping
the mAP-based performance comparable to that of an unattacked model.


http://arxiv.org/abs/2208.10878
Transferability Ranking of Adversarial Examples. (99%)
Mosh Levy; Yuval Elovici; Yisroel Mirsky
  Adversarial examples can be used to maliciously and covertly change a model's
prediction. It is known that an adversarial example designed for one model can
transfer to other models as well. This poses a major threat because it means
that attackers can target systems in a blackbox manner.
  In the domain of transferability, researchers have proposed ways to make
attacks more transferable and to make models more robust to transferred
examples. However, to the best of our knowledge, there are no works which
propose a means for ranking the transferability of an adversarial example in
the perspective of a blackbox attacker. This is an important task because an
attacker is likely to use only a select set of examples, and therefore will
want to select the samples which are most likely to transfer.
  In this paper we suggest a method for ranking the transferability of
adversarial examples without access to the victim's model. To accomplish this,
we define and estimate the expected transferability of a sample given limited
information about the victim. We also explore practical scenarios: where the
adversary can select the best sample to attack and where the adversary must use
a specific sample but can choose different perturbations. Through our
experiments, we found that our ranking method can increase an attacker's
success rate by up to 80% compared to the baseline (random selection without
ranking).


http://arxiv.org/abs/2208.11180
Auditing Membership Leakages of Multi-Exit Networks. (76%)
Zheng Li; Yiyong Liu; Xinlei He; Ning Yu; Michael Backes; Yang Zhang
  Relying on the fact that not all inputs require the same amount of
computation to yield a confident prediction, multi-exit networks are gaining
attention as a prominent approach for pushing the limits of efficient
deployment. Multi-exit networks endow a backbone model with early exits,
allowing to obtain predictions at intermediate layers of the model and thus
save computation time and/or energy. However, current various designs of
multi-exit networks are only considered to achieve the best trade-off between
resource usage efficiency and prediction accuracy, the privacy risks stemming
from them have never been explored. This prompts the need for a comprehensive
investigation of privacy risks in multi-exit networks.
  In this paper, we perform the first privacy analysis of multi-exit networks
through the lens of membership leakages. In particular, we first leverage the
existing attack methodologies to quantify the multi-exit networks'
vulnerability to membership leakages. Our experimental results show that
multi-exit networks are less vulnerable to membership leakages and the exit
(number and depth) attached to the backbone model is highly correlated with the
attack performance. Furthermore, we propose a hybrid attack that exploits the
exit information to improve the performance of existing attacks. We evaluate
membership leakage threat caused by our hybrid attack under three different
adversarial setups, ultimately arriving at a model-free and data-free
adversary. These results clearly demonstrate that our hybrid attacks are very
broadly applicable, thereby the corresponding risks are much more severe than
shown by existing membership inference attacks. We further present a defense
mechanism called TimeGuard specifically for multi-exit networks and show that
TimeGuard mitigates the newly proposed attacks perfectly.


http://arxiv.org/abs/2208.10895
A Comprehensive Study of Real-Time Object Detection Networks Across Multiple Domains: A Survey. (13%)
Elahe Arani; Shruthi Gowda; Ratnajit Mukherjee; Omar Magdy; Senthilkumar Kathiresan; Bahram Zonooz
  Deep neural network based object detectors are continuously evolving and are
used in a multitude of applications, each having its own set of requirements.
While safety-critical applications need high accuracy and reliability,
low-latency tasks need resource and energy-efficient networks. Real-time
detectors, which are a necessity in high-impact real-world applications, are
continuously proposed, but they overemphasize the improvements in accuracy and
speed while other capabilities such as versatility, robustness, resource and
energy efficiency are omitted. A reference benchmark for existing networks does
not exist, nor does a standard evaluation guideline for designing new networks,
which results in ambiguous and inconsistent comparisons. We, thus, conduct a
comprehensive study on multiple real-time detectors (anchor-, keypoint-, and
transformer-based) on a wide range of datasets and report results on an
extensive set of metrics. We also study the impact of variables such as image
size, anchor dimensions, confidence thresholds, and architecture layers on the
overall performance. We analyze the robustness of detection networks against
distribution shifts, natural corruptions, and adversarial attacks. Also, we
provide a calibration analysis to gauge the reliability of the predictions.
Finally, to highlight the real-world impact, we conduct two unique case
studies, on autonomous driving and healthcare applications. To further gauge
the capability of networks in critical real-time applications, we report the
performance after deploying the detection networks on edge devices. Our
extensive empirical study can act as a guideline for the industrial community
to make an informed choice on the existing networks. We also hope to inspire
the research community towards a new direction in the design and evaluation of
networks that focuses on a bigger and holistic overview for a far-reaching
impact.


http://arxiv.org/abs/2208.10973
Robust DNN Watermarking via Fixed Embedding Weights with Optimized Distribution. (10%)
Benedetta Tondi; Andrea Costanzo; Mauro Barni
  Watermarking has been proposed as a way to protect the Intellectual Property
Rights (IPR) of Deep Neural Networks (DNNs) and track their use. Several
methods have been proposed that embed the watermark into the trainable
parameters of the network (white box watermarking) or into the input-output
mappping implemented by the network in correspondence to specific inputs (black
box watermarking). In both cases, achieving robustness against fine tuning,
model compression and, even more, transfer learning, is one of the most
difficult challenges researchers are trying to face with. In this paper, we
propose a new white-box, multi-bit watermarking algorithm with strong
robustness properties, including retraining for transfer learning. Robustness
is achieved thanks to a new information coding strategy according to which the
watermark message is spread across a number of fixed weights, whose position
depends on a secret key. The weights hosting the watermark are set prior to
training, and are left unchanged throughout the entire training procedure. The
distribution of the weights carrying out the message is theoretically optimised
to make sure that the watermarked weights are indistinguishable from the other
weights, while at the same time keeping their amplitude as large as possible to
improve robustness against retraining. We carried out several experiments
demonstrating the capability of the proposed scheme to provide high payloads
with practically no impact on the network accuracy, at the same time retaining
excellent robustness against network modifications an re-use, including
retraining for transfer learning.


http://arxiv.org/abs/2208.10373
Fight Fire With Fire: Reversing Skin Adversarial Examples by Multiscale Diffusive and Denoising Aggregation Mechanism. (99%)
Yongwei Wang; Yuan Li; Zhiqi Shen
  Reliable skin cancer diagnosis models play an essential role in early
screening and medical intervention. Prevailing computer-aided skin cancer
classification systems employ deep learning approaches. However, recent studies
reveal their extreme vulnerability to adversarial attacks -- often
imperceptible perturbations to significantly reduce performances of skin cancer
diagnosis models. To mitigate these threats, this work presents a simple,
effective and resource-efficient defense framework by reverse engineering
adversarial perturbations in skin cancer images. Specifically, a multiscale
image pyramid is first established to better preserve discriminative structures
in medical imaging domain. To neutralize adversarial effects, skin images at
different scales are then progressively diffused by injecting isotropic
Gaussian noises to move the adversarial examples to the clean image manifold.
Crucially, to further reverse adversarial noises and suppress redundant
injected noises, a novel multiscale denoising mechanism is carefully designed
that aggregates image information from neighboring scales. We evaluated the
defensive effectiveness of our method on ISIC 2019, a largest skin cancer
multiclass classification dataset. Experimental results demonstrate that the
proposed method can successfully reverse adversarial perturbations from
different attacks and significantly outperform some state-of-the-art methods in
defending skin cancer diagnosis models.


http://arxiv.org/abs/2208.10688
Hierarchical Perceptual Noise Injection for Social Media Fingerprint Privacy Protection. (98%)
Simin Li; Huangxinxin Xu; Jiakai Wang; Aishan Liu; Fazhi He; Xianglong Liu; Dacheng Tao
  Billions of people are sharing their daily life images on social media every
day. However, their biometric information (e.g., fingerprint) could be easily
stolen from these images. The threat of fingerprint leakage from social media
raises a strong desire for anonymizing shared images while maintaining image
qualities, since fingerprints act as a lifelong individual biometric password.
To guard the fingerprint leakage, adversarial attack emerges as a solution by
adding imperceptible perturbations on images. However, existing works are
either weak in black-box transferability or appear unnatural. Motivated by
visual perception hierarchy (i.e., high-level perception exploits model-shared
semantics that transfer well across models while low-level perception extracts
primitive stimulus and will cause high visual sensitivities given suspicious
stimulus), we propose FingerSafe, a hierarchical perceptual protective noise
injection framework to address the mentioned problems. For black-box
transferability, we inject protective noises on fingerprint orientation field
to perturb the model-shared high-level semantics (i.e., fingerprint ridges).
Considering visual naturalness, we suppress the low-level local contrast
stimulus by regularizing the response of Lateral Geniculate Nucleus. Our
FingerSafe is the first to provide feasible fingerprint protection in both
digital (up to 94.12%) and realistic scenarios (Twitter and Facebook, up to
68.75%). Our code can be found at
https://github.com/nlsde-safety-team/FingerSafe.


http://arxiv.org/abs/2208.10576
Different Spectral Representations in Optimized Artificial Neural Networks and Brains. (93%)
Richard C. Gerum; Cassidy Pirlot; Alona Fyshe; Joel Zylberberg
  Recent studies suggest that artificial neural networks (ANNs) that match the
spectral properties of the mammalian visual cortex -- namely, the $\sim 1/n$
eigenspectrum of the covariance matrix of neural activities -- achieve higher
object recognition performance and robustness to adversarial attacks than those
that do not. To our knowledge, however, no previous work systematically
explored how modifying the ANN's spectral properties affects performance. To
fill this gap, we performed a systematic search over spectral regularizers,
forcing the ANN's eigenspectrum to follow $1/n^\alpha$ power laws with
different exponents $\alpha$. We found that larger powers (around 2--3) lead to
better validation accuracy and more robustness to adversarial attacks on dense
networks. This surprising finding applied to both shallow and deep networks and
it overturns the notion that the brain-like spectrum (corresponding to $\alpha
\sim 1$) always optimizes ANN performance and/or robustness. For convolutional
networks, the best $\alpha$ values depend on the task complexity and evaluation
metric: lower $\alpha$ values optimized validation accuracy and robustness to
adversarial attack for networks performing a simple object recognition task
(categorizing MNIST images of handwritten digits); for a more complex task
(categorizing CIFAR-10 natural images), we found that lower $\alpha$ values
optimized validation accuracy whereas higher $\alpha$ values optimized
adversarial robustness. These results have two main implications. First, they
cast doubt on the notion that brain-like spectral properties ($\alpha \sim 1$)
\emph{always} optimize ANN performance. Second, they demonstrate the potential
for fine-tuned spectral regularizers to optimize a chosen design metric, i.e.,
accuracy and/or robustness.


http://arxiv.org/abs/2208.10445
Membership-Doctor: Comprehensive Assessment of Membership Inference Against Machine Learning Models. (87%)
Xinlei He; Zheng Li; Weilin Xu; Cory Cornelius; Yang Zhang
  Machine learning models are prone to memorizing sensitive data, making them
vulnerable to membership inference attacks in which an adversary aims to infer
whether an input sample was used to train the model. Over the past few years,
researchers have produced many membership inference attacks and defenses.
However, these attacks and defenses employ a variety of strategies and are
conducted in different models and datasets. The lack of comprehensive
benchmark, however, means we do not understand the strengths and weaknesses of
existing attacks and defenses.
  We fill this gap by presenting a large-scale measurement of different
membership inference attacks and defenses. We systematize membership inference
through the study of nine attacks and six defenses and measure the performance
of different attacks and defenses in the holistic evaluation. We then quantify
the impact of the threat model on the results of these attacks. We find that
some assumptions of the threat model, such as same-architecture and
same-distribution between shadow and target models, are unnecessary. We are
also the first to execute attacks on the real-world data collected from the
Internet, instead of laboratory datasets. We further investigate what
determines the performance of membership inference attacks and reveal that the
commonly believed overfitting level is not sufficient for the success of the
attacks. Instead, the Jensen-Shannon distance of entropy/cross-entropy between
member and non-member samples correlates with attack performance much better.
This gives us a new way to accurately predict membership inference risks
without running the attack. Finally, we find that data augmentation degrades
the performance of existing attacks to a larger extent, and we propose an
adaptive attack using augmentation to train shadow and attack models that
improve attack performance.


http://arxiv.org/abs/2208.10481
BARReL: Bottleneck Attention for Adversarial Robustness in Vision-Based Reinforcement Learning. (86%)
Eugene Bykovets; Yannick Metz; Mennatallah El-Assady; Daniel A. Keim; Joachim M. Buhmann
  Robustness to adversarial perturbations has been explored in many areas of
computer vision. This robustness is particularly relevant in vision-based
reinforcement learning, as the actions of autonomous agents might be
safety-critic or impactful in the real world. We investigate the susceptibility
of vision-based reinforcement learning agents to gradient-based adversarial
attacks and evaluate a potential defense. We observe that Bottleneck Attention
Modules (BAM) included in CNN architectures can act as potential tools to
increase robustness against adversarial attacks. We show how learned attention
maps can be used to recover activations of a convolutional layer by restricting
the spatial activations to salient regions. Across a number of RL environments,
BAM-enhanced architectures show increased robustness during inference. Finally,
we discuss potential future research directions.


http://arxiv.org/abs/2208.10608
RIBAC: Towards Robust and Imperceptible Backdoor Attack against Compact DNN. (62%)
Huy Phan; Cong Shi; Yi Xie; Tianfang Zhang; Zhuohang Li; Tianming Zhao; Jian Liu; Yan Wang; Yingying Chen; Bo Yuan
  Recently backdoor attack has become an emerging threat to the security of
deep neural network (DNN) models. To date, most of the existing studies focus
on backdoor attack against the uncompressed model; while the vulnerability of
compressed DNNs, which are widely used in the practical applications, is little
exploited yet. In this paper, we propose to study and develop Robust and
Imperceptible Backdoor Attack against Compact DNN models (RIBAC). By performing
systematic analysis and exploration on the important design knobs, we propose a
framework that can learn the proper trigger patterns, model parameters and
pruning masks in an efficient way. Thereby achieving high trigger stealthiness,
high attack success rate and high model efficiency simultaneously. Extensive
evaluations across different datasets, including the test against the
state-of-the-art defense mechanisms, demonstrate the high robustness,
stealthiness and model efficiency of RIBAC. Code is available at
https://github.com/huyvnphan/ECCV2022-RIBAC


http://arxiv.org/abs/2208.10531
Toward Better Target Representation for Source-Free and Black-Box Domain Adaptation. (31%)
Qucheng Peng; Zhengming Ding; Lingjuan Lyu; Lichao Sun; Chen Chen
  Domain adaptation aims at aligning the labeled source domain and the
unlabeled target domain, and most existing approaches assume the source data is
accessible. Unfortunately, this paradigm raises concerns in data privacy and
security. Recent studies try to dispel these concerns by the Source-Free
setting, which adapts the source-trained model towards target domain without
exposing the source data. However, the Source-Free paradigm is still at risk of
data leakage due to adversarial attacks to the source model. Hence, the
Black-Box setting is proposed, where only the outputs of source model can be
utilized. In this paper, we address both the Source-Free adaptation and the
Black-Box adaptation, proposing a novel method named better target
representation from Frequency Mixup and Mutual Learning (FMML). Specifically,
we introduce a new data augmentation technique as Frequency MixUp, which
highlights task-relevant objects in the interpolations, thus enhancing
class-consistency and linear behavior for target models. Moreover, we introduce
a network regularization method called Mutual Learning to the domain adaptation
problem. It transfers knowledge inside the target model via self-knowledge
distillation and thus alleviates overfitting on the source domain by learning
multi-scale target representations. Extensive experiments show that our method
achieves state-of-the-art performance on several benchmark datasets under both
settings.


http://arxiv.org/abs/2208.10618
Optimal Bootstrapping of PoW Blockchains. (1%)
Ranvir Rana; Dimitris Karakostas; Sreeram Kannan; Aggelos Kiayias; Pramod Viswanath
  Proof of Work (PoW) blockchains are susceptible to adversarial majority
mining attacks in the early stages due to incipient participation and
corresponding low net hash power. Bootstrapping ensures safety and liveness
during the transient stage by protecting against a majority mining attack,
allowing a PoW chain to grow the participation base and corresponding mining
hash power. Liveness is especially important since a loss of liveness will lead
to loss of honest mining rewards, decreasing honest participation, hence
creating an undesired spiral; indeed existing bootstrapping mechanisms offer
especially weak liveness guarantees.
  In this paper, we propose Advocate, a new bootstrapping methodology, which
achieves two main results: (a) optimal liveness and low latency under a
super-majority adversary for the Nakamoto longest chain protocol and (b)
immediate black-box generalization to a variety of parallel-chain based scaling
architectures, including OHIE and Prism. We demonstrate via a full-stack
implementation the robustness of Advocate under a 90% adversarial majority.


http://arxiv.org/abs/2208.09801
PointDP: Diffusion-driven Purification against Adversarial Attacks on 3D Point Cloud Recognition. (99%)
Jiachen Sun; Weili Nie; Zhiding Yu; Z. Morley Mao; Chaowei Xiao
  3D Point cloud is becoming a critical data representation in many real-world
applications like autonomous driving, robotics, and medical imaging. Although
the success of deep learning further accelerates the adoption of 3D point
clouds in the physical world, deep learning is notorious for its vulnerability
to adversarial attacks. In this work, we first identify that the
state-of-the-art empirical defense, adversarial training, has a major
limitation in applying to 3D point cloud models due to gradient obfuscation. We
further propose PointDP, a purification strategy that leverages diffusion
models to defend against 3D adversarial attacks. We extensively evaluate
PointDP on six representative 3D point cloud architectures, and leverage 10+
strong and adaptive attacks to demonstrate its lower-bound robustness. Our
evaluation shows that PointDP achieves significantly better robustness than
state-of-the-art purification methods under strong attacks. Results of
certified defenses on randomized smoothing combined with PointDP will be
included in the near future.


http://arxiv.org/abs/2208.09967
Inferring Sensitive Attributes from Model Explanations. (56%)
Vasisht Duddu; Antoine Boutet
  Model explanations provide transparency into a trained machine learning
model's blackbox behavior to a model builder. They indicate the influence of
different input attributes to its corresponding model prediction. The
dependency of explanations on input raises privacy concerns for sensitive user
data. However, current literature has limited discussion on privacy risks of
model explanations.
  We focus on the specific privacy risk of attribute inference attack wherein
an adversary infers sensitive attributes of an input (e.g., race and sex) given
its model explanations. We design the first attribute inference attack against
model explanations in two threat models where model builder either (a) includes
the sensitive attributes in training data and input or (b) censors the
sensitive attributes by not including them in the training data and input.
  We evaluate our proposed attack on four benchmark datasets and four
state-of-the-art algorithms. We show that an adversary can successfully infer
the value of sensitive attributes from explanations in both the threat models
accurately. Moreover, the attack is successful even by exploiting only the
explanations corresponding to sensitive attributes. These suggest that our
attack is effective against explanations and poses a practical threat to data
privacy.
  On combining the model predictions (an attack surface exploited by prior
attacks) with explanations, we note that the attack success does not improve.
Additionally, the attack success on exploiting model explanations is better
compared to exploiting only model predictions. These suggest that model
explanations are a strong attack surface to exploit for an adversary.


http://arxiv.org/abs/2208.09894
Byzantines can also Learn from History: Fall of Centered Clipping in Federated Learning. (10%)
Kerem Ozfatura; Emre Ozfatura; Alptekin Kupcu; Deniz Gunduz
  The increasing popularity of the federated learning (FL) framework due to its
success in a wide range of collaborative learning tasks also induces certain
security concerns. Among many vulnerabilities, the risk of Byzantine attacks is
of particular concern, which refers to the possibility of malicious clients
participating in the learning process. Hence, a crucial objective in FL is to
neutralize the potential impact of Byzantine attacks and to ensure that the
final model is trustable. It has been observed that the higher the variance
among the clients' models/updates, the more space there is for Byzantine
attacks to be hidden. As a consequence, by utilizing momentum, and thus,
reducing the variance, it is possible to weaken the strength of known Byzantine
attacks. The centered clipping (CC) framework has further shown that the
momentum term from the previous iteration, besides reducing the variance, can
be used as a reference point to neutralize Byzantine attacks better. In this
work, we first expose vulnerabilities of the CC framework, and introduce a
novel attack strategy that can circumvent the defences of CC and other robust
aggregators and reduce their test accuracy up to %33 on best-case scenarios in
image classification tasks. Then, we propose a new robust and fast defence
mechanism that is effective against the proposed and other existing Byzantine
attacks.


http://arxiv.org/abs/2208.09915
MockingBERT: A Method for Retroactively Adding Resilience to NLP Models. (4%)
Jan Jezabek; Akash Singh
  Protecting NLP models against misspellings whether accidental or adversarial
has been the object of research interest for the past few years. Existing
remediations have typically either compromised accuracy or required full model
re-training with each new class of attacks. We propose a novel method of
retroactively adding resilience to misspellings to transformer-based NLP
models. This robustness can be achieved without the need for re-training of the
original NLP model and with only a minimal loss of language understanding
performance on inputs without misspellings. Additionally we propose a new
efficient approximate method of generating adversarial misspellings, which
significantly reduces the cost needed to evaluate a model's resilience to
adversarial attacks.


http://arxiv.org/abs/2208.10010
NOSMOG: Learning Noise-robust and Structure-aware MLPs on Graphs. (1%)
Yijun Tian; Chuxu Zhang; Zhichun Guo; Xiangliang Zhang; Nitesh V. Chawla
  While Graph Neural Networks (GNNs) have demonstrated their efficacy in
dealing with non-Euclidean structural data, they are difficult to be deployed
in real applications due to the scalability constraint imposed by multi-hop
data dependency. Existing methods attempt to address this scalability issue by
training multi-layer perceptrons (MLPs) exclusively on node content features
using labels derived from trained GNNs. Even though the performance of MLPs can
be significantly improved, two issues prevent MLPs from outperforming GNNs and
being used in practice: the ignorance of graph structural information and the
sensitivity to node feature noises. In this paper, we propose to learn
NOise-robust Structure-aware MLPs On Graphs (NOSMOG) to overcome the
challenges. Specifically, we first complement node content with position
features to help MLPs capture graph structural information. We then design a
novel representational similarity distillation strategy to inject structural
node similarities into MLPs. Finally, we introduce the adversarial feature
augmentation to ensure stable learning against feature noises and further
improve performance. Extensive experiments demonstrate that NOSMOG outperforms
GNNs and the state-of-the-art method in both transductive and inductive
settings across seven datasets, while maintaining a competitive inference
efficiency.


http://arxiv.org/abs/2208.09913
A Unified Analysis of Mixed Sample Data Augmentation: A Loss Function Perspective. (1%)
Chanwoo Park; Sangdoo Yun; Sanghyuk Chun
  We propose the first unified theoretical analysis of mixed sample data
augmentation (MSDA), such as Mixup and CutMix. Our theoretical results show
that regardless of the choice of the mixing strategy, MSDA behaves as a
pixel-level regularization of the underlying training loss and a regularization
of the first layer parameters. Similarly, our theoretical results support that
the MSDA training strategy can improve adversarial robustness and
generalization compared to the vanilla training strategy. Using the theoretical
results, we provide a high-level understanding of how different design choices
of MSDA work differently. For example, we show that the most popular MSDA
methods, Mixup and CutMix, behave differently, e.g., CutMix regularizes the
input gradients by pixel distances, while Mixup regularizes the input gradients
regardless of pixel distances. Our theoretical results also show that the
optimal MSDA strategy depends on tasks, datasets, or model parameters. From
these observations, we propose generalized MSDAs, a Hybrid version of Mixup and
CutMix (HMix) and Gaussian Mixup (GMix), simple extensions of Mixup and CutMix.
Our implementation can leverage the advantages of Mixup and CutMix, while our
implementation is very efficient, and the computation cost is almost
neglectable as Mixup and CutMix. Our empirical study shows that our HMix and
GMix outperform the previous state-of-the-art MSDA methods in CIFAR-100 and
ImageNet classification tasks. Source code is available at
https://github.com/naver-ai/hmix-gmix


http://arxiv.org/abs/2208.09602
Analyzing Adversarial Robustness of Vision Transformers against Spatial and Spectral Attacks. (86%)
Gihyun Kim; Jong-Seok Lee
  Vision Transformers have emerged as a powerful architecture that can
outperform convolutional neural networks (CNNs) in image classification tasks.
Several attempts have been made to understand robustness of Transformers
against adversarial attacks, but existing studies draw inconsistent results,
i.e., some conclude that Transformers are more robust than CNNs, while some
others find that they have similar degrees of robustness. In this paper, we
address two issues unexplored in the existing studies examining adversarial
robustness of Transformers. First, we argue that the image quality should be
simultaneously considered in evaluating adversarial robustness. We find that
the superiority of one architecture to another in terms of robustness can
change depending on the attack strength expressed by the quality of the
attacked images. Second, by noting that Transformers and CNNs rely on different
types of information in images, we formulate an attack framework, called
Fourier attack, as a tool for implementing flexible attacks, where an image can
be attacked in the spectral domain as well as in the spatial domain. This
attack perturbs the magnitude and phase information of particular frequency
components selectively. Through extensive experiments, we find that
Transformers tend to rely more on phase information and low frequency
information than CNNs, and thus sometimes they are even more vulnerable under
frequency-selective attacks. It is our hope that this work provides new
perspectives in understanding the properties and adversarial robustness of
Transformers.


http://arxiv.org/abs/2208.09764
GAIROSCOPE: Injecting Data from Air-Gapped Computers to Nearby Gyroscopes. (33%)
Mordechai Guri
  It is known that malware can leak data from isolated, air-gapped computers to
nearby smartphones using ultrasonic waves. However, this covert channel
requires access to the smartphone's microphone, which is highly protected in
Android OS and iOS, and might be non-accessible, disabled, or blocked.
  In this paper we present `GAIROSCOPE,' an ultrasonic covert channel that
doesn't require a microphone on the receiving side. Our malware generates
ultrasonic tones in the resonance frequencies of the MEMS gyroscope. These
inaudible frequencies produce tiny mechanical oscillations within the
smartphone's gyroscope, which can be demodulated into binary information.
Notably, the gyroscope in smartphones is considered to be a 'safe' sensor that
can be used legitimately from mobile apps and javascript. We introduce the
adversarial attack model and present related work. We provide the relevant
technical background and show the design and implementation of GAIROSCOPE. We
present the evaluation results and discuss a set of countermeasures to this
threat. Our experiments show that attackers can exfiltrate sensitive
information from air-gapped computers to smartphones located a few meters away
via Speakers-to-Gyroscope covert channel.


http://arxiv.org/abs/2208.09741
Sensor Security: Current Progress, Research Challenges, and Future Roadmap. (10%)
Anomadarshi Barua; Mohammad Abdullah Al Faruque
  Sensors are one of the most pervasive and integral components of today's
safety-critical systems. Sensors serve as a bridge between physical quantities
and connected systems. The connected systems with sensors blindly believe the
sensor as there is no way to authenticate the signal coming from a sensor. This
could be an entry point for an attacker. An attacker can inject a fake input
signal along with the legitimate signal by using a suitable spoofing technique.
As the sensor's transducer is not smart enough to differentiate between a fake
and legitimate signal, the injected fake signal eventually can collapse the
connected system. This type of attack is known as the transduction attack. Over
the last decade, several works have been published to provide a defense against
the transduction attack. However, the defenses are proposed on an ad-hoc basis;
hence, they are not well-structured. Our work begins to fill this gap by
providing a checklist that a defense technique should always follow to be
considered as an ideal defense against the transduction attack. We name this
checklist as the Golden reference of sensor defense. We provide insights on how
this Golden reference can be achieved and argue that sensors should be
redesigned from the transducer level to the sensor electronics level. We point
out that only hardware or software modification is not enough; instead, a
hardware/software (HW/SW) co-design approach is required to ride on this future
roadmap to the robust and resilient sensor.


http://arxiv.org/abs/2208.10940
Evaluating Out-of-Distribution Detectors Through Adversarial Generation of Outliers. (5%)
Sangwoong Yoon; Jinwon Choi; Yonghyeon Lee; Yung-Kyun Noh; Frank Chongwoo Park
  A reliable evaluation method is essential for building a robust
out-of-distribution (OOD) detector. Current robustness evaluation protocols for
OOD detectors rely on injecting perturbations to outlier data. However, the
perturbations are unlikely to occur naturally or not relevant to the content of
data, providing a limited assessment of robustness. In this paper, we propose
Evaluation-via-Generation for OOD detectors (EvG), a new protocol for
investigating the robustness of OOD detectors under more realistic modes of
variation in outliers. EvG utilizes a generative model to synthesize plausible
outliers, and employs MCMC sampling to find outliers misclassified as
in-distribution with the highest confidence by a detector. We perform a
comprehensive benchmark comparison of the performance of state-of-the-art OOD
detectors using EvG, uncovering previously overlooked weaknesses.


http://arxiv.org/abs/2208.09710
Adversarial contamination of networks in the setting of vertex nomination: a new trimming method. (1%)
Sheyda Peyman; Minh Tang; Vince Lyzinski
  As graph data becomes more ubiquitous, the need for robust inferential graph
algorithms to operate in these complex data domains is crucial. In many cases
of interest, inference is further complicated by the presence of adversarial
data contamination. The effect of the adversary is frequently to change the
data distribution in ways that negatively affect statistical and algorithmic
performance. We study this phenomenon in the context of vertex nomination, a
semi-supervised information retrieval task for network data. Here, a common
suite of methods relies on spectral graph embeddings, which have been shown to
provide both good algorithmic performance and flexible settings in which
regularization techniques can be implemented to help mitigate the effect of an
adversary. Many current regularization methods rely on direct network trimming
to effectively excise the adversarial contamination, although this direct
trimming often gives rise to complicated dependency structures in the resulting
graph. We propose a new trimming method that operates in model space which can
address both block structure contamination and white noise contamination
(contamination whose distribution is unknown). This model trimming is more
amenable to theoretical analysis while also demonstrating superior performance
in a number of simulations, compared to direct trimming.


http://arxiv.org/abs/2208.09195
Real-Time Robust Video Object Detection System Against Physical-World Adversarial Attacks. (99%)
Husheng Han; Xing Hu; Kaidi Xu; Pucheng Dang; Ying Wang; Yongwei Zhao; Zidong Du; Qi Guo; Yanzhi Yang; Tianshi Chen
  DNN-based video object detection (VOD) powers autonomous driving and video
surveillance industries with rising importance and promising opportunities.
However, adversarial patch attack yields huge concern in live vision tasks
because of its practicality, feasibility, and powerful attack effectiveness.
This work proposes Themis, a software/hardware system to defend against
adversarial patches for real-time robust video object detection. We observe
that adversarial patches exhibit extremely localized superficial feature
importance in a small region with non-robust predictions, and thus propose the
adversarial region detection algorithm for adversarial effect elimination.
Themis also proposes a systematic design to efficiently support the algorithm
by eliminating redundant computations and memory traffics. Experimental results
show that the proposed methodology can effectively recover the system from the
adversarial attack with negligible hardware overhead.


http://arxiv.org/abs/2208.09466
Gender Bias and Universal Substitution Adversarial Attacks on Grammatical Error Correction Systems for Automated Assessment. (92%)
Vyas Raina; Mark Gales
  Grammatical Error Correction (GEC) systems perform a sequence-to-sequence
task, where an input word sequence containing grammatical errors, is corrected
for these errors by the GEC system to output a grammatically correct word
sequence. With the advent of deep learning methods, automated GEC systems have
become increasingly popular. For example, GEC systems are often used on speech
transcriptions of English learners as a form of assessment and feedback - these
powerful GEC systems can be used to automatically measure an aspect of a
candidate's fluency. The count of \textit{edits} from a candidate's input
sentence (or essay) to a GEC system's grammatically corrected output sentence
is indicative of a candidate's language ability, where fewer edits suggest
better fluency. The count of edits can thus be viewed as a \textit{fluency
score} with zero implying perfect fluency. However, although deep learning
based GEC systems are extremely powerful and accurate, they are susceptible to
adversarial attacks: an adversary can introduce a small, specific change at the
input of a system that causes a large, undesired change at the output. When
considering the application of GEC systems to automated language assessment,
the aim of an adversary could be to cheat by making a small change to a
grammatically incorrect input sentence that conceals the errors from a GEC
system, such that no edits are found and the candidate is unjustly awarded a
perfect fluency score. This work examines a simple universal substitution
adversarial attack that non-native speakers of English could realistically
employ to deceive GEC systems used for assessment.


http://arxiv.org/abs/2208.09336
Dispersed Pixel Perturbation-based Imperceptible Backdoor Trigger for Image Classifier Models. (76%)
Yulong Wang; Minghui Zhao; Shenghong Li; Xin Yuan; Wei Ni
  Typical deep neural network (DNN) backdoor attacks are based on triggers
embedded in inputs. Existing imperceptible triggers are computationally
expensive or low in attack success. In this paper, we propose a new backdoor
trigger, which is easy to generate, imperceptible, and highly effective. The
new trigger is a uniformly randomly generated three-dimensional (3D) binary
pattern that can be horizontally and/or vertically repeated and mirrored and
superposed onto three-channel images for training a backdoored DNN model.
Dispersed throughout an image, the new trigger produces weak perturbation to
individual pixels, but collectively holds a strong recognizable pattern to
train and activate the backdoor of the DNN. We also analytically reveal that
the trigger is increasingly effective with the improving resolution of the
images. Experiments are conducted using the ResNet-18 and MLP models on the
MNIST, CIFAR-10, and BTSR datasets. In terms of imperceptibility, the new
trigger outperforms existing triggers, such as BadNets, Trojaned NN, and Hidden
Backdoor, by over an order of magnitude. The new trigger achieves an almost
100% attack success rate, only reduces the classification accuracy by less than
0.7%-2.4%, and invalidates the state-of-the-art defense techniques.


http://arxiv.org/abs/2208.09449
A Novel Plug-and-Play Approach for Adversarially Robust Generalization. (61%)
Deepak Maurya; Adarsh Barik; Jean Honorio
  In this work, we propose a robust framework that employs adversarially robust
training to safeguard the machine learning models against perturbed testing
data. We achieve this by incorporating the worst-case additive adversarial
error within a fixed budget for each sample during model estimation. Our main
focus is to provide a plug-and-play solution that can be incorporated in the
existing machine learning algorithms with minimal changes. To that end, we
derive the ready-to-use solution for several widely used loss functions with a
variety of norm constraints on adversarial perturbation for various supervised
and unsupervised ML problems, including regression, classification, two-layer
neural networks, graphical models, and matrix completion. The solutions are
either in closed-form, 1-D optimization, semidefinite programming, difference
of convex programming or a sorting-based algorithm. Finally, we validate our
approach by showing significant performance improvement on real-world datasets
for supervised problems such as regression and classification, as well as for
unsupervised problems such as matrix completion and learning graphical models,
with very little computational overhead.


http://arxiv.org/abs/2208.09418
SAFARI: Versatile and Efficient Evaluations for Robustness of Interpretability. (8%)
Wei Huang; Xingyu Zhao; Gaojie Jin; Xiaowei Huang
  Interpretability of Deep Learning (DL) is a barrier to trustworthy AI.
Despite great efforts made by the Explainable AI (XAI) community, explanations
lack robustness -- indistinguishable input perturbations may lead to different
XAI results. Thus, it is vital to assess how robust DL interpretability is,
given an XAI method. In this paper, we identify several challenges that the
state-of-the-art is unable to cope with collectively: i) existing metrics are
not comprehensive; ii) XAI techniques are highly heterogeneous; iii)
misinterpretations are normally rare events. To tackle these challenges, we
introduce two black-box evaluation methods, concerning the worst-case
interpretation discrepancy and a probabilistic notion of how robust in general,
respectively. Genetic Algorithm (GA) with bespoke fitness function is used to
solve constrained optimisation for efficient worst-case evaluation. Subset
Simulation (SS), dedicated to estimate rare event probabilities, is used for
evaluating overall robustness. Experiments show that the accuracy, sensitivity,
and efficiency of our methods outperform the state-of-the-arts. Finally, we
demonstrate two applications of our methods: ranking robust XAI methods and
selecting training schemes to improve both classification and interpretation
robustness.


http://arxiv.org/abs/2208.09316
UKP-SQuARE v2 Explainability and Adversarial Attacks for Trustworthy QA. (1%)
Rachneet Sachdeva; Haritz Puerto; Tim Baumgärtner; Sewin Tariverdian; Hao Zhang; Kexin Wang; Hossain Shaikh Saadi; Leonardo F. R. Ribeiro; Iryna Gurevych
  Question Answering (QA) systems are increasingly deployed in applications
where they support real-world decisions. However, state-of-the-art models rely
on deep neural networks, which are difficult to interpret by humans. Inherently
interpretable models or post hoc explainability methods can help users to
comprehend how a model arrives at its prediction and, if successful, increase
their trust in the system. Furthermore, researchers can leverage these insights
to develop new methods that are more accurate and less biased. In this paper,
we introduce SQuARE v2, the new version of SQuARE, to provide an explainability
infrastructure for comparing models based on methods such as saliency maps and
graph-based explanations. While saliency maps are useful to inspect the
importance of each input token for the model's prediction, graph-based
explanations from external Knowledge Graphs enable the users to verify the
reasoning behind the model prediction. In addition, we provide multiple
adversarial attacks to compare the robustness of QA models. With these
explainability methods and adversarial attacks, we aim to ease the research on
trustworthy QA models. SQuARE is available on https://square.ukp-lab.de.


http://arxiv.org/abs/2208.08697
Resisting Adversarial Attacks in Deep Neural Networks using Diverse Decision Boundaries. (99%)
Manaar Alam; Shubhajit Datta; Debdeep Mukhopadhyay; Arijit Mondal; Partha Pratim Chakrabarti
  The security of deep learning (DL) systems is an extremely important field of
study as they are being deployed in several applications due to their
ever-improving performance to solve challenging tasks. Despite overwhelming
promises, the deep learning systems are vulnerable to crafted adversarial
examples, which may be imperceptible to the human eye, but can lead the model
to misclassify. Protections against adversarial perturbations on ensemble-based
techniques have either been shown to be vulnerable to stronger adversaries or
shown to lack an end-to-end evaluation. In this paper, we attempt to develop a
new ensemble-based solution that constructs defender models with diverse
decision boundaries with respect to the original model. The ensemble of
classifiers constructed by (1) transformation of the input by a method called
Split-and-Shuffle, and (2) restricting the significant features by a method
called Contrast-Significant-Features are shown to result in diverse gradients
with respect to adversarial attacks, which reduces the chance of transferring
adversarial examples from the original to the defender model targeting the same
class. We present extensive experimentations using standard image
classification datasets, namely MNIST, CIFAR-10 and CIFAR-100 against
state-of-the-art adversarial attacks to demonstrate the robustness of the
proposed ensemble-based defense. We also evaluate the robustness in the
presence of a stronger adversary targeting all the models within the ensemble
simultaneously. Results for the overall false positives and false negatives
have been furnished to estimate the overall performance of the proposed
methodology.


http://arxiv.org/abs/2208.08677
Enhancing Targeted Attack Transferability via Diversified Weight Pruning. (99%)
Hung-Jui Wang; Yu-Yu Wu; Shang-Tse Chen
  Malicious attackers can generate targeted adversarial examples by imposing
tiny noises, forcing neural networks to produce specific incorrect outputs.
With cross-model transferability, network models remain vulnerable even in
black-box settings. Recent studies have shown the effectiveness of
ensemble-based methods in generating transferable adversarial examples. To
further enhance transferability, model augmentation methods aim to produce more
networks participating in the ensemble. However, existing model augmentation
methods are only proven effective in untargeted attacks. In this work, we
propose Diversified Weight Pruning (DWP), a novel model augmentation technique
for generating transferable targeted attacks. DWP leverages the weight pruning
method commonly used in model compression. Compared with prior work, DWP
protects necessary connections and ensures the diversity of the pruned models
simultaneously, which we show are crucial for targeted transferability.
Experiments on the ImageNet-compatible dataset under various and more
challenging scenarios confirm the effectiveness: transferring to adversarially
trained models, Non-CNN architectures, and Google Cloud Vision. The results
show that our proposed DWP improves the targeted attack success rates with up
to $10.1$%, $6.6$%, and $7.0$% on the combination of state-of-the-art methods,
respectively. The source code will be made available after acceptance.


http://arxiv.org/abs/2208.08664
Enhancing Diffusion-Based Image Synthesis with Robust Classifier Guidance. (45%)
Bahjat Kawar; Roy Ganz; Michael Elad
  Denoising diffusion probabilistic models (DDPMs) are a recent family of
generative models that achieve state-of-the-art results. In order to obtain
class-conditional generation, it was suggested to guide the diffusion process
by gradients from a time-dependent classifier. While the idea is theoretically
sound, deep learning-based classifiers are infamously susceptible to
gradient-based adversarial attacks. Therefore, while traditional classifiers
may achieve good accuracy scores, their gradients are possibly unreliable and
might hinder the improvement of the generation results. Recent work discovered
that adversarially robust classifiers exhibit gradients that are aligned with
human perception, and these could better guide a generative process towards
semantically meaningful images. We utilize this observation by defining and
training a time-dependent adversarially robust classifier and use it as
guidance for a generative diffusion model. In experiments on the highly
challenging and diverse ImageNet dataset, our scheme introduces significantly
more intelligible intermediate gradients, better alignment with theoretical
findings, as well as improved generation results under several evaluation
metrics. Furthermore, we conduct an opinion survey whose findings indicate that
human raters prefer our method's results.


http://arxiv.org/abs/2208.08689
Reverse Engineering of Integrated Circuits: Tools and Techniques. (33%)
Abhijitt Dhavlle
  Consumer and defense systems demanded design and manufacturing of electronics
with increased performance, compared to their predecessors. As such systems
became ubiquitous in a plethora of domains, their application surface
increased, thus making them a target for adversaries. Hence, with improved
performance the aspect of security demanded even more attention of the
designers. The research community is rife with extensive details of attacks
that target the confidential design details by exploiting vulnerabilities. The
adversary could target the physical design of a semiconductor chip or break a
cryptographic algorithm by extracting the secret keys, using attacks that will
be discussed in this thesis. This thesis focuses on presenting a brief overview
of IC reverse engineering attack and attacks targeting cryptographic systems.
Further, the thesis presents my contributions to the defenses for the discussed
attacks. The globalization of the Integrated Circuit (IC) supply chain has
rendered the advantage of low-cost and high-performance ICs in the market for
the end users. But this has also made the design vulnerable to over production,
IP Piracy, reverse engineering attacks and hardware malware during the
manufacturing and post manufacturing process. Logic locking schemes have been
proposed in the past to overcome the design trust issues but the new
state-of-the-art attacks such as SAT has proven a larger threat. This work
highlights the reverse engineering attack and a proposed hardened platform
along with its framework.


http://arxiv.org/abs/2208.09139
DAFT: Distilling Adversarially Fine-tuned Models for Better OOD Generalization. (10%)
Anshul Nasery; Sravanti Addepalli; Praneeth Netrapalli; Prateek Jain
  We consider the problem of OOD generalization, where the goal is to train a
model that performs well on test distributions that are different from the
training distribution. Deep learning models are known to be fragile to such
shifts and can suffer large accuracy drops even for slightly different test
distributions. We propose a new method - DAFT - based on the intuition that
adversarially robust combination of a large number of rich features should
provide OOD robustness. Our method carefully distills the knowledge from a
powerful teacher that learns several discriminative features using standard
training while combining them using adversarial training. The standard
adversarial training procedure is modified to produce teachers which can guide
the student better. We evaluate DAFT on standard benchmarks in the DomainBed
framework, and demonstrate that DAFT achieves significant improvements over the
current state-of-the-art OOD generalization methods. DAFT consistently
out-performs well-tuned ERM and distillation baselines by up to 6%, with more
pronounced gains for smaller networks.


http://arxiv.org/abs/2208.08831
Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning. (3%)
Olivia Wiles; Isabela Albuquerque; Sven Gowal
  Automatically discovering failures in vision models under real-world settings
remains an open challenge. This work demonstrates how off-the-shelf,
large-scale, image-to-text and text-to-image models, trained on vast amounts of
data, can be leveraged to automatically find such failures. In essence, a
conditional text-to-image generative model is used to generate large amounts of
synthetic, yet realistic, inputs given a ground-truth label. Misclassified
inputs are clustered and a captioning model is used to describe each cluster.
Each cluster's description is used in turn to generate more inputs and assess
whether specific clusters induce more failures than expected. We use this
pipeline to demonstrate that we can effectively interrogate classifiers trained
on ImageNet to find specific failure cases and discover spurious correlations.
We also show that we can scale the approach to generate adversarial datasets
targeting specific classifier architectures. This work serves as a
proof-of-concept demonstrating the utility of large-scale generative models to
automatically discover bugs in vision models in an open-ended manner. We also
describe a number of limitations and pitfalls related to this approach.


http://arxiv.org/abs/2208.08662
Private, Efficient, and Accurate: Protecting Models Trained by Multi-party Learning with Differential Privacy. (2%)
Wenqiang Ruan; Mingxin Xu; Wenjing Fang; Li Wang; Lei Wang; Weili Han
  Secure multi-party computation-based machine learning, referred to as MPL,
has become an important technology to utilize data from multiple parties with
privacy preservation. While MPL provides rigorous security guarantees for the
computation process, the models trained by MPL are still vulnerable to attacks
that solely depend on access to the models. Differential privacy could help to
defend against such attacks. However, the accuracy loss brought by differential
privacy and the huge communication overhead of secure multi-party computation
protocols make it highly challenging to balance the 3-way trade-off between
privacy, efficiency, and accuracy.
  In this paper, we are motivated to resolve the above issue by proposing a
solution, referred to as PEA (Private, Efficient, Accurate), which consists of
a secure DPSGD protocol and two optimization methods. First, we propose a
secure DPSGD protocol to enforce DPSGD in secret sharing-based MPL frameworks.
Second, to reduce the accuracy loss led by differential privacy noise and the
huge communication overhead of MPL, we propose two optimization methods for the
training process of MPL: (1) the data-independent feature extraction method,
which aims to simplify the trained model structure; (2) the local data-based
global model initialization method, which aims to speed up the convergence of
the model training. We implement PEA in two open-source MPL frameworks:
TF-Encrypted and Queqiao. The experimental results on various datasets
demonstrate the efficiency and effectiveness of PEA. E.g. when ${\epsilon}$ =
2, we can train a differentially private classification model with an accuracy
of 88% for CIFAR-10 within 7 minutes under the LAN setting. This result
significantly outperforms the one from CryptGPU, one SOTA MPL framework: it
costs more than 16 hours to train a non-private deep neural network model on
CIFAR-10 with the same accuracy.


http://arxiv.org/abs/2208.08745
Profiler: Profile-Based Model to Detect Phishing Emails. (1%)
Mariya Shmalko; Alsharif Abuadbba; Raj Gaire; Tingmin Wu; Hye-Young Paik; Surya Nepal
  Email phishing has become more prevalent and grows more sophisticated over
time. To combat this rise, many machine learning (ML) algorithms for detecting
phishing emails have been developed. However, due to the limited email data
sets on which these algorithms train, they are not adept at recognising varied
attacks and, thus, suffer from concept drift; attackers can introduce small
changes in the statistical characteristics of their emails or websites to
successfully bypass detection. Over time, a gap develops between the reported
accuracy from literature and the algorithm's actual effectiveness in the real
world. This realises itself in frequent false positive and false negative
classifications.
  To this end, we propose a multidimensional risk assessment of emails to
reduce the feasibility of an attacker adapting their email and avoiding
detection. This horizontal approach to email phishing detection profiles an
incoming email on its main features. We develop a risk assessment framework
that includes three models which analyse an email's (1) threat level, (2)
cognitive manipulation, and (3) email type, which we combine to return the
final risk assessment score. The Profiler does not require large data sets to
train on to be effective and its analysis of varied email features reduces the
impact of concept drift. Our Profiler can be used in conjunction with ML
approaches, to reduce their misclassifications or as a labeller for large email
data sets in the training stage.
  We evaluate the efficacy of the Profiler against a machine learning ensemble
using state-of-the-art ML algorithms on a data set of 9000 legitimate and 900
phishing emails from a large Australian research organisation. Our results
indicate that the Profiler's mitigates the impact of concept drift, and
delivers 30% less false positive and 25% less false negative email
classifications over the ML ensemble's approach.


http://arxiv.org/abs/2208.08083
Two Heads are Better than One: Robust Learning Meets Multi-branch Models. (99%)
Dong Huang; Qingwen Bu; Yuhao Qing; Haowen Pi; Sen Wang; Heming Cui
  Deep neural networks (DNNs) are vulnerable to adversarial examples, in which
DNNs are misled to false outputs due to inputs containing imperceptible
perturbations. Adversarial training, a reliable and effective method of
defense, may significantly reduce the vulnerability of neural networks and
becomes the de facto standard for robust learning. While many recent works
practice the data-centric philosophy, such as how to generate better
adversarial examples or use generative models to produce additional training
data, we look back to the models themselves and revisit the adversarial
robustness from the perspective of deep feature distribution as an insightful
complementarity. In this paper, we propose Branch Orthogonality adveRsarial
Training (BORT) to obtain state-of-the-art performance with solely the original
dataset for adversarial training. To practice our design idea of integrating
multiple orthogonal solution spaces, we leverage a simple and straightforward
multi-branch neural network that eclipses adversarial attacks with no increase
in inference time. We heuristically propose a corresponding loss function,
branch-orthogonal loss, to make each solution space of the multi-branch model
orthogonal. We evaluate our approach on CIFAR-10, CIFAR-100, and SVHN against
\ell_{\infty} norm-bounded perturbations of size \epsilon = 8/255,
respectively. Exhaustive experiments are conducted to show that our method goes
beyond all state-of-the-art methods without any tricks. Compared to all methods
that do not use additional data for training, our models achieve 67.3% and
41.5% robust accuracy on CIFAR-10 and CIFAR-100 (improving upon the
state-of-the-art by +7.23% and +9.07%). We also outperform methods using a
training set with a far larger scale than ours. All our models and codes are
available online at https://github.com/huangd1999/BORT.


http://arxiv.org/abs/2208.08297
An Evolutionary, Gradient-Free, Query-Efficient, Black-Box Algorithm for Generating Adversarial Instances in Deep Networks. (99%)
Raz Lapid; Zvika Haramaty; Moshe Sipper
  Deep neural networks (DNNs) are sensitive to adversarial data in a variety of
scenarios, including the black-box scenario, where the attacker is only allowed
to query the trained model and receive an output. Existing black-box methods
for creating adversarial instances are costly, often using gradient estimation
or training a replacement network. This paper introduces
\textbf{Qu}ery-Efficient \textbf{E}volutiona\textbf{ry} \textbf{Attack},
\textit{QuEry Attack}, an untargeted, score-based, black-box attack. QuEry
Attack is based on a novel objective function that can be used in gradient-free
optimization problems. The attack only requires access to the output logits of
the classifier and is thus not affected by gradient masking. No additional
information is needed, rendering our method more suitable to real-life
situations. We test its performance with three different state-of-the-art
models -- Inception-v3, ResNet-50, and VGG-16-BN -- against three benchmark
datasets: MNIST, CIFAR10 and ImageNet. Furthermore, we evaluate QuEry Attack's
performance on non-differential transformation defenses and state-of-the-art
robust models. Our results demonstrate the superior performance of QuEry
Attack, both in terms of accuracy score and query efficiency.


http://arxiv.org/abs/2208.09285
Shadows Aren't So Dangerous After All: A Fast and Robust Defense Against Shadow-Based Adversarial Attacks. (98%)
Andrew Wang; Wyatt Mayor; Ryan Smith; Gopal Nookula; Gregory Ditzler
  Robust classification is essential in tasks like autonomous vehicle sign
recognition, where the downsides of misclassification can be grave. Adversarial
attacks threaten the robustness of neural network classifiers, causing them to
consistently and confidently misidentify road signs. One such class of attack,
shadow-based attacks, causes misidentifications by applying a natural-looking
shadow to input images, resulting in road signs that appear natural to a human
observer but confusing for these classifiers. Current defenses against such
attacks use a simple adversarial training procedure to achieve a rather low
25\% and 40\% robustness on the GTSRB and LISA test sets, respectively. In this
paper, we propose a robust, fast, and generalizable method, designed to defend
against shadow attacks in the context of road sign recognition, that augments
source images with binary adaptive threshold and edge maps. We empirically show
its robustness against shadow attacks, and reformulate the problem to show its
similarity to $\varepsilon$ perturbation-based attacks. Experimental results
show that our edge defense results in 78\% robustness while maintaining 98\%
benign test accuracy on the GTSRB test set, with similar results from our
threshold defense. Link to our code is in the paper.


http://arxiv.org/abs/2208.08433
Label Flipping Data Poisoning Attack Against Wearable Human Activity Recognition System. (70%)
Abdur R. Shahid; Ahmed Imteaj; Peter Y. Wu; Diane A. Igoche; Tauhidul Alam
  Human Activity Recognition (HAR) is a problem of interpreting sensor data to
human movement using an efficient machine learning (ML) approach. The HAR
systems rely on data from untrusted users, making them susceptible to data
poisoning attacks. In a poisoning attack, attackers manipulate the sensor
readings to contaminate the training set, misleading the HAR to produce
erroneous outcomes. This paper presents the design of a label flipping data
poisoning attack for a HAR system, where the label of a sensor reading is
maliciously changed in the data collection phase. Due to high noise and
uncertainty in the sensing environment, such an attack poses a severe threat to
the recognition system. Besides, vulnerability to label flipping attacks is
dangerous when activity recognition models are deployed in safety-critical
applications. This paper shades light on how to carry out the attack in
practice through smartphone-based sensor data collection applications. This is
an earlier research work, to our knowledge, that explores attacking the HAR
models via label flipping poisoning. We implement the proposed attack and test
it on activity recognition models based on the following machine learning
algorithms: multi-layer perceptron, decision tree, random forest, and XGBoost.
Finally, we evaluate the effectiveness of K-nearest neighbors (KNN)-based
defense mechanism against the proposed attack.


http://arxiv.org/abs/2208.08071
An Efficient Multi-Step Framework for Malware Packing Identification. (41%)
Jong-Wouk Kim; Yang-Sae Moon; Mi-Jung Choi
  Malware developers use combinations of techniques such as compression,
encryption, and obfuscation to bypass anti-virus software. Malware with
anti-analysis technologies can bypass AI-based anti-virus software and malware
analysis tools. Therefore, classifying pack files is one of the big challenges.
Problems arise if the malware classifiers learn packers' features, not those of
malware. Training the models with unintended erroneous data turn into poisoning
attacks, adversarial attacks, and evasion attacks. Therefore, researchers
should consider packing to build appropriate malware classifier models. In this
paper, we propose a multi-step framework for classifying and identifying packed
samples which consists of pseudo-optimal feature selection, machine
learning-based classifiers, and packer identification steps. In the first step,
we use the CART algorithm and the permutation importance to preselect important
20 features. In the second step, each model learns 20 preselected features for
classifying the packed files with the highest performance. As a result, the
XGBoost, which learned the features preselected by XGBoost with the permutation
importance, showed the highest performance of any other experiment scenarios
with an accuracy of 99.67%, an F1-Score of 99.46%, and an area under the curve
(AUC) of 99.98%. In the third step, we propose a new approach that can identify
packers only for samples classified as Well-Known Packed.


http://arxiv.org/abs/2208.08270
On the Privacy Effect of Data Enhancement via the Lens of Memorization. (31%)
Xiao Li; Qiongxiu Li; Zhanhao Hu; Xiaolin Hu
  Machine learning poses severe privacy concerns as it has been shown that the
learned models can reveal sensitive information about their training data. Many
works have investigated the effect of widely adopted data augmentation and
adversarial training techniques, termed data enhancement in the paper, on the
privacy leakage of machine learning models. Such privacy effects are often
measured by membership inference attacks (MIAs), which aim to identify whether
a particular example belongs to the training set or not. We propose to
investigate privacy from a new perspective called memorization. Through the
lens of memorization, we find that previously deployed MIAs produce misleading
results as they are less likely to identify samples with higher privacy risks
as members compared to samples with low privacy risks. To solve this problem,
we deploy a recent attack that can capture individual samples' memorization
degrees for evaluation. Through extensive experiments, we unveil several
findings about the connections between three essential properties of machine
learning models, including privacy, generalization gap, and adversarial
robustness. We demonstrate that the generalization gap and privacy leakage are
less correlated than those of the previous results. Moreover, there is not
necessarily a trade-off between adversarial robustness and privacy as stronger
adversarial robustness does not make the model more susceptible to privacy
attacks.


http://arxiv.org/abs/2208.08114
An Empirical Study on the Membership Inference Attack against Tabular Data Synthesis Models. (26%)
Jihyeon Hyeong; Jayoung Kim; Noseong Park; Sushil Jajodia
  Tabular data typically contains private and important information; thus,
precautions must be taken before they are shared with others. Although several
methods (e.g., differential privacy and k-anonymity) have been proposed to
prevent information leakage, in recent years, tabular data synthesis models
have become popular because they can well trade-off between data utility and
privacy. However, recent research has shown that generative models for image
data are susceptible to the membership inference attack, which can determine
whether a given record was used to train a victim synthesis model. In this
paper, we investigate the membership inference attack in the context of tabular
data synthesis. We conduct experiments on 4 state-of-the-art tabular data
synthesis models under two attack scenarios (i.e., one black-box and one
white-box attack), and find that the membership inference attack can seriously
jeopardize these models. We next conduct experiments to evaluate how well two
popular differentially-private deep learning training algorithms, DP-SGD and
DP-GAN, can protect the models against the attack. Our key finding is that both
algorithms can largely alleviate this threat by sacrificing the generation
quality.


http://arxiv.org/abs/2208.08085
Efficient Detection and Filtering Systems for Distributed Training. (26%)
Konstantinos Konstantinidis; Aditya Ramamoorthy
  A plethora of modern machine learning tasks requires the utilization of
large-scale distributed clusters as a critical component of the training
pipeline. However, abnormal Byzantine behavior of the worker nodes can derail
the training and compromise the quality of the inference. Such behavior can be
attributed to unintentional system malfunctions or orchestrated attacks; as a
result, some nodes may return arbitrary results to the parameter server (PS)
that coordinates the training. Recent work considers a wide range of attack
models and has explored robust aggregation and/or computational redundancy to
correct the distorted gradients.
  In this work, we consider attack models ranging from strong ones: $q$
omniscient adversaries with full knowledge of the defense protocol that can
change from iteration to iteration to weak ones: $q$ randomly chosen
adversaries with limited collusion abilities that only change every few
iterations at a time. Our algorithms rely on redundant task assignments coupled
with detection of adversarial behavior. For strong attacks, we demonstrate a
reduction in the fraction of distorted gradients ranging from 16%-99% as
compared to the prior state-of-the-art. Our top-1 classification accuracy
results on the CIFAR-10 data set demonstrate a 25% advantage in accuracy
(averaged over strong and weak scenarios) under the most sophisticated attacks
compared to state-of-the-art methods.


http://arxiv.org/abs/2208.08569
ObfuNAS: A Neural Architecture Search-based DNN Obfuscation Approach. (2%)
Tong Zhou; Shaolei Ren; Xiaolin Xu
  Malicious architecture extraction has been emerging as a crucial concern for
deep neural network (DNN) security. As a defense, architecture obfuscation is
proposed to remap the victim DNN to a different architecture. Nonetheless, we
observe that, with only extracting an obfuscated DNN architecture, the
adversary can still retrain a substitute model with high performance (e.g.,
accuracy), rendering the obfuscation techniques ineffective. To mitigate this
under-explored vulnerability, we propose ObfuNAS, which converts the DNN
architecture obfuscation into a neural architecture search (NAS) problem. Using
a combination of function-preserving obfuscation strategies, ObfuNAS ensures
that the obfuscated DNN architecture can only achieve lower accuracy than the
victim. We validate the performance of ObfuNAS with open-source architecture
datasets like NAS-Bench-101 and NAS-Bench-301. The experimental results
demonstrate that ObfuNAS can successfully find the optimal mask for a victim
model within a given FLOPs constraint, leading up to 2.6% inference accuracy
degradation for attackers with only 0.14x FLOPs overhead. The code is available
at: https://github.com/Tongzhou0101/ObfuNAS.


http://arxiv.org/abs/2208.08524
DF-Captcha: A Deepfake Captcha for Preventing Fake Calls. (1%)
Yisroel Mirsky
  Social engineering (SE) is a form of deception that aims to trick people into
giving access to data, information, networks and even money. For decades SE has
been a key method for attackers to gain access to an organization, virtually
skipping all lines of defense. Attackers also regularly use SE to scam innocent
people by making threatening phone calls which impersonate an authority or by
sending infected emails which look like they have been sent from a loved one.
SE attacks will likely remain a top attack vector for criminals because humans
are the weakest link in cyber security.
  Unfortunately, the threat will only get worse now that a new technology
called deepfakes as arrived. A deepfake is believable media (e.g., videos)
created by an AI. Although the technology has mostly been used to swap the
faces of celebrities, it can also be used to `puppet' different personas.
Recently, researchers have shown how this technology can be deployed in
real-time to clone someone's voice in a phone call or reenact a face in a video
call. Given that any novice user can download this technology to use it, it is
no surprise that criminals have already begun to monetize it to perpetrate
their SE attacks.
  In this paper, we propose a lightweight application which can protect
organizations and individuals from deepfake SE attacks. Through a challenge and
response approach, we leverage the technical and theoretical limitations of
deepfake technologies to expose the attacker. Existing defence solutions are
too heavy as an end-point solution and can be evaded by a dynamic attacker. In
contrast, our approach is lightweight and breaks the reactive arms race,
putting the attacker at a disadvantage.


http://arxiv.org/abs/2208.08509
Analyzing Robustness of End-to-End Neural Models for Automatic Speech Recognition. (1%)
Goutham Rajendran; Wei Zou
  We investigate robustness properties of pre-trained neural models for
automatic speech recognition. Real life data in machine learning is usually
very noisy and almost never clean, which can be attributed to various factors
depending on the domain, e.g. outliers, random noise and adversarial noise.
Therefore, the models we develop for various tasks should be robust to such
kinds of noisy data, which led to the thriving field of robust machine
learning. We consider this important issue in the setting of automatic speech
recognition. With the increasing popularity of pre-trained models, it's an
important question to analyze and understand the robustness of such models to
noise. In this work, we perform a robustness analysis of the pre-trained neural
models wav2vec2, HuBERT and DistilHuBERT on the LibriSpeech and TIMIT datasets.
We use different kinds of noising mechanisms and measure the model performances
as quantified by the inference time and the standard Word Error Rate metric. We
also do an in-depth layer-wise analysis of the wav2vec2 model when injecting
noise in between layers, enabling us to predict at a high level what each layer
learns. Finally for this model, we visualize the propagation of errors across
the layers and compare how it behaves on clean versus noisy data. Our
experiments conform the predictions of Pasad et al. [2021] and also raise
interesting directions for future work.


http://arxiv.org/abs/2208.08029
A Context-Aware Approach for Textual Adversarial Attack through Probability Difference Guided Beam Search. (82%)
Huijun Liu; Jie Yu; Shasha Li; Jun Ma; Bin Ji
  Textual adversarial attacks expose the vulnerabilities of text classifiers
and can be used to improve their robustness. Existing context-aware methods
solely consider the gold label probability and use the greedy search when
searching an attack path, often limiting the attack efficiency. To tackle these
issues, we propose PDBS, a context-aware textual adversarial attack model using
Probability Difference guided Beam Search. The probability difference is an
overall consideration of all class label probabilities, and PDBS uses it to
guide the selection of attack paths. In addition, PDBS uses the beam search to
find a successful attack path, thus avoiding suffering from limited search
space. Extensive experiments and human evaluation demonstrate that PDBS
outperforms previous best models in a series of evaluation metrics, especially
bringing up to a +19.5% attack success rate. Ablation studies and qualitative
analyses further confirm the efficiency of PDBS.


http://arxiv.org/abs/2208.08052
Imperceptible and Robust Backdoor Attack in 3D Point Cloud. (68%)
Kuofeng Gao; Jiawang Bai; Baoyuan Wu; Mengxi Ya; Shu-Tao Xia
  With the thriving of deep learning in processing point cloud data, recent
works show that backdoor attacks pose a severe security threat to 3D vision
applications. The attacker injects the backdoor into the 3D model by poisoning
a few training samples with trigger, such that the backdoored model performs
well on clean samples but behaves maliciously when the trigger pattern appears.
Existing attacks often insert some additional points into the point cloud as
the trigger, or utilize a linear transformation (e.g., rotation) to construct
the poisoned point cloud. However, the effects of these poisoned samples are
likely to be weakened or even eliminated by some commonly used pre-processing
techniques for 3D point cloud, e.g., outlier removal or rotation augmentation.
In this paper, we propose a novel imperceptible and robust backdoor attack
(IRBA) to tackle this challenge. We utilize a nonlinear and local
transformation, called weighted local transformation (WLT), to construct
poisoned samples with unique transformations. As there are several
hyper-parameters and randomness in WLT, it is difficult to produce two similar
transformations. Consequently, poisoned samples with unique transformations are
likely to be resistant to aforementioned pre-processing techniques. Besides, as
the controllability and smoothness of the distortion caused by a fixed WLT, the
generated poisoned samples are also imperceptible to human inspection.
Extensive experiments on three benchmark datasets and four models show that
IRBA achieves 80%+ ASR in most cases even with pre-processing techniques, which
is significantly higher than previous state-of-the-art attacks.


http://arxiv.org/abs/2208.08025
AutoCAT: Reinforcement Learning for Automated Exploration of Cache-Timing Attacks. (13%)
Mulong Luo; Wenjie Xiong; Geunbae Lee; Yueying Li; Xiaomeng Yang; Amy Zhang; Yuandong Tian; Hsien-Hsin S. Lee; G. Edward Suh
  The aggressive performance optimizations in modern microprocessors can result
in security vulnerabilities. For example, the timing-based attacks in processor
caches are shown to be successful in stealing secret keys or causing privilege
escalation. So far, finding cache-timing vulnerabilities is mostly performed by
human experts, which is inefficient and laborious. There is a need for
automatic tools that can explore vulnerabilities because unreported
vulnerabilities leave the systems at risk. In this paper, we propose AutoCAT,
an automated exploration framework that finds cache timing-channel attacks
using reinforcement learning (RL). Specifically, AutoCAT formulates the cache
timing-channel attack as a guessing game between the attacker program and the
victim program holding a secret, which can thus be solved via modern deep RL
techniques. AutoCAT can explore attacks in various cache configurations without
knowing design details and under different attacker and victim configurations,
and also find attacks to bypass known detection and defense mechanisms. In
particular, AutoCAT discovered StealthyStreamline, a new attack that is able to
bypass detection based on performance counters and has up to a 71% higher
information leakage rate than the state-of-the-art LRU-based attacks on real
processors. AutoCAT is the first of its kind using RL for crafting
microarchitectural timing-channel attack sequences and can accelerate cache
timing-channel exploration for secure microprocessor designs.


http://arxiv.org/abs/2208.08003
Investigating the Impact of Model Width and Density on Generalization in Presence of Label Noise. (1%)
Yihao Xue; Kyle Whitecross; Baharan Mirzasoleiman
  Increasing the size of overparameterized neural networks has been a key in
achieving state-of-the-art performance. This is captured by the double descent
phenomenon, where the test loss follows a decreasing-increasing-decreasing
pattern (or sometimes monotonically decreasing) as model width increases.
However, the effect of label noise on the test loss curve has not been fully
explored. In this work, we uncover an intriguing phenomenon where label noise
leads to a \textit{final ascent} in the originally observed double descent
curve. Specifically, under a sufficiently large noise-to-sample-size ratio,
optimal generalization is achieved at intermediate widths. Through theoretical
analysis, we attribute this phenomenon to the shape transition of test loss
variance induced by label noise. Furthermore, we extend the final ascent
phenomenon to model density and provide the first theoretical characterization
showing that reducing density by randomly dropping trainable parameters
improves generalization under label noise. We also thoroughly examine the roles
of regularization and sample size. Surprisingly, we find that larger $\ell_2$
regularization and robust learning methods against label noise exacerbate the
final ascent. We confirm the validity of our findings through extensive
experiments on ReLu networks trained on MNIST, ResNets/ViTs trained on
CIFAR-10/100, and InceptionResNet-v2 trained on Stanford Cars with real-world
noisy labels.


http://arxiv.org/abs/2208.07174
Man-in-the-Middle Attack against Object Detection Systems. (96%)
Han Wu; Sareh Rowlands; Johan Wahlstrom
  Is deep learning secure for robots? As embedded systems have access to more
powerful CPUs and GPUs, deep-learning-enabled object detection systems become
pervasive in robotic applications. Meanwhile, prior research unveils that deep
learning models are vulnerable to adversarial attacks. Does this put real-world
robots at threat? Our research borrows the idea of the Main-in-the-Middle
attack from Cryptography to attack an object detection system. Our experimental
results prove that we can generate a strong Universal Adversarial Perturbation
(UAP) within one minute and then use the perturbation to attack a detection
system via the Man-in-the-Middle attack. Our findings raise a serious concern
over the applications of deep learning models in safety-critical systems such
as autonomous driving.


http://arxiv.org/abs/2208.07316
MENLI: Robust Evaluation Metrics from Natural Language Inference. (92%)
Yanran Chen; Steffen Eger
  Recently proposed BERT-based evaluation metrics for text generation perform
well on standard benchmarks but are vulnerable to adversarial attacks, e.g.,
relating to information correctness. We argue that this stems (in part) from
the fact that they are models of semantic similarity. In contrast, we develop
evaluation metrics based on Natural Language Inference (NLI), which we deem a
more appropriate modeling. We design a preference-based adversarial attack
framework and show that our NLI based metrics are much more robust to the
attacks than the recent BERT-based metrics. On standard benchmarks, our NLI
based metrics outperform existing summarization metrics, but perform below SOTA
MT metrics. However, when combining existing metrics with our NLI metrics, we
obtain both higher adversarial robustness (15%-30%) and higher quality metrics
as measured on standard benchmarks (+5% to 30%).


http://arxiv.org/abs/2208.07272
Training-Time Attacks against k-Nearest Neighbors. (2%)
Ara Vartanian; Will Rosenbaum; Scott Alfeld
  Nearest neighbor-based methods are commonly used for classification tasks and
as subroutines of other data-analysis methods. An attacker with the capability
of inserting their own data points into the training set can manipulate the
inferred nearest neighbor structure. We distill this goal to the task of
performing a training-set data insertion attack against $k$-Nearest Neighbor
classification ($k$NN). We prove that computing an optimal training-time
(a.k.a. poisoning) attack against $k$NN classification is NP-Hard, even when $k
= 1$ and the attacker can insert only a single data point. We provide an
anytime algorithm to perform such an attack, and a greedy algorithm for general
$k$ and attacker budget. We provide theoretical bounds and empirically
demonstrate the effectiveness and practicality of our methods on synthetic and
real-world datasets. Empirically, we find that $k$NN is vulnerable in practice
and that dimensionality reduction is an effective defense. We conclude with a
discussion of open problems illuminated by our analysis.


http://arxiv.org/abs/2208.07476
CTI4AI: Threat Intelligence Generation and Sharing after Red Teaming AI Models. (1%)
Chuyen Nguyen; Caleb Morgan; Sudip Mittal
  As the practicality of Artificial Intelligence (AI) and Machine Learning (ML)
based techniques grow, there is an ever increasing threat of adversarial
attacks. There is a need to red team this ecosystem to identify system
vulnerabilities, potential threats, characterize properties that will enhance
system robustness, and encourage the creation of effective defenses. A
secondary need is to share this AI security threat intelligence between
different stakeholders like, model developers, users, and AI/ML security
professionals. In this paper, we create and describe a prototype system CTI4AI,
to overcome the need to methodically identify and share AI/ML specific
vulnerabilities and threat intelligence.


http://arxiv.org/abs/2208.06984
A Multi-objective Memetic Algorithm for Auto Adversarial Attack Optimization Design. (99%)
Jialiang Sun; Wen Yao; Tingsong Jiang; Xiaoqian Chen
  The phenomenon of adversarial examples has been revealed in variant
scenarios. Recent studies show that well-designed adversarial defense
strategies can improve the robustness of deep learning models against
adversarial examples. However, with the rapid development of defense
technologies, it also tends to be more difficult to evaluate the robustness of
the defensed model due to the weak performance of existing manually designed
adversarial attacks. To address the challenge, given the defensed model, the
efficient adversarial attack with less computational burden and lower robust
accuracy is needed to be further exploited. Therefore, we propose a
multi-objective memetic algorithm for auto adversarial attack optimization
design, which realizes the automatical search for the near-optimal adversarial
attack towards defensed models. Firstly, the more general mathematical model of
auto adversarial attack optimization design is constructed, where the search
space includes not only the attacker operations, magnitude, iteration number,
and loss functions but also the connection ways of multiple adversarial
attacks. In addition, we develop a multi-objective memetic algorithm combining
NSGA-II and local search to solve the optimization problem. Finally, to
decrease the evaluation cost during the search, we propose a representative
data selection strategy based on the sorting of cross entropy loss values of
each images output by models. Experiments on CIFAR10, CIFAR100, and ImageNet
datasets show the effectiveness of our proposed method.


http://arxiv.org/abs/2208.06776
Link-Backdoor: Backdoor Attack on Link Prediction via Node Injection. (92%)
Haibin Zheng; Haiyang Xiong; Haonan Ma; Guohan Huang; Jinyin Chen
  Link prediction, inferring the undiscovered or potential links of the graph,
is widely applied in the real-world. By facilitating labeled links of the graph
as the training data, numerous deep learning based link prediction methods have
been studied, which have dominant prediction accuracy compared with non-deep
methods. However,the threats of maliciously crafted training graph will leave a
specific backdoor in the deep model, thus when some specific examples are fed
into the model, it will make wrong prediction, defined as backdoor attack. It
is an important aspect that has been overlooked in the current literature. In
this paper, we prompt the concept of backdoor attack on link prediction, and
propose Link-Backdoor to reveal the training vulnerability of the existing link
prediction methods. Specifically, the Link-Backdoor combines the fake nodes
with the nodes of the target link to form a trigger. Moreover, it optimizes the
trigger by the gradient information from the target model. Consequently, the
link prediction model trained on the backdoored dataset will predict the link
with trigger to the target state. Extensive experiments on five benchmark
datasets and five well-performing link prediction models demonstrate that the
Link-Backdoor achieves the state-of-the-art attack success rate under both
white-box (i.e., available of the target model parameter)and black-box (i.e.,
unavailable of the target model parameter) scenarios. Additionally, we testify
the attack under defensive circumstance, and the results indicate that the
Link-Backdoor still can construct successful attack on the well-performing link
prediction methods. The code and data are available at
https://github.com/Seaocn/Link-Backdoor.


http://arxiv.org/abs/2208.06962
InvisibiliTee: Angle-agnostic Cloaking from Person-Tracking Systems with a Tee. (92%)
Yaxian Li; Bingqing Zhang; Guoping Zhao; Mingyu Zhang; Jiajun Liu; Ziwei Wang; Jirong Wen
  After a survey for person-tracking system-induced privacy concerns, we
propose a black-box adversarial attack method on state-of-the-art human
detection models called InvisibiliTee. The method learns printable adversarial
patterns for T-shirts that cloak wearers in the physical world in front of
person-tracking systems. We design an angle-agnostic learning scheme which
utilizes segmentation of the fashion dataset and a geometric warping process so
the adversarial patterns generated are effective in fooling person detectors
from all camera angles and for unseen black-box detection models. Empirical
results in both digital and physical environments show that with the
InvisibiliTee on, person-tracking systems' ability to detect the wearer drops
significantly.


http://arxiv.org/abs/2208.10273
Long-Short History of Gradients is All You Need: Detecting Malicious and Unreliable Clients in Federated Learning. (67%)
Ashish Gupta; Tie Luo; Mao V. Ngo; Sajal K. Das
  Federated learning offers a framework of training a machine learning model in
a distributed fashion while preserving privacy of the participants. As the
server cannot govern the clients' actions, nefarious clients may attack the
global model by sending malicious local gradients. In the meantime, there could
also be unreliable clients who are benign but each has a portion of low-quality
training data (e.g., blur or low-resolution images), thus may appearing similar
as malicious clients. Therefore, a defense mechanism will need to perform a
three-fold differentiation which is much more challenging than the conventional
(two-fold) case. This paper introduces MUD-HoG, a novel defense algorithm that
addresses this challenge in federated learning using long-short history of
gradients, and treats the detected malicious and unreliable clients
differently. Not only this, but we can also distinguish between targeted and
untargeted attacks among malicious clients, unlike most prior works which only
consider one type of the attacks. Specifically, we take into account
sign-flipping, additive-noise, label-flipping, and multi-label-flipping
attacks, under a non-IID setting. We evaluate MUD-HoG with six state-of-the-art
methods on two datasets. The results show that MUD-HoG outperforms all of them
in terms of accuracy as well as precision and recall, in the presence of a
mixture of multiple (four) types of attackers as well as unreliable clients.
Moreover, unlike most prior works which can only tolerate a low population of
harmful users, MUD-HoG can work with and successfully detect a wide range of
malicious and unreliable clients - up to 47.5% and 10%, respectively, of the
total population. Our code is open-sourced at
https://github.com/LabSAINT/MUD-HoG_Federated_Learning.


http://arxiv.org/abs/2208.06651
Revisiting Adversarial Attacks on Graph Neural Networks for Graph Classification. (99%)
Beini Xie; Heng Chang; Xin Wang; Tian Bian; Shiji Zhou; Daixin Wang; Zhiqiang Zhang; Wenwu Zhu
  Graph neural networks (GNNs) have achieved tremendous success in the task of
graph classification and diverse downstream real-world applications. Despite
their success, existing approaches are either limited to structure attacks or
restricted to local information. This calls for a more general attack framework
on graph classification, which faces significant challenges due to the
complexity of generating local-node-level adversarial examples using the
global-graph-level information. To address this "global-to-local" problem, we
present a general framework CAMA to generate adversarial examples by
manipulating graph structure and node features in a hierarchical style.
Specifically, we make use of Graph Class Activation Mapping and its variant to
produce node-level importance corresponding to the graph classification task.
Then through a heuristic design of algorithms, we can perform both feature and
structure attacks under unnoticeable perturbation budgets with the help of both
node-level and subgraph-level importance. Experiments towards attacking four
state-of-the-art graph classification models on six real-world benchmarks
verify the flexibility and effectiveness of our framework.


http://arxiv.org/abs/2208.10224
Friendly Noise against Adversarial Noise: A Powerful Defense against Data Poisoning Attacks. (99%)
Tian Yu Liu; Yu Yang; Baharan Mirzasoleiman
  A powerful category of data poisoning attacks modify a subset of training
examples by small adversarial perturbations to change the prediction of certain
test-time data. Existing defense mechanisms are not desirable to deploy in
practice, as they often drastically harm the generalization performance, or are
attack-specific and prohibitively slow to apply. Here, we propose a simple but
highly effective approach that unlike existing methods breaks various types of
poisoning attacks with the slightest drop in the generalization performance. We
make the key observation that attacks exploit sharp loss regions to craft
adversarial perturbations which can substantially alter examples' gradient or
representations under small perturbations. To break poisoning attacks, our
approach comprises two components: an optimized friendly noise that is
generated to maximally perturb examples without degrading the performance, and
a random varying noise component. The first component takes examples farther
away from the sharp loss regions, and the second component smooths out the loss
landscape. The combination of both components builds a very light-weight but
extremely effective defense against the most powerful triggerless targeted and
hidden-trigger backdoor poisoning attacks, including Gradient Matching,
Bulls-eye Polytope, and Sleeper Agent. We show that our friendly noise is
transferable to other architectures, and adaptive attacks cannot break our
defense due to its random noise component.


http://arxiv.org/abs/2208.06592
Confidence Matters: Inspecting Backdoors in Deep Neural Networks via Distribution Transfer. (62%)
Tong Wang; Yuan Yao; Feng Xu; Miao Xu; Shengwei An; Ting Wang
  Backdoor attacks have been shown to be a serious security threat against deep
learning models, and detecting whether a given model has been backdoored
becomes a crucial task. Existing defenses are mainly built upon the observation
that the backdoor trigger is usually of small size or affects the activation of
only a few neurons. However, the above observations are violated in many cases
especially for advanced backdoor attacks, hindering the performance and
applicability of the existing defenses. In this paper, we propose a backdoor
defense DTInspector built upon a new observation. That is, an effective
backdoor attack usually requires high prediction confidence on the poisoned
training samples, so as to ensure that the trained model exhibits the targeted
behavior with a high probability. Based on this observation, DTInspector first
learns a patch that could change the predictions of most high-confidence data,
and then decides the existence of backdoor by checking the ratio of prediction
changes after applying the learned patch on the low-confidence data. Extensive
evaluations on five backdoor attacks, four datasets, and three advanced
attacking types demonstrate the effectiveness of the proposed defense.


http://arxiv.org/abs/2208.06538
Transferable Adversarial Examples with Bayes Approach. (99%)
Mingyuan Fan; Cen Chen; Wenmeng Zhou; Yinggui Wang
  The vulnerability of deep neural networks (DNNs) to black-box adversarial
attacks is one of the most heated topics in trustworthy AI. In such attacks,
the attackers operate without any insider knowledge of the model, making the
cross-model transferability of adversarial examples critical. Despite the
potential for adversarial examples to be effective across various models, it
has been observed that adversarial examples that are specifically crafted for a
specific model often exhibit poor transferability. In this paper, we explore
the transferability of adversarial examples via the lens of Bayesian approach.
Specifically, we leverage Bayesian approach to probe the transferability and
then study what constitutes a transferability-promoting prior. Following this,
we design two concrete transferability-promoting priors, along with an adaptive
dynamic weighting strategy for instances sampled from these priors. Employing
these techniques, we present BayAtk. Extensive experiments illustrate the
significant effectiveness of BayAtk in crafting more transferable adversarial
examples against both undefended and defended black-box models compared to
existing state-of-the-art attacks.


http://arxiv.org/abs/2208.06222
Scale-free and Task-agnostic Attack: Generating Photo-realistic Adversarial Patterns with Patch Quilting Generator. (99%)
Xiangbo Gao; Cheng Luo; Qinliang Lin; Weicheng Xie; Minmin Liu; Linlin Shen; Keerthy Kusumam; Siyang Song
  \noindent Traditional L_p norm-restricted image attack algorithms suffer from
poor transferability to black box scenarios and poor robustness to defense
algorithms. Recent CNN generator-based attack approaches can synthesize
unrestricted and semantically meaningful entities to the image, which is shown
to be transferable and robust. However, such methods attack images by either
synthesizing local adversarial entities, which are only suitable for attacking
specific contents or performing global attacks, which are only applicable to a
specific image scale. In this paper, we propose a novel Patch Quilting
Generative Adversarial Networks (PQ-GAN) to learn the first scale-free CNN
generator that can be applied to attack images with arbitrary scales for
various computer vision tasks. The principal investigation on transferability
of the generated adversarial examples, robustness to defense frameworks, and
visual quality assessment show that the proposed PQG-based attack framework
outperforms the other nine state-of-the-art adversarial attack approaches when
attacking the neural networks trained on two standard evaluation datasets
(i.e., ImageNet and CityScapes).


http://arxiv.org/abs/2208.10279
Defensive Distillation based Adversarial Attacks Mitigation Method for Channel Estimation using Deep Learning Models in Next-Generation Wireless Networks. (98%)
Ferhat Ozgur Catak; Murat Kuzlu; Evren Catak; Umit Cali; Ozgur Guler
  Future wireless networks (5G and beyond) are the vision of forthcoming
cellular systems, connecting billions of devices and people together. In the
last decades, cellular networks have been dramatically growth with advanced
telecommunication technologies for high-speed data transmission, high cell
capacity, and low latency. The main goal of those technologies is to support a
wide range of new applications, such as virtual reality, metaverse, telehealth,
online education, autonomous and flying vehicles, smart cities, smart grids,
advanced manufacturing, and many more. The key motivation of NextG networks is
to meet the high demand for those applications by improving and optimizing
network functions. Artificial Intelligence (AI) has a high potential to achieve
these requirements by being integrated in applications throughout all layers of
the network. However, the security concerns on network functions of NextG using
AI-based models, i.e., model poising, have not been investigated deeply.
Therefore, it needs to design efficient mitigation techniques and secure
solutions for NextG networks using AI-based methods. This paper proposes a
comprehensive vulnerability analysis of deep learning (DL)-based channel
estimation models trained with the dataset obtained from MATLAB's 5G toolbox
for adversarial attacks and defensive distillation-based mitigation methods.
The adversarial attacks produce faulty results by manipulating trained DL-based
models for channel estimation in NextG networks, while making models more
robust against any attacks through mitigation methods. This paper also presents
the performance of the proposed defensive distillation mitigation method for
each adversarial attack against the channel estimation model. The results
indicated that the proposed mitigation method can defend the DL-based channel
estimation models against adversarial attacks in NextG networks.


http://arxiv.org/abs/2208.06228
Unifying Gradients to Improve Real-world Robustness for Deep Networks. (96%)
Yingwen Wu; Sizhe Chen; Kun Fang; Xiaolin Huang
  The wide application of deep neural networks (DNNs) demands an increasing
amount of attention to their real-world robustness, i.e., whether a DNN resists
black-box adversarial attacks, among which score-based query attacks (SQAs) are
most threatening since they can effectively hurt a victim network with the only
access to model outputs. Defending against SQAs requires a slight but artful
variation of outputs due to the service purpose for users, who share the same
output information with SQAs. In this paper, we propose a real-world defense by
Unifying Gradients (UniG) of different data so that SQAs could only probe a
much weaker attack direction that is similar for different samples. Since such
universal attack perturbations have been validated as less aggressive than the
input-specific perturbations, UniG protects real-world DNNs by indicating
attackers a twisted and less informative attack direction. We implement UniG
efficiently by a Hadamard product module which is plug-and-play. According to
extensive experiments on 5 SQAs, 2 adaptive attacks and 7 defense baselines,
UniG significantly improves real-world robustness without hurting clean
accuracy on CIFAR10 and ImageNet. For instance, UniG maintains a model of
77.80% accuracy under 2500-query Square attack while the state-of-the-art
adversarially-trained model only has 67.34% on CIFAR10. Simultaneously, UniG
outperforms all compared baselines in terms of clean accuracy and achieves the
smallest modification of the model output. The code is released at
https://github.com/snowien/UniG-pytorch.


http://arxiv.org/abs/2208.06176
A Knowledge Distillation-Based Backdoor Attack in Federated Learning. (93%)
Yifan Wang; Wei Fan; Keke Yang; Naji Alhusaini; Jing Li
  Federated Learning (FL) is a novel framework of decentralized machine
learning. Due to the decentralized feature of FL, it is vulnerable to
adversarial attacks in the training procedure, e.g. , backdoor attacks. A
backdoor attack aims to inject a backdoor into the machine learning model such
that the model will make arbitrarily incorrect behavior on the test sample with
some specific backdoor trigger. Even though a range of backdoor attack methods
of FL has been introduced, there are also methods defending against them. Many
of the defending methods utilize the abnormal characteristics of the models
with backdoor or the difference between the models with backdoor and the
regular models. To bypass these defenses, we need to reduce the difference and
the abnormal characteristics. We find a source of such abnormality is that
backdoor attack would directly flip the label of data when poisoning the data.
However, current studies of the backdoor attack in FL are not mainly focus on
reducing the difference between the models with backdoor and the regular
models. In this paper, we propose Adversarial Knowledge Distillation(ADVKD), a
method combine knowledge distillation with backdoor attack in FL. With
knowledge distillation, we can reduce the abnormal characteristics in model
result from the label flipping, thus the model can bypass the defenses.
Compared to current methods, we show that ADVKD can not only reach a higher
attack success rate, but also successfully bypass the defenses when other
methods fails. To further explore the performance of ADVKD, we test how the
parameters affect the performance of ADVKD under different scenarios. According
to the experiment result, we summarize how to adjust the parameter for better
performance under different scenarios. We also use several methods to visualize
the effect of different attack and explain the effectiveness of ADVKD.


http://arxiv.org/abs/2208.06163
Dropout is NOT All You Need to Prevent Gradient Leakage. (62%)
Daniel Scheliga; Patrick Mäder; Marco Seeland
  Gradient inversion attacks on federated learning systems reconstruct client
training data from exchanged gradient information. To defend against such
attacks, a variety of defense mechanisms were proposed. However, they usually
lead to an unacceptable trade-off between privacy and model utility. Recent
observations suggest that dropout could mitigate gradient leakage and improve
model utility if added to neural networks. Unfortunately, this phenomenon has
not been systematically researched yet. In this work, we thoroughly analyze the
effect of dropout on iterative gradient inversion attacks. We find that state
of the art attacks are not able to reconstruct the client data due to the
stochasticity induced by dropout during model training. Nonetheless, we argue
that dropout does not offer reliable protection if the dropout induced
stochasticity is adequately modeled during attack optimization. Consequently,
we propose a novel Dropout Inversion Attack (DIA) that jointly optimizes for
client data and dropout masks to approximate the stochastic client model. We
conduct an extensive systematic evaluation of our attack on four seminal model
architectures and three image classification datasets of increasing complexity.
We find that our proposed attack bypasses the protection seemingly induced by
dropout and reconstructs client data with high fidelity. Our work demonstrates
that privacy inducing changes to model architectures alone cannot be assumed to
reliably protect from gradient leakage and therefore should be combined with
complementary defense mechanisms.


http://arxiv.org/abs/2208.06537
Defense against Backdoor Attacks via Identifying and Purifying Bad Neurons. (2%)
Mingyuan Fan; Yang Liu; Cen Chen; Ximeng Liu; Wenzhong Guo
  The opacity of neural networks leads their vulnerability to backdoor attacks,
where hidden attention of infected neurons is triggered to override normal
predictions to the attacker-chosen ones. In this paper, we propose a novel
backdoor defense method to mark and purify the infected neurons in the
backdoored neural networks. Specifically, we first define a new metric, called
benign salience. By combining the first-order gradient to retain the
connections between neurons, benign salience can identify the infected neurons
with higher accuracy than the commonly used metric in backdoor defense. Then, a
new Adaptive Regularization (AR) mechanism is proposed to assist in purifying
these identified infected neurons via fine-tuning. Due to the ability to adapt
to different magnitudes of parameters, AR can provide faster and more stable
convergence than the common regularization mechanism in neuron purifying.
Extensive experimental results demonstrate that our method can erase the
backdoor in neural networks with negligible performance degradation.


http://arxiv.org/abs/2208.06481
PRIVEE: A Visual Analytic Workflow for Proactive Privacy Risk Inspection of Open Data. (2%)
Kaustav Bhattacharjee; Akm Islam; Jaideep Vaidya; Aritra Dasgupta
  Open data sets that contain personal information are susceptible to
adversarial attacks even when anonymized. By performing low-cost joins on
multiple datasets with shared attributes, malicious users of open data portals
might get access to information that violates individuals' privacy. However,
open data sets are primarily published using a release-and-forget model,
whereby data owners and custodians have little to no cognizance of these
privacy risks. We address this critical gap by developing a visual analytic
solution that enables data defenders to gain awareness about the disclosure
risks in local, joinable data neighborhoods. The solution is derived through a
design study with data privacy researchers, where we initially play the role of
a red team and engage in an ethical data hacking exercise based on privacy
attack scenarios. We use this problem and domain characterization to develop a
set of visual analytic interventions as a defense mechanism and realize them in
PRIVEE, a visual risk inspection workflow that acts as a proactive monitor for
data defenders. PRIVEE uses a combination of risk scores and associated
interactive visualizations to let data defenders explore vulnerable joins and
interpret risks at multiple levels of data granularity. We demonstrate how
PRIVEE can help emulate the attack strategies and diagnose disclosure risks
through two case studies with data privacy experts.


http://arxiv.org/abs/2208.05650
Diverse Generative Perturbations on Attention Space for Transferable Adversarial Attacks. (99%)
Woo Jae Kim; Seunghoon Hong; Sung-Eui Yoon
  Adversarial attacks with improved transferability - the ability of an
adversarial example crafted on a known model to also fool unknown models - have
recently received much attention due to their practicality. Nevertheless,
existing transferable attacks craft perturbations in a deterministic manner and
often fail to fully explore the loss surface, thus falling into a poor local
optimum and suffering from low transferability. To solve this problem, we
propose Attentive-Diversity Attack (ADA), which disrupts diverse salient
features in a stochastic manner to improve transferability. Primarily, we
perturb the image attention to disrupt universal features shared by different
models. Then, to effectively avoid poor local optima, we disrupt these features
in a stochastic manner and explore the search space of transferable
perturbations more exhaustively. More specifically, we use a generator to
produce adversarial perturbations that each disturbs features in different ways
depending on an input latent code. Extensive experimental evaluations
demonstrate the effectiveness of our method, outperforming the transferability
of state-of-the-art methods. Codes are available at
https://github.com/wkim97/ADA.


http://arxiv.org/abs/2208.05740
General Cutting Planes for Bound-Propagation-Based Neural Network Verification. (68%)
Huan Zhang; Shiqi Wang; Kaidi Xu; Linyi Li; Bo Li; Suman Jana; Cho-Jui Hsieh; J. Zico Kolter
  Bound propagation methods, when combined with branch and bound, are among the
most effective methods to formally verify properties of deep neural networks
such as correctness, robustness, and safety. However, existing works cannot
handle the general form of cutting plane constraints widely accepted in
traditional solvers, which are crucial for strengthening verifiers with
tightened convex relaxations. In this paper, we generalize the bound
propagation procedure to allow the addition of arbitrary cutting plane
constraints, including those involving relaxed integer variables that do not
appear in existing bound propagation formulations. Our generalized bound
propagation method, GCP-CROWN, opens up the opportunity to apply general
cutting plane methods for neural network verification while benefiting from the
efficiency and GPU acceleration of bound propagation methods. As a case study,
we investigate the use of cutting planes generated by off-the-shelf mixed
integer programming (MIP) solver. We find that MIP solvers can generate
high-quality cutting planes for strengthening bound-propagation-based verifiers
using our new formulation. Since the branching-focused bound propagation
procedure and the cutting-plane-focused MIP solver can run in parallel
utilizing different types of hardware (GPUs and CPUs), their combination can
quickly explore a large number of branches with strong cutting planes, leading
to strong verification performance. Experiments demonstrate that our method is
the first verifier that can completely solve the oval20 benchmark and verify
twice as many instances on the oval21 benchmark compared to the best tool in
VNN-COMP 2021, and also noticeably outperforms state-of-the-art verifiers on a
wide range of benchmarks. GCP-CROWN is part of the $\alpha,\!\beta$-CROWN
verifier, the VNN-COMP 2022 winner. Code is available at
http://PaperCode.cc/GCP-CROWN


http://arxiv.org/abs/2208.06092
On deceiving malware classification with section injection. (5%)
Silva Adeilson Antonio da; Mauricio Pamplona Segundo
  We investigate how to modify executable files to deceive malware
classification systems. This work's main contribution is a methodology to
inject bytes across a malware file randomly and use it both as an attack to
decrease classification accuracy but also as a defensive method, augmenting the
data available for training. It respects the operating system file format to
make sure the malware will still execute after our injection and will not
change its behavior. We reproduced five state-of-the-art malware classification
approaches to evaluate our injection scheme: one based on GIST+KNN, three CNN
variations and one Gated CNN. We performed our experiments on a public dataset
with 9,339 malware samples from 25 different families. Our results show that a
mere increase of 7% in the malware size causes an accuracy drop between 25% and
40% for malware family classification. They show that a automatic malware
classification system may not be as trustworthy as initially reported in the
literature. We also evaluate using modified malwares alongside the original
ones to increase networks robustness against mentioned attacks. Results show
that a combination of reordering malware sections and injecting random data can
improve overall performance of the classification. Code available at
https://github.com/adeilsonsilva/malware-injection.


http://arxiv.org/abs/2208.06018
A Probabilistic Framework for Mutation Testing in Deep Neural Networks. (1%)
Florian Tambon; Foutse Khomh; Giuliano Antoniol
  Context: Mutation Testing (MT) is an important tool in traditional Software
Engineering (SE) white-box testing. It aims to artificially inject faults in a
system to evaluate a test suite's capability to detect them, assuming that the
test suite defects finding capability will then translate to real faults. If MT
has long been used in SE, it is only recently that it started gaining the
attention of the Deep Learning (DL) community, with researchers adapting it to
improve the testability of DL models and improve the trustworthiness of DL
systems.
  Objective: If several techniques have been proposed for MT, most of them
neglected the stochasticity inherent to DL resulting from the training phase.
Even the latest MT approaches in DL, which propose to tackle MT through a
statistical approach, might give inconsistent results. Indeed, as their
statistic is based on a fixed set of sampled training instances, it can lead to
different results across instances set when results should be consistent for
any instance.
  Methods: In this work, we propose a Probabilistic Mutation Testing (PMT)
approach that alleviates the inconsistency problem and allows for a more
consistent decision on whether a mutant is killed or not.
  Results: We show that PMT effectively allows a more consistent and informed
decision on mutations through evaluation using three models and eight mutation
operators used in previously proposed MT methods. We also analyze the trade-off
between the approximation error and the cost of our method, showing that
relatively small error can be achieved for a manageable cost.
  Conclusion: Our results showed the limitation of current MT practices in DNN
and the need to rethink them. We believe PMT is the first step in that
direction which effectively removes the lack of consistency across test
executions of previous methods caused by the stochasticity of DNN training.


http://arxiv.org/abs/2208.05969
Safety and Performance, Why not Both? Bi-Objective Optimized Model Compression toward AI Software Deployment. (1%)
Jie Zhu; Leye Wang; Xiao Han
  The size of deep learning models in artificial intelligence (AI) software is
increasing rapidly, which hinders the large-scale deployment on
resource-restricted devices (e.g., smartphones). To mitigate this issue, AI
software compression plays a crucial role, which aims to compress model size
while keeping high performance. However, the intrinsic defects in the big model
may be inherited by the compressed one. Such defects may be easily leveraged by
attackers, since the compressed models are usually deployed in a large number
of devices without adequate protection. In this paper, we try to address the
safe model compression problem from a safety-performance co-optimization
perspective. Specifically, inspired by the test-driven development (TDD)
paradigm in software engineering, we propose a test-driven sparse training
framework called SafeCompress. By simulating the attack mechanism as the safety
test, SafeCompress can automatically compress a big model to a small one
following the dynamic sparse training paradigm. Further, considering a
representative attack, i.e., membership inference attack (MIA), we develop a
concrete safe model compression mechanism, called MIA-SafeCompress. Extensive
experiments are conducted to evaluate MIA-SafeCompress on five datasets for
both computer vision and natural language processing tasks. The results verify
the effectiveness and generalization of our method. We also discuss how to
adapt SafeCompress to other attacks besides MIA, demonstrating the flexibility
of SafeCompress.


http://arxiv.org/abs/2208.05895
Shielding Federated Learning Systems against Inference Attacks with ARM TrustZone. (1%)
Aghiles Ait Messaoud; Sonia Ben Mokhtar; Vlad Nitu; Valerio Schiavoni
  Federated Learning (FL) opens new perspectives for training machine learning
models while keeping personal data on the users premises. Specifically, in FL,
models are trained on the users devices and only model updates (i.e.,
gradients) are sent to a central server for aggregation purposes. However, the
long list of inference attacks that leak private data from gradients, published
in the recent years, have emphasized the need of devising effective protection
mechanisms to incentivize the adoption of FL at scale. While there exist
solutions to mitigate these attacks on the server side, little has been done to
protect users from attacks performed on the client side. In this context, the
use of Trusted Execution Environments (TEEs) on the client side are among the
most proposing solutions. However, existing frameworks (e.g., DarkneTZ) require
statically putting a large portion of the machine learning model into the TEE
to effectively protect against complex attacks or a combination of attacks. We
present GradSec, a solution that allows protecting in a TEE only sensitive
layers of a machine learning model, either statically or dynamically, hence
reducing both the TCB size and the overall training time by up to 30% and 56%,
respectively compared to state-of-the-art competitors.


http://arxiv.org/abs/2208.05285
Explaining Machine Learning DGA Detectors from DNS Traffic Data. (13%)
Giorgio Piras; Maura Pintor; Luca Demetrio; Battista Biggio
  One of the most common causes of lack of continuity of online systems stems
from a widely popular Cyber Attack known as Distributed Denial of Service
(DDoS), in which a network of infected devices (botnet) gets exploited to flood
the computational capacity of services through the commands of an attacker.
This attack is made by leveraging the Domain Name System (DNS) technology
through Domain Generation Algorithms (DGAs), a stealthy connection strategy
that yet leaves suspicious data patterns. To detect such threats, advances in
their analysis have been made. For the majority, they found Machine Learning
(ML) as a solution, which can be highly effective in analyzing and classifying
massive amounts of data. Although strongly performing, ML models have a certain
degree of obscurity in their decision-making process. To cope with this
problem, a branch of ML known as Explainable ML tries to break down the
black-box nature of classifiers and make them interpretable and human-readable.
This work addresses the problem of Explainable ML in the context of botnet and
DGA detection, which at the best of our knowledge, is the first to concretely
break down the decisions of ML classifiers when devised for botnet/DGA
detection, therefore providing global and local explanations.


http://arxiv.org/abs/2208.05395
A Sublinear Adversarial Training Algorithm. (3%)
Yeqi Gao; Lianke Qin; Zhao Song; Yitan Wang
  Adversarial training is a widely used strategy for making neural networks
resistant to adversarial perturbations. For a neural network of width $m$, $n$
input training data in $d$ dimension, it takes $\Omega(mnd)$ time cost per
training iteration for the forward and backward computation. In this paper we
analyze the convergence guarantee of adversarial training procedure on a
two-layer neural network with shifted ReLU activation, and shows that only
$o(m)$ neurons will be activated for each input data per iteration.
Furthermore, we develop an algorithm for adversarial training with time cost
$o(m n d)$ per iteration by applying half-space reporting data structure.


http://arxiv.org/abs/2208.05190
DVR: Micro-Video Recommendation Optimizing Watch-Time-Gain under Duration Bias. (1%)
Yu Zheng; Chen Gao; Jingtao Ding; Lingling Yi; Depeng Jin; Yong Li; Meng Wang
  Recommender systems are prone to be misled by biases in the data. Models
trained with biased data fail to capture the real interests of users, thus it
is critical to alleviate the impact of bias to achieve unbiased recommendation.
In this work, we focus on an essential bias in micro-video recommendation,
duration bias. Specifically, existing micro-video recommender systems usually
consider watch time as the most critical metric, which measures how long a user
watches a video. Since videos with longer duration tend to have longer watch
time, there exists a kind of duration bias, making longer videos tend to be
recommended more against short videos. In this paper, we empirically show that
commonly-used metrics are vulnerable to duration bias, making them NOT suitable
for evaluating micro-video recommendation. To address it, we further propose an
unbiased evaluation metric, called WTG (short for Watch Time Gain). Empirical
results reveal that WTG can alleviate duration bias and better measure
recommendation performance. Moreover, we design a simple yet effective model
named DVR (short for Debiased Video Recommendation) that can provide unbiased
recommendation of micro-videos with varying duration, and learn unbiased user
preferences via adversarial learning. Extensive experiments based on two
real-world datasets demonstrate that DVR successfully eliminates duration bias
and significantly improves recommendation performance with over 30% relative
progress. Codes and datasets are released at
https://github.com/tsinghua-fib-lab/WTG-DVR.


http://arxiv.org/abs/2208.05073
Adversarial Machine Learning-Based Anticipation of Threats Against Vehicle-to-Microgrid Services. (98%)
Ahmed Omara; Burak Kantarci
  In this paper, we study the expanding attack surface of Adversarial Machine
Learning (AML) and the potential attacks against Vehicle-to-Microgrid (V2M)
services. We present an anticipatory study of a multi-stage gray-box attack
that can achieve a comparable result to a white-box attack. Adversaries aim to
deceive the targeted Machine Learning (ML) classifier at the network edge to
misclassify the incoming energy requests from microgrids. With an inference
attack, an adversary can collect real-time data from the communication between
smart microgrids and a 5G gNodeB to train a surrogate (i.e., shadow) model of
the targeted classifier at the edge. To anticipate the associated impact of an
adversary's capability to collect real-time data instances, we study five
different cases, each representing different amounts of real-time data
instances collected by an adversary. Out of six ML models trained on the
complete dataset, K-Nearest Neighbour (K-NN) is selected as the surrogate
model, and through simulations, we demonstrate that the multi-stage gray-box
attack is able to mislead the ML classifier and cause an Evasion Increase Rate
(EIR) up to 73.2% using 40% less data than what a white-box attack needs to
achieve a similar EIR.


http://arxiv.org/abs/2208.05083
Reducing Exploitability with Population Based Training. (67%)
Pavel Czempin; Adam Gleave
  Self-play reinforcement learning has achieved state-of-the-art, and often
superhuman, performance in a variety of zero-sum games. Yet prior work has
found that policies that are highly capable against regular opponents can fail
catastrophically against adversarial policies: an opponent trained explicitly
against the victim. Prior defenses using adversarial training were able to make
the victim robust to a specific adversary, but the victim remained vulnerable
to new ones. We conjecture this limitation was due to insufficient diversity of
adversaries seen during training. We analyze a defense using population based
training to pit the victim against a diverse set of opponents. We evaluate this
defense's robustness against new adversaries in two low-dimensional
environments. This defense increases robustness against adversaries, as
measured by the number of attacker training timesteps to exploit the victim.
Furthermore, we show that robustness is correlated with the size of the
opponent population.


http://arxiv.org/abs/2208.04767
Combining Stochastic Defenses to Resist Gradient Inversion: An Ablation Study. (50%)
Daniel Scheliga; Patrick Mäder; Marco Seeland
  Gradient Inversion (GI) attacks are a ubiquitous threat in Federated Learning
(FL) as they exploit gradient leakage to reconstruct supposedly private
training data. Common defense mechanisms such as Differential Privacy (DP) or
stochastic Privacy Modules (PMs) introduce randomness during gradient
computation to prevent such attacks. However, we pose that if an attacker
effectively mimics a client's stochastic gradient computation, the attacker can
circumvent the defense and reconstruct clients' private training data. This
paper introduces several targeted GI attacks that leverage this principle to
bypass common defense mechanisms. As a result, we demonstrate that no
individual defense provides sufficient privacy protection. To address this
issue, we propose to combine multiple defenses. We conduct an extensive
ablation study to evaluate the influence of various combinations of defenses on
privacy protection and model utility. We observe that only the combination of
DP and a stochastic PM was sufficient to decrease the Attack Success Rate (ASR)
from 100% to 0%, thus preserving privacy. Moreover, we found that this
combination of defenses consistently achieves the best trade-off between
privacy and model utility.


http://arxiv.org/abs/2208.04838
Robust Machine Learning for Malware Detection over Time. (9%)
Daniele Angioni; Luca Demetrio; Maura Pintor; Battista Biggio
  The presence and persistence of Android malware is an on-going threat that
plagues this information era, and machine learning technologies are now
extensively used to deploy more effective detectors that can block the majority
of these malicious programs. However, these algorithms have not been developed
to pursue the natural evolution of malware, and their performances
significantly degrade over time because of such concept-drift. Currently,
state-of-the-art techniques only focus on detecting the presence of such drift,
or they address it by relying on frequent updates of models. Hence, there is a
lack of knowledge regarding the cause of the concept drift, and ad-hoc
solutions that can counter the passing of time are still under-investigated. In
this work, we commence to address these issues as we propose (i) a
drift-analysis framework to identify which characteristics of data are causing
the drift, and (ii) SVM-CB, a time-aware classifier that leverages the
drift-analysis information to slow down the performance drop. We highlight the
efficacy of our contribution by comparing its degradation over time with a
state-of-the-art classifier, and we show that SVM-CB better withstands the
distribution changes that naturally characterize the malware domain. We
conclude by discussing the limitations of our approach and how our contribution
can be taken as a first step towards more time-resistant classifiers that not
only tackle, but also understand the concept drift that affects data.


http://arxiv.org/abs/2208.03944
Robust and Imperceptible Black-box DNN Watermarking Based on Fourier Perturbation Analysis and Frequency Sensitivity Clustering. (75%)
Yong Liu; Hanzhou Wu; Xinpeng Zhang
  Recently, more and more attention has been focused on the intellectual
property protection of deep neural networks (DNNs), promoting DNN watermarking
to become a hot research topic. Compared with embedding watermarks directly
into DNN parameters, inserting trigger-set watermarks enables us to verify the
ownership without knowing the internal details of the DNN, which is more
suitable for application scenarios. The cost is we have to carefully craft the
trigger samples. Mainstream methods construct the trigger samples by inserting
a noticeable pattern to the clean samples in the spatial domain, which does not
consider sample imperceptibility, sample robustness and model robustness, and
therefore has limited the watermarking performance and the model
generalization. It has motivated the authors in this paper to propose a novel
DNN watermarking method based on Fourier perturbation analysis and frequency
sensitivity clustering. First, we analyze the perturbation impact of different
frequency components of the input sample on the task functionality of the DNN
by applying random perturbation. Then, by K-means clustering, we determine the
frequency components that result in superior watermarking performance for
crafting the trigger samples. Our experiments show that the proposed work not
only maintains the performance of the DNN on its original task, but also
provides better watermarking performance compared with related works.


http://arxiv.org/abs/2208.04943
PerD: Perturbation Sensitivity-based Neural Trojan Detection Framework on NLP Applications. (67%)
Diego Garcia-soto; Huili Chen; Farinaz Koushanfar
  Deep Neural Networks (DNNs) have been shown to be susceptible to Trojan
attacks. Neural Trojan is a type of targeted poisoning attack that embeds the
backdoor into the victim and is activated by the trigger in the input space.
The increasing deployment of DNNs in critical systems and the surge of
outsourcing DNN training (which makes Trojan attack easier) makes the detection
of Trojan attacks necessary. While Neural Trojan detection has been studied in
the image domain, there is a lack of solutions in the NLP domain. In this
paper, we propose a model-level Trojan detection framework by analyzing the
deviation of the model output when we introduce a specially crafted
perturbation to the input. Particularly, we extract the model's responses to
perturbed inputs as the `signature' of the model and train a meta-classifier to
determine if a model is Trojaned based on its signature. We demonstrate the
effectiveness of our proposed method on both a dataset of NLP models we create
and a public dataset of Trojaned NLP models from TrojAI. Furthermore, we
propose a lightweight variant of our detection method that reduces the
detection time while preserving the detection rates.


http://arxiv.org/abs/2208.03923
Adversarial robustness of VAEs through the lens of local geometry. (47%)
Asif Khan; Amos Storkey
  In an unsupervised attack on variational autoencoders (VAEs), an adversary
finds a small perturbation in an input sample that significantly changes its
latent space encoding, thereby compromising the reconstruction for a fixed
decoder. A known reason for such vulnerability is the distortions in the latent
space resulting from a mismatch between approximated latent posterior and a
prior distribution. Consequently, a slight change in an input sample can move
its encoding to a low/zero density region in the latent space resulting in an
unconstrained generation. This paper demonstrates that an optimal way for an
adversary to attack VAEs is to exploit a directional bias of a stochastic
pullback metric tensor induced by the encoder and decoder networks. The
pullback metric tensor of an encoder measures the change in infinitesimal
latent volume from an input to a latent space. Thus, it can be viewed as a lens
to analyse the effect of input perturbations leading to latent space
distortions. We propose robustness evaluation scores using the eigenspectrum of
a pullback metric tensor. Moreover, we empirically show that the scores
correlate with the robustness parameter $\beta$ of the $\beta-$VAE. Since
increasing $\beta$ also degrades reconstruction quality, we demonstrate a
simple alternative using \textit{mixup} training to fill the empty regions in
the latent space, thus improving robustness with improved reconstruction.


http://arxiv.org/abs/2208.03948
AWEncoder: Adversarial Watermarking Pre-trained Encoders in Contrastive Learning. (26%)
Tianxing Zhang; Hanzhou Wu; Xiaofeng Lu; Guangling Sun
  As a self-supervised learning paradigm, contrastive learning has been widely
used to pre-train a powerful encoder as an effective feature extractor for
various downstream tasks. This process requires numerous unlabeled training
data and computational resources, which makes the pre-trained encoder become
valuable intellectual property of the owner. However, the lack of a priori
knowledge of downstream tasks makes it non-trivial to protect the intellectual
property of the pre-trained encoder by applying conventional watermarking
methods. To deal with this problem, in this paper, we introduce AWEncoder, an
adversarial method for watermarking the pre-trained encoder in contrastive
learning. First, as an adversarial perturbation, the watermark is generated by
enforcing the training samples to be marked to deviate respective location and
surround a randomly selected key image in the embedding space. Then, the
watermark is embedded into the pre-trained encoder by further optimizing a
joint loss function. As a result, the watermarked encoder not only performs
very well for downstream tasks, but also enables us to verify its ownership by
analyzing the discrepancy of output provided using the encoder as the backbone
under both white-box and black-box conditions. Extensive experiments
demonstrate that the proposed work enjoys pretty good effectiveness and
robustness on different contrastive learning algorithms and downstream tasks,
which has verified the superiority and applicability of the proposed work.


http://arxiv.org/abs/2208.03958
Abutting Grating Illusion: Cognitive Challenge to Neural Network Models. (1%)
Jinyu Fan; Yi Zeng
  Even the state-of-the-art deep learning models lack fundamental abilities
compared to humans. Multiple comparison paradigms have been proposed to explore
the distinctions between humans and deep learning. While most comparisons rely
on corruptions inspired by mathematical transformations, very few have bases on
human cognitive phenomena. In this study, we propose a novel corruption method
based on the abutting grating illusion, which is a visual phenomenon widely
discovered in both human and a wide range of animal species. The corruption
method destroys the gradient-defined boundaries and generates the perception of
illusory contours using line gratings abutting each other. We applied the
method on MNIST, high resolution MNIST, and silhouette object images. Various
deep learning models are tested on the corruption, including models trained
from scratch and 109 models pretrained with ImageNet or various data
augmentation techniques. Our results show that abutting grating corruption is
challenging even for state-of-the-art deep learning models because most models
are randomly guessing. We also discovered that the DeepAugment technique can
greatly improve robustness against abutting grating illusion. Visualisation of
early layers indicates that better performing models exhibit stronger
end-stopping property, which is consistent with neuroscience discoveries. To
validate the corruption method, 24 human subjects are involved to classify
samples of corrupted datasets.


http://arxiv.org/abs/2208.04062
Testing of Machine Learning Models with Limited Samples: An Industrial Vacuum Pumping Application. (1%)
Ayan Chatterjee; Bestoun S. Ahmed; Erik Hallin; Anton Engman
  There is often a scarcity of training data for machine learning (ML)
classification and regression models in industrial production, especially for
time-consuming or sparsely run manufacturing processes. A majority of the
limited ground-truth data is used for training, while a handful of samples are
left for testing. Here, the number of test samples is inadequate to properly
evaluate the robustness of the ML models under test for classification and
regression. Furthermore, the output of these ML models may be inaccurate or
even fail if the input data differ from the expected. This is the case for ML
models used in the Electroslag Remelting (ESR) process in the refined steel
industry to predict the pressure in a vacuum chamber. A vacuum pumping event
that occurs once a workday generates a few hundred samples in a year of pumping
for training and testing. In the absence of adequate training and test samples,
this paper first presents a method to generate a fresh set of augmented samples
based on vacuum pumping principles. Based on the generated augmented samples,
three test scenarios and one test oracle are presented to assess the robustness
of an ML model used for production on an industrial scale. Experiments are
conducted with real industrial production data obtained from Uddeholms AB steel
company. The evaluations indicate that Ensemble and Neural Network are the most
robust when trained on augmented data using the proposed testing strategy. The
evaluation also demonstrates the proposed method's effectiveness in checking
and improving ML algorithms' robustness in such situations. The work improves
software testing's state-of-the-art robustness testing in similar settings.
Finally, the paper presents an MLOps implementation of the proposed approach
for real-time ML model prediction and action on the edge node and automated
continuous delivery of ML software from the cloud.


http://arxiv.org/abs/2208.03635
Federated Adversarial Learning: A Framework with Convergence Analysis. (80%)
Xiaoxiao Li; Zhao Song; Jiaming Yang
  Federated learning (FL) is a trending training paradigm to utilize
decentralized training data. FL allows clients to update model parameters
locally for several epochs, then share them to a global model for aggregation.
This training paradigm with multi-local step updating before aggregation
exposes unique vulnerabilities to adversarial attacks. Adversarial training is
a popular and effective method to improve the robustness of networks against
adversaries. In this work, we formulate a general form of federated adversarial
learning (FAL) that is adapted from adversarial learning in the centralized
setting. On the client side of FL training, FAL has an inner loop to generate
adversarial samples for adversarial training and an outer loop to update local
model parameters. On the server side, FAL aggregates local model updates and
broadcast the aggregated model. We design a global robust training loss and
formulate FAL training as a min-max optimization problem. Unlike the
convergence analysis in classical centralized training that relies on the
gradient direction, it is significantly harder to analyze the convergence in
FAL for three reasons: 1) the complexity of min-max optimization, 2) model not
updating in the gradient direction due to the multi-local updates on the
client-side before aggregation and 3) inter-client heterogeneity. We address
these challenges by using appropriate gradient approximation and coupling
techniques and present the convergence analysis in the over-parameterized
regime. Our main result theoretically shows that the minimum loss under our
algorithm can converge to $\epsilon$ small with chosen learning rate and
communication rounds. It is noteworthy that our analysis is feasible for
non-IID clients.


http://arxiv.org/abs/2208.05514
Are Gradients on Graph Structure Reliable in Gray-box Attacks? (13%)
Zihan Liu; Yun Luo; Lirong Wu; Siyuan Li; Zicheng Liu; Stan Z. Li
  Graph edge perturbations are dedicated to damaging the prediction of graph
neural networks by modifying the graph structure. Previous gray-box attackers
employ gradients from the surrogate model to locate the vulnerable edges to
perturb the graph structure. However, unreliability exists in gradients on
graph structures, which is rarely studied by previous works. In this paper, we
discuss and analyze the errors caused by the unreliability of the structural
gradients. These errors arise from rough gradient usage due to the discreteness
of the graph structure and from the unreliability in the meta-gradient on the
graph structure. In order to address these problems, we propose a novel attack
model with methods to reduce the errors inside the structural gradients. We
propose edge discrete sampling to select the edge perturbations associated with
hierarchical candidate selection to ensure computational efficiency. In
addition, semantic invariance and momentum gradient ensemble are proposed to
address the gradient fluctuation on semantic-augmented graphs and the
instability of the surrogate model. Experiments are conducted in untargeted
gray-box poisoning scenarios and demonstrate the improvement in the performance
of our approach.


http://arxiv.org/abs/2208.03610
Blackbox Attacks via Surrogate Ensemble Search. (99%)
Zikui Cai; Chengyu Song; Srikanth Krishnamurthy; Amit Roy-Chowdhury; M. Salman Asif
  Blackbox adversarial attacks can be categorized into transfer- and
query-based attacks. Transfer methods do not require any feedback from the
victim model, but provide lower success rates compared to query-based methods.
Query attacks often require a large number of queries for success. To achieve
the best of both approaches, recent efforts have tried to combine them, but
still require hundreds of queries to achieve high success rates (especially for
targeted attacks). In this paper, we propose a novel method for Blackbox
Attacks via Surrogate Ensemble Search (BASES) that can generate highly
successful blackbox attacks using an extremely small number of queries. We
first define a perturbation machine that generates a perturbed image by
minimizing a weighted loss function over a fixed set of surrogate models. To
generate an attack for a given victim model, we search over the weights in the
loss function using queries generated by the perturbation machine. Since the
dimension of the search space is small (same as the number of surrogate
models), the search requires a small number of queries. We demonstrate that our
proposed method achieves better success rate with at least 30x fewer queries
compared to state-of-the-art methods on different image classifiers trained
with ImageNet. In particular, our method requires as few as 3 queries per image
(on average) to achieve more than a 90% success rate for targeted attacks and
1-2 queries per image for over a 99% success rate for untargeted attacks. Our
method is also effective on Google Cloud Vision API and achieved a 91%
untargeted attack success rate with 2.9 queries per image. We also show that
the perturbations generated by our proposed method are highly transferable and
can be adopted for hard-label blackbox attacks. We also show effectiveness of
BASES for hiding attacks on object detectors.


http://arxiv.org/abs/2208.03567
On the Fundamental Limits of Formally (Dis)Proving Robustness in Proof-of-Learning. (22%)
Congyu Fang; Hengrui Jia; Anvith Thudi; Mohammad Yaghini; Christopher A. Choquette-Choo; Natalie Dullerud; Varun Chandrasekaran; Nicolas Papernot
  Proof-of-learning (PoL) proposes a model owner use machine learning training
checkpoints to establish a proof of having expended the necessary compute for
training. The authors of PoL forego cryptographic approaches and trade rigorous
security guarantees for scalability to deep learning by being applicable to
stochastic gradient descent and adaptive variants. This lack of formal analysis
leaves the possibility that an attacker may be able to spoof a proof for a
model they did not train.
  We contribute a formal analysis of why the PoL protocol cannot be formally
(dis)proven to be robust against spoofing adversaries. To do so, we disentangle
the two roles of proof verification in PoL: (a) efficiently determining if a
proof is a valid gradient descent trajectory, and (b) establishing precedence
by making it more expensive to craft a proof after training completes (i.e.,
spoofing). We show that efficient verification results in a tradeoff between
accepting legitimate proofs and rejecting invalid proofs because deep learning
necessarily involves noise. Without a precise analytical model for how this
noise affects training, we cannot formally guarantee if a PoL verification
algorithm is robust. Then, we demonstrate that establishing precedence robustly
also reduces to an open problem in learning theory: spoofing a PoL post hoc
training is akin to finding different trajectories with the same endpoint in
non-convex learning. Yet, we do not rigorously know if priori knowledge of the
final model weights helps discover such trajectories.
  We conclude that, until the aforementioned open problems are addressed,
relying more heavily on cryptography is likely needed to formulate a new class
of PoL protocols with formal robustness guarantees. In particular, this will
help with establishing precedence. As a by-product of insights from our
analysis, we also demonstrate two novel attacks against PoL.


http://arxiv.org/abs/2208.03466
Preventing or Mitigating Adversarial Supply Chain Attacks; a legal analysis. (3%)
Kaspar Rosager Ludvigsen; Shishir Nagaraja; Angela Daly
  The world is currently strongly connected through both the internet at large,
but also the very supply chains which provide everything from food to
infrastructure and technology. The supply chains are themselves vulnerable to
adversarial attacks, both in a digital and physical sense, which can disrupt or
at worst destroy them. In this paper, we take a look at two examples of such
successful attacks and consider what their consequences may be going forward,
and analyse how EU and national law can prevent these attacks or otherwise
punish companies which do not try to mitigate them at all possible costs. We
find that the current types of national regulation are not technology specific
enough, and cannot force or otherwise mandate the correct parties who could
play the biggest role in preventing supply chain attacks to do everything in
their power to mitigate them. But, current EU law is on the right path, and
further vigilance may be what is necessary to consider these large threats, as
national law tends to fail at properly regulating companies when it comes to
cybersecurity.


http://arxiv.org/abs/2208.03161
Adversarial Robustness of MR Image Reconstruction under Realistic Perturbations. (73%)
Jan Nikolas Morshuis; Sergios Gatidis; Matthias Hein; Christian F. Baumgartner
  Deep Learning (DL) methods have shown promising results for solving ill-posed
inverse problems such as MR image reconstruction from undersampled $k$-space
data. However, these approaches currently have no guarantees for reconstruction
quality and the reliability of such algorithms is only poorly understood.
Adversarial attacks offer a valuable tool to understand possible failure modes
and worst case performance of DL-based reconstruction algorithms. In this paper
we describe adversarial attacks on multi-coil $k$-space measurements and
evaluate them on the recently proposed E2E-VarNet and a simpler UNet-based
model. In contrast to prior work, the attacks are targeted to specifically
alter diagnostically relevant regions. Using two realistic attack models
(adversarial $k$-space noise and adversarial rotations) we are able to show
that current state-of-the-art DL-based reconstruction algorithms are indeed
sensitive to such perturbations to a degree where relevant diagnostic
information may be lost. Surprisingly, in our experiments the UNet and the more
sophisticated E2E-VarNet were similarly sensitive to such attacks. Our findings
add further to the evidence that caution must be exercised as DL-based methods
move closer to clinical practice.


http://arxiv.org/abs/2208.03111
Data-free Backdoor Removal based on Channel Lipschitzness. (64%)
Runkai Zheng; Rongjun Tang; Jianze Li; Li Liu
  Recent studies have shown that Deep Neural Networks (DNNs) are vulnerable to
the backdoor attacks, which leads to malicious behaviors of DNNs when specific
triggers are attached to the input images. It was further demonstrated that the
infected DNNs possess a collection of channels, which are more sensitive to the
backdoor triggers compared with normal channels. Pruning these channels was
then shown to be effective in mitigating the backdoor behaviors. To locate
those channels, it is natural to consider their Lipschitzness, which measures
their sensitivity against worst-case perturbations on the inputs. In this work,
we introduce a novel concept called Channel Lipschitz Constant (CLC), which is
defined as the Lipschitz constant of the mapping from the input images to the
output of each channel. Then we provide empirical evidences to show the strong
correlation between an Upper bound of the CLC (UCLC) and the trigger-activated
change on the channel activation. Since UCLC can be directly calculated from
the weight matrices, we can detect the potential backdoor channels in a
data-free manner, and do simple pruning on the infected DNN to repair the
model. The proposed Channel Lipschitzness based Pruning (CLP) method is super
fast, simple, data-free and robust to the choice of the pruning threshold.
Extensive experiments are conducted to evaluate the efficiency and
effectiveness of CLP, which achieves state-of-the-art results among the
mainstream defense methods even without any data. Source codes are available at
https://github.com/rkteddy/channel-Lipschitzness-based-pruning.


http://arxiv.org/abs/2208.03309
Lethal Dose Conjecture on Data Poisoning. (2%)
Wenxiao Wang; Alexander Levine; Soheil Feizi
  Data poisoning considers an adversary that distorts the training set of
machine learning algorithms for malicious purposes. In this work, we bring to
light one conjecture regarding the fundamentals of data poisoning, which we
call the Lethal Dose Conjecture. The conjecture states: If $n$ clean training
samples are needed for accurate predictions, then in a size-$N$ training set,
only $\Theta(N/n)$ poisoned samples can be tolerated while ensuring accuracy.
Theoretically, we verify this conjecture in multiple cases. We also offer a
more general perspective of this conjecture through distribution
discrimination. Deep Partition Aggregation (DPA) and its extension, Finite
Aggregation (FA) are recent approaches for provable defenses against data
poisoning, where they predict through the majority vote of many base models
trained from different subsets of training set using a given learner. The
conjecture implies that both DPA and FA are (asymptotically) optimal -- if we
have the most data-efficient learner, they can turn it into one of the most
robust defenses against data poisoning. This outlines a practical approach to
developing stronger defenses against poisoning via finding data-efficient
learners. Empirically, as a proof of concept, we show that by simply using
different data augmentations for base learners, we can respectively double and
triple the certified robustness of DPA on CIFAR-10 and GTSRB without
sacrificing accuracy.


http://arxiv.org/abs/2208.03399
LCCDE: A Decision-Based Ensemble Framework for Intrusion Detection in The Internet of Vehicles. (1%)
Li Yang; Abdallah Shami; Gary Stevens; Rusett Stephen De
  Modern vehicles, including autonomous vehicles and connected vehicles, have
adopted an increasing variety of functionalities through connections and
communications with other vehicles, smart devices, and infrastructures.
However, the growing connectivity of the Internet of Vehicles (IoV) also
increases the vulnerabilities to network attacks. To protect IoV systems
against cyber threats, Intrusion Detection Systems (IDSs) that can identify
malicious cyber-attacks have been developed using Machine Learning (ML)
approaches. To accurately detect various types of attacks in IoV networks, we
propose a novel ensemble IDS framework named Leader Class and Confidence
Decision Ensemble (LCCDE). It is constructed by determining the best-performing
ML model among three advanced ML algorithms (XGBoost, LightGBM, and CatBoost)
for every class or type of attack. The class leader models with their
prediction confidence values are then utilized to make accurate decisions
regarding the detection of various types of cyber-attacks. Experiments on two
public IoV security datasets (Car-Hacking and CICIDS2017 datasets) demonstrate
the effectiveness of the proposed LCCDE for intrusion detection on both
intra-vehicle and external networks.


http://arxiv.org/abs/2208.03160
Almost-Orthogonal Layers for Efficient General-Purpose Lipschitz Networks. (1%)
Bernd Prach; Christoph H. Lampert
  It is a highly desirable property for deep networks to be robust against
small input changes. One popular way to achieve this property is by designing
networks with a small Lipschitz constant. In this work, we propose a new
technique for constructing such Lipschitz networks that has a number of
desirable properties: it can be applied to any linear network layer
(fully-connected or convolutional), it provides formal guarantees on the
Lipschitz constant, it is easy to implement and efficient to run, and it can be
combined with any training objective and optimization method. In fact, our
technique is the first one in the literature that achieves all of these
properties simultaneously. Our main contribution is a rescaling-based weight
matrix parametrization that guarantees each network layer to have a Lipschitz
constant of at most 1 and results in the learned weight matrices to be close to
orthogonal. Hence we call such layers almost-orthogonal Lipschitz (AOL).
Experiments and ablation studies in the context of image classification with
certified robust accuracy confirm that AOL layers achieve results that are on
par with most existing methods. Yet, they are simpler to implement and more
broadly applicable, because they do not require computationally expensive
matrix orthogonalization or inversion steps as part of the network
architecture. We provide code at https://github.com/berndprach/AOL.


http://arxiv.org/abs/2208.02851
Self-Ensembling Vision Transformer (SEViT) for Robust Medical Image Classification. (99%)
Faris Almalik; Mohammad Yaqub; Karthik Nandakumar
  Vision Transformers (ViT) are competing to replace Convolutional Neural
Networks (CNN) for various computer vision tasks in medical imaging such as
classification and segmentation. While the vulnerability of CNNs to adversarial
attacks is a well-known problem, recent works have shown that ViTs are also
susceptible to such attacks and suffer significant performance degradation
under attack. The vulnerability of ViTs to carefully engineered adversarial
samples raises serious concerns about their safety in clinical settings. In
this paper, we propose a novel self-ensembling method to enhance the robustness
of ViT in the presence of adversarial attacks. The proposed Self-Ensembling
Vision Transformer (SEViT) leverages the fact that feature representations
learned by initial blocks of a ViT are relatively unaffected by adversarial
perturbations. Learning multiple classifiers based on these intermediate
feature representations and combining these predictions with that of the final
ViT classifier can provide robustness against adversarial attacks. Measuring
the consistency between the various predictions can also help detect
adversarial samples. Experiments on two modalities (chest X-ray and fundoscopy)
demonstrate the efficacy of SEViT architecture to defend against various
adversarial attacks in the gray-box (attacker has full knowledge of the target
model, but not the defense mechanism) setting. Code:
https://github.com/faresmalik/SEViT


http://arxiv.org/abs/2208.01919
Spectrum Focused Frequency Adversarial Attacks for Automatic Modulation Classification. (99%)
Sicheng College of Information and Communication Engineering, Harbin Engineering University, Harbin Zhang; Jiarun College of Information and Communication Engineering, Harbin Engineering University, Harbin Yu; Zhida College of Information and Communication Engineering, Harbin Engineering University, Harbin Bao; Shiwen Department of Electrical & Computer Engineering, Auburn University, Auburn Mao; Yun College of Information and Communication Engineering, Harbin Engineering University, Harbin Lin
  Artificial intelligence (AI) technology has provided a potential solution for
automatic modulation recognition (AMC). Unfortunately, AI-based AMC models are
vulnerable to adversarial examples, which seriously threatens the efficient,
secure and trusted application of AI in AMC. This issue has attracted the
attention of researchers. Various studies on adversarial attacks and defenses
evolve in a spiral. However, the existing adversarial attack methods are all
designed in the time domain. They introduce more high-frequency components in
the frequency domain, due to abrupt updates in the time domain. For this issue,
from the perspective of frequency domain, we propose a spectrum focused
frequency adversarial attacks (SFFAA) for AMC model, and further draw on the
idea of meta-learning, propose a Meta-SFFAA algorithm to improve the
transferability in the black-box attacks. Extensive experiments, qualitative
and quantitative metrics demonstrate that the proposed algorithm can
concentrate the adversarial energy on the spectrum where the signal is located,
significantly improve the adversarial attack performance while maintaining the
concealment in the frequency domain.


http://arxiv.org/abs/2208.02310
Design of secure and robust cognitive system for malware detection. (99%)
Sanket Shukla
  Machine learning based malware detection techniques rely on grayscale images
of malware and tends to classify malware based on the distribution of textures
in graycale images. Albeit the advancement and promising results shown by
machine learning techniques, attackers can exploit the vulnerabilities by
generating adversarial samples. Adversarial samples are generated by
intelligently crafting and adding perturbations to the input samples. There
exists majority of the software based adversarial attacks and defenses. To
defend against the adversaries, the existing malware detection based on machine
learning and grayscale images needs a preprocessing for the adversarial data.
This can cause an additional overhead and can prolong the real-time malware
detection. So, as an alternative to this, we explore RRAM (Resistive Random
Access Memory) based defense against adversaries. Therefore, the aim of this
thesis is to address the above mentioned critical system security issues. The
above mentioned challenges are addressed by demonstrating proposed techniques
to design a secure and robust cognitive system. First, a novel technique to
detect stealthy malware is proposed. The technique uses malware binary images
and then extract different features from the same and then employ different
ML-classifiers on the dataset thus obtained. Results demonstrate that this
technique is successful in differentiating classes of malware based on the
features extracted. Secondly, I demonstrate the effects of adversarial attacks
on a reconfigurable RRAM-neuromorphic architecture with different learning
algorithms and device characteristics. I also propose an integrated solution
for mitigating the effects of the adversarial attack using the reconfigurable
RRAM architecture.


http://arxiv.org/abs/2208.02430
A New Kind of Adversarial Example. (99%)
Ali Borji
  Almost all adversarial attacks are formulated to add an imperceptible
perturbation to an image in order to fool a model. Here, we consider the
opposite which is adversarial examples that can fool a human but not a model. A
large enough and perceptible perturbation is added to an image such that a
model maintains its original decision, whereas a human will most likely make a
mistake if forced to decide (or opt not to decide at all). Existing targeted
attacks can be reformulated to synthesize such adversarial examples. Our
proposed attack, dubbed NKE, is similar in essence to the fooling images, but
is more efficient since it uses gradient descent instead of evolutionary
algorithms. It also offers a new and unified perspective into the problem of
adversarial vulnerability. Experimental results over MNIST and CIFAR-10
datasets show that our attack is quite efficient in fooling deep neural
networks. Code is available at https://github.com/aliborji/NKE.


http://arxiv.org/abs/2208.02250
Adversarial Attacks on ASR Systems: An Overview. (98%)
Xiao Zhang; Hao Tan; Xuan Huang; Denghui Zhang; Keke Tang; Zhaoquan Gu
  With the development of hardware and algorithms, ASR(Automatic Speech
Recognition) systems evolve a lot. As The models get simpler, the difficulty of
development and deployment become easier, ASR systems are getting closer to our
life. On the one hand, we often use APPs or APIs of ASR to generate subtitles
and record meetings. On the other hand, smart speaker and self-driving car rely
on ASR systems to control AIoT devices. In past few years, there are a lot of
works on adversarial examples attacks against ASR systems. By adding a small
perturbation to the waveforms, the recognition results make a big difference.
In this paper, we describe the development of ASR system, different assumptions
of attacks, and how to evaluate these attacks. Next, we introduce the current
works on adversarial examples attacks from two attack assumptions: white-box
attack and black-box attack. Different from other surveys, we pay more
attention to which layer they perturb waveforms in ASR system, the relationship
between these attacks, and their implementation methods. We focus on the effect
of their works.


http://arxiv.org/abs/2208.01844
Multiclass ASMA vs Targeted PGD Attack in Image Segmentation. (96%)
Johnson University of Toronto Vo; Jiabao University of Toronto Xie; Sahil University of Toronto Patel
  Deep learning networks have demonstrated high performance in a large variety
of applications, such as image classification, speech recognition, and natural
language processing. However, there exists a major vulnerability exploited by
the use of adversarial attacks. An adversarial attack imputes images by
altering the input image very slightly, making it nearly undetectable to the
naked eye, but results in a very different classification by the network. This
paper explores the projected gradient descent (PGD) attack and the Adaptive
Mask Segmentation Attack (ASMA) on the image segmentation DeepLabV3 model using
two types of architectures: MobileNetV3 and ResNet50, It was found that PGD was
very consistent in changing the segmentation to be its target while the
generalization of ASMA to a multiclass target was not as effective. The
existence of such attack however puts all of image classification deep learning
networks in danger of exploitation.


http://arxiv.org/abs/2208.02820
MOVE: Effective and Harmless Ownership Verification via Embedded External Features. (89%)
Yiming Li; Linghui Zhu; Xiaojun Jia; Yang Bai; Yong Jiang; Shu-Tao Xia; Xiaochun Cao
  Currently, deep neural networks (DNNs) are widely adopted in different
applications. Despite its commercial values, training a well-performed DNN is
resource-consuming. Accordingly, the well-trained model is valuable
intellectual property for its owner. However, recent studies revealed the
threats of model stealing, where the adversaries can obtain a function-similar
copy of the victim model, even when they can only query the model. In this
paper, we propose an effective and harmless model ownership verification (MOVE)
to defend against different types of model stealing simultaneously, without
introducing new security risks. In general, we conduct the ownership
verification by verifying whether a suspicious model contains the knowledge of
defender-specified external features. Specifically, we embed the external
features by tempering a few training samples with style transfer. We then train
a meta-classifier to determine whether a model is stolen from the victim. This
approach is inspired by the understanding that the stolen models should contain
the knowledge of features learned by the victim model. In particular, we
develop our MOVE method under both white-box and black-box settings to provide
comprehensive model protection. Extensive experiments on benchmark datasets
verify the effectiveness of our method and its resistance to potential adaptive
attacks. The codes for reproducing the main experiments of our method are
available at \url{https://github.com/THUYimingLi/MOVE}.


http://arxiv.org/abs/2208.01853
Robust Graph Neural Networks using Weighted Graph Laplacian. (13%)
Bharat Runwal; Vivek; Sandeep Kumar
  Graph neural network (GNN) is achieving remarkable performances in a variety
of application domains. However, GNN is vulnerable to noise and adversarial
attacks in input data. Making GNN robust against noises and adversarial attacks
is an important problem. The existing defense methods for GNNs are
computationally demanding and are not scalable. In this paper, we propose a
generic framework for robustifying GNN known as Weighted Laplacian GNN
(RWL-GNN). The method combines Weighted Graph Laplacian learning with the GNN
implementation. The proposed method benefits from the positive
semi-definiteness property of Laplacian matrix, feature smoothness, and latent
features via formulating a unified optimization framework, which ensures the
adversarial/noisy edges are discarded and connections in the graph are
appropriately weighted. For demonstration, the experiments are conducted with
Graph convolutional neural network(GCNN) architecture, however, the proposed
framework is easily amenable to any existing GNN architecture. The simulation
results with benchmark dataset establish the efficacy of the proposed method,
both in accuracy and computational efficiency. Code can be accessed at
https://github.com/Bharat-Runwal/RWL-GNN.


http://arxiv.org/abs/2208.01819
Adversarial Camouflage for Node Injection Attack on Graphs. (81%)
Shuchang Tao; Qi Cao; Huawei Shen; Yunfan Wu; Liang Hou; Xueqi Cheng
  Node injection attacks against Graph Neural Networks (GNNs) have received
emerging attention as a practical attack scenario, where the attacker injects
malicious nodes instead of modifying node features or edges to degrade the
performance of GNNs. Despite the initial success of node injection attacks, we
find that the injected nodes by existing methods are easy to be distinguished
from the original normal nodes by defense methods and limiting their attack
performance in practice. To solve the above issues, we devote to camouflage
node injection attack, i.e., camouflaging injected malicious nodes
(structure/attributes) as the normal ones that appear legitimate/imperceptible
to defense methods. The non-Euclidean nature of graph data and the lack of
human prior brings great challenges to the formalization, implementation, and
evaluation of camouflage on graphs. In this paper, we first propose and
formulate the camouflage of injected nodes from both the fidelity and diversity
of the ego networks centered around injected nodes. Then, we design an
adversarial CAmouflage framework for Node injection Attack, namely CANA, to
improve the camouflage while ensuring the attack performance. Several novel
indicators for graph camouflage are further designed for a comprehensive
evaluation. Experimental results demonstrate that when equipping existing node
injection attack methods with our proposed CANA framework, the attack
performance against defense methods as well as node camouflage is significantly
improved.


http://arxiv.org/abs/2208.01705
Success of Uncertainty-Aware Deep Models Depends on Data Manifold Geometry. (2%)
Mark Penrod; Harrison Termotto; Varshini Reddy; Jiayu Yao; Finale Doshi-Velez; Weiwei Pan
  For responsible decision making in safety-critical settings, machine learning
models must effectively detect and process edge-case data. Although existing
works show that predictive uncertainty is useful for these tasks, it is not
evident from literature which uncertainty-aware models are best suited for a
given dataset. Thus, we compare six uncertainty-aware deep learning models on a
set of edge-case tasks: robustness to adversarial attacks as well as
out-of-distribution and adversarial detection. We find that the geometry of the
data sub-manifold is an important factor in determining the success of various
models. Our finding suggests an interesting direction in the study of
uncertainty-aware deep learning models.


http://arxiv.org/abs/2208.01356
SCFI: State Machine Control-Flow Hardening Against Fault Attacks. (1%)
Pascal Nasahl; Martin Unterguggenberger; Rishub Nagpal; Robert Schilling; David Schrammel; Stefan Mangard
  Fault injection (FI) is a powerful attack methodology allowing an adversary
to entirely break the security of a target device. As finite-state machines
(FSMs) are fundamental hardware building blocks responsible for controlling
systems, inducing faults into these controllers enables an adversary to hijack
the execution of the integrated circuit. A common defense strategy mitigating
these attacks is to manually instantiate FSMs multiple times and detect faults
using a majority voting logic. However, as each additional FSM instance only
provides security against one additional induced fault, this approach scales
poorly in a multi-fault attack scenario.
  In this paper, we present SCFI: a strong, probabilistic FSM protection
mechanism ensuring that control-flow deviations from the intended control-flow
are detected even in the presence of multiple faults. At its core, SCFI
consists of a hardened next-state function absorbing the execution history as
well as the FSM's control signals to derive the next state. When either the
absorbed inputs, the state registers, or the function itself are affected by
faults, SCFI triggers an error with no detection latency. We integrate SCFI
into a synthesis tool capable of automatically hardening arbitrary unprotected
FSMs without user interaction and open-source the tool. Our evaluation shows
that SCFI provides strong protection guarantees with a better area-time product
than FSMs protected using classical redundancy-based approaches. Finally, we
formally verify the resilience of the protected state machines using a
pre-silicon fault analysis tool.


http://arxiv.org/abs/2208.01220
GeoECG: Data Augmentation via Wasserstein Geodesic Perturbation for Robust Electrocardiogram Prediction. (98%)
Jiacheng Zhu; Jielin Qiu; Zhuolin Yang; Douglas Weber; Michael A. Rosenberg; Emerson Liu; Bo Li; Ding Zhao
  There has been an increased interest in applying deep neural networks to
automatically interpret and analyze the 12-lead electrocardiogram (ECG). The
current paradigms with machine learning methods are often limited by the amount
of labeled data. This phenomenon is particularly problematic for
clinically-relevant data, where labeling at scale can be time-consuming and
costly in terms of the specialized expertise and human effort required.
Moreover, deep learning classifiers may be vulnerable to adversarial examples
and perturbations, which could have catastrophic consequences, for example,
when applied in the context of medical treatment, clinical trials, or insurance
claims. In this paper, we propose a physiologically-inspired data augmentation
method to improve performance and increase the robustness of heart disease
detection based on ECG signals. We obtain augmented samples by perturbing the
data distribution towards other classes along the geodesic in Wasserstein
space. To better utilize domain-specific knowledge, we design a ground metric
that recognizes the difference between ECG signals based on physiologically
determined features. Learning from 12-lead ECG signals, our model is able to
distinguish five categories of cardiac conditions. Our results demonstrate
improvements in accuracy and robustness, reflecting the effectiveness of our
data augmentation method.


http://arxiv.org/abs/2208.00906
Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem. (81%)
Zheng Wang; Wenjie Ruan
  Recent research on the robustness of deep learning has shown that Vision
Transformers (ViTs) surpass the Convolutional Neural Networks (CNNs) under some
perturbations, e.g., natural corruption, adversarial attacks, etc. Some papers
argue that the superior robustness of ViT comes from the segmentation of its
input images; others say that the Multi-head Self-Attention (MSA) is the key to
preserving the robustness. In this paper, we aim to introduce a principled and
unified theoretical framework to investigate such an argument on ViT's
robustness. We first theoretically prove that, unlike Transformers in Natural
Language Processing, ViTs are Lipschitz continuous. Then we theoretically
analyze the adversarial robustness of ViTs from the perspective of the Cauchy
Problem, via which we can quantify how the robustness propagates through
layers. We demonstrate that the first and last layers are the critical factors
to affect the robustness of ViTs. Furthermore, based on our theory, we
empirically show that unlike the claims from existing research, MSA only
contributes to the adversarial robustness of ViTs under weak adversarial
attacks, e.g., FGSM, and surprisingly, MSA actually comprises the model's
adversarial robustness under stronger attacks, e.g., PGD attacks.


http://arxiv.org/abs/2208.01113
On the Evaluation of User Privacy in Deep Neural Networks using Timing Side Channel. (75%)
Shubhi Shukla; Manaar Alam; Sarani Bhattacharya; Debdeep Mukhopadhyay; Pabitra Mitra
  Recent Deep Learning (DL) advancements in solving complex real-world tasks
have led to its widespread adoption in practical applications. However, this
opportunity comes with significant underlying risks, as many of these models
rely on privacy-sensitive data for training in a variety of applications,
making them an overly-exposed threat surface for privacy violations.
Furthermore, the widespread use of cloud-based Machine-Learning-as-a-Service
(MLaaS) for its robust infrastructure support has broadened the threat surface
to include a variety of remote side-channel attacks. In this paper, we first
identify and report a novel data-dependent timing side-channel leakage (termed
Class Leakage) in DL implementations originating from non-constant time
branching operation in a widely used DL framework PyTorch. We further
demonstrate a practical inference-time attack where an adversary with user
privilege and hard-label black-box access to an MLaaS can exploit Class Leakage
to compromise the privacy of MLaaS users. DL models are vulnerable to
Membership Inference Attack (MIA), where an adversary's objective is to deduce
whether any particular data has been used while training the model. In this
paper, as a separate case study, we demonstrate that a DL model secured with
differential privacy (a popular countermeasure against MIA) is still vulnerable
to MIA against an adversary exploiting Class Leakage. We develop an
easy-to-implement countermeasure by making a constant-time branching operation
that alleviates the Class Leakage and also aids in mitigating MIA. We have
chosen two standard benchmarking image classification datasets, CIFAR-10 and
CIFAR-100 to train five state-of-the-art pre-trained DL models, over two
different computing environments having Intel Xeon and Intel i7 processors to
validate our approach.


http://arxiv.org/abs/2208.00862
Attacking Adversarial Defences by Smoothing the Loss Landscape. (26%)
Panagiotis Eustratiadis; Henry Gouk; Da Li; Timothy Hospedales
  This paper investigates a family of methods for defending against adversarial
attacks that owe part of their success to creating a noisy, discontinuous, or
otherwise rugged loss landscape that adversaries find difficult to navigate. A
common, but not universal, way to achieve this effect is via the use of
stochastic neural networks. We show that this is a form of gradient
obfuscation, and propose a general extension to gradient-based adversaries
based on the Weierstrass transform, which smooths the surface of the loss
function and provides more reliable gradient estimates. We further show that
the same principle can strengthen gradient-free adversaries. We demonstrate the
efficacy of our loss-smoothing method against both stochastic and
non-stochastic adversarial defences that exhibit robustness due to this type of
obfuscation. Furthermore, we provide analysis of how it interacts with
Expectation over Transformation; a popular gradient-sampling method currently
used to attack stochastic defences.


http://arxiv.org/abs/2208.00498
DNNShield: Dynamic Randomized Model Sparsification, A Defense Against Adversarial Machine Learning. (99%)
Mohammad Hossein Samavatian; Saikat Majumdar; Kristin Barber; Radu Teodorescu
  DNNs are known to be vulnerable to so-called adversarial attacks that
manipulate inputs to cause incorrect results that can be beneficial to an
attacker or damaging to the victim. Recent works have proposed approximate
computation as a defense mechanism against machine learning attacks. We show
that these approaches, while successful for a range of inputs, are insufficient
to address stronger, high-confidence adversarial attacks. To address this, we
propose DNNSHIELD, a hardware-accelerated defense that adapts the strength of
the response to the confidence of the adversarial input. Our approach relies on
dynamic and random sparsification of the DNN model to achieve inference
approximation efficiently and with fine-grain control over the approximation
error. DNNSHIELD uses the output distribution characteristics of sparsified
inference compared to a dense reference to detect adversarial inputs. We show
an adversarial detection rate of 86% when applied to VGG16 and 88% when applied
to ResNet50, which exceeds the detection rate of the state of the art
approaches, with a much lower overhead. We demonstrate a
software/hardware-accelerated FPGA prototype, which reduces the performance
impact of DNNSHIELD relative to software-only CPU and GPU implementations.


http://arxiv.org/abs/2208.00428
Robust Real-World Image Super-Resolution against Adversarial Attacks. (99%)
Jiutao Yue; Haofeng Li; Pengxu Wei; Guanbin Li; Liang Lin
  Recently deep neural networks (DNNs) have achieved significant success in
real-world image super-resolution (SR). However, adversarial image samples with
quasi-imperceptible noises could threaten deep learning SR models. In this
paper, we propose a robust deep learning framework for real-world SR that
randomly erases potential adversarial noises in the frequency domain of input
images or features. The rationale is that on the SR task clean images or
features have a different pattern from the attacked ones in the frequency
domain. Observing that existing adversarial attacks usually add high-frequency
noises to input images, we introduce a novel random frequency mask module that
blocks out high-frequency components possibly containing the harmful
perturbations in a stochastic manner. Since the frequency masking may not only
destroys the adversarial perturbations but also affects the sharp details in a
clean image, we further develop an adversarial sample classifier based on the
frequency domain of images to determine if applying the proposed mask module.
Based on the above ideas, we devise a novel real-world image SR framework that
combines the proposed frequency mask modules and the proposed adversarial
classifier with an existing super-resolution backbone network. Experiments show
that our proposed method is more insensitive to adversarial attacks and
presents more stable SR results than existing models and defenses.


http://arxiv.org/abs/2208.00539
Is current research on adversarial robustness addressing the right problem? (97%)
Ali Borji
  Short answer: Yes, Long answer: No! Indeed, research on adversarial
robustness has led to invaluable insights helping us understand and explore
different aspects of the problem. Many attacks and defenses have been proposed
over the last couple of years. The problem, however, remains largely unsolved
and poorly understood. Here, I argue that the current formulation of the
problem serves short term goals, and needs to be revised for us to achieve
bigger gains. Specifically, the bound on perturbation has created a somewhat
contrived setting and needs to be relaxed. This has misled us to focus on model
classes that are not expressive enough to begin with. Instead, inspired by
human vision and the fact that we rely more on robust features such as shape,
vertices, and foreground objects than non-robust features such as texture,
efforts should be steered towards looking for significantly different classes
of models. Maybe instead of narrowing down on imperceptible adversarial
perturbations, we should attack a more general problem which is finding
architectures that are simultaneously robust to perceptible perturbations,
geometric transformations (e.g. rotation, scaling), image distortions
(lighting, blur), and more (e.g. occlusion, shadow). Only then we may be able
to solve the problem of adversarial vulnerability.


http://arxiv.org/abs/2208.00328
enpheeph: A Fault Injection Framework for Spiking and Compressed Deep Neural Networks. (5%)
Alessio Colucci; Andreas Steininger; Muhammad Shafique
  Research on Deep Neural Networks (DNNs) has focused on improving performance
and accuracy for real-world deployments, leading to new models, such as Spiking
Neural Networks (SNNs), and optimization techniques, e.g., quantization and
pruning for compressed networks. However, the deployment of these innovative
models and optimization techniques introduces possible reliability issues,
which is a pillar for DNNs to be widely used in safety-critical applications,
e.g., autonomous driving. Moreover, scaling technology nodes have the
associated risk of multiple faults happening at the same time, a possibility
not addressed in state-of-the-art resiliency analyses.
  Towards better reliability analysis for DNNs, we present enpheeph, a Fault
Injection Framework for Spiking and Compressed DNNs. The enpheeph framework
enables optimized execution on specialized hardware devices, e.g., GPUs, while
providing complete customizability to investigate different fault models,
emulating various reliability constraints and use-cases. Hence, the faults can
be executed on SNNs as well as compressed networks with minimal-to-none
modifications to the underlying code, a feat that is not achievable by other
state-of-the-art tools.
  To evaluate our enpheeph framework, we analyze the resiliency of different
DNN and SNN models, with different compression techniques. By injecting a
random and increasing number of faults, we show that DNNs can show a reduction
in accuracy with a fault rate as low as 7 x 10 ^ (-7) faults per parameter,
with an accuracy drop higher than 40%. Run-time overhead when executing
enpheeph is less than 20% of the baseline execution time when executing 100 000
faults concurrently, at least 10x lower than state-of-the-art frameworks,
making enpheeph future-proof for complex fault injection scenarios.
  We release enpheeph at https://github.com/Alexei95/enpheeph.


http://arxiv.org/abs/2208.00331
CoNLoCNN: Exploiting Correlation and Non-Uniform Quantization for Energy-Efficient Low-precision Deep Convolutional Neural Networks. (2%)
Muhammad Abdullah Hanif; Giuseppe Maria Sarda; Alberto Marchisio; Guido Masera; Maurizio Martina; Muhammad Shafique
  In today's era of smart cyber-physical systems, Deep Neural Networks (DNNs)
have become ubiquitous due to their state-of-the-art performance in complex
real-world applications. The high computational complexity of these networks,
which translates to increased energy consumption, is the foremost obstacle
towards deploying large DNNs in resource-constrained systems. Fixed-Point (FP)
implementations achieved through post-training quantization are commonly used
to curtail the energy consumption of these networks. However, the uniform
quantization intervals in FP restrict the bit-width of data structures to large
values due to the need to represent most of the numbers with sufficient
resolution and avoid high quantization errors. In this paper, we leverage the
key insight that (in most of the scenarios) DNN weights and activations are
mostly concentrated near zero and only a few of them have large magnitudes. We
propose CoNLoCNN, a framework to enable energy-efficient low-precision deep
convolutional neural network inference by exploiting: (1) non-uniform
quantization of weights enabling simplification of complex multiplication
operations; and (2) correlation between activation values enabling partial
compensation of quantization errors at low cost without any run-time overheads.
To significantly benefit from non-uniform quantization, we also propose a novel
data representation format, Encoded Low-Precision Binary Signed Digit, to
compress the bit-width of weights while ensuring direct use of the encoded
weight for processing using a novel multiply-and-accumulate (MAC) unit design.


http://arxiv.org/abs/2208.00094
Robust Trajectory Prediction against Adversarial Attacks. (99%)
Yulong Cao; Danfei Xu; Xinshuo Weng; Zhuoqing Mao; Anima Anandkumar; Chaowei Xiao; Marco Pavone
  Trajectory prediction using deep neural networks (DNNs) is an essential
component of autonomous driving (AD) systems. However, these methods are
vulnerable to adversarial attacks, leading to serious consequences such as
collisions. In this work, we identify two key ingredients to defend trajectory
prediction models against adversarial attacks including (1) designing effective
adversarial training methods and (2) adding domain-specific data augmentation
to mitigate the performance degradation on clean data. We demonstrate that our
method is able to improve the performance by 46% on adversarial data and at the
cost of only 3% performance degradation on clean data, compared to the model
trained with clean data. Additionally, compared to existing robust methods, our
method can improve performance by 21% on adversarial examples and 9% on clean
data. Our robust model is evaluated with a planner to study its downstream
impacts. We demonstrate that our model can significantly reduce the severe
accident rates (e.g., collisions and off-road driving).


http://arxiv.org/abs/2208.00081
Sampling Attacks on Meta Reinforcement Learning: A Minimax Formulation and Complexity Analysis. (56%)
Tao Li; Haozhe Lei; Quanyan Zhu
  Meta reinforcement learning (meta RL), as a combination of meta-learning
ideas and reinforcement learning (RL), enables the agent to adapt to different
tasks using a few samples. However, this sampling-based adaptation also makes
meta RL vulnerable to adversarial attacks. By manipulating the reward feedback
from sampling processes in meta RL, an attacker can mislead the agent into
building wrong knowledge from training experience, which deteriorates the
agent's performance when dealing with different tasks after adaptation. This
paper provides a game-theoretical underpinning for understanding this type of
security risk. In particular, we formally define the sampling attack model as a
Stackelberg game between the attacker and the agent, which yields a minimax
formulation. It leads to two online attack schemes: Intermittent Attack and
Persistent Attack, which enable the attacker to learn an optimal sampling
attack, defined by an $\epsilon$-first-order stationary point, within
$\mathcal{O}(\epsilon^{-2})$ iterations. These attack schemes freeride the
learning progress concurrently without extra interactions with the environment.
By corroborating the convergence results with numerical experiments, we observe
that a minor effort of the attacker can significantly deteriorate the learning
performance, and the minimax approach can also help robustify the meta RL
algorithms.


http://arxiv.org/abs/2207.14381
Pro-tuning: Unified Prompt Tuning for Vision Tasks. (1%)
Xing Nie; Bolin Ni; Jianlong Chang; Gaomeng Meng; Chunlei Huo; Zhaoxiang Zhang; Shiming Xiang; Qi Tian; Chunhong Pan
  In computer vision, fine-tuning is the de-facto approach to leverage
pre-trained vision models to perform downstream tasks. However, deploying it in
practice is quite challenging, due to adopting parameter inefficient global
update and heavily relying on high-quality downstream data. Recently,
prompt-based learning, which adds a task-relevant prompt to adapt the
downstream tasks to pre-trained models, has drastically boosted the performance
of many natural language downstream tasks. In this work, we extend this notable
transfer ability benefited from prompt into vision models as an alternative to
fine-tuning. To this end, we propose parameter-efficient Prompt tuning
(Pro-tuning) to adapt frozen vision models to various downstream vision tasks.
The key to Pro-tuning is prompt-based tuning, i.e., learning task-specific
vision prompts for downstream input images with the pre-trained model frozen.
By only training a few additional parameters, it can work on diverse CNN-based
and Transformer-based architectures. Extensive experiments evidence that
Pro-tuning outperforms fine-tuning in a broad range of vision tasks and
scenarios, including image classification (generic objects, class imbalance,
image corruption, adversarial robustness, and out-of-distribution
generalization), and dense prediction tasks such as object detection and
semantic segmentation.


http://arxiv.org/abs/2207.13381
Look Closer to Your Enemy: Learning to Attack via Teacher-student Mimicking. (99%)
Mingejie Wang; Zhiqing Tang; Sirui Li; Dingwen Xiao
  This paper aims to generate realistic attack samples of person
re-identification, ReID, by reading the enemy's mind (VM). In this paper, we
propose a novel inconspicuous and controllable ReID attack baseline, LCYE, to
generate adversarial query images. Concretely, LCYE first distills VM's
knowledge via teacher-student memory mimicking in the proxy task. Then this
knowledge prior acts as an explicit cipher conveying what is essential and
realistic, believed by VM, for accurate adversarial misleading. Besides,
benefiting from the multiple opposing task framework of LCYE, we further
investigate the interpretability and generalization of ReID models from the
view of the adversarial attack, including cross-domain adaption, cross-model
consensus, and online learning process. Extensive experiments on four ReID
benchmarks show that our method outperforms other state-of-the-art attackers
with a large margin in white-box, black-box, and target attacks. Our code is
now available at https://gitfront.io/r/user-3704489/mKXusqDT4ffr/LCYE/.


http://arxiv.org/abs/2207.13326
Point Cloud Attacks in Graph Spectral Domain: When 3D Geometry Meets Graph Signal Processing. (96%)
Daizong Liu; Wei Hu; Xin Li
  With the increasing attention in various 3D safety-critical applications,
point cloud learning models have been shown to be vulnerable to adversarial
attacks. Although existing 3D attack methods achieve high success rates, they
delve into the data space with point-wise perturbation, which may neglect the
geometric characteristics. Instead, we propose point cloud attacks from a new
perspective -- the graph spectral domain attack, aiming to perturb graph
transform coefficients in the spectral domain that corresponds to varying
certain geometric structure. Specifically, leveraging on graph signal
processing, we first adaptively transform the coordinates of points onto the
spectral domain via graph Fourier transform (GFT) for compact representation.
Then, we analyze the influence of different spectral bands on the geometric
structure, based on which we propose to perturb the GFT coefficients via a
learnable graph spectral filter. Considering the low-frequency components
mainly contribute to the rough shape of the 3D object, we further introduce a
low-frequency constraint to limit perturbations within imperceptible
high-frequency components. Finally, the adversarial point cloud is generated by
transforming the perturbed spectral representation back to the data domain via
the inverse GFT. Experimental results demonstrate the effectiveness of the
proposed attack in terms of both the imperceptibility and attack success rates.


http://arxiv.org/abs/2207.13572
Membership Inference Attacks via Adversarial Examples. (73%)
Hamid Jalalzai; Elie Kadoche; Rémi Leluc; Vincent Plassier
  The raise of machine learning and deep learning led to significant
improvement in several domains. This change is supported by both the dramatic
rise in computation power and the collection of large datasets. Such massive
datasets often include personal data which can represent a threat to privacy.
Membership inference attacks are a novel direction of research which aims at
recovering training data used by a learning algorithm. In this paper, we
develop a mean to measure the leakage of training data leveraging a quantity
appearing as a proxy of the total variation of a trained model near its
training samples. We extend our work by providing a novel defense mechanism.
Our contributions are supported by empirical evidence through convincing
numerical experiments.


http://arxiv.org/abs/2207.13417
Hardly Perceptible Trojan Attack against Neural Networks with Bit Flips. (69%)
Jiawang Bai; Kuofeng Gao; Dihong Gong; Shu-Tao Xia; Zhifeng Li; Wei Liu
  The security of deep neural networks (DNNs) has attracted increasing
attention due to their widespread use in various applications. Recently, the
deployed DNNs have been demonstrated to be vulnerable to Trojan attacks, which
manipulate model parameters with bit flips to inject a hidden behavior and
activate it by a specific trigger pattern. However, all existing Trojan attacks
adopt noticeable patch-based triggers (e.g., a square pattern), making them
perceptible to humans and easy to be spotted by machines. In this paper, we
present a novel attack, namely hardly perceptible Trojan attack (HPT). HPT
crafts hardly perceptible Trojan images by utilizing the additive noise and per
pixel flow field to tweak the pixel values and positions of the original
images, respectively. To achieve superior attack performance, we propose to
jointly optimize bit flips, additive noise, and flow field. Since the weight
bits of the DNNs are binary, this problem is very hard to be solved. We handle
the binary constraint with equivalent replacement and provide an effective
optimization algorithm. Extensive experiments on CIFAR-10, SVHN, and ImageNet
datasets show that the proposed HPT can generate hardly perceptible Trojan
images, while achieving comparable or better attack performance compared to the
state-of-the-art methods. The code is available at:
https://github.com/jiawangbai/HPT.


http://arxiv.org/abs/2207.13321
DynaMarks: Defending Against Deep Learning Model Extraction Using Dynamic Watermarking. (47%)
Abhishek Chakraborty; Daniel Xing; Yuntao Liu; Ankur Srivastava
  The functionality of a deep learning (DL) model can be stolen via model
extraction where an attacker obtains a surrogate model by utilizing the
responses from a prediction API of the original model. In this work, we propose
a novel watermarking technique called DynaMarks to protect the intellectual
property (IP) of DL models against such model extraction attacks in a black-box
setting. Unlike existing approaches, DynaMarks does not alter the training
process of the original model but rather embeds watermark into a surrogate
model by dynamically changing the output responses from the original model
prediction API based on certain secret parameters at inference runtime. The
experimental outcomes on Fashion MNIST, CIFAR-10, and ImageNet datasets
demonstrate the efficacy of DynaMarks scheme to watermark surrogate models
while preserving the accuracies of the original models deployed in edge
devices. In addition, we also perform experiments to evaluate the robustness of
DynaMarks against various watermark removal strategies, thus allowing a DL
model owner to reliably prove model ownership.


http://arxiv.org/abs/2207.13766
Label-Only Membership Inference Attack against Node-Level Graph Neural Networks. (22%)
Mauro Conti; Jiaxin Li; Stjepan Picek; Jing Xu
  Graph Neural Networks (GNNs), inspired by Convolutional Neural Networks
(CNNs), aggregate the message of nodes' neighbors and structure information to
acquire expressive representations of nodes for node classification, graph
classification, and link prediction. Previous studies have indicated that GNNs
are vulnerable to Membership Inference Attacks (MIAs), which infer whether a
node is in the training data of GNNs and leak the node's private information,
like the patient's disease history. The implementation of previous MIAs takes
advantage of the models' probability output, which is infeasible if GNNs only
provide the prediction label (label-only) for the input.
  In this paper, we propose a label-only MIA against GNNs for node
classification with the help of GNNs' flexible prediction mechanism, e.g.,
obtaining the prediction label of one node even when neighbors' information is
unavailable. Our attacking method achieves around 60\% accuracy, precision, and
Area Under the Curve (AUC) for most datasets and GNN models, some of which are
competitive or even better than state-of-the-art probability-based MIAs
implemented under our environment and settings. Additionally, we analyze the
influence of the sampling method, model selection approach, and overfitting
level on the attack performance of our label-only MIA. Both of those factors
have an impact on the attack performance. Then, we consider scenarios where
assumptions about the adversary's additional dataset (shadow dataset) and extra
information about the target model are relaxed. Even in those scenarios, our
label-only MIA achieves a better attack performance in most cases. Finally, we
explore the effectiveness of possible defenses, including Dropout,
Regularization, Normalization, and Jumping knowledge. None of those four
defenses prevent our attack completely.


http://arxiv.org/abs/2207.13867
Generative Steganography Network. (1%)
Ping Wei; Sheng Li; Xinpeng Zhang; Ge Luo; Zhenxing Qian; Qing Zhou
  Steganography usually modifies cover media to embed secret data. A new
steganographic approach called generative steganography (GS) has emerged
recently, in which stego images (images containing secret data) are generated
from secret data directly without cover media. However, existing GS schemes are
often criticized for their poor performances. In this paper, we propose an
advanced generative steganography network (GSN) that can generate realistic
stego images without using cover images. We firstly introduce the mutual
information mechanism in GS, which helps to achieve high secret extraction
accuracy. Our model contains four sub-networks, i.e., an image generator ($G$),
a discriminator ($D$), a steganalyzer ($S$), and a data extractor ($E$). $D$
and $S$ act as two adversarial discriminators to ensure the visual quality and
security of generated stego images. $E$ is to extract the hidden secret from
generated stego images. The generator $G$ is flexibly constructed to synthesize
either cover or stego images with different inputs. It facilitates covert
communication by concealing the function of generating stego images in a normal
generator. A module named secret block is designed to hide secret data in the
feature maps during image generation, with which high hiding capacity and image
fidelity are achieved. In addition, a novel hierarchical gradient decay (HGD)
skill is developed to resist steganalysis detection. Experiments demonstrate
the superiority of our work over existing methods.


http://arxiv.org/abs/2207.13129
LGV: Boosting Adversarial Example Transferability from Large Geometric Vicinity. (99%)
Martin Gubri; Maxime Cordy; Mike Papadakis; Yves Le Traon; Koushik Sen
  We propose transferability from Large Geometric Vicinity (LGV), a new
technique to increase the transferability of black-box adversarial attacks. LGV
starts from a pretrained surrogate model and collects multiple weight sets from
a few additional training epochs with a constant and high learning rate. LGV
exploits two geometric properties that we relate to transferability. First,
models that belong to a wider weight optimum are better surrogates. Second, we
identify a subspace able to generate an effective surrogate ensemble among this
wider optimum. Through extensive experiments, we show that LGV alone
outperforms all (combinations of) four established test-time transformations by
1.8 to 59.9 percentage points. Our findings shed new light on the importance of
the geometry of the weight space to explain the transferability of adversarial
examples.


http://arxiv.org/abs/2207.13192
Perception-Aware Attack: Creating Adversarial Music via Reverse-Engineering Human Perception. (99%)
Rui Duan; Zhe Qu; Shangqing Zhao; Leah Ding; Yao Liu; Zhuo Lu
  Recently, adversarial machine learning attacks have posed serious security
threats against practical audio signal classification systems, including speech
recognition, speaker recognition, and music copyright detection. Previous
studies have mainly focused on ensuring the effectiveness of attacking an audio
signal classifier via creating a small noise-like perturbation on the original
signal. It is still unclear if an attacker is able to create audio signal
perturbations that can be well perceived by human beings in addition to its
attack effectiveness. This is particularly important for music signals as they
are carefully crafted with human-enjoyable audio characteristics.
  In this work, we formulate the adversarial attack against music signals as a
new perception-aware attack framework, which integrates human study into
adversarial attack design. Specifically, we conduct a human study to quantify
the human perception with respect to a change of a music signal. We invite
human participants to rate their perceived deviation based on pairs of original
and perturbed music signals, and reverse-engineer the human perception process
by regression analysis to predict the human-perceived deviation given a
perturbed signal. The perception-aware attack is then formulated as an
optimization problem that finds an optimal perturbation signal to minimize the
prediction of perceived deviation from the regressed human perception model. We
use the perception-aware framework to design a realistic adversarial music
attack against YouTube's copyright detector. Experiments show that the
perception-aware attack produces adversarial music with significantly better
perceptual quality than prior work.


http://arxiv.org/abs/2207.12816
Generative Extraction of Audio Classifiers for Speaker Identification. (73%)
Tejumade Afonja; Lucas Bourtoule; Varun Chandrasekaran; Sageev Oore; Nicolas Papernot
  It is perhaps no longer surprising that machine learning models, especially
deep neural networks, are particularly vulnerable to attacks. One such
vulnerability that has been well studied is model extraction: a phenomenon in
which the attacker attempts to steal a victim's model by training a surrogate
model to mimic the decision boundaries of the victim model. Previous works have
demonstrated the effectiveness of such an attack and its devastating
consequences, but much of this work has been done primarily for image and text
processing tasks. Our work is the first attempt to perform model extraction on
{\em audio classification models}. We are motivated by an attacker whose goal
is to mimic the behavior of the victim's model trained to identify a speaker.
This is particularly problematic in security-sensitive domains such as
biometric authentication. We find that prior model extraction techniques, where
the attacker \textit{naively} uses a proxy dataset to attack a potential
victim's model, fail. We therefore propose the use of a generative model to
create a sufficiently large and diverse pool of synthetic attack queries. We
find that our approach is able to extract a victim's model trained on
\texttt{LibriSpeech} using queries synthesized with a proxy dataset based off
of \texttt{VoxCeleb}; we achieve a test accuracy of 84.41\% with a budget of 3
million queries.


http://arxiv.org/abs/2207.13243
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks. (8%)
Tilman Räuker; Anson Ho; Stephen Casper; Dylan Hadfield-Menell
  The last decade of machine learning has seen drastic increases in scale and
capabilities. Deep neural networks (DNNs) are increasingly being deployed in
the real world. However, they are difficult to analyze, raising concerns about
using them without a rigorous understanding of how they function. Effective
tools for interpreting them will be important for building more trustworthy AI
by helping to identify problems, fix bugs, and improve basic understanding. In
particular, "inner" interpretability techniques, which focus on explaining the
internal components of DNNs, are well-suited for developing a mechanistic
understanding, guiding manual modifications, and reverse engineering solutions.
  Much recent work has focused on DNN interpretability, and rapid progress has
thus far made a thorough systematization of methods difficult. In this survey,
we review over 300 works with a focus on inner interpretability tools. We
introduce a taxonomy that classifies methods by what part of the network they
help to explain (weights, neurons, subnetworks, or latent representations) and
whether they are implemented during (intrinsic) or after (post hoc) training.
To our knowledge, we are also the first to survey a number of connections
between interpretability research and work in adversarial robustness, continual
learning, modularity, network compression, and studying the human visual
system. We discuss key challenges and argue that the status quo in
interpretability research is largely unproductive. Finally, we highlight the
importance of future work that emphasizes diagnostics, debugging, adversaries,
and benchmarking in order to make interpretability tools more useful to
engineers in practical applications.


http://arxiv.org/abs/2207.12545
$p$-DkNN: Out-of-Distribution Detection Through Statistical Testing of Deep Representations. (99%)
Adam Dziedzic; Stephan Rabanser; Mohammad Yaghini; Armin Ale; Murat A. Erdogdu; Nicolas Papernot
  The lack of well-calibrated confidence estimates makes neural networks
inadequate in safety-critical domains such as autonomous driving or healthcare.
In these settings, having the ability to abstain from making a prediction on
out-of-distribution (OOD) data can be as important as correctly classifying
in-distribution data. We introduce $p$-DkNN, a novel inference procedure that
takes a trained deep neural network and analyzes the similarity structures of
its intermediate hidden representations to compute $p$-values associated with
the end-to-end model prediction. The intuition is that statistical tests
performed on latent representations can serve not only as a classifier, but
also offer a statistically well-founded estimation of uncertainty. $p$-DkNN is
scalable and leverages the composition of representations learned by hidden
layers, which makes deep representation learning successful. Our theoretical
analysis builds on Neyman-Pearson classification and connects it to recent
advances in selective classification (reject option). We demonstrate
advantageous trade-offs between abstaining from predicting on OOD inputs and
maintaining high accuracy on in-distribution inputs. We find that $p$-DkNN
forces adaptive attackers crafting adversarial examples, a form of worst-case
OOD inputs, to introduce semantically meaningful changes to the inputs.


http://arxiv.org/abs/2207.12203
Improving Adversarial Robustness via Mutual Information Estimation. (99%)
Dawei Zhou; Nannan Wang; Xinbo Gao; Bo Han; Xiaoyu Wang; Yibing Zhan; Tongliang Liu
  Deep neural networks (DNNs) are found to be vulnerable to adversarial noise.
They are typically misled by adversarial samples to make wrong predictions. To
alleviate this negative effect, in this paper, we investigate the dependence
between outputs of the target model and input adversarial samples from the
perspective of information theory, and propose an adversarial defense method.
Specifically, we first measure the dependence by estimating the mutual
information (MI) between outputs and the natural patterns of inputs (called
natural MI) and MI between outputs and the adversarial patterns of inputs
(called adversarial MI), respectively. We find that adversarial samples usually
have larger adversarial MI and smaller natural MI compared with those w.r.t.
natural samples. Motivated by this observation, we propose to enhance the
adversarial robustness by maximizing the natural MI and minimizing the
adversarial MI during the training process. In this way, the target model is
expected to pay more attention to the natural pattern that contains objective
semantics. Empirical evaluations demonstrate that our method could effectively
improve the adversarial accuracy against multiple attacks.


http://arxiv.org/abs/2207.12391
SegPGD: An Effective and Efficient Adversarial Attack for Evaluating and Boosting Segmentation Robustness. (99%)
Jindong Gu; Hengshuang Zhao; Volker Tresp; Philip Torr
  Deep neural network-based image classifications are vulnerable to adversarial
perturbations. The image classifications can be easily fooled by adding
artificial small and imperceptible perturbations to input images. As one of the
most effective defense strategies, adversarial training was proposed to address
the vulnerability of classification models, where the adversarial examples are
created and injected into training data during training. The attack and defense
of classification models have been intensively studied in past years. Semantic
segmentation, as an extension of classifications, has also received great
attention recently. Recent work shows a large number of attack iterations are
required to create effective adversarial examples to fool segmentation models.
The observation makes both robustness evaluation and adversarial training on
segmentation models challenging. In this work, we propose an effective and
efficient segmentation attack method, dubbed SegPGD. Besides, we provide a
convergence analysis to show the proposed SegPGD can create more effective
adversarial examples than PGD under the same number of attack iterations.
Furthermore, we propose to apply our SegPGD as the underlying attack method for
segmentation adversarial training. Since SegPGD can create more effective
adversarial examples, the adversarial training with our SegPGD can boost the
robustness of segmentation models. Our proposals are also verified with
experiments on popular Segmentation model architectures and standard
segmentation datasets.


http://arxiv.org/abs/2207.11971
Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer. (75%)
Yingyi Chen; Xi Shen; Yahui Liu; Qinghua Tao; Johan A. K. Suykens
  The success of Vision Transformer (ViT) in various computer vision tasks has
promoted the ever-increasing prevalence of this convolution-free network. The
fact that ViT works on image patches makes it potentially relevant to the
problem of jigsaw puzzle solving, which is a classical self-supervised task
aiming at reordering shuffled sequential image patches back to their natural
form. Despite its simplicity, solving jigsaw puzzle has been demonstrated to be
helpful for diverse tasks using Convolutional Neural Networks (CNNs), such as
self-supervised feature representation learning, domain generalization, and
fine-grained classification.
  In this paper, we explore solving jigsaw puzzle as a self-supervised
auxiliary loss in ViT for image classification, named Jigsaw-ViT. We show two
modifications that can make Jigsaw-ViT superior to standard ViT: discarding
positional embeddings and masking patches randomly. Yet simple, we find that
Jigsaw-ViT is able to improve both in generalization and robustness over the
standard ViT, which is usually rather a trade-off. Experimentally, we show that
adding the jigsaw puzzle branch provides better generalization than ViT on
large-scale image classification on ImageNet. Moreover, the auxiliary task also
improves robustness to noisy labels on Animal-10N, Food-101N, and Clothing1M as
well as adversarial examples. Our implementation is available at
https://yingyichen-cyy.github.io/Jigsaw-ViT/.


http://arxiv.org/abs/2207.12327
Technical Report: Assisting Backdoor Federated Learning with Whole Population Knowledge Alignment. (9%)
Tian Liu; Xueyang Hu; Tao Shu
  Due to the distributed nature of Federated Learning (FL), researchers have
uncovered that FL is vulnerable to backdoor attacks, which aim at injecting a
sub-task into the FL without corrupting the performance of the main task.
Single-shot backdoor attack achieves high accuracy on both the main task and
backdoor sub-task when injected at the FL model convergence. However, the
early-injected single-shot backdoor attack is ineffective because: (1) the
maximum backdoor effectiveness is not reached at injection because of the
dilution effect from normal local updates; (2) the backdoor effect decreases
quickly as the backdoor will be overwritten by the newcoming normal local
updates. In this paper, we strengthen the early-injected single-shot backdoor
attack utilizing FL model information leakage. We show that the FL convergence
can be expedited if the client trains on a dataset that mimics the distribution
and gradients of the whole population. Based on this observation, we proposed a
two-phase backdoor attack, which includes a preliminary phase for the
subsequent backdoor attack. In the preliminary phase, the attacker-controlled
client first launches a whole population distribution inference attack and then
trains on a locally crafted dataset that is aligned with both the gradient and
inferred distribution. Benefiting from the preliminary phase, the later
injected backdoor achieves better effectiveness as the backdoor effect will be
less likely to be diluted by the normal model updates. Extensive experiments
are conducted on MNIST dataset under various data heterogeneity settings to
evaluate the effectiveness of the proposed backdoor attack. Results show that
the proposed backdoor outperforms existing backdoor attacks in both success
rate and longevity, even when defense mechanisms are in place.


http://arxiv.org/abs/2207.12535
Semi-Leak: Membership Inference Attacks Against Semi-supervised Learning. (2%)
Xinlei He; Hongbin Liu; Neil Zhenqiang Gong; Yang Zhang
  Semi-supervised learning (SSL) leverages both labeled and unlabeled data to
train machine learning (ML) models. State-of-the-art SSL methods can achieve
comparable performance to supervised learning by leveraging much fewer labeled
data. However, most existing works focus on improving the performance of SSL.
In this work, we take a different angle by studying the training data privacy
of SSL. Specifically, we propose the first data augmentation-based membership
inference attacks against ML models trained by SSL. Given a data sample and the
black-box access to a model, the goal of membership inference attack is to
determine whether the data sample belongs to the training dataset of the model.
Our evaluation shows that the proposed attack can consistently outperform
existing membership inference attacks and achieves the best performance against
the model trained by SSL. Moreover, we uncover that the reason for membership
leakage in SSL is different from the commonly believed one in supervised
learning, i.e., overfitting (the gap between training and testing accuracy). We
observe that the SSL model is well generalized to the testing data (with almost
0 overfitting) but ''memorizes'' the training data by giving a more confident
prediction regardless of its correctness. We also explore early stopping as a
countermeasure to prevent membership inference attacks against SSL. The results
show that early stopping can mitigate the membership inference attack, but with
the cost of model's utility degradation.


http://arxiv.org/abs/2207.12405
Versatile Weight Attack via Flipping Limited Bits. (86%)
Jiawang Bai; Baoyuan Wu; Zhifeng Li; Shu-tao Xia
  To explore the vulnerability of deep neural networks (DNNs), many attack
paradigms have been well studied, such as the poisoning-based backdoor attack
in the training stage and the adversarial attack in the inference stage. In
this paper, we study a novel attack paradigm, which modifies model parameters
in the deployment stage. Considering the effectiveness and stealthiness goals,
we provide a general formulation to perform the bit-flip based weight attack,
where the effectiveness term could be customized depending on the attacker's
purpose. Furthermore, we present two cases of the general formulation with
different malicious purposes, i.e., single sample attack (SSA) and triggered
samples attack (TSA). To this end, we formulate this problem as a mixed integer
programming (MIP) to jointly determine the state of the binary bits (0 or 1) in
the memory and learn the sample modification. Utilizing the latest technique in
integer programming, we equivalently reformulate this MIP problem as a
continuous optimization problem, which can be effectively and efficiently
solved using the alternating direction method of multipliers (ADMM) method.
Consequently, the flipped critical bits can be easily determined through
optimization, rather than using a heuristic strategy. Extensive experiments
demonstrate the superiority of SSA and TSA in attacking DNNs.


http://arxiv.org/abs/2207.11727
Can we achieve robustness from data alone? (82%)
Nikolaos Tsilivis; Jingtong Su; Julia Kempe
  Adversarial training and its variants have come to be the prevailing methods
to achieve adversarially robust classification using neural networks. However,
its increased computational cost together with the significant gap between
standard and robust performance hinder progress and beg the question of whether
we can do better. In this work, we take a step back and ask: Can models achieve
robustness via standard training on a suitably optimized set? To this end, we
devise a meta-learning method for robust classification, that optimizes the
dataset prior to its deployment in a principled way, and aims to effectively
remove the non-robust parts of the data. We cast our optimization method as a
multi-step PGD procedure on kernel regression, with a class of kernels that
describe infinitely wide neural nets (Neural Tangent Kernels - NTKs).
Experiments on MNIST and CIFAR-10 demonstrate that the datasets we produce
enjoy very high robustness against PGD attacks, when deployed in both kernel
regression classifiers and neural networks. However, this robustness is
somewhat fallacious, as alternative attacks manage to fool the models, which we
find to be the case for previous similar works in the literature as well. We
discuss potential reasons for this and outline further avenues of research.


http://arxiv.org/abs/2207.11694
Proving Common Mechanisms Shared by Twelve Methods of Boosting Adversarial Transferability. (69%)
Quanshi Zhang; Xin Wang; Jie Ren; Xu Cheng; Shuyun Lin; Yisen Wang; Xiangming Zhu
  Although many methods have been proposed to enhance the transferability of
adversarial perturbations, these methods are designed in a heuristic manner,
and the essential mechanism for improving adversarial transferability is still
unclear. This paper summarizes the common mechanism shared by twelve previous
transferability-boosting methods in a unified view, i.e., these methods all
reduce game-theoretic interactions between regional adversarial perturbations.
To this end, we focus on the attacking utility of all interactions between
regional adversarial perturbations, and we first discover and prove the
negative correlation between the adversarial transferability and the attacking
utility of interactions. Based on this discovery, we theoretically prove and
empirically verify that twelve previous transferability-boosting methods all
reduce interactions between regional adversarial perturbations. More crucially,
we consider the reduction of interactions as the essential reason for the
enhancement of adversarial transferability. Furthermore, we design the
interaction loss to directly penalize interactions between regional adversarial
perturbations during attacking. Experimental results show that the interaction
loss significantly improves the transferability of adversarial perturbations.


http://arxiv.org/abs/2207.11788
Privacy Against Inference Attacks in Vertical Federated Learning. (2%)
Borzoo Rassouli; Morteza Varasteh; Deniz Gunduz
  Vertical federated learning is considered, where an active party, having
access to true class labels, wishes to build a classification model by
utilizing more features from a passive party, which has no access to the
labels, to improve the model accuracy. In the prediction phase, with logistic
regression as the classification model, several inference attack techniques are
proposed that the adversary, i.e., the active party, can employ to reconstruct
the passive party's features, regarded as sensitive information. These attacks,
which are mainly based on a classical notion of the center of a set, i.e., the
Chebyshev center, are shown to be superior to those proposed in the literature.
Moreover, several theoretical performance guarantees are provided for the
aforementioned attacks. Subsequently, we consider the minimum amount of
information that the adversary needs to fully reconstruct the passive party's
features. In particular, it is shown that when the passive party holds one
feature, and the adversary is only aware of the signs of the parameters
involved, it can perfectly reconstruct that feature when the number of
predictions is large enough. Next, as a defense mechanism, a privacy-preserving
scheme is proposed that worsen the adversary's reconstruction attacks, while
preserving the full benefits that VFL brings to the active party. Finally,
experimental results demonstrate the effectiveness of the proposed attacks and
the privacy-preserving scheme.


http://arxiv.org/abs/2207.11722
Semantic-guided Multi-Mask Image Harmonization. (1%)
Xuqian Ren; Yifan Liu
  Previous harmonization methods focus on adjusting one inharmonious region in
an image based on an input mask. They may face problems when dealing with
different perturbations on different semantic regions without available input
masks. To deal with the problem that one image has been pasted with several
foregrounds coming from different images and needs to harmonize them towards
different domain directions without any mask as input, we propose a new
semantic-guided multi-mask image harmonization task. Different from the
previous single-mask image harmonization task, each inharmonious image is
perturbed with different methods according to the semantic segmentation masks.
Two challenging benchmarks, HScene and HLIP, are constructed based on $150$ and
$19$ semantic classes, respectively. Furthermore, previous baselines focus on
regressing the exact value for each pixel of the harmonized images. The
generated results are in the `black box' and cannot be edited. In this work, we
propose a novel way to edit the inharmonious images by predicting a series of
operator masks. The masks indicate the level and the position to apply a
certain image editing operation, which could be the brightness, the saturation,
and the color in a specific dimension. The operator masks provide more
flexibility for users to edit the image further. Extensive experiments verify
that the operator mask-based network can further improve those state-of-the-art
methods which directly regress RGB images when the perturbations are
structural. Experiments have been conducted on our constructed benchmarks to
verify that our proposed operator mask-based framework can locate and modify
the inharmonious regions in more complex scenes. Our code and models are
available at
https://github.com/XuqianRen/Semantic-guided-Multi-mask-Image-Harmonization.git.


http://arxiv.org/abs/2207.11378
Do Perceptually Aligned Gradients Imply Adversarial Robustness? (99%)
Roy Ganz; Bahjat Kawar; Michael Elad
  Adversarially robust classifiers possess a trait that non-robust models do
not -- Perceptually Aligned Gradients (PAG). Their gradients with respect to
the input align well with human perception. Several works have identified PAG
as a byproduct of robust training, but none have considered it as a standalone
phenomenon nor studied its own implications. In this work, we focus on this
trait and test whether \emph{Perceptually Aligned Gradients imply Robustness}.
To this end, we develop a novel objective to directly promote PAG in training
classifiers and examine whether models with such gradients are more robust to
adversarial attacks. Extensive experiments on multiple datasets and
architectures validate that models with aligned gradients exhibit significant
robustness, exposing the surprising bidirectional connection between PAG and
robustness. Lastly, we show that better gradient alignment leads to increased
robustness and harness this observation to boost the robustness of existing
adversarial training techniques.


http://arxiv.org/abs/2207.11177
Provable Defense Against Geometric Transformations. (47%)
Rem Yang; Jacob Laurel; Sasa Misailovic; Gagandeep Singh
  Geometric image transformations that arise in the real world, such as scaling
and rotation, have been shown to easily deceive deep neural networks (DNNs).
Hence, training DNNs to be certifiably robust to these perturbations is
critical. However, no prior work has been able to incorporate the objective of
deterministic certified robustness against geometric transformations into the
training procedure, as existing verifiers are exceedingly slow. To address
these challenges, we propose the first provable defense for deterministic
certified geometric robustness. Our framework leverages a novel GPU-optimized
verifier that can certify images between 60$\times$ to 42,600$\times$ faster
than existing geometric robustness verifiers, and thus unlike existing works,
is fast enough for use in training. Our results across multiple datasets show
that networks trained via our framework consistently achieve state-of-the-art
deterministic certified geometric robustness and clean accuracy. Furthermore,
for the first time, we verify the geometric robustness of a neural network for
the challenging, real-world setting of autonomous driving.


http://arxiv.org/abs/2207.10942
Aries: Efficient Testing of Deep Neural Networks via Labeling-Free Accuracy Estimation. (41%)
Qiang Hu; Yuejun Guo; Xiaofei Xie; Maxime Cordy; Lei Ma; Mike Papadakis; Yves Le Traon
  Deep learning (DL) plays a more and more important role in our daily life due
to its competitive performance in industrial application domains. As the core
of DL-enabled systems, deep neural networks (DNNs) need to be carefully
evaluated to ensure the produced models match the expected requirements. In
practice, the \emph{de facto standard} to assess the quality of DNNs in the
industry is to check their performance (accuracy) on a collected set of labeled
test data. However, preparing such labeled data is often not easy partly
because of the huge labeling effort, i.e., data labeling is labor-intensive,
especially with the massive new incoming unlabeled data every day. Recent
studies show that test selection for DNN is a promising direction that tackles
this issue by selecting minimal representative data to label and using these
data to assess the model. However, it still requires human effort and cannot be
automatic. In this paper, we propose a novel technique, named \textit{Aries},
that can estimate the performance of DNNs on new unlabeled data using only the
information obtained from the original test data. The key insight behind our
technique is that the model should have similar prediction accuracy on the data
which have similar distances to the decision boundary. We performed a
large-scale evaluation of our technique on two famous datasets, CIFAR-10 and
Tiny-ImageNet, four widely studied DNN models including ResNet101 and
DenseNet121, and 13 types of data transformation methods. Results show that the
estimated accuracy by \textit{Aries} is only 0.03\% -- 2.60\% off the true
accuracy. Besides, \textit{Aries} also outperforms the state-of-the-art
labeling-free methods in 50 out of 52 cases and selection-labeling-based
methods in 96 out of 128 cases.


http://arxiv.org/abs/2207.11327
Learning from Multiple Annotator Noisy Labels via Sample-wise Label Fusion. (1%)
Zhengqi Gao; Fan-Keng Sun; Mingran Yang; Sucheng Ren; Zikai Xiong; Marc Engeler; Antonio Burazer; Linda Wildling; Luca Daniel; Duane S. Boning
  Data lies at the core of modern deep learning. The impressive performance of
supervised learning is built upon a base of massive accurately labeled data.
However, in some real-world applications, accurate labeling might not be
viable; instead, multiple noisy labels (instead of one accurate label) are
provided by several annotators for each data sample. Learning a classifier on
such a noisy training dataset is a challenging task. Previous approaches
usually assume that all data samples share the same set of parameters related
to annotator errors, while we demonstrate that label error learning should be
both annotator and data sample dependent. Motivated by this observation, we
propose a novel learning algorithm. The proposed method displays superiority
compared with several state-of-the-art baseline methods on MNIST, CIFAR-100,
and ImageNet-100. Our code is available at:
https://github.com/zhengqigao/Learning-from-Multiple-Annotator-Noisy-Labels.


http://arxiv.org/abs/2207.10719
Synthetic Dataset Generation for Adversarial Machine Learning Research. (99%)
Xiruo Liu; Shibani Singh; Cory Cornelius; Colin Busho; Mike Tan; Anindya Paul; Jason Martin
  Existing adversarial example research focuses on digitally inserted
perturbations on top of existing natural image datasets. This construction of
adversarial examples is not realistic because it may be difficult, or even
impossible, for an attacker to deploy such an attack in the real-world due to
sensing and environmental effects. To better understand adversarial examples
against cyber-physical systems, we propose approximating the real-world through
simulation. In this paper we describe our synthetic dataset generation tool
that enables scalable collection of such a synthetic dataset with realistic
adversarial examples. We use the CARLA simulator to collect such a dataset and
demonstrate simulated attacks that undergo the same environmental transforms
and processing as real-world images. Our tools have been used to collect
datasets to help evaluate the efficacy of adversarial examples, and can be
found at https://github.com/carla-simulator/carla/pull/4992.


http://arxiv.org/abs/2207.10561
Careful What You Wish For: on the Extraction of Adversarially Trained Models. (99%)
Kacem Khaled; Gabriela Nicolescu; Magalhães Felipe Gohring de
  Recent attacks on Machine Learning (ML) models such as evasion attacks with
adversarial examples and models stealing through extraction attacks pose
several security and privacy threats. Prior work proposes to use adversarial
training to secure models from adversarial examples that can evade the
classification of a model and deteriorate its performance. However, this
protection technique affects the model's decision boundary and its prediction
probabilities, hence it might raise model privacy risks. In fact, a malicious
user using only a query access to the prediction output of a model can extract
it and obtain a high-accuracy and high-fidelity surrogate model. To have a
greater extraction, these attacks leverage the prediction probabilities of the
victim model. Indeed, all previous work on extraction attacks do not take into
consideration the changes in the training process for security purposes. In
this paper, we propose a framework to assess extraction attacks on
adversarially trained models with vision datasets. To the best of our
knowledge, our work is the first to perform such evaluation. Through an
extensive empirical study, we demonstrate that adversarially trained models are
more vulnerable to extraction attacks than models obtained under natural
training circumstances. They can achieve up to $\times1.2$ higher accuracy and
agreement with a fraction lower than $\times0.75$ of the queries. We
additionally find that the adversarial robustness capability is transferable
through extraction attacks, i.e., extracted Deep Neural Networks (DNNs) from
robust models show an enhanced accuracy to adversarial examples compared to
extracted DNNs from naturally trained (i.e. standard) models.


http://arxiv.org/abs/2208.10251
Rethinking Textual Adversarial Defense for Pre-trained Language Models. (99%)
Jiayi Wang; Rongzhou Bao; Zhuosheng Zhang; Hai Zhao
  Although pre-trained language models (PrLMs) have achieved significant
success, recent studies demonstrate that PrLMs are vulnerable to adversarial
attacks. By generating adversarial examples with slight perturbations on
different levels (sentence / word / character), adversarial attacks can fool
PrLMs to generate incorrect predictions, which questions the robustness of
PrLMs. However, we find that most existing textual adversarial examples are
unnatural, which can be easily distinguished by both human and machine. Based
on a general anomaly detector, we propose a novel metric (Degree of Anomaly) as
a constraint to enable current adversarial attack approaches to generate more
natural and imperceptible adversarial examples. Under this new constraint, the
success rate of existing attacks drastically decreases, which reveals that the
robustness of PrLMs is not as fragile as they claimed. In addition, we find
that four types of randomization can invalidate a large portion of textual
adversarial examples. Based on anomaly detector and randomization, we design a
universal defense framework, which is among the first to perform textual
adversarial defense without knowing the specific attack. Empirical results show
that our universal defense framework achieves comparable or even higher
after-attack accuracy with other specific defenses, while preserving higher
original accuracy at the same time. Our work discloses the essence of textual
adversarial attacks, and indicates that (1) further works of adversarial
attacks should focus more on how to overcome the detection and resist the
randomization, otherwise their adversarial examples would be easily detected
and invalidated; and (2) compared with the unnatural and perceptible
adversarial examples, it is those undetectable adversarial examples that pose
real risks for PrLMs and require more attention for future robustness-enhancing
strategies.


http://arxiv.org/abs/2207.10290
AugRmixAT: A Data Processing and Training Method for Improving Multiple Robustness and Generalization Performance. (98%)
Xiaoliang Liu; Furao Shen; Jian Zhao; Changhai Nie
  Deep neural networks are powerful, but they also have shortcomings such as
their sensitivity to adversarial examples, noise, blur, occlusion, etc.
Moreover, ensuring the reliability and robustness of deep neural network models
is crucial for their application in safety-critical areas. Much previous work
has been proposed to improve specific robustness. However, we find that the
specific robustness is often improved at the sacrifice of the additional
robustness or generalization ability of the neural network model. In
particular, adversarial training methods significantly hurt the generalization
performance on unperturbed data when improving adversarial robustness. In this
paper, we propose a new data processing and training method, called AugRmixAT,
which can simultaneously improve the generalization ability and multiple
robustness of neural network models. Finally, we validate the effectiveness of
AugRmixAT on the CIFAR-10/100 and Tiny-ImageNet datasets. The experiments
demonstrate that AugRmixAT can improve the model's generalization performance
while enhancing the white-box robustness, black-box robustness, common
corruption robustness, and partial occlusion robustness.


http://arxiv.org/abs/2207.10307
Knowledge-enhanced Black-box Attacks for Recommendations. (92%)
Jingfan Chen; Wenqi Fan; Guanghui Zhu; Xiangyu Zhao; Chunfeng Yuan; Qing Li; Yihua Huang
  Recent studies have shown that deep neural networks-based recommender systems
are vulnerable to adversarial attacks, where attackers can inject carefully
crafted fake user profiles (i.e., a set of items that fake users have
interacted with) into a target recommender system to achieve malicious
purposes, such as promote or demote a set of target items. Due to the security
and privacy concerns, it is more practical to perform adversarial attacks under
the black-box setting, where the architecture/parameters and training data of
target systems cannot be easily accessed by attackers. However, generating
high-quality fake user profiles under black-box setting is rather challenging
with limited resources to target systems. To address this challenge, in this
work, we introduce a novel strategy by leveraging items' attribute information
(i.e., items' knowledge graph), which can be publicly accessible and provide
rich auxiliary knowledge to enhance the generation of fake user profiles. More
specifically, we propose a knowledge graph-enhanced black-box attacking
framework (KGAttack) to effectively learn attacking policies through deep
reinforcement learning techniques, in which knowledge graph is seamlessly
integrated into hierarchical policy networks to generate fake user profiles for
performing adversarial black-box attacks. Comprehensive experiments on various
real-world datasets demonstrate the effectiveness of the proposed attacking
framework under the black-box setting.


http://arxiv.org/abs/2207.10498
Towards Efficient Adversarial Training on Vision Transformers. (92%)
Boxi Wu; Jindong Gu; Zhifeng Li; Deng Cai; Xiaofei He; Wei Liu
  Vision Transformer (ViT), as a powerful alternative to Convolutional Neural
Network (CNN), has received much attention. Recent work showed that ViTs are
also vulnerable to adversarial examples like CNNs. To build robust ViTs, an
intuitive way is to apply adversarial training since it has been shown as one
of the most effective ways to accomplish robust CNNs. However, one major
limitation of adversarial training is its heavy computational cost. The
self-attention mechanism adopted by ViTs is a computationally intense operation
whose expense increases quadratically with the number of input patches, making
adversarial training on ViTs even more time-consuming. In this work, we first
comprehensively study fast adversarial training on a variety of vision
transformers and illustrate the relationship between the efficiency and
robustness. Then, to expediate adversarial training on ViTs, we propose an
efficient Attention Guided Adversarial Training mechanism. Specifically,
relying on the specialty of self-attention, we actively remove certain patch
embeddings of each layer with an attention-guided dropping strategy during
adversarial training. The slimmed self-attention modules accelerate the
adversarial training on ViTs significantly. With only 65\% of the fast
adversarial training time, we match the state-of-the-art results on the
challenging ImageNet benchmark.


http://arxiv.org/abs/2207.10825
Just Rotate it: Deploying Backdoor Attacks via Rotation Transformation. (87%)
Tong Wu; Tianhao Wang; Vikash Sehwag; Saeed Mahloujifar; Prateek Mittal
  Recent works have demonstrated that deep learning models are vulnerable to
backdoor poisoning attacks, where these attacks instill spurious correlations
to external trigger patterns or objects (e.g., stickers, sunglasses, etc.). We
find that such external trigger signals are unnecessary, as highly effective
backdoors can be easily inserted using rotation-based image transformation. Our
method constructs the poisoned dataset by rotating a limited amount of objects
and labeling them incorrectly; once trained with it, the victim's model will
make undesirable predictions during run-time inference. It exhibits a
significantly high attack success rate while maintaining clean performance
through comprehensive empirical studies on image classification and object
detection tasks. Furthermore, we evaluate standard data augmentation techniques
and four different backdoor defenses against our attack and find that none of
them can serve as a consistent mitigation approach. Our attack can be easily
deployed in the real world since it only requires rotating the object, as we
show in both image classification and object detection applications. Overall,
our work highlights a new, simple, physically realizable, and highly effective
vector for backdoor attacks. Our video demo is available at
https://youtu.be/6JIF8wnX34M.


http://arxiv.org/abs/2207.10862
Contrastive Self-Supervised Learning Leads to Higher Adversarial Susceptibility. (83%)
Rohit Gupta; Naveed Akhtar; Ajmal Mian; Mubarak Shah
  Contrastive self-supervised learning (CSL) has managed to match or surpass
the performance of supervised learning in image and video classification.
However, it is still largely unknown if the nature of the representations
induced by the two learning paradigms is similar. We investigate this under the
lens of adversarial robustness. Our analysis of the problem reveals that CSL
has intrinsically higher sensitivity to perturbations over supervised learning.
We identify the uniform distribution of data representation over a unit
hypersphere in the CSL representation space as the key contributor to this
phenomenon. We establish that this is a result of the presence of false
negative pairs in the training process, which increases model sensitivity to
input perturbations. Our finding is supported by extensive experiments for
image and video classification using adversarial perturbations and other input
corruptions. We devise a strategy to detect and remove false negative pairs
that is simple, yet effective in improving model robustness with CSL training.
We close up to 68% of the robustness gap between CSL and its supervised
counterpart. Finally, we contribute to adversarial learning by incorporating
our method in CSL. We demonstrate an average gain of about 5% over two
different state-of-the-art methods in this domain.


http://arxiv.org/abs/2207.10495
Generating and Detecting True Ambiguity: A Forgotten Danger in DNN Supervision Testing. (22%)
Michael Weiss; André García Gómez; Paolo Tonella
  Deep Neural Networks (DNNs) are becoming a crucial component of modern
software systems, but they are prone to fail under conditions that are
different from the ones observed during training (out-of-distribution inputs)
or on inputs that are truly ambiguous, i.e., inputs that admit multiple classes
with nonzero probability in their labels. Recent work proposed DNN supervisors
to detect high-uncertainty inputs before their possible misclassification leads
to any harm. To test and compare the capabilities of DNN supervisors,
researchers proposed test generation techniques, to focus the testing effort on
high-uncertainty inputs that should be recognized as anomalous by supervisors.
However, existing test generators aim to produce out-of-distribution inputs. No
existing model- and supervisor independent technique targets the generation of
truly ambiguous test inputs, i.e., inputs that admit multiple classes according
to expert human judgment.
  In this paper, we propose a novel way to generate ambiguous inputs to test
DNN supervisors and used it to empirically compare several existing supervisor
techniques. In particular, we propose AmbiGuess to generate ambiguous samples
for image classification problems. AmbiGuess is based on gradient-guided
sampling in the latent space of a regularized adversarial autoencoder.
Moreover, we conducted what is -- to the best of our knowledge -- the most
extensive comparative study of DNN supervisors, considering their capabilities
to detect 4 distinct types of high-uncertainty inputs, including truly
ambiguous ones. We find that the tested supervisors' capabilities are
complementary: Those best suited to detect true ambiguity perform worse on
invalid, out-of-distribution and adversarial inputs and vice-versa.


http://arxiv.org/abs/2207.10283
Switching One-Versus-the-Rest Loss to Increase the Margin of Logits for Adversarial Robustness. (99%)
Sekitoshi Kanai; Shin'ya Yamaguchi; Masanori Yamada; Hiroshi Takahashi; Kentaro Ohno; Yasutoshi Ida
  Adversarial training is a promising method to improve the robustness against
adversarial attacks. To enhance its performance, recent methods impose high
weights on the cross-entropy loss for important data points near the decision
boundary. However, these importance-aware methods are vulnerable to
sophisticated attacks, e.g., Auto-Attack. In this paper, we experimentally
investigate the cause of their vulnerability via margins between logits for the
true label and the other labels because they should be large enough to prevent
the largest logit from being flipped by the attacks. Our experiments reveal
that the histogram of the logit margins of na\"ive adversarial training has two
peaks. Thus, the levels of difficulty in increasing logit margins are roughly
divided into two: difficult samples (small logit margins) and easy samples
(large logit margins). On the other hand, only one peak near zero appears in
the histogram of importance-aware methods, i.e., they reduce the logit margins
of easy samples. To increase logit margins of difficult samples without
reducing those of easy samples, we propose switching one-versus-the-rest loss
(SOVR), which switches from cross-entropy to one-versus-the-rest loss (OVR) for
difficult samples. We derive trajectories of logit margins for a simple problem
and prove that OVR increases logit margins two times larger than the weighted
cross-entropy loss. Thus, SOVR increases logit margins of difficult samples,
unlike existing methods. We experimentally show that SOVR achieves better
robustness against Auto-Attack than importance-aware methods.


http://arxiv.org/abs/2207.10170
Illusory Attacks: Detectability Matters in Adversarial Attacks on Sequential Decision-Makers. (98%)
Tim Franzmeyer; Stephen McAleer; João F. Henriques; Jakob N. Foerster; Philip H. S. Torr; Adel Bibi; Witt Christian Schroeder de
  Autonomous agents deployed in the real world need to be robust against
adversarial attacks on sensory inputs. Robustifying agent policies requires
anticipating the strongest attacks possible. We demonstrate that existing
observation-space attacks on reinforcement learning agents have a common
weakness: while effective, their lack of information-theoretic detectability
constraints makes them detectable using automated means or human inspection.
Detectability is undesirable to adversaries as it may trigger security
escalations. We introduce \eattacks{}, a novel form of adversarial attack on
sequential decision-makers that is both effective and of $\epsilon$-bounded
statistical detectability. We propose a novel dual ascent algorithm to learn
such attacks end-to-end. Compared to existing attacks, we empirically find
\eattacks{} to be significantly harder to detect with automated methods, and a
small study with human participants (IRB approval under reference R84123/RE001)
suggests they are similarly harder to detect for humans. Our findings suggest
the need for better anomaly detectors, as well as effective hardware- and
system-level defenses. The project website can be found at
https://tinyurl.com/illusory-attacks.


http://arxiv.org/abs/2207.09640
Test-Time Adaptation via Conjugate Pseudo-labels. (10%)
Sachin Goyal; Mingjie Sun; Aditi Raghunathan; Zico Kolter
  Test-time adaptation (TTA) refers to adapting neural networks to distribution
shifts, with access to only the unlabeled test samples from the new domain at
test-time. Prior TTA methods optimize over unsupervised objectives such as the
entropy of model predictions in TENT [Wang et al., 2021], but it is unclear
what exactly makes a good TTA loss. In this paper, we start by presenting a
surprising phenomenon: if we attempt to meta-learn the best possible TTA loss
over a wide class of functions, then we recover a function that is remarkably
similar to (a temperature-scaled version of) the softmax-entropy employed by
TENT. This only holds, however, if the classifier we are adapting is trained
via cross-entropy; if trained via squared loss, a different best TTA loss
emerges. To explain this phenomenon, we analyze TTA through the lens of the
training losses's convex conjugate. We show that under natural conditions, this
(unsupervised) conjugate function can be viewed as a good local approximation
to the original supervised loss and indeed, it recovers the best losses found
by meta-learning. This leads to a generic recipe that can be used to find a
good TTA loss for any given supervised training loss function of a general
class. Empirically, our approach consistently dominates other baselines over a
wide range of benchmarks. Our approach is particularly of interest when applied
to classifiers trained with novel loss functions, e.g., the recently-proposed
PolyLoss, where it differs substantially from (and outperforms) an
entropy-based loss. Further, we show that our approach can also be interpreted
as a kind of self-training using a very specific soft label, which we refer to
as the conjugate pseudolabel. Overall, our method provides a broad framework
for better understanding and improving test-time adaptation. Code is available
at https://github.com/locuslab/tta_conjugate.


http://arxiv.org/abs/2207.10242
Malware Triage Approach using a Task Memory based on Meta-Transfer Learning Framework. (9%)
Jinting Zhu; Julian Jang-Jaccard; Ian Welch; Harith Al-Sahaf; Seyit Camtepe
  To enhance the efficiency of incident response triage operations, it is not
cost-effective to defend all systems equally in a complex cyber environment.
Instead, prioritizing the defense of critical functionality and the most
vulnerable systems is desirable. Threat intelligence is crucial for guiding
Security Operations Center (SOC) analysts' focus toward specific system
activity and provides the primary contextual foundation for interpreting
security alerts. This paper explores novel approaches for improving incident
response triage operations, including dealing with attacks and zero-day
malware. This solution for rapid prioritization of different malware have been
raised to formulate fast response plans to minimize socioeconomic damage from
the massive growth of malware attacks in recent years, it can also be extended
to other incident response. We propose a malware triage approach that can
rapidly classify and prioritize different malware classes to address this
concern. We utilize a pre-trained ResNet18 network based on Siamese Neural
Network (SNN) to reduce the biases in weights and parameters. Furthermore, our
approach incorporates external task memory to retain the task information of
previously encountered examples. This helps to transfer experience to new
samples and reduces computational costs, without requiring backpropagation on
external memory. Evaluation results indicate that the classification aspect of
our proposed method surpasses other similar classification techniques in terms
of performance. This new triage strategy based on task memory with
meta-learning evaluates the level of similarity matching across malware classes
to identify any risky and unknown malware (e.g., zero-day attacks) so that a
defense of those that support critical functionality can be conducted.


http://arxiv.org/abs/2207.09755
A temporally and spatially local spike-based backpropagation algorithm to enable training in hardware. (1%)
Anmol Biswas; Vivek Saraswat; Udayan Ganguly
  Spiking Neural Networks (SNNs) have emerged as a hardware efficient
architecture for classification tasks. The challenge of spike-based encoding
has been the lack of a universal training mechanism performed entirely using
spikes. There have been several attempts to adopt the powerful backpropagation
(BP) technique used in non-spiking artificial neural networks (ANN): (1) SNNs
can be trained by externally computed numerical gradients. (2) A major
advancement towards native spike-based learning has been the use of approximate
Backpropagation using spike-time dependent plasticity (STDP) with phased
forward/backward passes. However, the transfer of information between such
phases for gradient and weight update calculation necessitates external memory
and computational access. This is a challenge for standard neuromorphic
hardware implementations. In this paper, we propose a stochastic SNN based
Back-Prop (SSNN-BP) algorithm that utilizes a composite neuron to
simultaneously compute the forward pass activations and backward pass gradients
explicitly with spikes. Although signed gradient values are a challenge for
spike-based representation, we tackle this by splitting the gradient signal
into positive and negative streams. We show that our method approaches BP ANN
baseline with sufficiently long spike-trains. Finally, we show that the
well-performing softmax cross-entropy loss function can be implemented through
inhibitory lateral connections enforcing a Winner Take All (WTA) rule. Our SNN
with a 2-layer network shows excellent generalization through comparable
performance to ANNs with equivalent architecture and regularization parameters
on static image datasets like MNIST, Fashion-MNIST, Extended MNIST, and
temporally encoded image datasets like Neuromorphic MNIST datasets. Thus,
SSNN-BP enables BP compatible with purely spike-based neuromorphic hardware.


http://arxiv.org/abs/2207.09572
Robust Multivariate Time-Series Forecasting: Adversarial Attacks and Defense Mechanisms. (99%)
Linbo Liu; Youngsuk Park; Trong Nghia Hoang; Hilaf Hasson; Jun Huan
  This work studies the threats of adversarial attack on multivariate
probabilistic forecasting models and viable defense mechanisms. Our studies
discover a new attack pattern that negatively impact the forecasting of a
target time series via making strategic, sparse (imperceptible) modifications
to the past observations of a small number of other time series. To mitigate
the impact of such attack, we have developed two defense strategies. First, we
extend a previously developed randomized smoothing technique in classification
to multivariate forecasting scenarios. Second, we develop an adversarial
training algorithm that learns to create adversarial examples and at the same
time optimizes the forecasting model to improve its robustness against such
adversarial simulation. Extensive experiments on real-world datasets confirm
that our attack schemes are powerful and our defense algorithms are more
effective compared with baseline defense mechanisms.


http://arxiv.org/abs/2207.09209
FLDetector: Defending Federated Learning Against Model Poisoning Attacks via Detecting Malicious Clients. (41%)
Zaixi Zhang; Xiaoyu Cao; Jinyuan Jia; Neil Zhenqiang Gong
  Federated learning (FL) is vulnerable to model poisoning attacks, in which
malicious clients corrupt the global model via sending manipulated model
updates to the server. Existing defenses mainly rely on Byzantine-robust FL
methods, which aim to learn an accurate global model even if some clients are
malicious. However, they can only resist a small number of malicious clients in
practice. It is still an open challenge how to defend against model poisoning
attacks with a large number of malicious clients. Our FLDetector addresses this
challenge via detecting malicious clients. FLDetector aims to detect and remove
the majority of the malicious clients such that a Byzantine-robust FL method
can learn an accurate global model using the remaining clients. Our key
observation is that, in model poisoning attacks, the model updates from a
client in multiple iterations are inconsistent. Therefore, FLDetector detects
malicious clients via checking their model-updates consistency. Roughly
speaking, the server predicts a client's model update in each iteration based
on its historical model updates using the Cauchy mean value theorem and L-BFGS,
and flags a client as malicious if the received model update from the client
and the predicted model update are inconsistent in multiple iterations. Our
extensive experiments on three benchmark datasets show that FLDetector can
accurately detect malicious clients in multiple state-of-the-art model
poisoning attacks. After removing the detected malicious clients, existing
Byzantine-robust FL methods can learn accurate global models.Our code is
available at https://github.com/zaixizhang/FLDetector.


http://arxiv.org/abs/2207.09087
Is Vertical Logistic Regression Privacy-Preserving? A Comprehensive Privacy Analysis and Beyond. (26%)
Yuzheng Hu; Tianle Cai; Jinyong Shan; Shange Tang; Chaochao Cai; Ethan Song; Bo Li; Dawn Song
  We consider vertical logistic regression (VLR) trained with mini-batch
gradient descent -- a setting which has attracted growing interest among
industries and proven to be useful in a wide range of applications including
finance and medical research. We provide a comprehensive and rigorous privacy
analysis of VLR in a class of open-source Federated Learning frameworks, where
the protocols might differ between one another, yet a procedure of obtaining
local gradients is implicitly shared. We first consider the honest-but-curious
threat model, in which the detailed implementation of protocol is neglected and
only the shared procedure is assumed, which we abstract as an oracle. We find
that even under this general setting, single-dimension feature and label can
still be recovered from the other party under suitable constraints of batch
size, thus demonstrating the potential vulnerability of all frameworks
following the same philosophy. Then we look into a popular instantiation of the
protocol based on Homomorphic Encryption (HE). We propose an active attack that
significantly weaken the constraints on batch size in the previous analysis via
generating and compressing auxiliary ciphertext. To address the privacy leakage
within the HE-based protocol, we develop a simple-yet-effective countermeasure
based on Differential Privacy (DP), and provide both utility and privacy
guarantees for the updated algorithm. Finally, we empirically verify the
effectiveness of our attack and defense on benchmark datasets. Altogether, our
findings suggest that all vertical federated learning frameworks that solely
depend on HE might contain severe privacy risks, and DP, which has already
demonstrated its power in horizontal federated learning, can also play a
crucial role in the vertical setting, especially when coupled with HE or secure
multi-party computation (MPC) techniques.


http://arxiv.org/abs/2207.09239
Assaying Out-Of-Distribution Generalization in Transfer Learning. (1%)
Florian Wenzel; Andrea Dittadi; Peter Vincent Gehler; Carl-Johann Simon-Gabriel; Max Horn; Dominik Zietlow; David Kernert; Chris Russell; Thomas Brox; Bernt Schiele; Bernhard Schölkopf; Francesco Locatello
  Since out-of-distribution generalization is a generally ill-posed problem,
various proxy targets (e.g., calibration, adversarial robustness, algorithmic
corruptions, invariance across shifts) were studied across different research
programs resulting in different recommendations. While sharing the same
aspirational goal, these approaches have never been tested under the same
experimental conditions on real data. In this paper, we take a unified view of
previous work, highlighting message discrepancies that we address empirically,
and providing recommendations on how to measure the robustness of a model and
how to improve it. To this end, we collect 172 publicly available dataset pairs
for training and out-of-distribution evaluation of accuracy, calibration error,
adversarial attacks, environment invariance, and synthetic corruptions. We
fine-tune over 31k networks, from nine different architectures in the many- and
few-shot setting. Our findings confirm that in- and out-of-distribution
accuracies tend to increase jointly, but show that their relation is largely
dataset-dependent, and in general more nuanced and more complex than posited by
previous, smaller scale studies.


http://arxiv.org/abs/2207.11237
Defending Substitution-Based Profile Pollution Attacks on Sequential Recommenders. (99%)
Zhenrui Yue; Huimin Zeng; Ziyi Kou; Lanyu Shang; Dong Wang
  While sequential recommender systems achieve significant improvements on
capturing user dynamics, we argue that sequential recommenders are vulnerable
against substitution-based profile pollution attacks. To demonstrate our
hypothesis, we propose a substitution-based adversarial attack algorithm, which
modifies the input sequence by selecting certain vulnerable elements and
substituting them with adversarial items. In both untargeted and targeted
attack scenarios, we observe significant performance deterioration using the
proposed profile pollution algorithm. Motivated by such observations, we design
an efficient adversarial defense method called Dirichlet neighborhood sampling.
Specifically, we sample item embeddings from a convex hull constructed by
multi-hop neighbors to replace the original items in input sequences. During
sampling, a Dirichlet distribution is used to approximate the probability
distribution in the neighborhood such that the recommender learns to combat
local perturbations. Additionally, we design an adversarial training method
tailored for sequential recommender systems. In particular, we represent
selected items with one-hot encodings and perform gradient ascent on the
encodings to search for the worst case linear combination of item embeddings in
training. As such, the embedding function learns robust item representations
and the trained recommender is resistant to test-time adversarial examples.
Extensive experiments show the effectiveness of both our attack and defense
methods, which consistently outperform baselines by a significant margin across
model architectures and datasets.


http://arxiv.org/abs/2207.08859
Prior-Guided Adversarial Initialization for Fast Adversarial Training. (99%)
Xiaojun Jia; Yong Zhang; Xingxing Wei; Baoyuan Wu; Ke Ma; Jue Wang; Xiaochun Cao
  Fast adversarial training (FAT) effectively improves the efficiency of
standard adversarial training (SAT). However, initial FAT encounters
catastrophic overfitting, i.e.,the robust accuracy against adversarial attacks
suddenly and dramatically decreases. Though several FAT variants spare no
effort to prevent overfitting, they sacrifice much calculation cost. In this
paper, we explore the difference between the training processes of SAT and FAT
and observe that the attack success rate of adversarial examples (AEs) of FAT
gets worse gradually in the late training stage, resulting in overfitting. The
AEs are generated by the fast gradient sign method (FGSM) with a zero or random
initialization. Based on the observation, we propose a prior-guided FGSM
initialization method to avoid overfitting after investigating several
initialization strategies, improving the quality of the AEs during the whole
training process. The initialization is formed by leveraging historically
generated AEs without additional calculation cost. We further provide a
theoretical analysis for the proposed initialization method. We also propose a
simple yet effective regularizer based on the prior-guided initialization,i.e.,
the currently generated perturbation should not deviate too much from the
prior-guided initialization. The regularizer adopts both historical and current
adversarial perturbations to guide the model learning. Evaluations on four
datasets demonstrate that the proposed method can prevent catastrophic
overfitting and outperform state-of-the-art FAT methods. The code is released
at https://github.com/jiaxiaojunQAQ/FGSM-PGI.


http://arxiv.org/abs/2207.09031
Decorrelative Network Architecture for Robust Electrocardiogram Classification. (99%)
Christopher Wiedeman; Ge Wang
  Artificial intelligence has made great progress in medical data analysis, but
the lack of robustness and trustworthiness has kept these methods from being
widely deployed. As it is not possible to train networks that are accurate in
all situations, models must recognize situations where they cannot operate
confidently. Bayesian deep learning methods sample the model parameter space to
estimate uncertainty, but these parameters are often subject to the same
vulnerabilities, which can be exploited by adversarial attacks. We propose a
novel ensemble approach based on feature decorrelation and Fourier partitioning
for teaching networks diverse complementary features, reducing the chance of
perturbation-based fooling. We test our approach on electrocardiogram
classification, demonstrating superior accuracy confidence measurement, on a
variety of adversarial attacks. For example, on our ensemble trained with both
decorrelation and Fourier partitioning scored a 50.18% inference accuracy and
48.01% uncertainty accuracy (area under the curve) on {\epsilon} = 50 projected
gradient descent attacks, while a conventionally trained ensemble scored 21.1%
and 30.31% on these metrics respectively. Our approach does not require
expensive optimization with adversarial samples and can be scaled to large
problems. These methods can easily be applied to other tasks for more robust
and trustworthy models.


http://arxiv.org/abs/2207.08948
Multi-step domain adaptation by adversarial attack to $\mathcal{H} \Delta \mathcal{H}$-divergence. (96%)
Arip Asadulaev; Alexander Panfilov; Andrey Filchenkov
  Adversarial examples are transferable between different models. In our paper,
we propose to use this property for multi-step domain adaptation. In
unsupervised domain adaptation settings, we demonstrate that replacing the
source domain with adversarial examples to $\mathcal{H} \Delta
\mathcal{H}$-divergence can improve source classifier accuracy on the target
domain. Our method can be connected to most domain adaptation techniques. We
conducted a range of experiments and achieved improvement in accuracy on Digits
and Office-Home datasets.


http://arxiv.org/abs/2207.08803
Adversarial Pixel Restoration as a Pretext Task for Transferable Perturbations. (91%)
Hashmat Shadab Malik; Shahina K Kunhimon; Muzammal Naseer; Salman Khan; Fahad Shahbaz Khan
  Transferable adversarial attacks optimize adversaries from a pretrained
surrogate model and known label space to fool the unknown black-box models.
Therefore, these attacks are restricted by the availability of an effective
surrogate model. In this work, we relax this assumption and propose Adversarial
Pixel Restoration as a self-supervised alternative to train an effective
surrogate model from scratch under the condition of no labels and few data
samples. Our training approach is based on a min-max scheme which reduces
overfitting via an adversarial objective and thus optimizes for a more
generalizable surrogate model. Our proposed attack is complimentary to the
adversarial pixel restoration and is independent of any task specific objective
as it can be launched in a self-supervised manner. We successfully demonstrate
the adversarial transferability of our approach to Vision Transformers as well
as Convolutional Neural Networks for the tasks of classification, object
detection, and video segmentation. Our training approach improves the
transferability of the baseline unsupervised training method by 16.4% on
ImageNet val. set. Our codes & pre-trained surrogate models are available at:
https://github.com/HashmatShadab/APR


http://arxiv.org/abs/2207.08940
Easy Batch Normalization. (69%)
Arip Asadulaev; Alexander Panfilov; Andrey Filchenkov
  It was shown that adversarial examples improve object recognition. But what
about their opposite side, easy examples? Easy examples are samples that the
machine learning model classifies correctly with high confidence. In our paper,
we are making the first step toward exploring the potential benefits of using
easy examples in the training procedure of neural networks. We propose to use
an auxiliary batch normalization for easy examples for the standard and robust
accuracy improvement.


http://arxiv.org/abs/2207.08374
Adversarial Contrastive Learning via Asymmetric InfoNCE. (61%)
Qiying Yu; Jieming Lou; Xianyuan Zhan; Qizhang Li; Wangmeng Zuo; Yang Liu; Jingjing Liu
  Contrastive learning (CL) has recently been applied to adversarial learning
tasks. Such practice considers adversarial samples as additional positive views
of an instance, and by maximizing their agreements with each other, yields
better adversarial robustness. However, this mechanism can be potentially
flawed, since adversarial perturbations may cause instance-level identity
confusion, which can impede CL performance by pulling together different
instances with separate identities. To address this issue, we propose to treat
adversarial samples unequally when contrasted, with an asymmetric InfoNCE
objective ($A-InfoNCE$) that allows discriminating considerations of
adversarial samples. Specifically, adversaries are viewed as inferior positives
that induce weaker learning signals, or as hard negatives exhibiting higher
contrast to other negative samples. In the asymmetric fashion, the adverse
impacts of conflicting objectives between CL and adversarial learning can be
effectively mitigated. Experiments show that our approach consistently
outperforms existing Adversarial CL methods across different finetuning schemes
without additional computational cost. The proposed A-InfoNCE is also a generic
form that can be readily extended to other CL methods. Code is available at
https://github.com/yqy2001/A-InfoNCE.


http://arxiv.org/abs/2207.08486
Using Anomaly Detection to Detect Poisoning Attacks in Federated Learning Applications. (22%)
Ali Raza; Shujun Li; Kim-Phuc Tran; Ludovic Koehl
  Adversarial attacks such as poisoning attacks have attracted the attention of
many machine learning researchers. Traditionally, poisoning attacks attempt to
inject adversarial training data in order to manipulate the trained model. In
federated learning (FL), data poisoning attacks can be generalized to model
poisoning attacks, which cannot be detected by simpler methods due to the lack
of access to local training data by the detector. State-of-the-art poisoning
attack detection methods for FL have various weaknesses, e.g., the number of
attackers has to be known or not high enough, working with i.i.d. data only,
and high computational complexity. To overcome above weaknesses, we propose a
novel framework for detecting poisoning attacks in FL, which employs a
reference model based on a public dataset and an auditor model to detect
malicious updates. We implemented a detector based on the proposed framework
and using a one-class support vector machine (OC-SVM), which reaches the lowest
possible computational complexity O(K) where K is the number of clients. We
evaluated our detector's performance against state-of-the-art (SOTA) poisoning
attacks for two typical applications of FL: electrocardiograph (ECG)
classification and human activity recognition (HAR). Our experimental results
validated the performance of our detector over other SOTA detection methods.


http://arxiv.org/abs/2207.08556
A Certifiable Security Patch for Object Tracking in Self-Driving Systems via Historical Deviation Modeling. (10%)
Xudong Pan; Qifan Xiao; Mi Zhang; Min Yang
  Self-driving cars (SDC) commonly implement the perception pipeline to detect
the surrounding obstacles and track their moving trajectories, which lays the
ground for the subsequent driving decision making process. Although the
security of obstacle detection in SDC is intensively studied, not until very
recently the attackers start to exploit the vulnerability of the tracking
module. Compared with solely attacking the object detectors, this new attack
strategy influences the driving decision more effectively with less attack
budgets. However, little is known on whether the revealed vulnerability remains
effective in end-to-end self-driving systems and, if so, how to mitigate the
threat.
  In this paper, we present the first systematic research on the security of
object tracking in SDC. Through a comprehensive case study on the full
perception pipeline of a popular open-sourced self-driving system, Baidu's
Apollo, we prove the mainstream multi-object tracker (MOT) based on Kalman
Filter (KF) is unsafe even with an enabled multi-sensor fusion mechanism. Our
root cause analysis reveals, the vulnerability is innate to the design of
KF-based MOT, which shall error-handle the prediction results from the object
detectors yet the adopted KF algorithm is prone to trust the observation more
when its deviation from the prediction is larger. To address this design flaw,
we propose a simple yet effective security patch for KF-based MOT, the core of
which is an adaptive strategy to balance the focus of KF on observations and
predictions according to the anomaly index of the observation-prediction
deviation, and has certified effectiveness against a generalized hijacking
attack model. Extensive evaluation on $4$ KF-based existing MOT implementations
(including 2D and 3D, academic and Apollo ones) validate the defense
effectiveness and the trivial performance overhead of our approach.


http://arxiv.org/abs/2207.08898
Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence Classification. (2%)
Sarwan Ali; Bikram Sahoo; Alexander Zelikovskiy; Pin-Yu Chen; Murray Patterson
  The rapid spread of the COVID-19 pandemic has resulted in an unprecedented
amount of sequence data of the SARS-CoV-2 genome -- millions of sequences and
counting. This amount of data, while being orders of magnitude beyond the
capacity of traditional approaches to understanding the diversity, dynamics,
and evolution of viruses is nonetheless a rich resource for machine learning
(ML) approaches as alternatives for extracting such important information from
these data. It is of hence utmost importance to design a framework for testing
and benchmarking the robustness of these ML models.
  This paper makes the first effort (to our knowledge) to benchmark the
robustness of ML models by simulating biological sequences with errors. In this
paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to
mimic the error profiles of common sequencing platforms such as Illumina and
PacBio. We show from experiments on a wide array of ML models that some
simulation-based approaches are more robust (and accurate) than others for
specific embedding methods to certain adversarial attacks to the input
sequences. Our benchmarking framework may assist researchers in properly
assessing different ML models and help them understand the behavior of the
SARS-CoV-2 virus or avoid possible future pandemics.


http://arxiv.org/abs/2207.08178
Watermark Vaccine: Adversarial Attacks to Prevent Watermark Removal. (99%)
Xinwei Liu; Jian Liu; Yang Bai; Jindong Gu; Tao Chen; Xiaojun Jia; Xiaochun Cao
  As a common security tool, visible watermarking has been widely applied to
protect copyrights of digital images. However, recent works have shown that
visible watermarks can be removed by DNNs without damaging their host images.
Such watermark-removal techniques pose a great threat to the ownership of
images. Inspired by the vulnerability of DNNs on adversarial perturbations, we
propose a novel defence mechanism by adversarial machine learning for good.
From the perspective of the adversary, blind watermark-removal networks can be
posed as our target models; then we actually optimize an imperceptible
adversarial perturbation on the host images to proactively attack against
watermark-removal networks, dubbed Watermark Vaccine. Specifically, two types
of vaccines are proposed. Disrupting Watermark Vaccine (DWV) induces to ruin
the host image along with watermark after passing through watermark-removal
networks. In contrast, Inerasable Watermark Vaccine (IWV) works in another
fashion of trying to keep the watermark not removed and still noticeable.
Extensive experiments demonstrate the effectiveness of our DWV/IWV in
preventing watermark removal, especially on various watermark removal networks.


http://arxiv.org/abs/2207.08089
Threat Model-Agnostic Adversarial Defense using Diffusion Models. (99%)
Tsachi Blau; Roy Ganz; Bahjat Kawar; Alex Bronstein; Michael Elad
  Deep Neural Networks (DNNs) are highly sensitive to imperceptible malicious
perturbations, known as adversarial attacks. Following the discovery of this
vulnerability in real-world imaging and vision applications, the associated
safety concerns have attracted vast research attention, and many defense
techniques have been developed. Most of these defense methods rely on
adversarial training (AT) -- training the classification network on images
perturbed according to a specific threat model, which defines the magnitude of
the allowed modification. Although AT leads to promising results, training on a
specific threat model fails to generalize to other types of perturbations. A
different approach utilizes a preprocessing step to remove the adversarial
perturbation from the attacked image. In this work, we follow the latter path
and aim to develop a technique that leads to robust classifiers across various
realizations of threat models. To this end, we harness the recent advances in
stochastic generative modeling, and means to leverage these for sampling from
conditional distributions. Our defense relies on an addition of Gaussian i.i.d
noise to the attacked image, followed by a pretrained diffusion process -- an
architecture that performs a stochastic iterative process over a denoising
network, yielding a high perceptual quality denoised outcome. The obtained
robustness with this stochastic preprocessing step is validated through
extensive experiments on the CIFAR-10 dataset, showing that our method
outperforms the leading defense methods under various threat models.


http://arxiv.org/abs/2207.08137
Achieve Optimal Adversarial Accuracy for Adversarial Deep Learning using Stackelberg Game. (96%)
Xiao-Shan Gao; Shuang Liu; Lijia Yu
  Adversarial deep learning is to train robust DNNs against adversarial
attacks, which is one of the major research focuses of deep learning. Game
theory has been used to answer some of the basic questions about adversarial
deep learning such as the existence of a classifier with optimal robustness and
the existence of optimal adversarial samples for a given class of classifiers.
In most previous work, adversarial deep learning was formulated as a
simultaneous game and the strategy spaces are assumed to be certain probability
distributions in order for the Nash equilibrium to exist. But, this assumption
is not applicable to the practical situation. In this paper, we give answers to
these basic questions for the practical case where the classifiers are DNNs
with a given structure, by formulating the adversarial deep learning as
sequential games. The existence of Stackelberg equilibria for these games are
proved. Furthermore, it is shown that the equilibrium DNN has the largest
adversarial accuracy among all DNNs with the same structure, when
Carlini-Wagner's margin loss is used. Trade-off between robustness and accuracy
in adversarial deep learning is also studied from game theoretical aspect.


http://arxiv.org/abs/2207.08157
Automated Repair of Neural Networks. (16%)
Dor Cohen; Ofer Strichman
  Over the last decade, Neural Networks (NNs) have been widely used in numerous
applications including safety-critical ones such as autonomous systems. Despite
their emerging adoption, it is well known that NNs are susceptible to
Adversarial Attacks. Hence, it is highly important to provide guarantees that
such systems work correctly. To remedy these issues we introduce a framework
for repairing unsafe NNs w.r.t. safety specification, that is by utilizing
satisfiability modulo theories (SMT) solvers. Our method is able to search for
a new, safe NN representation, by modifying only a few of its weight values. In
addition, our technique attempts to maximize the similarity to original network
with regard to its decision boundaries. We perform extensive experiments which
demonstrate the capability of our proposed framework to yield safe NNs w.r.t.
the Adversarial Robustness property, with only a mild loss of accuracy (in
terms of similarity). Moreover, we compare our method with a naive baseline to
empirically prove its effectiveness. To conclude, we provide an algorithm to
automatically repair NNs given safety properties, and suggest a few heuristics
to improve its computational performance. Currently, by following this approach
we are capable of producing small-sized (i.e., with up to few hundreds of
parameters) correct NNs, composed of the piecewise linear ReLU activation
function. Nevertheless, our framework is general in the sense that it can
synthesize NNs w.r.t. any decidable fragment of first-order logic
specification.


http://arxiv.org/abs/2207.08044
DIMBA: Discretely Masked Black-Box Attack in Single Object Tracking. (99%)
Xiangyu Yin; Wenjie Ruan; Jonathan Fieldsend
  The adversarial attack can force a CNN-based model to produce an incorrect
output by craftily manipulating human-imperceptible input. Exploring such
perturbations can help us gain a deeper understanding of the vulnerability of
neural networks, and provide robustness to deep learning against miscellaneous
adversaries. Despite extensive studies focusing on the robustness of image,
audio, and NLP, works on adversarial examples of visual object tracking --
especially in a black-box manner -- are quite lacking. In this paper, we
propose a novel adversarial attack method to generate noises for single object
tracking under black-box settings, where perturbations are merely added on
initial frames of tracking sequences, which is difficult to be noticed from the
perspective of a whole video clip. Specifically, we divide our algorithm into
three components and exploit reinforcement learning for localizing important
frame patches precisely while reducing unnecessary computational queries
overhead. Compared to existing techniques, our method requires fewer queries on
initialized frames of a video to manipulate competitive or even better attack
performance. We test our algorithm in both long-term and short-term datasets,
including OTB100, VOT2018, UAV123, and LaSOT. Extensive experiments demonstrate
the effectiveness of our method on three mainstream types of trackers:
discrimination, Siamese-based, and reinforcement learning-based trackers.


http://arxiv.org/abs/2207.07972
Certified Neural Network Watermarks with Randomized Smoothing. (1%)
Arpit Bansal; Ping-yeh Chiang; Michael Curry; Rajiv Jain; Curtis Wigington; Varun Manjunatha; John P Dickerson; Tom Goldstein
  Watermarking is a commonly used strategy to protect creators' rights to
digital images, videos and audio. Recently, watermarking methods have been
extended to deep learning models -- in principle, the watermark should be
preserved when an adversary tries to copy the model. However, in practice,
watermarks can often be removed by an intelligent adversary. Several papers
have proposed watermarking methods that claim to be empirically resistant to
different types of removal attacks, but these new techniques often fail in the
face of new or better-tuned adversaries. In this paper, we propose a
certifiable watermarking method. Using the randomized smoothing technique
proposed in Chiang et al., we show that our watermark is guaranteed to be
unremovable unless the model parameters are changed by more than a certain l2
threshold. In addition to being certifiable, our watermark is also empirically
more robust compared to previous watermarking methods. Our experiments can be
reproduced with code at https://github.com/arpitbansal297/Certified_Watermarks


http://arxiv.org/abs/2207.08034
Progress and limitations of deep networks to recognize objects in unusual poses. (1%)
Amro Abbas; Stéphane Deny
  Deep networks should be robust to rare events if they are to be successfully
deployed in high-stakes real-world applications (e.g., self-driving cars). Here
we study the capability of deep networks to recognize objects in unusual poses.
We create a synthetic dataset of images of objects in unusual orientations, and
evaluate the robustness of a collection of 38 recent and competitive deep
networks for image classification. We show that classifying these images is
still a challenge for all networks tested, with an average accuracy drop of
29.5% compared to when the objects are presented upright. This brittleness is
largely unaffected by various network design choices, such as training losses
(e.g., supervised vs. self-supervised), architectures (e.g., convolutional
networks vs. transformers), dataset modalities (e.g., images vs. image-text
pairs), and data-augmentation schemes. However, networks trained on very large
datasets substantially outperform others, with the best network
tested$\unicode{x2014}$Noisy Student EfficentNet-L2 trained on
JFT-300M$\unicode{x2014}$showing a relatively small accuracy drop of only 14.5%
on unusual poses. Nevertheless, a visual inspection of the failures of Noisy
Student reveals a remaining gap in robustness with the human visual system.
Furthermore, combining multiple object
transformations$\unicode{x2014}$3D-rotations and
scaling$\unicode{x2014}$further degrades the performance of all networks.
Altogether, our results provide another measurement of the robustness of deep
networks that is important to consider when using them in the real world. Code
and datasets are available at https://github.com/amro-kamal/ObjectPose.


http://arxiv.org/abs/2207.07941
MixTailor: Mixed Gradient Aggregation for Robust Learning Against Tailored Attacks. (1%)
Ali Ramezani-Kebrya; Iman Tabrizian; Fartash Faghri; Petar Popovski
  Implementations of SGD on distributed and multi-GPU systems creates new
vulnerabilities, which can be identified and misused by one or more adversarial
agents. Recently, it has been shown that well-known Byzantine-resilient
gradient aggregation schemes are indeed vulnerable to informed attackers that
can tailor the attacks (Fang et al., 2020; Xie et al., 2020b). We introduce
MixTailor, a scheme based on randomization of the aggregation strategies that
makes it impossible for the attacker to be fully informed. Deterministic
schemes can be integrated into MixTailor on the fly without introducing any
additional hyperparameters. Randomization decreases the capability of a
powerful adversary to tailor its attacks, while the resulting randomized
aggregation scheme is still competitive in terms of performance. For both iid
and non-iid settings, we establish almost sure convergence guarantees that are
both stronger and more general than those available in the literature. Our
empirical studies across various datasets, attacks, and settings, validate our
hypothesis and show that MixTailor successfully defends when well-known
Byzantine-tolerant schemes fail.


http://arxiv.org/abs/2207.08005
Exploring The Resilience of Control Execution Skips against False Data Injection Attacks. (1%)
Ipsita Koley; Sunandan Adhikary; Soumyajit Dey
  Modern Cyber-Physical Systems (CPSs) are often designed as networked,
software-based controller implementations which have been found to be
vulnerable to network-level and physical level attacks. A number of research
works have proposed CPS-specific attack detection schemes as well as techniques
for attack resilient controller design. However, such schemes also incur
platform-level overheads. In this regard, some recent works have leveraged the
use of skips in control execution to enhance the resilience of a CPS against
false data injection (FDI) attacks.
  In this paper, we provide an analytical discussion on when and how skipping a
control execution can improve the resilience of the system against FDI attacks
while maintaining the control performance requirement. We also propose a
methodology to synthesize such optimal control execution patterns. To the best
of our knowledge, no previous work has provided any quantitative analysis about
the trade-off between attack resilience and control performance for such
aperiodic control execution. Finally, we evaluate the proposed method on
several safety-critical CPS benchmarks.


http://arxiv.org/abs/2207.07793
Towards the Desirable Decision Boundary by Moderate-Margin Adversarial Training. (99%)
Xiaoyu Liang; Yaguan Qian; Jianchang Huang; Xiang Ling; Bin Wang; Chunming Wu; Wassim Swaileh
  Adversarial training, as one of the most effective defense methods against
adversarial attacks, tends to learn an inclusive decision boundary to increase
the robustness of deep learning models. However, due to the large and
unnecessary increase in the margin along adversarial directions, adversarial
training causes heavy cross-over between natural examples and adversarial
examples, which is not conducive to balancing the trade-off between robustness
and natural accuracy. In this paper, we propose a novel adversarial training
scheme to achieve a better trade-off between robustness and natural accuracy.
It aims to learn a moderate-inclusive decision boundary, which means that the
margins of natural examples under the decision boundary are moderate. We call
this scheme Moderate-Margin Adversarial Training (MMAT), which generates
finer-grained adversarial examples to mitigate the cross-over problem. We also
take advantage of logits from a teacher model that has been well-trained to
guide the learning of our model. Finally, MMAT achieves high natural accuracy
and robustness under both black-box and white-box attacks. On SVHN, for
example, state-of-the-art robustness and natural accuracy are achieved.


http://arxiv.org/abs/2207.07797
CARBEN: Composite Adversarial Robustness Benchmark. (98%)
Lei Hsiung; Yun-Yun Tsai; Pin-Yu Chen; Tsung-Yi Ho
  Prior literature on adversarial attack methods has mainly focused on
attacking with and defending against a single threat model, e.g., perturbations
bounded in Lp ball. However, multiple threat models can be combined into
composite perturbations. One such approach, composite adversarial attack (CAA),
not only expands the perturbable space of the image, but also may be overlooked
by current modes of robustness evaluation. This paper demonstrates how CAA's
attack order affects the resulting image, and provides real-time inferences of
different models, which will facilitate users' configuration of the parameters
of the attack level and their rapid evaluation of model prediction. A
leaderboard to benchmark adversarial robustness against CAA is also introduced.


http://arxiv.org/abs/2207.07803
Masked Spatial-Spectral Autoencoders Are Excellent Hyperspectral Defenders. (68%)
Jiahao Qi; Zhiqiang Gong; Xingyue Liu; Kangcheng Bin; Chen Chen; Yongqian Li; Wei Xue; Yu Zhang; Ping Zhong
  Deep learning methodology contributes a lot to the development of
hyperspectral image (HSI) analysis community. However, it also makes HSI
analysis systems vulnerable to adversarial attacks. To this end, we propose a
masked spatial-spectral autoencoder (MSSA) in this paper under self-supervised
learning theory, for enhancing the robustness of HSI analysis systems. First, a
masked sequence attention learning module is conducted to promote the inherent
robustness of HSI analysis systems along spectral channel. Then, we develop a
graph convolutional network with learnable graph structure to establish global
pixel-wise combinations.In this way, the attack effect would be dispersed by
all the related pixels among each combination, and a better defense performance
is achievable in spatial aspect.Finally, to improve the defense transferability
and address the problem of limited labelled samples, MSSA employs spectra
reconstruction as a pretext task and fits the datasets in a self-supervised
manner.Comprehensive experiments over three benchmarks verify the effectiveness
of MSSA in comparison with the state-of-the-art hyperspectral classification
methods and representative adversarial defense strategies.


http://arxiv.org/abs/2207.07347
Feasibility of Inconspicuous GAN-generated Adversarial Patches against Object Detection. (10%)
Svetlana Pavlitskaya; Bianca-Marina Codău; J. Marius Zöllner
  Standard approaches for adversarial patch generation lead to noisy
conspicuous patterns, which are easily recognizable by humans. Recent research
has proposed several approaches to generate naturalistic patches using
generative adversarial networks (GANs), yet only a few of them were evaluated
on the object detection use case. Moreover, the state of the art mostly focuses
on suppressing a single large bounding box in input by overlapping it with the
patch directly. Suppressing objects near the patch is a different, more complex
task. In this work, we have evaluated the existing approaches to generate
inconspicuous patches. We have adapted methods, originally developed for
different computer vision tasks, to the object detection use case with YOLOv3
and the COCO dataset. We have evaluated two approaches to generate naturalistic
patches: by incorporating patch generation into the GAN training process and by
using the pretrained GAN. For both cases, we have assessed a trade-off between
performance and naturalistic patch appearance. Our experiments have shown, that
using a pre-trained GAN helps to gain realistic-looking patches while
preserving the performance similar to conventional adversarial patches.


http://arxiv.org/abs/2207.07292
PASS: Parameters Audit-based Secure and Fair Federated Learning Scheme against Free Rider. (5%)
Jianhua Wang
  Federated Learning (FL) as a secure distributed learning frame gains interest
in Internet of Things (IoT) due to its capability of protecting private data of
participants. However, traditional FL systems are vulnerable to attacks such as
Free-Rider (FR) attack, which causes not only unfairness but also privacy
leakage and inferior performance to FL systems. The existing defense mechanisms
against FR attacks only concern the scenarios where the adversaries declare
less than 50% of the total amount of clients. Moreover, they lose effectiveness
in resisting selfish FR (SFR) attacks. In this paper, we propose a Parameter
Audit-based Secure and fair federated learning Scheme (PASS) against FR
attacks. The PASS has the following key features: (a) works well in the
scenario where adversaries are more than 50% of the total amount of clients;
(b) is effective in countering anonymous FR attacks and SFR attacks; (c)
prevents from privacy leakage without accuracy loss. Extensive experimental
results verify the data protecting capability in mean square error against
privacy leakage and reveal the effectiveness of PASS in terms of a higher
defense success rate and lower false positive rate against anonymous SFR
attacks. Note in addition, PASS produces no effect on FL accuracy when there is
no FR adversary.


http://arxiv.org/abs/2207.07539
3DVerifier: Efficient Robustness Verification for 3D Point Cloud Models. (1%)
Ronghui Mu; Wenjie Ruan; Leandro S. Marcolino; Qiang Ni
  3D point cloud models are widely applied in safety-critical scenes, which
delivers an urgent need to obtain more solid proofs to verify the robustness of
models. Existing verification method for point cloud model is time-expensive
and computationally unattainable on large networks. Additionally, they cannot
handle the complete PointNet model with joint alignment network (JANet) that
contains multiplication layers, which effectively boosts the performance of 3D
models. This motivates us to design a more efficient and general framework to
verify various architectures of point cloud models. The key challenges in
verifying the large-scale complete PointNet models are addressed as dealing
with the cross-non-linearity operations in the multiplication layers and the
high computational complexity of high-dimensional point cloud inputs and added
layers. Thus, we propose an efficient verification framework, 3DVerifier, to
tackle both challenges by adopting a linear relaxation function to bound the
multiplication layer and combining forward and backward propagation to compute
the certified bounds of the outputs of the point cloud models. Our
comprehensive experiments demonstrate that 3DVerifier outperforms existing
verification algorithms for 3D models in terms of both efficiency and accuracy.
Notably, our approach achieves an orders-of-magnitude improvement in
verification efficiency for the large network, and the obtained certified
bounds are also significantly tighter than the state-of-the-art verifiers. We
release our tool 3DVerifier via https://github.com/TrustAI/3DVerifier for use
by the community.


http://arxiv.org/abs/2207.06982
Adversarial Examples for Model-Based Control: A Sensitivity Analysis. (98%)
Po-han Department of Electrical and Computer Engineering, The University of Texas at Austin Li; Ufuk Oden Institute for Computational Engineering and Sciences, The University of Texas at Austin Topcu; Sandeep P. Department of Electrical and Computer Engineering, The University of Texas at Austin Chinchali
  We propose a method to attack controllers that rely on external timeseries
forecasts as task parameters. An adversary can manipulate the costs, states,
and actions of the controllers by forging the timeseries, in this case
perturbing the real timeseries. Since the controllers often encode safety
requirements or energy limits in their costs and constraints, we refer to such
manipulation as an adversarial attack. We show that different attacks on
model-based controllers can increase control costs, activate constraints, or
even make the control optimization problem infeasible. We use the linear
quadratic regulator and convex model predictive controllers as examples of how
adversarial attacks succeed and demonstrate the impact of adversarial attacks
on a battery storage control task for power grid operators. As a result, our
method increases control cost by $8500\%$ and energy constraints by $13\%$ on
real electricity demand timeseries.


http://arxiv.org/abs/2207.07032
Adversarial Attacks on Monocular Pose Estimation. (98%)
Hemang Chawla; Arnav Varma; Elahe Arani; Bahram Zonooz
  Advances in deep learning have resulted in steady progress in computer vision
with improved accuracy on tasks such as object detection and semantic
segmentation. Nevertheless, deep neural networks are vulnerable to adversarial
attacks, thus presenting a challenge in reliable deployment. Two of the
prominent tasks in 3D scene-understanding for robotics and advanced drive
assistance systems are monocular depth and pose estimation, often learned
together in an unsupervised manner. While studies evaluating the impact of
adversarial attacks on monocular depth estimation exist, a systematic
demonstration and analysis of adversarial perturbations against pose estimation
are lacking. We show how additive imperceptible perturbations can not only
change predictions to increase the trajectory drift but also catastrophically
alter its geometry. We also study the relation between adversarial
perturbations targeting monocular depth and pose estimation networks, as well
as the transferability of perturbations to other networks with different
architectures and losses. Our experiments show how the generated perturbations
lead to notable errors in relative rotation and translation predictions and
elucidate vulnerabilities of the networks.


http://arxiv.org/abs/2207.07208
Provably Adversarially Robust Nearest Prototype Classifiers. (83%)
Václav Voráček; Matthias Hein
  Nearest prototype classifiers (NPCs) assign to each input point the label of
the nearest prototype with respect to a chosen distance metric. A direct
advantage of NPCs is that the decisions are interpretable. Previous work could
provide lower bounds on the minimal adversarial perturbation in the
$\ell_p$-threat model when using the same $\ell_p$-distance for the NPCs. In
this paper we provide a complete discussion on the complexity when using
$\ell_p$-distances for decision and $\ell_q$-threat models for certification
for $p,q \in \{1,2,\infty\}$. In particular we provide scalable algorithms for
the \emph{exact} computation of the minimal adversarial perturbation when using
$\ell_2$-distance and improved lower bounds in other cases. Using efficient
improved lower bounds we train our Provably adversarially robust NPC (PNPC),
for MNIST which have better $\ell_2$-robustness guarantees than neural
networks. Additionally, we show up to our knowledge the first certification
results w.r.t. to the LPIPS perceptual metric which has been argued to be a
more realistic threat model for image classification than $\ell_p$-balls. Our
PNPC has on CIFAR10 higher certified robust accuracy than the empirical robust
accuracy reported in (Laidlaw et al., 2021). The code is available in our
repository.


http://arxiv.org/abs/2207.07256
Improving Task-free Continual Learning by Distributionally Robust Memory Evolution. (70%)
Zhenyi Wang; Li Shen; Le Fang; Qiuling Suo; Tiehang Duan; Mingchen Gao
  Task-free continual learning (CL) aims to learn a non-stationary data stream
without explicit task definitions and not forget previous knowledge. The widely
adopted memory replay approach could gradually become less effective for long
data streams, as the model may memorize the stored examples and overfit the
memory buffer. Second, existing methods overlook the high uncertainty in the
memory data distribution since there is a big gap between the memory data
distribution and the distribution of all the previous data examples. To address
these problems, for the first time, we propose a principled memory evolution
framework to dynamically evolve the memory data distribution by making the
memory buffer gradually harder to be memorized with distributionally robust
optimization (DRO). We then derive a family of methods to evolve the memory
buffer data in the continuous probability measure space with Wasserstein
gradient flow (WGF). The proposed DRO is w.r.t the worst-case evolved memory
data distribution, thus guarantees the model performance and learns
significantly more robust features than existing memory-replay-based methods.
Extensive experiments on existing benchmarks demonstrate the effectiveness of
the proposed methods for alleviating forgetting. As a by-product of the
proposed framework, our method is more robust to adversarial examples than
existing task-free CL methods. Code is available on GitHub
\url{https://github.com/joey-wang123/DRO-Task-free}


http://arxiv.org/abs/2207.06858
RSD-GAN: Regularized Sobolev Defense GAN Against Speech-to-Text Adversarial Attacks. (67%)
Mohammad Esmaeilpour; Nourhene Chaalia; Patrick Cardinal
  This paper introduces a new synthesis-based defense algorithm for
counteracting with a varieties of adversarial attacks developed for challenging
the performance of the cutting-edge speech-to-text transcription systems. Our
algorithm implements a Sobolev-based GAN and proposes a novel regularizer for
effectively controlling over the functionality of the entire generative model,
particularly the discriminator network during training. Our achieved results
upon carrying out numerous experiments on the victim DeepSpeech, Kaldi, and
Lingvo speech transcription systems corroborate the remarkable performance of
our defense approach against a comprehensive range of targeted and non-targeted
adversarial attacks.


http://arxiv.org/abs/2207.07209
Sound Randomized Smoothing in Floating-Point Arithmetics. (50%)
Václav Voráček; Matthias Hein
  Randomized smoothing is sound when using infinite precision. However, we show
that randomized smoothing is no longer sound for limited floating-point
precision. We present a simple example where randomized smoothing certifies a
radius of $1.26$ around a point, even though there is an adversarial example in
the distance $0.8$ and extend this example further to provide false
certificates for CIFAR10. We discuss the implicit assumptions of randomized
smoothing and show that they do not apply to generic image classification
models whose smoothed versions are commonly certified. In order to overcome
this problem, we propose a sound approach to randomized smoothing when using
floating-point precision with essentially equal speed and matching the
certificates of the standard, unsound practice for standard classifiers tested
so far. Our only assumption is that we have access to a fair coin.


http://arxiv.org/abs/2207.07162
Audio-guided Album Cover Art Generation with Genetic Algorithms. (38%)
James Marien; Sam Leroux; Bart Dhoedt; Boom Cedric De
  Over 60,000 songs are released on Spotify every day, and the competition for
the listener's attention is immense. In that regard, the importance of
captivating and inviting cover art cannot be underestimated, because it is
deeply entangled with a song's character and the artist's identity, and remains
one of the most important gateways to lead people to discover music. However,
designing cover art is a highly creative, lengthy and sometimes expensive
process that can be daunting, especially for non-professional artists. For this
reason, we propose a novel deep-learning framework to generate cover art guided
by audio features. Inspired by VQGAN-CLIP, our approach is highly flexible
because individual components can easily be replaced without the need for any
retraining. This paper outlines the architectural details of our models and
discusses the optimization challenges that emerge from them. More specifically,
we will exploit genetic algorithms to overcome bad local minima and adversarial
examples. We find that our framework can generate suitable cover art for most
genres, and that the visual features adapt themselves to audio feature changes.
Given these results, we believe that our framework paves the road for
extensions and more advanced applications in audio-guided visual generation
tasks.


http://arxiv.org/abs/2207.06888
Distance Learner: Incorporating Manifold Prior to Model Training. (16%)
Aditya Chetan; Nipun Kwatra
  The manifold hypothesis (real world data concentrates near low-dimensional
manifolds) is suggested as the principle behind the effectiveness of machine
learning algorithms in very high dimensional problems that are common in
domains such as vision and speech. Multiple methods have been proposed to
explicitly incorporate the manifold hypothesis as a prior in modern Deep Neural
Networks (DNNs), with varying success. In this paper, we propose a new method,
Distance Learner, to incorporate this prior for DNN-based classifiers. Distance
Learner is trained to predict the distance of a point from the underlying
manifold of each class, rather than the class label. For classification,
Distance Learner then chooses the class corresponding to the closest predicted
class manifold. Distance Learner can also identify points as being out of
distribution (belonging to neither class), if the distance to the closest
manifold is higher than a threshold. We evaluate our method on multiple
synthetic datasets and show that Distance Learner learns much more meaningful
classification boundaries compared to a standard classifier. We also evaluate
our method on the task of adversarial robustness, and find that it not only
outperforms standard classifier by a large margin, but also performs at par
with classifiers trained via state-of-the-art adversarial training.


http://arxiv.org/abs/2207.10802
Active Data Pattern Extraction Attacks on Generative Language Models. (11%)
Bargav Jayaraman; Esha Ghosh; Huseyin Inan; Melissa Chase; Sambuddha Roy; Wei Dai
  With the wide availability of large pre-trained language model checkpoints,
such as GPT-2 and BERT, the recent trend has been to fine-tune them on a
downstream task to achieve the state-of-the-art performance with a small
computation overhead. One natural example is the Smart Reply application where
a pre-trained model is fine-tuned for suggesting a number of responses given a
query message. In this work, we set out to investigate potential information
leakage vulnerabilities in a typical Smart Reply pipeline and show that it is
possible for an adversary, having black-box or gray-box access to a Smart Reply
model, to extract sensitive user information present in the training data. We
further analyse the privacy impact of specific components, e.g. the decoding
strategy, pertained to this application through our attack settings. We explore
potential mitigation strategies and demonstrate how differential privacy can be
a strong defense mechanism to such data extraction attacks.


http://arxiv.org/abs/2207.07180
Contrastive Adapters for Foundation Model Group Robustness. (1%)
Michael Zhang; Christopher Ré
  While large pretrained foundation models (FMs) have shown remarkable
zero-shot classification robustness to dataset-level distribution shifts, their
robustness to subpopulation or group shifts is relatively underexplored. We
study this problem, and find that FMs such as CLIP may not be robust to various
group shifts. Across 9 robustness benchmarks, zero-shot classification with
their embeddings results in gaps of up to 80.7 percentage points (pp) between
average and worst-group accuracy. Unfortunately, existing methods to improve
robustness require retraining, which can be prohibitively expensive on large
foundation models. We also find that efficient ways to improve model inference
(e.g., via adapters, lightweight networks with FM embeddings as inputs) do not
consistently improve and can sometimes hurt group robustness compared to
zero-shot (e.g., increasing the accuracy gap by 50.1 pp on CelebA). We thus
develop an adapter training strategy to effectively and efficiently improve FM
group robustness. Our motivating observation is that while poor robustness
results from groups in the same class being embedded far apart in the
foundation model "embedding space," standard adapter training may not bring
these points closer together. We thus propose contrastive adapting, which
trains adapters with contrastive learning to bring sample embeddings close to
both their ground-truth class embeddings and other sample embeddings in the
same class. Across the 9 benchmarks, our approach consistently improves group
robustness, raising worst-group accuracy by 8.5 to 56.0 pp over zero-shot. Our
approach is also efficient, doing so without any FM finetuning and only a fixed
set of frozen FM embeddings. On benchmarks such as Waterbirds and CelebA, this
leads to worst-group accuracy comparable to state-of-the-art methods that
retrain entire models, while only training $\leq$1% of the model parameters.


http://arxiv.org/abs/2207.07232
Lipschitz Bound Analysis of Neural Networks. (1%)
Sarosij Bose
  Lipschitz Bound Estimation is an effective method of regularizing deep neural
networks to make them robust against adversarial attacks. This is useful in a
variety of applications ranging from reinforcement learning to autonomous
systems. In this paper, we highlight the significant gap in obtaining a
non-trivial Lipschitz bound certificate for Convolutional Neural Networks
(CNNs) and empirically support it with extensive graphical analysis. We also
show that unrolling Convolutional layers or Toeplitz matrices can be employed
to convert Convolutional Neural Networks (CNNs) to a Fully Connected Network.
Further, we propose a simple algorithm to show the existing 20x-50x gap in a
particular data distribution between the actual lipschitz constant and the
obtained tight bound. We also ran sets of thorough experiments on various
network architectures and benchmark them on datasets like MNIST and CIFAR-10.
All these proposals are supported by extensive testing, graphs, histograms and
comparative analysis.


http://arxiv.org/abs/2207.06035
Perturbation Inactivation Based Adversarial Defense for Face Recognition. (99%)
Min Ren; Yuhao Zhu; Yunlong Wang; Zhenan Sun
  Deep learning-based face recognition models are vulnerable to adversarial
attacks. To curb these attacks, most defense methods aim to improve the
robustness of recognition models against adversarial perturbations. However,
the generalization capacities of these methods are quite limited. In practice,
they are still vulnerable to unseen adversarial attacks. Deep learning models
are fairly robust to general perturbations, such as Gaussian noises. A
straightforward approach is to inactivate the adversarial perturbations so that
they can be easily handled as general perturbations. In this paper, a
plug-and-play adversarial defense method, named perturbation inactivation
(PIN), is proposed to inactivate adversarial perturbations for adversarial
defense. We discover that the perturbations in different subspaces have
different influences on the recognition model. There should be a subspace,
called the immune space, in which the perturbations have fewer adverse impacts
on the recognition model than in other subspaces. Hence, our method estimates
the immune space and inactivates the adversarial perturbations by restricting
them to this subspace. The proposed method can be generalized to unseen
adversarial perturbations since it does not rely on a specific kind of
adversarial attack method. This approach not only outperforms several
state-of-the-art adversarial defense methods but also demonstrates a superior
generalization capacity through exhaustive experiments. Moreover, the proposed
method can be successfully applied to four commercial APIs without additional
training, indicating that it can be easily generalized to existing face
recognition systems. The source code is available at
https://github.com/RenMin1991/Perturbation-Inactivate


http://arxiv.org/abs/2207.06154
On the Robustness of Bayesian Neural Networks to Adversarial Attacks. (93%)
Luca Bortolussi; Ginevra Carbone; Luca Laurenti; Andrea Patane; Guido Sanguinetti; Matthew Wicker
  Vulnerability to adversarial attacks is one of the principal hurdles to the
adoption of deep learning in safety-critical applications. Despite significant
efforts, both practical and theoretical, training deep learning models robust
to adversarial attacks is still an open problem. In this paper, we analyse the
geometry of adversarial attacks in the large-data, overparameterized limit for
Bayesian Neural Networks (BNNs). We show that, in the limit, vulnerability to
gradient-based attacks arises as a result of degeneracy in the data
distribution, i.e., when the data lies on a lower-dimensional submanifold of
the ambient space. As a direct consequence, we demonstrate that in this limit
BNN posteriors are robust to gradient-based adversarial attacks. Crucially, we
prove that the expected gradient of the loss with respect to the BNN posterior
distribution is vanishing, even when each neural network sampled from the
posterior is vulnerable to gradient-based attacks. Experimental results on the
MNIST, Fashion MNIST, and half moons datasets, representing the finite data
regime, with BNNs trained with Hamiltonian Monte Carlo and Variational
Inference, support this line of arguments, showing that BNNs can display both
high accuracy on clean data and robustness to both gradient-based and
gradient-free based adversarial attacks.


http://arxiv.org/abs/2207.06202
Adversarially-Aware Robust Object Detector. (91%)
Ziyi Dong; Pengxu Wei; Liang Lin
  Object detection, as a fundamental computer vision task, has achieved a
remarkable progress with the emergence of deep neural networks. Nevertheless,
few works explore the adversarial robustness of object detectors to resist
adversarial attacks for practical applications in various real-world scenarios.
Detectors have been greatly challenged by unnoticeable perturbation, with sharp
performance drop on clean images and extremely poor performance on adversarial
images. In this work, we empirically explore the model training for adversarial
robustness in object detection, which greatly attributes to the conflict
between learning clean images and adversarial images. To mitigate this issue,
we propose a Robust Detector (RobustDet) based on adversarially-aware
convolution to disentangle gradients for model learning on clean and
adversarial images. RobustDet also employs the Adversarial Image Discriminator
(AID) and Consistent Features with Reconstruction (CFR) to ensure a reliable
robustness. Extensive experiments on PASCAL VOC and MS-COCO demonstrate that
our model effectively disentangles gradients and significantly enhances the
detection robustness with maintaining the detection ability on clean images.


http://arxiv.org/abs/2207.06647
PIAT: Physics Informed Adversarial Training for Solving Partial Differential Equations. (15%)
Simin Shekarpaz; Mohammad Azizmalayeri; Mohammad Hossein Rohban
  In this paper, we propose the physics informed adversarial training (PIAT) of
neural networks for solving nonlinear differential equations (NDE). It is
well-known that the standard training of neural networks results in non-smooth
functions. Adversarial training (AT) is an established defense mechanism
against adversarial attacks, which could also help in making the solution
smooth. AT include augmenting the training mini-batch with a perturbation that
makes the network output mismatch the desired output adversarially. Unlike
formal AT, which relies only on the training data, here we encode the governing
physical laws in the form of nonlinear differential equations using automatic
differentiation in the adversarial network architecture. We compare PIAT with
PINN to indicate the effectiveness of our method in solving NDEs for up to 10
dimensions. Moreover, we propose weight decay and Gaussian smoothing to
demonstrate the PIAT advantages. The code repository is available at
https://github.com/rohban-lab/PIAT.


http://arxiv.org/abs/2207.06236
Explainable Intrusion Detection Systems (X-IDS): A Survey of Current Methods, Challenges, and Opportunities. (10%)
Subash Neupane; Jesse Ables; William Anderson; Sudip Mittal; Shahram Rahimi; Ioana Banicescu; Maria Seale
  The application of Artificial Intelligence (AI) and Machine Learning (ML) to
cybersecurity challenges has gained traction in industry and academia,
partially as a result of widespread malware attacks on critical systems such as
cloud infrastructures and government institutions. Intrusion Detection Systems
(IDS), using some forms of AI, have received widespread adoption due to their
ability to handle vast amounts of data with a high prediction accuracy. These
systems are hosted in the organizational Cyber Security Operation Center (CSoC)
as a defense tool to monitor and detect malicious network flow that would
otherwise impact the Confidentiality, Integrity, and Availability (CIA). CSoC
analysts rely on these systems to make decisions about the detected threats.
However, IDSs designed using Deep Learning (DL) techniques are often treated as
black box models and do not provide a justification for their predictions. This
creates a barrier for CSoC analysts, as they are unable to improve their
decisions based on the model's predictions. One solution to this problem is to
design explainable IDS (X-IDS).
  This survey reviews the state-of-the-art in explainable AI (XAI) for IDS, its
current challenges, and discusses how these challenges span to the design of an
X-IDS. In particular, we discuss black box and white box approaches
comprehensively. We also present the tradeoff between these approaches in terms
of their performance and ability to produce explanations. Furthermore, we
propose a generic architecture that considers human-in-the-loop which can be
used as a guideline when designing an X-IDS. Research recommendations are given
from three critical viewpoints: the need to define explainability for IDS, the
need to create explanations tailored to various stakeholders, and the need to
design metrics to evaluate explanations.


http://arxiv.org/abs/2207.06196
Interactive Machine Learning: A State of the Art Review. (4%)
Natnael A. Wondimu; Cédric Buche; Ubbo Visser
  Machine learning has proved useful in many software disciplines, including
computer vision, speech and audio processing, natural language processing,
robotics and some other fields. However, its applicability has been
significantly hampered due its black-box nature and significant resource
consumption. Performance is achieved at the expense of enormous computational
resource and usually compromising the robustness and trustworthiness of the
model. Recent researches have been identifying a lack of interactivity as the
prime source of these machine learning problems. Consequently, interactive
machine learning (iML) has acquired increased attention of researchers on
account of its human-in-the-loop modality and relatively efficient resource
utilization. Thereby, a state-of-the-art review of interactive machine learning
plays a vital role in easing the effort toward building human-centred models.
In this paper, we provide a comprehensive analysis of the state-of-the-art of
iML. We analyze salient research works using merit-oriented and
application/task oriented mixed taxonomy. We use a bottom-up clustering
approach to generate a taxonomy of iML research works. Research works on
adversarial black-box attacks and corresponding iML based defense system,
exploratory machine learning, resource constrained learning, and iML
performance evaluation are analyzed under their corresponding theme in our
merit-oriented taxonomy. We have further classified these research works into
technical and sectoral categories. Finally, research opportunities that we
believe are inspiring for future work in iML are discussed thoroughly.


http://arxiv.org/abs/2207.06211
Sample-dependent Adaptive Temperature Scaling for Improved Calibration. (2%)
Tom Joy; Francesco Pinto; Ser-Nam Lim; Philip H. S. Torr; Puneet K. Dokania
  It is now well known that neural networks can be wrong with high confidence
in their predictions, leading to poor calibration. The most common post-hoc
approach to compensate for this is to perform temperature scaling, which
adjusts the confidences of the predictions on any input by scaling the logits
by a fixed value. Whilst this approach typically improves the average
calibration across the whole test dataset, this improvement typically reduces
the individual confidences of the predictions irrespective of whether the
classification of a given input is correct or incorrect. With this insight, we
base our method on the observation that different samples contribute to the
calibration error by varying amounts, with some needing to increase their
confidence and others needing to decrease it. Therefore, for each input, we
propose to predict a different temperature value, allowing us to adjust the
mismatch between confidence and accuracy at a finer granularity. Furthermore,
we observe improved results on OOD detection and can also extract a notion of
hardness for the data-points. Our method is applied post-hoc, consequently
using very little computation time and with a negligible memory footprint and
is applied to off-the-shelf pre-trained classifiers. We test our method on the
ResNet50 and WideResNet28-10 architectures using the CIFAR10/100 and
Tiny-ImageNet datasets, showing that producing per-data-point temperatures is
beneficial also for the expected calibration error across the whole test set.
Code is available at: https://github.com/thwjoy/adats.


http://arxiv.org/abs/2207.06282
DiverGet: A Search-Based Software Testing Approach for Deep Neural Network Quantization Assessment. (1%)
Ahmed Haj Yahmed; Houssem Ben Braiek; Foutse Khomh; Sonia Bouzidi; Rania Zaatour
  Quantization is one of the most applied Deep Neural Network (DNN) compression
strategies, when deploying a trained DNN model on an embedded system or a cell
phone. This is owing to its simplicity and adaptability to a wide range of
applications and circumstances, as opposed to specific Artificial Intelligence
(AI) accelerators and compilers that are often designed only for certain
specific hardware (e.g., Google Coral Edge TPU). With the growing demand for
quantization, ensuring the reliability of this strategy is becoming a critical
challenge. Traditional testing methods, which gather more and more genuine data
for better assessment, are often not practical because of the large size of the
input space and the high similarity between the original DNN and its quantized
counterpart. As a result, advanced assessment strategies have become of
paramount importance. In this paper, we present DiverGet, a search-based
testing framework for quantization assessment. DiverGet defines a space of
metamorphic relations that simulate naturally-occurring distortions on the
inputs. Then, it optimally explores these relations to reveal the disagreements
among DNNs of different arithmetic precision. We evaluate the performance of
DiverGet on state-of-the-art DNNs applied to hyperspectral remote sensing
images. We chose the remote sensing DNNs as they're being increasingly deployed
at the edge (e.g., high-lift drones) in critical domains like climate change
research and astronomy. Our results show that DiverGet successfully challenges
the robustness of established quantization techniques against
naturally-occurring shifted data, and outperforms its most recent concurrent,
DiffChaser, with a success rate that is (on average) four times higher.


http://arxiv.org/abs/2207.05756
Exploring Adversarial Examples and Adversarial Robustness of Convolutional Neural Networks by Mutual Information. (99%)
Jiebao Zhang; Wenhua Qian; Rencan Nie; Jinde Cao; Dan Xu
  A counter-intuitive property of convolutional neural networks (CNNs) is their
inherent susceptibility to adversarial examples, which severely hinders the
application of CNNs in security-critical fields. Adversarial examples are
similar to original examples but contain malicious perturbations. Adversarial
training is a simple and effective training method to improve the robustness of
CNNs to adversarial examples. The mechanisms behind adversarial examples and
adversarial training are worth exploring. Therefore, this work investigates
similarities and differences between two types of CNNs (both normal and robust
ones) in information extraction by observing the trends towards the mutual
information. We show that 1) the amount of mutual information that CNNs extract
from original and adversarial examples is almost similar, whether CNNs are in
normal training or adversarial training; the reason why adversarial examples
mislead CNNs may be that they contain more texture-based information about
other categories; 2) compared with normal training, adversarial training is
more difficult and the amount of information extracted by the robust CNNs is
less; 3) the CNNs trained with different methods have different preferences for
certain types of information; normally trained CNNs tend to extract
texture-based information from the inputs, while adversarially trained models
prefer to shape-based information. Furthermore, we also analyze the mutual
information estimators used in this work, kernel-density-estimation and binning
methods, and find that these estimators outline the geometric properties of the
middle layer's output to a certain extent.


http://arxiv.org/abs/2207.05451
Adversarial Robustness Assessment of NeuroEvolution Approaches. (99%)
Inês Valentim; Nuno Lourenço; Nuno Antunes
  NeuroEvolution automates the generation of Artificial Neural Networks through
the application of techniques from Evolutionary Computation. The main goal of
these approaches is to build models that maximize predictive performance,
sometimes with an additional objective of minimizing computational complexity.
Although the evolved models achieve competitive results performance-wise, their
robustness to adversarial examples, which becomes a concern in
security-critical scenarios, has received limited attention. In this paper, we
evaluate the adversarial robustness of models found by two prominent
NeuroEvolution approaches on the CIFAR-10 image classification task: DENSER and
NSGA-Net. Since the models are publicly available, we consider white-box
untargeted attacks, where the perturbations are bounded by either the L2 or the
Linfinity-norm. Similarly to manually-designed networks, our results show that
when the evolved models are attacked with iterative methods, their accuracy
usually drops to, or close to, zero under both distance metrics. The DENSER
model is an exception to this trend, showing some resistance under the L2
threat model, where its accuracy only drops from 93.70% to 18.10% even with
iterative attacks. Additionally, we analyzed the impact of pre-processing
applied to the data before the first layer of the network. Our observations
suggest that some of these techniques can exacerbate the perturbations added to
the original inputs, potentially harming robustness. Thus, this choice should
not be neglected when automatically designing networks for applications where
adversarial attacks are prone to occur.


http://arxiv.org/abs/2207.05382
Frequency Domain Model Augmentation for Adversarial Attack. (99%)
Yuyang Long; Qilong Zhang; Boheng Zeng; Lianli Gao; Xianglong Liu; Jian Zhang; Jingkuan Song
  For black-box attacks, the gap between the substitute model and the victim
model is usually large, which manifests as a weak attack performance. Motivated
by the observation that the transferability of adversarial examples can be
improved by attacking diverse models simultaneously, model augmentation methods
which simulate different models by using transformed images are proposed.
However, existing transformations for spatial domain do not translate to
significantly diverse augmented models. To tackle this issue, we propose a
novel spectrum simulation attack to craft more transferable adversarial
examples against both normally trained and defense models. Specifically, we
apply a spectrum transformation to the input and thus perform the model
augmentation in the frequency domain. We theoretically prove that the
transformation derived from frequency domain leads to a diverse spectrum
saliency map, an indicator we proposed to reflect the diversity of substitute
models. Notably, our method can be generally combined with existing attacks.
Extensive experiments on the ImageNet dataset demonstrate the effectiveness of
our method, \textit{e.g.}, attacking nine state-of-the-art defense models with
an average success rate of \textbf{95.4\%}. Our code is available in
\url{https://github.com/yuyang-long/SSA}.


http://arxiv.org/abs/2207.05548
Practical Attacks on Machine Learning: A Case Study on Adversarial Windows Malware. (92%)
Luca Demetrio; Battista Biggio; Fabio Roli
  While machine learning is vulnerable to adversarial examples, it still lacks
systematic procedures and tools for evaluating its security in different
application contexts. In this article, we discuss how to develop automated and
scalable security evaluations of machine learning using practical attacks,
reporting a use case on Windows malware detection.


http://arxiv.org/abs/2207.05937
Game of Trojans: A Submodular Byzantine Approach. (87%)
Dinuka Sahabandu; Arezoo Rajabi; Luyao Niu; Bo Li; Bhaskar Ramasubramanian; Radha Poovendran
  Machine learning models in the wild have been shown to be vulnerable to
Trojan attacks during training. Although many detection mechanisms have been
proposed, strong adaptive attackers have been shown to be effective against
them. In this paper, we aim to answer the questions considering an intelligent
and adaptive adversary: (i) What is the minimal amount of instances required to
be Trojaned by a strong attacker? and (ii) Is it possible for such an attacker
to bypass strong detection mechanisms?
  We provide an analytical characterization of adversarial capability and
strategic interactions between the adversary and detection mechanism that take
place in such models. We characterize adversary capability in terms of the
fraction of the input dataset that can be embedded with a Trojan trigger. We
show that the loss function has a submodular structure, which leads to the
design of computationally efficient algorithms to determine this fraction with
provable bounds on optimality. We propose a Submodular Trojan algorithm to
determine the minimal fraction of samples to inject a Trojan trigger. To evade
detection of the Trojaned model, we model strategic interactions between the
adversary and Trojan detection mechanism as a two-player game. We show that the
adversary wins the game with probability one, thus bypassing detection. We
establish this by proving that output probability distributions of a Trojan
model and a clean model are identical when following the Min-Max (MM) Trojan
algorithm.
  We perform extensive evaluations of our algorithms on MNIST, CIFAR-10, and
EuroSAT datasets. The results show that (i) with Submodular Trojan algorithm,
the adversary needs to embed a Trojan trigger into a very small fraction of
samples to achieve high accuracy on both Trojan and clean samples, and (ii) the
MM Trojan algorithm yields a trained Trojan model that evades detection with
probability 1.


http://arxiv.org/abs/2207.05321
Bi-fidelity Evolutionary Multiobjective Search for Adversarially Robust Deep Neural Architectures. (84%)
Jia Liu; Ran Cheng; Yaochu Jin
  Deep neural networks have been found vulnerable to adversarial attacks, thus
raising potentially concerns in security-sensitive contexts. To address this
problem, recent research has investigated the adversarial robustness of deep
neural networks from the architectural point of view. However, searching for
architectures of deep neural networks is computationally expensive,
particularly when coupled with adversarial training process. To meet the above
challenge, this paper proposes a bi-fidelity multiobjective neural architecture
search approach. First, we formulate the NAS problem for enhancing adversarial
robustness of deep neural networks into a multiobjective optimization problem.
Specifically, in addition to a low-fidelity performance predictor as the first
objective, we leverage an auxiliary-objective -- the value of which is the
output of a surrogate model trained with high-fidelity evaluations. Secondly,
we reduce the computational cost by combining three performance estimation
methods, i.e., parameter sharing, low-fidelity evaluation, and surrogate-based
predictor. The effectiveness of the proposed approach is confirmed by extensive
experiments conducted on CIFAR-10, CIFAR-100 and SVHN datasets.


http://arxiv.org/abs/2207.05327
Certified Adversarial Robustness via Anisotropic Randomized Smoothing. (76%)
Hanbin Hong; Yuan Hong
  Randomized smoothing has achieved great success for certified robustness
against adversarial perturbations. Given any arbitrary classifier, randomized
smoothing can guarantee the classifier's prediction over the perturbed input
with provable robustness bound by injecting noise into the classifier. However,
all of the existing methods rely on fixed i.i.d. probability distribution to
generate noise for all dimensions of the data (e.g., all the pixels in an
image), which ignores the heterogeneity of inputs and data dimensions. Thus,
existing randomized smoothing methods cannot provide optimal protection for all
the inputs. To address this limitation, we propose the first anisotropic
randomized smoothing method which ensures provable robustness guarantee based
on pixel-wise noise distributions. Also, we design a novel CNN-based noise
generator to efficiently fine-tune the pixel-wise noise distributions for all
the pixels in each input. Experimental results demonstrate that our method
significantly outperforms the state-of-the-art randomized smoothing methods.


http://arxiv.org/abs/2207.05801
RelaxLoss: Defending Membership Inference Attacks without Losing Utility. (26%)
Dingfan Chen; Ning Yu; Mario Fritz
  As a long-term threat to the privacy of training data, membership inference
attacks (MIAs) emerge ubiquitously in machine learning models. Existing works
evidence strong connection between the distinguishability of the training and
testing loss distributions and the model's vulnerability to MIAs. Motivated by
existing results, we propose a novel training framework based on a relaxed loss
with a more achievable learning target, which leads to narrowed generalization
gap and reduced privacy leakage. RelaxLoss is applicable to any classification
model with added benefits of easy implementation and negligible overhead.
Through extensive evaluations on five datasets with diverse modalities (images,
medical data, transaction records), our approach consistently outperforms
state-of-the-art defense mechanisms in terms of resilience against MIAs as well
as model utility. Our defense is the first that can withstand a wide range of
attacks while preserving (or even improving) the target model's utility. Source
code is available at https://github.com/DingfanChen/RelaxLoss


http://arxiv.org/abs/2207.05902
Verifying Attention Robustness of Deep Neural Networks against Semantic Perturbations. (5%)
Satoshi Munakata; Caterina Urban; Haruki Yokoyama; Koji Yamamoto; Kazuki Munakata
  It is known that deep neural networks (DNNs) classify an input image by
paying particular attention to certain specific pixels; a graphical
representation of the magnitude of attention to each pixel is called a
saliency-map. Saliency-maps are used to check the validity of the
classification decision basis, e.g., it is not a valid basis for classification
if a DNN pays more attention to the background rather than the subject of an
image. Semantic perturbations can significantly change the saliency-map. In
this work, we propose the first verification method for attention robustness,
i.e., the local robustness of the changes in the saliency-map against
combinations of semantic perturbations. Specifically, our method determines the
range of the perturbation parameters (e.g., the brightness change) that
maintains the difference between the actual saliency-map change and the
expected saliency-map change below a given threshold value. Our method is based
on activation region traversals, focusing on the outermost robust boundary for
scalability on larger DNNs. Experimental results demonstrate that our method
can show the extent to which DNNs can classify with the same basis regardless
of semantic perturbations and report on performance and performance factors of
activation region traversals.


http://arxiv.org/abs/2207.05436
Markov Decision Process For Automatic Cyber Defense. (4%)
Simon Yusuf Enoch; Simon Yusuf Enoch; Dong Seong Kim
  It is challenging for a security analyst to detect or defend against
cyber-attacks. Moreover, traditional defense deployment methods require the
security analyst to manually enforce the defenses in the presence of
uncertainties about the defense to deploy. As a result, it is essential to
develop an automated and resilient defense deployment mechanism to thwart the
new generation of attacks. In this paper, we propose a framework based on
Markov Decision Process (MDP) and Q-learning to automatically generate optimal
defense solutions for networked system states. The framework consists of four
phases namely; the model initialization phase, model generation phase,
Q-learning phase, and the conclusion phase. The proposed model collects real
network information as inputs and then builds them into structural data. We
implement a Q-learning process in the model to learn the quality of a defense
action in a particular state. To investigate the feasibility of the proposed
model, we perform simulation experiments and the result reveals that the model
can reduce the risk of network systems from cyber attacks. Furthermore, the
experiment shows that the model has shown a certain level of flexibility when
different parameters are used for Q-learning.


http://arxiv.org/abs/2207.05796
Estimating Test Performance for AI Medical Devices under Distribution Shift with Conformal Prediction. (1%)
Charles Lu; Syed Rakin Ahmed; Praveer Singh; Jayashree Kalpathy-Cramer
  Estimating the test performance of software AI-based medical devices under
distribution shifts is crucial for evaluating the safety, efficiency, and
usability prior to clinical deployment. Due to the nature of regulated medical
device software and the difficulty in acquiring large amounts of labeled
medical datasets, we consider the task of predicting the test accuracy of an
arbitrary black-box model on an unlabeled target domain without modification to
the original training process or any distributional assumptions of the original
source data (i.e. we treat the model as a "black-box" and only use the
predicted output responses). We propose a "black-box" test estimation technique
based on conformal prediction and evaluate it against other methods on three
medical imaging datasets (mammography, dermatology, and histopathology) under
several clinically relevant types of distribution shift (institution, hardware
scanner, atlas, hospital). We hope that by promoting practical and effective
estimation techniques for black-box models, manufacturers of medical devices
will develop more standardized and realistic evaluation procedures to improve
the robustness and trustworthiness of clinical AI tools.


http://arxiv.org/abs/2207.05641
Backdoor Attacks on Crowd Counting. (1%)
Yuhua Sun; Tailai Zhang; Xingjun Ma; Pan Zhou; Jian Lou; Zichuan Xu; Xing Di; Yu Cheng; Lichao
  Crowd counting is a regression task that estimates the number of people in a
scene image, which plays a vital role in a range of safety-critical
applications, such as video surveillance, traffic monitoring and flow control.
In this paper, we investigate the vulnerability of deep learning based crowd
counting models to backdoor attacks, a major security threat to deep learning.
A backdoor attack implants a backdoor trigger into a target model via data
poisoning so as to control the model's predictions at test time. Different from
image classification models on which most of existing backdoor attacks have
been developed and tested, crowd counting models are regression models that
output multi-dimensional density maps, thus requiring different techniques to
manipulate.
  In this paper, we propose two novel Density Manipulation Backdoor Attacks
(DMBA$^{-}$ and DMBA$^{+}$) to attack the model to produce arbitrarily large or
small density estimations. Experimental results demonstrate the effectiveness
of our DMBA attacks on five classic crowd counting models and four types of
datasets. We also provide an in-depth analysis of the unique challenges of
backdooring crowd counting models and reveal two key elements of effective
attacks: 1) full and dense triggers and 2) manipulation of the ground truth
counts or density maps. Our work could help evaluate the vulnerability of crowd
counting models to potential backdoor attacks.


http://arxiv.org/abs/2207.04843
Statistical Detection of Adversarial examples in Blockchain-based Federated Forest In-vehicle Network Intrusion Detection Systems. (99%)
Ibrahim Aliyu; Engelenburg Selinde van; Muhammed Bashir Muazu; Jinsul Kim; Chang Gyoon Lim
  The internet-of-Vehicle (IoV) can facilitate seamless connectivity between
connected vehicles (CV), autonomous vehicles (AV), and other IoV entities.
Intrusion Detection Systems (IDSs) for IoV networks can rely on machine
learning (ML) to protect the in-vehicle network from cyber-attacks.
Blockchain-based Federated Forests (BFFs) could be used to train ML models
based on data from IoV entities while protecting the confidentiality of the
data and reducing the risks of tampering with the data. However, ML models
created this way are still vulnerable to evasion, poisoning, and exploratory
attacks using adversarial examples. This paper investigates the impact of
various possible adversarial examples on the BFF-IDS. We proposed integrating a
statistical detector to detect and extract unknown adversarial samples. By
including the unknown detected samples into the dataset of the detector, we
augment the BFF-IDS with an additional model to detect original known attacks
and the new adversarial inputs. The statistical adversarial detector
confidently detected adversarial examples at the sample size of 50 and 100
input samples. Furthermore, the augmented BFF-IDS (BFF-IDS(AUG)) successfully
mitigates the adversarial examples with more than 96% accuracy. With this
approach, the model will continue to be augmented in a sandbox whenever an
adversarial sample is detected and subsequently adopt the BFF-IDS(AUG) as the
active security model. Consequently, the proposed integration of the
statistical adversarial detector and the subsequent augmentation of the BFF-IDS
with detected adversarial samples provides a sustainable security framework
against adversarial examples and other unknown attacks.


http://arxiv.org/abs/2207.05127
RUSH: Robust Contrastive Learning via Randomized Smoothing. (98%)
Yijiang Pang; Boyang Liu; Jiayu Zhou
  Recently, adversarial training has been incorporated in self-supervised
contrastive pre-training to augment label efficiency with exciting adversarial
robustness. However, the robustness came at a cost of expensive adversarial
training. In this paper, we show a surprising fact that contrastive
pre-training has an interesting yet implicit connection with robustness, and
such natural robustness in the pre trained representation enables us to design
a powerful robust algorithm against adversarial attacks, RUSH, that combines
the standard contrastive pre-training and randomized smoothing. It boosts both
standard accuracy and robust accuracy, and significantly reduces training costs
as compared with adversarial training. We use extensive empirical studies to
show that the proposed RUSH outperforms robust classifiers from adversarial
training, by a significant margin on common benchmarks (CIFAR-10, CIFAR-100,
and STL-10) under first-order attacks. In particular, under
$\ell_{\infty}$-norm perturbations of size 8/255 PGD attack on CIFAR-10, our
model using ResNet-18 as backbone reached 77.8% robust accuracy and 87.9%
standard accuracy. Our work has an improvement of over 15% in robust accuracy
and a slight improvement in standard accuracy, compared to the
state-of-the-arts.


http://arxiv.org/abs/2207.05729
Physical Passive Patch Adversarial Attacks on Visual Odometry Systems. (98%)
Yaniv Nemcovsky; Matan Yaakoby; Alex M. Bronstein; Chaim Baskin
  Deep neural networks are known to be susceptible to adversarial perturbations
-- small perturbations that alter the output of the network and exist under
strict norm limitations. While such perturbations are usually discussed as
tailored to a specific input, a universal perturbation can be constructed to
alter the model's output on a set of inputs. Universal perturbations present a
more realistic case of adversarial attacks, as awareness of the model's exact
input is not required. In addition, the universal attack setting raises the
subject of generalization to unseen data, where given a set of inputs, the
universal perturbations aim to alter the model's output on out-of-sample data.
In this work, we study physical passive patch adversarial attacks on visual
odometry-based autonomous navigation systems. A visual odometry system aims to
infer the relative camera motion between two corresponding viewpoints, and is
frequently used by vision-based autonomous navigation systems to estimate their
state. For such navigation systems, a patch adversarial perturbation poses a
severe security issue, as it can be used to mislead a system onto some
collision course. To the best of our knowledge, we show for the first time that
the error margin of a visual odometry model can be significantly increased by
deploying patch adversarial attacks in the scene. We provide evaluation on
synthetic closed-loop drone navigation data and demonstrate that a comparable
vulnerability exists in real data. A reference implementation of the proposed
method and the reported experiments is provided at
https://github.com/patchadversarialattacks/patchadversarialattacks.


http://arxiv.org/abs/2207.05137
Towards Effective Multi-Label Recognition Attacks via Knowledge Graph Consistency. (83%)
Hassan Mahmood; Ehsan Elhamifar
  Many real-world applications of image recognition require multi-label
learning, whose goal is to find all labels in an image. Thus, robustness of
such systems to adversarial image perturbations is extremely important.
However, despite a large body of recent research on adversarial attacks, the
scope of the existing works is mainly limited to the multi-class setting, where
each image contains a single label. We show that the naive extensions of
multi-class attacks to the multi-label setting lead to violating label
relationships, modeled by a knowledge graph, and can be detected using a
consistency verification scheme. Therefore, we propose a graph-consistent
multi-label attack framework, which searches for small image perturbations that
lead to misclassifying a desired target set while respecting label hierarchies.
By extensive experiments on two datasets and using several multi-label
recognition models, we show that our method generates extremely successful
attacks that, unlike naive multi-label perturbations, can produce model
predictions consistent with the knowledge graph.


http://arxiv.org/abs/2207.05225
Susceptibility of Continual Learning Against Adversarial Attacks. (75%)
Hikmat Khan; Pir Masoom Shah; Syed Farhan Alam Zaidi; Saif ul Islam
  The recent advances in continual (incremental or lifelong) learning have
concentrated on the prevention of forgetting that can lead to catastrophic
consequences, but there are two outstanding challenges that must be addressed.
The first is the evaluation of the robustness of the proposed methods. The
second is ensuring the security of learned tasks remains largely unexplored.
This paper presents a comprehensive study of the susceptibility of the
continually learned tasks (including both current and previously learned tasks)
that are vulnerable to forgetting. Such vulnerability of tasks against
adversarial attacks raises profound issues in data integrity and privacy. We
consider the task incremental learning (Task-IL) scenario and explore three
regularization-based experiments, three replay-based experiments, and one
hybrid technique based on the reply and exemplar approach. We examine the
robustness of these methods. In particular, we consider cases where we
demonstrate that any class belonging to the current or previously learned tasks
is prone to misclassification. Our observations highlight the potential
limitations of existing Task-IL approaches. Our empirical study recommends that
the research community consider the robustness of the proposed continual
learning approaches and invest extensive efforts in mitigating catastrophic
forgetting.


http://arxiv.org/abs/2207.05164
"Why do so?" -- A Practical Perspective on Machine Learning Security. (64%)
Kathrin Grosse; Lukas Bieringer; Tarek Richard Besold; Battista Biggio; Katharina Krombholz
  Despite the large body of academic work on machine learning security, little
is known about the occurrence of attacks on machine learning systems in the
wild. In this paper, we report on a quantitative study with 139 industrial
practitioners. We analyze attack occurrence and concern and evaluate
statistical hypotheses on factors influencing threat perception and exposure.
Our results shed light on real-world attacks on deployed machine learning. On
the organizational level, while we find no predictors for threat exposure in
our sample, the amount of implement defenses depends on exposure to threats or
expected likelihood to become a target. We also provide a detailed analysis of
practitioners' replies on the relevance of individual machine learning attacks,
unveiling complex concerns like unreliable decision making, business
information leakage, and bias introduction into models. Finally, we find that
on the individual level, prior knowledge about machine learning security
influences threat perception. Our work paves the way for more research about
adversarial machine learning in practice, but yields also insights for
regulation and auditing.


http://arxiv.org/abs/2207.04718
Physical Attack on Monocular Depth Estimation with Optimal Adversarial Patches. (22%)
Zhiyuan Cheng; James Liang; Hongjun Choi; Guanhong Tao; Zhiwen Cao; Dongfang Liu; Xiangyu Zhang
  Deep learning has substantially boosted the performance of Monocular Depth
Estimation (MDE), a critical component in fully vision-based autonomous driving
(AD) systems (e.g., Tesla and Toyota). In this work, we develop an attack
against learning-based MDE. In particular, we use an optimization-based method
to systematically generate stealthy physical-object-oriented adversarial
patches to attack depth estimation. We balance the stealth and effectiveness of
our attack with object-oriented adversarial design, sensitive region
localization, and natural style camouflage. Using real-world driving scenarios,
we evaluate our attack on concurrent MDE models and a representative downstream
task for AD (i.e., 3D object detection). Experimental results show that our
method can generate stealthy, effective, and robust adversarial patches for
different target objects and models and achieves more than 6 meters mean depth
estimation error and 93% attack success rate (ASR) in object detection with a
patch of 1/9 of the vehicle's rear area. Field tests on three different driving
routes with a real vehicle indicate that we cause over 6 meters mean depth
estimation error and reduce the object detection rate from 90.70% to 5.16% in
continuous video frames.


http://arxiv.org/abs/2207.04892
Adversarial Style Augmentation for Domain Generalized Urban-Scene Segmentation. (1%)
Zhun Zhong; Yuyang Zhao; Gim Hee Lee; Nicu Sebe
  In this paper, we consider the problem of domain generalization in semantic
segmentation, which aims to learn a robust model using only labeled synthetic
(source) data. The model is expected to perform well on unseen real (target)
domains. Our study finds that the image style variation can largely influence
the model's performance and the style features can be well represented by the
channel-wise mean and standard deviation of images. Inspired by this, we
propose a novel adversarial style augmentation (AdvStyle) approach, which can
dynamically generate hard stylized images during training and thus can
effectively prevent the model from overfitting on the source domain.
Specifically, AdvStyle regards the style feature as a learnable parameter and
updates it by adversarial training. The learned adversarial style feature is
used to construct an adversarial image for robust model training. AdvStyle is
easy to implement and can be readily applied to different models. Experiments
on two synthetic-to-real semantic segmentation benchmarks demonstrate that
AdvStyle can significantly improve the model performance on unseen real domains
and show that we can achieve the state of the art. Moreover, AdvStyle can be
employed to domain generalized image classification and produces a clear
improvement on the considered datasets.


http://arxiv.org/abs/2207.04497
One-shot Neural Backdoor Erasing via Adversarial Weight Masking. (33%)
Shuwen Chai; Jinghui Chen
  Recent studies show that despite achieving high accuracy on a number of
real-world applications, deep neural networks (DNNs) can be backdoored: by
injecting triggered data samples into the training dataset, the adversary can
mislead the trained model into classifying any test data to the target class as
long as the trigger pattern is presented. To nullify such backdoor threats,
various methods have been proposed. Particularly, a line of research aims to
purify the potentially compromised model. However, one major limitation of this
line of work is the requirement to access sufficient original training data:
the purifying performance is a lot worse when the available training data is
limited. In this work, we propose Adversarial Weight Masking (AWM), a novel
method capable of erasing the neural backdoors even in the one-shot setting.
The key idea behind our method is to formulate this into a min-max optimization
problem: first, adversarially recover the trigger patterns and then (soft) mask
the network weights that are sensitive to the recovered patterns. Comprehensive
evaluations of several benchmark datasets suggest that AWM can largely improve
the purifying effects over other state-of-the-art methods on various available
training dataset sizes.


http://arxiv.org/abs/2207.04434
Hiding Your Signals: A Security Analysis of PPG-based Biometric Authentication. (4%)
Lin Li; Chao Chen; Lei Pan; Yonghang Tai; Jun Zhang; Yang Xiang
  Recently, physiological signal-based biometric systems have received wide
attention. Unlike traditional biometric features, physiological signals can not
be easily compromised (usually unobservable to human eyes).
Photoplethysmography (PPG) signal is easy to measure, making it more attractive
than many other physiological signals for biometric authentication. However,
with the advent of remote PPG (rPPG), unobservability has been challenged when
the attacker can remotely steal the rPPG signals by monitoring the victim's
face, subsequently posing a threat to PPG-based biometrics. In PPG-based
biometric authentication, current attack approaches mandate the victim's PPG
signal, making rPPG-based attacks neglected. In this paper, we firstly analyze
the security of PPG-based biometrics, including user authentication and
communication protocols. We evaluate the signal waveforms, heart rate and
inter-pulse-interval information extracted by five rPPG methods, including four
traditional optical computing methods (CHROM, POS, LGI, PCA) and one deep
learning method (CL_rPPG). We conducted experiments on five datasets (PURE,
UBFC_rPPG, UBFC_Phys, LGI_PPGI, and COHFACE) to collect a comprehensive set of
results. Our empirical studies show that rPPG poses a serious threat to the
authentication system. The success rate of the rPPG signal spoofing attack in
the user authentication system reached 0.35. The bit hit rate is 0.6 in
inter-pulse-interval-based security protocols. Further, we propose an active
defence strategy to hide the physiological signals of the face to resist the
attack. It reduces the success rate of rPPG spoofing attacks in user
authentication to 0.05. The bit hit rate was reduced to 0.5, which is at the
level of a random guess. Our strategy effectively prevents the exposure of PPG
signals to protect users' sensitive physiological data.


http://arxiv.org/abs/2207.04307
Adversarial Framework with Certified Robustness for Time-Series Domain via Statistical Features. (98%)
Taha Belkhouja; Janardhan Rao Doppa
  Time-series data arises in many real-world applications (e.g., mobile health)
and deep neural networks (DNNs) have shown great success in solving them.
Despite their success, little is known about their robustness to adversarial
attacks. In this paper, we propose a novel adversarial framework referred to as
Time-Series Attacks via STATistical Features (TSA-STAT)}. To address the unique
challenges of time-series domain, TSA-STAT employs constraints on statistical
features of the time-series data to construct adversarial examples. Optimized
polynomial transformations are used to create attacks that are more effective
(in terms of successfully fooling DNNs) than those based on additive
perturbations. We also provide certified bounds on the norm of the statistical
features for constructing adversarial examples. Our experiments on diverse
real-world benchmark datasets show the effectiveness of TSA-STAT in fooling
DNNs for time-series domain and in improving their robustness. The source code
of TSA-STAT algorithms is available at
https://github.com/tahabelkhouja/Time-Series-Attacks-via-STATistical-Features


http://arxiv.org/abs/2207.04209
Invisible Backdoor Attacks Using Data Poisoning in the Frequency Domain. (98%)
Chang Yue; Peizhuo Lv; Ruigang Liang; Kai Chen
  With the broad application of deep neural networks (DNNs), backdoor attacks
have gradually attracted attention. Backdoor attacks are insidious, and
poisoned models perform well on benign samples and are only triggered when
given specific inputs, which cause the neural network to produce incorrect
outputs. The state-of-the-art backdoor attack work is implemented by data
poisoning, i.e., the attacker injects poisoned samples into the dataset, and
the models trained with that dataset are infected with the backdoor. However,
most of the triggers used in the current study are fixed patterns patched on a
small fraction of an image and are often clearly mislabeled, which is easily
detected by humans or defense methods such as Neural Cleanse and SentiNet.
Also, it's difficult to be learned by DNNs without mislabeling, as they may
ignore small patterns. In this paper, we propose a generalized backdoor attack
method based on the frequency domain, which can implement backdoor implantation
without mislabeling and accessing the training process. It is invisible to
human beings and able to evade the commonly used defense methods. We evaluate
our approach in the no-label and clean-label cases on three datasets (CIFAR-10,
STL-10, and GTSRB) with two popular scenarios (self-supervised learning and
supervised learning). The results show our approach can achieve a high attack
success rate (above 90%) on all the tasks without significant performance
degradation on main tasks. Also, we evaluate the bypass performance of our
approach for different kinds of defenses, including the detection of training
data (i.e., Activation Clustering), the preprocessing of inputs (i.e.,
Filtering), the detection of inputs (i.e., SentiNet), and the detection of
models (i.e., Neural Cleanse). The experimental results demonstrate that our
approach shows excellent robustness to such defenses.


http://arxiv.org/abs/2207.04308
Dynamic Time Warping based Adversarial Framework for Time-Series Domain. (97%)
Taha Belkhouja; Yan Yan; Janardhan Rao Doppa
  Despite the rapid progress on research in adversarial robustness of deep
neural networks (DNNs), there is little principled work for the time-series
domain. Since time-series data arises in diverse applications including mobile
health, finance, and smart grid, it is important to verify and improve the
robustness of DNNs for the time-series domain. In this paper, we propose a
novel framework for the time-series domain referred as {\em Dynamic Time
Warping for Adversarial Robustness (DTW-AR)} using the dynamic time warping
measure. Theoretical and empirical evidence is provided to demonstrate the
effectiveness of DTW over the standard Euclidean distance metric employed in
prior methods for the image domain. We develop a principled algorithm justified
by theoretical analysis to efficiently create diverse adversarial examples
using random alignment paths. Experiments on diverse real-world benchmarks show
the effectiveness of DTW-AR to fool DNNs for time-series data and to improve
their robustness using adversarial training. The source code of DTW-AR
algorithms is available at https://github.com/tahabelkhouja/DTW-AR


http://arxiv.org/abs/2207.04305
Training Robust Deep Models for Time-Series Domain: Novel Algorithms and Theoretical Analysis. (67%)
Taha Belkhouja; Yan Yan; Janardhan Rao Doppa
  Despite the success of deep neural networks (DNNs) for real-world
applications over time-series data such as mobile health, little is known about
how to train robust DNNs for time-series domain due to its unique
characteristics compared to images and text data. In this paper, we propose a
novel algorithmic framework referred as RObust Training for Time-Series (RO-TS)
to create robust DNNs for time-series classification tasks. Specifically, we
formulate a min-max optimization problem over the model parameters by
explicitly reasoning about the robustness criteria in terms of additive
perturbations to time-series inputs measured by the global alignment kernel
(GAK) based distance. We also show the generality and advantages of our
formulation using the summation structure over time-series alignments by
relating both GAK and dynamic time warping (DTW). This problem is an instance
of a family of compositional min-max optimization problems, which are
challenging and open with unclear theoretical guarantee. We propose a
principled stochastic compositional alternating gradient descent ascent
(SCAGDA) algorithm for this family of optimization problems. Unlike traditional
methods for time-series that require approximate computation of distance
measures, SCAGDA approximates the GAK based distance on-the-fly using a moving
average approach. We theoretically analyze the convergence rate of SCAGDA and
provide strong theoretical support for the estimation of GAK based distance.
Our experiments on real-world benchmarks demonstrate that RO-TS creates more
robust DNNs when compared to adversarial training using prior methods that rely
on data augmentation or new definitions of loss functions. We also demonstrate
the importance of GAK for time-series data over the Euclidean distance. The
source code of RO-TS algorithms is available at
https://github.com/tahabelkhouja/Robust-Training-for-Time-Series


http://arxiv.org/abs/2207.04129
Not all broken defenses are equal: The dead angles of adversarial accuracy. (99%)
Raphael Olivier; Bhiksha Raj
  Robustness to adversarial attack is typically evaluated with adversarial
accuracy. This metric is however too coarse to properly capture all robustness
properties of machine learning models. Many defenses, when evaluated against a
strong attack, do not provide accuracy improvements while still contributing
partially to adversarial robustness. Popular certification methods suffer from
the same issue, as they provide a lower bound to accuracy. To capture finer
robustness properties we propose a new metric for L2 robustness, adversarial
angular sparsity, which partially answers the question "how many adversarial
examples are there around an input". We demonstrate its usefulness by
evaluating both "strong" and "weak" defenses. We show that some
state-of-the-art defenses, delivering very similar accuracy, can have very
different sparsity on the inputs that they are not robust on. We also show that
some weak defenses actually decrease robustness, while others strengthen it in
a measure that accuracy cannot capture. These differences are predictive of how
useful such defenses can become when combined with adversarial training.


http://arxiv.org/abs/2207.13036
Improved and Interpretable Defense to Transferred Adversarial Examples by Jacobian Norm with Selective Input Gradient Regularization. (99%)
Deyin Liu; Lin Wu; Lingqiao Liu; Haifeng Zhao; Farid Boussaid; Mohammed Bennamoun
  Deep neural networks (DNNs) are known to be vulnerable to adversarial
examples that are crafted with imperceptible perturbations, i.e., a small
change in an input image can induce a mis-classification, and thus threatens
the reliability of deep learning based deployment systems. Adversarial training
(AT) is often adopted to improve robustness through training a mixture of
corrupted and clean data. However, most of AT based methods are ineffective in
dealing with transferred adversarial examples which are generated to fool a
wide spectrum of defense models, and thus cannot satisfy the generalization
requirement raised in real-world scenarios. Moreover, adversarially training a
defense model in general cannot produce interpretable predictions towards the
inputs with perturbations, whilst a highly interpretable robust model is
required by different domain experts to understand the behaviour of a DNN. In
this work, we propose a novel approach based on Jacobian norm and Selective
Input Gradient Regularization (J-SIGR), which suggests the linearized
robustness through Jacobian normalization and also regularizes the
perturbation-based saliency maps to imitate the model's interpretable
predictions. As such, we achieve both the improved defense and high
interpretability of DNNs. Finally, we evaluate our method across different
architectures against powerful adversarial attacks. Experiments demonstrate
that the proposed J-SIGR confers improved robustness against transferred
adversarial attacks, and we also show that the predictions from the neural
network are easy to interpret.


http://arxiv.org/abs/2207.03895
Defense Against Multi-target Trojan Attacks. (80%)
Haripriya Harikumar; Santu Rana; Kien Do; Sunil Gupta; Wei Zong; Willy Susilo; Svetha Venkastesh
  Adversarial attacks on deep learning-based models pose a significant threat
to the current AI infrastructure. Among them, Trojan attacks are the hardest to
defend against. In this paper, we first introduce a variation of the Badnet
kind of attacks that introduces Trojan backdoors to multiple target classes and
allows triggers to be placed anywhere in the image. The former makes it more
potent and the latter makes it extremely easy to carry out the attack in the
physical space. The state-of-the-art Trojan detection methods fail with this
threat model. To defend against this attack, we first introduce a trigger
reverse-engineering mechanism that uses multiple images to recover a variety of
potential triggers. We then propose a detection mechanism by measuring the
transferability of such recovered triggers. A Trojan trigger will have very
high transferability i.e. they make other images also go to the same class. We
study many practical advantages of our attack method and then demonstrate the
detection performance using a variety of image datasets. The experimental
results show the superior detection performance of our method over the
state-of-the-arts.


http://arxiv.org/abs/2207.03689
Guiding the retraining of convolutional neural networks against adversarial inputs. (80%)
Francisco Durán López; Silverio Martínez-Fernández; Michael Felderer; Xavier Franch
  Background: When using deep learning models, there are many possible
vulnerabilities and some of the most worrying are the adversarial inputs, which
can cause wrong decisions with minor perturbations. Therefore, it becomes
necessary to retrain these models against adversarial inputs, as part of the
software testing process addressing the vulnerability to these inputs.
Furthermore, for an energy efficient testing and retraining, data scientists
need support on which are the best guidance metrics and optimal dataset
configurations.
  Aims: We examined four guidance metrics for retraining convolutional neural
networks and three retraining configurations. Our goal is to improve the models
against adversarial inputs regarding accuracy, resource utilization and time
from the point of view of a data scientist in the context of image
classification.
  Method: We conducted an empirical study in two datasets for image
classification. We explore: (a) the accuracy, resource utilization and time of
retraining convolutional neural networks by ordering new training set by four
different guidance metrics (neuron coverage, likelihood-based surprise
adequacy, distance-based surprise adequacy and random), (b) the accuracy and
resource utilization of retraining convolutional neural networks with three
different configurations (from scratch and augmented dataset, using weights and
augmented dataset, and using weights and only adversarial inputs).
  Results: We reveal that retraining with adversarial inputs from original
weights and by ordering with surprise adequacy metrics gives the best model
w.r.t. the used metrics.
  Conclusions: Although more studies are necessary, we recommend data
scientists to use the above configuration and metrics to deal with the
vulnerability to adversarial inputs of deep learning models, as they can
improve their models against adversarial inputs without using many inputs.


http://arxiv.org/abs/2207.09912
Online Evasion Attacks on Recurrent Models:The Power of Hallucinating the Future. (68%)
Byunggill Joe; Insik Shin; Jihun Hamm
  Recurrent models are frequently being used in online tasks such as autonomous
driving, and a comprehensive study of their vulnerability is called for.
Existing research is limited in generality only addressing application-specific
vulnerability or making implausible assumptions such as the knowledge of future
input. In this paper, we present a general attack framework for online tasks
incorporating the unique constraints of the online setting different from
offline tasks. Our framework is versatile in that it covers time-varying
adversarial objectives and various optimization constraints, allowing for a
comprehensive study of robustness. Using the framework, we also present a novel
white-box attack called Predictive Attack that `hallucinates' the future. The
attack achieves 98 percent of the performance of the ideal but infeasible
clairvoyant attack on average. We validate the effectiveness of the proposed
framework and attacks through various experiments.


http://arxiv.org/abs/2207.04075
Models Out of Line: A Fourier Lens on Distribution Shift Robustness. (10%)
Sara Fridovich-Keil; Brian R. Bartoldson; James Diffenderfer; Bhavya Kailkhura; Peer-Timo Bremer
  Improving the accuracy of deep neural networks (DNNs) on out-of-distribution
(OOD) data is critical to an acceptance of deep learning (DL) in real world
applications. It has been observed that accuracies on in-distribution (ID)
versus OOD data follow a linear trend and models that outperform this baseline
are exceptionally rare (and referred to as "effectively robust"). Recently,
some promising approaches have been developed to improve OOD robustness: model
pruning, data augmentation, and ensembling or zero-shot evaluating large
pretrained models. However, there still is no clear understanding of the
conditions on OOD data and model properties that are required to observe
effective robustness. We approach this issue by conducting a comprehensive
empirical study of diverse approaches that are known to impact OOD robustness
on a broad range of natural and synthetic distribution shifts of CIFAR-10 and
ImageNet. In particular, we view the "effective robustness puzzle" through a
Fourier lens and ask how spectral properties of both models and OOD data
influence the corresponding effective robustness. We find this Fourier lens
offers some insight into why certain robust models, particularly those from the
CLIP family, achieve OOD robustness. However, our analysis also makes clear
that no known metric is consistently the best explanation (or even a strong
explanation) of OOD robustness. Thus, to aid future research into the OOD
puzzle, we address the gap in publicly-available models with effective
robustness by introducing a set of pretrained models--RobustNets--with varying
levels of OOD robustness.


http://arxiv.org/abs/2207.03933
A law of adversarial risk, interpolation, and label noise. (1%)
Daniel Paleka; Amartya Sanyal
  In supervised learning, it has been shown that label noise in the data can be
interpolated without penalties on test accuracy under many circumstances. We
show that interpolating label noise induces adversarial vulnerability, and
prove the first theorem showing the dependence of label noise and adversarial
risk in terms of the data distribution. Our results are almost sharp without
accounting for the inductive bias of the learning algorithm. We also show that
inductive bias makes the effect of label noise much stronger.


http://arxiv.org/abs/2207.03162
Harnessing Out-Of-Distribution Examples via Augmenting Content and Style. (11%)
Zhuo Huang; Xiaobo Xia; Li Shen; Bo Han; Mingming Gong; Chen Gong; Tongliang Liu
  Machine learning models are vulnerable to Out-Of-Distribution (OOD) examples,
and such a problem has drawn much attention. However, current methods lack a
full understanding of different types of OOD data: there are benign OOD data
that can be properly adapted to enhance the learning performance, while other
malign OOD data would severely degenerate the classification result. To Harness
OOD data, this paper proposes a HOOD method that can leverage the content and
style from each image instance to identify benign and malign OOD data.
Particularly, we design a variational inference framework to causally
disentangle content and style features by constructing a structural causal
model. Subsequently, we augment the content and style through an intervention
process to produce malign and benign OOD data, respectively. The benign OOD
data contain novel styles but hold our interested contents, and they can be
leveraged to help train a style-invariant model. In contrast, the malign OOD
data inherit unknown contents but carry familiar styles, by detecting them can
improve model robustness against deceiving anomalies. Thanks to the proposed
novel disentanglement and data augmentation techniques, HOOD can effectively
deal with OOD examples in unknown and open environments, whose effectiveness is
empirically validated in three typical OOD applications including OOD
detection, open-set semi-supervised learning, and open-set domain adaptation.


http://arxiv.org/abs/2207.03586
CausalAgents: A Robustness Benchmark for Motion Forecasting using Causal Relationships. (5%)
Rebecca Roelofs; Liting Sun; Ben Caine; Khaled S. Refaat; Ben Sapp; Scott Ettinger; Wei Chai
  As machine learning models become increasingly prevalent in motion
forecasting for autonomous vehicles (AVs), it is critical to ensure that model
predictions are safe and reliable. However, exhaustively collecting and
labeling the data necessary to fully test the long tail of rare and challenging
scenarios is difficult and expensive. In this work, we construct a new
benchmark for evaluating and improving model robustness by applying
perturbations to existing data. Specifically, we conduct an extensive labeling
effort to identify causal agents, or agents whose presence influences human
drivers' behavior in any format, in the Waymo Open Motion Dataset (WOMD), and
we use these labels to perturb the data by deleting non-causal agents from the
scene. We evaluate a diverse set of state-of-the-art deep-learning model
architectures on our proposed benchmark and find that all models exhibit large
shifts under even non-causal perturbation: we observe a 25-38% relative change
in minADE as compared to the original. We also investigate techniques to
improve model robustness, including increasing the training dataset size and
using targeted data augmentations that randomly drop non-causal agents
throughout training. Finally, we release the causal agent labels (at
https://github.com/google-research/causal-agents) as an additional attribute to
WOMD and the robustness benchmarks to aid the community in building more
reliable and safe deep-learning models for motion forecasting.


http://arxiv.org/abs/2207.02963
The Weaknesses of Adversarial Camouflage in Overhead Imagery. (83%)
Etten Adam Van
  Machine learning is increasingly critical for analysis of the ever-growing
corpora of overhead imagery. Advanced computer vision object detection
techniques have demonstrated great success in identifying objects of interest
such as ships, automobiles, and aircraft from satellite and drone imagery. Yet
relying on computer vision opens up significant vulnerabilities, namely, the
susceptibility of object detection algorithms to adversarial attacks. In this
paper we explore the efficacy and drawbacks of adversarial camouflage in an
overhead imagery context. While a number of recent papers have demonstrated the
ability to reliably fool deep learning classifiers and object detectors with
adversarial patches, most of this work has been performed on relatively uniform
datasets and only a single class of objects. In this work we utilize the
VisDrone dataset, which has a large range of perspectives and object sizes. We
explore four different object classes: bus, car, truck, van. We build a library
of 24 adversarial patches to disguise these objects, and introduce a patch
translucency variable to our patches. The translucency (or alpha value) of the
patches is highly correlated to their efficacy. Further, we show that while
adversarial patches may fool object detectors, the presence of such patches is
often easily uncovered, with patches on average 24% more detectable than the
objects the patches were meant to hide. This raises the question of whether
such patches truly constitute camouflage. Source code is available at
https://github.com/IQTLabs/camolo.


http://arxiv.org/abs/2207.02639
Adversarial Robustness of Visual Dialog. (64%)
Lu Yu; Verena Rieser
  Adversarial robustness evaluates the worst-case performance scenario of a
machine learning model to ensure its safety and reliability. This study is the
first to investigate the robustness of visually grounded dialog models towards
textual attacks. These attacks represent a worst-case scenario where the input
question contains a synonym which causes the previously correct model to return
a wrong answer. Using this scenario, we first aim to understand how multimodal
input components contribute to model robustness. Our results show that models
which encode dialog history are more robust, and when launching an attack on
history, model prediction becomes more uncertain. This is in contrast to prior
work which finds that dialog history is negligible for model performance on
this task. We also evaluate how to generate adversarial test examples which
successfully fool the model but remain undetected by the user/software
designer. We find that the textual, as well as the visual context are important
to generate plausible worst-case scenarios.


http://arxiv.org/abs/2207.02764
Enhancing Adversarial Attacks on Single-Layer NVM Crossbar-Based Neural Networks with Power Consumption Information. (54%)
Cory Merkel
  Adversarial attacks on state-of-the-art machine learning models pose a
significant threat to the safety and security of mission-critical autonomous
systems. This paper considers the additional vulnerability of machine learning
models when attackers can measure the power consumption of their underlying
hardware platform. In particular, we explore the utility of power consumption
information for adversarial attacks on non-volatile memory crossbar-based
single-layer neural networks. Our results from experiments with MNIST and
CIFAR-10 datasets show that power consumption can reveal important information
about the neural network's weight matrix, such as the 1-norm of its columns.
That information can be used to infer the sensitivity of the network's loss
with respect to different inputs. We also find that surrogate-based black box
attacks that utilize crossbar power information can lead to improved attack
efficiency.


http://arxiv.org/abs/2207.02842
When does Bias Transfer in Transfer Learning? (10%)
Hadi Salman; Saachi Jain; Andrew Ilyas; Logan Engstrom; Eric Wong; Aleksander Madry
  Using transfer learning to adapt a pre-trained "source model" to a downstream
"target task" can dramatically increase performance with seemingly no downside.
In this work, we demonstrate that there can exist a downside after all: bias
transfer, or the tendency for biases of the source model to persist even after
adapting the model to the target class. Through a combination of synthetic and
natural experiments, we show that bias transfer both (a) arises in realistic
settings (such as when pre-training on ImageNet or other standard datasets) and
(b) can occur even when the target dataset is explicitly de-biased. As
transfer-learned models are increasingly deployed in the real world, our work
highlights the importance of understanding the limitations of pre-trained
source models. Code is available at https://github.com/MadryLab/bias-transfer


http://arxiv.org/abs/2207.03056
Privacy-preserving Reflection Rendering for Augmented Reality. (2%)
Yiqin Zhao; Sheng Wei; Tian Guo
  Many augmented reality (AR) applications rely on omnidirectional environment
lighting to render photorealistic virtual objects. When the virtual objects
consist of reflective materials, such as a metallic sphere, the required
lighting information to render such objects can consist of privacy-sensitive
information that is outside the current camera view. In this paper, we show,
for the first time, that accuracy-driven multi-view environment lighting can
reveal out-of-camera scene information and compromise privacy. We present a
simple yet effective privacy attack that extracts sensitive scene information
such as human face and text information from the rendered objects, under a
number of application scenarios.
  To defend against such attacks, we develop a novel $IPC^{2}S$ defense and a
conditional $R^2$ defense. Our $IPC^{2}S$ defense, used in conjunction with a
generic lighting reconstruction method, preserves the scene geometry while
obfuscating the privacy-sensitive information. As a proof-of-concept, we
leverage existing OCR and face detection models to identify text and human
faces from past camera observations and blur the color pixels associated with
detected regions. We evaluate the visual quality impact of our defense by
comparing rendered virtual objects to ones rendered with a generic
multi-lighting reconstruction technique, ARKit, and $R^2$ defense. Our visual
and quantitative results demonstrate that our defense leads to structurally
similar reflections with up to 0.98 SSIM score across a variety of rendering
scenarios while preserving sensitive information by reducing the automatic
extraction success rate to at most 8.8%.


http://arxiv.org/abs/2207.03036
Not All Models Are Equal: Predicting Model Transferability in a Self-challenging Fisher Space. (1%)
Wenqi Shao; Xun Zhao; Yixiao Ge; Zhaoyang Zhang; Lei Yang; Xiaogang Wang; Ying Shan; Ping Luo
  This paper addresses an important problem of ranking the pre-trained deep
neural networks and screening the most transferable ones for downstream tasks.
It is challenging because the ground-truth model ranking for each task can only
be generated by fine-tuning the pre-trained models on the target dataset, which
is brute-force and computationally expensive. Recent advanced methods proposed
several lightweight transferability metrics to predict the fine-tuning results.
However, these approaches only capture static representations but neglect the
fine-tuning dynamics. To this end, this paper proposes a new transferability
metric, called \textbf{S}elf-challenging \textbf{F}isher \textbf{D}iscriminant
\textbf{A}nalysis (\textbf{SFDA}), which has many appealing benefits that
existing works do not have. First, SFDA can embed the static features into a
Fisher space and refine them for better separability between classes. Second,
SFDA uses a self-challenging mechanism to encourage different pre-trained
models to differentiate on hard examples. Third, SFDA can easily select
multiple pre-trained models for the model ensemble. Extensive experiments on
$33$ pre-trained models of $11$ downstream tasks show that SFDA is efficient,
effective, and robust when measuring the transferability of pre-trained models.
For instance, compared with the state-of-the-art method NLEEP, SFDA
demonstrates an average of $59.1$\% gain while bringing $22.5$x speedup in
wall-clock time. The code will be available at
\url{https://github.com/TencentARC/SFDA}.


http://arxiv.org/abs/2207.02391
Query-Efficient Adversarial Attack Based on Latin Hypercube Sampling. (99%)
Dan Wang; Jiayu Lin; Yuan-Gen Wang
  In order to be applicable in real-world scenario, Boundary Attacks (BAs) were
proposed and ensured one hundred percent attack success rate with only decision
information. However, existing BA methods craft adversarial examples by
leveraging a simple random sampling (SRS) to estimate the gradient, consuming a
large number of model queries. To overcome the drawback of SRS, this paper
proposes a Latin Hypercube Sampling based Boundary Attack (LHS-BA) to save
query budget. Compared with SRS, LHS has better uniformity under the same
limited number of random samples. Therefore, the average on these random
samples is closer to the true gradient than that estimated by SRS. Various
experiments are conducted on benchmark datasets including MNIST, CIFAR, and
ImageNet-1K. Experimental results demonstrate the superiority of the proposed
LHS-BA over the state-of-the-art BA methods in terms of query efficiency. The
source codes are publicly available at https://github.com/GZHU-DVL/LHS-BA.


http://arxiv.org/abs/2207.01982
Defending against the Label-flipping Attack in Federated Learning. (98%)
Najeeb Moharram Jebreel; Josep Domingo-Ferrer; David Sánchez; Alberto Blanco-Justicia
  Federated learning (FL) provides autonomy and privacy by design to
participating peers, who cooperatively build a machine learning (ML) model
while keeping their private data in their devices. However, that same autonomy
opens the door for malicious peers to poison the model by conducting either
untargeted or targeted poisoning attacks. The label-flipping (LF) attack is a
targeted poisoning attack where the attackers poison their training data by
flipping the labels of some examples from one class (i.e., the source class) to
another (i.e., the target class). Unfortunately, this attack is easy to perform
and hard to detect and it negatively impacts on the performance of the global
model. Existing defenses against LF are limited by assumptions on the
distribution of the peers' data and/or do not perform well with
high-dimensional models. In this paper, we deeply investigate the LF attack
behavior and find that the contradicting objectives of attackers and honest
peers on the source class examples are reflected in the parameter gradients
corresponding to the neurons of the source and target classes in the output
layer, making those gradients good discriminative features for the attack
detection. Accordingly, we propose a novel defense that first dynamically
extracts those gradients from the peers' local updates, and then clusters the
extracted gradients, analyzes the resulting clusters and filters out potential
bad updates before model aggregation. Extensive empirical analysis on three
data sets shows the proposed defense's effectiveness against the LF attack
regardless of the data distribution or model dimensionality. Also, the proposed
defense outperforms several state-of-the-art defenses by offering lower test
error, higher overall accuracy, higher source class accuracy, lower attack
success rate, and higher stability of the source class accuracy.


http://arxiv.org/abs/2207.02152
UniCR: Universally Approximated Certified Robustness via Randomized Smoothing. (93%)
Hanbin Hong; Binghui Wang; Yuan Hong
  We study certified robustness of machine learning classifiers against
adversarial perturbations. In particular, we propose the first universally
approximated certified robustness (UniCR) framework, which can approximate the
robustness certification of any input on any classifier against any $\ell_p$
perturbations with noise generated by any continuous probability distribution.
Compared with the state-of-the-art certified defenses, UniCR provides many
significant benefits: (1) the first universal robustness certification
framework for the above 4 'any's; (2) automatic robustness certification that
avoids case-by-case analysis, (3) tightness validation of certified robustness,
and (4) optimality validation of noise distributions used by randomized
smoothing. We conduct extensive experiments to validate the above benefits of
UniCR and the advantages of UniCR over state-of-the-art certified defenses
against $\ell_p$ perturbations.


http://arxiv.org/abs/2207.02036
PRoA: A Probabilistic Robustness Assessment against Functional Perturbations. (92%)
Tianle Zhang; Wenjie Ruan; Jonathan E. Fieldsend
  In safety-critical deep learning applications robustness measurement is a
vital pre-deployment phase. However, existing robustness verification methods
are not sufficiently practical for deploying machine learning systems in the
real world. On the one hand, these methods attempt to claim that no
perturbations can ``fool'' deep neural networks (DNNs), which may be too
stringent in practice. On the other hand, existing works rigorously consider
$L_p$ bounded additive perturbations on the pixel space, although
perturbations, such as colour shifting and geometric transformations, are more
practically and frequently occurring in the real world. Thus, from the
practical standpoint, we present a novel and general {\it probabilistic
robustness assessment method} (PRoA) based on the adaptive concentration, and
it can measure the robustness of deep learning models against functional
perturbations. PRoA can provide statistical guarantees on the probabilistic
robustness of a model, \textit{i.e.}, the probability of failure encountered by
the trained model after deployment. Our experiments demonstrate the
effectiveness and flexibility of PRoA in terms of evaluating the probabilistic
robustness against a broad range of functional perturbations, and PRoA can
scale well to various large-scale deep neural networks compared to existing
state-of-the-art baselines. For the purpose of reproducibility, we release our
tool on GitHub: \url{ https://github.com/TrustAI/PRoA}.


http://arxiv.org/abs/2207.02087
Learning to Accelerate Approximate Methods for Solving Integer Programming via Early Fixing. (38%)
Longkang Li; Baoyuan Wu
  Integer programming (IP) is an important and challenging problem. Approximate
methods have shown promising performance on both effectiveness and efficiency
for solving the IP problem. However, we observed that a large fraction of
variables solved by some iterative approximate methods fluctuate around their
final converged discrete states in very long iterations. Inspired by this
observation, we aim to accelerate these approximate methods by early fixing
these fluctuated variables to their converged states while not significantly
harming the solution accuracy. To this end, we propose an early fixing
framework along with the approximate method. We formulate the whole early
fixing process as a Markov decision process, and train it using imitation
learning. A policy network will evaluate the posterior probability of each free
variable concerning its discrete candidate states in each block of iterations.
Specifically, we adopt the powerful multi-headed attention mechanism in the
policy network. Extensive experiments on our proposed early fixing framework
are conducted to three different IP applications: constrained linear
programming, MRF energy minimization and sparse adversarial attack. The former
one is linear IP problem, while the latter two are quadratic IP problems. We
extend the problem scale from regular size to significantly large size. The
extensive experiments reveal the competitiveness of our early fixing framework:
the runtime speeds up significantly, while the solution quality does not
degrade much, even in some cases it is available to obtain better solutions.
Our proposed early fixing framework can be regarded as an acceleration
extension of ADMM methods for solving integer programming. The source codes are
available at \url{https://github.com/SCLBD/Accelerated-Lpbox-ADMM}.


http://arxiv.org/abs/2207.02159
Robustness Analysis of Video-Language Models Against Visual and Language Perturbations. (1%)
Madeline C. Schiappa; Shruti Vyas; Hamid Palangi; Yogesh S. Rawat; Vibhav Vineet
  Joint visual and language modeling on large-scale datasets has recently shown
good progress in multi-modal tasks when compared to single modal learning.
However, robustness of these approaches against real-world perturbations has
not been studied. In this work, we perform the first extensive robustness study
of video-language models against various real-world perturbations. We focus on
text-to-video retrieval and propose two large-scale benchmark datasets,
MSRVTT-P and YouCook2-P, which utilize 90 different visual and 35 different
text perturbations. The study reveals some interesting initial findings from
the studied models: 1) models are more robust when text is perturbed versus
when video is perturbed, 2) models that are pre-trained are more robust than
those trained from scratch, 3) models attend more to scene and objects rather
than motion and action. We hope this study will serve as a benchmark and guide
future research in robust video-language learning. The benchmark introduced in
this study along with the code and datasets is available at
https://bit.ly/3CNOly4.


http://arxiv.org/abs/2207.01991
Conflicting Interactions Among Protection Mechanisms for Machine Learning Models. (1%)
Sebastian Szyller; N. Asokan
  Nowadays, systems based on machine learning (ML) are widely used in different
domains. Given their popularity, ML models have become targets for various
attacks. As a result, research at the intersection of security/privacy and ML
has flourished. Typically such work has focused on individual types of
security/privacy concerns and mitigations thereof. However, in real-life
deployments, an ML model will need to be protected against several concerns
simultaneously. A protection mechanism optimal for one security or privacy
concern may interact negatively with mechanisms intended to address other
concerns. Despite its practical relevance, the potential for such conflicts has
not been studied adequately. We first provide a framework for analyzing such
"conflicting interactions". We then focus on systematically analyzing pairwise
interactions between protection mechanisms for one concern, model and data
ownership verification, with two other classes of ML protection mechanisms:
differentially private training, and robustness against model evasion. We find
that several pairwise interactions result in conflicts. We explore potential
approaches for avoiding such conflicts. First, we study the effect of
hyperparameter relaxations, finding that there is no sweet spot balancing the
performance of both protection mechanisms. Second, we explore if modifying one
type of protection mechanism (ownership verification) so as to decouple it from
factors that may be impacted by a conflicting mechanism (differentially private
training or robustness to model evasion) can avoid conflict. We show that this
approach can avoid the conflict between ownership verification mechanisms when
combined with differentially private training, but has no effect on robustness
to model evasion. Finally, we identify the gaps in the landscape of studying
interactions between other types of ML protection mechanisms.


http://arxiv.org/abs/2207.01847
PoF: Post-Training of Feature Extractor for Improving Generalization. (1%)
Ikuro Sato; Ryota Yamada; Masayuki Tanaka; Nakamasa Inoue; Rei Kawakami
  It has been intensively investigated that the local shape, especially
flatness, of the loss landscape near a minimum plays an important role for
generalization of deep models. We developed a training algorithm called PoF:
Post-Training of Feature Extractor that updates the feature extractor part of
an already-trained deep model to search a flatter minimum. The characteristics
are two-fold: 1) Feature extractor is trained under parameter perturbations in
the higher-layer parameter space, based on observations that suggest flattening
higher-layer parameter space, and 2) the perturbation range is determined in a
data-driven manner aiming to reduce a part of test loss caused by the positive
loss curvature. We provide a theoretical analysis that shows the proposed
algorithm implicitly reduces the target Hessian components as well as the loss.
Experimental results show that PoF improved model performance against baseline
methods on both CIFAR-10 and CIFAR-100 datasets for only 10-epoch
post-training, and on SVHN dataset for 50-epoch post-training. Source code is
available at: \url{https://github.com/DensoITLab/PoF-v1


http://arxiv.org/abs/2207.02158
Class-Specific Semantic Reconstruction for Open Set Recognition. (1%)
Hongzhi Huang; Yu Wang; Qinghua Hu; Ming-Ming Cheng
  Open set recognition enables deep neural networks (DNNs) to identify samples
of unknown classes, while maintaining high classification accuracy on samples
of known classes. Existing methods basing on auto-encoder (AE) and prototype
learning show great potential in handling this challenging task. In this study,
we propose a novel method, called Class-Specific Semantic Reconstruction
(CSSR), that integrates the power of AE and prototype learning. Specifically,
CSSR replaces prototype points with manifolds represented by class-specific
AEs. Unlike conventional prototype-based methods, CSSR models each known class
on an individual AE manifold, and measures class belongingness through AE's
reconstruction error. Class-specific AEs are plugged into the top of the DNN
backbone and reconstruct the semantic representations learned by the DNN
instead of the raw image. Through end-to-end learning, the DNN and the AEs
boost each other to learn both discriminative and representative information.
The results of experiments conducted on multiple datasets show that the
proposed method achieves outstanding performance in both close and open set
recognition and is sufficiently simple and flexible to incorporate into
existing frameworks.


http://arxiv.org/abs/2207.01396
Hessian-Free Second-Order Adversarial Examples for Adversarial Learning. (99%)
Yaguan Qian; Yuqi Wang; Bin Wang; Zhaoquan Gu; Yuhan Guo; Wassim Swaileh
  Recent studies show deep neural networks (DNNs) are extremely vulnerable to
the elaborately designed adversarial examples. Adversarial learning with those
adversarial examples has been proved as one of the most effective methods to
defend against such an attack. At present, most existing adversarial examples
generation methods are based on first-order gradients, which can hardly further
improve models' robustness, especially when facing second-order adversarial
attacks. Compared with first-order gradients, second-order gradients provide a
more accurate approximation of the loss landscape with respect to natural
examples. Inspired by this, our work crafts second-order adversarial examples
and uses them to train DNNs. Nevertheless, second-order optimization involves
time-consuming calculation for Hessian-inverse. We propose an approximation
method through transforming the problem into an optimization in the Krylov
subspace, which remarkably reduce the computational complexity to speed up the
training procedure. Extensive experiments conducted on the MINIST and CIFAR-10
datasets show that our adversarial learning with second-order adversarial
examples outperforms other fisrt-order methods, which can improve the model
robustness against a wide range of attacks.


http://arxiv.org/abs/2207.01531
Wild Networks: Exposure of 5G Network Infrastructures to Adversarial Examples. (98%)
Giovanni Apruzzese; Rodion Vladimirov; Aliya Tastemirova; Pavel Laskov
  Fifth Generation (5G) networks must support billions of heterogeneous devices
while guaranteeing optimal Quality of Service (QoS). Such requirements are
impossible to meet with human effort alone, and Machine Learning (ML)
represents a core asset in 5G. ML, however, is known to be vulnerable to
adversarial examples; moreover, as our paper will show, the 5G context is
exposed to a yet another type of adversarial ML attacks that cannot be
formalized with existing threat models. Proactive assessment of such risks is
also challenging due to the lack of ML-powered 5G equipment available for
adversarial ML research.
  To tackle these problems, we propose a novel adversarial ML threat model that
is particularly suited to 5G scenarios, and is agnostic to the precise function
solved by ML. In contrast to existing ML threat models, our attacks do not
require any compromise of the target 5G system while still being viable due to
the QoS guarantees and the open nature of 5G networks. Furthermore, we propose
an original framework for realistic ML security assessments based on public
data. We proactively evaluate our threat model on 6 applications of ML
envisioned in 5G. Our attacks affect both the training and the inference
stages, can degrade the performance of state-of-the-art ML systems, and have a
lower entry barrier than previous attacks.


http://arxiv.org/abs/2207.01795
Task-agnostic Defense against Adversarial Patch Attacks. (98%)
Ke Xu; Yao Xiao; Zhaoheng Zheng; Kaijie Cai; Ram Nevatia
  Adversarial patch attacks mislead neural networks by injecting adversarial
pixels within a designated local region. Patch attacks can be highly effective
in a variety of tasks and physically realizable via attachment (e.g. a sticker)
to the real-world objects. Despite the diversity in attack patterns,
adversarial patches tend to be highly textured and different in appearance from
natural images. We exploit this property and present PatchZero, a task-agnostic
defense against white-box adversarial patches. Specifically, our defense
detects the adversarial pixels and "zeros out" the patch region by repainting
with mean pixel values. We formulate the patch detection problem as a semantic
segmentation task such that our model can generalize to patches of any size and
shape. We further design a two-stage adversarial training scheme to defend
against the stronger adaptive attacks. We thoroughly evaluate PatchZero on the
image classification (ImageNet, RESISC45), object detection (PASCAL VOC), and
video classification (UCF101) datasets. Our method achieves SOTA robust
accuracy without any degradation in the benign performance.


http://arxiv.org/abs/2207.01398
Large-scale Robustness Analysis of Video Action Recognition Models. (70%)
Madeline C. Schiappa; Naman Biyani; Shruti Vyas; Hamid Palangi; Vibhav Vineet; Yogesh Rawat
  We have seen a great progress in video action recognition in recent years.
There are several models based on convolutional neural network (CNN) with some
recent transformer based approaches which provide state-of-the-art performance
on existing benchmark datasets. However, large-scale robustness has not been
studied for these models which is a critical aspect for real-world
applications. In this work we perform a large-scale robustness analysis of
these existing models for video action recognition. We mainly focus on
robustness against distribution shifts due to real-world perturbations instead
of adversarial perturbations. We propose four different benchmark datasets,
HMDB-51P, UCF-101P, Kinetics-400P, and SSv2P and study the robustness of six
different state-of-the-art action recognition models against 90 different
perturbations. The study reveals some interesting findings, 1) transformer
based models are consistently more robust against most of the perturbations
when compared with CNN based models, 2) Pretraining helps Transformer based
models to be more robust to different perturbations than CNN based models, and
3) All of the studied models are robust to temporal perturbation on the
Kinetics dataset, but not on SSv2; this suggests temporal information is much
more important for action label prediction on SSv2 datasets than on the
Kinetics dataset. We hope that this study will serve as a benchmark for future
research in robust video action recognition. More details about the project are
available at https://rose-ar.github.io/.


http://arxiv.org/abs/2207.01548
Counterbalancing Teacher: Regularizing Batch Normalized Models for Robustness. (1%)
Saeid Asgari Taghanaki; Ali Gholami; Fereshte Khani; Kristy Choi; Linh Tran; Ran Zhang; Aliasghar Khani
  Batch normalization (BN) is a ubiquitous technique for training deep neural
networks that accelerates their convergence to reach higher accuracy. However,
we demonstrate that BN comes with a fundamental drawback: it incentivizes the
model to rely on low-variance features that are highly specific to the training
(in-domain) data, hurting generalization performance on out-of-domain examples.
In this work, we investigate this phenomenon by first showing that removing BN
layers across a wide range of architectures leads to lower out-of-domain and
corruption errors at the cost of higher in-domain errors. We then propose
Counterbalancing Teacher (CT), a method which leverages a frozen copy of the
same model without BN as a teacher to enforce the student network's learning of
robust representations by substantially adapting its weights through a
consistency loss function. This regularization signal helps CT perform well in
unforeseen data shifts, even without information from the target domain as in
prior works. We theoretically show in an overparameterized linear regression
setting why normalization leads to a model's reliance on such in-domain
features, and empirically demonstrate the efficacy of CT by outperforming
several baselines on robustness benchmarks such as CIFAR-10-C, CIFAR-100-C, and
VLCS.


http://arxiv.org/abs/2207.01149
RAF: Recursive Adversarial Attacks on Face Recognition Using Extremely Limited Queries. (99%)
Keshav Kasichainula; Hadi Mansourifar; Weidong Shi
  Recent successful adversarial attacks on face recognition show that, despite
the remarkable progress of face recognition models, they are still far behind
the human intelligence for perception and recognition. It reveals the
vulnerability of deep convolutional neural networks (CNNs) as state-of-the-art
building block for face recognition models against adversarial examples, which
can cause certain consequences for secure systems. Gradient-based adversarial
attacks are widely studied before and proved to be successful against face
recognition models. However, finding the optimized perturbation per each face
needs to submitting the significant number of queries to the target model. In
this paper, we propose recursive adversarial attack on face recognition using
automatic face warping which needs extremely limited number of queries to fool
the target model. Instead of a random face warping procedure, the warping
functions are applied on specific detected regions of face like eyebrows, nose,
lips, etc. We evaluate the robustness of proposed method in the decision-based
black-box attack setting, where the attackers have no access to the model
parameters and gradients, but hard-label predictions and confidence scores are
provided by the target model.


http://arxiv.org/abs/2207.01156
Removing Batch Normalization Boosts Adversarial Training. (98%)
Haotao Wang; Aston Zhang; Shuai Zheng; Xingjian Shi; Mu Li; Zhangyang Wang
  Adversarial training (AT) defends deep neural networks against adversarial
attacks. One challenge that limits its practical application is the performance
degradation on clean samples. A major bottleneck identified by previous works
is the widely used batch normalization (BN), which struggles to model the
different statistics of clean and adversarial training samples in AT. Although
the dominant approach is to extend BN to capture this mixture of distribution,
we propose to completely eliminate this bottleneck by removing all BN layers in
AT. Our normalizer-free robust training (NoFrost) method extends recent
advances in normalizer-free networks to AT for its unexplored advantage on
handling the mixture distribution challenge. We show that NoFrost achieves
adversarial robustness with only a minor sacrifice on clean sample accuracy. On
ImageNet with ResNet50, NoFrost achieves $74.06\%$ clean accuracy, which drops
merely $2.00\%$ from standard training. In contrast, BN-based AT obtains
$59.28\%$ clean accuracy, suffering a significant $16.78\%$ drop from standard
training. In addition, NoFrost achieves a $23.56\%$ adversarial robustness
against PGD attack, which improves the $13.57\%$ robustness in BN-based AT. We
observe better model smoothness and larger decision margins from NoFrost, which
make the models less sensitive to input perturbations and thus more robust.
Moreover, when incorporating more data augmentations into NoFrost, it achieves
comprehensive robustness against multiple distribution shifts. Code and
pre-trained models are public at
https://github.com/amazon-research/normalizer-free-robust-training.


http://arxiv.org/abs/2207.01106
Anomaly Detection with Adversarially Learned Perturbations of Latent Space. (13%)
Vahid Reza Khazaie; Anthony Wong; John Taylor Jewell; Yalda Mohsenzadeh
  Anomaly detection is to identify samples that do not conform to the
distribution of the normal data. Due to the unavailability of anomalous data,
training a supervised deep neural network is a cumbersome task. As such,
unsupervised methods are preferred as a common approach to solve this task.
Deep autoencoders have been broadly adopted as a base of many unsupervised
anomaly detection methods. However, a notable shortcoming of deep autoencoders
is that they provide insufficient representations for anomaly detection by
generalizing to reconstruct outliers. In this work, we have designed an
adversarial framework consisting of two competing components, an Adversarial
Distorter, and an Autoencoder. The Adversarial Distorter is a convolutional
encoder that learns to produce effective perturbations and the autoencoder is a
deep convolutional neural network that aims to reconstruct the images from the
perturbed latent feature space. The networks are trained with opposing goals in
which the Adversarial Distorter produces perturbations that are applied to the
encoder's latent feature space to maximize the reconstruction error and the
autoencoder tries to neutralize the effect of these perturbations to minimize
it. When applied to anomaly detection, the proposed method learns semantically
richer representations due to applying perturbations to the feature space. The
proposed method outperforms the existing state-of-the-art methods in anomaly
detection on image and video datasets.


http://arxiv.org/abs/2207.01059
Identifying the Context Shift between Test Benchmarks and Production Data. (1%)
Matthew Groh
  Across a wide variety of domains, there exists a performance gap between
machine learning models' accuracy on dataset benchmarks and real-world
production data. Despite the careful design of static dataset benchmarks to
represent the real-world, models often err when the data is out-of-distribution
relative to the data the models have been trained on. We can directly measure
and adjust for some aspects of distribution shift, but we cannot address sample
selection bias, adversarial perturbations, and non-stationarity without knowing
the data generation process. In this paper, we outline two methods for
identifying changes in context that lead to distribution shifts and model
prediction errors: leveraging human intuition and expert knowledge to identify
first-order contexts and developing dynamic benchmarks based on desiderata for
the data generation process. Furthermore, we present two case-studies to
highlight the implicit assumptions underlying applied machine learning models
that tend to lead to errors when attempting to generalize beyond test benchmark
datasets. By paying close attention to the role of context in each prediction
task, researchers can reduce context shift errors and increase generalization
performance.


http://arxiv.org/abs/2207.00872
FL-Defender: Combating Targeted Attacks in Federated Learning. (80%)
Najeeb Jebreel; Josep Domingo-Ferrer
  Federated learning (FL) enables learning a global machine learning model from
local data distributed among a set of participating workers. This makes it
possible i) to train more accurate models due to learning from rich joint
training data, and ii) to improve privacy by not sharing the workers' local
private data with others. However, the distributed nature of FL makes it
vulnerable to targeted poisoning attacks that negatively impact the integrity
of the learned model while, unfortunately, being difficult to detect. Existing
defenses against those attacks are limited by assumptions on the workers' data
distribution, may degrade the global model performance on the main task and/or
are ill-suited to high-dimensional models. In this paper, we analyze targeted
attacks against FL and find that the neurons in the last layer of a deep
learning (DL) model that are related to the attacks exhibit a different
behavior from the unrelated neurons, making the last-layer gradients valuable
features for attack detection. Accordingly, we propose \textit{FL-Defender} as
a method to combat FL targeted attacks. It consists of i) engineering more
robust discriminative features by calculating the worker-wise angle similarity
for the workers' last-layer gradients, ii) compressing the resulting similarity
vectors using PCA to reduce redundant information, and iii) re-weighting the
workers' updates based on their deviation from the centroid of the compressed
similarity vectors. Experiments on three data sets with different DL model
sizes and data distributions show the effectiveness of our method at defending
against label-flipping and backdoor attacks. Compared to several
state-of-the-art defenses, FL-Defender achieves the lowest attack success
rates, maintains the performance of the global model on the main task and
causes minimal computational overhead on the server.


http://arxiv.org/abs/2207.00762
Backdoor Attack is a Devil in Federated GAN-based Medical Image Synthesis. (11%)
Ruinan Jin; Xiaoxiao Li
  Deep Learning-based image synthesis techniques have been applied in
healthcare research for generating medical images to support open research.
Training generative adversarial neural networks (GAN) usually requires large
amounts of training data. Federated learning (FL) provides a way of training a
central model using distributed data from different medical institutions while
keeping raw data locally. However, FL is vulnerable to backdoor attack, an
adversarial by poisoning training data, given the central server cannot access
the original data directly. Most backdoor attack strategies focus on
classification models and centralized domains. In this study, we propose a way
of attacking federated GAN (FedGAN) by treating the discriminator with a
commonly used data poisoning strategy in backdoor attack classification models.
We demonstrate that adding a small trigger with size less than 0.5 percent of
the original image size can corrupt the FL-GAN model. Based on the proposed
attack, we provide two effective defense strategies: global malicious detection
and local training regularization. We show that combining the two defense
strategies yields a robust medical image generation.


http://arxiv.org/abs/2207.00740
PhilaeX: Explaining the Failure and Success of AI Models in Malware Detection. (1%)
Zhi Lu; Vrizlynn L. L. Thing
  The explanation to an AI model's prediction used to support decision making
in cyber security, is of critical importance. It is especially so when the
model's incorrect prediction can lead to severe damages or even losses to lives
and critical assets. However, most existing AI models lack the ability to
provide explanations on their prediction results, despite their strong
performance in most scenarios. In this work, we propose a novel explainable AI
method, called PhilaeX, that provides the heuristic means to identify the
optimized subset of features to form the complete explanations of AI models'
predictions. It identifies the features that lead to the model's borderline
prediction, and those with positive individual contributions are extracted. The
feature attributions are then quantified through the optimization of a Ridge
regression model. We verify the explanation fidelity through two experiments.
First, we assess our method's capability in correctly identifying the activated
features in the adversarial samples of Android malwares, through the features
attribution values from PhilaeX. Second, the deduction and augmentation tests,
are used to assess the fidelity of the explanations. The results show that
PhilaeX is able to explain different types of classifiers correctly, with
higher fidelity explanations, compared to the state-of-the-arts methods such as
LIME and SHAP.


http://arxiv.org/abs/2207.00694
Efficient Adversarial Training With Data Pruning. (99%)
Maximilian Kaufmann; Yiren Zhao; Ilia Shumailov; Robert Mullins; Nicolas Papernot
  Neural networks are susceptible to adversarial examples-small input
perturbations that cause models to fail. Adversarial training is one of the
solutions that stops adversarial examples; models are exposed to attacks during
training and learn to be resilient to them. Yet, such a procedure is currently
expensive-it takes a long time to produce and train models with adversarial
samples, and, what is worse, it occasionally fails. In this paper we
demonstrate data pruning-a method for increasing adversarial training
efficiency through data sub-sampling.We empirically show that data pruning
leads to improvements in convergence and reliability of adversarial training,
albeit with different levels of utility degradation. For example, we observe
that using random sub-sampling of CIFAR10 to drop 40% of data, we lose 8%
adversarial accuracy against the strongest attackers, while by using only 20%
of data we lose 14% adversarial accuracy and reduce runtime by a factor of 3.
Interestingly, we discover that in some settings data pruning brings benefits
from both worlds-it both improves adversarial accuracy and training time.


http://arxiv.org/abs/2207.00278
BadHash: Invisible Backdoor Attacks against Deep Hashing with Clean Label. (99%)
Shengshan Hu; Ziqi Zhou; Yechao Zhang; Leo Yu Zhang; Yifeng Zheng; Yuanyuan HE; Hai Jin
  Due to its powerful feature learning capability and high efficiency, deep
hashing has achieved great success in large-scale image retrieval. Meanwhile,
extensive works have demonstrated that deep neural networks (DNNs) are
susceptible to adversarial examples, and exploring adversarial attack against
deep hashing has attracted many research efforts. Nevertheless, backdoor
attack, another famous threat to DNNs, has not been studied for deep hashing
yet. Although various backdoor attacks have been proposed in the field of image
classification, existing approaches failed to realize a truly imperceptive
backdoor attack that enjoys invisible triggers and clean label setting
simultaneously, and they also cannot meet the intrinsic demand of image
retrieval backdoor.
  In this paper, we propose BadHash, the first generative-based imperceptible
backdoor attack against deep hashing, which can effectively generate invisible
and input-specific poisoned images with clean label. Specifically, we first
propose a new conditional generative adversarial network (cGAN) pipeline to
effectively generate poisoned samples. For any given benign image, it seeks to
generate a natural-looking poisoned counterpart with a unique invisible
trigger. In order to improve the attack effectiveness, we introduce a
label-based contrastive learning network LabCLN to exploit the semantic
characteristics of different labels, which are subsequently used for confusing
and misleading the target model to learn the embedded trigger. We finally
explore the mechanism of backdoor attacks on image retrieval in the hash space.
Extensive experiments on multiple benchmark datasets verify that BadHash can
generate imperceptible poisoned samples with strong attack ability and
transferability over state-of-the-art deep hashing schemes.


http://arxiv.org/abs/2206.15128
Detecting and Recovering Adversarial Examples from Extracting Non-robust and Highly Predictive Adversarial Perturbations. (99%)
Mingyu Dong; Jiahao Chen; Diqun Yan; Jingxing Gao; Li Dong; Rangding Wang
  Deep neural networks (DNNs) have been shown to be vulnerable against
adversarial examples (AEs) which are maliciously designed to fool target
models. The normal examples (NEs) added with imperceptible adversarial
perturbation, can be a security threat to DNNs. Although the existing AEs
detection methods have achieved a high accuracy, they failed to exploit the
information of the AEs detected. Thus, based on high-dimension perturbation
extraction, we propose a model-free AEs detection method, the whole process of
which is free from querying the victim model. Research shows that DNNs are
sensitive to the high-dimension features. The adversarial perturbation hiding
in the adversarial example belongs to the high-dimension feature which is
highly predictive and non-robust. DNNs learn more details from high-dimension
data than others. In our method, the perturbation extractor can extract the
adversarial perturbation from AEs as high-dimension feature, then the trained
AEs discriminator determines whether the input is an AE. Experimental results
show that the proposed method can not only detect the adversarial examples with
high accuracy, but also detect the specific category of the AEs. Meanwhile, the
extracted perturbation can be used to recover the AEs to NEs.


http://arxiv.org/abs/2207.00099
Measuring Forgetting of Memorized Training Examples. (83%)
Matthew Jagielski; Om Thakkar; Florian Tramèr; Daphne Ippolito; Katherine Lee; Nicholas Carlini; Eric Wallace; Shuang Song; Abhradeep Thakurta; Nicolas Papernot; Chiyuan Zhang
  Machine learning models exhibit two seemingly contradictory phenomena:
training data memorization, and various forms of forgetting. In memorization,
models overfit specific training examples and become susceptible to privacy
attacks. In forgetting, examples which appeared early in training are forgotten
by the end. In this work, we connect these phenomena. We propose a technique to
measure to what extent models "forget" the specifics of training examples,
becoming less susceptible to privacy attacks on examples they have not seen
recently. We show that, while non-convex models can memorize data forever in
the worst-case, standard image, speech, and language models empirically do
forget examples over time. We identify nondeterminism as a potential
explanation, showing that deterministically trained models do not forget. Our
results suggest that examples seen early when training with extremely large
datasets - for instance those examples used to pre-train a model - may observe
privacy benefits at the expense of examples seen later.


http://arxiv.org/abs/2206.15415
MEAD: A Multi-Armed Approach for Evaluation of Adversarial Examples Detectors. (80%)
Federica Granese; Marine Picot; Marco Romanelli; Francisco Messina; Pablo Piantanida
  Detection of adversarial examples has been a hot topic in the last years due
to its importance for safely deploying machine learning algorithms in critical
applications. However, the detection methods are generally validated by
assuming a single implicitly known attack strategy, which does not necessarily
account for real-life threats. Indeed, this can lead to an overoptimistic
assessment of the detectors' performance and may induce some bias in the
comparison between competing detection schemes. We propose a novel multi-armed
framework, called MEAD, for evaluating detectors based on several attack
strategies to overcome this limitation. Among them, we make use of three new
objectives to generate attacks. The proposed performance metric is based on the
worst-case scenario: detection is successful if and only if all different
attacks are correctly recognized. Empirically, we show the effectiveness of our
approach. Moreover, the poor performance obtained for state-of-the-art
detectors opens a new exciting line of research.


http://arxiv.org/abs/2207.00012
Reliable Representations Make A Stronger Defender: Unsupervised Structure Refinement for Robust GNN. (16%)
Kuan Li; Yang Liu; Xiang Ao; Jianfeng Chi; Jinghua Feng; Hao Yang; Qing He
  Benefiting from the message passing mechanism, Graph Neural Networks (GNNs)
have been successful on flourish tasks over graph data. However, recent studies
have shown that attackers can catastrophically degrade the performance of GNNs
by maliciously modifying the graph structure. A straightforward solution to
remedy this issue is to model the edge weights by learning a metric function
between pairwise representations of two end nodes, which attempts to assign low
weights to adversarial edges. The existing methods use either raw features or
representations learned by supervised GNNs to model the edge weights. However,
both strategies are faced with some immediate problems: raw features cannot
represent various properties of nodes (e.g., structure information), and
representations learned by supervised GNN may suffer from the poor performance
of the classifier on the poisoned graph. We need representations that carry
both feature information and as mush correct structure information as possible
and are insensitive to structural perturbations. To this end, we propose an
unsupervised pipeline, named STABLE, to optimize the graph structure. Finally,
we input the well-refined graph into a downstream classifier. For this part, we
design an advanced GCN that significantly enhances the robustness of vanilla
GCN without increasing the time complexity. Extensive experiments on four
real-world graph benchmarks demonstrate that STABLE outperforms the
state-of-the-art methods and successfully defends against various attacks.


http://arxiv.org/abs/2207.00091
Threat Assessment in Machine Learning based Systems. (13%)
Lionel Nganyewou Tidjon; Foutse Khomh
  Machine learning is a field of artificial intelligence (AI) that is becoming
essential for several critical systems, making it a good target for threat
actors. Threat actors exploit different Tactics, Techniques, and Procedures
(TTPs) against the confidentiality, integrity, and availability of Machine
Learning (ML) systems. During the ML cycle, they exploit adversarial TTPs to
poison data and fool ML-based systems. In recent years, multiple security
practices have been proposed for traditional systems but they are not enough to
cope with the nature of ML-based systems. In this paper, we conduct an
empirical study of threats reported against ML-based systems with the aim to
understand and characterize the nature of ML threats and identify common
mitigation strategies. The study is based on 89 real-world ML attack scenarios
from the MITRE's ATLAS database, the AI Incident Database, and the literature;
854 ML repositories from the GitHub search and the Python Packaging Advisory
database, selected based on their reputation. Attacks from the AI Incident
Database and the literature are used to identify vulnerabilities and new types
of threats that were not documented in ATLAS. Results show that convolutional
neural networks were one of the most targeted models among the attack
scenarios. ML repositories with the largest vulnerability prominence include
TensorFlow, OpenCV, and Notebook. In this paper, we also report the most
frequent vulnerabilities in the studied ML repositories, the most targeted ML
phases and models, the most used TTPs in ML phases and attack scenarios. This
information is particularly important for red/blue teams to better conduct
attacks/defenses, for practitioners to prevent threats during ML development,
and for researchers to develop efficient defense mechanisms.


http://arxiv.org/abs/2207.00137
Robustness of Epinets against Distributional Shifts. (1%)
Xiuyuan Lu; Ian Osband; Seyed Mohammad Asghari; Sven Gowal; Vikranth Dwaracherla; Zheng Wen; Roy Benjamin Van
  Recent work introduced the epinet as a new approach to uncertainty modeling
in deep learning. An epinet is a small neural network added to traditional
neural networks, which, together, can produce predictive distributions. In
particular, using an epinet can greatly improve the quality of joint
predictions across multiple inputs, a measure of how well a neural network
knows what it does not know. In this paper, we examine whether epinets can
offer similar advantages under distributional shifts. We find that, across
ImageNet-A/O/C, epinets generally improve robustness metrics. Moreover, these
improvements are more significant than those afforded by even very large
ensembles at orders of magnitude lower computational costs. However, these
improvements are relatively small compared to the outstanding issues in
distributionally-robust deep learning. Epinets may be a useful tool in the
toolbox, but they are far from the complete solution.


http://arxiv.org/abs/2207.00118
ProSelfLC: Progressive Self Label Correction Towards A Low-Temperature Entropy State. (1%)
Xinshao Wang; Yang Hua; Elyor Kodirov; Sankha Subhra Mukherjee; David A. Clifton; Neil M. Robertson
  There is a family of label modification approaches including self and
non-self label correction (LC), and output regularisation. They are widely used
for training robust deep neural networks (DNNs), but have not been
mathematically and thoroughly analysed together. We study them and discover
three key issues: (1) We are more interested in adopting Self LC as it
leverages its own knowledge and requires no auxiliary models. However, it is
unclear how to adaptively trust a learner as the training proceeds. (2) Some
methods penalise while the others reward low-entropy (i.e., high-confidence)
predictions, prompting us to ask which one is better. (3) Using the standard
training setting, a learned model becomes less confident when severe noise
exists. Self LC using high-entropy knowledge would generate high-entropy
targets.
  To resolve the issue (1), inspired by a well-accepted finding, i.e., deep
neural networks learn meaningful patterns before fitting noise, we propose a
novel end-to-end method named ProSelfLC, which is designed according to the
learning time and prediction entropy. Concretely, for any data point, we
progressively and adaptively trust its predicted probability distribution
versus its annotated one if a network has been trained for a relatively long
time and the prediction is of low entropy. For the issue (2), the effectiveness
of ProSelfLC defends entropy minimisation. By ProSelfLC, we empirically prove
that it is more effective to redefine a semantic low-entropy state and optimise
the learner toward it. To address the issue (3), we decrease the entropy of
self knowledge using a low temperature before exploiting it to correct labels,
so that the revised labels redefine low-entropy target probability
distributions.
  We demonstrate the effectiveness of ProSelfLC through extensive experiments
in both clean and noisy settings, and on both image and protein datasets.


http://arxiv.org/abs/2206.15369
No Reason for No Supervision: Improved Generalization in Supervised Models. (1%)
Mert Bulent Sariyildiz; Yannis Kalantidis; Karteek Alahari; Diane Larlus
  We consider the problem of training a deep neural network on a given
classification task, e.g., ImageNet-1K (IN1K), so that it excels at both the
training task as well as at other (future) transfer tasks. These two seemingly
contradictory properties impose a trade-off between improving the model's
generalization and maintaining its performance on the original task. Models
trained with self-supervised learning tend to generalize better than their
supervised counterparts for transfer learning; yet, they still lag behind
supervised models on IN1K. In this paper, we propose a supervised learning
setup that leverages the best of both worlds. We extensively analyze supervised
training using multi-scale crops for data augmentation and an expendable
projector head, and reveal that the design of the projector allows us to
control the trade-off between performance on the training task and
transferability. We further replace the last layer of class weights with class
prototypes computed on the fly using a memory bank and derive two models: t-ReX
that achieves a new state of the art for transfer learning and outperforms top
methods such as DINO and PAWS on IN1K, and t-ReX* that matches the highly
optimized RSB-A1 model on IN1K while performing better on transfer tasks. Code
and pretrained models: https://europe.naverlabs.com/t-rex


http://arxiv.org/abs/2206.15274
Augment like there's no tomorrow: Consistently performing neural networks for medical imaging. (1%)
Joona Pohjonen; Carolin Stürenberg; Atte Föhr; Reija Randen-Brady; Lassi Luomala; Jouni Lohi; Esa Pitkänen; Antti Rannikko; Tuomas Mirtti
  Deep neural networks have achieved impressive performance in a wide variety
of medical imaging tasks. However, these models often fail on data not used
during training, such as data originating from a different medical centre. How
to recognize models suffering from this fragility, and how to design robust
models are the main obstacles to clinical adoption. Here, we present general
methods to identify causes for model generalisation failures and how to
circumvent them. First, we use $\textit{distribution-shifted datasets}$ to show
that models trained with current state-of-the-art methods are highly fragile to
variability encountered in clinical practice, and then develop a
$\textit{strong augmentation}$ strategy to address this fragility.
Distribution-shifted datasets allow us to discover this fragility, which can
otherwise remain undetected after validation against multiple external
datasets. Strong augmentation allows us to train robust models achieving
consistent performance under shifts from the training data distribution.
Importantly, we demonstrate that strong augmentation yields biomedical imaging
models which retain high performance when applied to real-world clinical data.
Our results pave the way for the development and evaluation of reliable and
robust neural networks in clinical practice.


http://arxiv.org/abs/2206.14772
IBP Regularization for Verified Adversarial Robustness via Branch-and-Bound. (92%)
Palma Alessandro De; Rudy Bunel; Krishnamurthy Dvijotham; M. Pawan Kumar; Robert Stanforth
  Recent works have tried to increase the verifiability of adversarially
trained networks by running the attacks over domains larger than the original
perturbations and adding various regularization terms to the objective.
However, these algorithms either underperform or require complex and expensive
stage-wise training procedures, hindering their practical applicability. We
present IBP-R, a novel verified training algorithm that is both simple and
effective. IBP-R induces network verifiability by coupling adversarial attacks
on enlarged domains with a regularization term, based on inexpensive interval
bound propagation, that minimizes the gap between the non-convex verification
problem and its approximations. By leveraging recent branch-and-bound
frameworks, we show that IBP-R obtains state-of-the-art verified
robustness-accuracy trade-offs for small perturbations on CIFAR-10 while
training significantly faster than relevant previous work. Additionally, we
present UPB, a novel branching strategy that, relying on a simple heuristic
based on $\beta$-CROWN, reduces the cost of state-of-the-art branching
algorithms while yielding splits of comparable quality.


http://arxiv.org/abs/2206.14477
Adversarial Ensemble Training by Jointly Learning Label Dependencies and Member Models. (33%)
Lele Wang; Bin Liu
  Training an ensemble of different sub-models has empirically proven to be an
effective strategy to improve deep neural networks' adversarial robustness.
Current ensemble training methods for image recognition usually encode the
image labels by one-hot vectors, which neglect dependency relationships between
the labels. Here we propose a novel adversarial ensemble training approach to
jointly learn the label dependencies and the member models. Our approach
adaptively exploits the learned label dependencies to promote the diversity of
the member models. We test our approach on widely used datasets MNIST,
FasionMNIST, and CIFAR-10. Results show that our approach is more robust
against black-box attacks compared with the state-of-the-art methods. Our code
is available at https://github.com/ZJLAB-AMMI/LSD.


http://arxiv.org/abs/2206.14729
longhorns at DADC 2022: How many linguists does it take to fool a Question Answering model? A systematic approach to adversarial attacks. (10%)
Venelin Kovatchev; Trina Chatterjee; Venkata S Govindarajan; Jifan Chen; Eunsol Choi; Gabriella Chronis; Anubrata Das; Katrin Erk; Matthew Lease; Junyi Jessy Li; Yating Wu; Kyle Mahowald
  Developing methods to adversarially challenge NLP systems is a promising
avenue for improving both model performance and interpretability. Here, we
describe the approach of the team "longhorns" on Task 1 of the The First
Workshop on Dynamic Adversarial Data Collection (DADC), which asked teams to
manually fool a model on an Extractive Question Answering task. Our team
finished first, with a model error rate of 62%. We advocate for a systematic,
linguistically informed approach to formulating adversarial questions, and we
describe the results of our pilot experiments, as well as our official
submission.


http://arxiv.org/abs/2206.14724
Private Graph Extraction via Feature Explanations. (10%)
Iyiola E. Olatunji; Mandeep Rathee; Thorben Funke; Megha Khosla
  Privacy and interpretability are two important ingredients for achieving
trustworthy machine learning. We study the interplay of these two aspects in
graph machine learning through graph reconstruction attacks. The goal of the
adversary here is to reconstruct the graph structure of the training data given
access to model explanations. Based on the different kinds of auxiliary
information available to the adversary, we propose several graph reconstruction
attacks. We show that additional knowledge of post-hoc feature explanations
substantially increases the success rate of these attacks. Further, we
investigate in detail the differences between attack performance with respect
to three different classes of explanation methods for graph neural networks:
gradient-based, perturbation-based, and surrogate model-based methods. While
gradient-based explanations reveal the most in terms of the graph structure, we
find that these explanations do not always score high in utility. For the other
two classes of explanations, privacy leakage increases with an increase in
explanation utility. Finally, we propose a defense based on a randomized
response mechanism for releasing the explanations, which substantially reduces
the attack success rate. Our code is available at
https://github.com/iyempissy/graph-stealing-attacks-with-explanation


http://arxiv.org/abs/2206.14502
RegMixup: Mixup as a Regularizer Can Surprisingly Improve Accuracy and Out Distribution Robustness. (2%)
Francesco Pinto; Harry Yang; Ser-Nam Lim; Philip H. S. Torr; Puneet K. Dokania
  We show that the effectiveness of the well celebrated Mixup [Zhang et al.,
2018] can be further improved if instead of using it as the sole learning
objective, it is utilized as an additional regularizer to the standard
cross-entropy loss. This simple change not only provides much improved accuracy
but also significantly improves the quality of the predictive uncertainty
estimation of Mixup in most cases under various forms of covariate shifts and
out-of-distribution detection experiments. In fact, we observe that Mixup
yields much degraded performance on detecting out-of-distribution samples
possibly, as we show empirically, because of its tendency to learn models that
exhibit high-entropy throughout; making it difficult to differentiate
in-distribution samples from out-distribution ones. To show the efficacy of our
approach (RegMixup), we provide thorough analyses and experiments on vision
datasets (ImageNet & CIFAR-10/100) and compare it with a suite of recent
approaches for reliable uncertainty estimation.


http://arxiv.org/abs/2206.13991
Increasing Confidence in Adversarial Robustness Evaluations. (99%)
Roland S. Zimmermann; Wieland Brendel; Florian Tramer; Nicholas Carlini
  Hundreds of defenses have been proposed to make deep neural networks robust
against minimal (adversarial) input perturbations. However, only a handful of
these defenses held up their claims because correctly evaluating robustness is
extremely challenging: Weak attacks often fail to find adversarial examples
even if they unknowingly exist, thereby making a vulnerable network look
robust. In this paper, we propose a test to identify weak attacks, and thus
weak defense evaluations. Our test slightly modifies a neural network to
guarantee the existence of an adversarial example for every sample.
Consequentially, any correct attack must succeed in breaking this modified
network. For eleven out of thirteen previously-published defenses, the original
evaluation of the defense fails our test, while stronger attacks that break
these defenses pass it. We hope that attack unit tests - such as ours - will be
a major component in future robustness evaluations and increase confidence in
an empirical field that is currently riddled with skepticism.


http://arxiv.org/abs/2206.14020
Rethinking Adversarial Examples for Location Privacy Protection. (93%)
Trung-Nghia Le; Ta Gu; Huy H. Nguyen; Isao Echizen
  We have investigated a new application of adversarial examples, namely
location privacy protection against landmark recognition systems. We introduce
mask-guided multimodal projected gradient descent (MM-PGD), in which
adversarial examples are trained on different deep models. Image contents are
protected by analyzing the properties of regions to identify the ones most
suitable for blending in adversarial examples. We investigated two region
identification strategies: class activation map-based MM-PGD, in which the
internal behaviors of trained deep models are targeted; and human-vision-based
MM-PGD, in which regions that attract less human attention are targeted.
Experiments on the Places365 dataset demonstrated that these strategies are
potentially effective in defending against black-box landmark recognition
systems without the need for much image manipulation.


http://arxiv.org/abs/2206.14346
A Deep Learning Approach to Create DNS Amplification Attacks. (92%)
Jared Mathews; Prosenjit Chatterjee; Shankar Banik; Cory Nance
  In recent years, deep learning has shown itself to be an incredibly valuable
tool in cybersecurity as it helps network intrusion detection systems to
classify attacks and detect new ones. Adversarial learning is the process of
utilizing machine learning to generate a perturbed set of inputs to then feed
to the neural network to misclassify it. Much of the current work in the field
of adversarial learning has been conducted in image processing and natural
language processing with a wide variety of algorithms. Two algorithms of
interest are the Elastic-Net Attack on Deep Neural Networks and TextAttack. In
our experiment the EAD and TextAttack algorithms are applied to a Domain Name
System amplification classifier. The algorithms are used to generate malicious
Distributed Denial of Service adversarial examples to then feed as inputs to
the network intrusion detection systems neural network to classify as valid
traffic. We show in this work that both image processing and natural language
processing adversarial learning algorithms can be applied against a network
intrusion detection neural network.


http://arxiv.org/abs/2206.14004
On the amplification of security and privacy risks by post-hoc explanations in machine learning models. (31%)
Pengrui Quan; Supriyo Chakraborty; Jeya Vikranth Jeyakumar; Mani Srivastava
  A variety of explanation methods have been proposed in recent years to help
users gain insights into the results returned by neural networks, which are
otherwise complex and opaque black-boxes. However, explanations give rise to
potential side-channels that can be leveraged by an adversary for mounting
attacks on the system. In particular, post-hoc explanation methods that
highlight input dimensions according to their importance or relevance to the
result also leak information that weakens security and privacy. In this work,
we perform the first systematic characterization of the privacy and security
risks arising from various popular explanation techniques. First, we propose
novel explanation-guided black-box evasion attacks that lead to 10 times
reduction in query count for the same success rate. We show that the
adversarial advantage from explanations can be quantified as a reduction in the
total variance of the estimated gradient. Second, we revisit the membership
information leaked by common explanations. Contrary to observations in prior
studies, via our modified attacks we show significant leakage of membership
information (above 100% improvement over prior results), even in a much
stricter black-box setting. Finally, we study explanation-guided model
extraction attacks and demonstrate adversarial gains through a large reduction
in query count.


http://arxiv.org/abs/2206.14157
How to Steer Your Adversary: Targeted and Efficient Model Stealing Defenses with Gradient Redirection. (12%)
Mantas Mazeika; Bo Li; David Forsyth
  Model stealing attacks present a dilemma for public machine learning APIs. To
protect financial investments, companies may be forced to withhold important
information about their models that could facilitate theft, including
uncertainty estimates and prediction explanations. This compromise is harmful
not only to users but also to external transparency. Model stealing defenses
seek to resolve this dilemma by making models harder to steal while preserving
utility for benign users. However, existing defenses have poor performance in
practice, either requiring enormous computational overheads or severe utility
trade-offs. To meet these challenges, we present a new approach to model
stealing defenses called gradient redirection. At the core of our approach is a
provably optimal, efficient algorithm for steering an adversary's training
updates in a targeted manner. Combined with improvements to surrogate networks
and a novel coordinated defense strategy, our gradient redirection defense,
called GRAD${}^2$, achieves small utility trade-offs and low computational
overhead, outperforming the best prior defenses. Moreover, we demonstrate how
gradient redirection enables reprogramming the adversary with arbitrary
behavior, which we hope will foster work on new avenues of defense.


http://arxiv.org/abs/2206.14322
An Empirical Study of Challenges in Converting Deep Learning Models. (5%)
Moses Jack Openja; Amin Jack Nikanjam; Ahmed Haj Jack Yahmed; Foutse Jack Khomh; Zhen Jack Ming; Jiang
  There is an increase in deploying Deep Learning (DL)-based software systems
in real-world applications. Usually DL models are developed and trained using
DL frameworks that have their own internal mechanisms/formats to represent and
train DL models, and usually those formats cannot be recognized by other
frameworks. Moreover, trained models are usually deployed in environments
different from where they were developed. To solve the interoperability issue
and make DL models compatible with different frameworks/environments, some
exchange formats are introduced for DL models, like ONNX and CoreML. However,
ONNX and CoreML were never empirically evaluated by the community to reveal
their prediction accuracy, performance, and robustness after conversion. Poor
accuracy or non-robust behavior of converted models may lead to poor quality of
deployed DL-based software systems. We conduct, in this paper, the first
empirical study to assess ONNX and CoreML for converting trained DL models. In
our systematic approach, two popular DL frameworks, Keras and PyTorch, are used
to train five widely used DL models on three popular datasets. The trained
models are then converted to ONNX and CoreML and transferred to two runtime
environments designated for such formats, to be evaluated. We investigate the
prediction accuracy before and after conversion. Our results unveil that the
prediction accuracy of converted models are at the same level of originals. The
performance (time cost and memory consumption) of converted models are studied
as well. The size of models are reduced after conversion, which can result in
optimized DL-based software deployment. Converted models are generally assessed
as robust at the same level of originals. However, obtained results show that
CoreML models are more vulnerable to adversarial attacks compared to ONNX.


http://arxiv.org/abs/2206.14076
Reasoning about Moving Target Defense in Attack Modeling Formalisms. (2%)
Gabriel Ballot; Vadim Malvone; Jean Leneutre; Etienne Borde
  Since 2009, Moving Target Defense (MTD) has become a new paradigm of
defensive mechanism that frequently changes the state of the target system to
confuse the attacker. This frequent change is costly and leads to a trade-off
between misleading the attacker and disrupting the quality of service.
Optimizing the MTD activation frequency is necessary to develop this defense
mechanism when facing realistic, multi-step attack scenarios. Attack modeling
formalisms based on DAG are prominently used to specify these scenarios. Our
contribution is a new DAG-based formalism for MTDs and its translation into a
Price Timed Markov Decision Process to find the best activation frequencies
against the attacker's time/cost-optimal strategies. For the first time, MTD
activation frequencies are analyzed in a state-of-the-art DAG-based
representation. Moreover, this is the first paper that considers the
specificity of MTDs in the automatic analysis of attack modeling formalisms.
Finally, we present some experimental results using Uppaal Stratego to
demonstrate its applicability and relevance.


http://arxiv.org/abs/2206.13903
AS-IntroVAE: Adversarial Similarity Distance Makes Robust IntroVAE. (1%)
Changjie Lu; Shen Zheng; Zirui Wang; Omar Dib; Gaurav Gupta
  Recently, introspective models like IntroVAE and S-IntroVAE have excelled in
image generation and reconstruction tasks. The principal characteristic of
introspective models is the adversarial learning of VAE, where the encoder
attempts to distinguish between the real and the fake (i.e., synthesized)
images. However, due to the unavailability of an effective metric to evaluate
the difference between the real and the fake images, the posterior collapse and
the vanishing gradient problem still exist, reducing the fidelity of the
synthesized images. In this paper, we propose a new variation of IntroVAE
called Adversarial Similarity Distance Introspective Variational Autoencoder
(AS-IntroVAE). We theoretically analyze the vanishing gradient problem and
construct a new Adversarial Similarity Distance (AS-Distance) using the
2-Wasserstein distance and the kernel trick. With weight annealing on
AS-Distance and KL-Divergence, the AS-IntroVAE are able to generate stable and
high-quality images. The posterior collapse problem is addressed by making
per-batch attempts to transform the image so that it better fits the prior
distribution in the latent space. Compared with the per-image approach, this
strategy fosters more diverse distributions in the latent space, allowing our
model to produce images of great diversity. Comprehensive experiments on
benchmark datasets demonstrate the effectiveness of AS-IntroVAE on image
generation and reconstruction tasks.


http://arxiv.org/abs/2206.13083
Adversarial Example Detection in Deployed Tree Ensembles. (99%)
Laurens Devos; Wannes Meert; Jesse Davis
  Tree ensembles are powerful models that are widely used. However, they are
susceptible to adversarial examples, which are examples that purposely
constructed to elicit a misprediction from the model. This can degrade
performance and erode a user's trust in the model. Typically, approaches try to
alleviate this problem by verifying how robust a learned ensemble is or
robustifying the learning process. We take an alternative approach and attempt
to detect adversarial examples in a post-deployment setting. We present a novel
method for this task that works by analyzing an unseen example's output
configuration, which is the set of predictions made by an ensemble's
constituent trees. Our approach works with any additive tree ensemble and does
not require training a separate model. We evaluate our approach on three
different tree ensemble learners. We empirically show that our method is
currently the best adversarial detection method for tree ensembles.


http://arxiv.org/abs/2206.13104
Towards Secrecy-Aware Attacks Against Trust Prediction in Signed Graphs. (38%)
Yulin Zhu; Tomasz Michalak; Xiapu Luo; Kai Zhou
  Signed graphs are widely used to model the trust relationships among users in
security-sensitive systems such as cryptocurrency trading platforms, where
trust prediction plays a critical role. In this paper, we investigate how
attackers could mislead trust prediction via manipulating signed graphs while
remaining secret. To this end, we first design effective poisoning attacks
against representative trust prediction tools. The attacks are formulated as
hard bi-level optimization problems, for which we propose several efficient
approximation solutions. The resulting basic attacks would severely change the
structural semantics (in particular, both local and global balance properties)
of a signed graph, which makes the attacks prone to be detected by the powerful
attack detectors we designed. To address this issue, we further refine the
basic attacks by integrating some conflicting metrics as penalty terms into the
objective function. The refined attacks become secrecy-aware: they can
successfully evade attack detectors with high probability while sacrificing
little attack performance. We conduct comprehensive experiments to demonstrate
that the basic attacks can severely disrupt trust prediction, the basic attacks
could be easily detected, and the refined attacks can preserve attack
performance while evading detection. Overall, our results significantly advance
the knowledge in designing more practical attacks, reflecting more realistic
threats to current trust prediction systems.


http://arxiv.org/abs/2206.13405
Utilizing Class Separation Distance for the Evaluation of Corruption Robustness of Machine Learning Classifiers. (15%)
Georg Siedel; Silvia Vock; Andrey Morozov; Stefan Voß
  Robustness is a fundamental pillar of Machine Learning (ML) classifiers,
substantially determining their reliability. Methods for assessing classifier
robustness are therefore essential. In this work, we address the challenge of
evaluating corruption robustness in a way that allows comparability and
interpretability on a given dataset. We propose a test data augmentation method
that uses a robustness distance $\epsilon$ derived from the datasets minimal
class separation distance. The resulting MSCR (mean statistical corruption
robustness) metric allows a dataset-specific comparison of different
classifiers with respect to their corruption robustness. The MSCR value is
interpretable, as it represents the classifiers avoidable loss of accuracy due
to statistical corruptions. On 2D and image data, we show that the metric
reflects different levels of classifier robustness. Furthermore, we observe
unexpected optima in classifiers robust accuracy through training and testing
classifiers with different levels of noise. While researchers have frequently
reported on a significant tradeoff on accuracy when training robust models, we
strengthen the view that a tradeoff between accuracy and corruption robustness
is not inherent. Our results indicate that robustness training through simple
data augmentation can already slightly improve accuracy.


http://arxiv.org/abs/2206.13594
Cyber Network Resilience against Self-Propagating Malware Attacks. (13%)
Alesia Chernikova; Nicolò Gozzi; Simona Boboila; Priyanka Angadi; John Loughner; Matthew Wilden; Nicola Perra; Tina Eliassi-Rad; Alina Oprea
  Self-propagating malware (SPM) has led to huge financial losses, major data
breaches, and widespread service disruptions in recent years. In this paper, we
explore the problem of developing cyber resilient systems capable of mitigating
the spread of SPM attacks. We begin with an in-depth study of a well-known
self-propagating malware, WannaCry, and present a compartmental model called
SIIDR that accurately captures the behavior observed in real-world attack
traces. Next, we investigate ten cyber defense techniques, including existing
edge and node hardening strategies, as well as newly developed methods based on
reconfiguring network communication (NodeSplit) and isolating communities. We
evaluate all defense strategies in detail using six real-world communication
graphs collected from a large retail network and compare their performance
across a wide range of attacks and network topologies. We show that several of
these defenses are able to efficiently reduce the spread of SPM attacks modeled
with SIIDR. For instance, given a strong attack that infects 97% of nodes when
no defense is employed, strategically securing a small number of nodes (0.08%)
reduces the infection footprint in one of the networks down to 1%.


http://arxiv.org/abs/2206.14615
Quantification of Deep Neural Network Prediction Uncertainties for VVUQ of Machine Learning Models. (4%)
Mahmoud Yaseen; Xu Wu
  Recent performance breakthroughs in Artificial intelligence (AI) and Machine
learning (ML), especially advances in Deep learning (DL), the availability of
powerful, easy-to-use ML libraries (e.g., scikit-learn, TensorFlow, PyTorch.),
and increasing computational power have led to unprecedented interest in AI/ML
among nuclear engineers. For physics-based computational models, Verification,
Validation and Uncertainty Quantification (VVUQ) have been very widely
investigated and a lot of methodologies have been developed. However, VVUQ of
ML models has been relatively less studied, especially in nuclear engineering.
In this work, we focus on UQ of ML models as a preliminary step of ML VVUQ,
more specifically, Deep Neural Networks (DNNs) because they are the most widely
used supervised ML algorithm for both regression and classification tasks. This
work aims at quantifying the prediction, or approximation uncertainties of DNNs
when they are used as surrogate models for expensive physical models. Three
techniques for UQ of DNNs are compared, namely Monte Carlo Dropout (MCD), Deep
Ensembles (DE) and Bayesian Neural Networks (BNNs). Two nuclear engineering
examples are used to benchmark these methods, (1) time-dependent fission gas
release data using the Bison code, and (2) void fraction simulation based on
the BFBT benchmark using the TRACE code. It was found that the three methods
typically require different DNN architectures and hyperparameters to optimize
their performance. The UQ results also depend on the amount of training data
available and the nature of the data. Overall, all these three methods can
provide reasonable estimations of the approximation uncertainties. The
uncertainties are generally smaller when the mean predictions are close to the
test data, while the BNN methods usually produce larger uncertainties than MCD
and DE.


http://arxiv.org/abs/2206.12963
Self-Healing Robust Neural Networks via Closed-Loop Control. (45%)
Zhuotong Chen; Qianxiao Li; Zheng Zhang
  Despite the wide applications of neural networks, there have been increasing
concerns about their vulnerability issue. While numerous attack and defense
techniques have been developed, this work investigates the robustness issue
from a new angle: can we design a self-healing neural network that can
automatically detect and fix the vulnerability issue by itself? A typical
self-healing mechanism is the immune system of a human body. This
biology-inspired idea has been used in many engineering designs but is rarely
investigated in deep learning. This paper considers the post-training
self-healing of a neural network, and proposes a closed-loop control
formulation to automatically detect and fix the errors caused by various
attacks or perturbations. We provide a margin-based analysis to explain how
this formulation can improve the robustness of a classifier. To speed up the
inference of the proposed self-healing network, we solve the control problem
via improving the Pontryagin Maximum Principle-based solver. Lastly, we present
an error estimation of the proposed framework for neural networks with
nonlinear activation functions. We validate the performance on several network
architectures against various perturbations. Since the self-healing method does
not need a-priori information about data perturbations/attacks, it can handle a
broad class of unforeseen perturbations.


http://arxiv.org/abs/2206.13032
De-END: Decoder-driven Watermarking Network. (1%)
Han Fang; Zhaoyang Jia; Yupeng Qiu; Jiyi Zhang; Weiming Zhang; Ee-Chien Chang
  With recent advances in machine learning, researchers are now able to solve
traditional problems with new solutions. In the area of digital watermarking,
deep-learning-based watermarking technique is being extensively studied. Most
existing approaches adopt a similar encoder-driven scheme which we name END
(Encoder-NoiseLayer-Decoder) architecture. In this paper, we revamp the
architecture and creatively design a decoder-driven watermarking network dubbed
De-END which greatly outperforms the existing END-based methods. The motivation
for designing De-END originated from the potential drawback we discovered in
END architecture: The encoder may embed redundant features that are not
necessary for decoding, limiting the performance of the whole network. We
conducted a detailed analysis and found that such limitations are caused by
unsatisfactory coupling between the encoder and decoder in END. De-END
addresses such drawbacks by adopting a Decoder-Encoder-Noiselayer-Decoder
architecture. In De-END, the host image is firstly processed by the decoder to
generate a latent feature map instead of being directly fed into the encoder.
This latent feature map is concatenated to the original watermark message and
then processed by the encoder. This change in design is crucial as it makes the
feature of encoder and decoder directly shared thus the encoder and decoder are
better coupled. We conducted extensive experiments and the results show that
this framework outperforms the existing state-of-the-art (SOTA) END-based deep
learning watermarking both in visual quality and robustness. On the premise of
the same decoder structure, the visual quality (measured by PSNR) of De-END
improves by 1.6dB (45.16dB to 46.84dB), and extraction accuracy after JPEG
compression (QF=50) distortion outperforms more than 4% (94.9% to 99.1%).


http://arxiv.org/abs/2206.12725
Empirical Evaluation of Physical Adversarial Patch Attacks Against Overhead Object Detection Models. (99%)
Gavin S. Hartnett; Li Ang Zhang; Caolionn O'Connell; Andrew J. Lohn; Jair Aguirre
  Adversarial patches are images designed to fool otherwise well-performing
neural network-based computer vision models. Although these attacks were
initially conceived of and studied digitally, in that the raw pixel values of
the image were perturbed, recent work has demonstrated that these attacks can
successfully transfer to the physical world. This can be accomplished by
printing out the patch and adding it into scenes of newly captured images or
video footage. In this work we further test the efficacy of adversarial patch
attacks in the physical world under more challenging conditions. We consider
object detection models trained on overhead imagery acquired through aerial or
satellite cameras, and we test physical adversarial patches inserted into
scenes of a desert environment. Our main finding is that it is far more
difficult to successfully implement the adversarial patch attacks under these
conditions than in the previously considered conditions. This has important
implications for AI safety as the real-world threat posed by adversarial
examples may be overstated.


http://arxiv.org/abs/2206.12685
Defense against adversarial attacks on deep convolutional neural networks through nonlocal denoising. (99%)
Sandhya Aneja; Nagender Aneja; Pg Emeroylariffion Abas; Abdul Ghani Naim
  Despite substantial advances in network architecture performance, the
susceptibility of adversarial attacks makes deep learning challenging to
implement in safety-critical applications. This paper proposes a data-centric
approach to addressing this problem. A nonlocal denoising method with different
luminance values has been used to generate adversarial examples from the
Modified National Institute of Standards and Technology database (MNIST) and
Canadian Institute for Advanced Research (CIFAR-10) data sets. Under
perturbation, the method provided absolute accuracy improvements of up to 9.3%
in the MNIST data set and 13% in the CIFAR-10 data set. Training using
transformed images with higher luminance values increases the robustness of the
classifier. We have shown that transfer learning is disadvantageous for
adversarial machine learning. The results indicate that simple adversarial
examples can improve resilience and make deep learning easier to apply in
various applications.


http://arxiv.org/abs/2206.12590
RSTAM: An Effective Black-Box Impersonation Attack on Face Recognition using a Mobile and Compact Printer. (99%)
Xiaoliang Liu; Furao Shen; Jian Zhao; Changhai Nie
  Face recognition has achieved considerable progress in recent years thanks to
the development of deep neural networks, but it has recently been discovered
that deep neural networks are vulnerable to adversarial examples. This means
that face recognition models or systems based on deep neural networks are also
susceptible to adversarial examples. However, the existing methods of attacking
face recognition models or systems with adversarial examples can effectively
complete white-box attacks but not black-box impersonation attacks, physical
attacks, or convenient attacks, particularly on commercial face recognition
systems. In this paper, we propose a new method to attack face recognition
models or systems called RSTAM, which enables an effective black-box
impersonation attack using an adversarial mask printed by a mobile and compact
printer. First, RSTAM enhances the transferability of the adversarial masks
through our proposed random similarity transformation strategy. Furthermore, we
propose a random meta-optimization strategy for ensembling several pre-trained
face models to generate more general adversarial masks. Finally, we conduct
experiments on the CelebA-HQ, LFW, Makeup Transfer (MT), and CASIA-FaceV5
datasets. The performance of the attacks is also evaluated on state-of-the-art
commercial face recognition systems: Face++, Baidu, Aliyun, Tencent, and
Microsoft. Extensive experiments show that RSTAM can effectively perform
black-box impersonation attacks on face recognition models or systems.


http://arxiv.org/abs/2206.12714
Defending Multimodal Fusion Models against Single-Source Adversaries. (81%)
Karren Yang; Wan-Yi Lin; Manash Barman; Filipe Condessa; Zico Kolter
  Beyond achieving high performance across many vision tasks, multimodal models
are expected to be robust to single-source faults due to the availability of
redundant information between modalities. In this paper, we investigate the
robustness of multimodal neural networks against worst-case (i.e., adversarial)
perturbations on a single modality. We first show that standard multimodal
fusion models are vulnerable to single-source adversaries: an attack on any
single modality can overcome the correct information from multiple unperturbed
modalities and cause the model to fail. This surprising vulnerability holds
across diverse multimodal tasks and necessitates a solution. Motivated by this
finding, we propose an adversarially robust fusion strategy that trains the
model to compare information coming from all the input sources, detect
inconsistencies in the perturbed modality compared to the other modalities, and
only allow information from the unperturbed modalities to pass through. Our
approach significantly improves on state-of-the-art methods in single-source
robustness, achieving gains of 7.8-25.2% on action recognition, 19.7-48.2% on
object detection, and 1.6-6.7% on sentiment analysis, without degrading
performance on unperturbed (i.e., clean) data.


http://arxiv.org/abs/2206.12654
BackdoorBench: A Comprehensive Benchmark of Backdoor Learning. (12%)
Baoyuan Wu; Hongrui Chen; Mingda Zhang; Zihao Zhu; Shaokui Wei; Danni Yuan; Chao Shen; Hongyuan Zha
  Backdoor learning is an emerging and important topic of studying the
vulnerability of deep neural networks (DNNs). Many pioneering backdoor attack
and defense methods are being proposed successively or concurrently, in the
status of a rapid arms race. However, we find that the evaluations of new
methods are often unthorough to verify their claims and real performance,
mainly due to the rapid development, diverse settings, as well as the
difficulties of implementation and reproducibility. Without thorough
evaluations and comparisons, it is difficult to track the current progress and
design the future development roadmap of the literature. To alleviate this
dilemma, we build a comprehensive benchmark of backdoor learning, called
BackdoorBench. It consists of an extensible modular based codebase (currently
including implementations of 8 state-of-the-art (SOTA) attack and 9 SOTA
defense algorithms), as well as a standardized protocol of a complete backdoor
learning. We also provide comprehensive evaluations of every pair of 8 attacks
against 9 defenses, with 5 poisoning ratios, based on 5 models and 4 datasets,
thus 8,000 pairs of evaluations in total. We further present analysis from
different perspectives about these 8,000 evaluations, studying the effects of
attack against defense algorithms, poisoning ratio, model and dataset in
backdoor learning. All codes and evaluations of BackdoorBench are publicly
available at \url{https://backdoorbench.github.io}.


http://arxiv.org/abs/2206.12735
Cascading Failures in Smart Grids under Random, Targeted and Adaptive Attacks. (1%)
Sushmita Ruj; Arindam Pal
  We study cascading failures in smart grids, where an attacker selectively
compromises the nodes with probabilities proportional to their degrees,
betweenness, or clustering coefficient. This implies that nodes with high
degrees, betweenness, or clustering coefficients are attacked with higher
probability. We mathematically and experimentally analyze the sizes of the
giant components of the networks under different types of targeted attacks, and
compare the results with the corresponding sizes under random attacks. We show
that networks disintegrate faster for targeted attacks compared to random
attacks. A targeted attack on a small fraction of high degree nodes
disintegrates one or both of the networks, whereas both the networks contain
giant components for random attack on the same fraction of nodes. An important
observation is that an attacker has an advantage if it compromises nodes based
on their betweenness, rather than based on degree or clustering coefficient.
  We next study adaptive attacks, where an attacker compromises nodes in
rounds. Here, some nodes are compromised in each round based on their degree,
betweenness or clustering coefficients, instead of compromising all nodes
together. In this case, the degree, betweenness, or clustering coefficient is
calculated before the start of each round, instead of at the beginning. We show
experimentally that an adversary has an advantage in this adaptive approach,
compared to compromising the same number of nodes all at once.


http://arxiv.org/abs/2206.12381
Defending Backdoor Attacks on Vision Transformer via Patch Processing. (99%)
Khoa D. Doan; Yingjie Lao; Peng Yang; Ping Li
  Vision Transformers (ViTs) have a radically different architecture with
significantly less inductive bias than Convolutional Neural Networks. Along
with the improvement in performance, security and robustness of ViTs are also
of great importance to study. In contrast to many recent works that exploit the
robustness of ViTs against adversarial examples, this paper investigates a
representative causative attack, i.e., backdoor. We first examine the
vulnerability of ViTs against various backdoor attacks and find that ViTs are
also quite vulnerable to existing attacks. However, we observe that the
clean-data accuracy and backdoor attack success rate of ViTs respond
distinctively to patch transformations before the positional encoding. Then,
based on this finding, we propose an effective method for ViTs to defend both
patch-based and blending-based trigger backdoor attacks via patch processing.
The performances are evaluated on several benchmark datasets, including
CIFAR10, GTSRB, and TinyImageNet, which show the proposed novel defense is very
successful in mitigating backdoor attacks for ViTs. To the best of our
knowledge, this paper presents the first defensive strategy that utilizes a
unique characteristic of ViTs against backdoor attacks.
  The paper will appear in the Proceedings of the AAAI'23 Conference. This work
was initially submitted in November 2021 to CVPR'22, then it was re-submitted
to ECCV'22. The paper was made public in June 2022. The authors sincerely thank
all the referees from the Program Committees of CVPR'22, ECCV'22, and AAAI'23.


http://arxiv.org/abs/2206.12169
AdAUC: End-to-end Adversarial AUC Optimization Against Long-tail Problems. (96%)
Wenzheng Hou; Qianqian Xu; Zhiyong Yang; Shilong Bao; Yuan He; Qingming Huang
  It is well-known that deep learning models are vulnerable to adversarial
examples. Existing studies of adversarial training have made great progress
against this challenge. As a typical trait, they often assume that the class
distribution is overall balanced. However, long-tail datasets are ubiquitous in
a wide spectrum of applications, where the amount of head class instances is
larger than the tail classes. Under such a scenario, AUC is a much more
reasonable metric than accuracy since it is insensitive toward class
distribution. Motivated by this, we present an early trial to explore
adversarial training methods to optimize AUC. The main challenge lies in that
the positive and negative examples are tightly coupled in the objective
function. As a direct result, one cannot generate adversarial examples without
a full scan of the dataset. To address this issue, based on a concavity
regularization scheme, we reformulate the AUC optimization problem as a saddle
point problem, where the objective becomes an instance-wise function. This
leads to an end-to-end training protocol. Furthermore, we provide a convergence
guarantee of the proposed algorithm. Our analysis differs from the existing
studies since the algorithm is asked to generate adversarial examples by
calculating the gradient of a min-max problem. Finally, the extensive
experimental results show the performance and robustness of our algorithm in
three long-tail datasets.


http://arxiv.org/abs/2206.12227
Adversarial Robustness of Deep Neural Networks: A Survey from a Formal Verification Perspective. (92%)
Mark Huasong Meng; Guangdong Bai; Sin Gee Teo; Zhe Hou; Yan Xiao; Yun Lin; Jin Song Dong
  Neural networks have been widely applied in security applications such as
spam and phishing detection, intrusion prevention, and malware detection. This
black-box method, however, often has uncertainty and poor explainability in
applications. Furthermore, neural networks themselves are often vulnerable to
adversarial attacks. For those reasons, there is a high demand for trustworthy
and rigorous methods to verify the robustness of neural network models.
Adversarial robustness, which concerns the reliability of a neural network when
dealing with maliciously manipulated inputs, is one of the hottest topics in
security and machine learning. In this work, we survey existing literature in
adversarial robustness verification for neural networks and collect 39
diversified research works across machine learning, security, and software
engineering domains. We systematically analyze their approaches, including how
robustness is formulated, what verification techniques are used, and the
strengths and limitations of each technique. We provide a taxonomy from a
formal verification perspective for a comprehensive understanding of this
topic. We classify the existing techniques based on property specification,
problem reduction, and reasoning strategies. We also demonstrate representative
techniques that have been applied in existing studies with a sample model.
Finally, we discuss open questions for future research.


http://arxiv.org/abs/2206.12284
Robustness of Explanation Methods for NLP Models. (82%)
Shriya Atmakuri; Tejas Chheda; Dinesh Kandula; Nishant Yadav; Taesung Lee; Hessel Tuinhof
  Explanation methods have emerged as an important tool to highlight the
features responsible for the predictions of neural networks. There is mounting
evidence that many explanation methods are rather unreliable and susceptible to
malicious manipulations. In this paper, we particularly aim to understand the
robustness of explanation methods in the context of text modality. We provide
initial insights and results towards devising a successful adversarial attack
against text explanations. To our knowledge, this is the first attempt to
evaluate the adversarial robustness of an explanation method. Our experiments
show the explanation method can be largely disturbed for up to 86% of the
tested samples with small changes in the input sentence and its semantics.


http://arxiv.org/abs/2206.12100
zPROBE: Zero Peek Robustness Checks for Federated Learning. (4%)
Zahra Ghodsi; Mojan Javaheripi; Nojan Sheybani; Xinqiao Zhang; Ke Huang; Farinaz Koushanfar
  Privacy-preserving federated learning allows multiple users to jointly train
a model with coordination of a central server. The server only learns the final
aggregation result, thereby preventing leakage of the users' (private) training
data from the individual model updates. However, keeping the individual updates
private allows malicious users to perform Byzantine attacks and degrade the
model accuracy without being detected. Best existing defenses against Byzantine
workers rely on robust rank-based statistics, e.g., the median, to find
malicious updates. However, implementing privacy-preserving rank-based
statistics is nontrivial and unscalable in the secure domain, as it requires
sorting of all individual updates. We establish the first private robustness
check that uses high break point rank-based statistics on aggregated model
updates. By exploiting randomized clustering, we significantly improve the
scalability of our defense without compromising privacy. We leverage the
derived statistical bounds in zero-knowledge proofs to detect and remove
malicious updates without revealing the private user updates. Our novel
framework, zPROBE, enables Byzantine resilient and secure federated learning.
Empirical evaluations demonstrate that zPROBE provides a low overhead solution
to defend against state-of-the-art Byzantine attacks while preserving privacy.


http://arxiv.org/abs/2207.03576
Robustness Evaluation of Deep Unsupervised Learning Algorithms for Intrusion Detection Systems. (2%)
D'Jeff Kanda Nkashama; Arian Soltani; Jean-Charles Verdier; Marc Frappier; Pierre-Martin Tardif; Froduald Kabanza
  Recently, advances in deep learning have been observed in various fields,
including computer vision, natural language processing, and cybersecurity.
Machine learning (ML) has demonstrated its ability as a potential tool for
anomaly detection-based intrusion detection systems to build secure computer
networks. Increasingly, ML approaches are widely adopted than heuristic
approaches for cybersecurity because they learn directly from data. Data is
critical for the development of ML systems, and becomes potential targets for
attackers. Basically, data poisoning or contamination is one of the most common
techniques used to fool ML models through data. This paper evaluates the
robustness of six recent deep learning algorithms for intrusion detection on
contaminated data. Our experiments suggest that the state-of-the-art algorithms
used in this study are sensitive to data contamination and reveal the
importance of self-defense against data perturbation when developing novel
models, especially for intrusion detection systems.


http://arxiv.org/abs/2206.12251
Adversarial Zoom Lens: A Novel Physical-World Attack to DNNs. (99%)
Chengyin Hu; Weiwen Shi
  Although deep neural networks (DNNs) are known to be fragile, no one has
studied the effects of zooming-in and zooming-out of images in the physical
world on DNNs performance. In this paper, we demonstrate a novel physical
adversarial attack technique called Adversarial Zoom Lens (AdvZL), which uses a
zoom lens to zoom in and out of pictures of the physical world, fooling DNNs
without changing the characteristics of the target object. The proposed method
is so far the only adversarial attack technique that does not add physical
adversarial perturbation attack DNNs. In a digital environment, we construct a
data set based on AdvZL to verify the antagonism of equal-scale enlarged images
to DNNs. In the physical environment, we manipulate the zoom lens to zoom in
and out of the target object, and generate adversarial samples. The
experimental results demonstrate the effectiveness of AdvZL in both digital and
physical environments. We further analyze the antagonism of the proposed data
set to the improved DNNs. On the other hand, we provide a guideline for defense
against AdvZL by means of adversarial training. Finally, we look into the
threat possibilities of the proposed approach to future autonomous driving and
variant attack ideas similar to the proposed attack.


http://arxiv.org/abs/2206.11480
A Framework for Understanding Model Extraction Attack and Defense. (98%)
Xun Xian; Mingyi Hong; Jie Ding
  The privacy of machine learning models has become a significant concern in
many emerging Machine-Learning-as-a-Service applications, where prediction
services based on well-trained models are offered to users via pay-per-query.
The lack of a defense mechanism can impose a high risk on the privacy of the
server's model since an adversary could efficiently steal the model by querying
only a few `good' data points. The interplay between a server's defense and an
adversary's attack inevitably leads to an arms race dilemma, as commonly seen
in Adversarial Machine Learning. To study the fundamental tradeoffs between
model utility from a benign user's view and privacy from an adversary's view,
we develop new metrics to quantify such tradeoffs, analyze their theoretical
properties, and develop an optimization problem to understand the optimal
adversarial attack and defense strategies. The developed concepts and theory
match the empirical findings on the `equilibrium' between privacy and utility.
In terms of optimization, the key ingredient that enables our results is a
unified representation of the attack-defense problem as a min-max bi-level
problem. The developed results will be demonstrated by examples and
experiments.


http://arxiv.org/abs/2206.11750
Towards End-to-End Private Automatic Speaker Recognition. (76%)
Francisco Teixeira; Alberto Abad; Bhiksha Raj; Isabel Trancoso
  The development of privacy-preserving automatic speaker verification systems
has been the focus of a number of studies with the intent of allowing users to
authenticate themselves without risking the privacy of their voice. However,
current privacy-preserving methods assume that the template voice
representations (or speaker embeddings) used for authentication are extracted
locally by the user. This poses two important issues: first, knowledge of the
speaker embedding extraction model may create security and robustness
liabilities for the authentication system, as this knowledge might help
attackers in crafting adversarial examples able to mislead the system; second,
from the point of view of a service provider the speaker embedding extraction
model is arguably one of the most valuable components in the system and, as
such, disclosing it would be highly undesirable. In this work, we show how
speaker embeddings can be extracted while keeping both the speaker's voice and
the service provider's model private, using Secure Multiparty Computation.
Further, we show that it is possible to obtain reasonable trade-offs between
security and computational cost. This work is complementary to those showing
how authentication may be performed privately, and thus can be considered as
another step towards fully private automatic speaker recognition.


http://arxiv.org/abs/2206.11724
BERT Rankers are Brittle: a Study using Adversarial Document Perturbations. (75%)
Yumeng Wang; Lijun Lyu; Avishek Anand
  Contextual ranking models based on BERT are now well established for a wide
range of passage and document ranking tasks. However, the robustness of
BERT-based ranking models under adversarial inputs is under-explored. In this
paper, we argue that BERT-rankers are not immune to adversarial attacks
targeting retrieved documents given a query. Firstly, we propose algorithms for
adversarial perturbation of both highly relevant and non-relevant documents
using gradient-based optimization methods. The aim of our algorithms is to
add/replace a small number of tokens to a highly relevant or non-relevant
document to cause a large rank demotion or promotion. Our experiments show that
a small number of tokens can already result in a large change in the rank of a
document. Moreover, we find that BERT-rankers heavily rely on the document
start/head for relevance prediction, making the initial part of the document
more susceptible to adversarial attacks. More interestingly, we find a small
set of recurring adversarial words that when added to documents result in
successful rank demotion/promotion of any relevant/non-relevant document
respectively. Finally, our adversarial tokens also show particular topic
preferences within and across datasets, exposing potential biases from BERT
pre-training or downstream datasets.


http://arxiv.org/abs/2206.11981
Never trust, always verify : a roadmap for Trustworthy AI? (1%)
Lionel Nganyewou Tidjon; Foutse Khomh
  Artificial Intelligence (AI) is becoming the corner stone of many systems
used in our daily lives such as autonomous vehicles, healthcare systems, and
unmanned aircraft systems. Machine Learning is a field of AI that enables
systems to learn from data and make decisions on new data based on models to
achieve a given goal. The stochastic nature of AI models makes verification and
validation tasks challenging. Moreover, there are intrinsic biaises in AI
models such as reproductibility bias, selection bias (e.g., races, genders,
color), and reporting bias (i.e., results that do not reflect the reality).
Increasingly, there is also a particular attention to the ethical, legal, and
societal impacts of AI. AI systems are difficult to audit and certify because
of their black-box nature. They also appear to be vulnerable to threats; AI
systems can misbehave when untrusted data are given, making them insecure and
unsafe. Governments, national and international organizations have proposed
several principles to overcome these challenges but their applications in
practice are limited and there are different interpretations in the principles
that can bias implementations. In this paper, we examine trust in the context
of AI-based systems to understand what it means for an AI system to be
trustworthy and identify actions that need to be undertaken to ensure that AI
systems are trustworthy. To achieve this goal, we first review existing
approaches proposed for ensuring the trustworthiness of AI systems, in order to
identify potential conceptual gaps in understanding what trustworthy AI is.
Then, we suggest a trust (resp. zero-trust) model for AI and suggest a set of
properties that should be satisfied to ensure the trustworthiness of AI
systems.


http://arxiv.org/abs/2206.11939
Measuring Representational Robustness of Neural Networks Through Shared Invariances. (1%)
Vedant Nanda; Till Speicher; Camila Kolling; John P. Dickerson; Krishna P. Gummadi; Adrian Weller
  A major challenge in studying robustness in deep learning is defining the set
of ``meaningless'' perturbations to which a given Neural Network (NN) should be
invariant. Most work on robustness implicitly uses a human as the reference
model to define such perturbations. Our work offers a new view on robustness by
using another reference NN to define the set of perturbations a given NN should
be invariant to, thus generalizing the reliance on a reference ``human NN'' to
any NN. This makes measuring robustness equivalent to measuring the extent to
which two NNs share invariances, for which we propose a measure called STIR.
STIR re-purposes existing representation similarity measures to make them
suitable for measuring shared invariances. Using our measure, we are able to
gain insights into how shared invariances vary with changes in weight
initialization, architecture, loss functions, and training dataset. Our
implementation is available at: \url{https://github.com/nvedant07/STIR}.


http://arxiv.org/abs/2206.10988
AdvSmo: Black-box Adversarial Attack by Smoothing Linear Structure of Texture. (99%)
Hui Xia; Rui Zhang; Shuliang Jiang; Zi Kang
  Black-box attacks usually face two problems: poor transferability and the
inability to evade the adversarial defense. To overcome these shortcomings, we
create an original approach to generate adversarial examples by smoothing the
linear structure of the texture in the benign image, called AdvSmo. We
construct the adversarial examples without relying on any internal information
to the target model and design the imperceptible-high attack success rate
constraint to guide the Gabor filter to select appropriate angles and scales to
smooth the linear texture from the input images to generate adversarial
examples. Benefiting from the above design concept, AdvSmo will generate
adversarial examples with strong transferability and solid evasiveness.
Finally, compared to the four advanced black-box adversarial attack methods,
for the eight target models, the results show that AdvSmo improves the average
attack success rate by 9% on the CIFAR-10 and 16% on the Tiny-ImageNet dataset
compared to the best of these attack methods.


http://arxiv.org/abs/2206.12292
InfoAT: Improving Adversarial Training Using the Information Bottleneck Principle. (98%)
Mengting Xu; Tao Zhang; Zhongnian Li; Daoqiang Zhang
  Adversarial training (AT) has shown excellent high performance in defending
against adversarial examples. Recent studies demonstrate that examples are not
equally important to the final robustness of models during AT, that is, the
so-called hard examples that can be attacked easily exhibit more influence than
robust examples on the final robustness. Therefore, guaranteeing the robustness
of hard examples is crucial for improving the final robustness of the model.
However, defining effective heuristics to search for hard examples is still
difficult. In this article, inspired by the information bottleneck (IB)
principle, we uncover that an example with high mutual information of the input
and its associated latent representation is more likely to be attacked. Based
on this observation, we propose a novel and effective adversarial training
method (InfoAT). InfoAT is encouraged to find examples with high mutual
information and exploit them efficiently to improve the final robustness of
models. Experimental results show that InfoAT achieves the best robustness
among different datasets and models in comparison with several state-of-the-art
methods.


http://arxiv.org/abs/2206.10858
Robust Universal Adversarial Perturbations. (97%)
Changming Xu; Gagandeep Singh
  Universal Adversarial Perturbations (UAPs) are imperceptible, image-agnostic
vectors that cause deep neural networks (DNNs) to misclassify inputs with high
probability. In practical attack scenarios, adversarial perturbations may
undergo transformations such as changes in pixel intensity, scaling, etc.
before being added to DNN inputs. Existing methods do not create UAPs robust to
these real-world transformations, thereby limiting their applicability in
practical attack scenarios. In this work, we introduce and formulate UAPs
robust against real-world transformations. We build an iterative algorithm
using probabilistic robustness bounds and construct such UAPs robust to
transformations generated by composing arbitrary sub-differentiable
transformation functions. We perform an extensive evaluation on the popular
CIFAR-10 and ILSVRC 2012 datasets measuring our UAPs' robustness under a wide
range common, real-world transformations such as rotation, contrast changes,
etc. We further show that by using a set of primitive transformations our
method can generalize well to unseen transformations such as fog, JPEG
compression, etc. Our results show that our method can generate UAPs up to 23%
more robust than state-of-the-art baselines.


http://arxiv.org/abs/2206.10875
Guided Diffusion Model for Adversarial Purification from Random Noise. (68%)
Quanlin Wu; Hang Ye; Yuntian Gu
  In this paper, we propose a novel guided diffusion purification approach to
provide a strong defense against adversarial attacks. Our model achieves 89.62%
robust accuracy under PGD-L_inf attack (eps = 8/255) on the CIFAR-10 dataset.
We first explore the essential correlations between unguided diffusion models
and randomized smoothing, enabling us to apply the models to certified
robustness. The empirical results show that our models outperform randomized
smoothing by 5% when the certified L2 radius r is larger than 0.5.


http://arxiv.org/abs/2206.10915
Understanding the effect of sparsity on neural networks robustness. (61%)
Lukas Timpl; Rahim Entezari; Hanie Sedghi; Behnam Neyshabur; Olga Saukh
  This paper examines the impact of static sparsity on the robustness of a
trained network to weight perturbations, data corruption, and adversarial
examples. We show that, up to a certain sparsity achieved by increasing network
width and depth while keeping the network capacity fixed, sparsified networks
consistently match and often outperform their initially dense versions.
Robustness and accuracy decline simultaneously for very high sparsity due to
loose connectivity between network layers. Our findings show that a rapid
robustness drop caused by network compression observed in the literature is due
to a reduced network capacity rather than sparsity.


http://arxiv.org/abs/2206.11433
Shilling Black-box Recommender Systems by Learning to Generate Fake User Profiles. (41%)
Chen Lin; Si Chen; Meifang Zeng; Sheng Zhang; Min Gao; Hui Li
  Due to the pivotal role of Recommender Systems (RS) in guiding customers
towards the purchase, there is a natural motivation for unscrupulous parties to
spoof RS for profits. In this paper, we study Shilling Attack where an
adversarial party injects a number of fake user profiles for improper purposes.
Conventional Shilling Attack approaches lack attack transferability (i.e.,
attacks are not effective on some victim RS models) and/or attack invisibility
(i.e., injected profiles can be easily detected). To overcome these issues, we
present Leg-UP, a novel attack model based on the Generative Adversarial
Network. Leg-UP learns user behavior patterns from real users in the sampled
``templates'' and constructs fake user profiles. To simulate real users, the
generator in Leg-UP directly outputs discrete ratings. To enhance attack
transferability, the parameters of the generator are optimized by maximizing
the attack performance on a surrogate RS model. To improve attack invisibility,
Leg-UP adopts a discriminator to guide the generator to generate undetectable
fake user profiles. Experiments on benchmarks have shown that Leg-UP exceeds
state-of-the-art Shilling Attack methods on a wide range of victim RS models.
The source code of our work is available at:
https://github.com/XMUDM/ShillingAttack.


http://arxiv.org/abs/2206.10809
SSMI: How to Make Objects of Interest Disappear without Accessing Object Detectors? (99%)
Hui Xia; Rui Zhang; Zi Kang; Shuliang Jiang
  Most black-box adversarial attack schemes for object detectors mainly face
two shortcomings: requiring access to the target model and generating
inefficient adversarial examples (failing to make objects disappear in large
numbers). To overcome these shortcomings, we propose a black-box adversarial
attack scheme based on semantic segmentation and model inversion (SSMI). We
first locate the position of the target object using semantic segmentation
techniques. Next, we design a neighborhood background pixel replacement to
replace the target region pixels with background pixels to ensure that the
pixel modifications are not easily detected by human vision. Finally, we
reconstruct a machine-recognizable example and use the mask matrix to select
pixels in the reconstructed example to modify the benign image to generate an
adversarial example. Detailed experimental results show that SSMI can generate
efficient adversarial examples to evade human-eye perception and make objects
of interest disappear. And more importantly, SSMI outperforms existing same
kinds of attacks. The maximum increase in new and disappearing labels is 16%,
and the maximum decrease in mAP metrics for object detection is 36%.


http://arxiv.org/abs/2207.00425
Transferable Graph Backdoor Attack. (99%)
Shuiqiao Yang; Bao Gia Doan; Paul Montague; Vel Olivier De; Tamas Abraham; Seyit Camtepe; Damith C. Ranasinghe; Salil S. Kanhere
  Graph Neural Networks (GNNs) have achieved tremendous success in many graph
mining tasks benefitting from the message passing strategy that fuses the local
structure and node features for better graph representation learning. Despite
the success of GNNs, and similar to other types of deep neural networks, GNNs
are found to be vulnerable to unnoticeable perturbations on both graph
structure and node features. Many adversarial attacks have been proposed to
disclose the fragility of GNNs under different perturbation strategies to
create adversarial examples. However, vulnerability of GNNs to successful
backdoor attacks was only shown recently. In this paper, we disclose the TRAP
attack, a Transferable GRAPh backdoor attack. The core attack principle is to
poison the training dataset with perturbation-based triggers that can lead to
an effective and transferable backdoor attack. The perturbation trigger for a
graph is generated by performing the perturbation actions on the graph
structure via a gradient based score matrix from a surrogate model. Compared
with prior works, TRAP attack is different in several ways: i) it exploits a
surrogate Graph Convolutional Network (GCN) model to generate perturbation
triggers for a blackbox based backdoor attack; ii) it generates sample-specific
perturbation triggers which do not have a fixed pattern; and iii) the attack
transfers, for the first time in the context of GNNs, to different GNN models
when trained with the forged poisoned training dataset. Through extensive
evaluations on four real-world datasets, we demonstrate the effectiveness of
the TRAP attack to build transferable backdoors in four different popular GNNs
using four real-world datasets.


http://arxiv.org/abs/2206.10550
(Certified!!) Adversarial Robustness for Free! (84%)
Nicholas Dj Carlini; Florian Dj Tramer; Dj Krishnamurthy; Dvijotham; J. Zico Kolter
  In this paper we show how to achieve state-of-the-art certified adversarial
robustness to 2-norm bounded perturbations by relying exclusively on
off-the-shelf pretrained models. To do so, we instantiate the denoised
smoothing approach of Salman et al. by combining a pretrained denoising
diffusion probabilistic model and a standard high-accuracy classifier. This
allows us to certify 71% accuracy on ImageNet under adversarial perturbations
constrained to be within a 2-norm of 0.5, an improvement of 14 percentage
points over the prior certified SoTA using any approach, or an improvement of
30 percentage points over denoised smoothing. We obtain these results using
only pretrained diffusion models and image classifiers, without requiring any
fine tuning or retraining of model parameters.


http://arxiv.org/abs/2206.10158
Certifiably Robust Policy Learning against Adversarial Communication in Multi-agent Systems. (81%)
Yanchao Sun; Ruijie Zheng; Parisa Hassanzadeh; Yongyuan Liang; Soheil Feizi; Sumitra Ganesh; Furong Huang
  Communication is important in many multi-agent reinforcement learning (MARL)
problems for agents to share information and make good decisions. However, when
deploying trained communicative agents in a real-world application where noise
and potential attackers exist, the safety of communication-based policies
becomes a severe issue that is underexplored. Specifically, if communication
messages are manipulated by malicious attackers, agents relying on
untrustworthy communication may take unsafe actions that lead to catastrophic
consequences. Therefore, it is crucial to ensure that agents will not be misled
by corrupted communication, while still benefiting from benign communication.
In this work, we consider an environment with $N$ agents, where the attacker
may arbitrarily change the communication from any $C<\frac{N-1}{2}$ agents to a
victim agent. For this strong threat model, we propose a certifiable defense by
constructing a message-ensemble policy that aggregates multiple randomly
ablated message sets. Theoretical analysis shows that this message-ensemble
policy can utilize benign communication while being certifiably robust to
adversarial communication, regardless of the attacking algorithm. Experiments
in multiple environments verify that our defense significantly improves the
robustness of trained policies against various types of attacks.


http://arxiv.org/abs/2206.10708
FlashSyn: Flash Loan Attack Synthesis via Counter Example Driven Approximation. (68%)
Zhiyang Chen; Sidi Mohamed Beillahi; Fan Long
  In decentralized finance (DeFi), lenders can offer flash loans to borrowers,
i.e., loans that are only valid within a blockchain transaction and must be
repaid with fees by the end of that transaction. Unlike normal loans, flash
loans allow borrowers to borrow large assets without upfront collaterals
deposits. Malicious adversaries use flash loans to gather large assets to
exploit vulnerable DeFi protocols. In this paper, we introduce a new framework
for automated synthesis of adversarial transactions that exploit DeFi protocols
using flash loans. To bypass the complexity of a DeFi protocol, we propose a
new technique to approximate the DeFi protocol functional behaviors using
numerical methods (polynomial linear regression and nearest-neighbor
interpolation). We then construct an optimization query using the approximated
functions of the DeFi protocol to find an adversarial attack constituted of a
sequence of functions invocations with optimal parameters that gives the
maximum profit. To improve the accuracy of the approximation, we propose a
novel counterexample driven approximation refinement technique. We implement
our framework in a tool named FlashSyn. We evaluate FlashSyn on 16 DeFi
protocols that were victims to flash loan attacks and 2 DeFi protocols from
Damn Vulnerable DeFi challenges. FlashSyn automatically synthesizes an
adversarial attack for 16 of the 18 benchmarks. Among the 16 successful cases,
FlashSyn identifies attack vectors yielding higher profits than those employed
by historical hackers in 3 cases, and also discovers multiple distinct attack
vectors in 10 cases, demonstrating its effectiveness in finding possible flash
loan attacks.


http://arxiv.org/abs/2206.10673
Natural Backdoor Datasets. (33%)
Emily Wenger; Roma Bhattacharjee; Arjun Nitin Bhagoji; Josephine Passananti; Emilio Andere; Haitao Zheng; Ben Y. Zhao
  Extensive literature on backdoor poison attacks has studied attacks and
defenses for backdoors using "digital trigger patterns." In contrast, "physical
backdoors" use physical objects as triggers, have only recently been
identified, and are qualitatively different enough to resist all defenses
targeting digital trigger backdoors. Research on physical backdoors is limited
by access to large datasets containing real images of physical objects
co-located with targets of classification. Building these datasets is time- and
labor-intensive. This works seeks to address the challenge of accessibility for
research on physical backdoor attacks. We hypothesize that there may be
naturally occurring physically co-located objects already present in popular
datasets such as ImageNet. Once identified, a careful relabeling of these data
can transform them into training samples for physical backdoor attacks. We
propose a method to scalably identify these subsets of potential triggers in
existing datasets, along with the specific classes they can poison. We call
these naturally occurring trigger-class subsets natural backdoor datasets. Our
techniques successfully identify natural backdoors in widely-available
datasets, and produce models behaviorally equivalent to those trained on
manually curated datasets. We release our code to allow the research community
to create their own datasets for research on physical backdoor attacks.


http://arxiv.org/abs/2206.10469
The Privacy Onion Effect: Memorization is Relative. (22%)
Nicholas Carlini; Matthew Jagielski; Nicolas Papernot; Andreas Terzis; Florian Tramer; Chiyuan Zhang
  Machine learning models trained on private datasets have been shown to leak
their private data. While recent work has found that the average data point is
rarely leaked, the outlier samples are frequently subject to memorization and,
consequently, privacy leakage. We demonstrate and analyse an Onion Effect of
memorization: removing the "layer" of outlier points that are most vulnerable
to a privacy attack exposes a new layer of previously-safe points to the same
attack. We perform several experiments to study this effect, and understand why
it occurs. The existence of this effect has various consequences. For example,
it suggests that proposals to defend against memorization without training with
rigorous privacy guarantees are unlikely to be effective. Further, it suggests
that privacy-enhancing technologies such as machine unlearning could actually
harm the privacy of other users.


http://arxiv.org/abs/2206.10110
ProML: A Decentralised Platform for Provenance Management of Machine Learning Software Systems. (1%)
Nguyen Khoi Tran; Bushra Sabir; M. Ali Babar; Nini Cui; Mehran Abolhasan; Justin Lipman
  Large-scale Machine Learning (ML) based Software Systems are increasingly
developed by distributed teams situated in different trust domains. Insider
threats can launch attacks from any domain to compromise ML assets (models and
datasets). Therefore, practitioners require information about how and by whom
ML assets were developed to assess their quality attributes such as security,
safety, and fairness. Unfortunately, it is challenging for ML teams to access
and reconstruct such historical information of ML assets (ML provenance)
because it is generally fragmented across distributed ML teams and threatened
by the same adversaries that attack ML assets. This paper proposes ProML, a
decentralised platform that leverages blockchain and smart contracts to empower
distributed ML teams to jointly manage a single source of truth about
circulated ML assets' provenance without relying on a third party, which is
vulnerable to insider threats and presents a single point of failure. We
propose a novel architectural approach called Artefact-as-a-State-Machine to
leverage blockchain transactions and smart contracts for managing ML provenance
information and introduce a user-driven provenance capturing mechanism to
integrate existing scripts and tools to ProML without compromising
participants' control over their assets and toolchains. We evaluate the
performance and overheads of ProML by benchmarking a proof-of-concept system on
a global blockchain. Furthermore, we assessed ProML's security against a threat
model of a distributed ML workflow.


http://arxiv.org/abs/2206.09868
Understanding Robust Learning through the Lens of Representation Similarities. (99%)
Christian Cianfarani; Arjun Nitin Bhagoji; Vikash Sehwag; Ben Zhao; Prateek Mittal
  Representation learning, i.e. the generation of representations useful for
downstream applications, is a task of fundamental importance that underlies
much of the success of deep neural networks (DNNs). Recently, robustness to
adversarial examples has emerged as a desirable property for DNNs, spurring the
development of robust training methods that account for adversarial examples.
In this paper, we aim to understand how the properties of representations
learned by robust training differ from those obtained from standard, non-robust
training. This is critical to diagnosing numerous salient pitfalls in robust
networks, such as, degradation of performance on benign inputs, poor
generalization of robustness, and increase in over-fitting. We utilize a
powerful set of tools known as representation similarity metrics, across three
vision datasets, to obtain layer-wise comparisons between robust and non-robust
DNNs with different architectures, training procedures and adversarial
constraints. Our experiments highlight hitherto unseen properties of robust
representations that we posit underlie the behavioral differences of robust
networks. We discover a lack of specialization in robust networks'
representations along with a disappearance of `block structure'. We also find
overfitting during robust training largely impacts deeper layers. These, along
with other findings, suggest ways forward for the design and training of better
robust networks.


http://arxiv.org/abs/2206.09628
Diversified Adversarial Attacks based on Conjugate Gradient Method. (98%)
Keiichiro Yamamura; Haruki Sato; Nariaki Tateiwa; Nozomi Hata; Toru Mitsutake; Issa Oe; Hiroki Ishikura; Katsuki Fujisawa
  Deep learning models are vulnerable to adversarial examples, and adversarial
attacks used to generate such examples have attracted considerable research
interest. Although existing methods based on the steepest descent have achieved
high attack success rates, ill-conditioned problems occasionally reduce their
performance. To address this limitation, we utilize the conjugate gradient (CG)
method, which is effective for this type of problem, and propose a novel attack
algorithm inspired by the CG method, named the Auto Conjugate Gradient (ACG)
attack. The results of large-scale evaluation experiments conducted on the
latest robust models show that, for most models, ACG was able to find more
adversarial examples with fewer iterations than the existing SOTA algorithm
Auto-PGD (APGD). We investigated the difference in search performance between
ACG and APGD in terms of diversification and intensification, and define a
measure called Diversity Index (DI) to quantify the degree of diversity. From
the analysis of the diversity using this index, we show that the more diverse
search of the proposed method remarkably improves its attack success rate.


http://arxiv.org/abs/2206.10057
Robust Deep Reinforcement Learning through Bootstrapped Opportunistic Curriculum. (76%)
Junlin Wu; Yevgeniy Vorobeychik
  Despite considerable advances in deep reinforcement learning, it has been
shown to be highly vulnerable to adversarial perturbations to state
observations. Recent efforts that have attempted to improve adversarial
robustness of reinforcement learning can nevertheless tolerate only very small
perturbations, and remain fragile as perturbation size increases. We propose
Bootstrapped Opportunistic Adversarial Curriculum Learning (BCL), a novel
flexible adversarial curriculum learning framework for robust reinforcement
learning. Our framework combines two ideas: conservatively bootstrapping each
curriculum phase with highest quality solutions obtained from multiple runs of
the previous phase, and opportunistically skipping forward in the curriculum.
In our experiments we show that the proposed BCL framework enables dramatic
improvements in robustness of learned policies to adversarial perturbations.
The greatest improvement is for Pong, where our framework yields robustness to
perturbations of up to 25/255; in contrast, the best existing approach can only
tolerate adversarial noise up to 5/255. Our code is available at:
https://github.com/jlwu002/BCL.


http://arxiv.org/abs/2206.09682
SafeBench: A Benchmarking Platform for Safety Evaluation of Autonomous Vehicles. (5%)
Chejian Xu; Wenhao Ding; Weijie Lyu; Zuxin Liu; Shuai Wang; Yihan He; Hanjiang Hu; Ding Zhao; Bo Li
  As shown by recent studies, machine intelligence-enabled systems are
vulnerable to test cases resulting from either adversarial manipulation or
natural distribution shifts. This has raised great concerns about deploying
machine learning algorithms for real-world applications, especially in
safety-critical domains such as autonomous driving (AD). On the other hand,
traditional AD testing on naturalistic scenarios requires hundreds of millions
of driving miles due to the high dimensionality and rareness of the
safety-critical scenarios in the real world. As a result, several approaches
for autonomous driving evaluation have been explored, which are usually,
however, based on different simulation platforms, types of safety-critical
scenarios, scenario generation algorithms, and driving route variations. Thus,
despite a large amount of effort in autonomous driving testing, it is still
challenging to compare and understand the effectiveness and efficiency of
different testing scenario generation algorithms and testing mechanisms under
similar conditions. In this paper, we aim to provide the first unified platform
SafeBench to integrate different types of safety-critical testing scenarios,
scenario generation algorithms, and other variations such as driving routes and
environments. Meanwhile, we implement 4 deep reinforcement learning-based AD
algorithms with 4 types of input (e.g., bird's-eye view, camera) to perform
fair comparisons on SafeBench. We find our generated testing scenarios are
indeed more challenging and observe the trade-off between the performance of AD
agents under benign and safety-critical testing scenarios. We believe our
unified platform SafeBench for large-scale and effective autonomous driving
testing will motivate the development of new testing scenario generation and
safe AD algorithms. SafeBench is available at https://safebench.github.io.


http://arxiv.org/abs/2206.09880
Breaking Down Out-of-Distribution Detection: Many Methods Based on OOD Training Data Estimate a Combination of the Same Core Quantities. (1%)
Julian Bitterwolf; Alexander Meinke; Maximilian Augustin; Matthias Hein
  It is an important problem in trustworthy machine learning to recognize
out-of-distribution (OOD) inputs which are inputs unrelated to the
in-distribution task. Many out-of-distribution detection methods have been
suggested in recent years. The goal of this paper is to recognize common
objectives as well as to identify the implicit scoring functions of different
OOD detection methods. We focus on the sub-class of methods that use surrogate
OOD data during training in order to learn an OOD detection score that
generalizes to new unseen out-distributions at test time. We show that binary
discrimination between in- and (different) out-distributions is equivalent to
several distinct formulations of the OOD detection problem. When trained in a
shared fashion with a standard classifier, this binary discriminator reaches an
OOD detection performance similar to that of Outlier Exposure. Moreover, we
show that the confidence loss which is used by Outlier Exposure has an implicit
scoring function which differs in a non-trivial fashion from the theoretically
optimal scoring function in the case where training and test out-distribution
are the same, which again is similar to the one used when training an
Energy-Based OOD detector or when adding a background class. In practice, when
trained in exactly the same way, all these methods perform similarly.


http://arxiv.org/abs/2206.09491
On the Limitations of Stochastic Pre-processing Defenses. (99%)
Yue Gao; Ilia Shumailov; Kassem Fawaz; Nicolas Papernot
  Defending against adversarial examples remains an open problem. A common
belief is that randomness at inference increases the cost of finding
adversarial inputs. An example of such a defense is to apply a random
transformation to inputs prior to feeding them to the model. In this paper, we
empirically and theoretically investigate such stochastic pre-processing
defenses and demonstrate that they are flawed. First, we show that most
stochastic defenses are weaker than previously thought; they lack sufficient
randomness to withstand even standard attacks like projected gradient descent.
This casts doubt on a long-held assumption that stochastic defenses invalidate
attacks designed to evade deterministic defenses and force attackers to
integrate the Expectation over Transformation (EOT) concept. Second, we show
that stochastic defenses confront a trade-off between adversarial robustness
and model invariance; they become less effective as the defended model acquires
more invariance to their randomization. Future work will need to decouple these
two effects. We also discuss implications and guidance for future research.


http://arxiv.org/abs/2206.09391
Towards Adversarial Attack on Vision-Language Pre-training Models. (98%)
Jiaming Zhang; Qi Yi; Jitao Sang
  While vision-language pre-training model (VLP) has shown revolutionary
improvements on various vision-language (V+L) tasks, the studies regarding its
adversarial robustness remain largely unexplored. This paper studied the
adversarial attack on popular VLP models and V+L tasks. First, we analyzed the
performance of adversarial attacks under different settings. By examining the
influence of different perturbed objects and attack targets, we concluded some
key observations as guidance on both designing strong multimodal adversarial
attack and constructing robust VLP models. Second, we proposed a novel
multimodal attack method on the VLP models called Collaborative Multimodal
Adversarial Attack (Co-Attack), which collectively carries out the attacks on
the image modality and the text modality. Experimental results demonstrated
that the proposed method achieves improved attack performances on different V+L
downstream tasks and VLP models. The analysis observations and novel attack
method hopefully provide new understanding into the adversarial robustness of
VLP models, so as to contribute their safe and reliable deployment in more
real-world scenarios.


http://arxiv.org/abs/2206.09458
A Universal Adversarial Policy for Text Classifiers. (98%)
Gallil Maimon; Lior Rokach
  Discovering the existence of universal adversarial perturbations had large
theoretical and practical impacts on the field of adversarial learning. In the
text domain, most universal studies focused on adversarial prefixes which are
added to all texts. However, unlike the vision domain, adding the same
perturbation to different inputs results in noticeably unnatural inputs.
Therefore, we introduce a new universal adversarial setup - a universal
adversarial policy, which has many advantages of other universal attacks but
also results in valid texts - thus making it relevant in practice. We achieve
this by learning a single search policy over a predefined set of semantics
preserving text alterations, on many texts. This formulation is universal in
that the policy is successful in finding adversarial examples on new texts
efficiently. Our approach uses text perturbations which were extensively shown
to produce natural attacks in the non-universal setup (specific synonym
replacements). We suggest a strong baseline approach for this formulation which
uses reinforcement learning. It's ability to generalise (from as few as 500
training texts) shows that universal adversarial patterns exist in the text
domain as well.


http://arxiv.org/abs/2206.09410
JPEG Compression-Resistant Low-Mid Adversarial Perturbation against Unauthorized Face Recognition System. (68%)
Jiaming Zhang; Qi Yi; Jitao Sang
  It has been observed that the unauthorized use of face recognition system
raises privacy problems. Using adversarial perturbations provides one possible
solution to address this issue. A critical issue to exploit adversarial
perturbation against unauthorized face recognition system is that: The images
uploaded to the web need to be processed by JPEG compression, which weakens the
effectiveness of adversarial perturbation. Existing JPEG compression-resistant
methods fails to achieve a balance among compression resistance,
transferability, and attack effectiveness. To this end, we propose a more
natural solution called low frequency adversarial perturbation (LFAP). Instead
of restricting the adversarial perturbations, we turn to regularize the source
model to employing more low-frequency features by adversarial training.
Moreover, to better influence model in different frequency components, we
proposed the refined low-mid frequency adversarial perturbation (LMFAP)
considering the mid frequency components as the productive complement. We
designed a variety of settings in this study to simulate the real-world
application scenario, including cross backbones, supervisory heads, training
datasets and testing datasets. Quantitative and qualitative experimental
results validate the effectivenss of proposed solutions.


http://arxiv.org/abs/2206.11228
Adversarially trained neural representations may already be as robust as corresponding biological neural representations. (31%)
Chong Guo; Michael J. Lee; Guillaume Leclerc; Joel Dapello; Yug Rao; Aleksander Madry; James J. DiCarlo
  Visual systems of primates are the gold standard of robust perception. There
is thus a general belief that mimicking the neural representations that
underlie those systems will yield artificial visual systems that are
adversarially robust. In this work, we develop a method for performing
adversarial visual attacks directly on primate brain activity. We then leverage
this method to demonstrate that the above-mentioned belief might not be well
founded. Specifically, we report that the biological neurons that make up
visual systems of primates exhibit susceptibility to adversarial perturbations
that is comparable in magnitude to existing (robustly trained) artificial
neural networks.


http://arxiv.org/abs/2207.03574
Demystifying the Adversarial Robustness of Random Transformation Defenses. (99%)
Chawin Sitawarin; Zachary Golan-Strieb; David Wagner
  Neural networks' lack of robustness against attacks raises concerns in
security-sensitive settings such as autonomous vehicles. While many
countermeasures may look promising, only a few withstand rigorous evaluation.
Defenses using random transformations (RT) have shown impressive results,
particularly BaRT (Raff et al., 2019) on ImageNet. However, this type of
defense has not been rigorously evaluated, leaving its robustness properties
poorly understood. Their stochastic properties make evaluation more challenging
and render many proposed attacks on deterministic models inapplicable. First,
we show that the BPDA attack (Athalye et al., 2018a) used in BaRT's evaluation
is ineffective and likely overestimates its robustness. We then attempt to
construct the strongest possible RT defense through the informed selection of
transformations and Bayesian optimization for tuning their parameters.
Furthermore, we create the strongest possible attack to evaluate our RT
defense. Our new attack vastly outperforms the baseline, reducing the accuracy
by 83% compared to the 19% reduction by the commonly used EoT attack
($4.3\times$ improvement). Our result indicates that the RT defense on the
Imagenette dataset (a ten-class subset of ImageNet) is not robust against
adversarial examples. Extending the study further, we use our new attack to
adversarially train RT defense (called AdvRT), resulting in a large robustness
gain. Code is available at
https://github.com/wagner-group/demystify-random-transform.


http://arxiv.org/abs/2206.09238
On the Role of Generalization in Transferability of Adversarial Examples. (99%)
Yilin Wang; Farzan Farnia
  Black-box adversarial attacks designing adversarial examples for unseen
neural networks (NNs) have received great attention over the past years. While
several successful black-box attack schemes have been proposed in the
literature, the underlying factors driving the transferability of black-box
adversarial examples still lack a thorough understanding. In this paper, we aim
to demonstrate the role of the generalization properties of the substitute
classifier used for generating adversarial examples in the transferability of
the attack scheme to unobserved NN classifiers. To do this, we apply the
max-min adversarial example game framework and show the importance of the
generalization properties of the substitute NN in the success of the black-box
attack scheme in application to different NN classifiers. We prove theoretical
generalization bounds on the difference between the attack transferability
rates on training and test samples. Our bounds suggest that a substitute NN
with better generalization behavior could result in more transferable
adversarial examples. In addition, we show that standard operator norm-based
regularization methods could improve the transferability of the designed
adversarial examples. We support our theoretical results by performing several
numerical experiments showing the role of the substitute network's
generalization in generating transferable adversarial examples. Our empirical
results indicate the power of Lipschitz regularization methods in improving the
transferability of adversarial examples.


http://arxiv.org/abs/2206.09272
DECK: Model Hardening for Defending Pervasive Backdoors. (98%)
Guanhong Tao; Yingqi Liu; Siyuan Cheng; Shengwei An; Zhuo Zhang; Qiuling Xu; Guangyu Shen; Xiangyu Zhang
  Pervasive backdoors are triggered by dynamic and pervasive input
perturbations. They can be intentionally injected by attackers or naturally
exist in normally trained models. They have a different nature from the
traditional static and localized backdoors that can be triggered by perturbing
a small input area with some fixed pattern, e.g., a patch with solid color.
Existing defense techniques are highly effective for traditional backdoors.
However, they may not work well for pervasive backdoors, especially regarding
backdoor removal and model hardening. In this paper, we propose a novel model
hardening technique against pervasive backdoors, including both natural and
injected backdoors. We develop a general pervasive attack based on an
encoder-decoder architecture enhanced with a special transformation layer. The
attack can model a wide range of existing pervasive backdoor attacks and
quantify them by class distances. As such, using the samples derived from our
attack in adversarial training can harden a model against these backdoor
vulnerabilities. Our evaluation on 9 datasets with 15 model structures shows
that our technique can enlarge class distances by 59.65% on average with less
than 1% accuracy degradation and no robustness loss, outperforming five
hardening techniques such as adversarial training, universal adversarial
training, MOTH, etc. It can reduce the attack success rate of six pervasive
backdoor attacks from 99.06% to 1.94%, surpassing seven state-of-the-art
backdoor removal techniques.


http://arxiv.org/abs/2206.09122
Measuring Lower Bounds of Local Differential Privacy via Adversary Instantiations in Federated Learning. (10%)
Marin Matsumoto; Tsubasa Takahashi; Seng Pei Liew; Masato Oguchi
  Local differential privacy (LDP) gives a strong privacy guarantee to be used
in a distributed setting like federated learning (FL). LDP mechanisms in FL
protect a client's gradient by randomizing it on the client; however, how can
we interpret the privacy level given by the randomization? Moreover, what types
of attacks can we mitigate in practice? To answer these questions, we introduce
an empirical privacy test by measuring the lower bounds of LDP. The privacy
test estimates how an adversary predicts if a reported randomized gradient was
crafted from a raw gradient $g_1$ or $g_2$. We then instantiate six adversaries
in FL under LDP to measure empirical LDP at various attack surfaces, including
a worst-case attack that reaches the theoretical upper bound of LDP. The
empirical privacy test with the adversary instantiations enables us to
interpret LDP more intuitively and discuss relaxation of the privacy parameter
until a particular instantiated attack surfaces. We also demonstrate numerical
observations of the measured privacy in these adversarial settings, and the
worst-case attack is not realistic in FL. In the end, we also discuss the
possible relaxation of privacy levels in FL under LDP.


http://arxiv.org/abs/2206.09305
Adversarial Scrutiny of Evidentiary Statistical Software. (2%)
Rediet Abebe; Moritz Hardt; Angela Jin; John Miller; Ludwig Schmidt; Rebecca Wexler
  The U.S. criminal legal system increasingly relies on software output to
convict and incarcerate people. In a large number of cases each year, the
government makes these consequential decisions based on evidence from
statistical software -- such as probabilistic genotyping, environmental audio
detection, and toolmark analysis tools -- that defense counsel cannot fully
cross-examine or scrutinize. This undermines the commitments of the adversarial
criminal legal system, which relies on the defense's ability to probe and test
the prosecution's case to safeguard individual rights.
  Responding to this need to adversarially scrutinize output from such
software, we propose robust adversarial testing as an audit framework to
examine the validity of evidentiary statistical software. We define and
operationalize this notion of robust adversarial testing for defense use by
drawing on a large body of recent work in robust machine learning and
algorithmic fairness. We demonstrate how this framework both standardizes the
process for scrutinizing such tools and empowers defense lawyers to examine
their validity for instances most relevant to the case at hand. We further
discuss existing structural and institutional challenges within the U.S.
criminal legal system that may create barriers for implementing this and other
such audit frameworks and close with a discussion on policy changes that could
help address these concerns.


http://arxiv.org/abs/2206.08738
Detecting Adversarial Examples in Batches -- a geometrical approach. (99%)
Danush Kumar Venkatesh; Peter Steinbach
  Many deep learning methods have successfully solved complex tasks in computer
vision and speech recognition applications. Nonetheless, the robustness of
these models has been found to be vulnerable to perturbed inputs or adversarial
examples, which are imperceptible to the human eye, but lead the model to
erroneous output decisions. In this study, we adapt and introduce two geometric
metrics, density and coverage, and evaluate their use in detecting adversarial
samples in batches of unseen data. We empirically study these metrics using
MNIST and two real-world biomedical datasets from MedMNIST, subjected to two
different adversarial attacks. Our experiments show promising results for both
metrics to detect adversarial examples. We believe that his work can lay the
ground for further study on these metrics' use in deployed machine learning
systems to monitor for possible attacks by adversarial examples or related
pathologies such as dataset shift.


http://arxiv.org/abs/2206.08638
Minimum Noticeable Difference based Adversarial Privacy Preserving Image Generation. (99%)
Wen Sun; Jian Jin; Weisi Lin
  Deep learning models are found to be vulnerable to adversarial examples, as
wrong predictions can be caused by small perturbation in input for deep
learning models. Most of the existing works of adversarial image generation try
to achieve attacks for most models, while few of them make efforts on
guaranteeing the perceptual quality of the adversarial examples. High quality
adversarial examples matter for many applications, especially for the privacy
preserving. In this work, we develop a framework based on the Minimum
Noticeable Difference (MND) concept to generate adversarial privacy preserving
images that have minimum perceptual difference from the clean ones but are able
to attack deep learning models. To achieve this, an adversarial loss is firstly
proposed to make the deep learning models attacked by the adversarial images
successfully. Then, a perceptual quality-preserving loss is developed by taking
the magnitude of perturbation and perturbation-caused structural and gradient
changes into account, which aims to preserve high perceptual quality for
adversarial image generation. To the best of our knowledge, this is the first
work on exploring quality-preserving adversarial image generation based on the
MND concept for privacy preserving. To evaluate its performance in terms of
perceptual quality, the deep models on image classification and face
recognition are tested with the proposed method and several anchor methods in
this work. Extensive experimental results demonstrate that the proposed MND
framework is capable of generating adversarial images with remarkably improved
performance metrics (e.g., PSNR, SSIM, and MOS) than that generated with the
anchor methods.


http://arxiv.org/abs/2206.08575
Query-Efficient and Scalable Black-Box Adversarial Attacks on Discrete Sequential Data via Bayesian Optimization. (99%)
Deokjae Lee; Seungyong Moon; Junhyeok Lee; Hyun Oh Song
  We focus on the problem of adversarial attacks against models on discrete
sequential data in the black-box setting where the attacker aims to craft
adversarial examples with limited query access to the victim model. Existing
black-box attacks, mostly based on greedy algorithms, find adversarial examples
using pre-computed key positions to perturb, which severely limits the search
space and might result in suboptimal solutions. To this end, we propose a
query-efficient black-box attack using Bayesian optimization, which dynamically
computes important positions using an automatic relevance determination (ARD)
categorical kernel. We introduce block decomposition and history subsampling
techniques to improve the scalability of Bayesian optimization when an input
sequence becomes long. Moreover, we develop a post-optimization algorithm that
finds adversarial examples with smaller perturbation size. Experiments on
natural language and protein classification tasks demonstrate that our method
consistently achieves higher attack success rate with significant reduction in
query count and modification rate compared to the previous state-of-the-art
methods.


http://arxiv.org/abs/2206.09075
Comment on Transferability and Input Transformation with Additive Noise. (99%)
Hoki Kim; Jinseong Park; Jaewook Lee
  Adversarial attacks have verified the existence of the vulnerability of
neural networks. By adding small perturbations to a benign example, adversarial
attacks successfully generate adversarial examples that lead misclassification
of deep learning models. More importantly, an adversarial example generated
from a specific model can also deceive other models without modification. We
call this phenomenon ``transferability". Here, we analyze the relationship
between transferability and input transformation with additive noise by
mathematically proving that the modified optimization can produce more
transferable adversarial examples.


http://arxiv.org/abs/2207.00411
Adversarial Robustness is at Odds with Lazy Training. (98%)
Yunjuan Wang; Enayat Ullah; Poorya Mianjy; Raman Arora
  Recent works show that adversarial examples exist for random neural networks
[Daniely and Schacham, 2020] and that these examples can be found using a
single step of gradient ascent [Bubeck et al., 2021]. In this work, we extend
this line of work to "lazy training" of neural networks -- a dominant model in
deep learning theory in which neural networks are provably efficiently
learnable. We show that over-parametrized neural networks that are guaranteed
to generalize well and enjoy strong computational guarantees remain vulnerable
to attacks generated using a single step of gradient ascent.


http://arxiv.org/abs/2206.08788
Is Multi-Modal Necessarily Better? Robustness Evaluation of Multi-modal Fake News Detection. (83%)
Jinyin Chen; Chengyu Jia; Haibin Zheng; Ruoxi Chen; Chenbo Fu
  The proliferation of fake news and its serious negative social influence push
fake news detection methods to become necessary tools for web managers.
Meanwhile, the multi-media nature of social media makes multi-modal fake news
detection popular for its ability to capture more modal features than uni-modal
detection methods. However, current literature on multi-modal detection is more
likely to pursue the detection accuracy but ignore the robustness of the
detector. To address this problem, we propose a comprehensive robustness
evaluation of multi-modal fake news detectors. In this work, we simulate the
attack methods of malicious users and developers, i.e., posting fake news and
injecting backdoors. Specifically, we evaluate multi-modal detectors with five
adversarial and two backdoor attack methods. Experiment results imply that: (1)
The detection performance of the state-of-the-art detectors degrades
significantly under adversarial attacks, even worse than general detectors; (2)
Most multi-modal detectors are more vulnerable when subjected to attacks on
visual modality than textual modality; (3) Popular events' images will cause
significant degradation to the detectors when they are subjected to backdoor
attacks; (4) The performance of these detectors under multi-modal attacks is
worse than under uni-modal attacks; (5) Defensive methods will improve the
robustness of the multi-modal detectors.


http://arxiv.org/abs/2206.11225
RetrievalGuard: Provably Robust 1-Nearest Neighbor Image Retrieval. (81%)
Yihan Wu; Hongyang Zhang; Heng Huang
  Recent research works have shown that image retrieval models are vulnerable
to adversarial attacks, where slightly modified test inputs could lead to
problematic retrieval results. In this paper, we aim to design a provably
robust image retrieval model which keeps the most important evaluation metric
Recall@1 invariant to adversarial perturbation. We propose the first 1-nearest
neighbor (NN) image retrieval algorithm, RetrievalGuard, which is provably
robust against adversarial perturbations within an $\ell_2$ ball of calculable
radius. The challenge is to design a provably robust algorithm that takes into
consideration the 1-NN search and the high-dimensional nature of the embedding
space. Algorithmically, given a base retrieval model and a query sample, we
build a smoothed retrieval model by carefully analyzing the 1-NN search
procedure in the high-dimensional embedding space. We show that the smoothed
retrieval model has bounded Lipschitz constant and thus the retrieval score is
invariant to $\ell_2$ adversarial perturbations. Experiments on image retrieval
tasks validate the robustness of our RetrievalGuard method.


http://arxiv.org/abs/2206.09099
The Consistency of Adversarial Training for Binary Classification. (26%)
Natalie S. Frank; Jonathan Niles-Weed
  Robustness to adversarial perturbations is of paramount concern in modern
machine learning. One of the state-of-the-art methods for training robust
classifiers is adversarial training, which involves minimizing a supremum-based
surrogate risk. The statistical consistency of surrogate risks is well
understood in the context of standard machine learning, but not in the
adversarial setting. In this paper, we characterize which supremum-based
surrogates are consistent for distributions absolutely continuous with respect
to Lebesgue measure in binary classification. Furthermore, we obtain
quantitative bounds relating adversarial surrogate risks to the adversarial
classification risk. Lastly, we discuss implications for the $\cH$-consistency
of adversarial training.


http://arxiv.org/abs/2206.09098
Existence and Minimax Theorems for Adversarial Surrogate Risks in Binary Classification. (15%)
Natalie S. Frank
  Adversarial training is one of the most popular methods for training methods
robust to adversarial attacks, however, it is not well-understood from a
theoretical perspective. We prove and existence, regularity, and minimax
theorems for adversarial surrogate risks. Our results explain some empirical
observations on adversarial robustness from prior work and suggest new
directions in algorithm development. Furthermore, our results extend previously
known existence and minimax theorems for the adversarial classification risk to
surrogate risks.


http://arxiv.org/abs/2206.08675
Understanding Robust Overfitting of Adversarial Training and Beyond. (8%)
Chaojian Yu; Bo Han; Li Shen; Jun Yu; Chen Gong; Mingming Gong; Tongliang Liu
  Robust overfitting widely exists in adversarial training of deep networks.
The exact underlying reasons for this are still not completely understood.
Here, we explore the causes of robust overfitting by comparing the data
distribution of \emph{non-overfit} (weak adversary) and \emph{overfitted}
(strong adversary) adversarial training, and observe that the distribution of
the adversarial data generated by weak adversary mainly contain small-loss
data. However, the adversarial data generated by strong adversary is more
diversely distributed on the large-loss data and the small-loss data. Given
these observations, we further designed data ablation adversarial training and
identify that some small-loss data which are not worthy of the adversary
strength cause robust overfitting in the strong adversary mode. To relieve this
issue, we propose \emph{minimum loss constrained adversarial training} (MLCAT):
in a minibatch, we learn large-loss data as usual, and adopt additional
measures to increase the loss of the small-loss data. Technically, MLCAT
hinders data fitting when they become easy to learn to prevent robust
overfitting; philosophically, MLCAT reflects the spirit of turning waste into
treasure and making the best use of each adversarial data; algorithmically, we
designed two realizations of MLCAT, and extensive experiments demonstrate that
MLCAT can eliminate robust overfitting and further boost adversarial
robustness.


http://arxiv.org/abs/2206.08170
Adversarial Privacy Protection on Speech Enhancement. (99%)
Mingyu Dong; Diqun Yan; Rangding Wang
  Speech is easily leaked imperceptibly, such as being recorded by mobile
phones in different situations. Private content in speech may be maliciously
extracted through speech enhancement technology. Speech enhancement technology
has developed rapidly along with deep neural networks (DNNs), but adversarial
examples can cause DNNs to fail. In this work, we propose an adversarial method
to degrade speech enhancement systems. Experimental results show that generated
adversarial examples can erase most content information in original examples or
replace it with target speech content through speech enhancement. The word
error rate (WER) between an enhanced original example and enhanced adversarial
example recognition result can reach 89.0%. WER of target attack between
enhanced adversarial example and target example is low to 33.75% . Adversarial
perturbation can bring the rate of change to the original example to more than
1.4430. This work can prevent the malicious extraction of speech.


http://arxiv.org/abs/2206.08316
Boosting the Adversarial Transferability of Surrogate Model with Dark Knowledge. (99%)
Dingcheng Yang; Zihao Xiao; Wenjian Yu
  Deep neural networks (DNNs) for image classification are known to be
vulnerable to adversarial examples. And, the adversarial examples have
transferability, which means an adversarial example for a DNN model can fool
another black-box model with a non-trivial probability. This gave birth of the
transfer-based adversarial attack where the adversarial examples generated by a
pretrained or known model (called surrogate model) are used to conduct
black-box attack. There are some work on how to generate the adversarial
examples from a given surrogate model to achieve better transferability.
However, training a special surrogate model to generate adversarial examples
with better transferability is relatively under-explored. In this paper, we
propose a method of training a surrogate model with abundant dark knowledge to
boost the adversarial transferability of the adversarial examples generated by
the surrogate model. This trained surrogate model is named dark surrogate model
(DSM), and the proposed method to train DSM consists of two key components: a
teacher model extracting dark knowledge and providing soft labels, and the
mixing augmentation skill which enhances the dark knowledge of training data.
Extensive experiments have been conducted to show that the proposed method can
substantially improve the adversarial transferability of surrogate model across
different architectures of surrogate model and optimizers for generating
adversarial examples. We also show that the proposed method can be applied to
other scenarios of transfer-based attack that contain dark knowledge, like face
verification.


http://arxiv.org/abs/2206.07953
Analysis and Extensions of Adversarial Training for Video Classification. (93%)
Kaleab A. Kinfu; René Vidal
  Adversarial training (AT) is a simple yet effective defense against
adversarial attacks to image classification systems, which is based on
augmenting the training set with attacks that maximize the loss. However, the
effectiveness of AT as a defense for video classification has not been
thoroughly studied. Our first contribution is to show that generating optimal
attacks for video requires carefully tuning the attack parameters, especially
the step size. Notably, we show that the optimal step size varies linearly with
the attack budget. Our second contribution is to show that using a smaller
(sub-optimal) attack budget at training time leads to a more robust performance
at test time. Based on these findings, we propose three defenses against
attacks with variable attack budgets. The first one, Adaptive AT, is a
technique where the attack budget is drawn from a distribution that is adapted
as training iterations proceed. The second, Curriculum AT, is a technique where
the attack budget is increased as training iterations proceed. The third,
Generative AT, further couples AT with a denoising generative adversarial
network to boost robust performance. Experiments on the UCF101 dataset
demonstrate that the proposed methods improve adversarial robustness against
multiple attack types.


http://arxiv.org/abs/2206.07912
Double Sampling Randomized Smoothing. (89%)
Linyi Li; Jiawei Zhang; Tao Xie; Bo Li
  Neural networks (NNs) are known to be vulnerable against adversarial
perturbations, and thus there is a line of work aiming to provide robustness
certification for NNs, such as randomized smoothing, which samples smoothing
noises from a certain distribution to certify the robustness for a smoothed
classifier. However, as shown by previous work, the certified robust radius in
randomized smoothing suffers from scaling to large datasets ("curse of
dimensionality"). To overcome this hurdle, we propose a Double Sampling
Randomized Smoothing (DSRS) framework, which exploits the sampled probability
from an additional smoothing distribution to tighten the robustness
certification of the previous smoothed classifier. Theoretically, under mild
assumptions, we prove that DSRS can certify $\Theta(\sqrt d)$ robust radius
under $\ell_2$ norm where $d$ is the input dimension, implying that DSRS may be
able to break the curse of dimensionality of randomized smoothing. We
instantiate DSRS for a generalized family of Gaussian smoothing and propose an
efficient and sound computing method based on customized dual optimization
considering sampling error. Extensive experiments on MNIST, CIFAR-10, and
ImageNet verify our theory and show that DSRS certifies larger robust radii
than existing baselines consistently under different settings. Code is
available at https://github.com/llylly/DSRS.


http://arxiv.org/abs/2206.08260
Adversarial Robustness of Graph-based Anomaly Detection. (76%)
Yulin Zhu; Yuni Lai; Kaifa Zhao; Xiapu Luo; Mingquan Yuan; Jian Ren; Kai Zhou
  Graph-based anomaly detection is becoming prevalent due to the powerful
representation abilities of graphs as well as recent advances in graph mining
techniques. These GAD tools, however, expose a new attacking surface,
ironically due to their unique advantage of being able to exploit the relations
among data. That is, attackers now can manipulate those relations (i.e., the
structure of the graph) to allow target nodes to evade detection or degenerate
the classification performance of the detection. In this paper, we exploit this
vulnerability by designing the structural poisoning attacks to a FeXtra-based
GAD system termed OddBall as well as the black box attacks against GCN-based
GAD systems by attacking the imbalanced lienarized GCN ( LGCN ). Specifically,
we formulate the attack against OddBall and LGCN as a one-level optimization
problem by incorporating different regression techniques, where the key
technical challenge is to efficiently solve the problem in a discrete domain.
We propose a novel attack method termed BinarizedAttack based on gradient
descent. Comparing to prior arts, BinarizedAttack can better use the gradient
information, making it particularly suitable for solving discrete optimization
problems, thus opening the door to studying a new type of attack against
security analytic tools that rely on graph data.


http://arxiv.org/abs/2206.08514
A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks. (68%)
Ganqu Cui; Lifan Yuan; Bingxiang He; Yangyi Chen; Zhiyuan Liu; Maosong Sun
  Textual backdoor attacks are a kind of practical threat to NLP systems. By
injecting a backdoor in the training phase, the adversary could control model
predictions via predefined triggers. As various attack and defense models have
been proposed, it is of great significance to perform rigorous evaluations.
However, we highlight two issues in previous backdoor learning evaluations: (1)
The differences between real-world scenarios (e.g. releasing poisoned datasets
or models) are neglected, and we argue that each scenario has its own
constraints and concerns, thus requires specific evaluation protocols; (2) The
evaluation metrics only consider whether the attacks could flip the models'
predictions on poisoned samples and retain performances on benign samples, but
ignore that poisoned samples should also be stealthy and semantic-preserving.
To address these issues, we categorize existing works into three practical
scenarios in which attackers release datasets, pre-trained models, and
fine-tuned models respectively, then discuss their unique evaluation
methodologies. On metrics, to completely evaluate poisoned samples, we use
grammar error increase and perplexity difference for stealthiness, along with
text similarity for validity. After formalizing the frameworks, we develop an
open-source toolkit OpenBackdoor to foster the implementations and evaluations
of textual backdoor learning. With this toolkit, we perform extensive
experiments to benchmark attack and defense models under the suggested
paradigm. To facilitate the underexplored defenses against poisoned datasets,
we further propose CUBE, a simple yet strong clustering-based defense baseline.
We hope that our frameworks and benchmarks could serve as the cornerstones for
future model development and evaluations.


http://arxiv.org/abs/2206.08477
Backdoor Attacks on Vision Transformers. (31%)
Akshayvarun Subramanya; Aniruddha Saha; Soroush Abbasi Koohpayegani; Ajinkya Tejankar; Hamed Pirsiavash
  Vision Transformers (ViT) have recently demonstrated exemplary performance on
a variety of vision tasks and are being used as an alternative to CNNs. Their
design is based on a self-attention mechanism that processes images as a
sequence of patches, which is quite different compared to CNNs. Hence it is
interesting to study if ViTs are vulnerable to backdoor attacks. Backdoor
attacks happen when an attacker poisons a small part of the training data for
malicious purposes. The model performance is good on clean test images, but the
attacker can manipulate the decision of the model by showing the trigger at
test time. To the best of our knowledge, we are the first to show that ViTs are
vulnerable to backdoor attacks. We also find an intriguing difference between
ViTs and CNNs - interpretation algorithms effectively highlight the trigger on
test images for ViTs but not for CNNs. Based on this observation, we propose a
test-time image blocking defense for ViTs which reduces the attack success rate
by a large margin. Code is available here:
https://github.com/UCDvision/backdoor_transformer.git


http://arxiv.org/abs/2206.08304
Adversarial Patch Attacks and Defences in Vision-Based Tasks: A Survey. (22%)
Abhijith Sharma; Yijun Bian; Phil Munz; Apurva Narayan
  Adversarial attacks in deep learning models, especially for safety-critical
systems, are gaining more and more attention in recent years, due to the lack
of trust in the security and robustness of AI models. Yet the more primitive
adversarial attacks might be physically infeasible or require some resources
that are hard to access like the training data, which motivated the emergence
of patch attacks. In this survey, we provide a comprehensive overview to cover
existing techniques of adversarial patch attacks, aiming to help interested
researchers quickly catch up with the progress in this field. We also discuss
existing techniques for developing detection and defences against adversarial
patches, aiming to help the community better understand this field and its
applications in the real world.


http://arxiv.org/abs/2206.08242
Catastrophic overfitting is a bug but also a feature. (16%)
Guillermo Ortiz-Jiménez; Jorge Pau de; Amartya Sanyal; Adel Bibi; Puneet K. Dokania; Pascal Frossard; Gregory Rogéz; Philip H. S. Torr
  Despite clear computational advantages in building robust neural networks,
adversarial training (AT) using single-step methods is unstable as it suffers
from catastrophic overfitting (CO): Networks gain non-trivial robustness during
the first stages of adversarial training, but suddenly reach a breaking point
where they quickly lose all robustness in just a few iterations. Although some
works have succeeded at preventing CO, the different mechanisms that lead to
this remarkable failure mode are still poorly understood. In this work,
however, we find that the interplay between the structure of the data and the
dynamics of AT plays a fundamental role in CO. Specifically, through active
interventions on typical datasets of natural images, we establish a causal link
between the structure of the data and the onset of CO in single-step AT
methods. This new perspective provides important insights into the mechanisms
that lead to CO and paves the way towards a better understanding of the general
dynamics of robust model construction. The code to reproduce the experiments of
this paper can be found at https://github.com/gortizji/co_features .


http://arxiv.org/abs/2206.08451
I Know What You Trained Last Summer: A Survey on Stealing Machine Learning Models and Defences. (5%)
Daryna Oliynyk; Rudolf Mayer; Andreas Rauber
  Machine Learning-as-a-Service (MLaaS) has become a widespread paradigm,
making even the most complex machine learning models available for clients via
e.g. a pay-per-query principle. This allows users to avoid time-consuming
processes of data collection, hyperparameter tuning, and model training.
However, by giving their customers access to the (predictions of their) models,
MLaaS providers endanger their intellectual property, such as sensitive
training data, optimised hyperparameters, or learned model parameters.
Adversaries can create a copy of the model with (almost) identical behavior
using the the prediction labels only. While many variants of this attack have
been described, only scattered defence strategies have been proposed,
addressing isolated threats. This raises the necessity for a thorough
systematisation of the field of model stealing, to arrive at a comprehensive
understanding why these attacks are successful, and how they could be
holistically defended against. We address this by categorising and comparing
model stealing attacks, assessing their performance, and exploring
corresponding defence techniques in different settings. We propose a taxonomy
for attack and defence approaches, and provide guidelines on how to select the
right attack or defence strategy based on the goal and available resources.
Finally, we analyse which defences are rendered less effective by current
attack strategies.


http://arxiv.org/abs/2206.08255
Gradient-Based Adversarial and Out-of-Distribution Detection. (2%)
Jinsol Lee; Mohit Prabhushankar; Ghassan AlRegib
  We propose to utilize gradients for detecting adversarial and
out-of-distribution samples. We introduce confounding labels -- labels that
differ from normal labels seen during training -- in gradient generation to
probe the effective expressivity of neural networks. Gradients depict the
amount of change required for a model to properly represent given inputs,
providing insight into the representational power of the model established by
network architectural properties as well as training data. By introducing a
label of different design, we remove the dependency on ground truth labels for
gradient generation during inference. We show that our gradient-based approach
allows for capturing the anomaly in inputs based on the effective expressivity
of the models with no hyperparameter tuning or additional processing, and
outperforms state-of-the-art methods for adversarial and out-of-distribution
detection.


http://arxiv.org/abs/2206.07918
"Understanding Robustness Lottery": A Comparative Visual Analysis of Neural Network Pruning Approaches. (1%)
Zhimin Li; Shusen Liu; Xin Yu; Kailkhura Bhavya; Jie Cao; Diffenderfer James Daniel; Peer-Timo Bremer; Valerio Pascucci
  Deep learning approaches have provided state-of-the-art performance in many
applications by relying on extremely large and heavily overparameterized neural
networks. However, such networks have been shown to be very brittle, not
generalize well to new uses cases, and are often difficult if not impossible to
deploy on resources limited platforms. Model pruning, i.e., reducing the size
of the network, is a widely adopted strategy that can lead to more robust and
generalizable network -- usually orders of magnitude smaller with the same or
even improved performance. While there exist many heuristics for model pruning,
our understanding of the pruning process remains limited. Empirical studies
show that some heuristics improve performance while others can make models more
brittle or have other side effects. This work aims to shed light on how
different pruning methods alter the network's internal feature representation,
and the corresponding impact on model performance. To provide a meaningful
comparison and characterization of model feature space, we use three geometric
metrics that are decomposed from the common adopted classification loss. With
these metrics, we design a visualization system to highlight the impact of
pruning on model prediction as well as the latent feature embedding. The
proposed tool provides an environment for exploring and studying differences
among pruning methods and between pruned and original model. By leveraging our
visualization, the ML researchers can not only identify samples that are
fragile to model pruning and data corruption but also obtain insights and
explanations on how some pruned models achieve superior robustness performance.


http://arxiv.org/abs/2206.07314
Fast and Reliable Evaluation of Adversarial Robustness with Minimum-Margin Attack. (99%)
Ruize Gao; Jiongxiao Wang; Kaiwen Zhou; Feng Liu; Binghui Xie; Gang Niu; Bo Han; James Cheng
  The AutoAttack (AA) has been the most reliable method to evaluate adversarial
robustness when considerable computational resources are available. However,
the high computational cost (e.g., 100 times more than that of the project
gradient descent attack) makes AA infeasible for practitioners with limited
computational resources, and also hinders applications of AA in the adversarial
training (AT). In this paper, we propose a novel method, minimum-margin (MM)
attack, to fast and reliably evaluate adversarial robustness. Compared with AA,
our method achieves comparable performance but only costs 3% of the
computational time in extensive experiments. The reliability of our method lies
in that we evaluate the quality of adversarial examples using the margin
between two targets that can precisely identify the most adversarial example.
The computational efficiency of our method lies in an effective Sequential
TArget Ranking Selection (STARS) method, ensuring that the cost of the MM
attack is independent of the number of classes. The MM attack opens a new way
for evaluating adversarial robustness and provides a feasible and reliable way
to generate high-quality adversarial examples in AT.


http://arxiv.org/abs/2206.07321
Morphence-2.0: Evasion-Resilient Moving Target Defense Powered by Out-of-Distribution Detection. (99%)
Abderrahmen Amich; Ata Kaboudi; Birhanu Eshete
  Evasion attacks against machine learning models often succeed via iterative
probing of a fixed target model, whereby an attack that succeeds once will
succeed repeatedly. One promising approach to counter this threat is making a
model a moving target against adversarial inputs. To this end, we introduce
Morphence-2.0, a scalable moving target defense (MTD) powered by
out-of-distribution (OOD) detection to defend against adversarial examples. By
regularly moving the decision function of a model, Morphence-2.0 makes it
significantly challenging for repeated or correlated attacks to succeed.
Morphence-2.0 deploys a pool of models generated from a base model in a manner
that introduces sufficient randomness when it responds to prediction queries.
Via OOD detection, Morphence-2.0 is equipped with a scheduling approach that
assigns adversarial examples to robust decision functions and benign samples to
an undefended accurate models. To ensure repeated or correlated attacks fail,
the deployed pool of models automatically expires after a query budget is
reached and the model pool is seamlessly replaced by a new model pool generated
in advance. We evaluate Morphence-2.0 on two benchmark image classification
datasets (MNIST and CIFAR10) against 4 reference attacks (3 white-box and 1
black-box). Morphence-2.0 consistently outperforms prior defenses while
preserving accuracy on clean data and reducing attack transferability. We also
show that, when powered by OOD detection, Morphence-2.0 is able to precisely
make an input-based movement of the model's decision function that leads to
higher prediction accuracy on both adversarial and benign queries.


http://arxiv.org/abs/2206.07840
Architectural Backdoors in Neural Networks. (83%)
Mikel Bober-Irizar; Ilia Shumailov; Yiren Zhao; Robert Mullins; Nicolas Papernot
  Machine learning is vulnerable to adversarial manipulation. Previous
literature has demonstrated that at the training stage attackers can manipulate
data and data sampling procedures to control model behaviour. A common attack
goal is to plant backdoors i.e. force the victim model to learn to recognise a
trigger known only by the adversary. In this paper, we introduce a new class of
backdoor attacks that hide inside model architectures i.e. in the inductive
bias of the functions used to train. These backdoors are simple to implement,
for instance by publishing open-source code for a backdoored model architecture
that others will reuse unknowingly. We demonstrate that model architectural
backdoors represent a real threat and, unlike other approaches, can survive a
complete re-training from scratch. We formalise the main construction
principles behind architectural backdoors, such as a link between the input and
the output, and describe some possible protections against them. We evaluate
our attacks on computer vision benchmarks of different scales and demonstrate
the underlying vulnerability is pervasive in a variety of training settings.


http://arxiv.org/abs/2206.07406
Hardening DNNs against Transfer Attacks during Network Compression using Greedy Adversarial Pruning. (75%)
Jonah O'Brien Weiss; Tiago Alves; Sandip Kundu
  The prevalence and success of Deep Neural Network (DNN) applications in
recent years have motivated research on DNN compression, such as pruning and
quantization. These techniques accelerate model inference, reduce power
consumption, and reduce the size and complexity of the hardware necessary to
run DNNs, all with little to no loss in accuracy. However, since DNNs are
vulnerable to adversarial inputs, it is important to consider the relationship
between compression and adversarial robustness. In this work, we investigate
the adversarial robustness of models produced by several irregular pruning
schemes and by 8-bit quantization. Additionally, while conventional pruning
removes the least important parameters in a DNN, we investigate the effect of
an unconventional pruning method: removing the most important model parameters
based on the gradient on adversarial inputs. We call this method Greedy
Adversarial Pruning (GAP) and we find that this pruning method results in
models that are resistant to transfer attacks from their uncompressed
counterparts.


http://arxiv.org/abs/2206.07839
Linearity Grafting: Relaxed Neuron Pruning Helps Certifiable Robustness. (74%)
Tianlong Chen; Huan Zhang; Zhenyu Zhang; Shiyu Chang; Sijia Liu; Pin-Yu Chen; Zhangyang Wang
  Certifiable robustness is a highly desirable property for adopting deep
neural networks (DNNs) in safety-critical scenarios, but often demands tedious
computations to establish. The main hurdle lies in the massive amount of
non-linearity in large DNNs. To trade off the DNN expressiveness (which calls
for more non-linearity) and robustness certification scalability (which prefers
more linearity), we propose a novel solution to strategically manipulate
neurons, by "grafting" appropriate levels of linearity. The core of our
proposal is to first linearize insignificant ReLU neurons, to eliminate the
non-linear components that are both redundant for DNN performance and harmful
to its certification. We then optimize the associated slopes and intercepts of
the replaced linear activations for restoring model performance while
maintaining certifiability. Hence, typical neuron pruning could be viewed as a
special case of grafting a linear function of the fixed zero slopes and
intercept, that might overly restrict the network flexibility and sacrifice its
performance. Extensive experiments on multiple datasets and network backbones
show that our linearity grafting can (1) effectively tighten certified bounds;
(2) achieve competitive certifiable robustness without certified robust
training (i.e., over 30% improvements on CIFAR-10 models); and (3) scale up
complete verification to large adversarially trained models with 17M
parameters. Codes are available at
https://github.com/VITA-Group/Linearity-Grafting.


http://arxiv.org/abs/2206.07813
A Search-Based Testing Approach for Deep Reinforcement Learning Agents. (62%)
Amirhossein Zolfagharian; Manel Abdellatif; Lionel Briand; Mojtaba Bagherzadeh; Ramesh S
  Deep Reinforcement Learning (DRL) algorithms have been increasingly employed
during the last decade to solve various decision-making problems such as
autonomous driving and robotics. However, these algorithms have faced great
challenges when deployed in safety-critical environments since they often
exhibit erroneous behaviors that can lead to potentially critical errors. One
way to assess the safety of DRL agents is to test them to detect possible
faults leading to critical failures during their execution. This raises the
question of how we can efficiently test DRL policies to ensure their
correctness and adherence to safety requirements. Most existing works on
testing DRL agents use adversarial attacks that perturb states or actions of
the agent. However, such attacks often lead to unrealistic states of the
environment. Their main goal is to test the robustness of DRL agents rather
than testing the compliance of agents' policies with respect to requirements.
Due to the huge state space of DRL environments, the high cost of test
execution, and the black-box nature of DRL algorithms, the exhaustive testing
of DRL agents is impossible. In this paper, we propose a Search-based Testing
Approach of Reinforcement Learning Agents (STARLA) to test the policy of a DRL
agent by effectively searching for failing executions of the agent within a
limited testing budget. We use machine learning models and a dedicated genetic
algorithm to narrow the search towards faulty episodes. We apply STARLA on
Deep-Q-Learning agents which are widely used as benchmarks and show that it
significantly outperforms Random Testing by detecting more faults related to
the agent's policy. We also investigate how to extract rules that characterize
faulty episodes of the DRL agent using our search results. Such rules can be
used to understand the conditions under which the agent fails and thus assess
its deployment risks.


http://arxiv.org/abs/2206.07311
Can pruning improve certified robustness of neural networks? (56%)
Zhangheng Li; Tianlong Chen; Linyi Li; Bo Li; Zhangyang Wang
  With the rapid development of deep learning, the sizes of neural networks
become larger and larger so that the training and inference often overwhelm the
hardware resources. Given the fact that neural networks are often
over-parameterized, one effective way to reduce such computational overhead is
neural network pruning, by removing redundant parameters from trained neural
networks. It has been recently observed that pruning can not only reduce
computational overhead but also can improve empirical robustness of deep neural
networks (NNs), potentially owing to removing spurious correlations while
preserving the predictive accuracies. This paper for the first time
demonstrates that pruning can generally improve certified robustness for
ReLU-based NNs under the complete verification setting. Using the popular
Branch-and-Bound (BaB) framework, we find that pruning can enhance the
estimated bound tightness of certified robustness verification, by alleviating
linear relaxation and sub-domain split problems. We empirically verify our
findings with off-the-shelf pruning methods and further present a new
stability-based pruning method tailored for reducing neuron instability, that
outperforms existing pruning methods in enhancing certified robustness. Our
experiments show that by appropriately pruning an NN, its certified accuracy
can be boosted up to 8.2% under standard training, and up to 24.5% under
adversarial training on the CIFAR10 dataset. We additionally observe the
existence of certified lottery tickets that can match both standard and
certified robust accuracies of the original dense models across different
datasets. Our findings offer a new angle to study the intriguing interaction
between sparsity and robustness, i.e. interpreting the interaction of sparsity
and certified robustness via neuron stability. Codes are available at:
https://github.com/VITA-Group/CertifiedPruning.


http://arxiv.org/abs/2206.07736
Improving Diversity with Adversarially Learned Transformations for Domain Generalization. (33%)
Tejas Gokhale; Rushil Anirudh; Jayaraman J. Thiagarajan; Bhavya Kailkhura; Chitta Baral; Yezhou Yang
  To be successful in single source domain generalization, maximizing diversity
of synthesized domains has emerged as one of the most effective strategies.
Many of the recent successes have come from methods that pre-specify the types
of diversity that a model is exposed to during training, so that it can
ultimately generalize well to new domains. However, na\"ive diversity based
augmentations do not work effectively for domain generalization either because
they cannot model large domain shift, or because the span of transforms that
are pre-specified do not cover the types of shift commonly occurring in domain
generalization. To address this issue, we present a novel framework that uses
adversarially learned transformations (ALT) using a neural network to model
plausible, yet hard image transformations that fool the classifier. This
network is randomly initialized for each batch and trained for a fixed number
of steps to maximize classification error. Further, we enforce consistency
between the classifier's predictions on the clean and transformed images. With
extensive empirical analysis, we find that this new form of adversarial
transformations achieve both objectives of diversity and hardness
simultaneously, outperforming all existing techniques on competitive benchmarks
for single source domain generalization. We also show that ALT can naturally
work with existing diversity modules to produce highly distinct, and large
transformations of the source domain leading to state-of-the-art performance.


http://arxiv.org/abs/2206.07842
Queried Unlabeled Data Improves and Robustifies Class-Incremental Learning. (11%)
Tianlong Chen; Sijia Liu; Shiyu Chang; Lisa Amini; Zhangyang Wang
  Class-incremental learning (CIL) suffers from the notorious dilemma between
learning newly added classes and preserving previously learned class knowledge.
That catastrophic forgetting issue could be mitigated by storing historical
data for replay, which yet would cause memory overheads as well as imbalanced
prediction updates. To address this dilemma, we propose to leverage "free"
external unlabeled data querying in continual learning. We first present a CIL
with Queried Unlabeled Data (CIL-QUD) scheme, where we only store a handful of
past training samples as anchors and use them to query relevant unlabeled
examples each time. Along with new and past stored data, the queried unlabeled
are effectively utilized, through learning-without-forgetting (LwF)
regularizers and class-balance training. Besides preserving model
generalization over past and current tasks, we next study the problem of
adversarial robustness for CIL-QUD. Inspired by the recent success of learning
robust models with unlabeled data, we explore a new robustness-aware CIL
setting, where the learned adversarial robustness has to resist forgetting and
be transferred as new tasks come in continually. While existing options easily
fail, we show queried unlabeled data can continue to benefit, and seamlessly
extend CIL-QUD into its robustified versions, RCIL-QUD. Extensive experiments
demonstrate that CIL-QUD achieves substantial accuracy gains on CIFAR-10 and
CIFAR-100, compared to previous state-of-the-art CIL approaches. Moreover,
RCIL-QUD establishes the first strong milestone for robustness-aware CIL. Codes
are available in https://github.com/VITA-Group/CIL-QUD.


http://arxiv.org/abs/2206.07387
The Manifold Hypothesis for Gradient-Based Explanations. (2%)
Sebastian Bordt; Uddeshya Upadhyay; Zeynep Akata; Luxburg Ulrike von
  When do gradient-based explanation algorithms provide perceptually-aligned
explanations? We propose a criterion: the feature attributions need to be
aligned with the tangent space of the data manifold. To provide evidence for
this hypothesis, we introduce a framework based on variational autoencoders
that allows to estimate and generate image manifolds. Through experiments
across a range of different datasets -- MNIST, EMNIST, CIFAR10, X-ray pneumonia
and Diabetic Retinopathy detection -- we demonstrate that the more a feature
attribution is aligned with the tangent space of the data, the more
perceptually-aligned it tends to be. We then show that the attributions
provided by popular post-hoc methods such as Integrated Gradients and
SmoothGrad are more strongly aligned with the data manifold than the raw
gradient. Adversarial training also improves the alignment of model gradients
with the data manifold. As a consequence, we suggest that explanation
algorithms should actively strive to align their explanations with the data
manifold. This is an extended version of a CVPR Workshop paper. Code is
available at https://github.com/tml-tuebingen/explanations-manifold.


http://arxiv.org/abs/2206.07459
READ: Aggregating Reconstruction Error into Out-of-distribution Detection. (1%)
Wenyu Jiang; Hao Cheng; Mingcai Chen; Shuai Feng; Yuxin Ge; Chongjun Wang
  Detecting out-of-distribution (OOD) samples is crucial to the safe deployment
of a classifier in the real world. However, deep neural networks are known to
be overconfident for abnormal data. Existing works directly design score
function by mining the inconsistency from classifier for in-distribution (ID)
and OOD. In this paper, we further complement this inconsistency with
reconstruction error, based on the assumption that an autoencoder trained on ID
data can not reconstruct OOD as well as ID. We propose a novel method, READ
(Reconstruction Error Aggregated Detector), to unify inconsistencies from
classifier and autoencoder. Specifically, the reconstruction error of raw
pixels is transformed to latent space of classifier. We show that the
transformed reconstruction error bridges the semantic gap and inherits
detection performance from the original. Moreover, we propose an adjustment
strategy to alleviate the overconfidence problem of autoencoder according to a
fine-grained characterization of OOD data. Under two scenarios of pre-training
and retraining, we respectively present two variants of our method, namely
READ-MD (Mahalanobis Distance) only based on pre-trained classifier and READ-ED
(Euclidean Distance) which retrains the classifier. Our methods do not require
access to test time OOD data for fine-tuning hyperparameters. Finally, we
demonstrate the effectiveness of the proposed methods through extensive
comparisons with state-of-the-art OOD detection algorithms. On a CIFAR-10
pre-trained WideResNet, our method reduces the average FPR@95TPR by up to 9.8%
compared with previous state-of-the-art.


http://arxiv.org/abs/2206.06737
Adversarial Vulnerability of Randomized Ensembles. (99%)
Hassan Dbouk; Naresh R. Shanbhag
  Despite the tremendous success of deep neural networks across various tasks,
their vulnerability to imperceptible adversarial perturbations has hindered
their deployment in the real world. Recently, works on randomized ensembles
have empirically demonstrated significant improvements in adversarial
robustness over standard adversarially trained (AT) models with minimal
computational overhead, making them a promising solution for safety-critical
resource-constrained applications. However, this impressive performance raises
the question: Are these robustness gains provided by randomized ensembles real?
In this work we address this question both theoretically and empirically. We
first establish theoretically that commonly employed robustness evaluation
methods such as adaptive PGD provide a false sense of security in this setting.
Subsequently, we propose a theoretically-sound and efficient adversarial attack
algorithm (ARC) capable of compromising random ensembles even in cases where
adaptive PGD fails to do so. We conduct comprehensive experiments across a
variety of network architectures, training schemes, datasets, and norms to
support our claims, and empirically establish that randomized ensembles are in
fact more vulnerable to $\ell_p$-bounded adversarial perturbations than even
standard AT models. Our code can be found at https://github.com/hsndbk4/ARC.


http://arxiv.org/abs/2206.06592
Downlink Power Allocation in Massive MIMO via Deep Learning: Adversarial Attacks and Training. (99%)
B. R. Manoj; Meysam Sadeghi; Erik G. Larsson
  The successful emergence of deep learning (DL) in wireless system
applications has raised concerns about new security-related challenges. One
such security challenge is adversarial attacks. Although there has been much
work demonstrating the susceptibility of DL-based classification tasks to
adversarial attacks, regression-based problems in the context of a wireless
system have not been studied so far from an attack perspective. The aim of this
paper is twofold: (i) we consider a regression problem in a wireless setting
and show that adversarial attacks can break the DL-based approach and (ii) we
analyze the effectiveness of adversarial training as a defensive technique in
adversarial settings and show that the robustness of DL-based wireless system
against attacks improves significantly. Specifically, the wireless application
considered in this paper is the DL-based power allocation in the downlink of a
multicell massive multi-input-multi-output system, where the goal of the attack
is to yield an infeasible solution by the DL model. We extend the
gradient-based adversarial attacks: fast gradient sign method (FGSM), momentum
iterative FGSM, and projected gradient descent method to analyze the
susceptibility of the considered wireless application with and without
adversarial training. We analyze the deep neural network (DNN) models
performance against these attacks, where the adversarial perturbations are
crafted using both the white-box and black-box attacks.


http://arxiv.org/abs/2206.07144
Efficiently Training Low-Curvature Neural Networks. (92%)
Suraj Srinivas; Kyle Matoba; Himabindu Lakkaraju; Francois Fleuret
  The highly non-linear nature of deep neural networks causes them to be
susceptible to adversarial examples and have unstable gradients which hinders
interpretability. However, existing methods to solve these issues, such as
adversarial training, are expensive and often sacrifice predictive accuracy.
  In this work, we consider curvature, which is a mathematical quantity which
encodes the degree of non-linearity. Using this, we demonstrate low-curvature
neural networks (LCNNs) that obtain drastically lower curvature than standard
models while exhibiting similar predictive performance, which leads to improved
robustness and stable gradients, with only a marginally increased training
time. To achieve this, we minimize a data-independent upper bound on the
curvature of a neural network, which decomposes overall curvature in terms of
curvatures and slopes of its constituent layers. To efficiently minimize this
bound, we introduce two novel architectural components: first, a non-linearity
called centered-softplus that is a stable variant of the softplus
non-linearity, and second, a Lipschitz-constrained batch normalization layer.
  Our experiments show that LCNNs have lower curvature, more stable gradients
and increased off-the-shelf adversarial robustness when compared to their
standard high-curvature counterparts, all without affecting predictive
performance. Our approach is easy to use and can be readily incorporated into
existing neural network models.


http://arxiv.org/abs/2206.07179
Proximal Splitting Adversarial Attacks for Semantic Segmentation. (92%)
Jérôme Rony; Jean-Christophe Pesquet; Ismail Ben Ayed
  Classification has been the focal point of research on adversarial attacks,
but only a few works investigate methods suited to denser prediction tasks,
such as semantic segmentation. The methods proposed in these works do not
accurately solve the adversarial segmentation problem and, therefore,
overestimate the size of the perturbations required to fool models. Here, we
propose a white-box attack for these models based on a proximal splitting to
produce adversarial perturbations with much smaller $\ell_\infty$ norms. Our
attack can handle large numbers of constraints within a nonconvex minimization
framework via an Augmented Lagrangian approach, coupled with adaptive
constraint scaling and masking strategies. We demonstrate that our attack
significantly outperforms previously proposed ones, as well as classification
attacks that we adapted for segmentation, providing a first comprehensive
benchmark for this dense task.


http://arxiv.org/abs/2206.06854
On the explainable properties of 1-Lipschitz Neural Networks: An Optimal Transport Perspective. (89%)
Mathieu IRIT, UT Serrurier; Franck UT Mamalet; Thomas UT Fel; Louis UT3, UT, IRIT Béthune; Thibaut UT Boissin
  Input gradients have a pivotal role in a variety of applications, including
adversarial attack algorithms for evaluating model robustness, explainable AI
techniques for generating Saliency Maps, and counterfactual explanations.
However, Saliency Maps generated by traditional neural networks are often noisy
and provide limited insights. In this paper, we demonstrate that, on the
contrary, the Saliency Maps of 1-Lipschitz neural networks, learnt with the
dual loss of an optimal transportation problem, exhibit desirable XAI
properties: They are highly concentrated on the essential parts of the image
with low noise, significantly outperforming state-of-the-art explanation
approaches across various models and metrics. We also prove that these maps
align unprecedentedly well with human explanations on ImageNet. To explain the
particularly beneficial properties of the Saliency Map for such models, we
prove this gradient encodes both the direction of the transportation plan and
the direction towards the nearest adversarial attack. Following the gradient
down to the decision boundary is no longer considered an adversarial attack,
but rather a counterfactual explanation that explicitly transports the input
from one class to another. Thus, Learning with such a loss jointly optimizes
the classification objective and the alignment of the gradient , i.e. the
Saliency Map, to the transportation plan direction. These networks were
previously known to be certifiably robust by design, and we demonstrate that
they scale well for large problems and models, and are tailored for
explainability using a fast and straightforward method.


http://arxiv.org/abs/2206.07188
Defending Observation Attacks in Deep Reinforcement Learning via Detection and Denoising. (88%)
Zikang Xiong; Joe Eappen; He Zhu; Suresh Jagannathan
  Neural network policies trained using Deep Reinforcement Learning (DRL) are
well-known to be susceptible to adversarial attacks. In this paper, we consider
attacks manifesting as perturbations in the observation space managed by the
external environment. These attacks have been shown to downgrade policy
performance significantly. We focus our attention on well-trained deterministic
and stochastic neural network policies in the context of continuous control
benchmarks subject to four well-studied observation space adversarial attacks.
To defend against these attacks, we propose a novel defense strategy using a
detect-and-denoise schema. Unlike previous adversarial training approaches that
sample data in adversarial scenarios, our solution does not require sampling
data in an environment under attack, thereby greatly reducing risk during
training. Detailed experimental results show that our technique is comparable
with state-of-the-art adversarial training approaches.


http://arxiv.org/abs/2206.06761
Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO. (86%)
Javier Rando; Nasib Naimi; Thomas Baumann; Max Mathys
  This work conducts the first analysis on the robustness against adversarial
attacks on self-supervised Vision Transformers trained using DINO. First, we
evaluate whether features learned through self-supervision are more robust to
adversarial attacks than those emerging from supervised learning. Then, we
present properties arising for attacks in the latent space. Finally, we
evaluate whether three well-known defense strategies can increase adversarial
robustness in downstream tasks by only fine-tuning the classification head to
provide robustness even in view of limited compute resources. These defense
strategies are: Adversarial Training, Ensemble Adversarial Training and
Ensemble of Specialized Networks.


http://arxiv.org/abs/2206.07018
Turning a Curse Into a Blessing: Enabling Clean-Data-Free Defenses by Model Inversion. (68%)
Si Chen; Yi Zeng; Won Park; Ruoxi Jia
  It is becoming increasingly common to utilize pre-trained models provided by
third parties due to their convenience. At the same time, however, these models
may be vulnerable to both poisoning and evasion attacks. We introduce an
algorithmic framework that can mitigate potential security vulnerabilities in a
pre-trained model when clean data from its training distribution is unavailable
to the defender. The framework reverse-engineers samples from a given
pre-trained model. The resulting synthetic samples can then be used as a
substitute for clean data to perform various defenses. We consider two
important attack scenarios -- backdoor attacks and evasion attacks -- to
showcase the utility of synthesized samples. For both attacks, we show that
when supplied with our synthetic data, the state-of-the-art defenses perform
comparably or sometimes even better than the case when it's supplied with the
same amount of clean data.


http://arxiv.org/abs/2206.07282
Human Eyes Inspired Recurrent Neural Networks are More Robust Against Adversarial Noises. (67%)
Minkyu Choi; Yizhen Zhang; Kuan Han; Xiaokai Wang; Zhongming Liu
  Humans actively observe the visual surroundings by focusing on salient
objects and ignoring trivial details. However, computer vision models based on
convolutional neural networks (CNN) often analyze visual input all at once
through a single feed-forward pass. In this study, we designed a dual-stream
vision model inspired by the human brain. This model features retina-like input
layers and includes two streams: one determining the next point of focus (the
fixation), while the other interprets the visuals surrounding the fixation.
Trained on image recognition, this model examines an image through a sequence
of fixations, each time focusing on different parts, thereby progressively
building a representation of the image. We evaluated this model against various
benchmarks in terms of object recognition, gaze behavior and adversarial
robustness. Our findings suggest that the model can attend and gaze in ways
similar to humans without being explicitly trained to mimic human attention,
and that the model can enhance robustness against adversarial attacks due to
its retinal sampling and recurrent processing. In particular, the model can
correct its perceptual errors by taking more glances, setting itself apart from
all feed-forward-only models. In conclusion, the interactions of retinal
sampling, eye movement, and recurrent dynamics are important to human-like
visual exploration and inference.


http://arxiv.org/abs/2206.07150
Attacks on Perception-Based Control Systems: Modeling and Fundamental Limits. (2%)
Amir Khazraei; Henry Pfister; Miroslav Pajic
  In this work, we study performance of perception-based control systems in the
presence of attacks. We focus on a wide class of stochastic nonlinear control
systems, and provide methods for modeling and analysis of their resiliency to
stealthy attacks on both physical and perception-based sensing. Specifically,
we consider a general setup with a nonlinear affine physical plant controlled
with a perception-based controller that maps both the physical sensor (e.g.,
IMUs) and perceptual (e.g., camera) measurements to the control input; in
addition, the system is equipped with a statistical or learning-based anomaly
detector (AD) to detect the presence of abnormal behaviours in the system. To
enable general performance analysis, we model the attacks on perception and
physical sensing in the most general form. Further, we introduce the notions of
attack effectiveness and stealthiness that are independent of the employed AD;
i.e., the attack remaining stealthy even from the best existing detectors. In
such setting, we consider attacks with different levels of runtime knowledge
about the plant and its states. We find sufficient conditions for existence of
stealthy effective attacks that force the plant state into an unsafe region
without being detected by any employed AD. We show that as the open-loop
unstable plant dynamics diverges faster and the closed-loop system converges
faster to an equilibrium point, the system will be more vulnerable to effective
stealthy attacks. Specifically, we show that depending on runtime information
available to the attacker, the probability of attack remaining stealthy
(against any AD) can be arbitrarily close to one, if the attackers estimate of
the plant state is arbitrarily close to the true plant state.


http://arxiv.org/abs/2206.07277
A Gift from Label Smoothing: Robust Training with Adaptive Label Smoothing via Auxiliary Classifier under Label Noise. (1%)
Jongwoo Ko; Bongsoo Yi; Se-Young Yun
  As deep neural networks can easily overfit noisy labels, robust training in
the presence of noisy labels is becoming an important challenge in modern deep
learning. While existing methods address this problem in various directions,
they still produce unpredictable sub-optimal results since they rely on the
posterior information estimated by the feature extractor corrupted by noisy
labels. Lipschitz regularization successfully alleviates this problem by
training a robust feature extractor, but it requires longer training time and
expensive computations. Motivated by this, we propose a simple yet effective
method, called ALASCA, which efficiently provides a robust feature extractor
under label noise. ALASCA integrates two key ingredients: (1) adaptive label
smoothing based on our theoretical analysis that label smoothing implicitly
induces Lipschitz regularization, and (2) auxiliary classifiers that enable
practical application of intermediate Lipschitz regularization with negligible
computations. We conduct wide-ranging experiments for ALASCA and combine our
proposed method with previous noise-robust methods on several synthetic and
real-world datasets. Experimental results show that our framework consistently
improves the robustness of feature extractors and the performance of existing
baselines with efficiency. Our code is available at
https://github.com/jongwooko/ALASCA.


http://arxiv.org/abs/2206.07284
A Survey on Gradient Inversion: Attacks, Defenses and Future Directions. (1%)
Rui Zhang; Song Guo; Junxiao Wang; Xin Xie; Dacheng Tao
  Recent studies have shown that the training samples can be recovered from
gradients, which are called Gradient Inversion (GradInv) attacks. However,
there remains a lack of extensive surveys covering recent advances and thorough
analysis of this issue. In this paper, we present a comprehensive survey on
GradInv, aiming to summarize the cutting-edge research and broaden the horizons
for different domains. Firstly, we propose a taxonomy of GradInv attacks by
characterizing existing attacks into two paradigms: iteration- and
recursion-based attacks. In particular, we dig out some critical ingredients
from the iteration-based attacks, including data initialization, model training
and gradient matching. Second, we summarize emerging defense strategies against
GradInv attacks. We find these approaches focus on three perspectives covering
data obscuration, model improvement and gradient protection. Finally, we
discuss some promising directions and open problems for further research.


http://arxiv.org/abs/2206.06496
Towards Alternative Techniques for Improving Adversarial Robustness: Analysis of Adversarial Training at a Spectrum of Perturbations. (99%)
Kaustubh Sridhar; Souradeep Dutta; Ramneet Kaur; James Weimer; Oleg Sokolsky; Insup Lee
  Adversarial training (AT) and its variants have spearheaded progress in
improving neural network robustness to adversarial perturbations and common
corruptions in the last few years. Algorithm design of AT and its variants are
focused on training models at a specified perturbation strength $\epsilon$ and
only using the feedback from the performance of that $\epsilon$-robust model to
improve the algorithm. In this work, we focus on models, trained on a spectrum
of $\epsilon$ values. We analyze three perspectives: model performance,
intermediate feature precision and convolution filter sensitivity. In each, we
identify alternative improvements to AT that otherwise wouldn't have been
apparent at a single $\epsilon$. Specifically, we find that for a PGD attack at
some strength $\delta$, there is an AT model at some slightly larger strength
$\epsilon$, but no greater, that generalizes best to it. Hence, we propose
overdesigning for robustness where we suggest training models at an $\epsilon$
just above $\delta$. Second, we observe (across various $\epsilon$ values) that
robustness is highly sensitive to the precision of intermediate features and
particularly those after the first and second layer. Thus, we propose adding a
simple quantization to defenses that improves accuracy on seen and unseen
adaptive attacks. Third, we analyze convolution filters of each layer of models
at increasing $\epsilon$ and notice that those of the first and second layer
may be solely responsible for amplifying input perturbations. We present our
findings and demonstrate our techniques through experiments with ResNet and
WideResNet models on the CIFAR-10 and CIFAR-10-C datasets.


http://arxiv.org/abs/2206.06257
Distributed Adversarial Training to Robustify Deep Neural Networks at Scale. (99%)
Gaoyuan Zhang; Songtao Lu; Yihua Zhang; Xiangyi Chen; Pin-Yu Chen; Quanfu Fan; Lee Martie; Lior Horesh; Mingyi Hong; Sijia Liu
  Current deep neural networks (DNNs) are vulnerable to adversarial attacks,
where adversarial perturbations to the inputs can change or manipulate
classification. To defend against such attacks, an effective and popular
approach, known as adversarial training (AT), has been shown to mitigate the
negative impact of adversarial attacks by virtue of a min-max robust training
method. While effective, it remains unclear whether it can successfully be
adapted to the distributed learning context. The power of distributed
optimization over multiple machines enables us to scale up robust training over
large models and datasets. Spurred by that, we propose distributed adversarial
training (DAT), a large-batch adversarial training framework implemented over
multiple machines. We show that DAT is general, which supports training over
labeled and unlabeled data, multiple types of attack generation methods, and
gradient compression operations favored for distributed optimization.
Theoretically, we provide, under standard conditions in the optimization
theory, the convergence rate of DAT to the first-order stationary points in
general non-convex settings. Empirically, we demonstrate that DAT either
matches or outperforms state-of-the-art robust accuracies and achieves a
graceful training speedup (e.g., on ResNet-50 under ImageNet). Codes are
available at https://github.com/dat-2022/dat.


http://arxiv.org/abs/2206.05898
Pixel to Binary Embedding Towards Robustness for CNNs. (47%)
Ikki Kishida; Hideki Nakayama
  There are several problems with the robustness of Convolutional Neural
Networks (CNNs). For example, the prediction of CNNs can be changed by adding a
small magnitude of noise to an input, and the performances of CNNs are degraded
when the distribution of input is shifted by a transformation never seen during
training (e.g., the blur effect). There are approaches to replace pixel values
with binary embeddings to tackle the problem of adversarial perturbations,
which successfully improve robustness. In this work, we propose Pixel to Binary
Embedding (P2BE) to improve the robustness of CNNs. P2BE is a learnable binary
embedding method as opposed to previous hand-coded binary embedding methods.
P2BE outperforms other binary embedding methods in robustness against
adversarial perturbations and visual corruptions that are not shown during
training.


http://arxiv.org/abs/2206.06232
Towards Understanding Sharpness-Aware Minimization. (1%)
Maksym Andriushchenko; Nicolas Flammarion
  Sharpness-Aware Minimization (SAM) is a recent training method that relies on
worst-case weight perturbations which significantly improves generalization in
various settings. We argue that the existing justifications for the success of
SAM which are based on a PAC-Bayes generalization bound and the idea of
convergence to flat minima are incomplete. Moreover, there are no explanations
for the success of using $m$-sharpness in SAM which has been shown as essential
for generalization. To better understand this aspect of SAM, we theoretically
analyze its implicit bias for diagonal linear networks. We prove that SAM
always chooses a solution that enjoys better generalization properties than
standard gradient descent for a certain class of problems, and this effect is
amplified by using $m$-sharpness. We further study the properties of the
implicit bias on non-linear networks empirically, where we show that
fine-tuning a standard model with SAM can lead to significant generalization
improvements. Finally, we provide convergence results of SAM for non-convex
objectives when used with stochastic gradients. We illustrate these results
empirically for deep networks and discuss their relation to the generalization
behavior of SAM. The code of our experiments is available at
https://github.com/tml-epfl/understanding-sam.


http://arxiv.org/abs/2206.05981
Efficient Human-in-the-loop System for Guiding DNNs Attention. (1%)
Yi He; Xi Yang; Chia-Ming Chang; Haoran Xie; Takeo Igarashi
  Attention guidance is an approach to addressing dataset bias in deep
learning, where the model relies on incorrect features to make decisions.
Focusing on image classification tasks, we propose an efficient
human-in-the-loop system to interactively direct the attention of classifiers
to the regions specified by users, thereby reducing the influence of
co-occurrence bias and improving the transferability and interpretability of a
DNN. Previous approaches for attention guidance require the preparation of
pixel-level annotations and are not designed as interactive systems. We present
a new interactive method to allow users to annotate images with simple clicks,
and study a novel active learning strategy to significantly reduce the number
of annotations. We conducted both a numerical evaluation and a user study to
evaluate the proposed system on multiple datasets. Compared to the existing
non-active-learning approach which usually relies on huge amounts of
polygon-based segmentation masks to fine-tune or train the DNNs, our system can
save lots of labor and money and obtain a fine-tuned network that works better
even when the dataset is biased. The experiment results indicate that the
proposed system is efficient, reasonable, and reliable.


http://arxiv.org/abs/2206.06299
An adversarially robust data-market for spatial, crowd-sourced data. (1%)
Aida Manzano Kharman; Christian Jursitzky; Quan Zhou; Pietro Ferraro; Jakub Marecek; Pierre Pinson; Robert Shorten
  We describe an architecture for a decentralised data market for applications
in which agents are incentivised to collaborate to crowd-source their data. The
architecture is designed to reward data that furthers the market's collective
goal, and distributes reward fairly to all those that contribute with their
data. We show that the architecture is resilient to Sybil, wormhole, and data
poisoning attacks. In order to evaluate the resilience of the architecture, we
characterise its breakdown points for various adversarial threat models in an
automotive use case.


http://arxiv.org/abs/2206.05751
Consistent Attack: Universal Adversarial Perturbation on Embodied Vision Navigation. (98%)
Chengyang Ying; You Qiaoben; Xinning Zhou; Hang Su; Wenbo Ding; Jianyong Ai
  Embodied agents in vision navigation coupled with deep neural networks have
attracted increasing attention. However, deep neural networks have been shown
vulnerable to malicious adversarial noises, which may potentially cause
catastrophic failures in Embodied Vision Navigation. Among different
adversarial noises, universal adversarial perturbations (UAP), i.e., a constant
image-agnostic perturbation applied on every input frame of the agent, play a
critical role in Embodied Vision Navigation since they are
computation-efficient and application-practical during the attack. However,
existing UAP methods ignore the system dynamics of Embodied Vision Navigation
and might be sub-optimal. In order to extend UAP to the sequential decision
setting, we formulate the disturbed environment under the universal noise
$\delta$, as a $\delta$-disturbed Markov Decision Process ($\delta$-MDP). Based
on the formulation, we analyze the properties of $\delta$-MDP and propose two
novel Consistent Attack methods, named Reward UAP and Trajectory UAP, for
attacking Embodied agents, which consider the dynamic of the MDP and calculate
universal noises by estimating the disturbed distribution and the disturbed Q
function. For various victim models, our Consistent Attack can cause a
significant drop in their performance in the PointGoal task in Habitat with
different datasets and different scenes. Extensive experimental results
indicate that there exist serious potential risks for applying Embodied Vision
Navigation methods to the real world.


http://arxiv.org/abs/2206.05678
Security of Machine Learning-Based Anomaly Detection in Cyber Physical Systems. (92%)
Zahra Jadidi; Shantanu Pal; Nithesh Nayak K; Arawinkumaar Selvakkumar; Chih-Chia Chang; Maedeh Beheshti; Alireza Jolfaei
  In this study, we focus on the impact of adversarial attacks on deep
learning-based anomaly detection in CPS networks and implement a mitigation
approach against the attack by retraining models using adversarial samples. We
use the Bot-IoT and Modbus IoT datasets to represent the two CPS networks. We
train deep learning models and generate adversarial samples using these
datasets. These datasets are captured from IoT and Industrial IoT (IIoT)
networks. They both provide samples of normal and attack activities. The deep
learning model trained with these datasets showed high accuracy in detecting
attacks. An Artificial Neural Network (ANN) is adopted with one input layer,
four intermediate layers, and one output layer. The output layer has two nodes
representing the binary classification results. To generate adversarial samples
for the experiment, we used a function called the `fast_gradient_method' from
the Cleverhans library. The experimental result demonstrates the influence of
FGSM adversarial samples on the accuracy of the predictions and proves the
effectiveness of using the retrained model to defend against adversarial
attacks.


http://arxiv.org/abs/2206.06371
Darknet Traffic Classification and Adversarial Attacks. (81%)
Nhien Rust-Nguyen; Mark Stamp
  The anonymous nature of darknets is commonly exploited for illegal
activities. Previous research has employed machine learning and deep learning
techniques to automate the detection of darknet traffic in an attempt to block
these criminal activities. This research aims to improve darknet traffic
detection by assessing Support Vector Machines (SVM), Random Forest (RF),
Convolutional Neural Networks (CNN), and Auxiliary-Classifier Generative
Adversarial Networks (AC-GAN) for classification of such traffic and the
underlying application types. We find that our RF model outperforms the
state-of-the-art machine learning techniques used in prior work with the
CIC-Darknet2020 dataset. To evaluate the robustness of our RF classifier, we
obfuscate select application type classes to simulate realistic adversarial
attack scenarios. We demonstrate that our best-performing classifier can be
defeated by such attacks, and we consider ways to deal with such adversarial
attacks.


http://arxiv.org/abs/2206.05846
InBiaseD: Inductive Bias Distillation to Improve Generalization and Robustness through Shape-awareness. (26%)
Shruthi Gowda; Bahram Zonooz; Elahe Arani
  Humans rely less on spurious correlations and trivial cues, such as texture,
compared to deep neural networks which lead to better generalization and
robustness. It can be attributed to the prior knowledge or the high-level
cognitive inductive bias present in the brain. Therefore, introducing
meaningful inductive bias to neural networks can help learn more generic and
high-level representations and alleviate some of the shortcomings. We propose
InBiaseD to distill inductive bias and bring shape-awareness to the neural
networks. Our method includes a bias alignment objective that enforces the
networks to learn more generic representations that are less vulnerable to
unintended cues in the data which results in improved generalization
performance. InBiaseD is less susceptible to shortcut learning and also
exhibits lower texture bias. The better representations also aid in improving
robustness to adversarial attacks and we hence plugin InBiaseD seamlessly into
the existing adversarial training schemes to show a better trade-off between
generalization and robustness.


http://arxiv.org/abs/2206.05821
RSSD: Defend against Ransomware with Hardware-Isolated Network-Storage Codesign and Post-Attack Analysis. (9%)
Benjamin Reidys; Peng Liu; Jian Huang
  Encryption ransomware has become a notorious malware. It encrypts user data
on storage devices like solid-state drives (SSDs) and demands a ransom to
restore data for users. To bypass existing defenses, ransomware would keep
evolving and performing new attack models. For instance, we identify and
validate three new attacks, including (1) garbage-collection (GC) attack that
exploits storage capacity and keeps writing data to trigger GC and force SSDs
to release the retained data; (2) timing attack that intentionally slows down
the pace of encrypting data and hides its I/O patterns to escape existing
defense; (3) trimming attack that utilizes the trim command available in SSDs
to physically erase data.
  To enhance the robustness of SSDs against these attacks, we propose RSSD, a
ransomware-aware SSD. It redesigns the flash management of SSDs for enabling
the hardware-assisted logging, which can conservatively retain older versions
of user data and received storage operations in time order with low overhead.
It also employs hardware-isolated NVMe over Ethernet to expand local storage
capacity by transparently offloading the logs to remote cloud/servers in a
secure manner. RSSD enables post-attack analysis by building a trusted evidence
chain of storage operations to assist the investigation of ransomware attacks.
We develop RSSD with a real-world SSD FPGA board. Our evaluation shows that
RSSD can defend against new and future ransomware attacks, while introducing
negligible performance overhead.


http://arxiv.org/abs/2206.10341
Neurotoxin: Durable Backdoors in Federated Learning. (5%)
Zhengming Zhang; Ashwinee Panda; Linyue Song; Yaoqing Yang; Michael W. Mahoney; Joseph E. Gonzalez; Kannan Ramchandran; Prateek Mittal
  Due to their decentralized nature, federated learning (FL) systems have an
inherent vulnerability during their training to adversarial backdoor attacks.
In this type of attack, the goal of the attacker is to use poisoned updates to
implant so-called backdoors into the learned model such that, at test time, the
model's outputs can be fixed to a given target for certain inputs. (As a simple
toy example, if a user types "people from New York" into a mobile keyboard app
that uses a backdoored next word prediction model, then the model could
autocomplete the sentence to "people from New York are rude"). Prior work has
shown that backdoors can be inserted into FL models, but these backdoors are
often not durable, i.e., they do not remain in the model after the attacker
stops uploading poisoned updates. Thus, since training typically continues
progressively in production FL systems, an inserted backdoor may not survive
until deployment. Here, we propose Neurotoxin, a simple one-line modification
to existing backdoor attacks that acts by attacking parameters that are changed
less in magnitude during training. We conduct an exhaustive evaluation across
ten natural language processing and computer vision tasks, and we find that we
can double the durability of state of the art backdoors.


http://arxiv.org/abs/2206.05664
An Efficient Method for Sample Adversarial Perturbations against Nonlinear Support Vector Machines. (4%)
Wen Su; Qingna Li
  Adversarial perturbations have drawn great attentions in various machine
learning models. In this paper, we investigate the sample adversarial
perturbations for nonlinear support vector machines (SVMs). Due to the implicit
form of the nonlinear functions mapping data to the feature space, it is
difficult to obtain the explicit form of the adversarial perturbations. By
exploring the special property of nonlinear SVMs, we transform the optimization
problem of attacking nonlinear SVMs into a nonlinear KKT system. Such a system
can be solved by various numerical methods. Numerical results show that our
method is efficient in computing adversarial perturbations.


http://arxiv.org/abs/2206.05511
Improving the Adversarial Robustness of NLP Models by Information Bottleneck. (99%)
Cenyuan Zhang; Xiang Zhou; Yixin Wan; Xiaoqing Zheng; Kai-Wei Chang; Cho-Jui Hsieh
  Existing studies have demonstrated that adversarial examples can be directly
attributed to the presence of non-robust features, which are highly predictive,
but can be easily manipulated by adversaries to fool NLP models. In this study,
we explore the feasibility of capturing task-specific robust features, while
eliminating the non-robust ones by using the information bottleneck theory.
Through extensive experiments, we show that the models trained with our
information bottleneck-based method are able to achieve a significant
improvement in robust accuracy, exceeding performances of all the previously
reported defense methods while suffering almost no performance drop in clean
accuracy on SST-2, AGNEWS and IMDB datasets.


http://arxiv.org/abs/2206.10334
Defending Adversarial Examples by Negative Correlation Ensemble. (99%)
Wenjian Luo; Hongwei Zhang; Linghao Kong; Zhijian Chen; Ke Tang
  The security issues in DNNs, such as adversarial examples, have attracted
much attention. Adversarial examples refer to the examples which are capable to
induce the DNNs return completely predictions by introducing carefully designed
perturbations. Obviously, adversarial examples bring great security risks to
the development of deep learning. Recently, Some defense approaches against
adversarial examples have been proposed, however, in our opinion, the
performance of these approaches are still limited. In this paper, we propose a
new ensemble defense approach named the Negative Correlation Ensemble (NCEn),
which achieves compelling results by introducing gradient directions and
gradient magnitudes of each member in the ensemble negatively correlated and at
the same time, reducing the transferability of adversarial examples among them.
Extensive experiments have been conducted, and the results demonstrate that
NCEn can improve the adversarial robustness of ensembles effectively.


http://arxiv.org/abs/2206.05565
NeuGuard: Lightweight Neuron-Guided Defense against Membership Inference Attacks. (81%)
Nuo Xu; Binghui Wang; Ran Ran; Wujie Wen; Parv Venkitasubramaniam
  Membership inference attacks (MIAs) against machine learning models can lead
to serious privacy risks for the training dataset used in the model training.
In this paper, we propose a novel and effective Neuron-Guided Defense method
named NeuGuard against membership inference attacks (MIAs). We identify a key
weakness in existing defense mechanisms against MIAs wherein they cannot
simultaneously defend against two commonly used neural network based MIAs,
indicating that these two attacks should be separately evaluated to assure the
defense effectiveness. We propose NeuGuard, a new defense approach that jointly
controls the output and inner neurons' activation with the object to guide the
model output of training set and testing set to have close distributions.
NeuGuard consists of class-wise variance minimization targeting restricting the
final output neurons and layer-wise balanced output control aiming to constrain
the inner neurons in each layer. We evaluate NeuGuard and compare it with
state-of-the-art defenses against two neural network based MIAs, five strongest
metric based MIAs including the newly proposed label-only MIA on three
benchmark datasets. Results show that NeuGuard outperforms the state-of-the-art
defenses by offering much improved utility-privacy trade-off, generality, and
overhead.


http://arxiv.org/abs/2206.05483
Bilateral Dependency Optimization: Defending Against Model-inversion Attacks. (69%)
Xiong Peng; Feng Liu; Jingfen Zhang; Long Lan; Junjie Ye; Tongliang Liu; Bo Han
  Through using only a well-trained classifier, model-inversion (MI) attacks
can recover the data used for training the classifier, leading to the privacy
leakage of the training data. To defend against MI attacks, previous work
utilizes a unilateral dependency optimization strategy, i.e., minimizing the
dependency between inputs (i.e., features) and outputs (i.e., labels) during
training the classifier. However, such a minimization process conflicts with
minimizing the supervised loss that aims to maximize the dependency between
inputs and outputs, causing an explicit trade-off between model robustness
against MI attacks and model utility on classification tasks. In this paper, we
aim to minimize the dependency between the latent representations and the
inputs while maximizing the dependency between latent representations and the
outputs, named a bilateral dependency optimization (BiDO) strategy. In
particular, we use the dependency constraints as a universally applicable
regularizer in addition to commonly used losses for deep neural networks (e.g.,
cross-entropy), which can be instantiated with appropriate dependency criteria
according to different tasks. To verify the efficacy of our strategy, we
propose two implementations of BiDO, by using two different dependency
measures: BiDO with constrained covariance (BiDO-COCO) and BiDO with
Hilbert-Schmidt Independence Criterion (BiDO-HSIC). Experiments show that BiDO
achieves the state-of-the-art defense performance for a variety of datasets,
classifiers, and MI attacks while suffering a minor classification-accuracy
drop compared to the well-trained classifier with no defense, which lights up a
novel road to defend against MI attacks.


http://arxiv.org/abs/2206.05289
Localized adversarial artifacts for compressed sensing MRI. (76%)
Rima Alaifari; Giovanni S. Alberti; Tandri Gauksson
  As interest in deep neural networks (DNNs) for image reconstruction tasks
grows, their reliability has been called into question (Antun et al., 2020;
Gottschling et al., 2020). However, recent work has shown that, compared to
total variation (TV) minimization, when appropriately regularized, DNNs show
similar robustness to adversarial noise in terms of $\ell^2$-reconstruction
error (Genzel et al., 2022). We consider a different notion of robustness,
using the $\ell^\infty$-norm, and argue that localized reconstruction artifacts
are a more relevant defect than the $\ell^2$-error. We create adversarial
perturbations to undersampled magnetic resonance imaging measurements (in the
frequency domain) which induce severe localized artifacts in the TV-regularized
reconstruction. Notably, the same attack method is not as effective against DNN
based reconstruction. Finally, we show that this phenomenon is inherent to
reconstruction methods for which exact recovery can be guaranteed, as with
compressed sensing reconstructions with $\ell^1$- or TV-minimization.


http://arxiv.org/abs/2206.05406
Rethinking the Defense Against Free-rider Attack From the Perspective of Model Weight Evolving Frequency. (70%)
Jinyin Chen; Mingjun Li; Tao Liu; Haibin Zheng; Yao Cheng; Changting Lin
  Federated learning (FL) is a distributed machine learning approach where
multiple clients collaboratively train a joint model without exchanging their
data. Despite FL's unprecedented success in data privacy-preserving, its
vulnerability to free-rider attacks has attracted increasing attention.
Existing defenses may be ineffective against highly camouflaged or high
percentages of free riders. To address these challenges, we reconsider the
defense from a novel perspective, i.e., model weight evolving
frequency.Empirically, we gain a novel insight that during the FL's training,
the model weight evolving frequency of free-riders and that of benign clients
are significantly different. Inspired by this insight, we propose a novel
defense method based on the model Weight Evolving Frequency, referred to as
WEF-Defense.Specifically, we first collect the weight evolving frequency
(defined as WEF-Matrix) during local training. For each client, it uploads the
local model's WEF-Matrix to the server together with its model weight for each
iteration. The server then separates free-riders from benign clients based on
the difference in the WEF-Matrix. Finally, the server uses a personalized
approach to provide different global models for corresponding clients.
Comprehensive experiments conducted on five datasets and five models
demonstrate that WEF-Defense achieves better defense effectiveness than the
state-of-the-art baselines.


http://arxiv.org/abs/2206.05359
Blades: A Unified Benchmark Suite for Byzantine Attacks and Defenses in Federated Learning. (33%)
Shenghui Li; Edith Ngai; Fanghua Ye; Li Ju; Tianru Zhang; Thiemo Voigt
  Federated learning (FL) facilitates distributed training across clients,
safeguarding the privacy of their data. The inherent distributed structure of
FL introduces vulnerabilities, especially from adversarial (Byzantine) clients
aiming to skew local updates to their advantage. Despite the plethora of
research focusing on Byzantine-resilient FL, the academic community has yet to
establish a comprehensive benchmark suite, pivotal for impartial assessment and
comparison of different techniques.
  This paper investigates existing techniques in Byzantine-resilient FL and
introduces an open-source benchmark suite for convenient and fair performance
comparisons. Our investigation begins with a systematic study of Byzantine
attack and defense strategies. Subsequently, we present \ours, a scalable,
extensible, and easily configurable benchmark suite that supports researchers
and developers in efficiently implementing and validating novel strategies
against baseline algorithms in Byzantine-resilient FL. The design of \ours
incorporates key characteristics derived from our systematic study,
encompassing the attacker's capabilities and knowledge, defense strategy
categories, and factors influencing robustness. Blades contains built-in
implementations of representative attack and defense strategies and offers
user-friendly interfaces for seamlessly integrating new ideas.


http://arxiv.org/abs/2206.04881
Enhancing Clean Label Backdoor Attack with Two-phase Specific Triggers. (9%)
Nan Luo; Yuanzhang Li; Yajie Wang; Shangbo Wu; Yu-an Tan; Quanxin Zhang
  Backdoor attacks threaten Deep Neural Networks (DNNs). Towards stealthiness,
researchers propose clean-label backdoor attacks, which require the adversaries
not to alter the labels of the poisoned training datasets. Clean-label settings
make the attack more stealthy due to the correct image-label pairs, but some
problems still exist: first, traditional methods for poisoning training data
are ineffective; second, traditional triggers are not stealthy which are still
perceptible. To solve these problems, we propose a two-phase and image-specific
triggers generation method to enhance clean-label backdoor attacks. Our methods
are (1) powerful: our triggers can both promote the two phases (i.e., the
backdoor implantation and activation phase) in backdoor attacks simultaneously;
(2) stealthy: our triggers are generated from each image. They are
image-specific instead of fixed triggers. Extensive experiments demonstrate
that our approach can achieve a fantastic attack success rate~(98.98%) with low
poisoning rate~(5%), high stealthiness under many evaluation metrics and is
resistant to backdoor defense methods.


http://arxiv.org/abs/2206.04887
Deep Leakage from Model in Federated Learning. (3%)
Zihao Zhao; Mengen Luo; Wenbo Ding
  Distributed machine learning has been widely used in recent years to tackle
the large and complex dataset problem. Therewith, the security of distributed
learning has also drawn increasing attentions from both academia and industry.
In this context, federated learning (FL) was developed as a "secure"
distributed learning by maintaining private training data locally and only
public model gradients are communicated between. However, to date, a variety of
gradient leakage attacks have been proposed for this procedure and prove that
it is insecure. For instance, a common drawback of these attacks is shared:
they require too much auxiliary information such as model weights, optimizers,
and some hyperparameters (e.g., learning rate), which are difficult to obtain
in real situations. Moreover, many existing algorithms avoid transmitting model
gradients in FL and turn to sending model weights, such as FedAvg, but few
people consider its security breach. In this paper, we present two novel
frameworks to demonstrate that transmitting model weights is also likely to
leak private local data of clients, i.e., (DLM and DLM+), under the FL
scenario. In addition, a number of experiments are performed to illustrate the
effect and generality of our attack frameworks. At the end of this paper, we
also introduce two defenses to the proposed attacks and evaluate their
protection effects. Comprehensively, the proposed attack and defense schemes
can be applied to the general distributed learning scenario as well, just with
some appropriate customization.


http://arxiv.org/abs/2206.04890
Adversarial Counterfactual Environment Model Learning. (1%)
Xiong-Hui Chen; Yang Yu; Zheng-Mao Zhu; Zhihua Yu; Zhenjun Chen; Chenghe Wang; Yinan Wu; Hongqiu Wu; Rong-Jun Qin; Ruijin Ding; Fangsheng Huang
  A good model for action-effect prediction, named environment model, is
important to achieve sample-efficient decision-making policy learning in many
domains like robot control, recommender systems, and patients' treatment
selection. We can take unlimited trials with such a model to identify the
appropriate actions so that the costs of queries in the real world can be
saved. It requires the model to handle unseen data correctly, also called
counterfactual data. However, standard data fitting techniques do not
automatically achieve such generalization ability and commonly result in
unreliable models. In this work, we introduce counterfactual-query risk
minimization (CQRM) in model learning for generalizing to a counterfactual
dataset queried by a specific target policy. Since the target policies can be
various and unknown in policy learning, we propose an adversarial CQRM
objective in which the model learns on counterfactual data queried by
adversarial policies, and finally derive a tractable solution GALILEO. We also
discover that adversarial CQRM is closely related to the adversarial model
learning, explaining the effectiveness of the latter. We apply GALILEO in
synthetic tasks and a real-world application. The results show that GALILEO
makes accurate predictions on counterfactual data and thus significantly
improves policies in real-world testing.


http://arxiv.org/abs/2206.04783
ReFace: Real-time Adversarial Attacks on Face Recognition Systems. (99%)
Shehzeen Hussain; Todd Huster; Chris Mesterharm; Paarth Neekhara; Kevin An; Malhar Jere; Harshvardhan Sikka; Farinaz Koushanfar
  Deep neural network based face recognition models have been shown to be
vulnerable to adversarial examples. However, many of the past attacks require
the adversary to solve an input-dependent optimization problem using gradient
descent which makes the attack impractical in real-time. These adversarial
examples are also tightly coupled to the attacked model and are not as
successful in transferring to different models. In this work, we propose
ReFace, a real-time, highly-transferable attack on face recognition models
based on Adversarial Transformation Networks (ATNs). ATNs model adversarial
example generation as a feed-forward neural network. We find that the white-box
attack success rate of a pure U-Net ATN falls substantially short of
gradient-based attacks like PGD on large face recognition datasets. We
therefore propose a new architecture for ATNs that closes this gap while
maintaining a 10000x speedup over PGD. Furthermore, we find that at a given
perturbation magnitude, our ATN adversarial perturbations are more effective in
transferring to new face recognition models than PGD. ReFace attacks can
successfully deceive commercial face recognition services in a transfer attack
setting and reduce face identification accuracy from 82% to 16.4% for AWS
SearchFaces API and Azure face verification accuracy from 91% to 50.1%.


http://arxiv.org/abs/2206.04365
CARLA-GeAR: a Dataset Generator for a Systematic Evaluation of Adversarial Robustness of Vision Models. (99%)
Federico Nesti; Giulio Rossolini; Gianluca D'Amico; Alessandro Biondi; Giorgio Buttazzo
Adversarial examples represent a serious threat for deep neural networks in several application domains and a huge amount of work has been produced to investigate them and mitigate their effects. Nevertheless, no much work has been devoted to the generation of datasets specifically designed to evaluate the adversarial robustness of neural models. This paper presents CARLA-GeAR, a tool for the automatic generation of photo-realistic synthetic datasets that can be used for a systematic evaluation of the adversarial robustness of neural models against physical adversarial patches, as well as for comparing the performance of different adversarial defense/detection methods. The tool is built on the CARLA simulator, using its Python API, and allows the generation of datasets for several vision tasks in the context of autonomous driving. The adversarial patches included in the generated datasets are attached to billboards or the back of a truck and are crafted by using state-of-the-art white-box attack strategies to maximize the prediction error of the model under test. Finally, the paper presents an experimental study to evaluate the performance of some defense methods against such attacks, showing how the datasets generated with CARLA-GeAR might be used in future work as a benchmark for adversarial defense in the real world. All the code and datasets used in this paper are available at http://carlagear.retis.santannapisa.it.

http://arxiv.org/abs/2206.04316
Adversarial Noises Are Linearly Separable for (Nearly) Random Neural Networks. (98%)
Huishuai Zhang; Da Yu; Yiping Lu; Di He
  Adversarial examples, which are usually generated for specific inputs with a
specific model, are ubiquitous for neural networks. In this paper we unveil a
surprising property of adversarial noises when they are put together, i.e.,
adversarial noises crafted by one-step gradient methods are linearly separable
if equipped with the corresponding labels. We theoretically prove this property
for a two-layer network with randomly initialized entries and the neural
tangent kernel setup where the parameters are not far from initialization. The
proof idea is to show the label information can be efficiently backpropagated
to the input while keeping the linear separability. Our theory and experimental
evidence further show that the linear classifier trained with the adversarial
noises of the training data can well classify the adversarial noises of the
test data, indicating that adversarial noises actually inject a distributional
perturbation to the original data distribution. Furthermore, we empirically
demonstrate that the adversarial noises may become less linearly separable when
the above conditions are compromised while they are still much easier to
classify than original features.


http://arxiv.org/abs/2206.04463
Meet You Halfway: Explaining Deep Learning Mysteries. (92%)
Oriel BenShmuel
  Deep neural networks perform exceptionally well on various learning tasks
with state-of-the-art results. While these models are highly expressive and
achieve impressively accurate solutions with excellent generalization
abilities, they are susceptible to minor perturbations. Samples that suffer
such perturbations are known as "adversarial examples". Even though deep
learning is an extensively researched field, many questions about the nature of
deep learning models remain unanswered. In this paper, we introduce a new
conceptual framework attached with a formal description that aims to shed light
on the network's behavior and interpret the behind-the-scenes of the learning
process. Our framework provides an explanation for inherent questions
concerning deep learning. Particularly, we clarify: (1) Why do neural networks
acquire generalization abilities? (2) Why do adversarial examples transfer
between different models?. We provide a comprehensive set of experiments that
support this new framework, as well as its underlying theory.


http://arxiv.org/abs/2206.04472
Early Transferability of Adversarial Examples in Deep Neural Networks. (86%)
Oriel BenShmuel
  This paper will describe and analyze a new phenomenon that was not known
before, which we call "Early Transferability". Its essence is that the
adversarial perturbations transfer among different networks even at extremely
early stages in their training. In fact, one can initialize two networks with
two different independent choices of random weights and measure the angle
between their adversarial perturbations after each step of the training. What
we discovered was that these two adversarial directions started to align with
each other already after the first few training steps (which typically use only
a small fraction of the available training data), even though the accuracy of
the two networks hadn't started to improve from their initial bad values due to
the early stage of the training. The purpose of this paper is to present this
phenomenon experimentally and propose plausible explanations for some of its
properties.


http://arxiv.org/abs/2206.04310
GSmooth: Certified Robustness against Semantic Transformations via Generalized Randomized Smoothing. (86%)
Zhongkai Hao; Chengyang Ying; Yinpeng Dong; Hang Su; Jun Zhu; Jian Song
  Certified defenses such as randomized smoothing have shown promise towards
building reliable machine learning systems against $\ell_p$-norm bounded
attacks. However, existing methods are insufficient or unable to provably
defend against semantic transformations, especially those without closed-form
expressions (such as defocus blur and pixelate), which are more common in
practice and often unrestricted. To fill up this gap, we propose generalized
randomized smoothing (GSmooth), a unified theoretical framework for certifying
robustness against general semantic transformations via a novel dimension
augmentation strategy. Under the GSmooth framework, we present a scalable
algorithm that uses a surrogate image-to-image network to approximate the
complex transformation. The surrogate model provides a powerful tool for
studying the properties of semantic transformations and certifying robustness.
Experimental results on several datasets demonstrate the effectiveness of our
approach for robustness certification against multiple kinds of semantic
transformations and corruptions, which is not achievable by the alternative
baselines.


http://arxiv.org/abs/2206.04615
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. (84%)
Aarohi Shammie Srivastava; Abhinav Shammie Rastogi; Abhishek Shammie Rao; Abu Awal Md Shammie Shoeb; Abubakar Shammie Abid; Adam Shammie Fisch; Adam R. Shammie Brown; Adam Shammie Santoro; Aditya Shammie Gupta; Adrià Shammie Garriga-Alonso; Agnieszka Shammie Kluska; Aitor Shammie Lewkowycz; Akshat Shammie Agarwal; Alethea Shammie Power; Alex Shammie Ray; Alex Shammie Warstadt; Alexander W. Shammie Kocurek; Ali Shammie Safaya; Ali Shammie Tazarv; Alice Shammie Xiang; Alicia Shammie Parrish; Allen Shammie Nie; Aman Shammie Hussain; Amanda Shammie Askell; Amanda Shammie Dsouza; Ambrose Shammie Slone; Ameet Shammie Rahane; Anantharaman S. Shammie Iyer; Anders Shammie Andreassen; Andrea Shammie Madotto; Andrea Shammie Santilli; Andreas Shammie Stuhlmüller; Andrew Shammie Dai; Andrew Shammie La; Andrew Shammie Lampinen; Andy Shammie Zou; Angela Shammie Jiang; Angelica Shammie Chen; Anh Shammie Vuong; Animesh Shammie Gupta; Anna Shammie Gottardi; Antonio Shammie Norelli; Anu Shammie Venkatesh; Arash Shammie Gholamidavoodi; Arfa Shammie Tabassum; Arul Shammie Menezes; Arun Shammie Kirubarajan; Asher Shammie Mullokandov; Ashish Shammie Sabharwal; Austin Shammie Herrick; Avia Shammie Efrat; Aykut Shammie Erdem; Ayla Shammie Karakaş; B. Ryan Shammie Roberts; Bao Sheng Shammie Loe; Barret Shammie Zoph; Bartłomiej Shammie Bojanowski; Batuhan Shammie Özyurt; Behnam Shammie Hedayatnia; Behnam Shammie Neyshabur; Benjamin Shammie Inden; Benno Shammie Stein; Berk Shammie Ekmekci; Bill Yuchen Shammie Lin; Blake Shammie Howald; Cameron Shammie Diao; Cameron Shammie Dour; Catherine Shammie Stinson; Cedrick Shammie Argueta; César Ferri Shammie Ramírez; Chandan Shammie Singh; Charles Shammie Rathkopf; Chenlin Shammie Meng; Chitta Shammie Baral; Chiyu Shammie Wu; Chris Shammie Callison-Burch; Chris Shammie Waites; Christian Shammie Voigt; Christopher D. Shammie Manning; Christopher Shammie Potts; Cindy Shammie Ramirez; Clara E. Shammie Rivera; Clemencia Shammie Siro; Colin Shammie Raffel; Courtney Shammie Ashcraft; Cristina Shammie Garbacea; Damien Shammie Sileo; Dan Shammie Garrette; Dan Shammie Hendrycks; Dan Shammie Kilman; Dan Shammie Roth; Daniel Shammie Freeman; Daniel Shammie Khashabi; Daniel Shammie Levy; Daniel Moseguí Shammie González; Danielle Shammie Perszyk; Danny Shammie Hernandez; Danqi Shammie Chen; Daphne Shammie Ippolito; Dar Shammie Gilboa; David Shammie Dohan; David Shammie Drakard; David Shammie Jurgens; Debajyoti Shammie Datta; Deep Shammie Ganguli; Denis Shammie Emelin; Denis Shammie Kleyko; Deniz Shammie Yuret; Derek Shammie Chen; Derek Shammie Tam; Dieuwke Shammie Hupkes; Diganta Shammie Misra; Dilyar Shammie Buzan; Dimitri Coelho Shammie Mollo; Diyi Shammie Yang; Dong-Ho Shammie Lee; Ekaterina Shammie Shutova; Ekin Dogus Shammie Cubuk; Elad Shammie Segal; Eleanor Shammie Hagerman; Elizabeth Shammie Barnes; Elizabeth Shammie Donoway; Ellie Shammie Pavlick; Emanuele Shammie Rodola; Emma Shammie Lam; Eric Shammie Chu; Eric Shammie Tang; Erkut Shammie Erdem; Ernie Shammie Chang; Ethan A. Shammie Chi; Ethan Shammie Dyer; Ethan Shammie Jerzak; Ethan Shammie Kim; Eunice Engefu Shammie Manyasi; Evgenii Shammie Zheltonozhskii; Fanyue Shammie Xia; Fatemeh Shammie Siar; Fernando Shammie Martínez-Plumed; Francesca Shammie Happé; Francois Shammie Chollet; Frieda Shammie Rong; Gaurav Shammie Mishra; Genta Indra Shammie Winata; Melo Gerard Shammie de; Germán Shammie Kruszewski; Giambattista Shammie Parascandolo; Giorgio Shammie Mariani; Gloria Shammie Wang; Gonzalo Shammie Jaimovitch-López; Gregor Shammie Betz; Guy Shammie Gur-Ari; Hana Shammie Galijasevic; Hannah Shammie Kim; Hannah Shammie Rashkin; Hannaneh Shammie Hajishirzi; Harsh Shammie Mehta; Hayden Shammie Bogar; Henry Shammie Shevlin; Hinrich Shammie Schütze; Hiromu Shammie Yakura; Hongming Shammie Zhang; Hugh Mee Shammie Wong; Ian Shammie Ng; Isaac Shammie Noble; Jaap Shammie Jumelet; Jack Shammie Geissinger; Jackson Shammie Kernion; Jacob Shammie Hilton; Jaehoon Shammie Lee; Jaime Fernández Shammie Fisac; James B. Shammie Simon; James Shammie Koppel; James Shammie Zheng; James Shammie Zou; Jan Shammie Kocoń; Jana Shammie Thompson; Jared Shammie Kaplan; Jarema Shammie Radom; Jascha Shammie Sohl-Dickstein; Jason Shammie Phang; Jason Shammie Wei; Jason Shammie Yosinski; Jekaterina Shammie Novikova; Jelle Shammie Bosscher; Jennifer Shammie Marsh; Jeremy Shammie Kim; Jeroen Shammie Taal; Jesse Shammie Engel; Jesujoba Shammie Alabi; Jiacheng Shammie Xu; Jiaming Shammie Song; Jillian Shammie Tang; Joan Shammie Waweru; John Shammie Burden; John Shammie Miller; John U. Shammie Balis; Jonathan Shammie Berant; Jörg Shammie Frohberg; Jos Shammie Rozen; Jose Shammie Hernandez-Orallo; Joseph Shammie Boudeman; Joseph Shammie Jones; Joshua B. Shammie Tenenbaum; Joshua S. Shammie Rule; Joyce Shammie Chua; Kamil Shammie Kanclerz; Karen Shammie Livescu; Karl Shammie Krauth; Karthik Shammie Gopalakrishnan; Katerina Shammie Ignatyeva; Katja Shammie Markert; Kaustubh D. Shammie Dhole; Kevin Shammie Gimpel; Kevin Shammie Omondi; Kory Shammie Mathewson; Kristen Shammie Chiafullo; Ksenia Shammie Shkaruta; Kumar Shammie Shridhar; Kyle Shammie McDonell; Kyle Shammie Richardson; Laria Shammie Reynolds; Leo Shammie Gao; Li Shammie Zhang; Liam Shammie Dugan; Lianhui Shammie Qin; Lidia Shammie Contreras-Ochando; Louis-Philippe Shammie Morency; Luca Shammie Moschella; Lucas Shammie Lam; Lucy Shammie Noble; Ludwig Shammie Schmidt; Luheng Shammie He; Luis Oliveros Shammie Colón; Luke Shammie Metz; Lütfi Kerem Shammie Şenel; Maarten Shammie Bosma; Maarten Shammie Sap; Hoeve Maartje Shammie ter; Maheen Shammie Farooqi; Manaal Shammie Faruqui; Mantas Shammie Mazeika; Marco Shammie Baturan; Marco Shammie Marelli; Marco Shammie Maru; Maria Jose Ramírez Shammie Quintana; Marie Shammie Tolkiehn; Mario Shammie Giulianelli; Martha Shammie Lewis; Martin Shammie Potthast; Matthew L. Shammie Leavitt; Matthias Shammie Hagen; Mátyás Shammie Schubert; Medina Orduna Shammie Baitemirova; Melody Shammie Arnaud; Melvin Shammie McElrath; Michael A. Shammie Yee; Michael Shammie Cohen; Michael Shammie Gu; Michael Shammie Ivanitskiy; Michael Shammie Starritt; Michael Shammie Strube; Michał Shammie Swędrowski; Michele Shammie Bevilacqua; Michihiro Shammie Yasunaga; Mihir Shammie Kale; Mike Shammie Cain; Mimee Shammie Xu; Mirac Shammie Suzgun; Mo Shammie Tiwari; Mohit Shammie Bansal; Moin Shammie Aminnaseri; Mor Shammie Geva; Mozhdeh Shammie Gheini; Mukund Varma Shammie T; Nanyun Shammie Peng; Nathan Shammie Chi; Nayeon Shammie Lee; Neta Gur-Ari Shammie Krakover; Nicholas Shammie Cameron; Nicholas Shammie Roberts; Nick Shammie Doiron; Nikita Shammie Nangia; Niklas Shammie Deckers; Niklas Shammie Muennighoff; Nitish Shirish Shammie Keskar; Niveditha S. Shammie Iyer; Noah Shammie Constant; Noah Shammie Fiedel; Nuan Shammie Wen; Oliver Shammie Zhang; Omar Shammie Agha; Omar Shammie Elbaghdadi; Omer Shammie Levy; Owain Shammie Evans; Pablo Antonio Moreno Shammie Casares; Parth Shammie Doshi; Pascale Shammie Fung; Paul Pu Shammie Liang; Paul Shammie Vicol; Pegah Shammie Alipoormolabashi; Peiyuan Shammie Liao; Percy Shammie Liang; Peter Shammie Chang; Peter Shammie Eckersley; Phu Mon Shammie Htut; Pinyu Shammie Hwang; Piotr Shammie Miłkowski; Piyush Shammie Patil; Pouya Shammie Pezeshkpour; Priti Shammie Oli; Qiaozhu Shammie Mei; Qing Shammie Lyu; Qinlang Shammie Chen; Rabin Shammie Banjade; Rachel Etta Shammie Rudolph; Raefer Shammie Gabriel; Rahel Shammie Habacker; Ramón Risco Shammie Delgado; Raphaël Shammie Millière; Rhythm Shammie Garg; Richard Shammie Barnes; Rif A. Shammie Saurous; Riku Shammie Arakawa; Robbe Shammie Raymaekers; Robert Shammie Frank; Rohan Shammie Sikand; Roman Shammie Novak; Roman Shammie Sitelew; Ronan Shammie LeBras; Rosanne Shammie Liu; Rowan Shammie Jacobs; Rui Shammie Zhang; Ruslan Shammie Salakhutdinov; Ryan Shammie Chi; Ryan Shammie Lee; Ryan Shammie Stovall; Ryan Shammie Teehan; Rylan Shammie Yang; Sahib Shammie Singh; Saif M. Shammie Mohammad; Sajant Shammie Anand; Sam Shammie Dillavou; Sam Shammie Shleifer; Sam Shammie Wiseman; Samuel Shammie Gruetter; Samuel R. Shammie Bowman; Samuel S. Shammie Schoenholz; Sanghyun Shammie Han; Sanjeev Shammie Kwatra; Sarah A. Shammie Rous; Sarik Shammie Ghazarian; Sayan Shammie Ghosh; Sean Shammie Casey; Sebastian Shammie Bischoff; Sebastian Shammie Gehrmann; Sebastian Shammie Schuster; Sepideh Shammie Sadeghi; Shadi Shammie Hamdan; Sharon Shammie Zhou; Shashank Shammie Srivastava; Sherry Shammie Shi; Shikhar Shammie Singh; Shima Shammie Asaadi; Shixiang Shane Shammie Gu; Shubh Shammie Pachchigar; Shubham Shammie Toshniwal; Shyam Shammie Upadhyay; Shammie Shyamolima; Debnath; Siamak Shakeri; Simon Thormeyer; Simone Melzi; Siva Reddy; Sneha Priscilla Makini; Soo-Hwan Lee; Spencer Torene; Sriharsha Hatwar; Stanislas Dehaene; Stefan Divic; Stefano Ermon; Stella Biderman; Stephanie Lin; Stephen Prasad; Steven T. Piantadosi; Stuart M. Shieber; Summer Misherghi; Svetlana Kiritchenko; Swaroop Mishra; Tal Linzen; Tal Schuster; Tao Li; Tao Yu; Tariq Ali; Tatsu Hashimoto; Te-Lin Wu; Théo Desbordes; Theodore Rothschild; Thomas Phan; Tianle Wang; Tiberius Nkinyili; Timo Schick; Timofei Kornev; Timothy Telleen-Lawton; Titus Tunduny; Tobias Gerstenberg; Trenton Chang; Trishala Neeraj; Tushar Khot; Tyler Shultz; Uri Shaham; Vedant Misra; Vera Demberg; Victoria Nyamai; Vikas Raunak; Vinay Ramasesh; Vinay Uday Prabhu; Vishakh Padmakumar; Vivek Srikumar; William Fedus; William Saunders; William Zhang; Wout Vossen; Xiang Ren; Xiaoyu Tong; Xinran Zhao; Xinyi Wu; Xudong Shen; Yadollah Yaghoobzadeh; Yair Lakretz; Yangqiu Song; Yasaman Bahri; Yejin Choi; Yichi Yang; Yiding Hao; Yifu Chen; Yonatan Belinkov; Yu Hou; Yufang Hou; Yuntao Bai; Zachary Seid; Zhuoye Zhao; Zijian Wang; Zijie J. Wang; Zirui Wang; Ziyi Wu
  Language models demonstrate both quantitative improvement and new qualitative
capabilities with increasing scale. Despite their potentially transformative
impact, these new capabilities are as yet poorly characterized. In order to
inform future research, prepare for disruptive new model capabilities, and
ameliorate socially harmful effects, it is vital that we understand the present
and near-future capabilities and limitations of language models. To address
this challenge, we introduce the Beyond the Imitation Game benchmark
(BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442
authors across 132 institutions. Task topics are diverse, drawing problems from
linguistics, childhood development, math, common-sense reasoning, biology,
physics, social bias, software development, and beyond. BIG-bench focuses on
tasks that are believed to be beyond the capabilities of current language
models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense
transformer architectures, and Switch-style sparse transformers on BIG-bench,
across model sizes spanning millions to hundreds of billions of parameters. In
addition, a team of human expert raters performed all tasks in order to provide
a strong baseline. Findings include: model performance and calibration both
improve with scale, but are poor in absolute terms (and when compared with
rater performance); performance is remarkably similar across model classes,
though with benefits from sparsity; tasks that improve gradually and
predictably commonly involve a large knowledge or memorization component,
whereas tasks that exhibit "breakthrough" behavior at a critical scale often
involve multiple steps or components, or brittle metrics; social bias typically
increases with scale in settings with ambiguous context, but this can be
improved with prompting.


http://arxiv.org/abs/2206.04762
Data-Efficient Double-Win Lottery Tickets from Robust Pre-training. (41%)
Tianlong Chen; Zhenyu Zhang; Sijia Liu; Yang Zhang; Shiyu Chang; Zhangyang Wang
  Pre-training serves as a broadly adopted starting point for transfer learning
on various downstream tasks. Recent investigations of lottery tickets
hypothesis (LTH) demonstrate such enormous pre-trained models can be replaced
by extremely sparse subnetworks (a.k.a. matching subnetworks) without
sacrificing transferability. However, practical security-crucial applications
usually pose more challenging requirements beyond standard transfer, which also
demand these subnetworks to overcome adversarial vulnerability. In this paper,
we formulate a more rigorous concept, Double-Win Lottery Tickets, in which a
located subnetwork from a pre-trained model can be independently transferred on
diverse downstream tasks, to reach BOTH the same standard and robust
generalization, under BOTH standard and adversarial training regimes, as the
full pre-trained model can do. We comprehensively examine various pre-training
mechanisms and find that robust pre-training tends to craft sparser double-win
lottery tickets with superior performance over the standard counterparts. For
example, on downstream CIFAR-10/100 datasets, we identify double-win matching
subnetworks with the standard, fast adversarial, and adversarial pre-training
from ImageNet, at 89.26%/73.79%, 89.26%/79.03%, and 91.41%/83.22% sparsity,
respectively. Furthermore, we observe the obtained double-win lottery tickets
can be more data-efficient to transfer, under practical data-limited (e.g., 1%
and 10%) downstream schemes. Our results show that the benefits from robust
pre-training are amplified by the lottery ticket scheme, as well as the
data-limited transfer setting. Codes are available at
https://github.com/VITA-Group/Double-Win-LTH.


http://arxiv.org/abs/2206.04530
DORA: Exploring outlier representations in Deep Neural Networks. (1%)
Kirill Bykov; Mayukh Deb; Dennis Grinwald; Klaus-Robert Müller; Marina M. -C. Höhne
  Deep Neural Networks (DNNs) draw their power from the representations they
learn. However, while being incredibly effective in learning complex
abstractions, they are susceptible to learning malicious concepts, due to the
spurious correlations inherent in the training data. So far, existing methods
for uncovering such artifactual behavior in trained models focus on finding
artifacts in the input data, which requires both availability of a data set and
human supervision. In this paper, we introduce DORA (Data-agnOstic
Representation Analysis): the first data-agnostic framework for the analysis of
the representation space of DNNs. We propose a novel distance measure between
representations that utilizes self-explaining capabilities within the network
itself without access to any data and quantitatively validate its alignment
with human-defined semantic distances. We further demonstrate that this metric
could be utilized for the detection of anomalous representations, which may
bear a risk of learning unintended spurious concepts deviating from the desired
decision-making policy. Finally, we demonstrate the practical utility of DORA
by analyzing and identifying artifactual representations in widely popular
Computer Vision models.


http://arxiv.org/abs/2206.04823
Membership Inference via Backdooring. (1%)
Hongsheng Hu; Zoran Salcic; Gillian Dobbie; Jinjun Chen; Lichao Sun; Xuyun Zhang
  Recently issued data privacy regulations like GDPR (General Data Protection
Regulation) grant individuals the right to be forgotten. In the context of
machine learning, this requires a model to forget about a training data sample
if requested by the data owner (i.e., machine unlearning). As an essential step
prior to machine unlearning, it is still a challenge for a data owner to tell
whether or not her data have been used by an unauthorized party to train a
machine learning model. Membership inference is a recently emerging technique
to identify whether a data sample was used to train a target model, and seems
to be a promising solution to this challenge. However, straightforward adoption
of existing membership inference approaches fails to address the challenge
effectively due to being originally designed for attacking membership privacy
and suffering from several severe limitations such as low inference accuracy on
well-generalized models. In this paper, we propose a novel membership inference
approach inspired by the backdoor technology to address the said challenge.
Specifically, our approach of Membership Inference via Backdooring (MIB)
leverages the key observation that a backdoored model behaves very differently
from a clean model when predicting on deliberately marked samples created by a
data owner. Appealingly, MIB requires data owners' marking a small number of
samples for membership inference and only black-box access to the target model,
with theoretical guarantees for inference results. We perform extensive
experiments on various datasets and deep neural network architectures, and the
results validate the efficacy of our approach, e.g., marking only 0.1% of the
training dataset is practically sufficient for effective membership inference.


http://arxiv.org/abs/2206.03727
Wavelet Regularization Benefits Adversarial Training. (99%)
Jun Yan; Huilin Yin; Xiaoyang Deng; Ziming Zhao; Wancheng Ge; Hao Zhang; Gerhard Rigoll
  Adversarial training methods are state-of-the-art (SOTA) empirical defense
methods against adversarial examples. Many regularization methods have been
proven to be effective with the combination of adversarial training.
Nevertheless, such regularization methods are implemented in the time domain.
Since adversarial vulnerability can be regarded as a high-frequency phenomenon,
it is essential to regulate the adversarially-trained neural network models in
the frequency domain. Faced with these challenges, we make a theoretical
analysis on the regularization property of wavelets which can enhance
adversarial training. We propose a wavelet regularization method based on the
Haar wavelet decomposition which is named Wavelet Average Pooling. This wavelet
regularization module is integrated into the wide residual neural network so
that a new WideWaveletResNet model is formed. On the datasets of CIFAR-10 and
CIFAR-100, our proposed Adversarial Wavelet Training method realizes
considerable robustness under different types of attacks. It verifies the
assumption that our wavelet regularization method can enhance adversarial
robustness especially in the deep wide neural networks. The visualization
experiments of the Frequency Principle (F-Principle) and interpretability are
implemented to show the effectiveness of our method. A detailed comparison
based on different wavelet base functions is presented. The code is available
at the repository:
\url{https://github.com/momo1986/AdversarialWaveletTraining}.


http://arxiv.org/abs/2206.03717
Latent Boundary-guided Adversarial Training. (99%)
Xiaowei Zhou; Ivor W. Tsang; Jie Yin
  Deep Neural Networks (DNNs) have recently achieved great success in many
classification tasks. Unfortunately, they are vulnerable to adversarial attacks
that generate adversarial examples with a small perturbation to fool DNN
models, especially in model sharing scenarios. Adversarial training is proved
to be the most effective strategy that injects adversarial examples into model
training to improve the robustness of DNN models to adversarial attacks.
However, adversarial training based on the existing adversarial examples fails
to generalize well to standard, unperturbed test data. To achieve a better
trade-off between standard accuracy and adversarial robustness, we propose a
novel adversarial training framework called LAtent bounDary-guided aDvErsarial
tRaining (LADDER) that adversarially trains DNN models on latent
boundary-guided adversarial examples. As opposed to most of the existing
methods that generate adversarial examples in the input space, LADDER generates
a myriad of high-quality adversarial examples through adding perturbations to
latent features. The perturbations are made along the normal of the decision
boundary constructed by an SVM with an attention mechanism. We analyze the
merits of our generated boundary-guided adversarial examples from a boundary
field perspective and visualization view. Extensive experiments and detailed
analysis on MNIST, SVHN, CelebA, and CIFAR-10 validate the effectiveness of
LADDER in achieving a better trade-off between standard accuracy and
adversarial robustness as compared with vanilla DNNs and competitive baselines.


http://arxiv.org/abs/2206.04137
Adversarial Text Normalization. (73%)
Joanna Bitton; Maya Pavlova; Ivan Evtimov
  Text-based adversarial attacks are becoming more commonplace and accessible
to general internet users. As these attacks proliferate, the need to address
the gap in model robustness becomes imminent. While retraining on adversarial
data may increase performance, there remains an additional class of
character-level attacks on which these models falter. Additionally, the process
to retrain a model is time and resource intensive, creating a need for a
lightweight, reusable defense. In this work, we propose the Adversarial Text
Normalizer, a novel method that restores baseline performance on attacked
content with low computational overhead. We evaluate the efficacy of the
normalizer on two problem areas prone to adversarial attacks, i.e. Hate Speech
and Natural Language Inference. We find that text normalization provides a
task-agnostic defense against character-level attacks that can be implemented
supplementary to adversarial retraining solutions, which are more suited for
semantic alterations.


http://arxiv.org/abs/2206.03693
Autoregressive Perturbations for Data Poisoning. (70%)
Pedro Sandoval-Segura; Vasu Singla; Jonas Geiping; Micah Goldblum; Tom Goldstein; David W. Jacobs
  The prevalence of data scraping from social media as a means to obtain
datasets has led to growing concerns regarding unauthorized use of data. Data
poisoning attacks have been proposed as a bulwark against scraping, as they
make data "unlearnable" by adding small, imperceptible perturbations.
Unfortunately, existing methods require knowledge of both the target
architecture and the complete dataset so that a surrogate network can be
trained, the parameters of which are used to generate the attack. In this work,
we introduce autoregressive (AR) poisoning, a method that can generate poisoned
data without access to the broader dataset. The proposed AR perturbations are
generic, can be applied across different datasets, and can poison different
architectures. Compared to existing unlearnable methods, our AR poisons are
more resistant against common defenses such as adversarial training and strong
data augmentations. Our analysis further provides insight into what makes an
effective data poison.


http://arxiv.org/abs/2206.03669
Toward Certified Robustness Against Real-World Distribution Shifts. (5%)
Haoze Wu; Teruhiro Tagomori; Alexander Robey; Fengjun Yang; Nikolai Matni; George Pappas; Hamed Hassani; Corina Pasareanu; Clark Barrett
  We consider the problem of certifying the robustness of deep neural networks
against real-world distribution shifts. To do so, we bridge the gap between
hand-crafted specifications and realistic deployment settings by proposing a
novel neural-symbolic verification framework, in which we train a generative
model to learn perturbations from data and define specifications with respect
to the output of the learned model. A unique challenge arising from this
setting is that existing verifiers cannot tightly approximate sigmoid
activations, which are fundamental to many state-of-the-art generative models.
To address this challenge, we propose a general meta-algorithm for handling
sigmoid activations which leverages classical notions of counter-example-guided
abstraction refinement. The key idea is to "lazily" refine the abstraction of
sigmoid functions to exclude spurious counter-examples found in the previous
abstraction, thus guaranteeing progress in the verification process while
keeping the state-space small. Experiments on the MNIST and CIFAR-10 datasets
show that our framework significantly outperforms existing methods on a range
of challenging distribution shifts.


http://arxiv.org/abs/2207.00421
Generative Adversarial Networks and Image-Based Malware Classification. (1%)
Huy Nguyen; Troia Fabio Di; Genya Ishigaki; Mark Stamp
  For efficient malware removal, determination of malware threat levels, and
damage estimation, malware family classification plays a critical role. In this
paper, we extract features from malware executable files and represent them as
images using various approaches. We then focus on Generative Adversarial
Networks (GAN) for multiclass classification and compare our GAN results to
other popular machine learning techniques, including Support Vector Machine
(SVM), XGBoost, and Restricted Boltzmann Machines (RBM). We find that the
AC-GAN discriminator is generally competitive with other machine learning
techniques. We also evaluate the utility of the GAN generative model for
adversarial attacks on image-based malware detection. While AC-GAN generated
images are visually impressive, we find that they are easily distinguished from
real malware images using any of several learning techniques. This result
indicates that our GAN generated images would be of little value in adversarial
attacks.


http://arxiv.org/abs/2206.03691
Robust Deep Ensemble Method for Real-world Image Denoising. (1%)
Pengju Liu; Hongzhi Zhang; Jinghui Wang; Yuzhi Wang; Dongwei Ren; Wangmeng Zuo
  Recently, deep learning-based image denoising methods have achieved promising
performance on test data with the same distribution as training set, where
various denoising models based on synthetic or collected real-world training
data have been learned. However, when handling real-world noisy images, the
denoising performance is still limited. In this paper, we propose a simple yet
effective Bayesian deep ensemble (BDE) method for real-world image denoising,
where several representative deep denoisers pre-trained with various training
data settings can be fused to improve robustness. The foundation of BDE is that
real-world image noises are highly signal-dependent, and heterogeneous noises
in a real-world noisy image can be separately handled by different denoisers.
In particular, we take well-trained CBDNet, NBNet, HINet, Uformer and GMSNet
into denoiser pool, and a U-Net is adopted to predict pixel-wise weighting maps
to fuse these denoisers. Instead of solely learning pixel-wise weighting maps,
Bayesian deep learning strategy is introduced to predict weighting uncertainty
as well as weighting map, by which prediction variance can be modeled for
improving robustness on real-world noisy images. Extensive experiments have
shown that real-world noises can be better removed by fusing existing denoisers
instead of training a big denoiser with expensive cost. On DND dataset, our BDE
achieves +0.28~dB PSNR gain over the state-of-the-art denoising method.
Moreover, we note that our BDE denoiser based on different Gaussian noise
levels outperforms state-of-the-art CBDNet when applying to real-world noisy
images. Furthermore, our BDE can be extended to other image restoration tasks,
and achieves +0.30dB, +0.18dB and +0.12dB PSNR gains on benchmark datasets for
image deblurring, image deraining and single image super-resolution,
respectively.


http://arxiv.org/abs/2206.03178
Fooling Explanations in Text Classifiers. (99%)
Adam Ivankay; Ivan Girardi; Chiara Marchiori; Pascal Frossard
  State-of-the-art text classification models are becoming increasingly reliant
on deep neural networks (DNNs). Due to their black-box nature, faithful and
robust explanation methods need to accompany classifiers for deployment in
real-life scenarios. However, it has been shown in vision applications that
explanation methods are susceptible to local, imperceptible perturbations that
can significantly alter the explanations without changing the predicted
classes. We show here that the existence of such perturbations extends to text
classifiers as well. Specifically, we introduceTextExplanationFooler (TEF), a
novel explanation attack algorithm that alters text input samples imperceptibly
so that the outcome of widely-used explanation methods changes considerably
while leaving classifier predictions unchanged. We evaluate the performance of
the attribution robustness estimation performance in TEF on five sequence
classification datasets, utilizing three DNN architectures and three
transformer architectures for each dataset. TEF can significantly decrease the
correlation between unchanged and perturbed input attributions, which shows
that all models and explanation methods are susceptible to TEF perturbations.
Moreover, we evaluate how the perturbations transfer to other model
architectures and attribution methods, and show that TEF perturbations are also
effective in scenarios where the target model and explanation method are
unknown. Finally, we introduce a semi-universal attack that is able to compute
fast, computationally light perturbations with no knowledge of the attacked
classifier nor explanation method. Overall, our work shows that explanations in
text classifiers are very fragile and users need to carefully address their
robustness before relying on them in critical applications.


http://arxiv.org/abs/2206.03351
AS2T: Arbitrary Source-To-Target Adversarial Attack on Speaker Recognition Systems. (99%)
Guangke Chen; Zhe Zhao; Fu Song; Sen Chen; Lingling Fan; Yang Liu
  Recent work has illuminated the vulnerability of speaker recognition systems
(SRSs) against adversarial attacks, raising significant security concerns in
deploying SRSs. However, they considered only a few settings (e.g., some
combinations of source and target speakers), leaving many interesting and
important settings in real-world attack scenarios alone. In this work, we
present AS2T, the first attack in this domain which covers all the settings,
thus allows the adversary to craft adversarial voices using arbitrary source
and target speakers for any of three main recognition tasks. Since none of the
existing loss functions can be applied to all the settings, we explore many
candidate loss functions for each setting including the existing and newly
designed ones. We thoroughly evaluate their efficacy and find that some
existing loss functions are suboptimal. Then, to improve the robustness of AS2T
towards practical over-the-air attack, we study the possible distortions
occurred in over-the-air transmission, utilize different transformation
functions with different parameters to model those distortions, and incorporate
them into the generation of adversarial voices. Our simulated over-the-air
evaluation validates the effectiveness of our solution in producing robust
adversarial voices which remain effective under various hardware devices and
various acoustic environments with different reverberation, ambient noises, and
noise levels. Finally, we leverage AS2T to perform thus far the largest-scale
evaluation to understand transferability among 14 diverse SRSs. The
transferability analysis provides many interesting and useful insights which
challenge several findings and conclusion drawn in previous works in the image
domain. Our study also sheds light on future directions of adversarial attacks
in the speaker recognition domain.


http://arxiv.org/abs/2206.03393
Towards Understanding and Mitigating Audio Adversarial Examples for Speaker Recognition. (99%)
Guangke Chen; Zhe Zhao; Fu Song; Sen Chen; Lingling Fan; Feng Wang; Jiashui Wang
  Speaker recognition systems (SRSs) have recently been shown to be vulnerable
to adversarial attacks, raising significant security concerns. In this work, we
systematically investigate transformation and adversarial training based
defenses for securing SRSs. According to the characteristic of SRSs, we present
22 diverse transformations and thoroughly evaluate them using 7 recent
promising adversarial attacks (4 white-box and 3 black-box) on speaker
recognition. With careful regard for best practices in defense evaluations, we
analyze the strength of transformations to withstand adaptive attacks. We also
evaluate and understand their effectiveness against adaptive attacks when
combined with adversarial training. Our study provides lots of useful insights
and findings, many of them are new or inconsistent with the conclusions in the
image and speech recognition domains, e.g., variable and constant bit rate
speech compressions have different performance, and some non-differentiable
transformations remain effective against current promising evasion techniques
which often work well in the image domain. We demonstrate that the proposed
novel feature-level transformation combined with adversarial training is rather
effective compared to the sole adversarial training in a complete white-box
setting, e.g., increasing the accuracy by 13.62% and attack cost by two orders
of magnitude, while other transformations do not necessarily improve the
overall defense capability. This work sheds further light on the research
directions in this field. We also release our evaluation platform SPEAKERGUARD
to foster further research.


http://arxiv.org/abs/2206.03353
Adaptive Regularization for Adversarial Training. (98%)
Dongyoon Yang; Insung Kong; Yongdai Kim
  Adversarial training, which is to enhance robustness against adversarial
attacks, has received much attention because it is easy to generate
human-imperceptible perturbations of data to deceive a given deep neural
network. In this paper, we propose a new adversarial training algorithm that is
theoretically well motivated and empirically superior to other existing
algorithms. A novel feature of the proposed algorithm is to use a data-adaptive
regularization for robustifying a prediction model. We apply more
regularization to data which are more vulnerable to adversarial attacks and
vice versa. Even though the idea of data-adaptive regularization is not new,
our data-adaptive regularization has a firm theoretical base of reducing an
upper bound of the robust risk. Numerical experiments illustrate that our
proposed algorithm improves the generalization (accuracy on clean samples) and
robustness (accuracy on adversarial attacks) simultaneously to achieve the
state-of-the-art performance.


http://arxiv.org/abs/2206.03362
Building Robust Ensembles via Margin Boosting. (83%)
Dinghuai Zhang; Hongyang Zhang; Aaron Courville; Yoshua Bengio; Pradeep Ravikumar; Arun Sai Suggala
  In the context of adversarial robustness, a single model does not usually
have enough power to defend against all possible adversarial attacks, and as a
result, has sub-optimal robustness. Consequently, an emerging line of work has
focused on learning an ensemble of neural networks to defend against
adversarial attacks. In this work, we take a principled approach towards
building robust ensembles. We view this problem from the perspective of
margin-boosting and develop an algorithm for learning an ensemble with maximum
margin. Through extensive empirical evaluation on benchmark datasets, we show
that our algorithm not only outperforms existing ensembling techniques, but
also large models trained in an end-to-end fashion. An important byproduct of
our work is a margin-maximizing cross-entropy (MCE) loss, which is a better
alternative to the standard cross-entropy (CE) loss. Empirically, we show that
replacing the CE loss in state-of-the-art adversarial training techniques with
our MCE loss leads to significant performance improvement.


http://arxiv.org/abs/2206.04677
On the Permanence of Backdoors in Evolving Models. (67%)
Huiying Li; Arjun Nitin Bhagoji; Yuxin Chen; Haitao Zheng; Ben Y. Zhao
  Existing research on training-time attacks for deep neural networks (DNNs),
such as backdoors, largely assume that models are static once trained, and
hidden backdoors trained into models remain active indefinitely. In practice,
models are rarely static but evolve continuously to address distribution drifts
in the underlying data. This paper explores the behavior of backdoor attacks in
time-varying models, whose model weights are continually updated via
fine-tuning to adapt to data drifts. Our theoretical analysis shows how
fine-tuning with fresh data progressively "erases" the injected backdoors, and
our empirical study illustrates how quickly a time-varying model "forgets"
backdoors under a variety of training and attack settings. We also show that
novel fine-tuning strategies using smart learning rates can significantly
accelerate backdoor forgetting. Finally, we discuss the need for new backdoor
defenses that target time-varying models specifically.


http://arxiv.org/abs/2206.03317
Subject Membership Inference Attacks in Federated Learning. (4%)
Anshuman Suri; Pallika Kanani; Virendra J. Marathe; Daniel W. Peterson
  Privacy attacks on Machine Learning (ML) models often focus on inferring the
existence of particular data points in the training data. However, what the
adversary really wants to know is if a particular \emph{individual}'s
(\emph{subject}'s) data was included during training. In such scenarios, the
adversary is more likely to have access to the distribution of a particular
subject, than actual records. Furthermore, in settings like cross-silo
Federated Learning (FL), a subject's data can be embodied by multiple data
records that are spread across multiple organizations. Nearly all of the
existing private FL literature is dedicated to studying privacy at two
granularities -- item-level (individual data records), and user-level
(participating user in the federation), neither of which apply to data subjects
in cross-silo FL. This insight motivates us to shift our attention from the
privacy of data records to the privacy of \emph{data subjects}, also known as
subject-level privacy. We propose two black-box attacks for \emph{subject
membership inference}, of which one assumes access to a model after each
training round. Using these attacks, we estimate subject membership inference
risk on real-world data for single-party models as well as FL scenarios. We
find our attacks to be extremely potent, even without access to exact training
records, and using the knowledge of membership for a handful of subjects. To
better understand the various factors that may influence subject privacy risk
in cross-silo FL settings, we systematically generate several hundred synthetic
federation configurations, varying properties of the data, model design and
training, and the federation itself. Finally, we investigate the effectiveness
of Differential Privacy in mitigating this threat.


http://arxiv.org/abs/2206.03466
Adversarial Reprogramming Revisited. (3%)
Matthias Englert; Ranko Lazic
  Adversarial reprogramming, introduced by Elsayed, Goodfellow, and
Sohl-Dickstein, seeks to repurpose a neural network to perform a different
task, by manipulating its input without modifying its weights. We prove that
two-layer ReLU neural networks with random weights can be adversarially
reprogrammed to achieve arbitrarily high accuracy on Bernoulli data models over
hypercube vertices, provided the network width is no greater than its input
dimension. We also substantially strengthen a recent result of Phuong and
Lampert on directional convergence of gradient flow, and obtain as a corollary
that training two-layer ReLU neural networks on orthogonally separable datasets
can cause their adversarial reprogramming to fail. We support these theoretical
results by experiments that demonstrate that, as long as batch normalisation
layers are suitably initialised, even untrained networks with random weights
are susceptible to adversarial reprogramming. This is in contrast to
observations in several recent works that suggested that adversarial
reprogramming is not possible for untrained networks to any degree of
reliability.


http://arxiv.org/abs/2206.03575
Certifying Data-Bias Robustness in Linear Regression. (1%)
Anna P. Meyer; Aws Albarghouthi; Loris D'Antoni
  Datasets typically contain inaccuracies due to human error and societal
biases, and these inaccuracies can affect the outcomes of models trained on
such datasets. We present a technique for certifying whether linear regression
models are pointwise-robust to label bias in the training dataset, i.e.,
whether bounded perturbations to the labels of a training dataset result in
models that change the prediction of test points. We show how to solve this
problem exactly for individual test points, and provide an approximate but more
scalable method that does not require advance knowledge of the test point. We
extensively evaluate both techniques and find that linear models -- both
regression- and classification-based -- often display high levels of
bias-robustness. However, we also unearth gaps in bias-robustness, such as high
levels of non-robustness for certain bias assumptions on some datasets.
Overall, our approach can serve as a guide for when to trust, or question, a
model's output.


http://arxiv.org/abs/2206.03482
Parametric Chordal Sparsity for SDP-based Neural Network Verification. (1%)
Anton Xue; Lars Lindemann; Rajeev Alur
  Many future technologies rely on neural networks, but verifying the
correctness of their behavior remains a major challenge. It is known that
neural networks can be fragile in the presence of even small input
perturbations, yielding unpredictable outputs. The verification of neural
networks is therefore vital to their adoption, and a number of approaches have
been proposed in recent years. In this paper we focus on semidefinite
programming (SDP) based techniques for neural network verification, which are
particularly attractive because they can encode expressive behaviors while
ensuring a polynomial time decision. Our starting point is the DeepSDP
framework proposed by Fazlyab et al, which uses quadratic constraints to
abstract the verification problem into a large-scale SDP. When the size of the
neural network grows, however, solving this SDP quickly becomes intractable.
Our key observation is that by leveraging chordal sparsity and specific
parametrizations of DeepSDP, we can decompose the primary computational
bottleneck of DeepSDP -- a large linear matrix inequality (LMI) -- into an
equivalent collection of smaller LMIs. Our parametrization admits a tunable
parameter, allowing us to trade-off efficiency and accuracy in the verification
procedure. We call our formulation Chordal-DeepSDP, and provide experimental
evaluation to show that it can: (1) effectively increase accuracy with the
tunable parameter and (2) outperform DeepSDP on deeper networks.


http://arxiv.org/abs/2206.03452
Can CNNs Be More Robust Than Transformers? (1%)
Zeyu Wang; Yutong Bai; Yuyin Zhou; Cihang Xie
  The recent success of Vision Transformers is shaking the long dominance of
Convolutional Neural Networks (CNNs) in image recognition for a decade.
Specifically, in terms of robustness on out-of-distribution samples, recent
research finds that Transformers are inherently more robust than CNNs,
regardless of different training setups. Moreover, it is believed that such
superiority of Transformers should largely be credited to their
self-attention-like architectures per se. In this paper, we question that
belief by closely examining the design of Transformers. Our findings lead to
three highly effective architecture designs for boosting robustness, yet simple
enough to be implemented in several lines of code, namely a) patchifying input
images, b) enlarging kernel size, and c) reducing activation layers and
normalization layers. Bringing these components together, we are able to build
pure CNN architectures without any attention-like operations that is as robust
as, or even more robust than, Transformers. We hope this work can help the
community better understand the design of robust neural architectures. The code
is publicly available at https://github.com/UCSC-VLAA/RobustCNN.


http://arxiv.org/abs/2206.02670
Robust Adversarial Attacks Detection based on Explainable Deep Reinforcement Learning For UAV Guidance and Planning. (99%)
Thomas Hickling; Nabil Aouf; Phillippa Spencer
  The danger of adversarial attacks to unprotected Uncrewed Aerial Vehicle
(UAV) agents operating in public is growing. Adopting AI-based techniques and
more specifically Deep Learning (DL) approaches to control and guide these UAVs
can be beneficial in terms of performance but add more concerns regarding the
safety of those techniques and their vulnerability against adversarial attacks
causing the chances of collisions going up as the agent becomes confused. This
paper proposes an innovative approach based on the explainability of DL methods
to build an efficient detector that will protect these DL schemes and thus the
UAVs adopting them from potential attacks. The agent is adopting a Deep
Reinforcement Learning (DRL) scheme for guidance and planning. It is formed and
trained with a Deep Deterministic Policy Gradient (DDPG) with Prioritised
Experience Replay (PER) DRL scheme that utilises Artificial Potential Field
(APF) to improve training times and obstacle avoidance performance. The
adversarial attacks are generated by Fast Gradient Sign Method (FGSM) and Basic
Iterative Method (BIM) algorithms and reduced obstacle course completion rates
from 80\% to 35\%. A Realistic Synthetic environment for UAV explainable DRL
based planning and guidance including obstacles and adversarial attacks is
built. Two adversarial attack detectors are proposed. The first one adopts a
Convolutional Neural Network (CNN) architecture and achieves an accuracy in
detection of 80\%. The second detector is developed based on a Long Short Term
Memory (LSTM) network and achieves an accuracy of 91\% with much faster
computing times when compared to the CNN based detector.


http://arxiv.org/abs/2206.02417
Fast Adversarial Training with Adaptive Step Size. (98%)
Zhichao Huang; Yanbo Fan; Chen Liu; Weizhong Zhang; Yong Zhang; Mathieu Salzmann; Sabine Süsstrunk; Jue Wang
  While adversarial training and its variants have shown to be the most
effective algorithms to defend against adversarial attacks, their extremely
slow training process makes it hard to scale to large datasets like ImageNet.
The key idea of recent works to accelerate adversarial training is to
substitute multi-step attacks (e.g., PGD) with single-step attacks (e.g.,
FGSM). However, these single-step methods suffer from catastrophic overfitting,
where the accuracy against PGD attack suddenly drops to nearly 0% during
training, destroying the robustness of the networks. In this work, we study the
phenomenon from the perspective of training instances. We show that
catastrophic overfitting is instance-dependent and fitting instances with
larger gradient norm is more likely to cause catastrophic overfitting. Based on
our findings, we propose a simple but effective method, Adversarial Training
with Adaptive Step size (ATAS). ATAS learns an instancewise adaptive step size
that is inversely proportional to its gradient norm. The theoretical analysis
shows that ATAS converges faster than the commonly adopted non-adaptive
counterparts. Empirically, ATAS consistently mitigates catastrophic overfitting
and achieves higher robust accuracy on CIFAR10, CIFAR100 and ImageNet when
evaluated on various adversarial budgets.


http://arxiv.org/abs/2206.02535
Certified Robustness in Federated Learning. (87%)
Motasem Alfarra; Juan C. Pérez; Egor Shulgin; Peter Richtárik; Bernard Ghanem
  Federated learning has recently gained significant attention and popularity
due to its effectiveness in training machine learning models on distributed
data privately. However, as in the single-node supervised learning setup,
models trained in federated learning suffer from vulnerability to imperceptible
input transformations known as adversarial attacks, questioning their
deployment in security-related applications. In this work, we study the
interplay between federated training, personalization, and certified
robustness. In particular, we deploy randomized smoothing, a widely-used and
scalable certification method, to certify deep networks trained on a federated
setup against input perturbations and transformations. We find that the simple
federated averaging technique is effective in building not only more accurate,
but also more certifiably-robust models, compared to training solely on local
data. We further analyze personalization, a popular technique in federated
training that increases the model's bias towards local data, on robustness. We
show several advantages of personalization over both~(that is, only training on
local data and federated training) in building more robust models with faster
training. Finally, we explore the robustness of mixtures of global and
local~(\ie personalized) models, and find that the robustness of local models
degrades as they diverge from the global model


http://arxiv.org/abs/2206.02405
Robust Image Protection Countering Cropping Manipulation. (12%)
Qichao Ying; Hang Zhou; Zhenxing Qian; Sheng Li; Xinpeng Zhang
  Image cropping is an inexpensive and effective operation of maliciously
altering image contents. Existing cropping detection mechanisms analyze the
fundamental traces of image cropping, for example, chromatic aberration and
vignetting to uncover cropping attack, yet fragile to common post-processing
attacks which deceive forensics by removing such cues. Besides, they ignore the
fact that recovering the cropped-out contents can unveil the purpose of the
behaved cropping attack. This paper presents a novel robust watermarking scheme
for image Cropping Localization and Recovery (CLR-Net). We first protect the
original image by introducing imperceptible perturbations. Then, typical image
post-processing attacks are simulated to erode the protected image. On the
recipient's side, we predict the cropping mask and recover the original image.
We propose two plug-and-play networks to improve the real-world robustness of
CLR-Net, namely, the Fine-Grained generative JPEG simulator (FG-JPEG) and the
Siamese image pre-processing network. To the best of our knowledge, we are the
first to address the combined challenge of image cropping localization and
entire image recovery from a fragment. Experiments demonstrate that CLR-Net can
accurately localize the cropping as well as recover the details of the
cropped-out regions with both high quality and fidelity, despite of the
presence of image processing attacks of varied types.


http://arxiv.org/abs/2206.02541
PCPT and ACPT: Copyright Protection and Traceability Scheme for DNN Model. (3%)
Xuefeng Fan; Hangyu Gui; Xiaoyi Zhou
  Deep neural networks (DNNs) have achieved tremendous success in artificial
intelligence (AI) fields. However, DNN models can be easily illegally copied,
redistributed, or abused by criminals, seriously damaging the interests of
model inventers. Currently, the copyright protection of DNN models by neural
network watermarking has been studied, but the establishment of a traceability
mechanism for determining the authorized users of a leaked model is a new
problem driven by the demand for AI services. Because the existing traceability
mechanisms are used for models without watermarks, a small number of false
positives is generated. Existing black-box active protection schemes have loose
authorization control and are vulnerable to forgery attacks. Therefore, based
on the idea of black-box neural network watermarking with the video framing and
image perceptual hash algorithm, this study proposes a passive copyright
protection and traceability framework PCPT using an additional class of DNN
models, improving the existing traceability mechanism that yields a small
number of false positives. Based on the authorization control strategy and
image perceptual hash algorithm, using the authorization control center
constructed using the detector and verifier, a DNN model active copyright
protection and traceability framework ACPT is proposed. It realizes stricter
authorization control, which establishes a strong connection between users and
model owners, and improves the framework security. The key sample that is
simultaneously generated does not affect the quality of the original image and
supports traceability verification.


http://arxiv.org/abs/2206.02435
Tackling covariate shift with node-based Bayesian neural networks. (1%)
Trung Trinh; Markus Heinonen; Luigi Acerbi; Samuel Kaski
  Bayesian neural networks (BNNs) promise improved generalization under
covariate shift by providing principled probabilistic representations of
epistemic uncertainty. However, weight-based BNNs often struggle with high
computational complexity of large-scale architectures and datasets. Node-based
BNNs have recently been introduced as scalable alternatives, which induce
epistemic uncertainty by multiplying each hidden node with latent random
variables, while learning a point-estimate of the weights. In this paper, we
interpret these latent noise variables as implicit representations of simple
and domain-agnostic data perturbations during training, producing BNNs that
perform well under covariate shift due to input corruptions. We observe that
the diversity of the implicit corruptions depends on the entropy of the latent
variables, and propose a straightforward approach to increase the entropy of
these variables during training. We evaluate the method on out-of-distribution
image classification benchmarks, and show improved uncertainty estimation of
node-based BNNs under covariate shift due to input perturbations. As a side
effect, the method also provides robustness against noisy training labels.


http://arxiv.org/abs/2206.02345
Anomaly Detection with Test Time Augmentation and Consistency Evaluation. (1%)
Haowei He; Jiaye Teng; Yang Yuan
  Deep neural networks are known to be vulnerable to unseen data: they may
wrongly assign high confidence stcores to out-distribuion samples. Recent works
try to solve the problem using representation learning methods and specific
metrics. In this paper, we propose a simple, yet effective post-hoc anomaly
detection algorithm named Test Time Augmentation Anomaly Detection (TTA-AD),
inspired by a novel observation. Specifically, we observe that in-distribution
data enjoy more consistent predictions for its original and augmented versions
on a trained network than out-distribution data, which separates
in-distribution and out-distribution samples. Experiments on various
high-resolution image benchmark datasets demonstrate that TTA-AD achieves
comparable or better detection performance under dataset-vs-dataset anomaly
detection settings with a 60%~90\% running time reduction of existing
classifier-based algorithms. We provide empirical verification that the key to
TTA-AD lies in the remaining classes between augmented features, which has long
been partially ignored by previous works. Additionally, we use RUNS as a
surrogate to analyze our algorithm theoretically.


http://arxiv.org/abs/2206.02131
Federated Adversarial Training with Transformers. (98%)
Ahmed Aldahdooh; Wassim Hamidouche; Olivier Déforges
  Federated learning (FL) has emerged to enable global model training over
distributed clients' data while preserving its privacy. However, the global
trained model is vulnerable to the evasion attacks especially, the adversarial
examples (AEs), carefully crafted samples to yield false classification.
Adversarial training (AT) is found to be the most promising approach against
evasion attacks and it is widely studied for convolutional neural network
(CNN). Recently, vision transformers have been found to be effective in many
computer vision tasks. To the best of the authors' knowledge, there is no work
that studied the feasibility of AT in a FL process for vision transformers.
This paper investigates such feasibility with different federated model
aggregation methods and different vision transformer models with different
tokenization and classification head techniques. In order to improve the robust
accuracy of the models with the not independent and identically distributed
(Non-IID), we propose an extension to FedAvg aggregation method, called
FedWAvg. By measuring the similarities between the last layer of the global
model and the last layer of the client updates, FedWAvg calculates the weights
to aggregate the local models updates. The experiments show that FedWAvg
improves the robust accuracy when compared with other state-of-the-art
aggregation methods.


http://arxiv.org/abs/2206.02158
Vanilla Feature Distillation for Improving the Accuracy-Robustness Trade-Off in Adversarial Training. (98%)
Guodong Cao; Zhibo Wang; Xiaowei Dong; Zhifei Zhang; Hengchang Guo; Zhan Qin; Kui Ren
  Adversarial training has been widely explored for mitigating attacks against
deep models. However, most existing works are still trapped in the dilemma
between higher accuracy and stronger robustness since they tend to fit a model
towards robust features (not easily tampered with by adversaries) while
ignoring those non-robust but highly predictive features. To achieve a better
robustness-accuracy trade-off, we propose the Vanilla Feature Distillation
Adversarial Training (VFD-Adv), which conducts knowledge distillation from a
pre-trained model (optimized towards high accuracy) to guide adversarial
training towards higher accuracy, i.e., preserving those non-robust but
predictive features. More specifically, both adversarial examples and their
clean counterparts are forced to be aligned in the feature space by distilling
predictive representations from the pre-trained/clean model, while previous
works barely utilize predictive features from clean models. Therefore, the
adversarial training model is updated towards maximally preserving the accuracy
as gaining robustness. A key advantage of our method is that it can be
universally adapted to and boost existing works. Exhaustive experiments on
various datasets, classification models, and adversarial training algorithms
demonstrate the effectiveness of our proposed method.


http://arxiv.org/abs/2206.02152
Which models are innately best at uncertainty estimation? (1%)
Ido Galil; Mohammed Dabbah; Ran El-Yaniv
  Due to the comprehensive nature of this paper, it has been updated and split
into two separate papers: "A Framework For Benchmarking
Class-out-of-distribution Detection And Its Application To ImageNet" and "What
Can We Learn From The Selective Prediction And Uncertainty Estimation
Performance Of 523 Imagenet Classifiers". We recommend reading them instead.
  Deep neural networks must be equipped with an uncertainty estimation
mechanism when deployed for risk-sensitive tasks. This paper studies the
relationship between deep architectures and their training regimes with their
corresponding selective prediction and uncertainty estimation performance. We
consider both in-distribution uncertainties and class-out-of-distribution ones.
Moreover, we consider some of the most popular estimation performance metrics
previously proposed including AUROC, ECE, AURC, and coverage for selective
accuracy constraint. We present a novel and comprehensive study of selective
prediction and the uncertainty estimation performance of 484 existing
pretrained deep ImageNet classifiers that are available at popular
repositories. We identify numerous and previously unknown factors that affect
uncertainty estimation and examine the relationships between the different
metrics. We find that distillation-based training regimes consistently yield
better uncertainty estimations than other training schemes such as vanilla
training, pretraining on a larger dataset and adversarial training. We also
provide strong empirical evidence showing that ViT is by far the most superior
architecture in terms of uncertainty estimation performance, judging by any
aspect, in both in-distribution and class-out-of-distribution scenarios.


http://arxiv.org/abs/2206.01904
Soft Adversarial Training Can Retain Natural Accuracy. (76%)
Abhijith Sharma; Apurva Narayan
  Adversarial training for neural networks has been in the limelight in recent
years. The advancement in neural network architectures over the last decade has
led to significant improvement in their performance. It sparked an interest in
their deployment for real-time applications. This process initiated the need to
understand the vulnerability of these models to adversarial attacks. It is
instrumental in designing models that are robust against adversaries. Recent
works have proposed novel techniques to counter the adversaries, most often
sacrificing natural accuracy. Most suggest training with an adversarial version
of the inputs, constantly moving away from the original distribution. The focus
of our work is to use abstract certification to extract a subset of inputs for
(hence we call it 'soft') adversarial training. We propose a training framework
that can retain natural accuracy without sacrificing robustness in a
constrained setting. Our framework specifically targets moderately critical
applications which require a reasonable balance between robustness and
accuracy. The results testify to the idea of soft adversarial training for the
defense against adversarial attacks. At last, we propose the scope of future
work for further improvement of this framework.


http://arxiv.org/abs/2206.01898
Saliency Attack: Towards Imperceptible Black-box Adversarial Attack. (99%)
Zeyu Dai; Shengcai Liu; Ke Tang; Qing Li
  Deep neural networks are vulnerable to adversarial examples, even in the
black-box setting where the attacker is only accessible to the model output.
Recent studies have devised effective black-box attacks with high query
efficiency. However, such performance is often accompanied by compromises in
attack imperceptibility, hindering the practical use of these approaches. In
this paper, we propose to restrict the perturbations to a small salient region
to generate adversarial examples that can hardly be perceived. This approach is
readily compatible with many existing black-box attacks and can significantly
improve their imperceptibility with little degradation in attack success rate.
Further, we propose the Saliency Attack, a new black-box attack aiming to
refine the perturbations in the salient region to achieve even better
imperceptibility. Extensive experiments show that compared to the
state-of-the-art black-box attacks, our approach achieves much better
imperceptibility scores, including most apparent distortion (MAD), $L_0$ and
$L_2$ distances, and also obtains significantly higher success rates judged by
a human-like threshold on MAD. Importantly, the perturbations generated by our
approach are interpretable to some extent. Finally, it is also demonstrated to
be robust to different detection-based defenses.


http://arxiv.org/abs/2206.01715
Towards Evading the Limits of Randomized Smoothing: A Theoretical Analysis. (96%)
Raphael Ettedgui; Alexandre Araujo; Rafael Pinot; Yann Chevaleyre; Jamal Atif
  Randomized smoothing is the dominant standard for provable defenses against
adversarial examples. Nevertheless, this method has recently been proven to
suffer from important information theoretic limitations. In this paper, we
argue that these limitations are not intrinsic, but merely a byproduct of
current certification methods. We first show that these certificates use too
little information about the classifier, and are in particular blind to the
local curvature of the decision boundary. This leads to severely sub-optimal
robustness guarantees as the dimension of the problem increases. We then show
that it is theoretically possible to bypass this issue by collecting more
information about the classifier. More precisely, we show that it is possible
to approximate the optimal certificate with arbitrary precision, by probing the
decision boundary with several noise distributions. Since this process is
executed at certification time rather than at test time, it entails no loss in
natural accuracy while enhancing the quality of the certificates. This result
fosters further research on classifier-specific certification and demonstrates
that randomized smoothing is still worth investigating. Although
classifier-specific certification may induce more computational cost, we also
provide some theoretical insight on how to mitigate it.


http://arxiv.org/abs/2206.01467
Evaluating Transfer-based Targeted Adversarial Perturbations against Real-World Computer Vision Systems based on Human Judgments. (92%)
Zhengyu Zhao; Nga Dang; Martha Larson
  Computer vision systems are remarkably vulnerable to adversarial
perturbations. Transfer-based adversarial images are generated on one (source)
system and used to attack another (target) system. In this paper, we take the
first step to investigate transfer-based targeted adversarial images in a
realistic scenario where the target system is trained on some private data with
its inventory of semantic labels not publicly available. Our main contributions
include an extensive human-judgment-based evaluation of attack success on the
Google Cloud Vision API and additional analysis of the different behaviors of
Google Cloud Vision in face of original images vs. adversarial images.
Resources are publicly available at
\url{https://github.com/ZhengyuZhao/Targeted-Tansfer/blob/main/google_results.zip}.


http://arxiv.org/abs/2206.01820
A Robust Backpropagation-Free Framework for Images. (80%)
Timothy Zee; Alexander G. Ororbia; Ankur Mali; Ifeoma Nwogu
  While current deep learning algorithms have been successful for a wide
variety of artificial intelligence (AI) tasks, including those involving
structured image data, they present deep neurophysiological conceptual issues
due to their reliance on the gradients computed by backpropagation of errors
(backprop) to obtain synaptic weight adjustments; hence are biologically
implausible. We present a more biologically plausible approach, the
error-kernel driven activation alignment (EKDAA) algorithm, to train
convolution neural networks (CNNs) using locally derived error transmission
kernels and error maps. We demonstrate the efficacy of EKDAA by performing the
task of visual-recognition on the Fashion MNIST, CIFAR-10 and SVHN benchmarks
as well as conducting blackbox robustness tests on adversarial examples derived
from these datasets. Furthermore, we also present results for a CNN trained
using a non-differentiable activation function. All recognition results nearly
matches that of backprop and exhibit greater adversarial robustness compared to
backprop.


http://arxiv.org/abs/2206.01705
Gradient Obfuscation Checklist Test Gives a False Sense of Security. (73%)
Nikola Popovic; Danda Pani Paudel; Thomas Probst; Gool Luc Van
  One popular group of defense techniques against adversarial attacks is based
on injecting stochastic noise into the network. The main source of robustness
of such stochastic defenses however is often due to the obfuscation of the
gradients, offering a false sense of security. Since most of the popular
adversarial attacks are optimization-based, obfuscated gradients reduce their
attacking ability, while the model is still susceptible to stronger or
specifically tailored adversarial attacks. Recently, five characteristics have
been identified, which are commonly observed when the improvement in robustness
is mainly caused by gradient obfuscation. It has since become a trend to use
these five characteristics as a sufficient test, to determine whether or not
gradient obfuscation is the main source of robustness. However, these
characteristics do not perfectly characterize all existing cases of gradient
obfuscation, and therefore can not serve as a basis for a conclusive test. In
this work, we present a counterexample, showing this test is not sufficient for
concluding that gradient obfuscation is not the main cause of improvements in
robustness.


http://arxiv.org/abs/2206.01832
Kallima: A Clean-label Framework for Textual Backdoor Attacks. (26%)
Xiaoyi Chen; Yinpeng Dong; Zeyu Sun; Shengfang Zhai; Qingni Shen; Zhonghai Wu
  Although Deep Neural Network (DNN) has led to unprecedented progress in
various natural language processing (NLP) tasks, research shows that deep
models are extremely vulnerable to backdoor attacks. The existing backdoor
attacks mainly inject a small number of poisoned samples into the training
dataset with the labels changed to the target one. Such mislabeled samples
would raise suspicion upon human inspection, potentially revealing the attack.
To improve the stealthiness of textual backdoor attacks, we propose the first
clean-label framework Kallima for synthesizing mimesis-style backdoor samples
to develop insidious textual backdoor attacks. We modify inputs belonging to
the target class with adversarial perturbations, making the model rely more on
the backdoor trigger. Our framework is compatible with most existing backdoor
triggers. The experimental results on three benchmark datasets demonstrate the
effectiveness of the proposed method.


http://arxiv.org/abs/2206.00913
Improving the Robustness and Generalization of Deep Neural Network with Confidence Threshold Reduction. (99%)
Xiangyuan Yang; Jie Lin; Hanlin Zhang; Xinyu Yang; Peng Zhao
  Deep neural networks are easily attacked by imperceptible perturbation.
Presently, adversarial training (AT) is the most effective method to enhance
the robustness of the model against adversarial examples. However, because
adversarial training solved a min-max value problem, in comparison with natural
training, the robustness and generalization are contradictory, i.e., the
robustness improvement of the model will decrease the generalization of the
model. To address this issue, in this paper, a new concept, namely confidence
threshold (CT), is introduced and the reducing of the confidence threshold,
known as confidence threshold reduction (CTR), is proven to improve both the
generalization and robustness of the model. Specifically, to reduce the CT for
natural training (i.e., for natural training with CTR), we propose a
mask-guided divergence loss function (MDL) consisting of a cross-entropy loss
term and an orthogonal term. The empirical and theoretical analysis
demonstrates that the MDL loss improves the robustness and generalization of
the model simultaneously for natural training. However, the model robustness
improvement of natural training with CTR is not comparable to that of
adversarial training. Therefore, for adversarial training, we propose a
standard deviation loss function (STD), which minimizes the difference in the
probabilities of the wrong categories, to reduce the CT by being integrated
into the loss function of adversarial training. The empirical and theoretical
analysis demonstrates that the STD based loss function can further improve the
robustness of the adversarially trained model on basis of guaranteeing the
changeless or slight improvement of the natural accuracy.


http://arxiv.org/abs/2206.00924
FACM: Intermediate Layer Still Retain Effective Features against Adversarial Examples. (99%)
Xiangyuan Yang; Jie Lin; Hanlin Zhang; Xinyu Yang; Peng Zhao
  In strong adversarial attacks against deep neural networks (DNN), the
generated adversarial example will mislead the DNN-implemented classifier by
destroying the output features of the last layer. To enhance the robustness of
the classifier, in our paper, a \textbf{F}eature \textbf{A}nalysis and
\textbf{C}onditional \textbf{M}atching prediction distribution (FACM) model is
proposed to utilize the features of intermediate layers to correct the
classification. Specifically, we first prove that the intermediate layers of
the classifier can still retain effective features for the original category,
which is defined as the correction property in our paper. According to this, we
propose the FACM model consisting of \textbf{F}eature \textbf{A}nalysis (FA)
correction module, \textbf{C}onditional \textbf{M}atching \textbf{P}rediction
\textbf{D}istribution (CMPD) correction module and decision module. The FA
correction module is the fully connected layers constructed with the output of
the intermediate layers as the input to correct the classification of the
classifier. The CMPD correction module is a conditional auto-encoder, which can
not only use the output of intermediate layers as the condition to accelerate
convergence but also mitigate the negative effect of adversarial example
training with the Kullback-Leibler loss to match prediction distribution.
Through the empirically verified diversity property, the correction modules can
be implemented synergistically to reduce the adversarial subspace. Hence, the
decision module is proposed to integrate the correction modules to enhance the
DNN classifier's robustness. Specially, our model can be achieved by
fine-tuning and can be combined with other model-specific defenses.


http://arxiv.org/abs/2206.01736
Adaptive Adversarial Training to Improve Adversarial Robustness of DNNs for Medical Image Segmentation and Detection. (99%)
Linhai Ma; Liang Liang
  Recent methods based on Deep Neural Networks (DNNs) have reached high
accuracy for medical image analysis, including the three basic tasks:
segmentation, landmark detection, and object detection. It is known that DNNs
are vulnerable to adversarial attacks, and the adversarial robustness of DNNs
could be improved by adding adversarial noises to training data (i.e.,
adversarial training). In this study, we show that the standard adversarial
training (SAT) method has a severe issue that limits its practical use: it
generates a fixed level of noise for DNN training, and it is difficult for the
user to choose an appropriate noise level, because a high noise level may lead
to a large reduction in model performance, and a low noise level may have
little effect. To resolve this issue, we have designed a novel adaptive-margin
adversarial training (AMAT) method that generates adaptive adversarial noises
for DNN training, which are dynamically tailored for each individual training
sample. We have applied our AMAT method to state-of-the-art DNNs for the three
basic tasks, using five publicly available datasets. The experimental results
demonstrate that our AMAT method outperforms the SAT method in adversarial
robustness on noisy data and prediction accuracy on clean data. Please contact
the author for the source code.


http://arxiv.org/abs/2206.01733
Adversarial RAW: Image-Scaling Attack Against Imaging Pipeline. (99%)
Junjian Li; Honglong Chen
  Deep learning technologies have become the backbone for the development of
computer vision. With further explorations, deep neural networks have been
found vulnerable to well-designed adversarial attacks. Most of the vision
devices are equipped with image signal processing (ISP) pipeline to implement
RAW-to-RGB transformations and embedded into data preprocessing module for
efficient image processing. Actually, ISP pipeline can introduce adversarial
behaviors to post-capture images while data preprocessing may destroy attack
patterns. However, none of the existing adversarial attacks takes into account
the impacts of both ISP pipeline and data preprocessing. In this paper, we
develop an image-scaling attack targeting on ISP pipeline, where the crafted
adversarial RAW can be transformed into attack image that presents entirely
different appearance once being scaled to a specific-size image. We first
consider the gradient-available ISP pipeline, i.e., the gradient information
can be directly used in the generation process of adversarial RAW to launch the
attack. To make the adversarial attack more applicable, we further consider the
gradient-unavailable ISP pipeline, in which a proxy model that well learns the
RAW-to-RGB transformations is proposed as the gradient oracles. Extensive
experiments show that the proposed adversarial attacks can craft adversarial
RAW data against the target ISP pipelines with high attack rates.


http://arxiv.org/abs/2206.01034
Adversarial Laser Spot: Robust and Covert Physical Adversarial Attack to DNNs. (98%)
Chengyin Hu
  Most existing deep neural networks (DNNs) are easily disturbed by slight
noise. As far as we know, there are few researches on physical adversarial
attack technology by deploying lighting equipment. The light-based physical
adversarial attack technology has excellent covertness, which brings great
security risks to many applications based on deep neural networks (such as
automatic driving technology). Therefore, we propose a robust physical
adversarial attack technology with excellent covertness, called adversarial
laser point (AdvLS), which optimizes the physical parameters of laser point
through genetic algorithm to perform physical adversarial attack. It realizes
robust and covert physical adversarial attack by using low-cost laser
equipment. As far as we know, AdvLS is the first light-based adversarial attack
technology that can perform physical adversarial attacks in the daytime. A
large number of experiments in the digital and physical environments show that
AdvLS has excellent robustness and concealment. In addition, through in-depth
analysis of the experimental data, we find that the adversarial perturbations
generated by AdvLS have superior adversarial attack migration. The experimental
results show that AdvLS impose serious interference to the advanced deep neural
networks, we call for the attention of the proposed physical adversarial attack
technology.


http://arxiv.org/abs/2206.01367
Adversarial Unlearning: Reducing Confidence Along Adversarial Directions. (31%)
Amrith Setlur; Benjamin Eysenbach; Virginia Smith; Sergey Levine
  Supervised learning methods trained with maximum likelihood objectives often
overfit on training data. Most regularizers that prevent overfitting look to
increase confidence on additional examples (e.g., data augmentation,
adversarial training), or reduce it on training data (e.g., label smoothing).
In this work we propose a complementary regularization strategy that reduces
confidence on self-generated examples. The method, which we call RCAD (Reducing
Confidence along Adversarial Directions), aims to reduce confidence on
out-of-distribution examples lying along directions adversarially chosen to
increase training loss. In contrast to adversarial training, RCAD does not try
to robustify the model to output the original label, but rather regularizes it
to have reduced confidence on points generated using much larger perturbations
than in conventional adversarial training. RCAD can be easily integrated into
training pipelines with a few lines of code. Despite its simplicity, we find on
many classification benchmarks that RCAD can be added to existing techniques
(e.g., label smoothing, MixUp training) to increase test accuracy by 1-3% in
absolute value, with more significant gains in the low data regime. We also
provide a theoretical analysis that helps to explain these benefits in
simplified settings, showing that RCAD can provably help the model unlearn
spurious features in the training data.


http://arxiv.org/abs/2206.01737
MaxStyle: Adversarial Style Composition for Robust Medical Image Segmentation. (8%)
Chen Chen; Zeju Li; Cheng Ouyang; Matt Sinclair; Wenjia Bai; Daniel Rueckert
  Convolutional neural networks (CNNs) have achieved remarkable segmentation
accuracy on benchmark datasets where training and test sets are from the same
domain, yet their performance can degrade significantly on unseen domains,
which hinders the deployment of CNNs in many clinical scenarios. Most existing
works improve model out-of-domain (OOD) robustness by collecting multi-domain
datasets for training, which is expensive and may not always be feasible due to
privacy and logistical issues. In this work, we focus on improving model
robustness using a single-domain dataset only. We propose a novel data
augmentation framework called MaxStyle, which maximizes the effectiveness of
style augmentation for model OOD performance. It attaches an auxiliary
style-augmented image decoder to a segmentation network for robust feature
learning and data augmentation. Importantly, MaxStyle augments data with
improved image style diversity and hardness, by expanding the style space with
noise and searching for the worst-case style composition of latent features via
adversarial training. With extensive experiments on multiple public cardiac and
prostate MR datasets, we demonstrate that MaxStyle leads to significantly
improved out-of-distribution robustness against unseen corruptions as well as
common distribution shifts across multiple, different, unseen sites and unknown
image sequences under both low- and high-training data settings. The code can
be found at https://github.com/cherise215/MaxStyle.


http://arxiv.org/abs/2206.01102
A temporal chrominance trigger for clean-label backdoor attack against anti-spoof rebroadcast detection. (4%)
Wei Guo; Benedetta Tondi; Mauro Barni
  We propose a stealthy clean-label video backdoor attack against Deep Learning
(DL)-based models aiming at detecting a particular class of spoofing attacks,
namely video rebroadcast attacks. The injected backdoor does not affect
spoofing detection in normal conditions, but induces a misclassification in the
presence of a specific triggering signal. The proposed backdoor relies on a
temporal trigger altering the average chrominance of the video sequence. The
backdoor signal is designed by taking into account the peculiarities of the
Human Visual System (HVS) to reduce the visibility of the trigger, thus
increasing the stealthiness of the backdoor. To force the network to look at
the presence of the trigger in the challenging clean-label scenario, we choose
the poisoned samples used for the injection of the backdoor following a
so-called Outlier Poisoning Strategy (OPS). According to OPS, the triggering
signal is inserted in the training samples that the network finds more
difficult to classify. The effectiveness of the proposed backdoor attack and
its generality are validated experimentally on different datasets and
anti-spoofing rebroadcast detection architectures.


http://arxiv.org/abs/2206.01319
Learning Unbiased Transferability for Domain Adaptation by Uncertainty Modeling. (1%)
Jian Hu; Haowen Zhong; Junchi Yan; Shaogang Gong; Guile Wu; Fei Yang
  Domain adaptation (DA) aims to transfer knowledge learned from a labeled
source domain to an unlabeled or a less labeled but related target domain.
Ideally, the source and target distributions should be aligned to each other
equally to achieve unbiased knowledge transfer. However, due to the significant
imbalance between the amount of annotated data in the source and target
domains, usually only the target distribution is aligned to the source domain,
leading to adapting unnecessary source specific knowledge to the target domain,
i.e., biased domain adaptation. To resolve this problem, in this work, we delve
into the transferability estimation problem in domain adaptation and propose a
non-intrusive Unbiased Transferability Estimation Plug-in (UTEP) by modeling
the uncertainty of a discriminator in adversarial-based DA methods to optimize
unbiased transfer. We theoretically analyze the effectiveness of the proposed
approach to unbiased transferability learning in DA. Furthermore, to alleviate
the impact of imbalanced annotated data, we utilize the estimated uncertainty
for pseudo label selection of unlabeled samples in the target domain, which
helps achieve better marginal and conditional distribution alignments between
domains. Extensive experimental results on a high variety of DA benchmark
datasets show that the proposed approach can be readily incorporated into
various adversarial-based DA methods, achieving state-of-the-art performance.


http://arxiv.org/abs/2206.00772
On the reversibility of adversarial attacks. (99%)
Chau Yi Li; Ricardo Sánchez-Matilla; Ali Shahin Shamsabadi; Riccardo Mazzon; Andrea Cavallaro
  Adversarial attacks modify images with perturbations that change the
prediction of classifiers. These modified images, known as adversarial
examples, expose the vulnerabilities of deep neural network classifiers. In
this paper, we investigate the predictability of the mapping between the
classes predicted for original images and for their corresponding adversarial
examples. This predictability relates to the possibility of retrieving the
original predictions and hence reversing the induced misclassification. We
refer to this property as the reversibility of an adversarial attack, and
quantify reversibility as the accuracy in retrieving the original class or the
true class of an adversarial example. We present an approach that reverses the
effect of an adversarial attack on a classifier using a prior set of
classification results. We analyse the reversibility of state-of-the-art
adversarial attacks on benchmark classifiers and discuss the factors that
affect the reversibility.


http://arxiv.org/abs/2206.00402
NeuroUnlock: Unlocking the Architecture of Obfuscated Deep Neural Networks. (99%)
Mahya Morid Ahmadi; Lilas Alrahis; Alessio Colucci; Ozgur Sinanoglu; Muhammad Shafique
  The advancements of deep neural networks (DNNs) have led to their deployment
in diverse settings, including safety and security-critical applications. As a
result, the characteristics of these models have become sensitive intellectual
properties that require protection from malicious users. Extracting the
architecture of a DNN through leaky side-channels (e.g., memory access) allows
adversaries to (i) clone the model, and (ii) craft adversarial attacks. DNN
obfuscation thwarts side-channel-based architecture stealing (SCAS) attacks by
altering the run-time traces of a given DNN while preserving its functionality.
In this work, we expose the vulnerability of state-of-the-art DNN obfuscation
methods to these attacks. We present NeuroUnlock, a novel SCAS attack against
obfuscated DNNs. Our NeuroUnlock employs a sequence-to-sequence model that
learns the obfuscation procedure and automatically reverts it, thereby
recovering the original DNN architecture. We demonstrate the effectiveness of
NeuroUnlock by recovering the architecture of 200 randomly generated and
obfuscated DNNs running on the Nvidia RTX 2080 TI graphics processing unit
(GPU). Moreover, NeuroUnlock recovers the architecture of various other
obfuscated DNNs, such as the VGG-11, VGG-13, ResNet-20, and ResNet-32 networks.
After recovering the architecture, NeuroUnlock automatically builds a
near-equivalent DNN with only a 1.4% drop in the testing accuracy. We further
show that launching a subsequent adversarial attack on the recovered DNNs
boosts the success rate of the adversarial attack by 51.7% in average compared
to launching it on the obfuscated versions. Additionally, we propose a novel
methodology for DNN obfuscation, ReDLock, which eradicates the deterministic
nature of the obfuscation and achieves 2.16X more resilience to the NeuroUnlock
attack. We release the NeuroUnlock and the ReDLock as open-source frameworks.


http://arxiv.org/abs/2206.00489
Attack-Agnostic Adversarial Detection. (99%)
Jiaxin Cheng; Mohamed Hussein; Jay Billa; Wael AbdAlmageed
  The growing number of adversarial attacks in recent years gives attackers an
advantage over defenders, as defenders must train detectors after knowing the
types of attacks, and many models need to be maintained to ensure good
performance in detecting any upcoming attacks. We propose a way to end the
tug-of-war between attackers and defenders by treating adversarial attack
detection as an anomaly detection problem so that the detector is agnostic to
the attack. We quantify the statistical deviation caused by adversarial
perturbations in two aspects. The Least Significant Component Feature (LSCF)
quantifies the deviation of adversarial examples from the statistics of benign
samples and Hessian Feature (HF) reflects how adversarial examples distort the
landscape of the model's optima by measuring the local loss curvature.
Empirical results show that our method can achieve an overall ROC AUC of 94.9%,
89.7%, and 94.6% on CIFAR10, CIFAR100, and SVHN, respectively, and has
comparable performance to adversarial detectors trained with adversarial
examples on most of the attacks.


http://arxiv.org/abs/2206.00278
On the Perils of Cascading Robust Classifiers. (98%)
Ravi Mangal; Zifan Wang; Chi Zhang; Klas Leino; Corina Pasareanu; Matt Fredrikson
  Ensembling certifiably robust neural networks has been shown to be a
promising approach for improving the \emph{certified robust accuracy} of neural
models. Black-box ensembles that assume only query-access to the constituent
models (and their robustness certifiers) during prediction are particularly
attractive due to their modular structure. Cascading ensembles are a popular
instance of black-box ensembles that appear to improve certified robust
accuracies in practice. However, we find that the robustness certifier used by
a cascading ensemble is unsound. That is, when a cascading ensemble is
certified as locally robust at an input $x$, there can, in fact, be inputs $x'$
in the $\epsilon$-ball centered at $x$, such that the cascade's prediction at
$x'$ is different from $x$. We present an alternate black-box ensembling
mechanism based on weighted voting which we prove to be sound for robustness
certification. Via a thought experiment, we demonstrate that if the constituent
classifiers are suitably diverse, voting ensembles can improve certified
performance. Our code is available at
\url{https://github.com/TristaChi/ensembleKW}.


http://arxiv.org/abs/2206.00477
Anti-Forgery: Towards a Stealthy and Robust DeepFake Disruption Attack via Adversarial Perceptual-aware Perturbations. (98%)
Run Wang; Ziheng Huang; Zhikai Chen; Li Liu; Jing Chen; Lina Wang
  DeepFake is becoming a real risk to society and brings potential threats to
both individual privacy and political security due to the DeepFaked multimedia
are realistic and convincing. However, the popular DeepFake passive detection
is an ex-post forensics countermeasure and failed in blocking the
disinformation spreading in advance. To address this limitation, researchers
study the proactive defense techniques by adding adversarial noises into the
source data to disrupt the DeepFake manipulation. However, the existing studies
on proactive DeepFake defense via injecting adversarial noises are not robust,
which could be easily bypassed by employing simple image reconstruction
revealed in a recent study MagDR.
  In this paper, we investigate the vulnerability of the existing forgery
techniques and propose a novel \emph{anti-forgery} technique that helps users
protect the shared facial images from attackers who are capable of applying the
popular forgery techniques. Our proposed method generates perceptual-aware
perturbations in an incessant manner which is vastly different from the prior
studies by adding adversarial noises that is sparse. Experimental results
reveal that our perceptual-aware perturbations are robust to diverse image
transformations, especially the competitive evasion technique, MagDR via image
reconstruction. Our findings potentially open up a new research direction
towards thorough understanding and investigation of perceptual-aware
adversarial attack for protecting facial images against DeepFakes in a
proactive and robust manner. We open-source our tool to foster future research.
Code is available at https://github.com/AbstractTeen/AntiForgery/.


http://arxiv.org/abs/2206.00352
Support Vector Machines under Adversarial Label Contamination. (97%)
Huang Xiao; Battista Biggio; Blaine Nelson; Han Xiao; Claudia Eckert; Fabio Roli
  Machine learning algorithms are increasingly being applied in
security-related tasks such as spam and malware detection, although their
security properties against deliberate attacks have not yet been widely
understood. Intelligent and adaptive attackers may indeed exploit specific
vulnerabilities exposed by machine learning techniques to violate system
security. Being robust to adversarial data manipulation is thus an important,
additional requirement for machine learning algorithms to successfully operate
in adversarial settings. In this work, we evaluate the security of Support
Vector Machines (SVMs) to well-crafted, adversarial label noise attacks. In
particular, we consider an attacker that aims to maximize the SVM's
classification error by flipping a number of labels in the training data. We
formalize a corresponding optimal attack strategy, and solve it by means of
heuristic approaches to keep the computational complexity tractable. We report
an extensive experimental analysis on the effectiveness of the considered
attacks against linear and non-linear SVMs, both on synthetic and real-world
datasets. We finally argue that our approach can also provide useful insights
for developing more secure SVM learning algorithms, and also novel techniques
in a number of related research areas, such as semi-supervised and active
learning.


http://arxiv.org/abs/2206.00769
Defense Against Gradient Leakage Attacks via Learning to Obscure Data. (80%)
Yuxuan Wan; Han Xu; Xiaorui Liu; Jie Ren; Wenqi Fan; Jiliang Tang
  Federated learning is considered as an effective privacy-preserving learning
mechanism that separates the client's data and model training process. However,
federated learning is still under the risk of privacy leakage because of the
existence of attackers who deliberately conduct gradient leakage attacks to
reconstruct the client data. Recently, popular strategies such as gradient
perturbation methods and input encryption methods have been proposed to defend
against gradient leakage attacks. Nevertheless, these defenses can either
greatly sacrifice the model performance, or be evaded by more advanced attacks.
In this paper, we propose a new defense method to protect the privacy of
clients' data by learning to obscure data. Our defense method can generate
synthetic samples that are totally distinct from the original samples, but they
can also maximally preserve their predictive features and guarantee the model
performance. Furthermore, our defense strategy makes the gradient leakage
attack and its variants extremely difficult to reconstruct the client data.
Through extensive experiments, we show that our proposed defense method obtains
better privacy protection while preserving high accuracy compared with
state-of-the-art methods.


http://arxiv.org/abs/2206.00513
The robust way to stack and bag: the local Lipschitz way. (70%)
Thulasi Tholeti; Sheetal Kalyani
  Recent research has established that the local Lipschitz constant of a neural
network directly influences its adversarial robustness. We exploit this
relationship to construct an ensemble of neural networks which not only
improves the accuracy, but also provides increased adversarial robustness. The
local Lipschitz constants for two different ensemble methods - bagging and
stacking - are derived and the architectures best suited for ensuring
adversarial robustness are deduced. The proposed ensemble architectures are
tested on MNIST and CIFAR-10 datasets in the presence of white-box attacks,
FGSM and PGD. The proposed architecture is found to be more robust than a) a
single network and b) traditional ensemble methods.


http://arxiv.org/abs/2206.02539
Robustness Evaluation and Adversarial Training of an Instance Segmentation Model. (54%)
Jacob Bond; Andrew Lingg
  To evaluate the robustness of non-classifier models, we propose probabilistic
local equivalence, based on the notion of randomized smoothing, as a way to
quantitatively evaluate the robustness of an arbitrary function. In addition,
to understand the effect of adversarial training on non-classifiers and to
investigate the level of robustness that can be obtained without degrading
performance on the training distribution, we apply Fast is Better than Free
adversarial training together with the TRADES robust loss to the training of an
instance segmentation network. In this direction, we were able to achieve a
symmetric best dice score of 0.85 on the TuSimple lane detection challenge,
outperforming the standardly-trained network's score of 0.82. Additionally, we
were able to obtain an F-measure of 0.49 on manipulated inputs, in contrast to
the standardly-trained network's score of 0. We show that probabilisitic local
equivalence is able to successfully distinguish between standardly-trained and
adversarially-trained models, providing another view of the improved robustness
of the adversarially-trained models.


http://arxiv.org/abs/2206.00794
Sequential Bayesian Neural Subnetwork Ensembles. (2%)
Sanket Jantre; Shrijita Bhattacharya; Nathan M. Urban; Byung-Jun Yoon; Tapabrata Maiti; Prasanna Balaprakash; Sandeep Madireddy
  Deep ensembles have emerged as a powerful technique for improving predictive
performance and enhancing model robustness across various applications by
leveraging model diversity. However, traditional deep ensemble methods are
often computationally expensive and rely on deterministic models, which may
limit their flexibility. Additionally, while sparse subnetworks of dense models
have shown promise in matching the performance of their dense counterparts and
even enhancing robustness, existing methods for inducing sparsity typically
incur training costs comparable to those of training a single dense model, as
they either gradually prune the network during training or apply thresholding
post-training. In light of these challenges, we propose an approach for
sequential ensembling of dynamic Bayesian neural subnetworks that consistently
maintains reduced model complexity throughout the training process while
generating diverse ensembles in a single forward pass. Our approach involves an
initial exploration phase to identify high-performing regions within the
parameter space, followed by multiple exploitation phases that take advantage
of the compactness of the sparse model. These exploitation phases quickly
converge to different minima in the energy landscape, corresponding to
high-performing subnetworks that together form a diverse and robust ensemble.
We empirically demonstrate that our proposed approach outperforms traditional
dense and sparse deterministic and Bayesian ensemble models in terms of
prediction accuracy, uncertainty estimation, out-of-distribution detection, and
adversarial robustness.


http://arxiv.org/abs/2206.00700
RoCourseNet: Distributionally Robust Training of a Prediction Aware Recourse Model. (1%)
Hangzhi Guo; Feiran Jia; Jinghui Chen; Anna Squicciarini; Amulya Yadav
  Counterfactual (CF) explanations for machine learning (ML) models are
preferred by end-users, as they explain the predictions of ML models by
providing a recourse case to individuals who are adversely impacted by
predicted outcomes. Existing CF explanation methods generate recourses under
the assumption that the underlying target ML model remains stationary over
time. However, due to commonly occurring distributional shifts in training
data, ML models constantly get updated in practice, which might render
previously generated recourses invalid and diminish end-users trust in our
algorithmic framework. To address this problem, we propose RoCourseNet, a
training framework that jointly optimizes for predictions and robust recourses
to future data shifts. We have three main contributions: (i) We propose a novel
virtual data shift (VDS) algorithm to find worst-case shifted ML models by
explicitly considering the worst-case data shift in the training dataset. (ii)
We leverage adversarial training to solve a novel tri-level optimization
problem inside RoCourseNet, which simultaneously generates predictions and
corresponding robust recourses. (iii) Finally, we evaluate RoCourseNet's
performance on three real-world datasets and show that RoCourseNet outperforms
state-of-the-art baselines by 10% in generating robust CF explanations.


http://arxiv.org/abs/2205.15944
Hide and Seek: on the Stealthiness of Attacks against Deep Learning Systems. (99%)
Zeyan Liu; Fengjun Li; Jingqiang Lin; Zhu Li; Bo Luo
  With the growing popularity of artificial intelligence and machine learning,
a wide spectrum of attacks against deep learning models have been proposed in
the literature. Both the evasion attacks and the poisoning attacks attempt to
utilize adversarially altered samples to fool the victim model to misclassify
the adversarial sample. While such attacks claim to be or are expected to be
stealthy, i.e., imperceptible to human eyes, such claims are rarely evaluated.
In this paper, we present the first large-scale study on the stealthiness of
adversarial samples used in the attacks against deep learning. We have
implemented 20 representative adversarial ML attacks on six popular
benchmarking datasets. We evaluate the stealthiness of the attack samples using
two complementary approaches: (1) a numerical study that adopts 24 metrics for
image similarity or quality assessment; and (2) a user study of 3 sets of
questionnaires that has collected 20,000+ annotations from 1,000+ responses.
Our results show that the majority of the existing attacks introduce
nonnegligible perturbations that are not stealthy to human eyes. We further
analyze the factors that contribute to attack stealthiness. We further examine
the correlation between the numerical analysis and the user studies, and
demonstrate that some image quality metrics may provide useful guidance in
attack designs, while there is still a significant gap between assessed image
quality and visual stealthiness of attacks.


http://arxiv.org/abs/2205.15763
Exact Feature Collisions in Neural Networks. (95%)
Utku Ozbulak; Manvel Gasparyan; Shodhan Rao; Neve Wesley De; Messem Arnout Van
  Predictions made by deep neural networks were shown to be highly sensitive to
small changes made in the input space where such maliciously crafted data
points containing small perturbations are being referred to as adversarial
examples. On the other hand, recent research suggests that the same networks
can also be extremely insensitive to changes of large magnitude, where
predictions of two largely different data points can be mapped to approximately
the same output. In such cases, features of two data points are said to
approximately collide, thus leading to the largely similar predictions. Our
results improve and extend the work of Li et al.(2019), laying out theoretical
grounds for the data points that have colluding features from the perspective
of weights of neural networks, revealing that neural networks not only suffer
from features that approximately collide but also suffer from features that
exactly collide. We identify the necessary conditions for the existence of such
scenarios, hereby investigating a large number of DNNs that have been used to
solve various computer vision problems. Furthermore, we propose the Null-space
search, a numerical approach that does not rely on heuristics, to create data
points with colliding features for any input and for any task, including, but
not limited to, classification, localization, and segmentation.


http://arxiv.org/abs/2206.00052
CodeAttack: Code-based Adversarial Attacks for Pre-Trained Programming Language Models. (93%)
Akshita Jha; Chandan K. Reddy
  Pre-trained programming language (PL) models (such as CodeT5, CodeBERT,
GraphCodeBERT, etc.,) have the potential to automate software engineering tasks
involving code understanding and code generation. However, these models are not
robust to changes in the input and thus, are potentially susceptible to
adversarial attacks. We propose, CodeAttack, a simple yet effective black-box
attack model that uses code structure to generate imperceptible, effective, and
minimally perturbed adversarial code samples. We demonstrate the
vulnerabilities of the state-of-the-art PL models to code-specific adversarial
attacks. We evaluate the transferability of CodeAttack on several code-code
(translation and repair) and code-NL (summarization) tasks across different
programming languages. CodeAttack outperforms state-of-the-art adversarial NLP
attack models to achieve the best overall performance while being more
efficient and imperceptible.


http://arxiv.org/abs/2206.00145
CASSOCK: Viable Backdoor Attacks against DNN in The Wall of Source-Specific Backdoor Defences. (83%)
Shang Wang; Yansong Gao; Anmin Fu; Zhi Zhang; Yuqing Zhang; Willy Susilo
  Backdoor attacks have been a critical threat to deep neural network (DNN).
However, most existing countermeasures focus on source-agnostic backdoor
attacks (SABAs) and fail to defeat source-specific backdoor attacks (SSBAs).
Compared to an SABA, an SSBA activates a backdoor when an input from
attacker-chosen class(es) is stamped with an attacker-specified trigger, making
itself stealthier and thus evade most existing backdoor mitigation.
Nonetheless, existing SSBAs have trade-offs on attack success rate (ASR, a
backdoor is activated by a trigger input from a source class as expected) and
false positive rate (FPR, a backdoor is activated unexpectedly by a trigger
input from a non-source class). Significantly, they can still be effectively
detected by the state-of-the-art (SOTA) countermeasures targeting SSBAs. This
work overcomes efficiency and effectiveness deficiencies of existing SSBAs,
thus bypassing the SOTA defences. The key insight is to construct desired
poisoned and cover data during backdoor training by characterising SSBAs
in-depth. Both data are samples with triggers: the cover/poisoned data from
non-source/source class(es) holds ground-truth/target labels. Therefore, two
cover/poisoned data enhancements are developed from trigger style and content,
respectively, coined CASSOCK. First, we leverage trigger patterns with
discrepant transparency to craft cover/poisoned data, enforcing triggers with
heterogeneous sensitivity on different classes. The second enhancement chooses
the target class features as triggers to craft these samples, entangling
trigger features with the target class heavily. Compared with existing SSBAs,
CASSOCK-based attacks have higher ASR and low FPR on four popular tasks: MNIST,
CIFAR10, GTSRB, and LFW. More importantly, CASSOCK has effectively evaded three
defences (SCAn, Februus and extended Neural Cleanse) already defeat existing
SSBAs effectively.


http://arxiv.org/abs/2205.15592
Semantic Autoencoder and Its Potential Usage for Adversarial Attack. (81%)
Yurui Ming; Cuihuan Du; Chin-Teng Lin
  Autoencoder can give rise to an appropriate latent representation of the
input data, however, the representation which is solely based on the intrinsic
property of the input data, is usually inferior to express some semantic
information. A typical case is the potential incapability of forming a clear
boundary upon clustering of these representations. By encoding the latent
representation that not only depends on the content of the input data, but also
the semantic of the input data, such as label information, we propose an
enhanced autoencoder architecture named semantic autoencoder. Experiments of
representation distribution via t-SNE shows a clear distinction between these
two types of encoders and confirm the supremacy of the semantic one, whilst the
decoded samples of these two types of autoencoders exhibit faint dissimilarity
either objectively or subjectively. Based on this observation, we consider
adversarial attacks to learning algorithms that rely on the latent
representation obtained via autoencoders. It turns out that latent contents of
adversarial samples constructed from semantic encoder with deliberate wrong
label information exhibit different distribution compared with that of the
original input data, while both of these samples manifest very marginal
difference. This new way of attack set up by our work is worthy of attention
due to the necessity to secure the widespread deep learning applications.


http://arxiv.org/abs/2205.15582
An Effective Fusion Method to Enhance the Robustness of CNN. (80%)
Yating Ma; Zhichao Lian
  With the development of technology rapidly, applications of convolutional
neural networks have improved the convenience of our life. However, in image
classification field, it has been found that when some perturbations are added
to images, the CNN would misclassify it. Thus various defense methods have been
proposed. The previous approach only considered how to incorporate modules in
the network to improve robustness, but did not focus on the way the modules
were incorporated. In this paper, we design a new fusion method to enhance the
robustness of CNN. We use a dot product-based approach to add the denoising
module to ResNet18 and the attention mechanism to further improve the
robustness of the model. The experimental results on CIFAR10 have shown that
our method is effective and better than the state-of-the-art methods under the
attack of FGSM and PGD.


http://arxiv.org/abs/2206.00192
Order-sensitive Shapley Values for Evaluating Conceptual Soundness of NLP Models. (64%)
Kaiji Lu; Anupam Datta
  Previous works show that deep NLP models are not always conceptually sound:
they do not always learn the correct linguistic concepts. Specifically, they
can be insensitive to word order. In order to systematically evaluate models
for their conceptual soundness with respect to word order, we introduce a new
explanation method for sequential data: Order-sensitive Shapley Values (OSV).
We conduct an extensive empirical evaluation to validate the method and surface
how well various deep NLP models learn word order. Using synthetic data, we
first show that OSV is more faithful in explaining model behavior than
gradient-based methods. Second, applying to the HANS dataset, we discover that
the BERT-based NLI model uses only the word occurrences without word orders.
Although simple data augmentation improves accuracy on HANS, OSV shows that the
augmented model does not fundamentally improve the model's learning of order.
Third, we discover that not all sentiment analysis models learn negation
properly: some fail to capture the correct syntax of the negation construct.
Finally, we show that pretrained language models such as BERT may rely on the
absolute positions of subject words to learn long-range Subject-Verb Agreement.
With each NLP task, we also demonstrate how OSV can be leveraged to generate
adversarial examples.


http://arxiv.org/abs/2206.00071
Generative Models with Information-Theoretic Protection Against Membership Inference Attacks. (10%)
Parisa Hassanzadeh; Robert E. Tillman
  Deep generative models, such as Generative Adversarial Networks (GANs),
synthesize diverse high-fidelity data samples by estimating the underlying
distribution of high dimensional data. Despite their success, GANs may disclose
private information from the data they are trained on, making them susceptible
to adversarial attacks such as membership inference attacks, in which an
adversary aims to determine if a record was part of the training set. We
propose an information theoretically motivated regularization term that
prevents the generative model from overfitting to training data and encourages
generalizability. We show that this penalty minimizes the JensenShannon
divergence between components of the generator trained on data with different
membership, and that it can be implemented at low cost using an additional
classifier. Our experiments on image datasets demonstrate that with the
proposed regularization, which comes at only a small added computational cost,
GANs are able to preserve privacy and generate high-quality samples that
achieve better downstream classification performance compared to non-private
and differentially private generative models.


http://arxiv.org/abs/2205.15784
Likelihood-Free Inference with Generative Neural Networks via Scoring Rule Minimization. (1%)
Lorenzo Pacchiardi; Ritabrata Dutta
  Bayesian Likelihood-Free Inference methods yield posterior approximations for
simulator models with intractable likelihood. Recently, many works trained
neural networks to approximate either the intractable likelihood or the
posterior directly. Most proposals use normalizing flows, namely neural
networks parametrizing invertible maps used to transform samples from an
underlying base measure; the probability density of the transformed samples is
then accessible and the normalizing flow can be trained via maximum likelihood
on simulated parameter-observation pairs. A recent work [Ramesh et al., 2022]
approximated instead the posterior with generative networks, which drop the
invertibility requirement and are thus a more flexible class of distributions
scaling to high-dimensional and structured data. However, generative networks
only allow sampling from the parametrized distribution; for this reason, Ramesh
et al. [2022] follows the common solution of adversarial training, where the
generative network plays a min-max game against a "critic" network. This
procedure is unstable and can lead to a learned distribution underestimating
the uncertainty - in extreme cases collapsing to a single point. Here, we
propose to approximate the posterior with generative networks trained by
Scoring Rule minimization, an overlooked adversarial-free method enabling
smooth training and better uncertainty quantification. In simulation studies,
the Scoring Rule approach yields better performances with shorter training time
with respect to the adversarial framework.


http://arxiv.org/abs/2205.15128
Level Up with ML Vulnerability Identification: Leveraging Domain Constraints in Feature Space for Robust Android Malware Detection. (99%)
Hamid Bostani; Zhengyu Zhao; Zhuoran Liu; Veelasha Moonsamy
  Machine Learning (ML) promises to enhance the efficacy of Android Malware
Detection (AMD); however, ML models are vulnerable to realistic evasion
attacks--crafting realizable Adversarial Examples (AEs) that satisfy Android
malware domain constraints. To eliminate ML vulnerabilities, defenders aim to
identify susceptible regions in the feature space where ML models are prone to
deception. The primary approach to identifying vulnerable regions involves
investigating realizable AEs, but generating these feasible apps poses a
challenge. For instance, previous work has relied on generating either
feature-space norm-bounded AEs or problem-space realizable AEs in adversarial
hardening. The former is efficient but lacks full coverage of vulnerable
regions while the latter can uncover these regions by satisfying domain
constraints but is known to be time-consuming. To address these limitations, we
propose an approach to facilitate the identification of vulnerable regions.
Specifically, we introduce a new interpretation of Android domain constraints
in the feature space, followed by a novel technique that learns them. Our
empirical evaluations across various evasion attacks indicate effective
detection of AEs using learned domain constraints, with an average of 89.6%.
Furthermore, extensive experiments on different Android malware detectors
demonstrate that utilizing our learned domain constraints in Adversarial
Training (AT) outperforms other AT-based defenses that rely on norm-bounded AEs
or state-of-the-art non-uniform perturbations. Finally, we show that retraining
a malware detector with a wide variety of feature-space realizable AEs results
in a 77.9% robustness improvement against realizable AEs generated by unknown
problem-space transformations, with up to 70x faster training than using
problem-space realizable AEs.


http://arxiv.org/abs/2205.15357
Searching for the Essence of Adversarial Perturbations. (99%)
Dennis Y. Menn; Tzu-hsun Feng; Hung-yi Lee
  Neural networks have demonstrated state-of-the-art performance in various
machine learning fields. However, the introduction of malicious perturbations
in input data, known as adversarial examples, has been shown to deceive neural
network predictions. This poses potential risks for real-world applications
such as autonomous driving and text identification. In order to mitigate these
risks, a comprehensive understanding of the mechanisms underlying adversarial
examples is essential. In this study, we demonstrate that adversarial
perturbations contain human-recognizable information, which is the key
conspirator responsible for a neural network's incorrect prediction, in
contrast to the widely held belief that human-unidentifiable characteristics
play a critical role in fooling a network. This concept of human-recognizable
characteristics enables us to explain key features of adversarial
perturbations, including their existence, transferability among different
neural networks, and increased interpretability for adversarial training. We
also uncover two unique properties of adversarial perturbations that deceive
neural networks: masking and generation. Additionally, a special class, the
complementary class, is identified when neural networks classify input images.
The presence of human-recognizable information in adversarial perturbations
allows researchers to gain insight into the working principles of neural
networks and may lead to the development of techniques for detecting and
defending against adversarial attacks.


http://arxiv.org/abs/2205.14851
Exposing Fine-Grained Adversarial Vulnerability of Face Anti-Spoofing Models. (99%)
Songlin Yang; Wei Wang; Chenye Xu; Ziwen He; Bo Peng; Jing Dong
  Face anti-spoofing aims to discriminate the spoofing face images (e.g.,
printed photos) from live ones. However, adversarial examples greatly challenge
its credibility, where adding some perturbation noise can easily change the
predictions. Previous works conducted adversarial attack methods to evaluate
the face anti-spoofing performance without any fine-grained analysis that which
model architecture or auxiliary feature is vulnerable to the adversary. To
handle this problem, we propose a novel framework to expose the fine-grained
adversarial vulnerability of the face anti-spoofing models, which consists of a
multitask module and a semantic feature augmentation (SFA) module. The
multitask module can obtain different semantic features for further evaluation,
but only attacking these semantic features fails to reflect the
discrimination-related vulnerability. We then design the SFA module to
introduce the data distribution prior for more discrimination-related gradient
directions for generating adversarial examples. Comprehensive experiments show
that SFA module increases the attack success rate by nearly 40$\%$ on average.
We conduct this fine-grained adversarial analysis on different annotations,
geometric maps, and backbone networks (e.g., Resnet network). These
fine-grained adversarial examples can be used for selecting robust backbone
networks and auxiliary features. They also can be used for adversarial
training, which makes it practical to further improve the accuracy and
robustness of the face anti-spoofing models.


http://arxiv.org/abs/2205.14969
Guided Diffusion Model for Adversarial Purification. (99%)
Jinyi Wang; Zhaoyang Lyu; Dahua Lin; Bo Dai; Hongfei Fu
  With wider application of deep neural networks (DNNs) in various algorithms
and frameworks, security threats have become one of the concerns. Adversarial
attacks disturb DNN-based image classifiers, in which attackers can
intentionally add imperceptible adversarial perturbations on input images to
fool the classifiers. In this paper, we propose a novel purification approach,
referred to as guided diffusion model for purification (GDMP), to help protect
classifiers from adversarial attacks. The core of our approach is to embed
purification into the diffusion denoising process of a Denoised Diffusion
Probabilistic Model (DDPM), so that its diffusion process could submerge the
adversarial perturbations with gradually added Gaussian noises, and both of
these noises can be simultaneously removed following a guided denoising
process. On our comprehensive experiments across various datasets, the proposed
GDMP is shown to reduce the perturbations raised by adversarial attacks to a
shallow range, thereby significantly improving the correctness of
classification. GDMP improves the robust accuracy by 5%, obtaining 90.1% under
PGD attack on the CIFAR10 dataset. Moreover, GDMP achieves 70.94% robustness on
the challenging ImageNet dataset.


http://arxiv.org/abs/2205.15130
Why Adversarial Training of ReLU Networks Is Difficult? (68%)
Xu Cheng; Hao Zhang; Yue Xin; Wen Shen; Jie Ren; Quanshi Zhang
  This paper mathematically derives an analytic solution of the adversarial
perturbation on a ReLU network, and theoretically explains the difficulty of
adversarial training. Specifically, we formulate the dynamics of the
adversarial perturbation generated by the multi-step attack, which shows that
the adversarial perturbation tends to strengthen eigenvectors corresponding to
a few top-ranked eigenvalues of the Hessian matrix of the loss w.r.t. the
input. We also prove that adversarial training tends to strengthen the
influence of unconfident input samples with large gradient norms in an
exponential manner. Besides, we find that adversarial training strengthens the
influence of the Hessian matrix of the loss w.r.t. network parameters, which
makes the adversarial training more likely to oscillate along directions of a
few samples, and boosts the difficulty of adversarial training. Crucially, our
proofs provide a unified explanation for previous findings in understanding
adversarial training.


http://arxiv.org/abs/2205.14926
CalFAT: Calibrated Federated Adversarial Training with Label Skewness. (67%)
Chen Chen; Yuchen Liu; Xingjun Ma; Lingjuan Lyu
  Recent studies have shown that, like traditional machine learning, federated
learning (FL) is also vulnerable to adversarial attacks. To improve the
adversarial robustness of FL, federated adversarial training (FAT) methods have
been proposed to apply adversarial training locally before global aggregation.
Although these methods demonstrate promising results on independent identically
distributed (IID) data, they suffer from training instability on non-IID data
with label skewness, resulting in degraded natural accuracy. This tends to
hinder the application of FAT in real-world applications where the label
distribution across the clients is often skewed. In this paper, we study the
problem of FAT under label skewness, and reveal one root cause of the training
instability and natural accuracy degradation issues: skewed labels lead to
non-identical class probabilities and heterogeneous local models. We then
propose a Calibrated FAT (CalFAT) approach to tackle the instability issue by
calibrating the logits adaptively to balance the classes. We show both
theoretically and empirically that the optimization of CalFAT leads to
homogeneous local models across the clients and better convergence points.


http://arxiv.org/abs/2206.04793
Securing AI-based Healthcare Systems using Blockchain Technology: A State-of-the-Art Systematic Literature Review and Future Research Directions. (15%)
Rucha Shinde; Shruti Patil; Ketan Kotecha; Vidyasagar Potdar; Ganeshsree Selvachandran; Ajith Abraham
  Healthcare systems are increasingly incorporating Artificial Intelligence
into their systems, but it is not a solution for all difficulties. AI's
extraordinary potential is being held back by challenges such as a lack of
medical datasets for training AI models, adversarial attacks, and a lack of
trust due to its black box working style. We explored how blockchain technology
can improve the reliability and trustworthiness of AI-based healthcare. This
paper has conducted a Systematic Literature Review to explore the
state-of-the-art research studies conducted in healthcare applications
developed with different AI techniques and Blockchain Technology. This
systematic literature review proceeds with three different paths as natural
language processing-based healthcare systems, computer vision-based healthcare
systems and acoustic AI-based healthcare systems. We found that 1) Defence
techniques for adversarial attacks on AI are available for specific kind of
attacks and even adversarial training is AI based technique which in further
prone to different attacks. 2) Blockchain can address security and privacy
issues in healthcare fraternity. 3) Medical data verification and user
provenance can be enabled with Blockchain. 4) Blockchain can protect
distributed learning on heterogeneous medical data. 5) The issues like single
point of failure, non-transparency in healthcare systems can be resolved with
Blockchain. Nevertheless, it has been identified that research is at the
initial stage. As a result, we have synthesized a conceptual framework using
Blockchain Technology for AI-based healthcare applications that considers the
needs of each NLP, Computer Vision, and Acoustic AI application. A global
solution for all sort of adversarial attacks on AI based healthcare. However,
this technique has significant limits and challenges that need to be addressed
in future studies.


http://arxiv.org/abs/2205.14842
Efficient Reward Poisoning Attacks on Online Deep Reinforcement Learning. (13%)
Yinglun Xu; Qi Zeng; Gagandeep Singh
  We study reward poisoning attacks on online deep reinforcement learning
(DRL), where the attacker is oblivious to the learning algorithm used by the
agent and the dynamics of the environment. We demonstrate the intrinsic
vulnerability of state-of-the-art DRL algorithms by designing a general,
black-box reward poisoning framework called adversarial MDP attacks. We
instantiate our framework to construct two new attacks which only corrupt the
rewards for a small fraction of the total training timesteps and make the agent
learn a low-performing policy. We provide a theoretical analysis of the
efficiency of our attack and perform an extensive empirical evaluation. Our
results show that our attacks efficiently poison agents learning in several
popular classical control and MuJoCo environments with a variety of
state-of-the-art DRL algorithms, such as DQN, PPO, SAC, etc.


http://arxiv.org/abs/2206.03584
White-box Membership Attack Against Machine Learning Based Retinopathy Classification. (10%)
Mounia Hamidouche; Reda Bellafqira; Gwenolé Quellec; Gouenou Coatrieux
  The advances in machine learning (ML) have greatly improved AI-based
diagnosis aid systems in medical imaging. However, being based on collecting
medical data specific to individuals induces several security issues,
especially in terms of privacy. Even though the owner of the images like a
hospital put in place strict privacy protection provisions at the level of its
information system, the model trained over his images still holds disclosure
potential. The trained model may be accessible to an attacker as: 1) White-box:
accessing to the model architecture and parameters; 2) Black box: where he can
only query the model with his own inputs through an appropriate interface.
Existing attack methods include: feature estimation attacks (FEA), membership
inference attack (MIA), model memorization attack (MMA) and identification
attacks (IA). In this work we focus on MIA against a model that has been
trained to detect diabetic retinopathy from retinal images. Diabetic
retinopathy is a condition that can cause vision loss and blindness in the
people who have diabetes. MIA is the process of determining whether a data
sample comes from the training data set of a trained ML model or not. From a
privacy perspective in our use case where a diabetic retinopathy classification
model is given to partners that have at their disposal images along with
patients' identifiers, inferring the membership status of a data sample can
help to state if a patient has contributed or not to the training of the model.


http://arxiv.org/abs/2205.15419
Fool SHAP with Stealthily Biased Sampling. (2%)
Gabriel Laberge; Ulrich Aïvodji; Satoshi Hara; Mario Marchand.; Foutse Khomh
  SHAP explanations aim at identifying which features contribute the most to
the difference in model prediction at a specific input versus a background
distribution. Recent studies have shown that they can be manipulated by
malicious adversaries to produce arbitrary desired explanations. However,
existing attacks focus solely on altering the black-box model itself. In this
paper, we propose a complementary family of attacks that leave the model intact
and manipulate SHAP explanations using stealthily biased sampling of the data
points used to approximate expectations w.r.t the background distribution. In
the context of fairness audit, we show that our attack can reduce the
importance of a sensitive feature when explaining the difference in outcomes
between groups while remaining undetected. More precisely, experiments
performed on real-world datasets showed that our attack could yield up to a
90\% relative decrease in amplitude of the sensitive feature attribution. These
results highlight the manipulability of SHAP explanations and encourage
auditors to treat them with skepticism.


http://arxiv.org/abs/2205.15037
Snoopy: A Webpage Fingerprinting Framework with Finite Query Model for Mass-Surveillance. (2%)
Gargi Mitra; Prasanna Karthik Vairam; Sandip Saha; Nitin Chandrachoodan; V. Kamakoti
  Internet users are vulnerable to privacy attacks despite the use of
encryption. Webpage fingerprinting, an attack that analyzes encrypted traffic,
can identify the webpages visited by a user in a given website. Recent research
works have been successful in demonstrating webpage fingerprinting attacks on
individual users, but have been unsuccessful in extending their attack for
mass-surveillance. The key challenges in performing mass-scale webpage
fingerprinting arises from (i) the sheer number of combinations of user
behavior and preferences to account for, and; (ii) the bound on the number of
website queries imposed by the defense mechanisms (e.g., DDoS defense) deployed
at the website. These constraints preclude the use of conventional
data-intensive ML-based techniques. In this work, we propose Snoopy, a
first-of-its-kind framework, that performs webpage fingerprinting for a large
number of users visiting a website. Snoopy caters to the generalization
requirements of mass-surveillance while complying with a bound on the number of
website accesses (finite query model) for traffic sample collection. For this,
Snoopy uses a feature (i.e., sequence of encrypted resource sizes) that is
either unaffected or predictably affected by different browsing contexts (OS,
browser, caching, cookie settings). Snoopy uses static analysis techniques to
predict the variations caused by factors such as header sizes, MTU, and User
Agent String that arise from the diversity in browsing contexts. We show that
Snoopy achieves approximately 90% accuracy when evaluated on most websites,
across various browsing contexts. A simple ensemble of Snoopy and an ML-based
technique achieves approximately 97% accuracy while adhering to the finite
query model, in cases when Snoopy alone does not perform well.


http://arxiv.org/abs/2205.14826
Robust Weight Perturbation for Adversarial Training. (99%)
Chaojian Yu; Bo Han; Mingming Gong; Li Shen; Shiming Ge; Bo Du; Tongliang Liu
  Overfitting widely exists in adversarial robust training of deep networks. An
effective remedy is adversarial weight perturbation, which injects the
worst-case weight perturbation during network training by maximizing the
classification loss on adversarial examples. Adversarial weight perturbation
helps reduce the robust generalization gap; however, it also undermines the
robustness improvement. A criterion that regulates the weight perturbation is
therefore crucial for adversarial training. In this paper, we propose such a
criterion, namely Loss Stationary Condition (LSC) for constrained perturbation.
With LSC, we find that it is essential to conduct weight perturbation on
adversarial data with small classification loss to eliminate robust
overfitting. Weight perturbation on adversarial data with large classification
loss is not necessary and may even lead to poor robustness. Based on these
observations, we propose a robust perturbation strategy to constrain the extent
of weight perturbation. The perturbation strategy prevents deep networks from
overfitting while avoiding the side effect of excessive weight perturbation,
significantly improving the robustness of adversarial training. Extensive
experiments demonstrate the superiority of the proposed method over the
state-of-the-art adversarial training methods.


http://arxiv.org/abs/2205.15743
Mixture GAN For Modulation Classification Resiliency Against Adversarial Attacks. (99%)
Eyad Shtaiwi; Ahmed El Ouadrhiri; Majid Moradikia; Salma Sultana; Ahmed Abdelhadi; Zhu Han
  Automatic modulation classification (AMC) using the Deep Neural Network (DNN)
approach outperforms the traditional classification techniques, even in the
presence of challenging wireless channel environments. However, the adversarial
attacks cause the loss of accuracy for the DNN-based AMC by injecting a
well-designed perturbation to the wireless channels. In this paper, we propose
a novel generative adversarial network (GAN)-based countermeasure approach to
safeguard the DNN-based AMC systems against adversarial attack examples.
GAN-based aims to eliminate the adversarial attack examples before feeding to
the DNN-based classifier. Specifically, we have shown the resiliency of our
proposed defense GAN against the Fast-Gradient Sign method (FGSM) algorithm as
one of the most potent kinds of attack algorithms to craft the perturbed
signals. The existing defense-GAN has been designed for image classification
and does not work in our case where the above-mentioned communication system is
considered. Thus, our proposed countermeasure approach deploys GANs with a
mixture of generators to overcome the mode collapsing problem in a typical GAN
facing radio signal classification problem. Simulation results show the
effectiveness of our proposed defense GAN so that it could enhance the accuracy
of the DNN-based AMC under adversarial attacks to 81%, approximately.


http://arxiv.org/abs/2205.14772
Unfooling Perturbation-Based Post Hoc Explainers. (98%)
Zachariah Carmichael; Walter J Scheirer
  Monumental advancements in artificial intelligence (AI) have lured the
interest of doctors, lenders, judges, and other professionals. While these
high-stakes decision-makers are optimistic about the technology, those familiar
with AI systems are wary about the lack of transparency of its decision-making
processes. Perturbation-based post hoc explainers offer a model agnostic means
of interpreting these systems while only requiring query-level access. However,
recent work demonstrates that these explainers can be fooled adversarially.
This discovery has adverse implications for auditors, regulators, and other
sentinels. With this in mind, several natural questions arise - how can we
audit these black box systems? And how can we ascertain that the auditee is
complying with the audit in good faith? In this work, we rigorously formalize
this problem and devise a defense against adversarial attacks on
perturbation-based explainers. We propose algorithms for the detection
(CAD-Detect) and defense (CAD-Defend) of these attacks, which are aided by our
novel conditional anomaly detection approach, KNN-CAD. We demonstrate that our
approach successfully detects whether a black box system adversarially conceals
its decision-making process and mitigates the adversarial attack on real-world
data for the prevalent explainers, LIME and SHAP.


http://arxiv.org/abs/2205.14691
On the Robustness of Safe Reinforcement Learning under Observational Perturbations. (93%)
Zuxin Liu; Zijian Guo; Zhepeng Cen; Huan Zhang; Jie Tan; Bo Li; Ding Zhao
  Safe reinforcement learning (RL) trains a policy to maximize the task reward
while satisfying safety constraints. While prior works focus on the performance
optimality, we find that the optimal solutions of many safe RL problems are not
robust and safe against carefully designed observational perturbations. We
formally analyze the unique properties of designing effective state adversarial
attackers in the safe RL setting. We show that baseline adversarial attack
techniques for standard RL tasks are not always effective for safe RL and
proposed two new approaches - one maximizes the cost and the other maximizes
the reward. One interesting and counter-intuitive finding is that the maximum
reward attack is strong, as it can both induce unsafe behaviors and make the
attack stealthy by maintaining the reward. We further propose a more effective
adversarial training framework for safe RL and evaluate it via comprehensive
experiments. This paper provides a pioneer work to investigate the safety and
robustness of RL under observational attacks for future safe RL studies.


http://arxiv.org/abs/2205.14629
Superclass Adversarial Attack. (80%)
Soichiro Kumano; Hiroshi Kera; Toshihiko Yamasaki
  Adversarial attacks have only focused on changing the predictions of the
classifier, but their danger greatly depends on how the class is mistaken. For
example, when an automatic driving system mistakes a Persian cat for a Siamese
cat, it is hardly a problem. However, if it mistakes a cat for a 120km/h
minimum speed sign, serious problems can arise. As a stepping stone to more
threatening adversarial attacks, we consider the superclass adversarial attack,
which causes misclassification of not only fine classes, but also superclasses.
We conducted the first comprehensive analysis of superclass adversarial attacks
(an existing and 19 new methods) in terms of accuracy, speed, and stability,
and identified several strategies to achieve better performance. Although this
study is aimed at superclass misclassification, the findings can be applied to
other problem settings involving multiple classes, such as top-k and
multi-label classification attacks.


http://arxiv.org/abs/2205.14576
Problem-Space Evasion Attacks in the Android OS: a Survey. (50%)
Harel Berger; Chen Hajaj; Amit Dvir
  Android is the most popular OS worldwide. Therefore, it is a target for
various kinds of malware. As a countermeasure, the security community works day
and night to develop appropriate Android malware detection systems, with
ML-based or DL-based systems considered as some of the most common types.
Against these detection systems, intelligent adversaries develop a wide set of
evasion attacks, in which an attacker slightly modifies a malware sample to
evade its target detection system. In this survey, we address problem-space
evasion attacks in the Android OS, where attackers manipulate actual APKs,
rather than their extracted feature vector. We aim to explore this kind of
attacks, frequently overlooked by the research community due to a lack of
knowledge of the Android domain, or due to focusing on general mathematical
evasion attacks - i.e., feature-space evasion attacks. We discuss the different
aspects of problem-space evasion attacks, using a new taxonomy, which focuses
on key ingredients of each problem-space attack, such as the attacker model,
the attacker's mode of operation, and the functional assessment of post-attack
applications.


http://arxiv.org/abs/2206.11851
Context-based Virtual Adversarial Training for Text Classification with Noisy Labels. (11%)
Do-Myoung Lee; Yeachan Kim; Chang-gyun Seo
  Deep neural networks (DNNs) have a high capacity to completely memorize noisy
labels given sufficient training time, and its memorization, unfortunately,
leads to performance degradation. Recently, virtual adversarial training (VAT)
attracts attention as it could further improve the generalization of DNNs in
semi-supervised learning. The driving force behind VAT is to prevent the models
from overfitting data points by enforcing consistency between the inputs and
the perturbed inputs. This strategy could be helpful in learning from noisy
labels if it prevents neural models from learning noisy samples while
encouraging the models to generalize clean samples. In this paper, we propose
context-based virtual adversarial training (ConVAT) to prevent a text
classifier from overfitting to noisy labels. Unlike the previous works, the
proposed method performs the adversarial training at the context level rather
than the inputs. It makes the classifier not only learn its label but also its
contextual neighbors, which alleviates the learning from noisy labels by
preserving contextual semantics on each data point. We conduct extensive
experiments on four text classification datasets with two types of label
noises. Comprehensive experimental results clearly show that the proposed
method works quite well even with extremely noisy settings.


http://arxiv.org/abs/2205.14606
A General Multiple Data Augmentation Based Framework for Training Deep Neural Networks. (1%)
Binyan Hu; Yu Sun; A. K. Qin
  Deep neural networks (DNNs) often rely on massive labelled data for training,
which is inaccessible in many applications. Data augmentation (DA) tackles data
scarcity by creating new labelled data from available ones. Different DA
methods have different mechanisms and therefore using their generated labelled
data for DNN training may help improving DNN's generalisation to different
degrees. Combining multiple DA methods, namely multi-DA, for DNN training,
provides a way to boost generalisation. Among existing multi-DA based DNN
training methods, those relying on knowledge distillation (KD) have received
great attention. They leverage knowledge transfer to utilise the labelled data
sets created by multiple DA methods instead of directly combining them for
training DNNs. However, existing KD-based methods can only utilise certain
types of DA methods, incapable of utilising the advantages of arbitrary DA
methods. We propose a general multi-DA based DNN training framework capable to
use arbitrary DA methods. To train a DNN, our framework replicates a certain
portion in the latter part of the DNN into multiple copies, leading to multiple
DNNs with shared blocks in their former parts and independent blocks in their
latter parts. Each of these DNNs is associated with a unique DA and a newly
devised loss that allows comprehensively learning from the data generated by
all DA methods and the outputs from all DNNs in an online and adaptive way. The
overall loss, i.e., the sum of each DNN's loss, is used for training the DNN.
Eventually, one of the DNNs with the best validation performance is chosen for
inference. We implement the proposed framework by using three distinct DA
methods and apply it for training representative DNNs. Experiments on the
popular benchmarks of image classification demonstrate the superiority of our
method to several existing single-DA and multi-DA based training methods.


http://arxiv.org/abs/2206.03583
Contributor-Aware Defenses Against Adversarial Backdoor Attacks. (98%)
Glenn Dawson; Muhammad Umer; Robi Polikar
  Deep neural networks for image classification are well-known to be vulnerable
to adversarial attacks. One such attack that has garnered recent attention is
the adversarial backdoor attack, which has demonstrated the capability to
perform targeted misclassification of specific examples. In particular,
backdoor attacks attempt to force a model to learn spurious relations between
backdoor trigger patterns and false labels. In response to this threat,
numerous defensive measures have been proposed; however, defenses against
backdoor attacks focus on backdoor pattern detection, which may be unreliable
against novel or unexpected types of backdoor pattern designs. We introduce a
novel re-contextualization of the adversarial setting, where the presence of an
adversary implicitly admits the existence of multiple database contributors.
Then, under the mild assumption of contributor awareness, it becomes possible
to exploit this knowledge to defend against backdoor attacks by destroying the
false label associations. We propose a contributor-aware universal defensive
framework for learning in the presence of multiple, potentially adversarial
data sources that utilizes semi-supervised ensembles and learning from crowds
to filter the false labels produced by adversarial triggers. Importantly, this
defensive strategy is agnostic to backdoor pattern design, as it functions
without needing -- or even attempting -- to perform either adversary
identification or backdoor pattern detection during either training or
inference. Our empirical studies demonstrate the robustness of the proposed
framework against adversarial backdoor attacks from multiple simultaneous
adversaries.


http://arxiv.org/abs/2205.14497
BadDet: Backdoor Attacks on Object Detection. (92%)
Shih-Han Chan; Yinpeng Dong; Jun Zhu; Xiaolu Zhang; Jun Zhou
  Deep learning models have been deployed in numerous real-world applications
such as autonomous driving and surveillance. However, these models are
vulnerable in adversarial environments. Backdoor attack is emerging as a severe
security threat which injects a backdoor trigger into a small portion of
training data such that the trained model behaves normally on benign inputs but
gives incorrect predictions when the specific trigger appears. While most
research in backdoor attacks focuses on image classification, backdoor attacks
on object detection have not been explored but are of equal importance. Object
detection has been adopted as an important module in various security-sensitive
applications such as autonomous driving. Therefore, backdoor attacks on object
detection could pose severe threats to human lives and properties. We propose
four kinds of backdoor attacks for object detection task: 1) Object Generation
Attack: a trigger can falsely generate an object of the target class; 2)
Regional Misclassification Attack: a trigger can change the prediction of a
surrounding object to the target class; 3) Global Misclassification Attack: a
single trigger can change the predictions of all objects in an image to the
target class; and 4) Object Disappearance Attack: a trigger can make the
detector fail to detect the object of the target class. We develop appropriate
metrics to evaluate the four backdoor attacks on object detection. We perform
experiments using two typical object detection models -- Faster-RCNN and YOLOv3
on different datasets. More crucially, we demonstrate that even fine-tuning on
another benign dataset cannot remove the backdoor hidden in the object
detection model. To defend against these backdoor attacks, we propose Detector
Cleanse, an entropy-based run-time detection framework to identify poisoned
testing samples for any deployed object detector.


http://arxiv.org/abs/2205.14374
Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Models. (62%)
Md Rafiqul Islam Rabin; Aftab Hussain; Mohammad Amin Alipour
  Neural code intelligence (CI) models are opaque black-boxes and offer little
insight on the features they use in making predictions. This opacity may lead
to distrust in their prediction and hamper their wider adoption in
safety-critical applications. Recently, input program reduction techniques have
been proposed to identify key features in the input programs to improve the
transparency of CI models. However, this approach is syntax-unaware and does
not consider the grammar of the programming language. In this paper, we apply a
syntax-guided program reduction technique that considers the grammar of the
input programs during reduction. Our experiments on multiple models across
different types of input programs show that the syntax-guided program reduction
technique is faster and provides smaller sets of key tokens in reduced
programs. We also show that the key tokens could be used in generating
adversarial examples for up to 65% of the input programs.


http://arxiv.org/abs/2205.13807
fakeWeather: Adversarial Attacks for Deep Neural Networks Emulating Weather Conditions on the Camera Lens of Autonomous Systems. (96%)
Alberto Marchisio; Giovanni Caramia; Maurizio Martina; Muhammad Shafique
  Recently, Deep Neural Networks (DNNs) have achieved remarkable performances
in many applications, while several studies have enhanced their vulnerabilities
to malicious attacks. In this paper, we emulate the effects of natural weather
conditions to introduce plausible perturbations that mislead the DNNs. By
observing the effects of such atmospheric perturbations on the camera lenses,
we model the patterns to create different masks that fake the effects of rain,
snow, and hail. Even though the perturbations introduced by our attacks are
visible, their presence remains unnoticed due to their association with natural
events, which can be especially catastrophic for fully-autonomous and unmanned
vehicles. We test our proposed fakeWeather attacks on multiple Convolutional
Neural Network and Capsule Network models, and report noticeable accuracy drops
in the presence of such adversarial perturbations. Our work introduces a new
security threat for DNNs, which is especially severe for safety-critical
applications and autonomous systems.


http://arxiv.org/abs/2205.13863
Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power. (95%)
Binghui Li; Jikai Jin; Han Zhong; John E. Hopcroft; Liwei Wang
  It is well-known that modern neural networks are vulnerable to adversarial
examples. To mitigate this problem, a series of robust learning algorithms have
been proposed. However, although the robust training error can be near zero via
some methods, all existing algorithms lead to a high robust generalization
error. In this paper, we provide a theoretical understanding of this puzzling
phenomenon from the perspective of expressive power for deep neural networks.
Specifically, for binary classification problems with well-separated data, we
show that, for ReLU networks, while mild over-parameterization is sufficient
for high robust training accuracy, there exists a constant robust
generalization gap unless the size of the neural network is exponential in the
data dimension $d$. Even if the data is linear separable, which means achieving
low clean generalization error is easy, we can still prove an
$\exp({\Omega}(d))$ lower bound for robust generalization. In general, our
exponential lower bounds hold true for a variety of neural network families and
other function classes as well, as long as their VC dimension is at most
polynomial in the number of parameters. Moreover, we establish an improved
upper bound of $\exp({\mathcal{O}}(k))$ for the network size to achieve low
robust generalization error when the data lies on a manifold with intrinsic
dimension $k$ ($k \ll d$). Nonetheless, we also have a lower bound that grows
exponentially with respect to $k$ -- the curse of dimensionality is inevitable.
By demonstrating an exponential separation between the network size for
achieving low robust training and generalization error, our results reveal that
the hardness of robust generalization may stem from the expressive power of
practical models.


http://arxiv.org/abs/2205.14230
Semi-supervised Semantics-guided Adversarial Training for Trajectory Prediction. (93%)
Ruochen Jiao; Xiangguo Liu; Takami Sato; Qi Alfred Chen; Qi Zhu
  Predicting the trajectories of surrounding objects is a critical task for
self-driving vehicles and many other autonomous systems. Recent works
demonstrate that adversarial attacks on trajectory prediction, where small
crafted perturbations are introduced to history trajectories, may significantly
mislead the prediction of future trajectories and induce unsafe planning.
However, few works have addressed enhancing the robustness of this important
safety-critical task.In this paper, we present a novel adversarial training
method for trajectory prediction. Compared with typical adversarial training on
image tasks, our work is challenged by more random input with rich context and
a lack of class labels. To address these challenges, we propose a method based
on a semi-supervised adversarial autoencoder, which models disentangled
semantic features with domain knowledge and provides additional latent labels
for the adversarial training. Extensive experiments with different types of
attacks demonstrate that our Semisupervised Semantics-guided Adversarial
Training (SSAT) method can effectively mitigate the impact of adversarial
attacks by up to 73% and outperform other popular defense methods. In addition,
experiments show that our method can significantly improve the system's robust
generalization to unseen patterns of attacks. We believe that such
semantics-guided architecture and advancement on robust generalization is an
important step for developing robust prediction models and enabling safe
decision-making.


http://arxiv.org/abs/2205.14246
Defending Against Stealthy Backdoor Attacks. (73%)
Sangeet Sagar; Abhinav Bhatt; Abhijith Srinivas Bidaralli
  Defenses against security threats have been an interest of recent studies.
Recent works have shown that it is not difficult to attack a natural language
processing (NLP) model while defending against them is still a cat-mouse game.
Backdoor attacks are one such attack where a neural network is made to perform
in a certain way on specific samples containing some triggers while achieving
normal results on other samples. In this work, we present a few defense
strategies that can be useful to counter against such an attack. We show that
our defense methodologies significantly decrease the performance on the
attacked inputs while maintaining similar performance on benign inputs. We also
show that some of our defenses have very less runtime and also maintain
similarity with the original inputs.


http://arxiv.org/abs/2205.13892
EvenNet: Ignoring Odd-Hop Neighbors Improves Robustness of Graph Neural Networks. (13%)
Runlin Lei; Zhen Wang; Yaliang Li; Bolin Ding; Zhewei Wei
  Graph Neural Networks (GNNs) have received extensive research attention for
their promising performance in graph machine learning. Despite their
extraordinary predictive accuracy, existing approaches, such as GCN and GPRGNN,
are not robust in the face of homophily changes on test graphs, rendering these
models vulnerable to graph structural attacks and with limited capacity in
generalizing to graphs of varied homophily levels. Although many methods have
been proposed to improve the robustness of GNN models, most of these techniques
are restricted to the spatial domain and employ complicated defense mechanisms,
such as learning new graph structures or calculating edge attentions. In this
paper, we study the problem of designing simple and robust GNN models in the
spectral domain. We propose EvenNet, a spectral GNN corresponding to an
even-polynomial graph filter. Based on our theoretical analysis in both spatial
and spectral domains, we demonstrate that EvenNet outperforms full-order models
in generalizing across homophilic and heterophilic graphs, implying that
ignoring odd-hop neighbors improves the robustness of GNNs. We conduct
experiments on both synthetic and real-world datasets to demonstrate the
effectiveness of EvenNet. Notably, EvenNet outperforms existing defense models
against structural attacks without introducing additional computational costs
and maintains competitiveness in traditional node classification tasks on
homophilic and heterophilic graphs.


http://arxiv.org/abs/2205.13412
A Physical-World Adversarial Attack Against 3D Face Recognition. (99%)
Yanjie Li; Yiquan Li; Bin Xiao
  3D face recognition systems have been widely employed in intelligent
terminals, among which structured light imaging is a common method to measure
the 3D shape. However, this method could be easily attacked, leading to
inaccurate 3D face recognition. In this paper, we propose a novel,
physically-achievable attack on the fringe structured light system, named
structured light attack. The attack utilizes a projector to project optical
adversarial fringes on faces to generate point clouds with well-designed
noises. We firstly propose a 3D transform-invariant loss function to enhance
the robustness of 3D adversarial examples in the physical-world attack. Then we
reverse the 3D adversarial examples to the projector's input to place noises on
phase-shift images, which models the process of structured light imaging. A
real-world structured light system is constructed for the attack and several
state-of-the-art 3D face recognition neural networks are tested. Experiments
show that our method can attack the physical system successfully and only needs
minor modifications of projected images.


http://arxiv.org/abs/2205.13152
Transferable Adversarial Attack based on Integrated Gradients. (99%)
Yi Huang; Adams Wai-Kin Kong
  The vulnerability of deep neural networks to adversarial examples has drawn
tremendous attention from the community. Three approaches, optimizing standard
objective functions, exploiting attention maps, and smoothing decision
surfaces, are commonly used to craft adversarial examples. By tightly
integrating the three approaches, we propose a new and simple algorithm named
Transferable Attack based on Integrated Gradients (TAIG) in this paper, which
can find highly transferable adversarial examples for black-box attacks. Unlike
previous methods using multiple computational terms or combining with other
methods, TAIG integrates the three approaches into one single term. Two
versions of TAIG that compute their integrated gradients on a straight-line
path and a random piecewise linear path are studied. Both versions offer strong
transferability and can seamlessly work together with the previous methods.
Experimental results demonstrate that TAIG outperforms the state-of-the-art
methods. The code will available at https://github.com/yihuang2016/TAIG


http://arxiv.org/abs/2205.13253
MALICE: Manipulation Attacks on Learned Image ComprEssion. (99%)
Kang Liu; Di Wu; Yiru Wang; Dan Feng; Benjamin Tan; Siddharth Garg
  Deep learning techniques have shown promising results in image compression,
with competitive bitrate and image reconstruction quality from compressed
latent. However, while image compression has progressed towards a higher peak
signal-to-noise ratio (PSNR) and fewer bits per pixel (bpp), their robustness
to adversarial images has never received deliberation. In this work, we, for
the first time, investigate the robustness of image compression systems where
imperceptible perturbation of input images can precipitate a significant
increase in the bitrate of their compressed latent. To characterize the
robustness of state-of-the-art learned image compression, we mount white-box
and black-box attacks. Our white-box attack employs fast gradient sign method
on the entropy estimation of the bitstream as its bitrate approximation. We
propose DCT-Net simulating JPEG compression with architectural simplicity and
lightweight training as the substitute in the black-box attack and enable fast
adversarial transferability. Our results on six image compression models, each
with six different bitrate qualities (thirty-six models in total), show that
they are surprisingly fragile, where the white-box attack achieves up to
56.326x and black-box 1.947x bpp change. To improve robustness, we propose a
novel compression architecture factorAtn which incorporates attention modules
and a basic factorized entropy model, resulting in a promising trade-off
between the rate-distortion performance and robustness to adversarial attacks
that surpasses existing learned image compressors.


http://arxiv.org/abs/2205.13618
Phantom Sponges: Exploiting Non-Maximum Suppression to Attack Deep Object Detectors. (98%)
Avishag Shapira; Alon Zolfi; Luca Demetrio; Battista Biggio; Asaf Shabtai
  Adversarial attacks against deep learning-based object detectors have been
studied extensively in the past few years. Most of the attacks proposed have
targeted the model's integrity (i.e., caused the model to make incorrect
predictions), while adversarial attacks targeting the model's availability, a
critical aspect in safety-critical domains such as autonomous driving, have not
yet been explored by the machine learning research community. In this paper, we
propose a novel attack that negatively affects the decision latency of an
end-to-end object detection pipeline. We craft a universal adversarial
perturbation (UAP) that targets a widely used technique integrated in many
object detector pipelines -- non-maximum suppression (NMS). Our experiments
demonstrate the proposed UAP's ability to increase the processing time of
individual frames by adding "phantom" objects that overload the NMS algorithm
while preserving the detection of the original objects (which allows the attack
to go undetected for a longer period of time).


http://arxiv.org/abs/2205.13613
Circumventing Backdoor Defenses That Are Based on Latent Separability. (96%)
Xiangyu Qi; Tinghao Xie; Yiming Li; Saeed Mahloujifar; Prateek Mittal
  Recent studies revealed that deep learning is susceptible to backdoor
poisoning attacks. An adversary can embed a hidden backdoor into a model to
manipulate its predictions by only modifying a few training data, without
controlling the training process. Currently, a tangible signature has been
widely observed across a diverse set of backdoor poisoning attacks -- models
trained on a poisoned dataset tend to learn separable latent representations
for poison and clean samples. This latent separation is so pervasive that a
family of backdoor defenses directly take it as a default assumption (dubbed
latent separability assumption), based on which to identify poison samples via
cluster analysis in the latent space. An intriguing question consequently
follows: is the latent separation unavoidable for backdoor poisoning attacks?
This question is central to understanding whether the assumption of latent
separability provides a reliable foundation for defending against backdoor
poisoning attacks. In this paper, we design adaptive backdoor poisoning attacks
to present counter-examples against this assumption. Our methods include two
key components: (1) a set of trigger-planted samples correctly labeled to their
semantic classes (other than the target class) that can regularize backdoor
learning; (2) asymmetric trigger planting strategies that help to boost attack
success rate (ASR) as well as to diversify latent representations of poison
samples. Extensive experiments on benchmark datasets verify the effectiveness
of our adaptive attacks in bypassing existing latent separation based backdoor
defenses. Moreover, our attacks still maintain a high attack success rate with
negligible clean accuracy drop. Our studies call for defense designers to take
caution when leveraging latent separation as an assumption in their defenses.


http://arxiv.org/abs/2205.13502
An Analytic Framework for Robust Training of Artificial Neural Networks. (93%)
Ramin Barati; Reza Safabakhsh; Mohammad Rahmati
  The reliability of a learning model is key to the successful deployment of
machine learning in various industries. Creating a robust model, particularly
one unaffected by adversarial attacks, requires a comprehensive understanding
of the adversarial examples phenomenon. However, it is difficult to describe
the phenomenon due to the complicated nature of the problems in machine
learning. Consequently, many studies investigate the phenomenon by proposing a
simplified model of how adversarial examples occur and validate it by
predicting some aspect of the phenomenon. While these studies cover many
different characteristics of the adversarial examples, they have not reached a
holistic approach to the geometric and analytic modeling of the phenomenon.
This paper propose a formal framework to study the phenomenon in learning
theory and make use of complex analysis and holomorphicity to offer a robust
learning rule for artificial neural networks. With the help of complex
analysis, we can effortlessly move between geometric and analytic perspectives
of the phenomenon and offer further insights on the phenomenon by revealing its
connection with harmonic functions. Using our model, we can explain some of the
most intriguing characteristics of adversarial examples, including
transferability of adversarial examples, and pave the way for novel approaches
to mitigate the effects of the phenomenon.


http://arxiv.org/abs/2205.13685
Adversarial attacks and defenses in Speaker Recognition Systems: A survey. (81%)
Jiahe Lan; Rui Zhang; Zheng Yan; Jie Wang; Yu Chen; Ronghui Hou
  Speaker recognition has become very popular in many application scenarios,
such as smart homes and smart assistants, due to ease of use for remote control
and economic-friendly features. The rapid development of SRSs is inseparable
from the advancement of machine learning, especially neural networks. However,
previous work has shown that machine learning models are vulnerable to
adversarial attacks in the image domain, which inspired researchers to explore
adversarial attacks and defenses in Speaker Recognition Systems (SRS).
Unfortunately, existing literature lacks a thorough review of this topic. In
this paper, we fill this gap by performing a comprehensive survey on
adversarial attacks and defenses in SRSs. We first introduce the basics of SRSs
and concepts related to adversarial attacks. Then, we propose two sets of
criteria to evaluate the performance of attack methods and defense methods in
SRSs, respectively. After that, we provide taxonomies of existing attack
methods and defense methods, and further review them by employing our proposed
criteria. Finally, based on our review, we find some open issues and further
specify a number of future directions to motivate the research of SRSs
security.


http://arxiv.org/abs/2205.13523
PerDoor: Persistent Non-Uniform Backdoors in Federated Learning using Adversarial Perturbations. (81%)
Manaar Alam; Esha Sarkar; Michail Maniatakos
  Federated Learning (FL) enables numerous participants to train deep learning
models collaboratively without exposing their personal, potentially sensitive
data, making it a promising solution for data privacy in collaborative
training. The distributed nature of FL and unvetted data, however, makes it
inherently vulnerable to backdoor attacks: In this scenario, an adversary
injects backdoor functionality into the centralized model during training,
which can be triggered to cause the desired misclassification for a specific
adversary-chosen input. A range of prior work establishes successful backdoor
injection in an FL system; however, these backdoors are not demonstrated to be
long-lasting. The backdoor functionality does not remain in the system if the
adversary is removed from the training process since the centralized model
parameters continuously mutate during successive FL training rounds. Therefore,
in this work, we propose PerDoor, a persistent-by-construction backdoor
injection technique for FL, driven by adversarial perturbation and targeting
parameters of the centralized model that deviate less in successive FL rounds
and contribute the least to the main task accuracy. An exhaustive evaluation
considering an image classification scenario portrays on average $10.5\times$
persistence over multiple FL rounds compared to traditional backdoor attacks.
Through experiments, we further exhibit the potency of PerDoor in the presence
of state-of-the-art backdoor prevention techniques in an FL system.
Additionally, the operation of adversarial perturbation also assists PerDoor in
developing non-uniform trigger patterns for backdoor inputs compared to uniform
triggers (with fixed patterns and locations) of existing backdoor techniques,
which are prone to be easily mitigated.


http://arxiv.org/abs/2205.13383
BppAttack: Stealthy and Efficient Trojan Attacks against Deep Neural Networks via Image Quantization and Contrastive Adversarial Learning. (81%)
Zhenting Wang; Juan Zhai; Shiqing Ma
  Deep neural networks are vulnerable to Trojan attacks. Existing attacks use
visible patterns (e.g., a patch or image transformations) as triggers, which
are vulnerable to human inspection. In this paper, we propose stealthy and
efficient Trojan attacks, BppAttack. Based on existing biology literature on
human visual systems, we propose to use image quantization and dithering as the
Trojan trigger, making imperceptible changes. It is a stealthy and efficient
attack without training auxiliary models. Due to the small changes made to
images, it is hard to inject such triggers during training. To alleviate this
problem, we propose a contrastive learning based approach that leverages
adversarial attacks to generate negative sample pairs so that the learned
trigger is precise and accurate. The proposed method achieves high attack
success rates on four benchmark datasets, including MNIST, CIFAR-10, GTSRB, and
CelebA. It also effectively bypasses existing Trojan defenses and human
inspection. Our code can be found in
https://github.com/RU-System-Software-and-Security/BppAttack.


http://arxiv.org/abs/2205.13702
R-HTDetector: Robust Hardware-Trojan Detection Based on Adversarial Training. (80%)
Kento Hasegawa; Seira Hidano; Kohei Nozawa; Shinsaku Kiyomoto; Nozomu Togawa
  Hardware Trojans (HTs) have become a serious problem, and extermination of
them is strongly required for enhancing the security and safety of integrated
circuits. An effective solution is to identify HTs at the gate level via
machine learning techniques. However, machine learning has specific
vulnerabilities, such as adversarial examples. In reality, it has been reported
that adversarial modified HTs greatly degrade the performance of a machine
learning-based HT detection method. Therefore, we propose a robust HT detection
method using adversarial training (R-HTDetector). We formally describe the
robustness of R-HTDetector in modifying HTs. Our work gives the world-first
adversarial training for HT detection with theoretical backgrounds. We show
through experiments with Trust-HUB benchmarks that R-HTDetector overcomes
adversarial examples while maintaining its original accuracy.


http://arxiv.org/abs/2205.13634
BagFlip: A Certified Defense against Data Poisoning. (75%)
Yuhao Zhang; Aws Albarghouthi; Loris D'Antoni
  Machine learning models are vulnerable to data-poisoning attacks, in which an
attacker maliciously modifies the training set to change the prediction of a
learned model. In a trigger-less attack, the attacker can modify the training
set but not the test inputs, while in a backdoor attack the attacker can also
modify test inputs. Existing model-agnostic defense approaches either cannot
handle backdoor attacks or do not provide effective certificates (i.e., a proof
of a defense). We present BagFlip, a model-agnostic certified approach that can
effectively defend against both trigger-less and backdoor attacks. We evaluate
BagFlip on image classification and malware detection datasets. BagFlip is
equal to or more effective than the state-of-the-art approaches for
trigger-less attacks and more effective than the state-of-the-art approaches
for backdoor attacks.


http://arxiv.org/abs/2205.13616
Towards A Proactive ML Approach for Detecting Backdoor Poison Samples. (67%)
Xiangyu Qi; Tinghao Xie; Jiachen T. Wang; Tong Wu; Saeed Mahloujifar; Prateek Mittal
  Adversaries can embed backdoors in deep learning models by introducing
backdoor poison samples into training datasets. In this work, we investigate
how to detect such poison samples to mitigate the threat of backdoor attacks.
First, we uncover a post-hoc workflow underlying most prior work, where
defenders passively allow the attack to proceed and then leverage the
characteristics of the post-attacked model to uncover poison samples. We reveal
that this workflow does not fully exploit defenders' capabilities, and defense
pipelines built on it are prone to failure or performance degradation in many
scenarios. Second, we suggest a paradigm shift by promoting a proactive mindset
in which defenders engage proactively with the entire model training and poison
detection pipeline, directly enforcing and magnifying distinctive
characteristics of the post-attacked model to facilitate poison detection.
Based on this, we formulate a unified framework and provide practical insights
on designing detection pipelines that are more robust and generalizable. Third,
we introduce the technique of Confusion Training (CT) as a concrete
instantiation of our framework. CT applies an additional poisoning attack to
the already poisoned dataset, actively decoupling benign correlation while
exposing backdoor patterns to detection. Empirical evaluations on 4 datasets
and 14 types of attacks validate the superiority of CT over 14 baseline
defenses.


http://arxiv.org/abs/2205.13680
Membership Inference Attack Using Self Influence Functions. (45%)
Gilad Cohen; Raja Giryes
  Member inference (MI) attacks aim to determine if a specific data sample was
used to train a machine learning model. Thus, MI is a major privacy threat to
models trained on private sensitive data, such as medical records. In MI
attacks one may consider the black-box settings, where the model's parameters
and activations are hidden from the adversary, or the white-box case where they
are available to the attacker. In this work, we focus on the latter and present
a novel MI attack for it that employs influence functions, or more specifically
the samples' self-influence scores, to perform the MI prediction. We evaluate
our attack on CIFAR-10, CIFAR-100, and Tiny ImageNet datasets, using versatile
architectures such as AlexNet, ResNet, and DenseNet. Our attack method achieves
new state-of-the-art results for both training with and without data
augmentations. Code is available at
https://github.com/giladcohen/sif_mi_attack.


http://arxiv.org/abs/2205.13268
MemeTector: Enforcing deep focus for meme detection. (1%)
Christos Koutlis; Manos Schinas; Symeon Papadopoulos
  Image memes and specifically their widely-known variation image macros, is a
special new media type that combines text with images and is used in social
media to playfully or subtly express humour, irony, sarcasm and even hate. It
is important to accurately retrieve image memes from social media to better
capture the cultural and social aspects of online phenomena and detect
potential issues (hate-speech, disinformation). Essentially, the background
image of an image macro is a regular image easily recognized as such by humans
but cumbersome for the machine to do so due to feature map similarity with the
complete image macro. Hence, accumulating suitable feature maps in such cases
can lead to deep understanding of the notion of image memes. To this end, we
propose a methodology, called Visual Part Utilization, that utilizes the visual
part of image memes as instances of the regular image class and the initial
image memes as instances of the image meme class to force the model to
concentrate on the critical parts that characterize an image meme.
Additionally, we employ a trainable attention mechanism on top of a standard
ViT architecture to enhance the model's ability to focus on these critical
parts and make the predictions interpretable. Several training and test
scenarios involving web-scraped regular images of controlled text presence are
considered for evaluating the model in terms of robustness and accuracy. The
findings indicate that light visual part utilization combined with sufficient
text presence during training provides the best and most robust model,
surpassing state of the art. Source code and dataset are available at
https://github.com/mever-team/memetector.


http://arxiv.org/abs/2205.13700
ES-GNN: Generalizing Graph Neural Networks Beyond Homophily with Edge Splitting. (1%)
Jingwei Guo; Kaizhu Huang; Rui Zhang; Xinping Yi
  While Graph Neural Networks (GNNs) have achieved enormous success in multiple
graph analytical tasks, modern variants mostly rely on the strong inductive
bias of homophily. However, real-world networks typically exhibit both
homophilic and heterophilic linking patterns, wherein adjacent nodes may share
dissimilar attributes and distinct labels. Therefore, GNNs smoothing node
proximity holistically may aggregate both task-relevant and irrelevant (even
harmful) information, limiting their ability to generalize to heterophilic
graphs and potentially causing non-robustness. In this work, we propose a novel
Edge Splitting GNN (ES-GNN) framework to adaptively distinguish between graph
edges either relevant or irrelevant to learning tasks. This essentially
transfers the original graph into two subgraphs with the same node set but
complementary edge sets dynamically. Given that, information propagation
separately on these subgraphs and edge splitting are alternatively conducted,
thus disentangling the task-relevant and irrelevant features. Theoretically, we
show that our ES-GNN can be regarded as a solution to a disentangled graph
denoising problem, which further illustrates our motivations and interprets the
improved generalization beyond homophily. Extensive experiments over 11
benchmark and 1 synthetic datasets not only demonstrate the effective
performance of ES-GNN but also highlight its robustness to adversarial graphs
and mitigation of the over-smoothing problem.


http://arxiv.org/abs/2205.12695
Surprises in adversarially-trained linear regression. (87%)
Antônio H. Ribeiro; Dave Zachariah; Thomas B. Schön
  State-of-the-art machine learning models can be vulnerable to very small
input perturbations that are adversarially constructed. Adversarial training is
one of the most effective approaches to defend against such examples. We show
that for linear regression problems, adversarial training can be formulated as
a convex problem. This fact is then used to show that $\ell_\infty$-adversarial
training produces sparse solutions and has many similarities to the lasso
method. Similarly, $\ell_2$-adversarial training has similarities with ridge
regression. We use a robust regression framework to analyze and understand
these similarities and also point to some differences. Finally, we show how
adversarial training behaves differently from other regularization methods when
estimating overparameterized models (i.e., models with more parameters than
datapoints). It minimizes a sum of three terms which regularizes the solution,
but unlike lasso and ridge regression, it can sharply transition into an
interpolation mode. We show that for sufficiently many features or sufficiently
small regularization parameters, the learned model perfectly interpolates the
training data while still exhibiting good out-of-sample performance.


http://arxiv.org/abs/2205.12700
BITE: Textual Backdoor Attacks with Iterative Trigger Injection. (75%)
Jun Yan; Vansh Gupta; Xiang Ren
  Backdoor attacks have become an emerging threat to NLP systems. By providing
poisoned training data, the adversary can embed a "backdoor" into the victim
model, which allows input instances satisfying certain textual patterns (e.g.,
containing a keyword) to be predicted as a target label of the adversary's
choice. In this paper, we demonstrate that it is possible to design a backdoor
attack that is both stealthy (i.e., hard to notice) and effective (i.e., has a
high attack success rate). We propose BITE, a backdoor attack that poisons the
training data to establish strong correlations between the target label and a
set of "trigger words". These trigger words are iteratively identified and
injected into the target-label instances through natural word-level
perturbations. The poisoned training data instruct the victim model to predict
the target label on inputs containing trigger words, forming the backdoor.
Experiments on four text classification datasets show that our proposed attack
is significantly more effective than baseline methods while maintaining decent
stealthiness, raising alarm on the usage of untrusted training data. We further
propose a defense method named DeBITE based on potential trigger word removal,
which outperforms existing methods in defending against BITE and generalizes
well to handling other backdoor attacks.


http://arxiv.org/abs/2205.12787
Impartial Games: A Challenge for Reinforcement Learning. (13%)
Bei Zhou; Søren Riis
  While AlphaZero-style reinforcement learning (RL) algorithms excel in various
board games, in this paper we show that they face challenges on impartial games
where players share pieces. We present a concrete example of a game - namely
the children's game of Nim - and other impartial games that seem to be a
stumbling block for AlphaZero-style and similar self-play reinforcement
learning algorithms.
  Our work is built on the challenges posed by the intricacies of data
distribution on the ability of neural networks to learn parity functions,
exacerbated by the noisy labels issue. Our findings are consistent with recent
studies showing that AlphaZero-style algorithms are vulnerable to adversarial
attacks and adversarial perturbations, showing the difficulty of learning to
master the games in all legal states.
  We show that Nim can be learned on small boards, but the learning progress of
AlphaZero-style algorithms dramatically slows down when the board size
increases. Intuitively, the difference between impartial games like Nim and
partisan games like Chess and Go can be explained by the fact that if a small
part of the board is covered for impartial games it is typically not possible
to predict whether the position is won or lost as there is often zero
correlation between the visible part of a partly blanked-out position and its
correct evaluation. This situation starkly contrasts partisan games where a
partly blanked-out board position typically provides abundant or at least
non-trifle information about the value of the fully uncovered position.


http://arxiv.org/abs/2205.13042
How explainable are adversarially-robust CNNs? (8%)
Mehdi Nourelahi; Lars Kotthoff; Peijie Chen; Anh Nguyen
  Three important criteria of existing convolutional neural networks (CNNs) are
(1) test-set accuracy; (2) out-of-distribution accuracy; and (3)
explainability. While these criteria have been studied independently, their
relationship is unknown. For example, do CNNs that have a stronger
out-of-distribution performance have also stronger explainability? Furthermore,
most prior feature-importance studies only evaluate methods on 2-3 common
vanilla ImageNet-trained CNNs, leaving it unknown how these methods generalize
to CNNs of other architectures and training algorithms. Here, we perform the
first, large-scale evaluation of the relations of the three criteria using 9
feature-importance methods and 12 ImageNet-trained CNNs that are of 3 training
algorithms and 5 CNN architectures. We find several important insights and
recommendations for ML practitioners. First, adversarially robust CNNs have a
higher explainability score on gradient-based attribution methods (but not
CAM-based or perturbation-based methods). Second, AdvProp models, despite being
highly accurate more than both vanilla and robust models alone, are not
superior in explainability. Third, among 9 feature attribution methods tested,
GradCAM and RISE are consistently the best methods. Fourth, Insertion and
Deletion are biased towards vanilla and robust models respectively, due to
their strong correlation with the confidence score distributions of a CNN.
Fifth, we did not find a single CNN to be the best in all three criteria, which
interestingly suggests that CNNs are harder to interpret as they become more
accurate.


http://arxiv.org/abs/2205.12032
Defending a Music Recommender Against Hubness-Based Adversarial Attacks. (99%)
Katharina Hoedt; Arthur Flexer; Gerhard Widmer
  Adversarial attacks can drastically degrade performance of recommenders and
other machine learning systems, resulting in an increased demand for defence
mechanisms. We present a new line of defence against attacks which exploit a
vulnerability of recommenders that operate in high dimensional data spaces (the
so-called hubness problem). We use a global data scaling method, namely Mutual
Proximity (MP), to defend a real-world music recommender which previously was
susceptible to attacks that inflated the number of times a particular song was
recommended. We find that using MP as a defence greatly increases robustness of
the recommender against a range of attacks, with success rates of attacks
around 44% (before defence) dropping to less than 6% (after defence).
Additionally, adversarial examples still able to fool the defended system do so
at the price of noticeably lower audio quality as shown by a decreased average
SNR.


http://arxiv.org/abs/2205.12134
Adversarial Attack on Attackers: Post-Process to Mitigate Black-Box Score-Based Query Attacks. (99%)
Sizhe Chen; Zhehao Huang; Qinghua Tao; Yingwen Wu; Cihang Xie; Xiaolin Huang
  The score-based query attacks (SQAs) pose practical threats to deep neural
networks by crafting adversarial perturbations within dozens of queries, only
using the model's output scores. Nonetheless, we note that if the loss trend of
the outputs is slightly perturbed, SQAs could be easily misled and thereby
become much less effective. Following this idea, we propose a novel defense,
namely Adversarial Attack on Attackers (AAA), to confound SQAs towards
incorrect attack directions by slightly modifying the output logits. In this
way, (1) SQAs are prevented regardless of the model's worst-case robustness;
(2) the original model predictions are hardly changed, i.e., no degradation on
clean accuracy; (3) the calibration of confidence scores can be improved
simultaneously. Extensive experiments are provided to verify the above
advantages. For example, by setting $\ell_\infty=8/255$ on CIFAR-10, our
proposed AAA helps WideResNet-28 secure $80.59\%$ accuracy under Square attack
($2500$ queries), while the best prior defense (i.e., adversarial training)
only attains $67.44\%$. Since AAA attacks SQA's general greedy strategy, such
advantages of AAA over 8 defenses can be consistently observed on 8
CIFAR-10/ImageNet models under 6 SQAs, using different attack targets and
bounds. Moreover, AAA calibrates better without hurting the accuracy. Our code
would be released.


http://arxiv.org/abs/2205.12331
Certified Robustness Against Natural Language Attacks by Causal Intervention. (98%)
Haiteng Zhao; Chang Ma; Xinshuai Dong; Anh Tuan Luu; Zhi-Hong Deng; Hanwang Zhang
  Deep learning models have achieved great success in many fields, yet they are
vulnerable to adversarial examples. This paper follows a causal perspective to
look into the adversarial vulnerability and proposes Causal Intervention by
Semantic Smoothing (CISS), a novel framework towards robustness against natural
language attacks. Instead of merely fitting observational data, CISS learns
causal effects p(y|do(x)) by smoothing in the latent semantic space to make
robust predictions, which scales to deep architectures and avoids tedious
construction of noise customized for specific attacks. CISS is provably robust
against word substitution attacks, as well as empirically robust even when
perturbations are strengthened by unknown attack algorithms. For example, on
YELP, CISS surpasses the runner-up by 6.7% in terms of certified robustness
against word substitutions, and achieves 79.4% empirical robustness when
syntactic attacks are integrated.


http://arxiv.org/abs/2205.12141
One-Pixel Shortcut: on the Learning Preference of Deep Neural Networks. (92%)
Shutong Wu; Sizhe Chen; Cihang Xie; Xiaolin Huang
  Unlearnable examples (ULEs) aim to protect data from unauthorized usage for
training DNNs. Error-minimizing noise, which is injected to clean data, is one
of the most successful methods for preventing DNNs from giving correct
predictions on incoming new data. Nonetheless, under specific training
strategies such as adversarial training, the unlearnability of error-minimizing
noise will severely degrade. In addition, the transferability of
error-minimizing noise is inherently limited by the mismatch between the
generator model and the targeted learner model. In this paper, we investigate
the mechanism of unlearnable examples and propose a novel model-free method,
named \emph{One-Pixel Shortcut}, which only perturbs a single pixel of each
image and makes the dataset unlearnable. Our method needs much less
computational cost and obtains stronger transferability and thus can protect
data from a wide range of different models. Based on this, we further introduce
the first unlearnable dataset called CIFAR-10-S, which is indistinguishable
from normal CIFAR-10 by human observers and can serve as a benchmark for
different models or training strategies to evaluate their abilities to extract
critical features from the disturbance of non-semantic representations. The
original error-minimizing ULEs will lose efficiency under adversarial training,
where the model can get over 83\% clean test accuracy. Meanwhile, even if
adversarial training and strong data augmentation like RandAugment are applied
together, the model trained on CIFAR-10-S cannot get over 50\% clean test
accuracy.


http://arxiv.org/abs/2205.11782
Fine-grained Poisoning Attacks to Local Differential Privacy Protocols for Mean and Variance Estimation. (64%)
Xiaoguang Li; Neil Zhenqiang Gong; Ninghui Li; Wenhai Sun; Hui Li
  Local differential privacy (LDP) protects individual data contributors
against privacy-probing data aggregation and analytics. Recent work has shown
that LDP for some specific data types is vulnerable to data poisoning attacks,
which enable the attacker to alter analytical results by injecting
carefully-crafted bogus data. In this work, we focus on applying data poisoning
attack to unexplored statistical tasks, i.e. mean and variance estimations. In
contrast to prior work that aims for overall LDP performance degradation or
straightforward attack gain maximization, our attacker can fine-tune the LDP
estimated mean/variance to the desired target values and simultaneously
manipulate them. To accomplish this goal, we propose two types of data
poisoning attacks: input poisoning attack (IPA) and output poisoning attack
(OPA). The former is independent of LDP while the latter utilizes the
characteristics of LDP, thus being more effective. More intriguingly, we
observe a security-privacy consistency where a small $\epsilon$ enhances the
security of LDP contrary to the previous conclusion of a security-privacy
trade-off. We further study the consistency and reveal a more holistic view of
the threat landscape of LDP in the presence of data poisoning attacks. We
comprehensively evaluate the attacks on three real-world datasets and report
their effectiveness for achieving the target values. We also explore defense
mechanisms and provide insights into the secure LDP design.


http://arxiv.org/abs/2205.11803
WeDef: Weakly Supervised Backdoor Defense for Text Classification. (56%)
Lesheng Jin; Zihan Wang; Jingbo Shang
  Existing backdoor defense methods are only effective for limited trigger
types. To defend different trigger types at once, we start from the
class-irrelevant nature of the poisoning process and propose a novel weakly
supervised backdoor defense framework WeDef. Recent advances in weak
supervision make it possible to train a reasonably accurate text classifier
using only a small number of user-provided, class-indicative seed words. Such
seed words shall be considered independent of the triggers. Therefore, a weakly
supervised text classifier trained by only the poisoned documents without their
labels will likely have no backdoor. Inspired by this observation, in WeDef, we
define the reliability of samples based on whether the predictions of the weak
classifier agree with their labels in the poisoned training set. We further
improve the results through a two-phase sanitization: (1) iteratively refine
the weak classifier based on the reliable samples and (2) train a binary poison
classifier by distinguishing the most unreliable samples from the most reliable
samples. Finally, we train the sanitized model on the samples that the poison
classifier predicts as benign. Extensive experiments show that WeDefis
effective against popular trigger-based attacks (e.g., words, sentences, and
paraphrases), outperforming existing defense methods.


http://arxiv.org/abs/2205.12396
Recipe2Vec: Multi-modal Recipe Representation Learning with Graph Neural Networks. (50%)
Yijun Tian; Chuxu Zhang; Zhichun Guo; Yihong Ma; Ronald Metoyer; Nitesh V. Chawla
  Learning effective recipe representations is essential in food studies.
Unlike what has been developed for image-based recipe retrieval or learning
structural text embeddings, the combined effect of multi-modal information
(i.e., recipe images, text, and relation data) receives less attention. In this
paper, we formalize the problem of multi-modal recipe representation learning
to integrate the visual, textual, and relational information into recipe
embeddings. In particular, we first present Large-RG, a new recipe graph data
with over half a million nodes, making it the largest recipe graph to date. We
then propose Recipe2Vec, a novel graph neural network based recipe embedding
model to capture multi-modal information. Additionally, we introduce an
adversarial attack strategy to ensure stable learning and improve performance.
Finally, we design a joint objective function of node classification and
adversarial learning to optimize the model. Extensive experiments demonstrate
that Recipe2Vec outperforms state-of-the-art baselines on two classic food
study tasks, i.e., cuisine category classification and region prediction.
Dataset and codes are available at https://github.com/meettyj/Recipe2Vec.


http://arxiv.org/abs/2205.12243
EBM Life Cycle: MCMC Strategies for Synthesis, Defense, and Density Modeling. (10%)
Mitch Hill; Jonathan Mitchell; Chu Chen; Yuan Du; Mubarak Shah; Song-Chun Zhu
  This work presents strategies to learn an Energy-Based Model (EBM) according
to the desired length of its MCMC sampling trajectories. MCMC trajectories of
different lengths correspond to models with different purposes. Our experiments
cover three different trajectory magnitudes and learning outcomes: 1) shortrun
sampling for image generation; 2) midrun sampling for classifier-agnostic
adversarial defense; and 3) longrun sampling for principled modeling of image
probability densities. To achieve these outcomes, we introduce three novel
methods of MCMC initialization for negative samples used in Maximum Likelihood
(ML) learning. With standard network architectures and an unaltered ML
objective, our MCMC initialization methods alone enable significant performance
gains across the three applications that we investigate. Our results include
state-of-the-art FID scores for unnormalized image densities on the CIFAR-10
and ImageNet datasets; state-of-the-art adversarial defense on CIFAR-10 among
purification methods and the first EBM defense on ImageNet; and scalable
techniques for learning valid probability densities. Code for this project can
be found at https://github.com/point0bar1/ebm-life-cycle.


http://arxiv.org/abs/2205.11857
Comprehensive Privacy Analysis on Federated Recommender System against Attribute Inference Attacks. (9%)
Shijie Zhang; Hongzhi Yin
  In recent years, recommender systems are crucially important for the delivery
of personalized services that satisfy users' preferences. With personalized
recommendation services, users can enjoy a variety of recommendations such as
movies, books, ads, restaurants, and more. Despite the great benefits,
personalized recommendations typically require the collection of personal data
for user modelling and analysis, which can make users susceptible to attribute
inference attacks. Specifically, the vulnerability of existing centralized
recommenders under attribute inference attacks leaves malicious attackers a
backdoor to infer users' private attributes, as the systems remember
information of their training data (i.e., interaction data and side
information). An emerging practice is to implement recommender systems in the
federated setting, which enables all user devices to collaboratively learn a
shared global recommender while keeping all the training data on device.
However, the privacy issues in federated recommender systems have been rarely
explored. In this paper, we first design a novel attribute inference attacker
to perform a comprehensive privacy analysis of the state-of-the-art federated
recommender models. The experimental results show that the vulnerability of
each model component against attribute inference attack is varied, highlighting
the need for new defense approaches. Therefore, we propose a novel adaptive
privacy-preserving approach to protect users' sensitive data in the presence of
attribute inference attacks and meanwhile maximize the recommendation accuracy.
Extensive experimental results on two real-world datasets validate the superior
performance of our model on both recommendation effectiveness and resistance to
inference attacks.


http://arxiv.org/abs/2205.12311
Fast & Furious: Modelling Malware Detection as Evolving Data Streams. (2%)
Fabrício Ceschin; Marcus Botacin; Heitor Murilo Gomes; Felipe Pinagé; Luiz S. Oliveira; André Grégio
  Malware is a major threat to computer systems and imposes many challenges to
cyber security. Targeted threats, such as ransomware, cause millions of dollars
in losses every year. The constant increase of malware infections has been
motivating popular antiviruses (AVs) to develop dedicated detection strategies,
which include meticulously crafted machine learning (ML) pipelines. However,
malware developers unceasingly change their samples' features to bypass
detection. This constant evolution of malware samples causes changes to the
data distribution (i.e., concept drifts) that directly affect ML model
detection rates, something not considered in the majority of the literature
work. In this work, we evaluate the impact of concept drift on malware
classifiers for two Android datasets: DREBIN (about 130K apps) and a subset of
AndroZoo (about 285K apps). We used these datasets to train an Adaptive Random
Forest (ARF) classifier, as well as a Stochastic Gradient Descent (SGD)
classifier. We also ordered all datasets samples using their VirusTotal
submission timestamp and then extracted features from their textual attributes
using two algorithms (Word2Vec and TF-IDF). Then, we conducted experiments
comparing both feature extractors, classifiers, as well as four drift detectors
(DDM, EDDM, ADWIN, and KSWIN) to determine the best approach for real
environments. Finally, we compare some possible approaches to mitigate concept
drift and propose a novel data stream pipeline that updates both the classifier
and the feature extractor. To do so, we conducted a longitudinal evaluation by
(i) classifying malware samples collected over nine years (2009-2018), (ii)
reviewing concept drift detection algorithms to attest its pervasiveness, (iii)
comparing distinct ML approaches to mitigate the issue, and (iv) proposing an
ML data stream pipeline that outperformed literature approaches.


http://arxiv.org/abs/2205.11819
Quarantine: Sparsity Can Uncover the Trojan Attack Trigger for Free. (2%)
Tianlong Chen; Zhenyu Zhang; Yihua Zhang; Shiyu Chang; Sijia Liu; Zhangyang Wang
  Trojan attacks threaten deep neural networks (DNNs) by poisoning them to
behave normally on most samples, yet to produce manipulated results for inputs
attached with a particular trigger. Several works attempt to detect whether a
given DNN has been injected with a specific trigger during the training. In a
parallel line of research, the lottery ticket hypothesis reveals the existence
of sparse subnetworks which are capable of reaching competitive performance as
the dense network after independent training. Connecting these two dots, we
investigate the problem of Trojan DNN detection from the brand new lens of
sparsity, even when no clean training data is available. Our crucial
observation is that the Trojan features are significantly more stable to
network pruning than benign features. Leveraging that, we propose a novel
Trojan network detection regime: first locating a "winning Trojan lottery
ticket" which preserves nearly full Trojan information yet only chance-level
performance on clean inputs; then recovering the trigger embedded in this
already isolated subnetwork. Extensive experiments on various datasets, i.e.,
CIFAR-10, CIFAR-100, and ImageNet, with different network architectures, i.e.,
VGG-16, ResNet-18, ResNet-20s, and DenseNet-100 demonstrate the effectiveness
of our proposal. Codes are available at
https://github.com/VITA-Group/Backdoor-LTH.


http://arxiv.org/abs/2205.11845
CDFKD-MFS: Collaborative Data-free Knowledge Distillation via Multi-level Feature Sharing. (1%)
Zhiwei Hao; Yong Luo; Zhi Wang; Han Hu; Jianping An
  Recently, the compression and deployment of powerful deep neural networks
(DNNs) on resource-limited edge devices to provide intelligent services have
become attractive tasks. Although knowledge distillation (KD) is a feasible
solution for compression, its requirement on the original dataset raises
privacy concerns. In addition, it is common to integrate multiple pretrained
models to achieve satisfactory performance. How to compress multiple models
into a tiny model is challenging, especially when the original data are
unavailable. To tackle this challenge, we propose a framework termed
collaborative data-free knowledge distillation via multi-level feature sharing
(CDFKD-MFS), which consists of a multi-header student module, an asymmetric
adversarial data-free KD module, and an attention-based aggregation module. In
this framework, the student model equipped with a multi-level feature-sharing
structure learns from multiple teacher models and is trained together with a
generator in an asymmetric adversarial manner. When some real samples are
available, the attention module adaptively aggregates predictions of the
student headers, which can further improve performance. We conduct extensive
experiments on three popular computer visual datasets. In particular, compared
with the most competitive alternative, the accuracy of the proposed framework
is 1.18\% higher on the CIFAR-100 dataset, 1.67\% higher on the Caltech-101
dataset, and 2.99\% higher on the mini-ImageNet dataset.


http://arxiv.org/abs/2205.11156
Collaborative Adversarial Training. (98%)
Qizhang Li; Yiwen Guo; Wangmeng Zuo; Hao Chen
  The vulnerability of deep neural networks (DNNs) to adversarial examples has
attracted great attention in the machine learning community. The problem is
related to local non-smoothness and steepness of normally obtained loss
landscapes. Training augmented with adversarial examples (a.k.a., adversarial
training) is considered as an effective remedy. In this paper, we highlight
that some collaborative examples, nearly perceptually indistinguishable from
both adversarial and benign examples yet show extremely lower prediction loss,
can be utilized to enhance adversarial training. A novel method called
collaborative adversarial training (CoAT) is thus proposed to achieve new
state-of-the-arts.


http://arxiv.org/abs/2205.11744
Alleviating Robust Overfitting of Adversarial Training With Consistency Regularization. (98%)
Shudong Zhang; Haichang Gao; Tianwei Zhang; Yunyi Zhou; Zihui Wu
  Adversarial training (AT) has proven to be one of the most effective ways to
defend Deep Neural Networks (DNNs) against adversarial attacks. However, the
phenomenon of robust overfitting, i.e., the robustness will drop sharply at a
certain stage, always exists during AT. It is of great importance to decrease
this robust generalization gap in order to obtain a robust model. In this
paper, we present an in-depth study towards the robust overfitting from a new
angle. We observe that consistency regularization, a popular technique in
semi-supervised learning, has a similar goal as AT and can be used to alleviate
robust overfitting. We empirically validate this observation, and find a
majority of prior solutions have implicit connections to consistency
regularization. Motivated by this, we introduce a new AT solution, which
integrates the consistency regularization and Mean Teacher (MT) strategy into
AT. Specifically, we introduce a teacher model, coming from the average weights
of the student models over the training steps. Then we design a consistency
loss function to make the prediction distribution of the student models over
adversarial examples consistent with that of the teacher model over clean
samples. Experiments show that our proposed method can effectively alleviate
robust overfitting and improve the robustness of DNN models against common
adversarial attacks.


http://arxiv.org/abs/2205.11551
Learning to Ignore Adversarial Attacks. (95%)
Yiming Zhang; Yangqiaoyu Zhou; Samuel Carton; Chenhao Tan
  Despite the strong performance of current NLP models, they can be brittle
against adversarial attacks. To enable effective learning against adversarial
inputs, we introduce the use of rationale models that can explicitly learn to
ignore attack tokens. We find that the rationale models can successfully ignore
over 90\% of attack tokens. This approach leads to consistent sizable
improvements ($\sim$10\%) over baseline models in robustness on three datasets
for both BERT and RoBERTa, and also reliably outperforms data augmentation with
adversarial examples alone. In many cases, we find that our method is able to
close the gap between model performance on a clean test set and an attacked
test set and hence reduce the effect of adversarial attacks.


http://arxiv.org/abs/2205.11736
Towards a Defense against Backdoor Attacks in Continual Federated Learning. (50%)
Shuaiqi Wang; Jonathan Hayase; Giulia Fanti; Sewoong Oh
  Backdoor attacks are a major concern in federated learning (FL) pipelines
where training data is sourced from untrusted clients over long periods of time
(i.e., continual learning). Preventing such attacks is difficult because
defenders in FL do not have access to raw training data. Moreover, in a
phenomenon we call backdoor leakage, models trained continuously eventually
suffer from backdoors due to cumulative errors in backdoor defense mechanisms.
We propose a novel framework for defending against backdoor attacks in the
federated continual learning setting. Our framework trains two models in
parallel: a backbone model and a shadow model. The backbone is trained without
any defense mechanism to obtain good performance on the main task. The shadow
model combines recent ideas from robust covariance estimation-based filters
with early-stopping to control the attack success rate even as the data
distribution changes. We provide theoretical motivation for this design and
show experimentally that our framework significantly improves upon existing
defenses against backdoor attacks.


http://arxiv.org/abs/2205.11678
Compressing Deep Graph Neural Networks via Adversarial Knowledge Distillation. (10%)
Huarui He; Jie Wang; Zhanqiu Zhang; Feng Wu
  Deep graph neural networks (GNNs) have been shown to be expressive for
modeling graph-structured data. Nevertheless, the over-stacked architecture of
deep graph models makes it difficult to deploy and rapidly test on mobile or
embedded systems. To compress over-stacked GNNs, knowledge distillation via a
teacher-student architecture turns out to be an effective technique, where the
key step is to measure the discrepancy between teacher and student networks
with predefined distance functions. However, using the same distance for graphs
of various structures may be unfit, and the optimal distance formulation is
hard to determine. To tackle these problems, we propose a novel Adversarial
Knowledge Distillation framework for graph models named GraphAKD, which
adversarially trains a discriminator and a generator to adaptively detect and
decrease the discrepancy. Specifically, noticing that the well-captured
inter-node and inter-class correlations favor the success of deep GNNs, we
propose to criticize the inherited knowledge from node-level and class-level
views with a trainable discriminator. The discriminator distinguishes between
teacher knowledge and what the student inherits, while the student GNN works as
a generator and aims to fool the discriminator. To our best knowledge, GraphAKD
is the first to introduce adversarial training to knowledge distillation in
graph domains. Experiments on node-level and graph-level classification
benchmarks demonstrate that GraphAKD improves the student performance by a
large margin. The results imply that GraphAKD can precisely transfer knowledge
from a complicated teacher GNN to a compact student GNN.


http://arxiv.org/abs/2205.11693
RCC-GAN: Regularized Compound Conditional GAN for Large-Scale Tabular Data Synthesis. (1%)
Mohammad Esmaeilpour; Nourhene Chaalia; Adel Abusitta; Francois-Xavier Devailly; Wissem Maazoun; Patrick Cardinal
  This paper introduces a novel generative adversarial network (GAN) for
synthesizing large-scale tabular databases which contain various features such
as continuous, discrete, and binary. Technically, our GAN belongs to the
category of class-conditioned generative models with a predefined conditional
vector. However, we propose a new formulation for deriving such a vector
incorporating both binary and discrete features simultaneously. We refer to
this noble definition as compound conditional vector and employ it for training
the generator network. The core architecture of this network is a three-layered
deep residual neural network with skip connections. For improving the stability
of such complex architecture, we present a regularization scheme towards
limiting unprecedented variations on its weight vectors during training. This
regularization approach is quite compatible with the nature of adversarial
training and it is not computationally prohibitive in runtime. Furthermore, we
constantly monitor the variation of the weight vectors for identifying any
potential instabilities or irregularities to measure the strength of our
proposed regularizer. Toward this end, we also develop a new metric for
tracking sudden perturbation on the weight vectors using the singular value
decomposition theory. Finally, we evaluate the performance of our proposed
synthesis approach on six benchmarking tabular databases, namely Adult, Census,
HCDR, Cabs, News, and King. The achieved results corroborate that for the
majority of the cases, our proposed RccGAN outperforms other conventional and
modern generative models in terms of accuracy, stability, and reliability.


http://arxiv.org/abs/2205.10933
AutoJoin: Efficient Adversarial Training for Robust Maneuvering via Denoising Autoencoder and Joint Learning. (26%)
Michael Villarreal; Bibek Poudel; Ryan Wickman; Yu Shen; Weizi Li
  As a result of increasingly adopted machine learning algorithms and
ubiquitous sensors, many 'perception-to-control' systems are developed and
deployed. For these systems to be trustworthy, we need to improve their
robustness with adversarial training being one approach. We propose a
gradient-free adversarial training technique, called AutoJoin, which is a very
simple yet effective and efficient approach to produce robust models for
imaged-based maneuvering. Compared to other SOTA methods with testing on over
5M perturbed and clean images, AutoJoin achieves significant performance
increases up to the 40% range under gradient-free perturbations while improving
on clean performance up to 300%. Regarding efficiency, AutoJoin demonstrates
strong advantages over other SOTA techniques by saving up to 83% time per
training epoch and 90% training data. Although not the focus of AutoJoin, it
even demonstrates superb ability in defending gradient-based attacks. The core
idea of AutoJoin is to use a decoder attachment to the original regression
model creating a denoising autoencoder within the architecture. This
architecture allows the tasks 'maneuvering' and 'denoising sensor input' to be
jointly learnt and reinforce each other's performance.


http://arxiv.org/abs/2205.10848
Robust Quantity-Aware Aggregation for Federated Learning. (13%)
Jingwei Yi; Fangzhao Wu; Huishuai Zhang; Bin Zhu; Tao Qi; Guangzhong Sun; Xing Xie
  Federated learning (FL) enables multiple clients to collaboratively train
models without sharing their local data, and becomes an important
privacy-preserving machine learning framework. However, classical FL faces
serious security and robustness problem, e.g., malicious clients can poison
model updates and at the same time claim large quantities to amplify the impact
of their model updates in the model aggregation. Existing defense methods for
FL, while all handling malicious model updates, either treat all quantities
benign or simply ignore/truncate the quantities of all clients. The former is
vulnerable to quantity-enhanced attack, while the latter leads to sub-optimal
performance since the local data on different clients is usually in
significantly different sizes. In this paper, we propose a robust
quantity-aware aggregation algorithm for federated learning, called FedRA, to
perform the aggregation with awareness of local data quantities while being
able to defend against quantity-enhanced attacks. More specifically, we propose
a method to filter malicious clients by jointly considering the uploaded model
updates and data quantities from different clients, and performing
quantity-aware weighted averaging on model updates from remaining clients.
Moreover, as the number of malicious clients participating in the federated
learning may dynamically change in different rounds, we also propose a
malicious client number estimator to predict how many suspicious clients should
be filtered in each round. Experiments on four public datasets demonstrate the
effectiveness of our FedRA method in defending FL against quantity-enhanced
attacks.


http://arxiv.org/abs/2205.10952
Generalization ability and Vulnerabilities to adversarial perturbations: Two sides of the same coin. (10%)
Jung Hoon Lee; Sujith Vijayan
  Deep neural networks (DNNs), the agents of deep learning (DL), require a
massive number of parallel/sequential operations, which makes it difficult to
comprehend them and impedes proper diagnosis. Without better knowledge of DNNs'
internal process, deploying DNNs in high-stakes domains may lead to
catastrophic failures. Therefore, to build more reliable DNNs/DL, it is
imperative that we gain insights into their underlying decision-making process.
Here, we use the self-organizing map (SOM) to analyze DL models' internal codes
associated with DNNs' decision-making. Our analyses suggest that shallow layers
close to the input layer map onto homogeneous codes and that deep layers close
to the output layer transform these homogeneous codes in shallow layers to
diverse codes. We also found evidence indicating that homogeneous codes may
underlie DNNs' vulnerabilities to adversarial perturbations.


http://arxiv.org/abs/2205.10686
Post-breach Recovery: Protection against White-box Adversarial Examples for Leaked DNN Models. (99%)
Shawn Shan; Wenxin Ding; Emily Wenger; Haitao Zheng; Ben Y. Zhao
  Server breaches are an unfortunate reality on today's Internet. In the
context of deep neural network (DNN) models, they are particularly harmful,
because a leaked model gives an attacker "white-box" access to generate
adversarial examples, a threat model that has no practical robust defenses. For
practitioners who have invested years and millions into proprietary DNNs, e.g.
medical imaging, this seems like an inevitable disaster looming on the horizon.
  In this paper, we consider the problem of post-breach recovery for DNN
models. We propose Neo, a new system that creates new versions of leaked
models, alongside an inference time filter that detects and removes adversarial
examples generated on previously leaked models. The classification surfaces of
different model versions are slightly offset (by introducing hidden
distributions), and Neo detects the overfitting of attacks to the leaked model
used in its generation. We show that across a variety of tasks and attack
methods, Neo is able to filter out attacks from leaked models with very high
accuracy, and provides strong protection (7--10 recoveries) against attackers
who repeatedly breach the server. Neo performs well against a variety of strong
adaptive attacks, dropping slightly in # of breaches recoverable, and
demonstrates potential as a complement to DNN defenses in the wild.


http://arxiv.org/abs/2205.10617
Gradient Concealment: Free Lunch for Defending Adversarial Attacks. (99%)
Sen Pei; Jiaxi Sun; Xiaopeng Zhang; Gaofeng Meng
  Recent studies show that the deep neural networks (DNNs) have achieved great
success in various tasks. However, even the \emph{state-of-the-art} deep
learning based classifiers are extremely vulnerable to adversarial examples,
resulting in sharp decay of discrimination accuracy in the presence of enormous
unknown attacks. Given the fact that neural networks are widely used in the
open world scenario which can be safety-critical situations, mitigating the
adversarial effects of deep learning methods has become an urgent need.
Generally, conventional DNNs can be attacked with a dramatically high success
rate since their gradient is exposed thoroughly in the white-box scenario,
making it effortless to ruin a well trained classifier with only imperceptible
perturbations in the raw data space. For tackling this problem, we propose a
plug-and-play layer that is training-free, termed as \textbf{G}radient
\textbf{C}oncealment \textbf{M}odule (GCM), concealing the vulnerable direction
of gradient while guaranteeing the classification accuracy during the inference
time. GCM reports superior defense results on the ImageNet classification
benchmark, improving up to 63.41\% top-1 attack robustness (AR) when faced with
adversarial inputs compared to the vanilla DNNs. Moreover, we use GCM in the
CVPR 2022 Robust Classification Challenge, currently achieving \textbf{2nd}
place in Phase II with only a tiny version of ConvNext. The code will be made
available.


http://arxiv.org/abs/2205.10710
Phrase-level Textual Adversarial Attack with Label Preservation. (99%)
Yibin Lei; Yu Cao; Dianqi Li; Tianyi Zhou; Meng Fang; Mykola Pechenizkiy
  Generating high-quality textual adversarial examples is critical for
investigating the pitfalls of natural language processing (NLP) models and
further promoting their robustness. Existing attacks are usually realized
through word-level or sentence-level perturbations, which either limit the
perturbation space or sacrifice fluency and textual quality, both affecting the
attack effectiveness. In this paper, we propose Phrase-Level Textual
Adversarial aTtack (PLAT) that generates adversarial samples through
phrase-level perturbations. PLAT first extracts the vulnerable phrases as
attack targets by a syntactic parser, and then perturbs them by a pre-trained
blank-infilling model. Such flexible perturbation design substantially expands
the search space for more effective attacks without introducing too many
modifications, and meanwhile maintaining the textual fluency and grammaticality
via contextualized generation using surrounding texts. Moreover, we develop a
label-preservation filter leveraging the likelihoods of language models
fine-tuned on each class, rather than textual similarity, to rule out those
perturbations that potentially alter the original class label for humans.
Extensive experiments and human evaluation demonstrate that PLAT has a superior
attack effectiveness as well as a better label consistency than strong
baselines.


http://arxiv.org/abs/2205.10539
On the Feasibility and Generality of Patch-based Adversarial Attacks on Semantic Segmentation Problems. (16%)
Soma Kontar; Andras Horvath
  Deep neural networks were applied with success in a myriad of applications,
but in safety critical use cases adversarial attacks still pose a significant
threat. These attacks were demonstrated on various classification and detection
tasks and are usually considered general in a sense that arbitrary network
outputs can be generated by them.
  In this paper we will demonstrate through simple case studies both in
simulation and in real-life, that patch based attacks can be utilised to alter
the output of segmentation networks. Through a few examples and the
investigation of network complexity, we will also demonstrate that the number
of possible output maps which can be generated via patch-based attacks of a
given size is typically smaller than the area they effect or areas which should
be attacked in case of practical applications.
  We will prove that based on these results most patch-based attacks cannot be
general in practice, namely they can not generate arbitrary output maps or if
they could, they are spatially limited and this limit is significantly smaller
than the receptive field of the patches.


http://arxiv.org/abs/2205.10159
Getting a-Round Guarantees: Floating-Point Attacks on Certified Robustness. (99%)
Jiankai Jin; Olga Ohrimenko; Benjamin I. P. Rubinstein
  Adversarial examples pose a security risk as they can alter decisions of a
machine learning classifier through slight input perturbations. Certified
robustness has been proposed as a mitigation where given an input $\mathbf{x}$,
a classifier returns a prediction and a certified radius $R$ with a provable
guarantee that any perturbation to $\mathbf{x}$ with $R$-bounded norm will not
alter the classifier's prediction. In this work, we show that these guarantees
can be invalidated due to limitations of floating-point representation that
cause rounding errors. We design a rounding search method that can efficiently
exploit this vulnerability to find adversarial examples against
state-of-the-art certifications in two threat models, that differ in how the
norm of the perturbation is computed. We show that the attack can be carried
out against linear classifiers that have exact certifiable guarantees and
against neural networks that have conservative certifications. In the weak
threat model, our experiments demonstrate attack success rates over 50% on
random linear classifiers, up to 23% on the MNIST dataset for linear SVM, and
up to 15% for a neural network. In the strong threat model, the success rates
are lower but positive. The floating-point errors exploited by our attacks can
range from small to large (e.g., $10^{-13}$ to $10^{3}$) - showing that even
negligible errors can be systematically exploited to invalidate guarantees
provided by certified robustness. Finally, we propose a formal mitigation
approach based on rounded interval arithmetic, encouraging future
implementations of robustness certificates to account for limitations of modern
computing architecture to provide sound certifiable guarantees.


http://arxiv.org/abs/2205.10457
Robust Sensible Adversarial Learning of Deep Neural Networks for Image Classification. (98%)
Jungeum Kim; Xiao Wang
  The idea of robustness is central and critical to modern statistical
analysis. However, despite the recent advances of deep neural networks (DNNs),
many studies have shown that DNNs are vulnerable to adversarial attacks. Making
imperceptible changes to an image can cause DNN models to make the wrong
classification with high confidence, such as classifying a benign mole as a
malignant tumor and a stop sign as a speed limit sign. The trade-off between
robustness and standard accuracy is common for DNN models. In this paper, we
introduce sensible adversarial learning and demonstrate the synergistic effect
between pursuits of standard natural accuracy and robustness. Specifically, we
define a sensible adversary which is useful for learning a robust model while
keeping high natural accuracy. We theoretically establish that the Bayes
classifier is the most robust multi-class classifier with the 0-1 loss under
sensible adversarial learning. We propose a novel and efficient algorithm that
trains a robust model using implicit loss truncation. We apply sensible
adversarial learning for large-scale image classification to a handwritten
digital image dataset called MNIST and an object recognition colored image
dataset called CIFAR10. We have performed an extensive comparative study to
compare our method with other competitive methods. Our experiments empirically
demonstrate that our method is not sensitive to its hyperparameter and does not
collapse even with a small model capacity while promoting robustness against
various attacks and keeping high natural accuracy.


http://arxiv.org/abs/2205.10098
Adversarial joint attacks on legged robots. (86%)
Takuto Otomo; Hiroshi Kera; Kazuhiko Kawamoto
  We address adversarial attacks on the actuators at the joints of legged
robots trained by deep reinforcement learning. The vulnerability to the joint
attacks can significantly impact the safety and robustness of legged robots. In
this study, we demonstrate that the adversarial perturbations to the torque
control signals of the actuators can significantly reduce the rewards and cause
walking instability in robots. To find the adversarial torque perturbations, we
develop black-box adversarial attacks, where, the adversary cannot access the
neural networks trained by deep reinforcement learning. The black box attack
can be applied to legged robots regardless of the architecture and algorithms
of deep reinforcement learning. We employ three search methods for the
black-box adversarial attacks: random search, differential evolution, and
numerical gradient descent methods. In experiments with the quadruped robot
Ant-v2 and the bipedal robot Humanoid-v2, in OpenAI Gym environments, we find
that differential evolution can efficiently find the strongest torque
perturbations among the three methods. In addition, we realize that the
quadruped robot Ant-v2 is vulnerable to the adversarial perturbations, whereas
the bipedal robot Humanoid-v2 is robust to the perturbations. Consequently, the
joint attacks can be used for proactive diagnosis of robot walking instability.


http://arxiv.org/abs/2205.10022
Towards Consistency in Adversarial Classification. (82%)
Laurent Meunier; Raphaël Ettedgui; Rafael Pinot; Yann Chevaleyre; Jamal Atif
  In this paper, we study the problem of consistency in the context of
adversarial examples. Specifically, we tackle the following question: can
surrogate losses still be used as a proxy for minimizing the $0/1$ loss in the
presence of an adversary that alters the inputs at test-time? Different from
the standard classification task, this question cannot be reduced to a
point-wise minimization problem, and calibration needs not to be sufficient to
ensure consistency. In this paper, we expose some pathological behaviors
specific to the adversarial problem, and show that no convex surrogate loss can
be consistent or calibrated in this context. It is therefore necessary to
design another class of surrogate functions that can be used to solve the
adversarial consistency issue. As a first step towards designing such a class,
we identify sufficient and necessary conditions for a surrogate loss to be
calibrated in both the adversarial and standard settings. Finally, we give some
directions for building a class of losses that could be consistent in the
adversarial framework.


http://arxiv.org/abs/2205.10187
Adversarial Body Shape Search for Legged Robots. (80%)
Takaaki Azakami; Hiroshi Kera; Kazuhiko Kawamoto
  We propose an evolutionary computation method for an adversarial attack on
the length and thickness of parts of legged robots by deep reinforcement
learning. This attack changes the robot body shape and interferes with
walking-we call the attacked body as adversarial body shape. The evolutionary
computation method searches adversarial body shape by minimizing the expected
cumulative reward earned through walking simulation. To evaluate the
effectiveness of the proposed method, we perform experiments with three-legged
robots, Walker2d, Ant-v2, and Humanoid-v2 in OpenAI Gym. The experimental
results reveal that Walker2d and Ant-v2 are more vulnerable to the attack on
the length than the thickness of the body parts, whereas Humanoid-v2 is
vulnerable to the attack on both of the length and thickness. We further
identify that the adversarial body shapes break left-right symmetry or shift
the center of gravity of the legged robots. Finding adversarial body shape can
be used to proactively diagnose the vulnerability of legged robot walking.


http://arxiv.org/abs/2205.09986
SafeNet: Mitigating Data Poisoning Attacks on Private Machine Learning. (64%)
Harsh Chaudhari; Matthew Jagielski; Alina Oprea
  Secure multiparty computation (MPC) has been proposed to allow multiple
mutually distrustful data owners to jointly train machine learning (ML) models
on their combined data. However, the datasets used for training ML models might
be under the control of an adversary mounting a data poisoning attack, and MPC
prevents inspecting training sets to detect poisoning. We show that multiple
MPC frameworks for private ML training are susceptible to backdoor and targeted
poisoning attacks. To mitigate this, we propose SafeNet, a framework for
building ensemble models in MPC with formal guarantees of robustness to data
poisoning attacks. We extend the security definition of private ML training to
account for poisoning and prove that our SafeNet design satisfies the
definition. We demonstrate SafeNet's efficiency, accuracy, and resilience to
poisoning on several machine learning datasets and models. For instance,
SafeNet reduces backdoor attack success from 100% to 0% for a neural network
model, while achieving 39x faster training and 36x less communication than the
four-party MPC framework of Dalskov et al.


http://arxiv.org/abs/2205.10144
The developmental trajectory of object recognition robustness: children are like small adults but unlike big deep neural networks. (11%)
Lukas S. Huber; Robert Geirhos; Felix A. Wichmann
  In laboratory object recognition tasks based on undistorted photographs, both
adult humans and Deep Neural Networks (DNNs) perform close to ceiling. Unlike
adults', whose object recognition performance is robust against a wide range of
image distortions, DNNs trained on standard ImageNet (1.3M images) perform
poorly on distorted images. However, the last two years have seen impressive
gains in DNN distortion robustness, predominantly achieved through
ever-increasing large-scale datasets$\unicode{x2014}$orders of magnitude larger
than ImageNet. While this simple brute-force approach is very effective in
achieving human-level robustness in DNNs, it raises the question of whether
human robustness, too, is simply due to extensive experience with (distorted)
visual input during childhood and beyond. Here we investigate this question by
comparing the core object recognition performance of 146 children (aged
4$\unicode{x2013}$15) against adults and against DNNs. We find, first, that
already 4$\unicode{x2013}$6 year-olds showed remarkable robustness to image
distortions and outperform DNNs trained on ImageNet. Second, we estimated the
number of $\unicode{x201C}$images$\unicode{x201D}$ children have been exposed
to during their lifetime. Compared to various DNNs, children's high robustness
requires relatively little data. Third, when recognizing objects
children$\unicode{x2014}$like adults but unlike DNNs$\unicode{x2014}$rely
heavily on shape but not on texture cues. Together our results suggest that the
remarkable robustness to distortions emerges early in the developmental
trajectory of human object recognition and is unlikely the result of a mere
accumulation of experience with distorted visual input. Even though current
DNNs match human performance regarding robustness they seem to rely on
different and more data-hungry strategies to do so.


http://arxiv.org/abs/2205.10292
Vulnerability Analysis and Performance Enhancement of Authentication Protocol in Dynamic Wireless Power Transfer Systems. (10%)
Tommaso Bianchi; Surudhi Asokraj; Alessandro Brighente; Mauro Conti; Radha Poovendran
  Recent advancements in wireless charging technology, as well as the
possibility of utilizing it in the Electric Vehicle (EV) domain for dynamic
charging solutions, have fueled the demand for a secure and usable protocol in
the Dynamic Wireless Power Transfer (DWPT) technology. The DWPT must operate in
the presence of malicious adversaries that can undermine the charging process
and harm the customer service quality, while preserving the privacy of the
users. Recently, it was shown that the DWPT system is susceptible to
adversarial attacks, including replay, denial-of-service and free-riding
attacks, which can lead to the adversary blocking the authorized user from
charging, enabling free charging for free riders and exploiting the location
privacy of the customers. In this paper, we study the current State-Of-The-Art
(SOTA) authentication protocols and make the following two contributions: a) we
show that the SOTA is vulnerable to the tracking of the user activity and b) we
propose an enhanced authentication protocol that eliminates the vulnerability
while providing improved efficiency compared to the SOTA authentication
protocols. By adopting authentication messages based only on exclusive OR
operations, hashing, and hash chains, we optimize the protocol to achieve a
complexity that varies linearly with the number of charging pads, providing
improved scalability. Compared to SOTA, the proposed scheme has a performance
gain in the computational cost of around 90% on average for each pad.


http://arxiv.org/abs/2205.10232
Exploring the Trade-off between Plausibility, Change Intensity and Adversarial Power in Counterfactual Explanations using Multi-objective Optimization. (4%)
Ser Javier Del; Alejandro Barredo-Arrieta; Natalia Díaz-Rodríguez; Francisco Herrera; Andreas Holzinger
  There is a broad consensus on the importance of deep learning models in tasks
involving complex data. Often, an adequate understanding of these models is
required when focusing on the transparency of decisions in human-critical
applications. Besides other explainability techniques, trustworthiness can be
achieved by using counterfactuals, like the way a human becomes familiar with
an unknown process: by understanding the hypothetical circumstances under which
the output changes. In this work we argue that automated counterfactual
generation should regard several aspects of the produced adversarial instances,
not only their adversarial capability. To this end, we present a novel
framework for the generation of counterfactual examples which formulates its
goal as a multi-objective optimization problem balancing three different
objectives: 1) plausibility, i.e., the likeliness of the counterfactual of
being possible as per the distribution of the input data; 2) intensity of the
changes to the original input; and 3) adversarial power, namely, the
variability of the model's output induced by the counterfactual. The framework
departs from a target model to be audited and uses a Generative Adversarial
Network to model the distribution of input data, together with a
multi-objective solver for the discovery of counterfactuals balancing among
these objectives. The utility of the framework is showcased over six
classification tasks comprising image and three-dimensional data. The
experiments verify that the framework unveils counterfactuals that comply with
intuition, increasing the trustworthiness of the user, and leading to further
insights, such as the detection of bias and data misrepresentation.


http://arxiv.org/abs/2205.09624
Focused Adversarial Attacks. (99%)
Thomas Cilloni; Charles Walter; Charles Fleming
  Recent advances in machine learning show that neural models are vulnerable to
minimally perturbed inputs, or adversarial examples. Adversarial algorithms are
optimization problems that minimize the accuracy of ML models by perturbing
inputs, often using a model's loss function to craft such perturbations.
State-of-the-art object detection models are characterized by very large output
manifolds due to the number of possible locations and sizes of objects in an
image. This leads to their outputs being sparse and optimization problems that
use them incur a lot of unnecessary computation.
  We propose to use a very limited subset of a model's learned manifold to
compute adversarial examples. Our \textit{Focused Adversarial Attacks} (FA)
algorithm identifies a small subset of sensitive regions to perform
gradient-based adversarial attacks. FA is significantly faster than other
gradient-based attacks when a model's manifold is sparsely activated. Also, its
perturbations are more efficient than other methods under the same perturbation
constraints. We evaluate FA on the COCO 2017 and Pascal VOC 2007 detection
datasets.


http://arxiv.org/abs/2205.09592
Transferable Physical Attack against Object Detection with Separable Attention. (99%)
Yu Zhang; Zhiqiang Gong; Yichuang Zhang; YongQian Li; Kangcheng Bin; Jiahao Qi; Wei Xue; Ping Zhong
  Transferable adversarial attack is always in the spotlight since deep
learning models have been demonstrated to be vulnerable to adversarial samples.
However, existing physical attack methods do not pay enough attention on
transferability to unseen models, thus leading to the poor performance of
black-box attack.In this paper, we put forward a novel method of generating
physically realizable adversarial camouflage to achieve transferable attack
against detection models. More specifically, we first introduce multi-scale
attention maps based on detection models to capture features of objects with
various resolutions. Meanwhile, we adopt a sequence of composite
transformations to obtain the averaged attention maps, which could curb
model-specific noise in the attention and thus further boost transferability.
Unlike the general visualization interpretation methods where model attention
should be put on the foreground object as much as possible, we carry out attack
on separable attention from the opposite perspective, i.e. suppressing
attention of the foreground and enhancing that of the background. Consequently,
transferable adversarial camouflage could be yielded efficiently with our novel
attention-based loss function. Extensive comparison experiments verify the
superiority of our method to state-of-the-art methods.


http://arxiv.org/abs/2205.09518
Gradient Aligned Attacks via a Few Queries. (99%)
Xiangyuan Yang; Jie Lin; Hanlin Zhang; Xinyu Yang; Peng Zhao
  Black-box query attacks, which rely only on the output of the victim model,
have proven to be effective in attacking deep learning models. However,
existing black-box query attacks show low performance in a novel scenario where
only a few queries are allowed. To address this issue, we propose gradient
aligned attacks (GAA), which use the gradient aligned losses (GAL) we designed
on the surrogate model to estimate the accurate gradient to improve the attack
performance on the victim model. Specifically, we propose a gradient aligned
mechanism to ensure that the derivatives of the loss function with respect to
the logit vector have the same weight coefficients between the surrogate and
victim models. Using this mechanism, we transform the cross-entropy (CE) loss
and margin loss into gradient aligned forms, i.e. the gradient aligned CE or
margin losses. These losses not only improve the attack performance of our
gradient aligned attacks in the novel scenario but also increase the query
efficiency of existing black-box query attacks. Through theoretical and
empirical analysis on the ImageNet database, we demonstrate that our gradient
aligned mechanism is effective, and that our gradient aligned attacks can
improve the attack performance in the novel scenario by 16.1\% and 31.3\% on
the $l_2$ and $l_{\infty}$ norms of the box constraint, respectively, compared
to four latest transferable prior-based query attacks. Additionally, the
gradient aligned losses also significantly reduce the number of queries
required in these transferable prior-based query attacks by a maximum factor of
2.9 times. Overall, our proposed gradient aligned attacks and losses show
significant improvements in the attack performance and query efficiency of
black-box query attacks, particularly in scenarios where only a few queries are
allowed.


http://arxiv.org/abs/2205.09586
On Trace of PGD-Like Adversarial Attacks. (99%)
Mo Zhou; Vishal M. Patel
  Adversarial attacks pose safety and security concerns to deep learning
applications, but their characteristics are under-explored. Yet largely
imperceptible, a strong trace could have been left by PGD-like attacks in an
adversarial example. Recall that PGD-like attacks trigger the ``local
linearity'' of a network, which implies different extents of linearity for
benign or adversarial examples. Inspired by this, we construct an Adversarial
Response Characteristics (ARC) feature to reflect the model's gradient
consistency around the input to indicate the extent of linearity. Under certain
conditions, it qualitatively shows a gradually varying pattern from benign
example to adversarial example, as the latter leads to Sequel Attack Effect
(SAE). To quantitatively evaluate the effectiveness of ARC, we conduct
experiments on CIFAR-10 and ImageNet for attack detection and attack type
recognition in a challenging setting. The results suggest that SAE is an
effective and unique trace of PGD-like attacks reflected through the ARC
feature. The ARC feature is intuitive, light-weighted, non-intrusive, and
data-undemanding.


http://arxiv.org/abs/2205.09619
Improving Robustness against Real-World and Worst-Case Distribution Shifts through Decision Region Quantification. (98%)
Leo Schwinn; Leon Bungert; An Nguyen; René Raab; Falk Pulsmeyer; Doina Precup; Björn Eskofier; Dario Zanca
  The reliability of neural networks is essential for their use in
safety-critical applications. Existing approaches generally aim at improving
the robustness of neural networks to either real-world distribution shifts
(e.g., common corruptions and perturbations, spatial transformations, and
natural adversarial examples) or worst-case distribution shifts (e.g.,
optimized adversarial examples). In this work, we propose the Decision Region
Quantification (DRQ) algorithm to improve the robustness of any differentiable
pre-trained model against both real-world and worst-case distribution shifts in
the data. DRQ analyzes the robustness of local decision regions in the vicinity
of a given data point to make more reliable predictions. We theoretically
motivate the DRQ algorithm by showing that it effectively smooths spurious
local extrema in the decision surface. Furthermore, we propose an
implementation using targeted and untargeted adversarial attacks. An extensive
empirical evaluation shows that DRQ increases the robustness of adversarially
and non-adversarially trained models against real-world and worst-case
distribution shifts on several computer vision benchmark datasets.


http://arxiv.org/abs/2205.09522
Defending Against Adversarial Attacks by Energy Storage Facility. (96%)
Jiawei Li; Jianxiao Wang; Lin Chen; Yang Yu
  Adversarial attacks on data-driven algorithms applied in the power system
will be a new type of threat to grid security. Literature has demonstrated that
the adversarial attack on the deep-neural network can significantly mislead the
load fore-cast of a power system. However, it is unclear how the new type of
attack impacts the operation of the grid system. In this research, we manifest
that the adversarial algorithm attack induces a significant cost-increase risk
which will be exacerbated by the growing penetration of intermittent renewable
energy. In Texas, a 5% adversarial attack can increase the total generation
cost by 17% in a quarter, which accounts for around $20 million. When
wind-energy penetration increases to over 40%, the 5% adversarial attack will
inflate the genera-tion cost by 23%. Our research discovers a novel approach to
defending against the adversarial attack: investing in the energy-storage
system. All current literature focuses on developing algorithms to defend
against adversarial attacks. We are the first research revealing the capability
of using the facility in a physical system to defend against the adversarial
algorithm attack in a system of the Internet of Things, such as a smart grid
system.


http://arxiv.org/abs/2205.09362
Sparse Adversarial Attack in Multi-agent Reinforcement Learning. (82%)
Yizheng Hu; Zhihua Zhang
  Cooperative multi-agent reinforcement learning (cMARL) has many real
applications, but the policy trained by existing cMARL algorithms is not robust
enough when deployed. There exist also many methods about adversarial attacks
on the RL system, which implies that the RL system can suffer from adversarial
attacks, but most of them focused on single agent RL. In this paper, we propose
a \textit{sparse adversarial attack} on cMARL systems. We use (MA)RL with
regularization to train the attack policy. Our experiments show that the policy
trained by the current cMARL algorithm can obtain poor performance when only
one or a few agents in the team (e.g., 1 of 8 or 5 of 25) were attacked at a
few timesteps (e.g., attack 3 of total 40 timesteps).


http://arxiv.org/abs/2205.09550
Data Valuation for Offline Reinforcement Learning. (1%)
Amir Abolfazli; Gregory Palmer; Daniel Kudenko
  The success of deep reinforcement learning (DRL) hinges on the availability
of training data, which is typically obtained via a large number of environment
interactions. In many real-world scenarios, costs and risks are associated with
gathering these data. The field of offline reinforcement learning addresses
these issues through outsourcing the collection of data to a domain expert or a
carefully monitored program and subsequently searching for a batch-constrained
optimal policy. With the emergence of data markets, an alternative to
constructing a dataset in-house is to purchase external data. However, while
state-of-the-art offline reinforcement learning approaches have shown a lot of
promise, they currently rely on carefully constructed datasets that are well
aligned with the intended target domains. This raises questions regarding the
transferability and robustness of an offline reinforcement learning agent
trained on externally acquired data. In this paper, we empirically evaluate the
ability of the current state-of-the-art offline reinforcement learning
approaches to coping with the source-target domain mismatch within two MuJoCo
environments, finding that current state-of-the-art offline reinforcement
learning algorithms underperform in the target domain. To address this, we
propose data valuation for offline reinforcement learning (DVORL), which allows
us to identify relevant and high-quality transitions, improving the performance
and transferability of policies learned by offline reinforcement learning
algorithms. The results show that our method outperforms offline reinforcement
learning baselines on two MuJoCo environments.


http://arxiv.org/abs/2205.08738
Passive Defense Against 3D Adversarial Point Clouds Through the Lens of 3D Steganalysis. (99%)
Jiahao Zhu
  Nowadays, 3D data plays an indelible role in the computer vision field.
However, extensive studies have proved that deep neural networks (DNNs) fed
with 3D data, such as point clouds, are susceptible to adversarial examples,
which aim to misguide DNNs and might bring immeasurable losses. Currently, 3D
adversarial point clouds are chiefly generated in three fashions, i.e., point
shifting, point adding, and point dropping. These point manipulations would
modify geometrical properties and local correlations of benign point clouds
more or less. Motivated by this basic fact, we propose to defend such
adversarial examples with the aid of 3D steganalysis techniques. Specifically,
we first introduce an adversarial attack and defense model adapted from the
celebrated Prisoners' Problem in steganography to help us comprehend 3D
adversarial attack and defense more generally. Then we rethink two significant
but vague concepts in the field of adversarial example, namely, active defense
and passive defense, from the perspective of steganalysis. Most importantly, we
design a 3D adversarial point cloud detector through the lens of 3D
steganalysis. Our detector is double-blind, that is to say, it does not rely on
the exact knowledge of the adversarial attack means and victim models. To
enable the detector to effectively detect malicious point clouds, we craft a
64-D discriminant feature set, including features related to first-order and
second-order local descriptions of point clouds. To our knowledge, this work is
the first to apply 3D steganalysis to 3D adversarial example defense. Extensive
experimental results demonstrate that the proposed 3D adversarial point cloud
detector can achieve good detection performance on multiple types of 3D
adversarial point clouds.


http://arxiv.org/abs/2205.08821
Property Unlearning: A Defense Strategy Against Property Inference Attacks. (84%)
Joshua Universität Hamburg Stock; Jens Universität Hamburg Wettlaufer; Daniel Universität Hamburg Demmler; Hannes Universität Hamburg Federrath
  During the training of machine learning models, they may store or "learn"
more information about the training data than what is actually needed for the
prediction or classification task. This is exploited by property inference
attacks which aim at extracting statistical properties from the training data
of a given model without having access to the training data itself. These
properties may include the quality of pictures to identify the camera model,
the age distribution to reveal the target audience of a product, or the
included host types to refine a malware attack in computer networks. This
attack is especially accurate when the attacker has access to all model
parameters, i.e., in a white-box scenario. By defending against such attacks,
model owners are able to ensure that their training data, associated
properties, and thus their intellectual property stays private, even if they
deliberately share their models, e.g., to train collaboratively, or if models
are leaked. In this paper, we introduce property unlearning, an effective
defense mechanism against white-box property inference attacks, independent of
the training data type, model task, or number of properties. Property
unlearning mitigates property inference attacks by systematically changing the
trained weights and biases of a target model such that an adversary cannot
extract chosen properties. We empirically evaluate property unlearning on three
different data sets, including tabular and image data, and two types of
artificial neural networks. Our results show that property unlearning is both
efficient and reliable to protect machine learning models against property
inference attacks, with a good privacy-utility trade-off. Furthermore, our
approach indicates that this mechanism is also effective to unlearn multiple
properties.


http://arxiv.org/abs/2205.08989
Constraining the Attack Space of Machine Learning Models with Distribution Clamping Preprocessing. (81%)
Ryan Feng; Somesh Jha; Atul Prakash
  Preprocessing and outlier detection techniques have both been applied to
neural networks to increase robustness with varying degrees of success. In this
paper, we formalize the ideal preprocessor function as one that would take any
input and set it to the nearest in-distribution input. In other words, we
detect any anomalous pixels and set them such that the new input is
in-distribution. We then illustrate a relaxed solution to this problem in the
context of patch attacks. Specifically, we demonstrate that we can model
constraints on the patch attack that specify regions as out of distribution.
With these constraints, we are able to preprocess inputs successfully,
increasing robustness on CARLA object detection.


http://arxiv.org/abs/2205.09167
Backdoor Attacks on Bayesian Neural Networks using Reverse Distribution. (56%)
Zhixin Pan; Prabhat Mishra
  Due to cost and time-to-market constraints, many industries outsource the
training process of machine learning models (ML) to third-party cloud service
providers, popularly known as ML-asa-Service (MLaaS). MLaaS creates opportunity
for an adversary to provide users with backdoored ML models to produce
incorrect predictions only in extremely rare (attacker-chosen) scenarios.
Bayesian neural networks (BNN) are inherently immune against backdoor attacks
since the weights are designed to be marginal distributions to quantify the
uncertainty. In this paper, we propose a novel backdoor attack based on
effective learning and targeted utilization of reverse distribution. This paper
makes three important contributions. (1) To the best of our knowledge, this is
the first backdoor attack that can effectively break the robustness of BNNs.
(2) We produce reverse distributions to cancel the original distributions when
the trigger is activated. (3) We propose an efficient solution for merging
probability distributions in BNNs. Experimental results on diverse benchmark
datasets demonstrate that our proposed attack can achieve the attack success
rate (ASR) of 100%, while the ASR of the state-of-the-art attacks is lower than
60%.


http://arxiv.org/abs/2205.09037
Empirical Advocacy of Bio-inspired Models for Robust Image Recognition. (38%)
Harshitha Machiraju; Oh-Hyeon Choung; Michael H. Herzog; Pascal Frossard
  Deep convolutional neural networks (DCNNs) have revolutionized computer
vision and are often advocated as good models of the human visual system.
However, there are currently many shortcomings of DCNNs, which preclude them as
a model of human vision. There are continuous attempts to use features of the
human visual system to improve the robustness of neural networks to data
perturbations. We provide a detailed analysis of such bio-inspired models and
their properties. To this end, we benchmark the robustness of several
bio-inspired models against their most comparable baseline DCNN models. We find
that bio-inspired models tend to be adversarially robust without requiring any
special data augmentation. Additionally, we find that bio-inspired models beat
adversarially trained models in the presence of more real-world common
corruptions. Interestingly, we also find that bio-inspired models tend to use
both low and mid-frequency information, in contrast to other DCNN models. We
find that this mix of frequency information makes them robust to both
adversarial perturbations and common corruptions.


http://arxiv.org/abs/2205.09310
Mitigating Neural Network Overconfidence with Logit Normalization. (1%)
Hongxin Wei; Renchunzi Xie; Hao Cheng; Lei Feng; Bo An; Yixuan Li
  Detecting out-of-distribution inputs is critical for safe deployment of
machine learning models in the real world. However, neural networks are known
to suffer from the overconfidence issue, where they produce abnormally high
confidence for both in- and out-of-distribution inputs. In this work, we show
that this issue can be mitigated through Logit Normalization (LogitNorm) -- a
simple fix to the cross-entropy loss -- by enforcing a constant vector norm on
the logits in training. Our method is motivated by the analysis that the norm
of the logit keeps increasing during training, leading to overconfident output.
Our key idea behind LogitNorm is thus to decouple the influence of output's
norm during network optimization. Trained with LogitNorm, neural networks
produce highly distinguishable confidence scores between in- and
out-of-distribution data. Extensive experiments demonstrate the superiority of
LogitNorm, reducing the average FPR95 by up to 42.30% on common benchmarks.


http://arxiv.org/abs/2205.08728
RandoMix: A mixed sample data augmentation method with multiple mixed modes. (1%)
Xiaoliang Liu; Furao Shen; Jian Zhao; Changhai Nie
  Data augmentation plays a crucial role in enhancing the robustness and
performance of machine learning models across various domains. In this study,
we introduce a novel mixed-sample data augmentation method called RandoMix.
RandoMix is specifically designed to simultaneously address robustness and
diversity challenges. It leverages a combination of linear and mask-mixed
modes, introducing flexibility in candidate selection and weight adjustments.
We evaluate the effectiveness of RandoMix on diverse datasets, including
CIFAR-10/100, Tiny-ImageNet, ImageNet, and Google Speech Commands. Our results
demonstrate its superior performance compared to existing techniques such as
Mixup, CutMix, Fmix, and ResizeMix. Notably, RandoMix excels in enhancing model
robustness against adversarial noise, natural noise, and sample occlusion. The
comprehensive experimental results and insights into parameter tuning
underscore the potential of RandoMix as a versatile and effective data
augmentation method. Moreover, it seamlessly integrates into the training
pipeline.


http://arxiv.org/abs/2205.08589
Hierarchical Distribution-Aware Testing of Deep Learning. (99%)
Wei Huang; Xingyu Zhao; Alec Banks; Victoria Cox; Xiaowei Huang
  With its growing use in safety/security-critical applications, Deep Learning
(DL) has raised increasing concerns regarding its dependability. In particular,
DL has a notorious problem of lacking robustness. Despite recent efforts made
in detecting Adversarial Examples (AEs) via state-of-the-art attacking and
testing methods, they are normally input distribution agnostic and/or disregard
the perception quality of AEs. Consequently, the detected AEs are irrelevant
inputs in the application context or unnatural/unrealistic that can be easily
noticed by humans. This may lead to a limited effect on improving the DL
model's dependability, as the testing budget is likely to be wasted on
detecting AEs that are encountered very rarely in its real-life operations. In
this paper, we propose a new robustness testing approach for detecting AEs that
considers both the input distribution and the perceptual quality of inputs. The
two considerations are encoded by a novel hierarchical mechanism. First, at the
feature level, the input data distribution is extracted and approximated by
data compression techniques and probability density estimators. Such quantified
feature level distribution, together with indicators that are highly correlated
with local robustness, are considered in selecting test seeds. Given a test
seed, we then develop a two-step genetic algorithm for local test case
generation at the pixel level, in which two fitness functions work
alternatively to control the quality of detected AEs. Finally, extensive
experiments confirm that our holistic approach considering hierarchical
distributions at feature and pixel levels is superior to state-of-the-arts that
either disregard any input distribution or only consider a single
(non-hierarchical) distribution, in terms of not only the quality of detected
AEs but also improving the overall robustness of the DL model under testing.


http://arxiv.org/abs/2205.08287
Bankrupting DoS Attackers Despite Uncertainty. (12%)
Trisha Chakraborty; Abir Islam; Valerie King; Daniel Rayborn; Jared Saia; Maxwell Young
  On-demand provisioning in the cloud allows for services to remain available
despite massive denial-of-service (DoS) attacks. Unfortunately, on-demand
provisioning is expensive and must be weighed against the costs incurred by an
adversary. This leads to a recent threat known as {\it economic
denial-of-sustainability (EDoS)}, where the cost for defending a service is
higher than that of attacking.
  A natural tool for combating EDoS is to impose costs via resource burning
(RB). Here, a client must verifiably consume resources -- for example, by
solving a computational challenge -- before service is rendered. However, prior
RB-based defenses with security guarantees do not account for the cost of
on-demand provisioning.
  Another common approach is the use of heuristics -- such as a client's
reputation score or the geographical location -- to identify and discard
spurious job requests. However, these heuristics may err and existing
approaches do not provide security guarantees when this occurs.
  Here, we propose an EDoS defense, LCharge, that uses resource burning while
accounting for on-demand provisioning. LCharge leverages an estimate of the
number of job requests from honest clients (i.e., good jobs) in any set $S$ of
requests to within an $O(\alpha)$-factor, for any unknown $\alpha>0$, but
retains a strong security guarantee despite the uncertainty of this estimate.
Specifically, against an adversary that expends $B$ resources to attack, the
total cost for defending is $O( \alpha^{5/2}\sqrt{B\,(g+1)} +
\alpha^3(g+\alpha))$ where $g$ is the number of good jobs. Notably, for large
$B$ relative to $g$ and $\alpha$, the adversary has higher cost, implying that
the algorithm has an economic advantage. Finally, we prove a lower bound for
our problem of $\Omega(\sqrt{\alpha B g})$, showing that the cost of LCharge is
asymptotically tight for $\alpha=\Theta(1)$.


http://arxiv.org/abs/2205.08265
A two-steps approach to improve the performance of Android malware detectors. (10%)
Nadia Daoudi; Kevin Allix; Tegawendé F. Bissyandé; Jacques Klein
  The popularity of Android OS has made it an appealing target to malware
developers. To evade detection, including by ML-based techniques, attackers
invest in creating malware that closely resemble legitimate apps. In this
paper, we propose GUIDED RETRAINING, a supervised representation learning-based
method that boosts the performance of a malware detector. First, the dataset is
split into "easy" and "difficult" samples, where difficulty is associated to
the prediction probabilities yielded by a malware detector: for difficult
samples, the probabilities are such that the classifier is not confident on the
predictions, which have high error rates. Then, we apply our GUIDED RETRAINING
method on the difficult samples to improve their classification. For the subset
of "easy" samples, the base malware detector is used to make the final
predictions since the error rate on that subset is low by construction. For the
subset of "difficult" samples, we rely on GUIDED RETRAINING, which leverages
the correct predictions and the errors made by the base malware detector to
guide the retraining process. GUIDED RETRAINING focuses on the difficult
samples: it learns new embeddings of these samples using Supervised Contrastive
Learning and trains an auxiliary classifier for the final predictions. We
validate our method on four state-of-the-art Android malware detection
approaches using over 265k malware and benign apps, and we demonstrate that
GUIDED RETRAINING can reduce up to 40.41% prediction errors made by the malware
detectors. Our method is generic and designed to enhance the classification
performance on a binary classification task. Consequently, it can be applied to
other classification problems beyond Android malware detection.


http://arxiv.org/abs/2205.08685
Policy Distillation with Selective Input Gradient Regularization for Efficient Interpretability. (2%)
Jinwei Xing; Takashi Nagata; Xinyun Zou; Emre Neftci; Jeffrey L. Krichmar
  Although deep Reinforcement Learning (RL) has proven successful in a wide
range of tasks, one challenge it faces is interpretability when applied to
real-world problems. Saliency maps are frequently used to provide
interpretability for deep neural networks. However, in the RL domain, existing
saliency map approaches are either computationally expensive and thus cannot
satisfy the real-time requirement of real-world scenarios or cannot produce
interpretable saliency maps for RL policies. In this work, we propose an
approach of Distillation with selective Input Gradient Regularization (DIGR)
which uses policy distillation and input gradient regularization to produce new
policies that achieve both high interpretability and computation efficiency in
generating saliency maps. Our approach is also found to improve the robustness
of RL policies to multiple adversarial attacks. We conduct experiments on three
tasks, MiniGrid (Fetch Object), Atari (Breakout) and CARLA Autonomous Driving,
to demonstrate the importance and effectiveness of our approach.


http://arxiv.org/abs/2205.08514
Recovering Private Text in Federated Learning of Language Models. (2%)
Samyak Gupta; Yangsibo Huang; Zexuan Zhong; Tianyu Gao; Kai Li; Danqi Chen
  Federated learning allows distributed users to collaboratively train a model
while keeping each user's data private. Recently, a growing body of work has
demonstrated that an eavesdropping attacker can effectively recover image data
from gradients transmitted during federated learning. However, little progress
has been made in recovering text data. In this paper, we present a novel attack
method FILM for federated learning of language models (LMs). For the first
time, we show the feasibility of recovering text from large batch sizes of up
to 128 sentences. Unlike image-recovery methods that are optimized to match
gradients, we take a distinct approach that first identifies a set of words
from gradients and then directly reconstructs sentences based on beam search
and a prior-based reordering strategy. We conduct the FILM attack on several
large-scale datasets and show that it can successfully reconstruct single
sentences with high fidelity for large batch sizes and even multiple sentences
if applied iteratively. We evaluate three defense methods: gradient pruning,
DPSGD, and a simple approach to freeze word embeddings that we propose. We show
that both gradient pruning and DPSGD lead to a significant drop in utility.
However, if we fine-tune a public pre-trained LM on private text without
updating word embeddings, it can effectively defend the attack with minimal
data utility loss. Together, we hope that our results can encourage the
community to rethink the privacy concerns of LM training and its standard
practices in the future.


http://arxiv.org/abs/2205.08416
Semi-Supervised Building Footprint Generation with Feature and Output Consistency Training. (1%)
Qingyu Li; Yilei Shi; Xiao Xiang Zhu
  Accurate and reliable building footprint maps are vital to urban planning and
monitoring, and most existing approaches fall back on convolutional neural
networks (CNNs) for building footprint generation. However, one limitation of
these methods is that they require strong supervisory information from massive
annotated samples for network learning. State-of-the-art semi-supervised
semantic segmentation networks with consistency training can help to deal with
this issue by leveraging a large amount of unlabeled data, which encourages the
consistency of model output on data perturbation. Considering that rich
information is also encoded in feature maps, we propose to integrate the
consistency of both features and outputs in the end-to-end network training of
unlabeled samples, enabling to impose additional constraints. Prior
semi-supervised semantic segmentation networks have established the cluster
assumption, in which the decision boundary should lie in the vicinity of low
sample density. In this work, we observe that for building footprint
generation, the low-density regions are more apparent at the intermediate
feature representations within the encoder than the encoder's input or output.
Therefore, we propose an instruction to assign the perturbation to the
intermediate feature representations within the encoder, which considers the
spatial resolution of input remote sensing imagery and the mean size of
individual buildings in the study area. The proposed method is evaluated on
three datasets with different resolutions: Planet dataset (3 m/pixel),
Massachusetts dataset (1 m/pixel), and Inria dataset (0.3 m/pixel).
Experimental results show that the proposed approach can well extract more
complete building structures and alleviate omission errors.


http://arxiv.org/abs/2205.07626
Attacking and Defending Deep Reinforcement Learning Policies. (99%)
Chao Wang
  Recent studies have shown that deep reinforcement learning (DRL) policies are
vulnerable to adversarial attacks, which raise concerns about applications of
DRL to safety-critical systems. In this work, we adopt a principled way and
study the robustness of DRL policies to adversarial attacks from the
perspective of robust optimization. Within the framework of robust
optimization, optimal adversarial attacks are given by minimizing the expected
return of the policy, and correspondingly a good defense mechanism should be
realized by improving the worst-case performance of the policy. Considering
that attackers generally have no access to the training environment, we propose
a greedy attack algorithm, which tries to minimize the expected return of the
policy without interacting with the environment, and a defense algorithm, which
performs adversarial training in a max-min form. Experiments on Atari game
environments show that our attack algorithm is more effective and leads to
worse return of the policy than existing attack algorithms, and our defense
algorithm yields policies more robust than existing defense methods to a range
of adversarial attacks (including our proposed attack algorithm).


http://arxiv.org/abs/2205.07460
Diffusion Models for Adversarial Purification. (99%)
Weili Nie; Brandon Guo; Yujia Huang; Chaowei Xiao; Arash Vahdat; Anima Anandkumar
  Adversarial purification refers to a class of defense methods that remove
adversarial perturbations using a generative model. These methods do not make
assumptions on the form of attack and the classification model, and thus can
defend pre-existing classifiers against unseen threats. However, their
performance currently falls behind adversarial training methods. In this work,
we propose DiffPure that uses diffusion models for adversarial purification:
Given an adversarial example, we first diffuse it with a small amount of noise
following a forward diffusion process, and then recover the clean image through
a reverse generative process. To evaluate our method against strong adaptive
attacks in an efficient and scalable way, we propose to use the adjoint method
to compute full gradients of the reverse generative process. Extensive
experiments on three image datasets including CIFAR-10, ImageNet and CelebA-HQ
with three classifier architectures including ResNet, WideResNet and ViT
demonstrate that our method achieves the state-of-the-art results,
outperforming current adversarial training and adversarial purification
methods, often by a large margin. Project page: https://diffpure.github.io.


http://arxiv.org/abs/2205.07466
Robust Representation via Dynamic Feature Aggregation. (84%)
Haozhe Liu; Haoqin Ji; Yuexiang Li; Nanjun He; Haoqian Wu; Feng Liu; Linlin Shen; Yefeng Zheng
  Deep convolutional neural network (CNN) based models are vulnerable to the
adversarial attacks. One of the possible reasons is that the embedding space of
CNN based model is sparse, resulting in a large space for the generation of
adversarial samples. In this study, we propose a method, denoted as Dynamic
Feature Aggregation, to compress the embedding space with a novel
regularization. Particularly, the convex combination between two samples are
regarded as the pivot for aggregation. In the embedding space, the selected
samples are guided to be similar to the representation of the pivot. On the
other side, to mitigate the trivial solution of such regularization, the last
fully-connected layer of the model is replaced by an orthogonal classifier, in
which the embedding codes for different classes are processed orthogonally and
separately. With the regularization and orthogonal classifier, a more compact
embedding space can be obtained, which accordingly improves the model
robustness against adversarial attacks. An averaging accuracy of 56.91% is
achieved by our method on CIFAR-10 against various attack methods, which
significantly surpasses a solid baseline (Mixup) by a margin of 37.31%. More
surprisingly, empirical results show that, the proposed method can also achieve
the state-of-the-art performance for out-of-distribution (OOD) detection, due
to the learned compact feature space. An F1 score of 0.937 is achieved by the
proposed method, when adopting CIFAR-10 as in-distribution (ID) dataset and
LSUN as OOD dataset. Code is available at
https://github.com/HaozheLiu-ST/DynamicFeatureAggregation.


http://arxiv.org/abs/2205.07972
Sparse Visual Counterfactual Explanations in Image Space. (83%)
Valentyn Boreiko; Maximilian Augustin; Francesco Croce; Philipp Berens; Matthias Hein
  Visual counterfactual explanations (VCEs) in image space are an important
tool to understand decisions of image classifiers as they show under which
changes of the image the decision of the classifier would change. Their
generation in image space is challenging and requires robust models due to the
problem of adversarial examples. Existing techniques to generate VCEs in image
space suffer from spurious changes in the background. Our novel perturbation
model for VCEs together with its efficient optimization via our novel
Auto-Frank-Wolfe scheme yields sparse VCEs which lead to subtle changes
specific for the target class. Moreover, we show that VCEs can be used to
detect undesired behavior of ImageNet classifiers due to spurious features in
the ImageNet dataset.


http://arxiv.org/abs/2205.07890
On the Difficulty of Defending Self-Supervised Learning against Model Extraction. (67%)
Adam Dziedzic; Nikita Dhawan; Muhammad Ahmad Kaleem; Jonas Guan; Nicolas Papernot
  Self-Supervised Learning (SSL) is an increasingly popular ML paradigm that
trains models to transform complex inputs into representations without relying
on explicit labels. These representations encode similarity structures that
enable efficient learning of multiple downstream tasks. Recently,
ML-as-a-Service providers have commenced offering trained SSL models over
inference APIs, which transform user inputs into useful representations for a
fee. However, the high cost involved to train these models and their exposure
over APIs both make black-box extraction a realistic security threat. We thus
explore model stealing attacks against SSL. Unlike traditional model extraction
on classifiers that output labels, the victim models here output
representations; these representations are of significantly higher
dimensionality compared to the low-dimensional prediction scores output by
classifiers. We construct several novel attacks and find that approaches that
train directly on a victim's stolen representations are query efficient and
enable high accuracy for downstream models. We then show that existing defenses
against model extraction are inadequate and not easily retrofitted to the
specificities of SSL.


http://arxiv.org/abs/2205.07711
Transferability of Adversarial Attacks on Synthetic Speech Detection. (47%)
Jiacheng Deng; Shunyi Chen; Li Dong; Diqun Yan; Rangding Wang
  Synthetic speech detection is one of the most important research problems in
audio security. Meanwhile, deep neural networks are vulnerable to adversarial
attacks. Therefore, we establish a comprehensive benchmark to evaluate the
transferability of adversarial attacks on the synthetic speech detection task.
Specifically, we attempt to investigate: 1) The transferability of adversarial
attacks between different features. 2) The influence of varying extraction
hyperparameters of features on the transferability of adversarial attacks. 3)
The effect of clipping or self-padding operation on the transferability of
adversarial attacks. By performing these analyses, we summarise the weaknesses
of synthetic speech detectors and the transferability behaviours of adversarial
attacks, which provide insights for future research. More details can be found
at https://gitee.com/djc_QRICK/Attack-Transferability-On-Synthetic-Detection.


http://arxiv.org/abs/2205.07315
Learn2Weight: Parameter Adaptation against Similar-domain Adversarial Attacks. (99%)
Siddhartha Datta
  Recent work in black-box adversarial attacks for NLP systems has attracted
much attention. Prior black-box attacks assume that attackers can observe
output labels from target models based on selected inputs. In this work,
inspired by adversarial transferability, we propose a new type of black-box NLP
adversarial attack that an attacker can choose a similar domain and transfer
the adversarial examples to the target domain and cause poor performance in
target model. Based on domain adaptation theory, we then propose a defensive
strategy, called Learn2Weight, which trains to predict the weight adjustments
for a target model in order to defend against an attack of similar-domain
adversarial examples. Using Amazon multi-domain sentiment classification
datasets, we empirically show that Learn2Weight is effective against the attack
compared to standard black-box defense methods such as adversarial training and
defensive distillation. This work contributes to the growing literature on
machine learning safety.


http://arxiv.org/abs/2205.07279
Exploiting the Relationship Between Kendall's Rank Correlation and Cosine Similarity for Attribution Protection. (64%)
Fan Wang; Adams Wai-Kin Kong
  Model attributions are important in deep neural networks as they aid
practitioners in understanding the models, but recent studies reveal that
attributions can be easily perturbed by adding imperceptible noise to the
input. The non-differentiable Kendall's rank correlation is a key performance
index for attribution protection. In this paper, we first show that the
expected Kendall's rank correlation is positively correlated to cosine
similarity and then indicate that the direction of attribution is the key to
attribution robustness. Based on these findings, we explore the vector space of
attribution to explain the shortcomings of attribution defense methods using
$\ell_p$ norm and propose integrated gradient regularizer (IGR), which
maximizes the cosine similarity between natural and perturbed attributions. Our
analysis further exposes that IGR encourages neurons with the same activation
states for natural samples and the corresponding perturbed samples, which is
shown to induce robustness to gradient-based attribution methods. Our
experiments on different models and datasets confirm our analysis on
attribution protection and demonstrate a decent improvement in adversarial
robustness.


http://arxiv.org/abs/2205.07229
RoMFAC: A robust mean-field actor-critic reinforcement learning against adversarial perturbations on states. (62%)
Ziyuan Zhou; Guanjun Liu
  Multi-agent deep reinforcement learning makes optimal decisions dependent on
system states observed by agents, but any uncertainty on the observations may
mislead agents to take wrong actions. The Mean-Field Actor-Critic reinforcement
learning (MFAC) is well-known in the multi-agent field since it can effectively
handle a scalability problem. However, it is sensitive to state perturbations
that can significantly degrade the team rewards. This work proposes a Robust
Mean-field Actor-Critic reinforcement learning (RoMFAC) that has two
innovations: 1) a new objective function of training actors, composed of a
\emph{policy gradient function} that is related to the expected cumulative
discount reward on sampled clean states and an \emph{action loss function} that
represents the difference between actions taken on clean and adversarial
states; and 2) a repetitive regularization of the action loss, ensuring the
trained actors to obtain excellent performance. Furthermore, this work proposes
a game model named a State-Adversarial Stochastic Game (SASG). Despite the Nash
equilibrium of SASG may not exist, adversarial perturbations to states in the
RoMFAC are proven to be defensible based on SASG. Experimental results show
that RoMFAC is robust against adversarial perturbations while maintaining its
competitive performance in environments without perturbations.


http://arxiv.org/abs/2205.07228
Automation Slicing and Testing for in-App Deep Learning Models. (1%)
Hao Wu; Yuhang Gong; Xiaopeng Ke; Hanzhong Liang; Minghao Li; Fengyuan Xu; Yunxin Liu; Sheng Zhong
  Intelligent Apps (iApps), equipped with in-App deep learning (DL) models, are
emerging to offer stable DL inference services. However, App marketplaces have
trouble auto testing iApps because the in-App model is black-box and couples
with ordinary codes. In this work, we propose an automated tool, ASTM, which
can enable large-scale testing of in-App models. ASTM takes as input an iApps,
and the outputs can replace the in-App model as the test object. ASTM proposes
two reconstruction techniques to translate the in-App model to a
backpropagation-enabled version and reconstruct the IO processing code for DL
inference. With the ASTM's help, we perform a large-scale study on the
robustness of 100 unique commercial in-App models and find that 56\% of in-App
models are vulnerable to robustness issues in our context. ASTM also detects
physical attacks against three representative iApps that may cause economic
losses and security issues.


http://arxiv.org/abs/2205.06986
Evaluating Membership Inference Through Adversarial Robustness. (98%)
Zhaoxi Zhang; Leo Yu Zhang; Xufei Zheng; Bilal Hussain Abbasi; Shengshan Hu
  The usage of deep learning is being escalated in many applications. Due to
its outstanding performance, it is being used in a variety of security and
privacy-sensitive areas in addition to conventional applications. One of the
key aspects of deep learning efficacy is to have abundant data. This trait
leads to the usage of data which can be highly sensitive and private, which in
turn causes wariness with regard to deep learning in the general public.
Membership inference attacks are considered lethal as they can be used to
figure out whether a piece of data belongs to the training dataset or not. This
can be problematic with regards to leakage of training data information and its
characteristics. To highlight the significance of these types of attacks, we
propose an enhanced methodology for membership inference attacks based on
adversarial robustness, by adjusting the directions of adversarial
perturbations through label smoothing under a white-box setting. We evaluate
our proposed method on three datasets: Fashion-MNIST, CIFAR-10, and CIFAR-100.
Our experimental results reveal that the performance of our method surpasses
that of the existing adversarial robustness-based method when attacking
normally trained models. Additionally, through comparing our technique with the
state-of-the-art metric-based membership inference methods, our proposed method
also shows better performance when attacking adversarially trained models. The
code for reproducing the results of this work is available at
\url{https://github.com/plll4zzx/Evaluating-Membership-Inference-Through-Adversarial-Robustness}.


http://arxiv.org/abs/2205.06992
Verifying Neural Networks Against Backdoor Attacks. (2%)
Long H. Pham; Jun Sun
  Neural networks have achieved state-of-the-art performance in solving many
problems, including many applications in safety/security-critical systems.
Researchers also discovered multiple security issues associated with neural
networks. One of them is backdoor attacks, i.e., a neural network may be
embedded with a backdoor such that a target output is almost always generated
in the presence of a trigger. Existing defense approaches mostly focus on
detecting whether a neural network is 'backdoored' based on heuristics, e.g.,
activation patterns. To the best of our knowledge, the only line of work which
certifies the absence of backdoor is based on randomized smoothing, which is
known to significantly reduce neural network performance. In this work, we
propose an approach to verify whether a given neural network is free of
backdoor with a certain level of success rate. Our approach integrates
statistical sampling as well as abstract interpretation. The experiment results
show that our approach effectively verifies the absence of backdoor or
generates backdoor triggers.


http://arxiv.org/abs/2205.06900
MM-BD: Post-Training Detection of Backdoor Attacks with Arbitrary Backdoor Pattern Types Using a Maximum Margin Statistic. (98%)
Hang Wang; Zhen Xiang; David J. Miller; George Kesidis
  Backdoor attacks are an important type of adversarial threat against deep
neural network classifiers, wherein test samples from one or more source
classes will be (mis)classified to the attacker's target class when a backdoor
pattern is embedded. In this paper, we focus on the post-training backdoor
defense scenario commonly considered in the literature, where the defender aims
to detect whether a trained classifier was backdoor-attacked without any access
to the training set. Many post-training detectors are designed to detect
attacks that use either one or a few specific backdoor embedding functions
(e.g., patch-replacement or additive attacks). These detectors may fail when
the backdoor embedding function used by the attacker (unknown to the defender)
is different from the backdoor embedding function assumed by the defender. In
contrast, we propose a post-training defense that detects backdoor attacks with
arbitrary types of backdoor embeddings, without making any assumptions about
the backdoor embedding type. Our detector leverages the influence of the
backdoor attack, independent of the backdoor embedding mechanism, on the
landscape of the classifier's outputs prior to the softmax layer. For each
class, a maximum margin statistic is estimated. Detection inference is then
performed by applying an unsupervised anomaly detector to these statistics.
Thus, our detector does not need any legitimate clean samples, and can
efficiently detect backdoor attacks with arbitrary numbers of source classes.
These advantages over several state-of-the-art methods are demonstrated on four
datasets, for three different types of backdoor patterns, and for a variety of
attack configurations. Finally, we propose a novel, general approach for
backdoor mitigation once a detection is made. The mitigation approach was the
runner-up at the first IEEE Trojan Removal Competition. The code is online
available.


http://arxiv.org/abs/2205.06469
l-Leaks: Membership Inference Attacks with Logits. (41%)
Shuhao Li; Yajie Wang; Yuanzhang Li; Yu-an Tan
  Machine Learning (ML) has made unprecedented progress in the past several
decades. However, due to the memorability of the training data, ML is
susceptible to various attacks, especially Membership Inference Attacks (MIAs),
the objective of which is to infer the model's training data. So far, most of
the membership inference attacks against ML classifiers leverage the shadow
model with the same structure as the target model. However, empirical results
show that these attacks can be easily mitigated if the shadow model is not
clear about the network structure of the target model.
  In this paper, We present attacks based on black-box access to the target
model. We name our attack \textbf{l-Leaks}. The l-Leaks follows the intuition
that if an established shadow model is similar enough to the target model, then
the adversary can leverage the shadow model's information to predict a target
sample's membership.The logits of the trained target model contain valuable
sample knowledge. We build the shadow model by learning the logits of the
target model and making the shadow model more similar to the target model. Then
shadow model will have sufficient confidence in the member samples of the
target model. We also discuss the effect of the shadow model's different
network structures to attack results. Experiments over different networks and
datasets demonstrate that both of our attacks achieve strong performance.


http://arxiv.org/abs/2205.06504
DualCF: Efficient Model Extraction Attack from Counterfactual Explanations. (26%)
Yongjie Wang; Hangwei Qian; Chunyan Miao
  Cloud service providers have launched Machine-Learning-as-a-Service (MLaaS)
platforms to allow users to access large-scale cloudbased models via APIs. In
addition to prediction outputs, these APIs can also provide other information
in a more human-understandable way, such as counterfactual explanations (CF).
However, such extra information inevitably causes the cloud models to be more
vulnerable to extraction attacks which aim to steal the internal functionality
of models in the cloud. Due to the black-box nature of cloud models, however, a
vast number of queries are inevitably required by existing attack strategies
before the substitute model achieves high fidelity. In this paper, we propose a
novel simple yet efficient querying strategy to greatly enhance the querying
efficiency to steal a classification model. This is motivated by our
observation that current querying strategies suffer from decision boundary
shift issue induced by taking far-distant queries and close-to-boundary CFs
into substitute model training. We then propose DualCF strategy to circumvent
the above issues, which is achieved by taking not only CF but also
counterfactual explanation of CF (CCF) as pairs of training samples for the
substitute model. Extensive and comprehensive experimental evaluations are
conducted on both synthetic and real-world datasets. The experimental results
favorably illustrate that DualCF can produce a high-fidelity model with fewer
queries efficiently and effectively.


http://arxiv.org/abs/2205.06567
Millimeter-Wave Automotive Radar Spoofing. (2%)
Mihai Ordean; Flavio D. Garcia
  Millimeter-wave radar systems are one of the core components of the
safety-critical Advanced Driver Assistant System (ADAS) of a modern vehicle.
Due to their ability to operate efficiently despite bad weather conditions and
poor visibility, they are often the only reliable sensor a car has to detect
and evaluate potential dangers in the surrounding environment. In this paper,
we propose several attacks against automotive radars for the purposes of
assessing their reliability in real-world scenarios. Using COTS hardware, we
are able to successfully interfere with automotive-grade FMCW radars operating
in the commonly used 77GHz frequency band, deployed in real-world, truly
wireless environments. Our strongest type of interference is able to trick the
victim into detecting virtual (moving) objects. We also extend this attack with
a novel method that leverages noise to remove real-world objects, thus
complementing the aforementioned object spoofing attack. We evaluate the
viability of our attacks in two ways. First, we establish a baseline by
implementing and evaluating an unrealistically powerful adversary which
requires synchronization to the victim in a limited setup that uses wire-based
chirp synchronization. Later, we implement, for the first time, a truly
wireless attack that evaluates a weaker but realistic adversary which is
non-synchronized and does not require any adjustment feedback from the victim.
Finally, we provide theoretical fundamentals for our findings, and discuss the
efficiency of potential countermeasures against the proposed attacks. We plan
to release our software as open-source.


http://arxiv.org/abs/2205.06127
Sample Complexity Bounds for Robustly Learning Decision Lists against Evasion Attacks. (75%)
Pascale Gourdeau; Varun Kanade; Marta Kwiatkowska; James Worrell
  A fundamental problem in adversarial machine learning is to quantify how much
training data is needed in the presence of evasion attacks. In this paper we
address this issue within the framework of PAC learning, focusing on the class
of decision lists. Given that distributional assumptions are essential in the
adversarial setting, we work with probability distributions on the input data
that satisfy a Lipschitz condition: nearby points have similar probability. Our
key results illustrate that the adversary's budget (that is, the number of bits
it can perturb on each input) is a fundamental quantity in determining the
sample complexity of robust learning. Our first main result is a
sample-complexity lower bound: the class of monotone conjunctions (essentially
the simplest non-trivial hypothesis class on the Boolean hypercube) and any
superclass has sample complexity at least exponential in the adversary's
budget. Our second main result is a corresponding upper bound: for every fixed
$k$ the class of $k$-decision lists has polynomial sample complexity against a
$\log(n)$-bounded adversary. This sheds further light on the question of
whether an efficient PAC learning algorithm can always be used as an efficient
$\log(n)$-robust learning algorithm under the uniform distribution.


http://arxiv.org/abs/2205.06401
PoisonedEncoder: Poisoning the Unlabeled Pre-training Data in Contrastive Learning. (61%)
Hongbin Liu; Jinyuan Jia; Neil Zhenqiang Gong
  Contrastive learning pre-trains an image encoder using a large amount of
unlabeled data such that the image encoder can be used as a general-purpose
feature extractor for various downstream tasks. In this work, we propose
PoisonedEncoder, a data poisoning attack to contrastive learning. In
particular, an attacker injects carefully crafted poisoning inputs into the
unlabeled pre-training data, such that the downstream classifiers built based
on the poisoned encoder for multiple target downstream tasks simultaneously
classify attacker-chosen, arbitrary clean inputs as attacker-chosen, arbitrary
classes. We formulate our data poisoning attack as a bilevel optimization
problem, whose solution is the set of poisoning inputs; and we propose a
contrastive-learning-tailored method to approximately solve it. Our evaluation
on multiple datasets shows that PoisonedEncoder achieves high attack success
rates while maintaining the testing accuracy of the downstream classifiers
built upon the poisoned encoder for non-attacker-chosen inputs. We also
evaluate five defenses against PoisonedEncoder, including one pre-processing,
three in-processing, and one post-processing defenses. Our results show that
these defenses can decrease the attack success rate of PoisonedEncoder, but
they also sacrifice the utility of the encoder or require a large clean
pre-training dataset.


http://arxiv.org/abs/2205.06369
How to Combine Membership-Inference Attacks on Multiple Updated Models. (11%)
Matthew Jagielski; Stanley Wu; Alina Oprea; Jonathan Ullman; Roxana Geambasu
  A large body of research has shown that machine learning models are
vulnerable to membership inference (MI) attacks that violate the privacy of the
participants in the training data. Most MI research focuses on the case of a
single standalone model, while production machine-learning platforms often
update models over time, on data that often shifts in distribution, giving the
attacker more information. This paper proposes new attacks that take advantage
of one or more model updates to improve MI. A key part of our approach is to
leverage rich information from standalone MI attacks mounted separately against
the original and updated models, and to combine this information in specific
ways to improve attack effectiveness. We propose a set of combination functions
and tuning methods for each, and present both analytical and quantitative
justification for various options. Our results on four public datasets show
that our attacks are effective at using update information to give the
adversary a significant advantage over attacks on standalone models, but also
compared to a prior MI attack that takes advantage of model updates in a
related machine-unlearning setting. We perform the first measurements of the
impact of distribution shift on MI attacks with model updates, and show that a
more drastic distribution shift results in significantly higher MI risk than a
gradual shift. Our code is available at
https://www.github.com/stanleykywu/model-updates.


http://arxiv.org/abs/2205.05909
Infrared Invisible Clothing:Hiding from Infrared Detectors at Multiple Angles in Real World. (4%)
Xiaopei Zhu; Zhanhao Hu; Siyuan Huang; Jianmin Li; Xiaolin Hu
  Thermal infrared imaging is widely used in body temperature measurement,
security monitoring, and so on, but its safety research attracted attention
only in recent years. We proposed the infrared adversarial clothing, which
could fool infrared pedestrian detectors at different angles. We simulated the
process from cloth to clothing in the digital world and then designed the
adversarial "QR code" pattern. The core of our method is to design a basic
pattern that can be expanded periodically, and make the pattern after random
cropping and deformation still have an adversarial effect, then we can process
the flat cloth with an adversarial pattern into any 3D clothes. The results
showed that the optimized "QR code" pattern lowered the Average Precision (AP)
of YOLOv3 by 87.7%, while the random "QR code" pattern and blank pattern
lowered the AP of YOLOv3 by 57.9% and 30.1%, respectively, in the digital
world. We then manufactured an adversarial shirt with a new material: aerogel.
Physical-world experiments showed that the adversarial "QR code" pattern
clothing lowered the AP of YOLOv3 by 64.6%, while the random "QR code" pattern
clothing and fully heat-insulated clothing lowered the AP of YOLOv3 by 28.3%
and 22.8%, respectively. We used the model ensemble technique to improve the
attack transferability to unseen models.


http://arxiv.org/abs/2205.06154
Smooth-Reduce: Leveraging Patches for Improved Certified Robustness. (2%)
Ameya Joshi; Minh Pham; Minsu Cho; Leonid Boytsov; Filipe Condessa; J. Zico Kolter; Chinmay Hegde
  Randomized smoothing (RS) has been shown to be a fast, scalable technique for
certifying the robustness of deep neural network classifiers. However, methods
based on RS require augmenting data with large amounts of noise, which leads to
significant drops in accuracy. We propose a training-free, modified smoothing
approach, Smooth-Reduce, that leverages patching and aggregation to provide
improved classifier certificates. Our algorithm classifies overlapping patches
extracted from an input image, and aggregates the predicted logits to certify a
larger radius around the input. We study two aggregation schemes -- max and
mean -- and show that both approaches provide better certificates in terms of
certified accuracy, average certified radii and abstention rates as compared to
concurrent approaches. We also provide theoretical guarantees for such
certificates, and empirically show significant improvements over other
randomized smoothing methods that require expensive retraining. Further, we
extend our approach to videos and provide meaningful certificates for video
classifiers. A project page can be found at
https://nyu-dice-lab.github.io/SmoothReduce/


http://arxiv.org/abs/2205.06064
Stalloris: RPKI Downgrade Attack. (1%)
Tomas Hlavacek; Philipp Jeitner; Donika Mirdita; Haya Shulman; Michael Waidner
  We demonstrate the first downgrade attacks against RPKI. The key design
property in RPKI that allows our attacks is the tradeoff between connectivity
and security: when networks cannot retrieve RPKI information from publication
points, they make routing decisions in BGP without validating RPKI. We exploit
this tradeoff to develop attacks that prevent the retrieval of the RPKI objects
from the public repositories, thereby disabling RPKI validation and exposing
the RPKI-protected networks to prefix hijack attacks.
  We demonstrate experimentally that at least 47% of the public repositories
are vulnerable against a specific version of our attacks, a rate-limiting
off-path downgrade attack. We also show that all the current RPKI relying party
implementations are vulnerable to attacks by a malicious publication point.
This translates to 20.4% of the IPv4 address space.
  We provide recommendations for preventing our downgrade attacks. However,
resolving the fundamental problem is not straightforward: if the relying
parties prefer security over connectivity and insist on RPKI validation when
ROAs cannot be retrieved, the victim AS may become disconnected from many more
networks than just the one that the adversary wishes to hijack. Our work shows
that the publication points are a critical infrastructure for Internet
connectivity and security. Our main recommendation is therefore that the
publication points should be hosted on robust platforms guaranteeing a high
degree of connectivity.


http://arxiv.org/abs/2205.05439
Injection Attacks Reloaded: Tunnelling Malicious Payloads over DNS. (1%)
Philipp Jeitner; Haya Shulman
  The traditional design principle for Internet protocols indicates: "Be strict
when sending and tolerant when receiving" [RFC1958], and DNS is no exception to
this. The transparency of DNS in handling the DNS records, also standardised
specifically for DNS [RFC3597], is one of the key features that made it such a
popular platform facilitating a constantly increasing number of new
applications. An application simply creates a new DNS record and can instantly
start distributing it over DNS without requiring any changes to the DNS servers
and platforms. Our Internet wide study confirms that more than 1.3M (96% of
tested) open DNS resolvers are standard compliant and treat DNS records
transparently.
  In this work we show that this `transparency' introduces a severe
vulnerability in the Internet: we demonstrate a new method to launch string
injection attacks by encoding malicious payloads into DNS records. We show how
to weaponise such DNS records to attack popular applications. For instance, we
apply string injection to launch a new type of DNS cache poisoning attack,
which we evaluated against a population of open resolvers and found 105K to be
vulnerable. Such cache poisoning cannot be prevented with common setups of
DNSSEC. Our attacks apply to internal as well as to public services, for
instance, we reveal that all eduroam services are vulnerable to our injection
attacks, allowing us to launch exploits ranging from unauthorised access to
eduroam networks to resource starvation. Depending on the application, our
attacks cause system crashes, data corruption and leakage, degradation of
security, and can introduce remote code execution and arbitrary errors.
  In our evaluation of the attacks in the Internet we find that all the
standard compliant open DNS resolvers we tested allow our injection attacks
against applications and users on their networks.


http://arxiv.org/abs/2205.05473
The Hijackers Guide To The Galaxy: Off-Path Taking Over Internet Resources. (1%)
Tianxiang Dai; Philipp Jeitner; Haya Shulman; Michael Waidner
  Internet resources form the basic fabric of the digital society. They provide
the fundamental platform for digital services and assets, e.g., for critical
infrastructures, financial services, government. Whoever controls that fabric
effectively controls the digital society.
  In this work we demonstrate that the current practices of Internet resources
management, of IP addresses, domains, certificates and virtual platforms are
insecure. Over long periods of time adversaries can maintain control over
Internet resources which they do not own and perform stealthy manipulations,
leading to devastating attacks. We show that network adversaries can take over
and manipulate at least 68% of the assigned IPv4 address space as well as 31%
of the top Alexa domains. We demonstrate such attacks by hijacking the accounts
associated with the digital resources.
  For hijacking the accounts we launch off-path DNS cache poisoning attacks, to
redirect the password recovery link to the adversarial hosts. We then
demonstrate that the adversaries can manipulate the resources associated with
these accounts. We find all the tested providers vulnerable to our attacks.
  We recommend mitigations for blocking the attacks that we present in this
work. Nevertheless, the countermeasures cannot solve the fundamental problem -
the management of the Internet resources should be revised to ensure that
applying transactions cannot be done so easily and stealthily as is currently
possible.


http://arxiv.org/abs/2205.05573
A Longitudinal Study of Cryptographic API: a Decade of Android Malware. (1%)
Adam Janovsky; Davide Maiorca; Dominik Macko; Vashek Matyas; Giorgio Giacinto
  Cryptography has been extensively used in Android applications to guarantee
secure communications, conceal critical data from reverse engineering, or
ensure mobile users' privacy. Various system-based and third-party libraries
for Android provide cryptographic functionalities, and previous works mainly
explored the misuse of cryptographic API in benign applications. However, the
role of cryptographic API has not yet been explored in Android malware. This
paper performs a comprehensive, longitudinal analysis of cryptographic API in
Android malware. In particular, we analyzed $603\,937$ Android applications
(half of them malicious, half benign) released between $2012$ and $2020$,
gathering more than 1 million cryptographic API expressions. Our results reveal
intriguing trends and insights on how and why cryptography is employed in
Android malware. For instance, we point out the widespread use of weak hash
functions and the late transition from insecure DES to AES. Additionally, we
show that cryptography-related characteristics can help to improve the
performance of learning-based systems in detecting malicious applications.


http://arxiv.org/abs/2205.04723
Robust Medical Image Classification from Noisy Labeled Data with Global and Local Representation Guided Co-training. (1%)
Cheng Xue; Lequan Yu; Pengfei Chen; Qi Dou; Pheng-Ann Heng
  Deep neural networks have achieved remarkable success in a wide variety of
natural image and medical image computing tasks. However, these achievements
indispensably rely on accurately annotated training data. If encountering some
noisy-labeled images, the network training procedure would suffer from
difficulties, leading to a sub-optimal classifier. This problem is even more
severe in the medical image analysis field, as the annotation quality of
medical images heavily relies on the expertise and experience of annotators. In
this paper, we propose a novel collaborative training paradigm with global and
local representation learning for robust medical image classification from
noisy-labeled data to combat the lack of high quality annotated medical data.
Specifically, we employ the self-ensemble model with a noisy label filter to
efficiently select the clean and noisy samples. Then, the clean samples are
trained by a collaborative training strategy to eliminate the disturbance from
imperfect labeled samples. Notably, we further design a novel global and local
representation learning scheme to implicitly regularize the networks to utilize
noisy samples in a self-supervised manner. We evaluated our proposed robust
learning strategy on four public medical image classification datasets with
three types of label noise,ie,random noise, computer-generated label noise, and
inter-observer variability noise. Our method outperforms other learning from
noisy label methods and we also conducted extensive experiments to analyze each
component of our method.


http://arxiv.org/abs/2205.05050
White-box Testing of NLP models with Mask Neuron Coverage. (1%)
Arshdeep Sekhon; Yangfeng Ji; Matthew B. Dwyer; Yanjun Qi
  Recent literature has seen growing interest in using black-box strategies
like CheckList for testing the behavior of NLP models. Research on white-box
testing has developed a number of methods for evaluating how thoroughly the
internal behavior of deep models is tested, but they are not applicable to NLP
models. We propose a set of white-box testing methods that are customized for
transformer-based NLP models. These include Mask Neuron Coverage (MNCOVER) that
measures how thoroughly the attention layers in models are exercised during
testing. We show that MNCOVER can refine testing suites generated by CheckList
by substantially reduce them in size, for more than 60\% on average, while
retaining failing tests -- thereby concentrating the fault detection power of
the test suite. Further we show how MNCOVER can be used to guide CheckList
input generation, evaluate alternative NLP testing methods, and drive data
augmentation to improve accuracy.


http://arxiv.org/abs/2205.07859
Btech thesis report on adversarial attack detection and purification of adverserially attacked images. (99%)
Dvij Kalaria
  This is Btech thesis report on detection and purification of adverserially
attacked images. A deep learning model is trained on certain training examples
for various tasks such as classification, regression etc. By training, weights
are adjusted such that the model performs the task well not only on training
examples judged by a certain metric but has an excellent ability to generalize
on other unseen examples as well which are typically called the test data.
Despite the huge success of machine learning models on a wide range of tasks,
security has received a lot less attention along the years. Robustness along
various potential cyber attacks also should be a metric for the accuracy of the
machine learning models. These cyber attacks can potentially lead to a variety
of negative impacts in the real world sensitive applications for which machine
learning is used such as medical and transportation systems. Hence, it is a
necessity to secure the system from such attacks. Int this report, I focus on a
class of these cyber attacks called the adversarial attacks in which the
original input sample is modified by small perturbations such that they still
look visually the same to human beings but the machine learning models are
fooled by such inputs. In this report I discuss 2 novel ways to counter the
adversarial attack using AutoEncoders, 1) by detecting the presence of
adversaries and 2) purifying these adversaries to make target classification
models robust against such attacks.


http://arxiv.org/abs/2205.04638
Using Frequency Attention to Make Adversarial Patch Powerful Against Person Detector. (98%)
Xiaochun Lei; Chang Lu; Zetao Jiang; Zhaoting Gong; Xiang Cai; Linjun Lu
  Deep neural networks (DNNs) are vulnerable to adversarial attacks. In
particular, object detectors may be attacked by applying a particular
adversarial patch to the image. However, because the patch shrinks during
preprocessing, most existing approaches that employ adversarial patches to
attack object detectors would diminish the attack success rate on small and
medium targets. This paper proposes a Frequency Module(FRAN), a
frequency-domain attention module for guiding patch generation. This is the
first study to introduce frequency domain attention to optimize the attack
capabilities of adversarial patches. Our method increases the attack success
rates of small and medium targets by 4.18% and 3.89%, respectively, over the
state-of-the-art attack method for fooling the human detector while assaulting
YOLOv3 without reducing the attack success rate of big targets.


http://arxiv.org/abs/2205.04293
Do You Think You Can Hold Me? The Real Challenge of Problem-Space Evasion Attacks. (97%)
Harel Berger; Amit Dvir; Chen Hajaj; Rony Ronen
  Android malware is a spreading disease in the virtual world. Anti-virus and
detection systems continuously undergo patches and updates to defend against
these threats. Most of the latest approaches in malware detection use Machine
Learning (ML). Against the robustifying effort of detection systems, raise the
\emph{evasion attacks}, where an adversary changes its targeted samples so that
they are misclassified as benign. This paper considers two kinds of evasion
attacks: feature-space and problem-space. \emph{Feature-space} attacks consider
an adversary who manipulates ML features to evade the correct classification
while minimizing or constraining the total manipulations.
\textit{Problem-space} attacks refer to evasion attacks that change the actual
sample. Specifically, this paper analyzes the gap between these two types in
the Android malware domain. The gap between the two types of evasion attacks is
examined via the retraining process of classifiers using each one of the
evasion attack types. The experiments show that the gap between these two types
of retrained classifiers is dramatic and may increase to 96\%. Retrained
classifiers of feature-space evasion attacks have been found to be either less
effective or completely ineffective against problem-space evasion attacks.
Additionally, exploration of different problem-space evasion attacks shows that
retraining of one problem-space evasion attack may be effective against other
problem-space evasion attacks.


http://arxiv.org/abs/2205.04411
Model-Contrastive Learning for Backdoor Defense. (87%)
Zhihao Yue; Jun Xia; Zhiwei Ling; Ming Hu; Ting Wang; Xian Wei; Mingsong Chen
  Due to the popularity of Artificial Intelligence (AI) techniques, we are
witnessing an increasing number of backdoor injection attacks that are designed
to maliciously threaten Deep Neural Networks (DNNs) causing misclassification.
Although there exist various defense methods that can effectively erase
backdoors from DNNs, they greatly suffer from both high Attack Success Rate
(ASR) and a non-negligible loss in Benign Accuracy (BA). Inspired by the
observation that a backdoored DNN tends to form a new cluster in its feature
spaces for poisoned data, in this paper we propose a novel two-stage backdoor
defense method, named MCLDef, based on Model-Contrastive Learning (MCL). In the
first stage, our approach performs trigger inversion based on trigger
synthesis, where the resultant trigger can be used to generate poisoned data.
In the second stage, under the guidance of MCL and our defined positive and
negative pairs, MCLDef can purify the backdoored model by pulling the feature
representations of poisoned data towards those of their clean data
counterparts. Due to the shrunken cluster of poisoned data, the backdoor formed
by end-to-end supervised learning is eliminated. Comprehensive experimental
results show that, with only 5% of clean data, MCLDef significantly outperforms
state-of-the-art defense methods by up to 95.79% reduction in ASR, while in
most cases the BA degradation can be controlled within less than 2%. Our code
is available at https://github.com/WeCanShow/MCL.


http://arxiv.org/abs/2205.04533
How Does Frequency Bias Affect the Robustness of Neural Image Classifiers against Common Corruption and Adversarial Perturbations? (61%)
Alvin Chan; Yew-Soon Ong; Clement Tan
  Model robustness is vital for the reliable deployment of machine learning
models in real-world applications. Recent studies have shown that data
augmentation can result in model over-relying on features in the low-frequency
domain, sacrificing performance against low-frequency corruptions, highlighting
a connection between frequency and robustness. Here, we take one step further
to more directly study the frequency bias of a model through the lens of its
Jacobians and its implication to model robustness. To achieve this, we propose
Jacobian frequency regularization for models' Jacobians to have a larger ratio
of low-frequency components. Through experiments on four image datasets, we
show that biasing classifiers towards low (high)-frequency components can bring
performance gain against high (low)-frequency corruption and adversarial
perturbation, albeit with a tradeoff in performance for low (high)-frequency
corruption. Our approach elucidates a more direct connection between the
frequency bias and robustness of deep learning models.


http://arxiv.org/abs/2205.04134
Federated Multi-Armed Bandits Under Byzantine Attacks. (2%)
Ilker Demirel; Yigit Yildirim; Cem Tekin
  Multi-armed bandits (MAB) is a simple reinforcement learning model where the
learner controls the trade-off between exploration versus exploitation to
maximize its cumulative reward. Federated multi-armed bandits (FMAB) is a
recently emerging framework where a cohort of learners with heterogeneous local
models play a MAB game and communicate their aggregated feedback to a parameter
server to learn the global feedback model. Federated learning models are
vulnerable to adversarial attacks such as model-update attacks or data
poisoning. In this work, we study an FMAB problem in the presence of Byzantine
clients who can send false model updates that pose a threat to the learning
process. We borrow tools from robust statistics and propose a
median-of-means-based estimator: Fed-MoM-UCB, to cope with the Byzantine
clients. We show that if the Byzantine clients constitute at most half the
cohort, it is possible to incur a cumulative regret on the order of ${\cal O}
(\log T)$ with respect to an unavoidable error margin, including the
communication cost between the clients and the parameter server. We analyze the
interplay between the algorithm parameters, unavoidable error margin, regret,
communication cost, and the arms' suboptimality gaps. We demonstrate
Fed-MoM-UCB's effectiveness against the baselines in the presence of Byzantine
attacks via experiments.


http://arxiv.org/abs/2205.04145
Verifying Integrity of Deep Ensemble Models by Lossless Black-box Watermarking with Sensitive Samples. (2%)
Lina Lin; Hanzhou Wu
  With the widespread use of deep neural networks (DNNs) in many areas, more
and more studies focus on protecting DNN models from intellectual property (IP)
infringement. Many existing methods apply digital watermarking to protect the
DNN models. The majority of them either embed a watermark directly into the
internal network structure/parameters or insert a zero-bit watermark by
fine-tuning a model to be protected with a set of so-called trigger samples.
Though these methods work very well, they were designed for individual DNN
models, which cannot be directly applied to deep ensemble models (DEMs) that
combine multiple DNN models to make the final decision. It motivates us to
propose a novel black-box watermarking method in this paper for DEMs, which can
be used for verifying the integrity of DEMs. In the proposed method, a certain
number of sensitive samples are carefully selected through mimicking real-world
DEM attacks and analyzing the prediction results of the sub-models of the
non-attacked DEM and the attacked DEM on the carefully crafted dataset. By
analyzing the prediction results of the target DEM on these carefully crafted
sensitive samples, we are able to verify the integrity of the target DEM.
Different from many previous methods, the proposed method does not modify the
original DEM to be protected, which indicates that the proposed method is
lossless. Experimental results have shown that the DEM integrity can be
reliably verified even if only one sub-model was attacked, which has good
potential in practice.


http://arxiv.org/abs/2205.03809
Fingerprint Template Invertibility: Minutiae vs. Deep Templates. (68%)
Kanishka P. Wijewardena; Steven A. Grosz; Kai Cao; Anil K. Jain
  Much of the success of fingerprint recognition is attributed to
minutiae-based fingerprint representation. It was believed that minutiae
templates could not be inverted to obtain a high fidelity fingerprint image,
but this assumption has been shown to be false. The success of deep learning
has resulted in alternative fingerprint representations (embeddings), in the
hope that they might offer better recognition accuracy as well as
non-invertibility of deep network-based templates. We evaluate whether deep
fingerprint templates suffer from the same reconstruction attacks as the
minutiae templates. We show that while a deep template can be inverted to
produce a fingerprint image that could be matched to its source image, deep
templates are more resistant to reconstruction attacks than minutiae templates.
In particular, reconstructed fingerprint images from minutiae templates yield a
TAR of about 100.0% (98.3%) @ FAR of 0.01% for type-I (type-II) attacks using a
state-of-the-art commercial fingerprint matcher, when tested on NIST SD4. The
corresponding attack performance for reconstructed fingerprint images from deep
templates using the same commercial matcher yields a TAR of less than 1% for
both type-I and type-II attacks; however, when the reconstructed images are
matched using the same deep network, they achieve a TAR of 85.95% (68.10%) for
type-I (type-II) attacks. Furthermore, what is missing from previous
fingerprint template inversion studies is an evaluation of the black-box attack
performance, which we perform using 3 different state-of-the-art fingerprint
matchers. We conclude that fingerprint images generated by inverting minutiae
templates are highly susceptible to both white-box and black-box attack
evaluations, while fingerprint images generated by deep templates are resistant
to black-box evaluations and comparatively less susceptible to white-box
evaluations.


http://arxiv.org/abs/2205.04007
ResSFL: A Resistance Transfer Framework for Defending Model Inversion Attack in Split Federated Learning. (22%)
Jingtao Li; Adnan Siraj Rakin; Xing Chen; Zhezhi He; Deliang Fan; Chaitali Chakrabarti
  This work aims to tackle Model Inversion (MI) attack on Split Federated
Learning (SFL). SFL is a recent distributed training scheme where multiple
clients send intermediate activations (i.e., feature map), instead of raw data,
to a central server. While such a scheme helps reduce the computational load at
the client end, it opens itself to reconstruction of raw data from intermediate
activation by the server. Existing works on protecting SFL only consider
inference and do not handle attacks during training. So we propose ResSFL, a
Split Federated Learning Framework that is designed to be MI-resistant during
training. It is based on deriving a resistant feature extractor via
attacker-aware training, and using this extractor to initialize the client-side
model prior to standard SFL training. Such a method helps in reducing the
computational complexity due to use of strong inversion model in client-side
adversarial training as well as vulnerability of attacks launched in early
training epochs. On CIFAR-100 dataset, our proposed framework successfully
mitigates MI attack on a VGG-11 model with a high reconstruction
Mean-Square-Error of 0.050 compared to 0.005 obtained by the baseline system.
The framework achieves 67.5% accuracy (only 1% accuracy drop) with very low
computation overhead. Code is released at:
https://github.com/zlijingtao/ResSFL.


http://arxiv.org/abs/2205.03894
VPN: Verification of Poisoning in Neural Networks. (9%)
Youcheng Sun; Muhammad Usman; Divya Gopinath; Corina S. Păsăreanu
  Neural networks are successfully used in a variety of applications, many of
them having safety and security concerns. As a result researchers have proposed
formal verification techniques for verifying neural network properties. While
previous efforts have mainly focused on checking local robustness in neural
networks, we instead study another neural network security issue, namely data
poisoning. In this case an attacker inserts a trigger into a subset of the
training data, in such a way that at test time, this trigger in an input causes
the trained model to misclassify to some target class. We show how to formulate
the check for data poisoning as a property that can be checked with
off-the-shelf verification tools, such as Marabou and nneum, where
counterexamples of failed checks constitute the triggers. We further show that
the discovered triggers are `transferable' from a small model to a larger,
better-trained model, allowing us to analyze state-of-the art performant models
trained for image classification tasks.


http://arxiv.org/abs/2205.03915
FOLPETTI: A Novel Multi-Armed Bandit Smart Attack for Wireless Networks. (4%)
Emilie Bout; Alessandro Brighente; Mauro Conti; Valeria Loscri
  Channel hopping provides a defense mechanism against jamming attacks in large
scale \ac{iot} networks.} However, a sufficiently powerful attacker may be able
to learn the channel hopping pattern and efficiently predict the channel to
jam. In this paper, we present FOLPETTI, a MAB-based attack to dynamically
follow the victim's channel selection in real-time. Compared to previous
attacks implemented via DRL, FOLPETTI does not require recurrent training
phases to capture the victim's behavior, allowing hence a continuous attack. We
assess the validity of FOLPETTI by implementing it to launch a jamming attack.
We evaluate its performance against a victim performing random channel
selection and a victim implementing a MAB defence strategy. We assume that the
victim detects an attack when more than $20\%$ of the transmitted packets are
not received, therefore this represents the limit for the attack to be
stealthy. In this scenario, FOLPETTI achieves a $15\%$ success rate for the
victim's random channel selection strategy, close to the $17.5\%$ obtained with
a genie-aided approach. Conversely, the DRL-based approach reaches a success
rate of $12.5\%$, which is $5.5\%$ less than FOLPETTI. We also confirm the
results by confronting FOLPETTI with a MAB based channel hopping method.
Finally, we show that FOLPETTI creates an additional energy demand
independently from its success rate, therefore decreasing the lifetime of IoT
devices.


http://arxiv.org/abs/2205.03817
PGADA: Perturbation-Guided Adversarial Alignment for Few-shot Learning Under the Support-Query Shift. (1%)
Siyang Jiang; Wei Ding; Hsi-Wen Chen; Ming-Syan Chen
  Few-shot learning methods aim to embed the data to a low-dimensional
embedding space and then classify the unseen query data to the seen support
set. While these works assume that the support set and the query set lie in the
same embedding space, a distribution shift usually occurs between the support
set and the query set, i.e., the Support-Query Shift, in the real world. Though
optimal transportation has shown convincing results in aligning different
distributions, we find that the small perturbations in the images would
significantly misguide the optimal transportation and thus degrade the model
performance. To relieve the misalignment, we first propose a novel adversarial
data augmentation method, namely Perturbation-Guided Adversarial Alignment
(PGADA), which generates the hard examples in a self-supervised manner. In
addition, we introduce Regularized Optimal Transportation to derive a smooth
optimal transportation plan. Extensive experiments on three benchmark datasets
manifest that our framework significantly outperforms the eleven
state-of-the-art methods on three datasets.


http://arxiv.org/abs/2206.05015
A Simple Yet Efficient Method for Adversarial Word-Substitute Attack. (99%)
Tianle Li; Yi Yang
  NLP researchers propose different word-substitute black-box attacks that can
fool text classification models. In such attack, an adversary keeps sending
crafted adversarial queries to the target model until it can successfully
achieve the intended outcome. State-of-the-art attack methods usually require
hundreds or thousands of queries to find one adversarial example. In this
paper, we study whether a sophisticated adversary can attack the system with
much less queries. We propose a simple yet efficient method that can reduce the
average number of adversarial queries by 3-30 times and maintain the attack
effectiveness. This research highlights that an adversary can fool a deep NLP
model with much less cost.


http://arxiv.org/abs/2205.03546
Bandits for Structure Perturbation-based Black-box Attacks to Graph Neural Networks with Theoretical Guarantees. (92%)
Binghui Wang; Youqi Li; Pan Zhou
  Graph neural networks (GNNs) have achieved state-of-the-art performance in
many graph-based tasks such as node classification and graph classification.
However, many recent works have demonstrated that an attacker can mislead GNN
models by slightly perturbing the graph structure. Existing attacks to GNNs are
either under the less practical threat model where the attacker is assumed to
access the GNN model parameters, or under the practical black-box threat model
but consider perturbing node features that are shown to be not enough
effective. In this paper, we aim to bridge this gap and consider black-box
attacks to GNNs with structure perturbation as well as with theoretical
guarantees. We propose to address this challenge through bandit techniques.
Specifically, we formulate our attack as an online optimization with bandit
feedback. This original problem is essentially NP-hard due to the fact that
perturbing the graph structure is a binary optimization problem. We then
propose an online attack based on bandit optimization which is proven to be
{sublinear} to the query number $T$, i.e., $\mathcal{O}(\sqrt{N}T^{3/4})$ where
$N$ is the number of nodes in the graph. Finally, we evaluate our proposed
attack by conducting experiments over multiple datasets and GNN models. The
experimental results on various citation graphs and image graphs show that our
attack is both effective and efficient. Source code is available
at~\url{https://github.com/Metaoblivion/Bandit_GNN_Attack}


http://arxiv.org/abs/2205.03190
Imperceptible Backdoor Attack: From Input Space to Feature Representation. (68%)
Nan Zhong; Zhenxing Qian; Xinpeng Zhang
  Backdoor attacks are rapidly emerging threats to deep neural networks (DNNs).
In the backdoor attack scenario, attackers usually implant the backdoor into
the target model by manipulating the training dataset or training process.
Then, the compromised model behaves normally for benign input yet makes
mistakes when the pre-defined trigger appears. In this paper, we analyze the
drawbacks of existing attack approaches and propose a novel imperceptible
backdoor attack. We treat the trigger pattern as a special kind of noise
following a multinomial distribution. A U-net-based network is employed to
generate concrete parameters of multinomial distribution for each benign input.
This elaborated trigger ensures that our approach is invisible to both humans
and statistical detection. Besides the design of the trigger, we also consider
the robustness of our approach against model diagnose-based defences. We force
the feature representation of malicious input stamped with the trigger to be
entangled with the benign one. We demonstrate the effectiveness and robustness
against multiple state-of-the-art defences through extensive datasets and
networks. Our trigger only modifies less than 1\% pixels of a benign image
while the modification magnitude is 1. Our source code is available at
https://github.com/Ekko-zn/IJCAI2022-Backdoor.


http://arxiv.org/abs/2205.03168
Defending against Reconstruction Attacks through Differentially Private Federated Learning for Classification of Heterogeneous Chest X-Ray Data. (26%)
Joceline Ziegler; Bjarne Pfitzner; Heinrich Schulz; Axel Saalbach; Bert Arnrich
  Privacy regulations and the physical distribution of heterogeneous data are
often primary concerns for the development of deep learning models in a medical
context. This paper evaluates the feasibility of differentially private
federated learning for chest X-ray classification as a defense against data
privacy attacks. To the best of our knowledge, we are the first to directly
compare the impact of differentially private training on two different neural
network architectures, DenseNet121 and ResNet50. Extending the federated
learning environments previously analyzed in terms of privacy, we simulated a
heterogeneous and imbalanced federated setting by distributing images from the
public CheXpert and Mendeley chest X-ray datasets unevenly among 36 clients.
Both non-private baseline models achieved an area under the receiver operating
characteristic curve (AUC) of $0.94$ on the binary classification task of
detecting the presence of a medical finding. We demonstrate that both model
architectures are vulnerable to privacy violation by applying image
reconstruction attacks to local model updates from individual clients. The
attack was particularly successful during later training stages. To mitigate
the risk of privacy breach, we integrated R\'enyi differential privacy with a
Gaussian noise mechanism into local model training. We evaluate model
performance and attack vulnerability for privacy budgets $\epsilon \in$ {1, 3,
6, 10}. The DenseNet121 achieved the best utility-privacy trade-off with an AUC
of $0.94$ for $\epsilon$ = 6. Model performance deteriorated slightly for
individual clients compared to the non-private baseline. The ResNet50 only
reached an AUC of $0.76$ in the same privacy setting. Its performance was
inferior to that of the DenseNet121 for all considered privacy constraints,
suggesting that the DenseNet121 architecture is more robust to differentially
private training.


http://arxiv.org/abs/2205.03105
LPGNet: Link Private Graph Networks for Node Classification. (1%)
Aashish Kolluri; Teodora Baluta; Bryan Hooi; Prateek Saxena
  Classification tasks on labeled graph-structured data have many important
applications ranging from social recommendation to financial modeling. Deep
neural networks are increasingly being used for node classification on graphs,
wherein nodes with similar features have to be given the same label. Graph
convolutional networks (GCNs) are one such widely studied neural network
architecture that perform well on this task. However, powerful link-stealing
attacks on GCNs have recently shown that even with black-box access to the
trained model, inferring which links (or edges) are present in the training
graph is practical. In this paper, we present a new neural network architecture
called LPGNet for training on graphs with privacy-sensitive edges. LPGNet
provides differential privacy (DP) guarantees for edges using a novel design
for how graph edge structure is used during training. We empirically show that
LPGNet models often lie in the sweet spot between providing privacy and
utility: They can offer better utility than "trivially" private architectures
which use no edge information (e.g., vanilla MLPs) and better resilience
against existing link-stealing attacks than vanilla GCNs which use the full
edge structure. LPGNet also offers consistently better privacy-utility
tradeoffs than DPGCN, which is the state-of-the-art mechanism for retrofitting
differential privacy into conventional GCNs, in most of our evaluated datasets.


http://arxiv.org/abs/2205.03205
Unlimited Lives: Secure In-Process Rollback with Isolated Domains. (1%)
Merve Gülmez; Thomas Nyman; Christoph Baumann; Jan Tobias Mühlberg
  The use of unsafe programming languages still remains one of the major root
causes of software vulnerabilities. Although well-known defenses that detect
and mitigate memory-safety related issues exist, they don't address the
challenge of software resilience, i.e., whether a system under attack can
continue to carry out its function when subjected to malicious input. We
propose secure rollback of isolated domains as an efficient and secure method
of improving the resilience of software targeted by run-time attacks. We show
the practicability of our methodology by realizing a software library for
Secure Domain Rollback (SDRoB) and demonstrate how SDRoB can be applied to
real-world software.


http://arxiv.org/abs/2205.02604
Holistic Approach to Measure Sample-level Adversarial Vulnerability and its Utility in Building Trustworthy Systems. (99%)
Gaurav Kumar Nayak; Ruchit Rawal; Rohit Lal; Himanshu Patil; Anirban Chakraborty
  Adversarial attack perturbs an image with an imperceptible noise, leading to
incorrect model prediction. Recently, a few works showed inherent bias
associated with such attack (robustness bias), where certain subgroups in a
dataset (e.g. based on class, gender, etc.) are less robust than others. This
bias not only persists even after adversarial training, but often results in
severe performance discrepancies across these subgroups. Existing works
characterize the subgroup's robustness bias by only checking individual
sample's proximity to the decision boundary. In this work, we argue that this
measure alone is not sufficient and validate our argument via extensive
experimental analysis. It has been observed that adversarial attacks often
corrupt the high-frequency components of the input image. We, therefore,
propose a holistic approach for quantifying adversarial vulnerability of a
sample by combining these different perspectives, i.e., degree of model's
reliance on high-frequency features and the (conventional) sample-distance to
the decision boundary. We demonstrate that by reliably estimating adversarial
vulnerability at the sample level using the proposed holistic metric, it is
possible to develop a trustworthy system where humans can be alerted about the
incoming samples that are highly likely to be misclassified at test time. This
is achieved with better precision when our holistic metric is used over
individual measures. To further corroborate the utility of the proposed
holistic approach, we perform knowledge distillation in a limited-sample
setting. We observe that the student network trained with the subset of samples
selected using our combined metric performs better than both the competing
baselines, viz., where samples are selected randomly or based on their
distances to the decision boundary.


http://arxiv.org/abs/2205.08955
Structural Extensions of Basis Pursuit: Guarantees on Adversarial Robustness. (78%)
Dávid Szeghy; Mahmoud Aslan; Áron Fóthi; Balázs Mészáros; Zoltán Ádám Milacski; András Lőrincz
  While deep neural networks are sensitive to adversarial noise, sparse coding
using the Basis Pursuit (BP) method is robust against such attacks, including
its multi-layer extensions. We prove that the stability theorem of BP holds
upon the following generalizations: (i) the regularization procedure can be
separated into disjoint groups with different weights, (ii) neurons or full
layers may form groups, and (iii) the regularizer takes various generalized
forms of the $\ell_1$ norm. This result provides the proof for the
architectural generalizations of Cazenavette et al. (2021), including (iv) an
approximation of the complete architecture as a shallow sparse coding network.
Due to this approximation, we settled to experimenting with shallow networks
and studied their robustness against the Iterative Fast Gradient Sign Method on
a synthetic dataset and MNIST. We introduce classification based on the
$\ell_2$ norms of the groups and show numerically that it can be accurate and
offers considerable speedups. In this family, linear transformer shows the best
performance. Based on the theoretical results and the numerical simulations, we
highlight numerical matters that may improve performance further.


http://arxiv.org/abs/2205.02652
Can collaborative learning be private, robust and scalable? (61%)
Dmitrii Usynin; Helena Klause; Daniel Rueckert; Georgios Kaissis
  We investigate the effectiveness of combining differential privacy, model
compression and adversarial training to improve the robustness of models
against adversarial samples in train- and inference-time attacks. We explore
the applications of these techniques as well as their combinations to determine
which method performs best, without a significant utility trade-off. Our
investigation provides a practical overview of various methods that allow one
to achieve a competitive model performance, a significant reduction in model's
size and an improved empirical adversarial robustness without a severe
performance degradation.


http://arxiv.org/abs/2205.02973
Large Scale Transfer Learning for Differentially Private Image Classification. (2%)
Harsh Mehta; Abhradeep Thakurta; Alexey Kurakin; Ashok Cutkosky
  Differential Privacy (DP) provides a formal framework for training machine
learning models with individual example level privacy. In the field of deep
learning, Differentially Private Stochastic Gradient Descent (DP-SGD) has
emerged as a popular private training algorithm. Unfortunately, the
computational cost of training large-scale models with DP-SGD is substantially
higher than non-private training. This is further exacerbated by the fact that
increasing the number of parameters leads to larger degradation in utility with
DP. In this work, we zoom in on the ImageNet dataset and demonstrate that,
similar to the non-private case, pre-training over-parameterized models on a
large public dataset can lead to substantial gains when the model is finetuned
privately. Moreover, by systematically comparing private and non-private models
across a range of large batch sizes, we find that similar to non-private
setting, choice of optimizer can further improve performance substantially with
DP. By using LAMB optimizer with DP-SGD we saw improvement of up to 20$\%$
points (absolute). Finally, we show that finetuning just the last layer for a
\emph{single step} in the full batch setting, combined with extremely
small-scale (near-zero) initialization leads to both SOTA results of 81.7 $\%$
under a wide privacy budget range of $\epsilon \in [4, 10]$ and $\delta$ =
$10^{-6}$ while minimizing the computational overhead substantially.


http://arxiv.org/abs/2205.02496
Are GAN-based Morphs Threatening Face Recognition? (1%)
Eklavya Sarkar; Pavel Korshunov; Laurent Colbois; Sébastien Marcel
  Morphing attacks are a threat to biometric systems where the biometric
reference in an identity document can be altered. This form of attack presents
an important issue in applications relying on identity documents such as border
security or access control. Research in generation of face morphs and their
detection is developing rapidly, however very few datasets with morphing
attacks and open-source detection toolkits are publicly available. This paper
bridges this gap by providing two datasets and the corresponding code for four
types of morphing attacks: two that rely on facial landmarks based on OpenCV
and FaceMorpher, and two that use StyleGAN 2 to generate synthetic morphs. We
also conduct extensive experiments to assess the vulnerability of four
state-of-the-art face recognition systems, including FaceNet, VGG-Face,
ArcFace, and ISV. Surprisingly, the experiments demonstrate that, although
visually more appealing, morphs based on StyleGAN 2 do not pose a significant
threat to the state to face recognition systems, as these morphs were
outmatched by the simple morphs that are based facial landmarks.


http://arxiv.org/abs/2205.07853
Heterogeneous Domain Adaptation with Adversarial Neural Representation Learning: Experiments on E-Commerce and Cybersecurity. (1%)
Mohammadreza Ebrahimi; Yidong Chai; Hao Helen Zhang; Hsinchun Chen
  Learning predictive models in new domains with scarce training data is a
growing challenge in modern supervised learning scenarios. This incentivizes
developing domain adaptation methods that leverage the knowledge in known
domains (source) and adapt to new domains (target) with a different probability
distribution. This becomes more challenging when the source and target domains
are in heterogeneous feature spaces, known as heterogeneous domain adaptation
(HDA). While most HDA methods utilize mathematical optimization to map source
and target data to a common space, they suffer from low transferability. Neural
representations have proven to be more transferable; however, they are mainly
designed for homogeneous environments. Drawing on the theory of domain
adaptation, we propose a novel framework, Heterogeneous Adversarial Neural
Domain Adaptation (HANDA), to effectively maximize the transferability in
heterogeneous environments. HANDA conducts feature and distribution alignment
in a unified neural network architecture and achieves domain invariance through
adversarial kernel learning. Three experiments were conducted to evaluate the
performance against the state-of-the-art HDA methods on major image and text
e-commerce benchmarks. HANDA shows statistically significant improvement in
predictive performance. The practical utility of HANDA was shown in real-world
dark web online markets. HANDA is an important step towards successful domain
adaptation in e-commerce applications.


http://arxiv.org/abs/2205.02741
Based-CE white-box adversarial attack will not work using super-fitting. (99%)
Youhuan Yang; Lei Sun; Leyu Dai; Song Guo; Xiuqing Mao; Xiaoqin Wang; Bayi Xu
  Deep Neural Networks (DNN) are widely used in various fields due to their
powerful performance, but recent studies have shown that deep learning models
are vulnerable to adversarial attacks-by adding a slight perturbation to the
input, the model will get wrong results. It is especially dangerous for some
systems with high security requirements, so this paper proposes a new defense
method by using the model super-fitting status. Model's adversarial robustness
(i.e., the accuracry under adversarial attack) has been greatly improved in
this status. This paper mathematically proves the effectiveness of
super-fitting, and proposes a method to make the model reach this status
quickly-minimaze unrelated categories scores (MUCS). Theoretically,
super-fitting can resist any existing (even future) Based on CE white-box
adversarial attack. In addition, this paper uses a variety of powerful attack
algorithms to evaluate the adversarial robustness of super-fitting and other
nearly 50 defense models from recent conferences. The experimental results show
that super-fitting method in this paper can make the trained model obtain the
highest adversarial performance robustness.


http://arxiv.org/abs/2205.02743
Rethinking Classifier And Adversarial Attack. (98%)
Youhuan Yang; Lei Sun; Leyu Dai; Song Guo; Xiuqing Mao; Xiaoqin Wang; Bayi Xu
  Various defense models have been proposed to resist adversarial attack
algorithms, but existing adversarial robustness evaluation methods always
overestimate the adversarial robustness of these models (i.e. not approaching
the lower bound of robustness). To solve this problem, this paper first uses
the Decouple Space method to divide the classifier into two parts: non-linear
and linear. On this basis, this paper defines the representation vector of
original example (and its space, i.e., the representation space) and uses
Absolute Classification Boundaries Initialization (ACBI) iterative optimization
to obtain a better attack starting point (i.e. attacking from this point can
approach the lower bound of robustness faster). Particularly, this paper apply
ACBI to nearly 50 widely-used defense models (including 8 architectures).
Experimental results show that ACBI achieves lower robust accuracy in all
cases.


http://arxiv.org/abs/2205.01992
Wild Patterns Reloaded: A Survey of Machine Learning Security against Training Data Poisoning. (98%)
Antonio Emanuele Cinà; Kathrin Grosse; Ambra Demontis; Sebastiano Vascon; Werner Zellinger; Bernhard A. Moser; Alina Oprea; Battista Biggio; Marcello Pelillo; Fabio Roli
  The success of machine learning is fueled by the increasing availability of
computing power and large training datasets. The training data is used to learn
new models or update existing ones, assuming that it is sufficiently
representative of the data that will be encountered at test time. This
assumption is challenged by the threat of poisoning, an attack that manipulates
the training data to compromise the model's performance at test time. Although
poisoning has been acknowledged as a relevant threat in industry applications,
and a variety of different attacks and defenses have been proposed so far, a
complete systematization and critical review of the field is still missing. In
this survey, we provide a comprehensive systematization of poisoning attacks
and defenses in machine learning, reviewing more than 100 papers published in
the field in the last 15 years. We start by categorizing the current threat
models and attacks, and then organize existing defenses accordingly. While we
focus mostly on computer-vision applications, we argue that our systematization
also encompasses state-of-the-art attacks and defenses for other data
modalities. Finally, we discuss existing resources for research in poisoning,
and shed light on the current limitations and open research questions in this
research field.


http://arxiv.org/abs/2205.02392
Robust Conversational Agents against Imperceptible Toxicity Triggers. (92%)
Ninareh Mehrabi; Ahmad Beirami; Fred Morstatter; Aram Galstyan
  Warning: this paper contains content that maybe offensive or upsetting.
Recent research in Natural Language Processing (NLP) has advanced the
development of various toxicity detection models with the intention of
identifying and mitigating toxic language from existing systems. Despite the
abundance of research in this area, less attention has been given to
adversarial attacks that force the system to generate toxic language and the
defense against them. Existing work to generate such attacks is either based on
human-generated attacks which is costly and not scalable or, in case of
automatic attacks, the attack vector does not conform to human-like language,
which can be detected using a language model loss. In this work, we propose
attacks against conversational agents that are imperceptible, i.e., they fit
the conversation in terms of coherency, relevancy, and fluency, while they are
effective and scalable, i.e., they can automatically trigger the system into
generating toxic language. We then propose a defense mechanism against such
attacks which not only mitigates the attack but also attempts to maintain the
conversational flow. Through automatic and human evaluations, we show that our
defense is effective at avoiding toxic language generation even against
imperceptible toxicity triggers while the generated language fits the
conversation in terms of coherency and relevancy. Lastly, we establish the
generalizability of such a defense mechanism on language generation models
beyond conversational agents.


http://arxiv.org/abs/2205.02414
Subverting Fair Image Search with Generative Adversarial Perturbations. (83%)
Avijit Ghosh; Matthew Jagielski; Christo Wilson
  In this work we explore the intersection fairness and robustness in the
context of ranking: \textit{when a ranking model has been calibrated to achieve
some definition of fairness, is it possible for an external adversary to make
the ranking model behave unfairly without having access to the model or
training data?} To investigate this question, we present a case study in which
we develop and then attack a state-of-the-art, fairness-aware image search
engine using images that have been maliciously modified using a Generative
Adversarial Perturbation (GAP) model. These perturbations attempt to cause the
fair re-ranking algorithm to unfairly boost the rank of images containing
people from an adversary-selected subpopulation.
  We present results from extensive experiments demonstrating that our attacks
can successfully confer significant unfair advantage to people from the
majority class relative to fairly-ranked baseline search results. We
demonstrate that our attacks are robust across a number of variables, that they
have close to zero impact on the relevance of search results, and that they
succeed under a strict threat model. Our findings highlight the danger of
deploying fair machine learning algorithms in-the-wild when (1) the data
necessary to achieve fairness may be adversarially manipulated, and (2) the
models themselves are not robust against attacks.


http://arxiv.org/abs/2205.01663
Adversarial Training for High-Stakes Reliability. (98%)
Daniel M. Ziegler; Seraphina Nix; Lawrence Chan; Tim Bauman; Peter Schmidt-Nielsen; Tao Lin; Adam Scherlis; Noa Nabeshima; Ben Weinstein-Raun; Haas Daniel de; Buck Shlegeris; Nate Thomas
  In the future, powerful AI systems may be deployed in high-stakes settings,
where a single failure could be catastrophic. One technique for improving AI
safety in high-stakes settings is adversarial training, which uses an adversary
to generate examples to train on in order to achieve better worst-case
performance.
  In this work, we used a language generation task as a testbed for achieving
high reliability through adversarial training. We created a series of
adversarial training techniques -- including a tool that assists human
adversaries -- to find and eliminate failures in a classifier that filters text
completions suggested by a generator. In our simple "avoid injuries" task, we
determined that we can set very conservative classifier thresholds without
significantly impacting the quality of the filtered outputs. With our chosen
thresholds, filtering with our baseline classifier decreases the rate of unsafe
completions from about 2.4% to 0.003% on in-distribution data, which is near
the limit of our ability to measure. We found that adversarial training
significantly increased robustness to the adversarial attacks that we trained
on, without affecting in-distribution performance. We hope to see further work
in the high-stakes reliability setting, including more powerful tools for
enhancing human adversaries and better ways to measure high levels of
reliability, until we can confidently rule out the possibility of catastrophic
deployment-time failures of powerful models.


http://arxiv.org/abs/2205.01714
Don't sweat the small stuff, classify the rest: Sample Shielding to protect text classifiers against adversarial attacks. (96%)
Jonathan Rusert; Padmini Srinivasan
  Deep learning (DL) is being used extensively for text classification.
However, researchers have demonstrated the vulnerability of such classifiers to
adversarial attacks. Attackers modify the text in a way which misleads the
classifier while keeping the original meaning close to intact. State-of-the-art
(SOTA) attack algorithms follow the general principle of making minimal changes
to the text so as to not jeopardize semantics. Taking advantage of this we
propose a novel and intuitive defense strategy called Sample Shielding. It is
attacker and classifier agnostic, does not require any reconfiguration of the
classifier or external resources and is simple to implement. Essentially, we
sample subsets of the input text, classify them and summarize these into a
final decision. We shield three popular DL text classifiers with Sample
Shielding, test their resilience against four SOTA attackers across three
datasets in a realistic threat setting. Even when given the advantage of
knowing about our shielding strategy the adversary's attack success rate is
<=10% with only one exception and often < 5%. Additionally, Sample Shielding
maintains near original accuracy when applied to original texts. Crucially, we
show that the `make minimal changes' approach of SOTA attackers leads to
critical vulnerabilities that can be defended against with an intuitive
sampling strategy.


http://arxiv.org/abs/2205.01493
On the uncertainty principle of neural networks. (10%)
Jun-Jie Zhang; Dong-Xiao Zhang; Jian-Nan Chen; Long-Gang Pang; Deyu Meng
  In this study, we explore the inherent trade-off between accuracy and
robustness in neural networks, drawing an analogy to the uncertainty principle
in quantum mechanics. We propose that neural networks are subject to an
uncertainty relation, which manifests as a fundamental limitation in their
ability to simultaneously achieve high accuracy and robustness against
adversarial attacks. Through mathematical proofs and empirical evidence, we
demonstrate that this trade-off is a natural consequence of the sharp
boundaries formed between different class concepts during training. Our
findings reveal that the complementarity principle, a cornerstone of quantum
physics, applies to neural networks, imposing fundamental limits on their
capabilities in simultaneous learning of conjugate features. Meanwhile, our
work suggests that achieving human-level intelligence through a single network
architecture or massive datasets alone may be inherently limited. Our work
provides new insights into the theoretical foundations of neural network
vulnerability and opens up avenues for designing more robust neural network
architectures.


http://arxiv.org/abs/2205.01794
Meta-Cognition. An Inverse-Inverse Reinforcement Learning Approach for Cognitive Radars. (1%)
Kunal Pattanayak; Vikram Krishnamurthy; Christopher Berry
  This paper considers meta-cognitive radars in an adversarial setting. A
cognitive radar optimally adapts its waveform (response) in response to
maneuvers (probes) of a possibly adversarial moving target. A meta-cognitive
radar is aware of the adversarial nature of the target and seeks to mitigate
the adversarial target. How should the meta-cognitive radar choose its
responses to sufficiently confuse the adversary trying to estimate the radar's
utility function? This paper abstracts the radar's meta-cognition problem in
terms of the spectra (eigenvalues) of the state and observation noise
covariance matrices, and embeds the algebraic Riccati equation into an
economics-based utility maximization setup. This adversarial target is an
inverse reinforcement learner. By observing a noisy sequence of radar's
responses (waveforms), the adversarial target uses a statistical hypothesis
test to detect if the radar is a utility maximizer. In turn, the meta-cognitive
radar deliberately chooses sub-optimal responses that increasing its Type-I
error probability of the adversary's detector. We call this counter-adversarial
step taken by the meta-cognitive radar as inverse inverse reinforcement
learning (I-IRL). We illustrate the meta-cognition results of this paper via
simple numerical examples. Our approach for meta-cognition in this paper is
based on revealed preference theory in micro-economics and inspired by results
in differential privacy and adversarial obfuscation in machine learning.


http://arxiv.org/abs/2205.01287
SemAttack: Natural Textual Attacks via Different Semantic Spaces. (96%)
Boxin Wang; Chejian Xu; Xiangyu Liu; Yu Cheng; Bo Li
  Recent studies show that pre-trained language models (LMs) are vulnerable to
textual adversarial attacks. However, existing attack methods either suffer
from low attack success rates or fail to search efficiently in the
exponentially large perturbation space. We propose an efficient and effective
framework SemAttack to generate natural adversarial text by constructing
different semantic perturbation functions. In particular, SemAttack optimizes
the generated perturbations constrained on generic semantic spaces, including
typo space, knowledge space (e.g., WordNet), contextualized semantic space
(e.g., the embedding space of BERT clusterings), or the combination of these
spaces. Thus, the generated adversarial texts are more semantically close to
the original inputs. Extensive experiments reveal that state-of-the-art (SOTA)
large-scale LMs (e.g., DeBERTa-v2) and defense strategies (e.g., FreeLB) are
still vulnerable to SemAttack. We further demonstrate that SemAttack is general
and able to generate natural adversarial texts for different languages (e.g.,
English and Chinese) with high attack success rates. Human evaluations also
confirm that our generated adversarial texts are natural and barely affect
human performance. Our code is publicly available at
https://github.com/AI-secure/SemAttack.


http://arxiv.org/abs/2205.00807
Deep-Attack over the Deep Reinforcement Learning. (93%)
Yang Li; Quan Pan; Erik Cambria
  Recent adversarial attack developments have made reinforcement learning more
vulnerable, and different approaches exist to deploy attacks against it, where
the key is how to choose the right timing of the attack. Some work tries to
design an attack evaluation function to select critical points that will be
attacked if the value is greater than a certain threshold. This approach makes
it difficult to find the right place to deploy an attack without considering
the long-term impact. In addition, there is a lack of appropriate indicators of
assessment during attacks. To make the attacks more intelligent as well as to
remedy the existing problems, we propose the reinforcement learning-based
attacking framework by considering the effectiveness and stealthy
spontaneously, while we also propose a new metric to evaluate the performance
of the attack model in these two aspects. Experimental results show the
effectiveness of our proposed model and the goodness of our proposed evaluation
metric. Furthermore, we validate the transferability of the model, and also its
robustness under the adversarial training.


http://arxiv.org/abs/2205.00637
Enhancing Adversarial Training with Feature Separability. (92%)
Yaxin Li; Xiaorui Liu; Han Xu; Wentao Wang; Jiliang Tang
  Deep Neural Network (DNN) are vulnerable to adversarial attacks. As a
countermeasure, adversarial training aims to achieve robustness based on the
min-max optimization problem and it has shown to be one of the most effective
defense strategies. However, in this work, we found that compared with natural
training, adversarial training fails to learn better feature representations
for either clean or adversarial samples, which can be one reason why
adversarial training tends to have severe overfitting issues and less satisfied
generalize performance. Specifically, we observe two major shortcomings of the
features learned by existing adversarial training methods:(1) low intra-class
feature similarity; and (2) conservative inter-classes feature variance. To
overcome these shortcomings, we introduce a new concept of adversarial training
graph (ATG) with which the proposed adversarial training with feature
separability (ATFS) enables to coherently boost the intra-class feature
similarity and increase inter-class feature variance. Through comprehensive
experiments, we demonstrate that the proposed ATFS framework significantly
improves both clean and robust performance.


http://arxiv.org/abs/2205.00953
BERTops: Studying BERT Representations under a Topological Lens. (92%)
Jatin Chauhan; Manohar Kaul
  Proposing scoring functions to effectively understand, analyze and learn
various properties of high dimensional hidden representations of large-scale
transformer models like BERT can be a challenging task. In this work, we
explore a new direction by studying the topological features of BERT hidden
representations using persistent homology (PH). We propose a novel scoring
function named "persistence scoring function (PSF)" which: (i) accurately
captures the homology of the high-dimensional hidden representations and
correlates well with the test set accuracy of a wide range of datasets and
outperforms existing scoring metrics, (ii) captures interesting post
fine-tuning "per-class" level properties from both qualitative and quantitative
viewpoints, (iii) is more stable to perturbations as compared to the baseline
functions, which makes it a very robust proxy, and (iv) finally, also serves as
a predictor of the attack success rates for a wide category of black-box and
white-box adversarial attack methods. Our extensive correlation experiments
demonstrate the practical utility of PSF on various NLP tasks relevant to BERT.


http://arxiv.org/abs/2205.01674
MIRST-DM: Multi-Instance RST with Drop-Max Layer for Robust Classification of Breast Cancer. (83%)
Shoukun Sun; Min Xian; Aleksandar Vakanski; Hossny Ghanem
  Robust self-training (RST) can augment the adversarial robustness of image
classification models without significantly sacrificing models'
generalizability. However, RST and other state-of-the-art defense approaches
failed to preserve the generalizability and reproduce their good adversarial
robustness on small medical image sets. In this work, we propose the
Multi-instance RST with a drop-max layer, namely MIRST-DM, which involves a
sequence of iteratively generated adversarial instances during training to
learn smoother decision boundaries on small datasets. The proposed drop-max
layer eliminates unstable features and helps learn representations that are
robust to image perturbations. The proposed approach was validated using a
small breast ultrasound dataset with 1,190 images. The results demonstrate that
the proposed approach achieves state-of-the-art adversarial robustness against
three prevalent attacks.


http://arxiv.org/abs/2205.00920
Revisiting Gaussian Neurons for Online Clustering with Unknown Number of Clusters. (1%)
Ole Christian Eidheim
  Despite the recent success of artificial neural networks, more biologically
plausible learning methods may be needed to resolve the weaknesses of
backpropagation trained models such as catastrophic forgetting and adversarial
attacks. Although these weaknesses are not specifically addressed, a novel
local learning rule is presented that performs online clustering with an upper
limit on the number of clusters to be found rather than a fixed cluster count.
Instead of using orthogonal weight or output activation constraints, activation
sparsity is achieved by mutual repulsion of lateral Gaussian neurons ensuring
that multiple neuron centers cannot occupy the same location in the input
domain. An update method is also presented for adjusting the widths of the
Gaussian neurons in cases where the data samples can be represented by means
and variances. The algorithms were applied on the MNIST and CIFAR-10 datasets
to create filters capturing the input patterns of pixel patches of various
sizes. The experimental results demonstrate stability in the learned parameters
across a large number of training samples.


http://arxiv.org/abs/2205.01094
A Word is Worth A Thousand Dollars: Adversarial Attack on Tweets Fools Stock Prediction. (98%)
Yong Xie; Dakuo Wang; Pin-Yu Chen; Jinjun Xiong; Sijia Liu; Sanmi Koyejo
  More and more investors and machine learning models rely on social media
(e.g., Twitter and Reddit) to gather real-time information and sentiment to
predict stock price movements. Although text-based models are known to be
vulnerable to adversarial attacks, whether stock prediction models have similar
vulnerability is underexplored. In this paper, we experiment with a variety of
adversarial attack configurations to fool three stock prediction victim models.
We address the task of adversarial generation by solving combinatorial
optimization problems with semantics and budget constraints. Our results show
that the proposed attack method can achieve consistent success rates and cause
significant monetary loss in trading simulation by simply concatenating a
perturbed but semantically similar tweet.


http://arxiv.org/abs/2205.10117
DDDM: a Brain-Inspired Framework for Robust Classification. (76%)
Xiyuan Chen; Xingyu Li; Yi Zhou; Tianming Yang
  Despite their outstanding performance in a broad spectrum of real-world
tasks, deep artificial neural networks are sensitive to input noises,
particularly adversarial perturbations. On the contrary, human and animal
brains are much less vulnerable. In contrast to the one-shot inference
performed by most deep neural networks, the brain often solves decision-making
with an evidence accumulation mechanism that may trade time for accuracy when
facing noisy inputs. The mechanism is well described by the Drift-Diffusion
Model (DDM). In the DDM, decision-making is modeled as a process in which noisy
evidence is accumulated toward a threshold. Drawing inspiration from the DDM,
we propose the Dropout-based Drift-Diffusion Model (DDDM) that combines
test-phase dropout and the DDM for improving the robustness for arbitrary
neural networks. The dropouts create temporally uncorrelated noises in the
network that counter perturbations, while the evidence accumulation mechanism
guarantees a reasonable decision accuracy. Neural networks enhanced with the
DDDM tested in image, speech, and text classification tasks all significantly
outperform their native counterparts, demonstrating the DDDM as a task-agnostic
defense against adversarial attacks.


http://arxiv.org/abs/2205.00633
Robust Fine-tuning via Perturbation and Interpolation from In-batch Instances. (9%)
Shoujie Tong; Qingxiu Dong; Damai Dai; Yifan song; Tianyu Liu; Baobao Chang; Zhifang Sui
  Fine-tuning pretrained language models (PLMs) on downstream tasks has become
common practice in natural language processing. However, most of the PLMs are
vulnerable, e.g., they are brittle under adversarial attacks or imbalanced
data, which hinders the application of the PLMs on some downstream tasks,
especially in safe-critical scenarios. In this paper, we propose a simple yet
effective fine-tuning method called Match-Tuning to force the PLMs to be more
robust. For each instance in a batch, we involve other instances in the same
batch to interact with it. To be specific, regarding the instances with other
labels as a perturbation, Match-Tuning makes the model more robust to noise at
the beginning of training. While nearing the end, Match-Tuning focuses more on
performing an interpolation among the instances with the same label for better
generalization. Extensive experiments on various tasks in GLUE benchmark show
that Match-Tuning consistently outperforms the vanilla fine-tuning by $1.64$
scores. Moreover, Match-Tuning exhibits remarkable robustness to adversarial
attacks and data imbalance.


http://arxiv.org/abs/2205.00403
A Simple Approach to Improve Single-Model Deep Uncertainty via Distance-Awareness. (3%)
Jeremiah Zhe Liu; Shreyas Padhy; Jie Ren; Zi Lin; Yeming Wen; Ghassen Jerfel; Zack Nado; Jasper Snoek; Dustin Tran; Balaji Lakshminarayanan
  Accurate uncertainty quantification is a major challenge in deep learning, as
neural networks can make overconfident errors and assign high confidence
predictions to out-of-distribution (OOD) inputs. The most popular approaches to
estimate predictive uncertainty in deep learning are methods that combine
predictions from multiple neural networks, such as Bayesian neural networks
(BNNs) and deep ensembles. However their practicality in real-time,
industrial-scale applications are limited due to the high memory and
computational cost. Furthermore, ensembles and BNNs do not necessarily fix all
the issues with the underlying member networks. In this work, we study
principled approaches to improve uncertainty property of a single network,
based on a single, deterministic representation. By formalizing the uncertainty
quantification as a minimax learning problem, we first identify distance
awareness, i.e., the model's ability to quantify the distance of a testing
example from the training data, as a necessary condition for a DNN to achieve
high-quality (i.e., minimax optimal) uncertainty estimation. We then propose
Spectral-normalized Neural Gaussian Process (SNGP), a simple method that
improves the distance-awareness ability of modern DNNs with two simple changes:
(1) applying spectral normalization to hidden weights to enforce bi-Lipschitz
smoothness in representations and (2) replacing the last output layer with a
Gaussian process layer. On a suite of vision and language understanding
benchmarks, SNGP outperforms other single-model approaches in prediction,
calibration and out-of-domain detection. Furthermore, SNGP provides
complementary benefits to popular techniques such as deep ensembles and data
augmentation, making it a simple and scalable building block for probabilistic
deep learning. Code is open-sourced at
https://github.com/google/uncertainty-baselines


http://arxiv.org/abs/2205.00566
Adversarial Plannning. (2%)
Valentin Vie; Ryan Sheatsley; Sophia Beyda; Sushrut Shringarputale; Kevin Chan; Trent Jaeger; Patrick McDaniel
  Planning algorithms are used in computational systems to direct autonomous
behavior. In a canonical application, for example, planning for autonomous
vehicles is used to automate the static or continuous planning towards
performance, resource management, or functional goals (e.g., arriving at the
destination, managing fuel fuel consumption). Existing planning algorithms
assume non-adversarial settings; a least-cost plan is developed based on
available environmental information (i.e., the input instance). Yet, it is
unclear how such algorithms will perform in the face of adversaries attempting
to thwart the planner. In this paper, we explore the security of planning
algorithms used in cyber- and cyber-physical systems. We present two
$\textit{adversarial planning}$ algorithms-one static and one adaptive-that
perturb input planning instances to maximize cost (often substantially so). We
evaluate the performance of the algorithms against two dominant planning
algorithms used in commercial applications (D* Lite and Fast Downward) and show
both are vulnerable to extremely limited adversarial action. Here, experiments
show that an adversary is able to increase plan costs in 66.9% of instances by
only removing a single action from the actions space (D* Lite) and render 70%
of instances from an international planning competition unsolvable by removing
only three actions (Fast Forward). Finally, we show that finding an optimal
perturbation in any search-based planning system is NP-hard.


http://arxiv.org/abs/2205.02116
Optimizing One-pixel Black-box Adversarial Attacks. (82%)
Tianxun Zhou; Shubhankar Agrawal; Prateek Manocha
  The output of Deep Neural Networks (DNN) can be altered by a small
perturbation of the input in a black box setting by making multiple calls to
the DNN. However, the high computation and time required makes the existing
approaches unusable. This work seeks to improve the One-pixel (few-pixel)
black-box adversarial attacks to reduce the number of calls to the network
under attack. The One-pixel attack uses a non-gradient optimization algorithm
to find pixel-level perturbations under the constraint of a fixed number of
pixels, which causes the network to predict the wrong label for a given image.
We show through experimental results how the choice of the optimization
algorithm and initial positions to search can reduce function calls and
increase attack success significantly, making the attack more practical in
real-world settings.


http://arxiv.org/abs/2205.00199
Cracking White-box DNN Watermarks via Invariant Neuron Transforms. (26%)
Yifan Yan; Xudong Pan; Yining Wang; Mi Zhang; Min Yang
  Recently, how to protect the Intellectual Property (IP) of deep neural
networks (DNN) becomes a major concern for the AI industry. To combat potential
model piracy, recent works explore various watermarking strategies to embed
secret identity messages into the prediction behaviors or the internals (e.g.,
weights and neuron activation) of the target model. Sacrificing less
functionality and involving more knowledge about the target model, the latter
branch of watermarking schemes (i.e., white-box model watermarking) is claimed
to be accurate, credible and secure against most known watermark removal
attacks, with emerging research efforts and applications in the industry.
  In this paper, we present the first effective removal attack which cracks
almost all the existing white-box watermarking schemes with provably no
performance overhead and no required prior knowledge. By analyzing these IP
protection mechanisms at the granularity of neurons, we for the first time
discover their common dependence on a set of fragile features of a local neuron
group, all of which can be arbitrarily tampered by our proposed chain of
invariant neuron transforms. On $9$ state-of-the-art white-box watermarking
schemes and a broad set of industry-level DNN architectures, our attack for the
first time reduces the embedded identity message in the protected models to be
almost random. Meanwhile, unlike known removal attacks, our attack requires no
prior knowledge on the training data distribution or the adopted watermark
algorithms, and leaves model functionality intact.


http://arxiv.org/abs/2205.00224
Loss Function Entropy Regularization for Diverse Decision Boundaries. (1%)
Chong Sue Sin
  Is it possible to train several classifiers to perform meaningful
crowd-sourcing to produce a better prediction label set without any
ground-truth annotation? In this paper, we will attempt to modify the
contrastive learning objectives to automatically train a self-complementing
ensemble to produce a state-of-the-art prediction on the CIFAR10 and
CIFAR100-20 task. This paper will present a remarkably simple method to modify
a single unsupervised classification pipeline to automatically generate an
ensemble of neural networks with varied decision boundaries to learn a larger
feature set of classes. Loss Function Entropy Regularization (LFER), are
regularization terms to be added upon the pre-training and contrastive learning
objective functions, gives us a gear to modify the entropy state of the output
space of unsupervised learning, thereby diversifying the latent representation
of decision boundaries of neural networks. Ensemble trained with LFER have
higher successful prediction accuracy for samples near decision boundaries.
LFER is a effective gear to perturb decision boundaries, and has proven to be
able to produce classifiers that beat state-of-the-art at contrastive learning
stage. Experiments show that LFER can produce an ensemble where each have
accuracy comparable to the state-of-the-art, yet have each have varied latent
decision boundaries. It allows us to essence perform meaningful verification
for samples near decision boundaries, encouraging correct classification of
near-boundary samples. By compounding the probability of correct prediction of
a single sample amongst an ensemble of neural network trained, our method is
able to improve upon a single classifier by denoising and affirming correct
feature mappings.


http://arxiv.org/abs/2205.00359
Adapting and Evaluating Influence-Estimation Methods for Gradient-Boosted Decision Trees. (1%)
Jonathan Brophy; Zayd Hammoudeh; Daniel Lowd
  Influence estimation analyzes how changes to the training data can lead to
different model predictions; this analysis can help us better understand these
predictions, the models making those predictions, and the data sets they're
trained on. However, most influence-estimation techniques are designed for deep
learning models with continuous parameters. Gradient-boosted decision trees
(GBDTs) are a powerful and widely-used class of models; however, these models
are black boxes with opaque decision-making processes. In the pursuit of better
understanding GBDT predictions and generally improving these models, we adapt
recent and popular influence-estimation methods designed for deep learning
models to GBDTs. Specifically, we adapt representer-point methods and TracIn,
denoting our new methods TREX and BoostIn, respectively; source code is
available at https://github.com/jjbrophy47/tree_influence. We compare these
methods to LeafInfluence and other baselines using 5 different evaluation
measures on 22 real-world data sets with 4 popular GBDT implementations. These
experiments give us a comprehensive overview of how different approaches to
influence estimation work in GBDT models. We find BoostIn is an efficient
influence-estimation method for GBDTs that performs equally well or better than
existing work while being four orders of magnitude faster. Our evaluation also
suggests the gold-standard approach of leave-one-out~(LOO) retraining
consistently identifies the single-most influential training example but
performs poorly at finding the most influential set of training examples for a
given target prediction.


http://arxiv.org/abs/2205.01226
Adversarial attacks on an optical neural network. (92%)
Shuming Jiao; Ziwei Song; Shuiying Xiang
  Adversarial attacks have been extensively investigated for machine learning
systems including deep learning in the digital domain. However, the adversarial
attacks on optical neural networks (ONN) have been seldom considered
previously. In this work, we first construct an accurate image classifier with
an ONN using a mesh of interconnected Mach-Zehnder interferometers (MZI). Then
a corresponding adversarial attack scheme is proposed for the first time. The
attacked images are visually very similar to the original ones but the ONN
system becomes malfunctioned and generates wrong classification results in most
time. The results indicate that adversarial attack is also a significant issue
for optical machine learning systems.


http://arxiv.org/abs/2205.00047
Logically Consistent Adversarial Attacks for Soft Theorem Provers. (2%)
Alexander Gaskell; Yishu Miao; Lucia Specia; Francesca Toni
  Recent efforts within the AI community have yielded impressive results
towards "soft theorem proving" over natural language sentences using language
models. We propose a novel, generative adversarial framework for probing and
improving these models' reasoning capabilities. Adversarial attacks in this
domain suffer from the logical inconsistency problem, whereby perturbations to
the input may alter the label. Our Logically consistent AdVersarial Attacker,
LAVA, addresses this by combining a structured generative process with a
symbolic solver, guaranteeing logical consistency. Our framework successfully
generates adversarial attacks and identifies global weaknesses common across
multiple target models. Our analyses reveal naive heuristics and
vulnerabilities in these models' reasoning capabilities, exposing an incomplete
grasp of logical deduction under logic programs. Finally, in addition to
effective probing of these models, we show that training on the generated
samples improves the target model's performance.


http://arxiv.org/abs/2205.00107
Bridging Differential Privacy and Byzantine-Robustness via Model Aggregation. (1%)
Heng Zhu; Qing Ling
  This paper aims at jointly addressing two seemly conflicting issues in
federated learning: differential privacy (DP) and Byzantine-robustness, which
are particularly challenging when the distributed data are non-i.i.d.
(independent and identically distributed). The standard DP mechanisms add noise
to the transmitted messages, and entangles with robust stochastic gradient
aggregation to defend against Byzantine attacks. In this paper, we decouple the
two issues via robust stochastic model aggregation, in the sense that our
proposed DP mechanisms and the defense against Byzantine attacks have separated
influence on the learning performance. Leveraging robust stochastic model
aggregation, at each iteration, each worker calculates the difference between
the local model and the global one, followed by sending the element-wise signs
to the master node, which enables robustness to Byzantine attacks. Further, we
design two DP mechanisms to perturb the uploaded signs for the purpose of
privacy preservation, and prove that they are $(\epsilon,0)$-DP by exploiting
the properties of noise distributions. With the tools of Moreau envelop and
proximal point projection, we establish the convergence of the proposed
algorithm when the cost function is nonconvex. We analyze the trade-off between
privacy preservation and learning performance, and show that the influence of
our proposed DP mechanisms is decoupled with that of robust stochastic model
aggregation. Numerical experiments demonstrate the effectiveness of the
proposed algorithm.


http://arxiv.org/abs/2204.13853
Detecting Textual Adversarial Examples Based on Distributional Characteristics of Data Representations. (99%)
Na Liu; Mark Dras; Wei Emma Zhang
  Although deep neural networks have achieved state-of-the-art performance in
various machine learning tasks, adversarial examples, constructed by adding
small non-random perturbations to correctly classified inputs, successfully
fool highly expressive deep classifiers into incorrect predictions. Approaches
to adversarial attacks in natural language tasks have boomed in the last five
years using character-level, word-level, phrase-level, or sentence-level
textual perturbations. While there is some work in NLP on defending against
such attacks through proactive methods, like adversarial training, there is to
our knowledge no effective general reactive approaches to defence via detection
of textual adversarial examples such as is found in the image processing
literature. In this paper, we propose two new reactive methods for NLP to fill
this gap, which unlike the few limited application baselines from NLP are based
entirely on distribution characteristics of learned representations: we adapt
one from the image processing literature (Local Intrinsic Dimensionality
(LID)), and propose a novel one (MultiDistance Representation Ensemble Method
(MDRE)). Adapted LID and MDRE obtain state-of-the-art results on
character-level, word-level, and phrase-level attacks on the IMDB dataset as
well as on the later two with respect to the MultiNLI dataset. For future
research, we publish our code.


http://arxiv.org/abs/2204.13779
Formulating Robustness Against Unforeseen Attacks. (99%)
Sihui Dai; Saeed Mahloujifar; Prateek Mittal
  Existing defenses against adversarial examples such as adversarial training
typically assume that the adversary will conform to a specific or known threat
model, such as $\ell_p$ perturbations within a fixed budget. In this paper, we
focus on the scenario where there is a mismatch in the threat model assumed by
the defense during training, and the actual capabilities of the adversary at
test time. We ask the question: if the learner trains against a specific
"source" threat model, when can we expect robustness to generalize to a
stronger unknown "target" threat model during test-time? Our key contribution
is to formally define the problem of learning and generalization with an
unforeseen adversary, which helps us reason about the increase in adversarial
risk from the conventional perspective of a known adversary. Applying our
framework, we derive a generalization bound which relates the generalization
gap between source and target threat models to variation of the feature
extractor, which measures the expected maximum difference between extracted
features across a given threat model. Based on our generalization bound, we
propose variation regularization (VR) which reduces variation of the feature
extractor across the source threat model during training. We empirically
demonstrate that using VR can lead to improved generalization to unforeseen
attacks during test-time, and combining VR with perceptual adversarial training
(Laidlaw et al., 2021) achieves state-of-the-art robustness on unforeseen
attacks. Our code is publicly available at
https://github.com/inspire-group/variation-regularization.


http://arxiv.org/abs/2204.14187
Randomized Smoothing under Attack: How Good is it in Pratice? (84%)
Thibault Maho; Teddy Furon; Erwan Le Merrer
  Randomized smoothing is a recent and celebrated solution to certify the
robustness of any classifier. While it indeed provides a theoretical robustness
against adversarial attacks, the dimensionality of current classifiers
necessarily imposes Monte Carlo approaches for its application in practice.
This paper questions the effectiveness of randomized smoothing as a defense,
against state of the art black-box attacks. This is a novel perspective, as
previous research works considered the certification as an unquestionable
guarantee. We first formally highlight the mismatch between a theoretical
certification and the practice of attacks on classifiers. We then perform
attacks on randomized smoothing as a defense. Our main observation is that
there is a major mismatch in the settings of the RS for obtaining high
certified robustness or when defeating black box attacks while preserving the
classifier accuracy.


http://arxiv.org/abs/2204.13309
Improving robustness of language models from a geometry-aware perspective. (68%)
Bin Zhu; Zhaoquan Gu; Le Wang; Jinyin Chen; Qi Xuan
  Recent studies have found that removing the norm-bounded projection and
increasing search steps in adversarial training can significantly improve
robustness. However, we observe that a too large number of search steps can
hurt accuracy. We aim to obtain strong robustness efficiently using fewer
steps. Through a toy experiment, we find that perturbing the clean data to the
decision boundary but not crossing it does not degrade the test accuracy.
Inspired by this, we propose friendly adversarial data augmentation (FADA) to
generate friendly adversarial data. On top of FADA, we propose geometry-aware
adversarial training (GAT) to perform adversarial training on friendly
adversarial data so that we can save a large number of search steps.
Comprehensive experiments across two widely used datasets and three pre-trained
language models demonstrate that GAT can obtain stronger robustness via fewer
steps. In addition, we provide extensive empirical results and in-depth
analyses on robustness to facilitate future studies.


http://arxiv.org/abs/2204.13572
Mixup-based Deep Metric Learning Approaches for Incomplete Supervision. (50%)
Luiz H. Buris; Daniel C. G. Pedronette; Joao P. Papa; Jurandy Almeida; Gustavo Carneiro; Fabio A. Faria
  Deep learning architectures have achieved promising results in different
areas (e.g., medicine, agriculture, and security). However, using those
powerful techniques in many real applications becomes challenging due to the
large labeled collections required during training. Several works have pursued
solutions to overcome it by proposing strategies that can learn more for less,
e.g., weakly and semi-supervised learning approaches. As these approaches do
not usually address memorization and sensitivity to adversarial examples, this
paper presents three deep metric learning approaches combined with Mixup for
incomplete-supervision scenarios. We show that some state-of-the-art approaches
in metric learning might not work well in such scenarios. Moreover, the
proposed approaches outperform most of them in different datasets.


http://arxiv.org/abs/2204.13784
AGIC: Approximate Gradient Inversion Attack on Federated Learning. (16%)
Jin Xu; Chi Hong; Jiyue Huang; Lydia Y. Chen; Jérémie Decouchant
  Federated learning is a private-by-design distributed learning paradigm where
clients train local models on their own data before a central server aggregates
their local updates to compute a global model. Depending on the aggregation
method used, the local updates are either the gradients or the weights of local
learning models. Recent reconstruction attacks apply a gradient inversion
optimization on the gradient update of a single minibatch to reconstruct the
private data used by clients during training. As the state-of-the-art
reconstruction attacks solely focus on single update, realistic adversarial
scenarios are overlooked, such as observation across multiple updates and
updates trained from multiple mini-batches. A few studies consider a more
challenging adversarial scenario where only model updates based on multiple
mini-batches are observable, and resort to computationally expensive simulation
to untangle the underlying samples for each local step. In this paper, we
propose AGIC, a novel Approximate Gradient Inversion Attack that efficiently
and effectively reconstructs images from both model or gradient updates, and
across multiple epochs. In a nutshell, AGIC (i) approximates gradient updates
of used training samples from model updates to avoid costly simulation
procedures, (ii) leverages gradient/model updates collected from multiple
epochs, and (iii) assigns increasing weights to layers with respect to the
neural network structure for reconstruction quality. We extensively evaluate
AGIC on three datasets, CIFAR-10, CIFAR-100 and ImageNet. Our results show that
AGIC increases the peak signal-to-noise ratio (PSNR) by up to 50% compared to
two representative state-of-the-art gradient inversion attacks. Furthermore,
AGIC is faster than the state-of-the-art simulation based attack, e.g., it is
5x faster when attacking FedAvg with 8 local steps in between model updates.


http://arxiv.org/abs/2204.13814
An Online Ensemble Learning Model for Detecting Attacks in Wireless Sensor Networks. (1%)
Hiba Tabbaa; Samir Ifzarne; Imad Hafidi
  In today's modern world, the usage of technology is unavoidable and the rapid
advances in the Internet and communication fields have resulted to expand the
Wireless Sensor Network (WSN) technology. A huge number of sensing devices
collect and/or generate numerous sensory data throughout time for a wide range
of fields and applications. However, WSN has been proven to be vulnerable to
security breaches, the harsh and unattended deployment of these networks,
combined with their constrained resources and the volume of data generated
introduce a major security concern. WSN applications are extremely critical, it
is essential to build reliable solutions that involve fast and continuous
mechanisms for online data stream analysis enabling the detection of attacks
and intrusions. In this context, our aim is to develop an intelligent,
efficient, and updatable intrusion detection system by applying an important
machine learning concept known as ensemble learning in order to improve
detection performance. Although ensemble models have been proven to be useful
in offline learning, they have received less attention in streaming
applications. In this paper, we examine the application of different
homogeneous and heterogeneous online ensembles in sensory data analysis, on a
specialized wireless sensor network-detection system (WSN-DS) dataset in order
to classify four types of attacks: Blackhole attack, Grayhole, Flooding, and
Scheduling among normal network traffic. Among the proposed novel online
ensembles, both the heterogeneous ensemble consisting of an Adaptive Random
Forest (ARF) combined with the Hoeffding Adaptive Tree (HAT) algorithm and the
homogeneous ensemble HAT made up of 10 models achieved higher detection rates
of 96.84% and 97.2%, respectively. The above models are efficient and effective
in dealing with concept drift, while taking into account the resource
constraints of WSNs.


http://arxiv.org/abs/2204.13232
Adversarial Fine-tune with Dynamically Regulated Adversary. (99%)
Pengyue Hou; Ming Zhou; Jie Han; Petr Musilek; Xingyu Li
  Adversarial training is an effective method to boost model robustness to
malicious, adversarial attacks. However, such improvement in model robustness
often leads to a significant sacrifice of standard performance on clean images.
In many real-world applications such as health diagnosis and autonomous
surgical robotics, the standard performance is more valued over model
robustness against such extremely malicious attacks. This leads to the
question: To what extent we can boost model robustness without sacrificing
standard performance? This work tackles this problem and proposes a simple yet
effective transfer learning-based adversarial training strategy that
disentangles the negative effects of adversarial samples on model's standard
performance. In addition, we introduce a training-friendly adversarial attack
algorithm, which facilitates the boost of adversarial robustness without
introducing significant training complexity. Extensive experimentation
indicates that the proposed method outperforms previous adversarial training
algorithms towards the target: to improve model robustness while preserving
model's standard performance on clean data.


http://arxiv.org/abs/2204.13004
Defending Against Person Hiding Adversarial Patch Attack with a Universal White Frame. (98%)
Youngjoon Yu; Hong Joo Lee; Hakmin Lee; Yong Man Ro
  Object detection has attracted great attention in the computer vision area
and has emerged as an indispensable component in many vision systems. In the
era of deep learning, many high-performance object detection networks have been
proposed. Although these detection networks show high performance, they are
vulnerable to adversarial patch attacks. Changing the pixels in a restricted
region can easily fool the detection network in the physical world. In
particular, person-hiding attacks are emerging as a serious problem in many
safety-critical applications such as autonomous driving and surveillance
systems. Although it is necessary to defend against an adversarial patch
attack, very few efforts have been dedicated to defending against person-hiding
attacks. To tackle the problem, in this paper, we propose a novel defense
strategy that mitigates a person-hiding attack by optimizing defense patterns,
while previous methods optimize the model. In the proposed method, a
frame-shaped pattern called a 'universal white frame' (UWF) is optimized and
placed on the outside of the image. To defend against adversarial patch
attacks, UWF should have three properties (i) suppressing the effect of the
adversarial patch, (ii) maintaining its original prediction, and (iii)
applicable regardless of images. To satisfy the aforementioned properties, we
propose a novel pattern optimization algorithm that can defend against the
adversarial patch. Through comprehensive experiments, we demonstrate that the
proposed method effectively defends against the adversarial patch attack.


http://arxiv.org/abs/2204.13172
An Adversarial Attack Analysis on Malicious Advertisement URL Detection Framework. (81%)
Ehsan Nowroozi; Abhishek; Mohammadreza Mohammadi; Mauro Conti
  Malicious advertisement URLs pose a security risk since they are the source
of cyber-attacks, and the need to address this issue is growing in both
industry and academia. Generally, the attacker delivers an attack vector to the
user by means of an email, an advertisement link or any other means of
communication and directs them to a malicious website to steal sensitive
information and to defraud them. Existing malicious URL detection techniques
are limited and to handle unseen features as well as generalize to test data.
In this study, we extract a novel set of lexical and web-scrapped features and
employ machine learning technique to set up system for fraudulent advertisement
URLs detection. The combination set of six different kinds of features
precisely overcome the obfuscation in fraudulent URL classification. Based on
different statistical properties, we use twelve different formatted datasets
for detection, prediction and classification task. We extend our prediction
analysis for mismatched and unlabelled datasets. For this framework, we analyze
the performance of four machine learning techniques: Random Forest, Gradient
Boost, XGBoost and AdaBoost in the detection part. With our proposed method, we
can achieve a false negative rate as low as 0.0037 while maintaining high
accuracy of 99.63%. Moreover, we devise a novel unsupervised technique for data
clustering using K- Means algorithm for the visual analysis. This paper
analyses the vulnerability of decision tree-based models using the limited
knowledge attack scenario. We considered the exploratory attack and implemented
Zeroth Order Optimization adversarial attack on the detection models.


http://arxiv.org/abs/2204.12204
Boosting Adversarial Transferability of MLP-Mixer. (99%)
Haoran Lyu; Yajie Wang; Yu-an Tan; Huipeng Zhou; Yuhang Zhao; Quanxin Zhang
  The security of models based on new architectures such as MLP-Mixer and ViTs
needs to be studied urgently. However, most of the current researches are
mainly aimed at the adversarial attack against ViTs, and there is still
relatively little adversarial work on MLP-mixer. We propose an adversarial
attack method against MLP-Mixer called Maxwell's demon Attack (MA). MA breaks
the channel-mixing and token-mixing mechanism of MLP-Mixer by controlling the
part input of MLP-Mixer's each Mixer layer, and disturbs MLP-Mixer to obtain
the main information of images. Our method can mask the part input of the Mixer
layer, avoid overfitting of the adversarial examples to the source model, and
improve the transferability of cross-architecture. Extensive experimental
evaluation demonstrates the effectiveness and superior performance of the
proposed MA. Our method can be easily combined with existing methods and can
improve the transferability by up to 38.0% on MLP-based ResMLP. Adversarial
examples produced by our method on MLP-Mixer are able to exceed the
transferability of adversarial examples produced using DenseNet against CNNs.
To the best of our knowledge, we are the first work to study adversarial
transferability of MLP-Mixer.


http://arxiv.org/abs/2204.12347
Restricted Black-box Adversarial Attack Against DeepFake Face Swapping. (99%)
Junhao Dong; Yuan Wang; Jianhuang Lai; Xiaohua Xie
  DeepFake face swapping presents a significant threat to online security and
social media, which can replace the source face in an arbitrary photo/video
with the target face of an entirely different person. In order to prevent this
fraud, some researchers have begun to study the adversarial methods against
DeepFake or face manipulation. However, existing works focus on the white-box
setting or the black-box setting driven by abundant queries, which severely
limits the practical application of these methods. To tackle this problem, we
introduce a practical adversarial attack that does not require any queries to
the facial image forgery model. Our method is built on a substitute model
persuing for face reconstruction and then transfers adversarial examples from
the substitute model directly to inaccessible black-box DeepFake models.
Specially, we propose the Transferable Cycle Adversary Generative Adversarial
Network (TCA-GAN) to construct the adversarial perturbation for disrupting
unknown DeepFake systems. We also present a novel post-regularization module
for enhancing the transferability of generated adversarial examples. To
comprehensively measure the effectiveness of our approaches, we construct a
challenging benchmark of DeepFake adversarial attacks for future development.
Extensive experiments impressively show that the proposed adversarial attack
method makes the visual quality of DeepFake face images plummet so that they
are easier to be detected by humans and algorithms. Moreover, we demonstrate
that the proposed algorithm can be generalized to offer face image protection
against various face translation methods.


http://arxiv.org/abs/2204.12680
Improving the Transferability of Adversarial Examples with Restructure Embedded Patches. (99%)
Huipeng Zhou; Yu-an Tan; Yajie Wang; Haoran Lyu; Shangbo Wu; Yuanzhang Li
  Vision transformers (ViTs) have demonstrated impressive performance in
various computer vision tasks. However, the adversarial examples generated by
ViTs are challenging to transfer to other networks with different structures.
Recent attack methods do not consider the specificity of ViTs architecture and
self-attention mechanism, which leads to poor transferability of the generated
adversarial samples by ViTs. We attack the unique self-attention mechanism in
ViTs by restructuring the embedded patches of the input. The restructured
embedded patches enable the self-attention mechanism to obtain more diverse
patches connections and help ViTs keep regions of interest on the object.
Therefore, we propose an attack method against the unique self-attention
mechanism in ViTs, called Self-Attention Patches Restructure (SAPR). Our method
is simple to implement yet efficient and applicable to any self-attention based
network and gradient transferability-based attack methods. We evaluate attack
transferability on black-box models with different structures. The result show
that our method generates adversarial examples on white-box ViTs with higher
transferability and higher image quality. Our research advances the development
of black-box transfer attacks on ViTs and demonstrates the feasibility of using
white-box ViTs to attack other black-box models.


http://arxiv.org/abs/2204.12393
On Fragile Features and Batch Normalization in Adversarial Training. (97%)
Nils Philipp Walter; David Stutz; Bernt Schiele
  Modern deep learning architecture utilize batch normalization (BN) to
stabilize training and improve accuracy. It has been shown that the BN layers
alone are surprisingly expressive. In the context of robustness against
adversarial examples, however, BN is argued to increase vulnerability. That is,
BN helps to learn fragile features. Nevertheless, BN is still used in
adversarial training, which is the de-facto standard to learn robust features.
In order to shed light on the role of BN in adversarial training, we
investigate to what extent the expressiveness of BN can be used to robustify
fragile features in comparison to random features. On CIFAR10, we find that
adversarially fine-tuning just the BN layers can result in non-trivial
adversarial robustness. Adversarially training only the BN layers from scratch,
in contrast, is not able to convey meaningful adversarial robustness. Our
results indicate that fragile features can be used to learn models with
moderate adversarial robustness, while random features cannot


http://arxiv.org/abs/2204.12158
Mixed Strategies for Security Games with General Defending Requirements. (75%)
Rufan Bai; Haoxing Lin; Xinyu Yang; Xiaowei Wu; Minming Li; Weijia Jia
  The Stackelberg security game is played between a defender and an attacker,
where the defender needs to allocate a limited amount of resources to multiple
targets in order to minimize the loss due to adversarial attack by the
attacker. While allowing targets to have different values, classic settings
often assume uniform requirements to defend the targets. This enables existing
results that study mixed strategies (randomized allocation algorithms) to adopt
a compact representation of the mixed strategies.
  In this work, we initiate the study of mixed strategies for the security
games in which the targets can have different defending requirements. In
contrast to the case of uniform defending requirement, for which an optimal
mixed strategy can be computed efficiently, we show that computing the optimal
mixed strategy is NP-hard for the general defending requirements setting.
However, we show that strong upper and lower bounds for the optimal mixed
strategy defending result can be derived. We propose an efficient
close-to-optimal Patching algorithm that computes mixed strategies that use
only few pure strategies. We also study the setting when the game is played on
a network and resource sharing is enabled between neighboring targets. Our
experimental results demonstrate the effectiveness of our algorithm in several
large real-world datasets.


http://arxiv.org/abs/2204.13594
Poisoning Deep Learning based Recommender Model in Federated Learning Scenarios. (26%)
Dazhong Rong; Qinming He; Jianhai Chen
  Various attack methods against recommender systems have been proposed in the
past years, and the security issues of recommender systems have drawn
considerable attention. Traditional attacks attempt to make target items
recommended to as many users as possible by poisoning the training data.
Benifiting from the feature of protecting users' private data, federated
recommendation can effectively defend such attacks. Therefore, quite a few
works have devoted themselves to developing federated recommender systems. For
proving current federated recommendation is still vulnerable, in this work we
probe to design attack approaches targeting deep learning based recommender
models in federated learning scenarios. Specifically, our attacks generate
poisoned gradients for manipulated malicious users to upload based on two
strategies (i.e., random approximation and hard user mining). Extensive
experiments show that our well-designed attacks can effectively poison the
target models, and the attack effectiveness sets the state-of-the-art.


http://arxiv.org/abs/2204.12301
Designing Perceptual Puzzles by Differentiating Probabilistic Programs. (13%)
Kartik Chandra; Tzu-Mao Li; Joshua Tenenbaum; Jonathan Ragan-Kelley
  We design new visual illusions by finding "adversarial examples" for
principled models of human perception -- specifically, for probabilistic
models, which treat vision as Bayesian inference. To perform this search
efficiently, we design a differentiable probabilistic programming language,
whose API exposes MCMC inference as a first-class differentiable function. We
demonstrate our method by automatically creating illusions for three features
of human vision: color constancy, size constancy, and face perception.


http://arxiv.org/abs/2204.12495
Enhancing Privacy against Inversion Attacks in Federated Learning by using Mixing Gradients Strategies. (8%)
Shaltiel Eloul; Fran Silavong; Sanket Kamthe; Antonios Georgiadis; Sean J. Moran
  Federated learning reduces the risk of information leakage, but remains
vulnerable to attacks. We investigate how several neural network design
decisions can defend against gradients inversion attacks. We show that
overlapping gradients provides numerical resistance to gradient inversion on
the highly vulnerable dense layer. Specifically, we propose to leverage
batching to maximise mixing of gradients by choosing an appropriate loss
function and drawing identical labels. We show that otherwise it is possible to
directly recover all vectors in a mini-batch without any numerical optimisation
due to the de-mixing nature of the cross entropy loss. To accurately assess
data recovery, we introduce an absolute variation distance (AVD) metric for
information leakage in images, derived from total variation. In contrast to
standard metrics, e.g. Mean Squared Error or Structural Similarity Index, AVD
offers a continuous metric for extracting information in noisy images. Finally,
our empirical results on information recovery from various inversion attacks
and training performance supports our defense strategies. These strategies are
also shown to be useful for deep convolutional neural networks such as LeNET
for image recognition. We hope that this study will help guide the development
of further strategies that achieve a trustful federation policy.


http://arxiv.org/abs/2204.12378
Performance Analysis of Out-of-Distribution Detection on Trained Neural Networks. (4%)
Jens Henriksson; Christian Berger; Markus Borg; Lars Tornberg; Sankar Raman Sathyamoorthy; Cristofer Englund
  Several areas have been improved with Deep Learning during the past years.
Implementing Deep Neural Networks (DNN) for non-safety related applications
have shown remarkable achievements over the past years; however, for using DNNs
in safety critical applications, we are missing approaches for verifying the
robustness of such models. A common challenge for DNNs occurs when exposed to
out-of-distribution samples that are outside of the scope of a DNN, but which
result in high confidence outputs despite no prior knowledge of such input.
  In this paper, we analyze three methods that separate between in- and
out-of-distribution data, called supervisors, on four well-known DNN
architectures. We find that the outlier detection performance improves with the
quality of the model. We also analyse the performance of the particular
supervisors during the training procedure by applying the supervisor at a
predefined interval to investigate its performance as the training proceeds. We
observe that understanding the relationship between training results and
supervisor performance is crucial to improve the model's robustness and to
indicate, what input samples require further measures to improve the robustness
of a DNN. In addition, our work paves the road towards an instrument for safety
argumentation for safety critical applications. This paper is an extended
version of our previous work presented at 2019 SEAA (cf. [1]); here, we
elaborate on the used metrics, add an additional supervisor and test them on
two additional datasets.


http://arxiv.org/abs/2204.12050
Self-recoverable Adversarial Examples: A New Effective Protection Mechanism in Social Networks. (99%)
Jiawei Zhang; Jinwei Wang; Hao Wang; Xiangyang Luo
  Malicious intelligent algorithms greatly threaten the security of social
users' privacy by detecting and analyzing the uploaded photos to social network
platforms. The destruction to DNNs brought by the adversarial attack sparks the
potential that adversarial examples serve as a new protection mechanism for
privacy security in social networks. However, the existing adversarial example
does not have recoverability for serving as an effective protection mechanism.
To address this issue, we propose a recoverable generative adversarial network
to generate self-recoverable adversarial examples. By modeling the adversarial
attack and recovery as a united task, our method can minimize the error of the
recovered examples while maximizing the attack ability, resulting in better
recoverability of adversarial examples. To further boost the recoverability of
these examples, we exploit a dimension reducer to optimize the distribution of
adversarial perturbation. The experimental results prove that the adversarial
examples generated by the proposed method present superior recoverability,
attack ability, and robustness on different datasets and network architectures,
which ensure its effectiveness as a protection mechanism in social networks.


http://arxiv.org/abs/2204.11985
When adversarial examples are excusable. (89%)
Pieter-Jan Kindermans; Charles Staats
  Neural networks work remarkably well in practice and theoretically they can
be universal approximators. However, they still make mistakes and a specific
type of them called adversarial errors seem inexcusable to humans. In this
work, we analyze both test errors and adversarial errors on a well controlled
but highly non-linear visual classification problem. We find that, when
approximating training on infinite data, test errors tend to be close to the
ground truth decision boundary. Qualitatively speaking these are also more
difficult for a human. By contrast, adversarial examples can be found almost
everywhere and are often obvious mistakes. However, when we constrain
adversarial examples to the manifold, we observe a 90\% reduction in
adversarial errors. If we inflate the manifold by training with Gaussian noise
we observe a similar effect. In both cases, the remaining adversarial errors
tend to be close to the ground truth decision boundary. Qualitatively, the
remaining adversarial errors are similar to test errors on difficult examples.
They do not have the customary quality of being inexcusable mistakes.


http://arxiv.org/abs/2204.11596
A Simple Structure For Building A Robust Model. (81%)
Xiao Tan; JingBo Gao; Ruolin Li
  As deep learning applications, especially programs of computer vision, are
increasingly deployed in our lives, we have to think more urgently about the
security of these applications.One effective way to improve the security of
deep learning models is to perform adversarial training, which allows the model
to be compatible with samples that are deliberately created for use in
attacking the model.Based on this, we propose a simple architecture to build a
model with a certain degree of robustness, which improves the robustness of the
trained network by adding an adversarial sample detection network for
cooperative training.At the same time, we design a new data sampling strategy
that incorporates multiple existing attacks, allowing the model to adapt to
many different adversarial attacks with a single training.We conducted some
experiments to test the effectiveness of this design based on Cifar10 dataset,
and the results indicate that it has some degree of positive effect on the
robustness of the model.Our code could be found at
https://github.com/dowdyboy/simple_structure_for_robust_model.


http://arxiv.org/abs/2204.11853
Real or Virtual: A Video Conferencing Background Manipulation-Detection System. (67%)
Ehsan Nowroozi; Yassine Mekdad; Mauro Conti; Simone Milani; Selcuk Uluagac; Berrin Yanikoglu
  Recently, the popularity and wide use of the last-generation video
conferencing technologies created an exponential growth in its market size.
Such technology allows participants in different geographic regions to have a
virtual face-to-face meeting. Additionally, it enables users to employ a
virtual background to conceal their own environment due to privacy concerns or
to reduce distractions, particularly in professional settings. Nevertheless, in
scenarios where the users should not hide their actual locations, they may
mislead other participants by claiming their virtual background as a real one.
Therefore, it is crucial to develop tools and strategies to detect the
authenticity of the considered virtual background. In this paper, we present a
detection strategy to distinguish between real and virtual video conferencing
user backgrounds. We demonstrate that our detector is robust against two attack
scenarios. The first scenario considers the case where the detector is unaware
about the attacks and inn the second scenario, we make the detector aware of
the adversarial attacks, which we refer to Adversarial Multimedia Forensics
(i.e, the forensically-edited frames are included in the training set). Given
the lack of publicly available dataset of virtual and real backgrounds for
video conferencing, we created our own dataset and made them publicly available
[1]. Then, we demonstrate the robustness of our detector against different
adversarial attacks that the adversary considers. Ultimately, our detector's
performance is significant against the CRSPAM1372 [2] features, and
post-processing operations such as geometric transformations with different
quality factors that the attacker may choose. Moreover, our performance results
shows that we can perfectly identify a real from a virtual background with an
accuracy of 99.80%.


http://arxiv.org/abs/2204.11790
Can Rationalization Improve Robustness? (12%)
Howard Chen; Jacqueline He; Karthik Narasimhan; Danqi Chen
  A growing line of work has investigated the development of neural NLP models
that can produce rationales--subsets of input that can explain their model
predictions. In this paper, we ask whether such rationale models can also
provide robustness to adversarial attacks in addition to their interpretable
nature. Since these models need to first generate rationales ("rationalizer")
before making predictions ("predictor"), they have the potential to ignore
noise or adversarially added text by simply masking it out of the generated
rationale. To this end, we systematically generate various types of 'AddText'
attacks for both token and sentence-level rationalization tasks, and perform an
extensive empirical evaluation of state-of-the-art rationale models across five
different tasks. Our experiments reveal that the rationale models show the
promise to improve robustness, while they struggle in certain scenarios--when
the rationalizer is sensitive to positional bias or lexical choices of attack
text. Further, leveraging human rationale as supervision does not always
translate to better performance. Our study is a first step towards exploring
the interplay between interpretability and robustness in the
rationalize-then-predict framework.


http://arxiv.org/abs/2204.13597
PhysioGAN: Training High Fidelity Generative Model for Physiological Sensor Readings. (1%)
Moustafa Alzantot; Luis Garcia; Mani Srivastava
  Generative models such as the variational autoencoder (VAE) and the
generative adversarial networks (GAN) have proven to be incredibly powerful for
the generation of synthetic data that preserves statistical properties and
utility of real-world datasets, especially in the context of image and natural
language text. Nevertheless, until now, there has no successful demonstration
of how to apply either method for generating useful physiological sensory data.
The state-of-the-art techniques in this context have achieved only limited
success. We present PHYSIOGAN, a generative model to produce high fidelity
synthetic physiological sensor data readings. PHYSIOGAN consists of an encoder,
decoder, and a discriminator. We evaluate PHYSIOGAN against the
state-of-the-art techniques using two different real-world datasets: ECG
classification and activity recognition from motion sensors datasets. We
compare PHYSIOGAN to the baseline models not only the accuracy of class
conditional generation but also the sample diversity and sample novelty of the
synthetic datasets. We prove that PHYSIOGAN generates samples with higher
utility than other generative models by showing that classification models
trained on only synthetic data generated by PHYSIOGAN have only 10% and 20%
decrease in their classification accuracy relative to classification models
trained on the real data. Furthermore, we demonstrate the use of PHYSIOGAN for
sensor data imputation in creating plausible results.


http://arxiv.org/abs/2204.11531
VITA: A Multi-Source Vicinal Transfer Augmentation Method for Out-of-Distribution Generalization. (1%)
Minghui Chen; Cheng Wen; Feng Zheng; Fengxiang He; Ling Shao
  Invariance to diverse types of image corruption, such as noise, blurring, or
colour shifts, is essential to establish robust models in computer vision. Data
augmentation has been the major approach in improving the robustness against
common corruptions. However, the samples produced by popular augmentation
strategies deviate significantly from the underlying data manifold. As a
result, performance is skewed toward certain types of corruption. To address
this issue, we propose a multi-source vicinal transfer augmentation (VITA)
method for generating diverse on-manifold samples. The proposed VITA consists
of two complementary parts: tangent transfer and integration of multi-source
vicinal samples. The tangent transfer creates initial augmented samples for
improving corruption robustness. The integration employs a generative model to
characterize the underlying manifold built by vicinal samples, facilitating the
generation of on-manifold samples. Our proposed VITA significantly outperforms
the current state-of-the-art augmentation methods, demonstrated in extensive
experiments on corruption benchmarks.


http://arxiv.org/abs/2204.11786
Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications. (1%)
Han Cai; Ji Lin; Yujun Lin; Zhijian Liu; Haotian Tang; Hanrui Wang; Ligeng Zhu; Song Han
  Deep neural networks (DNNs) have achieved unprecedented success in the field
of artificial intelligence (AI), including computer vision, natural language
processing and speech recognition. However, their superior performance comes at
the considerable cost of computational complexity, which greatly hinders their
applications in many resource-constrained devices, such as mobile phones and
Internet of Things (IoT) devices. Therefore, methods and techniques that are
able to lift the efficiency bottleneck while preserving the high accuracy of
DNNs are in great demand in order to enable numerous edge AI applications. This
paper provides an overview of efficient deep learning methods, systems and
applications. We start from introducing popular model compression methods,
including pruning, factorization, quantization as well as compact model design.
To reduce the large design cost of these manual solutions, we discuss the
AutoML framework for each of them, such as neural architecture search (NAS) and
automated pruning and quantization. We then cover efficient on-device training
to enable user customization based on the local data on mobile devices. Apart
from general acceleration techniques, we also showcase several task-specific
accelerations for point cloud, video and natural language processing by
exploiting their spatial sparsity and temporal/token redundancy. Finally, to
support all these algorithmic advancements, we introduce the efficient deep
learning system design from both software and hardware perspectives.


http://arxiv.org/abs/2205.01225
A Hybrid Defense Method against Adversarial Attacks on Traffic Sign Classifiers in Autonomous Vehicles. (99%)
Zadid Khan; Mashrur Chowdhury; Sakib Mahmud Khan
  Adversarial attacks can make deep neural network (DNN) models predict
incorrect output labels, such as misclassified traffic signs, for autonomous
vehicle (AV) perception modules. Resilience against adversarial attacks can
help AVs navigate safely on the road by avoiding misclassication of signs or
objects. This DNN-based study develops a resilient traffic sign classifier for
AVs that uses a hybrid defense method. We use transfer learning to retrain the
Inception-V3 and Resnet-152 models as traffic sign classifiers. This method
also utilizes a combination of three different strategies: random filtering,
ensembling, and local feature mapping. We use the random cropping and resizing
technique for random filtering, plurality voting as ensembling strategy and an
optical character recognition model as a local feature mapper. This DNN-based
hybrid defense method has been tested for the no attack scenario and against
well-known untargeted adversarial attacks (e.g., Projected Gradient Descent or
PGD, Fast Gradient Sign Method or FGSM, Momentum Iterative Method or MIM
attack, and Carlini and Wagner or C&W). We find that our hybrid defense method
achieves 99% average traffic sign classification accuracy for the no attack
scenario and 88% average traffic sign classification accuracy for all attack
scenarios. Moreover, the hybrid defense method, presented in this study,
improves the accuracy for traffic sign classification compared to the
traditional defense methods (i.e., JPEG filtering, feature squeezing, binary
filtering, and random filtering) up to 6%, 50%, and 55% for FGSM, MIM, and PGD
attacks, respectively.


http://arxiv.org/abs/2204.11357
Improving Deep Learning Model Robustness Against Adversarial Attack by Increasing the Network Capacity. (81%)
Marco Marchetti; Edmond S. L. Ho
  Nowadays, we are more and more reliant on Deep Learning (DL) models and thus
it is essential to safeguard the security of these systems. This paper explores
the security issues in Deep Learning and analyses, through the use of
experiments, the way forward to build more resilient models. Experiments are
conducted to identify the strengths and weaknesses of a new approach to improve
the robustness of DL models against adversarial attacks. The results show
improvements and new ideas that can be used as recommendations for researchers
and practitioners to create increasingly better DL algorithms.


http://arxiv.org/abs/2204.11075
Smart App Attack: Hacking Deep Learning Models in Android Apps. (98%)
Yujin Huang; Chunyang Chen
  On-device deep learning is rapidly gaining popularity in mobile applications.
Compared to offloading deep learning from smartphones to the cloud, on-device
deep learning enables offline model inference while preserving user privacy.
However, such mechanisms inevitably store models on users' smartphones and may
invite adversarial attacks as they are accessible to attackers. Due to the
characteristic of the on-device model, most existing adversarial attacks cannot
be directly applied for on-device models. In this paper, we introduce a
grey-box adversarial attack framework to hack on-device models by crafting
highly similar binary classification models based on identified transfer
learning approaches and pre-trained models from TensorFlow Hub. We evaluate the
attack effectiveness and generality in terms of four different settings
including pre-trained models, datasets, transfer learning approaches and
adversarial attack algorithms. The results demonstrate that the proposed
attacks remain effective regardless of different settings, and significantly
outperform state-of-the-art baselines. We further conduct an empirical study on
real-world deep learning mobile apps collected from Google Play. Among 53 apps
adopting transfer learning, we find that 71.7\% of them can be successfully
attacked, which includes popular ones in medicine, automation, and finance
categories with critical usage scenarios. The results call for the awareness
and actions of deep learning mobile app developers to secure the on-device
models. The code of this work is available at
https://github.com/Jinxhy/SmartAppAttack


http://arxiv.org/abs/2204.11022
Towards Data-Free Model Stealing in a Hard Label Setting. (13%)
Sunandini Sanyal; Sravanti Addepalli; R. Venkatesh Babu
  Machine learning models deployed as a service (MLaaS) are susceptible to
model stealing attacks, where an adversary attempts to steal the model within a
restricted access framework. While existing attacks demonstrate near-perfect
clone-model performance using softmax predictions of the classification
network, most of the APIs allow access to only the top-1 labels. In this work,
we show that it is indeed possible to steal Machine Learning models by
accessing only top-1 predictions (Hard Label setting) as well, without access
to model gradients (Black-Box setting) or even the training dataset (Data-Free
setting) within a low query budget. We propose a novel GAN-based framework that
trains the student and generator in tandem to steal the model effectively while
overcoming the challenge of the hard label setting by utilizing gradients of
the clone network as a proxy to the victim's gradients. We propose to overcome
the large query costs associated with a typical Data-Free setting by utilizing
publicly available (potentially unrelated) datasets as a weak image prior. We
additionally show that even in the absence of such data, it is possible to
achieve state-of-the-art results within a low query budget using synthetically
crafted samples. We are the first to demonstrate the scalability of Model
Stealing in a restricted access setting on a 100 class dataset as well.


http://arxiv.org/abs/2204.11028
Reinforced Causal Explainer for Graph Neural Networks. (1%)
Xiang Wang; Yingxin Wu; An Zhang; Fuli Feng; Xiangnan He; Tat-Seng Chua
  Explainability is crucial for probing graph neural networks (GNNs), answering
questions like "Why the GNN model makes a certain prediction?". Feature
attribution is a prevalent technique of highlighting the explanatory subgraph
in the input graph, which plausibly leads the GNN model to make its prediction.
Various attribution methods exploit gradient-like or attention scores as the
attributions of edges, then select the salient edges with top attribution
scores as the explanation. However, most of these works make an untenable
assumption - the selected edges are linearly independent - thus leaving the
dependencies among edges largely unexplored, especially their coalition effect.
We demonstrate unambiguous drawbacks of this assumption - making the
explanatory subgraph unfaithful and verbose. To address this challenge, we
propose a reinforcement learning agent, Reinforced Causal Explainer
(RC-Explainer). It frames the explanation task as a sequential decision process
- an explanatory subgraph is successively constructed by adding a salient edge
to connect the previously selected subgraph. Technically, its policy network
predicts the action of edge addition, and gets a reward that quantifies the
action's causal effect on the prediction. Such reward accounts for the
dependency of the newly-added edge and the previously-added edges, thus
reflecting whether they collaborate together and form a coalition to pursue
better explanations. As such, RC-Explainer is able to generate faithful and
concise explanations, and has a better generalization power to unseen graphs.
When explaining different GNNs on three graph classification datasets,
RC-Explainer achieves better or comparable performance to SOTA approaches
w.r.t. predictive accuracy and contrastivity, and safely passes sanity checks
and visual inspections. Codes are available at
https://github.com/xiangwang1223/reinforced_causal_explainer.


http://arxiv.org/abs/2204.10839
How Sampling Impacts the Robustness of Stochastic Neural Networks. (99%)
Sina Däubener; Asja Fischer
  Stochastic neural networks (SNNs) are random functions and predictions are
gained by averaging over multiple realizations of this random function.
Consequently, an adversarial attack is calculated based on one set of samples
and applied to the prediction defined by another set of samples. In this paper
we analyze robustness in this setting by deriving a sufficient condition for
the given prediction process to be robust against the calculated attack. This
allows us to identify the factors that lead to an increased robustness of SNNs
and helps to explain the impact of the variance and the amount of samples.
Among other things, our theoretical analysis gives insights into (i) why
increasing the amount of samples drawn for the estimation of adversarial
examples increases the attack's strength, (ii) why decreasing sample size
during inference hardly influences the robustness, and (iii) why a higher
prediction variance between realizations relates to a higher robustness. We
verify the validity of our theoretical findings by an extensive empirical
analysis.


http://arxiv.org/abs/2204.10933
A Tale of Two Models: Constructing Evasive Attacks on Edge Models. (83%)
Wei Hao; Aahil Awatramani; Jiayang Hu; Chengzhi Mao; Pin-Chun Chen; Eyal Cidon; Asaf Cidon; Junfeng Yang
  Full-precision deep learning models are typically too large or costly to
deploy on edge devices. To accommodate to the limited hardware resources,
models are adapted to the edge using various edge-adaptation techniques, such
as quantization and pruning. While such techniques may have a negligible impact
on top-line accuracy, the adapted models exhibit subtle differences in output
compared to the original model from which they are derived. In this paper, we
introduce a new evasive attack, DIVA, that exploits these differences in edge
adaptation, by adding adversarial noise to input data that maximizes the output
difference between the original and adapted model. Such an attack is
particularly dangerous, because the malicious input will trick the adapted
model running on the edge, but will be virtually undetectable by the original
model, which typically serves as the authoritative model version, used for
validation, debugging and retraining. We compare DIVA to a state-of-the-art
attack, PGD, and show that DIVA is only 1.7-3.6% worse on attacking the adapted
model but 1.9-4.2 times more likely not to be detected by the the original
model under a whitebox and semi-blackbox setting, compared to PGD.


http://arxiv.org/abs/2204.10606
Enhancing the Transferability via Feature-Momentum Adversarial Attack. (82%)
Xianglong; Yuezun Li; Haipeng Qu; Junyu Dong
  Transferable adversarial attack has drawn increasing attention due to their
practical threaten to real-world applications. In particular, the feature-level
adversarial attack is one recent branch that can enhance the transferability
via disturbing the intermediate features. The existing methods usually create a
guidance map for features, where the value indicates the importance of the
corresponding feature element and then employs an iterative algorithm to
disrupt the features accordingly. However, the guidance map is fixed in
existing methods, which can not consistently reflect the behavior of networks
as the image is changed during iteration. In this paper, we describe a new
method called Feature-Momentum Adversarial Attack (FMAA) to further improve
transferability. The key idea of our method is that we estimate a guidance map
dynamically at each iteration using momentum to effectively disturb the
category-relevant features. Extensive experiments demonstrate that our method
significantly outperforms other state-of-the-art methods by a large margin on
different target models.


http://arxiv.org/abs/2204.12281
Data-Efficient Backdoor Attacks. (76%)
Pengfei Xia; Ziqiang Li; Wei Zhang; Bin Li
  Recent studies have proven that deep neural networks are vulnerable to
backdoor attacks. Specifically, by mixing a small number of poisoned samples
into the training set, the behavior of the trained model can be maliciously
controlled. Existing attack methods construct such adversaries by randomly
selecting some clean data from the benign set and then embedding a trigger into
them. However, this selection strategy ignores the fact that each poisoned
sample contributes inequally to the backdoor injection, which reduces the
efficiency of poisoning. In this paper, we formulate improving the poisoned
data efficiency by the selection as an optimization problem and propose a
Filtering-and-Updating Strategy (FUS) to solve it. The experimental results on
CIFAR-10 and ImageNet-10 indicate that the proposed method is effective: the
same attack success rate can be achieved with only 47% to 75% of the poisoned
sample volume compared to the random selection strategy. More importantly, the
adversaries selected according to one setting can generalize well to other
settings, exhibiting strong transferability. The prototype code of our method
is now available at https://github.com/xpf/Data-Efficient-Backdoor-Attacks.


http://arxiv.org/abs/2204.11837
A Mask-Based Adversarial Defense Scheme. (99%)
Weizhen Xu; Chenyi Zhang; Fangzhen Zhao; Liangda Fang
  Adversarial attacks hamper the functionality and accuracy of Deep Neural
Networks (DNNs) by meddling with subtle perturbations to their inputs.In this
work, we propose a new Mask-based Adversarial Defense scheme (MAD) for DNNs to
mitigate the negative effect from adversarial attacks. To be precise, our
method promotes the robustness of a DNN by randomly masking a portion of
potential adversarial images, and as a result, the %classification result
output of the DNN becomes more tolerant to minor input perturbations. Compared
with existing adversarial defense techniques, our method does not need any
additional denoising structure, nor any change to a DNN's design. We have
tested this approach on a collection of DNN models for a variety of data sets,
and the experimental results confirm that the proposed method can effectively
improve the defense abilities of the DNNs against all of the tested adversarial
attack methods. In certain scenarios, the DNN models trained with MAD have
improved classification accuracy by as much as 20% to 90% compared to the
original models that are given adversarial inputs.


http://arxiv.org/abs/2204.10027
Is Neuron Coverage Needed to Make Person Detection More Robust? (98%)
Svetlana Pavlitskaya; Şiyar Yıkmış; J. Marius Zöllner
  The growing use of deep neural networks (DNNs) in safety- and
security-critical areas like autonomous driving raises the need for their
systematic testing. Coverage-guided testing (CGT) is an approach that applies
mutation or fuzzing according to a predefined coverage metric to find inputs
that cause misbehavior. With the introduction of a neuron coverage metric, CGT
has also recently been applied to DNNs. In this work, we apply CGT to the task
of person detection in crowded scenes. The proposed pipeline uses YOLOv3 for
person detection and includes finding DNN bugs via sampling and mutation, and
subsequent DNN retraining on the updated training set. To be a bug, we require
a mutated image to cause a significant performance drop compared to a clean
input. In accordance with the CGT, we also consider an additional requirement
of increased coverage in the bug definition. In order to explore several types
of robustness, our approach includes natural image transformations,
corruptions, and adversarial examples generated with the Daedalus attack. The
proposed framework has uncovered several thousand cases of incorrect DNN
behavior. The relative change in mAP performance of the retrained models
reached on average between 26.21\% and 64.24\% for different robustness types.
However, we have found no evidence that the investigated coverage metrics can
be advantageously used to improve robustness.


http://arxiv.org/abs/2204.10046
Testing robustness of predictions of trained classifiers against naturally occurring perturbations. (98%)
Sebastian Scher; Andreas Trügler
  Correctly quantifying the robustness of machine learning models is a central
aspect in judging their suitability for specific tasks, and ultimately, for
generating trust in them. We address the problem of finding the robustness of
individual predictions. We show both theoretically and with empirical examples
that a method based on counterfactuals that was previously proposed for this is
insufficient, as it is not a valid metric for determining the robustness
against perturbations that occur ``naturally'', outside specific adversarial
attack scenarios. We propose a flexible approach that models possible
perturbations in input data individually for each application. This is then
combined with a probabilistic approach that computes the likelihood that a
``real-world'' perturbation will change a prediction, thus giving quantitative
information of the robustness of individual predictions of the trained machine
learning model. The method does not require access to the internals of the
classifier and thus in principle works for any black-box model. It is, however,
based on Monte-Carlo sampling and thus only suited for input spaces with small
dimensions. We illustrate our approach on the Iris and the Ionosphere datasets,
on an application predicting fog at an airport, and on analytically solvable
cases.


http://arxiv.org/abs/2204.10314
Adversarial Contrastive Learning by Permuting Cluster Assignments. (15%)
Muntasir Wahed; Afrina Tabassum; Ismini Lourentzou
  Contrastive learning has gained popularity as an effective self-supervised
representation learning technique. Several research directions improve
traditional contrastive approaches, e.g., prototypical contrastive methods
better capture the semantic similarity among instances and reduce the
computational burden by considering cluster prototypes or cluster assignments,
while adversarial instance-wise contrastive methods improve robustness against
a variety of attacks. To the best of our knowledge, no prior work jointly
considers robustness, cluster-wise semantic similarity and computational
efficiency. In this work, we propose SwARo, an adversarial contrastive
framework that incorporates cluster assignment permutations to generate
representative adversarial samples. We evaluate SwARo on multiple benchmark
datasets and against various white-box and black-box attacks, obtaining
consistent improvements over state-of-the-art baselines.


http://arxiv.org/abs/2204.09975
Eliminating Backdoor Triggers for Deep Neural Networks Using Attention Relation Graph Distillation. (4%)
Jun Xia; Ting Wang; Jiepin Ding; Xian Wei; Mingsong Chen
  Due to the prosperity of Artificial Intelligence (AI) techniques, more and
more backdoors are designed by adversaries to attack Deep Neural Networks
(DNNs).Although the state-of-the-art method Neural Attention Distillation (NAD)
can effectively erase backdoor triggers from DNNs, it still suffers from
non-negligible Attack Success Rate (ASR) together with lowered classification
ACCuracy (ACC), since NAD focuses on backdoor defense using attention features
(i.e., attention maps) of the same order. In this paper, we introduce a novel
backdoor defense framework named Attention Relation Graph Distillation (ARGD),
which fully explores the correlation among attention features with different
orders using our proposed Attention Relation Graphs (ARGs). Based on the
alignment of ARGs between both teacher and student models during knowledge
distillation, ARGD can eradicate more backdoor triggers than NAD. Comprehensive
experimental results show that, against six latest backdoor attacks, ARGD
outperforms NAD by up to 94.85% reduction in ASR, while ACC can be improved by
up to 3.23%.


http://arxiv.org/abs/2204.10072
Detecting Topology Attacks against Graph Neural Networks. (1%)
Senrong Xu; Yuan Yao; Liangyue Li; Wei Yang; Feng Xu; Hanghang Tong
  Graph neural networks (GNNs) have been widely used in many real applications,
and recent studies have revealed their vulnerabilities against topology
attacks. To address this issue, existing efforts have mainly been dedicated to
improving the robustness of GNNs, while little attention has been paid to the
detection of such attacks. In this work, we study the victim node detection
problem under topology attacks against GNNs. Our approach is built upon the key
observation rooted in the intrinsic message passing nature of GNNs. That is,
the neighborhood of a victim node tends to have two competing group forces,
pushing the node classification results towards the original label and the
targeted label, respectively. Based on this observation, we propose to detect
victim nodes by deliberately designing an effective measurement of the
neighborhood variance for each node. Extensive experimental results on four
real-world datasets and five existing topology attacks show the effectiveness
and efficiency of the proposed detection approach.


http://arxiv.org/abs/2204.09397
Adversarial Scratches: Deployable Attacks to CNN Classifiers. (99%)
Loris Giulivi; Malhar Jere; Loris Rossi; Farinaz Koushanfar; Gabriela Ciocarlie; Briland Hitaj; Giacomo Boracchi
  A growing body of work has shown that deep neural networks are susceptible to
adversarial examples. These take the form of small perturbations applied to the
model's input which lead to incorrect predictions. Unfortunately, most
literature focuses on visually imperceivable perturbations to be applied to
digital images that often are, by design, impossible to be deployed to physical
targets. We present Adversarial Scratches: a novel L0 black-box attack, which
takes the form of scratches in images, and which possesses much greater
deployability than other state-of-the-art attacks. Adversarial Scratches
leverage B\'ezier Curves to reduce the dimension of the search space and
possibly constrain the attack to a specific location. We test Adversarial
Scratches in several scenarios, including a publicly available API and images
of traffic signs. Results show that, often, our attack achieves higher fooling
rate than other deployable state-of-the-art methods, while requiring
significantly fewer queries and modifying very few pixels.


http://arxiv.org/abs/2204.09803
GUARD: Graph Universal Adversarial Defense. (99%)
Jintang Li; Jie Liao; Ruofan Wu; Liang Chen; Zibin Zheng; Jiawang Dan; Changhua Meng; Weiqiang Wang
  Graph convolutional networks (GCNs) have been shown to be vulnerable to small
adversarial perturbations, which becomes a severe threat and largely limits
their applications in security-critical scenarios. To mitigate such a threat,
considerable research efforts have been devoted to increasing the robustness of
GCNs against adversarial attacks. However, current defense approaches are
typically designed to prevent GCNs from untargeted adversarial attacks and
focus on overall performance, making it challenging to protect important local
nodes from more powerful targeted adversarial attacks. Additionally, a
trade-off between robustness and performance is often made in existing
research. Such limitations highlight the need for developing an effective and
efficient approach that can defend local nodes against targeted attacks,
without compromising the overall performance of GCNs. In this work, we present
a simple yet effective method, named Graph Universal Adversarial Defense
(GUARD). Unlike previous works, GUARD protects each individual node from
attacks with a universal defensive patch, which is generated once and can be
applied to any node (node-agnostic) in a graph. GUARD is fast, straightforward
to implement without any change to network architecture nor any additional
parameters, and is broadly applicable to any GCNs. Extensive experiments on
four benchmark datasets demonstrate that GUARD significantly improves
robustness for several established GCNs against multiple adversarial attacks
and outperforms state-of-the-art defense methods by large margins.


http://arxiv.org/abs/2204.09838
Fast AdvProp. (98%)
Jieru Mei; Yucheng Han; Yutong Bai; Yixiao Zhang; Yingwei Li; Xianhang Li; Alan Yuille; Cihang Xie
  Adversarial Propagation (AdvProp) is an effective way to improve recognition
models, leveraging adversarial examples. Nonetheless, AdvProp suffers from the
extremely slow training speed, mainly because: a) extra forward and backward
passes are required for generating adversarial examples; b) both original
samples and their adversarial counterparts are used for training (i.e.,
2$\times$ data). In this paper, we introduce Fast AdvProp, which aggressively
revamps AdvProp's costly training components, rendering the method nearly as
cheap as the vanilla training. Specifically, our modifications in Fast AdvProp
are guided by the hypothesis that disentangled learning with adversarial
examples is the key for performance improvements, while other training recipes
(e.g., paired clean and adversarial training samples, multi-step adversarial
attackers) could be largely simplified.
  Our empirical results show that, compared to the vanilla training baseline,
Fast AdvProp is able to further model performance on a spectrum of visual
benchmarks, without incurring extra training cost. Additionally, our ablations
find Fast AdvProp scales better if larger models are used, is compatible with
existing data augmentation methods (i.e., Mixup and CutMix), and can be easily
adapted to other recognition tasks like object detection. The code is available
here: https://github.com/meijieru/fast_advprop.


http://arxiv.org/abs/2204.09398
Case-Aware Adversarial Training. (98%)
Mingyuan Fan; Yang Liu; Wenzhong Guo; Ximeng Liu; Jianhua Li
  The neural network (NN) becomes one of the most heated type of models in
various signal processing applications. However, NNs are extremely vulnerable
to adversarial examples (AEs). To defend AEs, adversarial training (AT) is
believed to be the most effective method while due to the intensive
computation, AT is limited to be applied in most applications. In this paper,
to resolve the problem, we design a generic and efficient AT improvement
scheme, namely case-aware adversarial training (CAT). Specifically, the
intuition stems from the fact that a very limited part of informative samples
can contribute to most of model performance. Alternatively, if only the most
informative AEs are used in AT, we can lower the computation complexity of AT
significantly as maintaining the defense effect. To achieve this, CAT achieves
two breakthroughs. First, a method to estimate the information degree of
adversarial examples is proposed for AE filtering. Second, to further enrich
the information that the NN can obtain from AEs, CAT involves a weight
estimation and class-level balancing based sampling strategy to increase the
diversity of AT at each iteration. Extensive experiments show that CAT is
faster than vanilla AT by up to 3x while achieving competitive defense effect.


http://arxiv.org/abs/2204.09583
Improved Worst-Group Robustness via Classifier Retraining on Independent Splits. (1%)
Thien Hang Nguyen; Hongyang R. Zhang; Huy Le Nguyen
  High-capacity deep neural networks (DNNs) trained with Empirical Risk
Minimization (ERM) often suffer from poor worst-group accuracy despite good
on-average performance, where worst-group accuracy measures a model's
robustness towards certain subpopulations of the input space. Spurious
correlations and memorization behaviors of ERM trained DNNs are typically
attributed to this degradation in performance. We develop a method, called
CRIS, that address these issues by performing robust classifier retraining on
independent splits of the dataset. This results in a simple method that
improves upon state-of-the-art methods, such as Group DRO, on standard datasets
while relying on much fewer group labels and little additional hyperparameter
tuning.


http://arxiv.org/abs/2204.08726
Jacobian Ensembles Improve Robustness Trade-offs to Adversarial Attacks. (99%)
Kenneth T. Co; David Martinez-Rego; Zhongyuan Hau; Emil C. Lupu
  Deep neural networks have become an integral part of our software
infrastructure and are being deployed in many widely-used and safety-critical
applications. However, their integration into many systems also brings with it
the vulnerability to test time attacks in the form of Universal Adversarial
Perturbations (UAPs). UAPs are a class of perturbations that when applied to
any input causes model misclassification. Although there is an ongoing effort
to defend models against these adversarial attacks, it is often difficult to
reconcile the trade-offs in model accuracy and robustness to adversarial
attacks. Jacobian regularization has been shown to improve the robustness of
models against UAPs, whilst model ensembles have been widely adopted to improve
both predictive performance and model robustness. In this work, we propose a
novel approach, Jacobian Ensembles-a combination of Jacobian regularization and
model ensembles to significantly increase the robustness against UAPs whilst
maintaining or improving model accuracy. Our results show that Jacobian
Ensembles achieves previously unseen levels of accuracy and robustness, greatly
improving over previous methods that tend to skew towards only either accuracy
or robustness.


http://arxiv.org/abs/2204.09183
Robustness Testing of Data and Knowledge Driven Anomaly Detection in Cyber-Physical Systems. (86%)
Xugui Zhou; Maxfield Kouzel; Homa Alemzadeh
  The growing complexity of Cyber-Physical Systems (CPS) and challenges in
ensuring safety and security have led to the increasing use of deep learning
methods for accurate and scalable anomaly detection. However, machine learning
(ML) models often suffer from low performance in predicting unexpected data and
are vulnerable to accidental or malicious perturbations. Although robustness
testing of deep learning models has been extensively explored in applications
such as image classification and speech recognition, less attention has been
paid to ML-driven safety monitoring in CPS. This paper presents the preliminary
results on evaluating the robustness of ML-based anomaly detection methods in
safety-critical CPS against two types of accidental and malicious input
perturbations, generated using a Gaussian-based noise model and the Fast
Gradient Sign Method (FGSM). We test the hypothesis of whether integrating the
domain knowledge (e.g., on unsafe system behavior) with the ML models can
improve the robustness of anomaly detection without sacrificing accuracy and
transparency. Experimental results with two case studies of Artificial Pancreas
Systems (APS) for diabetes management show that ML-based safety monitors
trained with domain knowledge can reduce on average up to 54.2% of robustness
error and keep the average F1 scores high while improving transparency.


http://arxiv.org/abs/2204.08689
Generating Authentic Adversarial Examples beyond Meaning-preserving with Doubly Round-trip Translation. (83%)
Siyu Lai; Zhen Yang; Fandong Meng; Xue Zhang; Yufeng Chen; Jinan Xu; Jie Zhou
  Generating adversarial examples for Neural Machine Translation (NMT) with
single Round-Trip Translation (RTT) has achieved promising results by releasing
the meaning-preserving restriction. However, a potential pitfall for this
approach is that we cannot decide whether the generated examples are
adversarial to the target NMT model or the auxiliary backward one, as the
reconstruction error through the RTT can be related to either. To remedy this
problem, we propose a new criterion for NMT adversarial examples based on the
Doubly Round-Trip Translation (DRTT). Specifically, apart from the
source-target-source RTT, we also consider the target-source-target one, which
is utilized to pick out the authentic adversarial examples for the target NMT
model. Additionally, to enhance the robustness of the NMT model, we introduce
the masked language models to construct bilingual adversarial pairs based on
DRTT, which are used to train the NMT model directly. Extensive experiments on
both the clean and noisy test sets (including the artificial and natural noise)
show that our approach substantially improves the robustness of NMT models.


http://arxiv.org/abs/2204.09502
UNBUS: Uncertainty-aware Deep Botnet Detection System in Presence of Perturbed Samples. (99%)
Rahim Taheri
  A rising number of botnet families have been successfully detected using deep
learning architectures. While the variety of attacks increases, these
architectures should become more robust against attacks. They have been proven
to be very sensitive to small but well constructed perturbations in the input.
Botnet detection requires extremely low false-positive rates (FPR), which are
not commonly attainable in contemporary deep learning. Attackers try to
increase the FPRs by making poisoned samples. The majority of recent research
has focused on the use of model loss functions to build adversarial examples
and robust models. In this paper, two LSTM-based classification algorithms for
botnet classification with an accuracy higher than 98% are presented. Then, the
adversarial attack is proposed, which reduces the accuracy to about 30%. Then,
by examining the methods for computing the uncertainty, the defense method is
proposed to increase the accuracy to about 70%. By using the deep ensemble and
stochastic weight averaging quantification methods it has been investigated the
uncertainty of the accuracy in the proposed methods.


http://arxiv.org/abs/2204.08189
Sardino: Ultra-Fast Dynamic Ensemble for Secure Visual Sensing at Mobile Edge. (99%)
Qun Song; Zhenyu Yan; Wenjie Luo; Rui Tan
  Adversarial example attack endangers the mobile edge systems such as vehicles
and drones that adopt deep neural networks for visual sensing. This paper
presents {\em Sardino}, an active and dynamic defense approach that renews the
inference ensemble at run time to develop security against the adaptive
adversary who tries to exfiltrate the ensemble and construct the corresponding
effective adversarial examples. By applying consistency check and data fusion
on the ensemble's predictions, Sardino can detect and thwart adversarial
inputs. Compared with the training-based ensemble renewal, we use HyperNet to
achieve {\em one million times} acceleration and per-frame ensemble renewal
that presents the highest level of difficulty to the prerequisite exfiltration
attacks. We design a run-time planner that maximizes the ensemble size in favor
of security while maintaining the processing frame rate. Beyond adversarial
examples, Sardino can also address the issue of out-of-distribution inputs
effectively. This paper presents extensive evaluation of Sardino's performance
in counteracting adversarial examples and applies it to build a real-time
car-borne traffic sign recognition system. Live on-road tests show the built
system's effectiveness in maintaining frame rate and detecting
out-of-distribution inputs due to the false positives of a preceding YOLO-based
traffic sign detector.


http://arxiv.org/abs/2204.10779
CgAT: Center-Guided Adversarial Training for Deep Hashing-Based Retrieval. (99%)
Xunguang Wang; Yiqun Lin; Xiaomeng Li
  Deep hashing has been extensively utilized in massive image retrieval because
of its efficiency and effectiveness. However, deep hashing models are
vulnerable to adversarial examples, making it essential to develop adversarial
defense methods for image retrieval. Existing solutions achieved limited
defense performance because of using weak adversarial samples for training and
lacking discriminative optimization objectives to learn robust features. In
this paper, we present a min-max based Center-guided Adversarial Training,
namely CgAT, to improve the robustness of deep hashing networks through worst
adversarial examples. Specifically, we first formulate the center code as a
semantically-discriminative representative of the input image content, which
preserves the semantic similarity with positive samples and dissimilarity with
negative examples. We prove that a mathematical formula can calculate the
center code immediately. After obtaining the center codes in each optimization
iteration of the deep hashing network, they are adopted to guide the
adversarial training process. On the one hand, CgAT generates the worst
adversarial examples as augmented data by maximizing the Hamming distance
between the hash codes of the adversarial examples and the center codes. On the
other hand, CgAT learns to mitigate the effects of adversarial samples by
minimizing the Hamming distance to the center codes. Extensive experiments on
the benchmark datasets demonstrate the effectiveness of our adversarial
training algorithm in defending against adversarial attacks for deep
hashing-based retrieval. Compared with the current state-of-the-art defense
method, we significantly improve the defense performance by an average of
18.61\%, 12.35\%, and 11.56\% on FLICKR-25K, NUS-WIDE, and MS-COCO,
respectively. The code is available at https://github.com/xunguangwang/CgAT.


http://arxiv.org/abs/2204.08612
Metamorphic Testing-based Adversarial Attack to Fool Deepfake Detectors. (98%)
Nyee Thoang Lim; Meng Yi Kuan; Muxin Pu; Mei Kuan Lim; Chun Yong Chong
  Deepfakes utilise Artificial Intelligence (AI) techniques to create synthetic
media where the likeness of one person is replaced with another. There are
growing concerns that deepfakes can be maliciously used to create misleading
and harmful digital contents. As deepfakes become more common, there is a dire
need for deepfake detection technology to help spot deepfake media. Present
deepfake detection models are able to achieve outstanding accuracy (>90%).
However, most of them are limited to within-dataset scenario, where the same
dataset is used for training and testing. Most models do not generalise well
enough in cross-dataset scenario, where models are tested on unseen datasets
from another source. Furthermore, state-of-the-art deepfake detection models
rely on neural network-based classification models that are known to be
vulnerable to adversarial attacks. Motivated by the need for a robust deepfake
detection model, this study adapts metamorphic testing (MT) principles to help
identify potential factors that could influence the robustness of the examined
model, while overcoming the test oracle problem in this domain. Metamorphic
testing is specifically chosen as the testing technique as it fits our demand
to address learning-based system testing with probabilistic outcomes from
largely black-box components, based on potentially large input domains. We
performed our evaluations on MesoInception-4 and TwoStreamNet models, which are
the state-of-the-art deepfake detection models. This study identified makeup
application as an adversarial attack that could fool deepfake detectors. Our
experimental results demonstrate that both the MesoInception-4 and TwoStreamNet
models degrade in their performance by up to 30\% when the input data is
perturbed with makeup.


http://arxiv.org/abs/2204.08570
A Comprehensive Survey on Trustworthy Graph Neural Networks: Privacy, Robustness, Fairness, and Explainability. (75%)
Enyan Dai; Tianxiang Zhao; Huaisheng Zhu; Junjie Xu; Zhimeng Guo; Hui Liu; Jiliang Tang; Suhang Wang
  Graph Neural Networks (GNNs) have made rapid developments in the recent
years. Due to their great ability in modeling graph-structured data, GNNs are
vastly used in various applications, including high-stakes scenarios such as
financial analysis, traffic predictions, and drug discovery. Despite their
great potential in benefiting humans in the real world, recent study shows that
GNNs can leak private information, are vulnerable to adversarial attacks, can
inherit and magnify societal bias from training data and lack interpretability,
which have risk of causing unintentional harm to the users and society. For
example, existing works demonstrate that attackers can fool the GNNs to give
the outcome they desire with unnoticeable perturbation on training graph. GNNs
trained on social networks may embed the discrimination in their decision
process, strengthening the undesirable societal bias. Consequently, trustworthy
GNNs in various aspects are emerging to prevent the harm from GNN models and
increase the users' trust in GNNs. In this paper, we give a comprehensive
survey of GNNs in the computational aspects of privacy, robustness, fairness,
and explainability. For each aspect, we give the taxonomy of the related
methods and formulate the general frameworks for the multiple categories of
trustworthy GNNs. We also discuss the future research directions of each aspect
and connections between these aspects to help achieve trustworthiness.


http://arxiv.org/abs/2204.08623
CorrGAN: Input Transformation Technique Against Natural Corruptions. (70%)
Mirazul Haque; Christof J. Budnik; Wei Yang
  Because of the increasing accuracy of Deep Neural Networks (DNNs) on
different tasks, a lot of real times systems are utilizing DNNs. These DNNs are
vulnerable to adversarial perturbations and corruptions. Specifically, natural
corruptions like fog, blur, contrast etc can affect the prediction of DNN in an
autonomous vehicle. In real time, these corruptions are needed to be detected
and also the corrupted inputs are needed to be de-noised to be predicted
correctly. In this work, we propose CorrGAN approach, which can generate benign
input when a corrupted input is provided. In this framework, we train
Generative Adversarial Network (GAN) with novel intermediate output-based loss
function. The GAN can denoise the corrupted input and generate benign input.
Through experimentation, we show that up to 75.2% of the corrupted
misclassified inputs can be classified correctly by DNN using CorrGAN.


http://arxiv.org/abs/2204.08615
Poisons that are learned faster are more effective. (64%)
Pedro Sandoval-Segura; Vasu Singla; Liam Fowl; Jonas Geiping; Micah Goldblum; David Jacobs; Tom Goldstein
  Imperceptible poisoning attacks on entire datasets have recently been touted
as methods for protecting data privacy. However, among a number of defenses
preventing the practical use of these techniques, early-stopping stands out as
a simple, yet effective defense. To gauge poisons' vulnerability to
early-stopping, we benchmark error-minimizing, error-maximizing, and synthetic
poisons in terms of peak test accuracy over 100 epochs and make a number of
surprising observations. First, we find that poisons that reach a low training
loss faster have lower peak test accuracy. Second, we find that a current
state-of-the-art error-maximizing poison is 7 times less effective when poison
training is stopped at epoch 8. Third, we find that stronger, more transferable
adversarial attacks do not make stronger poisons. We advocate for evaluating
poisons in terms of peak test accuracy.


http://arxiv.org/abs/2204.10192
Residue-Based Natural Language Adversarial Attack Detection. (99%)
Vyas Raina; Mark Gales
  Deep learning based systems are susceptible to adversarial attacks, where a
small, imperceptible change at the input alters the model prediction. However,
to date the majority of the approaches to detect these attacks have been
designed for image processing systems. Many popular image adversarial detection
approaches are able to identify adversarial examples from embedding feature
spaces, whilst in the NLP domain existing state of the art detection approaches
solely focus on input text features, without consideration of model embedding
spaces. This work examines what differences result when porting these image
designed strategies to Natural Language Processing (NLP) tasks - these
detectors are found to not port over well. This is expected as NLP systems have
a very different form of input: discrete and sequential in nature, rather than
the continuous and fixed size inputs for images. As an equivalent model-focused
NLP detection approach, this work proposes a simple sentence-embedding
"residue" based detector to identify adversarial examples. On many tasks, it
out-performs ported image domain detectors and recent state of the art NLP
specific detectors.


http://arxiv.org/abs/2204.07932
Towards Comprehensive Testing on the Robustness of Cooperative Multi-agent Reinforcement Learning. (95%)
Jun Guo; Yonghong Chen; Yihang Hao; Zixin Yin; Yin Yu; Simin Li
  While deep neural networks (DNNs) have strengthened the performance of
cooperative multi-agent reinforcement learning (c-MARL), the agent policy can
be easily perturbed by adversarial examples. Considering the safety critical
applications of c-MARL, such as traffic management, power management and
unmanned aerial vehicle control, it is crucial to test the robustness of c-MARL
algorithm before it was deployed in reality. Existing adversarial attacks for
MARL could be used for testing, but is limited to one robustness aspects (e.g.,
reward, state, action), while c-MARL model could be attacked from any aspect.
To overcome the challenge, we propose MARLSafe, the first robustness testing
framework for c-MARL algorithms. First, motivated by Markov Decision Process
(MDP), MARLSafe consider the robustness of c-MARL algorithms comprehensively
from three aspects, namely state robustness, action robustness and reward
robustness. Any c-MARL algorithm must simultaneously satisfy these robustness
aspects to be considered secure. Second, due to the scarceness of c-MARL
attack, we propose c-MARL attacks as robustness testing algorithms from
multiple aspects. Experiments on \textit{SMAC} environment reveals that many
state-of-the-art c-MARL algorithms are of low robustness in all aspect,
pointing out the urgent need to test and enhance robustness of c-MARL
algorithms.


http://arxiv.org/abs/2204.07772
SETTI: A Self-supervised Adversarial Malware Detection Architecture in an IoT Environment. (95%)
Marjan Golmaryami; Rahim Taheri; Zahra Pooranian; Mohammad Shojafar; Pei Xiao
  In recent years, malware detection has become an active research topic in the
area of Internet of Things (IoT) security. The principle is to exploit
knowledge from large quantities of continuously generated malware. Existing
algorithms practice available malware features for IoT devices and lack
real-time prediction behaviors. More research is thus required on malware
detection to cope with real-time misclassification of the input IoT data.
Motivated by this, in this paper we propose an adversarial self-supervised
architecture for detecting malware in IoT networks, SETTI, considering samples
of IoT network traffic that may not be labeled. In the SETTI architecture, we
design three self-supervised attack techniques, namely Self-MDS, GSelf-MDS and
ASelf-MDS. The Self-MDS method considers the IoT input data and the adversarial
sample generation in real-time. The GSelf-MDS builds a generative adversarial
network model to generate adversarial samples in the self-supervised structure.
Finally, ASelf-MDS utilizes three well-known perturbation sample techniques to
develop adversarial malware and inject it over the self-supervised
architecture. Also, we apply a defence method to mitigate these attacks, namely
adversarial self-supervised training to protect the malware detection
architecture against injecting the malicious samples. To validate the attack
and defence algorithms, we conduct experiments on two recent IoT datasets:
IoT23 and NBIoT. Comparison of the results shows that in the IoT23 dataset, the
Self-MDS method has the most damaging consequences from the attacker's point of
view by reducing the accuracy rate from 98% to 74%. In the NBIoT dataset, the
ASelf-MDS method is the most devastating algorithm that can plunge the accuracy
rate from 98% to 77%.


http://arxiv.org/abs/2204.07752
Homomorphic Encryption and Federated Learning based Privacy-Preserving CNN Training: COVID-19 Detection Use-Case. (67%)
Febrianti Wibawa; Ferhat Ozgur Catak; Salih Sarp; Murat Kuzlu; Umit Cali
  Medical data is often highly sensitive in terms of data privacy and security
concerns. Federated learning, one type of machine learning techniques, has been
started to use for the improvement of the privacy and security of medical data.
In the federated learning, the training data is distributed across multiple
machines, and the learning process is performed in a collaborative manner.
There are several privacy attacks on deep learning (DL) models to get the
sensitive information by attackers. Therefore, the DL model itself should be
protected from the adversarial attack, especially for applications using
medical data. One of the solutions for this problem is homomorphic
encryption-based model protection from the adversary collaborator. This paper
proposes a privacy-preserving federated learning algorithm for medical data
using homomorphic encryption. The proposed algorithm uses a secure multi-party
computation protocol to protect the deep learning model from the adversaries.
In this study, the proposed algorithm using a real-world medical dataset is
evaluated in terms of the model performance.


http://arxiv.org/abs/2204.07373
Revisiting the Adversarial Robustness-Accuracy Tradeoff in Robot Learning. (92%)
Mathias Lechner; Alexander Amini; Daniela Rus; Thomas A. Henzinger
  Adversarial training (i.e., training on adversarially perturbed input data)
is a well-studied method for making neural networks robust to potential
adversarial attacks during inference. However, the improved robustness does not
come for free but rather is accompanied by a decrease in overall model accuracy
and performance. Recent work has shown that, in practical robot learning
applications, the effects of adversarial training do not pose a fair trade-off
but inflict a net loss when measured in holistic robot performance. This work
revisits the robustness-accuracy trade-off in robot learning by systematically
analyzing if recent advances in robust training methods and theory in
conjunction with adversarial robot learning can make adversarial training
suitable for real-world robot applications. We evaluate a wide variety of robot
learning tasks ranging from autonomous driving in a high-fidelity environment
amenable to sim-to-real deployment, to mobile robot gesture recognition. Our
results demonstrate that, while these techniques make incremental improvements
on the trade-off on a relative scale, the negative side-effects caused by
adversarial training still outweigh the improvements by an order of magnitude.
We conclude that more substantial advances in robust learning methods are
necessary before they can benefit robot learning tasks in practice.


http://arxiv.org/abs/2204.07018
From Environmental Sound Representation to Robustness of 2D CNN Models Against Adversarial Attacks. (99%)
Mohammad Esmaeilpour; Patrick Cardinal; Alessandro Lameiras Koerich
  This paper investigates the impact of different standard environmental sound
representations (spectrograms) on the recognition performance and adversarial
attack robustness of a victim residual convolutional neural network, namely
ResNet-18. Our main motivation for focusing on such a front-end classifier
rather than other complex architectures is balancing recognition accuracy and
the total number of training parameters. Herein, we measure the impact of
different settings required for generating more informative Mel-frequency
cepstral coefficient (MFCC), short-time Fourier transform (STFT), and discrete
wavelet transform (DWT) representations on our front-end model. This
measurement involves comparing the classification performance over the
adversarial robustness. We demonstrate an inverse relationship between
recognition accuracy and model robustness against six benchmarking attack
algorithms on the balance of average budgets allocated by the adversary and the
attack cost. Moreover, our experimental results have shown that while the
ResNet-18 model trained on DWT spectrograms achieves a high recognition
accuracy, attacking this model is relatively more costly for the adversary than
other 2D representations. We also report some results on different
convolutional neural network architectures such as ResNet-34, ResNet-56,
AlexNet, and GoogLeNet, SB-CNN, and LSTM-based.


http://arxiv.org/abs/2204.06974
Planting Undetectable Backdoors in Machine Learning Models. (99%)
Shafi Goldwasser; Michael P. Kim; Vinod Vaikuntanathan; Or Zamir
  Given the computational cost and technical expertise required to train
machine learning models, users may delegate the task of learning to a service
provider. We show how a malicious learner can plant an undetectable backdoor
into a classifier. On the surface, such a backdoored classifier behaves
normally, but in reality, the learner maintains a mechanism for changing the
classification of any input, with only a slight perturbation. Importantly,
without the appropriate "backdoor key", the mechanism is hidden and cannot be
detected by any computationally-bounded observer. We demonstrate two frameworks
for planting undetectable backdoors, with incomparable guarantees.
  First, we show how to plant a backdoor in any model, using digital signature
schemes. The construction guarantees that given black-box access to the
original model and the backdoored version, it is computationally infeasible to
find even a single input where they differ. This property implies that the
backdoored model has generalization error comparable with the original model.
Second, we demonstrate how to insert undetectable backdoors in models trained
using the Random Fourier Features (RFF) learning paradigm or in Random ReLU
networks. In this construction, undetectability holds against powerful
white-box distinguishers: given a complete description of the network and the
training data, no efficient distinguisher can guess whether the model is
"clean" or contains a backdoor.
  Our construction of undetectable backdoors also sheds light on the related
issue of robustness to adversarial examples. In particular, our construction
can produce a classifier that is indistinguishable from an "adversarially
robust" classifier, but where every input has an adversarial example! In
summary, the existence of undetectable backdoors represent a significant
theoretical roadblock to certifying adversarial robustness.


http://arxiv.org/abs/2204.07024
Q-TART: Quickly Training for Adversarial Robustness and in-Transferability. (50%)
Madan Ravi Ganesh; Salimeh Yasaei Sekeh; Jason J. Corso
  Raw deep neural network (DNN) performance is not enough; in real-world
settings, computational load, training efficiency and adversarial security are
just as or even more important. We propose to simultaneously tackle
Performance, Efficiency, and Robustness, using our proposed algorithm Q-TART,
Quickly Train for Adversarial Robustness and in-Transferability. Q-TART follows
the intuition that samples highly susceptible to noise strongly affect the
decision boundaries learned by DNNs, which in turn degrades their performance
and adversarial susceptibility. By identifying and removing such samples, we
demonstrate improved performance and adversarial robustness while using only a
subset of the training data. Through our experiments we highlight Q-TART's high
performance across multiple Dataset-DNN combinations, including ImageNet, and
provide insights into the complementary behavior of Q-TART alongside existing
adversarial training approaches to increase robustness by over 1.3% while using
up to 17.9% less training time.


http://arxiv.org/abs/2204.07246
Robotic and Generative Adversarial Attacks in Offline Writer-independent Signature Verification. (41%)
Jordan J. Bird
  This study explores how robots and generative approaches can be used to mount
successful false-acceptance adversarial attacks on signature verification
systems. Initially, a convolutional neural network topology and data
augmentation strategy are explored and tuned, producing an 87.12% accurate
model for the verification of 2,640 human signatures. Two robots are then
tasked with forging 50 signatures, where 25 are used for the verification
attack, and the remaining 25 are used for tuning of the model to defend against
them. Adversarial attacks on the system show that there exists an information
security risk; the Line-us robotic arm can fool the system 24% of the time and
the iDraw 2.0 robot 32% of the time. A conditional GAN finds similar success,
with around 30% forged signatures misclassified as genuine. Following fine-tune
transfer learning of robotic and generative data, adversarial attacks are
reduced below the model threshold by both robots and the GAN. It is observed
that tuning the model reduces the risk of attack by robots to 8% and 12%, and
that conditional generative adversarial attacks can be reduced to 4% when 25
images are presented and 5% when 1000 images are presented.


http://arxiv.org/abs/2204.06173
Task-Driven Data Augmentation for Vision-Based Robotic Control. (96%)
Shubhankar Agarwal; Sandeep P. Chinchali
  Today's robots often interface data-driven perception and planning models
with classical model-based controllers. For example, drones often use computer
vision models to estimate navigation waypoints that are tracked by model
predictive control (MPC). Often, such learned perception/planning models
produce erroneous waypoint predictions on out-of-distribution (OoD) or even
adversarial visual inputs, which increase control cost. However, today's
methods to train robust perception models are largely task-agnostic - they
augment a dataset using random image transformations or adversarial examples
targeted at the vision model in isolation. As such, they often introduce pixel
perturbations that are ultimately benign for control, while missing those that
are most adversarial. In contrast to prior work that synthesizes adversarial
examples for single-step vision tasks, our key contribution is to efficiently
synthesize adversarial scenarios for multi-step, model-based control. To do so,
we leverage differentiable MPC methods to calculate the sensitivity of a
model-based controller to errors in state estimation, which in turn guides how
we synthesize adversarial inputs. We show that re-training vision models on
these adversarial datasets improves control performance on OoD test scenarios
by up to 28.2% compared to standard task-agnostic data augmentation. Our system
is tested on examples of robotic navigation and vision-based control of an
autonomous air vehicle.


http://arxiv.org/abs/2204.06241
Stealing and Evading Malware Classifiers and Antivirus at Low False Positive Conditions. (87%)
Maria Rigaki; Sebastian Garcia
  Model stealing attacks have been successfully used in many machine learning
domains, but there is little understanding of how these attacks work against
models that perform malware detection. Malware detection and, in general,
security domains have unique conditions. In particular, there are very strong
requirements for low false positive rates (FPR). Antivirus products (AVs) that
use machine learning are very complex systems to steal, malware binaries
continually change, and the whole environment is adversarial by nature. This
study evaluates active learning model stealing attacks against publicly
available stand-alone machine learning malware classifiers and also against
antivirus products. The study proposes a new neural network architecture for
surrogate models (dualFFNN) and a new model stealing attack that combines
transfer and active learning for surrogate creation (FFNN-TL). We achieved good
surrogates of the stand-alone classifiers with up to 99\% agreement with the
target models, using less than 4% of the original training dataset. Good
surrogates of AV systems were also trained with up to 99% agreement and less
than 4,000 queries. The study uses the best surrogates to generate adversarial
malware to evade the target models, both stand-alone and AVs (with and without
an internet connection). Results show that surrogate models can generate
adversarial malware that evades the targets but with a lower success rate than
directly using the target models to generate adversarial malware. Using
surrogates, however, is still a good option since using the AVs for malware
generation is highly time-consuming and easily detected when the AVs are
connected to the internet.


http://arxiv.org/abs/2204.06213
Defensive Patches for Robust Recognition in the Physical World. (80%)
Jiakai Wang; Zixin Yin; Pengfei Hu; Aishan Liu; Renshuai Tao; Haotong Qin; Xianglong Liu; Dacheng Tao
  To operate in real-world high-stakes environments, deep learning systems have
to endure noises that have been continuously thwarting their robustness.
Data-end defense, which improves robustness by operations on input data instead
of modifying models, has attracted intensive attention due to its feasibility
in practice. However, previous data-end defenses show low generalization
against diverse noises and weak transferability across multiple models.
Motivated by the fact that robust recognition depends on both local and global
features, we propose a defensive patch generation framework to address these
problems by helping models better exploit these features. For the
generalization against diverse noises, we inject class-specific identifiable
patterns into a confined local patch prior, so that defensive patches could
preserve more recognizable features towards specific classes, leading models
for better recognition under noises. For the transferability across multiple
models, we guide the defensive patches to capture more global feature
correlations within a class, so that they could activate model-shared global
perceptions and transfer better among models. Our defensive patches show great
potentials to improve application robustness in practice by simply sticking
them around target objects. Extensive experiments show that we outperform
others by large margins (improve 20+\% accuracy for both adversarial and
corruption robustness on average in the digital and physical world). Our codes
are available at https://github.com/nlsde-safety-team/DefensivePatch


http://arxiv.org/abs/2204.06337
A Novel Approach to Train Diverse Types of Language Models for Health Mention Classification of Tweets. (78%)
Pervaiz Iqbal Khan; Imran Razzak; Andreas Dengel; Sheraz Ahmed
  Health mention classification deals with the disease detection in a given
text containing disease words. However, non-health and figurative use of
disease words adds challenges to the task. Recently, adversarial training
acting as a means of regularization has gained popularity in many NLP tasks. In
this paper, we propose a novel approach to train language models for health
mention classification of tweets that involves adversarial training. We
generate adversarial examples by adding perturbation to the representations of
transformer models for tweet examples at various levels using Gaussian noise.
Further, we employ contrastive loss as an additional objective function. We
evaluate the proposed method on the PHM2017 dataset extended version. Results
show that our proposed approach improves the performance of classifier
significantly over the baseline methods. Moreover, our analysis shows that
adding noise at earlier layers improves models' performance whereas adding
noise at intermediate layers deteriorates models' performance. Finally, adding
noise towards the final layers performs better than the middle layers noise
addition.


http://arxiv.org/abs/2204.06274
Overparameterized Linear Regression under Adversarial Attacks. (76%)
Antônio H. Ribeiro; Thomas B. Schön
  We study the error of linear regression in the face of adversarial attacks.
In this framework, an adversary changes the input to the regression model in
order to maximize the prediction error. We provide bounds on the prediction
error in the presence of an adversary as a function of the parameter norm and
the error in the absence of such an adversary. We show how these bounds make it
possible to study the adversarial error using analysis from non-adversarial
setups. The obtained results shed light on the robustness of overparameterized
linear models to adversarial attacks. Adding features might be either a source
of additional robustness or brittleness. On the one hand, we use asymptotic
results to illustrate how double-descent curves can be obtained for the
adversarial error. On the other hand, we derive conditions under which the
adversarial error can grow to infinity as more features are added, while at the
same time, the test error goes to zero. We show this behavior is caused by the
fact that the norm of the parameter vector grows with the number of features.
It is also established that $\ell_\infty$ and $\ell_2$-adversarial attacks
might behave fundamentally differently due to how the $\ell_1$ and
$\ell_2$-norms of random projections concentrate. We also show how our
reformulation allows for solving adversarial training as a convex optimization
problem. This fact is then exploited to establish similarities between
adversarial training and parameter-shrinking methods and to study how the
training might affect the robustness of the estimated models.


http://arxiv.org/abs/2204.06273
Towards A Critical Evaluation of Robustness for Deep Learning Backdoor Countermeasures. (38%)
Huming Qiu; Hua Ma; Zhi Zhang; Alsharif Abuadbba; Wei Kang; Anmin Fu; Yansong Gao
  Since Deep Learning (DL) backdoor attacks have been revealed as one of the
most insidious adversarial attacks, a number of countermeasures have been
developed with certain assumptions defined in their respective threat models.
However, the robustness of these countermeasures is inadvertently ignored,
which can introduce severe consequences, e.g., a countermeasure can be misused
and result in a false implication of backdoor detection.
  For the first time, we critically examine the robustness of existing backdoor
countermeasures with an initial focus on three influential model-inspection
ones that are Neural Cleanse (S&P'19), ABS (CCS'19), and MNTD (S&P'21).
Although the three countermeasures claim that they work well under their
respective threat models, they have inherent unexplored non-robust cases
depending on factors such as given tasks, model architectures, datasets, and
defense hyper-parameter, which are \textit{not even rooted from delicate
adaptive attacks}. We demonstrate how to trivially bypass them aligned with
their respective threat models by simply varying aforementioned factors.
Particularly, for each defense, formal proofs or empirical studies are used to
reveal its two non-robust cases where it is not as robust as it claims or
expects, especially the recent MNTD. This work highlights the necessity of
thoroughly evaluating the robustness of backdoor countermeasures to avoid their
misleading security implications in unknown non-robust cases.


http://arxiv.org/abs/2204.06624
A Natural Language Processing Approach for Instruction Set Architecture Identification. (1%)
Dinuka Sahabandu; Sukarno Mertoguno; Radha Poovendran
  Binary analysis of software is a critical step in cyber forensics
applications such as program vulnerability assessment and malware detection.
This involves interpreting instructions executed by software and often
necessitates converting the software's binary file data to assembly language.
The conversion process requires information about the binary file's target
instruction set architecture (ISA). However, ISA information might not be
included in binary files due to compilation errors, partial downloads, or
adversarial corruption of file metadata. Machine learning (ML) is a promising
methodology that can be used to identify the target ISA using binary data in
the object code section of binary files. In this paper we propose a binary code
feature extraction model to improve the accuracy and scalability of ML-based
ISA identification methods. Our feature extraction model can be used in the
absence of domain knowledge about the ISAs. Specifically, we adapt models from
natural language processing (NLP) to i) identify successive byte patterns
commonly observed in binary codes, ii) estimate the significance of each byte
pattern to a binary file, and iii) estimate the relevance of each byte pattern
in distinguishing between ISAs. We introduce character-level features of
encoded binaries to identify fine-grained bit patterns inherent to each ISA. We
use a dataset with binaries from 12 different ISAs to evaluate our approach.
Empirical evaluations show that using our byte-level features in ML-based ISA
identification results in an 8% higher accuracy than the state-of-the-art
features based on byte-histograms and byte pattern signatures. We observe that
character-level features allow reducing the size of the feature set by up to
16x while maintaining accuracy above 97%.


http://arxiv.org/abs/2204.06113
Liuer Mihou: A Practical Framework for Generating and Evaluating Grey-box Adversarial Attacks against NIDS. (99%)
Ke He; Dan Dongseong Kim; Jing Sun; Jeong Do Yoo; Young Hun Lee; Huy Kang Kim
  Due to its high expressiveness and speed, Deep Learning (DL) has become an
increasingly popular choice as the detection algorithm for Network-based
Intrusion Detection Systems (NIDSes). Unfortunately, DL algorithms are
vulnerable to adversarial examples that inject imperceptible modifications to
the input and cause the DL algorithm to misclassify the input. Existing
adversarial attacks in the NIDS domain often manipulate the traffic features
directly, which hold no practical significance because traffic features cannot
be replayed in a real network. It remains a research challenge to generate
practical and evasive adversarial attacks.
  This paper presents the Liuer Mihou attack that generates practical and
replayable adversarial network packets that can bypass anomaly-based NIDS
deployed in the Internet of Things (IoT) networks. The core idea behind Liuer
Mihou is to exploit adversarial transferability and generate adversarial
packets on a surrogate NIDS constrained by predefined mutation operations to
ensure practicality. We objectively analyse the evasiveness of Liuer Mihou
against four ML-based algorithms (LOF, OCSVM, RRCF, and SOM) and the
state-of-the-art NIDS, Kitsune. From the results of our experiment, we gain
valuable insights into necessary conditions on the adversarial transferability
of anomaly detection algorithms. Going beyond a theoretical setting, we replay
the adversarial attack in a real IoT testbed to examine the practicality of
Liuer Mihou. Furthermore, we demonstrate that existing feature-level
adversarial defence cannot defend against Liuer Mihou and constructively
criticise the limitations of feature-level adversarial defences.


http://arxiv.org/abs/2204.05764
Examining the Proximity of Adversarial Examples to Class Manifolds in Deep Networks. (98%)
Štefan Pócoš; Iveta Bečková; Igor Farkaš
  Deep neural networks achieve remarkable performance in multiple fields.
However, after proper training they suffer from an inherent vulnerability
against adversarial examples (AEs). In this work we shed light on inner
representations of the AEs by analysing their activations on the hidden layers.
We test various types of AEs, each crafted using a specific norm constraint,
which affects their visual appearance and eventually their behavior in the
trained networks. Our results in image classification tasks (MNIST and
CIFAR-10) reveal qualitative differences between the individual types of AEs,
when comparing their proximity to the class-specific manifolds on the inner
representations. We propose two methods that can be used to compare the
distances to class-specific manifolds, regardless of the changing dimensions
throughout the network. Using these methods, we consistently confirm that some
of the adversarials do not necessarily leave the proximity of the manifold of
the correct class, not even in the last hidden layer of the neural network.
Next, using UMAP visualisation technique, we project the class activations to
2D space. The results indicate that the activations of the individual AEs are
entangled with the activations of the test set. This, however, does not hold
for a group of crafted inputs called the rubbish class. We also confirm the
entanglement of adversarials with the test set numerically using the soft
nearest neighbour loss.


http://arxiv.org/abs/2205.01625
Toward Robust Spiking Neural Network Against Adversarial Perturbation. (98%)
Ling Liang; Kaidi Xu; Xing Hu; Lei Deng; Yuan Xie
  As spiking neural networks (SNNs) are deployed increasingly in real-world
efficiency critical applications, the security concerns in SNNs attract more
attention. Currently, researchers have already demonstrated an SNN can be
attacked with adversarial examples. How to build a robust SNN becomes an urgent
issue. Recently, many studies apply certified training in artificial neural
networks (ANNs), which can improve the robustness of an NN model promisely.
However, existing certifications cannot transfer to SNNs directly because of
the distinct neuron behavior and input formats for SNNs. In this work, we first
design S-IBP and S-CROWN that tackle the non-linear functions in SNNs' neuron
modeling. Then, we formalize the boundaries for both digital and spike inputs.
Finally, we demonstrate the efficiency of our proposed robust training method
in different datasets and model architectures. Based on our experiment, we can
achieve a maximum $37.7\%$ attack error reduction with $3.7\%$ original
accuracy loss. To the best of our knowledge, this is the first analysis on
robust training of SNNs.


http://arxiv.org/abs/2204.05986
Machine Learning Security against Data Poisoning: Are We There Yet? (92%)
Antonio Emanuele Cinà; Kathrin Grosse; Ambra Demontis; Battista Biggio; Fabio Roli; Marcello Pelillo
  The recent success of machine learning (ML) has been fueled by the increasing
availability of computing power and large amounts of data in many different
applications. However, the trustworthiness of the resulting models can be
compromised when such data is maliciously manipulated to mislead the learning
process. In this article, we first review poisoning attacks that compromise the
training data used to learn ML models, including attacks that aim to reduce the
overall performance, manipulate the predictions on specific test samples, and
even implant backdoors in the model. We then discuss how to mitigate these
attacks using basic security principles, or by deploying ML-oriented defensive
mechanisms. We conclude our article by formulating some relevant open
challenges which are hindering the development of testing methods and
benchmarks suitable for assessing and improving the trustworthiness of ML
models against data poisoning attacks


http://arxiv.org/abs/2204.06106
Optimal Membership Inference Bounds for Adaptive Composition of Sampled Gaussian Mechanisms. (11%)
Saeed Mahloujifar; Alexandre Sablayrolles; Graham Cormode; Somesh Jha
  Given a trained model and a data sample, membership-inference (MI) attacks
predict whether the sample was in the model's training set. A common
countermeasure against MI attacks is to utilize differential privacy (DP)
during model training to mask the presence of individual examples. While this
use of DP is a principled approach to limit the efficacy of MI attacks, there
is a gap between the bounds provided by DP and the empirical performance of MI
attacks. In this paper, we derive bounds for the \textit{advantage} of an
adversary mounting a MI attack, and demonstrate tightness for the widely-used
Gaussian mechanism. We further show bounds on the \textit{confidence} of MI
attacks. Our bounds are much stronger than those obtained by DP analysis. For
example, analyzing a setting of DP-SGD with $\epsilon=4$ would obtain an upper
bound on the advantage of $\approx0.36$ based on our analyses, while getting
bound of $\approx 0.97$ using the analysis of previous work that convert
$\epsilon$ to membership inference bounds.
  Finally, using our analysis, we provide MI metrics for models trained on
CIFAR10 dataset. To the best of our knowledge, our analysis provides the
state-of-the-art membership inference bounds for the privacy.


http://arxiv.org/abs/2204.05687
3DeformRS: Certifying Spatial Deformations on Point Clouds. (9%)
Gabriel Pérez S.; Juan C. Pérez; Motasem Alfarra; Silvio Giancola; Bernard Ghanem
  3D computer vision models are commonly used in security-critical applications
such as autonomous driving and surgical robotics. Emerging concerns over the
robustness of these models against real-world deformations must be addressed
practically and reliably. In this work, we propose 3DeformRS, a method to
certify the robustness of point cloud Deep Neural Networks (DNNs) against
real-world deformations. We developed 3DeformRS by building upon recent work
that generalized Randomized Smoothing (RS) from pixel-intensity perturbations
to vector-field deformations. In particular, we specialized RS to certify DNNs
against parameterized deformations (e.g. rotation, twisting), while enjoying
practical computational costs. We leverage the virtues of 3DeformRS to conduct
a comprehensive empirical study on the certified robustness of four
representative point cloud DNNs on two datasets and against seven different
deformations. Compared to previous approaches for certifying point cloud DNNs,
3DeformRS is fast, scales well with point cloud size, and provides
comparable-to-better certificates. For instance, when certifying a plain
PointNet against a 3{\deg} z-rotation on 1024-point clouds, 3DeformRS grants a
certificate 3x larger and 20x faster than previous work.


http://arxiv.org/abs/2204.05432
A Simple Approach to Adversarial Robustness in Few-shot Image Classification. (98%)
Akshayvarun Subramanya; Hamed Pirsiavash
  Few-shot image classification, where the goal is to generalize to tasks with
limited labeled data, has seen great progress over the years. However, the
classifiers are vulnerable to adversarial examples, posing a question regarding
their generalization capabilities. Recent works have tried to combine
meta-learning approaches with adversarial training to improve the robustness of
few-shot classifiers. We show that a simple transfer-learning based approach
can be used to train adversarially robust few-shot classifiers. We also present
a method for novel classification task based on calibrating the centroid of the
few-shot category towards the base classes. We show that standard adversarial
training on base categories along with calibrated centroid-based classifier in
the novel categories, outperforms or is on-par with state-of-the-art advanced
methods on standard benchmarks for few-shot learning. Our method is simple,
easy to scale, and with little effort can lead to robust few-shot classifiers.
Code is available here: \url{https://github.com/UCDvision/Simple_few_shot.git}


http://arxiv.org/abs/2204.05255
Narcissus: A Practical Clean-Label Backdoor Attack with Limited Information. (92%)
Yi Zeng; Minzhou Pan; Hoang Anh Just; Lingjuan Lyu; Meikang Qiu; Ruoxi Jia
  Backdoor attacks insert malicious data into a training set so that, during
inference time, it misclassifies inputs that have been patched with a backdoor
trigger as the malware specified label. For backdoor attacks to bypass human
inspection, it is essential that the injected data appear to be correctly
labeled. The attacks with such property are often referred to as "clean-label
attacks." Existing clean-label backdoor attacks require knowledge of the entire
training set to be effective. Obtaining such knowledge is difficult or
impossible because training data are often gathered from multiple sources
(e.g., face images from different users). It remains a question whether
backdoor attacks still present a real threat.
  This paper provides an affirmative answer to this question by designing an
algorithm to mount clean-label backdoor attacks based only on the knowledge of
representative examples from the target class. With poisoning equal to or less
than 0.5% of the target-class data and 0.05% of the training set, we can train
a model to classify test examples from arbitrary classes into the target class
when the examples are patched with a backdoor trigger. Our attack works well
across datasets and models, even when the trigger presents in the physical
world.
  We explore the space of defenses and find that, surprisingly, our attack can
evade the latest state-of-the-art defenses in their vanilla form, or after a
simple twist, we can adapt to the downstream defenses. We study the cause of
the intriguing effectiveness and find that because the trigger synthesized by
our attack contains features as persistent as the original semantic features of
the target class, any attempt to remove such triggers would inevitably hurt the
model accuracy first.


http://arxiv.org/abs/2204.05427
Generalizing Adversarial Explanations with Grad-CAM. (84%)
Tanmay Chakraborty; Utkarsh Trehan; Khawla Mallat; Jean-Luc Dugelay
  Gradient-weighted Class Activation Mapping (Grad- CAM), is an example-based
explanation method that provides a gradient activation heat map as an
explanation for Convolution Neural Network (CNN) models. The drawback of this
method is that it cannot be used to generalize CNN behaviour. In this paper, we
present a novel method that extends Grad-CAM from example-based explanations to
a method for explaining global model behaviour. This is achieved by introducing
two new metrics, (i) Mean Observed Dissimilarity (MOD) and (ii) Variation in
Dissimilarity (VID), for model generalization. These metrics are computed by
comparing a Normalized Inverted Structural Similarity Index (NISSIM) metric of
the Grad-CAM generated heatmap for samples from the original test set and
samples from the adversarial test set. For our experiment, we study adversarial
attacks on deep models such as VGG16, ResNet50, and ResNet101, and wide models
such as InceptionNetv3 and XceptionNet using Fast Gradient Sign Method (FGSM).
We then compute the metrics MOD and VID for the automatic face recognition
(AFR) use case with the VGGFace2 dataset. We observe a consistent shift in the
region highlighted in the Grad-CAM heatmap, reflecting its participation to the
decision making, across all models under adversarial attacks. The proposed
method can be used to understand adversarial attacks and explain the behaviour
of black box CNN models for image analysis.


http://arxiv.org/abs/2204.04890
Anti-Adversarially Manipulated Attributions for Weakly Supervised Semantic Segmentation and Object Localization. (83%)
Jungbeom Lee; Eunji Kim; Jisoo Mok; Sungroh Yoon
  Obtaining accurate pixel-level localization from class labels is a crucial
process in weakly supervised semantic segmentation and object localization.
Attribution maps from a trained classifier are widely used to provide
pixel-level localization, but their focus tends to be restricted to a small
discriminative region of the target object. An AdvCAM is an attribution map of
an image that is manipulated to increase the classification score produced by a
classifier before the final softmax or sigmoid layer. This manipulation is
realized in an anti-adversarial manner, so that the original image is perturbed
along pixel gradients in directions opposite to those used in an adversarial
attack. This process enhances non-discriminative yet class-relevant features,
which make an insufficient contribution to previous attribution maps, so that
the resulting AdvCAM identifies more regions of the target object. In addition,
we introduce a new regularization procedure that inhibits the incorrect
attribution of regions unrelated to the target object and the excessive
concentration of attributions on a small region of the target object. Our
method achieves a new state-of-the-art performance in weakly and
semi-supervised semantic segmentation, on both the PASCAL VOC 2012 and MS COCO
2014 datasets. In weakly supervised object localization, it achieves a new
state-of-the-art performance on the CUB-200-2011 and ImageNet-1K datasets.


http://arxiv.org/abs/2204.05239
Exploring the Universal Vulnerability of Prompt-based Learning Paradigm. (47%)
Lei Xu; Yangyi Chen; Ganqu Cui; Hongcheng Gao; Zhiyuan Liu
  Prompt-based learning paradigm bridges the gap between pre-training and
fine-tuning, and works effectively under the few-shot setting. However, we find
that this learning paradigm inherits the vulnerability from the pre-training
stage, where model predictions can be misled by inserting certain triggers into
the text. In this paper, we explore this universal vulnerability by either
injecting backdoor triggers or searching for adversarial triggers on
pre-trained language models using only plain text. In both scenarios, we
demonstrate that our triggers can totally control or severely decrease the
performance of prompt-based models fine-tuned on arbitrary downstream tasks,
reflecting the universal vulnerability of the prompt-based learning paradigm.
Further experiments show that adversarial triggers have good transferability
among language models. We also find conventional fine-tuning models are not
vulnerable to adversarial triggers constructed from pre-trained language
models. We conclude by proposing a potential solution to mitigate our attack
methods. Code and data are publicly available at
https://github.com/leix28/prompt-universal-vulnerability


http://arxiv.org/abs/2204.05376
medXGAN: Visual Explanations for Medical Classifiers through a Generative Latent Space. (1%)
Amil Dravid; Florian Schiffers; Boqing Gong; Aggelos K. Katsaggelos
  Despite the surge of deep learning in the past decade, some users are
skeptical to deploy these models in practice due to their black-box nature.
Specifically, in the medical space where there are severe potential
repercussions, we need to develop methods to gain confidence in the models'
decisions. To this end, we propose a novel medical imaging generative
adversarial framework, medXGAN (medical eXplanation GAN), to visually explain
what a medical classifier focuses on in its binary predictions. By encoding
domain knowledge of medical images, we are able to disentangle anatomical
structure and pathology, leading to fine-grained visualization through latent
interpolation. Furthermore, we optimize the latent space such that
interpolation explains how the features contribute to the classifier's output.
Our method outperforms baselines such as Gradient-Weighted Class Activation
Mapping (Grad-CAM) and Integrated Gradients in localization and explanatory
ability. Additionally, a combination of the medXGAN with Integrated Gradients
can yield explanations more robust to noise. The code is available at:
https://github.com/avdravid/medXGAN_explanations.


http://arxiv.org/abs/2204.04636
"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks. (88%)
Edoardo Mosca; Shreyash Agarwal; Javier Rando-Ramirez; Georg Groh
  Adversarial attacks are a major challenge faced by current machine learning
research. These purposely crafted inputs fool even the most advanced models,
precluding their deployment in safety-critical applications. Extensive research
in computer vision has been carried to develop reliable defense strategies.
However, the same issue remains less explored in natural language processing.
Our work presents a model-agnostic detector of adversarial text examples. The
approach identifies patterns in the logits of the target classifier when
perturbing the input text. The proposed detector improves the current
state-of-the-art performance in recognizing adversarial inputs and exhibits
strong generalization capabilities across different NLP models, datasets, and
word-level attacks.


http://arxiv.org/abs/2204.04768
Analysis of Power-Oriented Fault Injection Attacks on Spiking Neural Networks. (54%)
Karthikeyan Nagarajan; Junde Li; Sina Sayyah Ensan; Mohammad Nasim Imtiaz Khan; Sachhidh Kannan; Swaroop Ghosh
  Spiking Neural Networks (SNN) are quickly gaining traction as a viable
alternative to Deep Neural Networks (DNN). In comparison to DNNs, SNNs are more
computationally powerful and provide superior energy efficiency. SNNs, while
exciting at first appearance, contain security-sensitive assets (e.g., neuron
threshold voltage) and vulnerabilities (e.g., sensitivity of classification
accuracy to neuron threshold voltage change) that adversaries can exploit. We
investigate global fault injection attacks by employing external power supplies
and laser-induced local power glitches to corrupt crucial training parameters
such as spike amplitude and neuron's membrane threshold potential on SNNs
developed using common analog neurons. We also evaluate the impact of
power-based attacks on individual SNN layers for 0% (i.e., no attack) to 100%
(i.e., whole layer under attack). We investigate the impact of the attacks on
digit classification tasks and find that in the worst-case scenario,
classification accuracy is reduced by 85.65%. We also propose defenses e.g., a
robust current driver design that is immune to power-oriented attacks, improved
circuit sizing of neuron components to reduce/recover the adversarial accuracy
degradation at the cost of negligible area and 25% power overhead. We also
present a dummy neuron-based voltage fault injection detection system with 1%
power and area overhead.


http://arxiv.org/abs/2204.04778
Measuring the False Sense of Security. (26%)
Carlos Gomes
  Recently, several papers have demonstrated how widespread gradient masking is
amongst proposed adversarial defenses. Defenses that rely on this phenomenon
are considered failed, and can easily be broken. Despite this, there has been
little investigation into ways of measuring the phenomenon of gradient masking
and enabling comparisons of its extent amongst different networks. In this
work, we investigate gradient masking under the lens of its mensurability,
departing from the idea that it is a binary phenomenon. We propose and motivate
several metrics for it, performing extensive empirical tests on defenses
suspected of exhibiting different degrees of gradient masking. These are
computationally cheaper than strong attacks, enable comparisons between models,
and do not require the large time investment of tailor-made attacks for
specific models. Our results reveal metrics that are successful in measuring
the extent of gradient masking across different networks


http://arxiv.org/abs/2204.03851
Defense against Adversarial Attacks on Hybrid Speech Recognition using Joint Adversarial Fine-tuning with Denoiser. (99%)
Sonal Joshi; Saurabh Kataria; Yiwen Shao; Piotr Zelasko; Jesus Villalba; Sanjeev Khudanpur; Najim Dehak
  Adversarial attacks are a threat to automatic speech recognition (ASR)
systems, and it becomes imperative to propose defenses to protect them. In this
paper, we perform experiments to show that K2 conformer hybrid ASR is strongly
affected by white-box adversarial attacks. We propose three defenses--denoiser
pre-processor, adversarially fine-tuning ASR model, and adversarially
fine-tuning joint model of ASR and denoiser. Our evaluation shows denoiser
pre-processor (trained on offline adversarial examples) fails to defend against
adaptive white-box attacks. However, adversarially fine-tuning the denoiser
using a tandem model of denoiser and ASR offers more robustness. We evaluate
two variants of this defense--one updating parameters of both models and the
second keeping ASR frozen. The joint model offers a mean absolute decrease of
19.3\% ground truth (GT) WER with reference to baseline against fast gradient
sign method (FGSM) attacks with different $L_\infty$ norms. The joint model
with frozen ASR parameters gives the best defense against projected gradient
descent (PGD) with 7 iterations, yielding a mean absolute increase of 22.3\% GT
WER with reference to baseline; and against PGD with 500 iterations, yielding a
mean absolute decrease of 45.08\% GT WER and an increase of 68.05\% adversarial
target WER.


http://arxiv.org/abs/2204.03848
AdvEst: Adversarial Perturbation Estimation to Classify and Detect Adversarial Attacks against Speaker Identification. (99%)
Sonal Joshi; Saurabh Kataria; Jesus Villalba; Najim Dehak
  Adversarial attacks pose a severe security threat to the state-of-the-art
speaker identification systems, thereby making it vital to propose
countermeasures against them. Building on our previous work that used
representation learning to classify and detect adversarial attacks, we propose
an improvement to it using AdvEst, a method to estimate adversarial
perturbation. First, we prove our claim that training the representation
learning network using adversarial perturbations as opposed to adversarial
examples (consisting of the combination of clean signal and adversarial
perturbation) is beneficial because it eliminates nuisance information. At
inference time, we use a time-domain denoiser to estimate the adversarial
perturbations from adversarial examples. Using our improved representation
learning approach to obtain attack embeddings (signatures), we evaluate their
performance for three applications: known attack classification, attack
verification, and unknown attack detection. We show that common attacks in the
literature (Fast Gradient Sign Method (FGSM), Projected Gradient Descent (PGD),
Carlini-Wagner (CW) with different Lp threat models) can be classified with an
accuracy of ~96%. We also detect unknown attacks with an equal error rate (EER)
of ~9%, which is absolute improvement of ~12% from our previous work.


http://arxiv.org/abs/2204.04259
Evaluating the Adversarial Robustness for Fourier Neural Operators. (92%)
Abolaji D. Adesoji; Pin-Yu Chen
  In recent years, Machine-Learning (ML)-driven approaches have been widely
used in scientific discovery domains. Among them, the Fourier Neural Operator
(FNO) was the first to simulate turbulent flow with zero-shot super-resolution
and superior accuracy, which significantly improves the speed when compared to
traditional partial differential equation (PDE) solvers. To inspect the
trustworthiness, we provide the first study on the adversarial robustness of
scientific discovery models by generating adversarial examples for FNO, based
on norm-bounded data input perturbations. Evaluated on the mean squared error
between the FNO model's output and the PDE solver's output, our results show
that the model's robustness degrades rapidly with increasing perturbation
levels, particularly in non-simplistic cases like the 2D Darcy and the Navier
cases. Our research provides a sensitivity analysis tool and evaluation
principles for assessing the adversarial robustness of ML-based scientific
discovery models.


http://arxiv.org/abs/2204.05758
Backdoor Attack against NLP models with Robustness-Aware Perturbation defense. (87%)
Shaik Mohammed Maqsood; Viveros Manuela Ceron; Addluri GowthamKrishna
  Backdoor attack intends to embed hidden backdoor into deep neural networks
(DNNs), such that the attacked model performs well on benign samples, whereas
its prediction will be maliciously changed if the hidden backdoor is activated
by the attacker defined trigger. This threat could happen when the training
process is not fully controlled, such as training on third-party data-sets or
adopting third-party models. There has been a lot of research and different
methods to defend such type of backdoor attacks, one being robustness-aware
perturbation-based defense method. This method mainly exploits big gap of
robustness between poisoned and clean samples. In our work, we break this
defense by controlling the robustness gap between poisoned and clean samples
using adversarial training step.


http://arxiv.org/abs/2204.04329
An Adaptive Black-box Backdoor Detection Method for Deep Neural Networks. (45%)
Xinqiao Zhang; Huili Chen; Ke Huang; Farinaz Koushanfar
  With the surge of Machine Learning (ML), An emerging amount of intelligent
applications have been developed. Deep Neural Networks (DNNs) have demonstrated
unprecedented performance across various fields such as medical diagnosis and
autonomous driving. While DNNs are widely employed in security-sensitive
fields, they are identified to be vulnerable to Neural Trojan (NT) attacks that
are controlled and activated by stealthy triggers. In this paper, we target to
design a robust and adaptive Trojan detection scheme that inspects whether a
pre-trained model has been Trojaned before its deployment. Prior works are
oblivious of the intrinsic property of trigger distribution and try to
reconstruct the trigger pattern using simple heuristics, i.e., stimulating the
given model to incorrect outputs. As a result, their detection time and
effectiveness are limited. We leverage the observation that the pixel trigger
typically features spatial dependency and propose the first trigger
approximation based black-box Trojan detection framework that enables a fast
and scalable search of the trigger in the input space. Furthermore, our
approach can also detect Trojans embedded in the feature space where certain
filter transformations are used to activate the Trojan. We perform extensive
experiments to investigate the performance of our approach across various
datasets and ML models. Empirical results show that our approach achieves a
ROC-AUC score of 0.93 on the public TrojAI dataset. Our code can be found at
https://github.com/xinqiaozhang/adatrojan


http://arxiv.org/abs/2204.04220
Characterizing and Understanding the Behavior of Quantized Models for Reliable Deployment. (13%)
Qiang Hu; Yuejun Guo; Maxime Cordy; Xiaofei Xie; Wei Ma; Mike Papadakis; Yves Le Traon
  Deep Neural Networks (DNNs) have gained considerable attention in the past
decades due to their astounding performance in different applications, such as
natural language modeling, self-driving assistance, and source code
understanding. With rapid exploration, more and more complex DNN architectures
have been proposed along with huge pre-trained model parameters. The common way
to use such DNN models in user-friendly devices (e.g., mobile phones) is to
perform model compression before deployment. However, recent research has
demonstrated that model compression, e.g., model quantization, yields accuracy
degradation as well as outputs disagreements when tested on unseen data. Since
the unseen data always include distribution shifts and often appear in the
wild, the quality and reliability of quantized models are not ensured. In this
paper, we conduct a comprehensive study to characterize and help users
understand the behaviors of quantized models. Our study considers 4 datasets
spanning from image to text, 8 DNN architectures including feed-forward neural
networks and recurrent neural networks, and 42 shifted sets with both synthetic
and natural distribution shifts. The results reveal that 1) data with
distribution shifts happen more disagreements than without. 2)
Quantization-aware training can produce more stable models than standard,
adversarial, and Mixup training. 3) Disagreements often have closer top-1 and
top-2 output probabilities, and $Margin$ is a better indicator than the other
uncertainty metrics to distinguish disagreements. 4) Retraining with
disagreements has limited efficiency in removing disagreements. We opensource
our code and models as a new benchmark for further studying the quantized
models.


http://arxiv.org/abs/2204.04090
Neural Tangent Generalization Attacks. (12%)
Chia-Hung Yuan; Shan-Hung Wu
  The remarkable performance achieved by Deep Neural Networks (DNNs) in many
applications is followed by the rising concern about data privacy and security.
Since DNNs usually require large datasets to train, many practitioners scrape
data from external sources such as the Internet. However, an external data
owner may not be willing to let this happen, causing legal or ethical issues.
In this paper, we study the generalization attacks against DNNs, where an
attacker aims to slightly modify training data in order to spoil the training
process such that a trained network lacks generalizability. These attacks can
be performed by data owners and protect data from unexpected use. However,
there is currently no efficient generalization attack against DNNs due to the
complexity of a bilevel optimization involved. We propose the Neural Tangent
Generalization Attack (NTGA) that, to the best of our knowledge, is the first
work enabling clean-label, black-box generalization attack against DNNs. We
conduct extensive experiments, and the empirical results demonstrate the
effectiveness of NTGA. Our code and perturbed datasets are available at:
https://github.com/lionelmessi6410/ntga.


http://arxiv.org/abs/2204.03994
Labeling-Free Comparison Testing of Deep Learning Models. (11%)
Yuejun Guo; Qiang Hu; Maxime Cordy; Xiaofei Xie; Mike Papadakis; Yves Le Traon
  Various deep neural networks (DNNs) are developed and reported for their
tremendous success in multiple domains. Given a specific task, developers can
collect massive DNNs from public sources for efficient reusing and avoid
redundant work from scratch. However, testing the performance (e.g., accuracy
and robustness) of multiple DNNs and giving a reasonable recommendation that
which model should be used is challenging regarding the scarcity of labeled
data and demand of domain expertise. Existing testing approaches are mainly
selection-based where after sampling, a few of the test data are labeled to
discriminate DNNs. Therefore, due to the randomness of sampling, the
performance ranking is not deterministic. In this paper, we propose a
labeling-free comparison testing approach to overcome the limitations of
labeling effort and sampling randomness. The main idea is to learn a Bayesian
model to infer the models' specialty only based on predicted labels. To
evaluate the effectiveness of our approach, we undertook exhaustive experiments
on 9 benchmark datasets spanning in the domains of image, text, and source
code, and 165 DNNs. In addition to accuracy, we consider the robustness against
synthetic and natural distribution shifts. The experimental results demonstrate
that the performance of existing approaches degrades under distribution shifts.
Our approach outperforms the baseline methods by up to 0.74 and 0.53 on
Spearman's correlation and Kendall's $\tau$, respectively, regardless of the
dataset and distribution shift. Additionally, we investigated the impact of
model quality (accuracy and robustness) and diversity (standard deviation of
the quality) on the testing effectiveness and observe that there is a higher
chance of a good result when the quality is over 50\% and the diversity is
larger than 18\%.


http://arxiv.org/abs/2204.03934
Does Robustness on ImageNet Transfer to Downstream Tasks? (2%)
Yutaro Yamada; Mayu Otani
  As clean ImageNet accuracy nears its ceiling, the research community is
increasingly more concerned about robust accuracy under distributional shifts.
While a variety of methods have been proposed to robustify neural networks,
these techniques often target models trained on ImageNet classification. At the
same time, it is a common practice to use ImageNet pretrained backbones for
downstream tasks such as object detection, semantic segmentation, and image
classification from different domains. This raises a question: Can these robust
image classifiers transfer robustness to downstream tasks? For object detection
and semantic segmentation, we find that a vanilla Swin Transformer, a variant
of Vision Transformer tailored for dense prediction tasks, transfers robustness
better than Convolutional Neural Networks that are trained to be robust to the
corrupted version of ImageNet. For CIFAR10 classification, we find that models
that are robustified for ImageNet do not retain robustness when fully
fine-tuned. These findings suggest that current robustification techniques tend
to emphasize ImageNet evaluations. Moreover, network architecture is a strong
source of robustness when we consider transfer learning.


http://arxiv.org/abs/2204.05227
The self-learning AI controller for adaptive power beaming with fiber-array laser transmitter system. (1%)
A. M. Vorontsov; G. A. Filimonov
  In this study we consider adaptive power beaming with fiber-array laser
transmitter system in presence of atmospheric turbulence. For optimization of
power transition through the atmosphere fiber-array is traditionally controlled
by stochastic parallel gradient descent (SPGD) algorithm where control feedback
is provided via radio frequency link by an optical-to-electrical power
conversion sensor, attached to a cooperative target. The SPGD algorithm
continuously and randomly perturbs voltages applied to fiber-array phase
shifters and fiber tip positioners in order to maximize sensor signal, i.e.
uses, so-called, "blind" optimization principle.
  In opposite to this approach a perspective artificially intelligent (AI)
control systems for synthesis of optimal control can utilize various pupil- or
target-plane data available for the analysis including wavefront sensor data,
photo-voltaic array (PVA) data, other optical or atmospheric parameters, and
potentially can eliminate well-known drawbacks of SPGD-based controllers. In
this study an optimal control is synthesized by a deep neural network (DNN)
using target-plane PVA sensor data as its input. A DNN training is occurred
online in sync with control system operation and is performed by applying of
small perturbations to DNN's outputs. This approach does not require initial
DNN's pre-training as well as guarantees optimization of system performance in
time. All theoretical results are verified by numerical experiments.


http://arxiv.org/abs/2204.04063
Transfer Attacks Revisited: A Large-Scale Empirical Study in Real Computer Vision Settings. (99%)
Yuhao Mao; Chong Fu; Saizhuo Wang; Shouling Ji; Xuhong Zhang; Zhenguang Liu; Jun Zhou; Alex X. Liu; Raheem Beyah; Ting Wang
  One intriguing property of adversarial attacks is their "transferability" --
an adversarial example crafted with respect to one deep neural network (DNN)
model is often found effective against other DNNs as well. Intensive research
has been conducted on this phenomenon under simplistic controlled conditions.
Yet, thus far, there is still a lack of comprehensive understanding about
transferability-based attacks ("transfer attacks") in real-world environments.
  To bridge this critical gap, we conduct the first large-scale systematic
empirical study of transfer attacks against major cloud-based MLaaS platforms,
taking the components of a real transfer attack into account. The study leads
to a number of interesting findings which are inconsistent to the existing
ones, including: (1) Simple surrogates do not necessarily improve real transfer
attacks. (2) No dominant surrogate architecture is found in real transfer
attacks. (3) It is the gap between posterior (output of the softmax layer)
rather than the gap between logit (so-called $\kappa$ value) that increases
transferability. Moreover, by comparing with prior works, we demonstrate that
transfer attacks possess many previously unknown properties in real-world
environments, such as (1) Model similarity is not a well-defined concept. (2)
$L_2$ norm of perturbation can generate high transferability without usage of
gradient and is a more powerful source than $L_\infty$ norm. We believe this
work sheds light on the vulnerabilities of popular MLaaS platforms and points
to a few promising research directions.


http://arxiv.org/abs/2204.03694
Adaptive-Gravity: A Defense Against Adversarial Samples. (99%)
Ali Mirzaeian; Zhi Tian; Sai Manoj P D; Banafsheh S. Latibari; Ioannis Savidis; Houman Homayoun; Avesta Sasan
  This paper presents a novel model training solution, denoted as
Adaptive-Gravity, for enhancing the robustness of deep neural network
classifiers against adversarial examples. We conceptualize the model
parameters/features associated with each class as a mass characterized by its
centroid location and the spread (standard deviation of the distance) of
features around the centroid. We use the centroid associated with each cluster
to derive an anti-gravity force that pushes the centroids of different classes
away from one another during network training. Then we customized an objective
function that aims to concentrate each class's features toward their
corresponding new centroid, which has been obtained by anti-gravity force. This
methodology results in a larger separation between different masses and reduces
the spread of features around each centroid. As a result, the samples are
pushed away from the space that adversarial examples could be mapped to,
effectively increasing the degree of perturbation needed for making an
adversarial example. We have implemented this training solution as an iterative
method consisting of four steps at each iteration: 1) centroid extraction, 2)
anti-gravity force calculation, 3) centroid relocation, and 4) gravity
training. Gravity's efficiency is evaluated by measuring the corresponding
fooling rates against various attack models, including FGSM, MIM, BIM, and PGD
using LeNet and ResNet110 networks, benchmarked against MNIST and CIFAR10
classification problems. Test results show that Gravity not only functions as a
powerful instrument to robustify a model against state-of-the-art adversarial
attacks but also effectively improves the model training accuracy.


http://arxiv.org/abs/2204.03714
Using Multiple Self-Supervised Tasks Improves Model Robustness. (81%)
Matthew Lawhon; Chengzhi Mao; Junfeng Yang
  Deep networks achieve state-of-the-art performance on computer vision tasks,
yet they fail under adversarial attacks that are imperceptible to humans. In
this paper, we propose a novel defense that can dynamically adapt the input
using the intrinsic structure from multiple self-supervised tasks. By
simultaneously using many self-supervised tasks, our defense avoids
over-fitting the adapted image to one specific self-supervised task and
restores more intrinsic structure in the image compared to a single
self-supervised task approach. Our approach further improves robustness and
clean accuracy significantly compared to the state-of-the-art single task
self-supervised defense. Our work is the first to connect multiple
self-supervised tasks to robustness, and suggests that we can achieve better
robustness with more intrinsic signal from visual data.


http://arxiv.org/abs/2204.03214
Transformer-Based Language Models for Software Vulnerability Detection: Performance, Model's Security and Platforms. (69%)
Chandra Thapa; Seung Ick Jang; Muhammad Ejaz Ahmed; Seyit Camtepe; Josef Pieprzyk; Surya Nepal
  The large transformer-based language models demonstrate excellent performance
in natural language processing. By considering the closeness of natural
languages to the high-level programming language such as C/C++, this work
studies how good are the large transformer-based language models detecting
software vulnerabilities. Our results demonstrate the well performance of these
models on software vulnerability detection. The answer enables extending
transformer-based language models to vulnerability detection and leveraging
superior performance beyond the natural language processing domain. Besides, we
perform the model's security check using Microsoft's Counterfit, a command-line
tool to assess the model's security. Our results find that these models are
vulnerable to adversarial examples. In this regard, we present a simple
countermeasure and its result. Experimenting with large models is always a
challenge due to the requirement of computing resources and platforms/libraries
& dependencies. Based on the experiences and difficulties we faced during this
work, we present our recommendation while choosing the platforms to run these
large models. Moreover, the popular platforms are surveyed thoroughly in this
paper.


http://arxiv.org/abs/2204.03397
Defending Active Directory by Combining Neural Network based Dynamic Program and Evolutionary Diversity Optimisation. (1%)
Diksha Goel; Max Hector Ward-Graham; Aneta Neumann; Frank Neumann; Hung Nguyen; Mingyu Guo
  Active Directory (AD) is the default security management system for Windows
domain networks. We study a Stackelberg game model between one attacker and one
defender on an AD attack graph. The attacker initially has access to a set of
entry nodes. The attacker can expand this set by strategically exploring edges.
Every edge has a detection rate and a failure rate. The attacker aims to
maximize their chance of successfully reaching the destination before getting
detected. The defender's task is to block a constant number of edges to
decrease the attacker's chance of success. We show that the problem is #P-hard
and, therefore, intractable to solve exactly. We convert the attacker's problem
to an exponential sized Dynamic Program that is approximated by a Neural
Network (NN). Once trained, the NN provides an efficient fitness function for
the defender's Evolutionary Diversity Optimisation (EDO). The diversity
emphasis on the defender's solution provides a diverse set of training samples,
which improves the training accuracy of our NN for modelling the attacker. We
go back and forth between NN training and EDO. Experimental results show that
for R500 graph, our proposed EDO based defense is less than 1% away from the
optimal defense.


http://arxiv.org/abs/2204.02887
Sampling-based Fast Gradient Rescaling Method for Highly Transferable Adversarial Attacks. (99%)
Xu Han; Anmin Liu; Yifeng Xiong; Yanbo Fan; Kun He
  Deep neural networks have shown to be very vulnerable to adversarial examples
crafted by adding human-imperceptible perturbations to benign inputs. After
achieving impressive attack success rates in the white-box setting, more focus
is shifted to black-box attacks. In either case, the common gradient-based
approaches generally use the $sign$ function to generate perturbations at the
end of the process. However, only a few works pay attention to the limitation
of the $sign$ function. Deviation between the original gradient and the
generated noises may lead to inaccurate gradient update estimation and
suboptimal solutions for adversarial transferability, which is crucial for
black-box attacks. To address this issue, we propose a Sampling-based Fast
Gradient Rescaling Method (S-FGRM) to improve the transferability of the
crafted adversarial examples. Specifically, we use data rescaling to substitute
the inefficient $sign$ function in gradient-based attacks without extra
computational cost. We also propose a Depth First Sampling method to eliminate
the fluctuation of rescaling and stabilize the gradient update. Our method can
be used in any gradient-based optimizations and is extensible to be integrated
with various input transformation or ensemble methods for further improving the
adversarial transferability. Extensive experiments on the standard ImageNet
dataset show that our S-FGRM could significantly boost the transferability of
gradient-based attacks and outperform the state-of-the-art baselines.


http://arxiv.org/abs/2204.02738
Masking Adversarial Damage: Finding Adversarial Saliency for Robust and Sparse Network. (95%)
Byung-Kwan Lee; Junho Kim; Yong Man Ro
  Adversarial examples provoke weak reliability and potential security issues
in deep neural networks. Although adversarial training has been widely studied
to improve adversarial robustness, it works in an over-parameterized regime and
requires high computations and large memory budgets. To bridge adversarial
robustness and model compression, we propose a novel adversarial pruning
method, Masking Adversarial Damage (MAD) that employs second-order information
of adversarial loss. By using it, we can accurately estimate adversarial
saliency for model parameters and determine which parameters can be pruned
without weakening adversarial robustness. Furthermore, we reveal that model
parameters of initial layer are highly sensitive to the adversarial examples
and show that compressed feature representation retains semantic information
for the target objects. Through extensive experiments on three public datasets,
we demonstrate that MAD effectively prunes adversarially trained networks
without loosing adversarial robustness and shows better performance than
previous adversarial pruning methods.


http://arxiv.org/abs/2204.02735
Distilling Robust and Non-Robust Features in Adversarial Examples by Information Bottleneck. (93%)
Junho Kim; Byung-Kwan Lee; Yong Man Ro
  Adversarial examples, generated by carefully crafted perturbation, have
attracted considerable attention in research fields. Recent works have argued
that the existence of the robust and non-robust features is a primary cause of
the adversarial examples, and investigated their internal interactions in the
feature space. In this paper, we propose a way of explicitly distilling feature
representation into the robust and non-robust features, using Information
Bottleneck. Specifically, we inject noise variation to each feature unit and
evaluate the information flow in the feature representation to dichotomize
feature units either robust or non-robust, based on the noise variation
magnitude. Through comprehensive experiments, we demonstrate that the distilled
features are highly correlated with adversarial prediction, and they have
human-perceptible semantic information by themselves. Furthermore, we present
an attack mechanism intensifying the gradient of non-robust features that is
directly related to the model prediction, and validate its effectiveness of
breaking model robustness.


http://arxiv.org/abs/2204.03154
Optimization Models and Interpretations for Three Types of Adversarial Perturbations against Support Vector Machines. (68%)
Wen Su; Qingna Li; Chunfeng Cui
  Adversarial perturbations have drawn great attentions in various deep neural
networks. Most of them are computed by iterations and cannot be interpreted
very well. In contrast, little attentions are paid to basic machine learning
models such as support vector machines. In this paper, we investigate the
optimization models and the interpretations for three types of adversarial
perturbations against support vector machines, including sample-adversarial
perturbations (sAP), class-universal adversarial perturbations (cuAP) as well
as universal adversarial perturbations (uAP). For linear binary/multi
classification support vector machines (SVMs), we derive the explicit solutions
for sAP, cuAP and uAP (binary case), and approximate solution for uAP of
multi-classification. We also obtain the upper bound of fooling rate for uAP.
Such results not only increase the interpretability of the three adversarial
perturbations, but also provide great convenience in computation since
iterative process can be avoided. Numerical results show that our method is
fast and effective in calculating three types of adversarial perturbations.


http://arxiv.org/abs/2204.03141
Adversarial Machine Learning Attacks Against Video Anomaly Detection Systems. (62%)
Furkan Mumcu; Keval Doshi; Yasin Yilmaz
  Anomaly detection in videos is an important computer vision problem with
various applications including automated video surveillance. Although
adversarial attacks on image understanding models have been heavily
investigated, there is not much work on adversarial machine learning targeting
video understanding models and no previous work which focuses on video anomaly
detection. To this end, we investigate an adversarial machine learning attack
against video anomaly detection systems, that can be implemented via an
easy-to-perform cyber-attack. Since surveillance cameras are usually connected
to the server running the anomaly detection model through a wireless network,
they are prone to cyber-attacks targeting the wireless connection. We
demonstrate how Wi-Fi deauthentication attack, a notoriously easy-to-perform
and effective denial-of-service (DoS) attack, can be utilized to generate
adversarial data for video anomaly detection systems. Specifically, we apply
several effects caused by the Wi-Fi deauthentication attack on video quality
(e.g., slow down, freeze, fast forward, low resolution) to the popular
benchmark datasets for video anomaly detection. Our experiments with several
state-of-the-art anomaly detection models show that the attackers can
significantly undermine the reliability of video anomaly detection systems by
causing frequent false alarms and hiding physical anomalies from the
surveillance system.


http://arxiv.org/abs/2204.02654
Adversarial Analysis of the Differentially-Private Federated Learning in Cyber-Physical Critical Infrastructures. (33%)
Md Tamjid Jim Hossain; Shahriar Jim Badsha; Jim Hung; La; Haoting Shen; Shafkat Islam; Ibrahim Khalil; Xun Yi
  Differential privacy (DP) is considered to be an effective
privacy-preservation method to secure the promising distributed machine
learning (ML) paradigm-federated learning (FL) from privacy attacks (e.g.,
membership inference attack). Nevertheless, while the DP mechanism greatly
alleviates privacy concerns, recent studies have shown that it can be exploited
to conduct security attacks (e.g., false data injection attacks). To address
such attacks on FL-based applications in critical infrastructures, in this
paper, we perform the first systematic study on the DP-exploited poisoning
attacks from an adversarial point of view. We demonstrate that the DP method,
despite providing a level of privacy guarantee, can effectively open a new
poisoning attack vector for the adversary. Our theoretical analysis and
empirical evaluation of a smart grid dataset show the FL performance
degradation (sub-optimal model generation) scenario due to the differential
noise-exploited selective model poisoning attacks. As a countermeasure, we
propose a reinforcement learning-based differential privacy level selection
(rDP) process. The rDP process utilizes the differential privacy parameters
(privacy loss, information leakage probability, etc.) and the losses to
intelligently generate an optimal privacy level for the nodes. The evaluation
shows the accumulated reward and errors of the proposed technique converge to
an optimal privacy policy.


http://arxiv.org/abs/2204.03077
Control Barrier Function based Attack-Recovery with Provable Guarantees. (1%)
Kunal Garg; Ricardo G. Sanfelice; Alvaro A. Cardenas
  This paper studies provable security guarantees for cyber-physical systems
(CPS) under actuator attacks. In particular, we consider CPS safety and propose
a new attack detection mechanism based on zeroing control barrier function
(ZCBF) conditions. In addition, we design an adaptive recovery mechanism based
on how close the system is to violating safety. We show that under certain
conditions, the attack-detection mechanism is sound, i.e., there are no false
negatives for adversarial attacks. We propose sufficient conditions for the
initial conditions and input constraints so that the resulting CPS is secure by
design. We also propose a novel hybrid control to account for attack detection
delays and avoid Zeno behavior. Next, to efficiently compute the set of initial
conditions, we propose a sampling-based method to verify whether a set is a
viability domain. Specifically, we devise a method for checking a modified
barrier function condition on a finite set of points to assess whether a set
can be rendered forward invariant. Then, we propose an iterative algorithm to
compute the set of initial conditions and input constraints set to limit the
effect of an adversary if it compromises vulnerable inputs. Finally, we use a
Quadratic Programming (QP) approach for online recovery (as well as nominal)
control synthesis. We demonstrate the effectiveness of the proposed method in a
simulation case study involving a quadrotor with an attack on its motors.


http://arxiv.org/abs/2204.02381
Hear No Evil: Towards Adversarial Robustness of Automatic Speech Recognition via Multi-Task Learning. (98%)
Nilaksh Das; Duen Horng Chau
  As automatic speech recognition (ASR) systems are now being widely deployed
in the wild, the increasing threat of adversarial attacks raises serious
questions about the security and reliability of using such systems. On the
other hand, multi-task learning (MTL) has shown success in training models that
can resist adversarial attacks in the computer vision domain. In this work, we
investigate the impact of performing such multi-task learning on the
adversarial robustness of ASR models in the speech domain. We conduct extensive
MTL experimentation by combining semantically diverse tasks such as accent
classification and ASR, and evaluate a wide range of adversarial settings. Our
thorough analysis reveals that performing MTL with semantically diverse tasks
consistently makes it harder for an adversarial attack to succeed. We also
discuss in detail the serious pitfalls and their related remedies that have a
significant impact on the robustness of MTL models. Our proposed MTL approach
shows considerable absolute improvements in adversarially targeted WER ranging
from 17.25 up to 59.90 compared to single-task learning baselines (attention
decoder and CTC respectively). Ours is the first in-depth study that uncovers
adversarial robustness gains from multi-task learning for ASR.


http://arxiv.org/abs/2204.02481
Adversarial Robustness through the Lens of Convolutional Filters. (87%)
Paul Gavrikov; Janis Keuper
  Deep learning models are intrinsically sensitive to distribution shifts in
the input data. In particular, small, barely perceivable perturbations to the
input data can force models to make wrong predictions with high confidence. An
common defense mechanism is regularization through adversarial training which
injects worst-case perturbations back into training to strengthen the decision
boundaries, and to reduce overfitting. In this context, we perform an
investigation of 3x3 convolution filters that form in adversarially-trained
models. Filters are extracted from 71 public models of the linf-RobustBench
CIFAR-10/100 and ImageNet1k leaderboard and compared to filters extracted from
models built on the same architectures but trained without robust
regularization. We observe that adversarially-robust models appear to form more
diverse, less sparse, and more orthogonal convolution filters than their normal
counterparts. The largest differences between robust and normal models are
found in the deepest layers, and the very first convolution layer, which
consistently and predominantly forms filters that can partially eliminate
perturbations, irrespective of the architecture. Data & Project website:
https://github.com/paulgavrikov/cvpr22w_RobustnessThroughTheLens


http://arxiv.org/abs/2204.02500
User-Level Differential Privacy against Attribute Inference Attack of Speech Emotion Recognition in Federated Learning. (2%)
Tiantian Feng; Raghuveer Peri; Shrikanth Narayanan
  Many existing privacy-enhanced speech emotion recognition (SER) frameworks
focus on perturbing the original speech data through adversarial training
within a centralized machine learning setup. However, this privacy protection
scheme can fail since the adversary can still access the perturbed data. In
recent years, distributed learning algorithms, especially federated learning
(FL), have gained popularity to protect privacy in machine learning
applications. While FL provides good intuition to safeguard privacy by keeping
the data on local devices, prior work has shown that privacy attacks, such as
attribute inference attacks, are achievable for SER systems trained using FL.
In this work, we propose to evaluate the user-level differential privacy (UDP)
in mitigating the privacy leaks of the SER system in FL. UDP provides
theoretical privacy guarantees with privacy parameters $\epsilon$ and $\delta$.
Our results show that the UDP can effectively decrease attribute information
leakage while keeping the utility of the SER system with the adversary
accessing one model update. However, the efficacy of the UDP suffers when the
FL system leaks more model updates to the adversary. We make the code publicly
available to reproduce the results in
https://github.com/usc-sail/fed-ser-leakage.


http://arxiv.org/abs/2204.02285
SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering. (1%)
Vipul Gupta; Zhuowan Li; Adam Kortylewski; Chenyu Zhang; Yingwei Li; Alan Yuille
  While Visual Question Answering (VQA) has progressed rapidly, previous works
raise concerns about robustness of current VQA models. In this work, we study
the robustness of VQA models from a novel perspective: visual context. We
suggest that the models over-rely on the visual context, i.e., irrelevant
objects in the image, to make predictions. To diagnose the model's reliance on
visual context and measure their robustness, we propose a simple yet effective
perturbation technique, SwapMix. SwapMix perturbs the visual context by
swapping features of irrelevant context objects with features from other
objects in the dataset. Using SwapMix we are able to change answers to more
than 45 % of the questions for a representative VQA model. Additionally, we
train the models with perfect sight and find that the context over-reliance
highly depends on the quality of visual representations. In addition to
diagnosing, SwapMix can also be applied as a data augmentation strategy during
training in order to regularize the context over-reliance. By swapping the
context object features, the model reliance on context can be suppressed
effectively. Two representative VQA models are studied using SwapMix: a
co-attention model MCAN and a large-scale pretrained model LXMERT. Our
experiments on the popular GQA dataset show the effectiveness of SwapMix for
both diagnosing model robustness and regularizing the over-reliance on visual
context. The code for our method is available at
https://github.com/vipulgupta1011/swapmix


http://arxiv.org/abs/2204.01975
GAIL-PT: A Generic Intelligent Penetration Testing Framework with Generative Adversarial Imitation Learning. (1%)
Jinyin Chen; Shulong Hu; Haibin Zheng; Changyou Xing; Guomin Zhang
  Penetration testing (PT) is an efficient network testing and vulnerability
mining tool by simulating a hacker's attack for valuable information applied in
some areas. Compared with manual PT, intelligent PT has become a dominating
mainstream due to less time-consuming and lower labor costs. Unfortunately,
RL-based PT is still challenged in real exploitation scenarios because the
agent's action space is usually high-dimensional discrete, thus leading to
algorithm convergence difficulty. Besides, most PT methods still rely on the
decisions of security experts. Addressing the challenges, for the first time,
we introduce expert knowledge to guide the agent to make better decisions in
RL-based PT and propose a Generative Adversarial Imitation Learning-based
generic intelligent Penetration testing framework, denoted as GAIL-PT, to solve
the problems of higher labor costs due to the involvement of security experts
and high-dimensional discrete action space. Specifically, first, we manually
collect the state-action pairs to construct an expert knowledge base when the
pre-trained RL / DRL model executes successful penetration testings. Second, we
input the expert knowledge and the state-action pairs generated online by the
different RL / DRL models into the discriminator of GAIL for training. At last,
we apply the output reward of the discriminator to guide the agent to perform
the action with a higher penetration success rate to improve PT's performance.
Extensive experiments conducted on the real target host and simulated network
scenarios show that GAIL-PT achieves the SOTA penetration performance against
DeepExploit in exploiting actual target Metasploitable2 and Q-learning in
optimizing penetration path, not only in small-scale with or without honey-pot
network environments but also in the large-scale virtual network environment.


http://arxiv.org/abs/2204.01568
DAD: Data-free Adversarial Defense at Test Time. (99%)
Gaurav Kumar Nayak; Ruchit Rawal; Anirban Chakraborty
  Deep models are highly susceptible to adversarial attacks. Such attacks are
carefully crafted imperceptible noises that can fool the network and can cause
severe consequences when deployed. To encounter them, the model requires
training data for adversarial training or explicit regularization-based
techniques. However, privacy has become an important concern, restricting
access to only trained models but not the training data (e.g. biometric data).
Also, data curation is expensive and companies may have proprietary rights over
it. To handle such situations, we propose a completely novel problem of
'test-time adversarial defense in absence of training data and even their
statistics'. We solve it in two stages: a) detection and b) correction of
adversarial samples. Our adversarial sample detection framework is initially
trained on arbitrary data and is subsequently adapted to the unlabelled test
data through unsupervised domain adaptation. We further correct the predictions
on detected adversarial samples by transforming them in Fourier domain and
obtaining their low frequency component at our proposed suitable radius for
model prediction. We demonstrate the efficacy of our proposed technique via
extensive experiments against several adversarial attacks and for different
model architectures and datasets. For a non-robust Resnet-18 model pre-trained
on CIFAR-10, our detection method correctly identifies 91.42% adversaries.
Also, we significantly improve the adversarial accuracy from 0% to 37.37% with
a minimal drop of 0.02% in clean accuracy on state-of-the-art 'Auto Attack'
without having to retrain the model.


http://arxiv.org/abs/2204.01560
SecureSense: Defending Adversarial Attack for Secure Device-Free Human Activity Recognition. (99%)
Jianfei Yang; Han Zou; Lihua Xie
  Deep neural networks have empowered accurate device-free human activity
recognition, which has wide applications. Deep models can extract robust
features from various sensors and generalize well even in challenging
situations such as data-insufficient cases. However, these systems could be
vulnerable to input perturbations, i.e. adversarial attacks. We empirically
demonstrate that both black-box Gaussian attacks and modern adversarial
white-box attacks can render their accuracies to plummet. In this paper, we
firstly point out that such phenomenon can bring severe safety hazards to
device-free sensing systems, and then propose a novel learning framework,
SecureSense, to defend common attacks. SecureSense aims to achieve consistent
predictions regardless of whether there exists an attack on its input or not,
alleviating the negative effect of distribution perturbation caused by
adversarial attacks. Extensive experiments demonstrate that our proposed method
can significantly enhance the model robustness of existing deep models,
overcoming possible attacks. The results validate that our method works well on
wireless human activity recognition and person identification systems. To the
best of our knowledge, this is the first work to investigate adversarial
attacks and further develop a novel defense framework for wireless human
activity recognition in mobile computing research.


http://arxiv.org/abs/2204.01738
Experimental quantum adversarial learning with programmable superconducting qubits. (99%)
Wenhui Ren; Weikang Li; Shibo Xu; Ke Wang; Wenjie Jiang; Feitong Jin; Xuhao Zhu; Jiachen Chen; Zixuan Song; Pengfei Zhang; Hang Dong; Xu Zhang; Jinfeng Deng; Yu Gao; Chuanyu Zhang; Yaozu Wu; Bing Zhang; Qiujiang Guo; Hekang Li; Zhen Wang; Jacob Biamonte; Chao Song; Dong-Ling Deng; H. Wang
  Quantum computing promises to enhance machine learning and artificial
intelligence. Different quantum algorithms have been proposed to improve a wide
spectrum of machine learning tasks. Yet, recent theoretical works show that,
similar to traditional classifiers based on deep classical neural networks,
quantum classifiers would suffer from the vulnerability problem: adding tiny
carefully-crafted perturbations to the legitimate original data samples would
facilitate incorrect predictions at a notably high confidence level. This will
pose serious problems for future quantum machine learning applications in
safety and security-critical scenarios. Here, we report the first experimental
demonstration of quantum adversarial learning with programmable superconducting
qubits. We train quantum classifiers, which are built upon variational quantum
circuits consisting of ten transmon qubits featuring average lifetimes of 150
$\mu$s, and average fidelities of simultaneous single- and two-qubit gates
above 99.94% and 99.4% respectively, with both real-life images (e.g., medical
magnetic resonance imaging scans) and quantum data. We demonstrate that these
well-trained classifiers (with testing accuracy up to 99%) can be practically
deceived by small adversarial perturbations, whereas an adversarial training
process would significantly enhance their robustness to such perturbations. Our
results reveal experimentally a crucial vulnerability aspect of quantum
learning systems under adversarial scenarios and demonstrate an effective
defense strategy against adversarial attacks, which provide a valuable guide
for quantum artificial intelligence applications with both near-term and future
quantum devices.


http://arxiv.org/abs/2204.01321
PRADA: Practical Black-Box Adversarial Attacks against Neural Ranking Models. (99%)
Chen Wu; Ruqing Zhang; Jiafeng Guo; Rijke Maarten de; Yixing Fan; Xueqi Cheng
  Neural ranking models (NRMs) have shown remarkable success in recent years,
especially with pre-trained language models. However, deep neural models are
notorious for their vulnerability to adversarial examples. Adversarial attacks
may become a new type of web spamming technique given our increased reliance on
neural information retrieval models. Therefore, it is important to study
potential adversarial attacks to identify vulnerabilities of NRMs before they
are deployed.
  In this paper, we introduce the Word Substitution Ranking Attack (WSRA) task
against NRMs, which aims to promote a target document in rankings by adding
adversarial perturbations to its text. We focus on the decision-based black-box
attack setting, where the attackers have no access to the model parameters and
gradients, but can only acquire the rank positions of the partial retrieved
list by querying the target model. This attack setting is realistic in
real-world search engines. We propose a novel Pseudo Relevance-based
ADversarial ranking Attack method (PRADA) that learns a surrogate model based
on Pseudo Relevance Feedback (PRF) to generate gradients for finding the
adversarial perturbations.
  Experiments on two web search benchmark datasets show that PRADA can
outperform existing attack strategies and successfully fool the NRM with small
indiscernible perturbations of text.


http://arxiv.org/abs/2204.01960
FaceSigns: Semi-Fragile Neural Watermarks for Media Authentication and Countering Deepfakes. (98%)
Paarth Neekhara; Shehzeen Hussain; Xinqiao Zhang; Ke Huang; Julian McAuley; Farinaz Koushanfar
  Deepfakes and manipulated media are becoming a prominent threat due to the
recent advances in realistic image and video synthesis techniques. There have
been several attempts at combating Deepfakes using machine learning
classifiers. However, such classifiers do not generalize well to black-box
image synthesis techniques and have been shown to be vulnerable to adversarial
examples. To address these challenges, we introduce a deep learning based
semi-fragile watermarking technique that allows media authentication by
verifying an invisible secret message embedded in the image pixels. Instead of
identifying and detecting fake media using visual artifacts, we propose to
proactively embed a semi-fragile watermark into a real image so that we can
prove its authenticity when needed. Our watermarking framework is designed to
be fragile to facial manipulations or tampering while being robust to benign
image-processing operations such as image compression, scaling, saturation,
contrast adjustments etc. This allows images shared over the internet to retain
the verifiable watermark as long as face-swapping or any other Deepfake
modification technique is not applied. We demonstrate that FaceSigns can embed
a 128 bit secret as an imperceptible image watermark that can be recovered with
a high bit recovery accuracy at several compression levels, while being
non-recoverable when unseen Deepfake manipulations are applied. For a set of
unseen benign and Deepfake manipulations studied in our work, FaceSigns can
reliably detect manipulated content with an AUC score of 0.996 which is
significantly higher than prior image watermarking and steganography
techniques.


http://arxiv.org/abs/2204.01090
Breaking the De-Pois Poisoning Defense. (98%)
Alaa Anani; Mohamed Ghanem; Lotfy Abdel Khaliq
  Attacks on machine learning models have been, since their conception, a very
persistent and evasive issue resembling an endless cat-and-mouse game. One
major variant of such attacks is poisoning attacks which can indirectly
manipulate an ML model. It has been observed over the years that the majority
of proposed effective defense models are only effective when an attacker is not
aware of them being employed. In this paper, we show that the attack-agnostic
De-Pois defense is hardly an exception to that rule. In fact, we demonstrate
its vulnerability to the simplest White-Box and Black-Box attacks by an
attacker that knows the structure of the De-Pois defense model. In essence, the
De-Pois defense relies on a critic model that can be used to detect poisoned
data before passing it to the target model. In our work, we break this
poison-protection layer by replicating the critic model and then performing a
composed gradient-sign attack on both the critic and target models
simultaneously -- allowing us to bypass the critic firewall to poison the
target model.


http://arxiv.org/abs/2204.01099
Adversarially robust segmentation models learn perceptually-aligned gradients. (16%)
Pedro Sandoval-Segura
  The effects of adversarial training on semantic segmentation networks has not
been thoroughly explored. While previous work has shown that
adversarially-trained image classifiers can be used to perform image synthesis,
we have yet to understand how best to leverage an adversarially-trained
segmentation network to do the same. Using a simple optimizer, we demonstrate
that adversarially-trained semantic segmentation networks can be used to
perform image inpainting and generation. Our experiments demonstrate that
adversarially-trained segmentation networks are more robust and indeed exhibit
perceptually-aligned gradients which help in producing plausible image
inpaintings. We seek to place additional weight behind the hypothesis that
adversarially robust models exhibit gradients that are more
perceptually-aligned with human vision. Through image synthesis, we argue that
perceptually-aligned gradients promote a better understanding of a neural
network's learned representations and aid in making neural networks more
interpretable.


http://arxiv.org/abs/2204.01193
Detecting In-vehicle Intrusion via Semi-supervised Learning-based Convolutional Adversarial Autoencoders. (1%)
Thien-Nu Hoang; Daehee Kim
  With the development of autonomous vehicle technology, the controller area
network (CAN) bus has become the de facto standard for an in-vehicle
communication system because of its simplicity and efficiency. However, without
any encryption and authentication mechanisms, the in-vehicle network using the
CAN protocol is susceptible to a wide range of attacks. Many studies, which are
mostly based on machine learning, have proposed installing an intrusion
detection system (IDS) for anomaly detection in the CAN bus system. Although
machine learning methods have many advantages for IDS, previous models usually
require a large amount of labeled data, which results in high time and labor
costs. To handle this problem, we propose a novel semi-supervised
learning-based convolutional adversarial autoencoder model in this paper. The
proposed model combines two popular deep learning models: autoencoder and
generative adversarial networks. First, the model is trained with unlabeled
data to learn the manifolds of normal and attack patterns. Then, only a small
number of labeled samples are used in supervised training. The proposed model
can detect various kinds of message injection attacks, such as DoS, fuzzy, and
spoofing, as well as unknown attacks. The experimental results show that the
proposed model achieves the highest F1 score of 0.99 and a low error rate of
0.1\% with limited labeled data compared to other supervised methods. In
addition, we show that the model can meet the real-time requirement by
analyzing the model complexity in terms of the number of trainable parameters
and inference time. This study successfully reduced the number of model
parameters by five times and the inference time by eight times, compared to a
state-of-the-art model.


http://arxiv.org/abs/2204.00993
Improving Vision Transformers by Revisiting High-frequency Components. (1%)
Jiawang Bai; Li Yuan; Shu-Tao Xia; Shuicheng Yan; Zhifeng Li; Wei Liu
  The transformer models have shown promising effectiveness in dealing with
various vision tasks. However, compared with training Convolutional Neural
Network (CNN) models, training Vision Transformer (ViT) models is more
difficult and relies on the large-scale training set. To explain this
observation we make a hypothesis that \textit{ViT models are less effective in
capturing the high-frequency components of images than CNN models}, and verify
it by a frequency analysis. Inspired by this finding, we first investigate the
effects of existing techniques for improving ViT models from a new frequency
perspective, and find that the success of some techniques (e.g., RandAugment)
can be attributed to the better usage of the high-frequency components. Then,
to compensate for this insufficient ability of ViT models, we propose HAT,
which directly augments high-frequency components of images via adversarial
training. We show that HAT can consistently boost the performance of various
ViT models (e.g., +1.2% for ViT-B, +0.5% for Swin-B), and especially enhance
the advanced model VOLO-D5 to 87.3% that only uses ImageNet-1K data, and the
superiority can also be maintained on out-of-distribution data and transferred
to downstream tasks. The code is available at:
https://github.com/jiawangbai/HAT.


http://arxiv.org/abs/2204.00972
DST: Dynamic Substitute Training for Data-free Black-box Attack. (98%)
Wenxuan Wang; Xuelin Qian; Yanwei Fu; Xiangyang Xue
  With the wide applications of deep neural network models in various computer
vision tasks, more and more works study the model vulnerability to adversarial
examples. For data-free black box attack scenario, existing methods are
inspired by the knowledge distillation, and thus usually train a substitute
model to learn knowledge from the target model using generated data as input.
However, the substitute model always has a static network structure, which
limits the attack ability for various target models and tasks. In this paper,
we propose a novel dynamic substitute training attack method to encourage
substitute model to learn better and faster from the target model.
Specifically, a dynamic substitute structure learning strategy is proposed to
adaptively generate optimal substitute model structure via a dynamic gate
according to different target models and tasks. Moreover, we introduce a
task-driven graph-based structure information learning constrain to improve the
quality of generated training data, and facilitate the substitute model
learning structural relationships from the target model multiple outputs.
Extensive experiments have been conducted to verify the efficacy of the
proposed attack method, which can achieve better performance compared with the
state-of-the-art competitors on several datasets.


http://arxiv.org/abs/2204.00853
Adversarial Neon Beam: Robust Physical-World Adversarial Attack to DNNs. (98%)
Chengyin Hu; Kalibinuer Tiliwalidi
  In the physical world, light affects the performance of deep neural networks.
Nowadays, many products based on deep neural network have been put into daily
life. There are few researches on the effect of light on the performance of
deep neural network models. However, the adversarial perturbations generated by
light may have extremely dangerous effects on these systems. In this work, we
propose an attack method called adversarial neon beam (AdvNB), which can
execute the physical attack by obtaining the physical parameters of adversarial
neon beams with very few queries. Experiments show that our algorithm can
achieve advanced attack effect in both digital test and physical test. In the
digital environment, 99.3% attack success rate was achieved, and in the
physical environment, 100% attack success rate was achieved. Compared with the
most advanced physical attack methods, our method can achieve better physical
perturbation concealment. In addition, by analyzing the experimental data, we
reveal some new phenomena brought about by the adversarial neon beam attack.


http://arxiv.org/abs/2204.00734
SkeleVision: Towards Adversarial Resiliency of Person Tracking with Multi-Task Learning. (47%)
Nilaksh Das; Sheng-Yun Peng; Duen Horng Chau
  Person tracking using computer vision techniques has wide ranging
applications such as autonomous driving, home security and sports analytics.
However, the growing threat of adversarial attacks raises serious concerns
regarding the security and reliability of such techniques. In this work, we
study the impact of multi-task learning (MTL) on the adversarial robustness of
the widely used SiamRPN tracker, in the context of person tracking.
Specifically, we investigate the effect of jointly learning with semantically
analogous tasks of person tracking and human keypoint detection. We conduct
extensive experiments with more powerful adversarial attacks that can be
physically realizable, demonstrating the practical value of our approach. Our
empirical study with simulated as well as real-world datasets reveals that
training with MTL consistently makes it harder to attack the SiamRPN tracker,
compared to typically training only on the single task of person tracking.


http://arxiv.org/abs/2204.00487
Robust and Accurate -- Compositional Architectures for Randomized Smoothing. (31%)
Miklós Z. Horváth; Mark Niklas Müller; Marc Fischer; Martin Vechev
  Randomized Smoothing (RS) is considered the state-of-the-art approach to
obtain certifiably robust models for challenging tasks. However, current RS
approaches drastically decrease standard accuracy on unperturbed data, severely
limiting their real-world utility. To address this limitation, we propose a
compositional architecture, ACES, which certifiably decides on a per-sample
basis whether to use a smoothed model yielding predictions with guarantees or a
more accurate standard model without guarantees. This, in contrast to prior
approaches, enables both high standard accuracies and significant provable
robustness. On challenging tasks such as ImageNet, we obtain, e.g., $80.0\%$
natural accuracy and $28.2\%$ certifiable accuracy against $\ell_2$
perturbations with $r=1.0$. We release our code and models at
https://github.com/eth-sri/aces.


http://arxiv.org/abs/2204.00491
FrequencyLowCut Pooling -- Plug & Play against Catastrophic Overfitting. (16%)
Julia Grabinski; Steffen Jung; Janis Keuper; Margret Keuper
  Over the last years, Convolutional Neural Networks (CNNs) have been the
dominating neural architecture in a wide range of computer vision tasks. From
an image and signal processing point of view, this success might be a bit
surprising as the inherent spatial pyramid design of most CNNs is apparently
violating basic signal processing laws, i.e. Sampling Theorem in their
down-sampling operations. However, since poor sampling appeared not to affect
model accuracy, this issue has been broadly neglected until model robustness
started to receive more attention. Recent work [17] in the context of
adversarial attacks and distribution shifts, showed after all, that there is a
strong correlation between the vulnerability of CNNs and aliasing artifacts
induced by poor down-sampling operations. This paper builds on these findings
and introduces an aliasing free down-sampling operation which can easily be
plugged into any CNN architecture: FrequencyLowCut pooling. Our experiments
show, that in combination with simple and fast FGSM adversarial training, our
hyper-parameter free operator significantly improves model robustness and
avoids catastrophic overfitting.


http://arxiv.org/abs/2204.00292
Preventing Distillation-based Attacks on Neural Network IP. (2%)
Mahdieh Grailoo; Zain Ul Abideen; Mairo Leier; Samuel Pagliarini
  Neural networks (NNs) are already deployed in hardware today, becoming
valuable intellectual property (IP) as many hours are invested in their
training and optimization. Therefore, attackers may be interested in copying,
reverse engineering, or even modifying this IP. The current practices in
hardware obfuscation, including the widely studied logic locking technique, are
insufficient to protect the actual IP of a well-trained NN: its weights. Simply
hiding the weights behind a key-based scheme is inefficient (resource-hungry)
and inadequate (attackers can exploit knowledge distillation). This paper
proposes an intuitive method to poison the predictions that prevent
distillation-based attacks; this is the first work to consider such a poisoning
approach in hardware-implemented NNs. The proposed technique obfuscates a NN so
an attacker cannot train the NN entirely or accurately. We elaborate a threat
model which highlights the difference between random logic obfuscation and the
obfuscation of NN IP. Based on this threat model, our security analysis shows
that the poisoning successfully and significantly reduces the accuracy of the
stolen NN model on various representative datasets. Moreover, the accuracy and
prediction distributions are maintained, no functionality is disturbed, nor are
high overheads incurred. Finally, we highlight that our proposed approach is
flexible and does not require manipulation of the NN toolchain.


http://arxiv.org/abs/2204.01499
FedRecAttack: Model Poisoning Attack to Federated Recommendation. (1%)
Dazhong Rong; Shuai Ye; Ruoyan Zhao; Hon Ning Yuen; Jianhai Chen; Qinming He
  Federated Recommendation (FR) has received considerable popularity and
attention in the past few years. In FR, for each user, its feature vector and
interaction data are kept locally on its own client thus are private to others.
Without the access to above information, most existing poisoning attacks
against recommender systems or federated learning lose validity. Benifiting
from this characteristic, FR is commonly considered fairly secured. However, we
argue that there is still possible and necessary security improvement could be
made in FR. To prove our opinion, in this paper we present FedRecAttack, a
model poisoning attack to FR aiming to raise the exposure ratio of target
items. In most recommendation scenarios, apart from private user-item
interactions (e.g., clicks, watches and purchases), some interactions are
public (e.g., likes, follows and comments). Motivated by this point, in
FedRecAttack we make use of the public interactions to approximate users'
feature vectors, thereby attacker can generate poisoned gradients accordingly
and control malicious users to upload the poisoned gradients in a well-designed
way. To evaluate the effectiveness and side effects of FedRecAttack, we conduct
extensive experiments on three real-world datasets of different sizes from two
completely different scenarios. Experimental results demonstrate that our
proposed FedRecAttack achieves the state-of-the-art effectiveness while its
side effects are negligible. Moreover, even with small proportion (3%) of
malicious users and small proportion (1%) of public interactions, FedRecAttack
remains highly effective, which reveals that FR is more vulnerable to attack
than people commonly considered.


http://arxiv.org/abs/2204.00008
Improving Adversarial Transferability via Neuron Attribution-Based Attacks. (99%)
Jianping Zhang; Weibin Wu; Jen-tse Huang; Yizhan Huang; Wenxuan Wang; Yuxin Su; Michael R. Lyu
  Deep neural networks (DNNs) are known to be vulnerable to adversarial
examples. It is thus imperative to devise effective attack algorithms to
identify the deficiencies of DNNs beforehand in security-sensitive
applications. To efficiently tackle the black-box setting where the target
model's particulars are unknown, feature-level transfer-based attacks propose
to contaminate the intermediate feature outputs of local models, and then
directly employ the crafted adversarial samples to attack the target model. Due
to the transferability of features, feature-level attacks have shown promise in
synthesizing more transferable adversarial samples. However, existing
feature-level attacks generally employ inaccurate neuron importance
estimations, which deteriorates their transferability. To overcome such
pitfalls, in this paper, we propose the Neuron Attribution-based Attack (NAA),
which conducts feature-level attacks with more accurate neuron importance
estimations. Specifically, we first completely attribute a model's output to
each neuron in a middle layer. We then derive an approximation scheme of neuron
attribution to tremendously reduce the computation overhead. Finally, we weight
neurons based on their attribution results and launch feature-level attacks.
Extensive experiments confirm the superiority of our approach to the
state-of-the-art benchmarks.


http://arxiv.org/abs/2203.17209
Adversarial Examples in Random Neural Networks with General Activations. (98%)
Andrea Montanari; Yuchen Wu
  A substantial body of empirical work documents the lack of robustness in deep
learning models to adversarial examples. Recent theoretical work proved that
adversarial examples are ubiquitous in two-layers networks with sub-exponential
width and ReLU or smooth activations, and multi-layer ReLU networks with
sub-exponential width. We present a result of the same type, with no
restriction on width and for general locally Lipschitz continuous activations.
  More precisely, given a neural network $f(\,\cdot\,;{\boldsymbol \theta})$
with random weights ${\boldsymbol \theta}$, and feature vector ${\boldsymbol
x}$, we show that an adversarial example ${\boldsymbol x}'$ can be found with
high probability along the direction of the gradient $\nabla_{{\boldsymbol
x}}f({\boldsymbol x};{\boldsymbol \theta})$. Our proof is based on a Gaussian
conditioning technique. Instead of proving that $f$ is approximately linear in
a neighborhood of ${\boldsymbol x}$, we characterize the joint distribution of
$f({\boldsymbol x};{\boldsymbol \theta})$ and $f({\boldsymbol x}';{\boldsymbol
\theta})$ for ${\boldsymbol x}' = {\boldsymbol x}-s({\boldsymbol
x})\nabla_{{\boldsymbol x}}f({\boldsymbol x};{\boldsymbol \theta})$.


http://arxiv.org/abs/2204.00103
Scalable Whitebox Attacks on Tree-based Models. (96%)
Giuseppe Castiglione; Gavin Ding; Masoud Hashemi; Christopher Srinivasa; Ga Wu
  Adversarial robustness is one of the essential safety criteria for
guaranteeing the reliability of machine learning models. While various
adversarial robustness testing approaches were introduced in the last decade,
we note that most of them are incompatible with non-differentiable models such
as tree ensembles. Since tree ensembles are widely used in industry, this
reveals a crucial gap between adversarial robustness research and practical
applications. This paper proposes a novel whitebox adversarial robustness
testing approach for tree ensemble models. Concretely, the proposed approach
smooths the tree ensembles through temperature controlled sigmoid functions,
which enables gradient descent-based adversarial attacks. By leveraging
sampling and the log-derivative trick, the proposed approach can scale up to
testing tasks that were previously unmanageable. We compare the approach
against both random perturbations and blackbox approaches on multiple public
datasets (and corresponding models). Our results show that the proposed method
can 1) successfully reveal the adversarial vulnerability of tree ensemble
models without causing computational pressure for testing and 2) flexibly
balance the search performance and time complexity to meet various testing
criteria.


http://arxiv.org/abs/2203.16931
Towards Robust Rain Removal Against Adversarial Attacks: A Comprehensive Benchmark Analysis and Beyond. (86%)
Yi Yu; Wenhan Yang; Yap-Peng Tan; Alex C. Kot
  Rain removal aims to remove rain streaks from images/videos and reduce the
disruptive effects caused by rain. It not only enhances image/video visibility
but also allows many computer vision algorithms to function properly. This
paper makes the first attempt to conduct a comprehensive study on the
robustness of deep learning-based rain removal methods against adversarial
attacks. Our study shows that, when the image/video is highly degraded, rain
removal methods are more vulnerable to the adversarial attacks as small
distortions/perturbations become less noticeable or detectable. In this paper,
we first present a comprehensive empirical evaluation of various methods at
different levels of attacks and with various losses/targets to generate the
perturbations from the perspective of human perception and machine analysis
tasks. A systematic evaluation of key modules in existing methods is performed
in terms of their robustness against adversarial attacks. From the insights of
our analysis, we construct a more robust deraining method by integrating these
effective modules. Finally, we examine various types of adversarial attacks
that are specific to deraining problems and their effects on both human and
machine vision tasks, including 1) rain region attacks, adding perturbations
only in the rain regions to make the perturbations in the attacked rain images
less visible; 2) object-sensitive attacks, adding perturbations only in regions
near the given objects. Code is available at
https://github.com/yuyi-sd/Robust_Rain_Removal.


http://arxiv.org/abs/2204.00032
Truth Serum: Poisoning Machine Learning Models to Reveal Their Secrets. (81%)
Florian Tramèr; Reza Shokri; Ayrton San Joaquin; Hoang Le; Matthew Jagielski; Sanghyun Hong; Nicholas Carlini
  We introduce a new class of attacks on machine learning models. We show that
an adversary who can poison a training dataset can cause models trained on this
dataset to leak significant private details of training points belonging to
other parties. Our active inference attacks connect two independent lines of
work targeting the integrity and privacy of machine learning training data.
  Our attacks are effective across membership inference, attribute inference,
and data extraction. For example, our targeted attacks can poison <0.1% of the
training dataset to boost the performance of inference attacks by 1 to 2 orders
of magnitude. Further, an adversary who controls a significant fraction of the
training data (e.g., 50%) can launch untargeted attacks that enable 8x more
precise inference on all other users' otherwise-private data points.
  Our results cast doubts on the relevance of cryptographic privacy guarantees
in multiparty computation protocols for machine learning, if parties can
arbitrarily select their share of training data.


http://arxiv.org/abs/2204.00089
Investigating Top-$k$ White-Box and Transferable Black-box Attack. (87%)
Chaoning Zhang; Philipp Benz; Adil Karjauv; Jae Won Cho; Kang Zhang; In So Kweon
  Existing works have identified the limitation of top-$1$ attack success rate
(ASR) as a metric to evaluate the attack strength but exclusively investigated
it in the white-box setting, while our work extends it to a more practical
black-box setting: transferable attack. It is widely reported that stronger
I-FGSM transfers worse than simple FGSM, leading to a popular belief that
transferability is at odds with the white-box attack strength. Our work
challenges this belief with empirical finding that stronger attack actually
transfers better for the general top-$k$ ASR indicated by the interest class
rank (ICR) after attack. For increasing the attack strength, with an intuitive
interpretation of the logit gradient from the geometric perspective, we
identify that the weakness of the commonly used losses lie in prioritizing the
speed to fool the network instead of maximizing its strength. To this end, we
propose a new normalized CE loss that guides the logit to be updated in the
direction of implicitly maximizing its rank distance from the ground-truth
class. Extensive results in various settings have verified that our proposed
new loss is simple yet effective for top-$k$ attack. Code is available at:
\url{https://bit.ly/3uCiomP}


http://arxiv.org/abs/2203.16130
Sensor Data Validation and Driving Safety in Autonomous Driving Systems. (83%)
Jindi Zhang
  Autonomous driving technology has drawn a lot of attention due to its fast
development and extremely high commercial values. The recent technological leap
of autonomous driving can be primarily attributed to the progress in the
environment perception. Good environment perception provides accurate
high-level environment information which is essential for autonomous vehicles
to make safe and precise driving decisions and strategies. Moreover, such
progress in accurate environment perception would not be possible without deep
learning models and advanced onboard sensors, such as optical sensors (LiDARs
and cameras), radars, GPS. However, the advanced sensors and deep learning
models are prone to recently invented attack methods. For example, LiDARs and
cameras can be compromised by optical attacks, and deep learning models can be
attacked by adversarial examples. The attacks on advanced sensors and deep
learning models can largely impact the accuracy of the environment perception,
posing great threats to the safety and security of autonomous vehicles. In this
thesis, we study the detection methods against the attacks on onboard sensors
and the linkage between attacked deep learning models and driving safety for
autonomous vehicles. To detect the attacks, redundant data sources can be
exploited, since information distortions caused by attacks in victim sensor
data result in inconsistency with the information from other redundant sources.
To study the linkage between attacked deep learning models and driving
safety...


http://arxiv.org/abs/2203.16141
Example-based Explanations with Adversarial Attacks for Respiratory Sound Analysis. (56%)
Yi Chang; Zhao Ren; Thanh Tam Nguyen; Wolfgang Nejdl; Björn W. Schuller
  Respiratory sound classification is an important tool for remote screening of
respiratory-related diseases such as pneumonia, asthma, and COVID-19. To
facilitate the interpretability of classification results, especially ones
based on deep learning, many explanation methods have been proposed using
prototypes. However, existing explanation techniques often assume that the data
is non-biased and the prediction results can be explained by a set of
prototypical examples. In this work, we develop a unified example-based
explanation method for selecting both representative data (prototypes) and
outliers (criticisms). In particular, we propose a novel application of
adversarial attacks to generate an explanation spectrum of data instances via
an iterative fast gradient sign method. Such unified explanation can avoid
over-generalisation and bias by allowing human experts to assess the model
mistakes case by case. We performed a wide range of quantitative and
qualitative evaluations to show that our approach generates effective and
understandable explanation and is robust with many deep learning models


http://arxiv.org/abs/2203.15283
Mel Frequency Spectral Domain Defenses against Adversarial Attacks on Speech Recognition Systems. (99%)
Nicholas Mehlman; Anirudh Sreeram; Raghuveer Peri; Shrikanth Narayanan
  A variety of recent works have looked into defenses for deep neural networks
against adversarial attacks particularly within the image processing domain.
Speech processing applications such as automatic speech recognition (ASR) are
increasingly relying on deep learning models, and so are also prone to
adversarial attacks. However, many of the defenses explored for ASR simply
adapt the image-domain defenses, which may not provide optimal robustness. This
paper explores speech specific defenses using the mel spectral domain, and
introduces a novel defense method called 'mel domain noise flooding' (MDNF).
MDNF applies additive noise to the mel spectrogram of a speech utterance prior
to re-synthesising the audio signal. We test the defenses against strong
white-box adversarial attacks such as projected gradient descent (PGD) and
Carlini-Wagner (CW) attacks, and show better robustness compared to a
randomized smoothing baseline across strong threat models.


http://arxiv.org/abs/2203.15230
Zero-Query Transfer Attacks on Context-Aware Object Detectors. (99%)
Zikui Cai; Shantanu Rane; Alejandro E. Brito; Chengyu Song; Srikanth V. Krishnamurthy; Amit K. Roy-Chowdhury; M. Salman Asif
  Adversarial attacks perturb images such that a deep neural network produces
incorrect classification results. A promising approach to defend against
adversarial attacks on natural multi-object scenes is to impose a
context-consistency check, wherein, if the detected objects are not consistent
with an appropriately defined context, then an attack is suspected. Stronger
attacks are needed to fool such context-aware detectors. We present the first
approach for generating context-consistent adversarial attacks that can evade
the context-consistency check of black-box object detectors operating on
complex, natural scenes. Unlike many black-box attacks that perform repeated
attempts and open themselves to detection, we assume a "zero-query" setting,
where the attacker has no knowledge of the classification decisions of the
victim system. First, we derive multiple attack plans that assign incorrect
labels to victim objects in a context-consistent manner. Then we design and use
a novel data structure that we call the perturbation success probability
matrix, which enables us to filter the attack plans and choose the one most
likely to succeed. This final attack plan is implemented using a
perturbation-bounded adversarial attack algorithm. We compare our zero-query
attack against a few-query scheme that repeatedly checks if the victim system
is fooled. We also compare against state-of-the-art context-agnostic attacks.
Against a context-aware defense, the fooling rate of our zero-query approach is
significantly higher than context-agnostic approaches and higher than that
achievable with up to three rounds of the few-query scheme.


http://arxiv.org/abs/2203.15674
Exploring Frequency Adversarial Attacks for Face Forgery Detection. (99%)
Shuai Jia; Chao Ma; Taiping Yao; Bangjie Yin; Shouhong Ding; Xiaokang Yang
  Various facial manipulation techniques have drawn serious public concerns in
morality, security, and privacy. Although existing face forgery classifiers
achieve promising performance on detecting fake images, these methods are
vulnerable to adversarial examples with injected imperceptible perturbations on
the pixels. Meanwhile, many face forgery detectors always utilize the frequency
diversity between real and fake faces as a crucial clue. In this paper, instead
of injecting adversarial perturbations into the spatial domain, we propose a
frequency adversarial attack method against face forgery detectors. Concretely,
we apply discrete cosine transform (DCT) on the input images and introduce a
fusion module to capture the salient region of adversary in the frequency
domain. Compared with existing adversarial attacks (e.g. FGSM, PGD) in the
spatial domain, our method is more imperceptible to human observers and does
not degrade the visual quality of the original images. Moreover, inspired by
the idea of meta-learning, we also propose a hybrid adversarial attack that
performs attacks in both the spatial and frequency domains. Extensive
experiments indicate that the proposed method fools not only the spatial-based
detectors but also the state-of-the-art frequency-based detectors effectively.
In addition, the proposed frequency attack enhances the transferability across
face forgery detectors as black-box attacks.


http://arxiv.org/abs/2203.16000
StyleFool: Fooling Video Classification Systems via Style Transfer. (99%)
Yuxin Cao; Xi Xiao; Ruoxi Sun; Derui Wang; Minhui Xue; Sheng Wen
  Video classification systems are vulnerable to adversarial attacks, which can
create severe security problems in video verification. Current black-box
attacks need a large number of queries to succeed, resulting in high
computational overhead in the process of attack. On the other hand, attacks
with restricted perturbations are ineffective against defenses such as
denoising or adversarial training. In this paper, we focus on unrestricted
perturbations and propose StyleFool, a black-box video adversarial attack via
style transfer to fool the video classification system. StyleFool first
utilizes color theme proximity to select the best style image, which helps
avoid unnatural details in the stylized videos. Meanwhile, the target class
confidence is additionally considered in targeted attacks to influence the
output distribution of the classifier by moving the stylized video closer to or
even across the decision boundary. A gradient-free method is then employed to
further optimize the adversarial perturbations. We carry out extensive
experiments to evaluate StyleFool on two standard datasets, UCF-101 and
HMDB-51. The experimental results demonstrate that StyleFool outperforms the
state-of-the-art adversarial attacks in terms of both the number of queries and
the robustness against existing defenses. Moreover, 50% of the stylized videos
in untargeted attacks do not need any query since they can already fool the
video classification model. Furthermore, we evaluate the indistinguishability
through a user study to show that the adversarial samples of StyleFool look
imperceptible to human eyes, despite unrestricted perturbations.


http://arxiv.org/abs/2203.16536
Recent improvements of ASR models in the face of adversarial attacks. (98%)
Raphael Olivier; Bhiksha Raj
  Like many other tasks involving neural networks, Speech Recognition models
are vulnerable to adversarial attacks. However recent research has pointed out
differences between attacks and defenses on ASR models compared to image
models. Improving the robustness of ASR models requires a paradigm shift from
evaluating attacks on one or a few models to a systemic approach in evaluation.
We lay the ground for such research by evaluating on various architectures a
representative set of adversarial attacks: targeted and untargeted,
optimization and speech processing-based, white-box, black-box and targeted
attacks. Our results show that the relative strengths of different attack
algorithms vary considerably when changing the model architecture, and that the
results of some attacks are not to be blindly trusted. They also indicate that
training choices such as self-supervised pretraining can significantly impact
robustness by enabling transferable perturbations. We release our source code
as a package that should help future research in evaluating their attacks and
defenses.


http://arxiv.org/abs/2203.15245
Robust Structured Declarative Classifiers for 3D Point Clouds: Defending Adversarial Attacks with Implicit Gradients. (83%)
Kaidong Li; Ziming Zhang; Cuncong Zhong; Guanghui Wang
  Deep neural networks for 3D point cloud classification, such as PointNet,
have been demonstrated to be vulnerable to adversarial attacks. Current
adversarial defenders often learn to denoise the (attacked) point clouds by
reconstruction, and then feed them to the classifiers as input. In contrast to
the literature, we propose a family of robust structured declarative
classifiers for point cloud classification, where the internal constrained
optimization mechanism can effectively defend adversarial attacks through
implicit gradients. Such classifiers can be formulated using a bilevel
optimization framework. We further propose an effective and efficient
instantiation of our approach, namely, Lattice Point Classifier (LPC), based on
structured sparse coding in the permutohedral lattice and 2D convolutional
neural networks (CNNs) that is end-to-end trainable. We demonstrate
state-of-the-art robust point cloud classification performance on ModelNet40
and ScanNet under seven different attackers. For instance, we achieve 89.51%
and 83.16% test accuracy on each dataset under the recent JGBA attacker that
outperforms DUP-Net and IF-Defense with PointNet by ~70%. Demo code is
available at https://zhang-vislab.github.io.


http://arxiv.org/abs/2203.15529
Treatment Learning Causal Transformer for Noisy Image Classification. (26%)
Chao-Han Huck Yang; I-Te Danny Hung; Yi-Chieh Liu; Pin-Yu Chen
  Current top-notch deep learning (DL) based vision models are primarily based
on exploring and exploiting the inherent correlations between training data
samples and their associated labels. However, a known practical challenge is
their degraded performance against "noisy" data, induced by different
circumstances such as spurious correlations, irrelevant contexts, domain shift,
and adversarial attacks. In this work, we incorporate this binary information
of "existence of noise" as treatment into image classification tasks to improve
prediction accuracy by jointly estimating their treatment effects. Motivated
from causal variational inference, we propose a transformer-based architecture,
Treatment Learning Causal Transformer (TLT), that uses a latent generative
model to estimate robust feature representations from current observational
input for noise image classification. Depending on the estimated noise level
(modeled as a binary treatment factor), TLT assigns the corresponding inference
network trained by the designed causal loss for prediction. We also create new
noisy image datasets incorporating a wide range of noise factors (e.g., object
masking, style transfer, and adversarial perturbation) for performance
benchmarking. The superior performance of TLT in noisy image classification is
further validated by several refutation evaluation metrics. As a by-product,
TLT also improves visual salience methods for perceiving noisy images.


http://arxiv.org/abs/2203.15319
Can NMT Understand Me? Towards Perturbation-based Evaluation of NMT Models for Code Generation. (11%)
Pietro Liguori; Cristina Improta; Vivo Simona De; Roberto Natella; Bojan Cukic; Domenico Cotroneo
  Neural Machine Translation (NMT) has reached a level of maturity to be
recognized as the premier method for the translation between different
languages and aroused interest in different research areas, including software
engineering. A key step to validate the robustness of the NMT models consists
in evaluating the performance of the models on adversarial inputs, i.e., inputs
obtained from the original ones by adding small amounts of perturbation.
However, when dealing with the specific task of the code generation (i.e., the
generation of code starting from a description in natural language), it has not
yet been defined an approach to validate the robustness of the NMT models. In
this work, we address the problem by identifying a set of perturbations and
metrics tailored for the robustness assessment of such models. We present a
preliminary experimental evaluation, showing what type of perturbations affect
the model the most and deriving useful insights for future directions.


http://arxiv.org/abs/2203.14607
Boosting Black-Box Adversarial Attacks with Meta Learning. (99%)
Junjie the State Key Lab of Intelligent Control and Decision of Complex Systems and the School of Automation, Beijing Institute of Technology, Beijing, China Beijing Institute of Technology Chongqing Innovation Center, Chongqing, China Fu; Jian the State Key Lab of Intelligent Control and Decision of Complex Systems and the School of Automation, Beijing Institute of Technology, Beijing, China Beijing Institute of Technology Chongqing Innovation Center, Chongqing, China Sun; Gang the State Key Lab of Intelligent Control and Decision of Complex Systems and the School of Automation, Beijing Institute of Technology, Beijing, China Beijing Institute of Technology Chongqing Innovation Center, Chongqing, China Wang
  Deep neural networks (DNNs) have achieved remarkable success in diverse
fields. However, it has been demonstrated that DNNs are very vulnerable to
adversarial examples even in black-box settings. A large number of black-box
attack methods have been proposed to in the literature. However, those methods
usually suffer from low success rates and large query counts, which cannot
fully satisfy practical purposes. In this paper, we propose a hybrid attack
method which trains meta adversarial perturbations (MAPs) on surrogate models
and performs black-box attacks by estimating gradients of the models. Our
method uses the meta adversarial perturbation as an initialization and
subsequently trains any black-box attack method for several epochs.
Furthermore, the MAPs enjoy favorable transferability and universality, in the
sense that they can be employed to boost performance of other black-box
adversarial attack methods. Extensive experiments demonstrate that our method
can not only improve the attack success rates, but also reduces the number of
queries compared to other methods.


http://arxiv.org/abs/2204.00426
A Fast and Efficient Conditional Learning for Tunable Trade-Off between Accuracy and Robustness. (62%)
Souvik Kundu; Sairam Sundaresan; Massoud Pedram; Peter A. Beerel
  Existing models that achieve state-of-the-art (SOTA) performance on both
clean and adversarially-perturbed images rely on convolution operations
conditioned with feature-wise linear modulation (FiLM) layers. These layers
require many new parameters and are hyperparameter sensitive. They
significantly increase training time, memory cost, and potential latency which
can prove costly for resource-limited or real-time applications. In this paper,
we present a fast learnable once-for-all adversarial training (FLOAT)
algorithm, which instead of the existing FiLM-based conditioning, presents a
unique weight conditioned learning that requires no additional layer, thereby
incurring no significant increase in parameter count, training time, or network
latency compared to standard adversarial training. In particular, we add
configurable scaled noise to the weight tensors that enables a trade-off
between clean and adversarial performance. Extensive experiments show that
FLOAT can yield SOTA performance improving both clean and perturbed image
classification by up to ~6% and ~10%, respectively. Moreover, real hardware
measurement shows that FLOAT can reduce the training time by up to 1.43x with
fewer model parameters of up to 1.47x on iso-hyperparameter settings compared
to the FiLM-based alternatives. Additionally, to further improve memory
efficiency we introduce FLOAT sparse (FLOATS), a form of non-iterative model
pruning and provide detailed empirical analysis to provide a three way
accuracy-robustness-complexity trade-off for these new class of pruned
conditionally trained models.


http://arxiv.org/abs/2203.14533
Robust Unlearnable Examples: Protecting Data Against Adversarial Learning. (16%)
Shaopeng Fu; Fengxiang He; Yang Liu; Li Shen; Dacheng Tao
  The tremendous amount of accessible data in cyberspace face the risk of being
unauthorized used for training deep learning models. To address this concern,
methods are proposed to make data unlearnable for deep learning models by
adding a type of error-minimizing noise. However, such conferred unlearnability
is found fragile to adversarial training. In this paper, we design new methods
to generate robust unlearnable examples that are protected from adversarial
training. We first find that the vanilla error-minimizing noise, which
suppresses the informative knowledge of data via minimizing the corresponding
training loss, could not effectively minimize the adversarial training loss.
This explains the vulnerability of error-minimizing noise in adversarial
training. Based on the observation, robust error-minimizing noise is then
introduced to reduce the adversarial training loss. Experiments show that the
unlearnability brought by robust error-minimizing noise can effectively protect
data from adversarial training in various scenarios. The code is available at
\url{https://github.com/fshp971/robust-unlearnable-examples}.


http://arxiv.org/abs/2203.15076
Neurosymbolic hybrid approach to driver collision warning. (15%)
Kyongsik Yun; Thomas Lu; Alexander Huyen; Patrick Hammer; Pei Wang
  There are two main algorithmic approaches to autonomous driving systems: (1)
An end-to-end system in which a single deep neural network learns to map
sensory input directly into appropriate warning and driving responses. (2) A
mediated hybrid recognition system in which a system is created by combining
independent modules that detect each semantic feature. While some researchers
believe that deep learning can solve any problem, others believe that a more
engineered and symbolic approach is needed to cope with complex environments
with less data. Deep learning alone has achieved state-of-the-art results in
many areas, from complex gameplay to predicting protein structures. In
particular, in image classification and recognition, deep learning models have
achieved accuracies as high as humans. But sometimes it can be very difficult
to debug if the deep learning model doesn't work. Deep learning models can be
vulnerable and are very sensitive to changes in data distribution.
Generalization can be problematic. It's usually hard to prove why it works or
doesn't. Deep learning models can also be vulnerable to adversarial attacks.
Here, we combine deep learning-based object recognition and tracking with an
adaptive neurosymbolic network agent, called the Non-Axiomatic Reasoning System
(NARS), that can adapt to its environment by building concepts based on
perceptual sequences. We achieved an improved intersection-over-union (IOU)
object recognition performance of 0.65 in the adaptive retraining model
compared to IOU 0.31 in the COCO data pre-trained model. We improved the object
detection limits using RADAR sensors in a simulated environment, and
demonstrated the weaving car detection capability by combining deep
learning-based object detection and tracking with a neurosymbolic model.


http://arxiv.org/abs/2203.15563
Attacker Attribution of Audio Deepfakes. (1%)
Nicolas M. Müller; Franziska Dieckmann; Jennifer Williams
  Deepfakes are synthetically generated media often devised with malicious
intent. They have become increasingly more convincing with large training
datasets advanced neural networks. These fakes are readily being misused for
slander, misinformation and fraud. For this reason, intensive research for
developing countermeasures is also expanding. However, recent work is almost
exclusively limited to deepfake detection - predicting if audio is real or
fake. This is despite the fact that attribution (who created which fake?) is an
essential building block of a larger defense strategy, as practiced in the
field of cybersecurity for a long time. This paper considers the problem of
deepfake attacker attribution in the domain of audio. We present several
methods for creating attacker signatures using low-level acoustic descriptors
and machine learning embeddings. We show that speech signal features are
inadequate for characterizing attacker signatures. However, we also demonstrate
that embeddings from a recurrent neural network can successfully characterize
attacks from both known and unknown attackers. Our attack signature embeddings
result in distinct clusters, both for seen and unseen audio deepfakes. We show
that these embeddings can be used in downstream-tasks to high-effect, scoring
97.10% accuracy in attacker-id classification.


http://arxiv.org/abs/2203.14207
Text Adversarial Purification as Defense against Adversarial Attacks. (99%)
Linyang Li; Demin Song; Xipeng Qiu
  Adversarial purification is a successful defense mechanism against
adversarial attacks without requiring knowledge of the form of the incoming
attack. Generally, adversarial purification aims to remove the adversarial
perturbations therefore can make correct predictions based on the recovered
clean samples. Despite the success of adversarial purification in the computer
vision field that incorporates generative models such as energy-based models
and diffusion models, using purification as a defense strategy against textual
adversarial attacks is rarely explored. In this work, we introduce a novel
adversarial purification method that focuses on defending against textual
adversarial attacks. With the help of language models, we can inject noise by
masking input texts and reconstructing the masked texts based on the masked
language models. In this way, we construct an adversarial purification process
for textual models against the most widely used word-substitution adversarial
attacks. We test our proposed adversarial purification method on several strong
adversarial attack methods including Textfooler and BERT-Attack and
experimental results indicate that the purification algorithm can successfully
defend against strong word-substitution attacks.


http://arxiv.org/abs/2203.14299
Adversarial Representation Sharing: A Quantitative and Secure Collaborative Learning Framework. (8%)
Jikun Chen; Feng Qiang; Na Ruan
  The performance of deep learning models highly depends on the amount of
training data. It is common practice for today's data holders to merge their
datasets and train models collaboratively, which yet poses a threat to data
privacy. Different from existing methods such as secure multi-party computation
(MPC) and federated learning (FL), we find representation learning has unique
advantages in collaborative learning due to the lower communication overhead
and task-independency. However, data representations face the threat of model
inversion attacks. In this article, we formally define the collaborative
learning scenario, and quantify data utility and privacy. Then we present ARS,
a collaborative learning framework wherein users share representations of data
to train models, and add imperceptible adversarial noise to data
representations against reconstruction or attribute extraction attacks. By
evaluating ARS in different contexts, we demonstrate that our mechanism is
effective against model inversion attacks, and achieves a balance between
privacy and utility. The ARS framework has wide applicability. First, ARS is
valid for various data types, not limited to images. Second, data
representations shared by users can be utilized in different tasks. Third, the
framework can be easily extended to the vertical data partitioning scenario.


http://arxiv.org/abs/2203.14195
How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective. (99%)
Yimeng Zhang; Yuguang Yao; Jinghan Jia; Jinfeng Yi; Mingyi Hong; Shiyu Chang; Sijia Liu
  The lack of adversarial robustness has been recognized as an important issue
for state-of-the-art machine learning (ML) models, e.g., deep neural networks
(DNNs). Thereby, robustifying ML models against adversarial attacks is now a
major focus of research. However, nearly all existing defense methods,
particularly for robust training, made the white-box assumption that the
defender has the access to the details of an ML model (or its surrogate
alternatives if available), e.g., its architectures and parameters. Beyond
existing works, in this paper we aim to address the problem of black-box
defense: How to robustify a black-box model using just input queries and output
feedback? Such a problem arises in practical scenarios, where the owner of the
predictive model is reluctant to share model information in order to preserve
privacy. To this end, we propose a general notion of defensive operation that
can be applied to black-box models, and design it through the lens of denoised
smoothing (DS), a first-order (FO) certified defense technique. To allow the
design of merely using model queries, we further integrate DS with the
zeroth-order (gradient-free) optimization. However, a direct implementation of
zeroth-order (ZO) optimization suffers a high variance of gradient estimates,
and thus leads to ineffective defense. To tackle this problem, we next propose
to prepend an autoencoder (AE) to a given (black-box) model so that DS can be
trained using variance-reduced ZO optimization. We term the eventual defense as
ZO-AE-DS. In practice, we empirically show that ZO-AE- DS can achieve improved
accuracy, certified robustness, and query complexity over existing baselines.
And the effectiveness of our approach is justified under both image
classification and image reconstruction tasks. Codes are available at
https://github.com/damon-demon/Black-Box-Defense.


http://arxiv.org/abs/2203.14046
A Survey of Robust Adversarial Training in Pattern Recognition: Fundamental, Theory, and Methodologies. (99%)
Zhuang Qian; Kaizhu Huang; Qiu-Feng Wang; Xu-Yao Zhang
  In the last a few decades, deep neural networks have achieved remarkable
success in machine learning, computer vision, and pattern recognition. Recent
studies however show that neural networks (both shallow and deep) may be easily
fooled by certain imperceptibly perturbed input samples called adversarial
examples. Such security vulnerability has resulted in a large body of research
in recent years because real-world threats could be introduced due to vast
applications of neural networks. To address the robustness issue to adversarial
examples particularly in pattern recognition, robust adversarial training has
become one mainstream. Various ideas, methods, and applications have boomed in
the field. Yet, a deep understanding of adversarial training including
characteristics, interpretations, theories, and connections among different
models has still remained elusive. In this paper, we present a comprehensive
survey trying to offer a systematic and structured investigation on robust
adversarial training in pattern recognition. We start with fundamentals
including definition, notations, and properties of adversarial examples. We
then introduce a unified theoretical framework for defending against
adversarial samples - robust adversarial training with visualizations and
interpretations on why adversarial training can lead to model robustness.
Connections will be also established between adversarial training and other
traditional learning theories. After that, we summarize, review, and discuss
various methodologies with adversarial attack and defense/training algorithms
in a structured way. Finally, we present analysis, outlook, and remarks of
adversarial training.


http://arxiv.org/abs/2203.14145
Reverse Engineering of Imperceptible Adversarial Image Perturbations. (99%)
Yifan Gong; Yuguang Yao; Yize Li; Yimeng Zhang; Xiaoming Liu; Xue Lin; Sijia Liu
  It has been well recognized that neural network based image classifiers are
easily fooled by images with tiny perturbations crafted by an adversary. There
has been a vast volume of research to generate and defend such adversarial
attacks. However, the following problem is left unexplored: How to
reverse-engineer adversarial perturbations from an adversarial image? This
leads to a new adversarial learning paradigm--Reverse Engineering of Deceptions
(RED). If successful, RED allows us to estimate adversarial perturbations and
recover the original images. However, carefully crafted, tiny adversarial
perturbations are difficult to recover by optimizing a unilateral RED
objective. For example, the pure image denoising method may overfit to
minimizing the reconstruction error but hardly preserve the classification
properties of the true adversarial perturbations. To tackle this challenge, we
formalize the RED problem and identify a set of principles crucial to the RED
approach design. Particularly, we find that prediction alignment and proper
data augmentation (in terms of spatial transformations) are two criteria to
achieve a generalizable RED approach. By integrating these RED principles with
image denoising, we propose a new Class-Discriminative Denoising based RED
framework, termed CDD-RED. Extensive experiments demonstrate the effectiveness
of CDD-RED under different evaluation metrics (ranging from the pixel-level,
prediction-level to the attribution-level alignment) and a variety of attack
generation methods (e.g., FGSM, PGD, CW, AutoAttack, and adaptive attacks).


http://arxiv.org/abs/2203.14141
Efficient Global Robustness Certification of Neural Networks via Interleaving Twin-Network Encoding. (33%)
Zhilu Wang; Chao Huang; Qi Zhu
  The robustness of deep neural networks has received significant interest
recently, especially when being deployed in safety-critical systems, as it is
important to analyze how sensitive the model output is under input
perturbations. While most previous works focused on the local robustness
property around an input sample, the studies of the global robustness property,
which bounds the maximum output change under perturbations over the entire
input space, are still lacking. In this work, we formulate the global
robustness certification for neural networks with ReLU activation functions as
a mixed-integer linear programming (MILP) problem, and present an efficient
approach to address it. Our approach includes a novel interleaving twin-network
encoding scheme, where two copies of the neural network are encoded
side-by-side with extra interleaving dependencies added between them, and an
over-approximation algorithm leveraging relaxation and refinement techniques to
reduce complexity. Experiments demonstrate the timing efficiency of our work
when compared with previous global robustness certification methods and the
tightness of our over-approximation. A case study of closed-loop control safety
verification is conducted, and demonstrates the importance and practicality of
our approach for certifying the global robustness of neural networks in
safety-critical systems.


http://arxiv.org/abs/2203.14965
A Systematic Survey of Attack Detection and Prevention in Connected and Autonomous Vehicles. (1%)
Trupil Limbasiya; Ko Zheng Teng; Sudipta Chattopadhyay; Jianying Zhou
  The number of Connected and Autonomous Vehicles (CAVs) is increasing rapidly
in various smart transportation services and applications due to many benefits
to society, people, and the environment. Several research surveys were
conducted in the domain of CAVs. Such surveys primarily focus on various
security threats and vulnerabilities in the domain of CAVs to classify
different types of attacks, impacts of attacks, attacks features, cyber-risk,
defense methodologies against attacks, and safety standards in CAVs. However,
the importance of attacks detection and prevention approaches for CAVs has not
been discussed extensively in the state-of-the-art surveys, and there is a
clear gap in the existing literature on such methodologies to detect new and
conventional threats and protect the CAV system from unexpected hazards on the
road. There are some surveys with a limited discussion on Attacks Detection and
Prevention Systems (ADPS), but such surveys provide only partial coverage of
different types of ADPS for CAVs. Furthermore, there is a scope for discussing
security, privacy, and efficiency challenges in ADPS that can give an overview
of important security and performance attributes.
  This survey paper presents the significance of CAVs, potential challenges in
CAVs, and an explanation of important security and privacy properties, attack
scenarios, possible attacks in CAV, and performance evaluation parameters for
ADPS. This survey paper extensively provides a discussion on the overview of
different ADPS categories and state-of-the-art research works based on each
ADPS category that gives the latest findings in this research domain. This
survey also discusses crucial and open security research problems that are
required to be focused on a secure deployment of CAVs in the market.


http://arxiv.org/abs/2203.14101
A Roadmap for Big Model. (1%)
Sha Yuan; Hanyu Zhao; Shuai Zhao; Jiahong Leng; Yangxiao Liang; Xiaozhi Wang; Jifan Yu; Xin Lv; Zhou Shao; Jiaao He; Yankai Lin; Xu Han; Zhenghao Liu; Ning Ding; Yongming Rao; Yizhao Gao; Liang Zhang; Ming Ding; Cong Fang; Yisen Wang; Mingsheng Long; Jing Zhang; Yinpeng Dong; Tianyu Pang; Peng Cui; Lingxiao Huang; Zheng Liang; Huawei Shen; Hui Zhang; Quanshi Zhang; Qingxiu Dong; Zhixing Tan; Mingxuan Wang; Shuo Wang; Long Zhou; Haoran Li; Junwei Bao; Yingwei Pan; Weinan Zhang; Zhou Yu; Rui Yan; Chence Shi; Minghao Xu; Zuobai Zhang; Guoqiang Wang; Xiang Pan; Mengjie Li; Xiaoyu Chu; Zijun Yao; Fangwei Zhu; Shulin Cao; Weicheng Xue; Zixuan Ma; Zhengyan Zhang; Shengding Hu; Yujia Qin; Chaojun Xiao; Zheni Zeng; Ganqu Cui; Weize Chen; Weilin Zhao; Yuan Yao; Peng Li; Wenzhao Zheng; Wenliang Zhao; Ziyi Wang; Borui Zhang; Nanyi Fei; Anwen Hu; Zenan Ling; Haoyang Li; Boxi Cao; Xianpei Han; Weidong Zhan; Baobao Chang; Hao Sun; Jiawen Deng; Chujie Zheng; Juanzi Li; Lei Hou; Xigang Cao; Jidong Zhai; Zhiyuan Liu; Maosong Sun; Jiwen Lu; Zhiwu Lu; Qin Jin; Ruihua Song; Ji-Rong Wen; Zhouchen Lin; Liwei Wang; Hang Su; Jun Zhu; Zhifang Sui; Jiajun Zhang; Yang Liu; Xiaodong He; Minlie Huang; Jian Tang; Jie Tang
  With the rapid development of deep learning, training Big Models (BMs) for
multiple downstream tasks becomes a popular paradigm. Researchers have achieved
various outcomes in the construction of BMs and the BM application in many
fields. At present, there is a lack of research work that sorts out the overall
progress of BMs and guides the follow-up research. In this paper, we cover not
only the BM technologies themselves but also the prerequisites for BM training
and applications with BMs, dividing the BM review into four parts: Resource,
Models, Key Technologies and Application. We introduce 16 specific BM-related
topics in those four parts, they are Data, Knowledge, Computing System,
Parallel Training System, Language Model, Vision Model, Multi-modal Model,
Theory&Interpretability, Commonsense Reasoning, Reliability&Security,
Governance, Evaluation, Machine Translation, Text Generation, Dialogue and
Protein Research. In each topic, we summarize clearly the current studies and
propose some future research directions. At the end of this paper, we conclude
the further development of BMs in a more general view.


http://arxiv.org/abs/2203.13479
Enhancing Transferability of Adversarial Examples with Spatial Momentum. (99%)
Guoqiu Wang; Huanqian Yan; Xingxing Wei
  Many adversarial attack methods achieve satisfactory attack success rates
under the white-box setting, but they usually show poor transferability when
attacking other DNN models. Momentum-based attack is one effective method to
improve transferability. It integrates the momentum term into the iterative
process, which can stabilize the update directions by adding the gradients'
temporal correlation for each pixel. We argue that only this temporal momentum
is not enough, the gradients from the spatial domain within an image, i.e.
gradients from the context pixels centered on the target pixel are also
important to the stabilization. For that, we propose a novel method named
Spatial Momentum Iterative FGSM attack (SMI-FGSM), which introduces the
mechanism of momentum accumulation from temporal domain to spatial domain by
considering the context information from different regions within the image.
SMI-FGSM is then integrated with temporal momentum to simultaneously stabilize
the gradients' update direction from both the temporal and spatial domains.
Extensive experiments show that our method indeed further enhances adversarial
transferability. It achieves the best transferability success rate for multiple
mainstream undefended and defended models, which outperforms the
state-of-the-art attack methods by a large margin of 10\% on average.


http://arxiv.org/abs/2203.13779
Origins of Low-dimensional Adversarial Perturbations. (98%)
Elvis Dohmatob; Chuan Guo; Morgane Goibert
  In this paper, we initiate a rigorous study of the phenomenon of
low-dimensional adversarial perturbations (LDAPs) in classification. Unlike the
classical setting, these perturbations are limited to a subspace of dimension
$k$ which is much smaller than the dimension $d$ of the feature space. The case
$k=1$ corresponds to so-called universal adversarial perturbations (UAPs;
Moosavi-Dezfooli et al., 2017). First, we consider binary classifiers under
generic regularity conditions (including ReLU networks) and compute analytical
lower-bounds for the fooling rate of any subspace. These bounds explicitly
highlight the dependence of the fooling rate on the pointwise margin of the
model (i.e., the ratio of the output to its $L_2$ norm of its gradient at a
test point), and on the alignment of the given subspace with the gradients of
the model w.r.t. inputs. Our results provide a rigorous explanation for the
recent success of heuristic methods for efficiently generating low-dimensional
adversarial perturbations. Finally, we show that if a decision-region is
compact, then it admits a universal adversarial perturbation with $L_2$ norm
which is $\sqrt{d}$ times smaller than the typical $L_2$ norm of a data point.
Our theoretical results are confirmed by experiments on both synthetic and real
data.


http://arxiv.org/abs/2203.13639
Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness. (89%)
Giulio Lovisotto; Nicole Finnie; Mauricio Munoz; Chaithanya Kumar Mummadi; Jan Hendrik Metzen
  Neural architectures based on attention such as vision transformers are
revolutionizing image recognition. Their main benefit is that attention allows
reasoning about all parts of a scene jointly. In this paper, we show how the
global reasoning of (scaled) dot-product attention can be the source of a major
vulnerability when confronted with adversarial patch attacks. We provide a
theoretical understanding of this vulnerability and relate it to an adversary's
ability to misdirect the attention of all queries to a single key token under
the control of the adversarial patch. We propose novel adversarial objectives
for crafting adversarial patches which target this vulnerability explicitly. We
show the effectiveness of the proposed patch attacks on popular image
classification (ViTs and DeiTs) and object detection models (DETR). We find
that adversarial patches occupying 0.5% of the input can lead to robust
accuracies as low as 0% for ViT on ImageNet, and reduce the mAP of DETR on MS
COCO to less than 3%.


http://arxiv.org/abs/2203.13890
Improving Robustness of Jet Tagging Algorithms with Adversarial Training. (10%)
Annika Stein; Xavier Coubez; Spandan Mondal; Andrzej Novak; Alexander Schmidt
  Deep learning is a standard tool in the field of high-energy physics,
facilitating considerable sensitivity enhancements for numerous analysis
strategies. In particular, in identification of physics objects, such as jet
flavor tagging, complex neural network architectures play a major role.
However, these methods are reliant on accurate simulations. Mismodeling can
lead to non-negligible differences in performance in data that need to be
measured and calibrated against. We investigate the classifier response to
input data with injected mismodelings and probe the vulnerability of flavor
tagging algorithms via application of adversarial attacks. Subsequently, we
present an adversarial training strategy that mitigates the impact of such
simulated attacks and improves the classifier robustness. We examine the
relationship between performance and vulnerability and show that this method
constitutes a promising approach to reduce the vulnerability to poor modeling.


http://arxiv.org/abs/2203.13455
A Unified Contrastive Energy-based Model for Understanding the Generative Ability of Adversarial Training. (5%)
Yifei Wang; Yisen Wang; Jiansheng Yang; Zhouchen Lin
  Adversarial Training (AT) is known as an effective approach to enhance the
robustness of deep neural networks. Recently researchers notice that robust
models with AT have good generative ability and can synthesize realistic
images, while the reason behind it is yet under-explored. In this paper, we
demystify this phenomenon by developing a unified probabilistic framework,
called Contrastive Energy-based Models (CEM). On the one hand, we provide the
first probabilistic characterization of AT through a unified understanding of
robustness and generative ability. On the other hand, our unified framework can
be extended to the unsupervised scenario, which interprets unsupervised
contrastive learning as an important sampling of CEM. Based on these, we
propose a principled method to develop adversarial learning and sampling
methods. Experiments show that the sampling methods derived from our framework
improve the sample quality in both supervised and unsupervised learning.
Notably, our unsupervised adversarial sampling method achieves an Inception
score of 9.61 on CIFAR-10, which is superior to previous energy-based models
and comparable to state-of-the-art generative models.


http://arxiv.org/abs/2203.13834
A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network Calibration. (1%)
Ramya Hebbalaguppe; Jatin Prakash; Neelabh Madan; Chetan Arora
  Deep Neural Networks ( DNN s) are known to make overconfident mistakes, which
makes their use problematic in safety-critical applications. State-of-the-art (
SOTA ) calibration techniques improve on the confidence of predicted labels
alone and leave the confidence of non-max classes (e.g. top-2, top-5)
uncalibrated. Such calibration is not suitable for label refinement using
post-processing. Further, most SOTA techniques learn a few hyper-parameters
post-hoc, leaving out the scope for image, or pixel specific calibration. This
makes them unsuitable for calibration under domain shift, or for dense
prediction tasks like semantic segmentation. In this paper, we argue for
intervening at the train time itself, so as to directly produce calibrated DNN
models. We propose a novel auxiliary loss function: Multi-class Difference in
Confidence and Accuracy ( MDCA ), to achieve the same MDCA can be used in
conjunction with other application/task-specific loss functions. We show that
training with MDCA leads to better-calibrated models in terms of Expected
Calibration Error ( ECE ), and Static Calibration Error ( SCE ) on image
classification, and segmentation tasks. We report ECE ( SCE ) score of 0.72
(1.60) on the CIFAR 100 dataset, in comparison to 1.90 (1.71) by the SOTA.
Under domain shift, a ResNet-18 model trained on PACS dataset using MDCA gives
an average ECE ( SCE ) score of 19.7 (9.7) across all domains, compared to 24.2
(11.8) by the SOTA. For the segmentation task, we report a 2X reduction in
calibration error on PASCAL - VOC dataset in comparison to Focal Loss. Finally,
MDCA training improves calibration even on imbalanced data, and for natural
language classification tasks. We have released the code here: code is
available at https://github.com/mdca-loss


http://arxiv.org/abs/2203.15506
Trojan Horse Training for Breaking Defenses against Backdoor Attacks in Deep Learning. (99%)
Arezoo Rajabi; Bhaskar Ramasubramanian; Radha Poovendran
  Machine learning (ML) models that use deep neural networks are vulnerable to
backdoor attacks. Such attacks involve the insertion of a (hidden) trigger by
an adversary. As a consequence, any input that contains the trigger will cause
the neural network to misclassify the input to a (single) target class, while
classifying other inputs without a trigger correctly. ML models that contain a
backdoor are called Trojan models. Backdoors can have severe consequences in
safety-critical cyber and cyber physical systems when only the outputs of the
model are available. Defense mechanisms have been developed and illustrated to
be able to distinguish between outputs from a Trojan model and a non-Trojan
model in the case of a single-target backdoor attack with accuracy > 96
percent. Understanding the limitations of a defense mechanism requires the
construction of examples where the mechanism fails. Current single-target
backdoor attacks require one trigger per target class. We introduce a new, more
general attack that will enable a single trigger to result in misclassification
to more than one target class. Such a misclassification will depend on the true
(actual) class that the input belongs to. We term this category of attacks
multi-target backdoor attacks. We demonstrate that a Trojan model with either a
single-target or multi-target trigger can be trained so that the accuracy of a
defense mechanism that seeks to distinguish between outputs coming from a
Trojan and a non-Trojan model will be reduced. Our approach uses the non-Trojan
model as a teacher for the Trojan model and solves a min-max optimization
problem between the Trojan model and defense mechanism. Empirical evaluations
demonstrate that our training procedure reduces the accuracy of a
state-of-the-art defense mechanism from >96 to 0 percent.


http://arxiv.org/abs/2203.13214
A Perturbation Constrained Adversarial Attack for Evaluating the Robustness of Optical Flow. (99%)
Jenny Schmalfuss; Philipp Scholze; Andrés Bruhn
  Recent optical flow methods are almost exclusively judged in terms of
accuracy, while analyzing their robustness is often neglected. Although
adversarial attacks offer a useful tool to perform such an analysis, current
attacks on optical flow methods rather focus on real-world attacking scenarios
than on a worst case robustness assessment. Hence, in this work, we propose a
novel adversarial attack - the Perturbation Constrained Flow Attack (PCFA) -
that emphasizes destructivity over applicability as a real-world attack. More
precisely, PCFA is a global attack that optimizes adversarial perturbations to
shift the predicted flow towards a specified target flow, while keeping the L2
norm of the perturbation below a chosen bound. Our experiments not only
demonstrate PCFA's applicability in white- and black-box settings, but also
show that it finds stronger adversarial samples for optical flow than previous
attacking frameworks. Moreover, based on these strong samples, we provide the
first common ranking of optical flow methods in the literature considering both
prediction quality and adversarial robustness, indicating that high quality
methods are not necessarily robust. Our source code will be publicly available.


http://arxiv.org/abs/2203.12915
NPC: Neuron Path Coverage via Characterizing Decision Logic of Deep Neural Networks. (93%)
Xiaofei Xie; Tianlin Li; Jian Wang; Lei Ma; Qing Guo; Felix Juefei-Xu; Yang Liu
  Deep learning has recently been widely applied to many applications across
different domains, e.g., image classification and audio recognition. However,
the quality of Deep Neural Networks (DNNs) still raises concerns in the
practical operational environment, which calls for systematic testing,
especially in safety-critical scenarios. Inspired by software testing, a number
of structural coverage criteria are designed and proposed to measure the test
adequacy of DNNs. However, due to the blackbox nature of DNN, the existing
structural coverage criteria are difficult to interpret, making it hard to
understand the underlying principles of these criteria. The relationship
between the structural coverage and the decision logic of DNNs is unknown.
Moreover, recent studies have further revealed the non-existence of correlation
between the structural coverage and DNN defect detection, which further posts
concerns on what a suitable DNN testing criterion should be.
  In this paper, we propose the interpretable coverage criteria through
constructing the decision structure of a DNN. Mirroring the control flow graph
of the traditional program, we first extract a decision graph from a DNN based
on its interpretation, where a path of the decision graph represents a decision
logic of the DNN. Based on the control flow and data flow of the decision
graph, we propose two variants of path coverage to measure the adequacy of the
test cases in exercising the decision logic. The higher the path coverage, the
more diverse decision logic the DNN is expected to be explored. Our large-scale
evaluation results demonstrate that: the path in the decision graph is
effective in characterizing the decision of the DNN, and the proposed coverage
criteria are also sensitive with errors including natural errors and
adversarial examples, and strongly correlated with the output impartiality.


http://arxiv.org/abs/2203.12980
MERLIN -- Malware Evasion with Reinforcement LearnINg. (56%)
Tony Quertier; Benjamin Marais; Stéphane Morucci; Bertrand Fournel
  In addition to signature-based and heuristics-based detection techniques,
Machine learning (ML) is being widely used to generalize to new
never-before-seen malicious software (malware). However, it has been
demonstrated that ML models can be fooled by tricking the classifier into
returning the incorrect label. These studies usually rely on a prediction score
that is fragile to gradient-based attacks for instance. In the context of a
more realistic situation where an attacker has very little information about
the outputs of a malware detection engine, modest evasion rates are achieved.
In this paper, we propose a method using Reinforcement Learning with DQN and
REINFORCE algorithms to challenge two state-of-the-art Machine Learning based
detection engines (MalConv \& EMBER) and a commercial AV classified by Gartner
as a leader in 2021. Our stateful method combines several actions modifying a
Windows Portable Execution (PE) file without breaking its functionalities. Our
method also identifies which actions perform better and compiles a detailed
vulnerability report to help mitigate the evasion. We demonstrate that
REINFORCE achieves very good evasion rates even on a commercial AV with low
provided information.


http://arxiv.org/abs/2203.13612
Repairing Group-Level Errors for DNNs Using Weighted Regularization. (13%)
Ziyuan Zhong; Yuchi Tian; Conor J. Sweeney; Vicente Ordonez-Roman; Baishakhi Ray
  Deep Neural Networks (DNNs) have been widely used in software making
decisions impacting people's lives. However, they have been found to exhibit
severe erroneous behaviors that may lead to unfortunate outcomes. Previous work
shows that such misbehaviors often occur due to class property violations
rather than errors on a single image. Although methods for detecting such
errors have been proposed, fixing them has not been studied so far. Here, we
propose a generic method called Weighted Regularization (WR) consisting of five
concrete methods targeting the error-producing classes to fix the DNNs. In
particular, it can repair confusion error and bias error of DNN models for both
single-label and multi-label image classifications. A confusion error happens
when a given DNN model tends to confuse between two classes. Each method in WR
assigns more weights at a stage of DNN retraining or inference to mitigate the
confusion between target pair. A bias error can be fixed similarly. We evaluate
and compare the proposed methods along with baselines on six widely-used
datasets and architecture combinations. The results suggest that WR methods
have different trade-offs but under each setting at least one WR method can
greatly reduce confusion/bias errors at a very limited cost of the overall
performance.


http://arxiv.org/abs/2203.13277
A Manifold View of Adversarial Risk. (11%)
Wenjia Zhang; Yikai Zhang; Xiaoling Hu; Mayank Goswami; Chao Chen; Dimitris Metaxas
  The adversarial risk of a machine learning model has been widely studied.
Most previous works assume that the data lies in the whole ambient space. We
propose to take a new angle and take the manifold assumption into
consideration. Assuming data lies in a manifold, we investigate two new types
of adversarial risk, the normal adversarial risk due to perturbation along
normal direction, and the in-manifold adversarial risk due to perturbation
within the manifold. We prove that the classic adversarial risk can be bounded
from both sides using the normal and in-manifold adversarial risks. We also
show with a surprisingly pessimistic case that the standard adversarial risk
can be nonzero even when both normal and in-manifold risks are zero. We
finalize the paper with empirical studies supporting our theoretical results.
Our results suggest the possibility of improving the robustness of a classifier
by only focusing on the normal adversarial risk.


http://arxiv.org/abs/2203.15498
Powerful Physical Adversarial Examples Against Practical Face Recognition Systems. (99%)
Inderjeet Singh; Toshinori Araki; Kazuya Kakizaki
  It is well-known that the most existing machine learning (ML)-based
safety-critical applications are vulnerable to carefully crafted input
instances called adversarial examples (AXs). An adversary can conveniently
attack these target systems from digital as well as physical worlds. This paper
aims to the generation of robust physical AXs against face recognition systems.
We present a novel smoothness loss function and a patch-noise combo attack for
realizing powerful physical AXs. The smoothness loss interjects the concept of
delayed constraints during the attack generation process, thereby causing
better handling of optimization complexity and smoother AXs for the physical
domain. The patch-noise combo attack combines patch noise and imperceptibly
small noises from different distributions to generate powerful
registration-based physical AXs. An extensive experimental analysis found that
our smoothness loss results in robust and more transferable digital and
physical AXs than the conventional techniques. Notably, our smoothness loss
results in a 1.17 and 1.97 times better mean attack success rate (ASR) in
physical white-box and black-box attacks, respectively. Our patch-noise combo
attack furthers the performance gains and results in 2.39 and 4.74 times higher
mean ASR than conventional technique in physical world white-box and black-box
attacks, respectively.


http://arxiv.org/abs/2203.12709
Adversarial Training for Improving Model Robustness? Look at Both Prediction and Interpretation. (99%)
Hanjie Chen; Yangfeng Ji
  Neural language models show vulnerability to adversarial examples which are
semantically similar to their original counterparts with a few words replaced
by their synonyms. A common way to improve model robustness is adversarial
training which follows two steps-collecting adversarial examples by attacking a
target model, and fine-tuning the model on the augmented dataset with these
adversarial examples. The objective of traditional adversarial training is to
make a model produce the same correct predictions on an original/adversarial
example pair. However, the consistency between model decision-makings on two
similar texts is ignored. We argue that a robust model should behave
consistently on original/adversarial example pairs, that is making the same
predictions (what) based on the same reasons (how) which can be reflected by
consistent interpretations. In this work, we propose a novel feature-level
adversarial training method named FLAT. FLAT aims at improving model robustness
in terms of both predictions and interpretations. FLAT incorporates variational
word masks in neural networks to learn global word importance and play as a
bottleneck teaching the model to make predictions based on important words.
FLAT explicitly shoots at the vulnerability problem caused by the mismatch
between model understandings on the replaced words and their synonyms in
original/adversarial example pairs by regularizing the corresponding global
word importance scores. Experiments show the effectiveness of FLAT in improving
the robustness with respect to both predictions and interpretations of four
neural network models (LSTM, CNN, BERT, and DeBERTa) to two adversarial attacks
on four text classification tasks. The models trained via FLAT also show better
robustness than baseline models on unforeseen adversarial examples across
different attacks.


http://arxiv.org/abs/2203.12298
Input-specific Attention Subnetworks for Adversarial Detection. (99%)
Emil Biju; Anirudh Sriram; Pratyush Kumar; Mitesh M Khapra
  Self-attention heads are characteristic of Transformer models and have been
well studied for interpretability and pruning. In this work, we demonstrate an
altogether different utility of attention heads, namely for adversarial
detection. Specifically, we propose a method to construct input-specific
attention subnetworks (IAS) from which we extract three features to
discriminate between authentic and adversarial inputs. The resultant detector
significantly improves (by over 7.5%) the state-of-the-art adversarial
detection accuracy for the BERT encoder on 10 NLU datasets with 11 different
adversarial attack types. We also demonstrate that our method (a) is more
accurate for larger models which are likely to have more spurious correlations
and thus vulnerable to adversarial attack, and (b) performs well even with
modest training sets of adversarial examples.


http://arxiv.org/abs/2203.12208
Self-supervised Learning of Adversarial Example: Towards Good Generalizations for Deepfake Detection. (69%)
Liang Chen; Yong Zhang; Yibing Song; Lingqiao Liu; Jue Wang
  Recent studies in deepfake detection have yielded promising results when the
training and testing face forgeries are from the same dataset. However, the
problem remains challenging when one tries to generalize the detector to
forgeries created by unseen methods in the training dataset. This work
addresses the generalizable deepfake detection from a simple principle: a
generalizable representation should be sensitive to diverse types of forgeries.
Following this principle, we propose to enrich the "diversity" of forgeries by
synthesizing augmented forgeries with a pool of forgery configurations and
strengthen the "sensitivity" to the forgeries by enforcing the model to predict
the forgery configurations. To effectively explore the large forgery
augmentation space, we further propose to use the adversarial training strategy
to dynamically synthesize the most challenging forgeries to the current model.
Through extensive experiments, we show that the proposed strategies are
surprisingly effective (see Figure 1), and they could achieve superior
performance than the current state-of-the-art methods. Code is available at
\url{https://github.com/liangchen527/SLADD}.


http://arxiv.org/abs/2203.12249
Distort to Detect, not Affect: Detecting Stealthy Sensor Attacks with Micro-distortion. (3%)
Suman Sourav; Binbin Chen
  In this paper, we propose an effective and easily deployable approach to
detect the presence of stealthy sensor attacks in industrial control systems,
where (legacy) control devices critically rely on accurate (and usually
non-encrypted) sensor readings. Specifically, we focus on stealthy attacks that
crash a sensor and then immediately impersonate that sensor by sending out fake
readings. We consider attackers who aim to stay hidden in the system for a
prolonged period. To detect such attacks, our approach relies on continuous
injection of "micro distortion" to the original sensor's readings. In
particular, the injected distortion should be kept strictly within a small
magnitude (e.g., $0.5\%$ of the possible operating value range), to ensure it
does not affect the normal functioning of the ICS. Our approach uses a
pre-shared secret sequence between a sensor and the defender to generate the
micro-distortions. One key challenge is that the micro-distortions injected are
often much lower than the sensor's actual readings, hence can be easily
overwhelmed by the latter. To overcome this, we leverage the observation that
sensor readings in many ICS (and power grid in particular) often change
gradually in a significant fraction of time (i.e., with small difference
between consecutive time slots). We devise a simple yet effective algorithm
that can detect stealthy attackers in a highly accurate and fast (i.e., using
less than 100 samples) manner. We demonstrate the effectiveness of our defense
using real-world sensor reading traces from two different smart grid systems.


http://arxiv.org/abs/2203.12387
On the (Limited) Generalization of MasterFace Attacks and Its Relation to the Capacity of Face Representations. (3%)
Philipp Terhörst; Florian Bierbaum; Marco Huber; Naser Damer; Florian Kirchbuchner; Kiran Raja; Arjan Kuijper
  A MasterFace is a face image that can successfully match against a large
portion of the population. Since their generation does not require access to
the information of the enrolled subjects, MasterFace attacks represent a
potential security risk for widely-used face recognition systems. Previous
works proposed methods for generating such images and demonstrated that these
attacks can strongly compromise face recognition. However, previous works
followed evaluation settings consisting of older recognition models, limited
cross-dataset and cross-model evaluations, and the use of low-scale testing
data. This makes it hard to state the generalizability of these attacks. In
this work, we comprehensively analyse the generalizability of MasterFace
attacks in empirical and theoretical investigations. The empirical
investigations include the use of six state-of-the-art FR models, cross-dataset
and cross-model evaluation protocols, and utilizing testing datasets of
significantly higher size and variance. The results indicate a low
generalizability when MasterFaces are training on a different face recognition
model than the one used for testing. In these cases, the attack performance is
similar to zero-effort imposter attacks. In the theoretical investigations, we
define and estimate the face capacity and the maximum MasterFace coverage under
the assumption that identities in the face space are well separated. The
current trend of increasing the fairness and generalizability in face
recognition indicates that the vulnerability of future systems might further
decrease. We conclude that MasterFaces should not be seen as a threat to face
recognition systems but, on the contrary, seen as a tool to understand and
enhance the robustness of face recognition models.


http://arxiv.org/abs/2203.11492
Exploring High-Order Structure for Robust Graph Structure Learning. (99%)
Guangqian Yang; Yibing Zhan; Jinlong Li; Baosheng Yu; Liu Liu; Fengxiang He
  Recent studies show that Graph Neural Networks (GNNs) are vulnerable to
adversarial attack, i.e., an imperceptible structure perturbation can fool GNNs
to make wrong predictions. Some researches explore specific properties of clean
graphs such as the feature smoothness to defense the attack, but the analysis
of it has not been well-studied. In this paper, we analyze the adversarial
attack on graphs from the perspective of feature smoothness which further
contributes to an efficient new adversarial defensive algorithm for GNNs. We
discover that the effect of the high-order graph structure is a smoother filter
for processing graph structures. Intuitively, the high-order graph structure
denotes the path number between nodes, where larger number indicates closer
connection, so it naturally contributes to defense the adversarial
perturbation. Further, we propose a novel algorithm that incorporates the
high-order structural information into the graph structure learning. We perform
experiments on three popular benchmark datasets, Cora, Citeseer and Polblogs.
Extensive experiments demonstrate the effectiveness of our method for defending
against graph adversarial attacks.


http://arxiv.org/abs/2203.12122
On Adversarial Robustness of Large-scale Audio Visual Learning. (93%)
Juncheng B Bernie Li; Shuhui Bernie Qu; Xinjian Bernie Li; Bernie Po-Yao; Huang; Florian Metze
  As audio-visual systems are being deployed for safety-critical tasks such as
surveillance and malicious content filtering, their robustness remains an
under-studied area. Existing published work on robustness either does not scale
to large-scale dataset, or does not deal with multiple modalities. This work
aims to study several key questions related to multi-modal learning through the
lens of robustness: 1) Are multi-modal models necessarily more robust than
uni-modal models? 2) How to efficiently measure the robustness of multi-modal
learning? 3) How to fuse different modalities to achieve a more robust
multi-modal model? To understand the robustness of the multi-modal model in a
large-scale setting, we propose a density-based metric, and a convexity metric
to efficiently measure the distribution of each modality in high-dimensional
latent space. Our work provides a theoretical intuition together with empirical
evidence showing how multi-modal fusion affects adversarial robustness through
these metrics. We further devise a mix-up strategy based on our metrics to
improve the robustness of the trained model. Our experiments on AudioSet and
Kinetics-Sounds verify our hypothesis that multi-modal models are not
necessarily more robust than their uni-modal counterparts in the face of
adversarial examples. We also observe our mix-up trained method could achieve
as much protection as traditional adversarial training, offering a
computationally cheap alternative. Implementation:
https://github.com/lijuncheng16/AudioSetDoneRight


http://arxiv.org/abs/2203.11864
On the (Non-)Robustness of Two-Layer Neural Networks in Different Learning Regimes. (86%)
Elvis Dohmatob; Alberto Bietti
  Neural networks are known to be highly sensitive to adversarial examples.
These may arise due to different factors, such as random initialization, or
spurious correlations in the learning problem. To better understand these
factors, we provide a precise study of the adversarial robustness in different
scenarios, from initialization to the end of training in different regimes, as
well as intermediate scenarios, where initialization still plays a role due to
"lazy" training. We consider over-parameterized networks in high dimensions
with quadratic targets and infinite samples. Our analysis allows us to identify
new tradeoffs between approximation (as measured via test error) and
robustness, whereby robustness can only get worse when test error improves, and
vice versa. We also show how linearized lazy training regimes can worsen
robustness, due to improperly scaled random initialization. Our theoretical
results are illustrated with numerical experiments.


http://arxiv.org/abs/2203.11633
Semi-Targeted Model Poisoning Attack on Federated Learning via Backward Error Analysis. (78%)
Yuwei Sun; Hideya Ochiai; Jun Sakuma
  Model poisoning attacks on federated learning (FL) intrude in the entire
system via compromising an edge model, resulting in malfunctioning of machine
learning models. Such compromised models are tampered with to perform
adversary-desired behaviors. In particular, we considered a semi-targeted
situation where the source class is predetermined however the target class is
not. The goal is to cause the global classifier to misclassify data of the
source class. Though approaches such as label flipping have been adopted to
inject poisoned parameters into FL, it has been shown that their performances
are usually class-sensitive varying with different target classes applied.
Typically, an attack can become less effective when shifting to a different
target class. To overcome this challenge, we propose the Attacking
Distance-aware Attack (ADA) to enhance a poisoning attack by finding the
optimized target class in the feature space. Moreover, we studied a more
challenging situation where an adversary had limited prior knowledge about a
client's data. To tackle this problem, ADA deduces pair-wise distances between
different classes in the latent feature space from shared model parameters
based on the backward error analysis. We performed extensive empirical
evaluations on ADA by varying the factor of attacking frequency in three
different image classification tasks. As a result, ADA succeeded in increasing
the attack performance by 1.8 times in the most challenging case with an
attacking frequency of 0.01.


http://arxiv.org/abs/2203.11849
A Girl Has A Name, And It's ... Adversarial Authorship Attribution for Deobfuscation. (2%)
Wanyue Zhai; Jonathan Rusert; Zubair Shafiq; Padmini Srinivasan
  Recent advances in natural language processing have enabled powerful
privacy-invasive authorship attribution. To counter authorship attribution,
researchers have proposed a variety of rule-based and learning-based text
obfuscation approaches. However, existing authorship obfuscation approaches do
not consider the adversarial threat model. Specifically, they are not evaluated
against adversarially trained authorship attributors that are aware of
potential obfuscation. To fill this gap, we investigate the problem of
adversarial authorship attribution for deobfuscation. We show that
adversarially trained authorship attributors are able to degrade the
effectiveness of existing obfuscators from 20-30% to 5-10%. We also evaluate
the effectiveness of adversarial training when the attributor makes incorrect
assumptions about whether and which obfuscator was used. While there is a a
clear degradation in attribution accuracy, it is noteworthy that this
degradation is still at or above the attribution accuracy of the attributor
that is not adversarially trained at all. Our results underline the need for
stronger obfuscation approaches that are resistant to deobfuscation


http://arxiv.org/abs/2203.11894
GradViT: Gradient Inversion of Vision Transformers. (1%)
Ali Hatamizadeh; Hongxu Yin; Holger Roth; Wenqi Li; Jan Kautz; Daguang Xu; Pavlo Molchanov
  In this work we demonstrate the vulnerability of vision transformers (ViTs)
to gradient-based inversion attacks. During this attack, the original data
batch is reconstructed given model weights and the corresponding gradients. We
introduce a method, named GradViT, that optimizes random noise into naturally
looking images via an iterative process. The optimization objective consists of
(i) a loss on matching the gradients, (ii) image prior in the form of distance
to batch-normalization statistics of a pretrained CNN model, and (iii) a total
variation regularization on patches to guide correct recovery locations. We
propose a unique loss scheduling function to overcome local minima during
optimization. We evaluate GadViT on ImageNet1K and MS-Celeb-1M datasets, and
observe unprecedentedly high fidelity and closeness to the original (hidden)
data. During the analysis we find that vision transformers are significantly
more vulnerable than previously studied CNNs due to the presence of the
attention mechanism. Our method demonstrates new state-of-the-art results for
gradient inversion in both qualitative and quantitative metrics. Project page
at https://gradvit.github.io/.


http://arxiv.org/abs/2203.11805
On Robust Classification using Contractive Hamiltonian Neural ODEs. (1%)
Muhammad Zakwan; Liang Xu; Giancarlo Ferrari-Trecate
  Deep neural networks can be fragile and sensitive to small input
perturbations that might cause a significant change in the output. In this
paper, we employ contraction theory to improve the robustness of neural ODEs
(NODEs). A dynamical system is contractive if all solutions with different
initial conditions converge to each other asymptotically. As a consequence,
perturbations in initial conditions become less and less relevant over time.
Since in NODEs, the input data corresponds to the initial condition of
dynamical systems, we show contractivity can mitigate the effect of input
perturbations. More precisely, inspired by NODEs with Hamiltonian dynamics, we
propose a class of contractive Hamiltonian NODEs (CH-NODEs). By properly tuning
a scalar parameter, CH-NODEs ensure contractivity by design and can be trained
using standard backpropagation and gradient descent algorithms. Moreover,
CH-NODEs enjoy built-in guarantees of non-exploding gradients, which ensures a
well-posed training process. Finally, we demonstrate the robustness of CH-NODEs
on the MNIST image classification problem with noisy test datasets.


http://arxiv.org/abs/2203.11433
Making DeepFakes more spurious: evading deep face forgery detection via trace removal attack. (92%)
Chi Liu; Huajie Chen; Tianqing Zhu; Jun Zhang; Wanlei Zhou
  DeepFakes are raising significant social concerns. Although various DeepFake
detectors have been developed as forensic countermeasures, these detectors are
still vulnerable to attacks. Recently, a few attacks, principally adversarial
attacks, have succeeded in cloaking DeepFake images to evade detection.
However, these attacks have typical detector-specific designs, which require
prior knowledge about the detector, leading to poor transferability. Moreover,
these attacks only consider simple security scenarios. Less is known about how
effective they are in high-level scenarios where either the detectors or the
attacker's knowledge varies. In this paper, we solve the above challenges with
presenting a novel detector-agnostic trace removal attack for DeepFake
anti-forensics. Instead of investigating the detector side, our attack looks
into the original DeepFake creation pipeline, attempting to remove all
detectable natural DeepFake traces to render the fake images more "authentic".
To implement this attack, first, we perform a DeepFake trace discovery,
identifying three discernible traces. Then a trace removal network (TR-Net) is
proposed based on an adversarial learning framework involving one generator and
multiple discriminators. Each discriminator is responsible for one individual
trace representation to avoid cross-trace interference. These discriminators
are arranged in parallel, which prompts the generator to remove various traces
simultaneously. To evaluate the attack efficacy, we crafted heterogeneous
security scenarios where the detectors were embedded with different levels of
defense and the attackers' background knowledge of data varies. The
experimental results show that the proposed attack can significantly compromise
the detection accuracy of six state-of-the-art DeepFake detectors while causing
only a negligible loss in visual quality to the original DeepFake samples.


http://arxiv.org/abs/2203.10902
Integrity Fingerprinting of DNN with Double Black-box Design and Verification. (10%)
Shuo Wang; Sidharth Agarwal; Sharif Abuadbba; Kristen Moore; Surya Nepal; Salil Kanhere
  Cloud-enabled Machine Learning as a Service (MLaaS) has shown enormous
promise to transform how deep learning models are developed and deployed.
Nonetheless, there is a potential risk associated with the use of such services
since a malicious party can modify them to achieve an adverse result.
Therefore, it is imperative for model owners, service providers, and end-users
to verify whether the deployed model has not been tampered with or not. Such
verification requires public verifiability (i.e., fingerprinting patterns are
available to all parties, including adversaries) and black-box access to the
deployed model via APIs. Existing watermarking and fingerprinting approaches,
however, require white-box knowledge (such as gradient) to design the
fingerprinting and only support private verifiability, i.e., verification by an
honest party.
  In this paper, we describe a practical watermarking technique that enables
black-box knowledge in fingerprint design and black-box queries during
verification. The service ensures the integrity of cloud-based services through
public verification (i.e. fingerprinting patterns are available to all parties,
including adversaries). If an adversary manipulates a model, this will result
in a shift in the decision boundary. Thus, the underlying principle of
double-black watermarking is that a model's decision boundary could serve as an
inherent fingerprint for watermarking. Our approach captures the decision
boundary by generating a limited number of encysted sample fingerprints, which
are a set of naturally transformed and augmented inputs enclosed around the
model's decision boundary in order to capture the inherent fingerprints of the
model. We evaluated our watermarking approach against a variety of model
integrity attacks and model compression attacks.


http://arxiv.org/abs/2203.11331
On The Robustness of Offensive Language Classifiers. (2%)
Jonathan Rusert; Zubair Shafiq; Padmini Srinivasan
  Social media platforms are deploying machine learning based offensive
language classification systems to combat hateful, racist, and other forms of
offensive speech at scale. However, despite their real-world deployment, we do
not yet comprehensively understand the extent to which offensive language
classifiers are robust against adversarial attacks. Prior work in this space is
limited to studying robustness of offensive language classifiers against
primitive attacks such as misspellings and extraneous spaces. To address this
gap, we systematically analyze the robustness of state-of-the-art offensive
language classifiers against more crafty adversarial attacks that leverage
greedy- and attention-based word selection and context-aware embeddings for
word replacement. Our results on multiple datasets show that these crafty
adversarial attacks can degrade the accuracy of offensive language classifiers
by more than 50% while also being able to preserve the readability and meaning
of the modified text.


http://arxiv.org/abs/2203.10734
Defending against Co-residence Attack in Energy-Efficient Cloud: An Optimization based Real-time Secure VM Allocation Strategy. (1%)
Lu Cao; Ruiwen Li; Xiaojun Ruan; Yuhong Liu
  Resource sharing among users serves as the foundation of cloud computing,
which, however, may also cause vulnerabilities to diverse co-residence attacks
launched by malicious virtual machines (VM) residing in the same physical
server with the victim VMs. In this paper, we aim to defend against such
co-residence attacks through a secure, workload-balanced, and energy-efficient
VM allocation strategy. Specifically, we model the problem as an optimization
problem by quantifying and minimizing three key factors: (1) the security
risks, (2) the power consumption and (3) the unbalanced workloads among
different physical servers. Furthermore, this work considers a realistic
environmental setting by assuming a random number of VMs from different users
arriving at random timings, which requires the optimization solution to be
continuously evolving. As the optimization problem is NP-hard, we propose to
first cluster VMs in time windows, and further adopt the Ant Colony
Optimization (ACO) algorithm to identify the optimal allocation strategy for
each time window. Comprehensive experimental results based on real world cloud
traces validates the effectiveness of the proposed scheme.


http://arxiv.org/abs/2203.10723
An Intermediate-level Attack Framework on The Basis of Linear Regression. (99%)
Yiwen Guo; Qizhang Li; Wangmeng Zuo; Hao Chen
  This paper substantially extends our work published at ECCV, in which an
intermediate-level attack was proposed to improve the transferability of some
baseline adversarial examples. We advocate to establish a direct linear mapping
from the intermediate-level discrepancies (between adversarial features and
benign features) to classification prediction loss of the adversarial example.
In this paper, we delve deep into the core components of such a framework by
performing comprehensive studies and extensive experiments. We show that 1) a
variety of linear regression models can all be considered in order to establish
the mapping, 2) the magnitude of the finally obtained intermediate-level
discrepancy is linearly correlated with adversarial transferability, 3) further
boost of the performance can be achieved by performing multiple runs of the
baseline attack with random initialization. By leveraging these findings, we
achieve new state-of-the-arts on transfer-based $\ell_\infty$ and $\ell_2$
attacks.


http://arxiv.org/abs/2203.10714
A Prompting-based Approach for Adversarial Example Generation and Robustness Enhancement. (99%)
Yuting Yang; Pei Huang; Juan Cao; Jintao Li; Yun Lin; Jin Song Dong; Feifei Ma; Jian Zhang
  Recent years have seen the wide application of NLP models in crucial areas
such as finance, medical treatment, and news media, raising concerns of the
model robustness and vulnerabilities. In this paper, we propose a novel
prompt-based adversarial attack to compromise NLP models and robustness
enhancement technique. We first construct malicious prompts for each instance
and generate adversarial examples via mask-and-filling under the effect of a
malicious purpose. Our attack technique targets the inherent vulnerabilities of
NLP models, allowing us to generate samples even without interacting with the
victim NLP model, as long as it is based on pre-trained language models (PLMs).
Furthermore, we design a prompt-based adversarial training method to improve
the robustness of PLMs. As our training method does not actually generate
adversarial samples, it can be applied to large-scale training sets
efficiently. The experimental results show that our attack method can achieve a
high attack success rate with more diverse, fluent and natural adversarial
examples. In addition, our robustness enhancement method can significantly
improve the robustness of models to resist adversarial attacks. Our work
indicates that prompting paradigm has great potential in probing some
fundamental flaws of PLMs and fine-tuning them for downstream tasks.


http://arxiv.org/abs/2203.10693
Leveraging Expert Guided Adversarial Augmentation For Improving Generalization in Named Entity Recognition. (82%)
Aaron Reich; Jiaao Chen; Aastha Agrawal; Yanzhe Zhang; Diyi Yang
  Named Entity Recognition (NER) systems often demonstrate great performance on
in-distribution data, but perform poorly on examples drawn from a shifted
distribution. One way to evaluate the generalization ability of NER models is
to use adversarial examples, on which the specific variations associated with
named entities are rarely considered. To this end, we propose leveraging
expert-guided heuristics to change the entity tokens and their surrounding
contexts thereby altering their entity types as adversarial attacks. Using
expert-guided heuristics, we augmented the CoNLL 2003 test set and manually
annotated it to construct a high-quality challenging set. We found that
state-of-the-art NER systems trained on CoNLL 2003 training data drop
performance dramatically on our challenging set. By training on adversarial
augmented training examples and using mixup for regularization, we were able to
significantly improve the performance on the challenging set as well as improve
out-of-domain generalization which we evaluated by using OntoNotes data. We
have publicly released our dataset and code at
https://github.com/GT-SALT/Guided-Adversarial-Augmentation.


http://arxiv.org/abs/2203.10502
Adversarial Parameter Attack on Deep Neural Networks. (62%)
Lijia Yu; Yihan Wang; Xiao-Shan Gao
  In this paper, a new parameter perturbation attack on DNNs, called
adversarial parameter attack, is proposed, in which small perturbations to the
parameters of the DNN are made such that the accuracy of the attacked DNN does
not decrease much, but its robustness becomes much lower. The adversarial
parameter attack is stronger than previous parameter perturbation attacks in
that the attack is more difficult to be recognized by users and the attacked
DNN gives a wrong label for any modified sample input with high probability.
The existence of adversarial parameters is proved. For a DNN $F_{\Theta}$ with
the parameter set $\Theta$ satisfying certain conditions, it is shown that if
the depth of the DNN is sufficiently large, then there exists an adversarial
parameter set $\Theta_a$ for $\Theta$ such that the accuracy of $F_{\Theta_a}$
is equal to that of $F_{\Theta}$, but the robustness measure of $F_{\Theta_a}$
is smaller than any given bound. An effective training algorithm is given to
compute adversarial parameters and numerical experiments are used to
demonstrate that the algorithms are effective to produce high quality
adversarial parameters.


http://arxiv.org/abs/2203.10290
Adversarial Defense via Image Denoising with Chaotic Encryption. (99%)
Shi Hu; Eric Nalisnick; Max Welling
  In the literature on adversarial examples, white box and black box attacks
have received the most attention. The adversary is assumed to have either full
(white) or no (black) access to the defender's model. In this work, we focus on
the equally practical gray box setting, assuming an attacker has partial
information. We propose a novel defense that assumes everything but a private
key will be made available to the attacker. Our framework uses an image
denoising procedure coupled with encryption via a discretized Baker map.
Extensive testing against adversarial images (e.g. FGSM, PGD) crafted using
various gradients shows that our defense achieves significantly better results
on CIFAR-10 and CIFAR-100 than the state-of-the-art gray box defenses in both
natural and adversarial accuracy.


http://arxiv.org/abs/2203.10346
Perturbations in the Wild: Leveraging Human-Written Text Perturbations for Realistic Adversarial Attack and Defense. (98%)
Thai Le; Jooyoung Lee; Kevin Yen; Yifan Hu; Dongwon Lee
  We proposes a novel algorithm, ANTHRO, that inductively extracts over 600K
human-written text perturbations in the wild and leverages them for realistic
adversarial attack. Unlike existing character-based attacks which often
deductively hypothesize a set of manipulation strategies, our work is grounded
on actual observations from real-world texts. We find that adversarial texts
generated by ANTHRO achieve the best trade-off between (1) attack success rate,
(2) semantic preservation of the original text, and (3) stealthiness--i.e.
indistinguishable from human writings hence harder to be flagged as suspicious.
Specifically, our attacks accomplished around 83% and 91% attack success rates
on BERT and RoBERTa, respectively. Moreover, it outperformed the TextBugger
baseline with an increase of 50% and 40% in terms of semantic preservation and
stealthiness when evaluated by both layperson and professional human workers.
ANTHRO can further enhance a BERT classifier's performance in understanding
different variations of human-written toxic texts via adversarial training when
compared to the Perspective API.


http://arxiv.org/abs/2203.11199
Distinguishing Non-natural from Natural Adversarial Samples for More Robust Pre-trained Language Model. (84%)
Jiayi Wang; Rongzhou Bao; Zhuosheng Zhang; Hai Zhao
  Recently, the problem of robustness of pre-trained language models (PrLMs)
has received increasing research interest. Latest studies on adversarial
attacks achieve high attack success rates against PrLMs, claiming that PrLMs
are not robust. However, we find that the adversarial samples that PrLMs fail
are mostly non-natural and do not appear in reality. We question the validity
of current evaluation of robustness of PrLMs based on these non-natural
adversarial samples and propose an anomaly detector to evaluate the robustness
of PrLMs with more natural adversarial samples. We also investigate two
applications of the anomaly detector: (1) In data augmentation, we employ the
anomaly detector to force generating augmented data that are distinguished as
non-natural, which brings larger gains to the accuracy of PrLMs. (2) We apply
the anomaly detector to a defense framework to enhance the robustness of PrLMs.
It can be used to defend all types of attacks and achieves higher accuracy on
both adversarial samples and compliant samples than other defense frameworks.


http://arxiv.org/abs/2203.11201
Efficient Neural Network Analysis with Sum-of-Infeasibilities. (74%)
Haoze Wu; Aleksandar Zeljić; Guy Katz; Clark Barrett
  Inspired by sum-of-infeasibilities methods in convex optimization, we propose
a novel procedure for analyzing verification queries on neural networks with
piecewise-linear activation functions. Given a convex relaxation which
over-approximates the non-convex activation functions, we encode the violations
of activation functions as a cost function and optimize it with respect to the
convex relaxation. The cost function, referred to as the Sum-of-Infeasibilities
(SoI), is designed so that its minimum is zero and achieved only if all the
activation functions are satisfied. We propose a stochastic procedure, DeepSoI,
to efficiently minimize the SoI. An extension to a canonical
case-analysis-based complete search procedure can be achieved by replacing the
convex procedure executed at each search state with DeepSoI. Extending the
complete search with DeepSoI achieves multiple simultaneous goals: 1) it guides
the search towards a counter-example; 2) it enables more informed branching
decisions; and 3) it creates additional opportunities for bound derivation. An
extensive evaluation across different benchmarks and solvers demonstrates the
benefit of the proposed techniques. In particular, we demonstrate that SoI
significantly improves the performance of an existing complete search
procedure. Moreover, the SoI-based implementation outperforms other
state-of-the-art complete verifiers. We also show that our technique can
efficiently improve upon the perturbation bound derived by a recent adversarial
attack algorithm.


http://arxiv.org/abs/2203.10366
Deep Learning Generalization, Extrapolation, and Over-parameterization. (68%)
Roozbeh Yousefzadeh
  We study the generalization of over-parameterized deep networks (for image
classification) in relation to the convex hull of their training sets. Despite
their great success, generalization of deep networks is considered a mystery.
These models have orders of magnitude more parameters than their training
samples, and they can achieve perfect accuracy on their training sets, even
when training images are randomly labeled, or the contents of images are
replaced with random noise. The training loss function of these models has
infinite number of near zero minimizers, where only a small subset of those
minimizers generalize well. Overall, it is not clear why models need to be
over-parameterized, why we should use a very specific training regime to train
them, and why their classifications are so susceptible to imperceivable
adversarial perturbations (phenomenon known as adversarial vulnerability)
\cite{papernot2016limitations,shafahi2018adversarial,tsipras2018robustness}.
Some recent studies have made advances in answering these questions, however,
they only consider interpolation. We show that interpolation is not adequate to
understand generalization of deep networks and we should broaden our
perspective.


http://arxiv.org/abs/2203.10378
On Robust Prefix-Tuning for Text Classification. (10%)
Zonghan Yang; Yang Liu
  Recently, prefix-tuning has gained increasing attention as a
parameter-efficient finetuning method for large-scale pretrained language
models. The method keeps the pretrained models fixed and only updates the
prefix token parameters for each downstream task. Despite being lightweight and
modular, prefix-tuning still lacks robustness to textual adversarial attacks.
However, most currently developed defense techniques necessitate auxiliary
model update and storage, which inevitably hamper the modularity and low
storage of prefix-tuning. In this work, we propose a robust prefix-tuning
framework that preserves the efficiency and modularity of prefix-tuning. The
core idea of our framework is leveraging the layerwise activations of the
language model by correctly-classified training data as the standard for
additional prefix finetuning. During the test phase, an extra batch-level
prefix is tuned for each batch and added to the original prefix for robustness
enhancement. Extensive experiments on three text classification benchmarks show
that our framework substantially improves robustness over several strong
baselines against five textual attacks of different types while maintaining
comparable accuracy on clean texts. We also interpret our robust prefix-tuning
framework from the optimal control perspective and pose several directions for
future research.


http://arxiv.org/abs/2203.10166
Concept-based Adversarial Attacks: Tricking Humans and Classifiers Alike. (99%)
Johannes Schneider; Giovanni Apruzzese
  We propose to generate adversarial samples by modifying activations of upper
layers encoding semantically meaningful concepts. The original sample is
shifted towards a target sample, yielding an adversarial sample, by using the
modified activations to reconstruct the original sample. A human might (and
possibly should) notice differences between the original and the adversarial
sample. Depending on the attacker-provided constraints, an adversarial sample
can exhibit subtle differences or appear like a "forged" sample from another
class. Our approach and goal are in stark contrast to common attacks involving
perturbations of single pixels that are not recognizable by humans. Our
approach is relevant in, e.g., multi-stage processing of inputs, where both
humans and machines are involved in decision-making because invisible
perturbations will not fool a human. Our evaluation focuses on deep neural
networks. We also show the transferability of our adversarial examples among
networks.


http://arxiv.org/abs/2203.10183
Adversarial Attacks on Deep Learning-based Video Compression and Classification Systems. (99%)
Jung-Woo Chang; Mojan Javaheripi; Seira Hidano; Farinaz Koushanfar
  Video compression plays a crucial role in enabling video streaming and
classification systems and maximizing the end-user quality of experience (QoE)
at a given bandwidth budget. In this paper, we conduct the first systematic
study for adversarial attacks on deep learning based video compression and
downstream classification systems. We propose an adaptive adversarial attack
that can manipulate the Rate-Distortion (R-D) relationship of a video
compression model to achieve two adversarial goals: (1) increasing the network
bandwidth or (2) degrading the video quality for end-users. We further devise
novel objectives for targeted and untargeted attacks to a downstream video
classification service. Finally, we design an input-invariant perturbation that
universally disrupts video compression and classification systems in real time.
Unlike previously proposed attacks on video classification, our adversarial
perturbations are the first to withstand compression. We empirically show the
resilience of our attacks against various defenses, i.e., adversarial training,
video denoising, and JPEG compression. Our extensive experimental results on
various video datasets demonstrate the effectiveness of our attacks. Our video
quality and bandwidth attacks deteriorate peak signal-to-noise ratio by up to
5.4dB and the bit-rate by up to 2.4 times on the standard video compression
datasets while achieving over 90% attack success rate on a downstream
classifier.


http://arxiv.org/abs/2203.09849
Neural Predictor for Black-Box Adversarial Attacks on Speech Recognition. (99%)
Marie Biolková; Bac Nguyen
  Recent works have revealed the vulnerability of automatic speech recognition
(ASR) models to adversarial examples (AEs), i.e., small perturbations that
cause an error in the transcription of the audio signal. Studying audio
adversarial attacks is therefore the first step towards robust ASR. Despite the
significant progress made in attacking audio examples, the black-box attack
remains challenging because only the hard-label information of transcriptions
is provided. Due to this limited information, existing black-box methods often
require an excessive number of queries to attack a single audio example. In
this paper, we introduce NP-Attack, a neural predictor-based method, which
progressively evolves the search towards a small adversarial perturbation.
Given a perturbation direction, our neural predictor directly estimates the
smallest perturbation that causes a mistranscription. In particular, it enables
NP-Attack to accurately learn promising perturbation directions via
gradient-based optimization. Experimental results show that NP-Attack achieves
competitive results with other state-of-the-art black-box adversarial attacks
while requiring a significantly smaller number of queries. The code of
NP-Attack is available online.


http://arxiv.org/abs/2203.09756
AutoAdversary: A Pixel Pruning Method for Sparse Adversarial Attack. (99%)
Jinqiao Li; Xiaotao Liu; Jian Zhao; Furao Shen
  Deep neural networks (DNNs) have been proven to be vulnerable to adversarial
examples. A special branch of adversarial examples, namely sparse adversarial
examples, can fool the target DNNs by perturbing only a few pixels. However,
many existing sparse adversarial attacks use heuristic methods to select the
pixels to be perturbed, and regard the pixel selection and the adversarial
attack as two separate steps. From the perspective of neural network pruning,
we propose a novel end-to-end sparse adversarial attack method, namely
AutoAdversary, which can find the most important pixels automatically by
integrating the pixel selection into the adversarial attack. Specifically, our
method utilizes a trainable neural network to generate a binary mask for the
pixel selection. After jointly optimizing the adversarial perturbation and the
neural network, only the pixels corresponding to the value 1 in the mask are
perturbed. Experiments demonstrate the superiority of our proposed method over
several state-of-the-art methods. Furthermore, since AutoAdversary does not
require a heuristic pixel selection process, it does not slow down excessively
as other methods when the image size increases.


http://arxiv.org/abs/2203.09940
Alleviating Adversarial Attacks on Variational Autoencoders with MCMC. (96%)
Anna Kuzina; Max Welling; Jakub M. Tomczak
  Variational autoencoders (VAEs) are latent variable models that can generate
complex objects and provide meaningful latent representations. Moreover, they
could be further used in downstream tasks such as classification. As previous
work has shown, one can easily fool VAEs to produce unexpected latent
representations and reconstructions for a visually slightly modified input.
Here, we examine several objective functions for adversarial attack
construction proposed previously and present a solution to alleviate the effect
of these attacks. Our method utilizes the Markov Chain Monte Carlo (MCMC)
technique in the inference step that we motivate with a theoretical analysis.
Thus, we do not incorporate any extra costs during training, and the
performance on non-attacked inputs is not decreased. We validate our approach
on a variety of datasets (MNIST, Fashion MNIST, Color MNIST, CelebA) and VAE
configurations ($\beta$-VAE, NVAE, $\beta$-TCVAE), and show that our approach
consistently improves the model robustness to adversarial attacks.


http://arxiv.org/abs/2203.09831
DTA: Physical Camouflage Attacks using Differentiable Transformation Network. (83%)
Naufal Suryanto; Yongsu Kim; Hyoeun Kang; Harashta Tatimma Larasati; Youngyeo Yun; Thi-Thu-Huong Le; Hunmin Yang; Se-Yoon Oh; Howon Kim
  To perform adversarial attacks in the physical world, many studies have
proposed adversarial camouflage, a method to hide a target object by applying
camouflage patterns on 3D object surfaces. For obtaining optimal physical
adversarial camouflage, previous studies have utilized the so-called neural
renderer, as it supports differentiability. However, existing neural renderers
cannot fully represent various real-world transformations due to a lack of
control of scene parameters compared to the legacy photo-realistic renderers.
In this paper, we propose the Differentiable Transformation Attack (DTA), a
framework for generating a robust physical adversarial pattern on a target
object to camouflage it against object detection models with a wide range of
transformations. It utilizes our novel Differentiable Transformation Network
(DTN), which learns the expected transformation of a rendered object when the
texture is changed while preserving the original properties of the target
object. Using our attack framework, an adversary can gain both the advantages
of the legacy photo-realistic renderers including various physical-world
transformations and the benefit of white-box access by offering
differentiability. Our experiments show that our camouflaged 3D vehicles can
successfully evade state-of-the-art object detection models in the
photo-realistic environment (i.e., CARLA on Unreal Engine). Furthermore, our
demonstration on a scaled Tesla Model 3 proves the applicability and
transferability of our method to the real world.


http://arxiv.org/abs/2203.09792
AdIoTack: Quantifying and Refining Resilience of Decision Tree Ensemble Inference Models against Adversarial Volumetric Attacks on IoT Networks. (78%)
Arman Pashamokhtari; Gustavo Batista; Hassan Habibi Gharakheili
  Machine Learning-based techniques have shown success in cyber intelligence.
However, they are increasingly becoming targets of sophisticated data-driven
adversarial attacks resulting in misprediction, eroding their ability to detect
threats on network devices. In this paper, we present AdIoTack, a system that
highlights vulnerabilities of decision trees against adversarial attacks,
helping cybersecurity teams quantify and refine the resilience of their trained
models for monitoring IoT networks. To assess the model for the worst-case
scenario, AdIoTack performs white-box adversarial learning to launch successful
volumetric attacks that decision tree ensemble models cannot flag. Our first
contribution is to develop a white-box algorithm that takes a trained decision
tree ensemble model and the profile of an intended network-based attack on a
victim class as inputs. It then automatically generates recipes that specify
certain packets on top of the indented attack packets (less than 15% overhead)
that together can bypass the inference model unnoticed. We ensure that the
generated attack instances are feasible for launching on IP networks and
effective in their volumetric impact. Our second contribution develops a method
to monitor the network behavior of connected devices actively, inject
adversarial traffic (when feasible) on behalf of a victim IoT device, and
successfully launch the intended attack. Our third contribution prototypes
AdIoTack and validates its efficacy on a testbed consisting of a handful of
real IoT devices monitored by a trained inference model. We demonstrate how the
model detects all non-adversarial volumetric attacks on IoT devices while
missing many adversarial ones. The fourth contribution develops systematic
methods for applying patches to trained decision tree ensemble models,
improving their resilience against adversarial volumetric attacks.


http://arxiv.org/abs/2203.09790
Towards Robust 2D Convolution for Reliable Visual Recognition. (9%)
Lida Li; Shuai Li; Kun Wang; Xiangchu Feng; Lei Zhang
  2D convolution (Conv2d), which is responsible for extracting features from
the input image, is one of the key modules of a convolutional neural network
(CNN). However, Conv2d is vulnerable to image corruptions and adversarial
samples. It is an important yet rarely investigated problem that whether we can
design a more robust alternative of Conv2d for more reliable feature
extraction. In this paper, inspired by the recently developed learnable sparse
transform that learns to convert the CNN features into a compact and sparse
latent space, we design a novel building block, denoted by RConv-MK, to
strengthen the robustness of extracted convolutional features. Our method
leverages a set of learnable kernels of different sizes to extract features at
different frequencies and employs a normalized soft thresholding operator to
adaptively remove noises and trivial features at different corruption levels.
Extensive experiments on clean images, corrupted images as well as adversarial
samples validate the effectiveness of the proposed robust module for reliable
visual recognition. The source codes are enclosed in the submission.


http://arxiv.org/abs/2203.09123
Improving the Transferability of Targeted Adversarial Examples through Object-Based Diverse Input. (99%)
Junyoung Byun; Seungju Cho; Myung-Joon Kwon; Hee-Seon Kim; Changick Kim
  The transferability of adversarial examples allows the deception on black-box
models, and transfer-based targeted attacks have attracted a lot of interest
due to their practical applicability. To maximize the transfer success rate,
adversarial examples should avoid overfitting to the source model, and image
augmentation is one of the primary approaches for this. However, prior works
utilize simple image transformations such as resizing, which limits input
diversity. To tackle this limitation, we propose the object-based diverse input
(ODI) method that draws an adversarial image on a 3D object and induces the
rendered image to be classified as the target class. Our motivation comes from
the humans' superior perception of an image printed on a 3D object. If the
image is clear enough, humans can recognize the image content in a variety of
viewing conditions. Likewise, if an adversarial example looks like the target
class to the model, the model should also classify the rendered image of the 3D
object as the target class. The ODI method effectively diversifies the input by
leveraging an ensemble of multiple source objects and randomizing viewing
conditions. In our experimental results on the ImageNet-Compatible dataset,
this method boosts the average targeted attack success rate from 28.3% to 47.0%
compared to the state-of-the-art methods. We also demonstrate the applicability
of the ODI method to adversarial examples on the face verification task and its
superior performance improvement. Our code is available at
https://github.com/dreamflake/ODI.


http://arxiv.org/abs/2203.09678
Self-Ensemble Adversarial Training for Improved Robustness. (99%)
Hongjun Wang; Yisen Wang
  Due to numerous breakthroughs in real-world applications brought by machine
intelligence, deep neural networks (DNNs) are widely employed in critical
applications. However, predictions of DNNs are easily manipulated with
imperceptible adversarial perturbations, which impedes the further deployment
of DNNs and may result in profound security and privacy implications. By
incorporating adversarial samples into the training data pool, adversarial
training is the strongest principled strategy against various adversarial
attacks among all sorts of defense methods. Recent works mainly focus on
developing new loss functions or regularizers, attempting to find the unique
optimal point in the weight space. But none of them taps the potentials of
classifiers obtained from standard adversarial training, especially states on
the searching trajectory of training. In this work, we are dedicated to the
weight states of models through the training process and devise a simple but
powerful \emph{Self-Ensemble Adversarial Training} (SEAT) method for yielding a
robust classifier by averaging weights of history models. This considerably
improves the robustness of the target model against several well known
adversarial attacks, even merely utilizing the naive cross-entropy loss to
supervise. We also discuss the relationship between the ensemble of predictions
from different adversarially trained models and the prediction of
weight-ensembled models, as well as provide theoretical and empirical evidence
that the proposed self-ensemble method provides a smoother loss landscape and
better robustness than both individual models and the ensemble of predictions
from different classifiers. We further analyze a subtle but fatal issue in the
general settings for the self-ensemble model, which causes the deterioration of
the weight-ensembled method in the late phases.


http://arxiv.org/abs/2203.09566
Leveraging Adversarial Examples to Quantify Membership Information Leakage. (98%)
Grosso Ganesh Del; Hamid Jalalzai; Georg Pichler; Catuscia Palamidessi; Pablo Piantanida
  The use of personal data for training machine learning systems comes with a
privacy threat and measuring the level of privacy of a model is one of the
major challenges in machine learning today. Identifying training data based on
a trained model is a standard way of measuring the privacy risks induced by the
model. We develop a novel approach to address the problem of membership
inference in pattern recognition models, relying on information provided by
adversarial examples. The strategy we propose consists of measuring the
magnitude of a perturbation necessary to build an adversarial example. Indeed,
we argue that this quantity reflects the likelihood of belonging to the
training data. Extensive numerical experiments on multivariate data and an
array of state-of-the-art target models show that our method performs
comparable or even outperforms state-of-the-art strategies, but without
requiring any additional training samples.


http://arxiv.org/abs/2203.09243
On the Properties of Adversarially-Trained CNNs. (93%)
Mattia Carletti; Matteo Terzi; Gian Antonio Susto
  Adversarial Training has proved to be an effective training paradigm to
enforce robustness against adversarial examples in modern neural network
architectures. Despite many efforts, explanations of the foundational
principles underpinning the effectiveness of Adversarial Training are limited
and far from being widely accepted by the Deep Learning community. In this
paper, we describe surprising properties of adversarially-trained models,
shedding light on mechanisms through which robustness against adversarial
attacks is implemented. Moreover, we highlight limitations and failure modes
affecting these models that were not discussed by prior works. We conduct
extensive analyses on a wide range of architectures and datasets, performing a
deep comparison between robust and natural models.


http://arxiv.org/abs/2203.09289
PiDAn: A Coherence Optimization Approach for Backdoor Attack Detection and Mitigation in Deep Neural Networks. (89%)
Yue Wang; Wenqing Li; Esha Sarkar; Muhammad Shafique; Michail Maniatakos; Saif Eddin Jabari
  Backdoor attacks impose a new threat in Deep Neural Networks (DNNs), where a
backdoor is inserted into the neural network by poisoning the training dataset,
misclassifying inputs that contain the adversary trigger. The major challenge
for defending against these attacks is that only the attacker knows the secret
trigger and the target class. The problem is further exacerbated by the recent
introduction of "Hidden Triggers", where the triggers are carefully fused into
the input, bypassing detection by human inspection and causing backdoor
identification through anomaly detection to fail. To defend against such
imperceptible attacks, in this work we systematically analyze how
representations, i.e., the set of neuron activations for a given DNN when using
the training data as inputs, are affected by backdoor attacks. We propose
PiDAn, an algorithm based on coherence optimization purifying the poisoned
data. Our analysis shows that representations of poisoned data and authentic
data in the target class are still embedded in different linear subspaces,
which implies that they show different coherence with some latent spaces. Based
on this observation, the proposed PiDAn algorithm learns a sample-wise weight
vector to maximize the projected coherence of weighted samples, where we
demonstrate that the learned weight vector has a natural "grouping effect" and
is distinguishable between authentic data and poisoned data. This enables the
systematic detection and mitigation of backdoor attacks. Based on our
theoretical analysis and experimental results, we demonstrate the effectiveness
of PiDAn in defending against backdoor attacks that use different settings of
poisoned samples on GTSRB and ILSVRC2012 datasets. Our PiDAn algorithm can
detect more than 90% infected classes and identify 95% poisoned samples.


http://arxiv.org/abs/2203.09681
HDLock: Exploiting Privileged Encoding to Protect Hyperdimensional Computing Models against IP Stealing. (1%)
Shijin Duan; Shaolei Ren; Xiaolin Xu
  Hyperdimensional Computing (HDC) is facing infringement issues due to
straightforward computations. This work, for the first time, raises a critical
vulnerability of HDC, an attacker can reverse engineer the entire model, only
requiring the unindexed hypervector memory. To mitigate this attack, we propose
a defense strategy, namely HDLock, which significantly increases the reasoning
cost of encoding. Specifically, HDLock adds extra feature hypervector
combination and permutation in the encoding module. Compared to the standard
HDC model, a two-layer-key HDLock can increase the adversarial reasoning
complexity by 10 order of magnitudes without inference accuracy loss, with only
21% latency overhead.


http://arxiv.org/abs/2203.08959
Robustness through Cognitive Dissociation Mitigation in Contrastive Adversarial Training. (99%)
Adir Rahamim; Itay Naeh
  In this paper, we introduce a novel neural network training framework that
increases model's adversarial robustness to adversarial attacks while
maintaining high clean accuracy by combining contrastive learning (CL) with
adversarial training (AT). We propose to improve model robustness to
adversarial attacks by learning feature representations that are consistent
under both data augmentations and adversarial perturbations. We leverage
contrastive learning to improve adversarial robustness by considering an
adversarial example as another positive example, and aim to maximize the
similarity between random augmentations of data samples and their adversarial
example, while constantly updating the classification head in order to avoid a
cognitive dissociation between the classification head and the embedding space.
This dissociation is caused by the fact that CL updates the network up to the
embedding space, while freezing the classification head which is used to
generate new positive adversarial examples. We validate our method, Contrastive
Learning with Adversarial Features(CLAF), on the CIFAR-10 dataset on which it
outperforms both robust accuracy and clean accuracy over alternative supervised
and self-supervised adversarial learning methods.


http://arxiv.org/abs/2203.08519
Towards Practical Certifiable Patch Defense with Vision Transformer. (98%)
Zhaoyu Chen; Bo Li; Jianghe Xu; Shuang Wu; Shouhong Ding; Wenqiang Zhang
  Patch attacks, one of the most threatening forms of physical attack in
adversarial examples, can lead networks to induce misclassification by
modifying pixels arbitrarily in a continuous region. Certifiable patch defense
can guarantee robustness that the classifier is not affected by patch attacks.
Existing certifiable patch defenses sacrifice the clean accuracy of classifiers
and only obtain a low certified accuracy on toy datasets. Furthermore, the
clean and certified accuracy of these methods is still significantly lower than
the accuracy of normal classification networks, which limits their application
in practice. To move towards a practical certifiable patch defense, we
introduce Vision Transformer (ViT) into the framework of Derandomized Smoothing
(DS). Specifically, we propose a progressive smoothed image modeling task to
train Vision Transformer, which can capture the more discriminable local
context of an image while preserving the global semantic information. For
efficient inference and deployment in the real world, we innovatively
reconstruct the global self-attention structure of the original ViT into
isolated band unit self-attention. On ImageNet, under 2% area patch attacks our
method achieves 41.70% certified accuracy, a nearly 1-fold increase over the
previous best method (26.00%). Simultaneously, our method achieves 78.58% clean
accuracy, which is quite close to the normal ResNet-101 accuracy. Extensive
experiments show that our method obtains state-of-the-art clean and certified
accuracy with inferring efficiently on CIFAR-10 and ImageNet.


http://arxiv.org/abs/2203.08392
Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations? (97%)
Yonggan Fu; Shunyao Zhang; Shang Wu; Cheng Wan; Yingyan Lin
  Vision transformers (ViTs) have recently set off a new wave in neural
architecture design thanks to their record-breaking performance in various
vision tasks. In parallel, to fulfill the goal of deploying ViTs into
real-world vision applications, their robustness against potential malicious
attacks has gained increasing attention. In particular, recent works show that
ViTs are more robust against adversarial attacks as compared with convolutional
neural networks (CNNs), and conjecture that this is because ViTs focus more on
capturing global interactions among different input/feature patches, leading to
their improved robustness to local perturbations imposed by adversarial
attacks. In this work, we ask an intriguing question: "Under what kinds of
perturbations do ViTs become more vulnerable learners compared to CNNs?" Driven
by this question, we first conduct a comprehensive experiment regarding the
robustness of both ViTs and CNNs under various existing adversarial attacks to
understand the underlying reason favoring their robustness. Based on the drawn
insights, we then propose a dedicated attack framework, dubbed Patch-Fool, that
fools the self-attention mechanism by attacking its basic component (i.e., a
single patch) with a series of attention-aware optimization techniques.
Interestingly, our Patch-Fool framework shows for the first time that ViTs are
not necessarily more robust than CNNs against adversarial perturbations. In
particular, we find that ViTs are more vulnerable learners compared with CNNs
against our Patch-Fool attack which is consistent across extensive experiments,
and the observations from Sparse/Mild Patch-Fool, two variants of Patch-Fool,
indicate an intriguing insight that the perturbation density and strength on
each patch seem to be the key factors that influence the robustness ranking
between ViTs and CNNs.


http://arxiv.org/abs/2203.08945
Provable Adversarial Robustness for Fractional Lp Threat Models. (87%)
Alexander Levine; Soheil Feizi
  In recent years, researchers have extensively studied adversarial robustness
in a variety of threat models, including L_0, L_1, L_2, and L_infinity-norm
bounded adversarial attacks. However, attacks bounded by fractional L_p "norms"
(quasi-norms defined by the L_p distance with 0<p<1) have yet to be thoroughly
considered. We proactively propose a defense with several desirable properties:
it provides provable (certified) robustness, scales to ImageNet, and yields
deterministic (rather than high-probability) certified guarantees when applied
to quantized data (e.g., images). Our technique for fractional L_p robustness
constructs expressive, deep classifiers that are globally Lipschitz with
respect to the L_p^p metric, for any 0<p<1. However, our method is even more
general: we can construct classifiers which are globally Lipschitz with respect
to any metric defined as the sum of concave functions of components. Our
approach builds on a recent work, Levine and Feizi (2021), which provides a
provable defense against L_1 attacks. However, we demonstrate that our proposed
guarantees are highly non-vacuous, compared to the trivial solution of using
(Levine and Feizi, 2021) directly and applying norm inequalities. Code is
available at https://github.com/alevine0/fractionalLpRobustness.


http://arxiv.org/abs/2203.08739
What Do Adversarially trained Neural Networks Focus: A Fourier Domain-based Study. (83%)
Binxiao Huang; Chaofan Tao; Rui Lin; Ngai Wong
  Although many fields have witnessed the superior performance brought about by
deep learning, the robustness of neural networks remains an open issue.
Specifically, a small adversarial perturbation on the input may cause the model
to produce a completely different output. Such poor robustness implies many
potential hazards, especially in security-critical applications, e.g.,
autonomous driving and mobile robotics. This work studies what information the
adversarially trained model focuses on. Empirically, we notice that the
differences between the clean and adversarial data are mainly distributed in
the low-frequency region. We then find that an adversarially-trained model is
more robust than its naturally-trained counterpart due to the reason that the
former pays more attention to learning the dominant information in
low-frequency components. In addition, we consider two common ways to improve
model robustness, namely, by data augmentation and by using stronger network
architectures, and understand these techniques from a frequency-domain
perspective. We are hopeful this work can shed light on the design of more
robust neural networks.


http://arxiv.org/abs/2203.08398
COPA: Certifying Robust Policies for Offline Reinforcement Learning against Poisoning Attacks. (82%)
Fan Wu; Linyi Li; Chejian Xu; Huan Zhang; Bhavya Kailkhura; Krishnaram Kenthapadi; Ding Zhao; Bo Li
  As reinforcement learning (RL) has achieved near human-level performance in a
variety of tasks, its robustness has raised great attention. While a vast body
of research has explored test-time (evasion) attacks in RL and corresponding
defenses, its robustness against training-time (poisoning) attacks remains
largely unanswered. In this work, we focus on certifying the robustness of
offline RL in the presence of poisoning attacks, where a subset of training
trajectories could be arbitrarily manipulated. We propose the first
certification framework, COPA, to certify the number of poisoning trajectories
that can be tolerated regarding different certification criteria. Given the
complex structure of RL, we propose two certification criteria: per-state
action stability and cumulative reward bound. To further improve the
certification, we propose new partition and aggregation protocols to train
robust policies. We further prove that some of the proposed certification
methods are theoretically tight and some are NP-Complete problems. We leverage
COPA to certify three RL environments trained with different algorithms and
conclude: (1) The proposed robust aggregation protocols such as temporal
aggregation can significantly improve the certifications; (2) Our certification
for both per-state action stability and cumulative reward bound are efficient
and tight; (3) The certification for different training algorithms and
environments are different, implying their intrinsic robustness properties. All
experimental results are available at https://copa-leaderboard.github.io.


http://arxiv.org/abs/2203.08689
Sniper Backdoor: Single Client Targeted Backdoor Attack in Federated Learning. (70%)
Gorka Abad; Servio Paguada; Oguzhan Ersoy; Stjepan Picek; Víctor Julio Ramírez-Durán; Aitor Urbieta
  Federated Learning (FL) enables collaborative training of Deep Learning (DL)
models where the data is retained locally. Like DL, FL has severe security
weaknesses that the attackers can exploit, e.g., model inversion and backdoor
attacks. Model inversion attacks reconstruct the data from the training
datasets, whereas backdoors misclassify only classes containing specific
properties, e.g., a pixel pattern. Backdoors are prominent in FL and aim to
poison every client model, while model inversion attacks can target even a
single client.
  This paper introduces a novel technique to allow backdoor attacks to be
client-targeted, compromising a single client while the rest remain unchanged.
The attack takes advantage of state-of-the-art model inversion and backdoor
attacks. Precisely, we leverage a Generative Adversarial Network to perform the
model inversion. Afterward, we shadow-train the FL network, in which, using a
Siamese Neural Network, we can identify, target, and backdoor the victim's
model. Our attack has been validated using the MNIST, F-MNIST, EMNIST, and
CIFAR-100 datasets under different settings -- achieving up to 99\% accuracy on
both source (clean) and target (backdoor) classes and against state-of-the-art
defenses, e.g., Neural Cleanse, opening a novel threat model to be considered
in the future.


http://arxiv.org/abs/2203.08390
Reducing Flipping Errors in Deep Neural Networks. (68%)
Xiang Deng; Yun Xiao; Bo Long; Zhongfei Zhang
  Deep neural networks (DNNs) have been widely applied in various domains in
artificial intelligence including computer vision and natural language
processing. A DNN is typically trained for many epochs and then a validation
dataset is used to select the DNN in an epoch (we simply call this epoch "the
last epoch") as the final model for making predictions on unseen samples, while
it usually cannot achieve a perfect accuracy on unseen samples. An interesting
question is "how many test (unseen) samples that a DNN misclassifies in the
last epoch were ever correctly classified by the DNN before the last epoch?".
In this paper, we empirically study this question and find on several benchmark
datasets that the vast majority of the misclassified samples in the last epoch
were ever classified correctly before the last epoch, which means that the
predictions for these samples were flipped from "correct" to "wrong". Motivated
by this observation, we propose to restrict the behavior changes of a DNN on
the correctly-classified samples so that the correct local boundaries can be
maintained and the flipping error on unseen samples can be largely reduced.
Extensive experiments on different benchmark datasets with different modern
network architectures demonstrate that the proposed flipping error reduction
(FER) approach can substantially improve the generalization, the robustness,
and the transferability of DNNs without introducing any additional network
parameters or inference cost, only with a negligible training overhead.


http://arxiv.org/abs/2203.08725
Attacking deep networks with surrogate-based adversarial black-box methods is easy. (45%)
Nicholas A. Lord; Romain Mueller; Luca Bertinetto
  A recent line of work on black-box adversarial attacks has revived the use of
transfer from surrogate models by integrating it into query-based search.
However, we find that existing approaches of this type underperform their
potential, and can be overly complicated besides. Here, we provide a short and
simple algorithm which achieves state-of-the-art results through a search which
uses the surrogate network's class-score gradients, with no need for other
priors or heuristics. The guiding assumption of the algorithm is that the
studied networks are in a fundamental sense learning similar functions, and
that a transfer attack from one to the other should thus be fairly "easy". This
assumption is validated by the extremely low query counts and failure rates
achieved: e.g. an untargeted attack on a VGG-16 ImageNet network using a
ResNet-152 as the surrogate yields a median query count of 6 at a success rate
of 99.9%. Code is available at https://github.com/fiveai/GFCS.


http://arxiv.org/abs/2203.08961
On the Convergence of Certified Robust Training with Interval Bound Propagation. (15%)
Yihan Wang; Zhouxing Shi; Quanquan Gu; Cho-Jui Hsieh
  Interval Bound Propagation (IBP) is so far the base of state-of-the-art
methods for training neural networks with certifiable robustness guarantees
when potential adversarial perturbations present, while the convergence of IBP
training remains unknown in existing literature. In this paper, we present a
theoretical analysis on the convergence of IBP training. With an
overparameterized assumption, we analyze the convergence of IBP robust
training. We show that when using IBP training to train a randomly initialized
two-layer ReLU neural network with logistic loss, gradient descent can linearly
converge to zero robust training error with a high probability if we have
sufficiently small perturbation radius and large network width.


http://arxiv.org/abs/2203.08669
MPAF: Model Poisoning Attacks to Federated Learning based on Fake Clients. (15%)
Xiaoyu Cao; Neil Zhenqiang Gong
  Existing model poisoning attacks to federated learning assume that an
attacker has access to a large fraction of compromised genuine clients.
However, such assumption is not realistic in production federated learning
systems that involve millions of clients. In this work, we propose the first
Model Poisoning Attack based on Fake clients called MPAF. Specifically, we
assume the attacker injects fake clients to a federated learning system and
sends carefully crafted fake local model updates to the cloud server during
training, such that the learnt global model has low accuracy for many
indiscriminate test inputs. Towards this goal, our attack drags the global
model towards an attacker-chosen base model that has low accuracy.
Specifically, in each round of federated learning, the fake clients craft fake
local model updates that point to the base model and scale them up to amplify
their impact before sending them to the cloud server. Our experiments show that
MPAF can significantly decrease the test accuracy of the global model, even if
classical defenses and norm clipping are adopted, highlighting the need for
more advanced defenses.


http://arxiv.org/abs/2203.08822
Understanding robustness and generalization of artificial neural networks through Fourier masks. (2%)
Nikos Karantzas; Emma Besier; Josue Ortega Caro; Xaq Pitkow; Andreas S. Tolias; Ankit B. Patel; Fabio Anselmi
  Despite the enormous success of artificial neural networks (ANNs) in many
disciplines, the characterization of their computations and the origin of key
properties such as generalization and robustness remain open questions. Recent
literature suggests that robust networks with good generalization properties
tend to be biased towards processing low frequencies in images. To explore the
frequency bias hypothesis further, we develop an algorithm that allows us to
learn modulatory masks highlighting the essential input frequencies needed for
preserving a trained network's performance. We achieve this by imposing
invariance in the loss with respect to such modulations in the input
frequencies. We first use our method to test the low-frequency preference
hypothesis of adversarially trained or data-augmented networks. Our results
suggest that adversarially robust networks indeed exhibit a low-frequency bias
but we find this bias is also dependent on directions in frequency space.
However, this is not necessarily true for other types of data augmentation. Our
results also indicate that the essential frequencies in question are
effectively the ones used to achieve generalization in the first place.
Surprisingly, images seen through these modulatory masks are not recognizable
and resemble texture-like patterns.


http://arxiv.org/abs/2203.07653
Generalized but not Robust? Comparing the Effects of Data Modification Methods on Out-of-Domain Generalization and Adversarial Robustness. (76%)
Tejas Gokhale; Swaroop Mishra; Man Luo; Bhavdeep Singh Sachdeva; Chitta Baral
  Data modification, either via additional training datasets, data
augmentation, debiasing, and dataset filtering, has been proposed as an
effective solution for generalizing to out-of-domain (OOD) inputs, in both
natural language processing and computer vision literature. However, the effect
of data modification on adversarial robustness remains unclear. In this work,
we conduct a comprehensive study of common data modification strategies and
evaluate not only their in-domain and OOD performance, but also their
adversarial robustness (AR). We also present results on a two-dimensional
synthetic dataset to visualize the effect of each method on the training
distribution. This work serves as an empirical study towards understanding the
relationship between generalizing to unseen domains and defending against
adversarial perturbations. Our findings suggest that more data (either via
additional datasets or data augmentation) benefits both OOD accuracy and AR.
However, data filtering (previously shown to improve OOD accuracy on natural
language inference) hurts OOD accuracy on other tasks such as question
answering and image classification. We provide insights from our experiments to
inform future work in this direction.


http://arxiv.org/abs/2203.08302
Internet-based Social Engineering Attacks, Defenses and Psychology: A Survey. (13%)
Theodore Longtchi; Rosana Montañez Rodriguez; Laith Al-Shawaf; Adham Atyabi; Shouhuai Xu
  Social engineering attacks are a major cyber threat because they often serve
as a first step for an attacker to break into an otherwise well-defended
network, steal victims' credentials, and cause financial losses. The problem
has received due amount of attention with many publications proposing defenses
against them. Despite this, the situation has not improved. In this paper, we
aim to understand and explain this phenomenon by looking into the root cause of
the problem. To this end, we examine the literature on attacks and defenses
through a unique lens we propose -- {\em psychological factors (PFs) and
techniques (PTs)}. We find that there is a big discrepancy between attacks and
defenses: Attacks have deliberately exploited PFs by leveraging PTs, but
defenses rarely take either of these into consideration, preferring technical
solutions. This explains why existing defenses have achieved limited success.
This prompts us to propose a roadmap for a more systematic approach towards
designing effective defenses against social engineering attacks.


http://arxiv.org/abs/2203.07670
Towards Adversarial Control Loops in Sensor Attacks: A Case Study to Control the Kinematics and Actuation of Embedded Systems. (10%)
Yazhou Tu; Sara Rampazzi; Xiali Hei
  Recent works investigated attacks on sensors by influencing analog sensor
components with acoustic, light, and electromagnetic signals. Such attacks can
have extensive security, reliability, and safety implications since many types
of the targeted sensors are also widely used in critical process control,
robotics, automation, and industrial control systems. While existing works
advanced our understanding of the physical-level risks that are hidden from a
digital-domain perspective, gaps exist in how the attack can be guided to
achieve system-level control in real-time, continuous processes. This paper
proposes an adversarial control loop-based approach for real-time attacks on
control systems relying on sensors. We study how to utilize the system feedback
extracted from physical-domain signals to guide the attacks. In the attack
process, injection signals are adjusted in real time based on the extracted
feedback to exert targeted influence on a victim control system that is
continuously affected by the injected perturbations and applying changes to the
physical environment. In our case study, we investigate how an external
adversarial control system can be constructed over sensor-actuator systems and
demonstrate the attacks with program-controlled processes to manipulate the
victim system without accessing its internal statuses.


http://arxiv.org/abs/2203.07713
LDP: Learnable Dynamic Precision for Efficient Deep Neural Network Training and Inference. (1%)
Zhongzhi Yu; Yonggan Fu; Shang Wu; Mengquan Li; Haoran You; Yingyan Lin
  Low precision deep neural network (DNN) training is one of the most effective
techniques for boosting DNNs' training efficiency, as it trims down the
training cost from the finest bit level. While existing works mostly fix the
model precision during the whole training process, a few pioneering works have
shown that dynamic precision schedules help DNNs converge to a better accuracy
while leading to a lower training cost than their static precision training
counterparts. However, existing dynamic low precision training methods rely on
manually designed precision schedules to achieve advantageous efficiency and
accuracy trade-offs, limiting their more comprehensive practical applications
and achievable performance. To this end, we propose LDP, a Learnable Dynamic
Precision DNN training framework that can automatically learn a temporally and
spatially dynamic precision schedule during training towards optimal accuracy
and efficiency trade-offs. It is worth noting that LDP-trained DNNs are by
nature efficient during inference. Furthermore, we visualize the resulting
temporal and spatial precision schedule and distribution of LDP trained DNNs on
different tasks to better understand the corresponding DNNs' characteristics at
different training stages and DNN layers both during and after training,
drawing insights for promoting further innovations. Extensive experiments and
ablation studies (seven networks, five datasets, and three tasks) show that the
proposed LDP consistently outperforms state-of-the-art (SOTA) low precision DNN
training techniques in terms of training efficiency and achieved accuracy
trade-offs. For example, in addition to having the advantage of being
automated, our LDP achieves a 0.31\% higher accuracy with a 39.1\% lower
computational cost when training ResNet-20 on CIFAR-10 as compared with the
best SOTA method.


http://arxiv.org/abs/2203.07815
Adversarial Counterfactual Augmentation: Application in Alzheimer's Disease Classification. (1%)
Tian Xia; Pedro Sanchez; Chen Qin; Sotirios A. Tsaftaris
  Data augmentation has been widely used in deep learning to reduce
over-fitting and improve the robustness of models. However, traditional data
augmentation techniques, e.g., rotation, cropping, flipping, etc., do not
consider \textit{semantic} transformations, e.g., changing the age of a brain
image. Previous works tried to achieve semantic augmentation by generating
\textit{counterfactuals}, but they focused on how to train deep generative
models and randomly created counterfactuals with the generative models without
considering which counterfactuals are most \textit{effective} for improving
downstream training. Different from these approaches, in this work, we propose
a novel adversarial counterfactual augmentation scheme that aims to find the
most \textit{effective} counterfactuals to improve downstream tasks with a
pre-trained generative model. Specifically, we construct an adversarial game
where we update the input \textit{conditional factor} of the generator and the
downstream \textit{classifier} with gradient backpropagation alternatively and
iteratively. The key idea is to find conditional factors that can result in
\textit{hard} counterfactuals for the classifier. This can be viewed as finding
the `\textit{weakness}' of the classifier and purposely forcing it to
\textit{overcome} its weakness via the generative model. To demonstrate the
effectiveness of the proposed approach, we validate the method with the
classification of Alzheimer's Disease (AD) as the downstream task based on a
pre-trained brain ageing synthesis model. We show the proposed approach
improves test accuracy and can alleviate spurious correlations. Code will be
released upon acceptance.


http://arxiv.org/abs/2203.06898
Efficient universal shuffle attack for visual object tracking. (99%)
Siao Liu; Zhaoyu Chen; Wei Li; Jiwei Zhu; Jiafeng Wang; Wenqiang Zhang; Zhongxue Gan
  Recently, adversarial attacks have been applied in visual object tracking to
deceive deep trackers by injecting imperceptible perturbations into video
frames. However, previous work only generates the video-specific perturbations,
which restricts its application scenarios. In addition, existing attacks are
difficult to implement in reality due to the real-time of tracking and the
re-initialization mechanism. To address these issues, we propose an offline
universal adversarial attack called Efficient Universal Shuffle Attack. It
takes only one perturbation to cause the tracker malfunction on all videos. To
improve the computational efficiency and attack performance, we propose a
greedy gradient strategy and a triple loss to efficiently capture and attack
model-specific feature representations through the gradients. Experimental
results show that EUSA can significantly reduce the performance of
state-of-the-art trackers on OTB2015 and VOT2018.


http://arxiv.org/abs/2203.09487
Defending Against Adversarial Attack in ECG Classification with Adversarial Distillation Training. (99%)
Jiahao Shao; Shijia Geng; Zhaoji Fu; Weilun Xu; Tong Liu; Shenda Hong
  In clinics, doctors rely on electrocardiograms (ECGs) to assess severe
cardiac disorders. Owing to the development of technology and the increase in
health awareness, ECG signals are currently obtained by using medical and
commercial devices. Deep neural networks (DNNs) can be used to analyze these
signals because of their high accuracy rate. However, researchers have found
that adversarial attacks can significantly reduce the accuracy of DNNs. Studies
have been conducted to defend ECG-based DNNs against traditional adversarial
attacks, such as projected gradient descent (PGD), and smooth adversarial
perturbation (SAP) which targets ECG classification; however, to the best of
our knowledge, no study has completely explored the defense against adversarial
attacks targeting ECG classification. Thus, we did different experiments to
explore the effects of defense methods against white-box adversarial attack and
black-box adversarial attack targeting ECG classification, and we found that
some common defense methods performed well against these attacks. Besides, we
proposed a new defense method called Adversarial Distillation Training (ADT)
which comes from defensive distillation and can effectively improve the
generalization performance of DNNs. The results show that our method performed
more effectively against adversarial attacks targeting on ECG classification
than the other baseline methods, namely, adversarial training, defensive
distillation, Jacob regularization, and noise-to-signal ratio regularization.
Furthermore, we found that our method performed better against PGD attacks with
low noise levels, which means that our method has stronger robustness.


http://arxiv.org/abs/2203.07596
Task-Agnostic Robust Representation Learning. (98%)
A. Tuan Nguyen; Ser Nam Lim; Philip Torr
  It has been reported that deep learning models are extremely vulnerable to
small but intentionally chosen perturbations of its input. In particular, a
deep network, despite its near-optimal accuracy on the clean images, often
mis-classifies an image with a worst-case but humanly imperceptible
perturbation (so-called adversarial examples). To tackle this problem, a great
amount of research has been done to study the training procedure of a network
to improve its robustness. However, most of the research so far has focused on
the case of supervised learning. With the increasing popularity of
self-supervised learning methods, it is also important to study and improve the
robustness of their resulting representation on the downstream tasks. In this
paper, we study the problem of robust representation learning with unlabeled
data in a task-agnostic manner. Specifically, we first derive an upper bound on
the adversarial loss of a prediction model (which is based on the learned
representation) on any downstream task, using its loss on the clean data and a
robustness regularizer. Moreover, the regularizer is task-independent, thus we
propose to minimize it directly during the representation learning phase to
make the downstream prediction model more robust. Extensive experiments show
that our method achieves preferable adversarial performance compared to
relevant baselines.


http://arxiv.org/abs/2203.08147
Energy-Latency Attacks via Sponge Poisoning. (91%)
Antonio Emanuele Cinà; Ambra Demontis; Battista Biggio; Fabio Roli; Marcello Pelillo
  Sponge examples are test-time inputs carefully optimized to increase energy
consumption and latency of neural networks when deployed on hardware
accelerators. In this work, we are the first to demonstrate that sponge
examples can also be injected at training time, via an attack that we call
sponge poisoning. This attack allows one to increase the energy consumption and
latency of machine-learning models indiscriminately on each test-time input. We
present a novel formalization for sponge poisoning, overcoming the limitations
related to the optimization of test-time sponge examples, and show that this
attack is possible even if the attacker only controls a few model updates; for
instance, if model training is outsourced to an untrusted third-party or
distributed via federated learning. Our extensive experimental analysis shows
that sponge poisoning can almost completely vanish the effect of hardware
accelerators. We also analyze the activations of poisoned models, identifying
which components are more vulnerable to this attack. Finally, we examine the
feasibility of countermeasures against sponge poisoning to decrease energy
consumption, showing that sanitization methods may be overly expensive for most
of the users.


http://arxiv.org/abs/2203.07138
Adversarial amplitude swap towards robust image classifiers. (83%)
Chun Yang Tan; Hiroshi Kera; Kazuhiko Kawamoto
  The vulnerability of convolutional neural networks (CNNs) to image
perturbations such as common corruptions and adversarial perturbations has
recently been investigated from the perspective of frequency. In this study, we
investigate the effect of the amplitude and phase spectra of adversarial images
on the robustness of CNN classifiers. Extensive experiments revealed that the
images generated by combining the amplitude spectrum of adversarial images and
the phase spectrum of clean images accommodates moderate and general
perturbations, and training with these images equips a CNN classifier with more
general robustness, performing well under both common corruptions and
adversarial perturbations. We also found that two types of overfitting
(catastrophic overfitting and robust overfitting) can be circumvented by the
aforementioned spectrum recombination. We believe that these results contribute
to the understanding and the training of truly robust classifiers.


http://arxiv.org/abs/2203.07159
On the benefits of knowledge distillation for adversarial robustness. (82%)
Javier Maroto; Guillermo Ortiz-Jiménez; Pascal Frossard
  Knowledge distillation is normally used to compress a big network, or
teacher, onto a smaller one, the student, by training it to match its outputs.
Recently, some works have shown that robustness against adversarial attacks can
also be distilled effectively to achieve good rates of robustness on
mobile-friendly models. In this work, however, we take a different point of
view, and show that knowledge distillation can be used directly to boost the
performance of state-of-the-art models in adversarial robustness. In this
sense, we present a thorough analysis and provide general guidelines to distill
knowledge from a robust teacher and boost the clean and adversarial performance
of a student model even further. To that end, we present Adversarial Knowledge
Distillation (AKD), a new framework to improve a model's robust performance,
consisting on adversarially training a student on a mixture of the original
labels and the teacher outputs. Through carefully controlled ablation studies,
we show that using early-stopping, model ensembles and weak adversarial
training are key techniques to maximize performance of the student, and show
that these insights generalize across different robust distillation techniques.
Finally, we provide insights on the effect of robust knowledge distillation on
the dynamics of the student network, and show that AKD mostly improves the
calibration of the network and modify its training dynamics on samples that the
model finds difficult to learn, or even memorize.


http://arxiv.org/abs/2203.08148
RES-HD: Resilient Intelligent Fault Diagnosis Against Adversarial Attacks Using Hyper-Dimensional Computing. (82%)
Onat Gungor; Tajana Rosing; Baris Aksanli
  Industrial Internet of Things (I-IoT) enables fully automated production
systems by continuously monitoring devices and analyzing collected data.
Machine learning methods are commonly utilized for data analytics in such
systems. Cyber-attacks are a grave threat to I-IoT as they can manipulate
legitimate inputs, corrupting ML predictions and causing disruptions in the
production systems. Hyper-dimensional computing (HDC) is a brain-inspired
machine learning method that has been shown to be sufficiently accurate while
being extremely robust, fast, and energy-efficient. In this work, we use HDC
for intelligent fault diagnosis against different adversarial attacks. Our
black-box adversarial attacks first train a substitute model and create
perturbed test instances using this trained model. These examples are then
transferred to the target models. The change in the classification accuracy is
measured as the difference before and after the attacks. This change measures
the resiliency of a learning method. Our experiments show that HDC leads to a
more resilient and lightweight learning solution than the state-of-the-art deep
learning methods. HDC has up to 67.5% higher resiliency compared to the
state-of-the-art methods while being up to 25.1% faster to train.


http://arxiv.org/abs/2203.07341
Defending From Physically-Realizable Adversarial Attacks Through Internal Over-Activation Analysis. (54%)
Giulio Rossolini; Federico Nesti; Fabio Brau; Alessandro Biondi; Giorgio Buttazzo
  This work presents Z-Mask, a robust and effective strategy to improve the
adversarial robustness of convolutional networks against physically-realizable
adversarial attacks. The presented defense relies on specific Z-score analysis
performed on the internal network features to detect and mask the pixels
corresponding to adversarial objects in the input image. To this end, spatially
contiguous activations are examined in shallow and deep layers to suggest
potential adversarial regions. Such proposals are then aggregated through a
multi-thresholding mechanism. The effectiveness of Z-Mask is evaluated with an
extensive set of experiments carried out on models for both semantic
segmentation and object detection. The evaluation is performed with both
digital patches added to the input images and printed patches positioned in the
real world. The obtained results confirm that Z-Mask outperforms the
state-of-the-art methods in terms of both detection accuracy and overall
performance of the networks under attack. Additional experiments showed that
Z-Mask is also robust against possible defense-aware attacks.


http://arxiv.org/abs/2203.06616
LAS-AT: Adversarial Training with Learnable Attack Strategy. (99%)
Xiaojun Jia; Yong Zhang; Baoyuan Wu; Ke Ma; Jue Wang; Xiaochun Cao
  Adversarial training (AT) is always formulated as a minimax problem, of which
the performance depends on the inner optimization that involves the generation
of adversarial examples (AEs). Most previous methods adopt Projected Gradient
Decent (PGD) with manually specifying attack parameters for AE generation. A
combination of the attack parameters can be referred to as an attack strategy.
Several works have revealed that using a fixed attack strategy to generate AEs
during the whole training phase limits the model robustness and propose to
exploit different attack strategies at different training stages to improve
robustness. But those multi-stage hand-crafted attack strategies need much
domain expertise, and the robustness improvement is limited. In this paper, we
propose a novel framework for adversarial training by introducing the concept
of "learnable attack strategy", dubbed LAS-AT, which learns to automatically
produce attack strategies to improve the model robustness. Our framework is
composed of a target network that uses AEs for training to improve robustness
and a strategy network that produces attack strategies to control the AE
generation. Experimental evaluations on three benchmark databases demonstrate
the superiority of the proposed method. The code is released at
https://github.com/jiaxiaojunQAQ/LAS-AT.


http://arxiv.org/abs/2203.06694
Generating Practical Adversarial Network Traffic Flows Using NIDSGAN. (99%)
Bolor-Erdene Zolbayar; Ryan Sheatsley; Patrick McDaniel; Michael J. Weisman; Sencun Zhu; Shitong Zhu; Srikanth Krishnamurthy
  Network intrusion detection systems (NIDS) are an essential defense for
computer networks and the hosts within them. Machine learning (ML) nowadays
predominantly serves as the basis for NIDS decision making, where models are
tuned to reduce false alarms, increase detection rates, and detect known and
unknown attacks. At the same time, ML models have been found to be vulnerable
to adversarial examples that undermine the downstream task. In this work, we
ask the practical question of whether real-world ML-based NIDS can be
circumvented by crafted adversarial flows, and if so, how can they be created.
We develop the generative adversarial network (GAN)-based attack algorithm
NIDSGAN and evaluate its effectiveness against realistic ML-based NIDS. Two
main challenges arise for generating adversarial network traffic flows: (1) the
network features must obey the constraints of the domain (i.e., represent
realistic network behavior), and (2) the adversary must learn the decision
behavior of the target NIDS without knowing its model internals (e.g.,
architecture and meta-parameters) and training data. Despite these challenges,
the NIDSGAN algorithm generates highly realistic adversarial traffic flows that
evade ML-based NIDS. We evaluate our attack algorithm against two
state-of-the-art DNN-based NIDS in whitebox, blackbox, and restricted-blackbox
threat models and achieve success rates which are on average 99%, 85%, and 70%,
respectively. We also show that our attack algorithm can evade NIDS based on
classical ML models including logistic regression, SVM, decision trees and
KNNs, with a success rate of 70% on average. Our results demonstrate that
deploying ML-based NIDS without careful defensive strategies against
adversarial flows may (and arguably likely will) lead to future compromises.


http://arxiv.org/abs/2203.06570
Model Inversion Attack against Transfer Learning: Inverting a Model without Accessing It. (92%)
Dayong Ye; Huiqiang Chen; Shuai Zhou; Tianqing Zhu; Wanlei Zhou; Shouling Ji
  Transfer learning is an important approach that produces pre-trained teacher
models which can be used to quickly build specialized student models. However,
recent research on transfer learning has found that it is vulnerable to various
attacks, e.g., misclassification and backdoor attacks. However, it is still not
clear whether transfer learning is vulnerable to model inversion attacks.
Launching a model inversion attack against transfer learning scheme is
challenging. Not only does the student model hide its structural parameters,
but it is also inaccessible to the adversary. Hence, when targeting a student
model, both the white-box and black-box versions of existing model inversion
attacks fail. White-box attacks fail as they need the target model's
parameters. Black-box attacks fail as they depend on making repeated queries of
the target model. However, they may not mean that transfer learning models are
impervious to model inversion attacks. Hence, with this paper, we initiate
research into model inversion attacks against transfer learning schemes with
two novel attack methods. Both are black-box attacks, suiting different
situations, that do not rely on queries to the target student model. In the
first method, the adversary has the data samples that share the same
distribution as the training set of the teacher model. In the second method,
the adversary does not have any such samples. Experiments show that highly
recognizable data records can be recovered with both of these methods. This
means that even if a model is an inaccessible black-box, it can still be
inverted.


http://arxiv.org/abs/2203.06580
One Parameter Defense -- Defending against Data Inference Attacks via Differential Privacy. (67%)
Dayong Ye; Sheng Shen; Tianqing Zhu; Bo Liu; Wanlei Zhou
  Machine learning models are vulnerable to data inference attacks, such as
membership inference and model inversion attacks. In these types of breaches,
an adversary attempts to infer a data record's membership in a dataset or even
reconstruct this data record using a confidence score vector predicted by the
target model. However, most existing defense methods only protect against
membership inference attacks. Methods that can combat both types of attacks
require a new model to be trained, which may not be time-efficient. In this
paper, we propose a differentially private defense method that handles both
types of attacks in a time-efficient manner by tuning only one parameter, the
privacy budget. The central idea is to modify and normalize the confidence
score vectors with a differential privacy mechanism which preserves privacy and
obscures membership and reconstructed data. Moreover, this method can guarantee
the order of scores in the vector to avoid any loss in classification accuracy.
The experimental results show the method to be an effective and timely defense
against both membership inference and model inversion attacks with no reduction
in accuracy.


http://arxiv.org/abs/2203.06587
Policy Learning for Robust Markov Decision Process with a Mismatched Generative Model. (3%)
Jialian Li; Tongzheng Ren; Dong Yan; Hang Su; Jun Zhu
  In high-stake scenarios like medical treatment and auto-piloting, it's risky
or even infeasible to collect online experimental data to train the agent.
Simulation-based training can alleviate this issue, but may suffer from its
inherent mismatches from the simulator and real environment. It is therefore
imperative to utilize the simulator to learn a robust policy for the real-world
deployment. In this work, we consider policy learning for Robust Markov
Decision Processes (RMDP), where the agent tries to seek a robust policy with
respect to unexpected perturbations on the environments. Specifically, we focus
on the setting where the training environment can be characterized as a
generative model and a constrained perturbation can be added to the model
during testing. Our goal is to identify a near-optimal robust policy for the
perturbed testing environment, which introduces additional technical
difficulties as we need to simultaneously estimate the training environment
uncertainty from samples and find the worst-case perturbation for testing. To
solve this issue, we propose a generic method which formalizes the perturbation
as an opponent to obtain a two-player zero-sum game, and further show that the
Nash Equilibrium corresponds to the robust policy. We prove that, with a
polynomial number of samples from the generative model, our algorithm can find
a near-optimal robust policy with a high probability. Our method is able to
deal with general perturbations under some mild assumptions and can also be
extended to more complex problems like robust partial observable Markov
decision process, thanks to the game-theoretical formulation.


http://arxiv.org/abs/2203.06560
Query-Efficient Black-box Adversarial Attacks Guided by a Transfer-based Prior. (99%)
Yinpeng Dong; Shuyu Cheng; Tianyu Pang; Hang Su; Jun Zhu
  Adversarial attacks have been extensively studied in recent years since they
can identify the vulnerability of deep learning models before deployed. In this
paper, we consider the black-box adversarial setting, where the adversary needs
to craft adversarial examples without access to the gradients of a target
model. Previous methods attempted to approximate the true gradient either by
using the transfer gradient of a surrogate white-box model or based on the
feedback of model queries. However, the existing methods inevitably suffer from
low attack success rates or poor query efficiency since it is difficult to
estimate the gradient in a high-dimensional input space with limited
information. To address these problems and improve black-box attacks, we
propose two prior-guided random gradient-free (PRGF) algorithms based on biased
sampling and gradient averaging, respectively. Our methods can take the
advantage of a transfer-based prior given by the gradient of a surrogate model
and the query information simultaneously. Through theoretical analyses, the
transfer-based prior is appropriately integrated with model queries by an
optimal coefficient in each method. Extensive experiments demonstrate that, in
comparison with the alternative state-of-the-arts, both of our methods require
much fewer queries to attack black-box models with higher success rates.


http://arxiv.org/abs/2203.06414
A Survey of Adversarial Defences and Robustness in NLP. (99%)
Shreya Goyal; Sumanth Doddapaneni; Mitesh M. Khapra; Balaraman Ravindran
  In the past few years, it has become increasingly evident that deep neural
networks are not resilient enough to withstand adversarial perturbations in
input data, leaving them vulnerable to attack. Various authors have proposed
strong adversarial attacks for computer vision and Natural Language Processing
(NLP) tasks. As a response, many defense mechanisms have also been proposed to
prevent these networks from failing. The significance of defending neural
networks against adversarial attacks lies in ensuring that the model's
predictions remain unchanged even if the input data is perturbed. Several
methods for adversarial defense in NLP have been proposed, catering to
different NLP tasks such as text classification, named entity recognition, and
natural language inference. Some of these methods not only defend neural
networks against adversarial attacks but also act as a regularization mechanism
during training, saving the model from overfitting. This survey aims to review
the various methods proposed for adversarial defenses in NLP over the past few
years by introducing a novel taxonomy. The survey also highlights the fragility
of advanced deep neural networks in NLP and the challenges involved in
defending them.


http://arxiv.org/abs/2203.06555
Label-only Model Inversion Attack: The Attack that Requires the Least Information. (47%)
Dayong Ye; Tianqing Zhu; Shuai Zhou; Bo Liu; Wanlei Zhou
  In a model inversion attack, an adversary attempts to reconstruct the data
records, used to train a target model, using only the model's output. In
launching a contemporary model inversion attack, the strategies discussed are
generally based on either predicted confidence score vectors, i.e., black-box
attacks, or the parameters of a target model, i.e., white-box attacks. However,
in the real world, model owners usually only give out the predicted labels; the
confidence score vectors and model parameters are hidden as a defense mechanism
to prevent such attacks. Unfortunately, we have found a model inversion method
that can reconstruct the input data records based only on the output labels. We
believe this is the attack that requires the least information to succeed and,
therefore, has the best applicability. The key idea is to exploit the error
rate of the target model to compute the median distance from a set of data
records to the decision boundary of the target model. The distance, then, is
used to generate confidence score vectors which are adopted to train an attack
model to reconstruct the data records. The experimental results show that
highly recognizable data records can be reconstructed with far less information
than existing methods.


http://arxiv.org/abs/2203.05948
Block-Sparse Adversarial Attack to Fool Transformer-Based Text Classifiers. (99%)
Sahar Sadrizadeh; Ljiljana Dolamic; Pascal Frossard
  Recently, it has been shown that, in spite of the significant performance of
deep neural networks in different fields, those are vulnerable to adversarial
examples. In this paper, we propose a gradient-based adversarial attack against
transformer-based text classifiers. The adversarial perturbation in our method
is imposed to be block-sparse so that the resultant adversarial example differs
from the original sentence in only a few words. Due to the discrete nature of
textual data, we perform gradient projection to find the minimizer of our
proposed optimization problem. Experimental results demonstrate that, while our
adversarial attack maintains the semantics of the sentence, it can reduce the
accuracy of GPT-2 to less than 5% on different datasets (AG News, MNLI, and
Yelp Reviews). Furthermore, the block-sparsity constraint of the proposed
optimization problem results in small perturbations in the adversarial example.


http://arxiv.org/abs/2203.07027
Learning from Attacks: Attacking Variational Autoencoder for Improving Image Classification. (98%)
Jianzhang Zheng; Fan Yang; Hao Shen; Xuan Tang; Mingsong Chen; Liang Song; Xian Wei
  Adversarial attacks are often considered as threats to the robustness of Deep
Neural Networks (DNNs). Various defending techniques have been developed to
mitigate the potential negative impact of adversarial attacks against task
predictions. This work analyzes adversarial attacks from a different
perspective. Namely, adversarial examples contain implicit information that is
useful to the predictions i.e., image classification, and treat the adversarial
attacks against DNNs for data self-expression as extracted abstract
representations that are capable of facilitating specific learning tasks. We
propose an algorithmic framework that leverages the advantages of the DNNs for
data self-expression and task-specific predictions, to improve image
classification. The framework jointly learns a DNN for attacking Variational
Autoencoder (VAE) networks and a DNN for classification, coined as Attacking
VAE for Improve Classification (AVIC). The experiment results show that AVIC
can achieve higher accuracy on standard datasets compared to the training with
clean examples and the traditional adversarial training.


http://arxiv.org/abs/2203.10930
An integrated Auto Encoder-Block Switching defense approach to prevent adversarial attacks. (96%)
Anirudh Yadav; Ashutosh Upadhyay; S. Sharanya
  According to recent studies, the vulnerability of state-of-the-art Neural
Networks to adversarial input samples has increased drastically. A neural
network is an intermediate path or technique by which a computer learns to
perform tasks using Machine learning algorithms. Machine Learning and
Artificial Intelligence model has become a fundamental aspect of life, such as
self-driving cars [1], smart home devices, so any vulnerability is a
significant concern. The smallest input deviations can fool these extremely
literal systems and deceive their users as well as administrator into
precarious situations. This article proposes a defense algorithm that utilizes
the combination of an auto-encoder [3] and block-switching architecture.
Auto-coder is intended to remove any perturbations found in input images
whereas the block switching method is used to make it more robust against
White-box attacks. The attack is planned using FGSM [9] model, and the
subsequent counter-attack by the proposed architecture will take place thereby
demonstrating the feasibility and security delivered by the algorithm.


http://arxiv.org/abs/2203.06020
Enhancing Adversarial Training with Second-Order Statistics of Weights. (38%)
Gaojie Jin; Xinping Yi; Wei Huang; Sven Schewe; Xiaowei Huang
  Adversarial training has been shown to be one of the most effective
approaches to improve the robustness of deep neural networks. It is formalized
as a min-max optimization over model weights and adversarial perturbations,
where the weights can be optimized through gradient descent methods like SGD.
In this paper, we show that treating model weights as random variables allows
for enhancing adversarial training through \textbf{S}econd-Order
\textbf{S}tatistics \textbf{O}ptimization (S$^2$O) with respect to the weights.
By relaxing a common (but unrealistic) assumption of previous PAC-Bayesian
frameworks that all weights are statistically independent, we derive an
improved PAC-Bayesian adversarial generalization bound, which suggests that
optimizing second-order statistics of weights can effectively tighten the
bound. In addition to this theoretical insight, we conduct an extensive set of
experiments, which show that S$^2$O not only improves the robustness and
generalization of the trained neural networks when used in isolation, but also
integrates easily in state-of-the-art adversarial training techniques like
TRADES, AWP, MART, and AVMixup, leading to a measurable improvement of these
techniques. The code is available at \url{https://github.com/Alexkael/S2O}.


http://arxiv.org/abs/2203.06060
ROOD-MRI: Benchmarking the robustness of deep learning segmentation models to out-of-distribution and corrupted data in MRI. (33%)
Lyndon Boone; Mahdi Biparva; Parisa Mojiri Forooshani; Joel Ramirez; Mario Masellis; Robert Bartha; Sean Symons; Stephen Strother; Sandra E. Black; Chris Heyn; Anne L. Martel; Richard H. Swartz; Maged Goubran
  Deep artificial neural networks (DNNs) have moved to the forefront of medical
image analysis due to their success in classification, segmentation, and
detection challenges. A principal challenge in large-scale deployment of DNNs
in neuroimage analysis is the potential for shifts in signal-to-noise ratio,
contrast, resolution, and presence of artifacts from site to site due to
variances in scanners and acquisition protocols. DNNs are famously susceptible
to these distribution shifts in computer vision. Currently, there are no
benchmarking platforms or frameworks to assess the robustness of new and
existing models to specific distribution shifts in MRI, and accessible
multi-site benchmarking datasets are still scarce or task-specific. To address
these limitations, we propose ROOD-MRI: a platform for benchmarking the
Robustness of DNNs to Out-Of-Distribution (OOD) data, corruptions, and
artifacts in MRI. The platform provides modules for generating benchmarking
datasets using transforms that model distribution shifts in MRI,
implementations of newly derived benchmarking metrics for image segmentation,
and examples for using the methodology with new models and tasks. We apply our
methodology to hippocampus, ventricle, and white matter hyperintensity
segmentation in several large studies, providing the hippocampus dataset as a
publicly available benchmark. By evaluating modern DNNs on these datasets, we
demonstrate that they are highly susceptible to distribution shifts and
corruptions in MRI. We show that while data augmentation strategies can
substantially improve robustness to OOD data for anatomical segmentation tasks,
modern DNNs using augmentation still lack robustness in more challenging
lesion-based segmentation tasks. We finally benchmark U-Nets and
transformer-based models, finding consistent differences in robustness to
particular classes of transforms across architectures.


http://arxiv.org/abs/2203.06254
Perception Over Time: Temporal Dynamics for Robust Image Understanding. (16%)
Maryam Daniali; Edward Kim
  While deep learning surpasses human-level performance in narrow and specific
vision tasks, it is fragile and over-confident in classification. For example,
minor transformations in perspective, illumination, or object deformation in
the image space can result in drastically different labeling, which is
especially transparent via adversarial perturbations. On the other hand, human
visual perception is orders of magnitude more robust to changes in the input
stimulus. But unfortunately, we are far from fully understanding and
integrating the underlying mechanisms that result in such robust perception. In
this work, we introduce a novel method of incorporating temporal dynamics into
static image understanding. We describe a neuro-inspired method that decomposes
a single image into a series of coarse-to-fine images that simulates how
biological vision integrates information over time. Next, we demonstrate how
our novel visual perception framework can utilize this information "over time"
using a biologically plausible algorithm with recurrent units, and as a result,
significantly improving its accuracy and robustness over standard CNNs. We also
compare our proposed approach with state-of-the-art models and explicitly
quantify our adversarial robustness properties through multiple ablation
studies. Our quantitative and qualitative results convincingly demonstrate
exciting and transformative improvements over the standard computer vision and
deep learning architectures used today.


http://arxiv.org/abs/2203.05774
Reinforcement Learning for Linear Quadratic Control is Vulnerable Under Cost Manipulation. (15%)
Yunhan Huang; Quanyan Zhu
  In this work, we study the deception of a Linear-Quadratic-Gaussian (LQG)
agent by manipulating the cost signals. We show that a small falsification of
the cost parameters will only lead to a bounded change in the optimal policy.
The bound is linear on the amount of falsification the attacker can apply to
the cost parameters. We propose an attack model where the attacker aims to
mislead the agent into learning a `nefarious' policy by intentionally
falsifying the cost parameters. We formulate the attack's problem as a convex
optimization problem and develop necessary and sufficient conditions to check
the achievability of the attacker's goal.
  We showcase the adversarial manipulation on two types of LQG learners: the
batch RL learner and the other is the adaptive dynamic programming (ADP)
learner. Our results demonstrate that with only 2.296% of falsification on the
cost data, the attacker misleads the batch RL into learning the 'nefarious'
policy that leads the vehicle to a dangerous position. The attacker can also
gradually trick the ADP learner into learning the same `nefarious' policy by
consistently feeding the learner a falsified cost signal that stays close to
the actual cost signal. The paper aims to raise people's awareness of the
security threats faced by RL-enabled control systems.


http://arxiv.org/abs/2203.05323
Exploiting the Potential of Datasets: A Data-Centric Approach for Model Robustness. (92%)
Yiqi Zhong; Lei Wu; Xianming Liu; Junjun Jiang
  Robustness of deep neural networks (DNNs) to malicious perturbations is a hot
topic in trustworthy AI. Existing techniques obtain robust models given fixed
datasets, either by modifying model structures, or by optimizing the process of
inference or training. While significant improvements have been made, the
possibility of constructing a high-quality dataset for model robustness remain
unexplored. Follow the campaign of data-centric AI launched by Andrew Ng, we
propose a novel algorithm for dataset enhancement that works well for many
existing DNN models to improve robustness. Transferable adversarial examples
and 14 kinds of common corruptions are included in our optimized dataset. In
the data-centric robust learning competition hosted by Alibaba Group and
Tsinghua University, our algorithm came third out of more than 3000 competitors
in the first stage while we ranked fourth in the second stage. Our code is
available at \url{https://github.com/hncszyq/tianchi_challenge}.


http://arxiv.org/abs/2203.05212
Membership Privacy Protection for Image Translation Models via Adversarial Knowledge Distillation. (75%)
Saeed Ranjbar Alvar; Lanjun Wang; Jian Pei; Yong Zhang
  Image-to-image translation models are shown to be vulnerable to the
Membership Inference Attack (MIA), in which the adversary's goal is to identify
whether a sample is used to train the model or not. With daily increasing
applications based on image-to-image translation models, it is crucial to
protect the privacy of these models against MIAs.
  We propose adversarial knowledge distillation (AKD) as a defense method
against MIAs for image-to-image translation models. The proposed method
protects the privacy of the training samples by improving the generalizability
of the model. We conduct experiments on the image-to-image translation models
and show that AKD achieves the state-of-the-art utility-privacy tradeoff by
reducing the attack performance up to 38.9% compared with the regular training
model at the cost of a slight drop in the quality of the generated output
images. The experimental results also indicate that the models trained by AKD
generalize better than the regular training models. Furthermore, compared with
existing defense methods, the results show that at the same privacy protection
level, image translation models trained by AKD generate outputs with higher
quality; while at the same quality of outputs, AKD enhances the privacy
protection over 30%.


http://arxiv.org/abs/2203.05653
Attack Analysis of Face Recognition Authentication Systems Using Fast Gradient Sign Method. (69%)
Arbena Musa; Kamer Vishi; Blerim Rexha
  Biometric authentication methods, representing the "something you are"
scheme, are considered the most secure approach for gaining access to protected
resources. Recent attacks using Machine Learning techniques demand a serious
systematic reevaluation of biometric authentication. This paper analyzes and
presents the Fast Gradient Sign Method (FGSM) attack using face recognition for
biometric authentication. Machine Learning techniques have been used to train
and test the model, which can classify and identify different people's faces
and which will be used as a target for carrying out the attack. Furthermore,
the case study will analyze the implementation of the FGSM and the level of
performance reduction that the model will have by applying this method in
attacking. The test results were performed with the change of parameters both
in terms of training and attacking the model, thus showing the efficiency of
applying the FGSM.


http://arxiv.org/abs/2203.05408
Attacks as Defenses: Designing Robust Audio CAPTCHAs Using Attacks on Automatic Speech Recognition Systems. (64%)
Hadi Abdullah; Aditya Karlekar; Saurabh Prasad; Muhammad Sajidur Rahman; Logan Blue; Luke A. Bauer; Vincent Bindschaedler; Patrick Traynor
  Audio CAPTCHAs are supposed to provide a strong defense for online resources;
however, advances in speech-to-text mechanisms have rendered these defenses
ineffective. Audio CAPTCHAs cannot simply be abandoned, as they are
specifically named by the W3C as important enablers of accessibility.
Accordingly, demonstrably more robust audio CAPTCHAs are important to the
future of a secure and accessible Web. We look to recent literature on attacks
on speech-to-text systems for inspiration for the construction of robust,
principle-driven audio defenses. We begin by comparing 20 recent attack papers,
classifying and measuring their suitability to serve as the basis of new
"robust to transcription" but "easy for humans to understand" CAPTCHAs. After
showing that none of these attacks alone are sufficient, we propose a new
mechanism that is both comparatively intelligible (evaluated through a user
study) and hard to automatically transcribe (i.e., $P({\rm transcription}) = 4
\times 10^{-5}$). Finally, we demonstrate that our audio samples have a high
probability of being detected as CAPTCHAs when given to speech-to-text systems
($P({\rm evasion}) = 1.77 \times 10^{-4}$). In so doing, we not only
demonstrate a CAPTCHA that is approximately four orders of magnitude more
difficult to crack, but that such systems can be designed based on the insights
gained from attack papers using the differences between the ways that humans
and computers process audio.


http://arxiv.org/abs/2203.05314
SoK: On the Semantic AI Security in Autonomous Driving. (10%)
Junjie Shen; Ningfei Wang; Ziwen Wan; Yunpeng Luo; Takami Sato; Zhisheng Hu; Xinyang Zhang; Shengjian Guo; Zhenyu Zhong; Kang Li; Ziming Zhao; Chunming Qiao; Qi Alfred Chen
  Autonomous Driving (AD) systems rely on AI components to make safety and
correct driving decisions. Unfortunately, today's AI algorithms are known to be
generally vulnerable to adversarial attacks. However, for such AI
component-level vulnerabilities to be semantically impactful at the system
level, it needs to address non-trivial semantic gaps both (1) from the
system-level attack input spaces to those at AI component level, and (2) from
AI component-level attack impacts to those at the system level. In this paper,
we define such research space as semantic AI security as opposed to generic AI
security. Over the past 5 years, increasingly more research works are performed
to tackle such semantic AI security challenges in AD context, which has started
to show an exponential growth trend.
  In this paper, we perform the first systematization of knowledge of such
growing semantic AD AI security research space. In total, we collect and
analyze 53 such papers, and systematically taxonomize them based on research
aspects critical for the security field. We summarize 6 most substantial
scientific gaps observed based on quantitative comparisons both vertically
among existing AD AI security works and horizontally with security works from
closely-related domains. With these, we are able to provide insights and
potential future directions not only at the design level, but also at the
research goal, methodology, and community levels. To address the most critical
scientific methodology-level gap, we take the initiative to develop an
open-source, uniform, and extensible system-driven evaluation platform, named
PASS, for the semantic AD AI security research community. We also use our
implemented platform prototype to showcase the capabilities and benefits of
such a platform using representative semantic AD AI attacks.


http://arxiv.org/abs/2203.04607
Practical No-box Adversarial Attacks with Training-free Hybrid Image Transformation. (99%)
Qilong Zhang; Chaoning Zhang; Chaoqun Li; Jingkuan Song; Lianli Gao; Heng Tao Shen
  In recent years, the adversarial vulnerability of deep neural networks (DNNs)
has raised increasing attention. Among all the threat models, no-box attacks
are the most practical but extremely challenging since they neither rely on any
knowledge of the target model or similar substitute model, nor access the
dataset for training a new substitute model. Although a recent method has
attempted such an attack in a loose sense, its performance is not good enough
and computational overhead of training is expensive. In this paper, we move a
step forward and show the existence of a \textbf{training-free} adversarial
perturbation under the no-box threat model, which can be successfully used to
attack different DNNs in real-time. Motivated by our observation that
high-frequency component (HFC) domains in low-level features and plays a
crucial role in classification, we attack an image mainly by manipulating its
frequency components. Specifically, the perturbation is manipulated by
suppression of the original HFC and adding of noisy HFC. We empirically and
experimentally analyze the requirements of effective noisy HFC and show that it
should be regionally homogeneous, repeating and dense. Extensive experiments on
the ImageNet dataset demonstrate the effectiveness of our proposed no-box
method. It attacks ten well-known models with a success rate of
\textbf{98.13\%} on average, which outperforms state-of-the-art no-box attacks
by \textbf{29.39\%}. Furthermore, our method is even competitive to mainstream
transfer-based black-box attacks.


http://arxiv.org/abs/2203.05154
Practical Evaluation of Adversarial Robustness via Adaptive Auto Attack. (99%)
Ye Liu; Yaya Cheng; Lianli Gao; Xianglong Liu; Qilong Zhang; Jingkuan Song
  Defense models against adversarial attacks have grown significantly, but the
lack of practical evaluation methods has hindered progress. Evaluation can be
defined as looking for defense models' lower bound of robustness given a budget
number of iterations and a test dataset. A practical evaluation method should
be convenient (i.e., parameter-free), efficient (i.e., fewer iterations) and
reliable (i.e., approaching the lower bound of robustness). Towards this
target, we propose a parameter-free Adaptive Auto Attack (A$^3$) evaluation
method which addresses the efficiency and reliability in a test-time-training
fashion. Specifically, by observing that adversarial examples to a specific
defense model follow some regularities in their starting points, we design an
Adaptive Direction Initialization strategy to speed up the evaluation.
Furthermore, to approach the lower bound of robustness under the budget number
of iterations, we propose an online statistics-based discarding strategy that
automatically identifies and abandons hard-to-attack images. Extensive
experiments demonstrate the effectiveness of our A$^3$. Particularly, we apply
A$^3$ to nearly 50 widely-used defense models. By consuming much fewer
iterations than existing methods, i.e., $1/10$ on average (10$\times$ speed
up), we achieve lower robust accuracy in all cases. Notably, we won
$\textbf{first place}$ out of 1681 teams in CVPR 2021 White-box Adversarial
Attacks on Defense Models competitions with this method. Code is available at:
$\href{https://github.com/liuye6666/adaptive_auto_attack}{https://github.com/liuye6666/adaptive\_auto\_attack}$


http://arxiv.org/abs/2203.05151
Frequency-driven Imperceptible Adversarial Attack on Semantic Similarity. (99%)
Cheng Luo; Qinliang Lin; Weicheng Xie; Bizhu Wu; Jinheng Xie; Linlin Shen
  Current adversarial attack research reveals the vulnerability of
learning-based classifiers against carefully crafted perturbations. However,
most existing attack methods have inherent limitations in cross-dataset
generalization as they rely on a classification layer with a closed set of
categories. Furthermore, the perturbations generated by these methods may
appear in regions easily perceptible to the human visual system (HVS). To
circumvent the former problem, we propose a novel algorithm that attacks
semantic similarity on feature representations. In this way, we are able to
fool classifiers without limiting attacks to a specific dataset. For
imperceptibility, we introduce the low-frequency constraint to limit
perturbations within high-frequency components, ensuring perceptual similarity
between adversarial examples and originals. Extensive experiments on three
datasets (CIFAR-10, CIFAR-100, and ImageNet-1K) and three public online
platforms indicate that our attack can yield misleading and transferable
adversarial examples across architectures and datasets. Additionally,
visualization results and quantitative performance (in terms of four different
metrics) show that the proposed algorithm generates more imperceptible
perturbations than the state-of-the-art methods. Code is made available at.


http://arxiv.org/abs/2203.04855
Binary Classification Under $\ell_0$ Attacks for General Noise Distribution. (98%)
Payam Delgosha; Hamed Hassani; Ramtin Pedarsani
  Adversarial examples have recently drawn considerable attention in the field
of machine learning due to the fact that small perturbations in the data can
result in major performance degradation. This phenomenon is usually modeled by
a malicious adversary that can apply perturbations to the data in a constrained
fashion, such as being bounded in a certain norm. In this paper, we study this
problem when the adversary is constrained by the $\ell_0$ norm; i.e., it can
perturb a certain number of coordinates in the input, but has no limit on how
much it can perturb those coordinates. Due to the combinatorial nature of this
setting, we need to go beyond the standard techniques in robust machine
learning to address this problem. We consider a binary classification scenario
where $d$ noisy data samples of the true label are provided to us after
adversarial perturbations. We introduce a classification method which employs a
nonlinear component called truncation, and show in an asymptotic scenario, as
long as the adversary is restricted to perturb no more than $\sqrt{d}$ data
samples, we can almost achieve the optimal classification error in the absence
of the adversary, i.e. we can completely neutralize adversary's effect.
Surprisingly, we observe a phase transition in the sense that using a converse
argument, we show that if the adversary can perturb more than $\sqrt{d}$
coordinates, no classifier can do better than a random guess.


http://arxiv.org/abs/2203.04623
Controllable Evaluation and Generation of Physical Adversarial Patch on Face Recognition. (97%)
Xiao Yang; Yinpeng Dong; Tianyu Pang; Zihao Xiao; Hang Su; Jun Zhu
  Recent studies have revealed the vulnerability of face recognition models
against physical adversarial patches, which raises security concerns about the
deployed face recognition systems. However, it is still challenging to ensure
the reproducibility for most attack algorithms under complex physical
conditions, which leads to the lack of a systematic evaluation of the existing
methods. It is therefore imperative to develop a framework that can enable a
comprehensive evaluation of the vulnerability of face recognition in the
physical world. To this end, we propose to simulate the complex transformations
of faces in the physical world via 3D-face modeling, which serves as a digital
counterpart of physical faces. The generic framework allows us to control
different face variations and physical conditions to conduct reproducible
evaluations comprehensively. With this digital simulator, we further propose a
Face3DAdv method considering the 3D face transformations and realistic physical
variations. Extensive experiments validate that Face3DAdv can significantly
improve the effectiveness of diverse physically realizable adversarial patches
in both simulated and physical environments, against various white-box and
black-box face recognition models.


http://arxiv.org/abs/2203.04886
Reverse Engineering $\ell_p$ attacks: A block-sparse optimization approach with recovery guarantees. (92%)
Darshan Thaker; Paris Giampouras; René Vidal
  Deep neural network-based classifiers have been shown to be vulnerable to
imperceptible perturbations to their input, such as $\ell_p$-bounded norm
adversarial attacks. This has motivated the development of many defense
methods, which are then broken by new attacks, and so on. This paper focuses on
a different but related problem of reverse engineering adversarial attacks.
Specifically, given an attacked signal, we study conditions under which one can
determine the type of attack ($\ell_1$, $\ell_2$ or $\ell_\infty$) and recover
the clean signal. We pose this problem as a block-sparse recovery problem,
where both the signal and the attack are assumed to lie in a union of subspaces
that includes one subspace per class and one subspace per attack type. We
derive geometric conditions on the subspaces under which any attacked signal
can be decomposed as the sum of a clean signal plus an attack. In addition, by
determining the subspaces that contain the signal and the attack, we can also
classify the signal and determine the attack type. Experiments on digit and
face classification demonstrate the effectiveness of the proposed approach.


http://arxiv.org/abs/2203.04713
Defending Black-box Skeleton-based Human Activity Classifiers. (92%)
He Wang; Yunfeng Diao; Zichang Tan; Guodong Guo
  Skeletal motions have been heavily replied upon for human activity
recognition (HAR). Recently, a universal vulnerability of skeleton-based HAR
has been identified across a variety of classifiers and data, calling for
mitigation. To this end, we propose the first black-box defense method for
skeleton-based HAR to our best knowledge. Our method is featured by full
Bayesian treatments of the clean data, the adversaries and the classifier,
leading to (1) a new Bayesian Energy-based formulation of robust discriminative
classifiers, (2) a new adversary sampling scheme based on natural motion
manifolds, and (3) a new post-train Bayesian strategy for black-box defense. We
name our framework Bayesian Energy-based Adversarial Training or BEAT. BEAT is
straightforward but elegant, which turns vulnerable black-box classifiers into
robust ones without sacrificing accuracy. It demonstrates surprising and
universal effectiveness across a wide range of skeletal HAR classifiers and
datasets, under various attacks. Code is available at
https://github.com/realcrane/RobustActionRecogniser.


http://arxiv.org/abs/2203.04696
Robust Federated Learning Against Adversarial Attacks for Speech Emotion Recognition. (81%)
Yi Chang; Sofiane Laridi; Zhao Ren; Gregory Palmer; Björn W. Schuller; Marco Fisichella
  Due to the development of machine learning and speech processing, speech
emotion recognition has been a popular research topic in recent years. However,
the speech data cannot be protected when it is uploaded and processed on
servers in the internet-of-things applications of speech emotion recognition.
Furthermore, deep neural networks have proven to be vulnerable to
human-indistinguishable adversarial perturbations. The adversarial attacks
generated from the perturbations may result in deep neural networks wrongly
predicting the emotional states. We propose a novel federated adversarial
learning framework for protecting both data and deep neural networks. The
proposed framework consists of i) federated learning for data privacy, and ii)
adversarial training at the training stage and randomisation at the testing
stage for model robustness. The experiments show that our proposed framework
can effectively protect the speech data locally and improve the model
robustness against a series of adversarial attacks.


http://arxiv.org/abs/2203.05103
Improving Neural ODEs via Knowledge Distillation. (80%)
Haoyu Chu; Shikui Wei; Qiming Lu; Yao Zhao
  Neural Ordinary Differential Equations (Neural ODEs) construct the continuous
dynamics of hidden units using ordinary differential equations specified by a
neural network, demonstrating promising results on many tasks. However, Neural
ODEs still do not perform well on image recognition tasks. The possible reason
is that the one-hot encoding vector commonly used in Neural ODEs can not
provide enough supervised information. We propose a new training based on
knowledge distillation to construct more powerful and robust Neural ODEs
fitting image recognition tasks. Specially, we model the training of Neural
ODEs into a teacher-student learning process, in which we propose ResNets as
the teacher model to provide richer supervised information. The experimental
results show that the new training manner can improve the classification
accuracy of Neural ODEs by 24% on CIFAR10 and 5% on SVHN. In addition, we also
quantitatively discuss the effect of both knowledge distillation and time
horizon in Neural ODEs on robustness against adversarial examples. The
experimental analysis concludes that introducing the knowledge distillation and
increasing the time horizon can improve the robustness of Neural ODEs against
adversarial examples.


http://arxiv.org/abs/2203.06055
Physics-aware Complex-valued Adversarial Machine Learning in Reconfigurable Diffractive All-optical Neural Network. (22%)
Ruiyang Chen; Yingjie Li; Minhan Lou; Jichao Fan; Yingheng Tang; Berardi Sensale-Rodriguez; Cunxi Yu; Weilu Gao
  Diffractive optical neural networks have shown promising advantages over
electronic circuits for accelerating modern machine learning (ML) algorithms.
However, it is challenging to achieve fully programmable all-optical
implementation and rapid hardware deployment. Furthermore, understanding the
threat of adversarial ML in such system becomes crucial for real-world
applications, which remains unexplored. Here, we demonstrate a large-scale,
cost-effective, complex-valued, and reconfigurable diffractive all-optical
neural networks system in the visible range based on cascaded transmissive
twisted nematic liquid crystal spatial light modulators. With the assist of
categorical reparameterization, we create a physics-aware training framework
for the fast and accurate deployment of computer-trained models onto optical
hardware. Furthermore, we theoretically analyze and experimentally demonstrate
physics-aware adversarial attacks onto the system, which are generated from a
complex-valued gradient-based algorithm. The detailed adversarial robustness
comparison with conventional multiple layer perceptrons and convolutional
neural networks features a distinct statistical adversarial property in
diffractive optical neural networks. Our full stack of software and hardware
provides new opportunities of employing diffractive optics in a variety of ML
tasks and enabling the research on optical adversarial ML.


http://arxiv.org/abs/2203.04946
On the surprising tradeoff between ImageNet accuracy and perceptual similarity. (1%)
Manoj Kumar; Neil Houlsby; Nal Kalchbrenner; Ekin D. Cubuk
  Perceptual distances between images, as measured in the space of pre-trained
deep features, have outperformed prior low-level, pixel-based metrics on
assessing image similarity. While the capabilities of older and less accurate
models such as AlexNet and VGG to capture perceptual similarity are well known,
modern and more accurate models are less studied. First, we observe a
surprising inverse correlation between ImageNet accuracy and Perceptual Scores
of modern networks such as ResNets, EfficientNets, and Vision Transformers:
that is better classifiers achieve worse Perceptual Scores. Then, we perform a
large-scale study and examine the ImageNet accuracy/Perceptual Score
relationship on varying the depth, width, number of training steps, weight
decay, label smoothing, and dropout. Higher accuracy improves Perceptual Score
up to a certain point, but we uncover a Pareto frontier between accuracies and
Perceptual Score in the mid-to-high accuracy regime. We explore this
relationship further using distortion invariance, spatial frequency
sensitivity, and alternative perceptual functions. Interestingly we discover
shallow ResNets, trained for less than 5 epochs only on ImageNet, whose
emergent Perceptual Score matches the prior best networks trained directly on
supervised human perceptual judgements.


http://arxiv.org/abs/2203.04234
Adaptative Perturbation Patterns: Realistic Adversarial Learning for Robust NIDS. (99%)
João Vitorino; Nuno Oliveira; Isabel Praça
  Adversarial attacks pose a major threat to machine learning and to the
systems that rely on it. Nonetheless, adversarial examples cannot be freely
generated for domains with tabular data, such as cybersecurity. This work
establishes the fundamental constraint levels required to achieve realism and
introduces the Adaptative Perturbation Pattern Method (A2PM) to fulfill these
constraints in a gray-box setting. A2PM relies on pattern sequences that are
independently adapted to the characteristics of each class to create valid and
coherent data perturbations. The developed method was evaluated in a
cybersecurity case study with two scenarios: Enterprise and Internet of Things
(IoT) networks. Multilayer Perceptron (MLP) and Random Forest (RF) classifiers
were created with regular and adversarial training, using the CIC-IDS2017 and
IoT-23 datasets. In each scenario, targeted and untargeted attacks were
performed against the classifiers, and the generated examples were compared
with the original network traffic flows to assess their realism. The obtained
results demonstrate that A2PM provides a time efficient generation of realistic
adversarial examples, which can be advantageous for both adversarial training
and attacks.


http://arxiv.org/abs/2203.04041
Shape-invariant 3D Adversarial Point Clouds. (99%)
Qidong Huang; Xiaoyi Dong; Dongdong Chen; Hang Zhou; Weiming Zhang; Nenghai Yu
  Adversary and invisibility are two fundamental but conflict characters of
adversarial perturbations. Previous adversarial attacks on 3D point cloud
recognition have often been criticized for their noticeable point outliers,
since they just involve an "implicit constrain" like global distance loss in
the time-consuming optimization to limit the generated noise. While point cloud
is a highly structured data format, it is hard to constrain its perturbation
with a simple loss or metric properly. In this paper, we propose a novel
Point-Cloud Sensitivity Map to boost both the efficiency and imperceptibility
of point perturbations. This map reveals the vulnerability of point cloud
recognition models when encountering shape-invariant adversarial noises. These
noises are designed along the shape surface with an "explicit constrain"
instead of extra distance loss. Specifically, we first apply a reversible
coordinate transformation on each point of the point cloud input, to reduce one
degree of point freedom and limit its movement on the tangent plane. Then we
calculate the best attacking direction with the gradients of the transformed
point cloud obtained on the white-box model. Finally we assign each point with
a non-negative score to construct the sensitivity map, which benefits both
white-box adversarial invisibility and black-box query-efficiency extended in
our work. Extensive evaluations prove that our method can achieve the superior
performance on various point cloud recognition models, with its satisfying
adversarial imperceptibility and strong resistance to different point cloud
defense settings. Our code is available at: https://github.com/shikiw/SI-Adv.


http://arxiv.org/abs/2203.03888
ART-Point: Improving Rotation Robustness of Point Cloud Classifiers via Adversarial Rotation. (92%)
Robin Wang; Yibo Yang; Dacheng Tao
  Point cloud classifiers with rotation robustness have been widely discussed
in the 3D deep learning community. Most proposed methods either use rotation
invariant descriptors as inputs or try to design rotation equivariant networks.
However, robust models generated by these methods have limited performance
under clean aligned datasets due to modifications on the original classifiers
or input space. In this study, for the first time, we show that the rotation
robustness of point cloud classifiers can also be acquired via adversarial
training with better performance on both rotated and clean datasets.
Specifically, our proposed framework named ART-Point regards the rotation of
the point cloud as an attack and improves rotation robustness by training the
classifier on inputs with Adversarial RoTations. We contribute an axis-wise
rotation attack that uses back-propagated gradients of the pre-trained model to
effectively find the adversarial rotations. To avoid model over-fitting on
adversarial inputs, we construct rotation pools that leverage the
transferability of adversarial rotations among samples to increase the
diversity of training data. Moreover, we propose a fast one-step optimization
to efficiently reach the final robust model. Experiments show that our proposed
rotation attack achieves a high success rate and ART-Point can be used on most
existing classifiers to improve the rotation robustness while showing better
performance on clean datasets than state-of-the-art methods.


http://arxiv.org/abs/2203.04160
Robustly-reliable learners under poisoning attacks. (13%)
Maria-Florina Balcan; Avrim Blum; Steve Hanneke; Dravyansh Sharma
  Data poisoning attacks, in which an adversary corrupts a training set with
the goal of inducing specific desired mistakes, have raised substantial
concern: even just the possibility of such an attack can make a user no longer
trust the results of a learning system. In this work, we show how to achieve
strong robustness guarantees in the face of such attacks across multiple axes.
  We provide robustly-reliable predictions, in which the predicted label is
guaranteed to be correct so long as the adversary has not exceeded a given
corruption budget, even in the presence of instance targeted attacks, where the
adversary knows the test example in advance and aims to cause a specific
failure on that example. Our guarantees are substantially stronger than those
in prior approaches, which were only able to provide certificates that the
prediction of the learning algorithm does not change, as opposed to certifying
that the prediction is correct, as we are able to achieve in our work.
Remarkably, we provide a complete characterization of learnability in this
setting, in particular, nearly-tight matching upper and lower bounds on the
region that can be certified, as well as efficient algorithms for computing
this region given an ERM oracle. Moreover, for the case of linear separators
over logconcave distributions, we provide efficient truly polynomial time
algorithms (i.e., non-oracle algorithms) for such robustly-reliable
predictions.
  We also extend these results to the active setting where the algorithm
adaptively asks for labels of specific informative examples, and the difficulty
is that the adversary might even be adaptive to this interaction, as well as to
the agnostic learning setting where there is no perfect classifier even over
the uncorrupted data.


http://arxiv.org/abs/2203.04428
DeepSE-WF: Unified Security Estimation for Website Fingerprinting Defenses. (2%)
Alexander Veicht; Cedric Renggli; Diogo Barradas
  Website fingerprinting (WF) attacks, usually conducted with the help of a
machine learning-based classifier, enable a network eavesdropper to pinpoint
which web page a user is accessing through the inspection of traffic patterns.
These attacks have been shown to succeed even when users browse the Internet
through encrypted tunnels, e.g., through Tor or VPNs. To assess the security of
new defenses against WF attacks, recent works have proposed feature-dependent
theoretical frameworks that estimate the Bayes error of an adversary's features
set or the mutual information leaked by manually-crafted features.
Unfortunately, as state-of-the-art WF attacks increasingly rely on deep
learning and latent feature spaces, security estimations based on simpler (and
less informative) manually-crafted features can no longer be trusted to assess
the potential success of a WF adversary in defeating such defenses. In this
work, we propose DeepSE-WF, a novel WF security estimation framework that
leverages specialized kNN-based estimators to produce Bayes error and mutual
information estimates from learned latent feature spaces, thus bridging the gap
between current WF attacks and security estimation methods. Our evaluation
reveals that DeepSE-WF produces tighter security estimates than previous
frameworks, reducing the required computational resources to output security
estimations by one order of magnitude.


http://arxiv.org/abs/2203.06649
Joint rotational invariance and adversarial training of a dual-stream Transformer yields state of the art Brain-Score for Area V4. (1%)
William Berrios; Arturo Deza
  Modern high-scoring models of vision in the brain score competition do not
stem from Vision Transformers. However, in this paper, we provide evidence
against the unexpected trend of Vision Transformers (ViT) being not
perceptually aligned with human visual representations by showing how a
dual-stream Transformer, a CrossViT$~\textit{a la}$ Chen et al. (2021), under a
joint rotationally-invariant and adversarial optimization procedure yields 2nd
place in the aggregate Brain-Score 2022 competition(Schrimpf et al., 2020b)
averaged across all visual categories, and at the time of the competition held
1st place for the highest explainable variance of area V4. In addition, our
current Transformer-based model also achieves greater explainable variance for
areas V4, IT and Behaviour than a biologically-inspired CNN (ResNet50) that
integrates a frontal V1-like computation module (Dapello et al.,2020). To
assess the contribution of the optimization scheme with respect to the CrossViT
architecture, we perform several additional experiments on differently
optimized CrossViT's regarding adversarial robustness, common corruption
benchmarks, mid-ventral stimuli interpretation and feature inversion. Against
our initial expectations, our family of results provides tentative support for
an $\textit{"All roads lead to Rome"}$ argument enforced via a joint
optimization rule even for non biologically-motivated models of vision such as
Vision Transformers. Code is available at
https://github.com/williamberrios/BrainScore-Transformers


http://arxiv.org/abs/2203.04420
Harmonicity Plays a Critical Role in DNN Based Versus in Biologically-Inspired Monaural Speech Segregation Systems. (1%)
Rahil Institute for Systems Research, University of Maryland Parikh; Ilya Google Inc Kavalerov; Carol Institute for Systems Research, University of Maryland Espy-Wilson; Shihab Institute for Systems Research, University of Maryland Shamma
  Recent advancements in deep learning have led to drastic improvements in
speech segregation models. Despite their success and growing applicability, few
efforts have been made to analyze the underlying principles that these networks
learn to perform segregation. Here we analyze the role of harmonicity on two
state-of-the-art Deep Neural Networks (DNN)-based models- Conv-TasNet and
DPT-Net. We evaluate their performance with mixtures of natural speech versus
slightly manipulated inharmonic speech, where harmonics are slightly frequency
jittered. We find that performance deteriorates significantly if one source is
even slightly harmonically jittered, e.g., an imperceptible 3% harmonic jitter
degrades performance of Conv-TasNet from 15.4 dB to 0.70 dB. Training the model
on inharmonic speech does not remedy this sensitivity, instead resulting in
worse performance on natural speech mixtures, making inharmonicity a powerful
adversarial factor in DNN models. Furthermore, additional analyses reveal that
DNN algorithms deviate markedly from biologically inspired algorithms that rely
primarily on timing cues and not harmonicity to segregate speech.


http://arxiv.org/abs/2203.04412
ImageNet-Patch: A Dataset for Benchmarking Machine Learning Robustness against Adversarial Patches. (99%)
Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli
  Adversarial patches are optimized contiguous pixel blocks in an input image
that cause a machine-learning model to misclassify it. However, their
optimization is computationally demanding, and requires careful hyperparameter
tuning, potentially leading to suboptimal robustness evaluations. To overcome
these issues, we propose ImageNet-Patch, a dataset to benchmark
machine-learning models against adversarial patches. It consists of a set of
patches, optimized to generalize across different models, and readily
applicable to ImageNet data after preprocessing them with affine
transformations. This process enables an approximate yet faster robustness
evaluation, leveraging the transferability of adversarial perturbations. We
showcase the usefulness of this dataset by testing the effectiveness of the
computed patches against 127 models. We conclude by discussing how our dataset
could be used as a benchmark for robustness, and how our methodology can be
generalized to other domains. We open source our dataset and evaluation code at
https://github.com/pralab/ImageNet-Patch.


http://arxiv.org/abs/2203.04405
Art-Attack: Black-Box Adversarial Attack via Evolutionary Art. (99%)
Phoenix Williams; Ke Li
  Deep neural networks (DNNs) have achieved state-of-the-art performance in
many tasks but have shown extreme vulnerabilities to attacks generated by
adversarial examples. Many works go with a white-box attack that assumes total
access to the targeted model including its architecture and gradients. A more
realistic assumption is the black-box scenario where an attacker only has
access to the targeted model by querying some input and observing its predicted
class probabilities. Different from most prevalent black-box attacks that make
use of substitute models or gradient estimation, this paper proposes a
gradient-free attack by using a concept of evolutionary art to generate
adversarial examples that iteratively evolves a set of overlapping transparent
shapes. To evaluate the effectiveness of our proposed method, we attack three
state-of-the-art image classification models trained on the CIFAR-10 dataset in
a targeted manner. We conduct a parameter study outlining the impact the number
and type of shapes have on the proposed attack's performance. In comparison to
state-of-the-art black-box attacks, our attack is more effective at generating
adversarial examples and achieves a higher attack success rate on all three
baseline models.


http://arxiv.org/abs/2203.03818
Shadows can be Dangerous: Stealthy and Effective Physical-world Adversarial Attack by Natural Phenomenon. (99%)
Yiqi Zhong; Xianming Liu; Deming Zhai; Junjun Jiang; Xiangyang Ji
  Estimating the risk level of adversarial examples is essential for safely
deploying machine learning models in the real world. One popular approach for
physical-world attacks is to adopt the "sticker-pasting" strategy, which
however suffers from some limitations, including difficulties in access to the
target or printing by valid colors. A new type of non-invasive attacks emerged
recently, which attempt to cast perturbation onto the target by optics based
tools, such as laser beam and projector. However, the added optical patterns
are artificial but not natural. Thus, they are still conspicuous and
attention-grabbed, and can be easily noticed by humans. In this paper, we study
a new type of optical adversarial examples, in which the perturbations are
generated by a very common natural phenomenon, shadow, to achieve naturalistic
and stealthy physical-world adversarial attack under the black-box setting. We
extensively evaluate the effectiveness of this new attack on both simulated and
real-world environments. Experimental results on traffic sign recognition
demonstrate that our algorithm can generate adversarial examples effectively,
reaching 98.23% and 90.47% success rates on LISA and GTSRB test sets
respectively, while continuously misleading a moving camera over 95% of the
time in real-world scenarios. We also offer discussions about the limitations
and the defense mechanism of this attack.


http://arxiv.org/abs/2203.03373
Adversarial Texture for Fooling Person Detectors in the Physical World. (98%)
Zhanhao Hu; Siyuan Huang; Xiaopei Zhu; Xiaolin Hu; Fuchun Sun; Bo Zhang
  Nowadays, cameras equipped with AI systems can capture and analyze images to
detect people automatically. However, the AI system can make mistakes when
receiving deliberately designed patterns in the real world, i.e., physical
adversarial examples. Prior works have shown that it is possible to print
adversarial patches on clothes to evade DNN-based person detectors. However,
these adversarial examples could have catastrophic drops in the attack success
rate when the viewing angle (i.e., the camera's angle towards the object)
changes. To perform a multi-angle attack, we propose Adversarial Texture
(AdvTexture). AdvTexture can cover clothes with arbitrary shapes so that people
wearing such clothes can hide from person detectors from different viewing
angles. We propose a generative method, named Toroidal-Cropping-based
Expandable Generative Attack (TC-EGA), to craft AdvTexture with repetitive
structures. We printed several pieces of cloth with AdvTexure and then made
T-shirts, skirts, and dresses in the physical world. Experiments showed that
these clothes could fool person detectors in the physical world.


http://arxiv.org/abs/2203.03762
Defending Graph Convolutional Networks against Dynamic Graph Perturbations via Bayesian Self-supervision. (83%)
Jun Zhuang; Mohammad Al Hasan
  In recent years, plentiful evidence illustrates that Graph Convolutional
Networks (GCNs) achieve extraordinary accomplishments on the node
classification task. However, GCNs may be vulnerable to adversarial attacks on
label-scarce dynamic graphs. Many existing works aim to strengthen the
robustness of GCNs; for instance, adversarial training is used to shield GCNs
against malicious perturbations. However, these works fail on dynamic graphs
for which label scarcity is a pressing issue. To overcome label scarcity,
self-training attempts to iteratively assign pseudo-labels to highly confident
unlabeled nodes but such attempts may suffer serious degradation under dynamic
graph perturbations. In this paper, we generalize noisy supervision as a kind
of self-supervised learning method and then propose a novel Bayesian
self-supervision model, namely GraphSS, to address the issue. Extensive
experiments demonstrate that GraphSS can not only affirmatively alert the
perturbations on dynamic graphs but also effectively recover the prediction of
a node classifier when the graph is under such perturbations. These two
advantages prove to be generalized over three classic GCNs across five public
graph datasets.


http://arxiv.org/abs/2203.03810
Towards Efficient Data-Centric Robust Machine Learning with Noise-based Augmentation. (31%)
Xiaogeng Liu; Haoyu Wang; Yechao Zhang; Fangzhou Wu; Shengshan Hu
  The data-centric machine learning aims to find effective ways to build
appropriate datasets which can improve the performance of AI models. In this
paper, we mainly focus on designing an efficient data-centric scheme to improve
robustness for models towards unforeseen malicious inputs in the black-box test
settings. Specifically, we introduce a noised-based data augmentation method
which is composed of Gaussian Noise, Salt-and-Pepper noise, and the PGD
adversarial perturbations. The proposed method is built on lightweight
algorithms and proved highly effective based on comprehensive evaluations,
showing good efficiency on computation cost and robustness enhancement. In
addition, we share our insights about the data-centric robust machine learning
gained from our experiments.


http://arxiv.org/abs/2203.03128
$A^{3}D$: A Platform of Searching for Robust Neural Architectures and Efficient Adversarial Attacks. (99%)
Jialiang Sun; Wen Yao; Tingsong Jiang; Chao Li; Xiaoqian Chen
  The robustness of deep neural networks (DNN) models has attracted increasing
attention due to the urgent need for security in many applications. Numerous
existing open-sourced tools or platforms are developed to evaluate the
robustness of DNN models by ensembling the majority of adversarial attack or
defense algorithms. Unfortunately, current platforms do not possess the ability
to optimize the architectures of DNN models or the configuration of adversarial
attacks to further enhance the robustness of models or the performance of
adversarial attacks. To alleviate these problems, in this paper, we first
propose a novel platform called auto adversarial attack and defense ($A^{3}D$),
which can help search for robust neural network architectures and efficient
adversarial attacks. In $A^{3}D$, we employ multiple neural architecture search
methods, which consider different robustness evaluation metrics, including four
types of noises: adversarial noise, natural noise, system noise, and quantified
metrics, resulting in finding robust architectures. Besides, we propose a
mathematical model for auto adversarial attack, and provide multiple
optimization algorithms to search for efficient adversarial attacks. In
addition, we combine auto adversarial attack and defense together to form a
unified framework. Among auto adversarial defense, the searched efficient
attack can be used as the new robustness evaluation to further enhance the
robustness. In auto adversarial attack, the searched robust architectures can
be utilized as the threat model to help find stronger adversarial attacks.
Experiments on CIFAR10, CIFAR100, and ImageNet datasets demonstrate the
feasibility and effectiveness of the proposed platform, which can also provide
a benchmark and toolkit for researchers in the application of automated machine
learning in evaluating and improving the DNN model robustnesses.


http://arxiv.org/abs/2203.03121
Protecting Facial Privacy: Generating Adversarial Identity Masks via Style-robust Makeup Transfer. (98%)
Shengshan Hu; Xiaogeng Liu; Yechao Zhang; Minghui Li; Leo Yu Zhang; Hai Jin; Libing Wu
  While deep face recognition (FR) systems have shown amazing performance in
identification and verification, they also arouse privacy concerns for their
excessive surveillance on users, especially for public face images widely
spread on social networks. Recently, some studies adopt adversarial examples to
protect photos from being identified by unauthorized face recognition systems.
However, existing methods of generating adversarial face images suffer from
many limitations, such as awkward visual, white-box setting, weak
transferability, making them difficult to be applied to protect face privacy in
reality. In this paper, we propose adversarial makeup transfer GAN (AMT-GAN), a
novel face protection method aiming at constructing adversarial face images
that preserve stronger black-box transferability and better visual quality
simultaneously. AMT-GAN leverages generative adversarial networks (GAN) to
synthesize adversarial face images with makeup transferred from reference
images. In particular, we introduce a new regularization module along with a
joint training strategy to reconcile the conflicts between the adversarial
noises and the cycle consistence loss in makeup transfer, achieving a desirable
balance between the attack strength and visual changes. Extensive experiments
verify that compared with state of the arts, AMT-GAN can not only preserve a
comfortable visual quality, but also achieve a higher attack success rate over
commercial FR APIs, including Face++, Aliyun, and Microsoft.


http://arxiv.org/abs/2203.03048
Scalable Uncertainty Quantification for Deep Operator Networks using Randomized Priors. (45%)
Yibo Yang; Georgios Kissas; Paris Perdikaris
  We present a simple and effective approach for posterior uncertainty
quantification in deep operator networks (DeepONets); an emerging paradigm for
supervised learning in function spaces. We adopt a frequentist approach based
on randomized prior ensembles, and put forth an efficient vectorized
implementation for fast parallel inference on accelerated hardware. Through a
collection of representative examples in computational mechanics and climate
modeling, we show that the merits of the proposed approach are fourfold. (1) It
can provide more robust and accurate predictions when compared against
deterministic DeepONets. (2) It shows great capability in providing reliable
uncertainty estimates on scarce data-sets with multi-scale function pairs. (3)
It can effectively detect out-of-distribution and adversarial examples. (4) It
can seamlessly quantify uncertainty due to model bias, as well as noise
corruption in the data. Finally, we provide an optimized JAX library called
{\em UQDeepONet} that can accommodate large model architectures, large ensemble
sizes, as well as large data-sets with excellent parallel performance on
accelerated hardware, thereby enabling uncertainty quantification for DeepONets
in realistic large-scale applications.


http://arxiv.org/abs/2203.02928
Evaluation of Interpretability Methods and Perturbation Artifacts in Deep Neural Networks. (2%)
Lennart Brocki; Neo Christopher Chung
  Despite excellent performance of deep neural networks (DNNs) in image
classification, detection, and prediction, characterizing how DNNs make a given
decision remains an open problem, resulting in a number of interpretability
methods. Post-hoc interpretability methods primarily aim to quantify the
importance of input features with respect to the class probabilities. However,
due to the lack of ground truth and the existence of interpretability methods
with diverse operating characteristics, evaluating these methods is a crucial
challenge. A popular approach to evaluate interpretability methods is to
perturb input features deemed important for a given prediction and observe the
decrease in accuracy. However, perturbation itself may introduce artifacts,
since perturbed images may be out-of-distribution (OOD). In this paper, we have
conducted computational experiments to estimate the contribution of
perturbation artifacts and developed a method to estimate the fidelity of
interpretability methods. We demonstrate that, while perturbation artifacts
indeed exist, we can minimize and characterize their impact on fidelity
estimation by utilizing model accuracy curves from perturbing input features
according to the Most Import First (MIF) and Least Import First (LIF) orders.
Using the ResNet-50 trained on the ImageNet, we demonstrate the proposed
fidelity estimation of four popular post-hoc interpretability methods.


http://arxiv.org/abs/2203.02735
aaeCAPTCHA: The Design and Implementation of Audio Adversarial CAPTCHA. (92%)
Md Imran Hossen; Xiali Hei
  CAPTCHAs are designed to prevent malicious bot programs from abusing
websites. Most online service providers deploy audio CAPTCHAs as an alternative
to text and image CAPTCHAs for visually impaired users. However, prior research
investigating the security of audio CAPTCHAs found them highly vulnerable to
automated attacks using Automatic Speech Recognition (ASR) systems. To improve
the robustness of audio CAPTCHAs against automated abuses, we present the
design and implementation of an audio adversarial CAPTCHA (aaeCAPTCHA) system
in this paper. The aaeCAPTCHA system exploits audio adversarial examples as
CAPTCHAs to prevent the ASR systems from automatically solving them.
Furthermore, we conducted a rigorous security evaluation of our new audio
CAPTCHA design against five state-of-the-art DNN-based ASR systems and three
commercial Speech-to-Text (STT) services. Our experimental evaluations
demonstrate that aaeCAPTCHA is highly secure against these speech recognition
technologies, even when the attacker has complete knowledge of the current
attacks against audio adversarial examples. We also conducted a usability
evaluation of the proof-of-concept implementation of the aaeCAPTCHA scheme. Our
results show that it achieves high robustness at a moderate usability cost
compared to normal audio CAPTCHAs. Finally, our extensive analysis highlights
that aaeCAPTCHA can significantly enhance the security and robustness of
traditional audio CAPTCHA systems while maintaining similar usability.


http://arxiv.org/abs/2203.03560
Targeted Data Poisoning Attack on News Recommendation System by Content Perturbation. (82%)
Xudong Zhang; Zan Wang; Jingke Zhao; Lanjun Wang
  News Recommendation System(NRS) has become a fundamental technology to many
online news services. Meanwhile, several studies show that recommendation
systems(RS) are vulnerable to data poisoning attacks, and the attackers have
the ability to mislead the system to perform as their desires. A widely studied
attack approach, injecting fake users, can be applied on the NRS when the NRS
is treated the same as the other systems whose items are fixed. However, in the
NRS, as each item (i.e. news) is more informative, we propose a novel approach
to poison the NRS, which is to perturb contents of some browsed news that
results in the manipulation of the rank of the target news. Intuitively, an
attack is useless if it is highly likely to be caught, i.e., exposed. To
address this, we introduce a notion of the exposure risk and propose a novel
problem of attacking a history news dataset by means of perturbations where the
goal is to maximize the manipulation of the target news rank while keeping the
risk of exposure under a given budget. We design a reinforcement learning
framework, called TDP-CP, which contains a two-stage hierarchical model to
reduce the searching space. Meanwhile, influence estimation is also applied to
save the time on retraining the NRS for rewards. We test the performance of
TDP-CP under three NRSs and on different target news. Our experiments show that
TDP-CP can increase the rank of the target news successfully with a limited
exposure budget.


http://arxiv.org/abs/2203.02586
Concept-based Explanations for Out-Of-Distribution Detectors. (1%)
Jihye Choi; Jayaram Raghuram; Ryan Feng; Jiefeng Chen; Somesh Jha; Atul Prakash
  Out-of-distribution (OOD) detection plays a crucial role in ensuring the safe
deployment of deep neural network (DNN) classifiers. While a myriad of methods
have focused on improving the performance of OOD detectors, a critical gap
remains in interpreting their decisions. We help bridge this gap by providing
explanations for OOD detectors based on learned high-level concepts. We first
propose two new metrics for assessing the effectiveness of a particular set of
concepts for explaining OOD detectors: 1) detection completeness, which
quantifies the sufficiency of concepts for explaining an OOD-detector's
decisions, and 2) concept separability, which captures the distributional
separation between in-distribution and OOD data in the concept space. Based on
these metrics, we propose an unsupervised framework for learning a set of
concepts that satisfy the desired properties of high detection completeness and
concept separability, and demonstrate its effectiveness in providing
concept-based explanations for diverse off-the-shelf OOD detectors. We also
show how to identify prominent concepts contributing to the detection results,
and provide further reasoning about their decisions.


http://arxiv.org/abs/2203.01516
Ad2Attack: Adaptive Adversarial Attack on Real-Time UAV Tracking. (99%)
Changhong Fu; Sihang Li; Xinnan Yuan; Junjie Ye; Ziang Cao; Fangqiang Ding
  Visual tracking is adopted to extensive unmanned aerial vehicle (UAV)-related
applications, which leads to a highly demanding requirement on the robustness
of UAV trackers. However, adding imperceptible perturbations can easily fool
the tracker and cause tracking failures. This risk is often overlooked and
rarely researched at present. Therefore, to help increase awareness of the
potential risk and the robustness of UAV tracking, this work proposes a novel
adaptive adversarial attack approach, i.e., Ad$^2$Attack, against UAV object
tracking. Specifically, adversarial examples are generated online during the
resampling of the search patch image, which leads trackers to lose the target
in the following frames. Ad$^2$Attack is composed of a direct downsampling
module and a super-resolution upsampling module with adaptive stages. A novel
optimization function is proposed for balancing the imperceptibility and
efficiency of the attack. Comprehensive experiments on several well-known
benchmarks and real-world conditions show the effectiveness of our attack
method, which dramatically reduces the performance of the most advanced Siamese
trackers.


http://arxiv.org/abs/2203.01677
Detection of Word Adversarial Examples in Text Classification: Benchmark and Baseline via Robust Density Estimation. (98%)
KiYoon Yoo; Jangho Kim; Jiho Jang; Nojun Kwak
  Word-level adversarial attacks have shown success in NLP models, drastically
decreasing the performance of transformer-based models in recent years. As a
countermeasure, adversarial defense has been explored, but relatively few
efforts have been made to detect adversarial examples. However, detecting
adversarial examples may be crucial for automated tasks (e.g. review sentiment
analysis) that wish to amass information about a certain population and
additionally be a step towards a robust defense system. To this end, we release
a dataset for four popular attack methods on four datasets and four models to
encourage further research in this field. Along with it, we propose a
competitive baseline based on density estimation that has the highest AUC on 29
out of 30 dataset-attack-model combinations. Source code is available in
https://github.com/anoymous92874838/text-adv-detection.


http://arxiv.org/abs/2203.02121
Adversarial Patterns: Building Robust Android Malware Classifiers. (98%)
Dipkamal Bhusal; Nidhi Rastogi
  Machine learning models are increasingly being adopted across various fields,
such as medicine, business, autonomous vehicles, and cybersecurity, to analyze
vast amounts of data, detect patterns, and make predictions or recommendations.
In the field of cybersecurity, these models have made significant improvements
in malware detection. However, despite their ability to understand complex
patterns from unstructured data, these models are susceptible to adversarial
attacks that perform slight modifications in malware samples, leading to
misclassification from malignant to benign. Numerous defense approaches have
been proposed to either detect such adversarial attacks or improve model
robustness. These approaches have resulted in a multitude of attack and defense
techniques and the emergence of a field known as `adversarial machine
learning.' In this survey paper, we provide a comprehensive review of
adversarial machine learning in the context of Android malware classifiers.
Android is the most widely used operating system globally and is an easy target
for malicious agents. The paper first presents an extensive background on
Android malware classifiers, followed by an examination of the latest
advancements in adversarial attacks and defenses. Finally, the paper provides
guidelines for designing robust malware classifiers and outlines research
directions for the future.


http://arxiv.org/abs/2203.01895
Improving Health Mentioning Classification of Tweets using Contrastive Adversarial Training. (84%)
Pervaiz Iqbal Khan; Shoaib Ahmed Siddiqui; Imran Razzak; Andreas Dengel; Sheraz Ahmed
  Health mentioning classification (HMC) classifies an input text as health
mention or not. Figurative and non-health mention of disease words makes the
classification task challenging. Learning the context of the input text is the
key to this problem. The idea is to learn word representation by its
surrounding words and utilize emojis in the text to help improve the
classification results. In this paper, we improve the word representation of
the input text using adversarial training that acts as a regularizer during
fine-tuning of the model. We generate adversarial examples by perturbing the
embeddings of the model and then train the model on a pair of clean and
adversarial examples. Additionally, we utilize contrastive loss that pushes a
pair of clean and perturbed examples close to each other and other examples
away in the representation space. We train and evaluate the method on an
extended version of the publicly available PHM2017 dataset. Experiments show an
improvement of 1.0% over BERT-Large baseline and 0.6% over RoBERTa-Large
baseline, whereas 5.8% over the state-of-the-art in terms of F1 score.
Furthermore, we provide a brief analysis of the results by utilizing the power
of explainable AI.


http://arxiv.org/abs/2203.01925
Label-Only Model Inversion Attacks via Boundary Repulsion. (74%)
Mostafa Kahla; Si Chen; Hoang Anh Just; Ruoxi Jia
  Recent studies show that the state-of-the-art deep neural networks are
vulnerable to model inversion attacks, in which access to a model is abused to
reconstruct private training data of any given target class. Existing attacks
rely on having access to either the complete target model (whitebox) or the
model's soft-labels (blackbox). However, no prior work has been done in the
harder but more practical scenario, in which the attacker only has access to
the model's predicted label, without a confidence measure. In this paper, we
introduce an algorithm, Boundary-Repelling Model Inversion (BREP-MI), to invert
private training data using only the target model's predicted labels. The key
idea of our algorithm is to evaluate the model's predicted labels over a sphere
and then estimate the direction to reach the target class's centroid. Using the
example of face recognition, we show that the images reconstructed by BREP-MI
successfully reproduce the semantics of the private training data for various
datasets and target model architectures. We compare BREP-MI with the
state-of-the-art whitebox and blackbox model inversion attacks and the results
show that despite assuming less knowledge about the target model, BREP-MI
outperforms the blackbox attack and achieves comparable results to the whitebox
attack.


http://arxiv.org/abs/2203.01584
Fairness-aware Adversarial Perturbation Towards Bias Mitigation for Deployed Deep Models. (56%)
Zhibo Wang; Xiaowei Dong; Henry Xue; Zhifei Zhang; Weifeng Chiu; Tao Wei; Kui Ren
  Prioritizing fairness is of central importance in artificial intelligence
(AI) systems, especially for those societal applications, e.g., hiring systems
should recommend applicants equally from different demographic groups, and risk
assessment systems must eliminate racism in criminal justice. Existing efforts
towards the ethical development of AI systems have leveraged data science to
mitigate biases in the training set or introduced fairness principles into the
training process. For a deployed AI system, however, it may not allow for
retraining or tuning in practice. By contrast, we propose a more flexible
approach, i.e., fairness-aware adversarial perturbation (FAAP), which learns to
perturb input data to blind deployed models on fairness-related features, e.g.,
gender and ethnicity. The key advantage is that FAAP does not modify deployed
models in terms of parameters and structures. To achieve this, we design a
discriminator to distinguish fairness-related attributes based on latent
representations from deployed models. Meanwhile, a perturbation generator is
trained against the discriminator, such that no fairness-related features could
be extracted from perturbed inputs. Exhaustive experimental evaluation
demonstrates the effectiveness and superior performance of the proposed FAAP.
In addition, FAAP is validated on real-world commercial deployments
(inaccessible to model parameters), which shows the transferability of FAAP,
foreseeing the potential of black-box adaptation.


http://arxiv.org/abs/2203.02006
Why adversarial training can hurt robust accuracy. (22%)
Jacob Clarysse; Julia Hörmann; Fanny Yang
  Machine learning classifiers with high test accuracy often perform poorly
under adversarial attacks. It is commonly believed that adversarial training
alleviates this issue. In this paper, we demonstrate that, surprisingly, the
opposite may be true -- Even though adversarial training helps when enough data
is available, it may hurt robust generalization in the small sample size
regime. We first prove this phenomenon for a high-dimensional linear
classification setting with noiseless observations. Our proof provides
explanatory insights that may also transfer to feature learning models.
Further, we observe in experiments on standard image datasets that the same
behavior occurs for perceptible attacks that effectively reduce class
information such as mask attacks and object corruptions.


http://arxiv.org/abs/2203.01881
Understanding Failure Modes of Self-Supervised Learning. (4%)
Neha Mukund Kalibhat; Kanika Narang; Liang Tan; Hamed Firooz; Maziar Sanjabi; Soheil Feizi
  Self-supervised learning methods have shown impressive results in downstream
classification tasks. However, there is limited work in understanding their
failure models and interpreting the learned representations of these models. In
this paper, we tackle these issues and study the representation space of
self-supervised models by understanding the underlying reasons for
misclassifications in a downstream task. Over several state-of-the-art
self-supervised models including SimCLR, SwaV, MoCo V2 and BYOL, we observe
that representations of correctly classified samples have few discriminative
features with highly deviated values compared to other features. This is in a
clear contrast with representations of misclassified samples. We also observe
that noisy features in the representation space often correspond to spurious
attributes in images making the models less interpretable. Building on these
observations, we propose a sample-wise Self-Supervised Representation Quality
Score (or, Q-Score) that, without access to any label information, is able to
predict if a given sample is likely to be misclassified in the downstream task,
achieving an AUPRC of up to 0.90. Q-Score can also be used as a regularization
to remedy low-quality representations leading to 3.26% relative improvement in
accuracy of SimCLR on ImageNet-100. Moreover, we show that Q-Score
regularization increases representation sparsity, thus reducing noise and
improving interpretability through gradient heatmaps.


http://arxiv.org/abs/2203.01606
Ensemble Methods for Robust Support Vector Machines using Integer Programming. (2%)
Jannis Kurtz
  In this work we study binary classification problems where we assume that our
training data is subject to uncertainty, i.e. the precise data points are not
known. To tackle this issue in the field of robust machine learning the aim is
to develop models which are robust against small perturbations in the training
data. We study robust support vector machines (SVM) and extend the classical
approach by an ensemble method which iteratively solves a non-robust SVM on
different perturbations of the dataset, where the perturbations are derived by
an adversarial problem. Afterwards for classification of an unknown data point
we perform a majority vote of all calculated SVM solutions. We study three
different variants for the adversarial problem, the exact problem, a relaxed
variant and an efficient heuristic variant. While the exact and the relaxed
variant can be modeled using integer programming formulations, the heuristic
one can be implemented by an easy and efficient algorithm. All derived methods
are tested on random and realistic datasets and the results indicate that the
derived ensemble methods have a much more stable behaviour when changing the
protection level compared to the classical robust SVM model.


http://arxiv.org/abs/2203.02050
Autonomous and Resilient Control for Optimal LEO Satellite Constellation Coverage Against Space Threats. (1%)
Yuhan Zhao; Quanyan Zhu
  LEO satellite constellation coverage has served as the base platform for
various space applications. However, the rapidly evolving security environment
such as orbit debris and adversarial space threats are greatly endangering the
security of satellite constellation and integrity of the satellite
constellation coverage. As on-orbit repairs are challenging, a distributed and
autonomous protection mechanism is necessary to ensure the adaptation and
self-healing of the satellite constellation coverage from different attacks. To
this end, we establish an integrative and distributed framework to enable
resilient satellite constellation coverage planning and control in a single
orbit. Each satellite can make decisions individually to recover from
adversarial and non-adversarial attacks and keep providing coverage service. We
first provide models and methodologies to measure the coverage performance.
Then, we formulate the joint resilient coverage planning-control problem as a
two-stage problem. A coverage game is proposed to find the equilibrium
constellation deployment for resilient coverage planning and an agent-based
algorithm is developed to compute the equilibrium. The multi-waypoint Model
Predictive Control (MPC) methodology is adopted to achieve autonomous
self-healing control. Finally, we use a typical LEO satellite constellation as
a case study to corroborate the results.


http://arxiv.org/abs/2203.01439
Enhancing Adversarial Robustness for Deep Metric Learning. (99%)
Mo Zhou; Vishal M. Patel
  Owing to security implications of adversarial vulnerability, adversarial
robustness of deep metric learning models has to be improved. In order to avoid
model collapse due to excessively hard examples, the existing defenses dismiss
the min-max adversarial training, but instead learn from a weak adversary
inefficiently. Conversely, we propose Hardness Manipulation to efficiently
perturb the training triplet till a specified level of hardness for adversarial
training, according to a harder benign triplet or a pseudo-hardness function.
It is flexible since regular training and min-max adversarial training are its
boundary cases. Besides, Gradual Adversary, a family of pseudo-hardness
functions is proposed to gradually increase the specified hardness level during
training for a better balance between performance and robustness. Additionally,
an Intra-Class Structure loss term among benign and adversarial examples
further improves model robustness and efficiency. Comprehensive experimental
results suggest that the proposed method, although simple in its form,
overwhelmingly outperforms the state-of-the-art defenses in terms of
robustness, training efficiency, as well as performance on benign examples.


http://arxiv.org/abs/2203.00922
Adversarial attacks on neural networks through canonical Riemannian foliations. (99%)
Eliot Tron; Nicolas Couellan; Stéphane Puechmorel
  Deep learning models are known to be vulnerable to adversarial attacks.
Adversarial learning is therefore becoming a crucial task. We propose a new
vision on neural network robustness using Riemannian geometry and foliation
theory. The idea is illustrated by creating a new adversarial attack that takes
into account the curvature of the data space. This new adversarial attack,
called the two-step spectral attack is a piece-wise linear approximation of a
geodesic in the data space. The data space is treated as a (degenerate)
Riemannian manifold equipped with the pullback of the Fisher Information Metric
(FIM) of the neural network. In most cases, this metric is only semi-definite
and its kernel becomes a central object to study. A canonical foliation is
derived from this kernel. The curvature of transverse leaves gives the
appropriate correction to get a two-step approximation of the geodesic and
hence a new efficient adversarial attack. The method is first illustrated on a
2D toy example in order to visualize the neural network foliation and the
corresponding attacks. Next, we report numerical results on the MNIST and
CIFAR10 datasets with the proposed technique and state of the art attacks
presented in Zhao et al. (2019) (OSSA) and Croce et al. (2020) (AutoAttack).
The result show that the proposed attack is more efficient at all levels of
available budget for the attack (norm of the attack), confirming that the
curvature of the transverse neural network FIM foliation plays an important
role in the robustness of neural networks. The main objective and interest of
this study is to provide a mathematical understanding of the geometrical issues
at play in the data space when constructing efficient attacks on neural
networks.


http://arxiv.org/abs/2203.01177
Detecting Adversarial Perturbations in Multi-Task Perception. (98%)
Marvin Klingner; Varun Ravi Kumar; Senthil Yogamani; Andreas Bär; Tim Fingscheidt
  While deep neural networks (DNNs) achieve impressive performance on
environment perception tasks, their sensitivity to adversarial perturbations
limits their use in practical applications. In this paper, we (i) propose a
novel adversarial perturbation detection scheme based on multi-task perception
of complex vision tasks (i.e., depth estimation and semantic segmentation).
Specifically, adversarial perturbations are detected by inconsistencies between
extracted edges of the input image, the depth output, and the segmentation
output. To further improve this technique, we (ii) develop a novel edge
consistency loss between all three modalities, thereby improving their initial
consistency which in turn supports our detection scheme. We verify our
detection scheme's effectiveness by employing various known attacks and image
noises. In addition, we (iii) develop a multi-task adversarial attack, aiming
at fooling both tasks as well as our detection scheme. Experimental evaluation
on the Cityscapes and KITTI datasets shows that under an assumption of a 5%
false positive rate up to 100% of images are correctly detected as
adversarially perturbed, depending on the strength of the perturbation. Code is
available at https://github.com/ifnspaml/AdvAttackDet. A short video at
https://youtu.be/KKa6gOyWmH4 provides qualitative results.


http://arxiv.org/abs/2203.07983
Adversarial Robustness of Neural-Statistical Features in Detection of Generative Transformers. (69%)
Evan Crothers; Nathalie Japkowicz; Herna Viktor; Paula Branco
  The detection of computer-generated text is an area of rapidly increasing
significance as nascent generative models allow for efficient creation of
compelling human-like text, which may be abused for the purposes of spam,
disinformation, phishing, or online influence campaigns. Past work has studied
detection of current state-of-the-art models, but despite a developing threat
landscape, there has been minimal analysis of the robustness of detection
methods to adversarial attacks. To this end, we evaluate neural and non-neural
approaches on their ability to detect computer-generated text, their robustness
against text adversarial attacks, and the impact that successful adversarial
attacks have on human judgement of text quality. We find that while statistical
features underperform neural features, statistical features provide additional
adversarial robustness that can be leveraged in ensemble detection models. In
the process, we find that previously effective complex phrasal features for
detection of computer-generated text hold little predictive power against
contemporary generative models, and identify promising statistical features to
use instead. Finally, we pioneer the usage of $\Delta$MAUVE as a proxy measure
for human judgement of adversarial text quality.


http://arxiv.org/abs/2203.00928
Video is All You Need: Attacking PPG-based Biometric Authentication. (13%)
Lin Li; Chao Chen; Lei Pan; Jun Zhang; Yang Xiang
  Unobservable physiological signals enhance biometric authentication systems.
Photoplethysmography (PPG) signals are convenient owning to its ease of
measurement and are usually well protected against remote adversaries in
authentication. Any leaked PPG signals help adversaries compromise the
biometric authentication systems, and the advent of remote PPG (rPPG) enables
adversaries to acquire PPG signals through restoration. While potentially
dangerous, rPPG-based attacks are overlooked because existing methods require
the victim's PPG signals. This paper proposes a novel spoofing attack approach
that uses the waveforms of rPPG signals extracted from video clips to fool the
PPG-based biometric authentication. We develop a new PPG restoration model that
does not require leaked PPG signals for adversarial attacks. Test results on
state-of-art PPG-based biometric authentication show that the signals recovered
through rPPG pose a severe threat to PPG-based biometric authentication.


http://arxiv.org/abs/2203.00915
MIAShield: Defending Membership Inference Attacks via Preemptive Exclusion of Members. (2%)
Ismat Jarin; Birhanu Eshete
  In membership inference attacks (MIAs), an adversary observes the predictions
of a model to determine whether a sample is part of the model's training data.
Existing MIA defenses conceal the presence of a target sample through strong
regularization, knowledge distillation, confidence masking, or differential
privacy.
  We propose MIAShield, a new MIA defense based on preemptive exclusion of
member samples instead of masking the presence of a member. The key insight in
MIAShield is weakening the strong membership signal that stems from the
presence of a target sample by preemptively excluding it at prediction time
without compromising model utility. To that end, we design and evaluate a suite
of preemptive exclusion oracles leveraging model-confidence, exact or
approximate sample signature, and learning-based exclusion of member data
points. To be practical, MIAShield splits a training data into disjoint subsets
and trains each subset to build an ensemble of models. The disjointedness of
subsets ensures that a target sample belongs to only one subset, which isolates
the sample to facilitate the preemptive exclusion goal.
  We evaluate MIAShield on three benchmark image classification datasets. We
show that MIAShield effectively mitigates membership inference (near random
guess) for a wide range of MIAs, achieves far better privacy-utility trade-off
compared with state-of-the-art defenses, and remains resilient against an
adaptive adversary.


http://arxiv.org/abs/2203.01212
A Quantitative Geometric Approach to Neural-Network Smoothness. (2%)
Zi Wang; Gautam Prakriya; Somesh Jha
  Fast and precise Lipschitz constant estimation of neural networks is an
important task for deep learning. Researchers have recently found an intrinsic
trade-off between the accuracy and smoothness of neural networks, so training a
network with a loose Lipschitz constant estimation imposes a strong
regularization and can hurt the model accuracy significantly. In this work, we
provide a unified theoretical framework, a quantitative geometric approach, to
address the Lipschitz constant estimation. By adopting this framework, we can
immediately obtain several theoretical results, including the computational
hardness of Lipschitz constant estimation and its approximability. Furthermore,
the quantitative geometric perspective can also provide some insights into
recent empirical observations that techniques for one norm do not usually
transfer to another one.
  We also implement the algorithms induced from this quantitative geometric
approach in a tool GeoLIP. These algorithms are based on semidefinite
programming (SDP). Our empirical evaluation demonstrates that GeoLIP is more
scalable and precise than existing tools on Lipschitz constant estimation for
$\ell_\infty$-perturbations. Furthermore, we also show its intricate relations
with other recent SDP-based techniques, both theoretically and empirically. We
believe that this unified quantitative geometric perspective can bring new
insights and theoretical tools to the investigation of neural-network
smoothness and robustness.


http://arxiv.org/abs/2203.00302
Adversarial samples for deep monocular 6D object pose estimation. (99%)
Jinlai Zhang; Weiming Li; Shuang Liang; Hao Wang; Jihong Zhu
  Estimating object 6D pose from an RGB image is important for many real-world
applications such as autonomous driving and robotic grasping, where robustness
of the estimation is crucial. In this work, for the first time, we study
adversarial samples that can fool state-of-the-art (SOTA) deep learning based
6D pose estimation models. In particular, we propose a Unified 6D pose
estimation Attack, namely U6DA, which can successfully attack all the three
main categories of models for 6D pose estimation. The key idea of our U6DA is
to fool the models to predict wrong results for object shapes that are
essential for correct 6D pose estimation. Specifically, we explore a
transfer-based black-box attack to 6D pose estimation. By shifting the
segmentation attention map away from its original position, adversarial samples
are crafted. We show that such adversarial samples are not only effective for
the direct 6D pose estimation models, but also able to attack the two-stage
based models regardless of their robust RANSAC modules. Extensive experiments
were conducted to demonstrate the effectiveness of our U6DA with large-scale
public benchmarks. We also introduce a new U6DA-Linemod dataset for robustness
study of the 6D pose estimation task. Our codes and dataset will be available
at \url{https://github.com/cuge1995/U6DA}.


http://arxiv.org/abs/2203.00858
Physical Backdoor Attacks to Lane Detection Systems in Autonomous Driving. (87%)
Xingshuo Han; Guowen Xu; Yuan Zhou; Xuehuan Yang; Jiwei Li; Tianwei Zhang
  Modern autonomous vehicles adopt state-of-the-art DNN models to interpret the
sensor data and perceive the environment. However, DNN models are vulnerable to
different types of adversarial attacks, which pose significant risks to the
security and safety of the vehicles and passengers. One prominent threat is the
backdoor attack, where the adversary can compromise the DNN model by poisoning
the training samples. Although lots of effort has been devoted to the
investigation of the backdoor attack to conventional computer vision tasks, its
practicality and applicability to the autonomous driving scenario is rarely
explored, especially in the physical world.
  In this paper, we target the lane detection system, which is an indispensable
module for many autonomous driving tasks, e.g., navigation, lane switching. We
design and realize the first physical backdoor attacks to such system. Our
attacks are comprehensively effective against different types of lane detection
algorithms. Specifically, we introduce two attack methodologies
(poison-annotation and clean-annotation) to generate poisoned samples. With
those samples, the trained lane detection model will be infected with the
backdoor, and can be activated by common objects (e.g., traffic cones) to make
wrong detections, leading the vehicle to drive off the road or onto the
opposite lane. Extensive evaluations on public datasets and physical autonomous
vehicles demonstrate that our backdoor attacks are effective, stealthy and
robust against various defense solutions. Our codes and experimental videos can
be found in https://sites.google.com/view/lane-detection-attack/lda.


http://arxiv.org/abs/2203.00553
Global-Local Regularization Via Distributional Robustness. (86%)
Hoang Phan; Trung Le; Trung Phung; Tuan Anh Bui; Nhat Ho; Dinh Phung
  Despite superior performance in many situations, deep neural networks are
often vulnerable to adversarial examples and distribution shifts, limiting
model generalization ability in real-world applications. To alleviate these
problems, recent approaches leverage distributional robustness optimization
(DRO) to find the most challenging distribution, and then minimize loss
function over this most challenging distribution. Regardless of achieving some
improvements, these DRO approaches have some obvious limitations. First, they
purely focus on local regularization to strengthen model robustness, missing a
global regularization effect which is useful in many real-world applications
(e.g., domain adaptation, domain generalization, and adversarial machine
learning). Second, the loss functions in the existing DRO approaches operate in
only the most challenging distribution, hence decouple with the original
distribution, leading to a restrictive modeling capability. In this paper, we
propose a novel regularization technique, following the veins of
Wasserstein-based DRO framework. Specifically, we define a particular joint
distribution and Wasserstein-based uncertainty, allowing us to couple the
original and most challenging distributions for enhancing modeling capability
and applying both local and global regularizations. Empirical studies on
different learning problems demonstrate that our proposed approach
significantly outperforms the existing regularization approaches in various
domains: semi-supervised learning, domain adaptation, domain generalization,
and adversarial machine learning.


http://arxiv.org/abs/2203.01323
Benchmarking Robustness of Deep Learning Classifiers Using Two-Factor Perturbation. (11%)
Wei Dai; Daniel Berleant
  Accuracies of deep learning (DL) classifiers are often unstable in that they
may change significantly when retested on adversarial images, imperfect images,
or perturbed images. This paper adds to the fundamental body of work on
benchmarking the robustness of DL classifiers on defective images. To measure
robust DL classifiers, previous research reported on single-factor corruption.
We created comprehensive 69 benchmarking image sets, including a clean set,
sets with single factor perturbations, and sets with two-factor perturbation
conditions. The state-of-the-art two-factor perturbation includes (a) two
digital perturbations (salt & pepper noise and Gaussian noise) applied in both
sequences, and (b) one digital perturbation (salt & pepper noise) and a
geometric perturbation (rotation) applied in both sequences. Previous research
evaluating DL classifiers has often used top-1/top-5 accuracy. We innovate a
new two-dimensional, statistical matrix to evaluating robustness of DL
classifiers. Also, we introduce a new visualization tool, including minimum
accuracy, maximum accuracy, mean accuracies, and coefficient of variation (CV),
for benchmarking robustness of DL classifiers. Comparing with single factor
corruption, we first report that using two-factor perturbed images improves
both robustness and accuracy of DL classifiers. All source codes and related
image sets are shared on the Website at http://cslinux.semo.edu/david/data to
support future academic research and industry projects.


http://arxiv.org/abs/2203.00637
Signature Correction Attack on Dilithium Signature Scheme. (1%)
Saad Islam; Koksal Mus; Richa Singh; Patrick Schaumont; Berk Sunar
  Motivated by the rise of quantum computers, existing public-key cryptosystems
are expected to be replaced by post-quantum schemes in the next decade in
billions of devices. To facilitate the transition, NIST is running a
standardization process which is currently in its final Round. Only three
digital signature schemes are left in the competition, among which Dilithium
and Falcon are the ones based on lattices. Classical fault attacks on signature
schemes make use of pairs of faulty and correct signatures to recover the
secret key which only works on deterministic schemes. To counter such attacks,
Dilithium offers a randomized version which makes each signature unique, even
when signing identical messages.
  In this work, we introduce a novel Signature Correction Attack which not only
applies to the deterministic version but also to the randomized version of
Dilithium and is effective even on constant-time implementations using AVX2
instructions. The Signature Correction Attack exploits the mathematical
structure of Dilithium to recover the secret key bits by using faulty
signatures and the public-key. It can work for any fault mechanism which can
induce single bit-flips. For demonstration, we are using Rowhammer induced
faults. Thus, our attack does not require any physical access or special
privileges, and hence could be also implemented on shared cloud servers. We
perform a thorough classical and quantum security analysis of Dilithium and
successfully recover 1,851 bits out of 3,072 bits of secret key $s_1$ for
security level 2. The lattice strength against quantum attackers is reduced
from $2^{128}$ to $2^{81}$ while the strength against classical attackers is
reduced from $2^{141}$ to $2^{89}$. Hence, the Signature Correction Attack may
be employed to achieve a practical attack on Dilithium (security level 2) as
proposed in Round 3 of the NIST post-quantum standardization process.


http://arxiv.org/abs/2202.13625
Enhance transferability of adversarial examples with model architecture. (99%)
Mingyuan Fan; Wenzhong Guo; Shengxing Yu; Zuobin Ying; Ximeng Liu
  Transferability of adversarial examples is of critical importance to launch
black-box adversarial attacks, where attackers are only allowed to access the
output of the target model. However, under such a challenging but practical
setting, the crafted adversarial examples are always prone to overfitting to
the proxy model employed, presenting poor transferability. In this paper, we
suggest alleviating the overfitting issue from a novel perspective, i.e.,
designing a fitted model architecture. Specifically, delving the bottom of the
cause of poor transferability, we arguably decompose and reconstruct the
existing model architecture into an effective model architecture, namely
multi-track model architecture (MMA). The adversarial examples crafted on the
MMA can maximumly relieve the effect of model-specified features to it and
toward the vulnerable directions adopted by diverse architectures. Extensive
experimental evaluation demonstrates that the transferability of adversarial
examples based on the MMA significantly surpass other state-of-the-art model
architectures by up to 40% with comparable overhead.


http://arxiv.org/abs/2202.13755
Towards Robust Stacked Capsule Autoencoder with Hybrid Adversarial Training. (99%)
Jiazhu Dai; Siwei Xiong
  Capsule networks (CapsNets) are new neural networks that classify images
based on the spatial relationships of features. By analyzing the pose of
features and their relative positions, it is more capable to recognize images
after affine transformation. The stacked capsule autoencoder (SCAE) is a
state-of-the-art CapsNet, and achieved unsupervised classification of CapsNets
for the first time. However, the security vulnerabilities and the robustness of
the SCAE has rarely been explored. In this paper, we propose an evasion attack
against SCAE, where the attacker can generate adversarial perturbations based
on reducing the contribution of the object capsules in SCAE related to the
original category of the image. The adversarial perturbations are then applied
to the original images, and the perturbed images will be misclassified.
Furthermore, we propose a defense method called Hybrid Adversarial Training
(HAT) against such evasion attacks. HAT makes use of adversarial training and
adversarial distillation to achieve better robustness and stability. We
evaluate the defense method and the experimental results show that the refined
SCAE model can achieve 82.14% classification accuracy under evasion attack. The
source code is available at https://github.com/FrostbiteXSW/SCAE_Defense.


http://arxiv.org/abs/2202.13711
Evaluating the Adversarial Robustness of Adaptive Test-time Defenses. (98%)
Francesco Croce; Sven Gowal; Thomas Brunner; Evan Shelhamer; Matthias Hein; Taylan Cemgil
  Adaptive defenses, which optimize at test time, promise to improve
adversarial robustness. We categorize such adaptive test-time defenses, explain
their potential benefits and drawbacks, and evaluate a representative variety
of the latest adaptive defenses for image classification. Unfortunately, none
significantly improve upon static defenses when subjected to our careful case
study evaluation. Some even weaken the underlying static model while
simultaneously increasing inference computation. While these results are
disappointing, we still believe that adaptive test-time defenses are a
promising avenue of research and, as such, we provide recommendations for their
thorough evaluation. We extend the checklist of Carlini et al. (2019) by
providing concrete steps specific to adaptive defenses.


http://arxiv.org/abs/2202.13922
MaMaDroid2.0 -- The Holes of Control Flow Graphs. (88%)
Harel Berger; Chen Hajaj; Enrico Mariconti; Amit Dvir
  Android malware is a continuously expanding threat to billions of mobile
users around the globe. Detection systems are updated constantly to address
these threats. However, a backlash takes the form of evasion attacks, in which
an adversary changes malicious samples such that those samples will be
misclassified as benign. This paper fully inspects a well-known Android malware
detection system, MaMaDroid, which analyzes the control flow graph of the
application. Changes to the portion of benign samples in the train set and
models are considered to see their effect on the classifier. The changes in the
ratio between benign and malicious samples have a clear effect on each one of
the models, resulting in a decrease of more than 40% in their detection rate.
Moreover, adopted ML models are implemented as well, including 5-NN, Decision
Tree, and Adaboost. Exploration of the six models reveals a typical behavior in
different cases, of tree-based models and distance-based models. Moreover,
three novel attacks that manipulate the CFG and their detection rates are
described for each one of the targeted models. The attacks decrease the
detection rate of most of the models to 0%, with regards to different ratios of
benign to malicious apps. As a result, a new version of MaMaDroid is
engineered. This model fuses the CFG of the app and static analysis of features
of the app. This improved model is proved to be robust against evasion attacks
targeting both CFG-based models and static analysis models, achieving a
detection rate of more than 90% against each one of the attacks.


http://arxiv.org/abs/2202.13636
Improving Lexical Embeddings for Robust Question Answering. (67%)
Weiwen Xu; Bowei Zou; Wai Lam; Ai Ti Aw
  Recent techniques in Question Answering (QA) have gained remarkable
performance improvement with some QA models even surpassed human performance.
However, the ability of these models in truly understanding the language still
remains dubious and the models are revealing limitations when facing
adversarial examples. To strengthen the robustness of QA models and their
generalization ability, we propose a representation Enhancement via Semantic
and Context constraints (ESC) approach to improve the robustness of lexical
embeddings. Specifically, we insert perturbations with semantic constraints and
train enhanced contextual representations via a context-constraint loss to
better distinguish the context clues for the correct answer. Experimental
results show that our approach gains significant robustness improvement on four
adversarial test sets.


http://arxiv.org/abs/2202.13817
Robust Textual Embedding against Word-level Adversarial Attacks. (26%)
Yichen Yang; Xiaosen Wang; Kun He
  We attribute the vulnerability of natural language processing models to the
fact that similar inputs are converted to dissimilar representations in the
embedding space, leading to inconsistent outputs, and we propose a novel robust
training method, termed Fast Triplet Metric Learning (FTML). Specifically, we
argue that the original sample should have similar representation with its
adversarial counterparts and distinguish its representation from other samples
for better robustness. To this end, we adopt the triplet metric learning into
the standard training to pull words closer to their positive samples (i.e.,
synonyms) and push away their negative samples (i.e., non-synonyms) in the
embedding space. Extensive experiments demonstrate that FTML can significantly
promote the model robustness against various advanced adversarial attacks while
keeping competitive classification accuracy on original samples. Besides, our
method is efficient as it only needs to adjust the embedding and introduces
very little overhead on the standard training. Our work shows great potential
of improving the textual robustness through robust word embedding.


http://arxiv.org/abs/2202.14010
Artificial Intelligence for Cyber Security (AICS). (1%)
James Holt; Edward Raff; Ahmad Ridley; Dennis Ross; Arunesh Sinha; Diane Staheli; William Streilen; Milind Tambe; Yevgeniy Vorobeychik; Allan Wollaber
  The workshop will focus on the application of AI to problems in cyber
security. Cyber systems generate large volumes of data, utilizing this
effectively is beyond human capabilities. Additionally, adversaries continue to
develop new attacks. Hence, AI methods are required to understand and protect
the cyber domain. These challenges are widely studied in enterprise networks,
but there are many gaps in research and practice as well as novel problems in
other domains.
  In general, AI techniques are still not widely adopted in the real world.
Reasons include: (1) a lack of certification of AI for security, (2) a lack of
formal study of the implications of practical constraints (e.g., power, memory,
storage) for AI systems in the cyber domain, (3) known vulnerabilities such as
evasion, poisoning attacks, (4) lack of meaningful explanations for security
analysts, and (5) lack of analyst trust in AI solutions. There is a need for
the research community to develop novel solutions for these practical issues.


http://arxiv.org/abs/2203.00150
Explaining RADAR features for detecting spoofing attacks in Connected Autonomous Vehicles. (1%)
Nidhi Rastogi; Sara Rampazzi; Michael Clifford; Miriam Heller; Matthew Bishop; Karl Levitt
  Connected autonomous vehicles (CAVs) are anticipated to have built-in AI
systems for defending against cyberattacks. Machine learning (ML) models form
the basis of many such AI systems. These models are notorious for acting like
black boxes, transforming inputs into solutions with great accuracy, but no
explanations support their decisions. Explanations are needed to communicate
model performance, make decisions transparent, and establish trust in the
models with stakeholders. Explanations can also indicate when humans must take
control, for instance, when the ML model makes low confidence decisions or
offers multiple or ambiguous alternatives. Explanations also provide evidence
for post-incident forensic analysis. Research on explainable ML to security
problems is limited, and more so concerning CAVs. This paper surfaces a
critical yet under-researched sensor data \textit{uncertainty} problem for
training ML attack detection models, especially in highly mobile and
risk-averse platforms such as autonomous vehicles. We present a model that
explains \textit{certainty} and \textit{uncertainty} in sensor input -- a
missing characteristic in data collection. We hypothesize that model
explanation is inaccurate for a given system without explainable input data
quality. We estimate \textit{uncertainty} and mass functions for features in
radar sensor data and incorporate them into the training model through
experimental evaluation. The mass function allows the classifier to categorize
all spoofed inputs accurately with an incorrect class label.


http://arxiv.org/abs/2202.13437
A Unified Wasserstein Distributional Robustness Framework for Adversarial Training. (99%)
Tuan Anh Bui; Trung Le; Quan Tran; He Zhao; Dinh Phung
  It is well-known that deep neural networks (DNNs) are susceptible to
adversarial attacks, exposing a severe fragility of deep learning systems. As
the result, adversarial training (AT) method, by incorporating adversarial
examples during training, represents a natural and effective approach to
strengthen the robustness of a DNN-based classifier. However, most AT-based
methods, notably PGD-AT and TRADES, typically seek a pointwise adversary that
generates the worst-case adversarial example by independently perturbing each
data sample, as a way to "probe" the vulnerability of the classifier. Arguably,
there are unexplored benefits in considering such adversarial effects from an
entire distribution. To this end, this paper presents a unified framework that
connects Wasserstein distributional robustness with current state-of-the-art AT
methods. We introduce a new Wasserstein cost function and a new series of risk
functions, with which we show that standard AT methods are special cases of
their counterparts in our framework. This connection leads to an intuitive
relaxation and generalization of existing AT methods and facilitates the
development of a new family of distributional robustness AT-based algorithms.
Extensive experiments show that our distributional robustness AT algorithms
robustify further their standard AT counterparts in various settings.


http://arxiv.org/abs/2202.13440
Robust Control of Partially Specified Boolean Networks. (1%)
Luboš Brim; Samuel Pastva; David Šafránek; Eva Šmijáková
  Regulatory networks (RNs) are a well-accepted modelling formalism in
computational systems biology. The control of RNs is currently receiving a lot
of attention because it provides a computational basis for cell reprogramming
-- an attractive technology developed in regenerative medicine. By solving the
control problem, we learn which parts of a biological system should be
perturbed to stabilise the system in the desired phenotype.
  We allow the specification of the Boolean model representing a given RN to be
incomplete. To that end, we utilise the formalism of partially specified
Boolean networks which covers every possible behaviour of unspecified parts of
the system. Such an approach causes a significant state explosion. This problem
is addressed by using symbolic methods to represent both the unspecified model
parts and all possible perturbations of the system.
  Additionally, to make the control design efficient and practically
applicable, the optimal control should be minimal in terms of size. Moreover,
in a partially specified model, a control may achieve the desired stabilisation
only for a subset of the possible fully specified model instantiations. To
address these aspects, we utilise several quantitative measures. Apart from the
size of perturbation, we also examine its robustness -- a portion of
instantiations for which the control is applicable.
  We show that proposed symbolic methods solving the control problem for
partially specified BNs are efficient and scale well. We also evaluate the
robustness metrics in cases of all three studied control types. The robustness
metric tells us how big a proportion of fully defined systems the given
perturbation works. Our experiments support the hypothesis that one-step
perturbations may be less robust than temporary or permanent perturbations.
  This is a full version of a paper that is submitted to a journal.


http://arxiv.org/abs/2202.13216
Adversarial robustness of sparse local Lipschitz predictors. (87%)
Ramchandran Muthukumar; Jeremias Sulam
  This work studies the adversarial robustness of parametric functions composed
of a linear predictor and a non-linear representation map. Our analysis relies
on sparse local Lipschitzness (SLL), an extension of local Lipschitz continuity
that better captures the stability and reduced effective dimensionality of
predictors upon local perturbations. SLL functions preserve a certain degree of
structure, given by the sparsity pattern in the representation map, and include
several popular hypothesis classes, such as piece-wise linear models, Lasso and
its variants, and deep feed-forward ReLU networks. We provide a tighter
robustness certificate on the minimal energy of an adversarial example, as well
as tighter data-dependent non-uniform bounds on the robust generalization error
of these predictors. We instantiate these results for the case of deep neural
networks and provide numerical evidence that supports our results, shedding new
insights into natural regularization strategies to increase the robustness of
these models.


http://arxiv.org/abs/2202.13074
Neuro-Inspired Deep Neural Networks with Sparse, Strong Activations. (45%)
Metehan Cekic; Can Bakiskan; Upamanyu Madhow
  While end-to-end training of Deep Neural Networks (DNNs) yields state of the
art performance in an increasing array of applications, it does not provide
insight into, or control over, the features being extracted. We report here on
a promising neuro-inspired approach to DNNs with sparser and stronger
activations. We use standard stochastic gradient training, supplementing the
end-to-end discriminative cost function with layer-wise costs promoting Hebbian
("fire together," "wire together") updates for highly active neurons, and
anti-Hebbian updates for the remaining neurons. Instead of batch norm, we use
divisive normalization of activations (suppressing weak outputs using strong
outputs), along with implicit $\ell_2$ normalization of neuronal weights.
Experiments with standard image classification tasks on CIFAR-10 demonstrate
that, relative to baseline end-to-end trained architectures, our proposed
architecture (a) leads to sparser activations (with only a slight compromise on
accuracy), (b) exhibits more robustness to noise (without being trained on
noisy data), (c) exhibits more robustness to adversarial perturbations (without
adversarial training).


http://arxiv.org/abs/2202.13133
Automation of reversible steganographic coding with nonlinear discrete optimisation. (1%)
Ching-Chun Chang
  Authentication mechanisms are at the forefront of defending the world from
various types of cybercrime. Steganography can serve as an authentication
solution through the use of a digital signature embedded in a carrier object to
ensure the integrity of the object and simultaneously lighten the burden of
metadata management. Nevertheless, despite being generally imperceptible to
human sensory systems, any degree of steganographic distortion might be
inadmissible in fidelity-sensitive situations such as forensic science, legal
proceedings, medical diagnosis and military reconnaissance. This has led to the
development of reversible steganography. A fundamental element of reversible
steganography is predictive analytics, for which powerful neural network models
have been effectively deployed. Another core element is reversible
steganographic coding. Contemporary coding is based primarily on heuristics,
which offers a shortcut towards sufficient, but not necessarily optimal,
capacity--distortion performance. While attempts have been made to realise
automatic coding with neural networks, perfect reversibility is unattainable
via such learning machinery. Instead of relying on heuristics and machine
learning, we aim to derive optimal coding by means of mathematical
optimisation. In this study, we formulate reversible steganographic coding as a
nonlinear discrete optimisation problem with a logarithmic capacity constraint
and a quadratic distortion objective. Linearisation techniques are developed to
enable iterative mixed-integer linear programming. Experimental results
validate the near-optimality of the proposed optimisation algorithm when
benchmarked against a brute-force method.


http://arxiv.org/abs/2202.12860
ARIA: Adversarially Robust Image Attribution for Content Provenance. (99%)
Maksym Andriushchenko; Xiaoyang Rebecca Li; Geoffrey Oxholm; Thomas Gittings; Tu Bui; Nicolas Flammarion; John Collomosse
  Image attribution -- matching an image back to a trusted source -- is an
emerging tool in the fight against online misinformation. Deep visual
fingerprinting models have recently been explored for this purpose. However,
they are not robust to tiny input perturbations known as adversarial examples.
First we illustrate how to generate valid adversarial images that can easily
cause incorrect image attribution. Then we describe an approach to prevent
imperceptible adversarial attacks on deep visual fingerprinting models, via
robust contrastive learning. The proposed training procedure leverages training
on $\ell_\infty$-bounded adversarial examples, it is conceptually simple and
incurs only a small computational overhead. The resulting models are
substantially more robust, are accurate even on unperturbed images, and perform
well even over a database with millions of images. In particular, we achieve
91.6% standard and 85.1% adversarial recall under $\ell_\infty$-bounded
perturbations on manipulated images compared to 80.1% and 0.0% from prior work.
We also show that robustness generalizes to other types of imperceptible
perturbations unseen during training. Finally, we show how to train an
adversarially robust image comparator model for detecting editorial changes in
matched images.


http://arxiv.org/abs/2202.12993
Projective Ranking-based GNN Evasion Attacks. (97%)
He Zhang; Xingliang Yuan; Chuan Zhou; Shirui Pan
  Graph neural networks (GNNs) offer promising learning methods for
graph-related tasks. However, GNNs are at risk of adversarial attacks. Two
primary limitations of the current evasion attack methods are highlighted: (1)
The current GradArgmax ignores the "long-term" benefit of the perturbation. It
is faced with zero-gradient and invalid benefit estimates in certain
situations. (2) In the reinforcement learning-based attack methods, the learned
attack strategies might not be transferable when the attack budget changes. To
this end, we first formulate the perturbation space and propose an evaluation
framework and the projective ranking method. We aim to learn a powerful attack
strategy then adapt it as little as possible to generate adversarial samples
under dynamic budget settings. In our method, based on mutual information, we
rank and assess the attack benefits of each perturbation for an effective
attack strategy. By projecting the strategy, our method dramatically minimizes
the cost of learning a new attack strategy when the attack budget changes. In
the comparative assessment with GradArgmax and RL-S2V, the results show our
method owns high attack performance and effective transferability. The
visualization of our method also reveals various attack patterns in the
generation of adversarial samples.


http://arxiv.org/abs/2202.12506
On the Effectiveness of Dataset Watermarking in Adversarial Settings. (56%)
Buse Gul Atli Tekgul; N. Asokan
  In a data-driven world, datasets constitute a significant economic value.
Dataset owners who spend time and money to collect and curate the data are
incentivized to ensure that their datasets are not used in ways that they did
not authorize. When such misuse occurs, dataset owners need technical
mechanisms for demonstrating their ownership of the dataset in question.
Dataset watermarking provides one approach for ownership demonstration which
can, in turn, deter unauthorized use. In this paper, we investigate a recently
proposed data provenance method, radioactive data, to assess if it can be used
to demonstrate ownership of (image) datasets used to train machine learning
(ML) models. The original paper reported that radioactive data is effective in
white-box settings. We show that while this is true for large datasets with
many classes, it is not as effective for datasets where the number of classes
is low $(\leq 30)$ or the number of samples per class is low $(\leq 500)$. We
also show that, counter-intuitively, the black-box verification technique is
effective for all datasets used in this paper, even when white-box verification
is not. Given this observation, we show that the confidence in white-box
verification can be improved by using watermarked samples directly during the
verification process. We also highlight the need to assess the robustness of
radioactive data if it were to be used for ownership demonstration since it is
an adversarial setting unlike provenance identification.
  Compared to dataset watermarking, ML model watermarking has been explored
more extensively in recent literature. However, most of the model watermarking
techniques can be defeated via model extraction. We show that radioactive data
can effectively survive model extraction attacks, which raises the possibility
that it can be used for ML model ownership verification robust against model
extraction.


http://arxiv.org/abs/2202.12154
Towards Effective and Robust Neural Trojan Defenses via Input Filtering. (92%)
Kien Do; Haripriya Harikumar; Hung Le; Dung Nguyen; Truyen Tran; Santu Rana; Dang Nguyen; Willy Susilo; Svetha Venkatesh
  Trojan attacks on deep neural networks are both dangerous and surreptitious.
Over the past few years, Trojan attacks have advanced from using only a simple
trigger and targeting only one class to using many sophisticated triggers and
targeting multiple classes. However, Trojan defenses have not caught up with
this development. Most defense methods still make out-of-date assumptions about
Trojan triggers and target classes, thus, can be easily circumvented by modern
Trojan attacks. In this paper, we advocate general defenses that are effective
and robust against various Trojan attacks and propose two novel "filtering"
defenses with these characteristics called Variational Input Filtering (VIF)
and Adversarial Input Filtering (AIF). VIF and AIF leverage variational
inference and adversarial training respectively to purify all potential Trojan
triggers in the input at run time without making any assumption about their
numbers and forms. We further extend "filtering" to
"filtering-then-contrasting" - a new defense mechanism that helps avoid the
drop in classification accuracy on clean data caused by filtering. Extensive
experimental results show that our proposed defenses significantly outperform 4
well-known defenses in mitigating 5 different Trojan attacks including the two
state-of-the-art which defeat many strong defenses.


http://arxiv.org/abs/2202.11910
Robust Probabilistic Time Series Forecasting. (76%)
TaeHo Yoon; Youngsuk Park; Ernest K. Ryu; Yuyang Wang
  Probabilistic time series forecasting has played critical role in
decision-making processes due to its capability to quantify uncertainties. Deep
forecasting models, however, could be prone to input perturbations, and the
notion of such perturbations, together with that of robustness, has not even
been completely established in the regime of probabilistic forecasting. In this
work, we propose a framework for robust probabilistic time series forecasting.
First, we generalize the concept of adversarial input perturbations, based on
which we formulate the concept of robustness in terms of bounded Wasserstein
deviation. Then we extend the randomized smoothing technique to attain robust
probabilistic forecasters with theoretical robustness certificates against
certain classes of adversarial perturbations. Lastly, extensive experiments
demonstrate that our methods are empirically effective in enhancing the
forecast quality under additive adversarial attacks and forecast consistency
under supplement of noisy observations.


http://arxiv.org/abs/2202.12435
Understanding Adversarial Robustness from Feature Maps of Convolutional Layers. (70%)
Cong Xu; Min Yang
  The adversarial robustness of a neural network mainly relies on two factors,
one is the feature representation capacity of the network, and the other is its
resistance ability to perturbations. In this paper, we study the
anti-perturbation ability of the network from the feature maps of convolutional
layers. Our theoretical analysis discovers that larger convolutional features
before average pooling can contribute to better resistance to perturbations,
but the conclusion is not true for max pooling. Based on the theoretical
findings, we present two feasible ways to improve the robustness of existing
neural networks. The proposed approaches are very simple and only require
upsampling the inputs or modifying the stride configuration of convolution
operators. We test our approaches on several benchmark neural network
architectures, including AlexNet, VGG16, RestNet18 and PreActResNet18, and
achieve non-trivial improvements on both natural accuracy and robustness under
various attacks. Our study brings new insights into the design of robust neural
networks. The code is available at \url{https://github.com/MTandHJ/rcm}.


http://arxiv.org/abs/2202.12162
Measuring CLEVRness: Blackbox testing of Visual Reasoning Models. (16%)
Spyridon Mouselinos; Henryk Michalewski; Mateusz Malinowski
  How can we measure the reasoning capabilities of intelligence systems? Visual
question answering provides a convenient framework for testing the model's
abilities by interrogating the model through questions about the scene.
However, despite scores of various visual QA datasets and architectures, which
sometimes yield even a super-human performance, the question of whether those
architectures can actually reason remains open to debate. To answer this, we
extend the visual question answering framework and propose the following
behavioral test in the form of a two-player game. We consider black-box neural
models of CLEVR. These models are trained on a diagnostic dataset benchmarking
reasoning. Next, we train an adversarial player that re-configures the scene to
fool the CLEVR model. We show that CLEVR models, which otherwise could perform
at a human level, can easily be fooled by our agent. Our results put in doubt
whether data-driven approaches can do reasoning without exploiting the numerous
biases that are often present in those datasets. Finally, we also propose a
controlled experiment measuring the efficiency of such models to learn and
perform reasoning.


http://arxiv.org/abs/2202.12232
Bounding Membership Inference. (11%)
Anvith Thudi; Ilia Shumailov; Franziska Boenisch; Nicolas Papernot
  Differential Privacy (DP) is the de facto standard for reasoning about the
privacy guarantees of a training algorithm. Despite the empirical observation
that DP reduces the vulnerability of models to existing membership inference
(MI) attacks, a theoretical underpinning as to why this is the case is largely
missing in the literature. In practice, this means that models need to be
trained with DP guarantees that greatly decrease their accuracy.
  In this paper, we provide a tighter bound on the positive accuracy (i.e.,
attack precision) of any MI adversary when a training algorithm provides
$(\varepsilon, \delta)$-DP. Our bound informs the design of a novel privacy
amplification scheme: an effective training set is sub-sampled from a larger
set prior to the beginning of training. We find this greatly reduces the bound
on MI positive accuracy. As a result, our scheme allows the use of looser DP
guarantees to limit the success of any MI adversary; this ensures that the
model's accuracy is less impacted by the privacy guarantee. While this clearly
benefits entities working with far more data than they need to train on, it can
also improve the accuracy-privacy trade-off on benchmarks studied in the
academic literature. Consequently, we also find that subsampling decreases the
effectiveness of a state-of-the-art MI attack (LiRA) much more effectively than
training with stronger DP guarantees on MNIST and CIFAR10. We conclude by
discussing implications of our MI bound on the field of machine unlearning.


http://arxiv.org/abs/2202.12412
Fourier-Based Augmentations for Improved Robustness and Uncertainty Calibration. (3%)
Ryan Soklaski; Michael Yee; Theodoros Tsiligkaridis
  Diverse data augmentation strategies are a natural approach to improving
robustness in computer vision models against unforeseen shifts in data
distribution. However, the ability to tailor such strategies to inoculate a
model against specific classes of corruptions or attacks -- without incurring
substantial losses in robustness against other classes of corruptions --
remains elusive. In this work, we successfully harden a model against
Fourier-based attacks, while producing superior-to-AugMix accuracy and
calibration results on both the CIFAR-10-C and CIFAR-100-C datasets;
classification error is reduced by over ten percentage points for some
high-severity noise and digital-type corruptions. We achieve this by
incorporating Fourier-basis perturbations in the AugMix image-augmentation
framework. Thus we demonstrate that the AugMix framework can be tailored to
effectively target particular distribution shifts, while boosting overall model
robustness.


http://arxiv.org/abs/2202.11919
Threading the Needle of On and Off-Manifold Value Functions for Shapley Explanations. (2%)
Chih-Kuan Yeh; Kuan-Yun Lee; Frederick Liu; Pradeep Ravikumar
  A popular explainable AI (XAI) approach to quantify feature importance of a
given model is via Shapley values. These Shapley values arose in cooperative
games, and hence a critical ingredient to compute these in an XAI context is a
so-called value function, that computes the "value" of a subset of features,
and which connects machine learning models to cooperative games. There are many
possible choices for such value functions, which broadly fall into two
categories: on-manifold and off-manifold value functions, which take an
observational and an interventional viewpoint respectively. Both these classes
however have their respective flaws, where on-manifold value functions violate
key axiomatic properties and are computationally expensive, while off-manifold
value functions pay less heed to the data manifold and evaluate the model on
regions for which it wasn't trained. Thus, there is no consensus on which class
of value functions to use. In this paper, we show that in addition to these
existing issues, both classes of value functions are prone to adversarial
manipulations on low density regions. We formalize the desiderata of value
functions that respect both the model and the data manifold in a set of axioms
and are robust to perturbation on off-manifold regions, and show that there
exists a unique value function that satisfies these axioms, which we term the
Joint Baseline value function, and the resulting Shapley value the Joint
Baseline Shapley (JBshap), and validate the effectiveness of JBshap in
experiments.


http://arxiv.org/abs/2202.11915
Interpolation-based Contrastive Learning for Few-Label Semi-Supervised Learning. (1%)
Xihong Yang; Xiaochang Hu; Sihang Zhou; Xinwang Liu; En Zhu
  Semi-supervised learning (SSL) has long been proved to be an effective
technique to construct powerful models with limited labels. In the existing
literature, consistency regularization-based methods, which force the perturbed
samples to have similar predictions with the original ones have attracted much
attention for their promising accuracy. However, we observe that, the
performance of such methods decreases drastically when the labels get extremely
limited, e.g., 2 or 3 labels for each category. Our empirical study finds that
the main problem lies with the drifting of semantic information in the
procedure of data augmentation. The problem can be alleviated when enough
supervision is provided. However, when little guidance is available, the
incorrect regularization would mislead the network and undermine the
performance of the algorithm. To tackle the problem, we (1) propose an
interpolation-based method to construct more reliable positive sample pairs;
(2) design a novel contrastive loss to guide the embedding of the learned
network to change linearly between samples so as to improve the discriminative
capability of the network by enlarging the margin decision boundaries. Since no
destructive regularization is introduced, the performance of our proposed
algorithm is largely improved. Specifically, the proposed algorithm outperforms
the second best algorithm (Comatch) with 5.3% by achieving 88.73%
classification accuracy when only two labels are available for each class on
the CIFAR-10 dataset. Moreover, we further prove the generality of the proposed
method by improving the performance of the existing state-of-the-art algorithms
considerably with our proposed strategy.


http://arxiv.org/abs/2202.11898
Improving Robustness of Convolutional Neural Networks Using Element-Wise Activation Scaling. (96%)
Zhi-Yuan Zhang; Di Liu
  Recent works reveal that re-calibrating the intermediate activation of
adversarial examples can improve the adversarial robustness of a CNN model. The
state of the arts [Baiet al., 2021] and [Yanet al., 2021] explores this feature
at the channel level, i.e. the activation of a channel is uniformly scaled by a
factor. In this paper, we investigate the intermediate activation manipulation
at a more fine-grained level. Instead of uniformly scaling the activation, we
individually adjust each element within an activation and thus propose
Element-Wise Activation Scaling, dubbed EWAS, to improve CNNs' adversarial
robustness. Experimental results on ResNet-18 and WideResNet with CIFAR10 and
SVHN show that EWAS significantly improves the robustness accuracy. Especially
for ResNet18 on CIFAR10, EWAS increases the adversarial accuracy by 37.65% to
82.35% against C&W attack. EWAS is simple yet very effective in terms of
improving robustness. The codes are anonymously available at
https://anonymous.4open.science/r/EWAS-DD64.


http://arxiv.org/abs/2202.11865
Using calibrator to improve robustness in Machine Reading Comprehension. (13%)
Jing Jin; Houfeng Wang
  Machine Reading Comprehension(MRC) has achieved a remarkable result since
some powerful models, such as BERT, are proposed. However, these models are not
robust enough and vulnerable to adversarial input perturbation and
generalization examples. Some works tried to improve the performance on
specific types of data by adding some related examples into training data while
it leads to degradation on the original dataset, because the shift of data
distribution makes the answer ranking based on the softmax probability of model
unreliable. In this paper, we propose a method to improve the robustness by
using a calibrator as the post-hoc reranker, which is implemented based on
XGBoost model. The calibrator combines both manual features and representation
learning features to rerank candidate results. Experimental results on
adversarial datasets show that our model can achieve performance improvement by
more than 10\% and also make improvement on the original and generalization
datasets.


http://arxiv.org/abs/2202.11287
LPF-Defense: 3D Adversarial Defense based on Frequency Analysis. (99%)
Hanieh Naderi; Kimia Noorbakhsh; Arian Etemadi; Shohreh Kasaei
  Although 3D point cloud classification has recently been widely deployed in
different application scenarios, it is still very vulnerable to adversarial
attacks. This increases the importance of robust training of 3D models in the
face of adversarial attacks. Based on our analysis on the performance of
existing adversarial attacks, more adversarial perturbations are found in the
mid and high-frequency components of input data. Therefore, by suppressing the
high-frequency content in the training phase, the models robustness against
adversarial examples is improved. Experiments showed that the proposed defense
method decreases the success rate of six attacks on PointNet, PointNet++ ,, and
DGCNN models. In particular, improvements are achieved with an average increase
of classification accuracy by 3.8 % on drop100 attack and 4.26 % on drop200
attack compared to the state-of-the-art methods. The method also improves
models accuracy on the original dataset compared to other available methods.


http://arxiv.org/abs/2202.10693
Universal adversarial perturbation for remote sensing images. (95%)
Zhaoxia Yin; Qingyu Wang; Jin Tang; Bin Luo
  Recently, with the application of deep learning in the remote sensing image
(RSI) field, the classification accuracy of the RSI has been greatly improved
compared with traditional technology. However, even state-of-the-art object
recognition convolutional neural networks are fooled by the universal
adversarial perturbation (UAP). To verify that UAP makes the RSI classification
model error classification, this paper proposes a novel method combining an
encoder-decoder network with an attention mechanism. Firstly, the former can
learn the distribution of perturbations better, then the latter is used to find
the main regions concerned by the RSI classification model. Finally, the
generated regions are used to fine-tune the perturbations making the model
misclassified with fewer perturbations. The experimental results show that the
UAP can make the RSI misclassify, and the attack success rate (ASR) of our
proposed method on the RSI data set is as high as 97.35%.


http://arxiv.org/abs/2202.10673
Seeing is Living? Rethinking the Security of Facial Liveness Verification in the Deepfake Era. (84%)
Changjiang Li; Li Wang; Shouling Ji; Xuhong Zhang; Zhaohan Xi; Shanqing Guo; Ting Wang
  Facial Liveness Verification (FLV) is widely used for identity authentication
in many security-sensitive domains and offered as Platform-as-a-Service (PaaS)
by leading cloud vendors. Yet, with the rapid advances in synthetic media
techniques (e.g., deepfake), the security of FLV is facing unprecedented
challenges, about which little is known thus far.
  To bridge this gap, in this paper, we conduct the first systematic study on
the security of FLV in real-world settings. Specifically, we present
LiveBugger, a new deepfake-powered attack framework that enables customizable,
automated security evaluation of FLV. Leveraging LiveBugger, we perform a
comprehensive empirical assessment of representative FLV platforms, leading to
a set of interesting findings. For instance, most FLV APIs do not use
anti-deepfake detection; even for those with such defenses, their effectiveness
is concerning (e.g., it may detect high-quality synthesized videos but fail to
detect low-quality ones). We then conduct an in-depth analysis of the factors
impacting the attack performance of LiveBugger: a) the bias (e.g., gender or
race) in FLV can be exploited to select victims; b) adversarial training makes
deepfake more effective to bypass FLV; c) the input quality has a varying
influence on different deepfake techniques to bypass FLV. Based on these
findings, we propose a customized, two-stage approach that can boost the attack
success rate by up to 70%. Further, we run proof-of-concept attacks on several
representative applications of FLV (i.e., the clients of FLV APIs) to
illustrate the practical implications: due to the vulnerability of the APIs,
many downstream applications are vulnerable to deepfake. Finally, we discuss
potential countermeasures to improve the security of FLV. Our findings have
been confirmed by the corresponding vendors.


http://arxiv.org/abs/2202.11202
Indiscriminate Poisoning Attacks on Unsupervised Contrastive Learning. (1%)
Hao He; Kaiwen Zha; Dina Katabi
  Indiscriminate data poisoning attacks are quite effective against supervised
learning. However, not much is known about their impact on unsupervised
contrastive learning (CL). This paper is the first to consider indiscriminate
poisoning attacks of contrastive learning. We propose contrastive poisoning,
the first effective such attack on CL. We empirically show that contrastive
poisoning, not only drastically reduces the performance of CL algorithms, but
also attacks supervised learning models, making it the most generalizable
indiscriminate poisoning attack. We also show that CL algorithms with a
momentum encoder are more robust to indiscriminate poisoning, and propose a new
countermeasure based on matrix completion.


http://arxiv.org/abs/2202.10594
Adversarial Attacks on Speech Recognition Systems for Mission-Critical Applications: A Survey. (99%)
Ngoc Dung Huynh; Mohamed Reda Bouadjenek; Imran Razzak; Kevin Lee; Chetan Arora; Ali Hassani; Arkady Zaslavsky
  A Machine-Critical Application is a system that is fundamentally necessary to
the success of specific and sensitive operations such as search and recovery,
rescue, military, and emergency management actions. Recent advances in Machine
Learning, Natural Language Processing, voice recognition, and speech processing
technologies have naturally allowed the development and deployment of
speech-based conversational interfaces to interact with various
machine-critical applications. While these conversational interfaces have
allowed users to give voice commands to carry out strategic and critical
activities, their robustness to adversarial attacks remains uncertain and
unclear. Indeed, Adversarial Artificial Intelligence (AI) which refers to a set
of techniques that attempt to fool machine learning models with deceptive data,
is a growing threat in the AI and machine learning research community, in
particular for machine-critical applications. The most common reason of
adversarial attacks is to cause a malfunction in a machine learning model. An
adversarial attack might entail presenting a model with inaccurate or
fabricated samples as it's training data, or introducing maliciously designed
data to deceive an already trained model. While focusing on speech recognition
for machine-critical applications, in this paper, we first review existing
speech recognition techniques, then, we investigate the effectiveness of
adversarial attacks and defenses against these systems, before outlining
research challenges, defense recommendations, and future work. This paper is
expected to serve researchers and practitioners as a reference to help them in
understanding the challenges, position themselves and, ultimately, help them to
improve existing models of speech recognition for mission-critical
applications. Keywords: Mission-Critical Applications, Adversarial AI, Speech
Recognition Systems.


http://arxiv.org/abs/2202.10523
Semi-Implicit Hybrid Gradient Methods with Application to Adversarial Robustness. (99%)
Beomsu Kim; Junghoon Seo
  Adversarial examples, crafted by adding imperceptible perturbations to
natural inputs, can easily fool deep neural networks (DNNs). One of the most
successful methods for training adversarially robust DNNs is solving a
nonconvex-nonconcave minimax problem with an adversarial training (AT)
algorithm. However, among the many AT algorithms, only Dynamic AT (DAT) and You
Only Propagate Once (YOPO) guarantee convergence to a stationary point. In this
work, we generalize the stochastic primal-dual hybrid gradient algorithm to
develop semi-implicit hybrid gradient methods (SI-HGs) for finding stationary
points of nonconvex-nonconcave minimax problems. SI-HGs have the convergence
rate $O(1/K)$, which improves upon the rate $O(1/K^{1/2})$ of DAT and YOPO. We
devise a practical variant of SI-HGs, and show that it outperforms other AT
algorithms in terms of convergence speed and robustness.


http://arxiv.org/abs/2202.10309
HoneyModels: Machine Learning Honeypots. (99%)
Ahmed Abdou; Ryan Sheatsley; Yohan Beugin; Tyler Shipp; Patrick McDaniel
  Machine Learning is becoming a pivotal aspect of many systems today, offering
newfound performance on classification and prediction tasks, but this rapid
integration also comes with new unforeseen vulnerabilities. To harden these
systems the ever-growing field of Adversarial Machine Learning has proposed new
attack and defense mechanisms. However, a great asymmetry exists as these
defensive methods can only provide security to certain models and lack
scalability, computational efficiency, and practicality due to overly
restrictive constraints. Moreover, newly introduced attacks can easily bypass
defensive strategies by making subtle alterations. In this paper, we study an
alternate approach inspired by honeypots to detect adversaries. Our approach
yields learned models with an embedded watermark. When an adversary initiates
an interaction with our model, attacks are encouraged to add this predetermined
watermark stimulating detection of adversarial examples. We show that
HoneyModels can reveal 69.5% of adversaries attempting to attack a Neural
Network while preserving the original functionality of the model. HoneyModels
offer an alternate direction to secure Machine Learning that slightly affects
the accuracy while encouraging the creation of watermarked adversarial samples
detectable by the HoneyModel but indistinguishable from others for the
adversary.


http://arxiv.org/abs/2202.09994
Transferring Adversarial Robustness Through Robust Representation Matching. (99%)
Pratik Vaishnavi; Kevin Eykholt; Amir Rahmati
  With the widespread use of machine learning, concerns over its security and
reliability have become prevalent. As such, many have developed defenses to
harden neural networks against adversarial examples, imperceptibly perturbed
inputs that are reliably misclassified. Adversarial training in which
adversarial examples are generated and used during training is one of the few
known defenses able to reliably withstand such attacks against neural networks.
However, adversarial training imposes a significant training overhead and
scales poorly with model complexity and input dimension. In this paper, we
propose Robust Representation Matching (RRM), a low-cost method to transfer the
robustness of an adversarially trained model to a new model being trained for
the same task irrespective of architectural differences. Inspired by
student-teacher learning, our method introduces a novel training loss that
encourages the student to learn the teacher's robust representations. Compared
to prior works, RRM is superior with respect to both model performance and
adversarial training time. On CIFAR-10, RRM trains a robust model $\sim
1.8\times$ faster than the state-of-the-art. Furthermore, RRM remains effective
on higher-dimensional datasets. On Restricted-ImageNet, RRM trains a ResNet50
model $\sim 18\times$ faster than standard adversarial training.


http://arxiv.org/abs/2202.10627
On the Effectiveness of Adversarial Training against Backdoor Attacks. (96%)
Yinghua Gao; Dongxian Wu; Jingfeng Zhang; Guanhao Gan; Shu-Tao Xia; Gang Niu; Masashi Sugiyama
  DNNs' demand for massive data forces practitioners to collect data from the
Internet without careful check due to the unacceptable cost, which brings
potential risks of backdoor attacks. A backdoored model always predicts a
target class in the presence of a predefined trigger pattern, which can be
easily realized via poisoning a small amount of data. In general, adversarial
training is believed to defend against backdoor attacks since it helps models
to keep their prediction unchanged even if we perturb the input image (as long
as within a feasible range). Unfortunately, few previous studies succeed in
doing so. To explore whether adversarial training could defend against backdoor
attacks or not, we conduct extensive experiments across different threat models
and perturbation budgets, and find the threat model in adversarial training
matters. For instance, adversarial training with spatial adversarial examples
provides notable robustness against commonly-used patch-based backdoor attacks.
We further propose a hybrid strategy which provides satisfactory robustness
across different backdoor attacks.


http://arxiv.org/abs/2202.10276
Poisoning Attacks and Defenses on Artificial Intelligence: A Survey. (83%)
Miguel A. Ramirez; Song-Kyoo Kim; Hussam Al Hamadi; Ernesto Damiani; Young-Ji Byon; Tae-Yeon Kim; Chung-Suk Cho; Chan Yeob Yeun
  Machine learning models have been widely adopted in several fields. However,
most recent studies have shown several vulnerabilities from attacks with a
potential to jeopardize the integrity of the model, presenting a new window of
research opportunity in terms of cyber-security. This survey is conducted with
a main intention of highlighting the most relevant information related to
security vulnerabilities in the context of machine learning (ML) classifiers;
more specifically, directed towards training procedures against data poisoning
attacks, representing a type of attack that consists of tampering the data
samples fed to the model during the training phase, leading to a degradation in
the models accuracy during the inference phase. This work compiles the most
relevant insights and findings found in the latest existing literatures
addressing this type of attacks. Moreover, this paper also covers several
defense techniques that promise feasible detection and mitigation mechanisms,
capable of conferring a certain level of robustness to a target model against
an attacker. A thorough assessment is performed on the reviewed works,
comparing the effects of data poisoning on a wide range of ML models in
real-world conditions, performing quantitative and qualitative analyses. This
paper analyzes the main characteristics for each approach including performance
success metrics, required hyperparameters, and deployment complexity. Moreover,
this paper emphasizes the underlying assumptions and limitations considered by
both attackers and defenders along with their intrinsic properties such as:
availability, reliability, privacy, accountability, interpretability, etc.
Finally, this paper concludes by making references of some of main existing
research trends that provide pathways towards future research directions in the
field of cyber-security.


http://arxiv.org/abs/2202.10377
A Tutorial on Adversarial Learning Attacks and Countermeasures. (75%)
Cato Pauling; Michael Gimson; Muhammed Qaid; Ahmad Kida; Basel Halak
  Machine learning algorithms are used to construct a mathematical model for a
system based on training data. Such a model is capable of making highly
accurate predictions without being explicitly programmed to do so. These
techniques have a great many applications in all areas of the modern digital
economy and artificial intelligence. More importantly, these methods are
essential for a rapidly increasing number of safety-critical applications such
as autonomous vehicles and intelligent defense systems. However, emerging
adversarial learning attacks pose a serious security threat that greatly
undermines further such systems. The latter are classified into four types,
evasion (manipulating data to avoid detection), poisoning (injection malicious
training samples to disrupt retraining), model stealing (extraction), and
inference (leveraging over-generalization on training data). Understanding this
type of attacks is a crucial first step for the development of effective
countermeasures. The paper provides a detailed tutorial on the principles of
adversarial machining learning, explains the different attack scenarios, and
gives an in-depth insight into the state-of-art defense mechanisms against this
rising threat .


http://arxiv.org/abs/2202.11196
Backdoor Defense in Federated Learning Using Differential Testing and Outlier Detection. (41%)
Yein Kim; Huili Chen; Farinaz Koushanfar
  The goal of federated learning (FL) is to train one global model by
aggregating model parameters updated independently on edge devices without
accessing users' private data. However, FL is susceptible to backdoor attacks
where a small fraction of malicious agents inject a targeted misclassification
behavior in the global model by uploading polluted model updates to the server.
In this work, we propose DifFense, an automated defense framework to protect an
FL system from backdoor attacks by leveraging differential testing and two-step
MAD outlier detection, without requiring any previous knowledge of attack
scenarios or direct access to local model parameters. We empirically show that
our detection method prevents a various number of potential attackers while
consistently achieving the convergence of the global model comparable to that
trained under federated averaging (FedAvg). We further corroborate the
effectiveness and generalizability of our method against prior defense
techniques, such as Multi-Krum and coordinate-wise median aggregation. Our
detection method reduces the average backdoor accuracy of the global model to
below 4% and achieves a false negative rate of zero.


http://arxiv.org/abs/2202.10546
Privacy Leakage of Adversarial Training Models in Federated Learning Systems. (38%)
Jingyang Zhang; Yiran Chen; Hai Li
  Adversarial Training (AT) is crucial for obtaining deep neural networks that
are robust to adversarial attacks, yet recent works found that it could also
make models more vulnerable to privacy attacks. In this work, we further reveal
this unsettling property of AT by designing a novel privacy attack that is
practically applicable to the privacy-sensitive Federated Learning (FL)
systems. Using our method, the attacker can exploit AT models in the FL system
to accurately reconstruct users' private training images even when the training
batch size is large. Code is available at
https://github.com/zjysteven/PrivayAttack_AT_FL.


http://arxiv.org/abs/2202.10103
Robustness and Accuracy Could Be Reconcilable by (Proper) Definition. (11%)
Tianyu Pang; Min Lin; Xiao Yang; Jun Zhu; Shuicheng Yan
  The trade-off between robustness and accuracy has been widely studied in the
adversarial literature. Although still controversial, the prevailing view is
that this trade-off is inherent, either empirically or theoretically. Thus, we
dig for the origin of this trade-off in adversarial training and find that it
may stem from the improperly defined robust error, which imposes an inductive
bias of local invariance -- an overcorrection towards smoothness. Given this,
we advocate employing local equivariance to describe the ideal behavior of a
robust model, leading to a self-consistent robust error named SCORE. By
definition, SCORE facilitates the reconciliation between robustness and
accuracy, while still handling the worst-case uncertainty via robust
optimization. By simply substituting KL divergence with variants of distance
metrics, SCORE can be efficiently minimized. Empirically, our models achieve
top-rank performance on RobustBench under AutoAttack. Besides, SCORE provides
instructive insights for explaining the overfitting phenomenon and semantic
input gradients observed on robust models.


http://arxiv.org/abs/2202.10354
Cyber-Physical Defense in the Quantum Era. (2%)
Michel Barbeau; Joaquin Garcia-Alfaro
  Networked-Control Systems (NCSs), a type of cyber-physical systems, consist
of tightly integrated computing, communication and control technologies. While
being very flexible environments, they are vulnerable to computing and
networking attacks. Recent NCSs hacking incidents had major impact. They call
for more research on cyber-physical security. Fears about the use of quantum
computing to break current cryptosystems make matters worse. While the quantum
threat motivated the creation of new disciplines to handle the issue, such as
post-quantum cryptography, other fields have overlooked the existence of
quantum-enabled adversaries. This is the case of cyber-physical defense
research, a distinct but complementary discipline to cyber-physical protection.
Cyber-physical defense refers to the capability to detect and react in response
to cyber-physical attacks. Concretely, it involves the integration of
mechanisms to identify adverse events and prepare response plans, during and
after incidents occur. In this paper, we make the assumption that the
eventually available quantum computer will provide an advantage to adversaries
against defenders, unless they also adopt this technology. We envision the
necessity for a paradigm shift, where an increase of adversarial resources
because of quantum supremacy does not translate into higher likelihood of
disruptions. Consistently with current system design practices in other areas,
such as the use of artificial intelligence for the reinforcement of attack
detection tools, we outline a vision for next generation cyber-physical defense
layers leveraging ideas from quantum computing and machine learning. Through an
example, we show that defenders of NCSs can learn and improve their strategies
to anticipate and recover from attacks.


http://arxiv.org/abs/2202.11197
Real-time Over-the-air Adversarial Perturbations for Digital Communications using Deep Neural Networks. (93%)
Roman A. Sandler; Peter K. Relich; Cloud Cho; Sean Holloway
  Deep neural networks (DNNs) are increasingly being used in a variety of
traditional radiofrequency (RF) problems. Previous work has shown that while
DNN classifiers are typically more accurate than traditional signal processing
algorithms, they are vulnerable to intentionally crafted adversarial
perturbations which can deceive the DNN classifiers and significantly reduce
their accuracy. Such intentional adversarial perturbations can be used by RF
communications systems to avoid reactive-jammers and interception systems which
rely on DNN classifiers to identify their target modulation scheme. While
previous research on RF adversarial perturbations has established the
theoretical feasibility of such attacks using simulation studies, critical
questions concerning real-world implementation and viability remain unanswered.
This work attempts to bridge this gap by defining class-specific and
sample-independent adversarial perturbations which are shown to be effective
yet computationally feasible in real-time and time-invariant. We demonstrate
the effectiveness of these attacks over-the-air across a physical channel using
software-defined radios (SDRs). Finally, we demonstrate that these adversarial
perturbations can be emitted from a source other than the communications
device, making these attacks practical for devices that cannot manipulate their
transmitted signals at the physical layer.


http://arxiv.org/abs/2202.09844
Sparsity Winning Twice: Better Robust Generaliztion from More Efficient Training. (26%)
Tianlong Chen; Zhenyu Zhang; Pengjun Wang; Santosh Balachandra; Haoyu Ma; Zehao Wang; Zhangyang Wang
  Recent studies demonstrate that deep networks, even robustified by the
state-of-the-art adversarial training (AT), still suffer from large robust
generalization gaps, in addition to the much more expensive training costs than
standard training. In this paper, we investigate this intriguing problem from a
new perspective, i.e., injecting appropriate forms of sparsity during
adversarial training. We introduce two alternatives for sparse adversarial
training: (i) static sparsity, by leveraging recent results from the lottery
ticket hypothesis to identify critical sparse subnetworks arising from the
early training; (ii) dynamic sparsity, by allowing the sparse subnetwork to
adaptively adjust its connectivity pattern (while sticking to the same sparsity
ratio) throughout training. We find both static and dynamic sparse methods to
yield win-win: substantially shrinking the robust generalization gap and
alleviating the robust overfitting, meanwhile significantly saving training and
inference FLOPs. Extensive experiments validate our proposals with multiple
network architectures on diverse datasets, including CIFAR-10/100 and
Tiny-ImageNet. For example, our methods reduce robust generalization gap and
overfitting by 34.44% and 4.02%, with comparable robust/standard accuracy
boosts and 87.83%/87.82% training/inference FLOPs savings on CIFAR-100 with
ResNet-18. Besides, our approaches can be organically combined with existing
regularizers, establishing new state-of-the-art results in AT. Codes are
available in https://github.com/VITA-Group/Sparsity-Win-Robust-Generalization.


http://arxiv.org/abs/2202.09735
Overparametrization improves robustness against adversarial attacks: A replication study. (3%)
Ali Borji
  Overparametrization has become a de facto standard in machine learning.
Despite numerous efforts, our understanding of how and where
overparametrization helps model accuracy and robustness is still limited. To
this end, here we conduct an empirical investigation to systemically study and
replicate previous findings in this area, in particular the study by Madry et
al. Together with this study, our findings support the "universal law of
robustness" recently proposed by Bubeck et al. We argue that while critical for
robust perception, overparametrization may not be enough to achieve full
robustness and smarter architectures e.g. the ones implemented by the human
visual cortex) seem inevitable.


http://arxiv.org/abs/2202.09300
Exploring Adversarially Robust Training for Unsupervised Domain Adaptation. (99%)
Shao-Yuan Lo; Vishal M. Patel
  Unsupervised Domain Adaptation (UDA) methods aim to transfer knowledge from a
labeled source domain to an unlabeled target domain. UDA has been extensively
studied in the computer vision literature. Deep networks have been shown to be
vulnerable to adversarial attacks. However, very little focus is devoted to
improving the adversarial robustness of deep UDA models, causing serious
concerns about model reliability. Adversarial Training (AT) has been considered
to be the most successful adversarial defense approach. Nevertheless,
conventional AT requires ground-truth labels to generate adversarial examples
and train models, which limits its effectiveness in the unlabeled target
domain. In this paper, we aim to explore AT to robustify UDA models: How to
enhance the unlabeled data robustness via AT while learning domain-invariant
features for UDA? To answer this question, we provide a systematic study into
multiple AT variants that can potentially be applied to UDA. Moreover, we
propose a novel Adversarially Robust Training method for UDA accordingly,
referred to as ARTUDA. Extensive experiments on multiple adversarial attacks
and UDA benchmarks show that ARTUDA consistently improves the adversarial
robustness of UDA models. Code is available at
https://github.com/shaoyuanlo/ARTUDA


http://arxiv.org/abs/2202.09446
Learning Representations Robust to Group Shifts and Adversarial Examples. (93%)
Ming-Chang Chiu; Xuezhe Ma
  Despite the high performance achieved by deep neural networks on various
tasks, extensive studies have demonstrated that small tweaks in the input could
fail the model predictions. This issue of deep neural networks has led to a
number of methods to improve model robustness, including adversarial training
and distributionally robust optimization. Though both of these two methods are
geared towards learning robust models, they have essentially different
motivations: adversarial training attempts to train deep neural networks
against perturbations, while distributional robust optimization aims at
improving model performance on the most difficult "uncertain distributions". In
this work, we propose an algorithm that combines adversarial training and group
distribution robust optimization to improve robust representation learning.
Experiments on three image benchmark datasets illustrate that the proposed
method achieves superior results on robust metrics without sacrificing much of
the standard measures.


http://arxiv.org/abs/2202.09039
Critical Checkpoints for Evaluating Defence Models Against Adversarial Attack and Robustness. (92%)
Kanak Tekwani; Manojkumar Parmar
  From past couple of years there is a cycle of researchers proposing a defence
model for adversaries in machine learning which is arguably defensible to most
of the existing attacks in restricted condition (they evaluate on some bounded
inputs or datasets). And then shortly another set of researcher finding the
vulnerabilities in that defence model and breaking it by proposing a stronger
attack model. Some common flaws are been noticed in the past defence models
that were broken in very short time. Defence models being broken so easily is a
point of concern as decision of many crucial activities are taken with the help
of machine learning models. So there is an utter need of some defence
checkpoints that any researcher should keep in mind while evaluating the
soundness of technique and declaring it to be decent defence technique. In this
paper, we have suggested few checkpoints that should be taken into
consideration while building and evaluating the soundness of defence models.
All these points are recommended after observing why some past defence models
failed and how some model remained adamant and proved their soundness against
some of the very strong attacks.


http://arxiv.org/abs/2202.10320
Resurrecting Trust in Facial Recognition: Mitigating Backdoor Attacks in Face Recognition to Prevent Potential Privacy Breaches. (80%)
Reena Zelenkova; Jack Swallow; M. A. P. Chamikara; Dongxi Liu; Mohan Baruwal Chhetri; Seyit Camtepe; Marthie Grobler; Mahathir Almashor
  Biometric data, such as face images, are often associated with sensitive
information (e.g medical, financial, personal government records). Hence, a
data breach in a system storing such information can have devastating
consequences. Deep learning is widely utilized for face recognition (FR);
however, such models are vulnerable to backdoor attacks executed by malicious
parties. Backdoor attacks cause a model to misclassify a particular class as a
target class during recognition. This vulnerability can allow adversaries to
gain access to highly sensitive data protected by biometric authentication
measures or allow the malicious party to masquerade as an individual with
higher system permissions. Such breaches pose a serious privacy threat.
Previous methods integrate noise addition mechanisms into face recognition
models to mitigate this issue and improve the robustness of classification
against backdoor attacks. However, this can drastically affect model accuracy.
We propose a novel and generalizable approach (named BA-BAM: Biometric
Authentication - Backdoor Attack Mitigation), that aims to prevent backdoor
attacks on face authentication deep learning models through transfer learning
and selective image perturbation. The empirical evidence shows that BA-BAM is
highly robust and incurs a maximal accuracy drop of 2.4%, while reducing the
attack success rate to a maximum of 20%. Comparisons with existing approaches
show that BA-BAM provides a more practical backdoor mitigation approach for
face recognition.


http://arxiv.org/abs/2202.09483
Data-Driven Mitigation of Adversarial Text Perturbation. (75%)
Rasika Bhalerao; Mohammad Al-Rubaie; Anand Bhaskar; Igor Markov
  Social networks have become an indispensable part of our lives, with billions
of people producing ever-increasing amounts of text. At such scales, content
policies and their enforcement become paramount. To automate moderation,
questionable content is detected by Natural Language Processing (NLP)
classifiers. However, high-performance classifiers are hampered by misspellings
and adversarial text perturbations. In this paper, we classify intentional and
unintentional adversarial text perturbation into ten types and propose a
deobfuscation pipeline to make NLP models robust to such perturbations. We
propose Continuous Word2Vec (CW2V), our data-driven method to learn word
embeddings that ensures that perturbations of words have embeddings similar to
those of the original words. We show that CW2V embeddings are generally more
robust to text perturbations than embeddings based on character ngrams. Our
robust classification pipeline combines deobfuscation and classification, using
proposed defense methods and word embeddings to classify whether Facebook posts
are requesting engagement such as likes. Our pipeline results in engagement
bait classification that goes from 0.70 to 0.67 AUC with adversarial text
perturbation, while character ngram-based word embedding methods result in
downstream classification that goes from 0.76 to 0.64.


http://arxiv.org/abs/2202.10582
Debiasing Backdoor Attack: A Benign Application of Backdoor Attack in Eliminating Data Bias. (68%)
Shangxi Wu; Qiuyang He; Yi Zhang; Jitao Sang
  Backdoor attack is a new AI security risk that has emerged in recent years.
Drawing on the previous research of adversarial attack, we argue that the
backdoor attack has the potential to tap into the model learning process and
improve model performance. Based on Clean Accuracy Drop (CAD) in backdoor
attack, we found that CAD came out of the effect of pseudo-deletion of data. We
provided a preliminary explanation of this phenomenon from the perspective of
model classification boundaries and observed that this pseudo-deletion had
advantages over direct deletion in the data debiasing problem. Based on the
above findings, we proposed Debiasing Backdoor Attack (DBA). It achieves SOTA
in the debiasing task and has a broader application scenario than
undersampling.


http://arxiv.org/abs/2202.11203
Label-Smoothed Backdoor Attack. (38%)
Minlong Peng; Zidi Xiong; Mingming Sun; Ping Li
  By injecting a small number of poisoned samples into the training set,
backdoor attacks aim to make the victim model produce designed outputs on any
input injected with pre-designed backdoors. In order to achieve a high attack
success rate using as few poisoned training samples as possible, most existing
attack methods change the labels of the poisoned samples to the target class.
This practice often results in severe over-fitting of the victim model over the
backdoors, making the attack quite effective in output control but easier to be
identified by human inspection or automatic defense algorithms.
  In this work, we proposed a label-smoothing strategy to overcome the
over-fitting problem of these attack methods, obtaining a
\textit{Label-Smoothed Backdoor Attack} (LSBA). In the LSBA, the label of the
poisoned sample $\bm{x}$ will be changed to the target class with a probability
of $p_n(\bm{x})$ instead of 100\%, and the value of $p_n(\bm{x})$ is
specifically designed to make the prediction probability the target class be
only slightly greater than those of the other classes. Empirical studies on
several existing backdoor attacks show that our strategy can considerably
improve the stealthiness of these attacks and, at the same time, achieve a high
attack success rate. In addition, our strategy makes it able to manually
control the prediction probability of the design output through manipulating
the applied and activated number of LSBAs\footnote{Source code will be
published at \url{https://github.com/v-mipeng/LabelSmoothedAttack.git}}.


http://arxiv.org/abs/2202.09248
Stochastic Perturbations of Tabular Features for Non-Deterministic Inference with Automunge. (38%)
Nicholas J. Teague
  Injecting gaussian noise into training features is well known to have
regularization properties. This paper considers noise injections to numeric or
categoric tabular features as passed to inference, which translates inference
to a non-deterministic outcome and may have relevance to fairness
considerations, adversarial example protection, or other use cases benefiting
from non-determinism. We offer the Automunge library for tabular preprocessing
as a resource for the practice, which includes options to integrate random
sampling or entropy seeding with the support of quantum circuits, representing
a new way to channel quantum algorithms into classical learning.


http://arxiv.org/abs/2202.09389
Black-box Node Injection Attack for Graph Neural Networks. (33%)
Mingxuan Ju; Yujie Fan; Yanfang Ye; Liang Zhao
  Graph Neural Networks (GNNs) have drawn significant attentions over the years
and been broadly applied to vital fields that require high security standard
such as product recommendation and traffic forecasting. Under such scenarios,
exploiting GNN's vulnerabilities and further downgrade its classification
performance become highly incentive for adversaries. Previous attackers mainly
focus on structural perturbations of existing graphs. Although they deliver
promising results, the actual implementation needs capability of manipulating
the graph connectivity, which is impractical in some circumstances. In this
work, we study the possibility of injecting nodes to evade the victim GNN
model, and unlike previous related works with white-box setting, we
significantly restrict the amount of accessible knowledge and explore the
black-box setting. Specifically, we model the node injection attack as a Markov
decision process and propose GA2C, a graph reinforcement learning framework in
the fashion of advantage actor critic, to generate realistic features for
injected nodes and seamlessly merge them into the original graph following the
same topology characteristics. Through our extensive experiments on multiple
acknowledged benchmark datasets, we demonstrate the superior performance of our
proposed GA2C over existing state-of-the-art methods. The data and source code
are publicly accessible at: https://github.com/jumxglhf/GA2C.


http://arxiv.org/abs/2202.09514
Robust Reinforcement Learning as a Stackelberg Game via Adaptively-Regularized Adversarial Training. (9%)
Peide Huang; Mengdi Xu; Fei Fang; Ding Zhao
  Robust Reinforcement Learning (RL) focuses on improving performances under
model errors or adversarial attacks, which facilitates the real-life deployment
of RL agents. Robust Adversarial Reinforcement Learning (RARL) is one of the
most popular frameworks for robust RL. However, most of the existing literature
models RARL as a zero-sum simultaneous game with Nash equilibrium as the
solution concept, which could overlook the sequential nature of RL deployments,
produce overly conservative agents, and induce training instability. In this
paper, we introduce a novel hierarchical formulation of robust RL - a
general-sum Stackelberg game model called RRL-Stack - to formalize the
sequential nature and provide extra flexibility for robust training. We develop
the Stackelberg Policy Gradient algorithm to solve RRL-Stack, leveraging the
Stackelberg learning dynamics by considering the adversary's response. Our
method generates challenging yet solvable adversarial environments which
benefit RL agents' robust learning. Our algorithm demonstrates better training
stability and robustness against different testing conditions in the
single-agent robotics control and multi-agent highway merging tasks.


http://arxiv.org/abs/2202.09465
Attacks, Defenses, And Tools: A Framework To Facilitate Robust AI/ML Systems. (4%)
Mohamad Fazelnia; Igor Khokhlov; Mehdi Mirakhorli
  Software systems are increasingly relying on Artificial Intelligence (AI) and
Machine Learning (ML) components. The emerging popularity of AI techniques in
various application domains attracts malicious actors and adversaries.
Therefore, the developers of AI-enabled software systems need to take into
account various novel cyber-attacks and vulnerabilities that these systems may
be susceptible to. This paper presents a framework to characterize attacks and
weaknesses associated with AI-enabled systems and provide mitigation techniques
and defense strategies. This framework aims to support software designers in
taking proactive measures in developing AI-enabled software, understanding the
attack surface of such systems, and developing products that are resilient to
various emerging attacks associated with ML. The developed framework covers a
broad spectrum of attacks, mitigation techniques, and defensive and offensive
tools. In this paper, we demonstrate the framework architecture and its major
components, describe their attributes, and discuss the long-term goals of this
research.


http://arxiv.org/abs/2202.09381
Synthetic Disinformation Attacks on Automated Fact Verification Systems. (1%)
Yibing Du; Antoine Bosselut; Christopher D. Manning
  Automated fact-checking is a needed technology to curtail the spread of
online misinformation. One current framework for such solutions proposes to
verify claims by retrieving supporting or refuting evidence from related
textual sources. However, the realistic use cases for fact-checkers will
require verifying claims against evidence sources that could be affected by the
same misinformation. Furthermore, the development of modern NLP tools that can
produce coherent, fabricated content would allow malicious actors to
systematically generate adversarial disinformation for fact-checkers.
  In this work, we explore the sensitivity of automated fact-checkers to
synthetic adversarial evidence in two simulated settings: AdversarialAddition,
where we fabricate documents and add them to the evidence repository available
to the fact-checking system, and AdversarialModification, where existing
evidence source documents in the repository are automatically altered. Our
study across multiple models on three benchmarks demonstrates that these
systems suffer significant performance drops against these attacks. Finally, we
discuss the growing threat of modern NLG systems as generators of
disinformation in the context of the challenges they pose to automated
fact-checkers.


http://arxiv.org/abs/2202.08944
Rethinking Machine Learning Robustness via its Link with the Out-of-Distribution Problem. (99%)
Abderrahmen Amich; Birhanu Eshete
  Despite multiple efforts made towards robust machine learning (ML) models,
their vulnerability to adversarial examples remains a challenging problem that
calls for rethinking the defense strategy. In this paper, we take a step back
and investigate the causes behind ML models' susceptibility to adversarial
examples. In particular, we focus on exploring the cause-effect link between
adversarial examples and the out-of-distribution (OOD) problem. To that end, we
propose an OOD generalization method that stands against both adversary-induced
and natural distribution shifts. Through an OOD to in-distribution mapping
intuition, our approach translates OOD inputs to the data distribution used to
train and test the model. Through extensive experiments on three benchmark
image datasets of different scales (MNIST, CIFAR10, and ImageNet) and by
leveraging image-to-image translation methods, we confirm that the adversarial
examples problem is a special case of the wider OOD generalization problem.
Across all datasets, we show that our translation-based approach consistently
improves robustness to OOD adversarial inputs and outperforms state-of-the-art
defenses by a significant margin, while preserving the exact accuracy on benign
(in-distribution) data. Furthermore, our method generalizes on naturally OOD
inputs such as darker or sharper images


http://arxiv.org/abs/2202.08532
Mitigating Closed-model Adversarial Examples with Bayesian Neural Modeling for Enhanced End-to-End Speech Recognition. (98%)
Chao-Han Huck Yang; Zeeshan Ahmed; Yile Gu; Joseph Szurley; Roger Ren; Linda Liu; Andreas Stolcke; Ivan Bulyko
  In this work, we aim to enhance the system robustness of end-to-end automatic
speech recognition (ASR) against adversarially-noisy speech examples. We focus
on a rigorous and empirical "closed-model adversarial robustness" setting
(e.g., on-device or cloud applications). The adversarial noise is only
generated by closed-model optimization (e.g., evolutionary and zeroth-order
estimation) without accessing gradient information of a targeted ASR model
directly. We propose an advanced Bayesian neural network (BNN) based
adversarial detector, which could model latent distributions against adaptive
adversarial perturbation with divergence measurement. We further simulate
deployment scenarios of RNN Transducer, Conformer, and wav2vec-2.0 based ASR
systems with the proposed adversarial detection system. Leveraging the proposed
BNN based detection system, we improve detection rate by +2.77 to +5.42%
(relative +3.03 to +6.26%) and reduce the word error rate by 5.02 to 7.47% on
LibriSpeech datasets compared to the current model enhancement methods against
the adversarial speech examples.


http://arxiv.org/abs/2202.08892
Developing Imperceptible Adversarial Patches to Camouflage Military Assets From Computer Vision Enabled Technologies. (98%)
Chris Wise; Jo Plested
  Convolutional neural networks (CNNs) have demonstrated rapid progress and a
high level of success in object detection. However, recent evidence has
highlighted their vulnerability to adversarial attacks. These attacks are
calculated image perturbations or adversarial patches that result in object
misclassification or detection suppression. Traditional camouflage methods are
impractical when applied to disguise aircraft and other large mobile assets
from autonomous detection in intelligence, surveillance and reconnaissance
technologies and fifth generation missiles. In this paper we present a unique
method that produces imperceptible patches capable of camouflaging large
military assets from computer vision-enabled technologies. We developed these
patches by maximising object detection loss whilst limiting the patch's colour
perceptibility. This work also aims to further the understanding of adversarial
examples and their effects on object detection algorithms.


http://arxiv.org/abs/2202.08602
Fingerprinting Deep Neural Networks Globally via Universal Adversarial Perturbations. (78%)
Zirui Peng; Shaofeng Li; Guoxing Chen; Cheng Zhang; Haojin Zhu; Minhui Xue
  In this paper, we propose a novel and practical mechanism which enables the
service provider to verify whether a suspect model is stolen from the victim
model via model extraction attacks. Our key insight is that the profile of a
DNN model's decision boundary can be uniquely characterized by its
\textit{Universal Adversarial Perturbations (UAPs)}. UAPs belong to a
low-dimensional subspace and piracy models' subspaces are more consistent with
victim model's subspace compared with non-piracy model. Based on this, we
propose a UAP fingerprinting method for DNN models and train an encoder via
\textit{contrastive learning} that takes fingerprint as inputs, outputs a
similarity score. Extensive studies show that our framework can detect model IP
breaches with confidence $> 99.99 \%$ within only $20$ fingerprints of the
suspect model. It has good generalizability across different model
architectures and is robust against post-modifications on stolen models.


http://arxiv.org/abs/2202.08185
The Adversarial Security Mitigations of mmWave Beamforming Prediction Models using Defensive Distillation and Adversarial Retraining. (99%)
Murat Kuzlu; Ferhat Ozgur Catak; Umit Cali; Evren Catak; Ozgur Guler
  The design of a security scheme for beamforming prediction is critical for
next-generation wireless networks (5G, 6G, and beyond). However, there is no
consensus about protecting the beamforming prediction using deep learning
algorithms in these networks. This paper presents the security vulnerabilities
in deep learning for beamforming prediction using deep neural networks (DNNs)
in 6G wireless networks, which treats the beamforming prediction as a
multi-output regression problem. It is indicated that the initial DNN model is
vulnerable against adversarial attacks, such as Fast Gradient Sign Method
(FGSM), Basic Iterative Method (BIM), Projected Gradient Descent (PGD), and
Momentum Iterative Method (MIM), because the initial DNN model is sensitive to
the perturbations of the adversarial samples of the training data. This study
also offers two mitigation methods, such as adversarial training and defensive
distillation, for adversarial attacks against artificial intelligence
(AI)-based models used in the millimeter-wave (mmWave) beamforming prediction.
Furthermore, the proposed scheme can be used in situations where the data are
corrupted due to the adversarial examples in the training data. Experimental
results show that the proposed methods effectively defend the DNN models
against adversarial attacks in next-generation wireless networks.


http://arxiv.org/abs/2202.08057
Understanding and Improving Graph Injection Attack by Promoting Unnoticeability. (10%)
Yongqiang Chen; Han Yang; Yonggang Zhang; Kaili Ma; Tongliang Liu; Bo Han; James Cheng
  Recently Graph Injection Attack (GIA) emerges as a practical attack scenario
on Graph Neural Networks (GNNs), where the adversary can merely inject few
malicious nodes instead of modifying existing nodes or edges, i.e., Graph
Modification Attack (GMA). Although GIA has achieved promising results, little
is known about why it is successful and whether there is any pitfall behind the
success. To understand the power of GIA, we compare it with GMA and find that
GIA can be provably more harmful than GMA due to its relatively high
flexibility. However, the high flexibility will also lead to great damage to
the homophily distribution of the original graph, i.e., similarity among
neighbors. Consequently, the threats of GIA can be easily alleviated or even
prevented by homophily-based defenses designed to recover the original
homophily. To mitigate the issue, we introduce a novel constraint -- homophily
unnoticeability that enforces GIA to preserve the homophily, and propose
Harmonious Adversarial Objective (HAO) to instantiate it. Extensive experiments
verify that GIA with HAO can break homophily-based defenses and outperform
previous GIA attacks by a significant margin. We believe our methods can serve
for a more reliable evaluation of the robustness of GNNs.


http://arxiv.org/abs/2202.10943
Gradient Based Activations for Accurate Bias-Free Learning. (1%)
Vinod K Kurmi; Rishabh Sharma; Yash Vardhan Sharma; Vinay P. Namboodiri
  Bias mitigation in machine learning models is imperative, yet challenging.
While several approaches have been proposed, one view towards mitigating bias
is through adversarial learning. A discriminator is used to identify the bias
attributes such as gender, age or race in question. This discriminator is used
adversarially to ensure that it cannot distinguish the bias attributes. The
main drawback in such a model is that it directly introduces a trade-off with
accuracy as the features that the discriminator deems to be sensitive for
discrimination of bias could be correlated with classification. In this work we
solve the problem. We show that a biased discriminator can actually be used to
improve this bias-accuracy tradeoff. Specifically, this is achieved by using a
feature masking approach using the discriminator's gradients. We ensure that
the features favoured for the bias discrimination are de-emphasized and the
unbiased features are enhanced during classification. We show that this simple
approach works well to reduce bias as well as improve accuracy significantly.
We evaluate the proposed model on standard benchmarks. We improve the accuracy
of the adversarial methods while maintaining or even improving the unbiasness
and also outperform several other recent methods.


http://arxiv.org/abs/2202.07342
Unreasonable Effectiveness of Last Hidden Layer Activations. (99%)
Omer Faruk Tuna; Ferhat Ozgur Catak; M. Taner Eskil
  In standard Deep Neural Network (DNN) based classifiers, the general
convention is to omit the activation function in the last (output) layer and
directly apply the softmax function on the logits to get the probability scores
of each class. In this type of architectures, the loss value of the classifier
against any output class is directly proportional to the difference between the
final probability score and the label value of the associated class. Standard
White-box adversarial evasion attacks, whether targeted or untargeted, mainly
try to exploit the gradient of the model loss function to craft adversarial
samples and fool the model. In this study, we show both mathematically and
experimentally that using some widely known activation functions in the output
layer of the model with high temperature values has the effect of zeroing out
the gradients for both targeted and untargeted attack cases, preventing
attackers from exploiting the model's loss function to craft adversarial
samples. We've experimentally verified the efficacy of our approach on MNIST
(Digit), CIFAR10 datasets. Detailed experiments confirmed that our approach
substantially improves robustness against gradient-based targeted and
untargeted attack threats. And, we showed that the increased non-linearity at
the output layer has some additional benefits against some other attack methods
like Deepfool attack.


http://arxiv.org/abs/2202.07261
Exploring the Devil in Graph Spectral Domain for 3D Point Cloud Attacks. (99%)
Qianjiang Hu; Daizong Liu; Wei Hu
  With the maturity of depth sensors, point clouds have received increasing
attention in various applications such as autonomous driving, robotics,
surveillance, etc., while deep point cloud learning models have shown to be
vulnerable to adversarial attacks. Existing attack methods generally add/delete
points or perform point-wise perturbation over point clouds to generate
adversarial examples in the data space, which may neglect the geometric
characteristics of point clouds. Instead, we propose point cloud attacks from a
new perspective -- Graph Spectral Domain Attack (GSDA), aiming to perturb
transform coefficients in the graph spectral domain that corresponds to varying
certain geometric structure. In particular, we naturally represent a point
cloud over a graph, and adaptively transform the coordinates of points into the
graph spectral domain via graph Fourier transform (GFT) for compact
representation. We then analyze the influence of different spectral bands on
the geometric structure of the point cloud, based on which we propose to
perturb the GFT coefficients in a learnable manner guided by an energy
constraint loss function. Finally, the adversarial point cloud is generated by
transforming the perturbed spectral representation back to the data domain via
the inverse GFT (IGFT). Experimental results demonstrate the effectiveness of
the proposed GSDA in terms of both imperceptibility and attack success rates
under a variety of defense strategies. The code is available at
https://github.com/WoodwindHu/GSDA.


http://arxiv.org/abs/2202.07568
StratDef: Strategic Defense Against Adversarial Attacks in ML-based Malware Detection. (99%)
Aqib Rashid; Jose Such
  Over the years, most research towards defenses against adversarial attacks on
machine learning models has been in the image recognition domain. The malware
detection domain has received less attention despite its importance. Moreover,
most work exploring these defenses has focused on several methods but with no
strategy when applying them. In this paper, we introduce StratDef, which is a
strategic defense system based on a moving target defense approach. We overcome
challenges related to the systematic construction, selection, and strategic use
of models to maximize adversarial robustness. StratDef dynamically and
strategically chooses the best models to increase the uncertainty for the
attacker while minimizing critical aspects in the adversarial ML domain, like
attack transferability. We provide the first comprehensive evaluation of
defenses against adversarial attacks on machine learning for malware detection,
where our threat model explores different levels of threat, attacker knowledge,
capabilities, and attack intensities. We show that StratDef performs better
than other defenses even when facing the peak adversarial threat. We also show
that, of the existing defenses, only a few adversarially-trained models provide
substantially better protection than just using vanilla models but are still
outperformed by StratDef.


http://arxiv.org/abs/2202.07453
Random Walks for Adversarial Meshes. (97%)
Amir Belder; Gal Yefet; Ran Ben Izhak; Ayellet Tal
  A polygonal mesh is the most-commonly used representation of surfaces in
computer graphics. Therefore, it is not surprising that a number of mesh
classification networks have recently been proposed. However, while adversarial
attacks are wildly researched in 2D, the field of adversarial meshes is under
explored. This paper proposes a novel, unified, and general adversarial attack,
which leads to misclassification of several state-of-the-art mesh
classification neural networks. Our attack approach is black-box, i.e. it has
access only to the network's predictions, but not to the network's full
architecture or gradients. The key idea is to train a network to imitate a
given classification network. This is done by utilizing random walks along the
mesh surface, which gather geometric information. These walks provide insight
onto the regions of the mesh that are important for the correct prediction of
the given classification network. These mesh regions are then modified more
than other regions in order to attack the network in a manner that is barely
visible to the naked eye.


http://arxiv.org/abs/2202.07802
Generative Adversarial Network-Driven Detection of Adversarial Tasks in Mobile Crowdsensing. (93%)
Zhiyan Chen; Burak Kantarci
  Mobile Crowdsensing systems are vulnerable to various attacks as they build
on non-dedicated and ubiquitous properties. Machine learning (ML)-based
approaches are widely investigated to build attack detection systems and ensure
MCS systems security. However, adversaries that aim to clog the sensing
front-end and MCS back-end leverage intelligent techniques, which are
challenging for MCS platform and service providers to develop appropriate
detection frameworks against these attacks. Generative Adversarial Networks
(GANs) have been applied to generate synthetic samples, that are extremely
similar to the real ones, deceiving classifiers such that the synthetic samples
are indistinguishable from the originals. Previous works suggest that GAN-based
attacks exhibit more crucial devastation than empirically designed attack
samples, and result in low detection rate at the MCS platform. With this in
mind, this paper aims to detect intelligently designed illegitimate sensing
service requests by integrating a GAN-based model. To this end, we propose a
two-level cascading classifier that combines the GAN discriminator with a
binary classifier to prevent adversarial fake tasks. Through simulations, we
compare our results to a single-level binary classifier, and the numeric
results show that proposed approach raises Adversarial Attack Detection Rate
(AADR), from $0\%$ to $97.5\%$ by KNN/NB, from $45.9\%$ to $100\%$ by Decision
Tree. Meanwhile, with two-levels classifiers, Original Attack Detection Rate
(OADR) improves for the three binary classifiers, with comparison, such as NB
from $26.1\%$ to $61.5\%$.


http://arxiv.org/abs/2202.07815
Applying adversarial networks to increase the data efficiency and reliability of Self-Driving Cars. (89%)
Aakash Kumar
  Convolutional Neural Networks (CNNs) are vulnerable to misclassifying images
when small perturbations are present. With the increasing prevalence of CNNs in
self-driving cars, it is vital to ensure these algorithms are robust to prevent
collisions from occurring due to failure in recognizing a situation. In the
Adversarial Self-Driving framework, a Generative Adversarial Network (GAN) is
implemented to generate realistic perturbations in an image that cause a
classifier CNN to misclassify data. This perturbed data is then used to train
the classifier CNN further. The Adversarial Self-driving framework is applied
to an image classification algorithm to improve the classification accuracy on
perturbed images and is later applied to train a self-driving car to drive in a
simulation. A small-scale self-driving car is also built to drive around a
track and classify signs. The Adversarial Self-driving framework produces
perturbed images through learning a dataset, as a result removing the need to
train on significant amounts of data. Experiments demonstrate that the
Adversarial Self-driving framework identifies situations where CNNs are
vulnerable to perturbations and generates new examples of these situations for
the CNN to train on. The additional data generated by the Adversarial
Self-driving framework provides sufficient data for the CNN to generalize to
the environment. Therefore, it is a viable tool to increase the resilience of
CNNs to perturbations. Particularly, in the real-world self-driving car, the
application of the Adversarial Self-Driving framework resulted in an 18 %
increase in accuracy, and the simulated self-driving model had no collisions in
30 minutes of driving.


http://arxiv.org/abs/2202.07562
Improving the repeatability of deep learning models with Monte Carlo dropout. (1%)
Andreanne Lemay; Katharina Hoebel; Christopher P. Bridge; Brian Befano; Sanjosé Silvia De; Diden Egemen; Ana Cecilia Rodriguez; Mark Schiffman; John Peter Campbell; Jayashree Kalpathy-Cramer
  The integration of artificial intelligence into clinical workflows requires
reliable and robust models. Repeatability is a key attribute of model
robustness. Repeatable models output predictions with low variation during
independent tests carried out under similar conditions. During model
development and evaluation, much attention is given to classification
performance while model repeatability is rarely assessed, leading to the
development of models that are unusable in clinical practice. In this work, we
evaluate the repeatability of four model types (binary classification,
multi-class classification, ordinal classification, and regression) on images
that were acquired from the same patient during the same visit. We study the
performance of binary, multi-class, ordinal, and regression models on four
medical image classification tasks from public and private datasets: knee
osteoarthritis, cervical cancer screening, breast density estimation, and
retinopathy of prematurity. Repeatability is measured and compared on ResNet
and DenseNet architectures. Moreover, we assess the impact of sampling Monte
Carlo dropout predictions at test time on classification performance and
repeatability. Leveraging Monte Carlo predictions significantly increased
repeatability for all tasks on the binary, multi-class, and ordinal models
leading to an average reduction of the 95\% limits of agreement by 16% points
and of the disagreement rate by 7% points. The classification accuracy improved
in most settings along with the repeatability. Our results suggest that beyond
about 20 Monte Carlo iterations, there is no further gain in repeatability. In
addition to the higher test-retest agreement, Monte Carlo predictions were
better calibrated which leads to output probabilities reflecting more
accurately the true likelihood of being correctly classified.


http://arxiv.org/abs/2202.07201
Holistic Adversarial Robustness of Deep Learning Models. (1%)
Pin-Yu Chen; Sijia Liu
  Adversarial robustness studies the worst-case performance of a machine
learning model to ensure safety and reliability. With the proliferation of
deep-learning based technology, the potential risks associated with model
development and deployment can be amplified and become dreadful
vulnerabilities. This paper provides a comprehensive overview of research
topics and foundational principles of research methods for adversarial
robustness of deep learning models, including attacks, defenses, verification,
and novel applications.


http://arxiv.org/abs/2202.07679
Taking a Step Back with KCal: Multi-Class Kernel-Based Calibration for Deep Neural Networks. (1%)
Zhen Lin; Shubhendu Trivedi; Jimeng Sun
  Deep neural network (DNN) classifiers are often overconfident, producing
miscalibrated class probabilities. Most existing calibration methods either
lack theoretical guarantees for producing calibrated outputs or reduce the
classification accuracy in the process. This paper proposes a new Kernel-based
calibration method called KCal. Unlike other calibration procedures, KCal does
not operate directly on the logits or softmax outputs of the DNN. Instead, it
uses the penultimate-layer latent embedding to train a metric space in a
supervised manner. In effect, KCal amounts to a supervised dimensionality
reduction of the neural network embedding, and generates a prediction using
kernel density estimation on a holdout calibration set. We first analyze KCal
theoretically, showing that it enjoys a provable asymptotic calibration
guarantee. Then, through extensive experiments, we confirm that KCal
consistently outperforms existing calibration methods in terms of both the
classification accuracy and the (confidence and class-wise) calibration error.


http://arxiv.org/abs/2202.07054
Universal Adversarial Examples in Remote Sensing: Methodology and Benchmark. (99%)
Yonghao Xu; Pedram Ghamisi
  Deep neural networks have achieved great success in many important remote
sensing tasks. Nevertheless, their vulnerability to adversarial examples should
not be neglected. In this study, we systematically analyze the universal
adversarial examples in remote sensing data for the first time, without any
knowledge from the victim model. Specifically, we propose a novel black-box
adversarial attack method, namely Mixup-Attack, and its simple variant
Mixcut-Attack, for remote sensing data. The key idea of the proposed methods is
to find common vulnerabilities among different networks by attacking the
features in the shallow layer of a given surrogate model. Despite their
simplicity, the proposed methods can generate transferable adversarial examples
that deceive most of the state-of-the-art deep neural networks in both scene
classification and semantic segmentation tasks with high success rates. We
further provide the generated universal adversarial examples in the dataset
named UAE-RS, which is the first dataset that provides black-box adversarial
samples in the remote sensing field. We hope UAE-RS may serve as a benchmark
that helps researchers to design deep neural networks with strong resistance
toward adversarial attacks in the remote sensing field. Codes and the UAE-RS
dataset will be available online.


http://arxiv.org/abs/2202.06488
Finding Dynamics Preserving Adversarial Winning Tickets. (86%)
Xupeng Shi; Pengfei Zheng; A. Adam Ding; Yuan Gao; Weizhong Zhang
  Modern deep neural networks (DNNs) are vulnerable to adversarial attacks and
adversarial training has been shown to be a promising method for improving the
adversarial robustness of DNNs. Pruning methods have been considered in
adversarial context to reduce model capacity and improve adversarial robustness
simultaneously in training. Existing adversarial pruning methods generally
mimic the classical pruning methods for natural training, which follow the
three-stage 'training-pruning-fine-tuning' pipelines. We observe that such
pruning methods do not necessarily preserve the dynamics of dense networks,
making it potentially hard to be fine-tuned to compensate the accuracy
degradation in pruning. Based on recent works of \textit{Neural Tangent Kernel}
(NTK), we systematically study the dynamics of adversarial training and prove
the existence of trainable sparse sub-network at initialization which can be
trained to be adversarial robust from scratch. This theoretically verifies the
\textit{lottery ticket hypothesis} in adversarial context and we refer such
sub-network structure as \textit{Adversarial Winning Ticket} (AWT). We also
show empirical evidences that AWT preserves the dynamics of adversarial
training and achieve equal performance as dense adversarial training.


http://arxiv.org/abs/2202.07114
Recent Advances in Reliable Deep Graph Learning: Inherent Noise, Distribution Shift, and Adversarial Attack. (83%)
Jintang Li; Bingzhe Wu; Chengbin Hou; Guoji Fu; Yatao Bian; Liang Chen; Junzhou Huang; Zibin Zheng
  Deep graph learning (DGL) has achieved remarkable progress in both business
and scientific areas ranging from finance and e-commerce to drug and advanced
material discovery. Despite the progress, applying DGL to real-world
applications faces a series of reliability threats including inherent noise,
distribution shift, and adversarial attacks. This survey aims to provide a
comprehensive review of recent advances for improving the reliability of DGL
algorithms against the above threats. In contrast to prior related surveys
which mainly focus on adversarial attacks and defense, our survey covers more
reliability-related aspects of DGL, i.e., inherent noise and distribution
shift. Additionally, we discuss the relationships among above aspects and
highlight some important issues to be explored in future research.


http://arxiv.org/abs/2202.06658
PFGE: Parsimonious Fast Geometric Ensembling of DNNs. (1%)
Hao Guo; Jiyong Jin; Bin Liu
  Ensemble methods have been widely used to improve the performance of machine
learning methods in terms of generalization, while they are hard to use in deep
learning systems, as training an ensemble of deep neural networks (DNNs) incurs
an extremely higher computational overhead of model training. Recently,
advanced techniques such as fast geometric ensembling (FGE) and snapshot
ensemble have been proposed. These methods can train the model ensembles in the
same time as a single model, thus getting around the hurdle of training time.
However, their memory overhead for test-time inference remains much higher than
single model based methods. Here we propose a parsimonious FGE (PFGE) that
employs a lightweight ensemble of higher-performing DNNs, generated by
successively-performed stochastic weight averaging procedures. Experimental
results across different modern DNN architectures on widely used image datasets
CIFAR-{10,100} and Imagenet, demonstrate that PFGE is 5x memory efficient than
prior art methods, yet without compromise in generalization. Our code is
available at https://github.com/ZJLAB-AMMI/PFGE.


http://arxiv.org/abs/2202.06701
UA-FedRec: Untargeted Attack on Federated News Recommendation. (1%)
Jingwei Yi; Fangzhao Wu; Bin Zhu; Jing Yao; Zhulin Tao; Guangzhong Sun; Xing Xie
  News recommendation is critical for personalized news distribution. Federated
news recommendation enables collaborative model learning from many clients
without sharing their raw data. It is promising for privacy-preserving news
recommendation. However, the security of federated news recommendation is still
unclear. In this paper, we study this problem by proposing an untargeted attack
called UA-FedRec. By exploiting the prior knowledge of news recommendation and
federated learning, UA-FedRec can effectively degrade the model performance
with a small percentage of malicious clients. First, the effectiveness of news
recommendation highly depends on user modeling and news modeling. We design a
news similarity perturbation method to make representations of similar news
farther and those of dissimilar news closer to interrupt news modeling, and
propose a user model perturbation method to make malicious user updates in
opposite directions of benign updates to interrupt user modeling. Second,
updates from different clients are typically aggregated by weighted-averaging
based on their sample sizes. We propose a quantity perturbation method to
enlarge sample sizes of malicious clients in a reasonable range to amplify the
impact of malicious updates. Extensive experiments on two real-world datasets
show that UA-FedRec can effectively degrade the accuracy of existing federated
news recommendation methods, even when defense is applied. Our study reveals a
critical security issue in existing federated news recommendation systems and
calls for research efforts to address the issue.


http://arxiv.org/abs/2202.06312
Progressive Backdoor Erasing via connecting Backdoor and Adversarial Attacks. (99%)
Bingxu Mu; Zhenxing Niu; Le Wang; Xue Wang; Rong Jin; Gang Hua
  Deep neural networks (DNNs) are known to be vulnerable to both backdoor
attacks as well as adversarial attacks. In the literature, these two types of
attacks are commonly treated as distinct problems and solved separately, since
they belong to training-time and inference-time attacks respectively. However,
in this paper we find an intriguing connection between them: for a model
planted with backdoors, we observe that its adversarial examples have similar
behaviors as its triggered images, i.e., both activate the same subset of DNN
neurons. It indicates that planting a backdoor into a model will significantly
affect the model's adversarial examples. Based on these observations, a novel
Progressive Backdoor Erasing (PBE) algorithm is proposed to progressively
purify the infected model by leveraging untargeted adversarial attacks.
Different from previous backdoor defense methods, one significant advantage of
our approach is that it can erase backdoor even when the clean extra dataset is
unavailable. We empirically show that, against 5 state-of-the-art backdoor
attacks, our PBE can effectively erase the backdoor without obvious performance
degradation on clean samples and significantly outperforms existing defense
methods.


http://arxiv.org/abs/2202.06382
Training with More Confidence: Mitigating Injected and Natural Backdoors During Training. (92%)
Zhenting Wang; Hailun Ding; Juan Zhai; Shiqing Ma
  The backdoor or Trojan attack is a severe threat to deep neural networks
(DNNs). Researchers find that DNNs trained on benign data and settings can also
learn backdoor behaviors, which is known as the natural backdoor. Existing
works on anti-backdoor learning are based on weak observations that the
backdoor and benign behaviors can differentiate during training. An adaptive
attack with slow poisoning can bypass such defenses. Moreover, these methods
cannot defend natural backdoors. We found the fundamental differences between
backdoor-related neurons and benign neurons: backdoor-related neurons form a
hyperplane as the classification surface across input domains of all affected
labels. By further analyzing the training process and model architectures, we
found that piece-wise linear functions cause this hyperplane surface. In this
paper, we design a novel training method that forces the training to avoid
generating such hyperplanes and thus remove the injected backdoors. Our
extensive experiments on five datasets against five state-of-the-art attacks
and also benign training show that our method can outperform existing
state-of-the-art defenses. On average, the ASR (attack success rate) of the
models trained with NONE is 54.83 times lower than undefended models under
standard poisoning backdoor attack and 1.75 times lower under the natural
backdoor attack. Our code is available at
https://github.com/RU-System-Software-and-Security/NONE.


http://arxiv.org/abs/2202.06474
Extracting Label-specific Key Input Features for Neural Code Intelligence Models. (9%)
Md Rafiqul Islam Rabin
  The code intelligence (CI) models are often black-box and do not offer any
insights on the input features that they learn for making correct predictions.
This opacity may lead to distrust in their prediction and hamper their wider
adoption in safety-critical applications. In recent, the program reduction
technique is widely being used to identify key input features in order to
explain the prediction of CI models. The approach removes irrelevant parts from
an input program and keeps the minimal snippets that a CI model needs to
maintain its prediction. However, the state-of-the-art approaches mainly use a
syntax-unaware program reduction technique that does not follow the syntax of
programs, which adds significant overhead to the reduction of input programs
and explainability of models. In this paper, we apply a syntax-guided program
reduction technique that follows the syntax of input programs during reduction.
Our experiments on multiple models across different types of input programs
show that the syntax-guided program reduction technique significantly
outperforms the syntax-unaware program reduction technique in reducing the size
of input programs. Extracting key input features from reduced programs reveals
that the syntax-guided reduced programs contain more label-specific key input
features and are more vulnerable to adversarial transformation when renaming
the key tokens in programs. These label-specific key input features may help to
understand the reasoning of models' prediction from different perspectives and
increase the trustworthiness to correct classification given by CI models.


http://arxiv.org/abs/2202.06414
Defense Strategies Toward Model Poisoning Attacks in Federated Learning: A Survey. (2%)
Zhilin Wang; Qiao Kang; Xinyi Zhang; Qin Hu
  Advances in distributed machine learning can empower future communications
and networking. The emergence of federated learning (FL) has provided an
efficient framework for distributed machine learning, which, however, still
faces many security challenges. Among them, model poisoning attacks have a
significant impact on the security and performance of FL. Given that there have
been many studies focusing on defending against model poisoning attacks, it is
necessary to survey the existing work and provide insights to inspire future
research. In this paper, we first classify defense mechanisms for model
poisoning attacks into two categories: evaluation methods for local model
updates and aggregation methods for the global model. Then, we analyze some of
the existing defense strategies in detail. We also discuss some potential
challenges and future research directions. To the best of our knowledge, we are
the first to survey defense methods for model poisoning attacks in FL.


http://arxiv.org/abs/2202.07471
SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation. (1%)
Cong Guo; Yuxian Qiu; Jingwen Leng; Xiaotian Gao; Chen Zhang; Yunxin Liu; Fan Yang; Yuhao Zhu; Minyi Guo
  Quantization of deep neural networks (DNN) has been proven effective for
compressing and accelerating DNN models. Data-free quantization (DFQ) is a
promising approach without the original datasets under privacy-sensitive and
confidential scenarios. However, current DFQ solutions degrade accuracy, need
synthetic data to calibrate networks, and are time-consuming and costly. This
paper proposes an on-the-fly DFQ framework with sub-second quantization time,
called SQuant, which can quantize networks on inference-only devices with low
computation and memory requirements. With the theoretical analysis of the
second-order information of DNN task loss, we decompose and approximate the
Hessian-based optimization objective into three diagonal sub-items, which have
different areas corresponding to three dimensions of weight tensor:
element-wise, kernel-wise, and output channel-wise. Then, we progressively
compose sub-items and propose a novel data-free optimization objective in the
discrete domain, minimizing Constrained Absolute Sum of Error (or CASE in
short), which surprisingly does not need any dataset and is even not aware of
network architecture. We also design an efficient algorithm without
back-propagation to further reduce the computation complexity of the objective
solver. Finally, without fine-tuning and synthetic datasets, SQuant accelerates
the data-free quantization process to a sub-second level with >30% accuracy
improvement over the existing data-free post-training quantization works, with
the evaluated models under 4-bit quantization. We have open-sourced the SQuant
framework at https://github.com/clevercool/SQuant.


http://arxiv.org/abs/2202.06043
RoPGen: Towards Robust Code Authorship Attribution via Automatic Coding Style Transformation. (98%)
Zhen Qian Li; Qian Guenevere; Chen; Chen Chen; Yayi Zou; Shouhuai Xu
  Source code authorship attribution is an important problem often encountered
in applications such as software forensics, bug fixing, and software quality
analysis. Recent studies show that current source code authorship attribution
methods can be compromised by attackers exploiting adversarial examples and
coding style manipulation. This calls for robust solutions to the problem of
code authorship attribution. In this paper, we initiate the study on making
Deep Learning (DL)-based code authorship attribution robust. We propose an
innovative framework called Robust coding style Patterns Generation (RoPGen),
which essentially learns authors' unique coding style patterns that are hard
for attackers to manipulate or imitate. The key idea is to combine data
augmentation and gradient augmentation at the adversarial training phase. This
effectively increases the diversity of training examples, generates meaningful
perturbations to gradients of deep neural networks, and learns diversified
representations of coding styles. We evaluate the effectiveness of RoPGen using
four datasets of programs written in C, C++, and Java. Experimental results
show that RoPGen can significantly improve the robustness of DL-based code
authorship attribution, by respectively reducing 22.8% and 41.0% of the success
rate of targeted and untargeted attacks on average.


http://arxiv.org/abs/2202.07464
Excitement Surfeited Turns to Errors: Deep Learning Testing Framework Based on Excitable Neurons. (98%)
Haibo Jin; Ruoxi Chen; Haibin Zheng; Jinyin Chen; Yao Cheng; Yue Yu; Xianglong Liu
  Despite impressive capabilities and outstanding performance, deep neural
networks (DNNs) have captured increasing public concern about their security
problems, due to their frequently occurred erroneous behaviors. Therefore, it
is necessary to conduct a systematical testing for DNNs before they are
deployed to real-world applications. Existing testing methods have provided
fine-grained metrics based on neuron coverage and proposed various approaches
to improve such metrics. However, it has been gradually realized that a higher
neuron coverage does \textit{not} necessarily represent better capabilities in
identifying defects that lead to errors. Besides, coverage-guided methods
cannot hunt errors due to faulty training procedure. So the robustness
improvement of DNNs via retraining by these testing examples are
unsatisfactory. To address this challenge, we introduce the concept of
excitable neurons based on Shapley value and design a novel white-box testing
framework for DNNs, namely DeepSensor. It is motivated by our observation that
neurons with larger responsibility towards model loss changes due to small
perturbations are more likely related to incorrect corner cases due to
potential defects. By maximizing the number of excitable neurons concerning
various wrong behaviors of models, DeepSensor can generate testing examples
that effectively trigger more errors due to adversarial inputs, polluted data
and incomplete training. Extensive experiments implemented on both image
classification models and speaker recognition models have demonstrated the
superiority of DeepSensor.


http://arxiv.org/abs/2202.07421
Adversarial Attacks and Defense Methods for Power Quality Recognition. (99%)
Jiwei Tian; Buhong Wang; Jing Li; Zhen Wang; Mete Ozay
  Vulnerability of various machine learning methods to adversarial examples has
been recently explored in the literature. Power systems which use these
vulnerable methods face a huge threat against adversarial examples. To this
end, we first propose a signal-specific method and a universal signal-agnostic
method to attack power systems using generated adversarial examples. Black-box
attacks based on transferable characteristics and the above two methods are
also proposed and evaluated. We then adopt adversarial training to defend
systems against adversarial attacks. Experimental analyses demonstrate that our
signal-specific attack method provides less perturbation compared to the FGSM
(Fast Gradient Sign Method), and our signal-agnostic attack method can generate
perturbations fooling most natural signals with high probability. What's more,
the attack method based on the universal signal-agnostic algorithm has a higher
transfer rate of black-box attacks than the attack method based on the
signal-specific algorithm. In addition, the results show that the proposed
adversarial training improves robustness of power systems to adversarial
examples.


http://arxiv.org/abs/2202.05687
Towards Adversarially Robust Deepfake Detection: An Ensemble Approach. (99%)
Ashish Hooda; Neal Mangaokar; Ryan Feng; Kassem Fawaz; Somesh Jha; Atul Prakash
  Detecting deepfakes remains an open problem. Current detection methods fail
against an adversary who adds imperceptible adversarial perturbations to the
deepfake to evade detection. We propose Disjoint Deepfake Detection (D3), a
deepfake detector designed to improve adversarial robustness beyond de facto
solutions such as adversarial training. D3 uses an ensemble of models over
disjoint subsets of the frequency spectrum to significantly improve robustness.
Our key insight is to leverage a redundancy in the frequency domain and apply a
saliency partitioning technique to disjointly distribute frequency components
across multiple models. We formally prove that these disjoint ensembles lead to
a reduction in the dimensionality of the input subspace where adversarial
deepfakes lie. We then empirically validate the D3 method against white-box
attacks and black-box attacks and find that D3 significantly outperforms
existing state-of-the-art defenses applied to deepfake detection.


http://arxiv.org/abs/2202.05953
Open-set Adversarial Defense with Clean-Adversarial Mutual Learning. (98%)
Rui Shao; Pramuditha Perera; Pong C. Yuen; Vishal M. Patel
  Open-set recognition and adversarial defense study two key aspects of deep
learning that are vital for real-world deployment. The objective of open-set
recognition is to identify samples from open-set classes during testing, while
adversarial defense aims to robustify the network against images perturbed by
imperceptible adversarial noise. This paper demonstrates that open-set
recognition systems are vulnerable to adversarial samples. Furthermore, this
paper shows that adversarial defense mechanisms trained on known classes are
unable to generalize well to open-set samples. Motivated by these observations,
we emphasize the necessity of an Open-Set Adversarial Defense (OSAD) mechanism.
This paper proposes an Open-Set Defense Network with Clean-Adversarial Mutual
Learning (OSDN-CAML) as a solution to the OSAD problem. The proposed network
designs an encoder with dual-attentive feature-denoising layers coupled with a
classifier to learn a noise-free latent feature representation, which
adaptively removes adversarial noise guided by channel and spatial-wise
attentive filters. Several techniques are exploited to learn a noise-free and
informative latent feature space with the aim of improving the performance of
adversarial defense and open-set recognition. First, we incorporate a decoder
to ensure that clean images can be well reconstructed from the obtained latent
features. Then, self-supervision is used to ensure that the latent features are
informative enough to carry out an auxiliary task. Finally, to exploit more
complementary knowledge from clean image classification to facilitate feature
denoising and search for a more generalized local minimum for open-set
recognition, we further propose clean-adversarial mutual learning, where a peer
network (classifying clean images) is further introduced to mutually learn with
the classifier (classifying adversarial images).


http://arxiv.org/abs/2202.05758
Using Random Perturbations to Mitigate Adversarial Attacks on Sentiment Analysis Models. (92%)
Abigail Swenor; Jugal Kalita
  Attacks on deep learning models are often difficult to identify and therefore
are difficult to protect against. This problem is exacerbated by the use of
public datasets that typically are not manually inspected before use. In this
paper, we offer a solution to this vulnerability by using, during testing,
random perturbations such as spelling correction if necessary, substitution by
random synonym, or simply dropping the word. These perturbations are applied to
random words in random sentences to defend NLP models against adversarial
attacks. Our Random Perturbations Defense and Increased Randomness Defense
methods are successful in returning attacked models to similar accuracy of
models before attacks. The original accuracy of the model used in this work is
80% for sentiment classification. After undergoing attacks, the accuracy drops
to accuracy between 0% and 44%. After applying our defense methods, the
accuracy of the model is returned to the original accuracy within statistical
significance.


http://arxiv.org/abs/2202.05488
Fast Adversarial Training with Noise Augmentation: A Unified Perspective on RandStart and GradAlign. (74%)
Axi Niu; Kang Zhang; Chaoning Zhang; Chenshuang Zhang; In So Kweon; Chang D. Yoo; Yanning Zhang
  PGD-based and FGSM-based are two popular adversarial training (AT) approaches
for obtaining adversarially robust models. Compared with PGD-based AT,
FGSM-based one is significantly faster but fails with catastrophic overfitting
(CO). For mitigating CO in such Fast AT, there are two popular existing
strategies: random start (RandStart) and Gradient Alignment (GradAlign). The
former works only for a relatively small perturbation 8/255 with the l_\infty
constraint, and GradAlign improves it by extending the perturbation size to
16/255 (with the l_\infty constraint) but at the cost of being 3 to 4 times
slower. How to avoid CO in Fast AT for a large perturbation size but without
increasing the computation overhead remains as an unsolved issue, for which our
work provides a frustratingly simple (yet effective) solution. Specifically,
our solution lies in just noise augmentation (NoiseAug) which is a non-trivial
byproduct of simplifying GradAlign. By simplifying GradAlign we have two
findings: (i) aligning logit instead of gradient in GradAlign requires half the
training time but achieves higher performance than GradAlign; (ii) the
alignment operation can also be removed by only keeping noise augmentation
(NoiseAug). Simplified from GradAlign, our NoiseAug has a surprising
resemblance with RandStart except that we inject noise on the image instead of
perturbation. To understand why injecting noise to input prevents CO, we verify
that this is caused not by data augmentation effect (inject noise on image) but
by improved local linearity. We provide an intuitive explanation for why
NoiseAug improves local linearity without explicit regularization. Extensive
results demonstrate that our NoiseAug achieves SOTA results in FGSM AT. The
code will be released after accepted.


http://arxiv.org/abs/2202.05834
Predicting Out-of-Distribution Error with the Projection Norm. (62%)
Yaodong Yu; Zitong Yang; Alexander Wei; Yi Ma; Jacob Steinhardt
  We propose a metric -- Projection Norm -- to predict a model's performance on
out-of-distribution (OOD) data without access to ground truth labels.
Projection Norm first uses model predictions to pseudo-label test samples and
then trains a new model on the pseudo-labels. The more the new model's
parameters differ from an in-distribution model, the greater the predicted OOD
error. Empirically, our approach outperforms existing methods on both image and
text classification tasks and across different network architectures.
Theoretically, we connect our approach to a bound on the test error for
overparameterized linear models. Furthermore, we find that Projection Norm is
the only approach that achieves non-trivial detection performance on
adversarial examples. Our code is available at
https://github.com/yaodongyu/ProjNorm.


http://arxiv.org/abs/2202.05470
Jigsaw Puzzle: Selective Backdoor Attack to Subvert Malware Classifiers. (62%)
Limin Yang; Zhi Chen; Jacopo Cortellazzi; Feargus Pendlebury; Kevin Tu; Fabio Pierazzi; Lorenzo Cavallaro; Gang Wang
  Malware classifiers are subject to training-time exploitation due to the need
to regularly retrain using samples collected from the wild. Recent work has
demonstrated the feasibility of backdoor attacks against malware classifiers,
and yet the stealthiness of such attacks is not well understood. In this paper,
we investigate this phenomenon under the clean-label setting (i.e., attackers
do not have complete control over the training or labeling process).
Empirically, we show that existing backdoor attacks in malware classifiers are
still detectable by recent defenses such as MNTD. To improve stealthiness, we
propose a new attack, Jigsaw Puzzle (JP), based on the key observation that
malware authors have little to no incentive to protect any other authors'
malware but their own. As such, Jigsaw Puzzle learns a trigger to complement
the latent patterns of the malware author's samples, and activates the backdoor
only when the trigger and the latent pattern are pieced together in a sample.
We further focus on realizable triggers in the problem space (e.g., software
code) using bytecode gadgets broadly harvested from benign software. Our
evaluation confirms that Jigsaw Puzzle is effective as a backdoor, remains
stealthy against state-of-the-art defenses, and is a threat in realistic
settings that depart from reasoning about feature-space only attacks. We
conclude by exploring promising approaches to improve backdoor defenses.


http://arxiv.org/abs/2202.05778
White-Box Attacks on Hate-speech BERT Classifiers in German with Explicit and Implicit Character Level Defense. (12%)
Shahrukh Khan; Mahnoor Shahid; Navdeeppal Singh
  In this work, we evaluate the adversarial robustness of BERT models trained
on German Hate Speech datasets. We also complement our evaluation with two
novel white-box character and word level attacks thereby contributing to the
range of attacks available. Furthermore, we also perform a comparison of two
novel character-level defense strategies and evaluate their robustness with one
another.


http://arxiv.org/abs/2202.05725
On the Detection of Adaptive Adversarial Attacks in Speaker Verification Systems. (10%)
Zesheng Chen
  Speaker verification systems have been widely used in smart phones and
Internet of things devices to identify legitimate users. In recent work, it has
been shown that adversarial attacks, such as FAKEBOB, can work effectively
against speaker verification systems. The goal of this paper is to design a
detector that can distinguish an original audio from an audio contaminated by
adversarial attacks. Specifically, our designed detector, called MEH-FEST,
calculates the minimum energy in high frequencies from the short-time Fourier
transform of an audio and uses it as a detection metric. Through both analysis
and experiments, we show that our proposed detector is easy to implement, fast
to process an input audio, and effective in determining whether an audio is
corrupted by FAKEBOB attacks. The experimental results indicate that the
detector is extremely effective: with near zero false positive and false
negative rates for detecting FAKEBOB attacks in Gaussian mixture model (GMM)
and i-vector speaker verification systems. Moreover, adaptive adversarial
attacks against our proposed detector and their countermeasures are discussed
and studied, showing the game between attackers and defenders.


http://arxiv.org/abs/2202.05737
Improving Generalization via Uncertainty Driven Perturbations. (2%)
Matteo Pagliardini; Gilberto Manunza; Martin Jaggi; Michael I. Jordan; Tatjana Chavdarova
  Recently Shah et al., 2020 pointed out the pitfalls of the simplicity bias -
the tendency of gradient-based algorithms to learn simple models - which
include the model's high sensitivity to small input perturbations, as well as
sub-optimal margins. In particular, while Stochastic Gradient Descent yields
max-margin boundary on linear models, such guarantee does not extend to
non-linear models. To mitigate the simplicity bias, we consider
uncertainty-driven perturbations (UDP) of the training data points, obtained
iteratively by following the direction that maximizes the model's estimated
uncertainty. The uncertainty estimate does not rely on the input's label and it
is highest at the decision boundary, and - unlike loss-driven perturbations -
it allows for using a larger range of values for the perturbation magnitude.
Furthermore, as real-world datasets have non-isotropic distances between data
points of different classes, the above property is particularly appealing for
increasing the margin of the decision boundary, which in turn improves the
model's generalization. We show that UDP is guaranteed to achieve the maximum
margin decision boundary on linear models and that it notably increases it on
challenging simulated datasets. For nonlinear models, we show empirically that
UDP reduces the simplicity bias and learns more exhaustive features.
Interestingly, it also achieves competitive loss-based robustness and
generalization trade-off on several datasets.


http://arxiv.org/abs/2202.05613
CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep Learning. (1%)
Jun Shu; Xiang Yuan; Deyu Meng; Zongben Xu
  Modern deep neural networks can easily overfit to biased training data
containing corrupted labels or class imbalance. Sample re-weighting methods are
popularly used to alleviate this data bias issue. Most current methods,
however, require to manually pre-specify the weighting schemes as well as their
additional hyper-parameters relying on the characteristics of the investigated
problem and training data. This makes them fairly hard to be generally applied
in practical scenarios, due to their significant complexities and inter-class
variations of data bias situations. To address this issue, we propose a
meta-model capable of adaptively learning an explicit weighting scheme directly
from data. Specifically, by seeing each training class as a separate learning
task, our method aims to extract an explicit weighting function with sample
loss and task/class feature as input, and sample weight as output, expecting to
impose adaptively varying weighting schemes to different sample classes based
on their own intrinsic bias characteristics. Synthetic and real data
experiments substantiate the capability of our method on achieving proper
weighting schemes in various data bias cases, like the class imbalance,
feature-independent and dependent label noise scenarios, and more complicated
bias scenarios beyond conventional cases. Besides, the task-transferability of
the learned weighting scheme is also substantiated, by readily deploying the
weighting function learned on relatively smaller-scale CIFAR-10 dataset on much
larger-scale full WebVision dataset. A performance gain can be readily achieved
compared with previous SOAT ones without additional hyper-parameter tuning and
meta gradient descent step. The general availability of our method for multiple
robust deep learning issues, including partial-label learning, semi-supervised
learning and selective classification, has also been validated.


http://arxiv.org/abs/2202.05416
FAAG: Fast Adversarial Audio Generation through Interactive Attack Optimisation. (99%)
Yuantian Miao; Chao Chen; Lei Pan; Jun Zhang; Yang Xiang
  Automatic Speech Recognition services (ASRs) inherit deep neural networks'
vulnerabilities like crafted adversarial examples. Existing methods often
suffer from low efficiency because the target phases are added to the entire
audio sample, resulting in high demand for computational resources. This paper
proposes a novel scheme named FAAG as an iterative optimization-based method to
generate targeted adversarial examples quickly. By injecting the noise over the
beginning part of the audio, FAAG generates adversarial audio in high quality
with a high success rate timely. Specifically, we use audio's logits output to
map each character in the transcription to an approximate position of the
audio's frame. Thus, an adversarial example can be generated by FAAG in
approximately two minutes using CPUs only and around ten seconds with one GPU
while maintaining an average success rate over 85%. Specifically, the FAAG
method can speed up around 60% compared with the baseline method during the
adversarial example generation process. Furthermore, we found that appending
benign audio to any suspicious examples can effectively defend against the
targeted adversarial attack. We hope that this work paves the way for inventing
new adversarial attacks against speech recognition with computational
constraints.


http://arxiv.org/abs/2202.04978
Towards Assessing and Characterizing the Semantic Robustness of Face Recognition. (76%)
Juan C. Pérez; Motasem Alfarra; Ali Thabet; Pablo Arbeláez; Bernard Ghanem
  Deep Neural Networks (DNNs) lack robustness against imperceptible
perturbations to their input. Face Recognition Models (FRMs) based on DNNs
inherit this vulnerability. We propose a methodology for assessing and
characterizing the robustness of FRMs against semantic perturbations to their
input. Our methodology causes FRMs to malfunction by designing adversarial
attacks that search for identity-preserving modifications to faces. In
particular, given a face, our attacks find identity-preserving variants of the
face such that an FRM fails to recognize the images belonging to the same
identity. We model these identity-preserving semantic modifications via
direction- and magnitude-constrained perturbations in the latent space of
StyleGAN. We further propose to characterize the semantic robustness of an FRM
by statistically describing the perturbations that induce the FRM to
malfunction. Finally, we combine our methodology with a certification
technique, thus providing (i) theoretical guarantees on the performance of an
FRM, and (ii) a formal description of how an FRM may model the notion of face
identity.


http://arxiv.org/abs/2202.05068
Controlling the Complexity and Lipschitz Constant improves polynomial nets. (12%)
Zhenyu Zhu; Fabian Latorre; Grigorios G Chrysos; Volkan Cevher
  While the class of Polynomial Nets demonstrates comparable performance to
neural networks (NN), it currently has neither theoretical generalization
characterization nor robustness guarantees. To this end, we derive new
complexity bounds for the set of Coupled CP-Decomposition (CCP) and Nested
Coupled CP-decomposition (NCP) models of Polynomial Nets in terms of the
$\ell_\infty$-operator-norm and the $\ell_2$-operator norm. In addition, we
derive bounds on the Lipschitz constant for both models to establish a
theoretical certificate for their robustness. The theoretical results enable us
to propose a principled regularization scheme that we also evaluate
experimentally in six datasets and show that it improves the accuracy as well
as the robustness of the models to adversarial perturbations. We showcase how
this regularization can be combined with adversarial training, resulting in
further improvements.


http://arxiv.org/abs/2202.04975
FedAttack: Effective and Covert Poisoning Attack on Federated Recommendation via Hard Sampling. (8%)
Chuhan Wu; Fangzhao Wu; Tao Qi; Yongfeng Huang; Xing Xie
  Federated learning (FL) is a feasible technique to learn personalized
recommendation models from decentralized user data. Unfortunately, federated
recommender systems are vulnerable to poisoning attacks by malicious clients.
Existing recommender system poisoning methods mainly focus on promoting the
recommendation chances of target items due to financial incentives. In fact, in
real-world scenarios, the attacker may also attempt to degrade the overall
performance of recommender systems. However, existing general FL poisoning
methods for degrading model performance are either ineffective or not concealed
in poisoning federated recommender systems. In this paper, we propose a simple
yet effective and covert poisoning attack method on federated recommendation,
named FedAttack. Its core idea is using globally hardest samples to subvert
model training. More specifically, the malicious clients first infer user
embeddings based on local user profiles. Next, they choose the candidate items
that are most relevant to the user embeddings as hardest negative samples, and
find the candidates farthest from the user embeddings as hardest positive
samples. The model gradients inferred from these poisoned samples are then
uploaded to the server for aggregation and model update. Since the behaviors of
malicious clients are somewhat similar to users with diverse interests, they
cannot be effectively distinguished from normal clients by the server.
Extensive experiments on two benchmark datasets show that FedAttack can
effectively degrade the performance of various federated recommender systems,
meanwhile cannot be effectively detected nor defended by many existing methods.


http://arxiv.org/abs/2202.05271
A Field of Experts Prior for Adapting Neural Networks at Test Time. (1%)
Neerav Karani; Georg Brunner; Ertunc Erdil; Simin Fei; Kerem Tezcan; Krishna Chaitanya; Ender Konukoglu
  Performance of convolutional neural networks (CNNs) in image analysis tasks
is often marred in the presence of acquisition-related distribution shifts
between training and test images. Recently, it has been proposed to tackle this
problem by fine-tuning trained CNNs for each test image. Such
test-time-adaptation (TTA) is a promising and practical strategy for improving
robustness to distribution shifts as it requires neither data sharing between
institutions nor annotating additional data. Previous TTA methods use a helper
model to increase similarity between outputs and/or features extracted from a
test image with those of the training images. Such helpers, which are typically
modeled using CNNs, can be task-specific and themselves vulnerable to
distribution shifts in their inputs. To overcome these problems, we propose to
carry out TTA by matching the feature distributions of test and training
images, as modelled by a field-of-experts (FoE) prior. FoEs model complicated
probability distributions as products of many simpler expert distributions. We
use 1D marginal distributions of a trained task CNN's features as experts in
the FoE model. Further, we compute principal components of patches of the task
CNN's features, and consider the distributions of PCA loadings as additional
experts. We validate the method on 5 MRI segmentation tasks (healthy tissues in
4 anatomical regions and lesions in 1 one anatomy), using data from 17 clinics,
and on a MRI registration task, using data from 3 clinics. We find that the
proposed FoE-based TTA is generically applicable in multiple tasks, and
outperforms all previous TTA methods for lesion segmentation. For healthy
tissue segmentation, the proposed method outperforms other task-agnostic
methods, but a previous TTA method which is specifically designed for
segmentation performs the best for most of the tested datasets. Our code is
publicly available.


http://arxiv.org/abs/2202.04781
Adversarial Attack and Defense of YOLO Detectors in Autonomous Driving Scenarios. (99%)
Jung Im Choi; Qing Tian
  Visual detection is a key task in autonomous driving, and it serves as a
crucial foundation for self-driving planning and control. Deep neural networks
have achieved promising results in various visual tasks, but they are known to
be vulnerable to adversarial attacks. A comprehensive understanding of deep
visual detectors' vulnerability is required before people can improve their
robustness. However, only a few adversarial attack/defense works have focused
on object detection, and most of them employed only classification and/or
localization losses, ignoring the objectness aspect. In this paper, we identify
a serious objectness-related adversarial vulnerability in YOLO detectors and
present an effective attack strategy targeting the objectness aspect of visual
detection in autonomous vehicles. Furthermore, to address such vulnerability,
we propose a new objectness-aware adversarial training approach for visual
detection. Experiments show that the proposed attack targeting the objectness
aspect is 45.17% and 43.50% more effective than those generated from
classification and/or localization losses on the KITTI and COCO traffic
datasets, respectively. Also, the proposed adversarial defense approach can
improve the detectors' robustness against objectness-oriented attacks by up to
21% and 12% mAP on KITTI and COCO traffic, respectively.


http://arxiv.org/abs/2202.04347
Gradient Methods Provably Converge to Non-Robust Networks. (82%)
Gal Vardi; Gilad Yehudai; Ohad Shamir
  Despite a great deal of research, it is still unclear why neural networks are
so susceptible to adversarial examples. In this work, we identify natural
settings where depth-$2$ ReLU networks trained with gradient flow are provably
non-robust (susceptible to small adversarial $\ell_2$-perturbations), even when
robust networks that classify the training dataset correctly exist. Perhaps
surprisingly, we show that the well-known implicit bias towards margin
maximization induces bias towards non-robust networks, by proving that every
network which satisfies the KKT conditions of the max-margin problem is
non-robust.


http://arxiv.org/abs/2202.04479
False Memory Formation in Continual Learners Through Imperceptible Backdoor Trigger. (22%)
Muhammad Umer; Robi Polikar
  In this brief, we show that sequentially learning new information presented
to a continual (incremental) learning model introduces new security risks: an
intelligent adversary can introduce small amount of misinformation to the model
during training to cause deliberate forgetting of a specific task or class at
test time, thus creating "false memory" about that task. We demonstrate such an
adversary's ability to assume control of the model by injecting "backdoor"
attack samples to commonly used generative replay and regularization based
continual learning approaches using continual learning benchmark variants of
MNIST, as well as the more challenging SVHN and CIFAR 10 datasets. Perhaps most
damaging, we show this vulnerability to be very acute and exceptionally
effective: the backdoor pattern in our attack model can be imperceptible to
human eye, can be provided at any point in time, can be added into the training
data of even a single possibly unrelated task and can be achieved with as few
as just 1\% of total training dataset of a single task.


http://arxiv.org/abs/2202.04311
ARIBA: Towards Accurate and Robust Identification of Backdoor Attacks in Federated Learning. (10%)
Yuxi Mi; Jihong Guan; Shuigeng Zhou
  The distributed nature and privacy-preserving characteristics of federated
learning make it prone to the threat of poisoning attacks, especially backdoor
attacks, where the adversary implants backdoors to misguide the model on
certain attacker-chosen sub-tasks. In this paper, we present a novel method
ARIBA to accurately and robustly identify backdoor attacks in federated
learning. By empirical study, we observe that backdoor attacks are discernible
by the filters of CNN layers. Based on this finding, we employ unsupervised
anomaly detection to evaluate the pre-processed filters and calculate an
anomaly score for each client. We then identify the most suspicious clients
according to their anomaly scores. Extensive experiments are conducted, which
show that our method ARIBA can effectively and robustly defend against multiple
state-of-the-art attacks without degrading model performance.


http://arxiv.org/abs/2202.04291
L2B: Learning to Bootstrap Robust Models for Combating Label Noise. (2%)
Yuyin Zhou; Xianhang Li; Fengze Liu; Qingyue Wei; Xuxi Chen; Lequan Yu; Cihang Xie; Matthew P. Lungren; Lei Xing
  Deep neural networks have shown great success in representation learning.
However, when learning with noisy labels (LNL), they can easily overfit and
fail to generalize to new data. This paper introduces a simple and effective
method, named Learning to Bootstrap (L2B), which enables models to bootstrap
themselves using their own predictions without being adversely affected by
erroneous pseudo-labels. It achieves this by dynamically adjusting the
importance weight between real observed and generated labels, as well as
between different samples through meta-learning. Unlike existing instance
reweighting methods, the key to our method lies in a new, versatile objective
that enables implicit relabeling concurrently, leading to significant
improvements without incurring additional costs.
  L2B offers several benefits over the baseline methods. It yields more robust
models that are less susceptible to the impact of noisy labels by guiding the
bootstrapping procedure more effectively. It better exploits the valuable
information contained in corrupted instances by adapting the weights of both
instances and labels. Furthermore, L2B is compatible with existing LNL methods
and delivers competitive results spanning natural and medical imaging tasks
including classification and segmentation under both synthetic and real-world
noise. Extensive experiments demonstrate that our method effectively mitigates
the challenges of noisy labels, often necessitating few to no validation
samples, and is well generalized to other tasks such as image segmentation.
This not only positions it as a robust complement to existing LNL techniques
but also underscores its practical applicability. The code and models are
available at https://github.com/yuyinzhou/l2b.


http://arxiv.org/abs/2202.04392
Model Architecture Adaption for Bayesian Neural Networks. (1%)
Duo Wang; Yiren Zhao; Ilia Shumailov; Robert Mullins
  Bayesian Neural Networks (BNNs) offer a mathematically grounded framework to
quantify the uncertainty of model predictions but come with a prohibitive
computation cost for both training and inference. In this work, we show a novel
network architecture search (NAS) that optimizes BNNs for both accuracy and
uncertainty while having a reduced inference latency. Different from canonical
NAS that optimizes solely for in-distribution likelihood, the proposed scheme
searches for the uncertainty performance using both in- and out-of-distribution
data. Our method is able to search for the correct placement of Bayesian
layer(s) in a network. In our experiments, the searched models show comparable
uncertainty quantification ability and accuracy compared to the
state-of-the-art (deep ensemble). In addition, the searched models use only a
fraction of the runtime compared to many popular BNN baselines, reducing the
inference runtime cost by $2.98 \times$ and $2.92 \times$ respectively on the
CIFAR10 dataset when compared to MCDropout and deep ensemble.


http://arxiv.org/abs/2202.04235
Towards Compositional Adversarial Robustness: Generalizing Adversarial Training to Composite Semantic Perturbations. (99%)
Lei Hsiung; Yun-Yun Tsai; Pin-Yu Chen; Tsung-Yi Ho
  Model robustness against adversarial examples of single perturbation type
such as the $\ell_{p}$-norm has been widely studied, yet its generalization to
more realistic scenarios involving multiple semantic perturbations and their
composition remains largely unexplored. In this paper, we first propose a novel
method for generating composite adversarial examples. Our method can find the
optimal attack composition by utilizing component-wise projected gradient
descent and automatic attack-order scheduling. We then propose generalized
adversarial training (GAT) to extend model robustness from $\ell_{p}$-ball to
composite semantic perturbations, such as the combination of Hue, Saturation,
Brightness, Contrast, and Rotation. Results obtained using ImageNet and
CIFAR-10 datasets indicate that GAT can be robust not only to all the tested
types of a single attack, but also to any combination of such attacks. GAT also
outperforms baseline $\ell_{\infty}$-norm bounded adversarial training
approaches by a significant margin.


http://arxiv.org/abs/2202.03898
Verification-Aided Deep Ensemble Selection. (96%)
Guy Amir; Guy Katz; Michael Schapira
  Deep neural networks (DNNs) have become the technology of choice for
realizing a variety of complex tasks. However, as highlighted by many recent
studies, even an imperceptible perturbation to a correctly classified input can
lead to misclassification by a DNN. This renders DNNs vulnerable to strategic
input manipulations by attackers, and also prone to oversensitivity to
environmental noise.
  To mitigate this phenomenon, practitioners apply joint classification by an
ensemble of DNNs. By aggregating the classification outputs of different
individual DNNs for the same input, ensemble-based classification reduces the
risk of misclassifications due to the specific realization of the stochastic
training process of any single DNN. However, the effectiveness of a DNN
ensemble is highly dependent on its members not simultaneously erring on many
different inputs.
  In this case study, we harness recent advances in DNN verification to devise
a methodology for identifying ensemble compositions that are less prone to
simultaneous errors, even when the input is adversarially perturbed --
resulting in more robustly-accurate ensemble-based classification.
  Our proposed framework uses a DNN verifier as a backend, and includes
heuristics that help reduce the high complexity of directly verifying
ensembles. More broadly, our work puts forth a novel universal objective for
formal verification that can potentially improve the robustness of real-world,
deep-learning-based systems across a variety of application domains.


http://arxiv.org/abs/2202.04271
Adversarial Detection without Model Information. (87%)
Abhishek Moitra; Youngeun Kim; Priyadarshini Panda
  Most prior state-of-the-art adversarial detection works assume that the
underlying vulnerable model is accessible, i,e., the model can be trained or
its outputs are visible. However, this is not a practical assumption due to
factors like model encryption, model information leakage and so on. In this
work, we propose a model independent adversarial detection method using a
simple energy function to distinguish between adversarial and natural inputs.
We train a standalone detector independent of the underlying model, with
sequential layer-wise training to increase the energy separation corresponding
to natural and adversarial inputs. With this, we perform energy
distribution-based adversarial detection. Our method achieves state-of-the-art
detection performance (ROC-AUC > 0.9) across a wide range of gradient, score
and decision-based adversarial attacks on CIFAR10, CIFAR100 and TinyImagenet
datasets. Compared to prior approaches, our method requires ~10-100x less
number of operations and parameters for adversarial detection. Further, we show
that our detection method is transferable across different datasets and
adversarial attacks. For reproducibility, we provide code in the supplementary
material.


http://arxiv.org/abs/2202.03861
Towards Making a Trojan-horse Attack on Text-to-Image Retrieval. (68%)
Fan Hu; Aozhu Chen; Xirong Li
  While deep learning based image retrieval is reported to be vulnerable to
adversarial attacks, existing works are mainly on image-to-image retrieval with
their attacks performed at the front end via query modification. By contrast,
we present in this paper the first study about a threat that occurs at the back
end of a text-to-image retrieval (T2IR) system. Our study is motivated by the
fact that the image collection indexed by the system will be regularly updated
due to the arrival of new images from various sources such as web crawlers and
advertisers. With malicious images indexed, it is possible for an attacker to
indirectly interfere with the retrieval process, letting users see certain
images that are completely irrelevant w.r.t. their queries. We put this thought
into practice by proposing a novel Trojan-horse attack (THA). In particular, we
construct a set of Trojan-horse images by first embedding word-specific
adversarial information into a QR code and then putting the code on benign
advertising images. A proof-of-concept evaluation, conducted on two popular
T2IR datasets (Flickr30k and MS-COCO), shows the effectiveness of the proposed
THA in a white-box mode.


http://arxiv.org/abs/2202.05395
Robust, Deep, and Reinforcement Learning for Management of Communication and Power Networks. (1%)
Alireza Sadeghi
  This thesis develops data-driven machine learning algorithms to managing and
optimizing the next-generation highly complex cyberphysical systems, which
desperately need ground-breaking control, monitoring, and decision making
schemes that can guarantee robustness, scalability, and situational awareness.
The present thesis first develops principled methods to make generic machine
learning models robust against distributional uncertainties and adversarial
data. Particular focus will be on parametric models where some training data
are being used to learn a parametric model. The developed framework is of high
interest especially when training and testing data are drawn from "slightly"
different distribution. We then introduce distributionally robust learning
frameworks to minimize the worst-case expected loss over a prescribed ambiguity
set of training distributions quantified via Wasserstein distance. Later, we
build on this robust framework to design robust semi-supervised learning over
graph methods. The second part of this thesis aspires to fully unleash the
potential of next-generation wired and wireless networks, where we design
"smart" network entities using (deep) reinforcement learning approaches.
Finally, this thesis enhances the power system operation and control. Our
contribution is on sustainable distribution grids with high penetration of
renewable sources and demand response programs. To account for unanticipated
and rapidly changing renewable generation and load consumption scenarios, we
specifically delegate reactive power compensation to both utility-owned control
devices (e.g., capacitor banks), as well as smart inverters of distributed
generation units with cyber-capabilities.


http://arxiv.org/abs/2202.05877
Blind leads Blind: A Zero-Knowledge Attack on Federated Learning. (99%)
Jiyue Huang; Zilong Zhao; Lydia Y. Chen; Stefanie Roos
  Attacks on Federated Learning (FL) can severely reduce the quality of the
generated models and limit the usefulness of this emerging learning paradigm
that enables on-premise decentralized learning. There have been various
untargeted attacks on FL, but they are not widely applicable as they i) assume
that the attacker knows every update of benign clients, which is indeed sent in
encrypted form to the central server, or ii) assume that the attacker has a
large dataset and sufficient resources to locally train updates imitating
benign parties. In this paper, we design a zero-knowledge untargeted attack
(ZKA), which synthesizes malicious data to craft adversarial models without
eavesdropping on the transmission of benign clients at all or requiring a large
quantity of task-specific training data. To inject malicious input into the FL
system by synthetic data, ZKA has two variants. ZKA-R generates adversarial
ambiguous data by reversing engineering from the global models. To enable
stealthiness, ZKA-G trains the local model on synthetic data from the generator
that aims to synthesize images different from a randomly chosen class.
Furthermore, we add a novel distance-based regularization term for both attacks
to further enhance stealthiness. Experimental results on Fashion-MNIST and
CIFAR-10 show that the ZKA achieves similar or even higher attack success rate
than the state-of-the-art untargeted attacks against various defense
mechanisms, namely more than 50% for Cifar-10 for all considered defense
mechanisms. As expected, ZKA-G is better at circumventing defenses, even
showing a defense pass rate of close to 90% when ZKA-R only achieves 70%.
Higher data heterogeneity favours ZKA-R since detection becomes harder.


http://arxiv.org/abs/2202.03277
On The Empirical Effectiveness of Unrealistic Adversarial Hardening Against Realistic Adversarial Attacks. (99%)
Salijona Dyrmishi; Salah Ghamizi; Thibault Simonetto; Yves Le Traon; Maxime Cordy
  While the literature on security attacks and defense of Machine Learning (ML)
systems mostly focuses on unrealistic adversarial examples, recent research has
raised concern about the under-explored field of realistic adversarial attacks
and their implications on the robustness of real-world systems. Our paper paves
the way for a better understanding of adversarial robustness against realistic
attacks and makes two major contributions. First, we conduct a study on three
real-world use cases (text classification, botnet detection, malware
detection)) and five datasets in order to evaluate whether unrealistic
adversarial examples can be used to protect models against realistic examples.
Our results reveal discrepancies across the use cases, where unrealistic
examples can either be as effective as the realistic ones or may offer only
limited improvement. Second, to explain these results, we analyze the latent
representation of the adversarial examples generated with realistic and
unrealistic attacks. We shed light on the patterns that discriminate which
unrealistic examples can be used for effective hardening. We release our code,
datasets and models to support future research in exploring how to reduce the
gap between unrealistic and realistic adversarial attacks.


http://arxiv.org/abs/2202.03077
Adversarial Attacks and Defense for Non-Parametric Two-Sample Tests. (98%)
Xilie Xu; Jingfeng Zhang; Feng Liu; Masashi Sugiyama; Mohan Kankanhalli
  Non-parametric two-sample tests (TSTs) that judge whether two sets of samples
are drawn from the same distribution, have been widely used in the analysis of
critical data. People tend to employ TSTs as trusted basic tools and rarely
have any doubt about their reliability. This paper systematically uncovers the
failure mode of non-parametric TSTs through adversarial attacks and then
proposes corresponding defense strategies. First, we theoretically show that an
adversary can upper-bound the distributional shift which guarantees the
attack's invisibility. Furthermore, we theoretically find that the adversary
can also degrade the lower bound of a TST's test power, which enables us to
iteratively minimize the test criterion in order to search for adversarial
pairs. To enable TST-agnostic attacks, we propose an ensemble attack (EA)
framework that jointly minimizes the different types of test criteria. Second,
to robustify TSTs, we propose a max-min optimization that iteratively generates
adversarial pairs to train the deep kernels. Extensive experiments on both
simulated and real-world datasets validate the adversarial vulnerabilities of
non-parametric TSTs and the effectiveness of our proposed defense.


http://arxiv.org/abs/2202.03558
Evaluating Robustness of Cooperative MARL: A Model-based Approach. (98%)
Nhan H. Pham; Lam M. Nguyen; Jie Chen; Hoang Thanh Lam; Subhro Das; Tsui-Wei Weng
  In recent years, a proliferation of methods were developed for cooperative
multi-agent reinforcement learning (c-MARL). However, the robustness of c-MARL
agents against adversarial attacks has been rarely explored. In this paper, we
propose to evaluate the robustness of c-MARL agents via a model-based approach.
Our proposed formulation can craft stronger adversarial state perturbations of
c-MARL agents(s) to lower total team rewards more than existing model-free
approaches. In addition, we propose the first victim-agent selection strategy
which allows us to develop even stronger adversarial attack. Numerical
experiments on multi-agent MuJoCo benchmarks illustrate the advantage of our
approach over other baselines. The proposed model-based attack consistently
outperforms other baselines in all tested environments.


http://arxiv.org/abs/2202.03195
More is Better (Mostly): On the Backdoor Attacks in Federated Graph Neural Networks. (68%)
Jing Xu; Rui Wang; Kaitai Liang; Stjepan Picek
  Graph Neural Networks (GNNs) are a class of deep learning-based methods for
processing graph domain information. GNNs have recently become a widely used
graph analysis method due to their superior ability to learn representations
for complex graph data. However, due to privacy concerns and regulation
restrictions, centralized GNNs can be difficult to apply to data-sensitive
scenarios. Federated learning (FL) is an emerging technology developed for
privacy-preserving settings when several parties need to train a shared global
model collaboratively. Although many research works have applied FL to train
GNNs (Federated GNNs), there is no research on their robustness to backdoor
attacks.
  This paper bridges this gap by conducting two types of backdoor attacks in
Federated GNNs: centralized backdoor attacks (CBA) and distributed backdoor
attacks (DBA). CBA is conducted by embedding the same global trigger during
training for every malicious party, while DBA is conducted by decomposing a
global trigger into separate local triggers and embedding them into the
training dataset of different malicious parties, respectively. Our experiments
show that the DBA attack success rate is higher than CBA in almost all
evaluated cases, while rarely, the DBA attack performance is close to CBA. For
CBA, the attack success rate of all local triggers is similar to the global
trigger even if the training set of the adversarial party is embedded with the
global trigger. To further explore the properties of two backdoor attacks in
Federated GNNs, we evaluate the attack performance for different trigger sizes,
poisoning intensities, and trigger densities, with trigger density being the
most influential.


http://arxiv.org/abs/2202.03335
Membership Inference Attacks and Defenses in Neural Network Pruning. (50%)
Xiaoyong Yuan; Lan Zhang
  Neural network pruning has been an essential technique to reduce the
computation and memory requirements for using deep neural networks for
resource-constrained devices. Most existing research focuses primarily on
balancing the sparsity and accuracy of a pruned neural network by strategically
removing insignificant parameters and retraining the pruned model. Such efforts
on reusing training samples pose serious privacy risks due to increased
memorization, which, however, has not been investigated yet.
  In this paper, we conduct the first analysis of privacy risks in neural
network pruning. Specifically, we investigate the impacts of neural network
pruning on training data privacy, i.e., membership inference attacks. We first
explore the impact of neural network pruning on prediction divergence, where
the pruning process disproportionately affects the pruned model's behavior for
members and non-members. Meanwhile, the influence of divergence even varies
among different classes in a fine-grained manner. Enlighten by such divergence,
we proposed a self-attention membership inference attack against the pruned
neural networks. Extensive experiments are conducted to rigorously evaluate the
privacy impacts of different pruning approaches, sparsity levels, and adversary
knowledge. The proposed attack shows the higher attack performance on the
pruned models when compared with eight existing membership inference attacks.
In addition, we propose a new defense mechanism to protect the pruning process
by mitigating the prediction divergence based on KL-divergence distance, whose
effectiveness has been experimentally demonstrated to effectively mitigate the
privacy risks while maintaining the sparsity and accuracy of the pruned models.


http://arxiv.org/abs/2202.03104
SimGRACE: A Simple Framework for Graph Contrastive Learning without Data Augmentation. (4%)
Jun Xia; Lirong Wu; Jintao Chen; Bozhen Hu; Stan Z. Li
  Graph contrastive learning (GCL) has emerged as a dominant technique for
graph representation learning which maximizes the mutual information between
paired graph augmentations that share the same semantics. Unfortunately, it is
difficult to preserve semantics well during augmentations in view of the
diverse nature of graph data. Currently, data augmentations in GCL that are
designed to preserve semantics broadly fall into three unsatisfactory ways.
First, the augmentations can be manually picked per dataset by
trial-and-errors. Second, the augmentations can be selected via cumbersome
search. Third, the augmentations can be obtained by introducing expensive
domain-specific knowledge as guidance. All of these limit the efficiency and
more general applicability of existing GCL methods. To circumvent these crucial
issues, we propose a \underline{Sim}ple framework for \underline{GRA}ph
\underline{C}ontrastive l\underline{E}arning, \textbf{SimGRACE} for brevity,
which does not require data augmentations. Specifically, we take original graph
as input and GNN model with its perturbed version as two encoders to obtain two
correlated views for contrast. SimGRACE is inspired by the observation that
graph data can preserve their semantics well during encoder perturbations while
not requiring manual trial-and-errors, cumbersome search or expensive domain
knowledge for augmentations selection. Also, we explain why SimGRACE can
succeed. Furthermore, we devise adversarial training scheme, dubbed
\textbf{AT-SimGRACE}, to enhance the robustness of graph contrastive learning
and theoretically explain the reasons. Albeit simple, we show that SimGRACE can
yield competitive or better performance compared with state-of-the-art methods
in terms of generalizability, transferability and robustness, while enjoying
unprecedented degree of flexibility and efficiency.


http://arxiv.org/abs/2202.03460
Deletion Inference, Reconstruction, and Compliance in Machine (Un)Learning. (3%)
Ji Gao; Sanjam Garg; Mohammad Mahmoody; Prashant Nalini Vasudevan
  Privacy attacks on machine learning models aim to identify the data that is
used to train such models. Such attacks, traditionally, are studied on static
models that are trained once and are accessible by the adversary. Motivated to
meet new legal requirements, many machine learning methods are recently
extended to support machine unlearning, i.e., updating models as if certain
examples are removed from their training sets, and meet new legal requirements.
However, privacy attacks could potentially become more devastating in this new
setting, since an attacker could now access both the original model before
deletion and the new model after the deletion. In fact, the very act of
deletion might make the deleted record more vulnerable to privacy attacks.
  Inspired by cryptographic definitions and the differential privacy framework,
we formally study privacy implications of machine unlearning. We formalize
(various forms of) deletion inference and deletion reconstruction attacks, in
which the adversary aims to either identify which record is deleted or to
reconstruct (perhaps part of) the deleted records. We then present successful
deletion inference and reconstruction attacks for a variety of machine learning
models and tasks such as classification, regression, and language models.
Finally, we show that our attacks would provably be precluded if the schemes
satisfy (variants of) Deletion Compliance (Garg, Goldwasser, and Vasudevan,
Eurocrypt' 20).


http://arxiv.org/abs/2202.02751
Tubes Among Us: Analog Attack on Automatic Speaker Identification. (99%)
Shimaa Ahmed; Yash Wani; Ali Shahin Shamsabadi; Mohammad Yaghini; Ilia Shumailov; Nicolas Papernot; Kassem Fawaz
  Recent years have seen a surge in the popularity of acoustics-enabled
personal devices powered by machine learning. Yet, machine learning has proven
to be vulnerable to adversarial examples. A large number of modern systems
protect themselves against such attacks by targeting artificiality, i.e., they
deploy mechanisms to detect the lack of human involvement in generating the
adversarial examples. However, these defenses implicitly assume that humans are
incapable of producing meaningful and targeted adversarial examples. In this
paper, we show that this base assumption is wrong. In particular, we
demonstrate that for tasks like speaker identification, a human is capable of
producing analog adversarial examples directly with little cost and
supervision: by simply speaking through a tube, an adversary reliably
impersonates other speakers in eyes of ML models for speaker identification.
Our findings extend to a range of other acoustic-biometric tasks such as
liveness detection, bringing into question their use in security-critical
settings in real life, such as phone banking.


http://arxiv.org/abs/2202.02902
Redactor: A Data-centric and Individualized Defense Against Inference Attacks. (8%)
Geon Heo; Steven Euijong Whang
  Information leakage is becoming a critical problem as various information
becomes publicly available by mistake, and machine learning models train on
that data to provide services. As a result, one's private information could
easily be memorized by such trained models. Unfortunately, deleting information
is out of the question as the data is already exposed to the Web or third-party
platforms. Moreover, we cannot necessarily control the labeling process and the
model trainings by other parties either. In this setting, we study the problem
of targeted disinformation generation where the goal is to dilute the data and
thus make a model safer and more robust against inference attacks on a specific
target (e.g., a person's profile) by only inserting new data. Our method finds
the closest points to the target in the input space that will be labeled as a
different class. Since we cannot control the labeling process, we instead
conservatively estimate the labels probabilistically by combining decision
boundaries of multiple classifiers using data programming techniques. Our
experiments show that a probabilistic decision boundary can be a good proxy for
labelers, and that our approach is effective in defending against inference
attacks and can scale to large data.


http://arxiv.org/abs/2202.02626
Layer-wise Regularized Adversarial Training using Layers Sustainability Analysis (LSA) framework. (99%)
Mohammad Khalooei; Mohammad Mehdi Homayounpour; Maryam Amirmazlaghani
  Deep neural network models are used today in various applications of
artificial intelligence, the strengthening of which, in the face of adversarial
attacks is of particular importance. An appropriate solution to adversarial
attacks is adversarial training, which reaches a trade-off between robustness
and generalization. This paper introduces a novel framework (Layer
Sustainability Analysis (LSA)) for the analysis of layer vulnerability in a
given neural network in the scenario of adversarial attacks. LSA can be a
helpful toolkit to assess deep neural networks and to extend the adversarial
training approaches towards improving the sustainability of model layers via
layer monitoring and analysis. The LSA framework identifies a list of Most
Vulnerable Layers (MVL list) of a given network. The relative error, as a
comparison measure, is used to evaluate representation sustainability of each
layer against adversarial attack inputs. The proposed approach for obtaining
robust neural networks to fend off adversarial attacks is based on a layer-wise
regularization (LR) over LSA proposal(s) for adversarial training (AT); i.e.
the AT-LR procedure. AT-LR could be used with any benchmark adversarial attack
to reduce the vulnerability of network layers and to improve conventional
adversarial training approaches. The proposed idea performs well theoretically
and experimentally for state-of-the-art multilayer perceptron and convolutional
neural network architectures. Compared with the AT-LR and its corresponding
base adversarial training, the classification accuracy of more significant
perturbations increased by 16.35%, 21.79%, and 10.730% on Moon, MNIST, and
CIFAR-10 benchmark datasets in comparison with the AT-LR and its corresponding
base adversarial training, respectively. The LSA framework is available and
published at https://github.com/khalooei/LSA.


http://arxiv.org/abs/2202.02503
Adversarial Detector with Robust Classifier. (93%)
Takayuki Osakabe; Maungmaung Aprilpyone; Sayaka Shiota; Hitoshi Kiya
  Deep neural network (DNN) models are wellknown to easily misclassify
prediction results by using input images with small perturbations, called
adversarial examples. In this paper, we propose a novel adversarial detector,
which consists of a robust classifier and a plain one, to highly detect
adversarial examples. The proposed adversarial detector is carried out in
accordance with the logits of plain and robust classifiers. In an experiment,
the proposed detector is demonstrated to outperform a state-of-the-art detector
without any robust classifier.


http://arxiv.org/abs/2202.02595
Memory Defense: More Robust Classification via a Memory-Masking Autoencoder. (76%)
Eashan Lehigh University Adhikarla; Dan Lehigh University Luo; Brian D. Lehigh University Davison
  Many deep neural networks are susceptible to minute perturbations of images
that have been carefully crafted to cause misclassification. Ideally, a robust
classifier would be immune to small variations in input images, and a number of
defensive approaches have been created as a result. One method would be to
discern a latent representation which could ignore small changes to the input.
However, typical autoencoders easily mingle inter-class latent representations
when there are strong similarities between classes, making it harder for a
decoder to accurately project the image back to the original high-dimensional
space. We propose a novel framework, Memory Defense, an augmented classifier
with a memory-masking autoencoder to counter this challenge. By masking other
classes, the autoencoder learns class-specific independent latent
representations. We test the model's robustness against four widely used
attacks. Experiments on the Fashion-MNIST & CIFAR-10 datasets demonstrate the
superiority of our model. We make available our source code at GitHub
repository: https://github.com/eashanadhikarla/MemDefense


http://arxiv.org/abs/2202.02628
Improved Certified Defenses against Data Poisoning with (Deterministic) Finite Aggregation. (75%)
Wenxiao Wang; Alexander Levine; Soheil Feizi
  Data poisoning attacks aim at manipulating model behaviors through distorting
training data. Previously, an aggregation-based certified defense, Deep
Partition Aggregation (DPA), was proposed to mitigate this threat. DPA predicts
through an aggregation of base classifiers trained on disjoint subsets of data,
thus restricting its sensitivity to dataset distortions. In this work, we
propose an improved certified defense against general poisoning attacks, namely
Finite Aggregation. In contrast to DPA, which directly splits the training set
into disjoint subsets, our method first splits the training set into smaller
disjoint subsets and then combines duplicates of them to build larger (but not
disjoint) subsets for training base classifiers. This reduces the worst-case
impacts of poison samples and thus improves certified robustness bounds. In
addition, we offer an alternative view of our method, bridging the designs of
deterministic and stochastic aggregation-based certified defenses. Empirically,
our proposed Finite Aggregation consistently improves certificates on MNIST,
CIFAR-10, and GTSRB, boosting certified fractions by up to 3.05%, 3.87% and
4.77%, respectively, while keeping the same clean accuracies as DPA's,
effectively establishing a new state of the art in (pointwise) certified
robustness against data poisoning.


http://arxiv.org/abs/2202.02236
Pixle: a fast and effective black-box attack based on rearranging pixels. (98%)
Jary Pomponi; Simone Scardapane; Aurelio Uncini
  Recent research has found that neural networks are vulnerable to several
types of adversarial attacks, where the input samples are modified in such a
way that the model produces a wrong prediction that misclassifies the
adversarial sample. In this paper we focus on black-box adversarial attacks,
that can be performed without knowing the inner structure of the attacked
model, nor the training procedure, and we propose a novel attack that is
capable of correctly attacking a high percentage of samples by rearranging a
small number of pixels within the attacked image. We demonstrate that our
attack works on a large number of datasets and models, that it requires a small
number of iterations, and that the distance between the original sample and the
adversarial one is negligible to the human eye.


http://arxiv.org/abs/2202.03423
Backdoor Defense via Decoupling the Training Process. (80%)
Kunzhe Huang; Yiming Li; Baoyuan Wu; Zhan Qin; Kui Ren
  Recent studies have revealed that deep neural networks (DNNs) are vulnerable
to backdoor attacks, where attackers embed hidden backdoors in the DNN model by
poisoning a few training samples. The attacked model behaves normally on benign
samples, whereas its prediction will be maliciously changed when the backdoor
is activated. We reveal that poisoned samples tend to cluster together in the
feature space of the attacked DNN model, which is mostly due to the end-to-end
supervised training paradigm. Inspired by this observation, we propose a novel
backdoor defense via decoupling the original end-to-end training process into
three stages. Specifically, we first learn the backbone of a DNN model via
\emph{self-supervised learning} based on training samples without their labels.
The learned backbone will map samples with the same ground-truth label to
similar locations in the feature space. Then, we freeze the parameters of the
learned backbone and train the remaining fully connected layers via standard
training with all (labeled) training samples. Lastly, to further alleviate
side-effects of poisoned samples in the second stage, we remove labels of some
`low-credible' samples determined based on the learned model and conduct a
\emph{semi-supervised fine-tuning} of the whole model. Extensive experiments on
multiple benchmark datasets and DNN models verify that the proposed defense is
effective in reducing backdoor threats while preserving high accuracy in
predicting benign samples. Our code is available at
\url{https://github.com/SCLBD/DBD}.


http://arxiv.org/abs/2202.02278
LTU Attacker for Membership Inference. (67%)
Joseph Pedersen; Rafael Muñoz-Gómez; Jiangnan Huang; Haozhe Sun; Wei-Wei Tu; Isabelle Guyon
  We address the problem of defending predictive models, such as machine
learning classifiers (Defender models), against membership inference attacks,
in both the black-box and white-box setting, when the trainer and the trained
model are publicly released. The Defender aims at optimizing a dual objective:
utility and privacy. Both utility and privacy are evaluated with an external
apparatus including an Attacker and an Evaluator. On one hand, Reserved data,
distributed similarly to the Defender training data, is used to evaluate
Utility; on the other hand, Reserved data, mixed with Defender training data,
is used to evaluate membership inference attack robustness. In both cases
classification accuracy or error rate are used as the metric: Utility is
evaluated with the classification accuracy of the Defender model; Privacy is
evaluated with the membership prediction error of a so-called
"Leave-Two-Unlabeled" LTU Attacker, having access to all of the Defender and
Reserved data, except for the membership label of one sample from each. We
prove that, under certain conditions, even a "na\"ive" LTU Attacker can achieve
lower bounds on privacy loss with simple attack strategies, leading to concrete
necessary conditions to protect privacy, including: preventing over-fitting and
adding some amount of randomness. However, we also show that such a na\"ive LTU
Attacker can fail to attack the privacy of models known to be vulnerable in the
literature, demonstrating that knowledge must be complemented with strong
attack strategies to turn the LTU Attacker into a powerful means of evaluating
privacy. Our experiments on the QMNIST and CIFAR-10 datasets validate our
theoretical results and confirm the roles of over-fitting prevention and
randomness in the algorithms to protect against privacy attacks.


http://arxiv.org/abs/2202.02215
A Survey on Safety-Critical Driving Scenario Generation -- A Methodological Perspective. (1%)
Wenhao Ding; Chejian Xu; Mansur Arief; Haohong Lin; Bo Li; Ding Zhao
  Autonomous driving systems have witnessed a significant development during
the past years thanks to the advance in machine learning-enabled sensing and
decision-making algorithms. One critical challenge for their massive deployment
in the real world is their safety evaluation. Most existing driving systems are
still trained and evaluated on naturalistic scenarios collected from daily life
or heuristically-generated adversarial ones. However, the large population of
cars, in general, leads to an extremely low collision rate, indicating that the
safety-critical scenarios are rare in the collected real-world data. Thus,
methods to artificially generate scenarios become crucial to measure the risk
and reduce the cost. In this survey, we focus on the algorithms of
safety-critical scenario generation in autonomous driving. We first provide a
comprehensive taxonomy of existing algorithms by dividing them into three
categories: data-driven generation, adversarial generation, and knowledge-based
generation. Then, we discuss useful tools for scenario generation, including
simulation platforms and packages. Finally, we extend our discussion to five
main challenges of current works -- fidelity, efficiency, diversity,
transferability, controllability -- and research opportunities lighted up by
these challenges.


http://arxiv.org/abs/2202.01811
ObjectSeeker: Certifiably Robust Object Detection against Patch Hiding Attacks via Patch-agnostic Masking. (93%)
Chong Xiang; Alexander Valtchanov; Saeed Mahloujifar; Prateek Mittal
  Object detectors, which are widely deployed in security-critical systems such
as autonomous vehicles, have been found vulnerable to physical-world patch
hiding attacks. The attacker can use a single physically-realizable adversarial
patch to make the object detector miss the detection of victim objects and
completely undermines the functionality of object detection applications. In
this paper, we propose ObjectSeeker as a defense framework for building
certifiably robust object detectors against patch hiding attacks. The core
operation of ObjectSeeker is patch-agnostic masking: we aim to mask out the
entire adversarial patch without any prior knowledge of the shape, size, and
location of the patch. This masking operation neutralizes the adversarial
effect and allows any vanilla object detector to safely detect objects on the
masked images. Remarkably, we develop a certification procedure to determine if
ObjectSeeker can detect certain objects with a provable guarantee against any
adaptive attacker within the threat model. Our evaluation with two object
detectors and three datasets demonstrates a significant (~10%-40% absolute and
~2-6x relative) improvement in certified robustness over the prior work, as
well as high clean performance (~1% performance drop compared with vanilla
undefended models).


http://arxiv.org/abs/2202.01832
Adversarially Robust Models may not Transfer Better: Sufficient Conditions for Domain Transferability from the View of Regularization. (75%)
Xiaojun Xu; Jacky Yibo Zhang; Evelyn Ma; Danny Son; Oluwasanmi Koyejo; Bo Li
  Machine learning (ML) robustness and domain generalization are fundamentally
correlated: they essentially concern data distribution shifts under adversarial
and natural settings, respectively. On one hand, recent studies show that more
robust (adversarially trained) models are more generalizable. On the other
hand, there is a lack of theoretical understanding of their fundamental
connections. In this paper, we explore the relationship between regularization
and domain transferability considering different factors such as norm
regularization and data augmentations (DA). We propose a general theoretical
framework proving that factors involving the model function class
regularization are sufficient conditions for relative domain transferability.
Our analysis implies that "robustness" is neither necessary nor sufficient for
transferability; rather, robustness induced by adversarial training is a
by-product of such function class regularization. We then discuss popular DA
protocols and show when they can be viewed as the function class regularization
under certain conditions and therefore improve generalization. We conduct
extensive experiments to verify our theoretical findings and show several
counterexamples where robustness and generalization are negatively correlated
on different datasets.


http://arxiv.org/abs/2202.01117
An Eye for an Eye: Defending against Gradient-based Attacks with Gradients. (99%)
Hanbin Hong; Yuan Hong; Yu Kong
  Deep learning models have been shown to be vulnerable to adversarial attacks.
In particular, gradient-based attacks have demonstrated high success rates
recently. The gradient measures how each image pixel affects the model output,
which contains critical information for generating malicious perturbations. In
this paper, we show that the gradients can also be exploited as a powerful
weapon to defend against adversarial attacks. By using both gradient maps and
adversarial images as inputs, we propose a Two-stream Restoration Network (TRN)
to restore the adversarial images. To optimally restore the perturbed images
with two streams of inputs, a Gradient Map Estimation Mechanism is proposed to
estimate the gradients of adversarial images, and a Fusion Block is designed in
TRN to explore and fuse the information in two streams. Once trained, our TRN
can defend against a wide range of attack methods without significantly
degrading the performance of benign inputs. Also, our method is generalizable,
scalable, and hard to bypass. Experimental results on CIFAR10, SVHN, and
Fashion MNIST demonstrate that our method outperforms state-of-the-art defense
methods.


http://arxiv.org/abs/2202.01186
Smoothed Embeddings for Certified Few-Shot Learning. (76%)
Mikhail Pautov; Olesya Kuznetsova; Nurislam Tursynbek; Aleksandr Petiushko; Ivan Oseledets
  Randomized smoothing is considered to be the state-of-the-art provable
defense against adversarial perturbations. However, it heavily exploits the
fact that classifiers map input objects to class probabilities and do not focus
on the ones that learn a metric space in which classification is performed by
computing distances to embeddings of classes prototypes. In this work, we
extend randomized smoothing to few-shot learning models that map inputs to
normalized embeddings. We provide analysis of Lipschitz continuity of such
models and derive robustness certificate against $\ell_2$-bounded perturbations
that may be useful in few-shot learning scenarios. Our theoretical results are
confirmed by experiments on different datasets.


http://arxiv.org/abs/2202.01136
Probabilistically Robust Learning: Balancing Average- and Worst-case Performance. (75%)
Alexander Robey; Luiz F. O. Chamon; George J. Pappas; Hamed Hassani
  Many of the successes of machine learning are based on minimizing an averaged
loss function. However, it is well-known that this paradigm suffers from
robustness issues that hinder its applicability in safety-critical domains.
These issues are often addressed by training against worst-case perturbations
of data, a technique known as adversarial training. Although empirically
effective, adversarial training can be overly conservative, leading to
unfavorable trade-offs between nominal performance and robustness. To this end,
in this paper we propose a framework called probabilistic robustness that
bridges the gap between the accurate, yet brittle average case and the robust,
yet conservative worst case by enforcing robustness to most rather than to all
perturbations. From a theoretical point of view, this framework overcomes the
trade-offs between the performance and the sample-complexity of worst-case and
average-case learning. From a practical point of view, we propose a novel
algorithm based on risk-aware optimization that effectively balances average-
and worst-case performance at a considerably lower computational cost relative
to adversarial training. Our results on MNIST, CIFAR-10, and SVHN illustrate
the advantages of this framework on the spectrum from average- to worst-case
robustness.


http://arxiv.org/abs/2202.01181
Make Some Noise: Reliable and Efficient Single-Step Adversarial Training. (70%)
Jorge Pau de; Adel Bibi; Riccardo Volpi; Amartya Sanyal; Philip H. S. Torr; Grégory Rogez; Puneet K. Dokania
  Recently, Wong et al. showed that adversarial training with single-step FGSM
leads to a characteristic failure mode named catastrophic overfitting (CO), in
which a model becomes suddenly vulnerable to multi-step attacks. They showed
that adding a random perturbation prior to FGSM (RS-FGSM) seemed to be
sufficient to prevent CO. However, Andriushchenko and Flammarion observed that
RS-FGSM still leads to CO for larger perturbations, and proposed an expensive
regularizer (GradAlign) to avoid CO. In this work, we methodically revisit the
role of noise and clipping in single-step adversarial training. Contrary to
previous intuitions, we find that using a stronger noise around the clean
sample combined with not clipping is highly effective in avoiding CO for large
perturbation radii. Based on these observations, we then propose Noise-FGSM
(N-FGSM) that, while providing the benefits of single-step adversarial
training, does not suffer from CO. Empirical analyses on a large suite of
experiments show that N-FGSM is able to match or surpass the performance of
previous single-step methods while achieving a 3$\times$ speed-up. Code can be
found in https://github.com/pdejorge/N-FGSM


http://arxiv.org/abs/2202.01341
Robust Binary Models by Pruning Randomly-initialized Networks. (10%)
Chen Liu; Ziqi Zhao; Sabine Süsstrunk; Mathieu Salzmann
  We propose ways to obtain robust models against adversarial attacks from
randomly-initialized binary networks. Unlike adversarial training, which learns
the model parameters, we in contrast learn the structure of the robust model by
pruning a randomly-initialized binary network. Our method confirms the strong
lottery ticket hypothesis in the presence of adversarial attacks. Compared to
the results obtained in a non-adversarial setting, we in addition improve the
performance and compression of the model by 1) using an adaptive pruning
strategy for different layers, and 2) using a different initialization scheme
such that all model parameters are initialized either to +1 or -1. Our
extensive experiments demonstrate that our approach performs not only better
than the state-of-the art for robust binary networks; it also achieves
comparable or even better performance than full-precision network training
methods.


http://arxiv.org/abs/2202.01263
NoisyMix: Boosting Robustness by Combining Data Augmentations, Stability Training, and Noise Injections. (10%)
N. Benjamin Erichson; Soon Hoe Lim; Francisco Utrera; Winnie Xu; Ziang Cao; Michael W. Mahoney
  For many real-world applications, obtaining stable and robust statistical
performance is more important than simply achieving state-of-the-art predictive
test accuracy, and thus robustness of neural networks is an increasingly
important topic. Relatedly, data augmentation schemes have been shown to
improve robustness with respect to input perturbations and domain shifts.
Motivated by this, we introduce NoisyMix, a training scheme that combines data
augmentations with stability training and noise injections to improve both
model robustness and in-domain accuracy. This combination promotes models that
are consistently more robust and that provide well-calibrated estimates of
class membership probabilities. We demonstrate the benefits of NoisyMix on a
range of benchmark datasets, including ImageNet-C, ImageNet-R, and ImageNet-P.
Moreover, we provide theory to understand implicit regularization and
robustness of NoisyMix.


http://arxiv.org/abs/2202.00399
Language Dependencies in Adversarial Attacks on Speech Recognition Systems. (98%)
Karla Markert; Donika Mirdita; Konstantin Böttinger
  Automatic speech recognition (ASR) systems are ubiquitously present in our
daily devices. They are vulnerable to adversarial attacks, where manipulated
input samples fool the ASR system's recognition. While adversarial examples for
various English ASR systems have already been analyzed, there exists no
inter-language comparative vulnerability analysis.
  We compare the attackability of a German and an English ASR system, taking
Deepspeech as an example. We investigate if one of the language models is more
susceptible to manipulations than the other. The results of our experiments
suggest statistically significant differences between English and German in
terms of computational effort necessary for the successful generation of
adversarial examples. This result encourages further research in
language-dependent characteristics in the robustness analysis of ASR.


http://arxiv.org/abs/2202.00838
Finding Biological Plausibility for Adversarially Robust Features via Metameric Tasks. (80%)
Anne Harrington; Arturo Deza
  Recent work suggests that representations learned by adversarially robust
networks are more human perceptually-aligned than non-robust networks via image
manipulations. Despite appearing closer to human visual perception, it is
unclear if the constraints in robust DNN representations match biological
constraints found in human vision. Human vision seems to rely on
texture-based/summary statistic representations in the periphery, which have
been shown to explain phenomena such as crowding and performance on visual
search tasks. To understand how adversarially robust
optimizations/representations compare to human vision, we performed a
psychophysics experiment using a set of metameric discrimination tasks where we
evaluated how well human observers could distinguish between images synthesized
to match adversarially robust representations compared to non-robust
representations and a texture synthesis model of peripheral vision (Texforms).
We found that the discriminability of robust representation and texture model
images decreased to near chance performance as stimuli were presented farther
in the periphery. Moreover, performance on robust and texture-model images
showed similar trends within participants, while performance on non-robust
representations changed minimally across the visual field. These results
together suggest that (1) adversarially robust representations capture
peripheral computation better than non-robust representations and (2) robust
representations capture peripheral computation similar to current
state-of-the-art texture peripheral vision models. More broadly, our findings
support the idea that localized texture summary statistic representations may
drive human invariance to adversarial perturbations and that the incorporation
of such representations in DNNs could give rise to useful properties like
adversarial robustness.


http://arxiv.org/abs/2202.00673
Visualizing Automatic Speech Recognition -- Means for a Better Understanding? (64%)
Karla Markert; Romain Parracone; Mykhailo Kulakov; Philip Sperl; Ching-Yu Kao; Konstantin Böttinger
  Automatic speech recognition (ASR) is improving ever more at mimicking human
speech processing. The functioning of ASR, however, remains to a large extent
obfuscated by the complex structure of the deep neural networks (DNNs) they are
based on. In this paper, we show how so-called attribution methods, that we
import from image recognition and suitably adapt to handle audio data, can help
to clarify the working of ASR. Taking DeepSpeech, an end-to-end model for ASR,
as a case study, we show how these techniques help to visualize which features
of the input are the most influential in determining the output. We focus on
three visualization techniques: Layer-wise Relevance Propagation (LRP),
Saliency Maps, and Shapley Additive Explanations (SHAP). We compare these
methods and discuss potential further applications, such as in the detection of
adversarial examples.


http://arxiv.org/abs/2202.00622
Datamodels: Predicting Predictions from Training Data. (2%)
Andrew Ilyas; Sung Min Park; Logan Engstrom; Guillaume Leclerc; Aleksander Madry
  We present a conceptual framework, datamodeling, for analyzing the behavior
of a model class in terms of the training data. For any fixed "target" example
$x$, training set $S$, and learning algorithm, a datamodel is a parameterized
function $2^S \to \mathbb{R}$ that for any subset of $S' \subset S$ -- using
only information about which examples of $S$ are contained in $S'$ -- predicts
the outcome of training a model on $S'$ and evaluating on $x$. Despite the
potential complexity of the underlying process being approximated (e.g.,
end-to-end training and evaluation of deep neural networks), we show that even
simple linear datamodels can successfully predict model outputs. We then
demonstrate that datamodels give rise to a variety of applications, such as:
accurately predicting the effect of dataset counterfactuals; identifying
brittle predictions; finding semantically similar examples; quantifying
train-test leakage; and embedding data into a well-behaved and feature-rich
representation space. Data for this paper (including pre-computed datamodels as
well as raw predictions from four million trained deep neural networks) is
available at https://github.com/MadryLab/datamodels-data .


http://arxiv.org/abs/2201.12347
Adversarial Robustness in Deep Learning: Attacks on Fragile Neurons. (99%)
Chandresh Pravin; Ivan Martino; Giuseppe Nicosia; Varun Ojha
  We identify fragile and robust neurons of deep learning architectures using
nodal dropouts of the first convolutional layer. Using an adversarial targeting
algorithm, we correlate these neurons with the distribution of adversarial
attacks on the network. Adversarial robustness of neural networks has gained
significant attention in recent times and highlights intrinsic weaknesses of
deep learning networks against carefully constructed distortion applied to
input images. In this paper, we evaluate the robustness of state-of-the-art
image classification models trained on the MNIST and CIFAR10 datasets against
the fast gradient sign method attack, a simple yet effective method of
deceiving neural networks. Our method identifies the specific neurons of a
network that are most affected by the adversarial attack being applied. We,
therefore, propose to make fragile neurons more robust against these attacks by
compressing features within robust neurons and amplifying the fragile neurons
proportionally.


http://arxiv.org/abs/2201.13444
Boundary Defense Against Black-box Adversarial Attacks. (99%)
Manjushree B. Aithal; Xiaohua Li
  Black-box adversarial attacks generate adversarial samples via iterative
optimizations using repeated queries. Defending deep neural networks against
such attacks has been challenging. In this paper, we propose an efficient
Boundary Defense (BD) method which mitigates black-box attacks by exploiting
the fact that the adversarial optimizations often need samples on the
classification boundary. Our method detects the boundary samples as those with
low classification confidence and adds white Gaussian noise to their logits.
The method's impact on the deep network's classification accuracy is analyzed
theoretically. Extensive experiments are conducted and the results show that
the BD method can reliably defend against both soft and hard label black-box
attacks. It outperforms a list of existing defense methods. For IMAGENET
models, by adding zero-mean white Gaussian noise with standard deviation 0.1 to
logits when the classification confidence is less than 0.3, the defense reduces
the attack success rate to almost 0 while limiting the classification accuracy
degradation to around 1 percent.


http://arxiv.org/abs/2202.00091
Query Efficient Decision Based Sparse Attacks Against Black-Box Deep Learning Models. (99%)
Viet Quoc Vo; Ehsan Abbasnejad; Damith C. Ranasinghe
  Despite our best efforts, deep learning models remain highly vulnerable to
even tiny adversarial perturbations applied to the inputs. The ability to
extract information from solely the output of a machine learning model to craft
adversarial perturbations to black-box models is a practical threat against
real-world systems, such as autonomous cars or machine learning models exposed
as a service (MLaaS). Of particular interest are sparse attacks. The
realization of sparse attacks in black-box models demonstrates that machine
learning models are more vulnerable than we believe. Because these attacks aim
to minimize the number of perturbed pixels measured by l_0 norm-required to
mislead a model by solely observing the decision (the predicted label) returned
to a model query; the so-called decision-based attack setting. But, such an
attack leads to an NP-hard optimization problem. We develop an evolution-based
algorithm-SparseEvo-for the problem and evaluate against both convolutional
deep neural networks and vision transformers. Notably, vision transformers are
yet to be investigated under a decision-based attack setting. SparseEvo
requires significantly fewer model queries than the state-of-the-art sparse
attack Pointwise for both untargeted and targeted attacks. The attack
algorithm, although conceptually simple, is also competitive with only a
limited query budget against the state-of-the-art gradient-based whitebox
attacks in standard computer vision tasks such as ImageNet. Importantly, the
query efficient SparseEvo, along with decision-based attacks, in general, raise
new questions regarding the safety of deployed systems and poses new directions
to study and understand the robustness of machine learning models.


http://arxiv.org/abs/2201.13329
Can Adversarial Training Be Manipulated By Non-Robust Features? (98%)
Lue Tao; Lei Feng; Hongxin Wei; Jinfeng Yi; Sheng-Jun Huang; Songcan Chen
  Adversarial training, originally designed to resist test-time adversarial
examples, has shown to be promising in mitigating training-time availability
attacks. This defense ability, however, is challenged in this paper. We
identify a novel threat model named stability attacks, which aims to hinder
robust availability by slightly manipulating the training data. Under this
threat, we show that adversarial training using a conventional defense budget
$\epsilon$ provably fails to provide test robustness in a simple statistical
setting, where the non-robust features of the training data can be reinforced
by $\epsilon$-bounded perturbation. Further, we analyze the necessity of
enlarging the defense budget to counter stability attacks. Finally,
comprehensive experiments demonstrate that stability attacks are harmful on
benchmark datasets, and thus the adaptive defense is necessary to maintain
robustness.


http://arxiv.org/abs/2201.13102
GADoT: GAN-based Adversarial Training for Robust DDoS Attack Detection. (96%)
Maged Abdelaty; Sandra Scott-Hayward; Roberto Doriguzzi-Corin; Domenico Siracusa
  Machine Learning (ML) has proven to be effective in many application domains.
However, ML methods can be vulnerable to adversarial attacks, in which an
attacker tries to fool the classification/prediction mechanism by crafting the
input data. In the case of ML-based Network Intrusion Detection Systems
(NIDSs), the attacker might use their knowledge of the intrusion detection
logic to generate malicious traffic that remains undetected. One way to solve
this issue is to adopt adversarial training, in which the training set is
augmented with adversarial traffic samples. This paper presents an adversarial
training approach called GADoT, which leverages a Generative Adversarial
Network (GAN) to generate adversarial DDoS samples for training. We show that a
state-of-the-art NIDS with high accuracy on popular datasets can experience
more than 60% undetected malicious flows under adversarial attacks. We then
demonstrate how this score drops to 1.8% or less after adversarial training
using GADoT.


http://arxiv.org/abs/2202.03133
Rate Coding or Direct Coding: Which One is Better for Accurate, Robust, and Energy-efficient Spiking Neural Networks? (93%)
Youngeun Kim; Hyoungseob Park; Abhishek Moitra; Abhiroop Bhattacharjee; Yeshwanth Venkatesha; Priyadarshini Panda
  Recent Spiking Neural Networks (SNNs) works focus on an image classification
task, therefore various coding techniques have been proposed to convert an
image into temporal binary spikes. Among them, rate coding and direct coding
are regarded as prospective candidates for building a practical SNN system as
they show state-of-the-art performance on large-scale datasets. Despite their
usage, there is little attention to comparing these two coding schemes in a
fair manner. In this paper, we conduct a comprehensive analysis of the two
codings from three perspectives: accuracy, adversarial robustness, and
energy-efficiency. First, we compare the performance of two coding techniques
with various architectures and datasets. Then, we measure the robustness of the
coding techniques on two adversarial attack methods. Finally, we compare the
energy-efficiency of two coding schemes on a digital hardware platform. Our
results show that direct coding can achieve better accuracy especially for a
small number of timesteps. In contrast, rate coding shows better robustness to
adversarial attacks owing to the non-differentiable spike generation process.
Rate coding also yields higher energy-efficiency than direct coding which
requires multi-bit precision for the first layer. Our study explores the
characteristics of two codings, which is an important design consideration for
building SNNs. The code is made available at
https://github.com/Intelligent-Computing-Lab-Yale/Rate-vs-Direct.


http://arxiv.org/abs/2202.01179
AntidoteRT: Run-time Detection and Correction of Poison Attacks on Neural Networks. (89%)
Muhammad Usman; Youcheng Sun; Divya Gopinath; Corina S. Pasareanu
  We study backdoor poisoning attacks against image classification networks,
whereby an attacker inserts a trigger into a subset of the training data, in
such a way that at test time, this trigger causes the classifier to predict
some target class. %There are several techniques proposed in the literature
that aim to detect the attack but only a few also propose to defend against it,
and they typically involve retraining the network which is not always possible
in practice. We propose lightweight automated detection and correction
techniques against poisoning attacks, which are based on neuron patterns mined
from the network using a small set of clean and poisoned test samples with
known labels. The patterns built based on the mis-classified samples are used
for run-time detection of new poisoned inputs. For correction, we propose an
input correction technique that uses a differential analysis to identify the
trigger in the detected poisoned images, which is then reset to a neutral
color. Our detection and correction are performed at run-time and input level,
which is in contrast to most existing work that is focused on offline
model-level defenses. We demonstrate that our technique outperforms existing
defenses such as NeuralCleanse and STRIP on popular benchmarks such as MNIST,
CIFAR-10, and GTSRB against the popular BadNets attack and the more complex
DFST attack.


http://arxiv.org/abs/2201.13164
Imperceptible and Multi-channel Backdoor Attack against Deep Neural Networks. (81%)
Mingfu Xue; Shifeng Ni; Yinghao Wu; Yushu Zhang; Jian Wang; Weiqiang Liu
  Recent researches demonstrate that Deep Neural Networks (DNN) models are
vulnerable to backdoor attacks. The backdoored DNN model will behave
maliciously when images containing backdoor triggers arrive. To date, existing
backdoor attacks are single-trigger and single-target attacks, and the triggers
of most existing backdoor attacks are obvious thus are easy to be detected or
noticed. In this paper, we propose a novel imperceptible and multi-channel
backdoor attack against Deep Neural Networks by exploiting Discrete Cosine
Transform (DCT) steganography. Based on the proposed backdoor attack method, we
implement two variants of backdoor attacks, i.e., N-to-N backdoor attack and
N-to-One backdoor attack. Specifically, for a colored image, we utilize DCT
steganography to construct the trigger on different channels of the image. As a
result, the trigger is stealthy and natural. Based on the proposed method, we
implement multi-target and multi-trigger backdoor attacks. Experimental results
demonstrate that the average attack success rate of the N-to-N backdoor attack
is 93.95% on CIFAR-10 dataset and 91.55% on TinyImageNet dataset, respectively.
The average attack success rate of N-to-One attack is 90.22% and 89.53% on
CIFAR-10 and TinyImageNet datasets, respectively. Meanwhile, the proposed
backdoor attack does not affect the classification accuracy of the DNN model.
Moreover, the proposed attack is demonstrated to be robust to the
state-of-the-art backdoor defense (Neural Cleanse).


http://arxiv.org/abs/2201.13019
On the Robustness of Quality Measures for GANs. (80%)
Motasem Alfarra; Juan C. Pérez; Anna Frühstück; Philip H. S. Torr; Peter Wonka; Bernard Ghanem
  This work evaluates the robustness of quality measures of generative models
such as Inception Score (IS) and Fr\'echet Inception Distance (FID). Analogous
to the vulnerability of deep models against a variety of adversarial attacks,
we show that such metrics can also be manipulated by additive pixel
perturbations. Our experiments indicate that one can generate a distribution of
images with very high scores but low perceptual quality. Conversely, one can
optimize for small imperceptible perturbations that, when added to real world
images, deteriorate their scores. Furthermore, we extend our evaluation to
generative models themselves, including the state of the art network
StyleGANv2. We show the vulnerability of both the generative model and the FID
against additive perturbations in the latent space. Finally, we show that the
FID can be robustified by directly replacing the Inception model by a robustly
trained Inception. We validate the effectiveness of the robustified metric
through extensive experiments, which show that it is more robust against
manipulation.


http://arxiv.org/abs/2202.00008
MEGA: Model Stealing via Collaborative Generator-Substitute Networks. (76%)
Chi Hong; Jiyue Huang; Lydia Y. Chen
  Deep machine learning models are increasingly deployedin the wild for
providing services to users. Adversaries maysteal the knowledge of these
valuable models by trainingsubstitute models according to the inference results
of thetargeted deployed models. Recent data-free model stealingmethods are
shown effective to extract the knowledge of thetarget model without using real
query examples, but they as-sume rich inference information, e.g., class
probabilities andlogits. However, they are all based on competing
generator-substitute networks and hence encounter training instability.In this
paper we propose a data-free model stealing frame-work,MEGA, which is based on
collaborative generator-substitute networks and only requires the target model
toprovide label prediction for synthetic query examples. Thecore of our method
is a model stealing optimization con-sisting of two collaborative models (i)
the substitute modelwhich imitates the target model through the synthetic
queryexamples and their inferred labels and (ii) the generatorwhich synthesizes
images such that the confidence of thesubstitute model over each query example
is maximized. Wepropose a novel coordinate descent training procedure
andanalyze its convergence. We also empirically evaluate thetrained substitute
model on three datasets and its applicationon black-box adversarial attacks.
Our results show that theaccuracy of our trained substitute model and the
adversarialattack success rate over it can be up to 33% and 40% higherthan
state-of-the-art data-free black-box attacks.


http://arxiv.org/abs/2201.13025
Learning Robust Representation through Graph Adversarial Contrastive Learning. (26%)
Jiayan Guo; Shangyang Li; Yue Zhao; Yan Zhang
  Existing studies show that node representations generated by graph neural
networks (GNNs) are vulnerable to adversarial attacks, such as unnoticeable
perturbations of adjacent matrix and node features. Thus, it is requisite to
learn robust representations in graph neural networks. To improve the
robustness of graph representation learning, we propose a novel Graph
Adversarial Contrastive Learning framework (GraphACL) by introducing
adversarial augmentations into graph self-supervised learning. In this
framework, we maximize the mutual information between local and global
representations of a perturbed graph and its adversarial augmentations, where
the adversarial graphs can be generated in either supervised or unsupervised
approaches. Based on the Information Bottleneck Principle, we theoretically
prove that our method could obtain a much tighter bound, thus improving the
robustness of graph representation learning. Empirically, we evaluate several
methods on a range of node classification benchmarks and the results
demonstrate GraphACL could achieve comparable accuracy over previous supervised
methods.


http://arxiv.org/abs/2201.13279
UQGAN: A Unified Model for Uncertainty Quantification of Deep Classifiers trained via Conditional GANs. (16%)
Philipp Oberdiek; Gernot A. Fink; Matthias Rottmann
  We present an approach to quantifying both aleatoric and epistemic
uncertainty for deep neural networks in image classification, based on
generative adversarial networks (GANs). While most works in the literature that
use GANs to generate out-of-distribution (OoD) examples only focus on the
evaluation of OoD detection, we present a GAN based approach to learn a
classifier that produces proper uncertainties for OoD examples as well as for
false positives (FPs). Instead of shielding the entire in-distribution data
with GAN generated OoD examples which is state-of-the-art, we shield each class
separately with out-of-class examples generated by a conditional GAN and
complement this with a one-vs-all image classifier. In our experiments, in
particular on CIFAR10, CIFAR100 and Tiny ImageNet, we improve over the OoD
detection and FP detection performance of state-of-the-art GAN-training based
classifiers. Furthermore, we also find that the generated GAN examples do not
significantly affect the calibration error of our classifier and result in a
significant gain in model accuracy.


http://arxiv.org/abs/2201.13178
Few-Shot Backdoor Attacks on Visual Object Tracking. (10%)
Yiming Li; Haoxiang Zhong; Xingjun Ma; Yong Jiang; Shu-Tao Xia
  Visual object tracking (VOT) has been widely adopted in mission-critical
applications, such as autonomous driving and intelligent surveillance systems.
In current practice, third-party resources such as datasets, backbone networks,
and training platforms are frequently used to train high-performance VOT
models. Whilst these resources bring certain convenience, they also introduce
new security threats into VOT models. In this paper, we reveal such a threat
where an adversary can easily implant hidden backdoors into VOT models by
tempering with the training process. Specifically, we propose a simple yet
effective few-shot backdoor attack (FSBA) that optimizes two losses
alternately: 1) a \emph{feature loss} defined in the hidden feature space, and
2) the standard \emph{tracking loss}. We show that, once the backdoor is
embedded into the target model by our FSBA, it can trick the model to lose
track of specific objects even when the \emph{trigger} only appears in one or a
few frames. We examine our attack in both digital and physical-world settings
and show that it can significantly degrade the performance of state-of-the-art
VOT trackers. We also show that our attack is resistant to potential defenses,
highlighting the vulnerability of VOT models to potential backdoor attacks.


http://arxiv.org/abs/2202.00137
Studying the Robustness of Anti-adversarial Federated Learning Models Detecting Cyberattacks in IoT Spectrum Sensors. (5%)
Pedro Miguel Sánchez Sánchez; Alberto Huertas Celdrán; Timo Schenk; Adrian Lars Benjamin Iten; Gérôme Bovet; Gregorio Martínez Pérez; Burkhard Stiller
  Device fingerprinting combined with Machine and Deep Learning (ML/DL) report
promising performance when detecting cyberattacks targeting data managed by
resource-constrained spectrum sensors. However, the amount of data needed to
train models and the privacy concerns of such scenarios limit the applicability
of centralized ML/DL-based approaches. Federated learning (FL) addresses these
limitations by creating federated and privacy-preserving models. However, FL is
vulnerable to malicious participants, and the impact of adversarial attacks on
federated models detecting spectrum sensing data falsification (SSDF) attacks
on spectrum sensors has not been studied. To address this challenge, the first
contribution of this work is the creation of a novel dataset suitable for FL
and modeling the behavior (usage of CPU, memory, or file system, among others)
of resource-constrained spectrum sensors affected by different SSDF attacks.
The second contribution is a pool of experiments analyzing and comparing the
robustness of federated models according to i) three families of spectrum
sensors, ii) eight SSDF attacks, iii) four scenarios dealing with unsupervised
(anomaly detection) and supervised (binary classification) federated models,
iv) up to 33% of malicious participants implementing data and model poisoning
attacks, and v) four aggregation functions acting as anti-adversarial
mechanisms to increase the models robustness.


http://arxiv.org/abs/2201.13086
Securing Federated Sensitive Topic Classification against Poisoning Attacks. (1%)
Tianyue Chu; Alvaro Garcia-Recuero; Costas Iordanou; Georgios Smaragdakis; Nikolaos Laoutaris
  We present a Federated Learning (FL) based solution for building a
distributed classifier capable of detecting URLs containing GDPR-sensitive
content related to categories such as health, sexual preference, political
beliefs, etc. Although such a classifier addresses the limitations of previous
offline/centralised classifiers,it is still vulnerable to poisoning attacks
from malicious users that may attempt to reduce the accuracy for benign users
by disseminating faulty model updates. To guard against this, we develop a
robust aggregation scheme based on subjective logic and residual-based attack
detection. Employing a combination of theoretical analysis, trace-driven
simulation, as well as experimental validation with a prototype and real users,
we show that our classifier can detect sensitive content with high accuracy,
learn new labels fast, and remain robust in view of poisoning attacks from
malicious users, as well as imperfect input from non-malicious ones.


http://arxiv.org/abs/2201.12765
Improving Corruption and Adversarial Robustness by Enhancing Weak Subnets. (92%)
Yong Guo; David Stutz; Bernt Schiele
  Deep neural networks have achieved great success in many computer vision
tasks. However, deep networks have been shown to be very susceptible to
corrupted or adversarial images, which often result in significant performance
drops. In this paper, we observe that weak subnetwork (subnet) performance is
correlated with a lack of robustness against corruptions and adversarial
attacks. Based on that observation, we propose a novel robust training method
which explicitly identifies and enhances weak subnets (EWS) during training to
improve robustness. Specifically, we develop a search algorithm to find
particularly weak subnets and propose to explicitly strengthen them via
knowledge distillation from the full network. We show that our EWS greatly
improves the robustness against corrupted images as well as the accuracy on
clean data. Being complementary to many state-of-the-art data augmentation
approaches, EWS consistently improves corruption robustness on top of many of
these approaches. Moreover, EWS is also able to boost the adversarial
robustness when combined with popular adversarial training methods.


http://arxiv.org/abs/2201.12741
GARNET: Reduced-Rank Topology Learning for Robust and Scalable Graph Neural Networks. (84%)
Chenhui Deng; Xiuyu Li; Zhuo Feng; Zhiru Zhang
  Graph neural networks (GNNs) have been increasingly deployed in various
applications that involve learning on non-Euclidean data. However, recent
studies show that GNNs are vulnerable to graph adversarial attacks. Although
there are several defense methods to improve GNN robustness by eliminating
adversarial components, they may also impair the underlying clean graph
structure that contributes to GNN training. In addition, few of those defense
models can scale to large graphs due to their high computational complexity and
memory usage. In this paper, we propose GARNET, a scalable spectral method to
boost the adversarial robustness of GNN models. GARNET first leverages weighted
spectral embedding to construct a base graph, which is not only resistant to
adversarial attacks but also contains critical (clean) graph structure for GNN
training. Next, GARNET further refines the base graph by pruning additional
uncritical edges based on probabilistic graphical model. GARNET has been
evaluated on various datasets, including a large graph with millions of nodes.
Our extensive experiment results show that GARNET achieves adversarial accuracy
improvement and runtime speedup over state-of-the-art GNN (defense) models by
up to 13.27% and 14.7x, respectively.


http://arxiv.org/abs/2201.12733
TPC: Transformation-Specific Smoothing for Point Cloud Models. (75%)
Wenda Chu; Linyi Li; Bo Li
  Point cloud models with neural network architectures have achieved great
success and have been widely used in safety-critical applications, such as
Lidar-based recognition systems in autonomous vehicles. However, such models
are shown vulnerable against adversarial attacks which aim to apply stealthy
semantic transformations such as rotation and tapering to mislead model
predictions. In this paper, we propose a transformation-specific smoothing
framework TPC, which provides tight and scalable robustness guarantees for
point cloud models against semantic transformation attacks. We first categorize
common 3D transformations into three categories: additive (e.g., shearing),
composable (e.g., rotation), and indirectly composable (e.g., tapering), and we
present generic robustness certification strategies for all categories
respectively. We then specify unique certification protocols for a range of
specific semantic transformations and their compositions. Extensive experiments
on several common 3D transformations show that TPC significantly outperforms
the state of the art. For example, our framework boosts the certified accuracy
against twisting transformation along z-axis (within 20$^\circ$) from 20.3$\%$
to 83.8$\%$.


http://arxiv.org/abs/2201.12527
Scale-Invariant Adversarial Attack for Evaluating and Enhancing Adversarial Defenses. (99%)
Mengting Xu; Tao Zhang; Zhongnian Li; Daoqiang Zhang
  Efficient and effective attacks are crucial for reliable evaluation of
defenses, and also for developing robust models. Projected Gradient Descent
(PGD) attack has been demonstrated to be one of the most successful adversarial
attacks. However, the effect of the standard PGD attack can be easily weakened
by rescaling the logits, while the original decision of every input will not be
changed. To mitigate this issue, in this paper, we propose Scale-Invariant
Adversarial Attack (SI-PGD), which utilizes the angle between the features in
the penultimate layer and the weights in the softmax layer to guide the
generation of adversaries. The cosine angle matrix is used to learn angularly
discriminative representation and will not be changed with the rescaling of
logits, thus making SI-PGD attack to be stable and effective. We evaluate our
attack against multiple defenses and show improved performance when compared
with existing attacks. Further, we propose Scale-Invariant (SI) adversarial
defense mechanism based on the cosine angle matrix, which can be embedded into
the popular adversarial defenses. The experimental results show the defense
method with our SI mechanism achieves state-of-the-art performance among
multi-step and single-step defenses.


http://arxiv.org/abs/2201.12686
Robustness of Deep Recommendation Systems to Untargeted Interaction Perturbations. (82%)
Sejoon Oh; Srijan Kumar
  While deep learning-based sequential recommender systems are widely used in
practice, their sensitivity to untargeted training data perturbations is
unknown. Untargeted perturbations aim to modify ranked recommendation lists for
all users at test time, by inserting imperceptible input perturbations during
training time. Existing perturbation methods are mostly targeted attacks
optimized to change ranks of target items, but not suitable for untargeted
scenarios. In this paper, we develop a novel framework in which user-item
training interactions are perturbed in unintentional and adversarial settings.
First, through comprehensive experiments on four datasets, we show that four
popular recommender models are unstable against even one random perturbation.
Second, we establish a cascading effect in which minor manipulations of early
training interactions can cause extensive changes to the model and the
generated recommendations for all users. Leveraging this effect, we propose an
adversarial perturbation method CASPER which identifies and perturbs an
interaction that induces the maximal cascading effect. Experimentally, we
demonstrate that CASPER reduces the stability of recommendation models the
most, compared to several baselines and state-of-the-art methods. Finally, we
show the runtime and success of CASPER scale near-linearly with the dataset
size and the number of perturbations, respectively.


http://arxiv.org/abs/2201.12700
Coordinated Attacks against Contextual Bandits: Fundamental Limits and Defense Mechanisms. (1%)
Jeongyeol Kwon; Yonathan Efroni; Constantine Caramanis; Shie Mannor
  Motivated by online recommendation systems, we propose the problem of finding
the optimal policy in multitask contextual bandits when a small fraction
$\alpha < 1/2$ of tasks (users) are arbitrary and adversarial. The remaining
fraction of good users share the same instance of contextual bandits with $S$
contexts and $A$ actions (items). Naturally, whether a user is good or
adversarial is not known in advance. The goal is to robustly learn the policy
that maximizes rewards for good users with as few user interactions as
possible. Without adversarial users, established results in collaborative
filtering show that $O(1/\epsilon^2)$ per-user interactions suffice to learn a
good policy, precisely because information can be shared across users. This
parallelization gain is fundamentally altered by the presence of adversarial
users: unless there are super-polynomial number of users, we show a lower bound
of $\tilde{\Omega}(\min(S,A) \cdot \alpha^2 / \epsilon^2)$ {\it per-user}
interactions to learn an $\epsilon$-optimal policy for the good users. We then
show we can achieve an $\tilde{O}(\min(S,A)\cdot \alpha/\epsilon^2)$
upper-bound, by employing efficient robust mean estimators for both uni-variate
and high-dimensional random variables. We also show that this can be improved
depending on the distributions of contexts.


http://arxiv.org/abs/2201.12356
Adversarial Examples for Good: Adversarial Examples Guided Imbalanced Learning. (87%)
Jie Zhang; Lei Zhang; Gang Li; Chao Wu
  Adversarial examples are inputs for machine learning models that have been
designed by attackers to cause the model to make mistakes. In this paper, we
demonstrate that adversarial examples can also be utilized for good to improve
the performance of imbalanced learning. We provide a new perspective on how to
deal with imbalanced data: adjust the biased decision boundary by training with
Guiding Adversarial Examples (GAEs). Our method can effectively increase the
accuracy of minority classes while sacrificing little accuracy on majority
classes. We empirically show, on several benchmark datasets, our proposed
method is comparable to the state-of-the-art method. To our best knowledge, we
are the first to deal with imbalanced learning with adversarial examples.


http://arxiv.org/abs/2201.12107
Feature Visualization within an Automated Design Assessment leveraging Explainable Artificial Intelligence Methods. (81%)
Raoul Schönhof; Artem Werner; Jannes Elstner; Boldizsar Zopcsak; Ramez Awad; Marco Huber
  Not only automation of manufacturing processes but also automation of
automation procedures itself become increasingly relevant to automation
research. In this context, automated capability assessment, mainly leveraged by
deep learning systems driven from 3D CAD data, have been presented. Current
assessment systems may be able to assess CAD data with regards to abstract
features, e.g. the ability to automatically separate components from bulk
goods, or the presence of gripping surfaces. Nevertheless, they suffer from the
factor of black box systems, where an assessment can be learned and generated
easily, but without any geometrical indicator about the reasons of the system's
decision. By utilizing explainable AI (xAI) methods, we attempt to open up the
black box. Explainable AI methods have been used in order to assess whether a
neural network has successfully learned a given task or to analyze which
features of an input might lead to an adversarial attack. These methods aim to
derive additional insights into a neural network, by analyzing patterns from a
given input and its impact to the network output. Within the NeuroCAD Project,
xAI methods are used to identify geometrical features which are associated with
a certain abstract feature. Within this work, a sensitivity analysis (SA), the
layer-wise relevance propagation (LRP), the Gradient-weighted Class Activation
Mapping (Grad-CAM) method as well as the Local Interpretable Model-Agnostic
Explanations (LIME) have been implemented in the NeuroCAD environment, allowing
not only to assess CAD models but also to identify features which have been
relevant for the network decision. In the medium run, this might enable to
identify regions of interest supporting product designers to optimize their
models with regards to assembly processes.


http://arxiv.org/abs/2201.12440
Certifying Model Accuracy under Distribution Shifts. (74%)
Aounon Kumar; Alexander Levine; Tom Goldstein; Soheil Feizi
  Certified robustness in machine learning has primarily focused on adversarial
perturbations of the input with a fixed attack budget for each point in the
data distribution. In this work, we present provable robustness guarantees on
the accuracy of a model under bounded Wasserstein shifts of the data
distribution. We show that a simple procedure that randomizes the input of the
model within a transformation space is provably robust to distributional shifts
under the transformation. Our framework allows the datum-specific perturbation
size to vary across different points in the input distribution and is general
enough to include fixed-sized perturbations as well. Our certificates produce
guaranteed lower bounds on the performance of the model for any (natural or
adversarial) shift of the input distribution within a Wasserstein ball around
the original distribution. We apply our technique to: (i) certify robustness
against natural (non-adversarial) transformations of images such as color
shifts, hue shifts and changes in brightness and saturation, (ii) certify
robustness against adversarial shifts of the input distribution, and (iii) show
provable lower bounds (hardness results) on the performance of models trained
on so-called "unlearnable" datasets that have been poisoned to interfere with
model training.


http://arxiv.org/abs/2201.12296
Benchmarking Robustness of 3D Point Cloud Recognition Against Common Corruptions. (13%)
Jiachen Sun; Qingzhao Zhang; Bhavya Kailkhura; Zhiding Yu; Chaowei Xiao; Z. Morley Mao
  Deep neural networks on 3D point cloud data have been widely used in the real
world, especially in safety-critical applications. However, their robustness
against corruptions is less studied. In this paper, we present ModelNet40-C,
the first comprehensive benchmark on 3D point cloud corruption robustness,
consisting of 15 common and realistic corruptions. Our evaluation shows a
significant gap between the performances on ModelNet40 and ModelNet40-C for
state-of-the-art (SOTA) models. To reduce the gap, we propose a simple but
effective method by combining PointCutMix-R and TENT after evaluating a wide
range of augmentation and test-time adaptation strategies. We identify a number
of critical insights for future studies on corruption robustness in point cloud
recognition. For instance, we unveil that Transformer-based architectures with
proper training recipes achieve the strongest robustness. We hope our in-depth
analysis will motivate the development of robust training strategies or
architecture designs in the 3D point cloud domain. Our codebase and dataset are
included in https://github.com/jiachens/ModelNet40-C


http://arxiv.org/abs/2201.12179
Plug & Play Attacks: Towards Robust and Flexible Model Inversion Attacks. (8%)
Lukas Struppek; Dominik Hintersdorf; Antonio De Almeida Correia; Antonia Adler; Kristian Kersting
  Model inversion attacks (MIAs) aim to create synthetic images that reflect
the class-wise characteristics from a target classifier's private training data
by exploiting the model's learned knowledge. Previous research has developed
generative MIAs that use generative adversarial networks (GANs) as image priors
tailored to a specific target model. This makes the attacks time- and
resource-consuming, inflexible, and susceptible to distributional shifts
between datasets. To overcome these drawbacks, we present Plug & Play Attacks,
which relax the dependency between the target model and image prior, and enable
the use of a single GAN to attack a wide range of targets, requiring only minor
adjustments to the attack. Moreover, we show that powerful MIAs are possible
even with publicly available pre-trained GANs and under strong distributional
shifts, for which previous approaches fail to produce meaningful results. Our
extensive evaluation confirms the improved robustness and flexibility of Plug &
Play Attacks and their ability to create high-quality images revealing
sensitive class characteristics.


http://arxiv.org/abs/2201.12211
Backdoors Stuck At The Frontdoor: Multi-Agent Backdoor Attacks That Backfire. (3%)
Siddhartha Datta; Nigel Shadbolt
  Malicious agents in collaborative learning and outsourced data collection
threaten the training of clean models. Backdoor attacks, where an attacker
poisons a model during training to successfully achieve targeted
misclassification, are a major concern to train-time robustness. In this paper,
we investigate a multi-agent backdoor attack scenario, where multiple attackers
attempt to backdoor a victim model simultaneously. A consistent backfiring
phenomenon is observed across a wide range of games, where agents suffer from a
low collective attack success rate. We examine different modes of backdoor
attack configurations, non-cooperation / cooperation, joint distribution
shifts, and game setups to return an equilibrium attack success rate at the
lower bound. The results motivate the re-evaluation of backdoor defense
research for practical environments.


http://arxiv.org/abs/2201.12328
Toward Training at ImageNet Scale with Differential Privacy. (1%)
Alexey Kurakin; Shuang Song; Steve Chien; Roxana Geambasu; Andreas Terzis; Abhradeep Thakurta
  Differential privacy (DP) is the de facto standard for training machine
learning (ML) models, including neural networks, while ensuring the privacy of
individual examples in the training set. Despite a rich literature on how to
train ML models with differential privacy, it remains extremely challenging to
train real-life, large neural networks with both reasonable accuracy and
privacy.
  We set out to investigate how to do this, using ImageNet image classification
as a poster example of an ML task that is very challenging to resolve
accurately with DP right now. This paper shares initial lessons from our
effort, in the hope that it will inspire and inform other researchers to
explore DP training at scale. We show approaches that help make DP training
faster, as well as model types and settings of the training process that tend
to work better in the DP setting. Combined, the methods we discuss let us train
a Resnet-18 with DP to $47.9\%$ accuracy and privacy parameters $\epsilon = 10,
\delta = 10^{-6}$. This is a significant improvement over "naive" DP training
of ImageNet models, but a far cry from the $75\%$ accuracy that can be obtained
by the same network without privacy. The model we use was pretrained on the
Places365 data set as a starting point. We share our code at
https://github.com/google-research/dp-imagenet, calling for others to build
upon this new baseline to further improve DP at scale.


http://arxiv.org/abs/2201.11528
Beyond ImageNet Attack: Towards Crafting Adversarial Examples for Black-box Domains. (99%)
Qilong Zhang; Xiaodan Li; Yuefeng Chen; Jingkuan Song; Lianli Gao; Yuan He; Hui Xue
  Adversarial examples have posed a severe threat to deep neural networks due
to their transferable nature. Currently, various works have paid great efforts
to enhance the cross-model transferability, which mostly assume the substitute
model is trained in the same domain as the target model. However, in reality,
the relevant information of the deployed model is unlikely to leak. Hence, it
is vital to build a more practical black-box threat model to overcome this
limitation and evaluate the vulnerability of deployed models. In this paper,
with only the knowledge of the ImageNet domain, we propose a Beyond ImageNet
Attack (BIA) to investigate the transferability towards black-box domains
(unknown classification tasks). Specifically, we leverage a generative model to
learn the adversarial function for disrupting low-level features of input
images. Based on this framework, we further propose two variants to narrow the
gap between the source and target domains from the data and model perspectives,
respectively. Extensive experiments on coarse-grained and fine-grained domains
demonstrate the effectiveness of our proposed methods. Notably, our methods
outperform state-of-the-art approaches by up to 7.71\% (towards coarse-grained
domains) and 25.91\% (towards fine-grained domains) on average. Our code is
available at \url{https://github.com/qilong-zhang/Beyond-ImageNet-Attack}.


http://arxiv.org/abs/2201.11674
Vision Checklist: Towards Testable Error Analysis of Image Models to Help System Designers Interrogate Model Capabilities. (10%)
Xin Du; Benedicte Legastelois; Bhargavi Ganesh; Ajitha Rajan; Hana Chockler; Vaishak Belle; Stuart Anderson; Subramanian Ramamoorthy
  Using large pre-trained models for image recognition tasks is becoming
increasingly common owing to the well acknowledged success of recent models
like vision transformers and other CNN-based models like VGG and Resnet. The
high accuracy of these models on benchmark tasks has translated into their
practical use across many domains including safety-critical applications like
autonomous driving and medical diagnostics. Despite their widespread use, image
models have been shown to be fragile to changes in the operating environment,
bringing their robustness into question. There is an urgent need for methods
that systematically characterise and quantify the capabilities of these models
to help designers understand and provide guarantees about their safety and
robustness. In this paper, we propose Vision Checklist, a framework aimed at
interrogating the capabilities of a model in order to produce a report that can
be used by a system designer for robustness evaluations. This framework
proposes a set of perturbation operations that can be applied on the underlying
data to generate test samples of different types. The perturbations reflect
potential changes in operating environments, and interrogate various properties
ranging from the strictly quantitative to more qualitative. Our framework is
evaluated on multiple datasets like Tinyimagenet, CIFAR10, CIFAR100 and
Camelyon17 and for models like ViT and Resnet. Our Vision Checklist proposes a
specific set of evaluations that can be integrated into the previously proposed
concept of a model card. Robustness evaluations like our checklist will be
crucial in future safety evaluations of visual perception modules, and be
useful for a wide range of stakeholders including designers, deployers, and
regulators involved in the certification of these systems. Source code of
Vision Checklist would be open for public use.


http://arxiv.org/abs/2201.11692
SSLGuard: A Watermarking Scheme for Self-supervised Learning Pre-trained Encoders. (2%)
Tianshuo Cong; Xinlei He; Yang Zhang
  Self-supervised learning is an emerging machine learning (ML) paradigm.
Compared to supervised learning which leverages high-quality labeled datasets
to achieve good performance, self-supervised learning relies on unlabeled
datasets to pre-train powerful encoders which can then be treated as feature
extractors for various downstream tasks. The huge amount of data and
computational resources consumption makes the encoders themselves become
valuable intellectual property of the model owner. Recent research has shown
that the ML model's copyright is threatened by model stealing attacks, which
aim to train a surrogate model to mimic the behavior of a given model. We
empirically show that pre-trained encoders are highly vulnerable to model
stealing attacks. However, most of the current efforts of copyright protection
algorithms such as watermarking concentrate on classifiers. Meanwhile, the
intrinsic challenges of pre-trained encoder's copyright protection remain
largely unstudied. We fill the gap by proposing SSLGuard, the first
watermarking algorithm for pre-trained encoders. Given a clean pre-trained
encoder, SSLGuard injects a watermark into it and outputs a watermarked
version. The shadow training technique is also applied to preserve the
watermark under potential model stealing attacks. Our extensive evaluation
shows that SSLGuard is effective in watermark injection and verification, and
is robust against model stealing and other watermark removal attacks such as
input noising, output perturbing, overwriting, model pruning, and fine-tuning.


http://arxiv.org/abs/2201.11377
CacheFX: A Framework for Evaluating Cache Security. (1%)
Daniel Genkin; William Kosasih; Fangfei Liu; Anna Trikalinou; Thomas Unterluggauer; Yuval Yarom
  Over the last two decades, the danger of sharing resources between programs
has been repeatedly highlighted. Multiple side-channel attacks, which seek to
exploit shared components for leaking information, have been devised, mostly
targeting shared caching components. In response, the research community has
proposed multiple cache designs that aim at curbing the source of side
channels. With multiple competing designs, there is a need for assessing the
level of security against side-channel attacks that each design offers.
  In this work we propose CacheFX, a flexible framework for assessing and
evaluating the resilience of cache designs to side-channel attacks. CacheFX
allows the evaluator to implement various cache designs, victims, and
attackers, as well as to exercise them for assessing the leakage of information
via the cache.
  To demonstrate the power of CacheFX, we implement multiple cache designs and
replacement algorithms, and devise three evaluation metrics that measure
different aspects of the caches:(1) the entropy induced by a memory access; (2)
the complexity of building an eviction set; and (3) protection against
cryptographic attacks. Our experiments highlight that different security
metrics give different insights to designs, making a comprehensive analysis
mandatory. For instance, while eviction-set building was fastest for randomized
skewed caches, these caches featured lower eviction entropy and higher
practical attack complexity. Our experiments show that all non-partitioned
designs allow for effective cryptographic attacks. However, in state-of-the-art
secure caches, eviction-based attacks are more difficult to mount than
occupancy-based attacks, highlighting the need to consider the latter in cache
design.


http://arxiv.org/abs/2201.10937
Boosting 3D Adversarial Attacks with Attacking On Frequency. (98%)
Binbin Liu; Jinlai Zhang; Lyujie Chen; Jihong Zhu
  Deep neural networks (DNNs) have been shown to be vulnerable to adversarial
attacks. Recently, 3D adversarial attacks, especially adversarial attacks on
point clouds, have elicited mounting interest. However, adversarial point
clouds obtained by previous methods show weak transferability and are easy to
defend. To address these problems, in this paper we propose a novel point cloud
attack (dubbed AOF) that pays more attention on the low-frequency component of
point clouds. We combine the losses from point cloud and its low-frequency
component to craft adversarial samples. Extensive experiments validate that AOF
can improve the transferability significantly compared to state-of-the-art
(SOTA) attacks, and is more robust to SOTA 3D defense methods. Otherwise,
compared to clean point clouds, adversarial point clouds obtained by AOF
contain more deformation than outlier.


http://arxiv.org/abs/2201.10972
How Robust are Discriminatively Trained Zero-Shot Learning Models? (98%)
Mehmet Kerim Yucel; Ramazan Gokberk Cinbis; Pinar Duygulu
  Data shift robustness has been primarily investigated from a fully supervised
perspective, and robustness of zero-shot learning (ZSL) models have been
largely neglected. In this paper, we present novel analyses on the robustness
of discriminative ZSL to image corruptions. We subject several ZSL models to a
large set of common corruptions and defenses. In order to realize the
corruption analysis, we curate and release the first ZSL corruption robustness
datasets SUN-C, CUB-C and AWA2-C. We analyse our results by taking into account
the dataset characteristics, class imbalance, class transitions between seen
and unseen classes and the discrepancies between ZSL and GZSL performances. Our
results show that discriminative ZSL suffers from corruptions and this trend is
further exacerbated by the severe class imbalance and model weakness inherent
in ZSL methods. We then combine our findings with those based on adversarial
attacks in ZSL, and highlight the different effects of corruptions and
adversarial examples, such as the pseudo-robustness effect present under
adversarial attacks. We also obtain new strong baselines for both models with
the defense methods. Finally, our experiments show that although existing
methods to improve robustness somewhat work for ZSL models, they do not produce
a tangible effect.


http://arxiv.org/abs/2201.11148
Autonomous Cyber Defense Introduces Risk: Can We Manage the Risk? (2%)
Alexandre K. Ligo; Alexander Kott; Igor Linkov
  From denial-of-service attacks to spreading of ransomware or other malware
across an organization's network, it is possible that manually operated
defenses are not able to respond in real time at the scale required, and when a
breach is detected and remediated the damage is already made. Autonomous cyber
defenses therefore become essential to mitigate the risk of successful attacks
and their damage, especially when the response time, effort and accuracy
required in those defenses is impractical or impossible through defenses
operated exclusively by humans. Autonomous agents have the potential to use ML
with large amounts of data about known cyberattacks as input, in order to learn
patterns and predict characteristics of future attacks. Moreover, learning from
past and present attacks enable defenses to adapt to new threats that share
characteristics with previous attacks. On the other hand, autonomous cyber
defenses introduce risks of unintended harm. Actions arising from autonomous
defense agents may have harmful consequences of functional, safety, security,
ethical, or moral nature. Here we focus on machine learning training,
algorithmic feedback, and algorithmic constraints, with the aim of motivating a
discussion on achieving trust in autonomous cyber defenses.


http://arxiv.org/abs/2201.10833
Automatic detection of access control vulnerabilities via API specification processing. (1%)
Alexander Barabanov; Denis Dergunov; Denis Makrushin; Aleksey Teplov
  Objective. Insecure Direct Object Reference (IDOR) or Broken Object Level
Authorization (BOLA) are one of the critical type of access control
vulnerabilities for modern applications. As a result, an attacker can bypass
authorization checks leading to information leakage, account takeover. Our main
research goal was to help an application security architect to optimize
security design and testing process by giving an algorithm and tool that allows
to automatically analyze system API specifications and generate list of
possible vulnerabilities and attack vector ready to be used as security
non-functional requirements. Method. We conducted a multivocal review of
research and conference papers, bug bounty program reports and other grey
sources of literature to outline patterns of attacks against IDOR
vulnerability. These attacks are collected in groups proceeding with further
analysis common attributes between these groups and what features compose the
group. Endpoint properties and attack techniques comprise a group of attacks.
Mapping between group features and existing OpenAPI specifications is performed
to implement a tool for automatic discovery of potentially vulnerable
endpoints. Results and practical relevance. In this work, we provide
systematization of IDOR/BOLA attack techniques based on literature review, real
cases analysis and derive IDOR/BOLA attack groups. We proposed an approach to
describe IDOR/BOLA attacks based on OpenAPI specifications properties. We
develop an algorithm of potential IDOR/BOLA vulnerabilities detection based on
OpenAPI specification processing. We implemented our novel algorithm using
Python and evaluated it. The results show that algorithm is resilient and can
be used in practice to detect potential IDOR/BOLA vulnerabilities.


http://arxiv.org/abs/2201.10675
Virtual Adversarial Training for Semi-supervised Breast Mass Classification. (3%)
Xuxin Chen; Ximin Wang; Ke Zhang; Kar-Ming Fung; Theresa C. Thai; Kathleen Moore; Robert S. Mannel; Hong Liu; Bin Zheng; Yuchen Qiu
  This study aims to develop a novel computer-aided diagnosis (CAD) scheme for
mammographic breast mass classification using semi-supervised learning.
Although supervised deep learning has achieved huge success across various
medical image analysis tasks, its success relies on large amounts of
high-quality annotations, which can be challenging to acquire in practice. To
overcome this limitation, we propose employing a semi-supervised method, i.e.,
virtual adversarial training (VAT), to leverage and learn useful information
underlying in unlabeled data for better classification of breast masses.
Accordingly, our VAT-based models have two types of losses, namely supervised
and virtual adversarial losses. The former loss acts as in supervised
classification, while the latter loss aims at enhancing model robustness
against virtual adversarial perturbation, thus improving model
generalizability. To evaluate the performance of our VAT-based CAD scheme, we
retrospectively assembled a total of 1024 breast mass images, with equal number
of benign and malignant masses. A large CNN and a small CNN were used in this
investigation, and both were trained with and without the adversarial loss.
When the labeled ratios were 40% and 80%, VAT-based CNNs delivered the highest
classification accuracy of 0.740 and 0.760, respectively. The experimental
results suggest that the VAT-based CAD scheme can effectively utilize
meaningful knowledge from unlabeled data to better classify mammographic breast
mass images.


http://arxiv.org/abs/2201.10737
Class-Aware Adversarial Transformers for Medical Image Segmentation. (1%)
Chenyu You; Ruihan Zhao; Fenglin Liu; Siyuan Dong; Sandeep Chinchali; Ufuk Topcu; Lawrence Staib; James S. Duncan
  Transformers have made remarkable progress towards modeling long-range
dependencies within the medical image analysis domain. However, current
transformer-based models suffer from several disadvantages: (1) existing
methods fail to capture the important features of the images due to the naive
tokenization scheme; (2) the models suffer from information loss because they
only consider single-scale feature representations; and (3) the segmentation
label maps generated by the models are not accurate enough without considering
rich semantic contexts and anatomical textures. In this work, we present
CASTformer, a novel type of adversarial transformers, for 2D medical image
segmentation. First, we take advantage of the pyramid structure to construct
multi-scale representations and handle multi-scale variations. We then design a
novel class-aware transformer module to better learn the discriminative regions
of objects with semantic structures. Lastly, we utilize an adversarial training
strategy that boosts segmentation accuracy and correspondingly allows a
transformer-based discriminator to capture high-level semantically correlated
contents and low-level anatomical features. Our experiments demonstrate that
CASTformer dramatically outperforms previous state-of-the-art transformer-based
approaches on three benchmarks, obtaining 2.54%-5.88% absolute improvements in
Dice over previous models. Further qualitative experiments provide a more
detailed picture of the model's inner workings, shed light on the challenges in
improved transparency, and demonstrate that transfer learning can greatly
improve performance and reduce the size of medical image datasets in training,
making CASTformer a strong starting point for downstream medical image analysis
tasks.


http://arxiv.org/abs/2201.10207
SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training. (1%)
Wenyong Huang; Zhenhe Zhang; Yu Ting Yeung; Xin Jiang; Qun Liu
  We introduce a new approach for speech pre-training named SPIRAL which works
by learning denoising representation of perturbed data in a teacher-student
framework. Specifically, given a speech utterance, we first feed the utterance
to a teacher network to obtain corresponding representation. Then the same
utterance is perturbed and fed to a student network. The student network is
trained to output representation resembling that of the teacher. At the same
time, the teacher network is updated as moving average of student's weights
over training steps. In order to prevent representation collapse, we apply an
in-utterance contrastive loss as pre-training objective and impose position
randomization on the input to the teacher. SPIRAL achieves competitive or
better results compared to state-of-the-art speech pre-training method wav2vec
2.0, with significant reduction of training cost (80% for Base model, 65% for
Large model). Furthermore, we address the problem of noise-robustness that is
critical to real-world speech applications. We propose multi-condition
pre-training by perturbing the student's input with various types of additive
noise. We demonstrate that multi-condition pre-trained SPIRAL models are more
robust to noisy speech (9.0% - 13.3% relative word error rate reduction on real
noisy test data), compared to applying multi-condition training solely in the
fine-tuning stage. The code will be released after publication.


http://arxiv.org/abs/2201.09650
What You See is Not What the Network Infers: Detecting Adversarial Examples Based on Semantic Contradiction. (99%)
Yijun Yang; Ruiyuan Gao; Yu Li; Qiuxia Lai; Qiang Xu
  Adversarial examples (AEs) pose severe threats to the applications of deep
neural networks (DNNs) to safety-critical domains, e.g., autonomous driving.
While there has been a vast body of AE defense solutions, to the best of our
knowledge, they all suffer from some weaknesses, e.g., defending against only a
subset of AEs or causing a relatively high accuracy loss for legitimate inputs.
Moreover, most existing solutions cannot defend against adaptive attacks,
wherein attackers are knowledgeable about the defense mechanisms and craft AEs
accordingly. In this paper, we propose a novel AE detection framework based on
the very nature of AEs, i.e., their semantic information is inconsistent with
the discriminative features extracted by the target DNN model. To be specific,
the proposed solution, namely ContraNet, models such contradiction by first
taking both the input and the inference result to a generator to obtain a
synthetic output and then comparing it against the original input. For
legitimate inputs that are correctly inferred, the synthetic output tries to
reconstruct the input. On the contrary, for AEs, instead of reconstructing the
input, the synthetic output would be created to conform to the wrong label
whenever possible. Consequently, by measuring the distance between the input
and the synthetic output with metric learning, we can differentiate AEs from
legitimate inputs. We perform comprehensive evaluations under various AE attack
scenarios, and experimental results show that ContraNet outperforms existing
solutions by a large margin, especially under adaptive attacks. Moreover, our
analysis shows that successful AEs that can bypass ContraNet tend to have
much-weakened adversarial semantics. We have also shown that ContraNet can be
easily combined with adversarial training techniques to achieve further
improved AE defense capabilities.


http://arxiv.org/abs/2201.10055
Identifying a Training-Set Attack's Target Using Renormalized Influence Estimation. (95%)
Zayd Hammoudeh; Daniel Lowd
  Targeted training-set attacks inject malicious instances into the training
set to cause a trained model to mislabel one or more specific test instances.
This work proposes the task of target identification, which determines whether
a specific test instance is the target of a training-set attack. This can then
be combined with adversarial-instance identification to find (and remove) the
attack instances, mitigating the attack with minimal impact on other
predictions. Rather than focusing on a single attack method or data modality,
we build on influence estimation, which quantifies each training instance's
contribution to a model's prediction. We show that existing influence
estimators' poor practical performance often derives from their over-reliance
on instances and iterations with large losses. Our renormalized influence
estimators fix this weakness; they far outperform the original ones at
identifying influential groups of training examples in both adversarial and
non-adversarial settings, even finding up to 100% of adversarial training
instances with no clean-data false positives. Target identification then
simplifies to detecting test instances with anomalous influence values. We
demonstrate our method's generality on backdoor and poisoning attacks across
various data domains including text, vision, and speech. Our source code is
available at https://github.com/ZaydH/target_identification .


http://arxiv.org/abs/2201.09967
Attacks and Defenses for Free-Riders in Multi-Discriminator GAN. (76%)
Zilong Zhao; Jiyue Huang; Stefanie Roos; Lydia Y. Chen
  Generative Adversarial Networks (GANs) are increasingly adopted by the
industry to synthesize realistic images. Due to data not being centrally
available, Multi-Discriminator (MD)-GANs training framework employs multiple
discriminators that have direct access to the real data. Distributedly training
a joint GAN model entails the risk of free-riders, i.e., participants that aim
to benefit from the common model while only pretending to participate in the
training process. In this paper, we conduct the first characterization study of
the impact of free-riders on MD-GAN. Based on two production prototypes of
MD-GAN, we find that free-riders drastically reduce the ability of MD-GANs to
produce images that are indistinguishable from real data, i.e., they increase
the FID score -- the standard measure to assess the quality of generated
images. To mitigate the model degradation, we propose a defense strategy
against free-riders in MD-GAN, termed DFG. DFG distinguishes free-riders and
benign participants through periodic probing and clustering of discriminators'
responses based on a reference response of free-riders, which then allows the
generator to exclude the detected free-riders from the training. Furthermore,
we extend our defense, termed DFG+, to enable discriminators to filter out
free-riders at the variant of MD-GAN that allows peer exchanges of
discriminators networks. Extensive evaluation on various scenarios of
free-riders, MD-GAN architecture, and three datasets show that our defenses
effectively detect free-riders. With 1 to 5 free-riders, DFG and DFG+ averagely
decreases FID by 5.22% to 11.53% for CIFAR10 and 5.79% to 13.22% for CIFAR100
in comparison to an attack without defense. In a shell, the proposed DFG(+) can
effectively defend against free-riders without affecting benign clients at a
negligible computation overhead.


http://arxiv.org/abs/2201.09538
Backdoor Defense with Machine Unlearning. (33%)
Yang Liu; Mingyuan Fan; Cen Chen; Ximeng Liu; Zhuo Ma; Li Wang; Jianfeng Ma
  Backdoor injection attack is an emerging threat to the security of neural
networks, however, there still exist limited effective defense methods against
the attack. In this paper, we propose BAERASE, a novel method that can erase
the backdoor injected into the victim model through machine unlearning.
Specifically, BAERASE mainly implements backdoor defense in two key steps.
First, trigger pattern recovery is conducted to extract the trigger patterns
infected by the victim model. Here, the trigger pattern recovery problem is
equivalent to the one of extracting an unknown noise distribution from the
victim model, which can be easily resolved by the entropy maximization based
generative model. Subsequently, BAERASE leverages these recovered trigger
patterns to reverse the backdoor injection procedure and induce the victim
model to erase the polluted memories through a newly designed gradient ascent
based machine unlearning method. Compared with the previous machine unlearning
solutions, the proposed approach gets rid of the reliance on the full access to
training data for retraining and shows higher effectiveness on backdoor erasing
than existing fine-tuning or pruning methods. Moreover, experiments show that
BAERASE can averagely lower the attack success rates of three kinds of
state-of-the-art backdoor attacks by 99\% on four benchmark datasets.


http://arxiv.org/abs/2201.09631
On the Complexity of Attacking Elliptic Curve Based Authentication Chips. (1%)
Ievgen Kabin; Zoya Dyka; Dan Klann; Jan Schaeffner; Peter Langendoerfer
  In this paper we discuss the difficulties of mounting successful attack
against crypto implementations when essential information is missing. We start
with a detailed description of our attack against our own design, to highlight
which information is needed to increase the success of an attack, i.e. we use
it as a blueprint to the following attack against commercially available crypto
chips. We would like to stress that our attack against our own design is very
similar to what happens during certification e.g. according to Common Criteria
Standard as in those cases the manufacturer needs to provide detailed
information. When attacking the commercial designs without signing NDAs, we
needed to intensively search the Internet for information about the designs. We
cannot to reveal the private keys used by the attacked commercial
authentication chips 100% correctly. Moreover, the missing knowledge of the
used keys does not allow us to evaluate the success of our attack. We were able
to reveal information on the processing sequence during the authentication
process even as detailed as identifying the clock cycles in which the
individual key bits are processed. To summarize the effort of such an attack is
significantly higher than the one of attacking a well-known implementation.


http://arxiv.org/abs/2201.09369
Efficient and Robust Classification for Sparse Attacks. (83%)
Mark Beliaev; Payam Delgosha; Hamed Hassani; Ramtin Pedarsani
  In the past two decades we have seen the popularity of neural networks
increase in conjunction with their classification accuracy. Parallel to this,
we have also witnessed how fragile the very same prediction models are: tiny
perturbations to the inputs can cause misclassification errors throughout
entire datasets. In this paper, we consider perturbations bounded by the
$\ell_0$--norm, which have been shown as effective attacks in the domains of
image-recognition, natural language processing, and malware-detection. To this
end, we propose a novel defense method that consists of "truncation" and
"adversarial training". We then theoretically study the Gaussian mixture
setting and prove the asymptotic optimality of our proposed classifier.
Motivated by the insights we obtain, we extend these components to neural
network classifiers. We conduct numerical experiments in the domain of computer
vision using the MNIST and CIFAR datasets, demonstrating significant
improvement for the robust classification error of neural networks.


http://arxiv.org/abs/2202.00469
Gradient-guided Unsupervised Text Style Transfer via Contrastive Learning. (78%)
Chenghao Fan; Ziao Li; Wei wei
  Text style transfer is a challenging text generation problem, which aims at
altering the style of a given sentence to a target one while keeping its
content unchanged. Since there is a natural scarcity of parallel datasets,
recent works mainly focus on solving the problem in an unsupervised manner.
However, previous gradient-based works generally suffer from the deficiencies
as follows, namely: (1) Content migration. Previous approaches lack explicit
modeling of content invariance and are thus susceptible to content shift
between the original sentence and the transferred one. (2) Style
misclassification. A natural drawback of the gradient-guided approaches is that
the inference process is homogeneous with a line of adversarial attack, making
latent optimization easily becomes an attack to the classifier due to
misclassification. This leads to difficulties in achieving high transfer
accuracy. To address the problems, we propose a novel gradient-guided model
through a contrastive paradigm for text style transfer, to explicitly gather
similar semantic sentences, and to design a siamese-structure based style
classifier for alleviating such two issues, respectively. Experiments on two
datasets show the effectiveness of our proposed approach, as compared to the
state-of-the-arts.


http://arxiv.org/abs/2201.09370
Are Your Sensitive Attributes Private? Novel Model Inversion Attribute Inference Attacks on Classification Models. (56%)
Shagufta Mehnaz; Sayanton V. Dibbo; Ehsanul Kabir; Ninghui Li; Elisa Bertino
  Increasing use of machine learning (ML) technologies in privacy-sensitive
domains such as medical diagnoses, lifestyle predictions, and business
decisions highlights the need to better understand if these ML technologies are
introducing leakage of sensitive and proprietary training data. In this paper,
we focus on model inversion attacks where the adversary knows non-sensitive
attributes about records in the training data and aims to infer the value of a
sensitive attribute unknown to the adversary, using only black-box access to
the target classification model. We first devise a novel confidence score-based
model inversion attribute inference attack that significantly outperforms the
state-of-the-art. We then introduce a label-only model inversion attack that
relies only on the model's predicted labels but still matches our confidence
score-based attack in terms of attack effectiveness. We also extend our attacks
to the scenario where some of the other (non-sensitive) attributes of a target
record are unknown to the adversary. We evaluate our attacks on two types of
machine learning models, decision tree and deep neural network, trained on
three real datasets. Moreover, we empirically demonstrate the disparate
vulnerability of model inversion attacks, i.e., specific groups in the training
dataset (grouped by gender, race, etc.) could be more vulnerable to model
inversion attacks.


http://arxiv.org/abs/2201.09243
Increasing the Cost of Model Extraction with Calibrated Proof of Work. (22%)
Adam Dziedzic; Muhammad Ahmad Kaleem; Yu Shen Lu; Nicolas Papernot
  In model extraction attacks, adversaries can steal a machine learning model
exposed via a public API by repeatedly querying it and adjusting their own
model based on obtained predictions. To prevent model stealing, existing
defenses focus on detecting malicious queries, truncating, or distorting
outputs, thus necessarily introducing a tradeoff between robustness and model
utility for legitimate users. Instead, we propose to impede model extraction by
requiring users to complete a proof-of-work before they can read the model's
predictions. This deters attackers by greatly increasing (even up to 100x) the
computational effort needed to leverage query access for model extraction.
Since we calibrate the effort required to complete the proof-of-work to each
query, this only introduces a slight overhead for regular users (up to 2x). To
achieve this, our calibration applies tools from differential privacy to
measure the information revealed by a query. Our method requires no
modification of the victim model and can be applied by machine learning
practitioners to guard their publicly exposed models against being easily
stolen.


http://arxiv.org/abs/2201.08970
Parallel Rectangle Flip Attack: A Query-based Black-box Attack against Object Detection. (99%)
Siyuan Liang; Baoyuan Wu; Yanbo Fan; Xingxing Wei; Xiaochun Cao
  Object detection has been widely used in many safety-critical tasks, such as
autonomous driving. However, its vulnerability to adversarial examples has not
been sufficiently studied, especially under the practical scenario of black-box
attacks, where the attacker can only access the query feedback of predicted
bounding-boxes and top-1 scores returned by the attacked model. Compared with
black-box attack to image classification, there are two main challenges in
black-box attack to detection. Firstly, even if one bounding-box is
successfully attacked, another sub-optimal bounding-box may be detected near
the attacked bounding-box. Secondly, there are multiple bounding-boxes, leading
to very high attack cost. To address these challenges, we propose a Parallel
Rectangle Flip Attack (PRFA) via random search. We explain the difference
between our method with other attacks in Fig.~\ref{fig1}. Specifically, we
generate perturbations in each rectangle patch to avoid sub-optimal detection
near the attacked region. Besides, utilizing the observation that adversarial
perturbations mainly locate around objects' contours and critical points under
white-box attacks, the search space of attacked rectangles is reduced to
improve the attack efficiency. Moreover, we develop a parallel mechanism of
attacking multiple rectangles simultaneously to further accelerate the attack
process. Extensive experiments demonstrate that our method can effectively and
efficiently attack various popular object detectors, including anchor-based and
anchor-free, and generate transferable adversarial examples.


http://arxiv.org/abs/2201.09109
Robust Unpaired Single Image Super-Resolution of Faces. (98%)
Saurabh Goswami; Rajagopalan A. N
  We propose an adversarial attack for facial class-specific Single Image
Super-Resolution (SISR) methods. Existing attacks, such as the Fast Gradient
Sign Method (FGSM) or the Projected Gradient Descent (PGD) method, are either
fast but ineffective, or effective but prohibitively slow on these networks. By
closely inspecting the surface that the MSE loss, used to train such networks,
traces under varying degradations, we were able to identify its parameterizable
property. We leverage this property to propose an adverasrial attack that is
able to locate the optimum degradation (effective) without needing multiple
gradient-ascent steps (fast). Our experiments show that the proposed method is
able to achieve a better speed vs effectiveness trade-off than the
state-of-theart adversarial attacks, such as FGSM and PGD, for the task of
unpaired facial as well as class-specific SISR.


http://arxiv.org/abs/2201.09051
On the Robustness of Counterfactual Explanations to Adverse Perturbations. (10%)
Marco Virgolin; Saverio Fracaros
  Counterfactual explanations (CEs) are a powerful means for understanding how
decisions made by algorithms can be changed. Researchers have proposed a number
of desiderata that CEs should meet to be practically useful, such as requiring
minimal effort to enact, or complying with causal models. We consider a further
aspect to improve the usability of CEs: robustness to adverse perturbations,
which may naturally happen due to unfortunate circumstances. Since CEs
typically prescribe a sparse form of intervention (i.e., only a subset of the
features should be changed), we provide two definitions of robustness, which
concern, respectively, the features to change and to keep as they are. These
definitions are workable in that they can be incorporated as penalty terms in
the loss functions that are used for discovering CEs. To experiment with the
proposed definitions of robustness, we create and release code where five data
sets (commonly used in the field of fair and explainable machine learning) have
been enriched with feature-specific annotations that can be used to sample
meaningful perturbations. Our experiments show that CEs are often not robust
and, if adverse perturbations take place, the intervention they prescribe may
require a much larger cost than anticipated, or even become impossible.
However, accounting for robustness in the search process, which can be done
rather easily, allows discovering robust CEs systematically. Robust CEs are
resilient to adverse perturbations: additional intervention to contrast
perturbations is much less costly than for non-robust CEs. Our code is
available at: https://github.com/marcovirgolin/robust-counterfactuals


http://arxiv.org/abs/2201.08698
Natural Attack for Pre-trained Models of Code. (99%)
Zhou Yang; Jieke Shi; Junda He; David Lo
  Pre-trained models of code have achieved success in many important software
engineering tasks. However, these powerful models are vulnerable to adversarial
attacks that slightly perturb model inputs to make a victim model produce wrong
outputs. Current works mainly attack models of code with examples that preserve
operational program semantics but ignore a fundamental requirement for
adversarial example generation: perturbations should be natural to human
judges, which we refer to as naturalness requirement.
  In this paper, we propose ALERT (nAturaLnEss AwaRe ATtack), a black-box
attack that adversarially transforms inputs to make victim models produce wrong
outputs. Different from prior works, this paper considers the natural semantic
of generated examples at the same time as preserving the operational semantic
of original inputs. Our user study demonstrates that human developers
consistently consider that adversarial examples generated by ALERT are more
natural than those generated by the state-of-the-art work by Zhang et al. that
ignores the naturalness requirement. On attacking CodeBERT, our approach can
achieve attack success rates of 53.62%, 27.79%, and 35.78% across three
downstream tasks: vulnerability prediction, clone detection and code authorship
attribution. On GraphCodeBERT, our approach can achieve average success rates
of 76.95%, 7.96% and 61.47% on the three tasks. The above outperforms the
baseline by 14.07% and 18.56% on the two pre-trained models on average.
Finally, we investigated the value of the generated adversarial examples to
harden victim models through an adversarial fine-tuning procedure and
demonstrated the accuracy of CodeBERT and GraphCodeBERT against ALERT-generated
adversarial examples increased by 87.59% and 92.32%, respectively.


http://arxiv.org/abs/2201.08557
Toward Enhanced Robustness in Unsupervised Graph Representation Learning: A Graph Information Bottleneck Perspective. (99%)
Jihong Wang; Minnan Luo; Jundong Li; Ziqi Liu; Jun Zhou; Qinghua Zheng
  Recent studies have revealed that GNNs are vulnerable to adversarial attacks.
Most existing robust graph learning methods measure model robustness based on
label information, rendering them infeasible when label information is not
available. A straightforward direction is to employ the widely used Infomax
technique from typical Unsupervised Graph Representation Learning (UGRL) to
learn robust unsupervised representations. Nonetheless, directly transplanting
the Infomax technique from typical UGRL to robust UGRL may involve a biased
assumption. In light of the limitation of Infomax, we propose a novel unbiased
robust UGRL method called Robust Graph Information Bottleneck (RGIB), which is
grounded in the Information Bottleneck (IB) principle. Our RGIB attempts to
learn robust node representations against adversarial perturbations by
preserving the original information in the benign graph while eliminating the
adversarial information in the adversarial graph. There are mainly two
challenges to optimize RGIB: 1) high complexity of adversarial attack to
perturb node features and graph structure jointly in the training procedure; 2)
mutual information estimation upon adversarially attacked graphs. To tackle
these problems, we further propose an efficient adversarial training strategy
with only feature perturbations and an effective mutual information estimator
with subgraph-level summary. Moreover, we theoretically establish a connection
between our proposed RGIB and the robustness of downstream classifiers,
revealing that RGIB can provide a lower bound on the adversarial risk of
downstream classifiers. Extensive experiments over several benchmarks and
downstream tasks demonstrate the effectiveness and superiority of our proposed
method.


http://arxiv.org/abs/2201.08661
The Security of Deep Learning Defences for Medical Imaging. (80%)
Moshe Levy; Guy Amit; Yuval Elovici; Yisroel Mirsky
  Deep learning has shown great promise in the domain of medical image
analysis. Medical professionals and healthcare providers have been adopting the
technology to speed up and enhance their work. These systems use deep neural
networks (DNN) which are vulnerable to adversarial samples; images with
imperceivable changes that can alter the model's prediction. Researchers have
proposed defences which either make a DNN more robust or detect the adversarial
samples before they do harm. However, none of these works consider an informed
attacker which can adapt to the defence mechanism. We show that an informed
attacker can evade five of the current state of the art defences while
successfully fooling the victim's deep learning model, rendering these defences
useless. We then suggest better alternatives for securing healthcare DNNs from
such attacks: (1) harden the system's security and (2) use digital signatures.


http://arxiv.org/abs/2201.08619
Dangerous Cloaking: Natural Trigger based Backdoor Attacks on Object Detectors in the Physical World. (75%)
Hua Ma; Yinshan Li; Yansong Gao; Alsharif Abuadbba; Zhi Zhang; Anmin Fu; Hyoungshick Kim; Said F. Al-Sarawi; Nepal Surya; Derek Abbott
  Deep learning models have been shown to be vulnerable to recent backdoor
attacks. A backdoored model behaves normally for inputs containing no
attacker-secretly-chosen trigger and maliciously for inputs with the trigger.
To date, backdoor attacks and countermeasures mainly focus on image
classification tasks. And most of them are implemented in the digital world
with digital triggers. Besides the classification tasks, object detection
systems are also considered as one of the basic foundations of computer vision
tasks. However, there is no investigation and understanding of the backdoor
vulnerability of the object detector, even in the digital world with digital
triggers. For the first time, this work demonstrates that existing object
detectors are inherently susceptible to physical backdoor attacks. We use a
natural T-shirt bought from a market as a trigger to enable the cloaking
effect--the person bounding-box disappears in front of the object detector. We
show that such a backdoor can be implanted from two exploitable attack
scenarios into the object detector, which is outsourced or fine-tuned through a
pretrained model. We have extensively evaluated three popular object detection
algorithms: anchor-based Yolo-V3, Yolo-V4, and anchor-free CenterNet. Building
upon 19 videos shot in real-world scenes, we confirm that the backdoor attack
is robust against various factors: movement, distance, angle, non-rigid
deformation, and lighting. Specifically, the attack success rate (ASR) in most
videos is 100% or close to it, while the clean data accuracy of the backdoored
model is the same as its clean counterpart. The latter implies that it is
infeasible to detect the backdoor behavior merely through a validation set. The
averaged ASR still remains sufficiently high to be 78% in the transfer learning
attack scenarios evaluated on CenterNet. See the demo video on
https://youtu.be/Q3HOF4OobbY.


http://arxiv.org/abs/2201.08555
Identifying Adversarial Attacks on Text Classifiers. (73%)
Zhouhang Xie; Jonathan Brophy; Adam Noack; Wencong You; Kalyani Asthana; Carter Perkins; Sabrina Reis; Sameer Singh; Daniel Lowd
  The landscape of adversarial attacks against text classifiers continues to
grow, with new attacks developed every year and many of them available in
standard toolkits, such as TextAttack and OpenAttack. In response, there is a
growing body of work on robust learning, which reduces vulnerability to these
attacks, though sometimes at a high cost in compute time or accuracy. In this
paper, we take an alternate approach -- we attempt to understand the attacker
by analyzing adversarial text to determine which methods were used to create
it. Our first contribution is an extensive dataset for attack detection and
labeling: 1.5~million attack instances, generated by twelve adversarial attacks
targeting three classifiers trained on six source datasets for sentiment
analysis and abuse detection in English. As our second contribution, we use
this dataset to develop and benchmark a number of classifiers for attack
identification -- determining if a given text has been adversarially
manipulated and by which attack. As a third contribution, we demonstrate the
effectiveness of three classes of features for these tasks: text properties,
capturing content and presentation of text; language model properties,
determining which tokens are more or less probable throughout the input; and
target model properties, representing how the text classifier is influenced by
the attack, including internal node activations. Overall, this represents a
first step towards forensics for adversarial attacks against text classifiers.


http://arxiv.org/abs/2201.08956
The Many Faces of Adversarial Risk. (47%)
Muni Sreenivas Pydi; Varun Jog
  Adversarial risk quantifies the performance of classifiers on adversarially
perturbed data. Numerous definitions of adversarial risk -- not all
mathematically rigorous and differing subtly in the details -- have appeared in
the literature. In this paper, we revisit these definitions, make them
rigorous, and critically examine their similarities and differences. Our
technical tools derive from optimal transport, robust statistics, functional
analysis, and game theory. Our contributions include the following:
generalizing Strassen's theorem to the unbalanced optimal transport setting
with applications to adversarial classification with unequal priors; showing an
equivalence between adversarial robustness and robust hypothesis testing with
$\infty$-Wasserstein uncertainty sets; proving the existence of a pure Nash
equilibrium in the two-player game between the adversary and the algorithm; and
characterizing adversarial risk by the minimum Bayes error between a pair of
distributions belonging to the $\infty$-Wasserstein uncertainty sets. Our
results generalize and deepen recently discovered connections between optimal
transport and adversarial robustness and reveal new connections to Choquet
capacities and game theory.


http://arxiv.org/abs/2201.08193
TextHacker: Learning based Hybrid Local Search Algorithm for Text Hard-label Adversarial Attack. (99%)
Zhen Yu; Xiaosen Wang; Wanxiang Che; Kun He
  Existing textual adversarial attacks usually utilize the gradient or
prediction confidence to generate adversarial examples, making it hard to be
deployed in real-world applications. To this end, we consider a rarely
investigated but more rigorous setting, namely hard-label attack, in which the
attacker can only access the prediction label. In particular, we find we can
learn the importance of different words via the change on prediction label
caused by word substitutions on the adversarial examples. Based on this
observation, we propose a novel adversarial attack, termed Text Hard-label
attacker (TextHacker). TextHacker randomly perturbs lots of words to craft an
adversarial example. Then, TextHacker adopts a hybrid local search algorithm
with the estimation of word importance from the attack history to minimize the
adversarial perturbation. Extensive evaluations for text classification and
textual entailment show that TextHacker significantly outperforms existing
hard-label attacks regarding the attack performance as well as adversary
quality.


http://arxiv.org/abs/2201.08318
Cheating Automatic Short Answer Grading: On the Adversarial Usage of Adjectives and Adverbs. (95%)
Anna Filighera; Sebastian Ochs; Tim Steuer; Thomas Tregel
  Automatic grading models are valued for the time and effort saved during the
instruction of large student bodies. Especially with the increasing
digitization of education and interest in large-scale standardized testing, the
popularity of automatic grading has risen to the point where commercial
solutions are widely available and used. However, for short answer formats,
automatic grading is challenging due to natural language ambiguity and
versatility. While automatic short answer grading models are beginning to
compare to human performance on some datasets, their robustness, especially to
adversarially manipulated data, is questionable. Exploitable vulnerabilities in
grading models can have far-reaching consequences ranging from cheating
students receiving undeserved credit to undermining automatic grading
altogether - even when most predictions are valid. In this paper, we devise a
black-box adversarial attack tailored to the educational short answer grading
scenario to investigate the grading models' robustness. In our attack, we
insert adjectives and adverbs into natural places of incorrect student answers,
fooling the model into predicting them as correct. We observed a loss of
prediction accuracy between 10 and 22 percentage points using the
state-of-the-art models BERT and T5. While our attack made answers appear less
natural to humans in our experiments, it did not significantly increase the
graders' suspicions of cheating. Based on our experiments, we provide
recommendations for utilizing automatic grading systems more safely in
practice.


http://arxiv.org/abs/2201.08135
Survey on Federated Learning Threats: concepts, taxonomy on attacks and defences, experimental study and challenges. (93%)
Nuria Rodríguez-Barroso; Daniel Jiménez López; M. Victoria Luzón; Francisco Herrera; Eugenio Martínez-Cámara
  Federated learning is a machine learning paradigm that emerges as a solution
to the privacy-preservation demands in artificial intelligence. As machine
learning, federated learning is threatened by adversarial attacks against the
integrity of the learning model and the privacy of data via a distributed
approach to tackle local and global learning. This weak point is exacerbated by
the inaccessibility of data in federated learning, which makes harder the
protection against adversarial attacks and evidences the need to furtherance
the research on defence methods to make federated learning a real solution for
safeguarding data privacy. In this paper, we present an extensive review of the
threats of federated learning, as well as as their corresponding
countermeasures, attacks versus defences. This survey provides a taxonomy of
adversarial attacks and a taxonomy of defence methods that depict a general
picture of this vulnerability of federated learning and how to overcome it.
Likewise, we expound guidelines for selecting the most adequate defence method
according to the category of the adversarial attack. Besides, we carry out an
extensive experimental study from which we draw further conclusions about the
behaviour of attacks and defences and the guidelines for selecting the most
adequate defence method according to the category of the adversarial attack.
This study is finished leading to meditated learned lessons and challenges.


http://arxiv.org/abs/2201.08731
Low-Interception Waveform: To Prevent the Recognition of Spectrum Waveform Modulation via Adversarial Examples. (83%)
Haidong Xie; Jia Tan; Xiaoying Zhang; Nan Ji; Haihua Liao; Zuguo Yu; Xueshuang Xiang; Naijin Liu
  Deep learning is applied to many complex tasks in the field of wireless
communication, such as modulation recognition of spectrum waveforms, because of
its convenience and efficiency. This leads to the problem of a malicious third
party using a deep learning model to easily recognize the modulation format of
the transmitted waveform. Some existing works address this problem directly
using the concept of adversarial examples in the image domain without fully
considering the characteristics of the waveform transmission in the physical
world. Therefore, we propose a low-intercept waveform~(LIW) generation method
that can reduce the probability of the modulation being recognized by a third
party without affecting the reliable communication of the friendly party. Our
LIW exhibits significant low-interception performance even in the physical
hardware experiment, decreasing the accuracy of the state of the art model to
approximately $15\%$ with small perturbations.


http://arxiv.org/abs/2201.08474
Post-Training Detection of Backdoor Attacks for Two-Class and Multi-Attack Scenarios. (70%)
Zhen Xiang; David J. Miller; George Kesidis
  Backdoor attacks (BAs) are an emerging threat to deep neural network
classifiers. A victim classifier will predict to an attacker-desired target
class whenever a test sample is embedded with the same backdoor pattern (BP)
that was used to poison the classifier's training set. Detecting whether a
classifier is backdoor attacked is not easy in practice, especially when the
defender is, e.g., a downstream user without access to the classifier's
training set. This challenge is addressed here by a reverse-engineering defense
(RED), which has been shown to yield state-of-the-art performance in several
domains. However, existing REDs are not applicable when there are only {\it two
classes} or when {\it multiple attacks} are present. These scenarios are first
studied in the current paper, under the practical constraints that the defender
neither has access to the classifier's training set nor to supervision from
clean reference classifiers trained for the same domain. We propose a detection
framework based on BP reverse-engineering and a novel {\it expected
transferability} (ET) statistic. We show that our ET statistic is effective
{\it using the same detection threshold}, irrespective of the classification
domain, the attack configuration, and the BP reverse-engineering algorithm that
is used. The excellent performance of our method is demonstrated on six
benchmark datasets. Notably, our detection framework is also applicable to
multi-class scenarios with multiple attacks.


http://arxiv.org/abs/2201.08052
Adversarial Jamming for a More Effective Constellation Attack. (56%)
Haidong Xie; Yizhou Xu; Yuanqing Chen; Nan Ji; Shuai Yuan; Naijin Liu; Xueshuang Xiang
  The common jamming mode in wireless communication is band barrage jamming,
which is controllable and difficult to resist. Although this method is simple
to implement, it is obviously not the best jamming waveform. Therefore, based
on the idea of adversarial examples, we propose the adversarial jamming
waveform, which can independently optimize and find the best jamming waveform.
We attack QAM with adversarial jamming and find that the optimal jamming
waveform is equivalent to the amplitude and phase between the nearest
constellation points. Furthermore, by verifying the jamming performance on a
hardware platform, it is shown that our method significantly improves the bit
error rate compared to other methods.


http://arxiv.org/abs/2201.08388
Steerable Pyramid Transform Enables Robust Left Ventricle Quantification. (38%)
Xiangyang Zhu; Kede Ma; Wufeng Xue
  Predicting cardiac indices has long been a focal point in the medical imaging
community. While various deep learning models have demonstrated success in
quantifying cardiac indices, they remain susceptible to mild input
perturbations, e.g., spatial transformations, image distortions, and
adversarial attacks. This vulnerability undermines confidence in using
learning-based automated systems for diagnosing cardiovascular diseases. In
this work, we describe a simple yet effective method to learn robust models for
left ventricle (LV) quantification, encompassing cavity and myocardium areas,
directional dimensions, and regional wall thicknesses. Our success hinges on
employing the biologically inspired steerable pyramid transform (SPT) for fixed
front-end processing, which offers three main benefits. First, the basis
functions of SPT align with the anatomical structure of LV and the geometric
features of the measured indices. Second, SPT facilitates weight sharing across
different orientations as a form of parameter regularization and naturally
captures the scale variations of LV. Third, the residual highpass subband can
be conveniently discarded, promoting robust feature learning. Extensive
experiments on the Cardiac-Dig benchmark show that our SPT-augmented model not
only achieves reasonable prediction accuracy compared to state-of-the-art
methods, but also exhibits significantly improved robustness against input
perturbations.


http://arxiv.org/abs/2201.08531
Black-box Prompt Learning for Pre-trained Language Models. (13%)
Shizhe Diao; Zhichao Huang; Ruijia Xu; Xuechun Li; Yong Lin; Xiao Zhou; Tong Zhang
  The increasing scale of general-purpose Pre-trained Language Models (PLMs)
necessitates the study of more efficient adaptation across different downstream
tasks. In this paper, we establish a Black-box Discrete Prompt Learning (BDPL)
to resonate with pragmatic interactions between the cloud infrastructure and
edge devices. Particularly, instead of fine-tuning the model in the cloud, we
adapt PLMs by prompt learning, which efficiently optimizes only a few
parameters of the discrete prompts. Moreover, we consider the scenario that we
do not have access to the parameters and gradients of the pre-trained models,
except for its outputs given inputs. This black-box setting secures the cloud
infrastructure from potential attack and misuse to cause a single-point
failure, which is preferable to the white-box counterpart by current
infrastructures. Under this black-box constraint, we apply a variance-reduced
policy gradient algorithm to estimate the gradients of parameters in the
categorical distribution of each discrete prompt. In light of our method, the
user devices can efficiently tune their tasks by querying the PLMs bounded by a
range of API calls. Our experiments on RoBERTa and GPT-3 demonstrate that the
proposed algorithm achieves significant improvement on eight benchmarks in a
cloud-device collaboration manner. Finally, we conduct in-depth case studies to
comprehensively analyze our method in terms of various data sizes, prompt
lengths, training budgets, optimization objectives, prompt transferability, and
explanations of the learned prompts. Our code will be available at
https://github.com/shizhediao/Black-Box-Prompt-Learning.


http://arxiv.org/abs/2201.08087
DeepGalaxy: Testing Neural Network Verifiers via Two-Dimensional Input Space Exploration. (1%)
Xuan Xie; Fuyuan Zhang
  Deep neural networks (DNNs) are widely developed and applied in many areas,
and the quality assurance of DNNs is critical. Neural network verification
(NNV) aims to provide formal guarantees to DNN models. Similar to traditional
software, neural network verifiers could also contain bugs, which would have a
critical and serious impact, especially in safety-critical areas. However,
little work exists on validating neural network verifiers. In this work, we
propose DeepGalaxy, an automated approach based on differential testing to
tackle this problem. Specifically, we (1) propose a line of mutation rules,
including model level mutation and specification level mutation, to effectively
explore the two-dimensional input space of neural network verifiers; and (2)
propose heuristic strategies to select test cases. We leveraged our
implementation of DeepGalaxy to test three state-of-the-art neural network
verifies, Marabou, Eran, and Neurify. The experimental results support the
efficiency and effectiveness of DeepGalaxy. Moreover, five unique unknown bugs
were discovered


http://arxiv.org/abs/2201.07986
Unsupervised Graph Poisoning Attack via Contrastive Loss Back-propagation. (96%)
Sixiao Zhang; Hongxu Chen; Xiangguo Sun; Yicong Li; Guandong Xu
  Graph contrastive learning is the state-of-the-art unsupervised graph
representation learning framework and has shown comparable performance with
supervised approaches. However, evaluating whether the graph contrastive
learning is robust to adversarial attacks is still an open problem because most
existing graph adversarial attacks are supervised models, which means they
heavily rely on labels and can only be used to evaluate the graph contrastive
learning in a specific scenario. For unsupervised graph representation methods
such as graph contrastive learning, it is difficult to acquire labels in
real-world scenarios, making traditional supervised graph attack methods
difficult to be applied to test their robustness. In this paper, we propose a
novel unsupervised gradient-based adversarial attack that does not rely on
labels for graph contrastive learning. We compute the gradients of the
adjacency matrices of the two views and flip the edges with gradient ascent to
maximize the contrastive loss. In this way, we can fully use multiple views
generated by the graph contrastive learning models and pick the most
informative edges without knowing their labels, and therefore can promisingly
support our model adapted to more kinds of downstream tasks. Extensive
experiments show that our attack outperforms unsupervised baseline attacks and
has comparable performance with supervised attacks in multiple downstream tasks
including node classification and link prediction. We further show that our
attack can be transferred to other graph representation models as well.


http://arxiv.org/abs/2201.07513
Can't Steal? Cont-Steal! Contrastive Stealing Attacks Against Image Encoders. (8%)
Zeyang Sha; Xinlei He; Ning Yu; Michael Backes; Yang Zhang
  Self-supervised representation learning techniques have been developing
rapidly to make full use of unlabeled images. They encode images into rich
features that are oblivious to downstream tasks. Behind their revolutionary
representation power, the requirements for dedicated model designs and a
massive amount of computation resources expose image encoders to the risks of
potential model stealing attacks - a cheap way to mimic the well-trained
encoder performance while circumventing the demanding requirements. Yet
conventional attacks only target supervised classifiers given their predicted
labels and/or posteriors, which leaves the vulnerability of unsupervised
encoders unexplored.
  In this paper, we first instantiate the conventional stealing attacks against
encoders and demonstrate their severer vulnerability compared with downstream
classifiers. To better leverage the rich representation of encoders, we further
propose Cont-Steal, a contrastive-learning-based attack, and validate its
improved stealing effectiveness in various experiment settings. As a takeaway,
we appeal to our community's attention to the intellectual property protection
of representation learning techniques, especially to the defenses against
encoder stealing attacks like ours.


http://arxiv.org/abs/2201.07391
MetaV: A Meta-Verifier Approach to Task-Agnostic Model Fingerprinting. (99%)
Xudong Pan; Yifan Yan; Mi Zhang; Min Yang
  For model piracy forensics, previous model fingerprinting schemes are
commonly based on adversarial examples constructed for the owner's model as the
\textit{fingerprint}, and verify whether a suspect model is indeed pirated from
the original model by matching the behavioral pattern on the fingerprint
examples between one another. However, these methods heavily rely on the
characteristics of classification tasks which inhibits their application to
more general scenarios. To address this issue, we present MetaV, the first
task-agnostic model fingerprinting framework which enables fingerprinting on a
much wider range of DNNs independent from the downstream learning task, and
exhibits strong robustness against a variety of ownership obfuscation
techniques. Specifically, we generalize previous schemes into two critical
design components in MetaV: the \textit{adaptive fingerprint} and the
\textit{meta-verifier}, which are jointly optimized such that the meta-verifier
learns to determine whether a suspect model is stolen based on the concatenated
outputs of the suspect model on the adaptive fingerprint. As a key of being
task-agnostic, the full process makes no assumption on the model internals in
the ensemble only if they have the same input and output dimensions. Spanning
classification, regression and generative modeling, extensive experimental
results validate the substantially improved performance of MetaV over the
state-of-the-art fingerprinting schemes and demonstrate the enhanced generality
of MetaV for providing task-agnostic fingerprinting. For example, on
fingerprinting ResNet-18 trained for skin cancer diagnosis, MetaV achieves
simultaneously $100\%$ true positives and $100\%$ true negatives on a diverse
test set of $70$ suspect models, achieving an about $220\%$ relative
improvement in ARUC in comparison to the optimal baseline.


http://arxiv.org/abs/2201.07012
Adversarial vulnerability of powerful near out-of-distribution detection. (78%)
Stanislav Fort
  There has been a significant progress in detecting out-of-distribution (OOD)
inputs in neural networks recently, primarily due to the use of large models
pretrained on large datasets, and an emerging use of multi-modality. We show a
severe adversarial vulnerability of even the strongest current OOD detection
techniques. With a small, targeted perturbation to the input pixels, we can
change the image assignment from an in-distribution to an out-distribution, and
vice versa, easily. In particular, we demonstrate severe adversarial
vulnerability on the challenging near OOD CIFAR-100 vs CIFAR-10 task, as well
as on the far OOD CIFAR-100 vs SVHN. We study the adversarial robustness of
several post-processing techniques, including the simple baseline of Maximum of
Softmax Probabilities (MSP), the Mahalanobis distance, and the newly proposed
\textit{Relative} Mahalanobis distance. By comparing the loss of OOD detection
performance at various perturbation strengths, we demonstrate the beneficial
effect of using ensembles of OOD detectors, and the use of the
\textit{Relative} Mahalanobis distance over other post-processing methods. In
addition, we show that even strong zero-shot OOD detection using CLIP and
multi-modality suffers from a severe lack of adversarial robustness as well.
Our code is available at
https://github.com/stanislavfort/adversaries_to_OOD_detection


http://arxiv.org/abs/2201.07063
How to Backdoor HyperNetwork in Personalized Federated Learning? (13%)
Phung Lai; NhatHai Phan; Issa Khalil; Abdallah Khreishah; Xintao Wu
  This paper explores previously unknown backdoor risks in HyperNet-based
personalized federated learning (HyperNetFL) through poisoning attacks. Based
upon that, we propose a novel model transferring attack (called HNTroj), i.e.,
the first of its kind, to transfer a local backdoor infected model to all
legitimate and personalized local models, which are generated by the HyperNetFL
model, through consistent and effective malicious local gradients computed
across all compromised clients in the whole training process. As a result,
HNTroj reduces the number of compromised clients needed to successfully launch
the attack without any observable signs of sudden shifts or degradation
regarding model utility on legitimate data samples making our attack stealthy.
To defend against HNTroj, we adapted several backdoor-resistant FL training
algorithms into HyperNetFL. An extensive experiment that is carried out using
several benchmark datasets shows that HNTroj significantly outperforms data
poisoning and model replacement attacks and bypasses robust training algorithms
even with modest numbers of compromised clients.


http://arxiv.org/abs/2201.06937
Secure IoT Routing: Selective Forwarding Attacks and Trust-based Defenses in RPL Network. (2%)
Jun Jiang; Yuhong Liu
  IPv6 Routing Protocol for Low Power and Lossy Networks (RPL) is an essential
routing protocol to enable communications for IoT networks with low power
devices. RPL uses an objective function and routing constraints to find an
optimized routing path for each node in the network. However, recent research
has shown that topological attacks, such as selective forwarding attacks, pose
great challenges to the secure routing of IoT networks. Many conventional
secure routing solutions, on the other hand, are computationally heavy to be
directly applied in resource-constrained IoT networks. There is an urgent need
to develop lightweight secure routing solutions for IoT networks. In this
paper, we first design and implement a series of advanced selective forwarding
attacks from the attack perspective, which can flexibly select the type and
percentage of forwarding packets in an energy efficient way, and even bad-mouth
other innocent nodes in the network. Experiment results show that the proposed
attacks can maximize the attack consequences (i.e. number of dropped packets)
while maintaining undetected. Moreover, we propose a lightweight trust-based
defense solution to detect and eliminate malicious selective forwarding nodes
from the network. The results show that the proposed defense solution can
achieve high detection accuracy with very limited extra energy usage (i.e.
3.4%).


http://arxiv.org/abs/2201.07381
Unveiling Project-Specific Bias in Neural Code Models. (1%)
Zhiming Li; Yanzhou Li; Tianlin Li; Mengnan Du; Bozhi Wu; Yushi Cao; Junzhe Jiang; Yang Liu
  Deep learning has introduced significant improvements in many software
analysis tasks. Although the Large Language Models (LLMs) based neural code
models demonstrate commendable performance when trained and tested within the
intra-project independent and identically distributed (IID) setting, they often
struggle to generalize effectively to real-world inter-project
out-of-distribution (OOD) data. In this work, we show that this phenomenon is
caused by the heavy reliance on project-specific shortcuts for prediction
instead of ground-truth evidence. We propose a Cond-Idf measurement to
interpret this behavior, which quantifies the relatedness of a token with a
label and its project-specificness. The strong correlation between model
behavior and the proposed measurement indicates that without proper
regularization, models tend to leverage spurious statistical cues for
prediction. Equipped with these observations, we propose a novel bias
mitigation mechanism that regularizes the model's learning behavior by
leveraging latent logic relations among samples. Experimental results on two
representative program analysis tasks indicate that our mitigation framework
can improve both inter-project OOD generalization and adversarial robustness,
while not sacrificing accuracy on intra-project IID data.


http://arxiv.org/abs/2201.07344
Lung Swapping Autoencoder: Learning a Disentangled Structure-texture Representation of Chest Radiographs. (1%)
Lei Zhou; Joseph Bae; Huidong Liu; Gagandeep Singh; Jeremy Green; Amit Gupta; Dimitris Samaras; Prateek Prasanna
  Well-labeled datasets of chest radiographs (CXRs) are difficult to acquire
due to the high cost of annotation. Thus, it is desirable to learn a robust and
transferable representation in an unsupervised manner to benefit tasks that
lack labeled data. Unlike natural images, medical images have their own domain
prior; e.g., we observe that many pulmonary diseases, such as the COVID-19,
manifest as changes in the lung tissue texture rather than the anatomical
structure. Therefore, we hypothesize that studying only the texture without the
influence of structure variations would be advantageous for downstream
prognostic and predictive modeling tasks. In this paper, we propose a
generative framework, the Lung Swapping Autoencoder (LSAE), that learns
factorized representations of a CXR to disentangle the texture factor from the
structure factor. Specifically, by adversarial training, the LSAE is optimized
to generate a hybrid image that preserves the lung shape in one image but
inherits the lung texture of another. To demonstrate the effectiveness of the
disentangled texture representation, we evaluate the texture encoder $Enc^t$ in
LSAE on ChestX-ray14 (N=112,120), and our own multi-institutional COVID-19
outcome prediction dataset, COVOC (N=340 (Subset-1) + 53 (Subset-2)). On both
datasets, we reach or surpass the state-of-the-art by finetuning $Enc^t$ in
LSAE that is 77% smaller than a baseline Inception v3. Additionally, in
semi-and-self supervised settings with a similar model budget, $Enc^t$ in LSAE
is also competitive with the state-of-the-art MoCo. By "re-mixing" the texture
and shape factors, we generate meaningful hybrid images that can augment the
training set. This data augmentation method can further improve COVOC
prediction performance. The improvement is consistent even when we directly
evaluate the Subset-1 trained model on Subset-2 without any fine-tuning.


http://arxiv.org/abs/2201.06427
Masked Faces with Faced Masks. (81%)
Jiayi Zhu; Qing Guo; Felix Juefei-Xu; Yihao Huang; Yang Liu; Geguang Pu
  Modern face recognition systems (FRS) still fall short when the subjects are
wearing facial masks, a common theme in the age of respiratory pandemics. An
intuitive partial remedy is to add a mask detector to flag any masked faces so
that the FRS can act accordingly for those low-confidence masked faces. In this
work, we set out to investigate the potential vulnerability of such FRS
equipped with a mask detector, on large-scale masked faces, which might trigger
a serious risk, e.g., letting a suspect evade the FRS where both facial
identity and mask are undetected. As existing face recognizers and mask
detectors have high performance in their respective tasks, it is significantly
challenging to simultaneously fool them and preserve the transferability of the
attack. We formulate the new task as the generation of realistic &
adversarial-faced mask and make three main contributions: First, we study the
naive Delanunay-based masking method (DM) to simulate the process of wearing a
faced mask that is cropped from a template image, which reveals the main
challenges of this new task. Second, we further equip the DM with the
adversarial noise attack and propose the adversarial noise Delaunay-based
masking method (AdvNoise-DM) that can fool the face recognition and mask
detection effectively but make the face less natural. Third, we propose the
adversarial filtering Delaunay-based masking method denoted as MF2M by
employing the adversarial filtering for AdvNoise-DM and obtain more natural
faces. With the above efforts, the final version not only leads to significant
performance deterioration of the state-of-the-art (SOTA) deep learning-based
FRS, but also remains undetected by the SOTA facial mask detector, thus
successfully fooling both systems at the same time.


http://arxiv.org/abs/2201.06384
Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations. (56%)
Chris Emmery; Ákos Kádár; Grzegorz Chrupała; Walter Daelemans
  A limited amount of studies investigates the role of model-agnostic
adversarial behavior in toxic content classification. As toxicity classifiers
predominantly rely on lexical cues, (deliberately) creative and evolving
language-use can be detrimental to the utility of current corpora and
state-of-the-art models when they are deployed for content moderation. The less
training data is available, the more vulnerable models might become. This study
is, to our knowledge, the first to investigate the effect of adversarial
behavior and augmentation for cyberbullying detection. We demonstrate that
model-agnostic lexical substitutions significantly hurt classifier performance.
Moreover, when these perturbed samples are used for augmentation, we show
models become robust against word-level perturbations at a slight trade-off in
overall task performance. Augmentations proposed in prior work on toxicity
prove to be less effective. Our results underline the need for such evaluations
in online harm areas with small corpora. The perturbed data, models, and code
are available for reproduction at https://github.com/cmry/augtox


http://arxiv.org/abs/2201.06494
AugLy: Data Augmentations for Robustness. (3%)
Zoe Papakipos; Joanna Bitton
  We introduce AugLy, a data augmentation library with a focus on adversarial
robustness. AugLy provides a wide array of augmentations for multiple
modalities (audio, image, text, & video). These augmentations were inspired by
those that real users perform on social media platforms, some of which were not
already supported by existing data augmentation libraries. AugLy can be used
for any purpose where data augmentations are useful, but it is particularly
well-suited for evaluating robustness and systematically generating adversarial
attacks. In this paper we present how AugLy works, benchmark it compared
against existing libraries, and use it to evaluate the robustness of various
state-of-the-art models to showcase AugLy's utility. The AugLy repository can
be found at https://github.com/facebookresearch/AugLy.


http://arxiv.org/abs/2201.06192
Fooling the Eyes of Autonomous Vehicles: Robust Physical Adversarial Examples Against Traffic Sign Recognition Systems. (99%)
Wei Jia; Zhaojun Lu; Haichun Zhang; Zhenglin Liu; Jie Wang; Gang Qu
  Adversarial Examples (AEs) can deceive Deep Neural Networks (DNNs) and have
received a lot of attention recently. However, majority of the research on AEs
is in the digital domain and the adversarial patches are static, which is very
different from many real-world DNN applications such as Traffic Sign
Recognition (TSR) systems in autonomous vehicles. In TSR systems, object
detectors use DNNs to process streaming video in real time. From the view of
object detectors, the traffic sign`s position and quality of the video are
continuously changing, rendering the digital AEs ineffective in the physical
world.
  In this paper, we propose a systematic pipeline to generate robust physical
AEs against real-world object detectors. Robustness is achieved in three ways.
First, we simulate the in-vehicle cameras by extending the distribution of
image transformations with the blur transformation and the resolution
transformation. Second, we design the single and multiple bounding boxes
filters to improve the efficiency of the perturbation training. Third, we
consider four representative attack vectors, namely Hiding Attack, Appearance
Attack, Non-Target Attack and Target Attack.
  We perform a comprehensive set of experiments under a variety of
environmental conditions, and considering illuminations in sunny and cloudy
weather as well as at night. The experimental results show that the physical
AEs generated from our pipeline are effective and robust when attacking the
YOLO v5 based TSR system. The attacks have good transferability and can deceive
other state-of-the-art object detectors. We launched HA and NTA on a brand-new
2021 model vehicle. Both attacks are successful in fooling the TSR system,
which could be a life-threatening case for autonomous vehicles. Finally, we
discuss three defense mechanisms based on image preprocessing, AEs detection,
and model enhancing.


http://arxiv.org/abs/2201.06070
ALA: Naturalness-aware Adversarial Lightness Attack. (99%)
Yihao Huang; Liangru Sun; Qing Guo; Felix Juefei-Xu; Jiayi Zhu; Jincao Feng; Yang Liu; Geguang Pu
  Most researchers have tried to enhance the robustness of DNNs by revealing
and repairing the vulnerability of DNNs with specialized adversarial examples.
Parts of the attack examples have imperceptible perturbations restricted by Lp
norm. However, due to their high-frequency property, the adversarial examples
can be defended by denoising methods and are hard to realize in the physical
world. To avoid the defects, some works have proposed unrestricted attacks to
gain better robustness and practicality. It is disappointing that these
examples usually look unnatural and can alert the guards. In this paper, we
propose Adversarial Lightness Attack (ALA), a white-box unrestricted
adversarial attack that focuses on modifying the lightness of the images. The
shape and color of the samples, which are crucial to human perception, are
barely influenced. To obtain adversarial examples with a high attack success
rate, we propose unconstrained enhancement in terms of the light and shade
relationship in images. To enhance the naturalness of images, we craft the
naturalness-aware regularization according to the range and distribution of
light. The effectiveness of ALA is verified on two popular datasets for
different tasks (i.e., ImageNet for image classification and Places-365 for
scene recognition).


http://arxiv.org/abs/2201.06093
Adversarial Machine Learning Threat Analysis in Open Radio Access Networks. (64%)
Ron Bitton; Dan Avraham; Eitan Klevansky; Dudu Mimran; Oleg Brodt; Heiko Lehmann; Yuval Elovici; Asaf Shabtai
  The Open Radio Access Network (O-RAN) is a new, open, adaptive, and
intelligent RAN architecture. Motivated by the success of artificial
intelligence in other domains, O-RAN strives to leverage machine learning (ML)
to automatically and efficiently manage network resources in diverse use cases
such as traffic steering, quality of experience prediction, and anomaly
detection. Unfortunately, ML-based systems are not free of vulnerabilities;
specifically, they suffer from a special type of logical vulnerabilities that
stem from the inherent limitations of the learning algorithms. To exploit these
vulnerabilities, an adversary can utilize an attack technique referred to as
adversarial machine learning (AML). These special type of attacks has already
been demonstrated in recent researches. In this paper, we present a systematic
AML threat analysis for the O-RAN. We start by reviewing relevant ML use cases
and analyzing the different ML workflow deployment scenarios in O-RAN. Then, we
define the threat model, identifying potential adversaries, enumerating their
adversarial capabilities, and analyzing their main goals. Finally, we explore
the various AML threats in the O-RAN and review a large number of attacks that
can be performed to materialize these threats and demonstrate an AML attack on
a traffic steering model.


http://arxiv.org/abs/2201.06202
Neighboring Backdoor Attacks on Graph Convolutional Network. (22%)
Liang Chen; Qibiao Peng; Jintang Li; Yang Liu; Jiawei Chen; Yong Li; Zibin Zheng
  Backdoor attacks have been widely studied to hide the misclassification rules
in the normal models, which are only activated when the model is aware of the
specific inputs (i.e., the trigger). However, despite their success in the
conventional Euclidean space, there are few studies of backdoor attacks on
graph structured data. In this paper, we propose a new type of backdoor which
is specific to graph data, called neighboring backdoor. Considering the
discreteness of graph data, how to effectively design the triggers while
retaining the model accuracy on the original task is the major challenge. To
address such a challenge, we set the trigger as a single node, and the backdoor
is activated when the trigger node is connected to the target node. To preserve
the model accuracy, the model parameters are not allowed to be modified. Thus,
when the trigger node is not connected, the model performs normally. Under
these settings, in this work, we focus on generating the features of the
trigger node. Two types of backdoors are proposed: (1) Linear Graph Convolution
Backdoor which finds an approximation solution for the feature generation (can
be viewed as an integer programming problem) by looking at the linear part of
GCNs. (2) Variants of existing graph attacks. We extend current gradient-based
attack methods to our backdoor attack scenario. Extensive experiments on two
social networks and two citation networks datasets demonstrate that all
proposed backdoors can achieve an almost 100\% attack success rate while having
no impact on predictive accuracy.


http://arxiv.org/abs/2201.05819
Interpretable and Effective Reinforcement Learning for Attacking against Graph-based Rumor Detection. (26%)
Yuefei Lyu; Xiaoyu Yang; Jiaxin Liu; Philip S. Yu; Sihong Xie; Xi Zhang
  Social networks are frequently polluted by rumors, which can be detected by
advanced models such as graph neural networks. However, the models are
vulnerable to attacks and understanding the vulnerabilities is critical to
rumor detection in practice. To discover subtle vulnerabilities, we design a
powerful attacking algorithm to camouflage rumors in social networks based on
reinforcement learning that can interact with and attack any black-box
detectors. The environment has exponentially large state spaces, high-order
graph dependencies, and delayed noisy rewards, making the state-of-the-art
end-to-end approaches difficult to learn features as large learning costs and
expressive limitation of graph deep models. Instead, we design domain-specific
features to avoid learning features and produce interpretable attack policies.
To further speed up policy optimization, we devise: (i) a credit assignment
method that decomposes delayed rewards to atomic attacking actions proportional
to the their camouflage effects on target rumors; (ii) a time-dependent control
variate to reduce reward variance due to large graphs and many attacking steps,
supported by the reward variance analysis and a Bayesian analysis of the
prediction distribution. On three real world datasets of rumor detection tasks,
we demonstrate: (i) the effectiveness of the learned attacking policy compared
to rule-based attacks and current end-to-end approaches; (ii) the usefulness of
the proposed credit assignment strategy and variance reduction components;
(iii) the interpretability of the policy when generating strong attacks via the
case study.


http://arxiv.org/abs/2201.05889
StolenEncoder: Stealing Pre-trained Encoders. (13%)
Yupei Liu; Jinyuan Jia; Hongbin Liu; Neil Zhenqiang Gong
  Pre-trained encoders are general-purpose feature extractors that can be used
for many downstream tasks. Recent progress in self-supervised learning can
pre-train highly effective encoders using a large volume of unlabeled data,
leading to the emerging encoder as a service (EaaS). A pre-trained encoder may
be deemed confidential because its training often requires lots of data and
computation resources as well as its public release may facilitate misuse of
AI, e.g., for deepfakes generation. In this paper, we propose the first attack
called StolenEncoder to steal pre-trained image encoders. We evaluate
StolenEncoder on multiple target encoders pre-trained by ourselves and three
real-world target encoders including the ImageNet encoder pre-trained by
Google, CLIP encoder pre-trained by OpenAI, and Clarifai's General Embedding
encoder deployed as a paid EaaS. Our results show that the encoders stolen by
StolenEncoder have similar functionality with the target encoders. In
particular, the downstream classifiers built upon a target encoder and a stolen
encoder have similar accuracy. Moreover, stealing a target encoder using
StolenEncoder requires much less data and computation resources than
pre-training it from scratch. We also explore three defenses that perturb
feature vectors produced by a target encoder. Our evaluation shows that these
defenses are not enough to mitigate StolenEncoder.


http://arxiv.org/abs/2201.05320
CommonsenseQA 2.0: Exposing the Limits of AI through Gamification. (56%)
Alon Talmor; Ori Yoran; Ronan Le Bras; Chandra Bhagavatula; Yoav Goldberg; Yejin Choi; Jonathan Berant
  Constructing benchmarks that test the abilities of modern natural language
understanding models is difficult - pre-trained language models exploit
artifacts in benchmarks to achieve human parity, but still fail on adversarial
examples and make errors that demonstrate a lack of common sense. In this work,
we propose gamification as a framework for data construction. The goal of
players in the game is to compose questions that mislead a rival AI while using
specific phrases for extra points. The game environment leads to enhanced user
engagement and simultaneously gives the game designer control over the
collected data, allowing us to collect high-quality data at scale. Using our
method we create CommonsenseQA 2.0, which includes 14,343 yes/no questions, and
demonstrate its difficulty for models that are orders-of-magnitude larger than
the AI used in the game itself. Our best baseline, the T5-based Unicorn with
11B parameters achieves an accuracy of 70.2%, substantially higher than GPT-3
(52.9%) in a few-shot inference setup. Both score well below human performance
which is at 94.1%.


http://arxiv.org/abs/2201.05326
Security Orchestration, Automation, and Response Engine for Deployment of Behavioural Honeypots. (1%)
Upendra Bartwal; Subhasis Mukhopadhyay; Rohit Negi; Sandeep Shukla
  Cyber Security is a critical topic for organizations with IT/OT networks as
they are always susceptible to attack, whether insider or outsider. Since the
cyber landscape is an ever-evolving scenario, one must keep upgrading its
security systems to enhance the security of the infrastructure. Tools like
Security Information and Event Management (SIEM), Endpoint Detection and
Response (EDR), Threat Intelligence Platform (TIP), Information Technology
Service Management (ITSM), along with other defensive techniques like Intrusion
Detection System (IDS), Intrusion Protection System (IPS), and many others
enhance the cyber security posture of the infrastructure. However, the proposed
protection mechanisms have their limitations, they are insufficient to ensure
security, and the attacker penetrates the network. Deception technology, along
with Honeypots, provides a false sense of vulnerability in the target systems
to the attackers. The attacker deceived reveals threat intel about their modus
operandi. We have developed a Security Orchestration, Automation, and Response
(SOAR) Engine that dynamically deploys custom honeypots inside the internal
network infrastructure based on the attacker's behavior. The architecture is
robust enough to support multiple VLANs connected to the system and used for
orchestration. The presence of botnet traffic and DDOS attacks on the honeypots
in the network is detected, along with a malware collection system. After being
exposed to live traffic for four days, our engine dynamically orchestrated the
honeypots 40 times, detected 7823 attacks, 965 DDOS attack packets, and three
malicious samples. While our experiments with static honeypots show an average
attacker engagement time of 102 seconds per instance, our SOAR Engine-based
dynamic honeypots engage attackers on average 3148 seconds.


http://arxiv.org/abs/2201.05001
Evaluation of Four Black-box Adversarial Attacks and Some Query-efficient Improvement Analysis. (96%)
Rui Wang
  With the fast development of machine learning technologies, deep learning
models have been deployed in almost every aspect of everyday life. However, the
privacy and security of these models are threatened by adversarial attacks.
Among which black-box attack is closer to reality, where limited knowledge can
be acquired from the model. In this paper, we provided basic background
knowledge about adversarial attack and analyzed four black-box attack
algorithms: Bandits, NES, Square Attack and ZOsignSGD comprehensively. We also
explored the newly proposed Square Attack method with respect to square size,
hoping to improve its query efficiency.


http://arxiv.org/abs/2201.05149
The curse of overparametrization in adversarial training: Precise analysis of robust generalization for random features regression. (93%)
Hamed Hassani; Adel Javanmard
  Successful deep learning models often involve training neural network
architectures that contain more parameters than the number of training samples.
Such overparametrized models have been extensively studied in recent years, and
the virtues of overparametrization have been established from both the
statistical perspective, via the double-descent phenomenon, and the
computational perspective via the structural properties of the optimization
landscape.
  Despite the remarkable success of deep learning architectures in the
overparametrized regime, it is also well known that these models are highly
vulnerable to small adversarial perturbations in their inputs. Even when
adversarially trained, their performance on perturbed inputs (robust
generalization) is considerably worse than their best attainable performance on
benign inputs (standard generalization). It is thus imperative to understand
how overparametrization fundamentally affects robustness.
  In this paper, we will provide a precise characterization of the role of
overparametrization on robustness by focusing on random features regression
models (two-layer neural networks with random first layer weights). We consider
a regime where the sample size, the input dimension and the number of
parameters grow in proportion to each other, and derive an asymptotically exact
formula for the robust generalization error when the model is adversarially
trained. Our developed theory reveals the nontrivial effect of
overparametrization on robustness and indicates that for adversarially trained
random features models, high overparametrization can hurt robust
generalization.


http://arxiv.org/abs/2201.05057
On Adversarial Robustness of Trajectory Prediction for Autonomous Vehicles. (83%)
Qingzhao Zhang; Shengtuo Hu; Jiachen Sun; Qi Alfred Chen; Z. Morley Mao
  Trajectory prediction is a critical component for autonomous vehicles (AVs)
to perform safe planning and navigation. However, few studies have analyzed the
adversarial robustness of trajectory prediction or investigated whether the
worst-case prediction can still lead to safe planning. To bridge this gap, we
study the adversarial robustness of trajectory prediction models by proposing a
new adversarial attack that perturbs normal vehicle trajectories to maximize
the prediction error. Our experiments on three models and three datasets show
that the adversarial prediction increases the prediction error by more than
150%. Our case studies show that if an adversary drives a vehicle close to the
target AV following the adversarial trajectory, the AV may make an inaccurate
prediction and even make unsafe driving decisions. We also explore possible
mitigation techniques via data augmentation and trajectory smoothing. The
implementation is open source at
https://github.com/zqzqz/AdvTrajectoryPrediction.


http://arxiv.org/abs/2201.04845
Reconstructing Training Data with Informed Adversaries. (54%)
Borja Balle; Giovanni Cherubin; Jamie Hayes
  Given access to a machine learning model, can an adversary reconstruct the
model's training data? This work studies this question from the lens of a
powerful informed adversary who knows all the training data points except one.
By instantiating concrete attacks, we show it is feasible to reconstruct the
remaining data point in this stringent threat model. For convex models (e.g.
logistic regression), reconstruction attacks are simple and can be derived in
closed-form. For more general models (e.g. neural networks), we propose an
attack strategy based on training a reconstructor network that receives as
input the weights of the model under attack and produces as output the target
data point. We demonstrate the effectiveness of our attack on image classifiers
trained on MNIST and CIFAR-10, and systematically investigate which factors of
standard machine learning pipelines affect reconstruction success. Finally, we
theoretically investigate what amount of differential privacy suffices to
mitigate reconstruction attacks by informed adversaries. Our work provides an
effective reconstruction attack that model developers can use to assess
memorization of individual points in general settings beyond those considered
in previous works (e.g. generative language models or access to training
gradients); it shows that standard models have the capacity to store enough
information to enable high-fidelity reconstruction of training data points; and
it demonstrates that differential privacy can successfully mitigate such
attacks in a parameter regime where utility degradation is minimal.


http://arxiv.org/abs/2201.05172
Jamming Attacks on Federated Learning in Wireless Networks. (2%)
Yi Shi; Yalin E. Sagduyu
  Federated learning (FL) offers a decentralized learning environment so that a
group of clients can collaborate to train a global model at the server, while
keeping their training data confidential. This paper studies how to launch
over-the-air jamming attacks to disrupt the FL process when it is executed over
a wireless network. As a wireless example, FL is applied to learn how to
classify wireless signals collected by clients (spectrum sensors) at different
locations (such as in cooperative sensing). An adversary can jam the
transmissions for the local model updates from clients to the server (uplink
attack), or the transmissions for the global model updates the server to
clients (downlink attack), or both. Given a budget imposed on the number of
clients that can be attacked per FL round, clients for the (uplink/downlink)
attack are selected according to their local model accuracies that would be
expected without an attack or ranked via spectrum observations. This novel
attack is extended to general settings by accounting different processing
speeds and attack success probabilities for clients. Compared to benchmark
attack schemes, this attack approach degrades the FL performance significantly,
thereby revealing new vulnerabilities of FL to jamming attacks in wireless
networks.


http://arxiv.org/abs/2201.04733
Adversarially Robust Classification by Conditional Generative Model Inversion. (99%)
Mitra Alirezaei; Tolga Tasdizen
  Most adversarial attack defense methods rely on obfuscating gradients. These
methods are successful in defending against gradient-based attacks; however,
they are easily circumvented by attacks which either do not use the gradient or
by attacks which approximate and use the corrected gradient. Defenses that do
not obfuscate gradients such as adversarial training exist, but these
approaches generally make assumptions about the attack such as its magnitude.
We propose a classification model that does not obfuscate gradients and is
robust by construction without assuming prior knowledge about the attack. Our
method casts classification as an optimization problem where we "invert" a
conditional generator trained on unperturbed, natural images to find the class
that generates the closest sample to the query image. We hypothesize that a
potential source of brittleness against adversarial attacks is the
high-to-low-dimensional nature of feed-forward classifiers which allows an
adversary to find small perturbations in the input space that lead to large
changes in the output space. On the other hand, a generative model is typically
a low-to-high-dimensional mapping. While the method is related to Defense-GAN,
the use of a conditional generative model and inversion in our model instead of
the feed-forward classifier is a critical difference. Unlike Defense-GAN, which
was shown to generate obfuscated gradients that are easily circumvented, we
show that our method does not obfuscate gradients. We demonstrate that our
model is extremely robust against black-box attacks and has improved robustness
against white-box attacks compared to naturally trained, feed-forward
classifiers.


http://arxiv.org/abs/2201.04397
Towards Adversarially Robust Deep Image Denoising. (99%)
Hanshu Yan; Jingfeng Zhang; Jiashi Feng; Masashi Sugiyama; Vincent Y. F. Tan
  This work systematically investigates the adversarial robustness of deep
image denoisers (DIDs), i.e, how well DIDs can recover the ground truth from
noisy observations degraded by adversarial perturbations. Firstly, to evaluate
DIDs' robustness, we propose a novel adversarial attack, namely
Observation-based Zero-mean Attack ({\sc ObsAtk}), to craft adversarial
zero-mean perturbations on given noisy images. We find that existing DIDs are
vulnerable to the adversarial noise generated by {\sc ObsAtk}. Secondly, to
robustify DIDs, we propose an adversarial training strategy, hybrid adversarial
training ({\sc HAT}), that jointly trains DIDs with adversarial and
non-adversarial noisy data to ensure that the reconstruction quality is high
and the denoisers around non-adversarial data are locally smooth. The resultant
DIDs can effectively remove various types of synthetic and adversarial noise.
We also uncover that the robustness of DIDs benefits their generalization
capability on unseen real-world noise. Indeed, {\sc HAT}-trained DIDs can
recover high-quality clean images from real-world noise even without training
on real noisy data. Extensive experiments on benchmark datasets, including
Set68, PolyU, and SIDD, corroborate the effectiveness of {\sc ObsAtk} and {\sc
HAT}.


http://arxiv.org/abs/2201.04569
Get your Foes Fooled: Proximal Gradient Split Learning for Defense against Model Inversion Attacks on IoMT data. (70%)
Sunder Ali Khowaja; Ik Hyun Lee; Kapal Dev; Muhammad Aslam Jarwar; Nawab Muhammad Faseeh Qureshi
  The past decade has seen a rapid adoption of Artificial Intelligence (AI),
specifically the deep learning networks, in Internet of Medical Things (IoMT)
ecosystem. However, it has been shown recently that the deep learning networks
can be exploited by adversarial attacks that not only make IoMT vulnerable to
the data theft but also to the manipulation of medical diagnosis. The existing
studies consider adding noise to the raw IoMT data or model parameters which
not only reduces the overall performance concerning medical inferences but also
is ineffective to the likes of deep leakage from gradients method. In this
work, we propose proximal gradient split learning (PSGL) method for defense
against the model inversion attacks. The proposed method intentionally attacks
the IoMT data when undergoing the deep neural network training process at
client side. We propose the use of proximal gradient method to recover gradient
maps and a decision-level fusion strategy to improve the recognition
performance. Extensive analysis show that the PGSL not only provides effective
defense mechanism against the model inversion attacks but also helps in
improving the recognition performance on publicly available datasets. We report
17.9$\%$ and 36.9$\%$ gains in accuracy over reconstructed and adversarial
attacked images, respectively.


http://arxiv.org/abs/2201.04736
Security for Machine Learning-based Software Systems: a survey of threats, practices and challenges. (1%)
Huaming Chen; M. Ali Babar
  The rapid development of Machine Learning (ML) has demonstrated superior
performance in many areas, such as computer vision, video and speech
recognition. It has now been increasingly leveraged in software systems to
automate the core tasks. However, how to securely develop the machine
learning-based modern software systems (MLBSS) remains a big challenge, for
which the insufficient consideration will largely limit its application in
safety-critical domains. One concern is that the present MLBSS development
tends to be rush, and the latent vulnerabilities and privacy issues exposed to
external users and attackers will be largely neglected and hard to be
identified. Additionally, machine learning-based software systems exhibit
different liabilities towards novel vulnerabilities at different development
stages from requirement analysis to system maintenance, due to its inherent
limitations from the model and data and the external adversary capabilities.
The successful generation of such intelligent systems will thus solicit
dedicated efforts jointly from different research areas, i.e., software
engineering, system security and machine learning. Most of the recent works
regarding the security issues for ML have a strong focus on the data and
models, which has brought adversarial attacks into consideration. In this work,
we consider that security for machine learning-based software systems may arise
from inherent system defects or external adversarial attacks, and the secure
development practices should be taken throughout the whole lifecycle. While
machine learning has become a new threat domain for existing software
engineering practices, there is no such review work covering the topic.
Overall, we present a holistic review regarding the security for MLBSS, which
covers a systematic understanding from a structure review of three distinct
aspects in terms of security threats...


http://arxiv.org/abs/2201.03829
Quantifying Robustness to Adversarial Word Substitutions. (99%)
Yuting Yang; Pei Huang; FeiFei Ma; Juan Cao; Meishan Zhang; Jian Zhang; Jintao Li
  Deep-learning-based NLP models are found to be vulnerable to word
substitution perturbations. Before they are widely adopted, the fundamental
issues of robustness need to be addressed. Along this line, we propose a formal
framework to evaluate word-level robustness. First, to study safe regions for a
model, we introduce robustness radius which is the boundary where the model can
resist any perturbation. As calculating the maximum robustness radius is
computationally hard, we estimate its upper and lower bound. We repurpose
attack methods as ways of seeking upper bound and design a pseudo-dynamic
programming algorithm for a tighter upper bound. Then verification method is
utilized for a lower bound. Further, for evaluating the robustness of regions
outside a safe radius, we reexamine robustness from another view:
quantification. A robustness metric with a rigorous statistical guarantee is
introduced to measure the quantification of adversarial examples, which
indicates the model's susceptibility to perturbations outside the safe radius.
The metric helps us figure out why state-of-the-art models like BERT can be
easily fooled by a few word substitutions, but generalize well in the presence
of real-world noises.


http://arxiv.org/abs/2201.04011
Similarity-based Gray-box Adversarial Attack Against Deep Face Recognition. (99%)
Hanrui Wang; Shuo Wang; Zhe Jin; Yandan Wang; Cunjian Chen; Massimo Tistarell
  The majority of adversarial attack techniques perform well against deep face
recognition when the full knowledge of the system is revealed
(\emph{white-box}). However, such techniques act unsuccessfully in the gray-box
setting where the face templates are unknown to the attackers. In this work, we
propose a similarity-based gray-box adversarial attack (SGADV) technique with a
newly developed objective function. SGADV utilizes the dissimilarity score to
produce the optimized adversarial example, i.e., similarity-based adversarial
attack. This technique applies to both white-box and gray-box attacks against
authentication systems that determine genuine or imposter users using the
dissimilarity score. To validate the effectiveness of SGADV, we conduct
extensive experiments on face datasets of LFW, CelebA, and CelebA-HQ against
deep face recognition models of FaceNet and InsightFace in both white-box and
gray-box settings. The results suggest that the proposed method significantly
outperforms the existing adversarial attack techniques in the gray-box setting.
We hence summarize that the similarity-base approaches to develop the
adversarial example could satisfactorily cater to the gray-box attack scenarios
for de-authentication.


http://arxiv.org/abs/2201.05071
Evaluation of Neural Networks Defenses and Attacks using NDCG and Reciprocal Rank Metrics. (98%)
Haya Brama; Lihi Dery; Tal Grinshpoun
  The problem of attacks on neural networks through input modification (i.e.,
adversarial examples) has attracted much attention recently. Being relatively
easy to generate and hard to detect, these attacks pose a security breach that
many suggested defenses try to mitigate. However, the evaluation of the effect
of attacks and defenses commonly relies on traditional classification metrics,
without adequate adaptation to adversarial scenarios. Most of these metrics are
accuracy-based, and therefore may have a limited scope and low distinctive
power. Other metrics do not consider the unique characteristics of neural
networks functionality, or measure the effect of the attacks indirectly (e.g.,
through the complexity of their generation). In this paper, we present two
metrics which are specifically designed to measure the effect of attacks, or
the recovery effect of defenses, on the output of neural networks in multiclass
classification tasks. Inspired by the normalized discounted cumulative gain and
the reciprocal rank metrics used in information retrieval literature, we treat
the neural network predictions as ranked lists of results. Using additional
information about the probability of the rank enabled us to define novel
metrics that are suited to the task at hand. We evaluate our metrics using
various attacks and defenses on a pretrained VGG19 model and the ImageNet
dataset. Compared to the common classification metrics, our proposed metrics
demonstrate superior informativeness and distinctiveness.


http://arxiv.org/abs/2201.03281
IoTGAN: GAN Powered Camouflage Against Machine Learning Based IoT Device Identification. (89%)
Tao Hou; Tao Wang; Zhuo Lu; Yao Liu; Yalin Sagduyu
  With the proliferation of IoT devices, researchers have developed a variety
of IoT device identification methods with the assistance of machine learning.
Nevertheless, the security of these identification methods mostly depends on
collected training data. In this research, we propose a novel attack strategy
named IoTGAN to manipulate an IoT device's traffic such that it can evade
machine learning based IoT device identification. In the development of IoTGAN,
we have two major technical challenges: (i) How to obtain the discriminative
model in a black-box setting, and (ii) How to add perturbations to IoT traffic
through the manipulative model, so as to evade the identification while not
influencing the functionality of IoT devices. To address these challenges, a
neural network based substitute model is used to fit the target model in
black-box settings, it works as a discriminative model in IoTGAN. A
manipulative model is trained to add adversarial perturbations into the IoT
device's traffic to evade the substitute model. Experimental results show that
IoTGAN can successfully achieve the attack goals. We also develop efficient
countermeasures to protect machine learning based IoT device identification
from been undermined by IoTGAN.


http://arxiv.org/abs/2201.03777
Reciprocal Adversarial Learning for Brain Tumor Segmentation: A Solution to BraTS Challenge 2021 Segmentation Task. (73%)
Himashi Peiris; Zhaolin Chen; Gary Egan; Mehrtash Harandi
  This paper proposes an adversarial learning based training approach for brain
tumor segmentation task. In this concept, the 3D segmentation network learns
from dual reciprocal adversarial learning approaches. To enhance the
generalization across the segmentation predictions and to make the segmentation
network robust, we adhere to the Virtual Adversarial Training approach by
generating more adversarial examples via adding some noise on original patient
data. By incorporating a critic that acts as a quantitative subjective referee,
the segmentation network learns from the uncertainty information associated
with segmentation results. We trained and evaluated network architecture on the
RSNA-ASNR-MICCAI BraTS 2021 dataset. Our performance on the online validation
dataset is as follows: Dice Similarity Score of 81.38%, 90.77% and 85.39%;
Hausdorff Distance (95\%) of 21.83 mm, 5.37 mm, 8.56 mm for the enhancing
tumor, whole tumor and tumor core, respectively. Similarly, our approach
achieved a Dice Similarity Score of 84.55%, 90.46% and 85.30%, as well as
Hausdorff Distance (95\%) of 13.48 mm, 6.32 mm and 16.98 mm on the final test
dataset. Overall, our proposed approach yielded better performance in
segmentation accuracy for each tumor sub-region. Our code implementation is
publicly available at https://github.com/himashi92/vizviva_brats_2021


http://arxiv.org/abs/2201.03353
GMFIM: A Generative Mask-guided Facial Image Manipulation Model for Privacy Preservation. (3%)
Mohammad Hossein Khojaste; Nastaran Moradzadeh Farid; Ahmad Nickabadi
  The use of social media websites and applications has become very popular and
people share their photos on these networks. Automatic recognition and tagging
of people's photos on these networks has raised privacy preservation issues and
users seek methods for hiding their identities from these algorithms.
Generative adversarial networks (GANs) are shown to be very powerful in
generating face images in high diversity and also in editing face images. In
this paper, we propose a Generative Mask-guided Face Image Manipulation (GMFIM)
model based on GANs to apply imperceptible editing to the input face image to
preserve the privacy of the person in the image. Our model consists of three
main components: a) the face mask module to cut the face area out of the input
image and omit the background, b) the GAN-based optimization module for
manipulating the face image and hiding the identity and, c) the merge module
for combining the background of the input image and the manipulated
de-identified face image. Different criteria are considered in the loss
function of the optimization step to produce high-quality images that are as
similar as possible to the input image while they cannot be recognized by AFR
systems. The results of the experiments on different datasets show that our
model can achieve better performance against automated face recognition systems
in comparison to the state-of-the-art methods and it catches a higher attack
success rate in most experiments from a total of 18. Moreover, the generated
images of our proposed model have the highest quality and are more pleasing to
human eyes.


http://arxiv.org/abs/2201.03668
Towards Group Robustness in the presence of Partial Group Labels. (1%)
Vishnu Suresh Lokhande; Kihyuk Sohn; Jinsung Yoon; Madeleine Udell; Chen-Yu Lee; Tomas Pfister
  Learning invariant representations is an important requirement when training
machine learning models that are driven by spurious correlations in the
datasets. These spurious correlations, between input samples and the target
labels, wrongly direct the neural network predictions resulting in poor
performance on certain groups, especially the minority groups. Robust training
against these spurious correlations requires the knowledge of group membership
for every sample. Such a requirement is impractical in situations where the
data labeling efforts for minority or rare groups are significantly laborious
or where the individuals comprising the dataset choose to conceal sensitive
information. On the other hand, the presence of such data collection efforts
results in datasets that contain partially labeled group information. Recent
works have tackled the fully unsupervised scenario where no labels for groups
are available. Thus, we aim to fill the missing gap in the literature by
tackling a more realistic setting that can leverage partially available
sensitive or group information during training. First, we construct a
constraint set and derive a high probability bound for the group assignment to
belong to the set. Second, we propose an algorithm that optimizes for the
worst-off group assignments from the constraint set. Through experiments on
image and tabular datasets, we show improvements in the minority group's
performance while preserving overall aggregate accuracy across groups.


http://arxiv.org/abs/2201.02993
Rethink Stealthy Backdoor Attacks in Natural Language Processing. (89%)
Lingfeng Shen; Haiyun Jiang; Lemao Liu; Shuming Shi
  Recently, it has been shown that natural language processing (NLP) models are
vulnerable to a kind of security threat called the Backdoor Attack, which
utilizes a `backdoor trigger' paradigm to mislead the models. The most
threatening backdoor attack is the stealthy backdoor, which defines the
triggers as text style or syntactic. Although they have achieved an incredible
high attack success rate (ASR), we find that the principal factor contributing
to their ASR is not the `backdoor trigger' paradigm. Thus the capacity of these
stealthy backdoor attacks is overestimated when categorized as backdoor
attacks. Therefore, to evaluate the real attack power of backdoor attacks, we
propose a new metric called attack successful rate difference (ASRD), which
measures the ASR difference between clean state and poison state models.
Besides, since the defenses against stealthy backdoor attacks are absent, we
propose Trigger Breaker, consisting of two too simple tricks that can defend
against stealthy backdoor attacks effectively. Experiments on text
classification tasks show that our method achieves significantly better
performance than state-of-the-art defense methods against stealthy backdoor
attacks.


http://arxiv.org/abs/2201.02986
A Retrospective and Futurespective of Rowhammer Attacks and Defenses on DRAM. (76%)
Zhi Zhang; Jiahao Qi; Yueqiang Cheng; Shijie Jiang; Yiyang Lin; Yansong Gao; Surya Nepal; Yi Zou
  Rowhammer has drawn much attention from both academia and industry in the
last few years as rowhammer exploitation poses severe consequences to system
security. Since the first comprehensive study of rowhammer in 2014, a number of
rowhammer attacks have been demonstrated against ubiquitous dynamic random
access memory (DRAM)-based commodity systems to cause denial-of-service, gain
privilege escalation, leak sensitive information or degrade DNN model inference
accuracy. Accordingly, numerous software defenses have been proposed to protect
legacy systems while hardware defenses aim to protect next-generation
DRAM-based systems. In this paper, we systematize rowhammer attacks and
defenses with a focus on DRAM. Particularly, we characterize rowhammer attacks
comprehensively, shedding lights on possible new attack vectors that have not
yet been explored. We further summarize and classify existing software
defenses, from which new defense strategies are identified and worth future
exploring. We also categorize proposed hardware defenses from both industry and
academia and summarize their limitations. In particular, most industrial
solutions have turned out to be ineffective against rowhammer while on-die
ECC's susceptibility to rowhammer calls for a comprehensive study. Our work is
expected to inspire software-security community to identify new rowhammer
attack vectors while present novel defense solutions against them in legacy
systems. More importantly, both software and hardware security communities
should work together to develop more effective and practical defense solutions.


http://arxiv.org/abs/2201.03004
Privacy-aware Early Detection of COVID-19 through Adversarial Training. (10%)
Omid Rohanian; Samaneh Kouchaki; Andrew Soltan; Jenny Yang; Morteza Rohanian; Yang Yang; David Clifton
  Early detection of COVID-19 is an ongoing area of research that can help with
triage, monitoring and general health assessment of potential patients and may
reduce operational strain on hospitals that cope with the coronavirus pandemic.
Different machine learning techniques have been used in the literature to
detect coronavirus using routine clinical data (blood tests, and vital signs).
Data breaches and information leakage when using these models can bring
reputational damage and cause legal issues for hospitals. In spite of this,
protecting healthcare models against leakage of potentially sensitive
information is an understudied research area. In this work, we examine two
machine learning approaches, intended to predict a patient's COVID-19 status
using routinely collected and readily available clinical data. We employ
adversarial training to explore robust deep learning architectures that protect
attributes related to demographic information about the patients. The two
models we examine in this work are intended to preserve sensitive information
against adversarial attacks and information leakage. In a series of experiments
using datasets from the Oxford University Hospitals, Bedfordshire Hospitals NHS
Foundation Trust, University Hospitals Birmingham NHS Foundation Trust, and
Portsmouth Hospitals University NHS Trust we train and test two neural networks
that predict PCR test results using information from basic laboratory blood
tests, and vital signs performed on a patients' arrival to hospital. We assess
the level of privacy each one of the models can provide and show the efficacy
and robustness of our proposed architectures against a comparable baseline. One
of our main contributions is that we specifically target the development of
effective COVID-19 detection models with built-in mechanisms in order to
selectively protect sensitive attributes against adversarial attacks.


http://arxiv.org/abs/2201.02873
LoMar: A Local Defense Against Poisoning Attack on Federated Learning. (9%)
Xingyu Li; Zhe Qu; Shangqing Zhao; Bo Tang; Zhuo Lu; Yao Liu
  Federated learning (FL) provides a high efficient decentralized machine
learning framework, where the training data remains distributed at remote
clients in a network. Though FL enables a privacy-preserving mobile edge
computing framework using IoT devices, recent studies have shown that this
approach is susceptible to poisoning attacks from the side of remote clients.
To address the poisoning attacks on FL, we provide a \textit{two-phase} defense
algorithm called {Lo}cal {Ma}licious Facto{r} (LoMar). In phase I, LoMar scores
model updates from each remote client by measuring the relative distribution
over their neighbors using a kernel density estimation method. In phase II, an
optimal threshold is approximated to distinguish malicious and clean updates
from a statistical perspective. Comprehensive experiments on four real-world
datasets have been conducted, and the experimental results show that our
defense strategy can effectively protect the FL system. {Specifically, the
defense performance on Amazon dataset under a label-flipping attack indicates
that, compared with FG+Krum, LoMar increases the target label testing accuracy
from $96.0\%$ to $98.8\%$, and the overall averaged testing accuracy from
$90.1\%$ to $97.0\%$.


http://arxiv.org/abs/2201.02863
PocketNN: Integer-only Training and Inference of Neural Networks via Direct Feedback Alignment and Pocket Activations in Pure C++. (1%)
Jaewoo Song; Fangzhen Lin
  Standard deep learning algorithms are implemented using floating-point real
numbers. This presents an obstacle for implementing them on low-end devices
which may not have dedicated floating-point units (FPUs). As a result,
researchers in TinyML have considered machine learning algorithms that can
train and run a deep neural network (DNN) on a low-end device using integer
operations only. In this paper we propose PocketNN, a light and self-contained
proof-of-concept framework in pure C++ for the training and inference of DNNs
using only integers. Unlike other approaches, PocketNN directly operates on
integers without requiring any explicit quantization algorithms or customized
fixed-point formats. This was made possible by pocket activations, which are a
family of activation functions devised for integer-only DNNs, and an emerging
DNN training algorithm called direct feedback alignment (DFA). Unlike the
standard backpropagation (BP), DFA trains each layer independently, thus
avoiding integer overflow which is a key problem when using BP with
integer-only operations. We used PocketNN to train some DNNs on two well-known
datasets, MNIST and Fashion-MNIST. Our experiments show that the DNNs trained
with our PocketNN achieved 96.98% and 87.7% accuracies on MNIST and
Fashion-MNIST datasets, respectively. The accuracies are very close to the
equivalent DNNs trained using BP with floating-point real number operations,
such that accuracy degradations were just 1.02%p and 2.09%p, respectively.
Finally, our PocketNN has high compatibility and portability for low-end
devices as it is open source and implemented in pure C++ without any
dependencies.


http://arxiv.org/abs/2201.02331
iDECODe: In-distribution Equivariance for Conformal Out-of-distribution Detection. (93%)
Ramneet Kaur; Susmit Jha; Anirban Roy; Sangdon Park; Edgar Dobriban; Oleg Sokolsky; Insup Lee
  Machine learning methods such as deep neural networks (DNNs), despite their
success across different domains, are known to often generate incorrect
predictions with high confidence on inputs outside their training distribution.
The deployment of DNNs in safety-critical domains requires detection of
out-of-distribution (OOD) data so that DNNs can abstain from making predictions
on those. A number of methods have been recently developed for OOD detection,
but there is still room for improvement. We propose the new method iDECODe,
leveraging in-distribution equivariance for conformal OOD detection. It relies
on a novel base non-conformity measure and a new aggregation method, used in
the inductive conformal anomaly detection framework, thereby guaranteeing a
bounded false detection rate. We demonstrate the efficacy of iDECODe by
experiments on image and audio datasets, obtaining state-of-the-art results. We
also show that iDECODe can detect adversarial examples.


http://arxiv.org/abs/2201.02351
Asymptotic Security using Bayesian Defense Mechanisms with Application to Cyber Deception. (11%)
Hampei Sasahara; Henrik Sandberg
  This study addresses the question whether model knowledge can prevent a
defender from being deceived or not in cyber security. As a specific
model-based defense scheme, this study treats Bayesian defense mechanism, which
monitors the system's behavior, forms a belief on existence of the attacker,
and chooses appropriate reactions. Sophisticated attackers aim at achieving her
objective while avoiding being detected by deceiving the defender. In this
paper, their dynamic decision making is formulated as a stochastic signaling
game. It is revealed that the belief on the true scenario has a limit in a
stochastic sense at an equilibrium based on martingale analysis. This fact
implies that there are only two possible cases: the defender asymptotically
detects the attack with a firm belief or the attacker takes actions such that
the system's behavior becomes nominal after a certain finite time step.
Consequently, if the dynamics admits no stealthy attacks, the system is
guaranteed to be secure in an asymptotic manner provided that effective
countermeasures are implemented. The result concludes that model knowledge can
prevent deception in an asymptotic sense. As an application of the finding, a
defensive deception utilizing asymmetric recognition on vulnerabilities
exploited by the attacker is analyzed. It is shown that, the attacker possibly
stops the attack even if the defender is unaware of the vulnerabilities as long
as the defender's unawareness is concealed by the defensive deception. Those
results indicate the powerful defense capability achieved by model knowledge.


http://arxiv.org/abs/2201.02445
Negative Evidence Matters in Interpretable Histology Image Classification. (1%)
Soufiane Belharbi; Marco Pedersoli; Ismail Ben Ayed; Luke McCaffrey; Eric Granger
  Using only global annotations such as the image class labels,
weakly-supervised learning methods allow CNN classifiers to jointly classify an
image, and yield the regions of interest associated with the predicted class.
However, without any guidance at the pixel level, such methods may yield
inaccurate regions. This problem is known to be more challenging with histology
images than with natural ones, since objects are less salient, structures have
more variations, and foreground and background regions have stronger
similarities. Therefore, methods in computer vision literature for visual
interpretation of CNNs may not directly apply. In this work, we propose a
simple yet efficient method based on a composite loss function that leverages
information from the fully negative samples. Our new loss function contains two
complementary terms: the first exploits positive evidence collected from the
CNN classifier, while the second leverages the fully negative samples from the
training dataset. In particular, we equip a pre-trained classifier with a
decoder that allows refining the regions of interest. The same classifier is
exploited to collect both the positive and negative evidence at the pixel level
to train the decoder. This enables to take advantages of the fully negative
samples that occurs naturally in the data, without any additional supervision
signals and using only the image class as supervision. Compared to several
recent related methods, over the public benchmark GlaS for colon cancer and a
Camelyon16 patch-based benchmark for breast cancer using three different
backbones, we show the substantial improvements introduced by our method. Our
results shows the benefits of using both negative and positive evidence, ie,
the one obtained from a classifier and the one naturally available in datasets.
We provide an ablation study of both terms. Our code is publicly available.


http://arxiv.org/abs/2201.02009
PAEG: Phrase-level Adversarial Example Generation for Neural Machine Translation. (98%)
Juncheng Wan; Jian Yang; Shuming Ma; Dongdong Zhang; Weinan Zhang; Yong Yu; Zhoujun Li
  While end-to-end neural machine translation (NMT) has achieved impressive
progress, noisy input usually leads models to become fragile and unstable.
Generating adversarial examples as the augmented data has been proved to be
useful to alleviate this problem. Existing methods for adversarial example
generation (AEG) are word-level or character-level, which ignore the ubiquitous
phrase structure. In this paper, we propose a Phrase-level Adversarial Example
Generation (PAEG) framework to enhance the robustness of the translation model.
Our method further improves the gradient-based word-level AEG method by
adopting a phrase-level substitution strategy. We verify our method on three
benchmarks, including LDC Chinese-English, IWSLT14 German-English, and WMT14
English-German tasks. Experimental results demonstrate that our approach
significantly improves translation performance and robustness to noise compared
to previous strong baselines.


http://arxiv.org/abs/2201.02265
Learning to be adversarially robust and differentially private. (31%)
Jamie Hayes; Borja Balle; M. Pawan Kumar
  We study the difficulties in learning that arise from robust and
differentially private optimization. We first study convergence of gradient
descent based adversarial training with differential privacy, taking a simple
binary classification task on linearly separable data as an illustrative
example. We compare the gap between adversarial and nominal risk in both
private and non-private settings, showing that the data dimensionality
dependent term introduced by private optimization compounds the difficulties of
learning a robust model. After this, we discuss what parts of adversarial
training and differential privacy hurt optimization, identifying that the size
of adversarial perturbation and clipping norm in differential privacy both
increase the curvature of the loss landscape, implying poorer generalization
performance.


http://arxiv.org/abs/2201.01965
Efficient Global Optimization of Two-Layer ReLU Networks: Quadratic-Time Algorithms and Adversarial Training. (2%)
Yatong Bai; Tanmay Gautam; Somayeh Sojoudi
  The non-convexity of the artificial neural network (ANN) training landscape
brings inherent optimization difficulties. While the traditional
back-propagation stochastic gradient descent (SGD) algorithm and its variants
are effective in certain cases, they can become stuck at spurious local minima
and are sensitive to initializations and hyperparameters. Recent work has shown
that the training of an ANN with ReLU activations can be reformulated as a
convex program, bringing hope to globally optimizing interpretable ANNs.
However, naively solving the convex training formulation has an exponential
complexity, and even an approximation heuristic requires cubic time. In this
work, we characterize the quality of this approximation and develop two
efficient algorithms that train ANNs with global convergence guarantees. The
first algorithm is based on the alternating direction method of multiplier
(ADMM). It solves both the exact convex formulation and the approximate
counterpart. Linear global convergence is achieved, and the initial several
iterations often yield a solution with high prediction accuracy. When solving
the approximate formulation, the per-iteration time complexity is quadratic.
The second algorithm, based on the "sampled convex programs" theory, solves
unconstrained convex formulations and converges to an approximately globally
optimal classifier. The non-convexity of the ANN training landscape exacerbates
when adversarial training is considered. We apply the robust convex
optimization theory to convex training and develop convex formulations that
train ANNs robust to adversarial inputs. Our analysis explicitly focuses on
one-hidden-layer fully connected ANNs, but can extend to more sophisticated
architectures.


http://arxiv.org/abs/2201.01850
On the Real-World Adversarial Robustness of Real-Time Semantic Segmentation Models for Autonomous Driving. (99%)
Giulio Rossolini; Federico Nesti; Gianluca D'Amico; Saasha Nair; Alessandro Biondi; Giorgio Buttazzo
  The existence of real-world adversarial examples (commonly in the form of
patches) poses a serious threat for the use of deep learning models in
safety-critical computer vision tasks such as visual perception in autonomous
driving. This paper presents an extensive evaluation of the robustness of
semantic segmentation models when attacked with different types of adversarial
patches, including digital, simulated, and physical ones. A novel loss function
is proposed to improve the capabilities of attackers in inducing a
misclassification of pixels. Also, a novel attack strategy is presented to
improve the Expectation Over Transformation method for placing a patch in the
scene. Finally, a state-of-the-art method for detecting adversarial patch is
first extended to cope with semantic segmentation models, then improved to
obtain real-time performance, and eventually evaluated in real-world scenarios.
Experimental results reveal that, even though the adversarial effect is visible
with both digital and real-world attacks, its impact is often spatially
confined to areas of the image around the patch. This opens to further
questions about the spatial robustness of real-time semantic segmentation
models.


http://arxiv.org/abs/2201.01621
ROOM: Adversarial Machine Learning Attacks Under Real-Time Constraints. (99%)
Amira Guesmi; Khaled N. Khasawneh; Nael Abu-Ghazaleh; Ihsen Alouani
  Advances in deep learning have enabled a wide range of promising
applications. However, these systems are vulnerable to Adversarial Machine
Learning (AML) attacks; adversarially crafted perturbations to their inputs
could cause them to misclassify. Several state-of-the-art adversarial attacks
have demonstrated that they can reliably fool classifiers making these attacks
a significant threat. Adversarial attack generation algorithms focus primarily
on creating successful examples while controlling the noise magnitude and
distribution to make detection more difficult. The underlying assumption of
these attacks is that the adversarial noise is generated offline, making their
execution time a secondary consideration. However, recently, just-in-time
adversarial attacks where an attacker opportunistically generates adversarial
examples on the fly have been shown to be possible. This paper introduces a new
problem: how do we generate adversarial noise under real-time constraints to
support such real-time adversarial attacks? Understanding this problem improves
our understanding of the threat these attacks pose to real-time systems and
provides security evaluation benchmarks for future defenses. Therefore, we
first conduct a run-time analysis of adversarial generation algorithms.
Universal attacks produce a general attack offline, with no online overhead,
and can be applied to any input; however, their success rate is limited because
of their generality. In contrast, online algorithms, which work on a specific
input, are computationally expensive, making them inappropriate for operation
under time constraints. Thus, we propose ROOM, a novel Real-time Online-Offline
attack construction Model where an offline component serves to warm up the
online algorithm, making it possible to generate highly successful attacks
under time constraints.


http://arxiv.org/abs/2201.01842
Adversarial Robustness in Cognitive Radio Networks. (1%)
Makan Zamanipour
  When an adversary gets access to the data sample in the adversarial
robustness models and can make data-dependent changes, how has the decision
maker consequently, relying deeply upon the adversarially-modified data, to
make statistical inference? How can the resilience and elasticity of the
network be literally justified from a game theoretical viewpoint $-$ if there
exists a tool to measure the aforementioned elasticity? The principle of
byzantine resilience distributed hypothesis testing (BRDHT) is considered in
this paper for cognitive radio networks (CRNs) $-$ without-loss-of-generality,
something that can be extended to any type of homogeneous or heterogeneous
networks. We use the temporal rate of the $\alpha-$leakage as the appropriate
tool which we measure the aforementioned resilience through. We take into
account the main problem from an information theoretic point of view via an
exploration over the \textit{adversarial robustness} of distributed hypothesis
testing rules. We chiefly examine if one can write $\mathbb{F}=ma$ for the main
problem, consequently, we define a nested bi-level $-$ even 3-level including a
hidden control-law $-$ mean-field-game (MFG) realisation solving the control
dynamics as well. Further discussions are also provided e.g. the
synchronisation. Our novel online algorithm $-$ which is named
$\mathbb{OBRDHT}$ $-$ and solution are both unique and generic over which an
evaluation is finally performed by simulations.


http://arxiv.org/abs/2201.01102
Towards Transferable Unrestricted Adversarial Examples with Minimum Changes. (99%)
Fangcheng Liu; Chao Zhang; Hongyang Zhang
  Transfer-based adversarial example is one of the most important classes of
black-box attacks. However, there is a trade-off between transferability and
imperceptibility of the adversarial perturbation. Prior work in this direction
often requires a fixed but large $\ell_p$-norm perturbation budget to reach a
good transfer success rate, leading to perceptible adversarial perturbations.
On the other hand, most of the current unrestricted adversarial attacks that
aim to generate semantic-preserving perturbations suffer from weaker
transferability to the target model. In this work, we propose a geometry-aware
framework to generate transferable adversarial examples with minimum changes.
Analogous to model selection in statistical machine learning, we leverage a
validation model to select the best perturbation budget for each image under
both the $\ell_{\infty}$-norm and unrestricted threat models. We propose a
principled method for the partition of training and validation models by
encouraging intra-group diversity while penalizing extra-group similarity.
Extensive experiments verify the effectiveness of our framework on balancing
imperceptibility and transferability of the crafted adversarial examples. The
methodology is the foundation of our entry to the CVPR'21 Security AI
Challenger: Unrestricted Adversarial Attacks on ImageNet, in which we ranked
1st place out of 1,559 teams and surpassed the runner-up submissions by 4.59%
and 23.91% in terms of final score and average image quality level,
respectively. Code is available at https://github.com/Equationliu/GA-Attack.


http://arxiv.org/abs/2201.01080
Towards Understanding and Harnessing the Effect of Image Transformation in Adversarial Detection. (99%)
Hui Liu; Bo Zhao; Yuefeng Peng; Weidong Li; Peng Liu
  Deep neural networks (DNNs) are threatened by adversarial examples.
Adversarial detection, which distinguishes adversarial images from benign
images, is fundamental for robust DNN-based services. Image transformation is
one of the most effective approaches to detect adversarial examples. During the
last few years, a variety of image transformations have been studied and
discussed to design reliable adversarial detectors. In this paper, we
systematically synthesize the recent progress on adversarial detection via
image transformations with a novel classification method. Then, we conduct
extensive experiments to test the detection performance of image
transformations against state-of-the-art adversarial attacks. Furthermore, we
reveal that each individual transformation is not capable of detecting
adversarial examples in a robust way, and propose a DNN-based approach referred
to as AdvJudge, which combines scores of 9 image transformations. Without
knowing which individual scores are misleading or not misleading, AdvJudge can
make the right judgment, and achieve a significant improvement in detection
accuracy. We claim that AdvJudge is a more effective adversarial detector than
those based on an individual image transformation.


http://arxiv.org/abs/2201.01235
On the Minimal Adversarial Perturbation for Deep Neural Networks with Provable Estimation Error. (86%)
Fabio Brau; Giulio Rossolini; Alessandro Biondi; Giorgio Buttazzo
  Although Deep Neural Networks (DNNs) have shown incredible performance in
perceptive and control tasks, several trustworthy issues are still open. One of
the most discussed topics is the existence of adversarial perturbations, which
has opened an interesting research line on provable techniques capable of
quantifying the robustness of a given input. In this regard, the Euclidean
distance of the input from the classification boundary denotes a well-proved
robustness assessment as the minimal affordable adversarial perturbation.
Unfortunately, computing such a distance is highly complex due the non-convex
nature of NNs. Despite several methods have been proposed to address this
issue, to the best of our knowledge, no provable results have been presented to
estimate and bound the error committed. This paper addresses this issue by
proposing two lightweight strategies to find the minimal adversarial
perturbation. Differently from the state-of-the-art, the proposed approach
allows formulating an error estimation theory of the approximate distance with
respect to the theoretical one. Finally, a substantial set of experiments is
reported to evaluate the performance of the algorithms and support the
theoretical findings. The obtained results show that the proposed strategies
approximate the theoretical distance for samples close to the classification
boundary, leading to provable robustness guarantees against any adversarial
attacks.


http://arxiv.org/abs/2201.01409
Towards Understanding Quality Challenges of the Federated Learning for Neural Networks: A First Look from the Lens of Robustness. (31%)
Amin Eslami Abyane; Derui Zhu; Roberto Souza; Lei Ma; Hadi Hemmati
  Federated learning (FL) is a distributed learning paradigm that preserves
users' data privacy while leveraging the entire dataset of all participants. In
FL, multiple models are trained independently on the clients and aggregated
centrally to update a global model in an iterative process. Although this
approach is excellent at preserving privacy, FL still suffers from quality
issues such as attacks or byzantine faults. Recent attempts have been made to
address such quality challenges on the robust aggregation techniques for FL.
However, the effectiveness of state-of-the-art (SOTA) robust FL techniques is
still unclear and lacks a comprehensive study. Therefore, to better understand
the current quality status and challenges of these SOTA FL techniques in the
presence of attacks and faults, we perform a large-scale empirical study to
investigate the SOTA FL's quality from multiple angles of attacks, simulated
faults (via mutation operators), and aggregation (defense) methods. In
particular, we study FL's performance on the image classification tasks and use
DNNs as our model type. Furthermore, we perform our study on two generic image
datasets and one real-world federated medical image dataset. We also
investigate the effect of the proportion of affected clients and the dataset
distribution factors on the robustness of FL. After a large-scale analysis with
496 configurations, we find that most mutators on each user have a negligible
effect on the final model in the generic datasets, and only one of them is
effective in the medical dataset. Furthermore, we show that model poisoning
attacks are more effective than data poisoning attacks. Moreover, choosing the
most robust FL aggregator depends on the attacks and datasets. Finally, we
illustrate that a simple ensemble of aggregators achieves a more robust
solution than any single aggregator and is the best choice in 75% of the cases.


http://arxiv.org/abs/2201.01399
Corrupting Data to Remove Deceptive Perturbation: Using Preprocessing Method to Improve System Robustness. (10%)
Hieu Le; Hans Walker; Dung Tran; Peter Chin
  Although deep neural networks have achieved great performance on
classification tasks, recent studies showed that well trained networks can be
fooled by adding subtle noises. This paper introduces a new approach to improve
neural network robustness by applying the recovery process on top of the
naturally trained classifier. In this approach, images will be intentionally
corrupted by some significant operator and then be recovered before passing
through the classifiers. SARGAN -- an extension on Generative Adversarial
Networks (GAN) is capable of denoising radar signals. This paper will show that
SARGAN can also recover corrupted images by removing the adversarial effects.
Our results show that this approach does improve the performance of naturally
trained networks.


http://arxiv.org/abs/2201.00672
Compression-Resistant Backdoor Attack against Deep Neural Networks. (75%)
Mingfu Xue; Xin Wang; Shichang Sun; Yushu Zhang; Jian Wang; Weiqiang Liu
  In recent years, many backdoor attacks based on training data poisoning have
been proposed. However, in practice, those backdoor attacks are vulnerable to
image compressions. When backdoor instances are compressed, the feature of
specific backdoor trigger will be destroyed, which could result in the backdoor
attack performance deteriorating. In this paper, we propose a
compression-resistant backdoor attack based on feature consistency training. To
the best of our knowledge, this is the first backdoor attack that is robust to
image compressions. First, both backdoor images and their compressed versions
are input into the deep neural network (DNN) for training. Then, the feature of
each image is extracted by internal layers of the DNN. Next, the feature
difference between backdoor images and their compressed versions are minimized.
As a result, the DNN treats the feature of compressed images as the feature of
backdoor images in feature space. After training, the backdoor attack against
DNN is robust to image compression. Furthermore, we consider three different
image compressions (i.e., JPEG, JPEG2000, WEBP) in feature consistency
training, so that the backdoor attack is robust to multiple image compression
algorithms. Experimental results demonstrate the effectiveness and robustness
of the proposed backdoor attack. When the backdoor instances are compressed,
the attack success rate of common backdoor attack is lower than 10%, while the
attack success rate of our compression-resistant backdoor is greater than 97%.
The compression-resistant attack is still robust even when the backdoor images
are compressed with low compression quality. In addition, extensive experiments
have demonstrated that, our compression-resistant backdoor attack has the
generalization ability to resist image compression which is not used in the
training process.


http://arxiv.org/abs/2201.00763
DeepSight: Mitigating Backdoor Attacks in Federated Learning Through Deep Model Inspection. (68%)
Phillip Rieger; Thien Duc Nguyen; Markus Miettinen; Ahmad-Reza Sadeghi
  Federated Learning (FL) allows multiple clients to collaboratively train a
Neural Network (NN) model on their private data without revealing the data.
Recently, several targeted poisoning attacks against FL have been introduced.
These attacks inject a backdoor into the resulting model that allows
adversary-controlled inputs to be misclassified. Existing countermeasures
against backdoor attacks are inefficient and often merely aim to exclude
deviating models from the aggregation. However, this approach also removes
benign models of clients with deviating data distributions, causing the
aggregated model to perform poorly for such clients.
  To address this problem, we propose DeepSight, a novel model filtering
approach for mitigating backdoor attacks. It is based on three novel techniques
that allow to characterize the distribution of data used to train model updates
and seek to measure fine-grained differences in the internal structure and
outputs of NNs. Using these techniques, DeepSight can identify suspicious model
updates. We also develop a scheme that can accurately cluster model updates.
Combining the results of both components, DeepSight is able to identify and
eliminate model clusters containing poisoned models with high attack impact. We
also show that the backdoor contributions of possibly undetected poisoned
models can be effectively mitigated with existing weight clipping-based
defenses. We evaluate the performance and effectiveness of DeepSight and show
that it can mitigate state-of-the-art backdoor attacks with a negligible impact
on the model's performance on benign data.


http://arxiv.org/abs/2201.00801
Revisiting PGD Attacks for Stability Analysis of Large-Scale Nonlinear Systems and Perception-Based Control. (11%)
Aaron Havens; Darioush Keivan; Peter Seiler; Geir Dullerud; Bin Hu
  Many existing region-of-attraction (ROA) analysis tools find difficulty in
addressing feedback systems with large-scale neural network (NN) policies
and/or high-dimensional sensing modalities such as cameras. In this paper, we
tailor the projected gradient descent (PGD) attack method developed in the
adversarial learning community as a general-purpose ROA analysis tool for
large-scale nonlinear systems and end-to-end perception-based control. We show
that the ROA analysis can be approximated as a constrained maximization problem
whose goal is to find the worst-case initial condition which shifts the
terminal state the most. Then we present two PGD-based iterative methods which
can be used to solve the resultant constrained maximization problem. Our
analysis is not based on Lyapunov theory, and hence requires minimum
information of the problem structures. In the model-based setting, we show that
the PGD updates can be efficiently performed using back-propagation. In the
model-free setting (which is more relevant to ROA analysis of perception-based
control), we propose a finite-difference PGD estimate which is general and only
requires a black-box simulator for generating the trajectories of the
closed-loop system given any initial state. We demonstrate the scalability and
generality of our analysis tool on several numerical examples with large-scale
NN policies and high-dimensional image observations. We believe that our
proposed analysis serves as a meaningful initial step toward further
understanding of closed-loop stability of large-scale nonlinear systems and
perception-based control.


http://arxiv.org/abs/2201.00455
Actor-Critic Network for Q&A in an Adversarial Environment. (33%)
Bejan Sadeghian
  Significant work has been placed in the Q&A NLP space to build models that
are more robust to adversarial attacks. Two key areas of focus are in
generating adversarial data for the purposes of training against these
situations or modifying existing architectures to build robustness within. This
paper introduces an approach that joins these two ideas together to train a
critic model for use in an almost reinforcement learning framework. Using the
Adversarial SQuAD "Add One Sent" dataset we show that there are some promising
signs for this method in protecting against Adversarial attacks.


http://arxiv.org/abs/2201.00318
On Sensitivity of Deep Learning Based Text Classification Algorithms to Practical Input Perturbations. (12%)
Aamir Miyajiwala; Arnav Ladkat; Samiksha Jagadale; Raviraj Joshi
  Text classification is a fundamental Natural Language Processing task that
has a wide variety of applications, where deep learning approaches have
produced state-of-the-art results. While these models have been heavily
criticized for their black-box nature, their robustness to slight perturbations
in input text has been a matter of concern. In this work, we carry out a
data-focused study evaluating the impact of systematic practical perturbations
on the performance of the deep learning based text classification models like
CNN, LSTM, and BERT-based algorithms. The perturbations are induced by the
addition and removal of unwanted tokens like punctuation and stop-words that
are minimally associated with the final performance of the model. We show that
these deep learning approaches including BERT are sensitive to such legitimate
input perturbations on four standard benchmark datasets SST2, TREC-6, BBC News,
and tweet_eval. We observe that BERT is more susceptible to the removal of
tokens as compared to the addition of tokens. Moreover, LSTM is slightly more
sensitive to input perturbations as compared to CNN based model. The work also
serves as a practical guide to assessing the impact of discrepancies in
train-test conditions on the final performance of models.


http://arxiv.org/abs/2201.00148
Rethinking Feature Uncertainty in Stochastic Neural Networks for Adversarial Robustness. (87%)
Hao Yang; Min Wang; Zhengfei Yu; Yun Zhou
  It is well-known that deep neural networks (DNNs) have shown remarkable
success in many fields. However, when adding an imperceptible magnitude
perturbation on the model input, the model performance might get rapid
decrease. To address this issue, a randomness technique has been proposed
recently, named Stochastic Neural Networks (SNNs). Specifically, SNNs inject
randomness into the model to defend against unseen attacks and improve the
adversarial robustness. However, existed studies on SNNs mainly focus on
injecting fixed or learnable noises to model weights/activations. In this
paper, we find that the existed SNNs performances are largely bottlenecked by
the feature representation ability. Surprisingly, simply maximizing the
variance per dimension of the feature distribution leads to a considerable
boost beyond all previous methods, which we named maximize feature distribution
variance stochastic neural network (MFDV-SNN). Extensive experiments on
well-known white- and black-box attacks show that MFDV-SNN achieves a
significant improvement over existing methods, which indicates that it is a
simple but effective method to improve model robustness.


http://arxiv.org/abs/2201.00191
Revisiting Neuron Coverage Metrics and Quality of Deep Neural Networks. (41%)
Zhou Yang; Jieke Shi; Muhammad Hilmi Asyrofi; David Lo
  Deep neural networks (DNN) have been widely applied in modern life, including
critical domains like autonomous driving, making it essential to ensure the
reliability and robustness of DNN-powered systems. As an analogy to code
coverage metrics for testing conventional software, researchers have proposed
neuron coverage metrics and coverage-driven methods to generate DNN test cases.
However, Yan et al. doubt the usefulness of existing coverage criteria in DNN
testing. They show that a coverage-driven method is less effective than a
gradient-based method in terms of both uncovering defects and improving model
robustness.
  In this paper, we conduct a replication study of the work by Yan et al. and
extend the experiments for deeper analysis. A larger model and a dataset of
higher resolution images are included to examine the generalizability of the
results. We also extend the experiments with more test case generation
techniques and adjust the process of improving model robustness to be closer to
the practical life cycle of DNN development. Our experiment results confirm the
conclusion from Yan et al. that coverage-driven methods are less effective than
gradient-based methods. Yan et al. find that using gradient-based methods to
retrain cannot repair defects uncovered by coverage-driven methods. They
attribute this to the fact that the two types of methods use different
perturbation strategies: gradient-based methods perform differentiable
transformations while coverage-driven methods can perform additional
non-differentiable transformations. We test several hypotheses and further show
that even coverage-driven methods are constrained only to perform
differentiable transformations, the uncovered defects still cannot be repaired
by adversarial training with gradient-based methods. Thus, defensive strategies
for coverage-driven methods should be further studied.


http://arxiv.org/abs/2201.00167
Generating Adversarial Samples For Training Wake-up Word Detection Systems Against Confusing Words. (1%)
Haoxu Wang; Yan Jia; Zeqing Zhao; Xuyang Wang; Junjie Wang; Ming Li
  Wake-up word detection models are widely used in real life, but suffer from
severe performance degradation when encountering adversarial samples. In this
paper we discuss the concept of confusing words in adversarial samples.
Confusing words are commonly encountered, which are various kinds of words that
sound similar to the predefined keywords. To enhance the wake word detection
system's robustness against confusing words, we propose several methods to
generate the adversarial confusing samples for simulating real confusing words
scenarios in which we usually do not have any real confusing samples in the
training set. The generated samples include concatenated audio, synthesized
data, and partially masked keywords. Moreover, we use a domain embedding
concatenated system to improve the performance. Experimental results show that
the adversarial samples generated in our approach help improve the system's
robustness in both the common scenario and the confusing words scenario. In
addition, we release the confusing words testing database called HI-MIA-CW for
future research.


http://arxiv.org/abs/2201.00097
Adversarial Attack via Dual-Stage Network Erosion. (99%)
Yexin Duan; Junhua Zou; Xingyu Zhou; Wu Zhang; Jin Zhang; Zhisong Pan
  Deep neural networks are vulnerable to adversarial examples, which can fool
deep models by adding subtle perturbations. Although existing attacks have
achieved promising results, it still leaves a long way to go for generating
transferable adversarial examples under the black-box setting. To this end,
this paper proposes to improve the transferability of adversarial examples, and
applies dual-stage feature-level perturbations to an existing model to
implicitly create a set of diverse models. Then these models are fused by the
longitudinal ensemble during the iterations. The proposed method is termed
Dual-Stage Network Erosion (DSNE). We conduct comprehensive experiments both on
non-residual and residual networks, and obtain more transferable adversarial
examples with the computational cost similar to the state-of-the-art method. In
particular, for the residual networks, the transferability of the adversarial
examples can be significantly improved by biasing the residual block
information to the skip connections. Our work provides new insights into the
architectural vulnerability of neural networks and presents new challenges to
the robustness of neural networks.


http://arxiv.org/abs/2112.15329
On Distinctive Properties of Universal Perturbations. (83%)
Sung Min Park; Kuo-An Wei; Kai Xiao; Jerry Li; Aleksander Madry
  We identify properties of universal adversarial perturbations (UAPs) that
distinguish them from standard adversarial perturbations. Specifically, we show
that targeted UAPs generated by projected gradient descent exhibit two
human-aligned properties: semantic locality and spatial invariance, which
standard targeted adversarial perturbations lack. We also demonstrate that UAPs
contain significantly less signal for generalization than standard adversarial
perturbations -- that is, UAPs leverage non-robust features to a smaller extent
than standard adversarial perturbations.


http://arxiv.org/abs/2112.15250
Benign Overfitting in Adversarially Robust Linear Classification. (99%)
Jinghui Chen; Yuan Cao; Quanquan Gu
  "Benign overfitting", where classifiers memorize noisy training data yet
still achieve a good generalization performance, has drawn great attention in
the machine learning community. To explain this surprising phenomenon, a series
of works have provided theoretical justification in over-parameterized linear
regression, classification, and kernel methods. However, it is not clear if
benign overfitting still occurs in the presence of adversarial examples, i.e.,
examples with tiny and intentional perturbations to fool the classifiers. In
this paper, we show that benign overfitting indeed occurs in adversarial
training, a principled approach to defend against adversarial examples. In
detail, we prove the risk bounds of the adversarially trained linear classifier
on the mixture of sub-Gaussian data under $\ell_p$ adversarial perturbations.
Our result suggests that under moderate perturbations, adversarially trained
linear classifiers can achieve the near-optimal standard and adversarial risks,
despite overfitting the noisy training data. Numerical experiments validate our
theoretical findings.


http://arxiv.org/abs/2112.15089
Causal Attention for Interpretable and Generalizable Graph Classification. (1%)
Yongduo Sui; Xiang Wang; Jiancan Wu; Min Lin; Xiangnan He; Tat-Seng Chua
  In graph classification, attention and pooling-based graph neural networks
(GNNs) prevail to extract the critical features from the input graph and
support the prediction. They mostly follow the paradigm of learning to attend,
which maximizes the mutual information between the attended graph and the
ground-truth label. However, this paradigm makes GNN classifiers recklessly
absorb all the statistical correlations between input features and labels in
the training data, without distinguishing the causal and noncausal effects of
features. Instead of underscoring the causal features, the attended graphs are
prone to visit the noncausal features as the shortcut to predictions. Such
shortcut features might easily change outside the training distribution,
thereby making the GNN classifiers suffer from poor generalization. In this
work, we take a causal look at the GNN modeling for graph classification. With
our causal assumption, the shortcut feature serves as a confounder between the
causal feature and prediction. It tricks the classifier to learn spurious
correlations that facilitate the prediction in in-distribution (ID) test
evaluation, while causing the performance drop in out-of-distribution (OOD)
test data. To endow the classifier with better interpretation and
generalization, we propose the Causal Attention Learning (CAL) strategy, which
discovers the causal patterns and mitigates the confounding effect of
shortcuts. Specifically, we employ attention modules to estimate the causal and
shortcut features of the input graph. We then parameterize the backdoor
adjustment of causal theory -- combine each causal feature with various
shortcut features. It encourages the stable relationships between the causal
estimation and prediction, regardless of the changes in shortcut parts and
distributions. Extensive experiments on synthetic and real-world datasets
demonstrate the effectiveness of CAL.


http://arxiv.org/abs/2112.14420
Invertible Image Dataset Protection. (92%)
Kejiang Chen; Xianhan Zeng; Qichao Ying; Sheng Li; Zhenxing Qian; Xinpeng Zhang
  Deep learning has achieved enormous success in various industrial
applications. Companies do not want their valuable data to be stolen by
malicious employees to train pirated models. Nor do they wish the data analyzed
by the competitors after using them online. We propose a novel solution for
dataset protection in this scenario by robustly and reversibly transform the
images into adversarial images. We develop a reversible adversarial example
generator (RAEG) that introduces slight changes to the images to fool
traditional classification models. Even though malicious attacks train pirated
models based on the defensed versions of the protected images, RAEG can
significantly weaken the functionality of these models. Meanwhile, the
reversibility of RAEG ensures the performance of authorized models. Extensive
experiments demonstrate that RAEG can better protect the data with slight
distortion against adversarial defense than previous methods.


http://arxiv.org/abs/2112.14468
Challenges and Approaches for Mitigating Byzantine Attacks in Federated Learning. (4%)
Junyu Shi; Wei Wan; Shengshan Hu; Jianrong Lu; Leo Yu Zhang
  Recently emerged federated learning (FL) is an attractive distributed
learning framework in which numerous wireless end-user devices can train a
global model with the data remained autochthonous. Compared with the
traditional machine learning framework that collects user data for centralized
storage, which brings huge communication burden and concerns about data
privacy, this approach can not only save the network bandwidth but also protect
the data privacy. Despite the promising prospect, byzantine attack, an
intractable threat in conventional distributed network, is discovered to be
rather efficacious against FL as well. In this paper, we conduct a
comprehensive investigation of the state-of-the-art strategies for defending
against byzantine attacks in FL. We first provide a taxonomy for the existing
defense solutions according to the techniques they used, followed by an
across-the-board comparison and discussion. Then we propose a new byzantine
attack method called weight attack to defeat those defense schemes, and conduct
experiments to demonstrate its threat. The results show that existing defense
solutions, although abundant, are still far from fully protecting FL. Finally,
we indicate possible countermeasures for weight attack, and highlight several
challenges and future research directions for mitigating byzantine attacks in
FL.


http://arxiv.org/abs/2112.14232
Constrained Gradient Descent: A Powerful and Principled Evasion Attack Against Neural Networks. (99%)
Weiran Lin; Keane Lucas; Lujo Bauer; Michael K. Reiter; Mahmood Sharif
  We propose new, more efficient targeted white-box attacks against deep neural
networks. Our attacks better align with the attacker's goal: (1) tricking a
model to assign higher probability to the target class than to any other class,
while (2) staying within an $\epsilon$-distance of the attacked input. First,
we demonstrate a loss function that explicitly encodes (1) and show that
Auto-PGD finds more attacks with it. Second, we propose a new attack method,
Constrained Gradient Descent (CGD), using a refinement of our loss function
that captures both (1) and (2). CGD seeks to satisfy both attacker objectives
-- misclassification and bounded $\ell_{p}$-norm -- in a principled manner, as
part of the optimization, instead of via ad hoc post-processing techniques
(e.g., projection or clipping). We show that CGD is more successful on CIFAR10
(0.9--4.2%) and ImageNet (8.6--13.6%) than state-of-the-art attacks while
consuming less time (11.4--18.8%). Statistical tests confirm that our attack
outperforms others against leading defenses on different datasets and values of
$\epsilon$.


http://arxiv.org/abs/2112.14337
Closer Look at the Transferability of Adversarial Examples: How They Fool Different Models Differently. (99%)
Futa Waseda; Sosuke Nishikawa; Trung-Nghia Le; Huy H. Nguyen; Isao Echizen
  Deep neural networks are vulnerable to adversarial examples (AEs), which have
adversarial transferability: AEs generated for the source model can mislead
another (target) model's predictions. However, the transferability has not been
understood from the perspective of to which class target model's predictions
were misled (i.e., class-aware transferability). In this paper, we
differentiate the cases in which a target model predicts the same wrong class
as the source model ("same mistake") or a different wrong class ("different
mistake") to analyze and provide an explanation of the mechanism. First, our
analysis shows (1) that AEs tend to cause same mistakes, correlating with
"non-targeted transferability," and (2) that different mistakes occur between
similar models regardless of the perturbation size. Second, we present evidence
that the difference in same mistakes and different mistakes can be explained by
non-robust features, predictive but human-uninterpretable patterns: different
mistakes occur when non-robust features in AEs are used differently by models.
Non-robust features can thus provide consistent explanations for the
class-aware transferability of AEs.


http://arxiv.org/abs/2201.02504
Repairing Adversarial Texts through Perturbation. (99%)
Guoliang Dong; Jingyi Wang; Jun Sun; Sudipta Chattopadhyay; Xinyu Wang; Ting Dai; Jie Shi; Jin Song Dong
  It is known that neural networks are subject to attacks through adversarial
perturbations, i.e., inputs which are maliciously crafted through perturbations
to induce wrong predictions. Furthermore, such attacks are impossible to
eliminate, i.e., the adversarial perturbation is still possible after applying
mitigation methods such as adversarial training. Multiple approaches have been
developed to detect and reject such adversarial inputs, mostly in the image
domain. Rejecting suspicious inputs however may not be always feasible or
ideal. First, normal inputs may be rejected due to false alarms generated by
the detection algorithm. Second, denial-of-service attacks may be conducted by
feeding such systems with adversarial inputs. To address the gap, in this work,
we propose an approach to automatically repair adversarial texts at runtime.
Given a text which is suspected to be adversarial, we novelly apply multiple
adversarial perturbation methods in a positive way to identify a repair, i.e.,
a slightly mutated but semantically equivalent text that the neural network
correctly classifies. Our approach has been experimented with multiple models
trained for natural language processing tasks and the results show that our
approach is effective, i.e., it successfully repairs about 80\% of the
adversarial texts. Furthermore, depending on the applied perturbation method,
an adversarial text could be repaired in as short as one second on average.


http://arxiv.org/abs/2112.14299
DeepAdversaries: Examining the Robustness of Deep Learning Models for Galaxy Morphology Classification. (91%)
Aleksandra Ćiprijanović; Diana Kafkes; Gregory Snyder; F. Javier Sánchez; Gabriel Nathan Perdue; Kevin Pedro; Brian Nord; Sandeep Madireddy; Stefan M. Wild
  Data processing and analysis pipelines in cosmological survey experiments
introduce data perturbations that can significantly degrade the performance of
deep learning-based models. Given the increased adoption of supervised deep
learning methods for processing and analysis of cosmological survey data, the
assessment of data perturbation effects and the development of methods that
increase model robustness are increasingly important. In the context of
morphological classification of galaxies, we study the effects of perturbations
in imaging data. In particular, we examine the consequences of using neural
networks when training on baseline data and testing on perturbed data. We
consider perturbations associated with two primary sources: 1) increased
observational noise as represented by higher levels of Poisson noise and 2)
data processing noise incurred by steps such as image compression or telescope
errors as represented by one-pixel adversarial attacks. We also test the
efficacy of domain adaptation techniques in mitigating the perturbation-driven
errors. We use classification accuracy, latent space visualizations, and latent
space distance to assess model robustness. Without domain adaptation, we find
that processing pixel-level errors easily flip the classification into an
incorrect class and that higher observational noise makes the model trained on
low-noise data unable to classify galaxy morphologies. On the other hand, we
show that training with domain adaptation improves model robustness and
mitigates the effects of these perturbations, improving the classification
accuracy by 23% on data with higher observational noise. Domain adaptation also
increases by a factor of ~2.3 the latent space distance between the baseline
and the incorrectly classified one-pixel perturbed image, making the model more
robust to inadvertent perturbations.


http://arxiv.org/abs/2112.14340
Super-Efficient Super Resolution for Fast Adversarial Defense at the Edge. (88%)
Kartikeya Bhardwaj; Dibakar Gope; James Ward; Paul Whatmough; Danny Loh
  Autonomous systems are highly vulnerable to a variety of adversarial attacks
on Deep Neural Networks (DNNs). Training-free model-agnostic defenses have
recently gained popularity due to their speed, ease of deployment, and ability
to work across many DNNs. To this end, a new technique has emerged for
mitigating attacks on image classification DNNs, namely, preprocessing
adversarial images using super resolution -- upscaling low-quality inputs into
high-resolution images. This defense requires running both image classifiers
and super resolution models on constrained autonomous systems. However, super
resolution incurs a heavy computational cost. Therefore, in this paper, we
investigate the following question: Does the robustness of image classifiers
suffer if we use tiny super resolution models? To answer this, we first review
a recent work called Super-Efficient Super Resolution (SESR) that achieves
similar or better image quality than prior art while requiring 2x to 330x fewer
Multiply-Accumulate (MAC) operations. We demonstrate that despite being orders
of magnitude smaller than existing models, SESR achieves the same level of
robustness as significantly larger networks. Finally, we estimate end-to-end
performance of super resolution-based defenses on a commercial Arm Ethos-U55
micro-NPU. Our findings show that SESR achieves nearly 3x higher FPS than a
baseline while achieving similar robustness.


http://arxiv.org/abs/2201.00402
A General Framework for Evaluating Robustness of Combinatorial Optimization Solvers on Graphs. (86%)
Han Lu; Zenan Li; Runzhong Wang; Qibing Ren; Junchi Yan; Xiaokang Yang
  Solving combinatorial optimization (CO) on graphs is among the fundamental
tasks for upper-stream applications in data mining, machine learning and
operations research. Despite the inherent NP-hard challenge for CO, heuristics,
branch-and-bound, learning-based solvers are developed to tackle CO problems as
accurately as possible given limited time budgets. However, a practical metric
for the sensitivity of CO solvers remains largely unexplored. Existing
theoretical metrics require the optimal solution which is infeasible, and the
gradient-based adversarial attack metric from deep learning is not compatible
with non-learning solvers that are usually non-differentiable. In this paper,
we develop the first practically feasible robustness metric for general
combinatorial optimization solvers. We develop a no worse optimal cost
guarantee thus do not require optimal solutions, and we tackle the
non-differentiable challenge by resorting to black-box adversarial attack
methods. Extensive experiments are conducted on 14 unique combinations of
solvers and CO problems, and we demonstrate that the performance of
state-of-the-art solvers like Gurobi can degenerate by over 20% under the given
time limit bound on the hard instances discovered by our robustness metric,
raising concerns about the robustness of combinatorial optimization solvers.


http://arxiv.org/abs/2112.14771
Gas Gauge: A Security Analysis Tool for Smart Contract Out-of-Gas Vulnerabilities. (1%)
Behkish Nassirzadeh; Huaiying Sun; Sebastian Banescu; Vijay Ganesh
  In recent years we have witnessed a dramatic increase in the adoption and
application of smart contracts in a variety of contexts such as decentralized
finance, supply chain management, and identity management. However, a critical
stumbling block to the further adoption of smart contracts is their security. A
particularly widespread class of security vulnerabilities that afflicts
Ethereum smart contracts is the gas limit denial of service(DoS) on a contract
via unbounded operations. These vulnerabilities result in a failed transaction
with an out-of-gas error and are often present in contracts containing loops
whose bounds are affected by end-user input. Note that such vulnerabilities
differ from gas limit DoS on the network via block stuffing. Therefore, we
present Gas Gauge, a tool aimed at detecting Out-of-Gas DoS vulnerabilities in
Ethereum smart contracts. Gas Gauge consists of three major components: the
Detection, Identification, and Correction Phases. The Detection Phase consists
of an accurate static analysis approach that finds and summarizes all the loops
in a smart contract. The Identification Phase uses a white-box fuzzing approach
to generate a set of inputs that causes the contract to run out of gas. The
Correction Phase uses static analysis and run-time verification to predict the
maximum loop bounds consistent with allowable gas usage and suggest appropriate
repairs to the user of the tool. Each part of the tool can be used separately
for different purposes or all together to detect, identify and help repair the
contracts vulnerable to Out-of-Gas DoS vulnerabilities. Gas Gauge was tested on
1,000 real-world solidity smart contracts deployed on the Ethereum Mainnet. The
results were compared to seven state-of-the-art static and symbolic tools, and
it was empirically demonstrated that Gas Gauge is far more effective than
competing state-of-the-art tools.


http://arxiv.org/abs/2112.13534
Adversarial Attack for Asynchronous Event-based Data. (99%)
Wooju Lee; Hyun Myung
  Deep neural networks (DNNs) are vulnerable to adversarial examples that are
carefully designed to cause the deep learning model to make mistakes.
Adversarial examples of 2D images and 3D point clouds have been extensively
studied, but studies on event-based data are limited. Event-based data can be
an alternative to a 2D image under high-speed movements, such as autonomous
driving. However, the given adversarial events make the current deep learning
model vulnerable to safety issues. In this work, we generate adversarial
examples and then train the robust models for event-based data, for the first
time. Our algorithm shifts the time of the original events and generates
additional adversarial events. Additional adversarial events are generated in
two stages. First, null events are added to the event-based data to generate
additional adversarial events. The perturbation size can be controlled with the
number of null events. Second, the location and time of additional adversarial
events are set to mislead DNNs in a gradient-based attack. Our algorithm
achieves an attack success rate of 97.95\% on the N-Caltech101 dataset.
Furthermore, the adversarial training model improves robustness on the
adversarial event data compared to the original model.


http://arxiv.org/abs/2112.13547
PRIME: A Few Primitives Can Boost Robustness to Common Corruptions. (81%)
Apostolos Modas; Rahul Rade; Guillermo Ortiz-Jiménez; Seyed-Mohsen Moosavi-Dezfooli; Pascal Frossard
  Despite their impressive performance on image classification tasks, deep
networks have a hard time generalizing to many common corruptions of their
data. To fix this vulnerability, prior works have mostly focused on increasing
the complexity of their training pipelines, combining multiple methods, in the
name of diversity. However, in this work, we take a step back and follow a
principled approach to achieve robustness to common corruptions. We propose
PRIME, a general data augmentation scheme that consists of simple families of
max-entropy image transformations. We show that PRIME outperforms the prior art
for corruption robustness, while its simplicity and plug-and-play nature
enables it to be combined with other methods to further boost their robustness.
Furthermore, we analyze PRIME to shed light on the importance of the mixing
strategy on synthesizing corrupted images, and to reveal the
robustness-accuracy trade-offs arising in the context of common corruptions.
Finally, we show that the computational efficiency of our method allows it to
be easily used in both on-line and off-line data augmentation schemes.


http://arxiv.org/abs/2112.13989
Associative Adversarial Learning Based on Selective Attack. (26%)
Runqi Wang; Xiaoyue Duan; Baochang Zhang; Song Xue; Wentao Zhu; David Doermann; Guodong Guo
  A human's attention can intuitively adapt to corrupted areas of an image by
recalling a similar uncorrupted image they have previously seen. This
observation motivates us to improve the attention of adversarial images by
considering their clean counterparts. To accomplish this, we introduce
Associative Adversarial Learning (AAL) into adversarial learning to guide a
selective attack. We formulate the intrinsic relationship between attention and
attack (perturbation) as a coupling optimization problem to improve their
interaction. This leads to an attention backtracking algorithm that can
effectively enhance the attention's adversarial robustness. Our method is
generic and can be used to address a variety of tasks by simply choosing
different kernels for the associative attention that select other regions for a
specific attack. Experimental results show that the selective attack improves
the model's performance. We show that our method improves the recognition
accuracy of adversarial training on ImageNet by 8.32% compared with the
baseline. It also increases object detection mAP on PascalVOC by 2.02% and
recognition accuracy of few-shot learning on miniImageNet by 1.63%.


http://arxiv.org/abs/2112.13551
Learning Robust and Lightweight Model through Separable Structured Transformations. (8%)
Yanhui Huang; Yangyu Xu; Xian Wei
  With the proliferation of mobile devices and the Internet of Things, deep
learning models are increasingly deployed on devices with limited computing
resources and memory, and are exposed to the threat of adversarial noise.
Learning deep models with both lightweight and robustness is necessary for
these equipments. However, current deep learning solutions are difficult to
learn a model that possesses these two properties without degrading one or the
other. As is well known, the fully-connected layers contribute most of the
parameters of convolutional neural networks. We perform a separable structural
transformation of the fully-connected layer to reduce the parameters, where the
large-scale weight matrix of the fully-connected layer is decoupled by the
tensor product of several separable small-sized matrices. Note that data, such
as images, no longer need to be flattened before being fed to the
fully-connected layer, retaining the valuable spatial geometric information of
the data. Moreover, in order to further enhance both lightweight and
robustness, we propose a joint constraint of sparsity and differentiable
condition number, which is imposed on these separable matrices. We evaluate the
proposed approach on MLP, VGG-16 and Vision Transformer. The experimental
results on datasets such as ImageNet, SVHN, CIFAR-100 and CIFAR10 show that we
successfully reduce the amount of network parameters by 90%, while the robust
accuracy loss is less than 1.5%, which is better than the SOTA methods based on
the original fully-connected layer. Interestingly, it can achieve an
overwhelming advantage even at a high compression rate, e.g., 200 times.


http://arxiv.org/abs/2112.13408
Perlin Noise Improve Adversarial Robustness. (99%)
Chengjun Tang; Kun Zhang; Chunfang Xing; Yong Ding; Zengmin Xu
  Adversarial examples are some special input that can perturb the output of a
deep neural network, in order to make produce intentional errors in the
learning algorithms in the production environment. Most of the present methods
for generating adversarial examples require gradient information. Even
universal perturbations that are not relevant to the generative model rely to
some extent on gradient information. Procedural noise adversarial examples is a
new way of adversarial example generation, which uses computer graphics noise
to generate universal adversarial perturbations quickly while not relying on
gradient information. Combined with the defensive idea of adversarial training,
we use Perlin noise to train the neural network to obtain a model that can
defend against procedural noise adversarial examples. In combination with the
use of model fine-tuning methods based on pre-trained models, we obtain faster
training as well as higher accuracy. Our study shows that procedural noise
adversarial examples are defensible, but why procedural noise can generate
adversarial examples and how to defend against other kinds of procedural noise
adversarial examples that may emerge in the future remain to be investigated.


http://arxiv.org/abs/2112.13267
Task and Model Agnostic Adversarial Attack on Graph Neural Networks. (99%)
Kartik Sharma; Samidha Verma; Sourav Medya; Sayan Ranu; Arnab Bhattacharya
  Graph neural networks (GNNs) have witnessed significant adoption in the
industry owing to impressive performance on various predictive tasks.
Performance alone, however, is not enough. Any widely deployed machine learning
algorithm must be robust to adversarial attacks. In this work, we investigate
this aspect for GNNs, identify vulnerabilities, and link them to graph
properties that may potentially lead to the development of more secure and
robust GNNs. Specifically, we formulate the problem of task and model agnostic
evasion attacks where adversaries modify the test graph to affect the
performance of any unknown downstream task. The proposed algorithm, GRAND
($Gr$aph $A$ttack via $N$eighborhood $D$istortion) shows that distortion of
node neighborhoods is effective in drastically compromising prediction
performance. Although neighborhood distortion is an NP-hard problem, GRAND
designs an effective heuristic through a novel combination of Graph Isomorphism
Network with deep $Q$-learning. Extensive experiments on real datasets show
that, on average, GRAND is up to $50\%$ more effective than state of the art
techniques, while being more than $100$ times faster.


http://arxiv.org/abs/2112.13214
NeuronFair: Interpretable White-Box Fairness Testing through Biased Neuron Identification. (50%)
Haibin Zheng; Zhiqing Chen; Tianyu Du; Xuhong Zhang; Yao Cheng; Shouling Ji; Jingyi Wang; Yue Yu; Jinyin Chen
  Deep neural networks (DNNs) have demonstrated their outperformance in various
domains. However, it raises a social concern whether DNNs can produce reliable
and fair decisions especially when they are applied to sensitive domains
involving valuable resource allocation, such as education, loan, and
employment. It is crucial to conduct fairness testing before DNNs are reliably
deployed to such sensitive domains, i.e., generating as many instances as
possible to uncover fairness violations. However, the existing testing methods
are still limited from three aspects: interpretability, performance, and
generalizability. To overcome the challenges, we propose NeuronFair, a new DNN
fairness testing framework that differs from previous work in several key
aspects: (1) interpretable - it quantitatively interprets DNNs' fairness
violations for the biased decision; (2) effective - it uses the interpretation
results to guide the generation of more diverse instances in less time; (3)
generic - it can handle both structured and unstructured data. Extensive
evaluations across 7 datasets and the corresponding DNNs demonstrate
NeuronFair's superior performance. For instance, on structured datasets, it
generates much more instances (~x5.84) and saves more time (with an average
speedup of 534.56%) compared with the state-of-the-art methods. Besides, the
instances of NeuronFair can also be leveraged to improve the fairness of the
biased DNNs, which helps build more fair and trustworthy deep learning systems.


http://arxiv.org/abs/2112.13162
Stealthy Attack on Algorithmic-Protected DNNs via Smart Bit Flipping. (99%)
Behnam Ghavami; Seyd Movi; Zhenman Fang; Lesley Shannon
  Recently, deep neural networks (DNNs) have been deployed in safety-critical
systems such as autonomous vehicles and medical devices. Shortly after that,
the vulnerability of DNNs were revealed by stealthy adversarial examples where
crafted inputs -- by adding tiny perturbations to original inputs -- can lead a
DNN to generate misclassification outputs. To improve the robustness of DNNs,
some algorithmic-based countermeasures against adversarial examples have been
introduced thereafter.
  In this paper, we propose a new type of stealthy attack on protected DNNs to
circumvent the algorithmic defenses: via smart bit flipping in DNN weights, we
can reserve the classification accuracy for clean inputs but misclassify
crafted inputs even with algorithmic countermeasures. To fool protected DNNs in
a stealthy way, we introduce a novel method to efficiently find their most
vulnerable weights and flip those bits in hardware. Experimental results show
that we can successfully apply our stealthy attack against state-of-the-art
algorithmic-protected DNNs.


http://arxiv.org/abs/2112.13060
Fight Perturbations with Perturbations: Defending Adversarial Attacks via Neuron Influence. (99%)
Ruoxi Chen; Haibo Jin; Haibin Zheng; Jinyin Chen; Zhenguang Liu
  The vulnerabilities of deep learning models towards adversarial attacks have
attracted increasing attention, especially when models are deployed in
security-critical domains. Numerous defense methods, including reactive and
proactive ones, have been proposed for model robustness improvement. Reactive
defenses, such as conducting transformations to remove perturbations, usually
fail to handle large perturbations. The proactive defenses that involve
retraining, suffer from the attack dependency and high computation cost. In
this paper, we consider defense methods from the general effect of adversarial
attacks that take on neurons inside the model. We introduce the concept of
neuron influence, which can quantitatively measure neurons' contribution to
correct classification. Then, we observe that almost all attacks fool the model
by suppressing neurons with larger influence and enhancing those with smaller
influence. Based on this, we propose \emph{Neuron-level Inverse Perturbation}
(NIP), a novel defense against general adversarial attacks. It calculates
neuron influence from benign examples and then modifies input examples by
generating inverse perturbations that can in turn strengthen neurons with
larger influence and weaken those with smaller influence.


http://arxiv.org/abs/2112.13064
CatchBackdoor: Backdoor Testing by Critical Trojan Neural Path Identification via Differential Fuzzing. (86%)
Haibo Jin; Ruoxi Chen; Jinyin Chen; Yao Cheng; Chong Fu; Ting Wang; Yue Yu; Zhaoyan Ming
  The success of deep neural networks (DNNs) in real-world applications has
benefited from abundant pre-trained models. However, the backdoored pre-trained
models can pose a significant trojan threat to the deployment of downstream
DNNs. Existing DNN testing methods are mainly designed to find incorrect corner
case behaviors in adversarial settings but fail to discover the backdoors
crafted by strong trojan attacks. Observing the trojan network behaviors shows
that they are not just reflected by a single compromised neuron as proposed by
previous work but attributed to the critical neural paths in the activation
intensity and frequency of multiple neurons. This work formulates the DNN
backdoor testing and proposes the CatchBackdoor framework. Via differential
fuzzing of critical neurons from a small number of benign examples, we identify
the trojan paths and particularly the critical ones, and generate backdoor
testing examples by simulating the critical neurons in the identified paths.
Extensive experiments demonstrate the superiority of CatchBackdoor, with higher
detection performance than existing methods. CatchBackdoor works better on
detecting backdoors by stealthy blending and adaptive attacks, which existing
methods fail to detect. Moreover, our experiments show that CatchBackdoor may
reveal the potential backdoors of models in Model Zoo.


http://arxiv.org/abs/2112.13144
SoK: A Study of the Security on Voice Processing Systems. (9%)
Robert Chang; Logan Kuo; Arthur Liu; Nader Sehatbakhsh
  As the use of Voice Processing Systems (VPS) continues to become more
prevalent in our daily lives through the increased reliance on applications
such as commercial voice recognition devices as well as major text-to-speech
software, the attacks on these systems are increasingly complex, varied, and
constantly evolving. With the use cases for VPS rapidly growing into new spaces
and purposes, the potential consequences regarding privacy are increasingly
more dangerous. In addition, the growing number and increased practicality of
over-the-air attacks have made system failures much more probable. In this
paper, we will identify and classify an arrangement of unique attacks on voice
processing systems. Over the years research has been moving from specialized,
untargeted attacks that result in the malfunction of systems and the denial of
services to more general, targeted attacks that can force an outcome controlled
by an adversary. The current and most frequently used machine learning systems
and deep neural networks, which are at the core of modern voice processing
systems, were built with a focus on performance and scalability rather than
security. Therefore, it is critical for us to reassess the developing voice
processing landscape and to identify the state of current attacks and defenses
so that we may suggest future developments and theoretical improvements.


http://arxiv.org/abs/2112.12998
DP-UTIL: Comprehensive Utility Analysis of Differential Privacy in Machine Learning. (1%)
Ismat Jarin; Birhanu Eshete
  Differential Privacy (DP) has emerged as a rigorous formalism to reason about
quantifiable privacy leakage. In machine learning (ML), DP has been employed to
limit inference/disclosure of training examples. Prior work leveraged DP across
the ML pipeline, albeit in isolation, often focusing on mechanisms such as
gradient perturbation. In this paper, we present, DP-UTIL, a holistic utility
analysis framework of DP across the ML pipeline with focus on input
perturbation, objective perturbation, gradient perturbation, output
perturbation, and prediction perturbation. Given an ML task on
privacy-sensitive data, DP-UTIL enables a ML privacy practitioner perform
holistic comparative analysis on the impact of DP in these five perturbation
spots, measured in terms of model utility loss, privacy leakage, and the number
of truly revealed training samples. We evaluate DP-UTIL over classification
tasks on vision, medical, and financial datasets, using two representative
learning algorithms (logistic regression and deep neural network) against
membership inference attack as a case study attack. One of the highlights of
our results is that prediction perturbation consistently achieves the lowest
utility loss on all models across all datasets. In logistic regression models,
objective perturbation results in lowest privacy leakage compared to other
perturbation techniques. For deep neural networks, gradient perturbation
results in lowest privacy leakage. Moreover, our results on true revealed
records suggest that as privacy leakage increases a differentially private
model reveals more number of member samples. Overall, our findings suggest that
to make informed decisions as to which perturbation mechanism to use, a ML
privacy practitioner needs to examine the dynamics between optimization
techniques (convex vs. non-convex), perturbation mechanisms, number of classes,
and privacy budget.


http://arxiv.org/abs/2112.13178
Gradient Leakage Attack Resilient Deep Learning. (1%)
Wenqi Wei; Ling Liu
  Gradient leakage attacks are considered one of the wickedest privacy threats
in deep learning as attackers covertly spy gradient updates during iterative
training without compromising model training quality, and yet secretly
reconstruct sensitive training data using leaked gradients with high attack
success rate. Although deep learning with differential privacy is a defacto
standard for publishing deep learning models with differential privacy
guarantee, we show that differentially private algorithms with fixed privacy
parameters are vulnerable against gradient leakage attacks. This paper
investigates alternative approaches to gradient leakage resilient deep learning
with differential privacy (DP). First, we analyze existing implementation of
deep learning with differential privacy, which use fixed noise variance to
injects constant noise to the gradients in all layers using fixed privacy
parameters. Despite the DP guarantee provided, the method suffers from low
accuracy and is vulnerable to gradient leakage attacks. Second, we present a
gradient leakage resilient deep learning approach with differential privacy
guarantee by using dynamic privacy parameters. Unlike fixed-parameter
strategies that result in constant noise variance, different dynamic parameter
strategies present alternative techniques to introduce adaptive noise variance
and adaptive noise injection which are closely aligned to the trend of gradient
updates during differentially private model training. Finally, we describe four
complementary metrics to evaluate and compare alternative approaches.


http://arxiv.org/abs/2112.12431
Adaptive Modeling Against Adversarial Attacks. (99%)
Zhiwen Yan; Teck Khim Ng
  Adversarial training, the process of training a deep learning model with
adversarial data, is one of the most successful adversarial defense methods for
deep learning models. We have found that the robustness to white-box attack of
an adversarially trained model can be further improved if we fine tune this
model in inference stage to adapt to the adversarial input, with the extra
information in it. We introduce an algorithm that "post trains" the model at
inference stage between the original output class and a "neighbor" class, with
existing training data. The accuracy of pre-trained Fast-FGSM CIFAR10
classifier base model against white-box projected gradient attack (PGD) can be
significantly improved from 46.8% to 64.5% with our algorithm.


http://arxiv.org/abs/2112.12376
Revisiting and Advancing Fast Adversarial Training Through The Lens of Bi-Level Optimization. (99%)
Yihua Zhang; Guanhua Zhang; Prashant Khanduri; Mingyi Hong; Shiyu Chang; Sijia Liu
  Adversarial training (AT) is a widely recognized defense mechanism to gain
the robustness of deep neural networks against adversarial attacks. It is built
on min-max optimization (MMO), where the minimizer (i.e., defender) seeks a
robust model to minimize the worst-case training loss in the presence of
adversarial examples crafted by the maximizer (i.e., attacker). However, the
conventional MMO method makes AT hard to scale. Thus, Fast-AT and other recent
algorithms attempt to simplify MMO by replacing its maximization step with the
single gradient sign-based attack generation step. Although easy to implement,
FAST-AT lacks theoretical guarantees, and its empirical performance is
unsatisfactory due to the issue of robust catastrophic overfitting when
training with strong adversaries. In this paper, we advance Fast-AT from the
fresh perspective of bi-level optimization (BLO). We first show that the
commonly-used Fast-AT is equivalent to using a stochastic gradient algorithm to
solve a linearized BLO problem involving a sign operation. However, the
discrete nature of the sign operation makes it difficult to understand the
algorithm performance. Inspired by BLO, we design and analyze a new set of
robust training algorithms termed Fast Bi-level AT (Fast-BAT), which
effectively defends sign-based projected gradient descent (PGD) attacks without
using any gradient sign method or explicit robust regularization. In practice,
we show that our method yields substantial robustness improvements over
multiple baselines across multiple models and datasets.


http://arxiv.org/abs/2112.12920
Robust Secretary and Prophet Algorithms for Packing Integer Programs. (2%)
C. J. Argue; Anupam Gupta; Marco Molinaro; Sahil Singla
  We study the problem of solving Packing Integer Programs (PIPs) in the online
setting, where columns in $[0,1]^d$ of the constraint matrix are revealed
sequentially, and the goal is to pick a subset of the columns that sum to at
most $B$ in each coordinate while maximizing the objective. Excellent results
are known in the secretary setting, where the columns are adversarially chosen,
but presented in a uniformly random order. However, these existing algorithms
are susceptible to adversarial attacks: they try to "learn" characteristics of
a good solution, but tend to over-fit to the model, and hence a small number of
adversarial corruptions can cause the algorithm to fail.
  In this paper, we give the first robust algorithms for Packing Integer
Programs, specifically in the recently proposed Byzantine Secretary framework.
Our techniques are based on a two-level use of online learning, to robustly
learn an approximation to the optimal value, and then to use this robust
estimate to pick a good solution. These techniques are general and we use them
to design robust algorithms for PIPs in the prophet model as well, specifically
in the Prophet-with-Augmentations framework. We also improve known results in
the Byzantine Secretary framework: we make the non-constructive results
algorithmic and improve the existing bounds for single-item and matroid
constraints.


http://arxiv.org/abs/2112.12938
Counterfactual Memorization in Neural Language Models. (2%)
Chiyuan Zhang; Daphne Ippolito; Katherine Lee; Matthew Jagielski; Florian Tramèr; Nicholas Carlini
  Modern neural language models widely used in tasks across NLP risk memorizing
sensitive information from their training data. As models continue to scale up
in parameters, training data, and compute, understanding memorization in
language models is both important from a learning-theoretical point of view,
and is practically crucial in real world applications. An open question in
previous studies of memorization in language models is how to filter out
"common" memorization. In fact, most memorization criteria strongly correlate
with the number of occurrences in the training set, capturing "common"
memorization such as familiar phrases, public knowledge or templated texts. In
this paper, we provide a principled perspective inspired by a taxonomy of human
memory in Psychology. From this perspective, we formulate a notion of
counterfactual memorization, which characterizes how a model's predictions
change if a particular document is omitted during training. We identify and
study counterfactually-memorized training examples in standard text datasets.
We further estimate the influence of each training example on the validation
set and on generated texts, and show that this can provide direct evidence of
the source of memorization at test time.


http://arxiv.org/abs/2112.12310
Adversarial Attacks against Windows PE Malware Detection: A Survey of the State-of-the-Art. (99%)
Xiang Ling; Lingfei Wu; Jiangyu Zhang; Zhenqing Qu; Wei Deng; Xiang Chen; Yaguan Qian; Chunming Wu; Shouling Ji; Tianyue Luo; Jingzheng Wu; Yanjun Wu
  Malware has been one of the most damaging threats to computers that span
across multiple operating systems and various file formats. To defend against
ever-increasing and ever-evolving malware, tremendous efforts have been made to
propose a variety of malware detection that attempt to effectively and
efficiently detect malware so as to mitigate possible damages as early as
possible. Recent studies have shown that, on the one hand, existing ML and DL
techniques enable superior solutions in detecting newly emerging and previously
unseen malware. However, on the other hand, ML and DL models are inherently
vulnerable to adversarial attacks in the form of adversarial examples. In this
paper, we focus on malware with the file format of portable executable (PE) in
the family of Windows operating systems, namely Windows PE malware, as a
representative case to study the adversarial attack methods in such adversarial
settings. To be specific, we start by first outlining the general learning
framework of Windows PE malware detection based on ML/DL and subsequently
highlighting three unique challenges of performing adversarial attacks in the
context of Windows PE malware. Then, we conduct a comprehensive and systematic
review to categorize the state-of-the-art adversarial attacks against PE
malware detection, as well as corresponding defenses to increase the robustness
of Windows PE malware detection. Finally, we conclude the paper by first
presenting other related attacks against Windows PE malware detection beyond
the adversarial attacks and then shedding light on future research directions
and opportunities. In addition, a curated resource list of adversarial attacks
and defenses for Windows PE malware detection is also available at
https://github.com/ryderling/adversarial-attacks-and-defenses-for-windows-pe-malware-detection.


http://arxiv.org/abs/2112.11668
How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness? (98%)
Xinhsuai Dong; Luu Anh Tuan; Min Lin; Shuicheng Yan; Hanwang Zhang
  The fine-tuning of pre-trained language models has a great success in many
NLP fields. Yet, it is strikingly vulnerable to adversarial examples, e.g.,
word substitution attacks using only synonyms can easily fool a BERT-based
sentiment analysis model. In this paper, we demonstrate that adversarial
training, the prevalent defense technique, does not directly fit a conventional
fine-tuning scenario, because it suffers severely from catastrophic forgetting:
failing to retain the generic and robust linguistic features that have already
been captured by the pre-trained model. In this light, we propose Robust
Informative Fine-Tuning (RIFT), a novel adversarial fine-tuning method from an
information-theoretical perspective. In particular, RIFT encourages an
objective model to retain the features learned from the pre-trained model
throughout the entire fine-tuning process, whereas a conventional one only uses
the pre-trained weights for initialization. Experimental results show that RIFT
consistently outperforms the state-of-the-arts on two popular NLP tasks:
sentiment analysis and natural language inference, under different attacks
across various pre-trained language models.


http://arxiv.org/abs/2112.12095
Detect & Reject for Transferability of Black-box Adversarial Attacks Against Network Intrusion Detection Systems. (98%)
Islam Debicha; Thibault Debatty; Jean-Michel Dricot; Wim Mees; Tayeb Kenaza
  In the last decade, the use of Machine Learning techniques in anomaly-based
intrusion detection systems has seen much success. However, recent studies have
shown that Machine learning in general and deep learning specifically are
vulnerable to adversarial attacks where the attacker attempts to fool models by
supplying deceptive input. Research in computer vision, where this
vulnerability was first discovered, has shown that adversarial images designed
to fool a specific model can deceive other machine learning models. In this
paper, we investigate the transferability of adversarial network traffic
against multiple machine learning-based intrusion detection systems.
Furthermore, we analyze the robustness of the ensemble intrusion detection
system, which is notorious for its better accuracy compared to a single model,
against the transferability of adversarial attacks. Finally, we examine Detect
& Reject as a defensive mechanism to limit the effect of the transferability
property of adversarial network traffic against machine learning-based
intrusion detection systems.


http://arxiv.org/abs/2112.11937
Adversarial Deep Reinforcement Learning for Improving the Robustness of Multi-agent Autonomous Driving Policies. (96%)
Aizaz Sharif; Dusica Marijan
  Autonomous cars are well known for being vulnerable to adversarial attacks
that can compromise the safety of the car and pose danger to other road users.
To effectively defend against adversaries, it is required to not only test
autonomous cars for finding driving errors, but to improve the robustness of
the cars to these errors. To this end, in this paper, we propose a two-step
methodology for autonomous cars that consists of (i) finding failure states in
autonomous cars by training the adversarial driving agent, and (ii) improving
the robustness of autonomous cars by retraining them with effective adversarial
inputs. Our methodology supports testing ACs in a multi-agent environment,
where we train and compare adversarial car policy on two custom reward
functions to test the driving control decision of autonomous cars. We run
experiments in a vision-based high fidelity urban driving simulated
environment. Our results show that adversarial testing can be used for finding
erroneous autonomous driving behavior, followed by adversarial training for
improving the robustness of deep reinforcement learning based autonomous
driving policies. We demonstrate that the autonomous cars retrained using the
effective adversarial inputs noticeably increase the performance of their
driving policies in terms of reduced collision and offroad steering errors.


http://arxiv.org/abs/2112.12792
Understanding and Measuring Robustness of Multimodal Learning. (69%)
Nishant Vishwamitra; Hongxin Hu; Ziming Zhao; Long Cheng; Feng Luo
  The modern digital world is increasingly becoming multimodal. Although
multimodal learning has recently revolutionized the state-of-the-art
performance in multimodal tasks, relatively little is known about the
robustness of multimodal learning in an adversarial setting. In this paper, we
introduce a comprehensive measurement of the adversarial robustness of
multimodal learning by focusing on the fusion of input modalities in multimodal
models, via a framework called MUROAN (MUltimodal RObustness ANalyzer). We
first present a unified view of multimodal models in MUROAN and identify the
fusion mechanism of multimodal models as a key vulnerability. We then introduce
a new type of multimodal adversarial attacks called decoupling attack in MUROAN
that aims to compromise multimodal models by decoupling their fused modalities.
We leverage the decoupling attack of MUROAN to measure several state-of-the-art
multimodal models and find that the multimodal fusion mechanism in all these
models is vulnerable to decoupling attacks. We especially demonstrate that, in
the worst case, the decoupling attack of MUROAN achieves an attack success rate
of 100% by decoupling just 1.16% of the input space. Finally, we show that
traditional adversarial training is insufficient to improve the robustness of
multimodal models with respect to decoupling attacks. We hope our findings
encourage researchers to pursue improving the robustness of multimodal
learning.


http://arxiv.org/abs/2112.11947
Evaluating the Robustness of Deep Reinforcement Learning for Autonomous and Adversarial Policies in a Multi-agent Urban Driving Environment. (41%)
Aizaz Sharif; Dusica Marijan
  Deep reinforcement learning is actively used for training autonomous and
adversarial car policies in a simulated driving environment. Due to the large
availability of various reinforcement learning algorithms and the lack of their
systematic comparison across different driving scenarios, we are unsure of
which ones are more effective for training and testing autonomous car software
in single-agent as well as multi-agent driving environments. A benchmarking
framework for the comparison of deep reinforcement learning in a vision-based
autonomous driving will open up the possibilities for training better
autonomous car driving policies. Furthermore, autonomous cars trained on deep
reinforcement learning-based algorithms are known for being vulnerable to
adversarial attacks. To guard against adversarial attacks, we can train
autonomous cars on adversarial driving policies. However, we lack the knowledge
of which deep reinforcement learning algorithms would act as good adversarial
agents able to effectively test autonomous cars. To address these challenges,
we provide an open and reusable benchmarking framework for systematic
evaluation and comparative analysis of deep reinforcement learning algorithms
for autonomous and adversarial driving in a single- and multi-agent
environment. Using the framework, we perform a comparative study of five
discrete and two continuous action space deep reinforcement learning
algorithms. We run the experiments in a vision-only high fidelity urban driving
simulated environments. The results indicate that only some of the deep
reinforcement learning algorithms perform consistently better across single and
multi-agent scenarios when trained in a multi-agent-only setting.


http://arxiv.org/abs/2112.11018
A Theoretical View of Linear Backpropagation and Its Convergence. (99%)
Ziang Li; Yiwen Guo; Haodi Liu; Changshui Zhang
  Backpropagation (BP) is widely used for calculating gradients in deep neural
networks (DNNs). Applied often along with stochastic gradient descent (SGD) or
its variants, BP is considered as a de-facto choice in a variety of machine
learning tasks including DNN training and adversarial attack/defense. Recently,
a linear variant of BP named LinBP was introduced for generating more
transferable adversarial examples for performing black-box attacks, by Guo et
al. Although it has been shown empirically effective in black-box attacks,
theoretical studies and convergence analyses of such a method is lacking. This
paper serves as a complement and somewhat an extension to Guo et al.'s paper,
by providing theoretical analyses on LinBP in neural-network-involved learning
tasks, including adversarial attack and model training. We demonstrate that,
somewhat surprisingly, LinBP can lead to faster convergence in these tasks in
the same hyper-parameter settings, compared to BP. We confirm our theoretical
results with extensive experiments.


http://arxiv.org/abs/2112.11660
AED: An black-box NLP classifier model attacker. (99%)
Yueyang Liu; Yan Huang; Zhipeng Cai
  Deep Neural Networks (DNNs) have been successful in solving real-world tasks
in domains such as connected and automated vehicles, disease, and job hiring.
However, their implications are far-reaching in critical application areas.
Hence, there is a growing concern regarding the potential bias and robustness
of these DNN models. A transparency and robust model is always demanded in
high-stakes domains where reliability and safety are enforced, such as
healthcare and finance. While most studies have focused on adversarial image
attack scenarios, fewer studies have investigated the robustness of DNN models
in natural language processing (NLP) due to their adversarial samples are
difficult to generate. To address this gap, we propose a word-level NLP
classifier attack model called "AED," which stands for Attention mechanism
enabled post-model Explanation with Density peaks clustering algorithm for
synonyms search and substitution. AED aims to test the robustness of NLP DNN
models by interpretability their weaknesses and exploring alternative ways to
optimize them. By identifying vulnerabilities and providing explanations, AED
can help improve the reliability and safety of DNN models in critical
application areas such as healthcare and automated transportation. Our
experiment results demonstrate that compared with other existing models, AED
can effectively generate adversarial examples that can fool the victim model
while maintaining the original meaning of the input.


http://arxiv.org/abs/2112.11414
Covert Communications via Adversarial Machine Learning and Reconfigurable Intelligent Surfaces. (81%)
Brian Kim; Tugba Erpek; Yalin E. Sagduyu; Sennur Ulukus
  By moving from massive antennas to antenna surfaces for software-defined
wireless systems, the reconfigurable intelligent surfaces (RISs) rely on arrays
of unit cells to control the scattering and reflection profiles of signals,
mitigating the propagation loss and multipath attenuation, and thereby
improving the coverage and spectral efficiency. In this paper, covert
communication is considered in the presence of the RIS. While there is an
ongoing transmission boosted by the RIS, both the intended receiver and an
eavesdropper individually try to detect this transmission using their own deep
neural network (DNN) classifiers. The RIS interaction vector is designed by
balancing two (potentially conflicting) objectives of focusing the transmitted
signal to the receiver and keeping the transmitted signal away from the
eavesdropper. To boost covert communications, adversarial perturbations are
added to signals at the transmitter to fool the eavesdropper's classifier while
keeping the effect on the receiver low. Results from different network
topologies show that adversarial perturbation and RIS interaction vector can be
jointly designed to effectively increase the signal detection accuracy at the
receiver while reducing the detection accuracy at the eavesdropper to enable
covert communications.


http://arxiv.org/abs/2112.11255
Mind the Gap! A Study on the Transferability of Virtual vs Physical-world Testing of Autonomous Driving Systems. (76%)
Andrea Stocco; Brian Pulfer; Paolo Tonella
  Safe deployment of self-driving cars (SDC) necessitates thorough simulated
and in-field testing. Most testing techniques consider virtualized SDCs within
a simulation environment, whereas less effort has been directed towards
assessing whether such techniques transfer to and are effective with a physical
real-world vehicle. In this paper, we leverage the Donkey Car open-source
framework to empirically compare testing of SDCs when deployed on a physical
small-scale vehicle vs its virtual simulated counterpart. In our empirical
study, we investigate the transferability of behavior and failure exposure
between virtual and real-world environments on a vast set of corrupted and
adversarial settings. While a large number of testing results do transfer
between virtual and physical environments, we also identified critical
shortcomings that contribute to the reality gap between the virtual and
physical world, threatening the potential of existing testing solutions when
applied to physical SDCs.


http://arxiv.org/abs/2112.12084
Input-Specific Robustness Certification for Randomized Smoothing. (68%)
Ruoxin Chen; Jie Li; Junchi Yan; Ping Li; Bin Sheng
  Although randomized smoothing has demonstrated high certified robustness and
superior scalability to other certified defenses, the high computational
overhead of the robustness certification bottlenecks the practical
applicability, as it depends heavily on the large sample approximation for
estimating the confidence interval. In existing works, the sample size for the
confidence interval is universally set and agnostic to the input for
prediction. This Input-Agnostic Sampling (IAS) scheme may yield a poor Average
Certified Radius (ACR)-runtime trade-off which calls for improvement. In this
paper, we propose Input-Specific Sampling (ISS) acceleration to achieve the
cost-effectiveness for robustness certification, in an adaptive way of reducing
the sampling size based on the input characteristic. Furthermore, our method
universally controls the certified radius decline from the ISS sample size
reduction. The empirical results on CIFAR-10 and ImageNet show that ISS can
speed up the certification by more than three times at a limited cost of 0.05
certified radius. Meanwhile, ISS surpasses IAS on the average certified radius
across the extensive hyperparameter settings. Specifically, ISS achieves
ACR=0.958 on ImageNet ($\sigma=1.0$) in 250 minutes, compared to ACR=0.917 by
IAS under the same condition. We release our code in
\url{https://github.com/roy-ch/Input-Specific-Certification}.


http://arxiv.org/abs/2112.11235
Improving Robustness with Image Filtering. (68%)
Matteo Terzi; Mattia Carletti; Gian Antonio Susto
  Adversarial robustness is one of the most challenging problems in Deep
Learning and Computer Vision research. All the state-of-the-art techniques
require a time-consuming procedure that creates cleverly perturbed images. Due
to its cost, many solutions have been proposed to avoid Adversarial Training.
However, all these attempts proved ineffective as the attacker manages to
exploit spurious correlations among pixels to trigger brittle features
implicitly learned by the model. This paper first introduces a new image
filtering scheme called Image-Graph Extractor (IGE) that extracts the
fundamental nodes of an image and their connections through a graph structure.
By leveraging the IGE representation, we build a new defense method, Filtering
As a Defense, that does not allow the attacker to entangle pixels to create
malicious patterns. Moreover, we show that data augmentation with filtered
images effectively improves the model's robustness to data corruption. We
validate our techniques on CIFAR-10, CIFAR-100, and ImageNet.


http://arxiv.org/abs/2112.11313
On the Adversarial Robustness of Causal Algorithmic Recourse. (10%)
Ricardo Dominguez-Olmedo; Amir-Hossein Karimi; Bernhard Schölkopf
  Algorithmic recourse seeks to provide actionable recommendations for
individuals to overcome unfavorable outcomes made by automated decision-making
systems. Recourse recommendations should ideally be robust to reasonably small
uncertainty in the features of the individual seeking recourse. In this work,
we formulate the adversarially robust recourse problem and show that recourse
methods offering minimally costly recourse fail to be robust. We then present
methods for generating adversarially robust recourse in the linear and in the
differentiable case. To ensure that recourse is robust, individuals are asked
to make more effort than they would have otherwise had to. In order to shift
part of the burden of robustness from the decision-subject to the
decision-maker, we propose a model regularizer that encourages the additional
cost of seeking robust recourse to be low. We show that classifiers trained
with our proposed model regularizer, which penalizes relying on unactionable
features for prediction, offer potentially less effortful recourse.


http://arxiv.org/abs/2112.11542
MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation. (4%)
Zhongzhi Yu; Yonggan Fu; Sicheng Li; Chaojian Li; Yingyan Lin
  ViTs are often too computationally expensive to be fitted onto real-world
resource-constrained devices, due to (1) their quadratically increased
complexity with the number of input tokens and (2) their overparameterized
self-attention heads and model depth. In parallel, different images are of
varied complexity and their different regions can contain various levels of
visual information, indicating that treating all regions/tokens equally in
terms of model complexity is unnecessary while such opportunities for trimming
down ViTs' complexity have not been fully explored. To this end, we propose a
Multi-grained Input-adaptive Vision Transformer framework dubbed MIA-Former
that can input-adaptively adjust the structure of ViTs at three
coarse-to-fine-grained granularities (i.e., model depth and the number of model
heads/tokens). In particular, our MIA-Former adopts a low-cost network trained
with a hybrid supervised and reinforcement training method to skip unnecessary
layers, heads, and tokens in an input adaptive manner, reducing the overall
computational cost. Furthermore, an interesting side effect of our MIA-Former
is that its resulting ViTs are naturally equipped with improved robustness
against adversarial attacks over their static counterparts, because
MIA-Former's multi-grained dynamic control improves the model diversity similar
to the effect of ensemble and thus increases the difficulty of adversarial
attacks against all its sub-models. Extensive experiments and ablation studies
validate that the proposed MIA-Former framework can effectively allocate
computation budgets adaptive to the difficulty of input images meanwhile
increase robustness, achieving state-of-the-art (SOTA) accuracy-efficiency
trade-offs, e.g., 20% computation savings with the same or even a higher
accuracy compared with SOTA dynamic transformer models.


http://arxiv.org/abs/2112.11643
Exploring Credibility Scoring Metrics of Perception Systems for Autonomous Driving. (2%)
Viren Khandal; Arth Vidyarthi
  Autonomous and semi-autonomous vehicles' perception algorithms can encounter
situations with erroneous object detection, such as misclassification of
objects on the road, which can lead to safety violations and potentially fatal
consequences. While there has been substantial work in the robustness of object
detection algorithms and online metric learning, there is little research on
benchmarking scoring metrics to determine any possible indicators of potential
misclassification. An emphasis is put on exploring the potential of taking
these scoring metrics online in order to allow the AV to make perception-based
decisions given real-time constraints. In this work, we explore which, if any,
metrics act as online indicators of when perception algorithms and object
detectors are failing. Our work provides insight on better design principles
and characteristics of online metrics to accurately evaluate the credibility of
object detectors. Our approach employs non-adversarial and realistic
perturbations to images, on which we evaluate various quantitative metrics. We
found that offline metrics can be designed to account for real-world
corruptions such as poor weather conditions and that the analysis of such
metrics can provide a segue into designing online metrics. This is a clear next
step as it can allow for error-free autonomous vehicle perception and safer
time-critical and safety-critical decision-making.


http://arxiv.org/abs/2112.11136
Adversarial Gradient Driven Exploration for Deep Click-Through Rate Prediction. (2%)
Kailun Wu; Zhangming Chan; Weijie Bian; Lejian Ren; Shiming Xiang; Shuguang Han; Hongbo Deng; Bo Zheng
  Exploration-Exploitation (E{\&}E) algorithms are commonly adopted to deal
with the feedback-loop issue in large-scale online recommender systems. Most of
existing studies believe that high uncertainty can be a good indicator of
potential reward, and thus primarily focus on the estimation of model
uncertainty. We argue that such an approach overlooks the subsequent effect of
exploration on model training. From the perspective of online learning, the
adoption of an exploration strategy would also affect the collecting of
training data, which further influences model learning. To understand the
interaction between exploration and training, we design a Pseudo-Exploration
module that simulates the model updating process after a certain item is
explored and the corresponding feedback is received. We further show that such
a process is equivalent to adding an adversarial perturbation to the model
input, and thereby name our proposed approach as an the Adversarial Gradient
Driven Exploration (AGE). For production deployment, we propose a dynamic
gating unit to pre-determine the utility of an exploration. This enables us to
utilize the limited amount of resources for exploration, and avoid wasting
pageview resources on ineffective exploration. The effectiveness of AGE was
firstly examined through an extensive number of ablation studies on an academic
dataset. Meanwhile, AGE has also been deployed to one of the world-leading
display advertising platforms, and we observe significant improvements on
various top-line evaluation metrics.


http://arxiv.org/abs/2112.11289
Longitudinal Study of the Prevalence of Malware Evasive Techniques. (1%)
Lorenzo Maffia; Dario Nisi; Platon Kotzias; Giovanni Lagorio; Simone Aonzo; Davide Balzarotti
  By their very nature, malware samples employ a variety of techniques to
conceal their malicious behavior and hide it from analysis tools. To mitigate
the problem, a large number of different evasion techniques have been
documented over the years, and PoC implementations have been collected in
public frameworks, like the popular Al-Khaser. As malware authors tend to reuse
existing approaches, it is common to observe the same evasive techniques in
malware samples of different families. However, no measurement study has been
conducted to date to assess the adoption and prevalence of evasion techniques.
  In this paper, we present a large-scale study, conducted by dynamically
analyzing more than 180K Windows malware samples, on the evolution of evasive
techniques over the years. To perform the experiments, we developed a custom
Pin-based Evasive Program Profiler (Pepper), a tool capable of both detecting
and circumventing 53 anti-dynamic-analysis techniques of different categories,
ranging from anti-debug to virtual machine detection.
  To observe the phenomenon of evasion from different points of view, we
employed four different datasets, including benign files, advanced persistent
threat (APTs), malware samples collected over a period of five years, and a
recent collection of different families submitted to VirusTotal over a
one-month period.


http://arxiv.org/abs/2112.10525
Certified Federated Adversarial Training. (98%)
Giulio Zizzo; Ambrish Rawat; Mathieu Sinn; Sergio Maffeis; Chris Hankin
  In federated learning (FL), robust aggregation schemes have been developed to
protect against malicious clients. Many robust aggregation schemes rely on
certain numbers of benign clients being present in a quorum of workers. This
can be hard to guarantee when clients can join at will, or join based on
factors such as idle system status, and connected to power and WiFi. We tackle
the scenario of securing FL systems conducting adversarial training when a
quorum of workers could be completely malicious. We model an attacker who
poisons the model to insert a weakness into the adversarial training such that
the model displays apparent adversarial robustness, while the attacker can
exploit the inserted weakness to bypass the adversarial training and force the
model to misclassify adversarial examples. We use abstract interpretation
techniques to detect such stealthy attacks and block the corrupted model
updates. We show that this defence can preserve adversarial robustness even
against an adaptive attacker.


http://arxiv.org/abs/2112.11226
Energy-bounded Learning for Robust Models of Code. (83%)
Nghi D. Q. Bui; Yijun Yu
  In programming, learning code representations has a variety of applications,
including code classification, code search, comment generation, bug prediction,
and so on. Various representations of code in terms of tokens, syntax trees,
dependency graphs, code navigation paths, or a combination of their variants
have been proposed, however, existing vanilla learning techniques have a major
limitation in robustness, i.e., it is easy for the models to make incorrect
predictions when the inputs are altered in a subtle way. To enhance the
robustness, existing approaches focus on recognizing adversarial samples rather
than on the valid samples that fall outside a given distribution, which we
refer to as out-of-distribution (OOD) samples. Recognizing such OOD samples is
the novel problem investigated in this paper. To this end, we propose to first
augment the in=distribution datasets with out-of-distribution samples such
that, when trained together, they will enhance the model's robustness. We
propose the use of an energy-bounded learning objective function to assign a
higher score to in-distribution samples and a lower score to
out-of-distribution samples in order to incorporate such out-of-distribution
samples into the training process of source code models. In terms of OOD
detection and adversarial samples detection, our evaluation results demonstrate
a greater robustness for existing source code models to become more accurate at
recognizing OOD data while being more resistant to adversarial attacks at the
same time. Furthermore, the proposed energy-bounded score outperforms all
existing OOD detection scores by a large margin, including the softmax
confidence score, the Mahalanobis score, and ODIN.


http://arxiv.org/abs/2112.12591
Black-Box Testing of Deep Neural Networks through Test Case Diversity. (82%)
Zohreh Aghababaeyan; Manel Abdellatif; Lionel Briand; Ramesh S; Mojtaba Bagherzadeh
  Deep Neural Networks (DNNs) have been extensively used in many areas
including image processing, medical diagnostics, and autonomous driving.
However, DNNs can exhibit erroneous behaviours that may lead to critical
errors, especially when used in safety-critical systems. Inspired by testing
techniques for traditional software systems, researchers have proposed neuron
coverage criteria, as an analogy to source code coverage, to guide the testing
of DNN models. Despite very active research on DNN coverage, several recent
studies have questioned the usefulness of such criteria in guiding DNN testing.
Further, from a practical standpoint, these criteria are white-box as they
require access to the internals or training data of DNN models, which is in
many contexts not feasible or convenient. In this paper, we investigate
black-box input diversity metrics as an alternative to white-box coverage
criteria. To this end, we first select and adapt three diversity metrics and
study, in a controlled manner, their capacity to measure actual diversity in
input sets. We then analyse their statistical association with fault detection
using four datasets and five DNN models. We further compare diversity with
state-of-the-art white-box coverage criteria. Our experiments show that relying
on the diversity of image features embedded in test input sets is a more
reliable indicator than coverage criteria to effectively guide the testing of
DNNs. Indeed, we found that one of our selected black-box diversity metrics far
outperforms existing coverage criteria in terms of fault-revealing capability
and computational time. Results also confirm the suspicions that
state-of-the-art coverage metrics are not adequate to guide the construction of
test input sets to detect as many faults as possible with natural inputs.


http://arxiv.org/abs/2112.10424
Unifying Model Explainability and Robustness for Joint Text Classification and Rationale Extraction. (80%)
Dongfang Li; Baotian Hu; Qingcai Chen; Tujie Xu; Jingcong Tao; Yunan Zhang
  Recent works have shown explainability and robustness are two crucial
ingredients of trustworthy and reliable text classification. However, previous
works usually address one of two aspects: i) how to extract accurate rationales
for explainability while being beneficial to prediction; ii) how to make the
predictive model robust to different types of adversarial attacks. Intuitively,
a model that produces helpful explanations should be more robust against
adversarial attacks, because we cannot trust the model that outputs
explanations but changes its prediction under small perturbations. To this end,
we propose a joint classification and rationale extraction model named AT-BMC.
It includes two key mechanisms: mixed Adversarial Training (AT) is designed to
use various perturbations in discrete and embedding space to improve the
model's robustness, and Boundary Match Constraint (BMC) helps to locate
rationales more precisely with the guidance of boundary information.
Performances on benchmark datasets demonstrate that the proposed AT-BMC
outperforms baselines on both classification and rationale extraction by a
large margin. Robustness analysis shows that the proposed AT-BMC decreases the
attack success rate effectively by up to 69%. The empirical results indicate
that there are connections between robust models and better explanations.


http://arxiv.org/abs/2112.10690
Adversarially Robust Stability Certificates can be Sample-Efficient. (2%)
Thomas T. C. K. Zhang; Stephen Tu; Nicholas M. Boffi; Jean-Jacques E. Slotine; Nikolai Matni
  Motivated by bridging the simulation to reality gap in the context of
safety-critical systems, we consider learning adversarially robust stability
certificates for unknown nonlinear dynamical systems. In line with approaches
from robust control, we consider additive and Lipschitz bounded adversaries
that perturb the system dynamics. We show that under suitable assumptions of
incremental stability on the underlying system, the statistical cost of
learning an adversarial stability certificate is equivalent, up to constant
factors, to that of learning a nominal stability certificate. Our results hinge
on novel bounds for the Rademacher complexity of the resulting adversarial loss
class, which may be of independent interest. To the best of our knowledge, this
is the first characterization of sample-complexity bounds when performing
adversarial learning over data generated by a dynamical system. We further
provide a practical algorithm for approximating the adversarial training
algorithm, and validate our findings on a damped pendulum example.


http://arxiv.org/abs/2112.10098
Initiative Defense against Facial Manipulation. (67%)
Qidong Huang; Jie Zhang; Wenbo Zhou; WeimingZhang; Nenghai Yu
  Benefiting from the development of generative adversarial networks (GAN),
facial manipulation has achieved significant progress in both academia and
industry recently. It inspires an increasing number of entertainment
applications but also incurs severe threats to individual privacy and even
political security meanwhile. To mitigate such risks, many countermeasures have
been proposed. However, the great majority methods are designed in a passive
manner, which is to detect whether the facial images or videos are tampered
after their wide propagation. These detection-based methods have a fatal
limitation, that is, they only work for ex-post forensics but can not prevent
the engendering of malicious behavior. To address the limitation, in this
paper, we propose a novel framework of initiative defense to degrade the
performance of facial manipulation models controlled by malicious users. The
basic idea is to actively inject imperceptible venom into target facial data
before manipulation. To this end, we first imitate the target manipulation
model with a surrogate model, and then devise a poison perturbation generator
to obtain the desired venom. An alternating training strategy are further
leveraged to train both the surrogate model and the perturbation generator. Two
typical facial manipulation tasks: face attribute editing and face reenactment,
are considered in our initiative defense framework. Extensive experiments
demonstrate the effectiveness and robustness of our framework in different
settings. Finally, we hope this work can shed some light on initiative
countermeasures against more adversarial scenarios.


http://arxiv.org/abs/2112.09968
Being Friends Instead of Adversaries: Deep Networks Learn from Data Simplified by Other Networks. (12%)
Simone Marullo; Matteo Tiezzi; Marco Gori; Stefano Melacci
  Amongst a variety of approaches aimed at making the learning procedure of
neural networks more effective, the scientific community developed strategies
to order the examples according to their estimated complexity, to distil
knowledge from larger networks, or to exploit the principles behind adversarial
machine learning. A different idea has been recently proposed, named Friendly
Training, which consists in altering the input data by adding an automatically
estimated perturbation, with the goal of facilitating the learning process of a
neural classifier. The transformation progressively fades-out as long as
training proceeds, until it completely vanishes. In this work we revisit and
extend this idea, introducing a radically different and novel approach inspired
by the effectiveness of neural generators in the context of Adversarial Machine
Learning. We propose an auxiliary multi-layer network that is responsible of
altering the input data to make them easier to be handled by the classifier at
the current stage of the training procedure. The auxiliary network is trained
jointly with the neural classifier, thus intrinsically increasing the 'depth'
of the classifier, and it is expected to spot general regularities in the data
alteration process. The effect of the auxiliary network is progressively
reduced up to the end of training, when it is fully dropped and the classifier
is deployed for applications. We refer to this approach as Neural Friendly
Training. An extended experimental procedure involving several datasets and
different neural architectures shows that Neural Friendly Training overcomes
the originally proposed Friendly Training technique, improving the
generalization of the classifier, especially in the case of noisy data.


http://arxiv.org/abs/2112.10038
Android-COCO: Android Malware Detection with Graph Neural Network for Byte- and Native-Code. (1%)
Peng Xu
  With the popularity of Android growing exponentially, the amount of malware
has significantly exploded. It is arguably one of the most viral problems on
mobile platforms. Recently, various approaches have been introduced to detect
Android malware, the majority of these are either based on the Manifest File
features or the structural information, such as control flow graph and API
calls. Among those methods, nearly all of them only consider the Java byte-code
as the target to detect malicious behaviors. However, Recent research and our
own statistics show that native payloads are commonly used in both benign and
malicious apps. Current state-of-the-art Android static analysis tools avoid
handling native method invocation. None of those tools have the capability to
capture the inter-language behaviors.
  In this work, we explore an ensemble mechanism, which presents how the
combination of byte-code and native-code analysis of Android applications can
be efficiently used to cope with the advanced sophistication of Android
malware. We, therefore, present a multi-layer approach that utilizes deep
learning, natural language processing (NLP), as well as graph embedding
techniques to handle the threats of Android malware, both from the Java
byte-code and native code. After that, we design an ensemble algorithm to get
the final result of malware detection system. To be specific, the first layer
of our detection approach operates on the byte-code of application and the
native code level, whereas the second layer focuses on the ensemble algorithm.
Large-scale experiments on 100,113 samples (35,113 malware and 65,000 benign)
show that only byte-code sub-system yields 99.8% accuracy and native-code
sub-system yields an accuracy of 96.6%, whereas the Android-COCO method attains
an accuracy of 99.86% which outperforms various related works.


http://arxiv.org/abs/2112.09658
Reasoning Chain Based Adversarial Attack for Multi-hop Question Answering. (92%)
Jiayu Fudan University Ding; Siyuan Fudan University Wang; Qin East China Normal University Chen; Zhongyu Fudan University Wei
  Recent years have witnessed impressive advances in challenging multi-hop QA
tasks. However, these QA models may fail when faced with some disturbance in
the input text and their interpretability for conducting multi-hop reasoning
remains uncertain. Previous adversarial attack works usually edit the whole
question sentence, which has limited effect on testing the entity-based
multi-hop inference ability. In this paper, we propose a multi-hop reasoning
chain based adversarial attack method. We formulate the multi-hop reasoning
chains starting from the query entity to the answer entity in the constructed
graph, which allows us to align the question to each reasoning hop and thus
attack any hop. We categorize the questions into different reasoning types and
adversarially modify part of the question corresponding to the selected
reasoning hop to generate the distracting sentence. We test our adversarial
scheme on three QA models on HotpotQA dataset. The results demonstrate
significant performance reduction on both answer and supporting facts
prediction, verifying the effectiveness of our reasoning chain based attack
method for multi-hop reasoning models and the vulnerability of them. Our
adversarial re-training further improves the performance and robustness of
these models.


http://arxiv.org/abs/2112.09333
Deep Bayesian Learning for Car Hacking Detection. (81%)
Laha Ale; Scott A. King; Ning Zhang
  With the rise of self-drive cars and connected vehicles, cars are equipped
with various devices to assistant the drivers or support self-drive systems.
Undoubtedly, cars have become more intelligent as we can deploy more and more
devices and software on the cars. Accordingly, the security of assistant and
self-drive systems in the cars becomes a life-threatening issue as smart cars
can be invaded by malicious attacks that cause traffic accidents. Currently,
canonical machine learning and deep learning methods are extensively employed
in car hacking detection. However, machine learning and deep learning methods
can easily be overconfident and defeated by carefully designed adversarial
examples. Moreover, those methods cannot provide explanations for security
engineers for further analysis. In this work, we investigated Deep Bayesian
Learning models to detect and analyze car hacking behaviors. The Bayesian
learning methods can capture the uncertainty of the data and avoid
overconfident issues. Moreover, the Bayesian models can provide more
information to support the prediction results that can help security engineers
further identify the attacks. We have compared our model with deep learning
models and the results show the advantages of our proposed model. The code of
this work is publicly available


http://arxiv.org/abs/2112.09669
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations. (81%)
Siddhant Arora; Danish Pruthi; Norman Sadeh; William W. Cohen; Zachary C. Lipton; Graham Neubig
  In attempts to "explain" predictions of machine learning models, researchers
have proposed hundreds of techniques for attributing predictions to features
that are deemed important. While these attributions are often claimed to hold
the potential to improve human "understanding" of the models, surprisingly
little work explicitly evaluates progress towards this aspiration. In this
paper, we conduct a crowdsourcing study, where participants interact with
deception detection models that have been trained to distinguish between
genuine and fake hotel reviews. They are challenged both to simulate the model
on fresh reviews, and to edit reviews with the goal of lowering the probability
of the originally predicted class. Successful manipulations would lead to an
adversarial example. During the training (but not the test) phase, input spans
are highlighted to communicate salience. Through our evaluation, we observe
that for a linear bag-of-words model, participants with access to the feature
coefficients during training are able to cause a larger reduction in model
confidence in the testing phase when compared to the no-explanation control.
For the BERT-based classifier, popular local explanations do not improve their
ability to reduce the model confidence over the no-explanation case.
Remarkably, when the explanation for the BERT model is given by the (global)
attributions of a linear model trained to imitate the BERT model, people can
effectively manipulate the model.


http://arxiv.org/abs/2112.09428
Dynamics-aware Adversarial Attack of 3D Sparse Convolution Network. (80%)
An Tao; Yueqi Duan; He Wang; Ziyi Wu; Pengliang Ji; Haowen Sun; Jie Zhou; Jiwen Lu
  In this paper, we investigate the dynamics-aware adversarial attack problem
in deep neural networks. Most existing adversarial attack algorithms are
designed under a basic assumption -- the network architecture is fixed
throughout the attack process. However, this assumption does not hold for many
recently proposed networks, e.g. 3D sparse convolution network, which contains
input-dependent execution to improve computational efficiency. It results in a
serious issue of lagged gradient, making the learned attack at the current step
ineffective due to the architecture changes afterward. To address this issue,
we propose a Leaded Gradient Method (LGM) and show the significant effects of
the lagged gradient. More specifically, we re-formulate the gradients to be
aware of the potential dynamic changes of network architectures, so that the
learned attack better "leads" the next step than the dynamics-unaware methods
when network architecture changes dynamically. Extensive experiments on various
datasets show that our LGM achieves impressive performance on semantic
segmentation and classification. Compared with the dynamic-unaware methods, LGM
achieves about 20% lower mIoU averagely on the ScanNet and S3DIS datasets. LGM
also outperforms the recent point cloud attacks.


http://arxiv.org/abs/2112.09625
Provable Adversarial Robustness in the Quantum Model. (62%)
Khashayar Barooti; Grzegorz Głuch; Ruediger Urbanke
  Modern machine learning systems have been applied successfully to a variety
of tasks in recent years but making such systems robust against adversarially
chosen modifications of input instances seems to be a much harder problem. It
is probably fair to say that no fully satisfying solution has been found up to
date and it is not clear if the standard formulation even allows for a
principled solution. Hence, rather than following the classical path of bounded
perturbations, we consider a model similar to the quantum PAC-learning model
introduced by Bshouty and Jackson [1995]. Our first key contribution shows that
in this model we can reduce adversarial robustness to the conjunction of two
classical learning theory problems, namely (Problem 1) the problem of finding
generative models and (Problem 2) the problem of devising classifiers that are
robust with respect to distributional shifts. Our second key contribution is
that the considered framework does not rely on specific (and hence also
somewhat arbitrary) threat models like $\ell_p$ bounded perturbations. Instead,
our reduction guarantees that in order to solve the adversarial robustness
problem in our model it suffices to consider a single distance notion, i.e. the
Hellinger distance. From the technical perspective our protocols are heavily
based on the recent advances on delegation of quantum computation, e.g. Mahadev
[2018].
  Although the considered model is quantum and therefore not immediately
applicable to ``real-world'' situations, one might hope that in the future
either one can find a way to embed ``real-world'' problems into a quantum
framework or that classical algorithms can be found that are capable of
mimicking their powerful quantum counterparts.


http://arxiv.org/abs/2112.09343
Domain Adaptation on Point Clouds via Geometry-Aware Implicits. (1%)
Yuefan Shen; Yanchao Yang; Mi Yan; He Wang; Youyi Zheng; Leonidas Guibas
  As a popular geometric representation, point clouds have attracted much
attention in 3D vision, leading to many applications in autonomous driving and
robotics. One important yet unsolved issue for learning on point cloud is that
point clouds of the same object can have significant geometric variations if
generated using different procedures or captured using different sensors. These
inconsistencies induce domain gaps such that neural networks trained on one
domain may fail to generalize on others. A typical technique to reduce the
domain gap is to perform adversarial training so that point clouds in the
feature space can align. However, adversarial training is easy to fall into
degenerated local minima, resulting in negative adaptation gains. Here we
propose a simple yet effective method for unsupervised domain adaptation on
point clouds by employing a self-supervised task of learning geometry-aware
implicits, which plays two critical roles in one shot. First, the geometric
information in the point clouds is preserved through the implicit
representations for downstream tasks. More importantly, the domain-specific
variations can be effectively learned away in the implicit space. We also
propose an adaptive strategy to compute unsigned distance fields for arbitrary
point clouds due to the lack of shape models in practice. When combined with a
task loss, the proposed outperforms state-of-the-art unsupervised domain
adaptation methods that rely on adversarial domain alignment and more
complicated self-supervised tasks. Our method is evaluated on both PointDA-10
and GraspNet datasets. The code and trained models will be publicly available.


http://arxiv.org/abs/2112.08862
Addressing Adversarial Machine Learning Attacks in Smart Healthcare Perspectives. (99%)
Arawinkumaar Selvakkumar; Shantanu Pal; Zahra Jadidi
  Smart healthcare systems are gaining popularity with the rapid development of
intelligent sensors, the Internet of Things (IoT) applications and services,
and wireless communications. However, at the same time, several vulnerabilities
and adversarial attacks make it challenging for a safe and secure smart
healthcare system from a security point of view. Machine learning has been used
widely to develop suitable models to predict and mitigate attacks. Still, the
attacks could trick the machine learning models and misclassify outputs
generated by the model. As a result, it leads to incorrect decisions, for
example, false disease detection and wrong treatment plans for patients. In
this paper, we address the type of adversarial attacks and their impact on
smart healthcare systems. We propose a model to examine how adversarial attacks
impact machine learning classifiers. To test the model, we use a medical image
dataset. Our model can classify medical images with high accuracy. We then
attacked the model with a Fast Gradient Sign Method attack (FGSM) to cause the
model to predict the images and misclassify them inaccurately. Using transfer
learning, we train a VGG-19 model with the medical dataset and later implement
the FGSM to the Convolutional Neural Network (CNN) to examine the significant
impact it causes on the performance and accuracy of the machine learning model.
Our results demonstrate that the adversarial attack misclassifies the images,
causing the model's accuracy rate to drop from 88% to 11%.


http://arxiv.org/abs/2112.08691
Towards Robust Neural Image Compression: Adversarial Attack and Model Finetuning. (99%)
Tong Chen; Zhan Ma
  Deep neural network-based image compression has been extensively studied.
However, the model robustness which is crucial to practical application is
largely overlooked. We propose to examine the robustness of prevailing learned
image compression models by injecting negligible adversarial perturbation into
the original source image. Severe distortion in decoded reconstruction reveals
the general vulnerability in existing methods regardless of their settings
(e.g., network architecture, loss function, quality scale). A variety of
defense strategies including geometric self-ensemble based pre-processing, and
adversarial training, are investigated against the adversarial attack to
improve the model's robustness. Later the defense efficiency is further
exemplified in real-life image recompression case studies. Overall, our
methodology is simple, effective, and generalizable, making it attractive for
developing robust learned image compression solutions. All materials are made
publicly accessible at https://njuvision.github.io/RobustNIC for reproducible
research.


http://arxiv.org/abs/2112.09219
All You Need is RAW: Defending Against Adversarial Attacks with Camera Image Pipelines. (99%)
Yuxuan Zhang; Bo Dong; Felix Heide
  Existing neural networks for computer vision tasks are vulnerable to
adversarial attacks: adding imperceptible perturbations to the input images can
fool these methods to make a false prediction on an image that was correctly
predicted without the perturbation. Various defense methods have proposed
image-to-image mapping methods, either including these perturbations in the
training process or removing them in a preprocessing denoising step. In doing
so, existing methods often ignore that the natural RGB images in today's
datasets are not captured but, in fact, recovered from RAW color filter array
captures that are subject to various degradations in the capture. In this work,
we exploit this RAW data distribution as an empirical prior for adversarial
defense. Specifically, we proposed a model-agnostic adversarial defensive
method, which maps the input RGB images to Bayer RAW space and back to output
RGB using a learned camera image signal processing (ISP) pipeline to eliminate
potential adversarial patterns. The proposed method acts as an off-the-shelf
preprocessing module and, unlike model-specific adversarial training methods,
does not require adversarial images to train. As a result, the method
generalizes to unseen tasks without additional retraining. Experiments on
large-scale datasets (e.g., ImageNet, COCO) for different vision tasks (e.g.,
classification, semantic segmentation, object detection) validate that the
method significantly outperforms existing methods across task domains.


http://arxiv.org/abs/2112.09279
Robust Upper Bounds for Adversarial Training. (75%)
Dimitris Bertsimas; Xavier Boix; Kimberly Villalobos Carballo; Dick den Hertog
  Many state-of-the-art adversarial training methods for deep learning leverage
upper bounds of the adversarial loss to provide security guarantees against
adversarial attacks. Yet, these methods rely on convex relaxations to propagate
lower and upper bounds for intermediate layers, which affect the tightness of
the bound at the output layer. We introduce a new approach to adversarial
training by minimizing an upper bound of the adversarial loss that is based on
a holistic expansion of the network instead of separate bounds for each layer.
This bound is facilitated by state-of-the-art tools from Robust Optimization;
it has closed-form and can be effectively trained using backpropagation. We
derive two new methods with the proposed approach. The first method
(Approximated Robust Upper Bound or aRUB) uses the first order approximation of
the network as well as basic tools from Linear Robust Optimization to obtain an
empirical upper bound of the adversarial loss that can be easily implemented.
The second method (Robust Upper Bound or RUB), computes a provable upper bound
of the adversarial loss. Across a variety of tabular and vision data sets we
demonstrate the effectiveness of our approach -- RUB is substantially more
robust than state-of-the-art methods for larger perturbations, while aRUB
matches the performance of state-of-the-art methods for small perturbations.


http://arxiv.org/abs/2112.09151
TAFIM: Targeted Adversarial Attacks against Facial Image Manipulations. (64%)
Shivangi Aneja; Lev Markhasin; Matthias Niessner
  Face manipulation methods can be misused to affect an individual's privacy or
to spread disinformation. To this end, we introduce a novel data-driven
approach that produces image-specific perturbations which are embedded in the
original images. The key idea is that these protected images prevent face
manipulation by causing the manipulation model to produce a predefined
manipulation target (uniformly colored output image in our case) instead of the
actual manipulation. In addition, we propose to leverage differentiable
compression approximation, hence making generated perturbations robust to
common image compression. In order to prevent against multiple manipulation
methods simultaneously, we further propose a novel attention-based fusion of
manipulation-specific perturbations. Compared to traditional adversarial
attacks that optimize noise patterns for each image individually, our
generalized model only needs a single forward pass, thus running orders of
magnitude faster and allowing for easy integration in image processing stacks,
even on resource-constrained devices like smartphones.


http://arxiv.org/abs/2112.08772
Sharpness-Aware Minimization with Dynamic Reweighting. (31%)
Wenxuan Zhou; Fangyu Liu; Huan Zhang; Muhao Chen
  Deep neural networks are often overparameterized and may not easily achieve
model generalization. Adversarial training has shown effectiveness in improving
generalization by regularizing the change of loss on top of adversarially
chosen perturbations. The recently proposed sharpness-aware minimization (SAM)
algorithm conducts adversarial weight perturbation, encouraging the model to
converge to a flat minima. SAM finds a common adversarial weight perturbation
per-batch. Although per-instance adversarial weight perturbations are stronger
adversaries and they can potentially lead to better generalization performance,
their computational cost is very high and thus it is impossible to use
per-instance perturbations efficiently in SAM. In this paper, we tackle this
efficiency bottleneck and propose sharpness-aware minimization with dynamic
reweighting ({\delta}-SAM). Our theoretical analysis motivates that it is
possible to approach the stronger, per-instance adversarial weight
perturbations using reweighted per-batch weight perturbations. {\delta}-SAM
dynamically reweights perturbation within each batch according to the
theoretically principled weighting factors, serving as a good approximation to
per-instance perturbation. Experiments on various natural language
understanding tasks demonstrate the effectiveness of {\delta}-SAM.


http://arxiv.org/abs/2112.09008
APTSHIELD: A Stable, Efficient and Real-time APT Detection System for Linux Hosts. (16%)
Tiantian Zhu; Jinkai Yu; Tieming Chen; Jiayu Wang; Jie Ying; Ye Tian; Mingqi Lv; Yan Chen; Yuan Fan; Ting Wang
  Advanced Persistent Threat (APT) attack usually refers to the form of
long-term, covert and sustained attack on specific targets, with an adversary
using advanced attack techniques to destroy the key facilities of an
organization. APT attacks have caused serious security threats and massive
financial loss worldwide. Academics and industry thereby have proposed a series
of solutions to detect APT attacks, such as dynamic/static code analysis,
traffic detection, sandbox technology, endpoint detection and response (EDR),
etc. However, existing defenses are failed to accurately and effectively defend
against the current APT attacks that exhibit strong persistent, stealthy,
diverse and dynamic characteristics due to the weak data source integrity,
large data processing overhead and poor real-time performance in the process of
real-world scenarios.
  To overcome these difficulties, in this paper we propose \sn{}, a stable,
efficient and real-time APT detection system for Linux hosts. In the aspect of
data collection, audit is selected to stably collect kernel data of the
operating system so as to carry out a complete portrait of the attack based on
comprehensive analysis and comparison of existing logging tools; In the aspect
of data processing, redundant semantics skipping and non-viable node pruning
are adopted to reduce the amount of data, so as to reduce the overhead of the
detection system; In the aspect of attack detection, an APT attack detection
framework based on ATT\&CK model is designed to carry out real-time attack
response and alarm through the transfer and aggregation of labels. Experimental
results on both laboratory and Darpa Engagement show that our system can
effectively detect web vulnerability attacks, file-less attacks and remote
access trojan attacks, and has a low false positive rate, which adds far more
value than the existing frontier work.


http://arxiv.org/abs/2112.08806
Correlation inference attacks against machine learning models. (13%)
Ana-Maria Creţu; Florent Guépin; Montjoye Yves-Alexandre de
  Machine learning models are often trained on sensitive and proprietary
datasets. Yet what -- and under which conditions -- a model leaks about its
dataset, is not well understood. Most previous works study the leakage of
information about an individual record. Yet in many situations, global dataset
information such as its underlying distribution, e.g. $k$-way marginals or
correlations are similarly sensitive or secret. We here explore for the first
time whether a model leaks information about the correlations between the input
variables of its training dataset, something we name correlation inference
attack. We first propose a model-less attack, showing how an attacker can
exploit the spherical parametrization of correlation matrices to make an
informed guess based on the correlations between the input variables and the
target variable alone. Second, we propose a model-based attack, showing how an
attacker can exploit black-box access to the model to infer the correlations
using shadow models trained on synthetic datasets. Our synthetic data
generation approach combines Gaussian copula-based generative modeling with a
carefully adapted procedure for sampling correlation matrices under
constraints. Third, we evaluate our model-based attack against Logistic
Regression and Multilayer Perceptron models and show it to strongly outperform
the model-less attack on three real-world tabular datasets, indicating that the
models leak information about the correlations. We also propose a novel
correlation inference-based attribute inference attack (CI-AIA), and show it to
obtain state-of-the-art performance. Taken together, our results show how
attackers can use the model to extract information about the dataset
distribution, and use it to improve their prior on sensitive attributes of
individual records.


http://arxiv.org/abs/2112.09062
Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants. (2%)
Max Bartolo; Tristan Thrush; Sebastian Riedel; Pontus Stenetorp; Robin Jia; Douwe Kiela
  In Dynamic Adversarial Data Collection (DADC), human annotators are tasked
with finding examples that models struggle to predict correctly. Models trained
on DADC-collected training data have been shown to be more robust in
adversarial and out-of-domain settings, and are considerably harder for humans
to fool. However, DADC is more time-consuming than traditional data collection
and thus more costly per annotated example. In this work, we examine whether we
can maintain the advantages of DADC, without incurring the additional cost. To
that end, we introduce Generative Annotation Assistants (GAAs),
generator-in-the-loop models that provide real-time suggestions that annotators
can either approve, modify, or reject entirely. We collect training datasets in
twenty experimental settings and perform a detailed analysis of this approach
for the task of extractive question answering (QA) for both standard and
adversarial data collection. We demonstrate that GAAs provide significant
efficiency benefits with over a 30% annotation speed-up, while leading to over
a 5x improvement in model fooling rates. In addition, we find that using
GAA-assisted training data leads to higher downstream model performance on a
variety of question answering tasks over adversarial data collection.


http://arxiv.org/abs/2112.08810
Pure Noise to the Rescue of Insufficient Data: Improving Imbalanced Classification by Training on Random Noise Images. (2%)
Shiran Zada; Itay Benou; Michal Irani
  Despite remarkable progress on visual recognition tasks, deep neural-nets
still struggle to generalize well when training data is scarce or highly
imbalanced, rendering them extremely vulnerable to real-world examples. In this
paper, we present a surprisingly simple yet highly effective method to mitigate
this limitation: using pure noise images as additional training data. Unlike
the common use of additive noise or adversarial noise for data augmentation, we
propose an entirely different perspective by directly training on pure random
noise images. We present a new Distribution-Aware Routing Batch Normalization
layer (DAR-BN), which enables training on pure noise images in addition to
natural images within the same network. This encourages generalization and
suppresses overfitting. Our proposed method significantly improves imbalanced
classification performance, obtaining state-of-the-art results on a large
variety of long-tailed image classification datasets (CIFAR-10-LT,
CIFAR-100-LT, ImageNet-LT, Places-LT, and CelebA-5). Furthermore, our method is
extremely simple and easy to use as a general new augmentation tool (on top of
existing augmentations), and can be incorporated in any training scheme. It
does not require any specialized data generation or training procedures, thus
keeping training fast and efficient.


http://arxiv.org/abs/2112.08304
On the Convergence and Robustness of Adversarial Training. (99%)
Yisen Wang; Xingjun Ma; James Bailey; Jinfeng Yi; Bowen Zhou; Quanquan Gu
  Improving the robustness of deep neural networks (DNNs) to adversarial
examples is an important yet challenging problem for secure deep learning.
Across existing defense techniques, adversarial training with Projected
Gradient Decent (PGD) is amongst the most effective. Adversarial training
solves a min-max optimization problem, with the \textit{inner maximization}
generating adversarial examples by maximizing the classification loss, and the
\textit{outer minimization} finding model parameters by minimizing the loss on
adversarial examples generated from the inner maximization. A criterion that
measures how well the inner maximization is solved is therefore crucial for
adversarial training. In this paper, we propose such a criterion, namely
First-Order Stationary Condition for constrained optimization (FOSC), to
quantitatively evaluate the convergence quality of adversarial examples found
in the inner maximization. With FOSC, we find that to ensure better robustness,
it is essential to use adversarial examples with better convergence quality at
the \textit{later stages} of training. Yet at the early stages, high
convergence quality adversarial examples are not necessary and may even lead to
poor robustness. Based on these observations, we propose a \textit{dynamic}
training strategy to gradually increase the convergence quality of the
generated adversarial examples, which significantly improves the robustness of
adversarial training. Our theoretical and empirical results show the
effectiveness of the proposed method.


http://arxiv.org/abs/2112.07921
Temporal Shuffling for Defending Deep Action Recognition Models against Adversarial Attacks. (98%)
Jaehui Hwang; Huan Zhang; Jun-Ho Choi; Cho-Jui Hsieh; Jong-Seok Lee
  Recently, video-based action recognition methods using convolutional neural
networks (CNNs) achieve remarkable recognition performance. However, there is
still lack of understanding about the generalization mechanism of action
recognition models. In this paper, we suggest that action recognition models
rely on the motion information less than expected, and thus they are robust to
randomization of frame orders. Furthermore, we find that motion monotonicity
remaining after randomization also contributes to such robustness. Based on
this observation, we develop a novel defense method using temporal shuffling of
input videos against adversarial attacks for action recognition models. Another
observation enabling our defense method is that adversarial perturbations on
videos are sensitive to temporal destruction. To the best of our knowledge,
this is the first attempt to design a defense method without additional
training for 3D CNN-based video action recognition models.


http://arxiv.org/abs/2112.08609
DuQM: A Chinese Dataset of Linguistically Perturbed Natural Questions for Evaluating the Robustness of Question Matching Models. (75%)
Hongyu Zhu; Yan Chen; Jing Yan; Jing Liu; Yu Hong; Ying Chen; Hua Wu; Haifeng Wang
  In this paper, we focus on studying robustness evaluation of Chinese question
matching. Most of the previous work on analyzing robustness issue focus on just
one or a few types of artificial adversarial examples. Instead, we argue that
it is necessary to formulate a comprehensive evaluation about the linguistic
capabilities of models on natural texts. For this purpose, we create a Chinese
dataset namely DuQM which contains natural questions with linguistic
perturbations to evaluate the robustness of question matching models. DuQM
contains 3 categories and 13 subcategories with 32 linguistic perturbations.
The extensive experiments demonstrate that DuQM has a better ability to
distinguish different models. Importantly, the detailed breakdown of evaluation
by linguistic phenomenon in DuQM helps us easily diagnose the strength and
weakness of different models. Additionally, our experiment results show that
the effect of artificial adversarial examples does not work on the natural
texts.


http://arxiv.org/abs/2112.08102
Robust Neural Network Classification via Double Regularization. (1%)
Olof Zetterqvist; Rebecka Jörnsten; Johan Jonasson
  The presence of mislabeled observations in data is a notoriously challenging
problem in statistics and machine learning, associated with poor generalization
properties for both traditional classifiers and, perhaps even more so, flexible
classifiers like neural networks. Here we propose a novel double regularization
of the neural network training loss that combines a penalty on the complexity
of the classification model and an optimal reweighting of training
observations. The combined penalties result in improved generalization
properties and strong robustness against overfitting in different settings of
mislabeled training data and also against variation in initial parameter values
when training. We provide a theoretical justification for our proposed method
derived for a simple case of logistic regression. We demonstrate the double
regularization model, here denoted by DRFit, for neural net classification of
(i) MNIST and (ii) CIFAR-10, in both cases with simulated mislabeling. We also
illustrate that DRFit identifies mislabeled data points with very good
precision. This provides strong support for DRFit as a practical of-the-shelf
classifier, since, without any sacrifice in performance, we get a classifier
that simultaneously reduces overfitting against mislabeling and gives an
accurate measure of the trustworthiness of the labels.


http://arxiv.org/abs/2112.07512
Adversarial Examples for Extreme Multilabel Text Classification. (99%)
Mohammadreza Qaraei; Rohit Babbar
  Extreme Multilabel Text Classification (XMTC) is a text classification
problem in which, (i) the output space is extremely large, (ii) each data point
may have multiple positive labels, and (iii) the data follows a strongly
imbalanced distribution. With applications in recommendation systems and
automatic tagging of web-scale documents, the research on XMTC has been focused
on improving prediction accuracy and dealing with imbalanced data. However, the
robustness of deep learning based XMTC models against adversarial examples has
been largely underexplored.
  In this paper, we investigate the behaviour of XMTC models under adversarial
attacks. To this end, first, we define adversarial attacks in multilabel text
classification problems. We categorize attacking multilabel text classifiers as
(a) positive-targeted, where the target positive label should fall out of top-k
predicted labels, and (b) negative-targeted, where the target negative label
should be among the top-k predicted labels. Then, by experiments on APLC-XLNet
and AttentionXML, we show that XMTC models are highly vulnerable to
positive-targeted attacks but more robust to negative-targeted ones.
Furthermore, our experiments show that the success rate of positive-targeted
adversarial attacks has an imbalanced distribution. More precisely, tail
classes are highly vulnerable to adversarial attacks for which an attacker can
generate adversarial samples with high similarity to the actual data-points. To
overcome this problem, we explore the effect of rebalanced loss functions in
XMTC where not only do they increase accuracy on tail classes, but they also
improve the robustness of these classes against adversarial attacks. The code
for our experiments is available at https://github.com/xmc-aalto/adv-xmtc


http://arxiv.org/abs/2112.07400
Robustifying automatic speech recognition by extracting slowly varying features. (99%)
Matías Pizarro; Dorothea Kolossa; Asja Fischer
  In the past few years, it has been shown that deep learning systems are
highly vulnerable under attacks with adversarial examples. Neural-network-based
automatic speech recognition (ASR) systems are no exception. Targeted and
untargeted attacks can modify an audio input signal in such a way that humans
still recognise the same words, while ASR systems are steered to predict a
different transcription. In this paper, we propose a defense mechanism against
targeted adversarial attacks consisting in removing fast-changing features from
the audio signals, either by applying slow feature analysis, a low-pass filter,
or both, before feeding the input to the ASR system. We perform an empirical
analysis of hybrid ASR models trained on data pre-processed in such a way.
While the resulting models perform quite well on benign data, they are
significantly more robust against targeted adversarial attacks: Our final,
proposed model shows a performance on clean data similar to the baseline model,
while being more than four times more robust.


http://arxiv.org/abs/2112.07324
On the Impact of Hard Adversarial Instances on Overfitting in Adversarial Training. (81%)
Chen Liu; Zhichao Huang; Mathieu Salzmann; Tong Zhang; Sabine Süsstrunk
  Adversarial training is a popular method to robustify models against
adversarial attacks. However, it exhibits much more severe overfitting than
training on clean inputs. In this work, we investigate this phenomenon from the
perspective of training instances, i.e., training input-target pairs. Based on
a quantitative metric measuring the relative difficulty of an instance in the
training set, we analyze the model's behavior on training instances of
different difficulty levels. This lets us demonstrate that the decay in
generalization performance of adversarial training is a result of fitting hard
adversarial instances. We theoretically verify our observations for both linear
and general nonlinear models, proving that models trained on hard instances
have worse generalization performance than ones trained on easy instances, and
that this generalization gap increases with the size of the adversarial budget.
Finally, we investigate solutions to mitigate adversarial overfitting in
several scenarios, including fast adversarial training and fine-tuning a
pretrained model with additional data. Our results demonstrate that using
training data adaptively improves the model's robustness.


http://arxiv.org/abs/2112.07668
Dual-Key Multimodal Backdoors for Visual Question Answering. (81%)
Matthew Walmer; Karan Sikka; Indranil Sur; Abhinav Shrivastava; Susmit Jha
  The success of deep learning has enabled advances in multimodal tasks that
require non-trivial fusion of multiple input domains. Although multimodal
models have shown potential in many problems, their increased complexity makes
them more vulnerable to attacks. A Backdoor (or Trojan) attack is a class of
security vulnerability wherein an attacker embeds a malicious secret behavior
into a network (e.g. targeted misclassification) that is activated when an
attacker-specified trigger is added to an input. In this work, we show that
multimodal networks are vulnerable to a novel type of attack that we refer to
as Dual-Key Multimodal Backdoors. This attack exploits the complex fusion
mechanisms used by state-of-the-art networks to embed backdoors that are both
effective and stealthy. Instead of using a single trigger, the proposed attack
embeds a trigger in each of the input modalities and activates the malicious
behavior only when both the triggers are present. We present an extensive study
of multimodal backdoors on the Visual Question Answering (VQA) task with
multiple architectures and visual feature backbones. A major challenge in
embedding backdoors in VQA models is that most models use visual features
extracted from a fixed pretrained object detector. This is challenging for the
attacker as the detector can distort or ignore the visual trigger entirely,
which leads to models where backdoors are over-reliant on the language trigger.
We tackle this problem by proposing a visual trigger optimization strategy
designed for pretrained object detectors. Through this method, we create
Dual-Key Backdoors with over a 98% attack success rate while only poisoning 1%
of the training data. Finally, we release TrojVQA, a large collection of clean
and trojan VQA models to enable research in defending against multimodal
backdoors.


http://arxiv.org/abs/2112.07178
MuxLink: Circumventing Learning-Resilient MUX-Locking Using Graph Neural Network-based Link Prediction. (4%)
Lilas Alrahis; Satwik Patnaik; Muhammad Shafique; Ozgur Sinanoglu
  Logic locking has received considerable interest as a prominent technique for
protecting the design intellectual property from untrusted entities, especially
the foundry. Recently, machine learning (ML)-based attacks have questioned the
security guarantees of logic locking, and have demonstrated considerable
success in deciphering the secret key without relying on an oracle, hence,
proving to be very useful for an adversary in the fab. Such ML-based attacks
have triggered the development of learning-resilient locking techniques. The
most advanced state-of-the-art deceptive MUX-based locking (D-MUX) and the
symmetric MUX-based locking techniques have recently demonstrated resilience
against existing ML-based attacks. Both defense techniques obfuscate the design
by inserting key-controlled MUX logic, ensuring that all the secret inputs to
the MUXes are equiprobable.
  In this work, we show that these techniques primarily introduce local and
limited changes to the circuit without altering the global structure of the
design. By leveraging this observation, we propose a novel graph neural network
(GNN)-based link prediction attack, MuxLink, that successfully breaks both the
D-MUX and symmetric MUX-locking techniques, relying only on the underlying
structure of the locked design, i.e., in an oracle-less setting. Our trained
GNN model learns the structure of the given circuit and the composition of
gates around the non-obfuscated wires, thereby generating meaningful link
embeddings that help decipher the secret inputs to the MUXes. The proposed
MuxLink achieves key prediction accuracy and precision up to 100% on D-MUX and
symmetric MUX-locked ISCAS-85 and ITC-99 benchmarks, fully unlocking the
designs. We open-source MuxLink [1].


http://arxiv.org/abs/2112.06443
Detecting Audio Adversarial Examples with Logit Noising. (99%)
Namgyu Park; Sangwoo Ji; Jong Kim
  Automatic speech recognition (ASR) systems are vulnerable to audio
adversarial examples that attempt to deceive ASR systems by adding
perturbations to benign speech signals. Although an adversarial example and the
original benign wave are indistinguishable to humans, the former is transcribed
as a malicious target sentence by ASR systems. Several methods have been
proposed to generate audio adversarial examples and feed them directly into the
ASR system (over-line). Furthermore, many researchers have demonstrated the
feasibility of robust physical audio adversarial examples(over-air). To defend
against the attacks, several studies have been proposed. However, deploying
them in a real-world situation is difficult because of accuracy drop or time
overhead. In this paper, we propose a novel method to detect audio adversarial
examples by adding noise to the logits before feeding them into the decoder of
the ASR. We show that carefully selected noise can significantly impact the
transcription results of the audio adversarial examples, whereas it has minimal
impact on the transcription results of benign audio waves. Based on this
characteristic, we detect audio adversarial examples by comparing the
transcription altered by logit noising with its original transcription. The
proposed method can be easily applied to ASR systems without any structural
changes or additional training. The experimental results show that the proposed
method is robust to over-line audio adversarial examples as well as over-air
audio adversarial examples compared with state-of-the-art detection methods.


http://arxiv.org/abs/2112.06569
Triangle Attack: A Query-efficient Decision-based Adversarial Attack. (99%)
Xiaosen Wang; Zeliang Zhang; Kangheng Tong; Dihong Gong; Kun He; Zhifeng Li; Wei Liu
  Decision-based attack poses a severe threat to real-world applications since
it regards the target model as a black box and only accesses the hard
prediction label. Great efforts have been made recently to decrease the number
of queries; however, existing decision-based attacks still require thousands of
queries in order to generate good quality adversarial examples. In this work,
we find that a benign sample, the current and the next adversarial examples
could naturally construct a triangle in a subspace for any iterative attacks.
Based on the law of sines, we propose a novel Triangle Attack (TA) to optimize
the perturbation by utilizing the geometric information that the longer side is
always opposite the larger angle in any triangle. However, directly applying
such information on the input image is ineffective because it cannot thoroughly
explore the neighborhood of the input sample in the high dimensional space. To
address this issue, TA optimizes the perturbation in the low frequency space
for effective dimensionality reduction owing to the generality of such
geometric property. Extensive evaluations on the ImageNet dataset demonstrate
that TA achieves a much higher attack success rate within 1,000 queries and
needs a much less number of queries to achieve the same attack success rate
under various perturbation budgets than existing decision-based attacks. With
such high efficiency, we further demonstrate the applicability of TA on
real-world API, i.e., Tencent Cloud API.


http://arxiv.org/abs/2112.06323
Interpolated Joint Space Adversarial Training for Robust and Generalizable Defenses. (98%)
Chun Pong Lau; Jiang Liu; Hossein Souri; Wei-An Lin; Soheil Feizi; Rama Chellappa
  Adversarial training (AT) is considered to be one of the most reliable
defenses against adversarial attacks. However, models trained with AT sacrifice
standard accuracy and do not generalize well to novel attacks. Recent works
show generalization improvement with adversarial samples under novel threat
models such as on-manifold threat model or neural perceptual threat model.
However, the former requires exact manifold information while the latter
requires algorithm relaxation. Motivated by these considerations, we exploit
the underlying manifold information with Normalizing Flow, ensuring that exact
manifold assumption holds. Moreover, we propose a novel threat model called
Joint Space Threat Model (JSTM), which can serve as a special case of the
neural perceptual threat model that does not require additional relaxation to
craft the corresponding adversarial attacks. Under JSTM, we develop novel
adversarial attacks and defenses. The mixup strategy improves the standard
accuracy of neural networks but sacrifices robustness when combined with AT. To
tackle this issue, we propose the Robust Mixup strategy in which we maximize
the adversity of the interpolated images and gain robustness and prevent
overfitting. Our experiments show that Interpolated Joint Space Adversarial
Training (IJSAT) achieves good performance in standard accuracy, robustness,
and generalization in CIFAR-10/100, OM-ImageNet, and CIFAR-10-C datasets. IJSAT
is also flexible and can be used as a data augmentation method to improve
standard accuracy and combine with many existing AT approaches to improve
robustness.


http://arxiv.org/abs/2112.06276
Quantifying and Understanding Adversarial Examples in Discrete Input Spaces. (91%)
Volodymyr Kuleshov; Evgenii Nikishin; Shantanu Thakoor; Tingfung Lau; Stefano Ermon
  Modern classification algorithms are susceptible to adversarial
examples--perturbations to inputs that cause the algorithm to produce
undesirable behavior. In this work, we seek to understand and extend
adversarial examples across domains in which inputs are discrete, particularly
across new domains, such as computational biology. As a step towards this goal,
we formalize a notion of synonymous adversarial examples that applies in any
discrete setting and describe a simple domain-agnostic algorithm to construct
such examples. We apply this algorithm across multiple domains--including
sentiment analysis and DNA sequence classification--and find that it
consistently uncovers adversarial examples. We seek to understand their
prevalence theoretically and we attribute their existence to spurious token
correlations, a statistical phenomenon that is specific to discrete spaces. Our
work is a step towards a domain-agnostic treatment of discrete adversarial
examples analogous to that of continuous inputs.


http://arxiv.org/abs/2112.06274
SparseFed: Mitigating Model Poisoning Attacks in Federated Learning with Sparsification. (91%)
Ashwinee Panda; Saeed Mahloujifar; Arjun N. Bhagoji; Supriyo Chakraborty; Prateek Mittal
  Federated learning is inherently vulnerable to model poisoning attacks
because its decentralized nature allows attackers to participate with
compromised devices. In model poisoning attacks, the attacker reduces the
model's performance on targeted sub-tasks (e.g. classifying planes as birds) by
uploading "poisoned" updates. In this report we introduce \algoname{}, a novel
defense that uses global top-k update sparsification and device-level gradient
clipping to mitigate model poisoning attacks. We propose a theoretical
framework for analyzing the robustness of defenses against poisoning attacks,
and provide robustness and convergence analysis of our algorithm. To validate
its empirical efficacy we conduct an open-source evaluation at scale across
multiple benchmark datasets for computer vision and federated learning.


http://arxiv.org/abs/2112.06384
WOOD: Wasserstein-based Out-of-Distribution Detection. (12%)
Yinan Wang; Wenbo Sun; Jionghua "Judy" Jin; Zhenyu "James" Kong; Xiaowei Yue
  The training and test data for deep-neural-network-based classifiers are
usually assumed to be sampled from the same distribution. When part of the test
samples are drawn from a distribution that is sufficiently far away from that
of the training samples (a.k.a. out-of-distribution (OOD) samples), the trained
neural network has a tendency to make high confidence predictions for these OOD
samples. Detection of the OOD samples is critical when training a neural
network used for image classification, object detection, etc. It can enhance
the classifier's robustness to irrelevant inputs, and improve the system
resilience and security under different forms of attacks. Detection of OOD
samples has three main challenges: (i) the proposed OOD detection method should
be compatible with various architectures of classifiers (e.g., DenseNet,
ResNet), without significantly increasing the model complexity and requirements
on computational resources; (ii) the OOD samples may come from multiple
distributions, whose class labels are commonly unavailable; (iii) a score
function needs to be defined to effectively separate OOD samples from
in-distribution (InD) samples. To overcome these challenges, we propose a
Wasserstein-based out-of-distribution detection (WOOD) method. The basic idea
is to define a Wasserstein-distance-based score that evaluates the
dissimilarity between a test sample and the distribution of InD samples. An
optimization problem is then formulated and solved based on the proposed score
function. The statistical learning bound of the proposed method is investigated
to guarantee that the loss value achieved by the empirical optimizer
approximates the global optimum. The comparison study results demonstrate that
the proposed WOOD consistently outperforms other existing OOD detection
methods.


http://arxiv.org/abs/2112.06063
MedAttacker: Exploring Black-Box Adversarial Attacks on Risk Prediction Models in Healthcare. (99%)
Muchao Ye; Junyu Luo; Guanjie Zheng; Cao Xiao; Ting Wang; Fenglong Ma
  Deep neural networks (DNNs) have been broadly adopted in health risk
prediction to provide healthcare diagnoses and treatments. To evaluate their
robustness, existing research conducts adversarial attacks in the
white/gray-box setting where model parameters are accessible. However, a more
realistic black-box adversarial attack is ignored even though most real-world
models are trained with private data and released as black-box services on the
cloud. To fill this gap, we propose the first black-box adversarial attack
method against health risk prediction models named MedAttacker to investigate
their vulnerability. MedAttacker addresses the challenges brought by EHR data
via two steps: hierarchical position selection which selects the attacked
positions in a reinforcement learning (RL) framework and substitute selection
which identifies substitute with a score-based principle. Particularly, by
considering the temporal context inside EHRs, it initializes its RL position
selection policy by using the contribution score of each visit and the saliency
score of each code, which can be well integrated with the deterministic
substitute selection process decided by the score changes. In experiments,
MedAttacker consistently achieves the highest average success rate and even
outperforms a recent white-box EHR adversarial attack technique in certain
cases when attacking three advanced health risk prediction models in the
black-box setting across multiple real-world datasets. In addition, based on
the experiment results we include a discussion on defending EHR adversarial
attacks.


http://arxiv.org/abs/2112.06011
Improving the Transferability of Adversarial Examples with Resized-Diverse-Inputs, Diversity-Ensemble and Region Fitting. (98%)
Junhua Zou; Zhisong Pan; Junyang Qiu; Xin Liu; Ting Rui; Wei Li
  We introduce a three stage pipeline: resized-diverse-inputs (RDIM),
diversity-ensemble (DEM) and region fitting, that work together to generate
transferable adversarial examples. We first explore the internal relationship
between existing attacks, and propose RDIM that is capable of exploiting this
relationship. Then we propose DEM, the multi-scale version of RDIM, to generate
multi-scale gradients. After the first two steps we transform value fitting
into region fitting across iterations. RDIM and region fitting do not require
extra running time and these three steps can be well integrated into other
attacks. Our best attack fools six black-box defenses with a 93% success rate
on average, which is higher than the state-of-the-art gradient-based attacks.
Besides, we rethink existing attacks rather than simply stacking new methods on
the old ones to get better performance. It is expected that our findings will
serve as the beginning of exploring the internal relationship between attack
methods. Codes are available at https://github.com/278287847/DEM.


http://arxiv.org/abs/2112.06116
Stereoscopic Universal Perturbations across Different Architectures and Datasets. (98%)
Zachary Berger; Parth Agrawal; Tian Yu Liu; Stefano Soatto; Alex Wong
  We study the effect of adversarial perturbations of images on deep stereo
matching networks for the disparity estimation task. We present a method to
craft a single set of perturbations that, when added to any stereo image pair
in a dataset, can fool a stereo network to significantly alter the perceived
scene geometry. Our perturbation images are "universal" in that they not only
corrupt estimates of the network on the dataset they are optimized for, but
also generalize to stereo networks with different architectures across
different datasets. We evaluate our approach on multiple public benchmark
datasets and show that our perturbations can increase D1-error (akin to fooling
rate) of state-of-the-art stereo networks from 1% to as much as 87%. We
investigate the effect of perturbations on the estimated scene geometry and
identify object classes that are most vulnerable. Our analysis on the
activations of registered points between left and right images led us to find
that certain architectural components, i.e. deformable convolution and explicit
matching, can increase robustness against adversaries. We demonstrate that by
simply designing networks with such components, one can reduce the effect of
adversaries by up to 60.5%, which rivals the robustness of networks fine-tuned
with costly adversarial data augmentation.


http://arxiv.org/abs/2112.06658
Learning to Learn Transferable Attack. (99%)
Shuman Fang; Jie Li; Xianming Lin; Rongrong Ji
  Transfer adversarial attack is a non-trivial black-box adversarial attack
that aims to craft adversarial perturbations on the surrogate model and then
apply such perturbations to the victim model. However, the transferability of
perturbations from existing methods is still limited, since the adversarial
perturbations are easily overfitting with a single surrogate model and specific
data pattern. In this paper, we propose a Learning to Learn Transferable Attack
(LLTA) method, which makes the adversarial perturbations more generalized via
learning from both data and model augmentation. For data augmentation, we adopt
simple random resizing and padding. For model augmentation, we randomly alter
the back propagation instead of the forward propagation to eliminate the effect
on the model prediction. By treating the attack of both specific data and a
modified model as a task, we expect the adversarial perturbations to adopt
enough tasks for generalization. To this end, the meta-learning algorithm is
further introduced during the iteration of perturbation generation. Empirical
results on the widely-used dataset demonstrate the effectiveness of our attack
method with a 12.85% higher success rate of transfer attack compared with the
state-of-the-art methods. We also evaluate our method on the real-world online
system, i.e., Google Cloud Vision API, to further show the practical potentials
of our method.


http://arxiv.org/abs/2112.05379
Cross-Modal Transferable Adversarial Attacks from Images to Videos. (99%)
Zhipeng Wei; Jingjing Chen; Zuxuan Wu; Yu-Gang Jiang
  Recent studies have shown that adversarial examples hand-crafted on one
white-box model can be used to attack other black-box models. Such cross-model
transferability makes it feasible to perform black-box attacks, which has
raised security concerns for real-world DNNs applications. Nevertheless,
existing works mostly focus on investigating the adversarial transferability
across different deep models that share the same modality of input data. The
cross-modal transferability of adversarial perturbation has never been
explored. This paper investigates the transferability of adversarial
perturbation across different modalities, i.e., leveraging adversarial
perturbation generated on white-box image models to attack black-box video
models. Specifically, motivated by the observation that the low-level feature
space between images and video frames are similar, we propose a simple yet
effective cross-modal attack method, named as Image To Video (I2V) attack. I2V
generates adversarial frames by minimizing the cosine similarity between
features of pre-trained image models from adversarial and benign examples, then
combines the generated adversarial frames to perform black-box attacks on video
recognition models. Extensive experiments demonstrate that I2V can achieve high
attack success rates on different black-box video recognition models. On
Kinetics-400 and UCF-101, I2V achieves an average attack success rate of 77.88%
and 65.68%, respectively, which sheds light on the feasibility of cross-modal
adversarial attacks.


http://arxiv.org/abs/2112.05871
Attacking Point Cloud Segmentation with Color-only Perturbation. (99%)
Jiacen Xu; Zhe Zhou; Boyuan Feng; Yufei Ding; Zhou Li
  Recent research efforts on 3D point-cloud semantic segmentation have achieved
outstanding performance by adopting deep CNN (convolutional neural networks)
and GCN (graph convolutional networks). However, the robustness of these
complex models has not been systematically analyzed. Given that semantic
segmentation has been applied in many safety-critical applications (e.g.,
autonomous driving, geological sensing), it is important to fill this knowledge
gap, in particular, how these models are affected under adversarial samples.
While adversarial attacks against point cloud have been studied, we found all
of them were targeting single-object recognition, and the perturbation is done
on the point coordinates. We argue that the coordinate-based perturbation is
unlikely to realize under the physical-world constraints. Hence, we propose a
new color-only perturbation method named COLPER, and tailor it to semantic
segmentation. By evaluating COLPER on an indoor dataset (S3DIS) and an outdoor
dataset (Semantic3D) against three point cloud segmentation models (PointNet++,
DeepGCNs, and RandLA-Net), we found color-only perturbation is sufficient to
significantly drop the segmentation accuracy and aIoU, under both targeted and
non-targeted attack settings.


http://arxiv.org/abs/2112.05634
Preemptive Image Robustification for Protecting Users against Man-in-the-Middle Adversarial Attacks. (92%)
Seungyong Moon; Gaon An; Hyun Oh Song
  Deep neural networks have become the driving force of modern image
recognition systems. However, the vulnerability of neural networks against
adversarial attacks poses a serious threat to the people affected by these
systems. In this paper, we focus on a real-world threat model where a
Man-in-the-Middle adversary maliciously intercepts and perturbs images web
users upload online. This type of attack can raise severe ethical concerns on
top of simple performance degradation. To prevent this attack, we devise a
novel bi-level optimization algorithm that finds points in the vicinity of
natural images that are robust to adversarial perturbations. Experiments on
CIFAR-10 and ImageNet show our method can effectively robustify natural images
within the given modification budget. We also show the proposed method can
improve robustness when jointly used with randomized smoothing.


http://arxiv.org/abs/2112.05409
Batch Label Inference and Replacement Attacks in Black-Boxed Vertical Federated Learning. (75%)
Yang Liu; Tianyuan Zou; Yan Kang; Wenhan Liu; Yuanqin He; Zhihao Yi; Qiang Yang
  In a vertical federated learning (VFL) scenario where features and model are
split into different parties, communications of sample-specific updates are
required for correct gradient calculations but can be used to deduce important
sample-level label information. An immediate defense strategy is to protect
sample-level messages communicated with Homomorphic Encryption (HE), and in
this way only the batch-averaged local gradients are exposed to each party
(termed black-boxed VFL). In this paper, we first explore the possibility of
recovering labels in the vertical federated learning setting with HE-protected
communication, and show that private labels can be reconstructed with high
accuracy by training a gradient inversion model. Furthermore, we show that
label replacement backdoor attacks can be conducted in black-boxed VFL by
directly replacing encrypted communicated messages (termed gradient-replacement
attack). As it is a common presumption that batch-averaged information is safe
to share, batch label inference and replacement attacks are a severe challenge
to VFL. To defend against batch label inference attack, we further evaluate
several defense strategies, including confusional autoencoder (CoAE), a
technique we proposed based on autoencoder and entropy regularization. We
demonstrate that label inference and replacement attacks can be successfully
blocked by this technique without hurting as much main task accuracy as
compared to existing methods.


http://arxiv.org/abs/2112.05588
Copy, Right? A Testing Framework for Copyright Protection of Deep Learning Models. (68%)
Jialuo Chen; Jingyi Wang; Tinglan Peng; Youcheng Sun; Peng Cheng; Shouling Ji; Xingjun Ma; Bo Li; Dawn Song
  Deep learning (DL) models, especially those large-scale and high-performance
ones, can be very costly to train, demanding a great amount of data and
computational resources. Unauthorized reproduction of DL models can lead to
copyright infringement and cause huge economic losses to model owners. Existing
copyright protection techniques are mostly based on watermarking, which embeds
an owner-specified watermark into the model. While being able to provide exact
ownership verification, these techniques are 1) invasive, as they need to
tamper with the training process, which may affect the utility or introduce new
security risks; 2) prone to adaptive attacks that attempt to remove the
watermark; and 3) not robust to the emerging model extraction attacks. Latest
fingerprinting work, though being non-invasive, also falls short when facing
the diverse and ever-growing attack scenarios. In this paper, we propose a
novel testing framework for DL copyright protection: DEEPJUDGE. DEEPJUDGE
quantitatively tests the similarities between two DL models: a victim model and
a suspect model. It leverages a diverse set of testing metrics and test case
generation methods to produce a chain of supporting evidence to help determine
whether a suspect model is a copy of the victim model. Advantages of DEEPJUDGE
include: 1) non-invasive, as it works directly on the model and does not tamper
with the training process; 2) efficient, as it only needs a small set of test
cases and a quick scan of models; 3) flexible, as it can easily incorporate new
metrics or generation methods to obtain more confident judgement; and 4) fairly
robust to model extraction and adaptive attacks. We verify the effectiveness of
DEEPJUDGE under typical copyright infringement scenarios, including model
finetuning, pruning and extraction, via extensive experiments on both image and
speech datasets with a variety of model architectures.


http://arxiv.org/abs/2112.05367
Efficient Action Poisoning Attacks on Linear Contextual Bandits. (67%)
Guanlin Liu; Lifeng Lai
  Contextual bandit algorithms have many applicants in a variety of scenarios.
In order to develop trustworthy contextual bandit systems, understanding the
impacts of various adversarial attacks on contextual bandit algorithms is
essential. In this paper, we propose a new class of attacks: action poisoning
attacks, where an adversary can change the action signal selected by the agent.
We design action poisoning attack schemes against linear contextual bandit
algorithms in both white-box and black-box settings. We further analyze the
cost of the proposed attack strategies for a very popular and widely used
bandit algorithm: LinUCB. We show that, in both white-box and black-box
settings, the proposed attack schemes can force the LinUCB agent to pull a
target arm very frequently by spending only logarithm cost.


http://arxiv.org/abs/2112.05495
How Private Is Your RL Policy? An Inverse RL Based Analysis Framework. (41%)
Kritika Prakash; Fiza Husain; Praveen Paruchuri; Sujit P. Gujar
  Reinforcement Learning (RL) enables agents to learn how to perform various
tasks from scratch. In domains like autonomous driving, recommendation systems,
and more, optimal RL policies learned could cause a privacy breach if the
policies memorize any part of the private reward. We study the set of existing
differentially-private RL policies derived from various RL algorithms such as
Value Iteration, Deep Q Networks, and Vanilla Proximal Policy Optimization. We
propose a new Privacy-Aware Inverse RL (PRIL) analysis framework, that performs
reward reconstruction as an adversarial attack on private policies that the
agents may deploy. For this, we introduce the reward reconstruction attack,
wherein we seek to reconstruct the original reward from a privacy-preserving
policy using an Inverse RL algorithm. An adversary must do poorly at
reconstructing the original reward function if the agent uses a tightly private
policy. Using this framework, we empirically test the effectiveness of the
privacy guarantee offered by the private algorithms on multiple instances of
the FrozenLake domain of varying complexities. Based on the analysis performed,
we infer a gap between the current standard of privacy offered and the standard
of privacy needed to protect reward functions in RL. We do so by quantifying
the extent to which each private policy protects the reward function by
measuring distances between the original and reconstructed rewards.


http://arxiv.org/abs/2112.05423
SoK: On the Security & Privacy in Federated Learning. (5%)
Gorka Abad; Stjepan Picek; Aitor Urbieta
  Advances in Machine Learning (ML) and its wide range of applications boosted
its popularity. Recent privacy awareness initiatives as the EU General Data
Protection Regulation (GDPR) - European Parliament and Council Regulation No
2016/679, subdued ML to privacy and security assessments. Federated Learning
(FL) grants a privacy-driven, decentralized training scheme that improves ML
models' security. The industry's fast-growing adaptation and security
evaluations of FL technology exposed various vulnerabilities. Depending on the
FL phase, i.e., training or inference, the adversarial actor capabilities, and
the attack type threaten FL's confidentiality, integrity, or availability
(CIA). Therefore, the researchers apply the knowledge from distinct domains as
countermeasures, like cryptography and statistics.
  This work assesses the CIA of FL by reviewing the state-of-the-art (SoTA) for
creating a threat model that embraces the attack's surface, adversarial actors,
capabilities, and goals. We propose the first unifying taxonomy for attacks and
defenses by applying this model. Additionally, we provide critical insights
extracted by applying the suggested novel taxonomies to the SoTA, yielding
promising future research directions.


http://arxiv.org/abs/2112.04720
Amicable Aid: Turning Adversarial Attack to Benefit Classification. (99%)
Juyeop Kim; Jun-Ho Choi; Soobeom Jang; Jong-Seok Lee
  While adversarial attacks on deep image classification models pose serious
security concerns in practice, this paper suggests a novel paradigm where the
concept of adversarial attacks can benefit classification performance, which we
call amicable aid. We show that by taking the opposite search direction of
perturbation, an image can be converted to another yielding higher confidence
by the classification model and even a wrongly classified image can be made to
be correctly classified. Furthermore, with a large amount of perturbation, an
image can be made unrecognizable by human eyes, while it is correctly
recognized by the model. The mechanism of the amicable aid is explained in the
viewpoint of the underlying natural image manifold. We also consider universal
amicable perturbations, i.e., a fixed perturbation can be applied to multiple
images to improve their classification results. While it is challenging to find
such perturbations, we show that making the decision boundary as perpendicular
to the image manifold as possible via training with modified data is effective
to obtain a model for which universal amicable perturbations are more easily
found. Finally, we discuss several application scenarios where the amicable aid
can be useful, including secure image communication, privacy-preserving image
communication, and protection against adversarial attacks.


http://arxiv.org/abs/2112.05005
Mutual Adversarial Training: Learning together is better than going alone. (99%)
Jiang Liu; Chun Pong Lau; Hossein Souri; Soheil Feizi; Rama Chellappa
  Recent studies have shown that robustness to adversarial attacks can be
transferred across networks. In other words, we can make a weak model more
robust with the help of a strong teacher model. We ask if instead of learning
from a static teacher, can models "learn together" and "teach each other" to
achieve better robustness? In this paper, we study how interactions among
models affect robustness via knowledge distillation. We propose mutual
adversarial training (MAT), in which multiple models are trained together and
share the knowledge of adversarial examples to achieve improved robustness. MAT
allows robust models to explore a larger space of adversarial samples, and find
more robust feature spaces and decision boundaries. Through extensive
experiments on CIFAR-10 and CIFAR-100, we demonstrate that MAT can effectively
improve model robustness and outperform state-of-the-art methods under
white-box attacks, bringing $\sim$8% accuracy gain to vanilla adversarial
training (AT) under PGD-100 attacks. In addition, we show that MAT can also
mitigate the robustness trade-off among different perturbation types, bringing
as much as 13.1% accuracy gain to AT baselines against the union of $l_\infty$,
$l_2$ and $l_1$ attacks. These results show the superiority of the proposed
method and demonstrate that collaborative learning is an effective strategy for
designing robust models.


http://arxiv.org/abs/2112.04948
PARL: Enhancing Diversity of Ensemble Networks to Resist Adversarial Attacks via Pairwise Adversarially Robust Loss Function. (99%)
Manaar Alam; Shubhajit Datta; Debdeep Mukhopadhyay; Arijit Mondal; Partha Pratim Chakrabarti
  The security of Deep Learning classifiers is a critical field of study
because of the existence of adversarial attacks. Such attacks usually rely on
the principle of transferability, where an adversarial example crafted on a
surrogate classifier tends to mislead the target classifier trained on the same
dataset even if both classifiers have quite different architecture. Ensemble
methods against adversarial attacks demonstrate that an adversarial example is
less likely to mislead multiple classifiers in an ensemble having diverse
decision boundaries. However, recent ensemble methods have either been shown to
be vulnerable to stronger adversaries or shown to lack an end-to-end
evaluation. This paper attempts to develop a new ensemble methodology that
constructs multiple diverse classifiers using a Pairwise Adversarially Robust
Loss (PARL) function during the training procedure. PARL utilizes gradients of
each layer with respect to input in every classifier within the ensemble
simultaneously. The proposed training procedure enables PARL to achieve higher
robustness against black-box transfer attacks compared to previous ensemble
methods without adversely affecting the accuracy of clean examples. We also
evaluate the robustness in the presence of white-box attacks, where adversarial
examples are crafted using parameters of the target classifier. We present
extensive experiments using standard image classification datasets like
CIFAR-10 and CIFAR-100 trained using standard ResNet20 classifier against
state-of-the-art adversarial attacks to demonstrate the robustness of the
proposed ensemble methodology.


http://arxiv.org/abs/2112.05282
RamBoAttack: A Robust Query Efficient Deep Neural Network Decision Exploit. (99%)
Viet Quoc Vo; Ehsan Abbasnejad; Damith C. Ranasinghe
  Machine learning models are critically susceptible to evasion attacks from
adversarial examples. Generally, adversarial examples, modified inputs
deceptively similar to the original input, are constructed under whitebox
settings by adversaries with full access to the model. However, recent attacks
have shown a remarkable reduction in query numbers to craft adversarial
examples using blackbox attacks. Particularly, alarming is the ability to
exploit the classification decision from the access interface of a trained
model provided by a growing number of Machine Learning as a Service providers
including Google, Microsoft, IBM and used by a plethora of applications
incorporating these models. The ability of an adversary to exploit only the
predicted label from a model to craft adversarial examples is distinguished as
a decision-based attack. In our study, we first deep dive into recent
state-of-the-art decision-based attacks in ICLR and SP to highlight the costly
nature of discovering low distortion adversarial employing gradient estimation
methods. We develop a robust query efficient attack capable of avoiding
entrapment in a local minimum and misdirection from noisy gradients seen in
gradient estimation methods. The attack method we propose, RamBoAttack,
exploits the notion of Randomized Block Coordinate Descent to explore the
hidden classifier manifold, targeting perturbations to manipulate only
localized input features to address the issues of gradient estimation methods.
Importantly, the RamBoAttack is more robust to the different sample inputs
available to an adversary and the targeted class. Overall, for a given target
class, RamBoAttack is demonstrated to be more robust at achieving a lower
distortion within a given query budget. We curate our extensive results using
the large-scale high-resolution ImageNet dataset and open-source our attack,
test samples and artifacts on GitHub.


http://arxiv.org/abs/2112.05224
Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures. (69%)
Eugene Bagdasaryan; Vitaly Shmatikov
  We investigate a new threat to neural sequence-to-sequence (seq2seq) models:
training-time attacks that cause models to "spin" their outputs so as to
support an adversary-chosen sentiment or point of view -- but only when the
input contains adversary-chosen trigger words. For example, a spinned
summarization model outputs positive summaries of any text that mentions the
name of some individual or organization.
  Model spinning introduces a "meta-backdoor" into a model. Whereas
conventional backdoors cause models to produce incorrect outputs on inputs with
the trigger, outputs of spinned models preserve context and maintain standard
accuracy metrics, yet also satisfy a meta-task chosen by the adversary.
  Model spinning enables propaganda-as-a-service, where propaganda is defined
as biased speech. An adversary can create customized language models that
produce desired spins for chosen triggers, then deploy these models to generate
disinformation (a platform attack), or else inject them into ML training
pipelines (a supply-chain attack), transferring malicious functionality to
downstream models trained by victims.
  To demonstrate the feasibility of model spinning, we develop a new
backdooring technique. It stacks an adversarial meta-task onto a seq2seq model,
backpropagates the desired meta-task output to points in the word-embedding
space we call "pseudo-words," and uses pseudo-words to shift the entire output
distribution of the seq2seq model. We evaluate this attack on language
generation, summarization, and translation models with different triggers and
meta-tasks such as sentiment, toxicity, and entailment. Spinned models largely
maintain their accuracy metrics (ROUGE and BLEU) while shifting their outputs
to satisfy the adversary's meta-task. We also show that, in the case of a
supply-chain attack, the spin functionality transfers to downstream models.


http://arxiv.org/abs/2112.05310
Robustness Certificates for Implicit Neural Networks: A Mixed Monotone Contractive Approach. (38%)
Saber Jafarpour; Matthew Abate; Alexander Davydov; Francesco Bullo; Samuel Coogan
  Implicit neural networks are a general class of learning models that replace
the layers in traditional feedforward models with implicit algebraic equations.
Compared to traditional learning models, implicit networks offer competitive
performance and reduced memory consumption. However, they can remain brittle
with respect to input adversarial perturbations.
  This paper proposes a theoretical and computational framework for robustness
verification of implicit neural networks; our framework blends together mixed
monotone systems theory and contraction theory. First, given an implicit neural
network, we introduce a related embedded network and show that, given an
$\ell_\infty$-norm box constraint on the input, the embedded network provides
an $\ell_\infty$-norm box overapproximation for the output of the given
network. Second, using $\ell_{\infty}$-matrix measures, we propose sufficient
conditions for well-posedness of both the original and embedded system and
design an iterative algorithm to compute the $\ell_{\infty}$-norm box
robustness margins for reachability and classification problems. Third, of
independent value, we propose a novel relative classifier variable that leads
to tighter bounds on the certified adversarial robustness in classification
problems. Finally, we perform numerical simulations on a Non-Euclidean Monotone
Operator Network (NEMON) trained on the MNIST dataset. In these simulations, we
compare the accuracy and run time of our mixed monotone contractive approach
with the existing robustness verification approaches in the literature for
estimating the certified adversarial robustness.


http://arxiv.org/abs/2112.05135
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures. (10%)
Dan Hendrycks; Andy Zou; Mantas Mazeika; Leonard Tang; Dawn Song; Jacob Steinhardt
  In real-world applications of machine learning, reliable and safe systems
must consider measures of performance beyond standard test set accuracy. These
other goals include out-of-distribution (OOD) robustness, prediction
consistency, resilience to adversaries, calibrated uncertainty estimates, and
the ability to detect anomalous inputs. However, improving performance towards
these goals is often a balancing act that today's methods cannot achieve
without sacrificing performance on other safety axes. For instance, adversarial
training improves adversarial robustness but sharply degrades other classifier
performance metrics. Similarly, strong data augmentation and regularization
techniques often improve OOD robustness but harm anomaly detection, raising the
question of whether a Pareto improvement on all existing safety measures is
possible. To meet this challenge, we design a new data augmentation strategy
utilizing the natural structural complexity of pictures such as fractals, which
outperforms numerous baselines, is near Pareto-optimal, and roundly improves
safety measures.


http://arxiv.org/abs/2112.05307
Are We There Yet? Timing and Floating-Point Attacks on Differential Privacy Systems. (2%)
Jiankai Jin; Eleanor McMurtry; Benjamin I. P. Rubinstein; Olga Ohrimenko
  Differential privacy is a de facto privacy framework that has seen adoption
in practice via a number of mature software platforms. Implementation of
differentially private (DP) mechanisms has to be done carefully to ensure
end-to-end security guarantees. In this paper we study two implementation flaws
in the noise generation commonly used in DP systems. First we examine the
Gaussian mechanism's susceptibility to a floating-point representation attack.
The premise of this first vulnerability is similar to the one carried out by
Mironov in 2011 against the Laplace mechanism. Our experiments show attack's
success against DP algorithms, including deep learning models trained using
differentially-private stochastic gradient descent.
  In the second part of the paper we study discrete counterparts of the Laplace
and Gaussian mechanisms that were previously proposed to alleviate the
shortcomings of floating-point representation of real numbers. We show that
such implementations unfortunately suffer from another side channel: a novel
timing attack. An observer that can measure the time to draw (discrete) Laplace
or Gaussian noise can predict the noise magnitude, which can then be used to
recover sensitive attributes. This attack invalidates differential privacy
guarantees of systems implementing such mechanisms.
  We demonstrate that several commonly used, state-of-the-art implementations
of differential privacy are susceptible to these attacks. We report success
rates up to 92.56% for floating-point attacks on DP-SGD, and up to 99.65% for
end-to-end timing attacks on private sum protected with discrete Laplace.
Finally, we evaluate and suggest partial mitigations.


http://arxiv.org/abs/2112.04764
3D-VField: Learning to Adversarially Deform Point Clouds for Robust 3D Object Detection. (1%)
Alexander Lehner; Stefano Gasperini; Alvaro Marcos-Ramiro; Michael Schmidt; Mohammad-Ali Nikouei Mahani; Nassir Navab; Benjamin Busam; Federico Tombari
  As 3D object detection on point clouds relies on the geometrical
relationships between the points, non-standard object shapes can hinder a
method's detection capability. However, in safety-critical settings, robustness
on out-of-distribution and long-tail samples is fundamental to circumvent
dangerous issues, such as the misdetection of damaged or rare cars. In this
work, we substantially improve the generalization of 3D object detectors to
out-of-domain data by taking into account deformed point clouds during
training. We achieve this with 3D-VField: a novel method that plausibly deforms
objects via vectors learned in an adversarial fashion. Our approach constrains
3D points to slide along their sensor view rays while neither adding nor
removing any of them. The obtained vectors are transferrable,
sample-independent and preserve shape smoothness and occlusions. By augmenting
normal samples with the deformations produced by these vector fields during
training, we significantly improve robustness against differently shaped
objects, such as damaged/deformed cars, even while training only on KITTI.
Towards this end, we propose and share open source CrashD: a synthetic dataset
of realistic damaged and rare cars, with a variety of crash scenarios.
Extensive experiments on KITTI, Waymo, our CrashD and SUN RGB-D show the high
generalizability of our techniques to out-of-domain data, different models and
sensors, namely LiDAR and ToF cameras, for both indoor and outdoor scenes. Our
CrashD dataset is available at https://crashd-cars.github.io.


http://arxiv.org/abs/2112.04532
Segment and Complete: Defending Object Detectors against Adversarial Patch Attacks with Robust Patch Detection. (99%)
Jiang Liu; Alexander Levine; Chun Pong Lau; Rama Chellappa; Soheil Feizi
  Object detection plays a key role in many security-critical systems.
Adversarial patch attacks, which are easy to implement in the physical world,
pose a serious threat to state-of-the-art object detectors. Developing reliable
defenses for object detectors against patch attacks is critical but severely
understudied. In this paper, we propose Segment and Complete defense (SAC), a
general framework for defending object detectors against patch attacks through
detecting and removing adversarial patches. We first train a patch segmenter
that outputs patch masks that provide pixel-level localization of adversarial
patches. We then propose a self adversarial training algorithm to robustify the
patch segmenter. In addition, we design a robust shape completion algorithm,
which is guaranteed to remove the entire patch from the images given the
outputs of the patch segmenter are within a certain Hamming distance of the
ground-truth patch masks. Our experiments on COCO and xView datasets
demonstrate that SAC achieves superior robustness even under strong adaptive
attacks with no performance drop on clean images, and generalizes well to
unseen patch shapes, attack budgets, and unseen attack methods. Furthermore, we
present the APRICOT-Mask dataset, which augments the APRICOT dataset with
pixel-level annotations of adversarial patches. We show SAC can significantly
reduce the targeted attack success rate of physical patch attacks.


http://arxiv.org/abs/2112.04367
On visual self-supervision and its effect on model robustness. (99%)
Michal Kucer; Diane Oyen; Garrett Kenyon
  Recent self-supervision methods have found success in learning feature
representations that could rival ones from full supervision, and have been
shown to be beneficial to the model in several ways: for example improving
models robustness and out-of-distribution detection. In our paper, we conduct
an empirical study to understand more precisely in what way can self-supervised
learning - as a pre-training technique or part of adversarial training -
affects model robustness to $l_2$ and $l_{\infty}$ adversarial perturbations
and natural image corruptions. Self-supervision can indeed improve model
robustness, however it turns out the devil is in the details. If one simply
adds self-supervision loss in tandem with adversarial training, then one sees
improvement in accuracy of the model when evaluated with adversarial
perturbations smaller or comparable to the value of $\epsilon_{train}$ that the
robust model is trained with. However, if one observes the accuracy for
$\epsilon_{test} \ge \epsilon_{train}$, the model accuracy drops. In fact, the
larger the weight of the supervision loss, the larger the drop in performance,
i.e. harming the robustness of the model. We identify primary ways in which
self-supervision can be added to adversarial training, and observe that using a
self-supervised loss to optimize both network parameters and find adversarial
examples leads to the strongest improvement in model robustness, as this can be
viewed as a form of ensemble adversarial training. Although self-supervised
pre-training yields benefits in improving adversarial training as compared to
random weight initialization, we observe no benefit in model robustness or
accuracy if self-supervision is incorporated into adversarial training.


http://arxiv.org/abs/2112.04154
SNEAK: Synonymous Sentences-Aware Adversarial Attack on Natural Language Video Localization. (93%)
Wenbo Gou; Wen Shi; Jian Lou; Lijie Huang; Pan Zhou; Ruixuan Li
  Natural language video localization (NLVL) is an important task in the
vision-language understanding area, which calls for an in-depth understanding
of not only computer vision and natural language side alone, but more
importantly the interplay between both sides. Adversarial vulnerability has
been well-recognized as a critical security issue of deep neural network
models, which requires prudent investigation. Despite its extensive yet
separated studies in video and language tasks, current understanding of the
adversarial robustness in vision-language joint tasks like NLVL is less
developed. This paper therefore aims to comprehensively investigate the
adversarial robustness of NLVL models by examining three facets of
vulnerabilities from both attack and defense aspects. To achieve the attack
goal, we propose a new adversarial attack paradigm called synonymous
sentences-aware adversarial attack on NLVL (SNEAK), which captures the
cross-modality interplay between the vision and language sides.


http://arxiv.org/abs/2112.04468
Revisiting Contrastive Learning through the Lens of Neighborhood Component Analysis: an Integrated Framework. (8%)
Ching-Yun Ko; Jeet Mohapatra; Sijia Liu; Pin-Yu Chen; Luca Daniel; Lily Weng
  As a seminal tool in self-supervised representation learning, contrastive
learning has gained unprecedented attention in recent years. In essence,
contrastive learning aims to leverage pairs of positive and negative samples
for representation learning, which relates to exploiting neighborhood
information in a feature space. By investigating the connection between
contrastive learning and neighborhood component analysis (NCA), we provide a
novel stochastic nearest neighbor viewpoint of contrastive learning and
subsequently propose a series of contrastive losses that outperform the
existing ones. Under our proposed framework, we show a new methodology to
design integrated contrastive losses that could simultaneously achieve good
accuracy and robustness on downstream tasks. With the integrated framework, we
achieve up to 6\% improvement on the standard accuracy and 17\% improvement on
the adversarial accuracy.


http://arxiv.org/abs/2112.03615
Saliency Diversified Deep Ensemble for Robustness to Adversaries. (99%)
Alex Bogun; Dimche Kostadinov; Damian Borth
  Deep learning models have shown incredible performance on numerous image
recognition, classification, and reconstruction tasks. Although very appealing
and valuable due to their predictive capabilities, one common threat remains
challenging to resolve. A specifically trained attacker can introduce malicious
input perturbations to fool the network, thus causing potentially harmful
mispredictions. Moreover, these attacks can succeed when the adversary has full
access to the target model (white-box) and even when such access is limited
(black-box setting). The ensemble of models can protect against such attacks
but might be brittle under shared vulnerabilities in its members (attack
transferability). To that end, this work proposes a novel diversity-promoting
learning approach for the deep ensembles. The idea is to promote saliency map
diversity (SMD) on ensemble members to prevent the attacker from targeting all
ensemble members at once by introducing an additional term in our learning
objective. During training, this helps us minimize the alignment between model
saliencies to reduce shared member vulnerabilities and, thus, increase ensemble
robustness to adversaries. We empirically show a reduced transferability
between ensemble members and improved performance compared to the
state-of-the-art ensemble defense against medium and high strength white-box
attacks. In addition, we demonstrate that our approach combined with existing
methods outperforms state-of-the-art ensemble algorithms for defense under
white-box and black-box attacks.


http://arxiv.org/abs/2112.03909
Vehicle trajectory prediction works, but not everywhere. (50%)
Mohammadhossein Bahari; Saeed Saadatnejad; Ahmad Rahimi; Mohammad Shaverdikondori; Mohammad Shahidzadeh; Seyed-Mohsen Moosavi-Dezfooli; Alexandre Alahi
  Vehicle trajectory prediction is nowadays a fundamental pillar of
self-driving cars. Both the industry and research communities have acknowledged
the need for such a pillar by running public benchmarks. While state-of-the-art
methods are impressive, i.e., they have no off-road prediction, their
generalization to cities outside of the benchmark is unknown. In this work, we
show that those methods do not generalize to new scenes. We present a novel
method that automatically generates realistic scenes that cause
state-of-the-art models go off-road. We frame the problem through the lens of
adversarial scene generation. We promote a simple yet effective generative
model based on atomic scene generation functions along with physical
constraints. Our experiments show that more than $60\%$ of the existing scenes
from the current benchmarks can be modified in a way to make prediction methods
fail (predicting off-road). We further show that (i) the generated scenes are
realistic since they do exist in the real world, and (ii) can be used to make
existing models robust by 30-40%. Code is available at
https://s-attack.github.io/.


http://arxiv.org/abs/2112.03662
Lightning: Striking the Secure Isolation on GPU Clouds with Transient Hardware Faults. (11%)
Rihui Sun; Pefei Qiu; Yongqiang Lyu; Donsheng Wang; Jiang Dong; Gang Qu
  GPU clouds have become a popular computing platform because of the cost of
owning and maintaining high-performance computing clusters. Many cloud
architectures have also been proposed to ensure a secure execution environment
for guest applications by enforcing strong security policies to isolate the
untrusted hypervisor from the guest virtual machines (VMs). In this paper, we
study the impact of GPU chip's hardware faults on the security of cloud
"trusted" execution environment using Deep Neural Network (DNN) as the
underlying application. We show that transient hardware faults of GPUs can be
generated by exploiting the Dynamic Voltage and Frequency Scaling (DVFS)
technology, and these faults may cause computation errors, but they have
limited impact on the inference accuracy of DNN due to the robustness and
fault-tolerant nature of well-developed DNN models. To take full advantage of
these transient hardware faults, we propose the Lightning attack to locate the
fault injection targets of DNNs and to control the fault injection precision in
terms of timing and position. We conduct experiments on three commodity GPUs to
attack four widely-used DNNs. Experimental results show that the proposed
attack can reduce the inference accuracy of the models by as high as 78.3\% and
64.5\% on average. More importantly, 67.9\% of the targeted attacks have
successfully misled the models to give our desired incorrect inference result.
This demonstrates that the secure isolation on GPU clouds is vulnerable against
transient hardware faults and the computation results may not be trusted.


http://arxiv.org/abs/2112.03570
Membership Inference Attacks From First Principles. (2%)
Nicholas Carlini; Steve Chien; Milad Nasr; Shuang Song; Andreas Terzis; Florian Tramer
  A membership inference attack allows an adversary to query a trained machine
learning model to predict whether or not a particular example was contained in
the model's training dataset. These attacks are currently evaluated using
average-case "accuracy" metrics that fail to characterize whether the attack
can confidently identify any members of the training set. We argue that attacks
should instead be evaluated by computing their true-positive rate at low (e.g.,
<0.1%) false-positive rates, and find most prior attacks perform poorly when
evaluated in this way. To address this we develop a Likelihood Ratio Attack
(LiRA) that carefully combines multiple ideas from the literature. Our attack
is 10x more powerful at low false-positive rates, and also strictly dominates
prior attacks on existing metrics.


http://arxiv.org/abs/2112.03508
Training Deep Models to be Explained with Fewer Examples. (1%)
Tomoharu Iwata; Yuya Yoshikawa
  Although deep models achieve high predictive performance, it is difficult for
humans to understand the predictions they made. Explainability is important for
real-world applications to justify their reliability. Many example-based
explanation methods have been proposed, such as representer point selection,
where an explanation model defined by a set of training examples is used for
explaining a prediction model. For improving the interpretability, reducing the
number of examples in the explanation model is important. However, the
explanations with fewer examples can be unfaithful since it is difficult to
approximate prediction models well by such example-based explanation models.
The unfaithful explanations mean that the predictions by the explainable model
are different from those by the prediction model. We propose a method for
training deep models such that their predictions are faithfully explained by
explanation models with a small number of examples. We train the prediction and
explanation models simultaneously with a sparse regularizer for reducing the
number of examples. The proposed method can be incorporated into any neural
network-based prediction models. Experiments using several datasets demonstrate
that the proposed method improves faithfulness while keeping the predictive
performance.


http://arxiv.org/abs/2112.04038
Presentation Attack Detection Methods based on Gaze Tracking and Pupil Dynamic: A Comprehensive Survey. (1%)
Jalil Nourmohammadi Khiarak
  Purpose of the research: In the biometric community, visible human
characteristics are popular and viable for verification and identification on
mobile devices. However, imposters are able to spoof such characteristics by
creating fake and artificial biometrics to fool the system. Visible biometric
systems have suffered a high-security risk of presentation attack. Methods: In
the meantime, challenge-based methods, in particular, gaze tracking and pupil
dynamic appear to be more secure methods than others for contactless biometric
systems. We review the existing work that explores gaze tracking and pupil
dynamic liveness detection. The principal results: This research analyzes
various aspects of gaze tracking and pupil dynamic presentation attacks, such
as state-of-the-art liveness detection algorithms, various kinds of artifacts,
the accessibility of public databases, and a summary of standardization in this
area. In addition, we discuss future work and the open challenges to creating a
secure liveness detection based on challenge-based systems.


http://arxiv.org/abs/2112.03315
Adversarial Machine Learning In Network Intrusion Detection Domain: A Systematic Review. (99%)
Huda Ali Alatwi; Charles Morisset
  Due to their massive success in various domains, deep learning techniques are
increasingly used to design network intrusion detection solutions that detect
and mitigate unknown and known attacks with high accuracy detection rates and
minimal feature engineering. However, it has been found that deep learning
models are vulnerable to data instances that can mislead the model to make
incorrect classification decisions so-called (adversarial examples). Such
vulnerability allows attackers to target NIDSs by adding small crafty
perturbations to the malicious traffic to evade detection and disrupt the
system's critical functionalities. The problem of deep adversarial learning has
been extensively studied in the computer vision domain; however, it is still an
area of open research in network security applications. Therefore, this survey
explores the researches that employ different aspects of adversarial machine
learning in the area of network intrusion detection in order to provide
directions for potential solutions. First, the surveyed studies are categorized
based on their contribution to generating adversarial examples, evaluating the
robustness of ML-based NIDs towards adversarial examples, and defending these
models against such attacks. Second, we highlight the characteristics
identified in the surveyed research. Furthermore, we discuss the applicability
of the existing generic adversarial attacks for the NIDS domain, the
feasibility of launching the proposed attacks in real-world scenarios, and the
limitations of the existing mitigation solutions.


http://arxiv.org/abs/2112.03492
Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal. (84%)
Yucheng Shi; Yahong Han; Yu-an Tan; Xiaohui Kuang
  Vision transformers (ViTs) have demonstrated impressive performance and
stronger adversarial robustness compared to Convolutional Neural Networks
(CNNs). On the one hand, ViTs' focus on global interaction between individual
patches reduces the local noise sensitivity of images. On the other hand, the
neglect of noise sensitivity differences between image regions by existing
decision-based attacks further compromises the efficiency of noise compression,
especially for ViTs. Therefore, validating the black-box adversarial robustness
of ViTs when the target model can only be queried still remains a challenging
problem. In this paper, we theoretically analyze the limitations of existing
decision-based attacks from the perspective of noise sensitivity difference
between regions of the image, and propose a new decision-based black-box attack
against ViTs, termed Patch-wise Adversarial Removal (PAR). PAR divides images
into patches through a coarse-to-fine search process and compresses the noise
on each patch separately. PAR records the noise magnitude and noise sensitivity
of each patch and selects the patch with the highest query value for noise
compression. In addition, PAR can be used as a noise initialization method for
other decision-based attacks to improve the noise compression efficiency on
both ViTs and CNNs without introducing additional calculations. Extensive
experiments on three datasets demonstrate that PAR achieves a much lower noise
magnitude with the same number of queries.


http://arxiv.org/abs/2112.02797
ML Attack Models: Adversarial Attacks and Data Poisoning Attacks. (82%)
Jing Lin; Long Dang; Mohamed Rahouti; Kaiqi Xiong
  Many state-of-the-art ML models have outperformed humans in various tasks
such as image classification. With such outstanding performance, ML models are
widely used today. However, the existence of adversarial attacks and data
poisoning attacks really questions the robustness of ML models. For instance,
Engstrom et al. demonstrated that state-of-the-art image classifiers could be
easily fooled by a small rotation on an arbitrary image. As ML systems are
being increasingly integrated into safety and security-sensitive applications,
adversarial attacks and data poisoning attacks pose a considerable threat. This
chapter focuses on the two broad and important areas of ML security:
adversarial attacks and data poisoning attacks.


http://arxiv.org/abs/2112.03350
Test-Time Detection of Backdoor Triggers for Poisoned Deep Neural Networks. (82%)
Xi Li; Zhen Xiang; David J. Miller; George Kesidis
  Backdoor (Trojan) attacks are emerging threats against deep neural networks
(DNN). A DNN being attacked will predict to an attacker-desired target class
whenever a test sample from any source class is embedded with a backdoor
pattern; while correctly classifying clean (attack-free) test samples. Existing
backdoor defenses have shown success in detecting whether a DNN is attacked and
in reverse-engineering the backdoor pattern in a "post-training" regime: the
defender has access to the DNN to be inspected and a small, clean dataset
collected independently, but has no access to the (possibly poisoned) training
set of the DNN. However, these defenses neither catch culprits in the act of
triggering the backdoor mapping, nor mitigate the backdoor attack at test-time.
In this paper, we propose an "in-flight" defense against backdoor attacks on
image classification that 1) detects use of a backdoor trigger at test-time;
and 2) infers the class of origin (source class) for a detected trigger
example. The effectiveness of our defense is demonstrated experimentally
against different strong backdoor attacks.


http://arxiv.org/abs/2112.02918
When the Curious Abandon Honesty: Federated Learning Is Not Private. (68%)
Franziska Boenisch; Adam Dziedzic; Roei Schuster; Ali Shahin Shamsabadi; Ilia Shumailov; Nicolas Papernot
  In federated learning (FL), data does not leave personal devices when they
are jointly training a machine learning model. Instead, these devices share
gradients, parameters, or other model updates, with a central party (e.g., a
company) coordinating the training. Because data never "leaves" personal
devices, FL is often presented as privacy-preserving. Yet, recently it was
shown that this protection is but a thin facade, as even a passive,
honest-but-curious attacker observing gradients can reconstruct data of
individual users contributing to the protocol. In this work, we show a novel
data reconstruction attack which allows an active and dishonest central party
to efficiently extract user data from the received gradients. While prior work
on data reconstruction in FL relies on solving computationally expensive
optimization problems or on making easily detectable modifications to the
shared model's architecture or parameters, in our attack the central party
makes inconspicuous changes to the shared model's weights before sending them
out to the users. We call the modified weights of our attack trap weights. Our
active attacker is able to recover user data perfectly, i.e., with zero error,
even when this data stems from the same class. Recovery comes with near-zero
costs: the attack requires no complex optimization objectives. Instead, our
attacker exploits inherent data leakage from model gradients and simply
amplifies this effect by maliciously altering the weights of the shared model
through the trap weights. These specificities enable our attack to scale to
fully-connected and convolutional deep neural networks trained with large
mini-batches of data. For example, for the high-dimensional vision dataset
ImageNet, we perfectly reconstruct more than 50% of the training data points
from mini-batches as large as 100 data points.


http://arxiv.org/abs/2112.03476
Defending against Model Stealing via Verifying Embedded External Features. (33%)
Yiming Li; Linghui Zhu; Xiaojun Jia; Yong Jiang; Shu-Tao Xia; Xiaochun Cao
  Obtaining a well-trained model involves expensive data collection and
training procedures, therefore the model is a valuable intellectual property.
Recent studies revealed that adversaries can `steal' deployed models even when
they have no training samples and can not get access to the model parameters or
structures. Currently, there were some defense methods to alleviate this
threat, mostly by increasing the cost of model stealing. In this paper, we
explore the defense from another angle by verifying whether a suspicious model
contains the knowledge of defender-specified \emph{external features}.
Specifically, we embed the external features by tempering a few training
samples with style transfer. We then train a meta-classifier to determine
whether a model is stolen from the victim. This approach is inspired by the
understanding that the stolen models should contain the knowledge of features
learned by the victim model. We examine our method on both CIFAR-10 and
ImageNet datasets. Experimental results demonstrate that our method is
effective in detecting different types of model stealing simultaneously, even
if the stolen model is obtained via a multi-stage stealing process. The codes
for reproducing main results are available at Github
(https://github.com/zlh-thu/StealingVerification).


http://arxiv.org/abs/2112.03223
Context-Aware Transfer Attacks for Object Detection. (1%)
Zikui Cai; Xinxin Xie; Shasha Li; Mingjun Yin; Chengyu Song; Srikanth V. Krishnamurthy; Amit K. Roy-Chowdhury; M. Salman Asif
  Blackbox transfer attacks for image classifiers have been extensively studied
in recent years. In contrast, little progress has been made on transfer attacks
for object detectors. Object detectors take a holistic view of the image and
the detection of one object (or lack thereof) often depends on other objects in
the scene. This makes such detectors inherently context-aware and adversarial
attacks in this space are more challenging than those targeting image
classifiers. In this paper, we present a new approach to generate context-aware
attacks for object detectors. We show that by using co-occurrence of objects
and their relative locations and sizes as context information, we can
successfully generate targeted mis-categorization attacks that achieve higher
transfer success rates on blackbox object detectors than the state-of-the-art.
We test our approach on a variety of object detectors with images from PASCAL
VOC and MS COCO datasets and demonstrate up to $20$ percentage points
improvement in performance compared to the other state-of-the-art methods.


http://arxiv.org/abs/2112.02542
Robust Active Learning: Sample-Efficient Training of Robust Deep Learning Models. (96%)
Yuejun Guo; Qiang Hu; Maxime Cordy; Mike Papadakis; Yves Le Traon
  Active learning is an established technique to reduce the labeling cost to
build high-quality machine learning models. A core component of active learning
is the acquisition function that determines which data should be selected to
annotate. State-of-the-art acquisition functions -- and more largely, active
learning techniques -- have been designed to maximize the clean performance
(e.g. accuracy) and have disregarded robustness, an important quality property
that has received increasing attention. Active learning, therefore, produces
models that are accurate but not robust.
  In this paper, we propose \emph{robust active learning}, an active learning
process that integrates adversarial training -- the most established method to
produce robust models. Via an empirical study on 11 acquisition functions, 4
datasets, 6 DNN architectures, and 15105 trained DNNs, we show that robust
active learning can produce models with the robustness (accuracy on adversarial
examples) ranging from 2.35\% to 63.85\%, whereas standard active learning
systematically achieves negligible robustness (less than 0.20\%). Our study
also reveals, however, that the acquisition functions that perform well on
accuracy are worse than random sampling when it comes to robustness. We,
therefore, examine the reasons behind this and devise a new acquisition
function that targets both clean performance and robustness. Our acquisition
function -- named density-based robust sampling with entropy (DRE) --
outperforms the other acquisition functions (including random) in terms of
robustness by up to 24.40\% (3.84\% than random particularly), while remaining
competitive on accuracy. Additionally, we prove that DRE is applicable as a
test selection metric for model retraining and stands out from all compared
functions by up to 8.21\% robustness.


http://arxiv.org/abs/2112.02671
Stochastic Local Winner-Takes-All Networks Enable Profound Adversarial Robustness. (88%)
Konstantinos P. Panousis; Sotirios Chatzis; Sergios Theodoridis
  This work explores the potency of stochastic competition-based activations,
namely Stochastic Local Winner-Takes-All (LWTA), against powerful
(gradient-based) white-box and black-box adversarial attacks; we especially
focus on Adversarial Training settings. In our work, we replace the
conventional ReLU-based nonlinearities with blocks comprising locally and
stochastically competing linear units. The output of each network layer now
yields a sparse output, depending on the outcome of winner sampling in each
block. We rely on the Variational Bayesian framework for training and
inference; we incorporate conventional PGD-based adversarial training arguments
to increase the overall adversarial robustness. As we experimentally show, the
arising networks yield state-of-the-art robustness against powerful adversarial
attacks while retaining very high classification rate in the benign case.


http://arxiv.org/abs/2112.02705
Beyond Robustness: Resilience Verification of Tree-Based Classifiers. (2%)
Stefano Calzavara; Lorenzo Cazzaro; Claudio Lucchese; Federico Marcuzzi; Salvatore Orlando
  In this paper we criticize the robustness measure traditionally employed to
assess the performance of machine learning models deployed in adversarial
settings. To mitigate the limitations of robustness, we introduce a new measure
called resilience and we focus on its verification. In particular, we discuss
how resilience can be verified by combining a traditional robustness
verification technique with a data-independent stability analysis, which
identifies a subset of the feature space where the model does not change its
predictions despite adversarial manipulations. We then introduce a formally
sound data-independent stability analysis for decision trees and decision tree
ensembles, which we experimentally assess on public datasets and we leverage
for resilience verification. Our results show that resilience verification is
useful and feasible in practice, yielding a more reliable security assessment
of both standard and robust decision tree models.


http://arxiv.org/abs/2112.02606
On Impact of Semantically Similar Apps in Android Malware Datasets. (1%)
Roopak Surendran
  Malware authors reuse the same program segments found in other applications
for performing the similar kind of malicious activities such as information
stealing, sending SMS and so on. Hence, there may exist several semantically
similar malware samples in a family/dataset. Many researchers unaware about
these semantically similar apps and use their features in their ML models for
evaluation. Hence, the performance measures might be seriously affected by
these similar kinds of apps. In this paper, we study the impact of semantically
similar applications in the performance measures of ML based Android malware
detectors. For this, we propose a novel opcode subsequence based malware
clustering algorithm to identify the semantically similar malware and goodware
apps. For studying the impact of semantically similar apps in the performance
measures, we tested the performance of distinct ML models based on API call and
permission features of malware and goodware application with/without
semantically similar apps. In our experimentation with Drebin dataset, we found
that, after removing the exact duplicate apps from the dataset (? = 0) the
malware detection rate (TPR) of API call based ML models is dropped from 0.95
to 0.91 and permission based model is dropped from 0.94 to 0.90. In order to
overcome this issue, we advise the research community to use our clustering
algorithm to get rid of semantically similar apps before evaluating their
malware detection mechanism.


http://arxiv.org/abs/2112.02469
RADA: Robust Adversarial Data Augmentation for Camera Localization in Challenging Weather. (10%)
Jialu Wang; Muhamad Risqi U. Saputra; Chris Xiaoxuan Lu; Niki Trigon; Andrew Markham
  Camera localization is a fundamental and crucial problem for many robotic
applications. In recent years, using deep-learning for camera-based
localization has become a popular research direction. However, they lack
robustness to large domain shifts, which can be caused by seasonal or
illumination changes between training and testing data sets. Data augmentation
is an attractive approach to tackle this problem, as it does not require
additional data to be provided. However, existing augmentation methods blindly
perturb all pixels and therefore cannot achieve satisfactory performance. To
overcome this issue, we proposed RADA, a system whose aim is to concentrate on
perturbing the geometrically informative parts of the image. As a result, it
learns to generate minimal image perturbations that are still capable of
perplexing the network. We show that when these examples are utilized as
augmentation, it greatly improves robustness. We show that our method
outperforms previous augmentation techniques and achieves up to two times
higher accuracy than the SOTA localization models (e.g., AtLoc and MapNet) when
tested on `unseen' challenging weather conditions.


http://arxiv.org/abs/2112.01724
Single-Shot Black-Box Adversarial Attacks Against Malware Detectors: A Causal Language Model Approach. (99%)
James Lee Hu; Mohammadreza Ebrahimi; Hsinchun Chen
  Deep Learning (DL)-based malware detectors are increasingly adopted for early
detection of malicious behavior in cybersecurity. However, their sensitivity to
adversarial malware variants has raised immense security concerns. Generating
such adversarial variants by the defender is crucial to improving the
resistance of DL-based malware detectors against them. This necessity has given
rise to an emerging stream of machine learning research, Adversarial Malware
example Generation (AMG), which aims to generate evasive adversarial malware
variants that preserve the malicious functionality of a given malware. Within
AMG research, black-box method has gained more attention than white-box
methods. However, most black-box AMG methods require numerous interactions with
the malware detectors to generate adversarial malware examples. Given that most
malware detectors enforce a query limit, this could result in generating
non-realistic adversarial examples that are likely to be detected in practice
due to lack of stealth. In this study, we show that a novel DL-based causal
language model enables single-shot evasion (i.e., with only one query to
malware detector) by treating the content of the malware executable as a byte
sequence and training a Generative Pre-Trained Transformer (GPT). Our proposed
method, MalGPT, significantly outperformed the leading benchmark methods on a
real-world malware dataset obtained from VirusTotal, achieving over 24.51\%
evasion rate. MalGPT enables cybersecurity researchers to develop advanced
defense capabilities by emulating large-scale realistic AMG.


http://arxiv.org/abs/2112.02209
Generalized Likelihood Ratio Test for Adversarially Robust Hypothesis Testing. (99%)
Bhagyashree Puranik; Upamanyu Madhow; Ramtin Pedarsani
  Machine learning models are known to be susceptible to adversarial attacks
which can cause misclassification by introducing small but well designed
perturbations. In this paper, we consider a classical hypothesis testing
problem in order to develop fundamental insight into defending against such
adversarial perturbations. We interpret an adversarial perturbation as a
nuisance parameter, and propose a defense based on applying the generalized
likelihood ratio test (GLRT) to the resulting composite hypothesis testing
problem, jointly estimating the class of interest and the adversarial
perturbation. While the GLRT approach is applicable to general multi-class
hypothesis testing, we first evaluate it for binary hypothesis testing in white
Gaussian noise under $\ell_{\infty}$ norm-bounded adversarial perturbations,
for which a known minimax defense optimizing for the worst-case attack provides
a benchmark. We derive the worst-case attack for the GLRT defense, and show
that its asymptotic performance (as the dimension of the data increases)
approaches that of the minimax defense. For non-asymptotic regimes, we show via
simulations that the GLRT defense is competitive with the minimax approach
under the worst-case attack, while yielding a better robustness-accuracy
tradeoff under weaker attacks. We also illustrate the GLRT approach for a
multi-class hypothesis testing problem, for which a minimax strategy is not
known, evaluating its performance under both noise-agnostic and noise-aware
adversarial settings, by providing a method to find optimal noise-aware
attacks, and heuristics to find noise-agnostic attacks that are close to
optimal in the high SNR regime.


http://arxiv.org/abs/2112.01821
Blackbox Untargeted Adversarial Testing of Automatic Speech Recognition Systems. (98%)
Xiaoliang Wu; Ajitha Rajan
  Automatic speech recognition (ASR) systems are prevalent, particularly in
applications for voice navigation and voice control of domestic appliances. The
computational core of ASRs are deep neural networks (DNNs) that have been shown
to be susceptible to adversarial perturbations; easily misused by attackers to
generate malicious outputs. To help test the correctness of ASRS, we propose
techniques that automatically generate blackbox (agnostic to the DNN),
untargeted adversarial attacks that are portable across ASRs. Much of the
existing work on adversarial ASR testing focuses on targeted attacks, i.e
generating audio samples given an output text. Targeted techniques are not
portable, customised to the structure of DNNs (whitebox) within a specific ASR.
In contrast, our method attacks the signal processing stage of the ASR pipeline
that is shared across most ASRs. Additionally, we ensure the generated
adversarial audio samples have no human audible difference by manipulating the
acoustic signal using a psychoacoustic model that maintains the signal below
the thresholds of human perception. We evaluate portability and effectiveness
of our techniques using three popular ASRs and three input audio datasets using
the metrics - WER of output text, Similarity to original audio and attack
Success Rate on different ASRs. We found our testing techniques were portable
across ASRs, with the adversarial audio samples producing high Success Rates,
WERs and Similarities to the original audio.


http://arxiv.org/abs/2112.01777
Attack-Centric Approach for Evaluating Transferability of Adversarial Samples in Machine Learning Models. (54%)
Tochukwu Idika; Ismail Akturk
  Transferability of adversarial samples became a serious concern due to their
impact on the reliability of machine learning system deployments, as they find
their way into many critical applications. Knowing factors that influence
transferability of adversarial samples can assist experts to make informed
decisions on how to build robust and reliable machine learning systems. The
goal of this study is to provide insights on the mechanisms behind the
transferability of adversarial samples through an attack-centric approach. This
attack-centric perspective interprets how adversarial samples would transfer by
assessing the impact of machine learning attacks (that generated them) on a
given input dataset. To achieve this goal, we generated adversarial samples
using attacker models and transferred these samples to victim models. We
analyzed the behavior of adversarial samples on victim models and outlined four
factors that can influence the transferability of adversarial samples. Although
these factors are not necessarily exhaustive, they provide useful insights to
researchers and practitioners of machine learning systems.


http://arxiv.org/abs/2112.01723
Adversarial Attacks against a Satellite-borne Multispectral Cloud Detector. (13%)
Andrew Du; Yee Wei Law; Michele Sasdelli; Bo Chen; Ken Clarke; Michael Brown; Tat-Jun Chin
  Data collected by Earth-observing (EO) satellites are often afflicted by
cloud cover. Detecting the presence of clouds -- which is increasingly done
using deep learning -- is crucial preprocessing in EO applications. In fact,
advanced EO satellites perform deep learning-based cloud detection on board the
satellites and downlink only clear-sky data to save precious bandwidth. In this
paper, we highlight the vulnerability of deep learning-based cloud detection
towards adversarial attacks. By optimising an adversarial pattern and
superimposing it into a cloudless scene, we bias the neural network into
detecting clouds in the scene. Since the input spectra of cloud detectors
include the non-visible bands, we generated our attacks in the multispectral
domain. This opens up the potential of multi-objective attacks, specifically,
adversarial biasing in the cloud-sensitive bands and visual camouflage in the
visible bands. We also investigated mitigation strategies against the
adversarial attacks. We hope our work further builds awareness of the potential
of adversarial attacks in the EO community.


http://arxiv.org/abs/2112.02223
A Game-Theoretic Approach for AI-based Botnet Attack Defence. (9%)
Hooman Alavizadeh; Julian Jang-Jaccard; Tansu Alpcan; Seyit A. Camtepe
  The new generation of botnets leverages Artificial Intelligent (AI)
techniques to conceal the identity of botmasters and the attack intention to
avoid detection. Unfortunately, there has not been an existing assessment tool
capable of evaluating the effectiveness of existing defense strategies against
this kind of AI-based botnet attack. In this paper, we propose a sequential
game theory model that is capable to analyse the details of the potential
strategies botnet attackers and defenders could use to reach Nash Equilibrium
(NE). The utility function is computed under the assumption when the attacker
launches the maximum number of DDoS attacks with the minimum attack cost while
the defender utilises the maximum number of defense strategies with the minimum
defense cost. We conduct a numerical analysis based on a various number of
defense strategies involved on different (simulated) cloud-band sizes in
relation to different attack success rate values. Our experimental results
confirm that the success of defense highly depends on the number of defense
strategies used according to careful evaluation of attack rates.


http://arxiv.org/abs/2112.01156
A Unified Framework for Adversarial Attack and Defense in Constrained Feature Space. (99%)
Thibault Simonetto; Salijona Dyrmishi; Salah Ghamizi; Maxime Cordy; Yves Le Traon
  The generation of feasible adversarial examples is necessary for properly
assessing models that work on constrained feature space. However, it remains a
challenging task to enforce constraints into attacks that were designed for
computer vision. We propose a unified framework to generate feasible
adversarial examples that satisfy given domain constraints. Our framework
supports the use cases reported in the literature and can handle both linear
and non-linear constraints. We instantiate our framework into two algorithms: a
gradient-based attack that introduces constraints in the loss function to
maximize, and a multi-objective search algorithm that aims for
misclassification, perturbation minimization, and constraint satisfaction. We
show that our approach is effective on two datasets from different domains,
with a success rate of up to 100%, where state-of-the-art attacks fail to
generate a single feasible example. In addition to adversarial retraining, we
propose to introduce engineered non-convex constraints to improve model
adversarial robustness. We demonstrate that this new defense is as effective as
adversarial retraining. Our framework forms the starting point for research on
constrained adversarial attacks and provides relevant baselines and datasets
that future research can exploit.


http://arxiv.org/abs/2112.01555
Is Approximation Universally Defensive Against Adversarial Attacks in Deep Neural Networks? (93%)
Ayesha Siddique; Khaza Anuarul Hoque
  Approximate computing is known for its effectiveness in improvising the
energy efficiency of deep neural network (DNN) accelerators at the cost of
slight accuracy loss. Very recently, the inexact nature of approximate
components, such as approximate multipliers have also been reported successful
in defending adversarial attacks on DNNs models. Since the approximation errors
traverse through the DNN layers as masked or unmasked, this raises a key
research question-can approximate computing always offer a defense against
adversarial attacks in DNNs, i.e., are they universally defensive? Towards
this, we present an extensive adversarial robustness analysis of different
approximate DNN accelerators (AxDNNs) using the state-of-the-art approximate
multipliers. In particular, we evaluate the impact of ten adversarial attacks
on different AxDNNs using the MNIST and CIFAR-10 datasets. Our results
demonstrate that adversarial attacks on AxDNNs can cause 53% accuracy loss
whereas the same attack may lead to almost no accuracy loss (as low as 0.06%)
in the accurate DNN. Thus, approximate computing cannot be referred to as a
universal defense strategy against adversarial attacks.


http://arxiv.org/abs/2112.01601
Is RobustBench/AutoAttack a suitable Benchmark for Adversarial Robustness? (75%)
Peter Lorenz; Dominik Strassel; Margret Keuper; Janis Keuper
  Recently, RobustBench (Croce et al. 2020) has become a widely recognized
benchmark for the adversarial robustness of image classification networks. In
its most commonly reported sub-task, RobustBench evaluates and ranks the
adversarial robustness of trained neural networks on CIFAR10 under AutoAttack
(Croce and Hein 2020b) with l-inf perturbations limited to eps = 8/255. With
leading scores of the currently best performing models of around 60% of the
baseline, it is fair to characterize this benchmark to be quite challenging.
Despite its general acceptance in recent literature, we aim to foster
discussion about the suitability of RobustBench as a key indicator for
robustness which could be generalized to practical applications. Our line of
argumentation against this is two-fold and supported by excessive experiments
presented in this paper: We argue that I) the alternation of data by AutoAttack
with l-inf, eps = 8/255 is unrealistically strong, resulting in close to
perfect detection rates of adversarial samples even by simple detection
algorithms and human observers. We also show that other attack methods are much
harder to detect while achieving similar success rates. II) That results on
low-resolution data sets like CIFAR10 do not generalize well to higher
resolution images as gradient-based attacks appear to become even more
detectable with increasing resolutions.


http://arxiv.org/abs/2112.01423
Training Efficiency and Robustness in Deep Learning. (41%)
Fartash Faghri
  Deep Learning has revolutionized machine learning and artificial
intelligence, achieving superhuman performance in several standard benchmarks.
It is well-known that deep learning models are inefficient to train; they learn
by processing millions of training data multiple times and require powerful
computational resources to process large batches of data in parallel at the
same time rather than sequentially. Deep learning models also have unexpected
failure modes; they can be fooled into misbehaviour, producing unexpectedly
incorrect predictions.
  In this thesis, we study approaches to improve the training efficiency and
robustness of deep learning models. In the context of learning visual-semantic
embeddings, we find that prioritizing learning on more informative training
data increases convergence speed and improves generalization performance on
test data. We formalize a simple trick called hard negative mining as a
modification to the learning objective function with no computational overhead.
Next, we seek improvements to optimization speed in general-purpose
optimization methods in deep learning. We show that a redundancy-aware
modification to the sampling of training data improves the training speed and
develops an efficient method for detecting the diversity of training signal,
namely, gradient clustering. Finally, we study adversarial robustness in deep
learning and approaches to achieve maximal adversarial robustness without
training with additional data. For linear models, we prove guaranteed maximal
robustness achieved only by appropriate choice of the optimizer,
regularization, or architecture.


http://arxiv.org/abs/2112.01405
FedRAD: Federated Robust Adaptive Distillation. (10%)
Stefán Páll Sturluson; Samuel Trew; Luis Muñoz-González; Matei Grama; Jonathan Passerat-Palmbach; Daniel Rueckert; Amir Alansary
  The robustness of federated learning (FL) is vital for the distributed
training of an accurate global model that is shared among large number of
clients. The collaborative learning framework by typically aggregating model
updates is vulnerable to model poisoning attacks from adversarial clients.
Since the shared information between the global server and participants are
only limited to model parameters, it is challenging to detect bad model
updates. Moreover, real-world datasets are usually heterogeneous and not
independent and identically distributed (Non-IID) among participants, which
makes the design of such robust FL pipeline more difficult. In this work, we
propose a novel robust aggregation method, Federated Robust Adaptive
Distillation (FedRAD), to detect adversaries and robustly aggregate local
models based on properties of the median statistic, and then performing an
adapted version of ensemble Knowledge Distillation. We run extensive
experiments to evaluate the proposed method against recently published works.
The results show that FedRAD outperforms all other aggregators in the presence
of adversaries, as well as in heterogeneous data distributions.


http://arxiv.org/abs/2112.01148
FIBA: Frequency-Injection based Backdoor Attack in Medical Image Analysis. (3%)
Yu Feng; Benteng Ma; Jing Zhang; Shanshan Zhao; Yong Xia; Dacheng Tao
  In recent years, the security of AI systems has drawn increasing research
attention, especially in the medical imaging realm. To develop a secure medical
image analysis (MIA) system, it is a must to study possible backdoor attacks
(BAs), which can embed hidden malicious behaviors into the system. However,
designing a unified BA method that can be applied to various MIA systems is
challenging due to the diversity of imaging modalities (e.g., X-Ray, CT, and
MRI) and analysis tasks (e.g., classification, detection, and segmentation).
Most existing BA methods are designed to attack natural image classification
models, which apply spatial triggers to training images and inevitably corrupt
the semantics of poisoned pixels, leading to the failures of attacking dense
prediction models. To address this issue, we propose a novel
Frequency-Injection based Backdoor Attack method (FIBA) that is capable of
delivering attacks in various MIA tasks. Specifically, FIBA leverages a trigger
function in the frequency domain that can inject the low-frequency information
of a trigger image into the poisoned image by linearly combining the spectral
amplitude of both images. Since it preserves the semantics of the poisoned
image pixels, FIBA can perform attacks on both classification and dense
prediction models. Experiments on three benchmarks in MIA (i.e., ISIC-2019 for
skin lesion classification, KiTS-19 for kidney tumor segmentation, and EAD-2019
for endoscopic artifact detection), validate the effectiveness of FIBA and its
superiority over state-of-the-art methods in attacking MIA models as well as
bypassing backdoor defense. The code will be available at
https://github.com/HazardFY/FIBA.


http://arxiv.org/abs/2112.01694
On the Existence of the Adversarial Bayes Classifier (Extended Version). (2%)
Pranjal Awasthi; Natalie S. Frank; Mehryar Mohri
  Adversarial robustness is a critical property in a variety of modern machine
learning applications. While it has been the subject of several recent
theoretical studies, many important questions related to adversarial robustness
are still open. In this work, we study a fundamental question regarding Bayes
optimality for adversarial robustness. We provide general sufficient conditions
under which the existence of a Bayes optimal classifier can be guaranteed for
adversarial robustness. Our results can provide a useful tool for a subsequent
study of surrogate losses in adversarial robustness and their consistency
properties. This manuscript is the extended version of the paper "On the
Existence of the Adversarial Bayes Classifier" published in NeurIPS. The
results of the original paper did not apply to some non-strictly convex norms.
Here we extend our results to all possible norms. Additionally, we clarify a
missing step in one of our proofs.


http://arxiv.org/abs/2112.01008
Editing a classifier by rewriting its prediction rules. (1%)
Shibani Santurkar; Dimitris Tsipras; Mahalaxmi Elango; David Bau; Antonio Torralba; Aleksander Madry
  We present a methodology for modifying the behavior of a classifier by
directly rewriting its prediction rules. Our approach requires virtually no
additional data collection and can be applied to a variety of settings,
including adapting a model to new environments, and modifying it to ignore
spurious features. Our code is available at
https://github.com/MadryLab/EditingClassifiers .


http://arxiv.org/abs/2112.00973
Adversarial Robustness of Deep Reinforcement Learning based Dynamic Recommender Systems. (99%)
Siyu Wang; Yuanjiang Cao; Xiaocong Chen; Lina Yao; Xianzhi Wang; Quan Z. Sheng
  Adversarial attacks, e.g., adversarial perturbations of the input and
adversarial samples, pose significant challenges to machine learning and deep
learning techniques, including interactive recommendation systems. The latent
embedding space of those techniques makes adversarial attacks difficult to
detect at an early stage. Recent advance in causality shows that counterfactual
can also be considered one of ways to generate the adversarial samples drawn
from different distribution as the training samples. We propose to explore
adversarial examples and attack agnostic detection on reinforcement
learning-based interactive recommendation systems. We first craft different
types of adversarial examples by adding perturbations to the input and
intervening on the casual factors. Then, we augment recommendation systems by
detecting potential attacks with a deep learning-based classifier based on the
crafted data. Finally, we study the attack strength and frequency of
adversarial examples and evaluate our model on standard datasets with multiple
crafting methods. Our extensive experiments show that most adversarial attacks
are effective, and both attack strength and attack frequency impact the attack
performance. The strategically-timed attack achieves comparative attack
performance with only 1/3 to 1/2 attack frequency. Besides, our black-box
detector trained with one crafting method has the generalization ability over
several other crafting methods.


http://arxiv.org/abs/2112.00323
Push Stricter to Decide Better: A Class-Conditional Feature Adaptive Framework for Improving Adversarial Robustness. (99%)
Jia-Li Yin; Lehui Xie; Wanqing Zhu; Ximeng Liu; Bo-Hao Chen
  In response to the threat of adversarial examples, adversarial training
provides an attractive option for enhancing the model robustness by training
models on online-augmented adversarial examples. However, most of the existing
adversarial training methods focus on improving the robust accuracy by
strengthening the adversarial examples but neglecting the increasing shift
between natural data and adversarial examples, leading to a dramatic decrease
in natural accuracy. To maintain the trade-off between natural and robust
accuracy, we alleviate the shift from the perspective of feature adaption and
propose a Feature Adaptive Adversarial Training (FAAT) optimizing the
class-conditional feature adaption across natural data and adversarial
examples. Specifically, we propose to incorporate a class-conditional
discriminator to encourage the features become (1) class-discriminative and (2)
invariant to the change of adversarial attacks. The novel FAAT framework
enables the trade-off between natural and robust accuracy by generating
features with similar distribution across natural and adversarial data, and
achieve higher overall robustness benefited from the class-discriminative
feature characteristics. Experiments on various datasets demonstrate that FAAT
produces more discriminative features and performs favorably against
state-of-the-art methods. Codes are available at
https://github.com/VisionFlow/FAAT.


http://arxiv.org/abs/2112.00378
$\ell_\infty$-Robustness and Beyond: Unleashing Efficient Adversarial Training. (99%)
Hadi M. Dolatabadi; Sarah Erfani; Christopher Leckie
  Neural networks are vulnerable to adversarial attacks: adding well-crafted,
imperceptible perturbations to their input can modify their output. Adversarial
training is one of the most effective approaches in training robust models
against such attacks. However, it is much slower than vanilla training of
neural networks since it needs to construct adversarial examples for the entire
training data at every iteration, which has hampered its effectiveness.
Recently, Fast Adversarial Training was proposed that can obtain robust models
efficiently. However, the reasons behind its success are not fully understood,
and more importantly, it can only train robust models for $\ell_\infty$-bounded
attacks as it uses FGSM during training. In this paper, by leveraging the
theory of coreset selection we show how selecting a small subset of training
data provides a more principled approach towards reducing the time complexity
of robust training. Unlike existing methods, our approach can be adapted to a
wide variety of training objectives, including TRADES, $\ell_p$-PGD, and
Perceptual Adversarial Training. Our experimental results indicate that our
approach speeds up adversarial training by 2-3 times, while experiencing a
small reduction in the clean and robust accuracy.


http://arxiv.org/abs/2112.00659
Certified Adversarial Defenses Meet Out-of-Distribution Corruptions: Benchmarking Robustness and Simple Baselines. (96%)
Jiachen Sun; Akshay Mehra; Bhavya Kailkhura; Pin-Yu Chen; Dan Hendrycks; Jihun Hamm; Z. Morley Mao
  Certified robustness guarantee gauges a model's robustness to test-time
attacks and can assess the model's readiness for deployment in the real world.
In this work, we critically examine how the adversarial robustness guarantees
from randomized smoothing-based certification methods change when
state-of-the-art certifiably robust models encounter out-of-distribution (OOD)
data. Our analysis demonstrates a previously unknown vulnerability of these
models to low-frequency OOD data such as weather-related corruptions, rendering
these models unfit for deployment in the wild. To alleviate this issue, we
propose a novel data augmentation scheme, FourierMix, that produces
augmentations to improve the spectral coverage of the training data.
Furthermore, we propose a new regularizer that encourages consistent
predictions on noise perturbations of the augmented data to improve the quality
of the smoothed models. We find that FourierMix augmentations help eliminate
the spectral bias of certifiably robust models enabling them to achieve
significantly better robustness guarantees on a range of OOD benchmarks. Our
evaluation also uncovers the inability of current OOD benchmarks at
highlighting the spectral biases of the models. To this end, we propose a
comprehensive benchmarking suite that contains corruptions from different
regions in the spectral domain. Evaluation of models trained with popular
augmentation methods on the proposed suite highlights their spectral biases and
establishes the superiority of FourierMix trained models at achieving
better-certified robustness guarantees under OOD shifts over the entire
frequency spectrum.


http://arxiv.org/abs/2112.00428
Adv-4-Adv: Thwarting Changing Adversarial Perturbations via Adversarial Domain Adaptation. (95%)
Tianyue Zheng; Zhe Chen; Shuya Ding; Chao Cai; Jun Luo
  Whereas adversarial training can be useful against specific adversarial
perturbations, they have also proven ineffective in generalizing towards
attacks deviating from those used for training. However, we observe that this
ineffectiveness is intrinsically connected to domain adaptability, another
crucial issue in deep learning for which adversarial domain adaptation appears
to be a promising solution. Consequently, we proposed Adv-4-Adv as a novel
adversarial training method that aims to retain robustness against unseen
adversarial perturbations. Essentially, Adv-4-Adv treats attacks incurring
different perturbations as distinct domains, and by leveraging the power of
adversarial domain adaptation, it aims to remove the domain/attack-specific
features. This forces a trained model to learn a robust domain-invariant
representation, which in turn enhances its generalization ability. Extensive
evaluations on Fashion-MNIST, SVHN, CIFAR-10, and CIFAR-100 demonstrate that a
model trained by Adv-4-Adv based on samples crafted by simple attacks (e.g.,
FGSM) can be generalized to more advanced attacks (e.g., PGD), and the
performance exceeds state-of-the-art proposals on these datasets.


http://arxiv.org/abs/2112.00639
Robustness in Deep Learning for Computer Vision: Mind the gap? (31%)
Nathan Drenkow; Numair Sani; Ilya Shpitser; Mathias Unberath
  Deep neural networks for computer vision tasks are deployed in increasingly
safety-critical and socially-impactful applications, motivating the need to
close the gap in model performance under varied, naturally occurring imaging
conditions. Robustness, ambiguously used in multiple contexts including
adversarial machine learning, here then refers to preserving model performance
under naturally-induced image corruptions or alterations.
  We perform a systematic review to identify, analyze, and summarize current
definitions and progress towards non-adversarial robustness in deep learning
for computer vision. We find that this area of research has received
disproportionately little attention relative to adversarial machine learning,
yet a significant robustness gap exists that often manifests in performance
degradation similar in magnitude to adversarial conditions.
  To provide a more transparent definition of robustness across contexts, we
introduce a structural causal model of the data generating process and
interpret non-adversarial robustness as pertaining to a model's behavior on
corrupted images which correspond to low-probability samples from the unaltered
data distribution. We then identify key architecture-, data augmentation-, and
optimization tactics for improving neural network robustness. This causal view
of robustness reveals that common practices in the current literature, both in
regards to robustness tactics and evaluations, correspond to causal concepts,
such as soft interventions resulting in a counterfactually-altered distribution
of imaging conditions. Through our findings and analysis, we offer perspectives
on how future research may mind this evident and significant non-adversarial
robustness gap.


http://arxiv.org/abs/2112.00686
CYBORG: Blending Human Saliency Into the Loss Improves Deep Learning. (1%)
Aidan Boyd; Patrick Tinsley; Kevin Bowyer; Adam Czajka
  Can deep learning models achieve greater generalization if their training is
guided by reference to human perceptual abilities? And how can we implement
this in a practical manner? This paper proposes a first-ever training strategy
to ConveY Brain Oversight to Raise Generalization (CYBORG). This new training
approach incorporates human-annotated saliency maps into a CYBORG loss function
that guides the model towards learning features from image regions that humans
find salient when solving a given visual task. The Class Activation Mapping
(CAM) mechanism is used to probe the model's current saliency in each training
batch, juxtapose model saliency with human saliency, and penalize the model for
large differences. Results on the task of synthetic face detection show that
the CYBORG loss leads to significant improvement in performance on unseen
samples consisting of face images generated from six Generative Adversarial
Networks (GANs) across multiple classification network architectures. We also
show that scaling to even seven times as much training data with standard loss
cannot beat the accuracy of CYBORG loss. As a side effect, we observed that the
addition of explicit region annotation to the task of synthetic face detection
increased human classification performance. This work opens a new area of
research on how to incorporate human visual saliency into loss functions. All
data, code and pre-trained models used in this work are offered with this
paper.


http://arxiv.org/abs/2111.15213
Using a GAN to Generate Adversarial Examples to Facial Image Recognition. (99%)
Andrew Merrigan; Alan F. Smeaton
  Images posted online present a privacy concern in that they may be used as
reference examples for a facial recognition system. Such abuse of images is in
violation of privacy rights but is difficult to counter. It is well established
that adversarial example images can be created for recognition systems which
are based on deep neural networks. These adversarial examples can be used to
disrupt the utility of the images as reference examples or training data. In
this work we use a Generative Adversarial Network (GAN) to create adversarial
examples to deceive facial recognition and we achieve an acceptable success
rate in fooling the face recognition. Our results reduce the training time for
the GAN by removing the discriminator component. Furthermore, our results show
knowledge distillation can be employed to drastically reduce the size of the
resulting model without impacting performance indicating that our contribution
could run comfortably on a smartphone


http://arxiv.org/abs/2111.15160
Mitigating Adversarial Attacks by Distributing Different Copies to Different Users. (96%)
Jiyi Zhang; Wesley Joon-Wie Tann; Ee-Chien Chang
  Machine learning models are vulnerable to adversarial attacks. In this paper,
we consider the scenario where a model is to be distributed to multiple users,
among which a malicious user attempts to attack another user. The malicious
user probes its copy of the model to search for adversarial samples and then
presents the found samples to the victim's model in order to replicate the
attack. We point out that by distributing different copies of the model to
different users, we can mitigate the attack such that adversarial samples found
on one copy would not work on another copy. We first observed that training a
model with different randomness indeed mitigates such replication to certain
degree. However, there is no guarantee and retraining is computationally
expensive. Next, we propose a flexible parameter rewriting method that directly
modifies the model's parameters. This method does not require additional
training and is able to induce different sets of adversarial samples in
different copies in a more controllable manner. Experimentation studies show
that our approach can significantly mitigate the attacks while retaining high
classification accuracy. From this study, we believe that there are many
further directions worth exploring.


http://arxiv.org/abs/2111.15603
Human Imperceptible Attacks and Applications to Improve Fairness. (83%)
Xinru Hua; Huanzhong Xu; Jose Blanchet; Viet Nguyen
  Modern neural networks are able to perform at least as well as humans in
numerous tasks involving object classification and image generation. However,
small perturbations which are imperceptible to humans may significantly degrade
the performance of well-trained deep neural networks. We provide a
Distributionally Robust Optimization (DRO) framework which integrates
human-based image quality assessment methods to design optimal attacks that are
imperceptible to humans but significantly damaging to deep neural networks.
Through extensive experiments, we show that our attack algorithm generates
better-quality (less perceptible to humans) attacks than other state-of-the-art
human imperceptible attack methods. Moreover, we demonstrate that DRO training
using our optimally designed human imperceptible attacks can improve group
fairness in image classification. Towards the end, we provide an algorithmic
implementation to speed up DRO training significantly, which could be of
independent interest.


http://arxiv.org/abs/2112.00059
Evaluating Gradient Inversion Attacks and Defenses in Federated Learning. (81%)
Yangsibo Huang; Samyak Gupta; Zhao Song; Kai Li; Sanjeev Arora
  Gradient inversion attack (or input recovery from gradient) is an emerging
threat to the security and privacy preservation of Federated learning, whereby
malicious eavesdroppers or participants in the protocol can recover (partially)
the clients' private data. This paper evaluates existing attacks and defenses.
We find that some attacks make strong assumptions about the setup. Relaxing
such assumptions can substantially weaken these attacks. We then evaluate the
benefits of three proposed defense mechanisms against gradient inversion
attacks. We show the trade-offs of privacy leakage and data utility of these
defense methods, and find that combining them in an appropriate manner makes
the attack less effective, even under the original strong assumptions. We also
estimate the computation cost of end-to-end recovery of a single image under
each evaluated defense. Our findings suggest that the state-of-the-art attacks
can currently be defended against with minor data utility loss, as summarized
in a list of potential strategies. Our code is available at:
https://github.com/Princeton-SysML/GradAttack.


http://arxiv.org/abs/2111.15487
FROB: Few-shot ROBust Model for Classification and Out-of-Distribution Detection. (78%)
Nikolaos Dionelis
  Nowadays, classification and Out-of-Distribution (OoD) detection in the
few-shot setting remain challenging aims due to rarity and the limited samples
in the few-shot setting, and because of adversarial attacks. Accomplishing
these aims is important for critical systems in safety, security, and defence.
In parallel, OoD detection is challenging since deep neural network classifiers
set high confidence to OoD samples away from the training data. To address such
limitations, we propose the Few-shot ROBust (FROB) model for classification and
few-shot OoD detection. We devise FROB for improved robustness and reliable
confidence prediction for few-shot OoD detection. We generate the support
boundary of the normal class distribution and combine it with few-shot Outlier
Exposure (OE). We propose a self-supervised learning few-shot confidence
boundary methodology based on generative and discriminative models. The
contribution of FROB is the combination of the generated boundary in a
self-supervised learning manner and the imposition of low confidence at this
learned boundary. FROB implicitly generates strong adversarial samples on the
boundary and forces samples from OoD, including our boundary, to be less
confident by the classifier. FROB achieves generalization to unseen OoD with
applicability to unknown, in the wild, test sets that do not correlate to the
training datasets. To improve robustness, FROB redesigns OE to work even for
zero-shots. By including our boundary, FROB reduces the threshold linked to the
model's few-shot robustness; it maintains the OoD performance approximately
independent of the number of few-shots. The few-shot robustness analysis
evaluation of FROB on different sets and on One-Class Classification (OCC) data
shows that FROB achieves competitive performance and outperforms benchmarks in
terms of robustness to the outlier few-shot sample population and variability.


http://arxiv.org/abs/2111.15276
COREATTACK: Breaking Up the Core Structure of Graphs. (78%)
Bo Zhou; Yuqian Lv; Jinhuan Wang; Jian Zhang; Qi Xuan
  The concept of k-core in complex networks plays a key role in many
applications, e.g., understanding the global structure, or identifying
central/critical nodes, of a network. A malicious attacker with jamming ability
can exploit the vulnerability of the k-core structure to attack the network and
invalidate the network analysis methods, e.g., reducing the k-shell values of
nodes can deceive graph algorithms, leading to the wrong decisions. In this
paper, we investigate the robustness of the k-core structure under adversarial
attacks by deleting edges, for the first time. Firstly, we give the general
definition of targeted k-core attack, map it to the set cover problem which is
NP-hard, and further introduce a series of evaluation metrics to measure the
performance of attack methods. Then, we propose $Q$ index theoretically as the
probability that the terminal node of an edge does not belong to the innermost
core, which is further used to guide the design of our heuristic attack
methods, namely COREATTACK and GreedyCOREATTACK. The experiments on a variety
of real-world networks demonstrate that our methods behave much better than a
series of baselines, in terms of much smaller Edge Change Rate (ECR) and False
Attack Rate (FAR), achieving state-of-the-art attack performance. More
impressively, for certain real-world networks, only deleting one edge from the
k-core may lead to the collapse of the innermost core, even if this core
contains dozens of nodes. Such a phenomenon indicates that the k-core structure
could be extremely vulnerable under adversarial attacks, and its robustness
thus should be carefully addressed to ensure the security of many graph
algorithms.


http://arxiv.org/abs/2112.00247
Adversarial Attacks Against Deep Generative Models on Data: A Survey. (12%)
Hui Sun; Tianqing Zhu; Zhiqiu Zhang; Dawei Jin. Ping Xiong; Wanlei Zhou
  Deep generative models have gained much attention given their ability to
generate data for applications as varied as healthcare to financial technology
to surveillance, and many more - the most popular models being generative
adversarial networks and variational auto-encoders. Yet, as with all machine
learning models, ever is the concern over security breaches and privacy leaks
and deep generative models are no exception. These models have advanced so
rapidly in recent years that work on their security is still in its infancy. In
an attempt to audit the current and future threats against these models, and to
provide a roadmap for defense preparations in the short term, we prepared this
comprehensive and specialized survey on the security and privacy preservation
of GANs and VAEs. Our focus is on the inner connection between attacks and
model architectures and, more specifically, on five components of deep
generative models: the training data, the latent code, the generators/decoders
of GANs/ VAEs, the discriminators/encoders of GANs/ VAEs, and the generated
data. For each model, component and attack, we review the current research
progress and identify the key challenges. The paper concludes with a discussion
of possible future attacks and research directions in the field.


http://arxiv.org/abs/2111.15416
A Face Recognition System's Worst Morph Nightmare, Theoretically. (1%)
Una M. Kelly; Raymond Veldhuis; Luuk Spreeuwers
  It has been shown that Face Recognition Systems (FRSs) are vulnerable to
morphing attacks, but most research focusses on landmark-based morphs. A second
method for generating morphs uses Generative Adversarial Networks, which
results in convincingly real facial images that can be almost as challenging
for FRSs as landmark-based attacks. We propose a method to create a third,
different type of morph, that has the advantage of being easier to train. We
introduce the theoretical concept of \textit{worst-case morphs}, which are
those morphs that are most challenging for a fixed FRS. For a set of images and
corresponding embeddings in an FRS's latent space, we generate images that
approximate these worst-case morphs using a mapping from embedding space back
to image space. While the resulting images are not yet as challenging as other
morphs, they can provide valuable information in future research on Morphing
Attack Detection (MAD) methods and on weaknesses of FRSs. Methods for MAD need
to be validated on more varied morph databases. Our proposed method contributes
to achieving such variation.


http://arxiv.org/abs/2111.15205
New Datasets for Dynamic Malware Classification. (1%)
Berkant Düzgün; Aykut Çayır; Ferhat Demirkıran; Ceyda Nur Kayha; Buket Gençaydın; Hasan Dağ
  Nowadays, malware and malware incidents are increasing daily, even with
various anti-viruses systems and malware detection or classification
methodologies. Many static, dynamic, and hybrid techniques have been presented
to detect malware and classify them into malware families. Dynamic and hybrid
malware classification methods have advantages over static malware
classification methods by being highly efficient. Since it is difficult to mask
malware behavior while executing than its underlying code in static malware
classification, machine learning techniques have been the main focus of the
security experts to detect malware and determine their families dynamically.
The rapid increase of malware also brings the necessity of recent and updated
datasets of malicious software. We introduce two new, updated datasets in this
work: One with 9,795 samples obtained and compiled from VirusSamples and the
one with 14,616 samples from VirusShare. This paper also analyzes multi-class
malware classification performance of the balanced and imbalanced version of
these two datasets by using Histogram-based gradient boosting, Random Forest,
Support Vector Machine, and XGBoost models with API call-based dynamic malware
classification. Results show that Support Vector Machine, achieves the highest
score of 94% in the imbalanced VirusSample dataset, whereas the same model has
91% accuracy in the balanced VirusSample dataset. While XGBoost, one of the
most common gradient boosting-based models, achieves the highest score of 90%
and 80%.in both versions of the VirusShare dataset. This paper also presents
the baseline results of VirusShare and VirusSample datasets by using the four
most widely known machine learning techniques in dynamic malware classification
literature. We believe that these two datasets and baseline results enable
researchers in this field to test and validate their methods and approaches.


http://arxiv.org/abs/2112.00646
Reliability Assessment and Safety Arguments for Machine Learning Components in Assuring Learning-Enabled Autonomous Systems. (1%)
Xingyu Zhao; Wei Huang; Vibhav Bharti; Yi Dong; Victoria Cox; Alec Banks; Sen Wang; Sven Schewe; Xiaowei Huang
  The increasing use of Machine Learning (ML) components embedded in autonomous
systems -- so-called Learning-Enabled Systems (LES) -- has resulted in the
pressing need to assure their functional safety. As for traditional functional
safety, the emerging consensus within both, industry and academia, is to use
assurance cases for this purpose. Typically assurance cases support claims of
reliability in support of safety, and can be viewed as a structured way of
organising arguments and evidence generated from safety analysis and
reliability modelling activities. While such assurance activities are
traditionally guided by consensus-based standards developed from vast
engineering experience, LES pose new challenges in safety-critical application
due to the characteristics and design of ML models. In this article, we first
present an overall assurance framework for LES with an emphasis on quantitative
aspects, e.g., breaking down system-level safety targets to component-level
requirements and supporting claims stated in reliability metrics. We then
introduce a novel model-agnostic Reliability Assessment Model (RAM) for ML
classifiers that utilises the operational profile and robustness verification
evidence. We discuss the model assumptions and the inherent challenges of
assessing ML reliability uncovered by our RAM and propose practical solutions.
Probabilistic safety arguments at the lower ML component-level are also
developed based on the RAM. Finally, to evaluate and demonstrate our methods,
we not only conduct experiments on synthetic/benchmark datasets but also
demonstrate the scope of our methods with a comprehensive case study on
Autonomous Underwater Vehicles in simulation.


http://arxiv.org/abs/2111.14564
MedRDF: A Robust and Retrain-Less Diagnostic Framework for Medical Pretrained Models Against Adversarial Attack. (99%)
Mengting Xu; Tao Zhang; Daoqiang Zhang
  Deep neural networks are discovered to be non-robust when attacked by
imperceptible adversarial examples, which is dangerous for it applied into
medical diagnostic system that requires high reliability. However, the defense
methods that have good effect in natural images may not be suitable for medical
diagnostic tasks. The preprocessing methods (e.g., random resizing,
compression) may lead to the loss of the small lesions feature in the medical
image. Retraining the network on the augmented data set is also not practical
for medical models that have already been deployed online. Accordingly, it is
necessary to design an easy-to-deploy and effective defense framework for
medical diagnostic tasks. In this paper, we propose a Robust and Retrain-Less
Diagnostic Framework for Medical pretrained models against adversarial attack
(i.e., MedRDF). It acts on the inference time of the pertained medical model.
Specifically, for each test image, MedRDF firstly creates a large number of
noisy copies of it, and obtains the output labels of these copies from the
pretrained medical diagnostic model. Then, based on the labels of these copies,
MedRDF outputs the final robust diagnostic result by majority voting. In
addition to the diagnostic result, MedRDF produces the Robust Metric (RM) as
the confidence of the result. Therefore, it is convenient and reliable to
utilize MedRDF to convert pre-trained non-robust diagnostic models into robust
ones. The experimental results on COVID-19 and DermaMNIST datasets verify the
effectiveness of our MedRDF in improving the robustness of medical diagnostic
models.


http://arxiv.org/abs/2111.14833
Adversarial Attacks in Cooperative AI. (82%)
Ted Fujimoto; Arthur Paul Pedersen
  Single-agent reinforcement learning algorithms in a multi-agent environment
are inadequate for fostering cooperation. If intelligent agents are to interact
and work together to solve complex problems, methods that counter
non-cooperative behavior are needed to facilitate the training of multiple
agents. This is the goal of cooperative AI. Recent work in adversarial machine
learning, however, shows that models (e.g., image classifiers) can be easily
deceived into making incorrect decisions. In addition, some past research in
cooperative AI has relied on new notions of representations, like public
beliefs, to accelerate the learning of optimally cooperative behavior. Hence,
cooperative AI might introduce new weaknesses not investigated in previous
machine learning research. In this paper, our contributions include: (1)
arguing that three algorithms inspired by human-like social intelligence
introduce new vulnerabilities, unique to cooperative AI, that adversaries can
exploit, and (2) an experiment showing that simple, adversarial perturbations
on the agents' beliefs can negatively impact performance. This evidence points
to the possibility that formal representations of social behavior are
vulnerable to adversarial attacks.


http://arxiv.org/abs/2111.15039
Living-Off-The-Land Command Detection Using Active Learning. (10%)
Talha Ongun; Jack W. Stokes; Jonathan Bar Or; Ke Tian; Farid Tajaddodianfar; Joshua Neil; Christian Seifert; Alina Oprea; John C. Platt
  In recent years, enterprises have been targeted by advanced adversaries who
leverage creative ways to infiltrate their systems and move laterally to gain
access to critical data. One increasingly common evasive method is to hide the
malicious activity behind a benign program by using tools that are already
installed on user computers. These programs are usually part of the operating
system distribution or another user-installed binary, therefore this type of
attack is called "Living-Off-The-Land". Detecting these attacks is challenging,
as adversaries may not create malicious files on the victim computers and
anti-virus scans fail to detect them. We propose the design of an Active
Learning framework called LOLAL for detecting Living-Off-the-Land attacks that
iteratively selects a set of uncertain and anomalous samples for labeling by a
human analyst. LOLAL is specifically designed to work well when a limited
number of labeled samples are available for training machine learning models to
detect attacks. We investigate methods to represent command-line text using
word-embedding techniques, and design ensemble boosting classifiers to
distinguish malicious and benign samples based on the embedding representation.
We leverage a large, anonymized dataset collected by an endpoint security
product and demonstrate that our ensemble classifiers achieve an average F1
score of 0.96 at classifying different attack classes. We show that our active
learning method consistently improves the classifier performance, as more
training data is labeled, and converges in less than 30 iterations when
starting with a small number of labeled instances.


http://arxiv.org/abs/2111.14726
Do Invariances in Deep Neural Networks Align with Human Perception? (9%)
Vedant Nanda; Ayan Majumdar; Camila Kolling; John P. Dickerson; Krishna P. Gummadi; Bradley C. Love; Adrian Weller
  An evaluation criterion for safe and trustworthy deep learning is how well
the invariances captured by representations of deep neural networks (DNNs) are
shared with humans. We identify challenges in measuring these invariances.
Prior works used gradient-based methods to generate identically represented
inputs (IRIs), ie, inputs which have identical representations (on a given
layer) of a neural network, and thus capture invariances of a given network.
One necessary criterion for a network's invariances to align with human
perception is for its IRIs look 'similar' to humans. Prior works, however, have
mixed takeaways; some argue that later layers of DNNs do not learn human-like
invariances (\cite{jenelle2019metamers}) yet others seem to indicate otherwise
(\cite{mahendran2014understanding}). We argue that the loss function used to
generate IRIs can heavily affect takeaways about invariances of the network and
is the primary reason for these conflicting findings. We propose an adversarial
regularizer on the IRI generation loss that finds IRIs that make any model
appear to have very little shared invariance with humans. Based on this
evidence, we argue that there is scope for improving models to have human-like
invariances, and further, to have meaningful comparisons between models one
should use IRIs generated using the regularizer-free loss. We then conduct an
in-depth investigation of how different components (eg architectures, training
losses, data augmentations) of the deep learning pipeline contribute to
learning models that have good alignment with humans. We find that
architectures with residual connections trained using a (self-supervised)
contrastive loss with $\ell_p$ ball adversarial data augmentation tend to learn
invariances that are most aligned with humans. Code:
\url{github.com/nvedant07/Human-NN-Alignment}.


http://arxiv.org/abs/2111.14745
A Simple Long-Tailed Recognition Baseline via Vision-Language Model. (1%)
Teli Ma; Shijie Geng; Mengmeng Wang; Jing Shao; Jiasen Lu; Hongsheng Li; Peng Gao; Yu Qiao
  The visual world naturally exhibits a long-tailed distribution of open
classes, which poses great challenges to modern visual systems. Existing
approaches either perform class re-balancing strategies or directly improve
network modules to address the problem. However, they still train models with a
finite set of predefined labels, limiting their supervision information and
restricting their transferability to novel instances. Recent advances in
large-scale contrastive visual-language pretraining shed light on a new pathway
for visual recognition. With open-vocabulary supervisions, pretrained
contrastive vision-language models learn powerful multimodal representations
that are promising to handle data deficiency and unseen concepts. By
calculating the semantic similarity between visual and text inputs, visual
recognition is converted to a vision-language matching problem. Inspired by
this, we propose BALLAD to leverage contrastive vision-language models for
long-tailed recognition. We first continue pretraining the vision-language
backbone through contrastive learning on a specific long-tailed target dataset.
Afterward, we freeze the backbone and further employ an additional adapter
layer to enhance the representations of tail classes on balanced training
samples built with re-sampling strategies. Extensive experiments have been
conducted on three popular long-tailed recognition benchmarks. As a result, our
simple and effective approach sets the new state-of-the-art performances and
outperforms competitive baselines with a large margin. Code is released at
https://github.com/gaopengcuhk/BALLAD.


http://arxiv.org/abs/2111.14341
ROBIN : A Benchmark for Robustness to Individual Nuisances in Real-World Out-of-Distribution Shifts. (1%)
Bingchen Zhao; Shaozuo Yu; Wufei Ma; Mingxin Yu; Shenxiao Mei; Angtian Wang; Ju He; Alan Yuille; Adam Kortylewski
  Enhancing the robustness in real-world scenarios has been proven very
challenging. One reason is that existing robustness benchmarks are limited, as
they either rely on synthetic data or they simply measure robustness as
generalization between datasets and hence ignore the effects of individual
nuisance factors. In this work, we introduce ROBIN, a benchmark dataset for
diagnosing the robustness of vision algorithms to individual nuisances in
real-world images. ROBIN builds on 10 rigid categories from the PASCAL VOC 2012
and ImageNet datasets and includes out-of-distribution examples of the objects
3D pose, shape, texture, context and weather conditions. ROBIN is richly
annotated to enable benchmark models for image classification, object
detection, and 3D pose estimation. We provide results for a number of popular
baselines and make several interesting observations: 1. Some nuisance factors
have a much stronger negative effect on the performance compared to others.
Moreover, the negative effect of an OODnuisance depends on the downstream
vision task. 2. Current approaches to enhance OOD robustness using strong data
augmentation have only marginal effects in real-world OOD scenarios, and
sometimes even reduce the OOD performance. 3. We do not observe any significant
differences between convolutional and transformer architectures in terms of OOD
robustness. We believe our dataset provides a rich testbed to study the OOD
robustness of vision algorithms and will help to significantly push forward
research in this area.


http://arxiv.org/abs/2111.15121
Pyramid Adversarial Training Improves ViT Performance. (1%)
Charles Herrmann; Kyle Sargent; Lu Jiang; Ramin Zabih; Huiwen Chang; Ce Liu; Dilip Krishnan; Deqing Sun
  Aggressive data augmentation is a key component of the strong generalization
capabilities of Vision Transformer (ViT). One such data augmentation technique
is adversarial training; however, many prior works have shown that this often
results in poor clean accuracy. In this work, we present Pyramid Adversarial
Training, a simple and effective technique to improve ViT's overall
performance. We pair it with a "matched" Dropout and stochastic depth
regularization, which adopts the same Dropout and stochastic depth
configuration for the clean and adversarial samples. Similar to the
improvements on CNNs by AdvProp (not directly applicable to ViT), our Pyramid
Adversarial Training breaks the trade-off between in-distribution accuracy and
out-of-distribution robustness for ViT and related architectures. It leads to
$1.82\%$ absolute improvement on ImageNet clean accuracy for the ViT-B model
when trained only on ImageNet-1K data, while simultaneously boosting
performance on $7$ ImageNet robustness metrics, by absolute numbers ranging
from $1.76\%$ to $11.45\%$. We set a new state-of-the-art for ImageNet-C (41.4
mCE), ImageNet-R ($53.92\%$), and ImageNet-Sketch ($41.04\%$) without extra
data, using only the ViT-B/16 backbone and our Pyramid Adversarial Training.
Our code will be publicly available upon acceptance.


http://arxiv.org/abs/2111.15518
Detecting Adversaries, yet Faltering to Noise? Leveraging Conditional Variational AutoEncoders for Adversary Detection in the Presence of Noisy Images. (96%)
Dvij Kalaria; Aritra Hazra; Partha Pratim Chakrabarti
  With the rapid advancement and increased use of deep learning models in image
identification, security becomes a major concern to their deployment in
safety-critical systems. Since the accuracy and robustness of deep learning
models are primarily attributed from the purity of the training samples,
therefore the deep learning architectures are often susceptible to adversarial
attacks. Adversarial attacks are often obtained by making subtle perturbations
to normal images, which are mostly imperceptible to humans, but can seriously
confuse the state-of-the-art machine learning models. What is so special in the
slightest intelligent perturbations or noise additions over normal images that
it leads to catastrophic classifications by the deep neural networks? Using
statistical hypothesis testing, we find that Conditional Variational
AutoEncoders (CVAE) are surprisingly good at detecting imperceptible image
perturbations. In this paper, we show how CVAEs can be effectively used to
detect adversarial attacks on image classification networks. We demonstrate our
results over MNIST, CIFAR-10 dataset and show how our method gives comparable
performance to the state-of-the-art methods in detecting adversaries while not
getting confused with noisy images, where most of the existing methods falter.


http://arxiv.org/abs/2111.14185
MALIGN: Explainable Static Raw-byte Based Malware Family Classification using Sequence Alignment. (68%)
Shoumik Saha; Sadia Afroz; Atif Rahman
  For a long time, malware classification and analysis have been an arms-race
between antivirus systems and malware authors. Though static analysis is
vulnerable to evasion techniques, it is still popular as the first line of
defense in antivirus systems. But most of the static analyzers failed to gain
the trust of practitioners due to their black-box nature. We propose MAlign, a
novel static malware family classification approach inspired by genome sequence
alignment that can not only classify malware families but can also provide
explanations for its decision. MAlign encodes raw bytes using nucleotides and
adopts genome sequence alignment approaches to create a signature of a malware
family based on the conserved code segments in that family, without any human
labor or expertise. We evaluate MAlign on two malware datasets, and it
outperforms other state-of-the-art machine learning based malware classifiers
(by 4.49% - 0.07%), especially on small datasets (by 19.48% - 1.2%).
Furthermore, we explain the generated signatures by MAlign on different malware
families illustrating the kinds of insights it can provide to analysts, and
show its efficacy as an analysis tool. Additionally, we evaluate its
theoretical and empirical robustness against some common attacks. In this
paper, we approach static malware analysis from a unique perspective, aiming to
strike a delicate balance among performance, interpretability, and robustness.


http://arxiv.org/abs/2111.14255
Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU. (1%)
Fuxun Yu; Shawn Bray; Di Wang; Longfei Shangguan; Xulong Tang; Chenchen Liu; Xiang Chen
  With the fast development of deep neural networks (DNNs), many real-world
applications are adopting multiple models to conduct compound tasks, such as
co-running classification, detection, and segmentation models on autonomous
vehicles. Such multi-tenant DNN inference cases greatly exacerbate the
computational complexity and call for comprehensive collaboration for
graph-level operator scheduling, runtime-level resource awareness, as well as
hardware scheduler support. However, the current scheduling support for such
multi-tenant inference is still relatively backward. In this work, we propose a
resource-aware scheduling framework for efficient multi-tenant DNN inference on
GPU, which automatically coordinates DNN computing in different execution
levels. Leveraging the unified scheduling intermediate representation and the
automated ML-based searching algorithm, optimal schedules could be generated to
wisely adjust model concurrency and interleave DNN model operators, maintaining
a continuously balanced resource utilization across the entire inference
process, and eventually improving the runtime efficiency. Experiments show that
we could consistently achieve 1.3-1.7x speed-up, compared to regular DNN
runtime libraries (e.g., CuDNN, TVM) and particular concurrent scheduling
methods (e.g., NVIDIA Multi-Stream).


http://arxiv.org/abs/2111.14271
ExCon: Explanation-driven Supervised Contrastive Learning for Image Classification. (1%)
Zhibo Zhang; Jongseong Jang; Chiheb Trabelsi; Ruiwen Li; Scott Sanner; Yeonjeong Jeong; Dongsub Shim
  Contrastive learning has led to substantial improvements in the quality of
learned embedding representations for tasks such as image classification.
However, a key drawback of existing contrastive augmentation methods is that
they may lead to the modification of the image content which can yield
undesired alterations of its semantics. This can affect the performance of the
model on downstream tasks. Hence, in this paper, we ask whether we can augment
image data in contrastive learning such that the task-relevant semantic content
of an image is preserved. For this purpose, we propose to leverage
saliency-based explanation methods to create content-preserving masked
augmentations for contrastive learning. Our novel explanation-driven supervised
contrastive learning (ExCon) methodology critically serves the dual goals of
encouraging nearby image embeddings to have similar content and explanation. To
quantify the impact of ExCon, we conduct experiments on the CIFAR-100 and the
Tiny ImageNet datasets. We demonstrate that ExCon outperforms vanilla
supervised contrastive learning in terms of classification, explanation
quality, adversarial robustness as well as calibration of probabilistic
predictions of the model in the context of distributional shift.


http://arxiv.org/abs/2111.13844
Adaptive Image Transformations for Transfer-based Adversarial Attack. (99%)
Zheng Yuan; Jie Zhang; Shiguang Shan
  Adversarial attacks provide a good way to study the robustness of deep
learning models. One category of methods in transfer-based black-box attack
utilizes several image transformation operations to improve the transferability
of adversarial examples, which is effective, but fails to take the specific
characteristic of the input image into consideration. In this work, we propose
a novel architecture, called Adaptive Image Transformation Learner (AITL),
which incorporates different image transformation operations into a unified
framework to further improve the transferability of adversarial examples.
Unlike the fixed combinational transformations used in existing works, our
elaborately designed transformation learner adaptively selects the most
effective combination of image transformations specific to the input image.
Extensive experiments on ImageNet demonstrate that our method significantly
improves the attack success rates on both normally trained models and defense
models under various settings.


http://arxiv.org/abs/2111.13841
Adaptive Perturbation for Adversarial Attack. (99%)
Zheng Yuan; Jie Zhang; Zhaoyan Jiang; Liangliang Li; Shiguang Shan
  In recent years, the security of deep learning models achieves more and more
attentions with the rapid development of neural networks, which are vulnerable
to adversarial examples. Almost all existing gradient-based attack methods use
the sign function in the generation to meet the requirement of perturbation
budget on $L_\infty$ norm. However, we find that the sign function may be
improper for generating adversarial examples since it modifies the exact
gradient direction. Instead of using the sign function, we propose to directly
utilize the exact gradient direction with a scaling factor for generating
adversarial perturbations, which improves the attack success rates of
adversarial examples even with fewer perturbations. At the same time, we also
theoretically prove that this method can achieve better black-box
transferability. Moreover, considering that the best scaling factor varies
across different images, we propose an adaptive scaling factor generator to
seek an appropriate scaling factor for each image, which avoids the
computational cost for manually searching the scaling factor. Our method can be
integrated with almost all existing gradient-based attack methods to further
improve their attack success rates. Extensive experiments on the CIFAR10 and
ImageNet datasets show that our method exhibits higher transferability and
outperforms the state-of-the-art methods.


http://arxiv.org/abs/2111.14037
Statically Detecting Adversarial Malware through Randomised Chaining. (98%)
Matthew Crawford; Wei Wang; Ruoxi Sun; Minhui Xue
  With the rapid growth of malware attacks, more antivirus developers consider
deploying machine learning technologies into their productions. Researchers and
developers published various machine learning-based detectors with high
precision on malware detection in recent years. Although numerous machine
learning-based malware detectors are available, they face various machine
learning-targeted attacks, including evasion and adversarial attacks. This
project explores how and why adversarial examples evade malware detectors, then
proposes a randomised chaining method to defend against adversarial malware
statically. This research is crucial for working towards combating the
pertinent malware cybercrime.


http://arxiv.org/abs/2111.14035
Dissecting Malware in the Wild. (1%)
Hamish Spencer; Wei Wang; Ruoxi Sun; Minhui Xue
  With the increasingly rapid development of new malicious computer software by
bad faith actors, both commercial and research-oriented antivirus detectors
have come to make greater use of machine learning tactics to identify such
malware as harmful before end users are exposed to their effects. This, in
turn, has spurred the development of tools that allow for known malware to be
manipulated such that they can evade being classified as dangerous by these
machine learning-based detectors, while retaining their malicious
functionality. These manipulations function by applying a set of changes that
can be made to Windows programs that result in a different file structure and
signature without altering the software's capabilities. Various proposals have
been made for the most effective way of applying these alterations to input
malware to deceive static malware detectors; the purpose of this research is to
examine these proposals and test their implementations to determine which
tactics tend to generate the most successful attacks.


http://arxiv.org/abs/2111.13330
ArchRepair: Block-Level Architecture-Oriented Repairing for Deep Neural Networks. (50%)
Hua Qi; Zhijie Wang; Qing Guo; Jianlang Chen; Felix Juefei-Xu; Lei Ma; Jianjun Zhao
  Over the past few years, deep neural networks (DNNs) have achieved tremendous
success and have been continuously applied in many application domains.
However, during the practical deployment in the industrial tasks, DNNs are
found to be erroneous-prone due to various reasons such as overfitting, lacking
robustness to real-world corruptions during practical usage. To address these
challenges, many recent attempts have been made to repair DNNs for version
updates under practical operational contexts by updating weights (i.e., network
parameters) through retraining, fine-tuning, or direct weight fixing at a
neural level. In this work, as the first attempt, we initiate to repair DNNs by
jointly optimizing the architecture and weights at a higher (i.e., block)
level.
  We first perform empirical studies to investigate the limitation of whole
network-level and layer-level repairing, which motivates us to explore a novel
repairing direction for DNN repair at the block level. To this end, we first
propose adversarial-aware spectrum analysis for vulnerable block localization
that considers the neurons' status and weights' gradients in blocks during the
forward and backward processes, which enables more accurate candidate block
localization for repairing even under a few examples. Then, we further propose
the architecture-oriented search-based repairing that relaxes the targeted
block to a continuous repairing search space at higher deep feature levels. By
jointly optimizing the architecture and weights in that space, we can identify
a much better block architecture. We implement our proposed repairing
techniques as a tool, named ArchRepair, and conduct extensive experiments to
validate the proposed method. The results show that our method can not only
repair but also enhance accuracy & robustness, outperforming the
state-of-the-art DNN repair techniques.


http://arxiv.org/abs/2111.12971
Natural & Adversarial Bokeh Rendering via Circle-of-Confusion Predictive Network. (99%)
Yihao Huang; Felix Juefei-Xu; Qing Guo; Geguang Pu; Yang Liu
  Bokeh effect is a natural shallow depth-of-field phenomenon that blurs the
out-of-focus part in photography. In recent years, a series of works have
proposed automatic and realistic bokeh rendering methods for artistic and
aesthetic purposes. They usually employ cutting-edge data-driven deep
generative networks with complex training strategies and network architectures.
However, these works neglect that the bokeh effect, as a real phenomenon, can
inevitably affect the subsequent visual intelligent tasks like recognition, and
their data-driven nature prevents them from studying the influence of
bokeh-related physical parameters (i.e., depth-of-the-field) on the intelligent
tasks. To fill this gap, we study a totally new problem, i.e., natural &
adversarial bokeh rendering, which consists of two objectives: rendering
realistic and natural bokeh and fooling the visual perception models (i.e.,
bokeh-based adversarial attack). To this end, beyond the pure data-driven
solution, we propose a hybrid alternative by taking the respective advantages
of data-driven and physical-aware methods. Specifically, we propose the
circle-of-confusion predictive network (CoCNet) by taking the all-in-focus
image and depth image as inputs to estimate circle-of-confusion parameters for
each pixel, which are employed to render the final image through a well-known
physical model of bokeh. With the hybrid solution, our method could achieve
more realistic rendering results with the naive training strategy and a much
lighter network. Moreover, we propose the adversarial bokeh attack by fixing
the CoCNet while optimizing the depth map w.r.t the visual perception tasks.
Then, we are able to study the vulnerability of deep neural networks according
to the depth variations in the real world.


http://arxiv.org/abs/2111.12922
Clustering Effect of (Linearized) Adversarial Robust Models. (97%)
Yang Bai; Xin Yan; Yong Jiang; Shu-Tao Xia; Yisen Wang
  Adversarial robustness has received increasing attention along with the study
of adversarial examples. So far, existing works show that robust models not
only obtain robustness against various adversarial attacks but also boost the
performance in some downstream tasks. However, the underlying mechanism of
adversarial robustness is still not clear. In this paper, we interpret
adversarial robustness from the perspective of linear components, and find that
there exist some statistical properties for comprehensively robust models.
Specifically, robust models show obvious hierarchical clustering effect on
their linearized sub-networks, when removing or replacing all non-linear
components (e.g., batch normalization, maximum pooling, or activation layers).
Based on these observations, we propose a novel understanding of adversarial
robustness and apply it on more tasks including domain adaption and robustness
boosting. Experimental evaluations demonstrate the rationality and superiority
of our proposed clustering strategy.


http://arxiv.org/abs/2111.13301
Simple Contrastive Representation Adversarial Learning for NLP Tasks. (93%)
Deshui Miao; Jiaqi Zhang; Wenbo Xie; Jian Song; Xin Li; Lijuan Jia; Ning Guo
  Self-supervised learning approach like contrastive learning is attached great
attention in natural language processing. It uses pairs of training data
augmentations to build a classification task for an encoder with well
representation ability. However, the construction of learning pairs over
contrastive learning is much harder in NLP tasks. Previous works generate
word-level changes to form pairs, but small transforms may cause notable
changes on the meaning of sentences as the discrete and sparse nature of
natural language. In this paper, adversarial training is performed to generate
challenging and harder learning adversarial examples over the embedding space
of NLP as learning pairs. Using contrastive learning improves the
generalization ability of adversarial training because contrastive loss can
uniform the sample distribution. And at the same time, adversarial training
also enhances the robustness of contrastive learning. Two novel frameworks,
supervised contrastive adversarial learning (SCAL) and unsupervised SCAL
(USCAL), are proposed, which yields learning pairs by utilizing the adversarial
training for contrastive learning. The label-based loss of supervised tasks is
exploited to generate adversarial examples while unsupervised tasks bring
contrastive loss. To validate the effectiveness of the proposed framework, we
employ it to Transformer-based models for natural language understanding,
sentence semantic textual similarity and adversarial learning tasks.
Experimental results on GLUE benchmark tasks show that our fine-tuned
supervised method outperforms BERT$_{base}$ over 1.75\%. We also evaluate our
unsupervised method on semantic textual similarity (STS) tasks, and our method
gets 77.29\% with BERT$_{base}$. The robustness of our approach conducts
state-of-the-art results under multiple adversarial datasets on NLI tasks.


http://arxiv.org/abs/2111.13244
Going Grayscale: The Road to Understanding and Improving Unlearnable Examples. (92%)
Zhuoran Liu; Zhengyu Zhao; Alex Kolmus; Tijn Berns; Laarhoven Twan van; Tom Heskes; Martha Larson
  Recent work has shown that imperceptible perturbations can be applied to
craft unlearnable examples (ULEs), i.e. images whose content cannot be used to
improve a classifier during training. In this paper, we reveal the road that
researchers should follow for understanding ULEs and improving ULEs as they
were originally formulated (ULEOs). The paper makes four contributions. First,
we show that ULEOs exploit color and, consequently, their effects can be
mitigated by simple grayscale pre-filtering, without resorting to adversarial
training. Second, we propose an extension to ULEOs, which is called
ULEO-GrayAugs, that forces the generated ULEs away from channel-wise color
perturbations by making use of grayscale knowledge and data augmentations
during optimization. Third, we show that ULEOs generated using Multi-Layer
Perceptrons (MLPs) are effective in the case of complex Convolutional Neural
Network (CNN) classifiers, suggesting that CNNs suffer specific vulnerability
to ULEs. Fourth, we demonstrate that when a classifier is trained on ULEOs,
adversarial training will prevent a drop in accuracy measured both on clean
images and on adversarial images. Taken together, our contributions represent a
substantial advance in the state of art of unlearnable examples, but also
reveal important characteristics of their behavior that must be better
understood in order to achieve further improvements.


http://arxiv.org/abs/2111.12965
Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks. (92%)
Xiangyu Qi; Tinghao Xie; Ruizhe Pan; Jifeng Zhu; Yong Yang; Kai Bu
  One major goal of the AI security community is to securely and reliably
produce and deploy deep learning models for real-world applications. To this
end, data poisoning based backdoor attacks on deep neural networks (DNNs) in
the production stage (or training stage) and corresponding defenses are
extensively explored in recent years. Ironically, backdoor attacks in the
deployment stage, which can often happen in unprofessional users' devices and
are thus arguably far more threatening in real-world scenarios, draw much less
attention of the community. We attribute this imbalance of vigilance to the
weak practicality of existing deployment-stage backdoor attack algorithms and
the insufficiency of real-world attack demonstrations. To fill the blank, in
this work, we study the realistic threat of deployment-stage backdoor attacks
on DNNs. We base our study on a commonly used deployment-stage attack paradigm
-- adversarial weight attack, where adversaries selectively modify model
weights to embed backdoor into deployed DNNs. To approach realistic
practicality, we propose the first gray-box and physically realizable weights
attack algorithm for backdoor injection, namely subnet replacement attack
(SRA), which only requires architecture information of the victim model and can
support physical triggers in the real world. Extensive experimental simulations
and system-level real-world attack demonstrations are conducted. Our results
not only suggest the effectiveness and practicality of the proposed attack
algorithm, but also reveal the practical risk of a novel type of computer virus
that may widely spread and stealthily inject backdoor into DNN models in user
devices. By our study, we call for more attention to the vulnerability of DNNs
in the deployment stage.


http://arxiv.org/abs/2112.01299
Gradient Inversion Attack: Leaking Private Labels in Two-Party Split Learning. (3%)
Sanjay Kariyappa; Moinuddin K Qureshi
  Split learning is a popular technique used to perform vertical federated
learning, where the goal is to jointly train a model on the private input and
label data held by two parties. To preserve privacy of the input and label
data, this technique uses a split model and only requires the exchange of
intermediate representations (IR) of the inputs and gradients of the IR between
the two parties during the learning process. In this paper, we propose Gradient
Inversion Attack (GIA), a label leakage attack that allows an adversarial input
owner to learn the label owner's private labels by exploiting the gradient
information obtained during split learning. GIA frames the label leakage attack
as a supervised learning problem by developing a novel loss function using
certain key properties of the dataset and models. Our attack can uncover the
private label data on several multi-class image classification problems and a
binary conversion prediction task with near-perfect accuracy (97.01% - 99.96%),
demonstrating that split learning provides negligible privacy benefits to the
label owner. Furthermore, we evaluate the use of gradient noise to defend
against GIA. While this technique is effective for simpler datasets, it
significantly degrades utility for datasets with higher input dimensionality.
Our findings underscore the need for better privacy-preserving training
techniques for vertically split data.


http://arxiv.org/abs/2111.13236
Joint inference and input optimization in equilibrium networks. (1%)
Swaminathan Gurumurthy; Shaojie Bai; Zachary Manchester; J. Zico Kolter
  Many tasks in deep learning involve optimizing over the \emph{inputs} to a
network to minimize or maximize some objective; examples include optimization
over latent spaces in a generative model to match a target image, or
adversarially perturbing an input to worsen classifier performance. Performing
such optimization, however, is traditionally quite costly, as it involves a
complete forward and backward pass through the network for each gradient step.
In a separate line of work, a recent thread of research has developed the deep
equilibrium (DEQ) model, a class of models that foregoes traditional network
depth and instead computes the output of a network by finding the fixed point
of a single nonlinear layer. In this paper, we show that there is a natural
synergy between these two settings. Although, naively using DEQs for these
optimization problems is expensive (owing to the time needed to compute a fixed
point for each gradient step), we can leverage the fact that gradient-based
optimization can \emph{itself} be cast as a fixed point iteration to
substantially improve the overall speed. That is, we \emph{simultaneously} both
solve for the DEQ fixed point \emph{and} optimize over network inputs, all
within a single ``augmented'' DEQ model that jointly encodes both the original
network and the optimization process. Indeed, the procedure is fast enough that
it allows us to efficiently \emph{train} DEQ models for tasks traditionally
relying on an ``inner'' optimization loop. We demonstrate this strategy on
various tasks such as training generative models while optimizing over latent
codes, training models for inverse problems like denoising and inpainting,
adversarial training and gradient based meta-learning.


http://arxiv.org/abs/2111.12631
Unity is strength: Improving the Detection of Adversarial Examples with Ensemble Approaches. (99%)
Francesco Craighero; Fabrizio Angaroni; Fabio Stella; Chiara Damiani; Marco Antoniotti; Alex Graudenzi
  A key challenge in computer vision and deep learning is the definition of
robust strategies for the detection of adversarial examples. Here, we propose
the adoption of ensemble approaches to leverage the effectiveness of multiple
detectors in exploiting distinct properties of the input data. To this end, the
ENsemble Adversarial Detector (ENAD) framework integrates scoring functions
from state-of-the-art detectors based on Mahalanobis distance, Local Intrinsic
Dimensionality, and One-Class Support Vector Machines, which process the hidden
features of deep neural networks. ENAD is designed to ensure high
standardization and reproducibility to the computational workflow. Importantly,
extensive tests on benchmark datasets, models and adversarial attacks show that
ENAD outperforms all competing methods in the large majority of settings. The
improvement over the state-of-the-art and the intrinsic generality of the
framework, which allows one to easily extend ENAD to include any set of
detectors, set the foundations for the new area of ensemble adversarial
detection.


http://arxiv.org/abs/2111.12305
Thundernna: a white box adversarial attack. (99%)
Linfeng Ye; Shayan Mohajer Hamidi
  The existing work shows that the neural network trained by naive
gradient-based optimization method is prone to adversarial attacks, adds small
malicious on the ordinary input is enough to make the neural network wrong. At
the same time, the attack against a neural network is the key to improving its
robustness. The training against adversarial examples can make neural networks
resist some kinds of adversarial attacks. At the same time, the adversarial
attack against a neural network can also reveal some characteristics of the
neural network, a complex high-dimensional non-linear function, as discussed in
previous work.
  In This project, we develop a first-order method to attack the neural
network. Compare with other first-order attacks, our method has a much higher
success rate. Furthermore, it is much faster than second-order attacks and
multi-steps first-order attacks.


http://arxiv.org/abs/2111.12906
Robustness against Adversarial Attacks in Neural Networks using Incremental Dissipativity. (92%)
Bernardo Aquino; Arash Rahnama; Peter Seiler; Lizhen Lin; Vijay Gupta
  Adversarial examples can easily degrade the classification performance in
neural networks. Empirical methods for promoting robustness to such examples
have been proposed, but often lack both analytical insights and formal
guarantees. Recently, some robustness certificates have appeared in the
literature based on system theoretic notions. This work proposes an incremental
dissipativity-based robustness certificate for neural networks in the form of a
linear matrix inequality for each layer. We also propose an equivalent spectral
norm bound for this certificate which is scalable to neural networks with
multiple layers. We demonstrate the improved performance against adversarial
attacks on a feed-forward neural network trained on MNIST and an Alexnet
trained using CIFAR-10.


http://arxiv.org/abs/2111.12629
WFDefProxy: Modularly Implementing and Empirically Evaluating Website Fingerprinting Defenses. (15%)
Jiajun Gong; Wuqi Zhang; Charles Zhang; Tao Wang
  Tor, an onion-routing anonymity network, has been shown to be vulnerable to
Website Fingerprinting (WF), which de-anonymizes web browsing by analyzing the
unique characteristics of the encrypted network traffic. Although many defenses
have been proposed, few have been implemented and tested in the real world;
others were only simulated. Due to its synthetic nature, simulation may fail to
capture the real performance of these defenses. To figure out how these
defenses perform in the real world, we propose WFDefProxy, a general platform
for WF defense implementation on Tor using pluggable transports. We create the
first full implementation of three WF defenses: FRONT, Tamaraw and Random-WT.
We evaluate each defense in both simulation and implementation to compare their
results, and we find that simulation correctly captures the strength of each
defense against attacks. In addition, we confirm that Random-WT is not
effective in both simulation and implementation, reducing the strongest
attacker's accuracy by only 7%.
  We also found a minor difference in overhead between simulation and
implementation. We analyze how this may be due to assumptions made in
simulation regarding packet delays and queuing, or the soft stop condition we
implemented in WFDefProxy to detect the end of a page load. The implementation
of FRONT cost about 23% more data overhead than simulation, while the
implementation of Tamaraw cost about 28% - 45% less data overhead. In addition,
the implementation of Tamaraw incurred only 21% time overhead, compared to 51%
- 242% estimated by simulation in previous work.


http://arxiv.org/abs/2111.12273
Sharpness-aware Quantization for Deep Neural Networks. (10%)
Jing Liu; Jianfei Cai; Bohan Zhuang
  Network quantization is a dominant paradigm of model compression. However,
the abrupt changes in quantized weights during training often lead to severe
loss fluctuations and result in a sharp loss landscape, making the gradients
unstable and thus degrading the performance. Recently, Sharpness-Aware
Minimization (SAM) has been proposed to smooth the loss landscape and improve
the generalization performance of the models. Nevertheless, directly applying
SAM to the quantized models can lead to perturbation mismatch or diminishment
issues, resulting in suboptimal performance. In this paper, we propose a novel
method, dubbed Sharpness-Aware Quantization (SAQ), to explore the effect of SAM
in model compression, particularly quantization for the first time.
Specifically, we first provide a unified view of quantization and SAM by
treating them as introducing quantization noises and adversarial perturbations
to the model weights, respectively. According to whether the noise and
perturbation terms depend on each other, SAQ can be formulated into three
cases, which are analyzed and compared comprehensively. Furthermore, by
introducing an efficient training strategy, SAQ only incurs a little additional
training overhead compared with the default optimizer (e.g., SGD or AdamW).
Extensive experiments on both convolutional neural networks and Transformers
across various datasets (i.e., ImageNet, CIFAR-10/100, Oxford Flowers-102,
Oxford-IIIT Pets) show that SAQ improves the generalization performance of the
quantized models, yielding the SOTA results in uniform quantization. For
example, on ImageNet, SAQ outperforms AdamW by 1.2% on the Top-1 accuracy for
4-bit ViT-B/16. Our 4-bit ResNet-50 surpasses the previous SOTA method by 0.9%
on the Top-1 accuracy.


http://arxiv.org/abs/2111.12896
SLA$^2$P: Self-supervised Anomaly Detection with Adversarial Perturbation. (5%)
Yizhou Wang; Can Qin; Rongzhe Wei; Yi Xu; Yue Bai; Yun Fu
  Anomaly detection is a fundamental yet challenging problem in machine
learning due to the lack of label information. In this work, we propose a novel
and powerful framework, dubbed as SLA$^2$P, for unsupervised anomaly detection.
After extracting representative embeddings from raw data, we apply random
projections to the features and regard features transformed by different
projections as belonging to distinct pseudo classes. We then train a classifier
network on these transformed features to perform self-supervised learning. Next
we add adversarial perturbation to the transformed features to decrease their
softmax scores of the predicted labels and design anomaly scores based on the
predictive uncertainties of the classifier on these perturbed features. Our
motivation is that because of the relatively small number and the decentralized
modes of anomalies, 1) the pseudo label classifier's training concentrates more
on learning the semantic information of normal data rather than anomalous data;
2) the transformed features of the normal data are more robust to the
perturbations than those of the anomalies. Consequently, the perturbed
transformed features of anomalies fail to be classified well and accordingly
have lower anomaly scores than those of the normal samples. Extensive
experiments on image, text and inherently tabular benchmark datasets back up
our findings and indicate that SLA$^2$P achieves state-of-the-art results on
unsupervised anomaly detection tasks consistently.


http://arxiv.org/abs/2111.12405
An Attack on Facial Soft-biometric Privacy Enhancement. (2%)
Dailé Osorio-Roig; Christian Rathgeb; Pawel Drozdowski; Philipp Terhörst; Vitomir Štruc; Christoph Busch
  In the recent past, different researchers have proposed privacy-enhancing
face recognition systems designed to conceal soft-biometric attributes at
feature level. These works have reported impressive results, but generally did
not consider specific attacks in their analysis of privacy protection. We
introduce an attack on said schemes based on two observations: (1) highly
similar facial representations usually originate from face images with similar
soft-biometric attributes; (2) to achieve high recognition accuracy, robustness
against intra-class variations within facial representations has to be retained
in their privacy-enhanced versions. The presented attack only requires the
privacy-enhancing algorithm as a black-box and a relatively small database of
face images with annotated soft-biometric attributes. Firstly, an intercepted
privacy-enhanced face representation is compared against the attacker's
database. Subsequently, the unknown attribute is inferred from the attributes
associated with the highest obtained similarity scores. In the experiments, the
attack is applied against two state-of-the-art approaches. The attack is shown
to circumvent the privacy enhancement to a considerable degree and is able to
correctly classify gender with an accuracy of up to approximately 90%. Future
works on privacy-enhancing face recognition are encouraged to include the
proposed attack in evaluations on the privacy protection.


http://arxiv.org/abs/2111.12621
Accelerating Deep Learning with Dynamic Data Pruning. (1%)
Ravi S Raju; Kyle Daruwalla; Mikko Lipasti
  Deep learning's success has been attributed to the training of large,
overparameterized models on massive amounts of data. As this trend continues,
model training has become prohibitively costly, requiring access to powerful
computing systems to train state-of-the-art networks. A large body of research
has been devoted to addressing the cost per iteration of training through
various model compression techniques like pruning and quantization. Less effort
has been spent targeting the number of iterations. Previous work, such as
forget scores and GraNd/EL2N scores, address this problem by identifying
important samples within a full dataset and pruning the remaining samples,
thereby reducing the iterations per epoch. Though these methods decrease the
training time, they use expensive static scoring algorithms prior to training.
When accounting for the scoring mechanism, the total run time is often
increased. In this work, we address this shortcoming with dynamic data pruning
algorithms. Surprisingly, we find that uniform random dynamic pruning can
outperform the prior work at aggressive pruning rates. We attribute this to the
existence of "sometimes" samples -- points that are important to the learned
decision boundary only some of the training time. To better exploit the
subtlety of sometimes samples, we propose two algorithms, based on
reinforcement learning techniques, to dynamically prune samples and achieve
even higher accuracy than the random dynamic method. We test all our methods
against a full-dataset baseline and the prior work on CIFAR-10 and CIFAR-100,
and we can reduce the training time by up to 2x without significant performance
loss. Our results suggest that data pruning should be understood as a dynamic
process that is closely tied to a model's training trajectory, instead of a
static step based solely on the dataset alone.


http://arxiv.org/abs/2111.12034
Adversarial machine learning for protecting against online manipulation. (92%)
Stefano Cresci; Marinella Petrocchi; Angelo Spognardi; Stefano Tognazzi
  Adversarial examples are inputs to a machine learning system that result in
an incorrect output from that system. Attacks launched through this type of
input can cause severe consequences: for example, in the field of image
recognition, a stop signal can be misclassified as a speed limit
indication.However, adversarial examples also represent the fuel for a flurry
of research directions in different domains and applications. Here, we give an
overview of how they can be profitably exploited as powerful tools to build
stronger learning models, capable of better-withstanding attacks, for two
crucial tasks: fake news and social bot detection.


http://arxiv.org/abs/2111.12197
Fixed Points in Cyber Space: Rethinking Optimal Evasion Attacks in the Age of AI-NIDS. (84%)
Witt Christian Schroeder de; Yongchao Huang; Philip H. S. Torr; Martin Strohmeier
  Cyber attacks are increasing in volume, frequency, and complexity. In
response, the security community is looking toward fully automating cyber
defense systems using machine learning. However, so far the resultant effects
on the coevolutionary dynamics of attackers and defenders have not been
examined. In this whitepaper, we hypothesise that increased automation on both
sides will accelerate the coevolutionary cycle, thus begging the question of
whether there are any resultant fixed points, and how they are characterised.
Working within the threat model of Locked Shields, Europe's largest
cyberdefense exercise, we study blackbox adversarial attacks on network
classifiers. Given already existing attack capabilities, we question the
utility of optimal evasion attack frameworks based on minimal evasion
distances. Instead, we suggest a novel reinforcement learning setting that can
be used to efficiently generate arbitrary adversarial perturbations. We then
argue that attacker-defender fixed points are themselves general-sum games with
complex phase transitions, and introduce a temporally extended multi-agent
reinforcement learning framework in which the resultant dynamics can be
studied. We hypothesise that one plausible fixed point of AI-NIDS may be a
scenario where the defense strategy relies heavily on whitelisted feature flow
subspaces. Finally, we demonstrate that a continual learning approach is
required to study attacker-defender dynamics in temporally extended general-sum
games.


http://arxiv.org/abs/2111.12229
Subspace Adversarial Training. (69%)
Tao Li; Yingwen Wu; Sizhe Chen; Kun Fang; Xiaolin Huang
  Single-step adversarial training (AT) has received wide attention as it
proved to be both efficient and robust. However, a serious problem of
catastrophic overfitting exists, i.e., the robust accuracy against projected
gradient descent (PGD) attack suddenly drops to 0% during the training. In this
paper, we approach this problem from a novel perspective of optimization and
firstly reveal the close link between the fast-growing gradient of each sample
and overfitting, which can also be applied to understand robust overfitting in
multi-step AT. To control the growth of the gradient, we propose a new AT
method, Subspace Adversarial Training (Sub-AT), which constrains AT in a
carefully extracted subspace. It successfully resolves both kinds of
overfitting and significantly boosts the robustness. In subspace, we also allow
single-step AT with larger steps and larger radius, further improving the
robustness performance. As a result, we achieve state-of-the-art single-step AT
performance. Without any regularization term, our single-step AT can reach over
51% robust accuracy against strong PGD-50 attack of radius 8/255 on CIFAR-10,
reaching a competitive performance against standard multi-step PGD-10 AT with
huge computational advantages. The code is released at
https://github.com/nblt/Sub-AT.


http://arxiv.org/abs/2111.11986
HERO: Hessian-Enhanced Robust Optimization for Unifying and Improving Generalization and Quantization Performance. (1%)
Huanrui Yang; Xiaoxuan Yang; Neil Zhenqiang Gong; Yiran Chen
  With the recent demand of deploying neural network models on mobile and edge
devices, it is desired to improve the model's generalizability on unseen
testing data, as well as enhance the model's robustness under fixed-point
quantization for efficient deployment. Minimizing the training loss, however,
provides few guarantees on the generalization and quantization performance. In
this work, we fulfill the need of improving generalization and quantization
performance simultaneously by theoretically unifying them under the framework
of improving the model's robustness against bounded weight perturbation and
minimizing the eigenvalues of the Hessian matrix with respect to model weights.
We therefore propose HERO, a Hessian-enhanced robust optimization method, to
minimize the Hessian eigenvalues through a gradient-based training process,
simultaneously improving the generalization and quantization performance. HERO
enables up to a 3.8% gain on test accuracy, up to 30% higher accuracy under 80%
training label perturbation, and the best post-training quantization accuracy
across a wide range of precision, including a >10% accuracy improvement over
SGD-trained models for common model architectures on various datasets.


http://arxiv.org/abs/2111.11368
Adversarial Examples on Segmentation Models Can be Easy to Transfer. (99%)
Jindong Gu; Hengshuang Zhao; Volker Tresp; Philip Torr
  Deep neural network-based image classification can be misled by adversarial
examples with small and quasi-imperceptible perturbations. Furthermore, the
adversarial examples created on one classification model can also fool another
different model. The transferability of the adversarial examples has recently
attracted a growing interest since it makes black-box attacks on classification
models feasible. As an extension of classification, semantic segmentation has
also received much attention towards its adversarial robustness. However, the
transferability of adversarial examples on segmentation models has not been
systematically studied. In this work, we intensively study this topic. First,
we explore the overfitting phenomenon of adversarial examples on classification
and segmentation models. In contrast to the observation made on classification
models that the transferability is limited by overfitting to the source model,
we find that the adversarial examples on segmentations do not always overfit
the source models. Even when no overfitting is presented, the transferability
of adversarial examples is limited. We attribute the limitation to the
architectural traits of segmentation models, i.e., multi-scale object
recognition. Then, we propose a simple and effective method, dubbed dynamic
scaling, to overcome the limitation. The high transferability achieved by our
method shows that, in contrast to the observations in previous work,
adversarial examples on a segmentation model can be easy to transfer to other
segmentation models. Our analysis and proposals are supported by extensive
experiments.


http://arxiv.org/abs/2111.11056
Evaluating Adversarial Attacks on ImageNet: A Reality Check on Misclassification Classes. (99%)
Utku Ozbulak; Maura Pintor; Messem Arnout Van; Neve Wesley De
  Although ImageNet was initially proposed as a dataset for performance
benchmarking in the domain of computer vision, it also enabled a variety of
other research efforts. Adversarial machine learning is one such research
effort, employing deceptive inputs to fool models in making wrong predictions.
To evaluate attacks and defenses in the field of adversarial machine learning,
ImageNet remains one of the most frequently used datasets. However, a topic
that is yet to be investigated is the nature of the classes into which
adversarial examples are misclassified. In this paper, we perform a detailed
analysis of these misclassification classes, leveraging the ImageNet class
hierarchy and measuring the relative positions of the aforementioned type of
classes in the unperturbed origins of the adversarial examples. We find that
$71\%$ of the adversarial examples that achieve model-to-model adversarial
transferability are misclassified into one of the top-5 classes predicted for
the underlying source images. We also find that a large subset of untargeted
misclassifications are, in fact, misclassifications into semantically similar
classes. Based on these findings, we discuss the need to take into account the
ImageNet class hierarchy when evaluating untargeted adversarial successes.
Furthermore, we advocate for future research efforts to incorporate categorical
information.


http://arxiv.org/abs/2111.10990
Imperceptible Transfer Attack and Defense on 3D Point Cloud Classification. (99%)
Daizong Liu; Wei Hu
  Although many efforts have been made into attack and defense on the 2D image
domain in recent years, few methods explore the vulnerability of 3D models.
Existing 3D attackers generally perform point-wise perturbation over point
clouds, resulting in deformed structures or outliers, which is easily
perceivable by humans. Moreover, their adversarial examples are generated under
the white-box setting, which frequently suffers from low success rates when
transferred to attack remote black-box models. In this paper, we study 3D point
cloud attacks from two new and challenging perspectives by proposing a novel
Imperceptible Transfer Attack (ITA): 1) Imperceptibility: we constrain the
perturbation direction of each point along its normal vector of the
neighborhood surface, leading to generated examples with similar geometric
properties and thus enhancing the imperceptibility. 2) Transferability: we
develop an adversarial transformation model to generate the most harmful
distortions and enforce the adversarial examples to resist it, improving their
transferability to unknown black-box models. Further, we propose to train more
robust black-box 3D models to defend against such ITA attacks by learning more
discriminative point cloud representations. Extensive evaluations demonstrate
that our ITA attack is more imperceptible and transferable than
state-of-the-arts and validate the superiority of our defense strategy.


http://arxiv.org/abs/2111.10991
Backdoor Attack through Frequency Domain. (92%)
Tong Wang; Yuan Yao; Feng Xu; Shengwei An; Hanghang Tong; Ting Wang
  Backdoor attacks have been shown to be a serious threat against deep learning
systems such as biometric authentication and autonomous driving. An effective
backdoor attack could enforce the model misbehave under certain predefined
conditions, i.e., triggers, but behave normally otherwise. However, the
triggers of existing attacks are directly injected in the pixel space, which
tend to be detectable by existing defenses and visually identifiable at both
training and inference stages. In this paper, we propose a new backdoor attack
FTROJAN through trojaning the frequency domain. The key intuition is that
triggering perturbations in the frequency domain correspond to small pixel-wise
perturbations dispersed across the entire image, breaking the underlying
assumptions of existing defenses and making the poisoning images visually
indistinguishable from clean ones. We evaluate FTROJAN in several datasets and
tasks showing that it achieves a high attack success rate without significantly
degrading the prediction accuracy on benign inputs. Moreover, the poisoning
images are nearly invisible and retain high perceptual quality. We also
evaluate FTROJAN against state-of-the-art defenses as well as several adaptive
defenses that are designed on the frequency domain. The results show that
FTROJAN can robustly elude or significantly degenerate the performance of these
defenses.


http://arxiv.org/abs/2111.11157
NTD: Non-Transferability Enabled Backdoor Detection. (69%)
Yinshan Li; Hua Ma; Zhi Zhang; Yansong Gao; Alsharif Abuadbba; Anmin Fu; Yifeng Zheng; Said F. Al-Sarawi; Derek Abbott
  A backdoor deep learning (DL) model behaves normally upon clean inputs but
misbehaves upon trigger inputs as the backdoor attacker desires, posing severe
consequences to DL model deployments. State-of-the-art defenses are either
limited to specific backdoor attacks (source-agnostic attacks) or
non-user-friendly in that machine learning (ML) expertise or expensive
computing resources are required. This work observes that all existing backdoor
attacks have an inevitable intrinsic weakness, non-transferability, that is, a
trigger input hijacks a backdoored model but cannot be effective to another
model that has not been implanted with the same backdoor. With this key
observation, we propose non-transferability enabled backdoor detection (NTD) to
identify trigger inputs for a model-under-test (MUT) during
run-time.Specifically, NTD allows a potentially backdoored MUT to predict a
class for an input. In the meantime, NTD leverages a feature extractor (FE) to
extract feature vectors for the input and a group of samples randomly picked
from its predicted class, and then compares similarity between the input and
the samples in the FE's latent space. If the similarity is low, the input is an
adversarial trigger input; otherwise, benign. The FE is a free pre-trained
model privately reserved from open platforms. As the FE and MUT are from
different sources, the attacker is very unlikely to insert the same backdoor
into both of them. Because of non-transferability, a trigger effect that does
work on the MUT cannot be transferred to the FE, making NTD effective against
different types of backdoor attacks. We evaluate NTD on three popular
customized tasks such as face recognition, traffic sign recognition and general
animal classification, results of which affirm that NDT has high effectiveness
(low false acceptance rate) and usability (low false rejection rate) with low
detection latency.


http://arxiv.org/abs/2111.11487
A Comparison of State-of-the-Art Techniques for Generating Adversarial Malware Binaries. (33%)
Prithviraj Dasgupta; Zachariah Osman
  We consider the problem of generating adversarial malware by a cyber-attacker
where the attacker's task is to strategically modify certain bytes within
existing binary malware files, so that the modified files are able to evade a
malware detector such as machine learning-based malware classifier. We have
evaluated three recent adversarial malware generation techniques using binary
malware samples drawn from a single, publicly available malware data set and
compared their performances for evading a machine-learning based malware
classifier called MalConv. Our results show that among the compared techniques,
the most effective technique is the one that strategically modifies bytes in a
binary's header. We conclude by discussing the lessons learned and future
research directions on the topic of adversarial malware generation.


http://arxiv.org/abs/2111.11534
Poisoning Attacks to Local Differential Privacy Protocols for Key-Value Data. (13%)
Yongji Wu; Xiaoyu Cao; Jinyuan Jia; Neil Zhenqiang Gong
  Local Differential Privacy (LDP) protocols enable an untrusted server to
perform privacy-preserving, federated data analytics. Various LDP protocols
have been developed for different types of data such as categorical data,
numerical data, and key-value data. Due to their distributed settings, LDP
protocols are fundamentally vulnerable to poisoning attacks, in which fake
users manipulate the server's analytics results via sending carefully crafted
data to the server. However, existing poisoning attacks focused on LDP
protocols for simple data types such as categorical and numerical data, leaving
the security of LDP protocols for more advanced data types such as key-value
data unexplored.
  In this work, we aim to bridge the gap by introducing novel poisoning attacks
to LDP protocols for key-value data. In such a LDP protocol, a server aims to
simultaneously estimate the frequency and mean value of each key among some
users, each of whom possesses a set of key-value pairs. Our poisoning attacks
aim to simultaneously maximize the frequencies and mean values of some
attacker-chosen target keys via sending carefully crafted data from some fake
users to the sever. Specifically, since our attacks have two objectives, we
formulate them as a two-objective optimization problem. Moreover, we propose a
method to approximately solve the two-objective optimization problem, from
which we obtain the optimal crafted data the fake users should send to the
server. We demonstrate the effectiveness of our attacks to three LDP protocols
for key-value data both theoretically and empirically. We also explore two
defenses against our attacks, which are effective in some scenarios but have
limited effectiveness in other scenarios. Our results highlight the needs for
new defenses against our poisoning attacks.


http://arxiv.org/abs/2111.11581
Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time Mobile Acceleration. (1%)
Yifan Gong; Geng Yuan; Zheng Zhan; Wei Niu; Zhengang Li; Pu Zhao; Yuxuan Cai; Sijia Liu; Bin Ren; Xue Lin; Xulong Tang; Yanzhi Wang
  Weight pruning is an effective model compression technique to tackle the
challenges of achieving real-time deep neural network (DNN) inference on mobile
devices. However, prior pruning schemes have limited application scenarios due
to accuracy degradation, difficulty in leveraging hardware acceleration, and/or
restriction on certain types of DNN layers. In this paper, we propose a
general, fine-grained structured pruning scheme and corresponding compiler
optimizations that are applicable to any type of DNN layer while achieving high
accuracy and hardware inference performance. With the flexibility of applying
different pruning schemes to different layers enabled by our compiler
optimizations, we further probe into the new problem of determining the
best-suited pruning scheme considering the different acceleration and accuracy
performance of various pruning schemes. Two pruning scheme mapping methods, one
is search-based and the other is rule-based, are proposed to automatically
derive the best-suited pruning regularity and block size for each layer of any
given DNN. Experimental results demonstrate that our pruning scheme mapping
methods, together with the general fine-grained structured pruning scheme,
outperform the state-of-the-art DNN optimization framework with up to
2.48$\times$ and 1.73$\times$ DNN inference acceleration on CIFAR-10 and
ImageNet dataset without accuracy loss.


http://arxiv.org/abs/2111.11317
Electric Vehicle Attack Impact on Power Grid Operation. (1%)
Mohammad Ali Sayed; Ribal Atallah; Chadi Assi; Mourad Debbabi
  The increasing need for reducing greenhouse gas emissions and the drive for
green cities have promoted the use of electric vehicles due to their
environmental benefits. In fact, countries have set their own targets and are
offering incentives for people to purchase EVs as opposed to traditional
gasoline-powered cars. Manufacturers have been hastily deploying charging
stations to meet the charging requirements of the EVs on the road. This rapid
deployment has contributed to the EV ecosystem's lack of proper security
measures, raising multiple questions related to the power grid security and
vulnerability. In this paper, we offer a complete examination of the EV
ecosystem from the vulnerability to the attacks and finally the solutions. We
start by examining the existing vulnerabilities in the EV ecosystem that can be
exploited to control the EV charging and launch attacks against the power grid.
We then discuss the non-linear nature of the EV charging load and simulate
multiple attacks that can be launched against the power grid using these EVs.
EV loads have high reactive power demand which can have a larger impact on the
grid compared to residential loads. We perform simulations on two power grids
and demonstrate that while the grid can recover after a 48 MW attack utilizing
traditional residential loads, a smaller 30 MW EV load attack can completely
destabilize the system. Finally, we suggest several patches for the existing
vulnerabilities and discuss two methods aimed at detecting EV attacks.


http://arxiv.org/abs/2111.10752
Stochastic Variance Reduced Ensemble Adversarial Attack for Boosting the Adversarial Transferability. (99%)
Yifeng Xiong; Jiadong Lin; Min Zhang; John E. Hopcroft; Kun He
  The black-box adversarial attack has attracted impressive attention for its
practical use in the field of deep learning security. Meanwhile, it is very
challenging as there is no access to the network architecture or internal
weights of the target model. Based on the hypothesis that if an example remains
adversarial for multiple models, then it is more likely to transfer the attack
capability to other models, the ensemble-based adversarial attack methods are
efficient and widely used for black-box attacks. However, ways of ensemble
attack are rather less investigated, and existing ensemble attacks simply fuse
the outputs of all the models evenly. In this work, we treat the iterative
ensemble attack as a stochastic gradient descent optimization process, in which
the variance of the gradients on different models may lead to poor local
optima. To this end, we propose a novel attack method called the stochastic
variance reduced ensemble (SVRE) attack, which could reduce the gradient
variance of the ensemble models and take full advantage of the ensemble attack.
Empirical results on the standard ImageNet dataset demonstrate that the
proposed method could boost the adversarial transferability and outperforms
existing ensemble attacks significantly. Code is available at
https://github.com/JHL-HUST/SVRE.


http://arxiv.org/abs/2111.10759
Adversarial Mask: Real-World Universal Adversarial Attack on Face Recognition Model. (99%)
Alon Zolfi; Shai Avidan; Yuval Elovici; Asaf Shabtai
  Deep learning-based facial recognition (FR) models have demonstrated
state-of-the-art performance in the past few years, even when wearing
protective medical face masks became commonplace during the COVID-19 pandemic.
Given the outstanding performance of these models, the machine learning
research community has shown increasing interest in challenging their
robustness. Initially, researchers presented adversarial attacks in the digital
domain, and later the attacks were transferred to the physical domain. However,
in many cases, attacks in the physical domain are conspicuous, and thus may
raise suspicion in real-world environments (e.g., airports). In this paper, we
propose Adversarial Mask, a physical universal adversarial perturbation (UAP)
against state-of-the-art FR models that is applied on face masks in the form of
a carefully crafted pattern. In our experiments, we examined the
transferability of our adversarial mask to a wide range of FR model
architectures and datasets. In addition, we validated our adversarial mask's
effectiveness in real-world experiments (CCTV use case) by printing the
adversarial pattern on a fabric face mask. In these experiments, the FR system
was only able to identify 3.34% of the participants wearing the mask (compared
to a minimum of 83.34% with other evaluated masks). A demo of our experiments
can be found at: https://youtu.be/_TXkDO5z11w.


http://arxiv.org/abs/2111.10969
Medical Aegis: Robust adversarial protectors for medical images. (99%)
Qingsong Yao; Zecheng He; S. Kevin Zhou
  Deep neural network based medical image systems are vulnerable to adversarial
examples. Many defense mechanisms have been proposed in the literature,
however, the existing defenses assume a passive attacker who knows little about
the defense system and does not change the attack strategy according to the
defense. Recent works have shown that a strong adaptive attack, where an
attacker is assumed to have full knowledge about the defense system, can easily
bypass the existing defenses. In this paper, we propose a novel adversarial
example defense system called Medical Aegis. To the best of our knowledge,
Medical Aegis is the first defense in the literature that successfully
addresses the strong adaptive adversarial example attacks to medical images.
Medical Aegis boasts two-tier protectors: The first tier of Cushion weakens the
adversarial manipulation capability of an attack by removing its high-frequency
components, yet posing a minimal effect on classification performance of the
original image; the second tier of Shield learns a set of per-class DNN models
to predict the logits of the protected model. Deviation from the Shield's
prediction indicates adversarial examples. Shield is inspired by the
observations in our stress tests that there exist robust trails in the shallow
layers of a DNN model, which the adaptive attacks can hardly destruct.
Experimental results show that the proposed defense accurately detects adaptive
attacks, with negligible overhead for model inference.


http://arxiv.org/abs/2111.10754
Local Linearity and Double Descent in Catastrophic Overfitting. (73%)
Varun Sivashankar; Nikil Selvam
  Catastrophic overfitting is a phenomenon observed during Adversarial Training
(AT) with the Fast Gradient Sign Method (FGSM) where the test robustness
steeply declines over just one epoch in the training stage. Prior work has
attributed this loss in robustness to a sharp decrease in $\textit{local
linearity}$ of the neural network with respect to the input space, and has
demonstrated that introducing a local linearity measure as a regularization
term prevents catastrophic overfitting. Using a simple neural network
architecture, we experimentally demonstrate that maintaining high local
linearity might be $\textit{sufficient}$ to prevent catastrophic overfitting
but is not $\textit{necessary.}$ Further, inspired by Parseval networks, we
introduce a regularization term to AT with FGSM to make the weight matrices of
the network orthogonal and study the connection between orthogonality of the
network weights and local linearity. Lastly, we identify the $\textit{double
descent}$ phenomenon during the adversarial training process.


http://arxiv.org/abs/2111.10844
Denoised Internal Models: a Brain-Inspired Autoencoder against Adversarial Attacks. (62%)
Kaiyuan Liu; Xingyu Li; Yi Zhou; Jisong Guan; Yurui Lai; Ge Zhang; Hang Su; Jiachen Wang; Chunxu Guo
  Despite its great success, deep learning severely suffers from robustness;
that is, deep neural networks are very vulnerable to adversarial attacks, even
the simplest ones. Inspired by recent advances in brain science, we propose the
Denoised Internal Models (DIM), a novel generative autoencoder-based model to
tackle this challenge. Simulating the pipeline in the human brain for visual
signal processing, DIM adopts a two-stage approach. In the first stage, DIM
uses a denoiser to reduce the noise and the dimensions of inputs, reflecting
the information pre-processing in the thalamus. Inspired from the sparse coding
of memory-related traces in the primary visual cortex, the second stage
produces a set of internal models, one for each category. We evaluate DIM over
42 adversarial attacks, showing that DIM effectively defenses against all the
attacks and outperforms the SOTA on the overall robustness.


http://arxiv.org/abs/2111.10659
Are Vision Transformers Robust to Patch Perturbations? (98%)
Jindong Gu; Volker Tresp; Yao Qin
  The recent advances in Vision Transformer (ViT) have demonstrated its
impressive performance in image classification, which makes it a promising
alternative to Convolutional Neural Network (CNN). Unlike CNNs, ViT represents
an input image as a sequence of image patches. The patch-wise input image
representation makes the following question interesting: How does ViT perform
when individual input image patches are perturbed with natural corruptions or
adversarial perturbations, compared to CNNs? In this work, we study the
robustness of vision transformers to patch-wise perturbations. Surprisingly, we
find that vision transformers are more robust to naturally corrupted patches
than CNNs, whereas they are more vulnerable to adversarial patches.
Furthermore, we conduct extensive qualitative and quantitative experiments to
understand the robustness to patch perturbations. We have revealed that ViT's
stronger robustness to natural corrupted patches and higher vulnerability
against adversarial patches are both caused by the attention mechanism.
Specifically, the attention model can help improve the robustness of vision
transformers by effectively ignoring natural corrupted patches. However, when
vision transformers are attacked by an adversary, the attention mechanism can
be easily fooled to focus more on the adversarially perturbed patches and cause
a mistake.


http://arxiv.org/abs/2111.10055
Towards Efficiently Evaluating the Robustness of Deep Neural Networks in IoT Systems: A GAN-based Method. (99%)
Tao Bai; Jun Zhao; Jinlin Zhu; Shoudong Han; Jiefeng Chen; Bo Li; Alex Kot
  Intelligent Internet of Things (IoT) systems based on deep neural networks
(DNNs) have been widely deployed in the real world. However, DNNs are found to
be vulnerable to adversarial examples, which raises people's concerns about
intelligent IoT systems' reliability and security. Testing and evaluating the
robustness of IoT systems becomes necessary and essential. Recently various
attacks and strategies have been proposed, but the efficiency problem remains
unsolved properly. Existing methods are either computationally extensive or
time-consuming, which is not applicable in practice. In this paper, we propose
a novel framework called Attack-Inspired GAN (AI-GAN) to generate adversarial
examples conditionally. Once trained, it can generate adversarial perturbations
efficiently given input images and target classes. We apply AI-GAN on different
datasets in white-box settings, black-box settings and targeted models
protected by state-of-the-art defenses. Through extensive experiments, AI-GAN
achieves high attack success rates, outperforming existing methods, and reduces
generation time significantly. Moreover, for the first time, AI-GAN
successfully scales to complex datasets e.g. CIFAR-100 and ImageNet, with about
$90\%$ success rates among all classes.


http://arxiv.org/abs/2111.10291
Meta Adversarial Perturbations. (99%)
Chia-Hung Yuan; Pin-Yu Chen; Chia-Mu Yu
  A plethora of attack methods have been proposed to generate adversarial
examples, among which the iterative methods have been demonstrated the ability
to find a strong attack. However, the computation of an adversarial
perturbation for a new data point requires solving a time-consuming
optimization problem from scratch. To generate a stronger attack, it normally
requires updating a data point with more iterations. In this paper, we show the
existence of a meta adversarial perturbation (MAP), a better initialization
that causes natural images to be misclassified with high probability after
being updated through only a one-step gradient ascent update, and propose an
algorithm for computing such perturbations. We conduct extensive experiments,
and the empirical results demonstrate that state-of-the-art deep neural
networks are vulnerable to meta perturbations. We further show that these
perturbations are not only image-agnostic, but also model-agnostic, as a single
perturbation generalizes well across unseen data points and different neural
network architectures.


http://arxiv.org/abs/2111.10272
Resilience from Diversity: Population-based approach to harden models against adversarial attacks. (99%)
Jasser Jasser; Ivan Garibay
  Traditional deep learning models exhibit intriguing vulnerabilities that
allow an attacker to force them to fail at their task. Notorious attacks such
as the Fast Gradient Sign Method (FGSM) and the more powerful Projected
Gradient Descent (PGD) generate adversarial examples by adding a magnitude of
perturbation $\epsilon$ to the input's computed gradient, resulting in a
deterioration of the effectiveness of the model's classification. This work
introduces a model that is resilient to adversarial attacks. Our model
leverages a well established principle from biological sciences: population
diversity produces resilience against environmental changes. More precisely,
our model consists of a population of $n$ diverse submodels, each one of them
trained to individually obtain a high accuracy for the task at hand, while
forced to maintain meaningful differences in their weight tensors. Each time
our model receives a classification query, it selects a submodel from its
population at random to answer the query. To introduce and maintain diversity
in population of submodels, we introduce the concept of counter linking
weights. A Counter-Linked Model (CLM) consists of submodels of the same
architecture where a periodic random similarity examination is conducted during
the simultaneous training to guarantee diversity while maintaining accuracy. In
our testing, CLM robustness got enhanced by around 20% when tested on the MNIST
dataset and at least 15% when tested on the CIFAR-10 dataset. When implemented
with adversarially trained submodels, this methodology achieves
state-of-the-art robustness. On the MNIST dataset with $\epsilon=0.3$, it
achieved 94.34% against FGSM and 91% against PGD. On the CIFAR-10 dataset with
$\epsilon=8/255$, it achieved 62.97% against FGSM and 59.16% against PGD.


http://arxiv.org/abs/2111.10075
Enhanced countering adversarial attacks via input denoising and feature restoring. (99%)
Yanni Li; Wenhui Zhang; Jiawei Liu; Xiaoli Kou; Hui Li; Jiangtao Cui
  Despite the fact that deep neural networks (DNNs) have achieved prominent
performance in various applications, it is well known that DNNs are vulnerable
to adversarial examples/samples (AEs) with imperceptible perturbations in
clean/original samples. To overcome the weakness of the existing defense
methods against adversarial attacks, which damages the information on the
original samples, leading to the decrease of the target classifier accuracy,
this paper presents an enhanced countering adversarial attack method IDFR (via
Input Denoising and Feature Restoring). The proposed IDFR is made up of an
enhanced input denoiser (ID) and a hidden lossy feature restorer (FR) based on
the convex hull optimization. Extensive experiments conducted on benchmark
datasets show that the proposed IDFR outperforms the various state-of-the-art
defense methods, and is highly effective for protecting target models against
various adversarial black-box or white-box attacks. \footnote{Souce code is
released at:
\href{https://github.com/ID-FR/IDFR}{https://github.com/ID-FR/IDFR}}


http://arxiv.org/abs/2111.10481
PatchCensor: Patch Robustness Certification for Transformers via Exhaustive Testing. (99%)
Yuheng Huang; Lei Ma; Yuanchun Li
  Vision Transformer (ViT) is known to be highly nonlinear like other classical
neural networks and could be easily fooled by both natural and adversarial
patch perturbations. This limitation could pose a threat to the deployment of
ViT in the real industrial environment, especially in safety-critical
scenarios. In this work, we propose PatchCensor, aiming to certify the patch
robustness of ViT by applying exhaustive testing. We try to provide a provable
guarantee by considering the worst patch attack scenarios. Unlike empirical
defenses against adversarial patches that may be adaptively breached, certified
robust approaches can provide a certified accuracy against arbitrary attacks
under certain conditions. However, existing robustness certifications are
mostly based on robust training, which often requires substantial training
efforts and the sacrifice of model performance on normal samples. To bridge the
gap, PatchCensor seeks to improve the robustness of the whole system by
detecting abnormal inputs instead of training a robust model and asking it to
give reliable results for every input, which may inevitably compromise
accuracy. Specifically, each input is tested by voting over multiple inferences
with different mutated attention masks, where at least one inference is
guaranteed to exclude the abnormal patch. This can be seen as complete-coverage
testing, which could provide a statistical guarantee on inference at the test
time. Our comprehensive evaluation demonstrates that PatchCensor is able to
achieve high certified accuracy (e.g. 67.1% on ImageNet for 2%-pixel
adversarial patches), significantly outperforming state-of-the-art techniques
while achieving similar clean accuracy (81.8% on ImageNet). Meanwhile, our
technique also supports flexible configurations to handle different adversarial
patch sizes (up to 25%) by simply changing the masking strategy.


http://arxiv.org/abs/2111.10130
Fooling Adversarial Training with Inducing Noise. (98%)
Zhirui Wang; Yifei Wang; Yisen Wang
  Adversarial training is widely believed to be a reliable approach to improve
model robustness against adversarial attack. However, in this paper, we show
that when trained on one type of poisoned data, adversarial training can also
be fooled to have catastrophic behavior, e.g., $<1\%$ robust test accuracy with
$>90\%$ robust training accuracy on CIFAR-10 dataset. Previously, there are
other types of noise poisoned in the training data that have successfully
fooled standard training ($15.8\%$ standard test accuracy with $99.9\%$
standard training accuracy on CIFAR-10 dataset), but their poisonings can be
easily removed when adopting adversarial training. Therefore, we aim to design
a new type of inducing noise, named ADVIN, which is an irremovable poisoning of
training data. ADVIN can not only degrade the robustness of adversarial
training by a large margin, for example, from $51.7\%$ to $0.57\%$ on CIFAR-10
dataset, but also be effective for fooling standard training ($13.1\%$ standard
test accuracy with $100\%$ standard training accuracy). Additionally, ADVIN can
be applied to preventing personal data (like selfies) from being exploited
without authorization under whether standard or adversarial training.


http://arxiv.org/abs/2111.10085
Exposing Weaknesses of Malware Detectors with Explainability-Guided Evasion Attacks. (86%)
Wei Wang; Ruoxi Sun; Tian Dong; Shaofeng Li; Minhui Xue; Gareth Tyson; Haojin Zhu
  Numerous open-source and commercial malware detectors are available. However,
the efficacy of these tools has been threatened by new adversarial attacks,
whereby malware attempts to evade detection using, for example, machine
learning techniques. In this work, we design an adversarial evasion attack that
relies on both feature-space and problem-space manipulation. It uses
explainability-guided feature selection to maximize evasion by identifying the
most critical features that impact detection. We then use this attack as a
benchmark to evaluate several state-of-the-art malware detectors. We find that
(i) state-of-the-art malware detectors are vulnerable to even simple evasion
strategies, and they can easily be tricked using off-the-shelf techniques; (ii)
feature-space manipulation and problem-space obfuscation can be combined to
enable evasion without needing white-box understanding of the detector; (iii)
we can use explainability approaches (e.g., SHAP) to guide the feature
manipulation and explain how attacks can transfer across multiple detectors.
Our findings shed light on the weaknesses of current malware detectors, as well
as how they can be improved.


http://arxiv.org/abs/2111.09999
TnT Attacks! Universal Naturalistic Adversarial Patches Against Deep Neural Network Systems. (99%)
Bao Gia Doan; Minhui Xue; Shiqing Ma; Ehsan Abbasnejad; Damith C. Ranasinghe
  Deep neural networks are vulnerable to attacks from adversarial inputs and,
more recently, Trojans to misguide or hijack the decision of the model. We
expose the existence of an intriguing class of bounded adversarial examples --
Universal NaTuralistic adversarial paTches -- we call TnTs, by exploring the
superset of the bounded adversarial example space and the natural input space
within generative adversarial networks. Now, an adversary can arm themselves
with a patch that is naturalistic, less malicious-looking, physically
realizable, highly effective -- achieving high attack success rates, and
universal. A TnT is universal because any input image captured with a TnT in
the scene will: i) misguide a network (untargeted attack); or ii) force the
network to make a malicious decision (targeted attack). Interestingly, now, an
adversarial patch attacker has the potential to exert a greater level of
control -- the ability to choose a location independent, natural-looking patch
as a trigger in contrast to being constrained to noisy perturbations -- an
ability is thus far shown to be only possible with Trojan attack methods
needing to interfere with the model building processes to embed a backdoor at
the risk discovery; but, still realize a patch deployable in the physical
world. Through extensive experiments on the large-scale visual classification
task, ImageNet with evaluations across its entire validation set of 50,000
images, we demonstrate the realistic threat from TnTs and the robustness of the
attack. We show a generalization of the attack to create patches achieving
higher attack success rates than existing state-of-the-art methods. Our results
show the generalizability of the attack to different visual classification
tasks (CIFAR-10, GTSRB, PubFig) and multiple state-of-the-art deep neural
networks such as WideResnet50, Inception-V3 and VGG-16.


http://arxiv.org/abs/2111.09961
A Review of Adversarial Attack and Defense for Classification Methods. (99%)
Yao Li; Minhao Cheng; Cho-Jui Hsieh; Thomas C. M. Lee
  Despite the efficiency and scalability of machine learning systems, recent
studies have demonstrated that many classification methods, especially deep
neural networks (DNNs), are vulnerable to adversarial examples; i.e., examples
that are carefully crafted to fool a well-trained classification model while
being indistinguishable from natural data to human. This makes it potentially
unsafe to apply DNNs or related methods in security-critical areas. Since this
issue was first identified by Biggio et al. (2013) and Szegedy et al.(2014),
much work has been done in this field, including the development of attack
methods to generate adversarial examples and the construction of defense
techniques to guard against such examples. This paper aims to introduce this
topic and its latest developments to the statistical community, primarily
focusing on the generation and guarding of adversarial examples. Computing
codes (in python and R) used in the numerical experiments are publicly
available for readers to explore the surveyed methods. It is the hope of the
authors that this paper will encourage more statisticians to work on this
important and exciting field of generating and defending against adversarial
examples.


http://arxiv.org/abs/2111.09571
Robust Person Re-identification with Multi-Modal Joint Defence. (98%)
Yunpeng Gong; Lifei Chen
  The Person Re-identification (ReID) system based on metric learning has been
proved to inherit the vulnerability of deep neural networks (DNNs), which are
easy to be fooled by adversarail metric attacks. Existing work mainly relies on
adversarial training for metric defense, and more methods have not been fully
studied. By exploring the impact of attacks on the underlying features, we
propose targeted methods for metric attacks and defence methods. In terms of
metric attack, we use the local color deviation to construct the intra-class
variation of the input to attack color features. In terms of metric defenses,
we propose a joint defense method which includes two parts of proactive defense
and passive defense. Proactive defense helps to enhance the robustness of the
model to color variations and the learning of structure relations across
multiple modalities by constructing different inputs from multimodal images,
and passive defense exploits the invariance of structural features in a
changing pixel space by circuitous scaling to preserve structural features
while eliminating some of the adversarial noise. Extensive experiments
demonstrate that the proposed joint defense compared with the existing
adversarial metric defense methods which not only against multiple attacks at
the same time but also has not significantly reduced the generalization
capacity of the model. The code is available at
https://github.com/finger-monkey/multi-modal_joint_defence.


http://arxiv.org/abs/2111.09626
Enhancing the Insertion of NOP Instructions to Obfuscate Malware via Deep Reinforcement Learning. (96%)
Daniel Gibert; Matt Fredrikson; Carles Mateu; Jordi Planes; Quan Le
  Current state-of-the-art research for tackling the problem of malware
detection and classification is centered on the design, implementation and
deployment of systems powered by machine learning because of its ability to
generalize to never-before-seen malware families and polymorphic mutations.
However, it has been shown that machine learning models, in particular deep
neural networks, lack robustness against crafted inputs (adversarial examples).
In this work, we have investigated the vulnerability of a state-of-the-art
shallow convolutional neural network malware classifier against the dead code
insertion technique. We propose a general framework powered by a Double
Q-network to induce misclassification over malware families. The framework
trains an agent through a convolutional neural network to select the optimal
positions in a code sequence to insert dead code instructions so that the
machine learning classifier mislabels the resulting executable. The experiments
show that the proposed method significantly drops the classification accuracy
of the classifier to 56.53% while having an evasion rate of 100% for the
samples belonging to the Kelihos_ver3, Simda, and Kelihos_ver1 families. In
addition, the average number of instructions needed to mislabel malware in
comparison to a random agent decreased by 33%.


http://arxiv.org/abs/2112.03007
How to Build Robust FAQ Chatbot with Controllable Question Generator? (80%)
Yan Pan; Mingyang Ma; Bernhard Pflugfelder; Georg Groh
  Many unanswerable adversarial questions fool the question-answer (QA) system
with some plausible answers. Building a robust, frequently asked questions
(FAQ) chatbot needs a large amount of diverse adversarial examples. Recent
question generation methods are ineffective at generating many high-quality and
diverse adversarial question-answer pairs from unstructured text. We propose
the diversity controllable semantically valid adversarial attacker (DCSA), a
high-quality, diverse, controllable method to generate standard and adversarial
samples with a semantic graph. The fluent and semantically generated QA pairs
fool our passage retrieval model successfully. After that, we conduct a study
on the robustness and generalization of the QA model with generated QA pairs
among different domains. We find that the generated data set improves the
generalizability of the QA model to the new target domain and the robustness of
the QA model to detect unanswerable adversarial questions.


http://arxiv.org/abs/2111.09561
Adversarial attacks on voter model dynamics in complex networks. (76%)
Katsumi Chiyomaru; Kazuhiro Takemoto
  This study investigates adversarial attacks conducted to distort the voter
model dynamics in complex networks. Specifically, a simple adversarial attack
method is proposed for holding the state of an individual's opinions closer to
the target state in the voter model dynamics; the method shows that even when
one opinion is the majority, the vote outcome can be inverted (i.e., the
outcome can lean toward the other opinion) by adding extremely small
(hard-to-detect) perturbations strategically generated in social networks.
Adversarial attacks are relatively more effective for complex (large and dense)
networks. The results indicate that opinion dynamics can be unknowingly
distorted.


http://arxiv.org/abs/2111.09679
Enhanced Membership Inference Attacks against Machine Learning Models. (12%)
Jiayuan Ye; Aadyaa Maddi; Sasi Kumar Murakonda; Reza Shokri
  How much does a given trained model leak about each individual data record in
its training set? Membership inference attacks are used as an auditing tool to
quantify the private information that a model leaks about the individual data
points in its training set. Membership inference attacks are influenced by
different uncertainties that an attacker has to resolve about training data,
the training algorithm, and the underlying data distribution. Thus attack
success rates, of many attacks in the literature, do not precisely capture the
information leakage of models about their data, as they also reflect other
uncertainties that the attack algorithm has. In this paper, we explain the
implicit assumptions and also the simplifications made in prior work using the
framework of hypothesis testing. We also derive new attack algorithms from the
framework that can achieve a high AUC score while also highlighting the
different factors that affect their performance. Our algorithms capture a very
precise approximation of privacy loss in models, and can be used as a tool to
perform an accurate and informed estimation of privacy risk in machine learning
models. We provide a thorough empirical evaluation of our attack strategies on
various machine learning tasks and benchmark datasets.


http://arxiv.org/abs/2111.09779
Wiggling Weights to Improve the Robustness of Classifiers. (2%)
Sadaf Gulshad; Ivan Sosnovik; Arnold Smeulders
  Robustness against unwanted perturbations is an important aspect of deploying
neural network classifiers in the real world. Common natural perturbations
include noise, saturation, occlusion, viewpoint changes, and blur deformations.
All of them can be modelled by the newly proposed transform-augmented
convolutional networks. While many approaches for robustness train the network
by providing augmented data to the network, we aim to integrate perturbations
in the network architecture to achieve improved and more general robustness. To
demonstrate that wiggling the weights consistently improves classification, we
choose a standard network and modify it to a transform-augmented network. On
perturbed CIFAR-10 images, the modified network delivers a better performance
than the original network. For the much smaller STL-10 dataset, in addition to
delivering better general robustness, wiggling even improves the classification
of unperturbed, clean images substantially. We conclude that wiggled
transform-augmented networks acquire good robustness even for perturbations not
seen during training.


http://arxiv.org/abs/2111.09613
Improving Transferability of Representations via Augmentation-Aware Self-Supervision. (1%)
Hankook Lee; Kibok Lee; Kimin Lee; Honglak Lee; Jinwoo Shin
  Recent unsupervised representation learning methods have shown to be
effective in a range of vision tasks by learning representations invariant to
data augmentations such as random cropping and color jittering. However, such
invariance could be harmful to downstream tasks if they rely on the
characteristics of the data augmentations, e.g., location- or color-sensitive.
This is not an issue just for unsupervised learning; we found that this occurs
even in supervised learning because it also learns to predict the same label
for all augmented samples of an instance. To avoid such failures and obtain
more generalizable representations, we suggest to optimize an auxiliary
self-supervised loss, coined AugSelf, that learns the difference of
augmentation parameters (e.g., cropping positions, color adjustment
intensities) between two randomly augmented samples. Our intuition is that
AugSelf encourages to preserve augmentation-aware information in learned
representations, which could be beneficial for their transferability.
Furthermore, AugSelf can easily be incorporated into recent state-of-the-art
representation learning methods with a negligible additional training cost.
Extensive experiments demonstrate that our simple idea consistently improves
the transferability of representations learned by supervised and unsupervised
methods in various transfer learning scenarios. The code is available at
https://github.com/hankook/AugSelf.


http://arxiv.org/abs/2111.08954
TraSw: Tracklet-Switch Adversarial Attacks against Multi-Object Tracking. (99%)
Delv Lin; Qi Chen; Chengyu Zhou; Kun He
  Benefiting from the development of Deep Neural Networks, Multi-Object
Tracking (MOT) has achieved aggressive progress. Currently, the real-time
Joint-Detection-Tracking (JDT) based MOT trackers gain increasing attention and
derive many excellent models. However, the robustness of JDT trackers is rarely
studied, and it is challenging to attack the MOT system since its mature
association algorithms are designed to be robust against errors during
tracking. In this work, we analyze the weakness of JDT trackers and propose a
novel adversarial attack method, called Tracklet-Switch (TraSw), against the
complete tracking pipeline of MOT. Specifically, a push-pull loss and a center
leaping optimization are designed to generate adversarial examples for both
re-ID feature and object detection. TraSw can fool the tracker to fail to track
the targets in the subsequent frames by attacking very few frames. We evaluate
our method on the advanced deep trackers (i.e., FairMOT, JDE, ByteTrack) using
the MOT-Challenge datasets (i.e., 2DMOT15, MOT17, and MOT20). Experiments show
that TraSw can achieve a high success rate of over 95% by attacking only five
frames on average for the single-target attack and a reasonably high success
rate of over 80% for the multiple-target attack. The code is available at
https://github.com/DerryHub/FairMOT-attack .


http://arxiv.org/abs/2111.08973
Generating Unrestricted 3D Adversarial Point Clouds. (99%)
Xuelong Dai; Yanjie Li; Hua Dai; Bin Xiao
  Utilizing 3D point cloud data has become an urgent need for the deployment of
artificial intelligence in many areas like facial recognition and self-driving.
However, deep learning for 3D point clouds is still vulnerable to adversarial
attacks, e.g., iterative attacks, point transformation attacks, and generative
attacks. These attacks need to restrict perturbations of adversarial examples
within a strict bound, leading to the unrealistic adversarial 3D point clouds.
In this paper, we propose an Adversarial Graph-Convolutional Generative
Adversarial Network (AdvGCGAN) to generate visually realistic adversarial 3D
point clouds from scratch. Specifically, we use a graph convolutional generator
and a discriminator with an auxiliary classifier to generate realistic point
clouds, which learn the latent distribution from the real 3D data. The
unrestricted adversarial attack loss is incorporated in the special adversarial
training of GAN, which enables the generator to generate the adversarial
examples to spoof the target network. Compared with the existing state-of-art
attack methods, the experiment results demonstrate the effectiveness of our
unrestricted adversarial attack methods with a higher attack success rate and
visual quality. Additionally, the proposed AdvGCGAN can achieve better
performance against defense models and better transferability than existing
attack methods with strong camouflage.


http://arxiv.org/abs/2111.09277
SmoothMix: Training Confidence-calibrated Smoothed Classifiers for Certified Robustness. (93%)
Jongheon Jeong; Sejun Park; Minkyu Kim; Heung-Chang Lee; Doguk Kim; Jinwoo Shin
  Randomized smoothing is currently a state-of-the-art method to construct a
certifiably robust classifier from neural networks against $\ell_2$-adversarial
perturbations. Under the paradigm, the robustness of a classifier is aligned
with the prediction confidence, i.e., the higher confidence from a smoothed
classifier implies the better robustness. This motivates us to rethink the
fundamental trade-off between accuracy and robustness in terms of calibrating
confidences of a smoothed classifier. In this paper, we propose a simple
training scheme, coined SmoothMix, to control the robustness of smoothed
classifiers via self-mixup: it trains on convex combinations of samples along
the direction of adversarial perturbation for each input. The proposed
procedure effectively identifies over-confident, near off-class samples as a
cause of limited robustness in case of smoothed classifiers, and offers an
intuitive way to adaptively set a new decision boundary between these samples
for better robustness. Our experimental results demonstrate that the proposed
method can significantly improve the certified $\ell_2$-robustness of smoothed
classifiers compared to existing state-of-the-art robust training methods.


http://arxiv.org/abs/2111.09488
Attacking Deep Learning AI Hardware with Universal Adversarial Perturbation. (92%)
Mehdi Sadi; B. M. S. Bahar Talukder; Kaniz Mishty; Md Tauhidur Rahman
  Universal Adversarial Perturbations are image-agnostic and model-independent
noise that when added with any image can mislead the trained Deep Convolutional
Neural Networks into the wrong prediction. Since these Universal Adversarial
Perturbations can seriously jeopardize the security and integrity of practical
Deep Learning applications, existing techniques use additional neural networks
to detect the existence of these noises at the input image source. In this
paper, we demonstrate an attack strategy that when activated by rogue means
(e.g., malware, trojan) can bypass these existing countermeasures by augmenting
the adversarial noise at the AI hardware accelerator stage. We demonstrate the
accelerator-level universal adversarial noise attack on several deep Learning
models using co-simulation of the software kernel of Conv2D function and the
Verilog RTL model of the hardware under the FuseSoC environment.


http://arxiv.org/abs/2111.09076
Do Not Trust Prediction Scores for Membership Inference Attacks. (33%)
Dominik Hintersdorf; Lukas Struppek; Kristian Kersting
  Membership inference attacks (MIAs) aim to determine whether a specific
sample was used to train a predictive model. Knowing this may indeed lead to a
privacy breach. Arguably, most MIAs, however, make use of the model's
prediction scores - the probability of each output given some input - following
the intuition that the trained model tends to behave differently on its
training data. We argue that this is a fallacy for many modern deep network
architectures, e.g., ReLU type neural networks produce almost always high
prediction scores far away from the training data. Consequently, MIAs will
miserably fail since this behavior leads to high false-positive rates not only
on known domains but also on out-of-distribution data and implicitly acts as a
defense against MIAs. Specifically, using generative adversarial networks, we
are able to produce a potentially infinite number of samples falsely classified
as part of the training data. In other words, the threat of MIAs is
overestimated and less information is leaked than previously assumed. Moreover,
there is actually a trade-off between the overconfidence of classifiers and
their susceptibility to MIAs: the more classifiers know when they do not know,
making low confidence predictions far away from the training data, the more
they reveal the training data.


http://arxiv.org/abs/2111.08591
Robustness of Bayesian Neural Networks to White-Box Adversarial Attacks. (99%)
Adaku Uchendu; Daniel Campoy; Christopher Menart; Alexandra Hildenbrandt
  Bayesian Neural Networks (BNNs), unlike Traditional Neural Networks (TNNs)
are robust and adept at handling adversarial attacks by incorporating
randomness. This randomness improves the estimation of uncertainty, a feature
lacking in TNNs. Thus, we investigate the robustness of BNNs to white-box
attacks using multiple Bayesian neural architectures. Furthermore, we create
our BNN model, called BNN-DenseNet, by fusing Bayesian inference (i.e.,
variational Bayes) to the DenseNet architecture, and BDAV, by combining this
intervention with adversarial training. Experiments are conducted on the
CIFAR-10 and FGVC-Aircraft datasets. We attack our models with strong white-box
attacks ($l_\infty$-FGSM, $l_\infty$-PGD, $l_2$-PGD, EOT $l_\infty$-FGSM, and
EOT $l_\infty$-PGD). In all experiments, at least one BNN outperforms
traditional neural networks during adversarial attack scenarios. An
adversarially-trained BNN outperforms its non-Bayesian, adversarially-trained
counterpart in most experiments, and often by significant margins. Lastly, we
investigate network calibration and find that BNNs do not make overconfident
predictions, providing evidence that BNNs are also better at measuring
uncertainty.


http://arxiv.org/abs/2111.08529
Improving the robustness and accuracy of biomedical language models through adversarial training. (99%)
Milad Moradi; Matthias Samwald
  Deep transformer neural network models have improved the predictive accuracy
of intelligent text processing systems in the biomedical domain. They have
obtained state-of-the-art performance scores on a wide variety of biomedical
and clinical Natural Language Processing (NLP) benchmarks. However, the
robustness and reliability of these models has been less explored so far.
Neural NLP models can be easily fooled by adversarial samples, i.e. minor
changes to input that preserve the meaning and understandability of the text
but force the NLP system to make erroneous decisions. This raises serious
concerns about the security and trust-worthiness of biomedical NLP systems,
especially when they are intended to be deployed in real-world use cases. We
investigated the robustness of several transformer neural language models, i.e.
BioBERT, SciBERT, BioMed-RoBERTa, and Bio-ClinicalBERT, on a wide range of
biomedical and clinical text processing tasks. We implemented various
adversarial attack methods to test the NLP systems in different attack
scenarios. Experimental results showed that the biomedical NLP models are
sensitive to adversarial samples; their performance dropped in average by 21
and 18.9 absolute percent on character-level and word-level adversarial noise,
respectively. Conducting extensive adversarial training experiments, we
fine-tuned the NLP models on a mixture of clean samples and adversarial inputs.
Results showed that adversarial training is an effective defense mechanism
against adversarial noise; the models robustness improved in average by 11.3
absolute percent. In addition, the models performance on clean data increased
in average by 2.4 absolute present, demonstrating that adversarial training can
boost generalization abilities of biomedical NLP systems.


http://arxiv.org/abs/2111.08785
Detecting AutoAttack Perturbations in the Frequency Domain. (99%)
Peter Lorenz; Paula Harder; Dominik Strassel; Margret Keuper; Janis Keuper
  Recently, adversarial attacks on image classification networks by the
AutoAttack (Croce and Hein, 2020b) framework have drawn a lot of attention.
While AutoAttack has shown a very high attack success rate, most defense
approaches are focusing on network hardening and robustness enhancements, like
adversarial training. This way, the currently best-reported method can
withstand about 66% of adversarial examples on CIFAR10. In this paper, we
investigate the spatial and frequency domain properties of AutoAttack and
propose an alternative defense. Instead of hardening a network, we detect
adversarial attacks during inference, rejecting manipulated inputs. Based on a
rather simple and fast analysis in the frequency domain, we introduce two
different detection algorithms. First, a black box detector that only operates
on the input images and achieves a detection accuracy of 100% on the AutoAttack
CIFAR10 benchmark and 99.3% on ImageNet, for epsilon = 8/255 in both cases.
Second, a whitebox detector using an analysis of CNN feature maps, leading to a
detection rate of also 100% and 98.7% on the same benchmarks.


http://arxiv.org/abs/2111.08864
Adversarial Tradeoffs in Linear Inverse Problems and Robust StateEstimation. (92%)
Bruce D. Lee; Thomas T. C. K. Zhang; Hamed Hassani; Nikolai Matni
  Adversarially robust training has been shown to reduce the susceptibility of
learned models to targeted input data perturbations. However, it has also been
observed that such adversarially robust models suffer a degradation in accuracy
when applied to unperturbed data sets, leading to a robustness-accuracy
tradeoff. In this paper, we provide sharp and interpretable characterizations
of such robustness-accuracy tradeoffs for linear inverse problems. In
particular, we provide an algorithm to find the optimal adversarial
perturbation given data, and develop tight upper and lower bounds on the
adversarial loss in terms of the standard (non-adversarial) loss and the
spectral properties of the resulting estimator. Further, motivated by the use
of adversarial training in reinforcement learning, we define and analyze the
\emph{adversarially robust Kalman Filtering problem.} We apply a refined
version of our general theory to this problem, and provide the first
characterization of robustness-accuracy tradeoffs in a setting where the data
is generated by a dynamical system. In doing so, we show a natural connection
between a filter's robustness to adversarial perturbation and underlying
control theoretic properties of the system being observed, namely the spectral
properties of its observability gramian.


http://arxiv.org/abs/2111.08485
Consistent Semantic Attacks on Optical Flow. (81%)
Tom Koren; Lior Talker; Michael Dinerstein; Roy J Jevnisek
  We present a novel approach for semantically targeted adversarial attacks on
Optical Flow. In such attacks the goal is to corrupt the flow predictions of a
specific object category or instance. Usually, an attacker seeks to hide the
adversarial perturbations in the input. However, a quick scan of the output
reveals the attack. In contrast, our method helps to hide the attackers intent
in the output as well. We achieve this thanks to a regularization term that
encourages off-target consistency. We perform extensive tests on leading
optical flow models to demonstrate the benefits of our approach in both
white-box and black-box settings. Also, we demonstrate the effectiveness of our
attack on subsequent tasks that depend on the optical flow.


http://arxiv.org/abs/2111.08429
An Overview of Backdoor Attacks Against Deep Neural Networks and Possible Defences. (54%)
Wei Guo; Benedetta Tondi; Mauro Barni
  Together with impressive advances touching every aspect of our society, AI
technology based on Deep Neural Networks (DNN) is bringing increasing security
concerns. While attacks operating at test time have monopolised the initial
attention of researchers, backdoor attacks, exploiting the possibility of
corrupting DNN models by interfering with the training process, represents a
further serious threat undermining the dependability of AI techniques. In a
backdoor attack, the attacker corrupts the training data so to induce an
erroneous behaviour at test time. Test time errors, however, are activated only
in the presence of a triggering event corresponding to a properly crafted input
sample. In this way, the corrupted network continues to work as expected for
regular inputs, and the malicious behaviour occurs only when the attacker
decides to activate the backdoor hidden within the network. In the last few
years, backdoor attacks have been the subject of an intense research activity
focusing on both the development of new classes of attacks, and the proposal of
possible countermeasures. The goal of this overview paper is to review the
works published until now, classifying the different types of attacks and
defences proposed so far. The classification guiding the analysis is based on
the amount of control that the attacker has on the training process, and the
capability of the defender to verify the integrity of the data used for
training, and to monitor the operations of the DNN at training and test time.
As such, the proposed analysis is particularly suited to highlight the
strengths and weaknesses of both attacks and defences with reference to the
application scenarios they are operating in.


http://arxiv.org/abs/2111.08251
Enabling equivariance for arbitrary Lie groups. (1%)
Lachlan Ewen MacDonald; Sameera Ramasinghe; Simon Lucey
  Although provably robust to translational perturbations, convolutional neural
networks (CNNs) are known to suffer from extreme performance degradation when
presented at test time with more general geometric transformations of inputs.
Recently, this limitation has motivated a shift in focus from CNNs to Capsule
Networks (CapsNets). However, CapsNets suffer from admitting relatively few
theoretical guarantees of invariance. We introduce a rigourous mathematical
framework to permit invariance to any Lie group of warps, exclusively using
convolutions (over Lie groups), without the need for capsules. Previous work on
group convolutions has been hampered by strong assumptions about the group,
which precludes the application of such techniques to common warps in computer
vision such as affine and homographic. Our framework enables the implementation
of group convolutions over any finite-dimensional Lie group. We empirically
validate our approach on the benchmark affine-invariant classification task,
where we achieve 30% improvement in accuracy against conventional CNNs while
outperforming most CapsNets. As further illustration of the generality of our
framework, we train a homography-convolutional model which achieves superior
robustness on a homography-perturbed dataset, where CapsNet results degrade.


http://arxiv.org/abs/2111.08223
A Survey on Adversarial Attacks for Malware Analysis. (98%)
Kshitiz Aryal; Maanak Gupta; Mahmoud Abdelsalam
  Machine learning has witnessed tremendous growth in its adoption and
advancement in the last decade. The evolution of machine learning from
traditional algorithms to modern deep learning architectures has shaped the way
today's technology functions. Its unprecedented ability to discover
knowledge/patterns from unstructured data and automate the decision-making
process led to its application in wide domains. High flying machine learning
arena has been recently pegged back by the introduction of adversarial attacks.
Adversaries are able to modify data, maximizing the classification error of the
models. The discovery of blind spots in machine learning models has been
exploited by adversarial attackers by generating subtle intentional
perturbations in test samples. Increasing dependency on data has paved the
blueprint for ever-high incentives to camouflage machine learning models. To
cope with probable catastrophic consequences in the future, continuous research
is required to find vulnerabilities in form of adversarial and design remedies
in systems. This survey aims at providing the encyclopedic introduction to
adversarial attacks that are carried out against malware detection systems. The
paper will introduce various machine learning techniques used to generate
adversarial and explain the structure of target files. The survey will also
model the threat posed by the adversary and followed by brief descriptions of
widely accepted adversarial algorithms. Work will provide a taxonomy of
adversarial evasion attacks on the basis of attack domain and adversarial
generation techniques. Adversarial evasion attacks carried out against malware
detectors will be discussed briefly under each taxonomical headings and
compared with concomitant researches. Analyzing the current research challenges
in an adversarial generation, the survey will conclude by pinpointing the open
future research directions.


http://arxiv.org/abs/2111.07970
Triggerless Backdoor Attack for NLP Tasks with Clean Labels. (68%)
Leilei Gan; Jiwei Li; Tianwei Zhang; Xiaoya Li; Yuxian Meng; Fei Wu; Shangwei Guo; Chun Fan
  Backdoor attacks pose a new threat to NLP models. A standard strategy to
construct poisoned data in backdoor attacks is to insert triggers (e.g., rare
words) into selected sentences and alter the original label to a target label.
This strategy comes with a severe flaw of being easily detected from both the
trigger and the label perspectives: the trigger injected, which is usually a
rare word, leads to an abnormal natural language expression, and thus can be
easily detected by a defense model; the changed target label leads the example
to be mistakenly labeled and thus can be easily detected by manual inspections.
To deal with this issue, in this paper, we propose a new strategy to perform
textual backdoor attacks which do not require an external trigger, and the
poisoned samples are correctly labeled. The core idea of the proposed strategy
is to construct clean-labeled examples, whose labels are correct but can lead
to test label changes when fused with the training set. To generate poisoned
clean-labeled examples, we propose a sentence generation model based on the
genetic algorithm to cater to the non-differentiable characteristic of text
data. Extensive experiments demonstrate that the proposed attacking strategy is
not only effective, but more importantly, hard to defend due to its triggerless
and clean-labeled nature. Our work marks the first step towards developing
triggerless attacking strategies in NLP.


http://arxiv.org/abs/2111.07608
Property Inference Attacks Against GANs. (67%)
Junhao Zhou; Yufei Chen; Chao Shen; Yang Zhang
  While machine learning (ML) has made tremendous progress during the past
decade, recent research has shown that ML models are vulnerable to various
security and privacy attacks. So far, most of the attacks in this field focus
on discriminative models, represented by classifiers. Meanwhile, little
attention has been paid to the security and privacy risks of generative models,
such as generative adversarial networks (GANs). In this paper, we propose the
first set of training dataset property inference attacks against GANs.
Concretely, the adversary aims to infer the macro-level training dataset
property, i.e., the proportion of samples used to train a target GAN with
respect to a certain attribute. A successful property inference attack can
allow the adversary to gain extra knowledge of the target GAN's training
dataset, thereby directly violating the intellectual property of the target
model owner. Also, it can be used as a fairness auditor to check whether the
target GAN is trained with a biased dataset. Besides, property inference can
serve as a building block for other advanced attacks, such as membership
inference. We propose a general attack pipeline that can be tailored to two
attack scenarios, including the full black-box setting and partial black-box
setting. For the latter, we introduce a novel optimization framework to
increase the attack efficacy. Extensive experiments over four representative
GAN models on five property inference tasks show that our attacks achieve
strong performance. In addition, we show that our attacks can be used to
enhance the performance of membership inference against GANs.


http://arxiv.org/abs/2111.08211
FedCG: Leverage Conditional GAN for Protecting Privacy and Maintaining Competitive Performance in Federated Learning. (1%)
Yuezhou Wu; Yan Kang; Jiahuan Luo; Yuanqin He; Qiang Yang
  Federated learning (FL) aims to protect data privacy by enabling clients to
build machine learning models collaboratively without sharing their private
data. Recent works demonstrate that information exchanged during FL is subject
to gradient-based privacy attacks, and consequently, a variety of
privacy-preserving methods have been adopted to thwart such attacks. However,
these defensive methods either introduce orders of magnitude more computational
and communication overheads (e.g., with homomorphic encryption) or incur
substantial model performance losses in terms of prediction accuracy (e.g.,
with differential privacy). In this work, we propose $\textsc{FedCG}$, a novel
federated learning method that leverages conditional generative adversarial
networks to achieve high-level privacy protection while still maintaining
competitive model performance. $\textsc{FedCG}$ decomposes each client's local
network into a private extractor and a public classifier and keeps the
extractor local to protect privacy. Instead of exposing extractors,
$\textsc{FedCG}$ shares clients' generators with the server for aggregating
clients' shared knowledge, aiming to enhance the performance of each client's
local networks. Extensive experiments demonstrate that $\textsc{FedCG}$ can
achieve competitive model performance compared with FL baselines, and privacy
analysis shows that $\textsc{FedCG}$ has a high-level privacy-preserving
capability. Code is available at https://github.com/yankang18/FedCG


http://arxiv.org/abs/2111.07424
Generating Band-Limited Adversarial Surfaces Using Neural Networks. (99%)
Roee Ben-Shlomo; Yevgeniy Men; Ido Imanuel
  Generating adversarial examples is the art of creating a noise that is added
to an input signal of a classifying neural network, and thus changing the
network's classification, while keeping the noise as tenuous as possible. While
the subject is well-researched in the 2D regime, it is lagging behind in the 3D
regime, i.e. attacking a classifying network that works on 3D point-clouds or
meshes and, for example, classifies the pose of people's 3D scans. As of now,
the vast majority of papers that describe adversarial attacks in this regime
work by methods of optimization. In this technical report we suggest a neural
network that generates the attacks. This network utilizes PointNet's
architecture with some alterations. While the previous articles on which we
based our work on have to optimize each shape separately, i.e. tailor an attack
from scratch for each individual input without any learning, we attempt to
create a unified model that can deduce the needed adversarial example with a
single forward run.


http://arxiv.org/abs/2111.07492
Finding Optimal Tangent Points for Reducing Distortions of Hard-label Attacks. (76%)
Chen Ma; Xiangyu Guo; Li Chen; Jun-Hai Yong; Yisen Wang
  One major problem in black-box adversarial attacks is the high query
complexity in the hard-label attack setting, where only the top-1 predicted
label is available. In this paper, we propose a novel geometric-based approach
called Tangent Attack (TA), which identifies an optimal tangent point of a
virtual hemisphere located on the decision boundary to reduce the distortion of
the attack. Assuming the decision boundary is locally flat, we theoretically
prove that the minimum $\ell_2$ distortion can be obtained by reaching the
decision boundary along the tangent line passing through such tangent point in
each iteration. To improve the robustness of our method, we further propose a
generalized method which replaces the hemisphere with a semi-ellipsoid to adapt
to curved decision boundaries. Our approach is free of hyperparameters and
pre-training. Extensive experiments conducted on the ImageNet and CIFAR-10
datasets demonstrate that our approach can consume only a small number of
queries to achieve the low-magnitude distortion. The implementation source code
is released online at https://github.com/machanic/TangentAttack.


http://arxiv.org/abs/2111.07454
Towards Interpretability of Speech Pause in Dementia Detection using Adversarial Learning. (75%)
Youxiang Zhu; Bang Tran; Xiaohui Liang; John A. Batsis; Robert M. Roth
  Speech pause is an effective biomarker in dementia detection. Recent deep
learning models have exploited speech pauses to achieve highly accurate
dementia detection, but have not exploited the interpretability of speech
pauses, i.e., what and how positions and lengths of speech pauses affect the
result of dementia detection. In this paper, we will study the positions and
lengths of dementia-sensitive pauses using adversarial learning approaches.
Specifically, we first utilize an adversarial attack approach by adding the
perturbation to the speech pauses of the testing samples, aiming to reduce the
confidence levels of the detection model. Then, we apply an adversarial
training approach to evaluate the impact of the perturbation in training
samples on the detection model. We examine the interpretability from the
perspectives of model accuracy, pause context, and pause length. We found that
some pauses are more sensitive to dementia than other pauses from the model's
perspective, e.g., speech pauses near to the verb "is". Increasing lengths of
sensitive pauses or adding sensitive pauses leads the model inference to
Alzheimer's Disease, while decreasing the lengths of sensitive pauses or
deleting sensitive pauses leads to non-AD.


http://arxiv.org/abs/2111.07439
Improving Compound Activity Classification via Deep Transfer and Representation Learning. (1%)
Vishal Dey; Raghu Machiraju; Xia Ning
  Recent advances in molecular machine learning, especially deep neural
networks such as Graph Neural Networks (GNNs) for predicting structure activity
relationships (SAR) have shown tremendous potential in computer-aided drug
discovery. However, the applicability of such deep neural networks are limited
by the requirement of large amounts of training data. In order to cope with
limited training data for a target task, transfer learning for SAR modeling has
been recently adopted to leverage information from data of related tasks. In
this work, in contrast to the popular parameter-based transfer learning such as
pretraining, we develop novel deep transfer learning methods TAc and TAc-fc to
leverage source domain data and transfer useful information to the target
domain. TAc learns to generate effective molecular features that can generalize
well from one domain to another, and increase the classification performance in
the target domain. Additionally, TAc-fc extends TAc by incorporating novel
components to selectively learn feature-wise and compound-wise transferability.
We used the bioassay screening data from PubChem, and identified 120 pairs of
bioassays such that the active compounds in each pair are more similar to each
other compared to its inactive compounds. Overall, TAc achieves the best
performance with average ROC-AUC of 0.801; it significantly improves ROC-AUC of
83% target tasks with average task-wise performance improvement of 7.102%,
compared to the best baseline FCN-dmpna (DT). Our experiments clearly
demonstrate that TAc achieves significant improvement over all baselines across
a large number of target tasks. Furthermore, although TAc-fc achieves slightly
worse ROC-AUC on average compared to TAc (0.798 vs 0.801), TAc-fc still
achieves the best performance on more tasks in terms of PR-AUC and F1 compared
to other methods.


http://arxiv.org/abs/2111.07239
Robust and Accurate Object Detection via Self-Knowledge Distillation. (62%)
Weipeng Xu; Pengzhi Chu; Renhao Xie; Xiongziyan Xiao; Hongcheng Huang
  Object detection has achieved promising performance on clean datasets, but
how to achieve better tradeoff between the adversarial robustness and clean
precision is still under-explored. Adversarial training is the mainstream
method to improve robustness, but most of the works will sacrifice clean
precision to gain robustness than standard training. In this paper, we propose
Unified Decoupled Feature Alignment (UDFA), a novel fine-tuning paradigm which
achieves better performance than existing methods, by fully exploring the
combination between self-knowledge distillation and adversarial training for
object detection. We first use decoupled fore/back-ground features to construct
self-knowledge distillation branch between clean feature representation from
pretrained detector (served as teacher) and adversarial feature representation
from student detector. Then we explore the self-knowledge distillation from a
new angle by decoupling original branch into a self-supervised learning branch
and a new self-knowledge distillation branch. With extensive experiments on the
PASCAL-VOC and MS-COCO benchmarks, the evaluation results show that UDFA can
surpass the standard training and state-of-the-art adversarial training methods
for object detection. For example, compared with teacher detector, our approach
on GFLV2 with ResNet-50 improves clean precision by 2.2 AP on PASCAL-VOC;
compared with SOTA adversarial training methods, our approach improves clean
precision by 1.6 AP, while improving adversarial robustness by 0.5 AP. Our code
will be available at https://github.com/grispeut/udfa.


http://arxiv.org/abs/2111.07062
UNTANGLE: Unlocking Routing and Logic Obfuscation Using Graph Neural Networks-based Link Prediction. (2%)
Lilas Alrahis; Satwik Patnaik; Muhammad Abdullah Hanif; Muhammad Shafique; Ozgur Sinanoglu
  Logic locking aims to prevent intellectual property (IP) piracy and
unauthorized overproduction of integrated circuits (ICs). However, initial
logic locking techniques were vulnerable to the Boolean satisfiability
(SAT)-based attacks. In response, researchers proposed various SAT-resistant
locking techniques such as point function-based locking and symmetric
interconnection (SAT-hard) obfuscation. We focus on the latter since point
function-based locking suffers from various structural vulnerabilities. The
SAT-hard logic locking technique, InterLock [1], achieves a unified logic and
routing obfuscation that thwarts state-of-the-art attacks on logic locking. In
this work, we propose a novel link prediction-based attack, UNTANGLE, that
successfully breaks InterLock in an oracle-less setting without having access
to an activated IC (oracle). Since InterLock hides selected timing paths in
key-controlled routing blocks, UNTANGLE reveals the gates and interconnections
hidden in the routing blocks upon formulating this task as a link prediction
problem. The intuition behind our approach is that ICs contain a large amount
of repetition and reuse cores. Hence, UNTANGLE can infer the hidden timing
paths by learning the composition of gates in the observed locked netlist or a
circuit library leveraging graph neural networks. We show that circuits
withstanding SAT-based and other attacks can be unlocked in seconds with 100%
precision using UNTANGLE in an oracle-less setting. UNTANGLE is a generic
attack platform (which we also open source [2]) that applies to multiplexer
(MUX)-based obfuscation, as demonstrated through our experiments on ISCAS-85
and ITC-99 benchmarks locked using InterLock and random MUX-based locking.


http://arxiv.org/abs/2111.06979
Neural Population Geometry Reveals the Role of Stochasticity in Robust Perception. (99%)
Joel Dapello; Jenelle Feather; Hang Le; Tiago Marques; David D. Cox; Josh H. McDermott; James J. DiCarlo; SueYeon Chung
  Adversarial examples are often cited by neuroscientists and machine learning
researchers as an example of how computational models diverge from biological
sensory systems. Recent work has proposed adding biologically-inspired
components to visual neural networks as a way to improve their adversarial
robustness. One surprisingly effective component for reducing adversarial
vulnerability is response stochasticity, like that exhibited by biological
neurons. Here, using recently developed geometrical techniques from
computational neuroscience, we investigate how adversarial perturbations
influence the internal representations of standard, adversarially trained, and
biologically-inspired stochastic networks. We find distinct geometric
signatures for each type of network, revealing different mechanisms for
achieving robust representations. Next, we generalize these results to the
auditory domain, showing that neural stochasticity also makes auditory models
more robust to adversarial perturbations. Geometric analysis of the stochastic
networks reveals overlap between representations of clean and adversarially
perturbed stimuli, and quantitatively demonstrates that competing geometric
effects of stochasticity mediate a tradeoff between adversarial and clean
performance. Our results shed light on the strategies of robust perception
utilized by adversarially trained and stochastic networks, and help explain how
stochasticity may be beneficial to machine and biological computation.


http://arxiv.org/abs/2111.07035
Measuring the Contribution of Multiple Model Representations in Detecting Adversarial Instances. (98%)
Daniel Steinberg; Paul Munro
  Deep learning models have been used for a wide variety of tasks. They are
prevalent in computer vision, natural language processing, speech recognition,
and other areas. While these models have worked well under many scenarios, it
has been shown that they are vulnerable to adversarial attacks. This has led to
a proliferation of research into ways that such attacks could be identified
and/or defended against. Our goal is to explore the contribution that can be
attributed to using multiple underlying models for the purpose of adversarial
instance detection. Our paper describes two approaches that incorporate
representations from multiple models for detecting adversarial examples. We
devise controlled experiments for measuring the detection impact of
incrementally utilizing additional models. For many of the scenarios we
consider, the results show that performance increases with the number of
underlying models used for extracting representations.


http://arxiv.org/abs/2111.06961
Adversarially Robust Learning for Security-Constrained Optimal Power Flow. (10%)
Priya L. Donti; Aayushya Agarwal; Neeraj Vijay Bedmutha; Larry Pileggi; J. Zico Kolter
  In recent years, the ML community has seen surges of interest in both
adversarially robust learning and implicit layers, but connections between
these two areas have seldom been explored. In this work, we combine innovations
from these areas to tackle the problem of N-k security-constrained optimal
power flow (SCOPF). N-k SCOPF is a core problem for the operation of electrical
grids, and aims to schedule power generation in a manner that is robust to
potentially k simultaneous equipment outages. Inspired by methods in
adversarially robust training, we frame N-k SCOPF as a minimax optimization
problem - viewing power generation settings as adjustable parameters and
equipment outages as (adversarial) attacks - and solve this problem via
gradient-based techniques. The loss function of this minimax problem involves
resolving implicit equations representing grid physics and operational
decisions, which we differentiate through via the implicit function theorem. We
demonstrate the efficacy of our framework in solving N-3 SCOPF, which has
traditionally been considered as prohibitively expensive to solve given that
the problem size depends combinatorially on the number of potential outages.


http://arxiv.org/abs/2111.06719
On Transferability of Prompt Tuning for Natural Language Processing. (8%)
Yusheng Su; Xiaozhi Wang; Yujia Qin; Chi-Min Chan; Yankai Lin; Huadong Wang; Kaiyue Wen; Zhiyuan Liu; Peng Li; Juanzi Li; Lei Hou; Maosong Sun; Jie Zhou
  Prompt tuning (PT) is a promising parameter-efficient method to utilize
extremely large pre-trained language models (PLMs), which can achieve
comparable performance to full-parameter fine-tuning by only tuning a few soft
prompts. However, PT requires much more training time than fine-tuning.
Intuitively, knowledge transfer can help to improve the efficiency. To explore
whether we can improve PT via prompt transfer, we empirically investigate the
transferability of soft prompts across different downstream tasks and PLMs in
this work. We find that (1) in zero-shot setting, trained soft prompts can
effectively transfer to similar tasks on the same PLM and also to other PLMs
with a cross-model projector trained on similar tasks; (2) when used as
initialization, trained soft prompts of similar tasks and projected prompts of
other PLMs can significantly accelerate training and also improve the
performance of PT. Moreover, to explore what decides prompt transferability, we
investigate various transferability indicators and find that the overlapping
rate of activated neurons strongly reflects the transferability, which suggests
how the prompts stimulate PLMs is essential. Our findings show that prompt
transfer is promising for improving PT, and further research shall focus more
on prompts' stimulation to PLMs. The source code can be obtained from
https://github.com/thunlp/Prompt-Transferability.


http://arxiv.org/abs/2111.06682
A Bayesian Nash equilibrium-based moving target defense against stealthy sensor attacks. (1%)
David Umsonst; Serkan Sarıtaş; György Dán; Henrik Sandberg
  We present a moving target defense strategy to reduce the impact of stealthy
sensor attacks on feedback systems. The defender periodically and randomly
switches between thresholds from a discrete set to increase the uncertainty for
the attacker and make stealthy attacks detectable. However, the defender does
not know the exact goal of the attacker but only the prior of the possible
attacker goals. Here, we model one period with a constant threshold as a
Bayesian game and use the Bayesian Nash equilibrium to find the distribution
for the choice of the threshold in that period, which takes the defender's
uncertainty about the attacker into account. To obtain the equilibrium
distribution, the defender minimizes its cost consisting of the cost for false
alarms and the cost induced by the attack. We present a necessary and
sufficient condition for the existence of a moving target defense and formulate
a linear program to determine the moving target defense. Furthermore, we
present a closed-form solution for the special case when the defender knows the
attacker's goals. The results are numerically evaluated on a four-tank process.


http://arxiv.org/abs/2111.06776
Resilient Consensus-based Multi-agent Reinforcement Learning. (1%)
Martin Figura; Yixuan Lin; Ji Liu; Vijay Gupta
  Adversarial attacks during training can strongly influence the performance of
multi-agent reinforcement learning algorithms. It is, thus, highly desirable to
augment existing algorithms such that the impact of adversarial attacks on
cooperative networks is eliminated, or at least bounded. In this work, we
consider a fully decentralized network, where each agent receives a local
reward and observes the global state and action. We propose a resilient
consensus-based actor-critic algorithm, whereby each agent estimates the
team-average reward and value function, and communicates the associated
parameter vectors to its immediate neighbors. We show that in the presence of
Byzantine agents, whose estimation and communication strategies are completely
arbitrary, the estimates of the cooperative agents converge to a bounded
consensus value with probability one, provided that there are at most $H$
Byzantine agents in the neighborhood of each cooperative agent and the network
is $(2H+1)$-robust. Furthermore, we prove that the policy of the cooperative
agents converges with probability one to a bounded neighborhood around a local
maximizer of their team-average objective function under the assumption that
the policies of the adversarial agents asymptotically become stationary.


http://arxiv.org/abs/2111.06063
On the Equivalence between Neural Network and Support Vector Machine. (1%)
Yilan Chen; Wei Huang; Lam M. Nguyen; Tsui-Wei Weng
  Recent research shows that the dynamics of an infinitely wide neural network
(NN) trained by gradient descent can be characterized by Neural Tangent Kernel
(NTK) \citep{jacot2018neural}. Under the squared loss, the infinite-width NN
trained by gradient descent with an infinitely small learning rate is
equivalent to kernel regression with NTK \citep{arora2019exact}. However, the
equivalence is only known for ridge regression currently
\citep{arora2019harnessing}, while the equivalence between NN and other kernel
machines (KMs), e.g. support vector machine (SVM), remains unknown. Therefore,
in this work, we propose to establish the equivalence between NN and SVM, and
specifically, the infinitely wide NN trained by soft margin loss and the
standard soft margin SVM with NTK trained by subgradient descent. Our main
theoretical results include establishing the equivalence between NN and a broad
family of $\ell_2$ regularized KMs with finite-width bounds, which cannot be
handled by prior work, and showing that every finite-width NN trained by such
regularized loss functions is approximately a KM. Furthermore, we demonstrate
our theory can enable three practical applications, including (i)
\textit{non-vacuous} generalization bound of NN via the corresponding KM; (ii)
\textit{non-trivial} robustness certificate for the infinite-width NN (while
existing robustness verification methods would provide vacuous bounds); (iii)
intrinsically more robust infinite-width NNs than those from previous kernel
regression. Our code for the experiments are available at
\url{https://github.com/leslie-CH/equiv-nn-svm}.


http://arxiv.org/abs/2111.05978
Trustworthy Medical Segmentation with Uncertainty Estimation. (93%)
Giuseppina Carannante; Dimah Dera; Nidhal C. Bouaynaya; Ghulam Rasool; Hassan M. Fathallah-Shaykh
  Deep Learning (DL) holds great promise in reshaping the healthcare systems
given its precision, efficiency, and objectivity. However, the brittleness of
DL models to noisy and out-of-distribution inputs is ailing their deployment in
the clinic. Most systems produce point estimates without further information
about model uncertainty or confidence. This paper introduces a new Bayesian
deep learning framework for uncertainty quantification in segmentation neural
networks, specifically encoder-decoder architectures. The proposed framework
uses the first-order Taylor series approximation to propagate and learn the
first two moments (mean and covariance) of the distribution of the model
parameters given the training data by maximizing the evidence lower bound. The
output consists of two maps: the segmented image and the uncertainty map of the
segmentation. The uncertainty in the segmentation decisions is captured by the
covariance matrix of the predictive distribution. We evaluate the proposed
framework on medical image segmentation data from Magnetic Resonances Imaging
and Computed Tomography scans. Our experiments on multiple benchmark datasets
demonstrate that the proposed framework is more robust to noise and adversarial
attacks as compared to state-of-the-art segmentation models. Moreover, the
uncertainty map of the proposed framework associates low confidence (or
equivalently high uncertainty) to patches in the test input images that are
corrupted with noise, artifacts or adversarial attacks. Thus, the model can
self-assess its segmentation decisions when it makes an erroneous prediction or
misses part of the segmentation structures, e.g., tumor, by presenting higher
values in the uncertainty map.


http://arxiv.org/abs/2111.05953
Robust Learning via Ensemble Density Propagation in Deep Neural Networks. (2%)
Giuseppina Carannante; Dimah Dera; Ghulam Rasool; Nidhal C. Bouaynaya; Lyudmila Mihaylova
  Learning in uncertain, noisy, or adversarial environments is a challenging
task for deep neural networks (DNNs). We propose a new theoretically grounded
and efficient approach for robust learning that builds upon Bayesian estimation
and Variational Inference. We formulate the problem of density propagation
through layers of a DNN and solve it using an Ensemble Density Propagation
(EnDP) scheme. The EnDP approach allows us to propagate moments of the
variational probability distribution across the layers of a Bayesian DNN,
enabling the estimation of the mean and covariance of the predictive
distribution at the output of the model. Our experiments using MNIST and
CIFAR-10 datasets show a significant improvement in the robustness of the
trained models to random noise and adversarial attacks.


http://arxiv.org/abs/2111.05063
Tightening the Approximation Error of Adversarial Risk with Auto Loss Function Search. (99%)
Pengfei Xia; Ziqiang Li; Bin Li
  Numerous studies have demonstrated that deep neural networks are easily
misled by adversarial examples. Effectively evaluating the adversarial
robustness of a model is important for its deployment in practical
applications. Currently, a common type of evaluation is to approximate the
adversarial risk of a model as a robustness indicator by constructing malicious
instances and executing attacks. Unfortunately, there is an error (gap) between
the approximate value and the true value. Previous studies manually design
attack methods to achieve a smaller error, which is inefficient and may miss a
better solution. In this paper, we establish the tightening of the
approximation error as an optimization problem and try to solve it with an
algorithm. More specifically, we first analyze that replacing the non-convex
and discontinuous 0-1 loss with a surrogate loss, a necessary compromise in
calculating the approximation, is one of the main reasons for the error. Then
we propose AutoLoss-AR, the first method for searching loss functions for
tightening the approximation error of adversarial risk. Extensive experiments
are conducted in multiple settings. The results demonstrate the effectiveness
of the proposed method: the best-discovered loss functions outperform the
handcrafted baseline by 0.9%-2.9% and 0.7%-2.0% on MNIST and CIFAR-10,
respectively. Besides, we also verify that the searched losses can be
transferred to other settings and explore why they are better than the baseline
by visualizing the local loss landscape.


http://arxiv.org/abs/2111.05073
MixACM: Mixup-Based Robustness Transfer via Distillation of Activated Channel Maps. (99%)
Muhammad Awais; Fengwei Zhou; Chuanlong Xie; Jiawei Li; Sung-Ho Bae; Zhenguo Li
  Deep neural networks are susceptible to adversarially crafted, small and
imperceptible changes in the natural inputs. The most effective defense
mechanism against these examples is adversarial training which constructs
adversarial examples during training by iterative maximization of loss. The
model is then trained to minimize the loss on these constructed examples. This
min-max optimization requires more data, larger capacity models, and additional
computing resources. It also degrades the standard generalization performance
of a model. Can we achieve robustness more efficiently? In this work, we
explore this question from the perspective of knowledge transfer. First, we
theoretically show the transferability of robustness from an adversarially
trained teacher model to a student model with the help of mixup augmentation.
Second, we propose a novel robustness transfer method called Mixup-Based
Activated Channel Maps (MixACM) Transfer. MixACM transfers robustness from a
robust teacher to a student by matching activated channel maps generated
without expensive adversarial perturbations. Finally, extensive experiments on
multiple datasets and different learning scenarios show our method can transfer
robustness while also improving generalization on natural images.


http://arxiv.org/abs/2111.05468
Sparse Adversarial Video Attacks with Spatial Transformations. (98%)
Ronghui Mu; Wenjie Ruan; Leandro Soriano Marcolino; Qiang Ni
  In recent years, a significant amount of research efforts concentrated on
adversarial attacks on images, while adversarial video attacks have seldom been
explored. We propose an adversarial attack strategy on videos, called DeepSAVA.
Our model includes both additive perturbation and spatial transformation by a
unified optimisation framework, where the structural similarity index (SSIM)
measure is adopted to measure the adversarial distance. We design an effective
and novel optimisation scheme which alternatively utilizes Bayesian
optimisation to identify the most influential frame in a video and Stochastic
gradient descent (SGD) based optimisation to produce both additive and
spatial-transformed perturbations. Doing so enables DeepSAVA to perform a very
sparse attack on videos for maintaining human imperceptibility while still
achieving state-of-the-art performance in terms of both attack success rate and
adversarial transferability. Our intensive experiments on various types of deep
neural networks and video datasets confirm the superiority of DeepSAVA.


http://arxiv.org/abs/2111.05077
A Statistical Difference Reduction Method for Escaping Backdoor Detection. (97%)
Pengfei Xia; Hongjing Niu; Ziqiang Li; Bin Li
  Recent studies show that Deep Neural Networks (DNNs) are vulnerable to
backdoor attacks. An infected model behaves normally on benign inputs, whereas
its prediction will be forced to an attack-specific target on adversarial data.
Several detection methods have been developed to distinguish inputs to defend
against such attacks. The common hypothesis that these defenses rely on is that
there are large statistical differences between the latent representations of
clean and adversarial inputs extracted by the infected model. However, although
it is important, comprehensive research on whether the hypothesis must be true
is lacking. In this paper, we focus on it and study the following relevant
questions: 1) What are the properties of the statistical differences? 2) How to
effectively reduce them without harming the attack intensity? 3) What impact
does this reduction have on difference-based defenses? Our work is carried out
on the three questions. First, by introducing the Maximum Mean Discrepancy
(MMD) as the metric, we identify that the statistical differences of
multi-level representations are all large, not just the highest level. Then, we
propose a Statistical Difference Reduction Method (SDRM) by adding a
multi-level MMD constraint to the loss function during training a backdoor
model to effectively reduce the differences. Last, three typical
difference-based detection methods are examined. The F1 scores of these
defenses drop from 90%-100% on the regularly trained backdoor models to 60%-70%
on the models trained with SDRM on all two datasets, four model architectures,
and four attack methods. The results indicate that the proposed method can be
used to enhance existing attacks to escape backdoor detection algorithms.


http://arxiv.org/abs/2111.05328
Data Augmentation Can Improve Robustness. (73%)
Sylvestre-Alvise Rebuffi; Sven Gowal; Dan A. Calian; Florian Stimberg; Olivia Wiles; Timothy Mann
  Adversarial training suffers from robust overfitting, a phenomenon where the
robust test accuracy starts to decrease during training. In this paper, we
focus on reducing robust overfitting by using common data augmentation schemes.
We demonstrate that, contrary to previous findings, when combined with model
weight averaging, data augmentation can significantly boost robust accuracy.
Furthermore, we compare various augmentations techniques and observe that
spatial composition techniques work the best for adversarial training. Finally,
we evaluate our approach on CIFAR-10 against $\ell_\infty$ and $\ell_2$
norm-bounded perturbations of size $\epsilon = 8/255$ and $\epsilon = 128/255$,
respectively. We show large absolute improvements of +2.93% and +2.16% in
robust accuracy compared to previous state-of-the-art methods. In particular,
against $\ell_\infty$ norm-bounded perturbations of size $\epsilon = 8/255$,
our model reaches 60.07% robust accuracy without using any external data. We
also achieve a significant performance boost with this approach while using
other architectures and datasets such as CIFAR-100, SVHN and TinyImageNet.


http://arxiv.org/abs/2111.05464
Are Transformers More Robust Than CNNs? (67%)
Yutong Bai; Jieru Mei; Alan Yuille; Cihang Xie
  Transformer emerges as a powerful tool for visual recognition. In addition to
demonstrating competitive performance on a broad range of visual benchmarks,
recent works also argue that Transformers are much more robust than
Convolutions Neural Networks (CNNs). Nonetheless, surprisingly, we find these
conclusions are drawn from unfair experimental settings, where Transformers and
CNNs are compared at different scales and are applied with distinct training
frameworks. In this paper, we aim to provide the first fair & in-depth
comparisons between Transformers and CNNs, focusing on robustness evaluations.
  With our unified training setup, we first challenge the previous belief that
Transformers outshine CNNs when measuring adversarial robustness. More
surprisingly, we find CNNs can easily be as robust as Transformers on defending
against adversarial attacks, if they properly adopt Transformers' training
recipes. While regarding generalization on out-of-distribution samples, we show
pre-training on (external) large-scale datasets is not a fundamental request
for enabling Transformers to achieve better performance than CNNs. Moreover,
our ablations suggest such stronger generalization is largely benefited by the
Transformer's self-attention-like architectures per se, rather than by other
training setups. We hope this work can help the community better understand and
benchmark the robustness of Transformers and CNNs. The code and models are
publicly available at https://github.com/ytongbai/ViTs-vs-CNNs.


http://arxiv.org/abs/2111.04371
Geometrically Adaptive Dictionary Attack on Face Recognition. (99%)
Junyoung Byun; Hyojun Go; Changick Kim
  CNN-based face recognition models have brought remarkable performance
improvement, but they are vulnerable to adversarial perturbations. Recent
studies have shown that adversaries can fool the models even if they can only
access the models' hard-label output. However, since many queries are needed to
find imperceptible adversarial noise, reducing the number of queries is crucial
for these attacks. In this paper, we point out two limitations of existing
decision-based black-box attacks. We observe that they waste queries for
background noise optimization, and they do not take advantage of adversarial
perturbations generated for other images. We exploit 3D face alignment to
overcome these limitations and propose a general strategy for query-efficient
black-box attacks on face recognition named Geometrically Adaptive Dictionary
Attack (GADA). Our core idea is to create an adversarial perturbation in the UV
texture map and project it onto the face in the image. It greatly improves
query efficiency by limiting the perturbation search space to the facial area
and effectively recycling previous perturbations. We apply the GADA strategy to
two existing attack methods and show overwhelming performance improvement in
the experiments on the LFW and CPLFW datasets. Furthermore, we also present a
novel attack strategy that can circumvent query similarity-based stateful
detection that identifies the process of query-based black-box attacks.


http://arxiv.org/abs/2111.04303
Defense Against Explanation Manipulation. (98%)
Ruixiang Tang; Ninghao Liu; Fan Yang; Na Zou; Xia Hu
  Explainable machine learning attracts increasing attention as it improves
transparency of models, which is helpful for machine learning to be trusted in
real applications. However, explanation methods have recently been demonstrated
to be vulnerable to manipulation, where we can easily change a model's
explanation while keeping its prediction constant. To tackle this problem, some
efforts have been paid to use more stable explanation methods or to change
model configurations. In this work, we tackle the problem from the training
perspective, and propose a new training scheme called Adversarial Training on
EXplanations (ATEX) to improve the internal explanation stability of a model
regardless of the specific explanation method being applied. Instead of
directly specifying explanation values over data instances, ATEX only puts
requirement on model predictions which avoids involving second-order
derivatives in optimization. As a further discussion, we also find that
explanation stability is closely related to another property of the model,
i.e., the risk of being exposed to adversarial attack. Through experiments,
besides showing that ATEX improves model robustness against manipulation
targeting explanation, it also brings additional benefits including smoothing
explanations and improving the efficacy of adversarial training if applied to
the model.


http://arxiv.org/abs/2111.04625
DeepSteal: Advanced Model Extractions Leveraging Efficient Weight Stealing in Memories. (98%)
Adnan Siraj Rakin; Md Hafizul Islam Chowdhuryy; Fan Yao; Deliang Fan
  Recent advancements of Deep Neural Networks (DNNs) have seen widespread
deployment in multiple security-sensitive domains. The need of
resource-intensive training and use of valuable domain-specific training data
have made these models a top intellectual property (IP) for model owners. One
of the major threats to the DNN privacy is model extraction attacks where
adversaries attempt to steal sensitive information in DNN models. Recent
studies show hardware-based side channel attacks can reveal internal knowledge
about DNN models (e.g., model architectures) However, to date, existing attacks
cannot extract detailed model parameters (e.g., weights/biases). In this work,
for the first time, we propose an advanced model extraction attack framework
DeepSteal that effectively steals DNN weights with the aid of memory
side-channel attack. Our proposed DeepSteal comprises two key stages. Firstly,
we develop a new weight bit information extraction method, called HammerLeak,
through adopting the rowhammer based hardware fault technique as the
information leakage vector. HammerLeak leverages several novel system-level
techniques tailed for DNN applications to enable fast and efficient weight
stealing. Secondly, we propose a novel substitute model training algorithm with
Mean Clustering weight penalty, which leverages the partial leaked bit
information effectively and generates a substitute prototype of the target
victim model. We evaluate this substitute model extraction method on three
popular image datasets (e.g., CIFAR-10/100/GTSRB) and four DNN architectures
(e.g., ResNet-18/34/Wide-ResNet/VGG-11). The extracted substitute model has
successfully achieved more than 90 % test accuracy on deep residual networks
for the CIFAR-10 dataset. Moreover, our extracted substitute model could also
generate effective adversarial input samples to fool the victim model.


http://arxiv.org/abs/2111.04865
On Assessing The Safety of Reinforcement Learning algorithms Using Formal Methods. (75%)
Paulina Stevia Nouwou Mindom; Amin Nikanjam; Foutse Khomh; John Mullins
  The increasing adoption of Reinforcement Learning in safety-critical systems
domains such as autonomous vehicles, health, and aviation raises the need for
ensuring their safety. Existing safety mechanisms such as adversarial training,
adversarial detection, and robust learning are not always adapted to all
disturbances in which the agent is deployed. Those disturbances include moving
adversaries whose behavior can be unpredictable by the agent, and as a matter
of fact harmful to its learning. Ensuring the safety of critical systems also
requires methods that give formal guarantees on the behaviour of the agent
evolving in a perturbed environment. It is therefore necessary to propose new
solutions adapted to the learning challenges faced by the agent. In this paper,
first we generate adversarial agents that exhibit flaws in the agent's policy
by presenting moving adversaries. Secondly, We use reward shaping and a
modified Q-learning algorithm as defense mechanisms to improve the agent's
policy when facing adversarial perturbations. Finally, probabilistic model
checking is employed to evaluate the effectiveness of both mechanisms. We have
conducted experiments on a discrete grid world with a single agent facing
non-learning and learning adversaries. Our results show a diminution in the
number of collisions between the agent and the adversaries. Probabilistic model
checking provides lower and upper probabilistic bounds regarding the agent's
safety in the adversarial environment.


http://arxiv.org/abs/2111.04394
Get a Model! Model Hijacking Attack Against Machine Learning Models. (69%)
Ahmed Salem; Michael Backes; Yang Zhang
  Machine learning (ML) has established itself as a cornerstone for various
critical applications ranging from autonomous driving to authentication
systems. However, with this increasing adoption rate of machine learning
models, multiple attacks have emerged. One class of such attacks is training
time attack, whereby an adversary executes their attack before or during the
machine learning model training. In this work, we propose a new training time
attack against computer vision based machine learning models, namely model
hijacking attack. The adversary aims to hijack a target model to execute a
different task than its original one without the model owner noticing. Model
hijacking can cause accountability and security risks since a hijacked model
owner can be framed for having their model offering illegal or unethical
services. Model hijacking attacks are launched in the same way as existing data
poisoning attacks. However, one requirement of the model hijacking attack is to
be stealthy, i.e., the data samples used to hijack the target model should look
similar to the model's original training dataset. To this end, we propose two
different model hijacking attacks, namely Chameleon and Adverse Chameleon,
based on a novel encoder-decoder style ML model, namely the Camouflager. Our
evaluation shows that both of our model hijacking attacks achieve a high attack
success rate, with a negligible drop in model utility.


http://arxiv.org/abs/2111.04404
Robust and Information-theoretically Safe Bias Classifier against Adversarial Attacks. (69%)
Lijia Yu; Xiao-Shan Gao
  In this paper, the bias classifier is introduced, that is, the bias part of a
DNN with Relu as the activation function is used as a classifier. The work is
motivated by the fact that the bias part is a piecewise constant function with
zero gradient and hence cannot be directly attacked by gradient-based methods
to generate adversaries such as FGSM. The existence of the bias classifier is
proved an effective training method for the bias classifier is proposed. It is
proved that by adding a proper random first-degree part to the bias classifier,
an information-theoretically safe classifier against the original-model
gradient-based attack is obtained in the sense that the attack generates a
totally random direction for generating adversaries. This seems to be the first
time that the concept of information-theoretically safe classifier is proposed.
Several attack methods for the bias classifier are proposed and numerical
experiments are used to show that the bias classifier is more robust than DNNs
against these attacks in most cases.


http://arxiv.org/abs/2111.04330
Characterizing the adversarial vulnerability of speech self-supervised learning. (68%)
Haibin Wu; Bo Zheng; Xu Li; Xixin Wu; Hung-yi Lee; Helen Meng
  A leaderboard named Speech processing Universal PERformance Benchmark
(SUPERB), which aims at benchmarking the performance of a shared
self-supervised learning (SSL) speech model across various downstream speech
tasks with minimal modification of architectures and small amount of data, has
fueled the research for speech representation learning. The SUPERB demonstrates
speech SSL upstream models improve the performance of various downstream tasks
through just minimal adaptation. As the paradigm of the self-supervised
learning upstream model followed by downstream tasks arouses more attention in
the speech community, characterizing the adversarial robustness of such
paradigm is of high priority. In this paper, we make the first attempt to
investigate the adversarial vulnerability of such paradigm under the attacks
from both zero-knowledge adversaries and limited-knowledge adversaries. The
experimental results illustrate that the paradigm proposed by SUPERB is
seriously vulnerable to limited-knowledge adversaries, and the attacks
generated by zero-knowledge adversaries are with transferability. The XAB test
verifies the imperceptibility of crafted adversarial attacks.


http://arxiv.org/abs/2111.04703
HAPSSA: Holistic Approach to PDF Malware Detection Using Signal and Statistical Analysis. (67%)
Tajuddin Manhar Mohammed; Lakshmanan Nataraj; Satish Chikkagoudar; Shivkumar Chandrasekaran; B. S. Manjunath
  Malicious PDF documents present a serious threat to various security
organizations that require modern threat intelligence platforms to effectively
analyze and characterize the identity and behavior of PDF malware.
State-of-the-art approaches use machine learning (ML) to learn features that
characterize PDF malware. However, ML models are often susceptible to evasion
attacks, in which an adversary obfuscates the malware code to avoid being
detected by an Antivirus. In this paper, we derive a simple yet effective
holistic approach to PDF malware detection that leverages signal and
statistical analysis of malware binaries. This includes combining orthogonal
feature space models from various static and dynamic malware detection methods
to enable generalized robustness when faced with code obfuscations. Using a
dataset of nearly 30,000 PDF files containing both malware and benign samples,
we show that our holistic approach maintains a high detection rate (99.92%) of
PDF malware and even detects new malicious files created by simple methods that
remove the obfuscation conducted by malware authors to hide their malware,
which are undetected by most antiviruses.


http://arxiv.org/abs/2111.04314
Graph Robustness Benchmark: Benchmarking the Adversarial Robustness of Graph Machine Learning. (67%)
Qinkai Zheng; Xu Zou; Yuxiao Dong; Yukuo Cen; Da Yin; Jiarong Xu; Yang Yang; Jie Tang
  Adversarial attacks on graphs have posed a major threat to the robustness of
graph machine learning (GML) models. Naturally, there is an ever-escalating
arms race between attackers and defenders. However, the strategies behind both
sides are often not fairly compared under the same and realistic conditions. To
bridge this gap, we present the Graph Robustness Benchmark (GRB) with the goal
of providing a scalable, unified, modular, and reproducible evaluation for the
adversarial robustness of GML models. GRB standardizes the process of attacks
and defenses by 1) developing scalable and diverse datasets, 2) modularizing
the attack and defense implementations, and 3) unifying the evaluation protocol
in refined scenarios. By leveraging the GRB pipeline, the end-users can focus
on the development of robust GML models with automated data processing and
experimental evaluations. To support open and reproducible research on graph
adversarial learning, GRB also hosts public leaderboards across different
scenarios. As a starting point, we conduct extensive experiments to benchmark
baseline techniques. GRB is open-source and welcomes contributions from the
community. Datasets, codes, leaderboards are available at
https://cogdl.ai/grb/home.


http://arxiv.org/abs/2111.04550
BARFED: Byzantine Attack-Resistant Federated Averaging Based on Outlier Elimination. (45%)
Ece Isik-Polat; Gorkem Polat; Altan Kocyigit
  In federated learning, each participant trains its local model with its own
data and a global model is formed at a trusted server by aggregating model
updates coming from these participants. Since the server has no effect and
visibility on the training procedure of the participants to ensure privacy, the
global model becomes vulnerable to attacks such as data poisoning and model
poisoning. Although many defense algorithms have recently been proposed to
address these attacks, they often make strong assumptions that do not agree
with the nature of federated learning, such as Non-IID datasets. Moreover, they
mostly lack comprehensive experimental analyses. In this work, we propose a
defense algorithm called BARFED that does not make any assumptions about data
distribution, update similarity of participants, or the ratio of the malicious
participants. BARFED mainly considers the outlier status of participant updates
for each layer of the model architecture based on the distance to the global
model. Hence, the participants that do not have any outlier layer are involved
in model aggregation. We perform extensive experiments on many grounds and show
that the proposed approach provides a robust defense against different attacks.


http://arxiv.org/abs/2111.04266
Generative Dynamic Patch Attack. (99%)
Xiang Li; Shihao Ji
  Adversarial patch attack is a family of attack algorithms that perturb a part
of image to fool a deep neural network model. Existing patch attacks mostly
consider injecting adversarial patches at input-agnostic locations: either a
predefined location or a random location. This attack setup may be sufficient
for attack but has considerable limitations when using it for adversarial
training. Thus, robust models trained with existing patch attacks cannot
effectively defend other adversarial attacks. In this paper, we first propose
an end-to-end patch attack algorithm, Generative Dynamic Patch Attack (GDPA),
which generates both patch pattern and patch location adversarially for each
input image. We show that GDPA is a generic attack framework that can produce
dynamic/static and visible/invisible patches with a few configuration changes.
Secondly, GDPA can be readily integrated for adversarial training to improve
model robustness to various adversarial attacks. Extensive experiments on
VGGFace, Traffic Sign and ImageNet show that GDPA achieves higher attack
success rates than state-of-the-art patch attacks, while adversarially trained
model with GDPA demonstrates superior robustness to adversarial patch attacks
than competing methods. Our source code can be found at
https://github.com/lxuniverse/gdpa.


http://arxiv.org/abs/2111.04204
Natural Adversarial Objects. (81%)
Felix Lau; Nishant Subramani; Sasha Harrison; Aerin Kim; Elliot Branson; Rosanne Liu
  Although state-of-the-art object detection methods have shown compelling
performance, models often are not robust to adversarial attacks and
out-of-distribution data. We introduce a new dataset, Natural Adversarial
Objects (NAO), to evaluate the robustness of object detection models. NAO
contains 7,934 images and 9,943 objects that are unmodified and representative
of real-world scenarios, but cause state-of-the-art detection models to
misclassify with high confidence. The mean average precision (mAP) of
EfficientDet-D7 drops 74.5% when evaluated on NAO compared to the standard
MSCOCO validation set.
  Moreover, by comparing a variety of object detection architectures, we find
that better performance on MSCOCO validation set does not necessarily translate
to better performance on NAO, suggesting that robustness cannot be simply
achieved by training a more accurate model.
  We further investigate why examples in NAO are difficult to detect and
classify. Experiments of shuffling image patches reveal that models are overly
sensitive to local texture. Additionally, using integrated gradients and
background replacement, we find that the detection model is reliant on pixel
information within the bounding box, and insensitive to the background context
when predicting class labels. NAO can be downloaded at
https://drive.google.com/drive/folders/15P8sOWoJku6SSEiHLEts86ORfytGezi8.


http://arxiv.org/abs/2111.05108
"How Does It Detect A Malicious App?" Explaining the Predictions of AI-based Android Malware Detector. (11%)
Zhi Lu; Vrizlynn L. L. Thing
  AI methods have been proven to yield impressive performance on Android
malware detection. However, most AI-based methods make predictions of
suspicious samples in a black-box manner without transparency on models'
inference. The expectation on models' explainability and transparency by cyber
security and AI practitioners to assure the trustworthiness increases. In this
article, we present a novel model-agnostic explanation method for AI models
applied for Android malware detection. Our proposed method identifies and
quantifies the data features relevance to the predictions by two steps: i) data
perturbation that generates the synthetic data by manipulating features'
values; and ii) optimization of features attribution values to seek significant
changes of prediction scores on the perturbed data with minimal feature values
changes. The proposed method is validated by three experiments. We firstly
demonstrate that our proposed model explanation method can aid in discovering
how AI models are evaded by adversarial samples quantitatively. In the
following experiments, we compare the explainability and fidelity of our
proposed method with state-of-the-arts, respectively.


http://arxiv.org/abs/2111.03536
A Unified Game-Theoretic Interpretation of Adversarial Robustness. (98%)
Jie Ren; Die Zhang; Yisen Wang; Lu Chen; Zhanpeng Zhou; Yiting Chen; Xu Cheng; Xin Wang; Meng Zhou; Jie Shi; Quanshi Zhang
  This paper provides a unified view to explain different adversarial attacks
and defense methods, \emph{i.e.} the view of multi-order interactions between
input variables of DNNs. Based on the multi-order interaction, we discover that
adversarial attacks mainly affect high-order interactions to fool the DNN.
Furthermore, we find that the robustness of adversarially trained DNNs comes
from category-specific low-order interactions. Our findings provide a potential
method to unify adversarial perturbations and robustness, which can explain the
existing defense methods in a principle way. Besides, our findings also make a
revision of previous inaccurate understanding of the shape bias of
adversarially learned features.


http://arxiv.org/abs/2112.03000
Sequential Randomized Smoothing for Adversarially Robust Speech Recognition. (96%)
Raphael Olivier; Bhiksha Raj
  While Automatic Speech Recognition has been shown to be vulnerable to
adversarial attacks, defenses against these attacks are still lagging.
Existing, naive defenses can be partially broken with an adaptive attack. In
classification tasks, the Randomized Smoothing paradigm has been shown to be
effective at defending models. However, it is difficult to apply this paradigm
to ASR tasks, due to their complexity and the sequential nature of their
outputs. Our paper overcomes some of these challenges by leveraging
speech-specific tools like enhancement and ROVER voting to design an ASR model
that is robust to perturbations. We apply adaptive versions of state-of-the-art
attacks, such as the Imperceptible ASR attack, to our model, and show that our
strongest defense is robust to all attacks that use inaudible noise, and can
only be broken with very high distortion.


http://arxiv.org/abs/2111.03363
Federated Learning Attacks Revisited: A Critical Discussion of Gaps, Assumptions, and Evaluation Setups. (2%)
Aidmar Wainakh; Ephraim Zimmer; Sandeep Subedi; Jens Keim; Tim Grube; Shankar Karuppayah; Alejandro Sanchez Guinea; Max Mühlhäuser
  Federated learning (FL) enables a set of entities to collaboratively train a
machine learning model without sharing their sensitive data, thus, mitigating
some privacy concerns. However, an increasing number of works in the literature
propose attacks that can manipulate the model and disclose information about
the training data in FL. As a result, there has been a growing belief in the
research community that FL is highly vulnerable to a variety of severe attacks.
Although these attacks do indeed highlight security and privacy risks in FL,
some of them may not be as effective in production deployment because they are
feasible only under special -- sometimes impractical -- assumptions.
Furthermore, some attacks are evaluated under limited setups that may not match
real-world scenarios. In this paper, we investigate this issue by conducting a
systematic mapping study of attacks against FL, covering 48 relevant papers
from 2016 to the third quarter of 2021. On the basis of this study, we provide
a quantitative analysis of the proposed attacks and their evaluation settings.
This analysis reveals several research gaps with regard to the type of target
ML models and their architectures. Additionally, we highlight unrealistic
assumptions in the problem settings of some attacks, related to the
hyper-parameters of the ML model and data distribution among clients.
Furthermore, we identify and discuss several fallacies in the evaluation of
attacks, which open up questions on the generalizability of the conclusions. As
a remedy, we propose a set of recommendations to avoid these fallacies and to
promote adequate evaluations.


http://arxiv.org/abs/2111.02840
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models. (99%)
Boxin Wang; Chejian Xu; Shuohang Wang; Zhe Gan; Yu Cheng; Jianfeng Gao; Ahmed Hassan Awadallah; Bo Li
  Large-scale pre-trained language models have achieved tremendous success
across a wide range of natural language understanding (NLU) tasks, even
surpassing human performance. However, recent studies reveal that the
robustness of these models can be challenged by carefully crafted textual
adversarial examples. While several individual datasets have been proposed to
evaluate model robustness, a principled and comprehensive benchmark is still
missing. In this paper, we present Adversarial GLUE (AdvGLUE), a new multi-task
benchmark to quantitatively and thoroughly explore and evaluate the
vulnerabilities of modern large-scale language models under various types of
adversarial attacks. In particular, we systematically apply 14 textual
adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further
validated by humans for reliable annotations. Our findings are summarized as
follows. (i) Most existing adversarial attack algorithms are prone to
generating invalid or ambiguous adversarial examples, with around 90% of them
either changing the original semantic meanings or misleading human annotators
as well. Therefore, we perform a careful filtering process to curate a
high-quality benchmark. (ii) All the language models and robust training
methods we tested perform poorly on AdvGLUE, with scores lagging far behind the
benign accuracy. We hope our work will motivate the development of new
adversarial attacks that are more stealthy and semantic-preserving, as well as
new robust language models against sophisticated adversarial attacks. AdvGLUE
is available at https://adversarialglue.github.io.


http://arxiv.org/abs/2111.02842
Adversarial Attacks on Graph Classification via Bayesian Optimisation. (87%)
Xingchen Wan; Henry Kenlay; Binxin Ru; Arno Blaas; Michael A. Osborne; Xiaowen Dong
  Graph neural networks, a popular class of models effective in a wide range of
graph-based learning tasks, have been shown to be vulnerable to adversarial
attacks. While the majority of the literature focuses on such vulnerability in
node-level classification tasks, little effort has been dedicated to analysing
adversarial attacks on graph-level classification, an important problem with
numerous real-life applications such as biochemistry and social network
analysis. The few existing methods often require unrealistic setups, such as
access to internal information of the victim models, or an impractically-large
number of queries. We present a novel Bayesian optimisation-based attack method
for graph classification models. Our method is black-box, query-efficient and
parsimonious with respect to the perturbation applied. We empirically validate
the effectiveness and flexibility of the proposed method on a wide range of
graph classification tasks involving varying graph properties, constraints and
modes of attack. Finally, we analyse common interpretable patterns behind the
adversarial samples produced, which may shed further light on the adversarial
robustness of graph classification models.


http://arxiv.org/abs/2111.03120
Adversarial Attacks on Knowledge Graph Embeddings via Instance Attribution Methods. (47%)
Peru Bhardwaj; John Kelleher; Luca Costabello; Declan O'Sullivan
  Despite the widespread use of Knowledge Graph Embeddings (KGE), little is
known about the security vulnerabilities that might disrupt their intended
behaviour. We study data poisoning attacks against KGE models for link
prediction. These attacks craft adversarial additions or deletions at training
time to cause model failure at test time. To select adversarial deletions, we
propose to use the model-agnostic instance attribution methods from
Interpretable Machine Learning, which identify the training instances that are
most influential to a neural model's predictions on test instances. We use
these influential triples as adversarial deletions. We further propose a
heuristic method to replace one of the two entities in each influential triple
to generate adversarial additions. Our experiments show that the proposed
strategies outperform the state-of-art data poisoning attacks on KGE models and
improve the MRR degradation due to the attacks by up to 62% over the baselines.


http://arxiv.org/abs/2111.02845
Attacking Deep Reinforcement Learning-Based Traffic Signal Control Systems with Colluding Vehicles. (3%)
Ao Qu; Yihong Tang; Wei Ma
  The rapid advancements of Internet of Things (IoT) and artificial
intelligence (AI) have catalyzed the development of adaptive traffic signal
control systems (ATCS) for smart cities. In particular, deep reinforcement
learning (DRL) methods produce the state-of-the-art performance and have great
potentials for practical applications. In the existing DRL-based ATCS, the
controlled signals collect traffic state information from nearby vehicles, and
then optimal actions (e.g., switching phases) can be determined based on the
collected information. The DRL models fully "trust" that vehicles are sending
the true information to the signals, making the ATCS vulnerable to adversarial
attacks with falsified information. In view of this, this paper first time
formulates a novel task in which a group of vehicles can cooperatively send
falsified information to "cheat" DRL-based ATCS in order to save their total
travel time. To solve the proposed task, we develop CollusionVeh, a generic and
effective vehicle-colluding framework composed of a road situation encoder, a
vehicle interpreter, and a communication mechanism. We employ our method to
attack established DRL-based ATCS and demonstrate that the total travel time
for the colluding vehicles can be significantly reduced with a reasonable
number of learning episodes, and the colluding effect will decrease if the
number of colluding vehicles increases. Additionally, insights and suggestions
for the real-world deployment of DRL-based ATCS are provided. The research
outcomes could help improve the reliability and robustness of the ATCS and
better protect the smart mobility systems.


http://arxiv.org/abs/2111.02331
LTD: Low Temperature Distillation for Robust Adversarial Training. (88%)
Erh-Chung Chen; Che-Rung Lee
  Adversarial training has been widely used to enhance the robustness of the
neural network models against adversarial attacks. However, there still a
notable gap between the nature accuracy and the robust accuracy. We found one
of the reasons is the commonly used labels, one-hot vectors, hinder the
learning process for image recognition. In this paper, we proposed a method,
called Low Temperature Distillation (LTD), which is based on the knowledge
distillation framework to generate the desired soft labels. Unlike the previous
work, LTD uses relatively low temperature in the teacher model, and employs
different, but fixed, temperatures for the teacher model and the student model.
Moreover, we have investigated the methods to synergize the use of nature data
and adversarial ones in LTD. Experimental results show that without extra
unlabeled data, the proposed method combined with the previous work can achieve
57.72\% and 30.36\% robust accuracy on CIFAR-10 and CIFAR-100 dataset
respectively, which is about 1.21\% improvement of the state-of-the-art methods
in average.


http://arxiv.org/abs/2111.02018
Multi-Glimpse Network: A Robust and Efficient Classification Architecture based on Recurrent Downsampled Attention. (41%)
Sia Huat Tan; Runpei Dong; Kaisheng Ma
  Most feedforward convolutional neural networks spend roughly the same efforts
for each pixel. Yet human visual recognition is an interaction between eye
movements and spatial attention, which we will have several glimpses of an
object in different regions. Inspired by this observation, we propose an
end-to-end trainable Multi-Glimpse Network (MGNet) which aims to tackle the
challenges of high computation and the lack of robustness based on recurrent
downsampled attention mechanism. Specifically, MGNet sequentially selects
task-relevant regions of an image to focus on and then adaptively combines all
collected information for the final prediction. MGNet expresses strong
resistance against adversarial attacks and common corruptions with less
computation. Also, MGNet is inherently more interpretable as it explicitly
informs us where it focuses during each iteration. Our experiments on
ImageNet100 demonstrate the potential of recurrent downsampled attention
mechanisms to improve a single feedforward manner. For example, MGNet improves
4.76% accuracy on average in common corruptions with only 36.9% computational
cost. Moreover, while the baseline incurs an accuracy drop to 7.6%, MGNet
manages to maintain 44.2% accuracy in the same PGD attack strength with
ResNet-50 backbone. Our code is available at
https://github.com/siahuat0727/MGNet.


http://arxiv.org/abs/2111.01528
Effective and Imperceptible Adversarial Textual Attack via Multi-objectivization. (99%)
Shengcai Liu; Ning Lu; Wenjing Hong; Chao Qian; Ke Tang
  The field of adversarial textual attack has significantly grown over the last
few years, where the commonly considered objective is to craft adversarial
examples (AEs) that can successfully fool the target model. However, the
imperceptibility of attacks, which is also essential for practical attackers,
is often left out by previous studies. In consequence, the crafted AEs tend to
have obvious structural and semantic differences from the original
human-written texts, making them easily perceptible. In this work, we advocate
leveraging multi-objectivization to address such issue. Specifically, we
formulate the problem of crafting AEs as a multi-objective optimization
problem, where the imperceptibility of attacks is considered as auxiliary
objectives. Then, we propose a simple yet effective evolutionary algorithm,
dubbed HydraText, to solve this problem. To the best of our knowledge,
HydraText is currently the only approach that can be effectively applied to
both score-based and decision-based attack settings. Exhaustive experiments
involving 44237 instances demonstrate that HydraText consistently achieves
competitive attack success rates and better attack imperceptibility than the
recently proposed attack approaches. A human evaluation study also shows that
the AEs crafted by HydraText are more indistinguishable from human-written
texts. Finally, these AEs exhibit good transferability and can bring notable
robustness improvement to the target model by adversarial training.


http://arxiv.org/abs/2111.01714
Meta-Learning the Search Distribution of Black-Box Random Search Based Adversarial Attacks. (96%)
Maksym Yatsura; Jan Hendrik Metzen; Matthias Hein
  Adversarial attacks based on randomized search schemes have obtained
state-of-the-art results in black-box robustness evaluation recently. However,
as we demonstrate in this work, their efficiency in different query budget
regimes depends on manual design and heuristic tuning of the underlying
proposal distributions. We study how this issue can be addressed by adapting
the proposal distribution online based on the information obtained during the
attack. We consider Square Attack, which is a state-of-the-art score-based
black-box attack, and demonstrate how its performance can be improved by a
learned controller that adjusts the parameters of the proposal distribution
online during the attack. We train the controller using gradient-based
end-to-end training on a CIFAR10 model with white box access. We demonstrate
that plugging the learned controller into the attack consistently improves its
black-box robustness estimate in different query regimes by up to 20% for a
wide range of different models with black-box access. We further show that the
learned adaptation principle transfers well to the other data distributions
such as CIFAR100 or ImageNet and to the targeted attack setting.


http://arxiv.org/abs/2111.01395
Training Certifiably Robust Neural Networks with Efficient Local Lipschitz Bounds. (70%)
Yujia Huang; Huan Zhang; Yuanyuan Shi; J Zico Kolter; Anima Anandkumar
  Certified robustness is a desirable property for deep neural networks in
safety-critical applications, and popular training algorithms can certify
robustness of a neural network by computing a global bound on its Lipschitz
constant. However, such a bound is often loose: it tends to over-regularize the
neural network and degrade its natural accuracy. A tighter Lipschitz bound may
provide a better tradeoff between natural and certified accuracy, but is
generally hard to compute exactly due to non-convexity of the network. In this
work, we propose an efficient and trainable \emph{local} Lipschitz upper bound
by considering the interactions between activation functions (e.g. ReLU) and
weight matrices. Specifically, when computing the induced norm of a weight
matrix, we eliminate the corresponding rows and columns where the activation
function is guaranteed to be a constant in the neighborhood of each given data
point, which provides a provably tighter bound than the global Lipschitz
constant of the neural network. Our method can be used as a plug-in module to
tighten the Lipschitz bound in many certifiable training algorithms.
Furthermore, we propose to clip activation functions (e.g., ReLU and MaxMin)
with a learnable upper threshold and a sparsity loss to assist the network to
achieve an even tighter local Lipschitz bound. Experimentally, we show that our
method consistently outperforms state-of-the-art methods in both clean and
certified accuracy on MNIST, CIFAR-10 and TinyImageNet datasets with various
network architectures.


http://arxiv.org/abs/2111.01996
Pareto Adversarial Robustness: Balancing Spatial Robustness and Sensitivity-based Robustness. (68%)
Ke Sun; Mingjie Li; Zhouchen Lin
  Adversarial robustness, which mainly contains sensitivity-based robustness
and spatial robustness, plays an integral part in the robust generalization. In
this paper, we endeavor to design strategies to achieve universal adversarial
robustness. To hit this target, we firstly investigate the less-studied spatial
robustness and then integrate existing spatial robustness methods by
incorporating both local and global spatial vulnerability into one spatial
attack and adversarial training. Based on this exploration, we further present
a comprehensive relationship between natural accuracy, sensitivity-based and
different spatial robustness, supported by the strong evidence from the
perspective of robust representation. More importantly, in order to balance
these mutual impacts of different robustness into one unified framework, we
incorporate \textit{Pareto criterion} into the adversarial robustness analysis,
yielding a novel strategy called \textit{Pareto Adversarial Training} towards
universal robustness. The resulting Pareto front, the set of optimal solutions,
provides the set of optimal balance among natural accuracy and different
adversarial robustness, shedding light on solutions towards universal
robustness in the future. To the best of our knowledge, we are the first to
consider the universal adversarial robustness via multi-objective optimization.


http://arxiv.org/abs/2111.01363
Knowledge Cross-Distillation for Membership Privacy. (38%)
Rishav Chourasia; Batnyam Enkhtaivan; Kunihiro Ito; Junki Mori; Isamu Teranishi; Hikaru Tsuchida
  A membership inference attack (MIA) poses privacy risks for the training data
of a machine learning model. With an MIA, an attacker guesses if the target
data are a member of the training dataset. The state-of-the-art defense against
MIAs, distillation for membership privacy (DMP), requires not only private data
for protection but a large amount of unlabeled public data. However, in certain
privacy-sensitive domains, such as medicine and finance, the availability of
public data is not guaranteed. Moreover, a trivial method for generating public
data by using generative adversarial networks significantly decreases the model
accuracy, as reported by the authors of DMP. To overcome this problem, we
propose a novel defense against MIAs that uses knowledge distillation without
requiring public data. Our experiments show that the privacy protection and
accuracy of our defense are comparable to those of DMP for the benchmark
tabular datasets used in MIA research, Purchase100 and Texas100, and our
defense has a much better privacy-utility trade-off than those of the existing
defenses that also do not use public data for the image dataset CIFAR10.


http://arxiv.org/abs/2111.01965
Adversarially Perturbed Wavelet-based Morphed Face Generation. (9%)
Kelsey O'Haire; Sobhan Soleymani; Baaria Chaudhary; Poorya Aghdaie; Jeremy Dawson; Nasser M. Nasrabadi
  Morphing is the process of combining two or more subjects in an image in
order to create a new identity which contains features of both individuals.
Morphed images can fool Facial Recognition Systems (FRS) into falsely accepting
multiple people, leading to failures in national security. As morphed image
synthesis becomes easier, it is vital to expand the research community's
available data to help combat this dilemma. In this paper, we explore
combination of two methods for morphed image generation, those of geometric
transformation (warping and blending to create morphed images) and photometric
perturbation. We leverage both methods to generate high-quality adversarially
perturbed morphs from the FERET, FRGC, and FRLL datasets. The final images
retain high similarity to both input subjects while resulting in minimal
artifacts in the visual domain. Images are synthesized by fusing the wavelet
sub-bands from the two look-alike subjects, and then adversarially perturbed to
create highly convincing imagery to deceive both humans and deep morph
detectors.


http://arxiv.org/abs/2111.00684
Graph Structural Attack by Spectral Distance. (93%)
Lu Lin; Ethan Blaser; Hongning Wang
  Graph Convolutional Networks (GCNs) have fueled a surge of interest due to
their superior performance on graph learning tasks, but are also shown
vulnerability to adversarial attacks. In this paper, an effective graph
structural attack is investigated to disrupt graph spectral filters in the
Fourier domain. We define the spectral distance based on the eigenvalues of
graph Laplacian to measure the disruption of spectral filters. We then generate
edge perturbations by simultaneously maximizing a task-specific attack
objective and the proposed spectral distance. The experiments demonstrate
remarkable effectiveness of the proposed attack in the white-box setting at
both training and test time. Our qualitative analysis shows the connection
between the attack behavior and the imposed changes on the spectral
distribution, which provides empirical evidence that maximizing spectral
distance is an effective manner to change the structural property of graphs in
the spatial domain and perturb the frequency components in the Fourier domain.


http://arxiv.org/abs/2111.00898
Availability Attacks Create Shortcuts. (89%)
Da Yu; Huishuai Zhang; Wei Chen; Jian Yin; Tie-Yan Liu
  Availability attacks, which poison the training data with imperceptible
perturbations, can make the data \emph{not exploitable} by machine learning
algorithms so as to prevent unauthorized use of data. In this work, we
investigate why these perturbations work in principle. We are the first to
unveil an important population property of the perturbations of these attacks:
they are almost \textbf{linearly separable} when assigned with the target
labels of the corresponding samples, which hence can work as \emph{shortcuts}
for the learning objective. We further verify that linear separability is
indeed the workhorse for availability attacks. We synthesize linearly-separable
perturbations as attacks and show that they are as powerful as the deliberately
crafted attacks. Moreover, such synthetic perturbations are much easier to
generate. For example, previous attacks need dozens of hours to generate
perturbations for ImageNet while our algorithm only needs several seconds. Our
finding also suggests that the \emph{shortcut learning} is more widely present
than previously believed as deep models would rely on shortcuts even if they
are of an imperceptible scale and mixed together with the normal features. Our
source code is published at
\url{https://github.com/dayu11/Availability-Attacks-Create-Shortcuts}.


http://arxiv.org/abs/2111.00961
Robustness of deep learning algorithms in astronomy -- galaxy morphology studies. (83%)
A. Ćiprijanović; D. Kafkes; G. N. Perdue; K. Pedro; G. Snyder; F. J. Sánchez; S. Madireddy; S. Wild; B. Nord
  Deep learning models are being increasingly adopted in wide array of
scientific domains, especially to handle high-dimensionality and volume of the
scientific data. However, these models tend to be brittle due to their
complexity and overparametrization, especially to the inadvertent adversarial
perturbations that can appear due to common image processing such as
compression or blurring that are often seen with real scientific data. It is
crucial to understand this brittleness and develop models robust to these
adversarial perturbations. To this end, we study the effect of observational
noise from the exposure time, as well as the worst case scenario of a one-pixel
attack as a proxy for compression or telescope errors on performance of
ResNet18 trained to distinguish between galaxies of different morphologies in
LSST mock data. We also explore how domain adaptation techniques can help
improve model robustness in case of this type of naturally occurring attacks
and help scientists build more trustworthy and stable models.


http://arxiv.org/abs/2111.01124
When Does Contrastive Learning Preserve Adversarial Robustness from Pretraining to Finetuning? (69%)
Lijie Fan; Sijia Liu; Pin-Yu Chen; Gaoyuan Zhang; Chuang Gan
  Contrastive learning (CL) can learn generalizable feature representations and
achieve the state-of-the-art performance of downstream tasks by finetuning a
linear classifier on top of it. However, as adversarial robustness becomes
vital in image classification, it remains unclear whether or not CL is able to
preserve robustness to downstream tasks. The main challenge is that in the
self-supervised pretraining + supervised finetuning paradigm, adversarial
robustness is easily forgotten due to a learning task mismatch from pretraining
to finetuning. We call such a challenge 'cross-task robustness
transferability'. To address the above problem, in this paper we revisit and
advance CL principles through the lens of robustness enhancement. We show that
(1) the design of contrastive views matters: High-frequency components of
images are beneficial to improving model robustness; (2) Augmenting CL with
pseudo-supervision stimulus (e.g., resorting to feature clustering) helps
preserve robustness without forgetting. Equipped with our new designs, we
propose AdvCL, a novel adversarial contrastive pretraining framework. We show
that AdvCL is able to enhance cross-task robustness transferability without
loss of model accuracy and finetuning efficiency. With a thorough experimental
study, we demonstrate that AdvCL outperforms the state-of-the-art
self-supervised robust learning methods across multiple datasets (CIFAR-10,
CIFAR-100, and STL-10) and finetuning schemes (linear evaluation and full model
finetuning).


http://arxiv.org/abs/2111.01080
ZeBRA: Precisely Destroying Neural Networks with Zero-Data Based Repeated Bit Flip Attack. (9%)
Dahoon Park; Kon-Woo Kwon; Sunghoon Im; Jaeha Kung
  In this paper, we present Zero-data Based Repeated bit flip Attack (ZeBRA)
that precisely destroys deep neural networks (DNNs) by synthesizing its own
attack datasets. Many prior works on adversarial weight attack require not only
the weight parameters, but also the training or test dataset in searching
vulnerable bits to be attacked. We propose to synthesize the attack dataset,
named distilled target data, by utilizing the statistics of batch normalization
layers in the victim DNN model. Equipped with the distilled target data, our
ZeBRA algorithm can search vulnerable bits in the model without accessing
training or test dataset. Thus, our approach makes the adversarial weight
attack more fatal to the security of DNNs. Our experimental results show that
2.0x (CIFAR-10) and 1.6x (ImageNet) less number of bit flips are required on
average to destroy DNNs compared to the previous attack method. Our code is
available at https://github. com/pdh930105/ZeBRA.


http://arxiv.org/abs/2111.00435
An Actor-Critic Method for Simulation-Based Optimization. (56%)
Kuo Li; Qing-Shan Jia; Jiaqi Yan
  We focus on a simulation-based optimization problem of choosing the best
design from the feasible space. Although the simulation model can be queried
with finite samples, its internal processing rule cannot be utilized in the
optimization process. We formulate the sampling process as a policy searching
problem and give a solution from the perspective of Reinforcement Learning
(RL). Concretely, Actor-Critic (AC) framework is applied, where the Actor
serves as a surrogate model to predict the performance on unknown designs,
whereas the actor encodes the sampling policy to be optimized. We design the
updating rule and propose two algorithms for the cases where the feasible
spaces are continuous and discrete respectively. Some experiments are designed
to validate the effectiveness of proposed algorithms, including two toy
examples, which intuitively explain the algorithms, and two more complex tasks,
i.e., adversarial attack task and RL task, which validate the effectiveness in
large-scale problems. The results show that the proposed algorithms can
successfully deal with these problems. Especially note that in the RL task, our
methods give a new perspective to robot control by treating the task as a
simulation model and solving it by optimizing the policy generating process,
while existing works commonly optimize the policy itself directly.


http://arxiv.org/abs/2111.00295
Get Fooled for the Right Reason: Improving Adversarial Robustness through a Teacher-guided Curriculum Learning Approach. (97%)
Anindya Sarkar; Anirban Sarkar; Sowrya Gali; Vineeth N Balasubramanian
  Current SOTA adversarially robust models are mostly based on adversarial
training (AT) and differ only by some regularizers either at inner maximization
or outer minimization steps. Being repetitive in nature during the inner
maximization step, they take a huge time to train. We propose a non-iterative
method that enforces the following ideas during training. Attribution maps are
more aligned to the actual object in the image for adversarially robust models
compared to naturally trained models. Also, the allowed set of pixels to
perturb an image (that changes model decision) should be restricted to the
object pixels only, which reduces the attack strength by limiting the attack
space. Our method achieves significant performance gains with a little extra
effort (10-20%) over existing AT models and outperforms all other methods in
terms of adversarial as well as natural accuracy. We have performed extensive
experimentation with CIFAR-10, CIFAR-100, and TinyImageNet datasets and
reported results against many popular strong adversarial attacks to prove the
effectiveness of our method.


http://arxiv.org/abs/2111.00350
AdvCodeMix: Adversarial Attack on Code-Mixed Data. (93%)
Sourya Dipta Das; Ayan Basak; Soumil Mandal; Dipankar Das
  Research on adversarial attacks are becoming widely popular in the recent
years. One of the unexplored areas where prior research is lacking is the
effect of adversarial attacks on code-mixed data. Therefore, in the present
work, we have explained the first generalized framework on text perturbation to
attack code-mixed classification models in a black-box setting. We rely on
various perturbation techniques that preserve the semantic structures of the
sentences and also obscure the attacks from the perception of a human user. The
present methodology leverages the importance of a token to decide where to
attack by employing various perturbation strategies. We test our strategies on
various sentiment classification models trained on Bengali-English and
Hindi-English code-mixed datasets, and reduce their F1-scores by nearly 51 %
and 53 % respectively, which can be further reduced if a larger number of
tokens are perturbed in a given sentence.


http://arxiv.org/abs/2111.00197
Backdoor Pre-trained Models Can Transfer to All. (3%)
Lujia Shen; Shouling Ji; Xuhong Zhang; Jinfeng Li; Jing Chen; Jie Shi; Chengfang Fang; Jianwei Yin; Ting Wang
  Pre-trained general-purpose language models have been a dominating component
in enabling real-world natural language processing (NLP) applications. However,
a pre-trained model with backdoor can be a severe threat to the applications.
Most existing backdoor attacks in NLP are conducted in the fine-tuning phase by
introducing malicious triggers in the targeted class, thus relying greatly on
the prior knowledge of the fine-tuning task. In this paper, we propose a new
approach to map the inputs containing triggers directly to a predefined output
representation of the pre-trained NLP models, e.g., a predefined output
representation for the classification token in BERT, instead of a target label.
It can thus introduce backdoor to a wide range of downstream tasks without any
prior knowledge. Additionally, in light of the unique properties of triggers in
NLP, we propose two new metrics to measure the performance of backdoor attacks
in terms of both effectiveness and stealthiness. Our experiments with various
types of triggers show that our method is widely applicable to different
fine-tuning tasks (classification and named entity recognition) and to
different models (such as BERT, XLNet, BART), which poses a severe threat.
Furthermore, by collaborating with the popular online model repository Hugging
Face, the threat brought by our method has been confirmed. Finally, we analyze
the factors that may affect the attack performance and share insights on the
causes of the success of our backdoor attack.


http://arxiv.org/abs/2111.00169
Trojan Source: Invisible Vulnerabilities. (1%)
Nicholas Boucher; Ross Anderson
  We present a new type of attack in which source code is maliciously encoded
so that it appears different to a compiler and to the human eye. This attack
exploits subtleties in text-encoding standards such as Unicode to produce
source code whose tokens are logically encoded in a different order from the
one in which they are displayed, leading to vulnerabilities that cannot be
perceived directly by human code reviewers. 'Trojan Source' attacks, as we call
them, pose an immediate threat both to first-party software and of supply-chain
compromise across the industry. We present working examples of Trojan Source
attacks in C, C++, C#, JavaScript, Java, Rust, Go, Python, SQL, Bash, Assembly,
and Solidity. We propose definitive compiler-level defenses, and describe other
mitigating controls that can be deployed in editors, repositories, and build
pipelines while compilers are upgraded to block this attack. We document an
industry-wide coordinated disclosure for these vulnerabilities; as they affect
most compilers, editors, and repositories, the exercise teaches how different
firms, open-source communities, and other stakeholders respond to vulnerability
disclosure.


http://arxiv.org/abs/2110.15629
Attacking Video Recognition Models with Bullet-Screen Comments. (99%)
Kai Chen; Zhipeng Wei; Jingjing Chen; Zuxuan Wu; Yu-Gang Jiang
  Recent research has demonstrated that Deep Neural Networks (DNNs) are
vulnerable to adversarial patches which introduce perceptible but localized
changes to the input. Nevertheless, existing approaches have focused on
generating adversarial patches on images, their counterparts in videos have
been less explored. Compared with images, attacking videos is much more
challenging as it needs to consider not only spatial cues but also temporal
cues. To close this gap, we introduce a novel adversarial attack in this paper,
the bullet-screen comment (BSC) attack, which attacks video recognition models
with BSCs. Specifically, adversarial BSCs are generated with a Reinforcement
Learning (RL) framework, where the environment is set as the target model and
the agent plays the role of selecting the position and transparency of each
BSC. By continuously querying the target models and receiving feedback, the
agent gradually adjusts its selection strategies in order to achieve a high
fooling rate with non-overlapping BSCs. As BSCs can be regarded as a kind of
meaningful patch, adding it to a clean video will not affect people' s
understanding of the video content, nor will arouse people' s suspicion. We
conduct extensive experiments to verify the effectiveness of the proposed
method. On both UCF-101 and HMDB-51 datasets, our BSC attack method can achieve
about 90\% fooling rate when attacking three mainstream video recognition
models, while only occluding \textless 8\% areas in the video. Our code is
available at https://github.com/kay-ck/BSC-attack.


http://arxiv.org/abs/2110.15767
Adversarial Robustness with Semi-Infinite Constrained Learning. (92%)
Alexander Robey; Luiz F. O. Chamon; George J. Pappas; Hamed Hassani; Alejandro Ribeiro
  Despite strong performance in numerous applications, the fragility of deep
learning to input perturbations has raised serious questions about its use in
safety-critical domains. While adversarial training can mitigate this issue in
practice, state-of-the-art methods are increasingly application-dependent,
heuristic in nature, and suffer from fundamental trade-offs between nominal
performance and robustness. Moreover, the problem of finding worst-case
perturbations is non-convex and underparameterized, both of which engender a
non-favorable optimization landscape. Thus, there is a gap between the theory
and practice of adversarial training, particularly with respect to when and why
adversarial training works. In this paper, we take a constrained learning
approach to address these questions and to provide a theoretical foundation for
robust learning. In particular, we leverage semi-infinite optimization and
non-convex duality theory to show that adversarial training is equivalent to a
statistical problem over perturbation distributions, which we characterize
completely. Notably, we show that a myriad of previous robust training
techniques can be recovered for particular, sub-optimal choices of these
distributions. Using these insights, we then propose a hybrid Langevin Monte
Carlo approach of which several common algorithms (e.g., PGD) are special
cases. Finally, we show that our approach can mitigate the trade-off between
nominal and robust performance, yielding state-of-the-art results on MNIST and
CIFAR-10. Our code is available at: https://github.com/arobey1/advbench.


http://arxiv.org/abs/2110.15764
{\epsilon}-weakened Robustness of Deep Neural Networks. (62%)
Pei Huang; Yuting Yang; Minghao Liu; Fuqi Jia; Feifei Ma; Jian Zhang
  This paper introduces a notation of $\varepsilon$-weakened robustness for
analyzing the reliability and stability of deep neural networks (DNNs). Unlike
the conventional robustness, which focuses on the "perfect" safe region in the
absence of adversarial examples, $\varepsilon$-weakened robustness focuses on
the region where the proportion of adversarial examples is bounded by
user-specified $\varepsilon$. Smaller $\varepsilon$ means a smaller chance of
failure. Under such robustness definition, we can give conclusive results for
the regions where conventional robustness ignores. We prove that the
$\varepsilon$-weakened robustness decision problem is PP-complete and give a
statistical decision algorithm with user-controllable error bound. Furthermore,
we derive an algorithm to find the maximum $\varepsilon$-weakened robustness
radius. The time complexity of our algorithms is polynomial in the dimension
and size of the network. So, they are scalable to large real-world networks.
Besides, We also show its potential application in analyzing quality issues.


http://arxiv.org/abs/2111.00162
You are caught stealing my winning lottery ticket! Making a lottery ticket claim its ownership. (11%)
Xuxi Chen; Tianlong Chen; Zhenyu Zhang; Zhangyang Wang
  Despite tremendous success in many application scenarios, the training and
inference costs of using deep learning are also rapidly increasing over time.
The lottery ticket hypothesis (LTH) emerges as a promising framework to
leverage a special sparse subnetwork (i.e., winning ticket) instead of a full
model for both training and inference, that can lower both costs without
sacrificing the performance. The main resource bottleneck of LTH is however the
extraordinary cost to find the sparse mask of the winning ticket. That makes
the found winning ticket become a valuable asset to the owners, highlighting
the necessity of protecting its copyright. Our setting adds a new dimension to
the recently soaring interest in protecting against the intellectual property
(IP) infringement of deep models and verifying their ownerships, since they
take owners' massive/unique resources to develop or train. While existing
methods explored encrypted weights or predictions, we investigate a unique way
to leverage sparse topological information to perform lottery verification, by
developing several graph-based signatures that can be embedded as credentials.
By further combining trigger set-based methods, our proposal can work in both
white-box and black-box verification scenarios. Through extensive experiments,
we demonstrate the effectiveness of lottery verification in diverse models
(ResNet-20, ResNet-18, ResNet-50) on CIFAR-10 and CIFAR-100. Specifically, our
verification is shown to be robust to removal attacks such as model fine-tuning
and pruning, as well as several ambiguity attacks. Our codes are available at
https://github.com/VITA-Group/NO-stealing-LTH.


http://arxiv.org/abs/2110.15317
Bridge the Gap Between CV and NLP! A Gradient-based Textual Adversarial Attack Framework. (99%)
Lifan Yuan; Yichi Zhang; Yangyi Chen; Wei Wei
  Despite recent success on various tasks, deep learning techniques still
perform poorly on adversarial examples with small perturbations. While
optimization-based methods for adversarial attacks are well-explored in the
field of computer vision, it is impractical to directly apply them in natural
language processing due to the discrete nature of the text. To address the
problem, we propose a unified framework to extend the existing
optimization-based adversarial attack methods in the vision domain to craft
textual adversarial samples. In this framework, continuously optimized
perturbations are added to the embedding layer and amplified in the forward
propagation process. Then the final perturbed latent representations are
decoded with a masked language model head to obtain potential adversarial
samples. In this paper, we instantiate our framework with an attack algorithm
named Textual Projected Gradient Descent (T-PGD). We find our algorithm
effective even using proxy gradient information. Therefore, we perform the more
challenging transfer black-box attack and conduct comprehensive experiments to
evaluate our attack algorithm with several models on three benchmark datasets.
Experimental results demonstrate that our method achieves overall better
performance and produces more fluent and grammatical adversarial samples
compared to strong baseline methods. The code and data are available at
\url{https://github.com/Phantivia/T-PGD}.


http://arxiv.org/abs/2110.14880
AEVA: Black-box Backdoor Detection Using Adversarial Extreme Value Analysis. (92%)
Junfeng Guo; Ang Li; Cong Liu
  Deep neural networks (DNNs) are proved to be vulnerable against backdoor
attacks. A backdoor is often embedded in the target DNNs through injecting a
backdoor trigger into training examples, which can cause the target DNNs
misclassify an input attached with the backdoor trigger. Existing backdoor
detection methods often require the access to the original poisoned training
data, the parameters of the target DNNs, or the predictive confidence for each
given input, which are impractical in many real-world applications, e.g.,
on-device deployed DNNs. We address the black-box hard-label backdoor detection
problem where the DNN is fully black-box and only its final output label is
accessible. We approach this problem from the optimization perspective and show
that the objective of backdoor detection is bounded by an adversarial
objective. Further theoretical and empirical studies reveal that this
adversarial objective leads to a solution with highly skewed distribution; a
singularity is often observed in the adversarial map of a backdoor-infected
example, which we call the adversarial singularity phenomenon. Based on this
observation, we propose the adversarial extreme value analysis(AEVA) to detect
backdoors in black-box neural networks. AEVA is based on an extreme value
analysis of the adversarial map, computed from the monte-carlo gradient
estimation. Evidenced by extensive experiments across multiple popular tasks
and backdoor attacks, our approach is shown effective in detecting backdoor
attacks under the black-box hard-label scenarios.


http://arxiv.org/abs/2110.15188
The magnitude vector of images. (1%)
Michael F. Adamer; Leslie O'Bray; Brouwer Edward De; Bastian Rieck; Karsten Borgwardt
  The magnitude of a finite metric space is a recently-introduced invariant
quantity. Despite beneficial theoretical and practical properties, such as a
general utility for outlier detection, and a close connection to Laplace radial
basis kernels, magnitude has received little attention by the machine learning
community so far. In this work, we investigate the properties of magnitude on
individual images, with each image forming its own metric space. We show that
the known properties of outlier detection translate to edge detection in images
and we give supporting theoretical justifications. In addition, we provide a
proof of concept of its utility by using a novel magnitude layer to defend
against adversarial attacks. Since naive magnitude calculations may be
computationally prohibitive, we introduce an algorithm that leverages the
regular structure of images to dramatically reduce the computational cost.


http://arxiv.org/abs/2110.14735
Towards Evaluating the Robustness of Neural Networks Learned by Transduction. (98%)
Jiefeng Chen; Xi Wu; Yang Guo; Yingyu Liang; Somesh Jha
  There has been emerging interest in using transductive learning for
adversarial robustness (Goldwasser et al., NeurIPS 2020; Wu et al., ICML 2020;
Wang et al., ArXiv 2021). Compared to traditional defenses, these defense
mechanisms "dynamically learn" the model based on test-time input; and
theoretically, attacking these defenses reduces to solving a bilevel
optimization problem, which poses difficulty in crafting adaptive attacks. In
this paper, we examine these defense mechanisms from a principled threat
analysis perspective. We formulate and analyze threat models for
transductive-learning based defenses, and point out important subtleties. We
propose the principle of attacking model space for solving bilevel attack
objectives, and present Greedy Model Space Attack (GMSA), an attack framework
that can serve as a new baseline for evaluating transductive-learning based
defenses. Through systematic evaluation, we show that GMSA, even with weak
instantiations, can break previous transductive-learning based defenses, which
were resilient to previous attacks, such as AutoAttack (Croce and Hein, ICML
2020). On the positive side, we report a somewhat surprising empirical result
of "transductive adversarial training": Adversarially retraining the model
using fresh randomness at the test time gives a significant increase in
robustness against attacks we consider.


http://arxiv.org/abs/2110.14855
CAP: Co-Adversarial Perturbation on Weights and Features for Improving Generalization of Graph Neural Networks. (98%)
Haotian Xue; Kaixiong Zhou; Tianlong Chen; Kai Guo; Xia Hu; Yi Chang; Xin Wang
  Despite the recent advances of graph neural networks (GNNs) in modeling graph
data, the training of GNNs on large datasets is notoriously hard due to the
overfitting. Adversarial training, which augments data with the worst-case
adversarial examples, has been widely demonstrated to improve model's
robustness against adversarial attacks and generalization ability. However,
while the previous adversarial training generally focuses on protecting GNNs
from spiteful attacks, it remains unclear how the adversarial training could
improve the generalization abilities of GNNs in the graph analytics problem. In
this paper, we investigate GNNs from the lens of weight and feature loss
landscapes, i.e., the loss changes with respect to model weights and node
features, respectively. We draw the conclusion that GNNs are prone to falling
into sharp local minima in these two loss landscapes, where GNNs possess poor
generalization performances. To tackle this problem, we construct the
co-adversarial perturbation (CAP) optimization problem in terms of weights and
features, and design the alternating adversarial perturbation algorithm to
flatten the weight and feature loss landscapes alternately. Furthermore, we
divide the training process into two stages: one conducting the standard
cross-entropy minimization to ensure the quick convergence of GNN models, the
other applying our alternating adversarial training to avoid falling into
locally sharp minima. The extensive experiments demonstrate our CAP can
generally improve the generalization performance of GNNs on a variety of
benchmark graph datasets.


http://arxiv.org/abs/2110.14693
Towards Robust Reasoning over Knowledge Graphs. (83%)
Zhaohan Xi; Ren Pang; Changjiang Li; Shouling Ji; Xiapu Luo; Xusheng Xiao; Ting Wang
  Answering complex logical queries over large-scale knowledge graphs (KGs)
represents an important artificial intelligence task, entailing a range of
applications. Recently, knowledge representation learning (KRL) has emerged as
the state-of-the-art approach, wherein KG entities and the query are embedded
into a latent space such that entities that answer the query are embedded close
to the query. Yet, despite its surging popularity, the potential security risks
of KRL are largely unexplored, which is concerning, given the increasing use of
such capabilities in security-critical domains (e.g., cyber-security and
healthcare).
  This work represents a solid initial step towards bridging this gap. We
systematize the potential security threats to KRL according to the underlying
attack vectors (e.g., knowledge poisoning and query perturbation) and the
adversary's background knowledge. More importantly, we present ROAR(Reasoning
Over Adversarial Representations), a new class of attacks that instantiate a
variety of such threats. We demonstrate the practicality of ROAR in two
representative use cases (i.e., cyber-threat hunting and drug repurposing). For
instance, ROAR attains over 99% attack success rate in misleading the threat
intelligence engine to give pre-defined answers for target queries, yet without
any impact on non-target ones. Further, we discuss potential countermeasures
against ROAR, including filtering of poisoning facts and robust training with
adversarial queries, which leads to several promising research directions.


http://arxiv.org/abs/2110.14357
Binarized ResNet: Enabling Robust Automatic Modulation Classification at the resource-constrained Edge. (80%)
Deepsayan Sadhukhan; Nitin Priyadarshini Shankar; Nancy Nayak; Thulasi Tholeti; Sheetal Kalyani
  Recently, deep neural networks (DNNs) have been used extensively for
automatic modulation classification (AMC), and the results have been quite
promising. However, DNNs have high memory and computation requirements making
them impractical for edge networks where the devices are resource-constrained.
They are also vulnerable to adversarial attacks, which is a significant
security concern. This work proposes a rotated binary large ResNet (RBLResNet)
for AMC that can be deployed at the edge network because of low memory and
computational complexity. The performance gap between the RBLResNet and
existing architectures with floating-point weights and activations can be
closed by two proposed ensemble methods: (i) multilevel classification (MC),
and (ii) bagging multiple RBLResNets while retaining low memory and
computational power. The MC method achieves an accuracy of $93.39\%$ at $10$dB
over all the $24$ modulation classes of the Deepsig dataset. This performance
is comparable to state-of-the-art (SOTA) performances, with $4.75$ times lower
memory and $1214$ times lower computation. Furthermore, RBLResNet also has high
adversarial robustness compared to existing DNN models. The proposed MC method
with RBLResNets has an adversarial accuracy of $87.25\%$ over a wide range of
SNRs, surpassing the robustness of all existing SOTA methods to the best of our
knowledge. Properties such as low memory, low computation, and the highest
adversarial robustness make it a better choice for robust AMC in low-power edge
devices.


http://arxiv.org/abs/2110.14871
Generalized Depthwise-Separable Convolutions for Adversarially Robust and Efficient Neural Networks. (74%)
Hassan Dbouk; Naresh R. Shanbhag
  Despite their tremendous successes, convolutional neural networks (CNNs)
incur high computational/storage costs and are vulnerable to adversarial
perturbations. Recent works on robust model compression address these
challenges by combining model compression techniques with adversarial training.
But these methods are unable to improve throughput (frames-per-second) on
real-life hardware while simultaneously preserving robustness to adversarial
perturbations. To overcome this problem, we propose the method of Generalized
Depthwise-Separable (GDWS) convolution -- an efficient, universal,
post-training approximation of a standard 2D convolution. GDWS dramatically
improves the throughput of a standard pre-trained network on real-life hardware
while preserving its robustness. Lastly, GDWS is scalable to large problem
sizes since it operates on pre-trained models and doesn't require any
additional training. We establish the optimality of GDWS as a 2D convolution
approximator and present exact algorithms for constructing optimal GDWS
convolutions under complexity and error constraints. We demonstrate the
effectiveness of GDWS via extensive experiments on CIFAR-10, SVHN, and ImageNet
datasets. Our code can be found at https://github.com/hsndbk4/GDWS.


http://arxiv.org/abs/2110.14430
Adversarial Neuron Pruning Purifies Backdoored Deep Models. (15%)
Dongxian Wu; Yisen Wang
  As deep neural networks (DNNs) are growing larger, their requirements for
computational resources become huge, which makes outsourcing training more
popular. Training in a third-party platform, however, may introduce potential
risks that a malicious trainer will return backdoored DNNs, which behave
normally on clean samples but output targeted misclassifications whenever a
trigger appears at the test time. Without any knowledge of the trigger, it is
difficult to distinguish or recover benign DNNs from backdoored ones. In this
paper, we first identify an unexpected sensitivity of backdoored DNNs, that is,
they are much easier to collapse and tend to predict the target label on clean
samples when their neurons are adversarially perturbed. Based on these
observations, we propose a novel model repairing method, termed Adversarial
Neuron Pruning (ANP), which prunes some sensitive neurons to purify the
injected backdoor. Experiments show, even with only an extremely small amount
of clean data (e.g., 1%), ANP effectively removes the injected backdoor without
causing obvious performance degradation.


http://arxiv.org/abs/2110.14844
From Intrinsic to Counterfactual: On the Explainability of Contextualized Recommender Systems. (5%)
Yao Zhou; Haonan Wang; Jingrui He; Haixun Wang
  With the prevalence of deep learning based embedding approaches, recommender
systems have become a proven and indispensable tool in various information
filtering applications. However, many of them remain difficult to diagnose what
aspects of the deep models' input drive the final ranking decision, thus, they
cannot often be understood by human stakeholders. In this paper, we investigate
the dilemma between recommendation and explainability, and show that by
utilizing the contextual features (e.g., item reviews from users), we can
design a series of explainable recommender systems without sacrificing their
performance. In particular, we propose three types of explainable
recommendation strategies with gradual change of model transparency: whitebox,
graybox, and blackbox. Each strategy explains its ranking decisions via
different mechanisms: attention weights, adversarial perturbations, and
counterfactual perturbations. We apply these explainable models on five
real-world data sets under the contextualized setting where users and items
have explicit interactions. The empirical results show that our model achieves
highly competitive ranking performance, and generates accurate and effective
explanations in terms of numerous quantitative metrics and qualitative
visualizations.


http://arxiv.org/abs/2110.14189
Robust Contrastive Learning Using Negative Samples with Diminished Semantics. (1%)
Songwei Ge; Shlok Mishra; Haohan Wang; Chun-Liang Li; David Jacobs
  Unsupervised learning has recently made exceptional progress because of the
development of more effective contrastive learning methods. However, CNNs are
prone to depend on low-level features that humans deem non-semantic. This
dependency has been conjectured to induce a lack of robustness to image
perturbations or domain shift. In this paper, we show that by generating
carefully designed negative samples, contrastive learning can learn more robust
representations with less dependence on such features. Contrastive learning
utilizes positive pairs that preserve semantic information while perturbing
superficial features in the training images. Similarly, we propose to generate
negative samples in a reversed way, where only the superfluous instead of the
semantic features are preserved. We develop two methods, texture-based and
patch-based augmentations, to generate negative samples. These samples achieve
better generalization, especially under out-of-domain settings. We also analyze
our method and the generated texture-based samples, showing that texture
features are indispensable in classifying particular ImageNet classes and
especially finer classes. We also show that model bias favors texture and shape
features differently under different test settings. Our code, trained models,
and ImageNet-Texture dataset can be found at
https://github.com/SongweiGe/Contrastive-Learning-with-Non-Semantic-Negatives.


http://arxiv.org/abs/2110.14188
RoMA: Robust Model Adaptation for Offline Model-based Optimization. (1%)
Sihyun Yu; Sungsoo Ahn; Le Song; Jinwoo Shin
  We consider the problem of searching an input maximizing a black-box
objective function given a static dataset of input-output queries. A popular
approach to solving this problem is maintaining a proxy model, e.g., a deep
neural network (DNN), that approximates the true objective function. Here, the
main challenge is how to avoid adversarially optimized inputs during the
search, i.e., the inputs where the DNN highly overestimates the true objective
function. To handle the issue, we propose a new framework, coined robust model
adaptation (RoMA), based on gradient-based optimization of inputs over the DNN.
Specifically, it consists of two steps: (a) a pre-training strategy to robustly
train the proxy model and (b) a novel adaptation procedure of the proxy model
to have robust estimates for a specific set of candidate solutions. At a high
level, our scheme utilizes the local smoothness prior to overcome the
brittleness of the DNN. Experiments under various tasks show the effectiveness
of RoMA compared with previous methods, obtaining state-of-the-art results,
e.g., RoMA outperforms all at 4 out of 6 tasks and achieves runner-up results
at the remaining tasks.


http://arxiv.org/abs/2110.13950
Can't Fool Me: Adversarially Robust Transformer for Video Understanding. (99%)
Divya Choudhary; Palash Goyal; Saurabh Sahu
  Deep neural networks have been shown to perform poorly on adversarial
examples. To address this, several techniques have been proposed to increase
robustness of a model for image classification tasks. However, in video
understanding tasks, developing adversarially robust models is still
unexplored. In this paper, we aim to bridge this gap. We first show that simple
extensions of image based adversarially robust models slightly improve the
worst-case performance. Further, we propose a temporal attention regularization
scheme in Transformer to improve the robustness of attention modules to
adversarial examples. We illustrate using a large-scale video data set
YouTube-8M that the final model (A-ART) achieves close to non-adversarial
performance on its adversarial example set. We achieve 91% GAP on adversarial
examples, whereas baseline Transformer and simple adversarial extensions
achieve 72.9% and 82% respectively, showing significant improvement in
robustness over the state-of-the-art.


http://arxiv.org/abs/2110.13935
Frequency Centric Defense Mechanisms against Adversarial Examples. (99%)
Sanket B. Shah; Param Raval; Harin Khakhi; Mehul S. Raval
  Adversarial example (AE) aims at fooling a Convolution Neural Network by
introducing small perturbations in the input image.The proposed work uses the
magnitude and phase of the Fourier Spectrum and the entropy of the image to
defend against AE. We demonstrate the defense in two ways: by training an
adversarial detector and denoising the adversarial effect. Experiments were
conducted on the low-resolution CIFAR-10 and high-resolution ImageNet datasets.
The adversarial detector has 99% accuracy for FGSM and PGD attacks on the
CIFAR-10 dataset. However, the detection accuracy falls to 50% for
sophisticated DeepFool and Carlini & Wagner attacks on ImageNet. We overcome
the limitation by using autoencoder and show that 70% of AEs are correctly
classified after denoising.


http://arxiv.org/abs/2110.14120
ScaleCert: Scalable Certified Defense against Adversarial Patches with Sparse Superficial Layers. (99%)
Husheng Han; Kaidi Xu; Xing Hu; Xiaobing Chen; Ling Liang; Zidong Du; Qi Guo; Yanzhi Wang; Yunji Chen
  Adversarial patch attacks that craft the pixels in a confined region of the
input images show their powerful attack effectiveness in physical environments
even with noises or deformations. Existing certified defenses towards
adversarial patch attacks work well on small images like MNIST and CIFAR-10
datasets, but achieve very poor certified accuracy on higher-resolution images
like ImageNet. It is urgent to design both robust and effective defenses
against such a practical and harmful attack in industry-level larger images. In
this work, we propose the certified defense methodology that achieves high
provable robustness for high-resolution images and largely improves the
practicality for real adoption of the certified defense. The basic insight of
our work is that the adversarial patch intends to leverage localized
superficial important neurons (SIN) to manipulate the prediction results.
Hence, we leverage the SIN-based DNN compression techniques to significantly
improve the certified accuracy, by reducing the adversarial region searching
overhead and filtering the prediction noises. Our experimental results show
that the certified accuracy is increased from 36.3% (the state-of-the-art
certified detection) to 60.4% on the ImageNet dataset, largely pushing the
certified defenses for practical use.


http://arxiv.org/abs/2110.14068
Drawing Robust Scratch Tickets: Subnetworks with Inborn Robustness Are Found within Randomly Initialized Networks. (99%)
Yonggan Fu; Qixuan Yu; Yang Zhang; Shang Wu; Xu Ouyang; David Cox; Yingyan Lin
  Deep Neural Networks (DNNs) are known to be vulnerable to adversarial
attacks, i.e., an imperceptible perturbation to the input can mislead DNNs
trained on clean images into making erroneous predictions. To tackle this,
adversarial training is currently the most effective defense method, by
augmenting the training set with adversarial samples generated on the fly.
Interestingly, we discover for the first time that there exist subnetworks with
inborn robustness, matching or surpassing the robust accuracy of the
adversarially trained networks with comparable model sizes, within randomly
initialized networks without any model training, indicating that adversarial
training on model weights is not indispensable towards adversarial robustness.
We name such subnetworks Robust Scratch Tickets (RSTs), which are also by
nature efficient. Distinct from the popular lottery ticket hypothesis, neither
the original dense networks nor the identified RSTs need to be trained. To
validate and understand this fascinating finding, we further conduct extensive
experiments to study the existence and properties of RSTs under different
models, datasets, sparsity patterns, and attacks, drawing insights regarding
the relationship between DNNs' robustness and their
initialization/overparameterization. Furthermore, we identify the poor
adversarial transferability between RSTs of different sparsity ratios drawn
from the same randomly initialized dense network, and propose a Random RST
Switch (R2S) technique, which randomly switches between different RSTs, as a
novel defense method built on top of RSTs. We believe our findings about RSTs
have opened up a new perspective to study model robustness and extend the
lottery ticket hypothesis.


http://arxiv.org/abs/2110.13864
FL-WBC: Enhancing Robustness against Model Poisoning Attacks in Federated Learning from a Client Perspective. (98%)
Jingwei Sun; Ang Li; Louis DiValentin; Amin Hassanzadeh; Yiran Chen; Hai Li
  Federated learning (FL) is a popular distributed learning framework that
trains a global model through iterative communications between a central server
and edge devices. Recent works have demonstrated that FL is vulnerable to model
poisoning attacks. Several server-based defense approaches (e.g. robust
aggregation), have been proposed to mitigate such attacks. However, we
empirically show that under extremely strong attacks, these defensive methods
fail to guarantee the robustness of FL. More importantly, we observe that as
long as the global model is polluted, the impact of attacks on the global model
will remain in subsequent rounds even if there are no subsequent attacks. In
this work, we propose a client-based defense, named White Blood Cell for
Federated Learning (FL-WBC), which can mitigate model poisoning attacks that
have already polluted the global model. The key idea of FL-WBC is to identify
the parameter space where long-lasting attack effect on parameters resides and
perturb that space during local training. Furthermore, we derive a certified
robustness guarantee against model poisoning attacks and a convergence
guarantee to FedAvg after applying our FL-WBC. We conduct experiments on
FasionMNIST and CIFAR10 to evaluate the defense against state-of-the-art model
poisoning attacks. The results demonstrate that our method can effectively
mitigate model poisoning attack impact on the global model within 5
communication rounds with nearly no accuracy drop under both IID and Non-IID
settings. Our defense is also complementary to existing server-based robust
aggregation approaches and can further improve the robustness of FL under
extremely strong attacks.


http://arxiv.org/abs/2111.00861
A Frequency Perspective of Adversarial Robustness. (98%)
Shishira R Maiya; Max Ehrlich; Vatsal Agarwal; Ser-Nam Lim; Tom Goldstein; Abhinav Shrivastava
  Adversarial examples pose a unique challenge for deep learning systems.
Despite recent advances in both attacks and defenses, there is still a lack of
clarity and consensus in the community about the true nature and underlying
properties of adversarial examples. A deep understanding of these examples can
provide new insights towards the development of more effective attacks and
defenses. Driven by the common misconception that adversarial examples are
high-frequency noise, we present a frequency-based understanding of adversarial
examples, supported by theoretical and empirical findings. Our analysis shows
that adversarial examples are neither in high-frequency nor in low-frequency
components, but are simply dataset dependent. Particularly, we highlight the
glaring disparities between models trained on CIFAR-10 and ImageNet-derived
datasets. Utilizing this framework, we analyze many intriguing properties of
training robust models with frequency constraints, and propose a
frequency-based explanation for the commonly observed accuracy vs. robustness
trade-off.


http://arxiv.org/abs/2110.13741
Disrupting Deep Uncertainty Estimation Without Harming Accuracy. (86%)
Ido Galil; Ran El-Yaniv
  Deep neural networks (DNNs) have proven to be powerful predictors and are
widely used for various tasks. Credible uncertainty estimation of their
predictions, however, is crucial for their deployment in many risk-sensitive
applications. In this paper we present a novel and simple attack, which unlike
adversarial attacks, does not cause incorrect predictions but instead cripples
the network's capacity for uncertainty estimation. The result is that after the
attack, the DNN is more confident of its incorrect predictions than about its
correct ones without having its accuracy reduced. We present two versions of
the attack. The first scenario focuses on a black-box regime (where the
attacker has no knowledge of the target network) and the second scenario
attacks a white-box setting. The proposed attack is only required to be of
minuscule magnitude for its perturbations to cause severe uncertainty
estimation damage, with larger magnitudes resulting in completely unusable
uncertainty estimations. We demonstrate successful attacks on three of the most
popular uncertainty estimation methods: the vanilla softmax score, Deep
Ensembles and MC-Dropout. Additionally, we show an attack on SelectiveNet, the
selective classification architecture. We test the proposed attack on several
contemporary architectures such as MobileNetV2 and EfficientNetB0, all trained
to classify ImageNet.


http://arxiv.org/abs/2110.14030
Improving Local Effectiveness for Global robust training. (83%)
Jingyue Lu; M. Pawan Kumar
  Despite its popularity, deep neural networks are easily fooled. To alleviate
this deficiency, researchers are actively developing new training strategies,
which encourage models that are robust to small input perturbations. Several
successful robust training methods have been proposed. However, many of them
rely on strong adversaries, which can be prohibitively expensive to generate
when the input dimension is high and the model structure is complicated. We
adopt a new perspective on robustness and propose a novel training algorithm
that allows a more effective use of adversaries. Our method improves the model
robustness at each local ball centered around an adversary and then, by
combining these local balls through a global term, achieves overall robustness.
We demonstrate that, by maximizing the use of adversaries via focusing on local
balls, we achieve high robust accuracy with weak adversaries. Specifically, our
method reaches a similar robust accuracy level to the state of the art
approaches trained on strong adversaries on MNIST, CIFAR-10 and CIFAR-100. As a
result, the overall training time is reduced. Furthermore, when trained with
strong adversaries, our method matches with the current state of the art on
MNIST and outperforms them on CIFAR-10 and CIFAR-100.


http://arxiv.org/abs/2110.14038
Robustness of Graph Neural Networks at Scale. (76%)
Simon Geisler; Tobias Schmidt; Hakan Şirin; Daniel Zügner; Aleksandar Bojchevski; Stephan Günnemann
  Graph Neural Networks (GNNs) are increasingly important given their
popularity and the diversity of applications. Yet, existing studies of their
vulnerability to adversarial attacks rely on relatively small graphs. We
address this gap and study how to attack and defend GNNs at scale. We propose
two sparsity-aware first-order optimization attacks that maintain an efficient
representation despite optimizing over a number of parameters which is
quadratic in the number of nodes. We show that common surrogate losses are not
well-suited for global attacks on GNNs. Our alternatives can double the attack
strength. Moreover, to improve GNNs' reliability we design a robust aggregation
function, Soft Median, resulting in an effective defense at all scales. We
evaluate our attacks and defense with standard GNNs on graphs more than 100
times larger compared to previous work. We even scale one order of magnitude
further by extending our techniques to a scalable GNN.


http://arxiv.org/abs/2110.13980
Adversarial Attacks and Defenses for Social Network Text Processing Applications: Techniques, Challenges and Future Research Directions. (75%)
Izzat Alsmadi; Kashif Ahmad; Mahmoud Nazzal; Firoj Alam; Ala Al-Fuqaha; Abdallah Khreishah; Abdulelah Algosaibi
  The growing use of social media has led to the development of several Machine
Learning (ML) and Natural Language Processing(NLP) tools to process the
unprecedented amount of social media content to make actionable decisions.
However, these MLand NLP algorithms have been widely shown to be vulnerable to
adversarial attacks. These vulnerabilities allow adversaries to launch a
diversified set of adversarial attacks on these algorithms in different
applications of social media text processing. In this paper, we provide a
comprehensive review of the main approaches for adversarial attacks and
defenses in the context of social media applications with a particular focus on
key challenges and future research directions. In detail, we cover literature
on six key applications, namely (i) rumors detection, (ii) satires detection,
(iii) clickbait & spams identification, (iv) hate speech detection,
(v)misinformation detection, and (vi) sentiment analysis. We then highlight the
concurrent and anticipated future research questions and provide
recommendations and directions for future work.


http://arxiv.org/abs/2110.15053
Adversarial Robustness in Multi-Task Learning: Promises and Illusions. (64%)
Salah Ghamizi; Maxime Cordy; Mike Papadakis; Yves Le Traon
  Vulnerability to adversarial attacks is a well-known weakness of Deep Neural
networks. While most of the studies focus on single-task neural networks with
computer vision datasets, very little research has considered complex
multi-task models that are common in real applications. In this paper, we
evaluate the design choices that impact the robustness of multi-task deep
learning networks. We provide evidence that blindly adding auxiliary tasks, or
weighing the tasks provides a false sense of robustness. Thereby, we tone down
the claim made by previous research and study the different factors which may
affect robustness. In particular, we show that the choice of the task to
incorporate in the loss function are important factors that can be leveraged to
yield more robust models.


http://arxiv.org/abs/2110.13771
AugMax: Adversarial Composition of Random Augmentations for Robust Training. (56%)
Haotao Wang; Chaowei Xiao; Jean Kossaifi; Zhiding Yu; Anima Anandkumar; Zhangyang Wang
  Data augmentation is a simple yet effective way to improve the robustness of
deep neural networks (DNNs). Diversity and hardness are two complementary
dimensions of data augmentation to achieve robustness. For example, AugMix
explores random compositions of a diverse set of augmentations to enhance
broader coverage, while adversarial training generates adversarially hard
samples to spot the weakness. Motivated by this, we propose a data augmentation
framework, termed AugMax, to unify the two aspects of diversity and hardness.
AugMax first randomly samples multiple augmentation operators and then learns
an adversarial mixture of the selected operators. Being a stronger form of data
augmentation, AugMax leads to a significantly augmented input distribution
which makes model training more challenging. To solve this problem, we further
design a disentangled normalization module, termed DuBIN
(Dual-Batch-and-Instance Normalization), that disentangles the instance-wise
feature heterogeneity arising from AugMax. Experiments show that AugMax-DuBIN
leads to significantly improved out-of-distribution robustness, outperforming
prior arts by 3.03%, 3.49%, 1.82% and 0.71% on CIFAR10-C, CIFAR100-C, Tiny
ImageNet-C and ImageNet-C. Codes and pretrained models are available:
https://github.com/VITA-Group/AugMax.


http://arxiv.org/abs/2110.13541
Qu-ANTI-zation: Exploiting Quantization Artifacts for Achieving Adversarial Outcomes. (50%)
Sanghyun Hong; Michael-Andrei Panaitescu-Liess; Yiğitcan Kaya; Tudor Dumitraş
  Quantization is a popular technique that $transforms$ the parameter
representation of a neural network from floating-point numbers into
lower-precision ones ($e.g.$, 8-bit integers). It reduces the memory footprint
and the computational cost at inference, facilitating the deployment of
resource-hungry models. However, the parameter perturbations caused by this
transformation result in $behavioral$ $disparities$ between the model before
and after quantization. For example, a quantized model can misclassify some
test-time samples that are otherwise classified correctly. It is not known
whether such differences lead to a new security vulnerability. We hypothesize
that an adversary may control this disparity to introduce specific behaviors
that activate upon quantization. To study this hypothesis, we weaponize
quantization-aware training and propose a new training framework to implement
adversarial quantization outcomes. Following this framework, we present three
attacks we carry out with quantization: (i) an indiscriminate attack for
significant accuracy loss; (ii) a targeted attack against specific samples; and
(iii) a backdoor attack for controlling the model with an input trigger. We
further show that a single compromised model defeats multiple quantization
schemes, including robust quantization techniques. Moreover, in a federated
learning scenario, we demonstrate that a set of malicious participants who
conspire can inject our quantization-activated backdoor. Lastly, we discuss
potential counter-measures and show that only re-training consistently removes
the attack artifacts. Our code is available at
https://github.com/Secure-AI-Systems-Group/Qu-ANTI-zation


http://arxiv.org/abs/2110.13414
Semantic Host-free Trojan Attack. (10%)
Haripriya Harikumar; Kien Do; Santu Rana; Sunil Gupta; Svetha Venkatesh
  In this paper, we propose a novel host-free Trojan attack with triggers that
are fixed in the semantic space but not necessarily in the pixel space. In
contrast to existing Trojan attacks which use clean input images as hosts to
carry small, meaningless trigger patterns, our attack considers triggers as
full-sized images belonging to a semantically meaningful object class. Since in
our attack, the backdoored classifier is encouraged to memorize the abstract
semantics of the trigger images than any specific fixed pattern, it can be
later triggered by semantically similar but different looking images. This
makes our attack more practical to be applied in the real-world and harder to
defend against. Extensive experimental results demonstrate that with only a
small number of Trojan patterns for training, our attack can generalize well to
new patterns of the same Trojan class and can bypass state-of-the-art defense
methods.


http://arxiv.org/abs/2110.15122
CAFE: Catastrophic Data Leakage in Vertical Federated Learning. (3%)
Xiao Jin; Pin-Yu Chen; Chia-Yi Hsu; Chia-Mu Yu; Tianyi Chen
  Recent studies show that private training data can be leaked through the
gradients sharing mechanism deployed in distributed machine learning systems,
such as federated learning (FL). Increasing batch size to complicate data
recovery is often viewed as a promising defense strategy against data leakage.
In this paper, we revisit this defense premise and propose an advanced data
leakage attack with theoretical justification to efficiently recover batch data
from the shared aggregated gradients. We name our proposed method as
\textit{\underline{c}atastrophic d\underline{a}ta leakage in vertical
\underline{f}ederated l\underline{e}arning} (CAFE). Comparing to existing data
leakage attacks, our extensive experimental results on vertical FL settings
demonstrate the effectiveness of CAFE to perform large-batch data leakage
attack with improved data recovery quality. We also propose a practical
countermeasure to mitigate CAFE. Our results suggest that private data
participated in standard FL, especially the vertical case, have a high risk of
being leaked from the training gradients. Our analysis implies unprecedented
and practical data leakage risks in those learning settings. The code of our
work is available at
\href{https://github.com/DeRafael/CAFE}{\textcolor{blue}{\url{https://github.com/DeRafael/CAFE}}}.


http://arxiv.org/abs/2110.14032
MEST: Accurate and Fast Memory-Economic Sparse Training Framework on the Edge. (1%)
Geng Yuan; Xiaolong Ma; Wei Niu; Zhengang Li; Zhenglun Kong; Ning Liu; Yifan Gong; Zheng Zhan; Chaoyang He; Qing Jin; Siyue Wang; Minghai Qin; Bin Ren; Yanzhi Wang; Sijia Liu; Xue Lin
  Recently, a new trend of exploring sparsity for accelerating neural network
training has emerged, embracing the paradigm of training on the edge. This
paper proposes a novel Memory-Economic Sparse Training (MEST) framework
targeting for accurate and fast execution on edge devices. The proposed MEST
framework consists of enhancements by Elastic Mutation (EM) and Soft Memory
Bound (&S) that ensure superior accuracy at high sparsity ratios. Different
from the existing works for sparse training, this current work reveals the
importance of sparsity schemes on the performance of sparse training in terms
of accuracy as well as training speed on real edge devices. On top of that, the
paper proposes to employ data efficiency for further acceleration of sparse
training. Our results suggest that unforgettable examples can be identified
in-situ even during the dynamic exploration of sparsity masks in the sparse
training process, and therefore can be removed for further training speedup on
edge devices. Comparing with state-of-the-art (SOTA) works on accuracy, our
MEST increases Top-1 accuracy significantly on ImageNet when using the same
unstructured sparsity scheme. Systematical evaluation on accuracy, training
speed, and memory footprint are conducted, where the proposed MEST framework
consistently outperforms representative SOTA works. A reviewer strongly against
our work based on his false assumptions and misunderstandings. On top of the
previous submission, we employ data efficiency for further acceleration of
sparse training. And we explore the impact of model sparsity, sparsity schemes,
and sparse training algorithms on the number of removable training examples.
Our codes are publicly available at: https://github.com/boone891214/MEST.


http://arxiv.org/abs/2110.14019
Reliable and Trustworthy Machine Learning for Health Using Dataset Shift Detection. (1%)
Chunjong Park; Anas Awadalla; Tadayoshi Kohno; Shwetak Patel
  Unpredictable ML model behavior on unseen data, especially in the health
domain, raises serious concerns about its safety as repercussions for mistakes
can be fatal. In this paper, we explore the feasibility of using
state-of-the-art out-of-distribution detectors for reliable and trustworthy
diagnostic predictions. We select publicly available deep learning models
relating to various health conditions (e.g., skin cancer, lung sound, and
Parkinson's disease) using various input data types (e.g., image, audio, and
motion data). We demonstrate that these models show unreasonable predictions on
out-of-distribution datasets. We show that Mahalanobis distance- and Gram
matrices-based out-of-distribution detection methods are able to detect
out-of-distribution data with high accuracy for the health models that operate
on different modalities. We then translate the out-of-distribution score into a
human interpretable CONFIDENCE SCORE to investigate its effect on the users'
interaction with health ML applications. Our user study shows that the
\textsc{confidence score} helped the participants only trust the results with a
high score to make a medical decision and disregard results with a low score.
Through this work, we demonstrate that dataset shift is a critical piece of
information for high-stake ML applications, such as medical diagnosis and
healthcare, to provide reliable and trustworthy predictions to the users.


http://arxiv.org/abs/2110.13859
Defensive Tensorization. (1%)
Adrian Bulat; Jean Kossaifi; Sourav Bhattacharya; Yannis Panagakis; Timothy Hospedales; Georgios Tzimiropoulos; Nicholas D Lane; Maja Pantic
  We propose defensive tensorization, an adversarial defence technique that
leverages a latent high-order factorization of the network. The layers of a
network are first expressed as factorized tensor layers. Tensor dropout is then
applied in the latent subspace, therefore resulting in dense reconstructed
weights, without the sparsity or perturbations typically induced by the
randomization.Our approach can be readily integrated with any arbitrary neural
architecture and combined with techniques like adversarial training. We
empirically demonstrate the effectiveness of our approach on standard image
classification benchmarks. We validate the versatility of our approach across
domains and low-precision architectures by considering an audio classification
task and binary networks. In all cases, we demonstrate improved performance
compared to prior works.


http://arxiv.org/abs/2110.13409
Task-Aware Meta Learning-based Siamese Neural Network for Classifying Obfuscated Malware. (1%)
Jinting Zhu; Julian Jang-Jaccard; Amardeep Singh; Paul A. Watters; Seyit Camtepe
  Malware authors apply different obfuscation techniques on the generic feature
of malware (i.e., unique malware signature) to create new variants to avoid
detection. Existing Siamese Neural Network (SNN) based malware detection
methods fail to correctly classify different malware families when similar
generic features are shared across multiple malware variants resulting in high
false-positive rates. To address this issue, we propose a novel Task-Aware Meta
Learning-based Siamese Neural Network resilient against obfuscated malware
while able to detect malware trained with one or a few training samples. Using
entropy features of each malware signature alongside image features as task
inputs, our task-aware meta leaner generates the parameters for the feature
layers to more accurately adjust the feature embedding for different malware
families. In addition, our model utilizes meta-learning with the extracted
features of a pre-trained network (e.g., VGG-16) to avoid the bias typically
associated with a model trained with a limited number of training samples. Our
proposed approach is highly effective in recognizing unique malware signatures,
thus correctly classifying malware samples that belong to the same malware
family even in the presence of obfuscation technique applied to malware. Our
experimental results, validated with N-way on N-shot learning, show that our
model is highly effective in classification accuracy compared to other similar
methods.


http://arxiv.org/abs/2110.12976
Stable Neural ODE with Lyapunov-Stable Equilibrium Points for Defending Against Adversarial Attacks. (99%)
Qiyu Kang; Yang Song; Qinxu Ding; Wee Peng Tay
  Deep neural networks (DNNs) are well-known to be vulnerable to adversarial
attacks, where malicious human-imperceptible perturbations are included in the
input to the deep network to fool it into making a wrong classification. Recent
studies have demonstrated that neural Ordinary Differential Equations (ODEs)
are intrinsically more robust against adversarial attacks compared to vanilla
DNNs. In this work, we propose a stable neural ODE with Lyapunov-stable
equilibrium points for defending against adversarial attacks (SODEF). By
ensuring that the equilibrium points of the ODE solution used as part of SODEF
is Lyapunov-stable, the ODE solution for an input with a small perturbation
converges to the same solution as the unperturbed input. We provide theoretical
results that give insights into the stability of SODEF as well as the choice of
regularizers to ensure its stability. Our analysis suggests that our proposed
regularizers force the extracted feature points to be within a neighborhood of
the Lyapunov-stable equilibrium points of the ODE. SODEF is compatible with
many defense methods and can be applied to any neural network's final regressor
layer to enhance its stability against adversarial attacks.


http://arxiv.org/abs/2110.12948
Generating Watermarked Adversarial Texts. (99%)
Mingjie Li; Hanzhou Wu; Xinpeng Zhang
  Adversarial example generation has been a hot spot in recent years because it
can cause deep neural networks (DNNs) to misclassify the generated adversarial
examples, which reveals the vulnerability of DNNs, motivating us to find good
solutions to improve the robustness of DNN models. Due to the extensiveness and
high liquidity of natural language over the social networks, various natural
language based adversarial attack algorithms have been proposed in the
literature. These algorithms generate adversarial text examples with high
semantic quality. However, the generated adversarial text examples may be
maliciously or illegally used. In order to tackle with this problem, we present
a general framework for generating watermarked adversarial text examples. For
each word in a given text, a set of candidate words are determined to ensure
that all the words in the set can be used to either carry secret bits or
facilitate the construction of adversarial example. By applying a word-level
adversarial text generation algorithm, the watermarked adversarial text example
can be finally generated. Experiments show that the adversarial text examples
generated by the proposed method not only successfully fool advanced DNN
models, but also carry a watermark that can effectively verify the ownership
and trace the source of the adversarial examples. Moreover, the watermark can
still survive after attacked with adversarial example generation algorithms,
which has shown the applicability and superiority.


http://arxiv.org/abs/2110.13250
Beyond $L_p$ clipping: Equalization-based Psychoacoustic Attacks against ASRs. (92%)
Hadi Abdullah; Muhammad Sajidur Rahman; Christian Peeters; Cassidy Gibson; Washington Garcia; Vincent Bindschaedler; Thomas Shrimpton; Patrick Traynor
  Automatic Speech Recognition (ASR) systems convert speech into text and can
be placed into two broad categories: traditional and fully end-to-end. Both
types have been shown to be vulnerable to adversarial audio examples that sound
benign to the human ear but force the ASR to produce malicious transcriptions.
Of these attacks, only the "psychoacoustic" attacks can create examples with
relatively imperceptible perturbations, as they leverage the knowledge of the
human auditory system. Unfortunately, existing psychoacoustic attacks can only
be applied against traditional models, and are obsolete against the newer,
fully end-to-end ASRs. In this paper, we propose an equalization-based
psychoacoustic attack that can exploit both traditional and fully end-to-end
ASRs. We successfully demonstrate our attack against real-world ASRs that
include DeepSpeech and Wav2Letter. Moreover, we employ a user study to verify
that our method creates low audible distortion. Specifically, 80 of the 100
participants voted in favor of all our attack audio samples as less noisier
than the existing state-of-the-art attack. Through this, we demonstrate both
types of existing ASR pipelines can be exploited with minimum degradation to
attack audio quality.


http://arxiv.org/abs/2110.12734
Fast Gradient Non-sign Methods. (92%)
Yaya Cheng; Jingkuan Song; Xiaosu Zhu; Qilong Zhang; Lianli Gao; Heng Tao Shen
  Adversarial attacks make their success in DNNs, and among them,
gradient-based algorithms become one of the mainstreams. Based on the linearity
hypothesis, under $\ell_\infty$ constraint, $sign$ operation applied to the
gradients is a good choice for generating perturbations. However, side-effects
from such operation exist since it leads to the bias of direction between real
gradients and perturbations. In other words, current methods contain a gap
between real gradients and actual noises, which leads to biased and inefficient
attacks. Therefore in this paper, based on the Taylor expansion, the bias is
analyzed theoretically, and the correction of $sign$, i.e., Fast Gradient
Non-sign Method (FGNM), is further proposed. Notably, FGNM is a general routine
that seamlessly replaces the conventional $sign$ operation in gradient-based
attacks with negligible extra computational cost. Extensive experiments
demonstrate the effectiveness of our methods. Specifically, for untargeted
black-box attacks, ours outperform them by 27.5% at most and 9.5% on average.
For targeted attacks against defense models, it is 15.1% and 12.7%. Our
anonymous code is publicly available at https://github.com/yaya-cheng/FGNM


http://arxiv.org/abs/2110.14814
Ensemble Federated Adversarial Training with Non-IID data. (87%)
Shuang Luo; Didi Zhu; Zexi Li; Chao Wu
  Despite federated learning endows distributed clients with a cooperative
training mode under the premise of protecting data privacy and security, the
clients are still vulnerable when encountering adversarial samples due to the
lack of robustness. The adversarial samples can confuse and cheat the client
models to achieve malicious purposes via injecting elaborate noise into normal
input. In this paper, we introduce a novel Ensemble Federated Adversarial
Training Method, termed as EFAT, that enables an efficacious and robust coupled
training mechanism. Our core idea is to enhance the diversity of adversarial
examples through expanding training data with different disturbances generated
from other participated clients, which helps adversarial training perform well
in Non-IID settings. Experimental results on different Non-IID situations,
including feature distribution skew and label distribution skew, show that our
proposed method achieves promising results compared with solely combining
federated learning with adversarial approaches.


http://arxiv.org/abs/2110.13650
GANash -- A GAN approach to steganography. (81%)
Venkatesh Subramaniyan; Vignesh Sivakumar; A. K. Vagheesan; S. Sakthivelan; K. J. Jegadish Kumar; K. K. Nagarajan
  Data security is of the utmost concern of a communication system. Since the
early days, many developments have been made to improve the performance of the
system. PSNR of the received signal, secure transmission channel, quality of
encoding used, etc. are some of the key attributes of a good system. To ensure
security, the most commonly used technique is cryptography in which the message
is altered with respect to a key and using the same, the encoded message is
decoded at the receiver side. A complementary technique that is popularly used
to insure security is steganography. The advancements in Artificial
Intelligence(AI) have paved way for performing steganography in an intelligent,
tamper-proof manner. The recent discovery by researchers in the field of Deep
Learning(DL), an unsupervised learning network known as the Generative
Adversarial Networks(GAN) has improved the performance of this technique
exponentially. It has been demonstrated that deep neural networks are highly
sensitive to tiny perturbations of input data, giving rise to adversarial
examples. Though this property is usually considered a weakness of learned
models, it could be beneficial if used appropriately. The work that has been
accomplished by MIT for this purpose, a deep-neural model by the name of
SteganoGAN, has shown obligation for using this technique for steganography. In
this work, we have proposed a novel approach to improve the performance of the
existing system using latent space compression on the encoded data. This
theoretically would improve the performance exponentially. Thus, the algorithms
used to improve the system's performance and the results obtained have been
enunciated in this work. The results indicate the level of dominance this
system could achieve to be able to diminish the difficulties in solving
real-time problems in terms of security, deployment and database management.


http://arxiv.org/abs/2110.12690
A Dynamical System Perspective for Lipschitz Neural Networks. (81%)
Laurent Meunier; Blaise Delattre; Alexandre Araujo; Alexandre Allauzen
  The Lipschitz constant of neural networks has been established as a key
quantity to enforce the robustness to adversarial examples. In this paper, we
tackle the problem of building $1$-Lipschitz Neural Networks. By studying
Residual Networks from a continuous time dynamical system perspective, we
provide a generic method to build $1$-Lipschitz Neural Networks and show that
some previous approaches are special cases of this framework. Then, we extend
this reasoning and show that ResNet flows derived from convex potentials define
$1$-Lipschitz transformations, that lead us to define the {\em Convex Potential
Layer} (CPL). A comprehensive set of experiments on several datasets
demonstrates the scalability of our architecture and the benefits as an
$\ell_2$-provable defense against adversarial examples.


http://arxiv.org/abs/2110.12700
An Adaptive Structural Learning of Deep Belief Network for Image-based Crack Detection in Concrete Structures Using SDNET2018. (13%)
Shin Kamada; Takumi Ichimura; Takashi Iwasaki
  We have developed an adaptive structural Deep Belief Network (Adaptive DBN)
that finds an optimal network structure in a self-organizing manner during
learning. The Adaptive DBN is the hierarchical architecture where each layer
employs Adaptive Restricted Boltzmann Machine (Adaptive RBM). The Adaptive RBM
can find the appropriate number of hidden neurons during learning. The proposed
method was applied to a concrete image benchmark data set SDNET2018 for crack
detection. The dataset contains about 56,000 crack images for three types of
concrete structures: bridge decks, walls, and paved roads. The fine-tuning
method of the Adaptive DBN can show 99.7%, 99.7%, and 99.4% classification
accuracy for three types of structures. However, we found the database included
some wrong annotated data which cannot be judged from images by human experts.
This paper discusses consideration that purses the major factor for the wrong
cases and the removal of the adversarial examples from the dataset.


http://arxiv.org/abs/2110.12357
Towards A Conceptually Simple Defensive Approach for Few-shot classifiers Against Adversarial Support Samples. (80%)
Yi Xiang Marcus Tan; Penny Chong; Jiamei Sun; Ngai-man Cheung; Yuval Elovici; Alexander Binder
  Few-shot classifiers have been shown to exhibit promising results in use
cases where user-provided labels are scarce. These models are able to learn to
predict novel classes simply by training on a non-overlapping set of classes.
This can be largely attributed to the differences in their mechanisms as
compared to conventional deep networks. However, this also offers new
opportunities for novel attackers to induce integrity attacks against such
models, which are not present in other machine learning setups. In this work,
we aim to close this gap by studying a conceptually simple approach to defend
few-shot classifiers against adversarial attacks. More specifically, we propose
a simple attack-agnostic detection method, using the concept of self-similarity
and filtering, to flag out adversarial support sets which destroy the
understanding of a victim classifier for a certain class. Our extended
evaluation on the miniImagenet (MI) and CUB datasets exhibit good attack
detection performance, across three different few-shot classifiers and across
different attack strengths, beating baselines. Our observed results allow our
approach to establishing itself as a strong detection method for support set
poisoning attacks. We also show that our approach constitutes a generalizable
concept, as it can be paired with other filtering functions. Finally, we
provide an analysis of our results when we vary two components found in our
detection approach.


http://arxiv.org/abs/2110.12321
ADC: Adversarial attacks against object Detection that evade Context consistency checks. (99%)
Mingjun Yin; Shasha Li; Chengyu Song; M. Salman Asif; Amit K. Roy-Chowdhury; Srikanth V. Krishnamurthy
  Deep Neural Networks (DNNs) have been shown to be vulnerable to adversarial
examples, which are slightly perturbed input images which lead DNNs to make
wrong predictions. To protect from such examples, various defense strategies
have been proposed. A very recent defense strategy for detecting adversarial
examples, that has been shown to be robust to current attacks, is to check for
intrinsic context consistencies in the input data, where context refers to
various relationships (e.g., object-to-object co-occurrence relationships) in
images. In this paper, we show that even context consistency checks can be
brittle to properly crafted adversarial examples and to the best of our
knowledge, we are the first to do so. Specifically, we propose an adaptive
framework to generate examples that subvert such defenses, namely, Adversarial
attacks against object Detection that evade Context consistency checks (ADC).
In ADC, we formulate a joint optimization problem which has two attack goals,
viz., (i) fooling the object detector and (ii) evading the context consistency
check system, at the same time. Experiments on both PASCAL VOC and MS COCO
datasets show that examples generated with ADC fool the object detector with a
success rate of over 85% in most cases, and at the same time evade the recently
proposed context consistency checks, with a bypassing rate of over 80% in most
cases. Our results suggest that how to robustly model context and check its
consistency, is still an open problem.


http://arxiv.org/abs/2110.12308
A Layer-wise Adversarial-aware Quantization Optimization for Improving Robustness. (81%)
Chang Song; Riya Ranjan; Hai Li
  Neural networks are getting better accuracy with higher energy and
computational cost. After quantization, the cost can be greatly saved, and the
quantized models are more hardware friendly with acceptable accuracy loss. On
the other hand, recent research has found that neural networks are vulnerable
to adversarial attacks, and the robustness of a neural network model can only
be improved with defense methods, such as adversarial training. In this work,
we find that adversarially-trained neural networks are more vulnerable to
quantization loss than plain models. To minimize both the adversarial and the
quantization losses simultaneously and to make the quantized model robust, we
propose a layer-wise adversarial-aware quantization method, using the Lipschitz
constant to choose the best quantization parameter settings for a neural
network. We theoretically derive the losses and prove the consistency of our
metric selection. The experiment results show that our method can effectively
and efficiently improve the robustness of quantized adversarially-trained
neural networks.


http://arxiv.org/abs/2110.11987
Improving Robustness of Malware Classifiers using Adversarial Strings Generated from Perturbed Latent Representations. (99%)
Marek Galovic; Branislav Bosansky; Viliam Lisy
  In malware behavioral analysis, the list of accessed and created files very
often indicates whether the examined file is malicious or benign. However,
malware authors are trying to avoid detection by generating random filenames
and/or modifying used filenames with new versions of the malware. These changes
represent real-world adversarial examples. The goal of this work is to generate
realistic adversarial examples and improve the classifier's robustness against
these attacks. Our approach learns latent representations of input strings in
an unsupervised fashion and uses gradient-based adversarial attack methods in
the latent domain to generate adversarial examples in the input domain. We use
these examples to improve the classifier's robustness by training on the
generated adversarial set of strings. Compared to classifiers trained only on
perturbed latent vectors, our approach produces classifiers that are
significantly more robust without a large trade-off in standard accuracy.


http://arxiv.org/abs/2110.12072
How and When Adversarial Robustness Transfers in Knowledge Distillation? (91%)
Rulin Shao; Jinfeng Yi; Pin-Yu Chen; Cho-Jui Hsieh
  Knowledge distillation (KD) has been widely used in teacher-student training,
with applications to model compression in resource-constrained deep learning.
Current works mainly focus on preserving the accuracy of the teacher model.
However, other important model properties, such as adversarial robustness, can
be lost during distillation. This paper studies how and when the adversarial
robustness can be transferred from a teacher model to a student model in KD. We
show that standard KD training fails to preserve adversarial robustness, and we
propose KD with input gradient alignment (KDIGA) for remedy. Under certain
assumptions, we prove that the student model using our proposed KDIGA can
achieve at least the same certified robustness as the teacher model. Our
experiments of KD contain a diverse set of teacher and student models with
varying network architectures and sizes evaluated on ImageNet and CIFAR-10
datasets, including residual neural networks (ResNets) and vision transformers
(ViTs). Our comprehensive analysis shows several novel insights that (1) With
KDIGA, students can preserve or even exceed the adversarial robustness of the
teacher model, even when their models have fundamentally different
architectures; (2) KDIGA enables robustness to transfer to pre-trained
students, such as KD from an adversarially trained ResNet to a pre-trained ViT,
without loss of clean accuracy; and (3) Our derived local linearity bounds for
characterizing adversarial robustness in KD are consistent with the empirical
results.


http://arxiv.org/abs/2110.12020
Fairness Degrading Adversarial Attacks Against Clustering Algorithms. (86%)
Anshuman Chhabra; Adish Singla; Prasant Mohapatra
  Clustering algorithms are ubiquitous in modern data science pipelines, and
are utilized in numerous fields ranging from biology to facility location. Due
to their widespread use, especially in societal resource allocation problems,
recent research has aimed at making clustering algorithms fair, with great
success. Furthermore, it has also been shown that clustering algorithms, much
like other machine learning algorithms, are susceptible to adversarial attacks
where a malicious entity seeks to subvert the performance of the learning
algorithm. However, despite these known vulnerabilities, there has been no
research undertaken that investigates fairness degrading adversarial attacks
for clustering. We seek to bridge this gap by formulating a generalized attack
optimization problem aimed at worsening the group-level fairness of
centroid-based clustering algorithms. As a first step, we propose a fairness
degrading attack algorithm for k-median clustering that operates under a
whitebox threat model -- where the clustering algorithm, fairness notion, and
the input dataset are known to the adversary. We provide empirical results as
well as theoretical analysis for our simple attack algorithm, and find that the
addition of the generated adversarial samples can lead to significantly lower
fairness values. In this manner, we aim to motivate fairness degrading
adversarial attacks as a direction for future research in fair clustering.


http://arxiv.org/abs/2110.11950
Adversarial robustness for latent models: Revisiting the robust-standard accuracies tradeoff. (80%)
Adel Javanmard; Mohammad Mehrabi
  Over the past few years, several adversarial training methods have been
proposed to improve the robustness of machine learning models against
adversarial perturbations in the input. Despite remarkable progress in this
regard, adversarial training is often observed to drop the standard test
accuracy. This phenomenon has intrigued the research community to investigate
the potential tradeoff between standard accuracy (a.k.a generalization) and
robust accuracy (a.k.a robust generalization) as two performance measures. In
this paper, we revisit this tradeoff for latent models and argue that this
tradeoff is mitigated when the data enjoys a low-dimensional structure. In
particular, we consider binary classification under two data generative models,
namely Gaussian mixture model and generalized linear model, where the features
data lie on a low-dimensional manifold. We develop a theory to show that the
low-dimensional manifold structure allows one to obtain models that are nearly
optimal with respect to both, the standard accuracy and the robust accuracy
measures. We further corroborate our theory with several numerical experiments,
including Mixture of Factor Analyzers (MFA) model trained on the MNIST dataset.


http://arxiv.org/abs/2110.11578
PRECAD: Privacy-Preserving and Robust Federated Learning via Crypto-Aided Differential Privacy. (15%)
Xiaolan Gu; Ming Li; Li Xiong
  Federated Learning (FL) allows multiple participating clients to train
machine learning models collaboratively by keeping their datasets local and
only exchanging model updates. Existing FL protocol designs have been shown to
be vulnerable to attacks that aim to compromise data privacy and/or model
robustness. Recently proposed defenses focused on ensuring either privacy or
robustness, but not both. In this paper, we develop a framework called PRECAD,
which simultaneously achieves differential privacy (DP) and enhances robustness
against model poisoning attacks with the help of cryptography. Using secure
multi-party computation (MPC) techniques (e.g., secret sharing), noise is added
to the model updates by the honest-but-curious server(s) (instead of each
client) without revealing clients' inputs, which achieves the benefit of
centralized DP in terms of providing a better privacy-utility tradeoff than
local DP based solutions. Meanwhile, a crypto-aided secure validation protocol
is designed to verify that the contribution of model update from each client is
bounded without leaking privacy. We show analytically that the noise added to
ensure DP also provides enhanced robustness against malicious model
submissions. We experimentally demonstrate that our PRECAD framework achieves
higher privacy-utility tradeoff and enhances robustness for the trained models.


http://arxiv.org/abs/2110.11597
ProtoShotXAI: Using Prototypical Few-Shot Architecture for Explainable AI. (15%)
Samuel Hess; Gregory Ditzler
  Unexplainable black-box models create scenarios where anomalies cause
deleterious responses, thus creating unacceptable risks. These risks have
motivated the field of eXplainable Artificial Intelligence (XAI) to improve
trust by evaluating local interpretability in black-box neural networks.
Unfortunately, the ground truth is unavailable for the model's decision, so
evaluation is limited to qualitative assessment. Further, interpretability may
lead to inaccurate conclusions about the model or a false sense of trust. We
propose to improve XAI from the vantage point of the user's trust by exploring
a black-box model's latent feature space. We present an approach, ProtoShotXAI,
that uses a Prototypical few-shot network to explore the contrastive manifold
between nonlinear features of different classes. A user explores the manifold
by perturbing the input features of a query sample and recording the response
for a subset of exemplars from any class. Our approach is the first locally
interpretable XAI model that can be extended to, and demonstrated on, few-shot
networks. We compare ProtoShotXAI to the state-of-the-art XAI approaches on
MNIST, Omniglot, and ImageNet to demonstrate, both quantitatively and
qualitatively, that ProtoShotXAI provides more flexibility for model
exploration. Finally, ProtoShotXAI also demonstrates novel explainabilty and
detectabilty on adversarial samples.


http://arxiv.org/abs/2110.12923
Spoofing Detection on Hand Images Using Quality Assessment. (1%)
Asish Bera; Ratnadeep Dey; Debotosh Bhattacharjee; Mita Nasipuri; Hubert P. H. Shum
  Recent research on biometrics focuses on achieving a high success rate of
authentication and addressing the concern of various spoofing attacks. Although
hand geometry recognition provides adequate security over unauthorized access,
it is susceptible to presentation attack. This paper presents an anti-spoofing
method toward hand biometrics. A presentation attack detection approach is
addressed by assessing the visual quality of genuine and fake hand images. A
threshold-based gradient magnitude similarity quality metric is proposed to
discriminate between the real and spoofed hand samples. The visual hand images
of 255 subjects from the Bogazici University hand database are considered as
original samples. Correspondingly, from each genuine sample, we acquire a
forged image using a Canon EOS 700D camera. Such fake hand images with natural
degradation are considered for electronic screen display based spoofing attack
detection. Furthermore, we create another fake hand dataset with artificial
degradation by introducing additional Gaussian blur, salt and pepper, and
speckle noises to original images. Ten quality metrics are measured from each
sample for classification between original and fake hand image. The
classification experiments are performed using the k-nearest neighbors, random
forest, and support vector machine classifiers, as well as deep convolutional
neural networks. The proposed gradient similarity-based quality metric achieves
1.5% average classification er ror using the k-nearest neighbors and random
forest classifiers. An average classification error of 2.5% is obtained using
the baseline evaluation with the MobileNetV2 deep network for discriminating
original and different types of fake hand samples.


http://arxiv.org/abs/2110.11589
Text Counterfactuals via Latent Optimization and Shapley-Guided Search. (1%)
Quintin Pope; Xiaoli Z. Fern
  We study the problem of generating counterfactual text for a classifier as a
means for understanding and debugging classification. Given a textual input and
a classification model, we aim to minimally alter the text to change the
model's prediction. White-box approaches have been successfully applied to
similar problems in vision where one can directly optimize the continuous
input. Optimization-based approaches become difficult in the language domain
due to the discrete nature of text. We bypass this issue by directly optimizing
in the latent space and leveraging a language model to generate candidate
modifications from optimized latent representations. We additionally use
Shapley values to estimate the combinatoric effect of multiple changes. We then
use these estimates to guide a beam search for the final counterfactual text.
We achieve favorable performance compared to recent white-box and black-box
baselines using human and automatic evaluations. Ablation studies show that
both latent optimization and the use of Shapley values improve success rate and
the quality of the generated counterfactuals.


http://arxiv.org/abs/2110.11891
On the Necessity of Auditable Algorithmic Definitions for Machine Unlearning. (1%)
Anvith Thudi; Hengrui Jia; Ilia Shumailov; Nicolas Papernot
  Machine unlearning, i.e. having a model forget about some of its training
data, has become increasingly more important as privacy legislation promotes
variants of the right-to-be-forgotten. In the context of deep learning,
approaches for machine unlearning are broadly categorized into two classes:
exact unlearning methods, where an entity has formally removed the data point's
impact on the model by retraining the model from scratch, and approximate
unlearning, where an entity approximates the model parameters one would obtain
by exact unlearning to save on compute costs. In this paper, we first show that
the definition that underlies approximate unlearning, which seeks to prove the
approximately unlearned model is close to an exactly retrained model, is
incorrect because one can obtain the same model using different datasets. Thus
one could unlearn without modifying the model at all. We then turn to exact
unlearning approaches and ask how to verify their claims of unlearning. Our
results show that even for a given training trajectory one cannot formally
prove the absence of certain data points used during training. We thus conclude
that unlearning is only well-defined at the algorithmic level, where an
entity's only possible auditable claim to unlearning is that they used a
particular algorithm designed to allow for external scrutiny during an audit.


http://arxiv.org/abs/2110.11736
MANDERA: Malicious Node Detection in Federated Learning via Ranking. (1%)
Wanchuang Zhu; Benjamin Zi Hao Zhao; Simon Luo; Tongliang Liu; Ke Deng
  Byzantine attacks hinder the deployment of federated learning algorithms.
Although we know that the benign gradients and Byzantine attacked gradients are
distributed differently, to detect the malicious gradients is challenging due
to (1) the gradient is high-dimensional and each dimension has its unique
distribution and (2) the benign gradients and the attacked gradients are always
mixed (two-sample test methods cannot apply directly). To address the above,
for the first time, we propose MANDERA which is theoretically guaranteed to
efficiently detect all malicious gradients under Byzantine attacks with no
prior knowledge or history about the number of attacked nodes. More
specifically, we transfer the original updating gradient space into a ranking
matrix. By such an operation, the scales of different dimensions of the
gradients in the ranking space become identical. The high-dimensional benign
gradients and the malicious gradients can be easily separated. The
effectiveness of MANDERA is further confirmed by experimentation on four
Byzantine attack implementations (Gaussian, Zero Gradient, Sign Flipping,
Shifted Mean), comparing with state-of-the-art defenses. The experiments cover
both IID and Non-IID datasets.


http://arxiv.org/abs/2110.11459
CAPTIVE: Constrained Adversarial Perturbations to Thwart IC Reverse Engineering. (98%)
Amir Hosein Afandizadeh Zargari; Marzieh AshrafiAmiri; Minjun Seo; Sai Manoj Pudukotai Dinakarrao; Mohammed E. Fouda; Fadi Kurdahi
  Reverse engineering (RE) in Integrated Circuits (IC) is a process in which
one will attempt to extract the internals of an IC, extract the circuit
structure, and determine the gate-level information of an IC. In general, RE
process can be done for validation as well as intellectual property (IP)
stealing intentions. In addition, RE also facilitates different illicit
activities such as insertion of hardware Trojan, pirate, or counterfeit a
design, or develop an attack. In this work, we propose an approach to introduce
cognitive perturbations, with the aid of adversarial machine learning, to the
IC layout that could prevent the RE process from succeeding. We first construct
a layer-by-layer image dataset of 45nm predictive technology. With this
dataset, we propose a conventional neural network model called RecoG-Net to
recognize the logic gates, which is the first step in RE. RecoG-Net is
successfully to recognize the gates with more than 99.7% accuracy. Our
thwarting approach utilizes the concept of the adversarial attack generation
algorithms to generate perturbation. Unlike traditional adversarial attacks in
machine learning, the perturbation generation needs to be highly constrained to
meet the fab rules such as Design Rule Checking (DRC) Layout vs. Schematic
(LVS) checks. Hence, we propose CAPTIVE as an constrained perturbation
generation satisfying the DRC. The experiments shows that the accuracy of
reverse engineering using machine learning techniques can decrease from 100% to
approximately 30% based on the adversary generator.


http://arxiv.org/abs/2110.11411
PROVES: Establishing Image Provenance using Semantic Signatures. (93%)
Mingyang Xie; Manav Kulshrestha; Shaojie Wang; Jinghan Yang; Ayan Chakrabarti; Ning Zhang; Yevgeniy Vorobeychik
  Modern AI tools, such as generative adversarial networks, have transformed
our ability to create and modify visual data with photorealistic results.
However, one of the deleterious side-effects of these advances is the emergence
of nefarious uses in manipulating information in visual data, such as through
the use of deep fakes. We propose a novel architecture for preserving the
provenance of semantic information in images to make them less susceptible to
deep fake attacks. Our architecture includes semantic signing and verification
steps. We apply this architecture to verifying two types of semantic
information: individual identities (faces) and whether the photo was taken
indoors or outdoors. Verification accounts for a collection of common image
transformation, such as translation, scaling, cropping, and small rotations,
and rejects adversarial transformations, such as adversarially perturbed or, in
the case of face verification, swapped faces. Experiments demonstrate that in
the case of provenance of faces in an image, our approach is robust to
black-box adversarial transformations (which are rejected) as well as benign
transformations (which are accepted), with few false negatives and false
positives. Background verification, on the other hand, is susceptible to
black-box adversarial examples, but becomes significantly more robust after
adversarial training.


http://arxiv.org/abs/2110.11088
RoMA: a Method for Neural Network Robustness Measurement and Assessment. (92%)
Natan Levy; Guy Katz
  Neural network models have become the leading solution for a large variety of
tasks, such as classification, language processing, protein folding, and
others. However, their reliability is heavily plagued by adversarial inputs:
small input perturbations that cause the model to produce erroneous outputs.
Adversarial inputs can occur naturally when the system's environment behaves
randomly, even in the absence of a malicious adversary, and are a severe cause
for concern when attempting to deploy neural networks within critical systems.
In this paper, we present a new statistical method, called Robustness
Measurement and Assessment (RoMA), which can measure the expected robustness of
a neural network model. Specifically, RoMA determines the probability that a
random input perturbation might cause misclassification. The method allows us
to provide formal guarantees regarding the expected frequency of errors that a
trained model will encounter after deployment. Our approach can be applied to
large-scale, black-box neural networks, which is a significant advantage
compared to recently proposed verification methods. We apply our approach in
two ways: comparing the robustness of different models, and measuring how a
model's robustness is affected by the magnitude of input perturbation. One
interesting insight obtained through this work is that, in a classification
network, different output labels can exhibit very different robustness levels.
We term this phenomenon categorial robustness. Our ability to perform risk and
robustness assessments on a categorial basis opens the door to risk mitigation,
which may prove to be a significant step towards neural network certification
in safety-critical applications.


http://arxiv.org/abs/2110.11571
Anti-Backdoor Learning: Training Clean Models on Poisoned Data. (83%)
Yige Li; Xixiang Lyu; Nodens Koren; Lingjuan Lyu; Bo Li; Xingjun Ma
  Backdoor attack has emerged as a major security threat to deep neural
networks (DNNs). While existing defense methods have demonstrated promising
results on detecting or erasing backdoors, it is still not clear whether robust
training methods can be devised to prevent the backdoor triggers being injected
into the trained model in the first place. In this paper, we introduce the
concept of \emph{anti-backdoor learning}, aiming to train \emph{clean} models
given backdoor-poisoned data. We frame the overall learning process as a
dual-task of learning the \emph{clean} and the \emph{backdoor} portions of
data. From this view, we identify two inherent characteristics of backdoor
attacks as their weaknesses: 1) the models learn backdoored data much faster
than learning with clean data, and the stronger the attack the faster the model
converges on backdoored data; 2) the backdoor task is tied to a specific class
(the backdoor target class). Based on these two weaknesses, we propose a
general learning scheme, Anti-Backdoor Learning (ABL), to automatically prevent
backdoor attacks during training. ABL introduces a two-stage \emph{gradient
ascent} mechanism for standard training to 1) help isolate backdoor examples at
an early training stage, and 2) break the correlation between backdoor examples
and the target class at a later training stage. Through extensive experiments
on multiple benchmark datasets against 10 state-of-the-art attacks, we
empirically show that ABL-trained models on backdoor-poisoned data achieve the
same performance as they were trained on purely clean data. Code is available
at \url{https://github.com/bboylyg/ABL}.


http://arxiv.org/abs/2110.10926
PipAttack: Poisoning Federated Recommender Systems forManipulating Item Promotion. (68%)
Shijie Zhang; Hongzhi Yin; Tong Chen; Zi Huang; Quoc Viet Hung Nguyen; Lizhen Cui
  Due to the growing privacy concerns, decentralization emerges rapidly in
personalized services, especially recommendation. Also, recent studies have
shown that centralized models are vulnerable to poisoning attacks, compromising
their integrity. In the context of recommender systems, a typical goal of such
poisoning attacks is to promote the adversary's target items by interfering
with the training dataset and/or process. Hence, a common practice is to
subsume recommender systems under the decentralized federated learning
paradigm, which enables all user devices to collaboratively learn a global
recommender while retaining all the sensitive data locally. Without exposing
the full knowledge of the recommender and entire dataset to end-users, such
federated recommendation is widely regarded `safe' towards poisoning attacks.
In this paper, we present a systematic approach to backdooring federated
recommender systems for targeted item promotion. The core tactic is to take
advantage of the inherent popularity bias that commonly exists in data-driven
recommenders. As popular items are more likely to appear in the recommendation
list, our innovatively designed attack model enables the target item to have
the characteristics of popular items in the embedding space. Then, by uploading
carefully crafted gradients via a small number of malicious users during the
model update, we can effectively increase the exposure rate of a target
(unpopular) item in the resulted federated recommender. Evaluations on two
real-world datasets show that 1) our attack model significantly boosts the
exposure rate of the target item in a stealthy way, without harming the
accuracy of the poisoned recommender; and 2) existing defenses are not
effective enough, highlighting the need for new defenses against our local
model poisoning attacks to federated recommender systems.


http://arxiv.org/abs/2110.11205
Robustness through Data Augmentation Loss Consistency. (61%)
Tianjian Huang; Shaunak Halbe; Chinnadhurai Sankar; Pooyan Amini; Satwik Kottur; Alborz Geramifard; Meisam Razaviyayn; Ahmad Beirami
  While deep learning through empirical risk minimization (ERM) has succeeded
at achieving human-level performance at a variety of complex tasks, ERM is not
robust to distribution shifts or adversarial attacks. Synthetic data
augmentation followed by empirical risk minimization (DA-ERM) is a simple and
widely used solution to improve robustness in ERM. In addition, consistency
regularization can be applied to further improve the robustness of the model by
forcing the representation of the original sample and the augmented one to be
similar. However, existing consistency regularization methods are not
applicable to covariant data augmentation, where the label in the augmented
sample is dependent on the augmentation function. For example, dialog state
covaries with named entity when we augment data with a new named entity. In
this paper, we propose data augmented loss invariant regularization (DAIR), a
simple form of consistency regularization that is applied directly at the loss
level rather than intermediate features, making it widely applicable to both
invariant and covariant data augmentation regardless of network architecture,
problem setup, and task. We apply DAIR to real-world learning problems
involving covariant data augmentation: robust neural task-oriented dialog state
tracking and robust visual question answering. We also apply DAIR to tasks
involving invariant data augmentation: robust regression, robust classification
against adversarial attacks, and robust ImageNet classification under
distribution shift. Our experiments show that DAIR consistently outperforms ERM
and DA-ERM with little marginal computational cost and sets new
state-of-the-art results in several benchmarks involving covariant data
augmentation. Our code of all experiments is available at:
https://github.com/optimization-for-data-driven-science/DAIR.git


http://arxiv.org/abs/2110.10942
Generalization of Neural Combinatorial Solvers Through the Lens of Adversarial Robustness. (61%)
Simon Geisler; Johanna Sommer; Jan Schuchardt; Aleksandar Bojchevski; Stephan Günnemann
  End-to-end (geometric) deep learning has seen first successes in
approximating the solution of combinatorial optimization problems. However,
generating data in the realm of NP-hard/-complete tasks brings practical and
theoretical challenges, resulting in evaluation protocols that are too
optimistic. Specifically, most datasets only capture a simpler subproblem and
likely suffer from spurious features. We investigate these effects by studying
adversarial robustness - a local generalization property - to reveal hard,
model-specific instances and spurious features. For this purpose, we derive
perturbation models for SAT and TSP. Unlike in other applications, where
perturbation models are designed around subjective notions of imperceptibility,
our perturbation models are efficient and sound, allowing us to determine the
true label of perturbed samples without a solver. Surprisingly, with such
perturbations, a sufficiently expressive neural solver does not suffer from the
limitations of the accuracy-robustness trade-off common in supervised learning.
Although such robust solvers exist, we show empirically that the assessed
neural solvers do not generalize well w.r.t. small perturbations of the problem
instance.


http://arxiv.org/abs/2110.11024
Watermarking Graph Neural Networks based on Backdoor Attacks. (31%)
Jing Xu; Stjepan Picek
  Graph Neural Networks (GNNs) have achieved promising performance in various
real-world applications. Building a powerful GNN model is not a trivial task,
as it requires a large amount of training data, powerful computing resources,
and human expertise on fine-tuning the model. What is more, with the
development of adversarial attacks, e.g., model stealing attacks, GNNs raise
challenges to model authentication. To avoid copyright infringement on GNNs, it
is necessary to verify the ownership of the GNN models.
  In this paper, we present a watermarking framework for GNNs for both graph
and node classification tasks. We 1) design two strategies to generate
watermarked data for the graph classification and one for the node
classification task, 2) embed the watermark into the host model through
training to obtain the watermarked GNN model, and 3) verify the ownership of
the suspicious model in a black-box setting. The experiments show that our
framework can verify the ownership of GNN models with a very high probability
(around $100\%$) for both tasks. In addition, we experimentally show that our
watermarking approach is still effective even when considering suspicious
models obtained from different architectures than the owner's.


http://arxiv.org/abs/2110.11290
Physical Side-Channel Attacks on Embedded Neural Networks: A Survey. (8%)
Maria Méndez Real; Rubén Salvador
  During the last decade, Deep Neural Networks (DNN) have progressively been
integrated on all types of platforms, from data centers to embedded systems
including low-power processors and, recently, FPGAs. Neural Networks (NN) are
expected to become ubiquitous in IoT systems by transforming all sorts of
real-world applications, including applications in the safety-critical and
security-sensitive domains. However, the underlying hardware security
vulnerabilities of embedded NN implementations remain unaddressed. In
particular, embedded DNN implementations are vulnerable to Side-Channel
Analysis (SCA) attacks, which are especially important in the IoT and edge
computing contexts where an attacker can usually gain physical access to the
targeted device. A research field has therefore emerged and is rapidly growing
in terms of the use of SCA including timing, electromagnetic attacks and power
attacks to target NN embedded implementations. Since 2018, research papers have
shown that SCA enables an attacker to recover inference models architectures
and parameters, to expose industrial IP and endangers data confidentiality and
privacy. Without a complete review of this emerging field in the literature so
far, this paper surveys state-of-the-art physical SCA attacks relative to the
implementation of embedded DNNs on micro-controllers and FPGAs in order to
provide a thorough analysis on the current landscape. It provides a taxonomy
and a detailed classification of current attacks. It first discusses mitigation
techniques and then provides insights for future research leads.


http://arxiv.org/abs/2110.10655
Adversarial Socialbot Learning via Multi-Agent Deep Hierarchical Reinforcement Learning. (83%)
Thai Le; Long Tran-Thanh; Dongwon Lee
  Socialbots are software-driven user accounts on social platforms, acting
autonomously (mimicking human behavior), with the aims to influence the
opinions of other users or spread targeted misinformation for particular goals.
As socialbots undermine the ecosystem of social platforms, they are often
considered harmful. As such, there have been several computational efforts to
auto-detect the socialbots. However, to our best knowledge, the adversarial
nature of these socialbots has not yet been studied. This begs a question "can
adversaries, controlling socialbots, exploit AI techniques to their advantage?"
To this question, we successfully demonstrate that indeed it is possible for
adversaries to exploit computational learning mechanism such as reinforcement
learning (RL) to maximize the influence of socialbots while avoiding being
detected. We first formulate the adversarial socialbot learning as a
cooperative game between two functional hierarchical RL agents. While one agent
curates a sequence of activities that can avoid the detection, the other agent
aims to maximize network influence by selectively connecting with right users.
Our proposed policy networks train with a vast amount of synthetic graphs and
generalize better than baselines on unseen real-life graphs both in terms of
maximizing network influence (up to +18%) and sustainable stealthiness (up to
+40% undetectability) under a strong bot detector (with 90% detection
accuracy). During inference, the complexity of our approach scales linearly,
independent of a network's structure and the virality of news. This makes our
approach a practical adversarial attack when deployed in a real-life setting.


http://arxiv.org/abs/2110.10482
Surrogate Representation Learning with Isometric Mapping for Gray-box Graph Adversarial Attacks. (62%)
Zihan Liul; Yun Luo; Zelin Zang; Stan Z. Li
  Gray-box graph attacks aim at disrupting the performance of the victim model
by using inconspicuous attacks with limited knowledge of the victim model. The
parameters of the victim model and the labels of the test nodes are invisible
to the attacker. To obtain the gradient on the node attributes or graph
structure, the attacker constructs an imaginary surrogate model trained under
supervision. However, there is a lack of discussion on the training of
surrogate models and the robustness of provided gradient information. The
general node classification model loses the topology of the nodes on the graph,
which is, in fact, an exploitable prior for the attacker. This paper
investigates the effect of representation learning of surrogate models on the
transferability of gray-box graph adversarial attacks. To reserve the topology
in the surrogate embedding, we propose Surrogate Representation Learning with
Isometric Mapping (SRLIM). By using Isometric mapping method, our proposed
SRLIM can constrain the topological structure of nodes from the input layer to
the embedding space, that is, to maintain the similarity of nodes in the
propagation process. Experiments prove the effectiveness of our approach
through the improvement in the performance of the adversarial attacks generated
by the gradient-based attacker in untargeted poisoning gray-box setups.


http://arxiv.org/abs/2110.10444
Moir\'e Attack (MA): A New Potential Risk of Screen Photos. (56%)
Dantong Niu; Ruohao Guo; Yisen Wang
  Images, captured by a camera, play a critical role in training Deep Neural
Networks (DNNs). Usually, we assume the images acquired by cameras are
consistent with the ones perceived by human eyes. However, due to the different
physical mechanisms between human-vision and computer-vision systems, the final
perceived images could be very different in some cases, for example shooting on
digital monitors. In this paper, we find a special phenomenon in digital image
processing, the moir\'e effect, that could cause unnoticed security threats to
DNNs. Based on it, we propose a Moir\'e Attack (MA) that generates the
physical-world moir\'e pattern adding to the images by mimicking the shooting
process of digital devices. Extensive experiments demonstrate that our proposed
digital Moir\'e Attack (MA) is a perfect camouflage for attackers to tamper
with DNNs with a high success rate ($100.0\%$ for untargeted and $97.0\%$ for
targeted attack with the noise budget $\epsilon=4$), high transferability rate
across different models, and high robustness under various defenses.
Furthermore, MA owns great stealthiness because the moir\'e effect is
unavoidable due to the camera's inner physical structure, which therefore
hardly attracts the awareness of humans. Our code is available at
https://github.com/Dantong88/Moire_Attack.


http://arxiv.org/abs/2110.10783
Adversarial attacks against Bayesian forecasting dynamic models. (13%)
Roi Naveiro
  The last decade has seen the rise of Adversarial Machine Learning (AML). This
discipline studies how to manipulate data to fool inference engines, and how to
protect those systems against such manipulation attacks. Extensive work on
attacks against regression and classification systems is available, while
little attention has been paid to attacks against time series forecasting
systems. In this paper, we propose a decision analysis based attacking strategy
that could be utilized against Bayesian forecasting dynamic models.


http://arxiv.org/abs/2110.12899
No One Representation to Rule Them All: Overlapping Features of Training Methods. (1%)
Raphael Gontijo-Lopes; Yann Dauphin; Ekin D. Cubuk
  Despite being able to capture a range of features of the data, high accuracy
models trained with supervision tend to make similar predictions. This
seemingly implies that high-performing models share similar biases regardless
of training methodology, which would limit ensembling benefits and render
low-accuracy models as having little practical use. Against this backdrop,
recent work has developed quite different training techniques, such as
large-scale contrastive learning, yielding competitively high accuracy on
generalization and robustness benchmarks. This motivates us to revisit the
assumption that models necessarily learn similar functions. We conduct a
large-scale empirical study of models across hyper-parameters, architectures,
frameworks, and datasets. We find that model pairs that diverge more in
training methodology display categorically different generalization behavior,
producing increasingly uncorrelated errors. We show these models specialize in
subdomains of the data, leading to higher ensemble performance: with just 2
models (each with ImageNet accuracy ~76.5%), we can create ensembles with 83.4%
(+7% boost). Surprisingly, we find that even significantly low-accuracy models
can be used to improve high-accuracy models. Finally, we show diverging
training methodology yield representations that capture overlapping (but not
supersetting) feature sets which, when combined, lead to increased downstream
performance.


http://arxiv.org/abs/2110.10287
Multi-concept adversarial attacks. (99%)
Vibha Belavadi; Yan Zhou; Murat Kantarcioglu; Bhavani M. Thuraisingham
  As machine learning (ML) techniques are being increasingly used in many
applications, their vulnerability to adversarial attacks becomes well-known.
Test time attacks, usually launched by adding adversarial noise to test
instances, have been shown effective against the deployed ML models. In
practice, one test input may be leveraged by different ML models. Test time
attacks targeting a single ML model often neglect their impact on other ML
models. In this work, we empirically demonstrate that naively attacking the
classifier learning one concept may negatively impact classifiers trained to
learn other concepts. For example, for the online image classification
scenario, when the Gender classifier is under attack, the (wearing) Glasses
classifier is simultaneously attacked with the accuracy dropped from 98.69 to
88.42. This raises an interesting question: is it possible to attack one set of
classifiers without impacting the other set that uses the same test instance?
Answers to the above research question have interesting implications for
protecting privacy against ML model misuse. Attacking ML models that pose
unnecessary risks of privacy invasion can be an important tool for protecting
individuals from harmful privacy exploitation. In this paper, we address the
above research question by developing novel attack techniques that can
simultaneously attack one set of ML models while preserving the accuracy of the
other. In the case of linear classifiers, we provide a theoretical framework
for finding an optimal solution to generate such adversarial examples. Using
this theoretical framework, we develop a multi-concept attack strategy in the
context of deep learning. Our results demonstrate that our techniques can
successfully attack the target classes while protecting the protected classes
in many different settings, which is not possible with the existing test-time
attack-single strategies.


http://arxiv.org/abs/2110.09759
A Regularization Method to Improve Adversarial Robustness of Neural Networks for ECG Signal Classification. (96%)
Linhai Ma; Liang Liang
  Electrocardiogram (ECG) is the most widely used diagnostic tool to monitor
the condition of the human heart. By using deep neural networks (DNNs),
interpretation of ECG signals can be fully automated for the identification of
potential abnormalities in a patient's heart in a fraction of a second. Studies
have shown that given a sufficiently large amount of training data, DNN
accuracy for ECG classification could reach human-expert cardiologist level.
However, despite of the excellent performance in classification accuracy, DNNs
are highly vulnerable to adversarial noises that are subtle changes in the
input of a DNN and may lead to a wrong class-label prediction. It is
challenging and essential to improve robustness of DNNs against adversarial
noises, which are a threat to life-critical applications. In this work, we
proposed a regularization method to improve DNN robustness from the perspective
of noise-to-signal ratio (NSR) for the application of ECG signal
classification. We evaluated our method on PhysioNet MIT-BIH dataset and
CPSC2018 ECG dataset, and the results show that our method can substantially
enhance DNN robustness against adversarial noises generated from adversarial
attacks, with a minimal change in accuracy on clean data.


http://arxiv.org/abs/2110.10108
TESSERACT: Gradient Flip Score to Secure Federated Learning Against Model Poisoning Attacks. (69%)
Atul Sharma; Wei Chen; Joshua Zhao; Qiang Qiu; Somali Chaterji; Saurabh Bagchi
  Federated learning---multi-party, distributed learning in a decentralized
environment---is vulnerable to model poisoning attacks, even more so than
centralized learning approaches. This is because malicious clients can collude
and send in carefully tailored model updates to make the global model
inaccurate. This motivated the development of Byzantine-resilient federated
learning algorithms, such as Krum, Bulyan, FABA, and FoolsGold. However, a
recently developed untargeted model poisoning attack showed that all prior
defenses can be bypassed. The attack uses the intuition that simply by changing
the sign of the gradient updates that the optimizer is computing, for a set of
malicious clients, a model can be diverted from the optima to increase the test
error rate. In this work, we develop TESSERACT---a defense against this
directed deviation attack, a state-of-the-art model poisoning attack. TESSERACT
is based on a simple intuition that in a federated learning setting, certain
patterns of gradient flips are indicative of an attack. This intuition is
remarkably stable across different learning algorithms, models, and datasets.
TESSERACT assigns reputation scores to the participating clients based on their
behavior during the training phase and then takes a weighted contribution of
the clients. We show that TESSERACT provides robustness against even a
white-box version of the attack.


http://arxiv.org/abs/2110.09902
Understanding Convolutional Neural Networks from Theoretical Perspective via Volterra Convolution. (61%)
Tenghui Li; Guoxu Zhou; Yuning Qiu; Qibin Zhao
  This study proposes a general and unified perspective of convolutional neural
networks by exploring the relationship between (deep) convolutional neural
networks and finite Volterra convolutions. It provides a novel approach to
explain and study the overall characteristics of neural networks without being
disturbed by the complex network architectures. Concretely, we examine the
basic structures of finite term Volterra convolutions and convolutional neural
networks. Our results show that convolutional neural network is an
approximation of the finite term Volterra convolution, whose order increases
exponentially with the number of layers and kernel size increases exponentially
with the strides. With this perspective, the specialized perturbations are
directly obtained from the approximated kernels rather than iterative generated
adversarial examples. Extensive experiments on synthetic and real-world data
sets show the correctness and effectiveness of our results.


http://arxiv.org/abs/2110.10354
Detecting Backdoor Attacks Against Point Cloud Classifiers. (26%)
Zhen Xiang; David J. Miller; Siheng Chen; Xi Li; George Kesidis
  Backdoor attacks (BA) are an emerging threat to deep neural network
classifiers. A classifier being attacked will predict to the attacker's target
class when a test sample from a source class is embedded with the backdoor
pattern (BP). Recently, the first BA against point cloud (PC) classifiers was
proposed, creating new threats to many important applications including
autonomous driving. Such PC BAs are not detectable by existing BA defenses due
to their special BP embedding mechanism. In this paper, we propose a
reverse-engineering defense that infers whether a PC classifier is backdoor
attacked, without access to its training set or to any clean classifiers for
reference. The effectiveness of our defense is demonstrated on the benchmark
ModeNet40 dataset for PCs.


http://arxiv.org/abs/2110.09814
Speech Pattern based Black-box Model Watermarking for Automatic Speech Recognition. (13%)
Haozhe Chen; Weiming Zhang; Kunlin Liu; Kejiang Chen; Han Fang; Nenghai Yu
  As an effective method for intellectual property (IP) protection, model
watermarking technology has been applied on a wide variety of deep neural
networks (DNN), including speech classification models. However, how to design
a black-box watermarking scheme for automatic speech recognition (ASR) models
is still an unsolved problem, which is a significant demand for protecting
remote ASR Application Programming Interface (API) deployed in cloud servers.
Due to conditional independence assumption and label-detection-based evasion
attack risk of ASR models, the black-box model watermarking scheme for speech
classification models cannot apply to ASR models. In this paper, we propose the
first black-box model watermarking framework for protecting the IP of ASR
models. Specifically, we synthesize trigger audios by spreading the speech
clips of model owners over the entire input audios and labeling the trigger
audios with the stego texts, which hides the authorship information with
linguistic steganography. Experiments on the state-of-the-art open-source ASR
system DeepSpeech demonstrate the feasibility of the proposed watermarking
scheme, which is robust against five kinds of attacks and has little impact on
accuracy.


http://arxiv.org/abs/2110.10291
A Deeper Look into RowHammer`s Sensitivities: Experimental Analysis of Real DRAM Chips and Implications on Future Attacks and Defenses. (5%)
Lois Orosa; Abdullah Giray Yağlıkçı; Haocong Luo; Ataberk Olgun; Jisung Park; Hasan Hassan; Minesh Patel; Jeremie S. Kim; Onur Mutlu
  RowHammer is a circuit-level DRAM vulnerability where repeatedly accessing
(i.e., hammering) a DRAM row can cause bit flips in physically nearby rows. The
RowHammer vulnerability worsens as DRAM cell size and cell-to-cell spacing
shrink. Recent studies demonstrate that modern DRAM chips, including chips
previously marketed as RowHammer-safe, are even more vulnerable to RowHammer
than older chips such that the required hammer count to cause a bit flip has
reduced by more than 10X in the last decade. Therefore, it is essential to
develop a better understanding and in-depth insights into the RowHammer
vulnerability of modern DRAM chips to more effectively secure current and
future systems.
  Our goal in this paper is to provide insights into fundamental properties of
the RowHammer vulnerability that are not yet rigorously studied by prior works,
but can potentially be $i$) exploited to develop more effective RowHammer
attacks or $ii$) leveraged to design more effective and efficient defense
mechanisms. To this end, we present an experimental characterization using
248~DDR4 and 24~DDR3 modern DRAM chips from four major DRAM manufacturers
demonstrating how the RowHammer effects vary with three fundamental properties:
1)~DRAM chip temperature, 2)~aggressor row active time, and 3)~victim DRAM
cell's physical location. Among our 16 new observations, we highlight that a
RowHammer bit flip 1)~is very likely to occur in a bounded range, specific to
each DRAM cell (e.g., 5.4% of the vulnerable DRAM cells exhibit errors in the
range 70C to 90C), 2)~is more likely to occur if the aggressor row is active
for longer time (e.g., RowHammer vulnerability increases by 36% if we keep a
DRAM row active for 15 column accesses), and 3)~is more likely to occur in
certain physical regions of the DRAM module under attack (e.g., 5% of the rows
are 2x more vulnerable than the remaining 95% of the rows).


http://arxiv.org/abs/2110.09075
Boosting the Transferability of Video Adversarial Examples via Temporal Translation. (99%)
Zhipeng Wei; Jingjing Chen; Zuxuan Wu; Yu-Gang Jiang
  Although deep-learning based video recognition models have achieved
remarkable success, they are vulnerable to adversarial examples that are
generated by adding human-imperceptible perturbations on clean video samples.
As indicated in recent studies, adversarial examples are transferable, which
makes it feasible for black-box attacks in real-world applications.
Nevertheless, most existing adversarial attack methods have poor
transferability when attacking other video models and transfer-based attacks on
video models are still unexplored. To this end, we propose to boost the
transferability of video adversarial examples for black-box attacks on video
recognition models. Through extensive analysis, we discover that different
video recognition models rely on different discriminative temporal patterns,
leading to the poor transferability of video adversarial examples. This
motivates us to introduce a temporal translation attack method, which optimizes
the adversarial perturbations over a set of temporal translated video clips. By
generating adversarial examples over translated videos, the resulting
adversarial examples are less sensitive to temporal patterns existed in the
white-box model being attacked and thus can be better transferred. Extensive
experiments on the Kinetics-400 dataset and the UCF-101 dataset demonstrate
that our method can significantly boost the transferability of video
adversarial examples. For transfer-based attack against video recognition
models, it achieves a 61.56% average attack success rate on the Kinetics-400
and 48.60% on the UCF-101.


http://arxiv.org/abs/2110.09714
Black-box Adversarial Attacks on Commercial Speech Platforms with Minimal Information. (99%)
Baolin Zheng; Peipei Jiang; Qian Wang; Qi Li; Chao Shen; Cong Wang; Yunjie Ge; Qingyang Teng; Shenyi Zhang
  Adversarial attacks against commercial black-box speech platforms, including
cloud speech APIs and voice control devices, have received little attention
until recent years. The current "black-box" attacks all heavily rely on the
knowledge of prediction/confidence scores to craft effective adversarial
examples, which can be intuitively defended by service providers without
returning these messages. In this paper, we propose two novel adversarial
attacks in more practical and rigorous scenarios. For commercial cloud speech
APIs, we propose Occam, a decision-only black-box adversarial attack, where
only final decisions are available to the adversary. In Occam, we formulate the
decision-only AE generation as a discontinuous large-scale global optimization
problem, and solve it by adaptively decomposing this complicated problem into a
set of sub-problems and cooperatively optimizing each one. Our Occam is a
one-size-fits-all approach, which achieves 100% success rates of attacks with
an average SNR of 14.23dB, on a wide range of popular speech and speaker
recognition APIs, including Google, Alibaba, Microsoft, Tencent, iFlytek, and
Jingdong, outperforming the state-of-the-art black-box attacks. For commercial
voice control devices, we propose NI-Occam, the first non-interactive physical
adversarial attack, where the adversary does not need to query the oracle and
has no access to its internal information and training data. We combine
adversarial attacks with model inversion attacks, and thus generate the
physically-effective audio AEs with high transferability without any
interaction with target devices. Our experimental results show that NI-Occam
can successfully fool Apple Siri, Microsoft Cortana, Google Assistant, iFlytek
and Amazon Echo with an average SRoA of 52% and SNR of 9.65dB, shedding light
on non-interactive physical attacks against voice control devices.


http://arxiv.org/abs/2110.09468
Improving Robustness using Generated Data. (97%)
Sven Gowal; Sylvestre-Alvise Rebuffi; Olivia Wiles; Florian Stimberg; Dan Andrei Calian; Timothy Mann
  Recent work argues that robust training requires substantially larger
datasets than those required for standard classification. On CIFAR-10 and
CIFAR-100, this translates into a sizable robust-accuracy gap between models
trained solely on data from the original training set and those trained with
additional data extracted from the "80 Million Tiny Images" dataset (TI-80M).
In this paper, we explore how generative models trained solely on the original
training set can be leveraged to artificially increase the size of the original
training set and improve adversarial robustness to $\ell_p$ norm-bounded
perturbations. We identify the sufficient conditions under which incorporating
additional generated data can improve robustness, and demonstrate that it is
possible to significantly reduce the robust-accuracy gap to models trained with
additional real data. Surprisingly, we even show that even the addition of
non-realistic random data (generated by Gaussian sampling) can improve
robustness. We evaluate our approach on CIFAR-10, CIFAR-100, SVHN and
TinyImageNet against $\ell_\infty$ and $\ell_2$ norm-bounded perturbations of
size $\epsilon = 8/255$ and $\epsilon = 128/255$, respectively. We show large
absolute improvements in robust accuracy compared to previous state-of-the-art
methods. Against $\ell_\infty$ norm-bounded perturbations of size $\epsilon =
8/255$, our models achieve 66.10% and 33.49% robust accuracy on CIFAR-10 and
CIFAR-100, respectively (improving upon the state-of-the-art by +8.96% and
+3.29%). Against $\ell_2$ norm-bounded perturbations of size $\epsilon =
128/255$, our model achieves 78.31% on CIFAR-10 (+3.81%). These results beat
most prior works that use external data.


http://arxiv.org/abs/2110.09506
MEMO: Test Time Robustness via Adaptation and Augmentation. (13%)
Marvin Zhang; Sergey Levine; Chelsea Finn
  While deep neural networks can attain good accuracy on in-distribution test
points, many applications require robustness even in the face of unexpected
perturbations in the input, changes in the domain, or other sources of
distribution shift. We study the problem of test time robustification, i.e.,
using the test input to improve model robustness. Recent prior works have
proposed methods for test time adaptation, however, they each introduce
additional assumptions, such as access to multiple test points, that prevent
widespread adoption. In this work, we aim to study and devise methods that make
no assumptions about the model training process and are broadly applicable at
test time. We propose a simple approach that can be used in any test setting
where the model is probabilistic and adaptable: when presented with a test
example, perform different data augmentations on the data point, and then adapt
(all of) the model parameters by minimizing the entropy of the model's average,
or marginal, output distribution across the augmentations. Intuitively, this
objective encourages the model to make the same prediction across different
augmentations, thus enforcing the invariances encoded in these augmentations,
while also maintaining confidence in its predictions. In our experiments, we
evaluate two baseline ResNet models, two robust ResNet-50 models, and a robust
vision transformer model, and we demonstrate that this approach achieves
accuracy gains of 1-8\% over standard model evaluation and also generally
outperforms prior augmentation and adaptation strategies. For the setting in
which only one test point is available, we achieve state-of-the-art results on
the ImageNet-C, ImageNet-R, and, among ResNet-50 models, ImageNet-A
distribution shift benchmarks.


http://arxiv.org/abs/2110.09929
Minimal Multi-Layer Modifications of Deep Neural Networks. (4%)
Idan Refaeli; Guy Katz
  Deep neural networks (DNNs) have become increasingly popular in recent years.
However, despite their many successes, DNNs may also err and produce incorrect
and potentially fatal outputs in safety-critical settings, such as autonomous
driving, medical diagnosis, and airborne collision avoidance systems. Much work
has been put into detecting such erroneous behavior in DNNs, e.g., via testing
or verification, but removing these errors after their detection has received
lesser attention. We present here a new tool, called 3M-DNN, for repairing a
given DNN, which is known to err on some set of inputs. The novel repair
procedure implemented in 3M-DNN computes a modification to the network's
weights that corrects its behavior, and attempts to minimize this change via a
sequence of calls to a backend, black-box DNN verification engine. To the best
of our knowledge, our method is the first one that allows repairing the network
by simultaneously modifying multiple layers. This is achieved by splitting the
network into sub-networks, and applying a single-layer repairing technique to
each component. We evaluated 3M-DNN tool on an extensive set of benchmarks,
obtaining promising results.


http://arxiv.org/abs/2110.09903
Unrestricted Adversarial Attacks on ImageNet Competition. (99%)
Yuefeng Chen; Xiaofeng Mao; Yuan He; Hui Xue; Chao Li; Yinpeng Dong; Qi-An Fu; Xiao Yang; Wenzhao Xiang; Tianyu Pang; Hang Su; Jun Zhu; Fangcheng Liu; Chao Zhang; Hongyang Zhang; Yichi Zhang; Shilong Liu; Chang Liu; Wenzhao Xiang; Yajie Wang; Huipeng Zhou; Haoran Lyu; Yidan Xu; Zixuan Xu; Taoyu Zhu; Wenjun Li; Xianfeng Gao; Guoqiu Wang; Huanqian Yan; Ying Guo; Chaoning Zhang; Zheng Fang; Yang Wang; Bingyang Fu; Yunfei Zheng; Yekui Wang; Haorong Luo; Zhen Yang
  Many works have investigated the adversarial attacks or defenses under the
settings where a bounded and imperceptible perturbation can be added to the
input. However in the real-world, the attacker does not need to comply with
this restriction. In fact, more threats to the deep model come from
unrestricted adversarial examples, that is, the attacker makes large and
visible modifications on the image, which causes the model classifying
mistakenly, but does not affect the normal observation in human perspective.
Unrestricted adversarial attack is a popular and practical direction but has
not been studied thoroughly. We organize this competition with the purpose of
exploring more effective unrestricted adversarial attack algorithm, so as to
accelerate the academical research on the model robustness under stronger
unbounded attacks. The competition is held on the TianChi platform
(\url{https://tianchi.aliyun.com/competition/entrance/531853/introduction}) as
one of the series of AI Security Challengers Program.


http://arxiv.org/abs/2110.08956
Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training. (99%)
Alexander Daniel Pan; Daniel Yongkyun; Lee; Huan Zhang; Yize Chen; Yuanyuan Shi
  Due to the proliferation of renewable energy and its intrinsic intermittency
and stochasticity, current power systems face severe operational challenges.
Data-driven decision-making algorithms from reinforcement learning (RL) offer a
solution towards efficiently operating a clean energy system. Although RL
algorithms achieve promising performance compared to model-based control
models, there has been limited investigation of RL robustness in
safety-critical physical systems. In this work, we first show that several
competition-winning, state-of-the-art RL agents proposed for power system
control are vulnerable to adversarial attacks. Specifically, we use an
adversary Markov Decision Process to learn an attack policy, and demonstrate
the potency of our attack by successfully attacking multiple winning agents
from the Learning To Run a Power Network (L2RPN) challenge, under both
white-box and black-box attack settings. We then propose to use adversarial
training to increase the robustness of RL agent against attacks and avoid
infeasible operational decisions. To the best of our knowledge, our work is the
first to highlight the fragility of grid control RL algorithms, and contribute
an effective defense scheme towards improving their robustness and security.


http://arxiv.org/abs/2110.09983
ECG-ATK-GAN: Robustness against Adversarial Attacks on ECGs using Conditional Generative Adversarial Networks. (99%)
Khondker Fariha Hossain; Sharif Amit Kamran; Alireza Tavakkoli; Xingjun Ma
  Automating arrhythmia detection from ECG requires a robust and trusted system
that retains high accuracy under electrical disturbances. Many machine learning
approaches have reached human-level performance in classifying arrhythmia from
ECGs. However, these architectures are vulnerable to adversarial attacks, which
can misclassify ECG signals by decreasing the model's accuracy. Adversarial
attacks are small crafted perturbations injected in the original data which
manifest the out-of-distribution shifts in signal to misclassify the correct
class. Thus, security concerns arise for false hospitalization and insurance
fraud abusing these perturbations. To mitigate this problem, we introduce the
first novel Conditional Generative Adversarial Network (GAN), robust against
adversarial attacked ECG signals and retaining high accuracy. Our architecture
integrates a new class-weighted objective function for adversarial perturbation
identification and new blocks for discerning and combining out-of-distribution
shifts in signals in the learning process for accurately classifying various
arrhythmia types. Furthermore, we benchmark our architecture on six different
white and black-box attacks and compare them with other recently proposed
arrhythmia classification models on two publicly available ECG arrhythmia
datasets. The experiment confirms that our model is more robust against such
adversarial attacks for classifying arrhythmia with high accuracy.


http://arxiv.org/abs/2110.08760
Adapting Membership Inference Attacks to GNN for Graph Classification: Approaches and Implications. (22%)
Bang Wu; Xiangwen Yang; Shirui Pan; Xingliang Yuan
  Graph Neural Networks (GNNs) are widely adopted to analyse non-Euclidean
data, such as chemical networks, brain networks, and social networks, modelling
complex relationships and interdependency between objects. Recently, Membership
Inference Attack (MIA) against GNNs raises severe privacy concerns, where
training data can be leaked from trained GNN models. However, prior studies
focus on inferring the membership of only the components in a graph, e.g., an
individual node or edge. How to infer the membership of an entire graph record
is yet to be explored.
  In this paper, we take the first step in MIA against GNNs for graph-level
classification. Our objective is to infer whether a graph sample has been used
for training a GNN model. We present and implement two types of attacks, i.e.,
training-based attacks and threshold-based attacks from different adversarial
capabilities. We perform comprehensive experiments to evaluate our attacks in
seven real-world datasets using five representative GNN models. Both our
attacks are shown effective and can achieve high performance, i.e., reaching
over 0.7 attack F1 scores in most cases. Furthermore, we analyse the
implications behind the MIA against GNNs. Our findings confirm that GNNs can be
even more vulnerable to MIA than the models with non-graph structures. And
unlike the node-level classifier, MIAs on graph-level classification tasks are
more co-related with the overfitting level of GNNs rather than the statistic
property of their training graphs.


http://arxiv.org/abs/2110.08932
Poisoning Attacks on Fair Machine Learning. (12%)
Minh-Hao Van; Wei Du; Xintao Wu; Aidong Lu
  Both fair machine learning and adversarial learning have been extensively
studied. However, attacking fair machine learning models has received less
attention. In this paper, we present a framework that seeks to effectively
generate poisoning samples to attack both model accuracy and algorithmic
fairness. Our attacking framework can target fair machine learning models
trained with a variety of group based fairness notions such as demographic
parity and equalized odds. We develop three online attacks, adversarial
sampling , adversarial labeling, and adversarial feature modification. All
three attacks effectively and efficiently produce poisoning samples via
sampling, labeling, or modifying a fraction of training data in order to reduce
the test accuracy. Our framework enables attackers to flexibly adjust the
attack's focus on prediction accuracy or fairness and accurately quantify the
impact of each candidate point to both accuracy loss and fairness violation,
thus producing effective poisoning samples. Experiments on two real datasets
demonstrate the effectiveness and efficiency of our framework.


http://arxiv.org/abs/2110.08712
Black-box Adversarial Attacks on Network-wide Multi-step Traffic State Prediction Models. (99%)
Bibek Poudel; Weizi Li
  Traffic state prediction is necessary for many Intelligent Transportation
Systems applications. Recent developments of the topic have focused on
network-wide, multi-step prediction, where state of the art performance is
achieved via deep learning models, in particular, graph neural network-based
models. While the prediction accuracy of deep learning models is high, these
models' robustness has raised many safety concerns, given that imperceptible
perturbations added to input can substantially degrade the model performance.
In this work, we propose an adversarial attack framework by treating the
prediction model as a black-box, i.e., assuming no knowledge of the model
architecture, training data, and (hyper)parameters. However, we assume that the
adversary can oracle the prediction model with any input and obtain
corresponding output. Next, the adversary can train a substitute model using
input-output pairs and generate adversarial signals based on the substitute
model. To test the attack effectiveness, two state of the art, graph neural
network-based models (GCGRNN and DCRNN) are examined. As a result, the
adversary can degrade the target model's prediction accuracy up to $54\%$. In
comparison, two conventional statistical models (linear regression and
historical average) are also examined. While these two models do not produce
high prediction accuracy, they are either influenced negligibly (less than
$3\%$) or are immune to the adversary's attack.


http://arxiv.org/abs/2110.08514
Analyzing Dynamic Adversarial Training Data in the Limit. (82%)
Eric Wallace; Adina Williams; Robin Jia; Douwe Kiela
  To create models that are robust across a wide range of test inputs, training
datasets should include diverse examples that span numerous phenomena. Dynamic
adversarial data collection (DADC), where annotators craft examples that
challenge continually improving models, holds promise as an approach for
generating such diverse training sets. Prior work has shown that running DADC
over 1-3 rounds can help models fix some error types, but it does not
necessarily lead to better generalization beyond adversarial test data. We
argue that running DADC over many rounds maximizes its training-time benefits,
as the different rounds can together cover many of the task-relevant phenomena.
We present the first study of longer-term DADC, where we collect 20 rounds of
NLI examples for a small set of premise paragraphs, with both adversarial and
non-adversarial approaches. Models trained on DADC examples make 26% fewer
errors on our expert-curated test set compared to models trained on
non-adversarial data. Our analysis shows that DADC yields examples that are
more difficult, more lexically and syntactically diverse, and contain fewer
annotation artifacts compared to non-adversarial examples.


http://arxiv.org/abs/2110.08517
Characterizing Improper Input Validation Vulnerabilities of Mobile Crowdsourcing Services. (5%)
Sojhal Ismail Khan; Dominika Woszczyk; Chengzeng You; Soteris Demetriou; Muhammad Naveed
  Mobile crowdsourcing services (MCS), enable fast and economical data
acquisition at scale and find applications in a variety of domains. Prior work
has shown that Foursquare and Waze (a location-based and a navigation MCS) are
vulnerable to different kinds of data poisoning attacks. Such attacks can be
upsetting and even dangerous especially when they are used to inject improper
inputs to mislead users. However, to date, there is no comprehensive study on
the extent of improper input validation (IIV) vulnerabilities and the
feasibility of their exploits in MCSs across domains. In this work, we leverage
the fact that MCS interface with their participants through mobile apps to
design tools and new methodologies embodied in an end-to-end feedback-driven
analysis framework which we use to study 10 popular and previously unexplored
services in five different domains. Using our framework we send tens of
thousands of API requests with automatically generated input values to
characterize their IIV attack surface. Alarmingly, we found that most of them
(8/10) suffer from grave IIV vulnerabilities which allow an adversary to launch
data poisoning attacks at scale: 7400 spoofed API requests were successful in
faking online posts for robberies, gunshots, and other dangerous incidents,
faking fitness activities with supernatural speeds and distances among many
others. Lastly, we discuss easy to implement and deploy mitigation strategies
which can greatly reduce the IIV attack surface and argue for their use as a
necessary complementary measure working toward trustworthy mobile crowdsourcing
services.


http://arxiv.org/abs/2110.08690
Tackling the Imbalance for GNNs. (4%)
Rui Wang; Weixuan Xiong; Qinghu Hou; Ou Wu
  Different from deep neural networks for non-graph data classification, graph
neural networks (GNNs) leverage the information exchange between nodes (or
samples) when representing nodes. The category distribution shows an imbalance
or even a highly-skewed trend on nearly all existing benchmark GNN data sets.
The imbalanced distribution will cause misclassification of nodes in the
minority classes, and even cause the classification performance on the entire
data set to decrease. This study explores the effects of the imbalance problem
on the performances of GNNs and proposes new methodologies to solve it. First,
a node-level index, namely, the label difference index ($LDI$), is defined to
quantitatively analyze the relationship between imbalance and
misclassification. The less samples in a class, the higher the value of its
average $LDI$; the higher the $LDI$ of a sample, the more likely the sample
will be misclassified. We define a new loss and propose four new methods based
on $LDI$. Experimental results indicate that the classification accuracies of
the three among our proposed four new methods are better in both transductive
and inductive settings. The $LDI$ can be applied to other GNNs.


http://arxiv.org/abs/2110.08449
Adversarial Attacks on Gaussian Process Bandits. (99%)
Eric Han; Jonathan Scarlett
  Gaussian processes (GP) are a widely-adopted tool used to sequentially
optimize black-box functions, where evaluations are costly and potentially
noisy. Recent works on GP bandits have proposed to move beyond random noise and
devise algorithms robust to adversarial attacks. In this paper, we study this
problem from the attacker's perspective, proposing various adversarial attack
methods with differing assumptions on the attacker's strength and prior
information. Our goal is to understand adversarial attacks on GP bandits from
both a theoretical and practical perspective. We focus primarily on targeted
attacks on the popular GP-UCB algorithm and a related elimination-based
algorithm, based on adversarially perturbing the function $f$ to produce
another function $\tilde{f}$ whose optima are in some region $\mathcal{R}_{\rm
target}$. Based on our theoretical analysis, we devise both white-box attacks
(known $f$) and black-box attacks (unknown $f$), with the former including a
Subtraction attack and Clipping attack, and the latter including an Aggressive
subtraction attack. We demonstrate that adversarial attacks on GP bandits can
succeed in forcing the algorithm towards $\mathcal{R}_{\rm target}$ even with a
low attack budget, and we compare our attacks' performance and efficiency on
several real and synthetic functions.


http://arxiv.org/abs/2110.08036
Generating Natural Language Adversarial Examples through An Improved Beam Search Algorithm. (99%)
Tengfei Zhao; Zhaocheng Ge; Hanping Hu; Dingmeng Shi
  The research of adversarial attacks in the text domain attracts many
interests in the last few years, and many methods with a high attack success
rate have been proposed. However, these attack methods are inefficient as they
require lots of queries for the victim model when crafting text adversarial
examples. In this paper, a novel attack model is proposed, its attack success
rate surpasses the benchmark attack methods, but more importantly, its attack
efficiency is much higher than the benchmark attack methods. The novel method
is empirically evaluated by attacking WordCNN, LSTM, BiLSTM, and BERT on four
benchmark datasets. For instance, it achieves a 100\% attack success rate
higher than the state-of-the-art method when attacking BERT and BiLSTM on IMDB,
but the number of queries for the victim models only is 1/4 and 1/6.5 of the
state-of-the-art method, respectively. Also, further experiments show the novel
method has a good transferability on the generated adversarial examples.


http://arxiv.org/abs/2110.08042
Adversarial Attacks on ML Defense Models Competition. (99%)
Yinpeng Dong; Qi-An Fu; Xiao Yang; Wenzhao Xiang; Tianyu Pang; Hang Su; Jun Zhu; Jiayu Tang; Yuefeng Chen; XiaoFeng Mao; Yuan He; Hui Xue; Chao Li; Ye Liu; Qilong Zhang; Lianli Gao; Yunrui Yu; Xitong Gao; Zhe Zhao; Daquan Lin; Jiadong Lin; Chuanbiao Song; Zihao Wang; Zhennan Wu; Yang Guo; Jiequan Cui; Xiaogang Xu; Pengguang Chen
  Due to the vulnerability of deep neural networks (DNNs) to adversarial
examples, a large number of defense techniques have been proposed to alleviate
this problem in recent years. However, the progress of building more robust
models is usually hampered by the incomplete or incorrect robustness
evaluation. To accelerate the research on reliable evaluation of adversarial
robustness of the current defense models in image classification, the TSAIL
group at Tsinghua University and the Alibaba Security group organized this
competition along with a CVPR 2021 workshop on adversarial machine learning
(https://aisecure-workshop.github.io/amlcvpr2021/). The purpose of this
competition is to motivate novel attack algorithms to evaluate adversarial
robustness more effectively and reliably. The participants were encouraged to
develop stronger white-box attack algorithms to find the worst-case robustness
of different defenses. This competition was conducted on an adversarial
robustness evaluation platform -- ARES (https://github.com/thu-ml/ares), and is
held on the TianChi platform
(https://tianchi.aliyun.com/competition/entrance/531847/introduction) as one of
the series of AI Security Challengers Program. After the competition, we
summarized the results and established a new adversarial robustness benchmark
at https://ml.cs.tsinghua.edu.cn/ares-bench/, which allows users to upload
adversarial attack algorithms and defense models for evaluation.


http://arxiv.org/abs/2110.08324
Mitigating Membership Inference Attacks by Self-Distillation Through a Novel Ensemble Architecture. (76%)
Xinyu Tang; Saeed Mahloujifar; Liwei Song; Virat Shejwalkar; Milad Nasr; Amir Houmansadr; Prateek Mittal
  Membership inference attacks are a key measure to evaluate privacy leakage in
machine learning (ML) models. These attacks aim to distinguish training members
from non-members by exploiting differential behavior of the models on member
and non-member inputs. The goal of this work is to train ML models that have
high membership privacy while largely preserving their utility; we therefore
aim for an empirical membership privacy guarantee as opposed to the provable
privacy guarantees provided by techniques like differential privacy, as such
techniques are shown to deteriorate model utility. Specifically, we propose a
new framework to train privacy-preserving models that induces similar behavior
on member and non-member inputs to mitigate membership inference attacks. Our
framework, called SELENA, has two major components. The first component and the
core of our defense is a novel ensemble architecture for training. This
architecture, which we call Split-AI, splits the training data into random
subsets, and trains a model on each subset of the data. We use an adaptive
inference strategy at test time: our ensemble architecture aggregates the
outputs of only those models that did not contain the input sample in their
training data. We prove that our Split-AI architecture defends against a large
family of membership inference attacks, however, it is susceptible to new
adaptive attacks. Therefore, we use a second component in our framework called
Self-Distillation to protect against such stronger attacks. The
Self-Distillation component (self-)distills the training dataset through our
Split-AI ensemble, without using any external public datasets. Through
extensive experiments on major benchmark datasets we show that SELENA presents
a superior trade-off between membership privacy and utility compared to the
state of the art.


http://arxiv.org/abs/2110.08322
Robustness of different loss functions and their impact on networks learning capability. (76%)
Vishal Rajput
  Recent developments in AI have made it ubiquitous, every industry is trying
to adopt some form of intelligent processing of their data. Despite so many
advances in the field, AIs full capability is yet to be exploited by the
industry. Industries that involve some risk factors still remain cautious about
the usage of AI due to the lack of trust in such autonomous systems.
Present-day AI might be very good in a lot of things but it is very bad in
reasoning and this behavior of AI can lead to catastrophic results. Autonomous
cars crashing into a person or a drone getting stuck in a tree are a few
examples where AI decisions lead to catastrophic results. To develop insight
and generate an explanation about the learning capability of AI, we will try to
analyze the working of loss functions. For our case, we will use two sets of
loss functions, generalized loss functions like Binary cross-entropy or BCE and
specialized loss functions like Dice loss or focal loss. Through a series of
experiments, we will establish whether combining different loss functions is
better than using a single loss function and if yes, then what is the reason
behind it. In order to establish the difference between generalized loss and
specialized losses, we will train several models using the above-mentioned
losses and then compare their robustness on adversarial examples. In
particular, we will look at how fast the accuracy of different models decreases
when we change the pixels corresponding to the most salient gradients.


http://arxiv.org/abs/2110.08139
Chunked-Cache: On-Demand and Scalable Cache Isolation for Security Architectures. (22%)
Ghada Dessouky; Alexander Gruler; Pouya Mahmoody; Ahmad-Reza Sadeghi; Emmanuel Stapf
  Shared cache resources in multi-core processors are vulnerable to cache
side-channel attacks. Recently proposed defenses have their own caveats:
Randomization-based defenses are vulnerable to the evolving attack algorithms
besides relying on weak cryptographic primitives, because they do not
fundamentally address the root cause for cache side-channel attacks. Cache
partitioning defenses, on the other hand, provide the strict resource
partitioning and effectively block all side-channel threats. However, they
usually rely on way-based partitioning which is not fine-grained and cannot
scale to support a larger number of protection domains, e.g., in trusted
execution environment (TEE) security architectures, besides degrading
performance and often resulting in cache underutilization.
  To overcome the shortcomings of both approaches, we present a novel and
flexible set-associative cache partitioning design for TEE architectures,
called Chunked-Cache. Chunked-Cache enables an execution context to "carve" out
an exclusive configurable chunk of the cache if the execution requires
side-channel resilience. If side-channel resilience is not required, mainstream
cache resources are freely utilized. Hence, our solution addresses the
security-performance trade-off practically by enabling selective and on-demand
utilization of side-channel-resilient caches, while providing well-grounded
future-proof security guarantees. We show that Chunked-Cache provides
side-channel-resilient cache utilization for sensitive code execution, with
small hardware overhead, while incurring no performance overhead on the OS. We
also show that it outperforms conventional way-based cache partitioning by 43%,
while scaling significantly better to support a larger number of protection
domains.


http://arxiv.org/abs/2110.08247
Textual Backdoor Attacks Can Be More Harmful via Two Simple Tricks. (10%)
Yangyi Chen; Fanchao Qi; Zhiyuan Liu; Maosong Sun
  Backdoor attacks are a kind of emergent security threat in deep learning.
When a deep neural model is injected with a backdoor, it will behave normally
on standard inputs but give adversary-specified predictions once the input
contains specific backdoor triggers. Current textual backdoor attacks have poor
attack performance in some tough situations. In this paper, we find two simple
tricks that can make existing textual backdoor attacks much more harmful. The
first trick is to add an extra training task to distinguish poisoned and clean
data during the training of the victim model, and the second one is to use all
the clean training data rather than remove the original clean data
corresponding to the poisoned data. These two tricks are universally applicable
to different attack models. We conduct experiments in three tough situations
including clean data fine-tuning, low poisoning rate, and label-consistent
attacks. Experimental results show that the two tricks can significantly
improve attack performance. This paper exhibits the great potential harmfulness
of backdoor attacks. All the code and data will be made public to facilitate
further research.


http://arxiv.org/abs/2110.07858
Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation. (8%)
Yao Qin; Chiyuan Zhang; Ting Chen; Balaji Lakshminarayanan; Alex Beutel; Xuezhi Wang
  We investigate the robustness of vision transformers (ViTs) through the lens
of their special patch-based architectural structure, i.e., they process an
image as a sequence of image patches. We find that ViTs are surprisingly
insensitive to patch-based transformations, even when the transformation
largely destroys the original semantics and makes the image unrecognizable by
humans. This indicates that ViTs heavily use features that survived such
transformations but are generally not indicative of the semantic class to
humans. Further investigations show that these features are useful but
non-robust, as ViTs trained on them can achieve high in-distribution accuracy,
but break down under distribution shifts. From this understanding, we ask: can
training the model to rely less on these features improve ViT robustness and
out-of-distribution performance? We use the images transformed with our
patch-based operations as negatively augmented views and offer losses to
regularize the training away from using non-robust features. This is a
complementary view to existing research that mostly focuses on augmenting
inputs with semantic-preserving transformations to enforce models' invariance.
We show that patch-based negative augmentation consistently improves robustness
of ViTs across a wide set of ImageNet based robustness benchmarks. Furthermore,
we find our patch-based negative augmentation are complementary to traditional
(positive) data augmentation, and together boost the performance further. All
the code in this work will be open-sourced.


http://arxiv.org/abs/2110.08113
Hand Me Your PIN! Inferring ATM PINs of Users Typing with a Covered Hand. (1%)
Matteo Cardaioli; Stefano Cecconello; Mauro Conti; Simone Milani; Stjepan Picek; Eugen Saraci
  Automated Teller Machines (ATMs) represent the most used system for
withdrawing cash. The European Central Bank reported more than 11 billion cash
withdrawals and loading/unloading transactions on the European ATMs in 2019.
Although ATMs have undergone various technological evolutions, Personal
Identification Numbers (PINs) are still the most common authentication method
for these devices. Unfortunately, the PIN mechanism is vulnerable to
shoulder-surfing attacks performed via hidden cameras installed near the ATM to
catch the PIN pad. To overcome this problem, people get used to covering the
typing hand with the other hand. While such users probably believe this
behavior is safe enough to protect against mentioned attacks, there is no clear
assessment of this countermeasure in the scientific literature.
  This paper proposes a novel attack to reconstruct PINs entered by victims
covering the typing hand with the other hand. We consider the setting where the
attacker can access an ATM PIN pad of the same brand/model as the target one.
Afterward, the attacker uses that model to infer the digits pressed by the
victim while entering the PIN. Our attack owes its success to a carefully
selected deep learning architecture that can infer the PIN from the typing hand
position and movements. We run a detailed experimental analysis including 58
users. With our approach, we can guess 30% of the 5-digit PINs within three
attempts -- the ones usually allowed by ATM before blocking the card. We also
conducted a survey with 78 users that managed to reach an accuracy of only
7.92% on average for the same setting. Finally, we evaluate a shielding
countermeasure that proved to be rather inefficient unless the whole keypad is
shielded.


http://arxiv.org/abs/2110.07182
Adversarial examples by perturbing high-level features in intermediate decoder layers. (99%)
Vojtěch Čermák; Lukáš Adam
  We propose a novel method for creating adversarial examples. Instead of
perturbing pixels, we use an encoder-decoder representation of the input image
and perturb intermediate layers in the decoder. This changes the high-level
features provided by the generative model. Therefore, our perturbation
possesses semantic meaning, such as a longer beak or green tints. We formulate
this task as an optimization problem by minimizing the Wasserstein distance
between the adversarial and initial images under a misclassification
constraint. We employ the projected gradient method with a simple inexact
projection. Due to the projection, all iterations are feasible, and our method
always generates adversarial images. We perform numerical experiments on the
MNIST and ImageNet datasets in both targeted and untargeted settings. We
demonstrate that our adversarial images are much less vulnerable to
steganographic defence techniques than pixel-based attacks. Moreover, we show
that our method modifies key features such as edges and that defence techniques
based on adversarial training are vulnerable to our attacks.


http://arxiv.org/abs/2110.07305
DI-AA: An Interpretable White-box Attack for Fooling Deep Neural Networks. (99%)
Yixiang Wang; Jiqiang Liu; Xiaolin Chang; Jianhua Wang; Ricardo J. Rodríguez
  White-box Adversarial Example (AE) attacks towards Deep Neural Networks
(DNNs) have a more powerful destructive capacity than black-box AE attacks in
the fields of AE strategies. However, almost all the white-box approaches lack
interpretation from the point of view of DNNs. That is, adversaries did not
investigate the attacks from the perspective of interpretable features, and few
of these approaches considered what features the DNN actually learns. In this
paper, we propose an interpretable white-box AE attack approach, DI-AA, which
explores the application of the interpretable approach of the deep Taylor
decomposition in the selection of the most contributing features and adopts the
Lagrangian relaxation optimization of the logit output and L_p norm to further
decrease the perturbation. We compare DI-AA with six baseline attacks
(including the state-of-the-art attack AutoAttack) on three datasets.
Experimental results reveal that our proposed approach can 1) attack non-robust
models with comparatively low perturbation, where the perturbation is closer to
or lower than the AutoAttack approach; 2) break the TRADES adversarial training
models with the highest success rate; 3) the generated AE can reduce the robust
accuracy of the robust black-box models by 16% to 31% in the black-box transfer
attack.


http://arxiv.org/abs/2110.07801
Adversarial Purification through Representation Disentanglement. (99%)
Tao Bai; Jun Zhao; Lanqing Guo; Bihan Wen
  Deep learning models are vulnerable to adversarial examples and make
incomprehensible mistakes, which puts a threat on their real-world deployment.
Combined with the idea of adversarial training, preprocessing-based defenses
are popular and convenient to use because of their task independence and good
generalizability. Current defense methods, especially purification, tend to
remove ``noise" by learning and recovering the natural images. However,
different from random noise, the adversarial patterns are much easier to be
overfitted during model training due to their strong correlation to the images.
In this work, we propose a novel adversarial purification scheme by presenting
disentanglement of natural images and adversarial perturbations as a
preprocessing defense. With extensive experiments, our defense is shown to be
generalizable and make significant protection against unseen strong adversarial
attacks. It reduces the success rates of state-of-the-art \textbf{ensemble}
attacks from \textbf{61.7\%} to \textbf{14.9\%} on average, superior to a
number of existing methods. Notably, our defense restores the perturbed images
perfectly and does not hurt the clean accuracy of backbone models, which is
highly desirable in practice.


http://arxiv.org/abs/2110.07831
RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models. (93%)
Wenkai Yang; Yankai Lin; Peng Li; Jie Zhou; Xu Sun
  Backdoor attacks, which maliciously control a well-trained model's outputs of
the instances with specific triggers, are recently shown to be serious threats
to the safety of reusing deep neural networks (DNNs). In this work, we propose
an efficient online defense mechanism based on robustness-aware perturbations.
Specifically, by analyzing the backdoor training process, we point out that
there exists a big gap of robustness between poisoned and clean samples.
Motivated by this observation, we construct a word-based robustness-aware
perturbation to distinguish poisoned samples from clean samples to defend
against the backdoor attacks on natural language processing (NLP) models.
Moreover, we give a theoretical analysis about the feasibility of our
robustness-aware perturbation-based defense method. Experimental results on
sentiment analysis and toxic detection tasks show that our method achieves
better defending performance and much lower computational costs than existing
online defense methods. Our code is available at
https://github.com/lancopku/RAP.


http://arxiv.org/abs/2110.07683
An Optimization Perspective on Realizing Backdoor Injection Attacks on Deep Neural Networks in Hardware. (87%)
M. Caner Tol; Saad Islam; Berk Sunar; Ziming Zhang
  State-of-the-art deep neural networks (DNNs) have been proven to be
vulnerable to adversarial manipulation and backdoor attacks. Backdoored models
deviate from expected behavior on inputs with predefined triggers while
retaining performance on clean data. Recent works focus on software simulation
of backdoor injection during the inference phase by modifying network weights,
which we find often unrealistic in practice due to the hardware restriction
such as bit allocation in memory. In contrast, in this work, we investigate the
viability of backdoor injection attacks in real-life deployments of DNNs on
hardware and address such practical issues in hardware implementation from a
novel optimization perspective. We are motivated by the fact that the
vulnerable memory locations are very rare, device-specific, and sparsely
distributed. Consequently, we propose a novel network training algorithm based
on constrained optimization for realistic backdoor injection attack in
hardware. By modifying parameters uniformly across the convolutional and
fully-connected layers as well as optimizing the trigger pattern together, we
achieve the state-of-the-art attack performance with fewer bit flips. For
instance, our method on a hardware-deployed ResNet-20 model trained on CIFAR-10
can achieve over 91% test accuracy and 94% attack success rate by flipping only
10 bits out of 2.2 million bits.


http://arxiv.org/abs/2110.07667
Interactive Analysis of CNN Robustness. (80%)
Stefan Sietzen; Mathias Lechner; Judy Borowski; Ramin Hasani; Manuela Waldner
  While convolutional neural networks (CNNs) have found wide adoption as
state-of-the-art models for image-related tasks, their predictions are often
highly sensitive to small input perturbations, which the human vision is robust
against. This paper presents Perturber, a web-based application that allows
users to instantaneously explore how CNN activations and predictions evolve
when a 3D input scene is interactively perturbed. Perturber offers a large
variety of scene modifications, such as camera controls, lighting and shading
effects, background modifications, object morphing, as well as adversarial
attacks, to facilitate the discovery of potential vulnerabilities. Fine-tuned
model versions can be directly compared for qualitative evaluation of their
robustness. Case studies with machine learning experts have shown that
Perturber helps users to quickly generate hypotheses about model
vulnerabilities and to qualitatively compare model behavior. Using quantitative
analyses, we could replicate users' insights with other CNN architectures and
input images, yielding new insights about the vulnerability of adversarially
trained models.


http://arxiv.org/abs/2110.07462
On Adversarial Vulnerability of PHM algorithms: An Initial Study. (69%)
Weizhong Yan; Zhaoyuan Yang; Jianwei Qiu
  With proliferation of deep learning (DL) applications in diverse domains,
vulnerability of DL models to adversarial attacks has become an increasingly
interesting research topic in the domains of Computer Vision (CV) and Natural
Language Processing (NLP). DL has also been widely adopted to diverse PHM
applications, where data are primarily time-series sensor measurements. While
those advanced DL algorithms/models have resulted in an improved PHM
algorithms' performance, the vulnerability of those PHM algorithms to
adversarial attacks has not drawn much attention in the PHM community. In this
paper we attempt to explore the vulnerability of PHM algorithms. More
specifically, we investigate the strategies of attacking PHM algorithms by
considering several unique characteristics associated with time-series sensor
measurements data. We use two real-world PHM applications as examples to
validate our attack strategies and to demonstrate that PHM algorithms indeed
are vulnerable to adversarial attacks.


http://arxiv.org/abs/2110.07736
Identifying and Mitigating Spurious Correlations for Improving Robustness in NLP Models. (61%)
Tianlu Wang; Diyi Yang; Xuezhi Wang
  Recently, NLP models have achieved remarkable progress across a variety of
tasks; however, they have also been criticized for being not robust. Many
robustness problems can be attributed to models exploiting spurious
correlations, or shortcuts between the training data and the task labels.
Models may fail to generalize to out-of-distribution data or be vulnerable to
adversarial attacks if spurious correlations are exploited through the training
process. In this paper, we aim to automatically identify such spurious
correlations in NLP models at scale. We first leverage existing
interpretability methods to extract tokens that significantly affect model's
decision process from the input text. We then distinguish "genuine" tokens and
"spurious" tokens by analyzing model predictions across multiple corpora and
further verify them through knowledge-aware perturbations. We show that our
proposed method can effectively and efficiently identify a scalable set of
"shortcuts", and mitigating these leads to more robust models in multiple
applications.


http://arxiv.org/abs/2110.07537
Toward Degradation-Robust Voice Conversion. (9%)
Chien-yu Huang; Kai-Wei Chang; Hung-yi Lee
  Any-to-any voice conversion technologies convert the vocal timbre of an
utterance to any speaker even unseen during training. Although there have been
several state-of-the-art any-to-any voice conversion models, they were all
based on clean utterances to convert successfully. However, in real-world
scenarios, it is difficult to collect clean utterances of a speaker, and they
are usually degraded by noises or reverberations. It thus becomes highly
desired to understand how these degradations affect voice conversion and build
a degradation-robust model. We report in this paper the first comprehensive
study on the degradation robustness of any-to-any voice conversion. We show
that the performance of state-of-the-art models nowadays was severely hampered
given degraded utterances. To this end, we then propose speech enhancement
concatenation and denoising training to improve the robustness. In addition to
common degradations, we also consider adversarial noises, which alter the model
output significantly yet are human-imperceptible. It was shown that both
concatenations with off-the-shelf speech enhancement models and denoising
training on voice conversion models could improve the robustness, while each of
them had pros and cons.


http://arxiv.org/abs/2110.07159
Interpreting the Robustness of Neural NLP Models to Textual Perturbations. (9%)
Yunxiang Zhang; Liangming Pan; Samson Tan; Min-Yen Kan
  Modern Natural Language Processing (NLP) models are known to be sensitive to
input perturbations and their performance can decrease when applied to
real-world, noisy data. However, it is still unclear why models are less robust
to some perturbations than others. In this work, we test the hypothesis that
the extent to which a model is affected by an unseen textual perturbation
(robustness) can be explained by the learnability of the perturbation (defined
as how well the model learns to identify the perturbation with a small amount
of evidence). We further give a causal justification for the learnability
metric. We conduct extensive experiments with four prominent NLP models --
TextRNN, BERT, RoBERTa and XLNet -- over eight types of textual perturbations
on three datasets. We show that a model which is better at identifying a
perturbation (higher learnability) becomes worse at ignoring such a
perturbation at test time (lower robustness), providing empirical support for
our hypothesis.


http://arxiv.org/abs/2110.07596
Retrieval-guided Counterfactual Generation for QA. (2%)
Bhargavi Paranjape; Matthew Lamm; Ian Tenney
  Deep NLP models have been shown to learn spurious correlations, leaving them
brittle to input perturbations. Recent work has shown that counterfactual or
contrastive data -- i.e. minimally perturbed inputs -- can reveal these
weaknesses, and that data augmentation using counterfactuals can help
ameliorate them. Proposed techniques for generating counterfactuals rely on
human annotations, perturbations based on simple heuristics, and meaning
representation frameworks. We focus on the task of creating counterfactuals for
question answering, which presents unique challenges related to world
knowledge, semantic diversity, and answerability. To address these challenges,
we develop a Retrieve-Generate-Filter(RGF) technique to create counterfactual
evaluation and training data with minimal human supervision. Using an
open-domain QA framework and question generation model trained on original task
data, we create counterfactuals that are fluent, semantically diverse, and
automatically labeled. Data augmentation with RGF counterfactuals improves
performance on out-of-domain and challenging evaluation sets over and above
existing methods, in both the reading comprehension and open-domain QA
settings. Moreover, we find that RGF data leads to significant improvements in
a model's robustness to local perturbations.


http://arxiv.org/abs/2110.08260
Effective Certification of Monotone Deep Equilibrium Models. (1%)
Mark Niklas Müller; Robin Staab; Marc Fischer; Martin Vechev
  Monotone Operator Equilibrium Models (monDEQs) represent a class of models
combining the powerful deep equilibrium paradigm with convergence guarantees.
Further, their inherent robustness to adversarial perturbations makes
investigating their certifiability a promising research direction.
Unfortunately, existing approaches are either imprecise or severely limited in
scalability. In this work, we propose the first scalable and precise monDEQ
verifier, based on two key ideas: (i) a novel convex relaxation enabling
efficient inclusion checks, and (ii) non-trivial mathematical insights
characterizing the fixpoint operations at the heart of monDEQs on sets rather
than concrete inputs. An extensive evaluation of our verifier on the
challenging $\ell_\infty$ perturbations demonstrates that it exceeds
state-of-the-art performance in terms of speed (two orders of magnitude) and
scalability (an order of magnitude) while yielding 25% higher certified
accuracies on the same networks.


http://arxiv.org/abs/2110.06816
A Framework for Verification of Wasserstein Adversarial Robustness. (99%)
Tobias Wegel; Felix Assion; David Mickisch; Florens Greßner
  Machine learning image classifiers are susceptible to adversarial and
corruption perturbations. Adding imperceptible noise to images can lead to
severe misclassifications of the machine learning model. Using $L_p$-norms for
measuring the size of the noise fails to capture human similarity perception,
which is why optimal transport based distance measures like the Wasserstein
metric are increasingly being used in the field of adversarial robustness.
Verifying the robustness of classifiers using the Wasserstein metric can be
achieved by proving the absence of adversarial examples (certification) or
proving their presence (attack). In this work we present a framework based on
the work by Levine and Feizi, which allows us to transfer existing
certification methods for convex polytopes or $L_1$-balls to the Wasserstein
threat model. The resulting certification can be complete or incomplete,
depending on whether convex polytopes or $L_1$-balls were chosen. Additionally,
we present a new Wasserstein adversarial attack that is projected gradient
descent based and which has a significantly reduced computational burden
compared to existing attack approaches.


http://arxiv.org/abs/2110.06802
Identification of Attack-Specific Signatures in Adversarial Examples. (99%)
Hossein Souri; Pirazh Khorramshahi; Chun Pong Lau; Micah Goldblum; Rama Chellappa
  The adversarial attack literature contains a myriad of algorithms for
crafting perturbations which yield pathological behavior in neural networks. In
many cases, multiple algorithms target the same tasks and even enforce the same
constraints. In this work, we show that different attack algorithms produce
adversarial examples which are distinct not only in their effectiveness but
also in how they qualitatively affect their victims. We begin by demonstrating
that one can determine the attack algorithm that crafted an adversarial
example. Then, we leverage recent advances in parameter-space saliency maps to
show, both visually and quantitatively, that adversarial attack algorithms
differ in which parts of the network and image they target. Our findings
suggest that prospective adversarial attacks should be compared not only via
their success rates at fooling models but also via deeper downstream effects
they have on victims.


http://arxiv.org/abs/2110.08256
Model-Agnostic Meta-Attack: Towards Reliable Evaluation of Adversarial Robustness. (99%)
Xiao Yang; Yinpeng Dong; Wenzhao Xiang; Tianyu Pang; Hang Su; Jun Zhu
  The vulnerability of deep neural networks to adversarial examples has
motivated an increasing number of defense strategies for promoting model
robustness. However, the progress is usually hampered by insufficient
robustness evaluations. As the de facto standard to evaluate adversarial
robustness, adversarial attacks typically solve an optimization problem of
crafting adversarial examples with an iterative process. In this work, we
propose a Model-Agnostic Meta-Attack (MAMA) approach to discover stronger
attack algorithms automatically. Our method learns the optimizer in adversarial
attacks parameterized by a recurrent neural network, which is trained over a
class of data samples and defenses to produce effective update directions
during adversarial example generation. Furthermore, we develop a model-agnostic
training algorithm to improve the generalization ability of the learned
optimizer when attacking unseen defenses. Our approach can be flexibly
incorporated with various attacks and consistently improves the performance
with little extra computational cost. Extensive experiments demonstrate the
effectiveness of the learned attacks by MAMA compared to the state-of-the-art
attacks on different defenses, leading to a more reliable evaluation of
adversarial robustness.


http://arxiv.org/abs/2110.07139
Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer. (98%)
Fanchao Qi; Yangyi Chen; Xurui Zhang; Mukai Li; Zhiyuan Liu; Maosong Sun
  Adversarial attacks and backdoor attacks are two common security threats that
hang over deep learning. Both of them harness task-irrelevant features of data
in their implementation. Text style is a feature that is naturally irrelevant
to most NLP tasks, and thus suitable for adversarial and backdoor attacks. In
this paper, we make the first attempt to conduct adversarial and backdoor
attacks based on text style transfer, which is aimed at altering the style of a
sentence while preserving its meaning. We design an adversarial attack method
and a backdoor attack method, and conduct extensive experiments to evaluate
them. Experimental results show that popular NLP models are vulnerable to both
adversarial and backdoor attacks based on text style transfer -- the attack
success rates can exceed 90% without much effort. It reflects the limited
ability of NLP models to handle the feature of text style that has not been
widely realized. In addition, the style transfer-based adversarial and backdoor
attack methods show superiority to baselines in many aspects. All the code and
data of this paper can be obtained at https://github.com/thunlp/StyleAttack.


http://arxiv.org/abs/2110.07120
Brittle interpretations: The Vulnerability of TCAV and Other Concept-based Explainability Tools to Adversarial Attack. (93%)
Davis Brown; Henry Kvinge
  Methods for model explainability have become increasingly critical for
testing the fairness and soundness of deep learning. A number of explainability
techniques have been developed which use a set of examples to represent a
human-interpretable concept in a model's activations. In this work we show that
these explainability methods can suffer the same vulnerability to adversarial
attacks as the models they are meant to analyze. We demonstrate this phenomenon
on two well-known concept-based approaches to the explainability of deep
learning models: TCAV and faceted feature visualization. We show that by
carefully perturbing the examples of the concept that is being investigated, we
can radically change the output of the interpretability method, e.g. showing
that stripes are not an important factor in identifying images of a zebra. Our
work highlights the fact that in safety-critical applications, there is need
for security around not only the machine learning pipeline but also the model
interpretation process.


http://arxiv.org/abs/2110.06904
Poison Forensics: Traceback of Data Poisoning Attacks in Neural Networks. (92%)
Shawn Shan; Arjun Nitin Bhagoji; Haitao Zheng; Ben Y. Zhao
  In adversarial machine learning, new defenses against attacks on deep
learning systems are routinely broken soon after their release by more powerful
attacks. In this context, forensic tools can offer a valuable complement to
existing defenses, by tracing back a successful attack to its root cause, and
offering a path forward for mitigation to prevent similar attacks in the
future.
  In this paper, we describe our efforts in developing a forensic traceback
tool for poison attacks on deep neural networks. We propose a novel iterative
clustering and pruning solution that trims "innocent" training samples, until
all that remains is the set of poisoned data responsible for the attack. Our
method clusters training samples based on their impact on model parameters,
then uses an efficient data unlearning method to prune innocent clusters. We
empirically demonstrate the efficacy of our system on three types of
dirty-label (backdoor) poison attacks and three types of clean-label poison
attacks, across domains of computer vision and malware classification. Our
system achieves over 98.4% precision and 96.8% recall across all attacks. We
also show that our system is robust against four anti-forensics measures
specifically designed to attack it.


http://arxiv.org/abs/2110.06850
Boosting the Certified Robustness of L-infinity Distance Nets. (1%)
Bohang Zhang; Du Jiang; Di He; Liwei Wang
  Recently, Zhang et al.(2021) developed a new neural network architecture
based on $\ell_\infty$-distance functions, which naturally possesses certified
$\ell_\infty$ robustness by its construction. Despite rigorous theoretical
guarantees, the model so far can only achieve comparable performance to
conventional networks. In this paper, we make the following two contributions:
$\mathrm{(i)}$ We demonstrate that $\ell_\infty$-distance nets enjoy a
fundamental advantage in certified robustness over conventional networks (under
typical certification approaches); $\mathrm{(ii)}$ With an improved training
process we are able to significantly boost the certified accuracy of
$\ell_\infty$-distance nets. Our training approach largely alleviates the
optimization problem that arose in the previous training scheme, in particular,
the unexpected large Lipschitz constant due to the use of a crucial trick
called $\ell_p$-relaxation. The core of our training approach is a novel
objective function that combines scaled cross-entropy loss and clipped hinge
loss with a decaying mixing coefficient. Experiments show that using the
proposed training strategy, the certified accuracy of $\ell_\infty$-distance
net can be dramatically improved from 33.30% to 40.06% on CIFAR-10
($\epsilon=8/255$), meanwhile outperforming other approaches in this area by a
large margin. Our results clearly demonstrate the effectiveness and potential
of $\ell_\infty$-distance net for certified robustness. Codes are available at
https://github.com/zbh2047/L_inf-dist-net-v2.


http://arxiv.org/abs/2110.06513
Benchmarking the Robustness of Spatial-Temporal Models Against Corruptions. (1%)
Chenyu Yi; Siyuan Yang; Haoliang Li; Yap-peng Tan; Alex Kot
  The state-of-the-art deep neural networks are vulnerable to common
corruptions (e.g., input data degradations, distortions, and disturbances
caused by weather changes, system error, and processing). While much progress
has been made in analyzing and improving the robustness of models in image
understanding, the robustness in video understanding is largely unexplored. In
this paper, we establish a corruption robustness benchmark, Mini Kinetics-C and
Mini SSV2-C, which considers temporal corruptions beyond spatial corruptions in
images. We make the first attempt to conduct an exhaustive study on the
corruption robustness of established CNN-based and Transformer-based
spatial-temporal models. The study provides some guidance on robust model
design and training: Transformer-based model performs better than CNN-based
models on corruption robustness; the generalization ability of spatial-temporal
models implies robustness against temporal corruptions; model corruption
robustness (especially robustness in the temporal domain) enhances with
computational cost and model capacity, which may contradict the current trend
of improving the computational efficiency of models. Moreover, we find the
robustness intervention for image-related tasks (e.g., training models with
noise) may not work for spatial-temporal models.


http://arxiv.org/abs/2110.07718
Adversarial Attack across Datasets. (99%)
Yunxiao Qin; Yuanhao Xiong; Jinfeng Yi; Cho-Jui Hsieh
  It has been observed that Deep Neural Networks (DNNs) are vulnerable to
transfer attacks in the query-free black-box setting. However, all the previous
studies on transfer attack assume that the white-box surrogate models possessed
by the attacker and the black-box victim models are trained on the same
dataset, which means the attacker implicitly knows the label set and the input
size of the victim model. However, this assumption is usually unrealistic as
the attacker may not know the dataset used by the victim model, and further,
the attacker needs to attack any randomly encountered images that may not come
from the same dataset. Therefore, in this paper we define a new Generalized
Transferable Attack (GTA) problem where we assume the attacker has a set of
surrogate models trained on different datasets (with different label sets and
image sizes), and none of them is equal to the dataset used by the victim
model. We then propose a novel method called Image Classification Eraser (ICE)
to erase classification information for any encountered images from arbitrary
dataset. Extensive experiments on Cifar-10, Cifar-100, and TieredImageNet
demonstrate the effectiveness of the proposed ICE on the GTA problem.
Furthermore, we show that existing transfer attack methods can be modified to
tackle the GTA problem, but with significantly worse performance compared with
ICE.


http://arxiv.org/abs/2110.06468
Graph-Fraudster: Adversarial Attacks on Graph Neural Network Based Vertical Federated Learning. (99%)
Jinyin Chen; Guohan Huang; Haibin Zheng; Shanqing Yu; Wenrong Jiang; Chen Cui
  Graph neural network (GNN) has achieved great success on graph representation
learning. Challenged by large scale private data collected from user-side, GNN
may not be able to reflect the excellent performance, without rich features and
complete adjacent relationships. Addressing the problem, vertical federated
learning (VFL) is proposed to implement local data protection through training
a global model collaboratively. Consequently, for graph-structured data, it is
a natural idea to construct a GNN based VFL framework, denoted as GVFL.
However, GNN has been proved vulnerable to adversarial attacks. Whether the
vulnerability will be brought into the GVFL has not been studied. This is the
first study of adversarial attacks on GVFL. A novel adversarial attack method
is proposed, named Graph-Fraudster. It generates adversarial perturbations
based on the noise-added global node embeddings via the privacy leakage and the
gradient of pairwise node. Specifically, first, Graph-Fraudster steals the
global node embeddings and sets up a shadow model of the server for the attack
generator. Second, noise is added into node embeddings to confuse the shadow
model. At last, the gradient of pairwise node is used to generate attacks with
the guidance of noise-added node embeddings. Extensive experiments on five
benchmark datasets demonstrate that Graph-Fraudster achieves the
state-of-the-art attack performance compared with baselines in different GNN
based GVFLs. Furthermore, Graph-Fraudster can remain a threat to GVFL even if
two possible defense mechanisms are applied. Additionally, some suggestions are
put forward for the future work to improve the robustness of GVFL. The code and
datasets can be downloaded at https://github.com/hgh0545/Graph-Fraudster.


http://arxiv.org/abs/2110.05748
SEPP: Similarity Estimation of Predicted Probabilities for Defending and Detecting Adversarial Text. (92%)
Hoang-Quoc Nguyen-Son; Seira Hidano; Kazuhide Fukushima; Shinsaku Kiyomoto
  There are two cases describing how a classifier processes input text, namely,
misclassification and correct classification. In terms of misclassified texts,
a classifier handles the texts with both incorrect predictions and adversarial
texts, which are generated to fool the classifier, which is called a victim.
Both types are misunderstood by the victim, but they can still be recognized by
other classifiers. This induces large gaps in predicted probabilities between
the victim and the other classifiers. In contrast, text correctly classified by
the victim is often successfully predicted by the others and induces small
gaps. In this paper, we propose an ensemble model based on similarity
estimation of predicted probabilities (SEPP) to exploit the large gaps in the
misclassified predictions in contrast to small gaps in the correct
classification. SEPP then corrects the incorrect predictions of the
misclassified texts. We demonstrate the resilience of SEPP in defending and
detecting adversarial texts through different types of victim classifiers,
classification tasks, and adversarial attacks.


http://arxiv.org/abs/2110.06018
On the Security Risks of AutoML. (45%)
Ren Pang; Zhaohan Xi; Shouling Ji; Xiapu Luo; Ting Wang
  Neural Architecture Search (NAS) represents an emerging machine learning (ML)
paradigm that automatically searches for models tailored to given tasks, which
greatly simplifies the development of ML systems and propels the trend of ML
democratization. Yet, little is known about the potential security risks
incurred by NAS, which is concerning given the increasing use of NAS-generated
models in critical domains.
  This work represents a solid initial step towards bridging the gap. Through
an extensive empirical study of 10 popular NAS methods, we show that compared
with their manually designed counterparts, NAS-generated models tend to suffer
greater vulnerability to various malicious attacks (e.g., adversarial evasion,
model poisoning, and functionality stealing). Further, with both empirical and
analytical evidence, we provide possible explanations for such phenomena: given
the prohibitive search space and training cost, most NAS methods favor models
that converge fast at early training stages; this preference results in
architectural properties associated with attack vulnerability (e.g., high loss
smoothness and low gradient variance). Our findings not only reveal the
relationships between model characteristics and attack vulnerability but also
suggest the inherent connections underlying different attacks. Finally, we
discuss potential remedies to mitigate such drawbacks, including increasing
cell depth and suppressing skip connects, which lead to several promising
research directions.


http://arxiv.org/abs/2110.05797
Zero-bias Deep Neural Network for Quickest RF Signal Surveillance. (1%)
Yongxin Liu; Yingjie Chen; Jian Wang; Shuteng Niu; Dahai Liu; Houbing Song
  The Internet of Things (IoT) is reshaping modern society by allowing a decent
number of RF devices to connect and share information through RF channels.
However, such an open nature also brings obstacles to surveillance. For
alleviation, a surveillance oracle, or a cognitive communication entity needs
to identify and confirm the appearance of known or unknown signal sources in
real-time. In this paper, we provide a deep learning framework for RF signal
surveillance. Specifically, we jointly integrate the Deep Neural Networks
(DNNs) and Quickest Detection (QD) to form a sequential signal surveillance
scheme. We first analyze the latent space characteristic of neural network
classification models, and then we leverage the response characteristics of DNN
classifiers and propose a novel method to transform existing DNN classifiers
into performance-assured binary abnormality detectors. In this way, we
seamlessly integrate the DNNs with the parametric quickest detection. Finally,
we propose an enhanced Elastic Weight Consolidation (EWC) algorithm with better
numerical stability for DNNs in signal surveillance systems to evolve
incrementally, we demonstrate that the zero-bias DNN is superior to regular DNN
models considering incremental learning and decision fairness. We evaluated the
proposed framework using real signal datasets and we believe this framework is
helpful in developing a trustworthy IoT ecosystem.


http://arxiv.org/abs/2110.05007
Boosting Fast Adversarial Training with Learnable Adversarial Initialization. (99%)
Xiaojun Jia; Yong Zhang; Baoyuan Wu; Jue Wang; Xiaochun Cao
  Adversarial training (AT) has been demonstrated to be effective in improving
model robustness by leveraging adversarial examples for training. However, most
AT methods are in face of expensive time and computational cost for calculating
gradients at multiple steps in generating adversarial examples. To boost
training efficiency, fast gradient sign method (FGSM) is adopted in fast AT
methods by calculating gradient only once. Unfortunately, the robustness is far
from satisfactory. One reason may arise from the initialization fashion.
Existing fast AT generally uses a random sample-agnostic initialization, which
facilitates the efficiency yet hinders a further robustness improvement. Up to
now, the initialization in fast AT is still not extensively explored. In this
paper, we boost fast AT with a sample-dependent adversarial initialization,
i.e., an output from a generative network conditioned on a benign image and its
gradient information from the target network. As the generative network and the
target network are optimized jointly in the training phase, the former can
adaptively generate an effective initialization with respect to the latter,
which motivates gradually improved robustness. Experimental evaluations on four
benchmark databases demonstrate the superiority of our proposed method over
state-of-the-art fast AT methods, as well as comparable robustness to advanced
multi-step AT methods. The code is released at
https://github.com//jiaxiaojunQAQ//FGSM-SDI.


http://arxiv.org/abs/2110.05626
Parameterizing Activation Functions for Adversarial Robustness. (98%)
Sihui Dai; Saeed Mahloujifar; Prateek Mittal
  Deep neural networks are known to be vulnerable to adversarially perturbed
inputs. A commonly used defense is adversarial training, whose performance is
influenced by model capacity. While previous works have studied the impact of
varying model width and depth on robustness, the impact of increasing capacity
by using learnable parametric activation functions (PAFs) has not been studied.
We study how using learnable PAFs can improve robustness in conjunction with
adversarial training. We first ask the question: how should we incorporate
parameters into activation functions to improve robustness? To address this, we
analyze the direct impact of activation shape on robustness through PAFs and
observe that activation shapes with positive outputs on negative inputs and
with high finite curvature can increase robustness. We combine these properties
to create a new PAF, which we call Parametric Shifted Sigmoidal Linear Unit
(PSSiLU). We then combine PAFs (including PReLU, PSoftplus and PSSiLU) with
adversarial training and analyze robust performance. We find that PAFs optimize
towards activation shape properties found to directly affect robustness.
Additionally, we find that while introducing only 1-2 learnable parameters into
the network, smooth PAFs can significantly increase robustness over ReLU. For
instance, when trained on CIFAR-10 with additional synthetic data, PSSiLU
improves robust accuracy by 4.54% over ReLU on ResNet-18 and 2.69% over ReLU on
WRN-28-10 in the $\ell_{\infty}$ threat model while adding only 2 additional
parameters into the network architecture. The PSSiLU WRN-28-10 model achieves
61.96% AutoAttack accuracy, improving over the state-of-the-art robust accuracy
on RobustBench (Croce et al., 2020).


http://arxiv.org/abs/2110.05059
Amicable examples for informed source separation. (86%)
Naoya Takahashi; Yuki Mitsufuji
  This paper deals with the problem of informed source separation (ISS), where
the sources are accessible during the so-called \textit{encoding} stage.
Previous works computed side-information during the encoding stage and source
separation models were designed to utilize the side-information to improve the
separation performance. In contrast, in this work, we improve the performance
of a pretrained separation model that does not use any side-information. To
this end, we propose to adopt an adversarial attack for the opposite purpose,
i.e., rather than computing the perturbation to degrade the separation, we
compute an imperceptible perturbation called amicable noise to improve the
separation. Experimental results show that the proposed approach selectively
improves the performance of the targeted separation model by 2.23 dB on average
and is robust to signal compression. Moreover, we propose multi-model
multi-purpose learning that control the effect of the perturbation on different
models individually.


http://arxiv.org/abs/2110.05691
Doubly-Trained Adversarial Data Augmentation for Neural Machine Translation. (12%)
Weiting Tan; Shuoyang Ding; Huda Khayrallah; Philipp Koehn
  Neural Machine Translation (NMT) models are known to suffer from noisy
inputs. To make models robust, we generate adversarial augmentation samples
that attack the model and preserve the source-side semantic meaning at the same
time. To generate such samples, we propose a doubly-trained architecture that
pairs two NMT models of opposite translation directions with a joint loss
function, which combines the target-side attack and the source-side semantic
similarity constraint. The results from our experiments across three different
language pairs and two evaluation metrics show that these adversarial samples
improve the model robustness.


http://arxiv.org/abs/2110.05365
Intriguing Properties of Input-dependent Randomized Smoothing. (1%)
Peter Súkeník; Aleksei Kuvshinov; Stephan Günnemann
  Randomized smoothing is currently considered the state-of-the-art method to
obtain certifiably robust classifiers. Despite its remarkable performance, the
method is associated with various serious problems such as ``certified accuracy
waterfalls'', certification vs. accuracy trade-off, or even fairness issues.
Input-dependent smoothing approaches have been proposed to overcome these
flaws. However, we demonstrate that these methods lack formal guarantees and so
the resulting certificates are not justified. We show that the input-dependent
smoothing, in general, suffers from the curse of dimensionality, forcing the
variance function to have low semi-elasticity. On the other hand, we provide a
theoretical and practical framework that enables the usage of input-dependent
smoothing even in the presence of the curse of dimensionality, under strict
restrictions. We present one concrete design of the smoothing variance and test
it on CIFAR10 and MNIST. Our design solves some of the problems of classical
smoothing and is formally underlined, yet further improvement of the design is
still necessary.


http://arxiv.org/abs/2110.05689
Hiding Images into Images with Real-world Robustness. (1%)
Qichao Ying; Hang Zhou; Xianhan Zeng; Haisheng Xu; Zhenxing Qian; Xinpeng Zhang
  The existing image embedding networks are basically vulnerable to malicious
attacks such as JPEG compression and noise adding, not applicable for
real-world copyright protection tasks. To solve this problem, we introduce a
generative deep network based method for hiding images into images while
assuring high-quality extraction from the destructive synthesized images. An
embedding network is sequentially concatenated with an attack layer, a
decoupling network and an image extraction network. The addition of decoupling
network learns to extract the embedded watermark from the attacked image. We
also pinpoint the weaknesses of the adversarial training for robustness in
previous works and build our improved real-world attack simulator. Experimental
results demonstrate the superiority of the proposed method against typical
digital attacks by a large margin, as well as the performance boost of the
recovered images with the aid of progressive recovery strategy. Besides, we are
the first to robustly hide three secret images.


http://arxiv.org/abs/2110.05054
Source Mixing and Separation Robust Audio Steganography. (1%)
Naoya Takahashi; Mayank Kumar Singh; Yuki Mitsufuji
  Audio steganography aims at concealing secret information in carrier audio
with imperceptible modification on the carrier. Although previous works
addressed the robustness of concealed message recovery against distortions
introduced during transmission, they do not address the robustness against
aggressive editing such as mixing of other audio sources and source separation.
In this work, we propose for the first time a steganography method that can
embed information into individual sound sources in a mixture such as
instrumental tracks in music. To this end, we propose a time-domain model and
curriculum learning essential to learn to decode the concealed message from the
separated sources. Experimental results show that the proposed method
successfully conceals the information in an imperceptible perturbation and that
the information can be correctly recovered even after mixing of other sources
and separation by a source separation algorithm. Furthermore, we show that the
proposed method can be applied to multiple sources simultaneously without
interfering with the decoder for other sources even after the sources are mixed
and separated.


http://arxiv.org/abs/2110.05290
Homogeneous Learning: Self-Attention Decentralized Deep Learning. (1%)
Yuwei Sun; Hideya Ochiai
  Federated learning (FL) has been facilitating privacy-preserving deep
learning in many walks of life such as medical image classification, network
intrusion detection, and so forth. Whereas it necessitates a central parameter
server for model aggregation, which brings about delayed model communication
and vulnerability to adversarial attacks. A fully decentralized architecture
like Swarm Learning allows peer-to-peer communication among distributed nodes,
without the central server. One of the most challenging issues in decentralized
deep learning is that data owned by each node are usually non-independent and
identically distributed (non-IID), causing time-consuming convergence of model
training. To this end, we propose a decentralized learning model called
Homogeneous Learning (HL) for tackling non-IID data with a self-attention
mechanism. In HL, training performs on each round's selected node, and the
trained model of a node is sent to the next selected node at the end of each
round. Notably, for the selection, the self-attention mechanism leverages
reinforcement learning to observe a node's inner state and its surrounding
environment's state, and find out which node should be selected to optimize the
training. We evaluate our method with various scenarios for an image
classification task. The result suggests that HL can produce a better
performance compared with standalone learning and greatly reduce both the total
training rounds by 50.8% and the communication cost by 74.6% compared with
random policy-based decentralized learning for training on non-IID data.


http://arxiv.org/abs/2110.05679
Large Language Models Can Be Strong Differentially Private Learners. (1%)
Xuechen Li; Florian Tramèr; Percy Liang; Tatsunori Hashimoto
  Differentially Private (DP) learning has seen limited success for building
large deep learning models of text, and attempts at straightforwardly applying
Differentially Private Stochastic Gradient Descent (DP-SGD) to NLP tasks have
resulted in large performance drops and high computational overhead. We show
that this performance drop can be mitigated with (1) the use of large
pretrained models; (2) hyperparameters that suit DP optimization; and (3)
fine-tuning objectives aligned with the pretraining procedure. With these
factors set right, we obtain private NLP models that outperform
state-of-the-art private training approaches and strong non-private baselines
-- by directly fine-tuning pretrained models with DP optimization on
moderately-sized corpora. To address the computational challenge of running
DP-SGD with large Transformers, we propose a memory saving technique that
allows clipping in DP-SGD to run without instantiating per-example gradients
for any layer in the model. The technique enables privately training
Transformers with almost the same memory cost as non-private training at a
modest run-time overhead. Contrary to conventional wisdom that DP optimization
fails at learning high-dimensional models (due to noise that scales with
dimension) empirical results reveal that private learning with pretrained
models tends to not suffer from dimension-dependent performance degradation.


http://arxiv.org/abs/2110.05076
A Closer Look at Prototype Classifier for Few-shot Image Classification. (1%)
Mingcheng Hou; Issei Sato
  The prototypical network is a prototype classifier based on meta-learning and
is widely used for few-shot learning because it classifies unseen examples by
constructing class-specific prototypes without adjusting hyper-parameters
during meta-testing. Interestingly, recent research has attracted a lot of
attention, showing that training a new linear classifier, which does not use a
meta-learning algorithm, performs comparably with the prototypical network.
However, the training of a new linear classifier requires the retraining of the
classifier every time a new class appears. In this paper, we analyze how a
prototype classifier works equally well without training a new linear
classifier or meta-learning. We experimentally find that directly using the
feature vectors, which is extracted by using standard pre-trained models to
construct a prototype classifier in meta-testing, does not perform as well as
the prototypical network and training new linear classifiers on the feature
vectors of pre-trained models. Thus, we derive a novel generalization bound for
a prototypical classifier and show that the transformation of a feature vector
can improve the performance of prototype classifiers. We experimentally
investigate several normalization methods for minimizing the derived bound and
find that the same performance can be obtained by using the L2 normalization
and minimizing the ratio of the within-class variance to the between-class
variance without training a new classifier or meta-learning.


http://arxiv.org/abs/2110.07719
Certified Patch Robustness via Smoothed Vision Transformers. (1%)
Hadi Salman; Saachi Jain; Eric Wong; Aleksander Mądry
  Certified patch defenses can guarantee robustness of an image classifier to
arbitrary changes within a bounded contiguous region. But, currently, this
robustness comes at a cost of degraded standard accuracies and slower inference
times. We demonstrate how using vision transformers enables significantly
better certified patch robustness that is also more computationally efficient
and does not incur a substantial drop in standard accuracy. These improvements
stem from the inherent ability of the vision transformer to gracefully handle
largely masked images. Our code is available at
https://github.com/MadryLab/smoothed-vit.


http://arxiv.org/abs/2110.04887
Adversarial Attacks in a Multi-view Setting: An Empirical Study of the Adversarial Patches Inter-view Transferability. (98%)
Bilel Tarchoun; Ihsen Alouani; Anouar Ben Khalifa; Mohamed Ali Mahjoub
  While machine learning applications are getting mainstream owing to a
demonstrated efficiency in solving complex problems, they suffer from inherent
vulnerability to adversarial attacks. Adversarial attacks consist of additive
noise to an input which can fool a detector. Recently, successful real-world
printable adversarial patches were proven efficient against state-of-the-art
neural networks. In the transition from digital noise based attacks to
real-world physical attacks, the myriad of factors affecting object detection
will also affect adversarial patches. Among these factors, view angle is one of
the most influential, yet under-explored. In this paper, we study the effect of
view angle on the effectiveness of an adversarial patch. To this aim, we
propose the first approach that considers a multi-view context by combining
existing adversarial patches with a perspective geometric transformation in
order to simulate the effect of view angle changes. Our approach has been
evaluated on two datasets: the first dataset which contains most real world
constraints of a multi-view context, and the second dataset which empirically
isolates the effect of view angle. The experiments show that view angle
significantly affects the performance of adversarial patches, where in some
cases the patch loses most of its effectiveness. We believe that these results
motivate taking into account the effect of view angles in future adversarial
attacks, and open up new opportunities for adversarial defenses.


http://arxiv.org/abs/2110.04731
Universal Adversarial Attacks on Neural Networks for Power Allocation in a Massive MIMO System. (92%)
Pablo Millán Santos; B. R. Manoj; Meysam Sadeghi; Erik G. Larsson
  Deep learning (DL) architectures have been successfully used in many
applications including wireless systems. However, they have been shown to be
susceptible to adversarial attacks. We analyze DL-based models for a regression
problem in the context of downlink power allocation in massive
multiple-input-multiple-output systems and propose universal adversarial
perturbation (UAP)-crafting methods as white-box and black-box attacks. We
benchmark the UAP performance of white-box and black-box attacks for the
considered application and show that the adversarial success rate can achieve
up to 60% and 40%, respectively. The proposed UAP-based attacks make a more
practical and realistic approach as compared to classical white-box attacks.


http://arxiv.org/abs/2110.04488
Demystifying the Transferability of Adversarial Attacks in Computer Networks. (99%)
Ehsan Nowroozi; Yassine Mekdad; Mohammad Hajian Berenjestanaki; Mauro Conti; Abdeslam EL Fergougui
  Convolutional Neural Networks (CNNs) models are one of the most frequently
used deep learning networks and are extensively used in both academia and
industry. Recent studies demonstrated that adversarial attacks against such
models can maintain their effectiveness even when used on models other than the
one targeted by the attacker. This major property is known as transferability
and makes CNNs ill-suited for security applications. In this paper, we provide
the first comprehensive study which assesses the robustness of CNN-based models
for computer networks against adversarial transferability. Furthermore, we
investigate whether the transferability property issue holds in computer
networks applications. In our experiments, we first consider five different
attacks: the Iterative Fast Gradient Method (I-FGSM), the Jacobian-based
Saliency Map (JSMA), the Limited-memory Broyden letcher Goldfarb Shanno BFGS
(L-BFGS), the Projected Gradient Descent (PGD), and the DeepFool attack. Then,
we perform these attacks against three well-known datasets: the Network-based
Detection of IoT (N-BaIoT) dataset and the Domain Generating Algorithms (DGA)
dataset, and RIPE Atlas dataset. Our experimental results show clearly that the
transferability happens in specific use cases for the I-FGSM, the JSMA, and the
LBFGS attack. In such scenarios, the attack success rate on the target network
range from 63.00\% to 100\%. Finally, we suggest two shielding strategies to
hinder the attack transferability, by considering the Most Powerful Attacks
(MPAs), and the mismatch LSTM architecture.


http://arxiv.org/abs/2110.04471
Provably Efficient Black-Box Action Poisoning Attacks Against Reinforcement Learning. (93%)
Guanlin Liu; Lifeng Lai
  Due to the broad range of applications of reinforcement learning (RL),
understanding the effects of adversarial attacks against RL model is essential
for the safe applications of this model. Prior works on adversarial attacks
against RL mainly focus on either observation poisoning attacks or environment
poisoning attacks. In this paper, we introduce a new class of attacks named
action poisoning attacks, where an adversary can change the action signal
selected by the agent. Compared with existing attack models, the attacker's
ability in the proposed action poisoning attack model is more restricted, and
hence the attack model is more practical. We study the action poisoning attack
in both white-box and black-box settings. We introduce an adaptive attack
scheme called LCB-H, which works for most RL agents in the black-box setting.
We prove that the LCB-H attack can force any efficient RL agent, whose dynamic
regret scales sublinearly with the total number of steps taken, to choose
actions according to a policy selected by the attacker very frequently, with
only sublinear cost. In addition, we apply LCB-H attack against a popular
model-free RL algorithm: UCB-H. We show that, even in the black-box setting, by
spending only logarithm cost, the proposed LCB-H attack scheme can force the
UCB-H agent to choose actions according to the policy selected by the attacker
very frequently.


http://arxiv.org/abs/2110.04571
Widen The Backdoor To Let More Attackers In. (13%)
Siddhartha Datta; Giulio Lovisotto; Ivan Martinovic; Nigel Shadbolt
  As collaborative learning and the outsourcing of data collection become more
common, malicious actors (or agents) which attempt to manipulate the learning
process face an additional obstacle as they compete with each other. In
backdoor attacks, where an adversary attempts to poison a model by introducing
malicious samples into the training data, adversaries have to consider that the
presence of additional backdoor attackers may hamper the success of their own
backdoor. In this paper, we investigate the scenario of a multi-agent backdoor
attack, where multiple non-colluding attackers craft and insert triggered
samples in a shared dataset which is used by a model (a defender) to learn a
task. We discover a clear backfiring phenomenon: increasing the number of
attackers shrinks each attacker's attack success rate (ASR). We then exploit
this phenomenon to minimize the collective ASR of attackers and maximize
defender's robustness accuracy by (i) artificially augmenting the number of
attackers, and (ii) indexing to remove the attacker's sub-dataset from the
model for inference, hence proposing 2 defenses.


http://arxiv.org/abs/2110.04158
Explainability-Aware One Point Attack for Point Cloud Neural Networks. (99%)
Hanxiao Tan; Helena Kotthaus
  With the proposition of neural networks for point clouds, deep learning has
started to shine in the field of 3D object recognition while researchers have
shown an increased interest to investigate the reliability of point cloud
networks by adversarial attacks. However, most of the existing studies aim to
deceive humans or defense algorithms, while the few that address the operation
principles of the models themselves remain flawed in terms of critical point
selection. In this work, we propose two adversarial methods: One Point Attack
(OPA) and Critical Traversal Attack (CTA), which incorporate the explainability
technologies and aim to explore the intrinsic operating principle of point
cloud networks and their sensitivity against critical points perturbations. Our
results show that popular point cloud networks can be deceived with almost
$100\%$ success rate by shifting only one point from the input instance. In
addition, we show the interesting impact of different point attribution
distributions on the adversarial robustness of point cloud networks. Finally,
we discuss how our approaches facilitate the explainability study for point
cloud networks. To the best of our knowledge, this is the first
point-cloud-based adversarial approach concerning explainability. Our code is
available at https://github.com/Explain3D/Exp-One-Point-Atk-PC.


http://arxiv.org/abs/2110.06166
Game Theory for Adversarial Attacks and Defenses. (98%)
Shorya Sharma
  Adversarial attacks can generate adversarial inputs by applying small but
intentionally worst-case perturbations to samples from the dataset, which leads
to even state-of-the-art deep neural networks outputting incorrect answers with
high confidence. Hence, some adversarial defense techniques are developed to
improve the security and robustness of the models and avoid them being
attacked. Gradually, a game-like competition between attackers and defenders
formed, in which both players would attempt to play their best strategies
against each other while maximizing their own payoffs. To solve the game, each
player would choose an optimal strategy against the opponent based on the
prediction of the opponent's strategy choice. In this work, we are on the
defensive side to apply game-theoretic approaches on defending against attacks.
We use two randomization methods, random initialization and stochastic
activation pruning, to create diversity of networks. Furthermore, we use one
denoising technique, super resolution, to improve models' robustness by
preprocessing images before attacks. Our experimental results indicate that
those three methods can effectively improve the robustness of deep-learning
neural networks.


http://arxiv.org/abs/2110.03999
Graphs as Tools to Improve Deep Learning Methods. (10%)
Carlos Lassance; Myriam Bontonou; Mounia Hamidouche; Bastien Pasdeloup; Lucas Drumetz; Vincent Gripon
  In recent years, deep neural networks (DNNs) have known an important rise in
popularity. However, although they are state-of-the-art in many machine
learning challenges, they still suffer from several limitations. For example,
DNNs require a lot of training data, which might not be available in some
practical applications. In addition, when small perturbations are added to the
inputs, DNNs are prone to misclassification errors. DNNs are also viewed as
black-boxes and as such their decisions are often criticized for their lack of
interpretability.
  In this chapter, we review recent works that aim at using graphs as tools to
improve deep learning methods. These graphs are defined considering a specific
layer in a deep learning architecture. Their vertices represent distinct
samples, and their edges depend on the similarity of the corresponding
intermediate representations. These graphs can then be leveraged using various
methodologies, many of which built on top of graph signal processing.
  This chapter is composed of four main parts: tools for visualizing
intermediate layers in a DNN, denoising data representations, optimizing graph
objective functions and regularizing the learning process.


http://arxiv.org/abs/2110.04180
IHOP: Improved Statistical Query Recovery against Searchable Symmetric Encryption through Quadratic Optimization. (3%)
Simon Oya; Florian Kerschbaum
  Effective query recovery attacks against Searchable Symmetric Encryption
(SSE) schemes typically rely on auxiliary ground-truth information about the
queries or dataset. Query recovery is also possible under the weaker
statistical auxiliary information assumption, although statistical-based
attacks achieve lower accuracy and are not considered a serious threat. In this
work we present IHOP, a statistical-based query recovery attack that formulates
query recovery as a quadratic optimization problem and reaches a solution by
iterating over linear assignment problems. We perform an extensive evaluation
with five real datasets, and show that IHOP outperforms all other
statistical-based query recovery attacks under different parameter and leakage
configurations, including the case where the client uses some access-pattern
obfuscation defenses. In some cases, our attack achieves almost perfect query
recovery accuracy. Finally, we use IHOP in a frequency-only leakage setting
where the client's queries are correlated, and show that our attack can exploit
query dependencies even when PANCAKE, a recent frequency-hiding defense by
Grubbs et al., is applied. Our findings indicate that statistical query
recovery attacks pose a severe threat to privacy-preserving SSE schemes.


http://arxiv.org/abs/2110.04259
A Wireless Intrusion Detection System for 802.11 WPA3 Networks. (1%)
Neil Dalal; Nadeem Akhtar; Anubhav Gupta; Nikhil Karamchandani; Gaurav S. Kasbekar; Jatin Parekh
  Wi-Fi (802.11) networks have become an essential part of our daily lives;
hence, their security is of utmost importance. However, Wi-Fi Protected Access
3 (WPA3), the latest security certification for 802.11 standards, has recently
been shown to be vulnerable to several attacks. In this paper, we first
describe the attacks on WPA3 networks that have been reported in prior work;
additionally, we show that a deauthentication attack and a beacon flood attack,
known to be possible on a WPA2 network, are still possible with WPA3. We launch
and test all the above (a total of nine) attacks using a testbed that contains
an enterprise Access Point (AP) and Intrusion Detection System (IDS). Our
experimental results show that the AP is vulnerable to eight out of the nine
attacks and the IDS is unable to detect any of them. We propose a design for a
signature-based IDS, which incorporates techniques to detect all the above
attacks. Also, we implement these techniques on our testbed and verify that our
IDS is able to successfully detect all the above attacks. We provide schemes
for mitigating the impact of the above attacks once they are detected. We make
the code to perform the above attacks as well as that of our IDS publicly
available, so that it can be used for future work by the research community at
large.


http://arxiv.org/abs/2110.04301
Salient ImageNet: How to discover spurious features in Deep Learning? (1%)
Sahil Singla; Soheil Feizi
  A key reason for the lack of reliability of deep neural networks in the real
world is their heavy reliance on spurious input features that are not essential
to the true label. Focusing on image classifications, we define core attributes
as the set of visual features that are always a part of the object definition
while spurious attributes are the ones that are likely to co-occur with the
object but not a part of it (e.g., attribute "fingers" for class "band aid").
Traditional methods for discovering spurious features either require extensive
human annotations (thus, not scalable), or are useful on specific models. In
this work, we introduce a general framework to discover a subset of spurious
and core visual attributes used in inferences of a general model and localize
them on a large number of images with minimal human supervision. Our
methodology is based on this key idea: to identify spurious or core visual
attributes used in model predictions, we identify spurious or core neural
features (penultimate layer neurons of a robust model) via limited human
supervision (e.g., using top 5 activating images per feature). We then show
that these neural feature annotations generalize extremely well to many more
images without any human supervision. We use the activation maps for these
neural features as the soft masks to highlight spurious or core visual
attributes. Using this methodology, we introduce the Salient Imagenet dataset
containing core and spurious masks for a large set of samples from Imagenet.
Using this dataset, we show that several popular Imagenet models rely heavily
on various spurious features in their predictions, indicating the standard
accuracy alone is not sufficient to fully assess model' performance specially
in safety-critical applications.


http://arxiv.org/abs/2110.03605
Robust Feature-Level Adversaries are Interpretability Tools. (99%)
Stephen Casper; Max Nadeau; Dylan Hadfield-Menell; Gabriel Kreiman
  The literature on adversarial attacks in computer vision typically focuses on
pixel-level perturbations. These tend to be very difficult to interpret. Recent
work that manipulates the latent representations of image generators to create
"feature-level" adversarial perturbations gives us an opportunity to explore
perceptible, interpretable adversarial attacks. We make three contributions.
First, we observe that feature-level attacks provide useful classes of inputs
for studying representations in models. Second, we show that these adversaries
are versatile and highly robust. We demonstrate that they can be used to
produce targeted, universal, disguised, physically-realizable, and black-box
attacks at the ImageNet scale. Third, we show how these adversarial images can
be used as a practical interpretability tool for identifying bugs in networks.
We use these adversaries to make predictions about spurious associations
between features and classes which we then test by designing "copy/paste"
attacks in which one natural image is pasted into another to cause a targeted
misclassification. Our results suggest that feature-level attacks are a
promising approach for rigorous interpretability research. They support the
design of tools to better understand what a model has learned and diagnose
brittle feature associations. Code is available at
https://github.com/thestephencasper/feature_level_adv


http://arxiv.org/abs/2110.03301
EvadeDroid: A Practical Evasion Attack on Machine Learning for Black-box Android Malware Detection. (99%)
Hamid Bostani; Veelasha Moonsamy
  Over the last decade, several studies have investigated the weaknesses of
Android malware detectors against adversarial examples by proposing novel
evasion attacks; however, their practicality in manipulating real-world malware
remains arguable. The majority of studies have assumed attackers know the
details of the target classifiers used for malware detection, while in reality,
malicious actors have limited access to the target classifiers. This paper
presents a practical evasion attack, EvadeDroid, to circumvent black-box
Android malware detectors. In addition to generating real-world adversarial
malware, the proposed evasion attack can also preserve the functionality of the
original malware samples. EvadeDroid prepares a collection of
functionality-preserving transformations using an n-gram-based similarity
method, which are then used to morph malware instances into benign ones via an
iterative and incremental manipulation strategy. The proposed manipulation
technique is a novel, query-efficient optimization algorithm with the aim of
finding and injecting optimal sequences of transformations into malware
samples. Our empirical evaluation demonstrates the efficacy of EvadeDroid under
hard- and soft-label attacks. Moreover, EvadeDroid is capable to generate
practical adversarial examples with only a small number of queries, with
evasion rates of $81\%$, $73\%$, $75\%$, and $79\%$ for DREBIN, Sec-SVM,
MaMaDroid, and ADE-MA, respectively. Finally, we show that EvadeDroid is able
to preserve its stealthiness against five popular commercial antivirus, thus
demonstrating its feasibility in the real world.


http://arxiv.org/abs/2110.03745
Adversarial Attack by Limited Point Cloud Surface Modifications. (98%)
Atrin Arya; Hanieh Naderi; Shohreh Kasaei
  Recent research has revealed that the security of deep neural networks that
directly process 3D point clouds to classify objects can be threatened by
adversarial samples. Although existing adversarial attack methods achieve high
success rates, they do not restrict the point modifications enough to preserve
the point cloud appearance. To overcome this shortcoming, two constraints are
proposed. These include applying hard boundary constraints on the number of
modified points and on the point perturbation norms. Due to the restrictive
nature of the problem, the search space contains many local maxima. The
proposed method addresses this issue by using a high step-size at the beginning
of the algorithm to search the main surface of the point cloud fast and
effectively. Then, in order to converge to the desired output, the step-size is
gradually decreased. To evaluate the performance of the proposed method, it is
run on the ModelNet40 and ScanObjectNN datasets by employing the
state-of-the-art point cloud classification models; including PointNet,
PointNet++, and DGCNN. The obtained results show that it can perform successful
attacks and achieve state-of-the-art results by only a limited number of point
modifications while preserving the appearance of the point cloud. Moreover, due
to the effective search algorithm, it can perform successful attacks in just a
few steps. Additionally, the proposed step-size scheduling algorithm shows an
improvement of up to $14.5\%$ when adopted by other methods as well. The
proposed method also performs effectively against popular defense methods.


http://arxiv.org/abs/2110.03825
Exploring Architectural Ingredients of Adversarially Robust Deep Neural Networks. (98%)
Hanxun Huang; Yisen Wang; Sarah Monazam Erfani; Quanquan Gu; James Bailey; Xingjun Ma
  Deep neural networks (DNNs) are known to be vulnerable to adversarial
attacks. A range of defense methods have been proposed to train adversarially
robust DNNs, among which adversarial training has demonstrated promising
results. However, despite preliminary understandings developed for adversarial
training, it is still not clear, from the architectural perspective, what
configurations can lead to more robust DNNs. In this paper, we address this gap
via a comprehensive investigation on the impact of network width and depth on
the robustness of adversarially trained DNNs. Specifically, we make the
following key observations: 1) more parameters (higher model capacity) does not
necessarily help adversarial robustness; 2) reducing capacity at the last stage
(the last group of blocks) of the network can actually improve adversarial
robustness; and 3) under the same parameter budget, there exists an optimal
architectural configuration for adversarial robustness. We also provide a
theoretical analysis explaning why such network configuration can help
robustness. These architectural insights can help design adversarially robust
DNNs. Code is available at \url{https://github.com/HanxunH/RobustWRN}.


http://arxiv.org/abs/2110.03875
Dyn-Backdoor: Backdoor Attack on Dynamic Link Prediction. (80%)
Jinyin Chen; Haiyang Xiong; Haibin Zheng; Jian Zhang; Guodong Jiang; Yi Liu
  Dynamic link prediction (DLP) makes graph prediction based on historical
information. Since most DLP methods are highly dependent on the training data
to achieve satisfying prediction performance, the quality of the training data
is crucial. Backdoor attacks induce the DLP methods to make wrong prediction by
the malicious training data, i.e., generating a subgraph sequence as the
trigger and embedding it to the training data. However, the vulnerability of
DLP toward backdoor attacks has not been studied yet. To address the issue, we
propose a novel backdoor attack framework on DLP, denoted as Dyn-Backdoor.
Specifically, Dyn-Backdoor generates diverse initial-triggers by a generative
adversarial network (GAN). Then partial links of the initial-triggers are
selected to form a trigger set, according to the gradient information of the
attack discriminator in the GAN, so as to reduce the size of triggers and
improve the concealment of the attack. Experimental results show that
Dyn-Backdoor launches successful backdoor attacks on the state-of-the-art DLP
models with success rate more than 90%. Additionally, we conduct a possible
defense against Dyn-Backdoor to testify its resistance in defensive settings,
highlighting the needs of defenses for backdoor attacks on DLP.


http://arxiv.org/abs/2110.03175
Fingerprinting Multi-exit Deep Neural Network Models via Inference Time. (62%)
Tian Dong; Han Qiu; Tianwei Zhang; Jiwei Li; Hewu Li; Jialiang Lu
  Transforming large deep neural network (DNN) models into the multi-exit
architectures can overcome the overthinking issue and distribute a large DNN
model on resource-constrained scenarios (e.g. IoT frontend devices and backend
servers) for inference and transmission efficiency. Nevertheless, intellectual
property (IP) protection for the multi-exit models in the wild is still an
unsolved challenge. Previous efforts to verify DNN model ownership mainly rely
on querying the model with specific samples and checking the responses, e.g.,
DNN watermarking and fingerprinting. However, they are vulnerable to
adversarial settings such as adversarial training and are not suitable for the
IP verification for multi-exit DNN models. In this paper, we propose a novel
approach to fingerprint multi-exit models via inference time rather than
inference predictions. Specifically, we design an effective method to generate
a set of fingerprint samples to craft the inference process with a unique and
robust inference time cost as the evidence for model ownership. We conduct
extensive experiments to prove the uniqueness and robustness of our method on
three structures (ResNet-56, VGG-16, and MobileNet) and three datasets
(CIFAR-10, CIFAR-100, and Tiny-ImageNet) under comprehensive adversarial
settings.


http://arxiv.org/abs/2110.03735
Adversarial Unlearning of Backdoors via Implicit Hypergradient. (56%)
Yi Zeng; Si Chen; Won Park; Z. Morley Mao; Ming Jin; Ruoxi Jia
  We propose a minimax formulation for removing backdoors from a given poisoned
model based on a small set of clean data. This formulation encompasses much of
prior work on backdoor removal. We propose the Implicit Bacdoor Adversarial
Unlearning (I-BAU) algorithm to solve the minimax. Unlike previous work, which
breaks down the minimax into separate inner and outer problems, our algorithm
utilizes the implicit hypergradient to account for the interdependence between
inner and outer optimization. We theoretically analyze its convergence and the
generalizability of the robustness gained by solving minimax on clean data to
unseen test data. In our evaluation, we compare I-BAU with six state-of-art
backdoor defenses on seven backdoor attacks over two datasets and various
attack settings, including the common setting where the attacker targets one
class as well as important but underexplored settings where multiple classes
are targeted. I-BAU's performance is comparable to and most often significantly
better than the best baseline. Particularly, its performance is more robust to
the variation on triggers, attack settings, poison ratio, and clean data size.
Moreover, I-BAU requires less computation to take effect; particularly, it is
more than $13\times$ faster than the most efficient baseline in the
single-target attack setting. Furthermore, it can remain effective in the
extreme case where the defender can only access 100 clean samples -- a setting
where all the baselines fail to produce acceptable results.


http://arxiv.org/abs/2110.03302
MPSN: Motion-aware Pseudo Siamese Network for Indoor Video Head Detection in Buildings. (1%)
Kailai Sun; Xiaoteng Ma; Peng Liu; Qianchuan Zhao
  Head detection in the indoor video is an essential component of building
occupancy detection. While deep models have achieved remarkable progress in
general object detection, they are not satisfying enough in complex indoor
scenes. The indoor surveillance video often includes cluttered background
objects, among which heads have small scales and diverse poses. In this paper,
we propose Motion-aware Pseudo Siamese Network (MPSN), an end-to-end approach
that leverages head motion information to guide the deep model to extract
effective head features in indoor scenarios. By taking the pixel-wise
difference of adjacent frames as the auxiliary input, MPSN effectively enhances
human head motion information and removes the irrelevant objects in the
background. Compared with prior methods, it achieves superior performance on
the two indoor video datasets. Our experiments show that MPSN successfully
suppresses static background objects and highlights the moving instances,
especially human heads in indoor videos. We also compare different methods to
capture head motion, which demonstrates the simplicity and flexibility of MPSN.
To validate the robustness of MPSN, we conduct adversarial experiments with a
mathematical solution of small perturbations for robust model selection.
Finally, for confirming its potential in building control systems, we apply
MPSN to occupancy counting. Code is available at
https://github.com/pl-share/MPSN.


http://arxiv.org/abs/2110.11417
HIRE-SNN: Harnessing the Inherent Robustness of Energy-Efficient Deep Spiking Neural Networks by Training with Crafted Input Noise. (99%)
Souvik Kundu; Massoud Pedram; Peter A. Beerel
  Low-latency deep spiking neural networks (SNNs) have become a promising
alternative to conventional artificial neural networks (ANNs) because of their
potential for increased energy efficiency on event-driven neuromorphic
hardware. Neural networks, including SNNs, however, are subject to various
adversarial attacks and must be trained to remain resilient against such
attacks for many applications. Nevertheless, due to prohibitively high training
costs associated with SNNs, analysis, and optimization of deep SNNs under
various adversarial attacks have been largely overlooked. In this paper, we
first present a detailed analysis of the inherent robustness of low-latency
SNNs against popular gradient-based attacks, namely fast gradient sign method
(FGSM) and projected gradient descent (PGD). Motivated by this analysis, to
harness the model robustness against these attacks we present an SNN training
algorithm that uses crafted input noise and incurs no additional training time.
To evaluate the merits of our algorithm, we conducted extensive experiments
with variants of VGG and ResNet on both CIFAR-10 and CIFAR-100 datasets.
Compared to standard trained direct input SNNs, our trained models yield
improved classification accuracy of up to 13.7% and 10.1% on FGSM and PGD
attack-generated images, respectively, with negligible loss in clean image
accuracy. Our models also outperform inherently robust SNNs trained on
rate-coded inputs with improved or similar classification performance on
attack-generated images while having up to 25x and 4.6x lower latency and
computation energy, respectively.


http://arxiv.org/abs/2110.02700
Reversible adversarial examples against local visual perturbation. (99%)
Zhaoxia Yin; Li Chen; Shaowei Zhu
  Recently, studies have indicated that adversarial attacks pose a threat to
deep learning systems. However, when there are only adversarial examples,
people cannot get the original images, so there is research on reversible
adversarial attacks. However, the existing strategies are aimed at invisible
adversarial perturbation, and do not consider the case of locally visible
adversarial perturbation. In this article, we generate reversible adversarial
examples for local visual adversarial perturbation, and use reversible data
embedding technology to embed the information needed to restore the original
image into the adversarial examples to generate examples that are both
adversarial and reversible. Experiments on ImageNet dataset show that our
method can restore the original image losslessly while ensuring the attack
capability.


http://arxiv.org/abs/2110.02516
Attack as the Best Defense: Nullifying Image-to-image Translation GANs via Limit-aware Adversarial Attack. (99%)
Chin-Yuan Yeh; Hsi-Wen Chen; Hong-Han Shuai; De-Nian Yang; Ming-Syan Chen
  With the successful creation of high-quality image-to-image (Img2Img)
translation GANs comes the non-ethical applications of DeepFake and DeepNude.
Such misuses of img2img techniques present a challenging problem for society.
In this work, we tackle the problem by introducing the Limit-Aware Self-Guiding
Gradient Sliding Attack (LaS-GSA). LaS-GSA follows the Nullifying Attack to
cancel the img2img translation process under a black-box setting. In other
words, by processing input images with the proposed LaS-GSA before publishing,
any targeted img2img GANs can be nullified, preventing the model from
maliciously manipulating the images. To improve efficiency, we introduce the
limit-aware random gradient-free estimation and the gradient sliding mechanism
to estimate the gradient that adheres to the adversarial limit, i.e., the pixel
value limitations of the adversarial example. Theoretical justifications
validate how the above techniques prevent inefficiency caused by the
adversarial limit in both the direction and the step length. Furthermore, an
effective self-guiding prior is extracted solely from the threat model and the
target image to efficiently leverage the prior information and guide the
gradient estimation process. Extensive experiments demonstrate that LaS-GSA
requires fewer queries to nullify the image translation process with higher
success rates than 4 state-of-the-art black-box methods.


http://arxiv.org/abs/2110.02797
Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs. (99%)
Philipp Benz; Soomin Ham; Chaoning Zhang; Adil Karjauv; In So Kweon
  Convolutional Neural Networks (CNNs) have become the de facto gold standard
in computer vision applications in the past years. Recently, however, new model
architectures have been proposed challenging the status quo. The Vision
Transformer (ViT) relies solely on attention modules, while the MLP-Mixer
architecture substitutes the self-attention modules with Multi-Layer
Perceptrons (MLPs). Despite their great success, CNNs have been widely known to
be vulnerable to adversarial attacks, causing serious concerns for
security-sensitive applications. Thus, it is critical for the community to know
whether the newly proposed ViT and MLP-Mixer are also vulnerable to adversarial
attacks. To this end, we empirically evaluate their adversarial robustness
under several adversarial attack setups and benchmark them against the widely
used CNNs. Overall, we find that the two architectures, especially ViT, are
more robust than their CNN models. Using a toy example, we also provide
empirical evidence that the lower adversarial robustness of CNNs can be
partially attributed to their shift-invariant property. Our frequency analysis
suggests that the most robust ViT architectures tend to rely more on
low-frequency features compared with CNNs. Additionally, we have an intriguing
finding that MLP-Mixer is extremely vulnerable to universal adversarial
perturbations.


http://arxiv.org/abs/2110.02498
Adversarial Attacks on Machinery Fault Diagnosis. (99%)
Jiahao Chen; Diqun Yan
  Despite the great progress of neural network-based (NN-based) machinery fault
diagnosis methods, their robustness has been largely neglected, for they can be
easily fooled through adding imperceptible perturbation to the input. For fault
diagnosis problems, in this paper, we reformulate various adversarial attacks
and intensively investigate them under untargeted and targeted conditions.
Experimental results on six typical NN-based models show that accuracies of the
models are greatly reduced by adding small perturbations. We further propose a
simple, efficient and universal scheme to protect the victim models. This work
provides an in-depth look at adversarial examples of machinery vibration
signals for developing protection methods against adversarial attack and
improving the robustness of NN-based models.


http://arxiv.org/abs/2110.02929
Adversarial Attacks on Spiking Convolutional Networks for Event-based Vision. (98%)
Julian Büchel; Gregor Lenz; Yalun Hu; Sadique Sheik; Martino Sorbaro
  Event-based sensing using dynamic vision sensors is gaining traction in
low-power vision applications. Spiking neural networks work well with the
sparse nature of event-based data and suit deployment on low-power neuromorphic
hardware. Being a nascent field, the sensitivity of spiking neural networks to
potentially malicious adversarial attacks has received very little attention so
far. In this work, we show how white-box adversarial attack algorithms can be
adapted to the discrete and sparse nature of event-based visual data, and to
the continuous-time setting of spiking neural networks. We test our methods on
the N-MNIST and IBM Gestures neuromorphic vision datasets and show adversarial
perturbations achieve a high success rate, by injecting a relatively small
number of appropriately placed events. We also verify, for the first time, the
effectiveness of these perturbations directly on neuromorphic hardware.
Finally, we discuss the properties of the resulting perturbations and possible
future directions.


http://arxiv.org/abs/2110.03092
A Uniform Framework for Anomaly Detection in Deep Neural Networks. (97%)
Fangzhen Zhao; Chenyi Zhang; Naipeng Dong; Zefeng You; Zhenxin Wu
  Deep neural networks (DNN) can achieve high performance when applied to
In-Distribution (ID) data which come from the same distribution as the training
set. When presented with anomaly inputs not from the ID, the outputs of a DNN
should be regarded as meaningless. However, modern DNN often predict anomaly
inputs as an ID class with high confidence, which is dangerous and misleading.
In this work, we consider three classes of anomaly inputs, (1) natural inputs
from a different distribution than the DNN is trained for, known as
Out-of-Distribution (OOD) samples, (2) crafted inputs generated from ID by
attackers, often known as adversarial (AD) samples, and (3) noise (NS) samples
generated from meaningless data. We propose a framework that aims to detect all
these anomalies for a pre-trained DNN. Unlike some of the existing works, our
method does not require preprocessing of input data, nor is it dependent to any
known OOD set or adversarial attack algorithm. Through extensive experiments
over a variety of DNN models for the detection of aforementioned anomalies, we
show that in most cases our method outperforms state-of-the-art anomaly
detection methods in identifying all three classes of anomalies.


http://arxiv.org/abs/2110.03135
Double Descent in Adversarial Training: An Implicit Label Noise Perspective. (88%)
Chengyu Dong; Liyuan Liu; Jingbo Shang
  Here, we show that the robust overfitting shall be viewed as the early part
of an epoch-wise double descent -- the robust test error will start to decrease
again after training the model for a considerable number of epochs. Inspired by
our observations, we further advance the analyses of double descent to
understand robust overfitting better. In standard training, double descent has
been shown to be a result of label flipping noise. However, this reasoning is
not applicable in our setting, since adversarial perturbations are believed not
to change the label. Going beyond label flipping noise, we propose to measure
the mismatch between the assigned and (unknown) true label distributions,
denoted as \emph{implicit label noise}. We show that the traditional labeling
of adversarial examples inherited from their clean counterparts will lead to
implicit label noise. Towards better labeling, we show that predicted
distribution from a classifier, after scaling and interpolation, can provably
reduce the implicit label noise under mild assumptions. In light of our
analyses, we tailored the training objective accordingly to effectively
mitigate the double descent and verified its effectiveness on three benchmark
datasets.


http://arxiv.org/abs/2110.03124
Improving Adversarial Robustness for Free with Snapshot Ensemble. (83%)
Yihao Wang
  Adversarial training, as one of the few certified defenses against
adversarial attacks, can be quite complicated and time-consuming, while the
results might not be robust enough. To address the issue of lack of robustness,
ensemble methods were proposed, aiming to get the final output by weighting the
selected results from repeatedly trained processes. It is proved to be very
useful in achieving robust and accurate results, but the computational and
memory costs are even higher. Snapshot ensemble, a new ensemble method that
combines several local minima in a single training process to make the final
prediction, was proposed recently, which reduces the time spent on training
multiple networks and the memory to store the results. Based on the snapshot
ensemble, we present a new method that is easier to implement: unlike original
snapshot ensemble that seeks for local minima, our snapshot ensemble focuses on
the last few iterations of a training and stores the sets of parameters from
them. Our algorithm is much simpler but the results are no less accurate than
the original ones: based on different hyperparameters and datasets, our
snapshot ensemble has shown a 5% to 30% increase in accuracy when compared to
the traditional adversarial training.


http://arxiv.org/abs/2110.03154
DoubleStar: Long-Range Attack Towards Depth Estimation based Obstacle Avoidance in Autonomous Systems. (45%)
Ce Michigan State University Zhou; Qiben Michigan State University Yan; Yan Michigan State University Shi; Lichao Lehigh University Sun
  Depth estimation-based obstacle avoidance has been widely adopted by
autonomous systems (drones and vehicles) for safety purpose. It normally relies
on a stereo camera to automatically detect obstacles and make flying/driving
decisions, e.g., stopping several meters ahead of the obstacle in the path or
moving away from the detected obstacle. In this paper, we explore new security
risks associated with the stereo vision-based depth estimation algorithms used
for obstacle avoidance. By exploiting the weaknesses of the stereo matching in
depth estimation algorithms and the lens flare effect in optical imaging, we
propose DoubleStar, a long-range attack that injects fake obstacle depth by
projecting pure light from two complementary light sources.
  DoubleStar includes two distinctive attack formats: beams attack and orbs
attack, which leverage projected light beams and lens flare orbs respectively
to cause false depth perception. We successfully attack two commercial stereo
cameras designed for autonomous systems (ZED and Intel RealSense). The
visualization of fake depth perceived by the stereo cameras illustrates the
false stereo matching induced by DoubleStar. We further use Ardupilot to
simulate the attack and demonstrate its impact on drones. To validate the
attack on real systems, we perform a real-world attack towards a commercial
drone equipped with state-of-the-art obstacle avoidance algorithms. Our attack
can continuously bring a flying drone to a sudden stop or drift it away across
a long distance under various lighting conditions, even bypassing sensor fusion
mechanisms. Specifically, our experimental results show that DoubleStar creates
fake depth up to 15 meters in distance at night and up to 8 meters during the
daytime. To mitigate this newly discovered threat, we provide discussions on
potential countermeasures to defend against DoubleStar.


http://arxiv.org/abs/2110.02631
Inference Attacks Against Graph Neural Networks. (2%)
Zhikun Zhang; Min Chen; Michael Backes; Yun Shen; Yang Zhang
  Graph is an important data representation ubiquitously existing in the real
world. However, analyzing the graph data is computationally difficult due to
its non-Euclidean nature. Graph embedding is a powerful tool to solve the graph
analytics problem by transforming the graph data into low-dimensional vectors.
These vectors could also be shared with third parties to gain additional
insights of what is behind the data. While sharing graph embedding is
intriguing, the associated privacy risks are unexplored. In this paper, we
systematically investigate the information leakage of the graph embedding by
mounting three inference attacks. First, we can successfully infer basic graph
properties, such as the number of nodes, the number of edges, and graph
density, of the target graph with up to 0.89 accuracy. Second, given a subgraph
of interest and the graph embedding, we can determine with high confidence that
whether the subgraph is contained in the target graph. For instance, we achieve
0.98 attack AUC on the DD dataset. Third, we propose a novel graph
reconstruction attack that can reconstruct a graph that has similar graph
structural statistics to the target graph. We further propose an effective
defense mechanism based on graph embedding perturbation to mitigate the
inference attacks without noticeable performance degradation for graph
classification tasks. Our code is available at
https://github.com/Zhangzhk0819/GNN-Embedding-Leaks.


http://arxiv.org/abs/2110.03149
Data-driven behavioural biometrics for continuous and adaptive user verification using Smartphone and Smartwatch. (1%)
Akriti Verma; Valeh Moghaddam; Adnan Anwar
  Recent studies have shown how motion-based biometrics can be used as a form
of user authentication and identification without requiring any human
cooperation. This category of behavioural biometrics deals with the features we
learn in our life as a result of our interaction with the environment and
nature. This modality is related to change in human behaviour over time. The
developments in these methods aim to amplify continuous authentication such as
biometrics to protect their privacy on user devices. Various Continuous
Authentication (CA) systems have been proposed in the literature. They
represent a new generation of security mechanisms that continuously monitor
user behaviour and use this as the basis to re-authenticate them periodically
throughout a login session. However, these methods usually constitute a single
classification model which is used to identify or verify a user. This work
proposes an algorithm to blend behavioural biometrics with multi-factor
authentication (MFA) by introducing a two-step user verification algorithm that
verifies the user's identity using motion-based biometrics and complements the
multi-factor authentication, thus making it more secure and flexible. This
two-step user verification algorithm is also immune to adversarial attacks,
based on our experimental results which show how the rate of misclassification
drops while using this model with adversarial data.


http://arxiv.org/abs/2110.03054
On The Vulnerability of Recurrent Neural Networks to Membership Inference Attacks. (1%)
Yunhao Yang; Parham Gohari; Ufuk Topcu
  We study the privacy implications of deploying recurrent neural networks in
machine learning. We consider membership inference attacks (MIAs) in which an
attacker aims to infer whether a given data record has been used in the
training of a learning agent. Using existing MIAs that target feed-forward
neural networks, we empirically demonstrate that the attack accuracy wanes for
data records used earlier in the training history. Alternatively, recurrent
networks are specifically designed to better remember their past experience;
hence, they are likely to be more vulnerable to MIAs than their feed-forward
counterparts. We develop a pair of MIA layouts for two primary applications of
recurrent networks, namely, deep reinforcement learning and
sequence-to-sequence tasks. We use the first attack to provide empirical
evidence that recurrent networks are indeed more vulnerable to MIAs than
feed-forward networks with the same performance level. We use the second attack
to showcase the differences between the effects of overtraining recurrent and
feed-forward networks on the accuracy of their respective MIAs. Finally, we
deploy a differential privacy mechanism to resolve the privacy vulnerability
that the MIAs exploit. For both attack layouts, the privacy mechanism degrades
the attack accuracy from above 80% to 50%, which is equal to guessing the data
membership uniformly at random, while trading off less than 10% utility.


http://arxiv.org/abs/2110.03141
Efficient Sharpness-aware Minimization for Improved Training of Neural Networks. (1%)
Jiawei Du; Hanshu Yan; Jiashi Feng; Joey Tianyi Zhou; Liangli Zhen; Rick Siow Mong Goh; Vincent Y. F. Tan
  Overparametrized Deep Neural Networks (DNNs) often achieve astounding
performances, but may potentially result in severe generalization error.
Recently, the relation between the sharpness of the loss landscape and the
generalization error has been established by Foret et al. (2020), in which the
Sharpness Aware Minimizer (SAM) was proposed to mitigate the degradation of the
generalization. Unfortunately, SAM s computational cost is roughly double that
of base optimizers, such as Stochastic Gradient Descent (SGD). This paper thus
proposes Efficient Sharpness Aware Minimizer (ESAM), which boosts SAM s
efficiency at no cost to its generalization performance. ESAM includes two
novel and efficient training strategies-StochasticWeight Perturbation and
Sharpness-Sensitive Data Selection. In the former, the sharpness measure is
approximated by perturbing a stochastically chosen set of weights in each
iteration; in the latter, the SAM loss is optimized using only a judiciously
selected subset of data that is sensitive to the sharpness. We provide
theoretical explanations as to why these strategies perform well. We also show,
via extensive experiments on the CIFAR and ImageNet datasets, that ESAM
enhances the efficiency over SAM from requiring 100% extra computations to 40%
vis-a-vis base optimizers, while test accuracies are preserved or even
improved.


http://arxiv.org/abs/2110.02504
Stegomalware: A Systematic Survey of MalwareHiding and Detection in Images, Machine LearningModels and Research Challenges. (1%)
Rajasekhar Chaganti; Vinayakumar Ravi; Mamoun Alazab; Tuan D. Pham
  Malware distribution to the victim network is commonly performed through file
attachments in phishing email or from the internet, when the victim interacts
with the source of infection. To detect and prevent the malware distribution in
the victim machine, the existing end device security applications may leverage
techniques such as signature or anomaly-based, machine learning techniques. The
well-known file formats Portable Executable (PE) for Windows and Executable and
Linkable Format (ELF) for Linux based operating system are used for malware
analysis, and the malware detection capabilities of these files has been well
advanced for real-time detection. But the malware payload hiding in multimedia
using steganography detection has been a challenge for enterprises, as these
are rarely seen and usually act as a stager in sophisticated attacks. In this
article, to our knowledge, we are the first to try to address the knowledge gap
between the current progress in image steganography and steganalysis academic
research focusing on data hiding and the review of the stegomalware (malware
payload hiding in images) targeting enterprises with cyberattacks current
status. We present the stegomalware history, generation tools, file format
specification description. Based on our findings, we perform the detail review
of the image steganography techniques including the recent Generative
Adversarial Networks (GAN) based models and the image steganalysis methods
including the Deep Learning(DL) models for hiding data detection. Additionally,
the stegomalware detection framework for enterprise is proposed for anomaly
based stegomalware detection emphasizing the architecture details for different
network environments. Finally, the research opportunities and challenges in
stegomalware generation and detection are also presented.


http://arxiv.org/abs/2110.02863
Exploring the Common Principal Subspace of Deep Features in Neural Networks. (1%)
Haoran Liu; Haoyi Xiong; Yaqing Wang; Haozhe An; Dongrui Wu; Dejing Dou
  We find that different Deep Neural Networks (DNNs) trained with the same
dataset share a common principal subspace in latent spaces, no matter in which
architectures (e.g., Convolutional Neural Networks (CNNs), Multi-Layer
Preceptors (MLPs) and Autoencoders (AEs)) the DNNs were built or even whether
labels have been used in training (e.g., supervised, unsupervised, and
self-supervised learning). Specifically, we design a new metric
$\mathcal{P}$-vector to represent the principal subspace of deep features
learned in a DNN, and propose to measure angles between the principal subspaces
using $\mathcal{P}$-vectors. Small angles (with cosine close to $1.0$) have
been found in the comparisons between any two DNNs trained with different
algorithms/architectures. Furthermore, during the training procedure from
random scratch, the angle decrease from a larger one ($70^\circ-80^\circ$
usually) to the small one, which coincides the progress of feature space
learning from scratch to convergence. Then, we carry out case studies to
measure the angle between the $\mathcal{P}$-vector and the principal subspace
of training dataset, and connect such angle with generalization performance.
Extensive experiments with practically-used Multi-Layer Perceptron (MLPs), AEs
and CNNs for classification, image reconstruction, and self-supervised learning
tasks on MNIST, CIFAR-10 and CIFAR-100 datasets have been done to support our
claims with solid evidences.
  Interpretability of Deep Learning, Feature Learning, and Subspaces of Deep
Features


http://arxiv.org/abs/2110.02718
Generalizing Neural Networks by Reflecting Deviating Data in Production. (1%)
Yan Xiao; Yun Lin; Ivan Beschastnikh; Changsheng Sun; David S. Rosenblum; Jin Song Dong
  Trained with a sufficiently large training and testing dataset, Deep Neural
Networks (DNNs) are expected to generalize. However, inputs may deviate from
the training dataset distribution in real deployments. This is a fundamental
issue with using a finite dataset. Even worse, real inputs may change over time
from the expected distribution. Taken together, these issues may lead deployed
DNNs to mis-predict in production.
  In this work, we present a runtime approach that mitigates DNN
mis-predictions caused by the unexpected runtime inputs to the DNN. In contrast
to previous work that considers the structure and parameters of the DNN itself,
our approach treats the DNN as a blackbox and focuses on the inputs to the DNN.
Our approach has two steps. First, it recognizes and distinguishes "unseen"
semantically-preserving inputs. For this we use a distribution analyzer based
on the distance metric learned by a Siamese network. Second, our approach
transforms those unexpected inputs into inputs from the training set that are
identified as having similar semantics. We call this process input reflection
and formulate it as a search problem over the embedding space on the training
set. This embedding space is learned by a Quadruplet network as an auxiliary
model for the subject model to improve the generalization.
  We implemented a tool called InputReflector based on the above two-step
approach and evaluated it with experiments on three DNN models trained on
CIFAR-10, MNIST, and FMINST image datasets. The results show that
InputReflector can effectively distinguish inputs that retain semantics of the
distribution (e.g., blurred, brightened, contrasted, and zoomed images) and
out-of-distribution inputs from normal inputs.


http://arxiv.org/abs/2110.02125
Adversarial Robustness Verification and Attack Synthesis in Stochastic Systems. (99%)
Lisa Oakley; Alina Oprea; Stavros Tripakis
  Probabilistic model checking is a useful technique for specifying and
verifying properties of stochastic systems including randomized protocols and
the theoretical underpinnings of reinforcement learning models. However, these
methods rely on the assumed structure and probabilities of certain system
transitions. These assumptions may be incorrect, and may even be violated in
the event that an adversary gains control of some or all components in the
system.
  In this paper, motivated by research in adversarial machine learning on
adversarial examples, we develop a formal framework for adversarial robustness
in systems defined as discrete time Markov chains (DTMCs), and extend to
include deterministic, memoryless policies acting in Markov decision processes
(MDPs). Our framework includes a flexible approach for specifying several
adversarial models with different capabilities to manipulate the system. We
outline a class of threat models under which adversaries can perturb system
transitions, constrained by an $\varepsilon$ ball around the original
transition probabilities and define four specific instances of this threat
model.
  We define three main DTMC adversarial robustness problems and present two
optimization-based solutions, leveraging traditional and parametric
probabilistic model checking techniques. We then evaluate our solutions on two
stochastic protocols and a collection of GridWorld case studies, which model an
agent acting in an environment described as an MDP. We find that the parametric
solution results in fast computation for small parameter spaces. In the case of
less restrictive (stronger) adversaries, the number of parameters increases,
and directly computing property satisfaction probabilities is more scalable. We
demonstrate the usefulness of our definitions and solutions by comparing system
outcomes over various properties, threat models, and case studies.


http://arxiv.org/abs/2110.01823
Adversarial Attacks on Black Box Video Classifiers: Leveraging the Power of Geometric Transformations. (99%)
Shasha Li; Abhishek Aich; Shitong Zhu; M. Salman Asif; Chengyu Song; Amit K. Roy-Chowdhury; Srikanth Krishnamurthy
  When compared to the image classification models, black-box adversarial
attacks against video classification models have been largely understudied.
This could be possible because, with video, the temporal dimension poses
significant additional challenges in gradient estimation. Query-efficient
black-box attacks rely on effectively estimated gradients towards maximizing
the probability of misclassifying the target video. In this work, we
demonstrate that such effective gradients can be searched for by parameterizing
the temporal structure of the search space with geometric transformations.
Specifically, we design a novel iterative algorithm Geometric TRAnsformed
Perturbations (GEO-TRAP), for attacking video classification models. GEO-TRAP
employs standard geometric transformation operations to reduce the search space
for effective gradients into searching for a small group of parameters that
define these operations. This group of parameters describes the geometric
progression of gradients, resulting in a reduced and structured search space.
Our algorithm inherently leads to successful perturbations with surprisingly
few queries. For example, adversarial examples generated from GEO-TRAP have
better attack success rates with ~73.55% fewer queries compared to the
state-of-the-art method for video adversarial attacks on the widely used Jester
dataset. Overall, our algorithm exposes vulnerabilities of diverse video
classification models and achieves new state-of-the-art results under black-box
settings on two large datasets.


http://arxiv.org/abs/2110.02364
Adversarial defenses via a mixture of generators. (99%)
Maciej Żelaszczyk; Jacek Mańdziuk
  In spite of the enormous success of neural networks, adversarial examples
remain a relatively weakly understood feature of deep learning systems. There
is a considerable effort in both building more powerful adversarial attacks and
designing methods to counter the effects of adversarial examples. We propose a
method to transform the adversarial input data through a mixture of generators
in order to recover the correct class obfuscated by the adversarial attack. A
canonical set of images is used to generate adversarial examples through
potentially multiple attacks. Such transformed images are processed by a set of
generators, which are trained adversarially as a whole to compete in inverting
the initial transformations. To our knowledge, this is the first use of a
mixture-based adversarially trained system as a defense mechanism. We show that
it is possible to train such a system without supervision, simultaneously on
multiple adversarial attacks. Our system is able to recover class information
for previously-unseen examples with neither attack nor data labels on the MNIST
dataset. The results demonstrate that this multi-attack approach is competitive
with adversarial defenses tested in single-attack settings.


http://arxiv.org/abs/2110.01818
Neural Network Adversarial Attack Method Based on Improved Genetic Algorithm. (92%)
Dingming Yang; Yanrong Cui; Hongqiang Yuan
  Deep learning algorithms are widely used in fields such as computer vision
and natural language processing, but they are vulnerable to security threats
from adversarial attacks because of their internal presence of a large number
of nonlinear functions and parameters leading to their uninterpretability. In
this paper, we propose a neural network adversarial attack method based on an
improved genetic algorithm. The improved genetic algorithm improves the
variation and crossover links based on the original genetic optimization
algorithm, which greatly improves the iteration efficiency and shortens the
running time. The method does not need the internal structure and parameter
information of the neural network model, and it can obtain the adversarial
samples with high confidence in a short time by the classification and
confidence information of the neural network. The experimental results show
that the method in this paper has a wide range of applicability and high
efficiency for the model, and provides a new idea for the adversarial attack.


http://arxiv.org/abs/2110.02467
BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models. (33%)
Kangjie Chen; Yuxian Meng; Xiaofei Sun; Shangwei Guo; Tianwei Zhang; Jiwei Li; Chun Fan
  Pre-trained Natural Language Processing (NLP) models can be easily adapted to
a variety of downstream language tasks. This significantly accelerates the
development of language models. However, NLP models have been shown to be
vulnerable to backdoor attacks, where a pre-defined trigger word in the input
text causes model misprediction. Previous NLP backdoor attacks mainly focus on
some specific tasks. This makes those attacks less general and applicable to
other kinds of NLP models and tasks. In this work, we propose \Name, the first
task-agnostic backdoor attack against the pre-trained NLP models. The key
feature of our attack is that the adversary does not need prior information
about the downstream tasks when implanting the backdoor to the pre-trained
model. When this malicious model is released, any downstream models transferred
from it will also inherit the backdoor, even after the extensive transfer
learning process. We further design a simple yet effective strategy to bypass a
state-of-the-art defense. Experimental results indicate that our approach can
compromise a wide range of downstream NLP tasks in an effective and stealthy
way.


http://arxiv.org/abs/2110.02424
Spectral Bias in Practice: The Role of Function Frequency in Generalization. (1%)
Sara Fridovich-Keil; Raphael Gontijo-Lopes; Rebecca Roelofs
  Despite their ability to represent highly expressive functions, deep learning
models seem to find simple solutions that generalize surprisingly well.
Spectral bias -- the tendency of neural networks to prioritize learning low
frequency functions -- is one possible explanation for this phenomenon, but so
far spectral bias has primarily been observed in theoretical models and
simplified experiments. In this work, we propose methodologies for measuring
spectral bias in modern image classification networks on CIFAR-10 and ImageNet.
We find that these networks indeed exhibit spectral bias, and that
interventions that improve test accuracy on CIFAR-10 tend to produce learned
functions that have higher frequencies overall but lower frequencies in the
vicinity of examples from each class. This trend holds across variation in
training time, model architecture, number of training examples, data
augmentation, and self-distillation. We also explore the connections between
function frequency and image frequency and find that spectral bias is sensitive
to the low frequencies prevalent in natural images. On ImageNet, we find that
learned function frequency also varies with internal class diversity, with
higher frequencies on more diverse classes. Our work enables measuring and
ultimately influencing the spectral behavior of neural networks used for image
classification, and is a step towards understanding why deep models generalize
well.


http://arxiv.org/abs/2110.02417
CADA: Multi-scale Collaborative Adversarial Domain Adaptation for Unsupervised Optic Disc and Cup Segmentation. (1%)
Peng Liu; Charlie T. Tran; Bin Kong; Ruogu Fang
  The diversity of retinal imaging devices poses a significant challenge:
domain shift, which leads to performance degradation when applying the deep
learning models trained on one domain to new testing domains. In this paper, we
propose a multi-scale input along with multiple domain adaptors applied
hierarchically in both feature and output spaces. The proposed training
strategy and novel unsupervised domain adaptation framework, called
Collaborative Adversarial Domain Adaptation (CADA), can effectively overcome
the challenge. Multi-scale inputs can reduce the information loss due to the
pooling layers used in the network for feature extraction, while our proposed
CADA is an interactive paradigm that presents an exquisite collaborative
adaptation through both adversarial learning and ensembling weights at
different network layers. In particular, to produce a better prediction for the
unlabeled target domain data, we simultaneously achieve domain invariance and
model generalizability via adversarial learning at multi-scale outputs from
different levels of network layers and maintaining an exponential moving
average (EMA) of the historical weights during training. Without annotating any
sample from the target domain, multiple adversarial losses in encoder and
decoder layers guide the extraction of domain-invariant features to confuse the
domain classifier. Meanwhile, the ensembling of weights via EMA reduces the
uncertainty of adapting multiple discriminator learning. Comprehensive
experimental results demonstrate that our CADA model incorporating multi-scale
input training can overcome performance degradation and outperform
state-of-the-art domain adaptation methods in segmenting retinal optic disc and
cup from fundus images stemming from the REFUGE, Drishti-GS, and Rim-One-r3
datasets.


http://arxiv.org/abs/2110.02180
Noisy Feature Mixup. (1%)
Soon Hoe Lim; N. Benjamin Erichson; Francisco Utrera; Winnie Xu; Michael W. Mahoney
  We introduce Noisy Feature Mixup (NFM), an inexpensive yet effective method
for data augmentation that combines the best of interpolation based training
and noise injection schemes. Rather than training with convex combinations of
pairs of examples and their labels, we use noise-perturbed convex combinations
of pairs of data points in both input and feature space. This method includes
mixup and manifold mixup as special cases, but it has additional advantages,
including better smoothing of decision boundaries and enabling improved model
robustness. We provide theory to understand this as well as the implicit
regularization effects of NFM. Our theory is supported by empirical results,
demonstrating the advantage of NFM, as compared to mixup and manifold mixup. We
show that residual networks and vision transformers trained with NFM have
favorable trade-offs between predictive accuracy on clean data and robustness
with respect to various types of data perturbation across a range of computer
vision benchmark datasets.


http://arxiv.org/abs/2110.01232
Benchmarking Safety Monitors for Image Classifiers with Machine Learning. (1%)
Raul Sena LAAS Ferreira; Jean LAAS Arlat; Jeremie LAAS Guiochet; Hélène LAAS Waeselynck
  High-accurate machine learning (ML) image classifiers cannot guarantee that
they will not fail at operation. Thus, their deployment in safety-critical
applications such as autonomous vehicles is still an open issue. The use of
fault tolerance mechanisms such as safety monitors is a promising direction to
keep the system in a safe state despite errors of the ML classifier. As the
prediction from the ML is the core information directly impacting safety, many
works are focusing on monitoring the ML model itself. Checking the efficiency
of such monitors in the context of safety-critical applications is thus a
significant challenge. Therefore, this paper aims at establishing a baseline
framework for benchmarking monitors for ML image classifiers. Furthermore, we
propose a framework covering the entire pipeline, from data generation to
evaluation. Our approach measures monitor performance with a broader set of
metrics than usually proposed in the literature. Moreover, we benchmark three
different monitor approaches in 79 benchmark datasets containing five
categories of out-of-distribution data for image classifiers: class novelty,
noise, anomalies, distributional shifts, and adversarial attacks. Our results
indicate that these monitors are no more accurate than a random monitor. We
also release the code of all experiments for reproducibility.


http://arxiv.org/abs/2110.01094
Adversarial Examples Generation for Reducing Implicit Gender Bias in Pre-trained Models. (82%)
Wenqian Ye; Fei Xu; Yaojia Huang; Cassie Huang; Ji A
  Over the last few years, Contextualized Pre-trained Neural Language Models,
such as BERT, GPT, have shown significant gains in various NLP tasks. To
enhance the robustness of existing pre-trained models, one way is adversarial
examples generation and evaluation for conducting data augmentation or
adversarial learning. In the meanwhile, gender bias embedded in the models
seems to be a serious problem in practical applications. Many researches have
covered the gender bias produced by word-level information(e.g.
gender-stereotypical occupations), while few researchers have investigated the
sentence-level cases and implicit cases.
  In this paper, we proposed a method to automatically generate implicit gender
bias samples at sentence-level and a metric to measure gender bias. Samples
generated by our method will be evaluated in terms of accuracy. The metric will
be used to guide the generation of examples from Pre-trained models. Therefore,
those examples could be used to impose attacks on Pre-trained Models. Finally,
we discussed the evaluation efficacy of our generated examples on reducing
gender bias for future research.


http://arxiv.org/abs/2110.14597
Evaluating Deep Learning Models and Adversarial Attacks on Accelerometer-Based Gesture Authentication. (98%)
Elliu Huang; Troia Fabio Di; Mark Stamp
  Gesture-based authentication has emerged as a non-intrusive, effective means
of authenticating users on mobile devices. Typically, such authentication
techniques have relied on classical machine learning techniques, but recently,
deep learning techniques have been applied this problem. Although prior
research has shown that deep learning models are vulnerable to adversarial
attacks, relatively little research has been done in the adversarial domain for
behavioral biometrics. In this research, we collect tri-axial accelerometer
gesture data (TAGD) from 46 users and perform classification experiments with
both classical machine learning and deep learning models. Specifically, we
train and test support vector machines (SVM) and convolutional neural networks
(CNN). We then consider a realistic adversarial attack, where we assume the
attacker has access to real users' TAGD data, but not the authentication model.
We use a deep convolutional generative adversarial network (DC-GAN) to create
adversarial samples, and we show that our deep learning model is surprisingly
robust to such an attack scenario.


http://arxiv.org/abs/2110.00899
Anti-aliasing Deep Image Classifiers using Novel Depth Adaptive Blurring and Activation Function. (13%)
Md Tahmid Hossain; Shyh Wei Teng; Ferdous Sohel; Guojun Lu
  Deep convolutional networks are vulnerable to image translation or shift,
partly due to common down-sampling layers, e.g., max-pooling and strided
convolution. These operations violate the Nyquist sampling rate and cause
aliasing. The textbook solution is low-pass filtering (blurring) before
down-sampling, which can benefit deep networks as well. Even so, non-linearity
units, such as ReLU, often re-introduce the problem, suggesting that blurring
alone may not suffice. In this work, first, we analyse deep features with
Fourier transform and show that Depth Adaptive Blurring is more effective, as
opposed to monotonic blurring. To this end, we outline how this can replace
existing down-sampling methods. Second, we introduce a novel activation
function -- with a built-in low pass filter, to keep the problem from
reappearing. From experiments, we observe generalisation on other forms of
transformations and corruptions as well, e.g., rotation, scale, and noise. We
evaluate our method under three challenging settings: (1) a variety of image
translations; (2) adversarial attacks -- both $\ell_{p}$ bounded and unbounded;
and (3) data corruptions and perturbations. In each setting, our method
achieves state-of-the-art results and improves clean accuracy on various
benchmark datasets.


http://arxiv.org/abs/2110.00623
Calibrated Adversarial Training. (98%)
Tianjin Huang; Vlado Menkovski; Yulong Pei; Mykola Pechenizkiy
  Adversarial training is an approach of increasing the robustness of models to
adversarial attacks by including adversarial examples in the training set. One
major challenge of producing adversarial examples is to contain sufficient
perturbation in the example to flip the model's output while not making severe
changes in the example's semantical content. Exuberant change in the semantical
content could also change the true label of the example. Adding such examples
to the training set results in adverse effects. In this paper, we present the
Calibrated Adversarial Training, a method that reduces the adverse effects of
semantic perturbations in adversarial training. The method produces pixel-level
adaptations to the perturbations based on novel calibrated robust error. We
provide theoretical analysis on the calibrated robust error and derive an upper
bound for it. Our empirical results show a superior performance of the
Calibrated Adversarial Training over a number of public datasets.


http://arxiv.org/abs/2110.00708
Universal Adversarial Spoofing Attacks against Face Recognition. (87%)
Takuma Amada; Seng Pei Liew; Kazuya Kakizaki; Toshinori Araki
  We assess the vulnerabilities of deep face recognition systems for images
that falsify/spoof multiple identities simultaneously. We demonstrate that, by
manipulating the deep feature representation extracted from a face image via
imperceptibly small perturbations added at the pixel level using our proposed
Universal Adversarial Spoofing Examples (UAXs), one can fool a face
verification system into recognizing that the face image belongs to multiple
different identities with a high success rate. One characteristic of the UAXs
crafted with our method is that they are universal (identity-agnostic); they
are successful even against identities not known in advance. For a certain deep
neural network, we show that we are able to spoof almost all tested identities
(99\%), including those not known beforehand (not included in training). Our
results indicate that a multiple-identity attack is a real threat and should be
taken into account when deploying face recognition systems.


http://arxiv.org/abs/2110.00473
Score-Based Generative Classifiers. (84%)
Roland S. Zimmermann; Lukas Schott; Yang Song; Benjamin A. Dunn; David A. Klindt
  The tremendous success of generative models in recent years raises the
question whether they can also be used to perform classification. Generative
models have been used as adversarially robust classifiers on simple datasets
such as MNIST, but this robustness has not been observed on more complex
datasets like CIFAR-10. Additionally, on natural image datasets, previous
results have suggested a trade-off between the likelihood of the data and
classification accuracy. In this work, we investigate score-based generative
models as classifiers for natural images. We show that these models not only
obtain competitive likelihood values but simultaneously achieve
state-of-the-art classification accuracy for generative classifiers on
CIFAR-10. Nevertheless, we find that these models are only slightly, if at all,
more robust than discriminative baseline models on out-of-distribution tasks
based on common image corruptions. Similarly and contrary to prior results, we
find that score-based are prone to worst-case distribution shifts in the form
of adversarial perturbations. Our work highlights that score-based generative
models are closing the gap in classification accuracy compared to standard
discriminative models. While they do not yet deliver on the promise of
adversarial and out-of-domain robustness, they provide a different approach to
classification that warrants further research.


http://arxiv.org/abs/2110.05929
One Timestep is All You Need: Training Spiking Neural Networks with Ultra Low Latency. (1%)
Sayeed Shafayet Chowdhury; Nitin Rathi; Kaushik Roy
  Spiking Neural Networks (SNNs) are energy efficient alternatives to commonly
used deep neural networks (DNNs). Through event-driven information processing,
SNNs can reduce the expensive compute requirements of DNNs considerably, while
achieving comparable performance. However, high inference latency is a
significant hindrance to the edge deployment of deep SNNs. Computation over
multiple timesteps not only increases latency as well as overall energy budget
due to higher number of operations, but also incurs memory access overhead of
fetching membrane potentials, both of which lessen the energy benefits of SNNs.
To overcome this bottleneck and leverage the full potential of SNNs, we propose
an Iterative Initialization and Retraining method for SNNs (IIR-SNN) to perform
single shot inference in the temporal axis. The method starts with an SNN
trained with T timesteps (T>1). Then at each stage of latency reduction, the
network trained at previous stage with higher timestep is utilized as
initialization for subsequent training with lower timestep. This acts as a
compression method, as the network is gradually shrunk in the temporal domain.
In this paper, we use direct input encoding and choose T=5, since as per
literature, it is the minimum required latency to achieve satisfactory
performance on ImageNet. The proposed scheme allows us to obtain SNNs with up
to unit latency, requiring a single forward pass during inference. We achieve
top-1 accuracy of 93.05%, 70.15% and 67.71% on CIFAR-10, CIFAR-100 and
ImageNet, respectively using VGG16, with just 1 timestep. In addition, IIR-SNNs
perform inference with 5-2500X reduced latency compared to other
state-of-the-art SNNs, maintaining comparable or even better accuracy.
Furthermore, in comparison with standard DNNs, the proposed IIR-SNNs
provide25-33X higher energy efficiency, while being comparable to them in
classification performance.


http://arxiv.org/abs/2109.15160
Mitigating Black-Box Adversarial Attacks via Output Noise Perturbation. (98%)
Manjushree B. Aithal; Xiaohua Li
  In black-box adversarial attacks, adversaries query the deep neural network
(DNN), use the output to reconstruct gradients, and then optimize the
adversarial inputs iteratively. In this paper, we study the method of adding
white noise to the DNN output to mitigate such attacks, with a unique focus on
the trade-off analysis of noise level and query cost. The attacker's query
count (QC) is derived mathematically as a function of noise standard deviation.
With this result, the defender can conveniently find the noise level needed to
mitigate attacks for the desired security level specified by QC and limited DNN
performance loss. Our analysis shows that the added noise is drastically
magnified by the small variation of DNN outputs, which makes the reconstructed
gradient have an extremely low signal-to-noise ratio (SNR). Adding slight white
noise with a standard deviation less than 0.01 is enough to increase QC by many
orders of magnitude without introducing any noticeable classification accuracy
reduction. Our experiments demonstrate that this method can effectively
mitigate both soft-label and hard-label black-box attacks under realistic QC
constraints. We also show that this method outperforms many other defense
methods and is robust to the attacker's countermeasures.


http://arxiv.org/abs/2109.15177
You Cannot Easily Catch Me: A Low-Detectable Adversarial Patch for Object Detectors. (95%)
Zijian Zhu; Hang Su; Chang Liu; Wenzhao Xiang; Shibao Zheng
  Blind spots or outright deceit can bedevil and deceive machine learning
models. Unidentified objects such as digital "stickers," also known as
adversarial patches, can fool facial recognition systems, surveillance systems
and self-driving cars. Fortunately, most existing adversarial patches can be
outwitted, disabled and rejected by a simple classification network called an
adversarial patch detector, which distinguishes adversarial patches from
original images. An object detector classifies and predicts the types of
objects within an image, such as by distinguishing a motorcyclist from the
motorcycle, while also localizing each object's placement within the image by
"drawing" so-called bounding boxes around each object, once again separating
the motorcyclist from the motorcycle. To train detectors even better, however,
we need to keep subjecting them to confusing or deceitful adversarial patches
as we probe for the models' blind spots. For such probes, we came up with a
novel approach, a Low-Detectable Adversarial Patch, which attacks an object
detector with small and texture-consistent adversarial patches, making these
adversaries less likely to be recognized. Concretely, we use several geometric
primitives to model the shapes and positions of the patches. To enhance our
attack performance, we also assign different weights to the bounding boxes in
terms of loss function. Our experiments on the common detection dataset COCO as
well as the driving-video dataset D2-City show that LDAP is an effective attack
method, and can resist the adversarial patch detector.


http://arxiv.org/abs/2109.15009
Adversarial Semantic Contour for Object Detection. (92%)
Yichi Zhang; Zijian Zhu; Xiao Yang; Jun Zhu
  Modern object detectors are vulnerable to adversarial examples, which brings
potential risks to numerous applications, e.g., self-driving car. Among attacks
regularized by $\ell_p$ norm, $\ell_0$-attack aims to modify as few pixels as
possible. Nevertheless, the problem is nontrivial since it generally requires
to optimize the shape along with the texture simultaneously, which is an
NP-hard problem. To address this issue, we propose a novel method of
Adversarial Semantic Contour (ASC) guided by object contour as prior. With this
prior, we reduce the searching space to accelerate the $\ell_0$ optimization,
and also introduce more semantic information which should affect the detectors
more. Based on the contour, we optimize the selection of modified pixels via
sampling and their colors with gradient descent alternately. Extensive
experiments demonstrate that our proposed ASC outperforms the most commonly
manually designed patterns (e.g., square patches and grids) on task of
disappearing. By modifying no more than 5\% and 3.5\% of the object area
respectively, our proposed ASC can successfully mislead the mainstream object
detectors including the SSD512, Yolov4, Mask RCNN, Faster RCNN, etc.


http://arxiv.org/abs/2109.14868
From Zero-Shot Machine Learning to Zero-Day Attack Detection. (10%)
Mohanad Sarhan; Siamak Layeghy; Marcus Gallagher; Marius Portmann
  The standard ML methodology assumes that the test samples are derived from a
set of pre-observed classes used in the training phase. Where the model
extracts and learns useful patterns to detect new data samples belonging to the
same data classes. However, in certain applications such as Network Intrusion
Detection Systems, it is challenging to obtain data samples for all attack
classes that the model will most likely observe in production. ML-based NIDSs
face new attack traffic known as zero-day attacks, that are not used in the
training of the learning models due to their non-existence at the time. In this
paper, a zero-shot learning methodology has been proposed to evaluate the ML
model performance in the detection of zero-day attack scenarios. In the
attribute learning stage, the ML models map the network data features to
distinguish semantic attributes from known attack (seen) classes. In the
inference stage, the models are evaluated in the detection of zero-day attack
(unseen) classes by constructing the relationships between known attacks and
zero-day attacks. A new metric is defined as Zero-day Detection Rate, which
measures the effectiveness of the learning model in the inference stage. The
results demonstrate that while the majority of the attack classes do not
represent significant risks to organisations adopting an ML-based NIDS in a
zero-day attack scenario. However, for certain attack groups identified in this
paper, such systems are not effective in applying the learnt attributes of
attack behaviour to detect them as malicious. Further Analysis was conducted
using the Wasserstein Distance technique to measure how different such attacks
are from other attack types used in the training of the ML model. The results
demonstrate that sophisticated attacks with a low zero-day detection rate have
a significantly distinct feature distribution compared to the other attack
classes.


http://arxiv.org/abs/2109.14205
On Brightness Agnostic Adversarial Examples Against Face Recognition Systems. (99%)
Inderjeet Singh; Satoru Momiyama; Kazuya Kakizaki; Toshinori Araki
  This paper introduces a novel adversarial example generation method against
face recognition systems (FRSs). An adversarial example (AX) is an image with
deliberately crafted noise to cause incorrect predictions by a target system.
The AXs generated from our method remain robust under real-world brightness
changes. Our method performs non-linear brightness transformations while
leveraging the concept of curriculum learning during the attack generation
procedure. We demonstrate that our method outperforms conventional techniques
from comprehensive experimental investigations in the digital and physical
world. Furthermore, this method enables practical risk assessment of FRSs
against brightness agnostic AXs.


http://arxiv.org/abs/2109.15031
Back in Black: A Comparative Evaluation of Recent State-Of-The-Art Black-Box Attacks. (70%)
Kaleel Mahmood; Rigel Mahmood; Ethan Rathbun; Dijk Marten van
  The field of adversarial machine learning has experienced a near exponential
growth in the amount of papers being produced since 2018. This massive
information output has yet to be properly processed and categorized. In this
paper, we seek to help alleviate this problem by systematizing the recent
advances in adversarial machine learning black-box attacks since 2019. Our
survey summarizes and categorizes 20 recent black-box attacks. We also present
a new analysis for understanding the attack success rate with respect to the
adversarial model used in each paper. Overall, our paper surveys a wide body of
literature to highlight recent attack developments and organizes them into four
attack categories: score based attacks, decision based attacks, transfer
attacks and non-traditional attacks. Further, we provide a new mathematical
framework to show exactly how attack results can fairly be compared.


http://arxiv.org/abs/2109.14707
BulletTrain: Accelerating Robust Neural Network Training via Boundary Example Mining. (41%)
Weizhe Hua; Yichi Zhang; Chuan Guo; Zhiru Zhang; G. Edward Suh
  Neural network robustness has become a central topic in machine learning in
recent years. Most training algorithms that improve the model's robustness to
adversarial and common corruptions also introduce a large computational
overhead, requiring as many as ten times the number of forward and backward
passes in order to converge. To combat this inefficiency, we propose
BulletTrain $-$ a boundary example mining technique to drastically reduce the
computational cost of robust training. Our key observation is that only a small
fraction of examples are beneficial for improving robustness. BulletTrain
dynamically predicts these important examples and optimizes robust training
algorithms to focus on the important examples. We apply our technique to
several existing robust training algorithms and achieve a 2.1$\times$ speed-up
for TRADES and MART on CIFAR-10 and a 1.7$\times$ speed-up for AugMix on
CIFAR-10-C and CIFAR-100-C without any reduction in clean and robust accuracy.


http://arxiv.org/abs/2109.14678
Mitigation of Adversarial Policy Imitation via Constrained Randomization of Policy (CRoP). (10%)
Nancirose Piazza; Vahid Behzadan
  Deep reinforcement learning (DRL) policies are vulnerable to unauthorized
replication attacks, where an adversary exploits imitation learning to
reproduce target policies from observed behavior. In this paper, we propose
Constrained Randomization of Policy (CRoP) as a mitigation technique against
such attacks. CRoP induces the execution of sub-optimal actions at random under
performance loss constraints. We present a parametric analysis of CRoP, address
the optimality of CRoP, and establish theoretical bounds on the adversarial
budget and the expectation of loss. Furthermore, we report the experimental
evaluation of CRoP in Atari environments under adversarial imitation, which
demonstrate the efficacy and feasibility of our proposed method against policy
replication attacks.


http://arxiv.org/abs/2109.14002
slimTrain -- A Stochastic Approximation Method for Training Separable Deep Neural Networks. (1%)
Elizabeth Newman; Julianne Chung; Matthias Chung; Lars Ruthotto
  Deep neural networks (DNNs) have shown their success as high-dimensional
function approximators in many applications; however, training DNNs can be
challenging in general. DNN training is commonly phrased as a stochastic
optimization problem whose challenges include non-convexity, non-smoothness,
insufficient regularization, and complicated data distributions. Hence, the
performance of DNNs on a given task depends crucially on tuning
hyperparameters, especially learning rates and regularization parameters. In
the absence of theoretical guidelines or prior experience on similar tasks,
this requires solving many training problems, which can be time-consuming and
demanding on computational resources. This can limit the applicability of DNNs
to problems with non-standard, complex, and scarce datasets, e.g., those
arising in many scientific applications. To remedy the challenges of DNN
training, we propose slimTrain, a stochastic optimization method for training
DNNs with reduced sensitivity to the choice hyperparameters and fast initial
convergence. The central idea of slimTrain is to exploit the separability
inherent in many DNN architectures; that is, we separate the DNN into a
nonlinear feature extractor followed by a linear model. This separability
allows us to leverage recent advances made for solving large-scale, linear,
ill-posed inverse problems. Crucially, for the linear weights, slimTrain does
not require a learning rate and automatically adapts the regularization
parameter. Since our method operates on mini-batches, its computational
overhead per iteration is modest. In our numerical experiments, slimTrain
outperforms existing DNN training methods with the recommended hyperparameter
settings and reduces the sensitivity of DNN training to the remaining
hyperparameters.


http://arxiv.org/abs/2109.12838
MUTEN: Boosting Gradient-Based Adversarial Attacks via Mutant-Based Ensembles. (99%)
Yuejun Guo; Qiang Hu; Maxime Cordy; Michail Papadakis; Yves Le Traon
  Deep Neural Networks (DNNs) are vulnerable to adversarial examples, which
causes serious threats to security-critical applications. This motivated much
research on providing mechanisms to make models more robust against adversarial
attacks. Unfortunately, most of these defenses, such as gradient masking, are
easily overcome through different attack means. In this paper, we propose
MUTEN, a low-cost method to improve the success rate of well-known attacks
against gradient-masking models. Our idea is to apply the attacks on an
ensemble model which is built by mutating the original model elements after
training. As we found out that mutant diversity is a key factor in improving
success rate, we design a greedy algorithm for generating diverse mutants
efficiently. Experimental results on MNIST, SVHN, and CIFAR10 show that MUTEN
can increase the success rate of four attacks by up to 0.45.


http://arxiv.org/abs/2109.13069
Cluster Attack: Query-based Adversarial Attacks on Graphs with Graph-Dependent Priors. (99%)
Zhengyi Wang; Zhongkai Hao; Ziqiao Wang; Hang Su; Jun Zhu
  While deep neural networks have achieved great success in graph analysis,
recent work has shown that they are vulnerable to adversarial attacks. Compared
with adversarial attacks on image classification, performing adversarial
attacks on graphs is more challenging because of the discrete and
non-differential nature of the adjacent matrix for a graph. In this work, we
propose Cluster Attack -- a Graph Injection Attack (GIA) on node
classification, which injects fake nodes into the original graph to degenerate
the performance of graph neural networks (GNNs) on certain victim nodes while
affecting the other nodes as little as possible. We demonstrate that a GIA
problem can be equivalently formulated as a graph clustering problem; thus, the
discrete optimization problem of the adjacency matrix can be solved in the
context of graph clustering. In particular, we propose to measure the
similarity between victim nodes by a metric of Adversarial Vulnerability, which
is related to how the victim nodes will be affected by the injected fake node,
and to cluster the victim nodes accordingly. Our attack is performed in a
practical and unnoticeable query-based black-box manner with only a few nodes
on the graphs that can be accessed. Theoretical analysis and extensive
experiments demonstrate the effectiveness of our method by fooling the node
classifiers with only a small number of queries.


http://arxiv.org/abs/2109.13215
Classification and Adversarial examples in an Overparameterized Linear Model: A Signal Processing Perspective. (98%)
Adhyyan Narang; Vidya Muthukumar; Anant Sahai
  State-of-the-art deep learning classifiers are heavily overparameterized with
respect to the amount of training examples and observed to generalize well on
"clean" data, but be highly susceptible to infinitesmal adversarial
perturbations. In this paper, we identify an overparameterized linear ensemble,
that uses the "lifted" Fourier feature map, that demonstrates both of these
behaviors. The input is one-dimensional, and the adversary is only allowed to
perturb these inputs and not the non-linear features directly. We find that the
learned model is susceptible to adversaries in an intermediate regime where
classification generalizes but regression does not. Notably, the susceptibility
arises despite the absence of model mis-specification or label noise, which are
commonly cited reasons for adversarial-susceptibility. These results are
extended theoretically to a random-Fourier-sum setup that exhibits
double-descent behavior. In both feature-setups, the adversarial vulnerability
arises because of a phenomenon we term spatial localization: the predictions of
the learned model are markedly more sensitive in the vicinity of training
points than elsewhere. This sensitivity is a consequence of feature lifting and
is reminiscent of Gibb's and Runge's phenomena from signal processing and
functional analysis. Despite the adversarial susceptibility, we find that
classification with these features can be easier than the more commonly studied
"independent feature" models.


http://arxiv.org/abs/2109.13297
GANG-MAM: GAN based enGine for Modifying Android Malware. (64%)
Renjith G; Sonia Laudanna; Aji S; Corrado Aaron Visaggio; Vinod P
  Malware detectors based on machine learning are vulnerable to adversarial
attacks. Generative Adversarial Networks (GAN) are architectures based on
Neural Networks that could produce successful adversarial samples. The interest
towards this technology is quickly growing. In this paper, we propose a system
that produces a feature vector for making an Android malware strongly evasive
and then modify the malicious program accordingly. Such a system could have a
twofold contribution: it could be used to generate datasets to validate systems
for detecting GAN-based malware and to enlarge the training and testing dataset
for making more robust malware classifiers.


http://arxiv.org/abs/2109.12803
Distributionally Robust Multi-Output Regression Ranking. (3%)
Shahabeddin Sotudian; Ruidi Chen; Ioannis Paschalidis
  Despite their empirical success, most existing listwiselearning-to-rank (LTR)
models are not built to be robust to errors in labeling or annotation,
distributional data shift, or adversarial data perturbations. To fill this gap,
we introduce a new listwise LTR model called Distributionally Robust
Multi-output Regression Ranking (DRMRR). Different from existing methods, the
scoring function of DRMRR was designed as a multivariate mapping from a feature
vector to a vector of deviation scores, which captures local context
information and cross-document interactions. DRMRR uses a Distributionally
Robust Optimization (DRO) framework to minimize a multi-output loss function
under the most adverse distributions in the neighborhood of the empirical data
distribution defined by a Wasserstein ball. We show that this is equivalent to
a regularized regression problem with a matrix norm regularizer. Our
experiments were conducted on two real-world applications, medical document
retrieval, and drug response prediction, showing that DRMRR notably outperforms
state-of-the-art LTR models. We also conducted a comprehensive analysis to
assess the resilience of DRMRR against various types of noise: Gaussian noise,
adversarial perturbations, and label poisoning. We show that DRMRR is not only
able to achieve significantly better performance than other baselines, but it
can maintain a relatively stable performance as more noise is added to the
data.


http://arxiv.org/abs/2109.12851
Improving Uncertainty of Deep Learning-based Object Classification on Radar Spectra using Label Smoothing. (1%)
Kanil Patel; William Beluch; Kilian Rambach; Michael Pfeiffer; Bin Yang
  Object type classification for automotive radar has greatly improved with
recent deep learning (DL) solutions, however these developments have mostly
focused on the classification accuracy. Before employing DL solutions in
safety-critical applications, such as automated driving, an indispensable
prerequisite is the accurate quantification of the classifiers' reliability.
Unfortunately, DL classifiers are characterized as black-box systems which
output severely over-confident predictions, leading downstream decision-making
systems to false conclusions with possibly catastrophic consequences. We find
that deep radar classifiers maintain high-confidences for ambiguous, difficult
samples, e.g. small objects measured at large distances, under domain shift and
signal corruptions, regardless of the correctness of the predictions. The focus
of this article is to learn deep radar spectra classifiers which offer robust
real-time uncertainty estimates using label smoothing during training. Label
smoothing is a technique of refining, or softening, the hard labels typically
available in classification datasets. In this article, we exploit
radar-specific know-how to define soft labels which encourage the classifiers
to learn to output high-quality calibrated uncertainty estimates, thereby
partially resolving the problem of over-confidence. Our investigations show how
simple radar knowledge can easily be combined with complex data-driven learning
algorithms to yield safe automotive radar perception.


http://arxiv.org/abs/2109.13012
Federated Deep Learning with Bayesian Privacy. (1%)
Hanlin Gu; Lixin Fan; Bowen Li; Yan Kang; Yuan Yao; Qiang Yang
  Federated learning (FL) aims to protect data privacy by cooperatively
learning a model without sharing private data among users. For Federated
Learning of Deep Neural Network with billions of model parameters, existing
privacy-preserving solutions are unsatisfactory. Homomorphic encryption (HE)
based methods provide secure privacy protections but suffer from extremely high
computational and communication overheads rendering it almost useless in
practice . Deep learning with Differential Privacy (DP) was implemented as a
practical learning algorithm at a manageable cost in complexity. However, DP is
vulnerable to aggressive Bayesian restoration attacks as disclosed in the
literature and demonstrated in experimental results of this work. To address
the aforementioned perplexity, we propose a novel Bayesian Privacy (BP)
framework which enables Bayesian restoration attacks to be formulated as the
probability of reconstructing private data from observed public information.
Specifically, the proposed BP framework accurately quantifies privacy loss by
Kullback-Leibler (KL) Divergence between the prior distribution about the
privacy data and the posterior distribution of restoration private data
conditioning on exposed information}. To our best knowledge, this Bayesian
Privacy analysis is the first to provides theoretical justification of secure
privacy-preserving capabilities against Bayesian restoration attacks. As a
concrete use case, we demonstrate that a novel federated deep learning method
using private passport layers is able to simultaneously achieve high model
performance, privacy-preserving capability and low computational complexity.
Theoretical analysis is in accordance with empirical measurements of
information leakage extensively experimented with a variety of DNN networks on
image classification MNIST, CIFAR10, and CIFAR100 datasets.


http://arxiv.org/abs/2109.12772
Distributionally Robust Multiclass Classification and Applications in Deep CNN Image Classifiers. (11%)
Ruidi Chen; Boran Hao; Ioannis Paschalidis
  We develop a Distributionally Robust Optimization (DRO) formulation for
Multiclass Logistic Regression (MLR), which could tolerate data contaminated by
outliers. The DRO framework uses a probabilistic ambiguity set defined as a
ball of distributions that are close to the empirical distribution of the
training set in the sense of the Wasserstein metric. We relax the DRO
formulation into a regularized learning problem whose regularizer is a norm of
the coefficient matrix. We establish out-of-sample performance guarantees for
the solutions to our model, offering insights on the role of the regularizer in
controlling the prediction error. We apply the proposed method in rendering
deep CNN-based image classifiers robust to random and adversarial attacks.
Specifically, using the MNIST and CIFAR-10 datasets, we demonstrate reductions
in test error rate by up to 78.8% and loss by up to 90.8%. We also show that
with a limited number of perturbed images in the training set, our method can
improve the error rate by up to 49.49% and the loss by up to 68.93% compared to
Empirical Risk Minimization (ERM), converging faster to an ideal loss/error
rate as the number of perturbed images increases.


http://arxiv.org/abs/2109.12459
Two Souls in an Adversarial Image: Towards Universal Adversarial Example Detection using Multi-view Inconsistency. (99%)
Sohaib Kiani; Sana Awan; Chao Lan; Fengjun Li; Bo Luo
  In the evasion attacks against deep neural networks (DNN), the attacker
generates adversarial instances that are visually indistinguishable from benign
samples and sends them to the target DNN to trigger misclassifications. In this
paper, we propose a novel multi-view adversarial image detector, namely Argos,
based on a novel observation. That is, there exist two "souls" in an
adversarial instance, i.e., the visually unchanged content, which corresponds
to the true label, and the added invisible perturbation, which corresponds to
the misclassified label. Such inconsistencies could be further amplified
through an autoregressive generative approach that generates images with seed
pixels selected from the original image, a selected label, and pixel
distributions learned from the training data. The generated images (i.e., the
"views") will deviate significantly from the original one if the label is
adversarial, demonstrating inconsistencies that Argos expects to detect. To
this end, Argos first amplifies the discrepancies between the visual content of
an image and its misclassified label induced by the attack using a set of
regeneration mechanisms and then identifies an image as adversarial if the
reproduced views deviate to a preset degree. Our experimental results show that
Argos significantly outperforms two representative adversarial detectors in
both detection accuracy and robustness against six well-known adversarial
attacks. Code is available at:
https://github.com/sohaib730/Argos-Adversarial_Detection


http://arxiv.org/abs/2109.13232
Contributions to Large Scale Bayesian Inference and Adversarial Machine Learning. (98%)
Víctor Gallego
  The rampant adoption of ML methodologies has revealed that models are usually
adopted to make decisions without taking into account the uncertainties in
their predictions. More critically, they can be vulnerable to adversarial
examples. Thus, we believe that developing ML systems that take into account
predictive uncertainties and are robust against adversarial examples is a must
for critical, real-world tasks. We start with a case study in retailing. We
propose a robust implementation of the Nerlove-Arrow model using a Bayesian
structural time series model. Its Bayesian nature facilitates incorporating
prior information reflecting the manager's views, which can be updated with
relevant data. However, this case adopted classical Bayesian techniques, such
as the Gibbs sampler. Nowadays, the ML landscape is pervaded with neural
networks and this chapter also surveys current developments in this sub-field.
Then, we tackle the problem of scaling Bayesian inference to complex models and
large data regimes. In the first part, we propose a unifying view of two
different Bayesian inference algorithms, Stochastic Gradient Markov Chain Monte
Carlo (SG-MCMC) and Stein Variational Gradient Descent (SVGD), leading to
improved and efficient novel sampling schemes. In the second part, we develop a
framework to boost the efficiency of Bayesian inference in probabilistic models
by embedding a Markov chain sampler within a variational posterior
approximation. After that, we present an alternative perspective on adversarial
classification based on adversarial risk analysis, and leveraging the scalable
Bayesian approaches from chapter 2. In chapter 4 we turn to reinforcement
learning, introducing Threatened Markov Decision Processes, showing the
benefits of accounting for adversaries in RL while the agent learns.


http://arxiv.org/abs/2109.12406
MINIMAL: Mining Models for Data Free Universal Adversarial Triggers. (93%)
Swapnil Parekh; Yaman Singla Kumar; Somesh Singh; Changyou Chen; Balaji Krishnamurthy; Rajiv Ratn Shah
  It is well known that natural language models are vulnerable to adversarial
attacks, which are mostly input-specific in nature. Recently, it has been shown
that there also exist input-agnostic attacks in NLP models, called universal
adversarial triggers. However, existing methods to craft universal triggers are
data intensive. They require large amounts of data samples to generate
adversarial triggers, which are typically inaccessible by attackers. For
instance, previous works take 3000 data samples per class for the SNLI dataset
to generate adversarial triggers. In this paper, we present a novel data-free
approach, MINIMAL, to mine input-agnostic adversarial triggers from models.
Using the triggers produced with our data-free algorithm, we reduce the
accuracy of Stanford Sentiment Treebank's positive class from 93.6% to 9.6%.
Similarly, for the Stanford Natural Language Inference (SNLI), our single-word
trigger reduces the accuracy of the entailment class from 90.95% to less than
0.6\%. Despite being completely data-free, we get equivalent accuracy drops as
data-dependent methods.


http://arxiv.org/abs/2109.11803
Local Intrinsic Dimensionality Signals Adversarial Perturbations. (98%)
Sandamal Weerasinghe; Tansu Alpcan; Sarah M. Erfani; Christopher Leckie; Benjamin I. P. Rubinstein
  The vulnerability of machine learning models to adversarial perturbations has
motivated a significant amount of research under the broad umbrella of
adversarial machine learning. Sophisticated attacks may cause learning
algorithms to learn decision functions or make decisions with poor predictive
performance. In this context, there is a growing body of literature that uses
local intrinsic dimensionality (LID), a local metric that describes the minimum
number of latent variables required to describe each data point, for detecting
adversarial samples and subsequently mitigating their effects. The research to
date has tended to focus on using LID as a practical defence method often
without fully explaining why LID can detect adversarial samples. In this paper,
we derive a lower-bound and an upper-bound for the LID value of a perturbed
data point and demonstrate that the bounds, in particular the lower-bound, has
a positive correlation with the magnitude of the perturbation. Hence, we
demonstrate that data points that are perturbed by a large amount would have
large LID values compared to unperturbed samples, thus justifying its use in
the prior literature. Furthermore, our empirical validation demonstrates the
validity of the bounds on benchmark datasets.


http://arxiv.org/abs/2109.11308
Breaking BERT: Understanding its Vulnerabilities for Biomedical Named Entity Recognition through Adversarial Attack. (98%)
Anne Dirkson; Suzan Verberne; Wessel Kraaij
  Biomedical named entity recognition (NER) is a key task in the extraction of
information from biomedical literature and electronic health records. For this
task, both generic and biomedical BERT models are widely used. Robustness of
these models is vital for medical applications, such as automated medical
decision making. In this paper we investigate the vulnerability of BERT models
to variation in input data for NER through adversarial attack. Since
adversarial attack methods for NER are sparse, we propose two black-box methods
for NER based on existing methods for classification tasks. Experimental
results show that the original as well as the biomedical BERT models are highly
vulnerable to entity replacement: They can be fooled in 89.2 to 99.4% of the
cases to mislabel previously correct entities. BERT models are also vulnerable
to variation in the entity context with 20.2 to 45.0% of entities predicted
completely wrong and another 29.3 to 53.3% of entities predicted wrong
partially. Often a single change is sufficient to fool the model. BERT models
seem most vulnerable to changes in the local context of entities. Of the
biomedical BERT models, the vulnerability of BioBERT is comparable to the
original BERT model whereas SciBERT is even more vulnerable. Our results chart
the vulnerabilities of BERT models for biomedical NER and emphasize the
importance of further research into uncovering and reducing these weaknesses.


http://arxiv.org/abs/2109.11249
FooBaR: Fault Fooling Backdoor Attack on Neural Network Training. (88%)
Jakub Breier; Xiaolu Hou; Martín Ochoa; Jesus Solano
  Neural network implementations are known to be vulnerable to physical attack
vectors such as fault injection attacks. As of now, these attacks were only
utilized during the inference phase with the intention to cause a
misclassification. In this work, we explore a novel attack paradigm by
injecting faults during the training phase of a neural network in a way that
the resulting network can be attacked during deployment without the necessity
of further faulting. In particular, we discuss attacks against ReLU activation
functions that make it possible to generate a family of malicious inputs, which
are called fooling inputs, to be used at inference time to induce controlled
misclassifications. Such malicious inputs are obtained by mathematically
solving a system of linear equations that would cause a particular behaviour on
the attacked activation functions, similar to the one induced in training
through faulting. We call such attacks fooling backdoors as the fault attacks
at the training phase inject backdoors into the network that allow an attacker
to produce fooling inputs. We evaluate our approach against multi-layer
perceptron networks and convolutional networks on a popular image
classification task obtaining high attack success rates (from 60% to 100%) and
high classification confidence when as little as 25 neurons are attacked while
preserving high accuracy on the originally intended classification task.


http://arxiv.org/abs/2109.11728
AES Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses. (68%)
Yaman Kumar Singla; Swapnil Parekh; Somesh Singh; Junyi Jessy Li; Rajiv Ratn Shah; Changyou Chen
  Deep-learning based Automatic Essay Scoring (AES) systems are being actively
used by states and language testing agencies alike to evaluate millions of
candidates for life-changing decisions ranging from college applications to
visa approvals. However, little research has been put to understand and
interpret the black-box nature of deep-learning based scoring algorithms.
Previous studies indicate that scoring models can be easily fooled. In this
paper, we explore the reason behind their surprising adversarial brittleness.
We utilize recent advances in interpretability to find the extent to which
features such as coherence, content, vocabulary, and relevance are important
for automated scoring mechanisms. We use this to investigate the
oversensitivity i.e., large change in output score with a little change in
input essay content) and overstability i.e., little change in output scores
with large changes in input essay content) of AES. Our results indicate that
autoscoring models, despite getting trained as "end-to-end" models with rich
contextual embeddings such as BERT, behave like bag-of-words models. A few
words determine the essay score without the requirement of any context making
the model largely overstable. This is in stark contrast to recent probing
studies on pre-trained representation learning models, which show that rich
linguistic features such as parts-of-speech and morphology are encoded by them.
Further, we also find that the models have learnt dataset biases, making them
oversensitive. To deal with these issues, we propose detection-based protection
models that can detect oversensitivity and overstability causing samples with
high accuracies. We find that our proposed models are able to detect unusual
attribution patterns and flag adversarial samples successfully.


http://arxiv.org/abs/2109.11495
DeepAID: Interpreting and Improving Deep Learning-based Anomaly Detection in Security Applications. (1%)
Dongqi Han; Zhiliang Wang; Wenqi Chen; Ying Zhong; Su Wang; Han Zhang; Jiahai Yang; Xingang Shi; Xia Yin
  Unsupervised Deep Learning (DL) techniques have been widely used in various
security-related anomaly detection applications, owing to the great promise of
being able to detect unforeseen threats and superior performance provided by
Deep Neural Networks (DNN). However, the lack of interpretability creates key
barriers to the adoption of DL models in practice. Unfortunately, existing
interpretation approaches are proposed for supervised learning models and/or
non-security domains, which are unadaptable for unsupervised DL models and fail
to satisfy special requirements in security domains.
  In this paper, we propose DeepAID, a general framework aiming to (1)
interpret DL-based anomaly detection systems in security domains, and (2)
improve the practicality of these systems based on the interpretations. We
first propose a novel interpretation method for unsupervised DNNs by
formulating and solving well-designed optimization problems with special
constraints for security domains. Then, we provide several applications based
on our Interpreter as well as a model-based extension Distiller to improve
security systems by solving domain-specific problems. We apply DeepAID over
three types of security-related anomaly detection systems and extensively
evaluate our Interpreter with representative prior works. Experimental results
show that DeepAID can provide high-quality interpretations for unsupervised DL
models while meeting the special requirements of security domains. We also
provide several use cases to show that DeepAID can help security operators to
understand model decisions, diagnose system mistakes, give feedback to models,
and reduce false positives.


http://arxiv.org/abs/2109.10770
Exploring Adversarial Examples for Efficient Active Learning in Machine Learning Classifiers. (99%)
Honggang Yu; Shihfeng Zeng; Teng Zhang; Ing-Chao Lin; Yier Jin
  Machine learning researchers have long noticed the phenomenon that the model
training process will be more effective and efficient when the training samples
are densely sampled around the underlying decision boundary. While this
observation has already been widely applied in a range of machine learning
security techniques, it lacks theoretical analyses of the correctness of the
observation. To address this challenge, we first add particular perturbation to
original training examples using adversarial attack methods so that the
generated examples could lie approximately on the decision boundary of the ML
classifiers. We then investigate the connections between active learning and
these particular training examples. Through analyzing various representative
classifiers such as k-NN classifiers, kernel methods as well as deep neural
networks, we establish a theoretical foundation for the observation. As a
result, our theoretical proofs provide support to more efficient active
learning methods with the help of adversarial examples, contrary to previous
works where adversarial examples are often used as destructive solutions.
Experimental results show that the established theoretical foundation will
guide better active learning strategies based on adversarial examples.


http://arxiv.org/abs/2109.10696
CC-Cert: A Probabilistic Approach to Certify General Robustness of Neural Networks. (81%)
Mikhail Pautov; Nurislam Tursynbek; Marina Munkhoeva; Nikita Muravev; Aleksandr Petiushko; Ivan Oseledets
  In safety-critical machine learning applications, it is crucial to defend
models against adversarial attacks -- small modifications of the input that
change the predictions. Besides rigorously studied $\ell_p$-bounded additive
perturbations, recently proposed semantic perturbations (e.g. rotation,
translation) raise a serious concern on deploying ML systems in real-world.
Therefore, it is important to provide provable guarantees for deep learning
models against semantically meaningful input transformations. In this paper, we
propose a new universal probabilistic certification approach based on
Chernoff-Cramer bounds that can be used in general attack settings. We estimate
the probability of a model to fail if the attack is sampled from a certain
distribution. Our theoretical findings are supported by experimental results on
different datasets.


http://arxiv.org/abs/2109.11041
Security Analysis of Capsule Network Inference using Horizontal Collaboration. (69%)
Adewale Adeyemo; Faiq Khalid; Tolulope A. Odetola; Syed Rafay Hasan
  The traditional convolution neural networks (CNN) have several drawbacks like
the Picasso effect and the loss of information by the pooling layer. The
Capsule network (CapsNet) was proposed to address these challenges because its
architecture can encode and preserve the spatial orientation of input images.
Similar to traditional CNNs, CapsNet is also vulnerable to several malicious
attacks, as studied by several researchers in the literature. However, most of
these studies focus on single-device-based inference, but horizontally
collaborative inference in state-of-the-art systems, like intelligent edge
services in self-driving cars, voice controllable systems, and drones, nullify
most of these analyses. Horizontal collaboration implies partitioning the
trained CNN models or CNN tasks to multiple end devices or edge nodes.
Therefore, it is imperative to examine the robustness of the CapsNet against
malicious attacks when deployed in horizontally collaborative environments.
Towards this, we examine the robustness of the CapsNet when subjected to
noise-based inference attacks in a horizontal collaborative environment. In
this analysis, we perturbed the feature maps of the different layers of four
DNN models, i.e., CapsNet, Mini-VGG, LeNet, and an in-house designed CNN
(ConvNet) with the same number of parameters as CapsNet, using two types of
noised-based attacks, i.e., Gaussian Noise Attack and FGSM noise attack. The
experimental results show that similar to the traditional CNNs, depending upon
the access of the attacker to the DNN layer, the classification accuracy of the
CapsNet drops significantly. For example, when Gaussian Noise Attack
classification is performed at the DigitCap layer of the CapsNet, the maximum
classification accuracy drop is approximately 97%.


http://arxiv.org/abs/2109.11125
Adversarial Transfer Attacks With Unknown Data and Class Overlap. (62%)
Luke E. Richards; André Nguyen; Ryan Capps; Steven Forsythe; Cynthia Matuszek; Edward Raff
  The ability to transfer adversarial attacks from one model (the surrogate) to
another model (the victim) has been an issue of concern within the machine
learning (ML) community. The ability to successfully evade unseen models
represents an uncomfortable level of ease toward implementing attacks. In this
work we note that as studied, current transfer attack research has an
unrealistic advantage for the attacker: the attacker has the exact same
training data as the victim. We present the first study of transferring
adversarial attacks focusing on the data available to attacker and victim under
imperfect settings without querying the victim, where there is some variable
level of overlap in the exact data used or in the classes learned by each
model. This threat model is relevant to applications in medicine, malware, and
others. Under this new threat model attack success rate is not correlated with
data or class overlap in the way one would expect, and varies with dataset.
This makes it difficult for attacker and defender to reason about each other
and contributes to the broader study of model robustness and security. We
remedy this by developing a masked version of Projected Gradient Descent that
simulates class disparity, which enables the attacker to reliably estimate a
lower-bound on their attack's success.


http://arxiv.org/abs/2109.10859
Pushing the Right Buttons: Adversarial Evaluation of Quality Estimation. (1%)
Diptesh Kanojia; Marina Fomicheva; Tharindu Ranasinghe; Frédéric Blain; Constantin Orăsan; Lucia Specia
  Current Machine Translation (MT) systems achieve very good results on a
growing variety of language pairs and datasets. However, they are known to
produce fluent translation outputs that can contain important meaning errors,
thus undermining their reliability in practice. Quality Estimation (QE) is the
task of automatically assessing the performance of MT systems at test time.
Thus, in order to be useful, QE systems should be able to detect such errors.
However, this ability is yet to be tested in the current evaluation practices,
where QE systems are assessed only in terms of their correlation with human
judgements. In this work, we bridge this gap by proposing a general methodology
for adversarial testing of QE for MT. First, we show that despite a high
correlation with human judgements achieved by the recent SOTA, certain types of
meaning errors are still problematic for QE to detect. Second, we show that on
average, the ability of a given model to discriminate between
meaning-preserving and meaning-altering perturbations is predictive of its
overall performance, thus potentially allowing for comparing QE systems without
relying on manual quality annotation.


http://arxiv.org/abs/2109.10512
Backdoor Attacks on Federated Learning with Lottery Ticket Hypothesis. (1%)
Zeyuan Yin; Ye Yuan; Panfeng Guo; Pan Zhou
  Edge devices in federated learning usually have much more limited computation
and communication resources compared to servers in a data center. Recently,
advanced model compression methods, like the Lottery Ticket Hypothesis, have
already been implemented on federated learning to reduce the model size and
communication cost. However, Backdoor Attack can compromise its implementation
in the federated learning scenario. The malicious edge device trains the client
model with poisoned private data and uploads parameters to the center,
embedding a backdoor to the global shared model after unwitting aggregative
optimization. During the inference phase, the model with backdoors classifies
samples with a certain trigger as one target category, while shows a slight
decrease in inference accuracy to clean samples. In this work, we empirically
demonstrate that Lottery Ticket models are equally vulnerable to backdoor
attacks as the original dense models, and backdoor attacks can influence the
structure of extracted tickets. Based on tickets' similarities between each
other, we provide a feasible defense for federated learning against backdoor
attacks on various datasets.


http://arxiv.org/abs/2109.10417
Attacks on Visualization-Based Malware Detection: Balancing Effectiveness and Executability. (99%)
Hadjer Benkraouda; Jingyu Qian; Hung Quoc Tran; Berkay Kaplan
  With the rapid development of machine learning for image classification,
researchers have found new applications of visualization techniques in malware
detection. By converting binary code into images, researchers have shown
satisfactory results in applying machine learning to extract features that are
difficult to discover manually. Such visualization-based malware detection
methods can capture malware patterns from many different malware families and
improve malware detection speed. On the other hand, recent research has also
shown adversarial attacks against such visualization-based malware detection.
Attackers can generate adversarial examples by perturbing the malware binary in
non-reachable regions, such as padding at the end of the binary. Alternatively,
attackers can perturb the malware image embedding and then verify the
executability of the malware post-transformation. One major limitation of the
first attack scenario is that a simple pre-processing step can remove the
perturbations before classification. For the second attack scenario, it is hard
to maintain the original malware's executability and functionality. In this
work, we provide literature review on existing malware visualization techniques
and attacks against them. We summarize the limitation of the previous work, and
design a new adversarial example attack against visualization-based malware
detection that can evade pre-processing filtering and maintain the original
malware functionality. We test our attack on a public malware dataset and
achieve a 98% success rate.


http://arxiv.org/abs/2109.10161
3D Point Cloud Completion with Geometric-Aware Adversarial Augmentation. (93%)
Mengxi Wu; Hao Huang; Yi Fang
  With the popularity of 3D sensors in self-driving and other robotics
applications, extensive research has focused on designing novel neural network
architectures for accurate 3D point cloud completion. However, unlike in point
cloud classification and reconstruction, the role of adversarial samples in3D
point cloud completion has seldom been explored. In this work, we show that
training with adversarial samples can improve the performance of neural
networks on 3D point cloud completion tasks. We propose a novel approach to
generate adversarial samples that benefit both the performance of clean and
adversarial samples. In contrast to the PGD-k attack, our method generates
adversarial samples that keep the geometric features in clean samples and
contain few outliers. In particular, we use principal directions to constrain
the adversarial perturbations for each input point. The gradient components in
the mean direction of principal directions are taken as adversarial
perturbations. In addition, we also investigate the effect of using the minimum
curvature direction. Besides, we adopt attack strength accumulation and
auxiliary Batch Normalization layers method to speed up the training process
and alleviate the distribution mismatch between clean and adversarial samples.
Experimental results show that training with the adversarial samples crafted by
our method effectively enhances the performance of PCN on the ShapeNet dataset.


http://arxiv.org/abs/2109.09955
DeSMP: Differential Privacy-exploited Stealthy Model Poisoning Attacks in Federated Learning. (76%)
Md Tamjid Hossain; Shafkat Islam; Shahriar Badsha; Haoting Shen
  Federated learning (FL) has become an emerging machine learning technique
lately due to its efficacy in safeguarding the client's confidential
information. Nevertheless, despite the inherent and additional
privacy-preserving mechanisms (e.g., differential privacy, secure multi-party
computation, etc.), the FL models are still vulnerable to various
privacy-violating and security-compromising attacks (e.g., data or model
poisoning) due to their numerous attack vectors which in turn, make the models
either ineffective or sub-optimal. Existing adversarial models focusing on
untargeted model poisoning attacks are not enough stealthy and persistent at
the same time because of their conflicting nature (large scale attacks are
easier to detect and vice versa) and thus, remain an unsolved research problem
in this adversarial learning paradigm. Considering this, in this paper, we
analyze this adversarial learning process in an FL setting and show that a
stealthy and persistent model poisoning attack can be conducted exploiting the
differential noise. More specifically, we develop an unprecedented DP-exploited
stealthy model poisoning (DeSMP) attack for FL models. Our empirical analysis
on both the classification and regression tasks using two popular datasets
reflects the effectiveness of the proposed DeSMP attack. Moreover, we develop a
novel reinforcement learning (RL)-based defense strategy against such model
poisoning attacks which can intelligently and dynamically select the privacy
level of the FL models to minimize the DeSMP attack surface and facilitate the
attack detection.


http://arxiv.org/abs/2109.09963
Privacy, Security, and Utility Analysis of Differentially Private CPES Data. (13%)
Md Tamjid Hossain; Shahriar Badsha; Haoting Shen
  Differential privacy (DP) has been widely used to protect the privacy of
confidential cyber physical energy systems (CPES) data. However, applying DP
without analyzing the utility, privacy, and security requirements can affect
the data utility as well as help the attacker to conduct integrity attacks
(e.g., False Data Injection(FDI)) leveraging the differentially private data.
Existing anomaly-detection-based defense strategies against data integrity
attacks in DP-based smart grids fail to minimize the attack impact while
maximizing data privacy and utility. To address this challenge, it is
nontrivial to apply a defensive approach during the design process. In this
paper, we formulate and develop the defense strategy as a part of the design
process to investigate data privacy, security, and utility in a DP-based smart
grid network. We have proposed a provable relationship among the DP-parameters
that enables the defender to design a fault-tolerant system against FDI
attacks. To experimentally evaluate and prove the effectiveness of our proposed
design approach, we have simulated the FDI attack in a DP-based grid. The
evaluation indicates that the attack impact can be minimized if the designer
calibrates the privacy level according to the proposed correlation of the
DP-parameters to design the grid network. Moreover, we analyze the feasibility
of the DP mechanism and QoS of the smart grid network in an adversarial
setting. Our analysis suggests that the DP mechanism is feasible over existing
privacy-preserving mechanisms in the smart grid domain. Also, the QoS of the
differentially private grid applications is found satisfactory in adversarial
presence.


http://arxiv.org/abs/2109.09320
Robust Physical-World Attacks on Face Recognition. (99%)
Xin Zheng; Yanbo Fan; Baoyuan Wu; Yong Zhang; Jue Wang; Shirui Pan
  Face recognition has been greatly facilitated by the development of deep
neural networks (DNNs) and has been widely applied to many safety-critical
applications. However, recent studies have shown that DNNs are very vulnerable
to adversarial examples, raising serious concerns on the security of real-world
face recognition. In this work, we study sticker-based physical attacks on face
recognition for better understanding its adversarial robustness. To this end,
we first analyze in-depth the complicated physical-world conditions confronted
by attacking face recognition, including the different variations of stickers,
faces, and environmental conditions. Then, we propose a novel robust physical
attack framework, dubbed PadvFace, to model these challenging variations
specifically. Furthermore, considering the difference in attack complexity, we
propose an efficient Curriculum Adversarial Attack (CAA) algorithm that
gradually adapts adversarial stickers to environmental variations from easy to
complex. Finally, we construct a standardized testing protocol to facilitate
the fair evaluation of physical attacks on face recognition, and extensive
experiments on both dodging and impersonation attacks demonstrate the superior
performance of the proposed method.


http://arxiv.org/abs/2109.09901
Modeling Adversarial Noise for Adversarial Defense. (99%)
Dawei Zhou; Nannan Wang; Bo Han; Tongliang Liu
  Deep neural networks have been demonstrated to be vulnerable to adversarial
noise, promoting the development of defense against adversarial attacks.
Motivated by the fact that adversarial noise contains well-generalizing
features and that the relationship between adversarial data and natural data
can help infer natural data and make reliable predictions, in this paper, we
study to model adversarial noise by learning the transition relationship
between adversarial labels (i.e. the flipped labels used to generate
adversarial data) and natural labels (i.e. the ground truth labels of the
natural data). Specifically, we introduce an instance-dependent transition
matrix to relate adversarial labels and natural labels, which can be seamlessly
embedded with the target model (enabling us to model stronger adaptive
adversarial noise). Empirical evaluations demonstrate that our method could
effectively improve adversarial accuracy.


http://arxiv.org/abs/2109.09654
Can We Leverage Predictive Uncertainty to Detect Dataset Shift and Adversarial Examples in Android Malware Detection? (99%)
Deqiang Li; Tian Qiu; Shuo Chen; Qianmu Li; Shouhuai Xu
  The deep learning approach to detecting malicious software (malware) is
promising but has yet to tackle the problem of dataset shift, namely that the
joint distribution of examples and their labels associated with the test set is
different from that of the training set. This problem causes the degradation of
deep learning models without users' notice. In order to alleviate the problem,
one approach is to let a classifier not only predict the label on a given
example but also present its uncertainty (or confidence) on the predicted
label, whereby a defender can decide whether to use the predicted label or not.
While intuitive and clearly important, the capabilities and limitations of this
approach have not been well understood. In this paper, we conduct an empirical
study to evaluate the quality of predictive uncertainties of malware detectors.
Specifically, we re-design and build 24 Android malware detectors (by
transforming four off-the-shelf detectors with six calibration methods) and
quantify their uncertainties with nine metrics, including three metrics dealing
with data imbalance. Our main findings are: (i) predictive uncertainty indeed
helps achieve reliable malware detection in the presence of dataset shift, but
cannot cope with adversarial evasion attacks; (ii) approximate Bayesian methods
are promising to calibrate and generalize malware detectors to deal with
dataset shift, but cannot cope with adversarial evasion attacks; (iii)
adversarial evasion attacks can render calibration methods useless, and it is
an open problem to quantify the uncertainty associated with the predicted
labels of adversarial examples (i.e., it is not effective to use predictive
uncertainty to detect adversarial examples).


http://arxiv.org/abs/2109.09869
Robustness Analysis of Deep Learning Frameworks on Mobile Platforms. (10%)
Amin Eslami Abyane; Hadi Hemmati
  With the recent increase in the computational power of modern mobile devices,
machine learning-based heavy tasks such as face detection and speech
recognition are now integral parts of such devices. This requires frameworks to
execute machine learning models (e.g., Deep Neural Networks) on mobile devices.
Although there exist studies on the accuracy and performance of these
frameworks, the quality of on-device deep learning frameworks, in terms of
their robustness, has not been systematically studied yet. In this paper, we
empirically compare two on-device deep learning frameworks with three
adversarial attacks on three different model architectures. We also use both
the quantized and unquantized variants for each architecture. The results show
that, in general, neither of the deep learning frameworks is better than the
other in terms of robustness, and there is not a significant difference between
the PC and mobile frameworks either. However, in cases like Boundary attack,
mobile version is more robust than PC. In addition, quantization improves
robustness in all cases when moving from PC to mobile.


http://arxiv.org/abs/2109.09598
"Hello, It's Me": Deep Learning-based Speech Synthesis Attacks in the Real World. (2%)
Emily Wenger; Max Bronckers; Christian Cianfarani; Jenna Cryan; Angela Sha; Haitao Zheng; Ben Y. Zhao
  Advances in deep learning have introduced a new wave of voice synthesis
tools, capable of producing audio that sounds as if spoken by a target speaker.
If successful, such tools in the wrong hands will enable a range of powerful
attacks against both humans and software systems (aka machines). This paper
documents efforts and findings from a comprehensive experimental study on the
impact of deep-learning based speech synthesis attacks on both human listeners
and machines such as speaker recognition and voice-signin systems. We find that
both humans and machines can be reliably fooled by synthetic speech and that
existing defenses against synthesized speech fall short. These findings
highlight the need to raise awareness and develop new protections against
synthetic speech for both humans and machines.


http://arxiv.org/abs/2109.09829
Towards Energy-Efficient and Secure Edge AI: A Cross-Layer Framework. (1%)
Muhammad Shafique; Alberto Marchisio; Rachmad Vidya Wicaksana Putra; Muhammad Abdullah Hanif
  The security and privacy concerns along with the amount of data that is
required to be processed on regular basis has pushed processing to the edge of
the computing systems. Deploying advanced Neural Networks (NN), such as deep
neural networks (DNNs) and spiking neural networks (SNNs), that offer
state-of-the-art results on resource-constrained edge devices is challenging
due to the stringent memory and power/energy constraints. Moreover, these
systems are required to maintain correct functionality under diverse security
and reliability threats. This paper first discusses existing approaches to
address energy efficiency, reliability, and security issues at different system
layers, i.e., hardware (HW) and software (SW). Afterward, we discuss how to
further improve the performance (latency) and the energy efficiency of Edge AI
systems through HW/SW-level optimizations, such as pruning, quantization, and
approximation. To address reliability threats (like permanent and transient
faults), we highlight cost-effective mitigation techniques, like fault-aware
training and mapping. Moreover, we briefly discuss effective detection and
protection techniques to address security threats (like model and data
corruption). Towards the end, we discuss how these techniques can be combined
in an integrated cross-layer framework for realizing robust and
energy-efficient Edge AI systems.


http://arxiv.org/abs/2109.09060
On the Noise Stability and Robustness of Adversarially Trained Networks on NVM Crossbars. (99%)
Chun Tao; Deboleena Roy; Indranil Chakraborty; Kaushik Roy
  Applications based on Deep Neural Networks (DNNs) have grown exponentially in
the past decade. To match their increasing computational needs, several
Non-Volatile Memory (NVM) crossbar based accelerators have been proposed.
Recently, researchers have shown that apart from improved energy efficiency and
performance, such approximate hardware also possess intrinsic robustness for
defense against adversarial attacks. Prior works quantified this intrinsic
robustness for vanilla DNNs trained on unperturbed inputs. However, adversarial
training of DNNs is the benchmark technique for robustness, and sole reliance
on intrinsic robustness of the hardware may not be sufficient. In this work, we
explore the design of robust DNNs through the amalgamation of adversarial
training and intrinsic robustness of NVM crossbar-based analog hardware. First,
we study the noise stability of such networks on unperturbed inputs and observe
that internal activations of adversarially trained networks have lower
Signal-to-Noise Ratio (SNR), and are sensitive to noise compared to vanilla
networks. As a result, they suffer on average 2x performance degradation due to
the approximate computations on analog hardware. Noise stability analyses show
the instability of adversarially trained DNNs. On the other hand, for
adversarial images generated using Square Black Box attacks, ResNet-10/20
adversarially trained on CIFAR-10/100 display a robustness gain of 20-30%. For
adversarial images generated using Projected-Gradient-Descent (PGD) White-Box
attacks, adversarially trained DNNs present a 5-10% gain in robust accuracy due
to underlying NVM crossbar when $\epsilon_{attack}$ is greater than
$\epsilon_{train}$. Our results indicate that implementing adversarially
trained networks on analog hardware requires careful calibration between
hardware non-idealities and $\epsilon_{train}$ for optimum robustness and
performance.


http://arxiv.org/abs/2109.09075
Adversarial Training with Contrastive Learning in NLP. (16%)
Daniela N. Rim; DongNyeong Heo; Heeyoul Choi
  For years, adversarial training has been extensively studied in natural
language processing (NLP) settings. The main goal is to make models robust so
that similar inputs derive in semantically similar outcomes, which is not a
trivial problem since there is no objective measure of semantic similarity in
language. Previous works use an external pre-trained NLP model to tackle this
challenge, introducing an extra training stage with huge memory consumption
during training. However, the recent popular approach of contrastive learning
in language processing hints a convenient way of obtaining such similarity
restrictions. The main advantage of the contrastive learning approach is that
it aims for similar data points to be mapped close to each other and further
from different ones in the representation space. In this work, we propose
adversarial training with contrastive learning (ATCL) to adversarially train a
language processing task using the benefits of contrastive learning. The core
idea is to make linear perturbations in the embedding space of the input via
fast gradient methods (FGM) and train the model to keep the original and
perturbed representations close via contrastive learning. In NLP experiments,
we applied ATCL to language modeling and neural machine translation tasks. The
results show not only an improvement in the quantitative (perplexity and BLEU)
scores when compared to the baselines, but ATCL also achieves good qualitative
results in the semantic level for both tasks without using a pre-trained model.


http://arxiv.org/abs/2109.08868
Clean-label Backdoor Attack against Deep Hashing based Retrieval. (98%)
Kuofeng Gao; Jiawang Bai; Bin Chen; Dongxian Wu; Shu-Tao Xia
  Deep hashing has become a popular method in large-scale image retrieval due
to its computational and storage efficiency. However, recent works raise the
security concerns of deep hashing. Although existing works focus on the
vulnerability of deep hashing in terms of adversarial perturbations, we
identify a more pressing threat, backdoor attack, when the attacker has access
to the training data. A backdoored deep hashing model behaves normally on
original query images, while returning the images with the target label when
the trigger presents, which makes the attack hard to be detected. In this
paper, we uncover this security concern by utilizing clean-label data
poisoning. To the best of our knowledge, this is the first attempt at the
backdoor attack against deep hashing models. To craft the poisoned images, we
first generate the targeted adversarial patch as the backdoor trigger.
Furthermore, we propose the confusing perturbations to disturb the hashing code
learning, such that the hashing model can learn more about the trigger. The
confusing perturbations are imperceptible and generated by dispersing the
images with the target label in the Hamming space. We have conducted extensive
experiments to verify the efficacy of our backdoor attack under various
settings. For instance, it can achieve 63% targeted mean average precision on
ImageNet under 48 bits code length with only 40 poisoned images.


http://arxiv.org/abs/2109.08465
Messing Up 3D Virtual Environments: Transferable Adversarial 3D Objects. (98%)
Enrico Meloni; Matteo Tiezzi; Luca Pasqualini; Marco Gori; Stefano Melacci
  In the last few years, the scientific community showed a remarkable and
increasing interest towards 3D Virtual Environments, training and testing
Machine Learning-based models in realistic virtual worlds. On one hand, these
environments could also become a mean to study the weaknesses of Machine
Learning algorithms, or to simulate training settings that allow Machine
Learning models to gain robustness to 3D adversarial attacks. On the other
hand, their growing popularity might also attract those that aim at creating
adversarial conditions to invalidate the benchmarking process, especially in
the case of public environments that allow the contribution from a large
community of people. Most of the existing Adversarial Machine Learning
approaches are focused on static images, and little work has been done in
studying how to deal with 3D environments and how a 3D object should be altered
to fool a classifier that observes it. In this paper, we study how to craft
adversarial 3D objects by altering their textures, using a tool chain composed
of easily accessible elements. We show that it is possible, and indeed simple,
to create adversarial objects using off-the-shelf limited surrogate renderers
that can compute gradients with respect to the parameters of the rendering
process, and, to a certain extent, to transfer the attacks to more advanced 3D
engines. We propose a saliency-based attack that intersects the two classes of
renderers in order to focus the alteration to those texture elements that are
estimated to be effective in the target engine, evaluating its impact in
popular neural classifiers.


http://arxiv.org/abs/2109.08776
Exploring the Training Robustness of Distributional Reinforcement Learning against Noisy State Observations. (8%)
Ke Sun; Yingnan Zhao; Shangling Jui; Linglong Kong
  In real scenarios, state observations that an agent observes may contain
measurement errors or adversarial noises, misleading the agent to take
suboptimal actions or even collapse while training. In this paper, we study the
training robustness of distributional Reinforcement Learning (RL), a class of
state-of-the-art methods that estimate the whole distribution, as opposed to
only the expectation, of the total return. Firstly, we validate the contraction
of distributional Bellman operators in the State-Noisy Markov Decision Process
(SN-MDP), a typical tabular case that incorporates both random and adversarial
state observation noises. In the noisy setting with function approximation, we
then analyze the vulnerability of least squared loss in expectation-based RL
with either linear or nonlinear function approximation. By contrast, we
theoretically characterize the bounded gradient norm of distributional RL loss
based on the categorical parameterization equipped with the KL divergence. The
resulting stable gradients while the optimization in distributional RL accounts
for its better training robustness against state observation noises. Finally,
extensive experiments on the suite of environments verified that distributional
RL is less vulnerable against both random and adversarial noisy state
observations compared with its expectation-based counterpart.


http://arxiv.org/abs/2109.07986
Harnessing Perceptual Adversarial Patches for Crowd Counting. (99%)
Shunchang Liu; Jiakai Wang; Aishan Liu; Yingwei Li; Yijie Gao; Xianglong Liu; Dacheng Tao
  Crowd counting, which is significantly important for estimating the number of
people in safety-critical scenes, has been shown to be vulnerable to
adversarial examples in the physical world (e.g., adversarial patches). Though
harmful, adversarial examples are also valuable for assessing and better
understanding model robustness. However, existing adversarial example
generation methods in crowd counting scenarios lack strong transferability
among different black-box models. Motivated by the fact that transferability is
positively correlated to the model-invariant characteristics, this paper
proposes the Perceptual Adversarial Patch (PAP) generation framework to learn
the shared perceptual features between models by exploiting both the model
scale perception and position perception. Specifically, PAP exploits
differentiable interpolation and density attention to help learn the invariance
between models during training, leading to better transferability. In addition,
we surprisingly found that our adversarial patches could also be utilized to
benefit the performance of vanilla models for alleviating several challenges
including cross datasets and complex backgrounds. Extensive experiments under
both digital and physical world scenarios demonstrate the effectiveness of our
PAP.


http://arxiv.org/abs/2109.08191
KATANA: Simple Post-Training Robustness Using Test Time Augmentations. (98%)
Gilad Cohen; Raja Giryes
  Although Deep Neural Networks (DNNs) achieve excellent performance on many
real-world tasks, they are highly vulnerable to adversarial attacks. A leading
defense against such attacks is adversarial training, a technique in which a
DNN is trained to be robust to adversarial attacks by introducing adversarial
noise to its input. This procedure is effective but must be done during the
training phase. In this work, we propose a new simple and easy-to-use
technique, KATANA, for robustifying an existing pretrained DNN without
modifying its weights. For every image, we generate N randomized Test Time
Augmentations (TTAs) by applying diverse color, blur, noise, and geometric
transforms. Next, we utilize the DNN's logits output to train a simple random
forest classifier to predict the real class label. Our strategy achieves
state-of-the-art adversarial robustness on diverse attacks with minimal
compromise on the natural images' classification. We test KATANA also against
two adaptive white-box attacks and it shows excellent results when combined
with adversarial training. Code is available in
https://github.com/giladcohen/KATANA.


http://arxiv.org/abs/2109.07723
Targeted Attack on Deep RL-based Autonomous Driving with Learned Visual Patterns. (96%)
Prasanth Buddareddygari; Travis Zhang; Yezhou Yang; Yi Ren
  Recent studies demonstrated the vulnerability of control policies learned
through deep reinforcement learning against adversarial attacks, raising
concerns about the application of such models to risk-sensitive tasks such as
autonomous driving. Threat models for these demonstrations are limited to (1)
targeted attacks through real-time manipulation of the agent's observation, and
(2) untargeted attacks through manipulation of the physical environment. The
former assumes full access to the agent's states/observations at all times,
while the latter has no control over attack outcomes. This paper investigates
the feasibility of targeted attacks through visually learned patterns placed on
physical object in the environment, a threat model that combines the
practicality and effectiveness of the existing ones. Through analysis, we
demonstrate that a pre-trained policy can be hijacked within a time window,
e.g., performing an unintended self-parking, when an adversarial object is
present. To enable the attack, we adopt an assumption that the dynamics of both
the environment and the agent can be learned by the attacker. Lastly, we
empirically show the effectiveness of the proposed attack on different driving
scenarios, perform a location robustness test, and study the tradeoff between
the attack strength and its effectiveness.


http://arxiv.org/abs/2109.08139
Adversarial Attacks against Deep Learning Based Power Control in Wireless Communications. (95%)
Brian Kim; Yi Shi; Yalin E. Sagduyu; Tugba Erpek; Sennur Ulukus
  We consider adversarial machine learning based attacks on power allocation
where the base station (BS) allocates its transmit power to multiple orthogonal
subcarriers by using a deep neural network (DNN) to serve multiple user
equipments (UEs). The DNN that corresponds to a regression model is trained
with channel gains as the input and allocated transmit powers as the output.
While the BS allocates the transmit power to the UEs to maximize rates for all
UEs, there is an adversary that aims to minimize these rates. The adversary may
be an external transmitter that aims to manipulate the inputs to the DNN by
interfering with the pilot signals that are transmitted to measure the channel
gain. Alternatively, the adversary may be a rogue UE that transmits fabricated
channel estimates to the BS. In both cases, the adversary carefully crafts
adversarial perturbations to manipulate the inputs to the DNN of the BS subject
to an upper bound on the strengths of these perturbations. We consider the
attacks targeted on a single UE or all UEs. We compare these attacks with a
benchmark, where the adversary scales down the input to the DNN. We show that
adversarial attacks are much more effective than the benchmark attack in terms
of reducing the rate of communications. We also show that adversarial attacks
are robust to the uncertainty at the adversary including the erroneous
knowledge of channel gains and the potential errors in exercising the attacks
exactly as specified.


http://arxiv.org/abs/2109.07926
Don't Search for a Search Method -- Simple Heuristics Suffice for Adversarial Text Attacks. (68%)
Nathaniel Berger; Stefan Riezler; Artem Sokolov; Sebastian Ebert
  Recently more attention has been given to adversarial attacks on neural
networks for natural language processing (NLP). A central research topic has
been the investigation of search algorithms and search constraints, accompanied
by benchmark algorithms and tasks. We implement an algorithm inspired by zeroth
order optimization-based attacks and compare with the benchmark results in the
TextAttack framework. Surprisingly, we find that optimization-based methods do
not yield any improvement in a constrained setup and slightly benefit from
approximate gradient information only in unconstrained setups where search
spaces are larger. In contrast, simple heuristics exploiting nearest neighbors
without querying the target function yield substantial success rates in
constrained setups, and nearly full success rate in unconstrained setups, at an
order of magnitude fewer queries. We conclude from these results that current
TextAttack benchmark tasks are too easy and constraints are too strict,
preventing meaningful research on black-box adversarial text attacks.


http://arxiv.org/abs/2109.08045
Membership Inference Attacks Against Recommender Systems. (3%)
Minxing Zhang; Zhaochun Ren; Zihan Wang; Pengjie Ren; Zhumin Chen; Pengfei Hu; Yang Zhang
  Recently, recommender systems have achieved promising performances and become
one of the most widely used web applications. However, recommender systems are
often trained on highly sensitive user data, thus potential data leakage from
recommender systems may lead to severe privacy problems.
  In this paper, we make the first attempt on quantifying the privacy leakage
of recommender systems through the lens of membership inference. In contrast
with traditional membership inference against machine learning classifiers, our
attack faces two main differences. First, our attack is on the user-level but
not on the data sample-level. Second, the adversary can only observe the
ordered recommended items from a recommender system instead of prediction
results in the form of posterior probabilities. To address the above
challenges, we propose a novel method by representing users from relevant
items. Moreover, a shadow recommender is established to derive the labeled
training data for training the attack model. Extensive experimental results
show that our attack framework achieves a strong performance. In addition, we
design a defense mechanism to effectively mitigate the membership inference
threat of recommender systems.


http://arxiv.org/abs/2109.07142
Universal Adversarial Attack on Deep Learning Based Prognostics. (99%)
Arghya Basak; Pradeep Rathore; Sri Harsha Nistala; Sagar Srinivas; Venkataramana Runkana
  Deep learning-based time series models are being extensively utilized in
engineering and manufacturing industries for process control and optimization,
asset monitoring, diagnostic and predictive maintenance. These models have
shown great improvement in the prediction of the remaining useful life (RUL) of
industrial equipment but suffer from inherent vulnerability to adversarial
attacks. These attacks can be easily exploited and can lead to catastrophic
failure of critical industrial equipment. In general, different adversarial
perturbations are computed for each instance of the input data. This is,
however, difficult for the attacker to achieve in real time due to higher
computational requirement and lack of uninterrupted access to the input data.
Hence, we present the concept of universal adversarial perturbation, a special
imperceptible noise to fool regression based RUL prediction models. Attackers
can easily utilize universal adversarial perturbations for real-time attack
since continuous access to input data and repetitive computation of adversarial
perturbations are not a prerequisite for the same. We evaluate the effect of
universal adversarial attacks using NASA turbofan engine dataset. We show that
addition of universal adversarial perturbation to any instance of the input
data increases error in the output predicted by the model. To the best of our
knowledge, we are the first to study the effect of the universal adversarial
perturbation on time series regression models. We further demonstrate the
effect of varying the strength of perturbations on RUL prediction models and
found that model accuracy decreases with the increase in perturbation strength
of the universal adversarial attack. We also showcase that universal
adversarial perturbation can be transferred across different models.


http://arxiv.org/abs/2109.07171
Balancing detectability and performance of attacks on the control channel of Markov Decision Processes. (98%)
Alessio Russo; Alexandre Proutiere
  We investigate the problem of designing optimal stealthy poisoning attacks on
the control channel of Markov decision processes (MDPs). This research is
motivated by the recent interest of the research community for adversarial and
poisoning attacks applied to MDPs, and reinforcement learning (RL) methods. The
policies resulting from these methods have been shown to be vulnerable to
attacks perturbing the observations of the decision-maker. In such an attack,
drawing inspiration from adversarial examples used in supervised learning, the
amplitude of the adversarial perturbation is limited according to some norm,
with the hope that this constraint will make the attack imperceptible. However,
such constraints do not grant any level of undetectability and do not take into
account the dynamic nature of the underlying Markov process. In this paper, we
propose a new attack formulation, based on information-theoretical quantities,
that considers the objective of minimizing the detectability of the attack as
well as the performance of the controlled process. We analyze the trade-off
between the efficiency of the attack and its detectability. We conclude with
examples and numerical simulations illustrating this trade-off.


http://arxiv.org/abs/2109.07193
FCA: Learning a 3D Full-coverage Vehicle Camouflage for Multi-view Physical Adversarial Attack. (95%)
DonghuaWang; Tingsong Jiang; Jialiang Sun; Weien Zhou; Xiaoya Zhang; Zhiqiang Gong; Wen Yao; Xiaoqian Chen
  Physical adversarial attacks in object detection have attracted increasing
attention. However, most previous works focus on hiding the objects from the
detector by generating an individual adversarial patch, which only covers the
planar part of the vehicle's surface and fails to attack the detector in
physical scenarios for multi-view, long-distance and partially occluded
objects. To bridge the gap between digital attacks and physical attacks, we
exploit the full 3D vehicle surface to propose a robust Full-coverage
Camouflage Attack (FCA) to fool detectors. Specifically, we first try rendering
the non-planar camouflage texture over the full vehicle surface. To mimic the
real-world environment conditions, we then introduce a transformation function
to transfer the rendered camouflaged vehicle into a photo-realistic scenario.
Finally, we design an efficient loss function to optimize the camouflage
texture. Experiments show that the full-coverage camouflage attack can not only
outperform state-of-the-art methods under various test cases but also
generalize to different environments, vehicles, and object detectors.


http://arxiv.org/abs/2109.07403
BERT is Robust! A Case Against Synonym-Based Adversarial Examples in Text Classification. (92%)
Jens Hauser; Zhao Meng; Damián Pascual; Roger Wattenhofer
  Deep Neural Networks have taken Natural Language Processing by storm. While
this led to incredible improvements across many tasks, it also initiated a new
research field, questioning the robustness of these neural networks by
attacking them. In this paper, we investigate four word substitution-based
attacks on BERT. We combine a human evaluation of individual word substitutions
and a probabilistic analysis to show that between 96% and 99% of the analyzed
attacks do not preserve semantics, indicating that their success is mainly
based on feeding poor data to the model. To further confirm that, we introduce
an efficient data augmentation procedure and show that many adversarial
examples can be prevented by including data similar to the attacks during
training. An additional post-processing step reduces the success rates of
state-of-the-art attacks below 5%. Finally, by looking at more reasonable
thresholds on constraints for word substitutions, we conclude that BERT is a
lot more robust than research on attacks suggests.


http://arxiv.org/abs/2109.07177
Adversarial Mixing Policy for Relaxing Locally Linear Constraints in Mixup. (13%)
Guang Liu; Yuzhao Mao; Hailong Huang; Weiguo Gao; Xuan Li
  Mixup is a recent regularizer for current deep classification networks.
Through training a neural network on convex combinations of pairs of examples
and their labels, it imposes locally linear constraints on the model's input
space. However, such strict linear constraints often lead to under-fitting
which degrades the effects of regularization. Noticeably, this issue is getting
more serious when the resource is extremely limited. To address these issues,
we propose the Adversarial Mixing Policy (AMP), organized in a min-max-rand
formulation, to relax the Locally Linear Constraints in Mixup. Specifically,
AMP adds a small adversarial perturbation to the mixing coefficients rather
than the examples. Thus, slight non-linearity is injected in-between the
synthetic examples and synthetic labels. By training on these data, the deep
networks are further regularized, and thus achieve a lower predictive error
rate. Experiments on five text classification benchmarks and five backbone
models have empirically shown that our methods reduce the error rate over Mixup
variants in a significant margin (up to 31.3%), especially in low-resource
conditions (up to 17.5%).


http://arxiv.org/abs/2109.07395
Can one hear the shape of a neural network?: Snooping the GPU via Magnetic Side Channel. (10%)
Henrique Teles Maia; Chang Xiao; Dingzeyu Li; Eitan Grinspun; Changxi Zheng
  Neural network applications have become popular in both enterprise and
personal settings. Network solutions are tuned meticulously for each task, and
designs that can robustly resolve queries end up in high demand. As the
commercial value of accurate and performant machine learning models increases,
so too does the demand to protect neural architectures as confidential
investments. We explore the vulnerability of neural networks deployed as black
boxes across accelerated hardware through electromagnetic side channels. We
examine the magnetic flux emanating from a graphics processing unit's power
cable, as acquired by a cheap $3 induction sensor, and find that this signal
betrays the detailed topology and hyperparameters of a black-box neural network
model. The attack acquires the magnetic signal for one query with unknown input
values, but known input dimensions. The network reconstruction is possible due
to the modular layer sequence in which deep neural networks are evaluated. We
find that each layer component's evaluation produces an identifiable magnetic
signal signature, from which layer topology, width, function type, and sequence
order can be inferred using a suitably trained classifier and a joint
consistency optimization based on integer programming. We study the extent to
which network specifications can be recovered, and consider metrics for
comparing network similarity. We demonstrate the potential accuracy of this
side channel attack in recovering the details for a broad range of network
architectures, including random designs. We consider applications that may
exploit this novel side channel exposure, such as adversarial transfer attacks.
In response, we discuss countermeasures to protect against our method and other
similar snooping techniques.


http://arxiv.org/abs/2109.06634
A Novel Data Encryption Method Inspired by Adversarial Attacks. (99%)
Praveen Fernando; Jin Wei-Kocsis
  Due to the advances of sensing and storage technologies, a tremendous amount
of data becomes available and, it supports the phenomenal growth of artificial
intelligence (AI) techniques especially, deep learning (DL), in various
application domains. While the data sources become valuable assets for enabling
the success of autonomous decision-making, they also lead to critical
vulnerabilities in privacy and security. For example, data leakage can be
exploited via querying and eavesdropping in the exploratory phase for black-box
attacks against DL-based autonomous decision-making systems. To address this
issue, in this work, we propose a novel data encryption method, called
AdvEncryption, by exploiting the principle of adversarial attacks. Different
from existing encryption technologies, the AdvEncryption method is not
developed to prevent attackers from exploiting the dataset. Instead, our
proposed method aims to trap the attackers in a misleading feature distillation
of the data. To achieve this goal, our AdvEncryption method consists of two
essential components: 1) an adversarial attack-inspired encryption mechanism to
encrypt the data with stealthy adversarial perturbation, and 2) a decryption
mechanism that minimizes the impact of the perturbations on the effectiveness
of autonomous decision making. In the performance evaluation section, we
evaluate the performance of our proposed AdvEncryption method through case
studies considering different scenarios.


http://arxiv.org/abs/2109.06536
Improving Gradient-based Adversarial Training for Text Classification by Contrastive Learning and Auto-Encoder. (99%)
Yao Qiu; Jinchao Zhang; Jie Zhou
  Recent work has proposed several efficient approaches for generating
gradient-based adversarial perturbations on embeddings and proved that the
model's performance and robustness can be improved when they are trained with
these contaminated embeddings. While they paid little attention to how to help
the model to learn these adversarial samples more efficiently. In this work, we
focus on enhancing the model's ability to defend gradient-based adversarial
attack during the model's training process and propose two novel adversarial
training approaches: (1) CARL narrows the original sample and its adversarial
sample in the representation space while enlarging their distance from
different labeled samples. (2) RAR forces the model to reconstruct the original
sample from its adversarial representation. Experiments show that the proposed
two approaches outperform strong baselines on various text classification
datasets. Analysis experiments find that when using our approaches, the
semantic representation of the input sentence won't be significantly affected
by adversarial perturbations, and the model's performance drops less under
adversarial attack. That is to say, our approaches can effectively improve the
robustness of the model. Besides, RAR can also be used to generate text-form
adversarial samples.


http://arxiv.org/abs/2109.06777
PETGEN: Personalized Text Generation Attack on Deep Sequence Embedding-based Classification Models. (99%)
Bing He; Mustaque Ahamad; Srijan Kumar
  \textit{What should a malicious user write next to fool a detection model?}
Identifying malicious users is critical to ensure the safety and integrity of
internet platforms. Several deep learning based detection models have been
created. However, malicious users can evade deep detection models by
manipulating their behavior, rendering these models of little use. The
vulnerability of such deep detection models against adversarial attacks is
unknown. Here we create a novel adversarial attack model against deep user
sequence embedding-based classification models, which use the sequence of user
posts to generate user embeddings and detect malicious users. In the attack,
the adversary generates a new post to fool the classifier. We propose a novel
end-to-end Personalized Text Generation Attack model, called \texttt{PETGEN},
that simultaneously reduces the efficacy of the detection model and generates
posts that have several key desirable properties. Specifically, \texttt{PETGEN}
generates posts that are personalized to the user's writing style, have
knowledge about a given target context, are aware of the user's historical
posts on the target context, and encapsulate the user's recent topical
interests. We conduct extensive experiments on two real-world datasets (Yelp
and Wikipedia, both with ground-truth of malicious users) to show that
\texttt{PETGEN} significantly reduces the performance of popular deep user
sequence embedding-based classification models. \texttt{PETGEN} outperforms
five attack baselines in terms of text quality and attack efficacy in both
white-box and black-box classifier settings. Overall, this work paves the path
towards the next generation of adversary-aware sequence classification models.


http://arxiv.org/abs/2109.08026
EVAGAN: Evasion Generative Adversarial Network for Low Data Regimes. (76%)
Rizwan Hamid Randhawa; Nauman Aslam; Muhammad Alauthman; Husnain Rafiq; Muhammad Khalid
  Many recent literary works have leveraged generative adversarial networks
(GANs) to spawn unseen evasion samples. The purpose is to annex the generated
data with the original train set for adversarial training to improve the
detection performance of machine learning (ML) classifiers. The quality
generation of adversarial samples relies on the adequacy of training data
samples. Sadly in low data regimes like medical anomaly detection, drug
discovery and cybersecurity, the attack samples are scarce in number. This
paper proposes a novel GAN design called Evasion Generative Adversarial Network
(EVAGAN), more suitable for low data regime problems that use oversampling for
detection improvement of ML classifiers. EVAGAN not only can generate evasion
samples, but its discriminator can act as an evasion aware classifier. We have
considered Auxiliary Classifier GAN (ACGAN) as a benchmark to evaluate the
performance of EVAGAN on cybersecurity (ISCX-2014, CIC-2017 and CIC2018) botnet
and CV (MNIST) datasets. We demonstrate that EVAGAN outperforms ACGAN for
imbalance datasets regarding detection performance, training stability, time
complexity. EVAGAN's generator quickly learns to generate the low sample class
and hardens its discriminator simultaneously. In contrast to ML classifiers
that require security hardening after being adversarially trained by GAN
generated data, EVAGAN renders it needless. The experimental analysis proves
EVAGAN to be an efficient evasion hardened model for low data regimes in
cybersecurity and CV. Code will be available at
https://github.com/rhr407/EVAGAN.


http://arxiv.org/abs/2109.06467
Dodging Attack Using Carefully Crafted Natural Makeup. (47%)
Nitzan Guetta; Asaf Shabtai; Inderjeet Singh; Satoru Momiyama; Yuval Elovici
  Deep learning face recognition models are used by state-of-the-art
surveillance systems to identify individuals passing through public areas
(e.g., airports). Previous studies have demonstrated the use of adversarial
machine learning (AML) attacks to successfully evade identification by such
systems, both in the digital and physical domains. Attacks in the physical
domain, however, require significant manipulation to the human participant's
face, which can raise suspicion by human observers (e.g. airport security
officers). In this study, we present a novel black-box AML attack which
carefully crafts natural makeup, which, when applied on a human participant,
prevents the participant from being identified by facial recognition models. We
evaluated our proposed attack against the ArcFace face recognition model, with
20 participants in a real-world setup that includes two cameras, different
shooting angles, and different lighting conditions. The evaluation results show
that in the digital domain, the face recognition system was unable to identify
all of the participants, while in the physical domain, the face recognition
system was able to identify the participants in only 1.22% of the frames
(compared to 47.57% without makeup and 33.73% with random natural makeup),
which is below a reasonable threshold of a realistic operational environment.


http://arxiv.org/abs/2109.07028
Avengers Ensemble! Improving Transferability of Authorship Obfuscation. (12%)
Muhammad Haroon; Muhammad Fareed Zaffar; Padmini Srinivasan; Zubair Shafiq
  Stylometric approaches have been shown to be quite effective for real-world
authorship attribution. To mitigate the privacy threat posed by authorship
attribution, researchers have proposed automated authorship obfuscation
approaches that aim to conceal the stylometric artefacts that give away the
identity of an anonymous document's author. Recent work has focused on
authorship obfuscation approaches that rely on black-box access to an
attribution classifier to evade attribution while preserving semantics.
However, to be useful under a realistic threat model, it is important that
these obfuscation approaches work well even when the adversary's attribution
classifier is different from the one used internally by the obfuscator.
Unfortunately, existing authorship obfuscation approaches do not transfer well
to unseen attribution classifiers. In this paper, we propose an ensemble-based
approach for transferable authorship obfuscation. Our experiments show that if
an obfuscator can evade an ensemble attribution classifier, which is based on
multiple base attribution classifiers, it is more likely to transfer to
different attribution classifiers. Our analysis shows that ensemble-based
authorship obfuscation achieves better transferability because it combines the
knowledge from each of the base attribution classifiers by essentially
averaging their decision boundaries.


http://arxiv.org/abs/2109.07048
ARCH: Efficient Adversarial Regularized Training with Caching. (8%)
Simiao Zuo; Chen Liang; Haoming Jiang; Pengcheng He; Xiaodong Liu; Jianfeng Gao; Weizhu Chen; Tuo Zhao
  Adversarial regularization can improve model generalization in many natural
language processing tasks. However, conventional approaches are computationally
expensive since they need to generate a perturbation for each sample in each
epoch. We propose a new adversarial regularization method ARCH (adversarial
regularization with caching), where perturbations are generated and cached once
every several epochs. As caching all the perturbations imposes memory usage
concerns, we adopt a K-nearest neighbors-based strategy to tackle this issue.
The strategy only requires caching a small amount of perturbations, without
introducing additional training time. We evaluate our proposed method on a set
of neural machine translation and natural language understanding tasks. We
observe that ARCH significantly eases the computational burden (saves up to
70\% of computational time in comparison with conventional approaches). More
surprisingly, by reducing the variance of stochastic gradients, ARCH produces a
notably better (in most of the tasks) or comparable model generalization. Our
code is publicly available.


http://arxiv.org/abs/2109.05830
Adversarial Bone Length Attack on Action Recognition. (99%)
Nariki Tanaka; Hiroshi Kera; Kazuhiko Kawamoto
  Skeleton-based action recognition models have recently been shown to be
vulnerable to adversarial attacks. Compared to adversarial attacks on images,
perturbations to skeletons are typically bounded to a lower dimension of
approximately 100 per frame. This lower-dimensional setting makes it more
difficult to generate imperceptible perturbations. Existing attacks resolve
this by exploiting the temporal structure of the skeleton motion so that the
perturbation dimension increases to thousands. In this paper, we show that
adversarial attacks can be performed on skeleton-based action recognition
models, even in a significantly low-dimensional setting without any temporal
manipulation. Specifically, we restrict the perturbations to the lengths of the
skeleton's bones, which allows an adversary to manipulate only approximately 30
effective dimensions. We conducted experiments on the NTU RGB+D and HDM05
datasets and demonstrate that the proposed attack successfully deceived models
with sometimes greater than 90\% success rate by small perturbations.
Furthermore, we discovered an interesting phenomenon: in our low-dimensional
setting, the adversarial training with the bone length attack shares a similar
property with data augmentation, and it not only improves the adversarial
robustness but also improves the classification accuracy on the original
original data. This is an interesting counterexample of the trade-off between
adversarial robustness and clean accuracy, which has been widely observed in
studies on adversarial training in the high-dimensional regime.


http://arxiv.org/abs/2109.05698
Randomized Substitution and Vote for Textual Adversarial Example Detection. (99%)
Xiaosen Wang; Yifeng Xiong; Kun He
  A line of work has shown that natural text processing models are vulnerable
to adversarial examples. Correspondingly, various defense methods are proposed
to mitigate the threat of textual adversarial examples, e.g. adversarial
training, certified defense, input pre-processing, detection, etc. In this
work, we treat the optimization process for synonym substitution based textual
adversarial attacks as a specific sequence of word replacement, in which each
word mutually influences other words. We identify that we could destroy such
mutual interaction and eliminate the adversarial perturbation by randomly
substituting a word with its synonyms. Based on this observation, we propose a
novel textual adversarial example detection method, termed Randomized
Substitution and Vote (RS&V), which votes the prediction label by accumulating
the logits of k samples generated by randomly substituting the words in the
input text with synonyms. The proposed RS&V is generally applicable to any
existing neural networks without modification on the architecture or extra
training, and it is orthogonal to prior work on making the classification
network itself more robust. Empirical evaluations on three benchmark datasets
demonstrate that RS&V could detect the textual adversarial examples more
successfully than the existing detection methods while maintaining the high
classification accuracy on benign samples.


http://arxiv.org/abs/2109.05820
Improving the Robustness of Adversarial Attacks Using an Affine-Invariant Gradient Estimator. (99%)
Wenzhao Xiang; Hang Su; Chang Liu; Yandong Guo; Shibao Zheng
  As designers of artificial intelligence try to outwit hackers, both sides
continue to hone in on AI's inherent vulnerabilities. Designed and trained from
certain statistical distributions of data, AI's deep neural networks (DNNs)
remain vulnerable to deceptive inputs that violate a DNN's statistical,
predictive assumptions. Before being fed into a neural network, however, most
existing adversarial examples cannot maintain malicious functionality when
applied to an affine transformation. For practical purposes, maintaining that
malicious functionality serves as an important measure of the robustness of
adversarial attacks. To help DNNs learn to defend themselves more thoroughly
against attacks, we propose an affine-invariant adversarial attack, which can
consistently produce more robust adversarial examples over affine
transformations. For efficiency, we propose to disentangle current
affine-transformation strategies from the Euclidean geometry coordinate plane
with its geometric translations, rotations and dilations; we reformulate the
latter two in polar coordinates. Afterwards, we construct an affine-invariant
gradient estimator by convolving the gradient at the original image with
derived kernels, which can be integrated with any gradient-based attack
methods. Extensive experiments on ImageNet, including some experiments under
physical condition, demonstrate that our method can significantly improve the
affine invariance of adversarial examples and, as a byproduct, improve the
transferability of adversarial examples, compared with alternative
state-of-the-art methods.


http://arxiv.org/abs/2109.05919
Evolving Architectures with Gradient Misalignment toward Low Adversarial Transferability. (98%)
Kevin Richard G. Operiano; Wanchalerm Pora; Hitoshi Iba; Hiroshi Kera
  Deep neural network image classifiers are known to be susceptible not only to
adversarial examples created for them but even those created for others. This
phenomenon poses a potential security risk in various black-box systems relying
on image classifiers. The reason behind such transferability of adversarial
examples is not yet fully understood and many studies have proposed training
methods to obtain classifiers with low transferability. In this study, we
address this problem from a novel perspective through investigating the
contribution of the network architecture to transferability. Specifically, we
propose an architecture searching framework that employs neuroevolution to
evolve network architectures and the gradient misalignment loss to encourage
networks to converge into dissimilar functions after training. Our experiments
show that the proposed framework successfully discovers architectures that
reduce transferability from four standard networks including ResNet and VGG,
while maintaining a good accuracy on unperturbed images. In addition, the
evolved networks trained with gradient misalignment exhibit significantly lower
transferability compared to standard networks trained with gradient
misalignment, which indicates that the network architecture plays an important
role in reducing transferability. This study demonstrates that designing or
exploring proper network architectures is a promising approach to tackle the
transferability issue and train adversarially robust image classifiers.


http://arxiv.org/abs/2109.06358
A Practical Adversarial Attack on Contingency Detection of Smart Energy Systems. (98%)
Moein Sabounchi; Jin Wei-Kocsis
  Due to the advances in computing and sensing, deep learning (DL) has widely
been applied in smart energy systems (SESs). These DL-based solutions have
proved their potentials in improving the effectiveness and adaptiveness of the
control systems. However, in recent years, increasing evidence shows that DL
techniques can be manipulated by adversarial attacks with carefully-crafted
perturbations. Adversarial attacks have been studied in computer vision and
natural language processing. However, there is very limited work focusing on
the adversarial attack deployment and mitigation in energy systems. In this
regard, to better prepare the SESs against potential adversarial attacks, we
propose an innovative adversarial attack model that can practically compromise
dynamical controls of energy system. We also optimize the deployment of the
proposed adversarial attack model by employing deep reinforcement learning (RL)
techniques. In this paper, we present our first-stage work in this direction.
In simulation section, we evaluate the performance of our proposed adversarial
attack model using standard IEEE 9-bus system.


http://arxiv.org/abs/2109.05925
Adversarial Examples for Evaluating Math Word Problem Solvers. (96%)
Vivek Kumar; Rishabh Maheshwary; Vikram Pudi
  Standard accuracy metrics have shown that Math Word Problem (MWP) solvers
have achieved high performance on benchmark datasets. However, the extent to
which existing MWP solvers truly understand language and its relation with
numbers is still unclear. In this paper, we generate adversarial attacks to
evaluate the robustness of state-of-the-art MWP solvers. We propose two methods
Question Reordering and Sentence Paraphrasing to generate adversarial attacks.
We conduct experiments across three neural MWP solvers over two benchmark
datasets. On average, our attack method is able to reduce the accuracy of MWP
solvers by over 40 percentage points on these datasets. Our results demonstrate
that existing MWP solvers are sensitive to linguistic variations in the problem
text. We verify the validity and quality of generated adversarial examples
through human evaluation.


http://arxiv.org/abs/2109.05695
PAT: Pseudo-Adversarial Training For Detecting Adversarial Videos. (86%)
Nupur Thakur; Baoxin Li
  Extensive research has demonstrated that deep neural networks (DNNs) are
prone to adversarial attacks. Although various defense mechanisms have been
proposed for image classification networks, fewer approaches exist for
video-based models that are used in security-sensitive applications like
surveillance. In this paper, we propose a novel yet simple algorithm called
Pseudo-Adversarial Training (PAT), to detect the adversarial frames in a video
without requiring knowledge of the attack. Our approach generates `transition
frames' that capture critical deviation from the original frames and eliminate
the components insignificant to the detection task. To avoid the necessity of
knowing the attack model, we produce `pseudo perturbations' to train our
detection network. Adversarial detection is then achieved through the use of
the detected frames. Experimental results on UCF-101 and 20BN-Jester datasets
show that PAT can detect the adversarial video frames and videos with a high
detection rate. We also unveil the potential reasons for the effectiveness of
the transition frames and pseudo perturbations through extensive experiments.


http://arxiv.org/abs/2109.05872
Byzantine-robust Federated Learning through Collaborative Malicious Gradient Filtering. (81%)
Jian Xu; Shao-Lun Huang; Linqi Song; Tian Lan
  Gradient-based training in federated learning is known to be vulnerable to
faulty/malicious clients, which are often modeled as Byzantine clients. To this
end, previous work either makes use of auxiliary data at parameter server to
verify the received gradients (e.g., by computing validation error rate) or
leverages statistic-based methods (e.g. median and Krum) to identify and remove
malicious gradients from Byzantine clients. In this paper, we remark that
auxiliary data may not always be available in practice and focus on the
statistic-based approach. However, recent work on model poisoning attacks has
shown that well-crafted attacks can circumvent most of median- and
distance-based statistical defense methods, making malicious gradients
indistinguishable from honest ones. To tackle this challenge, we show that the
element-wise sign of gradient vector can provide valuable insight in detecting
model poisoning attacks. Based on our theoretical analysis of the
\textit{Little is Enough} attack, we propose a novel approach called
\textit{SignGuard} to enable Byzantine-robust federated learning through
collaborative malicious gradient filtering. More precisely, the received
gradients are first processed to generate relevant magnitude, sign, and
similarity statistics, which are then collaboratively utilized by multiple
filters to eliminate malicious gradients before final aggregation. Finally,
extensive experiments of image and text classification tasks are conducted
under recently proposed attacks and defense strategies. The numerical results
demonstrate the effectiveness and superiority of our proposed approach. The
code is available at \textit{\url{https://github.com/JianXu95/SignGuard}}


http://arxiv.org/abs/2109.06024
Formalizing and Estimating Distribution Inference Risks. (62%)
Anshuman Suri; David Evans
  Property inference attacks reveal statistical properties about a training set
but are difficult to distinguish from the intrinsic purpose of statistical
machine learning, namely to produce models that capture statistical properties
about a distribution. Motivated by Yeom et al.'s membership inference
framework, we propose a formal and general definition of property inference
attacks. The proposed notion describes attacks that can distinguish between
possible training distributions, extending beyond previous property inference
attacks that infer the ratio of a particular type of data in the training data
set such as the proportion of females. We show how our definition captures
previous property inference attacks as well as a new attack that can reveal the
average node degree or clustering coefficient of a training graph. Our
definition also enables a theorem that connects the maximum possible accuracy
of inference attacks distinguishing between distributions to the effective size
of dataset leaked by the model. To quantify and understand property inference
risks, we conduct a series of experiments across a range of different
distributions using both black-box and white-box attacks. Our results show that
inexpensive attacks are often as effective as expensive meta-classifier
attacks, and that there are surprising asymmetries in the effectiveness of
attacks. We also extend the state-of-the-art property inference attack to work
on convolutional neural networks, and propose techniques to help identify
parameters in a model that leak the most information, thus significantly
lowering resource requirements for meta-classifier attacks.


http://arxiv.org/abs/2109.05793
Virtual Data Augmentation: A Robust and General Framework for Fine-tuning Pre-trained Models. (50%)
Kun Zhou; Wayne Xin Zhao; Sirui Wang; Fuzheng Zhang; Wei Wu; Ji-Rong Wen
  Recent works have shown that powerful pre-trained language models (PLM) can
be fooled by small perturbations or intentional attacks. To solve this issue,
various data augmentation techniques are proposed to improve the robustness of
PLMs. However, it is still challenging to augment semantically relevant
examples with sufficient diversity. In this work, we present Virtual Data
Augmentation (VDA), a general framework for robustly fine-tuning PLMs. Based on
the original token embeddings, we construct a multinomial mixture for
augmenting virtual data embeddings, where a masked language model guarantees
the semantic relevance and the Gaussian noise provides the augmentation
diversity. Furthermore, a regularized training strategy is proposed to balance
the two aspects. Extensive experiments on six datasets show that our approach
is able to improve the robustness of PLMs and alleviate the performance
degradation under adversarial attacks. Our codes and data are publicly
available at \textcolor{blue}{\url{https://github.com/RUCAIBox/VDA}}.


http://arxiv.org/abs/2109.06363
Sensor Adversarial Traits: Analyzing Robustness of 3D Object Detection Sensor Fusion Models. (16%)
Won Park; Nan Li; Qi Alfred Chen; Z. Morley Mao
  A critical aspect of autonomous vehicles (AVs) is the object detection stage,
which is increasingly being performed with sensor fusion models: multimodal 3D
object detection models which utilize both 2D RGB image data and 3D data from a
LIDAR sensor as inputs. In this work, we perform the first study to analyze the
robustness of a high-performance, open source sensor fusion model architecture
towards adversarial attacks and challenge the popular belief that the use of
additional sensors automatically mitigate the risk of adversarial attacks. We
find that despite the use of a LIDAR sensor, the model is vulnerable to our
purposefully crafted image-based adversarial attacks including disappearance,
universal patch, and spoofing. After identifying the underlying reason, we
explore some potential defenses and provide some recommendations for improved
sensor fusion models.


http://arxiv.org/abs/2109.05751
Adversarially Trained Object Detector for Unsupervised Domain Adaptation. (3%)
Kazuma Fujii; Hiroshi Kera; Kazuhiko Kawamoto
  Unsupervised domain adaptation, which involves transferring knowledge from a
label-rich source domain to an unlabeled target domain, can be used to
substantially reduce annotation costs in the field of object detection. In this
study, we demonstrate that adversarial training in the source domain can be
employed as a new approach for unsupervised domain adaptation. Specifically, we
establish that adversarially trained detectors achieve improved detection
performance in target domains that are significantly shifted from source
domains. This phenomenon is attributed to the fact that adversarially trained
detectors can be used to extract robust features that are in alignment with
human perception and worth transferring across domains while discarding
domain-specific non-robust features. In addition, we propose a method that
combines adversarial training and feature alignment to ensure the improved
alignment of robust features with the target domain. We conduct experiments on
four benchmark datasets and confirm the effectiveness of our proposed approach
on large domain shifts from real to artistic images. Compared to the baseline
models, the adversarially trained detectors improve the mean average precision
by up to 7.7%, and further by up to 11.8% when feature alignments are
incorporated. Although our method degrades performance for small domain shifts,
quantification of the domain shift based on the Frechet distance allows us to
determine whether adversarial training should be conducted.


http://arxiv.org/abs/2109.05771
Perturbation CheckLists for Evaluating NLG Evaluation Metrics. (1%)
Ananya B. Sai; Tanay Dixit; Dev Yashpal Sheth; Sreyas Mohan; Mitesh M. Khapra
  Natural Language Generation (NLG) evaluation is a multifaceted task requiring
assessment of multiple desirable criteria, e.g., fluency, coherency, coverage,
relevance, adequacy, overall quality, etc. Across existing datasets for 6 NLG
tasks, we observe that the human evaluation scores on these multiple criteria
are often not correlated. For example, there is a very low correlation between
human scores on fluency and data coverage for the task of structured data to
text generation. This suggests that the current recipe of proposing new
automatic evaluation metrics for NLG by showing that they correlate well with
scores assigned by humans for a single criteria (overall quality) alone is
inadequate. Indeed, our extensive study involving 25 automatic evaluation
metrics across 6 different tasks and 18 different evaluation criteria shows
that there is no single metric which correlates well with human scores on all
desirable criteria, for most NLG tasks. Given this situation, we propose
CheckLists for better design and evaluation of automatic metrics. We design
templates which target a specific criteria (e.g., coverage) and perturb the
output such that the quality gets affected only along this specific criteria
(e.g., the coverage drops). We show that existing evaluation metrics are not
robust against even such simple perturbations and disagree with scores assigned
by humans to the perturbed output. The proposed templates thus allow for a
fine-grained assessment of automatic evaluation metrics exposing their
limitations and will facilitate better design, analysis and evaluation of such
metrics.


http://arxiv.org/abs/2109.05696
How to Select One Among All? An Extensive Empirical Study Towards the Robustness of Knowledge Distillation in Natural Language Understanding. (1%)
Tianda Li; Ahmad Rashid; Aref Jafari; Pranav Sharma; Ali Ghodsi; Mehdi Rezagholizadeh
  Knowledge Distillation (KD) is a model compression algorithm that helps
transfer the knowledge of a large neural network into a smaller one. Even
though KD has shown promise on a wide range of Natural Language Processing
(NLP) applications, little is understood about how one KD algorithm compares to
another and whether these approaches can be complimentary to each other. In
this work, we evaluate various KD algorithms on in-domain, out-of-domain and
adversarial testing. We propose a framework to assess the adversarial
robustness of multiple KD algorithms. Moreover, we introduce a new KD
algorithm, Combined-KD, which takes advantage of two promising approaches
(better training scheme and more efficient data augmentation). Our extensive
experimental results show that Combined-KD achieves state-of-the-art results on
the GLUE benchmark, out-of-domain generalization, and adversarial robustness
compared to competitive methods.


http://arxiv.org/abs/2109.06404
Detecting Safety Problems of Multi-Sensor Fusion in Autonomous Driving. (1%)
Ziyuan Zhong; Zhisheng Hu; Shengjian Guo; Xinyang Zhang; Zhenyu Zhong; Baishakhi Ray
  Autonomous driving (AD) systems have been thriving in recent years. In
general, they receive sensor data, compute driving decisions, and output
control signals to the vehicles. To smooth out the uncertainties brought by
sensor inputs, AD systems usually leverage multi-sensor fusion (MSF) to fuse
the sensor inputs and produce a more reliable understanding of the
surroundings. However, MSF cannot completely eliminate the uncertainties since
it lacks the knowledge about which sensor provides the most accurate data. As a
result, critical consequences might happen unexpectedly. In this work, we
observed that the popular MSF methods in an industry-grade Advanced
Driver-Assistance System (ADAS) can mislead the car control and result in
serious safety hazards. Misbehavior can happen regardless of the used fusion
methods and the accurate data from at least one sensor. To attribute the safety
hazards to a MSF method, we formally define the fusion errors and propose a way
to distinguish safety violations causally induced by such errors. Further, we
develop a novel evolutionary-based domain-specific search framework,
FusionFuzz, for the efficient detection of fusion errors. We evaluate our
framework on two widely used MSF methods. %in two driving environments.
Experimental results show that FusionFuzz identifies more than 150 fusion
errors. Finally, we provide several suggestions to improve the MSF methods
under study.


http://arxiv.org/abs/2109.06176
TREATED:Towards Universal Defense against Textual Adversarial Attacks. (99%)
Bin Zhu; Zhaoquan Gu; Le Wang; Zhihong Tian
  Recent work shows that deep neural networks are vulnerable to adversarial
examples. Much work studies adversarial example generation, while very little
work focuses on more critical adversarial defense. Existing adversarial
detection methods usually make assumptions about the adversarial example and
attack method (e.g., the word frequency of the adversarial example, the
perturbation level of the attack method). However, this limits the
applicability of the detection method. To this end, we propose TREATED, a
universal adversarial detection method that can defend against attacks of
various perturbation levels without making any assumptions. TREATED identifies
adversarial examples through a set of well-designed reference models. Extensive
experiments on three competitive neural networks and two widely used datasets
show that our method achieves better detection performance than baselines. We
finally conduct ablation studies to verify the effectiveness of our method.


http://arxiv.org/abs/2109.05558
CoG: a Two-View Co-training Framework for Defending Adversarial Attacks on Graph. (98%)
Xugang Wu; Huijun Wu; Xu Zhou; Kai Lu
  Graph neural networks exhibit remarkable performance in graph data analysis.
However, the robustness of GNN models remains a challenge. As a result, they
are not reliable enough to be deployed in critical applications. Recent studies
demonstrate that GNNs could be easily fooled with adversarial perturbations,
especially structural perturbations. Such vulnerability is attributed to the
excessive dependence on the structure information to make predictions. To
achieve better robustness, it is desirable to build the prediction of GNNs with
more comprehensive features. Graph data, in most cases, has two views of
information, namely structure information and feature information. In this
paper, we propose CoG, a simple yet effective co-training framework to combine
these two views for the purpose of robustness. CoG trains sub-models from the
feature view and the structure view independently and allows them to distill
knowledge from each other by adding their most confident unlabeled data into
the training set. The orthogonality of these two views diversifies the
sub-models, thus enhancing the robustness of their ensemble. We evaluate our
framework on three popular datasets, and results show that CoG significantly
improves the robustness of graph models against adversarial attacks without
sacrificing their performance on clean data. We also show that CoG still
achieves good robustness when both node features and graph structures are
perturbed.


http://arxiv.org/abs/2109.05507
Check Your Other Door! Creating Backdoor Attacks in the Frequency Domain. (93%)
Hasan Abed Al Kader Hammoud; Bernard Ghanem
  Deep Neural Networks (DNNs) are ubiquitous and span a variety of applications
ranging from image classification and facial recognition to medical image
analysis and real-time object detection. As DNN models become more
sophisticated and complex, the computational cost of training these models
becomes a burden. For this reason, outsourcing the training process has been
the go-to option for many DNN users. Unfortunately, this comes at the cost of
vulnerability to backdoor attacks. These attacks aim at establishing hidden
backdoors in the DNN such that it performs well on clean samples but outputs a
particular target label when a trigger is applied to the input. Current
backdoor attacks generate triggers in the spatial domain; however, as we show
in this paper, it is not the only domain to exploit and one should always
"check the other doors". To the best of our knowledge, this work is the first
to propose a pipeline for generating a spatially dynamic (changing) and
invisible (low norm) backdoor attack in the frequency domain. We show the
advantages of utilizing the frequency domain for creating undetectable and
powerful backdoor attacks through extensive experiments on various datasets and
network architectures. Unlike most spatial domain attacks, frequency-based
backdoor attacks can achieve high attack success rates with low poisoning rates
and little to no drop in performance while remaining imperceptible to the human
eye. Moreover, we show that the backdoored models (poisoned by our attacks) are
resistant to various state-of-the-art (SOTA) defenses, and so we contribute two
possible defenses that can successfully evade the attack.


http://arxiv.org/abs/2109.05620
RockNER: A Simple Method to Create Adversarial Examples for Evaluating the Robustness of Named Entity Recognition Models. (84%)
Bill Yuchen Lin; Wenyang Gao; Jun Yan; Ryan Moreno; Xiang Ren
  To audit the robustness of named entity recognition (NER) models, we propose
RockNER, a simple yet effective method to create natural adversarial examples.
Specifically, at the entity level, we replace target entities with other
entities of the same semantic class in Wikidata; at the context level, we use
pre-trained language models (e.g., BERT) to generate word substitutions.
Together, the two levels of attack produce natural adversarial examples that
result in a shifted distribution from the training data on which our target
models have been trained. We apply the proposed method to the OntoNotes dataset
and create a new benchmark named OntoRock for evaluating the robustness of
existing NER models via a systematic evaluation protocol. Our experiments and
analysis reveal that even the best model has a significant performance drop,
and these models seem to memorize in-domain entity patterns instead of
reasoning from the context. Our work also studies the effects of a few simple
data augmentation methods to improve the robustness of NER models.


http://arxiv.org/abs/2109.05671
Shape-Biased Domain Generalization via Shock Graph Embeddings. (2%)
Maruthi Narayanan; Vickram Rajendran; Benjamin Kimia
  There is an emerging sense that the vulnerability of Image Convolutional
Neural Networks (CNN), i.e., sensitivity to image corruptions, perturbations,
and adversarial attacks, is connected with Texture Bias. This relative lack of
Shape Bias is also responsible for poor performance in Domain Generalization
(DG). The inclusion of a role of shape alleviates these vulnerabilities and
some approaches have achieved this by training on negative images, images
endowed with edge maps, or images with conflicting shape and texture
information. This paper advocates an explicit and complete representation of
shape using a classical computer vision approach, namely, representing the
shape content of an image with the shock graph of its contour map. The
resulting graph and its descriptor is a complete representation of contour
content and is classified using recent Graph Neural Network (GNN) methods. The
experimental results on three domain shift datasets, Colored MNIST, PACS, and
VLCS demonstrate that even without using appearance the shape-based approach
exceeds classical Image CNN based methods in domain generalization.


http://arxiv.org/abs/2109.05659
Source Inference Attacks in Federated Learning. (1%)
Hongsheng Hu; Zoran Salcic; Lichao Sun; Gillian Dobbie; Xuyun Zhang
  Federated learning (FL) has emerged as a promising privacy-aware paradigm
that allows multiple clients to jointly train a model without sharing their
private data. Recently, many studies have shown that FL is vulnerable to
membership inference attacks (MIAs) that can distinguish the training members
of the given model from the non-members. However, existing MIAs ignore the
source of a training member, i.e., the information of which client owns the
training member, while it is essential to explore source privacy in FL beyond
membership privacy of examples from all clients. The leakage of source
information can lead to severe privacy issues. For example, identification of
the hospital contributing to the training of an FL model for COVID-19 pandemic
can render the owner of a data record from this hospital more prone to
discrimination if the hospital is in a high risk region. In this paper, we
propose a new inference attack called source inference attack (SIA), which can
derive an optimal estimation of the source of a training member. Specifically,
we innovatively adopt the Bayesian perspective to demonstrate that an
honest-but-curious server can launch an SIA to steal non-trivial source
information of the training members without violating the FL protocol. The
server leverages the prediction loss of local models on the training members to
achieve the attack effectively and non-intrusively. We conduct extensive
experiments on one synthetic and five real datasets to evaluate the key factors
in an SIA, and the results show the efficacy of the proposed source inference
attack.


http://arxiv.org/abs/2109.05211
RobustART: Benchmarking Robustness on Architecture Design and Training Techniques. (98%)
Shiyu Tang; Ruihao Gong; Yan Wang; Aishan Liu; Jiakai Wang; Xinyun Chen; Fengwei Yu; Xianglong Liu; Dawn Song; Alan Yuille; Philip H. S. Torr; Dacheng Tao
  Deep neural networks (DNNs) are vulnerable to adversarial noises, which
motivates the benchmark of model robustness. Existing benchmarks mainly focus
on evaluating the defenses, but there are no comprehensive studies of how
architecture design and general training techniques affect robustness.
Comprehensively benchmarking their relationships will be highly beneficial for
better understanding and developing robust DNNs. Thus, we propose RobustART,
the first comprehensive Robustness investigation benchmark on ImageNet
(including open-source toolkit, pre-trained model zoo, datasets, and analyses)
regarding ARchitecture design (44 human-designed off-the-shelf architectures
and 1200+ networks from neural architecture search) and Training techniques
(10+ general techniques, e.g., data augmentation) towards diverse noises
(adversarial, natural, and system noises). Extensive experiments revealed and
substantiated several insights for the first time, for example: (1) adversarial
training largely improves the clean accuracy and all types of robustness for
Transformers and MLP-Mixers; (2) with comparable sizes, CNNs > Transformers >
MLP-Mixers on robustness against natural and system noises; Transformers >
MLP-Mixers > CNNs on adversarial robustness; (3) for some light-weight
architectures (e.g., EfficientNet, MobileNetV2, and MobileNetV3), increasing
model sizes or using extra training data cannot improve robustness. Our
benchmark http://robust.art/ : (1) presents an open-source platform for
conducting comprehensive evaluation on diverse robustness types; (2) provides a
variety of pre-trained models with different training techniques to facilitate
robustness evaluation; (3) proposes a new view to better understand the
mechanism towards designing robust DNN architectures, backed up by the
analysis. We will continuously contribute to building this ecosystem for the
community.


http://arxiv.org/abs/2109.05223
2-in-1 Accelerator: Enabling Random Precision Switch for Winning Both Adversarial Robustness and Efficiency. (81%)
Yonggan Fu; Yang Zhao; Qixuan Yu; Chaojian Li; Yingyan Lin
  The recent breakthroughs of deep neural networks (DNNs) and the advent of
billions of Internet of Things (IoT) devices have excited an explosive demand
for intelligent IoT devices equipped with domain-specific DNN accelerators.
However, the deployment of DNN accelerator enabled intelligent functionality
into real-world IoT devices still remains particularly challenging. First,
powerful DNNs often come at prohibitive complexities, whereas IoT devices often
suffer from stringent resource constraints. Second, while DNNs are vulnerable
to adversarial attacks especially on IoT devices exposed to complex real-world
environments, many IoT applications require strict security. Existing DNN
accelerators mostly tackle only one of the two aforementioned challenges (i.e.,
efficiency or adversarial robustness) while neglecting or even sacrificing the
other. To this end, we propose a 2-in-1 Accelerator, an integrated
algorithm-accelerator co-design framework aiming at winning both the
adversarial robustness and efficiency of DNN accelerators. Specifically, we
first propose a Random Precision Switch (RPS) algorithm that can effectively
defend DNNs against adversarial attacks by enabling random DNN quantization as
an in-situ model switch. Furthermore, we propose a new precision-scalable
accelerator featuring (1) a new precision-scalable MAC unit architecture which
spatially tiles the temporal MAC units to boost both the achievable efficiency
and flexibility and (2) a systematically optimized dataflow that is searched by
our generic accelerator optimizer. Extensive experiments and ablation studies
validate that our 2-in-1 Accelerator can not only aggressively boost both the
adversarial robustness and efficiency of DNN accelerators under various
attacks, but also naturally support instantaneous robustness-efficiency
trade-offs adapting to varied resources without the necessity of DNN
retraining.


http://arxiv.org/abs/2109.04775
A Strong Baseline for Query Efficient Attacks in a Black Box Setting. (99%)
Rishabh Maheshwary; Saket Maheshwary; Vikram Pudi
  Existing black box search methods have achieved high success rate in
generating adversarial attacks against NLP models. However, such search methods
are inefficient as they do not consider the amount of queries required to
generate adversarial attacks. Also, prior attacks do not maintain a consistent
search space while comparing different search methods. In this paper, we
propose a query efficient attack strategy to generate plausible adversarial
examples on text classification and entailment tasks. Our attack jointly
leverages attention mechanism and locality sensitive hashing (LSH) to reduce
the query count. We demonstrate the efficacy of our approach by comparing our
attack with four baselines across three different search spaces. Further, we
benchmark our results across the same search space used in prior attacks. In
comparison to attacks proposed, on an average, we are able to reduce the query
count by 75% across all datasets and target models. We also demonstrate that
our attack achieves a higher success rate when compared to prior attacks in a
limited query setting.


http://arxiv.org/abs/2109.04385
Contrasting Human- and Machine-Generated Word-Level Adversarial Examples for Text Classification. (99%)
Maximilian Mozes; Max Bartolo; Pontus Stenetorp; Bennett Kleinberg; Lewis D. Griffin
  Research shows that natural language processing models are generally
considered to be vulnerable to adversarial attacks; but recent work has drawn
attention to the issue of validating these adversarial inputs against certain
criteria (e.g., the preservation of semantics and grammaticality). Enforcing
constraints to uphold such criteria may render attacks unsuccessful, raising
the question of whether valid attacks are actually feasible. In this work, we
investigate this through the lens of human language ability. We report on
crowdsourcing studies in which we task humans with iteratively modifying words
in an input text, while receiving immediate model feedback, with the aim of
causing a sentiment classification model to misclassify the example. Our
findings suggest that humans are capable of generating a substantial amount of
adversarial examples using semantics-preserving word substitutions. We analyze
how human-generated adversarial examples compare to the recently proposed
TextFooler, Genetic, BAE and SememePSO attack algorithms on the dimensions
naturalness, preservation of sentiment, grammaticality and substitution rate.
Our findings suggest that human-generated adversarial examples are not more
able than the best algorithms to generate natural-reading, sentiment-preserving
examples, though they do so by being much more computationally efficient.


http://arxiv.org/abs/2109.04300
Energy Attack: On Transferring Adversarial Examples. (99%)
Ruoxi Shi; Borui Yang; Yangzhou Jiang; Chenglong Zhao; Bingbing Ni
  In this work we propose Energy Attack, a transfer-based black-box
$L_\infty$-adversarial attack. The attack is parameter-free and does not
require gradient approximation. In particular, we first obtain white-box
adversarial perturbations of a surrogate model and divide these perturbations
into small patches. Then we extract the unit component vectors and eigenvalues
of these patches with principal component analysis (PCA). Base on the
eigenvalues, we can model the energy distribution of adversarial perturbations.
We then perform black-box attacks by sampling from the perturbation patches
according to their energy distribution, and tiling the sampled patches to form
a full-size adversarial perturbation. This can be done without the available
access to victim models. Extensive experiments well demonstrate that the
proposed Energy Attack achieves state-of-the-art performance in black-box
attacks on various models and several datasets. Moreover, the extracted
distribution is able to transfer among different model architectures and
different datasets, and is therefore intrinsic to vision architectures.


http://arxiv.org/abs/2109.04460
Protein Folding Neural Networks Are Not Robust. (99%)
Sumit Kumar Jha; Arvind Ramanathan; Rickard Ewetz; Alvaro Velasquez; Susmit Jha
  Deep neural networks such as AlphaFold and RoseTTAFold predict remarkably
accurate structures of proteins compared to other algorithmic approaches. It is
known that biologically small perturbations in the protein sequence do not lead
to drastic changes in the protein structure. In this paper, we demonstrate that
RoseTTAFold does not exhibit such a robustness despite its high accuracy, and
biologically small perturbations for some input sequences result in radically
different predicted protein structures. This raises the challenge of detecting
when these predicted protein structures cannot be trusted. We define the
robustness measure for the predicted structure of a protein sequence to be the
inverse of the root-mean-square distance (RMSD) in the predicted structure and
the structure of its adversarially perturbed sequence. We use adversarial
attack methods to create adversarial protein sequences, and show that the RMSD
in the predicted protein structure ranges from 0.119\r{A} to 34.162\r{A} when
the adversarial perturbations are bounded by 20 units in the BLOSUM62 distance.
This demonstrates very high variance in the robustness measure of the predicted
structures. We show that the magnitude of the correlation (0.917) between our
robustness measure and the RMSD between the predicted structure and the ground
truth is high, that is, the predictions with low robustness measure cannot be
trusted. This is the first paper demonstrating the susceptibility of
RoseTTAFold to adversarial attacks.


http://arxiv.org/abs/2109.04176
Towards Transferable Adversarial Attacks on Vision Transformers. (99%)
Zhipeng Wei; Jingjing Chen; Micah Goldblum; Zuxuan Wu; Tom Goldstein; Yu-Gang Jiang
  Vision transformers (ViTs) have demonstrated impressive performance on a
series of computer vision tasks, yet they still suffer from adversarial
examples. In this paper, we posit that adversarial attacks on transformers
should be specially tailored for their architecture, jointly considering both
patches and self-attention, in order to achieve high transferability. More
specifically, we introduce a dual attack framework, which contains a Pay No
Attention (PNA) attack and a PatchOut attack, to improve the transferability of
adversarial samples across different ViTs. We show that skipping the gradients
of attention during backpropagation can generate adversarial examples with high
transferability. In addition, adversarial perturbations generated by optimizing
randomly sampled subsets of patches at each iteration achieve higher attack
success rates than attacks using all patches. We evaluate the transferability
of attacks on state-of-the-art ViTs, CNNs and robustly trained CNNs. The
results of these experiments demonstrate that the proposed dual attack can
greatly boost transferability between ViTs and from ViTs to CNNs. In addition,
the proposed method can easily be combined with existing transfer methods to
boost performance.


http://arxiv.org/abs/2109.04367
Multi-granularity Textual Adversarial Attack with Behavior Cloning. (98%)
Yangyi Chen; Jin Su; Wei Wei
  Recently, the textual adversarial attack models become increasingly popular
due to their successful in estimating the robustness of NLP models. However,
existing works have obvious deficiencies. (1) They usually consider only a
single granularity of modification strategies (e.g. word-level or
sentence-level), which is insufficient to explore the holistic textual space
for generation; (2) They need to query victim models hundreds of times to make
a successful attack, which is highly inefficient in practice. To address such
problems, in this paper we propose MAYA, a Multi-grAnularitY Attack model to
effectively generate high-quality adversarial samples with fewer queries to
victim models. Furthermore, we propose a reinforcement-learning based method to
train a multi-granularity attack agent through behavior cloning with the expert
knowledge from our MAYA algorithm to further reduce the query times.
Additionally, we also adapt the agent to attack black-box models that only
output labels without confidence scores. We conduct comprehensive experiments
to evaluate our attack models by attacking BiLSTM, BERT and RoBERTa in two
different black-box attack settings and three benchmark datasets. Experimental
results show that our models achieve overall better attacking performance and
produce more fluent and grammatical adversarial samples compared to baseline
models. Besides, our adversarial attack agent significantly reduces the query
times in both attack settings. Our codes are released at
https://github.com/Yangyi-Chen/MAYA.


http://arxiv.org/abs/2109.04608
Spatially Focused Attack against Spatiotemporal Graph Neural Networks. (81%)
Fuqiang Liu; Luis Miranda-Moreno; Lijun Sun
  Spatiotemporal forecasting plays an essential role in various applications in
intelligent transportation systems (ITS), such as route planning, navigation,
and traffic control and management. Deep Spatiotemporal graph neural networks
(GNNs), which capture both spatial and temporal patterns, have achieved great
success in traffic forecasting applications. Understanding how GNNs-based
forecasting work and the vulnerability and robustness of these models becomes
critical to real-world applications. For example, if spatiotemporal GNNs are
vulnerable in real-world traffic prediction applications, a hacker can easily
manipulate the results and cause serious traffic congestion and even a
city-scale breakdown. However, despite that recent studies have demonstrated
that deep neural networks (DNNs) are vulnerable to carefully designed
perturbations in multiple domains like objection classification and graph
representation, current adversarial works cannot be directly applied to
spatiotemporal forecasting due to the causal nature and spatiotemporal
mechanisms in forecasting models. To fill this gap, in this paper we design
Spatially Focused Attack (SFA) to break spatiotemporal GNNs by attacking a
single vertex. To achieve this, we first propose the inverse estimation to
address the causality issue; then, we apply genetic algorithms with a universal
attack method as the evaluation function to locate the weakest vertex; finally,
perturbations are generated by solving an inverse estimation-based optimization
problem. We conduct experiments on real-world traffic data and our results show
that perturbations in one vertex designed by SA can be diffused into a large
part of the graph.


http://arxiv.org/abs/2109.04615
Differential Privacy in Personalized Pricing with Nonparametric Demand Models. (26%)
Xi Chen; Sentao Miao; Yining Wang
  In the recent decades, the advance of information technology and abundant
personal data facilitate the application of algorithmic personalized pricing.
However, this leads to the growing concern of potential violation of privacy
due to adversarial attack. To address the privacy issue, this paper studies a
dynamic personalized pricing problem with \textit{unknown} nonparametric demand
models under data privacy protection. Two concepts of data privacy, which have
been widely applied in practices, are introduced: \textit{central differential
privacy (CDP)} and \textit{local differential privacy (LDP)}, which is proved
to be stronger than CDP in many cases. We develop two algorithms which make
pricing decisions and learn the unknown demand on the fly, while satisfying the
CDP and LDP gurantees respectively. In particular, for the algorithm with CDP
guarantee, the regret is proved to be at most $\tilde
O(T^{(d+2)/(d+4)}+\varepsilon^{-1}T^{d/(d+4)})$. Here, the parameter $T$
denotes the length of the time horizon, $d$ is the dimension of the
personalized information vector, and the key parameter $\varepsilon>0$ measures
the strength of privacy (smaller $\varepsilon$ indicates a stronger privacy
protection). On the other hand, for the algorithm with LDP guarantee, its
regret is proved to be at most $\tilde
O(\varepsilon^{-2/(d+2)}T^{(d+1)/(d+2)})$, which is near-optimal as we prove a
lower bound of $\Omega(\varepsilon^{-2/(d+2)}T^{(d+1)/(d+2)})$ for any
algorithm with LDP guarantee.


http://arxiv.org/abs/2109.04344
EvilModel 2.0: Bringing Neural Network Models into Malware Attacks. (5%)
Zhi Wang; Chaoge Liu; Xiang Cui; Jie Yin; Xutong Wang
  In recent years, neural network has shown its strong power in various fields,
and it also brings increasing security threats. The stegomalware based on
neural network model is a representative one. Previous research preliminary
proves the feasibility of launching malicious attacks by triggering malware
embedded in the neural network model. However, the existing works have not
shown that this emerging threat is practical in real-world attacks because of
the low malware embedding rate, the high model performance degradation and the
extra efforts. Therefore, we predict an improved stegomalware called EvilModel.
We embed binary formed malware into neural network model as its parameters on
the basis of analyzing the structure of the neural network model, and propose
three new malware embedding technologies, namely MSB reservation, fast
substitution and half substitution. By marrying 19 malware samples and 10
popular neural network models, we build 550 malware-embedded models, and
analyze these models' performance on ImageNet dataset. The experimental results
show that the half substitution almost performs perfectly, with a malware
embedding rate of 48.52% and no model performance degradation or extra effort.
Considering a series of factors, we propose a quantitative algorithm to
evaluate the different embedding methods. The evaluation result indicates that
EvilModel is much superior to the classic Stegonet. Additionally, we conduct a
case study to trigger EvilModel in a real-world scenario. To understand the
proposed malware embedding technology deeply, we also investigate the impact of
neural network structures, layer and parameter size on malware embedding
capacity and embedded model accuracy. We also give some possible
countermeasures to defend EvilModel. We hope this work can provide a
comprehensive understanding of such a novel AI-powered threat, and recommend to
defense it in advance.


http://arxiv.org/abs/2109.03975
Membership Inference Attacks Against Temporally Correlated Data in Deep Reinforcement Learning. (89%)
Maziar Gomrokchi; Susan Amin; Hossein Aboutalebi; Alexander Wong; Doina Precup
  While significant research advances have been made in the field of deep
reinforcement learning, there have been no concrete adversarial attack
strategies in literature tailored for studying the vulnerability of deep
reinforcement learning algorithms to membership inference attacks. In such
attacking systems, the adversary targets the set of collected input data on
which the deep reinforcement learning algorithm has been trained. To address
this gap, we propose an adversarial attack framework designed for testing the
vulnerability of a state-of-the-art deep reinforcement learning algorithm to a
membership inference attack. In particular, we design a series of experiments
to investigate the impact of temporal correlation, which naturally exists in
reinforcement learning training data, on the probability of information
leakage. Moreover, we compare the performance of \emph{collective} and
\emph{individual} membership attacks against the deep reinforcement learning
algorithm. Experimental results show that the proposed adversarial attack
framework is surprisingly effective at inferring data with an accuracy
exceeding $84\%$ in individual and $97\%$ in collective modes in three
different continuous control Mujoco tasks, which raises serious privacy
concerns in this regard. Finally, we show that the learning state of the
reinforcement learning algorithm influences the level of privacy breaches
significantly.


http://arxiv.org/abs/2109.03857
Robust Optimal Classification Trees Against Adversarial Examples. (80%)
Daniël Vos; Sicco Verwer
  Decision trees are a popular choice of explainable model, but just like
neural networks, they suffer from adversarial examples. Existing algorithms for
fitting decision trees robust against adversarial examples are greedy
heuristics and lack approximation guarantees. In this paper we propose ROCT, a
collection of methods to train decision trees that are optimally robust against
user-specified attack models. We show that the min-max optimization problem
that arises in adversarial learning can be solved using a single minimization
formulation for decision trees with 0-1 loss. We propose such formulations in
Mixed-Integer Linear Programming and Maximum Satisfiability, which widely
available solvers can optimize. We also present a method that determines the
upper bound on adversarial accuracy for any model using bipartite matching. Our
experimental results demonstrate that the existing heuristics achieve close to
optimal scores while ROCT achieves state-of-the-art scores.


http://arxiv.org/abs/2109.02889
Adversarial Parameter Defense by Multi-Step Risk Minimization. (98%)
Zhiyuan Zhang; Ruixuan Luo; Xuancheng Ren; Qi Su; Liangyou Li; Xu Sun
  Previous studies demonstrate DNNs' vulnerability to adversarial examples and
adversarial training can establish a defense to adversarial examples. In
addition, recent studies show that deep neural networks also exhibit
vulnerability to parameter corruptions. The vulnerability of model parameters
is of crucial value to the study of model robustness and generalization. In
this work, we introduce the concept of parameter corruption and propose to
leverage the loss change indicators for measuring the flatness of the loss
basin and the parameter robustness of neural network parameters. On such basis,
we analyze parameter corruptions and propose the multi-step adversarial
corruption algorithm. To enhance neural networks, we propose the adversarial
parameter defense algorithm that minimizes the average risk of multiple
adversarial parameter corruptions. Experimental results show that the proposed
algorithm can improve both the parameter robustness and accuracy of neural
networks.


http://arxiv.org/abs/2109.02979
POW-HOW: An enduring timing side-channel to evade online malware sandboxes. (12%)
Antonio Nappa; Panagiotis Papadopoulos; Matteo Varvello; Daniel Aceituno Gomez; Juan Tapiador; Andrea Lanzi
  Online malware scanners are one of the best weapons in the arsenal of
cybersecurity companies and researchers. A fundamental part of such systems is
the sandbox that provides an instrumented and isolated environment (virtualized
or emulated) for any user to upload and run unknown artifacts and identify
potentially malicious behaviors. The provided API and the wealth of information
inthe reports produced by these services have also helped attackers test the
efficacy of numerous techniques to make malware hard to detect.The most common
technique used by malware for evading the analysis system is to monitor the
execution environment, detect the presence of any debugging artifacts, and hide
its malicious behavior if needed. This is usually achieved by looking for
signals suggesting that the execution environment does not belong to a the
native machine, such as specific memory patterns or behavioral traits of
certain CPU instructions.
  In this paper, we show how an attacker can evade detection on such online
services by incorporating a Proof-of-Work (PoW) algorithm into a malware
sample. Specifically, we leverage the asymptotic behavior of the computational
cost of PoW algorithms when they run on some classes of hardware platforms to
effectively detect a non bare-metal environment of the malware sandbox
analyzer. To prove the validity of this intuition, we design and implement the
POW-HOW framework, a tool to automatically implement sandbox detection
strategies and embed a test evasion program into an arbitrary malware sample.
Our empirical evaluation shows that the proposed evasion technique is durable,
hard to fingerprint, and reduces existing malware detection rate by a factor of
10. Moreover, we show how bare-metal environments cannot scale with actual
malware submissions rates for consumer services.


http://arxiv.org/abs/2109.02973
Unpaired Adversarial Learning for Single Image Deraining with Rain-Space Contrastive Constraints. (1%)
Xiang Chen; Jinshan Pan; Kui Jiang; Yufeng Huang; Caihua Kong; Longgang Dai; Yufeng Li
  Deep learning-based single image deraining (SID) with unpaired information is
of immense importance, as relying on paired synthetic data often limits their
generality and scalability in real-world applications. However, we noticed that
direct employ of unpaired adversarial learning and cycle-consistency
constraints in the SID task is insufficient to learn the underlying
relationship from rainy input to clean outputs, since the domain knowledge
between rainy and rain-free images is asymmetrical. To address such limitation,
we develop an effective unpaired SID method which explores mutual properties of
the unpaired exemplars by a contrastive learning manner in a GAN framework,
named as CDR-GAN. The proposed method mainly consists of two cooperative
branches: Bidirectional Translation Branch (BTB) and Contrastive Guidance
Branch (CGB). Specifically, BTB takes full advantage of the circulatory
architecture of adversarial consistency to exploit latent feature distributions
and guide transfer ability between two domains by equipping it with
bidirectional mapping. Simultaneously, CGB implicitly constrains the embeddings
of different exemplars in rain space by encouraging the similar feature
distributions closer while pushing the dissimilar further away, in order to
better help rain removal and image restoration. During training, we explore
several loss functions to further constrain the proposed CDR-GAN. Extensive
experiments show that our method performs favorably against existing unpaired
deraining approaches on both synthetic and real-world datasets, even
outperforms several fully-supervised or semi-supervised models.


http://arxiv.org/abs/2109.02765
Robustness and Generalization via Generative Adversarial Training. (82%)
Omid Poursaeed; Tianxing Jiang; Harry Yang; Serge Belongie; SerNam Lim
  While deep neural networks have achieved remarkable success in various
computer vision tasks, they often fail to generalize to new domains and subtle
variations of input images. Several defenses have been proposed to improve the
robustness against these variations. However, current defenses can only
withstand the specific attack used in training, and the models often remain
vulnerable to other input variations. Moreover, these methods often degrade
performance of the model on clean images and do not generalize to out-of-domain
samples. In this paper we present Generative Adversarial Training, an approach
to simultaneously improve the model's generalization to the test set and
out-of-domain samples as well as its robustness to unseen adversarial attacks.
Instead of altering a low-level pre-defined aspect of images, we generate a
spectrum of low-level, mid-level and high-level changes using generative models
with a disentangled latent space. Adversarial training with these examples
enable the model to withstand a wide range of attacks by observing a variety of
input alterations during training. We show that our approach not only improves
performance of the model on clean images and out-of-domain samples but also
makes it robust against unforeseen attacks and outperforms prior work. We
validate effectiveness of our method by demonstrating results on various tasks
such as classification, segmentation and object detection.


http://arxiv.org/abs/2109.02836
Trojan Signatures in DNN Weights. (33%)
Greg Fields; Mohammad Samragh; Mojan Javaheripi; Farinaz Koushanfar; Tara Javidi
  Deep neural networks have been shown to be vulnerable to backdoor, or trojan,
attacks where an adversary has embedded a trigger in the network at training
time such that the model correctly classifies all standard inputs, but
generates a targeted, incorrect classification on any input which contains the
trigger. In this paper, we present the first ultra light-weight and highly
effective trojan detection method that does not require access to the
training/test data, does not involve any expensive computations, and makes no
assumptions on the nature of the trojan trigger. Our approach focuses on
analysis of the weights of the final, linear layer of the network. We
empirically demonstrate several characteristics of these weights that occur
frequently in trojaned networks, but not in benign networks. In particular, we
show that the distribution of the weights associated with the trojan target
class is clearly distinguishable from the weights associated with other
classes. Using this, we demonstrate the effectiveness of our proposed detection
method against state-of-the-art attacks across a variety of architectures,
datasets, and trigger types.


http://arxiv.org/abs/2109.02532
Automated Robustness with Adversarial Training as a Post-Processing Step. (4%)
Ambrish Rawat; Mathieu Sinn; Beat Buesser
  Adversarial training is a computationally expensive task and hence searching
for neural network architectures with robustness as the criterion can be
challenging. As a step towards practical automation, this work explores the
efficacy of a simple post processing step in yielding robust deep learning
model. To achieve this, we adopt adversarial training as a post-processing step
for optimised network architectures obtained from a neural architecture search
algorithm. Specific policies are adopted for tuning the hyperparameters of the
different steps, resulting in a fully automated pipeline for generating
adversarially robust deep learning models. We evidence the usefulness of the
proposed pipeline with extensive experimentation across 11 image classification
and 9 text classification tasks.


http://arxiv.org/abs/2109.02431
Exposing Length Divergence Bias of Textual Matching Models. (2%)
Lan Jiang; Tianshu Lyu; Chong Meng; Xiaoyong Lyu; Dawei Yin
  Despite the remarkable success deep models have achieved in Textual Matching
(TM), their robustness issue is still a topic of concern. In this work, we
propose a new perspective to study this issue -- via the length divergence bias
of TM models. We conclude that this bias stems from two parts: the label bias
of existing TM datasets and the sensitivity of TM models to superficial
information. We critically examine widely used TM datasets, and find that all
of them follow specific length divergence distributions by labels, providing
direct cues for predictions. As for the TM models, we conduct adversarial
evaluation and show that all models' performances drop on the
out-of-distribution adversarial test sets we construct, which demonstrates that
they are all misled by biased training sets. This is also confirmed by the
\textit{SentLen} probing task that all models capture rich length information
during training to facilitate their performances. Finally, to alleviate the
length divergence bias in TM models, we propose a practical adversarial
training method using bias-free training data. Our experiments indicate that we
successfully improve the robustness and generalization ability of models at the
same time.


http://arxiv.org/abs/2109.02229
Efficient Combinatorial Optimization for Word-level Adversarial Textual Attack. (98%)
Shengcai Liu; Ning Lu; Cheng Chen; Ke Tang
  Over the past few years, various word-level textual attack approaches have
been proposed to reveal the vulnerability of deep neural networks used in
natural language processing. Typically, these approaches involve an important
optimization step to determine which substitute to be used for each word in the
original input. However, current research on this step is still rather limited,
from the perspectives of both problem-understanding and problem-solving. In
this paper, we address these issues by uncovering the theoretical properties of
the problem and proposing an efficient local search algorithm (LS) to solve it.
We establish the first provable approximation guarantee on solving the problem
in general cases.Extensive experiments involving 5 NLP tasks, 8 datasets and 26
NLP models show that LS can largely reduce the number of queries usually by an
order of magnitude to achieve high attack success rates. Further experiments
show that the adversarial examples crafted by LS usually have higher quality,
exhibit better transferability, and can bring more robustness improvement to
victim models by adversarial training.


http://arxiv.org/abs/2109.02018
Tolerating Adversarial Attacks and Byzantine Faults in Distributed Machine Learning. (2%)
Yusen Wu; Hao Chen; Xin Wang; Chao Liu; Phuong Nguyen; Yelena Yesha
  Adversarial attacks attempt to disrupt the training, retraining and utilizing
of artificial intelligence and machine learning models in large-scale
distributed machine learning systems. This causes security risks on its
prediction outcome. For example, attackers attempt to poison the model by
either presenting inaccurate misrepresentative data or altering the models'
parameters. In addition, Byzantine faults including software, hardware, network
issues occur in distributed systems which also lead to a negative impact on the
prediction outcome. In this paper, we propose a novel distributed training
algorithm, partial synchronous stochastic gradient descent (ParSGD), which
defends adversarial attacks and/or tolerates Byzantine faults. We demonstrate
the effectiveness of our algorithm under three common adversarial attacks again
the ML models and a Byzantine fault during the training phase. Our results show
that using ParSGD, ML models can still produce accurate predictions as if it is
not being attacked nor having failures at all when almost half of the nodes are
being compromised or failed. We will report the experimental evaluations of
ParSGD in comparison with other algorithms.


http://arxiv.org/abs/2109.03326
DexRay: A Simple, yet Effective Deep Learning Approach to Android Malware Detection based on Image Representation of Bytecode. (1%)
Nadia Daoudi; Jordan Samhi; Abdoul Kader Kabore; Kevin Allix; Tegawendé F. Bissyandé; Jacques Klein
  Computer vision has witnessed several advances in recent years, with
unprecedented performance provided by deep representation learning research.
Image formats thus appear attractive to other fields such as malware detection,
where deep learning on images alleviates the need for comprehensively
hand-crafted features generalising to different malware variants. We postulate
that this research direction could become the next frontier in Android malware
detection, and therefore requires a clear roadmap to ensure that new approaches
indeed bring novel contributions. We contribute with a first building block by
developing and assessing a baseline pipeline for image-based malware detection
with straightforward steps. We propose DexRay, which converts the bytecode of
the app DEX files into grey-scale "vector" images and feeds them to a
1-dimensional Convolutional Neural Network model. We view DexRay as
foundational due to the exceedingly basic nature of the design choices,
allowing to infer what could be a minimal performance that can be obtained with
image-based learning in malware detection. The performance of DexRay evaluated
on over 158k apps demonstrates that, while simple, our approach is effective
with a high detection rate (F1-score= 0.96). Finally, we investigate the impact
of time decay and image-resizing on the performance of DexRay and assess its
resilience to obfuscation. This work-in-progress paper contributes to the
domain of Deep Learning based Malware detection by providing a sound, simple,
yet effective approach (with available artefacts) that can be the basis to
scope the many profound questions that will need to be investigated to fully
develop this domain.


http://arxiv.org/abs/2109.03329
Real-World Adversarial Examples involving Makeup Application. (99%)
Chang-Sheng Lin; Chia-Yi Hsu; Pin-Yu Chen; Chia-Mu Yu
  Deep neural networks have developed rapidly and have achieved outstanding
performance in several tasks, such as image classification and natural language
processing. However, recent studies have indicated that both digital and
physical adversarial examples can fool neural networks. Face-recognition
systems are used in various applications that involve security threats from
physical adversarial examples. Herein, we propose a physical adversarial attack
with the use of full-face makeup. The presence of makeup on the human face is a
reasonable possibility, which possibly increases the imperceptibility of
attacks. In our attack framework, we combine the cycle-adversarial generative
network (cycle-GAN) and a victimized classifier. The Cycle-GAN is used to
generate adversarial makeup, and the architecture of the victimized classifier
is VGG 16. Our experimental results show that our attack can effectively
overcome manual errors in makeup application, such as color and
position-related errors. We also demonstrate that the approaches used to train
the models can influence physical attacks; the adversarial perturbations
crafted from the pre-trained model are affected by the corresponding training
data.


http://arxiv.org/abs/2109.01945
Utilizing Adversarial Targeted Attacks to Boost Adversarial Robustness. (99%)
Uriya Pesso; Koby Bibas; Meir Feder
  Adversarial attacks have been shown to be highly effective at degrading the
performance of deep neural networks (DNNs). The most prominent defense is
adversarial training, a method for learning a robust model. Nevertheless,
adversarial training does not make DNNs immune to adversarial perturbations. We
propose a novel solution by adopting the recently suggested Predictive
Normalized Maximum Likelihood. Specifically, our defense performs adversarial
targeted attacks according to different hypotheses, where each hypothesis
assumes a specific label for the test sample. Then, by comparing the hypothesis
probabilities, we predict the label. Our refinement process corresponds to
recent findings of the adversarial subspace properties. We extensively evaluate
our approach on 16 adversarial attack benchmarks using ResNet-50,
WideResNet-28, and a2-layer ConvNet trained with ImageNet, CIFAR10, and MNIST,
showing a significant improvement of up to 5.7%, 3.7%, and 0.6% respectively.


http://arxiv.org/abs/2109.01983
Training Meta-Surrogate Model for Transferable Adversarial Attack. (99%)
Yunxiao Qin; Yuanhao Xiong; Jinfeng Yi; Cho-Jui Hsieh
  We consider adversarial attacks to a black-box model when no queries are
allowed. In this setting, many methods directly attack surrogate models and
transfer the obtained adversarial examples to fool the target model. Plenty of
previous works investigated what kind of attacks to the surrogate model can
generate more transferable adversarial examples, but their performances are
still limited due to the mismatches between surrogate models and the target
model. In this paper, we tackle this problem from a novel angle -- instead of
using the original surrogate models, can we obtain a Meta-Surrogate Model (MSM)
such that attacks to this model can be easier transferred to other models? We
show that this goal can be mathematically formulated as a well-posed
(bi-level-like) optimization problem and design a differentiable attacker to
make training feasible. Given one or a set of surrogate models, our method can
thus obtain an MSM such that adversarial examples generated on MSM enjoy
eximious transferability. Comprehensive experiments on Cifar-10 and ImageNet
demonstrate that by attacking the MSM, we can obtain stronger transferable
adversarial examples to fool black-box models including adversarially trained
ones, with much higher success rates than existing methods. The proposed method
reveals significant security challenges of deep models and is promising to be
served as a state-of-the-art benchmark for evaluating the robustness of deep
models in the black-box setting.


http://arxiv.org/abs/2109.01766
SEC4SR: A Security Analysis Platform for Speaker Recognition. (99%)
Guangke Chen; Zhe Zhao; Fu Song; Sen Chen; Lingling Fan; Yang Liu
  Adversarial attacks have been expanded to speaker recognition (SR). However,
existing attacks are often assessed using different SR models, recognition
tasks and datasets, and only few adversarial defenses borrowed from computer
vision are considered. Yet,these defenses have not been thoroughly evaluated
against adaptive attacks. Thus, there is still a lack of quantitative
understanding about the strengths and limitations of adversarial attacks and
defenses. More effective defenses are also required for securing SR systems. To
bridge this gap, we present SEC4SR, the first platform enabling researchers to
systematically and comprehensively evaluate adversarial attacks and defenses in
SR. SEC4SR incorporates 4 white-box and 2 black-box attacks, 24 defenses
including our novel feature-level transformations. It also contains techniques
for mounting adaptive attacks. Using SEC4SR, we conduct thus far the
largest-scale empirical study on adversarial attacks and defenses in SR,
involving 23 defenses, 15 attacks and 4 attack settings. Our study provides
lots of useful findings that may advance future research: such as (1) all the
transformations slightly degrade accuracy on benign examples and their
effectiveness vary with attacks; (2) most transformations become less effective
under adaptive attacks, but some transformations become more effective; (3) few
transformations combined with adversarial training yield stronger defenses over
some but not all attacks, while our feature-level transformation combined with
adversarial training yields the strongest defense over all the attacks.
Extensive experiments demonstrate capabilities and advantages of SEC4SR which
can benefit future research in SR.


http://arxiv.org/abs/2109.01553
Risk Assessment for Connected Vehicles under Stealthy Attacks on Vehicle-to-Vehicle Networks. (1%)
Tianci Yang; Carlos Murguia; Chen Lv
  Cooperative Adaptive Cruise Control (CACC) is an autonomous vehicle-following
technology that allows groups of vehicles on the highway to form in
tightly-coupled platoons. This is accomplished by exchanging inter-vehicle data
through Vehicle-to-Vehicle (V2V) wireless communication networks. CACC
increases traffic throughput and safety, and decreases fuel consumption.
However, the surge of vehicle connectivity has brought new security challenges
as vehicular networks increasingly serve as new access points for adversaries
trying to deteriorate the platooning performance or even cause collisions. In
this manuscript, we propose a novel attack detection scheme that leverage
real-time sensor/network data and physics-based mathematical models of vehicles
in the platoon. Nevertheless, even the best detection scheme could lead to
conservative detection results because of unavoidable modelling uncertainties,
network effects (delays, quantization, communication dropouts), and noise. It
is hard (often impossible) for any detector to distinguish between these
different perturbation sources and actual attack signals. This enables
adversaries to launch a range of attack strategies that can surpass the
detection scheme by hiding within the system uncertainty. Here, we provide risk
assessment tools (in terms of semidefinite programs) for Connected and
Automated Vehicles (CAVs) to quantify the potential effect of attacks that
remain hidden from the detector (referred here as \emph{stealthy attacks}). A
numerical case-study is presented to illustrate the effectiveness of our
methods.


http://arxiv.org/abs/2109.01275
A Synergetic Attack against Neural Network Classifiers combining Backdoor and Adversarial Examples. (99%)
Guanxiong Liu; Issa Khalil; Abdallah Khreishah; NhatHai Phan
  In this work, we show how to jointly exploit adversarial perturbation and
model poisoning vulnerabilities to practically launch a new stealthy attack,
dubbed AdvTrojan. AdvTrojan is stealthy because it can be activated only when:
1) a carefully crafted adversarial perturbation is injected into the input
examples during inference, and 2) a Trojan backdoor is implanted during the
training process of the model. We leverage adversarial noise in the input space
to move Trojan-infected examples across the model decision boundary, making it
difficult to detect. The stealthiness behavior of AdvTrojan fools the users
into accidentally trust the infected model as a robust classifier against
adversarial examples. AdvTrojan can be implemented by only poisoning the
training data similar to conventional Trojan backdoor attacks. Our thorough
analysis and extensive experiments on several benchmark datasets show that
AdvTrojan can bypass existing defenses with a success rate close to 100% in
most of our experimental scenarios and can be extended to attack federated
learning tasks as well.


http://arxiv.org/abs/2109.00936
Impact of Attention on Adversarial Robustness of Image Classification Models. (99%)
Prachi Agrawal; Narinder Singh Punn; Sanjay Kumar Sonbhadra; Sonali Agarwal
  Adversarial attacks against deep learning models have gained significant
attention and recent works have proposed explanations for the existence of
adversarial examples and techniques to defend the models against these attacks.
Attention in computer vision has been used to incorporate focused learning of
important features and has led to improved accuracy. Recently, models with
attention mechanisms have been proposed to enhance adversarial robustness.
Following this context, this work aims at a general understanding of the impact
of attention on adversarial robustness. This work presents a comparative study
of adversarial robustness of non-attention and attention based image
classification models trained on CIFAR-10, CIFAR-100 and Fashion MNIST datasets
under the popular white box and black box attacks. The experimental results
show that the robustness of attention based models may be dependent on the
datasets used i.e. the number of classes involved in the classification. In
contrast to the datasets with less number of classes, attention based models
are observed to show better robustness towards classification.


http://arxiv.org/abs/2109.00946
Adversarial Robustness for Unsupervised Domain Adaptation. (98%)
Muhammad Awais; Fengwei Zhou; Hang Xu; Lanqing Hong; Ping Luo; Sung-Ho Bae; Zhenguo Li
  Extensive Unsupervised Domain Adaptation (UDA) studies have shown great
success in practice by learning transferable representations across a labeled
source domain and an unlabeled target domain with deep models. However,
previous works focus on improving the generalization ability of UDA models on
clean examples without considering the adversarial robustness, which is crucial
in real-world applications. Conventional adversarial training methods are not
suitable for the adversarial robustness on the unlabeled target domain of UDA
since they train models with adversarial examples generated by the supervised
loss function. In this work, we leverage intermediate representations learned
by multiple robust ImageNet models to improve the robustness of UDA models. Our
method works by aligning the features of the UDA model with the robust features
learned by ImageNet pre-trained models along with domain adaptation training.
It utilizes both labeled and unlabeled domains and instills robustness without
any adversarial intervention or label requirement during domain adaptation
training. Experimental results show that our method significantly improves
adversarial robustness compared to the baseline while keeping clean accuracy on
various UDA benchmarks.


http://arxiv.org/abs/2109.00864
Real World Robustness from Systematic Noise. (91%)
Yan Wang; Yuhang Li; Ruihao Gong
  Systematic error, which is not determined by chance, often refers to the
inaccuracy (involving either the observation or measurement process) inherent
to a system. In this paper, we exhibit some long-neglected but
frequent-happening adversarial examples caused by systematic error. More
specifically, we find the trained neural network classifier can be fooled by
inconsistent implementations of image decoding and resize. This tiny difference
between these implementations often causes an accuracy drop from training to
deployment. To benchmark these real-world adversarial examples, we propose
ImageNet-S dataset, which enables researchers to measure a classifier's
robustness to systematic error. For example, we find a normal ResNet-50 trained
on ImageNet can have 1%-5% accuracy difference due to the systematic error.
Together our evaluation and dataset may aid future work toward real-world
robustness and practical generalization.


http://arxiv.org/abs/2109.00959
Building Compact and Robust Deep Neural Networks with Toeplitz Matrices. (61%)
Alexandre Araujo
  Deep neural networks are state-of-the-art in a wide variety of tasks,
however, they exhibit important limitations which hinder their use and
deployment in real-world applications. When developing and training neural
networks, the accuracy should not be the only concern, neural networks must
also be cost-effective and reliable. Although accurate, large neural networks
often lack these properties. This thesis focuses on the problem of training
neural networks which are not only accurate but also compact, easy to train,
reliable and robust to adversarial examples. To tackle these problems, we
leverage the properties of structured matrices from the Toeplitz family to
build compact and secure neural networks.


http://arxiv.org/abs/2109.00544
Towards Improving Adversarial Training of NLP Models. (98%)
Jin Yong Yoo; Yanjun Qi
  Adversarial training, a method for learning robust deep neural networks,
constructs adversarial examples during training. However, recent methods for
generating NLP adversarial examples involve combinatorial search and expensive
sentence encoders for constraining the generated instances. As a result, it
remains challenging to use vanilla adversarial training to improve NLP models'
performance, and the benefits are mainly uninvestigated. This paper proposes a
simple and improved vanilla adversarial training process for NLP, which we name
Attacking to Training ($\texttt{A2T}$). The core part of $\texttt{A2T}$ is a
new and cheaper word substitution attack optimized for vanilla adversarial
training. We use $\texttt{A2T}$ to train BERT and RoBERTa models on IMDB,
Rotten Tomatoes, Yelp, and SNLI datasets. Our results show that it is possible
to train empirically robust NLP models using a much cheaper adversary. We
demonstrate that vanilla adversarial training with $\texttt{A2T}$ can improve
an NLP model's robustness to the attack it was originally trained with and also
defend the model against other types of attacks. Furthermore, we show that
$\texttt{A2T}$ can improve NLP models' standard accuracy, cross-domain
generalization, and interpretability. Code is available at
http://github.com/jinyongyoo/A2T .


http://arxiv.org/abs/2109.00685
Excess Capacity and Backdoor Poisoning. (97%)
Naren Sarayu Manoj; Avrim Blum
  A backdoor data poisoning attack is an adversarial attack wherein the
attacker injects several watermarked, mislabeled training examples into a
training set. The watermark does not impact the test-time performance of the
model on typical data; however, the model reliably errs on watermarked
examples.
  To gain a better foundational understanding of backdoor data poisoning
attacks, we present a formal theoretical framework within which one can discuss
backdoor data poisoning attacks for classification problems. We then use this
to analyze important statistical and computational issues surrounding these
attacks.
  On the statistical front, we identify a parameter we call the memorization
capacity that captures the intrinsic vulnerability of a learning problem to a
backdoor attack. This allows us to argue about the robustness of several
natural learning problems to backdoor attacks. Our results favoring the
attacker involve presenting explicit constructions of backdoor attacks, and our
robustness results show that some natural problem settings cannot yield
successful backdoor attacks.
  From a computational standpoint, we show that under certain assumptions,
adversarial training can detect the presence of backdoors in a training set. We
then show that under similar assumptions, two closely related problems we call
backdoor filtering and robust generalization are nearly equivalent. This
implies that it is both asymptotically necessary and sufficient to design
algorithms that can identify watermarked examples in the training set in order
to obtain a learning algorithm that both generalizes well to unseen data and is
robust to backdoors.


http://arxiv.org/abs/2109.00678
Regional Adversarial Training for Better Robust Generalization. (96%)
Chuanbiao Song; Yanbo Fan; Yicheng Yang; Baoyuan Wu; Yiming Li; Zhifeng Li; Kun He
  Adversarial training (AT) has been demonstrated as one of the most promising
defense methods against various adversarial attacks. To our knowledge, existing
AT-based methods usually train with the locally most adversarial perturbed
points and treat all the perturbed points equally, which may lead to
considerably weaker adversarial robust generalization on test data. In this
work, we introduce a new adversarial training framework that considers the
diversity as well as characteristics of the perturbed points in the vicinity of
benign samples. To realize the framework, we propose a Regional Adversarial
Training (RAT) defense method that first utilizes the attack path generated by
the typical iterative attack method of projected gradient descent (PGD), and
constructs an adversarial region based on the attack path. Then, RAT samples
diverse perturbed training points efficiently inside this region, and utilizes
a distance-aware label smoothing mechanism to capture our intuition that
perturbed points at different locations should have different impact on the
model performance. Extensive experiments on several benchmark datasets show
that RAT consistently makes significant improvement on standard adversarial
training (SAT), and exhibits better robust generalization.


http://arxiv.org/abs/2109.00533
R-SNN: An Analysis and Design Methodology for Robustifying Spiking Neural Networks against Adversarial Attacks through Noise Filters for Dynamic Vision Sensors. (86%)
Alberto Marchisio; Giacomo Pira; Maurizio Martina; Guido Masera; Muhammad Shafique
  Spiking Neural Networks (SNNs) aim at providing energy-efficient learning
capabilities when implemented on neuromorphic chips with event-based Dynamic
Vision Sensors (DVS). This paper studies the robustness of SNNs against
adversarial attacks on such DVS-based systems, and proposes R-SNN, a novel
methodology for robustifying SNNs through efficient DVS-noise filtering. We are
the first to generate adversarial attacks on DVS signals (i.e., frames of
events in the spatio-temporal domain) and to apply noise filters for DVS
sensors in the quest for defending against adversarial attacks. Our results
show that the noise filters effectively prevent the SNNs from being fooled. The
SNNs in our experiments provide more than 90% accuracy on the DVS-Gesture and
NMNIST datasets under different adversarial threat models.


http://arxiv.org/abs/2109.00542
Proof Transfer for Neural Network Verification. (9%)
Christian Sprecher; Marc Fischer; Dimitar I. Dimitrov; Gagandeep Singh; Martin Vechev
  We introduce the novel concept of proof transfer for neural network
verification. We show that by generating proof templates that capture and
generalize existing proofs, we can speed up subsequent proofs. In particular we
create these templates from previous proofs on the same neural network and
consider two cases: (i) where the proofs are created online when verifying
other properties and (ii) where the templates are created offline using a
dataset. We base our methods on three key hypotheses of neural network
robustness proofs. Our evaluation shows the potential of proof transfer for
benefitting robustness verification of neural networks against adversarial
patches, geometric, and $\ell_{\infty}$-perturbations.


http://arxiv.org/abs/2109.00187
Guarding Machine Learning Hardware Against Physical Side-Channel Attacks. (2%)
Anuj Dubey; Rosario Cammarota; Vikram Suresh; Aydin Aysu
  Machine learning (ML) models can be trade secrets due to their development
cost. Hence, they need protection against malicious forms of reverse
engineering (e.g., in IP piracy). With a growing shift of ML to the edge
devices, in part for performance and in part for privacy benefits, the models
have become susceptible to the so-called physical side-channel attacks.
  ML being a relatively new target compared to cryptography poses the problem
of side-channel analysis in a context that lacks published literature. The gap
between the burgeoning edge-based ML devices and the research on adequate
defenses to provide side-channel security for them thus motivates our study.
Our work develops and combines different flavors of side-channel defenses for
ML models in the hardware blocks. We propose and optimize the first defense
based on Boolean masking. We first implement all the masked hardware blocks. We
then present an adder optimization to reduce the area and latency overheads.
Finally, we couple it with a shuffle-based defense.
  We quantify that the area-delay overhead of masking ranges from 5.4$\times$
to 4.7$\times$ depending on the adder topology used and demonstrate first-order
side-channel security of millions of power traces. Additionally, the shuffle
countermeasure impedes a straightforward second-order attack on our first-order
masked implementation.


http://arxiv.org/abs/2108.13930
EG-Booster: Explanation-Guided Booster of ML Evasion Attacks. (99%)
Abderrahmen Amich; Birhanu Eshete
  The widespread usage of machine learning (ML) in a myriad of domains has
raised questions about its trustworthiness in security-critical environments.
Part of the quest for trustworthy ML is robustness evaluation of ML models to
test-time adversarial examples. Inline with the trustworthy ML goal, a useful
input to potentially aid robustness evaluation is feature-based explanations of
model predictions. In this paper, we present a novel approach called EG-Booster
that leverages techniques from explainable ML to guide adversarial example
crafting for improved robustness evaluation of ML models before deploying them
in security-critical settings. The key insight in EG-Booster is the use of
feature-based explanations of model predictions to guide adversarial example
crafting by adding consequential perturbations likely to result in model
evasion and avoiding non-consequential ones unlikely to contribute to evasion.
EG-Booster is agnostic to model architecture, threat model, and supports
diverse distance metrics used previously in the literature. We evaluate
EG-Booster using image classification benchmark datasets, MNIST and CIFAR10.
Our findings suggest that EG-Booster significantly improves evasion rate of
state-of-the-art attacks while performing less number of perturbations. Through
extensive experiments that covers four white-box and three black-box attacks,
we demonstrate the effectiveness of EG-Booster against two undefended neural
networks trained on MNIST and CIFAR10, and another adversarially-trained ResNet
model trained on CIFAR10. Furthermore, we introduce a stability assessment
metric and evaluate the reliability of our explanation-based approach by
observing the similarity between the model's classification outputs across
multiple runs of EG-Booster.


http://arxiv.org/abs/2108.13952
Morphence: Moving Target Defense Against Adversarial Examples. (99%)
Abderrahmen Amich; Birhanu Eshete
  Robustness to adversarial examples of machine learning models remains an open
topic of research. Attacks often succeed by repeatedly probing a fixed target
model with adversarial examples purposely crafted to fool it. In this paper, we
introduce Morphence, an approach that shifts the defense landscape by making a
model a moving target against adversarial examples. By regularly moving the
decision function of a model, Morphence makes it significantly challenging for
repeated or correlated attacks to succeed. Morphence deploys a pool of models
generated from a base model in a manner that introduces sufficient randomness
when it responds to prediction queries. To ensure repeated or correlated
attacks fail, the deployed pool of models automatically expires after a query
budget is reached and the model pool is seamlessly replaced by a new model pool
generated in advance. We evaluate Morphence on two benchmark image
classification datasets (MNIST and CIFAR10) against five reference attacks (2
white-box and 3 black-box). In all cases, Morphence consistently outperforms
the thus-far effective defense, adversarial training, even in the face of
strong white-box attacks, while preserving accuracy on clean data.


http://arxiv.org/abs/2109.00124
DPA: Learning Robust Physical Adversarial Camouflages for Object Detectors. (93%)
Yexin Duan; Jialin Chen; Xingyu Zhou; Junhua Zou; Zhengyun He; Wu Zhang; Jin Zhang; Zhisong Pan
  Adversarial attacks are feasible in the real world for object detection.
However, most of the previous works have tried to learn local "patches" applied
to an object to fool detectors, which become less effective in squint view
angles. To address this issue, we propose the Dense Proposals Attack (DPA) to
learn one-piece, physical, and targeted adversarial camouflages for detectors.
The camouflages are one-piece because they are generated as a whole for an
object, physical because they remain adversarial when filmed under arbitrary
viewpoints and different illumination conditions, and targeted because they can
cause detectors to misidentify an object as a specific target class. In order
to make the generated camouflages robust in the physical world, we introduce a
combination of transformations to model the physical phenomena. In addition, to
improve the attacks, DPA simultaneously attacks all the classifications in the
fixed proposals. Moreover, we build a virtual 3D scene using the Unity
simulation engine to fairly and reproducibly evaluate different physical
attacks. Extensive experiments demonstrate that DPA outperforms the
state-of-the-art methods, and it is generic for any object and generalized well
to the real world, posing a potential threat to the security-critical computer
vision systems.


http://arxiv.org/abs/2109.01165
Black-Box Attacks on Sequential Recommenders via Data-Free Model Extraction. (83%)
Zhenrui Yue; Zhankui He; Huimin Zeng; Julian McAuley
  We investigate whether model extraction can be used to "steal" the weights of
sequential recommender systems, and the potential threats posed to victims of
such attacks. This type of risk has attracted attention in image and text
classification, but to our knowledge not in recommender systems. We argue that
sequential recommender systems are subject to unique vulnerabilities due to the
specific autoregressive regimes used to train them. Unlike many existing
recommender attackers, which assume the dataset used to train the victim model
is exposed to attackers, we consider a data-free setting, where training data
are not accessible. Under this setting, we propose an API-based model
extraction method via limited-budget synthetic data generation and knowledge
distillation. We investigate state-of-the-art models for sequential
recommendation and show their vulnerability under model extraction and
downstream attacks. We perform attacks in two stages. (1) Model extraction:
given different types of synthetic data and their labels retrieved from a
black-box recommender, we extract the black-box model to a white-box model via
distillation. (2) Downstream attacks: we attack the black-box model with
adversarial samples generated by the white-box recommender. Experiments show
the effectiveness of our data-free model extraction and downstream attacks on
sequential recommenders in both profile pollution and data poisoning settings.


http://arxiv.org/abs/2108.13617
Segmentation Fault: A Cheap Defense Against Adversarial Machine Learning. (75%)
Doha Al Bared; Mohamed Nassar
  Recently published attacks against deep neural networks (DNNs) have stressed
the importance of methodologies and tools to assess the security risks of using
this technology in critical systems. Efficient techniques for detecting
adversarial machine learning helps establishing trust and boost the adoption of
deep learning in sensitive and security systems. In this paper, we propose a
new technique for defending deep neural network classifiers, and convolutional
ones in particular. Our defense is cheap in the sense that it requires less
computation power despite a small cost to pay in terms of detection accuracy.
The work refers to a recently published technique called ML-LOO. We replace the
costly pixel by pixel leave-one-out approach of ML-LOO by adopting
coarse-grained leave-one-out. We evaluate and compare the efficiency of
different segmentation algorithms for this task. Our results show that a large
gain in efficiency is possible, even though penalized by a marginal decrease in
detection accuracy.


http://arxiv.org/abs/2108.13888
Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning. (4%)
Linyang Li; Demin Song; Xiaonan Li; Jiehang Zeng; Ruotian Ma; Xipeng Qiu
  \textbf{P}re-\textbf{T}rained \textbf{M}odel\textbf{s} have been widely
applied and recently proved vulnerable under backdoor attacks: the released
pre-trained weights can be maliciously poisoned with certain triggers. When the
triggers are activated, even the fine-tuned model will predict pre-defined
labels, causing a security threat. These backdoors generated by the poisoning
methods can be erased by changing hyper-parameters during fine-tuning or
detected by finding the triggers. In this paper, we propose a stronger
weight-poisoning attack method that introduces a layerwise weight poisoning
strategy to plant deeper backdoors; we also introduce a combinatorial trigger
that cannot be easily detected. The experiments on text classification tasks
show that previous defense methods cannot resist our weight-poisoning method,
which indicates that our method can be widely applied and may provide hints for
future model robustness studies.


http://arxiv.org/abs/2108.13797
Sample Efficient Detection and Classification of Adversarial Attacks via Self-Supervised Embeddings. (99%)
Mazda Moayeri; Soheil Feizi
  Adversarial robustness of deep models is pivotal in ensuring safe deployment
in real world settings, but most modern defenses have narrow scope and
expensive costs. In this paper, we propose a self-supervised method to detect
adversarial attacks and classify them to their respective threat models, based
on a linear model operating on the embeddings from a pre-trained
self-supervised encoder. We use a SimCLR encoder in our experiments, since we
show the SimCLR embedding distance is a good proxy for human perceptibility,
enabling it to encapsulate many threat models at once. We call our method
SimCat since it uses SimCLR encoder to catch and categorize various types of
adversarial attacks, including L_p and non-L_p evasion attacks, as well as data
poisonings. The simple nature of a linear classifier makes our method efficient
in both time and sample complexity. For example, on SVHN, using only five pairs
of clean and adversarial examples computed with a PGD-L_inf attack, SimCat's
detection accuracy is over 85%. Moreover, on ImageNet, using only 25 examples
from each threat model, SimCat can classify eight different attack types such
as PGD-L_2, PGD-L_inf, CW-L_2, PPGD, LPA, StAdv, ReColor, and JPEG-L_inf, with
over 40% accuracy. On STL10 data, we apply SimCat as a defense against
poisoning attacks, such as BP, CP, FC, CLBD, HTBD, halving the success rate
while using only twenty total poisons for training. We find that the detectors
generalize well to unseen threat models. Lastly, we investigate the performance
of our detection method under adaptive attacks and further boost its robustness
against such attacks via adversarial training.


http://arxiv.org/abs/2108.13093
Investigating Vulnerabilities of Deep Neural Policies. (99%)
Ezgi Korkmaz
  Reinforcement learning policies based on deep neural networks are vulnerable
to imperceptible adversarial perturbations to their inputs, in much the same
way as neural network image classifiers. Recent work has proposed several
methods to improve the robustness of deep reinforcement learning agents to
adversarial perturbations based on training in the presence of these
imperceptible perturbations (i.e. adversarial training). In this paper, we
study the effects of adversarial training on the neural policy learned by the
agent. In particular, we follow two distinct parallel approaches to investigate
the outcomes of adversarial training on deep neural policies based on
worst-case distributional shift and feature sensitivity. For the first
approach, we compare the Fourier spectrum of minimal perturbations computed for
both adversarially trained and vanilla trained neural policies. Via experiments
in the OpenAI Atari environments we show that minimal perturbations computed
for adversarially trained policies are more focused on lower frequencies in the
Fourier domain, indicating a higher sensitivity of these policies to low
frequency perturbations. For the second approach, we propose a novel method to
measure the feature sensitivities of deep neural policies and we compare these
feature sensitivity differences in state-of-the-art adversarially trained deep
neural policies and vanilla trained deep neural policies. We believe our
results can be an initial step towards understanding the relationship between
adversarial training and different notions of robustness for neural policies.


http://arxiv.org/abs/2108.13562
Adversarial Example Devastation and Detection on Speech Recognition System by Adding Random Noise. (99%)
Mingyu Dong; Diqun Yan; Yongkang Gong; Rangding Wang
  An automatic speech recognition (ASR) system based on a deep neural network
is vulnerable to attack by an adversarial example, especially if the
command-dependent ASR fails. A defense method against adversarial examples is
proposed to improve the robustness and security of the ASR system. We propose
an algorithm of devastation and detection on adversarial examples that can
attack current advanced ASR systems. We choose an advanced text- and
command-dependent ASR system as our target, generating adversarial examples by
an optimization-based attack on text-dependent ASR and the GA-based algorithm
on command-dependent ASR. The method is based on input transformation of
adversarial examples. Different random intensities and kinds of noise are added
to adversarial examples to devastate the perturbation previously added to
normal examples. Experimental results show that the method performs well. For
the devastation of examples, the original speech similarity after adding noise
can reach 99.68%, the similarity of adversarial examples can reach zero, and
the detection rate of adversarial examples can reach 94%.


http://arxiv.org/abs/2108.13049
Single Node Injection Attack against Graph Neural Networks. (68%)
Shuchang Tao; Qi Cao; Huawei Shen; Junjie Huang; Yunfan Wu; Xueqi Cheng
  Node injection attack on Graph Neural Networks (GNNs) is an emerging and
practical attack scenario that the attacker injects malicious nodes rather than
modifying original nodes or edges to affect the performance of GNNs. However,
existing node injection attacks ignore extremely limited scenarios, namely the
injected nodes might be excessive such that they may be perceptible to the
target GNN. In this paper, we focus on an extremely limited scenario of single
node injection evasion attack, i.e., the attacker is only allowed to inject one
single node during the test phase to hurt GNN's performance. The discreteness
of network structure and the coupling effect between network structure and node
features bring great challenges to this extremely limited scenario. We first
propose an optimization-based method to explore the performance upper bound of
single node injection evasion attack. Experimental results show that 100%,
98.60%, and 94.98% nodes on three public datasets are successfully attacked
even when only injecting one node with one edge, confirming the feasibility of
single node injection evasion attack. However, such an optimization-based
method needs to be re-optimized for each attack, which is computationally
unbearable. To solve the dilemma, we further propose a Generalizable Node
Injection Attack model, namely G-NIA, to improve the attack efficiency while
ensuring the attack performance. Experiments are conducted across three
well-known GNNs. Our proposed G-NIA significantly outperforms state-of-the-art
baselines and is 500 times faster than the optimization-based method when
inferring.


http://arxiv.org/abs/2108.13446
Benchmarking the Accuracy and Robustness of Feedback Alignment Algorithms. (41%)
Albert Jiménez Sanfiz; Mohamed Akrout
  Backpropagation is the default algorithm for training deep neural networks
due to its simplicity, efficiency and high convergence rate. However, its
requirements make it impossible to be implemented in a human brain. In recent
years, more biologically plausible learning methods have been proposed. Some of
these methods can match backpropagation accuracy, and simultaneously provide
other extra benefits such as faster training on specialized hardware (e.g.,
ASICs) or higher robustness against adversarial attacks. While the interest in
the field is growing, there is a necessity for open-source libraries and
toolkits to foster research and benchmark algorithms. In this paper, we present
BioTorch, a software framework to create, train, and benchmark biologically
motivated neural networks. In addition, we investigate the performance of
several feedback alignment methods proposed in the literature, thereby
unveiling the importance of the forward and backward weight initialization and
optimizer choice. Finally, we provide a novel robustness study of these methods
against state-of-the-art white and black-box adversarial attacks.


http://arxiv.org/abs/2108.13239
Adaptive perturbation adversarial training: based on reinforcement learning. (41%)
Zhishen Nie; Ying Lin; Sp Ren; Lan Zhang
  Adversarial training has become the primary method to defend against
adversarial samples. However, it is hard to practically apply due to many
shortcomings. One of the shortcomings of adversarial training is that it will
reduce the recognition accuracy of normal samples. Adaptive perturbation
adversarial training is proposed to alleviate this problem. It uses marginal
adversarial samples that are close to the decision boundary but does not cross
the decision boundary for adversarial training, which improves the accuracy of
model recognition while maintaining the robustness of the model. However,
searching for marginal adversarial samples brings additional computational
costs. This paper proposes a method for finding marginal adversarial samples
based on reinforcement learning, and combines it with the latest fast
adversarial training technology, which effectively speeds up training process
and reduces training costs.


http://arxiv.org/abs/2108.13602
How Does Adversarial Fine-Tuning Benefit BERT? (33%)
Javid Ebrahimi; Hao Yang; Wei Zhang
  Adversarial training (AT) is one of the most reliable methods for defending
against adversarial attacks in machine learning. Variants of this method have
been used as regularization mechanisms to achieve SOTA results on NLP
benchmarks, and they have been found to be useful for transfer learning and
continual learning. We search for the reasons for the effectiveness of AT by
contrasting vanilla and adversarially fine-tuned BERT models. We identify
partial preservation of BERT's syntactic abilities during fine-tuning as the
key to the success of AT. We observe that adversarially fine-tuned models
remain more faithful to BERT's language modeling behavior and are more
sensitive to the word order. As concrete examples of syntactic abilities, an
adversarially fine-tuned model could have an advantage of up to 38% on anaphora
agreement and up to 11% on dependency parsing. Our analysis demonstrates that
vanilla fine-tuning oversimplifies the sentence representation by focusing
heavily on a small subset of words. AT, however, moderates the effect of these
influential words and encourages representational diversity. This allows for a
more hierarchical representation of a sentence and leads to the mitigation of
BERT's loss of syntactic abilities.


http://arxiv.org/abs/2108.13373
ML-based IoT Malware Detection Under Adversarial Settings: A Systematic Evaluation. (26%)
Ahmed Abusnaina; Afsah Anwar; Sultan Alshamrani; Abdulrahman Alabduljabbar; RhongHo Jang; Daehun Nyang; David Mohaisen
  The rapid growth of the Internet of Things (IoT) devices is paralleled by
them being on the front-line of malicious attacks. This has led to an explosion
in the number of IoT malware, with continued mutations, evolution, and
sophistication. These malicious software are detected using machine learning
(ML) algorithms alongside the traditional signature-based methods. Although
ML-based detectors improve the detection performance, they are susceptible to
malware evolution and sophistication, making them limited to the patterns that
they have been trained upon. This continuous trend motivates the large body of
literature on malware analysis and detection research, with many systems
emerging constantly, and outperforming their predecessors. In this work, we
systematically examine the state-of-the-art malware detection approaches, that
utilize various representation and learning techniques, under a range of
adversarial settings. Our analyses highlight the instability of the proposed
detectors in learning patterns that distinguish the benign from the malicious
software. The results exhibit that software mutations with
functionality-preserving operations, such as stripping and padding,
significantly deteriorate the accuracy of such detectors. Additionally, our
analysis of the industry-standard malware detectors shows their instability to
the malware mutations.


http://arxiv.org/abs/2108.13140
DuTrust: A Sentiment Analysis Dataset for Trustworthiness Evaluation. (1%)
Lijie Wang; Hao Liu; Shuyuan Peng; Hongxuan Tang; Xinyan Xiao; Ying Chen; Hua Wu; Haifeng Wang
  While deep learning models have greatly improved the performance of most
artificial intelligence tasks, they are often criticized to be untrustworthy
due to the black-box problem. Consequently, many works have been proposed to
study the trustworthiness of deep learning. However, as most open datasets are
designed for evaluating the accuracy of model outputs, there is still a lack of
appropriate datasets for evaluating the inner workings of neural networks. The
lack of datasets obviously hinders the development of trustworthiness research.
Therefore, in order to systematically evaluate the factors for building
trustworthy systems, we propose a novel and well-annotated sentiment analysis
dataset to evaluate robustness and interpretability. To evaluate these factors,
our dataset contains diverse annotations about the challenging distribution of
instances, manual adversarial instances and sentiment explanations. Several
evaluation metrics are further proposed for interpretability and robustness.
Based on the dataset and metrics, we conduct comprehensive comparisons for the
trustworthiness of three typical models, and also study the relations between
accuracy, robustness and interpretability. We release this trustworthiness
evaluation dataset at \url{https://github/xyz} and hope our work can facilitate
the progress on building more trustworthy systems for real-world applications.


http://arxiv.org/abs/2108.12777
Searching for an Effective Defender: Benchmarking Defense against Adversarial Word Substitution. (99%)
Zongyi Li; Jianhan Xu; Jiehang Zeng; Linyang Li; Xiaoqing Zheng; Qi Zhang; Kai-Wei Chang; Cho-Jui Hsieh
  Recent studies have shown that deep neural networks are vulnerable to
intentionally crafted adversarial examples, and various methods have been
proposed to defend against adversarial word-substitution attacks for neural NLP
models. However, there is a lack of systematic study on comparing different
defense approaches under the same attacking setting. In this paper, we seek to
fill the gap of systematic studies through comprehensive researches on
understanding the behavior of neural text classifiers trained by various
defense methods under representative adversarial attacks. In addition, we
propose an effective method to further improve the robustness of neural text
classifiers against such attacks and achieved the highest accuracy on both
clean and adversarial examples on AGNEWS and IMDB datasets by a significant
margin.


http://arxiv.org/abs/2108.13872
Reinforcement Learning Based Sparse Black-box Adversarial Attack on Video Recognition Models. (98%)
Zeyuan Wang; Chaofeng Sha; Su Yang
  We explore the black-box adversarial attack on video recognition models.
Attacks are only performed on selected key regions and key frames to reduce the
high computation cost of searching adversarial perturbations on a video due to
its high dimensionality. To select key frames, one way is to use heuristic
algorithms to evaluate the importance of each frame and choose the essential
ones. However, it is time inefficient on sorting and searching. In order to
speed up the attack process, we propose a reinforcement learning based frame
selection strategy. Specifically, the agent explores the difference between the
original class and the target class of videos to make selection decisions. It
receives rewards from threat models which indicate the quality of the
decisions. Besides, we also use saliency detection to select key regions and
only estimate the sign of gradient instead of the gradient itself in zeroth
order optimization to further boost the attack process. We can use the trained
model directly in the untargeted attack or with little fine-tune in the
targeted attack, which saves computation time. A range of empirical results on
real datasets demonstrate the effectiveness and efficiency of the proposed
method.


http://arxiv.org/abs/2108.12805
DropAttack: A Masked Weight Adversarial Training Method to Improve Generalization of Neural Networks. (82%)
Shiwen Ni; Jiawen Li; Hung-Yu Kao
  Adversarial training has been proven to be a powerful regularization method
to improve the generalization of models. However, current adversarial training
methods only attack the original input sample or the embedding vectors, and
their attacks lack coverage and diversity. To further enhance the breadth and
depth of attack, we propose a novel masked weight adversarial training method
called DropAttack, which enhances generalization of model by adding
intentionally worst-case adversarial perturbations to both the input and hidden
layers in different dimensions and minimize the adversarial risks generated by
each layer. DropAttack is a general technique and can be adopt to a wide
variety of neural networks with different architectures. To validate the
effectiveness of the proposed method, we used five public datasets in the
fields of natural language processing (NLP) and computer vision (CV) for
experimental evaluating. We compare the proposed method with other adversarial
training methods and regularization methods, and our method achieves
state-of-the-art on all datasets. In addition, Dropattack can achieve the same
performance when it use only a half training data compared to other standard
training method. Theoretical analysis reveals that DropAttack can perform
gradient regularization at random on some of the input and wight parameters of
the model. Further visualization experiments show that DropAttack can push the
minimum risk of the model to a lower and flatter loss landscapes. Our source
code is publicly available on https://github.com/nishiwen1214/DropAttack.


http://arxiv.org/abs/2110.00425
HAT4RD: Hierarchical Adversarial Training for Rumor Detection on Social Media. (81%)
Shiwen Ni; Jiawen Li; Hung-Yu Kao
  With the development of social media, social communication has changed. While
this facilitates people's communication and access to information, it also
provides an ideal platform for spreading rumors. In normal or critical
situations, rumors will affect people's judgment and even endanger social
security. However, natural language is high-dimensional and sparse, and the
same rumor may be expressed in hundreds of ways on social media. As such, the
robustness and generalization of the current rumor detection model are put into
question. We proposed a novel \textbf{h}ierarchical \textbf{a}dversarial
\textbf{t}raining method for \textbf{r}umor \textbf{d}etection (HAT4RD) on
social media. Specifically, HAT4RD is based on gradient ascent by adding
adversarial perturbations to the embedding layers of post-level and event-level
modules to deceive the detector. At the same time, the detector uses stochastic
gradient descent to minimize the adversarial risk to learn a more robust model.
In this way, the post-level and event-level sample spaces are enhanced, and we
have verified the robustness of our model under a variety of adversarial
attacks. Moreover, visual experiments indicate that the proposed model drifts
into an area with a flat loss landscape, leading to better generalization. We
evaluate our proposed method on three public rumors datasets from two commonly
used social platforms (Twitter and Weibo). Experiment results demonstrate that
our model achieves better results than state-of-the-art methods.


http://arxiv.org/abs/2108.12473
Mal2GCN: A Robust Malware Detection Approach Using Deep Graph Convolutional Networks With Non-Negative Weights. (99%)
Omid Kargarnovin; Amir Mahdi Sadeghzadeh; Rasool Jalili
  With the growing pace of using machine learning to solve various problems,
securing these models against adversaries has become one of the main concerns
of researchers. Recent studies have shown that in an adversarial environment,
machine learning models are vulnerable to adversarial examples, and adversaries
can create carefully crafted inputs to fool the models. With the advent of deep
neural networks, many researchers have used deep neural networks for various
tasks, and have achieved impressive results. These models must become robust
against attacks before being deployed safely, especially in security-related
fields such as malware detection. In this paper, we first present a black-box
source code-based adversarial malware generation approach that can be used to
evaluate the robustness of malware detection models against real-world
adversaries. The proposed approach injects adversarial codes into the various
locations of malware source codes to evade malware detection models. We then
propose Mal2GCN, a robust malware detection model. Mal2GCN uses the
representation power of graph convolutional networks combined with the
non-negative weights training method to create a malware detection model with
high detection accuracy, which is also robust against adversarial attacks that
add benign features to the input.


http://arxiv.org/abs/2108.12492
Disrupting Adversarial Transferability in Deep Neural Networks. (98%)
Christopher Wiedeman; Ge Wang
  Adversarial attack transferability is a well-recognized phenomenon in deep
learning. Prior work has partially explained transferability by recognizing
common adversarial subspaces and correlations between decision boundaries, but
we have found little explanation in the literature beyond this. In this paper,
we propose that transferability between seemingly different models is due to a
high linear correlation between features that different deep neural networks
extract. In other words, two models trained on the same task that are seemingly
distant in the parameter space likely extract features in the same fashion,
just with trivial shifts and rotations between the latent spaces. Furthermore,
we show how applying a feature correlation loss, which decorrelates the
extracted features in a latent space, can drastically reduce the
transferability of adversarial attacks between models, suggesting that the
models complete tasks in semantically different ways. Finally, we propose a
Dual Neck Autoencoder (DNA), which leverages this feature correlation loss to
create two meaningfully different encodings of input information with reduced
transferability.


http://arxiv.org/abs/2108.12237
Evaluating the Robustness of Neural Language Models to Input Perturbations. (16%)
Milad Moradi; Matthias Samwald
  High-performance neural language models have obtained state-of-the-art
results on a wide range of Natural Language Processing (NLP) tasks. However,
results for common benchmark datasets often do not reflect model reliability
and robustness when applied to noisy, real-world data. In this study, we design
and implement various types of character-level and word-level perturbation
methods to simulate realistic scenarios in which input texts may be slightly
noisy or different from the data distribution on which NLP systems were
trained. Conducting comprehensive experiments on different NLP tasks, we
investigate the ability of high-performance language models such as BERT,
XLNet, RoBERTa, and ELMo in handling different types of input perturbations.
The results suggest that language models are sensitive to input perturbations
and their performance can decrease even when small changes are introduced. We
highlight that models need to be further improved and that current benchmarks
are not reflecting model robustness well. We argue that evaluations on
perturbed inputs should routinely complement widely-used benchmarks in order to
yield a more realistic understanding of NLP systems robustness.


http://arxiv.org/abs/2108.12242
Deep learning models are not robust against noise in clinical text. (1%)
Milad Moradi; Kathrin Blagec; Matthias Samwald
  Artificial Intelligence (AI) systems are attracting increasing interest in
the medical domain due to their ability to learn complicated tasks that require
human intelligence and expert knowledge. AI systems that utilize
high-performance Natural Language Processing (NLP) models have achieved
state-of-the-art results on a wide variety of clinical text processing
benchmarks. They have even outperformed human accuracy on some tasks. However,
performance evaluation of such AI systems have been limited to accuracy
measures on curated and clean benchmark datasets that may not properly reflect
how robustly these systems can operate in real-world situations. In order to
address this challenge, we introduce and implement a wide variety of
perturbation methods that simulate different types of noise and variability in
clinical text data. While noisy samples produced by these perturbation methods
can often be understood by humans, they may cause AI systems to make erroneous
decisions. Conducting extensive experiments on several clinical text processing
tasks, we evaluated the robustness of high-performance NLP models against
various types of character-level and word-level noise. The results revealed
that the NLP models performance degrades when the input contains small amounts
of noise. This study is a significant step towards exposing vulnerabilities of
AI models utilized in clinical text processing systems. The proposed
perturbation methods can be used in performance evaluation tests to assess how
robustly clinical NLP models can operate on noisy data, in real-world settings.


http://arxiv.org/abs/2108.12001
Understanding the Logit Distributions of Adversarially-Trained Deep Neural Networks. (99%)
Landan Seguin; Anthony Ndirango; Neeli Mishra; SueYeon Chung; Tyler Lee
  Adversarial defenses train deep neural networks to be invariant to the input
perturbations from adversarial attacks. Almost all defense strategies achieve
this invariance through adversarial training i.e. training on inputs with
adversarial perturbations. Although adversarial training is successful at
mitigating adversarial attacks, the behavioral differences between
adversarially-trained (AT) models and standard models are still poorly
understood. Motivated by a recent study on learning robustness without input
perturbations by distilling an AT model, we explore what is learned during
adversarial training by analyzing the distribution of logits in AT models. We
identify three logit characteristics essential to learning adversarial
robustness. First, we provide a theoretical justification for the finding that
adversarial training shrinks two important characteristics of the logit
distribution: the max logit values and the "logit gaps" (difference between the
logit max and next largest values) are on average lower for AT models. Second,
we show that AT and standard models differ significantly on which samples are
high or low confidence, then illustrate clear qualitative differences by
visualizing samples with the largest confidence difference. Finally, we find
learning information about incorrect classes to be essential to learning
robustness by manipulating the non-max logit information during distillation
and measuring the impact on the student's robustness. Our results indicate that
learning some adversarial robustness without input perturbations requires a
model to learn specific sample-wise confidences and incorrect class orderings
that follow complex distributions.


http://arxiv.org/abs/2108.11785
A Hierarchical Assessment of Adversarial Severity. (98%)
Guillaume Jeanneret; Juan C Perez; Pablo Arbelaez
  Adversarial Robustness is a growing field that evidences the brittleness of
neural networks. Although the literature on adversarial robustness is vast, a
dimension is missing in these studies: assessing how severe the mistakes are.
We call this notion "Adversarial Severity" since it quantifies the downstream
impact of adversarial corruptions by computing the semantic error between the
misclassification and the proper label. We propose to study the effects of
adversarial noise by measuring the Robustness and Severity into a large-scale
dataset: iNaturalist-H. Our contributions are: (i) we introduce novel
Hierarchical Attacks that harness the rich structured space of labels to create
adversarial examples. (ii) These attacks allow us to benchmark the Adversarial
Robustness and Severity of classification models. (iii) We enhance the
traditional adversarial training with a simple yet effective Hierarchical
Curriculum Training to learn these nodes gradually within the hierarchical
tree. We perform extensive experiments showing that hierarchical defenses allow
deep models to boost the adversarial Robustness by 1.85% and reduce the
severity of all attacks by 0.17, on average.


http://arxiv.org/abs/2108.11765
Physical Adversarial Attacks on an Aerial Imagery Object Detector. (96%)
Andrew Du; Bo Chen; Tat-Jun Chin; Yee Wei Law; Michele Sasdelli; Ramesh Rajasegaran; Dillon Campbell
  Deep neural networks (DNNs) have become essential for processing the vast
amounts of aerial imagery collected using earth-observing satellite platforms.
However, DNNs are vulnerable towards adversarial examples, and it is expected
that this weakness also plagues DNNs for aerial imagery. In this work, we
demonstrate one of the first efforts on physical adversarial attacks on aerial
imagery, whereby adversarial patches were optimised, fabricated and installed
on or near target objects (cars) to significantly reduce the efficacy of an
object detector applied on overhead images. Physical adversarial attacks on
aerial images, particularly those captured from satellite platforms, are
challenged by atmospheric factors (lighting, weather, seasons) and the distance
between the observer and target. To investigate the effects of these
challenges, we devised novel experiments and metrics to evaluate the efficacy
of physical adversarial attacks against object detectors in aerial scenes. Our
results indicate the palpable threat posed by physical adversarial attacks
towards DNNs for processing satellite imagery.


http://arxiv.org/abs/2108.11673
Why Adversarial Reprogramming Works, When It Fails, and How to Tell the Difference. (80%)
Yang Zheng; Xiaoyi Feng; Zhaoqiang Xia; Xiaoyue Jiang; Ambra Demontis; Maura Pintor; Battista Biggio; Fabio Roli
  Adversarial reprogramming allows repurposing a machine-learning model to
perform a different task. For example, a model trained to recognize animals can
be reprogrammed to recognize digits by embedding an adversarial program in the
digit images provided as input. Recent work has shown that adversarial
reprogramming may not only be used to abuse machine-learning models provided as
a service, but also beneficially, to improve transfer learning when training
data is scarce. However, the factors affecting its success are still largely
unexplained. In this work, we develop a first-order linear model of adversarial
reprogramming to show that its success inherently depends on the size of the
average input gradient, which grows when input gradients are more aligned, and
when inputs have higher dimensionality. The results of our experimental
analysis, involving fourteen distinct reprogramming tasks, show that the above
factors are correlated with the success and the failure of adversarial
reprogramming.


http://arxiv.org/abs/2108.12081
Detection and Continual Learning of Novel Face Presentation Attacks. (2%)
Mohammad Rostami; Leonidas Spinoulas; Mohamed Hussein; Joe Mathai; Wael Abd-Almageed
  Advances in deep learning, combined with availability of large datasets, have
led to impressive improvements in face presentation attack detection research.
However, state-of-the-art face antispoofing systems are still vulnerable to
novel types of attacks that are never seen during training. Moreover, even if
such attacks are correctly detected, these systems lack the ability to adapt to
newly encountered attacks. The post-training ability of continually detecting
new types of attacks and self-adaptation to identify these attack types, after
the initial detection phase, is highly appealing. In this paper, we enable a
deep neural network to detect anomalies in the observed input data points as
potential new types of attacks by suppressing the confidence-level of the
network outside the training samples' distribution. We then use experience
replay to update the model to incorporate knowledge about new types of attacks
without forgetting the past learned attack types. Experimental results are
provided to demonstrate the effectiveness of the proposed method on two
benchmark datasets as well as a newly introduced dataset which exhibits a large
variety of attack types.


http://arxiv.org/abs/2108.11168
Adversarially Robust One-class Novelty Detection. (99%)
Shao-Yuan Lo; Poojan Oza; Vishal M. Patel
  One-class novelty detectors are trained with examples of a particular class
and are tasked with identifying whether a query example belongs to the same
known class. Most recent advances adopt a deep auto-encoder style architecture
to compute novelty scores for detecting novel class data. Deep networks have
shown to be vulnerable to adversarial attacks, yet little focus is devoted to
studying the adversarial robustness of deep novelty detectors. In this paper,
we first show that existing novelty detectors are susceptible to adversarial
examples. We further demonstrate that commonly-used defense approaches for
classification tasks have limited effectiveness in one-class novelty detection.
Hence, we need a defense specifically designed for novelty detection. To this
end, we propose a defense strategy that manipulates the latent space of novelty
detectors to improve the robustness against adversarial examples. The proposed
method, referred to as Principal Latent Space (PrincipaLS), learns the
incrementally-trained cascade principal components in the latent space to
robustify novelty detectors. PrincipaLS can purify latent space against
adversarial examples and constrain latent space to exclusively model the known
class distribution. We conduct extensive experiments on eight attacks, five
datasets and seven novelty detectors, showing that PrincipaLS consistently
enhances the adversarial robustness of novelty detection models. Code is
available at https://github.com/shaoyuanlo/PrincipaLS


http://arxiv.org/abs/2108.11299
Certifiers Make Neural Networks Vulnerable to Availability Attacks. (99%)
Tobias Lorenz; Marta Kwiatkowska; Mario Fritz
  To achieve reliable, robust, and safe AI systems, it is vital to implement
fallback strategies when AI predictions cannot be trusted. Certifiers for
neural networks are a reliable way to check the robustness of these
predictions. They guarantee for some predictions that a certain class of
manipulations or attacks could not have changed the outcome. For the remaining
predictions without guarantees, the method abstains from making a prediction,
and a fallback strategy needs to be invoked, which typically incurs additional
costs, can require a human operator, or even fail to provide any prediction.
While this is a key concept towards safe and secure AI, we show for the first
time that this approach comes with its own security risks, as such fallback
strategies can be deliberately triggered by an adversary. In addition to
naturally occurring abstains for some inputs and perturbations, the adversary
can use training-time attacks to deliberately trigger the fallback with high
probability. This transfers the main system load onto the fallback, reducing
the overall system's integrity and/or availability. We design two novel
availability attacks, which show the practical relevance of these threats. For
example, adding 1% poisoned data during training is sufficient to trigger the
fallback and hence make the model unavailable for up to 100% of all inputs by
inserting the trigger. Our extensive experiments across multiple datasets,
model architectures, and certifiers demonstrate the broad applicability of
these attacks. An initial investigation into potential defenses shows that
current approaches are insufficient to mitigate the issue, highlighting the
need for new, specific solutions.


http://arxiv.org/abs/2108.11135
Bridged Adversarial Training. (93%)
Hoki Kim; Woojin Lee; Sungyoon Lee; Jaewook Lee
  Adversarial robustness is considered as a required property of deep neural
networks. In this study, we discover that adversarially trained models might
have significantly different characteristics in terms of margin and smoothness,
even they show similar robustness. Inspired by the observation, we investigate
the effect of different regularizers and discover the negative effect of the
smoothness regularizer on maximizing the margin. Based on the analyses, we
propose a new method called bridged adversarial training that mitigates the
negative effect by bridging the gap between clean and adversarial examples. We
provide theoretical and empirical evidence that the proposed method provides
stable and better robustness, especially for large perturbations.


http://arxiv.org/abs/2108.11505
Generalized Real-World Super-Resolution through Adversarial Robustness. (93%)
Angela Castillo; María Escobar; Juan C. Pérez; Andrés Romero; Radu Timofte; Gool Luc Van; Pablo Arbeláez
  Real-world Super-Resolution (SR) has been traditionally tackled by first
learning a specific degradation model that resembles the noise and corruption
artifacts in low-resolution imagery. Thus, current methods lack generalization
and lose their accuracy when tested on unseen types of corruption. In contrast
to the traditional proposal, we present Robust Super-Resolution (RSR), a method
that leverages the generalization capability of adversarial attacks to tackle
real-world SR. Our novel framework poses a paradigm shift in the development of
real-world SR methods. Instead of learning a dataset-specific degradation, we
employ adversarial attacks to create difficult examples that target the model's
weaknesses. Afterward, we use these adversarial examples during training to
improve our model's capacity to process noisy inputs. We perform extensive
experimentation on synthetic and real-world images and empirically demonstrate
that our RSR method generalizes well across datasets without re-training for
specific noise priors. By using a single robust model, we outperform
state-of-the-art specialized methods on real-world benchmarks.


http://arxiv.org/abs/2108.11032
Improving Visual Quality of Unrestricted Adversarial Examples with Wavelet-VAE. (99%)
Wenzhao Xiang; Chang Liu; Shibao Zheng
  Traditional adversarial examples are typically generated by adding
perturbation noise to the input image within a small matrix norm. In practice,
un-restricted adversarial attack has raised great concern and presented a new
threat to the AI safety. In this paper, we propose a wavelet-VAE structure to
reconstruct an input image and generate adversarial examples by modifying the
latent code. Different from perturbation-based attack, the modifications of the
proposed method are not limited but imperceptible to human eyes. Experiments
show that our method can generate high quality adversarial examples on ImageNet
dataset.


http://arxiv.org/abs/2108.10879
Are socially-aware trajectory prediction models really socially-aware? (92%)
Saeed Saadatnejad; Mohammadhossein Bahari; Pedram Khorsandi; Mohammad Saneian; Seyed-Mohsen Moosavi-Dezfooli; Alexandre Alahi
  Our field has recently witnessed an arms race of neural network-based
trajectory predictors. While these predictors are at the core of many
applications such as autonomous navigation or pedestrian flow simulations,
their adversarial robustness has not been carefully studied. In this paper, we
introduce a socially-attended attack to assess the social understanding of
prediction models in terms of collision avoidance. An attack is a small yet
carefully-crafted perturbations to fail predictors. Technically, we define
collision as a failure mode of the output, and propose hard- and soft-attention
mechanisms to guide our attack. Thanks to our attack, we shed light on the
limitations of the current models in terms of their social understanding. We
demonstrate the strengths of our method on the recent trajectory prediction
models. Finally, we show that our attack can be employed to increase the social
understanding of state-of-the-art models. The code is available online:
https://s-attack.github.io/


http://arxiv.org/abs/2108.10992
OOWL500: Overcoming Dataset Collection Bias in the Wild. (76%)
Brandon Leung; Chih-Hui Ho; Amir Persekian; David Orozco; Yen Chang; Erik Sandstrom; Bo Liu; Nuno Vasconcelos
  The hypothesis that image datasets gathered online "in the wild" can produce
biased object recognizers, e.g. preferring professional photography or certain
viewing angles, is studied. A new "in the lab" data collection infrastructure
is proposed consisting of a drone which captures images as it circles around
objects. Crucially, the control provided by this setup and the natural camera
shake inherent to flight mitigate many biases. It's inexpensive and easily
replicable nature may also potentially lead to a scalable data collection
effort by the vision community. The procedure's usefulness is demonstrated by
creating a dataset of Objects Obtained With fLight (OOWL). Denoted as OOWL500,
it contains 120,000 images of 500 objects and is the largest "in the lab" image
dataset available when both number of classes and objects per class are
considered. Furthermore, it has enabled several of new insights on object
recognition. First, a novel adversarial attack strategy is proposed, where
image perturbations are defined in terms of semantic properties such as camera
shake and pose. Indeed, experiments have shown that ImageNet has considerable
amounts of pose and professional photography bias. Second, it is used to show
that the augmentation of in the wild datasets, such as ImageNet, with in the
lab data, such as OOWL500, can significantly decrease these biases, leading to
object recognizers of improved generalization. Third, the dataset is used to
study questions on "best procedures" for dataset collection. It is revealed
that data augmentation with synthetic images does not suffice to eliminate in
the wild datasets biases, and that camera shake and pose diversity play a more
important role in object recognition robustness than previously thought.


http://arxiv.org/abs/2108.10549
StyleAugment: Learning Texture De-biased Representations by Style Augmentation without Pre-defined Textures. (1%)
Sanghyuk Chun; Song Park
  Recent powerful vision classifiers are biased towards textures, while shape
information is overlooked by the models. A simple attempt by augmenting
training images using the artistic style transfer method, called Stylized
ImageNet, can reduce the texture bias. However, Stylized ImageNet approach has
two drawbacks in fidelity and diversity. First, the generated images show low
image quality due to the significant semantic gap betweeen natural images and
artistic paintings. Also, Stylized ImageNet training samples are pre-computed
before training, resulting in showing the lack of diversity for each sample. We
propose a StyleAugment by augmenting styles from the mini-batch. StyleAugment
does not rely on the pre-defined style references, but generates augmented
images on-the-fly by natural images in the mini-batch for the references.
Hence, StyleAugment let the model observe abundant confounding cues for each
image by on-the-fly the augmentation strategy, while the augmented images are
more realistic than artistic style transferred images. We validate the
effectiveness of StyleAugment in the ImageNet dataset with robustness
benchmarks, such as texture de-biased accuracy, corruption robustness, natural
adversarial samples, and occlusion robustness. StyleAugment shows better
generalization performances than previous unsupervised de-biasing methods and
state-of-the-art data augmentation methods in our experiments.


http://arxiv.org/abs/2108.10451
Adversarial Robustness of Deep Learning: Theory, Algorithms, and Applications. (99%)
Wenjie Ruan; Xinping Yi; Xiaowei Huang
  This tutorial aims to introduce the fundamentals of adversarial robustness of
deep learning, presenting a well-structured review of up-to-date techniques to
assess the vulnerability of various types of deep learning models to
adversarial examples. This tutorial will particularly highlight
state-of-the-art techniques in adversarial attacks and robustness verification
of deep neural networks (DNNs). We will also introduce some effective
countermeasures to improve the robustness of deep learning models, with a
particular focus on adversarial training. We aim to provide a comprehensive
overall picture about this emerging direction and enable the community to be
aware of the urgency and importance of designing robust deep learning models in
safety-critical data analytical applications, ultimately enabling the end-users
to trust deep learning classifiers. We will also summarize potential research
directions concerning the adversarial robustness of deep learning, and its
potential benefits to enable accountable and trustworthy deep learning-based
data analytical systems and applications.


http://arxiv.org/abs/2108.10015
Semantic-Preserving Adversarial Text Attacks. (99%)
Xinghao Yang; Weifeng Liu; James Bailey; Tianqing Zhu; Dacheng Tao; Wei Liu
  Deep neural networks (DNNs) are known to be vulnerable to adversarial images,
while their robustness in text classification is rarely studied. Several lines
of text attack methods have been proposed in the literature, including
character-level, word-level, and sentence-level attacks. However, it is still a
challenge to minimize the number of word changes necessary to induce
misclassification, while simultaneously ensuring lexical correctness, syntactic
soundness, and semantic similarity. In this paper, we propose a Bigram and
Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to
examine the vulnerability of deep models. Our method has four major merits.
Firstly, we propose to attack text documents not only at the unigram word level
but also at the bigram level which better keeps semantics and avoids producing
meaningless outputs. Secondly, we propose a hybrid method to replace the input
words with options among both their synonyms candidates and sememe candidates,
which greatly enriches the potential substitutions compared to only using
synonyms. Thirdly, we design an optimization algorithm, i.e., Semantic
Preservation Optimization (SPO), to determine the priority of word
replacements, aiming to reduce the modification cost. Finally, we further
improve the SPO with a semantic Filter (named SPOF) to find the adversarial
example with the highest semantic similarity. We evaluate the effectiveness of
our BU-SPO and BU-SPOF on IMDB, AG's News, and Yahoo! Answers text datasets by
attacking four popular DNNs models. Results show that our methods achieve the
highest attack success rates and semantics rates by changing the smallest
number of words compared with existing methods.


http://arxiv.org/abs/2108.10217
Deep Bayesian Image Set Classification: A Defence Approach against Adversarial Attacks. (99%)
Nima Mirnateghi; Syed Afaq Ali Shah; Mohammed Bennamoun
  Deep learning has become an integral part of various computer vision systems
in recent years due to its outstanding achievements for object recognition,
facial recognition, and scene understanding. However, deep neural networks
(DNNs) are susceptible to be fooled with nearly high confidence by an
adversary. In practice, the vulnerability of deep learning systems against
carefully perturbed images, known as adversarial examples, poses a dire
security threat in the physical world applications. To address this phenomenon,
we present, what to our knowledge, is the first ever image set based
adversarial defence approach. Image set classification has shown an exceptional
performance for object and face recognition, owing to its intrinsic property of
handling appearance variability. We propose a robust deep Bayesian image set
classification as a defence framework against a broad range of adversarial
attacks. We extensively experiment the performance of the proposed technique
with several voting strategies. We further analyse the effects of image size,
perturbation magnitude, along with the ratio of perturbed images in each image
set. We also evaluate our technique with the recent state-of-the-art defence
methods, and single-shot recognition task. The empirical results demonstrate
superior performance on CIFAR-10, MNIST, ETH-80, and Tiny ImageNet datasets.


http://arxiv.org/abs/2108.10251
Kryptonite: An Adversarial Attack Using Regional Focus. (99%)
Yogesh Kulkarni; Krisha Bhambani
  With the Rise of Adversarial Machine Learning and increasingly robust
adversarial attacks, the security of applications utilizing the power of
Machine Learning has been questioned. Over the past few years, applications of
Deep Learning using Deep Neural Networks(DNN) in several fields including
Medical Diagnosis, Security Systems, Virtual Assistants, etc. have become
extremely commonplace, and hence become more exposed and susceptible to attack.
In this paper, we present a novel study analyzing the weaknesses in the
security of deep learning systems. We propose 'Kryptonite', an adversarial
attack on images. We explicitly extract the Region of Interest (RoI) for the
images and use it to add imperceptible adversarial perturbations to images to
fool the DNN. We test our attack on several DNN's and compare our results with
state of the art adversarial attacks like Fast Gradient Sign Method (FGSM),
DeepFool (DF), Momentum Iterative Fast Gradient Sign Method (MIFGSM), and
Projected Gradient Descent (PGD). The results obtained by us cause a maximum
drop in network accuracy while yielding minimum possible perturbation and in
considerably less amount of time per sample. We thoroughly evaluate our attack
against three adversarial defence techniques and the promising results showcase
the efficacy of our attack.


http://arxiv.org/abs/2108.10241
Back to the Drawing Board: A Critical Evaluation of Poisoning Attacks on Federated Learning. (73%)
Virat Shejwalkar; Amir Houmansadr; Peter Kairouz; Daniel Ramage
  While recent works have indicated that federated learning (FL) is vulnerable
to poisoning attacks by compromised clients, we show that these works make a
number of unrealistic assumptions and arrive at somewhat misleading
conclusions. For instance, they often use impractically high percentages of
compromised clients or assume unrealistic capabilities for the adversary. We
perform the first critical analysis of poisoning attacks under practical
production FL environments by carefully characterizing the set of realistic
threat models and adversarial capabilities. Our findings are rather surprising:
contrary to the established belief, we show that FL, even without any defenses,
is highly robust in practice. In fact, we go even further and propose novel,
state-of-the-art poisoning attacks under two realistic threat models, and show
via an extensive set of experiments across three benchmark datasets how
(in)effective poisoning attacks are, especially when simple defense mechanisms
are used. We correct previous misconceptions and give concrete guidelines that
we hope will encourage our community to conduct more accurate research in this
space and build stronger (and more realistic) attacks and defenses.


http://arxiv.org/abs/2108.09929
SegMix: Co-occurrence Driven Mixup for Semantic Segmentation and Adversarial Robustness. (4%)
Md Amirul Islam; Matthew Kowal; Konstantinos G. Derpanis; Neil D. B. Bruce
  In this paper, we present a strategy for training convolutional neural
networks to effectively resolve interference arising from competing hypotheses
relating to inter-categorical information throughout the network. The premise
is based on the notion of feature binding, which is defined as the process by
which activations spread across space and layers in the network are
successfully integrated to arrive at a correct inference decision. In our work,
this is accomplished for the task of dense image labelling by blending images
based on (i) categorical clustering or (ii) the co-occurrence likelihood of
categories. We then train a feature binding network which simultaneously
segments and separates the blended images. Subsequent feature denoising to
suppress noisy activations reveals additional desirable properties and high
degrees of successful predictions. Through this process, we reveal a general
mechanism, distinct from any prior methods, for boosting the performance of the
base segmentation and saliency network while simultaneously increasing
robustness to adversarial attacks.


http://arxiv.org/abs/2108.09713
Robustness-via-Synthesis: Robust Training with Generative Adversarial Perturbations. (99%)
Inci M. Baytas; Debayan Deb
  Upon the discovery of adversarial attacks, robust models have become
obligatory for deep learning-based systems. Adversarial training with
first-order attacks has been one of the most effective defenses against
adversarial perturbations to this day. The majority of the adversarial training
approaches focus on iteratively perturbing each pixel with the gradient of the
loss function with respect to the input image. However, the adversarial
training with gradient-based attacks lacks diversity and does not generalize
well to natural images and various attacks. This study presents a robust
training algorithm where the adversarial perturbations are automatically
synthesized from a random vector using a generator network. The classifier is
trained with cross-entropy loss regularized with the optimal transport distance
between the representations of the natural and synthesized adversarial samples.
Unlike prevailing generative defenses, the proposed one-step attack generation
framework synthesizes diverse perturbations without utilizing gradient of the
classifier's loss. Experimental results show that the proposed approach attains
comparable robustness with various gradient-based and generative robust
training techniques on CIFAR10, CIFAR100, and SVHN datasets. In addition,
compared to the baselines, the proposed robust training framework generalizes
well to the natural samples. Code and trained models will be made publicly
available.


http://arxiv.org/abs/2108.09891
Multi-Expert Adversarial Attack Detection in Person Re-identification Using Context Inconsistency. (98%)
Xueping Wang; Shasha Li; Min Liu; Yaonan Wang; Amit K. Roy-Chowdhury
  The success of deep neural networks (DNNs) haspromoted the widespread
applications of person re-identification (ReID). However, ReID systems inherit
thevulnerability of DNNs to malicious attacks of visually in-conspicuous
adversarial perturbations. Detection of adver-sarial attacks is, therefore, a
fundamental requirement forrobust ReID systems. In this work, we propose a
Multi-Expert Adversarial Attack Detection (MEAAD) approach toachieve this goal
by checking context inconsistency, whichis suitable for any DNN-based ReID
systems. Specifically,three kinds of context inconsistencies caused by
adversar-ial attacks are employed to learn a detector for distinguish-ing the
perturbed examples, i.e., a) the embedding distancesbetween a perturbed query
person image and its top-K re-trievals are generally larger than those between
a benignquery image and its top-K retrievals, b) the embedding dis-tances among
the top-K retrievals of a perturbed query im-age are larger than those of a
benign query image, c) thetop-K retrievals of a benign query image obtained
with mul-tiple expert ReID models tend to be consistent, which isnot preserved
when attacks are present. Extensive exper-iments on the Market1501 and
DukeMTMC-ReID datasetsshow that, as the first adversarial attack detection
approachfor ReID,MEAADeffectively detects various adversarial at-tacks and
achieves high ROC-AUC (over 97.5%).


http://arxiv.org/abs/2108.09768
Relating CNNs with brain: Challenges and findings. (10%)
Reem Abdel-Salam
  Conventional neural network models (CNN), loosely inspired by the primate
visual system, have been shown to predict neural responses in the visual
cortex. However, the relationship between CNNs and the visual system is
incomplete due to many reasons. On one hand state of the art CNN architecture
is very complex, yet can be fooled by imperceptibly small, explicitly crafted
perturbations which makes it hard difficult to map layers of the network with
the visual system and to understand what they are doing. On the other hand, we
don't know the exact mapping between feature space of the CNNs and the space
domain of the visual cortex, which makes it hard to accurately predict neural
responses. In this paper we review the challenges and the methods that have
been used to predict neural responses in the visual cortex and whole brain as
part of The Algonauts Project 2021 Challenge: "How the Human Brain Makes Sense
of a World in Motion".


http://arxiv.org/abs/2108.09513
A Hard Label Black-box Adversarial Attack Against Graph Neural Networks. (99%)
Jiaming Mu; Binghui Wang; Qi Li; Kun Sun; Mingwei Xu; Zhuotao Liu
  Graph Neural Networks (GNNs) have achieved state-of-the-art performance in
various graph structure related tasks such as node classification and graph
classification. However, GNNs are vulnerable to adversarial attacks. Existing
works mainly focus on attacking GNNs for node classification; nevertheless, the
attacks against GNNs for graph classification have not been well explored.
  In this work, we conduct a systematic study on adversarial attacks against
GNNs for graph classification via perturbing the graph structure. In
particular, we focus on the most challenging attack, i.e., hard label black-box
attack, where an attacker has no knowledge about the target GNN model and can
only obtain predicted labels through querying the target model.To achieve this
goal, we formulate our attack as an optimization problem, whose objective is to
minimize the number of edges to be perturbed in a graph while maintaining the
high attack success rate. The original optimization problem is intractable to
solve, and we relax the optimization problem to be a tractable one, which is
solved with theoretical convergence guarantee. We also design a coarse-grained
searching algorithm and a query-efficient gradient computation algorithm to
decrease the number of queries to the target GNN model. Our experimental
results on three real-world datasets demonstrate that our attack can
effectively attack representative GNNs for graph classification with less
queries and perturbations. We also evaluate the effectiveness of our attack
under two defenses: one is well-designed adversarial graph detector and the
other is that the target GNN model itself is equipped with a defense to prevent
adversarial graph generation. Our experimental results show that such defenses
are not effective enough, which highlights more advanced defenses.


http://arxiv.org/abs/2108.09454
"Adversarial Examples" for Proof-of-Learning. (98%)
Rui Zhang; Jian Liu; Yuan Ding; Qingbiao Wu; Kui Ren
  In S&P '21, Jia et al. proposed a new concept/mechanism named
proof-of-learning (PoL), which allows a prover to demonstrate ownership of a
machine learning model by proving integrity of the training procedure. It
guarantees that an adversary cannot construct a valid proof with less cost (in
both computation and storage) than that made by the prover in generating the
proof. A PoL proof includes a set of intermediate models recorded during
training, together with the corresponding data points used to obtain each
recorded model. Jia et al. claimed that an adversary merely knowing the final
model and training dataset cannot efficiently find a set of intermediate models
with correct data points. In this paper, however, we show that PoL is
vulnerable to "adversarial examples"! Specifically, in a similar way as
optimizing an adversarial example, we could make an arbitrarily-chosen data
point "generate" a given model, hence efficiently generating intermediate
models with correct data points. We demonstrate, both theoretically and
empirically, that we are able to generate a valid proof with significantly less
cost than generating a proof by the prover, thereby we successfully break PoL.


http://arxiv.org/abs/2108.13551
Regularizing Instabilities in Image Reconstruction Arising from Learned Denoisers. (2%)
Abinash Nayak
  It's well-known that inverse problems are ill-posed and to solve them
meaningfully one has to employ regularization methods. Traditionally, popular
regularization methods have been the penalized Variational approaches. In
recent years, the classical regularized-reconstruction approaches have been
outclassed by the (deep-learning-based) learned reconstruction algorithms.
However, unlike the traditional regularization methods, the theoretical
underpinnings, such as stability and regularization, have been insufficient for
such learned reconstruction algorithms. Hence, the results obtained from such
algorithms, though empirically outstanding, can't always be completely trusted,
as they may contain certain instabilities or (hallucinated) features arising
from the learned process. In fact, it has been shown that such learning
algorithms are very susceptible to small (adversarial) noises in the data and
can lead to severe instabilities in the recovered solution, which can be quite
different than the inherent instabilities of the ill-posed (inverse) problem.
Whereas, the classical regularization methods can handle such (adversarial)
noises very well and can produce stable recovery. Here, we try to present
certain regularization methods to stabilize such (unstable) learned
reconstruction methods and recover a regularized solution, even in the presence
of adversarial noises. For this, we need to extend the classical notion of
regularization and incorporate it in the learned reconstruction algorithms. We
also present some regularization techniques to regularize two of the most
popular learning reconstruction algorithms, the Learned Post-Processing
Reconstruction and the Learned Unrolling Reconstruction.


http://arxiv.org/abs/2108.09034
AdvDrop: Adversarial Attack to DNNs by Dropping Information. (99%)
Ranjie Duan; Yuefeng Chen; Dantong Niu; Yun Yang; A. K. Qin; Yuan He
  Human can easily recognize visual objects with lost information: even losing
most details with only contour reserved, e.g. cartoon. However, in terms of
visual perception of Deep Neural Networks (DNNs), the ability for recognizing
abstract objects (visual objects with lost information) is still a challenge.
In this work, we investigate this issue from an adversarial viewpoint: will the
performance of DNNs decrease even for the images only losing a little
information? Towards this end, we propose a novel adversarial attack, named
\textit{AdvDrop}, which crafts adversarial examples by dropping existing
information of images. Previously, most adversarial attacks add extra
disturbing information on clean images explicitly. Opposite to previous works,
our proposed work explores the adversarial robustness of DNN models in a novel
perspective by dropping imperceptible details to craft adversarial examples. We
demonstrate the effectiveness of \textit{AdvDrop} by extensive experiments, and
show that this new type of adversarial examples is more difficult to be
defended by current defense systems.


http://arxiv.org/abs/2108.09135
PatchCleanser: Certifiably Robust Defense against Adversarial Patches for Any Image Classifier. (99%)
Chong Xiang; Saeed Mahloujifar; Prateek Mittal
  The adversarial patch attack against image classification models aims to
inject adversarially crafted pixels within a restricted image region (i.e., a
patch) for inducing model misclassification. This attack can be realized in the
physical world by printing and attaching the patch to the victim object; thus,
it imposes a real-world threat to computer vision systems. To counter this
threat, we design PatchCleanser as a certifiably robust defense against
adversarial patches. In PatchCleanser, we perform two rounds of pixel masking
on the input image to neutralize the effect of the adversarial patch. This
image-space operation makes PatchCleanser compatible with any state-of-the-art
image classifier for achieving high accuracy. Furthermore, we can prove that
PatchCleanser will always predict the correct class labels on certain images
against any adaptive white-box attacker within our threat model, achieving
certified robustness. We extensively evaluate PatchCleanser on the ImageNet,
ImageNette, CIFAR-10, CIFAR-100, SVHN, and Flowers-102 datasets and demonstrate
that our defense achieves similar clean accuracy as state-of-the-art
classification models and also significantly improves certified robustness from
prior works. Remarkably, PatchCleanser achieves 83.9% top-1 clean accuracy and
62.1% top-1 certified robust accuracy against a 2%-pixel square patch anywhere
on the image for the 1000-class ImageNet dataset.


http://arxiv.org/abs/2108.09413
Integer-arithmetic-only Certified Robustness for Quantized Neural Networks. (98%)
Haowen Lin; Jian Lou; Li Xiong; Cyrus Shahabi
  Adversarial data examples have drawn significant attention from the machine
learning and security communities. A line of work on tackling adversarial
examples is certified robustness via randomized smoothing that can provide a
theoretical robustness guarantee. However, such a mechanism usually uses
floating-point arithmetic for calculations in inference and requires large
memory footprints and daunting computational costs. These defensive models
cannot run efficiently on edge devices nor be deployed on integer-only logical
units such as Turing Tensor Cores or integer-only ARM processors. To overcome
these challenges, we propose an integer randomized smoothing approach with
quantization to convert any classifier into a new smoothed classifier, which
uses integer-only arithmetic for certified robustness against adversarial
perturbations. We prove a tight robustness guarantee under L2-norm for the
proposed approach. We show our approach can obtain a comparable accuracy and
4x~5x speedup over floating-point arithmetic certified robust methods on
general-purpose CPUs and mobile devices on two distinct datasets (CIFAR-10 and
Caltech-101).


http://arxiv.org/abs/2108.09093
Towards Understanding the Generative Capability of Adversarially Robust Classifiers. (98%)
Yao Zhu; Jiacheng Ma; Jiacheng Sun; Zewei Chen; Rongxin Jiang; Zhenguo Li
  Recently, some works found an interesting phenomenon that adversarially
robust classifiers can generate good images comparable to generative models. We
investigate this phenomenon from an energy perspective and provide a novel
explanation. We reformulate adversarial example generation, adversarial
training, and image generation in terms of an energy function. We find that
adversarial training contributes to obtaining an energy function that is flat
and has low energy around the real data, which is the key for generative
capability. Based on our new understanding, we further propose a better
adversarial training method, Joint Energy Adversarial Training (JEAT), which
can generate high-quality images and achieve new state-of-the-art robustness
under a wide range of attacks. The Inception Score of the images (CIFAR-10)
generated by JEAT is 8.80, much better than original robust classifiers (7.50).


http://arxiv.org/abs/2108.09383
Detecting and Segmenting Adversarial Graphics Patterns from Images. (93%)
Xiangyu Purdue University Qu; Stanley H. Purdue University Chan
  Adversarial attacks pose a substantial threat to computer vision system
security, but the social media industry constantly faces another form of
"adversarial attack" in which the hackers attempt to upload inappropriate
images and fool the automated screening systems by adding artificial graphics
patterns. In this paper, we formulate the defense against such attacks as an
artificial graphics pattern segmentation problem. We evaluate the efficacy of
several segmentation algorithms and, based on observation of their performance,
propose a new method tailored to this specific problem. Extensive experiments
show that the proposed method outperforms the baselines and has a promising
generalization capability, which is the most crucial aspect in segmenting
artificial graphics patterns.


http://arxiv.org/abs/2108.09033
UnSplit: Data-Oblivious Model Inversion, Model Stealing, and Label Inference Attacks Against Split Learning. (1%)
Ege Erdogan; Alptekin Kupcu; A. Ercument Cicek
  Training deep neural networks requires large scale data, which often forces
users to work in a distributed or outsourced setting, accompanied with privacy
concerns. Split learning framework aims to address this concern by splitting up
the model among the client and the server. The idea is that since the server
does not have access to client's part of the model, the scheme supposedly
provides privacy. We show that this is not true via two novel attacks. (1) We
show that an honest-but-curious split learning server, equipped only with the
knowledge of the client neural network architecture, can recover the input
samples and also obtain a functionally similar model to the client model,
without the client being able to detect the attack. (2) Furthermore, we show
that if split learning is used naively to protect the training labels, the
honest-but-curious server can infer the labels with perfect accuracy. We test
our attacks using three benchmark datasets and investigate various properties
of the overall system that affect the attacks' effectiveness. Our results show
that plaintext split learning paradigm can pose serious security risks and
provide no more than a false sense of security.


http://arxiv.org/abs/2108.09343
Early-exit deep neural networks for distorted images: providing an efficient edge offloading. (1%)
Roberto G. Pacheco; Fernanda D. V. R. Oliveira; Rodrigo S. Couto
  Edge offloading for deep neural networks (DNNs) can be adaptive to the
input's complexity by using early-exit DNNs. These DNNs have side branches
throughout their architecture, allowing the inference to end earlier in the
edge. The branches estimate the accuracy for a given input. If this estimated
accuracy reaches a threshold, the inference ends on the edge. Otherwise, the
edge offloads the inference to the cloud to process the remaining DNN layers.
However, DNNs for image classification deals with distorted images, which
negatively impact the branches' estimated accuracy. Consequently, the edge
offloads more inferences to the cloud. This work introduces expert side
branches trained on a particular distortion type to improve robustness against
image distortion. The edge detects the distortion type and selects appropriate
expert branches to perform the inference. This approach increases the estimated
accuracy on the edge, improving the offloading decisions. We validate our
proposal in a realistic scenario, in which the edge offloads DNN inference to
Amazon EC2 instances.


http://arxiv.org/abs/2108.08972
Application of Adversarial Examples to Physical ECG Signals. (99%)
Taiga Waseda University Ono; Takeshi The University of Electro-Communications Sugawara; Jun University of Tsukuba Sakuma; Tatsuya Waseda University RIKEN AIP Mori
  This work aims to assess the reality and feasibility of the adversarial
attack against cardiac diagnosis system powered by machine learning algorithms.
To this end, we introduce adversarial beats, which are adversarial
perturbations tailored specifically against electrocardiograms (ECGs)
beat-by-beat classification system. We first formulate an algorithm to generate
adversarial examples for the ECG classification neural network model, and study
its attack success rate. Next, to evaluate its feasibility in a physical
environment, we mount a hardware attack by designing a malicious signal
generator which injects adversarial beats into ECG sensor readings. To the best
of our knowledge, our work is the first in evaluating the proficiency of
adversarial examples for ECGs in a physical setup. Our real-world experiments
demonstrate that adversarial beats successfully manipulated the diagnosis
results 3-5 times out of 40 attempts throughout the course of 2 minutes.
Finally, we discuss the overall feasibility and impact of the attack, by
clearly defining motives and constraints of expected attackers along with our
experimental results.


http://arxiv.org/abs/2108.08560
Pruning in the Face of Adversaries. (99%)
Florian Merkle; Maximilian Samsinger; Pascal Schöttle
  The vulnerability of deep neural networks against adversarial examples -
inputs with small imperceptible perturbations - has gained a lot of attention
in the research community recently. Simultaneously, the number of parameters of
state-of-the-art deep learning models has been growing massively, with
implications on the memory and computational resources required to train and
deploy such models. One approach to control the size of neural networks is
retrospectively reducing the number of parameters, so-called neural network
pruning. Available research on the impact of neural network pruning on the
adversarial robustness is fragmentary and often does not adhere to established
principles of robustness evaluation. We close this gap by evaluating the
robustness of pruned models against L-0, L-2 and L-infinity attacks for a wide
range of attack strengths, several architectures, data sets, pruning methods,
and compression rates. Our results confirm that neural network pruning and
adversarial robustness are not mutually exclusive. Instead, sweet spots can be
found that are favorable in terms of model size and adversarial robustness.
Furthermore, we extend our analysis to situations that incorporate additional
assumptions on the adversarial scenario and show that depending on the
situation, different strategies are optimal.


http://arxiv.org/abs/2108.08976
ASAT: Adaptively Scaled Adversarial Training in Time Series. (98%)
Zhiyuan Zhang; Wei Li; Ruihan Bao; Keiko Harimoto; Yunfang Wu; Xu Sun
  Adversarial training is a method for enhancing neural networks to improve the
robustness against adversarial examples. Besides the security concerns of
potential adversarial examples, adversarial training can also improve the
generalization ability of neural networks, train robust neural networks, and
provide interpretability for neural networks. In this work, we introduce
adversarial training in time series analysis to enhance the neural networks for
better generalization ability by taking the finance field as an example.
Rethinking existing research on adversarial training, we propose the adaptively
scaled adversarial training (ASAT) in time series analysis, by rescaling data
at different time slots with adaptive scales. Experimental results show that
the proposed ASAT can improve both the generalization ability and the
adversarial robustness of neural networks compared to the baselines. Compared
to the traditional adversarial training algorithm, ASAT can achieve better
generalization ability and similar adversarial robustness.


http://arxiv.org/abs/2108.08487
Amplitude-Phase Recombination: Rethinking Robustness of Convolutional Neural Networks in Frequency Domain. (80%)
Guangyao Chen; Peixi Peng; Li Ma; Jia Li; Lin Du; Yonghong Tian
  Recently, the generalization behavior of Convolutional Neural Networks (CNN)
is gradually transparent through explanation techniques with the frequency
components decomposition. However, the importance of the phase spectrum of the
image for a robust vision system is still ignored. In this paper, we notice
that the CNN tends to converge at the local optimum which is closely related to
the high-frequency components of the training images, while the amplitude
spectrum is easily disturbed such as noises or common corruptions. In contrast,
more empirical studies found that humans rely on more phase components to
achieve robust recognition. This observation leads to more explanations of the
CNN's generalization behaviors in both robustness to common perturbations and
out-of-distribution detection, and motivates a new perspective on data
augmentation designed by re-combing the phase spectrum of the current image and
the amplitude spectrum of the distracter image. That is, the generated samples
force the CNN to pay more attention to the structured information from phase
components and keep robust to the variation of the amplitude. Experiments on
several image datasets indicate that the proposed method achieves
state-of-the-art performances on multiple generalizations and calibration
tasks, including adaptability for common corruptions and surface variations,
out-of-distribution detection, and adversarial attack.


http://arxiv.org/abs/2108.07969
Revisiting Adversarial Robustness Distillation: Robust Soft Labels Make Student Better. (99%)
Bojia Zi; Shihao Zhao; Xingjun Ma; Yu-Gang Jiang
  Adversarial training is one effective approach for training robust deep
neural networks against adversarial attacks. While being able to bring reliable
robustness, adversarial training (AT) methods in general favor high capacity
models, i.e., the larger the model the better the robustness. This tends to
limit their effectiveness on small models, which are more preferable in
scenarios where storage or computing resources are very limited (e.g., mobile
devices). In this paper, we leverage the concept of knowledge distillation to
improve the robustness of small models by distilling from adversarially trained
large models. We first revisit several state-of-the-art AT methods from a
distillation perspective and identify one common technique that can lead to
improved robustness: the use of robust soft labels -- predictions of a robust
model. Following this observation, we propose a novel adversarial robustness
distillation method called Robust Soft Label Adversarial Distillation (RSLAD)
to train robust small student models. RSLAD fully exploits the robust soft
labels produced by a robust (adversarially-trained) large teacher model to
guide the student's learning on both natural and adversarial examples in all
loss terms. We empirically demonstrate the effectiveness of our RSLAD approach
over existing adversarial training and distillation methods in improving the
robustness of small models against state-of-the-art attacks including the
AutoAttack. We also provide a set of understandings on our RSLAD and the
importance of robust soft labels for adversarial robustness distillation.


http://arxiv.org/abs/2108.08421
Exploiting Multi-Object Relationships for Detecting Adversarial Attacks in Complex Scenes. (98%)
Mingjun Yin; Shasha Li; Zikui Cai; Chengyu Song; M. Salman Asif; Amit K. Roy-Chowdhury; Srikanth V. Krishnamurthy
  Vision systems that deploy Deep Neural Networks (DNNs) are known to be
vulnerable to adversarial examples. Recent research has shown that checking the
intrinsic consistencies in the input data is a promising way to detect
adversarial attacks (e.g., by checking the object co-occurrence relationships
in complex scenes). However, existing approaches are tied to specific models
and do not offer generalizability. Motivated by the observation that language
descriptions of natural scene images have already captured the object
co-occurrence relationships that can be learned by a language model, we develop
a novel approach to perform context consistency checks using such language
models. The distinguishing aspect of our approach is that it is independent of
the deployed object detector and yet offers very high accuracy in terms of
detecting adversarial examples in practical scenes with multiple objects.


http://arxiv.org/abs/2108.08211
MBRS : Enhancing Robustness of DNN-based Watermarking by Mini-Batch of Real and Simulated JPEG Compression. (45%)
Zhaoyang Jia; Han Fang; Weiming Zhang
  Based on the powerful feature extraction ability of deep learning
architecture, recently, deep-learning based watermarking algorithms have been
widely studied. The basic framework of such algorithm is the auto-encoder like
end-to-end architecture with an encoder, a noise layer and a decoder. The key
to guarantee robustness is the adversarial training with the differential noise
layer. However, we found that none of the existing framework can well ensure
the robustness against JPEG compression, which is non-differential but is an
essential and important image processing operation. To address such
limitations, we proposed a novel end-to-end training architecture, which
utilizes Mini-Batch of Real and Simulated JPEG compression (MBRS) to enhance
the JPEG robustness. Precisely, for different mini-batches, we randomly choose
one of real JPEG, simulated JPEG and noise-free layer as the noise layer.
Besides, we suggest to utilize the Squeeze-and-Excitation blocks which can
learn better feature in embedding and extracting stage, and propose a "message
processor" to expand the message in a more appreciate way. Meanwhile, to
improve the robustness against crop attack, we propose an additive diffusion
block into the network. The extensive experimental results have demonstrated
the superior performance of the proposed scheme compared with the
state-of-the-art algorithms. Under the JPEG compression with quality factor
Q=50, our models achieve a bit error rate less than 0.01% for extracted
messages, with PSNR larger than 36 for the encoded images, which shows the
well-enhanced robustness against JPEG attack. Besides, under many other
distortions such as Gaussian filter, crop, cropout and dropout, the proposed
framework also obtains strong robustness. The code implemented by PyTorch
\cite{2011torch7} is avaiable in https://github.com/jzyustc/MBRS.


http://arxiv.org/abs/2108.08476
Proceedings of the 1st International Workshop on Adaptive Cyber Defense. (1%)
Damian Marriott; Kimberly Ferguson-Walter; Sunny Fugate; Marco Carvalho
  The 1st International Workshop on Adaptive Cyber Defense was held as part of
the 2021 International Joint Conference on Artificial Intelligence. This
workshop was organized to share research that explores unique applications of
Artificial Intelligence (AI) and Machine Learning (ML) as foundational
capabilities for the pursuit of adaptive cyber defense. The cyber domain cannot
currently be reliably and effectively defended without extensive reliance on
human experts. Skilled cyber defenders are in short supply and often cannot
respond fast enough to cyber threats.
  Building on recent advances in AI and ML the Cyber defense research community
has been motivated to develop new dynamic and sustainable defenses through the
adoption of AI and ML techniques to both cyber and non-cyber settings. Bridging
critical gaps between AI and Cyber researchers and practitioners can accelerate
efforts to create semi-autonomous cyber defenses that can learn to recognize
and respond to cyber attacks or discover and mitigate weaknesses in cooperation
with other cyber operation systems and human experts. Furthermore, these
defenses are expected to be adaptive and able to evolve over time to thwart
changes in attacker behavior, changes in the system health and readiness, and
natural shifts in user behavior over time.
  The Workshop (held on August 19th and 20th 2021 in Montreal-themed virtual
reality) was comprised of technical presentations and a panel discussion
focused on open problems and potential research solutions. Workshop submissions
were peer reviewed by a panel of domain experts with a proceedings consisting
of 10 technical articles exploring challenging problems of critical importance
to national and global security. Participation in this workshop offered new
opportunities to stimulate research and innovation in the emerging domain of
adaptive and autonomous cyber defense.


http://arxiv.org/abs/2108.07602
When Should You Defend Your Classifier -- A Game-theoretical Analysis of Countermeasures against Adversarial Examples. (98%)
Maximilian Samsinger; Florian Merkle; Pascal Schöttle; Tomas Pevny
  Adversarial machine learning, i.e., increasing the robustness of machine
learning algorithms against so-called adversarial examples, is now an
established field. Yet, newly proposed methods are evaluated and compared under
unrealistic scenarios where costs for adversary and defender are not considered
and either all samples are attacked or no sample is attacked. We scrutinize
these assumptions and propose the advanced adversarial classification game,
which incorporates all relevant parameters of an adversary and a defender in
adversarial classification. Especially, we take into account economic factors
on both sides and the fact that all so far proposed countermeasures against
adversarial examples reduce accuracy on benign samples. Analyzing the scenario
in detail, where both players have two pure strategies, we identify all best
responses and conclude that in practical settings, the most influential factor
might be the maximum amount of adversarial examples.


http://arxiv.org/abs/2108.07920
Adversarial Relighting Against Face Recognition. (98%)
Qian Zhang; Qing Guo; Ruijun Gao; Felix Juefei-Xu; Hongkai Yu; Wei Feng
  Deep face recognition (FR) has achieved significantly high accuracy on
several challenging datasets and fosters successful real-world applications,
even showing high robustness to the illumination variation that is usually
regarded as a main threat to the FR system. However, in the real world,
illumination variation caused by diverse lighting conditions cannot be fully
covered by the limited face dataset. In this paper, we study the threat of
lighting against FR from a new angle, i.e., adversarial attack, and identify a
new task, i.e., adversarial relighting. Given a face image, adversarial
relighting aims to produce a naturally relighted counterpart while fooling the
state-of-the-art deep FR methods. To this end, we first propose the physical
modelbased adversarial relighting attack (ARA) denoted as albedoquotient-based
adversarial relighting attack (AQ-ARA). It generates natural adversarial light
under the physical lighting model and guidance of FR systems and synthesizes
adversarially relighted face images. Moreover, we propose the auto-predictive
adversarial relighting attack (AP-ARA) by training an adversarial relighting
network (ARNet) to automatically predict the adversarial light in a one-step
manner according to different input faces, allowing efficiency-sensitive
applications. More importantly, we propose to transfer the above digital
attacks to physical ARA (PhyARA) through a precise relighting device, making
the estimated adversarial lighting condition reproducible in the real world. We
validate our methods on three state-of-the-art deep FR methods, i.e., FaceNet,
ArcFace, and CosFace, on two public datasets. The extensive and insightful
results demonstrate our work can generate realistic adversarial relighted face
images fooling face recognition tasks easily, revealing the threat of specific
light directions and strengths.


http://arxiv.org/abs/2108.07958
Semantic Perturbations with Normalizing Flows for Improved Generalization. (13%)
Oguz Kaan Yuksel; Sebastian U. Stich; Martin Jaggi; Tatjana Chavdarova
  Data augmentation is a widely adopted technique for avoiding overfitting when
training deep neural networks. However, this approach requires domain-specific
knowledge and is often limited to a fixed set of hard-coded transformations.
Recently, several works proposed to use generative models for generating
semantically meaningful perturbations to train a classifier. However, because
accurate encoding and decoding are critical, these methods, which use
architectures that approximate the latent-variable inference, remained limited
to pilot studies on small datasets.
  Exploiting the exactly reversible encoder-decoder structure of normalizing
flows, we perform on-manifold perturbations in the latent space to define fully
unsupervised data augmentations. We demonstrate that such perturbations match
the performance of advanced data augmentation techniques -- reaching 96.6% test
accuracy for CIFAR-10 using ResNet-18 and outperform existing methods,
particularly in low data regimes -- yielding 10--25% relative improvement of
test accuracy from classical training. We find that our latent adversarial
perturbations adaptive to the classifier throughout its training are most
effective, yielding the first test accuracy improvement results on real-world
datasets -- CIFAR-10/100 -- via latent-space perturbations.


http://arxiv.org/abs/2108.07594
Coalesced Multi-Output Tsetlin Machines with Clause Sharing. (1%)
Sondre Glimsdal; Ole-Christoffer Granmo
  Using finite-state machines to learn patterns, Tsetlin machines (TMs) have
obtained competitive accuracy and learning speed across several benchmarks,
with frugal memory- and energy footprint. A TM represents patterns as
conjunctive clauses in propositional logic (AND-rules), each clause voting for
or against a particular output. While efficient for single-output problems, one
needs a separate TM per output for multi-output problems. Employing multiple
TMs hinders pattern reuse because each TM then operates in a silo. In this
paper, we introduce clause sharing, merging multiple TMs into a single one.
Each clause is related to each output by using a weight. A positive weight
makes the clause vote for output $1$, while a negative weight makes the clause
vote for output $0$. The clauses thus coalesce to produce multiple outputs. The
resulting coalesced Tsetlin Machine (CoTM) simultaneously learns both the
weights and the composition of each clause by employing interacting Stochastic
Searching on the Line (SSL) and Tsetlin Automata (TA) teams. Our empirical
results on MNIST, Fashion-MNIST, and Kuzushiji-MNIST show that CoTM obtains
significantly higher accuracy than TM on $50$- to $1$K-clause configurations,
indicating an ability to repurpose clauses. E.g., accuracy goes from $71.99$%
to $89.66$% on Fashion-MNIST when employing $50$ clauses per class (22 Kb
memory). While TM and CoTM accuracy is similar when using more than $1$K
clauses per class, CoTM reaches peak accuracy $3\times$ faster on MNIST with
$8$K clauses. We further investigate robustness towards imbalanced training
data. Our evaluations on imbalanced versions of IMDb- and CIFAR10 data show
that CoTM is robust towards high degrees of class imbalance. Being able to
share clauses, we believe CoTM will enable new TM application domains that
involve multiple outputs, such as learning language models and auto-encoding.


http://arxiv.org/abs/2108.07779
Appearance Based Deep Domain Adaptation for the Classification of Aerial Images. (1%)
Dennis Wittich; Franz Rottensteiner
  This paper addresses domain adaptation for the pixel-wise classification of
remotely sensed data using deep neural networks (DNN) as a strategy to reduce
the requirements of DNN with respect to the availability of training data. We
focus on the setting in which labelled data are only available in a source
domain DS, but not in a target domain DT. Our method is based on adversarial
training of an appearance adaptation network (AAN) that transforms images from
DS such that they look like images from DT. Together with the original label
maps from DS, the transformed images are used to adapt a DNN to DT. We propose
a joint training strategy of the AAN and the classifier, which constrains the
AAN to transform the images such that they are correctly classified. In this
way, objects of a certain class are changed such that they resemble objects of
the same class in DT. To further improve the adaptation performance, we propose
a new regularization loss for the discriminator network used in domain
adversarial training. We also address the problem of finding the optimal values
of the trained network parameters, proposing an unsupervised entropy based
parameter selection criterion which compensates for the fact that there is no
validation set in DT that could be monitored. As a minor contribution, we
present a new weighting strategy for the cross-entropy loss, addressing the
problem of imbalanced class distributions. Our method is evaluated in 42
adaptation scenarios using datasets from 7 cities, all consisting of
high-resolution digital orthophotos and height data. It achieves a positive
transfer in all cases, and on average it improves the performance in the target
domain by 4.3% in overall accuracy. In adaptation scenarios between datasets
from the ISPRS semantic labelling benchmark our method outperforms those from
recent publications by 10-20% with respect to the mean intersection over union.


http://arxiv.org/abs/2108.07033
Exploring Transferable and Robust Adversarial Perturbation Generation from the Perspective of Network Hierarchy. (99%)
Ruikui Wang; Yuanfang Guo; Ruijie Yang; Yunhong Wang
  The transferability and robustness of adversarial examples are two practical
yet important properties for black-box adversarial attacks. In this paper, we
explore effective mechanisms to boost both of them from the perspective of
network hierarchy, where a typical network can be hierarchically divided into
output stage, intermediate stage and input stage. Since over-specialization of
source model, we can hardly improve the transferability and robustness of the
adversarial perturbations in the output stage. Therefore, we focus on the
intermediate and input stages in this paper and propose a transferable and
robust adversarial perturbation generation (TRAP) method. Specifically, we
propose the dynamically guided mechanism to continuously calculate accurate
directional guidances for perturbation generation in the intermediate stage. In
the input stage, instead of the single-form transformation augmentations
adopted in the existing methods, we leverage multiform affine transformation
augmentations to further enrich the input diversity and boost the robustness
and transferability of the adversarial perturbations. Extensive experiments
demonstrate that our TRAP achieves impressive transferability and high
robustness against certain interferences.


http://arxiv.org/abs/2108.06895
Interpreting Attributions and Interactions of Adversarial Attacks. (83%)
Xin Wang; Shuyun Lin; Hao Zhang; Yufei Zhu; Quanshi Zhang
  This paper aims to explain adversarial attacks in terms of how adversarial
perturbations contribute to the attacking task. We estimate attributions of
different image regions to the decrease of the attacking cost based on the
Shapley value. We define and quantify interactions among adversarial
perturbation pixels, and decompose the entire perturbation map into relatively
independent perturbation components. The decomposition of the perturbation map
shows that adversarially-trained DNNs have more perturbation components in the
foreground than normally-trained DNNs. Moreover, compared to the
normally-trained DNN, the adversarially-trained DNN have more components which
mainly decrease the score of the true category. Above analyses provide new
insights into the understanding of adversarial attacks.


http://arxiv.org/abs/2108.07229
Patch Attack Invariance: How Sensitive are Patch Attacks to 3D Pose? (62%)
Max Lennon; Nathan Drenkow; Philippe Burlina
  Perturbation-based attacks, while not physically realizable, have been the
main emphasis of adversarial machine learning (ML) research. Patch-based
attacks by contrast are physically realizable, yet most work has focused on 2D
domain with recent forays into 3D. Characterizing the robustness properties of
patch attacks and their invariance to 3D pose is important, yet not fully
elucidated, and is the focus of this paper. To this end, several contributions
are made here: A) we develop a new metric called mean Attack Success over
Transformations (mAST) to evaluate patch attack robustness and invariance; and
B), we systematically assess robustness of patch attacks to 3D position and
orientation for various conditions; in particular, we conduct a sensitivity
analysis which provides important qualitative insights into attack
effectiveness as a function of the 3D pose of a patch relative to the camera
(rotation, translation) and sets forth some properties for patch attack 3D
invariance; and C), we draw novel qualitative conclusions including: 1) we
demonstrate that for some 3D transformations, namely rotation and loom,
increasing the training distribution support yields an increase in patch
success over the full range at test time. 2) We provide new insights into the
existence of a fundamental cutoff limit in patch attack effectiveness that
depends on the extent of out-of-plane rotation angles. These findings should
collectively guide future design of 3D patch attacks and defenses.


http://arxiv.org/abs/2108.07256
NeuraCrypt is not private. (10%)
Nicholas Carlini; Sanjam Garg; Somesh Jha; Saeed Mahloujifar; Mohammad Mahmoody; Florian Tramer
  NeuraCrypt (Yara et al. arXiv 2021) is an algorithm that converts a sensitive
dataset to an encoded dataset so that (1) it is still possible to train machine
learning models on the encoded data, but (2) an adversary who has access only
to the encoded dataset can not learn much about the original sensitive dataset.
We break NeuraCrypt privacy claims, by perfectly solving the authors' public
challenge, and by showing that NeuraCrypt does not satisfy the formal privacy
definitions posed in the original paper. Our attack consists of a series of
boosting steps that, coupled with various design flaws, turns a 1% attack
advantage into a 100% complete break of the scheme.


http://arxiv.org/abs/2108.07083
Identifying and Exploiting Structures for Reliable Deep Learning. (2%)
Amartya Sanyal
  Deep learning research has recently witnessed an impressively fast-paced
progress in a wide range of tasks including computer vision, natural language
processing, and reinforcement learning. The extraordinary performance of these
systems often gives the impression that they can be used to revolutionise our
lives for the better. However, as recent works point out, these systems suffer
from several issues that make them unreliable for use in the real world,
including vulnerability to adversarial attacks (Szegedy et al. [248]), tendency
to memorise noise (Zhang et al. [292]), being over-confident on incorrect
predictions (miscalibration) (Guo et al. [99]), and unsuitability for handling
private data (Gilad-Bachrach et al. [88]). In this thesis, we look at each of
these issues in detail, investigate their causes, and propose computationally
cheap algorithms for mitigating them in practice. To do this, we identify
structures in deep neural networks that can be exploited to mitigate the above
causes of unreliability of deep learning algorithms.


http://arxiv.org/abs/2108.07258
On the Opportunities and Risks of Foundation Models. (2%)
Rishi Bommasani; Drew A. Hudson; Ehsan Adeli; Russ Altman; Simran Arora; Arx Sydney von; Michael S. Bernstein; Jeannette Bohg; Antoine Bosselut; Emma Brunskill; Erik Brynjolfsson; Shyamal Buch; Dallas Card; Rodrigo Castellon; Niladri Chatterji; Annie Chen; Kathleen Creel; Jared Quincy Davis; Dora Demszky; Chris Donahue; Moussa Doumbouya; Esin Durmus; Stefano Ermon; John Etchemendy; Kawin Ethayarajh; Li Fei-Fei; Chelsea Finn; Trevor Gale; Lauren Gillespie; Karan Goel; Noah Goodman; Shelby Grossman; Neel Guha; Tatsunori Hashimoto; Peter Henderson; John Hewitt; Daniel E. Ho; Jenny Hong; Kyle Hsu; Jing Huang; Thomas Icard; Saahil Jain; Dan Jurafsky; Pratyusha Kalluri; Siddharth Karamcheti; Geoff Keeling; Fereshte Khani; Omar Khattab; Pang Wei Koh; Mark Krass; Ranjay Krishna; Rohith Kuditipudi; Ananya Kumar; Faisal Ladhak; Mina Lee; Tony Lee; Jure Leskovec; Isabelle Levent; Xiang Lisa Li; Xuechen Li; Tengyu Ma; Ali Malik; Christopher D. Manning; Suvir Mirchandani; Eric Mitchell; Zanele Munyikwa; Suraj Nair; Avanika Narayan; Deepak Narayanan; Ben Newman; Allen Nie; Juan Carlos Niebles; Hamed Nilforoshan; Julian Nyarko; Giray Ogut; Laurel Orr; Isabel Papadimitriou; Joon Sung Park; Chris Piech; Eva Portelance; Christopher Potts; Aditi Raghunathan; Rob Reich; Hongyu Ren; Frieda Rong; Yusuf Roohani; Camilo Ruiz; Jack Ryan; Christopher Ré; Dorsa Sadigh; Shiori Sagawa; Keshav Santhanam; Andy Shih; Krishnan Srinivasan; Alex Tamkin; Rohan Taori; Armin W. Thomas; Florian Tramèr; Rose E. Wang; William Wang; Bohan Wu; Jiajun Wu; Yuhuai Wu; Sang Michael Xie; Michihiro Yasunaga; Jiaxuan You; Matei Zaharia; Michael Zhang; Tianyi Zhang; Xikun Zhang; Yuhui Zhang; Lucia Zheng; Kaitlyn Zhou; Percy Liang
  AI is undergoing a paradigm shift with the rise of models (e.g., BERT,
DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a
wide range of downstream tasks. We call these models foundation models to
underscore their critically central yet incomplete character. This report
provides a thorough account of the opportunities and risks of foundation
models, ranging from their capabilities (e.g., language, vision, robotics,
reasoning, human interaction) and technical principles(e.g., model
architectures, training procedures, data, systems, security, evaluation,
theory) to their applications (e.g., law, healthcare, education) and societal
impact (e.g., inequity, misuse, economic and environmental impact, legal and
ethical considerations). Though foundation models are based on standard deep
learning and transfer learning, their scale results in new emergent
capabilities,and their effectiveness across so many tasks incentivizes
homogenization. Homogenization provides powerful leverage but demands caution,
as the defects of the foundation model are inherited by all the adapted models
downstream. Despite the impending widespread deployment of foundation models,
we currently lack a clear understanding of how they work, when they fail, and
what they are even capable of due to their emergent properties. To tackle these
questions, we believe much of the critical research on foundation models will
require deep interdisciplinary collaboration commensurate with their
fundamentally sociotechnical nature.


http://arxiv.org/abs/2108.06885
Neural Architecture Dilation for Adversarial Robustness. (81%)
Yanxi Li; Zhaohui Yang; Yunhe Wang; Chang Xu
  With the tremendous advances in the architecture and scale of convolutional
neural networks (CNNs) over the past few decades, they can easily reach or even
exceed the performance of humans in certain tasks. However, a recently
discovered shortcoming of CNNs is that they are vulnerable to adversarial
attacks. Although the adversarial robustness of CNNs can be improved by
adversarial training, there is a trade-off between standard accuracy and
adversarial robustness. From the neural architecture perspective, this paper
aims to improve the adversarial robustness of the backbone CNNs that have a
satisfactory accuracy. Under a minimal computational overhead, the introduction
of a dilation architecture is expected to be friendly with the standard
performance of the backbone CNN while pursuing adversarial robustness.
Theoretical analyses on the standard and adversarial error bounds naturally
motivate the proposed neural architecture dilation algorithm. Experimental
results on real-world datasets and benchmark neural networks demonstrate the
effectiveness of the proposed algorithm to balance the accuracy and adversarial
robustness.


http://arxiv.org/abs/2108.06797
Deep Adversarially-Enhanced k-Nearest Neighbors. (74%)
Ren Wang; Tianqi Chen
  Recent works have theoretically and empirically shown that deep neural
networks (DNNs) have an inherent vulnerability to small perturbations. Applying
the Deep k-Nearest Neighbors (DkNN) classifier, we observe a dramatically
increasing robustness-accuracy trade-off as the layer goes deeper. In this
work, we propose a Deep Adversarially-Enhanced k-Nearest Neighbors (DAEkNN)
method which achieves higher robustness than DkNN and mitigates the
robustness-accuracy trade-off in deep layers through two key elements. First,
DAEkNN is based on an adversarially trained model. Second, DAEkNN makes
predictions by leveraging a weighted combination of benign and adversarial
training data. Empirically, we find that DAEkNN improves both the robustness
and the robustness-accuracy trade-off on MNIST and CIFAR-10 datasets.


http://arxiv.org/abs/2108.06871
IADA: Iterative Adversarial Data Augmentation Using Formal Verification and Expert Guidance. (1%)
Ruixuan Liu; Changliu Liu
  Neural networks (NNs) are widely used for classification tasks for their
remarkable performance. However, the robustness and accuracy of NNs heavily
depend on the training data. In many applications, massive training data is
usually not available. To address the challenge, this paper proposes an
iterative adversarial data augmentation (IADA) framework to learn neural
network models from an insufficient amount of training data. The method uses
formal verification to identify the most "confusing" input samples, and
leverages human guidance to safely and iteratively augment the training data
with these samples. The proposed framework is applied to an artificial 2D
dataset, the MNIST dataset, and a human motion dataset. By applying IADA to
fully-connected NN classifiers, we show that our training method can improve
the robustness and accuracy of the learned model. By comparing to regular
supervised training, on the MNIST dataset, the average perturbation bound
improved 107.4%. The classification accuracy improved 1.77%, 3.76%, 10.85% on
the 2D dataset, the MNIST dataset, and the human motion dataset respectively.


http://arxiv.org/abs/2108.06504
LinkTeller: Recovering Private Edges from Graph Neural Networks via Influence Analysis. (1%)
Fan Wu; Yunhui Long; Ce Zhang; Bo Li
  Graph structured data have enabled several successful applications such as
recommendation systems and traffic prediction, given the rich node features and
edges information. However, these high-dimensional features and high-order
adjacency information are usually heterogeneous and held by different data
holders in practice. Given such vertical data partition (e.g., one data holder
will only own either the node features or edge information), different data
holders have to develop efficient joint training protocols rather than directly
transfer data to each other due to privacy concerns. In this paper, we focus on
the edge privacy, and consider a training scenario where Bob with node features
will first send training node features to Alice who owns the adjacency
information. Alice will then train a graph neural network (GNN) with the joint
information and release an inference API. During inference, Bob is able to
provide test node features and query the API to obtain the predictions for test
nodes. Under this setting, we first propose a privacy attack LinkTeller via
influence analysis to infer the private edge information held by Alice via
designing adversarial queries for Bob. We then empirically show that LinkTeller
is able to recover a significant amount of private edges, outperforming
existing baselines. To further evaluate the privacy leakage, we adapt an
existing algorithm for differentially private graph convolutional network (DP
GCN) training and propose a new DP GCN mechanism LapGraph. We show that these
DP GCN mechanisms are not always resilient against LinkTeller empirically under
mild privacy guarantees ($\varepsilon>5$). Our studies will shed light on
future research towards designing more resilient privacy-preserving GCN models;
in the meantime, provide an in-depth understanding of the tradeoff between GCN
model utility and robustness against potential privacy attacks.


http://arxiv.org/abs/2108.06179
Evaluating the Robustness of Semantic Segmentation for Autonomous Driving against Real-World Adversarial Patch Attacks. (99%)
Federico Nesti; Giulio Rossolini; Saasha Nair; Alessandro Biondi; Giorgio Buttazzo
  Deep learning and convolutional neural networks allow achieving impressive
performance in computer vision tasks, such as object detection and semantic
segmentation (SS). However, recent studies have shown evident weaknesses of
such models against adversarial perturbations. In a real-world scenario
instead, like autonomous driving, more attention should be devoted to
real-world adversarial examples (RWAEs), which are physical objects (e.g.,
billboards and printable patches) optimized to be adversarial to the entire
perception pipeline. This paper presents an in-depth evaluation of the
robustness of popular SS models by testing the effects of both digital and
real-world adversarial patches. These patches are crafted with powerful attacks
enriched with a novel loss function. Firstly, an investigation on the
Cityscapes dataset is conducted by extending the Expectation Over
Transformation (EOT) paradigm to cope with SS. Then, a novel attack
optimization, called scene-specific attack, is proposed. Such an attack
leverages the CARLA driving simulator to improve the transferability of the
proposed EOT-based attack to a real 3D environment. Finally, a printed physical
billboard containing an adversarial patch was tested in an outdoor driving
scenario to assess the feasibility of the studied attacks in the real world.
Exhaustive experiments revealed that the proposed attack formulations
outperform previous work to craft both digital and real-world adversarial
patches for SS. At the same time, the experimental results showed how these
attacks are notably less effective in the real world, hence questioning the
practical relevance of adversarial attacks to SS models for autonomous/assisted
driving.


http://arxiv.org/abs/2108.06247
Optical Adversarial Attack. (98%)
Abhiram Gnanasambandam; Alex M. Sherman; Stanley H. Chan
  We introduce OPtical ADversarial attack (OPAD). OPAD is an adversarial attack
in the physical space aiming to fool image classifiers without physically
touching the objects (e.g., moving or painting the objects). The principle of
OPAD is to use structured illumination to alter the appearance of the target
objects. The system consists of a low-cost projector, a camera, and a computer.
The challenge of the problem is the non-linearity of the radiometric response
of the projector and the spatially varying spectral response of the scene.
Attacks generated in a conventional approach do not work in this setting unless
they are calibrated to compensate for such a projector-camera model. The
proposed solution incorporates the projector-camera model into the adversarial
attack optimization, where a new attack formulation is derived. Experimental
results prove the validity of the solution. It is demonstrated that OPAD can
optically attack a real 3D object in the presence of background lighting for
white-box, black-box, targeted, and untargeted attacks. Theoretical analysis is
presented to quantify the fundamental performance limit of the system.


http://arxiv.org/abs/2108.06280
Understanding Structural Vulnerability in Graph Convolutional Networks. (96%)
Liang Chen; Jintang Li; Qibiao Peng; Yang Liu; Zibin Zheng; Carl Yang
  Recent studies have shown that Graph Convolutional Networks (GCNs) are
vulnerable to adversarial attacks on the graph structure. Although multiple
works have been proposed to improve their robustness against such structural
adversarial attacks, the reasons for the success of the attacks remain unclear.
In this work, we theoretically and empirically demonstrate that structural
adversarial examples can be attributed to the non-robust aggregation scheme
(i.e., the weighted mean) of GCNs. Specifically, our analysis takes advantage
of the breakdown point which can quantitatively measure the robustness of
aggregation schemes. The key insight is that weighted mean, as the basic design
of GCNs, has a low breakdown point and its output can be dramatically changed
by injecting a single edge. We show that adopting the aggregation scheme with a
high breakdown point (e.g., median or trimmed mean) could significantly enhance
the robustness of GCNs against structural attacks. Extensive experiments on
four real-world datasets demonstrate that such a simple but effective method
achieves the best robustness performance compared to state-of-the-art models.


http://arxiv.org/abs/2108.06131
The Forgotten Threat of Voltage Glitching: A Case Study on Nvidia Tegra X2 SoCs. (1%)
Otto Bittner; Thilo Krachenfels; Andreas Galauner; Jean-Pierre Seifert
  Voltage fault injection (FI) is a well-known attack technique that can be
used to force faulty behavior in processors during their operation. Glitching
the supply voltage can cause data value corruption, skip security checks, or
enable protected code paths. At the same time, modern systems on a chip (SoCs)
are used in security-critical applications, such as self-driving cars and
autonomous machines. Since these embedded devices are often physically
accessible by attackers, vendors must consider device tampering in their threat
models. However, while the threat of voltage FI is known since the early 2000s,
it seems as if vendors still forget to integrate countermeasures. This work
shows how the entire boot security of an Nvidia SoC, used in Tesla's autopilot
and Mercedes-Benz's infotainment system, can be circumvented using voltage FI.
We uncover a hidden bootloader that is only available to the manufacturer for
testing purposes and disabled by fuses in shipped products. We demonstrate how
to re-enable this bootloader using FI to gain code execution with the highest
privileges, enabling us to extract the bootloader's firmware and decryption
keys used in later boot stages. Using a hardware implant, an adversary might
misuse the hidden bootloader to bypass trusted code execution even during the
system's regular operation.


http://arxiv.org/abs/2108.06017
AGKD-BML: Defense Against Adversarial Attack by Attention Guided Knowledge Distillation and Bi-directional Metric Learning. (99%)
Hong Wang; Yuefan Deng; Shinjae Yoo; Haibin Ling; Yuewei Lin
  While deep neural networks have shown impressive performance in many tasks,
they are fragile to carefully designed adversarial attacks. We propose a novel
adversarial training-based model by Attention Guided Knowledge Distillation and
Bi-directional Metric Learning (AGKD-BML). The attention knowledge is obtained
from a weight-fixed model trained on a clean dataset, referred to as a teacher
model, and transferred to a model that is under training on adversarial
examples (AEs), referred to as a student model. In this way, the student model
is able to focus on the correct region, as well as correcting the intermediate
features corrupted by AEs to eventually improve the model accuracy. Moreover,
to efficiently regularize the representation in feature space, we propose a
bidirectional metric learning. Specifically, given a clean image, it is first
attacked to its most confusing class to get the forward AE. A clean image in
the most confusing class is then randomly picked and attacked back to the
original class to get the backward AE. A triplet loss is then used to shorten
the representation distance between original image and its AE, while enlarge
that between the forward and backward AEs. We conduct extensive adversarial
robustness experiments on two widely used datasets with different attacks. Our
proposed AGKD-BML model consistently outperforms the state-of-the-art
approaches. The code of AGKD-BML will be available at:
https://github.com/hongw579/AGKD-BML.


http://arxiv.org/abs/2108.05948
Deep adversarial attack on target detection systems. (99%)
Uche M. Osahor; Nasser M. Nasrabadi
  Target detection systems identify targets by localizing their coordinates on
the input image of interest. This is ideally achieved by labeling each pixel in
an image as a background or a potential target pixel. Deep Convolutional Neural
Network (DCNN) classifiers have proven to be successful tools for computer
vision applications. However,prior research confirms that even state of the art
classifier models are susceptible to adversarial attacks. In this paper, we
show how to generate adversarial infrared images by adding small perturbations
to the targets region to deceive a DCNN-based target detector at remarkable
levels. We demonstrate significant progress in developing visually
imperceptible adversarial infrared images where the targets are visually
recognizable by an expert but a DCNN-based target detector cannot detect the
targets in the image.


http://arxiv.org/abs/2108.05921
Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based Hate. (69%)
Hannah Rose Kirk; Bertram Vidgen; Paul Röttger; Tristan Thrush; Scott A. Hale
  Detecting online hate is a complex task, and low-performing models have
harmful consequences when used for sensitive applications such as content
moderation. Emoji-based hate is an emerging challenge for automated detection.
We present HatemojiCheck, a test suite of 3,930 short-form statements that
allows us to evaluate performance on hateful language expressed with emoji.
Using the test suite, we expose weaknesses in existing hate detection models.
To address these weaknesses, we create the HatemojiBuild dataset using a
human-and-model-in-the-loop approach. Models built with these 5,912 adversarial
examples perform substantially better at detecting emoji-based hate, while
retaining strong performance on text-only hate. Both HatemojiCheck and
HatemojiBuild are made publicly available. See our Github Repository
(https://github.com/HannahKirk/Hatemoji). HatemojiCheck, HatemojiBuild, and the
final Hatemoji Model are also available on HuggingFace
(https://huggingface.co/datasets/HannahRoseKirk/).


http://arxiv.org/abs/2108.05075
Turning Your Strength against You: Detecting and Mitigating Robust and Universal Adversarial Patch Attacks. (99%)
Zitao Chen; Pritam Dash; Karthik Pattabiraman
  Adversarial patch attacks against image classification deep neural networks
(DNNs), which inject arbitrary distortions within a bounded region of an image,
can generate adversarial perturbations that are robust (i.e., remain
adversarial in physical world) and universal (i.e., remain adversarial on any
input). Such attacks can lead to severe consequences in real-world DNN-based
systems.
  This work proposes Jujutsu, a technique to detect and mitigate robust and
universal adversarial patch attacks. For detection, Jujutsu exploits the
attacks' universal property - Jujutsu first locates the region of the potential
adversarial patch, and then strategically transfers it to a dedicated region in
a new image to determine whether it is truly malicious. For attack mitigation,
Jujutsu leverages the attacks' localized nature via image inpainting to
synthesize the semantic contents in the pixels that are corrupted by the
attacks, and reconstruct the ``clean'' image.
  We evaluate Jujutsu on four diverse datasets (ImageNet, ImageNette, CelebA
and Place365), and show that Jujutsu achieves superior performance and
significantly outperforms existing techniques. We find that Jujutsu can further
defend against different variants of the basic attack, including 1)
physical-world attack; 2) attacks that target diverse classes; 3) attacks that
construct patches in different shapes and 4) adaptive attacks.


http://arxiv.org/abs/2108.05490
Attacks against Ranking Algorithms with Text Embeddings: a Case Study on Recruitment Algorithms. (78%)
Anahita Samadi; Debapriya Banerjee; Shirin Nilizadeh
  Recently, some studies have shown that text classification tasks are
vulnerable to poisoning and evasion attacks. However, little work has
investigated attacks against decision making algorithms that use text
embeddings, and their output is a ranking. In this paper, we focus on ranking
algorithms for recruitment process, that employ text embeddings for ranking
applicants resumes when compared to a job description. We demonstrate both
white box and black box attacks that identify text items, that based on their
location in embedding space, have significant contribution in increasing the
similarity score between a resume and a job description. The adversary then
uses these text items to improve the ranking of their resume among others. We
tested recruitment algorithms that use the similarity scores obtained from
Universal Sentence Encoder (USE) and Term Frequency Inverse Document Frequency
(TF IDF) vectors. Our results show that in both adversarial settings, on
average the attacker is successful. We also found that attacks against TF IDF
is more successful compared to USE.


http://arxiv.org/abs/2108.05018
Are Neural Ranking Models Robust? (4%)
Chen Wu; Ruqing Zhang; Jiafeng Guo; Yixing Fan; Xueqi Cheng
  Recently, we have witnessed the bloom of neural ranking models in the
information retrieval (IR) field. So far, much effort has been devoted to
developing effective neural ranking models that can generalize well on new
data. There has been less attention paid to the robustness perspective. Unlike
the effectiveness which is about the average performance of a system under
normal purpose, robustness cares more about the system performance in the worst
case or under malicious operations instead. When a new technique enters into
the real-world application, it is critical to know not only how it works in
average, but also how would it behave in abnormal situations. So we raise the
question in this work: Are neural ranking models robust? To answer this
question, firstly, we need to clarify what we refer to when we talk about the
robustness of ranking models in IR. We show that robustness is actually a
multi-dimensional concept and there are three ways to define it in IR: 1) The
performance variance under the independent and identically distributed (I.I.D.)
setting; 2) The out-of-distribution (OOD) generalizability; and 3) The
defensive ability against adversarial operations. The latter two definitions
can be further specified into two different perspectives respectively, leading
to 5 robustness tasks in total. Based on this taxonomy, we build corresponding
benchmark datasets, design empirical experiments, and systematically analyze
the robustness of several representative neural ranking models against
traditional probabilistic ranking models and learning-to-rank (LTR) models. The
empirical results show that there is no simple answer to our question. While
neural ranking models are less robust against other IR models in most cases,
some of them can still win 1 out of 5 tasks. This is the first comprehensive
study on the robustness of neural ranking models.


http://arxiv.org/abs/2108.05149
Logic Explained Networks. (1%)
Gabriele Ciravegna; Pietro Barbiero; Francesco Giannini; Marco Gori; Pietro Lió; Marco Maggini; Stefano Melacci
  The large and still increasing popularity of deep learning clashes with a
major limit of neural network architectures, that consists in their lack of
capability in providing human-understandable motivations of their decisions. In
situations in which the machine is expected to support the decision of human
experts, providing a comprehensible explanation is a feature of crucial
importance. The language used to communicate the explanations must be formal
enough to be implementable in a machine and friendly enough to be
understandable by a wide audience. In this paper, we propose a general approach
to Explainable Artificial Intelligence in the case of neural architectures,
showing how a mindful design of the networks leads to a family of interpretable
deep learning models called Logic Explained Networks (LENs). LENs only require
their inputs to be human-understandable predicates, and they provide
explanations in terms of simple First-Order Logic (FOL) formulas involving such
predicates. LENs are general enough to cover a large number of scenarios.
Amongst them, we consider the case in which LENs are directly used as special
classifiers with the capability of being explainable, or when they act as
additional networks with the role of creating the conditions for making a
black-box classifier explainable by FOL formulas. Despite supervised learning
problems are mostly emphasized, we also show that LENs can learn and provide
explanations in unsupervised learning settings. Experimental results on several
datasets and tasks show that LENs may yield better classifications than
established white-box models, such as decision trees and Bayesian rule lists,
while providing more compact and meaningful explanations.


http://arxiv.org/abs/2108.04979
Simple black-box universal adversarial attacks on medical image classification based on deep neural networks. (99%)
Kazuki Koga; Kazuhiro Takemoto
  Universal adversarial attacks, which hinder most deep neural network (DNN)
tasks using only a small single perturbation called a universal adversarial
perturbation (UAP), is a realistic security threat to the practical application
of a DNN. In particular, such attacks cause serious problems in medical
imaging. Given that computer-based systems are generally operated under a
black-box condition in which only queries on inputs are allowed and outputs are
accessible, the impact of UAPs seems to be limited because well-used algorithms
for generating UAPs are limited to a white-box condition in which adversaries
can access the model weights and loss gradients. Nevertheless, we demonstrate
that UAPs are easily generatable using a relatively small dataset under
black-box conditions. In particular, we propose a method for generating UAPs
using a simple hill-climbing search based only on DNN outputs and demonstrate
the validity of the proposed method using representative DNN-based medical
image classifications. Black-box UAPs can be used to conduct both non-targeted
and targeted attacks. Overall, the black-box UAPs showed high attack success
rates (40% to 90%), although some of them had relatively low success rates
because the method only utilizes limited information to generate UAPs. The
vulnerability of black-box UAPs was observed in several model architectures.
The results indicate that adversaries can also generate UAPs through a simple
procedure under the black-box condition to foil or control DNN-based medical
image diagnoses, and that UAPs are a more realistic security threat.


http://arxiv.org/abs/2108.04890
On the Effect of Pruning on Adversarial Robustness. (81%)
Artur Jordao; Helio Pedrini
  Pruning is a well-known mechanism for reducing the computational cost of deep
convolutional networks. However, studies have shown the potential of pruning as
a form of regularization, which reduces overfitting and improves
generalization. We demonstrate that this family of strategies provides
additional benefits beyond computational performance and generalization. Our
analyses reveal that pruning structures (filters and/or layers) from
convolutional networks increase not only generalization but also robustness to
adversarial images (natural images with content modified). Such achievements
are possible since pruning reduces network capacity and provides
regularization, which have been proven effective tools against adversarial
images. In contrast to promising defense mechanisms that require training with
adversarial images and careful regularization, we show that pruning obtains
competitive results considering only natural images (e.g., the standard and
low-cost training). We confirm these findings on several adversarial attacks
and architectures; thus suggesting the potential of pruning as a novel defense
mechanism against adversarial images.


http://arxiv.org/abs/2108.04974
SoK: How Robust is Image Classification Deep Neural Network Watermarking? (Extended Version). (68%)
Nils Lukas; Edward Jiang; Xinda Li; Florian Kerschbaum
  Deep Neural Network (DNN) watermarking is a method for provenance
verification of DNN models. Watermarking should be robust against watermark
removal attacks that derive a surrogate model that evades provenance
verification. Many watermarking schemes that claim robustness have been
proposed, but their robustness is only validated in isolation against a
relatively small set of attacks. There is no systematic, empirical evaluation
of these claims against a common, comprehensive set of removal attacks. This
uncertainty about a watermarking scheme's robustness causes difficulty to trust
their deployment in practice. In this paper, we evaluate whether recently
proposed watermarking schemes that claim robustness are robust against a large
set of removal attacks. We survey methods from the literature that (i) are
known removal attacks, (ii) derive surrogate models but have not been evaluated
as removal attacks, and (iii) novel removal attacks. Weight shifting and smooth
retraining are novel removal attacks adapted to the DNN watermarking schemes
surveyed in this paper. We propose taxonomies for watermarking schemes and
removal attacks. Our empirical evaluation includes an ablation study over sets
of parameters for each attack and watermarking scheme on the CIFAR-10 and
ImageNet datasets. Surprisingly, none of the surveyed watermarking schemes is
robust in practice. We find that schemes fail to withstand adaptive attacks and
known methods for deriving surrogate models that have not been evaluated as
removal attacks. This points to intrinsic flaws in how robustness is currently
evaluated. We show that watermarking schemes need to be evaluated against a
more extensive set of removal attacks with a more realistic adversary model.
Our source code and a complete dataset of evaluation results are publicly
available, which allows to independently verify our conclusions.


http://arxiv.org/abs/2108.04990
Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing. (64%)
Sanchit Sinha; Hanjie Chen; Arshdeep Sekhon; Yangfeng Ji; Yanjun Qi
  Interpretability methods like Integrated Gradient and LIME are popular
choices for explaining natural language model predictions with relative word
importance scores. These interpretations need to be robust for trustworthy NLP
applications in high-stake areas like medicine or finance. Our paper
demonstrates how interpretations can be manipulated by making simple word
perturbations on an input text. Via a small portion of word-level swaps, these
adversarial perturbations aim to make the resulting text semantically and
spatially similar to its seed input (therefore sharing similar
interpretations). Simultaneously, the generated examples achieve the same
prediction label as the seed yet are given a substantially different
explanation by the interpretation methods. Our experiments generate fragile
interpretations to attack two SOTA interpretation methods, across three popular
Transformer models and on two different NLP datasets. We observe that the rank
order correlation drops by over 20% when less than 10% of words are perturbed
on average. Further, rank-order correlation keeps decreasing as more words get
perturbed. Furthermore, we demonstrate that candidates generated from our
method have good quality metrics.


http://arxiv.org/abs/2108.04584
UniNet: A Unified Scene Understanding Network and Exploring Multi-Task Relationships through the Lens of Adversarial Attacks. (2%)
NareshKumar Gurulingan; Elahe Arani; Bahram Zonooz
  Scene understanding is crucial for autonomous systems which intend to operate
in the real world. Single task vision networks extract information only based
on some aspects of the scene. In multi-task learning (MTL), on the other hand,
these single tasks are jointly learned, thereby providing an opportunity for
tasks to share information and obtain a more comprehensive understanding. To
this end, we develop UniNet, a unified scene understanding network that
accurately and efficiently infers vital vision tasks including object
detection, semantic segmentation, instance segmentation, monocular depth
estimation, and monocular instance depth prediction. As these tasks look at
different semantic and geometric information, they can either complement or
conflict with each other. Therefore, understanding inter-task relationships can
provide useful cues to enable complementary information sharing. We evaluate
the task relationships in UniNet through the lens of adversarial attacks based
on the notion that they can exploit learned biases and task interactions in the
neural network. Extensive experiments on the Cityscapes dataset, using
untargeted and targeted attacks reveal that semantic tasks strongly interact
amongst themselves, and the same holds for geometric tasks. Additionally, we
show that the relationship between semantic and geometric tasks is asymmetric
and their interaction becomes weaker as we move towards higher-level
representations.


http://arxiv.org/abs/2108.04547
Instance-wise Hard Negative Example Generation for Contrastive Learning in Unpaired Image-to-Image Translation. (1%)
Weilun Wang; Wengang Zhou; Jianmin Bao; Dong Chen; Houqiang Li
  Contrastive learning shows great potential in unpaired image-to-image
translation, but sometimes the translated results are in poor quality and the
contents are not preserved consistently. In this paper, we uncover that the
negative examples play a critical role in the performance of contrastive
learning for image translation. The negative examples in previous methods are
randomly sampled from the patches of different positions in the source image,
which are not effective to push the positive examples close to the query
examples. To address this issue, we present instance-wise hard Negative Example
Generation for Contrastive learning in Unpaired image-to-image
Translation~(NEGCUT). Specifically, we train a generator to produce negative
examples online. The generator is novel from two perspectives: 1) it is
instance-wise which means that the generated examples are based on the input
image, and 2) it can generate hard negative examples since it is trained with
an adversarial loss. With the generator, the performance of unpaired
image-to-image translation is significantly improved. Experiments on three
benchmark datasets demonstrate that the proposed NEGCUT framework achieves
state-of-the-art performance compared to previous methods.


http://arxiv.org/abs/2108.04204
Meta Gradient Adversarial Attack. (99%)
Zheng Yuan; Jie Zhang; Yunpei Jia; Chuanqi Tan; Tao Xue; Shiguang Shan
  In recent years, research on adversarial attacks has become a hot spot.
Although current literature on the transfer-based adversarial attack has
achieved promising results for improving the transferability to unseen
black-box models, it still leaves a long way to go. Inspired by the idea of
meta-learning, this paper proposes a novel architecture called Meta Gradient
Adversarial Attack (MGAA), which is plug-and-play and can be integrated with
any existing gradient-based attack method for improving the cross-model
transferability. Specifically, we randomly sample multiple models from a model
zoo to compose different tasks and iteratively simulate a white-box attack and
a black-box attack in each task. By narrowing the gap between the gradient
directions in white-box and black-box attacks, the transferability of
adversarial examples on the black-box setting can be improved. Extensive
experiments on the CIFAR10 and ImageNet datasets show that our architecture
outperforms the state-of-the-art methods for both black-box and white-box
attack settings.


http://arxiv.org/abs/2108.04409
On Procedural Adversarial Noise Attack And Defense. (99%)
Jun Yan; Xiaoyang Deng; Huilin Yin; Wancheng Ge
  Deep Neural Networks (DNNs) are vulnerable to adversarial examples which
would inveigle neural networks to make prediction errors with small
perturbations on the input images. Researchers have been devoted to promoting
the research on the universal adversarial perturbations (UAPs) which are
gradient-free and have little prior knowledge on data distributions. Procedural
adversarial noise attack is a data-free universal perturbation generation
method. In this paper, we propose two universal adversarial perturbation (UAP)
generation methods based on procedural noise functions: Simplex noise and
Worley noise. In our framework, the shading which disturbs visual
classification is generated with rendering technology. Without changing the
semantic representations, the adversarial examples generated via our methods
show superior performance on the attack.


http://arxiv.org/abs/2108.04430
Enhancing Knowledge Tracing via Adversarial Training. (98%)
Xiaopeng Guo; Zhijie Huang; Jie Gao; Mingyu Shang; Maojing Shu; Jun Sun
  We study the problem of knowledge tracing (KT) where the goal is to trace the
students' knowledge mastery over time so as to make predictions on their future
performance. Owing to the good representation capacity of deep neural networks
(DNNs), recent advances on KT have increasingly concentrated on exploring DNNs
to improve the performance of KT. However, we empirically reveal that the DNNs
based KT models may run the risk of overfitting, especially on small datasets,
leading to limited generalization. In this paper, by leveraging the current
advances in adversarial training (AT), we propose an efficient AT based KT
method (ATKT) to enhance KT model's generalization and thus push the limit of
KT. Specifically, we first construct adversarial perturbations and add them on
the original interaction embeddings as adversarial examples. The original and
adversarial examples are further used to jointly train the KT model, forcing it
is not only to be robust to the adversarial examples, but also to enhance the
generalization over the original ones. To better implement AT, we then present
an efficient attentive-LSTM model as KT backbone, where the key is a proposed
knowledge hidden state attention module that adaptively aggregates information
from previous knowledge hidden states while simultaneously highlighting the
importance of current knowledge hidden state to make a more accurate
prediction. Extensive experiments on four public benchmark datasets demonstrate
that our ATKT achieves new state-of-the-art performance. Code is available at:
\color{blue} {\url{https://github.com/xiaopengguo/ATKT}}.


http://arxiv.org/abs/2108.04214
Neural Network Repair with Reachability Analysis. (96%)
Xiaodong Yang; Tom Yamaguchi; Hoang-Dung Tran; Bardh Hoxha; Taylor T Johnson; Danil Prokhorov
  Safety is a critical concern for the next generation of autonomy that is
likely to rely heavily on deep neural networks for perception and control.
Formally verifying the safety and robustness of well-trained DNNs and
learning-enabled systems under attacks, model uncertainties, and sensing errors
is essential for safe autonomy. This research proposes a framework to repair
unsafe DNNs in safety-critical systems with reachability analysis. The repair
process is inspired by adversarial training which has demonstrated high
effectiveness in improving the safety and robustness of DNNs. Different from
traditional adversarial training approaches where adversarial examples are
utilized from random attacks and may not be representative of all unsafe
behaviors, our repair process uses reachability analysis to compute the exact
unsafe regions and identify sufficiently representative examples to enhance the
efficacy and efficiency of the adversarial training.
  The performance of our framework is evaluated on two types of benchmarks
without safe models as references. One is a DNN controller for aircraft
collision avoidance with access to training data. The other is a rocket lander
where our framework can be seamlessly integrated with the well-known deep
deterministic policy gradient (DDPG) reinforcement learning algorithm. The
experimental results show that our framework can successfully repair all
instances on multiple safety specifications with negligible performance
degradation. In addition, to increase the computational and memory efficiency
of the reachability analysis algorithm, we propose a depth-first-search
algorithm that combines an existing exact analysis method with an
over-approximation approach based on a new set representation. Experimental
results show that our method achieves a five-fold improvement in runtime and a
two-fold improvement in memory usage compared to exact analysis.


http://arxiv.org/abs/2108.04206
Classification Auto-Encoder based Detector against Diverse Data Poisoning Attacks. (92%)
Fereshteh Razmi; Li Xiong
  Poisoning attacks are a category of adversarial machine learning threats in
which an adversary attempts to subvert the outcome of the machine learning
systems by injecting crafted data into training data set, thus increasing the
machine learning model's test error. The adversary can tamper with the data
feature space, data labels, or both, each leading to a different attack
strategy with different strengths. Various detection approaches have recently
emerged, each focusing on one attack strategy. The Achilles heel of many of
these detection approaches is their dependence on having access to a clean,
untampered data set. In this paper, we propose CAE, a Classification
Auto-Encoder based detector against diverse poisoned data. CAE can detect all
forms of poisoning attacks using a combination of reconstruction and
classification errors without having any prior knowledge of the attack
strategy. We show that an enhanced version of CAE (called CAE+) does not have
to employ a clean data set to train the defense model. Our experimental results
on three real datasets MNIST, Fashion-MNIST and CIFAR demonstrate that our
proposed method can maintain its functionality under up to 30% contaminated
data and help the defended SVM classifier to regain its best accuracy.


http://arxiv.org/abs/2108.03803
Mis-spoke or mis-lead: Achieving Robustness in Multi-Agent Communicative Reinforcement Learning. (82%)
Wanqi Xue; Wei Qiu; Bo An; Zinovi Rabinovich; Svetlana Obraztsova; Chai Kiat Yeo
  Recent studies in multi-agent communicative reinforcement learning (MACRL)
have demonstrated that multi-agent coordination can be greatly improved by
allowing communication between agents. Meanwhile, adversarial machine learning
(ML) has shown that ML models are vulnerable to attacks. Despite the increasing
concern about the robustness of ML algorithms, how to achieve robust
communication in multi-agent reinforcement learning has been largely neglected.
In this paper, we systematically explore the problem of adversarial
communication in MACRL. Our main contributions are threefold. First, we propose
an effective method to perform attacks in MACRL, by learning a model to
generate optimal malicious messages. Second, we develop a defence method based
on message reconstruction, to maintain multi-agent coordination under message
attacks. Third, we formulate the adversarial communication problem as a
two-player zero-sum game and propose a game-theoretical method R-MACRL to
improve the worst-case defending performance. Empirical results demonstrate
that many state-of-the-art MACRL methods are vulnerable to message attacks, and
our method can significantly improve their robustness.


http://arxiv.org/abs/2108.04417
Privacy-Preserving Machine Learning: Methods, Challenges and Directions. (16%)
Runhua Xu; Nathalie Baracaldo; James Joshi
  Machine learning (ML) is increasingly being adopted in a wide variety of
application domains. Usually, a well-performing ML model, especially, emerging
deep neural network model, relies on a large volume of training data and
high-powered computational resources. The need for a vast volume of available
data raises serious privacy concerns because of the risk of leakage of highly
privacy-sensitive information and the evolving regulatory environments that
increasingly restrict access to and use of privacy-sensitive data. Furthermore,
a trained ML model may also be vulnerable to adversarial attacks such as
membership/property inference attacks and model inversion attacks. Hence,
well-designed privacy-preserving ML (PPML) solutions are crucial and have
attracted increasing research interest from academia and industry. More and
more efforts of PPML are proposed via integrating privacy-preserving techniques
into ML algorithms, fusing privacy-preserving approaches into ML pipeline, or
designing various privacy-preserving architectures for existing ML systems. In
particular, existing PPML arts cross-cut ML, system, security, and privacy;
hence, there is a critical need to understand state-of-art studies, related
challenges, and a roadmap for future research. This paper systematically
reviews and summarizes existing privacy-preserving approaches and proposes a
PGU model to guide evaluation for various PPML solutions through elaborately
decomposing their privacy-preserving functionalities. The PGU model is designed
as the triad of Phase, Guarantee, and technical Utility. Furthermore, we also
discuss the unique characteristics and challenges of PPML and outline possible
directions of future work that benefit a wide range of research communities
among ML, distributed systems, security, and privacy areas.


http://arxiv.org/abs/2108.04345
Explainable AI and susceptibility to adversarial attacks: a case study in classification of breast ultrasound images. (15%)
Hamza Rasaee; Hassan Rivaz
  Ultrasound is a non-invasive imaging modality that can be conveniently used
to classify suspicious breast nodules and potentially detect the onset of
breast cancer. Recently, Convolutional Neural Networks (CNN) techniques have
shown promising results in classifying ultrasound images of the breast into
benign or malignant. However, CNN inference acts as a black-box model, and as
such, its decision-making is not interpretable. Therefore, increasing effort
has been dedicated to explaining this process, most notably through GRAD-CAM
and other techniques that provide visual explanations into inner workings of
CNNs. In addition to interpretation, these methods provide clinically important
information, such as identifying the location for biopsy or treatment. In this
work, we analyze how adversarial assaults that are practically undetectable may
be devised to alter these importance maps dramatically. Furthermore, we will
show that this change in the importance maps can come with or without altering
the classification result, rendering them even harder to detect. As such, care
must be taken when using these importance maps to shed light on the inner
workings of deep learning. Finally, we utilize Multi-Task Learning (MTL) and
propose a new network based on ResNet-50 to improve the classification
accuracies. Our sensitivity and specificity is comparable to the state of the
art results.


http://arxiv.org/abs/2108.03388
Jointly Attacking Graph Neural Network and its Explanations. (96%)
Wenqi Fan; Wei Jin; Xiaorui Liu; Han Xu; Xianfeng Tang; Suhang Wang; Qing Li; Jiliang Tang; Jianping Wang; Charu Aggarwal
  Graph Neural Networks (GNNs) have boosted the performance for many
graph-related tasks. Despite the great success, recent studies have shown that
GNNs are highly vulnerable to adversarial attacks, where adversaries can
mislead the GNNs' prediction by modifying graphs. On the other hand, the
explanation of GNNs (GNNExplainer) provides a better understanding of a trained
GNN model by generating a small subgraph and features that are most influential
for its prediction. In this paper, we first perform empirical studies to
validate that GNNExplainer can act as an inspection tool and have the potential
to detect the adversarial perturbations for graphs. This finding motivates us
to further initiate a new problem investigation: Whether a graph neural network
and its explanations can be jointly attacked by modifying graphs with malicious
desires? It is challenging to answer this question since the goals of
adversarial attacks and bypassing the GNNExplainer essentially contradict each
other. In this work, we give a confirmative answer to this question by
proposing a novel attack framework (GEAttack), which can attack both a GNN
model and its explanations by simultaneously exploiting their vulnerabilities.
Extensive experiments on two explainers (GNNExplainer and PGExplainer) under
various real-world datasets demonstrate the effectiveness of the proposed
method.


http://arxiv.org/abs/2108.03506
Membership Inference Attacks on Lottery Ticket Networks. (33%)
Aadesh Bagmar; Shishira R Maiya; Shruti Bidwalka; Amol Deshpande
  The vulnerability of the Lottery Ticket Hypothesis has not been studied from
the purview of Membership Inference Attacks. Through this work, we are the
first to empirically show that the lottery ticket networks are equally
vulnerable to membership inference attacks. A Membership Inference Attack (MIA)
is the process of determining whether a data sample belongs to a training set
of a trained model or not. Membership Inference Attacks could leak critical
information about the training data that can be used for targeted attacks.
Recent deep learning models often have very large memory footprints and a high
computational cost associated with training and drawing inferences. Lottery
Ticket Hypothesis is used to prune the networks to find smaller sub-networks
that at least match the performance of the original model in terms of test
accuracy in a similar number of iterations. We used CIFAR-10, CIFAR-100, and
ImageNet datasets to perform image classification tasks and observe that the
attack accuracies are similar. We also see that the attack accuracy varies
directly according to the number of classes in the dataset and the sparsity of
the network. We demonstrate that these attacks are transferable across models
with high accuracy.


http://arxiv.org/abs/2108.03418
Information Bottleneck Approach to Spatial Attention Learning. (1%)
Qiuxia Lai; Yu Li; Ailing Zeng; Minhao Liu; Hanqiu Sun; Qiang Xu
  The selective visual attention mechanism in the human visual system (HVS)
restricts the amount of information to reach visual awareness for perceiving
natural scenes, allowing near real-time information processing with limited
computational capacity [Koch and Ullman, 1987]. This kind of selectivity acts
as an 'Information Bottleneck (IB)', which seeks a trade-off between
information compression and predictive accuracy. However, such information
constraints are rarely explored in the attention mechanism for deep neural
networks (DNNs). In this paper, we propose an IB-inspired spatial attention
module for DNN structures built for visual recognition. The module takes as
input an intermediate representation of the input image, and outputs a
variational 2D attention map that minimizes the mutual information (MI) between
the attention-modulated representation and the input, while maximizing the MI
between the attention-modulated representation and the task label. To further
restrict the information bypassed by the attention map, we quantize the
continuous attention scores to a set of learnable anchor values during
training. Extensive experiments show that the proposed IB-inspired spatial
attention mechanism can yield attention maps that neatly highlight the regions
of interest while suppressing backgrounds, and bootstrap standard DNN
structures for visual recognition tasks (e.g., image classification,
fine-grained recognition, cross-domain classification). The attention maps are
interpretable for the decision making of the DNNs as verified in the
experiments. Our code is available at https://github.com/ashleylqx/AIB.git.


http://arxiv.org/abs/2108.02940
Evaluating Adversarial Attacks on Driving Safety in Vision-Based Autonomous Vehicles. (80%)
Jindi Zhang; Yang Lou; Jianping Wang; Kui Wu; Kejie Lu; Xiaohua Jia
  In recent years, many deep learning models have been adopted in autonomous
driving. At the same time, these models introduce new vulnerabilities that may
compromise the safety of autonomous vehicles. Specifically, recent studies have
demonstrated that adversarial attacks can cause a significant decline in
detection precision of deep learning-based 3D object detection models. Although
driving safety is the ultimate concern for autonomous driving, there is no
comprehensive study on the linkage between the performance of deep learning
models and the driving safety of autonomous vehicles under adversarial attacks.
In this paper, we investigate the impact of two primary types of adversarial
attacks, perturbation attacks and patch attacks, on the driving safety of
vision-based autonomous vehicles rather than the detection precision of deep
learning models. In particular, we consider two state-of-the-art models in
vision-based 3D object detection, Stereo R-CNN and DSGN. To evaluate driving
safety, we propose an end-to-end evaluation framework with a set of driving
safety performance metrics. By analyzing the results of our extensive
evaluation experiments, we find that (1) the attack's impact on the driving
safety of autonomous vehicles and the attack's impact on the precision of 3D
object detectors are decoupled, and (2) the DSGN model demonstrates stronger
robustness to adversarial attacks than the Stereo R-CNN model. In addition, we
further investigate the causes behind the two findings with an ablation study.
The findings of this paper provide a new perspective to evaluate adversarial
attacks and guide the selection of deep learning models in autonomous driving.


http://arxiv.org/abs/2108.03288
Ensemble Augmentation for Deep Neural Networks Using 1-D Time Series Vibration Data. (2%)
Atik Faysal; Ngui Wai Keng; M. H. Lim
  Time-series data are one of the fundamental types of raw data representation
used in data-driven techniques. In machine condition monitoring, time-series
vibration data are overly used in data mining for deep neural networks.
Typically, vibration data is converted into images for classification using
Deep Neural Networks (DNNs), and scalograms are the most effective form of
image representation. However, the DNN classifiers require huge labeled
training samples to reach their optimum performance. So, many forms of data
augmentation techniques are applied to the classifiers to compensate for the
lack of training samples. However, the scalograms are graphical representations
where the existing augmentation techniques suffer because they either change
the graphical meaning or have too much noise in the samples that change the
physical meaning. In this study, a data augmentation technique named ensemble
augmentation is proposed to overcome this limitation. This augmentation method
uses the power of white noise added in ensembles to the original samples to
generate real-like samples. After averaging the signal with ensembles, a new
signal is obtained that contains the characteristics of the original signal.
The parameters for the ensemble augmentation are validated using a simulated
signal. The proposed method is evaluated using 10 class bearing vibration data
using three state-of-the-art Transfer Learning (TL) models, namely,
Inception-V3, MobileNet-V2, and ResNet50. Augmented samples are generated in
two increments: the first increment generates the same number of fake samples
as the training samples, and in the second increment, the number of samples is
increased gradually. The outputs from the proposed method are compared with no
augmentation, augmentations using deep convolution generative adversarial
network (DCGAN), and several geometric transformation-based augmentations...


http://arxiv.org/abs/2108.02756
BOSS: Bidirectional One-Shot Synthesis of Adversarial Examples. (99%)
Ismail Alkhouri; Alvaro Velasquez; George Atia
  The design of additive imperceptible perturbations to the inputs of deep
classifiers to maximize their misclassification rates is a central focus of
adversarial machine learning. An alternative approach is to synthesize
adversarial examples from scratch using GAN-like structures, albeit with the
use of large amounts of training data. By contrast, this paper considers
one-shot synthesis of adversarial examples; the inputs are synthesized from
scratch to induce arbitrary soft predictions at the output of pre-trained
models, while simultaneously maintaining high similarity to specified inputs.
To this end, we present a problem that encodes objectives on the distance
between the desired and output distributions of the trained model and the
similarity between such inputs and the synthesized examples. We prove that the
formulated problem is NP-complete. Then, we advance a generative approach to
the solution in which the adversarial examples are obtained as the output of a
generative network whose parameters are iteratively updated by optimizing
surrogate loss functions for the dual-objective. We demonstrate the generality
and versatility of the framework and approach proposed through applications to
the design of targeted adversarial attacks, generation of decision boundary
samples, and synthesis of low confidence classification inputs. The approach is
further extended to an ensemble of models with different soft output
specifications. The experimental results verify that the targeted and
confidence reduction attack methods developed perform on par with
state-of-the-art algorithms.


http://arxiv.org/abs/2108.02488
Poison Ink: Robust and Invisible Backdoor Attack. (99%)
Jie Zhang; Dongdong Chen; Jing Liao; Qidong Huang; Gang Hua; Weiming Zhang; Nenghai Yu
  Recent research shows deep neural networks are vulnerable to different types
of attacks, such as adversarial attack, data poisoning attack and backdoor
attack. Among them, backdoor attack is the most cunning one and can occur in
almost every stage of deep learning pipeline. Therefore, backdoor attack has
attracted lots of interests from both academia and industry. However, most
existing backdoor attack methods are either visible or fragile to some
effortless pre-processing such as common data transformations. To address these
limitations, we propose a robust and invisible backdoor attack called "Poison
Ink". Concretely, we first leverage the image structures as target poisoning
areas, and fill them with poison ink (information) to generate the trigger
pattern. As the image structure can keep its semantic meaning during the data
transformation, such trigger pattern is inherently robust to data
transformations. Then we leverage a deep injection network to embed such
trigger pattern into the cover image to achieve stealthiness. Compared to
existing popular backdoor attack methods, Poison Ink outperforms both in
stealthiness and robustness. Through extensive experiments, we demonstrate
Poison Ink is not only general to different datasets and network architectures,
but also flexible for different attack scenarios. Besides, it also has very
strong resistance against many state-of-the-art defense techniques.


http://arxiv.org/abs/2108.02502
Imperceptible Adversarial Examples by Spatial Chroma-Shift. (99%)
Ayberk Aydin; Deniz Sen; Berat Tuna Karli; Oguz Hanoglu; Alptekin Temizel
  Deep Neural Networks have been shown to be vulnerable to various kinds of
adversarial perturbations. In addition to widely studied additive noise based
perturbations, adversarial examples can also be created by applying a per pixel
spatial drift on input images. While spatial transformation based adversarial
examples look more natural to human observers due to absence of additive noise,
they still possess visible distortions caused by spatial transformations. Since
the human vision is more sensitive to the distortions in the luminance compared
to those in chrominance channels, which is one of the main ideas behind the
lossy visual multimedia compression standards, we propose a spatial
transformation based perturbation method to create adversarial examples by only
modifying the color components of an input image. While having competitive
fooling rates on CIFAR-10 and NIPS2017 Adversarial Learning Challenge datasets,
examples created with the proposed method have better scores with regards to
various perceptual quality metrics. Human visual perception studies validate
that the examples are more natural looking and often indistinguishable from
their original counterparts.


http://arxiv.org/abs/2108.04062
Householder Activations for Provable Robustness against Adversarial Attacks. (83%)
Sahil Singla; Surbhi Singla; Soheil Feizi
  Training convolutional neural networks (CNNs) with a strict Lipschitz
constraint under the l_{2} norm is useful for provable adversarial robustness,
interpretable gradients and stable training. While 1-Lipschitz CNNs can be
designed by enforcing a 1-Lipschitz constraint on each layer, training such
networks requires each layer to have an orthogonal Jacobian matrix (for all
inputs) to prevent gradients from vanishing during backpropagation. A layer
with this property is said to be Gradient Norm Preserving (GNP). To construct
expressive GNP activation functions, we first prove that the Jacobian of any
GNP piecewise linear function is only allowed to change via Householder
transformations for the function to be continuous. Building on this result, we
introduce a class of nonlinear GNP activations with learnable Householder
transformations called Householder activations. A householder activation
parameterized by the vector $\mathbf{v}$ outputs $(\mathbf{I} -
2\mathbf{v}\mathbf{v}^{T})\mathbf{z}$ for its input $\mathbf{z}$ if
$\mathbf{v}^{T}\mathbf{z} \leq 0$; otherwise it outputs $\mathbf{z}$. Existing
GNP activations such as $\mathrm{MaxMin}$ can be viewed as special cases of
$\mathrm{HH}$ activations for certain settings of these transformations. Thus,
networks with $\mathrm{HH}$ activations have higher expressive power than those
with $\mathrm{MaxMin}$ activations. Although networks with $\mathrm{HH}$
activations have nontrivial provable robustness against adversarial attacks, we
further boost their robustness by (i) introducing a certificate regularization
and (ii) relaxing orthogonalization of the last layer of the network. Our
experiments on CIFAR-10 and CIFAR-100 show that our regularized networks with
$\mathrm{HH}$ activations lead to significant improvements in both the standard
and provable robust accuracy over the prior works (gain of 3.65\% and 4.46\% on
CIFAR-100 respectively).


http://arxiv.org/abs/2108.02707
Fairness Properties of Face Recognition and Obfuscation Systems. (68%)
Harrison Rosenberg; Brian Tang; Kassem Fawaz; Somesh Jha
  The proliferation of automated facial recognition in various commercial and
government sectors has caused significant privacy concerns for individuals. A
recent and popular approach to address these privacy concerns is to employ
evasion attacks against the metric embedding networks powering facial
recognition systems. Face obfuscation systems generate imperceptible
perturbations, when added to an image, cause the facial recognition system to
misidentify the user. The key to these approaches is the generation of
perturbations using a pre-trained metric embedding network followed by their
application to an online system, whose model might be proprietary. This
dependence of face obfuscation on metric embedding networks, which are known to
be unfair in the context of facial recognition, surfaces the question of
demographic fairness -- \textit{are there demographic disparities in the
performance of face obfuscation systems?} To address this question, we perform
an analytical and empirical exploration of the performance of recent face
obfuscation systems that rely on deep embedding networks. We find that metric
embedding networks are demographically aware; they cluster faces in the
embedding space based on their demographic attributes. We observe that this
effect carries through to the face obfuscation systems: faces belonging to
minority groups incur reduced utility compared to those from majority groups.
For example, the disparity in average obfuscation success rate on the online
Face++ API can reach up to 20 percentage points. Further, for some demographic
groups, the average perturbation size increases by up to 17\% when choosing a
target identity belonging to a different demographic group versus the same
demographic group. Finally, we present a simple analytical model to provide
insights into these phenomena.


http://arxiv.org/abs/2108.02360
Exploring Structure Consistency for Deep Model Watermarking. (10%)
Jie Zhang; Dongdong Chen; Jing Liao; Han Fang; Zehua Ma; Weiming Zhang; Gang Hua; Nenghai Yu
  The intellectual property (IP) of Deep neural networks (DNNs) can be easily
``stolen'' by surrogate model attack. There has been significant progress in
solutions to protect the IP of DNN models in classification tasks. However,
little attention has been devoted to the protection of DNNs in image processing
tasks. By utilizing consistent invisible spatial watermarks, one recent work
first considered model watermarking for deep image processing networks and
demonstrated its efficacy in many downstream tasks. Nevertheless, it highly
depends on the hypothesis that the embedded watermarks in the network outputs
are consistent. When the attacker uses some common data augmentation attacks
(e.g., rotate, crop, and resize) during surrogate model training, it will
totally fail because the underlying watermark consistency is destroyed. To
mitigate this issue, we propose a new watermarking methodology, namely
``structure consistency'', based on which a new deep structure-aligned model
watermarking algorithm is designed. Specifically, the embedded watermarks are
designed to be aligned with physically consistent image structures, such as
edges or semantic regions. Experiments demonstrate that our method is much more
robust than the baseline method in resisting data augmentation attacks for
model IP protection. Besides that, we further test the generalization ability
and robustness of our method to a broader range of circumvention attacks.


http://arxiv.org/abs/2108.02501
Locally Interpretable One-Class Anomaly Detection for Credit Card Fraud Detection. (1%)
Tungyu Wu; Youting Wang
  For the highly imbalanced credit card fraud detection problem, most existing
methods either use data augmentation methods or conventional machine learning
models, while neural network-based anomaly detection approaches are lacking.
Furthermore, few studies have employed AI interpretability tools to investigate
the feature importance of transaction data, which is crucial for the black-box
fraud detection module. Considering these two points together, we propose a
novel anomaly detection framework for credit card fraud detection as well as a
model-explaining module responsible for prediction explanations. The fraud
detection model is composed of two deep neural networks, which are trained in
an unsupervised and adversarial manner. Precisely, the generator is an
AutoEncoder aiming to reconstruct genuine transaction data, while the
discriminator is a fully-connected network for fraud detection. The explanation
module has three white-box explainers in charge of interpretations of the
AutoEncoder, discriminator, and the whole detection model, respectively.
Experimental results show the state-of-the-art performances of our fraud
detection model on the benchmark dataset compared with baselines. In addition,
prediction analyses by three explainers are presented, offering a clear
perspective on how each feature of an instance of interest contributes to the
final model output.


http://arxiv.org/abs/2108.02340
Robust Transfer Learning with Pretrained Language Models through Adapters. (82%)
Wenjuan Han; Bo Pang; Yingnian Wu
  Transfer learning with large pretrained transformer-based language models
like BERT has become a dominating approach for most NLP tasks. Simply
fine-tuning those large language models on downstream tasks or combining it
with task-specific pretraining is often not robust. In particular, the
performance considerably varies as the random seed changes or the number of
pretraining and/or fine-tuning iterations varies, and the fine-tuned model is
vulnerable to adversarial attack. We propose a simple yet effective
adapter-based approach to mitigate these issues. Specifically, we insert small
bottleneck layers (i.e., adapter) within each layer of a pretrained model, then
fix the pretrained layers and train the adapter layers on the downstream task
data, with (1) task-specific unsupervised pretraining and then (2)
task-specific supervised training (e.g., classification, sequence labeling).
Our experiments demonstrate that such a training scheme leads to improved
stability and adversarial robustness in transfer learning to various downstream
tasks.


http://arxiv.org/abs/2108.01852
Semi-supervised Conditional GAN for Simultaneous Generation and Detection of Phishing URLs: A Game theoretic Perspective. (31%)
Sharif Amit Kamran; Shamik Sengupta; Alireza Tavakkoli
  Spear Phishing is a type of cyber-attack where the attacker sends hyperlinks
through email on well-researched targets. The objective is to obtain sensitive
information by imitating oneself as a trustworthy website. In recent times,
deep learning has become the standard for defending against such attacks.
However, these architectures were designed with only defense in mind. Moreover,
the attacker's perspective and motivation are absent while creating such
models. To address this, we need a game-theoretic approach to understand the
perspective of the attacker (Hacker) and the defender (Phishing URL detector).
We propose a Conditional Generative Adversarial Network with novel training
strategy for real-time phishing URL detection. Additionally, we train our
architecture in a semi-supervised manner to distinguish between adversarial and
real examples, along with detecting malicious and benign URLs. We also design
two games between the attacker and defender in training and deployment settings
by utilizing the game-theoretic perspective. Our experiments confirm that the
proposed architecture surpasses recent state-of-the-art architectures for
phishing URLs detection.


http://arxiv.org/abs/2108.01807
On the Robustness of Domain Adaption to Adversarial Attacks. (99%)
Liyuan Zhang; Yuhang Zhou; Lei Zhang
  State-of-the-art deep neural networks (DNNs) have been proved to have
excellent performance on unsupervised domain adaption (UDA). However, recent
work shows that DNNs perform poorly when being attacked by adversarial samples,
where these attacks are implemented by simply adding small disturbances to the
original images. Although plenty of work has focused on this, as far as we
know, there is no systematic research on the robustness of unsupervised domain
adaption model. Hence, we discuss the robustness of unsupervised domain
adaption against adversarial attacking for the first time. We benchmark various
settings of adversarial attack and defense in domain adaption, and propose a
cross domain attack method based on pseudo label. Most importantly, we analyze
the impact of different datasets, models, attack methods and defense methods.
Directly, our work proves the limited robustness of unsupervised domain
adaptation model, and we hope our work may facilitate the community to pay more
attention to improve the robustness of the model against attacking.


http://arxiv.org/abs/2108.02010
On the Exploitability of Audio Machine Learning Pipelines to Surreptitious Adversarial Examples. (99%)
Adelin Travers; Lorna Licollari; Guanghan Wang; Varun Chandrasekaran; Adam Dziedzic; David Lie; Nicolas Papernot
  Machine learning (ML) models are known to be vulnerable to adversarial
examples. Applications of ML to voice biometrics authentication are no
exception. Yet, the implications of audio adversarial examples on these
real-world systems remain poorly understood given that most research targets
limited defenders who can only listen to the audio samples. Conflating
detectability of an attack with human perceptibility, research has focused on
methods that aim to produce imperceptible adversarial examples which humans
cannot distinguish from the corresponding benign samples. We argue that this
perspective is coarse for two reasons: 1. Imperceptibility is impossible to
verify; it would require an experimental process that encompasses variations in
listener training, equipment, volume, ear sensitivity, types of background
noise etc, and 2. It disregards pipeline-based detection clues that realistic
defenders leverage. This results in adversarial examples that are ineffective
in the presence of knowledgeable defenders. Thus, an adversary only needs an
audio sample to be plausible to a human. We thus introduce surreptitious
adversarial examples, a new class of attacks that evades both human and
pipeline controls. In the white-box setting, we instantiate this class with a
joint, multi-stage optimization attack. Using an Amazon Mechanical Turk user
study, we show that this attack produces audio samples that are more
surreptitious than previous attacks that aim solely for imperceptibility.
Lastly we show that surreptitious adversarial examples are challenging to
develop in the black-box setting.


http://arxiv.org/abs/2108.01289
AdvRush: Searching for Adversarially Robust Neural Architectures. (99%)
Jisoo Mok; Byunggook Na; Hyeokjun Choe; Sungroh Yoon
  Deep neural networks continue to awe the world with their remarkable
performance. Their predictions, however, are prone to be corrupted by
adversarial examples that are imperceptible to humans. Current efforts to
improve the robustness of neural networks against adversarial examples are
focused on developing robust training methods, which update the weights of a
neural network in a more robust direction. In this work, we take a step beyond
training of the weight parameters and consider the problem of designing an
adversarially robust neural architecture with high intrinsic robustness. We
propose AdvRush, a novel adversarial robustness-aware neural architecture
search algorithm, based upon a finding that independent of the training method,
the intrinsic robustness of a neural network can be represented with the
smoothness of its input loss landscape. Through a regularizer that favors a
candidate architecture with a smoother input loss landscape, AdvRush
successfully discovers an adversarially robust neural architecture. Along with
a comprehensive theoretical motivation for AdvRush, we conduct an extensive
amount of experiments to demonstrate the efficacy of AdvRush on various
benchmark datasets. Notably, on CIFAR-10, AdvRush achieves 55.91% robust
accuracy under FGSM attack after standard training and 50.04% robust accuracy
under AutoAttack after 7-step PGD adversarial training.


http://arxiv.org/abs/2108.01644
The Devil is in the GAN: Backdoor Attacks and Defenses in Deep Generative Models. (88%)
Ambrish Rawat; Killian Levacher; Mathieu Sinn
  Deep Generative Models (DGMs) are a popular class of deep learning models
which find widespread use because of their ability to synthesize data from
complex, high-dimensional manifolds. However, even with their increasing
industrial adoption, they haven't been subject to rigorous security and privacy
analysis. In this work we examine one such aspect, namely backdoor attacks on
DGMs which can significantly limit the applicability of pre-trained models
within a model supply chain and at the very least cause massive reputation
damage for companies outsourcing DGMs form third parties.
  While similar attacks scenarios have been studied in the context of classical
prediction models, their manifestation in DGMs hasn't received the same
attention. To this end we propose novel training-time attacks which result in
corrupted DGMs that synthesize regular data under normal operations and
designated target outputs for inputs sampled from a trigger distribution. These
attacks are based on an adversarial loss function that combines the dual
objectives of attack stealth and fidelity. We systematically analyze these
attacks, and show their effectiveness for a variety of approaches like
Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), as
well as different data domains including images and audio. Our experiments show
that - even for large-scale industry-grade DGMs (like StyleGAN) - our attacks
can be mounted with only modest computational effort. We also motivate suitable
defenses based on static/dynamic model and output inspections, demonstrate
their usefulness, and prescribe a practical and comprehensive defense strategy
that paves the way for safe usage of DGMs.


http://arxiv.org/abs/2108.01281
DeepFreeze: Cold Boot Attacks and High Fidelity Model Recovery on Commercial EdgeML Device. (69%)
Yoo-Seung Won; Soham Chatterjee; Dirmanto Jap; Arindam Basu; Shivam Bhasin
  EdgeML accelerators like Intel Neural Compute Stick 2 (NCS) can enable
efficient edge-based inference with complex pre-trained models. The models are
loaded in the host (like Raspberry Pi) and then transferred to NCS for
inference. In this paper, we demonstrate practical and low-cost cold boot based
model recovery attacks on NCS to recover the model architecture and weights,
loaded from the Raspberry Pi. The architecture is recovered with 100% success
and weights with an error rate of 0.04%. The recovered model reports maximum
accuracy loss of 0.5% as compared to original model and allows high fidelity
transfer of adversarial examples. We further extend our study to other cold
boot attack setups reported in the literature with higher error rates leading
to accuracy loss as high as 70%. We then propose a methodology based on
knowledge distillation to correct the erroneous weights in recovered model,
even without access to original training data. The proposed attack remains
unaffected by the model encryption features of the OpenVINO and NCS framework.


http://arxiv.org/abs/2108.01734
Tutorials on Testing Neural Networks. (1%)
Nicolas Berthier; Youcheng Sun; Wei Huang; Yanghao Zhang; Wenjie Ruan; Xiaowei Huang
  Deep learning achieves remarkable performance on pattern recognition, but can
be vulnerable to defects of some important properties such as robustness and
security. This tutorial is based on a stream of research conducted since the
summer of 2018 at a few UK universities, including the University of Liverpool,
University of Oxford, Queen's University Belfast, University of Lancaster,
University of Loughborough, and University of Exeter.
  The research aims to adapt software engineering methods, in particular
software testing methods, to work with machine learning models. Software
testing techniques have been successful in identifying software bugs, and
helping software developers in validating the software they design and
implement. It is for this reason that a few software testing techniques -- such
as the MC/DC coverage metric -- have been mandated in industrial standards for
safety critical systems, including the ISO26262 for automotive systems and the
RTCA DO-178B/C for avionics systems. However, these techniques cannot be
directly applied to machine learning models, because the latter are drastically
different from traditional software, and their design follows a completely
different development life-cycle.
  As the outcome of this thread of research, the team has developed a series of
methods that adapt the software testing techniques to work with a few classes
of machine learning models. The latter notably include convolutional neural
networks, recurrent neural networks, and random forest. The tools developed
from this research are now collected, and publicly released, in a GitHub
repository: \url{https://github.com/TrustAI/DeepConcolic}, with the BSD
3-Clause licence.
  This tutorial is to go through the major functionalities of the tools with a
few running examples, to exhibit how the developed techniques work, what the
results are, and how to interpret them.


http://arxiv.org/abs/2108.01125
Hybrid Classical-Quantum Deep Learning Models for Autonomous Vehicle Traffic Image Classification Under Adversarial Attack. (98%)
Reek Majumder; Sakib Mahmud Khan; Fahim Ahmed; Zadid Khan; Frank Ngeni; Gurcan Comert; Judith Mwakalonge; Dimitra Michalaka; Mashrur Chowdhury
  Image classification must work for autonomous vehicles (AV) operating on
public roads, and actions performed based on image misclassification can have
serious consequences. Traffic sign images can be misclassified by an
adversarial attack on machine learning models used by AVs for traffic sign
recognition. To make classification models resilient against adversarial
attacks, we used a hybrid deep-learning model with both the quantum and
classical layers. Our goal is to study the hybrid deep-learning architecture
for classical-quantum transfer learning models to support the current era of
intermediate-scale quantum technology. We have evaluated the impacts of various
white box adversarial attacks on these hybrid models. The classical part of
hybrid models includes a convolution network from the pre-trained Resnet18
model, which extracts informative features from a high dimensional LISA traffic
sign image dataset. The output from the classical processor is processed
further through the quantum layer, which is composed of various quantum gates
and provides support to various quantum mechanical features like entanglement
and superposition. We have tested multiple combinations of quantum circuits to
provide better classification accuracy with decreasing training data and found
better resiliency for our hybrid classical-quantum deep learning model during
attacks compared to the classical-only machine learning models.


http://arxiv.org/abs/2108.00833
Adversarial Attacks Against Deep Reinforcement Learning Framework in Internet of Vehicles. (10%)
Anum Talpur; Mohan Gurusamy
  Machine learning (ML) has made incredible impacts and transformations in a
wide range of vehicular applications. As the use of ML in Internet of Vehicles
(IoV) continues to advance, adversarial threats and their impact have become an
important subject of research worth exploring. In this paper, we focus on
Sybil-based adversarial threats against a deep reinforcement learning
(DRL)-assisted IoV framework and more specifically, DRL-based dynamic service
placement in IoV. We carry out an experimental study with real vehicle
trajectories to analyze the impact on service delay and resource congestion
under different attack scenarios for the DRL-based dynamic service placement
application. We further investigate the impact of the proportion of
Sybil-attacked vehicles in the network. The results demonstrate that the
performance is significantly affected by Sybil-based data poisoning attacks
when compared to adversary-free healthy network scenario.


http://arxiv.org/abs/2108.00701
Information Stealing in Federated Learning Systems Based on Generative Adversarial Networks. (9%)
Yuwei Sun; Ng Chong; Hideya Ochiai
  An attack on deep learning systems where intelligent machines collaborate to
solve problems could cause a node in the network to make a mistake on a
critical judgment. At the same time, the security and privacy concerns of AI
have galvanized the attention of experts from multiple disciplines. In this
research, we successfully mounted adversarial attacks on a federated learning
(FL) environment using three different datasets. The attacks leveraged
generative adversarial networks (GANs) to affect the learning process and
strive to reconstruct the private data of users by learning hidden features
from shared local model parameters. The attack was target-oriented drawing data
with distinct class distribution from the CIFAR- 10, MNIST, and Fashion-MNIST
respectively. Moreover, by measuring the Euclidean distance between the real
data and the reconstructed adversarial samples, we evaluated the performance of
the adversary in the learning processes in various scenarios. At last, we
successfully reconstructed the real data of the victim from the shared global
model parameters with all the applied datasets.


http://arxiv.org/abs/2108.01124
Efficacy of Statistical and Artificial Intelligence-based False Information Cyberattack Detection Models for Connected Vehicles. (1%)
Sakib Mahmud Khan; Gurcan Comert; Mashrur Chowdhury
  Connected vehicles (CVs), because of the external connectivity with other CVs
and connected infrastructure, are vulnerable to cyberattacks that can instantly
compromise the safety of the vehicle itself and other connected vehicles and
roadway infrastructure. One such cyberattack is the false information attack,
where an external attacker injects inaccurate information into the connected
vehicles and eventually can cause catastrophic consequences by compromising
safety-critical applications like the forward collision warning. The occurrence
and target of such attack events can be very dynamic, making real-time and
near-real-time detection challenging. Change point models, can be used for
real-time anomaly detection caused by the false information attack. In this
paper, we have evaluated three change point-based statistical models;
Expectation Maximization, Cumulative Summation, and Bayesian Online Change
Point Algorithms for cyberattack detection in the CV data. Also, data-driven
artificial intelligence (AI) models, which can be used to detect known and
unknown underlying patterns in the dataset, have the potential of detecting a
real-time anomaly in the CV data. We have used six AI models to detect false
information attacks and compared the performance for detecting the attacks with
our developed change point models. Our study shows that change points models
performed better in real-time false information attack detection compared to
the performance of the AI models. Change point models having the advantage of
no training requirements can be a feasible and computationally efficient
alternative to AI models for false information attack detection in connected
vehicles.


http://arxiv.org/abs/2108.00401
Advances in adversarial attacks and defenses in computer vision: A survey. (92%)
Naveed Akhtar; Ajmal Mian; Navid Kardan; Mubarak Shah
  Deep Learning (DL) is the most widely used tool in the contemporary field of
computer vision. Its ability to accurately solve complex problems is employed
in vision research to learn deep neural models for a variety of tasks,
including security critical applications. However, it is now known that DL is
vulnerable to adversarial attacks that can manipulate its predictions by
introducing visually imperceptible perturbations in images and videos. Since
the discovery of this phenomenon in 2013~[1], it has attracted significant
attention of researchers from multiple sub-fields of machine intelligence. In
[2], we reviewed the contributions made by the computer vision community in
adversarial attacks on deep learning (and their defenses) until the advent of
year 2018. Many of those contributions have inspired new directions in this
area, which has matured significantly since witnessing the first generation
methods. Hence, as a legacy sequel of [2], this literature review focuses on
the advances in this area since 2018. To ensure authenticity, we mainly
consider peer-reviewed contributions published in the prestigious sources of
computer vision and machine learning research. Besides a comprehensive
literature review, the article also provides concise definitions of technical
terminologies for non-experts in this domain. Finally, this article discusses
challenges and future outlook of this direction based on the literature
reviewed herein and [2].


http://arxiv.org/abs/2108.00491
Certified Defense via Latent Space Randomized Smoothing with Orthogonal Encoders. (80%)
Huimin Zeng; Jiahao Su; Furong Huang
  Randomized Smoothing (RS), being one of few provable defenses, has been
showing great effectiveness and scalability in terms of defending against
$\ell_2$-norm adversarial perturbations. However, the cost of MC sampling
needed in RS for evaluation is high and computationally expensive. To address
this issue, we investigate the possibility of performing randomized smoothing
and establishing the robust certification in the latent space of a network, so
that the overall dimensionality of tensors involved in computation could be
drastically reduced. To this end, we propose Latent Space Randomized Smoothing.
Another important aspect is that we use orthogonal modules, whose Lipschitz
property is known for free by design, to propagate the certified radius
estimated in the latent space back to the input space, providing valid
certifiable regions for the test samples in the input space. Experiments on
CIFAR10 and ImageNet show that our method achieves competitive certified
robustness but with a significant improvement of efficiency during the test
phase.


http://arxiv.org/abs/2108.00422
An Effective and Robust Detector for Logo Detection. (70%)
Xiaojun Jia; Huanqian Yan; Yonglin Wu; Xingxing Wei; Xiaochun Cao; Yong Zhang
  In recent years, intellectual property (IP), which represents literary,
inventions, artistic works, etc, gradually attract more and more people's
attention. Particularly, with the rise of e-commerce, the IP not only
represents the product design and brands, but also represents the images/videos
displayed on e-commerce platforms. Unfortunately, some attackers adopt some
adversarial methods to fool the well-trained logo detection model for
infringement. To overcome this problem, a novel logo detector based on the
mechanism of looking and thinking twice is proposed in this paper for robust
logo detection. The proposed detector is different from other mainstream
detectors, which can effectively detect small objects, long-tail objects, and
is robust to adversarial images. In detail, we extend detectoRS algorithm to a
cascade schema with an equalization loss function, multi-scale transformations,
and adversarial data augmentation. A series of experimental results have shown
that the proposed method can effectively improve the robustness of the
detection model. Moreover, we have applied the proposed methods to competition
ACM MM2021 Robust Logo Detection that is organized by Alibaba on the Tianchi
platform and won top 2 in 36489 teams. Code is available at
https://github.com/jiaxiaojunQAQ/Robust-Logo-Detection.


http://arxiv.org/abs/2108.00402
Style Curriculum Learning for Robust Medical Image Segmentation. (2%)
Zhendong Liu; Van Manh; Xin Yang; Xiaoqiong Huang; Karim Lekadir; Víctor Campello; Nishant Ravikumar; Alejandro F Frangi; Dong Ni
  The performance of deep segmentation models often degrades due to
distribution shifts in image intensities between the training and test data
sets. This is particularly pronounced in multi-centre studies involving data
acquired using multi-vendor scanners, with variations in acquisition protocols.
It is challenging to address this degradation because the shift is often not
known \textit{a priori} and hence difficult to model. We propose a novel
framework to ensure robust segmentation in the presence of such distribution
shifts. Our contribution is three-fold. First, inspired by the spirit of
curriculum learning, we design a novel style curriculum to train the
segmentation models using an easy-to-hard mode. A style transfer model with
style fusion is employed to generate the curriculum samples. Gradually focusing
on complex and adversarial style samples can significantly boost the robustness
of the models. Second, instead of subjectively defining the curriculum
complexity, we adopt an automated gradient manipulation method to control the
hard and adversarial sample generation process. Third, we propose the Local
Gradient Sign strategy to aggregate the gradient locally and stabilise training
during gradient manipulation. The proposed framework can generalise to unknown
distribution without using any target data. Extensive experiments on the public
M\&Ms Challenge dataset demonstrate that our proposed framework can generalise
deep models well to unknown distributions and achieve significant improvements
in segmentation accuracy.


http://arxiv.org/abs/2108.00180
Delving into Deep Image Prior for Adversarial Defense: A Novel Reconstruction-based Defense Framework. (99%)
Li Ding; Yongwei Wang; Xin Ding; Kaiwen Yuan; Ping Wang; Hua Huang; Z. Jane Wang
  Deep learning based image classification models are shown vulnerable to
adversarial attacks by injecting deliberately crafted noises to clean images.
To defend against adversarial attacks in a training-free and attack-agnostic
manner, this work proposes a novel and effective reconstruction-based defense
framework by delving into deep image prior (DIP). Fundamentally different from
existing reconstruction-based defenses, the proposed method analyzes and
explicitly incorporates the model decision process into our defense. Given an
adversarial image, firstly we map its reconstructed images during DIP
optimization to the model decision space, where cross-boundary images can be
detected and on-boundary images can be further localized. Then, adversarial
noise is purified by perturbing on-boundary images along the reverse direction
to the adversarial image. Finally, on-manifold images are stitched to construct
an image that can be correctly predicted by the victim classifier. Extensive
experiments demonstrate that the proposed method outperforms existing
state-of-the-art reconstruction-based methods both in defending white-box
attacks and defense-aware attacks. Moreover, the proposed method can maintain a
high visual quality during adversarial image reconstruction.


http://arxiv.org/abs/2108.00213
Adversarial Robustness of Deep Code Comment Generation. (99%)
Yu Zhou; Xiaoqing Zhang; Juanjuan Shen; Tingting Han; Taolue Chen; Harald Gall
  Deep neural networks (DNNs) have shown remarkable performance in a variety of
domains such as computer vision, speech recognition, or natural language
processing. Recently they also have been applied to various software
engineering tasks, typically involving processing source code. DNNs are
well-known to be vulnerable to adversarial examples, i.e., fabricated inputs
that could lead to various misbehaviors of the DNN model while being perceived
as benign by humans. In this paper, we focus on the code comment generation
task in software engineering and study the robustness issue of the DNNs when
they are applied to this task. We propose ACCENT, an identifier substitution
approach to craft adversarial code snippets, which are syntactically correct
and semantically close to the original code snippet, but may mislead the DNNs
to produce completely irrelevant code comments. In order to improve the
robustness, ACCENT also incorporates a novel training method, which can be
applied to existing code comment generation models. We conduct comprehensive
experiments to evaluate our approach by attacking the mainstream
encoder-decoder architectures on two large-scale publicly available datasets.
The results show that ACCENT efficiently produces stable attacks with
functionality-preserving adversarial examples, and the generated examples have
better transferability compared with baselines. We also confirm, via
experiments, the effectiveness in improving model robustness with our training
method.


http://arxiv.org/abs/2108.00335
Towards Adversarially Robust and Domain Generalizable Stereo Matching by Rethinking DNN Feature Backbones. (93%)
Kelvin Cheng; Christopher Healey; Tianfu Wu
  Stereo matching has recently witnessed remarkable progress using Deep Neural
Networks (DNNs). But, how robust are they? Although it has been well-known that
DNNs often suffer from adversarial vulnerability with a catastrophic drop in
performance, the situation is even worse in stereo matching. This paper first
shows that a type of weak white-box attacks can overwhelm state-of-the-art
methods. The attack is learned by a proposed stereo-constrained projected
gradient descent (PGD) method in stereo matching. This observation raises
serious concerns for the deployment of DNN-based stereo matching. Parallel to
the adversarial vulnerability, DNN-based stereo matching is typically trained
under the so-called simulation to reality pipeline, and thus domain
generalizability is an important problem. This paper proposes to rethink the
learnable DNN-based feature backbone towards adversarially-robust and domain
generalizable stereo matching by completely removing it for matching. In
experiments, the proposed method is tested in the SceneFlow dataset and the
KITTI2015 benchmark, with promising results. We compute the matching cost
volume using the classic multi-scale census transform (i.e., local binary
pattern) of the raw input stereo images, followed by a stacked Hourglass head
sub-network solving the matching problem. It significantly improves the
adversarial robustness, while retaining accuracy performance comparable to
state-of-the-art methods. It also shows better generalizability from simulation
(SceneFlow) to real (KITTI) datasets when no fine-tuning is used.


http://arxiv.org/abs/2108.00146
T$_k$ML-AP: Adversarial Attacks to Top-$k$ Multi-Label Learning. (81%)
Shu Hu; Lipeng Ke; Xin Wang; Siwei Lyu
  Top-$k$ multi-label learning, which returns the top-$k$ predicted labels from
an input, has many practical applications such as image annotation, document
analysis, and web search engine. However, the vulnerabilities of such
algorithms with regards to dedicated adversarial perturbation attacks have not
been extensively studied previously. In this work, we develop methods to create
adversarial perturbations that can be used to attack top-$k$ multi-label
learning-based image annotation systems (TkML-AP). Our methods explicitly
consider the top-$k$ ranking relation and are based on novel loss functions.
Experimental evaluations on large-scale benchmark datasets including PASCAL VOC
and MS COCO demonstrate the effectiveness of our methods in reducing the
performance of state-of-the-art top-$k$ multi-label learning methods, under
both untargeted and targeted attacks.


http://arxiv.org/abs/2108.00352
BadEncoder: Backdoor Attacks to Pre-trained Encoders in Self-Supervised Learning. (67%)
Jinyuan Jia; Yupei Liu; Neil Zhenqiang Gong
  Self-supervised learning in computer vision aims to pre-train an image
encoder using a large amount of unlabeled images or (image, text) pairs. The
pre-trained image encoder can then be used as a feature extractor to build
downstream classifiers for many downstream tasks with a small amount of or no
labeled training data. In this work, we propose BadEncoder, the first backdoor
attack to self-supervised learning. In particular, our BadEncoder injects
backdoors into a pre-trained image encoder such that the downstream classifiers
built based on the backdoored image encoder for different downstream tasks
simultaneously inherit the backdoor behavior. We formulate our BadEncoder as an
optimization problem and we propose a gradient descent based method to solve
it, which produces a backdoored image encoder from a clean one. Our extensive
empirical evaluation results on multiple datasets show that our BadEncoder
achieves high attack success rates while preserving the accuracy of the
downstream classifiers. We also show the effectiveness of BadEncoder using two
publicly available, real-world image encoders, i.e., Google's image encoder
pre-trained on ImageNet and OpenAI's Contrastive Language-Image Pre-training
(CLIP) image encoder pre-trained on 400 million (image, text) pairs collected
from the Internet. Moreover, we consider defenses including Neural Cleanse and
MNTD (empirical defenses) as well as PatchGuard (a provable defense). Our
results show that these defenses are insufficient to defend against BadEncoder,
highlighting the needs for new defenses against our BadEncoder. Our code is
publicly available at: https://github.com/jjy1994/BadEncoder.


http://arxiv.org/abs/2108.00295
Fair Representation Learning using Interpolation Enabled Disentanglement. (1%)
Akshita Jha; Bhanukiran Vinzamuri; Chandan K. Reddy
  With the growing interest in the machine learning community to solve
real-world problems, it has become crucial to uncover the hidden reasoning
behind their decisions by focusing on the fairness and auditing the predictions
made by these black-box models. In this paper, we propose a novel method to
address two key issues: (a) Can we simultaneously learn fair disentangled
representations while ensuring the utility of the learned representation for
downstream tasks, and (b)Can we provide theoretical insights into when the
proposed approach will be both fair and accurate. To address the former, we
propose the method FRIED, Fair Representation learning using Interpolation
Enabled Disentanglement. In our architecture, by imposing a critic-based
adversarial framework, we enforce the interpolated points in the latent space
to be more realistic. This helps in capturing the data manifold effectively and
enhances the utility of the learned representation for downstream prediction
tasks. We address the latter question by developing a theory on
fairness-accuracy trade-offs using classifier-based conditional mutual
information estimation. We demonstrate the effectiveness of FRIED on datasets
of different modalities - tabular, text, and image datasets. We observe that
the representations learned by FRIED are overall fairer in comparison to
existing baselines and also accurate for downstream prediction tasks.
Additionally, we evaluate FRIED on a real-world healthcare claims dataset where
we conduct an expert aided model auditing study providing useful insights into
opioid ad-diction patterns.


http://arxiv.org/abs/2107.14601
Who's Afraid of Thomas Bayes? (92%)
Erick Galinkin
  In many cases, neural networks perform well on test data, but tend to
overestimate their confidence on out-of-distribution data. This has led to
adoption of Bayesian neural networks, which better capture uncertainty and
therefore more accurately reflect the model's confidence. For machine learning
security researchers, this raises the natural question of how making a model
Bayesian affects the security of the model. In this work, we explore the
interplay between Bayesianism and two measures of security: model privacy and
adversarial robustness. We demonstrate that Bayesian neural networks are more
vulnerable to membership inference attacks in general, but are at least as
robust as their non-Bayesian counterparts to adversarial examples.


http://arxiv.org/abs/2107.14642
Practical Attacks on Voice Spoofing Countermeasures. (86%)
Andre Kassis; Urs Hengartner
  Voice authentication has become an integral part in security-critical
operations, such as bank transactions and call center conversations. The
vulnerability of automatic speaker verification systems (ASVs) to spoofing
attacks instigated the development of countermeasures (CMs), whose task is to
tell apart bonafide and spoofed speech. Together, ASVs and CMs form today's
voice authentication platforms, advertised as an impregnable access control
mechanism. We develop the first practical attack on CMs, and show how a
malicious actor may efficiently craft audio samples to bypass voice
authentication in its strictest form. Previous works have primarily focused on
non-proactive attacks or adversarial strategies against ASVs that do not
produce speech in the victim's voice. The repercussions of our attacks are far
more severe, as the samples we generate sound like the victim, eliminating any
chance of plausible deniability. Moreover, the few existing adversarial attacks
against CMs mistakenly optimize spoofed speech in the feature space and do not
take into account the existence of ASVs, resulting in inferior synthetic audio
that fails in realistic settings. We eliminate these obstacles through our key
technical contribution: a novel joint loss function that enables mounting
advanced adversarial attacks against combined ASV/CM deployments directly in
the time domain. Our adversarials achieve concerning black-box success rates
against state-of-the-art authentication platforms (up to 93.57\%). Finally, we
perform the first targeted, over-telephony-network attack on CMs, bypassing
several challenges and enabling various potential threats, given the increased
use of voice biometrics in call centers. Our results call into question the
security of modern voice authentication systems in light of the real threat of
attackers bypassing these measures to gain access to users' most valuable
resources.


http://arxiv.org/abs/2107.14569
Can You Hear It? Backdoor Attacks via Ultrasonic Triggers. (50%)
Stefanos Koffas; Jing Xu; Mauro Conti; Stjepan Picek
  This work explores backdoor attacks for automatic speech recognition systems
where we inject inaudible triggers. By doing so, we make the backdoor attack
challenging to detect for legitimate users, and thus, potentially more
dangerous. We conduct experiments on two versions of a speech dataset and three
neural networks and explore the performance of our attack concerning the
duration, position, and type of the trigger. Our results indicate that less
than 1% of poisoned data is sufficient to deploy a backdoor attack and reach a
100% attack success rate. We observed that short, non-continuous triggers
result in highly successful attacks. However, since our trigger is inaudible,
it can be as long as possible without raising any suspicions making the attack
more effective. Finally, we conducted our attack in actual hardware and saw
that an adversary could manipulate inference in an Android application by
playing the inaudible trigger over the air.


http://arxiv.org/abs/2107.14756
Unveiling the potential of Graph Neural Networks for robust Intrusion Detection. (13%)
David Pujol-Perich; José Suárez-Varela; Albert Cabellos-Aparicio; Pere Barlet-Ros
  The last few years have seen an increasing wave of attacks with serious
economic and privacy damages, which evinces the need for accurate Network
Intrusion Detection Systems (NIDS). Recent works propose the use of Machine
Learning (ML) techniques for building such systems (e.g., decision trees,
neural networks). However, existing ML-based NIDS are barely robust to common
adversarial attacks, which limits their applicability to real networks. A
fundamental problem of these solutions is that they treat and classify flows
independently. In contrast, in this paper we argue the importance of focusing
on the structural patterns of attacks, by capturing not only the individual
flow features, but also the relations between different flows (e.g., the
source/destination hosts they share). To this end, we use a graph
representation that keeps flow records and their relationships, and propose a
novel Graph Neural Network (GNN) model tailored to process and learn from such
graph-structured information. In our evaluation, we first show that the
proposed GNN model achieves state-of-the-art results in the well-known
CIC-IDS2017 dataset. Moreover, we assess the robustness of our solution under
two common adversarial attacks, that intentionally modify the packet size and
inter-arrival times to avoid detection. The results show that our model is able
to maintain the same level of accuracy as in previous experiments, while
state-of-the-art ML techniques degrade up to 50% their accuracy (F1-score)
under these attacks. This unprecedented level of robustness is mainly induced
by the capability of our GNN model to learn flow patterns of attacks structured
as graphs.


http://arxiv.org/abs/2107.14185
Feature Importance-aware Transferable Adversarial Attacks. (99%)
Zhibo Wang; Hengchang Guo; Zhifei Zhang; Wenxin Liu; Zhan Qin; Kui Ren
  Transferability of adversarial examples is of central importance for
attacking an unknown model, which facilitates adversarial attacks in more
practical scenarios, e.g., blackbox attacks. Existing transferable attacks tend
to craft adversarial examples by indiscriminately distorting features to
degrade prediction accuracy in a source model without aware of intrinsic
features of objects in the images. We argue that such brute-force degradation
would introduce model-specific local optimum into adversarial examples, thus
limiting the transferability. By contrast, we propose the Feature
Importance-aware Attack (FIA), which disrupts important object-aware features
that dominate model decisions consistently. More specifically, we obtain
feature importance by introducing the aggregate gradient, which averages the
gradients with respect to feature maps of the source model, computed on a batch
of random transforms of the original clean image. The gradients will be highly
correlated to objects of interest, and such correlation presents invariance
across different models. Besides, the random transforms will preserve intrinsic
features of objects and suppress model-specific information. Finally, the
feature importance guides to search for adversarial examples towards disrupting
critical features, achieving stronger transferability. Extensive experimental
evaluation demonstrates the effectiveness and superior performance of the
proposed FIA, i.e., improving the success rate by 8.4% against normally trained
models and 11.7% against defense models as compared to the state-of-the-art
transferable attacks. Code is available at: https://github.com/hcguoO0/FIA


http://arxiv.org/abs/2107.14110
Enhancing Adversarial Robustness via Test-time Transformation Ensembling. (98%)
Juan C. Pérez; Motasem Alfarra; Guillaume Jeanneret; Laura Rueda; Ali Thabet; Bernard Ghanem; Pablo Arbeláez
  Deep learning models are prone to being fooled by imperceptible perturbations
known as adversarial attacks. In this work, we study how equipping models with
Test-time Transformation Ensembling (TTE) can work as a reliable defense
against such attacks. While transforming the input data, both at train and test
times, is known to enhance model performance, its effects on adversarial
robustness have not been studied. Here, we present a comprehensive empirical
study of the impact of TTE, in the form of widely-used image transforms, on
adversarial robustness. We show that TTE consistently improves model robustness
against a variety of powerful attacks without any need for re-training, and
that this improvement comes at virtually no trade-off with accuracy on clean
samples. Finally, we show that the benefits of TTE transfer even to the
certified robustness domain, in which TTE provides sizable and consistent
improvements.


http://arxiv.org/abs/2107.13962
The Robustness of Graph k-shell Structure under Adversarial Attacks. (93%)
B. Zhou; Y. Q. Lv; Y. C. Mao; J. H. Wang; S. Q. Yu; Q. Xuan
  The k-shell decomposition plays an important role in unveiling the structural
properties of a network, i.e., it is widely adopted to find the densest part of
a network across a broad range of scientific fields, including Internet,
biological networks, social networks, etc. However, there arises concern about
the robustness of the k-shell structure when networks suffer from adversarial
attacks. Here, we introduce and formalize the problem of the k-shell attack and
develop an efficient strategy to attack the k-shell structure by rewiring a
small number of links. To the best of our knowledge, it is the first time to
study the robustness of graph k-shell structure under adversarial attacks. In
particular, we propose a Simulated Annealing (SA) based k-shell attack method
and testify it on four real-world social networks. The extensive experiments
validate that the k-shell structure of a network is robust under random
perturbation, but it is quite vulnerable under adversarial attack, e.g., in
Dolphin and Throne networks, more than 40% nodes change their k-shell values
when only 10% links are changed based on our SA-based k-shell attack. Such
results suggest that a single structural feature could also be significantly
disturbed when only a small fraction of links are changed purposefully in a
network. Therefore, it could be an interesting topic to improve the robustness
of various network properties against adversarial attack in the future.


http://arxiv.org/abs/2107.13876
Understanding the Effects of Adversarial Personalized Ranking Optimization Method on Recommendation Quality. (31%)
Vito Walter Anelli; Yashar Deldjoo; Noia Tommaso Di; Felice Antonio Merra
  Recommender systems (RSs) employ user-item feedback, e.g., ratings, to match
customers to personalized lists of products. Approaches to top-k recommendation
mainly rely on Learning-To-Rank algorithms and, among them, the most widely
adopted is Bayesian Personalized Ranking (BPR), which bases on a pair-wise
optimization approach. Recently, BPR has been found vulnerable against
adversarial perturbations of its model parameters. Adversarial Personalized
Ranking (APR) mitigates this issue by robustifying BPR via an adversarial
training procedure. The empirical improvements of APR's accuracy performance on
BPR have led to its wide use in several recommender models. However, a key
overlooked aspect has been the beyond-accuracy performance of APR, i.e.,
novelty, coverage, and amplification of popularity bias, considering that
recent results suggest that BPR, the building block of APR, is sensitive to the
intensification of biases and reduction of recommendation novelty. In this
work, we model the learning characteristics of the BPR and APR optimization
frameworks to give mathematical evidence that, when the feedback data have a
tailed distribution, APR amplifies the popularity bias more than BPR due to an
unbalanced number of received positive updates from short-head items. Using
matrix factorization (MF), we empirically validate the theoretical results by
performing preliminary experiments on two public datasets to compare BPR-MF and
APR-MF performance on accuracy and beyond-accuracy metrics. The experimental
results consistently show the degradation of novelty and coverage measures and
a worrying amplification of bias.


http://arxiv.org/abs/2107.14344
Towards robust vision by multi-task learning on monkey visual cortex. (3%)
Shahd Safarani; Arne Nix; Konstantin Willeke; Santiago A. Cadena; Kelli Restivo; George Denfield; Andreas S. Tolias; Fabian H. Sinz
  Deep neural networks set the state-of-the-art across many tasks in computer
vision, but their generalization ability to image distortions is surprisingly
fragile. In contrast, the mammalian visual system is robust to a wide range of
perturbations. Recent work suggests that this generalization ability can be
explained by useful inductive biases encoded in the representations of visual
stimuli throughout the visual cortex. Here, we successfully leveraged these
inductive biases with a multi-task learning approach: we jointly trained a deep
network to perform image classification and to predict neural activity in
macaque primary visual cortex (V1). We measured the out-of-distribution
generalization abilities of our network by testing its robustness to image
distortions. We found that co-training on monkey V1 data leads to increased
robustness despite the absence of those distortions during training.
Additionally, we showed that our network's robustness is very close to that of
an Oracle network where parts of the architecture are directly trained on noisy
images. Our results also demonstrated that the network's representations become
more brain-like as their robustness improves. Using a novel constrained
reconstruction analysis, we investigated what makes our brain-regularized
network more robust. We found that our co-trained network is more sensitive to
content than noise when compared to a Baseline network that we trained for
image classification alone. Using DeepGaze-predicted saliency maps for ImageNet
images, we found that our monkey co-trained network tends to be more sensitive
to salient regions in a scene, reminiscent of existing theories on the role of
V1 in the detection of object borders and bottom-up saliency. Overall, our work
expands the promising research avenue of transferring inductive biases from the
brain, and provides a novel analysis of the effects of our transfer.


http://arxiv.org/abs/2107.13639
Imbalanced Adversarial Training with Reweighting. (86%)
Wentao Wang; Han Xu; Xiaorui Liu; Yaxin Li; Bhavani Thuraisingham; Jiliang Tang
  Adversarial training has been empirically proven to be one of the most
effective and reliable defense methods against adversarial attacks. However,
almost all existing studies about adversarial training are focused on balanced
datasets, where each class has an equal amount of training examples. Research
on adversarial training with imbalanced training datasets is rather limited. As
the initial effort to investigate this problem, we reveal the facts that
adversarially trained models present two distinguished behaviors from naturally
trained models in imbalanced datasets: (1) Compared to natural training,
adversarially trained models can suffer much worse performance on
under-represented classes, when the training dataset is extremely imbalanced.
(2) Traditional reweighting strategies may lose efficacy to deal with the
imbalance issue for adversarial training. For example, upweighting the
under-represented classes will drastically hurt the model's performance on
well-represented classes, and as a result, finding an optimal reweighting value
can be tremendously challenging. In this paper, to further understand our
observations, we theoretically show that the poor data separability is one key
reason causing this strong tension between under-represented and
well-represented classes. Motivated by this finding, we propose Separable
Reweighted Adversarial Training (SRAT) to facilitate adversarial training under
imbalanced scenarios, by learning more separable features for different
classes. Extensive experiments on various datasets verify the effectiveness of
the proposed framework.


http://arxiv.org/abs/2107.13541
Towards Robustness Against Natural Language Word Substitutions. (73%)
Xinshuai Dong; Anh Tuan Luu; Rongrong Ji; Hong Liu
  Robustness against word substitutions has a well-defined and widely
acceptable form, i.e., using semantically similar words as substitutions, and
thus it is considered as a fundamental stepping-stone towards broader
robustness in natural language processing. Previous defense methods capture
word substitutions in vector space by using either $l_2$-ball or
hyper-rectangle, which results in perturbation sets that are not inclusive
enough or unnecessarily large, and thus impedes mimicry of worst cases for
robust training. In this paper, we introduce a novel \textit{Adversarial Sparse
Convex Combination} (ASCC) method. We model the word substitution attack space
as a convex hull and leverages a regularization term to enforce perturbation
towards an actual substitution, thus aligning our modeling better with the
discrete textual space. Based on the ASCC method, we further propose
ASCC-defense, which leverages ASCC to generate worst-case perturbations and
incorporates adversarial training towards robustness. Experiments show that
ASCC-defense outperforms the current state-of-the-arts in terms of robustness
on two prevailing NLP tasks, \emph{i.e.}, sentiment analysis and natural
language inference, concerning several attacks across multiple model
architectures. Besides, we also envision a new class of defense towards
robustness in NLP, where our robustly trained word vectors can be plugged into
a normally trained model and enforce its robustness without applying any other
defense techniques.


http://arxiv.org/abs/2107.13491
Models of Computational Profiles to Study the Likelihood of DNN Metamorphic Test Cases. (67%)
Ettore Merlo; Mira Marhaba; Foutse Khomh; Houssem Ben Braiek; Giuliano Antoniol
  Neural network test cases are meant to exercise different reasoning paths in
an architecture and used to validate the prediction outcomes. In this paper, we
introduce "computational profiles" as vectors of neuron activation levels. We
investigate the distribution of computational profile likelihood of metamorphic
test cases with respect to the likelihood distributions of training, test and
error control cases. We estimate the non-parametric probability densities of
neuron activation levels for each distinct output class. Probabilities are
inferred using training cases only, without any additional knowledge about
metamorphic test cases. Experiments are performed by training a network on the
MNIST Fashion library of images and comparing prediction likelihoods with those
obtained from error control-data and from metamorphic test cases. Experimental
results show that the distributions of computational profile likelihood for
training and test cases are somehow similar, while the distribution of the
random-noise control-data is always remarkably lower than the observed one for
the training and testing sets. In contrast, metamorphic test cases show a
prediction likelihood that lies in an extended range with respect to training,
tests, and random noise. Moreover, the presented approach allows the
independent assessment of different training classes and experiments to show
that some of the classes are more sensitive to misclassifying metamorphic test
cases than other classes. In conclusion, metamorphic test cases represent very
aggressive tests for neural network architectures. Furthermore, since
metamorphic test cases force a network to misclassify those inputs whose
likelihood is similar to that of training cases, they could also be considered
as adversarial attacks that evade defenses based on computational profile
likelihood evaluation.


http://arxiv.org/abs/2107.13335
WaveCNet: Wavelet Integrated CNNs to Suppress Aliasing Effect for Noise-Robust Image Classification. (15%)
Qiufu Li; Linlin Shen; Sheng Guo; Zhihui Lai
  Though widely used in image classification, convolutional neural networks
(CNNs) are prone to noise interruptions, i.e. the CNN output can be drastically
changed by small image noise. To improve the noise robustness, we try to
integrate CNNs with wavelet by replacing the common down-sampling (max-pooling,
strided-convolution, and average pooling) with discrete wavelet transform
(DWT). We firstly propose general DWT and inverse DWT (IDWT) layers applicable
to various orthogonal and biorthogonal discrete wavelets like Haar, Daubechies,
and Cohen, etc., and then design wavelet integrated CNNs (WaveCNets) by
integrating DWT into the commonly used CNNs (VGG, ResNets, and DenseNet).
During the down-sampling, WaveCNets apply DWT to decompose the feature maps
into the low-frequency and high-frequency components. Containing the main
information including the basic object structures, the low-frequency component
is transmitted into the following layers to generate robust high-level
features. The high-frequency components are dropped to remove most of the data
noises. The experimental results show that %wavelet accelerates the CNN
training, and WaveCNets achieve higher accuracy on ImageNet than various
vanilla CNNs. We have also tested the performance of WaveCNets on the noisy
version of ImageNet, ImageNet-C and six adversarial attacks, the results
suggest that the proposed DWT/IDWT layers could provide better noise-robustness
and adversarial robustness. When applying WaveCNets as backbones, the
performance of object detectors (i.e., faster R-CNN and RetinaNet) on COCO
detection dataset are consistently improved. We believe that suppression of
aliasing effect, i.e. separation of low frequency and high frequency
information, is the main advantages of our approach. The code of our DWT/IDWT
layer and different WaveCNets are available at
https://github.com/CVI-SZU/WaveCNet.


http://arxiv.org/abs/2107.13190
TableGAN-MCA: Evaluating Membership Collisions of GAN-Synthesized Tabular Data Releasing. (2%)
Aoting Hu; Renjie Xie; Zhigang Lu; Aiqun Hu; Minhui Xue
  Generative Adversarial Networks (GAN)-synthesized table publishing lets
people privately learn insights without access to the private table. However,
existing studies on Membership Inference (MI) Attacks show promising results on
disclosing membership of training datasets of GAN-synthesized tables. Different
from those works focusing on discovering membership of a given data point, in
this paper, we propose a novel Membership Collision Attack against GANs
(TableGAN-MCA), which allows an adversary given only synthetic entries randomly
sampled from a black-box generator to recover partial GAN training data.
Namely, a GAN-synthesized table immune to state-of-the-art MI attacks is
vulnerable to the TableGAN-MCA. The success of TableGAN-MCA is boosted by an
observation that GAN-synthesized tables potentially collide with the training
data of the generator.
  Our experimental evaluations on TableGAN-MCA have five main findings. First,
TableGAN-MCA has a satisfying training data recovery rate on three commonly
used real-world datasets against four generative models. Second, factors,
including the size of GAN training data, GAN training epochs and the number of
synthetic samples available to the adversary, are positively correlated to the
success of TableGAN-MCA. Third, highly frequent data points have high risks of
being recovered by TableGAN-MCA. Fourth, some unique data are exposed to
unexpected high recovery risks in TableGAN-MCA, which may attribute to GAN's
generalization. Fifth, as expected, differential privacy, without the
consideration of the correlations between features, does not show commendable
mitigation effect against the TableGAN-MCA. Finally, we propose two mitigation
methods and show promising privacy and utility trade-offs when protecting
against TableGAN-MCA.


http://arxiv.org/abs/2107.12732
Towards Black-box Attacks on Deep Learning Apps. (89%)
Hongchen Cao; Shuai Li; Yuming Zhou; Ming Fan; Xuejiao Zhao; Yutian Tang
  Deep learning is a powerful weapon to boost application performance in many
fields, including face recognition, object detection, image classification,
natural language understanding, and recommendation system. With the rapid
increase in the computing power of mobile devices, developers can embed deep
learning models into their apps for building more competitive products with
more accurate and faster responses. Although there are several works about
adversarial attacks against deep learning models in mobile apps, they all need
information about the models' internals (i.e., structures, weights) or need to
modify the models. In this paper, we propose an effective black-box approach by
training a substitute model to spoof the deep learning system inside the apps.
To evaluate our approach, we select 10 real-world deep-learning apps with high
popularity from Google Play to perform black-box adversarial attacks. Through
the study, we find three factors that can influence the performance of attacks.
Our approach can reach a relatively high attack success rate of 66.60% on
average. Compared with other adversarial attacks on mobile deep learning
models, in terms of the average attack success rates, our approach outperforms
counterparts by 27.63%.


http://arxiv.org/abs/2107.12612
Poisoning Online Learning Filters: DDoS Attacks and Countermeasures. (50%)
Wesley Joon-Wie Tann; Ee-Chien Chang
  The recent advancements in machine learning have led to a wave of interest in
adopting online learning-based approaches for long-standing attack mitigation
issues. In particular, DDoS attacks remain a significant threat to network
service availability even after more than two decades. These attacks have been
well studied under the assumption that malicious traffic originates from a
single attack profile. Based on this premise, malicious traffic characteristics
are assumed to be considerably different from legitimate traffic. Consequently,
online filtering methods are designed to learn network traffic distributions
adaptively and rank requests according to their attack likelihood. During an
attack, requests rated as malicious are precipitously dropped by the filters.
In this paper, we conduct the first systematic study on the effects of data
poisoning attacks on online DDoS filtering; introduce one such attack method,
and propose practical protective countermeasures for these attacks. We
investigate an adverse scenario where the attacker is "crafty", switching
profiles during attacks and generating erratic attack traffic that is
ever-shifting. This elusive attacker generates malicious requests by
manipulating and shifting traffic distribution to poison the training data and
corrupt the filters. To this end, we present a generative model MimicShift,
capable of controlling traffic generation while retaining the originating
traffic's intrinsic properties. Comprehensive experiments show that online
learning filters are highly susceptible to poisoning attacks, sometimes
performing much worse than a random filtering strategy in this attack scenario.
At the same time, our proposed protective countermeasure diminishes the attack
impact.


http://arxiv.org/abs/2107.12873
PDF-Malware: An Overview on Threats, Detection and Evasion Attacks. (8%)
Nicolas Fleury; Theo Dubrunquez; Ihsen Alouani
  In the recent years, Portable Document Format, commonly known as PDF, has
become a democratized standard for document exchange and dissemination. This
trend has been due to its characteristics such as its flexibility and
portability across platforms. The widespread use of PDF has installed a false
impression of inherent safety among benign users. However, the characteristics
of PDF motivated hackers to exploit various types of vulnerabilities, overcome
security safeguards, thereby making the PDF format one of the most efficient
malicious code attack vectors. Therefore, efficiently detecting malicious PDF
files is crucial for information security. Several analysis techniques has been
proposed in the literature, be it static or dynamic, to extract the main
features that allow the discrimination of malware files from benign ones. Since
classical analysis techniques may be limited in case of zero-days,
machine-learning based techniques have emerged recently as an automatic
PDF-malware detection method that is able to generalize from a set of training
samples. These techniques are themselves facing the challenge of evasion
attacks where a malicious PDF is transformed to look benign. In this work, we
give an overview on the PDF-malware detection problem. We give a perspective on
the new challenges and emerging solutions.


http://arxiv.org/abs/2107.11986
Benign Adversarial Attack: Tricking Models for Goodness. (99%)
Jitao Sang; Xian Zhao; Jiaming Zhang; Zhiyu Lin
  In spite of the successful application in many fields, machine learning
models today suffer from notorious problems like vulnerability to adversarial
examples. Beyond falling into the cat-and-mouse game between adversarial attack
and defense, this paper provides alternative perspective to consider
adversarial example and explore whether we can exploit it in benign
applications. We first attribute adversarial example to the human-model
disparity on employing non-semantic features. While largely ignored in
classical machine learning mechanisms, non-semantic feature enjoys three
interesting characteristics as (1) exclusive to model, (2) critical to affect
inference, and (3) utilizable as features. Inspired by this, we present brave
new idea of benign adversarial attack to exploit adversarial examples for
goodness in three directions: (1) adversarial Turing test, (2) rejecting
malicious model application, and (3) adversarial data augmentation. Each
direction is positioned with motivation elaboration, justification analysis and
prototype applications to showcase its potential.


http://arxiv.org/abs/2107.12085
Learning to Adversarially Blur Visual Object Tracking. (98%)
Qing Guo; Ziyi Cheng; Felix Juefei-Xu; Lei Ma; Xiaofei Xie; Yang Liu; Jianjun Zhao
  Motion blur caused by the moving of the object or camera during the exposure
can be a key challenge for visual object tracking, affecting tracking accuracy
significantly. In this work, we explore the robustness of visual object
trackers against motion blur from a new angle, i.e., adversarial blur attack
(ABA). Our main objective is to online transfer input frames to their natural
motion-blurred counterparts while misleading the state-of-the-art trackers
during the tracking process. To this end, we first design the motion blur
synthesizing method for visual tracking based on the generation principle of
motion blur, considering the motion information and the light accumulation
process. With this synthetic method, we propose optimization-based ABA (OP-ABA)
by iteratively optimizing an adversarial objective function against the
tracking w.r.t. the motion and light accumulation parameters. The OP-ABA is
able to produce natural adversarial examples but the iteration can cause heavy
time cost, making it unsuitable for attacking real-time trackers. To alleviate
this issue, we further propose one-step ABA (OS-ABA) where we design and train
a joint adversarial motion and accumulation predictive network (JAMANet) with
the guidance of OP-ABA, which is able to efficiently estimate the adversarial
motion and accumulation parameters in a one-step way. The experiments on four
popular datasets (e.g., OTB100, VOT2018, UAV123, and LaSOT) demonstrate that
our methods are able to cause significant accuracy drops on four
state-of-the-art trackers with high transferability. Please find the source
code at \url{https://github.com/tsingqguo/ABA}.


http://arxiv.org/abs/2107.12473
Adversarial Attacks with Time-Scale Representations. (96%)
Alberto Santamaria-Pang; Jianwei Qiu; Aritra Chowdhury; James Kubricht; Peter Tu; Iyer Naresh; Nurali Virani
  We propose a novel framework for real-time black-box universal attacks which
disrupts activations of early convolutional layers in deep learning models. Our
hypothesis is that perturbations produced in the wavelet space disrupt early
convolutional layers more effectively than perturbations performed in the time
domain. The main challenge in adversarial attacks is to preserve low frequency
image content while minimally changing the most meaningful high frequency
content. To address this, we formulate an optimization problem using time-scale
(wavelet) representations as a dual space in three steps. First, we project
original images into orthonormal sub-spaces for low and high scales via wavelet
coefficients. Second, we perturb wavelet coefficients for high scale projection
using a generator network. Third, we generate new adversarial images by
projecting back the original coefficients from the low scale and the perturbed
coefficients from the high scale sub-space. We provide a theoretical framework
that guarantees a dual mapping from time and time-scale domain representations.
We compare our results with state-of-the-art black-box attacks from
generative-based and gradient-based models. We also verify efficacy against
multiple defense methods such as JPEG compression, Guided Denoiser and
Comdefend. Our results show that wavelet-based perturbations consistently
outperform time-based attacks thus providing new insights into vulnerabilities
of deep learning models and could potentially lead to robust architectures or
new defense and attack mechanisms by leveraging time-scale representations.


http://arxiv.org/abs/2107.11671
Adversarial training may be a double-edged sword. (99%)
Ali Rahmati; Seyed-Mohsen Moosavi-Dezfooli; Huaiyu Dai
  Adversarial training has been shown as an effective approach to improve the
robustness of image classifiers against white-box attacks. However, its
effectiveness against black-box attacks is more nuanced. In this work, we
demonstrate that some geometric consequences of adversarial training on the
decision boundary of deep networks give an edge to certain types of black-box
attacks. In particular, we define a metric called robustness gain to show that
while adversarial training is an effective method to dramatically improve the
robustness in white-box scenarios, it may not provide such a good robustness
gain against the more realistic decision-based black-box attacks. Moreover, we
show that even the minimal perturbation white-box attacks can converge faster
against adversarially-trained neural networks compared to the regular ones.


http://arxiv.org/abs/2107.11630
Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them. (98%)
Florian Tramèr
  Making classifiers robust to adversarial examples is hard. Thus, many
defenses tackle the seemingly easier task of detecting perturbed inputs. We
show a barrier towards this goal. We prove a general hardness reduction between
detection and classification of adversarial examples: given a robust detector
for attacks at distance {\epsilon} (in some metric), we can build a similarly
robust (but inefficient) classifier for attacks at distance {\epsilon}/2. Our
reduction is computationally inefficient, and thus cannot be used to build
practical classifiers. Instead, it is a useful sanity check to test whether
empirical detection results imply something much stronger than the authors
presumably anticipated. To illustrate, we revisit 13 detector defenses. For
11/13 cases, we show that the claimed detection results would imply an
inefficient classifier with robustness far beyond the state-of-the-art.


http://arxiv.org/abs/2107.11652
Stress Test Evaluation of Biomedical Word Embeddings. (73%)
Vladimir Araujo; Andrés Carvallo; Carlos Aspillaga; Camilo Thorne; Denis Parra
  The success of pretrained word embeddings has motivated their use in the
biomedical domain, with contextualized embeddings yielding remarkable results
in several biomedical NLP tasks. However, there is a lack of research on
quantifying their behavior under severe "stress" scenarios. In this work, we
systematically evaluate three language models with adversarial examples --
automatically constructed tests that allow us to examine how robust the models
are. We propose two types of stress scenarios focused on the biomedical named
entity recognition (NER) task, one inspired by spelling errors and another
based on the use of synonyms for medical terms. Our experiments with three
benchmarks show that the performance of the original models decreases
considerably, in addition to revealing their weaknesses and strengths. Finally,
we show that adversarial training causes the models to improve their robustness
and even to exceed the original performance in some cases.


http://arxiv.org/abs/2107.11576
X-GGM: Graph Generative Modeling for Out-of-Distribution Generalization in Visual Question Answering. (1%)
Jingjing Jiang; Ziyi Liu; Yifan Liu; Zhixiong Nan; Nanning Zheng
  Encouraging progress has been made towards Visual Question Answering (VQA) in
recent years, but it is still challenging to enable VQA models to adaptively
generalize to out-of-distribution (OOD) samples. Intuitively, recompositions of
existing visual concepts (\ie, attributes and objects) can generate unseen
compositions in the training set, which will promote VQA models to generalize
to OOD samples. In this paper, we formulate OOD generalization in VQA as a
compositional generalization problem and propose a graph generative
modeling-based training scheme (X-GGM) to implicitly model the problem. X-GGM
leverages graph generative modeling to iteratively generate a relation matrix
and node representations for the predefined graph that utilizes
attribute-object pairs as nodes. Furthermore, to alleviate the unstable
training issue in graph generative modeling, we propose a gradient distribution
consistency loss to constrain the data distribution with adversarial
perturbations and the generated distribution. The baseline VQA model (LXMERT)
trained with the X-GGM scheme achieves state-of-the-art OOD performance on two
standard VQA OOD benchmarks, \ie, VQA-CP v2 and GQA-OOD. Extensive ablation
studies demonstrate the effectiveness of X-GGM components. Code is available at
\url{https://github.com/jingjing12110/x-ggm}.


http://arxiv.org/abs/2107.11275
A Differentiable Language Model Adversarial Attack on Text Classifiers. (99%)
Ivan Fursov; Alexey Zaytsev; Pavel Burnyshev; Ekaterina Dmitrieva; Nikita Klyuchnikov; Andrey Kravchenko; Ekaterina Artemova; Evgeny Burnaev
  Robustness of huge Transformer-based models for natural language processing
is an important issue due to their capabilities and wide adoption. One way to
understand and improve robustness of these models is an exploration of an
adversarial attack scenario: check if a small perturbation of an input can fool
a model.
  Due to the discrete nature of textual data, gradient-based adversarial
methods, widely used in computer vision, are not applicable per~se. The
standard strategy to overcome this issue is to develop token-level
transformations, which do not take the whole sentence into account.
  In this paper, we propose a new black-box sentence-level attack. Our method
fine-tunes a pre-trained language model to generate adversarial examples. A
proposed differentiable loss function depends on a substitute classifier score
and an approximate edit distance computed via a deep learning model.
  We show that the proposed attack outperforms competitors on a diverse set of
NLP problems for both computed metrics and human evaluation. Moreover, due to
the usage of the fine-tuned language model, the generated adversarial examples
are hard to detect, thus current models are not robust. Hence, it is difficult
to defend from the proposed attack, which is not the case for other attacks.


http://arxiv.org/abs/2107.11327
Structack: Structure-based Adversarial Attacks on Graph Neural Networks. (86%)
Hussain Hussain; Tomislav Duricic; Elisabeth Lex; Denis Helic; Markus Strohmaier; Roman Kern
  Recent work has shown that graph neural networks (GNNs) are vulnerable to
adversarial attacks on graph data. Common attack approaches are typically
informed, i.e. they have access to information about node attributes such as
labels and feature vectors. In this work, we study adversarial attacks that are
uninformed, where an attacker only has access to the graph structure, but no
information about node attributes. Here the attacker aims to exploit structural
knowledge and assumptions, which GNN models make about graph data. In
particular, literature has shown that structural node centrality and similarity
have a strong influence on learning with GNNs. Therefore, we study the impact
of centrality and similarity on adversarial attacks on GNNs. We demonstrate
that attackers can exploit this information to decrease the performance of GNNs
by focusing on injecting links between nodes of low similarity and,
surprisingly, low centrality. We show that structure-based uninformed attacks
can approach the performance of informed attacks, while being computationally
more efficient. With our paper, we present a new attack strategy on GNNs that
we refer to as Structack. Structack can successfully manipulate the performance
of GNNs with very limited information while operating under tight computational
constraints. Our work contributes towards building more robust machine learning
approaches on graphs.


http://arxiv.org/abs/2107.11252
Adversarial Reinforced Instruction Attacker for Robust Vision-Language Navigation. (45%)
Bingqian Lin; Yi Zhu; Yanxin Long; Xiaodan Liang; Qixiang Ye; Liang Lin
  Language instruction plays an essential role in the natural language grounded
navigation tasks. However, navigators trained with limited human-annotated
instructions may have difficulties in accurately capturing key information from
the complicated instruction at different timesteps, leading to poor navigation
performance. In this paper, we exploit to train a more robust navigator which
is capable of dynamically extracting crucial factors from the long instruction,
by using an adversarial attacking paradigm. Specifically, we propose a Dynamic
Reinforced Instruction Attacker (DR-Attacker), which learns to mislead the
navigator to move to the wrong target by destroying the most instructive
information in instructions at different timesteps. By formulating the
perturbation generation as a Markov Decision Process, DR-Attacker is optimized
by the reinforcement learning algorithm to generate perturbed instructions
sequentially during the navigation, according to a learnable attack score.
Then, the perturbed instructions, which serve as hard samples, are used for
improving the robustness of the navigator with an effective adversarial
training strategy and an auxiliary self-supervised reasoning task. Experimental
results on both Vision-and-Language Navigation (VLN) and Navigation from Dialog
History (NDH) tasks show the superiority of our proposed method over
state-of-the-art methods. Moreover, the visualization analysis shows the
effectiveness of the proposed DR-Attacker, which can successfully attack
crucial information in the instructions at different timesteps. Code is
available at https://github.com/expectorlin/DR-Attacker.


http://arxiv.org/abs/2107.11472
Clipped Hyperbolic Classifiers Are Super-Hyperbolic Classifiers. (8%)
Yunhui Guo; Xudong Wang; Yubei Chen; Stella X. Yu
  Hyperbolic space can embed hierarchical structures continuously. Hyperbolic
Neural Networks (HNNs) exploit such representational power by lifting Euclidean
features into hyperbolic space for classification, outperforming Euclidean
neural networks (ENNs) on datasets with known hierarchical structures. However,
HNNs underperform ENNs on standard benchmarks with unclear hierarchies, greatly
restricting HNNs' practical applicability.
  Our key insight is that HNNs' poorer general classification performance
results from vanishing gradients during backpropagation, caused by their hybrid
architecture connecting Euclidean features to a hyperbolic classifier. We
propose an effective solution by simply clipping the Euclidean feature
magnitude while training HNNs.
  Our experimental results demonstrate that clipped HNNs become
super-hyperbolic classifiers: They are not only consistently better than HNNs
which already outperform ENNs on hierarchical data, but also on-par with ENNs
on MNIST, CIFAR10, CIFAR100 and ImageNet benchmarks, with better adversarial
robustness and out-of-distribution detection.


http://arxiv.org/abs/2107.10873
On the Certified Robustness for Ensemble Models and Beyond. (99%)
Zhuolin Yang; Linyi Li; Xiaojun Xu; Bhavya Kailkhura; Tao Xie; Bo Li
  Recent studies show that deep neural networks (DNN) are vulnerable to
adversarial examples, which aim to mislead DNNs by adding perturbations with
small magnitude. To defend against such attacks, both empirical and theoretical
defense approaches have been extensively studied for a single ML model. In this
work, we aim to analyze and provide the certified robustness for ensemble ML
models, together with the sufficient and necessary conditions of robustness for
different ensemble protocols. Although ensemble models are shown more robust
than a single model empirically; surprisingly, we find that in terms of the
certified robustness the standard ensemble models only achieve marginal
improvement compared to a single model. Thus, to explore the conditions that
guarantee to provide certifiably robust ensemble ML models, we first prove that
diversified gradient and large confidence margin are sufficient and necessary
conditions for certifiably robust ensemble models under the model-smoothness
assumption. We then provide the bounded model-smoothness analysis based on the
proposed Ensemble-before-Smoothing strategy. We also prove that an ensemble
model can always achieve higher certified robustness than a single base model
under mild conditions. Inspired by the theoretical findings, we propose the
lightweight Diversity Regularized Training (DRT) to train certifiably robust
ensemble ML models. Extensive experiments show that our DRT enhanced ensembles
can consistently achieve higher certified robustness than existing single and
ensemble ML models, demonstrating the state-of-the-art certified L2-robustness
on MNIST, CIFAR-10, and ImageNet datasets.


http://arxiv.org/abs/2107.10480
Unsupervised Detection of Adversarial Examples with Model Explanations. (99%)
Gihyuk Ko; Gyumin Lim
  Deep Neural Networks (DNNs) have shown remarkable performance in a diverse
range of machine learning applications. However, it is widely known that DNNs
are vulnerable to simple adversarial perturbations, which causes the model to
incorrectly classify inputs. In this paper, we propose a simple yet effective
method to detect adversarial examples, using methods developed to explain the
model's behavior. Our key observation is that adding small, humanly
imperceptible perturbations can lead to drastic changes in the model
explanations, resulting in unusual or irregular forms of explanations. From
this insight, we propose an unsupervised detection of adversarial examples
using reconstructor networks trained only on model explanations of benign
examples. Our evaluations with MNIST handwritten dataset show that our method
is capable of detecting adversarial examples generated by the state-of-the-art
algorithms with high confidence. To the best of our knowledge, this work is the
first in suggesting unsupervised defense method using model explanations.


http://arxiv.org/abs/2107.12173
Membership Inference Attack and Defense for Wireless Signal Classifiers with Deep Learning. (83%)
Yi Shi; Yalin E. Sagduyu
  An over-the-air membership inference attack (MIA) is presented to leak
private information from a wireless signal classifier. Machine learning (ML)
provides powerful means to classify wireless signals, e.g., for PHY-layer
authentication. As an adversarial machine learning attack, the MIA infers
whether a signal of interest has been used in the training data of a target
classifier. This private information incorporates waveform, channel, and device
characteristics, and if leaked, can be exploited by an adversary to identify
vulnerabilities of the underlying ML model (e.g., to infiltrate the PHY-layer
authentication). One challenge for the over-the-air MIA is that the received
signals and consequently the RF fingerprints at the adversary and the intended
receiver differ due to the discrepancy in channel conditions. Therefore, the
adversary first builds a surrogate classifier by observing the spectrum and
then launches the black-box MIA on this classifier. The MIA results show that
the adversary can reliably infer signals (and potentially the radio and channel
information) used to build the target classifier. Therefore, a proactive
defense is developed against the MIA by building a shadow MIA model and fooling
the adversary. This defense can successfully reduce the MIA accuracy and
prevent information leakage from the wireless signal classifier.


http://arxiv.org/abs/2107.10599
Towards Explaining Adversarial Examples Phenomenon in Artificial Neural Networks. (75%)
Ramin Barati; Reza Safabakhsh; Mohammad Rahmati
  In this paper, we study the adversarial examples existence and adversarial
training from the standpoint of convergence and provide evidence that pointwise
convergence in ANNs can explain these observations. The main contribution of
our proposal is that it relates the objective of the evasion attacks and
adversarial training with concepts already defined in learning theory. Also, we
extend and unify some of the other proposals in the literature and provide
alternative explanations on the observations made in those proposals. Through
different experiments, we demonstrate that the framework is valuable in the
study of the phenomenon and is applicable to real-world problems.


http://arxiv.org/abs/2107.10989
Estimating Predictive Uncertainty Under Program Data Distribution Shift. (1%)
Yufei Li; Simin Chen; Wei Yang
  Deep learning (DL) techniques have achieved great success in predictive
accuracy in a variety of tasks, but deep neural networks (DNNs) are shown to
produce highly overconfident scores for even abnormal samples. Well-defined
uncertainty indicates whether a model's output should (or should not) be
trusted and thus becomes critical in real-world scenarios which typically
involves shifted input distributions due to many factors. Existing uncertainty
approaches assume that testing samples from a different data distribution would
induce unreliable model predictions thus have higher uncertainty scores. They
quantify model uncertainty by calibrating DL model's confidence of a given
input and evaluate the effectiveness in computer vision (CV) and natural
language processing (NLP)-related tasks. However, their methodologies'
reliability may be compromised under programming tasks due to difference in
data representations and shift patterns. In this paper, we first define three
different types of distribution shift in program data and build a large-scale
shifted Java dataset. We implement two common programming language tasks on our
dataset to study the effect of each distribution shift on DL model performance.
We also propose a large-scale benchmark of existing state-of-the-art predictive
uncertainty on programming tasks and investigate their effectiveness under data
distribution shift. Experiments show that program distribution shift does
degrade the DL model performance to varying degrees and that existing
uncertainty methods all present certain limitations in quantifying uncertainty
on program dataset.


http://arxiv.org/abs/2107.10457
Ready for Emerging Threats to Recommender Systems? A Graph Convolution-based Generative Shilling Attack. (1%)
Fan Wu; Min Gao; Junliang Yu; Zongwei Wang; Kecheng Liu; Xu Wange
  To explore the robustness of recommender systems, researchers have proposed
various shilling attack models and analyzed their adverse effects. Primitive
attacks are highly feasible but less effective due to simplistic handcrafted
rules, while upgraded attacks are more powerful but costly and difficult to
deploy because they require more knowledge from recommendations. In this paper,
we explore a novel shilling attack called Graph cOnvolution-based generative
shilling ATtack (GOAT) to balance the attacks' feasibility and effectiveness.
GOAT adopts the primitive attacks' paradigm that assigns items for fake users
by sampling and the upgraded attacks' paradigm that generates fake ratings by a
deep learning-based model. It deploys a generative adversarial network (GAN)
that learns the real rating distribution to generate fake ratings.
Additionally, the generator combines a tailored graph convolution structure
that leverages the correlations between co-rated items to smoothen the fake
ratings and enhance their authenticity. The extensive experiments on two public
datasets evaluate GOAT's performance from multiple perspectives. Our study of
the GOAT demonstrates technical feasibility for building a more powerful and
intelligent attack model with a much-reduced cost, enables analysis the threat
of such an attack and guides for investigating necessary prevention measures.


http://arxiv.org/abs/2107.09937
Fast and Scalable Adversarial Training of Kernel SVM via Doubly Stochastic Gradients. (98%)
Huimin Wu; Zhengmian Hu; Bin Gu
  Adversarial attacks by generating examples which are almost indistinguishable
from natural examples, pose a serious threat to learning models. Defending
against adversarial attacks is a critical element for a reliable learning
system. Support vector machine (SVM) is a classical yet still important
learning algorithm even in the current deep learning era. Although a wide range
of researches have been done in recent years to improve the adversarial
robustness of learning models, but most of them are limited to deep neural
networks (DNNs) and the work for kernel SVM is still vacant. In this paper, we
aim at kernel SVM and propose adv-SVM to improve its adversarial robustness via
adversarial training, which has been demonstrated to be the most promising
defense techniques. To the best of our knowledge, this is the first work that
devotes to the fast and scalable adversarial training of kernel SVM.
Specifically, we first build connection of perturbations of samples between
original and kernel spaces, and then give a reduced and equivalent formulation
of adversarial training of kernel SVM based on the connection. Next, doubly
stochastic gradients (DSG) based on two unbiased stochastic approximations
(i.e., one is on training points and another is on random features) are applied
to update the solution of our objective function. Finally, we prove that our
algorithm optimized by DSG converges to the optimal solution at the rate of
O(1/t) under the constant and diminishing stepsizes. Comprehensive experimental
results show that our adversarial training algorithm enjoys robustness against
various attacks and meanwhile has the similar efficiency and scalability with
classical DSG algorithm.


http://arxiv.org/abs/2107.10137
Improved Text Classification via Contrastive Adversarial Training. (84%)
Lin Pan; Chung-Wei Hang; Avirup Sil; Saloni Potdar
  We propose a simple and general method to regularize the fine-tuning of
Transformer-based encoders for text classification tasks. Specifically, during
fine-tuning we generate adversarial examples by perturbing the word embeddings
of the model and perform contrastive learning on clean and adversarial examples
in order to teach the model to learn noise-invariant representations. By
training on both clean and adversarial examples along with the additional
contrastive objective, we observe consistent improvement over standard
fine-tuning on clean examples. On several GLUE benchmark tasks, our fine-tuned
BERT Large model outperforms BERT Large baseline by 1.7% on average, and our
fine-tuned RoBERTa Large improves over RoBERTa Large baseline by 1.3%. We
additionally validate our method in different domains using three intent
classification datasets, where our fine-tuned RoBERTa Large outperforms RoBERTa
Large baseline by 1-2% on average.


http://arxiv.org/abs/2107.10174
Black-box Probe for Unsupervised Domain Adaptation without Model Transferring. (81%)
Kunhong Wu; Yucheng Shi; Yahong Han; Yunfeng Shao; Bingshuai Li
  In recent years, researchers have been paying increasing attention to the
threats brought by deep learning models to data security and privacy,
especially in the field of domain adaptation. Existing unsupervised domain
adaptation (UDA) methods can achieve promising performance without transferring
data from source domain to target domain. However, UDA with representation
alignment or self-supervised pseudo-labeling relies on the transferred source
models. In many data-critical scenarios, methods based on model transferring
may suffer from membership inference attacks and expose private data. In this
paper, we aim to overcome a challenging new setting where the source models are
only queryable but cannot be transferred to the target domain. We propose
Black-box Probe Domain Adaptation (BPDA), which adopts query mechanism to probe
and refine information from source model using third-party dataset. In order to
gain more informative query results, we further propose Distributionally
Adversarial Training (DAT) to align the distribution of third-party data with
that of target data. BPDA uses public third-party dataset and adversarial
examples based on DAT as the information carrier between source and target
domains, dispensing with transferring source data or model. Experimental
results on benchmarks of Digit-Five, Office-Caltech, Office-31, Office-Home,
and DomainNet demonstrate the feasibility of BPDA without model transferring.


http://arxiv.org/abs/2107.09898
Defending against Reconstruction Attack in Vertical Federated Learning. (10%)
Jiankai Sun; Yuanshun Yao; Weihao Gao; Junyuan Xie; Chong Wang
  Recently researchers have studied input leakage problems in Federated
Learning (FL) where a malicious party can reconstruct sensitive training inputs
provided by users from shared gradient. It raises concerns about FL since input
leakage contradicts the privacy-preserving intention of using FL. Despite a
relatively rich literature on attacks and defenses of input reconstruction in
Horizontal FL, input leakage and protection in vertical FL starts to draw
researcher's attention recently. In this paper, we study how to defend against
input leakage attacks in Vertical FL. We design an adversarial training-based
framework that contains three modules: adversarial reconstruction, noise
regularization, and distance correlation minimization. Those modules can not
only be employed individually but also applied together since they are
independent to each other. Through extensive experiments on a large-scale
industrial online advertising dataset, we show our framework is effective in
protecting input privacy while retaining the model utility.


http://arxiv.org/abs/2107.10139
Generative Models for Security: Attacks, Defenses, and Opportunities. (10%)
Luke A. Bauer; Vincent Bindschaedler
  Generative models learn the distribution of data from a sample dataset and
can then generate new data instances. Recent advances in deep learning has
brought forth improvements in generative model architectures, and some
state-of-the-art models can (in some cases) produce outputs realistic enough to
fool humans.
  We survey recent research at the intersection of security and privacy and
generative models. In particular, we discuss the use of generative models in
adversarial machine learning, in helping automate or enhance existing attacks,
and as building blocks for defenses in contexts such as intrusion detection,
biometrics spoofing, and malware obfuscation. We also describe the use of
generative models in diverse applications such as fairness in machine learning,
privacy-preserving data synthesis, and steganography. Finally, we discuss new
threats due to generative models: the creation of synthetic media such as
deepfakes that can be used for disinformation.


http://arxiv.org/abs/2107.10045
A Tandem Framework Balancing Privacy and Security for Voice User Interfaces. (5%)
Ranya Aloufi; Hamed Haddadi; David Boyle
  Speech synthesis, voice cloning, and voice conversion techniques present
severe privacy and security threats to users of voice user interfaces (VUIs).
These techniques transform one or more elements of a speech signal, e.g.,
identity and emotion, while preserving linguistic information. Adversaries may
use advanced transformation tools to trigger a spoofing attack using fraudulent
biometrics for a legitimate speaker. Conversely, such techniques have been used
to generate privacy-transformed speech by suppressing personally identifiable
attributes in the voice signals, achieving anonymization. Prior works have
studied the security and privacy vectors in parallel, and thus it raises alarm
that if a benign user can achieve privacy by a transformation, it also means
that a malicious user can break security by bypassing the anti-spoofing
mechanism. In this paper, we take a step towards balancing two seemingly
conflicting requirements: security and privacy. It remains unclear what the
vulnerabilities in one domain imply for the other, and what dynamic
interactions exist between them. A better understanding of these aspects is
crucial for assessing and mitigating vulnerabilities inherent with VUIs and
building effective defenses. In this paper,(i) we investigate the applicability
of the current voice anonymization methods by deploying a tandem framework that
jointly combines anti-spoofing and authentication models, and evaluate the
performance of these methods;(ii) examining analytical and empirical evidence,
we reveal a duality between the two mechanisms as they offer different ways to
achieve the same objective, and we show that leveraging one vector
significantly amplifies the effectiveness of the other;(iii) we demonstrate
that to effectively defend from potential attacks against VUIs, it is necessary
to investigate the attacks from multiple complementary perspectives(security
and privacy).


http://arxiv.org/abs/2107.10443
Spinning Sequence-to-Sequence Models with Meta-Backdoors. (4%)
Eugene Bagdasaryan; Vitaly Shmatikov
  We investigate a new threat to neural sequence-to-sequence (seq2seq) models:
training-time attacks that cause models to "spin" their output and support a
certain sentiment when the input contains adversary-chosen trigger words. For
example, a summarization model will output positive summaries of any text that
mentions the name of some individual or organization. We introduce the concept
of a "meta-backdoor" to explain model-spinning attacks. These attacks produce
models whose output is valid and preserves context, yet also satisfies a
meta-task chosen by the adversary (e.g., positive sentiment). Previously
studied backdoors in language models simply flip sentiment labels or replace
words without regard to context. Their outputs are incorrect on inputs with the
trigger. Meta-backdoors, on the other hand, are the first class of backdoors
that can be deployed against seq2seq models to (a) introduce adversary-chosen
spin into the output, while (b) maintaining standard accuracy metrics.
  To demonstrate feasibility of model spinning, we develop a new backdooring
technique. It stacks the adversarial meta-task (e.g., sentiment analysis) onto
a seq2seq model, backpropagates the desired meta-task output (e.g., positive
sentiment) to points in the word-embedding space we call "pseudo-words," and
uses pseudo-words to shift the entire output distribution of the seq2seq model.
Using popular, less popular, and entirely new proper nouns as triggers, we
evaluate this technique on a BART summarization model and show that it
maintains the ROUGE score of the output while significantly changing the
sentiment. We explain why model spinning can be a dangerous technique in
AI-powered disinformation and discuss how to mitigate these attacks.


http://arxiv.org/abs/2107.10110
On the Convergence of Prior-Guided Zeroth-Order Optimization Algorithms. (2%)
Shuyu Cheng; Guoqiang Wu; Jun Zhu
  Zeroth-order (ZO) optimization is widely used to handle challenging tasks,
such as query-based black-box adversarial attacks and reinforcement learning.
Various attempts have been made to integrate prior information into the
gradient estimation procedure based on finite differences, with promising
empirical results. However, their convergence properties are not well
understood. This paper makes an attempt to fill up this gap by analyzing the
convergence of prior-guided ZO algorithms under a greedy descent framework with
various gradient estimators. We provide a convergence guarantee for the
prior-guided random gradient-free (PRGF) algorithms. Moreover, to further
accelerate over greedy descent methods, we present a new accelerated random
search (ARS) algorithm that incorporates prior information, together with a
convergence analysis. Finally, our theoretical results are confirmed by
experiments on several numerical benchmarks as well as adversarial attacks.


http://arxiv.org/abs/2107.09804
Using Undervolting as an On-Device Defense Against Adversarial Machine Learning Attacks. (99%)
Saikat Majumdar; Mohammad Hossein Samavatian; Kristin Barber; Radu Teodorescu
  Deep neural network (DNN) classifiers are powerful tools that drive a broad
spectrum of important applications, from image recognition to autonomous
vehicles. Unfortunately, DNNs are known to be vulnerable to adversarial attacks
that affect virtually all state-of-the-art models. These attacks make small
imperceptible modifications to inputs that are sufficient to induce the DNNs to
produce the wrong classification.
  In this paper we propose a novel, lightweight adversarial correction and/or
detection mechanism for image classifiers that relies on undervolting (running
a chip at a voltage that is slightly below its safe margin). We propose using
controlled undervolting of the chip running the inference process in order to
introduce a limited number of compute errors. We show that these errors disrupt
the adversarial input in a way that can be used either to correct the
classification or detect the input as adversarial. We evaluate the proposed
solution in an FPGA design and through software simulation. We evaluate 10
attacks and show average detection rates of 77% and 90% on two popular DNNs.


http://arxiv.org/abs/2107.09258
A Markov Game Model for AI-based Cyber Security Attack Mitigation. (10%)
Hooman Alavizadeh; Julian Jang-Jaccard; Tansu Alpcan; Seyit A. Camtepe
  The new generation of cyber threats leverages advanced AI-aided methods,
which make them capable to launch multi-stage, dynamic, and effective attacks.
Current cyber-defense systems encounter various challenges to defend against
such new and emerging threats. Modeling AI-aided threats through game theory
models can help the defender to select optimal strategies against the attacks
and make wise decisions to mitigate the attack's impact. This paper first
explores the current state-of-the-art in the new generation of threats in which
AI techniques such as deep neural network is used for the attacker and
discusses further challenges. We propose a Markovian dynamic game that can
evaluate the efficiency of defensive methods against the AI-aided attacker
under a cloud-based system in which the attacker utilizes an AI technique to
launch an advanced attack by finding the shortest attack path. We use the CVSS
metrics to quantify the values of this zero-sum game model for decision-making.


http://arxiv.org/abs/2107.09833
Leaking Secrets through Modern Branch Predictor in the Speculative World. (1%)
Md Hafizul Islam Chowdhuryy; Fan Yao
  Transient execution attacks that exploit speculation have raised significant
concerns in computer systems. Typically, branch predictors are leveraged to
trigger mis-speculation in transient execution attacks. In this work, we
demonstrate a new class of speculation-based attack that targets branch
prediction unit (BPU). We find that speculative resolution of conditional
branches (i.e., in nested speculation) alter the states of pattern history
table (PHT) in modern processors, which are not restored after the
corresponding branches are later squashed. Such characteristic allows attackers
to exploit BPU as the secret transmitting medium in transient execution
attacks. To evaluate the discovered vulnerability, we build a novel attack
framework, BranchSpectre, that enables exfiltration of unintended secrets
through observing speculative PHT updates (in the form of covert and side
channels). We further investigate PHT collision mechanism in the history-based
predictor as well as the branch prediction mode transitions in Intel
processors. Built upon such knowledge, we implement an ultra high-speed covert
channel (BranchSpectre-cc) as well as two side channels (i.e., BranchSpectre-v1
and BranchSpectre-v2) that merely rely on BPU for mis-speculation trigger and
secret inference in the speculative domain. Notably, BranchSpectre side
channels can take advantage of much simpler code patterns than the ones used in
Spectre attacks. We present an extensive BranchSpectre code gadget analysis on
a set of popular real-world application code bases followed by a demonstration
of real-world side channel attack on OpenSSL. The evaluation results show
substantial wider existence and higher exploitability of BranchSpectre code
patterns in real-world software. Finally, we discuss several secure branch
prediction mechanisms that can mitigate transient execution attacks exploiting
modern branch predictors.


http://arxiv.org/abs/2107.09225
Discriminator-Free Generative Adversarial Attack. (99%)
Shaohao Lu; Yuqiao Xian; Ke Yan; Yi Hu; Xing Sun; Xiaowei Guo; Feiyue Huang; Wei-Shi Zheng
  The Deep Neural Networks are vulnerable toadversarial exam-ples(Figure 1),
making the DNNs-based systems collapsed byadding the inconspicuous
perturbations to the images. Most of the existing works for adversarial attack
are gradient-based and suf-fer from the latency efficiencies and the load on
GPU memory. Thegenerative-based adversarial attacks can get rid of this
limitation,and some relative works propose the approaches based on GAN.However,
suffering from the difficulty of the convergence of train-ing a GAN, the
adversarial examples have either bad attack abilityor bad visual quality. In
this work, we find that the discriminatorcould be not necessary for
generative-based adversarial attack, andpropose theSymmetric Saliency-based
Auto-Encoder (SSAE)to generate the perturbations, which is composed of the
saliencymap module and the angle-norm disentanglement of the featuresmodule.
The advantage of our proposed method lies in that it is notdepending on
discriminator, and uses the generative saliency map to pay more attention to
label-relevant regions. The extensive exper-iments among the various tasks,
datasets, and models demonstratethat the adversarial examples generated by SSAE
not only make thewidely-used models collapse, but also achieves good visual
quality.The code is available at https://github.com/BravoLu/SSAE.


http://arxiv.org/abs/2107.09502
Feature-Filter: Detecting Adversarial Examples through Filtering off Recessive Features. (99%)
Hui Liu; Bo Zhao; Yuefeng Peng; Jiabao Guo; Peng Liu
  Deep neural networks (DNNs) are under threat from adversarial example
attacks. The adversary can easily change the outputs of DNNs by adding small
well-designed perturbations to inputs. Adversarial example detection is a
fundamental work for robust DNNs-based service. Adversarial examples show the
difference between humans and DNNs in image recognition. From a human-centric
perspective, image features could be divided into dominant features that are
comprehensible to humans, and recessive features that are incomprehensible to
humans, yet are exploited by DNNs. In this paper, we reveal that imperceptible
adversarial examples are the product of recessive features misleading neural
networks, and an adversarial attack is essentially a kind of method to enrich
these recessive features in the image. The imperceptibility of the adversarial
examples indicates that the perturbations enrich recessive features, yet hardly
affect dominant features. Therefore, adversarial examples are sensitive to
filtering off recessive features, while benign examples are immune to such
operation. Inspired by this idea, we propose a label-only adversarial detection
approach that is referred to as feature-filter. Feature-filter utilizes
discrete cosine transform to approximately separate recessive features from
dominant features, and gets a mutant image that is filtered off recessive
features. By only comparing DNN's prediction labels on the input and its
mutant, feature-filter can real-time detect imperceptible adversarial examples
at high accuracy and few false positives.


http://arxiv.org/abs/2107.09126
Examining the Human Perceptibility of Black-Box Adversarial Attacks on Face Recognition. (98%)
Benjamin Spetter-Goldstein; Nataniel Ruiz; Sarah Adel Bargal
  The modern open internet contains billions of public images of human faces
across the web, especially on social media websites used by half the world's
population. In this context, Face Recognition (FR) systems have the potential
to match faces to specific names and identities, creating glaring privacy
concerns. Adversarial attacks are a promising way to grant users privacy from
FR systems by disrupting their capability to recognize faces. Yet, such attacks
can be perceptible to human observers, especially under the more challenging
black-box threat model. In the literature, the justification for the
imperceptibility of such attacks hinges on bounding metrics such as $\ell_p$
norms. However, there is not much research on how these norms match up with
human perception. Through examining and measuring both the effectiveness of
recent black-box attacks in the face recognition setting and their
corresponding human perceptibility through survey data, we demonstrate the
trade-offs in perceptibility that occur as attacks become more aggressive. We
also show how the $\ell_2$ norm and other metrics do not correlate with human
perceptibility in a linear fashion, thus making these norms suboptimal at
measuring adversarial attack perceptibility.


http://arxiv.org/abs/2107.09045
On the Veracity of Local, Model-agnostic Explanations in Audio Classification: Targeted Investigations with Adversarial Examples. (80%)
Verena Praher; Katharina Prinz; Arthur Flexer; Gerhard Widmer
  Local explanation methods such as LIME have become popular in MIR as tools
for generating post-hoc, model-agnostic explanations of a model's
classification decisions. The basic idea is to identify a small set of
human-understandable features of the classified example that are most
influential on the classifier's prediction. These are then presented as an
explanation. Evaluation of such explanations in publications often resorts to
accepting what matches the expectation of a human without actually being able
to verify if what the explanation shows is what really caused the model's
prediction. This paper reports on targeted investigations where we try to get
more insight into the actual veracity of LIME's explanations in an audio
classification task. We deliberately design adversarial examples for the
classifier, in a way that gives us knowledge about which parts of the input are
potentially responsible for the model's (wrong) prediction. Asking LIME to
explain the predictions for these adversaries permits us to study whether local
explanations do indeed detect these regions of interest. We also look at
whether LIME is more successful in finding perturbations that are more
prominent and easily noticeable for a human. Our results suggest that LIME does
not necessarily manage to identify the most relevant input features and hence
it remains unclear whether explanations are useful or even misleading.


http://arxiv.org/abs/2107.08909
MEGEX: Data-Free Model Extraction Attack against Gradient-Based Explainable AI. (33%)
Takayuki Miura; Satoshi Hasegawa; Toshiki Shibahara
  The advance of explainable artificial intelligence, which provides reasons
for its predictions, is expected to accelerate the use of deep neural networks
in the real world like Machine Learning as a Service (MLaaS) that returns
predictions on queried data with the trained model. Deep neural networks
deployed in MLaaS face the threat of model extraction attacks. A model
extraction attack is an attack to violate intellectual property and privacy in
which an adversary steals trained models in a cloud using only their
predictions. In particular, a data-free model extraction attack has been
proposed recently and is more critical. In this attack, an adversary uses a
generative model instead of preparing input data. The feasibility of this
attack, however, needs to be studied since it requires more queries than that
with surrogate datasets. In this paper, we propose MEGEX, a data-free model
extraction attack against a gradient-based explainable AI. In this method, an
adversary uses the explanations to train the generative model and reduces the
number of queries to steal the model. Our experiments show that our proposed
method reconstructs high-accuracy models -- 0.97$\times$ and 0.98$\times$ the
victim model accuracy on SVHN and CIFAR-10 datasets given 2M and 20M queries,
respectively. This implies that there is a trade-off between the
interpretability of models and the difficulty of stealing them.


http://arxiv.org/abs/2107.08688
Structural Watermarking to Deep Neural Networks via Network Channel Pruning. (11%)
Xiangyu Zhao; Yinzhe Yao; Hanzhou Wu; Xinpeng Zhang
  In order to protect the intellectual property (IP) of deep neural networks
(DNNs), many existing DNN watermarking techniques either embed watermarks
directly into the DNN parameters or insert backdoor watermarks by fine-tuning
the DNN parameters, which, however, cannot resist against various attack
methods that remove watermarks by altering DNN parameters. In this paper, we
bypass such attacks by introducing a structural watermarking scheme that
utilizes channel pruning to embed the watermark into the host DNN architecture
instead of crafting the DNN parameters. To be specific, during watermark
embedding, we prune the internal channels of the host DNN with the channel
pruning rates controlled by the watermark. During watermark extraction, the
watermark is retrieved by identifying the channel pruning rates from the
architecture of the target DNN model. Due to the superiority of pruning
mechanism, the performance of the DNN model on its original task is reserved
during watermark embedding. Experimental results have shown that, the proposed
work enables the embedded watermark to be reliably recovered and provides a
sufficient payload, without sacrificing the usability of the DNN model. It is
also demonstrated that the proposed work is robust against common transforms
and attacks designed for conventional watermarking approaches.


http://arxiv.org/abs/2108.04328
Generative Adversarial Neural Cellular Automata. (1%)
Maximilian Otte; Quentin Delfosse; Johannes Czech; Kristian Kersting
  Motivated by the interaction between cells, the recently introduced concept
of Neural Cellular Automata shows promising results in a variety of tasks. So
far, this concept was mostly used to generate images for a single scenario. As
each scenario requires a new model, this type of generation seems contradictory
to the adaptability of cells in nature. To address this contradiction, we
introduce a concept using different initial environments as input while using a
single Neural Cellular Automata to produce several outputs. Additionally, we
introduce GANCA, a novel algorithm that combines Neural Cellular Automata with
Generative Adversarial Networks, allowing for more generalization through
adversarial training. The experiments show that a single model is capable of
learning several images when presented with different inputs, and that the
adversarially trained model improves drastically on out-of-distribution data
compared to a supervised trained model.


http://arxiv.org/abs/2107.08767
Improving Interpretability of Deep Neural Networks in Medical Diagnosis by Investigating the Individual Units. (1%)
Woo-Jeoung Nam; Seong-Whan Lee
  As interpretability has been pointed out as the obstacle to the adoption of
Deep Neural Networks (DNNs), there is an increasing interest in solving a
transparency issue to guarantee the impressive performance. In this paper, we
demonstrate the efficiency of recent attribution techniques to explain the
diagnostic decision by visualizing the significant factors in the input image.
By utilizing the characteristics of objectness that DNNs have learned, fully
decomposing the network prediction visualizes clear localization of target
lesion. To verify our work, we conduct our experiments on Chest X-ray diagnosis
with publicly accessible datasets. As an intuitive assessment metric for
explanations, we report the performance of intersection of Union between visual
explanation and bounding box of lesions. Experiment results show that recently
proposed attribution methods visualize the more accurate localization for the
diagnostic decision compared to the traditionally used CAM. Furthermore, we
analyze the inconsistency of intentions between humans and DNNs, which is
easily obscured by high performance. By visualizing the relevant factors, it is
possible to confirm that the criterion for decision is in line with the
learning strategy. Our analysis of unmasking machine intelligence represents
the necessity of explainability in the medical diagnostic decision.


http://arxiv.org/abs/2107.09044
Just Train Twice: Improving Group Robustness without Training Group Information. (1%)
Evan Zheran Liu; Behzad Haghgoo; Annie S. Chen; Aditi Raghunathan; Pang Wei Koh; Shiori Sagawa; Percy Liang; Chelsea Finn
  Standard training via empirical risk minimization (ERM) can produce models
that achieve high accuracy on average but low accuracy on certain groups,
especially in the presence of spurious correlations between the input and
label. Prior approaches that achieve high worst-group accuracy, like group
distributionally robust optimization (group DRO) require expensive group
annotations for each training point, whereas approaches that do not use such
group annotations typically achieve unsatisfactory worst-group accuracy. In
this paper, we propose a simple two-stage approach, JTT, that first trains a
standard ERM model for several epochs, and then trains a second model that
upweights the training examples that the first model misclassified.
Intuitively, this upweights examples from groups on which standard ERM models
perform poorly, leading to improved worst-group performance. Averaged over four
image classification and natural language processing tasks with spurious
correlations, JTT closes 75% of the gap in worst-group accuracy between
standard ERM and group DRO, while only requiring group annotations on a small
validation set in order to tune hyperparameters.


http://arxiv.org/abs/2107.08402
RobustFed: A Truth Inference Approach for Robust Federated Learning. (1%)
Farnaz Tahmasebian; Jian Lou; Li Xiong
  Federated learning is a prominent framework that enables clients (e.g.,
mobile devices or organizations) to train a collaboratively global model under
a central server's orchestration while keeping local training datasets'
privacy. However, the aggregation step in federated learning is vulnerable to
adversarial attacks as the central server cannot manage clients' behavior.
Therefore, the global model's performance and convergence of the training
process will be affected under such attacks.To mitigate this vulnerability
issue, we propose a novel robust aggregation algorithm inspired by the truth
inference methods in crowdsourcing via incorporating the worker's reliability
into aggregation. We evaluate our solution on three real-world datasets with a
variety of machine learning models. Experimental results show that our solution
ensures robust federated learning and is resilient to various types of attacks,
including noisy data attacks, Byzantine attacks, and label flipping attacks.


http://arxiv.org/abs/2107.08189
BEDS-Bench: Behavior of EHR-models under Distributional Shift--A Benchmark. (9%)
Anand Avati; Martin Seneviratne; Emily Xue; Zhen Xu; Balaji Lakshminarayanan; Andrew M. Dai
  Machine learning has recently demonstrated impressive progress in predictive
accuracy across a wide array of tasks. Most ML approaches focus on
generalization performance on unseen data that are similar to the training data
(In-Distribution, or IND). However, real world applications and deployments of
ML rarely enjoy the comfort of encountering examples that are always IND. In
such situations, most ML models commonly display erratic behavior on
Out-of-Distribution (OOD) examples, such as assigning high confidence to wrong
predictions, or vice-versa. Implications of such unusual model behavior are
further exacerbated in the healthcare setting, where patient health can
potentially be put at risk. It is crucial to study the behavior and robustness
properties of models under distributional shift, understand common failure
modes, and take mitigation steps before the model is deployed. Having a
benchmark that shines light upon these aspects of a model is a first and
necessary step in addressing the issue. Recent work and interest in increasing
model robustness in OOD settings have focused more on image modality, while the
Electronic Health Record (EHR) modality is still largely under-explored. We aim
to bridge this gap by releasing BEDS-Bench, a benchmark for quantifying the
behavior of ML models over EHR data under OOD settings. We use two open access,
de-identified EHR datasets to construct several OOD data settings to run tests
on, and measure relevant metrics that characterize crucial aspects of a model's
OOD behavior. We evaluate several learning algorithms under BEDS-Bench and find
that all of them show poor generalization performance under distributional
shift in general. Our results highlight the need and the potential to improve
robustness of EHR models under distributional shift, and BEDS-Bench provides
one way to measure progress towards that goal.


http://arxiv.org/abs/2107.07737
EGC2: Enhanced Graph Classification with Easy Graph Compression. (89%)
Jinyin Chen; Haiyang Xiong; Haibin Zhenga; Dunjie Zhang; Jian Zhang; Mingwei Jia; Yi Liu
  Graph classification is crucial in network analyses. Networks face potential
security threats, such as adversarial attacks. Some defense methods may trade
off the algorithm complexity for robustness, such as adversarial training,
whereas others may trade off clean example performance, such as smoothingbased
defense. Most suffer from high complexity or low transferability. To address
this problem, we proposed EGC2, an enhanced graph classification model with
easy graph compression. EGC2 captures the relationship between the features of
different nodes by constructing feature graphs and improving the aggregation of
the node-level representations. To achieve lower-complexity defense applied to
graph classification models, EGC2 utilizes a centrality-based edge-importance
index to compress the graphs, filtering out trivial structures and adversarial
perturbations in the input graphs, thus improving the model's robustness.
Experiments on ten benchmark datasets demonstrate that the proposed feature
read-out and graph compression mechanisms enhance the robustness of multiple
basic models, resulting in a state-of-the-art performance in terms of accuracy
and robustness against various adversarial attacks.


http://arxiv.org/abs/2107.08821
Proceedings of ICML 2021 Workshop on Theoretic Foundation, Criticism, and Application Trend of Explainable AI. (1%)
Quanshi Zhang; Tian Han; Lixin Fan; Zhanxing Zhu; Hang Su; Ying Nian Wu; Jie Ren; Hao Zhang
  This is the Proceedings of ICML 2021 Workshop on Theoretic Foundation,
Criticism, and Application Trend of Explainable AI. Deep neural networks (DNNs)
have undoubtedly brought great success to a wide range of applications in
computer vision, computational linguistics, and AI. However, foundational
principles underlying the DNNs' success and their resilience to adversarial
attacks are still largely missing. Interpreting and theorizing the internal
mechanisms of DNNs becomes a compelling yet controversial topic. This workshop
pays a special interest in theoretic foundations, limitations, and new
application trends in the scope of XAI. These issues reflect new bottlenecks in
the future development of XAI.


http://arxiv.org/abs/2107.07610
Self-Supervised Contrastive Learning with Adversarial Perturbations for Defending Word Substitution-based Attacks. (99%)
Zhao Meng; Yihan Dong; Mrinmaya Sachan; Roger Wattenhofer
  In this paper, we present an approach to improve the robustness of BERT
language models against word substitution-based adversarial attacks by
leveraging adversarial perturbations for self-supervised contrastive learning.
We create a word-level adversarial attack generating hard positives on-the-fly
as adversarial examples during contrastive learning. In contrast to previous
works, our method improves model robustness without using any labeled data.
Experimental results show that our method improves robustness of BERT against
four different word substitution-based adversarial attacks, and combining our
method with adversarial training gives higher robustness than adversarial
training alone. As our method improves the robustness of BERT purely with
unlabeled data, it opens up the possibility of using large text datasets to
train robust language models against word substitution-based adversarial
attacks.


http://arxiv.org/abs/2107.07449
Adversarial Attacks on Multi-task Visual Perception for Autonomous Driving. (98%)
Ibrahim Sobh; Ahmed Hamed; Varun Ravi Kumar; Senthil Yogamani
  Deep neural networks (DNNs) have accomplished impressive success in various
applications, including autonomous driving perception tasks, in recent years.
On the other hand, current deep neural networks are easily fooled by
adversarial attacks. This vulnerability raises significant concerns,
particularly in safety-critical applications. As a result, research into
attacking and defending DNNs has gained much coverage. In this work, detailed
adversarial attacks are applied on a diverse multi-task visual perception deep
network across distance estimation, semantic segmentation, motion detection,
and object detection. The experiments consider both white and black box attacks
for targeted and un-targeted cases, while attacking a task and inspecting the
effect on all the others, in addition to inspecting the effect of applying a
simple defense method. We conclude this paper by comparing and discussing the
experimental results, proposing insights and future work. The visualizations of
the attacks are available at https://youtu.be/6AixN90budY.


http://arxiv.org/abs/2107.07677
ECG-Adv-GAN: Detecting ECG Adversarial Examples with Conditional Generative Adversarial Networks. (92%)
Khondker Fariha Hossain; Sharif Amit Kamran; Alireza Tavakkoli; Lei Pan; Xingjun Ma; Sutharshan Rajasegarar; Chandan Karmaker
  Electrocardiogram (ECG) acquisition requires an automated system and analysis
pipeline for understanding specific rhythm irregularities. Deep neural networks
have become a popular technique for tracing ECG signals, outperforming human
experts. Despite this, convolutional neural networks are susceptible to
adversarial examples that can misclassify ECG signals and decrease the model's
precision. Moreover, they do not generalize well on the out-of-distribution
dataset. The GAN architecture has been employed in recent works to synthesize
adversarial ECG signals to increase existing training data. However, they use a
disjointed CNN-based classification architecture to detect arrhythmia. Till
now, no versatile architecture has been proposed that can detect adversarial
examples and classify arrhythmia simultaneously. To alleviate this, we propose
a novel Conditional Generative Adversarial Network to simultaneously generate
ECG signals for different categories and detect cardiac abnormalities.
Moreover, the model is conditioned on class-specific ECG signals to synthesize
realistic adversarial examples. Consequently, we compare our architecture and
show how it outperforms other classification models in normal/abnormal ECG
signal detection by benchmarking real world and adversarial signals.


http://arxiv.org/abs/2107.07618
Adversarial Attack for Uncertainty Estimation: Identifying Critical Regions in Neural Networks. (80%)
Ismail Alarab; Simant Prakoonwit
  We propose a novel method to capture data points near decision boundary in
neural network that are often referred to a specific type of uncertainty. In
our approach, we sought to perform uncertainty estimation based on the idea of
adversarial attack method. In this paper, uncertainty estimates are derived
from the input perturbations, unlike previous studies that provide
perturbations on the model's parameters as in Bayesian approach. We are able to
produce uncertainty with couple of perturbations on the inputs. Interestingly,
we apply the proposed method to datasets derived from blockchain. We compare
the performance of model uncertainty with the most recent uncertainty methods.
We show that the proposed method has revealed a significant outperformance over
other methods and provided less risk to capture model uncertainty in machine
learning.


http://arxiv.org/abs/2107.07240
Subnet Replacement: Deployment-stage backdoor attack against deep neural networks in gray-box setting. (16%)
Xiangyu Qi; Jifeng Zhu; Chulin Xie; Yong Yang
  We study the realistic potential of conducting backdoor attack against deep
neural networks (DNNs) during deployment stage. Specifically, our goal is to
design a deployment-stage backdoor attack algorithm that is both threatening
and realistically implementable. To this end, we propose Subnet Replacement
Attack (SRA), which is capable of embedding backdoor into DNNs by directly
modifying a limited number of model parameters. Considering the realistic
practicability, we abandon the strong white-box assumption widely adopted in
existing studies, instead, our algorithm works in a gray-box setting, where
architecture information of the victim model is available but the adversaries
do not have any knowledge of parameter values. The key philosophy underlying
our approach is -- given any neural network instance (regardless of its
specific parameter values) of a certain architecture, we can always embed a
backdoor into that model instance, by replacing a very narrow subnet of a
benign model (without backdoor) with a malicious backdoor subnet, which is
designed to be sensitive (fire large activation value) to a particular backdoor
trigger pattern.


http://arxiv.org/abs/2107.07150
Tailor: Generating and Perturbing Text with Semantic Controls. (3%)
Alexis Ross; Tongshuang Wu; Hao Peng; Matthew E. Peters; Matt Gardner
  Controlled text perturbation is useful for evaluating and improving model
generalizability. However, current techniques rely on training a model for
every target perturbation, which is expensive and hard to generalize. We
present Tailor, a semantically-controlled text generation system. Tailor builds
on a pretrained seq2seq model and produces textual outputs conditioned on
control codes derived from semantic representations. We craft a set of
operations to modify the control codes, which in turn steer generation towards
targeted attributes. These operations can be further composed into higher-level
ones, allowing for flexible perturbation strategies. We demonstrate the
effectiveness of these perturbations in multiple applications. First, we use
Tailor to automatically create high-quality contrast sets for four distinct
natural language processing (NLP) tasks. These contrast sets contain fewer
spurious artifacts and are complementary to manually annotated ones in their
lexical diversity. Second, we show that Tailor perturbations can improve model
generalization through data augmentation. Perturbing just 2% of training data
leads to a 5.8-point gain on an NLI challenge set measuring reliance on
syntactic heuristics.


http://arxiv.org/abs/2107.07455
Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks. (1%)
Andrey Malinin; Neil Band; Ganshin; Alexander; German Chesnokov; Yarin Gal; Mark J. F. Gales; Alexey Noskov; Andrey Ploskonosov; Liudmila Prokhorenkova; Ivan Provilkov; Vatsal Raina; Vyas Raina; Roginskiy; Denis; Mariya Shmatova; Panos Tigas; Boris Yangel
  There has been significant research done on developing methods for improving
robustness to distributional shift and uncertainty estimation. In contrast,
only limited work has examined developing standard datasets and benchmarks for
assessing these approaches. Additionally, most work on uncertainty estimation
and robustness has developed new techniques based on small-scale regression or
image classification tasks. However, many tasks of practical interest have
different modalities, such as tabular data, audio, text, or sensor data, which
offer significant challenges involving regression and discrete or continuous
structured prediction. Thus, given the current state of the field, a
standardized large-scale dataset of tasks across a range of modalities affected
by distributional shifts is necessary. This will enable researchers to
meaningfully evaluate the plethora of recently developed uncertainty
quantification methods, as well as assessment criteria and state-of-the-art
baselines. In this work, we propose the Shifts Dataset for evaluation of
uncertainty estimates and robustness to distributional shift. The dataset,
which has been collected from industrial sources and services, is composed of
three tasks, with each corresponding to a particular data modality: tabular
weather prediction, machine translation, and self-driving car (SDC) vehicle
motion prediction. All of these data modalities and tasks are affected by real,
"in-the-wild" distributional shifts and pose interesting challenges with
respect to uncertainty estimation. In this work we provide a description of the
dataset and baseline results for all tasks.


http://arxiv.org/abs/2107.06501
AdvFilter: Predictive Perturbation-aware Filtering against Adversarial Attack via Multi-domain Learning. (99%)
Yihao Huang; Qing Guo; Felix Juefei-Xu; Lei Ma; Weikai Miao; Yang Liu; Geguang Pu
  High-level representation-guided pixel denoising and adversarial training are
independent solutions to enhance the robustness of CNNs against adversarial
attacks by pre-processing input data and re-training models, respectively. Most
recently, adversarial training techniques have been widely studied and improved
while the pixel denoising-based method is getting less attractive. However, it
is still questionable whether there exists a more advanced pixel
denoising-based method and whether the combination of the two solutions
benefits each other. To this end, we first comprehensively investigate two
kinds of pixel denoising methods for adversarial robustness enhancement (i.e.,
existing additive-based and unexplored filtering-based methods) under the loss
functions of image-level and semantic-level, respectively, showing that
pixel-wise filtering can obtain much higher image quality (e.g., higher PSNR)
as well as higher robustness (e.g., higher accuracy on adversarial examples)
than existing pixel-wise additive-based method. However, we also observe that
the robustness results of the filtering-based method rely on the perturbation
amplitude of adversarial examples used for training. To address this problem,
we propose predictive perturbation-aware & pixel-wise filtering}, where
dual-perturbation filtering and an uncertainty-aware fusion module are designed
and employed to automatically perceive the perturbation amplitude during the
training and testing process. The method is termed as AdvFilter. Moreover, we
combine adversarial pixel denoising methods with three adversarial
training-based methods, hinting that considering data and models jointly is
able to achieve more robust CNNs. The experiments conduct on NeurIPS-2017DEV,
SVHN and CIFAR10 datasets and show advantages over enhancing CNNs' robustness,
high generalization to different models and noise levels.


http://arxiv.org/abs/2107.06882
Conservative Objective Models for Effective Offline Model-Based Optimization. (67%)
Brandon Trabucco; Aviral Kumar; Xinyang Geng; Sergey Levine
  Computational design problems arise in a number of settings, from synthetic
biology to computer architectures. In this paper, we aim to solve data-driven
model-based optimization (MBO) problems, where the goal is to find a design
input that maximizes an unknown objective function provided access to only a
static dataset of prior experiments. Such data-driven optimization procedures
are the only practical methods in many real-world domains where active data
collection is expensive (e.g., when optimizing over proteins) or dangerous
(e.g., when optimizing over aircraft designs). Typical methods for MBO that
optimize the design against a learned model suffer from distributional shift:
it is easy to find a design that "fools" the model into predicting a high
value. To overcome this, we propose conservative objective models (COMs), a
method that learns a model of the objective function that lower bounds the
actual value of the ground-truth objective on out-of-distribution inputs, and
uses it for optimization. Structurally, COMs resemble adversarial training
methods used to overcome adversarial examples. COMs are simple to implement and
outperform a number of existing methods on a wide range of MBO problems,
including optimizing protein sequences, robot morphologies, neural network
weights, and superconducting materials.


http://arxiv.org/abs/2107.06456
AID-Purifier: A Light Auxiliary Network for Boosting Adversarial Defense. (88%)
Duhun Hwang; Eunjung Lee; Wonjong Rhee
  We propose an AID-purifier that can boost the robustness of
adversarially-trained networks by purifying their inputs. AID-purifier is an
auxiliary network that works as an add-on to an already trained main
classifier. To keep it computationally light, it is trained as a discriminator
with a binary cross-entropy loss. To obtain additionally useful information
from the adversarial examples, the architecture design is closely related to
information maximization principles where two layers of the main classification
network are piped to the auxiliary network. To assist the iterative
optimization procedure of purification, the auxiliary network is trained with
AVmixup. AID-purifier can be used together with other purifiers such as
PixelDefend for an extra enhancement. The overall results indicate that the
best performing adversarially-trained networks can be enhanced by the best
performing purification networks, where AID-purifier is a competitive candidate
that is light and robust.


http://arxiv.org/abs/2107.06400
Using BERT Encoding to Tackle the Mad-lib Attack in SMS Spam Detection. (69%)
Sergio Rojas-Galeano
  One of the stratagems used to deceive spam filters is to substitute vocables
with synonyms or similar words that turn the message unrecognisable by the
detection algorithms. In this paper we investigate whether the recent
development of language models sensitive to the semantics and context of words,
such as Google's BERT, may be useful to overcome this adversarial attack
(called "Mad-lib" as per the word substitution game). Using a dataset of 5572
SMS spam messages, we first established a baseline of detection performance
using widely known document representation models (BoW and TFIDF) and the novel
BERT model, coupled with a variety of classification algorithms (Decision Tree,
kNN, SVM, Logistic Regression, Naive Bayes, Multilayer Perceptron). Then, we
built a thesaurus of the vocabulary contained in these messages, and set up a
Mad-lib attack experiment in which we modified each message of a held out
subset of data (not used in the baseline experiment) with different rates of
substitution of original words with synonyms from the thesaurus. Lastly, we
evaluated the detection performance of the three representation models (BoW,
TFIDF and BERT) coupled with the best classifier from the baseline experiment
(SVM). We found that the classic models achieved a 94% Balanced Accuracy (BA)
in the original dataset, whereas the BERT model obtained 96%. On the other
hand, the Mad-lib attack experiment showed that BERT encodings manage to
maintain a similar BA performance of 96% with an average substitution rate of
1.82 words per message, and 95% with 3.34 words substituted per message. In
contrast, the BA performance of the BoW and TFIDF encoders dropped to chance.
These results hint at the potential advantage of BERT models to combat these
type of ingenious attacks, offsetting to some extent for the inappropriate use
of semantic relationships in language.


http://arxiv.org/abs/2107.06158
Correlation Analysis between the Robustness of Sparse Neural Networks and their Random Hidden Structural Priors. (41%)
M. Ben Amor; J. Stier; M. Granitzer
  Deep learning models have been shown to be vulnerable to adversarial attacks.
This perception led to analyzing deep learning models not only from the
perspective of their performance measures but also their robustness to certain
types of adversarial attacks. We take another step forward in relating the
architectural structure of neural networks from a graph theoretic perspective
to their robustness. We aim to investigate any existing correlations between
graph theoretic properties and the robustness of Sparse Neural Networks. Our
hypothesis is, that graph theoretic properties as a prior of neural network
structures are related to their robustness. To answer to this hypothesis, we
designed an empirical study with neural network models obtained through random
graphs used as sparse structural priors for the networks. We additionally
investigated the evaluation of a randomly pruned fully connected network as a
point of reference.
  We found that robustness measures are independent of initialization methods
but show weak correlations with graph properties: higher graph densities
correlate with lower robustness, but higher average path lengths and average
node eccentricities show negative correlations with robustness measures. We
hope to motivate further empirical and analytical research to tightening an
answer to our hypothesis.


http://arxiv.org/abs/2107.06217
What classifiers know what they don't? (1%)
Mohamed Ishmael Belghazi; David Lopez-Paz
  Being uncertain when facing the unknown is key to intelligent decision
making. However, machine learning algorithms lack reliable estimates about
their predictive uncertainty. This leads to wrong and overly-confident
decisions when encountering classes unseen during training. Despite the
importance of equipping classifiers with uncertainty estimates ready for the
real world, prior work has focused on small datasets and little or no class
discrepancy between training and testing data. To close this gap, we introduce
UIMNET: a realistic, ImageNet-scale test-bed to evaluate predictive uncertainty
estimates for deep image classifiers. Our benchmark provides implementations of
eight state-of-the-art algorithms, six uncertainty measures, four in-domain
metrics, three out-domain metrics, and a fully automated pipeline to train,
calibrate, ensemble, select, and evaluate models. Our test-bed is open-source
and all of our results are reproducible from a fixed commit in our repository.
Adding new datasets, algorithms, measures, or metrics is a matter of a few
lines of code-in so hoping that UIMNET becomes a stepping stone towards
realistic, rigorous, and reproducible research in uncertainty estimation. Our
results show that ensembles of ERM classifiers as well as single MIMO
classifiers are the two best alternatives currently available to measure
uncertainty about both in-domain and out-domain classes.


http://arxiv.org/abs/2107.05754
EvoBA: An Evolution Strategy as a Strong Baseline forBlack-Box Adversarial Attacks. (99%)
Andrei Ilie; Marius Popescu; Alin Stefanescu
  Recent work has shown how easily white-box adversarial attacks can be applied
to state-of-the-art image classifiers. However, real-life scenarios resemble
more the black-box adversarial conditions, lacking transparency and usually
imposing natural, hard constraints on the query budget.
  We propose $\textbf{EvoBA}$, a black-box adversarial attack based on a
surprisingly simple evolutionary search strategy. $\textbf{EvoBA}$ is
query-efficient, minimizes $L_0$ adversarial perturbations, and does not
require any form of training.
  $\textbf{EvoBA}$ shows efficiency and efficacy through results that are in
line with much more complex state-of-the-art black-box attacks such as
$\textbf{AutoZOOM}$. It is more query-efficient than $\textbf{SimBA}$, a simple
and powerful baseline black-box attack, and has a similar level of complexity.
Therefore, we propose it both as a new strong baseline for black-box
adversarial attacks and as a fast and general tool for gaining empirical
insight into how robust image classifiers are with respect to $L_0$ adversarial
perturbations.
  There exist fast and reliable $L_2$ black-box attacks, such as
$\textbf{SimBA}$, and $L_{\infty}$ black-box attacks, such as
$\textbf{DeepSearch}$. We propose $\textbf{EvoBA}$ as a query-efficient $L_0$
black-box adversarial attack which, together with the aforementioned methods,
can serve as a generic tool to assess the empirical robustness of image
classifiers. The main advantages of such methods are that they run fast, are
query-efficient, and can easily be integrated in image classifiers development
pipelines.
  While our attack minimises the $L_0$ adversarial perturbation, we also report
$L_2$, and notice that we compare favorably to the state-of-the-art $L_2$
black-box attack, $\textbf{AutoZOOM}$, and of the $L_2$ strong baseline,
$\textbf{SimBA}$.


http://arxiv.org/abs/2107.05780
Detect and Defense Against Adversarial Examples in Deep Learning using Natural Scene Statistics and Adaptive Denoising. (99%)
Anouar Kherchouche; Sid Ahmed Fezza; Wassim Hamidouche
  Despite the enormous performance of deepneural networks (DNNs), recent
studies have shown theirvulnerability to adversarial examples (AEs), i.e.,
care-fully perturbed inputs designed to fool the targetedDNN. Currently, the
literature is rich with many ef-fective attacks to craft such AEs. Meanwhile,
many de-fenses strategies have been developed to mitigate thisvulnerability.
However, these latter showed their effec-tiveness against specific attacks and
does not general-ize well to different attacks. In this paper, we proposea
framework for defending DNN classifier against ad-versarial samples. The
proposed method is based on atwo-stage framework involving a separate detector
anda denoising block. The detector aims to detect AEs bycharacterizing them
through the use of natural scenestatistic (NSS), where we demonstrate that
these statis-tical features are altered by the presence of
adversarialperturbations. The denoiser is based on block matching3D (BM3D)
filter fed by an optimum threshold valueestimated by a convolutional neural
network (CNN) toproject back the samples detected as AEs into theirdata
manifold. We conducted a complete evaluation onthree standard datasets namely
MNIST, CIFAR-10 andTiny-ImageNet. The experimental results show that
theproposed defense method outperforms the state-of-the-art defense techniques
by improving the robustnessagainst a set of attacks under black-box, gray-box
and white-box settings. The source code is available at:
https://github.com/kherchouche-anouar/2DAE


http://arxiv.org/abs/2107.05222
Perceptual-based deep-learning denoiser as a defense against adversarial attacks on ASR systems. (96%)
Anirudh Sreeram; Nicholas Mehlman; Raghuveer Peri; Dillon Knox; Shrikanth Narayanan
  In this paper we investigate speech denoising as a defense against
adversarial attacks on automatic speech recognition (ASR) systems. Adversarial
attacks attempt to force misclassification by adding small perturbations to the
original speech signal. We propose to counteract this by employing a
neural-network based denoiser as a pre-processor in the ASR pipeline. The
denoiser is independent of the downstream ASR model, and thus can be rapidly
deployed in existing systems. We found that training the denoisier using a
perceptually motivated loss function resulted in increased adversarial
robustness without compromising ASR performance on benign samples. Our defense
was evaluated (as a part of the DARPA GARD program) on the 'Kenansville' attack
strategy across a range of attack strengths and speech samples. An average
improvement in Word Error Rate (WER) of about 7.7% was observed over the
undefended model at 20 dB signal-to-noise-ratio (SNR) attack strength.


http://arxiv.org/abs/2107.05243
Putting words into the system's mouth: A targeted attack on neural machine translation using monolingual data poisoning. (81%)
Jun Wang; Chang Xu; Francisco Guzman; Ahmed El-Kishky; Yuqing Tang; Benjamin I. P. Rubinstein; Trevor Cohn
  Neural machine translation systems are known to be vulnerable to adversarial
test inputs, however, as we show in this paper, these systems are also
vulnerable to training attacks. Specifically, we propose a poisoning attack in
which a malicious adversary inserts a small poisoned sample of monolingual text
into the training set of a system trained using back-translation. This sample
is designed to induce a specific, targeted translation behaviour, such as
peddling misinformation. We present two methods for crafting poisoned examples,
and show that only a tiny handful of instances, amounting to only 0.02% of the
training set, is sufficient to enact a successful attack. We outline a defence
method against said attacks, which partly ameliorates the problem. However, we
stress that this is a blind-spot in modern NMT, demanding immediate attention.


http://arxiv.org/abs/2107.05712
A Closer Look at the Adversarial Robustness of Information Bottleneck Models. (70%)
Iryna Korshunova; David Stutz; Alexander A. Alemi; Olivia Wiles; Sven Gowal
  We study the adversarial robustness of information bottleneck models for
classification. Previous works showed that the robustness of models trained
with information bottlenecks can improve upon adversarial training. Our
evaluation under a diverse range of white-box $l_{\infty}$ attacks suggests
that information bottlenecks alone are not a strong defense strategy, and that
previous results were likely influenced by gradient obfuscation.


http://arxiv.org/abs/2107.05747
SoftHebb: Bayesian inference in unsupervised Hebbian soft winner-take-all networks. (56%)
Timoleon Moraitis; Dmitry Toichkin; Yansong Chua; Qinghai Guo
  State-of-the-art artificial neural networks (ANNs) require labelled data or
feedback between layers, are often biologically implausible, and are vulnerable
to adversarial attacks that humans are not susceptible to. On the other hand,
Hebbian learning in winner-take-all (WTA) networks, is unsupervised,
feed-forward, and biologically plausible. However, a modern objective
optimization theory for WTA networks has been missing, except under very
limiting assumptions. Here we derive formally such a theory, based on
biologically plausible but generic ANN elements. Through Hebbian learning,
network parameters maintain a Bayesian generative model of the data. There is
no supervisory loss function, but the network does minimize cross-entropy
between its activations and the input distribution. The key is a "soft" WTA
where there is no absolute "hard" winner neuron, and a specific type of
Hebbian-like plasticity of weights and biases. We confirm our theory in
practice, where, in handwritten digit (MNIST) recognition, our Hebbian
algorithm, SoftHebb, minimizes cross-entropy without having access to it, and
outperforms the more frequently used, hard-WTA-based method. Strikingly, it
even outperforms supervised end-to-end backpropagation, under certain
conditions. Specifically, in a two-layered network, SoftHebb outperforms
backpropagation when the training dataset is only presented once, when the
testing data is noisy, and under gradient-based adversarial attacks. Notably,
adversarial attacks that confuse SoftHebb are also confusing to the human eye.
Finally, the model can generate interpolations of objects from its input
distribution. All in all, SoftHebb extends Hebbian WTA theory with modern
machine learning tools, thus making these networks relevant to pertinent issues
in deep learning.


http://arxiv.org/abs/2107.10302
Adversarial for Good? How the Adversarial ML Community's Values Impede Socially Beneficial Uses of Attacks. (76%)
Kendra Albert; Maggie Delano; Bogdan Kulynych; Ram Shankar Siva Kumar
  Attacks from adversarial machine learning (ML) have the potential to be used
"for good": they can be used to run counter to the existing power structures
within ML, creating breathing space for those who would otherwise be the
targets of surveillance and control. But most research on adversarial ML has
not engaged in developing tools for resistance against ML systems. Why? In this
paper, we review the broader impact statements that adversarial ML researchers
wrote as part of their NeurIPS 2020 papers and assess the assumptions that
authors have about the goals of their work. We also collect information about
how authors view their work's impact more generally. We find that most
adversarial ML researchers at NeurIPS hold two fundamental assumptions that
will make it difficult for them to consider socially beneficial uses of
attacks: (1) it is desirable to make systems robust, independent of context,
and (2) attackers of systems are normatively bad and defenders of systems are
normatively good. That is, despite their expressed and supposed neutrality,
most adversarial ML researchers believe that the goal of their work is to
secure systems, making it difficult to conceptualize and build tools for
disrupting the status quo.


http://arxiv.org/abs/2107.05166
Stateful Detection of Model Extraction Attacks. (2%)
Soham Pal; Yash Gupta; Aditya Kanade; Shirish Shevade
  Machine-Learning-as-a-Service providers expose machine learning (ML) models
through application programming interfaces (APIs) to developers. Recent work
has shown that attackers can exploit these APIs to extract good approximations
of such ML models, by querying them with samples of their choosing. We propose
VarDetect, a stateful monitor that tracks the distribution of queries made by
users of such a service, to detect model extraction attacks. Harnessing the
latent distributions learned by a modified variational autoencoder, VarDetect
robustly separates three types of attacker samples from benign samples, and
successfully raises an alarm for each. Further, with VarDetect deployed as an
automated defense mechanism, the extracted substitute models are found to
exhibit poor performance and transferability, as intended. Finally, we
demonstrate that even adaptive attackers with prior knowledge of the deployment
of VarDetect, are detected by it.


http://arxiv.org/abs/2107.05127
Attack Rules: An Adversarial Approach to Generate Attacks for Industrial Control Systems using Machine Learning. (1%)
Muhammad Azmi Umer; Chuadhry Mujeeb Ahmed; Muhammad Taha Jilani; Aditya P. Mathur
  Adversarial learning is used to test the robustness of machine learning
algorithms under attack and create attacks that deceive the anomaly detection
methods in Industrial Control System (ICS). Given that security assessment of
an ICS demands that an exhaustive set of possible attack patterns is studied,
in this work, we propose an association rule mining-based attack generation
technique. The technique has been implemented using data from a secure Water
Treatment plant. The proposed technique was able to generate more than 300,000
attack patterns constituting a vast majority of new attack vectors which were
not seen before. Automatically generated attacks improve our understanding of
the potential attacks and enable the design of robust attack detection
techniques.


http://arxiv.org/abs/2107.04764
Hack The Box: Fooling Deep Learning Abstraction-Based Monitors. (91%)
Sara Hajj Ibrahim; Mohamed Nassar
  Deep learning is a type of machine learning that adapts a deep hierarchy of
concepts. Deep learning classifiers link the most basic version of concepts at
the input layer to the most abstract version of concepts at the output layer,
also known as a class or label. However, once trained over a finite set of
classes, some deep learning models do not have the power to say that a given
input does not belong to any of the classes and simply cannot be linked.
Correctly invalidating the prediction of unrelated classes is a challenging
problem that has been tackled in many ways in the literature. Novelty detection
gives deep learning the ability to output "do not know" for novel/unseen
classes. Still, no attention has been given to the security aspects of novelty
detection. In this paper, we consider the case study of abstraction-based
novelty detection and show that it is not robust against adversarial samples.
Moreover, we show the feasibility of crafting adversarial samples that fool the
deep learning classifier and bypass the novelty detection monitoring at the
same time. In other words, these monitoring boxes are hackable. We demonstrate
that novelty detection itself ends up as an attack surface.


http://arxiv.org/abs/2107.04863
HOMRS: High Order Metamorphic Relations Selector for Deep Neural Networks. (88%)
Florian Tambon; Giulio Antoniol; Foutse Khomh
  Deep Neural Networks (DNN) applications are increasingly becoming a part of
our everyday life, from medical applications to autonomous cars. Traditional
validation of DNN relies on accuracy measures, however, the existence of
adversarial examples has highlighted the limitations of these accuracy
measures, raising concerns especially when DNN are integrated into
safety-critical systems.
  In this paper, we present HOMRS, an approach to boost metamorphic testing by
automatically building a small optimized set of high order metamorphic
relations from an initial set of elementary metamorphic relations. HOMRS'
backbone is a multi-objective search; it exploits ideas drawn from traditional
systems testing such as code coverage, test case, path diversity as well as
input validation.
  We applied HOMRS to MNIST/LeNet and SVHN/VGG and we report evidence that it
builds a small but effective set of high-order transformations that generalize
well to the input data distribution. Moreover, comparing to similar generation
technique such as DeepXplore, we show that our distribution-based approach is
more effective, generating valid transformations from an uncertainty
quantification point of view, while requiring less computation time by
leveraging the generalization ability of the approach.


http://arxiv.org/abs/2107.04827
Identifying Layers Susceptible to Adversarial Attacks. (83%)
Shoaib Ahmed Siddiqui; Thomas Breuel
  Common neural network architectures are susceptible to attack by adversarial
samples. Neural network architectures are commonly thought of as divided into
low-level feature extraction layers and high-level classification layers;
susceptibility of networks to adversarial samples is often thought of as a
problem related to classification rather than feature extraction. We test this
idea by selectively retraining different portions of VGG and ResNet
architectures on CIFAR-10, Imagenette and ImageNet using non-adversarial and
adversarial data. Our experimental results show that susceptibility to
adversarial samples is associated with low-level feature extraction layers.
Therefore, retraining high-level layers is insufficient for achieving
robustness. This phenomenon could have two explanations: either, adversarial
attacks yield outputs from early layers that are indistinguishable from
features found in the attack classes, or adversarial attacks yield outputs from
early layers that differ statistically from features for non-adversarial
samples and do not permit consistent classification by subsequent layers. We
test this question by large-scale non-linear dimensionality reduction and
density modeling on distributions of feature vectors in hidden layers and find
that the feature distributions between non-adversarial and adversarial samples
differ substantially. Our results provide new insights into the statistical
origins of adversarial samples and possible defenses.


http://arxiv.org/abs/2107.04882
Out of Distribution Detection and Adversarial Attacks on Deep Neural Networks for Robust Medical Image Analysis. (22%)
Anisie Uwimana1; Ransalu Senanayake
  Deep learning models have become a popular choice for medical image analysis.
However, the poor generalization performance of deep learning models limits
them from being deployed in the real world as robustness is critical for
medical applications. For instance, the state-of-the-art Convolutional Neural
Networks (CNNs) fail to detect adversarial samples or samples drawn
statistically far away from the training distribution. In this work, we
experimentally evaluate the robustness of a Mahalanobis distance-based
confidence score, a simple yet effective method for detecting abnormal input
samples, in classifying malaria parasitized cells and uninfected cells. Results
indicated that the Mahalanobis confidence score detector exhibits improved
performance and robustness of deep learning models, and achieves
stateof-the-art performance on both out-of-distribution (OOD) and adversarial
samples.


http://arxiv.org/abs/2107.04910
Cyber-Security Challenges in Aviation Industry: A Review of Current and Future Trends. (1%)
Elochukwu Ukwandu; Mohamed Amine Ben Farah; Hanan Hindy; Miroslav Bures; Robert Atkinson; Christos Tachtatzis; Xavier Bellekens
  The integration of Information and Communication Technology (ICT) tools into
mechanical devices found in aviation industry has raised security concerns. The
more integrated the system, the more vulnerable due to the inherent
vulnerabilities found in ICT tools and software that drives the system. The
security concerns have become more heightened as the concept of
electronic-enabled aircraft and smart airports get refined and implemented
underway. In line with the above, this paper undertakes a review of
cyber-security incidence in the aviation sector over the last 20 years. The
essence is to understand the common threat actors, their motivations, the type
of attacks, aviation infrastructure that is commonly attacked and then match
these so as to provide insight on the current state of the cyber-security in
the aviation sector. The review showed that the industry's threats come mainly
from Advance Persistent Threat (APT) groups that work in collaboration with
some state actors to steal intellectual property and intelligence, in order to
advance their domestic aerospace capabilities as well as possibly monitor,
infiltrate and subvert other nations' capabilities. The segment of the aviation
industry commonly attacked is the Information Technology infrastructure, and
the prominent type of attacks is malicious hacking activities that aim at
gaining unauthorised access using known malicious password cracking techniques
such as Brute force attacks, Dictionary attacks and so on. The review further
analysed the different attack surfaces that exist in aviation industry, threat
dynamics, and use these dynamics to predict future trends of cyberattacks in
the industry. The aim is to provide information for the cybersecurity
professionals and aviation stakeholders for proactive actions in protecting
these critical infrastructures against cyberincidence for an optimal customer
service oriented industry.


http://arxiv.org/abs/2107.04435
Learning to Detect Adversarial Examples Based on Class Scores. (99%)
Tobias Uelwer; Felix Michels; Candido Oliver De
  Given the increasing threat of adversarial attacks on deep neural networks
(DNNs), research on efficient detection methods is more important than ever. In
this work, we take a closer look at adversarial attack detection based on the
class scores of an already trained classification model. We propose to train a
support vector machine (SVM) on the class scores to detect adversarial
examples. Our method is able to detect adversarial examples generated by
various attacks, and can be easily adopted to a plethora of deep classification
models. We show that our approach yields an improved detection rate compared to
an existing method, whilst being easy to implement. We perform an extensive
empirical analysis on different deep classification models, investigating
various state-of-the-art adversarial attacks. Moreover, we observe that our
proposed method is better at detecting a combination of adversarial attacks.
This work indicates the potential of detecting various adversarial attacks
simply by using the class scores of an already trained classification model.


http://arxiv.org/abs/2107.04749
Resilience of Autonomous Vehicle Object Category Detection to Universal Adversarial Perturbations. (99%)
Mohammad Nayeem Teli; Seungwon Oh
  Due to the vulnerability of deep neural networks to adversarial examples,
numerous works on adversarial attacks and defenses have been burgeoning over
the past several years. However, there seem to be some conventional views
regarding adversarial attacks and object detection approaches that most
researchers take for granted. In this work, we bring a fresh perspective on
those procedures by evaluating the impact of universal perturbations on object
detection at a class-level. We apply it to a carefully curated data set related
to autonomous driving. We use Faster-RCNN object detector on images of five
different categories: person, car, truck, stop sign and traffic light from the
COCO data set, while carefully perturbing the images using Universal Dense
Object Suppression algorithm. Our results indicate that person, car, traffic
light, truck and stop sign are resilient in that order (most to least) to
universal perturbations. To the best of our knowledge, this is the first time
such a ranking has been established which is significant for the security of
the data sets pertaining to autonomous vehicles and object detection in
general.


http://arxiv.org/abs/2107.04284
Universal 3-Dimensional Perturbations for Black-Box Attacks on Video Recognition Systems. (99%)
Shangyu Xie; Han Wang; Yu Kong; Yuan Hong
  Widely deployed deep neural network (DNN) models have been proven to be
vulnerable to adversarial perturbations in many applications (e.g., image,
audio and text classifications). To date, there are only a few adversarial
perturbations proposed to deviate the DNN models in video recognition systems
by simply injecting 2D perturbations into video frames. However, such attacks
may overly perturb the videos without learning the spatio-temporal features
(across temporal frames), which are commonly extracted by DNN models for video
recognition. To our best knowledge, we propose the first black-box attack
framework that generates universal 3-dimensional (U3D) perturbations to subvert
a variety of video recognition systems. U3D has many advantages, such as (1) as
the transfer-based attack, U3D can universally attack multiple DNN models for
video recognition without accessing to the target DNN model; (2) the high
transferability of U3D makes such universal black-box attack easy-to-launch,
which can be further enhanced by integrating queries over the target model when
necessary; (3) U3D ensures human-imperceptibility; (4) U3D can bypass the
existing state-of-the-art defense schemes; (5) U3D can be efficiently generated
with a few pre-learned parameters, and then immediately injected to attack
real-time DNN-based video recognition systems. We have conducted extensive
experiments to evaluate U3D on multiple DNN models and three large-scale video
datasets. The experimental results demonstrate its superiority and
practicality.


http://arxiv.org/abs/2107.07043
GGT: Graph-Guided Testing for Adversarial Sample Detection of Deep Neural Network. (98%)
Zuohui Chen; Renxuan Wang; Jingyang Xiang; Yue Yu; Xin Xia; Shouling Ji; Qi Xuan; Xiaoniu Yang
  Deep Neural Networks (DNN) are known to be vulnerable to adversarial samples,
the detection of which is crucial for the wide application of these DNN models.
Recently, a number of deep testing methods in software engineering were
proposed to find the vulnerability of DNN systems, and one of them, i.e., Model
Mutation Testing (MMT), was used to successfully detect various adversarial
samples generated by different kinds of adversarial attacks. However, the
mutated models in MMT are always huge in number (e.g., over 100 models) and
lack diversity (e.g., can be easily circumvented by high-confidence adversarial
samples), which makes it less efficient in real applications and less effective
in detecting high-confidence adversarial samples. In this study, we propose
Graph-Guided Testing (GGT) for adversarial sample detection to overcome these
aforementioned challenges. GGT generates pruned models with the guide of graph
characteristics, each of them has only about 5% parameters of the mutated model
in MMT, and graph guided models have higher diversity. The experiments on
CIFAR10 and SVHN validate that GGT performs much better than MMT with respect
to both effectiveness and efficiency.


http://arxiv.org/abs/2107.04263
Towards Robust General Medical Image Segmentation. (83%)
Laura Daza; Juan C. Pérez; Pablo Arbeláez
  The reliability of Deep Learning systems depends on their accuracy but also
on their robustness against adversarial perturbations to the input data.
Several attacks and defenses have been proposed to improve the performance of
Deep Neural Networks under the presence of adversarial noise in the natural
image domain. However, robustness in computer-aided diagnosis for volumetric
data has only been explored for specific tasks and with limited attacks. We
propose a new framework to assess the robustness of general medical image
segmentation systems. Our contributions are two-fold: (i) we propose a new
benchmark to evaluate robustness in the context of the Medical Segmentation
Decathlon (MSD) by extending the recent AutoAttack natural image classification
framework to the domain of volumetric data segmentation, and (ii) we present a
novel lattice architecture for RObust Generic medical image segmentation (ROG).
Our results show that ROG is capable of generalizing across different tasks of
the MSD and largely surpasses the state-of-the-art under sophisticated
adversarial attacks.


http://arxiv.org/abs/2107.04487
ARC: Adversarially Robust Control Policies for Autonomous Vehicles. (38%)
Sampo Kuutti; Saber Fallah; Richard Bowden
  Deep neural networks have demonstrated their capability to learn control
policies for a variety of tasks. However, these neural network-based policies
have been shown to be susceptible to exploitation by adversarial agents.
Therefore, there is a need to develop techniques to learn control policies that
are robust against adversaries. We introduce Adversarially Robust Control
(ARC), which trains the protagonist policy and the adversarial policy
end-to-end on the same loss. The aim of the protagonist is to maximise this
loss, whilst the adversary is attempting to minimise it. We demonstrate the
proposed ARC training in a highway driving scenario, where the protagonist
controls the follower vehicle whilst the adversary controls the lead vehicle.
By training the protagonist against an ensemble of adversaries, it learns a
significantly more robust control policy, which generalises to a variety of
adversarial strategies. The approach is shown to reduce the amount of
collisions against new adversaries by up to 90.25%, compared to the original
policy. Moreover, by utilising an auxiliary distillation loss, we show that the
fine-tuned control policy shows no drop in performance across its original
training distribution.


http://arxiv.org/abs/2107.03806
Output Randomization: A Novel Defense for both White-box and Black-box Adversarial Models. (99%)
Daniel Park; Haidar Khan; Azer Khan; Alex Gittens; Bülent Yener
  Adversarial examples pose a threat to deep neural network models in a variety
of scenarios, from settings where the adversary has complete knowledge of the
model in a "white box" setting and to the opposite in a "black box" setting. In
this paper, we explore the use of output randomization as a defense against
attacks in both the black box and white box models and propose two defenses. In
the first defense, we propose output randomization at test time to thwart
finite difference attacks in black box settings. Since this type of attack
relies on repeated queries to the model to estimate gradients, we investigate
the use of randomization to thwart such adversaries from successfully creating
adversarial examples. We empirically show that this defense can limit the
success rate of a black box adversary using the Zeroth Order Optimization
attack to 0%. Secondly, we propose output randomization training as a defense
against white box adversaries. Unlike prior approaches that use randomization,
our defense does not require its use at test time, eliminating the Backward
Pass Differentiable Approximation attack, which was shown to be effective
against other randomization defenses. Additionally, this defense has low
overhead and is easily implemented, allowing it to be used together with other
defenses across various model architectures. We evaluate output randomization
training against the Projected Gradient Descent attacker and show that the
defense can reduce the PGD attack's success rate down to 12% when using
cross-entropy loss.


http://arxiv.org/abs/2107.04401
Improving Model Robustness with Latent Distribution Locally and Globally. (99%)
Zhuang Qian; Shufei Zhang; Kaizhu Huang; Qiufeng Wang; Rui Zhang; Xinping Yi
  In this work, we consider model robustness of deep neural networks against
adversarial attacks from a global manifold perspective. Leveraging both the
local and global latent information, we propose a novel adversarial training
method through robust optimization, and a tractable way to generate Latent
Manifold Adversarial Examples (LMAEs) via an adversarial game between a
discriminator and a classifier. The proposed adversarial training with latent
distribution (ATLD) method defends against adversarial attacks by crafting
LMAEs with the latent manifold in an unsupervised manner. ATLD preserves the
local and global information of latent manifold and promises improved
robustness against adversarial attacks. To verify the effectiveness of our
proposed method, we conduct extensive experiments over different datasets
(e.g., CIFAR-10, CIFAR-100, SVHN) with different adversarial attacks (e.g.,
PGD, CW), and show that our method substantially outperforms the
state-of-the-art (e.g., Feature Scattering) in adversarial robustness by a
large accuracy margin. The source codes are available at
https://github.com/LitterQ/ATLD-pytorch.


http://arxiv.org/abs/2107.03759
Analytically Tractable Hidden-States Inference in Bayesian Neural Networks. (50%)
Luong-Ha Nguyen; James-A. Goulet
  With few exceptions, neural networks have been relying on backpropagation and
gradient descent as the inference engine in order to learn the model
parameters, because the closed-form Bayesian inference for neural networks has
been considered to be intractable. In this paper, we show how we can leverage
the tractable approximate Gaussian inference's (TAGI) capabilities to infer
hidden states, rather than only using it for inferring the network's
parameters. One novel aspect it allows is to infer hidden states through the
imposition of constraints designed to achieve specific objectives, as
illustrated through three examples: (1) the generation of adversarial-attack
examples, (2) the usage of a neural network as a black-box optimization method,
and (3) the application of inference on continuous-action reinforcement
learning. These applications showcase how tasks that were previously reserved
to gradient-based optimization approaches can now be approached with
analytically tractable inference


http://arxiv.org/abs/2107.03919
Understanding the Limits of Unsupervised Domain Adaptation via Data Poisoning. (33%)
Akshay Mehra; Bhavya Kailkhura; Pin-Yu Chen; Jihun Hamm
  Unsupervised domain adaptation (UDA) enables cross-domain learning without
target domain labels by transferring knowledge from a labeled source domain
whose distribution differs from that of the target. However, UDA is not always
successful and several accounts of `negative transfer' have been reported in
the literature. In this work, we prove a simple lower bound on the target
domain error that complements the existing upper bound. Our bound shows the
insufficiency of minimizing source domain error and marginal distribution
mismatch for a guaranteed reduction in the target domain error, due to the
possible increase of induced labeling function mismatch. This insufficiency is
further illustrated through simple distributions for which the same UDA
approach succeeds, fails, and may succeed or fail with an equal chance.
Motivated from this, we propose novel data poisoning attacks to fool UDA
methods into learning representations that produce large target domain errors.
We evaluate the effect of these attacks on popular UDA methods using benchmark
datasets where they have been previously shown to be successful. Our results
show that poisoning can significantly decrease the target domain accuracy,
dropping it to almost 0% in some cases, with the addition of only 10% poisoned
data in the source domain. The failure of these UDA methods demonstrates their
limitations at guaranteeing cross-domain generalization consistent with our
lower bound. Thus, evaluating UDA methods in adversarial settings such as data
poisoning provides a better sense of their robustness to data distributions
unfavorable for UDA.


http://arxiv.org/abs/2107.03050
Controlled Caption Generation for Images Through Adversarial Attacks. (99%)
Nayyer Aafaq; Naveed Akhtar; Wei Liu; Mubarak Shah; Ajmal Mian
  Deep learning is found to be vulnerable to adversarial examples. However, its
adversarial susceptibility in image caption generation is under-explored. We
study adversarial examples for vision and language models, which typically
adopt an encoder-decoder framework consisting of two major components: a
Convolutional Neural Network (i.e., CNN) for image feature extraction and a
Recurrent Neural Network (RNN) for caption generation. In particular, we
investigate attacks on the visual encoder's hidden layer that is fed to the
subsequent recurrent network. The existing methods either attack the
classification layer of the visual encoder or they back-propagate the gradients
from the language model. In contrast, we propose a GAN-based algorithm for
crafting adversarial examples for neural image captioning that mimics the
internal representation of the CNN such that the resulting deep features of the
input image enable a controlled incorrect caption generation through the
recurrent network. Our contribution provides new insights for understanding
adversarial attacks on vision systems with language component. The proposed
method employs two strategies for a comprehensive evaluation. The first
examines if a neural image captioning system can be misled to output targeted
image captions. The second analyzes the possibility of keywords into the
predicted captions. Experiments show that our algorithm can craft effective
adversarial images based on the CNN hidden layers to fool captioning framework.
Moreover, we discover the proposed attack to be highly transferable. Our work
leads to new robustness implications for neural image captioning.


http://arxiv.org/abs/2107.03250
Incorporating Label Uncertainty in Understanding Adversarial Robustness. (38%)
Xiao Zhang; David Evans
  A fundamental question in adversarial machine learning is whether a robust
classifier exists for a given task. A line of research has made progress
towards this goal by studying concentration of measure, but without considering
data labels. We argue that the standard concentration fails to fully
characterize the intrinsic robustness of a classification problem, since it
ignores data labels which are essential to any classification task. Building on
a novel definition of label uncertainty, we empirically demonstrate that error
regions induced by state-of-the-art models tend to have much higher label
uncertainty compared with randomly-selected subsets. This observation motivates
us to adapt a concentration estimation algorithm to account for label
uncertainty, resulting in more accurate intrinsic robustness measures for
benchmark image classification problems. We further provide empirical evidence
showing that adding an abstain option for classifiers based on label
uncertainty can help improve both the clean and robust accuracies of models.


http://arxiv.org/abs/2107.03311
RoFL: Attestable Robustness for Secure Federated Learning. (2%)
Lukas Burkhalter; Hidde Lycklama; Alexander Viand; Nicolas Küchler; Anwar Hithnawi
  Even though recent years have seen many attacks exposing severe
vulnerabilities in federated learning (FL), a holistic understanding of what
enables these attacks and how they can be mitigated effectively is still
lacking. In this work we demystify the inner workings of existing targeted
attacks. We provide new insights into why these attacks are possible and why a
definitive solution to FL robustness is challenging. We show that the need for
ML algorithms to memorize tail data has significant implications for FL
integrity. This phenomenon has largely been studied in the context of privacy;
our analysis sheds light on its implications for ML integrity. In addition, we
show how constraints on client updates can effectively improve robustness. To
incorporate these constraints into secure FL protocols, we design and develop
RoFL, a new secure FL system that enables constraints to be expressed and
enforced on high-dimensional encrypted model updates. In essence, RoFL augments
existing secure FL aggregation protocols with zero-knowledge proofs. Due to the
scale of FL, realizing these checks efficiently presents a paramount challenge.
We introduce several optimizations at the ML layer that allow us to reduce the
number of cryptographic checks needed while preserving the effectiveness of our
defenses. We show that RoFL scales to the sizes of models used in real-world FL
deployments.


http://arxiv.org/abs/2107.02425
GradDiv: Adversarial Robustness of Randomized Neural Networks via Gradient Diversity Regularization. (99%)
Sungyoon Lee; Hoki Kim; Jaewook Lee
  Deep learning is vulnerable to adversarial examples. Many defenses based on
randomized neural networks have been proposed to solve the problem, but fail to
achieve robustness against attacks using proxy gradients such as the
Expectation over Transformation (EOT) attack. We investigate the effect of the
adversarial attacks using proxy gradients on randomized neural networks and
demonstrate that it highly relies on the directional distribution of the loss
gradients of the randomized neural network. We show in particular that proxy
gradients are less effective when the gradients are more scattered. To this
end, we propose Gradient Diversity (GradDiv) regularizations that minimize the
concentration of the gradients to build a robust randomized neural network. Our
experiments on MNIST, CIFAR10, and STL10 show that our proposed GradDiv
regularizations improve the adversarial robustness of randomized neural
networks against a variety of state-of-the-art attack methods. Moreover, our
method efficiently reduces the transferability among sample models of
randomized neural networks.


http://arxiv.org/abs/2107.02434
Self-Adversarial Training incorporating Forgery Attention for Image Forgery Localization. (95%)
Long Zhuo; Shunquan Tan; Bin Li; Jiwu Huang
  Image editing techniques enable people to modify the content of an image
without leaving visual traces and thus may cause serious security risks. Hence
the detection and localization of these forgeries become quite necessary and
challenging. Furthermore, unlike other tasks with extensive data, there is
usually a lack of annotated forged images for training due to annotation
difficulties. In this paper, we propose a self-adversarial training strategy
and a reliable coarse-to-fine network that utilizes a self-attention mechanism
to localize forged regions in forgery images. The self-attention module is
based on a Channel-Wise High Pass Filter block (CW-HPF). CW-HPF leverages
inter-channel relationships of features and extracts noise features by high
pass filters. Based on the CW-HPF, a self-attention mechanism, called forgery
attention, is proposed to capture rich contextual dependencies of intrinsic
inconsistency extracted from tampered regions. Specifically, we append two
types of attention modules on top of CW-HPF respectively to model internal
interdependencies in spatial dimension and external dependencies among
channels. We exploit a coarse-to-fine network to enhance the noise
inconsistency between original and tampered regions. More importantly, to
address the issue of insufficient training data, we design a self-adversarial
training strategy that expands training data dynamically to achieve more robust
performance. Specifically, in each training iteration, we perform adversarial
attacks against our network to generate adversarial examples and train our
model on them. Extensive experimental results demonstrate that our proposed
algorithm steadily outperforms state-of-the-art methods by a clear margin in
different benchmark datasets.


http://arxiv.org/abs/2108.04217
ROPUST: Improving Robustness through Fine-tuning with Photonic Processors and Synthetic Gradients. (76%)
Alessandro Cappelli; Julien Launay; Laurent Meunier; Ruben Ohana; Iacopo Poli
  Robustness to adversarial attacks is typically obtained through expensive
adversarial training with Projected Gradient Descent. Here we introduce ROPUST,
a remarkably simple and efficient method to leverage robust pre-trained models
and further increase their robustness, at no cost in natural accuracy. Our
technique relies on the use of an Optical Processing Unit (OPU), a photonic
co-processor, and a fine-tuning step performed with Direct Feedback Alignment,
a synthetic gradient training scheme. We test our method on nine different
models against four attacks in RobustBench, consistently improving over
state-of-the-art performance. We perform an ablation study on the single
components of our defense, showing that robustness arises from parameter
obfuscation and the alternative training method. We also introduce phase
retrieval attacks, specifically designed to increase the threat level of
attackers against our own defense. We show that even with state-of-the-art
phase retrieval techniques, ROPUST remains an effective defense.


http://arxiv.org/abs/2107.02658
On Generalization of Graph Autoencoders with Adversarial Training. (12%)
Tianjin huang; Yulong Pei; Vlado Menkovski; Mykola Pechenizkiy
  Adversarial training is an approach for increasing model's resilience against
adversarial perturbations. Such approaches have been demonstrated to result in
models with feature representations that generalize better. However, limited
works have been done on adversarial training of models on graph data. In this
paper, we raise such a question { does adversarial training improve the
generalization of graph representations. We formulate L2 and L1 versions of
adversarial training in two powerful node embedding methods: graph autoencoder
(GAE) and variational graph autoencoder (VGAE). We conduct extensive
experiments on three main applications, i.e. link prediction, node clustering,
graph anomaly detection of GAE and VGAE, and demonstrate that both L2 and L1
adversarial training boost the generalization of GAE and VGAE.


http://arxiv.org/abs/2107.02488
On Robustness of Lane Detection Models to Physical-World Adversarial Attacks in Autonomous Driving. (1%)
Takami Sato; Qi Alfred Chen
  After the 2017 TuSimple Lane Detection Challenge, its evaluation based on
accuracy and F1 score has become the de facto standard to measure the
performance of lane detection methods. In this work, we conduct the first
large-scale empirical study to evaluate the robustness of state-of-the-art lane
detection methods under physical-world adversarial attacks in autonomous
driving. We evaluate 4 major types of lane detection approaches with the
conventional evaluation and end-to-end evaluation in autonomous driving
scenarios and then discuss the security proprieties of each lane detection
model. We demonstrate that the conventional evaluation fails to reflect the
robustness in end-to-end autonomous driving scenarios. Our results show that
the most robust model on the conventional metrics is the least robust in the
end-to-end evaluation. Although the competition dataset and its metrics have
played a substantial role in developing performant lane detection methods along
with the rapid development of deep neural networks, the conventional evaluation
is becoming obsolete and the gap between the metrics and practicality is
critical. We hope that our study will help the community make further progress
in building a more comprehensive framework to evaluate lane detection models.


http://arxiv.org/abs/2107.01943
When and How to Fool Explainable Models (and Humans) with Adversarial Examples. (99%)
Jon Vadillo; Roberto Santana; Jose A. Lozano
  Reliable deployment of machine learning models such as neural networks
continues to be challenging due to several limitations. Some of the main
shortcomings are the lack of interpretability and the lack of robustness
against adversarial examples or out-of-distribution inputs. In this paper, we
explore the possibilities and limits of adversarial attacks for explainable
machine learning models. First, we extend the notion of adversarial examples to
fit in explainable machine learning scenarios, in which the inputs, the output
classifications and the explanations of the model's decisions are assessed by
humans. Next, we propose a comprehensive framework to study whether (and how)
adversarial examples can be generated for explainable models under human
assessment, introducing novel attack paradigms. In particular, our framework
considers a wide range of relevant (yet often ignored) factors such as the type
of problem, the user expertise or the objective of the explanations in order to
identify the attack strategies that should be adopted in each scenario to
successfully deceive the model (and the human). These contributions intend to
serve as a basis for a more rigorous and realistic study of adversarial
examples in the field of explainable machine learning.


http://arxiv.org/abs/2107.01809
Boosting Transferability of Targeted Adversarial Examples via Hierarchical Generative Networks. (99%)
Xiao Yang; Yinpeng Dong; Tianyu Pang; Hang Su; Jun Zhu
  Transfer-based adversarial attacks can effectively evaluate model robustness
in the black-box setting. Though several methods have demonstrated impressive
transferability of untargeted adversarial examples, targeted adversarial
transferability is still challenging. The existing methods either have low
targeted transferability or sacrifice computational efficiency. In this paper,
we develop a simple yet practical framework to efficiently craft targeted
transfer-based adversarial examples. Specifically, we propose a conditional
generative attacking model, which can generate the adversarial examples
targeted at different classes by simply altering the class embedding and share
a single backbone. Extensive experiments demonstrate that our method improves
the success rates of targeted black-box attacks by a significant margin over
the existing methods -- it reaches an average success rate of 29.6\% against
six diverse models based only on one substitute white-box model in the standard
testing of NeurIPS 2017 competition, which outperforms the state-of-the-art
gradient-based attack methods (with an average success rate of $<$2\%) by a
large margin. Moreover, the proposed method is also more efficient beyond an
order of magnitude than gradient-based methods.


http://arxiv.org/abs/2107.01936
Adversarial Robustness of Probabilistic Network Embedding for Link Prediction. (87%)
Xi Chen; Bo Kang; Jefrey Lijffijt; Bie Tijl De
  In today's networked society, many real-world problems can be formalized as
predicting links in networks, such as Facebook friendship suggestions,
e-commerce recommendations, and the prediction of scientific collaborations in
citation networks. Increasingly often, link prediction problem is tackled by
means of network embedding methods, owing to their state-of-the-art
performance. However, these methods lack transparency when compared to simpler
baselines, and as a result their robustness against adversarial attacks is a
possible point of concern: could one or a few small adversarial modifications
to the network have a large impact on the link prediction performance when
using a network embedding model? Prior research has already investigated
adversarial robustness for network embedding models, focused on classification
at the node and graph level. Robustness with respect to the link prediction
downstream task, on the other hand, has been explored much less.
  This paper contributes to filling this gap, by studying adversarial
robustness of Conditional Network Embedding (CNE), a state-of-the-art
probabilistic network embedding model, for link prediction. More specifically,
given CNE and a network, we measure the sensitivity of the link predictions of
the model to small adversarial perturbations of the network, namely changes of
the link status of a node pair. Thus, our approach allows one to identify the
links and non-links in the network that are most vulnerable to such
perturbations, for further investigation by an analyst. We analyze the
characteristics of the most and least sensitive perturbations, and empirically
confirm that our approach not only succeeds in identifying the most vulnerable
links and non-links, but also that it does so in a time-efficient manner thanks
to an effective approximation.


http://arxiv.org/abs/2107.02052
Dealing with Adversarial Player Strategies in the Neural Network Game iNNk through Ensemble Learning. (69%)
Mathias Löwe; Jennifer Villareale; Evan Freed; Aleksanteri Sladek; Jichen Zhu; Sebastian Risi
  Applying neural network (NN) methods in games can lead to various new and
exciting game dynamics not previously possible. However, they also lead to new
challenges such as the lack of large, clean datasets, varying player skill
levels, and changing gameplay strategies. In this paper, we focus on the
adversarial player strategy aspect in the game iNNk, in which players try to
communicate secret code words through drawings with the goal of not being
deciphered by a NN. Some strategies exploit weaknesses in the NN that
consistently trick it into making incorrect classifications, leading to
unbalanced gameplay. We present a method that combines transfer learning and
ensemble methods to obtain a data-efficient adaptation to these strategies.
This combination significantly outperforms the baseline NN across all
adversarial player strategies despite only being trained on a limited set of
adversarial examples. We expect the methods developed in this paper to be
useful for the rapidly growing field of NN-based games, which will require new
approaches to deal with unforeseen player creativity.


http://arxiv.org/abs/2107.02045
Understanding the Security of Deepfake Detection. (33%)
Xiaoyu Cao; Neil Zhenqiang Gong
  Deepfakes pose growing challenges to the trust of information on the
Internet. Therefore,detecting deepfakes has attracted increasing attentions
from both academia and industry. State-of-the-art deepfake detection methods
consist of two key components, i.e., face extractor and face classifier, which
extract the face region in an image and classify it to be real/fake,
respectively. Existing studies mainly focused on improving the detection
performance in non-adversarial settings, leaving security of deepfake detection
in adversarial settings largely unexplored. In this work, we aim to bridge the
gap. In particular, we perform a systematic measurement study to understand the
security of the state-of-the-art deepfake detection methods in adversarial
settings. We use two large-scale public deepfakes data sources including
FaceForensics++ and Facebook Deepfake Detection Challenge, where the deepfakes
are fake face images; and we train state-of-the-art deepfake detection methods.
These detection methods can achieve 0.94--0.99 accuracies in non-adversarial
settings on these datasets. However, our measurement results uncover multiple
security limitations of the deepfake detection methods in adversarial settings.
First, we find that an attacker can evade a face extractor, i.e., the face
extractor fails to extract the correct face regions, via adding small Gaussian
noise to its deepfake images. Second, we find that a face classifier trained
using deepfakes generated by one method cannot detect deepfakes generated by
another method, i.e., an attacker can evade detection via generating deepfakes
using a new method. Third, we find that an attacker can leverage backdoor
attacks developed by the adversarial machine learning community to evade a face
classifier. Our results highlight that deepfake detection should consider the
adversarial nature of the problem.


http://arxiv.org/abs/2107.01806
Evaluating the Cybersecurity Risk of Real World, Machine Learning Production Systems. (15%)
Ron Bitton; Nadav Maman; Inderjeet Singh; Satoru Momiyama; Yuval Elovici; Asaf Shabtai
  Although cyberattacks on machine learning (ML) production systems can be
harmful, today, security practitioners are ill equipped, lacking methodologies
and tactical tools that would allow them to analyze the security risks of their
ML-based systems. In this paper, we performed a comprehensive threat analysis
of ML production systems. In this analysis, we follow the ontology presented by
NIST for evaluating enterprise network security risk and apply it to ML-based
production systems. Specifically, we (1) enumerate the assets of a typical ML
production system, (2) describe the threat model (i.e., potential adversaries,
their capabilities, and their main goal), (3) identify the various threats to
ML systems, and (4) review a large number of attacks, demonstrated in previous
studies, which can realize these threats. In addition, to quantify the risk of
adversarial machine learning (AML) threat, we introduce a novel scoring system,
which assign a severity score to different AML attacks. The proposed scoring
system utilizes the analytic hierarchy process (AHP) for ranking, with the
assistance of security experts, various attributes of the attacks. Finally, we
developed an extension to the MulVAL attack graph generation and analysis
framework to incorporate cyberattacks on ML production systems. Using the
extension, security practitioners can apply attack graph analysis methods in
environments that include ML components; thus, providing security practitioners
with a methodological and practical tool for evaluating the impact and
quantifying the risk of a cyberattack targeting an ML production system.


http://arxiv.org/abs/2107.01854
Poisoning Attack against Estimating from Pairwise Comparisons. (15%)
Ke Ma; Qianqian Xu; Jinshan Zeng; Xiaochun Cao; Qingming Huang
  As pairwise ranking becomes broadly employed for elections, sports
competitions, recommendations, and so on, attackers have strong motivation and
incentives to manipulate the ranking list. They could inject malicious
comparisons into the training data to fool the victim. Such a technique is
called poisoning attack in regression and classification tasks. In this paper,
to the best of our knowledge, we initiate the first systematic investigation of
data poisoning attacks on pairwise ranking algorithms, which can be formalized
as the dynamic and static games between the ranker and the attacker and can be
modeled as certain kinds of integer programming problems. To break the
computational hurdle of the underlying integer programming problems, we
reformulate them into the distributionally robust optimization (DRO) problems,
which are computationally tractable. Based on such DRO formulations, we propose
two efficient poisoning attack algorithms and establish the associated
theoretical guarantees. The effectiveness of the suggested poisoning attack
strategies is demonstrated by a series of toy simulations and several real data
experiments. These experimental results show that the proposed methods can
significantly reduce the performance of the ranker in the sense that the
correlation between the true ranking list and the aggregated results can be
decreased dramatically.


http://arxiv.org/abs/2107.06993
Confidence Conditioned Knowledge Distillation. (10%)
Sourav Mishra; Suresh Sundaram
  In this paper, a novel confidence conditioned knowledge distillation (CCKD)
scheme for transferring the knowledge from a teacher model to a student model
is proposed. Existing state-of-the-art methods employ fixed loss functions for
this purpose and ignore the different levels of information that need to be
transferred for different samples. In addition to that, these methods are also
inefficient in terms of data usage. CCKD addresses these issues by leveraging
the confidence assigned by the teacher model to the correct class to devise
sample-specific loss functions (CCKD-L formulation) and targets (CCKD-T
formulation). Further, CCKD improves the data efficiency by employing
self-regulation to stop those samples from participating in the distillation
process on which the student model learns faster. Empirical evaluations on
several benchmark datasets show that CCKD methods achieve at least as much
generalization performance levels as other state-of-the-art methods while being
data efficient in the process. Student models trained through CCKD methods do
not retain most of the misclassifications commited by the teacher model on the
training set. Distillation through CCKD methods improves the resilience of the
student models against adversarial attacks compared to the conventional KD
method. Experiments show at least 3% increase in performance against
adversarial attacks for the MNIST and the Fashion MNIST datasets, and at least
6% increase for the CIFAR10 dataset.


http://arxiv.org/abs/2107.01561
Certifiably Robust Interpretation via Renyi Differential Privacy. (67%)
Ao Liu; Xiaoyu Chen; Sijia Liu; Lirong Xia; Chuang Gan
  Motivated by the recent discovery that the interpretation maps of CNNs could
easily be manipulated by adversarial attacks against network interpretability,
we study the problem of interpretation robustness from a new perspective of
\Renyi differential privacy (RDP). The advantages of our Renyi-Robust-Smooth
(RDP-based interpretation method) are three-folds. First, it can offer provable
and certifiable top-$k$ robustness. That is, the top-$k$ important attributions
of the interpretation map are provably robust under any input perturbation with
bounded $\ell_d$-norm (for any $d\geq 1$, including $d = \infty$). Second, our
proposed method offers $\sim10\%$ better experimental robustness than existing
approaches in terms of the top-$k$ attributions. Remarkably, the accuracy of
Renyi-Robust-Smooth also outperforms existing approaches. Third, our method can
provide a smooth tradeoff between robustness and computational efficiency.
Experimentally, its top-$k$ attributions are {\em twice} more robust than
existing approaches when the computational resources are highly constrained.


http://arxiv.org/abs/2107.01709
Mirror Mirror on the Wall: Next-Generation Wireless Jamming Attacks Based on Software-Controlled Surfaces. (1%)
Paul Staat; Harald Elders-Boll; Christian Zenger; Christof Paar
  The intelligent reflecting surface (IRS) is a promising new paradigm in
wireless communications to meet the growing demand for high-speed connectivity
in next-generation mobile networks. IRS, also known as software-controlled
metasurfaces, consist of an array of adjustable radio wave reflectors, enabling
smart radio environments, e.g., for enhancing the signal-to-noise ratio (SNR)
and spatial diversity of wireless channels.
  Research on IRS to date has been largely focused on constructive
applications. In this work, we show for the first time that the IRS provides a
practical low-cost toolkit for attackers to easily perform complex signal
manipulation attacks on the physical layer in real time. We introduce the
environment reconfiguration attack (ERA) as a novel class of jamming attacks in
wireless radio networks. Here, an adversary leverages an IRS to rapidly vary
the electromagnetic propagation environment to disturb legitimate receivers.
The IRS gives the adversary a key advantage over traditional jamming: It no
longer has to actively emit a jamming signal itself while the jamming signal is
correlated to the legitimate communication signal.
  We thoroughly investigate the ERA using the popular orthogonal frequency
division multiplexing~(OFDM) modulation as an example. We show that the ERA
allows to severely degrade the available data rates even in entire networks. We
present insights to the attack through analytical analysis, simulations, as
well as experiments. Our results highlight that the attack also works with
reasonably small IRS sizes. Finally, we implement an attacker setup and
demonstrate a practical ERA to slow down a Wi-Fi network.


http://arxiv.org/abs/2107.01396
Demiguise Attack: Crafting Invisible Semantic Adversarial Perturbations with Perceptual Similarity. (99%)
Yajie Wang; Shangbo Wu; Wenyi Jiang; Shengang Hao; Yu-an Tan; Quanxin Zhang
  Deep neural networks (DNNs) have been found to be vulnerable to adversarial
examples. Adversarial examples are malicious images with visually imperceptible
perturbations. While these carefully crafted perturbations restricted with
tight $\Lp$ norm bounds are small, they are still easily perceivable by humans.
These perturbations also have limited success rates when attacking black-box
models or models with defenses like noise reduction filters. To solve these
problems, we propose Demiguise Attack, crafting ``unrestricted'' perturbations
with Perceptual Similarity. Specifically, we can create powerful and
photorealistic adversarial examples by manipulating semantic information based
on Perceptual Similarity. Adversarial examples we generate are friendly to the
human visual system (HVS), although the perturbations are of large magnitudes.
We extend widely-used attacks with our approach, enhancing adversarial
effectiveness impressively while contributing to imperceptibility. Extensive
experiments show that the proposed method not only outperforms various
state-of-the-art attacks in terms of fooling rate, transferability, and
robustness against defenses but can also improve attacks effectively. In
addition, we also notice that our implementation can simulate illumination and
contrast changes that occur in real-world scenarios, which will contribute to
exposing the blind spots of DNNs.


http://arxiv.org/abs/2107.00561
Using Anomaly Feature Vectors for Detecting, Classifying and Warning of Outlier Adversarial Examples. (99%)
Nelson Manohar-Alers; Ryan Feng; Sahib Singh; Jiguo Song; Atul Prakash
  We present DeClaW, a system for detecting, classifying, and warning of
adversarial inputs presented to a classification neural network. In contrast to
current state-of-the-art methods that, given an input, detect whether an input
is clean or adversarial, we aim to also identify the types of adversarial
attack (e.g., PGD, Carlini-Wagner or clean). To achieve this, we extract
statistical profiles, which we term as anomaly feature vectors, from a set of
latent features. Preliminary findings suggest that AFVs can help distinguish
among several types of adversarial attacks (e.g., PGD versus Carlini-Wagner)
with close to 93% accuracy on the CIFAR-10 dataset. The results open the door
to using AFV-based methods for exploring not only adversarial attack detection
but also classification of the attack type and then design of attack-specific
mitigation strategies.


http://arxiv.org/abs/2107.00415
DVS-Attacks: Adversarial Attacks on Dynamic Vision Sensors for Spiking Neural Networks. (99%)
Alberto Marchisio; Giacomo Pira; Maurizio Martina; Guido Masera; Muhammad Shafique
  Spiking Neural Networks (SNNs), despite being energy-efficient when
implemented on neuromorphic hardware and coupled with event-based Dynamic
Vision Sensors (DVS), are vulnerable to security threats, such as adversarial
attacks, i.e., small perturbations added to the input for inducing a
misclassification. Toward this, we propose DVS-Attacks, a set of stealthy yet
efficient adversarial attack methodologies targeted to perturb the event
sequences that compose the input of the SNNs. First, we show that noise filters
for DVS can be used as defense mechanisms against adversarial attacks.
Afterwards, we implement several attacks and test them in the presence of two
types of noise filters for DVS cameras. The experimental results show that the
filters can only partially defend the SNNs against our proposed DVS-Attacks.
Using the best settings for the noise filters, our proposed Mask Filter-Aware
Dash Attack reduces the accuracy by more than 20% on the DVS-Gesture dataset
and by more than 65% on the MNIST dataset, compared to the original clean
frames. The source code of all the proposed DVS-Attacks and noise filters is
released at https://github.com/albertomarchisio/DVS-Attacks.


http://arxiv.org/abs/2107.00440
CLINE: Contrastive Learning with Semantic Negative Examples for Natural Language Understanding. (68%)
Dong Wang; Ning Ding; Piji Li; Hai-Tao Zheng
  Despite pre-trained language models have proven useful for learning
high-quality semantic representations, these models are still vulnerable to
simple perturbations. Recent works aimed to improve the robustness of
pre-trained models mainly focus on adversarial training from perturbed examples
with similar semantics, neglecting the utilization of different or even
opposite semantics. Different from the image processing field, the text is
discrete and few word substitutions can cause significant semantic changes. To
study the impact of semantics caused by small perturbations, we conduct a
series of pilot experiments and surprisingly find that adversarial training is
useless or even harmful for the model to detect these semantic changes. To
address this problem, we propose Contrastive Learning with semantIc Negative
Examples (CLINE), which constructs semantic negative examples unsupervised to
improve the robustness under semantically adversarial attacking. By comparing
with similar and opposite semantic examples, the model can effectively perceive
the semantic changes caused by small perturbations. Empirical results show that
our approach yields substantial improvements on a range of sentiment analysis,
reasoning, and reading comprehension tasks. And CLINE also ensures the
compactness within the same semantics and separability across different
semantics in sentence-level.


http://arxiv.org/abs/2107.00309
Adversarial Sample Detection for Speaker Verification by Neural Vocoders. (41%)
Haibin Wu; Po-chun Hsu; Ji Gao; Shanshan Zhang; Shen Huang; Jian Kang; Zhiyong Wu; Helen Meng; Hung-yi Lee
  Automatic speaker verification (ASV), one of the most important technology
for biometric identification, has been widely adopted in security-critical
applications. However, ASV is seriously vulnerable to recently emerged
adversarial attacks, yet effective countermeasures against them are limited. In
this paper, we adopt neural vocoders to spot adversarial samples for ASV. We
use the neural vocoder to re-synthesize audio and find that the difference
between the ASV scores for the original and re-synthesized audio is a good
indicator for discrimination between genuine and adversarial samples. This
effort is, to the best of our knowledge, among the first to pursue such a
technical direction for detecting time-domain adversarial samples for ASV, and
hence there is a lack of established baselines for comparison. Consequently, we
implement the Griffin-Lim algorithm as the detection baseline. The proposed
approach achieves effective detection performance that outperforms the
baselines in all the settings. We also show that the neural vocoder adopted in
the detection framework is dataset-independent. Our codes will be made
open-source for future works to do fair comparison.


http://arxiv.org/abs/2107.00247
The Interplay between Distribution Parameters and the Accuracy-Robustness Tradeoff in Classification. (16%)
Alireza Mousavi Hosseini; Amir Mohammad Abouei; Mohammad Hossein Rohban
  Adversarial training tends to result in models that are less accurate on
natural (unperturbed) examples compared to standard models. This can be
attributed to either an algorithmic shortcoming or a fundamental property of
the training data distribution, which admits different solutions for optimal
standard and adversarial classifiers. In this work, we focus on the latter case
under a binary Gaussian mixture classification problem. Unlike earlier work, we
aim to derive the natural accuracy gap between the optimal Bayes and
adversarial classifiers, and study the effect of different distributional
parameters, namely separation between class centroids, class proportions, and
the covariance matrix, on the derived gap. We show that under certain
conditions, the natural error of the optimal adversarial classifier, as well as
the gap, are locally minimized when classes are balanced, contradicting the
performance of the Bayes classifier where perfect balance induces the worst
accuracy. Moreover, we show that with an $\ell_\infty$ bounded perturbation and
an adversarial budget of $\epsilon$, this gap is $\Theta(\epsilon^2)$ for the
worst-case parameters, which for suitably small $\epsilon$ indicates the
theoretical possibility of achieving robust classifiers with near-perfect
accuracy, which is rarely reflected in practical algorithms.


http://arxiv.org/abs/2107.00783
Reinforcement Learning for Feedback-Enabled Cyber Resilience. (10%)
Yunhan Huang; Linan Huang; Quanyan Zhu
  Digitization and remote connectivity have enlarged the attack surface and
made cyber systems more vulnerable. As attackers become increasingly
sophisticated and resourceful, mere reliance on traditional cyber protection,
such as intrusion detection, firewalls, and encryption, is insufficient to
secure the cyber systems. Cyber resilience provides a new security paradigm
that complements inadequate protection with resilience mechanisms. A
Cyber-Resilient Mechanism (CRM) adapts to the known or zero-day threats and
uncertainties in real-time and strategically responds to them to maintain
critical functions of the cyber systems in the event of successful attacks.
Feedback architectures play a pivotal role in enabling the online sensing,
reasoning, and actuation process of the CRM. Reinforcement Learning (RL) is an
essential tool that epitomizes the feedback architectures for cyber resilience.
It allows the CRM to provide sequential responses to attacks with limited or
without prior knowledge of the environment and the attacker. In this work, we
review the literature on RL for cyber resilience and discuss cyber resilience
against three major types of vulnerabilities, i.e., posture-related,
information-related, and human-related vulnerabilities. We introduce three
application domains of CRMs: moving target defense, defensive cyber deception,
and assistive human security technologies. The RL algorithms also have
vulnerabilities themselves. We explain the three vulnerabilities of RL and
present attack models where the attacker targets the information exchanged
between the environment and the agent: the rewards, the state observations, and
the action commands. We show that the attacker can trick the RL agent into
learning a nefarious policy with minimum attacking effort. Lastly, we discuss
the future challenges of RL for cyber security and resilience and emerging
applications of RL-based CRMs.


http://arxiv.org/abs/2106.16198
In-distribution adversarial attacks on object recognition models using gradient-free search. (99%)
Spandan Madan; Tomotake Sasaki; Hanspeter Pfister; Tzu-Mao Li; Xavier Boix
  Neural networks are susceptible to small perturbations in the form of 2D
rotations and shifts, image crops, and even changes in object colors. Past
works attribute these errors to dataset bias, claiming that models fail on
these perturbed samples as they do not belong to the training data
distribution. Here, we challenge this claim and present evidence of the
widespread existence of perturbed images within the training data distribution,
which networks fail to classify. We train models on data sampled from
parametric distributions, then search inside this data distribution to find
such in-distribution adversarial examples. This is done using our gradient-free
evolution strategies (ES) based approach which we call CMA-Search. Despite
training with a large-scale (0.5 million images), unbiased dataset of camera
and light variations, CMA-Search can find a failure inside the data
distribution in over 71% cases by perturbing the camera position. With lighting
changes, CMA-Search finds misclassifications in 42% cases. These findings also
extend to natural images from ImageNet and Co3D datasets. This phenomenon of
in-distribution images presents a highly worrisome problem for artificial
intelligence -- they bypass the need for a malicious agent to add engineered
noise to induce an adversarial attack. All code, datasets, and demos are
available at
https://github.com/Spandan-Madan/in_distribution_adversarial_examples.


http://arxiv.org/abs/2106.15998
Single-Step Adversarial Training for Semantic Segmentation. (96%)
Daniel Wiens; Barbara Hammer
  Even though deep neural networks succeed on many different tasks including
semantic segmentation, they lack on robustness against adversarial examples. To
counteract this exploit, often adversarial training is used. However, it is
known that adversarial training with weak adversarial attacks (e.g. using the
Fast Gradient Method) does not improve the robustness against stronger attacks.
Recent research shows that it is possible to increase the robustness of such
single-step methods by choosing an appropriate step size during the training.
Finding such a step size, without increasing the computational effort of
single-step adversarial training, is still an open challenge. In this work we
address the computationally particularly demanding task of semantic
segmentation and propose a new step size control algorithm that increases the
robustness of single-step adversarial training. The proposed algorithm does not
increase the computational effort of single-step adversarial training
considerably and also simplifies training, because it is free of
meta-parameter. We show that the robustness of our approach can compete with
multi-step adversarial training on two popular benchmarks for semantic
segmentation.


http://arxiv.org/abs/2106.15860
Understanding Adversarial Attacks on Observations in Deep Reinforcement Learning. (84%)
You Qiaoben; Chengyang Ying; Xinning Zhou; Hang Su; Jun Zhu; Bo Zhang
  Deep reinforcement learning models are vulnerable to adversarial attacks that
can decrease a victim's cumulative expected reward by manipulating the victim's
observations. Despite the efficiency of previous optimization-based methods for
generating adversarial noise in supervised learning, such methods might not be
able to achieve the lowest cumulative reward since they do not explore the
environmental dynamics in general. In this paper, we provide a framework to
better understand the existing methods by reformulating the problem of
adversarial attacks on reinforcement learning in the function space. Our
reformulation generates an optimal adversary in the function space of the
targeted attacks, repelling them via a generic two-stage framework. In the
first stage, we train a deceptive policy by hacking the environment, and
discover a set of trajectories routing to the lowest reward or the worst-case
performance. Next, the adversary misleads the victim to imitate the deceptive
policy by perturbing the observations. Compared to existing approaches, we
theoretically show that our adversary is stronger under an appropriate noise
level. Extensive experiments demonstrate our method's superiority in terms of
efficiency and effectiveness, achieving the state-of-the-art performance in
both Atari and MuJoCo environments.


http://arxiv.org/abs/2106.15820
Explanation-Guided Diagnosis of Machine Learning Evasion Attacks. (82%)
Abderrahmen Amich; Birhanu Eshete
  Machine Learning (ML) models are susceptible to evasion attacks. Evasion
accuracy is typically assessed using aggregate evasion rate, and it is an open
question whether aggregate evasion rate enables feature-level diagnosis on the
effect of adversarial perturbations on evasive predictions. In this paper, we
introduce a novel framework that harnesses explainable ML methods to guide
high-fidelity assessment of ML evasion attacks. Our framework enables
explanation-guided correlation analysis between pre-evasion perturbations and
post-evasion explanations. Towards systematic assessment of ML evasion attacks,
we propose and evaluate a novel suite of model-agnostic metrics for
sample-level and dataset-level correlation analysis. Using malware and image
classifiers, we conduct comprehensive evaluations across diverse model
architectures and complementary feature representations. Our explanation-guided
correlation analysis reveals correlation gaps between adversarial samples and
the corresponding perturbations performed on them. Using a case study on
explanation-guided evasion, we show the broader usage of our methodology for
assessing robustness of ML models.


http://arxiv.org/abs/2107.02897
Bi-Level Poisoning Attack Model and Countermeasure for Appliance Consumption Data of Smart Homes. (8%)
Mustain Billah; Adnan Anwar; Ziaur Rahman; Syed Md. Galib
  Accurate building energy prediction is useful in various applications
starting from building energy automation and management to optimal storage
control. However, vulnerabilities should be considered when designing building
energy prediction models, as intelligent attackers can deliberately influence
the model performance using sophisticated attack models. These may consequently
degrade the prediction accuracy, which may affect the efficiency and
performance of the building energy management systems. In this paper, we
investigate the impact of bi-level poisoning attacks on regression models of
energy usage obtained from household appliances. Furthermore, an effective
countermeasure against the poisoning attacks on the prediction model is
proposed in this paper. Attacks and defenses are evaluated on a benchmark
dataset. Experimental results show that an intelligent cyber-attacker can
poison the prediction model to manipulate the decision. However, our proposed
solution successfully ensures defense against such poisoning attacks
effectively compared to other benchmark techniques.


http://arxiv.org/abs/2106.15850
Exploring Robustness of Neural Networks through Graph Measures. (8%)
Asim Rowan University Waqas; Ghulam Rowan University Rasool; Hamza University of Minnesota Farooq; Nidhal C. Rowan University Bouaynaya
  Motivated by graph theory, artificial neural networks (ANNs) are
traditionally structured as layers of neurons (nodes), which learn useful
information by the passage of data through interconnections (edges). In the
machine learning realm, graph structures (i.e., neurons and connections) of
ANNs have recently been explored using various graph-theoretic measures linked
to their predictive performance. On the other hand, in network science
(NetSci), certain graph measures including entropy and curvature are known to
provide insight into the robustness and fragility of real-world networks. In
this work, we use these graph measures to explore the robustness of various
ANNs to adversarial attacks. To this end, we (1) explore the design space of
inter-layer and intra-layers connectivity regimes of ANNs in the graph domain
and record their predictive performance after training under different types of
adversarial attacks, (2) use graph representations for both inter-layer and
intra-layers connectivity regimes to calculate various graph-theoretic
measures, including curvature and entropy, and (3) analyze the relationship
between these graph measures and the adversarial performance of ANNs. We show
that curvature and entropy, while operating in the graph domain, can quantify
the robustness of ANNs without having to train these ANNs. Our results suggest
that the real-world networks, including brain networks, financial networks, and
social networks may provide important clues to the neural architecture search
for robust ANNs. We propose a search strategy that efficiently finds robust
ANNs amongst a set of well-performing ANNs without having a need to train all
of these ANNs.


http://arxiv.org/abs/2106.15890
A Context-Aware Information-Based Clone Node Attack Detection Scheme in Internet of Things. (1%)
Khizar Hameed; Saurabh Garg; Muhammad Bilal Amin; Byeong Kang; Abid Khan
  The rapidly expanding nature of the Internet of Things (IoT) networks is
beginning to attract interest across a range of applications, including smart
homes, smart transportation, smart health, and industrial contexts. This
cutting-edge technology enables individuals to track and control their
integrated environment in real-time and remotely via a thousand IoT devices
comprised of sensors and actuators that actively participate in sensing,
processing, storing, and sharing information. Nonetheless, IoT devices are
frequently deployed in hostile environments, wherein adversaries attempt to
capture and breach them in order to seize control of the entire network. One
such example of potentially malicious behaviour is the cloning of IoT devices,
in which an attacker can physically capture the devices, obtain some sensitive
information, duplicate the devices, and intelligently deploy them in desired
locations to conduct various insider attacks. A device cloning attack on IoT
networks is a significant security concern since it allows for selective
forwarding, sink-hole, and black-hole attacks. To address this issue, this
paper provides an efficient scheme for detecting clone node attacks on IoT
networks that makes use of semantic information about IoT devices known as
context information sensed from the deployed environment to locate them
securely. We design a location proof mechanism by combining location proofs and
batch verification of the extended elliptic curve digital signature technique
to accelerate the verification process at selected trusted nodes. We
demonstrate the security of our scheme and its resilience to secure clone node
attack detection by conducting a comprehensive security analysis. The
performance of our proposed scheme provides a high degree of detection accuracy
with minimal detection time and significantly reduces the computation,
communication and storage overhead.


http://arxiv.org/abs/2106.15853
Understanding and Improving Early Stopping for Learning with Noisy Labels. (1%)
Yingbin Bai; Erkun Yang; Bo Han; Yanhua Yang; Jiatong Li; Yinian Mao; Gang Niu; Tongliang Liu
  The memorization effect of deep neural network (DNN) plays a pivotal role in
many state-of-the-art label-noise learning methods. To exploit this property,
the early stopping trick, which stops the optimization at the early stage of
training, is usually adopted. Current methods generally decide the early
stopping point by considering a DNN as a whole. However, a DNN can be
considered as a composition of a series of layers, and we find that the latter
layers in a DNN are much more sensitive to label noise, while their former
counterparts are quite robust. Therefore, selecting a stopping point for the
whole network may make different DNN layers antagonistically affected each
other, thus degrading the final performance. In this paper, we propose to
separate a DNN into different parts and progressively train them to address
this problem. Instead of the early stopping, which trains a whole DNN all at
once, we initially train former DNN layers by optimizing the DNN with a
relatively large number of epochs. During training, we progressively train the
latter DNN layers by using a smaller number of epochs with the preceding layers
fixed to counteract the impact of noisy labels. We term the proposed method as
progressive early stopping (PES). Despite its simplicity, compared with the
early stopping, PES can help to obtain more promising and stable results.
Furthermore, by combining PES with existing approaches on noisy label training,
we achieve state-of-the-art performance on image classification benchmarks.


http://arxiv.org/abs/2107.02894
Adversarial Machine Learning for Cybersecurity and Computer Vision: Current Developments and Challenges. (99%)
Bowei Xi
  We provide a comprehensive overview of adversarial machine learning focusing
on two application domains, i.e., cybersecurity and computer vision. Research
in adversarial machine learning addresses a significant threat to the wide
application of machine learning techniques -- they are vulnerable to carefully
crafted attacks from malicious adversaries. For example, deep neural networks
fail to correctly classify adversarial images, which are generated by adding
imperceptible perturbations to clean images.We first discuss three main
categories of attacks against machine learning techniques -- poisoning attacks,
evasion attacks, and privacy attacks. Then the corresponding defense approaches
are introduced along with the weakness and limitations of the existing defense
approaches. We notice adversarial samples in cybersecurity and computer vision
are fundamentally different. While adversarial samples in cybersecurity often
have different properties/distributions compared with training data,
adversarial images in computer vision are created with minor input
perturbations. This further complicates the development of robust learning
techniques, because a robust learning technique must withstand different types
of attacks.


http://arxiv.org/abs/2107.00003
Understanding Adversarial Examples Through Deep Neural Network's Response Surface and Uncertainty Regions. (99%)
Juan Shu; Bowei Xi; Charles Kamhoua
  Deep neural network (DNN) is a popular model implemented in many systems to
handle complex tasks such as image classification, object recognition, natural
language processing etc. Consequently DNN structural vulnerabilities become
part of the security vulnerabilities in those systems. In this paper we study
the root cause of DNN adversarial examples. We examine the DNN response surface
to understand its classification boundary. Our study reveals the structural
problem of DNN classification boundary that leads to the adversarial examples.
Existing attack algorithms can generate from a handful to a few hundred
adversarial examples given one clean image. We show there are infinitely many
adversarial images given one clean sample, all within a small neighborhood of
the clean sample. We then define DNN uncertainty regions and show
transferability of adversarial examples is not universal. We also argue that
generalization error, the large sample theoretical guarantee established for
DNN, cannot adequately capture the phenomenon of adversarial examples. We need
new theory to measure DNN robustness.


http://arxiv.org/abs/2106.15360
Attack Transferability Characterization for Adversarially Robust Multi-label Classification. (99%)
Zhuo Yang; Yufei Han; Xiangliang Zhang
  Despite of the pervasive existence of multi-label evasion attack, it is an
open yet essential problem to characterize the origin of the adversarial
vulnerability of a multi-label learning system and assess its attackability. In
this study, we focus on non-targeted evasion attack against multi-label
classifiers. The goal of the threat is to cause miss-classification with
respect to as many labels as possible, with the same input perturbation. Our
work gains in-depth understanding about the multi-label adversarial attack by
first characterizing the transferability of the attack based on the functional
properties of the multi-label classifier. We unveil how the transferability
level of the attack determines the attackability of the classifier via
establishing an information-theoretic analysis of the adversarial risk.
Furthermore, we propose a transferability-centered attackability assessment,
named Soft Attackability Estimator (SAE), to evaluate the intrinsic
vulnerability level of the targeted multi-label classifier. This estimator is
then integrated as a transferability-tuning regularization term into the
multi-label learning paradigm to achieve adversarially robust classification.
The experimental study on real-world data echos the theoretical analysis and
verify the validity of the transferability-regularized multi-label learning
method.


http://arxiv.org/abs/2106.15202
Inconspicuous Adversarial Patches for Fooling Image Recognition Systems on Mobile Devices. (99%)
Tao Bai; Jinqi Luo; Jun Zhao
  Deep learning based image recognition systems have been widely deployed on
mobile devices in today's world. In recent studies, however, deep learning
models are shown vulnerable to adversarial examples. One variant of adversarial
examples, called adversarial patch, draws researchers' attention due to its
strong attack abilities. Though adversarial patches achieve high attack success
rates, they are easily being detected because of the visual inconsistency
between the patches and the original images. Besides, it usually requires a
large amount of data for adversarial patch generation in the literature, which
is computationally expensive and time-consuming. To tackle these challenges, we
propose an approach to generate inconspicuous adversarial patches with one
single image. In our approach, we first decide the patch locations basing on
the perceptual sensitivity of victim models, then produce adversarial patches
in a coarse-to-fine way by utilizing multiple-scale generators and
discriminators. The patches are encouraged to be consistent with the background
images with adversarial training while preserving strong attack abilities. Our
approach shows the strong attack abilities in white-box settings and the
excellent transferability in black-box settings through extensive experiments
on various models with different architectures and training methods. Compared
to other adversarial patches, our adversarial patches hold the most negligible
risks to be detected and can evade human observations, which is supported by
the illustrations of saliency maps and results of user evaluations. Lastly, we
show that our adversarial patches can be applied in the physical world.


http://arxiv.org/abs/2107.02895
Bio-Inspired Adversarial Attack Against Deep Neural Networks. (98%)
Bowei Xi; Yujie Chen; Fan Fei; Zhan Tu; Xinyan Deng
  The paper develops a new adversarial attack against deep neural networks
(DNN), based on applying bio-inspired design to moving physical objects. To the
best of our knowledge, this is the first work to introduce physical attacks
with a moving object. Instead of following the dominating attack strategy in
the existing literature, i.e., to introduce minor perturbations to a digital
input or a stationary physical object, we show two new successful attack
strategies in this paper. We show by superimposing several patterns onto one
physical object, a DNN becomes confused and picks one of the patterns to assign
a class label. Our experiment with three flapping wing robots demonstrates the
possibility of developing an adversarial camouflage to cause a targeted mistake
by DNN. We also show certain motion can reduce the dependency among consecutive
frames in a video and make an object detector "blind", i.e., not able to detect
an object exists in the video. Hence in a successful physical attack against
DNN, targeted motion against the system should also be considered.


http://arxiv.org/abs/2106.15130
Do Not Deceive Your Employer with a Virtual Background: A Video Conferencing Manipulation-Detection System. (62%)
Mauro Conti; Simone Milani; Ehsan Nowroozi; Gabriele Orazi
  The last-generation video conferencing software allows users to utilize a
virtual background to conceal their personal environment due to privacy
concerns, especially in official meetings with other employers. On the other
hand, users maybe want to fool people in the meeting by considering the virtual
background to conceal where they are. In this case, developing tools to
understand the virtual background utilize for fooling people in meeting plays
an important role. Besides, such detectors must prove robust against different
kinds of attacks since a malicious user can fool the detector by applying a set
of adversarial editing steps on the video to conceal any revealing footprint.
In this paper, we study the feasibility of an efficient tool to detect whether
a videoconferencing user background is real. In particular, we provide the
first tool which computes pixel co-occurrences matrices and uses them to search
for inconsistencies among spectral and spatial bands. Our experiments confirm
that cross co-occurrences matrices improve the robustness of the detector
against different kinds of attacks. This work's performance is especially
noteworthy with regard to color SPAM features. Moreover, the performance
especially is significant with regard to robustness versus post-processing,
like geometric transformations, filtering, contrast enhancement, and JPEG
compression with different quality factors.


http://arxiv.org/abs/2106.15764
The Threat of Offensive AI to Organizations. (54%)
Yisroel Mirsky; Ambra Demontis; Jaidip Kotak; Ram Shankar; Deng Gelei; Liu Yang; Xiangyu Zhang; Wenke Lee; Yuval Elovici; Battista Biggio
  AI has provided us with the ability to automate tasks, extract information
from vast amounts of data, and synthesize media that is nearly
indistinguishable from the real thing. However, positive tools can also be used
for negative purposes. In particular, cyber adversaries can use AI (such as
machine learning) to enhance their attacks and expand their campaigns.
  Although offensive AI has been discussed in the past, there is a need to
analyze and understand the threat in the context of organizations. For example,
how does an AI-capable adversary impact the cyber kill chain? Does AI benefit
the attacker more than the defender? What are the most significant AI threats
facing organizations today and what will be their impact on the future?
  In this survey, we explore the threat of offensive AI on organizations.
First, we present the background and discuss how AI changes the adversary's
methods, strategies, goals, and overall attack model. Then, through a
literature review, we identify 33 offensive AI capabilities which adversaries
can use to enhance their attacks. Finally, through a user study spanning
industry and academia, we rank the AI threats and provide insights on the
adversaries.


http://arxiv.org/abs/2106.15776
Local Reweighting for Adversarial Training. (22%)
Ruize Gao; Feng Liu; Kaiwen Zhou; Gang Niu; Bo Han; James Cheng
  Instances-reweighted adversarial training (IRAT) can significantly boost the
robustness of trained models, where data being less/more vulnerable to the
given attack are assigned smaller/larger weights during training. However, when
tested on attacks different from the given attack simulated in training, the
robustness may drop significantly (e.g., even worse than no reweighting). In
this paper, we study this problem and propose our solution--locally reweighted
adversarial training (LRAT). The rationale behind IRAT is that we do not need
to pay much attention to an instance that is already safe under the attack. We
argue that the safeness should be attack-dependent, so that for the same
instance, its weight can change given different attacks based on the same
model. Thus, if the attack simulated in training is mis-specified, the weights
of IRAT are misleading. To this end, LRAT pairs each instance with its
adversarial variants and performs local reweighting inside each pair, while
performing no global reweighting--the rationale is to fit the instance itself
if it is immune to the attack, but not to skip the pair, in order to passively
defend different attacks in future. Experiments show that LRAT works better
than both IRAT (i.e., global reweighting) and the standard AT (i.e., no
reweighting) when trained with an attack and tested on different attacks.


http://arxiv.org/abs/2106.15355
On the Interaction of Belief Bias and Explanations. (15%)
Ana Valeria Gonzalez; Anna Rogers; Anders Søgaard
  A myriad of explainability methods have been proposed in recent years, but
there is little consensus on how to evaluate them. While automatic metrics
allow for quick benchmarking, it isn't clear how such metrics reflect human
interaction with explanations. Human evaluation is of paramount importance, but
previous protocols fail to account for belief biases affecting human
performance, which may lead to misleading conclusions. We provide an overview
of belief bias, its role in human evaluation, and ideas for NLP practitioners
on how to account for it. For two experimental paradigms, we present a case
study of gradient-based explainability introducing simple ways to account for
humans' prior beliefs: models of varying quality and adversarial examples. We
show that conclusions about the highest performing methods change when
introducing such controls, pointing to the importance of accounting for belief
bias in evaluation.


http://arxiv.org/abs/2106.14815
Feature Importance Guided Attack: A Model Agnostic Adversarial Attack. (99%)
Gilad Gressel; Niranjan Hegde; Archana Sreekumar; Michael Darling
  Machine learning models are susceptible to adversarial attacks which
dramatically reduce their performance. Reliable defenses to these attacks are
an unsolved challenge. In this work, we present a novel evasion attack: the
'Feature Importance Guided Attack' (FIGA) which generates adversarial evasion
samples. FIGA is model agnostic, it assumes no prior knowledge of the defending
model's learning algorithm, but does assume knowledge of the feature
representation. FIGA leverages feature importance rankings; it perturbs the
most important features of the input in the direction of the target class we
wish to mimic. We demonstrate FIGA against eight phishing detection models. We
keep the attack realistic by perturbing phishing website features that an
adversary would have control over. Using FIGA we are able to cause a reduction
in the F1-score of a phishing detection model from 0.96 to 0.41 on average.
Finally, we implement adversarial training as a defense against FIGA and show
that while it is sometimes effective, it can be evaded by changing the
parameters of FIGA.


http://arxiv.org/abs/2106.15023
Evading Adversarial Example Detection Defenses with Orthogonal Projected Gradient Descent. (99%)
Oliver Bryniarski; Nabeel Hingun; Pedro Pachuca; Vincent Wang; Nicholas Carlini
  Evading adversarial example detection defenses requires finding adversarial
examples that must simultaneously (a) be misclassified by the model and (b) be
detected as non-adversarial. We find that existing attacks that attempt to
satisfy multiple simultaneous constraints often over-optimize against one
constraint at the cost of satisfying another. We introduce Orthogonal Projected
Gradient Descent, an improved attack technique to generate adversarial examples
that avoids this problem by orthogonalizing the gradients when running standard
gradient-based attacks. We use our technique to evade four state-of-the-art
detection defenses, reducing their accuracy to 0% while maintaining a 0%
detection rate.


http://arxiv.org/abs/2106.15058
Improving Transferability of Adversarial Patches on Face Recognition with Generative Models. (99%)
Zihao Xiao; Xianfeng Gao; Chilin Fu; Yinpeng Dong; Wei Gao; Xiaolu Zhang; Jun Zhou; Jun Zhu
  Face recognition is greatly improved by deep convolutional neural networks
(CNNs). Recently, these face recognition models have been used for identity
authentication in security sensitive applications. However, deep CNNs are
vulnerable to adversarial patches, which are physically realizable and
stealthy, raising new security concerns on the real-world applications of these
models. In this paper, we evaluate the robustness of face recognition models
using adversarial patches based on transferability, where the attacker has
limited accessibility to the target models. First, we extend the existing
transfer-based attack techniques to generate transferable adversarial patches.
However, we observe that the transferability is sensitive to initialization and
degrades when the perturbation magnitude is large, indicating the overfitting
to the substitute models. Second, we propose to regularize the adversarial
patches on the low dimensional data manifold. The manifold is represented by
generative models pre-trained on legitimate human face images. Using face-like
features as adversarial perturbations through optimization on the manifold, we
show that the gaps between the responses of substitute models and the target
models dramatically decrease, exhibiting a better transferability. Extensive
digital world experiments are conducted to demonstrate the superiority of the
proposed method in the black-box setting. We apply the proposed method in the
physical world as well.


http://arxiv.org/abs/2106.14851
Data Poisoning Won't Save You From Facial Recognition. (97%)
Evani Radiya-Dixit; Florian Tramèr
  Data poisoning has been proposed as a compelling defense against facial
recognition models trained on Web-scraped pictures. By perturbing the images
they post online, users can fool models into misclassifying future
(unperturbed) pictures. We demonstrate that this strategy provides a false
sense of security, as it ignores an inherent asymmetry between the parties:
users' pictures are perturbed once and for all before being published (at which
point they are scraped) and must thereafter fool all future models -- including
models trained adaptively against the users' past attacks, or models that use
technologies discovered after the attack. We evaluate two systems for poisoning
attacks against large-scale facial recognition, Fawkes (500,000+ downloads) and
LowKey. We demonstrate how an "oblivious" model trainer can simply wait for
future developments in computer vision to nullify the protection of pictures
collected in the past. We further show that an adversary with black-box access
to the attack can (i) train a robust model that resists the perturbations of
collected pictures and (ii) detect poisoned pictures uploaded online. We
caution that facial recognition poisoning will not admit an "arms race" between
attackers and defenders. Once perturbed pictures are scraped, the attack cannot
be changed so any future successful defense irrevocably undermines users'
privacy.


http://arxiv.org/abs/2106.14952
Adversarial Robustness of Streaming Algorithms through Importance Sampling. (61%)
Vladimir Braverman; Avinatan Hassidim; Yossi Matias; Mariano Schain; Sandeep Silwal; Samson Zhou
  In this paper, we introduce adversarially robust streaming algorithms for
central machine learning and algorithmic tasks, such as regression and
clustering, as well as their more general counterparts, subspace embedding,
low-rank approximation, and coreset construction. For regression and other
numerical linear algebra related tasks, we consider the row arrival streaming
model. Our results are based on a simple, but powerful, observation that many
importance sampling-based algorithms give rise to adversarial robustness which
is in contrast to sketching based algorithms, which are very prevalent in the
streaming literature but suffer from adversarial attacks. In addition, we show
that the well-known merge and reduce paradigm in streaming is adversarially
robust. Since the merge and reduce paradigm allows coreset constructions in the
streaming setting, we thus obtain robust algorithms for $k$-means, $k$-median,
$k$-center, Bregman clustering, projective clustering, principal component
analysis (PCA) and non-negative matrix factorization. To the best of our
knowledge, these are the first adversarially robust results for these problems
yet require no new algorithmic implementations. Finally, we empirically confirm
the robustness of our algorithms on various adversarial attacks and demonstrate
that by contrast, some common existing algorithms are not robust.
  (Abstract shortened to meet arXiv limits)


http://arxiv.org/abs/2106.14999
Test-Time Adaptation to Distribution Shift by Confidence Maximization and Input Transformation. (2%)
Chaithanya Kumar Mummadi; Robin Hutmacher; Kilian Rambach; Evgeny Levinkov; Thomas Brox; Jan Hendrik Metzen
  Deep neural networks often exhibit poor performance on data that is unlikely
under the train-time data distribution, for instance data affected by
corruptions. Previous works demonstrate that test-time adaptation to data
shift, for instance using entropy minimization, effectively improves
performance on such shifted distributions. This paper focuses on the fully
test-time adaptation setting, where only unlabeled data from the target
distribution is required. This allows adapting arbitrary pretrained networks.
Specifically, we propose a novel loss that improves test-time adaptation by
addressing both premature convergence and instability of entropy minimization.
This is achieved by replacing the entropy by a non-saturating surrogate and
adding a diversity regularizer based on batch-wise entropy maximization that
prevents convergence to trivial collapsed solutions. Moreover, we propose to
prepend an input transformation module to the network that can partially undo
test-time distribution shifts. Surprisingly, this preprocessing can be learned
solely using the fully test-time adaptation loss in an end-to-end fashion
without any target domain labels or source domain data. We show that our
approach outperforms previous work in improving the robustness of publicly
available pretrained image classifiers to common corruptions on such
challenging benchmarks as ImageNet-C.


http://arxiv.org/abs/2106.14432
Certified Robustness via Randomized Smoothing over Multiplicative Parameters. (1%)
Nikita Muravev; Aleksandr Petiushko
  Currently the most popular method of providing robustness certificates is
randomized smoothing where an input is smoothed via some probability
distribution. We propose a novel approach to randomized smoothing over
multiplicative parameters. Using this method we construct certifiably robust
classifiers with respect to a gamma correction perturbation and compare the
result with classifiers obtained via other smoothing distributions (Gaussian,
Laplace, uniform). The experiments show that asymmetrical Rayleigh distribution
allows to obtain better certificates for some values of perturbation
parameters. To the best of our knowledge it is the first work concerning
certified robustness against the multiplicative gamma correction transformation
and the first to study effects of asymmetrical distributions in randomized
smoothing.


http://arxiv.org/abs/2106.14707
Realtime Robust Malicious Traffic Detection via Frequency Domain Analysis. (1%)
Chuanpu Fu; Qi Li; Meng Shen; Ke Xu
  Machine learning (ML) based malicious traffic detection is an emerging
security paradigm, particularly for zero-day attack detection, which is
complementary to existing rule based detection. However, the existing ML based
detection has low detection accuracy and low throughput incurred by inefficient
traffic features extraction. Thus, they cannot detect attacks in realtime
especially in high throughput networks. Particularly, these detection systems
similar to the existing rule based detection can be easily evaded by
sophisticated attacks. To this end, we propose Whisper, a realtime ML based
malicious traffic detection system that achieves both high accuracy and high
throughput by utilizing frequency domain features. It utilizes sequential
features represented by the frequency domain features to achieve bounded
information loss, which ensures high detection accuracy, and meanwhile
constrains the scale of features to achieve high detection throughput.
Particularly, attackers cannot easily interfere with the frequency domain
features and thus Whisper is robust against various evasion attacks. Our
experiments with 42 types of attacks demonstrate that, compared with the
state-of-theart systems, Whisper can accurately detect various sophisticated
and stealthy attacks, achieving at most 18.36% improvement, while achieving two
orders of magnitude throughput. Even under various evasion attacks, Whisper is
still able to maintain around 90% detection accuracy.


http://arxiv.org/abs/2107.02840
RAILS: A Robust Adversarial Immune-inspired Learning System. (98%)
Ren Wang; Tianqi Chen; Stephen Lindsly; Cooper Stansbury; Alnawaz Rehemtulla; Indika Rajapakse; Alfred Hero
  Adversarial attacks against deep neural networks (DNNs) are continuously
evolving, requiring increasingly powerful defense strategies. We develop a
novel adversarial defense framework inspired by the adaptive immune system: the
Robust Adversarial Immune-inspired Learning System (RAILS). Initializing a
population of exemplars that is balanced across classes, RAILS starts from a
uniform label distribution that encourages diversity and uses an evolutionary
optimization process to adaptively adjust the predictive label distribution in
a manner that emulates the way the natural immune system recognizes novel
pathogens. RAILS' evolutionary optimization process explicitly captures the
tradeoff between robustness (diversity) and accuracy (specificity) of the
network, and represents a new immune-inspired perspective on adversarial
learning. The benefits of RAILS are empirically demonstrated under eight types
of adversarial attacks on a DNN adversarial image classifier for several
benchmark datasets, including: MNIST; SVHN; CIFAR-10; and CIFAR-10. We find
that PGD is the most damaging attack strategy and that for this attack RAILS is
significantly more robust than other methods, achieving improvements in
adversarial robustness by $\geq 5.62\%, 12.5\%$, $10.32\%$, and $8.39\%$, on
these respective datasets, without appreciable loss of classification accuracy.
Codes for the results in this paper are available at
https://github.com/wangren09/RAILS.


http://arxiv.org/abs/2106.14152
Who is Responsible for Adversarial Defense? (93%)
Kishor Datta Gupta; Dipankar Dasgupta
  We have seen a surge in research aims toward adversarial attacks and defenses
in AI/ML systems. While it is crucial to formulate new attack methods and
devise novel defense strategies for robustness, it is also imperative to
recognize who is responsible for implementing, validating, and justifying the
necessity of these defenses. In particular, which components of the system are
vulnerable to what type of adversarial attacks, and the expertise needed to
realize the severity of adversarial attacks. Also how to evaluate and address
the adversarial challenges in order to recommend defense strategies for
different applications. This paper opened a discussion on who should examine
and implement the adversarial defenses and the reason behind such efforts.


http://arxiv.org/abs/2106.14300
ASK: Adversarial Soft k-Nearest Neighbor Attack and Defense. (82%)
Ren Wang; Tianqi Chen; Philip Yao; Sijia Liu; Indika Rajapakse; Alfred Hero
  K-Nearest Neighbor (kNN)-based deep learning methods have been applied to
many applications due to their simplicity and geometric interpretability.
However, the robustness of kNN-based classification models has not been
thoroughly explored and kNN attack strategies are underdeveloped. In this
paper, we propose an Adversarial Soft kNN (ASK) loss to both design more
effective kNN attack strategies and to develop better defenses against them.
Our ASK loss approach has two advantages. First, ASK loss can better
approximate the kNN's probability of classification error than objectives
proposed in previous works. Second, the ASK loss is interpretable: it preserves
the mutual information between the perturbed input and the kNN of the
unperturbed input. We use the ASK loss to generate a novel attack method called
the ASK-Attack (ASK-Atk), which shows superior attack efficiency and accuracy
degradation relative to previous kNN attacks. Based on the ASK-Atk, we then
derive an ASK-Defense (ASK-Def) method that optimizes the worst-case training
loss induced by ASK-Atk.


http://arxiv.org/abs/2107.02842
Immuno-mimetic Deep Neural Networks (Immuno-Net). (64%)
Ren Wang; Tianqi Chen; Stephen Lindsly; Cooper Stansbury; Indika Rajapakse; Alfred Hero
  Biomimetics has played a key role in the evolution of artificial neural
networks. Thus far, in silico metaphors have been dominated by concepts from
neuroscience and cognitive psychology. In this paper we introduce a different
type of biomimetic model, one that borrows concepts from the immune system, for
designing robust deep neural networks. This immuno-mimetic model leads to a new
computational biology framework for robustification of deep neural networks
against adversarial attacks. Within this Immuno-Net framework we define a
robust adaptive immune-inspired learning system (Immuno-Net RAILS) that
emulates, in silico, the adaptive biological mechanisms of B-cells that are
used to defend a mammalian host against pathogenic attacks. When applied to
image classification tasks on benchmark datasets, we demonstrate that
Immuno-net RAILS results in improvement of as much as 12.5% in adversarial
accuracy of a baseline method, the DkNN-robustified CNN, without appreciable
loss of accuracy on clean data.


http://arxiv.org/abs/2106.14342
Stabilizing Equilibrium Models by Jacobian Regularization. (1%)
Shaojie Bai; Vladlen Koltun; J. Zico Kolter
  Deep equilibrium networks (DEQs) are a new class of models that eschews
traditional depth in favor of finding the fixed point of a single nonlinear
layer. These models have been shown to achieve performance competitive with the
state-of-the-art deep networks while using significantly less memory. Yet they
are also slower, brittle to architectural choices, and introduce potential
instability to the model. In this paper, we propose a regularization scheme for
DEQ models that explicitly regularizes the Jacobian of the fixed-point update
equations to stabilize the learning of equilibrium models. We show that this
regularization adds only minimal computational cost, significantly stabilizes
the fixed-point convergence in both forward and backward passes, and scales
well to high-dimensional, realistic domains (e.g., WikiText-103 language
modeling and ImageNet classification). Using this method, we demonstrate, for
the first time, an implicit-depth model that runs with approximately the same
speed and level of performance as popular conventional deep networks such as
ResNet-101, while still maintaining the constant memory footprint and
architectural simplicity of DEQs. Code is available at
https://github.com/locuslab/deq .


http://arxiv.org/abs/2106.15357
Multi-stage Optimization based Adversarial Training. (99%)
Xiaosen Wang; Chuanbiao Song; Liwei Wang; Kun He
  In the field of adversarial robustness, there is a common practice that
adopts the single-step adversarial training for quickly developing
adversarially robust models. However, the single-step adversarial training is
most likely to cause catastrophic overfitting, as after a few training epochs
it will be hard to generate strong adversarial examples to continuously boost
the adversarial robustness. In this work, we aim to avoid the catastrophic
overfitting by introducing multi-step adversarial examples during the
single-step adversarial training. Then, to balance the large training overhead
of generating multi-step adversarial examples, we propose a Multi-stage
Optimization based Adversarial Training (MOAT) method that periodically trains
the model on mixed benign examples, single-step adversarial examples, and
multi-step adversarial examples stage by stage. In this way, the overall
training overhead is reduced significantly, meanwhile, the model could avoid
catastrophic overfitting. Extensive experiments on CIFAR-10 and CIFAR-100
datasets demonstrate that under similar amount of training overhead, the
proposed MOAT exhibits better robustness than either single-step or multi-step
adversarial training methods.


http://arxiv.org/abs/2106.13997
The Feasibility and Inevitability of Stealth Attacks. (69%)
Ivan Y. Tyukin; Desmond J. Higham; Eliyas Woldegeorgis; Alexander N. Gorban
  We develop and study new adversarial perturbations that enable an attacker to
gain control over decisions in generic Artificial Intelligence (AI) systems
including deep learning neural networks. In contrast to adversarial data
modification, the attack mechanism we consider here involves alterations to the
AI system itself. Such a stealth attack could be conducted by a mischievous,
corrupt or disgruntled member of a software development team. It could also be
made by those wishing to exploit a "democratization of AI" agenda, where
network architectures and trained parameter sets are shared publicly. Building
on work by [Tyukin et al., International Joint Conference on Neural Networks,
2020], we develop a range of new implementable attack strategies with
accompanying analysis, showing that with high probability a stealth attack can
be made transparent, in the sense that system performance is unchanged on a
fixed validation set which is unknown to the attacker, while evoking any
desired output on a trigger input of interest. The attacker only needs to have
estimates of the size of the validation set and the spread of the AI's relevant
latent space. In the case of deep learning neural networks, we show that a one
neuron attack is possible - a modification to the weights and bias associated
with a single neuron - revealing a vulnerability arising from
over-parameterization. We illustrate these concepts in a realistic setting.
Guided by the theory and computational results, we also propose strategies to
guard against stealth attacks.


http://arxiv.org/abs/2106.13326
On the (Un-)Avoidability of Adversarial Examples. (99%)
Sadia Chowdhury; Ruth Urner
  The phenomenon of adversarial examples in deep learning models has caused
substantial concern over their reliability. While many deep neural networks
have shown impressive performance in terms of predictive accuracy, it has been
shown that in many instances an imperceptible perturbation can falsely flip the
network's prediction. Most research has then focused on developing defenses
against adversarial attacks or learning under a worst-case adversarial loss. In
this work, we take a step back and aim to provide a framework for determining
whether a model's label change under small perturbation is justified (and when
it is not). We carefully argue that adversarial robustness should be defined as
a locally adaptive measure complying with the underlying distribution. We then
suggest a definition for an adaptive robust loss, derive an empirical version
of it, and develop a resulting data-augmentation framework. We prove that our
adaptive data-augmentation maintains consistency of 1-nearest neighbor
classification under deterministic labels and provide illustrative empirical
evaluations.


http://arxiv.org/abs/2106.13394
Countering Adversarial Examples: Combining Input Transformation and Noisy Training. (99%)
Cheng Zhang; Pan Gao
  Recent studies have shown that neural network (NN) based image classifiers
are highly vulnerable to adversarial examples, which poses a threat to
security-sensitive image recognition task. Prior work has shown that JPEG
compression can combat the drop in classification accuracy on adversarial
examples to some extent. But, as the compression ratio increases, traditional
JPEG compression is insufficient to defend those attacks but can cause an
abrupt accuracy decline to the benign images. In this paper, with the aim of
fully filtering the adversarial perturbations, we firstly make modifications to
traditional JPEG compression algorithm which becomes more favorable for NN.
Specifically, based on an analysis of the frequency coefficient, we design a
NN-favored quantization table for compression. Considering compression as a
data augmentation strategy, we then combine our model-agnostic preprocess with
noisy training. We fine-tune the pre-trained model by training with images
encoded at different compression levels, thus generating multiple classifiers.
Finally, since lower (higher) compression ratio can remove both perturbations
and original features slightly (aggressively), we use these trained multiple
models for model ensemble. The majority vote of the ensemble of models is
adopted as final predictions. Experiments results show our method can improve
defense efficiency while maintaining original accuracy.


http://arxiv.org/abs/2106.13123
Break it, Fix it: Attack and Defense for "Add-on'' Access Control Solutions in Distributed Data Analytics Platforms. (8%)
Fahad Data Security Technologies Shaon; Sazzadur University of Arizona Rahaman; Murat Data Security Technologies Kantarcioglu
  Distributed data analytics platforms (i.e., Apache Spark, Hadoop) enable
cost-effective storage and processing by distributing data and computation to
multiple nodes. Since these frameworks' design was primarily motivated by
performance and usability, most were assumed to operate in non-malicious
settings. Hence, they allow users to execute arbitrary code to analyze the
data. To make the situation worse, they do not support fine-grained access
control inherently or offer any plugin mechanism to enable it - which makes
them risky to be used in multi-tier organizational settings.
  There have been attempts to build "add-on" solutions to enable fine-grained
access control for distributed data analytics platforms. In this paper, we show
that by knowing the nature of the solution, an attacker can evade the access
control by maliciously using the platform-provided APIs. Specifically, we
crafted several attack vectors to evade such solutions.
  Next, we systematically analyze the threats and potentially risky APIs and
propose a two-layered (i.e., proactive and reactive) defense to protect against
those attacks. Our proactive security layer utilizes state-of-the-art program
analysis to detect potentially malicious user code. The reactive security layer
consists of binary integrity checking, instrumentation-based runtime checks,
and sandboxed execution. Finally, Using this solution, we provide a secure
implementation of a new framework-agnostic fine-grained attribute-based access
control framework named SecureDL for Apache Spark.
  To the best of our knowledge, this is the first work that provides secure
fine-grained attribute-based access control distributed data analytics
platforms that allow arbitrary code execution. Performance evaluation showed
that the overhead due to added security is low.


http://arxiv.org/abs/2106.12611
Adversarial Examples in Multi-Layer Random ReLU Networks. (81%)
Peter L. Bartlett; Sébastien Bubeck; Yeshwanth Cherapanamjeri
  We consider the phenomenon of adversarial examples in ReLU networks with
independent gaussian parameters. For networks of constant depth and with a
large range of widths (for instance, it suffices if the width of each layer is
polynomial in that of any other layer), small perturbations of input vectors
lead to large changes of outputs. This generalizes results of Daniely and
Schacham (2020) for networks of rapidly decreasing width and of Bubeck et al
(2021) for two-layer networks. The proof shows that adversarial examples arise
in these networks because the functions that they compute are very close to
linear. Bottleneck layers in the network play a key role: the minimal width up
to some point in the network determines scales and sensitivities of mappings
computed up to that point. The main result is for networks with constant depth,
but we also show that some constraint on depth is necessary for a result of
this kind, because there are suitably deep networks that, with constant
probability, compute a function that is close to constant.


http://arxiv.org/abs/2106.12478
Teacher Model Fingerprinting Attacks Against Transfer Learning. (2%)
Yufei Chen; Chao Shen; Cong Wang; Yang Zhang
  Transfer learning has become a common solution to address training data
scarcity in practice. It trains a specified student model by reusing or
fine-tuning early layers of a well-trained teacher model that is usually
publicly available. However, besides utility improvement, the transferred
public knowledge also brings potential threats to model confidentiality, and
even further raises other security and privacy issues.
  In this paper, we present the first comprehensive investigation of the
teacher model exposure threat in the transfer learning context, aiming to gain
a deeper insight into the tension between public knowledge and model
confidentiality. To this end, we propose a teacher model fingerprinting attack
to infer the origin of a student model, i.e., the teacher model it transfers
from. Specifically, we propose a novel optimization-based method to carefully
generate queries to probe the student model to realize our attack. Unlike
existing model reverse engineering approaches, our proposed fingerprinting
method neither relies on fine-grained model outputs, e.g., posteriors, nor
auxiliary information of the model architecture or training dataset. We
systematically evaluate the effectiveness of our proposed attack. The empirical
results demonstrate that our attack can accurately identify the model origin
with few probing queries. Moreover, we show that the proposed attack can serve
as a stepping stone to facilitating other attacks against machine learning
models, such as model stealing.


http://arxiv.org/abs/2106.12723
Meaningfully Explaining Model Mistakes Using Conceptual Counterfactuals. (1%)
Abubakar Abid; Mert Yuksekgonul; James Zou
  Understanding and explaining the mistakes made by trained models is critical
to many machine learning objectives, such as improving robustness, addressing
concept drift, and mitigating biases. However, this is often an ad hoc process
that involves manually looking at the model's mistakes on many test samples and
guessing at the underlying reasons for those incorrect predictions. In this
paper, we propose a systematic approach, conceptual counterfactual
explanations(CCE), that explains why a classifier makes a mistake on a
particular test sample(s) in terms of human-understandable concepts (e.g. this
zebra is misclassified as a dog because of faint stripes). We base CCE on two
prior ideas: counterfactual explanations and concept activation vectors, and
validate our approach on well-known pretrained models, showing that it explains
the models' mistakes meaningfully. In addition, for new models trained on data
with spurious correlations, CCE accurately identifies the spurious correlation
as the cause of model mistakes from a single misclassified test sample. On two
challenging medical applications, CCE generated useful insights, confirmed by
clinicians, into biases and mistakes the model makes in real-world settings.
The code for CCE is publicly available and can easily be applied to explain
mistakes in new models.


http://arxiv.org/abs/2106.12563
Feature Attributions and Counterfactual Explanations Can Be Manipulated. (1%)
Dylan Slack; Sophie Hilgard; Sameer Singh; Himabindu Lakkaraju
  As machine learning models are increasingly used in critical decision-making
settings (e.g., healthcare, finance), there has been a growing emphasis on
developing methods to explain model predictions. Such \textit{explanations} are
used to understand and establish trust in models and are vital components in
machine learning pipelines. Though explanations are a critical piece in these
systems, there is little understanding about how they are vulnerable to
manipulation by adversaries. In this paper, we discuss how two broad classes of
explanations are vulnerable to manipulation. We demonstrate how adversaries can
design biased models that manipulate model agnostic feature attribution methods
(e.g., LIME \& SHAP) and counterfactual explanations that hill-climb during the
counterfactual search (e.g., Wachter's Algorithm \& DiCE) into
\textit{concealing} the model's biases. These vulnerabilities allow an
adversary to deploy a biased model, yet explanations will not reveal this bias,
thereby deceiving stakeholders into trusting the model. We evaluate the
manipulations on real world data sets, including COMPAS and Communities \&
Crime, and find explanations can be manipulated in practice.


http://arxiv.org/abs/2106.12021
DetectX -- Adversarial Input Detection using Current Signatures in Memristive XBar Arrays. (99%)
Abhishek Moitra; Priyadarshini Panda
  Adversarial input detection has emerged as a prominent technique to harden
Deep Neural Networks(DNNs) against adversarial attacks. Most prior works use
neural network-based detectors or complex statistical analysis for adversarial
detection. These approaches are computationally intensive and vulnerable to
adversarial attacks. To this end, we propose DetectX - a hardware friendly
adversarial detection mechanism using hardware signatures like Sum of column
Currents (SoI) in memristive crossbars (XBar). We show that adversarial inputs
have higher SoI compared to clean inputs. However, the difference is too small
for reliable adversarial detection. Hence, we propose a dual-phase training
methodology: Phase1 training is geared towards increasing the separation
between clean and adversarial SoIs; Phase2 training improves the overall
robustness against different strengths of adversarial attacks. For
hardware-based adversarial detection, we implement the DetectX module using
32nm CMOS circuits and integrate it with a Neurosim-like analog crossbar
architecture. We perform hardware evaluation of the Neurosim+DetectX system on
the Neurosim platform using datasets-CIFAR10(VGG8), CIFAR100(VGG16) and
TinyImagenet(ResNet18). Our experiments show that DetectX is 10x-25x more
energy efficient and immune to dynamic adversarial attacks compared to previous
state-of-the-art works. Moreover, we achieve high detection performance
(ROC-AUC > 0.95) for strong white-box and black-box attacks. The code has been
released at https://github.com/Intelligent-Computing-Lab-Yale/DetectX


http://arxiv.org/abs/2106.11644
Self-Supervised Iterative Contextual Smoothing for Efficient Adversarial Defense against Gray- and Black-Box Attack. (99%)
Sungmin Cha; Naeun Ko; Youngjoon Yoo; Taesup Moon
  We propose a novel and effective input transformation based adversarial
defense method against gray- and black-box attack, which is computationally
efficient and does not require any adversarial training or retraining of a
classification model. We first show that a very simple iterative Gaussian
smoothing can effectively wash out adversarial noise and achieve substantially
high robust accuracy. Based on the observation, we propose Self-Supervised
Iterative Contextual Smoothing (SSICS), which aims to reconstruct the original
discriminative features from the Gaussian-smoothed image in context-adaptive
manner, while still smoothing out the adversarial noise. From the experiments
on ImageNet, we show that our SSICS achieves both high standard accuracy and
very competitive robust accuracy for the gray- and black-box attacks; e.g.,
transfer-based PGD-attack and score-based attack. A note-worthy point to stress
is that our defense is free of computationally expensive adversarial training,
yet, can approach its robust accuracy via input transformation.


http://arxiv.org/abs/2106.12900
Long-term Cross Adversarial Training: A Robust Meta-learning Method for Few-shot Classification Tasks. (83%)
Fan Liu; Shuyu Zhao; Xuelong Dai; Bin Xiao
  Meta-learning model can quickly adapt to new tasks using few-shot labeled
data. However, despite achieving good generalization on few-shot classification
tasks, it is still challenging to improve the adversarial robustness of the
meta-learning model in few-shot learning. Although adversarial training (AT)
methods such as Adversarial Query (AQ) can improve the adversarially robust
performance of meta-learning models, AT is still computationally expensive
training. On the other hand, meta-learning models trained with AT will drop
significant accuracy on the original clean images. This paper proposed a
meta-learning method on the adversarially robust neural network called
Long-term Cross Adversarial Training (LCAT). LCAT will update meta-learning
model parameters cross along the natural and adversarial sample distribution
direction with long-term to improve both adversarial and clean few-shot
classification accuracy. Due to cross-adversarial training, LCAT only needs
half of the adversarial training epoch than AQ, resulting in a low adversarial
training computation. Experiment results show that LCAT achieves superior
performance both on the clean and adversarial few-shot classification accuracy
than SOTA adversarial training methods for meta-learning models.


http://arxiv.org/abs/2106.11629
On Adversarial Robustness of Synthetic Code Generation. (81%)
Mrinal Anand; Pratik Kayal; Mayank Singh
  Automatic code synthesis from natural language descriptions is a challenging
task. We witness massive progress in developing code generation systems for
domain-specific languages (DSLs) employing sequence-to-sequence deep learning
techniques in the recent past. In this paper, we specifically experiment with
\textsc{AlgoLisp} DSL-based generative models and showcase the existence of
significant dataset bias through different classes of adversarial examples. We
also experiment with two variants of Transformer-based models that outperform
all existing \textsc{AlgoLisp} DSL-based code generation baselines. Consistent
with the current state-of-the-art systems, our proposed models, too, achieve
poor performance under adversarial settings. Therefore, we propose several
dataset augmentation techniques to reduce bias and showcase their efficacy
using robust experimentation.


http://arxiv.org/abs/2106.11865
NetFense: Adversarial Defenses against Privacy Attacks on Neural Networks for Graph Data. (67%)
I-Chung Hsieh; Cheng-Te Li
  Recent advances in protecting node privacy on graph data and attacking graph
neural networks (GNNs) gain much attention. The eye does not bring these two
essential tasks together yet. Imagine an adversary can utilize the powerful
GNNs to infer users' private labels in a social network. How can we
adversarially defend against such privacy attacks while maintaining the utility
of perturbed graphs? In this work, we propose a novel research task,
adversarial defenses against GNN-based privacy attacks, and present a graph
perturbation-based approach, NetFense, to achieve the goal. NetFense can
simultaneously keep graph data unnoticeability (i.e., having limited changes on
the graph structure), maintain the prediction confidence of targeted label
classification (i.e., preserving data utility), and reduce the prediction
confidence of private label classification (i.e., protecting the privacy of
nodes). Experiments conducted on single- and multiple-target perturbations
using three real graph data exhibit that the perturbed graphs by NetFense can
effectively maintain data utility (i.e., model unnoticeability) on targeted
label classification and significantly decrease the prediction confidence of
private label classification (i.e., privacy protection). Extensive studies also
bring several insights, such as the flexibility of NetFense, preserving local
neighborhoods in data unnoticeability, and better privacy protection for
high-degree nodes.


http://arxiv.org/abs/2106.11732
FLEA: Provably Robust Fair Multisource Learning from Unreliable Training Data. (1%)
Eugenia Iofinova; Nikola Konstantinov; Christoph H. Lampert
  Fairness-aware learning aims at constructing classifiers that not only make
accurate predictions, but also do not discriminate against specific groups. It
is a fast-growing area of machine learning with far-reaching societal impact.
However, existing fair learning methods are vulnerable to accidental or
malicious artifacts in the training data, which can cause them to unknowingly
produce unfair classifiers. In this work we address the problem of fair
learning from unreliable training data in the robust multisource setting, where
the available training data comes from multiple sources, a fraction of which
might not be representative of the true data distribution. We introduce FLEA, a
filtering-based algorithm that identifies and suppresses those data sources
that would have a negative impact on fairness or accuracy if they were used for
training. As such, FLEA is not a replacement of prior fairness-aware learning
methods but rather an augmentation that makes any of them robust against
unreliable training data. We show the effectiveness of our approach by a
diverse range of experiments on multiple datasets. Additionally, we prove
formally that -- given enough data -- FLEA protects the learner against
corruptions as long as the fraction of affected data sources is less than half.
Our source code and documentation are available at
https://github.com/ISTAustria-CVML/FLEA.


http://arxiv.org/abs/2106.11420
Policy Smoothing for Provably Robust Reinforcement Learning. (99%)
Aounon Kumar; Alexander Levine; Soheil Feizi
  The study of provable adversarial robustness for deep neural network (DNN)
models has mainly focused on static supervised learning tasks such as image
classification. However, DNNs have been used extensively in real-world adaptive
tasks such as reinforcement learning (RL), making RL systems vulnerable to
adversarial attacks. The key challenge in adversarial RL is that the attacker
can adapt itself to the defense strategy used by the agent in previous
time-steps to strengthen its attack in future steps. In this work, we study the
provable robustness of RL against norm-bounded adversarial perturbations of the
inputs. We focus on smoothing-based provable defenses and propose policy
smoothing where the agent adds a Gaussian noise to its observation at each
time-step before applying the policy network to make itself less sensitive to
adversarial perturbations of its inputs. Our main theoretical contribution is
to prove an adaptive version of the Neyman-Pearson Lemma where the adversarial
perturbation at a particular time can be a stochastic function of current and
previous observations and states as well as previously observed actions. Using
this lemma, we adapt the robustness certificates produced by randomized
smoothing in the static setting of image classification to the dynamic setting
of RL. We generate certificates that guarantee that the total reward obtained
by the smoothed policy will not fall below a certain threshold under a
norm-bounded adversarial perturbation of the input. We show that our
certificates are tight by constructing a worst-case setting that achieves the
bounds derived in our analysis. In our experiments, we show that this method
can yield meaningful certificates in complex environments demonstrating its
effectiveness against adversarial attacks.


http://arxiv.org/abs/2106.10996
Delving into the pixels of adversarial samples. (98%)
Blerta Lindqvist
  Despite extensive research into adversarial attacks, we do not know how
adversarial attacks affect image pixels. Knowing how image pixels are affected
by adversarial attacks has the potential to lead us to better adversarial
defenses. Motivated by instances that we find where strong attacks do not
transfer, we delve into adversarial examples at pixel level to scrutinize how
adversarial attacks affect image pixel values. We consider several ImageNet
architectures, InceptionV3, VGG19 and ResNet50, as well as several strong
attacks. We find that attacks can have different effects at pixel level
depending on classifier architecture. In particular, input pre-processing plays
a previously overlooked role in the effect that attacks have on pixels. Based
on the insights of pixel-level examination, we find new ways to detect some of
the strongest current attacks.


http://arxiv.org/abs/2106.11424
HODA: Hardness-Oriented Detection of Model Extraction Attacks. (98%)
Amir Mahdi Sadeghzadeh; Amir Mohammad Sobhanian; Faezeh Dehghan; Rasool Jalili
  Model Extraction attacks exploit the target model's prediction API to create
a surrogate model in order to steal or reconnoiter the functionality of the
target model in the black-box setting. Several recent studies have shown that a
data-limited adversary who has no or limited access to the samples from the
target model's training data distribution can use synthesis or semantically
similar samples to conduct model extraction attacks. In this paper, we define
the hardness degree of a sample using the concept of learning difficulty. The
hardness degree of a sample depends on the epoch number that the predicted
label of that sample converges. We investigate the hardness degree of samples
and demonstrate that the hardness degree histogram of a data-limited
adversary's sample sequences is distinguishable from the hardness degree
histogram of benign users' samples sequences. We propose Hardness-Oriented
Detection Approach (HODA) to detect the sample sequences of model extraction
attacks. The results demonstrate that HODA can detect the sample sequences of
model extraction attacks with a high success rate by only monitoring 100
samples of them, and it outperforms all previous model extraction detection
methods.


http://arxiv.org/abs/2106.10974
Friendly Training: Neural Networks Can Adapt Data To Make Learning Easier. (91%)
Simone Marullo; Matteo Tiezzi; Marco Gori; Stefano Melacci
  In the last decade, motivated by the success of Deep Learning, the scientific
community proposed several approaches to make the learning procedure of Neural
Networks more effective. When focussing on the way in which the training data
are provided to the learning machine, we can distinguish between the classic
random selection of stochastic gradient-based optimization and more involved
techniques that devise curricula to organize data, and progressively increase
the complexity of the training set. In this paper, we propose a novel training
procedure named Friendly Training that, differently from the aforementioned
approaches, involves altering the training examples in order to help the model
to better fulfil its learning criterion. The model is allowed to simplify those
examples that are too hard to be classified at a certain stage of the training
procedure. The data transformation is controlled by a developmental plan that
progressively reduces its impact during training, until it completely vanishes.
In a sense, this is the opposite of what is commonly done in order to increase
robustness against adversarial examples, i.e., Adversarial Training.
Experiments on multiple datasets are provided, showing that Friendly Training
yields improvements with respect to informed data sub-selection routines and
random selection, especially in deep convolutional architectures. Results
suggest that adapting the input data is a feasible way to stabilize learning
and improve the generalization skills of the network.


http://arxiv.org/abs/2106.11384
Membership Inference on Word Embedding and Beyond. (38%)
Saeed Mahloujifar; Huseyin A. Inan; Melissa Chase; Esha Ghosh; Marcello Hasegawa
  In the text processing context, most ML models are built on word embeddings.
These embeddings are themselves trained on some datasets, potentially
containing sensitive data. In some cases this training is done independently,
in other cases, it occurs as part of training a larger, task-specific model. In
either case, it is of interest to consider membership inference attacks based
on the embedding layer as a way of understanding sensitive information leakage.
But, somewhat surprisingly, membership inference attacks on word embeddings and
their effect in other natural language processing (NLP) tasks that use these
embeddings, have remained relatively unexplored.
  In this work, we show that word embeddings are vulnerable to black-box
membership inference attacks under realistic assumptions. Furthermore, we show
that this leakage persists through two other major NLP applications:
classification and text-generation, even when the embedding layer is not
exposed to the attacker. We show that our MI attack achieves high attack
accuracy against a classifier model and an LSTM-based language model. Indeed,
our attack is a cheaper membership inference attack on text-generative models,
which does not require the knowledge of the target model or any expensive
training of text-generative models as shadow models.


http://arxiv.org/abs/2106.11478
An Alternative Auxiliary Task for Enhancing Image Classification. (11%)
Chen Liu
  Image reconstruction is likely the most predominant auxiliary task for image
classification. In this paper, we investigate ``estimating the Fourier
Transform of the input image" as a potential alternative auxiliary task, in the
hope that it may further boost the performances on the primary task or
introduce novel constraints not well covered by image reconstruction. We
experimented with five popular classification architectures on the CIFAR-10
dataset, and the empirical results indicated that our proposed auxiliary task
generally improves the classification accuracy. More notably, the results
showed that in certain cases our proposed auxiliary task may enhance the
classifiers' resistance to adversarial attacks generated using the fast
gradient sign method.


http://arxiv.org/abs/2106.14647
Zero-shot learning approach to adaptive Cybersecurity using Explainable AI. (1%)
Dattaraj Rao; Shraddha Mane
  Cybersecurity is a domain where there is constant change in patterns of
attack, and we need ways to make our Cybersecurity systems more adaptive to
handle new attacks and categorize for appropriate action. We present a novel
approach to handle the alarm flooding problem faced by Cybersecurity systems
like security information and event management (SIEM) and intrusion detection
(IDS). We apply a zero-shot learning method to machine learning (ML) by
leveraging explanations for predictions of anomalies generated by a ML model.
This approach has huge potential to auto detect alarm labels generated in SIEM
and associate them with specific attack types. In this approach, without any
prior knowledge of attack, we try to identify it, decipher the features that
contribute to classification and try to bucketize the attack in a specific
category - using explainable AI. Explanations give us measurable factors as to
what features influence the prediction of a cyber-attack and to what degree.
These explanations generated based on game-theory are used to allocate credit
to specific features based on their influence on a specific prediction. Using
this allocation of credit, we propose a novel zero-shot approach to categorize
novel attacks into specific new classes based on feature influence. The
resulting system demonstrated will get good at separating attack traffic from
normal flow and auto-generate a label for attacks based on features that
contribute to the attack. These auto-generated labels can be presented to SIEM
analyst and are intuitive enough to figure out the nature of attack. We apply
this approach to a network flow dataset and demonstrate results for specific
attack types like ip sweep, denial of service, remote to local, etc.
  Paper was presented at the first Conference on Deployable AI at IIT-Madras in
June 2021.


http://arxiv.org/abs/2106.10807
Adversarial Examples Make Strong Poisons. (98%)
Liam Fowl; Micah Goldblum; Ping-yeh Chiang; Jonas Geiping; Wojtek Czaja; Tom Goldstein
  The adversarial machine learning literature is largely partitioned into
evasion attacks on testing data and poisoning attacks on training data. In this
work, we show that adversarial examples, originally intended for attacking
pre-trained models, are even more effective for data poisoning than recent
methods designed specifically for poisoning. Our findings indicate that
adversarial examples, when assigned the original label of their natural base
image, cannot be used to train a classifier for natural images. Furthermore,
when adversarial examples are assigned their adversarial class label, they are
useful for training. This suggests that adversarial examples contain useful
semantic content, just with the ``wrong'' labels (according to a network, but
not a human). Our method, adversarial poisoning, is substantially more
effective than existing poisoning methods for secure dataset release, and we
release a poisoned version of ImageNet, ImageNet-P, to encourage research into
the strength of this form of data obfuscation.


http://arxiv.org/abs/2106.10785
Adversarial Attack on Graph Neural Networks as An Influence Maximization Problem. (95%)
Jiaqi Ma; Junwei Deng; Qiaozhu Mei
  Graph neural networks (GNNs) have attracted increasing interests. With broad
deployments of GNNs in real-world applications, there is an urgent need for
understanding the robustness of GNNs under adversarial attacks, especially in
realistic setups. In this work, we study the problem of attacking GNNs in a
restricted and realistic setup, by perturbing the features of a small set of
nodes, with no access to model parameters and model predictions. Our formal
analysis draws a connection between this type of attacks and an influence
maximization problem on the graph. This connection not only enhances our
understanding on the problem of adversarial attack on GNNs, but also allows us
to propose a group of effective and practical attack strategies. Our
experiments verify that the proposed attack strategies significantly degrade
the performance of three popular GNN models and outperform baseline adversarial
attack strategies.


http://arxiv.org/abs/2106.10696
Generative Model Adversarial Training for Deep Compressed Sensing. (8%)
Ashkan Esmaeili
  Deep compressed sensing assumes the data has sparse representation in a
latent space, i.e., it is intrinsically of low-dimension. The original data is
assumed to be mapped from a low-dimensional space through a
low-to-high-dimensional generator. In this work, we propound how to design such
a low-to-high dimensional deep learning-based generator suiting for compressed
sensing, while satisfying robustness to universal adversarial perturbations in
the latent domain. We also justify why the noise is considered in the latent
space. The work is also buttressed with theoretical analysis on the robustness
of the trained generator to adversarial perturbations. Experiments on
real-world datasets are provided to substantiate the efficacy of the proposed
\emph{generative model adversarial training for deep compressed sensing.}


http://arxiv.org/abs/2106.10606
Attack to Fool and Explain Deep Networks. (99%)
Naveed Akhtar; Muhammad A. A. K. Jalwana; Mohammed Bennamoun; Ajmal Mian
  Deep visual models are susceptible to adversarial perturbations to inputs.
Although these signals are carefully crafted, they still appear noise-like
patterns to humans. This observation has led to the argument that deep visual
representation is misaligned with human perception. We counter-argue by
providing evidence of human-meaningful patterns in adversarial perturbations.
We first propose an attack that fools a network to confuse a whole category of
objects (source class) with a target label. Our attack also limits the
unintended fooling by samples from non-sources classes, thereby circumscribing
human-defined semantic notions for network fooling. We show that the proposed
attack not only leads to the emergence of regular geometric patterns in the
perturbations, but also reveals insightful information about the decision
boundaries of deep models. Exploring this phenomenon further, we alter the
`adversarial' objective of our attack to use it as a tool to `explain' deep
visual representation. We show that by careful channeling and projection of the
perturbations computed by our method, we can visualize a model's understanding
of human-defined semantic notions. Finally, we exploit the explanability
properties of our perturbations to perform image generation, inpainting and
interactive image manipulation by attacking adversarialy robust
`classifiers'.In all, our major contribution is a novel pragmatic adversarial
attack that is subsequently transformed into a tool to interpret the visual
models. The article also makes secondary contributions in terms of establishing
the utility of our attack beyond the adversarial objective with multiple
interesting applications.


http://arxiv.org/abs/2106.11760
A Stealthy and Robust Fingerprinting Scheme for Generative Models. (47%)
Li Guanlin; Guo Shangwei; Wang Run; Xu Guowen; Zhang Tianwei
  This paper presents a novel fingerprinting methodology for the Intellectual
Property protection of generative models. Prior solutions for discriminative
models usually adopt adversarial examples as the fingerprints, which give
anomalous inference behaviors and prediction results. Hence, these methods are
not stealthy and can be easily recognized by the adversary. Our approach
leverages the invisible backdoor technique to overcome the above limitation.
Specifically, we design verification samples, whose model outputs look normal
but can trigger a backdoor classifier to make abnormal predictions. We propose
a new backdoor embedding approach with Unique-Triplet Loss and fine-grained
categorization to enhance the effectiveness of our fingerprints. Extensive
evaluations show that this solution can outperform other strategies with higher
robustness, uniqueness and stealthiness for various GAN models.


http://arxiv.org/abs/2106.10212
Residual Error: a New Performance Measure for Adversarial Robustness. (99%)
Hossein Aboutalebi; Mohammad Javad Shafiee; Michelle Karg; Christian Scharfenberger; Alexander Wong
  Despite the significant advances in deep learning over the past decade, a
major challenge that limits the wide-spread adoption of deep learning has been
their fragility to adversarial attacks. This sensitivity to making erroneous
predictions in the presence of adversarially perturbed data makes deep neural
networks difficult to adopt for certain real-world, mission-critical
applications. While much of the research focus has revolved around adversarial
example creation and adversarial hardening, the area of performance measures
for assessing adversarial robustness is not well explored. Motivated by this,
this study presents the concept of residual error, a new performance measure
for not only assessing the adversarial robustness of a deep neural network at
the individual sample level, but also can be used to differentiate between
adversarial and non-adversarial examples to facilitate for adversarial example
detection. Furthermore, we introduce a hybrid model for approximating the
residual error in a tractable manner. Experimental results using the case of
image classification demonstrates the effectiveness and efficacy of the
proposed residual error metric for assessing several well-known deep neural
network architectures. These results thus illustrate that the proposed measure
could be a useful tool for not only assessing the robustness of deep neural
networks used in mission-critical scenarios, but also in the design of
adversarially robust models.


http://arxiv.org/abs/2106.09947
Indicators of Attack Failure: Debugging and Improving Optimization of Adversarial Examples. (99%)
Maura Pintor; Luca Demetrio; Angelo Sotgiu; Ambra Demontis; Nicholas Carlini; Battista Biggio; Fabio Roli
  Evaluating robustness of machine-learning models to adversarial examples is a
challenging problem. Many defenses have been shown to provide a false sense of
robustness by causing gradient-based attacks to fail, and they have been broken
under more rigorous evaluations. Although guidelines and best practices have
been suggested to improve current adversarial robustness evaluations, the lack
of automatic testing and debugging tools makes it difficult to apply these
recommendations in a systematic manner. In this work, we overcome these
limitations by: (i) categorizing attack failures based on how they affect the
optimization of gradient-based attacks, while also unveiling two novel failures
affecting many popular attack implementations and past evaluations; (ii)
proposing six novel indicators of failure, to automatically detect the presence
of such failures in the attack optimization process; and (iii) suggesting a
systematic protocol to apply the corresponding fixes. Our extensive
experimental analysis, involving more than 15 models in 3 distinct application
domains, shows that our indicators of failure can be used to debug and improve
current adversarial robustness evaluations, thereby providing a first concrete
step towards automatizing and systematizing them. Our open-source code is
available at: https://github.com/pralab/IndicatorsOfAttackFailure.


http://arxiv.org/abs/2106.10151
The Dimpled Manifold Model of Adversarial Examples in Machine Learning. (99%)
Adi Shamir; Odelia Melamed; Oriel BenShmuel
  The extreme fragility of deep neural networks, when presented with tiny
perturbations in their inputs, was independently discovered by several research
groups in 2013. However, despite enormous effort, these adversarial examples
remained a counterintuitive phenomenon with no simple testable explanation. In
this paper, we introduce a new conceptual framework for how the decision
boundary between classes evolves during training, which we call the {\em
Dimpled Manifold Model}. In particular, we demonstrate that training is divided
into two distinct phases. The first phase is a (typically fast) clinging
process in which the initially randomly oriented decision boundary gets very
close to the low dimensional image manifold, which contains all the training
examples. Next, there is a (typically slow) dimpling phase which creates
shallow bulges in the decision boundary that move it to the correct side of the
training examples. This framework provides a simple explanation for why
adversarial examples exist, why their perturbations have such tiny norms, and
why they look like random noise rather than like the target class. This
explanation is also used to show that a network that was adversarially trained
with incorrectly labeled images might still correctly classify most test
images, and to show that the main effect of adversarial training is just to
deepen the generated dimples in the decision boundary. Finally, we discuss and
demonstrate the very different properties of on-manifold and off-manifold
adversarial perturbations. We describe the results of numerous experiments
which strongly support this new model, using both low dimensional synthetic
datasets and high dimensional natural datasets.


http://arxiv.org/abs/2106.09992
Exploring Counterfactual Explanations Through the Lens of Adversarial Examples: A Theoretical and Empirical Analysis. (99%)
Martin Pawelczyk; Chirag Agarwal; Shalmali Joshi; Sohini Upadhyay; Himabindu Lakkaraju
  As machine learning (ML) models become more widely deployed in high-stakes
applications, counterfactual explanations have emerged as key tools for
providing actionable model explanations in practice. Despite the growing
popularity of counterfactual explanations, a deeper understanding of these
explanations is still lacking. In this work, we systematically analyze
counterfactual explanations through the lens of adversarial examples. We do so
by formalizing the similarities between popular counterfactual explanation and
adversarial example generation methods identifying conditions when they are
equivalent. We then derive the upper bounds on the distances between the
solutions output by counterfactual explanation and adversarial example
generation methods, which we validate on several real-world data sets. By
establishing these theoretical and empirical similarities between
counterfactual explanations and adversarial examples, our work raises
fundamental questions about the design and development of existing
counterfactual explanation algorithms.


http://arxiv.org/abs/2106.09908
Light Lies: Optical Adversarial Attack. (92%)
Kyulim Kim; JeongSoo Kim; Seungri Song; Jun-Ho Choi; Chulmin Joo; Jong-Seok Lee
  A significant amount of work has been done on adversarial attacks that inject
imperceptible noise to images to deteriorate the image classification
performance of deep models. However, most of the existing studies consider
attacks in the digital (pixel) domain where an image acquired by an image
sensor with sampling and quantization has been recorded. This paper, for the
first time, introduces an optical adversarial attack, which physically alters
the light field information arriving at the image sensor so that the
classification model yields misclassification. More specifically, we modulate
the phase of the light in the Fourier domain using a spatial light modulator
placed in the photographic system. The operative parameters of the modulator
are obtained by gradient-based optimization to maximize cross-entropy and
minimize distortions. We present experiments based on both simulation and a
real hardware optical system, from which the feasibility of the proposed
optical attack is demonstrated. It is also verified that the proposed attack is
completely different from common optical-domain distortions such as spherical
aberration, defocus, and astigmatism in terms of both perturbation patterns and
classification results.


http://arxiv.org/abs/2106.09989
BinarizedAttack: Structural Poisoning Attacks to Graph-based Anomaly Detection. (82%)
Yulin Zhu; Yuni Lai; Kaifa Zhao; Xiapu Luo; Mingquan Yuan; Jian Ren; Kai Zhou
  Graph-based Anomaly Detection (GAD) is becoming prevalent due to the powerful
representation abilities of graphs as well as recent advances in graph mining
techniques. These GAD tools, however, expose a new attacking surface,
ironically due to their unique advantage of being able to exploit the relations
among data. That is, attackers now can manipulate those relations (i.e., the
structure of the graph) to allow some target nodes to evade detection. In this
paper, we exploit this vulnerability by designing a new type of targeted
structural poisoning attacks to a representative regression-based GAD system
termed OddBall. Specially, we formulate the attack against OddBall as a
bi-level optimization problem, where the key technical challenge is to
efficiently solve the problem in a discrete domain. We propose a novel attack
method termed BinarizedAttack based on gradient descent. Comparing to prior
arts, BinarizedAttack can better use the gradient information, making it
particularly suitable for solving combinatorial optimization problems.
Furthermore, we investigate the attack transferability of BinarizedAttack by
employing it to attack other representation-learning-based GAD systems. Our
comprehensive experiments demonstrate that BinarizedAttack is very effective in
enabling target nodes to evade graph-based anomaly detection tools with limited
attackers' budget, and in the black-box transfer attack setting,
BinarizedAttack is also tested effective and in particular, can significantly
change the node embeddings learned by the GAD systems. Our research thus opens
the door to studying a new type of attack against security analytic tools that
rely on graph data.


http://arxiv.org/abs/2106.10252
Less is More: Feature Selection for Adversarial Robustness with Compressive Counter-Adversarial Attacks. (80%)
Emre Ozfatura; Muhammad Zaid Hameed; Kerem Ozfatura; Deniz Gunduz
  A common observation regarding adversarial attacks is that they mostly give
rise to false activation at the penultimate layer to fool the classifier.
Assuming that these activation values correspond to certain features of the
input, the objective becomes choosing the features that are most useful for
classification. Hence, we propose a novel approach to identify the important
features by employing counter-adversarial attacks, which highlights the
consistency at the penultimate layer with respect to perturbations on input
samples. First, we empirically show that there exist a subset of features,
classification based in which bridge the gap between the clean and robust
accuracy. Second, we propose a simple yet efficient mechanism to identify those
features by searching the neighborhood of input sample. We then select features
by observing the consistency of the activation values at the penultimate layer.


http://arxiv.org/abs/2106.10324
Group-Structured Adversarial Training. (68%)
Farzan Farnia; Amirali Aghazadeh; James Zou; David Tse
  Robust training methods against perturbations to the input data have received
great attention in the machine learning literature. A standard approach in this
direction is adversarial training which learns a model using
adversarially-perturbed training samples. However, adversarial training
performs suboptimally against perturbations structured across samples such as
universal and group-sparse shifts that are commonly present in biological data
such as gene expression levels of different tissues. In this work, we seek to
close this optimality gap and introduce Group-Structured Adversarial Training
(GSAT) which learns a model robust to perturbations structured across samples.
We formulate GSAT as a non-convex concave minimax optimization problem which
minimizes a group-structured optimal transport cost. Specifically, we focus on
the applications of GSAT for group-sparse and rank-constrained perturbations
modeled using group and nuclear norm penalties. In order to solve GSAT's
non-smooth optimization problem in those cases, we propose a new minimax
optimization algorithm called GDADMM by combining Gradient Descent Ascent (GDA)
and Alternating Direction Method of Multipliers (ADMM). We present several
applications of the GSAT framework to gain robustness against structured
perturbations for image recognition and computational biology datasets.


http://arxiv.org/abs/2106.09993
Accumulative Poisoning Attacks on Real-time Data. (45%)
Tianyu Pang; Xiao Yang; Yinpeng Dong; Hang Su; Jun Zhu
  Collecting training data from untrusted sources exposes machine learning
services to poisoning adversaries, who maliciously manipulate training data to
degrade the model accuracy. When trained on offline datasets, poisoning
adversaries have to inject the poisoned data in advance before training, and
the order of feeding these poisoned batches into the model is stochastic. In
contrast, practical systems are more usually trained/fine-tuned on sequentially
captured real-time data, in which case poisoning adversaries could dynamically
poison each data batch according to the current model state. In this paper, we
focus on the real-time settings and propose a new attacking strategy, which
affiliates an accumulative phase with poisoning attacks to secretly (i.e.,
without affecting accuracy) magnify the destructive effect of a (poisoned)
trigger batch. By mimicking online learning and federated learning on CIFAR-10,
we show that the model accuracy will significantly drop by a single update step
on the trigger batch after the accumulative phase. Our work validates that a
well-designed but straightforward attacking strategy can dramatically amplify
the poisoning effects, with no need to explore complex techniques.


http://arxiv.org/abs/2106.10147
Evaluating the Robustness of Trigger Set-Based Watermarks Embedded in Deep Neural Networks. (45%)
Suyoung Lee; Wonho Song; Suman Jana; Meeyoung Cha; Sooel Son
  Trigger set-based watermarking schemes have gained emerging attention as they
provide a means to prove ownership for deep neural network model owners. In
this paper, we argue that state-of-the-art trigger set-based watermarking
algorithms do not achieve their designed goal of proving ownership. We posit
that this impaired capability stems from two common experimental flaws that the
existing research practice has committed when evaluating the robustness of
watermarking algorithms: (1) incomplete adversarial evaluation and (2)
overlooked adaptive attacks. We conduct a comprehensive adversarial evaluation
of 11 representative watermarking schemes against six of the existing attacks
and demonstrate that each of these watermarking schemes lacks robustness
against at least two non-adaptive attacks. We also propose novel adaptive
attacks that harness the adversary's knowledge of the underlying watermarking
algorithm of a target model. We demonstrate that the proposed attacks
effectively break all of the 11 watermarking schemes, consequently allowing
adversaries to obscure the ownership of any watermarked model. We encourage
follow-up studies to consider our guidelines when evaluating the robustness of
their watermarking schemes via conducting comprehensive adversarial evaluation
that includes our adaptive attacks to demonstrate a meaningful upper bound of
watermark robustness.


http://arxiv.org/abs/2106.10196
Federated Robustness Propagation: Sharing Adversarial Robustness in Federated Learning. (5%)
Junyuan Hong; Haotao Wang; Zhangyang Wang; Jiayu Zhou
  Federated learning (FL) emerges as a popular distributed learning schema that
learns a model from a set of participating users without requiring raw data to
be shared. One major challenge of FL comes from heterogeneity in users, which
may have distributionally different (or non-iid) data and varying computation
resources. Just like in centralized learning, FL users also desire model
robustness against malicious attackers at test time. Whereas adversarial
training (AT) provides a sound solution for centralized learning, extending its
usage for FL users has imposed significant challenges, as many users may have
very limited training data as well as tight computational budgets, to afford
the data-hungry and costly AT. In this paper, we study a novel learning setting
that propagates adversarial robustness from high-resource users that can afford
AT, to those low-resource users that cannot afford it, during the FL process.
We show that existing FL techniques cannot effectively propagate adversarial
robustness among non-iid users, and propose a simple yet effective propagation
approach that transfers robustness through carefully designed
batch-normalization statistics. We demonstrate the rationality and
effectiveness of our method through extensive experiments. Especially, the
proposed method is shown to grant FL remarkable robustness even when only a
small portion of users afford AT during learning. Codes will be published upon
acceptance.


http://arxiv.org/abs/2106.09872
Analyzing Adversarial Robustness of Deep Neural Networks in Pixel Space: a Semantic Perspective. (99%)
Lina Wang; Xingshu Chen; Yulong Wang; Yawei Yue; Yi Zhu; Xuemei Zeng; Wei Wang
  The vulnerability of deep neural networks to adversarial examples, which are
crafted maliciously by modifying the inputs with imperceptible perturbations to
misled the network produce incorrect outputs, reveals the lack of robustness
and poses security concerns. Previous works study the adversarial robustness of
image classifiers on image level and use all the pixel information in an image
indiscriminately, lacking of exploration of regions with different semantic
meanings in the pixel space of an image. In this work, we fill this gap and
explore the pixel space of the adversarial image by proposing an algorithm to
looking for possible perturbations pixel by pixel in different regions of the
segmented image. The extensive experimental results on CIFAR-10 and ImageNet
verify that searching for the modified pixel in only some pixels of an image
can successfully launch the one-pixel adversarial attacks without requiring all
the pixels of the entire image, and there exist multiple vulnerable points
scattered in different regions of an image. We also demonstrate that the
adversarial robustness of different regions on the image varies with the amount
of semantic information contained.


http://arxiv.org/abs/2106.09898
Bad Characters: Imperceptible NLP Attacks. (99%)
Nicholas Boucher; Ilia Shumailov; Ross Anderson; Nicolas Papernot
  Several years of research have shown that machine-learning systems are
vulnerable to adversarial examples, both in theory and in practice. Until now,
such attacks have primarily targeted visual models, exploiting the gap between
human and machine perception. Although text-based models have also been
attacked with adversarial examples, such attacks struggled to preserve semantic
meaning and indistinguishability. In this paper, we explore a large class of
adversarial examples that can be used to attack text-based models in a
black-box setting without making any human-perceptible visual modification to
inputs. We use encoding-specific perturbations that are imperceptible to the
human eye to manipulate the outputs of a wide range of Natural Language
Processing (NLP) systems from neural machine-translation pipelines to web
search engines. We find that with a single imperceptible encoding injection --
representing one invisible character, homoglyph, reordering, or deletion -- an
attacker can significantly reduce the performance of vulnerable models, and
with three injections most models can be functionally broken. Our attacks work
against currently-deployed commercial systems, including those produced by
Microsoft and Google, in addition to open source models published by Facebook,
IBM, and HuggingFace. This novel series of attacks presents a significant
threat to many language processing systems: an attacker can affect systems in a
targeted manner without any assumptions about the underlying model. We conclude
that text-based NLP systems require careful input sanitization, just like
conventional applications, and that given such systems are now being deployed
rapidly at scale, the urgent attention of architects and operators is required.


http://arxiv.org/abs/2106.09501
DeepInsight: Interpretability Assisting Detection of Adversarial Samples on Graphs. (99%)
Junhao Zhu; Yalu Shan; Jinhuan Wang; Shanqing Yu; Guanrong Chen; Qi Xuan
  With the rapid development of artificial intelligence, a series of machine
learning algorithms, e.g., graph neural networks, have been proposed to
facilitate network analysis or graph data mining. Unfortunately, recent studies
indicate that such advanced methods may suffer from adversarial attacks, i.e.,
they may lose effectiveness when only a small fraction of links are purposely
changed. However, little is known what's the difference between adversarial
nodes and clean nodes, and what's the preference of each attack method, in
terms of network structure. In this paper, we theoretically investigate three
well-known adversarial attack methods, i.e., Nettack, Meta Attack, and
GradArgmax, and find that different attack methods have their specific attack
preferences on changing network structure. Such attack patterns are further
validated by the experimental results on real-world networks, i.e., generally
the top 4 most important network attributes on detecting adversarial samples
are sufficient to explain the preference of each attack method. Based on these
findings, we further utilize the network attributes to design machine learning
models for adversarial sample detection and attack method recognition,
achieving the outstanding performance.


http://arxiv.org/abs/2106.09534
Adversarial Visual Robustness by Causal Intervention. (99%)
Kaihua Tang; Mingyuan Tao; Hanwang Zhang
  Adversarial training is the de facto most promising defense against
adversarial examples. Yet, its passive nature inevitably prevents it from being
immune to unknown attackers. To achieve a proactive defense, we need a more
fundamental understanding of adversarial examples, beyond the popular bounded
threat model. In this paper, we provide a causal viewpoint of adversarial
vulnerability: the cause is the spurious correlation ubiquitously existing in
learning, i.e., the confounding effect, where attackers are precisely
exploiting these effects. Therefore, a fundamental solution for adversarial
robustness is by causal intervention. As these visual confounders are
imperceptible in general, we propose to use the instrumental variable that
achieves causal intervention without the need for confounder observation. We
term our robust training method as Causal intervention by instrumental Variable
(CiiV). It's a causal regularization that 1) augments the image with multiple
retinotopic centers and 2) encourages the model to learn causal features,
rather than local confounding patterns, by favoring features linearly
responding to spatial interpolations. Extensive experiments on a wide spectrum
of attackers and settings applied in CIFAR-10, CIFAR-100, and mini-ImageNet
demonstrate that CiiV is robust to adaptive attacks, including the recent
AutoAttack. Besides, as a general causal regularization, it can be easily
plugged into other methods to further boost the robustness.


http://arxiv.org/abs/2106.09820
Adversarial Detection Avoidance Attacks: Evaluating the robustness of perceptual hashing-based client-side scanning. (92%)
Shubham Jain; Ana-Maria Cretu; Montjoye Yves-Alexandre de
  End-to-end encryption (E2EE) by messaging platforms enable people to securely
and privately communicate with one another. Its widespread adoption however
raised concerns that illegal content might now be shared undetected. Following
the global pushback against key escrow systems, client-side scanning based on
perceptual hashing has been recently proposed by tech companies, governments
and researchers to detect illegal content in E2EE communications. We here
propose the first framework to evaluate the robustness of perceptual
hashing-based client-side scanning to detection avoidance attacks and show
current systems to not be robust. More specifically, we propose three
adversarial attacks--a general black-box attack and two white-box attacks for
discrete cosine transform-based algorithms--against perceptual hashing
algorithms. In a large-scale evaluation, we show perceptual hashing-based
client-side scanning mechanisms to be highly vulnerable to detection avoidance
attacks in a black-box setting, with more than 99.9\% of images successfully
attacked while preserving the content of the image. We furthermore show our
attack to generate diverse perturbations, strongly suggesting that
straightforward mitigation strategies would be ineffective. Finally, we show
that the larger thresholds necessary to make the attack harder would probably
require more than one billion images to be flagged and decrypted daily, raising
strong privacy concerns. Taken together, our results shed serious doubts on the
robustness of perceptual hashing-based client-side scanning mechanisms
currently proposed by governments, organizations, and researchers around the
world.


http://arxiv.org/abs/2106.09249
Invisible for both Camera and LiDAR: Security of Multi-Sensor Fusion based Perception in Autonomous Driving Under Physical-World Attacks. (91%)
Yulong *co-first authors Cao*; Ningfei *co-first authors Wang*; Chaowei *co-first authors Xiao*; Dawei *co-first authors Yang*; Jin *co-first authors Fang; Ruigang *co-first authors Yang; Qi Alfred *co-first authors Chen; Mingyan *co-first authors Liu; Bo *co-first authors Li
  In Autonomous Driving (AD) systems, perception is both security and safety
critical. Despite various prior studies on its security issues, all of them
only consider attacks on camera- or LiDAR-based AD perception alone. However,
production AD systems today predominantly adopt a Multi-Sensor Fusion (MSF)
based design, which in principle can be more robust against these attacks under
the assumption that not all fusion sources are (or can be) attacked at the same
time. In this paper, we present the first study of security issues of MSF-based
perception in AD systems. We directly challenge the basic MSF design assumption
above by exploring the possibility of attacking all fusion sources
simultaneously. This allows us for the first time to understand how much
security guarantee MSF can fundamentally provide as a general defense strategy
for AD perception.
  We formulate the attack as an optimization problem to generate a
physically-realizable, adversarial 3D-printed object that misleads an AD system
to fail in detecting it and thus crash into it. We propose a novel attack
pipeline that addresses two main design challenges: (1) non-differentiable
target camera and LiDAR sensing systems, and (2) non-differentiable cell-level
aggregated features popularly used in LiDAR-based AD perception. We evaluate
our attack on MSF included in representative open-source industry-grade AD
systems in real-world driving scenarios. Our results show that the attack
achieves over 90% success rate across different object types and MSF. Our
attack is also found stealthy, robust to victim positions, transferable across
MSF algorithms, and physical-world realizable after being 3D-printed and
captured by LiDAR and camera devices. To concretely assess the end-to-end
safety impact, we further perform simulation evaluation and show that it can
cause a 100% vehicle collision rate for an industry-grade AD system.


http://arxiv.org/abs/2106.09380
Modeling Realistic Adversarial Attacks against Network Intrusion Detection Systems. (82%)
Giovanni Apruzzese; Mauro Andreolini; Luca Ferretti; Mirco Marchetti; Michele Colajanni
  The incremental diffusion of machine learning algorithms in supporting
cybersecurity is creating novel defensive opportunities but also new types of
risks. Multiple researches have shown that machine learning methods are
vulnerable to adversarial attacks that create tiny perturbations aimed at
decreasing the effectiveness of detecting threats. We observe that existing
literature assumes threat models that are inappropriate for realistic
cybersecurity scenarios because they consider opponents with complete knowledge
about the cyber detector or that can freely interact with the target systems.
By focusing on Network Intrusion Detection Systems based on machine learning,
we identify and model the real capabilities and circumstances required by
attackers to carry out feasible and successful adversarial attacks. We then
apply our model to several adversarial attacks proposed in literature and
highlight the limits and merits that can result in actual adversarial attacks.
The contributions of this paper can help hardening defensive systems by letting
cyber defenders address the most critical and real issues, and can benefit
researchers by allowing them to devise novel forms of adversarial attacks based
on realistic threat models.


http://arxiv.org/abs/2106.09667
Poisoning and Backdooring Contrastive Learning. (70%)
Nicholas Carlini; Andreas Terzis
  Contrastive learning methods like CLIP train on noisy and uncurated training
datasets. This is cheaper than labeling datasets manually, and even improves
out-of-distribution robustness. We show that this practice makes backdoor and
poisoning attacks a significant threat. By poisoning just 0.005% of a dataset
(e.g., just 150 images of the 3 million-example Conceptual Captions dataset),
we can cause the model to misclassify test images by overlaying a small patch.
Targeted poisoning attacks, whereby the model misclassifies a particular test
input with an adversarially-desired label, are even easier requiring control of
less than 0.0001% of the dataset (e.g., just two out of the 3 million images).
Our attacks call into question whether training on noisy and uncurated Internet
scrapes is desirable.


http://arxiv.org/abs/2106.09292
CROP: Certifying Robust Policies for Reinforcement Learning through Functional Smoothing. (69%)
Fan Wu; Linyi Li; Zijian Huang; Yevgeniy Vorobeychik; Ding Zhao; Bo Li
  As reinforcement learning (RL) has achieved great success and been even
adopted in safety-critical domains such as autonomous vehicles, a range of
empirical studies have been conducted to improve its robustness against
adversarial attacks. However, how to certify its robustness with theoretical
guarantees still remains challenging. In this paper, we present the first
unified framework CROP (Certifying Robust Policies for RL) to provide
robustness certification on both action and reward levels. In particular, we
propose two robustness certification criteria: robustness of per-state actions
and lower bound of cumulative rewards. We then develop a local smoothing
algorithm for policies derived from Q-functions to guarantee the robustness of
actions taken along the trajectory; we also develop a global smoothing
algorithm for certifying the lower bound of a finite-horizon cumulative reward,
as well as a novel local smoothing algorithm to perform adaptive search in
order to obtain tighter reward certification. Empirically, we apply CROP to
evaluate several existing empirically robust RL algorithms, including
adversarial training and different robust regularization, in four environments
(two representative Atari games, Highway, and CartPole). Furthermore, by
evaluating these algorithms against adversarial attacks, we demonstrate that
our certification are often tight. All experiment results are available at
website https://crop-leaderboard.github.io.


http://arxiv.org/abs/2106.09242
CoCoFuzzing: Testing Neural Code Models with Coverage-Guided Fuzzing. (64%)
Moshi Wei; Yuchao Huang; Jinqiu Yang; Junjie Wang; Song Wang
  Deep learning-based code processing models have shown good performance for
tasks such as predicting method names, summarizing programs, and comment
generation. However, despite the tremendous progress, deep learning models are
often prone to adversarial attacks, which can significantly threaten the
robustness and generalizability of these models by leading them to
misclassification with unexpected inputs. To address the above issue, many deep
learning testing approaches have been proposed, however, these approaches
mainly focus on testing deep learning applications in the domains of image,
audio, and text analysis, etc., which cannot be directly applied to neural
models for code due to the unique properties of programs. In this paper, we
propose a coverage-based fuzzing framework, CoCoFuzzing, for testing deep
learning-based code processing models. In particular, we first propose ten
mutation operators to automatically generate valid and semantically preserving
source code examples as tests; then we propose a neuron coverage-based approach
to guide the generation of tests. We investigate the performance of CoCoFuzzing
on three state-of-the-art neural code models, i.e., NeuralCodeSum, CODE2SEQ,
and CODE2VEC. Our experiment results demonstrate that CoCoFuzzing can generate
valid and semantically preserving source code examples for testing the
robustness and generalizability of these models and improve the neuron
coverage. Moreover, these tests can be used to improve the performance of the
target neural code models through adversarial retraining.


http://arxiv.org/abs/2106.09385
On Deep Neural Network Calibration by Regularization and its Impact on Refinement. (3%)
Aditya Singh; Alessandro Bay; Biswa Sengupta; Andrea Mirabile
  Deep neural networks have been shown to be highly miscalibrated. often they
tend to be overconfident in their predictions. It poses a significant challenge
for safety-critical systems to utilise deep neural networks (DNNs), reliably.
Many recently proposed approaches to mitigate this have demonstrated
substantial progress in improving DNN calibration. However, they hardly touch
upon refinement, which historically has been an essential aspect of
calibration. Refinement indicates separability of a network's correct and
incorrect predictions. This paper presents a theoretically and empirically
supported exposition reviewing refinement of a calibrated model. Firstly, we
show the breakdown of expected calibration error (ECE), into predicted
confidence and refinement under the assumption of over-confident predictions.
Secondly, linking with this result, we highlight that regularization based
calibration only focuses on naively reducing a model's confidence. This
logically has a severe downside to a model's refinement as correct and
incorrect predictions become tightly coupled. Lastly, connecting refinement
with ECE also provides support to existing refinement based approaches which
improve calibration but do not explain the reasoning behind it. We support our
claims through rigorous empirical evaluations of many state of the art
calibration approaches on widely used datasets and neural networks. We find
that many calibration approaches with the likes of label smoothing, mixup etc.
lower the usefulness of a DNN by degrading its refinement. Even under natural
data shift, this calibration-refinement trade-off holds for the majority of
calibration methods.


http://arxiv.org/abs/2106.09857
Effective Model Sparsification by Scheduled Grow-and-Prune Methods. (1%)
Xiaolong Ma; Minghai Qin; Fei Sun; Zejiang Hou; Kun Yuan; Yi Xu; Yanzhi Wang; Yen-Kuang Chen; Rong Jin; Yuan Xie
  Deep neural networks (DNNs) are effective in solving many real-world
problems. Larger DNN models usually exhibit better quality (e.g., accuracy) but
their excessive computation results in long inference time. Model
sparsification can reduce the computation and memory cost while maintaining
model quality. Most existing sparsification algorithms unidirectionally remove
weights, while others randomly or greedily explore a small subset of weights in
each layer for pruning. The limitations of these algorithms reduce the level of
achievable sparsity. In addition, many algorithms still require pre-trained
dense models and thus suffer from large memory footprint. In this paper, we
propose a novel scheduled grow-and-prune (GaP) methodology without having to
pre-train a dense model. It addresses the shortcomings of the previous works by
repeatedly growing a subset of layers to dense and then pruning them back to
sparse after some training. Experiments show that the models pruned using the
proposed methods match or beat the quality of the highly optimized dense models
at 80% sparsity on a variety of tasks, such as image classification, objective
detection, 3D object part segmentation, and translation. They also outperform
other state-of-the-art (SOTA) methods for model sparsification. As an example,
a 90% non-uniform sparse ResNet-50 model obtained via GaP achieves 77.9% top-1
accuracy on ImageNet, improving the previous SOTA results by 1.5%. All code
will be publicly released.


http://arxiv.org/abs/2106.08746
Real-time Adversarial Perturbations against Deep Reinforcement Learning Policies: Attacks and Defenses. (99%)
Buse G. A. Tekgul; Shelly Wang; Samuel Marchal; N. Asokan
  Recent work has shown that deep reinforcement learning (DRL) policies are
vulnerable to adversarial perturbations. Adversaries can mislead policies of
DRL agents by perturbing the state of the environment observed by the agents.
Existing attacks are feasible in principle but face challenges in practice,
either by being too slow to fool DRL policies in real time or by modifying past
observations stored in the agent's memory. We show that using the Universal
Adversarial Perturbation (UAP) method to compute perturbations, independent of
the individual inputs to which they are applied to, can fool DRL policies
effectively and in real time. We describe three such attack variants. Via an
extensive evaluation using three Atari 2600 games, we show that our attacks are
effective, as they fully degrade the performance of three different DRL agents
(up to 100%, even when the $l_\infty$ bound on the perturbation is as small as
0.01). It is faster compared to the response time (0.6ms on average) of
different DRL policies, and considerably faster than prior attacks using
adversarial perturbations (1.8ms on average). We also show that our attack
technique is efficient, incurring an online computational cost of 0.027ms on
average. Using two further tasks involving robotic movement, we confirm that
our results generalize to more complex DRL tasks. Furthermore, we demonstrate
that the effectiveness of known defenses diminishes against universal
perturbations. We propose an effective technique that detects all known
adversarial perturbations against DRL policies, including all the universal
perturbations presented in this paper.


http://arxiv.org/abs/2106.09222
Localized Uncertainty Attacks. (99%)
Ousmane Amadou Dia; Theofanis Karaletsos; Caner Hazirbas; Cristian Canton Ferrer; Ilknur Kaynar Kabul; Erik Meijer
  The susceptibility of deep learning models to adversarial perturbations has
stirred renewed attention in adversarial examples resulting in a number of
attacks. However, most of these attacks fail to encompass a large spectrum of
adversarial perturbations that are imperceptible to humans. In this paper, we
present localized uncertainty attacks, a novel class of threat models against
deterministic and stochastic classifiers. Under this threat model, we create
adversarial examples by perturbing only regions in the inputs where a
classifier is uncertain. To find such regions, we utilize the predictive
uncertainty of the classifier when the classifier is stochastic or, we learn a
surrogate model to amortize the uncertainty when it is deterministic. Unlike
$\ell_p$ ball or functional attacks which perturb inputs indiscriminately, our
targeted changes can be less perceptible. When considered under our threat
model, these attacks still produce strong adversarial examples; with the
examples retaining a greater degree of similarity with the inputs.


http://arxiv.org/abs/2106.09223
Evaluating the Robustness of Bayesian Neural Networks Against Different Types of Attacks. (67%)
Yutian Pang; Sheng Cheng; Jueming Hu; Yongming Liu
  To evaluate the robustness gain of Bayesian neural networks on image
classification tasks, we perform input perturbations, and adversarial attacks
to the state-of-the-art Bayesian neural networks, with a benchmark CNN model as
reference. The attacks are selected to simulate signal interference and
cyberattacks towards CNN-based machine learning systems. The result shows that
a Bayesian neural network achieves significantly higher robustness against
adversarial attacks generated against a deterministic neural network model,
without adversarial training. The Bayesian posterior can act as the safety
precursor of ongoing malicious activities. Furthermore, we show that the
stochastic classifier after the deterministic CNN extractor has sufficient
robustness enhancement rather than a stochastic feature extractor before the
stochastic classifier. This advises on utilizing stochastic layers in building
decision-making pipelines within a safety-critical domain.


http://arxiv.org/abs/2106.08970
Sleeper Agent: Scalable Hidden Trigger Backdoors for Neural Networks Trained from Scratch. (38%)
Hossein Souri; Liam Fowl; Rama Chellappa; Micah Goldblum; Tom Goldstein
  As the curation of data for machine learning becomes increasingly automated,
dataset tampering is a mounting threat. Backdoor attackers tamper with training
data to embed a vulnerability in models that are trained on that data. This
vulnerability is then activated at inference time by placing a "trigger" into
the model's input. Typical backdoor attacks insert the trigger directly into
the training data, although the presence of such an attack may be visible upon
inspection. In contrast, the Hidden Trigger Backdoor Attack achieves poisoning
without placing a trigger into the training data at all. However, this hidden
trigger attack is ineffective at poisoning neural networks trained from
scratch. We develop a new hidden trigger attack, Sleeper Agent, which employs
gradient matching, data selection, and target model re-training during the
crafting process. Sleeper Agent is the first hidden trigger backdoor attack to
be effective against neural networks trained from scratch. We demonstrate its
effectiveness on ImageNet and in black-box settings. Our implementation code
can be found at https://github.com/hsouri/Sleeper-Agent.


http://arxiv.org/abs/2106.09106
Explainable AI for Natural Adversarial Images. (13%)
Tomas Folke; ZhaoBin Li; Ravi B. Sojitra; Scott Cheng-Hsin Yang; Patrick Shafto
  Adversarial images highlight how vulnerable modern image classifiers are to
perturbations outside of their training set. Human oversight might mitigate
this weakness, but depends on humans understanding the AI well enough to
predict when it is likely to make a mistake. In previous work we have found
that humans tend to assume that the AI's decision process mirrors their own.
Here we evaluate if methods from explainable AI can disrupt this assumption to
help participants predict AI classifications for adversarial and standard
images. We find that both saliency maps and examples facilitate catching AI
errors, but their effects are not additive, and saliency maps are more
effective than examples.


http://arxiv.org/abs/2106.09129
A Winning Hand: Compressing Deep Networks Can Improve Out-Of-Distribution Robustness. (2%)
James Diffenderfer; Brian R. Bartoldson; Shreya Chaganti; Jize Zhang; Bhavya Kailkhura
  Two crucial requirements for a successful adoption of deep learning (DL) in
the wild are: (1) robustness to distributional shifts, and (2) model
compactness for achieving efficiency. Unfortunately, efforts towards
simultaneously achieving Out-of-Distribution (OOD) robustness and extreme model
compactness without sacrificing accuracy have mostly been unsuccessful. This
raises an important question: "Is the inability to create compact, accurate,
and robust deep neural networks (CARDs) fundamental?" To answer this question,
we perform a large-scale analysis for a range of popular model compression
techniques which uncovers several intriguing patterns. Notably, in contrast to
traditional pruning approaches (e.g., fine tuning and gradual magnitude
pruning), we find that "lottery ticket-style" pruning approaches can
surprisingly be used to create high performing CARDs. Specifically, we are able
to create extremely compact CARDs that are dramatically more robust than their
significantly larger and full-precision counterparts while matching (or
beating) their test accuracy, simply by pruning and/or quantizing. To better
understand these differences, we perform sensitivity analysis in the Fourier
domain for CARDs trained using different data augmentation methods. Motivated
by our analysis, we develop a simple domain-adaptive test-time ensembling
approach (CARD-Deck) that uses a gating module to dynamically select an
appropriate CARD from the CARD-Deck based on their spectral-similarity with
test samples. By leveraging complementary frequency biases of different
compressed models, the proposed approach builds a "winning hand" of CARDs that
establishes a new state-of-the-art on CIFAR-10-C accuracies (i.e., 96.8% clean
and 92.75% robust) with dramatically better memory usage than their
non-compressed counterparts. We also present some theoretical evidences
supporting our empirical findings.


http://arxiv.org/abs/2106.09121
Scaling-up Diverse Orthogonal Convolutional Networks with a Paraunitary Framework. (1%)
Jiahao Su; Wonmin Byeon; Furong Huang
  Enforcing orthogonality in neural networks is an antidote for gradient
vanishing/exploding problems, sensitivity by adversarial perturbation, and
bounding generalization errors. However, many previous approaches are
heuristic, and the orthogonality of convolutional layers is not systematically
studied: some of these designs are not exactly orthogonal, while others only
consider standard convolutional layers and propose specific classes of their
realizations. To address this problem, we propose a theoretical framework for
orthogonal convolutional layers, which establishes the equivalence between
various orthogonal convolutional layers in the spatial domain and the
paraunitary systems in the spectral domain. Since there exists a complete
spectral factorization of paraunitary systems, any orthogonal convolution layer
can be parameterized as convolutions of spatial filters. Our framework endows
high expressive power to various convolutional layers while maintaining their
exact orthogonality. Furthermore, our layers are memory and computationally
efficient for deep networks compared to previous designs. Our versatile
framework, for the first time, enables the study of architecture designs for
deep orthogonal networks, such as choices of skip connection, initialization,
stride, and dilation. Consequently, we scale up orthogonal networks to deep
architectures, including ResNet, WideResNet, and ShuffleNet, substantially
increasing the performance over the traditional shallow orthogonal networks.


http://arxiv.org/abs/2106.08913
Loki: Hardening Code Obfuscation Against Automated Attacks. (1%)
Moritz Schloegel; Tim Blazytko; Moritz Contag; Cornelius Aschermann; Julius Basler; Thorsten Holz; Ali Abbasi
  Software obfuscation is a crucial technology to protect intellectual
property. Despite its importance, commercial and academic state-of-the-art
obfuscation approaches are vulnerable to a plethora of automated deobfuscation
attacks, such as symbolic execution, taint analysis, or program synthesis.
While several enhanced techniques were proposed to thwart taint analysis or
symbolic execution, they either impose a prohibitive runtime overhead or can be
removed by compiler optimizations. In general, they suffer from focusing on a
single attack vector, allowing an attacker to switch to other more effective
techniques, such as program synthesis. In this work, we present Loki, an
approach for code obfuscation that is resilient against all known automated
deobfuscation attacks. To this end, we deploy multiple techniques, including a
generic approach to synthesize formally verified expressions of arbitrary
complexity. Contrary to state-of-the-art approaches that rely on a few
hardcoded generation rules, our expressions are more diverse and harder to
pattern match against. Moreover, Loki protects against previously unaccounted
attack vectors such as program synthesis, for which it reduces the success rate
to merely 19%. Overall, our design incurs significantly less overhead while
providing a much stronger protection level.


http://arxiv.org/abs/2106.08361
Adversarial Attacks on Deep Models for Financial Transaction Records. (99%)
Ivan Fursov; Matvey Morozov; Nina Kaploukhaya; Elizaveta Kovtun; Rodrigo Rivera-Castro; Gleb Gusev; Dmitry Babaev; Ivan Kireev; Alexey Zaytsev; Evgeny Burnaev
  Machine learning models using transaction records as inputs are popular among
financial institutions. The most efficient models use deep-learning
architectures similar to those in the NLP community, posing a challenge due to
their tremendous number of parameters and limited robustness. In particular,
deep-learning models are vulnerable to adversarial attacks: a little change in
the input harms the model's output.
  In this work, we examine adversarial attacks on transaction records data and
defences from these attacks. The transaction records data have a different
structure than the canonical NLP or time series data, as neighbouring records
are less connected than words in sentences, and each record consists of both
discrete merchant code and continuous transaction amount. We consider a
black-box attack scenario, where the attack doesn't know the true decision
model, and pay special attention to adding transaction tokens to the end of a
sequence. These limitations provide more realistic scenario, previously
unexplored in NLP world.
  The proposed adversarial attacks and the respective defences demonstrate
remarkable performance using relevant datasets from the financial industry. Our
results show that a couple of generated transactions are sufficient to fool a
deep-learning model. Further, we improve model robustness via adversarial
training or separate adversarial examples detection. This work shows that
embedding protection from adversarial attacks improves model robustness,
allowing a wider adoption of deep models for transaction records in banking and
finance.


http://arxiv.org/abs/2106.08299
Model Extraction and Adversarial Attacks on Neural Networks using Switching Power Information. (99%)
Tommy Li; Cory Merkel
  Artificial neural networks (ANNs) have gained significant popularity in the
last decade for solving narrow AI problems in domains such as healthcare,
transportation, and defense. As ANNs become more ubiquitous, it is imperative
to understand their associated safety, security, and privacy vulnerabilities.
Recently, it has been shown that ANNs are susceptible to a number of
adversarial evasion attacks--inputs that cause the ANN to make high-confidence
misclassifications despite being almost indistinguishable from the data used to
train and test the network. This work explores to what degree finding these
examples maybe aided by using side-channel information, specifically switching
power consumption, of hardware implementations of ANNs. A black-box threat
scenario is assumed, where an attacker has access to the ANN hardware's input,
outputs, and topology, but the trained model parameters are unknown. Then, a
surrogate model is trained to have similar functional (i.e. input-output
mapping) and switching power characteristics as the oracle (black-box) model.
Our results indicate that the inclusion of power consumption data increases the
fidelity of the model extraction by up to 30 percent based on a mean square
error comparison of the oracle and surrogate weights. However, transferability
of adversarial examples from the surrogate to the oracle model was not
significantly affected.


http://arxiv.org/abs/2106.08387
Towards Adversarial Robustness via Transductive Learning. (80%)
Jiefeng Chen; Yang Guo; Xi Wu; Tianqi Li; Qicheng Lao; Yingyu Liang; Somesh Jha
  There has been emerging interest to use transductive learning for adversarial
robustness (Goldwasser et al., NeurIPS 2020; Wu et al., ICML 2020). Compared to
traditional "test-time" defenses, these defense mechanisms "dynamically
retrain" the model based on test time input via transductive learning; and
theoretically, attacking these defenses boils down to bilevel optimization,
which seems to raise the difficulty for adaptive attacks. In this paper, we
first formalize and analyze modeling aspects of transductive robustness. Then,
we propose the principle of attacking model space for solving bilevel attack
objectives, and present an instantiation of the principle which breaks previous
transductive defenses. These attacks thus point to significant difficulties in
the use of transductive learning to improve adversarial robustness. To this
end, we present new theoretical and empirical evidence in support of the
utility of transductive learning.


http://arxiv.org/abs/2106.07868
Voting for the right answer: Adversarial defense for speaker verification. (78%)
Haibin Wu; Yang Zhang; Zhiyong Wu; Dong Wang; Hung-yi Lee
  Automatic speaker verification (ASV) is a well developed technology for
biometric identification, and has been ubiquitous implemented in
security-critic applications, such as banking and access control. However,
previous works have shown that ASV is under the radar of adversarial attacks,
which are very similar to their original counterparts from human's perception,
yet will manipulate the ASV render wrong prediction. Due to the very late
emergence of adversarial attacks for ASV, effective countermeasures against
them are limited. Given that the security of ASV is of high priority, in this
work, we propose the idea of "voting for the right answer" to prevent risky
decisions of ASV in blind spot areas, by employing random sampling and voting.
Experimental results show that our proposed method improves the robustness
against both the limited-knowledge attackers by pulling the adversarial samples
out of the blind spots, and the perfect-knowledge attackers by introducing
randomness and increasing the attackers' budgets.


http://arxiv.org/abs/2106.08104
Detect and remove watermark in deep neural networks via generative adversarial networks. (68%)
Haoqi Wang; Mingfu Xue; Shichang Sun; Yushu Zhang; Jian Wang; Weiqiang Liu
  Deep neural networks (DNN) have achieved remarkable performance in various
fields. However, training a DNN model from scratch requires a lot of computing
resources and training data. It is difficult for most individual users to
obtain such computing resources and training data. Model copyright infringement
is an emerging problem in recent years. For instance, pre-trained models may be
stolen or abuse by illegal users without the authorization of the model owner.
Recently, many works on protecting the intellectual property of DNN models have
been proposed. In these works, embedding watermarks into DNN based on backdoor
is one of the widely used methods. However, when the DNN model is stolen, the
backdoor-based watermark may face the risk of being detected and removed by an
adversary. In this paper, we propose a scheme to detect and remove watermark in
deep neural networks via generative adversarial networks (GAN). We demonstrate
that the backdoor-based DNN watermarks are vulnerable to the proposed GAN-based
watermark removal attack. The proposed attack method includes two phases. In
the first phase, we use the GAN and few clean images to detect and reverse the
watermark in the DNN model. In the second phase, we fine-tune the watermarked
DNN based on the reversed backdoor images. Experimental evaluations on the
MNIST and CIFAR10 datasets demonstrate that, the proposed method can
effectively remove about 98% of the watermark in DNN models, as the watermark
retention rate reduces from 100% to less than 2% after applying the proposed
attack. In the meantime, the proposed attack hardly affects the model's
performance. The test accuracy of the watermarked DNN on the MNIST and the
CIFAR10 datasets drops by less than 1% and 3%, respectively.


http://arxiv.org/abs/2106.08283
CRFL: Certifiably Robust Federated Learning against Backdoor Attacks. (13%)
Chulin Xie; Minghao Chen; Pin-Yu Chen; Bo Li
  Federated Learning (FL) as a distributed learning paradigm that aggregates
information from diverse clients to train a shared global model, has
demonstrated great success. However, malicious clients can perform poisoning
attacks and model replacement to introduce backdoors into the trained global
model. Although there have been intensive studies designing robust aggregation
methods and empirical robust federated training protocols against backdoors,
existing approaches lack robustness certification. This paper provides the
first general framework, Certifiably Robust Federated Learning (CRFL), to train
certifiably robust FL models against backdoors. Our method exploits clipping
and smoothing on model parameters to control the global model smoothness, which
yields a sample-wise robustness certification on backdoors with limited
magnitude. Our certification also specifies the relation to federated learning
parameters, such as poisoning ratio on instance level, number of attackers, and
training iterations. Practically, we conduct comprehensive experiments across a
range of federated datasets, and provide the first benchmark for certified
robustness against backdoor attacks in federated learning. Our code is
available at https://github.com/AI-secure/CRFL.


http://arxiv.org/abs/2106.08013
Securing Face Liveness Detection Using Unforgeable Lip Motion Patterns. (12%)
Man Senior Member, IEEE Zhou; Qian Senior Member, IEEE Wang; Qi Senior Member, IEEE Li; Peipei Senior Member, IEEE Jiang; Jingxiao Senior Member, IEEE Yang; Chao Senior Member, IEEE Shen; Cong Fellow, IEEE Wang; Shouhong Ding
  Face authentication usually utilizes deep learning models to verify users
with high recognition accuracy. However, face authentication systems are
vulnerable to various attacks that cheat the models by manipulating the digital
counterparts of human faces. So far, lots of liveness detection schemes have
been developed to prevent such attacks. Unfortunately, the attacker can still
bypass these schemes by constructing wide-ranging sophisticated attacks. We
study the security of existing face authentication services (e.g., Microsoft,
Amazon, and Face++) and typical liveness detection approaches. Particularly, we
develop a new type of attack, i.e., the low-cost 3D projection attack that
projects manipulated face videos on a 3D face model, which can easily evade
these face authentication services and liveness detection approaches. To this
end, we propose FaceLip, a novel liveness detection scheme for face
authentication, which utilizes unforgeable lip motion patterns built upon
well-designed acoustic signals to enable a strong security guarantee. The
unique lip motion patterns for each user are unforgeable because FaceLip
verifies the patterns by capturing and analyzing the acoustic signals that are
dynamically generated according to random challenges, which ensures that our
signals for liveness detection cannot be manipulated. Specially, we develop
robust algorithms for FaceLip to eliminate the impact of noisy signals in the
environment and thus can accurately infer the lip motions at larger distances.
We prototype FaceLip on off-the-shelf smartphones and conduct extensive
experiments under different settings. Our evaluation with 44 participants
validates the effectiveness and robustness of FaceLip.


http://arxiv.org/abs/2106.07904
Probabilistic Margins for Instance Reweighting in Adversarial Training. (8%)
Qizhou Wang; Feng Liu; Bo Han; Tongliang Liu; Chen Gong; Gang Niu; Mingyuan Zhou; Masashi Sugiyama
  Reweighting adversarial data during training has been recently shown to
improve adversarial robustness, where data closer to the current decision
boundaries are regarded as more critical and given larger weights. However,
existing methods measuring the closeness are not very reliable: they are
discrete and can take only a few values, and they are path-dependent, i.e.,
they may change given the same start and end points with different attack
paths. In this paper, we propose three types of probabilistic margin (PM),
which are continuous and path-independent, for measuring the aforementioned
closeness and reweighting adversarial data. Specifically, a PM is defined as
the difference between two estimated class-posterior probabilities, e.g., such
the probability of the true label minus the probability of the most confusing
label given some natural data. Though different PMs capture different geometric
properties, all three PMs share a negative correlation with the vulnerability
of data: data with larger/smaller PMs are safer/riskier and should have
smaller/larger weights. Experiments demonstrate that PMs are reliable
measurements and PM-based reweighting methods outperform state-of-the-art
methods.


http://arxiv.org/abs/2106.07895
CAN-LOC: Spoofing Detection and Physical Intrusion Localization on an In-Vehicle CAN Bus Based on Deep Features of Voltage Signals. (1%)
Efrat Levy; Asaf Shabtai; Bogdan Groza; Pal-Stefan Murvay; Yuval Elovici
  The Controller Area Network (CAN) is used for communication between
in-vehicle devices. The CAN bus has been shown to be vulnerable to remote
attacks. To harden vehicles against such attacks, vehicle manufacturers have
divided in-vehicle networks into sub-networks, logically isolating critical
devices. However, attackers may still have physical access to various
sub-networks where they can connect a malicious device. This threat has not
been adequately addressed, as methods proposed to determine physical intrusion
points have shown weak results, emphasizing the need to develop more advanced
techniques. To address this type of threat, we propose a security hardening
system for in-vehicle networks. The proposed system includes two mechanisms
that process deep features extracted from voltage signals measured on the CAN
bus. The first mechanism uses data augmentation and deep learning to detect and
locate physical intrusions when the vehicle starts; this mechanism can detect
and locate intrusions, even when the connected malicious devices are silent.
This mechanism's effectiveness (100% accuracy) is demonstrated in a wide
variety of insertion scenarios on a CAN bus prototype. The second mechanism is
a continuous device authentication mechanism, which is also based on deep
learning; this mechanism's robustness (99.8% accuracy) is demonstrated on a
real moving vehicle.


http://arxiv.org/abs/2106.07445
PopSkipJump: Decision-Based Attack for Probabilistic Classifiers. (99%)
Carl-Johann Simon-Gabriel; Noman Ahmed Sheikh; Andreas Krause
  Most current classifiers are vulnerable to adversarial examples, small input
perturbations that change the classification output. Many existing attack
algorithms cover various settings, from white-box to black-box classifiers, but
typically assume that the answers are deterministic and often fail when they
are not. We therefore propose a new adversarial decision-based attack
specifically designed for classifiers with probabilistic outputs. It is based
on the HopSkipJump attack by Chen et al. (2019, arXiv:1904.02144v5 ), a strong
and query efficient decision-based attack originally designed for deterministic
classifiers. Our P(robabilisticH)opSkipJump attack adapts its amount of queries
to maintain HopSkipJump's original output quality across various noise levels,
while converging to its query efficiency as the noise level decreases. We test
our attack on various noise models, including state-of-the-art off-the-shelf
randomized defenses, and show that they offer almost no extra robustness to
decision-based attacks. Code is available at
https://github.com/cjsg/PopSkipJump .


http://arxiv.org/abs/2106.08153
Now You See It, Now You Dont: Adversarial Vulnerabilities in Computational Pathology. (99%)
Alex Foote; Amina Asif; Ayesha Azam; Tim Marshall-Cox; Nasir Rajpoot; Fayyaz Minhas
  Deep learning models are routinely employed in computational pathology
(CPath) for solving problems of diagnostic and prognostic significance.
Typically, the generalization performance of CPath models is analyzed using
evaluation protocols such as cross-validation and testing on multi-centric
cohorts. However, to ensure that such CPath solutions are robust and safe for
use in a clinical setting, a critical analysis of their predictive performance
and vulnerability to adversarial attacks is required, which is the focus of
this paper. Specifically, we show that a highly accurate model for
classification of tumour patches in pathology images (AUC > 0.95) can easily be
attacked with minimal perturbations which are imperceptible to lay humans and
trained pathologists alike. Our analytical results show that it is possible to
generate single-instance white-box attacks on specific input images with high
success rate and low perturbation energy. Furthermore, we have also generated a
single universal perturbation matrix using the training dataset only which,
when added to unseen test images, results in forcing the trained neural network
to flip its prediction labels with high confidence at a success rate of > 84%.
We systematically analyze the relationship between perturbation energy of an
adversarial attack, its impact on morphological constructs of clinical
significance, their perceptibility by a trained pathologist and saliency maps
obtained using deep learning models. Based on our analysis, we strongly
recommend that computational pathology models be critically analyzed using the
proposed adversarial validation strategy prior to clinical adoption.


http://arxiv.org/abs/2106.07428
Audio Attacks and Defenses against AED Systems -- A Practical Study. (99%)
Rodrigo dos Santos; Shirin Nilizadeh
  In this paper, we evaluate deep learning-enabled AED systems against evasion
attacks based on adversarial examples. We test the robustness of multiple
security critical AED tasks, implemented as CNNs classifiers, as well as
existing third-party Nest devices, manufactured by Google, which run their own
black-box deep learning models. Our adversarial examples use audio
perturbations made of white and background noises. Such disturbances are easy
to create, to perform and to reproduce, and can be accessible to a large number
of potential attackers, even non-technically savvy ones.
  We show that an adversary can focus on audio adversarial inputs to cause AED
systems to misclassify, achieving high success rates, even when we use small
levels of a given type of noisy disturbance. For instance, on the case of the
gunshot sound class, we achieve nearly 100% success rate when employing as
little as 0.05 white noise level. Similarly to what has been previously done by
works focusing on adversarial examples from the image domain as well as on the
speech recognition domain. We then, seek to improve classifiers' robustness
through countermeasures. We employ adversarial training and audio denoising. We
show that these countermeasures, when applied to audio input, can be
successful, either in isolation or in combination, generating relevant
increases of nearly fifty percent in the performance of the classifiers when
these are under attack.


http://arxiv.org/abs/2106.07214
Backdoor Learning Curves: Explaining Backdoor Poisoning Beyond Influence Functions. (92%)
Antonio Emanuele Cinà; Kathrin Grosse; Sebastiano Vascon; Ambra Demontis; Battista Biggio; Fabio Roli; Marcello Pelillo
  Backdoor attacks inject poisoning samples during training, with the goal of
forcing a machine learning model to output an attacker-chosen class when
presented a specific trigger at test time. Although backdoor attacks have been
demonstrated in a variety of settings and against different models, the factors
affecting their effectiveness are still not well understood. In this work, we
provide a unifying framework to study the process of backdoor learning under
the lens of incremental learning and influence functions. We show that the
effectiveness of backdoor attacks depends on: (i) the complexity of the
learning algorithm, controlled by its hyperparameters; (ii) the fraction of
backdoor samples injected into the training set; and (iii) the size and
visibility of the backdoor trigger. These factors affect how fast a model
learns to correlate the presence of the backdoor trigger with the target class.
Our analysis unveils the intriguing existence of a region in the hyperparameter
space in which the accuracy on clean test samples is still high while backdoor
attacks are ineffective, thereby suggesting novel criteria to improve existing
defenses.


http://arxiv.org/abs/2106.07860
Evading Malware Classifiers via Monte Carlo Mutant Feature Discovery. (81%)
John Boutsikas; Maksim E. Eren; Charles Varga; Edward Raff; Cynthia Matuszek; Charles Nicholas
  The use of Machine Learning has become a significant part of malware
detection efforts due to the influx of new malware, an ever changing threat
landscape, and the ability of Machine Learning methods to discover meaningful
distinctions between malicious and benign software. Antivirus vendors have also
begun to widely utilize malware classifiers based on dynamic and static malware
analysis features. Therefore, a malware author might make evasive binary
modifications against Machine Learning models as part of the malware
development life cycle to execute an attack successfully. This makes the
studying of possible classifier evasion strategies an essential part of cyber
defense against malice. To this extent, we stage a grey box setup to analyze a
scenario where the malware author does not know the target classifier
algorithm, and does not have access to decisions made by the classifier, but
knows the features used in training. In this experiment, a malicious actor
trains a surrogate model using the EMBER-2018 dataset to discover binary
mutations that cause an instance to be misclassified via a Monte Carlo tree
search. Then, mutated malware is sent to the victim model that takes the place
of an antivirus API to test whether it can evade detection.


http://arxiv.org/abs/2106.07767
On the Relationship between Heterophily and Robustness of Graph Neural Networks. (81%)
Jiong Zhu; Junchen Jin; Donald Loveland; Michael T. Schaub; Danai Koutra
  Empirical studies on the robustness of graph neural networks (GNNs) have
suggested a relation between the vulnerabilities of GNNs to adversarial attacks
and the increased presence of heterophily in perturbed graphs (where edges tend
to connect nodes with dissimilar features and labels). In this work, we
formalize the relation between heterophily and robustness, bridging two topics
previously investigated by separate lines of research. We theoretically and
empirically show that for graphs exhibiting homophily (low heterophily),
impactful structural attacks always lead to increased levels of heterophily,
while for graph with heterophily the change in the homophily level depends on
the node degrees. By leveraging these insights, we deduce that a design
principle identified to significantly improve predictive performance under
heterophily -- separate aggregators for ego- and neighbor-embeddings -- can
also inherently offer increased robustness to GNNs. Our extensive empirical
analysis shows that GNNs adopting this design alone can achieve significantly
improved empirical and certifiable robustness compared to the best-performing
unvaccinated model. Furthermore, models with this design can be readily
combined with explicit defense mechanisms to yield improved robustness with up
to 18.33% increase in performance under attacks compared to the best-performing
vaccinated model.


http://arxiv.org/abs/2106.07411
Partial success in closing the gap between human and machine vision. (15%)
Robert Geirhos; Kantharaju Narayanappa; Benjamin Mitzkus; Tizian Thieringer; Matthias Bethge; Felix A. Wichmann; Wieland Brendel
  A few years ago, the first CNN surpassed human performance on ImageNet.
However, it soon became clear that machines lack robustness on more challenging
test cases, a major obstacle towards deploying machines "in the wild" and
towards obtaining better computational models of human visual perception. Here
we ask: Are we making progress in closing the gap between human and machine
vision? To answer this question, we tested human observers on a broad range of
out-of-distribution (OOD) datasets, adding the "missing human baseline" by
recording 85,120 psychophysical trials across 90 participants. We then
investigated a range of promising machine learning developments that crucially
deviate from standard supervised CNNs along three axes: objective function
(self-supervised, adversarially trained, CLIP language-image training),
architecture (e.g. vision transformers), and dataset size (ranging from 1M to
1B). Our findings are threefold. (1.) The longstanding robustness gap between
humans and CNNs is closing, with the best models now matching or exceeding
human performance on most OOD datasets. (2.) There is still a substantial
image-level consistency gap, meaning that humans make different errors than
models. In contrast, most models systematically agree in their categorisation
errors, even substantially different ones like contrastive self-supervised vs.
standard supervised models. (3.) In many cases, human-to-model consistency
improves when training dataset size is increased by one to three orders of
magnitude. Our results give reason for cautious optimism: While there is still
much room for improvement, the behavioural difference between human and machine
vision is narrowing. In order to measure future progress, 17 OOD datasets with
image-level human behavioural data are provided as a benchmark here:
https://github.com/bethgelab/model-vs-human/


http://arxiv.org/abs/2106.07704
Text Generation with Efficient (Soft) Q-Learning. (2%)
Han Guo; Bowen Tan; Zhengzhong Liu; Eric P. Xing; Zhiting Hu
  Maximum likelihood estimation (MLE) is the predominant algorithm for training
text generation models. This paradigm relies on direct supervision examples,
which is not applicable to many emerging applications, such as generating
adversarial attacks or generating prompts to control language models.
Reinforcement learning (RL) on the other hand offers a more flexible solution
by allowing users to plug in arbitrary task metrics as reward. Yet previous RL
algorithms for text generation, such as policy gradient (on-policy RL) and
Q-learning (off-policy RL), are often notoriously inefficient or unstable to
train due to the large sequence space and the sparse reward received only at
the end of sequences. In this paper, we introduce a new RL formulation for text
generation from the soft Q-learning (SQL) perspective. It enables us to draw
from the latest RL advances, such as path consistency learning, to combine the
best of on-/off-policy updates, and learn effectively from sparse reward. We
apply the approach to a wide range of text generation tasks, including learning
from noisy/negative examples, adversarial attacks, and prompt generation.
Experiments show our approach consistently outperforms both task-specialized
algorithms and the previous RL methods.


http://arxiv.org/abs/2106.07541
Resilient Control of Platooning Networked Robitic Systems via Dynamic Watermarking. (1%)
Matthew Porter; Arnav Joshi; Sidhartha Dey; Qirui Wu; Pedro Hespanhol; Anil Aswani; Matthew Johnson-Roberson; Ram Vasudevan
  Networked robotic systems, such as connected vehicle platoons, can improve
the safety and efficiency of transportation networks by allowing for high-speed
coordination. To enable such coordination, these systems rely on networked
communications. This can make them susceptible to cyber attacks. Though
security methods such as encryption or specially designed network topologies
can increase the difficulty of successfully executing such an attack, these
techniques are unable to guarantee secure communication against an attacker.
More troublingly, these security methods are unable to ensure that individual
agents are able to detect attacks that alter the content of specific messages.
To ensure resilient behavior under such attacks, this paper formulates a
networked linear time-varying version of dynamic watermarking in which each
agent generates and adds a private excitation to the input of its corresponding
robotic subsystem. This paper demonstrates that such a method can enable each
agent in a networked robotic system to detect cyber attacks. By altering
measurements sent between vehicles, this paper illustrates that an attacker can
create unstable behavior within a platoon. By utilizing the dynamic
watermarking method proposed in this paper, the attack is detected, allowing
the vehicles in the platoon to gracefully degrade to a non-communicative
control strategy that maintains safety across a variety of scenarios.


http://arxiv.org/abs/2106.07165
Self-training Guided Adversarial Domain Adaptation For Thermal Imagery. (1%)
Ibrahim Batuhan Akkaya; Fazil Altinel; Ugur Halici
  Deep models trained on large-scale RGB image datasets have shown tremendous
success. It is important to apply such deep models to real-world problems.
However, these models suffer from a performance bottleneck under illumination
changes. Thermal IR cameras are more robust against such changes, and thus can
be very useful for the real-world problems. In order to investigate efficacy of
combining feature-rich visible spectrum and thermal image modalities, we
propose an unsupervised domain adaptation method which does not require
RGB-to-thermal image pairs. We employ large-scale RGB dataset MS-COCO as source
domain and thermal dataset FLIR ADAS as target domain to demonstrate results of
our method. Although adversarial domain adaptation methods aim to align the
distributions of source and target domains, simply aligning the distributions
cannot guarantee perfect generalization to the target domain. To this end, we
propose a self-training guided adversarial domain adaptation method to promote
generalization capabilities of adversarial domain adaptation methods. To
perform self-training, pseudo labels are assigned to the samples on the target
thermal domain to learn more generalized representations for the target domain.
Extensive experimental analyses show that our proposed method achieves better
results than the state-of-the-art adversarial domain adaptation methods. The
code and models are publicly available.


http://arxiv.org/abs/2106.07851
Code Integrity Attestation for PLCs using Black Box Neural Network Predictions. (1%)
Yuqi Chen; Christopher M. Poskitt; Jun Sun
  Cyber-physical systems (CPSs) are widespread in critical domains, and
significant damage can be caused if an attacker is able to modify the code of
their programmable logic controllers (PLCs). Unfortunately, traditional
techniques for attesting code integrity (i.e. verifying that it has not been
modified) rely on firmware access or roots-of-trust, neither of which
proprietary or legacy PLCs are likely to provide. In this paper, we propose a
practical code integrity checking solution based on privacy-preserving black
box models that instead attest the input/output behaviour of PLC programs.
Using faithful offline copies of the PLC programs, we identify their most
important inputs through an information flow analysis, execute them on multiple
combinations to collect data, then train neural networks able to predict PLC
outputs (i.e. actuator commands) from their inputs. By exploiting the black box
nature of the model, our solution maintains the privacy of the original PLC
code and does not assume that attackers are unaware of its presence. The trust
instead comes from the fact that it is extremely hard to attack the PLC code
and neural networks at the same time and with consistent outcomes. We evaluated
our approach on a modern six-stage water treatment plant testbed, finding that
it could predict actuator states from PLC inputs with near-100% accuracy, and
thus could detect all 120 effective code mutations that we subjected the PLCs
to. Finally, we found that it is not practically possible to simultaneously
modify the PLC code and apply discreet adversarial noise to our attesters in a
way that leads to consistent (mis-)predictions.


http://arxiv.org/abs/2106.07047
Target Model Agnostic Adversarial Attacks with Query Budgets on Language Understanding Models. (99%)
Jatin Chauhan; Karan Bhukar; Manohar Kaul
  Despite significant improvements in natural language understanding models
with the advent of models like BERT and XLNet, these neural-network based
classifiers are vulnerable to blackbox adversarial attacks, where the attacker
is only allowed to query the target model outputs. We add two more realistic
restrictions on the attack methods, namely limiting the number of queries
allowed (query budget) and crafting attacks that easily transfer across
different pre-trained models (transferability), which render previous attack
models impractical and ineffective. Here, we propose a target model agnostic
adversarial attack method with a high degree of attack transferability across
the attacked models. Our empirical studies show that in comparison to baseline
methods, our method generates highly transferable adversarial sentences under
the restriction of limited query budgets.


http://arxiv.org/abs/2106.07141
Selection of Source Images Heavily Influences the Effectiveness of Adversarial Attacks. (99%)
Utku Ozbulak; Esla Timothy Anzaku; Neve Wesley De; Messem Arnout Van
  Although the adoption rate of deep neural networks (DNNs) has tremendously
increased in recent years, a solution for their vulnerability against
adversarial examples has not yet been found. As a result, substantial research
efforts are dedicated to fix this weakness, with many studies typically using a
subset of source images to generate adversarial examples, treating every image
in this subset as equal. We demonstrate that, in fact, not every source image
is equally suited for this kind of assessment. To do so, we devise a
large-scale model-to-model transferability scenario for which we meticulously
analyze the properties of adversarial examples, generated from every suitable
source image in ImageNet by making use of three of the most frequently deployed
attacks. In this transferability scenario, which involves seven distinct DNN
models, including the recently proposed vision transformers, we reveal that it
is possible to have a difference of up to $12.5\%$ in model-to-model
transferability success, $1.01$ in average $L_2$ perturbation, and $0.03$
($8/225$) in average $L_{\infty}$ perturbation when $1,000$ source images are
sampled randomly among all suitable candidates. We then take one of the first
steps in evaluating the robustness of images used to create adversarial
examples, proposing a number of simple but effective methods to identify
unsuitable source images, thus making it possible to mitigate extreme cases in
experimentation and support high-quality benchmarking.


http://arxiv.org/abs/2106.06917
ATRAS: Adversarially Trained Robust Architecture Search. (96%)
Yigit Alparslan; Edward Kim
  In this paper, we explore the effect of architecture completeness on
adversarial robustness. We train models with different architectures on
CIFAR-10 and MNIST dataset. For each model, we vary different number of layers
and different number of nodes in the layer. For every architecture candidate,
we use Fast Gradient Sign Method (FGSM) to generate untargeted adversarial
attacks and use adversarial training to defend against those attacks. For each
architecture candidate, we report pre-attack, post-attack and post-defense
accuracy for the model as well as the architecture parameters and the impact of
completeness to the model accuracies.


http://arxiv.org/abs/2106.07098
Security Analysis of Camera-LiDAR Semantic-Level Fusion Against Black-Box Attacks on Autonomous Vehicles. (64%)
R. Spencer Hallyburton; Yupei Liu; Miroslav Pajic
  To enable safe and reliable decision-making, autonomous vehicles (AVs) feed
sensor data to perception algorithms to understand the environment. Sensor
fusion, and particularly semantic fusion, with multi-frame tracking is becoming
increasingly popular for detecting 3D objects. Recently, it was shown that
LiDAR-based perception built on deep neural networks is vulnerable to LiDAR
spoofing attacks. Thus, in this work, we perform the first analysis of
camera-LiDAR fusion under spoofing attacks and the first security analysis of
semantic fusion in any AV context. We find first that fusion is more successful
than existing defenses at guarding against naive spoofing. However, we then
define the frustum attack as a new class of attacks on AVs and find that
semantic camera-LiDAR fusion exhibits widespread vulnerability to frustum
attacks with between 70% and 90% success against target models. Importantly,
the attacker needs less than 20 random spoof points on average for successful
attacks - an order of magnitude less than established maximum capability.
Finally, we are the first to analyze the longitudinal impact of perception
attacks by showing the impact of multi-frame attacks.


http://arxiv.org/abs/2106.07049
Weakly-supervised High-resolution Segmentation of Mammography Images for Breast Cancer Diagnosis. (1%)
Kangning Liu; Yiqiu Shen; Nan Wu; Jakub Chłędowski; Carlos Fernandez-Granda; Krzysztof J. Geras
  In the last few years, deep learning classifiers have shown promising results
in image-based medical diagnosis. However, interpreting the outputs of these
models remains a challenge. In cancer diagnosis, interpretability can be
achieved by localizing the region of the input image responsible for the
output, i.e. the location of a lesion. Alternatively, segmentation or detection
models can be trained with pixel-wise annotations indicating the locations of
malignant lesions. Unfortunately, acquiring such labels is labor-intensive and
requires medical expertise. To overcome this difficulty, weakly-supervised
localization can be utilized. These methods allow neural network classifiers to
output saliency maps highlighting the regions of the input most relevant to the
classification task (e.g. malignant lesions in mammograms) using only
image-level labels (e.g. whether the patient has cancer or not) during
training. When applied to high-resolution images, existing methods produce
low-resolution saliency maps. This is problematic in applications in which
suspicious lesions are small in relation to the image size. In this work, we
introduce a novel neural network architecture to perform weakly-supervised
segmentation of high-resolution images. The proposed model selects regions of
interest via coarse-level localization, and then performs fine-grained
segmentation of those regions. We apply this model to breast cancer diagnosis
with screening mammography, and validate it on a large clinically-realistic
dataset. Measured by Dice similarity score, our approach outperforms existing
methods by a large margin in terms of localization performance of benign and
malignant lesions, relatively improving the performance by 39.6% and 20.0%,
respectively. Code and the weights of some of the models are available at
https://github.com/nyukat/GLAM


http://arxiv.org/abs/2106.07068
HistoTransfer: Understanding Transfer Learning for Histopathology. (1%)
Yash Sharma; Lubaina Ehsan; Sana Syed; Donald E. Brown
  Advancement in digital pathology and artificial intelligence has enabled deep
learning-based computer vision techniques for automated disease diagnosis and
prognosis. However, WSIs present unique computational and algorithmic
challenges. WSIs are gigapixel-sized, making them infeasible to be used
directly for training deep neural networks. Hence, for modeling, a two-stage
approach is adopted: Patch representations are extracted first, followed by the
aggregation for WSI prediction. These approaches require detailed pixel-level
annotations for training the patch encoder. However, obtaining these
annotations is time-consuming and tedious for medical experts. Transfer
learning is used to address this gap and deep learning architectures
pre-trained on ImageNet are used for generating patch-level representation.
Even though ImageNet differs significantly from histopathology data,
pre-trained networks have been shown to perform impressively on histopathology
data. Also, progress in self-supervised and multi-task learning coupled with
the release of multiple histopathology data has led to the release of
histopathology-specific networks. In this work, we compare the performance of
features extracted from networks trained on ImageNet and histopathology data.
We use an attention pooling network over these extracted features for
slide-level aggregation. We investigate if features learned using more complex
networks lead to gain in performance. We use a simple top-k sampling approach
for fine-tuning framework and study the representation similarity between
frozen and fine-tuned networks using Centered Kernel Alignment. Further, to
examine if intermediate block representation is better suited for feature
extraction and ImageNet architectures are unnecessarily large for
histopathology, we truncate the blocks of ResNet18 and DenseNet121 and examine
the performance.


http://arxiv.org/abs/2106.06685
Adversarial Robustness via Fisher-Rao Regularization. (67%)
Marine Picot; Francisco Messina; Malik Boudiaf; Fabrice Labeau; Ismail Ben Ayed; Pablo Piantanida
  Adversarial robustness has become a topic of growing interest in machine
learning since it was observed that neural networks tend to be brittle. We
propose an information-geometric formulation of adversarial defense and
introduce FIRE, a new Fisher-Rao regularization for the categorical
cross-entropy loss, which is based on the geodesic distance between the softmax
outputs corresponding to natural and perturbed input features. Based on the
information-geometric properties of the class of softmax distributions, we
derive an explicit characterization of the Fisher-Rao Distance (FRD) for the
binary and multiclass cases, and draw some interesting properties as well as
connections with standard regularization metrics. Furthermore, for a simple
linear and Gaussian model, we show that all Pareto-optimal points in the
accuracy-robustness region can be reached by FIRE while other state-of-the-art
methods fail. Empirically, we evaluate the performance of various classifiers
trained with the proposed loss on standard datasets, showing up to a
simultaneous 1\% of improvement in terms of clean and robust performances while
reducing the training time by 20\% over the best-performing methods.


http://arxiv.org/abs/2106.06770
What can linearized neural networks actually say about generalization? (31%)
Guillermo Ortiz-Jiménez; Seyed-Mohsen Moosavi-Dezfooli; Pascal Frossard
  For certain infinitely-wide neural networks, the neural tangent kernel (NTK)
theory fully characterizes generalization. However, for the networks used in
practice, the empirical NTK represents only a rough first-order approximation
of these architectures. Still, a growing body of work keeps leveraging this
approximation to successfully analyze important deep learning phenomena and
derive algorithms for new applications. In our work, we provide strong
empirical evidence to determine the practical validity of such approximation by
conducting a systematic comparison of the behaviour of different neural
networks and their linear approximations on different tasks. We show that the
linear approximations can indeed rank the learning complexity of certain tasks
for neural networks, albeit with important nuances. Specifically, we discover
that, in contrast to what was previously observed, neural networks do not
always perform better than their kernel approximations, and reveal that their
performance gap heavily depends on architecture, number of samples and training
task. In fact, we show that during training, deep networks increase the
alignment of their empirical NTK with the target task, which explains why
linear approximations at the end of training can better explain the dynamics of
deep networks. Overall, our work provides concrete examples of novel deep
learning phenomena which can inspire future theoretical research, as well as
provides a new perspective on the use of the NTK approximation in deep
learning.


http://arxiv.org/abs/2106.06895
FeSHI: Feature Map Based Stealthy Hardware Intrinsic Attack. (2%)
Tolulope Odetola; Faiq Khalid; Travis Sandefur; Hawzhin Mohammed; Syed Rafay Hasan
  Convolutional Neural Networks (CNN) have shown impressive performance in
computer vision, natural language processing, and many other applications, but
they exhibit high computations and substantial memory requirements. To address
these limitations, especially in resource-constrained devices, the use of cloud
computing for CNNs is becoming more popular. This comes with privacy and
latency concerns that have motivated the designers to develop embedded hardware
accelerators for CNNs. However, designing a specialized accelerator increases
the time-to-market and cost of production. Therefore, to reduce the
time-to-market and access to state-of-the-art techniques, CNN hardware mapping
and deployment on embedded accelerators are often outsourced to untrusted third
parties, which is going to be more prevalent in futuristic artificial
intelligence of things (AIoT) systems. These AIoT systems anticipate horizontal
collaboration among different resource-constrained AIoT node devices, where CNN
layers are partitioned and these devices collaboratively compute complex CNN
tasks Therefore, there is a dire need to explore this attack surface for
designing secure embedded hardware accelerators for CNNs. Towards this goal, in
this paper, we exploited this attack surface to propose an HT-based attack
called FeSHI. This attack exploits the statistical distribution i.e., Gaussian
distribution, of the layer-by-layer feature maps of the CNN to design two
triggers for stealthy HT with a very low probability of triggering. To
illustrate the effectiveness of the proposed attack, we deployed the LeNet and
LeNet-3D on PYNQ to classify the MNIST and CIFAR-10 datasets, respectively, and
tested FeSHI. The experimental results show that FeSHI utilizes up to 2% extra
LUTs, and the overall resource overhead is less than 1% compared to the
original designs


http://arxiv.org/abs/2106.06196
CausalAdv: Adversarial Robustness through the Lens of Causality. (99%)
Yonggang Zhang; Mingming Gong; Tongliang Liu; Gang Niu; Xinmei Tian; Bo Han; Bernhard Schölkopf; Kun Zhang
  The adversarial vulnerability of deep neural networks has attracted
significant attention in machine learning. As causal reasoning has an instinct
for modelling distribution change, it is essential to incorporate causality
into analyzing this specific type of distribution change induced by adversarial
attacks. However, causal formulations of the intuition of adversarial attacks
and the development of robust DNNs are still lacking in the literature. To
bridge this gap, we construct a causal graph to model the generation process of
adversarial examples and define the adversarial distribution to formalize the
intuition of adversarial attacks. From the causal perspective, we study the
distinction between the natural and adversarial distribution and conclude that
the origin of adversarial vulnerability is the focus of models on spurious
correlations. Inspired by the causal understanding, we propose the Causal
inspired Adversarial distribution alignment method, CausalAdv, to eliminate the
difference between natural and adversarial distributions by considering
spurious correlations. Extensive experiments demonstrate the efficacy of the
proposed method. Our work is the first attempt towards using causality to
understand and mitigate the adversarial vulnerability.


http://arxiv.org/abs/2106.06235
Knowledge Enhanced Machine Learning Pipeline against Diverse Adversarial Attacks. (99%)
Nezihe Merve Gürel; Xiangyu Qi; Luka Rimanic; Ce Zhang; Bo Li
  Despite the great successes achieved by deep neural networks (DNNs), recent
studies show that they are vulnerable against adversarial examples, which aim
to mislead DNNs by adding small adversarial perturbations. Several defenses
have been proposed against such attacks, while many of them have been
adaptively attacked. In this work, we aim to enhance the ML robustness from a
different perspective by leveraging domain knowledge: We propose a Knowledge
Enhanced Machine Learning Pipeline (KEMLP) to integrate domain knowledge (i.e.,
logic relationships among different predictions) into a probabilistic graphical
model via first-order logic rules. In particular, we develop KEMLP by
integrating a diverse set of weak auxiliary models based on their logical
relationships to the main DNN model that performs the target task.
Theoretically, we provide convergence results and prove that, under mild
conditions, the prediction of KEMLP is more robust than that of the main DNN
model. Empirically, we take road sign recognition as an example and leverage
the relationships between road signs and their shapes and contents as domain
knowledge. We show that compared with adversarial training and other baselines,
KEMLP achieves higher robustness against physical attacks, $\mathcal{L}_p$
bounded attacks, unforeseen attacks, and natural corruptions under both
whitebox and blackbox settings, while still maintaining high clean accuracy.


http://arxiv.org/abs/2106.06041
Adversarial purification with Score-based generative models. (89%)
Jongmin Yoon; Sung Ju Hwang; Juho Lee
  While adversarial training is considered as a standard defense method against
adversarial attacks for image classifiers, adversarial purification, which
purifies attacked images into clean images with a standalone purification
model, has shown promises as an alternative defense method. Recently, an
Energy-Based Model (EBM) trained with Markov-Chain Monte-Carlo (MCMC) has been
highlighted as a purification model, where an attacked image is purified by
running a long Markov-chain using the gradients of the EBM. Yet, the
practicality of the adversarial purification using an EBM remains questionable
because the number of MCMC steps required for such purification is too large.
In this paper, we propose a novel adversarial purification method based on an
EBM trained with Denoising Score-Matching (DSM). We show that an EBM trained
with DSM can quickly purify attacked images within a few steps. We further
introduce a simple yet effective randomized purification scheme that injects
random noises into images before purification. This process screens the
adversarial perturbations imposed on images by the random noises and brings the
images to the regime where the EBM can denoise well. We show that our
purification method is robust against various attacks and demonstrate its
state-of-the-art performances.


http://arxiv.org/abs/2106.06624
Relaxing Local Robustness. (80%)
Klas Leino; Matt Fredrikson
  Certifiable local robustness, which rigorously precludes small-norm
adversarial examples, has received significant attention as a means of
addressing security concerns in deep learning. However, for some classification
problems, local robustness is not a natural objective, even in the presence of
adversaries; for example, if an image contains two classes of subjects, the
correct label for the image may be considered arbitrary between the two, and
thus enforcing strict separation between them is unnecessary. In this work, we
introduce two relaxed safety properties for classifiers that address this
observation: (1) relaxed top-k robustness, which serves as the analogue of
top-k accuracy; and (2) affinity robustness, which specifies which sets of
labels must be separated by a robustness margin, and which can be
$\epsilon$-close in $\ell_p$ space. We show how to construct models that can be
efficiently certified against each relaxed robustness property, and trained
with very little overhead relative to standard gradient descent. Finally, we
demonstrate experimentally that these relaxed variants of robustness are
well-suited to several significant classification problems, leading to lower
rejection rates and higher certified accuracies than can be obtained when
certifying "standard" local robustness.


http://arxiv.org/abs/2106.06663
TDGIA:Effective Injection Attacks on Graph Neural Networks. (76%)
Xu Zou; Qinkai Zheng; Yuxiao Dong; Xinyu Guan; Evgeny Kharlamov; Jialiang Lu; Jie Tang
  Graph Neural Networks (GNNs) have achieved promising performance in various
real-world applications. However, recent studies have shown that GNNs are
vulnerable to adversarial attacks. In this paper, we study a
recently-introduced realistic attack scenario on graphs -- graph injection
attack (GIA). In the GIA scenario, the adversary is not able to modify the
existing link structure and node attributes of the input graph, instead the
attack is performed by injecting adversarial nodes into it. We present an
analysis on the topological vulnerability of GNNs under GIA setting, based on
which we propose the Topological Defective Graph Injection Attack (TDGIA) for
effective injection attacks. TDGIA first introduces the topological defective
edge selection strategy to choose the original nodes for connecting with the
injected ones. It then designs the smooth feature optimization objective to
generate the features for the injected nodes. Extensive experiments on
large-scale datasets show that TDGIA can consistently and significantly
outperform various attack baselines in attacking dozens of defense GNN models.
Notably, the performance drop on target GNNs resultant from TDGIA is more than
double the damage brought by the best attack solution among hundreds of
submissions on KDD-CUP 2020.


http://arxiv.org/abs/2106.06361
Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution. (56%)
Fanchao Qi; Yuan Yao; Sophia Xu; Zhiyuan Liu; Maosong Sun
  Recent studies show that neural natural language processing (NLP) models are
vulnerable to backdoor attacks. Injected with backdoors, models perform
normally on benign examples but produce attacker-specified predictions when the
backdoor is activated, presenting serious security threats to real-world
applications. Since existing textual backdoor attacks pay little attention to
the invisibility of backdoors, they can be easily detected and blocked. In this
work, we present invisible backdoors that are activated by a learnable
combination of word substitution. We show that NLP models can be injected with
backdoors that lead to a nearly 100% attack success rate, whereas being highly
invisible to existing defense strategies and even human inspections. The
results raise a serious alarm to the security of NLP models, which requires
further research to be resolved. All the data and code of this paper are
released at https://github.com/thunlp/BkdAtk-LWS.


http://arxiv.org/abs/2106.06667
CARTL: Cooperative Adversarially-Robust Transfer Learning. (8%)
Dian Chen; Hongxin Hu; Qian Wang; Yinli Li; Cong Wang; Chao Shen; Qi Li
  Transfer learning eases the burden of training a well-performed model from
scratch, especially when training data is scarce and computation power is
limited. In deep learning, a typical strategy for transfer learning is to
freeze the early layers of a pre-trained model and fine-tune the rest of its
layers on the target domain. Previous work focuses on the accuracy of the
transferred model but neglects the transfer of adversarial robustness. In this
work, we first show that transfer learning improves the accuracy on the target
domain but degrades the inherited robustness of the target model. To address
such a problem, we propose a novel cooperative adversarially-robust transfer
learning (CARTL) by pre-training the model via feature distance minimization
and fine-tuning the pre-trained model with non-expansive fine-tuning for target
domain tasks. Empirical results show that CARTL improves the inherited
robustness by about 28% at most compared with the baseline with the same degree
of accuracy. Furthermore, we study the relationship between the batch
normalization (BN) layers and the robustness in the context of transfer
learning, and we reveal that freezing BN layers can further boost the
robustness transfer.


http://arxiv.org/abs/2106.06603
A Shuffling Framework for Local Differential Privacy. (1%)
Casey Meehan; Amrita Roy Chowdhury; Kamalika Chaudhuri; Somesh Jha
  ldp deployments are vulnerable to inference attacks as an adversary can link
the noisy responses to their identity and subsequently, auxiliary information
using the order of the data. An alternative model, shuffle DP, prevents this by
shuffling the noisy responses uniformly at random. However, this limits the
data learnability -- only symmetric functions (input order agnostic) can be
learned. In this paper, we strike a balance and propose a generalized shuffling
framework that interpolates between the two deployment models. We show that
systematic shuffling of the noisy responses can thwart specific inference
attacks while retaining some meaningful data learnability. To this end, we
propose a novel privacy guarantee, d-sigma privacy, that captures the privacy
of the order of a data sequence. d-sigma privacy allows tuning the granularity
at which the ordinal information is maintained, which formalizes the degree the
resistance to inference attacks trading it off with data learnability.
Additionally, we propose a novel shuffling mechanism that can achieve d-sigma
privacy and demonstrate the practicality of our mechanism via evaluation on
real-world datasets.


http://arxiv.org/abs/2106.06027
Sparse and Imperceptible Adversarial Attack via a Homotopy Algorithm. (99%)
Mingkang Zhu; Tianlong Chen; Zhangyang Wang
  Sparse adversarial attacks can fool deep neural networks (DNNs) by only
perturbing a few pixels (regularized by l_0 norm). Recent efforts combine it
with another l_infty imperceptible on the perturbation magnitudes. The
resultant sparse and imperceptible attacks are practically relevant, and
indicate an even higher vulnerability of DNNs that we usually imagined.
However, such attacks are more challenging to generate due to the optimization
difficulty by coupling the l_0 regularizer and box constraints with a
non-convex objective. In this paper, we address this challenge by proposing a
homotopy algorithm, to jointly tackle the sparsity and the perturbation bound
in one unified framework. Each iteration, the main step of our algorithm is to
optimize an l_0-regularized adversarial loss, by leveraging the nonmonotone
Accelerated Proximal Gradient Method (nmAPG) for nonconvex programming; it is
followed by an l_0 change control step, and an optional post-attack step
designed to escape bad local minima. We also extend the algorithm to handling
the structural sparsity regularizer. We extensively examine the effectiveness
of our proposed homotopy attack for both targeted and non-targeted attack
scenarios, on CIFAR-10 and ImageNet datasets. Compared to state-of-the-art
methods, our homotopy attack leads to significantly fewer perturbations, e.g.,
reducing 42.91% on CIFAR-10 and 75.03% on ImageNet (average case, targeted
attack), at similar maximal perturbation magnitudes, when still achieving 100%
attack success rates. Our codes are available at:
https://github.com/VITA-Group/SparseADV_Homotopy.


http://arxiv.org/abs/2106.05657
Deep neural network loses attention to adversarial images. (99%)
Shashank Kotyan; Danilo Vasconcellos Vargas
  Adversarial algorithms have shown to be effective against neural networks for
a variety of tasks. Some adversarial algorithms perturb all the pixels in the
image minimally for the image classification task in image classification. In
contrast, some algorithms perturb few pixels strongly. However, very little
information is available regarding why these adversarial samples so diverse
from each other exist. Recently, Vargas et al. showed that the existence of
these adversarial samples might be due to conflicting saliency within the
neural network. We test this hypothesis of conflicting saliency by analysing
the Saliency Maps (SM) and Gradient-weighted Class Activation Maps (Grad-CAM)
of original and few different types of adversarial samples. We also analyse how
different adversarial samples distort the attention of the neural network
compared to original samples. We show that in the case of Pixel Attack,
perturbed pixels either calls the network attention to themselves or divert the
attention from them. Simultaneously, the Projected Gradient Descent Attack
perturbs pixels so that intermediate layers inside the neural network lose
attention for the correct class. We also show that both attacks affect the
saliency map and activation maps differently. Thus, shedding light on why some
defences successful against some attacks remain vulnerable against other
attacks. We hope that this analysis will improve understanding of the existence
and the effect of adversarial samples and enable the community to develop more
robust neural networks.


http://arxiv.org/abs/2106.05997
Verifying Quantized Neural Networks using SMT-Based Model Checking. (92%)
Luiz Sena; Xidan Song; Erickson Alves; Iury Bessa; Edoardo Manino; Lucas Cordeiro; Eddie de Lima Filho
  Artificial Neural Networks (ANNs) are being deployed for an increasing number
of safety-critical applications, including autonomous cars and medical
diagnosis. However, concerns about their reliability have been raised due to
their black-box nature and apparent fragility to adversarial attacks. These
concerns are amplified when ANNs are deployed on restricted system, which limit
the precision of mathematical operations and thus introduce additional
quantization errors. Here, we develop and evaluate a novel symbolic
verification framework using software model checking (SMC) and satisfiability
modulo theories (SMT) to check for vulnerabilities in ANNs. More specifically,
we propose several ANN-related optimizations for SMC, including invariant
inference via interval analysis, slicing, expression simplifications, and
discretization of non-linear activation functions. With this verification
framework, we can provide formal guarantees on the safe behavior of ANNs
implemented both in floating- and fixed-point arithmetic. In this regard, our
verification approach was able to verify and produce adversarial examples for
$52$ test cases spanning image classification and general machine learning
applications. Furthermore, for small- to medium-sized ANN, our approach
completes most of its verification runs in minutes. Moreover, in contrast to
most state-of-the-art methods, our approach is not restricted to specific
choices regarding activation functions and non-quantized representations. Our
experiments show that our approach can analyze larger ANN implementations and
substantially reduce the verification time compared to state-of-the-art
techniques that use SMT solving.


http://arxiv.org/abs/2106.06056
Progressive-Scale Boundary Blackbox Attack via Projective Gradient Estimation. (80%)
Jiawei Zhang; Linyi Li; Huichen Li; Xiaolu Zhang; Shuang Yang; Bo Li
  Boundary based blackbox attack has been recognized as practical and
effective, given that an attacker only needs to access the final model
prediction. However, the query efficiency of it is in general high especially
for high dimensional image data. In this paper, we show that such efficiency
highly depends on the scale at which the attack is applied, and attacking at
the optimal scale significantly improves the efficiency. In particular, we
propose a theoretical framework to analyze and show three key characteristics
to improve the query efficiency. We prove that there exists an optimal scale
for projective gradient estimation. Our framework also explains the
satisfactory performance achieved by existing boundary black-box attacks. Based
on our theoretical framework, we propose Progressive-Scale enabled projective
Boundary Attack (PSBA) to improve the query efficiency via progressive scaling
techniques. In particular, we employ Progressive-GAN to optimize the scale of
projections, which we call PSBA-PGAN. We evaluate our approach on both spatial
and frequency scales. Extensive experiments on MNIST, CIFAR-10, CelebA, and
ImageNet against different models including a real-world face recognition API
show that PSBA-PGAN significantly outperforms existing baseline attacks in
terms of query efficiency and attack success rate. We also observe relatively
stable optimal scales for different models and datasets. The code is publicly
available at https://github.com/AI-secure/PSBA.


http://arxiv.org/abs/2106.05996
An Ensemble Approach Towards Adversarial Robustness. (41%)
Haifeng Qian
  It is a known phenomenon that adversarial robustness comes at a cost to
natural accuracy. To improve this trade-off, this paper proposes an ensemble
approach that divides a complex robust-classification task into simpler
subtasks. Specifically, fractal divide derives multiple training sets from the
training data, and fractal aggregation combines inference outputs from multiple
classifiers that are trained on those sets. The resulting ensemble classifiers
have a unique property that ensures robustness for an input if certain
don't-care conditions are met. The new techniques are evaluated on MNIST and
Fashion-MNIST, with no adversarial training. The MNIST classifier has 99%
natural accuracy, 70% measured robustness and 36.9% provable robustness, within
L2 distance of 2. The Fashion-MNIST classifier has 90% natural accuracy, 54.5%
measured robustness and 28.2% provable robustness, within L2 distance of 1.5.
Both results are new state of the art, and we also present new state-of-the-art
binary results on challenging label-pairs.


http://arxiv.org/abs/2106.05625
Towards an Automated Pipeline for Detecting and Classifying Malware through Machine Learning. (1%)
Nicola Loi; Claudio Borile; Daniele Ucci
  The constant growth in the number of malware - software or code fragment
potentially harmful for computers and information networks - and the use of
sophisticated evasion and obfuscation techniques have seriously hindered
classic signature-based approaches. On the other hand, malware detection
systems based on machine learning techniques started offering a promising
alternative to standard approaches, drastically reducing analysis time and
turning out to be more robust against evasion and obfuscation techniques. In
this paper, we propose a malware taxonomic classification pipeline able to
classify Windows Portable Executable files (PEs). Given an input PE sample, it
is first classified as either malicious or benign. If malicious, the pipeline
further analyzes it in order to establish its threat type, family, and
behavior(s). We tested the proposed pipeline on the open source dataset EMBER,
containing approximately 1 million PE samples, analyzed through static
analysis. Obtained malware detection results are comparable to other academic
works in the current state of art and, in addition, we provide an in-depth
classification of malicious samples. Models used in the pipeline provides
interpretable results which can help security analysts in better understanding
decisions taken by the automated pipeline.


http://arxiv.org/abs/2106.05964
Fair Classification with Adversarial Perturbations. (1%)
L. Elisa Celis; Anay Mehrotra; Nisheeth K. Vishnoi
  We study fair classification in the presence of an omniscient adversary that,
given an $\eta$, is allowed to choose an arbitrary $\eta$-fraction of the
training samples and arbitrarily perturb their protected attributes. The
motivation comes from settings in which protected attributes can be incorrect
due to strategic misreporting, malicious actors, or errors in imputation; and
prior approaches that make stochastic or independence assumptions on errors may
not satisfy their guarantees in this adversarial setting. Our main contribution
is an optimization framework to learn fair classifiers in this adversarial
setting that comes with provable guarantees on accuracy and fairness. Our
framework works with multiple and non-binary protected attributes, is designed
for the large class of linear-fractional fairness metrics, and can also handle
perturbations besides protected attributes. We prove near-tightness of our
framework's guarantees for natural hypothesis classes: no algorithm can have
significantly better accuracy and any algorithm with better fairness must have
lower accuracy. Empirically, we evaluate the classifiers produced by our
framework for statistical rate on real-world and synthetic datasets for a
family of adversaries.


http://arxiv.org/abs/2106.05825
HASI: Hardware-Accelerated Stochastic Inference, A Defense Against Adversarial Machine Learning Attacks. (99%)
Mohammad Hossein Samavatian; Saikat Majumdar; Kristin Barber; Radu Teodorescu
  DNNs are known to be vulnerable to so-called adversarial attacks, in which
inputs are carefully manipulated to induce misclassification. Existing defenses
are mostly software-based and come with high overheads or other limitations.
This paper presents HASI, a hardware-accelerated defense that uses a process we
call stochastic inference to detect adversarial inputs. HASI carefully injects
noise into the model at inference time and used the model's response to
differentiate adversarial inputs from benign ones. We show an adversarial
detection rate of average 87% which exceeds the detection rate of the
state-of-the-art approaches, with a much lower overhead. We demonstrate a
software/hardware-accelerated co-design, which reduces the performance impact
of stochastic inference to 1.58X-2X relative to the unprotected baseline,
compared to 14X-20X overhead for a software-only GPU implementation.


http://arxiv.org/abs/2106.05036
Towards Defending against Adversarial Examples via Attack-Invariant Features. (99%)
Dawei Zhou; Tongliang Liu; Bo Han; Nannan Wang; Chunlei Peng; Xinbo Gao
  Deep neural networks (DNNs) are vulnerable to adversarial noise. Their
adversarial robustness can be improved by exploiting adversarial examples.
However, given the continuously evolving attacks, models trained on seen types
of adversarial examples generally cannot generalize well to unseen types of
adversarial examples. To solve this problem, in this paper, we propose to
remove adversarial noise by learning generalizable invariant features across
attacks which maintain semantic classification information. Specifically, we
introduce an adversarial feature learning mechanism to disentangle invariant
features from adversarial noise. A normalization term has been proposed in the
encoded space of the attack-invariant features to address the bias issue
between the seen and unseen types of attacks. Empirical evaluations demonstrate
that our method could provide better protection in comparison to previous
state-of-the-art approaches, especially against unseen types of attacks and
adaptive attacks.


http://arxiv.org/abs/2106.04938
Attacking Adversarial Attacks as A Defense. (99%)
Boxi Wu; Heng Pan; Li Shen; Jindong Gu; Shuai Zhao; Zhifeng Li; Deng Cai; Xiaofei He; Wei Liu
  It is well known that adversarial attacks can fool deep neural networks with
imperceptible perturbations. Although adversarial training significantly
improves model robustness, failure cases of defense still broadly exist. In
this work, we find that the adversarial attacks can also be vulnerable to small
perturbations. Namely, on adversarially-trained models, perturbing adversarial
examples with a small random noise may invalidate their misled predictions.
After carefully examining state-of-the-art attacks of various kinds, we find
that all these attacks have this deficiency to different extents. Enlightened
by this finding, we propose to counter attacks by crafting more effective
defensive perturbations. Our defensive perturbations leverage the advantage
that adversarial training endows the ground-truth class with smaller local
Lipschitzness. By simultaneously attacking all the classes, the misled
predictions with larger Lipschitzness can be flipped into correct ones. We
verify our defensive perturbation with both empirical experiments and
theoretical analyses on a linear model. On CIFAR10, it boosts the
state-of-the-art model from 66.16% to 72.66% against the four attacks of
AutoAttack, including 71.76% to 83.30% against the Square attack. On ImageNet,
the top-1 robust accuracy of FastAT is improved from 33.18% to 38.54% under the
100-step PGD attack.


http://arxiv.org/abs/2106.05453
Improving White-box Robustness of Pre-processing Defenses via Joint Adversarial Training. (99%)
Dawei Zhou; Nannan Wang; Xinbo Gao; Bo Han; Jun Yu; Xiaoyu Wang; Tongliang Liu
  Deep neural networks (DNNs) are vulnerable to adversarial noise. A range of
adversarial defense techniques have been proposed to mitigate the interference
of adversarial noise, among which the input pre-processing methods are scalable
and show great potential to safeguard DNNs. However, pre-processing methods may
suffer from the robustness degradation effect, in which the defense reduces
rather than improving the adversarial robustness of a target model in a
white-box setting. A potential cause of this negative effect is that
adversarial training examples are static and independent to the pre-processing
model. To solve this problem, we investigate the influence of full adversarial
examples which are crafted against the full model, and find they indeed have a
positive impact on the robustness of defenses. Furthermore, we find that simply
changing the adversarial training examples in pre-processing methods does not
completely alleviate the robustness degradation effect. This is due to the
adversarial risk of the pre-processed model being neglected, which is another
cause of the robustness degradation effect. Motivated by above analyses, we
propose a method called Joint Adversarial Training based Pre-processing (JATP)
defense. Specifically, we formulate a feature similarity based adversarial risk
for the pre-processing model by using full adversarial examples found in a
feature space. Unlike standard adversarial training, we only update the
pre-processing model, which prompts us to introduce a pixel-wise loss to
improve its cross-model transferability. We then conduct a joint adversarial
training on the pre-processing model to minimize this overall risk. Empirical
results show that our method could effectively mitigate the robustness
degradation effect across different target models in comparison to previous
state-of-the-art approaches.


http://arxiv.org/abs/2106.05261
We Can Always Catch You: Detecting Adversarial Patched Objects WITH or WITHOUT Signature. (98%)
Bin Liang; Jiachun Li; Jianjun Huang
  Recently, the object detection based on deep learning has proven to be
vulnerable to adversarial patch attacks. The attackers holding a specially
crafted patch can hide themselves from the state-of-the-art person detectors,
e.g., YOLO, even in the physical world. This kind of attack can bring serious
security threats, such as escaping from surveillance cameras. In this paper, we
deeply explore the detection problems about the adversarial patch attacks to
the object detection. First, we identify a leverageable signature of existing
adversarial patches from the point of the visualization explanation. A fast
signature-based defense method is proposed and demonstrated to be effective.
Second, we design an improved patch generation algorithm to reveal the risk
that the signature-based way may be bypassed by the techniques emerging in the
future. The newly generated adversarial patches can successfully evade the
proposed signature-based defense. Finally, we present a novel
signature-independent detection method based on the internal content semantics
consistency rather than any attack-specific prior knowledge. The fundamental
intuition is that the adversarial object can appear locally but disappear
globally in an input image. The experiments demonstrate that the
signature-independent method can effectively detect the existing and improved
attacks. It has also proven to be a general method by detecting unforeseen and
even other types of attacks without any attack-specific prior knowledge. The
two proposed detection methods can be adopted in different scenarios, and we
believe that combining them can offer a comprehensive protection.


http://arxiv.org/abs/2106.05087
Who Is the Strongest Enemy? Towards Optimal and Efficient Evasion Attacks in Deep RL. (97%)
Yanchao Sun; Ruijie Zheng; Yongyuan Liang; Furong Huang
  Evaluating the worst-case performance of a reinforcement learning (RL) agent
under the strongest/optimal adversarial perturbations on state observations
(within some constraints) is crucial for understanding the robustness of RL
agents. However, finding the optimal adversary is challenging, in terms of both
whether we can find the optimal attack and how efficiently we can find it.
Existing works on adversarial RL either use heuristics-based methods that may
not find the strongest adversary, or directly train an RL-based adversary by
treating the agent as a part of the environment, which can find the optimal
adversary but may become intractable in a large state space. This paper
introduces a novel attacking method to find the optimal attacks through
collaboration between a designed function named "actor" and an RL-based learner
named "director". The actor crafts state perturbations for a given policy
perturbation direction, and the director learns to propose the best policy
perturbation directions. Our proposed algorithm, PA-AD, is theoretically
optimal and significantly more efficient than prior RL-based works in
environments with large state spaces. Empirical results show that our proposed
PA-AD universally outperforms state-of-the-art attacking methods in various
Atari and MuJoCo environments. By applying PA-AD to adversarial training, we
achieve state-of-the-art empirical robustness in multiple tasks under strong
adversaries.


http://arxiv.org/abs/2106.05256
URLTran: Improving Phishing URL Detection Using Transformers. (10%)
Pranav Maneriker; Jack W. Stokes; Edir Garcia Lazo; Diana Carutasu; Farid Tajaddodianfar; Arun Gururajan
  Browsers often include security features to detect phishing web pages. In the
past, some browsers evaluated an unknown URL for inclusion in a list of known
phishing pages. However, as the number of URLs and known phishing pages
continued to increase at a rapid pace, browsers started to include one or more
machine learning classifiers as part of their security services that aim to
better protect end users from harm. While additional information could be used,
browsers typically evaluate every unknown URL using some classifier in order to
quickly detect these phishing pages. Early phishing detection used standard
machine learning classifiers, but recent research has instead proposed the use
of deep learning models for the phishing URL detection task. Concurrently, text
embedding research using transformers has led to state-of-the-art results in
many natural language processing tasks. In this work, we perform a
comprehensive analysis of transformer models on the phishing URL detection
task. We consider standard masked language model and additional domain-specific
pre-training tasks, and compare these models to fine-tuned BERT and RoBERTa
models. Combining the insights from these experiments, we propose URLTran which
uses transformers to significantly improve the performance of phishing URL
detection over a wide range of very low false positive rates (FPRs) compared to
other deep learning-based methods. For example, URLTran yields a true positive
rate (TPR) of 86.80% compared to 71.20% for the next best baseline at an FPR of
0.01%, resulting in a relative improvement of over 21.9%. Further, we consider
some classical adversarial black-box phishing attacks such as those based on
homoglyphs and compound word splits to improve the robustness of URLTran. We
consider additional fine tuning with these adversarial samples and demonstrate
that URLTran can maintain low FPRs under these scenarios.


http://arxiv.org/abs/2106.05325
ZoPE: A Fast Optimizer for ReLU Networks with Low-Dimensional Inputs. (5%)
Christopher A. Strong; Sydney M. Katz; Anthony L. Corso; Mykel J. Kochenderfer
  Deep neural networks often lack the safety and robustness guarantees needed
to be deployed in safety critical systems. Formal verification techniques can
be used to prove input-output safety properties of networks, but when
properties are difficult to specify, we rely on the solution to various
optimization problems. In this work, we present an algorithm called ZoPE that
solves optimization problems over the output of feedforward ReLU networks with
low-dimensional inputs. The algorithm eagerly splits the input space, bounding
the objective using zonotope propagation at each step, and improves
computational efficiency compared to existing mixed integer programming
approaches. We demonstrate how to formulate and solve three types of
optimization problems: (i) minimization of any convex function over the output
space, (ii) minimization of a convex function over the output of two networks
in series with an adversarial perturbation in the layer between them, and (iii)
maximization of the difference in output between two networks. Using ZoPE, we
observe a $25\times$ speedup on property 1 of the ACAS Xu neural network
verification benchmark and an $85\times$ speedup on a set of linear
optimization problems. We demonstrate the versatility of the optimizer in
analyzing networks by projecting onto the range of a generative adversarial
network and visualizing the differences between a compressed and uncompressed
network.


http://arxiv.org/abs/2106.04823
Practical Machine Learning Safety: A Survey and Primer. (4%)
Sina Mohseni; Haotao Wang; Zhiding Yu; Chaowei Xiao; Zhangyang Wang; Jay Yadawa
  The open-world deployment of Machine Learning (ML) algorithms in
safety-critical applications such as autonomous vehicles needs to address a
variety of ML vulnerabilities such as interpretability, verifiability, and
performance limitations. Research explores different approaches to improve ML
dependability by proposing new models and training techniques to reduce
generalization error, achieve domain adaptation, and detect outlier examples
and adversarial attacks. In this paper, we review and organize practical ML
techniques that can improve the safety and dependability of ML algorithms and
therefore ML-based software. Our organization maps state-of-the-art ML
techniques to safety strategies in order to enhance the dependability of the ML
algorithm from different aspects, and discuss research gaps as well as
promising solutions.


http://arxiv.org/abs/2106.05009
Network insensitivity to parameter noise via adversarial regularization. (2%)
Julian Büchel; Fynn Faber; Dylan R. Muir
  Neuromorphic neural network processors, in the form of compute-in-memory
crossbar arrays of memristors, or in the form of subthreshold analog and
mixed-signal ASICs, promise enormous advantages in compute density and energy
efficiency for NN-based ML tasks. However, these technologies are prone to
computational non-idealities, due to process variation and intrinsic device
physics. This degrades the task performance of networks deployed to the
processor, by introducing parameter noise into the deployed model. While it is
possible to calibrate each device, or train networks individually for each
processor, these approaches are expensive and impractical for commercial
deployment. Alternative methods are therefore needed to train networks that are
inherently robust against parameter variation, as a consequence of network
architecture and parameters. We present a new adversarial network optimisation
algorithm that attacks network parameters during training, and promotes robust
performance during inference in the face of parameter variation. Our approach
introduces a regularization term penalising the susceptibility of a network to
weight perturbation. We compare against previous approaches for producing
parameter insensitivity such as dropout, weight smoothing and introducing
parameter noise during training. We show that our approach produces models that
are more robust to targeted parameter variation, and equally robust to random
parameter variation. Our approach finds minima in flatter locations in the
weight-loss landscape compared with other approaches, highlighting that the
networks found by our technique are less sensitive to parameter perturbation.
Our work provides an approach to deploy neural network architectures to
inference devices that suffer from computational non-idealities, with minimal
loss of performance. ...


http://arxiv.org/abs/2106.04169
On Improving Adversarial Transferability of Vision Transformers. (99%)
Muzammal Naseer; Kanchana Ranasinghe; Salman Khan; Fahad Shahbaz Khan; Fatih Porikli
  Vision transformers (ViTs) process input images as sequences of patches via
self-attention; a radically different architecture than convolutional neural
networks (CNNs). This makes it interesting to study the adversarial feature
space of ViT models and their transferability. In particular, we observe that
adversarial patterns found via conventional adversarial attacks show very low
black-box transferability even for large ViT models. However, we show that this
phenomenon is only due to the sub-optimal attack procedures that do not
leverage the true representation potential of ViTs. A deep ViT is composed of
multiple blocks, with a consistent architecture comprising of self-attention
and feed-forward layers, where each block is capable of independently producing
a class token. Formulating an attack using only the last class token
(conventional approach) does not directly leverage the discriminative
information stored in the earlier tokens, leading to poor adversarial
transferability of ViTs. Using the compositional nature of ViT models, we
enhance the transferability of existing attacks by introducing two novel
strategies specific to the architecture of ViT models. (i) Self-Ensemble: We
propose a method to find multiple discriminative pathways by dissecting a
single ViT model into an ensemble of networks. This allows explicitly utilizing
class-specific information at each ViT block. (ii) Token Refinement: We then
propose to refine the tokens to further enhance the discriminative capacity at
each block of ViT. Our token refinement systematically combines the class
tokens with structural information preserved within the patch tokens. An
adversarial attack, when applied to such refined tokens within the ensemble of
classifiers found in a single vision transformer, has significantly higher
transferability.


http://arxiv.org/abs/2106.04569
Simulated Adversarial Testing of Face Recognition Models. (99%)
Nataniel Ruiz; Adam Kortylewski; Weichao Qiu; Cihang Xie; Sarah Adel Bargal; Alan Yuille; Stan Sclaroff
  Most machine learning models are validated and tested on fixed datasets. This
can give an incomplete picture of the capabilities and weaknesses of the model.
Such weaknesses can be revealed at test time in the real world. The risks
involved in such failures can be loss of profits, loss of time or even loss of
life in certain critical applications. In order to alleviate this issue,
simulators can be controlled in a fine-grained manner using interpretable
parameters to explore the semantic image manifold. In this work, we propose a
framework for learning how to test machine learning algorithms using simulators
in an adversarial manner in order to find weaknesses in the model before
deploying it in critical scenarios. We apply this model in a face recognition
scenario. We are the first to show that weaknesses of models trained on real
data can be discovered using simulated samples. Using our proposed method, we
can find adversarial synthetic faces that fool contemporary face recognition
models. This demonstrates the fact that these models have weaknesses that are
not measured by commonly used validation datasets. We hypothesize that this
type of adversarial examples are not isolated, but usually lie in connected
components in the latent space of the simulator. We present a method to find
these adversarial regions as opposed to the typical adversarial points found in
the adversarial example literature.


http://arxiv.org/abs/2106.04794
Towards the Memorization Effect of Neural Networks in Adversarial Training. (93%)
Han Xu; Xiaorui Liu; Wentao Wang; Wenbiao Ding; Zhongqin Wu; Zitao Liu; Anil Jain; Jiliang Tang
  Recent studies suggest that ``memorization'' is one important factor for
overparameterized deep neural networks (DNNs) to achieve optimal performance.
Specifically, the perfectly fitted DNNs can memorize the labels of many
atypical samples, generalize their memorization to correctly classify test
atypical samples and enjoy better test performance. While, DNNs which are
optimized via adversarial training algorithms can also achieve perfect training
performance by memorizing the labels of atypical samples, as well as the
adversarially perturbed atypical samples. However, adversarially trained models
always suffer from poor generalization, with both relatively low clean accuracy
and robustness on the test set. In this work, we study the effect of
memorization in adversarial trained DNNs and disclose two important findings:
(a) Memorizing atypical samples is only effective to improve DNN's accuracy on
clean atypical samples, but hardly improve their adversarial robustness and (b)
Memorizing certain atypical samples will even hurt the DNN's performance on
typical samples. Based on these two findings, we propose Benign Adversarial
Training (BAT) which can facilitate adversarial training to avoid fitting
``harmful'' atypical samples and fit as more ``benign'' atypical samples as
possible. In our experiments, we validate the effectiveness of BAT, and show it
can achieve better clean accuracy vs. robustness trade-off than baseline
methods, in benchmark datasets such as CIFAR100 and Tiny~ImageNet.


http://arxiv.org/abs/2106.04690
Handcrafted Backdoors in Deep Neural Networks. (92%)
Sanghyun Hong; Nicholas Carlini; Alexey Kurakin
  When machine learning training is outsourced to third parties, $backdoor$
$attacks$ become practical as the third party who trains the model may act
maliciously to inject hidden behaviors into the otherwise accurate model. Until
now, the mechanism to inject backdoors has been limited to $poisoning$. We
argue that a supply-chain attacker has more attack techniques available by
introducing a $handcrafted$ attack that directly manipulates a model's weights.
This direct modification gives our attacker more degrees of freedom compared to
poisoning, and we show it can be used to evade many backdoor detection or
removal defenses effectively. Across four datasets and four network
architectures our backdoor attacks maintain an attack success rate above 96%.
Our results suggest that further research is needed for understanding the
complete space of supply-chain backdoor attacks.


http://arxiv.org/abs/2106.04435
Enhancing Robustness of Neural Networks through Fourier Stabilization. (73%)
Netanel Raviv; Aidan Kelley; Michael Guo; Yevgeny Vorobeychik
  Despite the considerable success of neural networks in security settings such
as malware detection, such models have proved vulnerable to evasion attacks, in
which attackers make slight changes to inputs (e.g., malware) to bypass
detection. We propose a novel approach, \emph{Fourier stabilization}, for
designing evasion-robust neural networks with binary inputs. This approach,
which is complementary to other forms of defense, replaces the weights of
individual neurons with robust analogs derived using Fourier analytic tools.
The choice of which neurons to stabilize in a neural network is then a
combinatorial optimization problem, and we propose several methods for
approximately solving it. We provide a formal bound on the per-neuron drop in
accuracy due to Fourier stabilization, and experimentally demonstrate the
effectiveness of the proposed approach in boosting robustness of neural
networks in several detection settings. Moreover, we show that our approach
effectively composes with adversarial training.


http://arxiv.org/abs/2106.04260
Provably Robust Detection of Out-of-distribution Data (almost) for free. (26%)
Alexander Meinke; Julian Bitterwolf; Matthias Hein
  When applying machine learning in safety-critical systems, a reliable
assessment of the uncertainy of a classifier is required. However, deep neural
networks are known to produce highly overconfident predictions on
out-of-distribution (OOD) data and even if trained to be non-confident on OOD
data one can still adversarially manipulate OOD data so that the classifer
again assigns high confidence to the manipulated samples. In this paper we
propose a novel method where from first principles we combine a certifiable OOD
detector with a standard classifier into an OOD aware classifier. In this way
we achieve the best of two worlds: certifiably adversarially robust OOD
detection, even for OOD samples close to the in-distribution, without loss in
prediction accuracy and close to state-of-the-art OOD detection performance for
non-manipulated OOD data. Moreover, due to the particular construction our
classifier provably avoids the asymptotic overconfidence problem of standard
neural networks.


http://arxiv.org/abs/2106.03614
Adversarial Attack and Defense in Deep Ranking. (99%)
Mo Zhou; Le Wang; Zhenxing Niu; Qilin Zhang; Nanning Zheng; Gang Hua
  Deep Neural Network classifiers are vulnerable to adversarial attack, where
an imperceptible perturbation could result in misclassification. However, the
vulnerability of DNN-based image ranking systems remains under-explored. In
this paper, we propose two attacks against deep ranking systems, i.e.,
Candidate Attack and Query Attack, that can raise or lower the rank of chosen
candidates by adversarial perturbations. Specifically, the expected ranking
order is first represented as a set of inequalities, and then a triplet-like
objective function is designed to obtain the optimal perturbation. Conversely,
an anti-collapse triplet defense is proposed to improve the ranking model
robustness against all proposed attacks, where the model learns to prevent the
positive and negative samples being pulled close to each other by adversarial
attack. To comprehensively measure the empirical adversarial robustness of a
ranking model with our defense, we propose an empirical robustness score, which
involves a set of representative attacks against ranking models. Our
adversarial ranking attacks and defenses are evaluated on MNIST, Fashion-MNIST,
CUB200-2011, CARS196 and Stanford Online Products datasets. Experimental
results demonstrate that a typical deep ranking system can be effectively
compromised by our attacks. Nevertheless, our defense can significantly improve
the ranking system robustness, and simultaneously mitigate a wide range of
attacks.


http://arxiv.org/abs/2106.03734
Reveal of Vision Transformers Robustness against Adversarial Attacks. (99%)
Ahmed Aldahdooh; Wassim Hamidouche; Olivier Deforges
  The major part of the vanilla vision transformer (ViT) is the attention block
that brings the power of mimicking the global context of the input image. For
better performance, ViT needs large-scale training data. To overcome this data
hunger limitation, many ViT-based networks, or hybrid-ViT, have been proposed
to include local context during the training. The robustness of ViTs and its
variants against adversarial attacks has not been widely investigated in the
literature like CNNs. This work studies the robustness of ViT variants 1)
against different Lp-based adversarial attacks in comparison with CNNs, 2)
under adversarial examples (AEs) after applying preprocessing defense methods
and 3) under the adaptive attacks using expectation over transformation (EOT)
framework. To that end, we run a set of experiments on 1000 images from
ImageNet-1k and then provide an analysis that reveals that vanilla ViT or
hybrid-ViT are more robust than CNNs. For instance, we found that 1) Vanilla
ViTs or hybrid-ViTs are more robust than CNNs under Lp-based attacks and under
adaptive attacks. 2) Unlike hybrid-ViTs, Vanilla ViTs are not responding to
preprocessing defenses that mainly reduce the high frequency components.
Furthermore, feature maps, attention maps, and Grad-CAM visualization jointly
with image quality measures, and perturbations' energy spectrum are provided
for an insight understanding of attention-based models.


http://arxiv.org/abs/2106.03518
Position Bias Mitigation: A Knowledge-Aware Graph Model for Emotion Cause Extraction. (89%)
Hanqi Yan; Lin Gui; Gabriele Pergola; Yulan He
  The Emotion Cause Extraction (ECE)} task aims to identify clauses which
contain emotion-evoking information for a particular emotion expressed in text.
We observe that a widely-used ECE dataset exhibits a bias that the majority of
annotated cause clauses are either directly before their associated emotion
clauses or are the emotion clauses themselves. Existing models for ECE tend to
explore such relative position information and suffer from the dataset bias. To
investigate the degree of reliance of existing ECE models on clause relative
positions, we propose a novel strategy to generate adversarial examples in
which the relative position information is no longer the indicative feature of
cause clauses. We test the performance of existing models on such adversarial
examples and observe a significant performance drop. To address the dataset
bias, we propose a novel graph-based method to explicitly model the emotion
triggering paths by leveraging the commonsense knowledge to enhance the
semantic dependencies between a candidate clause and an emotion clause.
Experimental results show that our proposed approach performs on par with the
existing state-of-the-art methods on the original ECE dataset, and is more
robust against adversarial attacks compared to existing models.


http://arxiv.org/abs/2106.03805
3DB: A Framework for Debugging Computer Vision Models. (45%)
Guillaume Leclerc; Hadi Salman; Andrew Ilyas; Sai Vemprala; Logan Engstrom; Vibhav Vineet; Kai Xiao; Pengchuan Zhang; Shibani Santurkar; Greg Yang; Ashish Kapoor; Aleksander Madry
  We introduce 3DB: an extendable, unified framework for testing and debugging
vision models using photorealistic simulation. We demonstrate, through a wide
range of use cases, that 3DB allows users to discover vulnerabilities in
computer vision systems and gain insights into how models make decisions. 3DB
captures and generalizes many robustness analyses from prior work, and enables
one to study their interplay. Finally, we find that the insights generated by
the system transfer to the physical world.
  We are releasing 3DB as a library (https://github.com/3db/3db) alongside a
set of example analyses, guides, and documentation: https://3db.github.io/3db/ .


http://arxiv.org/abs/2106.03613
RoSearch: Search for Robust Student Architectures When Distilling Pre-trained Language Models. (11%)
Xin Guo; Jianlei Yang; Haoyi Zhou; Xucheng Ye; Jianxin Li
  Pre-trained language models achieve outstanding performance in NLP tasks.
Various knowledge distillation methods have been proposed to reduce the heavy
computation and storage requirements of pre-trained language models. However,
from our observations, student models acquired by knowledge distillation suffer
from adversarial attacks, which limits their usage in security sensitive
scenarios. In order to overcome these security problems, RoSearch is proposed
as a comprehensive framework to search the student models with better
adversarial robustness when performing knowledge distillation. A directed
acyclic graph based search space is built and an evolutionary search strategy
is utilized to guide the searching approach. Each searched architecture is
trained by knowledge distillation on pre-trained language model and then
evaluated under a robustness-, accuracy- and efficiency-aware metric as
environmental fitness. Experimental results show that RoSearch can improve
robustness of student models from 7%~18% up to 45.8%~47.8% on different
datasets with comparable weight compression ratio to existing distillation
methods (4.6$\times$~6.5$\times$ improvement from teacher model BERT_BASE) and
low accuracy drop. In addition, we summarize the relationship between student
architecture and robustness through statistics of searched models.


http://arxiv.org/abs/2106.04066
Semantically Adversarial Scenario Generation with Explicit Knowledge Guidance. (1%)
Wenhao Ding; Haohong Lin; Bo Li; Ding Zhao
  Generating adversarial scenarios, which have the potential to fail autonomous
driving systems, provides an effective way to improve robustness. Extending
purely data-driven generative models, recent specialized models satisfy
additional controllable requirements such as embedding a traffic sign in a
driving scene by manipulating patterns implicitly in the neuron level. In this
paper, we introduce a method to incorporate domain knowledge explicitly in the
generation process to achieve the Semantically Adversarial Generation (SAG). To
be consistent with the composition of driving scenes, we first categorize the
knowledge into two types, the property of objects and the relationship among
objects. We then propose a tree-structured variational auto-encoder (T-VAE) to
learn hierarchical scene representation. By imposing semantic rules on the
properties of nodes and edges in the tree structure, explicit knowledge
integration enables controllable generation. We construct a synthetic example
to illustrate the controllability and explainability of our method in a
succinct setting. We further extend to realistic environments for autonomous
vehicles: our method efficiently identifies adversarial driving scenes against
different state-of-the-art 3D point cloud segmentation models and satisfies the
traffic rules specified as the explicit knowledge.


http://arxiv.org/abs/2106.03099
A Primer on Multi-Neuron Relaxation-based Adversarial Robustness Certification. (98%)
Kevin Roth
  The existence of adversarial examples poses a real danger when deep neural
networks are deployed in the real world. The go-to strategy to quantify this
vulnerability is to evaluate the model against specific attack algorithms. This
approach is however inherently limited, as it says little about the robustness
of the model against more powerful attacks not included in the evaluation. We
develop a unified mathematical framework to describe relaxation-based
robustness certification methods, which go beyond adversary-specific robustness
evaluation and instead provide provable robustness guarantees against attacks
by any adversary. We discuss the fundamental limitations posed by single-neuron
relaxations and show how the recent ``k-ReLU'' multi-neuron relaxation
framework of Singh et al. (2019) obtains tighter correlation-aware activation
bounds by leveraging additional relational constraints among groups of neurons.
Specifically, we show how additional pre-activation bounds can be mapped to
corresponding post-activation bounds and how they can in turn be used to obtain
tighter robustness certificates. We also present an intuitive way to visualize
different relaxation-based certification methods. By approximating multiple
non-linearities jointly instead of separately, the k-ReLU method is able to
bypass the convex barrier imposed by single neuron relaxations.


http://arxiv.org/abs/2106.03310
Zero-Shot Knowledge Distillation from a Decision-Based Black-Box Model. (4%)
Zi Wang
  Knowledge distillation (KD) is a successful approach for deep neural network
acceleration, with which a compact network (student) is trained by mimicking
the softmax output of a pre-trained high-capacity network (teacher). In
tradition, KD usually relies on access to the training samples and the
parameters of the white-box teacher to acquire the transferred knowledge.
However, these prerequisites are not always realistic due to storage costs or
privacy issues in real-world applications. Here we propose the concept of
decision-based black-box (DB3) knowledge distillation, with which the student
is trained by distilling the knowledge from a black-box teacher (parameters are
not accessible) that only returns classes rather than softmax outputs. We start
with the scenario when the training set is accessible. We represent a sample's
robustness against other classes by computing its distances to the teacher's
decision boundaries and use it to construct the soft label for each training
sample. After that, the student can be trained via standard KD. We then extend
this approach to a more challenging scenario in which even accessing the
training data is not feasible. We propose to generate pseudo samples
distinguished by the teacher's decision boundaries to the largest extent and
construct soft labels for them, which are used as the transfer set. We evaluate
our approaches on various benchmark networks and datasets and experiment
results demonstrate their effectiveness. Codes are available at:
https://github.com/zwang84/zsdb3kd.


http://arxiv.org/abs/2106.02867
Ensemble Defense with Data Diversity: Weak Correlation Implies Strong Robustness. (92%)
Renjue Li; Hanwei Zhang; Pengfei Yang; Cheng-Chao Huang; Aimin Zhou; Bai Xue; Lijun Zhang
  In this paper, we propose a framework of filter-based ensemble of deep
neuralnetworks (DNNs) to defend against adversarial attacks. The framework
builds an ensemble of sub-models -- DNNs with differentiated preprocessing
filters. From the theoretical perspective of DNN robustness, we argue that
under the assumption of high quality of the filters, the weaker the
correlations of the sensitivity of the filters are, the more robust the
ensemble model tends to be, and this is corroborated by the experiments of
transfer-based attacks. Correspondingly, we propose a principle that chooses
the specific filters with smaller Pearson correlation coefficients, which
ensures the diversity of the inputs received by DNNs, as well as the
effectiveness of the entire framework against attacks. Our ensemble models are
more robust than those constructed by previous defense methods like adversarial
training, and even competitive with the classical ensemble of adversarial
trained DNNs under adversarial attacks when the attacking radius is large.


http://arxiv.org/abs/2106.02978
Robust Stochastic Linear Contextual Bandits Under Adversarial Attacks. (69%)
Qin Ding; Cho-Jui Hsieh; James Sharpnack
  Stochastic linear contextual bandit algorithms have substantial applications
in practice, such as recommender systems, online advertising, clinical trials,
etc. Recent works show that optimal bandit algorithms are vulnerable to
adversarial attacks and can fail completely in the presence of attacks.
Existing robust bandit algorithms only work for the non-contextual setting
under the attack of rewards and cannot improve the robustness in the general
and popular contextual bandit environment. In addition, none of the existing
methods can defend against attacked context. In this work, we provide the first
robust bandit algorithm for stochastic linear contextual bandit setting under a
fully adaptive and omniscient attack. Our algorithm not only works under the
attack of rewards, but also under attacked context. Moreover, it does not need
any information about the attack budget or the particular form of the attack.
We provide theoretical guarantees for our proposed algorithm and show by
extensive experiments that our proposed algorithm significantly improves the
robustness against various kinds of popular attacks.


http://arxiv.org/abs/2106.02874
RDA: Robust Domain Adaptation via Fourier Adversarial Attacking. (2%)
Jiaxing Huang; Dayan Guan; Aoran Xiao; Shijian Lu
  Unsupervised domain adaptation (UDA) involves a supervised loss in a labeled
source domain and an unsupervised loss in an unlabeled target domain, which
often faces more severe overfitting (than classical supervised learning) as the
supervised source loss has clear domain gap and the unsupervised target loss is
often noisy due to the lack of annotations. This paper presents RDA, a robust
domain adaptation technique that introduces adversarial attacking to mitigate
overfitting in UDA. We achieve robust domain adaptation by a novel Fourier
adversarial attacking (FAA) method that allows large magnitude of perturbation
noises but has minimal modification of image semantics, the former is critical
to the effectiveness of its generated adversarial samples due to the existence
of 'domain gaps'. Specifically, FAA decomposes images into multiple frequency
components (FCs) and generates adversarial samples by just perturbating certain
FCs that capture little semantic information. With FAA-generated samples, the
training can continue the 'random walk' and drift into an area with a flat loss
landscape, leading to more robust domain adaptation. Extensive experiments over
multiple domain adaptation tasks show that RDA can work with different computer
vision tasks with superior performance.


http://arxiv.org/abs/2106.02734
Revisiting Hilbert-Schmidt Information Bottleneck for Adversarial Robustness. (99%)
Zifeng Wang; Tong Jian; Aria Masoomi; Stratis Ioannidis; Jennifer Dy
  We investigate the HSIC (Hilbert-Schmidt independence criterion) bottleneck
as a regularizer for learning an adversarially robust deep neural network
classifier. We show that the HSIC bottleneck enhances robustness to adversarial
attacks both theoretically and experimentally. Our experiments on multiple
benchmark datasets and architectures demonstrate that incorporating an HSIC
bottleneck regularizer attains competitive natural accuracy and improves
adversarial robustness, both with and without adversarial examples during
training.


http://arxiv.org/abs/2106.02732
BO-DBA: Query-Efficient Decision-Based Adversarial Attacks via Bayesian Optimization. (99%)
Zhuosheng Zhang; Shucheng Yu
  Decision-based attacks (DBA), wherein attackers perturb inputs to spoof
learning algorithms by observing solely the output labels, are a type of severe
adversarial attacks against Deep Neural Networks (DNNs) requiring minimal
knowledge of attackers. State-of-the-art DBA attacks relying on zeroth-order
gradient estimation require an excessive number of queries. Recently, Bayesian
optimization (BO) has shown promising in reducing the number of queries in
score-based attacks (SBA), in which attackers need to observe real-valued
probability scores as outputs. However, extending BO to the setting of DBA is
nontrivial because in DBA only output labels instead of real-valued scores, as
needed by BO, are available to attackers. In this paper, we close this gap by
proposing an efficient DBA attack, namely BO-DBA. Different from existing
approaches, BO-DBA generates adversarial examples by searching so-called
\emph{directions of perturbations}. It then formulates the problem as a BO
problem that minimizes the real-valued distortion of perturbations. With the
optimized perturbation generation process, BO-DBA converges much faster than
the state-of-the-art DBA techniques. Experimental results on pre-trained
ImageNet classifiers show that BO-DBA converges within 200 queries while the
state-of-the-art DBA techniques need over 15,000 queries to achieve the same
level of perturbation distortion. BO-DBA also shows similar attack success
rates even as compared to BO-based SBA attacks but with less distortion.


http://arxiv.org/abs/2106.02280
Human-Adversarial Visual Question Answering. (31%)
Sasha Sheng; Amanpreet Singh; Vedanuj Goswami; Jose Alberto Lopez Magana; Wojciech Galuba; Devi Parikh; Douwe Kiela
  Performance on the most commonly used Visual Question Answering dataset (VQA
v2) is starting to approach human accuracy. However, in interacting with
state-of-the-art VQA models, it is clear that the problem is far from being
solved. In order to stress test VQA models, we benchmark them against
human-adversarial examples. Human subjects interact with a state-of-the-art VQA
model, and for each image in the dataset, attempt to find a question where the
model's predicted answer is incorrect. We find that a wide range of
state-of-the-art models perform poorly when evaluated on these examples. We
conduct an extensive analysis of the collected adversarial examples and provide
guidance on future research directions. We hope that this Adversarial VQA
(AdVQA) benchmark can help drive progress in the field and advance the state of
the art.


http://arxiv.org/abs/2106.02749
Predify: Augmenting deep neural networks with brain-inspired predictive coding dynamics. (15%)
Bhavin Choksi; Milad Mozafari; Callum Biggs O'May; Benjamin Ador; Andrea Alamia; Rufin VanRullen
  Deep neural networks excel at image classification, but their performance is
far less robust to input perturbations than human perception. In this work we
explore whether this shortcoming may be partly addressed by incorporating
brain-inspired recurrent dynamics in deep convolutional networks. We take
inspiration from a popular framework in neuroscience: 'predictive coding'. At
each layer of the hierarchical model, generative feedback 'predicts' (i.e.,
reconstructs) the pattern of activity in the previous layer. The reconstruction
errors are used to iteratively update the network's representations across
timesteps, and to optimize the network's feedback weights over the natural
image dataset-a form of unsupervised training. We show that implementing this
strategy into two popular networks, VGG16 and EfficientNetB0, improves their
robustness against various corruptions and adversarial attacks. We hypothesize
that other feedforward networks could similarly benefit from the proposed
framework. To promote research in this direction, we provide an open-sourced
PyTorch-based package called Predify, which can be used to implement and
investigate the impacts of the predictive coding dynamics in any convolutional
neural network.


http://arxiv.org/abs/2106.02395
DOCTOR: A Simple Method for Detecting Misclassification Errors. (1%)
Federica Granese; Marco Romanelli; Daniele Gorla; Catuscia Palamidessi; Pablo Piantanida
  Deep neural networks (DNNs) have shown to perform very well on large scale
object recognition problems and lead to widespread use for real-world
applications, including situations where DNN are implemented as "black boxes".
A promising approach to secure their use is to accept decisions that are likely
to be correct while discarding the others. In this work, we propose DOCTOR, a
simple method that aims to identify whether the prediction of a DNN classifier
should (or should not) be trusted so that, consequently, it would be possible
to accept it or to reject it. Two scenarios are investigated: Totally Black Box
(TBB) where only the soft-predictions are available and Partially Black Box
(PBB) where gradient-propagation to perform input pre-processing is allowed.
Empirically, we show that DOCTOR outperforms all state-of-the-art methods on
various well-known images and sentiment analysis datasets. In particular, we
observe a reduction of up to $4\%$ of the false rejection rate (FRR) in the PBB
scenario. DOCTOR can be applied to any pre-trained model, it does not require
prior information about the underlying dataset and is as simple as the simplest
available methods in the literature.


http://arxiv.org/abs/2106.02443
Teaching keyword spotters to spot new keywords with limited examples. (1%)
Abhijeet Awasthi; Kevin Kilgour; Hassan Rom
  Learning to recognize new keywords with just a few examples is essential for
personalizing keyword spotting (KWS) models to a user's choice of keywords.
However, modern KWS models are typically trained on large datasets and
restricted to a small vocabulary of keywords, limiting their transferability to
a broad range of unseen keywords. Towards easily customizable KWS models, we
present KeySEM (Keyword Speech EMbedding), a speech embedding model pre-trained
on the task of recognizing a large number of keywords. Speech representations
offered by KeySEM are highly effective for learning new keywords from a limited
number of examples. Comparisons with a diverse range of related work across
several datasets show that our method achieves consistently superior
performance with fewer training examples. Although KeySEM was pre-trained only
on English utterances, the performance gains also extend to datasets from four
other languages indicating that KeySEM learns useful representations well
aligned with the task of keyword spotting. Finally, we demonstrate KeySEM's
ability to learn new keywords sequentially without requiring to re-train on
previously learned keywords. Our experimental observations suggest that KeySEM
is well suited to on-device environments where post-deployment learning and
ease of customization are often desirable.


http://arxiv.org/abs/2106.01617
Improving the Transferability of Adversarial Examples with New Iteration Framework and Input Dropout. (99%)
Pengfei Xie; Linyuan Wang; Ruoxi Qin; Kai Qiao; Shuhao Shi; Guoen Hu; Bin Yan
  Deep neural networks(DNNs) is vulnerable to be attacked by adversarial
examples. Black-box attack is the most threatening attack. At present,
black-box attack methods mainly adopt gradient-based iterative attack methods,
which usually limit the relationship between the iteration step size, the
number of iterations, and the maximum perturbation. In this paper, we propose a
new gradient iteration framework, which redefines the relationship between the
above three. Under this framework, we easily improve the attack success rate of
DI-TI-MIM. In addition, we propose a gradient iterative attack method based on
input dropout, which can be well combined with our framework. We further
propose a multi dropout rate version of this method. Experimental results show
that our best method can achieve attack success rate of 96.2\% for defense
model on average, which is higher than the state-of-the-art gradient-based
attacks.


http://arxiv.org/abs/2106.01615
Imperceptible Adversarial Examples for Fake Image Detection. (99%)
Quanyu Liao; Yuezun Li; Xin Wang; Bin Kong; Bin Zhu; Siwei Lyu; Youbing Yin; Qi Song; Xi Wu
  Fooling people with highly realistic fake images generated with Deepfake or
GANs brings a great social disturbance to our society. Many methods have been
proposed to detect fake images, but they are vulnerable to adversarial
perturbations -- intentionally designed noises that can lead to the wrong
prediction. Existing methods of attacking fake image detectors usually generate
adversarial perturbations to perturb almost the entire image. This is redundant
and increases the perceptibility of perturbations. In this paper, we propose a
novel method to disrupt the fake image detection by determining key pixels to a
fake image detector and attacking only the key pixels, which results in the
$L_0$ and the $L_2$ norms of adversarial perturbations much less than those of
existing works. Experiments on two public datasets with three fake image
detectors indicate that our proposed method achieves state-of-the-art
performance in both white-box and black-box attacks.


http://arxiv.org/abs/2106.02105
A Little Robustness Goes a Long Way: Leveraging Universal Features for Targeted Transfer Attacks. (99%)
Jacob M. Springer; Melanie Mitchell; Garrett T. Kenyon
  Adversarial examples for neural network image classifiers are known to be
transferable: examples optimized to be misclassified by a source classifier are
often misclassified as well by classifiers with different architectures.
However, targeted adversarial examples -- optimized to be classified as a
chosen target class -- tend to be less transferable between architectures.
While prior research on constructing transferable targeted attacks has focused
on improving the optimization procedure, in this work we examine the role of
the source classifier. Here, we show that training the source classifier to be
"slightly robust" -- that is, robust to small-magnitude adversarial examples --
substantially improves the transferability of targeted attacks, even between
architectures as different as convolutional neural networks and transformers.
We argue that this result supports a non-intuitive hypothesis: on the spectrum
from non-robust (standard) to highly robust classifiers, those that are only
slightly robust exhibit the most universal features -- ones that tend to
overlap with the features learned by other classifiers trained on the same
dataset. The results we present provide insight into the nature of adversarial
examples as well as the mechanisms underlying so-called "robust" classifiers.


http://arxiv.org/abs/2106.01618
Transferable Adversarial Examples for Anchor Free Object Detection. (99%)
Quanyu Liao; Xin Wang; Bin Kong; Siwei Lyu; Bin Zhu; Youbing Yin; Qi Song; Xi Wu
  Deep neural networks have been demonstrated to be vulnerable to adversarial
attacks: subtle perturbation can completely change prediction result. The
vulnerability has led to a surge of research in this direction, including
adversarial attacks on object detection networks. However, previous studies are
dedicated to attacking anchor-based object detectors. In this paper, we present
the first adversarial attack on anchor-free object detectors. It conducts
category-wise, instead of previously instance-wise, attacks on object
detectors, and leverages high-level semantic information to efficiently
generate transferable adversarial examples, which can also be transferred to
attack other object detectors, even anchor-based detectors such as Faster
R-CNN. Experimental results on two benchmark datasets demonstrate that our
proposed method achieves state-of-the-art performance and transferability.


http://arxiv.org/abs/2106.01606
Exploring Memorization in Adversarial Training. (98%)
Yinpeng Dong; Ke Xu; Xiao Yang; Tianyu Pang; Zhijie Deng; Hang Su; Jun Zhu
  It is well known that deep learning models have a propensity for fitting the
entire training set even with random labels, which requires memorization of
every training sample. In this paper, we investigate the memorization effect in
adversarial training (AT) for promoting a deeper understanding of capacity,
convergence, generalization, and especially robust overfitting of adversarially
trained classifiers. We first demonstrate that deep networks have sufficient
capacity to memorize adversarial examples of training data with completely
random labels, but not all AT algorithms can converge under the extreme
circumstance. Our study of AT with random labels motivates further analyses on
the convergence and generalization of AT. We find that some AT methods suffer
from a gradient instability issue, and the recently suggested complexity
measures cannot explain robust generalization by considering models trained on
random labels. Furthermore, we identify a significant drawback of memorization
in AT that it could result in robust overfitting. We then propose a new
mitigation algorithm motivated by detailed memorization analyses. Extensive
experiments on various datasets validate the effectiveness of the proposed
method.


http://arxiv.org/abs/2106.02078
Improving Neural Network Robustness via Persistency of Excitation. (68%)
Kaustubh Sridhar; Oleg Sokolsky; Insup Lee; James Weimer
  Improving adversarial robustness of neural networks remains a major
challenge. Fundamentally, training a neural network via gradient descent is a
parameter estimation problem. In adaptive control, maintaining persistency of
excitation (PoE) is integral to ensuring convergence of parameter estimates in
dynamical systems to their true values. We show that parameter estimation with
gradient descent can be modeled as a sampling of an adaptive linear
time-varying continuous system. Leveraging this model, and with inspiration
from Model-Reference Adaptive Control (MRAC), we prove a sufficient condition
to constrain gradient descent updates to reference persistently excited
trajectories converging to the true parameters. The sufficient condition is
achieved when the learning rate is less than the inverse of the Lipschitz
constant of the gradient of loss function. We provide an efficient technique
for estimating the corresponding Lipschitz constant in practice using extreme
value theory. Our experimental results in both standard and adversarial
training illustrate that networks trained with the PoE-motivated learning rate
schedule have similar clean accuracy but are significantly more robust to
adversarial attacks than models trained using current state-of-the-art
heuristics.


http://arxiv.org/abs/2106.01810
Defending against Backdoor Attacks in Natural Language Generation. (38%)
Chun Fan; Xiaoya Li; Yuxian Meng; Xiaofei Sun; Xiang Ao; Fei Wu; Jiwei Li; Tianwei Zhang
  The frustratingly fragile nature of neural network models make current
natural language generation (NLG) systems prone to backdoor attacks and
generate malicious sequences that could be sexist or offensive. Unfortunately,
little effort has been invested to how backdoor attacks can affect current NLG
models and how to defend against these attacks. In this work, we investigate
this problem on two important NLG tasks, machine translation and dialogue
generation. By giving a formal definition for backdoor attack and defense, and
developing corresponding benchmarks, we design methods to attack NLG models,
which achieve high attack success to ask NLG models to generate malicious
sequences. To defend against these attacks, we propose to detect the attack
trigger by examining the effect of deleting or replacing certain words on the
generation outputs, which we find successful for certain types of attacks. We
will discuss the limitation of this work, and hope this work can raise the
awareness of backdoor risks concealed in deep NLG systems. (Code and data are
available at https://github.com/ShannonAI/backdoor_nlg.)


http://arxiv.org/abs/2106.02240
Sneak Attack against Mobile Robotic Networks under Formation Control. (1%)
Yushan Li; Jianping He; Xuda Ding; Lin Cai; Xinping Guan
  The security of mobile robotic networks (MRNs) has been an active research
topic in recent years. This paper demonstrates that the observable interaction
process of MRNs under formation control will present increasingly severe
threats. Specifically, we find that an external attack robot, who has only
partial observation over MRNs while not knowing the system dynamics or access,
can learn the interaction rules from observations and utilize them to replace a
target robot, destroying the cooperation performance of MRNs. We call this
novel attack as sneak, which endows the attacker with the intelligence of
learning knowledge and is hard to be tackled by traditional defense techniques.
The key insight is to separately reveal the internal interaction structure
within robots and the external interaction mechanism with the environment, from
the coupled state evolution influenced by the model-unknown rules and
unobservable part of the MRN. To address this issue, we first provide general
interaction process modeling and prove the learnability of the interaction
rules. Then, with the learned rules, we design an Evaluate-Cut-Restore (ECR)
attack strategy considering the partial interaction structure and geometric
pattern. We also establish the sufficient conditions for a successful sneak
with maximum control impacts over the MRN. Extensive simulations illustrate the
feasibility and effectiveness of the proposed attack.


http://arxiv.org/abs/2106.01538
PDPGD: Primal-Dual Proximal Gradient Descent Adversarial Attack. (99%)
Alexander Matyasko; Lap-Pui Chau
  State-of-the-art deep neural networks are sensitive to small input
perturbations. Since the discovery of this intriguing vulnerability, many
defence methods have been proposed that attempt to improve robustness to
adversarial noise. Fast and accurate attacks are required to compare various
defence methods. However, evaluating adversarial robustness has proven to be
extremely challenging. Existing norm minimisation adversarial attacks require
thousands of iterations (e.g. Carlini & Wagner attack), are limited to the
specific norms (e.g. Fast Adaptive Boundary), or produce sub-optimal results
(e.g. Brendel & Bethge attack). On the other hand, PGD attack, which is fast,
general and accurate, ignores the norm minimisation penalty and solves a
simpler perturbation-constrained problem. In this work, we introduce a fast,
general and accurate adversarial attack that optimises the original non-convex
constrained minimisation problem. We interpret optimising the Lagrangian of the
adversarial attack optimisation problem as a two-player game: the first player
minimises the Lagrangian wrt the adversarial noise; the second player maximises
the Lagrangian wrt the regularisation penalty. Our attack algorithm
simultaneously optimises primal and dual variables to find the minimal
adversarial perturbation. In addition, for non-smooth $l_p$-norm minimisation,
such as $l_{\infty}$-, $l_1$-, and $l_0$-norms, we introduce primal-dual
proximal gradient descent attack. We show in the experiments that our attack
outperforms current state-of-the-art $l_{\infty}$-, $l_2$-, $l_1$-, and
$l_0$-attacks on MNIST, CIFAR-10 and Restricted ImageNet datasets against
unregularised and adversarially trained models.


http://arxiv.org/abs/2106.01065
Towards Robustness of Text-to-SQL Models against Synonym Substitution. (75%)
Yujian Gan; Xinyun Chen; Qiuping Huang; Matthew Purver; John R. Woodward; Jinxia Xie; Pengsheng Huang
  Recently, there has been significant progress in studying neural networks to
translate text descriptions into SQL queries. Despite achieving good
performance on some public benchmarks, existing text-to-SQL models typically
rely on the lexical matching between words in natural language (NL) questions
and tokens in table schemas, which may render the models vulnerable to attacks
that break the schema linking mechanism. In this work, we investigate the
robustness of text-to-SQL models to synonym substitution. In particular, we
introduce Spider-Syn, a human-curated dataset based on the Spider benchmark for
text-to-SQL translation. NL questions in Spider-Syn are modified from Spider,
by replacing their schema-related words with manually selected synonyms that
reflect real-world question paraphrases. We observe that the accuracy
dramatically drops by eliminating such explicit correspondence between NL
questions and table schemas, even if the synonyms are not adversarially
selected to conduct worst-case adversarial attacks. Finally, we present two
categories of approaches to improve the model robustness. The first category of
approaches utilizes additional synonym annotations for table schemas by
modifying the model input, while the second category is based on adversarial
training. We demonstrate that both categories of approaches significantly
outperform their counterparts without the defense, and the first category of
approaches are more effective.


http://arxiv.org/abs/2106.01452
BERT-Defense: A Probabilistic Model Based on BERT to Combat Cognitively Inspired Orthographic Adversarial Attacks. (62%)
Yannik Keller; Jan Mackensen; Steffen Eger
  Adversarial attacks expose important blind spots of deep learning systems.
While word- and sentence-level attack scenarios mostly deal with finding
semantic paraphrases of the input that fool NLP models, character-level attacks
typically insert typos into the input stream. It is commonly thought that these
are easier to defend via spelling correction modules. In this work, we show
that both a standard spellchecker and the approach of Pruthi et al. (2019),
which trains to defend against insertions, deletions and swaps, perform poorly
on the character-level benchmark recently proposed in Eger and Benz (2020)
which includes more challenging attacks such as visual and phonetic
perturbations and missing word segmentations. In contrast, we show that an
untrained iterative approach which combines context-independent character-level
information with context-dependent information from BERT's masked language
modeling can perform on par with human crowd-workers from Amazon Mechanical
Turk (AMT) supervised via 3-shot learning.


http://arxiv.org/abs/2106.00273
Adversarial Defense for Automatic Speaker Verification by Self-Supervised Learning. (99%)
Haibin Wu; Xu Li; Andy T. Liu; Zhiyong Wu; Helen Meng; Hung-yi Lee
  Previous works have shown that automatic speaker verification (ASV) is
seriously vulnerable to malicious spoofing attacks, such as replay, synthetic
speech, and recently emerged adversarial attacks. Great efforts have been
dedicated to defending ASV against replay and synthetic speech; however, only a
few approaches have been explored to deal with adversarial attacks. All the
existing approaches to tackle adversarial attacks for ASV require the knowledge
for adversarial samples generation, but it is impractical for defenders to know
the exact attack algorithms that are applied by the in-the-wild attackers. This
work is among the first to perform adversarial defense for ASV without knowing
the specific attack algorithms. Inspired by self-supervised learning models
(SSLMs) that possess the merits of alleviating the superficial noise in the
inputs and reconstructing clean samples from the interrupted ones, this work
regards adversarial perturbations as one kind of noise and conducts adversarial
defense for ASV by SSLMs. Specifically, we propose to perform adversarial
defense from two perspectives: 1) adversarial perturbation purification and 2)
adversarial perturbation detection. Experimental results show that our
detection module effectively shields the ASV by detecting adversarial samples
with an accuracy of around 80%. Moreover, since there is no common metric for
evaluating the adversarial defense performance for ASV, this work also
formalizes evaluation metrics for adversarial defense considering both
purification and detection based approaches into account. We sincerely
encourage future works to benchmark their approaches based on the proposed
evaluation framework.


http://arxiv.org/abs/2106.00769
Improving Compositionality of Neural Networks by Decoding Representations to Inputs. (68%)
Mike Wu; Noah Goodman; Stefano Ermon
  In traditional software programs, we take for granted how easy it is to debug
code by tracing program logic from variables back to input, apply unit tests
and assertion statements to block erroneous behavior, and compose programs
together. But as the programs we write grow more complex, it becomes hard to
apply traditional software to applications like computer vision or natural
language. Although deep learning programs have demonstrated strong performance
on these applications, they sacrifice many of the functionalities of
traditional software programs. In this paper, we work towards bridging the
benefits of traditional and deep learning programs by jointly training a
generative model to constrain neural network activations to "decode" back to
inputs. Doing so enables practitioners to probe and track information encoded
in activation(s), apply assertion-like constraints on what information is
encoded in an activation, and compose separate neural networks together in a
plug-and-play fashion. In our experiments, we demonstrate applications of
decodable representations to out-of-distribution detection, adversarial
examples, calibration, and fairness -- while matching standard neural networks
in accuracy.


http://arxiv.org/abs/2106.00660
Markpainting: Adversarial Machine Learning meets Inpainting. (12%)
David Khachaturov; Ilia Shumailov; Yiren Zhao; Nicolas Papernot; Ross Anderson
  Inpainting is a learned interpolation technique that is based on generative
modeling and used to populate masked or missing pieces in an image; it has wide
applications in picture editing and retouching. Recently, inpainting started
being used for watermark removal, raising concerns. In this paper we study how
to manipulate it using our markpainting technique. First, we show how an image
owner with access to an inpainting model can augment their image in such a way
that any attempt to edit it using that model will add arbitrary visible
information. We find that we can target multiple different models
simultaneously with our technique. This can be designed to reconstitute a
watermark if the editor had been trying to remove it. Second, we show that our
markpainting technique is transferable to models that have different
architectures or were trained on different datasets, so watermarks created
using it are difficult for adversaries to remove. Markpainting is novel and can
be used as a manipulation alarm that becomes visible in the event of
inpainting.


http://arxiv.org/abs/2106.00872
On the Efficacy of Adversarial Data Collection for Question Answering: Results from a Large-Scale Randomized Study. (9%)
Divyansh Kaushik; Douwe Kiela; Zachary C. Lipton; Wen-tau Yih
  In adversarial data collection (ADC), a human workforce interacts with a
model in real time, attempting to produce examples that elicit incorrect
predictions. Researchers hope that models trained on these more challenging
datasets will rely less on superficial patterns, and thus be less brittle.
However, despite ADC's intuitive appeal, it remains unclear when training on
adversarial datasets produces more robust models. In this paper, we conduct a
large-scale controlled study focused on question answering, assigning workers
at random to compose questions either (i) adversarially (with a model in the
loop); or (ii) in the standard fashion (without a model). Across a variety of
models and datasets, we find that models trained on adversarial data usually
perform better on other adversarial datasets but worse on a diverse collection
of out-of-domain evaluation sets. Finally, we provide a qualitative analysis of
adversarial (vs standard) data, identifying key differences and offering
guidance for future research.


http://arxiv.org/abs/2106.00245
Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models. (5%)
Linjie Li; Jie Lei; Zhe Gan; Jingjing Liu
  With large-scale pre-training, the past two years have witnessed significant
performance boost on the Visual Question Answering (VQA) task. Though rapid
progresses have been made, it remains unclear whether these state-of-the-art
(SOTA) VQA models are robust when encountering test examples in the wild. To
study this, we introduce Adversarial VQA, a new large-scale VQA benchmark,
collected iteratively via an adversarial human-and-model-in-the-loop procedure.
Through this new benchmark, we present several interesting findings. (i)
Surprisingly, during dataset collection, we find that non-expert annotators can
successfully attack SOTA VQA models with relative ease. (ii) We test a variety
of SOTA VQA models on our new dataset to highlight their fragility, and find
that both large-scale pre-trained models and adversarial training methods can
only achieve far lower performance than what they can achieve on the standard
VQA v2 dataset. (iii) When considered as data augmentation, our dataset can be
used to improve the performance on other robust VQA benchmarks. (iv) We present
a detailed analysis of the dataset, providing valuable insights on the
challenges it brings to the community. We hope Adversarial VQA can serve as a
valuable benchmark that will be used by future work to test the robustness of
its developed VQA models. Our dataset is publicly available at
https://adversarialvqa. github.io/.


http://arxiv.org/abs/2106.01440
Memory Wrap: a Data-Efficient and Interpretable Extension to Image Classification Models. (1%)
Rosa Biagio La; Roberto Capobianco; Daniele Nardi
  Due to their black-box and data-hungry nature, deep learning techniques are
not yet widely adopted for real-world applications in critical domains, like
healthcare and justice. This paper presents Memory Wrap, a plug-and-play
extension to any image classification model. Memory Wrap improves both
data-efficiency and model interpretability, adopting a content-attention
mechanism between the input and some memories of past training samples. We show
that Memory Wrap outperforms standard classifiers when it learns from a limited
set of data, and it reaches comparable performance when it learns from the full
dataset. We discuss how its structure and content-attention mechanisms make
predictions interpretable, compared to standard classifiers. To this end, we
both show a method to build explanations by examples and counterfactuals, based
on the memory content, and how to exploit them to get insights about its
decision process. We test our approach on image classification tasks using
several architectures on three different datasets, namely CIFAR10, SVHN, and
CINIC10.


http://arxiv.org/abs/2106.00221
Concurrent Adversarial Learning for Large-Batch Training. (1%)
Yong Liu; Xiangning Chen; Minhao Cheng; Cho-Jui Hsieh; Yang You
  Large-batch training has become a commonly used technique when training
neural networks with a large number of GPU/TPU processors. As batch size
increases, stochastic optimizers tend to converge to sharp local minima,
leading to degraded test performance. Current methods usually use extensive
data augmentation to increase the batch size, but we found the performance gain
with data augmentation decreases as batch size increases, and data augmentation
will become insufficient after certain point. In this paper, we propose to use
adversarial learning to increase the batch size in large-batch training.
Despite being a natural choice for smoothing the decision surface and biasing
towards a flat region, adversarial learning has not been successfully applied
in large-batch training since it requires at least two sequential gradient
computations at each step, which will at least double the running time compared
with vanilla training even with a large number of processors. To overcome this
issue, we propose a novel Concurrent Adversarial Learning (ConAdv) method that
decouple the sequential gradient computations in adversarial learning by
utilizing staled parameters. Experimental results demonstrate that ConAdv can
successfully increase the batch size on ResNet-50 training on ImageNet while
maintaining high accuracy. In particular, we show ConAdv along can achieve
75.3\% top-1 accuracy on ImageNet ResNet-50 training with 96K batch size, and
the accuracy can be further improved to 76.2\% when combining ConAdv with data
augmentation. This is the first work successfully scales ResNet-50 training
batch size to 96K.


http://arxiv.org/abs/2105.15157
Adaptive Feature Alignment for Adversarial Training. (99%)
Tao Wang; Ruixin Zhang; Xingyu Chen; Kai Zhao; Xiaolin Huang; Yuge Huang; Shaoxin Li; Jilin Li; Feiyue Huang
  Recent studies reveal that Convolutional Neural Networks (CNNs) are typically
vulnerable to adversarial attacks, which pose a threat to security-sensitive
applications. Many adversarial defense methods improve robustness at the cost
of accuracy, raising the contradiction between standard and adversarial
accuracies. In this paper, we observe an interesting phenomenon that feature
statistics change monotonically and smoothly w.r.t the rising of attacking
strength. Based on this observation, we propose the adaptive feature alignment
(AFA) to generate features of arbitrary attacking strengths. Our method is
trained to automatically align features of arbitrary attacking strength. This
is done by predicting a fusing weight in a dual-BN architecture. Unlike
previous works that need to either retrain the model or manually tune a
hyper-parameters for different attacking strengths, our method can deal with
arbitrary attacking strengths with a single model without introducing any
hyper-parameter. Importantly, our method improves the model robustness against
adversarial samples without incurring much loss in standard accuracy.
Experiments on CIFAR-10, SVHN, and tiny-ImageNet datasets demonstrate that our
method outperforms the state-of-the-art under a wide range of attacking
strengths.


http://arxiv.org/abs/2105.15010
QueryNet: An Efficient Attack Framework with Surrogates Carrying Multiple Identities. (99%)
Sizhe Chen; Zhehao Huang; Qinghua Tao; Xiaolin Huang
  Deep Neural Networks (DNNs) are acknowledged as vulnerable to adversarial
attacks, while the existing black-box attacks require extensive queries on the
victim DNN to achieve high success rates. For query-efficiency, surrogate
models of the victim are adopted as transferable attackers in consideration of
their Gradient Similarity (GS), i.e., surrogates' attack gradients are similar
to the victim's ones to some extent. However, it is generally neglected to
exploit their similarity on outputs, namely the Prediction Similarity (PS), to
filter out inefficient queries. To jointly utilize and also optimize
surrogates' GS and PS, we develop QueryNet, an efficient attack network that
can significantly reduce queries. QueryNet crafts several transferable
Adversarial Examples (AEs) by surrogates, and then decides also by surrogates
on the most promising AE, which is then sent to query the victim. That is to
say, in QueryNet, surrogates are not only exploited as transferable attackers,
but also as transferability evaluators for AEs. The AEs are generated using
surrogates' GS and evaluated based on their FS, and therefore, the query
results could be back-propagated to optimize surrogates' parameters and also
their architectures, enhancing both the GS and the FS. QueryNet has significant
query-efficiency, i.e., reduces queries by averagely about an order of
magnitude compared to recent SOTA methods according to our comprehensive and
real-world experiments: 11 victims (including 2 commercial models) on
MNIST/CIFAR10/ImageNet, allowing only 8-bit image queries, and no access to the
victim's training data.


http://arxiv.org/abs/2105.14727
Transferable Sparse Adversarial Attack. (99%)
Ziwen He; Wei Wang; Jing Dong; Tieniu Tan
  Deep neural networks have shown their vulnerability to adversarial attacks.
In this paper, we focus on sparse adversarial attack based on the $\ell_0$ norm
constraint, which can succeed by only modifying a few pixels of an image.
Despite a high attack success rate, prior sparse attack methods achieve a low
transferability under the black-box protocol due to overfitting the target
model. Therefore, we introduce a generator architecture to alleviate the
overfitting issue and thus efficiently craft transferable sparse adversarial
examples. Specifically, the generator decouples the sparse perturbation into
amplitude and position components. We carefully design a random quantization
operator to optimize these two components jointly in an end-to-end way. The
experiment shows that our method has improved the transferability by a large
margin under a similar sparsity setting compared with state-of-the-art methods.
Moreover, our method achieves superior inference speed, 700$\times$ faster than
other optimization-based methods. The code is available at
https://github.com/shaguopohuaizhe/TSAA.


http://arxiv.org/abs/2105.14785
Adversarial Training with Rectified Rejection. (99%)
Tianyu Pang; Huishuai Zhang; Di He; Yinpeng Dong; Hang Su; Wei Chen; Jun Zhu; Tie-Yan Liu
  Adversarial training (AT) is one of the most effective strategies for
promoting model robustness, whereas even the state-of-the-art adversarially
trained models struggle to exceed 65% robust test accuracy on CIFAR-10 without
additional data, which is far from practical. A natural way to improve beyond
this accuracy bottleneck is to introduce a rejection option, where confidence
is a commonly used certainty proxy. However, the vanilla confidence can
overestimate the model certainty if the input is wrongly classified. To this
end, we propose to use true confidence (T-Con) (i.e., predicted probability of
the true class) as a certainty oracle, and learn to predict T-Con by rectifying
confidence. Intriguingly, we prove that under mild conditions, a rectified
confidence (R-Con) rejector and a confidence rejector can be coupled to
distinguish any wrongly classified input from correctly classified ones. We
also quantify that training R-Con to be aligned with T-Con could be an easier
task than learning robust classifiers. In our experiments, we evaluate our
rectified rejection (RR) module on CIFAR-10, CIFAR-10-C, and CIFAR-100 under
several attacks, and demonstrate that the RR module is well compatible with
different AT frameworks on improving robustness, with little extra computation.


http://arxiv.org/abs/2105.14710
Robustifying $\ell_\infty$ Adversarial Training to the Union of Perturbation Models. (82%)
Ameya D. Patil; Michael Tuttle; Alexander G. Schwing; Naresh R. Shanbhag
  Classical adversarial training (AT) frameworks are designed to achieve high
adversarial accuracy against a single attack type, typically $\ell_\infty$
norm-bounded perturbations. Recent extensions in AT have focused on defending
against the union of multiple perturbations but this benefit is obtained at the
expense of a significant (up to $10\times$) increase in training complexity
over single-attack $\ell_\infty$ AT. In this work, we expand the capabilities
of widely popular single-attack $\ell_\infty$ AT frameworks to provide
robustness to the union of ($\ell_\infty, \ell_2, \ell_1$) perturbations while
preserving their training efficiency. Our technique, referred to as Shaped
Noise Augmented Processing (SNAP), exploits a well-established byproduct of
single-attack AT frameworks -- the reduction in the curvature of the decision
boundary of networks. SNAP prepends a given deep net with a shaped noise
augmentation layer whose distribution is learned along with network parameters
using any standard single-attack AT. As a result, SNAP enhances adversarial
accuracy of ResNet-18 on CIFAR-10 against the union of ($\ell_\infty, \ell_2,
\ell_1$) perturbations by 14%-to-20% for four state-of-the-art (SOTA)
single-attack $\ell_\infty$ AT frameworks, and, for the first time, establishes
a benchmark for ResNet-50 and ResNet-101 on ImageNet.


http://arxiv.org/abs/2105.15057
Dominant Patterns: Critical Features Hidden in Deep Neural Networks. (80%)
Zhixing Ye; Shaofei Qin; Sizhe Chen; Xiaolin Huang
  In this paper, we find the existence of critical features hidden in Deep
NeuralNetworks (DNNs), which are imperceptible but can actually dominate the
outputof DNNs. We call these features dominant patterns. As the name suggests,
for a natural image, if we add the dominant pattern of a DNN to it, the output
of this DNN is determined by the dominant pattern instead of the original
image, i.e., DNN's prediction is the same with the dominant pattern's. We
design an algorithm to find such patterns by pursuing the insensitivity in the
feature space. A direct application of the dominant patterns is the Universal
Adversarial Perturbations(UAPs). Numerical experiments show that the found
dominant patterns defeat state-of-the-art UAP methods, especially in label-free
settings. In addition, dominant patterns are proved to have the potential to
attack downstream tasks in which DNNs share the same backbone. We claim that
DNN-specific dominant patterns reveal some essential properties of a DNN and
are of great importance for its feature analysis and robustness enhancement.


http://arxiv.org/abs/2105.14813
Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models. (75%)
Chong Li; Cenyuan Zhang; Xiaoqing Zheng; Xuanjing Huang
  A sequence-to-sequence learning with neural networks has empirically proven
to be an effective framework for Chinese Spelling Correction (CSC), which takes
a sentence with some spelling errors as input and outputs the corrected one.
However, CSC models may fail to correct spelling errors covered by the
confusion sets, and also will encounter unseen ones. We propose a method, which
continually identifies the weak spots of a model to generate more valuable
training instances, and apply a task-specific pre-training strategy to enhance
the model. The generated adversarial examples are gradually added to the
training set. Experimental results show that such an adversarial training
method combined with the pretraining strategy can improve both the
generalization and robustness of multiple CSC models across three different
datasets, achieving stateof-the-art performance for CSC task.


http://arxiv.org/abs/2105.14803
Gradient-based Data Subversion Attack Against Binary Classifiers. (73%)
Rosni K Vasu; Sanjay Seetharaman; Shubham Malaviya; Manish Shukla; Sachin Lodha
  Machine learning based data-driven technologies have shown impressive
performances in a variety of application domains. Most enterprises use data
from multiple sources to provide quality applications. The reliability of the
external data sources raises concerns for the security of the machine learning
techniques adopted. An attacker can tamper the training or test datasets to
subvert the predictions of models generated by these techniques. Data poisoning
is one such attack wherein the attacker tries to degrade the performance of a
classifier by manipulating the training data.
  In this work, we focus on label contamination attack in which an attacker
poisons the labels of data to compromise the functionality of the system. We
develop Gradient-based Data Subversion strategies to achieve model degradation
under the assumption that the attacker has limited-knowledge of the victim
model. We exploit the gradients of a differentiable convex loss function
(residual errors) with respect to the predicted label as a warm-start and
formulate different strategies to find a set of data instances to contaminate.
Further, we analyze the transferability of attacks and the susceptibility of
binary classifiers. Our experiments show that the proposed approach outperforms
the baselines and is computationally efficient.


http://arxiv.org/abs/2105.15164
DISSECT: Disentangled Simultaneous Explanations via Concept Traversals. (1%)
Asma Ghandeharioun; Been Kim; Chun-Liang Li; Brendan Jou; Brian Eoff; Rosalind W. Picard
  Explaining deep learning model inferences is a promising venue for scientific
understanding, improving safety, uncovering hidden biases, evaluating fairness,
and beyond, as argued by many scholars. One of the principal benefits of
counterfactual explanations is allowing users to explore "what-if" scenarios
through what does not and cannot exist in the data, a quality that many other
forms of explanation such as heatmaps and influence functions are inherently
incapable of doing. However, most previous work on generative explainability
cannot disentangle important concepts effectively, produces unrealistic
examples, or fails to retain relevant information. We propose a novel approach,
DISSECT, that jointly trains a generator, a discriminator, and a concept
disentangler to overcome such challenges using little supervision. DISSECT
generates Concept Traversals (CTs), defined as a sequence of generated examples
with increasing degrees of concepts that influence a classifier's decision. By
training a generative model from a classifier's signal, DISSECT offers a way to
discover a classifier's inherent "notion" of distinct concepts automatically
rather than rely on user-predefined concepts. We show that DISSECT produces CTs
that (1) disentangle several concepts, (2) are influential to a classifier's
decision and are coupled to its reasoning due to joint training (3), are
realistic, (4) preserve relevant information, and (5) are stable across similar
inputs. We validate DISSECT on several challenging synthetic and realistic
datasets where previous methods fall short of satisfying desirable criteria for
interpretability and show that it performs consistently well and better than
existing methods. Finally, we present experiments showing applications of
DISSECT for detecting potential biases of a classifier and identifying spurious
artifacts that impact predictions.


http://arxiv.org/abs/2105.14944
The effectiveness of feature attribution methods and its correlation with automatic evaluation scores. (1%)
Giang Nguyen; Daeyoung Kim; Anh Nguyen
  Explaining the decisions of an Artificial Intelligence (AI) model is
increasingly critical in many real-world, high-stake applications. Hundreds of
papers have either proposed new feature attribution methods, discussed or
harnessed these tools in their work. However, despite humans being the target
end-users, most attribution methods were only evaluated on proxy
automatic-evaluation metrics. In this paper, we conduct the first, large-scale
user study on 320 lay and 11 expert users to shed light on the effectiveness of
state-of-the-art attribution methods in assisting humans in ImageNet
classification, Stanford Dogs fine-grained classification, and these two tasks
but when the input image contains adversarial perturbations. We found that, in
overall, feature attribution is surprisingly not more effective than showing
humans nearest training-set examples. On a hard task of fine-grained dog
categorization, presenting attribution maps to humans does not help, but
instead hurts the performance of human-AI teams compared to AI alone.
Importantly, we found automatic attribution-map evaluation measures to
correlate poorly with the actual human-AI team performance. Our findings
encourage the community to rigorously test their methods on the downstream
human-in-the-loop applications and to rethink the existing evaluation metrics.


http://arxiv.org/abs/2105.14644
Generating Adversarial Examples with Graph Neural Networks. (99%)
Florian Jaeckle; M. Pawan Kumar
  Recent years have witnessed the deployment of adversarial attacks to evaluate
the robustness of Neural Networks. Past work in this field has relied on
traditional optimization algorithms that ignore the inherent structure of the
problem and data, or generative methods that rely purely on learning and often
fail to generate adversarial examples where they are hard to find. To alleviate
these deficiencies, we propose a novel attack based on a graph neural network
(GNN) that takes advantage of the strengths of both approaches; we call it
AdvGNN. Our GNN architecture closely resembles the network we wish to attack.
During inference, we perform forward-backward passes through the GNN layers to
guide an iterative procedure towards adversarial examples. During training, its
parameters are estimated via a loss function that encourages the efficient
computation of adversarial examples over a time horizon. We show that our
method beats state-of-the-art adversarial attacks, including PGD-attack,
MI-FGSM, and Carlini and Wagner attack, reducing the time required to generate
adversarial examples with small perturbation norms by over 65\%. Moreover,
AdvGNN achieves good generalization performance on unseen networks. Finally, we
provide a new challenging dataset specifically designed to allow for a more
illustrative comparison of adversarial attacks.


http://arxiv.org/abs/2105.14553
Defending Pre-trained Language Models from Adversarial Word Substitutions Without Performance Sacrifice. (98%)
Rongzhou Bao; Jiayi Wang; Hai Zhao
  Pre-trained contextualized language models (PrLMs) have led to strong
performance gains in downstream natural language understanding tasks. However,
PrLMs can still be easily fooled by adversarial word substitution, which is one
of the most challenging textual adversarial attack methods. Existing defence
approaches suffer from notable performance loss and complexities. Thus, this
paper presents a compact and performance-preserved framework, Anomaly Detection
with Frequency-Aware Randomization (ADFAR). In detail, we design an auxiliary
anomaly detection classifier and adopt a multi-task learning procedure, by
which PrLMs are able to distinguish adversarial input samples. Then, in order
to defend adversarial word substitution, a frequency-aware randomization
process is applied to those recognized adversarial input samples. Empirical
results show that ADFAR significantly outperforms those newly proposed defense
methods over various tasks with much higher inference speed. Remarkably, ADFAR
does not impair the overall performance of PrLMs. The code is available at
https://github.com/LilyNLP/ADFAR


http://arxiv.org/abs/2105.14676
NoiLIn: Do Noisy Labels Always Hurt Adversarial Training? (62%)
Jingfeng Zhang; Xilie Xu; Bo Han; Tongliang Liu; Gang Niu; Lizhen Cui; Masashi Sugiyama
  Adversarial training (AT) based on minimax optimization is a popular learning
style that enhances the model's adversarial robustness. Noisy labels (NL)
commonly undermine the learning and hurt the model's performance.
Interestingly, both research directions hardly crossover and hit sparks. In
this paper, we raise an intriguing question -- Does NL always hurt AT? Firstly,
we find that NL injection in inner maximization for generating adversarial data
augments natural data implicitly, which benefits AT's generalization. Secondly,
we find NL injection in outer minimization for the learning serves as
regularization that alleviates robust overfitting, which benefits AT's
robustness. To enhance AT's adversarial robustness, we propose "NoiLIn" that
gradually increases \underline{Noi}sy \underline{L}abels \underline{In}jection
over the AT's training process. Empirically, NoiLIn answers the previous
question negatively -- the adversarial robustness can be indeed enhanced by NL
injection. Philosophically, we provide a new perspective of the learning with
NL: NL should not always be deemed detrimental, and even in the absence of NL
in the training set, we may consider injecting it deliberately.


http://arxiv.org/abs/2105.14564
Evaluating Resilience of Encrypted Traffic Classification Against Adversarial Evasion Attacks. (62%)
Ramy Maarouf; Danish Sattar; Ashraf Matrawy
  Machine learning and deep learning algorithms can be used to classify
encrypted Internet traffic. Classification of encrypted traffic can become more
challenging in the presence of adversarial attacks that target the learning
algorithms. In this paper, we focus on investigating the effectiveness of
different evasion attacks and see how resilient machine and deep learning
algorithms are. Namely, we test C4.5 Decision Tree, K-Nearest Neighbor (KNN),
Artificial Neural Network (ANN), Convolutional Neural Networks (CNN) and
Recurrent Neural Networks (RNN). In most of our experimental results, deep
learning shows better resilience against the adversarial samples in comparison
to machine learning. Whereas, the impact of the attack varies depending on the
type of attack.


http://arxiv.org/abs/2105.14638
DAAIN: Detection of Anomalous and Adversarial Input using Normalizing Flows. (12%)
Baußnern Samuel von; Johannes Otterbach; Adrian Loy; Mathieu Salzmann; Thomas Wollmann
  Despite much recent work, detecting out-of-distribution (OOD) inputs and
adversarial attacks (AA) for computer vision models remains a challenge. In
this work, we introduce a novel technique, DAAIN, to detect OOD inputs and AA
for image segmentation in a unified setting. Our approach monitors the inner
workings of a neural network and learns a density estimator of the activation
distribution. We equip the density estimator with a classification head to
discriminate between regular and anomalous inputs. To deal with the
high-dimensional activation-space of typical segmentation networks, we
subsample them to obtain a homogeneous spatial and layer-wise coverage. The
subsampling pattern is chosen once per monitored model and kept fixed for all
inputs. Since the attacker has access to neither the detection model nor the
sampling key, it becomes harder for them to attack the segmentation network, as
the attack cannot be backpropagated through the detector. We demonstrate the
effectiveness of our approach using an ESPNet trained on the Cityscapes dataset
as segmentation model, an affine Normalizing Flow as density estimator and use
blue noise to ensure homogeneous sampling. Our model can be trained on a single
GPU making it compute efficient and deployable without requiring specialized
accelerators.


http://arxiv.org/abs/2107.09507
EEG-based Cross-Subject Driver Drowsiness Recognition with an Interpretable Convolutional Neural Network. (1%)
Jian Cui; Zirui Lan; Olga Sourina; Wolfgang Müller-Wittig
  In the context of electroencephalogram (EEG)-based driver drowsiness
recognition, it is still challenging to design a calibration-free system, since
EEG signals vary significantly among different subjects and recording sessions.
Many efforts have been made to use deep learning methods for mental state
recognition from EEG signals. However, existing work mostly treats deep
learning models as black-box classifiers, while what have been learned by the
models and to which extent they are affected by the noise in EEG data are still
underexplored. In this paper, we develop a novel convolutional neural network
that can explain its decision by highlighting the local areas of the input
sample that contain important information for classification. The network has a
compact structure and takes advantage of separable convolutions to process the
EEG signals in a spatial-temporal sequence. Results show that the model
achieves an average accuracy of 78.35% on 11 subjects for leave-one-out
cross-subject drowsiness recognition, which is higher than the conventional
baseline methods of 53.4%-72.68% and state-of-the-art deep learning methods of
63.90%-65.78%. Visualization results show that the model has learned to
recognize biologically explainable features from EEG signals, e.g., Alpha
spindles, as strong indicators of drowsiness across different subjects. In
addition, we also explore reasons behind some wrongly classified samples with
the visualization technique and discuss potential ways to improve the
recognition accuracy. Our work illustrates a promising direction on using
interpretable deep learning models to discover meaningful patterns related to
different mental states from complex EEG signals.


http://arxiv.org/abs/2105.14259
Detecting Backdoor in Deep Neural Networks via Intentional Adversarial Perturbations. (99%)
Mingfu Xue; Yinghao Wu; Zhiyu Wu; Jian Wang; Yushu Zhang; Weiqiang Liu
  Recent researches show that deep learning model is susceptible to backdoor
attacks where the backdoor embedded in the model will be triggered when a
backdoor instance arrives. In this paper, a novel backdoor detection method
based on adversarial examples is proposed. The proposed method leverages
intentional adversarial perturbations to detect whether the image contains a
trigger, which can be applied in two scenarios (sanitize the training set in
training stage and detect the backdoor instances in inference stage).
Specifically, given an untrusted image, the adversarial perturbation is added
to the input image intentionally, if the prediction of model on the perturbed
image is consistent with that on the unperturbed image, the input image will be
considered as a backdoor instance. The proposed adversarial perturbation based
method requires low computational resources and maintains the visual quality of
the images. Experimental results show that, the proposed defense method reduces
the backdoor attack success rates from 99.47%, 99.77% and 97.89% to 0.37%,
0.24% and 0.09% on Fashion-MNIST, CIFAR-10 and GTSRB datasets, respectively.
Besides, the proposed method maintains the visual quality of the image as the
added perturbation is very small. In addition, for attacks under different
settings (trigger transparency, trigger size and trigger pattern), the false
acceptance rates of the proposed method are as low as 1.2%, 0.3% and 0.04% on
Fashion-MNIST, CIFAR-10 and GTSRB datasets, respectively, which demonstrates
that the proposed method can achieve high defense performance against backdoor
attacks under different attack settings.


http://arxiv.org/abs/2105.14240
Analysis and Applications of Class-wise Robustness in Adversarial Training. (99%)
Qi Tian; Kun Kuang; Kelu Jiang; Fei Wu; Yisen Wang
  Adversarial training is one of the most effective approaches to improve model
robustness against adversarial examples. However, previous works mainly focus
on the overall robustness of the model, and the in-depth analysis on the role
of each class involved in adversarial training is still missing. In this paper,
we propose to analyze the class-wise robustness in adversarial training. First,
we provide a detailed diagnosis of adversarial training on six benchmark
datasets, i.e., MNIST, CIFAR-10, CIFAR-100, SVHN, STL-10 and ImageNet.
Surprisingly, we find that there are remarkable robustness discrepancies among
classes, leading to unbalance/unfair class-wise robustness in the robust
models. Furthermore, we keep investigating the relations between classes and
find that the unbalanced class-wise robustness is pretty consistent among
different attack and defense methods. Moreover, we observe that the stronger
attack methods in adversarial learning achieve performance improvement mainly
from a more successful attack on the vulnerable classes (i.e., classes with
less robustness). Inspired by these interesting findings, we design a simple
but effective attack method based on the traditional PGD attack, named
Temperature-PGD attack, which proposes to enlarge the robustness disparity
among classes with a temperature factor on the confidence distribution of each
image. Experiments demonstrate our method can achieve a higher attack rate than
the PGD attack. Furthermore, from the defense perspective, we also make some
modifications in the training and inference phase to improve the robustness of
the most vulnerable class, so as to mitigate the large difference in class-wise
robustness. We believe our work can contribute to a more comprehensive
understanding of adversarial training as well as rethinking the class-wise
properties in robust models.


http://arxiv.org/abs/2105.14298
A Measurement Study on the (In)security of End-of-Life (EoL) Embedded Devices. (2%)
Dingding Wang; Muhui Jiang; Rui Chang; Yajin Zhou; Baolei Hou; Xiapu Luo; Lei Wu; Kui Ren
  Embedded devices are becoming popular. Meanwhile, researchers are actively
working on improving the security of embedded devices. However, previous work
ignores the insecurity caused by a special category of devices, i.e., the
End-of-Life (EoL in short) devices. Once a product becomes End-of-Life, vendors
tend to no longer maintain its firmware or software, including providing bug
fixes and security patches. This makes EoL devices susceptible to attacks. For
instance, a report showed that an EoL model with thousands of active devices
was exploited to redirect web traffic for malicious purposes. In this paper, we
conduct the first measurement study to shed light on the (in)security of EoL
devices. To this end, our study performs two types of analysis, including the
aliveness analysis and the vulnerability analysis. The first one aims to detect
the scale of EoL devices that are still alive. The second one is to evaluate
the vulnerabilities existing in (active) EoL devices. We have applied our
approach to a large number of EoL models from three vendors (i.e., D-Link,
Tp-Link, and Netgear) and detect the alive devices in a time period of ten
months. Our study reveals some worrisome facts that were unknown by the
community. For instance, there exist more than 2 million active EoL devices.
Nearly 300,000 of them are still alive even after five years since they became
EoL. Although vendors may release security patches after the EoL date, however,
the process is ad hoc and incomplete. As a result, more than 1 million active
EoL devices are vulnerable, and nearly half of them are threatened by high-risk
vulnerabilities. Attackers can achieve a minimum of 2.79 Tbps DDoS attack by
compromising a large number of active EoL devices. We believe these facts pose
a clear call for more attention to deal with the security issues of EoL
devices.


http://arxiv.org/abs/2105.13902
Demotivate adversarial defense in remote sensing. (99%)
Adrien Chan-Hon-Tong; Gaston Lenczner; Aurelien Plyer
  Convolutional neural networks are currently the state-of-the-art algorithms
for many remote sensing applications such as semantic segmentation or object
detection. However, these algorithms are extremely sensitive to over-fitting,
domain change and adversarial examples specifically designed to fool them.
While adversarial attacks are not a threat in most remote sensing applications,
one could wonder if strengthening networks to adversarial attacks could also
increase their resilience to over-fitting and their ability to deal with the
inherent variety of worldwide data. In this work, we study both adversarial
retraining and adversarial regularization as adversarial defenses to this
purpose. However, we show through several experiments on public remote sensing
datasets that adversarial robustness seems uncorrelated to geographic and
over-fitting robustness.


http://arxiv.org/abs/2105.13697
AdvParams: An Active DNN Intellectual Property Protection Technique via Adversarial Perturbation Based Parameter Encryption. (92%)
Mingfu Xue; Zhiyu Wu; Jian Wang; Yushu Zhang; Weiqiang Liu
  A well-trained DNN model can be regarded as an intellectual property (IP) of
the model owner. To date, many DNN IP protection methods have been proposed,
but most of them are watermarking based verification methods where model owners
can only verify their ownership passively after the copyright of DNN models has
been infringed. In this paper, we propose an effective framework to actively
protect the DNN IP from infringement. Specifically, we encrypt the DNN model's
parameters by perturbing them with well-crafted adversarial perturbations. With
the encrypted parameters, the accuracy of the DNN model drops significantly,
which can prevent malicious infringers from using the model. After the
encryption, the positions of encrypted parameters and the values of the added
adversarial perturbations form a secret key. Authorized user can use the secret
key to decrypt the model. Compared with the watermarking methods which only
passively verify the ownership after the infringement occurs, the proposed
method can prevent infringement in advance. Moreover, compared with most of the
existing active DNN IP protection methods, the proposed method does not require
additional training process of the model, which introduces low computational
overhead. Experimental results show that, after the encryption, the test
accuracy of the model drops by 80.65%, 81.16%, and 87.91% on Fashion-MNIST,
CIFAR-10, and GTSRB, respectively. Moreover, the proposed method only needs to
encrypt an extremely low number of parameters, and the proportion of the
encrypted parameters of all the model's parameters is as low as 0.000205%. The
experimental results also indicate that, the proposed method is robust against
model fine-tuning attack and model pruning attack. Moreover, for the adaptive
attack where attackers know the detailed steps of the proposed method, the
proposed method is also demonstrated to be robust.


http://arxiv.org/abs/2105.13745
Robust Regularization with Adversarial Labelling of Perturbed Samples. (83%)
Xiaohui Guo; Richong Zhang; Yaowei Zheng; Yongyi Mao
  Recent researches have suggested that the predictive accuracy of neural
network may contend with its adversarial robustness. This presents challenges
in designing effective regularization schemes that also provide strong
adversarial robustness. Revisiting Vicinal Risk Minimization (VRM) as a
unifying regularization principle, we propose Adversarial Labelling of
Perturbed Samples (ALPS) as a regularization scheme that aims at improving the
generalization ability and adversarial robustness of the trained model. ALPS
trains neural networks with synthetic samples formed by perturbing each
authentic input sample towards another one along with an adversarially assigned
label. The ALPS regularization objective is formulated as a min-max problem, in
which the outer problem is minimizing an upper-bound of the VRM loss, and the
inner problem is L$_1$-ball constrained adversarial labelling on perturbed
sample. The analytic solution to the induced inner maximization problem is
elegantly derived, which enables computational efficiency. Experiments on the
SVHN, CIFAR-10, CIFAR-100 and Tiny-ImageNet datasets show that the ALPS has a
state-of-the-art regularization performance while also serving as an effective
adversarial training scheme.


http://arxiv.org/abs/2105.13746
SafeAMC: Adversarial training for robust modulation recognition models. (83%)
Javier Maroto; Gérôme Bovet; Pascal Frossard
  In communication systems, there are many tasks, like modulation recognition,
which rely on Deep Neural Networks (DNNs) models. However, these models have
been shown to be susceptible to adversarial perturbations, namely imperceptible
additive noise crafted to induce misclassification. This raises questions about
the security but also the general trust in model predictions. We propose to use
adversarial training, which consists of fine-tuning the model with adversarial
perturbations, to increase the robustness of automatic modulation recognition
(AMC) models. We show that current state-of-the-art models benefit from
adversarial training, which mitigates the robustness issues for some families
of modulations. We use adversarial perturbations to visualize the features
learned, and we found that in robust models the signal symbols are shifted
towards the nearest classes in constellation space, like maximum likelihood
methods. This confirms that robust models not only are more secure, but also
more interpretable, building their decisions on signal statistics that are
relevant to modulation recognition.


http://arxiv.org/abs/2105.14119
Towards optimally abstaining from prediction. (81%)
Adam Tauman Kalai; Varun Kanade
  A common challenge across all areas of machine learning is that training data
is not distributed like test data, due to natural shifts, "blind spots," or
adversarial examples. We consider a model where one may abstain from
predicting, at a fixed cost. In particular, our transductive abstention
algorithm takes labeled training examples and unlabeled test examples as input,
and provides predictions with optimal prediction loss guarantees. The loss
bounds match standard generalization bounds when test examples are i.i.d. from
the training distribution, but add an additional term that is the cost of
abstaining times the statistical distance between the train and test
distribution (or the fraction of adversarial examples). For linear regression,
we give a polynomial-time algorithm based on Celis-Dennis-Tapia optimization
algorithms. For binary classification, we show how to efficiently implement it
using a proper agnostic learner (i.e., an Empirical Risk Minimizer) for the
class of interest. Our work builds on a recent abstention algorithm of
Goldwasser, Kalais, and Montasser (2020) for transductive binary
classification.


http://arxiv.org/abs/2105.14083
Rethinking Noisy Label Models: Labeler-Dependent Noise with Adversarial Awareness. (76%)
Glenn Dawson; Robi Polikar
  Most studies on learning from noisy labels rely on unrealistic models of
i.i.d. label noise, such as class-conditional transition matrices. More recent
work on instance-dependent noise models are more realistic, but assume a single
generative process for label noise across the entire dataset. We propose a more
principled model of label noise that generalizes instance-dependent noise to
multiple labelers, based on the observation that modern datasets are typically
annotated using distributed crowdsourcing methods. Under our labeler-dependent
model, label noise manifests itself under two modalities: natural error of
good-faith labelers, and adversarial labels provided by malicious actors. We
present two adversarial attack vectors that more accurately reflect the label
noise that may be encountered in real-world settings, and demonstrate that
under our multimodal noisy labels model, state-of-the-art approaches for
learning from noisy labels are defeated by adversarial label attacks. Finally,
we propose a multi-stage, labeler-aware, model-agnostic framework that reliably
filters noisy labels by leveraging knowledge about which data partitions were
labeled by which labeler, and show that our proposed framework remains robust
even in the presence of extreme adversarial label noise.


http://arxiv.org/abs/2105.14116
Visualizing Representations of Adversarially Perturbed Inputs. (68%)
Daniel Steinberg; Paul Munro
  It has been shown that deep learning models are vulnerable to adversarial
attacks. We seek to further understand the consequence of such attacks on the
intermediate activations of neural networks. We present an evaluation metric,
POP-N, which scores the effectiveness of projecting data to N dimensions under
the context of visualizing representations of adversarially perturbed inputs.
We conduct experiments on CIFAR-10 to compare the POP-2 score of several
dimensionality reduction algorithms across various adversarial attacks.
Finally, we utilize the 2D data corresponding to high POP-2 scores to generate
example visualizations.


http://arxiv.org/abs/2105.13771
Chromatic and spatial analysis of one-pixel attacks against an image classifier. (15%)
Janne Alatalo; Joni Korpihalkola; Tuomo Sipola; Tero Kokkonen
  One-pixel attack is a curious way of deceiving neural network classifier by
changing only one pixel in the input image. The full potential and boundaries
of this attack method are not yet fully understood. In this research, the
successful and unsuccessful attacks are studied in more detail to illustrate
the working mechanisms of a one-pixel attack created using differential
evolution. The data comes from our earlier studies where we applied the attack
against medical imaging. We used a real breast cancer tissue dataset and a real
classifier as the attack target. This research presents ways to analyze
chromatic and spatial distributions of one-pixel attacks. In addition, we
present one-pixel attack confidence maps to illustrate the behavior of the
target classifier. We show that the more effective attacks change the color of
the pixel more, and that the successful attacks are situated at the center of
the images. This kind of analysis is not only useful for understanding the
behavior of the attack but also the qualities of the classifying neural
network.


http://arxiv.org/abs/2105.14173
FoveaTer: Foveated Transformer for Image Classification. (10%)
Aditya Jonnalagadda; William Yang Wang; B. S. Manjunath; Miguel P. Eckstein
  Many animals and humans process the visual field with a varying spatial
resolution (foveated vision) and use peripheral processing to make eye
movements and point the fovea to acquire high-resolution information about
objects of interest. This architecture results in computationally efficient
rapid scene exploration. Recent progress in self-attention-based vision
Transformers allow global interactions between feature locations and result in
increased robustness to adversarial attacks. However, the Transformer models do
not explicitly model the foveated properties of the visual system nor the
interaction between eye movements and the classification task. We propose
foveated Transformer (FoveaTer) model, which uses pooling regions and eye
movements to perform object classification tasks. Our proposed model pools the
image features using squared pooling regions, an approximation to the
biologically-inspired foveated architecture. It decides on subsequent fixation
locations based on the attention assigned by the Transformer to various
locations from past and present fixations. It dynamically allocates more
fixation/computational resources to more challenging images before making the
final object category decision. We compare FoveaTer against a Full-resolution
baseline model, which does not contain any pooling. On the ImageNet dataset,
the Foveated model with Dynamic-stop achieves an accuracy of $1.9\%$ below the
full-resolution model with a throughput gain of $51\%$. Using a Foveated model
with Dynamic-stop and the Full-resolution model, the ensemble outperforms the
baseline Full-resolution by $0.2\%$ with a throughput gain of $7.7\%$. We also
demonstrate our model's robustness against adversarial attacks. Finally, we
compare the Foveated model to human performance in a scene categorization task
and show similar dependence of accuracy with number of exploratory fixations.


http://arxiv.org/abs/2105.14035
DeepMoM: Robust Deep Learning With Median-of-Means. (1%)
Shih-Ting Huang; Johannes Lederer
  Data used in deep learning is notoriously problematic. For example, data are
usually combined from diverse sources, rarely cleaned and vetted thoroughly,
and sometimes corrupted on purpose. Intentional corruption that targets the
weak spots of algorithms has been studied extensively under the label of
"adversarial attacks." In contrast, the arguably much more common case of
corruption that reflects the limited quality of data has been studied much
less. Such "random" corruptions are due to measurement errors, unreliable
sources, convenience sampling, and so forth. These kinds of corruption are
common in deep learning, because data are rarely collected according to strict
protocols -- in strong contrast to the formalized data collection in some parts
of classical statistics. This paper concerns such corruption. We introduce an
approach motivated by very recent insights into median-of-means and Le Cam's
principle, we show that the approach can be readily implemented, and we
demonstrate that it performs very well in practice. In conclusion, we believe
that our approach is a very promising alternative to standard parameter
training based on least-squares and cross-entropy loss.


http://arxiv.org/abs/2105.13530
A BIC-based Mixture Model Defense against Data Poisoning Attacks on Classifiers. (84%)
Xi Li; David J. Miller; Zhen Xiang; George Kesidis
  Data Poisoning (DP) is an effective attack that causes trained classifiers to
misclassify their inputs. DP attacks significantly degrade a classifier's
accuracy by covertly injecting attack samples into the training set. Broadly
applicable to different classifier structures, without strong assumptions about
the attacker, an {\it unsupervised} Bayesian Information Criterion (BIC)-based
mixture model defense against "error generic" DP attacks is herein proposed
that: 1) addresses the most challenging {\it embedded} DP scenario wherein, if
DP is present, the poisoned samples are an {\it a priori} unknown subset of the
training set, and with no clean validation set available; 2) applies a mixture
model both to well-fit potentially multi-modal class distributions and to
capture poisoned samples within a small subset of the mixture components; 3)
jointly identifies poisoned components and samples by minimizing the BIC cost
defined over the whole training set, with the identified poisoned data removed
prior to classifier training. Our experimental results, for various classifier
structures and benchmark datasets, demonstrate the effectiveness and
universality of our defense under strong DP attacks, as well as its superiority
over other works.


http://arxiv.org/abs/2105.12427
Deep Repulsive Prototypes for Adversarial Robustness. (99%)
Alex Serban; Erik Poll; Joost Visser
  While many defences against adversarial examples have been proposed, finding
robust machine learning models is still an open problem. The most compelling
defence to date is adversarial training and consists of complementing the
training data set with adversarial examples. Yet adversarial training severely
impacts training time and depends on finding representative adversarial
samples. In this paper we propose to train models on output spaces with large
class separation in order to gain robustness without adversarial training. We
introduce a method to partition the output space into class prototypes with
large separation and train models to preserve it. Experimental results shows
that models trained with these prototypes -- which we call deep repulsive
prototypes -- gain robustness competitive with adversarial training, while also
preserving more accuracy on natural samples. Moreover, the models are more
resilient to large perturbation sizes. For example, we obtained over 50%
robustness for CIFAR-10, with 92% accuracy on natural samples and over 20%
robustness for CIFAR-100, with 71% accuracy on natural samples without
adversarial training. For both data sets, the models preserved robustness
against large perturbations better than adversarially trained models.


http://arxiv.org/abs/2105.12419
Adversarial Attack Framework on Graph Embedding Models with Limited Knowledge. (98%)
Heng Chang; Yu Rong; Tingyang Xu; Wenbing Huang; Honglei Zhang; Peng Cui; Xin Wang; Wenwu Zhu; Junzhou Huang
  With the success of the graph embedding model in both academic and industry
areas, the robustness of graph embedding against adversarial attack inevitably
becomes a crucial problem in graph learning. Existing works usually perform the
attack in a white-box fashion: they need to access the predictions/labels to
construct their adversarial loss. However, the inaccessibility of
predictions/labels makes the white-box attack impractical to a real graph
learning system. This paper promotes current frameworks in a more general and
flexible sense -- we demand to attack various kinds of graph embedding models
with black-box driven. We investigate the theoretical connections between graph
signal processing and graph embedding models and formulate the graph embedding
model as a general graph signal process with a corresponding graph filter.
Therefore, we design a generalized adversarial attacker: GF-Attack. Without
accessing any labels and model predictions, GF-Attack can perform the attack
directly on the graph filter in a black-box fashion. We further prove that
GF-Attack can perform an effective attack without knowing the number of layers
of graph embedding models. To validate the generalization of GF-Attack, we
construct the attacker on four popular graph embedding models. Extensive
experiments validate the effectiveness of GF-Attack on several benchmark
datasets.


http://arxiv.org/abs/2105.12508
Adversarial robustness against multiple $l_p$-threat models at the price of one and how to quickly fine-tune robust models to another threat model. (93%)
Francesco Croce; Matthias Hein
  Adversarial training (AT) in order to achieve adversarial robustness wrt
single $l_p$-threat models has been discussed extensively. However, for
safety-critical systems adversarial robustness should be achieved wrt all
$l_p$-threat models simultaneously. In this paper we develop a simple and
efficient training scheme to achieve adversarial robustness against the union
of $l_p$-threat models. Our novel $l_1+l_\infty$-AT scheme is based on
geometric considerations of the different $l_p$-balls and costs as much as
normal adversarial training against a single $l_p$-threat model. Moreover, we
show that using our $l_1+l_\infty$-AT scheme one can fine-tune with just 3
epochs any $l_p$-robust model (for $p \in \{1,2,\infty\}$) and achieve multiple
norm adversarial robustness. In this way we boost the previous state-of-the-art
reported for multiple-norm robustness by more than $6\%$ on CIFAR-10 and report
up to our knowledge the first ImageNet models with multiple norm robustness.
Moreover, we study the general transfer of adversarial robustness between
different threat models and in this way boost the previous SOTA
$l_1$-robustness on CIFAR-10 by almost $10\%$.


http://arxiv.org/abs/2105.12697
Can Linear Programs Have Adversarial Examples? A Causal Perspective. (83%)
Matej Zečević; Devendra Singh Dhami; Kristian Kersting
  The recent years have been marked by extended research on adversarial
attacks, especially on deep neural networks. With this work we intend on posing
and investigating the question of whether the phenomenon might be more general
in nature, that is, adversarial-style attacks outside classification.
Specifically, we investigate optimization problems starting with Linear
Programs (LPs). We start off by demonstrating the shortcoming of a naive
mapping between the formalism of adversarial examples and LPs, to then reveal
how we can provide the missing piece -- intriguingly, through the Pearlian
notion of Causality. Characteristically, we show the direct influence of the
Structural Causal Model (SCM) onto the subsequent LP optimization, which
ultimately exposes a notion of confounding in LPs (inherited by said SCM) that
allows for adversarial-style attacks. We provide both the general proof
formally alongside existential proofs of such intriguing LP-parameterizations
based on SCM for three combinatorial problems, namely Linear Assignment,
Shortest Path and a real world problem of energy systems.


http://arxiv.org/abs/2105.12400
Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger. (61%)
Fanchao Qi; Mukai Li; Yangyi Chen; Zhengyan Zhang; Zhiyuan Liu; Yasheng Wang; Maosong Sun
  Backdoor attacks are a kind of insidious security threat against machine
learning models. After being injected with a backdoor in training, the victim
model will produce adversary-specified outputs on the inputs embedded with
predesigned triggers but behave properly on normal inputs during inference. As
a sort of emergent attack, backdoor attacks in natural language processing
(NLP) are investigated insufficiently. As far as we know, almost all existing
textual backdoor attack methods insert additional contents into normal samples
as triggers, which causes the trigger-embedded samples to be detected and the
backdoor attacks to be blocked without much effort. In this paper, we propose
to use syntactic structure as the trigger in textual backdoor attacks. We
conduct extensive experiments to demonstrate that the syntactic trigger-based
attack method can achieve comparable attack performance (almost 100\% success
rate) to the insertion-based methods but possesses much higher invisibility and
stronger resistance to defenses. These results also reveal the significant
insidiousness and harmfulness of textual backdoor attacks. All the code and
data of this paper can be obtained at https://github.com/thunlp/HiddenKiller.


http://arxiv.org/abs/2105.12837
Fooling Partial Dependence via Data Poisoning. (13%)
Hubert Baniecki; Wojciech Kretowicz; Przemyslaw Biecek
  Many methods have been developed to understand complex predictive models and
high expectations are placed on post-hoc model explainability. It turns out
that such explanations are not robust nor trustworthy, and they can be fooled.
This paper presents techniques for attacking Partial Dependence (plots,
profiles, PDP), which are among the most popular methods of explaining any
predictive model trained on tabular data. We showcase that PD can be
manipulated in an adversarial manner, which is alarming, especially in
financial or medical applications where auditability became a must-have trait
supporting black-box models. The fooling is performed via poisoning the data to
bend and shift explanations in the desired direction using genetic and gradient
algorithms. To the best of our knowledge, this is the first work performing
attacks on variable dependence explanations. The novel approach of using a
genetic algorithm for doing so is highly transferable as it generalizes both
ways: in a model-agnostic and an explanation-agnostic manner.


http://arxiv.org/abs/2105.12237
Practical Convex Formulation of Robust One-hidden-layer Neural Network Training. (98%)
Yatong Bai; Tanmay Gautam; Yu Gai; Somayeh Sojoudi
  Recent work has shown that the training of a one-hidden-layer, scalar-output
fully-connected ReLU neural network can be reformulated as a finite-dimensional
convex program. Unfortunately, the scale of such a convex program grows
exponentially in data size. In this work, we prove that a stochastic procedure
with a linear complexity well approximates the exact formulation. Moreover, we
derive a convex optimization approach to efficiently solve the "adversarial
training" problem, which trains neural networks that are robust to adversarial
input perturbations. Our method can be applied to binary classification and
regression, and provides an alternative to the current adversarial training
methods, such as Fast Gradient Sign Method (FGSM) and Projected Gradient
Descent (PGD). We demonstrate in experiments that the proposed method achieves
a noticeably better adversarial robustness and performance than the existing
methods.


http://arxiv.org/abs/2105.12106
Adversarial Attack Driven Data Augmentation for Accurate And Robust Medical Image Segmentation. (98%)
Mst. Tasnim Pervin; Linmi Tao; Aminul Huq; Zuoxiang He; Li Huo
  Segmentation is considered to be a very crucial task in medical image
analysis. This task has been easier since deep learning models have taken over
with its high performing behavior. However, deep learning models dependency on
large data proves it to be an obstacle in medical image analysis because of
insufficient data samples. Several data augmentation techniques have been used
to mitigate this problem. We propose a new augmentation method by introducing
adversarial learning attack techniques, specifically Fast Gradient Sign Method
(FGSM). Furthermore, We have also introduced the concept of Inverse FGSM
(InvFGSM), which works in the opposite manner of FGSM for the data
augmentation. This two approaches worked together to improve the segmentation
accuracy, as well as helped the model to gain robustness against adversarial
attacks. The overall analysis of experiments indicates a novel use of
adversarial machine learning along with robustness enhancement.


http://arxiv.org/abs/2105.12049
Honest-but-Curious Nets: Sensitive Attributes of Private Inputs Can Be Secretly Coded into the Classifiers' Outputs. (67%)
Mohammad Malekzadeh; Anastasia Borovykh; Deniz Gündüz
  It is known that deep neural networks, trained for the classification of
non-sensitive target attributes, can reveal sensitive attributes of their input
data through internal representations extracted by the classifier. We take a
step forward and show that deep classifiers can be trained to secretly encode a
sensitive attribute of their input data into the classifier's outputs for the
target attribute, at inference time. Our proposed attack works even if users
have a full white-box view of the classifier, can keep all internal
representations hidden, and only release the classifier's estimations for the
target attribute. We introduce an information-theoretical formulation for such
attacks and present efficient empirical implementations for training
honest-but-curious (HBC) classifiers: classifiers that can be accurate in
predicting their target attribute, but can also exploit their outputs to
secretly encode a sensitive attribute. Our work highlights a vulnerability that
can be exploited by malicious machine learning service providers to attack
their user's privacy in several seemingly safe scenarios; such as encrypted
inferences, computations at the edge, or private knowledge distillation.
Experimental results on several attributes in two face-image datasets show that
a semi-trusted server can train classifiers that are not only perfectly honest
but also accurately curious. We conclude by showing the difficulties in
distinguishing between standard and HBC classifiers, discussing challenges in
defending against this vulnerability of deep classifiers, and enumerating
related open directions for future studies.


http://arxiv.org/abs/2105.12189
Robust Value Iteration for Continuous Control Tasks. (9%)
Michael Lutter; Shie Mannor; Jan Peters; Dieter Fox; Animesh Garg
  When transferring a control policy from simulation to a physical system, the
policy needs to be robust to variations in the dynamics to perform well.
Commonly, the optimal policy overfits to the approximate model and the
corresponding state-distribution, often resulting in failure to trasnfer
underlying distributional shifts. In this paper, we present Robust Fitted Value
Iteration, which uses dynamic programming to compute the optimal value function
on the compact state domain and incorporates adversarial perturbations of the
system dynamics. The adversarial perturbations encourage a optimal policy that
is robust to changes in the dynamics. Utilizing the continuous-time perspective
of reinforcement learning, we derive the optimal perturbations for the states,
actions, observations and model parameters in closed-form. Notably, the
resulting algorithm does not require discretization of states or actions.
Therefore, the optimal adversarial perturbations can be efficiently
incorporated in the min-max value function update. We apply the resulting
algorithm to the physical Furuta pendulum and cartpole. By changing the masses
of the systems we evaluate the quantitative and qualitative performance across
different model parameters. We show that robust value iteration is more robust
compared to deep reinforcement learning algorithm and the non-robust version of
the algorithm. Videos of the experiments are shown at
https://sites.google.com/view/rfvi


http://arxiv.org/abs/2105.11593
OFEI: A Semi-black-box Android Adversarial Sample Attack Framework Against DLaaS. (99%)
Guangquan Xu; GuoHua Xin; Litao Jiao; Jian Liu; Shaoying Liu; Meiqi Feng; Xi Zheng
  With the growing popularity of Android devices, Android malware is seriously
threatening the safety of users. Although such threats can be detected by deep
learning as a service (DLaaS), deep neural networks as the weakest part of
DLaaS are often deceived by the adversarial samples elaborated by attackers. In
this paper, we propose a new semi-black-box attack framework called
one-feature-each-iteration (OFEI) to craft Android adversarial samples. This
framework modifies as few features as possible and requires less classifier
information to fool the classifier. We conduct a controlled experiment to
evaluate our OFEI framework by comparing it with the benchmark methods JSMF,
GenAttack and pointwise attack. The experimental results show that our OFEI has
a higher misclassification rate of 98.25%. Furthermore, OFEI can extend the
traditional white-box attack methods in the image field, such as fast gradient
sign method (FGSM) and DeepFool, to craft adversarial samples for Android.
Finally, to enhance the security of DLaaS, we use two uncertainties of the
Bayesian neural network to construct the combined uncertainty, which is used to
detect adversarial samples and achieves a high detection rate of 99.28%.


http://arxiv.org/abs/2105.11363
Learning Security Classifiers with Verified Global Robustness Properties. (92%)
Yizheng Chen; Shiqi Wang; Yue Qin; Xiaojing Liao; Suman Jana; David Wagner
  Many recent works have proposed methods to train classifiers with local
robustness properties, which can provably eliminate classes of evasion attacks
for most inputs, but not all inputs. Since data distribution shift is very
common in security applications, e.g., often observed for malware detection,
local robustness cannot guarantee that the property holds for unseen inputs at
the time of deploying the classifier. Therefore, it is more desirable to
enforce global robustness properties that hold for all inputs, which is
strictly stronger than local robustness.
  In this paper, we present a framework and tools for training classifiers that
satisfy global robustness properties. We define new notions of global
robustness that are more suitable for security classifiers. We design a novel
booster-fixer training framework to enforce global robustness properties. We
structure our classifier as an ensemble of logic rules and design a new
verifier to verify the properties. In our training algorithm, the booster
increases the classifier's capacity, and the fixer enforces verified global
robustness properties following counterexample guided inductive synthesis.
  We show that we can train classifiers to satisfy different global robustness
properties for three security datasets, and even multiple properties at the
same time, with modest impact on the classifier's performance. For example, we
train a Twitter spam account classifier to satisfy five global robustness
properties, with 5.4% decrease in true positive rate, and 0.1% increase in
false positive rate, compared to a baseline XGBoost model that doesn't satisfy
any property.


http://arxiv.org/abs/2105.11645
Feature Space Targeted Attacks by Statistic Alignment. (82%)
Lianli Gao; Yaya Cheng; Qilong Zhang; Xing Xu; Jingkuan Song
  By adding human-imperceptible perturbations to images, DNNs can be easily
fooled. As one of the mainstream methods, feature space targeted attacks
perturb images by modulating their intermediate feature maps, for the
discrepancy between the intermediate source and target features is minimized.
However, the current choice of pixel-wise Euclidean Distance to measure the
discrepancy is questionable because it unreasonably imposes a
spatial-consistency constraint on the source and target features. Intuitively,
an image can be categorized as "cat" no matter the cat is on the left or right
of the image. To address this issue, we propose to measure this discrepancy
using statistic alignment. Specifically, we design two novel approaches called
Pair-wise Alignment Attack and Global-wise Alignment Attack, which attempt to
measure similarities between feature maps by high-order statistics with
translation invariance. Furthermore, we systematically analyze the layer-wise
transferability with varied difficulties to obtain highly reliable attacks.
Extensive experiments verify the effectiveness of our proposed method, and it
outperforms the state-of-the-art algorithms by a large margin. Our code is
publicly available at https://github.com/yaya-cheng/PAA-GAA.


http://arxiv.org/abs/2105.11144
Improved OOD Generalization via Adversarial Training and Pre-training. (12%)
Mingyang Yi; Lu Hou; Jiacheng Sun; Lifeng Shang; Xin Jiang; Qun Liu; Zhi-Ming Ma
  Recently, learning a model that generalizes well on out-of-distribution (OOD)
data has attracted great attention in the machine learning community. In this
paper, after defining OOD generalization via Wasserstein distance, we
theoretically show that a model robust to input perturbation generalizes well
on OOD data. Inspired by previous findings that adversarial training helps
improve input-robustness, we theoretically show that adversarially trained
models have converged excess risk on OOD data, and empirically verify it on
both image classification and natural language understanding tasks. Besides, in
the paradigm of first pre-training and then fine-tuning, we theoretically show
that a pre-trained model that is more robust to input perturbation provides a
better initialization for generalization on downstream OOD data. Empirically,
after fine-tuning, this better-initialized model from adversarial pre-training
also has better OOD generalization.


http://arxiv.org/abs/2105.11160
Out-of-Distribution Detection in Dermatology using Input Perturbation and Subset Scanning. (5%)
Hannah Kim; Girmaw Abebe Tadesse; Celia Cintas; Skyler Speakman; Kush Varshney
  Recent advances in deep learning have led to breakthroughs in the development
of automated skin disease classification. As we observe an increasing interest
in these models in the dermatology space, it is crucial to address aspects such
as the robustness towards input data distribution shifts. Current skin disease
models could make incorrect inferences for test samples from different hardware
devices and clinical settings or unknown disease samples, which are
out-of-distribution (OOD) from the training samples. To this end, we propose a
simple yet effective approach that detect these OOD samples prior to making any
decision. The detection is performed via scanning in the latent space
representation (e.g., activations of the inner layers of any pre-trained skin
disease classifier). The input samples could also perturbed to maximise
divergence of OOD samples. We validate our ODD detection approach in two use
cases: 1) identify samples collected from different protocols, and 2) detect
samples from unknown disease classes. Additionally, we evaluate the performance
of the proposed approach and compare it with other state-of-the-art methods.
Furthermore, data-driven dermatology applications may deepen the disparity in
clinical care across racial and ethnic groups since most datasets are reported
to suffer from bias in skin tone distribution. Therefore, we also evaluate the
fairness of these OOD detection methods across different skin tones. Our
experiments resulted in competitive performance across multiple datasets in
detecting OOD samples, which could be used (in the future) to design more
effective transfer learning techniques prior to inferring on these samples.


http://arxiv.org/abs/2105.11166
AirNet: Neural Network Transmission over the Air. (1%)
Mikolaj Jankowski; Deniz Gunduz; Krystian Mikolajczyk
  State-of-the-art performance for many emerging edge applications is achieved
by deep neural networks (DNNs). Often, these DNNs are location and time
sensitive, and must be delivered from an edge server to the edge device rapidly
and efficiently to carry out time-sensitive inference tasks. In this paper, we
introduce AirNet, a novel training and transmission method that allows
efficient wireless delivery of DNNs under stringent transmit power and latency
constraints. We first train the DNN with noise injection to counter the channel
noise. We then employ pruning to reduce the network size to the available
channel bandwidth, and perform knowledge distillation from a large model to
improve the performance. We show that AirNet achieves significantly higher test
accuracy compared to digital alternatives under the same bandwidth and power
constraints. We further improve the performance of AirNet by pruning the
network below the available bandwidth, and using channel expansion to provide
better robustness against channel noise. We also benefit from unequal error
protection (UEP) by selectively expanding more important layers of the network.
Finally, we develop an ensemble training approach, which trains a whole
spectrum of DNNs, each of which can be used at different channel condition,
resolving the impractical memory requirements.


http://arxiv.org/abs/2105.11172
Every Byte Matters: Traffic Analysis of Bluetooth Wearable Devices. (1%)
Ludovic Barman; Alexandre Dumur; Apostolos Pyrgelis; Jean-Pierre Hubaux
  Wearable devices such as smartwatches, fitness trackers, and blood-pressure
monitors process, store, and communicate sensitive and personal information
related to the health, life-style, habits and interests of the wearer. This
data is exchanged with a companion app running on a smartphone over a Bluetooth
connection. In this work, we investigate what can be inferred from the metadata
(such as the packet timings and sizes) of encrypted Bluetooth communications
between a wearable device and its connected smartphone. We show that a passive
eavesdropper can use traffic-analysis attacks to accurately recognize (a)
communicating devices, even without having access to the MAC address, (b) human
actions (e.g., monitoring heart rate, exercising) performed on wearable devices
ranging from fitness trackers to smartwatches, (c) the mere opening of specific
applications on a Wear OS smartwatch (e.g., the opening of a medical app, which
can immediately reveal a condition of the wearer), (d) fine-grained actions
(e.g., recording an insulin injection) within a specific application that helps
diabetic users to monitor their condition, and (e) the profile and habits of
the wearer by continuously monitoring her traffic over an extended period. We
run traffic-analysis attacks by collecting a dataset of Bluetooth traces of
multiple wearable devices, by designing features based on packet sizes and
timings, and by using machine learning to classify the encrypted traffic to
actions performed by the wearer. Then, we explore standard defense strategies;
we show that these defenses do not provide sufficient protection against our
attacks and introduce significant costs. Our research highlights the need to
rethink how applications exchange sensitive information over Bluetooth, to
minimize unnecessary data exchanges, and to design new defenses against
traffic-analysis tailored to the wearable setting.


http://arxiv.org/abs/2105.11136
Using Adversarial Attacks to Reveal the Statistical Bias in Machine Reading Comprehension Models. (1%)
Jieyu Lin; Jiajie Zou; Nai Ding
  Pre-trained language models have achieved human-level performance on many
Machine Reading Comprehension (MRC) tasks, but it remains unclear whether these
models truly understand language or answer questions by exploiting statistical
biases in datasets. Here, we demonstrate a simple yet effective method to
attack MRC models and reveal the statistical biases in these models. We apply
the method to the RACE dataset, for which the answer to each MRC question is
selected from 4 options. It is found that several pre-trained language models,
including BERT, ALBERT, and RoBERTa, show consistent preference to some
options, even when these options are irrelevant to the question. When
interfered by these irrelevant options, the performance of MRC models can be
reduced from human-level performance to the chance-level performance. Human
readers, however, are not clearly affected by these irrelevant options.
Finally, we propose an augmented training method that can greatly reduce
models' statistical biases.


http://arxiv.org/abs/2105.11103
Dissecting Click Fraud Autonomy in the Wild. (1%)
Tong Zhu; Yan Meng; Haotian Hu; Xiaokuan Zhang; Minhui Xue; Haojin Zhu
  Although the use of pay-per-click mechanisms stimulates the prosperity of the
mobile advertisement network, fraudulent ad clicks result in huge financial
losses for advertisers. Extensive studies identify click fraud according to
click/traffic patterns based on dynamic analysis. However, in this study, we
identify a novel click fraud, named humanoid attack, which can circumvent
existing detection schemes by generating fraudulent clicks with similar
patterns to normal clicks. We implement the first tool ClickScanner to detect
humanoid attacks on Android apps based on static analysis and variational
AutoEncoder (VAE) with limited knowledge of fraudulent examples. We define
novel features to characterize the patterns of humanoid attacks in the apps'
bytecode level. ClickScanner builds a data dependency graph (DDG) based on
static analysis to extract these key features and form a feature vector. We
then propose a classification model only trained on benign datasets to overcome
the limited knowledge of humanoid attacks.
  We leverage ClickScanner to conduct the first large-scale measurement on app
markets (i.e.,120,000 apps from Google Play and Huawei AppGallery) and reveal
several unprecedented phenomena. First, even for the top-rated 20,000 apps,
ClickScanner still identifies 157 apps as fraudulent, which shows the
prevalence of humanoid attacks. Second, it is observed that the ad SDK-based
attack (i.e., the fraudulent codes are in the third-party ad SDKs) is now a
dominant attack approach. Third, the manner of attack is notably different
across apps of various categories and popularities. Finally, we notice there
are several existing variants of the humanoid attack. Additionally, our
measurements demonstrate the proposed ClickScanner is accurate and
time-efficient (i.e., the detection overhead is only 15.35% of those of
existing schemes).


http://arxiv.org/abs/2105.10909
Killing Two Birds with One Stone: Stealing Model and Inferring Attribute from BERT-based APIs. (99%)
Lingjuan Lyu; Xuanli He; Fangzhao Wu; Lichao Sun
  The advances in pre-trained models (e.g., BERT, XLNET and etc) have largely
revolutionized the predictive performance of various modern natural language
processing tasks. This allows corporations to provide machine learning as a
service (MLaaS) by encapsulating fine-tuned BERT-based models as commercial
APIs. However, previous works have discovered a series of vulnerabilities in
BERT- based APIs. For example, BERT-based APIs are vulnerable to both model
extraction attack and adversarial example transferrability attack. However, due
to the high capacity of BERT-based APIs, the fine-tuned model is easy to be
overlearned, what kind of information can be leaked from the extracted model
remains unknown and is lacking. To bridge this gap, in this work, we first
present an effective model extraction attack, where the adversary can
practically steal a BERT-based API (the target/victim model) by only querying a
limited number of queries. We further develop an effective attribute inference
attack to expose the sensitive attribute of the training data used by the
BERT-based APIs. Our extensive experiments on benchmark datasets under various
realistic settings demonstrate the potential vulnerabilities of BERT-based
APIs.


http://arxiv.org/abs/2105.10872
CMUA-Watermark: A Cross-Model Universal Adversarial Watermark for Combating Deepfakes. (92%)
Hao Huang; Yongtao Wang; Zhaoyu Chen; Yuheng Li; Zhi Tang; Wei Chu; Jingdong Chen; Weisi Lin; Kai-Kuang Ma
  Malicious application of deepfakes (i.e., technologies can generate target
faces or face attributes) has posed a huge threat to our society. The fake
multimedia content generated by deepfake models can harm the reputation and
even threaten the property of the person who has been impersonated.
Fortunately, the adversarial watermark could be used for combating deepfake
models, leading them to generate distorted images. The existing methods require
an individual training process for every facial image, to generate the
adversarial watermark against a specific deepfake model, which are extremely
inefficient. To address this problem, we propose a universal adversarial attack
method on deepfake models, to generate a Cross-Model Universal Adversarial
Watermark (CMUA-Watermark) that can protect thousands of facial images from
multiple deepfake models. Specifically, we first propose a cross-model
universal attack pipeline by attacking multiple deepfake models and combining
gradients from these models iteratively. Then we introduce a batch-based method
to alleviate the conflict of adversarial watermarks generated by different
facial images. Finally, we design a more reasonable and comprehensive
evaluation method for evaluating the effectiveness of the adversarial
watermark. Experimental results demonstrate that the proposed CMUA-Watermark
can effectively distort the fake facial images generated by deepfake models and
successfully protect facial images from deepfakes in real scenes.


http://arxiv.org/abs/2105.10948
Regularization Can Help Mitigate Poisoning Attacks... with the Right Hyperparameters. (12%)
Javier Carnerero-Cano; Luis Muñoz-González; Phillippa Spencer; Emil C. Lupu
  Machine learning algorithms are vulnerable to poisoning attacks, where a
fraction of the training data is manipulated to degrade the algorithms'
performance. We show that current approaches, which typically assume that
regularization hyperparameters remain constant, lead to an overly pessimistic
view of the algorithms' robustness and of the impact of regularization. We
propose a novel optimal attack formulation that considers the effect of the
attack on the hyperparameters, modelling the attack as a \emph{minimax bilevel
optimization problem}. This allows to formulate optimal attacks, select
hyperparameters and evaluate robustness under worst case conditions. We apply
this formulation to logistic regression using $L_2$ regularization, empirically
show the limitations of previous strategies and evidence the benefits of using
$L_2$ regularization to dampen the effect of poisoning attacks.


http://arxiv.org/abs/2105.10707
Adversarial Attacks and Mitigation for Anomaly Detectors of Cyber-Physical Systems. (99%)
Yifan Jia; Jingyi Wang; Christopher M. Poskitt; Sudipta Chattopadhyay; Jun Sun; Yuqi Chen
  The threats faced by cyber-physical systems (CPSs) in critical infrastructure
have motivated research into a multitude of attack detection mechanisms,
including anomaly detectors based on neural network models. The effectiveness
of anomaly detectors can be assessed by subjecting them to test suites of
attacks, but less consideration has been given to adversarial attackers that
craft noise specifically designed to deceive them. While successfully applied
in domains such as images and audio, adversarial attacks are much harder to
implement in CPSs due to the presence of other built-in defence mechanisms such
as rule checkers(or invariant checkers). In this work, we present an
adversarial attack that simultaneously evades the anomaly detectors and rule
checkers of a CPS. Inspired by existing gradient-based approaches, our
adversarial attack crafts noise over the sensor and actuator values, then uses
a genetic algorithm to optimise the latter, ensuring that the neural network
and the rule checking system are both deceived.We implemented our approach for
two real-world critical infrastructure testbeds, successfully reducing the
classification accuracy of their detectors by over 50% on average, while
simultaneously avoiding detection by rule checkers. Finally, we explore whether
these attacks can be mitigated by training the detectors on adversarial
samples.


http://arxiv.org/abs/2105.10843
Exploring Robustness of Unsupervised Domain Adaptation in Semantic Segmentation. (98%)
Jinyu Yang; Chunyuan Li; Weizhi An; Hehuan Ma; Yuzhi Guo; Yu Rong; Peilin Zhao; Junzhou Huang
  Recent studies imply that deep neural networks are vulnerable to adversarial
examples -- inputs with a slight but intentional perturbation are incorrectly
classified by the network. Such vulnerability makes it risky for some
security-related applications (e.g., semantic segmentation in autonomous cars)
and triggers tremendous concerns on the model reliability. For the first time,
we comprehensively evaluate the robustness of existing UDA methods and propose
a robust UDA approach. It is rooted in two observations: (i) the robustness of
UDA methods in semantic segmentation remains unexplored, which pose a security
concern in this field; and (ii) although commonly used self-supervision (e.g.,
rotation and jigsaw) benefits image tasks such as classification and
recognition, they fail to provide the critical supervision signals that could
learn discriminative representation for segmentation tasks. These observations
motivate us to propose adversarial self-supervision UDA (or ASSUDA) that
maximizes the agreement between clean images and their adversarial examples by
a contrastive loss in the output space. Extensive empirical studies on commonly
used benchmarks demonstrate that ASSUDA is resistant to adversarial attacks.


http://arxiv.org/abs/2105.10663
Securing Optical Networks using Quantum-secured Blockchain: An Overview. (1%)
Purva Sharma; Vimal Bhatia; Shashi Prakash
  Deployment of optical network infrastructure and network services is growing
exponentially for beyond 5G networks. Since the uptake of e-commerce and
e-services has seen unprecedented serge in recent months due to the global
COVID-19 pandemic era, the security of such transactions in optical
communication has gained much importance. Optical fiber communication networks
are vulnerable to several types of security threats, such as single point
failure, wormhole attacks, and sybil attacks. Therefore, blockchain is a
promising solution to protect confidential information against attacks and
helps in achieving trusted network architecture by creating a distributed
ledger platform. Recently, blockchain has received much attention because of
its decentralized and distributed ledger technology. Hence, blockchain has also
been employed to protect network against such attacks. However, blockchain
technology's security relies on the platform of computational complexity, and
because of the evolution of quantum computers, it will become insecure in the
near future. Therefore, for enhancing blockchain security, research focus on
combining quantum key distribution (QKD) with blockchain. This new technology
is known as quantum-secured blockchain. The article describes the attacks in
optical networks and provides a solution to protect network against security
attacks by employing quantum-secured blockchain in optical networks. It
provides a brief overview of blockchain technology with its security loopholes
and focuses on QKD, which makes blockchain technology more robust against
quantum-attacks. Next, the article provides a broad view of quantum-secured
blockchain and presents the network architecture for future research and
development of secure and trusted optical communication networks using
quantum-secured blockchain.


http://arxiv.org/abs/2105.10393
ReLUSyn: Synthesizing Stealthy Attacks for Deep Neural Network Based Cyber-Physical Systems. (81%)
Aarti Kashyap; Syed Mubashir Iqbal; Karthik Pattabiraman; Margo Seltzer
  Cyber Physical Systems (cps) are deployed in many mission-critical settings,
such as medical devices, autonomous vehicular systems and aircraft control
management systems. As more and more CPS adopt Deep Neural Networks (Deep
Neural Network (dnns), these systems can be vulnerable to attacks. . Prior work
has demonstrated the susceptibility of CPS to False Data Injection Attacks
(False Data Injection Attacks (fdias), which can cause significant damage. We
identify a new category of attacks on these systems. In this paper, we
demonstrate that DNN based CPS are also subject to these attacks. These
attacks, which we call Ripple False Data Injection Attacks (rfdia), use minimal
input perturbations to stealthily change the dnn output. The input
perturbations propagate as ripples through multiple dnn layers to affect the
output in a targeted manner. We develop an automated technique to synthesize
rfdias against DNN-based CPS. Our technique models the attack as an
optimization problem using Mixed Integer Linear Programming (Mixed Integer
Linear Program (milp)). We define an abstraction for dnnbased cps that allows
us to automatically: 1) identify the critical inputs, and 2) find the smallest
perturbations that produce output changes. We demonstrate our technique on
three practical cps with two mission-critical applications: an (Artifical
Pancreas System (aps)) and two aircraft control management systems (Horizontal
Collision Avoidance System (hcas) and Collision Avoidance System-Xu (acas-xu)).


http://arxiv.org/abs/2105.10304
Exploring Misclassifications of Robust Neural Networks to Enhance Adversarial Attacks. (76%)
Leo Schwinn; René Raab; An Nguyen; Dario Zanca; Bjoern Eskofier
  Progress in making neural networks more robust against adversarial attacks is
mostly marginal, despite the great efforts of the research community. Moreover,
the robustness evaluation is often imprecise, making it difficult to identify
promising approaches. We analyze the classification decisions of 19 different
state-of-the-art neural networks trained to be robust against adversarial
attacks. Our findings suggest that current untargeted adversarial attacks
induce misclassification towards only a limited amount of different classes.
Additionally, we observe that both over- and under-confidence in model
predictions result in an inaccurate assessment of model robustness. Based on
these observations, we propose a novel loss function for adversarial attacks
that consistently improves attack success rate compared to prior loss functions
for 19 out of 19 analyzed models.


http://arxiv.org/abs/2105.10123
Backdoor Attacks on Self-Supervised Learning. (68%)
Aniruddha Saha; Ajinkya Tejankar; Soroush Abbasi Koohpayegani; Hamed Pirsiavash
  Large-scale unlabeled data has allowed recent progress in self-supervised
learning methods that learn rich visual representations. State-of-the-art
self-supervised methods for learning representations from images (MoCo and
BYOL) use an inductive bias that different augmentations (e.g. random crops) of
an image should produce similar embeddings. We show that such methods are
vulnerable to backdoor attacks where an attacker poisons a part of the
unlabeled data by adding a small trigger (known to the attacker) to the images.
The model performance is good on clean test images but the attacker can
manipulate the decision of the model by showing the trigger at test time.
Backdoor attacks have been studied extensively in supervised learning and to
the best of our knowledge, we are the first to study them for self-supervised
learning. Backdoor attacks are more practical in self-supervised learning since
the unlabeled data is large and as a result, an inspection of the data to avoid
the presence of poisoned data is prohibitive. We show that in our targeted
attack, the attacker can produce many false positives for the target category
by using the trigger at test time. We also propose a knowledge distillation
based defense algorithm that succeeds in neutralizing the attack. Our code is
available here: https://github.com/UMBCvision/SSL-Backdoor .


http://arxiv.org/abs/2105.10497
Intriguing Properties of Vision Transformers. (8%)
Muzammal Naseer; Kanchana Ranasinghe; Salman Khan; Munawar Hayat; Fahad Shahbaz Khan; Ming-Hsuan Yang
  Vision transformers (ViT) have demonstrated impressive performance across
various machine vision problems. These models are based on multi-head
self-attention mechanisms that can flexibly attend to a sequence of image
patches to encode contextual cues. An important question is how such
flexibility in attending image-wide context conditioned on a given patch can
facilitate handling nuisances in natural images e.g., severe occlusions, domain
shifts, spatial permutations, adversarial and natural perturbations. We
systematically study this question via an extensive set of experiments
encompassing three ViT families and comparisons with a high-performing
convolutional neural network (CNN). We show and analyze the following
intriguing properties of ViT: (a) Transformers are highly robust to severe
occlusions, perturbations and domain shifts, e.g., retain as high as 60% top-1
accuracy on ImageNet even after randomly occluding 80% of the image content.
(b) The robust performance to occlusions is not due to a bias towards local
textures, and ViTs are significantly less biased towards textures compared to
CNNs. When properly trained to encode shape-based features, ViTs demonstrate
shape recognition capability comparable to that of human visual system,
previously unmatched in the literature. (c) Using ViTs to encode shape
representation leads to an interesting consequence of accurate semantic
segmentation without pixel-level supervision. (d) Off-the-shelf features from a
single ViT model can be combined to create a feature ensemble, leading to high
accuracy rates across a range of classification datasets in both traditional
and few-shot learning paradigms. We show effective features of ViTs are due to
flexible and dynamic receptive fields possible via the self-attention
mechanism.


http://arxiv.org/abs/2105.13843
Explainable Enterprise Credit Rating via Deep Feature Crossing Network. (1%)
Weiyu Guo; Zhijiang Yang; Shu Wu; Fu Chen
  Due to the powerful learning ability on high-rank and non-linear features,
deep neural networks (DNNs) are being applied to data mining and machine
learning in various fields, and exhibit higher discrimination performance than
conventional methods. However, the applications based on DNNs are rare in
enterprise credit rating tasks because most of DNNs employ the "end-to-end"
learning paradigm, which outputs the high-rank representations of objects and
predictive results without any explanations. Thus, users in the financial
industry cannot understand how these high-rank representations are generated,
what do they mean and what relations exist with the raw inputs. Then users
cannot determine whether the predictions provided by DNNs are reliable, and not
trust the predictions providing by such "black box" models. Therefore, in this
paper, we propose a novel network to explicitly model the enterprise credit
rating problem using DNNs and attention mechanisms. The proposed model realizes
explainable enterprise credit ratings. Experimental results obtained on
real-world enterprise datasets verify that the proposed approach achieves
higher performance than conventional methods, and provides insights into
individual rating results and the reliability of model training.


http://arxiv.org/abs/2105.09685
Simple Transparent Adversarial Examples. (99%)
Jaydeep Borkar; Pin-Yu Chen
  There has been a rise in the use of Machine Learning as a Service (MLaaS)
Vision APIs as they offer multiple services including pre-built models and
algorithms, which otherwise take a huge amount of resources if built from
scratch. As these APIs get deployed for high-stakes applications, it's very
important that they are robust to different manipulations. Recent works have
only focused on typical adversarial attacks when evaluating the robustness of
vision APIs. We propose two new aspects of adversarial image generation methods
and evaluate them on the robustness of Google Cloud Vision API's optical
character recognition service and object detection APIs deployed in real-world
settings such as sightengine.com, picpurify.com, Google Cloud Vision API, and
Microsoft Azure's Computer Vision API. Specifically, we go beyond the
conventional small-noise adversarial attacks and introduce secret embedding and
transparent adversarial examples as a simpler way to evaluate robustness. These
methods are so straightforward that even non-specialists can craft such
attacks. As a result, they pose a serious threat where APIs are used for
high-stakes applications. Our transparent adversarial examples successfully
evade state-of-the art object detections APIs such as Azure Cloud Vision
(attack success rate 52%) and Google Cloud Vision (attack success rate 36%).
90% of the images have a secret embedded text that successfully fools the
vision of time-limited humans but is detected by Google Cloud Vision API's
optical character recognition. Complementing to current research, our results
provide simple but unconventional methods on robustness evaluation.


http://arxiv.org/abs/2105.10101
Anomaly Detection of Adversarial Examples using Class-conditional Generative Adversarial Networks. (99%)
Hang Wang; David J. Miller; George Kesidis
  Deep Neural Networks (DNNs) have been shown vulnerable to Test-Time Evasion
attacks (TTEs, or adversarial examples), which, by making small changes to the
input, alter the DNN's decision. We propose an unsupervised attack detector on
DNN classifiers based on class-conditional Generative Adversarial Networks
(GANs). We model the distribution of clean data conditioned on the predicted
class label by an Auxiliary Classifier GAN (AC-GAN). Given a test sample and
its predicted class, three detection statistics are calculated based on the
AC-GAN Generator and Discriminator. Experiments on image classification
datasets under various TTE attacks show that our method outperforms previous
detection methods. We also investigate the effectiveness of anomaly detection
using different DNN layers (input features or internal-layer features) and
demonstrate, as one might expect, that anomalies are harder to detect using
features closer to the DNN's output layer.


http://arxiv.org/abs/2105.10051
Preventing Machine Learning Poisoning Attacks Using Authentication and Provenance. (11%)
Jack W. Stokes; Paul England; Kevin Kane
  Recent research has successfully demonstrated new types of data poisoning
attacks. To address this problem, some researchers have proposed both offline
and online data poisoning detection defenses which employ machine learning
algorithms to identify such attacks. In this work, we take a different approach
to preventing data poisoning attacks which relies on cryptographically-based
authentication and provenance to ensure the integrity of the data used to train
a machine learning model. The same approach is also used to prevent software
poisoning and model poisoning attacks. A software poisoning attack maliciously
alters one or more software components used to train a model. Once the model
has been trained it can also be protected against model poisoning attacks which
seek to alter a model's predictions by modifying its underlying parameters or
structure. Finally, an evaluation set or test set can also be protected to
provide evidence if they have been modified by a second data poisoning attack.
To achieve these goals, we propose VAMP which extends the previously proposed
AMP system, that was designed to protect media objects such as images, video
files or audio clips, to the machine learning setting. We first provide
requirements for authentication and provenance for a secure machine learning
system. Next, we demonstrate how VAMP's manifest meets these requirements to
protect a machine learning system's datasets, software components, and models.


http://arxiv.org/abs/2105.10113
TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks. (1%)
Yu Li; Min Li; Qiuxia Lai; Yannan Liu; Qiang Xu
  Deep learning (DL) has achieved unprecedented success in a variety of tasks.
However, DL systems are notoriously difficult to test and debug due to the lack
of explainability of DL models and the huge test input space to cover.
Generally speaking, it is relatively easy to collect a massive amount of test
data, but the labeling cost can be quite high. Consequently, it is essential to
conduct test selection and label only those selected "high quality"
bug-revealing test inputs for test cost reduction.
  In this paper, we propose a novel test prioritization technique that brings
order into the unlabeled test instances according to their bug-revealing
capabilities, namely TestRank. Different from existing solutions, TestRank
leverages both intrinsic attributes and contextual attributes of test instances
when prioritizing them. To be specific, we first build a similarity graph on
test instances and training samples, and we conduct graph-based semi-supervised
learning to extract contextual features. Then, for a particular test instance,
the contextual features extracted from the graph neural network (GNN) and the
intrinsic features obtained with the DL model itself are combined to predict
its bug-revealing probability. Finally, TestRank prioritizes unlabeled test
instances in descending order of the above probability value. We evaluate the
performance of TestRank on a variety of image classification datasets.
Experimental results show that the debugging efficiency of our method
significantly outperforms existing test prioritization techniques.


http://arxiv.org/abs/2105.09022
Attack on practical speaker verification system using universal adversarial perturbations. (99%)
Weiyi Zhang; Shuning Zhao; Le Liu; Jianmin Li; Xingliang Cheng; Thomas Fang Zheng; Xiaolin Hu
  In authentication scenarios, applications of practical speaker verification
systems usually require a person to read a dynamic authentication text.
Previous studies played an audio adversarial example as a digital signal to
perform physical attacks, which would be easily rejected by audio replay
detection modules. This work shows that by playing our crafted adversarial
perturbation as a separate source when the adversary is speaking, the practical
speaker verification system will misjudge the adversary as a target speaker. A
two-step algorithm is proposed to optimize the universal adversarial
perturbation to be text-independent and has little effect on the authentication
text recognition. We also estimated room impulse response (RIR) in the
algorithm which allowed the perturbation to be effective after being played
over the air. In the physical experiment, we achieved targeted attacks with
success rate of 100%, while the word error rate (WER) on speech recognition was
only increased by 3.55%. And recorded audios could pass replay detection for
the live person speaking.


http://arxiv.org/abs/2105.09090
Local Aggressive Adversarial Attacks on 3D Point Cloud. (99%)
Yiming Sun; Feng Chen; Zhiyu Chen; Mingjie Wang
  Deep neural networks are found to be prone to adversarial examples which
could deliberately fool the model to make mistakes. Recently, a few of works
expand this task from 2D image to 3D point cloud by using global point cloud
optimization. However, the perturbations of global point are not effective for
misleading the victim model. First, not all points are important in
optimization toward misleading. Abundant points account considerable distortion
budget but contribute trivially to attack. Second, the multi-label optimization
is suboptimal for adversarial attack, since it consumes extra energy in finding
multi-label victim model collapse and causes instance transformation to be
dissimilar to any particular instance. Third, the independent adversarial and
perceptibility losses, caring misclassification and dissimilarity separately,
treat the updating of each point equally without a focus. Therefore, once
perceptibility loss approaches its budget threshold, all points would be stock
in the surface of hypersphere and attack would be locked in local optimality.
Therefore, we propose a local aggressive adversarial attacks (L3A) to solve
above issues. Technically, we select a bunch of salient points, the high-score
subset of point cloud according to gradient, to perturb. Then a flow of
aggressive optimization strategies are developed to reinforce the unperceptive
generation of adversarial examples toward misleading victim models. Extensive
experiments on PointNet, PointNet++ and DGCNN demonstrate the state-of-the-art
performance of our method against existing adversarial attack methods.


http://arxiv.org/abs/2105.09109
An Orthogonal Classifier for Improving the Adversarial Robustness of Neural Networks. (76%)
Cong Xu; Xiang Li; Min Yang
  Neural networks are susceptible to artificially designed adversarial
perturbations. Recent efforts have shown that imposing certain modifications on
classification layer can improve the robustness of the neural networks. In this
paper, we explicitly construct a dense orthogonal weight matrix whose entries
have the same magnitude, thereby leading to a novel robust classifier. The
proposed classifier avoids the undesired structural redundancy issue in
previous work. Applying this classifier in standard training on clean data is
sufficient to ensure the high accuracy and good robustness of the model.
Moreover, when extra adversarial samples are used, better robustness can be
further obtained with the help of a special worst-case loss. Experimental
results show that our method is efficient and competitive to many
state-of-the-art defensive approaches. Our code is available at
\url{https://github.com/MTandHJ/roboc}.


http://arxiv.org/abs/2105.09394
Balancing Robustness and Sensitivity using Feature Contrastive Learning. (15%)
Seungyeon Kim; Daniel Glasner; Srikumar Ramalingam; Cho-Jui Hsieh; Kishore Papineni; Sanjiv Kumar
  It is generally believed that robust training of extremely large networks is
critical to their success in real-world applications. However, when taken to
the extreme, methods that promote robustness can hurt the model's sensitivity
to rare or underrepresented patterns. In this paper, we discuss this trade-off
between sensitivity and robustness to natural (non-adversarial) perturbations
by introducing two notions: contextual feature utility and contextual feature
sensitivity. We propose Feature Contrastive Learning (FCL) that encourages a
model to be more sensitive to the features that have higher contextual utility.
Empirical results demonstrate that models trained with FCL achieve a better
balance of robustness and sensitivity, leading to improved generalization in
the presence of noise on both vision and NLP datasets.


http://arxiv.org/abs/2105.09453
DeepStrike: Remotely-Guided Fault Injection Attacks on DNN Accelerator in Cloud-FPGA. (1%)
Yukui Luo; Cheng Gongye; Yunsi Fei; Xiaolin Xu
  As Field-programmable gate arrays (FPGAs) are widely adopted in clouds to
accelerate Deep Neural Networks (DNN), such virtualization environments have
posed many new security issues. This work investigates the integrity of DNN
FPGA accelerators in clouds. It proposes DeepStrike, a remotely-guided attack
based on power glitching fault injections targeting DNN execution. We
characterize the vulnerabilities of different DNN layers against fault
injections on FPGAs and leverage time-to-digital converter (TDC) sensors to
precisely control the timing of fault injections. Experimental results show
that our proposed attack can successfully disrupt the FPGA DSP kernel and
misclassify the target victim DNN application.


http://arxiv.org/abs/2105.09369
User Label Leakage from Gradients in Federated Learning. (1%)
Aidmar Wainakh; Fabrizio Ventola; Till Müßig; Jens Keim; Carlos Garcia Cordero; Ephraim Zimmer; Tim Grube; Kristian Kersting; Max Mühlhäuser
  Federated learning enables multiple users to build a joint model by sharing
their model updates (gradients), while their raw data remains local on their
devices. In contrast to the common belief that this provides privacy benefits,
we here add to the very recent results on privacy risks when sharing gradients.
Specifically, we propose Label Leakage from Gradients (LLG), a novel attack to
extract the labels of the users' training data from their shared gradients. The
attack exploits the direction and magnitude of gradients to determine the
presence or absence of any label. LLG is simple yet effective, capable of
leaking potential sensitive information represented by labels, and scales well
to arbitrary batch sizes and multiple classes. We empirically and
mathematically demonstrate the validity of our attack under different settings.
Moreover, empirical results show that LLG successfully extracts labels with
high accuracy at the early stages of model training. We also discuss different
defense mechanisms against such leakage. Our findings suggest that gradient
compression is a practical technique to prevent our attack.


http://arxiv.org/abs/2105.09157
Hunter in the Dark: Deep Ensemble Networks for Discovering Anomalous Activity from Smart Networks. (1%)
Shiyi Yang; Nour Moustafa; Hui Guo
  In modern networked society, smart networks are indispensable to offer
intelligent communications and automated services to end-users and
organizations. Machine learning (ML)-based network intrusion detection system
(NIDS) plays a critical role in safeguarding smart networks against novel cyber
threats. However, there are two challenges in the existing designs: 1)
achieving an outstanding performance of threat detection often produces high
false positives, leading to alert fatigue and 2) the interpretability of
detection results is low, making a difficulty of understanding cyber threats
and taking prompt actions against them. To tackle these challenges, in this
paper, we propose a cyber defense mechanism, namely DarkHunter, which includes
three new components: stream processor, detection engine and incident analyzer.
The stream processor converts raw network packets into data records, including
statistical features, which involve latent patterns of legitimates or anomalies
to be effectively discovered using the detection engine. In this essence, the
detection engine leverages an efficient ensemble neural network (EnsembleNet)
to accurately identify anomalous traffic. Finally, the incident analyzer
applies a correlation analysis to filter out the mispredictions from
EnsembleNet, traces each detected threat from its statistical representation
back to its source traffic flow to enhance its intelligibility and prioritizes
the threats to be processed to minimize security risks. Our evaluations, based
on the UNSW-NB15 dataset, show that DarkHunter significantly outperforms some
state-of-the-art ML-based NIDSs by accomplishing higher accuracy, higher
detection rate, higher precision, higher F1 score while keeping a lower false
alarm rate.


http://arxiv.org/abs/2105.08269
Sparta: Spatially Attentive and Adversarially Robust Activation. (99%)
Qing Guo; Felix Juefei-Xu; Changqing Zhou; Wei Feng; Yang Liu; Song Wang
  Adversarial training (AT) is one of the most effective ways for improving the
robustness of deep convolution neural networks (CNNs). Just like common network
training, the effectiveness of AT relies on the design of basic network
components. In this paper, we conduct an in-depth study on the role of the
basic ReLU activation component in AT for robust CNNs. We find that the
spatially-shared and input-independent properties of ReLU activation make CNNs
less robust to white-box adversarial attacks with either standard or
adversarial training. To address this problem, we extend ReLU to a novel Sparta
activation function (Spatially attentive and Adversarially Robust Activation),
which enables CNNs to achieve both higher robustness, i.e., lower error rate on
adversarial examples, and higher accuracy, i.e., lower error rate on clean
examples, than the existing state-of-the-art (SOTA) activation functions. We
further study the relationship between Sparta and the SOTA activation
functions, providing more insights about the advantages of our method. With
comprehensive experiments, we also find that the proposed method exhibits
superior cross-CNN and cross-dataset transferability. For the former, the
adversarially trained Sparta function for one CNN (e.g., ResNet-18) can be
fixed and directly used to train another adversarially robust CNN (e.g.,
ResNet-34). For the latter, the Sparta function trained on one dataset (e.g.,
CIFAR-10) can be employed to train adversarially robust CNNs on another dataset
(e.g., SVHN). In both cases, Sparta leads to CNNs with higher robustness than
the vanilla ReLU, verifying the flexibility and versatility of the proposed
method.


http://arxiv.org/abs/2105.08620
Detecting Adversarial Examples with Bayesian Neural Network. (99%)
Yao Li; Tongyi Tang; Cho-Jui Hsieh; Thomas C. M. Lee
  In this paper, we propose a new framework to detect adversarial examples
motivated by the observations that random components can improve the smoothness
of predictors and make it easier to simulate output distribution of deep neural
network. With these observations, we propose a novel Bayesian adversarial
example detector, short for BATer, to improve the performance of adversarial
example detection. In specific, we study the distributional difference of
hidden layer output between natural and adversarial examples, and propose to
use the randomness of Bayesian neural network (BNN) to simulate hidden layer
output distribution and leverage the distribution dispersion to detect
adversarial examples. The advantage of BNN is that the output is stochastic
while neural networks without random components do not have such
characteristics. Empirical results on several benchmark datasets against
popular attacks show that the proposed BATer outperforms the state-of-the-art
detectors in adversarial example detection.


http://arxiv.org/abs/2105.08714
Fighting Gradients with Gradients: Dynamic Defenses against Adversarial Attacks. (98%)
Dequan Wang; An Ju; Evan Shelhamer; David Wagner; Trevor Darrell
  Adversarial attacks optimize against models to defeat defenses. Existing
defenses are static, and stay the same once trained, even while attacks change.
We argue that models should fight back, and optimize their defenses against
attacks at test time. We propose dynamic defenses, to adapt the model and input
during testing, by defensive entropy minimization (dent). Dent alters testing,
but not training, for compatibility with existing models and train-time
defenses. Dent improves the robustness of adversarially-trained defenses and
nominally-trained models against white-box, black-box, and adaptive attacks on
CIFAR-10/100 and ImageNet. In particular, dent boosts state-of-the-art defenses
by 20+ points absolute against AutoAttack on CIFAR-10 at $\epsilon_\infty$ =
8/255.


http://arxiv.org/abs/2105.08619
On the Robustness of Domain Constraints. (98%)
Ryan Sheatsley; Blaine Hoak; Eric Pauley; Yohan Beugin; Michael J. Weisman; Patrick McDaniel
  Machine learning is vulnerable to adversarial examples-inputs designed to
cause models to perform poorly. However, it is unclear if adversarial examples
represent realistic inputs in the modeled domains. Diverse domains such as
networks and phishing have domain constraints-complex relationships between
features that an adversary must satisfy for an attack to be realized (in
addition to any adversary-specific goals). In this paper, we explore how domain
constraints limit adversarial capabilities and how adversaries can adapt their
strategies to create realistic (constraint-compliant) examples. In this, we
develop techniques to learn domain constraints from data, and show how the
learned constraints can be integrated into the adversarial crafting process. We
evaluate the efficacy of our approach in network intrusion and phishing
datasets and find: (1) up to 82% of adversarial examples produced by
state-of-the-art crafting algorithms violate domain constraints, (2) domain
constraints are robust to adversarial examples; enforcing constraints yields an
increase in model accuracy by up to 34%. We observe not only that adversaries
must alter inputs to satisfy domain constraints, but that these constraints
make the generation of valid adversarial examples far more challenging.


http://arxiv.org/abs/2105.08709
Learning and Certification under Instance-targeted Poisoning. (82%)
Ji Gao; Amin Karbasi; Mohammad Mahmoody
  In this paper, we study PAC learnability and certification of predictions
under instance-targeted poisoning attacks, where the adversary who knows the
test instance may change a fraction of the training set with the goal of
fooling the learner at the test instance. Our first contribution is to
formalize the problem in various settings and to explicitly model subtle
aspects such as the proper or improper nature of the learning, learner's
randomness, and whether (or not) adversary's attack can depend on it. Our main
result shows that when the budget of the adversary scales sublinearly with the
sample complexity, (improper) PAC learnability and certification are
achievable; in contrast, when the adversary's budget grows linearly with the
sample complexity, the adversary can potentially drive up the expected 0-1 loss
to one. We also study distribution-specific PAC learning in the same attack
model and show that proper learning with certification is possible for learning
half spaces under natural distributions. Finally, we empirically study the
robustness of K nearest neighbour, logistic regression, multi-layer perceptron,
and convolutional neural network on real data sets against targeted-poisoning
attacks. Our experimental results show that many models, especially
state-of-the-art neural networks, are indeed vulnerable to these strong
attacks. Interestingly, we observe that methods with high standard accuracy
might be more vulnerable to instance-targeted poisoning attacks.


http://arxiv.org/abs/2105.07926
Towards Robust Vision Transformer. (95%)
Xiaofeng Mao; Gege Qi; Yuefeng Chen; Xiaodan Li; Ranjie Duan; Shaokai Ye; Yuan He; Hui Xue
  Recent advances on Vision Transformer (ViT) and its improved variants have
shown that self-attention-based networks surpass traditional Convolutional
Neural Networks (CNNs) in most vision tasks. However, existing ViTs focus on
the standard accuracy and computation cost, lacking the investigation of the
intrinsic influence on model robustness and generalization. In this work, we
conduct systematic evaluation on components of ViTs in terms of their impact on
robustness to adversarial examples, common corruptions and distribution shifts.
We find some components can be harmful to robustness. By using and combining
robust components as building blocks of ViTs, we propose Robust Vision
Transformer (RVT), which is a new vision transformer and has superior
performance with strong robustness. We further propose two new plug-and-play
techniques called position-aware attention scaling and patch-wise augmentation
to augment our RVT, which we abbreviate as RVT*. The experimental results on
ImageNet and six robustness benchmarks show the advanced robustness and
generalization ability of RVT compared with previous ViTs and state-of-the-art
CNNs. Furthermore, RVT-S* also achieves Top-1 rank on multiple robustness
leaderboards including ImageNet-C and ImageNet-Sketch. The code will be
available at \url{https://git.io/Jswdk}.


http://arxiv.org/abs/2105.07985
Gradient Masking and the Underestimated Robustness Threats of Differential Privacy in Deep Learning. (93%)
Franziska Boenisch; Philip Sperl; Konstantin Böttinger
  An important problem in deep learning is the privacy and security of neural
networks (NNs). Both aspects have long been considered separately. To date, it
is still poorly understood how privacy enhancing training affects the
robustness of NNs. This paper experimentally evaluates the impact of training
with Differential Privacy (DP), a standard method for privacy preservation, on
model vulnerability against a broad range of adversarial attacks. The results
suggest that private models are less robust than their non-private
counterparts, and that adversarial examples transfer better among DP models
than between non-private and private ones. Furthermore, detailed analyses of DP
and non-DP models suggest significant differences between their gradients.
Additionally, this work is the first to observe that an unfavorable choice of
parameters in DP training can lead to gradient masking, and, thereby, results
in a wrong sense of security.


http://arxiv.org/abs/2105.08037
An SDE Framework for Adversarial Training, with Convergence and Robustness Analysis. (69%)
Haotian Gu; Xin Guo
  Adversarial training has gained great popularity as one of the most effective
defenses for deep neural networks against adversarial perturbations on data
points. Consequently, research interests have grown in understanding the
convergence and robustness of adversarial training. This paper considers the
min-max game of adversarial training by alternating stochastic gradient
descent. It approximates the training process with a continuous-time
stochastic-differential-equation (SDE). In particular, the error bound and
convergence analysis is established.
  This SDE framework allows direct comparison between adversarial training and
stochastic gradient descent; and confirms analytically the robustness of
adversarial training from a (new) gradient-flow viewpoint. This analysis is
then corroborated via numerical studies.
  To demonstrate the versatility of this SDE framework for algorithm design and
parameter tuning, a stochastic control problem is formulated for learning rate
adjustment, where the advantage of adaptive learning rate over fixed learning
rate in terms of training loss is demonstrated through numerical experiments.


http://arxiv.org/abs/2105.07754
A Fusion-Denoising Attack on InstaHide with Data Augmentation. (1%)
Xinjian Luo; Xiaokui Xiao; Yuncheng Wu; Juncheng Liu; Beng Chin Ooi
  InstaHide is a state-of-the-art mechanism for protecting private training
images, by mixing multiple private images and modifying them such that their
visual features are indistinguishable to the naked eye. In recent work,
however, Carlini et al. show that it is possible to reconstruct private images
from the encrypted dataset generated by InstaHide. Nevertheless, we demonstrate
that Carlini et al.'s attack can be easily defeated by incorporating data
augmentation into InstaHide. This leads to a natural question: is InstaHide
with data augmentation secure? In this paper, we provide a negative answer to
this question, by devising an attack for recovering private images from the
outputs of InstaHide even when data augmentation is present. The basic idea is
to use a comparative network to identify encrypted images that are likely to
correspond to the same private image, and then employ a fusion-denoising
network for restoring the private image from the encrypted ones, taking into
account the effects of data augmentation. Extensive experiments demonstrate the
effectiveness of the proposed attack in comparison to Carlini et al.'s attack.


http://arxiv.org/abs/2105.07581
Vision Transformers are Robust Learners. (99%)
Sayak Paul; Pin-Yu Chen
  Transformers, composed of multiple self-attention layers, hold strong
promises toward a generic learning primitive applicable to different data
modalities, including the recent breakthroughs in computer vision achieving
state-of-the-art (SOTA) standard accuracy with better parameter efficiency.
Since self-attention helps a model systematically align different components
present inside the input data, it leaves grounds to investigate its performance
under model robustness benchmarks. In this work, we study the robustness of the
Vision Transformer (ViT) against common corruptions and perturbations,
distribution shifts, and natural adversarial examples. We use six different
diverse ImageNet datasets concerning robust classification to conduct a
comprehensive performance comparison of ViT models and SOTA convolutional
neural networks (CNNs), Big-Transfer. Through a series of six systematically
designed experiments, we then present analyses that provide both quantitative
and qualitative indications to explain why ViTs are indeed more robust
learners. For example, with fewer parameters and similar dataset and
pre-training combinations, ViT gives a top-1 accuracy of 28.10% on ImageNet-A
which is 4.3x higher than a comparable variant of BiT. Our analyses on image
masking, Fourier spectrum sensitivity, and spread on discrete cosine energy
spectrum reveal intriguing properties of ViT attributing to improved
robustness. Code for reproducing our experiments is available here:
https://git.io/J3VO0.


http://arxiv.org/abs/2105.07553
Prototype-supervised Adversarial Network for Targeted Attack of Deep Hashing. (99%)
Xunguang Wang; Zheng Zhang; Baoyuan Wu; Fumin Shen; Guangming Lu
  Due to its powerful capability of representation learning and high-efficiency
computation, deep hashing has made significant progress in large-scale image
retrieval. However, deep hashing networks are vulnerable to adversarial
examples, which is a practical secure problem but seldom studied in
hashing-based retrieval field. In this paper, we propose a novel
prototype-supervised adversarial network (ProS-GAN), which formulates a
flexible generative architecture for efficient and effective targeted hashing
attack. To the best of our knowledge, this is the first generation-based method
to attack deep hashing networks. Generally, our proposed framework consists of
three parts, i.e., a PrototypeNet, a generator, and a discriminator.
Specifically, the designed PrototypeNet embeds the target label into the
semantic representation and learns the prototype code as the category-level
representative of the target label. Moreover, the semantic representation and
the original image are jointly fed into the generator for a flexible targeted
attack. Particularly, the prototype code is adopted to supervise the generator
to construct the targeted adversarial example by minimizing the Hamming
distance between the hash code of the adversarial example and the prototype
code. Furthermore, the generator is against the discriminator to simultaneously
encourage the adversarial examples visually realistic and the semantic
representation informative. Extensive experiments verify that the proposed
framework can efficiently produce adversarial examples with better targeted
attack performance and transferability over state-of-the-art targeted attack
methods of deep hashing. The related codes could be available at
https://github.com/xunguangwang/ProS-GAN .


http://arxiv.org/abs/2105.07574
SoundFence: Securing Ultrasonic Sensors in Vehicles Using Physical-Layer Defense. (2%)
Jianzhi Lou; Qiben Yan; Qing Hui; Huacheng Zeng
  Autonomous vehicles (AVs), equipped with numerous sensors such as camera,
LiDAR, radar, and ultrasonic sensor, are revolutionizing the transportation
industry. These sensors are expected to sense reliable information from a
physical environment, facilitating the critical decision-making process of the
AVs. Ultrasonic sensors, which detect obstacles in a short distance, play an
important role in assisted parking and blind spot detection events. However,
due to their weak security level, ultrasonic sensors are particularly
vulnerable to signal injection attacks, when the attackers inject malicious
acoustic signals to create fake obstacles and intentionally mislead the
vehicles to make wrong decisions with disastrous aftermath. In this paper, we
systematically analyze the attack model of signal injection attacks toward
moving vehicles. By considering the potential threats, we propose SoundFence, a
physical-layer defense system which leverages the sensors' signal processing
capability without requiring any additional equipment. SoundFence verifies the
benign measurement results and detects signal injection attacks by analyzing
sensor readings and the physical-layer signatures of ultrasonic signals. Our
experiment with commercial sensors shows that SoundFence detects most (more
than 95%) of the abnormal sensor readings with very few false alarms, and it
can also accurately distinguish the real echo from injected signals to identify
injection attacks.


http://arxiv.org/abs/2105.07334
Real-time Detection of Practical Universal Adversarial Perturbations. (99%)
Kenneth T. Co; Luis Muñoz-González; Leslie Kanthan; Emil C. Lupu
  Universal Adversarial Perturbations (UAPs) are a prominent class of
adversarial examples that exploit the systemic vulnerabilities and enable
physically realizable and robust attacks against Deep Neural Networks (DNNs).
UAPs generalize across many different inputs; this leads to realistic and
effective attacks that can be applied at scale. In this paper we propose
HyperNeuron, an efficient and scalable algorithm that allows for the real-time
detection of UAPs by identifying suspicious neuron hyper-activations. Our
results show the effectiveness of HyperNeuron on multiple tasks (image
classification, object detection), against a wide variety of universal attacks,
and in realistic scenarios, like perceptual ad-blocking and adversarial
patches. HyperNeuron is able to simultaneously detect both adversarial mask and
patch UAPs with comparable or better performance than existing UAP defenses
whilst introducing a significantly reduced latency of only 0.86 milliseconds
per image. This suggests that many realistic and practical universal attacks
can be reliably mitigated in real-time, which shows promise for the robust
deployment of machine learning systems.


http://arxiv.org/abs/2105.06807
Salient Feature Extractor for Adversarial Defense on Deep Neural Networks. (99%)
Jinyin Chen; Ruoxi Chen; Haibin Zheng; Zhaoyan Ming; Wenrong Jiang; Chen Cui
  Recent years have witnessed unprecedented success achieved by deep learning
models in the field of computer vision. However, their vulnerability towards
carefully crafted adversarial examples has also attracted the increasing
attention of researchers. Motivated by the observation that adversarial
examples are due to the non-robust feature learned from the original dataset by
models, we propose the concepts of salient feature(SF) and trivial feature(TF).
The former represents the class-related feature, while the latter is usually
adopted to mislead the model. We extract these two features with coupled
generative adversarial network model and put forward a novel detection and
defense method named salient feature extractor (SFE) to defend against
adversarial attacks. Concretely, detection is realized by separating and
comparing the difference between SF and TF of the input. At the same time,
correct labels are obtained by re-identifying SF to reach the purpose of
defense. Extensive experiments are carried out on MNIST, CIFAR-10, and ImageNet
datasets where SFE shows state-of-the-art results in effectiveness and
efficiency compared with baselines. Furthermore, we provide an interpretable
understanding of the defense and detection process.


http://arxiv.org/abs/2105.07078
High-Robustness, Low-Transferability Fingerprinting of Neural Networks. (9%)
Siyue Wang; Xiao Wang; Pin-Yu Chen; Pu Zhao; Xue Lin
  This paper proposes Characteristic Examples for effectively fingerprinting
deep neural networks, featuring high-robustness to the base model against model
pruning as well as low-transferability to unassociated models. This is the
first work taking both robustness and transferability into consideration for
generating realistic fingerprints, whereas current methods lack practical
assumptions and may incur large false positive rates. To achieve better
trade-off between robustness and transferability, we propose three kinds of
characteristic examples: vanilla C-examples, RC-examples, and LTRC-example, to
derive fingerprints from the original base model. To fairly characterize the
trade-off between robustness and transferability, we propose Uniqueness Score,
a comprehensive metric that measures the difference between robustness and
transferability, which also serves as an indicator to the false alarm problem.


http://arxiv.org/abs/2105.06956
Information-theoretic Evolution of Model Agnostic Global Explanations. (1%)
Sukriti Verma; Nikaash Puri; Piyush Gupta; Balaji Krishnamurthy
  Explaining the behavior of black box machine learning models through human
interpretable rules is an important research area. Recent work has focused on
explaining model behavior locally i.e. for specific predictions as well as
globally across the fields of vision, natural language, reinforcement learning
and data science. We present a novel model-agnostic approach that derives rules
to globally explain the behavior of classification models trained on numerical
and/or categorical data. Our approach builds on top of existing local model
explanation methods to extract conditions important for explaining model
behavior for specific instances followed by an evolutionary algorithm that
optimizes an information theory based fitness function to construct rules that
explain global model behavior. We show how our approach outperforms existing
approaches on a variety of datasets. Further, we introduce a parameter to
evaluate the quality of interpretation under the scenario of distributional
shift. This parameter evaluates how well the interpretation can predict model
behavior for previously unseen data distributions. We show how existing
approaches for interpreting models globally lack distributional robustness.
Finally, we show how the quality of the interpretation can be improved under
the scenario of distributional shift by adding out of distribution samples to
the dataset used to learn the interpretation and thereby, increase robustness.
All of the datasets used in our paper are open and publicly available. Our
approach has been deployed in a leading digital marketing suite of products.


http://arxiv.org/abs/2105.07080
Iterative Algorithms for Assessing Network Resilience Against Structured Perturbations. (1%)
Shenyu Liu; Sonia Martinez; Jorge Cortes
  This paper studies network resilience against structured additive
perturbations to its topology. We consider dynamic networks modeled as linear
time-invariant systems subject to perturbations of bounded energy satisfying
specific sparsity and entry-wise constraints. Given an energy level, the
structured pseudospectral abscissa captures the worst-possible perturbation an
adversary could employ to de-stabilize the network, and the structured
stability radius is the maximum energy in the structured perturbation that the
network can withstand without becoming unstable. Building on a novel
characterization of the worst-case structured perturbation, we propose
iterative algorithms that efficiently compute the structured pseudospectral
abscissa and structured stability radius. We provide theoretical guarantees of
the local convergence of the algorithms and illustrate their efficacy and
accuracy on several network examples.


http://arxiv.org/abs/2105.06512
Stochastic-Shield: A Probabilistic Approach Towards Training-Free Adversarial Defense in Quantized CNNs. (98%)
Lorena Qendro; Sangwon Ha; Jong René de; Partha Maji
  Quantized neural networks (NN) are the common standard to efficiently deploy
deep learning models on tiny hardware platforms. However, we notice that
quantized NNs are as vulnerable to adversarial attacks as the full-precision
models. With the proliferation of neural networks on small devices that we
carry or surround us, there is a need for efficient models without sacrificing
trust in the prediction in presence of malign perturbations. Current mitigation
approaches often need adversarial training or are bypassed when the strength of
adversarial examples is increased.
  In this work, we investigate how a probabilistic framework would assist in
overcoming the aforementioned limitations for quantized deep learning models.
We explore Stochastic-Shield: a flexible defense mechanism that leverages input
filtering and a probabilistic deep learning approach materialized via Monte
Carlo Dropout. We show that it is possible to jointly achieve efficiency and
robustness by accurately enabling each module without the burden of
re-retraining or ad hoc fine-tuning.


http://arxiv.org/abs/2105.06152
When Human Pose Estimation Meets Robustness: Adversarial Algorithms and Benchmarks. (5%)
Jiahang Wang; Sheng Jin; Wentao Liu; Weizhong Liu; Chen Qian; Ping Luo
  Human pose estimation is a fundamental yet challenging task in computer
vision, which aims at localizing human anatomical keypoints. However, unlike
human vision that is robust to various data corruptions such as blur and
pixelation, current pose estimators are easily confused by these corruptions.
This work comprehensively studies and addresses this problem by building
rigorous robust benchmarks, termed COCO-C, MPII-C, and OCHuman-C, to evaluate
the weaknesses of current advanced pose estimators, and a new algorithm termed
AdvMix is proposed to improve their robustness in different corruptions. Our
work has several unique benefits. (1) AdvMix is model-agnostic and capable in a
wide-spectrum of pose estimation models. (2) AdvMix consists of adversarial
augmentation and knowledge distillation. Adversarial augmentation contains two
neural network modules that are trained jointly and competitively in an
adversarial manner, where a generator network mixes different corrupted images
to confuse a pose estimator, improving the robustness of the pose estimator by
learning from harder samples. To compensate for the noise patterns by
adversarial augmentation, knowledge distillation is applied to transfer clean
pose structure knowledge to the target pose estimator. (3) Extensive
experiments show that AdvMix significantly increases the robustness of pose
estimations across a wide range of corruptions, while maintaining accuracy on
clean data in various challenging benchmark datasets.


http://arxiv.org/abs/2105.06209
DeepObliviate: A Powerful Charm for Erasing Data Residual Memory in Deep Neural Networks. (1%)
Yingzhe He; Guozhu Meng; Kai Chen; Jinwen He; Xingbo Hu
  Machine unlearning has great significance in guaranteeing model security and
protecting user privacy. Additionally, many legal provisions clearly stipulate
that users have the right to demand model providers to delete their own data
from training set, that is, the right to be forgotten. The naive way of
unlearning data is to retrain the model without it from scratch, which becomes
extremely time and resource consuming at the modern scale of deep neural
networks. Other unlearning approaches by refactoring model or training data
struggle to gain a balance between overhead and model usability.
  In this paper, we propose an approach, dubbed as DeepObliviate, to implement
machine unlearning efficiently, without modifying the normal training mode. Our
approach improves the original training process by storing intermediate models
on the hard disk. Given a data point to unlearn, we first quantify its temporal
residual memory left in stored models. The influenced models will be retrained
and we decide when to terminate the retraining based on the trend of residual
memory on-the-fly. Last, we stitch an unlearned model by combining the
retrained models and uninfluenced models. We extensively evaluate our approach
on five datasets and deep learning models. Compared to the method of retraining
from scratch, our approach can achieve 99.0%, 95.0%, 91.9%, 96.7%, 74.1%
accuracy rates and 66.7$\times$, 75.0$\times$, 33.3$\times$, 29.4$\times$,
13.7$\times$ speedups on the MNIST, SVHN, CIFAR-10, Purchase, and ImageNet
datasets, respectively. Compared to the state-of-the-art unlearning approach,
we improve 5.8% accuracy, 32.5$\times$ prediction speedup, and reach a
comparable retrain speedup under identical settings on average on these
datasets. Additionally, DeepObliviate can also pass the backdoor-based
unlearning verification.


http://arxiv.org/abs/2105.06625
Biometrics: Trust, but Verify. (1%)
Anil K. Jain; Debayan Deb; Joshua J. Engelsma
  Over the past two decades, biometric recognition has exploded into a plethora
of different applications around the globe. This proliferation can be
attributed to the high levels of authentication accuracy and user convenience
that biometric recognition systems afford end-users. However, in-spite of the
success of biometric recognition systems, there are a number of outstanding
problems and concerns pertaining to the various sub-modules of biometric
recognition systems that create an element of mistrust in their use - both by
the scientific community and also the public at large. Some of these problems
include: i) questions related to system recognition performance, ii) security
(spoof attacks, adversarial attacks, template reconstruction attacks and
demographic information leakage), iii) uncertainty over the bias and fairness
of the systems to all users, iv) explainability of the seemingly black-box
decisions made by most recognition systems, and v) concerns over data
centralization and user privacy. In this paper, we provide an overview of each
of the aforementioned open-ended challenges. We survey work that has been
conducted to address each of these concerns and highlight the issues requiring
further attention. Finally, we provide insights into how the biometric
community can address core biometric recognition systems design issues to
better instill trust, fairness, and security for all.


http://arxiv.org/abs/2105.05558
AVA: Adversarial Vignetting Attack against Visual Recognition. (99%)
Binyu Tian; Felix Juefei-Xu; Qing Guo; Xiaofei Xie; Xiaohong Li; Yang Liu
  Vignetting is an inherited imaging phenomenon within almost all optical
systems, showing as a radial intensity darkening toward the corners of an
image. Since it is a common effect for photography and usually appears as a
slight intensity variation, people usually regard it as a part of a photo and
would not even want to post-process it. Due to this natural advantage, in this
work, we study vignetting from a new viewpoint, i.e., adversarial vignetting
attack (AVA), which aims to embed intentionally misleading information into
vignetting and produce a natural adversarial example without noise patterns.
This example can fool the state-of-the-art deep convolutional neural networks
(CNNs) but is imperceptible to humans. To this end, we first propose the
radial-isotropic adversarial vignetting attack (RI-AVA) based on the physical
model of vignetting, where the physical parameters (e.g., illumination factor
and focal length) are tuned through the guidance of target CNN models. To
achieve higher transferability across different CNNs, we further propose
radial-anisotropic adversarial vignetting attack (RA-AVA) by allowing the
effective regions of vignetting to be radial-anisotropic and shape-free.
Moreover, we propose the geometry-aware level-set optimization method to solve
the adversarial vignetting regions and physical parameters jointly. We validate
the proposed methods on three popular datasets, i.e., DEV, CIFAR10, and Tiny
ImageNet, by attacking four CNNs, e.g., ResNet50, EfficientNet-B0, DenseNet121,
and MobileNet-V2, demonstrating the advantages of our methods over baseline
methods on both transferability and image quality.


http://arxiv.org/abs/2105.05601
OutFlip: Generating Out-of-Domain Samples for Unknown Intent Detection with Natural Language Attack. (70%)
DongHyun Choi; Myeong Cheol Shin; EungGyun Kim; Dong Ryeol Shin
  Out-of-domain (OOD) input detection is vital in a task-oriented dialogue
system since the acceptance of unsupported inputs could lead to an incorrect
response of the system. This paper proposes OutFlip, a method to generate
out-of-domain samples using only in-domain training dataset automatically. A
white-box natural language attack method HotFlip is revised to generate
out-of-domain samples instead of adversarial examples. Our evaluation results
showed that integrating OutFlip-generated out-of-domain samples into the
training dataset could significantly improve an intent classification model's
out-of-domain detection performance.


http://arxiv.org/abs/2105.05817
Adversarial Reinforcement Learning in Dynamic Channel Access and Power Control. (2%)
Feng Wang; M. Cenk Gursoy; Senem Velipasalar
  Deep reinforcement learning (DRL) has recently been used to perform efficient
resource allocation in wireless communications. In this paper, the
vulnerabilities of such DRL agents to adversarial attacks is studied. In
particular, we consider multiple DRL agents that perform both dynamic channel
access and power control in wireless interference channels. For these victim
DRL agents, we design a jammer, which is also a DRL agent. We propose an
adversarial jamming attack scheme that utilizes a listening phase and
significantly degrades the users' sum rate. Subsequently, we develop an
ensemble policy defense strategy against such a jamming attacker by reloading
models (saved during retraining) that have minimum transition correlation.


http://arxiv.org/abs/2105.05610
A Statistical Threshold for Adversarial Classification in Laplace Mechanisms. (1%)
Ayşe Ünsal; Melek Önen
  This paper studies the statistical characterization of detecting an adversary
who wants to harm some computation such as machine learning models or
aggregation by altering the output of a differentially private mechanism in
addition to discovering some information about the underlying dataset. An
adversary who is able to modify the published information from a differentially
private mechanism aims to maximize the possible damage to the system while
remaining undetected. We present a trade-off between the privacy parameter of
the system, the sensitivity and the attacker's advantage (the bias) through
determining the threshold for the best critical region of the hypothesis
testing problem for deciding whether or not the adversary's attack is detected.
Such trade-offs are provided for Laplace mechanisms using one-sided and
two-sided hypothesis tests. Corresponding error probabilities are analytically
derived and ROC curves are presented for various levels of the sensitivity, the
absolute mean of the attack and the privacy parameter. Subsequently, we provide
an interval for the bias induced by the adversary so that the defender detects
the attack. Finally, we adapt the Kullback-Leibler differential privacy to
adversarial classification.


http://arxiv.org/abs/2105.04839
Poisoning MorphNet for Clean-Label Backdoor Attack to Point Clouds. (99%)
Guiyu Tian; Wenhao Jiang; Wei Liu; Yadong Mu
  This paper presents Poisoning MorphNet, the first backdoor attack method on
point clouds. Conventional adversarial attack takes place in the inference
stage, often fooling a model by perturbing samples. In contrast, backdoor
attack aims to implant triggers into a model during the training stage, such
that the victim model acts normally on the clean data unless a trigger is
present in a sample. This work follows a typical setting of clean-label
backdoor attack, where a few poisoned samples (with their content tampered yet
labels unchanged) are injected into the training set. The unique contributions
of MorphNet are two-fold. First, it is key to ensure the implanted triggers
both visually imperceptible to humans and lead to high attack success rate on
the point clouds. To this end, MorphNet jointly optimizes two objectives for
sample-adaptive poisoning: a reconstruction loss that preserves the visual
similarity between benign / poisoned point clouds, and a classification loss
that enforces a modern recognition model of point clouds tends to mis-classify
the poisoned sample to a pre-specified target category. This implicitly
conducts spectral separation over point clouds, hiding sample-adaptive triggers
in fine-grained high-frequency details. Secondly, existing backdoor attack
methods are mainly designed for image data, easily defended by some point cloud
specific operations (such as denoising). We propose a third loss in MorphNet
for suppressing isolated points, leading to improved resistance to
denoising-based defense. Comprehensive evaluations are conducted on ModelNet40
and ShapeNetcorev2. Our proposed Poisoning MorphNet outstrips all previous
methods with clear margins.


http://arxiv.org/abs/2105.04834
Improving Adversarial Transferability with Gradient Refining. (99%)
Guoqiu Wang; Huanqian Yan; Ying Guo; Xingxing Wei
  Deep neural networks are vulnerable to adversarial examples, which are
crafted by adding human-imperceptible perturbations to original images. Most
existing adversarial attack methods achieve nearly 100% attack success rates
under the white-box setting, but only achieve relatively low attack success
rates under the black-box setting. To improve the transferability of
adversarial examples for the black-box setting, several methods have been
proposed, e.g., input diversity, translation-invariant attack, and
momentum-based attack. In this paper, we propose a method named Gradient
Refining, which can further improve the adversarial transferability by
correcting useless gradients introduced by input diversity through multiple
transformations. Our method is generally applicable to many gradient-based
attack methods combined with input diversity. Extensive experiments are
conducted on the ImageNet dataset and our method can achieve an average
transfer success rate of 82.07% for three different models under single-model
setting, which outperforms the other state-of-the-art methods by a large margin
of 6.0% averagely. And we have applied the proposed method to the competition
CVPR 2021 Unrestricted Adversarial Attacks on ImageNet organized by Alibaba and
won the second place in attack success rates among 1558 teams.


http://arxiv.org/abs/2105.05381
Accuracy-Privacy Trade-off in Deep Ensemble: A Membership Inference Perspective. (16%)
Shahbaz Rezaei; Zubair Shafiq; Xin Liu
  Deep ensemble learning has been shown to improve accuracy by training
multiple neural networks and fusing their outputs. Ensemble learning has also
been used to defend against membership inference attacks that undermine
privacy. In this paper, we empirically demonstrate a trade-off between these
two goals, namely accuracy and privacy (in terms of membership inference
attacks), in deep ensembles. Using a wide range of datasets and model
architectures, we show that the effectiveness of membership inference attacks
also increases when ensembling improves accuracy. To better understand this
trade-off, we study the impact of various factors such as prediction confidence
and agreement between models that constitute the ensemble. Finally, we evaluate
defenses against membership inference attacks based on regularization and
differential privacy. We show that while these defenses can mitigate the
effectiveness of the membership inference attack, they simultaneously degrade
ensemble accuracy. We illustrate similar trade-off in more advanced and
state-of-the-art ensembling techniques, such as snapshot ensembles and
diversified ensemble networks. The source code is available in supplementary
materials.


http://arxiv.org/abs/2105.05029
Adversarial examples attack based on random warm restart mechanism and improved Nesterov momentum. (99%)
Tiangang Li
  The deep learning algorithm has achieved great success in the field of
computer vision, but some studies have pointed out that the deep learning model
is vulnerable to attacks adversarial examples and makes false decisions. This
challenges the further development of deep learning, and urges researchers to
pay more attention to the relationship between adversarial examples attacks and
deep learning security. This work focuses on adversarial examples, optimizes
the generation of adversarial examples from the view of adversarial robustness,
takes the perturbations added in adversarial examples as the optimization
parameter. We propose RWR-NM-PGD attack algorithm based on random warm restart
mechanism and improved Nesterov momentum from the view of gradient
optimization. The algorithm introduces improved Nesterov momentum, using its
characteristics of accelerating convergence and improving gradient update
direction in optimization algorithm to accelerate the generation of adversarial
examples. In addition, the random warm restart mechanism is used for
optimization, and the projected gradient descent algorithm is used to limit the
range of the generated perturbations in each warm restart, which can obtain
better attack effect. Experiments on two public datasets show that the
algorithm proposed in this work can improve the success rate of attacking deep
learning models without extra time cost. Compared with the benchmark attack
method, the algorithm proposed in this work can achieve better attack success
rate for both normal training model and defense model. Our method has average
attack success rate of 46.3077%, which is 27.19% higher than I-FGSM and 9.27%
higher than PGD. The attack results in 13 defense models show that the attack
algorithm proposed in this work is superior to the benchmark algorithm in
attack universality and transferability.


http://arxiv.org/abs/2105.04128
Examining and Mitigating Kernel Saturation in Convolutional Neural Networks using Negative Images. (1%)
Nidhi Gowdra; Roopak Sinha; Stephen MacDonell
  Neural saturation in Deep Neural Networks (DNNs) has been studied
extensively, but remains relatively unexplored in Convolutional Neural Networks
(CNNs). Understanding and alleviating the effects of convolutional kernel
saturation is critical for enhancing CNN models classification accuracies. In
this paper, we analyze the effect of convolutional kernel saturation in CNNs
and propose a simple data augmentation technique to mitigate saturation and
increase classification accuracy, by supplementing negative images to the
training dataset. We hypothesize that greater semantic feature information can
be extracted using negative images since they have the same structural
information as standard images but differ in their data representations. Varied
data representations decrease the probability of kernel saturation and thus
increase the effectiveness of kernel weight updates. The two datasets selected
to evaluate our hypothesis were CIFAR- 10 and STL-10 as they have similar image
classes but differ in image resolutions thus making for a better understanding
of the saturation phenomenon. MNIST dataset was used to highlight the
ineffectiveness of the technique for linearly separable data. The ResNet CNN
architecture was chosen since the skip connections in the network ensure the
most important features contributing the most to classification accuracy are
retained. Our results show that CNNs are indeed susceptible to convolutional
kernel saturation and that supplementing negative images to the training
dataset can offer a statistically significant increase in classification
accuracies when compared against models trained on the original datasets. Our
results present accuracy increases of 6.98% and 3.16% on the STL-10 and
CIFAR-10 datasets respectively.


http://arxiv.org/abs/2105.03931
Automated Decision-based Adversarial Attacks. (99%)
Qi-An Fu; Yinpeng Dong; Hang Su; Jun Zhu
  Deep learning models are vulnerable to adversarial examples, which can fool a
target classifier by imposing imperceptible perturbations onto natural
examples. In this work, we consider the practical and challenging
decision-based black-box adversarial setting, where the attacker can only
acquire the final classification labels by querying the target model without
access to the model's details. Under this setting, existing works often rely on
heuristics and exhibit unsatisfactory performance. To better understand the
rationality of these heuristics and the limitations of existing methods, we
propose to automatically discover decision-based adversarial attack algorithms.
In our approach, we construct a search space using basic mathematical
operations as building blocks and develop a random search algorithm to
efficiently explore this space by incorporating several pruning techniques and
intuitive priors inspired by program synthesis works. Although we use a small
and fast model to efficiently evaluate attack algorithms during the search,
extensive experiments demonstrate that the discovered algorithms are simple yet
query-efficient when transferred to larger normal and defensive models on the
CIFAR-10 and ImageNet datasets. They achieve comparable or better performance
than the state-of-the-art decision-based attack methods consistently.


http://arxiv.org/abs/2105.04003
Efficiency-driven Hardware Optimization for Adversarially Robust Neural Networks. (88%)
Abhiroop Bhattacharjee; Abhishek Moitra; Priyadarshini Panda
  With a growing need to enable intelligence in embedded devices in the
Internet of Things (IoT) era, secure hardware implementation of Deep Neural
Networks (DNNs) has become imperative. We will focus on how to address
adversarial robustness for DNNs through efficiency-driven hardware
optimizations. Since memory (specifically, dot-product operations) is a key
energy-spending component for DNNs, hardware approaches in the past have
focused on optimizing the memory. One such approach is approximate digital CMOS
memories with hybrid 6T-8T SRAM cells that enable supply voltage (Vdd) scaling
yielding low-power operation, without significantly affecting the performance
due to read/write failures incurred in the 6T cells. In this paper, we show how
the bit-errors in the 6T cells of hybrid 6T-8T memories minimize the
adversarial perturbations in a DNN. Essentially, we find that for different
configurations of 8T-6T ratios and scaledVdd operation, noise incurred in the
hybrid memory architectures is bound within specific limits. This hardware
noise can potentially interfere in the creation of adversarial attacks in DNNs
yielding robustness. Another memory optimization approach involves using analog
memristive crossbars that perform Matrix-Vector-Multiplications (MVMs)
efficiently with low energy and area requirements. However, crossbars generally
suffer from intrinsic non-idealities that cause errors in performing MVMs,
leading to degradation in the accuracy of the DNNs. We will show how the
intrinsic hardware variations manifested through crossbar non-idealities yield
adversarial robustness to the mapped DNNs without any additional optimization.


http://arxiv.org/abs/2105.03905
Security Concerns on Machine Learning Solutions for 6G Networks in mmWave Beam Prediction. (81%)
Ferhat Ozgur Catak; Evren Catak; Murat Kuzlu; Umit Cali
  6G -- sixth generation -- is the latest cellular technology currently under
development for wireless communication systems. In recent years, machine
learning algorithms have been applied widely in various fields, such as
healthcare, transportation, energy, autonomous car, and many more. Those
algorithms have been also using in communication technologies to improve the
system performance in terms of frequency spectrum usage, latency, and security.
With the rapid developments of machine learning techniques, especially deep
learning, it is critical to take the security concern into account when
applying the algorithms. While machine learning algorithms offer significant
advantages for 6G networks, security concerns on Artificial Intelligent (AI)
models is typically ignored by the scientific community so far. However,
security is also a vital part of the AI algorithms, this is because the AI
model itself can be poisoned by attackers. This paper proposes a mitigation
method for adversarial attacks against proposed 6G machine learning models for
the millimeter-wave (mmWave) beam prediction using adversarial learning. The
main idea behind adversarial attacks against machine learning models is to
produce faulty results by manipulating trained deep learning models for 6G
applications for mmWave beam prediction. We also present the adversarial
learning mitigation method's performance for 6G security in mmWave beam
prediction application with fast gradient sign method attack. The mean square
errors (MSE) of the defended model under attack are very close to the
undefended model without attack.


http://arxiv.org/abs/2105.04070
Robust Training Using Natural Transformation. (13%)
Shuo Wang; Lingjuan Lyu; Surya Nepal; Carsten Rudolph; Marthie Grobler; Kristen Moore
  Previous robustness approaches for deep learning models such as data
augmentation techniques via data transformation or adversarial training cannot
capture real-world variations that preserve the semantics of the input, such as
a change in lighting conditions. To bridge this gap, we present NaTra, an
adversarial training scheme that is designed to improve the robustness of image
classification algorithms. We target attributes of the input images that are
independent of the class identification, and manipulate those attributes to
mimic real-world natural transformations (NaTra) of the inputs, which are then
used to augment the training dataset of the image classifier. Specifically, we
apply \textit{Batch Inverse Encoding and Shifting} to map a batch of given
images to corresponding disentangled latent codes of well-trained generative
models. \textit{Latent Codes Expansion} is used to boost image reconstruction
quality through the incorporation of extended feature maps.
\textit{Unsupervised Attribute Directing and Manipulation} enables
identification of the latent directions that correspond to specific attribute
changes, and then produce interpretable manipulations of those attributes,
thereby generating natural transformations to the input data. We demonstrate
the efficacy of our scheme by utilizing the disentangled latent representations
derived from well-trained GANs to mimic transformations of an image that are
similar to real-world natural variations (such as lighting conditions or
hairstyle), and train models to be invariant to these natural transformations.
Extensive experiments show that our method improves generalization of
classification models and increases its robustness to various real-world
distortions


http://arxiv.org/abs/2105.03834
Learning Image Attacks toward Vision Guided Autonomous Vehicles. (4%)
Hyung-Jin Yoon; Hamidreza Jafarnejadsani; Petros Voulgaris
  While adversarial neural networks have been shown successful for static image
attacks, very few approaches have been developed for attacking online image
streams while taking into account the underlying physical dynamics of
autonomous vehicles, their mission, and environment. This paper presents an
online adversarial machine learning framework that can effectively misguide
autonomous vehicles' missions. In the existing image attack methods devised
toward autonomous vehicles, optimization steps are repeated for every image
frame. This framework removes the need for fully converged optimization at
every frame to realize image attacks in real-time. Using reinforcement
learning, a generative neural network is trained over a set of image frames to
obtain an attack policy that is more robust to dynamic and uncertain
environments. A state estimator is introduced for processing image streams to
reduce the attack policy's sensitivity to physical variables such as unknown
position and velocity. A simulation study is provided to validate the results.


http://arxiv.org/abs/2105.03917
Combining Time-Dependent Force Perturbations in Robot-Assisted Surgery Training. (1%)
Yarden Sharon; Daniel Naftalovich; Lidor Bahar; Yael Refaely; Ilana Nisky
  Teleoperated robot-assisted minimally-invasive surgery (RAMIS) offers many
advantages over open surgery. However, there are still no guidelines for
training skills in RAMIS. Motor learning theories have the potential to improve
the design of RAMIS training but they are based on simple movements that do not
resemble the complex movements required in surgery. To fill this gap, we
designed an experiment to investigate the effect of time-dependent force
perturbations on the learning of a pattern-cutting surgical task. Thirty
participants took part in the experiment: (1) a control group that trained
without perturbations, and (2) a 1Hz group that trained with 1Hz periodic force
perturbations that pushed each participant's hand inwards and outwards in the
radial direction. We monitored their learning using four objective metrics and
found that participants in the 1Hz group learned how to overcome the
perturbations and improved their performances during training without impairing
their performances after the perturbations were removed. Our results present an
important step toward understanding the effect of adding perturbations to RAMIS
training protocols and improving RAMIS training for the benefit of surgeons and
patients.


http://arxiv.org/abs/2105.03689
Self-Supervised Adversarial Example Detection by Disentangled Representation. (99%)
Zhaoxi Zhang; Leo Yu Zhang; Xufei Zheng; Shengshan Hu; Jinyu Tian; Jiantao Zhou
  Deep learning models are known to be vulnerable to adversarial examples that
are elaborately designed for malicious purposes and are imperceptible to the
human perceptual system. Autoencoder, when trained solely over benign examples,
has been widely used for (self-supervised) adversarial detection based on the
assumption that adversarial examples yield larger reconstruction error.
However, because lacking adversarial examples in its training and the too
strong generalization ability of autoencoder, this assumption does not always
hold true in practice. To alleviate this problem, we explore to detect
adversarial examples by disentangled representations of images under the
autoencoder structure. By disentangling input images as class features and
semantic features, we train an autoencoder, assisted by a discriminator
network, over both correctly paired class/semantic features and incorrectly
paired class/semantic features to reconstruct benign and counterexamples. This
mimics the behavior of adversarial examples and can reduce the unnecessary
generalization ability of autoencoder. Compared with the state-of-the-art
self-supervised detection methods, our method exhibits better performance in
various measurements (i.e., AUC, FPR, TPR) over different datasets (MNIST,
Fashion-MNIST and CIFAR-10), different adversarial attack methods (FGSM, BIM,
PGD, DeepFool, and CW) and different victim models (8-layer CNN and 16-layer
VGG). We compare our method with the state-of-the-art self-supervised detection
methods under different adversarial attacks and different victim models (30
attack settings), and it exhibits better performance in various measurements
(AUC, FPR, TPR) for most attacks settings. Ideally, AUC is $1$ and our method
achieves $0.99+$ on CIFAR-10 for all attacks. Notably, different from other
Autoencoder-based detectors, our method can provide resistance to the adaptive
adversary.


http://arxiv.org/abs/2105.03592
De-Pois: An Attack-Agnostic Defense against Data Poisoning Attacks. (96%)
Jian Chen; Xuxin Zhang; Rui Zhang; Chen Wang; Ling Liu
  Machine learning techniques have been widely applied to various applications.
However, they are potentially vulnerable to data poisoning attacks, where
sophisticated attackers can disrupt the learning procedure by injecting a
fraction of malicious samples into the training dataset. Existing defense
techniques against poisoning attacks are largely attack-specific: they are
designed for one specific type of attacks but do not work for other types,
mainly due to the distinct principles they follow. Yet few general defense
strategies have been developed. In this paper, we propose De-Pois, an
attack-agnostic defense against poisoning attacks. The key idea of De-Pois is
to train a mimic model the purpose of which is to imitate the behavior of the
target model trained by clean samples. We take advantage of Generative
Adversarial Networks (GANs) to facilitate informative training data
augmentation as well as the mimic model construction. By comparing the
prediction differences between the mimic model and the target model, De-Pois is
thus able to distinguish the poisoned samples from clean ones, without explicit
knowledge of any ML algorithms or types of poisoning attacks. We implement four
types of poisoning attacks and evaluate De-Pois with five typical defense
methods on different realistic datasets. The results demonstrate that De-Pois
is effective and efficient for detecting poisoned data against all the four
types of poisoning attacks, with both the accuracy and F1-score over 0.9 on
average.


http://arxiv.org/abs/2105.03743
Certified Robustness to Text Adversarial Attacks by Randomized [MASK]. (93%)
Jiehang Zeng; Xiaoqing Zheng; Jianhan Xu; Linyang Li; Liping Yuan; Xuanjing Huang
  Recently, few certified defense methods have been developed to provably
guarantee the robustness of a text classifier to adversarial synonym
substitutions. However, all existing certified defense methods assume that the
defenders are informed of how the adversaries generate synonyms, which is not a
realistic scenario. In this paper, we propose a certifiably robust defense
method by randomly masking a certain proportion of the words in an input text,
in which the above unrealistic assumption is no longer necessary. The proposed
method can defend against not only word substitution-based attacks, but also
character-level perturbations. We can certify the classifications of over 50%
texts to be robust to any perturbation of 5 words on AGNEWS, and 2 words on
SST2 dataset. The experimental results show that our randomized smoothing
method significantly outperforms recently proposed defense methods across
multiple datasets.


http://arxiv.org/abs/2105.03692
Provable Guarantees against Data Poisoning Using Self-Expansion and Compatibility. (81%)
Charles Jin; Melinda Sun; Martin Rinard
  As deep learning datasets grow larger and less curated, backdoor data
poisoning attacks, which inject malicious poisoned data into the training
dataset, have drawn increasing attention in both academia and industry.
  We identify an incompatibility property of the interaction of clean and
poisoned data with the training algorithm, specifically that including poisoned
data in the training dataset does not improve model accuracy on clean data and
vice-versa. Leveraging this property, we develop an algorithm that iteratively
refines subsets of the poisoned dataset to obtain subsets that concentrate
around either clean or poisoned data. The result is a partition of the original
dataset into disjoint subsets, for each of which we train a corresponding
model. A voting algorithm over these models identifies the clean data within
the larger poisoned dataset.
  We empirically evaluate our approach and technique for image classification
tasks over the GTSRB and CIFAR-10 datasets. The experimental results show that
prior dirty-label and clean-label backdoor attacks in the literature produce
poisoned datasets that exhibit behavior consistent with the incompatibility
property. The results also show that our defense reduces the attack success
rate below 1% on 134 out of 165 scenarios in this setting, with only a 2% drop
in clean accuracy on CIFAR-10 (and negligible impact on GTSRB).


http://arxiv.org/abs/2105.03726
Mental Models of Adversarial Machine Learning. (16%)
Lukas Bieringer; Kathrin Grosse; Michael Backes; Battista Biggio; Katharina Krombholz
  Although machine learning is widely used in practice, little is known about
practitioners' understanding of potential security challenges. In this work, we
close this substantial gap and contribute a qualitative study focusing on
developers' mental models of the machine learning pipeline and potentially
vulnerable components. Similar studies have helped in other security fields to
discover root causes or improve risk communication. Our study reveals two
\facets of practitioners' mental models of machine learning security. Firstly,
practitioners often confuse machine learning security with threats and defences
that are not directly related to machine learning. Secondly, in contrast to
most academic research, our participants perceive security of machine learning
as not solely related to individual models, but rather in the context of entire
workflows that consist of multiple components. Jointly with our additional
findings, these two facets provide a foundation to substantiate mental models
for machine learning security and have implications for the integration of
adversarial machine learning into corporate workflows, \new{decreasing
practitioners' reported uncertainty}, and appropriate regulatory frameworks for
machine learning security.


http://arxiv.org/abs/2105.03162
Adv-Makeup: A New Imperceptible and Transferable Attack on Face Recognition. (99%)
Bangjie Yin; Wenxuan Wang; Taiping Yao; Junfeng Guo; Zelun Kong; Shouhong Ding; Jilin Li; Cong Liu
  Deep neural networks, particularly face recognition models, have been shown
to be vulnerable to both digital and physical adversarial examples. However,
existing adversarial examples against face recognition systems either lack
transferability to black-box models, or fail to be implemented in practice. In
this paper, we propose a unified adversarial face generation method -
Adv-Makeup, which can realize imperceptible and transferable attack under
black-box setting. Adv-Makeup develops a task-driven makeup generation method
with the blending module to synthesize imperceptible eye shadow over the
orbital region on faces. And to achieve transferability, Adv-Makeup implements
a fine-grained meta-learning adversarial attack strategy to learn more general
attack features from various models. Compared to existing techniques,
sufficient visualization results demonstrate that Adv-Makeup is capable to
generate much more imperceptible attacks under both digital and physical
scenarios. Meanwhile, extensive quantitative experiments show that Adv-Makeup
can significantly improve the attack success rate under black-box setting, even
attacking commercial systems.


http://arxiv.org/abs/2105.03491
Uniform Convergence, Adversarial Spheres and a Simple Remedy. (15%)
Gregor Bachmann; Seyed-Mohsen Moosavi-Dezfooli; Thomas Hofmann
  Previous work has cast doubt on the general framework of uniform convergence
and its ability to explain generalization in neural networks. By considering a
specific dataset, it was observed that a neural network completely
misclassifies a projection of the training data (adversarial set), rendering
any existing generalization bound based on uniform convergence vacuous. We
provide an extensive theoretical investigation of the previously studied data
setting through the lens of infinitely-wide models. We prove that the Neural
Tangent Kernel (NTK) also suffers from the same phenomenon and we uncover its
origin. We highlight the important role of the output bias and show
theoretically as well as empirically how a sensible choice completely mitigates
the problem. We identify sharp phase transitions in the accuracy on the
adversarial set and study its dependency on the training sample size. As a
result, we are able to characterize critical sample sizes beyond which the
effect disappears. Moreover, we study decompositions of a neural network into a
clean and noisy part by considering its canonical decomposition into its
different eigenfunctions and show empirically that for too small bias the
adversarial phenomenon still persists.


http://arxiv.org/abs/2105.02803
Dynamic Defense Approach for Adversarial Robustness in Deep Neural Networks via Stochastic Ensemble Smoothed Model. (99%)
Ruoxi Qin; Linyuan Wang; Xingyuan Chen; Xuehui Du; Bin Yan
  Deep neural networks have been shown to suffer from critical vulnerabilities
under adversarial attacks. This phenomenon stimulated the creation of different
attack and defense strategies similar to those adopted in cyberspace security.
The dependence of such strategies on attack and defense mechanisms makes the
associated algorithms on both sides appear as closely reciprocating processes.
The defense strategies are particularly passive in these processes, and
enhancing initiative of such strategies can be an effective way to get out of
this arms race. Inspired by the dynamic defense approach in cyberspace, this
paper builds upon stochastic ensemble smoothing based on defense method of
random smoothing and model ensemble. Proposed method employs network
architecture and smoothing parameters as ensemble attributes, and dynamically
change attribute-based ensemble model before every inference prediction
request. The proposed method handles the extreme transferability and
vulnerability of ensemble models under white-box attacks. Experimental
comparison of ASR-vs-distortion curves with different attack scenarios shows
that even the attacker with the highest attack capability cannot easily exceed
the attack success rate associated with the ensemble smoothed model, especially
under untargeted attacks.


http://arxiv.org/abs/2105.02480
A Simple and Strong Baseline for Universal Targeted Attacks on Siamese Visual Tracking. (99%)
Zhenbang Li; Yaya Shi; Jin Gao; Shaoru Wang; Bing Li; Pengpeng Liang; Weiming Hu
  Siamese trackers are shown to be vulnerable to adversarial attacks recently.
However, the existing attack methods craft the perturbations for each video
independently, which comes at a non-negligible computational cost. In this
paper, we show the existence of universal perturbations that can enable the
targeted attack, e.g., forcing a tracker to follow the ground-truth trajectory
with specified offsets, to be video-agnostic and free from inference in a
network. Specifically, we attack a tracker by adding a universal imperceptible
perturbation to the template image and adding a fake target, i.e., a small
universal adversarial patch, into the search images adhering to the predefined
trajectory, so that the tracker outputs the location and size of the fake
target instead of the real target. Our approach allows perturbing a novel video
to come at no additional cost except the mere addition operations -- and not
require gradient optimization or network inference. Experimental results on
several datasets demonstrate that our approach can effectively fool the Siamese
trackers in a targeted attack manner. We show that the proposed perturbations
are not only universal across videos, but also generalize well across different
trackers. Such perturbations are therefore doubly universal, both with respect
to the data and the network architectures. We will make our code publicly
available.


http://arxiv.org/abs/2105.02942
Understanding Catastrophic Overfitting in Adversarial Training. (92%)
Peilin Kang; Seyed-Mohsen Moosavi-Dezfooli
  Recently, FGSM adversarial training is found to be able to train a robust
model which is comparable to the one trained by PGD but an order of magnitude
faster. However, there is a failure mode called catastrophic overfitting (CO)
that the classifier loses its robustness suddenly during the training and
hardly recovers by itself. In this paper, we find CO is not only limited to
FGSM, but also happens in $\mbox{DF}^{\infty}$-1 adversarial training. Then, we
analyze the geometric properties for both FGSM and $\mbox{DF}^{\infty}$-1 and
find they have totally different decision boundaries after CO. For FGSM, a new
decision boundary is generated along the direction of perturbation and makes
the small perturbation more effective than the large one. While for
$\mbox{DF}^{\infty}$-1, there is no new decision boundary generated along the
direction of perturbation, instead the perturbation generated by
$\mbox{DF}^{\infty}$-1 becomes smaller after CO and thus loses its
effectiveness. We also experimentally analyze three hypotheses on potential
factors causing CO. And then based on the empirical analysis, we modify the
RS-FGSM by not projecting perturbation back to the $l_\infty$ ball. By this
small modification, we could achieve $47.56 \pm 0.37\% $ PGD-50-10 accuracy on
CIFAR10 with $\epsilon=8/255$ in contrast to $43.57 \pm 0.30\% $ by RS-FGSM and
also further extend the working range of $\epsilon$ from 8/255 to 11/255 on
CIFAR10 without CO occurring.


http://arxiv.org/abs/2105.02435
Attestation Waves: Platform Trust via Remote Power Analysis. (1%)
Ignacio M. Delgado-Lozano; Macarena C. Martínez-Rodríguez; Alexandros Bakas; Billy Bob Brumley; Antonis Michalas
  Attestation is a strong tool to verify the integrity of an untrusted system.
However, in recent years, different attacks have appeared that are able to
mislead the attestation process with treacherous practices as memory copy,
proxy and rootkit attacks, just to name a few. A successful attack leads to
systems that are considered trusted by a verifier system, while the prover has
bypassed the challenge. To harden these attacks against attestation methods and
protocols, some proposals have considered the use of side-channel information
that can be measured externally, as it is the case of electromagnetic (EM)
emanation. Nonetheless, these methods require the physical proximity of an
external setup to capture the EM radiation.
  In this paper, we present the possibility of performing attestation by using
the side channel information captured by a sensor or peripheral that lives in
the same System-on-Chip (SoC) than the processor system (PS) which executes the
operation that we aim to attest, by only sharing the Power Distribution Network
(PDN). In our case, an analog-to-digital converter (ADC) that captures the
voltage fluctuations at its input terminal while a certain operation is taking
place is suitable to characterize itself and to distinguish it from other
binaries. The resultant power traces are enough to clearly identify a given
operation without the requirement of physical proximity.


http://arxiv.org/abs/2105.01959
Attack-agnostic Adversarial Detection on Medical Data Using Explainable Machine Learning. (99%)
Matthew Durham University, Durham, UK Watson; Noura Al Durham University, Durham, UK Moubayed
  Explainable machine learning has become increasingly prevalent, especially in
healthcare where explainable models are vital for ethical and trusted automated
decision making. Work on the susceptibility of deep learning models to
adversarial attacks has shown the ease of designing samples to mislead a model
into making incorrect predictions. In this work, we propose a model agnostic
explainability-based method for the accurate detection of adversarial samples
on two datasets with different complexity and properties: Electronic Health
Record (EHR) and chest X-ray (CXR) data. On the MIMIC-III and Henan-Renmin EHR
datasets, we report a detection accuracy of 77% against the Longitudinal
Adversarial Attack. On the MIMIC-CXR dataset, we achieve an accuracy of 88%;
significantly improving on the state of the art of adversarial detection in
both datasets by over 10% in all settings. We propose an anomaly detection
based method using explainability techniques to detect adversarial samples
which is able to generalise to different attack methods without a need for
retraining.


http://arxiv.org/abs/2105.03251
Exploiting Vulnerabilities in Deep Neural Networks: Adversarial and Fault-Injection Attacks. (97%)
Faiq Khalid; Muhammad Abdullah Hanif; Muhammad Shafique
  From tiny pacemaker chips to aircraft collision avoidance systems, the
state-of-the-art Cyber-Physical Systems (CPS) have increasingly started to rely
on Deep Neural Networks (DNNs). However, as concluded in various studies, DNNs
are highly susceptible to security threats, including adversarial attacks. In
this paper, we first discuss different vulnerabilities that can be exploited
for generating security attacks for neural network-based systems. We then
provide an overview of existing adversarial and fault-injection-based attacks
on DNNs. We also present a brief analysis to highlight different challenges in
the practical implementation of adversarial attacks. Finally, we also discuss
various prospective ways to develop robust DNN-based systems that are resilient
to adversarial and fault-injection attacks.


http://arxiv.org/abs/2105.02001
Contrastive Learning and Self-Training for Unsupervised Domain Adaptation in Semantic Segmentation. (1%)
Robert A. Marsden; Alexander Bartler; Mario Döbler; Bin Yang
  Deep convolutional neural networks have considerably improved
state-of-the-art results for semantic segmentation. Nevertheless, even modern
architectures lack the ability to generalize well to a test dataset that
originates from a different domain. To avoid the costly annotation of training
data for unseen domains, unsupervised domain adaptation (UDA) attempts to
provide efficient knowledge transfer from a labeled source domain to an
unlabeled target domain. Previous work has mainly focused on minimizing the
discrepancy between the two domains by using adversarial training or
self-training. While adversarial training may fail to align the correct
semantic categories as it minimizes the discrepancy between the global
distributions, self-training raises the question of how to provide reliable
pseudo-labels. To align the correct semantic categories across domains, we
propose a contrastive learning approach that adapts category-wise centroids
across domains. Furthermore, we extend our method with self-training, where we
use a memory-efficient temporal ensemble to generate consistent and reliable
pseudo-labels. Although both contrastive learning and self-training (CLST)
through temporal ensembling enable knowledge transfer between two domains, it
is their combination that leads to a symbiotic structure. We validate our
approach on two domain adaptation benchmarks: GTA5 $\rightarrow$ Cityscapes and
SYNTHIA $\rightarrow$ Cityscapes. Our method achieves better or comparable
results than the state-of-the-art. We will make the code publicly available.


http://arxiv.org/abs/2105.01867
A Theoretical-Empirical Approach to Estimating Sample Complexity of DNNs. (1%)
Devansh Bisla; Apoorva Nandini Saridena; Anna Choromanska
  This paper focuses on understanding how the generalization error scales with
the amount of the training data for deep neural networks (DNNs). Existing
techniques in statistical learning require computation of capacity measures,
such as VC dimension, to provably bound this error. It is however unclear how
to extend these measures to DNNs and therefore the existing analyses are
applicable to simple neural networks, which are not used in practice, e.g.,
linear or shallow ones or otherwise multi-layer perceptrons. Moreover, many
theoretical error bounds are not empirically verifiable. We derive estimates of
the generalization error that hold for deep networks and do not rely on
unattainable capacity measures. The enabling technique in our approach hinges
on two major assumptions: i) the network achieves zero training error, ii) the
probability of making an error on a test point is proportional to the distance
between this point and its nearest training point in the feature space and at a
certain maximal distance (that we call radius) it saturates. Based on these
assumptions we estimate the generalization error of DNNs. The obtained estimate
scales as O(1/(\delta N^{1/d})), where N is the size of the training data and
is parameterized by two quantities, the effective dimensionality of the data as
perceived by the network (d) and the aforementioned radius (\delta), both of
which we find empirically. We show that our estimates match with the
experimentally obtained behavior of the error on multiple learning tasks using
benchmark data-sets and realistic models. Estimating training data requirements
is essential for deployment of safety critical applications such as autonomous
driving etc. Furthermore, collecting and annotating training data requires a
huge amount of financial, computational and human resources. Our empirical
estimates will help to efficiently allocate resources.


http://arxiv.org/abs/2105.01622
Poisoning the Unlabeled Dataset of Semi-Supervised Learning. (92%)
Nicholas Carlini
  Semi-supervised machine learning models learn from a (small) set of labeled
training examples, and a (large) set of unlabeled training examples.
State-of-the-art models can reach within a few percentage points of
fully-supervised training, while requiring 100x less labeled data.
  We study a new class of vulnerabilities: poisoning attacks that modify the
unlabeled dataset. In order to be useful, unlabeled datasets are given strictly
less review than labeled datasets, and adversaries can therefore poison them
easily. By inserting maliciously-crafted unlabeled examples totaling just 0.1%
of the dataset size, we can manipulate a model trained on this poisoned dataset
to misclassify arbitrary examples at test time (as any desired label). Our
attacks are highly effective across datasets and semi-supervised learning
methods.
  We find that more accurate methods (thus more likely to be used) are
significantly more vulnerable to poisoning attacks, and as such better training
methods are unlikely to prevent this attack. To counter this we explore the
space of defenses, and propose two methods that mitigate our attack.


http://arxiv.org/abs/2105.01560
Broadly Applicable Targeted Data Sample Omission Attacks. (68%)
Guy Barash; Eitan Farchi; Sarit Kraus; Onn Shehory
  We introduce a novel clean-label targeted poisoning attack on learning
mechanisms. While classical poisoning attacks typically corrupt data via
addition, modification and omission, our attack focuses on data omission only.
Our attack misclassifies a single, targeted test sample of choice, without
manipulating that sample. We demonstrate the effectiveness of omission attacks
against a large variety of learners including deep neural networks, SVM and
decision trees, using several datasets including MNIST, IMDB and CIFAR. The
focus of our attack on data omission only is beneficial as well, as it is
simpler to implement and analyze. We show that, with a low attack budget, our
attack's success rate is above 80%, and in some cases 100%, for white-box
learning. It is systematically above the reference benchmark for black-box
learning. For both white-box and black-box cases, changes in model accuracy are
negligible, regardless of the specific learner and dataset. We also prove
theoretically in a simplified agnostic PAC learning framework that, subject to
dataset size and distribution, our omission attack succeeds with high
probability against any successful simplified agnostic PAC learner.


http://arxiv.org/abs/2105.01403
An Overview of Laser Injection against Embedded Neural Network Models. (2%)
Mathieu Dumont; Pierre-Alain Moellic; Raphael Viera; Jean-Max Dutertre; Rémi Bernhard
  For many IoT domains, Machine Learning and more particularly Deep Learning
brings very efficient solutions to handle complex data and perform challenging
and mostly critical tasks. However, the deployment of models in a large variety
of devices faces several obstacles related to trust and security. The latest is
particularly critical since the demonstrations of severe flaws impacting the
integrity, confidentiality and accessibility of neural network models. However,
the attack surface of such embedded systems cannot be reduced to abstract flaws
but must encompass the physical threats related to the implementation of these
models within hardware platforms (e.g., 32-bit microcontrollers). Among
physical attacks, Fault Injection Analysis (FIA) are known to be very powerful
with a large spectrum of attack vectors. Most importantly, highly focused FIA
techniques such as laser beam injection enable very accurate evaluation of the
vulnerabilities as well as the robustness of embedded systems. Here, we propose
to discuss how laser injection with state-of-the-art equipment, combined with
theoretical evidences from Adversarial Machine Learning, highlights worrying
threats against the integrity of deep learning inference and claims that join
efforts from the theoretical AI and Physical Security communities are a urgent
need.


http://arxiv.org/abs/2105.00622
Physical world assistive signals for deep neural network classifiers -- neither defense nor attack. (83%)
Camilo Pestana; Wei Liu; David Glance; Robyn Owens; Ajmal Mian
  Deep Neural Networks lead the state of the art of computer vision tasks.
Despite this, Neural Networks are brittle in that small changes in the input
can drastically affect their prediction outcome and confidence. Consequently
and naturally, research in this area mainly focus on adversarial attacks and
defenses. In this paper, we take an alternative stance and introduce the
concept of Assistive Signals, which are optimized to improve a model's
confidence score regardless if it's under attack or not. We analyse some
interesting properties of these assistive perturbations and extend the idea to
optimize assistive signals in the 3D space for real-life scenarios simulating
different lighting conditions and viewing angles. Experimental evaluations show
that the assistive signals generated by our optimization method increase the
accuracy and confidence of deep models more than those generated by
conventional methods that work in the 2D space. In addition, our Assistive
Signals illustrate the intrinsic bias of ML models towards certain patterns in
real-life objects. We discuss how we can exploit these insights to re-think, or
avoid, some patterns that might contribute to, or degrade, the detectability of
objects in the real-world.


http://arxiv.org/abs/2105.00623
Black-Box Dissector: Towards Erasing-based Hard-Label Model Stealing Attack. (73%)
Yixu Wang; Jie Li; Hong Liu; Yan Wang; Yongjian Wu; Feiyue Huang; Rongrong Ji
  Previous studies have verified that the functionality of black-box models can
be stolen with full probability outputs. However, under the more practical
hard-label setting, we observe that existing methods suffer from catastrophic
performance degradation. We argue this is due to the lack of rich information
in the probability prediction and the overfitting caused by hard labels. To
this end, we propose a novel hard-label model stealing method termed
\emph{black-box dissector}, which consists of two erasing-based modules. One is
a CAM-driven erasing strategy that is designed to increase the information
capacity hidden in hard labels from the victim model. The other is a
random-erasing-based self-knowledge distillation module that utilizes soft
labels from the substitute model to mitigate overfitting. Extensive experiments
on four widely-used datasets consistently demonstrate that our method
outperforms state-of-the-art methods, with an improvement of at most $8.27\%$.
We also validate the effectiveness and practical potential of our method on
real-world APIs and defense methods. Furthermore, our method promotes other
downstream tasks, \emph{i.e.}, transfer adversarial attacks.


http://arxiv.org/abs/2105.00495
BAARD: Blocking Adversarial Examples by Testing for Applicability, Reliability and Decidability. (99%)
Xinglong Chang; Katharina Dost; Kaiqi Zhao; Ambra Demontis; Fabio Roli; Gill Dobbie; Jörg Wicker
  Adversarial defenses protect machine learning models from adversarial
attacks, but are often tailored to one type of model or attack. The lack of
information on unknown potential attacks makes detecting adversarial examples
challenging. Additionally, attackers do not need to follow the rules made by
the defender. To address this problem, we take inspiration from the concept of
Applicability Domain in cheminformatics. Cheminformatics models struggle to
make accurate predictions because only a limited number of compounds are known
and available for training. Applicability Domain defines a domain based on the
known compounds and rejects any unknown compound that falls outside the domain.
Similarly, adversarial examples start as harmless inputs, but can be
manipulated to evade reliable classification by moving outside the domain of
the classifier. We are the first to identify the similarity between
Applicability Domain and adversarial detection. Instead of focusing on unknown
attacks, we focus on what is known, the training data. We propose a simple yet
robust triple-stage data-driven framework that checks the input globally and
locally, and confirms that they are coherent with the model's output. This
framework can be applied to any classification model and is not limited to
specific attacks. We demonstrate these three stages work as one unit,
effectively detecting various attacks, even for a white-box scenario.


http://arxiv.org/abs/2105.00433
Who's Afraid of Adversarial Transferability? (99%)
Ziv Katzir; Yuval Elovici
  Adversarial transferability, namely the ability of adversarial perturbations
to simultaneously fool multiple learning models, has long been the "big bad
wolf" of adversarial machine learning. Successful transferability-based attacks
requiring no prior knowledge of the attacked model's parameters or training
data have been demonstrated numerous times in the past, implying that machine
learning models pose an inherent security threat to real-life systems. However,
all of the research performed in this area regarded transferability as a
probabilistic property and attempted to estimate the percentage of adversarial
examples that are likely to mislead a target model given some predefined
evaluation set. As a result, those studies ignored the fact that real-life
adversaries are often highly sensitive to the cost of a failed attack. We argue
that overlooking this sensitivity has led to an exaggerated perception of the
transferability threat, when in fact real-life transferability-based attacks
are quite unlikely. By combining theoretical reasoning with a series of
empirical results, we show that it is practically impossible to predict whether
a given adversarial example is transferable to a specific target model in a
black-box setting, hence questioning the validity of adversarial
transferability as a real-life attack tool for adversaries that are sensitive
to the cost of a failed attack.


http://arxiv.org/abs/2105.00389
Multi-Robot Coordination and Planning in Uncertain and Adversarial Environments. (10%)
Lifeng Zhou; Pratap Tokekar
  Deploying a team of robots that can carefully coordinate their actions can
make the entire system robust to individual failures. In this report, we review
recent algorithmic development in making multi-robot systems robust to
environmental uncertainties, failures, and adversarial attacks.
  We find the following three trends in the recent research in the area of
multi-robot coordination: (1) resilient coordination to either withstand
failures and/or attack or recover from failures/attacks; (2) risk-aware
coordination to manage the trade-off risk and reward, where the risk stems due
to environmental uncertainty; (3) Graph Neural Networks based coordination to
learn decentralized multi-robot coordination policies. These algorithms have
been applied to tasks such as formation control, task assignment and
scheduling, search and planning, and informative data collection.
  In order for multi-robot systems to become practical, we need coordination
algorithms that can scale to large teams of robots dealing with dynamically
changing, failure-prone, contested, and uncertain environments. There has been
significant recent research on multi-robot coordination that has contributed
resilient and risk-aware algorithms to deal with these issues and reduce the
gap between theory and practice. Learning-based approaches have been seen to be
promising, especially since they can learn who, when, and how to communicate
for effective coordination. However, these algorithms have also been shown to
be vulnerable to adversarial attacks, and as such developing learning-based
coordination strategies that are resilient to such attacks and robust to
uncertainties is an important open area of research.


http://arxiv.org/abs/2105.00529
GRNN: Generative Regression Neural Network -- A Data Leakage Attack for Federated Learning. (2%)
Hanchi Ren; Jingjing Deng; Xianghua Xie
  Data privacy has become an increasingly important issue in Machine Learning
(ML), where many approaches have been developed to tackle this challenge, e.g.
cryptography (Homomorphic Encryption (HE), Differential Privacy (DP), etc.) and
collaborative training (Secure Multi-Party Computation (MPC), Distributed
Learning and Federated Learning (FL)). These techniques have a particular focus
on data encryption or secure local computation. They transfer the intermediate
information to the third party to compute the final result. Gradient exchanging
is commonly considered to be a secure way of training a robust model
collaboratively in Deep Learning (DL). However, recent researches have
demonstrated that sensitive information can be recovered from the shared
gradient. Generative Adversarial Network (GAN), in particular, has shown to be
effective in recovering such information. However, GAN based techniques require
additional information, such as class labels which are generally unavailable
for privacy-preserved learning. In this paper, we show that, in the FL system,
image-based privacy data can be easily recovered in full from the shared
gradient only via our proposed Generative Regression Neural Network (GRNN). We
formulate the attack to be a regression problem and optimize two branches of
the generative model by minimizing the distance between gradients. We evaluate
our method on several image classification tasks. The results illustrate that
our proposed GRNN outperforms state-of-the-art methods with better stability,
stronger robustness, and higher accuracy. It also has no convergence
requirement to the global FL model. Moreover, we demonstrate information
leakage using face re-identification. Some defense strategies are also
discussed in this work.


http://arxiv.org/abs/2105.00391
Spinner: Automated Dynamic Command Subsystem Perturbation. (1%)
Meng Wang; Chijung Jung; Ali Ahad; Yonghwi Kwon
  Injection attacks have been a major threat to web applications. Despite the
significant effort in thwarting injection attacks, protection against injection
attacks remains challenging due to the sophisticated attacks that exploit the
existing protection techniques' design and implementation flaws. In this paper,
we develop Spinner, a system that provides general protection against input
injection attacks, including OS/shell command, SQL, and XXE injection. Instead
of focusing on detecting malicious inputs, Spinner constantly randomizes
underlying subsystems so that injected inputs (e.g., commands or SQL queries)
that are not properly randomized will not be executed, hence prevented. We
revisit the design and implementation choices of previous randomization-based
techniques and develop a more robust and practical protection against various
sophisticated input injection attacks. To handle complex real-world
applications, we develop a bidirectional analysis that combines forward and
backward static analysis techniques to identify intended commands or SQL
queries to ensure the correct execution of the randomized target program. We
implement Spinner for the shell command processor and two different database
engines (MySQL and SQLite) and in diverse programming languages including
C/C++, PHP, JavaScript and Lua. Our evaluation results on 42 real-world
applications including 27 vulnerable ones show that it effectively prevents a
variety of input injection attacks with low runtime overhead (around 5%).


http://arxiv.org/abs/2105.00203
Adversarial Example Detection for DNN Models: A Review and Experimental Comparison. (99%)
Ahmed Aldahdooh; Wassim Hamidouche; Sid Ahmed Fezza; Olivier Deforges
  Deep learning (DL) has shown great success in many human-related tasks, which
has led to its adoption in many computer vision based applications, such as
security surveillance systems, autonomous vehicles and healthcare. Such
safety-critical applications have to draw their path to success deployment once
they have the capability to overcome safety-critical challenges. Among these
challenges are the defense against or/and the detection of the adversarial
examples (AEs). Adversaries can carefully craft small, often imperceptible,
noise called perturbations to be added to the clean image to generate the AE.
The aim of AE is to fool the DL model which makes it a potential risk for DL
applications. Many test-time evasion attacks and countermeasures,i.e., defense
or detection methods, are proposed in the literature. Moreover, few reviews and
surveys were published and theoretically showed the taxonomy of the threats and
the countermeasure methods with little focus in AE detection methods. In this
paper, we focus on image classification task and attempt to provide a survey
for detection methods of test-time evasion attacks on neural network
classifiers. A detailed discussion for such methods is provided with
experimental results for eight state-of-the-art detectors under different
scenarios on four datasets. We also provide potential challenges and future
perspectives for this research direction.


http://arxiv.org/abs/2105.00278
A Perceptual Distortion Reduction Framework: Towards Generating Adversarial Examples with High Perceptual Quality and Attack Success Rate. (98%)
Ruijie Yang; Yunhong Wang; Ruikui Wang; Yuanfang Guo
  Most of the adversarial attack methods suffer from large perceptual
distortions such as visible artifacts, when the attack strength is relatively
high. These perceptual distortions contain a certain portion which contributes
less to the attack success rate. This portion of distortions, which is induced
by unnecessary modifications and lack of proper perceptual distortion
constraint, is the target of the proposed framework. In this paper, we propose
a perceptual distortion reduction framework to tackle this problem from two
perspectives. Firstly, we propose a perceptual distortion constraint and add it
into the objective function to jointly optimize the perceptual distortions and
attack success rate. Secondly, we propose an adaptive penalty factor $\lambda$
to balance the discrepancies between different samples. Since SGD and
Momentum-SGD cannot optimize our complex non-convex problem, we exploit Adam in
optimization. Extensive experiments have verified the superiority of our
proposed framework.


http://arxiv.org/abs/2105.00227
On the Adversarial Robustness of Quantized Neural Networks. (75%)
Micah Gorsline; James Smith; Cory Merkel
  Reducing the size of neural network models is a critical step in moving AI
from a cloud-centric to an edge-centric (i.e. on-device) compute paradigm. This
shift from cloud to edge is motivated by a number of factors including reduced
latency, improved security, and higher flexibility of AI algorithms across
several application domains (e.g. transportation, healthcare, defense, etc.).
However, it is currently unclear how model compression techniques may affect
the robustness of AI algorithms against adversarial attacks. This paper
explores the effect of quantization, one of the most common compression
techniques, on the adversarial robustness of neural networks. Specifically, we
investigate and model the accuracy of quantized neural networks on
adversarially-perturbed images. Results indicate that for simple gradient-based
attacks, quantization can either improve or degrade adversarial robustness
depending on the attack strength.


http://arxiv.org/abs/2105.00164
Hidden Backdoors in Human-Centric Language Models. (73%)
Shaofeng Li; Hui Liu; Tian Dong; Benjamin Zi Hao Zhao; Minhui Xue; Haojin Zhu; Jialiang Lu
  Natural language processing (NLP) systems have been proven to be vulnerable
to backdoor attacks, whereby hidden features (backdoors) are trained into a
language model and may only be activated by specific inputs (called triggers),
to trick the model into producing unexpected behaviors. In this paper, we
create covert and natural triggers for textual backdoor attacks, \textit{hidden
backdoors}, where triggers can fool both modern language models and human
inspection. We deploy our hidden backdoors through two state-of-the-art trigger
embedding methods. The first approach via homograph replacement, embeds the
trigger into deep neural networks through the visual spoofing of lookalike
character replacement. The second approach uses subtle differences between text
generated by language models and real natural text to produce trigger sentences
with correct grammar and high fluency. We demonstrate that the proposed hidden
backdoors can be effective across three downstream security-critical NLP tasks,
representative of modern human-centric NLP systems, including toxic comment
detection, neural machine translation (NMT), and question answering (QA). Our
two hidden backdoor attacks can achieve an Attack Success Rate (ASR) of at
least $97\%$ with an injection rate of only $3\%$ in toxic comment detection,
$95.1\%$ ASR in NMT with less than $0.5\%$ injected data, and finally $91.12\%$
ASR against QA updated with only 27 poisoning data samples on a model
previously trained with 92,024 samples (0.029\%). We are able to demonstrate
the adversary's high success rate of attacks, while maintaining functionality
for regular users, with triggers inconspicuous by the human administrators.


http://arxiv.org/abs/2105.00187
One Detector to Rule Them All: Towards a General Deepfake Attack Detection Framework. (62%)
Shahroz Tariq; Sangyup Lee; Simon S. Woo
  Deep learning-based video manipulation methods have become widely accessible
to the masses. With little to no effort, people can quickly learn how to
generate deepfake (DF) videos. While deep learning-based detection methods have
been proposed to identify specific types of DFs, their performance suffers for
other types of deepfake methods, including real-world deepfakes, on which they
are not sufficiently trained. In other words, most of the proposed deep
learning-based detection methods lack transferability and generalizability.
Beyond detecting a single type of DF from benchmark deepfake datasets, we focus
on developing a generalized approach to detect multiple types of DFs, including
deepfakes from unknown generation methods such as DeepFake-in-the-Wild (DFW)
videos. To better cope with unknown and unseen deepfakes, we introduce a
Convolutional LSTM-based Residual Network (CLRNet), which adopts a unique model
training strategy and explores spatial as well as the temporal information in
deepfakes. Through extensive experiments, we show that existing defense methods
are not ready for real-world deployment. Whereas our defense method (CLRNet)
achieves far better generalization when detecting various benchmark deepfake
methods (97.57% on average). Furthermore, we evaluate our approach with a
high-quality DeepFake-in-the-Wild dataset, collected from the Internet
containing numerous videos and having more than 150,000 frames. Our CLRNet
model demonstrated that it generalizes well against high-quality DFW videos by
achieving 93.86% detection accuracy, outperforming existing state-of-the-art
defense methods by a considerable margin.


http://arxiv.org/abs/2105.00249
A Master Key Backdoor for Universal Impersonation Attack against DNN-based Face Verification. (62%)
Wei Guo; Benedetta Tondi; Mauro Barni
  We introduce a new attack against face verification systems based on Deep
Neural Networks (DNN). The attack relies on the introduction into the network
of a hidden backdoor, whose activation at test time induces a verification
error allowing the attacker to impersonate any user. The new attack, named
Master Key backdoor attack, operates by interfering with the training phase, so
to instruct the DNN to always output a positive verification answer when the
face of the attacker is presented at its input. With respect to existing
attacks, the new backdoor attack offers much more flexibility, since the
attacker does not need to know the identity of the victim beforehand. In this
way, he can deploy a Universal Impersonation attack in an open-set framework,
allowing him to impersonate any enrolled users, even those that were not yet
enrolled in the system when the attack was conceived. We present a practical
implementation of the attack targeting a Siamese-DNN face verification system,
and show its effectiveness when the system is trained on VGGFace2 dataset and
tested on LFW and YTF datasets. According to our experiments, the Master Key
backdoor attack provides a high attack success rate even when the ratio of
poisoned training data is as small as 0.01, thus raising a new alarm regarding
the use of DNN-based face verification systems in security-critical
applications.


http://arxiv.org/abs/2105.00350
Load Oscillating Attacks of Smart Grids: Demand Strategies and Vulnerability Analysis. (2%)
Falah Alanazi; Jinsub Kim; Eduardo Cotilla-Sanchez
  We investigate the vulnerability of a power transmission grid to load
oscillation attacks. We demonstrate that an adversary with a relatively small
amount of resources can launch a successful load oscillation attack to
destabilize the grid. The adversary is assumed to be able to compromise smart
meters at a subset of load buses and control their switches. In the studied
attack scenarios the adversary estimates the line flow sensitivity factors
(LFSFs) associated with the monitored tie lines by perturbing a small amount of
load at compromised buses and observing the monitored lines flow changes. The
learned LFSF values are used for selecting a target line and optimizing the
oscillation attack to cause the target line to trip while minimizing the
magnitude of load oscillation. We evaluated the attack impact using the COSMIC
time-domain simulator with two test cases, the IEEE RTS 96 and Polish 2383-Bus
Systems. The proposed attack strategy succeeded in causing 33% of load to be
shed while oscillating only 7% of load in the IEEE RTS 96 test system, and full
blackout after oscillating only 3% of the load in the Polish test system, which
is much smaller than oscillation magnitudes used by other benchmarks.


http://arxiv.org/abs/2105.00303
RATT: Leveraging Unlabeled Data to Guarantee Generalization. (1%)
Saurabh Garg; Sivaraman Balakrishnan; J. Zico Kolter; Zachary C. Lipton
  To assess generalization, machine learning scientists typically either (i)
bound the generalization gap and then (after training) plug in the empirical
risk to obtain a bound on the true risk; or (ii) validate empirically on
holdout data. However, (i) typically yields vacuous guarantees for
overparameterized models. Furthermore, (ii) shrinks the training set and its
guarantee erodes with each re-use of the holdout set. In this paper, we
introduce a method that leverages unlabeled data to produce generalization
bounds. After augmenting our (labeled) training set with randomly labeled fresh
examples, we train in the standard fashion. Whenever classifiers achieve low
error on clean data and high error on noisy data, our bound provides a tight
upper bound on the true risk. We prove that our bound is valid for 0-1
empirical risk minimization and with linear classifiers trained by gradient
descent. Our approach is especially useful in conjunction with deep learning
due to the early learning phenomenon whereby networks fit true labels before
noisy labels but requires one intuitive assumption. Empirically, on canonical
computer vision and NLP tasks, our bound provides non-vacuous generalization
guarantees that track actual performance closely. This work provides
practitioners with an option for certifying the generalization of deep nets
even when unseen labeled data is unavailable and provides theoretical insights
into the relationship between random label noise and generalization.


http://arxiv.org/abs/2104.15022
Deep Image Destruction: A Comprehensive Study on Vulnerability of Deep Image-to-Image Models against Adversarial Attacks. (99%)
Jun-Ho Choi; Huan Zhang; Jun-Hyuk Kim; Cho-Jui Hsieh; Jong-Seok Lee
  Recently, the vulnerability of deep image classification models to
adversarial attacks has been investigated. However, such an issue has not been
thoroughly studied for image-to-image models that can have different
characteristics in quantitative evaluation, consequences of attacks, and
defense strategy. To tackle this, we present comprehensive investigations into
the vulnerability of deep image-to-image models to adversarial attacks. For
five popular image-to-image tasks, 16 deep models are analyzed from various
standpoints such as output quality degradation due to attacks, transferability
of adversarial examples across different tasks, and characteristics of
perturbations. We show that unlike in image classification tasks, the
performance degradation on image-to-image tasks can largely differ depending on
various factors, e.g., attack methods and task objectives. In addition, we
analyze the effectiveness of conventional defense methods used for
classification models in improving the robustness of the image-to-image models.


http://arxiv.org/abs/2104.15061
Black-box Gradient Attack on Graph Neural Networks: Deeper Insights in Graph-based Attack and Defense. (99%)
Haoxi Zhan; Xiaobing Pei
  Graph Neural Networks (GNNs) have received significant attention due to their
state-of-the-art performance on various graph representation learning tasks.
However, recent studies reveal that GNNs are vulnerable to adversarial attacks,
i.e. an attacker is able to fool the GNNs by perturbing the graph structure or
node features deliberately. While being able to successfully decrease the
performance of GNNs, most existing attacking algorithms require access to
either the model parameters or the training data, which is not practical in the
real world.
  In this paper, we develop deeper insights into the Mettack algorithm, which
is a representative grey-box attacking method, and then we propose a
gradient-based black-box attacking algorithm. Firstly, we show that the Mettack
algorithm will perturb the edges unevenly, thus the attack will be highly
dependent on a specific training set. As a result, a simple yet useful strategy
to defense against Mettack is to train the GNN with the validation set.
Secondly, to overcome the drawbacks, we propose the Black-Box Gradient Attack
(BBGA) algorithm. Extensive experiments demonstrate that out proposed method is
able to achieve stable attack performance without accessing the training sets
of the GNNs. Further results shows that our proposed method is also applicable
when attacking against various defense methods.


http://arxiv.org/abs/2104.15064
Black-box adversarial attacks using Evolution Strategies. (98%)
Hao Qiu; Leonardo Lucio Custode; Giovanni Iacca
  In the last decade, deep neural networks have proven to be very powerful in
computer vision tasks, starting a revolution in the computer vision and machine
learning fields. However, deep neural networks, usually, are not robust to
perturbations of the input data. In fact, several studies showed that slightly
changing the content of the images can cause a dramatic decrease in the
accuracy of the attacked neural network. Several methods able to generate
adversarial samples make use of gradients, which usually are not available to
an attacker in real-world scenarios. As opposed to this class of attacks,
another class of adversarial attacks, called black-box adversarial attacks,
emerged, which does not make use of information on the gradients, being more
suitable for real-world attack scenarios. In this work, we compare three
well-known evolution strategies on the generation of black-box adversarial
attacks for image classification tasks. While our results show that the
attacked neural networks can be, in most cases, easily fooled by all the
algorithms under comparison, they also show that some black-box optimization
algorithms may be better in "harder" setups, both in terms of attack success
rate and efficiency (i.e., number of queries).


http://arxiv.org/abs/2105.00113
IPatch: A Remote Adversarial Patch. (97%)
Yisroel Mirsky
  Applications such as autonomous vehicles and medical screening use deep
learning models to localize and identify hundreds of objects in a single frame.
In the past, it has been shown how an attacker can fool these models by placing
an adversarial patch within a scene. However, these patches must be placed in
the target location and do not explicitly alter the semantics elsewhere in the
image.
  In this paper, we introduce a new type of adversarial patch which alters a
model's perception of an image's semantics. These patches can be placed
anywhere within an image to change the classification or semantics of locations
far from the patch. We call this new class of adversarial examples `remote
adversarial patches' (RAP).
  We implement our own RAP called IPatch and perform an in-depth analysis on
image segmentation RAP attacks using five state-of-the-art architectures with
eight different encoders on the CamVid street view dataset. Moreover, we
demonstrate that the attack can be extended to object recognition models with
preliminary results on the popular YOLOv3 model. We found that the patch can
change the classification of a remote target region with a success rate of up
to 93% on average.


http://arxiv.org/abs/2104.15068
DeFiRanger: Detecting Price Manipulation Attacks on DeFi Applications. (10%)
Siwei Wu; Dabao Wang; Jianting He; Yajin Zhou; Lei Wu; Xingliang Yuan; Qinming He; Kui Ren
  The rapid growth of Decentralized Finance (DeFi) boosts the Ethereum
ecosystem. At the same time, attacks towards DeFi applications (apps) are
increasing. However, to the best of our knowledge, existing smart contract
vulnerability detection tools cannot be directly used to detect DeFi attacks.
That's because they lack the capability to recover and understand high-level
DeFi semantics, e.g., a user trades a token pair X and Y in a Decentralized
EXchange (DEX).
  In this work, we focus on the detection of two types of new attacks on DeFi
apps, including direct and indirect price manipulation attacks. The former one
means that an attacker directly manipulates the token price in DEX by
performing an unwanted trade in the same DEX by attacking the vulnerable DeFi
app. The latter one means that an attacker indirectly manipulates the token
price of the vulnerable DeFi app (e.g., a lending app). To this end, we propose
a platform-independent way to recover high-level DeFi semantics by first
constructing the cash flow tree from raw Ethereum transactions and then lifting
the low-level semantics to high-level ones, including token trade, liquidity
mining, and liquidity cancel. Finally, we detect price manipulation attacks
using the patterns expressed with the recovered DeFi semantics.
  We have implemented a prototype named \tool{} and applied it to more than 350
million transactions. It successfully detected 432 real-world attacks in the
wild. We confirm that they belong to four known security incidents and five
zero-day ones. We reported our findings. Two CVEs have been assigned. We
further performed an attack analysis to reveal the root cause of the
vulnerability, the attack footprint, and the impact of the attack. Our work
urges the need to secure the DeFi ecosystem.


http://arxiv.org/abs/2104.14993
FIPAC: Thwarting Fault- and Software-Induced Control-Flow Attacks with ARM Pointer Authentication. (2%)
Robert Schilling; Pascal Nasahl; Stefan Mangard
  With the improvements of computing technology, more and more applications
embed powerful ARM processors into their devices. These systems can be attacked
by redirecting the control-flow of a program to bypass critical pieces of code
such as privilege checks or signature verifications. Control-flow hijacks can
be performed using classical software vulnerabilities, physical fault attacks,
or software-induced fault attacks. To cope with this threat and to protect the
control-flow, dedicated countermeasures are needed. To counteract control-flow
hijacks, control-flow integrity~(CFI) aims to be a generic solution. However,
software-based CFI typically either protects against software or fault attacks,
but not against both. While hardware-assisted CFI can mitigate both types of
attacks, they require extensive hardware modifications. As hardware changes are
unrealistic for existing ARM architectures, a wide range of systems remains
unprotected and vulnerable to control-flow attacks.
  In this work, we present FIPAC, an efficient software-based CFI scheme
protecting the execution at basic block granularity of ARM-based devices
against software and fault attacks. FIPAC exploits ARM pointer authentication
of ARMv8.6-A to implement a cryptographically signed control-flow graph. We
cryptographically link the correct sequence of executed basic blocks to enforce
CFI at this granularity. We use an LLVM-based toolchain to automatically
instrument programs. The evaluation on SPEC2017 with different security
policies shows a code overhead between 54-97\% and a runtime overhead between
35-105%. While these overheads are higher than for countermeasures against
software attacks, FIPAC outperforms related work protecting the control-flow
against fault attacks. FIPAC is an efficient solution to provide protection
against software- and fault-based CFI attacks on basic block level on modern
ARM devices.


http://arxiv.org/abs/2104.14528
GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification. (67%)
Haoyuan Chen; Chen Li; Xiaoyan Li; Ge Wang; Weiming Hu; Yixin Li; Wanli Liu; Changhao Sun; Yudong Yao; Yueyang Teng; Marcin Grzegorzek
  Existing deep learning methods for diagnosis of gastric cancer commonly use
convolutional neural networks (CNN). Recently, the Visual Transformer (VT) has
attracted a major attention because of its performance and efficiency, but its
applications are mostly in the field of computer vision. In this paper, a
multi-scale visual transformer model, referred to as GasHis-Transformer, is
proposed for gastric histopathology image classification (GHIC), which enables
the automatic classification of microscopic gastric images into abnormal and
normal cases. The GasHis-Transformer model consists of two key modules: a
global information module (GIM) and a local information module (LIM) to extract
pathological features effectively. In our experiments, a public hematoxylin and
eosin (H&E) stained gastric histopathology dataset with 280 abnormal or normal
images using the GasHis-Transformer model is applied to estimate precision,
recall, F1-score, and accuracy on the testing set as 98.0%, 100.0%, 96.0% and
98.0% respectively. Furthermore, a critical study is conducted to evaluate the
robustness of GasHis-Transformer according to add ten different noises
including adversarial attack and traditional image noise. In addition, a
clinically meaningful study is executed to test the gastric cancer
identification of GasHis-Transformerwith 420 abnormal images and achieves 96.2%
accuracy. Finally, a comparative study is performed to test the
generalizability with both H&E and Immunohistochemical (IHC) stained images on
a lymphoma image dataset, a breast cancer dataset and a cervical cancer
dataset, producing comparable F1-scores (85.6%, 82.8% and 65.7%, respectively)
and accuracy (83.9%, 89.4% and 65.7%, respectively) respectively. In
conclusion, GasHis-Transformerdemonstrates a high classification performance
and shows its significant potential in histopathology image analysis.


http://arxiv.org/abs/2104.14372
A neural anisotropic view of underspecification in deep learning. (26%)
Guillermo Ortiz-Jimenez; Itamar Franco Salazar-Reque; Apostolos Modas; Seyed-Mohsen Moosavi-Dezfooli; Pascal Frossard
  The underspecification of most machine learning pipelines means that we
cannot rely solely on validation performance to assess the robustness of deep
learning systems to naturally occurring distribution shifts. Instead, making
sure that a neural network can generalize across a large number of different
situations requires to understand the specific way in which it solves a task.
In this work, we propose to study this problem from a geometric perspective
with the aim to understand two key characteristics of neural network solutions
in underspecified settings: how is the geometry of the learned function related
to the data representation? And, are deep networks always biased towards
simpler solutions, as conjectured in recent literature? We show that the way
neural networks handle the underspecification of these problems is highly
dependent on the data representation, affecting both the geometry and the
complexity of the learned predictors. Our results highlight that understanding
the architectural inductive bias in deep learning is fundamental to address the
fairness, robustness, and generalization of these systems.


http://arxiv.org/abs/2104.14672
Analytical bounds on the local Lipschitz constants of ReLU networks. (12%)
Trevor Avant; Kristi A. Morgansen
  In this paper, we determine analytical upper bounds on the local Lipschitz
constants of feedforward neural networks with ReLU activation functions. We do
so by deriving Lipschitz constants and bounds for ReLU, affine-ReLU, and max
pooling functions, and combining the results to determine a network-wide bound.
Our method uses several insights to obtain tight bounds, such as keeping track
of the zero elements of each layer, and analyzing the composition of affine and
ReLU functions. Furthermore, we employ a careful computational approach which
allows us to apply our method to large networks such as AlexNet and VGG-16. We
present several examples using different networks, which show how our local
Lipschitz bounds are tighter than the global Lipschitz bounds. We also show how
our method can be applied to provide adversarial bounds for classification
networks. These results show that our method produces the largest known bounds
on minimum adversarial perturbations for large networks such as AlexNet and
VGG-16.


http://arxiv.org/abs/2104.14379
Learning Robust Variational Information Bottleneck with Reference. (5%)
Weizhu Qian; Bowei Chen; Xiaowei Huang
  We propose a new approach to train a variational information bottleneck (VIB)
that improves its robustness to adversarial perturbations. Unlike the
traditional methods where the hard labels are usually used for the
classification task, we refine the categorical class information in the
training phase with soft labels which are obtained from a pre-trained reference
neural network and can reflect the likelihood of the original class labels. We
also relax the Gaussian posterior assumption in the VIB implementation by using
the mutual information neural estimation. Extensive experiments have been
performed with the MNIST and CIFAR-10 datasets, and the results show that our
proposed approach significantly outperforms the benchmarked models.


http://arxiv.org/abs/2104.13673
AdvHaze: Adversarial Haze Attack. (99%)
Ruijun Gao; Qing Guo; Felix Juefei-Xu; Hongkai Yu; Wei Feng
  In recent years, adversarial attacks have drawn more attention for their
value on evaluating and improving the robustness of machine learning models,
especially, neural network models. However, previous attack methods have mainly
focused on applying some $l^p$ norm-bounded noise perturbations. In this paper,
we instead introduce a novel adversarial attack method based on haze, which is
a common phenomenon in real-world scenery. Our method can synthesize
potentially adversarial haze into an image based on the atmospheric scattering
model with high realisticity and mislead classifiers to predict an incorrect
class. We launch experiments on two popular datasets, i.e., ImageNet and
NIPS~2017. We demonstrate that the proposed method achieves a high success
rate, and holds better transferability across different classification models
than the baselines. We also visualize the correlation matrices, which inspire
us to jointly apply different perturbations to improve the success rate of the
attack. We hope this work can boost the development of non-noise-based
adversarial attacks and help evaluate and improve the robustness of DNNs.


http://arxiv.org/abs/2104.13484
Improved and Efficient Text Adversarial Attacks using Target Information. (97%)
Mahmoud Hossam; Trung Le; He Zhao; Viet Huynh; Dinh Phung
  There has been recently a growing interest in studying adversarial examples
on natural language models in the black-box setting. These methods attack
natural language classifiers by perturbing certain important words until the
classifier label is changed. In order to find these important words, these
methods rank all words by importance by querying the target model word by word
for each input sentence, resulting in high query inefficiency. A new
interesting approach was introduced that addresses this problem through
interpretable learning to learn the word ranking instead of previous expensive
search. The main advantage of using this approach is that it achieves
comparable attack rates to the state-of-the-art methods, yet faster and with
fewer queries, where fewer queries are desirable to avoid suspicion towards the
attacking agent. Nonetheless, this approach sacrificed the useful information
that could be leveraged from the target classifier for that sake of query
efficiency. In this paper we study the effect of leveraging the target model
outputs and data on both attack rates and average number of queries, and we
show that both can be improved, with a limited overhead of additional queries.


http://arxiv.org/abs/2104.13295
Metamorphic Detection of Repackaged Malware. (91%)
Shirish Singh; Gail Kaiser
  Machine learning-based malware detection systems are often vulnerable to
evasion attacks, in which a malware developer manipulates their malicious
software such that it is misclassified as benign. Such software hides some
properties of the real class or adopts some properties of a different class by
applying small perturbations. A special case of evasive malware hides by
repackaging a bonafide benign mobile app to contain malware in addition to the
original functionality of the app, thus retaining most of the benign properties
of the original app. We present a novel malware detection system based on
metamorphic testing principles that can detect such benign-seeming malware
apps. We apply metamorphic testing to the feature representation of the mobile
app rather than to the app itself. That is, the source input is the original
feature vector for the app and the derived input is that vector with selected
features removed. If the app was originally classified benign and is indeed
benign, the output for the source and derived inputs should be the same class,
i.e., benign, but if they differ, then the app is exposed as likely malware.
Malware apps originally classified as malware should retain that classification
since only features prevalent in benign apps are removed. This approach enables
the machine learning model to classify repackaged malware with reasonably few
false negatives and false positives. Our training pipeline is simpler than many
existing ML-based malware detection methods, as the network is trained
end-to-end to learn appropriate features and perform classification. We
pre-trained our classifier model on 3 million apps collected from the
widely-used AndroZoo dataset. We perform an extensive study on other publicly
available datasets to show our approach's effectiveness in detecting repackaged
malware with more than94% accuracy, 0.98 precision, 0.95 recall, and 0.96 F1
score.


http://arxiv.org/abs/2104.13012
Structure-Aware Hierarchical Graph Pooling using Information Bottleneck. (2%)
Kashob Kumar Roy; Amit Roy; A K M Mahbubur Rahman; M Ashraful Amin; Amin Ahsan Ali
  Graph pooling is an essential ingredient of Graph Neural Networks (GNNs) in
graph classification and regression tasks. For these tasks, different pooling
strategies have been proposed to generate a graph-level representation by
downsampling and summarizing nodes' features in a graph. However, most existing
pooling methods are unable to capture distinguishable structural information
effectively. Besides, they are prone to adversarial attacks. In this work, we
propose a novel pooling method named as {HIBPool} where we leverage the
Information Bottleneck (IB) principle that optimally balances the
expressiveness and robustness of a model to learn representations of input
data. Furthermore, we introduce a novel structure-aware Discriminative Pooling
Readout ({DiP-Readout}) function to capture the informative local subgraph
structures in the graph. Finally, our experimental results show that our model
significantly outperforms other state-of-art methods on several graph
classification benchmarks and more resilient to feature-perturbation attack
than existing pooling methods.


http://arxiv.org/abs/2104.13061
Property Inference Attacks on Convolutional Neural Networks: Influence and Implications of Target Model's Complexity. (1%)
Mathias P. M. Parisot; Balazs Pejo; Dayana Spagnuelo
  Machine learning models' goal is to make correct predictions for specific
tasks by learning important properties and patterns from data. By doing so,
there is a chance that the model learns properties that are unrelated to its
primary task. Property Inference Attacks exploit this and aim to infer from a
given model (\ie the target model) properties about the training dataset
seemingly unrelated to the model's primary goal. If the training data is
sensitive, such an attack could lead to privacy leakage. This paper
investigates the influence of the target model's complexity on the accuracy of
this type of attack, focusing on convolutional neural network classifiers. We
perform attacks on models that are trained on facial images to predict whether
someone's mouth is open. Our attacks' goal is to infer whether the training
dataset is balanced gender-wise. Our findings reveal that the risk of a privacy
breach is present independently of the target model's complexity: for all
studied architectures, the attack's accuracy is clearly over the baseline. We
discuss the implication of the property inference on personal data in the light
of Data Protection Regulations and Guidelines.


http://arxiv.org/abs/2104.12426
Launching Adversarial Attacks against Network Intrusion Detection Systems for IoT. (99%)
Pavlos Papadopoulos; Essen Oliver Thornewill von; Nikolaos Pitropakis; Christos Chrysoulas; Alexios Mylonas; William J. Buchanan
  As the internet continues to be populated with new devices and emerging
technologies, the attack surface grows exponentially. Technology is shifting
towards a profit-driven Internet of Things market where security is an
afterthought. Traditional defending approaches are no longer sufficient to
detect both known and unknown attacks to high accuracy. Machine learning
intrusion detection systems have proven their success in identifying unknown
attacks with high precision. Nevertheless, machine learning models are also
vulnerable to attacks. Adversarial examples can be used to evaluate the
robustness of a designed model before it is deployed. Further, using
adversarial examples is critical to creating a robust model designed for an
adversarial environment. Our work evaluates both traditional machine learning
and deep learning models' robustness using the Bot-IoT dataset. Our methodology
included two main approaches. First, label poisoning, used to cause incorrect
classification by the model. Second, the fast gradient sign method, used to
evade detection measures. The experiments demonstrated that an attacker could
manipulate or circumvent detection with significant probability.


http://arxiv.org/abs/2104.12378
Delving into Data: Effectively Substitute Training for Black-box Attack. (99%)
Wenxuan Wang; Bangjie Yin; Taiping Yao; Li Zhang; Yanwei Fu; Shouhong Ding; Jilin Li; Feiyue Huang; Xiangyang Xue
  Deep models have shown their vulnerability when processing adversarial
samples. As for the black-box attack, without access to the architecture and
weights of the attacked model, training a substitute model for adversarial
attacks has attracted wide attention. Previous substitute training approaches
focus on stealing the knowledge of the target model based on real training data
or synthetic data, without exploring what kind of data can further improve the
transferability between the substitute and target models. In this paper, we
propose a novel perspective substitute training that focuses on designing the
distribution of data used in the knowledge stealing process. More specifically,
a diverse data generation module is proposed to synthesize large-scale data
with wide distribution. And adversarial substitute training strategy is
introduced to focus on the data distributed near the decision boundary. The
combination of these two modules can further boost the consistency of the
substitute model and target model, which greatly improves the effectiveness of
adversarial attack. Extensive experiments demonstrate the efficacy of our
method against state-of-the-art competitors under non-target and target attack
settings. Detailed visualization and analysis are also provided to help
understand the advantage of our method.


http://arxiv.org/abs/2104.12848
secml-malware: Pentesting Windows Malware Classifiers with Adversarial EXEmples in Python. (99%)
Luca Demetrio; Battista Biggio
  Machine learning has been increasingly used as a first line of defense for
Windows malware detection. Recent work has however shown that learning-based
malware detectors can be evaded by carefully-perturbed input malware samples,
referred to as adversarial EXEmples, thus demanding for tools that can ease and
automate the adversarial robustness evaluation of such detectors. To this end,
we present secml-malware, the first Python library for computing adversarial
attacks on Windows malware detectors. \secmlmalware implements state-of-the-art
white-box and black-box attacks on Windows malware classifiers, by leveraging a
set of feasible manipulations that can be applied to Windows programs while
preserving their functionality. The library can be used to perform the
penetration testing and assessment of the adversarial robustness of Windows
malware detectors, and it can be easily extended to include novel attack
strategies. Our library is available at
https://github.com/pralab/secml_malware.


http://arxiv.org/abs/2104.12623
Good Artists Copy, Great Artists Steal: Model Extraction Attacks Against Image Translation Generative Adversarial Networks. (98%)
Sebastian Szyller; Vasisht Duddu; Tommi Gröndahl; N. Asokan
  Machine learning models are typically made available to potential client
users via inference APIs. Model extraction attacks occur when a malicious
client uses information gleaned from queries to the inference API of a victim
model $F_V$ to build a surrogate model $F_A$ that has comparable functionality.
Recent research has shown successful model extraction attacks against image
classification, and NLP models. In this paper, we show the first model
extraction attack against real-world generative adversarial network (GAN) image
translation models. We present a framework for conducting model extraction
attacks against image translation models, and show that the adversary can
successfully extract functional surrogate models. The adversary is not required
to know $F_V$'s architecture or any other information about it beyond its
intended image translation task, and queries $F_V$'s inference interface using
data drawn from the same domain as the training data for $F_V$. We evaluate the
effectiveness of our attacks using three different instances of two popular
categories of image translation: (1) Selfie-to-Anime and (2) Monet-to-Photo
(image style transfer), and (3) Super-Resolution (super resolution). Using
standard performance metrics for GANs, we show that our attacks are effective
in each of the three cases -- the differences between $F_V$ and $F_A$, compared
to the target are in the following ranges: Selfie-to-Anime: FID $13.36-68.66$,
Monet-to-Photo: FID $3.57-4.40$, and Super-Resolution: SSIM: $0.06-0.08$ and
PSNR: $1.43-4.46$. Furthermore, we conducted a large scale (125 participants)
user study on Selfie-to-Anime and Monet-to-Photo to show that human perception
of the images produced by the victim and surrogate models can be considered
equivalent, within an equivalence bound of Cohen's $d=0.3$.


http://arxiv.org/abs/2104.12679
Impact of Spatial Frequency Based Constraints on Adversarial Robustness. (98%)
Rémi Bernhard; Pierre-Alain Moellic; Martial Mermillod; Yannick Bourrier; Romain Cohendet; Miguel Solinas; Marina Reyboz
  Adversarial examples mainly exploit changes to input pixels to which humans
are not sensitive to, and arise from the fact that models make decisions based
on uninterpretable features. Interestingly, cognitive science reports that the
process of interpretability for human classification decision relies
predominantly on low spatial frequency components. In this paper, we
investigate the robustness to adversarial perturbations of models enforced
during training to leverage information corresponding to different spatial
frequency ranges. We show that it is tightly linked to the spatial frequency
characteristics of the data at stake. Indeed, depending on the data set, the
same constraint may results in very different level of robustness (up to 0.41
adversarial accuracy difference). To explain this phenomenon, we conduct
several experiments to enlighten influential factors such as the level of
sensitivity to high frequencies, and the transferability of adversarial
perturbations between original and low-pass filtered inputs.


http://arxiv.org/abs/2104.12609
PatchGuard++: Efficient Provable Attack Detection against Adversarial Patches. (87%)
Chong Xiang; Prateek Mittal
  An adversarial patch can arbitrarily manipulate image pixels within a
restricted region to induce model misclassification. The threat of this
localized attack has gained significant attention because the adversary can
mount a physically-realizable attack by attaching patches to the victim object.
Recent provably robust defenses generally follow the PatchGuard framework by
using CNNs with small receptive fields and secure feature aggregation for
robust model predictions. In this paper, we extend PatchGuard to PatchGuard++
for provably detecting the adversarial patch attack to boost both provable
robust accuracy and clean accuracy. In PatchGuard++, we first use a CNN with
small receptive fields for feature extraction so that the number of features
corrupted by the adversarial patch is bounded. Next, we apply masks in the
feature space and evaluate predictions on all possible masked feature maps.
Finally, we extract a pattern from all masked predictions to catch the
adversarial patch attack. We evaluate PatchGuard++ on ImageNette (a 10-class
subset of ImageNet), ImageNet, and CIFAR-10 and demonstrate that PatchGuard++
significantly improves the provable robustness and clean performance.


http://arxiv.org/abs/2104.12146
3D Adversarial Attacks Beyond Point Cloud. (99%)
Jinlai Zhang; Lyujie Chen; Binbin Liu; Bo Ouyang; Qizhi Xie; Jihong Zhu; Weiming Li; Yanmei Meng
  Recently, 3D deep learning models have been shown to be susceptible to
adversarial attacks like their 2D counterparts. Most of the state-of-the-art
(SOTA) 3D adversarial attacks perform perturbation to 3D point clouds. To
reproduce these attacks in pseudo physical scenario, a generated adversarial 3D
point cloud need to be reconstructed to mesh, which leads to a significant drop
in its adversarial effect. In this paper, we propose a strong 3D adversarial
attack named Mesh Attack to address this problem by directly performing
perturbation on mesh of a 3D object. Specifically, in each iteration of our
method, the mesh is first sampled to point cloud by a differentiable sample
module. Then a point cloud classifier is used to back-propagate a combined loss
to update the mesh vertices. The combined loss includes an adversarial loss to
mislead the point cloud classifier and three mesh losses to regularize the mesh
to be smooth. Extensive experiments demonstrate that the proposed scheme
outperforms SOTA 3D attacks by a significant margin in the pseudo physical
scenario. We also achieved SOTA performance under various defenses. Moreover,
to the best of our knowledge, our Mesh Attack is the first attempt of
adversarial attack on mesh classifier. Our code is available at:
{\footnotesize{\url{https://github.com/cuge1995/Mesh-Attack}}}.


http://arxiv.org/abs/2104.12069
Making Generated Images Hard To Spot: A Transferable Attack On Synthetic Image Detectors. (81%)
Xinwei Zhao; Matthew C. Stamm
  Visually realistic GAN-generated images have recently emerged as an important
misinformation threat. Research has shown that these synthetic images contain
forensic traces that are readily identifiable by forensic detectors.
Unfortunately, these detectors are built upon neural networks, which are
vulnerable to recently developed adversarial attacks. In this paper, we propose
a new anti-forensic attack capable of fooling GAN-generated image detectors.
Our attack uses an adversarially trained generator to synthesize traces that
these detectors associate with real images. Furthermore, we propose a technique
to train our attack so that it can achieve transferability, i.e. it can fool
unknown CNNs that it was not explicitly trained against. We evaluate our attack
through an extensive set of experiments, where we show that our attack can fool
eight state-of-the-art detection CNNs with synthetic images created using seven
different GANs, and outperform other alternative attacks.


http://arxiv.org/abs/2104.13230
Influence Based Defense Against Data Poisoning Attacks in Online Learning. (99%)
Sanjay Seetharaman; Shubham Malaviya; Rosni KV; Manish Shukla; Sachin Lodha
  Data poisoning is a type of adversarial attack on training data where an
attacker manipulates a fraction of data to degrade the performance of machine
learning model. Therefore, applications that rely on external data-sources for
training data are at a significantly higher risk. There are several known
defensive mechanisms that can help in mitigating the threat from such attacks.
For example, data sanitization is a popular defensive mechanism wherein the
learner rejects those data points that are sufficiently far from the set of
training instances. Prior work on data poisoning defense primarily focused on
offline setting, wherein all the data is assumed to be available for analysis.
Defensive measures for online learning, where data points arrive sequentially,
have not garnered similar interest.
  In this work, we propose a defense mechanism to minimize the degradation
caused by the poisoned training data on a learner's model in an online setup.
Our proposed method utilizes an influence function which is a classic technique
in robust statistics. Further, we supplement it with the existing data
sanitization methods for filtering out some of the poisoned data points. We
study the effectiveness of our defense mechanism on multiple datasets and
across multiple attack strategies against an online learner.


http://arxiv.org/abs/2104.11470
Theoretical Study of Random Noise Defense against Query-Based Black-Box Attacks. (98%)
Zeyu Qin; Yanbo Fan; Hongyuan Zha; Baoyuan Wu
  The query-based black-box attacks, which don't require any knowledge about
the attacked models and datasets, have raised serious threats to machine
learning models in many real applications. In this work, we study a simple but
promising defense technique, dubbed Random Noise Defense (RND) against
query-based black-box attacks, which adds proper Gaussian noise to each query.
It is lightweight and can be directly combined with any off-the-shelf models
and other defense strategies. However, the theoretical guarantee of random
noise defense is missing, and the actual effectiveness of this defense is not
yet fully understood. In this work, we present solid theoretical analyses to
demonstrate that the defense effect of RND against the query-based black-box
attack and the corresponding adaptive attack heavily depends on the magnitude
ratio between the random noise added by the defender (i.e., RND) and the random
noise added by the attacker for gradient estimation. Extensive experiments on
CIFAR-10 and ImageNet verify our theoretical studies. Based on RND, we also
propose a stronger defense method that combines RND with Gaussian augmentation
training (RND-GT) and achieves better defense performance.


http://arxiv.org/abs/2104.11729
Evaluating Deception Detection Model Robustness To Linguistic Variation. (82%)
Maria Glenski; Ellyn Ayton; Robin Cosbey; Dustin Arendt; Svitlana Volkova
  With the increasing use of machine-learning driven algorithmic judgements, it
is critical to develop models that are robust to evolving or manipulated
inputs. We propose an extensive analysis of model robustness against linguistic
variation in the setting of deceptive news detection, an important task in the
context of misinformation spread online. We consider two prediction tasks and
compare three state-of-the-art embeddings to highlight consistent trends in
model performance, high confidence misclassifications, and high impact
failures. By measuring the effectiveness of adversarial defense strategies and
evaluating model susceptibility to adversarial attacks using character- and
word-perturbed text, we find that character or mixed ensemble models are the
most effective defenses and that character perturbation-based attack tactics
are more successful.


http://arxiv.org/abs/2104.11408
Lightweight Detection of Out-of-Distribution and Adversarial Samples via Channel Mean Discrepancy. (3%)
Xin Dong; Junfeng Guo; Wei-Te Ting; H. T. Kung
  Detecting out-of-distribution (OOD) and adversarial samples is essential when
deploying classification models in real-world applications. We introduce
Channel Mean Discrepancy (CMD), a model-agnostic distance metric for evaluating
the statistics of features extracted by classification models, inspired by
integral probability metrics. CMD compares the feature statistics of incoming
samples against feature statistics estimated from previously seen training
samples with minimal overhead. We experimentally demonstrate that CMD magnitude
is significantly smaller for legitimate samples than for OOD and adversarial
samples. We propose a simple method to reliably differentiate between
legitimate samples from OOD and adversarial samples using CMD, requiring only a
single forward pass on a pre-trained classification model per sample. We
further demonstrate how to achieve single image detection by using a
lightweight model for channel sensitivity tuning, an improvement on other
statistical detection methods. Preliminary results show that our simple yet
effective method outperforms several state-of-the-art approaches to detecting
OOD and adversarial samples across various datasets and attack methods with
high efficiency and generalizability.


http://arxiv.org/abs/2104.11601
Improving Neural Silent Speech Interface Models by Adversarial Training. (1%)
Amin Honarmandi Shandiz; László Tóth; Gábor Gosztolya; Alexandra Markó; Tamás Gábor Csapó
  Besides the well-known classification task, these days neural networks are
frequently being applied to generate or transform data, such as images and
audio signals. In such tasks, the conventional loss functions like the mean
squared error (MSE) may not give satisfactory results. To improve the
perceptual quality of the generated signals, one possibility is to increase
their similarity to real signals, where the similarity is evaluated via a
discriminator network. The combination of the generator and discriminator nets
is called a Generative Adversarial Network (GAN). Here, we evaluate this
adversarial training framework in the articulatory-to-acoustic mapping task,
where the goal is to reconstruct the speech signal from a recording of the
movement of articulatory organs. As the generator, we apply a 3D convolutional
network that gave us good results in an earlier study. To turn it into a GAN,
we extend the conventional MSE training loss with an adversarial loss component
provided by a discriminator network. As for the evaluation, we report various
objective speech quality metrics such as the Perceptual Evaluation of Speech
Quality (PESQ), and the Mel-Cepstral Distortion (MCD). Our results indicate
that the application of the adversarial training loss brings about a slight,
but consistent improvement in all these metrics.


http://arxiv.org/abs/2104.10868
Towards Adversarial Patch Analysis and Certified Defense against Crowd Counting. (99%)
Qiming Wu; Zhikang Zou; Pan Zhou; Xiaoqing Ye; Binghui Wang; Ang Li
  Crowd counting has drawn much attention due to its importance in
safety-critical surveillance systems. Especially, deep neural network (DNN)
methods have significantly reduced estimation errors for crowd counting
missions. Recent studies have demonstrated that DNNs are vulnerable to
adversarial attacks, i.e., normal images with human-imperceptible perturbations
could mislead DNNs to make false predictions. In this work, we propose a robust
attack strategy called Adversarial Patch Attack with Momentum (APAM) to
systematically evaluate the robustness of crowd counting models, where the
attacker's goal is to create an adversarial perturbation that severely degrades
their performances, thus leading to public safety accidents (e.g., stampede
accidents). Especially, the proposed attack leverages the extreme-density
background information of input images to generate robust adversarial patches
via a series of transformations (e.g., interpolation, rotation, etc.). We
observe that by perturbing less than 6\% of image pixels, our attacks severely
degrade the performance of crowd counting systems, both digitally and
physically. To better enhance the adversarial robustness of crowd counting
models, we propose the first regression model-based Randomized Ablation (RA),
which is more sufficient than Adversarial Training (ADT) (Mean Absolute Error
of RA is 5 lower than ADT on clean samples and 30 lower than ADT on adversarial
examples). Extensive experiments on five crowd counting models demonstrate the
effectiveness and generality of the proposed method. The supplementary
materials and certificate retrained models are available at
\url{https://www.dropbox.com/s/hc4fdx133vht0qb/ACM_MM2021_Supp.pdf?dl=0}


http://arxiv.org/abs/2104.11101
Learning Transferable 3D Adversarial Cloaks for Deep Trained Detectors. (98%)
Arman Maesumi; Mingkang Zhu; Yi Wang; Tianlong Chen; Zhangyang Wang; Chandrajit Bajaj
  This paper presents a novel patch-based adversarial attack pipeline that
trains adversarial patches on 3D human meshes. We sample triangular faces on a
reference human mesh, and create an adversarial texture atlas over those faces.
The adversarial texture is transferred to human meshes in various poses, which
are rendered onto a collection of real-world background images. Contrary to the
traditional patch-based adversarial attacks, where prior work attempts to fool
trained object detectors using appended adversarial patches, this new form of
attack is mapped into the 3D object world and back-propagated to the texture
atlas through differentiable rendering. As such, the adversarial patch is
trained under deformation consistent with real-world materials. In addition,
and unlike existing adversarial patches, our new 3D adversarial patch is shown
to fool state-of-the-art deep object detectors robustly under varying views,
potentially leading to an attacking scheme that is persistently strong in the
physical world.


http://arxiv.org/abs/2104.11103
Performance Evaluation of Adversarial Attacks: Discrepancies and Solutions. (86%)
Jing Wu; Mingyi Zhou; Ce Zhu; Yipeng Liu; Mehrtash Harandi; Li Li
  Recently, adversarial attack methods have been developed to challenge the
robustness of machine learning models. However, mainstream evaluation criteria
experience limitations, even yielding discrepancies among results under
different settings. By examining various attack algorithms, including
gradient-based and query-based attacks, we notice the lack of a consensus on a
uniform standard for unbiased performance evaluation. Accordingly, we propose a
Piece-wise Sampling Curving (PSC) toolkit to effectively address the
aforementioned discrepancy, by generating a comprehensive comparison among
adversaries in a given range. In addition, the PSC toolkit offers options for
balancing the computational cost and evaluation effectiveness. Experimental
results demonstrate our PSC toolkit presents comprehensive comparisons of
attack algorithms, significantly reducing discrepancies in practice.


http://arxiv.org/abs/2104.11294
Operator Shifting for General Noisy Matrix Systems. (56%)
Philip Etter; Lexing Ying
  In the computational sciences, one must often estimate model parameters from
data subject to noise and uncertainty, leading to inaccurate results. In order
to improve the accuracy of models with noisy parameters, we consider the
problem of reducing error in a linear system with the operator corrupted by
noise. Our contribution in this paper is to extend the elliptic operator
shifting framework from Etter, Ying '20 to the general nonsymmetric matrix
case. Roughly, the operator shifting technique is a matrix analogue of the
James-Stein estimator. The key insight is that a shift of the matrix inverse
estimate in an appropriately chosen direction will reduce average error. In our
extension, we interrogate a number of questions -- namely, whether or not
shifting towards the origin for general matrix inverses always reduces error as
it does in the elliptic case. We show that this is usually the case, but that
there are three key features of the general nonsingular matrices that allow for
adversarial examples not possible in the symmetric case. We prove that when
these adversarial possibilities are eliminated by the assumption of noise
symmetry and the use of the residual norm as the error metric, the optimal
shift is always towards the origin, mirroring results from Etter, Ying '20. We
also investigate behavior in the small noise regime and other scenarios. We
conclude by presenting numerical experiments (with accompanying source code)
inspired by Reinforcement Learning to demonstrate that operator shifting can
yield substantial reductions in error.


http://arxiv.org/abs/2104.11315
SPECTRE: Defending Against Backdoor Attacks Using Robust Statistics. (22%)
Jonathan Hayase; Weihao Kong; Raghav Somani; Sewoong Oh
  Modern machine learning increasingly requires training on a large collection
of data from multiple sources, not all of which can be trusted. A particularly
concerning scenario is when a small fraction of poisoned data changes the
behavior of the trained model when triggered by an attacker-specified
watermark. Such a compromised model will be deployed unnoticed as the model is
accurate otherwise. There have been promising attempts to use the intermediate
representations of such a model to separate corrupted examples from clean ones.
However, these defenses work only when a certain spectral signature of the
poisoned examples is large enough for detection. There is a wide range of
attacks that cannot be protected against by the existing defenses. We propose a
novel defense algorithm using robust covariance estimation to amplify the
spectral signature of corrupted data. This defense provides a clean model,
completely removing the backdoor, even in regimes where previous methods have
no hope of detecting the poisoned examples. Code and pre-trained models are
available at https://github.com/SewoongLab/spectre-defense .


http://arxiv.org/abs/2104.10377
Dual Head Adversarial Training. (99%)
Yujing Jiang; Xingjun Ma; Sarah Monazam Erfani; James Bailey
  Deep neural networks (DNNs) are known to be vulnerable to adversarial
examples/attacks, raising concerns about their reliability in safety-critical
applications. A number of defense methods have been proposed to train robust
DNNs resistant to adversarial attacks, among which adversarial training has so
far demonstrated the most promising results. However, recent studies have shown
that there exists an inherent tradeoff between accuracy and robustness in
adversarially-trained DNNs. In this paper, we propose a novel technique Dual
Head Adversarial Training (DH-AT) to further improve the robustness of existing
adversarial training methods. Different from existing improved variants of
adversarial training, DH-AT modifies both the architecture of the network and
the training strategy to seek more robustness. Specifically, DH-AT first
attaches a second network head (or branch) to one intermediate layer of the
network, then uses a lightweight convolutional neural network (CNN) to
aggregate the outputs of the two heads. The training strategy is also adapted
to reflect the relative importance of the two heads. We empirically show, on
multiple benchmark datasets, that DH-AT can bring notable robustness
improvements to existing adversarial training methods. Compared with TRADES,
one state-of-the-art adversarial training method, our DH-AT can improve the
robustness by 3.4% against PGD40 and 2.3% against AutoAttack, and also improve
the clean accuracy by 1.8%.


http://arxiv.org/abs/2104.10586
Mixture of Robust Experts (MoRE): A Flexible Defense Against Multiple Perturbations. (99%)
Kaidi Xu; Chenan Wang; Xue Lin; Bhavya Kailkhura; Ryan Goldhahn
  To tackle the susceptibility of deep neural networks to adversarial examples,
the adversarial training has been proposed which provides a notion of security
through an inner maximization problem presenting the first-order adversaries
embedded within the outer minimization of the training loss. To generalize the
adversarial robustness over different perturbation types, the adversarial
training method has been augmented with the improved inner maximization
presenting a union of multiple perturbations e.g., various $\ell_p$
norm-bounded perturbations. However, the improved inner maximization only
enjoys limited flexibility in terms of the allowable perturbation types. In
this work, through a gating mechanism, we assemble a set of expert networks,
each one either adversarially trained to deal with a particular perturbation
type or normally trained for boosting accuracy on clean data. The gating module
assigns weights dynamically to each expert to achieve superior accuracy under
various data types e.g., adversarial examples, adverse weather perturbations,
and clean input. In order to deal with the obfuscated gradients issue, the
training of the gating module is conducted together with fine-tuning of the
last fully connected layers of expert networks through adversarial training
approach. Using extensive experiments, we show that our Mixture of Robust
Experts (MoRE) approach enables flexible integration of a broad range of robust
experts with superior performance.


http://arxiv.org/abs/2104.10837
Robust Certification for Laplace Learning on Geometric Graphs. (96%)
Matthew Thorpe; Bao Wang
  Graph Laplacian (GL)-based semi-supervised learning is one of the most used
approaches for classifying nodes in a graph. Understanding and certifying the
adversarial robustness of machine learning (ML) algorithms has attracted large
amounts of attention from different research communities due to its crucial
importance in many security-critical applied domains. There is great interest
in the theoretical certification of adversarial robustness for popular ML
algorithms. In this paper, we provide the first adversarial robust
certification for the GL classifier. More precisely we quantitatively bound the
difference in the classification accuracy of the GL classifier before and after
an adversarial attack. Numerically, we validate our theoretical certification
results and show that leveraging existing adversarial defenses for the
$k$-nearest neighbor classifier can remarkably improve the robustness of the GL
classifier.


http://arxiv.org/abs/2104.10459
Jacobian Regularization for Mitigating Universal Adversarial Perturbations. (95%)
Kenneth T. Co; David Martinez Rego; Emil C. Lupu
  Universal Adversarial Perturbations (UAPs) are input perturbations that can
fool a neural network on large sets of data. They are a class of attacks that
represents a significant threat as they facilitate realistic, practical, and
low-cost attacks on neural networks. In this work, we derive upper bounds for
the effectiveness of UAPs based on norms of data-dependent Jacobians. We
empirically verify that Jacobian regularization greatly increases model
robustness to UAPs by up to four times whilst maintaining clean performance.
Our theoretical analysis also allows us to formulate a metric for the strength
of shared adversarial perturbations between pairs of inputs. We apply this
metric to benchmark datasets and show that it is highly correlated with the
actual observed robustness. This suggests that realistic and practical
universal attacks can be reliably mitigated without sacrificing clean accuracy,
which shows promise for the robustness of machine learning systems.


http://arxiv.org/abs/2104.10706
Dataset Inference: Ownership Resolution in Machine Learning. (83%)
Pratyush Maini; Mohammad Yaghini; Nicolas Papernot
  With increasingly more data and computation involved in their training,
machine learning models constitute valuable intellectual property. This has
spurred interest in model stealing, which is made more practical by advances in
learning with partial, little, or no supervision. Existing defenses focus on
inserting unique watermarks in a model's decision surface, but this is
insufficient: the watermarks are not sampled from the training distribution and
thus are not always preserved during model stealing. In this paper, we make the
key observation that knowledge contained in the stolen model's training set is
what is common to all stolen copies. The adversary's goal, irrespective of the
attack employed, is always to extract this knowledge or its by-products. This
gives the original model's owner a strong advantage over the adversary: model
owners have access to the original training data. We thus introduce $dataset$
$inference$, the process of identifying whether a suspected model copy has
private knowledge from the original model's dataset, as a defense against model
stealing. We develop an approach for dataset inference that combines
statistical testing with the ability to estimate the distance of multiple data
points to the decision boundary. Our experiments on CIFAR10, SVHN, CIFAR100 and
ImageNet show that model owners can claim with confidence greater than 99% that
their model (or dataset as a matter of fact) was stolen, despite only exposing
50 of the stolen model's training points. Dataset inference defends against
state-of-the-art attacks even when the adversary is adaptive. Unlike prior
work, it does not require retraining or overfitting the defended model.


http://arxiv.org/abs/2104.09852
Adversarial Training for Deep Learning-based Intrusion Detection Systems. (99%)
Islam Debicha; Thibault Debatty; Jean-Michel Dricot; Wim Mees
  Nowadays, Deep Neural Networks (DNNs) report state-of-the-art results in many
machine learning areas, including intrusion detection. Nevertheless, recent
studies in computer vision have shown that DNNs can be vulnerable to
adversarial attacks that are capable of deceiving them into misclassification
by injecting specially crafted data. In security-critical areas, such attacks
can cause serious damage; therefore, in this paper, we examine the effect of
adversarial attacks on deep learning-based intrusion detection. In addition, we
investigate the effectiveness of adversarial training as a defense against such
attacks. Experimental results show that with sufficient distortion, adversarial
examples are able to mislead the detector and that the use of adversarial
training can improve the robustness of intrusion detection.


http://arxiv.org/abs/2104.10076
MixDefense: A Defense-in-Depth Framework for Adversarial Example Detection Based on Statistical and Semantic Analysis. (99%)
Yijun Yang; Ruiyuan Gao; Yu Li; Qiuxia Lai; Qiang Xu
  Machine learning with deep neural networks (DNNs) has become one of the
foundation techniques in many safety-critical systems, such as autonomous
vehicles and medical diagnosis systems. DNN-based systems, however, are known
to be vulnerable to adversarial examples (AEs) that are maliciously perturbed
variants of legitimate inputs. While there has been a vast body of research to
defend against AE attacks in the literature, the performances of existing
defense techniques are still far from satisfactory, especially for adaptive
attacks, wherein attackers are knowledgeable about the defense mechanisms and
craft AEs accordingly. In this work, we propose a multilayer defense-in-depth
framework for AE detection, namely MixDefense. For the first layer, we focus on
those AEs with large perturbations. We propose to leverage the `noise' features
extracted from the inputs to discover the statistical difference between
natural images and tampered ones for AE detection. For AEs with small
perturbations, the inference result of such inputs would largely deviate from
their semantic information. Consequently, we propose a novel learning-based
solution to model such contradictions for AE detection. Both layers are
resilient to adaptive attacks because there do not exist gradient propagation
paths for AE generation. Experimental results with various AE attack methods on
image classification datasets show that the proposed MixDefense solution
outperforms the existing AE detection techniques by a considerable margin.


http://arxiv.org/abs/2104.10336
MagicPai at SemEval-2021 Task 7: Method for Detecting and Rating Humor Based on Multi-Task Adversarial Training. (64%)
Jian Ma; Shuyi Xie; Haiqin Yang; Lianxin Jiang; Mengyuan Zhou; Xiaoyi Ruan; Yang Mo
  This paper describes MagicPai's system for SemEval 2021 Task 7, HaHackathon:
Detecting and Rating Humor and Offense. This task aims to detect whether the
text is humorous and how humorous it is. There are four subtasks in the
competition. In this paper, we mainly present our solution, a multi-task
learning model based on adversarial examples, for task 1a and 1b. More
specifically, we first vectorize the cleaned dataset and add the perturbation
to obtain more robust embedding representations. We then correct the loss via
the confidence level. Finally, we perform interactive joint learning on
multiple tasks to capture the relationship between whether the text is humorous
and how humorous it is. The final result shows the effectiveness of our system.


http://arxiv.org/abs/2104.09789
Does enhanced shape bias improve neural network robustness to common corruptions? (26%)
Chaithanya Kumar Mummadi; Ranjitha Subramaniam; Robin Hutmacher; Julien Vitay; Volker Fischer; Jan Hendrik Metzen
  Convolutional neural networks (CNNs) learn to extract representations of
complex features, such as object shapes and textures to solve image recognition
tasks. Recent work indicates that CNNs trained on ImageNet are biased towards
features that encode textures and that these alone are sufficient to generalize
to unseen test data from the same distribution as the training data but often
fail to generalize to out-of-distribution data. It has been shown that
augmenting the training data with different image styles decreases this texture
bias in favor of increased shape bias while at the same time improving
robustness to common corruptions, such as noise and blur. Commonly, this is
interpreted as shape bias increasing corruption robustness. However, this
relationship is only hypothesized. We perform a systematic study of different
ways of composing inputs based on natural images, explicit edge information,
and stylization. While stylization is essential for achieving high corruption
robustness, we do not find a clear correlation between shape bias and
robustness. We conclude that the data augmentation caused by style-variation
accounts for the improved corruption robustness and increased shape bias is
only a byproduct.


http://arxiv.org/abs/2104.09872
Robust Sensor Fusion Algorithms Against Voice Command Attacks in Autonomous Vehicles. (9%)
Jiwei Guan; Xi Zheng; Chen Wang; Yipeng Zhou; Alireza Jolfa
  With recent advances in autonomous driving, Voice Control Systems have become
increasingly adopted as human-vehicle interaction methods. This technology
enables drivers to use voice commands to control the vehicle and will be soon
available in Advanced Driver Assistance Systems (ADAS). Prior work has shown
that Siri, Alexa and Cortana, are highly vulnerable to inaudible command
attacks. This could be extended to ADAS in real-world applications and such
inaudible command threat is difficult to detect due to microphone
nonlinearities. In this paper, we aim to develop a more practical solution by
using camera views to defend against inaudible command attacks where ADAS are
capable of detecting their environment via multi-sensors. To this end, we
propose a novel multimodal deep learning classification system to defend
against inaudible command attacks. Our experimental results confirm the
feasibility of the proposed defense methods and the best classification
accuracy reaches 89.2%. Code is available at
https://github.com/ITSEG-MQ/Sensor-Fusion-Against-VoiceCommand-Attacks.


http://arxiv.org/abs/2104.10262
Network Defense is Not a Game. (1%)
Andres Molina-Markham; Ransom K. Winder; Ahmad Ridley
  Research seeks to apply Artificial Intelligence (AI) to scale and extend the
capabilities of human operators to defend networks. A fundamental problem that
hinders the generalization of successful AI approaches -- i.e., beating humans
at playing games -- is that network defense cannot be defined as a single game
with a fixed set of rules. Our position is that network defense is better
characterized as a collection of games with uncertain and possibly drifting
rules. Hence, we propose to define network defense tasks as distributions of
network environments, to: (i) enable research to apply modern AI techniques,
such as unsupervised curriculum learning and reinforcement learning for network
defense; and, (ii) facilitate the design of well-defined challenges that can be
used to compare approaches for autonomous cyberdefense.
  To demonstrate that an approach for autonomous network defense is practical
it is important to be able to reason about the boundaries of its applicability.
Hence, we need to be able to define network defense tasks that capture sets of
adversarial tactics, techniques, and procedures (TTPs); quality of service
(QoS) requirements; and TTPs available to defenders. Furthermore, the
abstractions to define these tasks must be extensible; must be backed by
well-defined semantics that allow us to reason about distributions of
environments; and should enable the generation of data and experiences from
which an agent can learn.
  Our approach named Network Environment Design for Autonomous Cyberdefense
inspired the architecture of FARLAND, a Framework for Advanced Reinforcement
Learning for Autonomous Network Defense, which we use at MITRE to develop RL
network defenders that perform blue actions from the MITRE Shield matrix
against attackers with TTPs that drift from MITRE ATT&CK TTPs.


http://arxiv.org/abs/2104.09722
Staircase Sign Method for Boosting Adversarial Attacks. (99%)
Qilong Zhang; Xiaosu Zhu; Jingkuan Song; Lianli Gao; Heng Tao Shen
  Crafting adversarial examples for the transfer-based attack is challenging
and remains a research hot spot. Currently, such attack methods are based on
the hypothesis that the substitute model and the victim model learn similar
decision boundaries, and they conventionally apply Sign Method (SM) to
manipulate the gradient as the resultant perturbation. Although SM is
efficient, it only extracts the sign of gradient units but ignores their value
difference, which inevitably leads to a deviation. Therefore, we propose a
novel Staircase Sign Method (S$^2$M) to alleviate this issue, thus boosting
attacks. Technically, our method heuristically divides the gradient sign into
several segments according to the values of the gradient units, and then
assigns each segment with a staircase weight for better crafting adversarial
perturbation. As a result, our adversarial examples perform better in both
white-box and black-box manner without being more visible. Since S$^2$M just
manipulates the resultant gradient, our method can be generally integrated into
the family of FGSM algorithms, and the computational overhead is negligible.
Extensive experiments on the ImageNet dataset demonstrate the effectiveness of
our proposed methods, which significantly improve the transferability (i.e., on
average, \textbf{5.1\%} for normally trained models and \textbf{12.8\%} for
adversarially trained defenses). Our code is available at
\url{https://github.com/qilong-zhang/Staircase-sign-method}.


http://arxiv.org/abs/2104.09425
Improving Adversarial Robustness Using Proxy Distributions. (99%)
Vikash Sehwag; Saeed Mahloujifar; Tinashe Handina; Sihui Dai; Chong Xiang; Mung Chiang; Prateek Mittal
  We focus on the use of proxy distributions, i.e., approximations of the
underlying distribution of the training dataset, in both understanding and
improving the adversarial robustness in image classification. While additional
training data helps in adversarial training, curating a very large number of
real-world images is challenging. In contrast, proxy distributions enable us to
sample a potentially unlimited number of images and improve adversarial
robustness using these samples. We first ask the question: when does
adversarial robustness benefit from incorporating additional samples from the
proxy distribution in the training stage? We prove that the difference between
the robustness of a classifier on the proxy and original training dataset
distribution is upper bounded by the conditional Wasserstein distance between
them. Our result confirms the intuition that samples from a proxy distribution
that closely approximates training dataset distribution should be able to boost
adversarial robustness. Motivated by this finding, we leverage samples from
state-of-the-art generative models, which can closely approximate training data
distribution, to improve robustness. In particular, we improve robust accuracy
by up to 6.1% and 5.7% in $l_{\infty}$ and $l_2$ threat model, and certified
robust accuracy by 6.7% over baselines not using proxy distributions on the
CIFAR-10 dataset. Since we can sample an unlimited number of images from a
proxy distribution, it also allows us to investigate the effect of an
increasing number of training samples on adversarial robustness. Here we
provide the first large scale empirical investigation of accuracy vs robustness
trade-off and sample complexity of adversarial training by training deep neural
networks on 2K to 10M images.


http://arxiv.org/abs/2104.09369
Adversarial Diffusion Attacks on Graph-based Traffic Prediction Models. (99%)
Lyuyi Zhu; Kairui Feng; Ziyuan Pu; Wei Ma
  Real-time traffic prediction models play a pivotal role in smart mobility
systems and have been widely used in route guidance, emerging mobility
services, and advanced traffic management systems. With the availability of
massive traffic data, neural network-based deep learning methods, especially
the graph convolutional networks (GCN) have demonstrated outstanding
performance in mining spatio-temporal information and achieving high prediction
accuracy. Recent studies reveal the vulnerability of GCN under adversarial
attacks, while there is a lack of studies to understand the vulnerability
issues of the GCN-based traffic prediction models. Given this, this paper
proposes a new task -- diffusion attack, to study the robustness of GCN-based
traffic prediction models. The diffusion attack aims to select and attack a
small set of nodes to degrade the performance of the entire prediction model.
To conduct the diffusion attack, we propose a novel attack algorithm, which
consists of two major components: 1) approximating the gradient of the
black-box prediction model with Simultaneous Perturbation Stochastic
Approximation (SPSA); 2) adapting the knapsack greedy algorithm to select the
attack nodes. The proposed algorithm is examined with three GCN-based traffic
prediction models: St-Gcn, T-Gcn, and A3t-Gcn on two cities. The proposed
algorithm demonstrates high efficiency in the adversarial attack tasks under
various scenarios, and it can still generate adversarial samples under the drop
regularization such as DropOut, DropNode, and DropEdge. The research outcomes
could help to improve the robustness of the GCN-based traffic prediction models
and better protect the smart mobility systems. Our code is available at
https://github.com/LYZ98/Adversarial-Diffusion-Attacks-on-Graph-based-Traffic-Prediction-Models


http://arxiv.org/abs/2104.09284
LAFEAT: Piercing Through Adversarial Defenses with Latent Features. (99%)
Yunrui Yu; Xitong Gao; Cheng-Zhong Xu
  Deep convolutional neural networks are susceptible to adversarial attacks.
They can be easily deceived to give an incorrect output by adding a tiny
perturbation to the input. This presents a great challenge in making CNNs
robust against such attacks. An influx of new defense techniques have been
proposed to this end. In this paper, we show that latent features in certain
"robust" models are surprisingly susceptible to adversarial attacks. On top of
this, we introduce a unified $\ell_\infty$-norm white-box attack algorithm
which harnesses latent features in its gradient descent steps, namely LAFEAT.
We show that not only is it computationally much more efficient for successful
attacks, but it is also a stronger adversary than the current state-of-the-art
across a wide range of defense mechanisms. This suggests that model robustness
could be contingent on the effective use of the defender's hidden components,
and it should no longer be viewed from a holistic perspective.


http://arxiv.org/abs/2104.09197
Removing Adversarial Noise in Class Activation Feature Space. (99%)
Dawei Zhou; Nannan Wang; Chunlei Peng; Xinbo Gao; Xiaoyu Wang; Jun Yu; Tongliang Liu
  Deep neural networks (DNNs) are vulnerable to adversarial noise.
Preprocessing based defenses could largely remove adversarial noise by
processing inputs. However, they are typically affected by the error
amplification effect, especially in the front of continuously evolving attacks.
To solve this problem, in this paper, we propose to remove adversarial noise by
implementing a self-supervised adversarial training mechanism in a class
activation feature space. To be specific, we first maximize the disruptions to
class activation features of natural examples to craft adversarial examples.
Then, we train a denoising model to minimize the distances between the
adversarial examples and the natural examples in the class activation feature
space. Empirical evaluations demonstrate that our method could significantly
enhance adversarial robustness in comparison to previous state-of-the-art
approaches, especially against unseen adversarial attacks and adaptive attacks.


http://arxiv.org/abs/2104.09172
Direction-Aggregated Attack for Transferable Adversarial Examples. (99%)
Tianjin Huang; Vlado Menkovski; Yulong Pei; YuHao Wang; Mykola Pechenizkiy
  Deep neural networks are vulnerable to adversarial examples that are crafted
by imposing imperceptible changes to the inputs. However, these adversarial
examples are most successful in white-box settings where the model and its
parameters are available. Finding adversarial examples that are transferable to
other models or developed in a black-box setting is significantly more
difficult. In this paper, we propose the Direction-Aggregated adversarial
attacks that deliver transferable adversarial examples. Our method utilizes
aggregated direction during the attack process for avoiding the generated
adversarial examples overfitting to the white-box model. Extensive experiments
on ImageNet show that our proposed method improves the transferability of
adversarial examples significantly and outperforms state-of-the-art attacks,
especially against adversarial robust models. The best averaged attack success
rates of our proposed method reaches 94.6\% against three adversarial trained
models and 94.8\% against five defense methods. It also reveals that current
defense approaches do not prevent transferable adversarial attacks.


http://arxiv.org/abs/2104.09667
Manipulating SGD with Data Ordering Attacks. (95%)
Ilia Shumailov; Zakhar Shumaylov; Dmitry Kazhdan; Yiren Zhao; Nicolas Papernot; Murat A. Erdogdu; Ross Anderson
  Machine learning is vulnerable to a wide variety of attacks. It is now well
understood that by changing the underlying data distribution, an adversary can
poison the model trained with it or introduce backdoors. In this paper we
present a novel class of training-time attacks that require no changes to the
underlying dataset or model architecture, but instead only change the order in
which data are supplied to the model. In particular, we find that the attacker
can either prevent the model from learning, or poison it to learn behaviours
specified by the attacker. Furthermore, we find that even a single
adversarially-ordered epoch can be enough to slow down model learning, or even
to reset all of the learning progress. Indeed, the attacks presented here are
not specific to the model or dataset, but rather target the stochastic nature
of modern learning procedures. We extensively evaluate our attacks on computer
vision and natural language benchmarks to find that the adversary can disrupt
model training and even introduce backdoors.


http://arxiv.org/abs/2104.09437
Provable Robustness of Adversarial Training for Learning Halfspaces with Noise. (22%)
Difan Zou; Spencer Frei; Quanquan Gu
  We analyze the properties of adversarial training for learning adversarially
robust halfspaces in the presence of agnostic label noise. Denoting
$\mathsf{OPT}_{p,r}$ as the best robust classification error achieved by a
halfspace that is robust to perturbations of $\ell_{p}$ balls of radius $r$, we
show that adversarial training on the standard binary cross-entropy loss yields
adversarially robust halfspaces up to (robust) classification error $\tilde
O(\sqrt{\mathsf{OPT}_{2,r}})$ for $p=2$, and $\tilde O(d^{1/4}
\sqrt{\mathsf{OPT}_{\infty, r}} + d^{1/2} \mathsf{OPT}_{\infty,r})$ when
$p=\infty$. Our results hold for distributions satisfying anti-concentration
properties enjoyed by log-concave isotropic distributions among others. We
additionally show that if one instead uses a nonconvex sigmoidal loss,
adversarial training yields halfspaces with an improved robust classification
error of $O(\mathsf{OPT}_{2,r})$ for $p=2$, and $O(d^{1/4}\mathsf{OPT}_{\infty,
r})$ when $p=\infty$. To the best of our knowledge, this is the first work to
show that adversarial training provably yields robust classifiers in the
presence of noise.


http://arxiv.org/abs/2104.09203
Protecting the Intellectual Properties of Deep Neural Networks with an Additional Class and Steganographic Images. (11%)
Shichang Sun; Mingfu Xue; Jian Wang; Weiqiang Liu
  Recently, the research on protecting the intellectual properties (IP) of deep
neural networks (DNN) has attracted serious concerns. A number of DNN copyright
protection methods have been proposed. However, most of the existing
watermarking methods focus on verifying the copyright of the model, which do
not support the authentication and management of users' fingerprints, thus can
not satisfy the requirements of commercial copyright protection. In addition,
the query modification attack which was proposed recently can invalidate most
of the existing backdoor-based watermarking methods. To address these
challenges, in this paper, we propose a method to protect the intellectual
properties of DNN models by using an additional class and steganographic
images. Specifically, we use a set of watermark key samples to embed an
additional class into the DNN, so that the watermarked DNN will classify the
watermark key sample as the predefined additional class in the copyright
verification stage. We adopt the least significant bit (LSB) image
steganography to embed users' fingerprints into watermark key images. Each user
will be assigned with a unique fingerprint image so that the user's identity
can be authenticated later. Experimental results demonstrate that, the proposed
method can protect the copyright of DNN models effectively. On Fashion-MNIST
and CIFAR-10 datasets, the proposed method can obtain 100% watermark accuracy
and 100% fingerprint authentication success rate. In addition, the proposed
method is demonstrated to be robust to the model fine-tuning attack, model
pruning attack, and the query modification attack. Compared with three existing
watermarking methods (the logo-based, noise-based, and adversarial frontier
stitching watermarking methods), the proposed method has better performance on
watermark accuracy and robustness against the query modification attack.


http://arxiv.org/abs/2104.09136
Semi-Supervised Domain Adaptation with Prototypical Alignment and Consistency Learning. (1%)
Kai Li; Chang Liu; Handong Zhao; Yulun Zhang; Yun Fu
  Domain adaptation enhances generalizability of a model across domains with
domain shifts. Most research effort has been spent on Unsupervised Domain
Adaption (UDA) which trains a model jointly with labeled source data and
unlabeled target data. This paper studies how much it can help address domain
shifts if we further have a few target samples (e.g., one sample per class)
labeled. This is the so-called semi-supervised domain adaptation (SSDA) problem
and the few labeled target samples are termed as ``landmarks''. To explore the
full potential of landmarks, we incorporate a prototypical alignment (PA)
module which calculates a target prototype for each class from the landmarks;
source samples are then aligned with the target prototype from the same class.
To further alleviate label scarcity, we propose a data augmentation based
solution. Specifically, we severely perturb the labeled images, making PA
non-trivial to achieve and thus promoting model generalizability. Moreover, we
apply consistency learning on unlabeled target images, by perturbing each image
with light transformations and strong transformations. Then, the strongly
perturbed image can enjoy ``supervised-like'' training using the pseudo label
inferred from the lightly perturbed one. Experiments show that the proposed
method, though simple, reaches significant performance gains over
state-of-the-art methods, and enjoys the flexibility of being able to serve as
a plug-and-play component to various existing UDA methods and improve
adaptation performance with landmarks provided. Our code is available at
\url{https://github.com/kailigo/pacl}.


http://arxiv.org/abs/2104.08806
Best Practices for Noise-Based Augmentation to Improve the Performance of Emotion Recognition "In the Wild". (83%)
Mimansa Jaiswal; Emily Mower Provost
  Emotion recognition as a key component of high-stake downstream applications
has been shown to be effective, such as classroom engagement or mental health
assessments. These systems are generally trained on small datasets collected in
single laboratory environments, and hence falter when tested on data that has
different noise characteristics. Multiple noise-based data augmentation
approaches have been proposed to counteract this challenge in other speech
domains. But, unlike speech recognition and speaker verification, in emotion
recognition, noise-based data augmentation may change the underlying label of
the original emotional sample. In this work, we generate realistic noisy
samples of a well known emotion dataset (IEMOCAP) using multiple categories of
environmental and synthetic noise. We evaluate how both human and machine
emotion perception changes when noise is introduced. We find that some commonly
used augmentation techniques for emotion recognition significantly change human
perception, which may lead to unreliable evaluation metrics such as evaluating
efficiency of adversarial attack. We also find that the trained
state-of-the-art emotion recognition models fail to classify unseen
noise-augmented samples, even when trained on noise augmented datasets. This
finding demonstrates the brittleness of these systems in real-world conditions.
We propose a set of recommendations for noise-based augmentation of emotion
datasets and for how to deploy these emotion recognition systems "in the wild".


http://arxiv.org/abs/2104.08763
Making Attention Mechanisms More Robust and Interpretable with Virtual Adversarial Training. (68%)
Shunsuke Kitada; Hitoshi Iyatomi
  Although attention mechanisms have become fundamental components of deep
learning models, they are vulnerable to perturbations, which may degrade the
prediction performance and model interpretability. Adversarial training (AT)
for attention mechanisms has successfully reduced such drawbacks by considering
adversarial perturbations. However, this technique requires label information,
and thus, its use is limited to supervised settings. In this study, we explore
the concept of incorporating virtual AT (VAT) into the attention mechanisms, by
which adversarial perturbations can be computed even from unlabeled data. To
realize this approach, we propose two general training techniques, namely VAT
for attention mechanisms (Attention VAT) and "interpretable" VAT for attention
mechanisms (Attention iVAT), which extend AT for attention mechanisms to a
semi-supervised setting. In particular, Attention iVAT focuses on the
differences in attention; thus, it can efficiently learn clearer attention and
improve model interpretability, even with unlabeled data. Empirical experiments
based on six public datasets revealed that our techniques provide better
prediction performance than conventional AT-based as well as VAT-based
techniques, and stronger agreement with evidence that is provided by humans in
detecting important words in sentences. Moreover, our proposal offers these
advantages without needing to add the careful selection of unlabeled data. That
is, even if the model using our VAT-based technique is trained on unlabeled
data from a source other than the target task, both the prediction performance
and model interpretability can be improved.


http://arxiv.org/abs/2104.08782
On the Sensitivity and Stability of Model Interpretations in NLP. (1%)
Fan Yin; Zhouxing Shi; Cho-Jui Hsieh; Kai-Wei Chang
  Recent years have witnessed the emergence of a variety of post-hoc
interpretations that aim to uncover how natural language processing (NLP)
models make predictions. Despite the surge of new interpretation methods, it
remains an open problem how to define and quantitatively measure the
faithfulness of interpretations, i.e., to what extent interpretations reflect
the reasoning process by a model. We propose two new criteria, sensitivity and
stability, that provide complementary notions of faithfulness to the existed
removal-based criteria. Our results show that the conclusion for how faithful
interpretations are could vary substantially based on different notions.
Motivated by the desiderata of sensitivity and stability, we introduce a new
class of interpretation methods that adopt techniques from adversarial
robustness. Empirical results show that our proposed methods are effective
under the new criteria and overcome limitations of gradient-based methods on
removal-based criteria. Besides text classification, we also apply
interpretation methods and metrics to dependency parsing. Our results shed
light on understanding the diverse set of interpretations.


http://arxiv.org/abs/2104.08453
Attacking Text Classifiers via Sentence Rewriting Sampler. (99%)
Lei Xu; Kalyan Veeramachaneni
  Most adversarial attack methods on text classification are designed to change
the classifier's prediction by modifying few words or characters. Few try to
attack classifiers by rewriting a whole sentence, due to the difficulties
inherent in sentence-level rephrasing and the problem of maintaining high
semantic similarity and sentence quality.
  To tackle this problem, we design a general sentence rewriting sampler (SRS)
framework, which can conditionally generate meaningful sentences. Then we
customize SRS to attack text classification models. Our method can effectively
rewrite the original sentence in multiple ways while maintaining high semantic
similarity and good sentence quality. Experimental results show that many of
these rewritten sentences are misclassified by the classifier. Our method
achieves a better attack success rate on 4 out of 7 datasets, as well as
significantly better sentence quality on all 7 datasets.


http://arxiv.org/abs/2104.08690
Rethinking Image-Scaling Attacks: The Interplay Between Vulnerabilities in Machine Learning Systems. (99%)
Yue Gao; Ilia Shumailov; Kassem Fawaz
  As real-world images come in varying sizes, the machine learning model is
part of a larger system that includes an upstream image scaling algorithm. In
this system, the model and the scaling algorithm have become attractive targets
for numerous attacks, such as adversarial examples and the recent image-scaling
attack. In response to these attacks, researchers have developed defense
approaches that are tailored to attacks at each processing stage. As these
defenses are developed in isolation, their underlying assumptions may not hold
when viewing them from the perspective of an end-to-end machine learning
system. Thus, it is necessary to study these attacks and defenses in the
context of machine learning systems. In this paper, we investigate the
interplay between vulnerabilities of the image scaling procedure and machine
learning models in the challenging hard-label black-box setting. We propose a
series of novel techniques to make a black-box attack exploit vulnerabilities
in scaling algorithms, scaling defenses, and the final machine learning model
in an end-to-end manner. Based on this scaling-aware attack, we reveal that
most existing scaling defenses are ineffective under threat from downstream
models. Moreover, we empirically observe that standard black-box attacks can
significantly improve their performance by exploiting the vulnerable scaling
procedure. We further demonstrate this problem on a commercial Image Analysis
API with transfer-based black-box attacks.


http://arxiv.org/abs/2104.08678
Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation. (98%)
Max Bartolo; Tristan Thrush; Robin Jia; Sebastian Riedel; Pontus Stenetorp; Douwe Kiela
  Despite recent progress, state-of-the-art question answering models remain
vulnerable to a variety of adversarial attacks. While dynamic adversarial data
collection, in which a human annotator tries to write examples that fool a
model-in-the-loop, can improve model robustness, this process is expensive
which limits the scale of the collected data. In this work, we are the first to
use synthetic adversarial data generation to make question answering models
more robust to human adversaries. We develop a data generation pipeline that
selects source passages, identifies candidate answers, generates questions,
then finally filters or re-labels them to improve quality. Using this approach,
we amplify a smaller human-written adversarial dataset to a much larger set of
synthetic question-answer pairs. By incorporating our synthetic data, we
improve the state-of-the-art on the AdversarialQA dataset by 3.7F1 and improve
model generalisation on nine of the twelve MRQA datasets. We further conduct a
novel human-in-the-loop evaluation to show that our models are considerably
more robust to new human-written adversarial examples: crowdworkers can fool
our model only 8.8% of the time on average, compared to 17.6% for a model
trained without synthetic data.


http://arxiv.org/abs/2104.08645
Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training. (87%)
Kuan-Hao Huang; Wasi Uddin Ahmad; Nanyun Peng; Kai-Wei Chang
  In recent years, pre-trained multilingual language models, such as
multilingual BERT and XLM-R, exhibit good performance on zero-shot
cross-lingual transfer learning. However, since their multilingual contextual
embedding spaces for different languages are not perfectly aligned, the
difference between representations of different languages might cause zero-shot
cross-lingual transfer failed in some cases. In this work, we draw connections
between those failed cases and adversarial examples. We then propose to use
robust training methods to train a robust model that can tolerate some noise in
input embeddings. We study two widely used robust training methods: adversarial
training and randomized smoothing. The experimental results demonstrate that
robust training can improve zero-shot cross-lingual transfer for text
classification. The performance improvements become significant when the
distance between the source language and the target language increases.


http://arxiv.org/abs/2104.08639
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples. (15%)
Qianchu Liu; Edoardo M. Ponti; Diana McCarthy; Ivan Vulić; Anna Korhonen
  Capturing word meaning in context and distinguishing between correspondences
and variations across languages is key to building successful multilingual and
cross-lingual text representation models. However, existing multilingual
evaluation datasets that evaluate lexical semantics "in-context" have various
limitations, in particular, (1) their language coverage is restricted to
high-resource languages and skewed in favor of only a few language families and
areas, (2) a design that makes the task solvable via superficial cues, which
results in artificially inflated (and sometimes super-human) performances of
pretrained encoders, on many target languages, which limits their usefulness
for model probing and diagnostics, and (3) no support for cross-lingual
evaluation. In order to address these gaps, we present AM2iCo, Adversarial and
Multilingual Meaning in Context, a wide-coverage cross-lingual and multilingual
evaluation set; it aims to faithfully assess the ability of state-of-the-art
(SotA) representation models to understand the identity of word meaning in
cross-lingual contexts for 14 language pairs. We conduct a series of
experiments in a wide range of setups and demonstrate the challenging nature of
AM2iCo. The results reveal that current SotA pretrained encoders substantially
lag behind human performance, and the largest gaps are observed for
low-resource languages and languages dissimilar to English.


http://arxiv.org/abs/2104.08422
Fashion-Guided Adversarial Attack on Person Segmentation. (99%)
Marc Treu; Trung-Nghia Le; Huy H. Nguyen; Junichi Yamagishi; Isao Echizen
  This paper presents the first adversarial example based method for attacking
human instance segmentation networks, namely person segmentation networks in
short, which are harder to fool than classification networks. We propose a
novel Fashion-Guided Adversarial Attack (FashionAdv) framework to automatically
identify attackable regions in the target image to minimize the effect on image
quality. It generates adversarial textures learned from fashion style images
and then overlays them on the clothing regions in the original image to make
all persons in the image invisible to person segmentation networks. The
synthesized adversarial textures are inconspicuous and appear natural to the
human eye. The effectiveness of the proposed method is enhanced by robustness
training and by jointly attacking multiple components of the target network.
Extensive experiments demonstrated the effectiveness of FashionAdv in terms of
robustness to image manipulations and storage in cyberspace as well as
appearing natural to the human eye. The code and data are publicly released on
our project page https://github.com/nii-yamagishilab/fashion_adv


http://arxiv.org/abs/2104.08139
Towards Variable-Length Textual Adversarial Attacks. (99%)
Junliang Guo; Zhirui Zhang; Linlin Zhang; Linli Xu; Boxing Chen; Enhong Chen; Weihua Luo
  Adversarial attacks have shown the vulnerability of machine learning models,
however, it is non-trivial to conduct textual adversarial attacks on natural
language processing tasks due to the discreteness of data. Most previous
approaches conduct attacks with the atomic \textit{replacement} operation,
which usually leads to fixed-length adversarial examples and therefore limits
the exploration on the decision space. In this paper, we propose
variable-length textual adversarial attacks~(VL-Attack) and integrate three
atomic operations, namely \textit{insertion}, \textit{deletion} and
\textit{replacement}, into a unified framework, by introducing and manipulating
a special \textit{blank} token while attacking. In this way, our approach is
able to more comprehensively find adversarial examples around the decision
boundary and effectively conduct adversarial attacks. Specifically, our method
drops the accuracy of IMDB classification by $96\%$ with only editing $1.3\%$
tokens while attacking a pre-trained BERT model. In addition, fine-tuning the
victim model with generated adversarial samples can improve the robustness of
the model without hurting the performance, especially for length-sensitive
models. On the task of non-autoregressive machine translation, our method can
achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an
improvement of $1.47$ over the baseline model.


http://arxiv.org/abs/2104.08231
An Adversarially-Learned Turing Test for Dialog Generation Models. (96%)
Xiang Gao; Yizhe Zhang; Michel Galley; Bill Dolan
  The design of better automated dialogue evaluation metrics offers the
potential of accelerate evaluation research on conversational AI. However,
existing trainable dialogue evaluation models are generally restricted to
classifiers trained in a purely supervised manner, which suffer a significant
risk from adversarial attacking (e.g., a nonsensical response that enjoys a
high classification score). To alleviate this risk, we propose an adversarial
training approach to learn a robust model, ATT (Adversarial Turing Test), that
discriminates machine-generated responses from human-written replies. In
contrast to previous perturbation-based methods, our discriminator is trained
by iteratively generating unrestricted and diverse adversarial examples using
reinforcement learning. The key benefit of this unrestricted adversarial
training approach is allowing the discriminator to improve robustness in an
iterative attack-defense game. Our discriminator shows high accuracy on strong
attackers including DialoGPT and GPT-3.


http://arxiv.org/abs/2104.08323
Random and Adversarial Bit Error Robustness: Energy-Efficient and Secure DNN Accelerators. (83%)
David Stutz; Nandhini Chandramoorthy; Matthias Hein; Bernt Schiele
  Deep neural network (DNN) accelerators received considerable attention in
recent years due to the potential to save energy compared to mainstream
hardware. Low-voltage operation of DNN accelerators allows to further reduce
energy consumption significantly, however, causes bit-level failures in the
memory storing the quantized DNN weights. Furthermore, DNN accelerators have
been shown to be vulnerable to adversarial attacks on voltage controllers or
individual bits. In this paper, we show that a combination of robust
fixed-point quantization, weight clipping, as well as random bit error training
(RandBET) or adversarial bit error training (AdvBET) improves robustness
against random or adversarial bit errors in quantized DNN weights
significantly. This leads not only to high energy savings for low-voltage
operation as well as low-precision quantization, but also improves security of
DNN accelerators. Our approach generalizes across operating voltages and
accelerators, as demonstrated on bit errors from profiled SRAM arrays, and
achieves robustness against both targeted and untargeted bit-level attacks.
Without losing more than 0.8%/2% in test accuracy, we can reduce energy
consumption on CIFAR10 by 20%/30% for 8/4-bit quantization using RandBET.
Allowing up to 320 adversarial bit errors, AdvBET reduces test error from above
90% (chance level) to 26.22% on CIFAR10.


http://arxiv.org/abs/2104.08382
Lower Bounds on Cross-Entropy Loss in the Presence of Test-time Adversaries. (2%)
Arjun Nitin Bhagoji; Daniel Cullina; Vikash Sehwag; Prateek Mittal
  Understanding the fundamental limits of robust supervised learning has
emerged as a problem of immense interest, from both practical and theoretical
standpoints. In particular, it is critical to determine classifier-agnostic
bounds on the training loss to establish when learning is possible. In this
paper, we determine optimal lower bounds on the cross-entropy loss in the
presence of test-time adversaries, along with the corresponding optimal
classification outputs. Our formulation of the bound as a solution to an
optimization problem is general enough to encompass any loss function depending
on soft classifier outputs. We also propose and provide a proof of correctness
for a bespoke algorithm to compute this lower bound efficiently, allowing us to
determine lower bounds for multiple practical datasets of interest. We use our
lower bounds as a diagnostic tool to determine the effectiveness of current
robust training methods and find a gap from optimality at larger budgets.
Finally, we investigate the possibility of using of optimal classification
outputs as soft labels to empirically improve robust training.


http://arxiv.org/abs/2104.13733
Gradient-based Adversarial Attacks against Text Transformers. (99%)
Chuan Guo; Alexandre Sablayrolles; Hervé Jégou; Douwe Kiela
  We propose the first general-purpose gradient-based attack against
transformer models. Instead of searching for a single adversarial example, we
search for a distribution of adversarial examples parameterized by a
continuous-valued matrix, hence enabling gradient-based optimization. We
empirically demonstrate that our white-box attack attains state-of-the-art
attack performance on a variety of natural language tasks. Furthermore, we show
that a powerful black-box transfer attack, enabled by sampling from the
adversarial distribution, matches or exceeds existing methods, while only
requiring hard-label outputs.


http://arxiv.org/abs/2104.07395
Robust Backdoor Attacks against Deep Neural Networks in Real Physical World. (86%)
Mingfu Xue; Can He; Shichang Sun; Jian Wang; Weiqiang Liu
  Deep neural networks (DNN) have been widely deployed in various applications.
However, many researches indicated that DNN is vulnerable to backdoor attacks.
The attacker can create a hidden backdoor in target DNN model, and trigger the
malicious behaviors by submitting specific backdoor instance. However, almost
all the existing backdoor works focused on the digital domain, while few
studies investigate the backdoor attacks in real physical world. Restricted to
a variety of physical constraints, the performance of backdoor attacks in the
real physical world will be severely degraded. In this paper, we propose a
robust physical backdoor attack method, PTB (physical transformations for
backdoors), to implement the backdoor attacks against deep learning models in
the real physical world. Specifically, in the training phase, we perform a
series of physical transformations on these injected backdoor instances at each
round of model training, so as to simulate various transformations that a
backdoor may experience in real world, thus improves its physical robustness.
Experimental results on the state-of-the-art face recognition model show that,
compared with the backdoor methods that without PTB, the proposed attack method
can significantly improve the performance of backdoor attacks in real physical
world. Under various complex physical conditions, by injecting only a very
small ratio (0.5%) of backdoor instances, the attack success rate of physical
backdoor attacks with the PTB method on VGGFace is 82%, while the attack
success rate of backdoor attacks without the proposed PTB method is lower than
11%. Meanwhile, the normal performance of the target DNN model has not been
affected.


http://arxiv.org/abs/2104.07646
Are Multilingual BERT models robust? A Case Study on Adversarial Attacks for Multilingual Question Answering. (12%)
Sara Rosenthal; Mihaela Bornea; Avirup Sil
  Recent approaches have exploited weaknesses in monolingual question answering
(QA) models by adding adversarial statements to the passage. These attacks
caused a reduction in state-of-the-art performance by almost 50%. In this
paper, we are the first to explore and successfully attack a multilingual QA
(MLQA) system pre-trained on multilingual BERT using several attack strategies
for the adversarial statement reducing performance by as much as 85%. We show
that the model gives priority to English and the language of the question
regardless of the other languages in the QA pair. Further, we also show that
adding our attack strategies during training helps alleviate the attacks.


http://arxiv.org/abs/2104.09994
Federated Learning for Malware Detection in IoT Devices. (10%)
Valerian Rey; Pedro Miguel Sánchez Sánchez; Alberto Huertas Celdrán; Gérôme Bovet; Martin Jaggi
  This work investigates the possibilities enabled by federated learning
concerning IoT malware detection and studies security issues inherent to this
new learning paradigm. In this context, a framework that uses federated
learning to detect malware affecting IoT devices is presented. N-BaIoT, a
dataset modeling network traffic of several real IoT devices while affected by
malware, has been used to evaluate the proposed framework. Both supervised and
unsupervised federated models (multi-layer perceptron and autoencoder) able to
detect malware affecting seen and unseen IoT devices of N-BaIoT have been
trained and evaluated. Furthermore, their performance has been compared to two
traditional approaches. The first one lets each participant locally train a
model using only its own data, while the second consists of making the
participants share their data with a central entity in charge of training a
global model. This comparison has shown that the use of more diverse and large
data, as done in the federated and centralized methods, has a considerable
positive impact on the model performance. Besides, the federated models, while
preserving the participant's privacy, show similar results as the centralized
ones. As an additional contribution and to measure the robustness of the
federated approach, an adversarial setup with several malicious participants
poisoning the federated model has been considered. The baseline model
aggregation averaging step used in most federated learning algorithms appears
highly vulnerable to different attacks, even with a single adversary. The
performance of other model aggregation functions acting as countermeasures is
thus evaluated under the same attack scenarios. These functions provide a
significant improvement against malicious participants, but more efforts are
still needed to make federated approaches robust.


http://arxiv.org/abs/2104.06728
Meaningful Adversarial Stickers for Face Recognition in Physical World. (98%)
Ying Guo; Xingxing Wei; Guoqiu Wang; Bo Zhang
  Face recognition (FR) systems have been widely applied in safety-critical
fields with the introduction of deep learning. However, the existence of
adversarial examples brings potential security risks to FR systems. To identify
their vulnerability and help improve their robustness, in this paper, we
propose Meaningful Adversarial Stickers, a physically feasible and easily
implemented attack method by using meaningful real stickers existing in our
life, where the attackers manipulate the pasting parameters of stickers on the
face, instead of designing perturbation patterns and then printing them like
most existing works. We conduct attacks in the black-box setting with limited
information which is more challenging and practical. To effectively solve the
pasting position, rotation angle, and other parameters of the stickers, we
design Region based Heuristic Differential Algorithm, which utilizes the
inbreeding strategy based on regional aggregation of effective solutions and
the adaptive adjustment strategy of evaluation criteria. Extensive experiments
are conducted on two public datasets including LFW and CelebA with respective
to three representative FR models like FaceNet, SphereFace, and CosFace,
achieving attack success rates of 81.78%, 72.93%, and 79.26% respectively with
only hundreds of queries. The results in the physical world confirm the
effectiveness of our method in complex physical conditions. When continuously
changing the face posture of testers, the method can still perform successful
attacks up to 98.46%, 91.30% and 86.96% in the time series.


http://arxiv.org/abs/2104.07167
Orthogonalizing Convolutional Layers with the Cayley Transform. (80%)
Asher Trockman; J. Zico Kolter
  Recent work has highlighted several advantages of enforcing orthogonality in
the weight layers of deep networks, such as maintaining the stability of
activations, preserving gradient norms, and enhancing adversarial robustness by
enforcing low Lipschitz constants. Although numerous methods exist for
enforcing the orthogonality of fully-connected layers, those for convolutional
layers are more heuristic in nature, often focusing on penalty methods or
limited classes of convolutions. In this work, we propose and evaluate an
alternative approach to directly parameterize convolutional layers that are
constrained to be orthogonal. Specifically, we propose to apply the Cayley
transform to a skew-symmetric convolution in the Fourier domain, so that the
inverse convolution needed by the Cayley transform can be computed efficiently.
We compare our method to previous Lipschitz-constrained and orthogonal
convolutional layers and show that it indeed preserves orthogonality to a high
degree even for large convolutions. Applied to the problem of certified
adversarial robustness, we show that networks incorporating the layer
outperform existing deterministic methods for certified defense against
$\ell_2$-norm-bounded adversaries, while scaling to larger architectures than
previously investigated. Code is available at
https://github.com/locuslab/orthogonal-convolutions.


http://arxiv.org/abs/2104.06744
Defending Against Adversarial Denial-of-Service Data Poisoning Attacks. (38%)
Nicolas M. Müller; Simon Roschmann; Konstantin Böttinger
  Data poisoning is one of the most relevant security threats against machine
learning and data-driven technologies. Since many applications rely on
untrusted training data, an attacker can easily craft malicious samples and
inject them into the training dataset to degrade the performance of machine
learning models. As recent work has shown, such Denial-of-Service (DoS) data
poisoning attacks are highly effective. To mitigate this threat, we propose a
new approach of detecting DoS poisoned instances. In comparison to related
work, we deviate from clustering and anomaly detection based approaches, which
often suffer from the curse of dimensionality and arbitrary anomaly threshold
selection. Rather, our defence is based on extracting information from the
training data in such a generalized manner that we can identify poisoned
samples based on the information present in the unpoisoned portion of the data.
We evaluate our defence against two DoS poisoning attacks and seven datasets,
and find that it reliably identifies poisoned instances. In comparison to
related work, our defence improves false positive / false negative rates by at
least 50%, often more.


http://arxiv.org/abs/2104.06718
Improved Branch and Bound for Neural Network Verification via Lagrangian Decomposition. (1%)
Palma Alessandro De; Rudy Bunel; Alban Desmaison; Krishnamurthy Dvijotham; Pushmeet Kohli; Philip H. S. Torr; M. Pawan Kumar
  We improve the scalability of Branch and Bound (BaB) algorithms for formally
proving input-output properties of neural networks. First, we propose novel
bounding algorithms based on Lagrangian Decomposition. Previous works have used
off-the-shelf solvers to solve relaxations at each node of the BaB tree, or
constructed weaker relaxations that can be solved efficiently, but lead to
unnecessarily weak bounds. Our formulation restricts the optimization to a
subspace of the dual domain that is guaranteed to contain the optimum,
resulting in accelerated convergence. Furthermore, it allows for a massively
parallel implementation, which is amenable to GPU acceleration via modern deep
learning frameworks. Second, we present a novel activation-based branching
strategy. By coupling an inexpensive heuristic with fast dual bounding, our
branching scheme greatly reduces the size of the BaB tree compared to previous
heuristic methods. Moreover, it performs competitively with a recent strategy
based on learning algorithms, without its large offline training cost. Finally,
we design a BaB framework, named Branch and Dual Network Bound (BaDNB), based
on our novel bounding and branching algorithms. We show that BaDNB outperforms
previous complete verification systems by a large margin, cutting average
verification times by factors up to 50 on adversarial robustness properties.


http://arxiv.org/abs/2104.06377
Mitigating Adversarial Attack for Compute-in-Memory Accelerator Utilizing On-chip Finetune. (99%)
Shanshi Huang; Hongwu Jiang; Shimeng Yu
  Compute-in-memory (CIM) has been proposed to accelerate the convolution
neural network (CNN) computation by implementing parallel multiply and
accumulation in analog domain. However, the subsequent processing is still
preferred to be performed in digital domain. This makes the analog to digital
converter (ADC) critical in CIM architectures. One drawback is the ADC error
introduced by process variation. While research efforts are being made to
improve ADC design to reduce the offset, we find that the accuracy loss
introduced by the ADC error could be recovered by model weight finetune. In
addition to compensate ADC offset, on-chip weight finetune could be leveraged
to provide additional protection for adversarial attack that aims to fool the
inference engine with manipulated input samples. Our evaluation results show
that by adapting the model weights to the specific ADC offset pattern to each
chip, the transferability of the adversarial attack is suppressed. For a chip
being attacked by the C&W method, the classification for CIFAR-10 dataset will
drop to almost 0%. However, when applying the similarly generated adversarial
examples to other chips, the accuracy could still maintain more than 62% and
85% accuracy for VGG-8 and DenseNet-40, respectively.


http://arxiv.org/abs/2104.06015
Detecting Operational Adversarial Examples for Reliable Deep Learning. (82%)
Xingyu Zhao; Wei Huang; Sven Schewe; Yi Dong; Xiaowei Huang
  The utilisation of Deep Learning (DL) raises new challenges regarding its
dependability in critical applications. Sound verification and validation
methods are needed to assure the safe and reliable use of DL. However,
state-of-the-art debug testing methods on DL that aim at detecting adversarial
examples (AEs) ignore the operational profile, which statistically depicts the
software's future operational use. This may lead to very modest effectiveness
on improving the software's delivered reliability, as the testing budget is
likely to be wasted on detecting AEs that are unrealistic or encountered very
rarely in real-life operation. In this paper, we first present the novel notion
of "operational AEs" which are AEs that have relatively high chance to be seen
in future operation. Then an initial design of a new DL testing method to
efficiently detect "operational AEs" is provided, as well as some insights on
our prospective research plan.


http://arxiv.org/abs/2104.05996
Fall of Giants: How popular text-based MLaaS fall against a simple evasion attack. (75%)
Luca Pajola; Mauro Conti
  The increased demand for machine learning applications made companies offer
Machine-Learning-as-a-Service (MLaaS). In MLaaS (a market estimated 8000M USD
by 2025), users pay for well-performing ML models without dealing with the
complicated training procedure. Among MLaaS, text-based applications are the
most popular ones (e.g., language translators). Given this popularity, MLaaS
must provide resiliency to adversarial manipulations. For example, a wrong
translation might lead to a misunderstanding between two parties. In the text
domain, state-of-the-art attacks mainly focus on strategies that leverage ML
models' weaknesses. Unfortunately, not much attention has been given to the
other pipeline' stages, such as the indexing stage (i.e., when a sentence is
converted from a textual to a numerical representation) that, if manipulated,
can significantly affect the final performance of the application.
  In this paper, we propose a novel text evasion technique called
"\textit{Zero-Width} attack" (ZeW) that leverages the injection of human
non-readable characters, affecting indexing stage mechanisms. We demonstrate
that our simple yet effective attack deceives MLaaS of "giants" such as Amazon,
Google, IBM, and Microsoft. Our case study, based on the manipulation of
hateful tweets, shows that out of 12 analyzed services, only one is resistant
to our injection strategy. We finally introduce and test a simple \textit{input
validation} defense that can prevent our proposed attack.


http://arxiv.org/abs/2104.05353
Sparse Coding Frontend for Robust Neural Networks. (99%)
Can Bakiskan; Metehan Cekic; Ahmet Dundar Sezer; Upamanyu Madhow
  Deep Neural Networks are known to be vulnerable to small, adversarially
crafted, perturbations. The current most effective defense methods against
these adversarial attacks are variants of adversarial training. In this paper,
we introduce a radically different defense trained only on clean images: a
sparse coding based frontend which significantly attenuates adversarial attacks
before they reach the classifier. We evaluate our defense on CIFAR-10 dataset
under a wide range of attack types (including Linf , L2, and L1 bounded
attacks), demonstrating its promise as a general-purpose approach for defense.


http://arxiv.org/abs/2104.05808
A Backdoor Attack against 3D Point Cloud Classifiers. (96%)
Zhen Xiang; David J. Miller; Siheng Chen; Xi Li; George Kesidis
  Vulnerability of 3D point cloud (PC) classifiers has become a grave concern
due to the popularity of 3D sensors in safety-critical applications. Existing
adversarial attacks against 3D PC classifiers are all test-time evasion (TTE)
attacks that aim to induce test-time misclassifications using knowledge of the
classifier. But since the victim classifier is usually not accessible to the
attacker, the threat is largely diminished in practice, as PC TTEs typically
have poor transferability. Here, we propose the first backdoor attack (BA)
against PC classifiers. Originally proposed for images, BAs poison the victim
classifier's training set so that the classifier learns to decide to the
attacker's target class whenever the attacker's backdoor pattern is present in
a given input sample. Significantly, BAs do not require knowledge of the victim
classifier. Different from image BAs, we propose to insert a cluster of points
into a PC as a robust backdoor pattern customized for 3D PCs. Such clusters are
also consistent with a physical attack (i.e., with a captured object in a
scene). We optimize the cluster's location using an independently trained
surrogate classifier and choose the cluster's local geometry to evade possible
PC preprocessing and PC anomaly detectors (ADs). Experimentally, our BA
achieves a uniformly high success rate (> 87%) and shows evasiveness against
state-of-the-art PC ADs.


http://arxiv.org/abs/2104.05801
Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation. (56%)
Sarik Ghazarian; Zixi Liu; Akash SM; Ralph Weischedel; Aram Galstyan; Nanyun Peng
  With the recent advances of open-domain story generation, the lack of
reliable automatic evaluation metrics becomes an increasingly imperative issue
that hinders the fast development of story generation. According to conducted
researches in this regard, learnable evaluation metrics have promised more
accurate assessments by having higher correlations with human judgments. A
critical bottleneck of obtaining a reliable learnable evaluation metric is the
lack of high-quality training data for classifiers to efficiently distinguish
plausible and implausible machine-generated stories. Previous works relied on
\textit{heuristically manipulated} plausible examples to mimic possible system
drawbacks such as repetition, contradiction, or irrelevant content in the text
level, which can be \textit{unnatural} and \textit{oversimplify} the
characteristics of implausible machine-generated stories. We propose to tackle
these issues by generating a more comprehensive set of implausible stories
using {\em plots}, which are structured representations of controllable factors
used to generate stories. Since these plots are compact and structured, it is
easier to manipulate them to generate text with targeted undesirable
properties, while at the same time maintain the grammatical correctness and
naturalness of the generated sentences. To improve the quality of generated
implausible stories, we further apply the adversarial filtering procedure
presented by \citet{zellers2018swag} to select a more nuanced set of
implausible texts. Experiments show that the evaluation metrics trained on our
generated data result in more reliable automatic assessments that correlate
remarkably better with human judgments compared to the baselines.


http://arxiv.org/abs/2104.05232
Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation. (50%)
Chong Zhang; Jieyu Zhao; Huan Zhang; Kai-Wei Chang; Cho-Jui Hsieh
  Robustness and counterfactual bias are usually evaluated on a test dataset.
However, are these evaluations robust? If the test dataset is perturbed
slightly, will the evaluation results keep the same? In this paper, we propose
a "double perturbation" framework to uncover model weaknesses beyond the test
dataset. The framework first perturbs the test dataset to construct abundant
natural sentences similar to the test data, and then diagnoses the prediction
change regarding a single-word substitution. We apply this framework to study
two perturbation-based approaches that are used to analyze models' robustness
and counterfactual bias in English. (1) For robustness, we focus on synonym
substitutions and identify vulnerable examples where prediction can be altered.
Our proposed attack attains high success rates (96.0%-99.8%) in finding
vulnerable examples on both original and robustly trained CNNs and
Transformers. (2) For counterfactual bias, we focus on substituting demographic
tokens (e.g., gender, race) and measure the shift of the expected prediction
among constructed sentences. Our method is able to reveal the hidden model
biases not directly shown in the test dataset. Our code is available at
https://github.com/chong-z/nlp-second-order-attack.


http://arxiv.org/abs/2104.05921
Thief, Beware of What Get You There: Towards Understanding Model Extraction Attack. (1%)
Xinyi Zhang; Chengfang Fang; Jie Shi
  Model extraction increasingly attracts research attentions as keeping
commercial AI models private can retain a competitive advantage. In some
scenarios, AI models are trained proprietarily, where neither pre-trained
models nor sufficient in-distribution data is publicly available. Model
extraction attacks against these models are typically more devastating.
Therefore, in this paper, we empirically investigate the behaviors of model
extraction under such scenarios. We find the effectiveness of existing
techniques significantly affected by the absence of pre-trained models. In
addition, the impacts of the attacker's hyperparameters, e.g. model
architecture and optimizer, as well as the utilities of information retrieved
from queries, are counterintuitive. We provide some insights on explaining the
possible causes of these phenomena. With these observations, we formulate model
extraction attacks into an adaptive framework that captures these factors with
deep reinforcement learning. Experiments show that the proposed framework can
be used to improve existing techniques, and show that model extraction is still
possible in such strict scenarios. Our research can help system designers to
construct better defense strategies based on their scenarios.


http://arxiv.org/abs/2104.05062
Achieving Model Robustness through Discrete Adversarial Training. (99%)
Maor Ivgi; Jonathan Berant
  Discrete adversarial attacks are symbolic perturbations to a language input
that preserve the output label but lead to a prediction error. While such
attacks have been extensively explored for the purpose of evaluating model
robustness, their utility for improving robustness has been limited to offline
augmentation only. Concretely, given a trained model, attacks are used to
generate perturbed (adversarial) examples, and the model is re-trained exactly
once. In this work, we address this gap and leverage discrete attacks for
online augmentation, where adversarial examples are generated at every training
step, adapting to the changing nature of the model. We propose (i) a new
discrete attack, based on best-first search, and (ii) random sampling attacks
that unlike prior work are not based on expensive search-based procedures.
Surprisingly, we find that random sampling leads to impressive gains in
robustness, outperforming the commonly-used offline augmentation, while leading
to a speedup at training time of ~10x. Furthermore, online augmentation with
search-based attacks justifies the higher training cost, significantly
improving robustness on three datasets. Last, we show that our new attack
substantially improves robustness compared to prior methods.


http://arxiv.org/abs/2104.05097
Pay attention to your loss: understanding misconceptions about 1-Lipschitz neural networks. (1%)
Louis Béthune; Thibaut Boissin; Mathieu Serrurier; Franck Mamalet; Corentin Friedrich; Alberto González-Sanz
  Lipschitz constrained networks have gathered considerable attention in deep
learning community, with usages ranging from Wasserstein distance estimation to
the training of certifiably robust classifiers. However they remain commonly
considered as less accurate, and their properties in learning are still not
fully understood. In this paper we clarify the matter: when it comes to
classification 1-Lipschitz neural networks enjoy several advantages over their
unconstrained counterpart. First, we show that these networks are as accurate
as classical ones, and can fit arbitrarily difficult boundaries. Then, relying
on a robustness metric which reflects operational needs we characterize the
most robust classifier: the WGAN discriminator. Next, we show that 1-Lipschitz
neural networks generalize well under milder assumptions. Finally, we show that
hyper-parameters of the loss are crucial for controlling the
accuracy-robustness trade-off. We conclude that they exhibit appealing
properties to pave the way toward provably accurate, and provably robust neural
networks.


http://arxiv.org/abs/2104.04680
Distributed Estimation over Directed Graphs Resilient to Sensor Spoofing. (69%)
Shamik Bhattacharyya; Kiran Rokade; Rachel Kalpana Kalaimani
  This paper addresses the problem of distributed estimation of an unknown
dynamic parameter by a multi-agent system over a directed communication network
in the presence of an adversarial attack on the agents' sensors. The mode of
attack of the adversaries is to corrupt the sensor measurements of some of the
agents, while the communication and information processing capabilities of
those agents remain unaffected. To ensure that all the agents, both normal as
well as those under attack, are able to correctly estimate the parameter value,
the Resilient Estimation through Weight Balancing (REWB) algorithm is
introduced. The only condition required for the REWB algorithm to guarantee
resilient estimation is that at any given point in time, less than half of the
total number of agents are under attack. The paper discusses the development of
the REWB algorithm using the concepts of weight balancing of directed graphs,
and the consensus+innovations approach for linear estimation. Numerical
simulations are presented to illustrate the performance of our algorithm over
directed graphs under different conditions of adversarial attacks.


http://arxiv.org/abs/2104.04725
Fool Me Twice: Entailment from Wikipedia Gamification. (61%)
Julian Martin Eisenschlos; Bhuwan Dhingra; Jannis Bulian; Benjamin Börschinger; Jordan Boyd-Graber
  We release FoolMeTwice (FM2 for short), a large dataset of challenging
entailment pairs collected through a fun multi-player game. Gamification
encourages adversarial examples, drastically lowering the number of examples
that can be solved using "shortcuts" compared to other popular entailment
datasets. Players are presented with two tasks. The first task asks the player
to write a plausible claim based on the evidence from a Wikipedia page. The
second one shows two plausible claims written by other players, one of which is
false, and the goal is to identify it before the time runs out. Players "pay"
to see clues retrieved from the evidence pool: the more evidence the player
needs, the harder the claim. Game-play between motivated players leads to
diverse strategies for crafting claims, such as temporal inference and
diverting to unrelated evidence, and results in higher quality data for the
entailment and evidence retrieval tasks. We open source the dataset and the
game code.


http://arxiv.org/abs/2104.04886
Adversarial Regularization as Stackelberg Game: An Unrolled Optimization Approach. (15%)
Simiao Zuo; Chen Liang; Haoming Jiang; Xiaodong Liu; Pengcheng He; Jianfeng Gao; Weizhu Chen; Tuo Zhao
  Adversarial regularization has been shown to improve the generalization
performance of deep learning models in various natural language processing
tasks. Existing works usually formulate the method as a zero-sum game, which is
solved by alternating gradient descent/ascent algorithms. Such a formulation
treats the adversarial and the defending players equally, which is undesirable
because only the defending player contributes to the generalization
performance. To address this issue, we propose Stackelberg Adversarial
Regularization (SALT), which formulates adversarial regularization as a
Stackelberg game. This formulation induces a competition between a leader and a
follower, where the follower generates perturbations, and the leader trains the
model subject to the perturbations. Different from conventional approaches, in
SALT, the leader is in an advantageous position. When the leader moves, it
recognizes the strategy of the follower and takes the anticipated follower's
outcomes into consideration. Such a leader's advantage enables us to improve
the model fitting to the unperturbed data. The leader's strategic information
is captured by the Stackelberg gradient, which is obtained using an unrolling
algorithm. Our experimental results on a set of machine translation and natural
language understanding tasks show that SALT outperforms existing adversarial
regularization baselines across all tasks. Our code is publicly available.


http://arxiv.org/abs/2104.04907
Disentangled Contrastive Learning for Learning Robust Textual Representations. (11%)
Xiang Chen; Xin Xie; Zhen Bi; Hongbin Ye; Shumin Deng; Ningyu Zhang; Huajun Chen
  Although the self-supervised pre-training of transformer models has resulted
in the revolutionizing of natural language processing (NLP) applications and
the achievement of state-of-the-art results with regard to various benchmarks,
this process is still vulnerable to small and imperceptible permutations
originating from legitimate inputs. Intuitively, the representations should be
similar in the feature space with subtle input permutations, while large
variations occur with different meanings. This motivates us to investigate the
learning of robust textual representation in a contrastive manner. However, it
is non-trivial to obtain opposing semantic instances for textual samples. In
this study, we propose a disentangled contrastive learning method that
separately optimizes the uniformity and alignment of representations without
negative sampling. Specifically, we introduce the concept of momentum
representation consistency to align features and leverage power normalization
while conforming the uniformity. Our experimental results for the NLP
benchmarks demonstrate that our approach can obtain better results compared
with the baselines, as well as achieve promising improvements with invariance
tests and adversarial attacks. The code is available in
https://github.com/zxlzr/DCL.


http://arxiv.org/abs/2104.04448
Relating Adversarially Robust Generalization to Flat Minima. (99%)
David Stutz; Matthias Hein; Bernt Schiele
  Adversarial training (AT) has become the de-facto standard to obtain models
robust against adversarial examples. However, AT exhibits severe robust
overfitting: cross-entropy loss on adversarial examples, so-called robust loss,
decreases continuously on training examples, while eventually increasing on
test examples. In practice, this leads to poor robust generalization, i.e.,
adversarial robustness does not generalize well to new examples. In this paper,
we study the relationship between robust generalization and flatness of the
robust loss landscape in weight space, i.e., whether robust loss changes
significantly when perturbing weights. To this end, we propose average- and
worst-case metrics to measure flatness in the robust loss landscape and show a
correlation between good robust generalization and flatness. For example,
throughout training, flatness reduces significantly during overfitting such
that early stopping effectively finds flatter minima in the robust loss
landscape. Similarly, AT variants achieving higher adversarial robustness also
correspond to flatter minima. This holds for many popular choices, e.g.,
AT-AWP, TRADES, MART, AT with self-supervision or additional unlabeled
examples, as well as simple regularization techniques, e.g., AutoAugment,
weight decay or label noise. For fair comparison across these approaches, our
flatness measures are specifically designed to be scale-invariant and we
conduct extensive experiments to validate our findings.


http://arxiv.org/abs/2104.04553
SPoTKD: A Protocol for Symmetric Key Distribution over Public Channels Using Self-Powered Timekeeping Devices. (1%)
Mustafizur Rahman; Liang Zhou; Shantanu Chakrabartty
  In this paper, we propose a novel class of symmetric key distribution
protocols that leverages basic security primitives offered by low-cost,
hardware chipsets containing millions of synchronized self-powered timers. The
keys are derived from the temporal dynamics of a physical, micro-scale
time-keeping device which makes the keys immune to any potential side-channel
attacks, malicious tampering, or snooping. Using the behavioral model of the
self-powered timers, we first show that the derived key-strings can pass the
randomness test as defined by the National Institute of Standards and
Technology (NIST) suite. The key-strings are then used in two SPoTKD
(Self-Powered Timer Key Distribution) protocols that exploit the timer's
dynamics as one-way functions: (a) protocol 1 facilitates secure communications
between a user and a remote Server, and (b) protocol 2 facilitates secure
communications between two users. In this paper, we investigate the security of
these protocols under standard model and against different adversarial attacks.
Using Monte-Carlo simulations, we also investigate the robustness of these
protocols in the presence of real-world operating conditions and propose
error-correcting SPoTKD protocols to mitigate these noise-related artifacts.


http://arxiv.org/abs/2104.04268
Reversible Watermarking in Deep Convolutional Neural Networks for Integrity Authentication. (1%)
Xiquan Guan; Huamin Feng; Weiming Zhang; Hang Zhou; Jie Zhang; Nenghai Yu
  Deep convolutional neural networks have made outstanding contributions in
many fields such as computer vision in the past few years and many researchers
published well-trained network for downloading. But recent studies have shown
serious concerns about integrity due to model-reuse attacks and backdoor
attacks. In order to protect these open-source networks, many algorithms have
been proposed such as watermarking. However, these existing algorithms modify
the contents of the network permanently and are not suitable for integrity
authentication. In this paper, we propose a reversible watermarking algorithm
for integrity authentication. Specifically, we present the reversible
watermarking problem of deep convolutional neural networks and utilize the
pruning theory of model compression technology to construct a host sequence
used for embedding watermarking information by histogram shift. As shown in the
experiments, the influence of embedding reversible watermarking on the
classification performance is less than 0.5% and the parameters of the model
can be fully recovered after extracting the watermarking. At the same time, the
integrity of the model can be verified by applying the reversible watermarking:
if the model is modified illegally, the authentication information generated by
original model will be absolutely different from the extracted watermarking
information.


http://arxiv.org/abs/2104.04405
Learning Sampling Policy for Faster Derivative Free Optimization. (1%)
Zhou Zhai; Bin Gu; Heng Huang
  Zeroth-order (ZO, also known as derivative-free) methods, which estimate the
gradient only by two function evaluations, have attracted much attention
recently because of its broad applications in machine learning community. The
two function evaluations are normally generated with random perturbations from
standard Gaussian distribution. To speed up ZO methods, many methods, such as
variance reduced stochastic ZO gradients and learning an adaptive Gaussian
distribution, have recently been proposed to reduce the variances of ZO
gradients. However, it is still an open problem whether there is a space to
further improve the convergence of ZO methods. To explore this problem, in this
paper, we propose a new reinforcement learning based ZO algorithm (ZO-RL) with
learning the sampling policy for generating the perturbations in ZO
optimization instead of using random sampling. To find the optimal policy, an
actor-critic RL algorithm called deep deterministic policy gradient (DDPG) with
two neural network function approximators is adopted. The learned sampling
policy guides the perturbed points in the parameter space to estimate a more
accurate ZO gradient. To the best of our knowledge, our ZO-RL is the first
algorithm to learn the sampling policy using reinforcement learning for ZO
optimization which is parallel to the existing methods. Especially, our ZO-RL
can be combined with existing ZO algorithms that could further accelerate the
algorithms. Experimental results for different ZO optimization problems show
that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by
learning a sampling policy, and converge faster than existing ZO algorithms in
different scenarios.


http://arxiv.org/abs/2104.04107
FACESEC: A Fine-grained Robustness Evaluation Framework for Face Recognition Systems. (98%)
Liang Tong; Zhengzhang Chen; Jingchao Ni; Wei Cheng; Dongjin Song; Haifeng Chen; Yevgeniy Vorobeychik
  We present FACESEC, a framework for fine-grained robustness evaluation of
face recognition systems. FACESEC evaluation is performed along four dimensions
of adversarial modeling: the nature of perturbation (e.g., pixel-level or face
accessories), the attacker's system knowledge (about training data and learning
architecture), goals (dodging or impersonation), and capability (tailored to
individual inputs or across sets of these). We use FACESEC to study five face
recognition systems in both closed-set and open-set settings, and to evaluate
the state-of-the-art approach for defending against physically realizable
attacks on these. We find that accurate knowledge of neural architecture is
significantly more important than knowledge of the training data in black-box
attacks. Moreover, we observe that open-set face recognition systems are more
vulnerable than closed-set systems under different types of attacks. The
efficacy of attacks for other threat model variations, however, appears highly
dependent on both the nature of perturbation and the neural network
architecture. For example, attacks that involve adversarial face masks are
usually more potent, even against adversarially trained models, and the ArcFace
architecture tends to be more robust than the others.


http://arxiv.org/abs/2104.03674
Explainability-based Backdoor Attacks Against Graph Neural Networks. (15%)
Jing Jason Xu; Jason Minhui; Xue; Stjepan Picek
  Backdoor attacks represent a serious threat to neural network models. A
backdoored model will misclassify the trigger-embedded inputs into an
attacker-chosen target label while performing normally on other benign inputs.
There are already numerous works on backdoor attacks on neural networks, but
only a few works consider graph neural networks (GNNs). As such, there is no
intensive research on explaining the impact of trigger injecting position on
the performance of backdoor attacks on GNNs.
  To bridge this gap, we conduct an experimental investigation on the
performance of backdoor attacks on GNNs. We apply two powerful GNN
explainability approaches to select the optimal trigger injecting position to
achieve two attacker objectives -- high attack success rate and low clean
accuracy drop. Our empirical results on benchmark datasets and state-of-the-art
neural network models demonstrate the proposed method's effectiveness in
selecting trigger injecting position for backdoor attacks on GNNs. For
instance, on the node classification task, the backdoor attack with trigger
injecting position selected by GraphLIME reaches over $84 \%$ attack success
rate with less than $2.5 \%$ accuracy drop


http://arxiv.org/abs/2104.03863
A single gradient step finds adversarial examples on random two-layers neural networks. (10%)
Sébastien Bubeck; Yeshwanth Cherapanamjeri; Gauthier Gidel; Rémi Tachet des Combes
  Daniely and Schacham recently showed that gradient descent finds adversarial
examples on random undercomplete two-layers ReLU neural networks. The term
"undercomplete" refers to the fact that their proof only holds when the number
of neurons is a vanishing fraction of the ambient dimension. We extend their
result to the overcomplete case, where the number of neurons is larger than the
dimension (yet also subexponential in the dimension). In fact we prove that a
single step of gradient descent suffices. We also show this result for any
subexponential width random neural network with smooth activation function.


http://arxiv.org/abs/2104.04054
Adversarial Learning Inspired Emerging Side-Channel Attacks and Defenses. (8%)
Abhijitt Dhavlle
  Evolving attacks on the vulnerabilities of the computing systems demand novel
defense strategies to keep pace with newer attacks. This report discusses
previous works on side-channel attacks (SCAs) and defenses for cache-targeted
and physical proximity attacks. We then discuss the proposed Entropy-Shield as
a defense against timing SCAs, and explain how we can extend the same to
hardware-based implementations of crypto applications as "Entropy-Shield for
FPGA". We then discuss why we want to build newer attacks with the hope of
coming up with better defense strategies.


http://arxiv.org/abs/2104.03000
Universal Adversarial Training with Class-Wise Perturbations. (99%)
Philipp Benz; Chaoning Zhang; Adil Karjauv; In So Kweon
  Despite their overwhelming success on a wide range of applications,
convolutional neural networks (CNNs) are widely recognized to be vulnerable to
adversarial examples. This intriguing phenomenon led to a competition between
adversarial attacks and defense techniques. So far, adversarial training is the
most widely used method for defending against adversarial attacks. It has also
been extended to defend against universal adversarial perturbations (UAPs). The
SOTA universal adversarial training (UAT) method optimizes a single
perturbation for all training samples in the mini-batch. In this work, we find
that a UAP does not attack all classes equally. Inspired by this observation,
we identify it as the source of the model having unbalanced robustness. To this
end, we improve the SOTA UAT by proposing to utilize class-wise UAPs during
adversarial training. On multiple benchmark datasets, our class-wise UAT leads
superior performance for both clean accuracy and adversarial robustness against
universal attack.


http://arxiv.org/abs/2104.02963
The art of defense: letting networks fool the attacker. (98%)
Jinlai Zhang; Yinpeng Dong; Binbin Liu; Bo Ouyang; Jihong Zhu; Minchi Kuang; Houqing Wang; Yanmei Meng
  Robust environment perception is critical for autonomous cars, and
adversarial defenses are the most effective and widely studied ways to improve
the robustness of environment perception. However, all of previous defense
methods decrease the natural accuracy, and the nature of the DNNs itself has
been overlooked. To this end, in this paper, we propose a novel adversarial
defense for 3D point cloud classifier that makes full use of the nature of the
DNNs. Due to the disorder of point cloud, all point cloud classifiers have the
property of permutation invariant to the input point cloud. Based on this
nature, we design invariant transformations defense (IT-Defense). We show that,
even after accounting for obfuscated gradients, our IT-Defense is a resilient
defense against state-of-the-art (SOTA) 3D attacks. Moreover, IT-Defense do not
hurt clean accuracy compared to previous SOTA 3D defenses. Our code is
available at: {\footnotesize{\url{https://github.com/cuge1995/IT-Defense}}}.


http://arxiv.org/abs/2104.03356
Universal Spectral Adversarial Attacks for Deformable Shapes. (81%)
Arianna Rampini; Franco Pestarini; Luca Cosmo; Simone Melzi; Emanuele Rodolà
  Machine learning models are known to be vulnerable to adversarial attacks,
namely perturbations of the data that lead to wrong predictions despite being
imperceptible. However, the existence of "universal" attacks (i.e., unique
perturbations that transfer across different data points) has only been
demonstrated for images to date. Part of the reason lies in the lack of a
common domain, for geometric data such as graphs, meshes, and point clouds,
where a universal perturbation can be defined. In this paper, we offer a change
in perspective and demonstrate the existence of universal attacks for geometric
data (shapes). We introduce a computational procedure that operates entirely in
the spectral domain, where the attacks take the form of small perturbations to
short eigenvalue sequences; the resulting geometry is then synthesized via
shape-from-spectrum recovery. Our attacks are universal, in that they transfer
across different shapes, different representations (meshes and point clouds),
and generalize to previously unseen data.


http://arxiv.org/abs/2104.03180
Adversarial Robustness Guarantees for Gaussian Processes. (68%)
Andrea Patane; Arno Blaas; Luca Laurenti; Luca Cardelli; Stephen Roberts; Marta Kwiatkowska
  Gaussian processes (GPs) enable principled computation of model uncertainty,
making them attractive for safety-critical applications. Such scenarios demand
that GP decisions are not only accurate, but also robust to perturbations. In
this paper we present a framework to analyse adversarial robustness of GPs,
defined as invariance of the model's decision to bounded perturbations. Given a
compact subset of the input space $T\subseteq \mathbb{R}^d$, a point $x^*$ and
a GP, we provide provable guarantees of adversarial robustness of the GP by
computing lower and upper bounds on its prediction range in $T$. We develop a
branch-and-bound scheme to refine the bounds and show, for any $\epsilon > 0$,
that our algorithm is guaranteed to converge to values $\epsilon$-close to the
actual values in finitely many iterations. The algorithm is anytime and can
handle both regression and classification tasks, with analytical formulation
for most kernels used in practice. We evaluate our methods on a collection of
synthetic and standard benchmark datasets, including SPAM, MNIST and
FashionMNIST. We study the effect of approximate inference techniques on
robustness and demonstrate how our method can be used for interpretability. Our
empirical results suggest that the adversarial robustness of GPs increases with
accurate posterior estimation.


http://arxiv.org/abs/2104.03413
Rethinking the Backdoor Attacks' Triggers: A Frequency Perspective. (61%)
Yi Zeng; Won Park; Z. Morley Mao; Ruoxi Jia
  Backdoor attacks have been considered a severe security threat to deep
learning. Such attacks can make models perform abnormally on inputs with
predefined triggers and still retain state-of-the-art performance on clean
data. While backdoor attacks have been thoroughly investigated in the image
domain from both attackers' and defenders' sides, an analysis in the frequency
domain has been missing thus far.
  This paper first revisits existing backdoor triggers from a frequency
perspective and performs a comprehensive analysis. Our results show that many
current backdoor attacks exhibit severe high-frequency artifacts, which persist
across different datasets and resolutions. We further demonstrate these
high-frequency artifacts enable a simple way to detect existing backdoor
triggers at a detection rate of 98.50% without prior knowledge of the attack
details and the target model. Acknowledging previous attacks' weaknesses, we
propose a practical way to create smooth backdoor triggers without
high-frequency artifacts and study their detectability. We show that existing
defense works can benefit by incorporating these smooth triggers into their
design consideration. Moreover, we show that the detector tuned over stronger
smooth triggers can generalize well to unseen weak smooth triggers. In short,
our work emphasizes the importance of considering frequency analysis when
designing both backdoor attacks and defenses in deep learning.


http://arxiv.org/abs/2104.03154
Improving Robustness of Deep Reinforcement Learning Agents: Environment Attacks based on Critic Networks. (10%)
Lucas Schott; Manon Césaire; Hatem Hajri; Sylvain Lamprier
  To improve policy robustness of deep reinforcement learning agents, a line of
recent works focus on producing disturbances of the environment. Existing
approaches of the literature to generate meaningful disturbances of the
environment are adversarial reinforcement learning methods. These methods set
the problem as a two-player game between the protagonist agent, which learns to
perform a task in an environment, and the adversary agent, which learns to
disturb the protagonist via modifications of the considered environment. Both
protagonist and adversary are trained with deep reinforcement learning
algorithms. Alternatively, we propose in this paper to build on gradient-based
adversarial attacks, usually used for classification tasks for instance, that
we apply on the critic network of the protagonist to identify efficient
disturbances of the environment. Rather than learning an attacker policy, which
usually reveals as very complex and unstable, we leverage the knowledge of the
critic network of the protagonist, to dynamically complexify the task at each
step of the learning process. We show that our method, while being faster and
lighter, leads to significantly better improvements in policy robustness than
existing methods of the literature.


http://arxiv.org/abs/2104.02922
Sparse Oblique Decision Trees: A Tool to Understand and Manipulate Neural Net Features. (3%)
Suryabhan Singh Hada; Miguel Á. Carreira-Perpiñán; Arman Zharmagambetov
  The widespread deployment of deep nets in practical applications has lead to
a growing desire to understand how and why such black-box methods perform
prediction. Much work has focused on understanding what part of the input
pattern (an image, say) is responsible for a particular class being predicted,
and how the input may be manipulated to predict a different class. We focus
instead on understanding which of the internal features computed by the neural
net are responsible for a particular class. We achieve this by mimicking part
of the neural net with an oblique decision tree having sparse weight vectors at
the decision nodes. Using the recently proposed Tree Alternating Optimization
(TAO) algorithm, we are able to learn trees that are both highly accurate and
interpretable. Such trees can faithfully mimic the part of the neural net they
replaced, and hence they can provide insights into the deep net black box.
Further, we show we can easily manipulate the neural net features in order to
make the net predict, or not predict, a given class, thus showing that it is
possible to carry out adversarial attacks at the level of the features. These
insights and manipulations apply globally to the entire training and test set,
not just at a local (single-instance) level. We demonstrate this robustly in
the MNIST and ImageNet datasets with LeNet5 and VGG networks.


http://arxiv.org/abs/2104.03366
An Object Detection based Solver for Google's Image reCAPTCHA v2. (1%)
Md Imran Hossen; Yazhou Tu; Md Fazle Rabby; Md Nazmul Islam; Hui Cao; Xiali Hei
  Previous work showed that reCAPTCHA v2's image challenges could be solved by
automated programs armed with Deep Neural Network (DNN) image classifiers and
vision APIs provided by off-the-shelf image recognition services. In response
to emerging threats, Google has made significant updates to its image reCAPTCHA
v2 challenges that can render the prior approaches ineffective to a great
extent. In this paper, we investigate the robustness of the latest version of
reCAPTCHA v2 against advanced object detection based solvers. We propose a
fully automated object detection based system that breaks the most advanced
challenges of reCAPTCHA v2 with an online success rate of 83.25%, the highest
success rate to date, and it takes only 19.93 seconds (including network
delays) on average to crack a challenge. We also study the updated security
features of reCAPTCHA v2, such as anti-recognition mechanisms, improved
anti-bot detection techniques, and adjustable security preferences. Our
extensive experiments show that while these security features can provide some
resistance against automated attacks, adversaries can still bypass most of
them. Our experimental findings indicate that the recent advances in object
detection technologies pose a severe threat to the security of image captcha
designs relying on simple object detection as their underlying AI problem.


http://arxiv.org/abs/2104.02757
Exploring Targeted Universal Adversarial Perturbations to End-to-end ASR Models. (93%)
Zhiyun Lu; Wei Han; Yu Zhang; Liangliang Cao
  Although end-to-end automatic speech recognition (e2e ASR) models are widely
deployed in many applications, there have been very few studies to understand
models' robustness against adversarial perturbations. In this paper, we explore
whether a targeted universal perturbation vector exists for e2e ASR models. Our
goal is to find perturbations that can mislead the models to predict the given
targeted transcript such as "thank you" or empty string on any input utterance.
We study two different attacks, namely additive and prepending perturbations,
and their performances on the state-of-the-art LAS, CTC and RNN-T models. We
find that LAS is the most vulnerable to perturbations among the three models.
RNN-T is more robust against additive perturbations, especially on long
utterances. And CTC is robust against both additive and prepending
perturbations. To attack RNN-T, we find prepending perturbation is more
effective than the additive perturbation, and can mislead the models to predict
the same short target on utterances of arbitrary length.


http://arxiv.org/abs/2104.02703
Adversarial Robustness under Long-Tailed Distribution. (89%)
Tong Wu; Ziwei Liu; Qingqiu Huang; Yu Wang; Dahua Lin
  Adversarial robustness has attracted extensive studies recently by revealing
the vulnerability and intrinsic characteristics of deep networks. However,
existing works on adversarial robustness mainly focus on balanced datasets,
while real-world data usually exhibits a long-tailed distribution. To push
adversarial robustness towards more realistic scenarios, in this work we
investigate the adversarial vulnerability as well as defense under long-tailed
distributions. In particular, we first reveal the negative impacts induced by
imbalanced data on both recognition performance and adversarial robustness,
uncovering the intrinsic challenges of this problem. We then perform a
systematic study on existing long-tailed recognition methods in conjunction
with the adversarial training framework. Several valuable observations are
obtained: 1) natural accuracy is relatively easy to improve, 2) fake gain of
robust accuracy exists under unreliable evaluation, and 3) boundary error
limits the promotion of robustness. Inspired by these observations, we propose
a clean yet effective framework, RoBal, which consists of two dedicated
modules, a scale-invariant classifier and data re-balancing via both margin
engineering at training stage and boundary adjustment during inference.
Extensive experiments demonstrate the superiority of our approach over other
state-of-the-art defense methods. To our best knowledge, we are the first to
tackle adversarial robustness under long-tailed distributions, which we believe
would be a significant step towards real-world robustness. Our code is
available at: https://github.com/wutong16/Adversarial_Long-Tail .


http://arxiv.org/abs/2104.02334
Robust Adversarial Classification via Abstaining. (75%)
Abed AlRahman Al Makdah; Vaibhav Katewa; Fabio Pasqualetti
  In this work, we consider a binary classification problem and cast it into a
binary hypothesis testing framework, where the observations can be perturbed by
an adversary. To improve the adversarial robustness of a classifier, we include
an abstain option, where the classifier abstains from making a decision when it
has low confidence about the prediction. We propose metrics to quantify the
nominal performance of a classifier with an abstain option and its robustness
against adversarial perturbations. We show that there exist a tradeoff between
the two metrics regardless of what method is used to choose the abstain region.
Our results imply that the robustness of a classifier with an abstain option
can only be improved at the expense of its nominal performance. Further, we
provide necessary conditions to design the abstain region for a 1- dimensional
binary classification problem. We validate our theoretical results on the MNIST
dataset, where we numerically show that the tradeoff between performance and
robustness also exist for the general multi-class classification problems.


http://arxiv.org/abs/2104.02361
Backdoor Attack in the Physical World. (2%)
Yiming Li; Tongqing Zhai; Yong Jiang; Zhifeng Li; Shu-Tao Xia
  Backdoor attack intends to inject hidden backdoor into the deep neural
networks (DNNs), such that the prediction of infected models will be
maliciously changed if the hidden backdoor is activated by the attacker-defined
trigger. Currently, most existing backdoor attacks adopted the setting of
static trigger, $i.e.,$ triggers across the training and testing images follow
the same appearance and are located in the same area. In this paper, we revisit
this attack paradigm by analyzing trigger characteristics. We demonstrate that
this attack paradigm is vulnerable when the trigger in testing images is not
consistent with the one used for training. As such, those attacks are far less
effective in the physical world, where the location and appearance of the
trigger in the digitized image may be different from that of the one used for
training. Moreover, we also discuss how to alleviate such vulnerability. We
hope that this work could inspire more explorations on backdoor properties, to
help the design of more advanced backdoor attack and defense methods.


http://arxiv.org/abs/2104.02189
Robust Classification Under $\ell_0$ Attack for the Gaussian Mixture Model. (99%)
Payam Delgosha; Hamed Hassani; Ramtin Pedarsani
  It is well-known that machine learning models are vulnerable to small but
cleverly-designed adversarial perturbations that can cause misclassification.
While there has been major progress in designing attacks and defenses for
various adversarial settings, many fundamental and theoretical problems are yet
to be resolved. In this paper, we consider classification in the presence of
$\ell_0$-bounded adversarial perturbations, a.k.a. sparse attacks. This setting
is significantly different from other $\ell_p$-adversarial settings, with
$p\geq 1$, as the $\ell_0$-ball is non-convex and highly non-smooth. Under the
assumption that data is distributed according to the Gaussian mixture model,
our goal is to characterize the optimal robust classifier and the corresponding
robust classification error as well as a variety of trade-offs between
robustness, accuracy, and the adversary's budget. To this end, we develop a
novel classification algorithm called FilTrun that has two main modules:
Filtration and Truncation. The key idea of our method is to first filter out
the non-robust coordinates of the input and then apply a carefully-designed
truncated inner product for classification. By analyzing the performance of
FilTrun, we derive an upper bound on the optimal robust classification error.
We also find a lower bound by designing a specific adversarial strategy that
enables us to derive the corresponding robust classifier and its achieved
error. For the case that the covariance matrix of the Gaussian mixtures is
diagonal, we show that as the input's dimension gets large, the upper and lower
bounds converge; i.e. we characterize the asymptotically-optimal robust
classifier. Throughout, we discuss several examples that illustrate interesting
behaviors such as the existence of a phase transition for adversary's budget
determining whether the effect of adversarial perturbation can be fully
neutralized.


http://arxiv.org/abs/2104.02155
Adaptive Clustering of Robust Semantic Representations for Adversarial Image Purification. (98%)
Samuel Henrique Silva; Arun Das; Ian Scarff; Peyman Najafirad
  Deep Learning models are highly susceptible to adversarial manipulations that
can lead to catastrophic consequences. One of the most effective methods to
defend against such disturbances is adversarial training but at the cost of
generalization of unseen attacks and transferability across models. In this
paper, we propose a robust defense against adversarial attacks, which is model
agnostic and generalizable to unseen adversaries. Initially, with a baseline
model, we extract the latent representations for each class and adaptively
cluster the latent representations that share a semantic similarity. We obtain
the distributions for the clustered latent representations and from their
originating images, we learn semantic reconstruction dictionaries (SRD). We
adversarially train a new model constraining the latent space representation to
minimize the distance between the adversarial latent representation and the
true cluster distribution. To purify the image, we decompose the input into low
and high-frequency components. The high-frequency component is reconstructed
based on the most adequate SRD from the clean dataset. In order to evaluate the
most adequate SRD, we rely on the distance between robust latent
representations and semantic cluster distributions. The output is a purified
image with no perturbation. Image purification on CIFAR-10 and ImageNet-10
using our proposed method improved the accuracy by more than 10% compared to
state-of-the-art results.


http://arxiv.org/abs/2104.01782
BBAEG: Towards BERT-based Biomedical Adversarial Example Generation for Text Classification. (96%)
Ishani Mondal
  Healthcare predictive analytics aids medical decision-making, diagnosis
prediction and drug review analysis. Therefore, prediction accuracy is an
important criteria which also necessitates robust predictive language models.
However, the models using deep learning have been proven vulnerable towards
insignificantly perturbed input instances which are less likely to be
misclassified by humans. Recent efforts of generating adversaries using
rule-based synonyms and BERT-MLMs have been witnessed in general domain, but
the ever increasing biomedical literature poses unique challenges. We propose
BBAEG (Biomedical BERT-based Adversarial Example Generation), a black-box
attack algorithm for biomedical text classification, leveraging the strengths
of both domain-specific synonym replacement for biomedical named entities and
BERTMLM predictions, spelling variation and number replacement. Through
automatic and human evaluation on two datasets, we demonstrate that BBAEG
performs stronger attack with better language fluency, semantic coherence as
compared to prior work.


http://arxiv.org/abs/2104.01789
Deep Learning-Based Autonomous Driving Systems: A Survey of Attacks and Defenses. (74%)
Yao Deng; Tiehua Zhang; Guannan Lou; Xi Zheng; Jiong Jin; Qing-Long Han
  The rapid development of artificial intelligence, especially deep learning
technology, has advanced autonomous driving systems (ADSs) by providing precise
control decisions to counterpart almost any driving event, spanning from
anti-fatigue safe driving to intelligent route planning. However, ADSs are
still plagued by increasing threats from different attacks, which could be
categorized into physical attacks, cyberattacks and learning-based adversarial
attacks. Inevitably, the safety and security of deep learning-based autonomous
driving are severely challenged by these attacks, from which the
countermeasures should be analyzed and studied comprehensively to mitigate all
potential risks. This survey provides a thorough analysis of different attacks
that may jeopardize ADSs, as well as the corresponding state-of-the-art defense
mechanisms. The analysis is unrolled by taking an in-depth overview of each
step in the ADS workflow, covering adversarial attacks for various deep
learning models and attacks in both physical and cyber context. Furthermore,
some promising research directions are suggested in order to improve deep
learning-based autonomous driving safety, including model robustness training,
model testing and verification, and anomaly detection based on cloud/edge
servers.


http://arxiv.org/abs/2104.02000
Can audio-visual integration strengthen robustness under multimodal attacks? (68%)
Yapeng Tian; Chenliang Xu
  In this paper, we propose to make a systematic study on machines multisensory
perception under attacks. We use the audio-visual event recognition task
against multimodal adversarial attacks as a proxy to investigate the robustness
of audio-visual learning. We attack audio, visual, and both modalities to
explore whether audio-visual integration still strengthens perception and how
different fusion mechanisms affect the robustness of audio-visual models. For
interpreting the multimodal interactions under attacks, we learn a
weakly-supervised sound source visual localization model to localize sounding
regions in videos. To mitigate multimodal attacks, we propose an audio-visual
defense approach based on an audio-visual dissimilarity constraint and external
feature memory banks. Extensive experiments demonstrate that audio-visual
models are susceptible to multimodal adversarial attacks; audio-visual
integration could decrease the model robustness rather than strengthen under
multimodal attacks; even a weakly-supervised sound source visual localization
model can be successfully fooled; our defense method can improve the
invulnerability of audio-visual networks without significantly sacrificing
clean model performance.


http://arxiv.org/abs/2104.02107
Jekyll: Attacking Medical Image Diagnostics using Deep Generative Models. (33%)
Neal Mangaokar; Jiameng Pu; Parantapa Bhattacharya; Chandan K. Reddy; Bimal Viswanath
  Advances in deep neural networks (DNNs) have shown tremendous promise in the
medical domain. However, the deep learning tools that are helping the domain,
can also be used against it. Given the prevalence of fraud in the healthcare
domain, it is important to consider the adversarial use of DNNs in manipulating
sensitive data that is crucial to patient healthcare. In this work, we present
the design and implementation of a DNN-based image translation attack on
biomedical imagery. More specifically, we propose Jekyll, a neural style
transfer framework that takes as input a biomedical image of a patient and
translates it to a new image that indicates an attacker-chosen disease
condition. The potential for fraudulent claims based on such generated 'fake'
medical images is significant, and we demonstrate successful attacks on both
X-rays and retinal fundus image modalities. We show that these attacks manage
to mislead both medical professionals and algorithmic detection schemes.
Lastly, we also investigate defensive measures based on machine learning to
detect images generated by Jekyll.


http://arxiv.org/abs/2104.02156
Unified Detection of Digital and Physical Face Attacks. (8%)
Debayan Deb; Xiaoming Liu; Anil K. Jain
  State-of-the-art defense mechanisms against face attacks achieve near perfect
accuracies within one of three attack categories, namely adversarial, digital
manipulation, or physical spoofs, however, they fail to generalize well when
tested across all three categories. Poor generalization can be attributed to
learning incoherent attacks jointly. To overcome this shortcoming, we propose a
unified attack detection framework, namely UniFAD, that can automatically
cluster 25 coherent attack types belonging to the three categories. Using a
multi-task learning framework along with k-means clustering, UniFAD learns
joint representations for coherent attacks, while uncorrelated attack types are
learned separately. Proposed UniFAD outperforms prevailing defense methods and
their fusion with an overall TDR = 94.73% @ 0.2% FDR on a large fake face
dataset consisting of 341K bona fide images and 448K attack images of 25 types
across all 3 categories. Proposed method can detect an attack within 3
milliseconds on a Nvidia 2080Ti. UniFAD can also identify the attack types and
categories with 75.81% and 97.37% accuracies, respectively.


http://arxiv.org/abs/2104.02226
Beyond Categorical Label Representations for Image Classification. (2%)
Boyuan Chen; Yu Li; Sunand Raghupathi; Hod Lipson
  We find that the way we choose to represent data labels can have a profound
effect on the quality of trained models. For example, training an image
classifier to regress audio labels rather than traditional categorical
probabilities produces a more reliable classification. This result is
surprising, considering that audio labels are more complex than simpler
numerical probabilities or text. We hypothesize that high dimensional, high
entropy label representations are generally more useful because they provide a
stronger error signal. We support this hypothesis with evidence from various
label representations including constant matrices, spectrograms, shuffled
spectrograms, Gaussian mixtures, and uniform random matrices of various
dimensionalities. Our experiments reveal that high dimensional, high entropy
labels achieve comparable accuracy to text (categorical) labels on the standard
image classification task, but features learned through our label
representations exhibit more robustness under various adversarial attacks and
better effectiveness with a limited amount of training data. These results
suggest that label representation may play a more important role than
previously thought. The project website is at
\url{https://www.creativemachineslab.com/label-representation.html}.


http://arxiv.org/abs/2104.01853
Rethinking Perturbations in Encoder-Decoders for Fast Training. (1%)
Sho Takase; Shun Kiyono
  We often use perturbations to regularize neural models. For neural
encoder-decoders, previous studies applied the scheduled sampling (Bengio et
al., 2015) and adversarial perturbations (Sato et al., 2019) as perturbations
but these methods require considerable computational time. Thus, this study
addresses the question of whether these approaches are efficient enough for
training time. We compare several perturbations in sequence-to-sequence
problems with respect to computational time. Experimental results show that the
simple techniques such as word dropout (Gal and Ghahramani, 2016) and random
replacement of input tokens achieve comparable (or better) scores to the
recently proposed perturbations, even though these simple methods are faster.
Our code is publicly available at
https://github.com/takase/rethink_perturbations.


http://arxiv.org/abs/2104.01732
Semantically Stealthy Adversarial Attacks against Segmentation Models. (99%)
Zhenhua Chen; Chuhua Wang; David J. Crandall
  Segmentation models have been found to be vulnerable to targeted and
non-targeted adversarial attacks. However, the resulting segmentation outputs
are often so damaged that it is easy to spot an attack. In this paper, we
propose semantically stealthy adversarial attacks which can manipulate targeted
labels while preserving non-targeted labels at the same time. One challenge is
making semantically meaningful manipulations across datasets and models.
Another challenge is avoiding damaging non-targeted labels. To solve these
challenges, we consider each input image as prior knowledge to generate
perturbations. We also design a special regularizer to help extract features.
To evaluate our model's performance, we design three basic attack types, namely
`vanishing into the context,' `embedding fake labels,' and `displacing target
objects.' Our experiments show that our stealthy adversarial model can attack
segmentation models with a relatively high success rate on Cityscapes,
Mapillary, and BDD100K. Our framework shows good empirical generalization
across datasets and models.


http://arxiv.org/abs/2104.01575
Reliably fast adversarial training via latent adversarial perturbation. (93%)
Geon Yeong Park; Sang Wan Lee
  While multi-step adversarial training is widely popular as an effective
defense method against strong adversarial attacks, its computational cost is
notoriously expensive, compared to standard training. Several single-step
adversarial training methods have been proposed to mitigate the above-mentioned
overhead cost; however, their performance is not sufficiently reliable
depending on the optimization setting. To overcome such limitations, we deviate
from the existing input-space-based adversarial training regime and propose a
single-step latent adversarial training method (SLAT), which leverages the
gradients of latent representation as the latent adversarial perturbation. We
demonstrate that the L1 norm of feature gradients is implicitly regularized
through the adopted latent perturbation, thereby recovering local linearity and
ensuring reliable performance, compared to the existing single-step adversarial
training methods. Because latent perturbation is based on the gradients of the
latent representations which can be obtained for free in the process of input
gradients computation, the proposed method costs roughly the same time as the
fast gradient sign method. Experiment results demonstrate that the proposed
method, despite its structural simplicity, outperforms state-of-the-art
accelerated adversarial training methods.


http://arxiv.org/abs/2104.01494
Mitigating Gradient-based Adversarial Attacks via Denoising and Compression. (99%)
Rehana Mahfuz; Rajeev Sahay; Aly El Gamal
  Gradient-based adversarial attacks on deep neural networks pose a serious
threat, since they can be deployed by adding imperceptible perturbations to the
test data of any network, and the risk they introduce cannot be assessed
through the network's original training performance. Denoising and
dimensionality reduction are two distinct methods that have been independently
investigated to combat such attacks. While denoising offers the ability to
tailor the defense to the specific nature of the attack, dimensionality
reduction offers the advantage of potentially removing previously unseen
perturbations, along with reducing the training time of the network being
defended. We propose strategies to combine the advantages of these two defense
mechanisms. First, we propose the cascaded defense, which involves denoising
followed by dimensionality reduction. To reduce the training time of the
defense for a small trade-off in performance, we propose the hidden layer
defense, which involves feeding the output of the encoder of a denoising
autoencoder into the network. Further, we discuss how adaptive attacks against
these defenses could become significantly weak when an alternative defense is
used, or when no defense is used. In this light, we propose a new metric to
evaluate a defense which measures the sensitivity of the adaptive attack to
modifications in the defense. Finally, we present a guideline for building an
ordered repertoire of defenses, a.k.a. a defense infrastructure, that adjusts
to limited computational resources in presence of uncertainty about the attack
strategy.


http://arxiv.org/abs/2104.06375
Gradient-based Adversarial Deep Modulation Classification with Data-driven Subsampling. (93%)
Jinho Yi; Aly El Gamal
  Automatic modulation classification can be a core component for intelligent
spectrally efficient wireless communication networks, and deep learning
techniques have recently been shown to deliver superior performance to
conventional model-based strategies, particularly when distinguishing between a
large number of modulation types. However, such deep learning techniques have
also been recently shown to be vulnerable to gradient-based adversarial attacks
that rely on subtle input perturbations, which would be particularly feasible
in a wireless setting via jamming. One such potent attack is the one known as
the Carlini-Wagner attack, which we consider in this work. We further consider
a data-driven subsampling setting, where several recently introduced
deep-learning-based algorithms are employed to select a subset of samples that
lead to reducing the final classifier's training time with minimal loss in
accuracy. In this setting, the attacker has to make an assumption about the
employed subsampling strategy, in order to calculate the loss gradient. Based
on state of the art techniques available to both the attacker and defender, we
evaluate best strategies under various assumptions on the knowledge of the
other party's strategy. Interestingly, in presence of knowledgeable attackers,
we identify computational cost reduction opportunities for the defender with no
or minimal loss in performance.


http://arxiv.org/abs/2104.01396
Property-driven Training: All You (N)Ever Wanted to Know About. (38%)
Marco Casadio; Matthew Daggitt; Ekaterina Komendantskaya; Wen Kokke; Daniel Kienitz; Rob Stewart
  Neural networks are known for their ability to detect general patterns in
noisy data. This makes them a popular tool for perception components in complex
AI systems. Paradoxically, they are also known for being vulnerable to
adversarial attacks. In response, various methods such as adversarial training,
data-augmentation and Lipschitz robustness training have been proposed as means
of improving their robustness. However, as this paper explores, these training
methods each optimise for a different definition of robustness. We perform an
in-depth comparison of these different definitions, including their
relationship, assumptions, interpretability and verifiability after training.
We also look at constraint-driven training, a general approach designed to
encode arbitrary constraints, and show that not all of these definitions are
directly encodable. Finally we perform experiments to compare the applicability
and efficacy of the training methods at ensuring the network obeys these
different definitions. These results highlight that even the encoding of such a
simple piece of knowledge such as robustness in neural network training is
fraught with difficult choices and pitfalls.


http://arxiv.org/abs/2104.01086
Defending Against Image Corruptions Through Adversarial Augmentations. (92%)
Dan A. Calian; Florian Stimberg; Olivia Wiles; Sylvestre-Alvise Rebuffi; Andras Gyorgy; Timothy Mann; Sven Gowal
  Modern neural networks excel at image classification, yet they remain
vulnerable to common image corruptions such as blur, speckle noise or fog.
Recent methods that focus on this problem, such as AugMix and DeepAugment,
introduce defenses that operate in expectation over a distribution of image
corruptions. In contrast, the literature on $\ell_p$-norm bounded perturbations
focuses on defenses against worst-case corruptions. In this work, we reconcile
both approaches by proposing AdversarialAugment, a technique which optimizes
the parameters of image-to-image models to generate adversarially corrupted
augmented images. We theoretically motivate our method and give sufficient
conditions for the consistency of its idealized version as well as that of
DeepAugment. Our classifiers improve upon the state-of-the-art on common image
corruption benchmarks conducted in expectation on CIFAR-10-C and improve
worst-case performance against $\ell_p$-norm bounded perturbations on both
CIFAR-10 and ImageNet.


http://arxiv.org/abs/2104.01026
RABA: A Robust Avatar Backdoor Attack on Deep Neural Network. (83%)
Ying He; Zhili Shen; Chang Xia; Jingyu Hua; Wei Tong; Sheng Zhong
  With the development of Deep Neural Network (DNN), as well as the demand
growth of third-party DNN model stronger, there leaves a gap for backdoor
attack. Backdoor can be injected into a third-party model and has strong
stealthiness in normal situation, thus has been widely discussed. Nowadays
backdoor attack on deep neural network has been concerned a lot and there comes
lots of researches about attack and defense around backdoor in DNN.
  In this paper, we propose a robust avatar backdoor attack that integrated
with adversarial attack. Our attack can escape mainstream detection schemes
with popularity and impact that detect whether a model has backdoor or not
before deployed. It reveals that although many effective backdoor defense
schemes has been put forward, backdoor attack in DNN still needs to be
concerned. We select three popular datasets and two detection schemes with high
impact factor to prove that our attack has a great performance in aggressivity
and stealthiness.


http://arxiv.org/abs/2104.01231
Diverse Gaussian Noise Consistency Regularization for Robustness and Uncertainty Calibration under Noise Domain Shifts. (2%)
Athanasios Tsiligkaridis; Theodoros Tsiligkaridis
  Deep neural networks achieve high prediction accuracy when the train and test
distributions coincide. In practice though, various types of corruptions occur
which deviate from this setup and cause severe performance degradations. Few
methods have been proposed to address generalization in the presence of
unforeseen domain shifts. In particular, digital noise corruptions arise
commonly in practice during the image acquisition stage and present a
significant challenge for current robustness approaches. In this paper, we
propose a diverse Gaussian noise consistency regularization method for
improving robustness of image classifiers under a variety of noise corruptions
while still maintaining high clean accuracy. We derive bounds to motivate our
Gaussian noise consistency regularization using a local loss landscape
analysis. We show that this simple approach improves robustness against various
unforeseen noise corruptions over standard and adversarial training and other
strong baselines. Furthermore, when combined with diverse data augmentation
techniques we empirically show this type of consistency regularization further
improves robustness and uncertainty calibration for common corruptions upon the
state-of-the-art for several image classification benchmarks.


http://arxiv.org/abs/2104.00919
Fast-adapting and Privacy-preserving Federated Recommender System. (1%)
Qinyong Wang; Hongzhi Yin; Tong Chen; Junliang Yu; Alexander Zhou; Xiangliang Zhang
  In the mobile Internet era, recommender systems have become an irreplaceable
tool to help users discover useful items, thus alleviating the information
overload problem. Recent research on deep neural network (DNN)-based
recommender systems have made significant progress in improving prediction
accuracy, largely attributed to the widely accessible large-scale user data.
Such data is commonly collected from users' personal devices, and then
centrally stored in the cloud server to facilitate model training. However,
with the rising public concerns on user privacy leakage in online platforms,
online users are becoming increasingly anxious over abuses of user privacy.
Therefore, it is urgent and beneficial to develop a recommender system that can
achieve both high prediction accuracy and strong privacy protection.
  To this end, we propose a DNN-based recommendation model called PrivRec
running on the decentralized federated learning (FL) environment, which ensures
that a user's data is fully retained on her/his personal device while
contributing to training an accurate model. On the other hand, to better
embrace the data heterogeneity (e.g., users' data vary in scale and quality
significantly) in FL, we innovatively introduce a first-order meta-learning
method that enables fast on-device personalization with only a few data points.
Furthermore, to defend against potential malicious participants that pose
serious security threat to other users, we further develop a user-level
differentially private model, namely DP-PrivRec, so attackers are unable to
identify any arbitrary user from the trained model. Finally, we conduct
extensive experiments on two large-scale datasets in a simulated FL
environment, and the results validate the superiority of both PrivRec and
DP-PrivRec.


http://arxiv.org/abs/2104.00671
TRS: Transferability Reduced Ensemble via Encouraging Gradient Diversity and Model Smoothness. (99%)
Zhuolin Yang; Linyi Li; Xiaojun Xu; Shiliang Zuo; Qian Chen; Benjamin Rubinstein; Pan Zhou; Ce Zhang; Bo Li
  Adversarial Transferability is an intriguing property - adversarial
perturbation crafted against one model is also effective against another model,
while these models are from different model families or training processes. To
better protect ML systems against adversarial attacks, several questions are
raised: what are the sufficient conditions for adversarial transferability and
how to bound it? Is there a way to reduce the adversarial transferability in
order to improve the robustness of an ensemble ML model? To answer these
questions, in this work we first theoretically analyze and outline sufficient
conditions for adversarial transferability between models; then propose a
practical algorithm to reduce the transferability between base models within an
ensemble to improve its robustness. Our theoretical analysis shows that only
promoting the orthogonality between gradients of base models is not enough to
ensure low transferability; in the meantime, the model smoothness is an
important factor to control the transferability. We also provide the lower and
upper bounds of adversarial transferability under certain conditions. Inspired
by our theoretical analysis, we propose an effective Transferability Reduced
Smooth(TRS) ensemble training strategy to train a robust ensemble with low
transferability by enforcing both gradient orthogonality and model smoothness
between base models. We conduct extensive experiments on TRS and compare with 6
state-of-the-art ensemble baselines against 8 whitebox attacks on different
datasets, demonstrating that the proposed TRS outperforms all baselines
significantly.


http://arxiv.org/abs/2104.00322
Domain Invariant Adversarial Learning. (98%)
Matan Levi; Idan Attias; Aryeh Kontorovich
  The phenomenon of adversarial examples illustrates one of the most basic
vulnerabilities of deep neural networks. Among the variety of techniques
introduced to surmount this inherent weakness, adversarial training has emerged
as the most effective strategy for learning robust models. Typically, this is
achieved by balancing robust and natural objectives. In this work, we aim to
further optimize the trade-off between robust and standard accuracy by
enforcing a domain-invariant feature representation. We present a new
adversarial training method, Domain Invariant Adversarial Learning (DIAL),
which learns a feature representation that is both robust and domain invariant.
DIAL uses a variant of Domain Adversarial Neural Network (DANN) on the natural
domain and its corresponding adversarial domain. In the case where the source
domain consists of natural examples and the target domain is the adversarially
perturbed examples, our method learns a feature representation constrained not
to discriminate between the natural and adversarial examples, and can therefore
achieve a more robust representation. DIAL is a generic and modular technique
that can be easily incorporated into any adversarial training method. Our
experiments indicate that incorporating DIAL in the adversarial training
process improves both robustness and standard accuracy.


http://arxiv.org/abs/2104.00312
Normal vs. Adversarial: Salience-based Analysis of Adversarial Samples for Relation Extraction. (93%)
Luoqiu Li; Xiang Chen; Ningyu Zhang; Shumin Deng; Xin Xie; Chuanqi Tan; Mosha Chen; Fei Huang; Huajun Chen
  Recent neural-based relation extraction approaches, though achieving
promising improvement on benchmark datasets, have reported their vulnerability
towards adversarial attacks. Thus far, efforts mostly focused on generating
adversarial samples or defending adversarial attacks, but little is known about
the difference between normal and adversarial samples. In this work, we take
the first step to leverage the salience-based method to analyze those
adversarial samples. We observe that salience tokens have a direct correlation
with adversarial perturbations. We further find the adversarial perturbations
are either those tokens not existing in the training set or superficial cues
associated with relation labels. To some extent, our approach unveils the
characters against adversarial samples. We release an open-source testbed,
"DiagnoseAdv".


http://arxiv.org/abs/2104.00447
Towards Evaluating and Training Verifiably Robust Neural Networks. (45%)
Zhaoyang Lyu; Minghao Guo; Tong Wu; Guodong Xu; Kehuan Zhang; Dahua Lin
  Recent works have shown that interval bound propagation (IBP) can be used to
train verifiably robust neural networks. Reseachers observe an intriguing
phenomenon on these IBP trained networks: CROWN, a bounding method based on
tight linear relaxation, often gives very loose bounds on these networks. We
also observe that most neurons become dead during the IBP training process,
which could hurt the representation capability of the network. In this paper,
we study the relationship between IBP and CROWN, and prove that CROWN is always
tighter than IBP when choosing appropriate bounding lines. We further propose a
relaxed version of CROWN, linear bound propagation (LBP), that can be used to
verify large networks to obtain lower verified errors than IBP. We also design
a new activation function, parameterized ramp function (ParamRamp), which has
more diversity of neuron status than ReLU. We conduct extensive experiments on
MNIST, CIFAR-10 and Tiny-ImageNet with ParamRamp activation and achieve
state-of-the-art verified robustness. Code and the appendix are available at
https://github.com/ZhaoyangLyu/VerifiablyRobustNN.


http://arxiv.org/abs/2104.00460
Augmenting Zero Trust Architecture to Endpoints Using Blockchain: A Systematic Review. (3%)
Lampis Alevizos; Vinh Thong Ta; Max Hashem Eiza
  With the purpose of defending against lateral movement in todays borderless
networks, Zero Trust Architecture (ZTA) adoption is gaining momentum.
Considering a full scale ZTA implementation, it is unlikely that adversaries
will be able to spread through the network starting from a compromised
endpoint. However, the already authenticated and authorised session of the
compromised endpoint can be leveraged to perform limited, though malicious
activities, ultimately rendering the endpoints the Achilles heel of ZTA. To
effectively detect such attacks, distributed collaborative intrusion detection
systems with attack scenario-based approach have been developed. Nonetheless,
Advanced Persistent Threats (APTs) have demonstrated their ability to bypass
this approach with high success ratio. As a result, adversaries can pass
undetected or potentially alter the detection logging mechanisms to achieve a
stealthy presence. Recently, blockchain technology has demonstrated solid use
cases in the cyber security domain. Motivated by the convergence of ZTA and
blockchain-based intrusion detection and prevention, in this paper, we examine
how ZTA can be augmented onto endpoints. Namely, we perform a systematic review
of ZTA models, real-world architectures with the focus on endpoints, and
blockchain-based intrusion detection systems. We discuss the potential of
blockchains immutability fortifying the detection process, and the identified
open challenges as well as the possible solutions and future directions.


http://arxiv.org/abs/2104.02570
Learning from Noisy Labels via Dynamic Loss Thresholding. (1%)
Hao Yang; Youzhi Jin; Ziyin Li; Deng-Bao Wang; Lei Miao; Xin Geng; Min-Ling Zhang
  Numerous researches have proved that deep neural networks (DNNs) can fit
everything in the end even given data with noisy labels, and result in poor
generalization performance. However, recent studies suggest that DNNs tend to
gradually memorize the data, moving from correct data to mislabeled data.
Inspired by this finding, we propose a novel method named Dynamic Loss
Thresholding (DLT). During the training process, DLT records the loss value of
each sample and calculates dynamic loss thresholds. Specifically, DLT compares
the loss value of each sample with the current loss threshold. Samples with
smaller losses can be considered as clean samples with higher probability and
vice versa. Then, DLT discards the potentially corrupted labels and further
leverages supervised learning techniques. Experiments on CIFAR-10/100 and
Clothing1M demonstrate substantial improvements over recent state-of-the-art
methods.
  In addition, we investigate two real-world problems for the first time.
Firstly, we propose a novel approach to estimate the noise rates of datasets
based on the loss difference between the early and late training stages of
DNNs. Secondly, we explore the effect of hard samples (which are difficult to
be distinguished) on the process of learning from noisy labels.


http://arxiv.org/abs/2104.00139
Adversarial Heart Attack: Neural Networks Fooled to Segment Heart Symbols in Chest X-Ray Images. (99%)
Gerda Bortsova; Florian Dubost; Laurens Hogeweg; Ioannis Katramados; Bruijne Marleen de
  Adversarial attacks consist in maliciously changing the input data to mislead
the predictions of automated decision systems and are potentially a serious
threat for automated medical image analysis. Previous studies have shown that
it is possible to adversarially manipulate automated segmentations produced by
neural networks in a targeted manner in the white-box attack setting. In this
article, we studied the effectiveness of adversarial attacks in targeted
modification of segmentations of anatomical structures in chest X-rays.
Firstly, we experimented with using anatomically implausible shapes as targets
for adversarial manipulation. We showed that, by adding almost imperceptible
noise to the image, we can reliably force state-of-the-art neural networks to
segment the heart as a heart symbol instead of its real anatomical shape.
Moreover, such heart-shaping attack did not appear to require higher
adversarial noise level than an untargeted attack based the same attack method.
Secondly, we attempted to explore the limits of adversarial manipulation of
segmentations. For that, we assessed the effectiveness of shrinking and
enlarging segmentation contours for the three anatomical structures. We
observed that adversarially extending segmentations of structures into regions
with intensity and texture uncharacteristic for them presented a challenge to
our attacks, as well as, in some cases, changing segmentations in ways that
conflict with class adjacency priors learned by the target network.
Additionally, we evaluated performances of the untargeted attacks and targeted
heart attacks in the black-box attack scenario, using a surrogate network
trained on a different subset of images. In both cases, the attacks were
substantially less effective. We believe these findings bring novel insights
into the current capabilities and limits of adversarial attacks for semantic
segmentation.


http://arxiv.org/abs/2103.17122
Adversarial Attacks and Defenses for Speech Recognition Systems. (99%)
Piotr Żelasko; Sonal Joshi; Yiwen Shao; Jesus Villalba; Jan Trmal; Najim Dehak; Sanjeev Khudanpur
  The ubiquitous presence of machine learning systems in our lives necessitates
research into their vulnerabilities and appropriate countermeasures. In
particular, we investigate the effectiveness of adversarial attacks and
defenses against automatic speech recognition (ASR) systems. We select two ASR
models - a thoroughly studied DeepSpeech model and a more recent Espresso
framework Transformer encoder-decoder model. We investigate two threat models:
a denial-of-service scenario where fast gradient-sign method (FGSM) or weak
projected gradient descent (PGD) attacks are used to degrade the model's word
error rate (WER); and a targeted scenario where a more potent imperceptible
attack forces the system to recognize a specific phrase. We find that the
attack transferability across the investigated ASR systems is limited. To
defend the model, we use two preprocessing defenses: randomized smoothing and
WaveGAN-based vocoder, and find that they significantly improve the model's
adversarial robustness. We show that a WaveGAN vocoder can be a useful
countermeasure to adversarial attacks on ASR systems - even when it is jointly
attacked with the ASR, the target phrases' word error rate is high.


http://arxiv.org/abs/2103.17268
Fast Certified Robust Training with Short Warmup. (86%)
Zhouxing Shi; Yihan Wang; Huan Zhang; Jinfeng Yi; Cho-Jui Hsieh
  Recently, bound propagation based certified robust training methods have been
proposed for training neural networks with certifiable robustness guarantees.
Despite that state-of-the-art (SOTA) methods including interval bound
propagation (IBP) and CROWN-IBP have per-batch training complexity similar to
standard neural network training, they usually use a long warmup schedule with
hundreds or thousands epochs to reach SOTA performance and are thus still
costly. In this paper, we identify two important issues in existing methods,
namely exploded bounds at initialization, and the imbalance in ReLU activation
states and improve IBP training. These two issues make certified training
difficult and unstable, and thereby long warmup schedules were needed in prior
works. To mitigate these issues and conduct faster certified training with
shorter warmup, we propose three improvements based on IBP training: 1) We
derive a new weight initialization method for IBP training; 2) We propose to
fully add Batch Normalization (BN) to each layer in the model, since we find BN
can reduce the imbalance in ReLU activation states; 3) We also design
regularization to explicitly tighten certified bounds and balance ReLU
activation states during wamrup. We are able to obtain 65.03% verified error on
CIFAR-10 ($\epsilon=\frac{8}{255}$) and 82.36% verified error on TinyImageNet
($\epsilon=\frac{1}{255}$) using very short training schedules (160 and 80
total epochs, respectively), outperforming literature SOTA trained with
hundreds or thousands epochs under the same network architecture. The code is
available at https://github.com/shizhouxing/Fast-Certified-Robust-Training.


http://arxiv.org/abs/2104.00219
Fast Jacobian-Vector Product for Deep Networks. (22%)
Randall Balestriero; Richard Baraniuk
  Jacobian-vector products (JVPs) form the backbone of many recent developments
in Deep Networks (DNs), with applications including faster constrained
optimization, regularization with generalization guarantees, and adversarial
example sensitivity assessments. Unfortunately, JVPs are computationally
expensive for real world DN architectures and require the use of automatic
differentiation to avoid manually adapting the JVP program when changing the DN
architecture. We propose a novel method to quickly compute JVPs for any DN that
employ Continuous Piecewise Affine (e.g., leaky-ReLU, max-pooling, maxout,
etc.) nonlinearities. We show that our technique is on average $2\times$ faster
than the fastest alternative over $13$ DN architectures and across various
hardware. In addition, our solution does not require automatic differentiation
and is thus easy to deploy in software, requiring only the modification of a
few lines of codes that do not depend on the DN architecture.


http://arxiv.org/abs/2104.00236
Too Expensive to Attack: A Joint Defense Framework to Mitigate Distributed Attacks for the Internet of Things Grid. (2%)
Jianhua Li; Ximeng Liu; Jiong Jin; Shui Yu
  The distributed denial of service (DDoS) attack is detrimental to businesses
and individuals as we are heavily relying on the Internet. Due to remarkable
profits, crackers favor DDoS as cybersecurity weapons in attacking servers,
computers, IoT devices, and even the entire Internet. Many current detection
and mitigation solutions concentrate on specific technologies in combating
DDoS, whereas the attacking expense and the cross-defender collaboration have
not drawn enough attention. Under this circumstance, we revisit the DDoS attack
and defense in terms of attacking cost and populations of both parties,
proposing a joint defense framework to incur higher attacking expense in a grid
of Internet service providers (ISPs), businesses, individuals, and third-party
organizations (IoT Grid). Meanwhile, the defender's cost does not grow much
during combats. The skyrocket of attacking expense discourages profit-driven
attackers from launching further attacks effectively. The quantitative
evaluation and experimental assessment reinforce the effectiveness of our
framework.


http://arxiv.org/abs/2103.17028
Digital Forensics vs. Anti-Digital Forensics: Techniques, Limitations and Recommendations. (1%)
Jean-Paul A. Yaacoub; Hassan N. Noura; Ola Salman; Ali Chehab
  The number of cyber attacks has increased tremendously in the last few years.
This resulted into both human and financial losses at the individual and
organization levels. Recently, cyber-criminals are leveraging new skills and
capabilities by employing anti-forensics activities, techniques and tools to
cover their tracks and evade any possible detection. Consequently,
cyber-attacks are becoming more efficient and more sophisticated. Therefore,
traditional cryptographic and non-cryptographic solutions and access control
systems are no longer enough to prevent such cyber attacks, especially in terms
of acquiring evidence for attack investigation. Hence, the need for
well-defined, sophisticated, and advanced forensics investigation tools are
highly required to track down cyber criminals and to reduce the number of cyber
crimes. This paper reviews the different forensics and anti-forensics methods,
tools, techniques, types, and challenges, while also discussing the rise of the
anti-anti-forensics as a new forensics protection mechanism against
anti-forensics activities. This would help forensics investigators to better
understand the different anti-forensics tools, methods and techniques that
cyber criminals employ while launching their attacks. Moreover, the limitations
of the current forensics techniques are discussed, especially in terms of
issues and challenges. Finally, this paper presents a holistic view from a
literature point of view over the forensics domain and also helps other fellow
colleagues in their quest to further understand the digital forensics domain.


http://arxiv.org/abs/2104.02610
On the Robustness of Vision Transformers to Adversarial Examples. (99%)
Kaleel Mahmood; Rigel Mahmood; Dijk Marten van
  Recent advances in attention-based networks have shown that Vision
Transformers can achieve state-of-the-art or near state-of-the-art results on
many image classification tasks. This puts transformers in the unique position
of being a promising alternative to traditional convolutional neural networks
(CNNs). While CNNs have been carefully studied with respect to adversarial
attacks, the same cannot be said of Vision Transformers. In this paper, we
study the robustness of Vision Transformers to adversarial examples. Our
analyses of transformer security is divided into three parts. First, we test
the transformer under standard white-box and black-box attacks. Second, we
study the transferability of adversarial examples between CNNs and
transformers. We show that adversarial examples do not readily transfer between
CNNs and transformers. Based on this finding, we analyze the security of a
simple ensemble defense of CNNs and transformers. By creating a new attack, the
self-attention blended gradient attack, we show that such an ensemble is not
secure under a white-box adversary. However, under a black-box adversary, we
show that an ensemble can achieve unprecedented robustness without sacrificing
clean accuracy. Our analysis for this work is done using six types of white-box
attacks and two types of black-box attacks. Our study encompasses multiple
Vision Transformers, Big Transfer Models and CNN architectures trained on
CIFAR-10, CIFAR-100 and ImageNet.


http://arxiv.org/abs/2103.16148
Class-Aware Robust Adversarial Training for Object Detection. (96%)
Pin-Chun Chen; Bo-Han Kung; Jun-Cheng Chen
  Object detection is an important computer vision task with plenty of
real-world applications; therefore, how to enhance its robustness against
adversarial attacks has emerged as a crucial issue. However, most of the
previous defense methods focused on the classification task and had few
analysis in the context of the object detection task. In this work, to address
the issue, we present a novel class-aware robust adversarial training paradigm
for the object detection task. For a given image, the proposed approach
generates an universal adversarial perturbation to simultaneously attack all
the occurred objects in the image through jointly maximizing the respective
loss for each object. Meanwhile, instead of normalizing the total loss with the
number of objects, the proposed approach decomposes the total loss into
class-wise losses and normalizes each class loss using the number of objects
for the class. The adversarial training based on the class weighted loss can
not only balances the influence of each class but also effectively and evenly
improves the adversarial robustness of trained models for all the object
classes as compared with the previous defense methods. Furthermore, with the
recent development of fast adversarial training, we provide a fast version of
the proposed algorithm which can be trained faster than the traditional
adversarial training while keeping comparable performance. With extensive
experiments on the challenging PASCAL-VOC and MS-COCO datasets, the evaluation
results demonstrate that the proposed defense methods can effectively enhance
the robustness of the object detection models.


http://arxiv.org/abs/2103.16074
PointBA: Towards Backdoor Attacks in 3D Point Cloud. (92%)
Xinke Li; Zhiru Chen; Yue Zhao; Zekun Tong; Yabang Zhao; Andrew Lim; Joey Tianyi Zhou
  3D deep learning has been increasingly more popular for a variety of tasks
including many safety-critical applications. However, recently several works
raise the security issues of 3D deep nets. Although most of these works
consider adversarial attacks, we identify that backdoor attack is indeed a more
serious threat to 3D deep learning systems but remains unexplored. We present
the backdoor attacks in 3D with a unified framework that exploits the unique
properties of 3D data and networks. In particular, we design two attack
approaches: the poison-label attack and the clean-label attack. The first one
is straightforward and effective in practice, while the second one is more
sophisticated assuming there are certain data inspections. The attack
algorithms are mainly motivated and developed by 1) the recent discovery of 3D
adversarial samples which demonstrate the vulnerability of 3D deep nets under
spatial transformations; 2) the proposed feature disentanglement technique that
manipulates the feature of the data through optimization methods and its
potential to embed a new task. Extensive experiments show the efficacy of the
poison-label attack with over 95% success rate across several 3D datasets and
models, and the ability of clean-label attack against data filtering with
around 50% success rate. Our proposed backdoor attack in 3D point cloud is
expected to perform as a baseline for improving the robustness of 3D deep
models.


http://arxiv.org/abs/2103.16255
What Causes Optical Flow Networks to be Vulnerable to Physical Adversarial Attacks. (91%)
Simon Schrodi; Tonmoy Saikia; Thomas Brox
  Recent work demonstrated the lack of robustness of optical flow networks to
physical, patch-based adversarial attacks. The possibility to physically attack
a basic component of automotive systems is a reason for serious concerns. In
this paper, we analyze the cause of the problem and show that the lack of
robustness is rooted in the classical aperture problem of optical flow
estimation in combination with bad choices in the details of the network
architecture. We show how these mistakes can be rectified in order to make
optical flow networks robust to physical, patch-based attacks.


http://arxiv.org/abs/2103.16714
Statistical inference for individual fairness. (67%)
Subha Maity; Songkai Xue; Mikhail Yurochkin; Yuekai Sun
  As we rely on machine learning (ML) models to make more consequential
decisions, the issue of ML models perpetuating or even exacerbating undesirable
historical biases (e.g., gender and racial biases) has come to the fore of the
public's attention. In this paper, we focus on the problem of detecting
violations of individual fairness in ML models. We formalize the problem as
measuring the susceptibility of ML models against a form of adversarial attack
and develop a suite of inference tools for the adversarial cost function. The
tools allow auditors to assess the individual fairness of ML models in a
statistically-principled way: form confidence intervals for the worst-case
performance differential between similar individuals and test hypotheses of
model fairness with (asymptotic) non-coverage/Type I error rate control. We
demonstrate the utility of our tools in a real-world case study.


http://arxiv.org/abs/2103.16629
Learning Lipschitz Feedback Policies from Expert Demonstrations: Closed-Loop Guarantees, Generalization and Robustness. (47%)
Abed AlRahman Al Makdah; Vishaal Krishnan; Fabio Pasqualetti
  In this work, we propose a framework to learn feedback control policies with
guarantees on closed-loop generalization and adversarial robustness. These
policies are learned directly from expert demonstrations, contained in a
dataset of state-control input pairs, without any prior knowledge of the task
and system model. We use a Lipschitz-constrained loss minimization scheme to
learn feedback policies with certified closed-loop robustness, wherein the
Lipschitz constraint serves as a mechanism to tune the generalization
performance and robustness to adversarial disturbances. Our analysis exploits
the Lipschitz property to obtain closed-loop guarantees on generalization and
robustness of the learned policies. In particular, we derive a finite sample
bound on the policy learning error and establish robust closed-loop stability
under the learned control policy. We also derive bounds on the closed-loop
regret with respect to the expert policy and the deterioration of closed-loop
performance under bounded (adversarial) disturbances to the state measurements.
Numerical results validate our analysis and demonstrate the effectiveness of
our robust feedback policy learning framework. Finally, our results suggest the
existence of a potential tradeoff between nominal closed-loop performance and
adversarial robustness, and that improvements in nominal closed-loop
performance can only be made at the expense of robustness to adversarial
perturbations.


http://arxiv.org/abs/2103.16241
Improving robustness against common corruptions with frequency biased models. (1%)
Tonmoy Saikia; Cordelia Schmid; Thomas Brox
  CNNs perform remarkably well when the training and test distributions are
i.i.d, but unseen image corruptions can cause a surprisingly large drop in
performance. In various real scenarios, unexpected distortions, such as random
noise, compression artefacts, or weather distortions are common phenomena.
Improving performance on corrupted images must not result in degraded i.i.d
performance - a challenge faced by many state-of-the-art robust approaches.
Image corruption types have different characteristics in the frequency spectrum
and would benefit from a targeted type of data augmentation, which, however, is
often unknown during training. In this paper, we introduce a mixture of two
expert models specializing in high and low-frequency robustness, respectively.
Moreover, we propose a new regularization scheme that minimizes the total
variation (TV) of convolution feature-maps to increase high-frequency
robustness. The approach improves on corrupted images without degrading
in-distribution performance. We demonstrate this on ImageNet-C and also for
real-world corruptions on an automotive dataset, both for object classification
and object detection.


http://arxiv.org/abs/2103.15385
Lagrangian Objective Function Leads to Improved Unforeseen Attack Generalization in Adversarial Training. (99%)
Mohammad Azizmalayeri; Mohammad Hossein Rohban
  Recent improvements in deep learning models and their practical applications
have raised concerns about the robustness of these models against adversarial
examples. Adversarial training (AT) has been shown effective to reach a robust
model against the attack that is used during training. However, it usually
fails against other attacks, i.e. the model overfits to the training attack
scheme. In this paper, we propose a simple modification to the AT that
mitigates the mentioned issue. More specifically, we minimize the perturbation
$\ell_p$ norm while maximizing the classification loss in the Lagrangian form.
We argue that crafting adversarial examples based on this scheme results in
enhanced attack generalization in the learned model. We compare our final model
robust accuracy against attacks that were not used during training to closely
related state-of-the-art AT methods. This comparison demonstrates that our
average robust accuracy against unseen attacks is 5.9% higher in the CIFAR-10
dataset and is 3.2% higher in the ImageNet-100 dataset than corresponding
state-of-the-art methods. We also demonstrate that our attack is faster than
other attack schemes that are designed for unseen attack generalization, and
conclude that it is feasible for large-scale datasets.


http://arxiv.org/abs/2103.15571
Enhancing the Transferability of Adversarial Attacks through Variance Tuning. (99%)
Xiaosen Wang; Kun He
  Deep neural networks are vulnerable to adversarial examples that mislead the
models with imperceptible perturbations. Though adversarial attacks have
achieved incredible success rates in the white-box setting, most existing
adversaries often exhibit weak transferability in the black-box setting,
especially under the scenario of attacking models with defense mechanisms. In
this work, we propose a new method called variance tuning to enhance the class
of iterative gradient based attack methods and improve their attack
transferability. Specifically, at each iteration for the gradient calculation,
instead of directly using the current gradient for the momentum accumulation,
we further consider the gradient variance of the previous iteration to tune the
current gradient so as to stabilize the update direction and escape from poor
local optima. Empirical results on the standard ImageNet dataset demonstrate
that our method could significantly improve the transferability of
gradient-based adversarial attacks. Besides, our method could be used to attack
ensemble models or be integrated with various input transformations.
Incorporating variance tuning with input transformations on iterative
gradient-based attacks in the multi-model setting, the integrated method could
achieve an average success rate of 90.1% against nine advanced defense methods,
improving the current best attack performance significantly by 85.1% . Code is
available at https://github.com/JHL-HUST/VT.


http://arxiv.org/abs/2103.15670
On the Adversarial Robustness of Vision Transformers. (99%)
Rulin Shao; Zhouxing Shi; Jinfeng Yi; Pin-Yu Chen; Cho-Jui Hsieh
  Following the success in advancing natural language processing and
understanding, transformers are expected to bring revolutionary changes to
computer vision. This work provides the first and comprehensive study on the
robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs
possess better adversarial robustness when compared with convolutional neural
networks (CNNs). This observation also holds for certified robustness. We
summarize the following main observations contributing to the improved
robustness of ViTs:
  1) Features learned by ViTs contain less low-level information and are more
generalizable, which contributes to superior robustness against adversarial
perturbations.
  2) Introducing convolutional or tokens-to-token blocks for learning low-level
features in ViTs can improve classification accuracy but at the cost of
adversarial robustness.
  3) Increasing the proportion of transformers in the model structure (when the
model consists of both transformer and CNN blocks) leads to better robustness.
But for a pure transformer model, simply increasing the size or adding layers
cannot guarantee a similar effect.
  4) Pre-training on larger datasets does not significantly improve adversarial
robustness though it is critical for training ViTs.
  5) Adversarial training is also applicable to ViT for training robust models.
  Furthermore, feature visualization and frequency analysis are conducted for
explanation. The results show that ViTs are less sensitive to high-frequency
perturbations than CNNs and there is a high correlation between how well the
model learns low-level features and its robustness against different
frequency-based perturbations.


http://arxiv.org/abs/2103.15476
ZeroGrad : Mitigating and Explaining Catastrophic Overfitting in FGSM Adversarial Training. (95%)
Zeinab Golgooni; Mehrdad Saberi; Masih Eskandar; Mohammad Hossein Rohban
  Making deep neural networks robust to small adversarial noises has recently
been sought in many applications. Adversarial training through iterative
projected gradient descent (PGD) has been established as one of the mainstream
ideas to achieve this goal. However, PGD is computationally demanding and often
prohibitive in case of large datasets and models. For this reason, single-step
PGD, also known as FGSM, has recently gained interest in the field.
Unfortunately, FGSM-training leads to a phenomenon called ``catastrophic
overfitting," which is a sudden drop in the adversarial accuracy under the PGD
attack. In this paper, we support the idea that small input gradients play a
key role in this phenomenon, and hence propose to zero the input gradient
elements that are small for crafting FGSM attacks. Our proposed idea, while
being simple and efficient, achieves competitive adversarial accuracy on
various datasets.


http://arxiv.org/abs/2103.16031
Certifiably-Robust Federated Adversarial Learning via Randomized Smoothing. (93%)
Cheng Chen; Bhavya Kailkhura; Ryan Goldhahn; Yi Zhou
  Federated learning is an emerging data-private distributed learning
framework, which, however, is vulnerable to adversarial attacks. Although
several heuristic defenses are proposed to enhance the robustness of federated
learning, they do not provide certifiable robustness guarantees. In this paper,
we incorporate randomized smoothing techniques into federated adversarial
training to enable data-private distributed learning with certifiable
robustness to test-time adversarial perturbations. Our experiments show that
such an advanced federated adversarial learning framework can deliver models as
robust as those trained by the centralized training. Further, this enables
provably-robust classifiers to $\ell_2$-bounded adversarial perturbations in a
distributed setup. We also show that one-point gradient estimation based
training approach is $2-3\times$ faster than popular stochastic estimator based
approach without any noticeable certified robustness differences.


http://arxiv.org/abs/2103.15326
Fooling LiDAR Perception via Adversarial Trajectory Perturbation. (83%)
Yiming Li; Congcong Wen; Felix Juefei-Xu; Chen Feng
  LiDAR point clouds collected from a moving vehicle are functions of its
trajectories, because the sensor motion needs to be compensated to avoid
distortions. When autonomous vehicles are sending LiDAR point clouds to deep
networks for perception and planning, could the motion compensation
consequently become a wide-open backdoor in those networks, due to both the
adversarial vulnerability of deep learning and GPS-based vehicle trajectory
estimation that is susceptible to wireless spoofing? We demonstrate such
possibilities for the first time: instead of directly attacking point cloud
coordinates which requires tampering with the raw LiDAR readings, only
adversarial spoofing of a self-driving car's trajectory with small
perturbations is enough to make safety-critical objects undetectable or
detected with incorrect positions. Moreover, polynomial trajectory perturbation
is developed to achieve a temporally-smooth and highly-imperceptible attack.
Extensive experiments on 3D object detection have shown that such attacks not
only lower the performance of the state-of-the-art detectors effectively, but
also transfer to other detectors, raising a red flag for the community. The
code is available on https://ai4ce.github.io/FLAT/.


http://arxiv.org/abs/2103.15370
Robust Reinforcement Learning under model misspecification. (31%)
Lebin Yu; Jian Wang; Xudong Zhang
  Reinforcement learning has achieved remarkable performance in a wide range of
tasks these days. Nevertheless, some unsolved problems limit its applications
in real-world control. One of them is model misspecification, a situation where
an agent is trained and deployed in environments with different transition
dynamics. We propose an novel framework that utilize history trajectory and
Partial Observable Markov Decision Process Modeling to deal with this dilemma.
Additionally, we put forward an efficient adversarial attack method to assist
robust training. Our experiments in four gym domains validate the effectiveness
of our framework.


http://arxiv.org/abs/2103.15897
Automating Defense Against Adversarial Attacks: Discovery of Vulnerabilities and Application of Multi-INT Imagery to Protect Deployed Models. (16%)
Josh Kalin; David Noever; Matthew Ciolino; Dominick Hambrick; Gerry Dozier
  Image classification is a common step in image recognition for machine
learning in overhead applications. When applying popular model architectures
like MobileNetV2, known vulnerabilities expose the model to counter-attacks,
either mislabeling a known class or altering box location. This work proposes
an automated approach to defend these models. We evaluate the use of
multi-spectral image arrays and ensemble learners to combat adversarial
attacks. The original contribution demonstrates the attack, proposes a remedy,
and automates some key outcomes for protecting the model's predictions against
adversaries. In rough analogy to defending cyber-networks, we combine
techniques from both offensive ("red team") and defensive ("blue team")
approaches, thus generating a hybrid protective outcome ("green team"). For
machine learning, we demonstrate these methods with 3-color channels plus
infrared for vehicles. The outcome uncovers vulnerabilities and corrects them
with supplemental data inputs commonly found in overhead cases particularly.


http://arxiv.org/abs/2103.15918
MISA: Online Defense of Trojaned Models using Misattributions. (10%)
Panagiota Kiourti; Wenchao Li; Anirban Roy; Karan Sikka; Susmit Jha
  Recent studies have shown that neural networks are vulnerable to Trojan
attacks, where a network is trained to respond to specially crafted trigger
patterns in the inputs in specific and potentially malicious ways. This paper
proposes MISA, a new online approach to detect Trojan triggers for neural
networks at inference time. Our approach is based on a novel notion called
misattributions, which captures the anomalous manifestation of a Trojan
activation in the feature space. Given an input image and the corresponding
output prediction, our algorithm first computes the model's attribution on
different features. It then statistically analyzes these attributions to
ascertain the presence of a Trojan trigger. Across a set of benchmarks, we show
that our method can effectively detect Trojan triggers for a wide variety of
trigger patterns, including several recent ones for which there are no known
defenses. Our method achieves 96% AUC for detecting images that include a
Trojan trigger without any assumptions on the trigger pattern.


http://arxiv.org/abs/2103.15543
Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models. (9%)
Wenkai Yang; Lei Li; Zhiyuan Zhang; Xuancheng Ren; Xu Sun; Bin He
  Recent studies have revealed a security threat to natural language processing
(NLP) models, called the Backdoor Attack. Victim models can maintain
competitive performance on clean samples while behaving abnormally on samples
with a specific trigger word inserted. Previous backdoor attacking methods
usually assume that attackers have a certain degree of data knowledge, either
the dataset which users would use or proxy datasets for a similar task, for
implementing the data poisoning procedure. However, in this paper, we find that
it is possible to hack the model in a data-free way by modifying one single
word embedding vector, with almost no accuracy sacrificed on clean samples.
Experimental results on sentiment analysis and sentence-pair classification
tasks show that our method is more efficient and stealthier. We hope this work
can raise the awareness of such a critical security risk hidden in the
embedding layers of NLP models. Our code is available at
https://github.com/lancopku/Embedding-Poisoning.


http://arxiv.org/abs/2103.15383
Selective Output Smoothing Regularization: Regularize Neural Networks by Softening Output Distributions. (1%)
Xuan Cheng; Tianshu Xie; Xiaomin Wang; Qifeng Weng; Minghui Liu; Jiali Deng; Ming Liu
  In this paper, we propose Selective Output Smoothing Regularization, a novel
regularization method for training the Convolutional Neural Networks (CNNs).
Inspired by the diverse effects on training from different samples, Selective
Output Smoothing Regularization improves the performance by encouraging the
model to produce equal logits on incorrect classes when dealing with samples
that the model classifies correctly and over-confidently. This plug-and-play
regularization method can be conveniently incorporated into almost any
CNN-based project without extra hassle. Extensive experiments have shown that
Selective Output Smoothing Regularization consistently achieves significant
improvement in image classification benchmarks, such as CIFAR-100, Tiny
ImageNet, ImageNet, and CUB-200-2011. Particularly, our method obtains 77.30%
accuracy on ImageNet with ResNet-50, which gains 1.1% than baseline (76.2%). We
also empirically demonstrate the ability of our method to make further
improvements when combining with other widely used regularization techniques.
On Pascal detection, using the SOSR-trained ImageNet classifier as the
pretrained model leads to better detection performances.


http://arxiv.org/abs/2103.15089
Improved Autoregressive Modeling with Distribution Smoothing. (86%)
Chenlin Meng; Jiaming Song; Yang Song; Shengjia Zhao; Stefano Ermon
  While autoregressive models excel at image compression, their sample quality
is often lacking. Although not realistic, generated images often have high
likelihood according to the model, resembling the case of adversarial examples.
Inspired by a successful adversarial defense method, we incorporate randomized
smoothing into autoregressive generative modeling. We first model a smoothed
version of the data distribution, and then reverse the smoothing process to
recover the original data distribution. This procedure drastically improves the
sample quality of existing autoregressive models on several synthetic and
real-world image datasets while obtaining competitive likelihoods on synthetic
datasets.


http://arxiv.org/abs/2103.14977
On the benefits of robust models in modulation recognition. (99%)
Javier Maroto; Gérôme Bovet; Pascal Frossard
  Given the rapid changes in telecommunication systems and their higher
dependence on artificial intelligence, it is increasingly important to have
models that can perform well under different, possibly adverse, conditions.
Deep Neural Networks (DNNs) using convolutional layers are state-of-the-art in
many tasks in communications. However, in other domains, like image
classification, DNNs have been shown to be vulnerable to adversarial
perturbations, which consist of imperceptible crafted noise that when added to
the data fools the model into misclassification. This puts into question the
security of DNNs in communication tasks, and in particular in modulation
recognition. We propose a novel framework to test the robustness of current
state-of-the-art models where the adversarial perturbation strength is
dependent on the signal strength and measured with the "signal to perturbation
ratio" (SPR). We show that current state-of-the-art models are susceptible to
these perturbations. In contrast to current research on the topic of image
classification, modulation recognition allows us to have easily accessible
insights on the usefulness of the features learned by DNNs by looking at the
constellation space. When analyzing these vulnerable models we found that
adversarial perturbations do not shift the symbols towards the nearest classes
in constellation space. This shows that DNNs do not base their decisions on
signal statistics that are important for the Bayes-optimal modulation
recognition model, but spurious correlations in the training data. Our feature
analysis and proposed framework can help in the task of finding better models
for communication systems.


http://arxiv.org/abs/2103.14938
IoU Attack: Towards Temporally Coherent Black-Box Adversarial Attack for Visual Object Tracking. (99%)
Shuai Jia; Yibing Song; Chao Ma; Xiaokang Yang
  Adversarial attack arises due to the vulnerability of deep neural networks to
perceive input samples injected with imperceptible perturbations. Recently,
adversarial attack has been applied to visual object tracking to evaluate the
robustness of deep trackers. Assuming that the model structures of deep
trackers are known, a variety of white-box attack approaches to visual tracking
have demonstrated promising results. However, the model knowledge about deep
trackers is usually unavailable in real applications. In this paper, we propose
a decision-based black-box attack method for visual object tracking. In
contrast to existing black-box adversarial attack methods that deal with static
images for image classification, we propose IoU attack that sequentially
generates perturbations based on the predicted IoU scores from both current and
historical frames. By decreasing the IoU scores, the proposed attack method
degrades the accuracy of temporal coherent bounding boxes (i.e., object
motions) accordingly. In addition, we transfer the learned perturbations to the
next few frames to initialize temporal motion attack. We validate the proposed
IoU attack on state-of-the-art deep trackers (i.e., detection based,
correlation filter based, and long-term trackers). Extensive experiments on the
benchmark datasets indicate the effectiveness of the proposed IoU attack
method. The source code is available at
https://github.com/VISION-SJTU/IoUattack.


http://arxiv.org/abs/2103.14835
LiBRe: A Practical Bayesian Approach to Adversarial Detection. (99%)
Zhijie Deng; Xiao Yang; Shizhen Xu; Hang Su; Jun Zhu
  Despite their appealing flexibility, deep neural networks (DNNs) are
vulnerable against adversarial examples. Various adversarial defense strategies
have been proposed to resolve this problem, but they typically demonstrate
restricted practicability owing to unsurmountable compromise on universality,
effectiveness, or efficiency. In this work, we propose a more practical
approach, Lightweight Bayesian Refinement (LiBRe), in the spirit of leveraging
Bayesian neural networks (BNNs) for adversarial detection. Empowered by the
task and attack agnostic modeling under Bayes principle, LiBRe can endow a
variety of pre-trained task-dependent DNNs with the ability of defending
heterogeneous adversarial attacks at a low cost. We develop and integrate
advanced learning techniques to make LiBRe appropriate for adversarial
detection. Concretely, we build the few-layer deep ensemble variational and
adopt the pre-training & fine-tuning workflow to boost the effectiveness and
efficiency of LiBRe. We further provide a novel insight to realise adversarial
detection-oriented uncertainty quantification without inefficiently crafting
adversarial examples during training. Extensive empirical studies covering a
wide range of scenarios verify the practicability of LiBRe. We also conduct
thorough ablation studies to evidence the superiority of our modeling and
learning strategies.


http://arxiv.org/abs/2103.14717
Cyclic Defense GAN Against Speech Adversarial Attacks. (99%)
Mohammad Esmaeilpour; Patrick Cardinal; Alessandro Lameiras Koerich
  This paper proposes a new defense approach for counteracting state-of-the-art
white and black-box adversarial attack algorithms. Our approach fits into the
implicit reactive defense algorithm category since it does not directly
manipulate the potentially malicious input signals. Instead, it reconstructs a
similar signal with a synthesized spectrogram using a cyclic generative
adversarial network. This cyclic framework helps to yield a stable generative
model. Finally, we feed the reconstructed signal into the speech-to-text model
for transcription. The conducted experiments on targeted and non-targeted
adversarial attacks developed for attacking DeepSpeech, Kaldi, and Lingvo
models demonstrate the proposed defense's effectiveness in adverse scenarios.


http://arxiv.org/abs/2103.14347
Combating Adversaries with Anti-Adversaries. (93%)
Motasem Alfarra; Juan C. Pérez; Ali Thabet; Adel Bibi; Philip H. S. Torr; Bernard Ghanem
  Deep neural networks are vulnerable to small input perturbations known as
adversarial attacks. Inspired by the fact that these adversaries are
constructed by iteratively minimizing the confidence of a network for the true
class label, we propose the anti-adversary layer, aimed at countering this
effect. In particular, our layer generates an input perturbation in the
opposite direction of the adversarial one, and feeds the classifier a perturbed
version of the input. Our approach is training-free and theoretically
supported. We verify the effectiveness of our approach by combining our layer
with both nominally and robustly trained models, and conduct large scale
experiments from black-box to adaptive attacks on CIFAR10, CIFAR100 and
ImageNet. Our anti-adversary layer significantly enhances model robustness
while coming at no cost on clean accuracy.


http://arxiv.org/abs/2103.14641
On Generating Transferable Targeted Perturbations. (93%)
Muzammal Naseer; Salman Khan; Munawar Hayat; Fahad Shahbaz Khan; Fatih Porikli
  While the untargeted black-box transferability of adversarial perturbations
has been extensively studied before, changing an unseen model's decisions to a
specific `targeted' class remains a challenging feat. In this paper, we propose
a new generative approach for highly transferable targeted perturbations
(\ours). We note that the existing methods are less suitable for this task due
to their reliance on class-boundary information that changes from one model to
another, thus reducing transferability. In contrast, our approach matches the
perturbed image `distribution' with that of the target class, leading to high
targeted transferability rates. To this end, we propose a new objective
function that not only aligns the global distributions of source and target
images, but also matches the local neighbourhood structure between the two
domains. Based on the proposed objective, we train a generator function that
can adaptively synthesize perturbations specific to a given input. Our
generative approach is independent of the source or target domain labels, while
consistently performs well against state-of-the-art methods on a wide range of
attack settings. As an example, we achieve $32.63\%$ target transferability
from (an adversarially weak) VGG19$_{BN}$ to (a strong) WideResNet on ImageNet
val. set, which is 4$\times$ higher than the previous best generative attack
and 16$\times$ better than instance-specific iterative attack. Code is
available at: {\small\url{https://github.com/Muzammal-Naseer/TTP}}.


http://arxiv.org/abs/2103.14332
Building Reliable Explanations of Unreliable Neural Networks: Locally Smoothing Perspective of Model Interpretation. (86%)
Dohun Lim; Hyeonseok Lee; Sungchan Kim
  We present a novel method for reliably explaining the predictions of neural
networks. We consider an explanation reliable if it identifies input features
relevant to the model output by considering the input and the neighboring data
points. Our method is built on top of the assumption of smooth landscape in a
loss function of the model prediction: locally consistent loss and gradient
profile. A theoretical analysis established in this study suggests that those
locally smooth model explanations are learned using a batch of noisy copies of
the input with the L1 regularization for a saliency map. Extensive experiments
support the analysis results, revealing that the proposed saliency maps
retrieve the original classes of adversarial examples crafted against both
naturally and adversarially trained models, significantly outperforming
previous methods. We further demonstrated that such good performance results
from the learning capability of this method to identify input features that are
truly relevant to the model output of the input and the neighboring data
points, fulfilling the requirements of a reliable explanation.


http://arxiv.org/abs/2103.14795
Ensemble-in-One: Learning Ensemble within Random Gated Networks for Enhanced Adversarial Robustness. (83%)
Yi Cai; Xuefei Ning; Huazhong Yang; Yu Wang
  Adversarial attacks have rendered high security risks on modern deep learning
systems. Adversarial training can significantly enhance the robustness of
neural network models by suppressing the non-robust features. However, the
models often suffer from significant accuracy loss on clean data. Ensemble
training methods have emerged as promising solutions for defending against
adversarial attacks by diversifying the vulnerabilities among the sub-models,
simultaneously maintaining comparable accuracy as standard training. However,
existing ensemble methods are with poor scalability, owing to the rapid
complexity increase when including more sub-models in the ensemble. Moreover,
in real-world applications, it is difficult to deploy an ensemble with multiple
sub-models, owing to the tight hardware resource budget and latency
requirement. In this work, we propose ensemble-in-one (EIO), a simple but
efficient way to train an ensemble within one random gated network (RGN). EIO
augments the original model by replacing the parameterized layers with
multi-path random gated blocks (RGBs) to construct a RGN. By diversifying the
vulnerability of the numerous paths within the RGN, better robustness can be
achieved. It provides high scalability because the paths within an EIO network
exponentially increase with the network depth. Our experiments demonstrate that
EIO consistently outperforms previous ensemble training methods with even less
computational overhead.


http://arxiv.org/abs/2103.14441
Visual Explanations from Spiking Neural Networks using Interspike Intervals. (62%)
Youngeun Kim; Priyadarshini Panda
  Spiking Neural Networks (SNNs) compute and communicate with asynchronous
binary temporal events that can lead to significant energy savings with
neuromorphic hardware. Recent algorithmic efforts on training SNNs have shown
competitive performance on a variety of classification tasks. However, a
visualization tool for analysing and explaining the internal spike behavior of
such temporal deep SNNs has not been explored. In this paper, we propose a new
concept of bio-plausible visualization for SNNs, called Spike Activation Map
(SAM). The proposed SAM circumvents the non-differentiable characteristic of
spiking neurons by eliminating the need for calculating gradients to obtain
visual explanations. Instead, SAM calculates a temporal visualization map by
forward propagating input spikes over different time-steps. SAM yields an
attention map corresponding to each time-step of input data by highlighting
neurons with short inter-spike interval activity. Interestingly, without both
the backpropagation process and the class label, SAM highlights the
discriminative region of the image while capturing fine-grained details. With
SAM, for the first time, we provide a comprehensive analysis on how internal
spikes work in various SNN training configurations depending on optimization
types, leak behavior, as well as when faced with adversarial examples.


http://arxiv.org/abs/2103.14577
Unsupervised Robust Domain Adaptation without Source Data. (13%)
Peshal Agarwal; Danda Pani Paudel; Jan-Nico Zaech; Gool Luc Van
  We study the problem of robust domain adaptation in the context of
unavailable target labels and source data. The considered robustness is against
adversarial perturbations. This paper aims at answering the question of finding
the right strategy to make the target model robust and accurate in the setting
of unsupervised domain adaptation without source data. The major findings of
this paper are: (i) robust source models can be transferred robustly to the
target; (ii) robust domain adaptation can greatly benefit from non-robust
pseudo-labels and the pair-wise contrastive loss. The proposed method of using
non-robust pseudo-labels performs surprisingly well on both clean and
adversarial samples, for the task of image classification. We show a consistent
performance improvement of over $10\%$ in accuracy against the tested baselines
on four benchmark datasets.


http://arxiv.org/abs/2103.14222
Adversarial Attacks are Reversible with Natural Supervision. (99%)
Chengzhi Mao; Mia Chiquier; Hao Wang; Junfeng Yang; Carl Vondrick
  We find that images contain intrinsic structure that enables the reversal of
many adversarial attacks. Attack vectors cause not only image classifiers to
fail, but also collaterally disrupt incidental structure in the image. We
demonstrate that modifying the attacked image to restore the natural structure
will reverse many types of attacks, providing a defense. Experiments
demonstrate significantly improved robustness for several state-of-the-art
models across the CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets. Our results
show that our defense is still effective even if the attacker is aware of the
defense mechanism. Since our defense is deployed during inference instead of
training, it is compatible with pre-trained networks as well as most other
defenses. Our results suggest deep networks are vulnerable to adversarial
examples partly because their representations do not enforce the natural
structure of images.


http://arxiv.org/abs/2103.13989
Adversarial Attacks on Deep Learning Based mmWave Beam Prediction in 5G and Beyond. (98%)
Brian Kim; Yalin E. Sagduyu; Tugba Erpek; Sennur Ulukus
  Deep learning provides powerful means to learn from spectrum data and solve
complex tasks in 5G and beyond such as beam selection for initial access (IA)
in mmWave communications. To establish the IA between the base station (e.g.,
gNodeB) and user equipment (UE) for directional transmissions, a deep neural
network (DNN) can predict the beam that is best slanted to each UE by using the
received signal strengths (RSSs) from a subset of possible narrow beams. While
improving the latency and reliability of beam selection compared to the
conventional IA that sweeps all beams, the DNN itself is susceptible to
adversarial attacks. We present an adversarial attack by generating adversarial
perturbations to manipulate the over-the-air captured RSSs as the input to the
DNN. This attack reduces the IA performance significantly and fools the DNN
into choosing the beams with small RSSs compared to jamming attacks with
Gaussian or uniform noise.


http://arxiv.org/abs/2103.14211
MagDR: Mask-guided Detection and Reconstruction for Defending Deepfakes. (81%)
Zhikai Chen; Lingxi Xie; Shanmin Pang; Yong He; Bo Zhang
  Deepfakes raised serious concerns on the authenticity of visual contents.
Prior works revealed the possibility to disrupt deepfakes by adding adversarial
perturbations to the source data, but we argue that the threat has not been
eliminated yet. This paper presents MagDR, a mask-guided detection and
reconstruction pipeline for defending deepfakes from adversarial attacks. MagDR
starts with a detection module that defines a few criteria to judge the
abnormality of the output of deepfakes, and then uses it to guide a learnable
reconstruction procedure. Adaptive masks are extracted to capture the change in
local facial regions. In experiments, MagDR defends three main tasks of
deepfakes, and the learned reconstruction pipeline transfers across input data,
showing promising performance in defending both black-box and white-box
attacks.


http://arxiv.org/abs/2103.14172
Deep-RBF Networks for Anomaly Detection in Automotive Cyber-Physical Systems. (70%)
Matthew Burruss; Shreyas Ramakrishna; Abhishek Dubey
  Deep Neural Networks (DNNs) are popularly used for implementing autonomy
related tasks in automotive Cyber-Physical Systems (CPSs). However, these
networks have been shown to make erroneous predictions to anomalous inputs,
which manifests either due to Out-of-Distribution (OOD) data or adversarial
attacks. To detect these anomalies, a separate DNN called assurance monitor is
often trained and used in parallel to the controller DNN, increasing the
resource burden and latency. We hypothesize that a single network that can
perform controller predictions and anomaly detection is necessary to reduce the
resource requirements. Deep-Radial Basis Function (RBF) networks provide a
rejection class alongside the class predictions, which can be utilized for
detecting anomalies at runtime. However, the use of RBF activation functions
limits the applicability of these networks to only classification tasks. In
this paper, we show how the deep-RBF network can be used for detecting
anomalies in CPS regression tasks such as continuous steering predictions.
Further, we design deep-RBF networks using popular DNNs such as NVIDIA DAVE-II,
and ResNet20, and then use the resulting rejection class for detecting
adversarial attacks such as a physical attack and data poison attack. Finally,
we evaluate these attacks and the trained deep-RBF networks using a hardware
CPS testbed called DeepNNCar and a real-world German Traffic Sign Benchmark
(GTSB) dataset. Our results show that the deep-RBF networks can robustly detect
these attacks in a short time without additional resource requirements.


http://arxiv.org/abs/2103.14021
Orthogonal Projection Loss. (45%)
Kanchana Ranasinghe; Muzammal Naseer; Munawar Hayat; Salman Khan; Fahad Shahbaz Khan
  Deep neural networks have achieved remarkable performance on a range of
classification tasks, with softmax cross-entropy (CE) loss emerging as the
de-facto objective function. The CE loss encourages features of a class to have
a higher projection score on the true class-vector compared to the negative
classes. However, this is a relative constraint and does not explicitly force
different class features to be well-separated. Motivated by the observation
that ground-truth class representations in CE loss are orthogonal (one-hot
encoded vectors), we develop a novel loss function termed `Orthogonal
Projection Loss' (OPL) which imposes orthogonality in the feature space. OPL
augments the properties of CE loss and directly enforces inter-class separation
alongside intra-class clustering in the feature space through orthogonality
constraints on the mini-batch level. As compared to other alternatives of CE,
OPL offers unique advantages e.g., no additional learnable parameters, does not
require careful negative mining and is not sensitive to the batch size. Given
the plug-and-play nature of OPL, we evaluate it on a diverse range of tasks
including image recognition (CIFAR-100), large-scale classification (ImageNet),
domain generalization (PACS) and few-shot learning (miniImageNet, CIFAR-FS,
tiered-ImageNet and Meta-dataset) and demonstrate its effectiveness across the
board. Furthermore, OPL offers better robustness against practical nuisances
such as adversarial attacks and label noise. Code is available at:
https://github.com/kahnchana/opl.


http://arxiv.org/abs/2103.13612
THAT: Two Head Adversarial Training for Improving Robustness at Scale. (26%)
Zuxuan Wu; Tom Goldstein; Larry S. Davis; Ser-Nam Lim
  Many variants of adversarial training have been proposed, with most research
focusing on problems with relatively few classes. In this paper, we propose Two
Head Adversarial Training (THAT), a two-stream adversarial learning network
that is designed to handle the large-scale many-class ImageNet dataset. The
proposed method trains a network with two heads and two loss functions; one to
minimize feature-space domain shift between natural and adversarial images, and
one to promote high classification accuracy. This combination delivers a
hardened network that achieves state of the art robust accuracy while
maintaining high natural accuracy on ImageNet. Through extensive experiments,
we demonstrate that the proposed framework outperforms alternative methods
under both standard and "free" adversarial training settings.


http://arxiv.org/abs/2103.14244
A Survey of Microarchitectural Side-channel Vulnerabilities, Attacks and Defenses in Cryptography. (11%)
Xiaoxuan Lou; Tianwei Zhang; Jun Jiang; Yinqian Zhang
  Side-channel attacks have become a severe threat to the confidentiality of
computer applications and systems. One popular type of such attacks is the
microarchitectural attack, where the adversary exploits the hardware features
to break the protection enforced by the operating system and steal the secrets
from the program. In this paper, we systematize microarchitectural side
channels with a focus on attacks and defenses in cryptographic applications. We
make three contributions. (1) We survey past research literature to categorize
microarchitectural side-channel attacks. Since these are hardware attacks
targeting software, we summarize the vulnerable implementations in software, as
well as flawed designs in hardware. (2) We identify common strategies to
mitigate microarchitectural attacks, from the application, OS and hardware
levels. (3) We conduct a large-scale evaluation on popular cryptographic
applications in the real world, and analyze the severity, practicality and
impact of side-channel vulnerabilities. This survey is expected to inspire
side-channel research community to discover new attacks, and more importantly,
propose new defense solutions against them.


http://arxiv.org/abs/2103.13628
HufuNet: Embedding the Left Piece as Watermark and Keeping the Right Piece for Ownership Verification in Deep Neural Networks. (10%)
Peizhuo Lv; Pan Li; Shengzhi Zhang; Kai Chen; Ruigang Liang; Yue Zhao; Yingjiu Li
  Due to the wide use of highly-valuable and large-scale deep neural networks
(DNNs), it becomes crucial to protect the intellectual property of DNNs so that
the ownership of disputed or stolen DNNs can be verified. Most existing
solutions embed backdoors in DNN model training such that DNN ownership can be
verified by triggering distinguishable model behaviors with a set of secret
inputs. However, such solutions are vulnerable to model fine-tuning and
pruning. They also suffer from fraudulent ownership claim as attackers can
discover adversarial samples and use them as secret inputs to trigger
distinguishable behaviors from stolen models. To address these problems, we
propose a novel DNN watermarking solution, named HufuNet, for protecting the
ownership of DNN models. We evaluate HufuNet rigorously on four benchmark
datasets with five popular DNN models, including convolutional neural network
(CNN) and recurrent neural network (RNN). The experiments demonstrate HufuNet
is highly robust against model fine-tuning/pruning, kernels cutoff/supplement,
functionality-equivalent attack, and fraudulent ownership claims, thus highly
promising to protect large-scale DNN models in the real-world.


http://arxiv.org/abs/2103.14108
The Geometry of Over-parameterized Regression and Adversarial Perturbations. (2%)
Jason W. Rocks; Pankaj Mehta
  Classical regression has a simple geometric description in terms of a
projection of the training labels onto the column space of the design matrix.
However, for over-parameterized models -- where the number of fit parameters is
large enough to perfectly fit the training data -- this picture becomes
uninformative. Here, we present an alternative geometric interpretation of
regression that applies to both under- and over-parameterized models. Unlike
the classical picture which takes place in the space of training labels, our
new picture resides in the space of input features. This new feature-based
perspective provides a natural geometric interpretation of the double-descent
phenomenon in the context of bias and variance, explaining why it can occur
even in the absence of label noise. Furthermore, we show that adversarial
perturbations -- small perturbations to the input features that result in large
changes in label values -- are a generic feature of biased models, arising from
the underlying geometry. We demonstrate these ideas by analyzing three minimal
models for over-parameterized linear least squares regression: without basis
functions (input features equal model features) and with linear or nonlinear
basis functions (two-layer neural networks with linear or nonlinear activation
functions, respectively).


http://arxiv.org/abs/2103.14212
Synthesize-It-Classifier: Learning a Generative Classifier through RecurrentSelf-analysis. (1%)
Arghya Pal; Rapha Phan; KokSheik Wong
  In this work, we show the generative capability of an image classifier
network by synthesizing high-resolution, photo-realistic, and diverse images at
scale. The overall methodology, called Synthesize-It-Classifier (STIC), does
not require an explicit generator network to estimate the density of the data
distribution and sample images from that, but instead uses the classifier's
knowledge of the boundary to perform gradient ascent w.r.t. class logits and
then synthesizes images using Gram Matrix Metropolis Adjusted Langevin
Algorithm (GRMALA) by drawing on a blank canvas. During training, the
classifier iteratively uses these synthesized images as fake samples and
re-estimates the class boundary in a recurrent fashion to improve both the
classification accuracy and quality of synthetic images. The STIC shows the
mixing of the hard fake samples (i.e. those synthesized by the one hot class
conditioning), and the soft fake samples (which are synthesized as a convex
combination of classes, i.e. a mixup of classes) improves class interpolation.
We demonstrate an Attentive-STIC network that shows an iterative drawing of
synthesized images on the ImageNet dataset that has thousands of classes. In
addition, we introduce the synthesis using a class conditional score classifier
(Score-STIC) instead of a normal image classifier and show improved results on
several real-world datasets, i.e. ImageNet, LSUN, and CIFAR 10.


http://arxiv.org/abs/2103.13733
Spirit Distillation: Precise Real-time Prediction with Insufficient Data. (1%)
Zhiyuan Wu; Hong Qi; Yu Jiang; Chupeng Cui; Zongmin Yang; Xinhui Xue
  Recent trend demonstrates the effectiveness of deep neural networks (DNNs)
apply on the task of environment perception in autonomous driving system. While
large-scale and complete data can train out fine DNNs, collecting it is always
difficult, expensive, and time-consuming. Also, the significance of both
accuracy and efficiency cannot be over-emphasized due to the requirement of
real-time recognition. To alleviate the conflicts between weak data and high
computational consumption of DNNs, we propose a new training framework named
Spirit Distillation(SD). It extends the ideas of fine-tuning-based transfer
learning(FTT) and feature-based knowledge distillation. By allowing the student
to mimic its teacher in feature extraction, the gap of general features between
the teacher-student networks is bridged. The Image Party distillation
enhancement method(IP) is also proposed, which shuffling images from various
domains, and randomly selecting a few as mini-batch. With this approach, the
overfitting that the student network to the general features of the teacher
network can be easily avoided. Persuasive experiments and discussions are
conducted on CityScapes with the prompt of COCO2017 and KITTI. Results
demonstrate the boosting performance in segmentation(mIOU and high-precision
accuracy boost by 1.4% and 8.2% respectively, with 78.2% output variance), and
can gain a precise compact network with only 41.8\% FLOPs(see Fig. 1). This
paper is a pioneering work on knowledge distillation applied to few-shot
learning. The proposed methods significantly reduce the dependence on data of
DNNs training, and improves the robustness of DNNs when facing rare situations,
with real-time requirement satisfied. We provide important technical support
for the advancement of scene perception technology for autonomous driving.


http://arxiv.org/abs/2103.13598
Recent Advances in Large Margin Learning. (1%)
Yiwen Guo; Changshui Zhang
  This paper serves as a survey of recent advances in large margin training and
its theoretical foundations, mostly for (nonlinear) deep neural networks (DNNs)
that are probably the most prominent machine learning models for large-scale
data in the community over the past decade. We generalize the formulation of
classification margins from classical research to latest DNNs, summarize
theoretical connections between the margin, network generalization, and
robustness, and introduce recent efforts in enlarging the margins for DNNs
comprehensively. Since the viewpoint of different methods is discrepant, we
categorize them into groups for ease of comparison and discussion in the paper.
Hopefully, our discussions and overview inspire new research work in the
community that aim to improve the performance of DNNs, and we also point to
directions where the large margin principle can be verified to provide
theoretical evidence why certain regularizations for DNNs function well in
practice. We managed to shorten the paper such that the crucial spirit of large
margin learning and related methods are better emphasized.


http://arxiv.org/abs/2103.13124
Towards Both Accurate and Robust Neural Networks without Extra Data. (99%)
Faqiang Liu; Rong Zhao
  Deep neural networks have achieved remarkable performance in various
applications but are extremely vulnerable to adversarial perturbation. The most
representative and promising methods that can enhance model robustness, such as
adversarial training and its variants, substantially degrade model accuracy on
benign samples, limiting practical utility. Although incorporating extra
training data can alleviate the trade-off to a certain extent, it remains
unsolved to achieve both robustness and accuracy under limited training data.
Here, we demonstrate the feasibility of overcoming the trade-off, by developing
an adversarial feature stacking (AFS) model, which combines multiple
independent feature extractors with varied levels of robustness and accuracy.
Theoretical analysis is further conducted, and general principles for the
selection of basic feature extractors are provided. We evaluate the AFS model
on CIFAR-10 and CIFAR-100 datasets with strong adaptive attack methods,
significantly advancing the state-of-the-art in terms of the trade-off. The AFS
model achieves a benign accuracy improvement of ~6% on CIFAR-10 and ~10% on
CIFAR-100 with comparable or even stronger robustness than the state-of-the-art
adversarial training methods.


http://arxiv.org/abs/2103.13134
Vulnerability of Appearance-based Gaze Estimation. (97%)
Mingjie Xu; Haofei Wang; Yunfei Liu; Feng Lu
  Appearance-based gaze estimation has achieved significant improvement by
using deep learning. However, many deep learning-based methods suffer from the
vulnerability property, i.e., perturbing the raw image using noise confuses the
gaze estimation models. Although the perturbed image visually looks similar to
the original image, the gaze estimation models output the wrong gaze direction.
In this paper, we investigate the vulnerability of appearance-based gaze
estimation. To our knowledge, this is the first time that the vulnerability of
gaze estimation to be found. We systematically characterized the vulnerability
property from multiple aspects, the pixel-based adversarial attack, the
patch-based adversarial attack and the defense strategy. Our experimental
results demonstrate that the CA-Net shows superior performance against attack
among the four popular appearance-based gaze estimation networks, Full-Face,
Gaze-Net, CA-Net and RT-GENE. This study draws the attention of researchers in
the appearance-based gaze estimation community to defense from adversarial
attacks.


http://arxiv.org/abs/2103.13127
Black-box Detection of Backdoor Attacks with Limited Information and Data. (96%)
Yinpeng Dong; Xiao Yang; Zhijie Deng; Tianyu Pang; Zihao Xiao; Hang Su; Jun Zhu
  Although deep neural networks (DNNs) have made rapid progress in recent
years, they are vulnerable in adversarial environments. A malicious backdoor
could be embedded in a model by poisoning the training dataset, whose intention
is to make the infected model give wrong predictions during inference when the
specific trigger appears. To mitigate the potential threats of backdoor
attacks, various backdoor detection and defense methods have been proposed.
However, the existing techniques usually require the poisoned training data or
access to the white-box model, which is commonly unavailable in practice. In
this paper, we propose a black-box backdoor detection (B3D) method to identify
backdoor attacks with only query access to the model. We introduce a
gradient-free optimization algorithm to reverse-engineer the potential trigger
for each class, which helps to reveal the existence of backdoor attacks. In
addition to backdoor detection, we also propose a simple strategy for reliable
predictions using the identified backdoored models. Extensive experiments on
hundreds of DNN models trained on several datasets corroborate the
effectiveness of our method under the black-box setting against various
backdoor attacks.


http://arxiv.org/abs/2103.13567
Deepfake Forensics via An Adversarial Game. (10%)
Zhi Wang; Yiwen Guo; Wangmeng Zuo
  With the progress in AI-based facial forgery (i.e., deepfake), people are
increasingly concerned about its abuse. Albeit effort has been made for
training classification (also known as deepfake detection) models to recognize
such forgeries, existing models suffer from poor generalization to unseen
forgery technologies and high sensitivity to changes in image/video quality. In
this paper, we advocate adversarial training for improving the generalization
ability to both unseen facial forgeries and unseen image/video qualities. We
believe training with samples that are adversarially crafted to attack the
classification models improves the generalization ability considerably.
Considering that AI-based face manipulation often leads to high-frequency
artifacts that can be easily spotted by models yet difficult to generalize, we
further propose a new adversarial training method that attempts to blur out
these specific artifacts, by introducing pixel-wise Gaussian blurring models.
With adversarial training, the classification models are forced to learn more
discriminative and generalizable features, and the effectiveness of our method
can be verified by plenty of empirical evidence. Our code will be made publicly
available.


http://arxiv.org/abs/2103.13886
Robust and Accurate Object Detection via Adversarial Learning. (98%)
Xiangning Chen; Cihang Xie; Mingxing Tan; Li Zhang; Cho-Jui Hsieh; Boqing Gong
  Data augmentation has become a de facto component for training
high-performance deep image classifiers, but its potential is under-explored
for object detection. Noting that most state-of-the-art object detectors
benefit from fine-tuning a pre-trained classifier, we first study how the
classifiers' gains from various data augmentations transfer to object
detection. The results are discouraging; the gains diminish after fine-tuning
in terms of either accuracy or robustness. This work instead augments the
fine-tuning stage for object detectors by exploring adversarial examples, which
can be viewed as a model-dependent data augmentation. Our method dynamically
selects the stronger adversarial images sourced from a detector's
classification and localization branches and evolves with the detector to
ensure the augmentation policy stays current and relevant. This model-dependent
augmentation generalizes to different object detectors better than AutoAugment,
a model-agnostic augmentation policy searched based on one particular detector.
Our approach boosts the performance of state-of-the-art EfficientDets by +1.1
mAP on the COCO object detection benchmark. It also improves the detectors'
robustness against natural distortions by +3.8 mAP and against domain shift by
+1.3 mAP.


http://arxiv.org/abs/2103.12531
CLIP: Cheap Lipschitz Training of Neural Networks. (96%)
Leon Bungert; René Raab; Tim Roith; Leo Schwinn; Daniel Tenbrinck
  Despite the large success of deep neural networks (DNN) in recent years, most
neural networks still lack mathematical guarantees in terms of stability. For
instance, DNNs are vulnerable to small or even imperceptible input
perturbations, so called adversarial examples, that can cause false
predictions. This instability can have severe consequences in applications
which influence the health and safety of humans, e.g., biomedical imaging or
autonomous driving. While bounding the Lipschitz constant of a neural network
improves stability, most methods rely on restricting the Lipschitz constants of
each layer which gives a poor bound for the actual Lipschitz constant.
  In this paper we investigate a variational regularization method named CLIP
for controlling the Lipschitz constant of a neural network, which can easily be
integrated into the training procedure. We mathematically analyze the proposed
model, in particular discussing the impact of the chosen regularization
parameter on the output of the network. Finally, we numerically evaluate our
method on both a nonlinear regression problem and the MNIST and Fashion-MNIST
classification databases, and compare our results with a weight regularization
approach.


http://arxiv.org/abs/2103.12399
The Hammer and the Nut: Is Bilevel Optimization Really Needed to Poison Linear Classifiers? (92%)
Antonio Emanuele Cinà; Sebastiano Vascon; Ambra Demontis; Battista Biggio; Fabio Roli; Marcello Pelillo
  One of the most concerning threats for modern AI systems is data poisoning,
where the attacker injects maliciously crafted training data to corrupt the
system's behavior at test time. Availability poisoning is a particularly
worrisome subset of poisoning attacks where the attacker aims to cause a
Denial-of-Service (DoS) attack. However, the state-of-the-art algorithms are
computationally expensive because they try to solve a complex bi-level
optimization problem (the "hammer"). We observed that in particular conditions,
namely, where the target model is linear (the "nut"), the usage of
computationally costly procedures can be avoided. We propose a
counter-intuitive but efficient heuristic that allows contaminating the
training set such that the target system's performance is highly compromised.
We further suggest a re-parameterization trick to decrease the number of
variables to be optimized. Finally, we demonstrate that, under the considered
settings, our framework achieves comparable, or even better, performances in
terms of the attacker's objective while being significantly more
computationally efficient.


http://arxiv.org/abs/2103.12719
Characterizing and Improving the Robustness of Self-Supervised Learning through Background Augmentations. (87%)
Chaitanya K. Ryali; David J. Schwab; Ari S. Morcos
  Recent progress in self-supervised learning has demonstrated promising
results in multiple visual tasks. An important ingredient in high-performing
self-supervised methods is the use of data augmentation by training models to
place different augmented views of the same image nearby in embedding space.
However, commonly used augmentation pipelines treat images holistically,
ignoring the semantic relevance of parts of an image-e.g. a subject vs. a
background-which can lead to the learning of spurious correlations. Our work
addresses this problem by investigating a class of simple, yet highly effective
"background augmentations", which encourage models to focus on
semantically-relevant content by discouraging them from focusing on image
backgrounds. Through a systematic investigation, we show that background
augmentations lead to substantial improvements in performance across a spectrum
of state-of-the-art self-supervised methods (MoCo-v2, BYOL, SwAV) on a variety
of tasks, e.g. $\sim$+1-2% gains on ImageNet, enabling performance on par with
the supervised baseline. Further, we find the improvement in limited-labels
settings is even larger (up to 4.2%). Background augmentations also improve
robustness to a number of distribution shifts, including natural adversarial
examples, ImageNet-9, adversarial attacks, ImageNet-Renditions. We also make
progress in completely unsupervised saliency detection, in the process of
generating saliency masks used for background augmentations.


http://arxiv.org/abs/2103.12469
RPATTACK: Refined Patch Attack on General Object Detectors. (76%)
Hao Huang; Yongtao Wang; Zhaoyu Chen; Zhi Tang; Wenqiang Zhang; Kai-Kuang Ma
  Nowadays, general object detectors like YOLO and Faster R-CNN as well as
their variants are widely exploited in many applications. Many works have
revealed that these detectors are extremely vulnerable to adversarial patch
attacks. The perturbed regions generated by previous patch-based attack works
on object detectors are very large which are not necessary for attacking and
perceptible for human eyes. To generate much less but more efficient
perturbation, we propose a novel patch-based method for attacking general
object detectors. Firstly, we propose a patch selection and refining scheme to
find the pixels which have the greatest importance for attack and remove the
inconsequential perturbations gradually. Then, for a stable ensemble attack, we
balance the gradients of detectors to avoid over-optimizing one of them during
the training phase. Our RPAttack can achieve an amazing missed detection rate
of 100% for both Yolo v4 and Faster R-CNN while only modifies 0.32% pixels on
VOC 2007 test set. Our code is available at
https://github.com/VDIGPKU/RPAttack.


http://arxiv.org/abs/2103.12535
NNrepair: Constraint-based Repair of Neural Network Classifiers. (50%)
Muhammad Usman; Divya Gopinath; Youcheng Sun; Yannic Noller; Corina Pasareanu
  We present NNrepair, a constraint-based technique for repairing neural
network classifiers. The technique aims to fix the logic of the network at an
intermediate layer or at the last layer. NNrepair first uses fault localization
to find potentially faulty network parameters (such as the weights) and then
performs repair using constraint solving to apply small modifications to the
parameters to remedy the defects. We present novel strategies to enable precise
yet efficient repair such as inferring correctness specifications to act as
oracles for intermediate layer repair, and generation of experts for each
class. We demonstrate the technique in the context of three different
scenarios: (1) Improving the overall accuracy of a model, (2) Fixing security
vulnerabilities caused by poisoning of training data and (3) Improving the
robustness of the network against adversarial attacks. Our evaluation on MNIST
and CIFAR-10 models shows that NNrepair can improve the accuracy by 45.56
percentage points on poisoned data and 10.40 percentage points on adversarial
data. NNrepair also provides small improvement in the overall accuracy of
models, without requiring new data or re-training.


http://arxiv.org/abs/2103.12628
Are all outliers alike? On Understanding the Diversity of Outliers for Detecting OODs. (31%)
Ramneet Kaur; Susmit Jha; Anirban Roy; Oleg Sokolsky; Insup Lee
  Deep neural networks (DNNs) are known to produce incorrect predictions with
very high confidence on out-of-distribution (OOD) inputs. This limitation is
one of the key challenges in the adoption of deep learning models in
high-assurance systems such as autonomous driving, air traffic management, and
medical diagnosis. This challenge has received significant attention recently,
and several techniques have been developed to detect inputs where the model's
prediction cannot be trusted. These techniques use different statistical,
geometric, or topological signatures. This paper presents a taxonomy of OOD
outlier inputs based on their source and nature of uncertainty. We demonstrate
how different existing detection approaches fail to detect certain types of
outliers. We utilize these insights to develop a novel integrated detection
approach that uses multiple attributes corresponding to different types of
outliers. Our results include experiments on CIFAR10, SVHN and MNIST as
in-distribution data and Imagenet, LSUN, SVHN (for CIFAR10), CIFAR10 (for
SVHN), KMNIST, and F-MNIST as OOD data across different DNN architectures such
as ResNet34, WideResNet, DenseNet, and LeNet5.


http://arxiv.org/abs/2103.12913
Improved Estimation of Concentration Under $\ell_p$-Norm Distance Metrics Using Half Spaces. (22%)
Jack Prescott; Xiao Zhang; David Evans
  Concentration of measure has been argued to be the fundamental cause of
adversarial vulnerability. Mahloujifar et al. presented an empirical way to
measure the concentration of a data distribution using samples, and employed it
to find lower bounds on intrinsic robustness for several benchmark datasets.
However, it remains unclear whether these lower bounds are tight enough to
provide a useful approximation for the intrinsic robustness of a dataset. To
gain a deeper understanding of the concentration of measure phenomenon, we
first extend the Gaussian Isoperimetric Inequality to non-spherical Gaussian
measures and arbitrary $\ell_p$-norms ($p \geq 2$). We leverage these
theoretical insights to design a method that uses half-spaces to estimate the
concentration of any empirical dataset under $\ell_p$-norm distance metrics.
Our proposed algorithm is more efficient than Mahloujifar et al.'s, and our
experiments on synthetic datasets and image benchmarks demonstrate that it is
able to find much tighter intrinsic robustness bounds. These tighter estimates
provide further evidence that rules out intrinsic dataset concentration as a
possible explanation for the adversarial vulnerability of state-of-the-art
classifiers.


http://arxiv.org/abs/2103.12607
ESCORT: Ethereum Smart COntRacTs Vulnerability Detection using Deep Neural Network and Transfer Learning. (1%)
Oliver Lutz; Huili Chen; Hossein Fereidooni; Christoph Sendner; Alexandra Dmitrienko; Ahmad Reza Sadeghi; Farinaz Koushanfar
  Ethereum smart contracts are automated decentralized applications on the
blockchain that describe the terms of the agreement between buyers and sellers,
reducing the need for trusted intermediaries and arbitration. However, the
deployment of smart contracts introduces new attack vectors into the
cryptocurrency systems. In particular, programming flaws in smart contracts can
be and have already been exploited to gain enormous financial profits. It is
thus an emerging yet crucial issue to detect vulnerabilities of different
classes in contracts in an efficient manner. Existing machine learning-based
vulnerability detection methods are limited and only inspect whether the smart
contract is vulnerable, or train individual classifiers for each specific
vulnerability, or demonstrate multi-class vulnerability detection without
extensibility consideration. To overcome the scalability and generalization
limitations of existing works, we propose ESCORT, the first Deep Neural Network
(DNN)-based vulnerability detection framework for Ethereum smart contracts that
support lightweight transfer learning on unseen security vulnerabilities, thus
is extensible and generalizable. ESCORT leverages a multi-output NN
architecture that consists of two parts: (i) A common feature extractor that
learns the semantics of the input contract; (ii) Multiple branch structures
where each branch learns a specific vulnerability type based on features
obtained from the feature extractor. Experimental results show that ESCORT
achieves an average F1-score of 95% on six vulnerability types and the
detection time is 0.02 seconds per contract. When extended to new vulnerability
types, ESCORT yields an average F1-score of 93%. To the best of our knowledge,
ESCORT is the first framework that enables transfer learning on new
vulnerability types with minimal modification of the DNN model architecture and
re-training overhead.


http://arxiv.org/abs/2103.11576
Grey-box Adversarial Attack And Defence For Sentiment Classification. (99%)
Ying Xu; Xu Zhong; Antonio Jimeno Yepes; Jey Han Lau
  We introduce a grey-box adversarial attack and defence framework for
sentiment classification. We address the issues of differentiability, label
preservation and input reconstruction for adversarial attack and defence in one
unified framework. Our results show that once trained, the attacking model is
capable of generating high-quality adversarial examples substantially faster
(one order of magnitude less in time) than state-of-the-art attacking methods.
These examples also preserve the original sentiment according to human
evaluation. Additionally, our framework produces an improved classifier that is
robust in defending against multiple adversarial attacking methods. Code is
available at: https://github.com/ibm-aur-nlp/adv-def-text-dist.


http://arxiv.org/abs/2103.13815
Fast Approximate Spectral Normalization for Robust Deep Neural Networks. (98%)
Zhixin Pan; Prabhat Mishra
  Deep neural networks (DNNs) play an important role in machine learning due to
its outstanding performance compared to other alternatives. However, DNNs are
not suitable for safety-critical applications since DNNs can be easily fooled
by well-crafted adversarial examples. One promising strategy to counter
adversarial attacks is to utilize spectral normalization, which ensures that
the trained model has low sensitivity towards the disturbance of input samples.
Unfortunately, this strategy requires exact computation of spectral norm, which
is computation intensive and impractical for large-scale networks. In this
paper, we introduce an approximate algorithm for spectral normalization based
on Fourier transform and layer separation. The primary contribution of our work
is to effectively combine the sparsity of weight matrix and decomposability of
convolution layers. Extensive experimental evaluation demonstrates that our
framework is able to significantly improve both time efficiency (up to 60\%)
and model robustness (61\% on average) compared with the state-of-the-art
spectral normalization.


http://arxiv.org/abs/2103.12256
Spatio-Temporal Sparsification for General Robust Graph Convolution Networks. (87%)
Mingming Lu; Ya Zhang
  Graph Neural Networks (GNNs) have attracted increasing attention due to its
successful applications on various graph-structure data. However, recent
studies have shown that adversarial attacks are threatening the functionality
of GNNs. Although numerous works have been proposed to defend adversarial
attacks from various perspectives, most of them can be robust against the
attacks only on specific scenarios. To address this shortage of robust
generalization, we propose to defend the adversarial attacks on GNN through
applying the Spatio-Temporal sparsification (called ST-Sparse) on the GNN
hidden node representation. ST-Sparse is similar to the Dropout regularization
in spirit. Through intensive experiment evaluation with GCN as the target GNN
model, we identify the benefits of ST-Sparse as follows: (1) ST-Sparse shows
the defense performance improvement in most cases, as it can effectively
increase the robust accuracy by up to 6\% improvement; (2) ST-Sparse
illustrates its robust generalization capability by integrating with the
existing defense methods, similar to the integration of Dropout into various
deep learning models as a standard regularization technique; (3) ST-Sparse also
shows its ordinary generalization capability on clean datasets, in that
ST-SparseGCN (the integration of ST-Sparse and the original GCN) even
outperform the original GCN, while the other three representative defense
methods are inferior to the original GCN.


http://arxiv.org/abs/2103.13813
RA-BNN: Constructing Robust & Accurate Binary Neural Network to Simultaneously Defend Adversarial Bit-Flip Attack and Improve Accuracy. (75%)
Adnan Siraj Rakin; Li Yang; Jingtao Li; Fan Yao; Chaitali Chakrabarti; Yu Cao; Jae-sun Seo; Deliang Fan
  Recently developed adversarial weight attack, a.k.a. bit-flip attack (BFA),
has shown enormous success in compromising Deep Neural Network (DNN)
performance with an extremely small amount of model parameter perturbation. To
defend against this threat, we propose RA-BNN that adopts a complete binary
(i.e., for both weights and activation) neural network (BNN) to significantly
improve DNN model robustness (defined as the number of bit-flips required to
degrade the accuracy to as low as a random guess). However, such an aggressive
low bit-width model suffers from poor clean (i.e., no attack) inference
accuracy. To counter this, we propose a novel and efficient two-stage network
growing method, named Early-Growth. It selectively grows the channel size of
each BNN layer based on channel-wise binary masks training with Gumbel-Sigmoid
function. Apart from recovering the inference accuracy, our RA-BNN after
growing also shows significantly higher resistance to BFA. Our evaluation of
the CIFAR-10 dataset shows that the proposed RA-BNN can improve the clean model
accuracy by ~2-8 %, compared with a baseline BNN, while simultaneously
improving the resistance to BFA by more than 125 x. Moreover, on ImageNet, with
a sufficiently large (e.g., 5,000) amount of bit-flips, the baseline BNN
accuracy drops to 4.3 % from 51.9 %, while our RA-BNN accuracy only drops to
37.1 % from 60.9 % (9 % clean accuracy improvement).


http://arxiv.org/abs/2103.12171
Adversarial Feature Augmentation and Normalization for Visual Recognition. (13%)
Tianlong Chen; Yu Cheng; Zhe Gan; Jianfeng Wang; Lijuan Wang; Zhangyang Wang; Jingjing Liu
  Recent advances in computer vision take advantage of adversarial data
augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates
adversarial augmentation on intermediate feature embeddings, instead of relying
on computationally-expensive pixel-level perturbations. We propose Adversarial
Feature Augmentation and Normalization (A-FAN), which (i) first augments visual
recognition models with adversarial features that integrate flexible scales of
perturbation strengths, (ii) then extracts adversarial feature statistics from
batch normalization, and re-injects them into clean features through feature
normalization. We validate the proposed approach across diverse visual
recognition tasks with representative backbone networks, including ResNets and
EfficientNets for classification, Faster-RCNN for detection, and Deeplab V3+
for segmentation. Extensive experiments show that A-FAN yields consistent
generalization improvement over strong baselines across various datasets for
classification, detection and segmentation tasks, such as CIFAR-10, CIFAR-100,
ImageNet, Pascal VOC2007, Pascal VOC2012, COCO2017, and Cityspaces.
Comprehensive ablation studies and detailed analyses also demonstrate that
adding perturbations to specific modules and layers of
classification/detection/segmentation backbones yields optimal performance.
Codes and pre-trained models will be made available at:
https://github.com/VITA-Group/CV_A-FAN.


http://arxiv.org/abs/2103.11589
Adversarially Optimized Mixup for Robust Classification. (13%)
Jason Bunk; Srinjoy Chattopadhyay; B. S. Manjunath; Shivkumar Chandrasekaran
  Mixup is a procedure for data augmentation that trains networks to make
smoothly interpolated predictions between datapoints. Adversarial training is a
strong form of data augmentation that optimizes for worst-case predictions in a
compact space around each data-point, resulting in neural networks that make
much more robust predictions. In this paper, we bring these ideas together by
adversarially probing the space between datapoints, using projected gradient
descent (PGD). The fundamental approach in this work is to leverage
backpropagation through the mixup interpolation during training to optimize for
places where the network makes unsmooth and incongruous predictions.
Additionally, we also explore several modifications and nuances, like
optimization of the mixup ratio and geometrical label assignment, and discuss
their impact on enhancing network robustness. Through these ideas, we have been
able to train networks that robustly generalize better; experiments on CIFAR-10
and CIFAR-100 demonstrate consistent improvements in accuracy against strong
adversaries, including the recent strong ensemble attack AutoAttack. Our source
code would be released for reproducibility.


http://arxiv.org/abs/2103.11526
ExAD: An Ensemble Approach for Explanation-based Adversarial Detection. (99%)
Raj Vardhan; Ninghao Liu; Phakpoom Chinprutthiwong; Weijie Fu; Zhenyu Hu; Xia Ben Hu; Guofei Gu
  Recent research has shown Deep Neural Networks (DNNs) to be vulnerable to
adversarial examples that induce desired misclassifications in the models. Such
risks impede the application of machine learning in security-sensitive domains.
Several defense methods have been proposed against adversarial attacks to
detect adversarial examples at test time or to make machine learning models
more robust. However, while existing methods are quite effective under blackbox
threat model, where the attacker is not aware of the defense, they are
relatively ineffective under whitebox threat model, where the attacker has full
knowledge of the defense.
  In this paper, we propose ExAD, a framework to detect adversarial examples
using an ensemble of explanation techniques. Each explanation technique in ExAD
produces an explanation map identifying the relevance of input variables for
the model's classification. For every class in a dataset, the system includes a
detector network, corresponding to each explanation technique, which is trained
to distinguish between normal and abnormal explanation maps. At test time, if
the explanation map of an input is detected as abnormal by any detector model
of the classified class, then we consider the input to be an adversarial
example. We evaluate our approach using six state-of-the-art adversarial
attacks on three image datasets. Our extensive evaluation shows that our
mechanism can effectively detect these attacks under blackbox threat model with
limited false-positives. Furthermore, we find that our approach achieves
promising results in limiting the success rate of whitebox attacks.


http://arxiv.org/abs/2103.11441
TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing. (75%)
Tao Gui; Xiao Wang; Qi Zhang; Qin Liu; Yicheng Zou; Xin Zhou; Rui Zheng; Chong Zhang; Qinzhuo Wu; Jiacheng Ye; Zexiong Pang; Yongxin Zhang; Zhengyan Li; Ruotian Ma; Zichu Fei; Ruijian Cai; Jun Zhao; Xinwu Hu; Zhiheng Yan; Yiding Tan; Yuan Hu; Qiyuan Bian; Zhihua Liu; Bolin Zhu; Shan Qin; Xiaoyu Xing; Jinlan Fu; Yue Zhang; Minlong Peng; Xiaoqing Zheng; Yaqian Zhou; Zhongyu Wei; Xipeng Qiu; Xuanjing Huang
  Various robustness evaluation methodologies from different perspectives have
been proposed for different natural language processing (NLP) tasks. These
methods have often focused on either universal or task-specific generalization
capabilities. In this work, we propose a multilingual robustness evaluation
platform for NLP tasks (TextFlint) that incorporates universal text
transformation, task-specific transformation, adversarial attack,
subpopulation, and their combinations to provide comprehensive robustness
analysis. TextFlint enables practitioners to automatically evaluate their
models from all aspects or to customize their evaluations as desired with just
a few lines of code. To guarantee user acceptability, all the text
transformations are linguistically based, and we provide a human evaluation for
each one. TextFlint generates complete analytical reports as well as targeted
augmented data to address the shortcomings of the model's robustness. To
validate TextFlint's utility, we performed large-scale empirical evaluations
(over 67,000 evaluations) on state-of-the-art deep learning models, classic
supervised methods, and real-world systems. Almost all models showed
significant performance degradation, including a decline of more than 50% of
BERT's prediction accuracy on tasks such as aspect-level sentiment
classification, named entity recognition, and natural language inference.
Therefore, we call for the robustness to be included in the model evaluation,
so as to promote the healthy development of NLP technology.


http://arxiv.org/abs/2103.11372
Natural Perturbed Training for General Robustness of Neural Network Classifiers. (38%)
Sadaf Gulshad; Arnold Smeulders
  We focus on the robustness of neural networks for classification. To permit a
fair comparison between methods to achieve robustness, we first introduce a
standard based on the mensuration of a classifier's degradation. Then, we
propose natural perturbed training to robustify the network. Natural
perturbations will be encountered in practice: the difference of two images of
the same object may be approximated by an elastic deformation (when they have
slightly different viewing angles), by occlusions (when they hide differently
behind objects), or by saturation, Gaussian noise etc. Training some fraction
of the epochs on random versions of such variations will help the classifier to
learn better. We conduct extensive experiments on six datasets of varying sizes
and granularity. Natural perturbed learning show better and much faster
performance than adversarial training on clean, adversarial as well as natural
perturbed images. It even improves general robustness on perturbations not seen
during the training. For Cifar-10 and STL-10 natural perturbed training even
improves the accuracy for clean data and reaches the state of the art
performance. Ablation studies verify the effectiveness of natural perturbed
training.


http://arxiv.org/abs/2103.11362
Self adversarial attack as an augmentation method for immunohistochemical stainings. (33%)
Jelica Vasiljević; Friedrich Feuerhake; Cédric Wemmert; Thomas Lampert
  It has been shown that unpaired image-to-image translation methods
constrained by cycle-consistency hide the information necessary for accurate
input reconstruction as imperceptible noise. We demonstrate that, when applied
to histopathology data, this hidden noise appears to be related to stain
specific features and show that this is the case with two immunohistochemical
stainings during translation to Periodic acid- Schiff (PAS), a histochemical
staining method commonly applied in renal pathology. Moreover, by perturbing
this hidden information, the translation models produce different, plausible
outputs. We demonstrate that this property can be used as an augmentation
method which, in a case of supervised glomeruli segmentation, leads to improved
performance.


http://arxiv.org/abs/2103.11257
Robust Models Are More Interpretable Because Attributions Look Normal. (15%)
Zifan Wang; Matt Fredrikson; Anupam Datta
  Recent work has found that adversarially-robust deep networks used for image
classification are more interpretable: their feature attributions tend to be
sharper, and are more concentrated on the objects associated with the image's
ground-truth class. We show that smooth decision boundaries play an important
role in this enhanced interpretability, as the model's input gradients around
data points will more closely align with boundaries' normal vectors when they
are smooth. Thus, because robust models have smoother boundaries, the results
of gradient-based attribution methods, like Integrated Gradients and DeepLift,
will capture more accurate information about nearby decision boundaries. This
understanding of robust interpretability leads to our second contribution:
\emph{boundary attributions}, which aggregate information about the normal
vectors of local decision boundaries to explain a classification outcome. We
show that by leveraging the key factors underpinning robust interpretability,
boundary attributions produce sharper, more concentrated visual explanations --
even on non-robust models. Any example implementation can be found at
\url{https://github.com/zifanw/boundary}.


http://arxiv.org/abs/2103.10787
LSDAT: Low-Rank and Sparse Decomposition for Decision-based Adversarial Attack. (99%)
Ashkan Esmaeili; Marzieh Edraki; Nazanin Rahnavard; Mubarak Shah; Ajmal Mian
  We propose LSDAT, an image-agnostic decision-based black-box attack that
exploits low-rank and sparse decomposition (LSD) to dramatically reduce the
number of queries and achieve superior fooling rates compared to the
state-of-the-art decision-based methods under given imperceptibility
constraints. LSDAT crafts perturbations in the low-dimensional subspace formed
by the sparse component of the input sample and that of an adversarial sample
to obtain query-efficiency. The specific perturbation of interest is obtained
by traversing the path between the input and adversarial sparse components. It
is set forth that the proposed sparse perturbation is the most aligned sparse
perturbation with the shortest path from the input sample to the decision
boundary for some initial adversarial sample (the best sparse approximation of
shortest path, likely to fool the model). Theoretical analyses are provided to
justify the functionality of LSDAT. Unlike other dimensionality reduction based
techniques aimed at improving query efficiency (e.g, ones based on FFT), LSD
works directly in the image pixel domain to guarantee that non-$\ell_2$
constraints, such as sparsity, are satisfied. LSD offers better control over
the number of queries and provides computational efficiency as it performs
sparse decomposition of the input and adversarial images only once to generate
all queries. We demonstrate $\ell_0$, $\ell_2$ and $\ell_\infty$ bounded
attacks with LSDAT to evince its efficiency compared to baseline decision-based
attacks in diverse low-query budget scenarios as outlined in the experiments.


http://arxiv.org/abs/2103.10651
SoK: A Modularized Approach to Study the Security of Automatic Speech Recognition Systems. (93%)
Yuxuan Chen; Jiangshan Zhang; Xuejing Yuan; Shengzhi Zhang; Kai Chen; Xiaofeng Wang; Shanqing Guo
  With the wide use of Automatic Speech Recognition (ASR) in applications such
as human machine interaction, simultaneous interpretation, audio transcription,
etc., its security protection becomes increasingly important. Although recent
studies have brought to light the weaknesses of popular ASR systems that enable
out-of-band signal attack, adversarial attack, etc., and further proposed
various remedies (signal smoothing, adversarial training, etc.), a systematic
understanding of ASR security (both attacks and defenses) is still missing,
especially on how realistic such threats are and how general existing
protection could be. In this paper, we present our systematization of knowledge
for ASR security and provide a comprehensive taxonomy for existing work based
on a modularized workflow. More importantly, we align the research in this
domain with that on security in Image Recognition System (IRS), which has been
extensively studied, using the domain knowledge in the latter to help
understand where we stand in the former. Generally, both IRS and ASR are
perceptual systems. Their similarities allow us to systematically study
existing literature in ASR security based on the spectrum of attacks and
defense solutions proposed for IRS, and pinpoint the directions of more
advanced attacks and the directions potentially leading to more effective
protection in ASR. In contrast, their differences, especially the complexity of
ASR compared with IRS, help us learn unique challenges and opportunities in ASR
security. Particularly, our experimental study shows that transfer learning
across ASR models is feasible, even in the absence of knowledge about models
(even their types) and training data.


http://arxiv.org/abs/2103.11002
Attribution of Gradient Based Adversarial Attacks for Reverse Engineering of Deceptions. (86%)
Michael Goebel; Jason Bunk; Srinjoy Chattopadhyay; Lakshmanan Nataraj; Shivkumar Chandrasekaran; B. S. Manjunath
  Machine Learning (ML) algorithms are susceptible to adversarial attacks and
deception both during training and deployment. Automatic reverse engineering of
the toolchains behind these adversarial machine learning attacks will aid in
recovering the tools and processes used in these attacks. In this paper, we
present two techniques that support automated identification and attribution of
adversarial ML attack toolchains using Co-occurrence Pixel statistics and
Laplacian Residuals. Our experiments show that the proposed techniques can
identify parameters used to generate adversarial samples. To the best of our
knowledge, this is the first approach to attribute gradient based adversarial
attacks and estimate their parameters. Source code and data is available at:
https://github.com/michael-goebel/ei_red


http://arxiv.org/abs/2103.10689
Interpretable Deep Learning: Interpretation, Interpretability, Trustworthiness, and Beyond. (2%)
Xuhong Li; Haoyi Xiong; Xingjian Li; Xuanyu Wu; Xiao Zhang; Ji Liu; Jiang Bian; Dejing Dou
  Deep neural networks have been well-known for their superb performance in
handling various machine learning and artificial intelligence tasks. However,
due to their over-parameterized black-box nature, it is often difficult to
understand the prediction results of deep models. In recent years, many
interpretation tools have been proposed to explain or reveal the ways that deep
models make decisions. In this paper, we review this line of research and try
to make a comprehensive survey. Specifically, we introduce and clarify two
basic concepts-interpretations and interpretability-that people usually get
confused. First of all, to address the research efforts in interpretations, we
elaborate the design of several recent interpretation algorithms, from
different perspectives, through proposing a new taxonomy. Then, to understand
the results of interpretation, we also survey the performance metrics for
evaluating interpretation algorithms. Further, we summarize the existing work
in evaluating models' interpretability using "trustworthy" interpretation
algorithms. Finally, we review and discuss the connections between deep models'
interpretations and other factors, such as adversarial robustness and data
augmentations, and we introduce several open-source libraries for
interpretation algorithms and evaluation approaches.


http://arxiv.org/abs/2103.11882
Generating Adversarial Computer Programs using Optimized Obfuscations. (99%)
Shashank Srikant; Sijia Liu; Tamara Mitrovska; Shiyu Chang; Quanfu Fan; Gaoyuan Zhang; Una-May O'Reilly
  Machine learning (ML) models that learn and predict properties of computer
programs are increasingly being adopted and deployed. These models have
demonstrated success in applications such as auto-completing code, summarizing
large programs, and detecting bugs and malware in programs. In this work, we
investigate principled ways to adversarially perturb a computer program to fool
such learned models, and thus determine their adversarial robustness. We use
program obfuscations, which have conventionally been used to avoid attempts at
reverse engineering programs, as adversarial perturbations. These perturbations
modify programs in ways that do not alter their functionality but can be
crafted to deceive an ML model when making a decision. We provide a general
formulation for an adversarial program that allows applying multiple
obfuscation transformations to a program in any language. We develop
first-order optimization algorithms to efficiently determine two key aspects --
which parts of the program to transform, and what transformations to use. We
show that it is important to optimize both these aspects to generate the best
adversarially perturbed program. Due to the discrete nature of this problem, we
also propose using randomized smoothing to improve the attack loss landscape to
ease optimization. We evaluate our work on Python and Java programs on the
problem of program summarization. We show that our best attack proposal
achieves a $52\%$ improvement over a state-of-the-art attack generation
approach for programs trained on a seq2seq model. We further show that our
formulation is better at training models that are robust to adversarial
attacks.


http://arxiv.org/abs/2103.10609
Boosting Adversarial Transferability through Enhanced Momentum. (99%)
Xiaosen Wang; Jiadong Lin; Han Hu; Jingdong Wang; Kun He
  Deep learning models are known to be vulnerable to adversarial examples
crafted by adding human-imperceptible perturbations on benign images. Many
existing adversarial attack methods have achieved great white-box attack
performance, but exhibit low transferability when attacking other models.
Various momentum iterative gradient-based methods are shown to be effective to
improve the adversarial transferability. In what follows, we propose an
enhanced momentum iterative gradient-based method to further enhance the
adversarial transferability. Specifically, instead of only accumulating the
gradient during the iterative process, we additionally accumulate the average
gradient of the data points sampled in the gradient direction of the previous
iteration so as to stabilize the update direction and escape from poor local
maxima. Extensive experiments on the standard ImageNet dataset demonstrate that
our method could improve the adversarial transferability of momentum-based
methods by a large margin of 11.1% on average. Moreover, by incorporating with
various input transformation methods, the adversarial transferability could be
further improved significantly. We also attack several extra advanced defense
models under the ensemble-model setting, and the enhancements are remarkable
with at least 7.8% on average.


http://arxiv.org/abs/2103.10229
Explainable Adversarial Attacks in Deep Neural Networks Using Activation Profiles. (98%)
Gabriel D. Cantareira; Rodrigo F. Mello; Fernando V. Paulovich
  As neural networks become the tool of choice to solve an increasing variety
of problems in our society, adversarial attacks become critical. The
possibility of generating data instances deliberately designed to fool a
network's analysis can have disastrous consequences. Recent work has shown that
commonly used methods for model training often result in fragile abstract
representations that are particularly vulnerable to such attacks. This paper
presents a visual framework to investigate neural network models subjected to
adversarial examples, revealing how models' perception of the adversarial data
differs from regular data instances and their relationships with class
perception. Through different use cases, we show how observing these elements
can quickly pinpoint exploited areas in a model, allowing further study of
vulnerable features in input data and serving as a guide to improving model
training and architecture.


http://arxiv.org/abs/2103.10043
Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training. (76%)
Saurabh Sahu; Palash Goyal
  The introduction of Transformer model has led to tremendous advancements in
sequence modeling, especially in text domain. However, the use of
attention-based models for video understanding is still relatively unexplored.
In this paper, we introduce Gated Adversarial Transformer (GAT) to enhance the
applicability of attention-based models to videos. GAT uses a multi-level
attention gate to model the relevance of a frame based on local and global
contexts. This enables the model to understand the video at various
granularities. Further, GAT uses adversarial training to improve model
generalization. We propose temporal attention regularization scheme to improve
the robustness of attention modules to adversarial examples. We illustrate the
performance of GAT on the large-scale YoutTube-8M data set on the task of video
categorization. We further show ablation studies along with quantitative and
qualitative analysis to showcase the improvement.


http://arxiv.org/abs/2103.10013
Model Extraction and Adversarial Transferability, Your BERT is Vulnerable! (69%)
Xuanli He; Lingjuan Lyu; Qiongkai Xu; Lichao Sun
  Natural language processing (NLP) tasks, ranging from text classification to
text generation, have been revolutionised by the pre-trained language models,
such as BERT. This allows corporations to easily build powerful APIs by
encapsulating fine-tuned BERT models for downstream tasks. However, when a
fine-tuned BERT model is deployed as a service, it may suffer from different
attacks launched by malicious users. In this work, we first present how an
adversary can steal a BERT-based API service (the victim/target model) on
multiple benchmark datasets with limited prior knowledge and queries. We
further show that the extracted model can lead to highly transferable
adversarial attacks against the victim model. Our studies indicate that the
potential vulnerabilities of BERT-based API services still hold, even when
there is an architectural mismatch between the victim model and the attack
model. Finally, we investigate two defence strategies to protect the victim
model and find that unless the performance of the victim model is sacrificed,
both model ex-traction and adversarial transferability can effectively
compromise the target models


http://arxiv.org/abs/2103.10274
TOP: Backdoor Detection in Neural Networks via Transferability of Perturbation. (61%)
Todd Huster; Emmanuel Ekwedike
  Deep neural networks (DNNs) are vulnerable to "backdoor" poisoning attacks,
in which an adversary implants a secret trigger into an otherwise normally
functioning model. Detection of backdoors in trained models without access to
the training data or example triggers is an important open problem. In this
paper, we identify an interesting property of these models: adversarial
perturbations transfer from image to image more readily in poisoned models than
in clean models. This holds for a variety of model and trigger types, including
triggers that are not linearly separable from clean data. We use this feature
to detect poisoned models in the TrojAI benchmark, as well as additional
models.


http://arxiv.org/abs/2103.10603
Noise Modulation: Let Your Model Interpret Itself. (54%)
Haoyang Li; Xinggang Wang
  Given the great success of Deep Neural Networks(DNNs) and the black-box
nature of it,the interpretability of these models becomes an important
issue.The majority of previous research works on the post-hoc interpretation of
a trained model.But recently, adversarial training shows that it is possible
for a model to have an interpretable input-gradient through
training.However,adversarial training lacks efficiency for interpretability.To
resolve this problem, we construct an approximation of the adversarial
perturbations and discover a connection between adversarial training and
amplitude modulation. Based on a digital analogy,we propose noise modulation as
an efficient and model-agnostic alternative to train a model that interprets
itself with input-gradients.Experiment results show that noise modulation can
effectively increase the interpretability of input-gradients model-agnosticly.


http://arxiv.org/abs/2103.10094
KoDF: A Large-scale Korean DeepFake Detection Dataset. (16%)
Patrick Kwon; Jaeseong You; Gyuhyeon Nam; Sungwoo Park; Gyeongsu Chae
  A variety of effective face-swap and face-reenactment methods have been
publicized in recent years, democratizing the face synthesis technology to a
great extent. Videos generated as such have come to be called deepfakes with a
negative connotation, for various social problems they have caused. Facing the
emerging threat of deepfakes, we have built the Korean DeepFake Detection
Dataset (KoDF), a large-scale collection of synthesized and real videos focused
on Korean subjects. In this paper, we provide a detailed description of methods
used to construct the dataset, experimentally show the discrepancy between the
distributions of KoDF and existing deepfake detection datasets, and underline
the importance of using multiple datasets for real-world generalization. KoDF
is publicly available at https://moneybrain-research.github.io/kodf in its
entirety (i.e. real clips, synthesized clips, clips with adversarial attack,
and metadata).


http://arxiv.org/abs/2103.10480
Reading Isn't Believing: Adversarial Attacks On Multi-Modal Neurons. (9%)
David A. Noever; Samantha E. Miller Noever
  With Open AI's publishing of their CLIP model (Contrastive Language-Image
Pre-training), multi-modal neural networks now provide accessible models that
combine reading with visual recognition. Their network offers novel ways to
probe its dual abilities to read text while classifying visual objects. This
paper demonstrates several new categories of adversarial attacks, spanning
basic typographical, conceptual, and iconographic inputs generated to fool the
model into making false or absurd classifications. We demonstrate that
contradictory text and image signals can confuse the model into choosing false
(visual) options. Like previous authors, we show by example that the CLIP model
tends to read first, look later, a phenomenon we describe as reading isn't
believing.


http://arxiv.org/abs/2103.09916
Can Targeted Adversarial Examples Transfer When the Source and Target Models Have No Label Space Overlap? (99%)
Nathan Inkawhich; Kevin J Liang; Jingyang Zhang; Huanrui Yang; Hai Li; Yiran Chen
  We design blackbox transfer-based targeted adversarial attacks for an
environment where the attacker's source model and the target blackbox model may
have disjoint label spaces and training datasets. This scenario significantly
differs from the "standard" blackbox setting, and warrants a unique approach to
the attacking process. Our methodology begins with the construction of a class
correspondence matrix between the whitebox and blackbox label sets. During the
online phase of the attack, we then leverage representations of highly related
proxy classes from the whitebox distribution to fool the blackbox model into
predicting the desired target class. Our attacks are evaluated in three complex
and challenging test environments where the source and target models have
varying degrees of conceptual overlap amongst their unique categories.
Ultimately, we find that it is indeed possible to construct targeted
transfer-based adversarial attacks between models that have non-overlapping
label spaces! We also analyze the sensitivity of attack success to properties
of the clean data. Finally, we show that our transfer attacks serve as powerful
adversarial priors when integrated with query-based methods, markedly boosting
query efficiency and adversarial success.


http://arxiv.org/abs/2103.09448
Adversarial Attacks on Camera-LiDAR Models for 3D Car Detection. (98%)
Mazen Abdelfattah; Kaiwen Yuan; Z. Jane Wang; Rabab Ward
  Most autonomous vehicles (AVs) rely on LiDAR and RGB camera sensors for
perception. Using these point cloud and image data, perception models based on
deep neural nets (DNNs) have achieved state-of-the-art performance in 3D
detection. The vulnerability of DNNs to adversarial attacks have been heavily
investigated in the RGB image domain and more recently in the point cloud
domain, but rarely in both domains simultaneously. Multi-modal perception
systems used in AVs can be divided into two broad types: cascaded models which
use each modality independently, and fusion models which learn from different
modalities simultaneously. We propose a universal and physically realizable
adversarial attack for each type, and study and contrast their respective
vulnerabilities to attacks. We place a single adversarial object with specific
shape and texture on top of a car with the objective of making this car evade
detection. Evaluating on the popular KITTI benchmark, our adversarial object
made the host vehicle escape detection by each model type nearly 50% of the
time. The dense RGB input contributed more to the success of the adversarial
attacks on both cascaded and fusion models. We found that the fusion model was
relatively more robust to adversarial attacks than the cascaded model.


http://arxiv.org/abs/2103.10834
Improved, Deterministic Smoothing for L1 Certified Robustness. (82%)
Alexander Levine; Soheil Feizi
  Randomized smoothing is a general technique for computing sample-dependent
robustness guarantees against adversarial attacks for deep classifiers. Prior
works on randomized smoothing against L_1 adversarial attacks use additive
smoothing noise and provide probabilistic robustness guarantees. In this work,
we propose a non-additive and deterministic smoothing method, Deterministic
Smoothing with Splitting Noise (DSSN). To develop DSSN, we first develop SSN, a
randomized method which involves generating each noisy smoothing sample by
first randomly splitting the input space and then returning a representation of
the center of the subdivision occupied by the input sample. In contrast to
uniform additive smoothing, the SSN certification does not require the random
noise components used to be independent. Thus, smoothing can be done
effectively in just one dimension and can therefore be efficiently derandomized
for quantized data (e.g., images). To the best of our knowledge, this is the
first work to provide deterministic "randomized smoothing" for a norm-based
adversarial threat model while allowing for an arbitrary classifier (i.e., a
deep model) to be used as a base classifier and without requiring an
exponential number of smoothing samples. On CIFAR-10 and ImageNet datasets, we
provide substantially larger L_1 robustness certificates compared to prior
works, establishing a new state-of-the-art. The determinism of our method also
leads to significantly faster certificate computation.


http://arxiv.org/abs/2103.09947
Understanding Generalization in Adversarial Training via the Bias-Variance Decomposition. (41%)
Yaodong Yu; Zitong Yang; Edgar Dobriban; Jacob Steinhardt; Yi Ma
  Adversarially trained models exhibit a large generalization gap: they can
interpolate the training set even for large perturbation radii, but at the cost
of large test error on clean samples. To investigate this gap, we decompose the
test risk into its bias and variance components and study their behavior as a
function of adversarial training perturbation radii ($\varepsilon$). We find
that the bias increases monotonically with $\varepsilon$ and is the dominant
term in the risk. Meanwhile, the variance is unimodal as a function of
$\varepsilon$, peaking near the interpolation threshold for the training set.
This characteristic behavior occurs robustly across different datasets and also
for other robust training procedures such as randomized smoothing. It thus
provides a test for proposed explanations of the generalization gap. We find
that some existing explanations fail this test--for instance, by predicting a
monotonically increasing variance curve. This underscores the power of
bias-variance decompositions in modern settings-by providing two measurements
instead of one, they can rule out more explanations than test accuracy alone.
We also show that bias and variance can provide useful guidance for scalably
reducing the generalization gap, highlighting pre-training and unlabeled data
as promising routes.


http://arxiv.org/abs/2103.09593
Code-Mixing on Sesame Street: Dawn of the Adversarial Polyglots. (38%)
Samson Tan; Shafiq Joty
  Multilingual models have demonstrated impressive cross-lingual transfer
performance. However, test sets like XNLI are monolingual at the example level.
In multilingual communities, it is common for polyglots to code-mix when
conversing with each other. Inspired by this phenomenon, we present two strong
black-box adversarial attacks (one word-level, one phrase-level) for
multilingual models that push their ability to handle code-mixed sentences to
the limit. The former uses bilingual dictionaries to propose perturbations and
translations of the clean example for sense disambiguation. The latter directly
aligns the clean example with its translations before extracting phrases as
perturbations. Our phrase-level attack has a success rate of 89.75% against
XLM-R-large, bringing its average accuracy of 79.85 down to 8.18 on XNLI.
Finally, we propose an efficient adversarial training scheme that trains in the
same number of steps as the original model and show that it improves model
accuracy.


http://arxiv.org/abs/2103.09713
Cyber Intrusion Detection by Using Deep Neural Networks with Attack-sharing Loss. (13%)
Boxiang Wendy Dong; Wendy Hui; Wang; Aparna S. Varde; Dawei Li; Bharath K. Samanthula; Weifeng Sun; Liang Zhao
  Cyber attacks pose crucial threats to computer system security, and put
digital treasuries at excessive risks. This leads to an urgent call for an
effective intrusion detection system that can identify the intrusion attacks
with high accuracy. It is challenging to classify the intrusion events due to
the wide variety of attacks. Furthermore, in a normal network environment, a
majority of the connections are initiated by benign behaviors. The class
imbalance issue in intrusion detection forces the classifier to be biased
toward the majority/benign class, thus leave many attack incidents undetected.
Spurred by the success of deep neural networks in computer vision and natural
language processing, in this paper, we design a new system named DeepIDEA that
takes full advantage of deep learning to enable intrusion detection and
classification. To achieve high detection accuracy on imbalanced data, we
design a novel attack-sharing loss function that can effectively move the
decision boundary towards the attack classes and eliminates the bias towards
the majority/benign class. By using this loss function, DeepIDEA respects the
fact that the intrusion mis-classification should receive higher penalty than
the attack mis-classification. Extensive experimental results on three
benchmark datasets demonstrate the high detection accuracy of DeepIDEA. In
particular, compared with eight state-of-the-art approaches, DeepIDEA always
provides the best class-balanced accuracy.


http://arxiv.org/abs/2103.09151
Adversarial Driving: Attacking End-to-End Autonomous Driving. (93%)
Han Wu; Syed Yunas; Sareh Rowlands; Wenjie Ruan; Johan Wahlstrom
  As research in deep neural networks advances, deep convolutional networks
become promising for autonomous driving tasks. In particular, there is an
emerging trend of employing end-to-end neural network models for autonomous
driving. However, previous research has shown that deep neural network
classifiers are vulnerable to adversarial attacks. While for regression tasks,
the effect of adversarial attacks is not as well understood. In this research,
we devise two white-box targeted attacks against end-to-end autonomous driving
models. Our attacks manipulate the behavior of the autonomous driving system by
perturbing the input image. In an average of 800 attacks with the same attack
strength (epsilon=1), the image-specific and image-agnostic attack deviates the
steering angle from the original output by 0.478 and 0.111, respectively, which
is much stronger than random noises that only perturbs the steering angle by
0.002 (The steering angle ranges from [-1, 1]). Both attacks can be initiated
in real-time on CPUs without employing GPUs. Demo video:
https://youtu.be/I0i8uN2oOP0.


http://arxiv.org/abs/2103.08860
Adversarial YOLO: Defense Human Detection Patch Attacks via Detecting Adversarial Patches. (92%)
Nan Ji; YanFei Feng; Haidong Xie; Xueshuang Xiang; Naijin Liu
  The security of object detection systems has attracted increasing attention,
especially when facing adversarial patch attacks. Since patch attacks change
the pixels in a restricted area on objects, they are easy to implement in the
physical world, especially for attacking human detection systems. The existing
defenses against patch attacks are mostly applied for image classification
problems and have difficulty resisting human detection attacks. Towards this
critical issue, we propose an efficient and effective plug-in defense component
on the YOLO detection system, which we name Ad-YOLO. The main idea is to add a
patch class on the YOLO architecture, which has a negligible inference
increment. Thus, Ad-YOLO is expected to directly detect both the objects of
interest and adversarial patches. To the best of our knowledge, our approach is
the first defense strategy against human detection attacks.
  We investigate Ad-YOLO's performance on the YOLOv2 baseline. To improve the
ability of Ad-YOLO to detect variety patches, we first use an adversarial
training process to develop a patch dataset based on the Inria dataset, which
we name Inria-Patch. Then, we train Ad-YOLO by a combination of Pascal VOC,
Inria, and Inria-Patch datasets. With a slight drop of $0.70\%$ mAP on VOC 2007
test set, Ad-YOLO achieves $80.31\%$ AP of persons, which highly outperforms
$33.93\%$ AP for YOLOv2 when facing white-box patch attacks. Furthermore,
compared with YOLOv2, the results facing a physical-world attack are also
included to demonstrate Ad-YOLO's excellent generalization ability.


http://arxiv.org/abs/2103.08896
Anti-Adversarially Manipulated Attributions for Weakly and Semi-Supervised Semantic Segmentation. (75%)
Jungbeom Lee; Eunji Kim; Sungroh Yoon
  Weakly supervised semantic segmentation produces a pixel-level localization
from a classifier, but it is likely to restrict its focus to a small
discriminative region of the target object. AdvCAM is an attribution map of an
image that is manipulated to increase the classification score. This
manipulation is realized in an anti-adversarial manner, which perturbs the
images along pixel gradients in the opposite direction from those used in an
adversarial attack. It forces regions initially considered not to be
discriminative to become involved in subsequent classifications, and produces
attribution maps that successively identify more regions of the target object.
In addition, we introduce a new regularization procedure that inhibits the
incorrect attribution of regions unrelated to the target object and limits the
attributions of the regions that already have high scores. On PASCAL VOC 2012
test images, we achieve mIoUs of 68.0 and 76.9 for weakly and semi-supervised
semantic segmentation respectively, which represent a new state-of-the-art.


http://arxiv.org/abs/2103.09265
Bio-inspired Robustness: A Review. (70%)
Harshitha Machiraju; Oh-Hyeon Choung; Pascal Frossard; Michael. H Herzog
  Deep convolutional neural networks (DCNNs) have revolutionized computer
vision and are often advocated as good models of the human visual system.
However, there are currently many shortcomings of DCNNs, which preclude them as
a model of human vision. For example, in the case of adversarial attacks, where
adding small amounts of noise to an image, including an object, can lead to
strong misclassification of that object. But for humans, the noise is often
invisible. If vulnerability to adversarial noise cannot be fixed, DCNNs cannot
be taken as serious models of human vision. Many studies have tried to add
features of the human visual system to DCNNs to make them robust against
adversarial attacks. However, it is not fully clear whether human vision
inspired components increase robustness because performance evaluations of
these novel components in DCNNs are often inconclusive. We propose a set of
criteria for proper evaluation and analyze different models according to these
criteria. We finally sketch future efforts to make DCCNs one step closer to the
model of human vision.


http://arxiv.org/abs/2103.08265
Constant Random Perturbations Provide Adversarial Robustness with Minimal Effect on Accuracy. (83%)
Bronya Roni Chernyak; Bhiksha Raj; Tamir Hazan; Joseph Keshet
  This paper proposes an attack-independent (non-adversarial training)
technique for improving adversarial robustness of neural network models, with
minimal loss of standard accuracy. We suggest creating a neighborhood around
each training example, such that the label is kept constant for all inputs
within that neighborhood. Unlike previous work that follows a similar
principle, we apply this idea by extending the training set with multiple
perturbations for each training example, drawn from within the neighborhood.
These perturbations are model independent, and remain constant throughout the
entire training process. We analyzed our method empirically on MNIST, SVHN, and
CIFAR-10, under different attacks and conditions. Results suggest that the
proposed approach improves standard accuracy over other defenses while having
increased robustness compared to vanilla adversarial training.


http://arxiv.org/abs/2103.08187
Adversarial Training is Not Ready for Robot Learning. (67%)
Mathias Lechner; Ramin Hasani; Radu Grosu; Daniela Rus; Thomas A. Henzinger
  Adversarial training is an effective method to train deep learning models
that are resilient to norm-bounded perturbations, with the cost of nominal
performance drop. While adversarial training appears to enhance the robustness
and safety of a deep model deployed in open-world decision-critical
applications, counterintuitively, it induces undesired behaviors in robot
learning settings. In this paper, we show theoretically and experimentally that
neural controllers obtained via adversarial training are subjected to three
types of defects, namely transient, systematic, and conditional errors. We
first generalize adversarial training to a safety-domain optimization scheme
allowing for more generic specifications. We then prove that such a learning
process tends to cause certain error profiles. We support our theoretical
results by a thorough experimental safety analysis in a robot-learning task.
Our results suggest that adversarial training is not yet ready for robot
learning.


http://arxiv.org/abs/2103.08668
HDTest: Differential Fuzz Testing of Brain-Inspired Hyperdimensional Computing. (64%)
Dongning Ma; Jianmin Guo; Yu Jiang; Xun Jiao
  Brain-inspired hyperdimensional computing (HDC) is an emerging computational
paradigm that mimics brain cognition and leverages hyperdimensional vectors
with fully distributed holographic representation and (pseudo)randomness.
Compared to other machine learning (ML) methods such as deep neural networks
(DNNs), HDC offers several advantages including high energy efficiency, low
latency, and one-shot learning, making it a promising alternative candidate on
a wide range of applications. However, the reliability and robustness of HDC
models have not been explored yet. In this paper, we design, implement, and
evaluate HDTest to test HDC model by automatically exposing unexpected or
incorrect behaviors under rare inputs. The core idea of HDTest is based on
guided differential fuzz testing. Guided by the distance between query
hypervector and reference hypervector in HDC, HDTest continuously mutates
original inputs to generate new inputs that can trigger incorrect behaviors of
HDC model. Compared to traditional ML testing methods, HDTest does not need to
manually label the original input. Using handwritten digit classification as an
example, we show that HDTest can generate thousands of adversarial inputs with
negligible perturbations that can successfully fool HDC models. On average,
HDTest can generate around 400 adversarial inputs within one minute running on
a commodity computer. Finally, by using the HDTest-generated inputs to retrain
HDC models, we can strengthen the robustness of HDC models. To the best of our
knowledge, this paper presents the first effort in systematically testing this
emerging brain-inspired computational model.


http://arxiv.org/abs/2103.07470
Understanding invariance via feedforward inversion of discriminatively trained classifiers. (10%)
Piotr Teterwak; Chiyuan Zhang; Dilip Krishnan; Michael C. Mozer
  A discriminatively trained neural net classifier achieves optimal performance
if all information about its input other than class membership has been
discarded prior to the output layer. Surprisingly, past research has discovered
that some extraneous visual detail remains in the output logits. This finding
is based on inversion techniques that map deep embeddings back to images.
Although the logit inversions seldom produce coherent, natural images or
recognizable object classes, they do recover some visual detail. We explore
this phenomenon further using a novel synthesis of methods, yielding a
feedforward inversion model that produces remarkably high fidelity
reconstructions, qualitatively superior to those of past efforts. When applied
to an adversarially robust classifier model, the reconstructions contain
sufficient local detail and global structure that they might be confused with
the original image in a quick glance, and the object category can clearly be
gleaned from the reconstruction. Our approach is based on BigGAN (Brock, 2019),
with conditioning on logits instead of one-hot class labels. We use our
reconstruction model as a tool for exploring the nature of representations,
including: the influence of model architecture and training objectives
(specifically robust losses), the forms of invariance that networks achieve,
representational differences between correctly and incorrectly classified
images, and the effects of manipulating logits and images. We believe that our
method can inspire future investigations into the nature of information flow in
a neural net and can provide diagnostics for improving discriminative models.


http://arxiv.org/abs/2103.08561
Meta-Solver for Neural Ordinary Differential Equations. (2%)
Julia Gusak; Alexandr Katrutsa; Talgat Daulbaev; Andrzej Cichocki; Ivan Oseledets
  A conventional approach to train neural ordinary differential equations
(ODEs) is to fix an ODE solver and then learn the neural network's weights to
optimize a target loss function. However, such an approach is tailored for a
specific discretization method and its properties, which may not be optimal for
the selected application and yield the overfitting to the given solver. In our
paper, we investigate how the variability in solvers' space can improve neural
ODEs performance. We consider a family of Runge-Kutta methods that are
parameterized by no more than two scalar variables. Based on the solvers'
properties, we propose an approach to decrease neural ODEs overfitting to the
pre-defined solver, along with a criterion to evaluate such behaviour.
Moreover, we show that the right choice of solver parameterization can
significantly affect neural ODEs models in terms of robustness to adversarial
attacks. Recently it was shown that neural ODEs demonstrate superiority over
conventional CNNs in terms of robustness. Our work demonstrates that the model
robustness can be further improved by optimizing solver choice for a given
task. The source code to reproduce our experiments is available at
https://github.com/juliagusak/neural-ode-metasolver.


http://arxiv.org/abs/2103.08095
Towards Robust Speech-to-Text Adversarial Attack. (99%)
Mohammad Esmaeilpour; Patrick Cardinal; Alessandro Lameiras Koerich
  This paper introduces a novel adversarial algorithm for attacking the
state-of-the-art speech-to-text systems, namely DeepSpeech, Kaldi, and Lingvo.
Our approach is based on developing an extension for the conventional
distortion condition of the adversarial optimization formulation using the
Cram\`er integral probability metric. Minimizing over this metric, which
measures the discrepancies between original and adversarial samples'
distributions, contributes to crafting signals very close to the subspace of
legitimate speech recordings. This helps to yield more robust adversarial
signals against playback over-the-air without employing neither costly
expectation over transformation operations nor static room impulse response
simulations. Our approach outperforms other targeted and non-targeted
algorithms in terms of word error rate and sentence-level-accuracy with
competitive performance on the crafted adversarial signals' quality. Compared
to seven other strong white and black-box adversarial attacks, our proposed
approach is considerably more resilient against multiple consecutive playbacks
over-the-air, corroborating its higher robustness in noisy environments.


http://arxiv.org/abs/2103.08031
BreakingBED -- Breaking Binary and Efficient Deep Neural Networks by Adversarial Attacks. (98%)
Manoj Rohit Vemparala; Alexander Frickenstein; Nael Fasfous; Lukas Frickenstein; Qi Zhao; Sabine Kuhn; Daniel Ehrhardt; Yuankai Wu; Christian Unger; Naveen Shankar Nagaraja; Walter Stechele
  Deploying convolutional neural networks (CNNs) for embedded applications
presents many challenges in balancing resource-efficiency and task-related
accuracy. These two aspects have been well-researched in the field of CNN
compression. In real-world applications, a third important aspect comes into
play, namely the robustness of the CNN. In this paper, we thoroughly study the
robustness of uncompressed, distilled, pruned and binarized neural networks
against white-box and black-box adversarial attacks (FGSM, PGD, C&W, DeepFool,
LocalSearch and GenAttack). These new insights facilitate defensive training
schemes or reactive filtering methods, where the attack is detected and the
input is discarded and/or cleaned. Experimental results are shown for distilled
CNNs, agent-based state-of-the-art pruned models, and binarized neural networks
(BNNs) such as XNOR-Net and ABC-Net, trained on CIFAR-10 and ImageNet datasets.
We present evaluation methods to simplify the comparison between CNNs under
different attack schemes using loss/accuracy levels, stress-strain graphs,
box-plots and class activation mapping (CAM). Our analysis reveals susceptible
behavior of uncompressed and pruned CNNs against all kinds of attacks. The
distilled models exhibit their strength against all white box attacks with an
exception of C&W. Furthermore, binary neural networks exhibit resilient
behavior compared to their baselines and other compressed variants.


http://arxiv.org/abs/2103.08086
Multi-Discriminator Sobolev Defense-GAN Against Adversarial Attacks for End-to-End Speech Systems. (82%)
Mohammad Esmaeilpour; Patrick Cardinal; Alessandro Lameiras Koerich
  This paper introduces a defense approach against end-to-end adversarial
attacks developed for cutting-edge speech-to-text systems. The proposed defense
algorithm has four major steps. First, we represent speech signals with 2D
spectrograms using the short-time Fourier transform. Second, we iteratively
find a safe vector using a spectrogram subspace projection operation. This
operation minimizes the chordal distance adjustment between spectrograms with
an additional regularization term. Third, we synthesize a spectrogram with such
a safe vector using a novel GAN architecture trained with Sobolev integral
probability metric. To improve the model's performance in terms of stability
and the total number of learned modes, we impose an additional constraint on
the generator network. Finally, we reconstruct the signal from the synthesized
spectrogram and the Griffin-Lim phase approximation technique. We evaluate the
proposed defense approach against six strong white and black-box adversarial
attacks benchmarked on DeepSpeech, Kaldi, and Lingvo models. Our experimental
results show that our algorithm outperforms other state-of-the-art defense
algorithms both in terms of accuracy and signal quality.


http://arxiv.org/abs/2103.07853
Membership Inference Attacks on Machine Learning: A Survey. (68%)
Hongsheng Hu; Zoran Salcic; Lichao Sun; Gillian Dobbie; Philip S. Yu; Xuyun Zhang
  Machine learning (ML) models have been widely applied to various
applications, including image classification, text generation, audio
recognition, and graph data analysis. However, recent studies have shown that
ML models are vulnerable to membership inference attacks (MIAs), which aim to
infer whether a data record was used to train a target model or not. MIAs on ML
models can directly lead to a privacy breach. For example, via identifying the
fact that a clinical record that has been used to train a model associated with
a certain disease, an attacker can infer that the owner of the clinical record
has the disease with a high chance. In recent years, MIAs have been shown to be
effective on various ML models, e.g., classification models and generative
models. Meanwhile, many defense methods have been proposed to mitigate MIAs.
Although MIAs on ML models form a newly emerging and rapidly growing research
area, there has been no systematic survey on this topic yet. In this paper, we
conduct the first comprehensive survey on membership inference attacks and
defenses. We provide the taxonomies for both attacks and defenses, based on
their characterizations, and discuss their pros and cons. Based on the
limitations and gaps identified in this survey, we point out several promising
future research directions to inspire the researchers who wish to follow this
area. This survey not only serves as a reference for the research community but
also provides a clear description for researchers outside this research domain.
To further help the researchers, we have created an online resource repository,
which we will keep updated with future relevant work. Interested readers can
find the repository at
https://github.com/HongshengHu/membership-inference-machine-learning-literature.


http://arxiv.org/abs/2103.07633
Attack as Defense: Characterizing Adversarial Examples using Robustness. (99%)
Zhe Zhao; Guangke Chen; Jingyi Wang; Yiwei Yang; Fu Song; Jun Sun
  As a new programming paradigm, deep learning has expanded its application to
many real-world problems. At the same time, deep learning based software are
found to be vulnerable to adversarial attacks. Though various defense
mechanisms have been proposed to improve robustness of deep learning software,
many of them are ineffective against adaptive attacks. In this work, we propose
a novel characterization to distinguish adversarial examples from benign ones
based on the observation that adversarial examples are significantly less
robust than benign ones. As existing robustness measurement does not scale to
large networks, we propose a novel defense framework, named attack as defense
(A2D), to detect adversarial examples by effectively evaluating an example's
robustness. A2D uses the cost of attacking an input for robustness evaluation
and identifies those less robust examples as adversarial since less robust
examples are easier to attack. Extensive experiment results on MNIST, CIFAR10
and ImageNet show that A2D is more effective than recent promising approaches.
We also evaluate our defence against potential adaptive attacks and show that
A2D is effective in defending carefully designed adaptive attacks, e.g., the
attack success rate drops to 0% on CIFAR10.


http://arxiv.org/abs/2103.07640
Generating Unrestricted Adversarial Examples via Three Parameters. (99%)
Hanieh Naderi; Leili Goli; Shohreh Kasaei
  Deep neural networks have been shown to be vulnerable to adversarial examples
deliberately constructed to misclassify victim models. As most adversarial
examples have restricted their perturbations to $L_{p}$-norm, existing defense
methods have focused on these types of perturbations and less attention has
been paid to unrestricted adversarial examples; which can create more realistic
attacks, able to deceive models without affecting human predictions. To address
this problem, the proposed adversarial attack generates an unrestricted
adversarial example with a limited number of parameters. The attack selects
three points on the input image and based on their locations transforms the
image into an adversarial example. By limiting the range of movement and
location of these three points and using a discriminatory network, the proposed
unrestricted adversarial example preserves the image appearance. Experimental
results show that the proposed adversarial examples obtain an average success
rate of 93.5% in terms of human evaluation on the MNIST and SVHN datasets. It
also reduces the model accuracy by an average of 73% on six datasets MNIST,
FMNIST, SVHN, CIFAR10, CIFAR100, and ImageNet. It should be noted that, in the
case of attacks, lower accuracy in the victim model denotes a more successful
attack. The adversarial train of the attack also improves model robustness
against a randomly transformed image.


http://arxiv.org/abs/2103.07704
Simeon -- Secure Federated Machine Learning Through Iterative Filtering. (12%)
Nicholas Malecki; Hye-young Paik; Aleksandar Ignjatovic; Alan Blair; Elisa Bertino
  Federated learning enables a global machine learning model to be trained
collaboratively by distributed, mutually non-trusting learning agents who
desire to maintain the privacy of their training data and their hardware. A
global model is distributed to clients, who perform training, and submit their
newly-trained model to be aggregated into a superior model. However, federated
learning systems are vulnerable to interference from malicious learning agents
who may desire to prevent training or induce targeted misclassification in the
resulting global model. A class of Byzantine-tolerant aggregation algorithms
has emerged, offering varying degrees of robustness against these attacks,
often with the caveat that the number of attackers is bounded by some quantity
known prior to training. This paper presents Simeon: a novel approach to
aggregation that applies a reputation-based iterative filtering technique to
achieve robustness even in the presence of attackers who can exhibit arbitrary
behaviour. We compare Simeon to state-of-the-art aggregation techniques and
find that Simeon achieves comparable or superior robustness to a variety of
attacks. Notably, we show that Simeon is tolerant to sybil attacks, where other
algorithms are not, presenting a key advantage of our approach.


http://arxiv.org/abs/2103.07595
Learning Defense Transformers for Counterattacking Adversarial Examples. (99%)
Jincheng Li; Jiezhang Cao; Yifan Zhang; Jian Chen; Mingkui Tan
  Deep neural networks (DNNs) are vulnerable to adversarial examples with small
perturbations. Adversarial defense thus has been an important means which
improves the robustness of DNNs by defending against adversarial examples.
Existing defense methods focus on some specific types of adversarial examples
and may fail to defend well in real-world applications. In practice, we may
face many types of attacks where the exact type of adversarial examples in
real-world applications can be even unknown. In this paper, motivated by that
adversarial examples are more likely to appear near the classification
boundary, we study adversarial examples from a new perspective that whether we
can defend against adversarial examples by pulling them back to the original
clean distribution. We theoretically and empirically verify the existence of
defense affine transformations that restore adversarial examples. Relying on
this, we learn a defense transformer to counterattack the adversarial examples
by parameterizing the affine transformations and exploiting the boundary
information of DNNs. Extensive experiments on both toy and real-world datasets
demonstrate the effectiveness and generalization of our defense transformer.


http://arxiv.org/abs/2103.07598
Internal Wasserstein Distance for Adversarial Attack and Defense. (99%)
Mingkui Tan; Shuhai Zhang; Jiezhang Cao; Jincheng Li; Yanwu Xu
  Deep neural networks (DNNs) are known to be vulnerable to adversarial attacks
that would trigger misclassification of DNNs but may be imperceptible to human
perception. Adversarial defense has been important ways to improve the
robustness of DNNs. Existing attack methods often construct adversarial
examples relying on some metrics like the $\ell_p$ distance to perturb samples.
However, these metrics can be insufficient to conduct adversarial attacks due
to their limited perturbations. In this paper, we propose a new internal
Wasserstein distance (IWD) to capture the semantic similarity of two samples,
and thus it helps to obtain larger perturbations than currently used metrics
such as the $\ell_p$ distance We then apply the internal Wasserstein distance
to perform adversarial attack and defense. In particular, we develop a novel
attack method relying on IWD to calculate the similarities between an image and
its adversarial examples. In this way, we can generate diverse and semantically
similar adversarial examples that are more difficult to defend by existing
defense methods. Moreover, we devise a new defense method relying on IWD to
learn robust models against unseen adversarial examples. We provide both
thorough theoretical and empirical evidence to support our methods.


http://arxiv.org/abs/2103.07364
A Unified Game-Theoretic Interpretation of Adversarial Robustness. (98%)
Jie Ren; Die Zhang; Yisen Wang; Lu Chen; Zhanpeng Zhou; Yiting Chen; Xu Cheng; Xin Wang; Meng Zhou; Jie Shi; Quanshi Zhang
  This paper provides a unified view to explain different adversarial attacks
and defense methods, i.e. the view of multi-order interactions between input
variables of DNNs. Based on the multi-order interaction, we discover that
adversarial attacks mainly affect high-order interactions to fool the DNN.
Furthermore, we find that the robustness of adversarially trained DNNs comes
from category-specific low-order interactions. Our findings provide a potential
method to unify adversarial perturbations and robustness, which can explain the
existing defense methods in a principle way. Besides, our findings also make a
revision of previous inaccurate understanding of the shape bias of
adversarially learned features.


http://arxiv.org/abs/2103.07268
Adversarial Machine Learning Security Problems for 6G: mmWave Beam Prediction Use-Case. (82%)
Evren Catak; Ferhat Ozgur Catak; Arild Moldsvor
  6G is the next generation for the communication systems. In recent years,
machine learning algorithms have been applied widely in various fields such as
health, transportation, and the autonomous car. The predictive algorithms will
be used in 6G problems. With the rapid developments of deep learning
techniques, it is critical to take the security concern into account to apply
the algorithms. While machine learning offers significant advantages for 6G, AI
models' security is ignored. Since it has many applications in the real world,
security is a vital part of the algorithms. This paper has proposed a
mitigation method for adversarial attacks against proposed 6G machine learning
models for the millimeter-wave (mmWave) beam prediction with adversarial
learning. The main idea behind adversarial attacks against machine learning
models is to produce faulty results by manipulating trained deep learning
models for 6G applications for mmWave beam prediction use case. We have also
presented the adversarial learning mitigation method's performance for 6G
security in millimeter-wave beam prediction application with fast gradient sign
method attack. The mean square errors of the defended model and undefended
model are very close.


http://arxiv.org/abs/2103.07583
Network Environment Design for Autonomous Cyberdefense. (1%)
Andres Molina-Markham; Cory Miniter; Becky Powell; Ahmad Ridley
  Reinforcement learning (RL) has been demonstrated suitable to develop agents
that play complex games with human-level performance. However, it is not
understood how to effectively use RL to perform cybersecurity tasks. To develop
such understanding, it is necessary to develop RL agents using simulation and
emulation systems allowing researchers to model a broad class of realistic
threats and network conditions. Demonstrating that a specific RL algorithm can
be effective for defending a network under certain conditions may not
necessarily give insight about the performance of the algorithm when the
threats, network conditions, and security goals change. This paper introduces a
novel approach for network environment design and a software framework to
address the fundamental problem that network defense cannot be defined as a
single game with a simple set of fixed rules. We show how our approach is
necessary to facilitate the development of RL network defenders that are robust
against attacks aimed at the agent's learning. Our framework enables the
development and simulation of adversaries with sophisticated behavior that
includes poisoning and evasion attacks on RL network defenders.


http://arxiv.org/abs/2103.06936
Stochastic-HMDs: Adversarial Resilient Hardware Malware Detectors through Voltage Over-scaling. (99%)
Md Shohidul Islam; Ihsen Alouani; Khaled N. Khasawneh
  Machine learning-based hardware malware detectors (HMDs) offer a potential
game changing advantage in defending systems against malware. However, HMDs
suffer from adversarial attacks, can be effectively reverse-engineered and
subsequently be evaded, allowing malware to hide from detection. We address
this issue by proposing a novel HMDs (Stochastic-HMDs) through approximate
computing, which makes HMDs' inference computation-stochastic, thereby making
HMDs resilient against adversarial evasion attacks. Specifically, we propose to
leverage voltage overscaling to induce stochastic computation in the HMDs
model. We show that such a technique makes HMDs more resilient to both
black-box adversarial attack scenarios, i.e., reverse-engineering and
transferability. Our experimental results demonstrate that Stochastic-HMDs
offer effective defense against adversarial attacks along with by-product power
savings, without requiring any changes to the hardware/software nor to the
HMDs' model, i.e., no retraining or fine tuning is needed. Moreover, based on
recent results in probably approximately correct (PAC) learnability theory, we
show that Stochastic-HMDs are provably more difficult to reverse engineer.


http://arxiv.org/abs/2103.06624
Beta-CROWN: Efficient Bound Propagation with Per-neuron Split Constraints for Complete and Incomplete Neural Network Verification. (99%)
Shiqi Wang; Huan Zhang; Kaidi Xu; Xue Lin; Suman Jana; Cho-Jui Hsieh; J. Zico Kolter
  Recent works in neural network verification show that cheap incomplete
verifiers such as CROWN, based upon bound propagations, can effectively be used
in Branch-and-Bound (BaB) methods to accelerate complete verification,
achieving significant speedups compared to expensive linear programming (LP)
based techniques. However, they cannot fully handle the per-neuron split
constraints introduced by BaB like LP verifiers do, leading to looser bounds
and hurting their verification efficiency. In this work, we develop
$\beta$-CROWN, a new bound propagation based method that can fully encode
per-neuron splits via optimizable parameters $\beta$. When the optimizable
parameters are jointly optimized in intermediate layers, $\beta$-CROWN has the
potential of producing better bounds than typical LP verifiers with neuron
split constraints, while being efficiently parallelizable on GPUs. Applied to
the complete verification setting, $\beta$-CROWN is close to three orders of
magnitude faster than LP-based BaB methods for robustness verification, and
also over twice faster than state-of-the-art GPU-based complete verifiers with
similar timeout rates. By terminating BaB early, our method can also be used
for incomplete verification. Compared to the state-of-the-art
semidefinite-programming (SDP) based verifier, we show a substantial leap
forward by greatly reducing the gap between verified accuracy and empirical
adversarial attack accuracy, from 35% (SDP) to 12% on an adversarially trained
MNIST network ($\epsilon=0.3$), while being 47 times faster. Our code is
available at https://github.com/KaidiXu/Beta-CROWN


http://arxiv.org/abs/2103.06504
Adversarial Laser Beam: Effective Physical-World Attack to DNNs in a Blink. (99%)
Ranjie Duan; Xiaofeng Mao; A. K. Qin; Yun Yang; Yuefeng Chen; Shaokai Ye; Yuan He
  Though it is well known that the performance of deep neural networks (DNNs)
degrades under certain light conditions, there exists no study on the threats
of light beams emitted from some physical source as adversarial attacker on
DNNs in a real-world scenario. In this work, we show by simply using a laser
beam that DNNs are easily fooled. To this end, we propose a novel attack method
called Adversarial Laser Beam ($AdvLB$), which enables manipulation of laser
beam's physical parameters to perform adversarial attack. Experiments
demonstrate the effectiveness of our proposed approach in both digital- and
physical-settings. We further empirically analyze the evaluation results and
reveal that the proposed laser beam attack may lead to some interesting
prediction errors of the state-of-the-art DNNs. We envisage that the proposed
$AdvLB$ method enriches the current family of adversarial attacks and builds
the foundation for future robustness studies for light.


http://arxiv.org/abs/2103.06487
DAFAR: Detecting Adversaries by Feedback-Autoencoder Reconstruction. (99%)
Haowen Liu; Ping Yi; Hsiao-Ying Lin; Jie Shi
  Deep learning has shown impressive performance on challenging perceptual
tasks. However, researchers found deep neural networks vulnerable to
adversarial examples. Since then, many methods are proposed to defend against
or detect adversarial examples, but they are either attack-dependent or shown
to be ineffective with new attacks.
  We propose DAFAR, a feedback framework that allows deep learning models to
detect adversarial examples in high accuracy and universality. DAFAR has a
relatively simple structure, which contains a target network, a plug-in
feedback network and an autoencoder-based detector. The key idea is to capture
the high-level features extracted by the target network, and then reconstruct
the input using the feedback network. These two parts constitute a feedback
autoencoder. It transforms the imperceptible-perturbation attack on the target
network directly into obvious reconstruction-error attack on the feedback
autoencoder. Finally the detector gives an anomaly score and determines whether
the input is adversarial according to the reconstruction errors. Experiments
are conducted on MNIST and CIFAR-10 data-sets. Experimental results show that
DAFAR is effective against popular and arguably most advanced attacks without
losing performance on legitimate samples, with high accuracy and universality
across attack methods and parameters.


http://arxiv.org/abs/2103.08306
ReinforceBug: A Framework to Generate Adversarial Textual Examples. (97%)
Bushra Sabir; M. Ali Babar; Raj Gaire
  Adversarial Examples (AEs) generated by perturbing original training examples
are useful in improving the robustness of Deep Learning (DL) based models. Most
prior works, generate AEs that are either unconscionable due to lexical errors
or semantically or functionally deviant from original examples. In this paper,
we present ReinforceBug, a reinforcement learning framework, that learns a
policy that is transferable on unseen datasets and generates utility-preserving
and transferable (on other models) AEs. Our results show that our method is on
average 10% more successful as compared to the state-of-the-art attack
TextFooler. Moreover, the target models have on average 73.64% confidence in
the wrong prediction, the generated AEs preserve the functional equivalence and
semantic similarity (83.38% ) to their original counterparts, and are
transferable on other models with an average success rate of 46%.


http://arxiv.org/abs/2103.06473
Multi-Task Federated Reinforcement Learning with Adversaries. (15%)
Aqeel Anwar; Arijit Raychowdhury
  Reinforcement learning algorithms, just like any other Machine learning
algorithm pose a serious threat from adversaries. The adversaries can
manipulate the learning algorithm resulting in non-optimal policies. In this
paper, we analyze the Multi-task Federated Reinforcement Learning algorithms,
where multiple collaborative agents in various environments are trying to
maximize the sum of discounted return, in the presence of adversarial agents.
We argue that the common attack methods are not guaranteed to carry out a
successful attack on Multi-task Federated Reinforcement Learning and propose an
adaptive attack method with better attack performance. Furthermore, we modify
the conventional federated reinforcement learning algorithm to address the
issue of adversaries that works equally well with and without the adversaries.
Experimentation on different small to mid-size reinforcement learning problems
show that the proposed attack method outperforms other general attack methods
and the proposed modification to federated reinforcement learning algorithm was
able to achieve near-optimal policies in the presence of adversarial agents.


http://arxiv.org/abs/2103.06797
BODAME: Bilevel Optimization for Defense Against Model Extraction. (8%)
Yuto Mori; Atsushi Nitanda; Akiko Takeda
  Model extraction attacks have become serious issues for service providers
using machine learning. We consider an adversarial setting to prevent model
extraction under the assumption that attackers will make their best guess on
the service provider's model using query accesses, and propose to build a
surrogate model that significantly keeps away the predictions of the attacker's
model from those of the true model. We formulate the problem as a non-convex
constrained bilevel optimization problem and show that for kernel models, it
can be transformed into a non-convex 1-quadratically constrained quadratic
program with a polynomial-time algorithm to find the global optimum. Moreover,
we give a tractable transformation and an algorithm for more complicated models
that are learned by using stochastic gradient descent-based algorithms.
Numerical experiments show that the surrogate model performs well compared with
existing defense models when the difference between the attacker's and service
provider's distributions is large. We also empirically confirm the
generalization ability of the surrogate model.


http://arxiv.org/abs/2103.08307
Improving Adversarial Robustness via Channel-wise Activation Suppressing. (99%)
Yang Bai; Yuyuan Zeng; Yong Jiang; Shu-Tao Xia; Xingjun Ma; Yisen Wang
  The study of adversarial examples and their activation has attracted
significant attention for secure and robust learning with deep neural networks
(DNNs). Different from existing works, in this paper, we highlight two new
characteristics of adversarial examples from the channel-wise activation
perspective: 1) the activation magnitudes of adversarial examples are higher
than that of natural examples; and 2) the channels are activated more uniformly
by adversarial examples than natural examples. We find that the
state-of-the-art defense adversarial training has addressed the first issue of
high activation magnitudes via training on adversarial examples, while the
second issue of uniform activation remains. This motivates us to suppress
redundant activation from being activated by adversarial perturbations via a
Channel-wise Activation Suppressing (CAS) strategy. We show that CAS can train
a model that inherently suppresses adversarial activation, and can be easily
applied to existing defense methods to further improve their robustness. Our
work provides a simple but generic training strategy for robustifying the
intermediate layer activation of DNNs.


http://arxiv.org/abs/2103.06297
TANTRA: Timing-Based Adversarial Network Traffic Reshaping Attack. (92%)
Yam Sharon; David Berend; Yang Liu; Asaf Shabtai; Yuval Elovici
  Network intrusion attacks are a known threat. To detect such attacks, network
intrusion detection systems (NIDSs) have been developed and deployed. These
systems apply machine learning models to high-dimensional vectors of features
extracted from network traffic to detect intrusions. Advances in NIDSs have
made it challenging for attackers, who must execute attacks without being
detected by these systems. Prior research on bypassing NIDSs has mainly focused
on perturbing the features extracted from the attack traffic to fool the
detection system, however, this may jeopardize the attack's functionality. In
this work, we present TANTRA, a novel end-to-end Timing-based Adversarial
Network Traffic Reshaping Attack that can bypass a variety of NIDSs. Our
evasion attack utilizes a long short-term memory (LSTM) deep neural network
(DNN) which is trained to learn the time differences between the target
network's benign packets. The trained LSTM is used to set the time differences
between the malicious traffic packets (attack), without changing their content,
such that they will "behave" like benign network traffic and will not be
detected as an intrusion. We evaluate TANTRA on eight common intrusion attacks
and three state-of-the-art NIDS systems, achieving an average success rate of
99.99\% in network intrusion detection system evasion. We also propose a novel
mitigation technique to address this new evasion attack.


http://arxiv.org/abs/2103.05905
VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples. (67%)
Tian Pan; Yibing Song; Tianyu Yang; Wenhao Jiang; Wei Liu
  MoCo is effective for unsupervised image representation learning. In this
paper, we propose VideoMoCo for unsupervised video representation learning.
Given a video sequence as an input sample, we improve the temporal feature
representations of MoCo from two perspectives. First, we introduce a generator
to drop out several frames from this sample temporally. The discriminator is
then learned to encode similar feature representations regardless of frame
removals. By adaptively dropping out different frames during training
iterations of adversarial learning, we augment this input sample to train a
temporally robust encoder. Second, we use temporal decay to model key
attenuation in the memory queue when computing the contrastive loss. As the
momentum encoder updates after keys enqueue, the representation ability of
these keys degrades when we use the current input sample for contrastive
learning. This degradation is reflected via temporal decay to attend the input
sample to recent keys in the queue. As a result, we adapt MoCo to learn video
representations without empirically designing pretext tasks. By empowering the
temporal robustness of the encoder and modeling the temporal decay of the keys,
our VideoMoCo improves MoCo temporally based on contrastive learning.
Experiments on benchmark datasets including UCF101 and HMDB51 show that
VideoMoCo stands as a state-of-the-art video representation learning method.


http://arxiv.org/abs/2103.13329
Fine-tuning of Pre-trained End-to-end Speech Recognition with Generative Adversarial Networks. (1%)
Md Akmal Haidar; Mehdi Rezagholizadeh
  Adversarial training of end-to-end (E2E) ASR systems using generative
adversarial networks (GAN) has recently been explored for low-resource ASR
corpora. GANs help to learn the true data representation through a two-player
min-max game. However, training an E2E ASR model using a large ASR corpus with
a GAN framework has never been explored, because it might take excessively long
time due to high-variance gradient updates and face convergence issues. In this
paper, we introduce a novel framework for fine-tuning a pre-trained ASR model
using the GAN objective where the ASR model acts as a generator and a
discriminator tries to distinguish the ASR output from the real data. Since the
ASR model is pre-trained, we hypothesize that the ASR model output (soft
distribution vectors) helps to get higher scores from the discriminator and
makes the task of the discriminator harder within our GAN framework, which in
turn improves the performance of the ASR model in the fine-tuning stage. Here,
the pre-trained ASR model is fine-tuned adversarially against the discriminator
using an additional adversarial loss. Experiments on full LibriSpeech dataset
show that our proposed approach outperforms baselines and conventional
GAN-based adversarial models.


http://arxiv.org/abs/2103.05232
Stabilized Medical Image Attacks. (99%)
Gege Qi; Lijun Gong; Yibing Song; Kai Ma; Yefeng Zheng
  Convolutional Neural Networks (CNNs) have advanced existing medical systems
for automatic disease diagnosis. However, a threat to these systems arises that
adversarial attacks make CNNs vulnerable. Inaccurate diagnosis results make a
negative influence on human healthcare. There is a need to investigate
potential adversarial attacks to robustify deep medical diagnosis systems. On
the other side, there are several modalities of medical images (e.g., CT,
fundus, and endoscopic image) of which each type is significantly different
from others. It is more challenging to generate adversarial perturbations for
different types of medical images. In this paper, we propose an image-based
medical adversarial attack method to consistently produce adversarial
perturbations on medical images. The objective function of our method consists
of a loss deviation term and a loss stabilization term. The loss deviation term
increases the divergence between the CNN prediction of an adversarial example
and its ground truth label. Meanwhile, the loss stabilization term ensures
similar CNN predictions of this example and its smoothed input. From the
perspective of the whole iterations for perturbation generation, the proposed
loss stabilization term exhaustively searches the perturbation space to smooth
the single spot for local optimum escape. We further analyze the KL-divergence
of the proposed loss function and find that the loss stabilization term makes
the perturbations updated towards a fixed objective spot while deviating from
the ground truth. This stabilization ensures the proposed medical attack
effective for different types of medical images while producing perturbations
in small variance. Experiments on several medical image analysis benchmarks
including the recent COVID-19 dataset show the stability of the proposed
method.


http://arxiv.org/abs/2103.05354
Revisiting Model's Uncertainty and Confidences for Adversarial Example Detection. (99%)
Ahmed Aldahdooh; Wassim Hamidouche; Olivier Déforges
  Security-sensitive applications that rely on Deep Neural Networks (DNNs) are
vulnerable to small perturbations that are crafted to generate Adversarial
Examples(AEs). The AEs are imperceptible to humans and cause DNN to misclassify
them. Many defense and detection techniques have been proposed. Model's
confidences and Dropout, as a popular way to estimate the model's uncertainty,
have been used for AE detection but they showed limited success against black-
and gray-box attacks. Moreover, the state-of-the-art detection techniques have
been designed for specific attacks or broken by others, need knowledge about
the attacks, are not consistent, increase model parameters overhead, are
time-consuming, or have latency in inference time. To trade off these factors,
we revisit the model's uncertainty and confidences and propose a novel
unsupervised ensemble AE detection mechanism that 1) uses the uncertainty
method called SelectiveNet, 2) processes model layers outputs, i.e.feature
maps, to generate new confidence probabilities. The detection method is called
Selective and Feature based Adversarial Detection (SFAD). Experimental results
show that the proposed approach achieves better performance against black- and
gray-box attacks than the state-of-the-art methods and achieves comparable
performance against white-box attacks. Moreover, results show that SFAD is
fully robust against High Confidence Attacks (HCAs) for MNIST and partially
robust for CIFAR10 datasets.


http://arxiv.org/abs/2103.05248
Practical Relative Order Attack in Deep Ranking. (99%)
Mo Zhou; Le Wang; Zhenxing Niu; Qilin Zhang; Yinghui Xu; Nanning Zheng; Gang Hua
  Recent studies unveil the vulnerabilities of deep ranking models, where an
imperceptible perturbation can trigger dramatic changes in the ranking result.
While previous attempts focus on manipulating absolute ranks of certain
candidates, the possibility of adjusting their relative order remains
under-explored. In this paper, we formulate a new adversarial attack against
deep ranking systems, i.e., the Order Attack, which covertly alters the
relative order among a selected set of candidates according to an
attacker-specified permutation, with limited interference to other unrelated
candidates. Specifically, it is formulated as a triplet-style loss imposing an
inequality chain reflecting the specified permutation. However, direct
optimization of such white-box objective is infeasible in a real-world attack
scenario due to various black-box limitations. To cope with them, we propose a
Short-range Ranking Correlation metric as a surrogate objective for black-box
Order Attack to approximate the white-box method. The Order Attack is evaluated
on the Fashion-MNIST and Stanford-Online-Products datasets under both white-box
and black-box threat models. The black-box attack is also successfully
implemented on a major e-commerce platform. Comprehensive experimental
evaluations demonstrate the effectiveness of the proposed methods, revealing a
new type of ranking model vulnerability.


http://arxiv.org/abs/2103.05266
BASAR:Black-box Attack on Skeletal Action Recognition. (99%)
Yunfeng Diao; Tianjia Shao; Yong-Liang Yang; Kun Zhou; He Wang
  Skeletal motion plays a vital role in human activity recognition as either an
independent data source or a complement. The robustness of skeleton-based
activity recognizers has been questioned recently, which shows that they are
vulnerable to adversarial attacks when the full-knowledge of the recognizer is
accessible to the attacker. However, this white-box requirement is overly
restrictive in most scenarios and the attack is not truly threatening. In this
paper, we show that such threats do exist under black-box settings too. To this
end, we propose the first black-box adversarial attack method BASAR. Through
BASAR, we show that adversarial attack is not only truly a threat but also can
be extremely deceitful, because on-manifold adversarial samples are rather
common in skeletal motions, in contrast to the common belief that adversarial
samples only exist off-manifold. Through exhaustive evaluation and comparison,
we show that BASAR can deliver successful attacks across models, data, and
attack modes. Through harsh perceptual studies, we show that it achieves
effective yet imperceptible attacks. By analyzing the attack on different
activity recognizers, BASAR helps identify the potential causes of their
vulnerability and provides insights on what classifiers are likely to be more
robust against attack.


http://arxiv.org/abs/2103.05347
Understanding the Robustness of Skeleton-based Action Recognition under Adversarial Attack. (98%)
He Wang; Feixiang He; Zhexi Peng; Tianjia Shao; Yong-Liang Yang; Kun Zhou; David Hogg
  Action recognition has been heavily employed in many applications such as
autonomous vehicles, surveillance, etc, where its robustness is a primary
concern. In this paper, we examine the robustness of state-of-the-art action
recognizers against adversarial attack, which has been rarely investigated so
far. To this end, we propose a new method to attack action recognizers that
rely on 3D skeletal motion. Our method involves an innovative perceptual loss
that ensures the imperceptibility of the attack. Empirical studies demonstrate
that our method is effective in both white-box and black-box scenarios. Its
generalizability is evidenced on a variety of action recognizers and datasets.
Its versatility is shown in different attacking strategies. Its deceitfulness
is proven in extensive perceptual studies. Our method shows that adversarial
attack on 3D skeletal motions, one type of time-series data, is significantly
different from traditional adversarial attack problems. Its success raises
serious concern on the robustness of action recognizers and provides insights
on potential improvements.


http://arxiv.org/abs/2103.05292
Deep Learning for Android Malware Defenses: a Systematic Literature Review. (11%)
Yue Liu; Chakkrit Tantithamthavorn; Li Li; Yepang Liu
  Malicious applications (particularly those targeting the Android platform)
pose a serious threat to developers and end-users. Numerous research efforts
have been devoted to developing effective approaches to defend against Android
malware. However, given the explosive growth of Android malware and the
continuous advancement of malicious evasion technologies like obfuscation and
reflection, Android malware defense approaches based on manual rules or
traditional machine learning may not be effective. In recent years, a dominant
research field called deep learning (DL), which provides a powerful feature
abstraction ability, has demonstrated a compelling and promising performance in
a variety of areas, like natural language processing and computer vision. To
this end, employing deep learning techniques to thwart Android malware attacks
has recently garnered considerable research attention. Yet, no systematic
literature review focusing on deep learning approaches for Android Malware
defenses exists. In this paper, we conducted a systematic literature review to
search and analyze how deep learning approaches have been applied in the
context of malware defenses in the Android environment. As a result, a total of
132 studies covering the period 2014-2021 were identified. Our investigation
reveals that, while the majority of these sources mainly consider DL-based on
Android malware detection, 53 primary studies (40.1 percent) design defense
approaches based on other scenarios. This review also discusses research
trends, research focuses, challenges, and future research directions in
DL-based Android malware defenses.


http://arxiv.org/abs/2103.05590
Robust Black-box Watermarking for Deep NeuralNetwork using Inverse Document Frequency. (10%)
Mohammad Mehdi Yadollahi; Farzaneh Shoeleh; Sajjad Dadkhah; Ali A. Ghorbani
  Deep learning techniques are one of the most significant elements of any
Artificial Intelligence (AI) services. Recently, these Machine Learning (ML)
methods, such as Deep Neural Networks (DNNs), presented exceptional achievement
in implementing human-level capabilities for various predicaments, such as
Natural Processing Language (NLP), voice recognition, and image processing,
etc. Training these models are expensive in terms of computational power and
the existence of enough labelled data. Thus, ML-based models such as DNNs
establish genuine business value and intellectual property (IP) for their
owners. Therefore the trained models need to be protected from any adversary
attacks such as illegal redistribution, reproducing, and derivation.
Watermarking can be considered as an effective technique for securing a DNN
model. However, so far, most of the watermarking algorithm focuses on
watermarking the DNN by adding noise to an image. To this end, we propose a
framework for watermarking a DNN model designed for a textual domain. The
watermark generation scheme provides a secure watermarking method by combining
Term Frequency (TF) and Inverse Document Frequency (IDF) of a particular word.
The proposed embedding procedure takes place in the model's training time,
making the watermark verification stage straightforward by sending the
watermarked document to the trained model. The experimental results show that
watermarked models have the same accuracy as the original ones. The proposed
framework accurately verifies the ownership of all surrogate models without
impairing the performance. The proposed algorithm is robust against well-known
attacks such as parameter pruning and brute force attack.


http://arxiv.org/abs/2103.05833
Towards Strengthening Deep Learning-based Side Channel Attacks with Mixup. (2%)
Zhimin Luo; Mengce Zheng; Ping Wang; Minhui Jin; Jiajia Zhang; Honggang Hu; Nenghai Yu
  In recent years, various deep learning techniques have been exploited in side
channel attacks, with the anticipation of obtaining more appreciable attack
results. Most of them concentrate on improving network architectures or putting
forward novel algorithms, assuming that there are adequate profiling traces
available to train an appropriate neural network. However, in practical
scenarios, profiling traces are probably insufficient, which makes the network
learn deficiently and compromises attack performance.
  In this paper, we investigate a kind of data augmentation technique, called
mixup, and first propose to exploit it in deep-learning based side channel
attacks, for the purpose of expanding the profiling set and facilitating the
chances of mounting a successful attack. We perform Correlation Power Analysis
for generated traces and original traces, and discover that there exists
consistency between them regarding leakage information. Our experiments show
that mixup is truly capable of enhancing attack performance especially for
insufficient profiling traces. Specifically, when the size of the training set
is decreased to 30% of the original set, mixup can significantly reduce
acquired attacking traces. We test three mixup parameter values and conclude
that generally all of them can bring about improvements. Besides, we compare
three leakage models and unexpectedly find that least significant bit model,
which is less frequently used in previous works, actually surpasses prevalent
identity model and hamming weight model in terms of attack results.


http://arxiv.org/abs/2103.04794
Packet-Level Adversarial Network Traffic Crafting using Sequence Generative Adversarial Networks. (99%)
Qiumei Cheng; Shiying Zhou; Yi Shen; Dezhang Kong; Chunming Wu
  The surge in the internet of things (IoT) devices seriously threatens the
current IoT security landscape, which requires a robust network intrusion
detection system (NIDS). Despite superior detection accuracy, existing machine
learning or deep learning based NIDS are vulnerable to adversarial examples.
Recently, generative adversarial networks (GANs) have become a prevailing
method in adversarial examples crafting. However, the nature of discrete
network traffic at the packet level makes it hard for GAN to craft adversarial
traffic as GAN is efficient in generating continuous data like image synthesis.
Unlike previous methods that convert discrete network traffic into a grayscale
image, this paper gains inspiration from SeqGAN in sequence generation with
policy gradient. Based on the structure of SeqGAN, we propose Attack-GAN to
generate adversarial network traffic at packet level that complies with domain
constraints. Specifically, the adversarial packet generation is formulated into
a sequential decision making process. In this case, each byte in a packet is
regarded as a token in a sequence. The objective of the generator is to select
a token to maximize its expected end reward. To bypass the detection of NIDS,
the generated network traffic and benign traffic are classified by a black-box
NIDS. The prediction results returned by the NIDS are fed into the
discriminator to guide the update of the generator. We generate malicious
adversarial traffic based on a real public available dataset with attack
functionality unchanged. The experimental results validate that the generated
adversarial samples are able to deceive many existing black-box NIDS.


http://arxiv.org/abs/2103.04565
Improving Transformation-based Defenses against Adversarial Examples with First-order Perturbations. (99%)
Haimin Zhang; Min Xu
  Deep neural networks have been successfully applied in various machine
learning tasks. However, studies show that neural networks are susceptible to
adversarial attacks. This exposes a potential threat to neural network-based
intelligent systems. We observe that the probability of the correct result
outputted by the neural network increases by applying small first-order
perturbations generated for non-predicted class labels to adversarial examples.
Based on this observation, we propose a method for counteracting adversarial
perturbations to improve adversarial robustness. In the proposed method, we
randomly select a number of class labels and generate small first-order
perturbations for these selected labels. The generated perturbations are added
together and then clamped onto a specified space. The obtained perturbation is
finally added to the adversarial example to counteract the adversarial
perturbation contained in the example. The proposed method is applied at
inference time and does not require retraining or finetuning the model. We
experimentally validate the proposed method on CIFAR-10 and CIFAR-100. The
results demonstrate that our method effectively improves the defense
performance of several transformation-based defense methods, especially against
strong adversarial examples generated using more iterations.


http://arxiv.org/abs/2103.05137
Contemplating real-world object classification. (81%)
Ali Borji
  Deep object recognition models have been very successful over benchmark
datasets such as ImageNet. How accurate and robust are they to distribution
shifts arising from natural and synthetic variations in datasets? Prior
research on this problem has primarily focused on ImageNet variations (e.g.,
ImageNetV2, ImageNet-A). To avoid potential inherited biases in these studies,
we take a different approach. Specifically, we reanalyze the ObjectNet dataset
recently proposed by Barbu et al. containing objects in daily life situations.
They showed a dramatic performance drop of the state of the art object
recognition models on this dataset. Due to the importance and implications of
their results regarding the generalization ability of deep models, we take a
second look at their analysis. We find that applying deep models to the
isolated objects, rather than the entire scene as is done in the original
paper, results in around 20-30% performance improvement. Relative to the
numbers reported in Barbu et al., around 10-15% of the performance loss is
recovered, without any test time data augmentation. Despite this gain, however,
we conclude that deep models still suffer drastically on the ObjectNet dataset.
We also investigate the robustness of models against synthetic image
perturbations such as geometric transformations (e.g., scale, rotation,
translation), natural image distortions (e.g., impulse noise, blur) as well as
adversarial attacks (e.g., FGSM and PGD-5). Our results indicate that limiting
the object area as much as possible (i.e., from the entire image to the
bounding box to the segmentation mask) leads to consistent improvement in
accuracy and robustness.


http://arxiv.org/abs/2103.04623
Consistency Regularization for Adversarial Robustness. (50%)
Jihoon Tack; Sihyun Yu; Jongheon Jeong; Minseon Kim; Sung Ju Hwang; Jinwoo Shin
  Adversarial training (AT) is currently one of the most successful methods to
obtain the adversarial robustness of deep neural networks. However, the
phenomenon of robust overfitting, i.e., the robustness starts to decrease
significantly during AT, has been problematic, not only making practitioners
consider a bag of tricks for a successful training, e.g., early stopping, but
also incurring a significant generalization gap in the robustness. In this
paper, we propose an effective regularization technique that prevents robust
overfitting by optimizing an auxiliary 'consistency' regularization loss during
AT. Specifically, it forces the predictive distributions after attacking from
two different augmentations of the same instance to be similar with each other.
Our experimental results demonstrate that such a simple regularization
technique brings significant improvements in the test robust accuracy of a wide
range of AT methods. More remarkably, we also show that our method could
significantly help the model to generalize its robustness against unseen
adversaries, e.g., other types or larger perturbations compared to those used
during training. Code is available at
https://github.com/alinlab/consistency-adversarial.


http://arxiv.org/abs/2103.04952
Prime+Probe 1, JavaScript 0: Overcoming Browser-based Side-Channel Defenses. (2%)
Anatoly Shusterman; Ayush Agarwal; Sioli O'Connell; Daniel Genkin; Yossi Oren; Yuval Yarom
  The "eternal war in cache" has reached browsers, with multiple cache-based
side-channel attacks and countermeasures being suggested. A common approach for
countermeasures is to disable or restrict JavaScript features deemed essential
for carrying out attacks. To assess the effectiveness of this approach, in this
work we seek to identify those JavaScript features which are essential for
carrying out a cache-based attack. We develop a sequence of attacks with
progressively decreasing dependency on JavaScript features, culminating in the
first browser-based side-channel attack which is constructed entirely from
Cascading Style Sheets (CSS) and HTML, and works even when script execution is
completely blocked. We then show that avoiding JavaScript features makes our
techniques architecturally agnostic, resulting in microarchitectural website
fingerprinting attacks that work across hardware platforms including Intel
Core, AMD Ryzen, Samsung Exynos, and Apple M1 architectures. As a final
contribution, we evaluate our techniques in hardened browser environments
including the Tor browser, Deter-Fox (Cao el al., CCS 2017), and Chrome Zero
(Schwartz et al., NDSS 2018). We confirm that none of these approaches
completely defend against our attacks. We further argue that the protections of
Chrome Zero need to be more comprehensively applied, and that the performance
and user experience of Chrome Zero will be severely degraded if this approach
is taken.


http://arxiv.org/abs/2103.04814
Deeply Unsupervised Patch Re-Identification for Pre-training Object Detectors. (1%)
Jian Ding; Enze Xie; Hang Xu; Chenhan Jiang; Zhenguo Li; Ping Luo; Gui-Song Xia
  Unsupervised pre-training aims at learning transferable features that are
beneficial for downstream tasks. However, most state-of-the-art unsupervised
methods concentrate on learning global representations for image-level
classification tasks instead of discriminative local region representations,
which limits their transferability to region-level downstream tasks, such as
object detection. To improve the transferability of pre-trained features to
object detection, we present Deeply Unsupervised Patch Re-ID (DUPR), a simple
yet effective method for unsupervised visual representation learning. The patch
Re-ID task treats individual patch as a pseudo-identity and contrastively
learns its correspondence in two views, enabling us to obtain discriminative
local features for object detection. Then the proposed patch Re-ID is performed
in a deeply unsupervised manner, appealing to object detection, which usually
requires multilevel feature maps. Extensive experiments demonstrate that DUPR
outperforms state-of-the-art unsupervised pre-trainings and even the ImageNet
supervised pre-training on various downstream tasks related to object
detection.


http://arxiv.org/abs/2103.04980
Deep Model Intellectual Property Protection via Deep Watermarking. (1%)
Jie Zhang; Dongdong Chen; Jing Liao; Weiming Zhang; Huamin Feng; Gang Hua; Nenghai Yu
  Despite the tremendous success, deep neural networks are exposed to serious
IP infringement risks. Given a target deep model, if the attacker knows its
full information, it can be easily stolen by fine-tuning. Even if only its
output is accessible, a surrogate model can be trained through student-teacher
learning by generating many input-output training pairs. Therefore, deep model
IP protection is important and necessary. However, it is still seriously
under-researched. In this work, we propose a new model watermarking framework
for protecting deep networks trained for low-level computer vision or image
processing tasks. Specifically, a special task-agnostic barrier is added after
the target model, which embeds a unified and invisible watermark into its
outputs. When the attacker trains one surrogate model by using the input-output
pairs of the barrier target model, the hidden watermark will be learned and
extracted afterwards. To enable watermarks from binary bits to high-resolution
images, a deep invisible watermarking mechanism is designed. By jointly
training the target model and watermark embedding, the extra barrier can even
be absorbed into the target model. Through extensive experiments, we
demonstrate the robustness of the proposed framework, which can resist attacks
with different network structures and objective functions.


http://arxiv.org/abs/2103.05469
Universal Adversarial Perturbations and Image Spam Classifiers. (99%)
Andy Phung; Mark Stamp
  As the name suggests, image spam is spam email that has been embedded in an
image. Image spam was developed in an effort to evade text-based filters.
Modern deep learning-based classifiers perform well in detecting typical image
spam that is seen in the wild. In this chapter, we evaluate numerous
adversarial techniques for the purpose of attacking deep learning-based image
spam classifiers. Of the techniques tested, we find that universal perturbation
performs best. Using universal adversarial perturbations, we propose and
analyze a new transformation-based adversarial attack that enables us to create
tailored "natural perturbations" in image spam. The resulting spam images
benefit from both the presence of concentrated natural features and a universal
adversarial perturbation. We show that the proposed technique outperforms
existing adversarial attacks in terms of accuracy reduction, computation time
per example, and perturbation distance. We apply our technique to create a
dataset of adversarial spam images, which can serve as a challenge dataset for
future research in image spam detection.


http://arxiv.org/abs/2103.04302
Detecting Adversarial Examples from Sensitivity Inconsistency of Spatial-Transform Domain. (99%)
Jinyu Tian; Jiantao Zhou; Yuanman Li; Jia Duan
  Deep neural networks (DNNs) have been shown to be vulnerable against
adversarial examples (AEs), which are maliciously designed to cause dramatic
model output errors. In this work, we reveal that normal examples (NEs) are
insensitive to the fluctuations occurring at the highly-curved region of the
decision boundary, while AEs typically designed over one single domain (mostly
spatial domain) exhibit exorbitant sensitivity on such fluctuations. This
phenomenon motivates us to design another classifier (called dual classifier)
with transformed decision boundary, which can be collaboratively used with the
original classifier (called primal classifier) to detect AEs, by virtue of the
sensitivity inconsistency. When comparing with the state-of-the-art algorithms
based on Local Intrinsic Dimensionality (LID), Mahalanobis Distance (MD), and
Feature Squeezing (FS), our proposed Sensitivity Inconsistency Detector (SID)
achieves improved AE detection performance and superior generalization
capabilities, especially in the challenging cases where the adversarial
perturbation levels are small. Intensive experimental results on ResNet and VGG
validate the superiority of the proposed SID.


http://arxiv.org/abs/2103.04513
Improving Global Adversarial Robustness Generalization With Adversarially Trained GAN. (99%)
Desheng School of Electrical Engineering, Southwest Jiaotong University, Chengdu, P. R. China Wang; Weidong School of Electrical Engineering, Southwest Jiaotong University, Chengdu, P. R. China Jin; Yunpu School of Electrical Engineering, Southwest Jiaotong University, Chengdu, P. R. China Wu; Aamir School of Electrical Engineering, Southwest Jiaotong University, Chengdu, P. R. China Khan
  Convolutional neural networks (CNNs) have achieved beyond human-level
accuracy in the image classification task and are widely deployed in real-world
environments. However, CNNs show vulnerability to adversarial perturbations
that are well-designed noises aiming to mislead the classification models. In
order to defend against the adversarial perturbations, adversarially trained
GAN (ATGAN) is proposed to improve the adversarial robustness generalization of
the state-of-the-art CNNs trained by adversarial training. ATGAN incorporates
adversarial training into standard GAN training procedure to remove obfuscated
gradients which can lead to a false sense in defending against the adversarial
perturbations and are commonly observed in existing GANs-based adversarial
defense methods. Moreover, ATGAN adopts the image-to-image generator as data
augmentation to increase the sample complexity needed for adversarial
robustness generalization in adversarial training. Experimental results in
MNIST SVHN and CIFAR-10 datasets show that the proposed method doesn't rely on
obfuscated gradients and achieves better global adversarial robustness
generalization performance than the adversarially trained state-of-the-art
CNNs.


http://arxiv.org/abs/2103.04436
Insta-RS: Instance-wise Randomized Smoothing for Improved Robustness and Accuracy. (76%)
Chen Chen; Kezhi Kong; Peihong Yu; Juan Luque; Tom Goldstein; Furong Huang
  Randomized smoothing (RS) is an effective and scalable technique for
constructing neural network classifiers that are certifiably robust to
adversarial perturbations. Most RS works focus on training a good base model
that boosts the certified robustness of the smoothed model. However, existing
RS techniques treat every data point the same, i.e., the variance of the
Gaussian noise used to form the smoothed model is preset and universal for all
training and test data. This preset and universal Gaussian noise variance is
suboptimal since different data points have different margins and the local
properties of the base model vary across the input examples. In this paper, we
examine the impact of customized handling of examples and propose Instance-wise
Randomized Smoothing (Insta-RS) -- a multiple-start search algorithm that
assigns customized Gaussian variances to test examples. We also design Insta-RS
Train -- a novel two-stage training algorithm that adaptively adjusts and
customizes the noise level of each training example for training a base model
that boosts the certified robustness of the instance-wise Gaussian smoothed
model. Through extensive experiments on CIFAR-10 and ImageNet, we show that our
method significantly enhances the average certified radius (ACR) as well as the
clean data accuracy compared to existing state-of-the-art provably robust
classifiers.


http://arxiv.org/abs/2103.04264
T-Miner: A Generative Approach to Defend Against Trojan Attacks on DNN-based Text Classification. (98%)
Ahmadreza Azizi; Ibrahim Asadullah Tahmid; Asim Waheed; Neal Mangaokar; Jiameng Pu; Mobin Javed; Chandan K. Reddy; Bimal Viswanath
  Deep Neural Network (DNN) classifiers are known to be vulnerable to Trojan or
backdoor attacks, where the classifier is manipulated such that it
misclassifies any input containing an attacker-determined Trojan trigger.
Backdoors compromise a model's integrity, thereby posing a severe threat to the
landscape of DNN-based classification. While multiple defenses against such
attacks exist for classifiers in the image domain, there have been limited
efforts to protect classifiers in the text domain.
  We present Trojan-Miner (T-Miner) -- a defense framework for Trojan attacks
on DNN-based text classifiers. T-Miner employs a sequence-to-sequence
(seq-2-seq) generative model that probes the suspicious classifier and learns
to produce text sequences that are likely to contain the Trojan trigger.
T-Miner then analyzes the text produced by the generative model to determine if
they contain trigger phrases, and correspondingly, whether the tested
classifier has a backdoor. T-Miner requires no access to the training dataset
or clean inputs of the suspicious classifier, and instead uses synthetically
crafted "nonsensical" text inputs to train the generative model. We extensively
evaluate T-Miner on 1100 model instances spanning 3 ubiquitous DNN model
architectures, 5 different classification tasks, and a variety of trigger
phrases. We show that T-Miner detects Trojan and clean models with a 98.75%
overall accuracy, while achieving low false positives on clean models. We also
show that T-Miner is robust against a variety of targeted, advanced attacks
from an adaptive attacker.


http://arxiv.org/abs/2103.04038
Hidden Backdoor Attack against Semantic Segmentation Models. (93%)
Yiming Li; Yanjie Li; Yalei Lv; Yong Jiang; Shu-Tao Xia
  Deep neural networks (DNNs) are vulnerable to the \emph{backdoor attack},
which intends to embed hidden backdoors in DNNs by poisoning training data. The
attacked model behaves normally on benign samples, whereas its prediction will
be changed to a particular target label if hidden backdoors are activated. So
far, backdoor research has mostly been conducted towards classification tasks.
In this paper, we reveal that this threat could also happen in semantic
segmentation, which may further endanger many mission-critical applications
($e.g.$, autonomous driving). Except for extending the existing attack paradigm
to maliciously manipulate the segmentation models from the image-level, we
propose a novel attack paradigm, the \emph{fine-grained attack}, where we treat
the target label ($i.e.$, annotation) from the object-level instead of the
image-level to achieve more sophisticated manipulation. In the annotation of
poisoned samples generated by the fine-grained attack, only pixels of specific
objects will be labeled with the attacker-specified target class while others
are still with their ground-truth ones. Experiments show that the proposed
methods can successfully attack semantic segmentation models by poisoning only
a small proportion of training data. Our method not only provides a new
perspective for designing novel attacks but also serves as a strong baseline
for improving the robustness of semantic segmentation methods.


http://arxiv.org/abs/2103.03530
Cyber Threat Intelligence Model: An Evaluation of Taxonomies, Sharing Standards, and Ontologies within Cyber Threat Intelligence. (13%)
Vasileios Mavroeidis; Siri Bromander
  Cyber threat intelligence is the provision of evidence-based knowledge about
existing or emerging threats. Benefits of threat intelligence include increased
situational awareness and efficiency in security operations and improved
prevention, detection, and response capabilities. To process, analyze, and
correlate vast amounts of threat information and derive highly contextual
intelligence that can be shared and consumed in meaningful times requires
utilizing machine-understandable knowledge representation formats that embed
the industry-required expressivity and are unambiguous. To a large extend, this
is achieved by technologies like ontologies, interoperability schemas, and
taxonomies. This research evaluates existing cyber-threat-intelligence-relevant
ontologies, sharing standards, and taxonomies for the purpose of measuring
their high-level conceptual expressivity with regards to the who, what, why,
where, when, and how elements of an adversarial attack in addition to courses
of action and technical indicators. The results confirm that little emphasis
has been given to developing a comprehensive cyber threat intelligence ontology
with existing efforts not being thoroughly designed, non-interoperable and
ambiguous, and lacking semantic reasoning capability.


http://arxiv.org/abs/2103.03701
Don't Forget to Sign the Gradients! (10%)
Omid Aramoon; Pin-Yu Chen; Gang Qu
  Engineering a top-notch deep learning model is an expensive procedure that
involves collecting data, hiring human resources with expertise in machine
learning, and providing high computational resources. For that reason, deep
learning models are considered as valuable Intellectual Properties (IPs) of the
model vendors. To ensure reliable commercialization of deep learning models, it
is crucial to develop techniques to protect model vendors against IP
infringements. One of such techniques that recently has shown great promise is
digital watermarking. However, current watermarking approaches can embed very
limited amount of information and are vulnerable against watermark removal
attacks. In this paper, we present GradSigns, a novel watermarking framework
for deep neural networks (DNNs). GradSigns embeds the owner's signature into
the gradient of the cross-entropy cost function with respect to inputs to the
model. Our approach has a negligible impact on the performance of the protected
model and it allows model vendors to remotely verify the watermark through
prediction APIs. We evaluate GradSigns on DNNs trained for different image
classification tasks using CIFAR-10, SVHN, and YTF datasets. Experimental
results show that GradSigns is robust against all known counter-watermark
attacks and can embed a large amount of information into DNNs.


http://arxiv.org/abs/2103.03831
Tor circuit fingerprinting defenses using adaptive padding. (1%)
George Kadianakis; Theodoros Polyzos; Mike Perry; Kostas Chatzikokolakis
  Online anonymity and privacy has been based on confusing the adversary by
creating indistinguishable network elements. Tor is the largest and most widely
deployed anonymity system, designed against realistic modern adversaries.
Recently, researchers have managed to fingerprint Tor's circuits -- and hence
the type of underlying traffic -- simply by capturing and analyzing traffic
traces. In this work, we study the circuit fingerprinting problem, isolating it
from website fingerprinting, and revisit previous findings in this model,
showing that accurate attacks are possible even when the application-layer
traffic is identical. We then proceed to incrementally create defenses against
circuit fingerprinting, using a generic adaptive padding framework for Tor
based on WTF-PAD. We present a simple defense which delays a fraction of the
traffic, as well as a more advanced one which can effectively hide onion
service circuits with zero delays. We thoroughly evaluate both defenses, both
analytically and experimentally, discovering new subtle fingerprints, but also
showing the effectiveness of our defenses.


http://arxiv.org/abs/2103.03325
Hard-label Manifolds: Unexpected Advantages of Query Efficiency for Finding On-manifold Adversarial Examples. (99%)
Washington Garcia; Pin-Yu Chen; Somesh Jha; Scott Clouse; Kevin R. B. Butler
  Designing deep networks robust to adversarial examples remains an open
problem. Likewise, recent zeroth order hard-label attacks on image
classification models have shown comparable performance to their first-order,
gradient-level alternatives. It was recently shown in the gradient-level
setting that regular adversarial examples leave the data manifold, while their
on-manifold counterparts are in fact generalization errors. In this paper, we
argue that query efficiency in the zeroth-order setting is connected to an
adversary's traversal through the data manifold. To explain this behavior, we
propose an information-theoretic argument based on a noisy manifold distance
oracle, which leaks manifold information through the adversary's gradient
estimate. Through numerical experiments of manifold-gradient mutual
information, we show this behavior acts as a function of the effective problem
dimensionality and number of training points. On real-world datasets and
multiple zeroth-order attacks using dimension-reduction, we observe the same
universal behavior to produce samples closer to the data manifold. This results
in up to two-fold decrease in the manifold distance measure, regardless of the
model robustness. Our results suggest that taking the manifold-gradient mutual
information into account can thus inform better robust model design in the
future, and avoid leakage of the sensitive data manifold.


http://arxiv.org/abs/2103.03344
WaveGuard: Understanding and Mitigating Audio Adversarial Examples. (99%)
Shehzeen Hussain; Paarth Neekhara; Shlomo Dubnov; Julian McAuley; Farinaz Koushanfar
  There has been a recent surge in adversarial attacks on deep learning based
automatic speech recognition (ASR) systems. These attacks pose new challenges
to deep learning security and have raised significant concerns in deploying ASR
systems in safety-critical applications. In this work, we introduce WaveGuard:
a framework for detecting adversarial inputs that are crafted to attack ASR
systems. Our framework incorporates audio transformation functions and analyses
the ASR transcriptions of the original and transformed audio to detect
adversarial inputs. We demonstrate that our defense framework is able to
reliably detect adversarial examples constructed by four recent audio
adversarial attacks, with a variety of audio transformation functions. With
careful regard for best practices in defense evaluations, we analyze our
proposed defense and its strength to withstand adaptive and robust attacks in
the audio domain. We empirically demonstrate that audio transformations that
recover audio from perceptually informed representations can lead to a strong
defense that is robust against an adaptive adversary even in a complete
white-box setting. Furthermore, WaveGuard can be used out-of-the box and
integrated directly with any ASR model to efficiently detect audio adversarial
examples, without the need for model retraining.


http://arxiv.org/abs/2103.03438
Towards Evaluating the Robustness of Deep Diagnostic Models by Adversarial Attack. (99%)
Mengting Xu; Tao Zhang; Zhongnian Li; Mingxia Liu; Daoqiang Zhang
  Deep learning models (with neural networks) have been widely used in
challenging tasks such as computer-aided disease diagnosis based on medical
images. Recent studies have shown deep diagnostic models may not be robust in
the inference process and may pose severe security concerns in clinical
practice. Among all the factors that make the model not robust, the most
serious one is adversarial examples. The so-called "adversarial example" is a
well-designed perturbation that is not easily perceived by humans but results
in a false output of deep diagnostic models with high confidence. In this
paper, we evaluate the robustness of deep diagnostic models by adversarial
attack. Specifically, we have performed two types of adversarial attacks to
three deep diagnostic models in both single-label and multi-label
classification tasks, and found that these models are not reliable when
attacked by adversarial example. We have further explored how adversarial
examples attack the models, by analyzing their quantitative classification
results, intermediate features, discriminability of features and correlation of
estimated labels for both original/clean images and those adversarial ones. We
have also designed two new defense methods to handle adversarial examples in
deep diagnostic models, i.e., Multi-Perturbations Adversarial Training (MPAdvT)
and Misclassification-Aware Adversarial Training (MAAdvT). The experimental
results have shown that the use of defense methods can significantly improve
the robustness of deep diagnostic models against adversarial attacks.


http://arxiv.org/abs/2103.02927
QAIR: Practical Query-efficient Black-Box Attacks for Image Retrieval. (99%)
Xiaodan Li; Jinfeng Li; Yuefeng Chen; Shaokai Ye; Yuan He; Shuhui Wang; Hang Su; Hui Xue
  We study the query-based attack against image retrieval to evaluate its
robustness against adversarial examples under the black-box setting, where the
adversary only has query access to the top-k ranked unlabeled images from the
database. Compared with query attacks in image classification, which produce
adversaries according to the returned labels or confidence score, the challenge
becomes even more prominent due to the difficulty in quantifying the attack
effectiveness on the partial retrieved list. In this paper, we make the first
attempt in Query-based Attack against Image Retrieval (QAIR), to completely
subvert the top-k retrieval results. Specifically, a new relevance-based loss
is designed to quantify the attack effects by measuring the set similarity on
the top-k retrieval results before and after attacks and guide the gradient
optimization. To further boost the attack efficiency, a recursive model
stealing method is proposed to acquire transferable priors on the target model
and generate the prior-guided gradients. Comprehensive experiments show that
the proposed attack achieves a high attack success rate with few queries
against the image retrieval systems under the black-box setting. The attack
evaluations on the real-world visual search engine show that it successfully
deceives a commercial system such as Bing Visual Search with 98% attack success
rate by only 33 queries on average.


http://arxiv.org/abs/2103.03000
SpectralDefense: Detecting Adversarial Attacks on CNNs in the Fourier Domain. (99%)
Paula Harder; Franz-Josef Pfreundt; Margret Keuper; Janis Keuper
  Despite the success of convolutional neural networks (CNNs) in many computer
vision and image analysis tasks, they remain vulnerable against so-called
adversarial attacks: Small, crafted perturbations in the input images can lead
to false predictions. A possible defense is to detect adversarial examples. In
this work, we show how analysis in the Fourier domain of input images and
feature maps can be used to distinguish benign test samples from adversarial
images. We propose two novel detection methods: Our first method employs the
magnitude spectrum of the input images to detect an adversarial attack. This
simple and robust classifier can successfully detect adversarial perturbations
of three commonly used attack methods. The second method builds upon the first
and additionally extracts the phase of Fourier coefficients of feature-maps at
different layers of the network. With this extension, we are able to improve
adversarial detection rates compared to state-of-the-art detectors on five
different attack methods.


http://arxiv.org/abs/2103.03076
Gradient-Guided Dynamic Efficient Adversarial Training. (96%)
Fu Wang; Yanghao Zhang; Yanbin Zheng; Wenjie Ruan
  Adversarial training is arguably an effective but time-consuming way to train
robust deep neural networks that can withstand strong adversarial attacks. As a
response to the inefficiency, we propose the Dynamic Efficient Adversarial
Training (DEAT), which gradually increases the adversarial iteration during
training. Moreover, we theoretically reveal that the connection of the lower
bound of Lipschitz constant of a given network and the magnitude of its partial
derivative towards adversarial examples. Supported by this theoretical finding,
we utilize the gradient's magnitude to quantify the effectiveness of
adversarial training and determine the timing to adjust the training procedure.
This magnitude based strategy is computational friendly and easy to implement.
It is especially suited for DEAT and can also be transplanted into a wide range
of adversarial training methods. Our post-investigation suggests that
maintaining the quality of the training adversarial examples at a certain level
is essential to achieve efficient adversarial training, which may shed some
light on future studies.


http://arxiv.org/abs/2103.03046
PointGuard: Provably Robust 3D Point Cloud Classification. (92%)
Hongbin Liu; Jinyuan Jia; Neil Zhenqiang Gong
  3D point cloud classification has many safety-critical applications such as
autonomous driving and robotic grasping. However, several studies showed that
it is vulnerable to adversarial attacks. In particular, an attacker can make a
classifier predict an incorrect label for a 3D point cloud via carefully
modifying, adding, and/or deleting a small number of its points. Randomized
smoothing is state-of-the-art technique to build certifiably robust 2D image
classifiers. However, when applied to 3D point cloud classification, randomized
smoothing can only certify robustness against adversarially {modified} points.
  In this work, we propose PointGuard, the first defense that has provable
robustness guarantees against adversarially modified, added, and/or deleted
points. Specifically, given a 3D point cloud and an arbitrary point cloud
classifier, our PointGuard first creates multiple subsampled point clouds, each
of which contains a random subset of the points in the original point cloud;
then our PointGuard predicts the label of the original point cloud as the
majority vote among the labels of the subsampled point clouds predicted by the
point cloud classifier. Our first major theoretical contribution is that we
show PointGuard provably predicts the same label for a 3D point cloud when the
number of adversarially modified, added, and/or deleted points is bounded. Our
second major theoretical contribution is that we prove the tightness of our
derived bound when no assumptions on the point cloud classifier are made.
Moreover, we design an efficient algorithm to compute our certified robustness
guarantees. We also empirically evaluate PointGuard on ModelNet40 and ScanNet
benchmark datasets.


http://arxiv.org/abs/2103.03078
Defending Medical Image Diagnostics against Privacy Attacks using Generative Methods. (12%)
William Paul; Yinzhi Cao; Miaomiao Zhang; Phil Burlina
  Machine learning (ML) models used in medical imaging diagnostics can be
vulnerable to a variety of privacy attacks, including membership inference
attacks, that lead to violations of regulations governing the use of medical
data and threaten to compromise their effective deployment in the clinic. In
contrast to most recent work in privacy-aware ML that has been focused on model
alteration and post-processing steps, we propose here a novel and complementary
scheme that enhances the security of medical data by controlling the data
sharing process. We develop and evaluate a privacy defense protocol based on
using a generative adversarial network (GAN) that allows a medical data sourcer
(e.g. a hospital) to provide an external agent (a modeler) a proxy dataset
synthesized from the original images, so that the resulting diagnostic systems
made available to model consumers is rendered resilient to privacy attackers.
We validate the proposed method on retinal diagnostics AI used for diabetic
retinopathy that bears the risk of possibly leaking private information. To
incorporate concerns of both privacy advocates and modelers, we introduce a
metric to evaluate privacy and utility performance in combination, and
demonstrate, using these novel and classical metrics, that our approach, by
itself or in conjunction with other defenses, provides state of the art (SOTA)
performance for defending against privacy attacks.


http://arxiv.org/abs/2103.03472
A Novel Framework for Threat Analysis of Machine Learning-based Smart Healthcare Systems. (1%)
Nur Imtiazul Haque; Mohammad Ashiqur Rahman; Md Hasan Shahriar; Alvi Ataur Khalil; Selcuk Uluagac
  Smart healthcare systems (SHSs) are providing fast and efficient disease
treatment leveraging wireless body sensor networks (WBSNs) and implantable
medical devices (IMDs)-based internet of medical things (IoMT). In addition,
IoMT-based SHSs are enabling automated medication, allowing communication among
myriad healthcare sensor devices. However, adversaries can launch various
attacks on the communication network and the hardware/firmware to introduce
false data or cause data unavailability to the automatic medication system
endangering the patient's life. In this paper, we propose SHChecker, a novel
threat analysis framework that integrates machine learning and formal analysis
capabilities to identify potential attacks and corresponding effects on an
IoMT-based SHS. Our framework can provide us with all potential attack vectors,
each representing a set of sensor measurements to be altered, for an SHS given
a specific set of attack attributes, allowing us to realize the system's
resiliency, thus the insight to enhance the robustness of the model. We
implement SHChecker on a synthetic and a real dataset, which affirms that our
framework can reveal potential attack vectors in an IoMT system. This is a
novel effort to formally analyze supervised and unsupervised machine learning
models for black-box SHS threat analysis.


http://arxiv.org/abs/2103.02895
On the privacy-utility trade-off in differentially private hierarchical text classification. (1%)
Dominik Wunderlich; Daniel Bernau; Francesco Aldà; Javier Parra-Arnau; Thorsten Strufe
  Hierarchical text classification consists in classifying text documents into
a hierarchy of classes and sub-classes. Although artificial neural networks
have proved useful to perform this task, unfortunately they can leak training
data information to adversaries due to training data memorization. Using
differential privacy during model training can mitigate leakage attacks against
trained models, enabling the models to be shared safely at the cost of reduced
model accuracy. This work investigates the privacy-utility trade-off in
hierarchical text classification with differential privacy guarantees, and
identifies neural network architectures that offer superior trade-offs. To this
end, we use a white-box membership inference attack to empirically assess the
information leakage of three widely used neural network architectures. We show
that large differential privacy parameters already suffice to completely
mitigate membership inference attacks, thus resulting only in a moderate
decrease in model utility. More specifically, for large datasets with long
texts we observed Transformer-based models to achieve an overall favorable
privacy-utility trade-off, while for smaller datasets with shorter texts
convolutional neural networks are preferable.


http://arxiv.org/abs/2103.02781
Structure-Preserving Progressive Low-rank Image Completion for Defending Adversarial Attacks. (99%)
Zhiqun Zhao; Hengyou Wang; Hao Sun; Zhihai He
  Deep neural networks recognize objects by analyzing local image details and
summarizing their information along the inference layers to derive the final
decision. Because of this, they are prone to adversarial attacks. Small
sophisticated noise in the input images can accumulate along the network
inference path and produce wrong decisions at the network output. On the other
hand, human eyes recognize objects based on their global structure and semantic
cues, instead of local image textures. Because of this, human eyes can still
clearly recognize objects from images which have been heavily damaged by
adversarial attacks. This leads to a very interesting approach for defending
deep neural networks against adversarial attacks. In this work, we propose to
develop a structure-preserving progressive low-rank image completion (SPLIC)
method to remove unneeded texture details from the input images and shift the
bias of deep neural networks towards global object structures and semantic
cues. We formulate the problem into a low-rank matrix completion problem with
progressively smoothed rank functions to avoid local minimums during the
optimization process. Our experimental results demonstrate that the proposed
method is able to successfully remove the insignificant local image details
while preserving important global object structures. On black-box, gray-box,
and white-box attacks, our method outperforms existing defense methods (by up
to 12.6%) and significantly improves the adversarial robustness of the network.


http://arxiv.org/abs/2103.02718
A Modified Drake Equation for Assessing Adversarial Risk to Machine Learning Models. (89%)
Josh Kalin; David Noever; Matthew Ciolino
  Each machine learning model deployed into production has a risk of
adversarial attack. Quantifying the contributing factors and uncertainties
using empirical measures could assist the industry with assessing the risk of
downloading and deploying common machine learning model types. The Drake
Equation is famously used for parameterizing uncertainties and estimating the
number of radio-capable extra-terrestrial civilizations. This work proposes
modifying the traditional Drake Equation's formalism to estimate the number of
potentially successful adversarial attacks on a deployed model. While previous
work has outlined methods for discovering vulnerabilities in public model
architectures, the proposed equation seeks to provide a semi-quantitative
benchmark for evaluating the potential risk factors of adversarial attacks.


http://arxiv.org/abs/2103.02695
Shift Invariance Can Reduce Adversarial Robustness. (87%)
Songwei Ge; Vasu Singla; Ronen Basri; David Jacobs
  Shift invariance is a critical property of CNNs that improves performance on
classification. However, we show that invariance to circular shifts can also
lead to greater sensitivity to adversarial attacks. We first characterize the
margin between classes when a shift-invariant linear classifier is used. We
show that the margin can only depend on the DC component of the signals. Then,
using results about infinitely wide networks, we show that in some simple
cases, fully connected and shift-invariant neural networks produce linear
decision boundaries. Using this, we prove that shift invariance in neural
networks produces adversarial examples for the simple case of two classes, each
consisting of a single image with a black or white dot on a gray background.
This is more than a curiosity; we show empirically that with real datasets and
realistic architectures, shift invariance reduces adversarial robustness.
Finally, we describe initial experiments using synthetic data to probe the
source of this connection.


http://arxiv.org/abs/2103.02654
A Robust Adversarial Network-Based End-to-End Communications System With Strong Generalization Ability Against Adversarial Attacks. (81%)
Yudi Dong; Huaxia Wang; Yu-Dong Yao
  We propose a novel defensive mechanism based on a generative adversarial
network (GAN) framework to defend against adversarial attacks in end-to-end
communications systems. Specifically, we utilize a generative network to model
a powerful adversary and enable the end-to-end communications system to combat
the generative attack network via a minimax game. We show that the proposed
system not only works well against white-box and black-box adversarial attacks
but also possesses excellent generalization capabilities to maintain good
performance under no attacks. We also show that our GAN-based end-to-end system
outperforms the conventional communications system and the end-to-end
communications system with/without adversarial training.


http://arxiv.org/abs/2103.02325
On the effectiveness of adversarial training against common corruptions. (67%)
Klim Kireev; Maksym Andriushchenko; Nicolas Flammarion
  The literature on robustness towards common corruptions shows no consensus on
whether adversarial training can improve the performance in this setting.
First, we show that, when used with an appropriately selected perturbation
radius, $\ell_p$ adversarial training can serve as a strong baseline against
common corruptions. Then we explain why adversarial training performs better
than data augmentation with simple Gaussian noise which has been observed to be
a meaningful baseline on common corruptions. Related to this, we identify the
$\sigma$-overfitting phenomenon when Gaussian augmentation overfits to a
particular standard deviation used for training which has a significant
detrimental effect on common corruption accuracy. We discuss how to alleviate
this problem and then how to further enhance $\ell_p$ adversarial training by
introducing an efficient relaxation of adversarial training with learned
perceptual image patch similarity as the distance metric. Through experiments
on CIFAR-10 and ImageNet-100, we show that our approach does not only improve
the $\ell_p$ adversarial training baseline but also has cumulative gains with
data augmentation methods such as AugMix, ANT, and SIN leading to
state-of-the-art performance on common corruptions. The code of our experiments
is publicly available at https://github.com/tml-epfl/adv-training-corruptions.


http://arxiv.org/abs/2103.02200
Formalizing Generalization and Robustness of Neural Networks to Weight Perturbations. (64%)
Yu-Lin Tsai; Chia-Yi Hsu; Chia-Mu Yu; Pin-Yu Chen
  Studying the sensitivity of weight perturbation in neural networks and its
impacts on model performance, including generalization and robustness, is an
active research topic due to its implications on a wide range of machine
learning tasks such as model compression, generalization gap assessment, and
adversarial attacks. In this paper, we provide the first formal analysis for
feed-forward neural networks with non-negative monotone activation functions
against norm-bounded weight perturbations, in terms of the robustness in
pairwise class margin functions and the Rademacher complexity for
generalization. We further design a new theory-driven loss function for
training generalizable and robust neural networks against weight perturbations.
Empirical experiments are conducted to validate our theoretical analysis. Our
results offer fundamental insights for characterizing the generalization and
robustness of neural networks against weight perturbations.


http://arxiv.org/abs/2103.01914
Evaluating the Robustness of Geometry-Aware Instance-Reweighted Adversarial Training. (99%)
Dorjan Hitaj; Giulio Pagnotta; Iacopo Masi; Luigi V. Mancini
  In this technical report, we evaluate the adversarial robustness of a very
recent method called "Geometry-aware Instance-reweighted Adversarial
Training"[7]. GAIRAT reports state-of-the-art results on defenses to
adversarial attacks on the CIFAR-10 dataset. In fact, we find that a network
trained with this method, while showing an improvement over regular adversarial
training (AT), is biasing the model towards certain samples by re-scaling the
loss. Indeed, this leads the model to be susceptible to attacks that scale the
logits. The original model shows an accuracy of 59% under AutoAttack - when
trained with additional data with pseudo-labels. We provide an analysis that
shows the opposite. In particular, we craft a PGD attack multiplying the logits
by a positive scalar that decreases the GAIRAT accuracy from from 55% to 44%,
when trained solely on CIFAR-10. In this report, we rigorously evaluate the
model and provide insights into the reasons behind the vulnerability of GAIRAT
to this adversarial attack. The code to reproduce our evaluation is made
available at https://github.com/giuxhub/GAIRAT-LSA


http://arxiv.org/abs/2103.01498
A Survey On Universal Adversarial Attack. (99%)
Chaoning Zhang; Philipp Benz; Chenguo Lin; Adil Karjauv; Jing Wu; In So Kweon
  The intriguing phenomenon of adversarial examples has attracted significant
attention in machine learning and what might be more surprising to the
community is the existence of universal adversarial perturbations (UAPs), i.e.
a single perturbation to fool the target DNN for most images. With the focus on
UAP against deep classifiers, this survey summarizes the recent progress on
universal adversarial attacks, discussing the challenges from both the attack
and defense sides, as well as the reason for the existence of UAP. We aim to
extend this work as a dynamic survey that will regularly update its content to
follow new works regarding UAP or universal attack in a wide range of domains,
such as image, audio, video, text, etc. Relevant updates will be discussed at:
https://bit.ly/2SbQlLG. We welcome authors of future works in this field to
contact us for including your new finding.


http://arxiv.org/abs/2103.02014
Online Adversarial Attacks. (99%)
Andjela Mladenovic; Avishek Joey Bose; Hugo Berard; William L. Hamilton; Simon Lacoste-Julien; Pascal Vincent; Gauthier Gidel
  Adversarial attacks expose important vulnerabilities of deep learning models,
yet little attention has been paid to settings where data arrives as a stream.
In this paper, we formalize the online adversarial attack problem, emphasizing
two key elements found in real-world use-cases: attackers must operate under
partial knowledge of the target model, and the decisions made by the attacker
are irrevocable since they operate on a transient data stream. We first
rigorously analyze a deterministic variant of the online threat model by
drawing parallels to the well-studied $k$-secretary problem in theoretical
computer science and propose Virtual+, a simple yet practical online algorithm.
Our main theoretical result show Virtual+ yields provably the best competitive
ratio over all single-threshold algorithms for $k<5$ -- extending previous
analysis of the $k$-secretary problem. We also introduce the \textit{stochastic
$k$-secretary} -- effectively reducing online blackbox transfer attacks to a
$k$-secretary problem under noise -- and prove theoretical bounds on the
performance of \textit{any} online algorithms adapted to this setting. Finally,
we complement our theoretical results by conducting experiments on both MNIST
and CIFAR-10 with both vanilla and robust classifiers, revealing not only the
necessity of online algorithms in achieving near-optimal performance but also
the rich interplay of a given attack strategy towards online attack selection,
enabling simple strategies like FGSM to outperform classically strong whitebox
adversaries.


http://arxiv.org/abs/2103.01895
Adversarial Examples for Unsupervised Machine Learning Models. (98%)
Chia-Yi Hsu; Pin-Yu Chen; Songtao Lu; Sijia Liu; Chia-Mu Yu
  Adversarial examples causing evasive predictions are widely used to evaluate
and improve the robustness of machine learning models. However, current studies
on adversarial examples focus on supervised learning tasks, relying on the
ground-truth data label, a targeted objective, or supervision from a trained
classifier. In this paper, we propose a framework of generating adversarial
examples for unsupervised models and demonstrate novel applications to data
augmentation. Our framework exploits a mutual information neural estimator as
an information-theoretic similarity measure to generate adversarial examples
without supervision. We propose a new MinMax algorithm with provable
convergence guarantees for efficient generation of unsupervised adversarial
examples. Our framework can also be extended to supervised adversarial
examples. When using unsupervised adversarial examples as a simple plug-in data
augmentation tool for model retraining, significant improvements are
consistently observed across different unsupervised tasks and datasets,
including data reconstruction, representation learning, and contrastive
learning. Our results show novel methods and advantages in studying and
improving robustness of unsupervised learning problems via adversarial
examples. Our codes are available at https://github.com/IBM/UAE.


http://arxiv.org/abs/2103.01629
DeepCert: Verification of Contextually Relevant Robustness for Neural Network Image Classifiers. (97%)
Colin Paterson; Haoze Wu; John Grese; Radu Calinescu; Corina S. Pasareanu; Clark Barrett
  We introduce DeepCert, a tool-supported method for verifying the robustness
of deep neural network (DNN) image classifiers to contextually relevant
perturbations such as blur, haze, and changes in image contrast. While the
robustness of DNN classifiers has been the subject of intense research in
recent years, the solutions delivered by this research focus on verifying DNN
robustness to small perturbations in the images being classified, with
perturbation magnitude measured using established Lp norms. This is useful for
identifying potential adversarial attacks on DNN image classifiers, but cannot
verify DNN robustness to contextually relevant image perturbations, which are
typically not small when expressed with Lp norms. DeepCert addresses this
underexplored verification problem by supporting:(1) the encoding of real-world
image perturbations; (2) the systematic evaluation of contextually relevant DNN
robustness, using both testing and formal verification; (3) the generation of
contextually relevant counterexamples; and, through these, (4) the selection of
DNN image classifiers suitable for the operational context (i)envisaged when a
potentially safety-critical system is designed, or (ii)observed by a deployed
system. We demonstrate the effectiveness of DeepCert by showing how it can be
used to verify the robustness of DNN image classifiers build for two benchmark
datasets (`German Traffic Sign' and `CIFAR-10') to multiple contextually
relevant perturbations.


http://arxiv.org/abs/2103.01527
ActiveGuard: An Active DNN IP Protection Technique via Adversarial Examples. (97%)
Mingfu Xue; Shichang Sun; Can He; Yushu Zhang; Jian Wang; Weiqiang Liu
  The training of Deep Neural Networks (DNN) is costly, thus DNN can be
considered as the intellectual properties (IP) of model owners. To date, most
of the existing protection works focus on verifying the ownership after the DNN
model is stolen, which cannot resist piracy in advance. To this end, we propose
an active DNN IP protection method based on adversarial examples against DNN
piracy, named ActiveGuard. ActiveGuard aims to achieve authorization control
and users' fingerprints management through adversarial examples, and can
provide ownership verification. Specifically, ActiveGuard exploits the
elaborate adversarial examples as users' fingerprints to distinguish authorized
users from unauthorized users. Legitimate users can enter fingerprints into DNN
for identity authentication and authorized usage, while unauthorized users will
obtain poor model performance due to an additional control layer. In addition,
ActiveGuard enables the model owner to embed a watermark into the weights of
DNN. When the DNN is illegally pirated, the model owner can extract the
embedded watermark and perform ownership verification. Experimental results
show that, for authorized users, the test accuracy of LeNet-5 and Wide Residual
Network (WRN) models are 99.15% and 91.46%, respectively, while for
unauthorized users, the test accuracy of the two DNNs are only 8.92% (LeNet-5)
and 10% (WRN), respectively. Besides, each authorized user can pass the
fingerprint authentication with a high success rate (up to 100%). For ownership
verification, the embedded watermark can be successfully extracted, while the
normal performance of the DNN model will not be affected. Further, ActiveGuard
is demonstrated to be robust against fingerprint forgery attack, model
fine-tuning attack and pruning attack.


http://arxiv.org/abs/2103.01946
Fixing Data Augmentation to Improve Adversarial Robustness. (69%)
Sylvestre-Alvise Rebuffi; Sven Gowal; Dan A. Calian; Florian Stimberg; Olivia Wiles; Timothy Mann
  Adversarial training suffers from robust overfitting, a phenomenon where the
robust test accuracy starts to decrease during training. In this paper, we
focus on both heuristics-driven and data-driven augmentations as a means to
reduce robust overfitting. First, we demonstrate that, contrary to previous
findings, when combined with model weight averaging, data augmentation can
significantly boost robust accuracy. Second, we explore how state-of-the-art
generative models can be leveraged to artificially increase the size of the
training set and further improve adversarial robustness. Finally, we evaluate
our approach on CIFAR-10 against $\ell_\infty$ and $\ell_2$ norm-bounded
perturbations of size $\epsilon = 8/255$ and $\epsilon = 128/255$,
respectively. We show large absolute improvements of +7.06% and +5.88% in
robust accuracy compared to previous state-of-the-art methods. In particular,
against $\ell_\infty$ norm-bounded perturbations of size $\epsilon = 8/255$,
our model reaches 64.20% robust accuracy without using any external data,
beating most prior works that use external data.


http://arxiv.org/abs/2103.01607
A Brief Survey on Deep Learning Based Data Hiding. (54%)
Chaoning Zhang; Chenguo Lin; Philipp Benz; Kejiang Chen; Weiming Zhang; In So Kweon
  Data hiding is the art of concealing messages with limited perceptual
changes. Recently, deep learning has enriched it from various perspectives with
significant progress. In this work, we conduct a brief yet comprehensive review
of existing literature for deep learning based data hiding (deep hiding) by
first classifying it according to three essential properties (i.e., capacity,
security and robustness), and outline three commonly used architectures. Based
on this, we summarize specific strategies for different applications of data
hiding, including basic hiding, steganography, watermarking and light field
messaging. Finally, further insight into deep hiding is provided by
incorporating the perspective of adversarial attack.


http://arxiv.org/abs/2103.02152
Group-wise Inhibition based Feature Regularization for Robust Classification. (16%)
Haozhe Liu; Haoqian Wu; Weicheng Xie; Feng Liu; Linlin Shen
  The convolutional neural network (CNN) is vulnerable to degraded images with
even very small variations (e.g. corrupted and adversarial samples). One of the
possible reasons is that CNN pays more attention to the most discriminative
regions, but ignores the auxiliary features when learning, leading to the lack
of feature diversity for final judgment. In our method, we propose to
dynamically suppress significant activation values of CNN by group-wise
inhibition, but not fixedly or randomly handle them when training. The feature
maps with different activation distribution are then processed separately to
take the feature independence into account. CNN is finally guided to learn
richer discriminative features hierarchically for robust classification
according to the proposed regularization. Our method is comprehensively
evaluated under multiple settings, including classification against
corruptions, adversarial attacks and low data regime. Extensive experimental
results show that the proposed method can achieve significant improvements in
terms of both robustness and generalization performances, when compared with
the state-of-the-art methods. Code is available at
https://github.com/LinusWu/TENET_Training.


http://arxiv.org/abs/2103.02079
DP-InstaHide: Provably Defusing Poisoning and Backdoor Attacks with Differentially Private Data Augmentations. (1%)
Eitan Borgnia; Jonas Geiping; Valeriia Cherepanova; Liam Fowl; Arjun Gupta; Amin Ghiasi; Furong Huang; Micah Goldblum; Tom Goldstein
  Data poisoning and backdoor attacks manipulate training data to induce
security breaches in a victim model. These attacks can be provably deflected
using differentially private (DP) training methods, although this comes with a
sharp decrease in model performance. The InstaHide method has recently been
proposed as an alternative to DP training that leverages supposed privacy
properties of the mixup augmentation, although without rigorous guarantees. In
this work, we show that strong data augmentations, such as mixup and random
additive noise, nullify poison attacks while enduring only a small accuracy
trade-off. To explain these finding, we propose a training method,
DP-InstaHide, which combines the mixup regularizer with additive noise. A
rigorous analysis of DP-InstaHide shows that mixup does indeed have privacy
advantages, and that training with k-way mixup provably yields at least k times
stronger DP guarantees than a naive DP mechanism. Because mixup (as opposed to
noise) is beneficial to model performance, DP-InstaHide provides a mechanism
for achieving stronger empirical performance against poisoning attacks than
other known DP methods.


http://arxiv.org/abs/2103.01050
Dual Attention Suppression Attack: Generate Adversarial Camouflage in Physical World. (99%)
Jiakai Wang; Aishan Liu; Zixin Yin; Shunchang Liu; Shiyu Tang; Xianglong Liu
  Deep learning models are vulnerable to adversarial examples. As a more
threatening type for practical deep learning systems, physical adversarial
examples have received extensive research attention in recent years. However,
without exploiting the intrinsic characteristics such as model-agnostic and
human-specific patterns, existing works generate weak adversarial perturbations
in the physical world, which fall short of attacking across different models
and show visually suspicious appearance. Motivated by the viewpoint that
attention reflects the intrinsic characteristics of the recognition process,
this paper proposes the Dual Attention Suppression (DAS) attack to generate
visually-natural physical adversarial camouflages with strong transferability
by suppressing both model and human attention. As for attacking, we generate
transferable adversarial camouflages by distracting the model-shared similar
attention patterns from the target to non-target regions. Meanwhile, based on
the fact that human visual attention always focuses on salient items (e.g.,
suspicious distortions), we evade the human-specific bottom-up attention to
generate visually-natural camouflages which are correlated to the scenario
context. We conduct extensive experiments in both the digital and physical
world for classification and detection tasks on up-to-date models (e.g.,
Yolo-V5) and significantly demonstrate that our method outperforms
state-of-the-art methods.


http://arxiv.org/abs/2103.01359
Brain Programming is Immune to Adversarial Attacks: Towards Accurate and Robust Image Classification using Symbolic Learning. (99%)
Gerardo Ibarra-Vazquez; Gustavo Olague; Mariana Chan-Ley; Cesar Puente; Carlos Soubervielle-Montalvo
  In recent years, the security concerns about the vulnerability of Deep
Convolutional Neural Networks (DCNN) to Adversarial Attacks (AA) in the form of
small modifications to the input image almost invisible to human vision make
their predictions untrustworthy. Therefore, it is necessary to provide
robustness to adversarial examples in addition to an accurate score when
developing a new classifier. In this work, we perform a comparative study of
the effects of AA on the complex problem of art media categorization, which
involves a sophisticated analysis of features to classify a fine collection of
artworks. We tested a prevailing bag of visual words approach from computer
vision, four state-of-the-art DCNN models (AlexNet, VGG, ResNet, ResNet101),
and the Brain Programming (BP) algorithm. In this study, we analyze the
algorithms' performance using accuracy. Besides, we use the accuracy ratio
between adversarial examples and clean images to measure robustness. Moreover,
we propose a statistical analysis of each classifier's predictions' confidence
to corroborate the results. We confirm that BP predictions' change was below
2\% using adversarial examples computed with the fast gradient sign method.
Also, considering the multiple pixel attack, BP obtained four out of seven
classes without changes and the rest with a maximum error of 4\% in the
predictions. Finally, BP also gets four categories using adversarial patches
without changes and for the remaining three classes with a variation of 1\%.
Additionally, the statistical analysis showed that the predictions' confidence
of BP were not significantly different for each pair of clean and perturbed
images in every experiment. These results prove BP's robustness against
adversarial examples compared to DCNN and handcrafted features methods, whose
performance on the art media classification was compromised with the proposed
perturbations.


http://arxiv.org/abs/2103.01400
Smoothness Analysis of Adversarial Training. (98%)
Sekitoshi Kanai; Masanori Yamada; Hiroshi Takahashi; Yuki Yamanaka; Yasutoshi Ida
  Deep neural networks are vulnerable to adversarial attacks. Recent studies
about adversarial robustness focus on the loss landscape in the parameter space
since it is related to optimization and generalization performance. These
studies conclude that the difficulty of adversarial training is caused by the
non-smoothness of the loss function: i.e., its gradient is not Lipschitz
continuous. However, this analysis ignores the dependence of adversarial
attacks on model parameters. Since adversarial attacks are optimized for
models, they should depend on the parameters. Considering this dependence, we
analyze the smoothness of the loss function of adversarial training using the
optimal attacks for the model parameter in more detail. We reveal that the
constraint of adversarial attacks is one cause of the non-smoothness and that
the smoothness depends on the types of the constraints. Specifically, the
$L_\infty$ constraint can cause non-smoothness more than the $L_2$ constraint.
Moreover, our analysis implies that if we flatten the loss function with
respect to input data, the Lipschitz constant of the gradient of adversarial
loss tends to increase. To address the non-smoothness, we show that EntropySGD
smoothens the non-smooth loss and improves the performance of adversarial
training.


http://arxiv.org/abs/2103.00778
Explaining Adversarial Vulnerability with a Data Sparsity Hypothesis. (96%)
Mahsa Paknezhad; Cuong Phuc Ngo; Amadeus Aristo Winarto; Alistair Cheong; Beh Chuen Yang; Wu Jiayang; Lee Hwee Kuan
  Despite many proposed algorithms to provide robustness to deep learning (DL)
models, DL models remain susceptible to adversarial attacks. We hypothesize
that the adversarial vulnerability of DL models stems from two factors. The
first factor is data sparsity which is that in the high dimensional data space,
there are large regions outside the support of the data distribution. The
second factor is the existence of many redundant parameters in the DL models.
Owing to these factors, different models are able to come up with different
decision boundaries with comparably high prediction accuracy. The appearance of
the decision boundaries in the space outside the support of the data
distribution does not affect the prediction accuracy of the model. However,
they make an important difference in the adversarial robustness of the model.
We propose that the ideal decision boundary should be as far as possible from
the support of the data distribution.\par In this paper, we develop a training
framework for DL models to learn such decision boundaries spanning the space
around the class distributions further from the data points themselves.
Semi-supervised learning was deployed to achieve this objective by leveraging
unlabeled data generated in the space outside the support of the data
distribution. We measure adversarial robustness of the models trained using
this training framework against well-known adversarial attacks We found that
our results, other regularization methods and adversarial training also support
our hypothesis of data sparcity. We show that the unlabeled data generated by
noise using our framework is almost as effective as unlabeled data, sourced
from existing data sets or generated by synthesis algorithms, on adversarial
robustness. Our code is available at
https://github.com/MahsaPaknezhad/AdversariallyRobustTraining.


http://arxiv.org/abs/2103.01208
Mind the box: $l_1$-APGD for sparse adversarial attacks on image classifiers. (93%)
Francesco Croce; Matthias Hein
  We show that when taking into account also the image domain $[0,1]^d$,
established $l_1$-projected gradient descent (PGD) attacks are suboptimal as
they do not consider that the effective threat model is the intersection of the
$l_1$-ball and $[0,1]^d$. We study the expected sparsity of the steepest
descent step for this effective threat model and show that the exact projection
onto this set is computationally feasible and yields better performance.
Moreover, we propose an adaptive form of PGD which is highly effective even
with a small budget of iterations. Our resulting $l_1$-APGD is a strong
white-box attack showing that prior works overestimated their $l_1$-robustness.
Using $l_1$-APGD for adversarial training we get a robust classifier with SOTA
$l_1$-robustness. Finally, we combine $l_1$-APGD and an adaptation of the
Square Attack to $l_1$ into $l_1$-AutoAttack, an ensemble of attacks which
reliably assesses adversarial robustness for the threat model of $l_1$-ball
intersected with $[0,1]^d$.


http://arxiv.org/abs/2103.01319
Adversarial training in communication constrained federated learning. (87%)
Devansh Shah; Parijat Dube; Supriyo Chakraborty; Ashish Verma
  Federated learning enables model training over a distributed corpus of agent
data. However, the trained model is vulnerable to adversarial examples,
designed to elicit misclassification. We study the feasibility of using
adversarial training (AT) in the federated learning setting. Furthermore, we do
so assuming a fixed communication budget and non-iid data distribution between
participating agents. We observe a significant drop in both natural and
adversarial accuracies when AT is used in the federated setting as opposed to
centralized training. We attribute this to the number of epochs of AT performed
locally at the agents, which in turn effects (i) drift between local models;
and (ii) convergence time (measured in number of communication rounds). Towards
this end, we propose FedDynAT, a novel algorithm for performing AT in federated
setting. Through extensive experimentation we show that FedDynAT significantly
improves both natural and adversarial accuracy, as well as model convergence
time by reducing the model drift.


http://arxiv.org/abs/2103.01096
Counterfactual Explanations for Oblique Decision Trees: Exact, Efficient Algorithms. (82%)
Miguel Á. Carreira-Perpiñán; Suryabhan Singh Hada
  We consider counterfactual explanations, the problem of minimally adjusting
features in a source input instance so that it is classified as a target class
under a given classifier. This has become a topic of recent interest as a way
to query a trained model and suggest possible actions to overturn its decision.
Mathematically, the problem is formally equivalent to that of finding
adversarial examples, which also has attracted significant attention recently.
Most work on either counterfactual explanations or adversarial examples has
focused on differentiable classifiers, such as neural nets. We focus on
classification trees, both axis-aligned and oblique (having hyperplane splits).
Although here the counterfactual optimization problem is nonconvex and
nondifferentiable, we show that an exact solution can be computed very
efficiently, even with high-dimensional feature vectors and with both
continuous and categorical features, and demonstrate it in different datasets
and settings. The results are particularly relevant for finance, medicine or
legal applications, where interpretability and counterfactual explanations are
particularly important.


http://arxiv.org/abs/2103.00847
Am I a Real or Fake Celebrity? Measuring Commercial Face Recognition Web APIs under Deepfake Impersonation Attack. (70%)
Shahroz Tariq; Sowon Jeon; Simon S. Woo
  Recently, significant advancements have been made in face recognition
technologies using Deep Neural Networks. As a result, companies such as
Microsoft, Amazon, and Naver offer highly accurate commercial face recognition
web services for diverse applications to meet the end-user needs. Naturally,
however, such technologies are threatened persistently, as virtually any
individual can quickly implement impersonation attacks. In particular, these
attacks can be a significant threat for authentication and identification
services, which heavily rely on their underlying face recognition technologies'
accuracy and robustness. Despite its gravity, the issue regarding deepfake
abuse using commercial web APIs and their robustness has not yet been
thoroughly investigated. This work provides a measurement study on the
robustness of black-box commercial face recognition APIs against Deepfake
Impersonation (DI) attacks using celebrity recognition APIs as an example case
study. We use five deepfake datasets, two of which are created by us and
planned to be released. More specifically, we measure attack performance based
on two scenarios (targeted and non-targeted) and further analyze the differing
system behaviors using fidelity, confidence, and similarity metrics.
Accordingly, we demonstrate how vulnerable face recognition technologies from
popular companies are to DI attack, achieving maximum success rates of 78.0%
and 99.9% for targeted (i.e., precise match) and non-targeted (i.e., match with
any celebrity) attacks, respectively. Moreover, we propose practical defense
strategies to mitigate DI attacks, reducing the attack success rates to as low
as 0% and 0.02% for targeted and non-targeted attacks, respectively.


http://arxiv.org/abs/2103.01276
A Multiclass Boosting Framework for Achieving Fast and Provable Adversarial Robustness. (64%)
Jacob Abernethy; Pranjal Awasthi; Satyen Kale
  Alongside the well-publicized accomplishments of deep neural networks there
has emerged an apparent bug in their success on tasks such as object
recognition: with deep models trained using vanilla methods, input images can
be slightly corrupted in order to modify output predictions, even when these
corruptions are practically invisible. This apparent lack of robustness has led
researchers to propose methods that can help to prevent an adversary from
having such capabilities. The state-of-the-art approaches have incorporated the
robustness requirement into the loss function, and the training process
involves taking stochastic gradient descent steps not using original inputs but
on adversarially-corrupted ones. In this paper we propose a multiclass boosting
framework to ensure adversarial robustness. Boosting algorithms are generally
well-suited for adversarial scenarios, as they were classically designed to
satisfy a minimax guarantee. We provide a theoretical foundation for this
methodology and describe conditions under which robustness can be achieved
given a weak training oracle. We show empirically that adversarially-robust
multiclass boosting not only outperforms the state-of-the-art methods, it does
so at a fraction of the training time.


http://arxiv.org/abs/2103.03102
Benchmarking Robustness of Deep Learning Classifiers Using Two-Factor Perturbation. (62%)
Wei Dai; Daniel Berleant
  The accuracy of DL classifiers is unstable in that it often changes
significantly when retested on adversarial images, imperfect images, or
perturbed images. This paper adds to the small but fundamental body of work on
benchmarking the robustness of DL classifiers on defective images. Unlike
existed single-factor digital perturbation work, we provide state-of-the-art
two-factor perturbation that provides two natural perturbations on images
applied in different sequences. The two-factor perturbation includes (1) two
digital perturbations (Salt & pepper noise and Gaussian noise) applied in both
sequences. (2) one digital perturbation (salt & pepper noise) and a geometric
perturbation (rotation) applied in different sequences. To measure robust DL
classifiers, previous scientists provided 15 types of single-factor corruption.
We created 69 benchmarking image sets, including a clean set, sets with single
factor perturbations, and sets with two-factor perturbation conditions. To be
best of our knowledge, this is the first report that two-factor perturbed
images improves both robustness and accuracy of DL classifiers. Previous
research evaluating deep learning (DL) classifiers has often used top-1/top-5
accuracy, so researchers have usually offered tables, line diagrams, and bar
charts to display accuracy of DL classifiers. But these existed approaches
cannot quantitively evaluate robustness of DL classifiers. We innovate a new
two-dimensional, statistical visualization tool, including mean accuracy and
coefficient of variation (CV), to benchmark the robustness of DL classifiers.
All source codes and related image sets are shared on websites
(http://cslinux.semo.edu/david/data.html or
https://github.com/daiweiworking/RobustDeepLearningUsingPerturbations ) to
support future academic research and industry projects.


http://arxiv.org/abs/2103.00663
Model-Agnostic Defense for Lane Detection against Adversarial Attack. (98%)
Henry Xu; An Ju; David Wagner
  Susceptibility of neural networks to adversarial attack prompts serious
safety concerns for lane detection efforts, a domain where such models have
been widely applied. Recent work on adversarial road patches have successfully
induced perception of lane lines with arbitrary form, presenting an avenue for
rogue control of vehicle behavior. In this paper, we propose a modular lane
verification system that can catch such threats before the autonomous driving
system is misled while remaining agnostic to the particular lane detection
model. Our experiments show that implementing the system with a simple
convolutional neural network (CNN) can defend against a wide gamut of attacks
on lane detection models. With a 10% impact to inference time, we can detect
96% of bounded non-adaptive attacks, 90% of bounded adaptive attacks, and 98%
of patch attacks while preserving accurate identification at least 95% of true
lanes, indicating that our proposed verification system is effective at
mitigating lane detection security risks with minimal overhead.


http://arxiv.org/abs/2103.00671
Robust learning under clean-label attack. (22%)
Avrim Blum; Steve Hanneke; Jian Qian; Han Shao
  We study the problem of robust learning under clean-label data-poisoning
attacks, where the attacker injects (an arbitrary set of) correctly-labeled
examples to the training set to fool the algorithm into making mistakes on
specific test instances at test time. The learning goal is to minimize the
attackable rate (the probability mass of attackable test instances), which is
more difficult than optimal PAC learning. As we show, any robust algorithm with
diminishing attackable rate can achieve the optimal dependence on $\epsilon$ in
its PAC sample complexity, i.e., $O(1/\epsilon)$. On the other hand, the
attackable rate might be large even for some optimal PAC learners, e.g., SVM
for linear classifiers. Furthermore, we show that the class of linear
hypotheses is not robustly learnable when the data distribution has zero margin
and is robustly learnable in the case of positive margin but requires sample
complexity exponential in the dimension. For a general hypothesis class with
bounded VC dimension, if the attacker is limited to add at most $t>0$ poison
examples, the optimal robust learning sample complexity grows almost linearly
with $t$.


http://arxiv.org/abs/2103.00250
Effective Universal Unrestricted Adversarial Attacks using a MOE Approach. (98%)
A. E. Baia; Bari G. Di; V. Poggioni
  Recent studies have shown that Deep Leaning models are susceptible to
adversarial examples, which are data, in general images, intentionally modified
to fool a machine learning classifier. In this paper, we present a
multi-objective nested evolutionary algorithm to generate universal
unrestricted adversarial examples in a black-box scenario. The unrestricted
attacks are performed through the application of well-known image filters that
are available in several image processing libraries, modern cameras, and mobile
applications. The multi-objective optimization takes into account not only the
attack success rate but also the detection rate. Experimental results showed
that this approach is able to create a sequence of filters capable of
generating very effective and undetectable attacks.


http://arxiv.org/abs/2103.00363
Tiny Adversarial Mulit-Objective Oneshot Neural Architecture Search. (93%)
Guoyang Xie; Jinbao Wang; Guo Yu; Feng Zheng; Yaochu Jin
  Due to limited computational cost and energy consumption, most neural network
models deployed in mobile devices are tiny. However, tiny neural networks are
commonly very vulnerable to attacks. Current research has proved that larger
model size can improve robustness, but little research focuses on how to
enhance the robustness of tiny neural networks. Our work focuses on how to
improve the robustness of tiny neural networks without seriously deteriorating
of clean accuracy under mobile-level resources. To this end, we propose a
multi-objective oneshot network architecture search (NAS) algorithm to obtain
the best trade-off networks in terms of the adversarial accuracy, the clean
accuracy and the model size. Specifically, we design a novel search space based
on new tiny blocks and channels to balance model size and adversarial
performance. Moreover, since the supernet significantly affects the performance
of subnets in our NAS algorithm, we reveal the insights into how the supernet
helps to obtain the best subnet under white-box adversarial attacks.
Concretely, we explore a new adversarial training paradigm by analyzing the
adversarial transferability, the width of the supernet and the difference
between training the subnets from scratch and fine-tuning. Finally, we make a
statistical analysis for the layer-wise combination of certain blocks and
channels on the first non-dominated front, which can serve as a guideline to
design tiny neural network architectures for the resilience of adversarial
perturbations.


http://arxiv.org/abs/2103.00345
End-to-end Uncertainty-based Mitigation of Adversarial Attacks to Automated Lane Centering. (73%)
Ruochen Jiao; Hengyi Liang; Takami Sato; Junjie Shen; Qi Alfred Chen; Qi Zhu
  In the development of advanced driver-assistance systems (ADAS) and
autonomous vehicles, machine learning techniques that are based on deep neural
networks (DNNs) have been widely used for vehicle perception. These techniques
offer significant improvement on average perception accuracy over traditional
methods, however, have been shown to be susceptible to adversarial attacks,
where small perturbations in the input may cause significant errors in the
perception results and lead to system failure. Most prior works addressing such
adversarial attacks focus only on the sensing and perception modules. In this
work, we propose an end-to-end approach that addresses the impact of
adversarial attacks throughout perception, planning, and control modules. In
particular, we choose a target ADAS application, the automated lane centering
system in OpenPilot, quantify the perception uncertainty under adversarial
attacks, and design a robust planning and control module accordingly based on
the uncertainty analysis. We evaluate our proposed approach using both the
public dataset and production-grade autonomous driving simulator. The
experiment results demonstrate that our approach can effectively mitigate the
impact of adversarial attacks and can achieve 55% to 90% improvement over the
original OpenPilot.


http://arxiv.org/abs/2103.00381
Adversarial Information Bottleneck. (33%)
Pemhlong Zhai; Shihua Zhang
  The information bottleneck (IB) principle has been adopted to explain deep
learning in terms of information compression and prediction, which are balanced
by a trade-off hyperparameter. How to optimize the IB principle for better
robustness and figure out the effects of compression through the trade-off
hyperparameter are two challenging problems. Previous methods attempted to
optimize the IB principle by introducing random noise into learning the
representation and achieved state-of-the-art performance in the nuisance
information compression and semantic information extraction. However, their
performance on resisting adversarial perturbations is far less impressive. To
this end, we propose an adversarial information bottleneck (AIB) method without
any explicit assumptions about the underlying distribution of the
representations, which can be optimized effectively by solving a Min-Max
optimization problem. Numerical experiments on synthetic and real-world
datasets demonstrate its effectiveness on learning more invariant
representations and mitigating adversarial perturbations compared to several
competing IB methods. In addition, we analyse the adversarial robustness of
diverse IB methods contrasting with their IB curves, and reveal that IB models
with the hyperparameter $\beta$ corresponding to the knee point in the IB curve
achieve the best trade-off between compression and prediction, and has best
robustness against various attacks.


http://arxiv.org/abs/2103.00229
Neuron Coverage-Guided Domain Generalization. (2%)
Chris Xing Tian; Haoliang Li; Xiaofei Xie; Yang Liu; Shiqi Wang
  This paper focuses on the domain generalization task where domain knowledge
is unavailable, and even worse, only samples from a single domain can be
utilized during training. Our motivation originates from the recent progresses
in deep neural network (DNN) testing, which has shown that maximizing neuron
coverage of DNN can help to explore possible defects of DNN (i.e.,
misclassification). More specifically, by treating the DNN as a program and
each neuron as a functional point of the code, during the network training we
aim to improve the generalization capability by maximizing the neuron coverage
of DNN with the gradient similarity regularization between the original and
augmented samples. As such, the decision behavior of the DNN is optimized,
avoiding the arbitrary neurons that are deleterious for the unseen samples, and
leading to the trained DNN that can be better generalized to
out-of-distribution samples. Extensive studies on various domain generalization
tasks based on both single and multiple domain(s) setting demonstrate the
effectiveness of our proposed approach compared with state-of-the-art baseline
methods. We also analyze our method by conducting visualization based on
network dissection. The results further provide useful evidence on the
rationality and effectiveness of our approach.


http://arxiv.org/abs/2102.13624
What Doesn't Kill You Makes You Robust(er): Adversarial Training against Poisons and Backdoors.
Jonas Geiping; Liam Fowl; Gowthami Somepalli; Micah Goldblum; Michael Moeller; Tom Goldstein
  Data poisoning is a threat model in which a malicious actor tampers with
training data to manipulate outcomes at inference time. A variety of defenses
against this threat model have been proposed, but each suffers from at least
one of the following flaws: they are easily overcome by adaptive attacks, they
severely reduce testing performance, or they cannot generalize to diverse data
poisoning threat models. Adversarial training, and its variants, is currently
considered the only empirically strong defense against (inference-time)
adversarial attacks. In this work, we extend the adversarial training framework
to instead defend against (training-time) poisoning and backdoor attacks. Our
method desensitizes networks to the effects of poisoning by creating poisons
during training and injecting them into training batches. We show that this
defense withstands adaptive attacks, generalizes to diverse threat models, and
incurs a better performance trade-off than previous defenses.


http://arxiv.org/abs/2103.00124
NEUROSPF: A tool for the Symbolic Analysis of Neural Networks. (68%)
Muhammad Usman; Yannic Noller; Corina Pasareanu; Youcheng Sun; Divya Gopinath
  This paper presents NEUROSPF, a tool for the symbolic analysis of neural
networks. Given a trained neural network model, the tool extracts the
architecture and model parameters and translates them into a Java
representation that is amenable for analysis using the Symbolic PathFinder
symbolic execution tool. Notably, NEUROSPF encodes specialized peer classes for
parsing the model's parameters, thereby enabling efficient analysis. With
NEUROSPF the user has the flexibility to specify either the inputs or the
network internal parameters as symbolic, promoting the application of program
analysis and testing approaches from software engineering to the field of
machine learning. For instance, NEUROSPF can be used for coverage-based testing
and test generation, finding adversarial examples and also constraint-based
repair of neural networks, thus improving the reliability of neural networks
and of the applications that use them. Video URL: https://youtu.be/seal8fG78LI


http://arxiv.org/abs/2102.13066
On Instabilities of Conventional Multi-Coil MRI Reconstruction to Small Adverserial Perturbations.
Chi Zhang; Jinghan Jia; Burhaneddin Yaman; Steen Moeller; Sijia Liu; Mingyi Hong; Mehmet Akçakaya
  Although deep learning (DL) has received much attention in accelerated MRI,
recent studies suggest small perturbations may lead to instabilities in
DL-based reconstructions, leading to concern for their clinical application.
However, these works focus on single-coil acquisitions, which is not practical.
We investigate instabilities caused by small adversarial attacks for multi-coil
acquisitions. Our results suggest that, parallel imaging and multi-coil CS
exhibit considerable instabilities against small adversarial perturbations.


http://arxiv.org/abs/2102.12781
Do Input Gradients Highlight Discriminative Features?
Harshay Shah; Prateek Jain; Praneeth Netrapalli
  Post-hoc gradient-based interpretability methods [Simonyan et al., 2013,
Smilkov et al., 2017] that provide instance-specific explanations of model
predictions are often based on assumption (A): magnitude of input gradients --
gradients of logits with respect to input -- noisily highlight discriminative
task-relevant features. In this work, we test the validity of assumption (A)
using a three-pronged approach. First, we develop an evaluation framework,
DiffROAR, to test assumption (A) on four image classification benchmarks. Our
results suggest that (i) input gradients of standard models (i.e., trained on
original data) may grossly violate (A), whereas (ii) input gradients of
adversarially robust models satisfy (A). Second, we then introduce BlockMNIST,
an MNIST-based semi-real dataset, that by design encodes a priori knowledge of
discriminative features. Our analysis on BlockMNIST leverages this information
to validate as well as characterize differences between input gradient
attributions of standard and robust models. Finally, we theoretically prove
that our empirical findings hold on a simplified version of the BlockMNIST
dataset. Specifically, we prove that input gradients of standard
one-hidden-layer MLPs trained on this dataset do not highlight
instance-specific signal coordinates, thus grossly violating assumption (A).
Our findings motivate the need to formalize and test common assumptions in
interpretability in a falsifiable manner [Leavitt and Morcos, 2020].
Additionally, we believe that the DiffROAR evaluation framework and
BlockMNIST-based datasets can serve as sanity checks to audit instance-specific
interpretability methods.


http://arxiv.org/abs/2102.13184
Nonlinear Projection Based Gradient Estimation for Query Efficient Blackbox Attacks.
Huichen Li; Linyi Li; Xiaojun Xu; Xiaolu Zhang; Shuang Yang; Bo Li
  Gradient estimation and vector space projection have been studied as two
distinct topics. We aim to bridge the gap between the two by investigating how
to efficiently estimate gradient based on a projected low-dimensional space. We
first provide lower and upper bounds for gradient estimation under both linear
and nonlinear projections, and outline checkable sufficient conditions under
which one is better than the other. Moreover, we analyze the query complexity
for the projection-based gradient estimation and present a sufficient condition
for query-efficient estimators. Built upon our theoretic analysis, we propose a
novel query-efficient Nonlinear Gradient Projection-based Boundary Blackbox
Attack (NonLinear-BA). We conduct extensive experiments on four image datasets:
ImageNet, CelebA, CIFAR-10, and MNIST, and show the superiority of the proposed
methods compared with the state-of-the-art baselines. In particular, we show
that the projection-based boundary blackbox attacks are able to achieve much
smaller magnitude of perturbations with 100% attack success rate based on
efficient queries. Both linear and nonlinear projections demonstrate their
advantages under different conditions. We also evaluate NonLinear-BA against
the commercial online API MEGVII Face++, and demonstrate the high blackbox
attack performance both quantitatively and qualitatively. The code is publicly
available at https://github.com/AI-secure/NonLinear-BA.


http://arxiv.org/abs/2102.13170
Understanding Robustness in Teacher-Student Setting: A New Perspective.
Zhuolin Yang; Zhaoxi Chen; Tiffany Cai; Xinyun Chen; Bo Li; Yuandong Tian
  Adversarial examples have appeared as a ubiquitous property of machine
learning models where bounded adversarial perturbation could mislead the models
to make arbitrarily incorrect predictions. Such examples provide a way to
assess the robustness of machine learning models as well as a proxy for
understanding the model training process. Extensive studies try to explain the
existence of adversarial examples and provide ways to improve model robustness
(e.g. adversarial training). While they mostly focus on models trained on
datasets with predefined labels, we leverage the teacher-student framework and
assume a teacher model, or oracle, to provide the labels for given instances.
We extend Tian (2019) in the case of low-rank input data and show that student
specialization (trained student neuron is highly correlated with certain
teacher neuron at the same layer) still happens within the input subspace, but
the teacher and student nodes could differ wildly out of the data subspace,
which we conjecture leads to adversarial examples. Extensive experiments show
that student specialization correlates strongly with model robustness in
different scenarios, including student trained via standard training,
adversarial training, confidence-calibrated adversarial training, and training
with robust feature dataset. Our studies could shed light on the future
exploration about adversarial examples, and enhancing model robustness via
principled data augmentation.


http://arxiv.org/abs/2102.12827
Fast Minimum-norm Adversarial Attacks through Adaptive Norm Constraints.
Maura Pintor; Fabio Roli; Wieland Brendel; Battista Biggio
  Evaluating adversarial robustness amounts to finding the minimum perturbation
needed to have an input sample misclassified. The inherent complexity of the
underlying optimization requires current gradient-based attacks to be carefully
tuned, initialized, and possibly executed for many computationally-demanding
iterations, even if specialized to a given perturbation model. In this work, we
overcome these limitations by proposing a fast minimum-norm (FMN) attack that
works with different $\ell_p$-norm perturbation models ($p=0, 1, 2, \infty$),
is robust to hyperparameter choices, does not require adversarial starting
points, and converges within few lightweight steps. It works by iteratively
finding the sample misclassified with maximum confidence within an
$\ell_p$-norm constraint of size $\epsilon$, while adapting $\epsilon$ to
minimize the distance of the current sample to the decision boundary. Extensive
experiments show that FMN significantly outperforms existing attacks in terms
of convergence speed and computation time, while reporting comparable or even
smaller perturbation sizes.


http://arxiv.org/abs/2102.13256
Cybersecurity Threats in Connected and Automated Vehicles based Federated Learning Systems.
Ranwa Al Mallah; Godwin Badu-Marfo; Bilal Farooq
  Federated learning (FL) is a machine learning technique that aims at training
an algorithm across decentralized entities holding their local data private.
Wireless mobile networks allow users to communicate with other fixed or mobile
users. The road traffic network represents an infrastructure-based
configuration of a wireless mobile network where the Connected and Automated
Vehicles (CAV) represent the communicating entities. Applying FL in a wireless
mobile network setting gives rise to a new threat in the mobile environment
that is very different from the traditional fixed networks. The threat is due
to the intrinsic characteristics of the wireless medium and is caused by the
characteristics of the vehicular networks such as high node-mobility and
rapidly changing topology. Most cyber defense techniques depend on highly
reliable and connected networks. This paper explores falsified information
attacks, which target the FL process that is ongoing at the RSU. We identified
a number of attack strategies conducted by the malicious CAVs to disrupt the
training of the global model in vehicular networks. We show that the attacks
were able to increase the convergence time and decrease the accuracy the model.
We demonstrate that our attacks bypass FL defense strategies in their primary
form and highlight the need for novel poisoning resilience defense mechanisms
in the wireless mobile setting of the future road networks.


http://arxiv.org/abs/2102.12967
A statistical framework for efficient out of distribution detection in deep neural networks. (1%)
Matan Haroush; Tzviel Frostig; Ruth Heller; Daniel Soudry
  Background. Commonly, Deep Neural Networks (DNNs) generalize well on samples
drawn from a distribution similar to that of the training set. However, DNNs'
predictions are brittle and unreliable when the test samples are drawn from a
dissimilar distribution. This is a major concern for deployment in real-world
applications, where such behavior may come at a considerable cost, such as
industrial production lines, autonomous vehicles, or healthcare applications.
Contributions. We frame Out Of Distribution (OOD) detection in DNNs as a
statistical hypothesis testing problem. Tests generated within our proposed
framework combine evidence from the entire network. Unlike previous OOD
detection heuristics, this framework returns a $p$-value for each test sample.
It is guaranteed to maintain the Type I Error (T1E - incorrectly predicting OOD
for an actual in-distribution sample) for test data. Moreover, this allows to
combine several detectors while maintaining the T1E. Building on this
framework, we suggest a novel OOD procedure based on low-order statistics. Our
method achieves comparable or better results than state-of-the-art methods on
well-accepted OOD benchmarks, without retraining the network parameters or
assuming prior knowledge on the test distribution -- and at a fraction of the
computational cost.


http://arxiv.org/abs/2102.12680
Confidence Calibration with Bounded Error Using Transformations.
Sooyong Jang; Radoslav Ivanov; Insup lee; James Weimer
  As machine learning techniques become widely adopted in new domains,
especially in safety-critical systems such as autonomous vehicles, it is
crucial to provide accurate output uncertainty estimation. As a result, many
approaches have been proposed to calibrate neural networks to accurately
estimate the likelihood of misclassification. However, while these methods
achieve low expected calibration error (ECE), few techniques provide
theoretical performance guarantees on the calibration error (CE). In this
paper, we introduce Hoki, a novel calibration algorithm with a theoretical
bound on the CE. Hoki works by transforming the neural network logits and/or
inputs and recursively performing calibration leveraging the information from
the corresponding change in the output. We provide a PAC-like bounds on CE that
is shown to decrease with the number of samples used for calibration, and
increase proportionally with ECE and the number of discrete bins used to
calculate ECE. We perform experiments on multiple datasets, including ImageNet,
and show that the proposed approach generally outperforms state-of-the-art
calibration algorithms across multiple datasets and models - providing nearly
an order or magnitude improvement in ECE on ImageNet. In addition, Hoki is fast
algorithm which is comparable to temperature scaling in terms of learning time.


http://arxiv.org/abs/2102.12567
Sketching Curvature for Efficient Out-of-Distribution Detection for Deep Neural Networks.
Apoorva Sharma; Navid Azizan; Marco Pavone
  In order to safely deploy Deep Neural Networks (DNNs) within the perception
pipelines of real-time decision making systems, there is a need for safeguards
that can detect out-of-training-distribution (OoD) inputs both efficiently and
accurately. Building on recent work leveraging the local curvature of DNNs to
reason about epistemic uncertainty, we propose Sketching Curvature of OoD
Detection (SCOD), an architecture-agnostic framework for equipping any trained
DNN with a task-relevant epistemic uncertainty estimate. Offline, given a
trained model and its training data, SCOD employs tools from matrix sketching
to tractably compute a low-rank approximation of the Fisher information matrix,
which characterizes which directions in the weight space are most influential
on the predictions over the training data. Online, we estimate uncertainty by
measuring how much perturbations orthogonal to these directions can alter
predictions at a new test input. We apply SCOD to pre-trained networks of
varying architectures on several tasks, ranging from regression to
classification. We demonstrate that SCOD achieves comparable or better OoD
detection performance with lower computational burden relative to existing
baselines.


http://arxiv.org/abs/2102.12555
Robust SleepNets.
Yigit Alparslan; Edward Kim
  State-of-the-art convolutional neural networks excel in machine learning
tasks such as face recognition, and object classification but suffer
significantly when adversarial attacks are present. It is crucial that machine
critical systems, where machine learning models are deployed, utilize robust
models to handle a wide range of variability in the real world and malicious
actors that may use adversarial attacks. In this study, we investigate eye
closedness detection to prevent vehicle accidents related to driver
disengagements and driver drowsiness. Specifically, we focus on adversarial
attacks in this application domain, but emphasize that the methodology can be
applied to many other domains. We develop two models to detect eye closedness:
first model on eye images and a second model on face images. We adversarially
attack the models with Projected Gradient Descent, Fast Gradient Sign and
DeepFool methods and report adversarial success rate. We also study the effect
of training data augmentation. Finally, we adversarially train the same models
on perturbed images and report the success rate for the defense against these
attacks. We hope our study sets up the work to prevent potential vehicle
accidents by capturing drivers' face images and alerting them in case driver's
eyes are closed due to drowsiness.


http://arxiv.org/abs/2102.12192
Multiplicative Reweighting for Robust Neural Network Optimization.
Noga Bar; Tomer Koren; Raja Giryes
  Deep neural networks are widespread due to their powerful performance. Yet,
they suffer from degraded performance in the presence of noisy labels at train
time or adversarial examples during inference. Inspired by the setting of
learning with expert advice, where multiplicative weights (MW) updates were
recently shown to be robust to moderate adversarial corruptions, we propose to
use MW for reweighting examples during neural networks optimization. We
establish the convergence of our method when used with gradient descent and
demonstrate its advantage in two simple examples. We then validate empirically
our findings by showing that MW improves network's accuracy in the presence of
label noise on CIFAR-10, CIFAR-100 and Clothing1M, and that it leads to better
robustness to adversarial attacks.


http://arxiv.org/abs/2102.12196
Identifying Untrustworthy Predictions in Neural Networks by Geometric Gradient Analysis.
Leo Schwinn; An Nguyen; René Raab; Leon Bungert; Daniel Tenbrinck; Dario Zanca; Martin Burger; Bjoern Eskofier
  The susceptibility of deep neural networks to untrustworthy predictions,
including out-of-distribution (OOD) data and adversarial examples, still
prevent their widespread use in safety-critical applications. Most existing
methods either require a re-training of a given model to achieve robust
identification of adversarial attacks or are limited to out-of-distribution
sample detection only. In this work, we propose a geometric gradient analysis
(GGA) to improve the identification of untrustworthy predictions without
retraining of a given model. GGA analyzes the geometry of the loss landscape of
neural networks based on the saliency maps of their respective input. To
motivate the proposed approach, we provide theoretical connections between
gradients' geometrical properties and local minima of the loss function.
Furthermore, we demonstrate that the proposed method outperforms prior
approaches in detecting OOD data and adversarial attacks, including
state-of-the-art and adaptive attacks.


http://arxiv.org/abs/2102.12284
Graphfool: Targeted Label Adversarial Attack on Graph Embedding.
Jinyin Chen; Xiang Lin; Dunjie Zhang; Wenrong Jiang; Guohan Huang; Hui Xiong; Yun Xiang
  Deep learning is effective in graph analysis. It is widely applied in many
related areas, such as link prediction, node classification, community
detection, and graph classification etc. Graph embedding, which learns
low-dimensional representations for vertices or edges in the graph, usually
employs deep models to derive the embedding vector. However, these models are
vulnerable. We envision that graph embedding methods based on deep models can
be easily attacked using adversarial examples. Thus, in this paper, we propose
Graphfool, a novel targeted label adversarial attack on graph embedding. It can
generate adversarial graph to attack graph embedding methods via classifying
boundary and gradient information in graph convolutional network (GCN).
Specifically, we perform the following steps: 1),We first estimate the
classification boundaries of different classes. 2), We calculate the minimal
perturbation matrix to misclassify the attacked vertex according to the target
classification boundary. 3), We modify the adjacency matrix according to the
maximal absolute value of the disturbance matrix. This process is implemented
iteratively. To the best of our knowledge, this is the first targeted label
attack technique. The experiments on real-world graph networks demonstrate that
Graphfool can derive better performance than state-of-art techniques. Compared
with the second best algorithm, Graphfool can achieve an average improvement of
11.44% in attack success rate.


http://arxiv.org/abs/2102.11917
The Sensitivity of Word Embeddings-based Author Detection Models to Semantic-preserving Adversarial Perturbations.
Jeremiah Duncan; Fabian Fallas; Chris Gropp; Emily Herron; Maria Mahbub; Paula Olaya; Eduardo Ponce; Tabitha K. Samuel; Daniel Schultz; Sudarshan Srinivasan; Maofeng Tang; Viktor Zenkov; Quan Zhou; Edmon Begoli
  Authorship analysis is an important subject in the field of natural language
processing. It allows the detection of the most likely writer of articles,
news, books, or messages. This technique has multiple uses in tasks related to
authorship attribution, detection of plagiarism, style analysis, sources of
misinformation, etc. The focus of this paper is to explore the limitations and
sensitiveness of established approaches to adversarial manipulations of inputs.
To this end, and using those established techniques, we first developed an
experimental frame-work for author detection and input perturbations. Next, we
experimentally evaluated the performance of the authorship detection model to a
collection of semantic-preserving adversarial perturbations of input
narratives. Finally, we compare and analyze the effects of different
perturbation strategies, input and model configurations, and the effects of
these on the author detection model.


http://arxiv.org/abs/2102.11731
Rethinking Natural Adversarial Examples for Classification Models.
Xiao Li; Jianmin Li; Ting Dai; Jie Shi; Jun Zhu; Xiaolin Hu
  Recently, it was found that many real-world examples without intentional
modifications can fool machine learning models, and such examples are called
"natural adversarial examples". ImageNet-A is a famous dataset of natural
adversarial examples. By analyzing this dataset, we hypothesized that large,
cluttered and/or unusual background is an important reason why the images in
this dataset are difficult to be classified. We validated the hypothesis by
reducing the background influence in ImageNet-A examples with object detection
techniques. Experiments showed that the object detection models with various
classification models as backbones obtained much higher accuracy than their
corresponding classification models. A detection model based on the
classification model EfficientNet-B7 achieved a top-1 accuracy of 53.95%,
surpassing previous state-of-the-art classification models trained on ImageNet,
suggesting that accurate localization information can significantly boost the
performance of classification models on ImageNet-A. We then manually cropped
the objects in images from ImageNet-A and created a new dataset, named
ImageNet-A-Plus. A human test on the new dataset showed that the deep
learning-based classifiers still performed quite poorly compared with humans.
Therefore, the new dataset can be used to study the robustness of
classification models to the internal variance of objects without considering
the background disturbance.


http://arxiv.org/abs/2102.11860
Automated Discovery of Adaptive Attacks on Adversarial Defenses.
Chengyuan Yao; Pavol Bielik; Petar Tsankov; Martin Vechev
  Reliable evaluation of adversarial defenses is a challenging task, currently
limited to an expert who manually crafts attacks that exploit the defense's
inner workings, or to approaches based on ensemble of fixed attacks, none of
which may be effective for the specific defense at hand. Our key observation is
that custom attacks are composed from a set of reusable building blocks, such
as fine-tuning relevant attack parameters, network transformations, and custom
loss functions. Based on this observation, we present an extensible framework
that defines a search space over these reusable building blocks and
automatically discovers an effective attack on a given model with an unknown
defense by searching over suitable combinations of these blocks. We evaluated
our framework on 23 adversarial defenses and showed it outperforms AutoAttack,
the current state-of-the-art tool for reliable evaluation of adversarial
defenses: our discovered attacks are either stronger, producing 3.0%-50.8%
additional adversarial examples (10 cases), or are typically 2x faster while
enjoying similar adversarial robustness (13 cases).


http://arxiv.org/abs/2102.12002
Adversarial Robustness with Non-uniform Perturbations.
Ecenaz Erdemir; Jeffrey Bickford; Luca Melis; Sergul Aydore
  Robustness of machine learning models is critical for security related
applications, where real-world adversaries are uniquely focused on evading
neural network based detectors. Prior work mainly focus on crafting adversarial
examples with small uniform norm-bounded perturbations across features to
maintain the requirement of imperceptibility. Although such approaches are
valid for images, uniform perturbations do not result in realistic adversarial
examples in domains such as malware, finance, and social networks. For these
types of applications, features typically have some semantically meaningful
dependencies. The key idea of our proposed approach is to enable non-uniform
perturbations that can adequately represent these feature dependencies during
adversarial training. We propose using characteristics of the empirical data
distribution, both on correlations between the features and the importance of
the features themselves. Using experimental datasets for malware
classification, credit risk prediction, and spam detection, we show that our
approach is more robust to real-world attacks. Our approach can be adapted to
other domains where non-uniform perturbations more accurately represent
realistic adversarial examples.


http://arxiv.org/abs/2102.11935
Non-Singular Adversarial Robustness of Neural Networks.
Yu-Lin Tsai; Chia-Yi Hsu; Chia-Mu Yu; Pin-Yu Chen
  Adversarial robustness has become an emerging challenge for neural network
owing to its over-sensitivity to small input perturbations. While being
critical, we argue that solving this singular issue alone fails to provide a
comprehensive robustness assessment. Even worse, the conclusions drawn from
singular robustness may give a false sense of overall model robustness.
Specifically, our findings show that adversarially trained models that are
robust to input perturbations are still (or even more) vulnerable to weight
perturbations when compared to standard models. In this paper, we formalize the
notion of non-singular adversarial robustness for neural networks through the
lens of joint perturbations to data inputs as well as model weights. To our
best knowledge, this study is the first work considering simultaneous
input-weight adversarial perturbations. Based on a multi-layer feed-forward
neural network model with ReLU activation functions and standard classification
loss, we establish error analysis for quantifying the loss sensitivity subject
to $\ell_\infty$-norm bounded perturbations on data inputs and model weights.
Based on the error analysis, we propose novel regularization functions for
robust training and demonstrate improved non-singular robustness against joint
input-weight adversarial perturbations.


http://arxiv.org/abs/2102.11584
Enhancing Model Robustness By Incorporating Adversarial Knowledge Into Semantic Representation.
Jinfeng Li; Tianyu Du; Xiangyu Liu; Rong Zhang; Hui Xue; Shouling Ji
  Despite that deep neural networks (DNNs) have achieved enormous success in
many domains like natural language processing (NLP), they have also been proven
to be vulnerable to maliciously generated adversarial examples. Such inherent
vulnerability has threatened various real-world deployed DNNs-based
applications. To strength the model robustness, several countermeasures have
been proposed in the English NLP domain and obtained satisfactory performance.
However, due to the unique language properties of Chinese, it is not trivial to
extend existing defenses to the Chinese domain. Therefore, we propose AdvGraph,
a novel defense which enhances the robustness of Chinese-based NLP models by
incorporating adversarial knowledge into the semantic representation of the
input. Extensive experiments on two real-world tasks show that AdvGraph
exhibits better performance compared with previous work: (i) effective - it
significantly strengthens the model robustness even under the adaptive attacks
setting without negative impact on model performance over legitimate input;
(ii) generic - its key component, i.e., the representation of connotative
adversarial knowledge is task-agnostic, which can be reused in any
Chinese-based NLP models without retraining; and (iii) efficient - it is a
light-weight defense with sub-linear computational complexity, which can
guarantee the efficiency required in practical scenarios.


http://arxiv.org/abs/2102.11586
Adversarial Examples Detection beyond Image Space.
Kejiang Chen; Yuefeng Chen; Hang Zhou; Chuan Qin; Xiaofeng Mao; Weiming Zhang; Nenghai Yu
  Deep neural networks have been proved that they are vulnerable to adversarial
examples, which are generated by adding human-imperceptible perturbations to
images. To defend these adversarial examples, various detection based methods
have been proposed. However, most of them perform poorly on detecting
adversarial examples with extremely slight perturbations. By exploring these
adversarial examples, we find that there exists compliance between
perturbations and prediction confidence, which guides us to detect
few-perturbation attacks from the aspect of prediction confidence. To detect
both few-perturbation attacks and large-perturbation attacks, we propose a
method beyond image space by a two-stream architecture, in which the image
stream focuses on the pixel artifacts and the gradient stream copes with the
confidence artifacts. The experimental results show that the proposed method
outperforms the existing methods under oblivious attacks and is verified
effective to defend omniscient attacks as well.


http://arxiv.org/abs/2102.11502
Oriole: Thwarting Privacy against Trustworthy Deep Learning Models.
Liuqiao Chen; Hu Wang; Benjamin Zi Hao Zhao; Minhui Xue; Haifeng Qian
  Deep Neural Networks have achieved unprecedented success in the field of face
recognition such that any individual can crawl the data of others from the
Internet without their explicit permission for the purpose of training
high-precision face recognition models, creating a serious violation of
privacy. Recently, a well-known system named Fawkes (published in USENIX
Security 2020) claimed this privacy threat can be neutralized by uploading
cloaked user images instead of their original images. In this paper, we present
Oriole, a system that combines the advantages of data poisoning attacks and
evasion attacks, to thwart the protection offered by Fawkes, by training the
attacker face recognition model with multi-cloaked images generated by Oriole.
Consequently, the face recognition accuracy of the attack model is maintained
and the weaknesses of Fawkes are revealed. Experimental results show that our
proposed Oriole system is able to effectively interfere with the performance of
the Fawkes system to achieve promising attacking results. Our ablation study
highlights multiple principal factors that affect the performance of the Oriole
system, including the DSSIM perturbation budget, the ratio of leaked clean user
images, and the numbers of multi-cloaks for each uncloaked image. We also
identify and discuss at length the vulnerabilities of Fawkes. We hope that the
new methodology presented in this paper will inform the security community of a
need to design more robust privacy-preserving deep learning models.


http://arxiv.org/abs/2102.10875
On the robustness of randomized classifiers to adversarial examples.
Rafael Pinot; Laurent Meunier; Florian Yger; Cédric Gouy-Pailler; Yann Chevaleyre; Jamal Atif
  This paper investigates the theory of robustness against adversarial attacks.
We focus on randomized classifiers (\emph{i.e.} classifiers that output random
variables) and provide a thorough analysis of their behavior through the lens
of statistical learning theory and information theory. To this aim, we
introduce a new notion of robustness for randomized classifiers, enforcing
local Lipschitzness using probability metrics. Equipped with this definition,
we make two new contributions. The first one consists in devising a new upper
bound on the adversarial generalization gap of randomized classifiers. More
precisely, we devise bounds on the generalization gap and the adversarial gap
(\emph{i.e.} the gap between the risk and the worst-case risk under attack) of
randomized classifiers. The second contribution presents a yet simple but
efficient noise injection method to design robust randomized classifiers. We
show that our results are applicable to a wide range of machine learning models
under mild hypotheses. We further corroborate our findings with experimental
results using deep neural networks on standard image datasets, namely CIFAR-10
and CIFAR-100. All robust models we trained models can simultaneously achieve
state-of-the-art accuracy (over $0.82$ clean accuracy on CIFAR-10) and enjoy
\emph{guaranteed} robust accuracy bounds ($0.45$ against $\ell_2$ adversaries
with magnitude $0.5$ on CIFAR-10).


http://arxiv.org/abs/2102.11010
Resilience of Bayesian Layer-Wise Explanations under Adversarial Attacks.
Ginevra Carbone; Guido Sanguinetti; Luca Bortolussi
  We consider the problem of the stability of saliency-based explanations of
Neural Network predictions under adversarial attacks in a classification task.
Saliency interpretations of deterministic Neural Networks are remarkably
brittle even when the attacks fail, i.e. for attacks that do not change the
classification label. We empirically show that interpretations provided by
Bayesian Neural Networks are considerably more stable under adversarial
perturbations. By leveraging recent results, we also provide a theoretical
explanation of this result in terms of the geometry of adversarial attacks.
Additionally, we discuss the stability of the interpretations of high level
representations of the inputs in the internal layers of a Network. Our results
not only confirm that Bayesian Neural Networks are more robust to adversarial
attacks, but also demonstrate that Bayesian methods have the potential to
provide more stable and interpretable assessments of Neural Network
predictions.


http://arxiv.org/abs/2102.11455
Man-in-The-Middle Attacks and Defense in a Power System Cyber-Physical Testbed.
Patrick Wlazlo; Abhijeet Sahu; Zeyu Mao; Hao Huang; Ana Goulart; Katherine Davis; Saman Zonouz
  Man-in-The-Middle (MiTM) attacks present numerous threats to a smart grid. In
a MiTM attack, an intruder embeds itself within a conversation between two
devices to either eavesdrop or impersonate one of the devices, making it appear
to be a normal exchange of information. Thus, the intruder can perform false
data injection (FDI) and false command injection (FCI) attacks that can
compromise power system operations, such as state estimation, economic
dispatch, and automatic generation control (AGC). Very few researchers have
focused on MiTM methods that are difficult to detect within a smart grid. To
address this, we are designing and implementing multi-stage MiTM intrusions in
an emulation-based cyber-physical power system testbed against a large-scale
synthetic grid model to demonstrate how such attacks can cause physical
contingencies such as misguided operation and false measurements. MiTM
intrusions create FCI, FDI, and replay attacks in this synthetic power grid.
This work enables stakeholders to defend against these stealthy attacks, and we
present detection mechanisms that are developed using multiple alerts from
intrusion detection systems and network monitoring tools. Our contribution will
enable other smart grid security researchers and industry to develop further
detection mechanisms for inconspicuous MiTM attacks.


http://arxiv.org/abs/2102.11382
Sandwich Batch Normalization: A Drop-In Replacement for Feature Distribution Heterogeneity.
Xinyu Gong; Wuyang Chen; Tianlong Chen; Zhangyang Wang
  We present Sandwich Batch Normalization (SaBN), a frustratingly easy
improvement of Batch Normalization (BN) with only a few lines of code changes.
SaBN is motivated by addressing the inherent feature distribution heterogeneity
that one can be identified in many tasks, which can arise from data
heterogeneity (multiple input domains) or model heterogeneity (dynamic
architectures, model conditioning, etc.). Our SaBN factorizes the BN affine
layer into one shared sandwich affine layer, cascaded by several parallel
independent affine layers. Concrete analysis reveals that, during optimization,
SaBN promotes balanced gradient norms while still preserving diverse gradient
directions -- a property that many application tasks seem to favor. We
demonstrate the prevailing effectiveness of SaBN as a drop-in replacement in
four tasks: conditional image generation, neural architecture search (NAS),
adversarial training, and arbitrary style transfer. Leveraging SaBN immediately
achieves better Inception Score and FID on CIFAR-10 and ImageNet conditional
image generation with three state-of-the-art GANs; boosts the performance of a
state-of-the-art weight-sharing NAS algorithm significantly on NAS-Bench-201;
substantially improves the robust and standard accuracies for adversarial
defense; and produces superior arbitrary stylized results. We also provide
visualizations and analysis to help understand why SaBN works. Codes are
available at: https://github.com/VITA-Group/Sandwich-Batch-Normalization.


http://arxiv.org/abs/2102.10534
The Effects of Image Distribution and Task on Adversarial Robustness.
Owen Kunhardt; Arturo Deza; Tomaso Poggio
  In this paper, we propose an adaptation to the area under the curve (AUC)
metric to measure the adversarial robustness of a model over a particular
$\epsilon$-interval $[\epsilon_0, \epsilon_1]$ (interval of adversarial
perturbation strengths) that facilitates unbiased comparisons across models
when they have different initial $\epsilon_0$ performance. This can be used to
determine how adversarially robust a model is to different image distributions
or task (or some other variable); and/or to measure how robust a model is
comparatively to other models. We used this adversarial robustness metric on
models of an MNIST, CIFAR-10, and a Fusion dataset (CIFAR-10 + MNIST) where
trained models performed either a digit or object recognition task using a
LeNet, ResNet50, or a fully connected network (FullyConnectedNet) architecture
and found the following: 1) CIFAR-10 models are inherently less adversarially
robust than MNIST models; 2) Both the image distribution and task that a model
is trained on can affect the adversarial robustness of the resultant model. 3)
Pretraining with a different image distribution and task sometimes carries over
the adversarial robustness induced by that image distribution and task in the
resultant model; Collectively, our results imply non-trivial differences of the
learned representation space of one perceptual system over another given its
exposure to different image statistics or tasks (mainly objects vs digits).
Moreover, these results hold even when model systems are equalized to have the
same level of performance, or when exposed to approximately matched image
statistics of fusion images but with different tasks.


http://arxiv.org/abs/2102.10707
A Zeroth-Order Block Coordinate Descent Algorithm for Huge-Scale Black-Box Optimization.
HanQin Cai; Yuchen Lou; Daniel McKenzie; Wotao Yin
  We consider the zeroth-order optimization problem in the huge-scale setting,
where the dimension of the problem is so large that performing even basic
vector operations on the decision variables is infeasible. In this paper, we
propose a novel algorithm, coined ZO-BCD, that exhibits favorable overall query
complexity and has a much smaller per-iteration computational complexity. In
addition, we discuss how the memory footprint of ZO-BCD can be reduced even
further by the clever use of circulant measurement matrices. As an application
of our new method, we propose the idea of crafting adversarial attacks on
neural network based classifiers in a wavelet domain, which can result in
problem dimensions of over 1.7 million. In particular, we show that crafting
adversarial examples to audio classifiers in a wavelet domain can achieve the
state-of-the-art attack success rate of 97.9%.


http://arxiv.org/abs/2102.12894
Constrained Optimization to Train Neural Networks on Critical and Under-Represented Classes. (1%)
Sara Sangalli; Ertunc Erdil; Andreas Hoetker; Olivio Donati; Ender Konukoglu
  Deep neural networks (DNNs) are notorious for making more mistakes for the
classes that have substantially fewer samples than the others during training.
Such class imbalance is ubiquitous in clinical applications and very crucial to
handle because the classes with fewer samples most often correspond to critical
cases (e.g., cancer) where misclassifications can have severe consequences. Not
to miss such cases, binary classifiers need to be operated at high True
Positive Rates (TPR) by setting a higher threshold but this comes at the cost
of very high False Positive Rates (FPR) for problems with class imbalance.
Existing methods for learning under class imbalance most often do not take this
into account. We argue that prediction accuracy should be improved by
emphasizing reducing FPRs at high TPRs for problems where misclassification of
the positive, i.e., critical, class samples are associated with higher cost. To
this end, we pose the training of a DNN for binary classification as a
constrained optimization problem and introduce a novel constraint that can be
used with existing loss functions to enforce maximal area under the ROC curve
(AUC) through prioritizing FPR reduction at high TPR. We solve the resulting
constrained optimization problem using an Augmented Lagrangian method (ALM).
Going beyond binary, we also propose two possible extensions of the proposed
constraint for multi-class classification problems. We present experimental
results for image-based binary and multi-class classification applications
using an in-house medical imaging dataset, CIFAR10, and CIFAR100. Our results
demonstrate that the proposed method improves the baselines in majority of the
cases by attaining higher accuracy on critical classes while reducing the
misclassification rate for the non-critical class samples.


http://arxiv.org/abs/2102.10454
On Fast Adversarial Robustness Adaptation in Model-Agnostic Meta-Learning.
Ren Wang; Kaidi Xu; Sijia Liu; Pin-Yu Chen; Tsui-Wei Weng; Chuang Gan; Meng Wang
  Model-agnostic meta-learning (MAML) has emerged as one of the most successful
meta-learning techniques in few-shot learning. It enables us to learn a
meta-initialization} of model parameters (that we call meta-model) to rapidly
adapt to new tasks using a small amount of labeled training data. Despite the
generalization power of the meta-model, it remains elusive that how adversarial
robustness can be maintained by MAML in few-shot learning. In addition to
generalization, robustness is also desired for a meta-model to defend
adversarial examples (attacks). Toward promoting adversarial robustness in
MAML, we first study WHEN a robustness-promoting regularization should be
incorporated, given the fact that MAML adopts a bi-level (fine-tuning vs.
meta-update) learning procedure. We show that robustifying the meta-update
stage is sufficient to make robustness adapted to the task-specific fine-tuning
stage even if the latter uses a standard training protocol. We also make
additional justification on the acquired robustness adaptation by peering into
the interpretability of neurons' activation maps. Furthermore, we investigate
HOW robust regularization can efficiently be designed in MAML. We propose a
general but easily-optimized robustness-regularized meta-learning framework,
which allows the use of unlabeled data augmentation, fast adversarial attack
generation, and computationally-light fine-tuning. In particular, we for the
first time show that the auxiliary contrastive learning task can enhance the
adversarial robustness of MAML. Finally, extensive experiments are conducted to
demonstrate the effectiveness of our proposed methods in robust few-shot
learning.


http://arxiv.org/abs/2102.10343
Measuring $\ell_\infty$ Attacks by the $\ell_2$ Norm.
Sizhe Chen; Qinghua Tao; Zhixing Ye; Xiaolin Huang
  Deep Neural Networks (DNNs) could be easily fooled by Adversarial Examples
(AEs) with the imperceptible difference to original samples in human eyes. To
keep the difference imperceptible, the existing attacking bound the adversarial
perturbations by the $\ell_\infty$ norm, which is then served as the standard
to align different attacks for a fair comparison. However, when investigating
attack transferability, i.e., the capability of the AEs from attacking one
surrogate DNN to cheat other black-box DNN, we find that only using the
$\ell_\infty$ norm is not sufficient to measure the attack strength, according
to our comprehensive experiments concerning 7 transfer-based attacks, 4
white-box surrogate models, and 9 black-box victim models. Specifically, we
find that the $\ell_2$ norm greatly affects the transferability in
$\ell_\infty$ attacks. Since larger-perturbed AEs naturally bring about better
transferability, we advocate that the strength of all attacks should be
measured by both the widely used $\ell_\infty$ and also the $\ell_2$ norm.
Despite the intuitiveness of our conclusion and advocacy, they are very
necessary for the community, because common evaluations (bounding only the
$\ell_\infty$ norm) allow tricky enhancements of the "attack transferability"
by increasing the "attack strength" ($\ell_2$ norm) as shown by our simple
counter-example method, and the good transferability of several existing
methods may be due to their large $\ell_2$ distances.


http://arxiv.org/abs/2102.11069
A PAC-Bayes Analysis of Adversarial Robustness.
Guillaume IRIT Vidot; Paul LHC Viallard; Amaury LHC Habrard; Emilie LHC Morvant
  We propose the first general PAC-Bayesian generalization bounds for
adversarial robustness, that estimate, at test time, how much a model will be
invariant to imperceptible perturbations in the input. Instead of deriving a
worst-case analysis of the risk of a hypothesis over all the possible
perturbations, we leverage the PAC-Bayesian framework to bound the averaged
risk on the perturbations for majority votes (over the whole class of
hypotheses). Our theoretically founded analysis has the advantage to provide
general bounds (i) independent from the type of perturbations (i.e., the
adversarial attacks), (ii) that are tight thanks to the PAC-Bayesian framework,
(iii) that can be directly minimized during the learning phase to obtain a
robust model on different attacks at test time.


http://arxiv.org/abs/2102.10055
Effective and Efficient Vote Attack on Capsule Networks.
Jindong Gu; Baoyuan Wu; Volker Tresp
  Standard Convolutional Neural Networks (CNNs) can be easily fooled by images
with small quasi-imperceptible artificial perturbations. As alternatives to
CNNs, the recently proposed Capsule Networks (CapsNets) are shown to be more
robust to white-box attacks than CNNs under popular attack protocols. Besides,
the class-conditional reconstruction part of CapsNets is also used to detect
adversarial examples. In this work, we investigate the adversarial robustness
of CapsNets, especially how the inner workings of CapsNets change when the
output capsules are attacked. The first observation is that adversarial
examples misled CapsNets by manipulating the votes from primary capsules.
Another observation is the high computational cost, when we directly apply
multi-step attack methods designed for CNNs to attack CapsNets, due to the
computationally expensive routing mechanism. Motivated by these two
observations, we propose a novel vote attack where we attack votes of CapsNets
directly. Our vote attack is not only effective but also efficient by
circumventing the routing process. Furthermore, we integrate our vote attack
into the detection-aware attack paradigm, which can successfully bypass the
class-conditional reconstruction based detection method. Extensive experiments
demonstrate the superior attack performance of our vote attack on CapsNets.


http://arxiv.org/abs/2102.09230
Random Projections for Improved Adversarial Robustness.
Ginevra Carbone; Guido Sanguinetti; Luca Bortolussi
  We propose two training techniques for improving the robustness of Neural
Networks to adversarial attacks, i.e. manipulations of the inputs that are
maliciously crafted to fool networks into incorrect predictions. Both methods
are independent of the chosen attack and leverage random projections of the
original inputs, with the purpose of exploiting both dimensionality reduction
and some characteristic geometrical properties of adversarial perturbations.
The first technique is called RP-Ensemble and consists of an ensemble of
networks trained on multiple projected versions of the original inputs. The
second one, named RP-Regularizer, adds instead a regularization term to the
training objective.


http://arxiv.org/abs/2102.09695
Fortify Machine Learning Production Systems: Detect and Classify Adversarial Attacks.
Matthew Ciolino; Josh Kalin; David Noever
  Production machine learning systems are consistently under attack by
adversarial actors. Various deep learning models must be capable of accurately
detecting fake or adversarial input while maintaining speed. In this work, we
propose one piece of the production protection system: detecting an incoming
adversarial attack and its characteristics. Detecting types of adversarial
attacks has two primary effects: the underlying model can be trained in a
structured manner to be robust from those attacks and the attacks can be
potentially filtered out in realtime before causing any downstream damage. The
adversarial image classification space is explored for models commonly used in
transfer learning.


http://arxiv.org/abs/2102.09479
Make Sure You're Unsure: A Framework for Verifying Probabilistic Specifications.
Leonard Berrada; Sumanth Dathathri; Krishnamurthy Dvijotham; Robert Stanforth; Rudy Bunel; Jonathan Uesato; Sven Gowal; M. Pawan Kumar
  Most real world applications require dealing with stochasticity like sensor
noise or predictive uncertainty, where formal specifications of desired
behavior are inherently probabilistic. Despite the promise of formal
verification in ensuring the reliability of neural networks, progress in the
direction of probabilistic specifications has been limited. In this direction,
we first introduce a general formulation of probabilistic specifications for
neural networks, which captures both probabilistic networks (e.g., Bayesian
neural networks, MC-Dropout networks) and uncertain inputs (distributions over
inputs arising from sensor noise or other perturbations). We then propose a
general technique to verify such specifications by generalizing the notion of
Lagrangian duality, replacing standard Lagrangian multipliers with "functional
multipliers" that can be arbitrary functions of the activations at a given
layer. We show that an optimal choice of functional multipliers leads to exact
verification (i.e., sound and complete verification), and for specific forms of
multipliers, we develop tractable practical verification algorithms.
  We empirically validate our algorithms by applying them to Bayesian Neural
Networks (BNNs) and MC Dropout Networks, and certifying properties such as
adversarial robustness and robust detection of out-of-distribution (OOD) data.
On these tasks we are able to provide significantly stronger guarantees when
compared to prior work -- for instance, for a VGG-64 MC-Dropout CNN trained on
CIFAR-10, we improve the certified AUC (a verified lower bound on the true AUC)
for robust OOD detection (on CIFAR-100) from $0\% \rightarrow 29\%$. Similarly,
for a BNN trained on MNIST, we improve on the robust accuracy from $60.2\%
\rightarrow 74.6\%$. Further, on a novel specification -- distributionally
robust OOD detection -- we improve the certified AUC from $5\% \rightarrow
23\%$.


http://arxiv.org/abs/2102.09701
Center Smoothing: Provable Robustness for Functions with Metric-Space Outputs.
Aounon Kumar; Tom Goldstein
  Randomized smoothing has been successfully applied to classification tasks on
high-dimensional inputs, such as images, to obtain models that are provably
robust against adversarial perturbations of the input. We extend this technique
to produce provable robustness for functions that map inputs into an arbitrary
metric space rather than discrete classes. Such functions are used in many
machine learning problems like image reconstruction, dimensionality reduction,
facial recognition, etc. Our robustness certificates guarantee that the change
in the output of the smoothed model as measured by the distance metric remains
small for any norm-bounded perturbation of the input. We can certify robustness
under a variety of different output metrics, such as total variation distance,
Jaccard distance, perceptual metrics, etc. In our experiments, we apply our
procedure to create certifiably robust models with disparate output spaces --
from sets to images -- and show that it yields meaningful certificates without
significantly degrading the performance of the base model. The code for our
experiments is available at: https://github.com/aounon/center-smoothing.


http://arxiv.org/abs/2102.09012
Improving Hierarchical Adversarial Robustness of Deep Neural Networks.
Avery Ma; Aladin Virmaux; Kevin Scaman; Juwei Lu
  Do all adversarial examples have the same consequences? An autonomous driving
system misclassifying a pedestrian as a car may induce a far more dangerous --
and even potentially lethal -- behavior than, for instance, a car as a bus. In
order to better tackle this important problematic, we introduce the concept of
hierarchical adversarial robustness. Given a dataset whose classes can be
grouped into coarse-level labels, we define hierarchical adversarial examples
as the ones leading to a misclassification at the coarse level. To improve the
resistance of neural networks to hierarchical attacks, we introduce a
hierarchical adversarially robust (HAR) network design that decomposes a single
classification task into one coarse and multiple fine classification tasks,
before being specifically trained by adversarial defense techniques. As an
alternative to an end-to-end learning approach, we show that HAR significantly
improves the robustness of the network against $\ell_2$ and $\ell_{\infty}$
bounded hierarchical attacks on the CIFAR-10 and CIFAR-100 dataset.


http://arxiv.org/abs/2102.09086
Consistent Non-Parametric Methods for Maximizing Robustness.
Robi Bhattacharjee; Kamalika Chaudhuri
  Learning classifiers that are robust to adversarial examples has received a
great deal of recent attention. A major drawback of the standard robust
learning framework is there is an artificial robustness radius $r$ that applies
to all inputs. This ignores the fact that data may be highly heterogeneous, in
which case it is plausible that robustness regions should be larger in some
regions of data, and smaller in others. In this paper, we address this
limitation by proposing a new limit classifier, called the neighborhood optimal
classifier, that extends the Bayes optimal classifier outside its support by
using the label of the closest in-support point. We then argue that this
classifier maximizes the size of its robustness regions subject to the
constraint of having accuracy equal to the Bayes optimal. We then present
sufficient conditions under which general non-parametric methods that can be
represented as weight functions converge towards this limit, and show that both
nearest neighbors and kernel classifiers satisfy them under certain conditions.


http://arxiv.org/abs/2102.08868
Bridging the Gap Between Adversarial Robustness and Optimization Bias.
Fartash Faghri; Sven Gowal; Cristina Vasconcelos; David J. Fleet; Fabian Pedregosa; Nicolas Le Roux
  We demonstrate that the choice of optimizer, neural network architecture, and
regularizer significantly affect the adversarial robustness of linear neural
networks, providing guarantees without the need for adversarial training. To
this end, we revisit a known result linking maximally robust classifiers and
minimum norm solutions, and combine it with recent results on the implicit bias
of optimizers. First, we show that, under certain conditions, it is possible to
achieve both perfect standard accuracy and a certain degree of robustness,
simply by training an overparametrized model using the implicit bias of the
optimization. In that regime, there is a direct relationship between the type
of the optimizer and the attack to which the model is robust. To the best of
our knowledge, this work is the first to study the impact of optimization
methods such as sign gradient descent and proximal methods on adversarial
robustness. Second, we characterize the robustness of linear convolutional
models, showing that they resist attacks subject to a constraint on the
Fourier-$\ell_\infty$ norm. To illustrate these findings we design a novel
Fourier-$\ell_\infty$ attack that finds adversarial examples with controllable
frequencies. We evaluate Fourier-$\ell_\infty$ robustness of
adversarially-trained deep CIFAR-10 models from the standard RobustBench
benchmark and visualize adversarial perturbations.


http://arxiv.org/abs/2102.09057
Towards Adversarial-Resilient Deep Neural Networks for False Data Injection Attack Detection in Power Grids.
Jiangnan Li; Yingyuan Yang; Jinyuan Stella Sun; Kevin Tomsovic; Hairong Qi
  False data injection attacks (FDIAs) pose a significant security threat to
power system state estimation. To detect such attacks, recent studies have
proposed machine learning (ML) techniques, particularly deep neural networks
(DNNs). However, most of these methods fail to account for the risk posed by
adversarial measurements, which can compromise the reliability of DNNs in
various ML applications. In this paper, we present a DNN-based FDIA detection
approach that is resilient to adversarial attacks. We first analyze several
adversarial defense mechanisms used in computer vision and show their inherent
limitations in FDIA detection. We then propose an adversarial-resilient DNN
detection framework for FDIA that incorporates random input padding in both the
training and inference phases. Our simulations, based on an IEEE standard power
system, demonstrate that this framework significantly reduces the effectiveness
of adversarial attacks while having a negligible impact on the DNNs' detection
performance.


http://arxiv.org/abs/2102.08452
Globally-Robust Neural Networks.
Klas Leino; Zifan Wang; Matt Fredrikson
  The threat of adversarial examples has motivated work on training certifiably
robust neural networks to facilitate efficient verification of local robustness
at inference time. We formalize a notion of global robustness, which captures
the operational properties of on-line local robustness certification while
yielding a natural learning objective for robust training. We show that
widely-used architectures can be easily adapted to this objective by
incorporating efficient global Lipschitz bounds into the network, yielding
certifiably-robust models by construction that achieve state-of-the-art
verifiable accuracy. Notably, this approach requires significantly less time
and memory than recent certifiable training methods, and leads to negligible
costs when certifying points on-line; for example, our evaluation shows that it
is possible to train a large robust Tiny-Imagenet model in a matter of hours.
Our models effectively leverage inexpensive global Lipschitz bounds for
real-time certification, despite prior suggestions that tighter local bounds
are needed for good performance; we posit this is possible because our models
are specifically trained to achieve tighter global bounds. Namely, we prove
that the maximum achievable verifiable accuracy for a given dataset is not
improved by using a local bound.


http://arxiv.org/abs/2102.08093
A Law of Robustness for Weight-bounded Neural Networks.
Hisham Husain; Borja Balle
  Robustness of deep neural networks against adversarial perturbations is a
pressing concern motivated by recent findings showing the pervasive nature of
such vulnerabilities. One method of characterizing the robustness of a neural
network model is through its Lipschitz constant, which forms a robustness
certificate. A natural question to ask is, for a fixed model class (such as
neural networks) and a dataset of size $n$, what is the smallest achievable
Lipschitz constant among all models that fit the dataset? Recently, (Bubeck et
al., 2020) conjectured that when using two-layer networks with $k$ neurons to
fit a generic dataset, the smallest Lipschitz constant is
$\Omega(\sqrt{\frac{n}{k}})$. This implies that one would require one neuron
per data point to robustly fit the data. In this work we derive a lower bound
on the Lipschitz constant for any arbitrary model class with bounded Rademacher
complexity. Our result coincides with that conjectured in (Bubeck et al., 2020)
for two-layer networks under the assumption of bounded weights. However, due to
our result's generality, we also derive bounds for multi-layer neural networks,
discovering that one requires $\log n$ constant-sized layers to robustly fit
the data. Thus, our work establishes a law of robustness for weight bounded
neural networks and provides formal evidence on the necessity of
over-parametrization in deep learning.


http://arxiv.org/abs/2102.08079
Just Noticeable Difference for Machine Perception and Generation of Regularized Adversarial Images with Minimal Perturbation.
Adil Kaan Akan; Emre Akbas; Fatos T. Yarman Vural
  In this study, we introduce a measure for machine perception, inspired by the
concept of Just Noticeable Difference (JND) of human perception. Based on this
measure, we suggest an adversarial image generation algorithm, which
iteratively distorts an image by an additive noise until the machine learning
model detects the change in the image by outputting a false label. The amount
of noise added to the original image is defined as the gradient of the cost
function of the machine learning model. This cost function explicitly minimizes
the amount of perturbation applied on the input image and it is regularized by
bounded range and total variation functions to assure perceptual similarity of
the adversarial image to the input. We evaluate the adversarial images
generated by our algorithm both qualitatively and quantitatively on CIFAR10,
ImageNet, and MS COCO datasets. Our experiments on image classification and
object detection tasks show that adversarial images generated by our method are
both more successful in deceiving the recognition/detection model and less
perturbed compared to the images generated by the state-of-the-art methods.


http://arxiv.org/abs/2102.07437
Data Profiling for Adversarial Training: On the Ruin of Problematic Data.
Chengyu Dong; Liyuan Liu; Jingbo Shang
  Multiple intriguing problems hover in adversarial training, including
robustness-accuracy trade-off, robust overfitting, and gradient masking, posing
great challenges to both reliable evaluation and practical deployment. Here, we
show that these problems share one common cause -- low quality samples in the
dataset. We first identify an intrinsic property of the data called problematic
score and then design controlled experiments to investigate its connections
with these problems. Specifically, we find that when problematic data is
removed, robust overfitting and gradient masking can be largely alleviated; and
robustness-accuracy trade-off is more prominent for a dataset containing highly
problematic data. These observations not only verify our intuition about data
quality but also open new opportunities to advance adversarial training.
Remarkably, simply removing problematic data from adversarial training, while
making the training set smaller, yields better robustness consistently with
different adversary settings, training methods, and neural architectures.


http://arxiv.org/abs/2102.07818
Certified Robustness to Programmable Transformations in LSTMs.
Yuhao Zhang; Aws Albarghouthi; Loris D'Antoni
  Deep neural networks for natural language processing are fragile in the face
of adversarial examples--small input perturbations, like synonym substitution
or word duplication, which cause a neural network to change its prediction. We
present an approach to certifying the robustness of LSTMs (and extensions of
LSTMs) and training models that can be efficiently certified. Our approach can
certify robustness to intractably large perturbation spaces defined
programmatically in a language of string transformations.
  The key insight of our approach is an application of abstract interpretation
that exploits recursive LSTM structure to incrementally propagate symbolic sets
of inputs, compactly representing a large perturbation space. Our evaluation
shows that (1) our approach can train models that are more robust to
combinations of string transformations than those produced using existing
techniques; (2) our approach can show high certification accuracy of the
resulting models.


http://arxiv.org/abs/2102.07360
Generating Structured Adversarial Attacks Using Frank-Wolfe Method.
Ehsan Kazemi; Thomas Kerdreux; Liquang Wang
  White box adversarial perturbations are generated via iterative optimization
algorithms most often by minimizing an adversarial loss on a $\ell_p$
neighborhood of the original image, the so-called distortion set. Constraining
the adversarial search with different norms results in disparately structured
adversarial examples. Here we explore several distortion sets with
structure-enhancing algorithms. These new structures for adversarial examples
might provide challenges for provable and empirical robust mechanisms. Because
adversarial robustness is still an empirical field, defense mechanisms should
also reasonably be evaluated against differently structured attacks. Besides,
these structured adversarial perturbations may allow for larger distortions
size than their $\ell_p$ counter-part while remaining imperceptible or
perceptible as natural distortions of the image. We will demonstrate in this
work that the proposed structured adversarial examples can significantly bring
down the classification accuracy of adversarialy trained classifiers while
showing low $\ell_2$ distortion rate. For instance, on ImagNet dataset the
structured attacks drop the accuracy of adversarial model to near zero with
only 50\% of $\ell_2$ distortion generated using white-box attacks like PGD. As
a byproduct, our finding on structured adversarial examples can be used for
adversarial regularization of models to make models more robust or improve
their generalization performance on datasets which are structurally different.


http://arxiv.org/abs/2102.07788
Universal Adversarial Examples and Perturbations for Quantum Classifiers.
Weiyuan Gong; Dong-Ling Deng
  Quantum machine learning explores the interplay between machine learning and
quantum physics, which may lead to unprecedented perspectives for both fields.
In fact, recent works have shown strong evidences that quantum computers could
outperform classical computers in solving certain notable machine learning
tasks. Yet, quantum learning systems may also suffer from the vulnerability
problem: adding a tiny carefully-crafted perturbation to the legitimate input
data would cause the systems to make incorrect predictions at a notably high
confidence level. In this paper, we study the universality of adversarial
examples and perturbations for quantum classifiers. Through concrete examples
involving classifications of real-life images and quantum phases of matter, we
show that there exist universal adversarial examples that can fool a set of
different quantum classifiers. We prove that for a set of $k$ classifiers with
each receiving input data of $n$ qubits, an $O(\frac{\ln k} {2^n})$ increase of
the perturbation strength is enough to ensure a moderate universal adversarial
risk. In addition, for a given quantum classifier we show that there exist
universal adversarial perturbations, which can be added to different legitimate
samples and make them to be adversarial examples for the classifier. Our
results reveal the universality perspective of adversarial attacks for quantum
machine learning systems, which would be crucial for practical applications of
both near-term and future quantum technologies in solving machine learning
problems.


http://arxiv.org/abs/2102.07861
Low Curvature Activations Reduce Overfitting in Adversarial Training.
Vasu Singla; Sahil Singla; David Jacobs; Soheil Feizi
  Adversarial training is one of the most effective defenses against
adversarial attacks. Previous works suggest that overfitting is a dominant
phenomenon in adversarial training leading to a large generalization gap
between test and train accuracy in neural networks. In this work, we show that
the observed generalization gap is closely related to the choice of the
activation function. In particular, we show that using activation functions
with low (exact or approximate) curvature values has a regularization effect
that significantly reduces both the standard and robust generalization gaps in
adversarial training. We observe this effect for both differentiable/smooth
activations such as SiLU as well as non-differentiable/non-smooth activations
such as LeakyReLU. In the latter case, the "approximate" curvature of the
activation is low. Finally, we show that for activation functions with low
curvature, the double descent phenomenon for adversarially trained models does
not occur.


http://arxiv.org/abs/2102.07389
And/or trade-off in artificial neurons: impact on adversarial robustness.
Alessandro Fontana
  Since its discovery in 2013, the phenomenon of adversarial examples has
attracted a growing amount of attention from the machine learning community. A
deeper understanding of the problem could lead to a better comprehension of how
information is processed and encoded in neural networks and, more in general,
could help to solve the issue of interpretability in machine learning. Our idea
to increase adversarial resilience starts with the observation that artificial
neurons can be divided in two broad categories: AND-like neurons and OR-like
neurons. Intuitively, the former are characterised by a relatively low number
of combinations of input values which trigger neuron activation, while for the
latter the opposite is true. Our hypothesis is that the presence in a network
of a sufficiently high number of OR-like neurons could lead to classification
"brittleness" and increase the network's susceptibility to adversarial attacks.
After constructing an operational definition of a neuron AND-like behaviour, we
proceed to introduce several measures to increase the proportion of AND-like
neurons in the network: L1 norm weight normalisation; application of an input
filter; comparison between the neuron output's distribution obtained when the
network is fed with the actual data set and the distribution obtained when the
network is fed with a randomised version of the former called "scrambled data
set". Tests performed on the MNIST data set hint that the proposed measures
could represent an interesting direction to explore.


http://arxiv.org/abs/2102.07559
Certifiably Robust Variational Autoencoders.
Ben Barrett; Alexander Camuto; Matthew Willetts; Tom Rainforth
  We introduce an approach for training Variational Autoencoders (VAEs) that
are certifiably robust to adversarial attack. Specifically, we first derive
actionable bounds on the minimal size of an input perturbation required to
change a VAE's reconstruction by more than an allowed amount, with these bounds
depending on certain key parameters such as the Lipschitz constants of the
encoder and decoder. We then show how these parameters can be controlled,
thereby providing a mechanism to ensure \textit{a priori} that a VAE will
attain a desired level of robustness. Moreover, we extend this to a complete
practical approach for training such VAEs to ensure our criteria are met.
Critically, our method allows one to specify a desired level of robustness
\emph{upfront} and then train a VAE that is guaranteed to achieve this
robustness. We further demonstrate that these Lipschitz--constrained VAEs are
more robust to attack than standard VAEs in practice.


http://arxiv.org/abs/2102.07327
Guided Interpolation for Adversarial Training.
Chen Chen; Jingfeng Zhang; Xilie Xu; Tianlei Hu; Gang Niu; Gang Chen; Masashi Sugiyama
  To enhance adversarial robustness, adversarial training learns deep neural
networks on the adversarial variants generated by their natural data. However,
as the training progresses, the training data becomes less and less attackable,
undermining the robustness enhancement. A straightforward remedy is to
incorporate more training data, but sometimes incurring an unaffordable cost.
In this paper, to mitigate this issue, we propose the guided interpolation
framework (GIF): in each epoch, the GIF employs the previous epoch's meta
information to guide the data's interpolation. Compared with the vanilla mixup,
the GIF can provide a higher ratio of attackable data, which is beneficial to
the robustness enhancement; it meanwhile mitigates the model's linear behavior
between classes, where the linear behavior is favorable to generalization but
not to the robustness. As a result, the GIF encourages the model to predict
invariantly in the cluster of each class. Experiments demonstrate that the GIF
can indeed enhance adversarial robustness on various adversarial training
methods and various datasets.


http://arxiv.org/abs/2102.07244
Resilient Machine Learning for Networked Cyber Physical Systems: A Survey for Machine Learning Security to Securing Machine Learning for CPS.
Felix Olowononi; Danda B. Rawat; Chunmei Liu
  Cyber Physical Systems (CPS) are characterized by their ability to integrate
the physical and information or cyber worlds. Their deployment in critical
infrastructure have demonstrated a potential to transform the world. However,
harnessing this potential is limited by their critical nature and the far
reaching effects of cyber attacks on human, infrastructure and the environment.
An attraction for cyber concerns in CPS rises from the process of sending
information from sensors to actuators over the wireless communication medium,
thereby widening the attack surface. Traditionally, CPS security has been
investigated from the perspective of preventing intruders from gaining access
to the system using cryptography and other access control techniques. Most
research work have therefore focused on the detection of attacks in CPS.
However, in a world of increasing adversaries, it is becoming more difficult to
totally prevent CPS from adversarial attacks, hence the need to focus on making
CPS resilient. Resilient CPS are designed to withstand disruptions and remain
functional despite the operation of adversaries. One of the dominant
methodologies explored for building resilient CPS is dependent on machine
learning (ML) algorithms. However, rising from recent research in adversarial
ML, we posit that ML algorithms for securing CPS must themselves be resilient.
This paper is therefore aimed at comprehensively surveying the interactions
between resilient CPS using ML and resilient ML when applied in CPS. The paper
concludes with a number of research trends and promising future research
directions. Furthermore, with this paper, readers can have a thorough
understanding of recent advances on ML-based security and securing ML for CPS
and countermeasures, as well as research trends in this active research area.


http://arxiv.org/abs/2102.07265
Exploring Adversarial Robustness of Deep Metric Learning.
Thomas Kobber Panum; Zi Wang; Pengyu Kan; Earlence Fernandes; Somesh Jha
  Deep Metric Learning (DML), a widely-used technique, involves learning a
distance metric between pairs of samples. DML uses deep neural architectures to
learn semantic embeddings of the input, where the distance between similar
examples is small while dissimilar ones are far apart. Although the underlying
neural networks produce good accuracy on naturally occurring samples, they are
vulnerable to adversarially-perturbed samples that reduce performance. We take
a first step towards training robust DML models and tackle the primary
challenge of the metric losses being dependent on the samples in a mini-batch,
unlike standard losses that only depend on the specific input-output pair. We
analyze this dependence effect and contribute a robust optimization
formulation. Using experiments on three commonly-used DML datasets, we
demonstrate 5-76 fold increases in adversarial accuracy, and outperform an
existing DML model that sought out to be robust.


http://arxiv.org/abs/2102.07164
Adversarial Attack on Network Embeddings via Supervised Network Poisoning.
Viresh Gupta; Tanmoy Chakraborty
  Learning low-level node embeddings using techniques from network
representation learning is useful for solving downstream tasks such as node
classification and link prediction. An important consideration in such
applications is the robustness of the embedding algorithms against adversarial
attacks, which can be examined by performing perturbation on the original
network. An efficient perturbation technique can degrade the performance of
network embeddings on downstream tasks. In this paper, we study network
embedding algorithms from an adversarial point of view and observe the effect
of poisoning the network on downstream tasks. We propose VIKING, a supervised
network poisoning strategy that outperforms the state-of-the-art poisoning
methods by upto 18% on the original network structure. We also extend VIKING to
a semi-supervised attack setting and show that it is comparable to its
supervised counterpart.


http://arxiv.org/abs/2102.07140
Perceptually Constrained Adversarial Attacks.
Muhammad Zaid Hameed; Andras Gyorgy
  Motivated by previous observations that the usually applied $L_p$ norms
($p=1,2,\infty$) do not capture the perceptual quality of adversarial examples
in image classification, we propose to replace these norms with the structural
similarity index (SSIM) measure, which was developed originally to measure the
perceptual similarity of images. Through extensive experiments with
adversarially trained classifiers for MNIST and CIFAR-10, we demonstrate that
our SSIM-constrained adversarial attacks can break state-of-the-art
adversarially trained classifiers and achieve similar or larger success rate
than the elastic net attack, while consistently providing adversarial images of
better perceptual quality. Utilizing SSIM to automatically identify and
disallow adversarial images of low quality, we evaluate the performance of
several defense schemes in a perceptually much more meaningful way than was
done previously in the literature.


http://arxiv.org/abs/2102.07304
CAP-GAN: Towards Adversarial Robustness with Cycle-consistent Attentional Purification.
Mingu Kang; Trung Quang Tran; Seungju Cho; Daeyoung Kim
  Adversarial attack is aimed at fooling the target classifier with
imperceptible perturbation. Adversarial examples, which are carefully crafted
with a malicious purpose, can lead to erroneous predictions, resulting in
catastrophic accidents. To mitigate the effects of adversarial attacks, we
propose a novel purification model called CAP-GAN. CAP-GAN takes account of the
idea of pixel-level and feature-level consistency to achieve reasonable
purification under cycle-consistent learning. Specifically, we utilize the
guided attention module and knowledge distillation to convey meaningful
information to the purification model. Once a model is fully trained, inputs
would be projected into the purification model and transformed into clean-like
images. We vary the capacity of the adversary to argue the robustness against
various types of attack strategies. On the CIFAR-10 dataset, CAP-GAN
outperforms other pre-processing based defenses under both black-box and
white-box settings.


http://arxiv.org/abs/2102.07325
Cross-modal Adversarial Reprogramming.
Paarth Neekhara; Shehzeen Hussain; Jinglong Du; Shlomo Dubnov; Farinaz Koushanfar; Julian McAuley
  With the abundance of large-scale deep learning models, it has become
possible to repurpose pre-trained networks for new tasks. Recent works on
adversarial reprogramming have shown that it is possible to repurpose neural
networks for alternate tasks without modifying the network architecture or
parameters. However these works only consider original and target tasks within
the same data domain. In this work, we broaden the scope of adversarial
reprogramming beyond the data modality of the original task. We analyze the
feasibility of adversarially repurposing image classification neural networks
for Natural Language Processing (NLP) and other sequence classification tasks.
We design an efficient adversarial program that maps a sequence of discrete
tokens into an image which can be classified to the desired class by an image
classification model. We demonstrate that by using highly efficient adversarial
programs, we can reprogram image classifiers to achieve competitive performance
on a variety of text and sequence classification benchmarks without retraining
the network.


http://arxiv.org/abs/2102.06905
Mixed Nash Equilibria in the Adversarial Examples Game.
Laurent Meunier; Meyer Scetbon; Rafael Pinot; Jamal Atif; Yann Chevaleyre
  This paper tackles the problem of adversarial examples from a game theoretic
point of view. We study the open question of the existence of mixed Nash
equilibria in the zero-sum game formed by the attacker and the classifier.
While previous works usually allow only one player to use randomized
strategies, we show the necessity of considering randomization for both the
classifier and the attacker. We demonstrate that this game has no duality gap,
meaning that it always admits approximate Nash equilibria. We also provide the
first optimization algorithms to learn a mixture of classifiers that
approximately realizes the value of this game, \emph{i.e.} procedures to build
an optimally robust randomized classifier.


http://arxiv.org/abs/2102.07047
Adversarial defense for automatic speaker verification by cascaded self-supervised learning models.
Haibin Wu; Xu Li; Andy T. Liu; Zhiyong Wu; Helen Meng; Hung-yi Lee
  Automatic speaker verification (ASV) is one of the core technologies in
biometric identification. With the ubiquitous usage of ASV systems in
safety-critical applications, more and more malicious attackers attempt to
launch adversarial attacks at ASV systems. In the midst of the arms race
between attack and defense in ASV, how to effectively improve the robustness of
ASV against adversarial attacks remains an open question. We note that the
self-supervised learning models possess the ability to mitigate superficial
perturbations in the input after pretraining. Hence, with the goal of effective
defense in ASV against adversarial attacks, we propose a standard and
attack-agnostic method based on cascaded self-supervised learning models to
purify the adversarial perturbations. Experimental results demonstrate that the
proposed method achieves effective defense performance and can successfully
counter adversarial attacks in scenarios where attackers may either be aware or
unaware of the self-supervised learning models.


http://arxiv.org/abs/2102.06638
UAVs Path Deviation Attacks: Survey and Research Challenges.
Francesco Betti Sorbelli; Mauro Conti; Cristina M. Pinotti; Giulio Rigoni
  Recently, Unmanned Aerial Vehicles (UAVs) are employed for a plethora of
civilian applications. Such flying vehicles can accomplish tasks under the
pilot's eyesight within the range of a remote controller, or autonomously
according to a certain pre-loaded path configuration. Different path deviation
attacks can be performed by malicious users against UAVs. We classify such
attacks and the relative defenses based on the UAV's flight mode, i.e., (i)
First Person View (FPV), (ii) civilian Global Navigation Satellite System based
(GNSS), and (iii) GNSS "plus" auxiliary technologies (GNSS+), and on the
multiplicity, i.e., (i) Single UAV, and (ii) Multiple UAVs. We found that very
little has been done to secure the FPV flight mode against path deviation. In
GNSS mode, spoofing is the most worrisome attack. The best defense against
spoofing seems to be redundancy, such as adding vision chips to single UAV or
using multiple arranged UAVs. No specific attacks and defenses have been found
in literature for GNSS+ or for UAVs moving in group without a pre-ordered
arrangement. These aspects require further investigation.


http://arxiv.org/abs/2102.06479
Universal Adversarial Perturbations Through the Lens of Deep Steganography: Towards A Fourier Perspective.
Chaoning Zhang; Philipp Benz; Adil Karjauv; In So Kweon
  The booming interest in adversarial attacks stems from a misalignment between
human vision and a deep neural network (DNN), i.e. a human imperceptible
perturbation fools the DNN. Moreover, a single perturbation, often called
universal adversarial perturbation (UAP), can be generated to fool the DNN for
most images. A similar misalignment phenomenon has recently also been observed
in the deep steganography task, where a decoder network can retrieve a secret
image back from a slightly perturbed cover image. We attempt explaining the
success of both in a unified manner from the Fourier perspective. We perform
task-specific and joint analysis and reveal that (a) frequency is a key factor
that influences their performance based on the proposed entropy metric for
quantifying the frequency distribution; (b) their success can be attributed to
a DNN being highly sensitive to high-frequency content. We also perform feature
layer analysis for providing deep insight on model generalization and
robustness. Additionally, we propose two new variants of universal
perturbations: (1) Universal Secret Adversarial Perturbation (USAP) that
simultaneously achieves attack and hiding; (2) high-pass UAP (HP-UAP) that is
less visible to the human eye.


http://arxiv.org/abs/2102.06747
Universal Adversarial Perturbations for Malware.
Raphael Labaca-Castro; Luis Muñoz-González; Feargus Pendlebury; Gabi Dreo Rodosek; Fabio Pierazzi; Lorenzo Cavallaro
  Machine learning classification models are vulnerable to adversarial examples
-- effective input-specific perturbations that can manipulate the model's
output. Universal Adversarial Perturbations (UAPs), which identify noisy
patterns that generalize across the input space, allow the attacker to greatly
scale up the generation of these adversarial examples. Although UAPs have been
explored in application domains beyond computer vision, little is known about
their properties and implications in the specific context of realizable
attacks, such as malware, where attackers must reason about satisfying
challenging problem-space constraints.
  In this paper, we explore the challenges and strengths of UAPs in the context
of malware classification. We generate sequences of problem-space
transformations that induce UAPs in the corresponding feature-space embedding
and evaluate their effectiveness across threat models that consider a varying
degree of realistic attacker knowledge. Additionally, we propose adversarial
training-based mitigations using knowledge derived from the problem-space
transformations, and compare against alternative feature-space defenses. Our
experiments limit the effectiveness of a white box Android evasion attack to
~20 % at the cost of 3 % TPR at 1 % FPR. We additionally show how our method
can be adapted to more restrictive application domains such as Windows malware.
  We observe that while adversarial training in the feature space must deal
with large and often unconstrained regions, UAPs in the problem space identify
specific vulnerabilities that allow us to harden a classifier more effectively,
shifting the challenges and associated cost of identifying new universal
adversarial transformations back to the attacker.


http://arxiv.org/abs/2102.06700
On the Paradox of Certified Training. (13%)
Nikola Jovanović; Mislav Balunović; Maximilian Baader; Martin Vechev
  Certified defenses based on convex relaxations are an established technique
for training provably robust models. The key component is the choice of
relaxation, varying from simple intervals to tight polyhedra.
Counterintuitively, loose interval-based training often leads to higher
certified robustness than what can be achieved with tighter relaxations, which
is a well-known but poorly understood paradox. While recent works introduced
various improvements aiming to circumvent this issue in practice, the
fundamental problem of training models with high certified robustness remains
unsolved. In this work, we investigate the underlying reasons behind the
paradox and identify two key properties of relaxations, beyond tightness, that
impact certified training dynamics: continuity and sensitivity. Our extensive
experimental evaluation with a number of popular convex relaxations provides
strong evidence that these factors can explain the drop in certified robustness
observed for tighter relaxations. We also systematically explore modifications
of existing relaxations and discover that improving unfavorable properties is
challenging, as such attempts often harm other properties, revealing a complex
tradeoff. Our findings represent an important first step towards understanding
the intricate optimization challenges involved in certified training.


http://arxiv.org/abs/2102.05950
Adversarially robust deepfake media detection using fused convolutional neural network predictions.
Sohail Ahmed Khan; Alessandro Artusi; Hang Dai
  Deepfakes are synthetically generated images, videos or audios, which
fraudsters use to manipulate legitimate information. Current deepfake detection
systems struggle against unseen data. To address this, we employ three
different deep Convolutional Neural Network (CNN) models, (1) VGG16, (2)
InceptionV3, and (3) XceptionNet to classify fake and real images extracted
from videos. We also constructed a fusion of the deep CNN models to improve the
robustness and generalisation capability. The proposed technique outperforms
state-of-the-art models with 96.5% accuracy, when tested on publicly available
DeepFake Detection Challenge (DFDC) test data, comprising of 400 videos. The
fusion model achieves 99% accuracy on lower quality DeepFake-TIMIT dataset
videos and 91.88% on higher quality DeepFake-TIMIT videos. In addition to this,
we prove that prediction fusion is more robust against adversarial attacks. If
one model is compromised by an adversarial attack, the prediction fusion does
not let it affect the overall classification.


http://arxiv.org/abs/2102.06162
Defuse: Harnessing Unrestricted Adversarial Examples for Debugging Models Beyond Test Accuracy.
Dylan Slack; Nathalie Rauschmayr; Krishnaram Kenthapadi
  We typically compute aggregate statistics on held-out test data to assess the
generalization of machine learning models. However, statistics on test data
often overstate model generalization, and thus, the performance of deployed
machine learning models can be variable and untrustworthy. Motivated by these
concerns, we develop methods to automatically discover and correct model errors
beyond those available in the data. We propose Defuse, a method that generates
novel model misclassifications, categorizes these errors into high-level model
bugs, and efficiently labels and fine-tunes on the errors to correct them. To
generate misclassified data, we propose an algorithm inspired by adversarial
machine learning techniques that uses a generative model to find naturally
occurring instances misclassified by a model. Further, we observe that the
generative models have regions in their latent space with higher concentrations
of misclassifications. We call these regions misclassification regions and find
they have several useful properties. Each region contains a specific type of
model bug; for instance, a misclassification region for an MNIST classifier
contains a style of skinny 6 that the model mistakes as a 1. We can also assign
a single label to each region, facilitating low-cost labeling. We propose a
method to learn the misclassification regions and use this insight to both
categorize errors and correct them. In practice, Defuse finds and corrects
novel errors in classifiers. For example, Defuse shows that a high-performance
traffic sign classifier mistakes certain 50km/h signs as 80km/h. Defuse
corrects the error after fine-tuning while maintaining generalization on the
test set.


http://arxiv.org/abs/2102.05913
RobOT: Robustness-Oriented Testing for Deep Learning Systems.
Jingyi Wang; Jialuo Chen; Youcheng Sun; Xingjun Ma; Dongxia Wang; Jun Sun; Peng Cheng
  Recently, there has been a significant growth of interest in applying
software engineering techniques for the quality assurance of deep learning (DL)
systems. One popular direction is deep learning testing, where adversarial
examples (a.k.a.~bugs) of DL systems are found either by fuzzing or guided
search with the help of certain testing metrics. However, recent studies have
revealed that the commonly used neuron coverage metrics by existing DL testing
approaches are not correlated to model robustness. It is also not an effective
measurement on the confidence of the model robustness after testing. In this
work, we address this gap by proposing a novel testing framework called
Robustness-Oriented Testing (RobOT). A key part of RobOT is a quantitative
measurement on 1) the value of each test case in improving model robustness
(often via retraining), and 2) the convergence quality of the model robustness
improvement. RobOT utilizes the proposed metric to automatically generate test
cases valuable for improving model robustness. The proposed metric is also a
strong indicator on how well robustness improvement has converged through
testing. Experiments on multiple benchmark datasets confirm the effectiveness
and efficiency of RobOT in improving DL model robustness, with 67.02% increase
on the adversarial robustness that is 50.65% higher than the state-of-the-art
work DeepGini.


http://arxiv.org/abs/2102.05561
Meta Federated Learning.
Omid Aramoon; Pin-Yu Chen; Gang Qu; Yuan Tian
  Due to its distributed methodology alongside its privacy-preserving features,
Federated Learning (FL) is vulnerable to training time adversarial attacks. In
this study, our focus is on backdoor attacks in which the adversary's goal is
to cause targeted misclassifications for inputs embedded with an adversarial
trigger while maintaining an acceptable performance on the main learning task
at hand. Contemporary defenses against backdoor attacks in federated learning
require direct access to each individual client's update which is not feasible
in recent FL settings where Secure Aggregation is deployed. In this study, we
seek to answer the following question, Is it possible to defend against
backdoor attacks when secure aggregation is in place?, a question that has not
been addressed by prior arts. To this end, we propose Meta Federated Learning
(Meta-FL), a novel variant of federated learning which not only is compatible
with secure aggregation protocol but also facilitates defense against backdoor
attacks. We perform a systematic evaluation of Meta-FL on two classification
datasets: SVHN and GTSRB. The results show that Meta-FL not only achieves
better utility than classic FL, but also enhances the performance of
contemporary defenses in terms of robustness against adversarial attacks.


http://arxiv.org/abs/2102.05475
Adversarial Robustness: What fools you makes you stronger.
Grzegorz Głuch; Rüdiger Urbanke
  We prove an exponential separation for the sample complexity between the
standard PAC-learning model and a version of the Equivalence-Query-learning
model. We then show that this separation has interesting implications for
adversarial robustness. We explore a vision of designing an adaptive defense
that in the presence of an attacker computes a model that is provably robust.
In particular, we show how to realize this vision in a simplified setting.
  In order to do so, we introduce a notion of a strong adversary: he is not
limited by the type of perturbations he can apply but when presented with a
classifier can repetitively generate different adversarial examples. We explain
why this notion is interesting to study and use it to prove the following.
There exists an efficient adversarial-learning-like scheme such that for every
strong adversary $\mathbf{A}$ it outputs a classifier that (a) cannot be
strongly attacked by $\mathbf{A}$, or (b) has error at most $\epsilon$. In both
cases our scheme uses exponentially (in $\epsilon$) fewer samples than what the
PAC bound requires.


http://arxiv.org/abs/2102.05311
CIFS: Improving Adversarial Robustness of CNNs via Channel-wise Importance-based Feature Selection.
Hanshu Yan; Jingfeng Zhang; Gang Niu; Jiashi Feng; Vincent Y. F. Tan; Masashi Sugiyama
  We investigate the adversarial robustness of CNNs from the perspective of
channel-wise activations. By comparing \textit{non-robust} (normally trained)
and \textit{robustified} (adversarially trained) models, we observe that
adversarial training (AT) robustifies CNNs by aligning the channel-wise
activations of adversarial data with those of their natural counterparts.
However, the channels that are \textit{negatively-relevant} (NR) to predictions
are still over-activated when processing adversarial data. Besides, we also
observe that AT does not result in similar robustness for all classes. For the
robust classes, channels with larger activation magnitudes are usually more
\textit{positively-relevant} (PR) to predictions, but this alignment does not
hold for the non-robust classes. Given these observations, we hypothesize that
suppressing NR channels and aligning PR ones with their relevances further
enhances the robustness of CNNs under AT. To examine this hypothesis, we
introduce a novel mechanism, i.e., \underline{C}hannel-wise
\underline{I}mportance-based \underline{F}eature \underline{S}election (CIFS).
The CIFS manipulates channels' activations of certain layers by generating
non-negative multipliers to these channels based on their relevances to
predictions. Extensive experiments on benchmark datasets including CIFAR10 and
SVHN clearly verify the hypothesis and CIFS's effectiveness of robustifying
CNNs. \url{https://github.com/HanshuYAN/CIFS}


http://arxiv.org/abs/2102.05431
Dompteur: Taming Audio Adversarial Examples.
Thorsten Eisenhofer; Lea Schönherr; Joel Frank; Lars Speckemeier; Dorothea Kolossa; Thorsten Holz
  Adversarial examples seem to be inevitable. These specifically crafted inputs
allow attackers to arbitrarily manipulate machine learning systems. Even worse,
they often seem harmless to human observers. In our digital society, this poses
a significant threat. For example, Automatic Speech Recognition (ASR) systems,
which serve as hands-free interfaces to many kinds of systems, can be attacked
with inputs incomprehensible for human listeners. The research community has
unsuccessfully tried several approaches to tackle this problem. In this paper
we propose a different perspective: We accept the presence of adversarial
examples against ASR systems, but we require them to be perceivable by human
listeners. By applying the principles of psychoacoustics, we can remove
semantically irrelevant information from the ASR input and train a model that
resembles human perception more closely. We implement our idea in a tool named
DOMPTEUR and demonstrate that our augmented system, in contrast to an
unmodified baseline, successfully focuses on perceptible ranges of the input
signal. This change forces adversarial examples into the audible range, while
using minimal computational overhead and preserving benign performance. To
evaluate our approach, we construct an adaptive attacker that actively tries to
avoid our augmentations and demonstrate that adversarial examples from this
attacker remain clearly perceivable. Finally, we substantiate our claims by
performing a hearing test with crowd-sourced human listeners.


http://arxiv.org/abs/2102.05334
Enhancing Real-World Adversarial Patches through 3D Modeling of Complex Target Scenes.
Yael Mathov; Lior Rokach; Yuval Elovici
  Adversarial examples have proven to be a concerning threat to deep learning
models, particularly in the image domain. However, while many studies have
examined adversarial examples in the real world, most of them relied on 2D
photos of the attack scene. As a result, the attacks proposed may have limited
effectiveness when implemented in realistic environments with 3D objects or
varied conditions. There are few studies on adversarial learning that use 3D
objects, and in many cases, other researchers are unable to replicate the
real-world evaluation process. In this study, we present a framework that uses
3D modeling to craft adversarial patches for an existing real-world scene. Our
approach uses a 3D digital approximation of the scene as a simulation of the
real world. With the ability to add and manipulate any element in the digital
scene, our framework enables the attacker to improve the adversarial patch's
impact in real-world settings. We use the framework to create a patch for an
everyday scene and evaluate its performance using a novel evaluation process
that ensures that our results are reproducible in both the digital space and
the real world. Our evaluation results show that the framework can generate
adversarial patches that are robust to different settings in the real world.


http://arxiv.org/abs/2102.05363
Towards Certifying L-infinity Robustness using Neural Networks with L-inf-dist Neurons.
Bohang Zhang; Tianle Cai; Zhou Lu; Di He; Liwei Wang
  It is well-known that standard neural networks, even with a high
classification accuracy, are vulnerable to small $\ell_\infty$-norm bounded
adversarial perturbations. Although many attempts have been made, most previous
works either can only provide empirical verification of the defense to a
particular attack method, or can only develop a certified guarantee of the
model robustness in limited scenarios. In this paper, we seek for a new
approach to develop a theoretically principled neural network that inherently
resists $\ell_\infty$ perturbations. In particular, we design a novel neuron
that uses $\ell_\infty$-distance as its basic operation (which we call
$\ell_\infty$-dist neuron), and show that any neural network constructed with
$\ell_\infty$-dist neurons (called $\ell_{\infty}$-dist net) is naturally a
1-Lipschitz function with respect to $\ell_\infty$-norm. This directly provides
a rigorous guarantee of the certified robustness based on the margin of
prediction outputs. We also prove that such networks have enough expressive
power to approximate any 1-Lipschitz function with robust generalization
guarantee. Our experimental results show that the proposed network is
promising. Using $\ell_{\infty}$-dist nets as the basic building blocks, we
consistently achieve state-of-the-art performance on commonly used datasets:
93.09% certified accuracy on MNIST ($\epsilon=0.3$), 79.23% on Fashion MNIST
($\epsilon=0.1$) and 35.10% on CIFAR-10 ($\epsilon=8/255$).


http://arxiv.org/abs/2102.05368
RoBIC: A benchmark suite for assessing classifiers robustness.
Thibault Maho; Benoît Bonnet; Teddy Furon; Erwan Le Merrer
  Many defenses have emerged with the development of adversarial attacks.
Models must be objectively evaluated accordingly. This paper systematically
tackles this concern by proposing a new parameter-free benchmark we coin RoBIC.
RoBIC fairly evaluates the robustness of image classifiers using a new
half-distortion measure. It gauges the robustness of the network against white
and black box attacks, independently of its accuracy. RoBIC is faster than the
other available benchmarks. We present the significant differences in the
robustness of 16 recent models as assessed by RoBIC.


http://arxiv.org/abs/2102.05289
Bayesian Inference with Certifiable Adversarial Robustness.
Matthew Wicker; Luca Laurenti; Andrea Patane; Zhoutong Chen; Zheng Zhang; Marta Kwiatkowska
  We consider adversarial training of deep neural networks through the lens of
Bayesian learning, and present a principled framework for adversarial training
of Bayesian Neural Networks (BNNs) with certifiable guarantees. We rely on
techniques from constraint relaxation of non-convex optimisation problems and
modify the standard cross-entropy error model to enforce posterior robustness
to worst-case perturbations in $\epsilon$-balls around input points. We
illustrate how the resulting framework can be combined with methods commonly
employed for approximate inference of BNNs. In an empirical investigation, we
demonstrate that the presented approach enables training of certifiably robust
models on MNIST, FashionMNIST and CIFAR-10 and can also be beneficial for
uncertainty calibration. Our method is the first to directly train certifiable
BNNs, thus facilitating their deployment in safety-critical applications.


http://arxiv.org/abs/2102.04836
Target Training Does Adversarial Training Without Adversarial Samples.
Blerta Lindqvist
  Neural network classifiers are vulnerable to misclassification of adversarial
samples, for which the current best defense trains classifiers with adversarial
samples. However, adversarial samples are not optimal for steering attack
convergence, based on the minimization at the core of adversarial attacks. The
minimization perturbation term can be minimized towards $0$ by replacing
adversarial samples in training with duplicated original samples, labeled
differently only for training. Using only original samples, Target Training
eliminates the need to generate adversarial samples for training against all
attacks that minimize perturbation. In low-capacity classifiers and without
using adversarial samples, Target Training exceeds both default CIFAR10
accuracy ($84.3$%) and current best defense accuracy (below $25$%) with $84.8$%
against CW-L$_2$($\kappa=0$) attack, and $86.6$% against DeepFool. Using
adversarial samples against attacks that do not minimize perturbation, Target
Training exceeds current best defense ($69.1$%) with $76.4$% against
CW-L$_2$($\kappa=40$) in CIFAR10.


http://arxiv.org/abs/2102.04661
Security and Privacy for Artificial Intelligence: Opportunities and Challenges.
Ayodeji Oseni; Nour Moustafa; Helge Janicke; Peng Liu; Zahir Tari; Athanasios Vasilakos
  The increased adoption of Artificial Intelligence (AI) presents an
opportunity to solve many socio-economic and environmental challenges; however,
this cannot happen without securing AI-enabled technologies. In recent years,
most AI models are vulnerable to advanced and sophisticated hacking techniques.
This challenge has motivated concerted research efforts into adversarial AI,
with the aim of developing robust machine and deep learning models that are
resilient to different types of adversarial scenarios. In this paper, we
present a holistic cyber security review that demonstrates adversarial attacks
against AI applications, including aspects such as adversarial knowledge and
capabilities, as well as existing methods for generating adversarial examples
and existing cyber defence models. We explain mathematical AI models,
especially new variants of reinforcement and federated learning, to demonstrate
how attack vectors would exploit vulnerabilities of AI models. We also propose
a systematic framework for demonstrating attack techniques against AI
applications and reviewed several cyber defences that would protect AI
applications against those attacks. We also highlight the importance of
understanding the adversarial goals and their capabilities, especially the
recent attacks against industry applications, to develop adaptive defences that
assess to secure AI applications. Finally, we describe the main challenges and
future research directions in the domain of security and privacy of AI
technologies.


http://arxiv.org/abs/2102.05104
"What's in the box?!": Deflecting Adversarial Attacks by Randomly Deploying Adversarially-Disjoint Models.
Sahar Abdelnabi; Mario Fritz
  Machine learning models are now widely deployed in real-world applications.
However, the existence of adversarial examples has been long considered a real
threat to such models. While numerous defenses aiming to improve the robustness
have been proposed, many have been shown ineffective. As these vulnerabilities
are still nowhere near being eliminated, we propose an alternative
deployment-based defense paradigm that goes beyond the traditional white-box
and black-box threat models. Instead of training a single partially-robust
model, one could train a set of same-functionality, yet, adversarially-disjoint
models with minimal in-between attack transferability. These models could then
be randomly and individually deployed, such that accessing one of them
minimally affects the others. Our experiments on CIFAR-10 and a wide range of
attacks show that we achieve a significantly lower attack transferability
across our disjoint models compared to a baseline of ensemble diversity. In
addition, compared to an adversarially trained set, we achieve a higher average
robust accuracy while maintaining the accuracy of clean examples.


http://arxiv.org/abs/2102.05110
Adversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers.
Jacob M. Springer; Melanie Mitchell; Garrett T. Kenyon
  Neural networks trained on visual data are well-known to be vulnerable to
often imperceptible adversarial perturbations. The reasons for this
vulnerability are still being debated in the literature. Recently Ilyas et al.
(2019) showed that this vulnerability arises, in part, because neural network
classifiers rely on highly predictive but brittle "non-robust" features. In
this paper we extend the work of Ilyas et al. by investigating the nature of
the input patterns that give rise to these features. In particular, we
hypothesize that in a neural network trained in a standard way, non-robust
features respond to small, "non-semantic" patterns that are typically entangled
with larger, robust patterns, known to be more human-interpretable, as opposed
to solely responding to statistical artifacts in a dataset. Thus, adversarial
examples can be formed via minimal perturbations to these small, entangled
patterns. In addition, we demonstrate a corollary of our hypothesis: robust
classifiers are more effective than standard (non-robust) ones as a source for
generating transferable adversarial examples in both the untargeted and
targeted settings. The results we present in this paper provide new insight
into the nature of the non-robust features responsible for adversarial
vulnerability of neural network classifiers.


http://arxiv.org/abs/2102.05241
Detecting Localized Adversarial Examples: A Generic Approach using Critical Region Analysis.
Fengting Li; Xuankai Liu; Xiaoli Zhang; Qi Li; Kun Sun; Kang Li
  Deep neural networks (DNNs) have been applied in a wide range of
applications,e.g.,face recognition and image classification; however,they are
vulnerable to adversarial examples. By adding a small amount of imperceptible
perturbations,an attacker can easily manipulate the outputs of a DNN.
Particularly,the localized adversarial examples only perturb a small and
contiguous region of the target object,so that they are robust and effective in
both digital and physical worlds. Although the localized adversarial examples
have more severe real-world impacts than traditional pixel attacks,they have
not been well addressed in the literature. In this paper,we propose a generic
defense system called TaintRadar to accurately detect localized adversarial
examples via analyzing critical regions that have been manipulated by
attackers. The main idea is that when removing critical regions from input
images,the ranking changes of adversarial labels will be larger than those of
benign labels. Compared with existing defense solutions,TaintRadar can
effectively capture sophisticated localized partial attacks, e.g.,the
eye-glasses attack,while not requiring additional training or fine-tuning of
the original model's structure. Comprehensive experiments have been conducted
in both digital and physical worlds to verify the effectiveness and robustness
of our defense.


http://arxiv.org/abs/2102.06020
Making Paper Reviewing Robust to Bid Manipulation Attacks.
Ruihan Wu; Chuan Guo; Felix Wu; Rahul Kidambi; der Maaten Laurens van; Kilian Q. Weinberger
  Most computer science conferences rely on paper bidding to assign reviewers
to papers. Although paper bidding enables high-quality assignments in days of
unprecedented submission numbers, it also opens the door for dishonest
reviewers to adversarially influence paper reviewing assignments. Anecdotal
evidence suggests that some reviewers bid on papers by "friends" or colluding
authors, even though these papers are outside their area of expertise, and
recommend them for acceptance without considering the merit of the work. In
this paper, we study the efficacy of such bid manipulation attacks and find
that, indeed, they can jeopardize the integrity of the review process. We
develop a novel approach for paper bidding and assignment that is much more
robust against such attacks. We show empirically that our approach provides
robustness even when dishonest reviewers collude, have full knowledge of the
assignment system's internal workings, and have access to the system's inputs.
In addition to being more robust, the quality of our paper review assignments
is comparable to that of current, non-robust assignment approaches.


http://arxiv.org/abs/2102.05096
Towards Bridging the gap between Empirical and Certified Robustness against Adversarial Examples.
Jay Nandy; Sudipan Saha; Wynne Hsu; Mong Li Lee; Xiao Xiang Zhu
  The current state-of-the-art defense methods against adversarial examples
typically focus on improving either empirical or certified robustness. Among
them, adversarially trained (AT) models produce empirical state-of-the-art
defense against adversarial examples without providing any robustness
guarantees for large classifiers or higher-dimensional inputs. In contrast,
existing randomized smoothing based models achieve state-of-the-art certified
robustness while significantly degrading the empirical robustness against
adversarial examples. In this paper, we propose a novel method, called
\emph{Certification through Adaptation}, that transforms an AT model into a
randomized smoothing classifier during inference to provide certified
robustness for $\ell_2$ norm without affecting their empirical robustness
against adversarial attacks. We also propose \emph{Auto-Noise} technique that
efficiently approximates the appropriate noise levels to flexibly certify the
test examples using randomized smoothing technique. Our proposed
\emph{Certification through Adaptation} with \emph{Auto-Noise} technique
achieves an \textit{average certified radius (ACR) scores} up to $1.102$ and
$1.148$ respectively for CIFAR-10 and ImageNet datasets using AT models without
affecting their empirical robustness or benign accuracy. Therefore, our paper
is a step towards bridging the gap between the empirical and certified
robustness against adversarial examples by achieving both using the same
classifier.


http://arxiv.org/abs/2102.04154
Efficient Certified Defenses Against Patch Attacks on Image Classifiers.
Jan Hendrik Metzen; Maksym Yatsura
  Adversarial patches pose a realistic threat model for physical world attacks
on autonomous systems via their perception component. Autonomous systems in
safety-critical domains such as automated driving should thus contain a
fail-safe fallback component that combines certifiable robustness against
patches with efficient inference while maintaining high performance on clean
inputs. We propose BagCert, a novel combination of model architecture and
certification procedure that allows efficient certification. We derive a loss
that enables end-to-end optimization of certified robustness against patches of
different sizes and locations. On CIFAR10, BagCert certifies 10.000 examples in
43 seconds on a single GPU and obtains 86% clean and 60% certified accuracy
against 5x5 patches.


http://arxiv.org/abs/2102.04291
A Real-time Defense against Website Fingerprinting Attacks.
Shawn Shan; Arjun Nitin Bhagoji; Haitao Zheng; Ben Y. Zhao
  Anonymity systems like Tor are vulnerable to Website Fingerprinting (WF)
attacks, where a local passive eavesdropper infers the victim's activity.
Current WF attacks based on deep learning classifiers have successfully
overcome numerous proposed defenses. While recent defenses leveraging
adversarial examples offer promise, these adversarial examples can only be
computed after the network session has concluded, thus offer users little
protection in practical settings.
  We propose Dolos, a system that modifies user network traffic in real time to
successfully evade WF attacks. Dolos injects dummy packets into traffic traces
by computing input-agnostic adversarial patches that disrupt deep learning
classifiers used in WF attacks. Patches are then applied to alter and protect
user traffic in real time. Importantly, these patches are parameterized by a
user-side secret, ensuring that attackers cannot use adversarial training to
defeat Dolos. We experimentally demonstrate that Dolos provides 94+% protection
against state-of-the-art WF attacks under a variety of settings. Against prior
defenses, Dolos outperforms in terms of higher protection performance and lower
information leakage and bandwidth overhead. Finally, we show that Dolos is
robust against a variety of adaptive countermeasures to detect or disrupt the
defense.


http://arxiv.org/abs/2102.04615
Benford's law: what does it say on adversarial images?
João G. Zago; Fabio L. Baldissera; Eric A. Antonelo; Rodrigo T. Saad
  Convolutional neural networks (CNNs) are fragile to small perturbations in
the input images. These networks are thus prone to malicious attacks that
perturb the inputs to force a misclassification. Such slightly manipulated
images aimed at deceiving the classifier are known as adversarial images. In
this work, we investigate statistical differences between natural images and
adversarial ones. More precisely, we show that employing a proper image
transformation and for a class of adversarial attacks, the distribution of the
leading digit of the pixels in adversarial images deviates from Benford's law.
The stronger the attack, the more distant the resulting distribution is from
Benford's law. Our analysis provides a detailed investigation of this new
approach that can serve as a basis for alternative adversarial example
detection methods that do not need to modify the original CNN classifier
neither work on the raw high-dimensional pixels as features to defend against
attacks.


http://arxiv.org/abs/2102.04150
Exploiting epistemic uncertainty of the deep learning models to generate adversarial samples.
Omer Faruk Tuna; Ferhat Ozgur Catak; M. Taner Eskil
  Deep neural network architectures are considered to be robust to random
perturbations. Nevertheless, it was shown that they could be severely
vulnerable to slight but carefully crafted perturbations of the input, termed
as adversarial samples. In recent years, numerous studies have been conducted
in this new area called "Adversarial Machine Learning" to devise new
adversarial attacks and to defend against these attacks with more robust DNN
architectures. However, almost all the research work so far has been
concentrated on utilising model loss function to craft adversarial examples or
create robust models. This study explores the usage of quantified epistemic
uncertainty obtained from Monte-Carlo Dropout Sampling for adversarial attack
purposes by which we perturb the input to the areas where the model has not
seen before. We proposed new attack ideas based on the epistemic uncertainty of
the model. Our results show that our proposed hybrid attack approach increases
the attack success rates from 82.59% to 85.40%, 82.86% to 89.92% and 88.06% to
90.03% on MNIST Digit, MNIST Fashion and CIFAR-10 datasets, respectively.


http://arxiv.org/abs/2102.03726
Adversarial example generation with AdaBelief Optimizer and Crop Invariance.
Bo Yang; Hengwei Zhang; Yuchen Zhang; Kaiyong Xu; Jindong Wang
  Deep neural networks are vulnerable to adversarial examples, which are
crafted by applying small, human-imperceptible perturbations on the original
images, so as to mislead deep neural networks to output inaccurate predictions.
Adversarial attacks can thus be an important method to evaluate and select
robust models in safety-critical applications. However, under the challenging
black-box setting, most existing adversarial attacks often achieve relatively
low success rates on adversarially trained networks and advanced defense
models. In this paper, we propose AdaBelief Iterative Fast Gradient Method
(ABI-FGM) and Crop-Invariant attack Method (CIM) to improves the
transferability of adversarial examples. ABI-FGM and CIM can be readily
integrated to build a strong gradient-based attack to further boost the success
rates of adversarial examples for black-box attacks. Moreover, our method can
also be naturally combined with other gradient-based attack methods to build a
more robust attack to generate more transferable adversarial examples against
the defense models. Extensive experiments on the ImageNet dataset demonstrate
the method's effectiveness. Whether on adversarially trained networks or
advanced defense models, our method has higher success rates than
state-of-the-art gradient-based attack methods.


http://arxiv.org/abs/2102.03728
Adversarial Imaging Pipelines.
Buu Phan; Fahim Mannan; Felix Heide
  Adversarial attacks play an essential role in understanding deep neural
network predictions and improving their robustness. Existing attack methods aim
to deceive convolutional neural network (CNN)-based classifiers by manipulating
RGB images that are fed directly to the classifiers. However, these approaches
typically neglect the influence of the camera optics and image processing
pipeline (ISP) that produce the network inputs. ISPs transform RAW measurements
to RGB images and traditionally are assumed to preserve adversarial patterns.
However, these low-level pipelines can, in fact, destroy, introduce or amplify
adversarial patterns that can deceive a downstream detector. As a result,
optimized patterns can become adversarial for the classifier after being
transformed by a certain camera ISP and optic but not for others. In this work,
we examine and develop such an attack that deceives a specific camera ISP while
leaving others intact, using the same down-stream classifier. We frame
camera-specific attacks as a multi-task optimization problem, relying on a
differentiable approximation for the ISP itself. We validate the proposed
method using recent state-of-the-art automotive hardware ISPs, achieving 92%
fooling rate when attacking a specific ISP. We demonstrate physical optics
attacks with 90% fooling rate for a specific camera lenses.


http://arxiv.org/abs/2102.03716
SPADE: A Spectral Method for Black-Box Adversarial Robustness Evaluation.
Wuxinlin Cheng; Chenhui Deng; Zhiqiang Zhao; Yaohui Cai; Zhiru Zhang; Zhuo Feng
  A black-box spectral method is introduced for evaluating the adversarial
robustness of a given machine learning (ML) model. Our approach, named SPADE,
exploits bijective distance mapping between the input/output graphs constructed
for approximating the manifolds corresponding to the input/output data. By
leveraging the generalized Courant-Fischer theorem, we propose a SPADE score
for evaluating the adversarial robustness of a given model, which is proved to
be an upper bound of the best Lipschitz constant under the manifold setting. To
reveal the most non-robust data samples highly vulnerable to adversarial
attacks, we develop a spectral graph embedding procedure leveraging dominant
generalized eigenvectors. This embedding step allows assigning each data sample
a robustness score that can be further harnessed for more effective adversarial
training. Our experiments show the proposed SPADE method leads to promising
empirical results for neural network models adversarially trained with the
MNIST and CIFAR-10 data sets.


http://arxiv.org/abs/2102.03483
Corner Case Generation and Analysis for Safety Assessment of Autonomous Vehicles.
Haowei Sun; Shuo Feng; Xintao Yan; Henry X. Liu
  Testing and evaluation is a crucial step in the development and deployment of
Connected and Automated Vehicles (CAVs). To comprehensively evaluate the
performance of CAVs, it is of necessity to test the CAVs in safety-critical
scenarios, which rarely happen in naturalistic driving environment. Therefore,
how to purposely and systematically generate these corner cases becomes an
important problem. Most existing studies focus on generating adversarial
examples for perception systems of CAVs, whereas limited efforts have been put
on the decision-making systems, which is the highlight of this paper. As the
CAVs need to interact with numerous background vehicles (BVs) for a long
duration, variables that define the corner cases are usually high dimensional,
which makes the generation a challenging problem. In this paper, a unified
framework is proposed to generate corner cases for the decision-making systems.
To address the challenge brought by high dimensionality, the driving
environment is formulated based on Markov Decision Process, and the deep
reinforcement learning techniques are applied to learn the behavior policy of
BVs. With the learned policy, BVs will behave and interact with the CAVs more
aggressively, resulting in more corner cases. To further analyze the generated
corner cases, the techniques of feature extraction and clustering are utilized.
By selecting representative cases of each cluster and outliers, the valuable
corner cases can be identified from all generated corner cases. Simulation
results of a highway driving environment show that the proposed methods can
effectively generate and identify the valuable corner cases.


http://arxiv.org/abs/2102.03016
Model Agnostic Answer Reranking System for Adversarial Question Answering.
Sagnik Majumder; Chinmoy Samant; Greg Durrett
  While numerous methods have been proposed as defenses against adversarial
examples in question answering (QA), these techniques are often model specific,
require retraining of the model, and give only marginal improvements in
performance over vanilla models. In this work, we present a simple
model-agnostic approach to this problem that can be applied directly to any QA
model without any retraining. Our method employs an explicit answer candidate
reranking mechanism that scores candidate answers on the basis of their content
overlap with the question before making the final prediction. Combined with a
strong base QAmodel, our method outperforms state-of-the-art defense
techniques, calling into question how well these techniques are actually doing
and strong these adversarial testbeds are.


http://arxiv.org/abs/2102.03381
Robust Single-step Adversarial Training with Regularizer.
Lehui Xie; Yaopeng Wang; Jia-Li Yin; Ximeng Liu
  High cost of training time caused by multi-step adversarial example
generation is a major challenge in adversarial training. Previous methods try
to reduce the computational burden of adversarial training using single-step
adversarial example generation schemes, which can effectively improve the
efficiency but also introduce the problem of catastrophic overfitting, where
the robust accuracy against Fast Gradient Sign Method (FGSM) can achieve nearby
100\% whereas the robust accuracy against Projected Gradient Descent (PGD)
suddenly drops to 0\% over a single epoch. To address this problem, we propose
a novel Fast Gradient Sign Method with PGD Regularization (FGSMPR) to boost the
efficiency of adversarial training without catastrophic overfitting. Our core
idea is that single-step adversarial training can not learn robust internal
representations of FGSM and PGD adversarial examples. Therefore, we design a
PGD regularization term to encourage similar embeddings of FGSM and PGD
adversarial examples. The experiments demonstrate that our proposed method can
train a robust deep network for L$_\infty$-perturbations with FGSM adversarial
training and reduce the gap to multi-step adversarial training.


http://arxiv.org/abs/2102.03482
Understanding the Interaction of Adversarial Training with Noisy Labels.
Jianing Zhu; Jingfeng Zhang; Bo Han; Tongliang Liu; Gang Niu; Hongxia Yang; Mohan Kankanhalli; Masashi Sugiyama
  Noisy labels (NL) and adversarial examples both undermine trained models, but
interestingly they have hitherto been studied independently. A recent
adversarial training (AT) study showed that the number of projected gradient
descent (PGD) steps to successfully attack a point (i.e., find an adversarial
example in its proximity) is an effective measure of the robustness of this
point. Given that natural data are clean, this measure reveals an intrinsic
geometric property -- how far a point is from its class boundary. Based on this
breakthrough, in this paper, we figure out how AT would interact with NL.
Firstly, we find if a point is too close to its noisy-class boundary (e.g., one
step is enough to attack it), this point is likely to be mislabeled, which
suggests to adopt the number of PGD steps as a new criterion for sample
selection for correcting NL. Secondly, we confirm AT with strong smoothing
effects suffers less from NL (without NL corrections) than standard training
(ST), which suggests AT itself is an NL correction. Hence, AT with NL is
helpful for improving even the natural accuracy, which again illustrates the
superiority of AT as a general-purpose robust learning criterion.


http://arxiv.org/abs/2102.03156
Optimal Transport as a Defense Against Adversarial Attacks.
Quentin Bouniot; Romaric Audigier; Angélique Loesch
  Deep learning classifiers are now known to have flaws in the representations
of their class. Adversarial attacks can find a human-imperceptible perturbation
for a given image that will mislead a trained model. The most effective methods
to defend against such attacks trains on generated adversarial examples to
learn their distribution. Previous work aimed to align original and adversarial
image representations in the same way as domain adaptation to improve
robustness. Yet, they partially align the representations using approaches that
do not reflect the geometry of space and distribution. In addition, it is
difficult to accurately compare robustness between defended models. Until now,
they have been evaluated using a fixed perturbation size. However, defended
models may react differently to variations of this perturbation size. In this
paper, the analogy of domain adaptation is taken a step further by exploiting
optimal transport theory. We propose to use a loss between distributions that
faithfully reflect the ground distance. This leads to SAT (Sinkhorn Adversarial
Training), a more robust defense against adversarial attacks. Then, we propose
to quantify more precisely the robustness of a model to adversarial attacks
over a wide range of perturbation sizes using a different metric, the Area
Under the Accuracy Curve (AUAC). We perform extensive experiments on both
CIFAR-10 and CIFAR-100 datasets and show that our defense is globally more
robust than the state-of-the-art.


http://arxiv.org/abs/2102.02956
DetectorGuard: Provably Securing Object Detectors against Localized Patch Hiding Attacks.
Chong Xiang; Prateek Mittal
  State-of-the-art object detectors are vulnerable to localized patch hiding
attacks where an adversary introduces a small adversarial patch to make
detectors miss the detection of salient objects. In this paper, we propose the
first general framework for building provably robust detectors against the
localized patch hiding attack called DetectorGuard. To start with, we propose a
general approach for transferring the robustness from image classifiers to
object detectors, which builds a bridge between robust image classification and
robust object detection. We apply a provably robust image classifier to a
sliding window over the image and aggregates robust window classifications at
different locations for a robust object detection. Second, in order to mitigate
the notorious trade-off between clean performance and provable robustness, we
use a prediction pipeline in which we compare the outputs of a conventional
detector and a robust detector for catching an ongoing attack. When no attack
is detected, DetectorGuard outputs the precise bounding boxes predicted by the
conventional detector to achieve a high clean performance; otherwise,
DetectorGuard triggers an attack alert for security. Notably, our prediction
strategy ensures that the robust detector incorrectly missing objects will not
hurt the clean performance of DetectorGuard. Moreover, our approach allows us
to formally prove the robustness of DetectorGuard on certified objects, i.e.,
it either detects the object or triggers an alert, against any patch hiding
attacker. Our evaluation on the PASCAL VOC and MS COCO datasets shows that
DetectorGuard has the almost same clean performance as conventional detectors,
and more importantly, that DetectorGuard achieves the first provable robustness
against localized patch hiding attacks.


http://arxiv.org/abs/2102.02950
Adversarial Training Makes Weight Loss Landscape Sharper in Logistic Regression.
Masanori Yamada; Sekitoshi Kanai; Tomoharu Iwata; Tomokatsu Takahashi; Yuki Yamanaka; Hiroshi Takahashi; Atsutoshi Kumagai
  Adversarial training is actively studied for learning robust models against
adversarial examples. A recent study finds that adversarially trained models
degenerate generalization performance on adversarial examples when their weight
loss landscape, which is loss changes with respect to weights, is sharp.
Unfortunately, it has been experimentally shown that adversarial training
sharpens the weight loss landscape, but this phenomenon has not been
theoretically clarified. Therefore, we theoretically analyze this phenomenon in
this paper. As a first step, this paper proves that adversarial training with
the L2 norm constraints sharpens the weight loss landscape in the linear
logistic regression model. Our analysis reveals that the sharpness of the
weight loss landscape is caused by the noise aligned in the direction of
increasing the loss, which is used in adversarial training. We theoretically
and experimentally confirm that the weight loss landscape becomes sharper as
the magnitude of the noise of adversarial training increases in the linear
logistic regression model. Moreover, we experimentally confirm the same
phenomena in ResNet18 with softmax as a more general case.


http://arxiv.org/abs/2102.02885
Adversarial Robustness Study of Convolutional Neural Network for Lumbar Disk Shape Reconstruction from MR images.
Jiasong Chen; Linchen Qian; Timur Urakov; Weiyong Gu; Liang Liang
  Machine learning technologies using deep neural networks (DNNs), especially
convolutional neural networks (CNNs), have made automated, accurate, and fast
medical image analysis a reality for many applications, and some DNN-based
medical image analysis systems have even been FDA-cleared. Despite the
progress, challenges remain to build DNNs as reliable as human expert doctors.
It is known that DNN classifiers may not be robust to noises: by adding a small
amount of noise to an input image, a DNN classifier may make a wrong
classification of the noisy image (i.e., in-distribution adversarial sample),
whereas it makes the right classification of the clean image. Another issue is
caused by out-of-distribution samples that are not similar to any sample in the
training set. Given such a sample as input, the output of a DNN will become
meaningless. In this study, we investigated the in-distribution (IND) and
out-of-distribution (OOD) adversarial robustness of a representative CNN for
lumbar disk shape reconstruction from spine MR images. To study the
relationship between dataset size and robustness to IND adversarial attacks, we
used a data augmentation method to create training sets with different levels
of shape variations. We utilized the PGD-based algorithm for IND adversarial
attacks and extended it for OOD adversarial attacks to generate OOD adversarial
samples for model testing. The results show that IND adversarial training can
improve the CNN robustness to IND adversarial attacks, and larger training
datasets may lead to higher IND robustness. However, it is still a challenge to
defend against OOD adversarial attacks.


http://arxiv.org/abs/2102.02923
PredCoin: Defense against Query-based Hard-label Attack.
Junfeng Guo; Yaswanth Yadlapalli; Thiele Lothar; Ang Li; Cong Liu
  Many adversarial attacks and defenses have recently been proposed for Deep
Neural Networks (DNNs). While most of them are in the white-box setting, which
is impractical, a new class of query-based hard-label (QBHL) black-box attacks
pose a significant threat to real-world applications (e.g., Google Cloud,
Tencent API). Till now, there has been no generalizable and practical approach
proposed to defend against such attacks.
  This paper proposes and evaluates PredCoin, a practical and generalizable
method for providing robustness against QBHL attacks. PredCoin poisons the
gradient estimation step, an essential component of most QBHL attacks. PredCoin
successfully identifies gradient estimation queries crafted by an attacker and
introduces uncertainty to the output. Extensive experiments show that PredCoin
successfully defends against four state-of-the-art QBHL attacks across various
settings and tasks while preserving the target model's overall accuracy.
  PredCoin is also shown to be robust and effective against several
defense-aware attacks, which may have full knowledge regarding the internal
mechanisms of PredCoin.


http://arxiv.org/abs/2102.02729
Adversarial Attacks and Defenses in Physiological Computing: A Systematic Review.
Dongrui Wu; Weili Fang; Yi Zhang; Liuqing Yang; Hanbin Luo; Lieyun Ding; Xiaodong Xu; Xiang Yu
  Physiological computing uses human physiological data as system inputs in
real time. It includes, or significantly overlaps with, brain-computer
interfaces, affective computing, adaptive automation, health informatics, and
physiological signal based biometrics. Physiological computing increases the
communication bandwidth from the user to the computer, but is also subject to
various types of adversarial attacks, in which the attacker deliberately
manipulates the training and/or test examples to hijack the machine learning
algorithm output, leading to possibly user confusion, frustration, injury, or
even death. However, the vulnerability of physiological computing systems has
not been paid enough attention to, and there does not exist a comprehensive
review on adversarial attacks to it. This paper fills this gap, by providing a
systematic review on the main research areas of physiological computing,
different types of adversarial attacks and their applications to physiological
computing, and the corresponding defense strategies. We hope this review will
attract more research interests on the vulnerability of physiological computing
systems, and more importantly, defense strategies to make them more secure.


http://arxiv.org/abs/2102.02417
Audio Adversarial Examples: Attacks Using Vocal Masks.
Lynnette Ng; Kai Yuan Tay; Wei Han Chua; Lucerne Loke; Danqi Ye; Melissa Chua
  We construct audio adversarial examples on automatic Speech-To-Text systems .
Given any audio waveform, we produce an another by overlaying an audio vocal
mask generated from the original audio. We apply our audio adversarial attack
to five SOTA STT systems: DeepSpeech, Julius, Kaldi, wav2letter@anywhere and
CMUSphinx. In addition, we engaged human annotators to transcribe the
adversarial audio. Our experiments show that these adversarial examples fool
State-Of-The-Art Speech-To-Text systems, yet humans are able to consistently
pick out the speech. The feasibility of this attack introduces a new domain to
study machine and human perception of speech.


http://arxiv.org/abs/2102.02551
ML-Doctor: Holistic Risk Assessment of Inference Attacks Against Machine Learning Models.
Yugeng Liu; Rui Wen; Xinlei He; Ahmed Salem; Zhikun Zhang; Michael Backes; Cristofaro Emiliano De; Mario Fritz; Yang Zhang
  Inference attacks against Machine Learning (ML) models allow adversaries to
learn sensitive information about training data, model parameters, etc. While
researchers have studied, in depth, several kinds of attacks, they have done so
in isolation. As a result, we lack a comprehensive picture of the risks caused
by the attacks, e.g., the different scenarios they can be applied to, the
common factors that influence their performance, the relationship among them,
or the effectiveness of possible defenses. In this paper, we fill this gap by
presenting a first-of-its-kind holistic risk assessment of different inference
attacks against machine learning models. We concentrate on four attacks --
namely, membership inference, model inversion, attribute inference, and model
stealing -- and establish a threat model taxonomy.
  Our extensive experimental evaluation, run on five model architectures and
four image datasets, shows that the complexity of the training dataset plays an
important role with respect to the attack's performance, while the
effectiveness of model stealing and membership inference attacks are negatively
correlated. We also show that defenses like DP-SGD and Knowledge Distillation
can only mitigate some of the inference attacks. Our analysis relies on a
modular re-usable software, ML-Doctor, which enables ML model owners to assess
the risks of deploying their models, and equally serves as a benchmark tool for
researchers and practitioners.


http://arxiv.org/abs/2102.02145
Adversarially Robust Learning with Unknown Perturbation Sets.
Omar Montasser; Steve Hanneke; Nathan Srebro
  We study the problem of learning predictors that are robust to adversarial
examples with respect to an unknown perturbation set, relying instead on
interaction with an adversarial attacker or access to attack oracles, examining
different models for such interactions. We obtain upper bounds on the sample
complexity and upper and lower bounds on the number of required interactions,
or number of successful attacks, in different interaction models, in terms of
the VC and Littlestone dimensions of the hypothesis class of predictors, and
without any assumptions on the perturbation set.


http://arxiv.org/abs/2102.02128
IWA: Integrated Gradient based White-box Attacks for Fooling Deep Neural Networks.
Yixiang Wang; Jiqiang Liu; Xiaolin Chang; Jelena Mišić; Vojislav B. Mišić
  The widespread application of deep neural network (DNN) techniques is being
challenged by adversarial examples, the legitimate input added with
imperceptible and well-designed perturbations that can fool DNNs easily in the
DNN testing/deploying stage. Previous adversarial example generation algorithms
for adversarial white-box attacks used Jacobian gradient information to add
perturbations. This information is too imprecise and inexplicit, which will
cause unnecessary perturbations when generating adversarial examples. This
paper aims to address this issue. We first propose to apply a more informative
and distilled gradient information, namely integrated gradient, to generate
adversarial examples. To further make the perturbations more imperceptible, we
propose to employ the restriction combination of $L_0$ and $L_1/L_2$ secondly,
which can restrict the total perturbations and perturbation points
simultaneously. Meanwhile, to address the non-differentiable problem of $L_1$,
we explore a proximal operation of $L_1$ thirdly. Based on these three works,
we propose two Integrated gradient based White-box Adversarial example
generation algorithms (IWA): IFPA and IUA. IFPA is suitable for situations
where there are a determined number of points to be perturbed. IUA is suitable
for situations where no perturbation point number is preset in order to obtain
more adversarial examples. We verify the effectiveness of the proposed
algorithms on both structured and unstructured datasets, and we compare them
with five baseline generation algorithms. The results show that our proposed
algorithms do craft adversarial examples with more imperceptible perturbations
and satisfactory crafting rate. $L_2$ restriction is more suitable for
unstructured dataset and $L_1$ restriction performs better in structured
dataset.


http://arxiv.org/abs/2102.01563
On Robustness of Neural Semantic Parsers.
Shuo Huang; Zhuang Li; Lizhen Qu; Lei Pan
  Semantic parsing maps natural language (NL) utterances into logical forms
(LFs), which underpins many advanced NLP problems. Semantic parsers gain
performance boosts with deep neural networks, but inherit vulnerabilities
against adversarial examples. In this paper, we provide the empirical study on
the robustness of semantic parsers in the presence of adversarial attacks.
Formally, adversaries of semantic parsing are considered to be the perturbed
utterance-LF pairs, whose utterances have exactly the same meanings as the
original ones. A scalable methodology is proposed to construct robustness test
sets based on existing benchmark corpora. Our results answered five research
questions in measuring the sate-of-the-art parsers' performance on robustness
test sets, and evaluating the effect of data augmentation.


http://arxiv.org/abs/2102.01862
Towards Robust Neural Networks via Close-loop Control.
Zhuotong Chen; Qianxiao Li; Zheng Zhang
  Despite their success in massive engineering applications, deep neural
networks are vulnerable to various perturbations due to their black-box nature.
Recent study has shown that a deep neural network can misclassify the data even
if the input data is perturbed by an imperceptible amount. In this paper, we
address the robustness issue of neural networks by a novel close-loop control
method from the perspective of dynamic systems. Instead of modifying the
parameters in a fixed neural network architecture, a close-loop control process
is added to generate control signals adaptively for the perturbed or corrupted
data. We connect the robustness of neural networks with optimal control using
the geometrical information of underlying data to design the control objective.
The detailed analysis shows how the embedding manifolds of state trajectory
affect error estimation of the proposed method. Our approach can simultaneously
maintain the performance on clean data and improve the robustness against many
types of data perturbations. It can also further improve the performance of
robustly trained neural networks against different perturbations. To the best
of our knowledge, this is the first work that improves the robustness of neural
networks with close-loop control.


http://arxiv.org/abs/2102.01356
Recent Advances in Adversarial Training for Adversarial Robustness.
Tao Bai; Jinqi Luo; Jun Zhao; Bihan Wen; Qian Wang
  Adversarial training is one of the most effective approaches defending
against adversarial examples for deep learning models. Unlike other defense
strategies, adversarial training aims to promote the robustness of models
intrinsically. During the last few years, adversarial training has been studied
and discussed from various aspects. A variety of improvements and developments
of adversarial training are proposed, which were, however, neglected in
existing surveys. For the first time in this survey, we systematically review
the recent progress on adversarial training for adversarial robustness with a
novel taxonomy. Then we discuss the generalization problems in adversarial
training from three perspectives. Finally, we highlight the challenges which
are not fully tackled and present potential future directions.


http://arxiv.org/abs/2102.01336
Probabilistic Trust Intervals for Out of Distribution Detection. (68%)
Gagandeep Singh; Ishan Mishra; Deepak Mishra
  The ability of a deep learning network to distinguish between in-distribution
(ID) and out-of-distribution (OOD) inputs is crucial for ensuring the
reliability and trustworthiness of AI systems. Existing OOD detection methods
often involve complex architectural innovations, such as ensemble models,
which, while enhancing detection accuracy, significantly increase model
complexity and training time. Other methods utilize surrogate samples to
simulate OOD inputs, but these may not generalize well across different types
of OOD data. In this paper, we propose a straightforward yet novel technique to
enhance OOD detection in pre-trained networks without altering its original
parameters. Our approach defines probabilistic trust intervals for each network
weight, determined using in-distribution data. During inference, additional
weight values are sampled, and the resulting disagreements among outputs are
utilized for OOD detection. We propose a metric to quantify this disagreement
and validate its effectiveness with empirical evidence. Our method
significantly outperforms various baseline methods across multiple OOD datasets
without requiring actual or surrogate OOD samples. We evaluate our approach on
MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100 and CIFAR-10-C (a
corruption-augmented version of CIFAR-10), across various neural network
architectures (e.g., VGG-16, ResNet-20, DenseNet-100). On the
MNIST-FashionMNIST setup, our method achieves a False Positive Rate (FPR) of
12.46\% at 95\% True Positive Rate (TPR), compared to 27.09\% achieved by the
best baseline. On adversarial and corrupted datasets such as CIFAR-10-C, our
proposed method easily differentiate between clean and noisy inputs. These
results demonstrate the robustness of our approach in identifying corrupted and
adversarial inputs, all without requiring OOD samples during training.


http://arxiv.org/abs/2102.01208
Fast Training of Provably Robust Neural Networks by SingleProp.
Akhilan Boopathy; Tsui-Wei Weng; Sijia Liu; Pin-Yu Chen; Gaoyuan Zhang; Luca Daniel
  Recent works have developed several methods of defending neural networks
against adversarial attacks with certified guarantees. However, these
techniques can be computationally costly due to the use of certification during
training. We develop a new regularizer that is both more efficient than
existing certified defenses, requiring only one additional forward propagation
through a network, and can be used to train networks with similar certified
accuracy. Through experiments on MNIST and CIFAR-10 we demonstrate improvements
in training speed and comparable certified accuracy compared to
state-of-the-art certified defenses.


http://arxiv.org/abs/2102.00662
Towards Speeding up Adversarial Training in Latent Spaces.
Yaguan Qian; Qiqi Shao; Tengteng Yao; Bin Wang; Shaoning Zeng; Zhaoquan Gu; Wassim Swaileh
  Adversarial training is wildly considered as the most effective way to defend
against adversarial examples. However, existing adversarial training methods
consume unbearable time cost, since they need to generate adversarial examples
in the input space, which accounts for the main part of total time-consuming.
For speeding up the training process, we propose a novel adversarial training
method that does not need to generate real adversarial examples. We notice that
a clean example is closer to the decision boundary of the class with the second
largest logit component than any other class besides its own class. Thus, by
adding perturbations to logits to generate Endogenous Adversarial
Examples(EAEs) -- adversarial examples in the latent space, it can avoid
calculating gradients to speed up the training process. We further gain a deep
insight into the existence of EAEs by the theory of manifold. To guarantee the
added perturbation is within the range of constraint, we use statistical
distributions to select seed examples to craft EAEs. Extensive experiments are
conducted on CIFAR-10 and ImageNet, and the results show that compare with
state-of-the-art "Free" and "Fast" methods, our EAE adversarial training not
only shortens the training time, but also enhances the robustness of the model.
Moreover, the EAE adversarial training has little impact on the accuracy of
clean examples than the existing methods.


http://arxiv.org/abs/2102.00918
Robust Adversarial Attacks Against DNN-Based Wireless Communication Systems.
Alireza Bahramali; Milad Nasr; Amir Houmansadr; Dennis Goeckel; Don Towsley
  Deep Neural Networks (DNNs) have become prevalent in wireless communication
systems due to their promising performance. However, similar to other DNN-based
applications, they are vulnerable to adversarial examples. In this work, we
propose an input-agnostic, undetectable, and robust adversarial attack against
DNN-based wireless communication systems in both white-box and black-box
scenarios. We design tailored Universal Adversarial Perturbations (UAPs) to
perform the attack. We also use a Generative Adversarial Network (GAN) to
enforce an undetectability constraint for our attack. Furthermore, we
investigate the robustness of our attack against countermeasures. We show that
in the presence of defense mechanisms deployed by the communicating parties,
our attack performs significantly better compared to existing attacks against
DNN-based wireless systems. In particular, the results demonstrate that even
when employing well-considered defenses, DNN-based wireless communications are
vulnerable to adversarial attacks.


http://arxiv.org/abs/2102.01130
Comparing hundreds of machine learning classifiers and discrete choice models in predicting travel behavior: an empirical benchmark. (1%)
Shenhao Wang; Baichuan Mo; Yunhan Zheng; Stephane Hess; Jinhua Zhao
  Numerous studies have compared machine learning (ML) and discrete choice
models (DCMs) in predicting travel demand. However, these studies often lack
generalizability as they compare models deterministically without considering
contextual variations. To address this limitation, our study develops an
empirical benchmark by designing a tournament model, thus efficiently
summarizing a large number of experiments, quantifying the randomness in model
comparisons, and using formal statistical tests to differentiate between the
model and contextual effects. This benchmark study compares two large-scale
data sources: a database compiled from literature review summarizing 136
experiments from 35 studies, and our own experiment data, encompassing a total
of 6,970 experiments from 105 models and 12 model families. This benchmark
study yields two key findings. Firstly, many ML models, particularly the
ensemble methods and deep learning, statistically outperform the DCM family
(i.e., multinomial, nested, and mixed logit models). However, this study also
highlights the crucial role of the contextual factors (i.e., data sources,
inputs and choice categories), which can explain models' predictive performance
more effectively than the differences in model types alone. Model performance
varies significantly with data sources, improving with larger sample sizes and
lower dimensional alternative sets. After controlling all the model and
contextual factors, significant randomness still remains, implying inherent
uncertainty in such model comparisons. Overall, we suggest that future
researchers shift more focus from context-specific model comparisons towards
examining model transferability across contexts and characterizing the inherent
uncertainty in ML, thus creating more robust and generalizable next-generation
travel demand models.


http://arxiv.org/abs/2102.00533
Deep Deterministic Information Bottleneck with Matrix-based Entropy Functional.
Xi Yu; Shujian Yu; Jose C. Principe
  We introduce the matrix-based Renyi's $\alpha$-order entropy functional to
parameterize Tishby et al. information bottleneck (IB) principle with a neural
network. We term our methodology Deep Deterministic Information Bottleneck
(DIB), as it avoids variational inference and distribution assumption. We show
that deep neural networks trained with DIB outperform the variational objective
counterpart and those that are trained with other forms of regularization, in
terms of generalization performance and robustness to adversarial attack.Code
available at https://github.com/yuxi120407/DIB


http://arxiv.org/abs/2102.00449
Towards Imperceptible Query-limited Adversarial Attacks with Perceptual Feature Fidelity Loss.
Pengrui Quan; Ruiming Guo; Mani Srivastava
  Recently, there has been a large amount of work towards fooling
deep-learning-based classifiers, particularly for images, via adversarial
inputs that are visually similar to the benign examples. However, researchers
usually use Lp-norm minimization as a proxy for imperceptibility, which
oversimplifies the diversity and richness of real-world images and human visual
perception. In this work, we propose a novel perceptual metric utilizing the
well-established connection between the low-level image feature fidelity and
human visual sensitivity, where we call it Perceptual Feature Fidelity Loss. We
show that our metric can robustly reflect and describe the imperceptibility of
the generated adversarial images validated in various conditions. Moreover, we
demonstrate that this metric is highly flexible, which can be conveniently
integrated into different existing optimization frameworks to guide the noise
distribution for better imperceptibility. The metric is particularly useful in
the challenging black-box attack with limited queries, where the
imperceptibility is hard to achieve due to the non-trivial perturbation power.


http://arxiv.org/abs/2102.00436
Admix: Enhancing the Transferability of Adversarial Attacks.
Xiaosen Wang; Xuanran He; Jingdong Wang; Kun He
  Deep neural networks are known to be extremely vulnerable to adversarial
examples under white-box setting. Moreover, the malicious adversaries crafted
on the surrogate (source) model often exhibit black-box transferability on
other models with the same learning task but having different architectures.
Recently, various methods have been proposed to boost the adversarial
transferability, among which the input transformation is one of the most
effective approaches. We investigate in this direction and observe that
existing transformations are all applied on a single image, which might limit
the adversarial transferability. To this end, we propose a new input
transformation based attack method called Admix that considers the input image
and a set of images randomly sampled from other categories. Instead of directly
calculating the gradient on the original input, Admix calculates the gradient
on the input image admixed with a small portion of each add-in image while
using the original label of the input, to craft more transferable adversaries.
Empirical evaluations on standard ImageNet dataset demonstrate that Admix could
achieve significantly better transferability than existing input transformation
methods under both single model setting and ensemble-model setting. By
incorporating with existing input transformations, our method could further
improve the transferability and outperforms the state-of-the-art combination of
input transformations by a clear margin when attacking nine advanced defense
models under ensemble-model setting.


http://arxiv.org/abs/2102.00313
Cortical Features for Defense Against Adversarial Audio Attacks.
Ilya Kavalerov; Ruijie Zheng; Wojciech Czaja; Rama Chellappa
  We propose using a computational model of the auditory cortex as a defense
against adversarial attacks on audio. We apply several white-box iterative
optimization-based adversarial attacks to an implementation of Amazon Alexa's
HW network, and a modified version of this network with an integrated cortical
representation, and show that the cortical features help defend against
universal adversarial examples. At the same level of distortion, the
adversarial noises found for the cortical network are always less effective for
universal audio attacks. We make our code publicly available at
https://github.com/ilyakava/py3fst.


http://arxiv.org/abs/2102.00029
You Only Query Once: Effective Black Box Adversarial Attacks with Minimal Repeated Queries.
Devin Willmott; Anit Kumar Sahu; Fatemeh Sheikholeslami; Filipe Condessa; Zico Kolter
  Researchers have repeatedly shown that it is possible to craft adversarial
attacks on deep classifiers (small perturbations that significantly change the
class label), even in the "black-box" setting where one only has query access
to the classifier. However, all prior work in the black-box setting attacks the
classifier by repeatedly querying the same image with minor modifications,
usually thousands of times or more, making it easy for defenders to detect an
ensuing attack. In this work, we instead show that it is possible to craft
(universal) adversarial perturbations in the black-box setting by querying a
sequence of different images only once. This attack prevents detection from
high number of similar queries and produces a perturbation that causes
misclassification when applied to any input to the classifier. In experiments,
we show that attacks that adhere to this restriction can produce untargeted
adversarial perturbations that fool the vast majority of MNIST and CIFAR-10
classifier inputs, as well as in excess of $60-70\%$ of inputs on ImageNet
classifiers. In the targeted setting, we exhibit targeted black-box universal
attacks on ImageNet classifiers with success rates above $20\%$ when only
allowed one query per image, and $66\%$ when allowed two queries per image.


http://arxiv.org/abs/2101.12097
Adversarial Machine Learning Attacks on Condition-Based Maintenance Capabilities.
Hamidreza Habibollahi Najaf Abadi
  Condition-based maintenance (CBM) strategies exploit machine learning models
to assess the health status of systems based on the collected data from the
physical environment, while machine learning models are vulnerable to
adversarial attacks. A malicious adversary can manipulate the collected data to
deceive the machine learning model and affect the CBM system's performance.
Adversarial machine learning techniques introduced in the computer vision
domain can be used to make stealthy attacks on CBM systems by adding
perturbation to data to confuse trained models. The stealthy nature causes
difficulty and delay in detection of the attacks. In this paper, adversarial
machine learning in the domain of CBM is introduced. A case study shows how
adversarial machine learning can be used to attack CBM capabilities.
Adversarial samples are crafted using the Fast Gradient Sign method, and the
performance of a CBM system under attack is investigated. The obtained results
reveal that CBM systems are vulnerable to adversarial machine learning attacks
and defense strategies need to be considered.


http://arxiv.org/abs/2101.12090
Adversarial Attacks on Deep Learning Based Power Allocation in a Massive MIMO Network.
B. R. Manoj; Meysam Sadeghi; Erik G. Larsson
  Deep learning (DL) is becoming popular as a new tool for many applications in
wireless communication systems. However, for many classification tasks (e.g.,
modulation classification) it has been shown that DL-based wireless systems are
susceptible to adversarial examples; adversarial examples are well-crafted
malicious inputs to the neural network (NN) with the objective to cause
erroneous outputs. In this paper, we extend this to regression problems and
show that adversarial attacks can break DL-based power allocation in the
downlink of a massive multiple-input-multiple-output (maMIMO) network.
Specifically, we extend the fast gradient sign method (FGSM), momentum
iterative FGSM, and projected gradient descent adversarial attacks in the
context of power allocation in a maMIMO system. We benchmark the performance of
these attacks and show that with a small perturbation in the input of the NN,
the white-box attacks can result in infeasible solutions up to 86%.
Furthermore, we investigate the performance of black-box attacks. All the
evaluations conducted in this work are based on an open dataset and NN models,
which are publicly available.


http://arxiv.org/abs/2101.12100
Increasing the Confidence of Deep Neural Networks by Coverage Analysis.
Giulio Rossolini; Alessandro Biondi; Giorgio Carlo Buttazzo
  The great performance of machine learning algorithms and deep neural networks
in several perception and control tasks is pushing the industry to adopt such
technologies in safety-critical applications, as autonomous robots and
self-driving vehicles. At present, however, several issues need to be solved to
make deep learning methods more trustworthy, predictable, safe, and secure
against adversarial attacks. Although several methods have been proposed to
improve the trustworthiness of deep neural networks, most of them are tailored
for specific classes of adversarial examples, hence failing to detect other
corner cases or unsafe inputs that heavily deviate from the training samples.
  This paper presents a lightweight monitoring architecture based on coverage
paradigms to enhance the model robustness against different unsafe inputs. In
particular, four coverage analysis methods are proposed and tested in the
architecture for evaluating multiple detection logics. Experimental results
show that the proposed approach is effective in detecting both powerful
adversarial examples and out-of-distribution inputs, introducing limited
extra-execution time and memory requirements.


http://arxiv.org/abs/2101.12372
Adversarial Learning with Cost-Sensitive Classes.
Haojing Shen; Sihong Chen; Ran Wang; Xizhao Wang
  It is necessary to improve the performance of some special classes or to
particularly protect them from attacks in adversarial learning. This paper
proposes a framework combining cost-sensitive classification and adversarial
learning together to train a model that can distinguish between protected and
unprotected classes, such that the protected classes are less vulnerable to
adversarial examples. We find in this framework an interesting phenomenon
during the training of deep neural networks, called Min-Max property, that is,
the absolute values of most parameters in the convolutional layer approach zero
while the absolute values of a few parameters are significantly larger becoming
bigger. Based on this Min-Max property which is formulated and analyzed in a
view of random distribution, we further build a new defense model against
adversarial examples for adversarial robustness improvement. An advantage of
the built model is that it does no longer need adversarial training, and thus,
has a higher computational efficiency than most existing models of needing
adversarial training. It is experimentally confirmed that, regarding the
average accuracy of all classes, our model is almost as same as the existing
models when an attack does not occur and is better than the existing models
when an attack occurs. Specifically, regarding the accuracy of protected
classes, the proposed model is much better than the existing models when an
attack occurs.


http://arxiv.org/abs/2101.12031
Robust Android Malware Detection System against Adversarial Attacks using Q-Learning.
Hemant Rathore; Sanjay K. Sahay; Piyush Nikam; Mohit Sewak
  The current state-of-the-art Android malware detection systems are based on
machine learning and deep learning models. Despite having superior performance,
these models are susceptible to adversarial attacks. Therefore in this paper,
we developed eight Android malware detection models based on machine learning
and deep neural network and investigated their robustness against adversarial
attacks. For this purpose, we created new variants of malware using
Reinforcement Learning, which will be misclassified as benign by the existing
Android malware detection models. We propose two novel attack strategies,
namely single policy attack and multiple policy attack using reinforcement
learning for white-box and grey-box scenario respectively. Putting ourselves in
the adversary's shoes, we designed adversarial attacks on the detection models
with the goal of maximizing fooling rate, while making minimum modifications to
the Android application and ensuring that the app's functionality and behavior
do not change. We achieved an average fooling rate of 44.21% and 53.20% across
all the eight detection models with a maximum of five modifications using a
single policy attack and multiple policy attack, respectively. The highest
fooling rate of 86.09% with five changes was attained against the decision
tree-based model using the multiple policy approach. Finally, we propose an
adversarial defense strategy that reduces the average fooling rate by threefold
to 15.22% against a single policy attack, thereby increasing the robustness of
the detection models i.e. the proposed model can effectively detect variants
(metamorphic) of malware. The experimental analysis shows that our proposed
Android malware detection system using reinforcement learning is more robust
against adversarial attacks.


http://arxiv.org/abs/2101.11443
Adversaries in Online Learning Revisited: with applications in Robust Optimization and Adversarial training.
Sebastian Pokutta; Huan Xu
  We revisit the concept of "adversary" in online learning, motivated by
solving robust optimization and adversarial training using online learning
methods. While one of the classical setups in online learning deals with the
"adversarial" setup, it appears that this concept is used less rigorously,
causing confusion in applying results and insights from online learning.
Specifically, there are two fundamentally different types of adversaries,
depending on whether the "adversary" is able to anticipate the exogenous
randomness of the online learning algorithms. This is particularly relevant to
robust optimization and adversarial training because the adversarial sequences
are often anticipative, and many online learning algorithms do not achieve
diminishing regret in such a case.
  We then apply this to solving robust optimization problems or (equivalently)
adversarial training problems via online learning and establish a general
approach for a large variety of problem classes using imaginary play. Here two
players play against each other, the primal player playing the decisions and
the dual player playing realizations of uncertain data. When the game
terminates, the primal player has obtained an approximately robust solution.
This meta-game allows for solving a large variety of robust optimization and
multi-objective optimization problems and generalizes the approach of
arXiv:1402.6361.


http://arxiv.org/abs/2101.11310
Adversarial Stylometry in the Wild: Transferable Lexical Substitution Attacks on Author Profiling.
Chris Emmery; Ákos Kádár; Grzegorz Chrupała
  Written language contains stylistic cues that can be exploited to
automatically infer a variety of potentially sensitive author information.
Adversarial stylometry intends to attack such models by rewriting an author's
text. Our research proposes several components to facilitate deployment of
these adversarial attacks in the wild, where neither data nor target models are
accessible. We introduce a transformer-based extension of a lexical replacement
attack, and show it achieves high transferability when trained on a weakly
labeled corpus -- decreasing target model performance below chance. While not
completely inconspicuous, our more successful attacks also prove notably less
detectable by humans. Our framework therefore provides a promising direction
for future privacy-preserving adversarial attacks.


http://arxiv.org/abs/2101.11453
Meta Adversarial Training against Universal Patches.
Jan Hendrik Metzen; Nicole Finnie; Robin Hutmacher
  Recently demonstrated physical-world adversarial attacks have exposed
vulnerabilities in perception systems that pose severe risks for
safety-critical applications such as autonomous driving. These attacks place
adversarial artifacts in the physical world that indirectly cause the addition
of a universal patch to inputs of a model that can fool it in a variety of
contexts. Adversarial training is the most effective defense against
image-dependent adversarial attacks. However, tailoring adversarial training to
universal patches is computationally expensive since the optimal universal
patch depends on the model weights which change during training. We propose
meta adversarial training (MAT), a novel combination of adversarial training
with meta-learning, which overcomes this challenge by meta-learning universal
patches along with model training. MAT requires little extra computation while
continuously adapting a large set of patches to the current model. MAT
considerably increases robustness against universal patch attacks on image
classification and traffic-light detection.


http://arxiv.org/abs/2101.11466
Detecting Adversarial Examples by Input Transformations, Defense Perturbations, and Voting.
Federico Nesti; Alessandro Biondi; Giorgio Buttazzo
  Over the last few years, convolutional neural networks (CNNs) have proved to
reach super-human performance in visual recognition tasks. However, CNNs can
easily be fooled by adversarial examples, i.e., maliciously-crafted images that
force the networks to predict an incorrect output while being extremely similar
to those for which a correct output is predicted. Regular adversarial examples
are not robust to input image transformations, which can then be used to detect
whether an adversarial example is presented to the network. Nevertheless, it is
still possible to generate adversarial examples that are robust to such
transformations.
  This paper extensively explores the detection of adversarial examples via
image transformations and proposes a novel methodology, called \textit{defense
perturbation}, to detect robust adversarial examples with the same input
transformations the adversarial examples are robust to. Such a \textit{defense
perturbation} is shown to be an effective counter-measure to robust adversarial
examples.
  Furthermore, multi-network adversarial examples are introduced. This kind of
adversarial examples can be used to simultaneously fool multiple networks,
which is critical in systems that use network redundancy, such as those based
on architectures with majority voting over multiple CNNs. An extensive set of
experiments based on state-of-the-art CNNs trained on the Imagenet dataset is
finally reported.


http://arxiv.org/abs/2101.11766
Improving Neural Network Robustness through Neighborhood Preserving Layers.
Bingyuan Liu; Christopher Malon; Lingzhou Xue; Erik Kruus
  Robustness against adversarial attack in neural networks is an important
research topic in the machine learning community. We observe one major source
of vulnerability of neural nets is from overparameterized fully-connected
layers. In this paper, we propose a new neighborhood preserving layer which can
replace these fully connected layers to improve the network robustness. We
demonstrate a novel neural network architecture which can incorporate such
layers and also can be trained efficiently. We theoretically prove that our
models are more robust against distortion because they effectively control the
magnitude of gradients. Finally, we empirically show that our designed network
architecture is more robust against state-of-art gradient descent based
attacks, such as a PGD attack on the benchmark datasets MNIST and CIFAR10.


http://arxiv.org/abs/2101.10876
Blind Image Denoising and Inpainting Using Robust Hadamard Autoencoders.
Rasika Karkare; Randy Paffenroth; Gunjan Mahindre
  In this paper, we demonstrate how deep autoencoders can be generalized to the
case of inpainting and denoising, even when no clean training data is
available. In particular, we show how neural networks can be trained to perform
all of these tasks simultaneously. While, deep autoencoders implemented by way
of neural networks have demonstrated potential for denoising and anomaly
detection, standard autoencoders have the drawback that they require access to
clean data for training. However, recent work in Robust Deep Autoencoders
(RDAEs) shows how autoencoders can be trained to eliminate outliers and noise
in a dataset without access to any clean training data. Inspired by this work,
we extend RDAEs to the case where data are not only noisy and have outliers,
but also only partially observed. Moreover, the dataset we train the neural
network on has the properties that all entries have noise, some entries are
corrupted by large mistakes, and many entries are not even known. Given such an
algorithm, many standard tasks, such as denoising, image inpainting, and
unobserved entry imputation can all be accomplished simultaneously within the
same framework. Herein we demonstrate these techniques on standard machine
learning tasks, such as image inpainting and denoising for the MNIST and
CIFAR10 datasets. However, these approaches are not only applicable to image
processing problems, but also have wide ranging impacts on datasets arising
from real-world problems, such as manufacturing and network processing, where
noisy, partially observed data naturally arise.


http://arxiv.org/abs/2101.11073
Property Inference From Poisoning.
Melissa Chase; Esha Ghosh; Saeed Mahloujifar
  Property inference attacks consider an adversary who has access to the
trained model and tries to extract some global statistics of the training data.
In this work, we study property inference in scenarios where the adversary can
maliciously control part of the training data (poisoning data) with the goal of
increasing the leakage.
  Previous work on poisoning attacks focused on trying to decrease the accuracy
of models either on the whole population or on specific sub-populations or
instances. Here, for the first time, we study poisoning attacks where the goal
of the adversary is to increase the information leakage of the model. Our
findings suggest that poisoning attacks can boost the information leakage
significantly and should be considered as a stronger threat model in sensitive
applications where some of the data sources may be malicious.
  We describe our \emph{property inference poisoning attack} that allows the
adversary to learn the prevalence in the training data of any property it
chooses. We theoretically prove that our attack can always succeed as long as
the learning algorithm used has good generalization properties.
  We then verify the effectiveness of our attack by experimentally evaluating
it on two datasets: a Census dataset and the Enron email dataset. We were able
to achieve above $90\%$ attack accuracy with $9-10\%$ poisoning in all of our
experiments.


http://arxiv.org/abs/2101.10792
Adversarial Vulnerability of Active Transfer Learning.
Nicolas M. Müller; Konstantin Böttinger
  Two widely used techniques for training supervised machine learning models on
small datasets are Active Learning and Transfer Learning. The former helps to
optimally use a limited budget to label new data. The latter uses large
pre-trained models as feature extractors and enables the design of complex,
non-linear models even on tiny datasets. Combining these two approaches is an
effective, state-of-the-art method when dealing with small datasets.
  In this paper, we share an intriguing observation: Namely, that the
combination of these techniques is particularly susceptible to a new kind of
data poisoning attack: By adding small adversarial noise on the input, it is
possible to create a collision in the output space of the transfer learner. As
a result, Active Learning algorithms no longer select the optimal instances,
but almost exclusively the ones injected by the attacker. This allows an
attacker to manipulate the active learner to select and include arbitrary
images into the data set, even against an overwhelming majority of unpoisoned
samples. We show that a model trained on such a poisoned dataset has a
significantly deteriorated performance, dropping from 86\% to 34\% test
accuracy. We evaluate this attack on both audio and image datasets and support
our findings empirically. To the best of our knowledge, this weakness has not
been described before in literature.


http://arxiv.org/abs/2101.10586
SkeletonVis: Interactive Visualization for Understanding Adversarial Attacks on Human Action Recognition Models.
Haekyu Park; Zijie J. Wang; Nilaksh Das; Anindya S. Paul; Pruthvi Perumalla; Zhiyan Zhou; Duen Horng Chau
  Skeleton-based human action recognition technologies are increasingly used in
video based applications, such as home robotics, healthcare on aging
population, and surveillance. However, such models are vulnerable to
adversarial attacks, raising serious concerns for their use in safety-critical
applications. To develop an effective defense against attacks, it is essential
to understand how such attacks mislead the pose detection models into making
incorrect predictions. We present SkeletonVis, the first interactive system
that visualizes how the attacks work on the models to enhance human
understanding of attacks.


http://arxiv.org/abs/2101.11081
The Effect of Class Definitions on the Transferability of Adversarial Attacks Against Forensic CNNs.
Xinwei Zhao; Matthew C. Stamm
  In recent years, convolutional neural networks (CNNs) have been widely used
by researchers to perform forensic tasks such as image tampering detection. At
the same time, adversarial attacks have been developed that are capable of
fooling CNN-based classifiers. Understanding the transferability of adversarial
attacks, i.e. an attacks ability to attack a different CNN than the one it was
trained against, has important implications for designing CNNs that are
resistant to attacks. While attacks on object recognition CNNs are believed to
be transferrable, recent work by Barni et al. has shown that attacks on
forensic CNNs have difficulty transferring to other CNN architectures or CNNs
trained using different datasets. In this paper, we demonstrate that
adversarial attacks on forensic CNNs are even less transferrable than
previously thought even between virtually identical CNN architectures! We show
that several common adversarial attacks against CNNs trained to identify image
manipulation fail to transfer to CNNs whose only difference is in the class
definitions (i.e. the same CNN architectures trained using the same data). We
note that all formulations of class definitions contain the unaltered class.
This has important implications for the future design of forensic CNNs that are
robust to adversarial and anti-forensic attacks.


http://arxiv.org/abs/2101.11060
Defenses Against Multi-Sticker Physical Domain Attacks on Classifiers.
Xinwei Zhao; Matthew C. Stamm
  Recently, physical domain adversarial attacks have drawn significant
attention from the machine learning community. One important attack proposed by
Eykholt et al. can fool a classifier by placing black and white stickers on an
object such as a road sign. While this attack may pose a significant threat to
visual classifiers, there are currently no defenses designed to protect against
this attack. In this paper, we propose new defenses that can protect against
multi-sticker attacks. We present defensive strategies capable of operating
when the defender has full, partial, and no prior information about the attack.
By conducting extensive experiments, we show that our proposed defenses can
outperform existing defenses against physical attacks when presented with a
multi-sticker attack.


http://arxiv.org/abs/2101.10562
Investigating the significance of adversarial attacks and their relation to interpretability for radar-based human activity recognition systems.
Utku Ozbulak; Baptist Vandersmissen; Azarakhsh Jalalvand; Ivo Couckuyt; Messem Arnout Van; Neve Wesley De
  Given their substantial success in addressing a wide range of computer vision
challenges, Convolutional Neural Networks (CNNs) are increasingly being used in
smart home applications, with many of these applications relying on the
automatic recognition of human activities. In this context, low-power radar
devices have recently gained in popularity as recording sensors, given that the
usage of these devices allows mitigating a number of privacy concerns, a key
issue when making use of conventional video cameras. Another concern that is
often cited when designing smart home applications is the resilience of these
applications against cyberattacks. It is, for instance, well-known that the
combination of images and CNNs is vulnerable against adversarial examples,
mischievous data points that force machine learning models to generate wrong
classifications during testing time. In this paper, we investigate the
vulnerability of radar-based CNNs to adversarial attacks, and where these
radar-based CNNs have been designed to recognize human gestures. Through
experiments with four unique threat models, we show that radar-based CNNs are
susceptible to both white- and black-box adversarial attacks. We also expose
the existence of an extreme adversarial attack case, where it is possible to
change the prediction made by the radar-based CNNs by only perturbing the
padding of the inputs, without touching the frames where the action itself
occurs. Moreover, we observe that gradient-based attacks exercise perturbation
not randomly, but on important features of the input data. We highlight these
important features by making use of Grad-CAM, a popular neural network
interpretability method, hereby showing the connection between adversarial
perturbation and prediction interpretability.


http://arxiv.org/abs/2101.10710
Visual explanation of black-box model: Similarity Difference and Uniqueness (SIDU) method.
Satya M. Muddamsetty; Mohammad N. S. Jahromi; Andreea E. Ciontos; Laura M. Fenoy; Thomas B. Moeslund
  Explainable Artificial Intelligence (XAI) has in recent years become a
well-suited framework to generate human understandable explanations of
"black-box" models. In this paper, a novel XAI visual explanation algorithm
known as the Similarity Difference and Uniqueness (SIDU) method that can
effectively localize entire object regions responsible for prediction is
presented in full detail. The SIDU algorithm robustness and effectiveness is
analyzed through various computational and human subject experiments. In
particular, the SIDU algorithm is assessed using three different types of
evaluations (Application, Human and Functionally-Grounded) to demonstrate its
superior performance. The robustness of SIDU is further studied in the presence
of adversarial attack on "black-box" models to better understand its
performance. Our code is available at:
https://github.com/satyamahesh84/SIDU_XAI_CODE.


http://arxiv.org/abs/2101.10747
Towards Universal Physical Attacks On Cascaded Camera-Lidar 3D Object Detection Models.
Mazen Abdelfattah; Kaiwen Yuan; Z. Jane Wang; Rabab Ward
  We propose a universal and physically realizable adversarial attack on a
cascaded multi-modal deep learning network (DNN), in the context of
self-driving cars. DNNs have achieved high performance in 3D object detection,
but they are known to be vulnerable to adversarial attacks. These attacks have
been heavily investigated in the RGB image domain and more recently in the
point cloud domain, but rarely in both domains simultaneously - a gap to be
filled in this paper. We use a single 3D mesh and differentiable rendering to
explore how perturbing the mesh's geometry and texture can reduce the
robustness of DNNs to adversarial attacks. We attack a prominent cascaded
multi-modal DNN, the Frustum-Pointnet model. Using the popular KITTI benchmark,
we showed that the proposed universal multi-modal attack was successful in
reducing the model's ability to detect a car by nearly 73%. This work can aid
in the understanding of what the cascaded RGB-point cloud DNN learns and its
vulnerability to adversarial attacks.


http://arxiv.org/abs/2101.10001
Diverse Adversaries for Mitigating Bias in Training.
Xudong Han; Timothy Baldwin; Trevor Cohn
  Adversarial learning can learn fairer and less biased models of language than
standard methods. However, current adversarial techniques only partially
mitigate model bias, added to which their training procedures are often
unstable. In this paper, we propose a novel approach to adversarial learning
based on the use of multiple diverse discriminators, whereby discriminators are
encouraged to learn orthogonal hidden representations from one another.
Experimental results show that our method substantially improves over standard
adversarial removal methods, in terms of reducing bias and the stability of
training.


http://arxiv.org/abs/2101.10011
They See Me Rollin': Inherent Vulnerability of the Rolling Shutter in CMOS Image Sensors.
Sebastian Köhler; Giulio Lovisotto; Simon Birnbach; Richard Baker; Ivan Martinovic
  Cameras have become a fundamental component of vision-based intelligent
systems. As a balance between production costs and image quality, most modern
cameras use Complementary Metal-Oxide Semiconductor image sensors that
implement an electronic rolling shutter mechanism, where image rows are
captured consecutively rather than all-at-once.
  In this paper, we describe how the electronic rolling shutter can be
exploited using a bright, modulated light source (e.g., an inexpensive,
off-the-shelf laser), to inject fine-grained image disruptions. These
disruptions substantially affect camera-based computer vision systems, where
high-frequency data is crucial in extracting informative features from objects.
  We study the fundamental factors affecting a rolling shutter attack, such as
environmental conditions, angle of the incident light, laser to camera
distance, and aiming precision. We demonstrate how these factors affect the
intensity of the injected distortion and how an adversary can take them into
account by modeling the properties of the camera. We introduce a general
pipeline of a practical attack, which consists of: (i) profiling several
properties of the target camera and (ii) partially simulating the attack to
find distortions that satisfy the adversary's goal. Then, we instantiate the
attack to the scenario of object detection, where the adversary's goal is to
maximally disrupt the detection of objects in the image. We show that the
adversary can modulate the laser to hide up to 75% of objects perceived by
state-of-the-art detectors while controlling the amount of perturbation to keep
the attack inconspicuous. Our results indicate that rolling shutter attacks can
substantially reduce the performance and reliability of vision-based
intelligent systems.


http://arxiv.org/abs/2101.09930
Generalizing Adversarial Examples by AdaBelief Optimizer.
Yixiang Wang; Jiqiang Liu; Xiaolin Chang
  Recent research has proved that deep neural networks (DNNs) are vulnerable to
adversarial examples, the legitimate input added with imperceptible and
well-designed perturbations can fool DNNs easily in the testing stage. However,
most of the existing adversarial attacks are difficult to fool adversarially
trained models. To solve this issue, we propose an AdaBelief iterative Fast
Gradient Sign Method (AB-FGSM) to generalize adversarial examples. By
integrating AdaBelief optimization algorithm to I-FGSM, we believe that the
generalization of adversarial examples will be improved, relying on the strong
generalization of AdaBelief optimizer. To validate the effectiveness and
transferability of adversarial examples generated by our proposed AB-FGSM, we
conduct the white-box and black-box attacks on various single models and
ensemble models. Compared with state-of-the-art attack methods, our proposed
method can generate adversarial examples effectively in the white-box setting,
and the transfer rate is 7%-21% higher than latest attack methods.


http://arxiv.org/abs/2101.10102
Towards Practical Robustness Analysis for DNNs based on PAC-Model Learning.
Renjue Li; Pengfei Yang; Cheng-Chao Huang; Youcheng Sun; Bai Xue; Lijun Zhang
  To analyse local robustness properties of deep neural networks (DNNs), we
present a practical framework from a model learning perspective. Based on
black-box model learning with scenario optimisation, we abstract the local
behaviour of a DNN via an affine model with the probably approximately correct
(PAC) guarantee. From the learned model, we can infer the corresponding
PAC-model robustness property. The innovation of our work is the integration of
model learning into PAC robustness analysis: that is, we construct a PAC
guarantee on the model level instead of sample distribution, which induces a
more faithful and accurate robustness evaluation. This is in contrast to
existing statistical methods without model learning. We implement our method in
a prototypical tool named DeepPAC. As a black-box method, DeepPAC is scalable
and efficient, especially when DNNs have complex structures or high-dimensional
inputs. We extensively evaluate DeepPAC, with 4 baselines (using formal
verification, statistical methods, testing and adversarial attack) and 20 DNN
models across 3 datasets, including MNIST, CIFAR-10, and ImageNet. It is shown
that DeepPAC outperforms the state-of-the-art statistical method PROVERO, and
it achieves more practical robustness analysis than the formal verification
tool ERAN. Also, its results are consistent with existing DNN testing work like
DeepGini.


http://arxiv.org/abs/2101.10063
Few-Shot Website Fingerprinting Attack.
Mantun Chen; Yongjun Wang; Zhiquan Qin; Xiatian Zhu
  This work introduces a novel data augmentation method for few-shot website
fingerprinting (WF) attack where only a handful of training samples per website
are available for deep learning model optimization. Moving beyond earlier WF
methods relying on manually-engineered feature representations, more advanced
deep learning alternatives demonstrate that learning feature representations
automatically from training data is superior. Nonetheless, this advantage is
subject to an unrealistic assumption that there exist many training samples per
website, which otherwise will disappear. To address this, we introduce a
model-agnostic, efficient, and Harmonious Data Augmentation (HDA) method that
can improve deep WF attacking methods significantly. HDA involves both
intra-sample and inter-sample data transformations that can be used in
harmonious manner to expand a tiny training dataset to an arbitrarily large
collection, therefore effectively and explicitly addressing the intrinsic data
scarcity problem. We conducted expensive experiments to validate our HDA for
boosting state-of-the-art deep learning WF attack models in both closed-world
and open-world attacking scenarios, at absence and presence of strong defense.
{For instance, in the more challenging and realistic evaluation scenario with
WTF-PAD based defense, our HDA method surpasses the previous state-of-the-art
results by more than 4% in absolute classification accuracy in the 20-shot
learning case.


http://arxiv.org/abs/2101.10027
Understanding and Achieving Efficient Robustness with Adversarial Supervised Contrastive Learning.
Anh Bui; Trung Le; He Zhao; Paul Montague; Seyit Camtepe; Dinh Phung
  Contrastive learning (CL) has recently emerged as an effective approach to
learning representation in a range of downstream tasks. Central to this
approach is the selection of positive (similar) and negative (dissimilar) sets
to provide the model the opportunity to `contrast' between data and class
representation in the latent space. In this paper, we investigate CL for
improving model robustness using adversarial samples. We first designed and
performed a comprehensive study to understand how adversarial vulnerability
behaves in the latent space. Based on these empirical evidences, we propose an
effective and efficient supervised contrastive learning to achieve model
robustness against adversarial attacks. Moreover, we propose a new sample
selection strategy that optimizes the positive/negative sets by removing
redundancy and improving correlation with the anchor. Experiments conducted on
benchmark datasets show that our Adversarial Supervised Contrastive Learning
(ASCL) approach outperforms the state-of-the-art defenses by $2.6\%$ in terms
of the robust accuracy, whilst our ASCL with the proposed selection strategy
can further gain $1.4\%$ improvement with only $42.8\%$ positives and $6.3\%$
negatives compared with ASCL without a selection strategy.


http://arxiv.org/abs/2101.09568
A Transferable Anti-Forensic Attack on Forensic CNNs Using A Generative Adversarial Network.
Xinwei Zhao; Chen Chen; Matthew C. Stamm
  With the development of deep learning, convolutional neural networks (CNNs)
have become widely used in multimedia forensics for tasks such as detecting and
identifying image forgeries. Meanwhile, anti-forensic attacks have been
developed to fool these CNN-based forensic algorithms. Previous anti-forensic
attacks often were designed to remove forgery traces left by a single
manipulation operation as opposed to a set of manipulations. Additionally,
recent research has shown that existing anti-forensic attacks against forensic
CNNs have poor transferability, i.e. they are unable to fool other forensic
CNNs that were not explicitly used during training. In this paper, we propose a
new anti-forensic attack framework designed to remove forensic traces left by a
variety of manipulation operations. This attack is transferable, i.e. it can be
used to attack forensic CNNs are unknown to the attacker, and it introduces
only minimal distortions that are imperceptible to human eyes. Our proposed
attack utilizes a generative adversarial network (GAN) to build a generator
that can attack color images of any size. We achieve attack transferability
through the use of a new training strategy and loss function. We conduct
extensive experiment to demonstrate that our attack can fool many state-of-art
forensic CNNs with varying levels of knowledge available to the attacker.


http://arxiv.org/abs/2101.09451
Error Diffusion Halftoning Against Adversarial Examples.
Shao-Yuan Lo; Vishal M. Patel
  Adversarial examples contain carefully crafted perturbations that can fool
deep neural networks (DNNs) into making wrong predictions. Enhancing the
adversarial robustness of DNNs has gained considerable interest in recent
years. Although image transformation-based defenses were widely considered at
an earlier time, most of them have been defeated by adaptive attacks. In this
paper, we propose a new image transformation defense based on error diffusion
halftoning, and combine it with adversarial training to defend against
adversarial examples. Error diffusion halftoning projects an image into a 1-bit
space and diffuses quantization error to neighboring pixels. This process can
remove adversarial perturbations from a given image while maintaining
acceptable image quality in the meantime in favor of recognition. Experimental
results demonstrate that the proposed method is able to improve adversarial
robustness even under advanced adaptive attacks, while most of the other image
transformation-based defenses do not. We show that a proper image
transformation can still be an effective defense approach. Code:
https://github.com/shaoyuanlo/Halftoning-Defense


http://arxiv.org/abs/2101.09617
A Comprehensive Evaluation Framework for Deep Model Robustness.
Jun Guo; Wei Bao; Jiakai Wang; Yuqing Ma; Xinghai Gao; Gang Xiao; Aishan Liu; Jian Dong; Xianglong Liu; Wenjun Wu
  Deep neural networks (DNNs) have achieved remarkable performance across a
wide range of applications, while they are vulnerable to adversarial examples,
which motivates the evaluation and benchmark of model robustness. However,
current evaluations usually use simple metrics to study the performance of
defenses, which are far from understanding the limitation and weaknesses of
these defense methods. Thus, most proposed defenses are quickly shown to be
attacked successfully, which results in the ``arm race'' phenomenon between
attack and defense. To mitigate this problem, we establish a model robustness
evaluation framework containing 23 comprehensive and rigorous metrics, which
consider two key perspectives of adversarial learning (i.e., data and model).
Through neuron coverage and data imperceptibility, we use data-oriented metrics
to measure the integrity of test examples; by delving into model structure and
behavior, we exploit model-oriented metrics to further evaluate robustness in
the adversarial setting. To fully demonstrate the effectiveness of our
framework, we conduct large-scale experiments on multiple datasets including
CIFAR-10, SVHN, and ImageNet using different models and defenses with our
open-source platform. Overall, our paper provides a comprehensive evaluation
framework, where researchers could conduct comprehensive and fast evaluations
using the open-source toolkit, and the analytical results could inspire deeper
understanding and further improvement to the model robustness.


http://arxiv.org/abs/2101.09387
Online Adversarial Purification based on Self-Supervision.
Changhao Shi; Chester Holtz; Gal Mishne
  Deep neural networks are known to be vulnerable to adversarial examples,
where a perturbation in the input space leads to an amplified shift in the
latent network representation. In this paper, we combine canonical supervised
learning with self-supervised representation learning, and present
Self-supervised Online Adversarial Purification (SOAP), a novel defense
strategy that uses a self-supervised loss to purify adversarial examples at
test-time. Our approach leverages the label-independent nature of
self-supervised signals and counters the adversarial perturbation with respect
to the self-supervised tasks. SOAP yields competitive robust accuracy against
state-of-the-art adversarial training and purification methods, with
considerably less training complexity. In addition, our approach is robust even
when adversaries are given knowledge of the purification defense strategy. To
the best of our knowledge, our paper is the first that generalizes the idea of
using self-supervised signals to perform online test-time purification.


http://arxiv.org/abs/2101.09306
Towards Optimal Branching of Linear and Semidefinite Relaxations for Neural Network Robustness Certification.
Brendon G. Anderson; Ziye Ma; Jingqi Li; Somayeh Sojoudi
  In this paper, we study certifying the robustness of ReLU neural networks
against adversarial input perturbations. To diminish the relaxation error
suffered by the popular linear programming (LP) and semidefinite programming
(SDP) certification methods, we take a branch-and-bound approach to propose
partitioning the input uncertainty set and solving the relaxations on each part
separately. We show that this approach reduces relaxation error, and that the
error is eliminated entirely upon performing an LP relaxation with a partition
intelligently designed to exploit the nature of the ReLU activations. To scale
this approach to large networks, we consider using a coarser partition whereby
the number of parts in the partition is reduced. We prove that computing such a
coarse partition that directly minimizes the LP relaxation error is NP-hard. By
instead minimizing the worst-case LP relaxation error, we develop a closed-form
branching scheme. We extend the analysis to the SDP, where the feasible set
geometry is exploited to design a branching scheme that minimizes the
worst-case SDP relaxation error. Experiments on MNIST, CIFAR-10, and Wisconsin
breast cancer diagnosis classifiers demonstrate significant increases in the
percentages of test samples certified. By independently increasing the input
size and the number of layers, we empirically illustrate under which regimes
the branched LP and branched SDP are best applied.


http://arxiv.org/abs/2101.09324
Generating Black-Box Adversarial Examples in Sparse Domain.
Hadi Zanddizari; Behnam Zeinali; J. Morris Chang
  Applications of machine learning (ML) models and convolutional neural
networks (CNNs) have been rapidly increased. Although state-of-the-art CNNs
provide high accuracy in many applications, recent investigations show that
such networks are highly vulnerable to adversarial attacks. The black-box
adversarial attack is one type of attack that the attacker does not have any
knowledge about the model or the training dataset, but it has some input data
set and their labels. In this paper, we propose a novel approach to generate a
black-box attack in sparse domain whereas the most important information of an
image can be observed. Our investigation shows that large sparse (LaS)
components play a critical role in the performance of image classifiers. Under
this presumption, to generate adversarial example, we transfer an image into a
sparse domain and put a threshold to choose only k LaS components. In contrast
to the very recent works that randomly perturb k low frequency (LoF)
components, we perturb k LaS components either randomly (query-based) or in the
direction of the most correlated sparse signal from a different class. We show
that LaS components contain some middle or higher frequency components
information which leads fooling image classifiers with a fewer number of
queries. We demonstrate the effectiveness of this approach by fooling six
state-of-the-art image classifiers, the TensorFlow Lite (TFLite) model of
Google Cloud Vision platform, and YOLOv5 model as an object detection
algorithm. Mean squared error (MSE) and peak signal to noise ratio (PSNR) are
used as quality metrics. We also present a theoretical proof to connect these
metrics to the level of perturbation in the sparse domain.


http://arxiv.org/abs/2101.09108
Adaptive Neighbourhoods for the Discovery of Adversarial Examples.
Jay Morgan; Adeline Paiement; Arno Pauly; Monika Seisenberger
  Deep Neural Networks (DNNs) have often supplied state-of-the-art results in
pattern recognition tasks. Despite their advances, however, the existence of
adversarial examples have caught the attention of the community. Many existing
works have proposed methods for searching for adversarial examples within
fixed-sized regions around training points. Our work complements and improves
these existing approaches by adapting the size of these regions based on the
problem complexity and data sampling density. This makes such approaches more
appropriate for other types of data and may further improve adversarial
training methods by increasing the region sizes without creating incorrect
labels.


http://arxiv.org/abs/2101.08452
Robust Reinforcement Learning on State Observations with Learned Optimal Adversary.
Huan Zhang; Hongge Chen; Duane Boning; Cho-Jui Hsieh
  We study the robustness of reinforcement learning (RL) with adversarially
perturbed state observations, which aligns with the setting of many adversarial
attacks to deep reinforcement learning (DRL) and is also important for rolling
out real-world RL agent under unpredictable sensing noise. With a fixed agent
policy, we demonstrate that an optimal adversary to perturb state observations
can be found, which is guaranteed to obtain the worst case agent reward. For
DRL settings, this leads to a novel empirical adversarial attack to RL agents
via a learned adversary that is much stronger than previous ones. To enhance
the robustness of an agent, we propose a framework of alternating training with
learned adversaries (ATLA), which trains an adversary online together with the
agent using policy gradient following the optimal adversarial attack framework.
Additionally, inspired by the analysis of state-adversarial Markov decision
process (SA-MDP), we show that past states and actions (history) can be useful
for learning a robust agent, and we empirically find a LSTM based policy can be
more robust under adversaries. Empirical evaluations on a few continuous
control environments show that ATLA achieves state-of-the-art performance under
strong adversaries. Our code is available at
https://github.com/huanzhang12/ATLA_robust_RL.


http://arxiv.org/abs/2101.08523
Adv-OLM: Generating Textual Adversaries via OLM.
Vijit Malik; Ashwani Bhat; Ashutosh Modi
  Deep learning models are susceptible to adversarial examples that have
imperceptible perturbations in the original input, resulting in adversarial
attacks against these models. Analysis of these attacks on the state of the art
transformers in NLP can help improve the robustness of these models against
such adversarial inputs. In this paper, we present Adv-OLM, a black-box attack
method that adapts the idea of Occlusion and Language Models (OLM) to the
current state of the art attack methods. OLM is used to rank words of a
sentence, which are later substituted using word replacement strategies. We
experimentally show that our approach outperforms other attack methods for
several text classification tasks.


http://arxiv.org/abs/2101.08732
Self-Adaptive Training: Bridging Supervised and Self-Supervised Learning.
Lang Huang; Chao Zhang; Hongyang Zhang
  We propose self-adaptive training -- a unified training algorithm that
dynamically calibrates and enhances training processes by model predictions
without incurring an extra computational cost -- to advance both supervised and
self-supervised learning of deep neural networks. We analyze the training
dynamics of deep networks on training data that are corrupted by, e.g., random
noise and adversarial examples. Our analysis shows that model predictions are
able to magnify useful underlying information in data and this phenomenon
occurs broadly even in the absence of any label information, highlighting that
model predictions could substantially benefit the training processes:
self-adaptive training improves the generalization of deep networks under noise
and enhances the self-supervised representation learning. The analysis also
sheds light on understanding deep learning, e.g., a potential explanation of
the recently-discovered double-descent phenomenon in empirical risk
minimization and the collapsing issue of the state-of-the-art self-supervised
learning algorithms. Experiments on the CIFAR, STL, and ImageNet datasets
verify the effectiveness of our approach in three applications: classification
with label noise, selective classification, and linear evaluation. To
facilitate future research, the code has been made publicly available at
https://github.com/LayneH/self-adaptive-training.


http://arxiv.org/abs/2101.08783
A Person Re-identification Data Augmentation Method with Adversarial Defense Effect.
Yunpeng Gong; Zhiyong Zeng; Liwen Chen; Yifan Luo; Bin Weng; Feng Ye
  The security of the Person Re-identification(ReID) model plays a decisive
role in the application of ReID. However, deep neural networks have been shown
to be vulnerable, and adding undetectable adversarial perturbations to clean
images can trick deep neural networks that perform well in clean images. We
propose a ReID multi-modal data augmentation method with adversarial defense
effect: 1) Grayscale Patch Replacement, it consists of Local Grayscale Patch
Replacement(LGPR) and Global Grayscale Patch Replacement(GGPR). This method can
not only improve the accuracy of the model, but also help the model defend
against adversarial examples; 2) Multi-Modal Defense, it integrates three
homogeneous modal images of visible, grayscale and sketch, and further
strengthens the defense ability of the model. These methods fuse different
modalities of homogeneous images to enrich the input sample variety, the
variaty of samples will reduce the over-fitting of the ReID model to color
variations and make the adversarial space of the dataset that the attack method
can find difficult to align, thus the accuracy of model is improved, and the
attack effect is greatly reduced. The more modal homogeneous images are fused,
the stronger the defense capabilities is . The proposed method performs well on
multiple datasets, and successfully defends the attack of MS-SSIM proposed by
CVPR2020 against ReID [10], and increases the accuracy by 467 times(0.2% to
93.3%).


http://arxiv.org/abs/2101.08909
Adversarial Attacks and Defenses for Speaker Identification Systems.
Sonal Joshi; Jesús Villalba; Piotr Żelasko; Laureano Moro-Velázquez; Najim Dehak
  Research in automatic speaker recognition (SR) has been undertaken for
several decades, reaching great performance. However, researchers discovered
potential loopholes in these technologies like spoofing attacks. Quite
recently, a new genre of attack, termed adversarial attacks, has been proved to
be fatal in computer vision and it is vital to study their effects on SR
systems. This paper examines how state-of-the-art speaker identification (SID)
systems are vulnerable to adversarial attacks and how to defend against them.
We investigated adversarial attacks common in the literature like fast gradient
sign method (FGSM), iterative-FGSM / basic iterative method (BIM) and
Carlini-Wagner (CW). Furthermore, we propose four pre-processing defenses
against these attacks - randomized smoothing, DefenseGAN, variational
autoencoder (VAE) and WaveGAN vocoder. We found that SID is extremely
vulnerable under Iterative FGSM and CW attacks. Randomized smoothing defense
robustified the system for imperceptible BIM and CW attacks recovering
classification accuracies ~97%. Defenses based on generative models
(DefenseGAN, VAE and WaveGAN) project adversarial examples (outside manifold)
back into the clean manifold. In the case that attacker cannot adapt the attack
to the defense (black-box defense), WaveGAN performed the best, being close to
clean condition (Accuracy>97%). However, if the attack is adapted to the
defense - assuming the attacker has access to the defense model (white-box
defense), VAE and WaveGAN protection dropped significantly-50% and 37% accuracy
for CW attack. To counteract this,we combined randomized smoothing with VAE or
WaveGAN. We found that smoothing followed by WaveGAN vocoder was the most
effective defense overall. As a black-box defense, it provides 93% average
accuracy. As white-box defense, accuracy only degraded for iterative attacks
with perceptible perturbations (L>=0.01).


http://arxiv.org/abs/2101.08533
A general multi-modal data learning method for Person Re-identification. (78%)
Yunpeng Gong
  This paper proposes a general multi-modal data learning method, which
includes Global Homogeneous Transformation, Local Homogeneous Transformation
and their combination. During ReID model training, on the one hand, it randomly
selected a rectangular area in the RGB image and replace its color with the
same rectangular area in corresponding homogeneous image, thus it generate a
training image with different homogeneous areas; On the other hand, it convert
an image into a homogeneous image. These two methods help the model to directly
learn the relationship between different modalities in the Special ReID task.
In single-modal ReID tasks, it can be used as an effective data augmentation.
The experimental results show that our method achieves a performance
improvement of up to 3.3% in single modal ReID task, and performance
improvement in the Sketch Re-identification more than 8%. In addition, our
experiments also show that this method is also very useful in adversarial
training for adversarial defense. It can help the model learn faster and better
from adversarial examples.


http://arxiv.org/abs/2101.08030
Adversarial Attacks for Tabular Data: Application to Fraud Detection and Imbalanced Data.
Francesco Cartella; Orlando Anunciacao; Yuki Funabiki; Daisuke Yamaguchi; Toru Akishita; Olivier Elshocht
  Guaranteeing the security of transactional systems is a crucial priority of
all institutions that process transactions, in order to protect their
businesses against cyberattacks and fraudulent attempts. Adversarial attacks
are novel techniques that, other than being proven to be effective to fool
image classification models, can also be applied to tabular data. Adversarial
attacks aim at producing adversarial examples, in other words, slightly
modified inputs that induce the Artificial Intelligence (AI) system to return
incorrect outputs that are advantageous for the attacker. In this paper we
illustrate a novel approach to modify and adapt state-of-the-art algorithms to
imbalanced tabular data, in the context of fraud detection. Experimental
results show that the proposed modifications lead to a perfect attack success
rate, obtaining adversarial examples that are also less perceptible when
analyzed by humans. Moreover, when applied to a real-world production system,
the proposed techniques shows the possibility of posing a serious threat to the
robustness of advanced AI-based fraud detection procedures.


http://arxiv.org/abs/2101.08386
Invariance, encodings, and generalization: learning identity effects with neural networks.
S. Brugiapaglia; M. Liu; P. Tupper
  Often in language and other areas of cognition, whether two components of an
object are identical or not determines if it is well formed. We call such
constraints identity effects. When developing a system to learn well-formedness
from examples, it is easy enough to build in an identify effect. But can
identity effects be learned from the data without explicit guidance? We provide
a framework in which we can rigorously prove that algorithms satisfying simple
criteria cannot make the correct inference. We then show that a broad class of
learning algorithms including deep feedforward neural networks trained via
gradient-based algorithms (such as stochastic gradient descent or the Adam
method) satisfy our criteria, dependent on the encoding of inputs. In some
broader circumstances we are able to provide of adversarial examples that the
network necessarily classifies incorrectly. Finally, we demonstrate our theory
with computational experiments in which we explore the effect of different
input encodings on the ability of algorithms to generalize to novel inputs.


http://arxiv.org/abs/2101.08154
Fooling thermal infrared pedestrian detectors in real world using small bulbs.
Xiaopei Zhu; Xiao Li; Jianmin Li; Zheyao Wang; Xiaolin Hu
  Thermal infrared detection systems play an important role in many areas such
as night security, autonomous driving, and body temperature detection. They
have the unique advantages of passive imaging, temperature sensitivity and
penetration. But the security of these systems themselves has not been fully
explored, which poses risks in applying these systems. We propose a physical
attack method with small bulbs on a board against the state of-the-art
pedestrian detectors. Our goal is to make infrared pedestrian detectors unable
to detect real-world pedestrians. Towards this goal, we first showed that it is
possible to use two kinds of patches to attack the infrared pedestrian detector
based on YOLOv3. The average precision (AP) dropped by 64.12% in the digital
world, while a blank board with the same size caused the AP to drop by 29.69%
only. After that, we designed and manufactured a physical board and
successfully attacked YOLOv3 in the real world. In recorded videos, the
physical board caused AP of the target detector to drop by 34.48%, while a
blank board with the same size caused the AP to drop by 14.91% only. With the
ensemble attack techniques, the designed physical board had good
transferability to unseen detectors. We also proposed the first physical
multispectral (infrared and visible) attack. By using a combination method, we
successfully hide from the visible light and infrared object detection systems
at the same time.


http://arxiv.org/abs/2101.07922
LowKey: Leveraging Adversarial Attacks to Protect Social Media Users from Facial Recognition.
Valeriia Cherepanova; Micah Goldblum; Harrison Foley; Shiyuan Duan; John Dickerson; Gavin Taylor; Tom Goldstein
  Facial recognition systems are increasingly deployed by private corporations,
government agencies, and contractors for consumer services and mass
surveillance programs alike. These systems are typically built by scraping
social media profiles for user images. Adversarial perturbations have been
proposed for bypassing facial recognition systems. However, existing methods
fail on full-scale systems and commercial APIs. We develop our own adversarial
filter that accounts for the entire image processing pipeline and is
demonstrably effective against industrial-grade pipelines that include face
detection and large scale databases. Additionally, we release an easy-to-use
webtool that significantly degrades the accuracy of Amazon Rekognition and the
Microsoft Azure Face Recognition API, reducing the accuracy of each to below 1%


http://arxiv.org/abs/2101.07910
A Search-Based Testing Framework for Deep Neural Networks of Source Code Embedding.
Maryam Vahdat Pour; Zhuo Li; Lei Ma; Hadi Hemmati
  Over the past few years, deep neural networks (DNNs) have been continuously
expanding their real-world applications for source code processing tasks across
the software engineering domain, e.g., clone detection, code search, comment
generation. Although quite a few recent works have been performed on testing of
DNNs in the context of image and speech processing, limited progress has been
achieved so far on DNN testing in the context of source code processing, that
exhibits rather unique characteristics and challenges.
  In this paper, we propose a search-based testing framework for DNNs of source
code embedding and its downstream processing tasks like Code Search. To
generate new test inputs, we adopt popular source code refactoring tools to
generate the semantically equivalent variants. For more effective testing, we
leverage the DNN mutation testing to guide the testing direction. To
demonstrate the usefulness of our technique, we perform a large-scale
evaluation on popular DNNs of source code processing based on multiple
state-of-the-art code embedding methods (i.e., Code2vec, Code2seq and
CodeBERT). The testing results show that our generated adversarial samples can
on average reduce the performance of these DNNs from 5.41% to 9.58%. Through
retraining the DNNs with our generated adversarial samples, the robustness of
DNN can improve by 23.05% on average. The evaluation results also show that our
adversarial test generation strategy has the least negative impact (median of
3.56%), on the performance of the DNNs for regular test data, compared to the
other methods.


http://arxiv.org/abs/2101.07538
PICA: A Pixel Correlation-based Attentional Black-box Adversarial Attack.
Jie Wang; Zhaoxia Yin; Jin Tang; Jing Jiang; Bin Luo
  The studies on black-box adversarial attacks have become increasingly
prevalent due to the intractable acquisition of the structural knowledge of
deep neural networks (DNNs). However, the performance of emerging attacks is
negatively impacted when fooling DNNs tailored for high-resolution images. One
of the explanations is that these methods usually focus on attacking the entire
image, regardless of its spatial semantic information, and thereby encounter
the notorious curse of dimensionality. To this end, we propose a pixel
correlation-based attentional black-box adversarial attack, termed as PICA.
Firstly, we take only one of every two neighboring pixels in the salient region
as the target by leveraging the attentional mechanism and pixel correlation of
images, such that the dimension of the black-box attack reduces. After that, a
general multiobjective evolutionary algorithm is employed to traverse the
reduced pixels and generate perturbations that are imperceptible by the human
vision. Extensive experimental results have verified the effectiveness of the
proposed PICA on the ImageNet dataset. More importantly, PICA is
computationally more efficient to generate high-resolution adversarial examples
compared with the existing black-box attacks.


http://arxiv.org/abs/2101.07512
Attention-Guided Black-box Adversarial Attacks with Large-Scale Multiobjective Evolutionary Optimization.
Jie Wang; Zhaoxia Yin; Jing Jiang; Yang Du
  Fooling deep neural networks (DNNs) with the black-box optimization has
become a popular adversarial attack fashion, as the structural prior knowledge
of DNNs is always unknown. Nevertheless, recent black-box adversarial attacks
may struggle to balance their attack ability and visual quality of the
generated adversarial examples (AEs) in tackling high-resolution images. In
this paper, we propose an attention-guided black-box adversarial attack based
on the large-scale multiobjective evolutionary optimization, termed as LMOA. By
considering the spatial semantic information of images, we firstly take
advantage of the attention map to determine the perturbed pixels. Instead of
attacking the entire image, reducing the perturbed pixels with the attention
mechanism can help to avoid the notorious curse of dimensionality and thereby
improves the performance of attacking. Secondly, a large-scale multiobjective
evolutionary algorithm is employed to traverse the reduced pixels in the
salient region. Benefiting from its characteristics, the generated AEs have the
potential to fool target DNNs while being imperceptible by the human vision.
Extensive experimental results have verified the effectiveness of the proposed
LMOA on the ImageNet dataset. More importantly, it is more competitive to
generate high-resolution AEs with better visual quality compared with the
existing black-box adversarial attacks.


http://arxiv.org/abs/2101.06898
What Do Deep Nets Learn? Class-wise Patterns Revealed in the Input Space.
Shihao Zhao; Xingjun Ma; Yisen Wang; James Bailey; Bo Li; Yu-Gang Jiang
  Deep neural networks (DNNs) are increasingly deployed in different
applications to achieve state-of-the-art performance. However, they are often
applied as a black box with limited understanding of what knowledge the model
has learned from the data. In this paper, we focus on image classification and
propose a method to visualize and understand the class-wise knowledge
(patterns) learned by DNNs under three different settings including natural,
backdoor and adversarial. Different to existing visualization methods, our
method searches for a single predictive pattern in the pixel space to represent
the knowledge learned by the model for each class. Based on the proposed
method, we show that DNNs trained on natural (clean) data learn abstract shapes
along with some texture, and backdoored models learn a suspicious pattern for
the backdoored class. Interestingly, the phenomenon that DNNs can learn a
single predictive pattern for each class indicates that DNNs can learn a
backdoor even from clean data, and the pattern itself is a backdoor trigger. In
the adversarial setting, we show that adversarially trained models tend to
learn more simplified shape patterns. Our method can serve as a useful tool to
better understand the knowledge learned by DNNs on different datasets under
different settings.


http://arxiv.org/abs/2101.06969
Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-Level Backdoor Attacks. (1%)
Zhengyan Zhang; Guangxuan Xiao; Yongwei Li; Tian Lv; Fanchao Qi; Zhiyuan Liu; Yasheng Wang; Xin Jiang; Maosong Sun
  Pre-trained models (PTMs) have been widely used in various downstream tasks.
The parameters of PTMs are distributed on the Internet and may suffer backdoor
attacks. In this work, we demonstrate the universal vulnerability of PTMs,
where fine-tuned PTMs can be easily controlled by backdoor attacks in arbitrary
downstream tasks. Specifically, attackers can add a simple pre-training task,
which restricts the output representations of trigger instances to pre-defined
vectors, namely neuron-level backdoor attack (NeuBA). If the backdoor
functionality is not eliminated during fine-tuning, the triggers can make the
fine-tuned model predict fixed labels by pre-defined vectors. In the
experiments of both natural language processing (NLP) and computer vision (CV),
we show that NeuBA absolutely controls the predictions for trigger instances
without any knowledge of downstream tasks. Finally, we apply several defense
methods to NeuBA and find that model pruning is a promising direction to resist
NeuBA by excluding backdoored neurons. Our findings sound a red alarm for the
wide use of PTMs. Our source code and models are available at
\url{https://github.com/thunlp/NeuBA}.


http://arxiv.org/abs/2101.06704
Adversarial Interaction Attack: Fooling AI to Misinterpret Human Intentions.
Nodens Koren; Qiuhong Ke; Yisen Wang; James Bailey; Xingjun Ma
  Understanding the actions of both humans and artificial intelligence (AI)
agents is important before modern AI systems can be fully integrated into our
daily life. In this paper, we show that, despite their current huge success,
deep learning based AI systems can be easily fooled by subtle adversarial noise
to misinterpret the intention of an action in interaction scenarios. Based on a
case study of skeleton-based human interactions, we propose a novel adversarial
attack on interactions, and demonstrate how DNN-based interaction models can be
tricked to predict the participants' reactions in unexpected ways. From a
broader perspective, the scope of our proposed attack method is not confined to
problems related to skeleton data but can also be extended to any type of
problems involving sequential regressions. Our study highlights potential risks
in the interaction loop with AI and humans, which need to be carefully
addressed when deploying AI systems in safety-critical applications.


http://arxiv.org/abs/2101.06855
GraphAttacker: A General Multi-Task GraphAttack Framework.
Jinyin Chen; Dunjie Zhang; Zhaoyan Ming; Kejie Huang; Wenrong Jiang; Chen Cui
  Graph neural networks (GNNs) have been successfully exploited in graph
analysis tasks in many real-world applications. The competition between attack
and defense methods also enhances the robustness of GNNs. In this competition,
the development of adversarial training methods put forward higher requirement
for the diversity of attack examples. By contrast, most attack methods with
specific attack strategies are difficult to satisfy such a requirement. To
address this problem, we propose GraphAttacker, a novel generic graph attack
framework that can flexibly adjust the structures and the attack strategies
according to the graph analysis tasks. GraphAttacker generates adversarial
examples through alternate training on three key components: the multi-strategy
attack generator (MAG), the similarity discriminator (SD), and the attack
discriminator (AD), based on the generative adversarial network (GAN).
Furthermore, we introduce a novel similarity modification rate SMR to conduct a
stealthier attack considering the change of node similarity distribution.
Experiments on various benchmark datasets demonstrate that GraphAttacker can
achieve state-of-the-art attack performance on graph analysis tasks of node
classification, graph classification, and link prediction, no matter the
adversarial training is conducted or not. Moreover, we also analyze the unique
characteristics of each task and their specific response in the unified attack
framework. The project code is available at
https://github.com/honoluluuuu/GraphAttacker.


http://arxiv.org/abs/2101.06784
Exploring Adversarial Robustness of Multi-Sensor Perception Systems in Self Driving.
James Tu; Huichen Li; Xinchen Yan; Mengye Ren; Yun Chen; Ming Liang; Eilyan Bitar; Ersin Yumer; Raquel Urtasun
  Modern self-driving perception systems have been shown to improve upon
processing complementary inputs such as LiDAR with images. In isolation, 2D
images have been found to be extremely vulnerable to adversarial attacks. Yet,
there have been limited studies on the adversarial robustness of multi-modal
models that fuse LiDAR features with image features. Furthermore, existing
works do not consider physically realizable perturbations that are consistent
across the input modalities. In this paper, we showcase practical
susceptibilities of multi-sensor detection by placing an adversarial object on
top of a host vehicle. We focus on physically realizable and input-agnostic
attacks as they are feasible to execute in practice, and show that a single
universal adversary can hide different host vehicles from state-of-the-art
multi-modal detectors. Our experiments demonstrate that successful attacks are
primarily caused by easily corrupted image features. Furthermore, we find that
in modern sensor fusion methods which project image features into 3D,
adversarial attacks can exploit the projection process to generate false
positives across distant regions in 3D. Towards more robust multi-modal
perception systems, we show that adversarial training with feature denoising
can boost robustness to such attacks significantly. However, we find that
standard adversarial defenses still struggle to prevent false positives which
are also caused by inaccurate associations between 3D LiDAR points and 2D
pixels.


http://arxiv.org/abs/2101.06507
Multi-objective Search of Robust Neural Architectures against Multiple Types of Adversarial Attacks.
Jia Liu; Yaochu Jin
  Many existing deep learning models are vulnerable to adversarial examples
that are imperceptible to humans. To address this issue, various methods have
been proposed to design network architectures that are robust to one particular
type of adversarial attacks. It is practically impossible, however, to predict
beforehand which type of attacks a machine learn model may suffer from. To
address this challenge, we propose to search for deep neural architectures that
are robust to five types of well-known adversarial attacks using a
multi-objective evolutionary algorithm. To reduce the computational cost, a
normalized error rate of a randomly chosen attack is calculated as the
robustness for each newly generated neural architecture at each generation. All
non-dominated network architectures obtained by the proposed method are then
fully trained against randomly chosen adversarial attacks and tested on two
widely used datasets. Our experimental results demonstrate the superiority of
optimized neural architectures found by the proposed approach over
state-of-the-art networks that are widely used in the literature in terms of
the classification accuracy under different adversarial attacks.


http://arxiv.org/abs/2101.06560
Adversarial Attacks On Multi-Agent Communication.
James Tu; Tsunhsuan Wang; Jingkang Wang; Sivabalan Manivasagam; Mengye Ren; Raquel Urtasun
  Growing at a fast pace, modern autonomous systems will soon be deployed at
scale, opening up the possibility for cooperative multi-agent systems. Sharing
information and distributing workloads allow autonomous agents to better
perform tasks and increase computation efficiency. However, shared information
can be modified to execute adversarial attacks on deep learning models that are
widely employed in modern systems. Thus, we aim to study the robustness of such
systems and focus on exploring adversarial attacks in a novel multi-agent
setting where communication is done through sharing learned intermediate
representations of neural networks. We observe that an indistinguishable
adversarial message can severely degrade performance, but becomes weaker as the
number of benign agents increases. Furthermore, we show that black-box transfer
attacks are more difficult in this setting when compared to directly perturbing
the inputs, as it is necessary to align the distribution of learned
representations with domain adaptation. Our work studies robustness at the
neural network level to contribute an additional layer of fault tolerance to
modern security protocols for more secure multi-agent systems.


http://arxiv.org/abs/2101.06309
Fundamental Tradeoffs in Distributionally Adversarial Training.
Mohammad Mehrabi; Adel Javanmard; Ryan A. Rossi; Anup Rao; Tung Mai
  Adversarial training is among the most effective techniques to improve the
robustness of models against adversarial perturbations. However, the full
effect of this approach on models is not well understood. For example, while
adversarial training can reduce the adversarial risk (prediction error against
an adversary), it sometimes increase standard risk (generalization error when
there is no adversary). Even more, such behavior is impacted by various
elements of the learning problem, including the size and quality of training
data, specific forms of adversarial perturbations in the input, model
overparameterization, and adversary's power, among others. In this paper, we
focus on \emph{distribution perturbing} adversary framework wherein the
adversary can change the test distribution within a neighborhood of the
training data distribution. The neighborhood is defined via Wasserstein
distance between distributions and the radius of the neighborhood is a measure
of adversary's manipulative power. We study the tradeoff between standard risk
and adversarial risk and derive the Pareto-optimal tradeoff, achievable over
specific classes of models, in the infinite data limit with features dimension
kept fixed. We consider three learning settings: 1) Regression with the class
of linear models; 2) Binary classification under the Gaussian mixtures data
model, with the class of linear classifiers; 3) Regression with the class of
random features model (which can be equivalently represented as two-layer
neural network with random first-layer weights). We show that a tradeoff
between standard and adversarial risk is manifested in all three settings. We
further characterize the Pareto-optimal tradeoff curves and discuss how a
variety of factors, such as features correlation, adversary's power or the
width of two-layer neural network would affect this tradeoff.


http://arxiv.org/abs/2101.06092
Black-box Adversarial Attacks in Autonomous Vehicle Technology.
K Naveen Kumar; C Vishnu; Reshmi Mitra; C Krishna Mohan
  Despite the high quality performance of the deep neural network in real-world
applications, they are susceptible to minor perturbations of adversarial
attacks. This is mostly undetectable to human vision. The impact of such
attacks has become extremely detrimental in autonomous vehicles with real-time
"safety" concerns. The black-box adversarial attacks cause drastic
misclassification in critical scene elements such as road signs and traffic
lights leading the autonomous vehicle to crash into other vehicles or
pedestrians. In this paper, we propose a novel query-based attack method called
Modified Simple black-box attack (M-SimBA) to overcome the use of a white-box
source in transfer based attack method. Also, the issue of late convergence in
a Simple black-box attack (SimBA) is addressed by minimizing the loss of the
most confused class which is the incorrect class predicted by the model with
the highest probability, instead of trying to maximize the loss of the correct
class. We evaluate the performance of the proposed approach to the German
Traffic Sign Recognition Benchmark (GTSRB) dataset. We show that the proposed
model outperforms the existing models like Transfer-based projected gradient
descent (T-PGD), SimBA in terms of convergence time, flattening the
distribution of confused class probability, and producing adversarial samples
with least confidence on the true class.


http://arxiv.org/abs/2101.06061
Heating up decision boundaries: isocapacitory saturation, adversarial scenarios and generalization bounds.
Bogdan Georgiev; Lukas Franken; Mayukh Mukherjee
  In the present work we study classifiers' decision boundaries via Brownian
motion processes in ambient data space and associated probabilistic techniques.
Intuitively, our ideas correspond to placing a heat source at the decision
boundary and observing how effectively the sample points warm up. We are
largely motivated by the search for a soft measure that sheds further light on
the decision boundary's geometry. En route, we bridge aspects of potential
theory and geometric analysis (Mazya, 2011, Grigoryan-Saloff-Coste, 2002) with
active fields of ML research such as adversarial examples and generalization
bounds. First, we focus on the geometric behavior of decision boundaries in the
light of adversarial attack/defense mechanisms. Experimentally, we observe a
certain capacitory trend over different adversarial defense strategies:
decision boundaries locally become flatter as measured by isoperimetric
inequalities (Ford et al, 2019); however, our more sensitive heat-diffusion
metrics extend this analysis and further reveal that some non-trivial geometry
invisible to plain distance-based methods is still preserved. Intuitively, we
provide evidence that the decision boundaries nevertheless retain many
persistent "wiggly and fuzzy" regions on a finer scale. Second, we show how
Brownian hitting probabilities translate to soft generalization bounds which
are in turn connected to compression and noise stability (Arora et al, 2018),
and these bounds are significantly stronger if the decision boundary has
controlled geometric features.


http://arxiv.org/abs/2101.06069
Mining Data Impressions from Deep Models as Substitute for the Unavailable Training Data.
Gaurav Kumar Nayak; Konda Reddy Mopuri; Saksham Jain; Anirban Chakraborty
  Pretrained deep models hold their learnt knowledge in the form of model
parameters. These parameters act as "memory" for the trained models and help
them generalize well on unseen data. However, in absence of training data, the
utility of a trained model is merely limited to either inference or better
initialization towards a target task. In this paper, we go further and extract
synthetic data by leveraging the learnt model parameters. We dub them "Data
Impressions", which act as proxy to the training data and can be used to
realize a variety of tasks. These are useful in scenarios where only the
pretrained models are available and the training data is not shared (e.g., due
to privacy or sensitivity concerns). We show the applicability of data
impressions in solving several computer vision tasks such as unsupervised
domain adaptation, continual learning as well as knowledge distillation. We
also study the adversarial robustness of lightweight models trained via
knowledge distillation using these data impressions. Further, we demonstrate
the efficacy of data impressions in generating data-free Universal Adversarial
Perturbations (UAPs) with better fooling rates. Extensive experiments performed
on benchmark datasets demonstrate competitive performance achieved using data
impressions in absence of original training data.


http://arxiv.org/abs/2101.05833
Context-Aware Image Denoising with Auto-Threshold Canny Edge Detection to Suppress Adversarial Perturbation.
Li-Yun Wang; Yeganeh Jalalpour; Wu-chi Feng
  This paper presents a novel context-aware image denoising algorithm that
combines an adaptive image smoothing technique and color reduction techniques
to remove perturbation from adversarial images. Adaptive image smoothing is
achieved using auto-threshold canny edge detection to produce an accurate edge
map used to produce a blurred image that preserves more edge features. The
proposed algorithm then uses color reduction techniques to reconstruct the
image using only a few representative colors. Through this technique, the
algorithm can reduce the effects of adversarial perturbations on images. We
also discuss experimental data on classification accuracy. Our results showed
that the proposed approach reduces adversarial perturbation in adversarial
attacks and increases the robustness of the deep convolutional neural network
models.


http://arxiv.org/abs/2101.05950
Robusta: Robust AutoML for Feature Selection via Reinforcement Learning.
Xiaoyang Wang; Bo Li; Yibo Zhang; Bhavya Kailkhura; Klara Nahrstedt
  Several AutoML approaches have been proposed to automate the machine learning
(ML) process, such as searching for the ML model architectures and
hyper-parameters. However, these AutoML pipelines only focus on improving the
learning accuracy of benign samples while ignoring the ML model robustness
under adversarial attacks. As ML systems are increasingly being used in a
variety of mission-critical applications, improving the robustness of ML
systems has become of utmost importance. In this paper, we propose the first
robust AutoML framework, Robusta--based on reinforcement learning (RL)--to
perform feature selection, aiming to select features that lead to both accurate
and robust ML systems. We show that a variation of the 0-1 robust loss can be
directly optimized via an RL-based combinatorial search in the feature
selection scenario. In addition, we employ heuristics to accelerate the search
procedure based on feature scoring metrics, which are mutual information
scores, tree-based classifiers feature importance scores, F scores, and
Integrated Gradient (IG) scores, as well as their combinations. We conduct
extensive experiments and show that the proposed framework is able to improve
the model robustness by up to 22% while maintaining competitive accuracy on
benign samples compared with other feature selection methods.


http://arxiv.org/abs/2101.05930
Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks.
Yige Li; Xixiang Lyu; Nodens Koren; Lingjuan Lyu; Bo Li; Xingjun Ma
  Deep neural networks (DNNs) are known vulnerable to backdoor attacks, a
training time attack that injects a trigger pattern into a small proportion of
training data so as to control the model's prediction at the test time.
Backdoor attacks are notably dangerous since they do not affect the model's
performance on clean examples, yet can fool the model to make incorrect
prediction whenever the trigger pattern appears during testing. In this paper,
we propose a novel defense framework Neural Attention Distillation (NAD) to
erase backdoor triggers from backdoored DNNs. NAD utilizes a teacher network to
guide the finetuning of the backdoored student network on a small clean subset
of data such that the intermediate-layer attention of the student network
aligns with that of the teacher network. The teacher network can be obtained by
an independent finetuning process on the same clean subset. We empirically
show, against 6 state-of-the-art backdoor attacks, NAD can effectively erase
the backdoor triggers using only 5\% clean training data without causing
obvious performance degradation on clean examples. Code is available in
https://github.com/bboylyg/NAD.


http://arxiv.org/abs/2101.05639
Untargeted, Targeted and Universal Adversarial Attacks and Defenses on Time Series.
Pradeep Rathore; Arghya Basak; Sri Harsha Nistala; Venkataramana Runkana
  Deep learning based models are vulnerable to adversarial attacks. These
attacks can be much more harmful in case of targeted attacks, where an attacker
tries not only to fool the deep learning model, but also to misguide the model
to predict a specific class. Such targeted and untargeted attacks are
specifically tailored for an individual sample and require addition of an
imperceptible noise to the sample. In contrast, universal adversarial attack
calculates a special imperceptible noise which can be added to any sample of
the given dataset so that, the deep learning model is forced to predict a wrong
class. To the best of our knowledge these targeted and universal attacks on
time series data have not been studied in any of the previous works. In this
work, we have performed untargeted, targeted and universal adversarial attacks
on UCR time series datasets. Our results show that deep learning based time
series classification models are vulnerable to these attacks. We also show that
universal adversarial attacks have good generalization property as it need only
a fraction of the training data. We have also performed adversarial training
based adversarial defense. Our results show that models trained adversarially
using Fast gradient sign method (FGSM), a single step attack, are able to
defend against FGSM as well as Basic iterative method (BIM), a popular
iterative attack.


http://arxiv.org/abs/2101.05209
Image Steganography based on Iteratively Adversarial Samples of A Synchronized-directions Sub-image.
Xinghong Qin; Shunquan Tan; Bin Li; Weixuan Tang; Jiwu Huang
  Nowadays a steganography has to face challenges of both feature based
staganalysis and convolutional neural network (CNN) based steganalysis. In this
paper, we present a novel steganography scheme denoted as ITE-SYN (based on
ITEratively adversarial perturbations onto a SYNchronized-directions
sub-image), by which security data is embedded with synchronizing modification
directions to enhance security and then iteratively increased perturbations are
added onto a sub-image to reduce loss with cover class label of the target CNN
classifier. Firstly an exist steganographic function is employed to compute
initial costs. Then the cover image is decomposed into some non-overlapped
sub-images. After each sub-image is embedded, costs will be adjusted following
clustering modification directions profile. And then the next sub-image will be
embedded with adjusted costs until all secret data has been embedded. If the
target CNN classifier does not discriminate the stego image as a cover image,
based on adjusted costs, we change costs with adversarial manners according to
signs of gradients back-propagated from the CNN classifier. And then a
sub-image is chosen to be re-embedded with changed costs. Adversarial intensity
will be iteratively increased until the adversarial stego image can fool the
target CNN classifier. Experiments demonstrate that the proposed method
effectively enhances security to counter both conventional feature-based
classifiers and CNN classifiers, even other non-target CNN classifiers.


http://arxiv.org/abs/2101.04840
Robustness Gym: Unifying the NLP Evaluation Landscape.
Karan Goel; Nazneen Rajani; Jesse Vig; Samson Tan; Jason Wu; Stephan Zheng; Caiming Xiong; Mohit Bansal; Christopher Ré
  Despite impressive performance on standard benchmarks, deep neural networks
are often brittle when deployed in real-world systems. Consequently, recent
research has focused on testing the robustness of such models, resulting in a
diverse set of evaluation methodologies ranging from adversarial attacks to
rule-based data transformations. In this work, we identify challenges with
evaluating NLP systems and propose a solution in the form of Robustness Gym
(RG), a simple and extensible evaluation toolkit that unifies 4 standard
evaluation paradigms: subpopulations, transformations, evaluation sets, and
adversarial attacks. By providing a common platform for evaluation, Robustness
Gym enables practitioners to compare results from all 4 evaluation paradigms
with just a few clicks, and to easily develop and share novel evaluation
methods using a built-in set of abstractions. To validate Robustness Gym's
utility to practitioners, we conducted a real-world case study with a
sentiment-modeling team, revealing performance degradations of 18%+. To verify
that Robustness Gym can aid novel research analyses, we perform the first study
of state-of-the-art commercial and academic named entity linking (NEL) systems,
as well as a fine-grained analysis of state-of-the-art summarization models.
For NEL, commercial systems struggle to link rare entities and lag their
academic counterparts by 10%+, while state-of-the-art summarization models
struggle on examples that require abstraction and distillation, degrading by
9%+. Robustness Gym can be found at https://robustnessgym.com/


http://arxiv.org/abs/2101.04401
Robustness of on-device Models: Adversarial Attack to Deep Learning Models on Android Apps.
Yujin Huang; Han Hu; Chunyang Chen
  Deep learning has shown its power in many applications, including object
detection in images, natural-language understanding, and speech recognition. To
make it more accessible to end users, many deep learning models are now
embedded in mobile apps. Compared to offloading deep learning from smartphones
to the cloud, performing machine learning on-device can help improve latency,
connectivity, and power consumption. However, most deep learning models within
Android apps can easily be obtained via mature reverse engineering, while the
models' exposure may invite adversarial attacks. In this study, we propose a
simple but effective approach to hacking deep learning models using adversarial
attacks by identifying highly similar pre-trained models from TensorFlow Hub.
All 10 real-world Android apps in the experiment are successfully attacked by
our approach. Apart from the feasibility of the model attack, we also carry out
an empirical study that investigates the characteristics of deep learning
models used by hundreds of Android apps on Google Play. The results show that
many of them are similar to each other and widely use fine-tuning techniques to
pre-trained models on the Internet.


http://arxiv.org/abs/2101.04321
Random Transformation of Image Brightness for Adversarial Attack.
Bo Yang; Kaiyong Xu; Hengjun Wang; Hengwei Zhang
  Deep neural networks are vulnerable to adversarial examples, which are
crafted by adding small, human-imperceptible perturbations to the original
images, but make the model output inaccurate predictions. Before deep neural
networks are deployed, adversarial attacks can thus be an important method to
evaluate and select robust models in safety-critical applications. However,
under the challenging black-box setting, the attack success rate, i.e., the
transferability of adversarial examples, still needs to be improved. Based on
image augmentation methods, we found that random transformation of image
brightness can eliminate overfitting in the generation of adversarial examples
and improve their transferability. To this end, we propose an adversarial
example generation method based on this phenomenon, which can be integrated
with Fast Gradient Sign Method (FGSM)-related methods to build a more robust
gradient-based attack and generate adversarial examples with better
transferability. Extensive experiments on the ImageNet dataset demonstrate the
method's effectiveness. Whether on normally or adversarially trained networks,
our method has a higher success rate for black-box attacks than other attack
methods based on data augmentation. We hope that this method can help to
evaluate and improve the robustness of models.


http://arxiv.org/abs/2101.04829
On the Effectiveness of Small Input Noise for Defending Against Query-based Black-Box Attacks.
Junyoung Byun; Hyojun Go; Changick Kim
  While deep neural networks show unprecedented performance in various tasks,
the vulnerability to adversarial examples hinders their deployment in
safety-critical systems. Many studies have shown that attacks are also possible
even in a black-box setting where an adversary cannot access the target model's
internal information. Most black-box attacks are based on queries, each of
which obtains the target model's output for an input, and many recent studies
focus on reducing the number of required queries. In this paper, we pay
attention to an implicit assumption of query-based black-box adversarial
attacks that the target model's output exactly corresponds to the query input.
If some randomness is introduced into the model, it can break the assumption,
and thus, query-based attacks may have tremendous difficulty in both gradient
estimation and local search, which are the core of their attack process. From
this motivation, we observe even a small additive input noise can neutralize
most query-based attacks and name this simple yet effective approach Small
Noise Defense (SND). We analyze how SND can defend against query-based
black-box attacks and demonstrate its effectiveness against eight
state-of-the-art attacks with CIFAR-10 and ImageNet datasets. Even with strong
defense ability, SND almost maintains the original classification accuracy and
computational speed. SND is readily applicable to pre-trained models by adding
only one line of code at the inference.


http://arxiv.org/abs/2101.03924
The Vulnerability of Semantic Segmentation Networks to Adversarial Attacks in Autonomous Driving: Enhancing Extensive Environment Sensing.
Andreas Bär; Jonas Löhdefink; Nikhil Kapoor; Serin J. Varghese; Fabian Hüger; Peter Schlicht; Tim Fingscheidt
  Enabling autonomous driving (AD) can be considered one of the biggest
challenges in today's technology. AD is a complex task accomplished by several
functionalities, with environment perception being one of its core functions.
Environment perception is usually performed by combining the semantic
information captured by several sensors, i.e., lidar or camera. The semantic
information from the respective sensor can be extracted by using convolutional
neural networks (CNNs) for dense prediction. In the past, CNNs constantly
showed state-of-the-art performance on several vision-related tasks, such as
semantic segmentation of traffic scenes using nothing but the red-green-blue
(RGB) images provided by a camera. Although CNNs obtain state-of-the-art
performance on clean images, almost imperceptible changes to the input,
referred to as adversarial perturbations, may lead to fatal deception. The goal
of this article is to illuminate the vulnerability aspects of CNNs used for
semantic segmentation with respect to adversarial attacks, and share insights
into some of the existing known adversarial defense strategies. We aim to
clarify the advantages and disadvantages associated with applying CNNs for
environment perception in AD to serve as a motivation for future research in
this field.


http://arxiv.org/abs/2101.05624
Adversarially Robust and Explainable Model Compression with On-Device Personalization for Text Classification.
Yao Qiang; Supriya Tumkur Suresh Kumar; Marco Brocanelli; Dongxiao Zhu
  On-device Deep Neural Networks (DNNs) have recently gained more attention due
to the increasing computing power of the mobile devices and the number of
applications in Computer Vision (CV), Natural Language Processing (NLP), and
Internet of Things (IoTs). Unfortunately, the existing efficient convolutional
neural network (CNN) architectures designed for CV tasks are not directly
applicable to NLP tasks and the tiny Recurrent Neural Network (RNN)
architectures have been designed primarily for IoT applications. In NLP
applications, although model compression has seen initial success in on-device
text classification, there are at least three major challenges yet to be
addressed: adversarial robustness, explainability, and personalization. Here we
attempt to tackle these challenges by designing a new training scheme for model
compression and adversarial robustness, including the optimization of an
explainable feature mapping objective, a knowledge distillation objective, and
an adversarially robustness objective. The resulting compressed model is
personalized using on-device private training data via fine-tuning. We perform
extensive experiments to compare our approach with both compact RNN (e.g.,
FastGRNN) and compressed RNN (e.g., PRADO) architectures in both natural and
adversarial NLP test settings.


http://arxiv.org/abs/2101.02899
Adversarial Attack Attribution: Discovering Attributable Signals in Adversarial ML Attacks.
Marissa Dotter; Sherry Xie; Keith Manville; Josh Harguess; Colin Busho; Mikel Rodriguez
  Machine Learning (ML) models are known to be vulnerable to adversarial inputs
and researchers have demonstrated that even production systems, such as
self-driving cars and ML-as-a-service offerings, are susceptible. These systems
represent a target for bad actors. Their disruption can cause real physical and
economic harm. When attacks on production ML systems occur, the ability to
attribute the attack to the responsible threat group is a critical step in
formulating a response and holding the attackers accountable. We pose the
following question: can adversarially perturbed inputs be attributed to the
particular methods used to generate the attack? In other words, is there a way
to find a signal in these attacks that exposes the attack algorithm, model
architecture, or hyperparameters used in the attack? We introduce the concept
of adversarial attack attribution and create a simple supervised learning
experimental framework to examine the feasibility of discovering attributable
signals in adversarial attacks. We find that it is possible to differentiate
attacks generated with different attack algorithms, models, and hyperparameters
on both the CIFAR-10 and MNIST datasets.


http://arxiv.org/abs/2101.03218
DiPSeN: Differentially Private Self-normalizing Neural Networks For Adversarial Robustness in Federated Learning.
Olakunle Ibitoye; M. Omair Shafiq; Ashraf Matrawy
  The need for robust, secure and private machine learning is an important goal
for realizing the full potential of the Internet of Things (IoT). Federated
learning has proven to help protect against privacy violations and information
leakage. However, it introduces new risk vectors which make machine learning
models more difficult to defend against adversarial samples. In this study, we
examine the role of differential privacy and self-normalization in mitigating
the risk of adversarial samples specifically in a federated learning
environment. We introduce DiPSeN, a Differentially Private Self-normalizing
Neural Network which combines elements of differential privacy noise with
self-normalizing techniques. Our empirical results on three publicly available
datasets show that DiPSeN successfully improves the adversarial robustness of a
deep learning classifier in a federated learning environment based on several
evaluation metrics.


http://arxiv.org/abs/2101.03272
Exploring Adversarial Fake Images on Face Manifold.
Dongze Li; Wei Wang; Hongxing Fan; Jing Dong
  Images synthesized by powerful generative adversarial network (GAN) based
methods have drawn moral and privacy concerns. Although image forensic models
have reached great performance in detecting fake images from real ones, these
models can be easily fooled with a simple adversarial attack. But, the noise
adding adversarial samples are also arousing suspicion. In this paper, instead
of adding adversarial noise, we optimally search adversarial points on face
manifold to generate anti-forensic fake face images. We iteratively do a
gradient-descent with each small step in the latent space of a generative
model, e.g. Style-GAN, to find an adversarial latent vector, which is similar
to norm-based adversarial attack but in latent space. Then, the generated fake
images driven by the adversarial latent vectors with the help of GANs can
defeat main-stream forensic models. For examples, they make the accuracy of
deepfake detection models based on Xception or EfficientNet drop from over 90%
to nearly 0%, meanwhile maintaining high visual quality. In addition, we find
manipulating style vector $z$ or noise vectors $n$ at different levels have
impacts on attack success rate. The generated adversarial images mainly have
facial texture or face attributes changing.


http://arxiv.org/abs/2101.02689
The Effect of Prior Lipschitz Continuity on the Adversarial Robustness of Bayesian Neural Networks.
Arno Blaas; Stephen J. Roberts
  It is desirable, and often a necessity, for machine learning models to be
robust against adversarial attacks. This is particularly true for Bayesian
models, as they are well-suited for safety-critical applications, in which
adversarial attacks can have catastrophic outcomes. In this work, we take a
deeper look at the adversarial robustness of Bayesian Neural Networks (BNNs).
In particular, we consider whether the adversarial robustness of a BNN can be
increased by model choices, particularly the Lipschitz continuity induced by
the prior. Conducting in-depth analysis on the case of i.i.d., zero-mean
Gaussian priors and posteriors approximated via mean-field variational
inference, we find evidence that adversarial robustness is indeed sensitive to
the prior variance.


http://arxiv.org/abs/2101.02483
Robust Text CAPTCHAs Using Adversarial Examples.
Rulin Shao; Zhouxing Shi; Jinfeng Yi; Pin-Yu Chen; Cho-Jui Hsieh
  CAPTCHA (Completely Automated Public Truing test to tell Computers and Humans
Apart) is a widely used technology to distinguish real users and automated
users such as bots. However, the advance of AI technologies weakens many
CAPTCHA tests and can induce security concerns. In this paper, we propose a
user-friendly text-based CAPTCHA generation method named Robust Text CAPTCHA
(RTC). At the first stage, the foregrounds and backgrounds are constructed with
randomly sampled font and background images, which are then synthesized into
identifiable pseudo adversarial CAPTCHAs. At the second stage, we design and
apply a highly transferable adversarial attack for text CAPTCHAs to better
obstruct CAPTCHA solvers. Our experiments cover comprehensive models including
shallow models such as KNN, SVM and random forest, various deep neural networks
and OCR models. Experiments show that our CAPTCHAs have a failure rate lower
than one millionth in general and high usability. They are also robust against
various defensive techniques that attackers may employ, including adversarial
training, data pre-processing and manual tagging.


http://arxiv.org/abs/2101.02115
Adversarial Robustness by Design through Analog Computing and Synthetic Gradients.
Alessandro Cappelli; Ruben Ohana; Julien Launay; Laurent Meunier; Iacopo Poli; Florent Krzakala
  We propose a new defense mechanism against adversarial attacks inspired by an
optical co-processor, providing robustness without compromising natural
accuracy in both white-box and black-box settings. This hardware co-processor
performs a nonlinear fixed random transformation, where the parameters are
unknown and impossible to retrieve with sufficient precision for large enough
dimensions. In the white-box setting, our defense works by obfuscating the
parameters of the random projection. Unlike other defenses relying on
obfuscated gradients, we find we are unable to build a reliable backward
differentiable approximation for obfuscated parameters. Moreover, while our
model reaches a good natural accuracy with a hybrid backpropagation - synthetic
gradient method, the same approach is suboptimal if employed to generate
adversarial examples. We find the combination of a random projection and
binarization in the optical system also improves robustness against various
types of black-box attacks. Finally, our hybrid training method builds robust
features against transfer attacks. We demonstrate our approach on a VGG-like
architecture, placing the defense on top of the convolutional features, on
CIFAR-10 and CIFAR-100. Code is available at
https://github.com/lightonai/adversarial-robustness-by-design.


http://arxiv.org/abs/2101.02325
Understanding the Error in Evaluating Adversarial Robustness.
Pengfei Xia; Ziqiang Li; Hongjing Niu; Bin Li
  Deep neural networks are easily misled by adversarial examples. Although lots
of defense methods are proposed, many of them are demonstrated to lose
effectiveness when against properly performed adaptive attacks. How to evaluate
the adversarial robustness effectively is important for the realistic
deployment of deep models, but yet still unclear. To provide a reasonable
solution, one of the primary things is to understand the error (or gap) between
the true adversarial robustness and the evaluated one, what is it and why it
exists. Several works are done in this paper to make it clear. Firstly, we
introduce an interesting phenomenon named gradient traps, which lead to
incompetent adversaries and are demonstrated to be a manifestation of
evaluation error. Then, we analyze the error and identify that there are three
components. Each of them is caused by a specific compromise. Moreover, based on
the above analysis, we present our evaluation suggestions. Experiments on
adversarial training and its variations indicate that: (1) the error does exist
empirically, and (2) these defenses are still vulnerable. We hope these
analyses and results will help the community to develop more powerful defenses.


http://arxiv.org/abs/2101.01543
Noise Sensitivity-Based Energy Efficient and Robust Adversary Detection in Neural Networks.
Rachel Sterneck; Abhishek Moitra; Priyadarshini Panda
  Neural networks have achieved remarkable performance in computer vision,
however they are vulnerable to adversarial examples. Adversarial examples are
inputs that have been carefully perturbed to fool classifier networks, while
appearing unchanged to humans. Based on prior works on detecting adversaries,
we propose a structured methodology of augmenting a deep neural network (DNN)
with a detector subnetwork. We use $\textit{Adversarial Noise Sensitivity}$
(ANS), a novel metric for measuring the adversarial gradient contribution of
different intermediate layers of a network. Based on the ANS value, we append a
detector to the most sensitive layer. In prior works, more complex detectors
were added to a DNN, increasing the inference computational cost of the model.
In contrast, our structured and strategic addition of a detector to a DNN
reduces the complexity of the model while making the overall network
adversarially resilient. Through comprehensive white-box and black-box
experiments on MNIST, CIFAR-10, and CIFAR-100, we show that our method improves
state-of-the-art detector robustness against adversarial examples. Furthermore,
we validate the energy efficiency of our proposed adversarial detection
methodology through an extensive energy analysis on various hardware scalable
CMOS accelerator platforms. We also demonstrate the effects of quantization on
our detector-appended networks.


http://arxiv.org/abs/2101.00989
Fooling Object Detectors: Adversarial Attacks by Half-Neighbor Masks.
Yanghao Zhang; Fu Wang; Wenjie Ruan
  Although there are a great number of adversarial attacks on deep learning
based classifiers, how to attack object detection systems has been rarely
studied. In this paper, we propose a Half-Neighbor Masked Projected Gradient
Descent (HNM-PGD) based attack, which can generate strong perturbation to fool
different kinds of detectors under strict constraints. We also applied the
proposed HNM-PGD attack in the CIKM 2020 AnalytiCup Competition, which was
ranked within the top 1% on the leaderboard. We release the code at
https://github.com/YanghaoZYH/HNM-PGD.


http://arxiv.org/abs/2101.01121
Local Competition and Stochasticity for Adversarial Robustness in Deep Learning.
Konstantinos P. Panousis; Sotirios Chatzis; Antonios Alexos; Sergios Theodoridis
  This work addresses adversarial robustness in deep learning by considering
deep networks with stochastic local winner-takes-all (LWTA) nonlinearities.
This type of network units result in sparse representations from each model
layer, as the units are organized in blocks where only one unit generates
non-zero output. The main operating principle of the introduced units lies on
stochastic arguments, as the network performs posterior sampling over competing
units to select the winner. We combine these LWTA arguments with tools from the
field of Bayesian non-parametrics, specifically the stick-breaking construction
of the Indian Buffet Process, to allow for inferring the sub-part of each layer
that is essential for modeling the data at hand. Inference for the proposed
network is performed by means of stochastic variational Bayes. We perform a
thorough experimental evaluation of our model using benchmark datasets,
assuming gradient-based adversarial attacks. As we show, our method achieves
high robustness to adversarial perturbations, with state-of-the-art performance
in powerful white-box attacks.


http://arxiv.org/abs/2101.01032
Local Black-box Adversarial Attacks: A Query Efficient Approach.
Tao Xiang; Hangcheng Liu; Shangwei Guo; Tianwei Zhang; Xiaofeng Liao
  Adversarial attacks have threatened the application of deep neural networks
in security-sensitive scenarios. Most existing black-box attacks fool the
target model by interacting with it many times and producing global
perturbations. However, global perturbations change the smooth and
insignificant background, which not only makes the perturbation more easily be
perceived but also increases the query overhead. In this paper, we propose a
novel framework to perturb the discriminative areas of clean examples only
within limited queries in black-box attacks. Our framework is constructed based
on two types of transferability. The first one is the transferability of model
interpretations. Based on this property, we identify the discriminative areas
of a given clean example easily for local perturbations. The second is the
transferability of adversarial examples. It helps us to produce a local
pre-perturbation for improving query efficiency. After identifying the
discriminative areas and pre-perturbing, we generate the final adversarial
examples from the pre-perturbed example by querying the targeted model with two
kinds of black-box attack techniques, i.e., gradient estimation and random
search. We conduct extensive experiments to show that our framework can
significantly improve the query efficiency during black-box perturbing with a
high attack success rate. Experimental results show that our attacks outperform
state-of-the-art black-box attacks under various system settings.


http://arxiv.org/abs/2101.02559
Robust Machine Learning Systems: Challenges, Current Trends, Perspectives, and the Road Ahead.
Muhammad Shafique; Mahum Naseer; Theocharis Theocharides; Christos Kyrkou; Onur Mutlu; Lois Orosa; Jungwook Choi
  Machine Learning (ML) techniques have been rapidly adopted by smart
Cyber-Physical Systems (CPS) and Internet-of-Things (IoT) due to their powerful
decision-making capabilities. However, they are vulnerable to various security
and reliability threats, at both hardware and software levels, that compromise
their accuracy. These threats get aggravated in emerging edge ML devices that
have stringent constraints in terms of resources (e.g., compute, memory,
power/energy), and that therefore cannot employ costly security and reliability
measures. Security, reliability, and vulnerability mitigation techniques span
from network security measures to hardware protection, with an increased
interest towards formal verification of trained ML models.
  This paper summarizes the prominent vulnerabilities of modern ML systems,
highlights successful defenses and mitigation techniques against these
vulnerabilities, both at the cloud (i.e., during the ML training phase) and
edge (i.e., during the ML inference stage), discusses the implications of a
resource-constrained design on the reliability and security of the system,
identifies verification methodologies to ensure correct system behavior, and
describes open research challenges for building secure and reliable ML systems
at both the edge and the cloud.


http://arxiv.org/abs/2101.00521
Improving DGA-Based Malicious Domain Classifiers for Malware Defense with Adversarial Machine Learning.
Ibrahim Yilmaz; Ambareen Siraj; Denis Ulybyshev
  Domain Generation Algorithms (DGAs) are used by adversaries to establish
Command and Control (C\&C) server communications during cyber attacks.
Blacklists of known/identified C\&C domains are often used as one of the
defense mechanisms. However, since blacklists are static and generated by
signature-based approaches, they can neither keep up nor detect
never-seen-before malicious domain names. Due to this shortcoming of blacklist
domain checking, machine learning algorithms have been used to address the
problem to some extent. However, when training is performed with limited
datasets, the algorithms are likely to fail in detecting new DGA variants. To
mitigate this weakness, we successfully applied a DGA-based malicious domain
classifier using the Long Short-Term Memory (LSTM) method with a novel feature
engineering technique. Our model's performance shows a higher level of accuracy
compared to a previously reported model from prior research. Additionally, we
propose a new method using adversarial machine learning to generate
never-before-seen malware-related domain families that can be used to
illustrate the shortcomings of machine learning algorithms in this regard.
Next, we augment the training dataset with new samples such that it makes
training of the machine learning models more effective in detecting
never-before-seen malicious domain name variants. Finally, to protect
blacklists of malicious domain names from disclosure and tampering, we devise
secure data containers that store blacklists and guarantee their protection
against adversarial access and modifications.


http://arxiv.org/abs/2012.15699
Better Robustness by More Coverage: Adversarial Training with Mixup Augmentation for Robust Fine-tuning.
Chenglei Si; Zhengyan Zhang; Fanchao Qi; Zhiyuan Liu; Yasheng Wang; Qun Liu; Maosong Sun
  Pre-trained language models (PLMs) fail miserably on adversarial attacks. To
improve the robustness, adversarial data augmentation (ADA) has been widely
adopted, which attempts to cover more search space of adversarial attacks by
adding the adversarial examples during training. However, the number of
adversarial examples added by ADA is extremely insufficient due to the
enormously large search space. In this work, we propose a simple and effective
method to cover much larger proportion of the attack search space, called
Adversarial Data Augmentation with Mixup (MixADA). Specifically, MixADA
linearly interpolates the representations of pairs of training examples to form
new virtual samples, which are more abundant and diverse than the discrete
adversarial examples used in conventional ADA. Moreover, to evaluate the
robustness of different models fairly, we adopt a challenging setup, which
dynamically generates new adversarial examples for each model. In the text
classification experiments of BERT and RoBERTa, MixADA achieves significant
robustness gains under two strong adversarial attacks and alleviates the
performance degradation of ADA on the original data. Our source codes will be
released to support further explorations.


http://arxiv.org/abs/2012.15503
Patch-wise++ Perturbation for Adversarial Targeted Attacks.
Lianli Gao; Qilong Zhang; Jingkuan Song; Heng Tao Shen
  Although great progress has been made on adversarial attacks for deep neural
networks (DNNs), their transferability is still unsatisfactory, especially for
targeted attacks. There are two problems behind that have been long overlooked:
1) the conventional setting of $T$ iterations with the step size of
$\epsilon/T$ to comply with the $\epsilon$-constraint. In this case, most of
the pixels are allowed to add very small noise, much less than $\epsilon$; and
2) usually manipulating pixel-wise noise. However, features of a pixel
extracted by DNNs are influenced by its surrounding regions, and different DNNs
generally focus on different discriminative regions in recognition. To tackle
these issues, we propose a patch-wise iterative method (PIM) aimed at crafting
adversarial examples with high transferability. Specifically, we introduce an
amplification factor to the step size in each iteration, and one pixel's
overall gradient overflowing the $\epsilon$-constraint is properly assigned to
its surrounding regions by a project kernel. But targeted attacks aim to push
the adversarial examples into the territory of a specific class, and the
amplification factor may lead to underfitting. Thus, we introduce the
temperature and propose a patch-wise++ iterative method (PIM++) to further
improve transferability without significantly sacrificing the performance of
the white-box attack. Our method can be generally integrated to any
gradient-based attack method. Compared with the current state-of-the-art attack
methods, we significantly improve the success rate by 35.9\% for defense models
and 32.7\% for normally trained models on average.


http://arxiv.org/abs/2012.15183
Temporally-Transferable Perturbations: Efficient, One-Shot Adversarial Attacks for Online Visual Object Trackers.
Krishna Kanth Nakka; Mathieu Salzmann
  In recent years, the trackers based on Siamese networks have emerged as
highly effective and efficient for visual object tracking (VOT). While these
methods were shown to be vulnerable to adversarial attacks, as most deep
networks for visual recognition tasks, the existing attacks for VOT trackers
all require perturbing the search region of every input frame to be effective,
which comes at a non-negligible cost, considering that VOT is a real-time task.
In this paper, we propose a framework to generate a single temporally
transferable adversarial perturbation from the object template image only. This
perturbation can then be added to every search image, which comes at virtually
no cost, and still, successfully fool the tracker. Our experiments evidence
that our approach outperforms the state-of-the-art attacks on the standard VOT
benchmarks in the untargeted scenario. Furthermore, we show that our formalism
naturally extends to targeted attacks that force the tracker to follow any
given trajectory by precomputing diverse directional perturbations.


http://arxiv.org/abs/2012.15386
Beating Attackers At Their Own Games: Adversarial Example Detection Using Adversarial Gradient Directions.
Yuhang Wu; Sunpreet S. Arora; Yanhong Wu; Hao Yang
  Adversarial examples are input examples that are specifically crafted to
deceive machine learning classifiers. State-of-the-art adversarial example
detection methods characterize an input example as adversarial either by
quantifying the magnitude of feature variations under multiple perturbations or
by measuring its distance from estimated benign example distribution. Instead
of using such metrics, the proposed method is based on the observation that the
directions of adversarial gradients when crafting (new) adversarial examples
play a key role in characterizing the adversarial space. Compared to detection
methods that use multiple perturbations, the proposed method is efficient as it
only applies a single random perturbation on the input example. Experiments
conducted on two different databases, CIFAR-10 and ImageNet, show that the
proposed detection method achieves, respectively, 97.9% and 98.6% AUC-ROC (on
average) on five different adversarial attacks, and outperforms multiple
state-of-the-art detection methods. Results demonstrate the effectiveness of
using adversarial gradient directions for adversarial example detection.


http://arxiv.org/abs/2101.10452
Black-box Adversarial Attacks on Monocular Depth Estimation Using Evolutionary Multi-objective Optimization.
Renya Department of Information Science and Biomedical Engineering, Graduate School of Science and Engineering, Kagoshima University Daimo; Satoshi Department of Information Science and Biomedical Engineering, Graduate School of Science and Engineering, Kagoshima University Ono; Takahiro Department of Information Science and Biomedical Engineering, Graduate School of Science and Engineering, Kagoshima University Suzuki
  This paper proposes an adversarial attack method to deep neural networks
(DNNs) for monocular depth estimation, i.e., estimating the depth from a single
image. Single image depth estimation has improved drastically in recent years
due to the development of DNNs. However, vulnerabilities of DNNs for image
classification have been revealed by adversarial attacks, and DNNs for
monocular depth estimation could contain similar vulnerabilities. Therefore,
research on vulnerabilities of DNNs for monocular depth estimation has spread
rapidly, but many of them assume white-box conditions where inside information
of DNNs is available, or are transferability-based black-box attacks that
require a substitute DNN model and a training dataset. Utilizing Evolutionary
Multi-objective Optimization, the proposed method in this paper analyzes DNNs
under the black-box condition where only output depth maps are available. In
addition, the proposed method does not require a substitute DNN that has a
similar architecture to the target DNN nor any knowledge about training data
used to train the target model. Experimental results showed that the proposed
method succeeded in attacking two DNN-based methods that were trained with
indoor and outdoor scenes respectively.


http://arxiv.org/abs/2012.14769
Generating Adversarial Examples in Chinese Texts Using Sentence-Pieces.
Linyang Li; Yunfan Shao; Demin Song; Xipeng Qiu; Xuanjing Huang
  Adversarial attacks in texts are mostly substitution-based methods that
replace words or characters in the original texts to achieve success attacks.
Recent methods use pre-trained language models as the substitutes generator.
While in Chinese, such methods are not applicable since words in Chinese
require segmentations first. In this paper, we propose a pre-train language
model as the substitutes generator using sentence-pieces to craft adversarial
examples in Chinese. The substitutions in the generated adversarial examples
are not characters or words but \textit{'pieces'}, which are more natural to
Chinese readers. Experiments results show that the generated adversarial
samples can mislead strong target models and remain fluent and semantically
preserved.


http://arxiv.org/abs/2012.14965
Improving Adversarial Robustness in Weight-quantized Neural Networks.
Chang Song; Elias Fallon; Hai Li
  Neural networks are getting deeper and more computation-intensive nowadays.
Quantization is a useful technique in deploying neural networks on hardware
platforms and saving computation costs with negligible performance loss.
However, recent research reveals that neural network models, no matter
full-precision or quantized, are vulnerable to adversarial attacks. In this
work, we analyze both adversarial and quantization losses and then introduce
criteria to evaluate them. We propose a boundary-based retraining method to
mitigate adversarial and quantization losses together and adopt a nonlinear
mapping method to defend against white-box gradient-based adversarial attacks.
The evaluations demonstrate that our method can better restore accuracy after
quantization than other baseline methods on both black-box and white-box
adversarial attacks. The results also show that adversarial training suffers
quantization loss and does not cooperate well with other training methods.


http://arxiv.org/abs/2012.14738
With False Friends Like These, Who Can Have Self-Knowledge?
Lue Tao; Songcan Chen
  Adversarial examples arise from excessive sensitivity of a model. Commonly
studied adversarial examples are malicious inputs, crafted by an adversary from
correctly classified examples, to induce misclassification. This paper studies
an intriguing, yet far overlooked consequence of the excessive sensitivity,
that is, a misclassified example can be easily perturbed to help the model to
produce correct output. Such perturbed examples look harmless, but actually can
be maliciously utilized by a false friend to make the model self-satisfied.
Thus we name them hypocritical examples. With false friends like these, a
poorly performed model could behave like a state-of-the-art one. Once a
deployer trusts the hypocritical performance and uses the "well-performed"
model in real-world applications, potential security concerns appear even in
benign environments. In this paper, we formalize the hypocritical risk for the
first time and propose a defense method specialized for hypocritical examples
by minimizing the tradeoff between natural risk and an upper bound of
hypocritical risk. Moreover, our theoretical analysis reveals connections
between adversarial risk and hypocritical risk. Extensive experiments verify
the theoretical results and the effectiveness of our proposed methods.


http://arxiv.org/abs/2012.14956
Generating Natural Language Attacks in a Hard Label Black Box Setting.
Rishabh Maheshwary; Saket Maheshwary; Vikram Pudi
  We study an important and challenging task of attacking natural language
processing models in a hard label black box setting. We propose a
decision-based attack strategy that crafts high quality adversarial examples on
text classification and entailment tasks. Our proposed attack strategy
leverages population-based optimization algorithm to craft plausible and
semantically similar adversarial examples by observing only the top label
predicted by the target model. At each iteration, the optimization procedure
allow word replacements that maximizes the overall semantic similarity between
the original and the adversarial text. Further, our approach does not rely on
using substitute models or any kind of training data. We demonstrate the
efficacy of our proposed approach through extensive experimentation and
ablation studies on five state-of-the-art target models across seven benchmark
datasets. In comparison to attacks proposed in prior literature, we are able to
achieve a higher success rate with lower word perturbation percentage that too
in a highly restricted setting.


http://arxiv.org/abs/2012.14395
Enhanced Regularizers for Attributional Robustness.
Anindya Sarkar; Anirban Sarkar; Vineeth N Balasubramanian
  Deep neural networks are the default choice of learning models for computer
vision tasks. Extensive work has been carried out in recent years on explaining
deep models for vision tasks such as classification. However, recent work has
shown that it is possible for these models to produce substantially different
attribution maps even when two very similar images are given to the network,
raising serious questions about trustworthiness. To address this issue, we
propose a robust attribution training strategy to improve attributional
robustness of deep neural networks. Our method carefully analyzes the
requirements for attributional robustness and introduces two new regularizers
that preserve a model's attribution map during attacks. Our method surpasses
state-of-the-art attributional robustness methods by a margin of approximately
3% to 9% in terms of attribution robustness measures on several datasets
including MNIST, FMNIST, Flower and GTSRB.


http://arxiv.org/abs/2012.14352
Analysis of Dominant Classes in Universal Adversarial Perturbations.
Jon Vadillo; Roberto Santana; Jose A. Lozano
  The reasons why Deep Neural Networks are susceptible to being fooled by
adversarial examples remains an open discussion. Indeed, many different
strategies can be employed to efficiently generate adversarial attacks, some of
them relying on different theoretical justifications. Among these strategies,
universal (input-agnostic) perturbations are of particular interest, due to
their capability to fool a network independently of the input in which the
perturbation is applied. In this work, we investigate an intriguing phenomenon
of universal perturbations, which has been reported previously in the
literature, yet without a proven justification: universal perturbations change
the predicted classes for most inputs into one particular (dominant) class,
even if this behavior is not specified during the creation of the perturbation.
In order to justify the cause of this phenomenon, we propose a number of
hypotheses and experimentally test them using a speech command classification
problem in the audio domain as a testbed. Our analyses reveal interesting
properties of universal perturbations, suggest new methods to generate such
attacks and provide an explanation of dominant classes, under both a geometric
and a data-feature perspective.


http://arxiv.org/abs/2012.14057
Person Re-identification with Adversarial Triplet Embedding.
Xinglu Wang
  Person re-identification is an important task and has widespread applications
in video surveillance for public security. In the past few years, deep learning
network with triplet loss has become popular for this problem. However, the
triplet loss usually suffers from poor local optimal and relies heavily on the
strategy of hard example mining. In this paper, we propose to address this
problem with a new deep metric learning method called Adversarial Triplet
Embedding (ATE), in which we simultaneously generate adversarial triplets and
discriminative feature embedding in an unified framework. In particular,
adversarial triplets are generated by introducing adversarial perturbations
into the training process. This adversarial game is converted into a minimax
problem so as to have an optimal solution from the theoretical view. Extensive
experiments on several benchmark datasets demonstrate the effectiveness of the
approach against the state-of-the-art literature.


http://arxiv.org/abs/2012.13872
My Teacher Thinks The World Is Flat! Interpreting Automatic Essay Scoring Mechanism.
Swapnil Parekh; Yaman Kumar Singla; Changyou Chen; Junyi Jessy Li; Rajiv Ratn Shah
  Significant progress has been made in deep-learning based Automatic Essay
Scoring (AES) systems in the past two decades. However, little research has
been put to understand and interpret the black-box nature of these
deep-learning based scoring models. Recent work shows that automated scoring
systems are prone to even common-sense adversarial samples. Their lack of
natural language understanding capability raises questions on the models being
actively used by millions of candidates for life-changing decisions. With
scoring being a highly multi-modal task, it becomes imperative for scoring
models to be validated and tested on all these modalities. We utilize recent
advances in interpretability to find the extent to which features such as
coherence, content and relevance are important for automated scoring mechanisms
and why they are susceptible to adversarial samples. We find that the systems
tested consider essays not as a piece of prose having the characteristics of
natural flow of speech and grammatical structure, but as `word-soups' where a
few words are much more important than the other words. Removing the context
surrounding those few important words causes the prose to lose the flow of
speech and grammar, however has little impact on the predicted score. We also
find that since the models are not semantically grounded with world-knowledge
and common sense, adding false facts such as ``the world is flat'' actually
increases the score instead of decreasing it.


http://arxiv.org/abs/2012.13692
Sparse Adversarial Attack to Object Detection.
Jiayu Bao
  Adversarial examples have gained tons of attention in recent years. Many
adversarial attacks have been proposed to attack image classifiers, but few
work shift attention to object detectors. In this paper, we propose Sparse
Adversarial Attack (SAA) which enables adversaries to perform effective evasion
attack on detectors with bounded \emph{l$_{0}$} norm perturbation. We select
the fragile position of the image and designed evasion loss function for the
task. Experiment results on YOLOv4 and FasterRCNN reveal the effectiveness of
our method. In addition, our SAA shows great transferability across different
detectors in the black-box attack setting. Codes are available at
\emph{https://github.com/THUrssq/Tianchi04}.


http://arxiv.org/abs/2012.14427
Assessment of the Relative Importance of different hyper-parameters of LSTM for an IDS.
Mohit Sewak; Sanjay K. Sahay; Hemant Rathore
  Recurrent deep learning language models like the LSTM are often used to
provide advanced cyber-defense for high-value assets. The underlying assumption
for using LSTM networks for malware-detection is that the op-code sequence of
malware could be treated as a (spoken) language representation. There are
differences between any spoken-language (sequence of words/sentences) and the
machine-language (sequence of op-codes). In this paper, we demonstrate that due
to these inherent differences, an LSTM model with its default configuration as
tuned for a spoken-language, may not work well to detect malware (using its
op-code sequence) unless the network's essential hyper-parameters are tuned
appropriately. In the process, we also determine the relative importance of all
the different hyper-parameters of an LSTM network as applied to malware
detection using their op-code sequence representations. We experimented with
different configurations of LSTM networks, and altered hyper-parameters like
the embedding-size, number of hidden layers, number of LSTM-units in a hidden
layer, pruning/padding-length of the input-vector, activation-function, and
batch-size. We discovered that owing to the enhanced complexity of the
malware/machine-language, the performance of an LSTM network configured for an
Intrusion Detection System, is very sensitive towards the
number-of-hidden-layers, input sequence-length, and the choice of the
activation-function. Also, for (spoken) language-modeling, the recurrent
architectures by-far outperform their non-recurrent counterparts. Therefore, we
also assess how sequential DL architectures like the LSTM compare against their
non-sequential counterparts like the MLP-DNN for the purpose of
malware-detection.


http://arxiv.org/abs/2012.13573
Robustness, Privacy, and Generalization of Adversarial Training.
Fengxiang He; Shaopeng Fu; Bohan Wang; Dacheng Tao
  Adversarial training can considerably robustify deep neural networks to
resist adversarial attacks. However, some works suggested that adversarial
training might comprise the privacy-preserving and generalization abilities.
This paper establishes and quantifies the privacy-robustness trade-off and
generalization-robustness trade-off in adversarial training from both
theoretical and empirical aspects. We first define a notion, {\it robustified
intensity} to measure the robustness of an adversarial training algorithm. This
measure can be approximate empirically by an asymptotically consistent
empirical estimator, {\it empirical robustified intensity}. Based on the
robustified intensity, we prove that (1) adversarial training is $(\varepsilon,
\delta)$-differentially private, where the magnitude of the differential
privacy has a positive correlation with the robustified intensity; and (2) the
generalization error of adversarial training can be upper bounded by an
$\mathcal O(\sqrt{\log N}/N)$ on-average bound and an $\mathcal O(1/\sqrt{N})$
high-probability bound, both of which have positive correlations with the
robustified intensity. Additionally, our generalization bounds do not
explicitly rely on the parameter size which would be prohibitively large in
deep learning. Systematic experiments on standard datasets, CIFAR-10 and
CIFAR-100, are in full agreement with our theories. The source code package is
available at \url{https://github.com/fshp971/RPG}.


http://arxiv.org/abs/2012.13628
A Simple Fine-tuning Is All You Need: Towards Robust Deep Learning Via Adversarial Fine-tuning.
Ahmadreza Jeddi; Mohammad Javad Shafiee; Alexander Wong
  Adversarial Training (AT) with Projected Gradient Descent (PGD) is an
effective approach for improving the robustness of the deep neural networks.
However, PGD AT has been shown to suffer from two main limitations: i) high
computational cost, and ii) extreme overfitting during training that leads to
reduction in model generalization. While the effect of factors such as model
capacity and scale of training data on adversarial robustness have been
extensively studied, little attention has been paid to the effect of a very
important parameter in every network optimization on adversarial robustness:
the learning rate. In particular, we hypothesize that effective learning rate
scheduling during adversarial training can significantly reduce the overfitting
issue, to a degree where one does not even need to adversarially train a model
from scratch but can instead simply adversarially fine-tune a pre-trained
model. Motivated by this hypothesis, we propose a simple yet very effective
adversarial fine-tuning approach based on a $\textit{slow start, fast decay}$
learning rate scheduling strategy which not only significantly decreases
computational cost required, but also greatly improves the accuracy and
robustness of a deep neural network. Experimental results show that the
proposed adversarial fine-tuning approach outperforms the state-of-the-art
methods on CIFAR-10, CIFAR-100 and ImageNet datasets in both test accuracy and
the robustness, while reducing the computational cost by 8-10$\times$.
Furthermore, a very important benefit of the proposed adversarial fine-tuning
approach is that it enables the ability to improve the robustness of any
pre-trained deep neural network without needing to train the model from
scratch, which to the best of the authors' knowledge has not been previously
demonstrated in research literature.


http://arxiv.org/abs/2012.13339
A Context Aware Approach for Generating Natural Language Attacks.
Rishabh Maheshwary; Saket Maheshwary; Vikram Pudi
  We study an important task of attacking natural language processing models in
a black box setting. We propose an attack strategy that crafts semantically
similar adversarial examples on text classification and entailment tasks. Our
proposed attack finds candidate words by considering the information of both
the original word and its surrounding context. It jointly leverages masked
language modelling and next sentence prediction for context understanding. In
comparison to attacks proposed in prior literature, we are able to generate
high quality adversarial examples that do significantly better both in terms of
success rate and word perturbation percentage.


http://arxiv.org/abs/2012.13111
Exploring Adversarial Examples via Invertible Neural Networks.
Ruqi Bai; Saurabh Bagchi; David I. Inouye
  Adversarial examples (AEs) are images that can mislead deep neural network
(DNN) classifiers via introducing slight perturbations into original images.
This security vulnerability has led to vast research in recent years because it
can introduce real-world threats into systems that rely on neural networks.
Yet, a deep understanding of the characteristics of adversarial examples has
remained elusive. We propose a new way of achieving such understanding through
a recent development, namely, invertible neural models with Lipschitz
continuous mapping functions from the input to the output. With the ability to
invert any latent representation back to its corresponding input image, we can
investigate adversarial examples at a deeper level and disentangle the
adversarial example's latent representation. Given this new perspective, we
propose a fast latent space adversarial example generation method that could
accelerate adversarial training. Moreover, this new perspective could
contribute to new ways of adversarial example detection.


http://arxiv.org/abs/2012.13103
Improving the Certified Robustness of Neural Networks via Consistency Regularization.
Mengting Xu; Tao Zhang; Zhongnian Li; Daoqiang Zhang
  A range of defense methods have been proposed to improve the robustness of
neural networks on adversarial examples, among which provable defense methods
have been demonstrated to be effective to train neural networks that are
certifiably robust to the attacker. However, most of these provable defense
methods treat all examples equally during training process, which ignore the
inconsistent constraint of certified robustness between correctly classified
(natural) and misclassified examples. In this paper, we explore this
inconsistency caused by misclassified examples and add a novel consistency
regularization term to make better use of the misclassified examples.
Specifically, we identified that the certified robustness of network can be
significantly improved if the constraint of certified robustness on
misclassified examples and correctly classified examples is consistent.
Motivated by this discovery, we design a new defense regularization term called
Misclassification Aware Adversarial Regularization (MAAR), which constrains the
output probability distributions of all examples in the certified region of the
misclassified example. Experimental results show that our proposed MAAR
achieves the best certified robustness and comparable accuracy on CIFAR-10 and
MNIST datasets in comparison with several state-of-the-art methods.


http://arxiv.org/abs/2012.13154
Adversarial Momentum-Contrastive Pre-Training.
Cong Xu; Min Yang
  Deep neural networks are vulnerable to semantic invariant corruptions and
imperceptible artificial perturbations. Although data augmentation can improve
the robustness against the former, it offers no guarantees against the latter.
Adversarial training, on the other hand, is quite the opposite. Recent studies
have shown that adversarial self-supervised pre-training is helpful to extract
the invariant representations under both data augmentations and adversarial
perturbations. Based on the MoCo's idea, this paper proposes a novel
adversarial momentum-contrastive (AMOC) pre-training approach, which designs
two dynamic memory banks to maintain the historical clean and adversarial
representations respectively, so as to exploit the discriminative
representations that are consistent in a long period. Compared with the
existing self-supervised pre-training approaches, AMOC can use a smaller batch
size and fewer training epochs but learn more robust features. Empirical
results show that the developed approach further improves the current
state-of-the-art adversarial robustness. Our code is available at
\url{https://github.com/MTandHJ/amoc}.


http://arxiv.org/abs/2012.13489
Learning Robust Representation for Clustering through Locality Preserving Variational Discriminative Network.
Ruixuan Luo; Wei Li; Zhiyuan Zhang; Ruihan Bao; Keiko Harimoto; Xu Sun
  Clustering is one of the fundamental problems in unsupervised learning.
Recent deep learning based methods focus on learning clustering oriented
representations. Among those methods, Variational Deep Embedding achieves great
success in various clustering tasks by specifying a Gaussian Mixture prior to
the latent space. However, VaDE suffers from two problems: 1) it is fragile to
the input noise; 2) it ignores the locality information between the neighboring
data points. In this paper, we propose a joint learning framework that improves
VaDE with a robust embedding discriminator and a local structure constraint,
which are both helpful to improve the robustness of our model. Experiment
results on various vision and textual datasets demonstrate that our method
outperforms the state-of-the-art baseline models in all metrics. Further
detailed analysis shows that our proposed model is very robust to the
adversarial inputs, which is a desirable property for practical applications.


http://arxiv.org/abs/2012.12528
The Translucent Patch: A Physical and Universal Attack on Object Detectors.
Alon Zolfi; Moshe Kravchik; Yuval Elovici; Asaf Shabtai
  Physical adversarial attacks against object detectors have seen increasing
success in recent years. However, these attacks require direct access to the
object of interest in order to apply a physical patch. Furthermore, to hide
multiple objects, an adversarial patch must be applied to each object. In this
paper, we propose a contactless translucent physical patch containing a
carefully constructed pattern, which is placed on the camera's lens, to fool
state-of-the-art object detectors. The primary goal of our patch is to hide all
instances of a selected target class. In addition, the optimization method used
to construct the patch aims to ensure that the detection of other (untargeted)
classes remains unharmed. Therefore, in our experiments, which are conducted on
state-of-the-art object detection models used in autonomous driving, we study
the effect of the patch on the detection of both the selected target class and
the other classes. We show that our patch was able to prevent the detection of
42.27% of all stop sign instances while maintaining high (nearly 80%) detection
of the other classes.


http://arxiv.org/abs/2012.12640
Gradient-Free Adversarial Attacks for Bayesian Neural Networks.
Matthew Yuan; Matthew Wicker; Luca Laurenti
  The existence of adversarial examples underscores the importance of
understanding the robustness of machine learning models. Bayesian neural
networks (BNNs), due to their calibrated uncertainty, have been shown to posses
favorable adversarial robustness properties. However, when approximate Bayesian
inference methods are employed, the adversarial robustness of BNNs is still not
well understood. In this work, we employ gradient-free optimization methods in
order to find adversarial examples for BNNs. In particular, we consider genetic
algorithms, surrogate models, as well as zeroth order optimization methods and
adapt them to the goal of finding adversarial examples for BNNs. In an
empirical evaluation on the MNIST and Fashion MNIST datasets, we show that for
various approximate Bayesian inference methods the usage of gradient-free
algorithms can greatly improve the rate of finding adversarial examples
compared to state-of-the-art gradient-based methods.


http://arxiv.org/abs/2012.12529
SCOPE CPS: Secure Compiling of PLCs in Cyber-Physical Systems.
Eyasu Getahun Chekole; Martin Ochoa; Sudipta Chattopadhyay
  Cyber-Physical Systems (CPS) are being widely adopted in critical
infrastructures, such as smart grids, nuclear plants, water systems,
transportation systems, manufacturing and healthcare services, among others.
However, the increasing prevalence of cyberattacks targeting them raises a
growing security concern in the domain. In particular, memory-safety attacks,
that exploit memory-safety vulnerabilities, constitute a major attack vector
against real-time control devices in CPS. Traditional IT countermeasures
against such attacks have limitations when applied to the CPS context: they
typically incur in high runtime overheads; which conflicts with real-time
constraints in CPS and they often abort the program when an attack is detected,
thus harming availability of the system, which in turn can potentially result
in damage to the physical world. In this work, we propose to enforce a
full-stack memory-safety (covering user-space and kernel-space attack surfaces)
based on secure compiling of PLCs to detect memory-safety attacks in CPS.
Furthermore, to ensure availability, we enforce a resilient mitigation
technique that bypasses illegal memory access instructions at runtime by
dynamically instrumenting low-level code. We empirically measure the
computational overhead caused by our approach on two experimental settings
based on real CPS. The experimental results show that our approach effectively
and efficiently detects and mitigates memory-safety attacks in realistic CPS.


http://arxiv.org/abs/2012.15740
Poisoning Attacks on Cyber Attack Detectors for Industrial Control Systems.
Moshe Kravchik; Battista Biggio; Asaf Shabtai
  Recently, neural network (NN)-based methods, including autoencoders, have
been proposed for the detection of cyber attacks targeting industrial control
systems (ICSs). Such detectors are often retrained, using data collected during
system operation, to cope with the natural evolution (i.e., concept drift) of
the monitored signals. However, by exploiting this mechanism, an attacker can
fake the signals provided by corrupted sensors at training time and poison the
learning process of the detector such that cyber attacks go undetected at test
time. With this research, we are the first to demonstrate such poisoning
attacks on ICS cyber attack online NN detectors. We propose two distinct attack
algorithms, namely, interpolation- and back-gradient based poisoning, and
demonstrate their effectiveness on both synthetic and real-world ICS data. We
also discuss and analyze some potential mitigation strategies.


http://arxiv.org/abs/2012.12141
Learning to Initialize Gradient Descent Using Gradient Descent.
Kartik Ahuja; Amit Dhurandhar; Kush R. Varshney
  Non-convex optimization problems are challenging to solve; the success and
computational expense of a gradient descent algorithm or variant depend heavily
on the initialization strategy. Often, either random initialization is used or
initialization rules are carefully designed by exploiting the nature of the
problem class. As a simple alternative to hand-crafted initialization rules, we
propose an approach for learning "good" initialization rules from previous
solutions. We provide theoretical guarantees that establish conditions that are
sufficient in all cases and also necessary in some under which our approach
performs better than random initialization. We apply our methodology to various
non-convex problems such as generating adversarial examples, generating post
hoc explanations for black-box machine learning models, and allocating
communication spectrum, and show consistent gains over other initialization
techniques.


http://arxiv.org/abs/2012.12235
Unadversarial Examples: Designing Objects for Robust Vision.
Hadi Salman; Andrew Ilyas; Logan Engstrom; Sai Vemprala; Aleksander Madry; Ashish Kapoor
  We study a class of realistic computer vision settings wherein one can
influence the design of the objects being recognized. We develop a framework
that leverages this capability to significantly improve vision models'
performance and robustness. This framework exploits the sensitivity of modern
machine learning algorithms to input perturbations in order to design "robust
objects," i.e., objects that are explicitly optimized to be confidently
detected or classified. We demonstrate the efficacy of the framework on a wide
variety of vision-based tasks ranging from standard benchmarks, to
(in-simulation) robotics, to real-world experiments. Our code can be found at
https://git.io/unadversarial .


http://arxiv.org/abs/2012.11835
Multi-shot NAS for Discovering Adversarially Robust Convolutional Neural Architectures at Targeted Capacities.
Xuefei Ning; Junbo Zhao; Wenshuo Li; Tianchen Zhao; Huazhong Yang; Yu Wang
  Convolutional neural networks (CNNs) are vulnerable to adversarial examples,
and studies show that increasing the model capacity of an architecture topology
(e.g., width expansion) can bring consistent robustness improvements. This
reveals a clear robustness-efficiency trade-off that should be considered in
architecture design. Recent studies have employed one-shot neural architecture
search (NAS) to discover adversarially robust architectures. However, since the
capacities of different topologies cannot be easily aligned during the search
process, current one-shot NAS methods might favor topologies with larger
capacity in the supernet. And the discovered topology might be sub-optimal when
aligned to the targeted capacity. This paper proposes a novel multi-shot NAS
method to explicitly search for adversarially robust architectures at a certain
targeted capacity. Specifically, we estimate the reward at the targeted
capacity using interior extra-polation of the rewards from multiple supernets.
Experimental results demonstrate the effectiveness of the proposed method. For
instance, at the targeted FLOPs of 1560M, the discovered MSRobNet-1560 (clean
84.8%, PGD100 52.9%) outperforms the recent NAS-discovered architecture
RobNet-free (clean 82.8%, PGD100 52.6%) with similar FLOPs.


http://arxiv.org/abs/2012.12368
On Frank-Wolfe Optimization for Adversarial Robustness and Interpretability.
Theodoros Tsiligkaridis; Jay Roberts
  Deep neural networks are easily fooled by small perturbations known as
adversarial attacks. Adversarial Training (AT) is a technique that
approximately solves a robust optimization problem to minimize the worst-case
loss and is widely regarded as the most effective defense against such attacks.
While projected gradient descent (PGD) has received most attention for
approximately solving the inner maximization of AT, Frank-Wolfe (FW)
optimization is projection-free and can be adapted to any $L^p$ norm. A
Frank-Wolfe adversarial training approach is presented and is shown to provide
as competitive level of robustness as PGD-AT without much tuning for a variety
of architectures. We empirically show that robustness is strongly connected to
the $L^2$ magnitude of the adversarial perturbation and that more locally
linear loss landscapes tend to have larger $L^2$ distortions despite having the
same $L^\infty$ distortion. We provide theoretical guarantees on the magnitude
of the distortion for FW that depend on local geometry which FW-AT exploits. It
is empirically shown that FW-AT achieves strong robustness to white-box attacks
and black-box attacks and offers improved resistance to gradient masking.
Further, FW-AT allows networks to learn high-quality human-interpretable
features which are then used to generate counterfactual explanations to model
predictions by using dense and sparse adversarial perturbations.


http://arxiv.org/abs/2012.11352
Genetic Adversarial Training of Decision Trees.
Francesco Ranzato; Marco Zanella
  We put forward a novel learning methodology for ensembles of decision trees
based on a genetic algorithm which is able to train a decision tree for
maximizing both its accuracy and its robustness to adversarial perturbations.
This learning algorithm internally leverages a complete formal verification
technique for robustness properties of decision trees based on abstract
interpretation, a well known static program analysis technique. We implemented
this genetic adversarial training algorithm in a tool called Meta-Silvae (MS)
and we experimentally evaluated it on some reference datasets used in
adversarial training. The experimental results show that MS is able to train
robust models that compete with and often improve on the current
state-of-the-art of adversarial training of decision trees while being much
more compact and therefore interpretable and efficient tree models.


http://arxiv.org/abs/2012.11220
Incremental Verification of Fixed-Point Implementations of Neural Networks.
Luiz Sena; Erickson Alves; Iury Bessa; Eddie Filho; Lucas Cordeiro
  Implementations of artificial neural networks (ANNs) might lead to failures,
which are hardly predicted in the design phase since ANNs are highly parallel
and their parameters are barely interpretable. Here, we develop and evaluate a
novel symbolic verification framework using incremental bounded model checking
(BMC), satisfiability modulo theories (SMT), and invariant inference, to obtain
adversarial cases and validate coverage methods in a multi-layer perceptron
(MLP). We exploit incremental BMC based on interval analysis to compute
boundaries from a neuron's input. Then, the latter are propagated to
effectively find a neuron's output since it is the input of the next one. This
paper describes the first bit-precise symbolic verification framework to reason
over actual implementations of ANNs in CUDA, based on invariant inference,
therefore providing further guarantees about finite-precision arithmetic and
its rounding errors, which are routinely ignored in the existing literature. We
have implemented the proposed approach on top of the efficient SMT-based
bounded model checker (ESBMC), and its experimental results show that it can
successfully verify safety properties, in actual implementations of ANNs, and
generate real adversarial cases in MLPs. Our approach was able to verify and
produce adversarial examples for 85.8% of 21 test cases considering different
input images, and 100% of the properties related to covering methods. Although
our verification time is higher than existing approaches, our methodology can
consider fixed-point implementation aspects that are disregarded by the
state-of-the-art verification methodologies.


http://arxiv.org/abs/2012.11442
Blurring Fools the Network -- Adversarial Attacks by Feature Peak Suppression and Gaussian Blurring.
Chenchen Zhao; Hao Li
  Existing pixel-level adversarial attacks on neural networks may be deficient
in real scenarios, since pixel-level changes on the data cannot be fully
delivered to the neural network after camera capture and multiple image
preprocessing steps. In contrast, in this paper, we argue from another
perspective that gaussian blurring, a common technique of image preprocessing,
can be aggressive itself in specific occasions, thus exposing the network to
real-world adversarial attacks. We first propose an adversarial attack demo
named peak suppression (PS) by suppressing the values of peak elements in the
features of the data. Based on the blurring spirit of PS, we further apply
gaussian blurring to the data, to investigate the potential influence and
threats of gaussian blurring to performance of the network. Experiment results
show that PS and well-designed gaussian blurring can form adversarial attacks
that completely change classification results of a well-trained target network.
With the strong physical significance and wide applications of gaussian
blurring, the proposed approach will also be capable of conducting real world
attacks.


http://arxiv.org/abs/2012.11413
Exploiting Vulnerability of Pooling in Convolutional Neural Networks by Strict Layer-Output Manipulation for Adversarial Attacks.
Chenchen Zhao; Hao Li
  Convolutional neural networks (CNN) have been more and more applied in mobile
robotics such as intelligent vehicles. Security of CNNs in robotics
applications is an important issue, for which potential adversarial attacks on
CNNs are worth research. Pooling is a typical step of dimension reduction and
information discarding in CNNs. Such information discarding may result in
mis-deletion and mis-preservation of data features which largely influence the
output of the network. This may aggravate the vulnerability of CNNs to
adversarial attacks. In this paper, we conduct adversarial attacks on CNNs from
the perspective of network structure by investigating and exploiting the
vulnerability of pooling. First, a novel adversarial attack methodology named
Strict Layer-Output Manipulation (SLOM) is proposed. Then an attack method
based on Strict Pooling Manipulation (SPM) which is an instantiation of the
SLOM spirit is designed to effectively realize both type I and type II
adversarial attacks on a target CNN. Performances of attacks based on SPM at
different depths are also investigated and compared. Moreover, performances of
attack methods designed by instantiating the SLOM spirit with different
operation layers of CNNs are compared. Experiment results reflect that pooling
tends to be more vulnerable to adversarial attacks than other operations in
CNNs.


http://arxiv.org/abs/2012.11212
Deep Feature Space Trojan Attack of Neural Networks by Controlled Detoxification.
Siyuan Cheng; Yingqi Liu; Shiqing Ma; Xiangyu Zhang
  Trojan (backdoor) attack is a form of adversarial attack on deep neural
networks where the attacker provides victims with a model trained/retrained on
malicious data. The backdoor can be activated when a normal input is stamped
with a certain pattern called trigger, causing misclassification. Many existing
trojan attacks have their triggers being input space patches/objects (e.g., a
polygon with solid color) or simple input transformations such as Instagram
filters. These simple triggers are susceptible to recent backdoor detection
algorithms. We propose a novel deep feature space trojan attack with five
characteristics: effectiveness, stealthiness, controllability, robustness and
reliance on deep features. We conduct extensive experiments on 9 image
classifiers on various datasets including ImageNet to demonstrate these
properties and show that our attack can evade state-of-the-art defense.


http://arxiv.org/abs/2012.11769
Self-Progressing Robust Training.
Minhao Cheng; Pin-Yu Chen; Sijia Liu; Shiyu Chang; Cho-Jui Hsieh; Payel Das
  Enhancing model robustness under new and even adversarial environments is a
crucial milestone toward building trustworthy machine learning systems. Current
robust training methods such as adversarial training explicitly uses an
"attack" (e.g., $\ell_{\infty}$-norm bounded perturbation) to generate
adversarial examples during model training for improving adversarial
robustness. In this paper, we take a different perspective and propose a new
framework called SPROUT, self-progressing robust training. During model
training, SPROUT progressively adjusts training label distribution via our
proposed parametrized label smoothing technique, making training free of attack
generation and more scalable. We also motivate SPROUT using a general
formulation based on vicinity risk minimization, which includes many robust
training methods as special cases. Compared with state-of-the-art adversarial
training methods (PGD-l_inf and TRADES) under l_inf-norm bounded attacks and
various invariance tests, SPROUT consistently attains superior performance and
is more scalable to large neural networks. Our results shed new light on
scalable, effective and attack-independent robust training methods.


http://arxiv.org/abs/2012.11138
Adjust-free adversarial example generation in speech recognition using evolutionary multi-objective optimization under black-box condition.
Shoma Ishida; Satoshi Ono
  This paper proposes a black-box adversarial attack method to automatic speech
recognition systems. Some studies have attempted to attack neural networks for
speech recognition; however, these methods did not consider the robustness of
generated adversarial examples against timing lag with a target speech. The
proposed method in this paper adopts Evolutionary Multi-objective Optimization
(EMO)that allows it generating robust adversarial examples under black-box
scenario. Experimental results showed that the proposed method successfully
generated adjust-free adversarial examples, which are sufficiently robust
against timing lag so that an attacker does not need to take the timing of
playing it against the target speech.


http://arxiv.org/abs/2012.11619
Defence against adversarial attacks using classical and quantum-enhanced Boltzmann machines.
Aidan Kehoe; Peter Wittek; Yanbo Xue; Alejandro Pozas-Kerstjens
  We provide a robust defence to adversarial attacks on discriminative
algorithms. Neural networks are naturally vulnerable to small, tailored
perturbations in the input data that lead to wrong predictions. On the
contrary, generative models attempt to learn the distribution underlying a
dataset, making them inherently more robust to small perturbations. We use
Boltzmann machines for discrimination purposes as attack-resistant classifiers,
and compare them against standard state-of-the-art adversarial defences. We
find improvements ranging from 5% to 72% against attacks with Boltzmann
machines on the MNIST dataset. We furthermore complement the training with
quantum-enhanced sampling from the D-Wave 2000Q annealer, finding results
comparable with classical techniques and with marginal improvements in some
cases. These results underline the relevance of probabilistic methods in
constructing neural networks and highlight a novel scenario of practical
relevance where quantum computers, even with limited hardware capabilites,
could provide advantages over classical computers. This work is dedicated to
the memory of Peter Wittek.


http://arxiv.org/abs/2012.11207
On Success and Simplicity: A Second Look at Transferable Targeted Attacks.
Zhengyu Zhao; Zhuoran Liu; Martha Larson
  Achieving transferability of targeted attacks is reputed to be remarkably
difficult. Currently, state-of-the-art approaches are resource-intensive
because they necessitate training model(s) for each target class with
additional data. In our investigation, we find, however, that simple
transferable attacks which require neither additional data nor model training
can achieve surprisingly high targeted transferability. This insight has been
overlooked until now, mainly due to the widespread practice of unreasonably
restricting attack optimization to a limited number of iterations. In
particular, we, for the first time, identify that a simple logit loss can yield
competitive results with the state of the arts. Our analysis spans a variety of
transfer settings, especially including three new, realistic settings: an
ensemble transfer setting with little model similarity, a worse-case setting
with low-ranked target classes, and also a real-world attack against the Google
Cloud Vision API. Results in these new settings demonstrate that the commonly
adopted, easy settings cannot fully reveal the actual properties of different
attacks and may cause misleading comparisons. We also show the usefulness of
the simple logit loss for generating targeted universal adversarial
perturbations in a data-free and training-free manner. Overall, the aim of our
analysis is to inspire a more meaningful evaluation on targeted
transferability.


http://arxiv.org/abs/2012.11701
Learning from What We Know: How to Perform Vulnerability Prediction using Noisy Historical Data. (1%)
Aayush Garg; Renzo Degiovanni; Matthieu Jimenez; Maxime Cordy; Mike Papadakis; Yves Le Traon
  Vulnerability prediction refers to the problem of identifying system
components that are most likely to be vulnerable. Typically, this problem is
tackled by training binary classifiers on historical data. Unfortunately,
recent research has shown that such approaches underperform due to the
following two reasons: a) the imbalanced nature of the problem, and b) the
inherently noisy historical data, i.e., most vulnerabilities are discovered
much later than they are introduced. This misleads classifiers as they learn to
recognize actual vulnerable components as non-vulnerable. To tackle these
issues, we propose TROVON, a technique that learns from known vulnerable
components rather than from vulnerable and non-vulnerable components, as
typically performed. We perform this by contrasting the known vulnerable, and
their respective fixed components. This way, TROVON manages to learn from the
things we know, i.e., vulnerabilities, hence reducing the effects of noisy and
unbalanced data. We evaluate TROVON by comparing it with existing techniques on
three security-critical open source systems, i.e., Linux Kernel, OpenSSL, and
Wireshark, with historical vulnerabilities that have been reported in the
National Vulnerability Database (NVD). Our evaluation demonstrates that the
prediction capability of TROVON significantly outperforms existing
vulnerability prediction techniques such as Software Metrics, Imports, Function
Calls, Text Mining, Devign, LSTM, and LSTM-RF with an improvement of 40.84% in
Matthews Correlation Coefficient (MCC) score under Clean Training Data
Settings, and an improvement of 35.52% under Realistic Training Data Settings.


http://arxiv.org/abs/2012.14456
Color Channel Perturbation Attacks for Fooling Convolutional Neural Networks and A Defense Against Such Attacks.
Jayendra Kantipudi; Shiv Ram Dubey; Soumendu Chakraborty
  The Convolutional Neural Networks (CNNs) have emerged as a very powerful data
dependent hierarchical feature extraction method. It is widely used in several
computer vision problems. The CNNs learn the important visual features from
training samples automatically. It is observed that the network overfits the
training samples very easily. Several regularization methods have been proposed
to avoid the overfitting. In spite of this, the network is sensitive to the
color distribution within the images which is ignored by the existing
approaches. In this paper, we discover the color robustness problem of CNN by
proposing a Color Channel Perturbation (CCP) attack to fool the CNNs. In CCP
attack new images are generated with new channels created by combining the
original channels with the stochastic weights. Experiments were carried out
over widely used CIFAR10, Caltech256 and TinyImageNet datasets in the image
classification framework. The VGG, ResNet and DenseNet models are used to test
the impact of the proposed attack. It is observed that the performance of the
CNNs degrades drastically under the proposed CCP attack. Result show the effect
of the proposed simple CCP attack over the robustness of the CNN trained model.
The results are also compared with existing CNN fooling approaches to evaluate
the accuracy drop. We also propose a primary defense mechanism to this problem
by augmenting the training dataset with the proposed CCP attack. The
state-of-the-art performance using the proposed solution in terms of the CNN
robustness under CCP attack is observed in the experiments. The code is made
publicly available at
\url{https://github.com/jayendrakantipudi/Color-Channel-Perturbation-Attack}.


http://arxiv.org/abs/2012.10794
Sample Complexity of Adversarially Robust Linear Classification on Separated Data.
Robi Bhattacharjee; Somesh Jha; Kamalika Chaudhuri
  We consider the sample complexity of learning with adversarial robustness.
Most prior theoretical results for this problem have considered a setting where
different classes in the data are close together or overlapping. Motivated by
some real applications, we consider, in contrast, the well-separated case where
there exists a classifier with perfect accuracy and robustness, and show that
the sample complexity narrates an entirely different story. Specifically, for
linear classifiers, we show a large class of well-separated distributions where
the expected robust loss of any algorithm is at least $\Omega(\frac{d}{n})$,
whereas the max margin algorithm has expected standard loss $O(\frac{1}{n})$.
This shows a gap in the standard and robust losses that cannot be obtained via
prior techniques. Additionally, we present an algorithm that, given an instance
where the robustness radius is much smaller than the gap between the classes,
gives a solution with expected robust loss is $O(\frac{1}{n})$. This shows that
for very well-separated data, convergence rates of $O(\frac{1}{n})$ are
achievable, which is not the case otherwise. Our results apply to robustness
measured in any $\ell_p$ norm with $p > 1$ (including $p = \infty$).


http://arxiv.org/abs/2012.10076
Semantics and explanation: why counterfactual explanations produce adversarial examples in deep neural networks.
Kieran Browne; Ben Swift
  Recent papers in explainable AI have made a compelling case for
counterfactual modes of explanation. While counterfactual explanations appear
to be extremely effective in some instances, they are formally equivalent to
adversarial examples. This presents an apparent paradox for explainability
researchers: if these two procedures are formally equivalent, what accounts for
the explanatory divide apparent between counterfactual explanations and
adversarial examples? We resolve this paradox by placing emphasis back on the
semantics of counterfactual expressions. Producing satisfactory explanations
for deep learning systems will require that we find ways to interpret the
semantics of hidden layer representations in deep neural networks.


http://arxiv.org/abs/2012.10282
ROBY: Evaluating the Robustness of a Deep Model by its Decision Boundaries.
Jinyin Chen; Zhen Wang; Haibin Zheng; Jun Xiao; Zhaoyan Ming
  With the successful application of deep learning models in many real-world
tasks, the model robustness becomes more and more critical. Often, we evaluate
the robustness of the deep models by attacking them with purposely generated
adversarial samples, which is computationally costly and dependent on the
specific attackers and the model types. This work proposes a generic evaluation
metric ROBY, a novel attack-independent robustness measure based on the model's
decision boundaries. Independent of adversarial samples, ROBY uses the
inter-class and intra-class statistic features to capture the features of the
model's decision boundaries. We experimented on ten state-of-the-art deep
models and showed that ROBY matches the robustness gold standard of attack
success rate (ASR) by a strong first-order generic attacker. with only 1% of
time cost. To the best of our knowledge, ROBY is the first lightweight
attack-independent robustness evaluation metric that can be applied to a wide
range of deep models. The code of ROBY is open sourced at
https://github.com/baaaad/ROBY-Evaluating-the-Robustness-of-a-Deep-Model-by-its-Decision-Boundaries.


http://arxiv.org/abs/2012.10235
AdvExpander: Generating Natural Language Adversarial Examples by Expanding Text.
Zhihong Shao; Zitao Liu; Jiyong Zhang; Zhongqin Wu; Minlie Huang
  Adversarial examples are vital to expose the vulnerability of machine
learning models. Despite the success of the most popular substitution-based
methods which substitutes some characters or words in the original examples,
only substitution is insufficient to uncover all robustness issues of models.
In this paper, we present AdvExpander, a method that crafts new adversarial
examples by expanding text, which is complementary to previous
substitution-based methods. We first utilize linguistic rules to determine
which constituents to expand and what types of modifiers to expand with. We
then expand each constituent by inserting an adversarial modifier searched from
a CVAE-based generative model which is pre-trained on a large scale corpus. To
search adversarial modifiers, we directly search adversarial latent codes in
the latent space without tuning the pre-trained parameters. To ensure that our
adversarial examples are label-preserving for text matching, we also constrain
the modifications with a heuristic rule. Experiments on three classification
tasks verify the effectiveness of AdvExpander and the validity of our
adversarial examples. AdvExpander crafts a new type of adversarial examples by
text expansion, thereby promising to reveal new robustness issues.


http://arxiv.org/abs/2012.10278
Adversarially Robust Estimate and Risk Analysis in Linear Regression.
Yue Xing; Ruizhi Zhang; Guang Cheng
  Adversarially robust learning aims to design algorithms that are robust to
small adversarial perturbations on input variables. Beyond the existing studies
on the predictive performance to adversarial samples, our goal is to understand
statistical properties of adversarially robust estimates and analyze
adversarial risk in the setup of linear regression models. By discovering the
statistical minimax rate of convergence of adversarially robust estimators, we
emphasize the importance of incorporating model information, e.g., sparsity, in
adversarially robust learning. Further, we reveal an explicit connection of
adversarial and standard estimates, and propose a straightforward two-stage
adversarial learning framework, which facilitates to utilize model structure
information to improve adversarial robustness. In theory, the consistency of
the adversarially robust estimator is proven and its Bahadur representation is
also developed for the statistical inference purpose. The proposed estimator
converges in a sharp rate under either low-dimensional or sparse scenario.
Moreover, our theory confirms two phenomena in adversarially robust learning:
adversarial robustness hurts generalization, and unlabeled data help improve
the generalization. In the end, we conduct numerical simulations to verify our
theory.


http://arxiv.org/abs/2012.10485
RAILS: A Robust Adversarial Immune-inspired Learning System.
Ren Wang; Tianqi Chen; Stephen Lindsly; Alnawaz Rehemtulla; Alfred Hero; Indika Rajapakse
  Adversarial attacks against deep neural networks are continuously evolving.
Without effective defenses, they can lead to catastrophic failure. The
long-standing and arguably most powerful natural defense system is the
mammalian immune system, which has successfully defended against attacks by
novel pathogens for millions of years. In this paper, we propose a new
adversarial defense framework, called the Robust Adversarial Immune-inspired
Learning System (RAILS). RAILS incorporates an Adaptive Immune System Emulation
(AISE), which emulates in silico the biological mechanisms that are used to
defend the host against attacks by pathogens. We use RAILS to harden Deep
k-Nearest Neighbor (DkNN) architectures against evasion attacks. Evolutionary
programming is used to simulate processes in the natural immune system: B-cell
flocking, clonal expansion, and affinity maturation. We show that the RAILS
learning curve exhibits similar diversity-selection learning phases as observed
in our in vitro biological experiments. When applied to adversarial image
classification on three different datasets, RAILS delivers an additional
5.62%/12.56%/4.74% robustness improvement as compared to applying DkNN alone,
without appreciable loss of accuracy on clean data.


http://arxiv.org/abs/2012.10438
Efficient Training of Robust Decision Trees Against Adversarial Examples.
Daniël Vos; Sicco Verwer
  In the present day we use machine learning for sensitive tasks that require
models to be both understandable and robust. Although traditional models such
as decision trees are understandable, they suffer from adversarial attacks.
When a decision tree is used to differentiate between a user's benign and
malicious behavior, an adversarial attack allows the user to effectively evade
the model by perturbing the inputs the model receives. We can use algorithms
that take adversarial attacks into account to fit trees that are more robust.
In this work we propose an algorithm, GROOT, that is two orders of magnitude
faster than the state-of-the-art-work while scoring competitively on accuracy
against adversaries. GROOT accepts an intuitive and permissible threat model.
Where previous threat models were limited to distance norms, we allow each
feature to be perturbed with a user-specified parameter: either a maximum
distance or constraints on the direction of perturbation. Previous works
assumed that both benign and malicious users attempt model evasion but we allow
the user to select which classes perform adversarial attacks. Additionally, we
introduce a hyperparameter rho that allows GROOT to trade off performance in
the regular and adversarial settings.


http://arxiv.org/abs/2101.05219
On the human-recognizability phenomenon of adversarially trained deep image classifiers.
Jonathan Helland; Nathan VanHoudnos
  In this work, we investigate the phenomenon that robust image classifiers
have human-recognizable features -- often referred to as interpretability -- as
revealed through the input gradients of their score functions and their
subsequent adversarial perturbations. In particular, we demonstrate that
state-of-the-art methods for adversarial training incorporate two terms -- one
that orients the decision boundary via minimizing the expected loss, and
another that induces smoothness of the classifier's decision surface by
penalizing the local Lipschitz constant. Through this demonstration, we provide
a unified discussion of gradient and Jacobian-based regularizers that have been
used to encourage adversarial robustness in prior works. Following this
discussion, we give qualitative evidence that the coupling of smoothness and
orientation of the decision boundary is sufficient to induce the aforementioned
human-recognizability phenomenon.


http://arxiv.org/abs/2012.09427
Characterizing the Evasion Attackability of Multi-label Classifiers.
Zhuo Yang; Yufei Han; Xiangliang Zhang
  Evasion attack in multi-label learning systems is an interesting, widely
witnessed, yet rarely explored research topic. Characterizing the crucial
factors determining the attackability of the multi-label adversarial threat is
the key to interpret the origin of the adversarial vulnerability and to
understand how to mitigate it. Our study is inspired by the theory of
adversarial risk bound. We associate the attackability of a targeted
multi-label classifier with the regularity of the classifier and the training
data distribution. Beyond the theoretical attackability analysis, we further
propose an efficient empirical attackability estimator via greedy label space
exploration. It provides provably computational efficiency and approximation
accuracy. Substantial experimental results on real-world datasets validate the
unveiled attackability factors and the effectiveness of the proposed empirical
attackability indicator


http://arxiv.org/abs/2012.09501
A Hierarchical Feature Constraint to Camouflage Medical Adversarial Attacks.
Qingsong Yao; Zecheng He; Yi Lin; Kai Ma; Yefeng Zheng; S. Kevin Zhou
  Deep neural networks (DNNs) for medical images are extremely vulnerable to
adversarial examples (AEs), which poses security concerns on clinical decision
making. Luckily, medical AEs are also easy to detect in hierarchical feature
space per our study herein. To better understand this phenomenon, we thoroughly
investigate the intrinsic characteristic of medical AEs in feature space,
providing both empirical evidence and theoretical explanations for the
question: why are medical adversarial attacks easy to detect? We first perform
a stress test to reveal the vulnerability of deep representations of medical
images, in contrast to natural images. We then theoretically prove that typical
adversarial attacks to binary disease diagnosis network manipulate the
prediction by continuously optimizing the vulnerable representations in a fixed
direction, resulting in outlier features that make medical AEs easy to detect.
However, this vulnerability can also be exploited to hide the AEs in the
feature space. We propose a novel hierarchical feature constraint (HFC) as an
add-on to existing adversarial attacks, which encourages the hiding of the
adversarial representation within the normal feature distribution. We evaluate
the proposed method on two public medical image datasets, namely {Fundoscopy}
and {Chest X-Ray}. Experimental results demonstrate the superiority of our
adversarial attack method as it bypasses an array of state-of-the-art
adversarial detectors more easily than competing attack methods, supporting
that the great vulnerability of medical features allows an attacker more room
to manipulate the adversarial representations.


http://arxiv.org/abs/2012.09384
On the Limitations of Denoising Strategies as Adversarial Defenses.
Zhonghan Niu; Zhaoxi Chen; Linyi Li; Yubin Yang; Bo Li; Jinfeng Yi
  As adversarial attacks against machine learning models have raised increasing
concerns, many denoising-based defense approaches have been proposed. In this
paper, we summarize and analyze the defense strategies in the form of symmetric
transformation via data denoising and reconstruction (denoted as $F+$ inverse
$F$, $F-IF$ Framework). In particular, we categorize these denoising strategies
from three aspects (i.e. denoising in the spatial domain, frequency domain, and
latent space, respectively). Typically, defense is performed on the entire
adversarial example, both image and perturbation are modified, making it
difficult to tell how it defends against the perturbations. To evaluate the
robustness of these denoising strategies intuitively, we directly apply them to
defend against adversarial noise itself (assuming we have obtained all of it),
which saving us from sacrificing benign accuracy. Surprisingly, our
experimental results show that even if most of the perturbations in each
dimension is eliminated, it is still difficult to obtain satisfactory
robustness. Based on the above findings and analyses, we propose the adaptive
compression strategy for different frequency bands in the feature domain to
improve the robustness. Our experiment results show that the adaptive
compression strategies enable the model to better suppress adversarial
perturbations, and improve robustness compared with existing denoising
strategies.


http://arxiv.org/abs/2012.08588
FoggySight: A Scheme for Facial Lookup Privacy.
Ivan Evtimov; Pascal Sturmfels; Tadayoshi Kohno
  Advances in deep learning algorithms have enabled better-than-human
performance on face recognition tasks. In parallel, private companies have been
scraping social media and other public websites that tie photos to identities
and have built up large databases of labeled face images. Searches in these
databases are now being offered as a service to law enforcement and others and
carry a multitude of privacy risks for social media users. In this work, we
tackle the problem of providing privacy from such face recognition systems. We
propose and evaluate FoggySight, a solution that applies lessons learned from
the adversarial examples literature to modify facial photos in a
privacy-preserving manner before they are uploaded to social media.
FoggySight's core feature is a community protection strategy where users acting
as protectors of privacy for others upload decoy photos generated by
adversarial machine learning algorithms. We explore different settings for this
scheme and find that it does enable protection of facial privacy -- including
against a facial recognition service with unknown internals.


http://arxiv.org/abs/2012.08096
FAWA: Fast Adversarial Watermark Attack on Optical Character Recognition (OCR) Systems.
Lu Chen; Jiao Sun; Wei Xu
  Deep neural networks (DNNs) significantly improved the accuracy of optical
character recognition (OCR) and inspired many important applications.
Unfortunately, OCRs also inherit the vulnerabilities of DNNs under adversarial
examples. Different from colorful vanilla images, text images usually have
clear backgrounds. Adversarial examples generated by most existing adversarial
attacks are unnatural and pollute the background severely. To address this
issue, we propose the Fast Adversarial Watermark Attack (FAWA) against
sequence-based OCR models in the white-box manner. By disguising the
perturbations as watermarks, we can make the resulting adversarial images
appear natural to human eyes and achieve a perfect attack success rate. FAWA
works with either gradient-based or optimization-based perturbation generation.
In both letter-level and word-level attacks, our experiments show that in
addition to natural appearance, FAWA achieves a 100% attack success rate with
60% less perturbations and 78% fewer iterations on average. In addition, we
further extend FAWA to support full-color watermarks, other languages, and even
the OCR accuracy-enhancing mechanism.


http://arxiv.org/abs/2012.08112
Amata: An Annealing Mechanism for Adversarial Training Acceleration.
Nanyang Ye; Qianxiao Li; Xiao-Yun Zhou; Zhanxing Zhu
  Despite the empirical success in various domains, it has been revealed that
deep neural networks are vulnerable to maliciously perturbed input data that
much degrade their performance. This is known as adversarial attacks. To
counter adversarial attacks, adversarial training formulated as a form of
robust optimization has been demonstrated to be effective. However, conducting
adversarial training brings much computational overhead compared with standard
training. In order to reduce the computational cost, we propose an annealing
mechanism, Amata, to reduce the overhead associated with adversarial training.
The proposed Amata is provably convergent, well-motivated from the lens of
optimal control theory and can be combined with existing acceleration methods
to further enhance performance. It is demonstrated that on standard datasets,
Amata can achieve similar or better robustness with around 1/3 to 1/2 the
computational time compared with traditional methods. In addition, Amata can be
incorporated into other adversarial training acceleration algorithms (e.g.
YOPO, Free, Fast, and ATTA), which leads to further reduction in computational
time on large-scale problems.


http://arxiv.org/abs/2012.07372
Disentangled Information Bottleneck.
Ziqi Pan; Li Niu; Jianfu Zhang; Liqing Zhang
  The information bottleneck (IB) method is a technique for extracting
information that is relevant for predicting the target random variable from the
source random variable, which is typically implemented by optimizing the IB
Lagrangian that balances the compression and prediction terms. However, the IB
Lagrangian is hard to optimize, and multiple trials for tuning values of
Lagrangian multiplier are required. Moreover, we show that the prediction
performance strictly decreases as the compression gets stronger during
optimizing the IB Lagrangian. In this paper, we implement the IB method from
the perspective of supervised disentangling. Specifically, we introduce
Disentangled Information Bottleneck (DisenIB) that is consistent on compressing
source maximally without target prediction performance loss (maximum
compression). Theoretical and experimental results demonstrate that our method
is consistent on maximum compression, and performs well in terms of
generalization, robustness to adversarial attack, out-of-distribution
detection, and supervised disentangling.


http://arxiv.org/abs/2012.07887
Adaptive Verifiable Training Using Pairwise Class Similarity.
Shiqi Wang; Kevin Eykholt; Taesung Lee; Jiyong Jang; Ian Molloy
  Verifiable training has shown success in creating neural networks that are
provably robust to a given amount of noise. However, despite only enforcing a
single robustness criterion, its performance scales poorly with dataset
complexity. On CIFAR10, a non-robust LeNet model has a 21.63% error rate, while
a model created using verifiable training and a L-infinity robustness criterion
of 8/255, has an error rate of 57.10%. Upon examination, we find that when
labeling visually similar classes, the model's error rate is as high as 61.65%.
We attribute the loss in performance to inter-class similarity. Similar classes
(i.e., close in the feature space) increase the difficulty of learning a robust
model. While it's desirable to train a robust model for a large robustness
region, pairwise class similarities limit the potential gains. Also,
consideration must be made regarding the relative cost of mistaking similar
classes. In security or safety critical tasks, similar classes are likely to
belong to the same group, and thus are equally sensitive.
  In this work, we propose a new approach that utilizes inter-class similarity
to improve the performance of verifiable training and create robust models with
respect to multiple adversarial criteria. First, we use agglomerate clustering
to group similar classes and assign robustness criteria based on the similarity
between clusters. Next, we propose two methods to apply our approach: (1)
Inter-Group Robustness Prioritization, which uses a custom loss term to create
a single model with multiple robustness guarantees and (2) neural decision
trees, which trains multiple sub-classifiers with different robustness
guarantees and combines them in a decision tree architecture. On Fashion-MNIST
and CIFAR10, our approach improves clean performance by 9.63% and 30.89%
respectively. On CIFAR100, our approach improves clean performance by 26.32%.


http://arxiv.org/abs/2012.07828
Robustness Threats of Differential Privacy.
Nurislam Tursynbek; Aleksandr Petiushko; Ivan Oseledets
  Differential privacy is a powerful and gold-standard concept of measuring and
guaranteeing privacy in data analysis. It is well-known that differential
privacy reduces the model's accuracy. However, it is unclear how it affects
security of the model from robustness point of view. In this paper, we
empirically observe an interesting trade-off between the differential privacy
and the security of neural networks. Standard neural networks are vulnerable to
input perturbations, either adversarial attacks or common corruptions. We
experimentally demonstrate that networks, trained with differential privacy, in
some settings might be even more vulnerable in comparison to non-private
versions. To explore this, we extensively study different robustness
measurements, including FGSM and PGD adversaries, distance to linear decision
boundaries, curvature profile, and performance on a corrupted dataset. Finally,
we study how the main ingredients of differentially private neural networks
training, such as gradient clipping and noise addition, affect (decrease and
increase) the robustness of the model.


http://arxiv.org/abs/2012.07474
HaS-Nets: A Heal and Select Mechanism to Defend DNNs Against Backdoor Attacks for Data Collection Scenarios.
Hassan Ali; Surya Nepal; Salil S. Kanhere; Sanjay Jha
  We have witnessed the continuing arms race between backdoor attacks and the
corresponding defense strategies on Deep Neural Networks (DNNs). Most
state-of-the-art defenses rely on the statistical sanitization of the "inputs"
or "latent DNN representations" to capture trojan behaviour. In this paper, we
first challenge the robustness of such recently reported defenses by
introducing a novel variant of targeted backdoor attack, called "low-confidence
backdoor attack". We also propose a novel defense technique, called "HaS-Nets".
  "Low-confidence backdoor attack" exploits the confidence labels assigned to
poisoned training samples by giving low values to hide their presence from the
defender, both during training and inference. We evaluate the attack against
four state-of-the-art defense methods, viz., STRIP, Gradient-Shaping, Februus
and ULP-defense, and achieve Attack Success Rate (ASR) of 99%, 63.73%, 91.2%
and 80%, respectively.
  We next present "HaS-Nets" to resist backdoor insertion in the network during
training, using a reasonably small healing dataset, approximately 2% to 15% of
full training data, to heal the network at each iteration. We evaluate it for
different datasets - Fashion-MNIST, CIFAR-10, Consumer Complaint and Urban
Sound - and network architectures - MLPs, 2D-CNNs, 1D-CNNs. Our experiments
show that "HaS-Nets" can decrease ASRs from over 90% to less than 15%,
independent of the dataset, attack configuration and network architecture.


http://arxiv.org/abs/2012.07688
Improving Adversarial Robustness via Probabilistically Compact Loss with Logit Constraints.
Xin Li; Xiangrui Li; Deng Pan; Dongxiao Zhu
  Convolutional neural networks (CNNs) have achieved state-of-the-art
performance on various tasks in computer vision. However, recent studies
demonstrate that these models are vulnerable to carefully crafted adversarial
samples and suffer from a significant performance drop when predicting them.
Many methods have been proposed to improve adversarial robustness (e.g.,
adversarial training and new loss functions to learn adversarially robust
feature representations). Here we offer a unique insight into the predictive
behavior of CNNs that they tend to misclassify adversarial samples into the
most probable false classes. This inspires us to propose a new
Probabilistically Compact (PC) loss with logit constraints which can be used as
a drop-in replacement for cross-entropy (CE) loss to improve CNN's adversarial
robustness. Specifically, PC loss enlarges the probability gaps between true
class and false classes meanwhile the logit constraints prevent the gaps from
being melted by a small perturbation. We extensively compare our method with
the state-of-the-art using large scale datasets under both white-box and
black-box attacks to demonstrate its effectiveness. The source codes are
available from the following url: https://github.com/xinli0928/PC-LC.


http://arxiv.org/abs/2012.07994
Binary Black-box Evasion Attacks Against Deep Learning-based Static Malware Detectors with Adversarial Byte-Level Language Model.
Mohammadreza Ebrahimi; Ning Zhang; James Hu; Muhammad Taqi Raza; Hsinchun Chen
  Anti-malware engines are the first line of defense against malicious
software. While widely used, feature engineering-based anti-malware engines are
vulnerable to unseen (zero-day) attacks. Recently, deep learning-based static
anti-malware detectors have achieved success in identifying unseen attacks
without requiring feature engineering and dynamic analysis. However, these
detectors are susceptible to malware variants with slight perturbations, known
as adversarial examples. Generating effective adversarial examples is useful to
reveal the vulnerabilities of such systems. Current methods for launching such
attacks require accessing either the specifications of the targeted
anti-malware model, the confidence score of the anti-malware response, or
dynamic malware analysis, which are either unrealistic or expensive. We propose
MalRNN, a novel deep learning-based approach to automatically generate evasive
malware variants without any of these restrictions. Our approach features an
adversarial example generation process, which learns a language model via a
generative sequence-to-sequence recurrent neural network to augment malware
binaries. MalRNN effectively evades three recent deep learning-based malware
detectors and outperforms current benchmark methods. Findings from applying our
MalRNN on a real dataset with eight malware categories are discussed.


http://arxiv.org/abs/2012.07280
Contrastive Learning with Adversarial Perturbations for Conditional Text Generation.
Seanie Lee; Dong Bok Lee; Sung Ju Hwang
  Recently, sequence-to-sequence (seq2seq) models with the Transformer
architecture have achieved remarkable performance on various conditional text
generation tasks, such as machine translation. However, most of them are
trained with teacher forcing with the ground truth label given at each time
step, without being exposed to incorrectly generated tokens during training,
which hurts its generalization to unseen inputs, that is known as the "exposure
bias" problem. In this work, we propose to mitigate the conditional text
generation problem by contrasting positive pairs with negative pairs, such that
the model is exposed to various valid or incorrect perturbations of the inputs,
for improved generalization. However, training the model with naive contrastive
learning framework using random non-target sequences as negative examples is
suboptimal, since they are easily distinguishable from the correct output,
especially so with models pretrained with large text corpora. Also, generating
positive examples requires domain-specific augmentation heuristics which may
not generalize over diverse domains. To tackle this problem, we propose a
principled method to generate positive and negative samples for contrastive
learning of seq2seq models. Specifically, we generate negative examples by
adding small perturbations to the input sequence to minimize its conditional
likelihood, and positive examples by adding large perturbations while enforcing
it to have a high conditional likelihood. Such "hard" positive and negative
pairs generated using our method guides the model to better distinguish correct
outputs from incorrect ones. We empirically show that our proposed method
significantly improves the generalization of the seq2seq on three text
generation tasks - machine translation, text summarization, and question
generation.


http://arxiv.org/abs/2012.07233
Achieving Adversarial Robustness Requires An Active Teacher.
Chao Ma; Lexing Ying
  A new understanding of adversarial examples and adversarial robustness is
proposed by decoupling the data generator and the label generator (which we
call the teacher). In our framework, adversarial robustness is a conditional
concept---the student model is not absolutely robust, but robust with respect
to the teacher. Based on the new understanding, we claim that adversarial
examples exist because the student cannot obtain sufficient information of the
teacher from the training data. Various ways of achieving robustness is
compared. Theoretical and numerical evidence shows that to efficiently attain
robustness, a teacher that actively provides its information to the student may
be necessary.


http://arxiv.org/abs/2012.06757
Query-free Black-box Adversarial Attacks on Graphs.
Jiarong Xu; Yizhou Sun; Xin Jiang; Yanhao Wang; Yang Yang; Chunping Wang; Jiangang Lu
  Many graph-based machine learning models are known to be vulnerable to
adversarial attacks, where even limited perturbations on input data can result
in dramatic performance deterioration. Most existing works focus on moderate
settings in which the attacker is either aware of the model structure and
parameters (white-box), or able to send queries to fetch model information. In
this paper, we propose a query-free black-box adversarial attack on graphs, in
which the attacker has no knowledge of the target model and no query access to
the model. With the mere observation of the graph topology, the proposed attack
strategy flips a limited number of links to mislead the graph models. We prove
that the impact of the flipped links on the target model can be quantified by
spectral changes, and thus be approximated using the eigenvalue perturbation
theory. Accordingly, we model the proposed attack strategy as an optimization
problem, and adopt a greedy algorithm to select the links to be flipped. Due to
its simplicity and scalability, the proposed model is not only generic in
various graph-based models, but can be easily extended when different knowledge
levels are accessible as well. Extensive experiments demonstrate the
effectiveness and efficiency of the proposed model on various downstream tasks,
as well as several different graph-based learning models.


http://arxiv.org/abs/2012.06390
Closeness and Uncertainty Aware Adversarial Examples Detection in Adversarial Machine Learning.
Omer Faruk Tuna; Ferhat Ozgur Catak; M. Taner Eskil
  While state-of-the-art Deep Neural Network (DNN) models are considered to be
robust to random perturbations, it was shown that these architectures are
highly vulnerable to deliberately crafted perturbations, albeit being
quasi-imperceptible. These vulnerabilities make it challenging to deploy DNN
models in security-critical areas. In recent years, many research studies have
been conducted to develop new attack methods and come up with new defense
techniques that enable more robust and reliable models. In this work, we
explore and assess the usage of different type of metrics for detecting
adversarial samples. We first leverage the usage of moment-based predictive
uncertainty estimates of a DNN classifier obtained using Monte-Carlo Dropout
Sampling. And we also introduce a new method that operates in the subspace of
deep features extracted by the model. We verified the effectiveness of our
approach on a range of standard datasets like MNIST (Digit), MNIST (Fashion)
and CIFAR-10. Our experiments show that these two different approaches
complement each other, and the combined usage of all the proposed metrics
yields up to 99 \% ROC-AUC scores regardless of the attack algorithm.


http://arxiv.org/abs/2012.06405
Attack Agnostic Detection of Adversarial Examples via Random Subspace Analysis.
Nathan Drenkow; Neil Fendley; Philippe Burlina
  Whilst adversarial attack detection has received considerable attention, it
remains a fundamentally challenging problem from two perspectives. First, while
threat models can be well-defined, attacker strategies may still vary widely
within those constraints. Therefore, detection should be considered as an
open-set problem, standing in contrast to most current detection approaches.
These methods take a closed-set view and train binary detectors, thus biasing
detection toward attacks seen during detector training. Second, limited
information is available at test time and typically confounded by nuisance
factors including the label and underlying content of the image. We address
these challenges via a novel strategy based on random subspace analysis. We
present a technique that utilizes properties of random projections to
characterize the behavior of clean and adversarial examples across a diverse
set of subspaces. The self-consistency (or inconsistency) of model activations
is leveraged to discern clean from adversarial examples. Performance
evaluations demonstrate that our technique ($AUC\in[0.92, 0.98]$) outperforms
competing detection strategies ($AUC\in[0.30,0.79]$), while remaining truly
agnostic to the attack strategy (for both targeted/untargeted attacks). It also
requires significantly less calibration data (composed only of clean examples)
than competing approaches to achieve this performance.


http://arxiv.org/abs/2012.06568
Analyzing and Improving Adversarial Training for Generative Modeling. (86%)
Xuwang Yin; Shiying Li; Gustavo K. Rohde
  We study a new generative modeling technique based on adversarial training
(AT). We show that in a setting where the model is trained to discriminate
in-distribution data from adversarial examples perturbed from out-distribution
samples, the model learns the support of the in-distribution data. The learning
process is also closely related to MCMC-based maximum likelihood learning of
energy-based models (EBMs), and can be considered as an approximate maximum
likelihood learning method. We show that this AT generative model achieves
competitive image generation performance to state-of-the-art EBMs, and at the
same time is stable to train and has better sampling efficiency. We demonstrate
that the AT generative model is well-suited for the task of image translation
and worst-case out-of-distribution detection.


http://arxiv.org/abs/2012.05948
GNNUnlock: Graph Neural Networks-based Oracle-less Unlocking Scheme for Provably Secure Logic Locking.
Lilas Alrahis; Satwik Patnaik; Faiq Khalid; Muhammad Abdullah Hanif; Hani Saleh; Muhammad Shafique; Ozgur Sinanoglu
  In this paper, we propose GNNUnlock, the first-of-its-kind oracle-less
machine learning-based attack on provably secure logic locking that can
identify any desired protection logic without focusing on a specific syntactic
topology. The key is to leverage a well-trained graph neural network (GNN) to
identify all the gates in a given locked netlist that belong to the targeted
protection logic, without requiring an oracle. This approach fits perfectly
with the targeted problem since a circuit is a graph with an inherent structure
and the protection logic is a sub-graph of nodes (gates) with specific and
common characteristics. GNNs are powerful in capturing the nodes' neighborhood
properties, facilitating the detection of the protection logic. To rectify any
misclassifications induced by the GNN, we additionally propose a connectivity
analysis-based post-processing algorithm to successfully remove the predicted
protection logic, thereby retrieving the original design. Our extensive
experimental evaluation demonstrates that GNNUnlock is 99.24%-100% successful
in breaking various benchmarks locked using stripped-functionality logic
locking, tenacious and traceless logic locking, and Anti-SAT. Our proposed
post-processing enhances the detection accuracy, reaching 100% for all of our
tested locked benchmarks. Analysis of the results corroborates that GNNUnlock
is powerful enough to break the considered schemes under different parameters,
synthesis settings, and technology nodes. The evaluation further shows that
GNNUnlock successfully breaks corner cases where even the most advanced
state-of-the-art attacks fail.


http://arxiv.org/abs/2012.06058
Next Wave Artificial Intelligence: Robust, Explainable, Adaptable, Ethical, and Accountable.
Odest Chadwicke Jenkins; Daniel Lopresti; Melanie Mitchell
  The history of AI has included several "waves" of ideas. The first wave, from
the mid-1950s to the 1980s, focused on logic and symbolic hand-encoded
representations of knowledge, the foundations of so-called "expert systems".
The second wave, starting in the 1990s, focused on statistics and machine
learning, in which, instead of hand-programming rules for behavior, programmers
constructed "statistical learning algorithms" that could be trained on large
datasets. In the most recent wave research in AI has largely focused on deep
(i.e., many-layered) neural networks, which are loosely inspired by the brain
and trained by "deep learning" methods. However, while deep neural networks
have led to many successes and new capabilities in computer vision, speech
recognition, language processing, game-playing, and robotics, their potential
for broad application remains limited by several factors.
  A concerning limitation is that even the most successful of today's AI
systems suffer from brittleness-they can fail in unexpected ways when faced
with situations that differ sufficiently from ones they have been trained on.
This lack of robustness also appears in the vulnerability of AI systems to
adversarial attacks, in which an adversary can subtly manipulate data in a way
to guarantee a specific wrong answer or action from an AI system. AI systems
also can absorb biases-based on gender, race, or other factors-from their
training data and further magnify these biases in their subsequent
decision-making. Taken together, these various limitations have prevented AI
systems such as automatic medical diagnosis or autonomous vehicles from being
sufficiently trustworthy for wide deployment. The massive proliferation of AI
across society will require radically new ideas to yield technology that will
not sacrifice our productivity, our quality of life, or our values.


http://arxiv.org/abs/2012.06122
DSRNA: Differentiable Search of Robust Neural Architectures.
Ramtin Hosseini; Xingyi Yang; Pengtao Xie
  In deep learning applications, the architectures of deep neural networks are
crucial in achieving high accuracy. Many methods have been proposed to search
for high-performance neural architectures automatically. However, these
searched architectures are prone to adversarial attacks. A small perturbation
of the input data can render the architecture to change prediction outcomes
significantly. To address this problem, we propose methods to perform
differentiable search of robust neural architectures. In our methods, two
differentiable metrics are defined to measure architectures' robustness, based
on certified lower bound and Jacobian norm bound. Then we search for robust
architectures by maximizing the robustness metrics. Different from previous
approaches which aim to improve architectures' robustness in an implicit way:
performing adversarial training and injecting random noise, our methods
explicitly and directly maximize robustness metrics to harvest robust
architectures. On CIFAR-10, ImageNet, and MNIST, we perform game-based
evaluation and verification-based evaluation on the robustness of our methods.
The experimental results show that our methods 1) are more robust to various
norm-bound attacks than several robust NAS baselines; 2) are more accurate than
baselines when there are no attacks; 3) have significantly higher certified
lower bounds than baselines.


http://arxiv.org/abs/2012.06110
I-GCN: Robust Graph Convolutional Network via Influence Mechanism.
Haoxi Zhan; Xiaobing Pei
  Deep learning models for graphs, especially Graph Convolutional Networks
(GCNs), have achieved remarkable performance in the task of semi-supervised
node classification. However, recent studies show that GCNs suffer from
adversarial perturbations. Such vulnerability to adversarial attacks
significantly decreases the stability of GCNs when being applied to
security-critical applications. Defense methods such as preprocessing,
attention mechanism and adversarial training have been discussed by various
studies. While being able to achieve desirable performance when the
perturbation rates are low, such methods are still vulnerable to high
perturbation rates. Meanwhile, some defending algorithms perform poorly when
the node features are not visible. Therefore, in this paper, we propose a novel
mechanism called influence mechanism, which is able to enhance the robustness
of the GCNs significantly. The influence mechanism divides the effect of each
node into two parts: introverted influence which tries to maintain its own
features and extroverted influence which exerts influences on other nodes.
Utilizing the influence mechanism, we propose the Influence GCN (I-GCN) model.
Extensive experiments show that our proposed model is able to achieve higher
accuracy rates than state-of-the-art methods when defending against
non-targeted attacks.


http://arxiv.org/abs/2012.06332
An Empirical Review of Adversarial Defenses.
Ayush Goel
  From face recognition systems installed in phones to self-driving cars, the
field of AI is witnessing rapid transformations and is being integrated into
our everyday lives at an incredible pace. Any major failure in these system's
predictions could be devastating, leaking sensitive information or even costing
lives (as in the case of self-driving cars). However, deep neural networks,
which form the basis of such systems, are highly susceptible to a specific type
of attack, called adversarial attacks. A hacker can, even with bare minimum
computation, generate adversarial examples (images or data points that belong
to another class, but consistently fool the model to get misclassified as
genuine) and crumble the basis of such algorithms. In this paper, we compile
and test numerous approaches to defend against such adversarial attacks. Out of
the ones explored, we found two effective techniques, namely Dropout and
Denoising Autoencoders, and show their success in preventing such attacks from
fooling the model. We demonstrate that these techniques are also resistant to
both higher noise levels as well as different kinds of adversarial attacks
(although not tested against all). We also develop a framework for deciding the
suitable defense technique to use against attacks, based on the nature of the
application and resource constraints of the Deep Neural Network.


http://arxiv.org/abs/2012.06024
Robustness and Transferability of Universal Attacks on Compressed Models.
Alberto G. Matachana; Kenneth T. Co; Luis Muñoz-González; David Martinez; Emil C. Lupu
  Neural network compression methods like pruning and quantization are very
effective at efficiently deploying Deep Neural Networks (DNNs) on edge devices.
However, DNNs remain vulnerable to adversarial examples-inconspicuous inputs
that are specifically designed to fool these models. In particular, Universal
Adversarial Perturbations (UAPs), are a powerful class of adversarial attacks
which create adversarial perturbations that can generalize across a large set
of inputs. In this work, we analyze the effect of various compression
techniques to UAP attacks, including different forms of pruning and
quantization. We test the robustness of compressed models to white-box and
transfer attacks, comparing them with their uncompressed counterparts on
CIFAR-10 and SVHN datasets. Our evaluations reveal clear differences between
pruning methods, including Soft Filter and Post-training Pruning. We observe
that UAP transfer attacks between pruned and full models are limited,
suggesting that the systemic vulnerabilities across these models are different.
This finding has practical implications as using different compression
techniques can blunt the effectiveness of black-box transfer attacks. We show
that, in some scenarios, quantization can produce gradient-masking, giving a
false sense of security. Finally, our results suggest that conclusions about
the robustness of compressed models to UAP attacks is application dependent,
observing different phenomena in the two datasets used in our experiments.


http://arxiv.org/abs/2012.05657
Geometric Adversarial Attacks and Defenses on 3D Point Clouds.
Itai Lang; Uriel Kotlicki; Shai Avidan
  Deep neural networks are prone to adversarial examples that maliciously alter
the network's outcome. Due to the increasing popularity of 3D sensors in
safety-critical systems and the vast deployment of deep learning models for 3D
point sets, there is a growing interest in adversarial attacks and defenses for
such models. So far, the research has focused on the semantic level, namely,
deep point cloud classifiers. However, point clouds are also widely used in a
geometric-related form that includes encoding and reconstructing the geometry.
In this work, we explore adversarial examples at a geometric level. That is, a
small change to a clean source point cloud leads, after passing through an
autoencoder model, to a shape from a different target class. On the defense
side, we show that remnants of the attack's target shape are still present at
the reconstructed output after applying the defense to the adversarial input.
Our code is publicly available at https://github.com/itailang/geometric_adv.


http://arxiv.org/abs/2012.05858
SPAA: Stealthy Projector-based Adversarial Attacks on Deep Image Classifiers.
Bingyao Huang; Haibin Ling
  Light-based adversarial attacks aim to fool deep learning-based image
classifiers by altering the physical light condition using a controllable light
source, e.g., a projector. Compared with physical attacks that place carefully
designed stickers or printed adversarial objects, projector-based ones obviate
modifying the physical entities. Moreover, projector-based attacks can be
performed transiently and dynamically by altering the projection pattern.
However, existing approaches focus on projecting adversarial patterns that
result in clearly perceptible camera-captured perturbations, while the more
interesting yet challenging goal, stealthy projector-based attack, remains an
open problem. In this paper, for the first time, we formulate this problem as
an end-to-end differentiable process and propose Stealthy Projector-based
Adversarial Attack (SPAA). In SPAA, we approximate the real project-and-capture
operation using a deep neural network named PCNet, then we include PCNet in the
optimization of projector-based attacks such that the generated adversarial
projection is physically plausible. Finally, to generate robust and stealthy
adversarial projections, we propose an optimization algorithm that uses minimum
perturbation and adversarial confidence thresholds to alternate between the
adversarial loss and stealthiness loss optimization. Our experimental
evaluations show that the proposed SPAA clearly outperforms other methods by
achieving higher attack success rates and meanwhile being stealthier.


http://arxiv.org/abs/2012.05027
Generating Out of Distribution Adversarial Attack using Latent Space Poisoning.
Ujjwal Upadhyay; Prerana Mukherjee
  Traditional adversarial attacks rely upon the perturbations generated by
gradients from the network which are generally safeguarded by gradient guided
search to provide an adversarial counterpart to the network. In this paper, we
propose a novel mechanism of generating adversarial examples where the actual
image is not corrupted rather its latent space representation is utilized to
tamper with the inherent structure of the image while maintaining the
perceptual quality intact and to act as legitimate data samples. As opposed to
gradient-based attacks, the latent space poisoning exploits the inclination of
classifiers to model the independent and identical distribution of the training
dataset and tricks it by producing out of distribution samples. We train a
disentangled variational autoencoder (beta-VAE) to model the data in latent
space and then we add noise perturbations using a class-conditioned
distribution function to the latent space under the constraint that it is
misclassified to the target label. Our empirical results on MNIST, SVHN, and
CelebA dataset validate that the generated adversarial examples can easily fool
robust l_0, l_2, l_inf norm classifiers designed using provably robust defense
mechanisms.


http://arxiv.org/abs/2012.06330
Detection of Adversarial Supports in Few-shot Classifiers Using Self-Similarity and Filtering.
Yi Xiang Marcus Tan; Penny Chong; Jiamei Sun; Ngai-Man Cheung; Yuval Elovici; Alexander Binder
  Few-shot classifiers excel under limited training samples, making them useful
in applications with sparsely user-provided labels. Their unique relative
prediction setup offers opportunities for novel attacks, such as targeting
support sets required to categorise unseen test samples, which are not
available in other machine learning setups. In this work, we propose a
detection strategy to identify adversarial support sets, aimed at destroying
the understanding of a few-shot classifier for a certain class. We achieve this
by introducing the concept of self-similarity of a support set and by employing
filtering of supports. Our method is attack-agnostic, and we are the first to
explore adversarial detection for support sets of few-shot classifiers to the
best of our knowledge. Our evaluation of the miniImagenet (MI) and CUB datasets
exhibits good attack detection performance despite conceptual simplicity,
showing high AUROC scores. We show that self-similarity and filtering for
adversarial detection can be paired with other filtering functions,
constituting a generalisable concept.


http://arxiv.org/abs/2012.05321
Securing Deep Spiking Neural Networks against Adversarial Attacks through Inherent Structural Parameters.
Rida El-Allami; Alberto Marchisio; Muhammad Shafique; Ihsen Alouani
  Deep Learning (DL) algorithms have gained popularity owing to their practical
problem-solving capacity. However, they suffer from a serious integrity threat,
i.e., their vulnerability to adversarial attacks. In the quest for DL
trustworthiness, recent works claimed the inherent robustness of Spiking Neural
Networks (SNNs) to these attacks, without considering the variability in their
structural spiking parameters. This paper explores the security enhancement of
SNNs through internal structural parameters. Specifically, we investigate the
SNNs robustness to adversarial attacks with different values of the neuron's
firing voltage thresholds and time window boundaries. We thoroughly study SNNs
security under different adversarial attacks in the strong white-box setting,
with different noise budgets and under variable spiking parameters. Our results
show a significant impact of the structural parameters on the SNNs' security,
and promising sweet spots can be reached to design trustworthy SNNs with 85%
higher robustness than a traditional non-spiking DL system. To the best of our
knowledge, this is the first work that investigates the impact of structural
parameters on SNNs robustness to adversarial attacks. The proposed
contributions and the experimental framework is available online to the
community for reproducible research.


http://arxiv.org/abs/2012.05434
Composite Adversarial Attacks.
Xiaofeng Mao; Yuefeng Chen; Shuhui Wang; Hang Su; Yuan He; Hui Xue
  Adversarial attack is a technique for deceiving Machine Learning (ML) models,
which provides a way to evaluate the adversarial robustness. In practice,
attack algorithms are artificially selected and tuned by human experts to break
a ML system. However, manual selection of attackers tends to be sub-optimal,
leading to a mistakenly assessment of model security. In this paper, a new
procedure called Composite Adversarial Attack (CAA) is proposed for
automatically searching the best combination of attack algorithms and their
hyper-parameters from a candidate pool of \textbf{32 base attackers}. We design
a search space where attack policy is represented as an attacking sequence,
i.e., the output of the previous attacker is used as the initialization input
for successors. Multi-objective NSGA-II genetic algorithm is adopted for
finding the strongest attack policy with minimum complexity. The experimental
result shows CAA beats 10 top attackers on 11 diverse defenses with less
elapsed time (\textbf{6 $\times$ faster than AutoAttack}), and achieves the new
state-of-the-art on $l_{\infty}$, $l_{2}$ and unrestricted adversarial attacks.


http://arxiv.org/abs/2012.06043
Provable Defense against Privacy Leakage in Federated Learning from Representation Perspective.
Jingwei Sun; Ang Li; Binghui Wang; Huanrui Yang; Hai Li; Yiran Chen
  Federated learning (FL) is a popular distributed learning framework that can
reduce privacy risks by not explicitly sharing private data. However, recent
works demonstrated that sharing model updates makes FL vulnerable to inference
attacks. In this work, we show our key observation that the data representation
leakage from gradients is the essential cause of privacy leakage in FL. We also
provide an analysis of this observation to explain how the data presentation is
leaked. Based on this observation, we propose a defense against model inversion
attack in FL. The key idea of our defense is learning to perturb data
representation such that the quality of the reconstructed data is severely
degraded, while FL performance is maintained. In addition, we derive certified
robustness guarantee to FL and convergence guarantee to FedAvg, after applying
our defense. To evaluate our defense, we conduct experiments on MNIST and
CIFAR10 for defending against the DLG attack and GS attack. Without sacrificing
accuracy, the results demonstrate that our proposed defense can increase the
mean squared error between the reconstructed data and the raw data by as much
as more than 160X for both DLG attack and GS attack, compared with baseline
defense methods. The privacy of the FL system is significantly improved.


http://arxiv.org/abs/2012.04729
On 1/n neural representation and robustness.
Josue Nassar; Piotr Aleksander Sokol; SueYeon Chung; Kenneth D. Harris; Il Memming Park
  Understanding the nature of representation in neural networks is a goal
shared by neuroscience and machine learning. It is therefore exciting that both
fields converge not only on shared questions but also on similar approaches. A
pressing question in these areas is understanding how the structure of the
representation used by neural networks affects both their generalization, and
robustness to perturbations. In this work, we investigate the latter by
juxtaposing experimental results regarding the covariance spectrum of neural
representations in the mouse V1 (Stringer et al) with artificial neural
networks. We use adversarial robustness to probe Stringer et al's theory
regarding the causal role of a 1/n covariance spectrum. We empirically
investigate the benefits such a neural code confers in neural networks, and
illuminate its role in multi-layer architectures. Our results show that
imposing the experimentally observed structure on artificial neural networks
makes them more robust to adversarial attacks. Moreover, our findings
complement the existing theory relating wide neural networks to kernel methods,
by showing the role of intermediate representations.


http://arxiv.org/abs/2012.04692
Locally optimal detection of stochastic targeted universal adversarial perturbations.
Amish Goel; Pierre Moulin
  Deep learning image classifiers are known to be vulnerable to small
adversarial perturbations of input images. In this paper, we derive the locally
optimal generalized likelihood ratio test (LO-GLRT) based detector for
detecting stochastic targeted universal adversarial perturbations (UAPs) of the
classifier inputs. We also describe a supervised training method to learn the
detector's parameters, and demonstrate better performance of the detector
compared to other detection methods on several popular image classification
datasets.


http://arxiv.org/abs/2012.04734
A Deep Marginal-Contrastive Defense against Adversarial Attacks on 1D Models.
Mohammed Hassanin; Nour Moustafa; Murat Tahtali
  Deep learning algorithms have been recently targeted by attackers due to
their vulnerability. Several research studies have been conducted to address
this issue and build more robust deep learning models. Non-continuous deep
models are still not robust against adversarial, where most of the recent
studies have focused on developing attack techniques to evade the learning
process of the models. One of the main reasons behind the vulnerability of such
models is that a learning classifier is unable to slightly predict perturbed
samples. To address this issue, we propose a novel objective/loss function, the
so-called marginal contrastive, which enforces the features to lie under a
specified margin to facilitate their prediction using deep convolutional
networks (i.e., Char-CNN). Extensive experiments have been conducted on
continuous cases (e.g., UNSW NB15 dataset) and discrete ones (i.e,
eight-large-scale datasets [32]) to prove the effectiveness of the proposed
method. The results revealed that the regularization of the learning process
based on the proposed loss function can improve the performance of Char-CNN.


http://arxiv.org/abs/2012.04382
Using Feature Alignment can Improve Clean Average Precision and Adversarial Robustness in Object Detection.
Weipeng Xu; Hongcheng Huang
  The 2D object detection in clean images has been a well studied topic, but
its vulnerability against adversarial attacks is still worrying. Existing work
has improved the robustness of object detector by adversarial training, but at
the same time, the average precision (AP) on clean images drops significantly.
In this paper, we improve object detection algorithm by guiding the output of
intermediate feature layer. On the basis of adversarial training, we propose
two feature alignment methods, namely Knowledge-Distilled Feature Alignment
(KDFA) and Self-Supervised Feature Alignment (SSFA). The detector's clean AP
and robustness can be improved by aligning the features of the middle layer of
the network. We conduct extensive experiments on PASCAL VOC and MS-COCO
datasets to verify the effectiveness of our proposed approach. The code of our
experiments is available at https://github.com/grispeut/Feature-Alignment.git.


http://arxiv.org/abs/2012.04864
EvaLDA: Efficient Evasion Attacks Towards Latent Dirichlet Allocation.
Qi Zhou; Haipeng Chen; Yitao Zheng; Zhen Wang
  As one of the most powerful topic models, Latent Dirichlet Allocation (LDA)
has been used in a vast range of tasks, including document understanding,
information retrieval and peer-reviewer assignment. Despite its tremendous
popularity, the security of LDA has rarely been studied. This poses severe
risks to security-critical tasks such as sentiment analysis and peer-reviewer
assignment that are based on LDA. In this paper, we are interested in knowing
whether LDA models are vulnerable to adversarial perturbations of benign
document examples during inference time. We formalize the evasion attack to LDA
models as an optimization problem and prove it to be NP-hard. We then propose a
novel and efficient algorithm, EvaLDA to solve it. We show the effectiveness of
EvaLDA via extensive empirical evaluations. For instance, in the NIPS dataset,
EvaLDA can averagely promote the rank of a target topic from 10 to around 7 by
only replacing 1% of the words with similar words in a victim document. Our
work provides significant insights into the power and limitations of evasion
attacks to LDA models.


http://arxiv.org/abs/2012.04262
Overcomplete Representations Against Adversarial Videos.
Shao-Yuan Lo; Jeya Maria Jose Valanarasu; Vishal M. Patel
  Adversarial robustness of deep neural networks is an extensively studied
problem in the literature and various methods have been proposed to defend
against adversarial images. However, only a handful of defense methods have
been developed for defending against attacked videos. In this paper, we propose
a novel Over-and-Under complete restoration network for Defending against
adversarial videos (OUDefend). Most restoration networks adopt an
encoder-decoder architecture that first shrinks spatial dimension then expands
it back. This approach learns undercomplete representations, which have large
receptive fields to collect global information but overlooks local details. On
the other hand, overcomplete representations have opposite properties. Hence,
OUDefend is designed to balance local and global features by learning those two
representations. We attach OUDefend to target video recognition models as a
feature restoration block and train the entire network end-to-end. Experimental
results show that the defenses focusing on images may be ineffective to videos,
while OUDefend enhances robustness against different types of adversarial
videos, ranging from additive attacks, multiplicative attacks to physically
realizable attacks.


http://arxiv.org/abs/2012.04750
Mitigating the Impact of Adversarial Attacks in Very Deep Networks.
Mohammed Hassanin; Ibrahim Radwan; Nour Moustafa; Murat Tahtali; Neeraj Kumar
  Deep Neural Network (DNN) models have vulnerabilities related to security
concerns, with attackers usually employing complex hacking techniques to expose
their structures. Data poisoning-enabled perturbation attacks are complex
adversarial ones that inject false data into models. They negatively impact the
learning process, with no benefit to deeper networks, as they degrade a model's
accuracy and convergence rates. In this paper, we propose an
attack-agnostic-based defense method for mitigating their influence. In it, a
Defensive Feature Layer (DFL) is integrated with a well-known DNN architecture
which assists in neutralizing the effects of illegitimate perturbation samples
in the feature space. To boost the robustness and trustworthiness of this
method for correctly classifying attacked input samples, we regularize the
hidden space of a trained model with a discriminative loss function called
Polarized Contrastive Loss (PCL). It improves discrimination among samples in
different classes and maintains the resemblance of those in the same class.
Also, we integrate a DFL and PCL in a compact model for defending against data
poisoning attacks. This method is trained and tested using the CIFAR-10 and
MNIST datasets with data poisoning-enabled perturbation attacks, with the
experimental results revealing its excellent performance compared with those of
recent peer techniques.


http://arxiv.org/abs/2012.04353
Reinforcement Based Learning on Classification Task Could Yield Better Generalization and Adversarial Accuracy.
Shashi Kant Gupta
  Deep Learning has become interestingly popular in computer vision, mostly
attaining near or above human-level performance in various vision tasks. But
recent work has also demonstrated that these deep neural networks are very
vulnerable to adversarial examples (adversarial examples - inputs to a model
which are naturally similar to original data but fools the model in classifying
it into a wrong class). Humans are very robust against such perturbations; one
possible reason could be that humans do not learn to classify based on an error
between "target label" and "predicted label" but possibly due to reinforcements
that they receive on their predictions. In this work, we proposed a novel
method to train deep learning models on an image classification task. We used a
reward-based optimization function, similar to the vanilla policy gradient
method used in reinforcement learning, to train our model instead of
conventional cross-entropy loss. An empirical evaluation on the cifar10 dataset
showed that our method learns a more robust classifier than the same model
architecture trained using cross-entropy loss function (on adversarial
training). At the same time, our method shows a better generalization with the
difference in test accuracy and train accuracy $< 2\%$ for most of the time
compared to the cross-entropy one, whose difference most of the time remains $>
2\%$.


http://arxiv.org/abs/2012.04432
Poisoning Semi-supervised Federated Learning via Unlabeled Data: Attacks and Defenses. (95%)
Yi Liu; Xingliang Yuan; Ruihui Zhao; Cong Wang; Dusit Niyato; Yefeng Zheng
  Semi-supervised Federated Learning (SSFL) has recently drawn much attention
due to its practical consideration, i.e., the clients may only have unlabeled
data. In practice, these SSFL systems implement semi-supervised training by
assigning a "guessed" label to the unlabeled data near the labeled data to
convert the unsupervised problem into a fully supervised problem. However, the
inherent properties of such semi-supervised training techniques create a new
attack surface. In this paper, we discover and reveal a simple yet powerful
poisoning attack against SSFL. Our attack utilizes the natural characteristic
of semi-supervised learning to cause the model to be poisoned by poisoning
unlabeled data. Specifically, the adversary just needs to insert a small number
of maliciously crafted unlabeled samples (e.g., only 0.1\% of the dataset) to
infect model performance and misclassification. Extensive case studies have
shown that our attacks are effective on different datasets and common
semi-supervised learning methods. To mitigate the attacks, we propose a
defense, i.e., a minimax optimization-based client selection strategy, to
enable the server to select the clients who hold the correct label information
and high-quality updates. Our defense further employs a quality-based
aggregation rule to strengthen the contributions of the selected updates.
Evaluations under different attack conditions show that the proposed defense
can well alleviate such unlabeled poisoning attacks. Our study unveils the
vulnerability of SSFL to unlabeled poisoning attacks and provides the community
with potential defense methods.


http://arxiv.org/abs/2012.04351
Data Dependent Randomized Smoothing. (1%)
Motasem Alfarra; Adel Bibi; Philip H. S. Torr; Bernard Ghanem
  Randomized smoothing is a recent technique that achieves state-of-art
performance in training certifiably robust deep neural networks. While the
smoothing family of distributions is often connected to the choice of the norm
used for certification, the parameters of these distributions are always set as
global hyper parameters independent from the input data on which a network is
certified. In this work, we revisit Gaussian randomized smoothing and show that
the variance of the Gaussian distribution can be optimized at each input so as
to maximize the certification radius for the construction of the smooth
classifier. We also propose a simple memory-based approach to certifying the
resultant smooth classifier. This new approach is generic, parameter-free, and
easy to implement. In fact, we show that our data dependent framework can be
seamlessly incorporated into 3 randomized smoothing approaches, leading to
consistent improved certified accuracy. When this framework is used in the
training routine of these approaches followed by a data dependent
certification, we achieve 9% and 6% improvement over the certified accuracy of
the strongest baseline for a radius of 0.5 on CIFAR10 and ImageNet.


http://arxiv.org/abs/2012.03516
A Singular Value Perspective on Model Robustness.
Malhar Jere; Maghav Kumar; Farinaz Koushanfar
  Convolutional Neural Networks (CNNs) have made significant progress on
several computer vision benchmarks, but are fraught with numerous non-human
biases such as vulnerability to adversarial samples. Their lack of
explainability makes identification and rectification of these biases
difficult, and understanding their generalization behavior remains an open
problem. In this work we explore the relationship between the generalization
behavior of CNNs and the Singular Value Decomposition (SVD) of images. We show
that naturally trained and adversarially robust CNNs exploit highly different
features for the same dataset. We demonstrate that these features can be
disentangled by SVD for ImageNet and CIFAR-10 trained networks. Finally, we
propose Rank Integrated Gradients (RIG), the first rank-based feature
attribution method to understand the dependence of CNNs on image rank.


http://arxiv.org/abs/2012.03528
Backpropagating Linearly Improves Transferability of Adversarial Examples.
Yiwen Guo; Qizhang Li; Hao Chen
  The vulnerability of deep neural networks (DNNs) to adversarial examples has
drawn great attention from the community. In this paper, we study the
transferability of such examples, which lays the foundation of many black-box
attacks on DNNs. We revisit a not so new but definitely noteworthy hypothesis
of Goodfellow et al.'s and disclose that the transferability can be enhanced by
improving the linearity of DNNs in an appropriate manner. We introduce linear
backpropagation (LinBP), a method that performs backpropagation in a more
linear fashion using off-the-shelf attacks that exploit gradients. More
specifically, it calculates forward as normal but backpropagates loss as if
some nonlinear activations are not encountered in the forward pass.
Experimental results demonstrate that this simple yet effective method
obviously outperforms current state-of-the-arts in crafting transferable
adversarial examples on CIFAR-10 and ImageNet, leading to more effective
attacks on a variety of DNNs.


http://arxiv.org/abs/2012.03483
Learning to Separate Clusters of Adversarial Representations for Robust Adversarial Detection.
Byunggill Joe; Jihun Hamm; Sung Ju Hwang; Sooel Son; Insik Shin
  Although deep neural networks have shown promising performances on various
tasks, they are susceptible to incorrect predictions induced by imperceptibly
small perturbations in inputs. A large number of previous works proposed to
detect adversarial attacks. Yet, most of them cannot effectively detect them
against adaptive whitebox attacks where an adversary has the knowledge of the
model and the defense method. In this paper, we propose a new probabilistic
adversarial detector motivated by a recently introduced non-robust feature. We
consider the non-robust features as a common property of adversarial examples,
and we deduce it is possible to find a cluster in representation space
corresponding to the property. This idea leads us to probability estimate
distribution of adversarial representations in a separate cluster, and leverage
the distribution for a likelihood based adversarial detector.


http://arxiv.org/abs/2012.03843
Are DNNs fooled by extremely unrecognizable images?
Soichiro Kumano; Hiroshi Kera; Toshihiko Yamasaki
  Fooling images are a potential threat to deep neural networks (DNNs). These
images are not recognizable to humans as natural objects, such as dogs and
cats, but are misclassified by DNNs as natural-object classes with high
confidence scores. Despite their original design concept, existing fooling
images retain some features that are characteristic of the target objects if
looked into closely. Hence, DNNs can react to these features. In this paper, we
address the question of whether there can be fooling images with no
characteristic pattern of natural objects locally or globally. As a minimal
case, we introduce single-color images with a few pixels altered, called sparse
fooling images (SFIs). We first prove that SFIs always exist under mild
conditions for linear and nonlinear models and reveal that complex models are
more likely to be vulnerable to SFI attacks. With two SFI generation methods,
we demonstrate that in deeper layers, SFIs end up with similar features to
those of natural images, and consequently, fool DNNs successfully. Among other
layers, we discovered that the max pooling layer causes the vulnerability
against SFIs. The defense against SFIs and transferability are also discussed.
This study highlights the new vulnerability of DNNs by introducing a novel
class of images that distributes extremely far from natural images.


http://arxiv.org/abs/2012.03460
Reprogramming Language Models for Molecular Representation Learning.
Ria Vinod; Pin-Yu Chen; Payel Das
  Recent advancements in transfer learning have made it a promising approach
for domain adaptation via transfer of learned representations. This is
especially when relevant when alternate tasks have limited samples of
well-defined and labeled data, which is common in the molecule data domain.
This makes transfer learning an ideal approach to solve molecular learning
tasks. While Adversarial reprogramming has proven to be a successful method to
repurpose neural networks for alternate tasks, most works consider source and
alternate tasks within the same domain. In this work, we propose a new
algorithm, Representation Reprogramming via Dictionary Learning (R2DL), for
adversarially reprogramming pretrained language models for molecular learning
tasks, motivated by leveraging learned representations in massive state of the
art language models. The adversarial program learns a linear transformation
between a dense source model input space (language data) and a sparse target
model input space (e.g., chemical and biological molecule data) using a k-SVD
solver to approximate a sparse representation of the encoded data, via
dictionary learning. R2DL achieves the baseline established by state of the art
toxicity prediction models trained on domain-specific data and outperforms the
baseline in a limited training-data setting, thereby establishing avenues for
domain-agnostic transfer learning for tasks with molecule data.


http://arxiv.org/abs/2012.03404
Black-box Model Inversion Attribute Inference Attacks on Classification Models.
Shagufta Mehnaz; Ninghui Li; Elisa Bertino
  Increasing use of ML technologies in privacy-sensitive domains such as
medical diagnoses, lifestyle predictions, and business decisions highlights the
need to better understand if these ML technologies are introducing leakages of
sensitive and proprietary training data. In this paper, we focus on one kind of
model inversion attacks, where the adversary knows non-sensitive attributes
about instances in the training data and aims to infer the value of a sensitive
attribute unknown to the adversary, using oracle access to the target
classification model. We devise two novel model inversion attribute inference
attacks -- confidence modeling-based attack and confidence score-based attack,
and also extend our attack to the case where some of the other (non-sensitive)
attributes are unknown to the adversary. Furthermore, while previous work uses
accuracy as the metric to evaluate the effectiveness of attribute inference
attacks, we find that accuracy is not informative when the sensitive attribute
distribution is unbalanced. We identify two metrics that are better for
evaluating attribute inference attacks, namely G-mean and Matthews correlation
coefficient (MCC). We evaluate our attacks on two types of machine learning
models, decision tree and deep neural network, trained with two real datasets.
Experimental results show that our newly proposed attacks significantly
outperform the state-of-the-art attacks. Moreover, we empirically show that
specific groups in the training dataset (grouped by attributes, e.g., gender,
race) could be more vulnerable to model inversion attacks. We also demonstrate
that our attacks' performances are not impacted significantly when some of the
other (non-sensitive) attributes are also unknown to the adversary.


http://arxiv.org/abs/2012.03310
PAC-Learning for Strategic Classification.
Ravi Sundaram; Anil Vullikanti; Haifeng Xu; Fan Yao
  Machine learning (ML) algorithms may be susceptible to being gamed by
individuals with knowledge of the algorithm (a.k.a. Goodhart's law). Such
concerns have motivated a surge of recent work on strategic classification
where each data point is a self-interested agent and may strategically
manipulate his features to induce a more desirable classification outcome for
himself. Previous works assume agents have homogeneous preferences and all
equally prefer the positive label. This paper generalizes strategic
classification to settings where different data points may have different
preferences over the classification outcomes. Besides a richer model, this
generalization allows us to include evasion attacks in adversarial ML also as a
special case of our model where positive [resp. negative] data points prefer
the negative [resp. positive] label, and thus for the first time allows
strategic and adversarial learning to be studied under the same framework.
  We introduce the strategic VC-dimension (SVC), which captures the
PAC-learnability of a hypothesis class in our general strategic setup. SVC
generalizes the notion of adversarial VC-dimension (AVC) introduced recently by
Cullina et al. arXiv:1806.01471. We then instantiate our framework for arguably
the most basic hypothesis class, i.e., linear classifiers. We fully
characterize the statistical learnability of linear classifiers by pinning down
its SVC and the computational tractability by pinning down the complexity of
the empirical risk minimization problem. Our bound of SVC for linear
classifiers also strictly generalizes the AVC bound for linear classifiers in
arXiv:1806.01471. Finally, we briefly study the power of randomization in our
strategic classification setup. We show that randomization may strictly
increase the accuracy in general, but will not help in the special case of
adversarial classification under evasion attacks.


http://arxiv.org/abs/2012.02976
Evaluating adversarial robustness in simulated cerebellum.
Liu Yuezhang; Bo Li; Qifeng Chen
  It is well known that artificial neural networks are vulnerable to
adversarial examples, in which great efforts have been made to improve the
robustness. However, such examples are usually imperceptible to humans, and
thus their effect on biological neural circuits is largely unknown. This paper
will investigate the adversarial robustness in a simulated cerebellum, a
well-studied supervised learning system in computational neuroscience.
Specifically, we propose to study three unique characteristics revealed in the
cerebellum: (i) network width; (ii) long-term depression on the parallel
fiber-Purkinje cell synapses; (iii) sparse connectivity in the granule layer,
and hypothesize that they will be beneficial for improving robustness. To the
best of our knowledge, this is the first attempt to examine the adversarial
robustness in simulated cerebellum models.
  The results are negative in the experimental phase -- no significant
improvements in robustness are discovered from the proposed three mechanisms.
Consequently, the cerebellum is expected to be vulnerable to adversarial
examples as the deep neural networks under batch training. Neuroscientists are
encouraged to fool the biological system in experiments with adversarial
attacks.


http://arxiv.org/abs/2012.02632
Advocating for Multiple Defense Strategies against Adversarial Examples.
Alexandre Araujo; Laurent Meunier; Rafael Pinot; Benjamin Negrevergne
  It has been empirically observed that defense mechanisms designed to protect
neural networks against $\ell_\infty$ adversarial examples offer poor
performance against $\ell_2$ adversarial examples and vice versa. In this paper
we conduct a geometrical analysis that validates this observation. Then, we
provide a number of empirical insights to illustrate the effect of this
phenomenon in practice. Then, we review some of the existing defense mechanism
that attempts to defend against multiple attacks by mixing defense strategies.
Thanks to our numerical experiments, we discuss the relevance of this method
and state open questions for the adversarial examples community.


http://arxiv.org/abs/2012.02525
Practical No-box Adversarial Attacks against DNNs.
Qizhang Li; Yiwen Guo; Hao Chen
  The study of adversarial vulnerabilities of deep neural networks (DNNs) has
progressed rapidly. Existing attacks require either internal access (to the
architecture, parameters, or training set of the victim model) or external
access (to query the model). However, both the access may be infeasible or
expensive in many scenarios. We investigate no-box adversarial examples, where
the attacker can neither access the model information or the training set nor
query the model. Instead, the attacker can only gather a small number of
examples from the same problem domain as that of the victim model. Such a
stronger threat model greatly expands the applicability of adversarial attacks.
We propose three mechanisms for training with a very small dataset (on the
order of tens of examples) and find that prototypical reconstruction is the
most effective. Our experiments show that adversarial examples crafted on
prototypical auto-encoding models transfer well to a variety of image
classification and face verification models. On a commercial celebrity
recognition system held by clarifai.com, our approach significantly diminishes
the average prediction accuracy of the system to only 15.40%, which is on par
with the attack that transfers adversarial examples from a pre-trained Arcface
model.


http://arxiv.org/abs/2012.02452
Towards Natural Robustness Against Adversarial Examples.
Haoyu Chu; Shikui Wei; Yao Zhao
  Recent studies have shown that deep neural networks are vulnerable to
adversarial examples, but most of the methods proposed to defense adversarial
examples cannot solve this problem fundamentally. In this paper, we
theoretically prove that there is an upper bound for neural networks with
identity mappings to constrain the error caused by adversarial noises. However,
in actual computations, this kind of neural network no longer holds any upper
bound and is therefore susceptible to adversarial examples. Following similar
procedures, we explain why adversarial examples can fool other deep neural
networks with skip connections. Furthermore, we demonstrate that a new family
of deep neural networks called Neural ODEs (Chen et al., 2018) holds a weaker
upper bound. This weaker upper bound prevents the amount of change in the
result from being too large. Thus, Neural ODEs have natural robustness against
adversarial examples. We evaluate the performance of Neural ODEs compared with
ResNet under three white-box adversarial attacks (FGSM, PGD, DI2-FGSM) and one
black-box adversarial attack (Boundary Attack). Finally, we show that the
natural robustness of Neural ODEs is even better than the robustness of neural
networks that are trained with adversarial training methods, such as TRADES and
YOPO.


http://arxiv.org/abs/2012.02486
Unsupervised Adversarially-Robust Representation Learning on Graphs.
Jiarong Xu; Yang Yang; Junru Chen; Chunping Wang; Xin Jiang; Jiangang Lu; Yizhou Sun
  Unsupervised/self-supervised pre-training methods for graph representation
learning have recently attracted increasing research interests, and they are
shown to be able to generalize to various downstream applications. Yet, the
adversarial robustness of such pre-trained graph learning models remains
largely unexplored. More importantly, most existing defense techniques designed
for end-to-end graph representation learning methods require pre-specified
label definitions, and thus cannot be directly applied to the pre-training
methods. In this paper, we propose an unsupervised defense technique to
robustify pre-trained deep graph models, so that the perturbations on the input
graph can be successfully identified and blocked before the model is applied to
different downstream tasks. Specifically, we introduce a mutual
information-based measure, \textit{graph representation vulnerability (GRV)},
to quantify the robustness of graph encoders on the representation space. We
then formulate an optimization problem to learn the graph representation by
carefully balancing the trade-off between the expressive power and the
robustness (\emph{i.e.}, GRV) of the graph encoder. The discrete nature of
graph topology and the joint space of graph data make the optimization problem
intractable to solve. To handle the above difficulty and to reduce
computational expense, we further relax the problem and thus provide an
approximate solution. Additionally, we explore a provable connection between
the robustness of the unsupervised graph encoder and that of models on
downstream tasks. Extensive experiments demonstrate that even without access to
labels and tasks, our model is still able to enhance robustness against
adversarial attacks on three downstream tasks (node classification, link
prediction, and community detection) by an average of +16.5% compared with
existing methods.


http://arxiv.org/abs/2012.02521
Kernel-convoluted Deep Neural Networks with Data Augmentation.
Minjin Kim; Young-geun Kim; Dongha Kim; Yongdai Kim; Myunghee Cho Paik
  The Mixup method (Zhang et al. 2018), which uses linearly interpolated data,
has emerged as an effective data augmentation tool to improve generalization
performance and the robustness to adversarial examples. The motivation is to
curtail undesirable oscillations by its implicit model constraint to behave
linearly at in-between observed data points and promote smoothness. In this
work, we formally investigate this premise, propose a way to explicitly impose
smoothness constraints, and extend it to incorporate with implicit model
constraints. First, we derive a new function class composed of
kernel-convoluted models (KCM) where the smoothness constraint is directly
imposed by locally averaging the original functions with a kernel function.
Second, we propose to incorporate the Mixup method into KCM to expand the
domains of smoothness. In both cases of KCM and the KCM adapted with the Mixup,
we provide risk analysis, respectively, under some conditions for kernels. We
show that the upper bound of the excess risk is not slower than that of the
original function class. The upper bound of the KCM with the Mixup remains
dominated by that of the KCM if the perturbation of the Mixup vanishes faster
than \(O(n^{-1/2})\) where \(n\) is a sample size. Using CIFAR-10 and CIFAR-100
datasets, our experiments demonstrate that the KCM with the Mixup outperforms
the Mixup method in terms of generalization and robustness to adversarial
examples.


http://arxiv.org/abs/2012.02048
Ethical Testing in the Real World: Evaluating Physical Testing of Adversarial Machine Learning.
Kendra Albert; Maggie Delano; Jonathon Penney; Afsaneh Rigot; Ram Shankar Siva Kumar
  This paper critically assesses the adequacy and representativeness of
physical domain testing for various adversarial machine learning (ML) attacks
against computer vision systems involving human subjects. Many papers that
deploy such attacks characterize themselves as "real world." Despite this
framing, however, we found the physical or real-world testing conducted was
minimal, provided few details about testing subjects and was often conducted as
an afterthought or demonstration. Adversarial ML research without
representative trials or testing is an ethical, scientific, and health/safety
issue that can cause real harms. We introduce the problem and our methodology,
and then critique the physical domain testing methodologies employed by papers
in the field. We then explore various barriers to more inclusive physical
testing in adversarial ML and offer recommendations to improve such testing
notwithstanding these challenges.


http://arxiv.org/abs/2012.01791
FAT: Federated Adversarial Training.
Giulio Zizzo; Ambrish Rawat; Mathieu Sinn; Beat Buesser
  Federated learning (FL) is one of the most important paradigms addressing
privacy and data governance issues in machine learning (ML). Adversarial
training has emerged, so far, as the most promising approach against evasion
threats on ML models. In this paper, we take the first known steps towards
federated adversarial training (FAT) combining both methods to reduce the
threat of evasion during inference while preserving the data privacy during
training. We investigate the effectiveness of the FAT protocol for idealised
federated settings using MNIST, Fashion-MNIST, and CIFAR10, and provide first
insights on stabilising the training on the LEAF benchmark dataset which
specifically emulates a federated learning environment. We identify challenges
with this natural extension of adversarial training with regards to achieved
adversarial robustness and further examine the idealised settings in the
presence of clients undermining model convergence. We find that Trimmed Mean
and Bulyan defences can be compromised and we were able to subvert Krum with a
novel distillation based attack which presents an apparently "robust" model to
the defender while in fact the model fails to provide robustness against simple
attack modifications.


http://arxiv.org/abs/2012.01901
An Empirical Study of Derivative-Free-Optimization Algorithms for Targeted Black-Box Attacks in Deep Neural Networks.
Giuseppe Ughi; Vinayak Abrol; Jared Tanner
  We perform a comprehensive study on the performance of derivative free
optimization (DFO) algorithms for the generation of targeted black-box
adversarial attacks on Deep Neural Network (DNN) classifiers assuming the
perturbation energy is bounded by an $\ell_\infty$ constraint and the number of
queries to the network is limited. This paper considers four pre-existing
state-of-the-art DFO-based algorithms along with the introduction of a new
algorithm built on BOBYQA, a model-based DFO method. We compare these
algorithms in a variety of settings according to the fraction of images that
they successfully misclassify given a maximum number of queries to the DNN.
  The experiments disclose how the likelihood of finding an adversarial example
depends on both the algorithm used and the setting of the attack; algorithms
limiting the search of adversarial example to the vertices of the $\ell^\infty$
constraint work particularly well without structural defenses, while the
presented BOBYQA based algorithm works better for especially small perturbation
energies. This variance in performance highlights the importance of new
algorithms being compared to the state-of-the-art in a variety of settings, and
the effectiveness of adversarial defenses being tested using as wide a range of
algorithms as possible.


http://arxiv.org/abs/2012.02160
Channel Effects on Surrogate Models of Adversarial Attacks against Wireless Signal Classifiers.
Brian Kim; Yalin E. Sagduyu; Tugba Erpek; Kemal Davaslioglu; Sennur Ulukus
  We consider a wireless communication system that consists of a background
emitter, a transmitter, and an adversary. The transmitter is equipped with a
deep neural network (DNN) classifier for detecting the ongoing transmissions
from the background emitter and transmits a signal if the spectrum is idle.
Concurrently, the adversary trains its own DNN classifier as the surrogate
model by observing the spectrum to detect the ongoing transmissions of the
background emitter and generate adversarial attacks to fool the transmitter
into misclassifying the channel as idle. This surrogate model may differ from
the transmitter's classifier significantly because the adversary and the
transmitter experience different channels from the background emitter and
therefore their classifiers are trained with different distributions of inputs.
This system model may represent a setting where the background emitter is a
primary, the transmitter is a secondary, and the adversary is trying to fool
the secondary to transmit even though the channel is occupied by the primary.
We consider different topologies to investigate how different surrogate models
that are trained by the adversary (depending on the differences in channel
effects experienced by the adversary) affect the performance of the adversarial
attack. The simulation results show that the surrogate models that are trained
with different distributions of channel-induced inputs severely limit the
attack performance and indicate that the transferability of adversarial attacks
is neither readily available nor straightforward to achieve since surrogate
models for wireless applications may significantly differ from the target model
depending on channel effects.


http://arxiv.org/abs/2012.01806
Attribute-Guided Adversarial Training for Robustness to Natural Perturbations.
Tejas Gokhale; Rushil Anirudh; Bhavya Kailkhura; Jayaraman J. Thiagarajan; Chitta Baral; Yezhou Yang
  While existing work in robust deep learning has focused on small pixel-level
norm-based perturbations, this may not account for perturbations encountered in
several real-world settings. In many such cases although test data might not be
available, broad specifications about the types of perturbations (such as an
unknown degree of rotation) may be known. We consider a setup where robustness
is expected over an unseen test domain that is not i.i.d. but deviates from the
training domain. While this deviation may not be exactly known, its broad
characterization is specified a priori, in terms of attributes. We propose an
adversarial training approach which learns to generate new samples so as to
maximize exposure of the classifier to the attributes-space, without having
access to the data from the test domain. Our adversarial training solves a
min-max optimization problem, with the inner maximization generating
adversarial perturbations, and the outer minimization finding model parameters
by optimizing the loss on adversarial perturbations generated from the inner
maximization. We demonstrate the applicability of our approach on three types
of naturally occurring perturbations -- object-related shifts, geometric
transformations, and common image corruptions. Our approach enables deep neural
networks to be robust against a wide range of naturally occurring
perturbations. We demonstrate the usefulness of the proposed approach by
showing the robustness gains of deep neural networks trained using our
adversarial training on MNIST, CIFAR-10, and a new variant of the CLEVR
dataset.


http://arxiv.org/abs/2012.01558
From a Fourier-Domain Perspective on Adversarial Examples to a Wiener Filter Defense for Semantic Segmentation.
Nikhil Kapoor; Andreas Bär; Serin Varghese; Jan David Schneider; Fabian Hüger; Peter Schlicht; Tim Fingscheidt
  Despite recent advancements, deep neural networks are not robust against
adversarial perturbations. Many of the proposed adversarial defense approaches
use computationally expensive training mechanisms that do not scale to complex
real-world tasks such as semantic segmentation, and offer only marginal
improvements. In addition, fundamental questions on the nature of adversarial
perturbations and their relation to the network architecture are largely
understudied. In this work, we study the adversarial problem from a frequency
domain perspective. More specifically, we analyze discrete Fourier transform
(DFT) spectra of several adversarial images and report two major findings:
First, there exists a strong connection between a model architecture and the
nature of adversarial perturbations that can be observed and addressed in the
frequency domain. Second, the observed frequency patterns are largely image-
and attack-type independent, which is important for the practical impact of any
defense making use of such patterns. Motivated by these findings, we
additionally propose an adversarial defense method based on the well-known
Wiener filters that captures and suppresses adversarial frequencies in a
data-driven manner. Our proposed method not only generalizes across unseen
attacks but also beats five existing state-of-the-art methods across two models
in a variety of attack settings.


http://arxiv.org/abs/2012.01701
FenceBox: A Platform for Defeating Adversarial Examples with Data Augmentation Techniques.
Han Qiu; Yi Zeng; Tianwei Zhang; Yong Jiang; Meikang Qiu
  It is extensively studied that Deep Neural Networks (DNNs) are vulnerable to
Adversarial Examples (AEs). With more and more advanced adversarial attack
methods have been developed, a quantity of corresponding defense solutions were
designed to enhance the robustness of DNN models. It has become a popularity to
leverage data augmentation techniques to preprocess input samples before
inference to remove adversarial perturbations. By obfuscating the gradients of
DNN models, these approaches can defeat a considerable number of conventional
attacks. Unfortunately, advanced gradient-based attack techniques (e.g., BPDA
and EOT) were introduced to invalidate these preprocessing effects.
  In this paper, we present FenceBox, a comprehensive framework to defeat
various kinds of adversarial attacks. FenceBox is equipped with 15 data
augmentation methods from three different categories. We comprehensively
evaluated that these methods can effectively mitigate various adversarial
attacks. FenceBox also provides APIs for users to easily deploy the defense
over their models in different modes: they can either select an arbitrary
preprocessing method, or a combination of functions for a better robustness
guarantee, even under advanced adversarial attacks. We open-source FenceBox,
and expect it can be used as a standard toolkit to facilitate the research of
adversarial attacks and defenses.


http://arxiv.org/abs/2012.01654
Towards Defending Multiple $\ell_p$-norm Bounded Adversarial Perturbations via Gated Batch Normalization.
Aishan Liu; Shiyu Tang; Xinyun Chen; Lei Huang; Haotong Qin; Xianglong Liu; Dacheng Tao
  There has been extensive evidence demonstrating that deep neural networks are
vulnerable to adversarial examples, which motivates the development of defenses
against adversarial attacks. Existing adversarial defenses typically improve
model robustness against individual specific perturbation types (\eg,
$\ell_{\infty}$-norm bounded adversarial examples). However, adversaries are
likely to generate multiple types of perturbations in practice (\eg, $\ell_1$,
$\ell_2$, and $\ell_{\infty}$ perturbations). Some recent methods improve model
robustness against adversarial attacks in multiple $\ell_p$ balls, but their
performance against each perturbation type is still far from satisfactory. In
this paper, we observe that different $\ell_p$ bounded adversarial
perturbations induce different statistical properties that can be separated and
characterized by the statistics of Batch Normalization (BN). We thus propose
Gated Batch Normalization (GBN) to adversarially train a perturbation-invariant
predictor for defending multiple $\ell_p$ bounded adversarial perturbations.
GBN consists of a multi-branch BN layer and a gated sub-network. Each BN branch
in GBN is in charge of one perturbation type to ensure that the normalized
output is aligned towards learning perturbation-invariant representation.
Meanwhile, the gated sub-network is designed to separate inputs added with
different perturbation types. We perform an extensive evaluation of our
approach on commonly-used dataset including MNIST, CIFAR-10, and Tiny-ImageNet,
and demonstrate that GBN outperforms previous defense proposals against
multiple perturbation types (\ie, $\ell_1$, $\ell_2$, and $\ell_{\infty}$
perturbations) by large margins.


http://arxiv.org/abs/2012.01699
Content-Adaptive Pixel Discretization to Improve Model Robustness.
Ryan Feng; Wu-chi Feng; Atul Prakash
  Preprocessing defenses such as pixel discretization are appealing to remove
adversarial attacks due to their simplicity. However, they have been shown to
be ineffective except on simple datasets like MNIST. We hypothesize that
existing discretization approaches failed because using a fixed codebook for
the entire dataset limits their ability to balance image representation and
codeword separability. We first formally prove that adaptive codebooks can
provide stronger robustness guarantees than fixed codebooks as a preprocessing
defense on some datasets. Based on that insight, we propose a content-adaptive
pixel discretization defense called Essential Features, which discretizes the
image to a per-image adaptive codebook to reduce the color space. We then find
that Essential Features can be further optimized by applying adaptive blurring
before the discretization to push perturbed pixel values back to their original
value before determining the codebook. Against adaptive attacks, we show that
content-adaptive pixel discretization extends the range of datasets that
benefit in terms of both L_2 and L_infinity robustness where previously fixed
codebooks were found to have failed. Our findings suggest that content-adaptive
pixel discretization should be part of the repertoire for making models robust.


http://arxiv.org/abs/2012.01274
How Robust are Randomized Smoothing based Defenses to Data Poisoning?
Akshay Mehra; Bhavya Kailkhura; Pin-Yu Chen; Jihun Hamm
  Predictions of certifiably robust classifiers remain constant in a
neighborhood of a point, making them resilient to test-time attacks with a
guarantee. In this work, we present a previously unrecognized threat to robust
machine learning models that highlights the importance of training-data quality
in achieving high certified adversarial robustness. Specifically, we propose a
novel bilevel optimization-based data poisoning attack that degrades the
robustness guarantees of certifiably robust classifiers. Unlike other poisoning
attacks that reduce the accuracy of the poisoned models on a small set of
target points, our attack reduces the average certified radius (ACR) of an
entire target class in the dataset. Moreover, our attack is effective even when
the victim trains the models from scratch using state-of-the-art robust
training methods such as Gaussian data augmentation\cite{cohen2019certified},
MACER\cite{zhai2020macer}, and SmoothAdv\cite{salman2019provably} that achieve
high certified adversarial robustness. To make the attack harder to detect, we
use clean-label poisoning points with imperceptible distortions. The
effectiveness of the proposed method is evaluated by poisoning MNIST and
CIFAR10 datasets and training deep neural networks using previously mentioned
training methods and certifying the robustness with randomized smoothing. The
ACR of the target class, for models trained on generated poison data, can be
reduced by more than 30\%. Moreover, the poisoned data is transferable to
models trained with different training methods and models with different
architectures.


http://arxiv.org/abs/2012.00802
Adversarial Robustness Across Representation Spaces.
Pranjal Awasthi; George Yu; Chun-Sung Ferng; Andrew Tomkins; Da-Cheng Juan
  Adversarial robustness corresponds to the susceptibility of deep neural
networks to imperceptible perturbations made at test time. In the context of
image tasks, many algorithms have been proposed to make neural networks robust
to adversarial perturbations made to the input pixels. These perturbations are
typically measured in an $\ell_p$ norm. However, robustness often holds only
for the specific attack used for training. In this work we extend the above
setting to consider the problem of training of deep neural networks that can be
made simultaneously robust to perturbations applied in multiple natural
representation spaces. For the case of image data, examples include the
standard pixel representation as well as the representation in the discrete
cosine transform~(DCT) basis. We design a theoretically sound algorithm with
formal guarantees for the above problem. Furthermore, our guarantees also hold
when the goal is to require robustness with respect to multiple $\ell_p$ norm
based attacks. We then derive an efficient practical implementation and
demonstrate the effectiveness of our approach on standard datasets for image
classification.


http://arxiv.org/abs/2012.00558
Robustness Out of the Box: Compositional Representations Naturally Defend Against Black-Box Patch Attacks.
Christian Cosgrove; Adam Kortylewski; Chenglin Yang; Alan Yuille
  Patch-based adversarial attacks introduce a perceptible but localized change
to the input that induces misclassification. While progress has been made in
defending against imperceptible attacks, it remains unclear how patch-based
attacks can be resisted. In this work, we study two different approaches for
defending against black-box patch attacks. First, we show that adversarial
training, which is successful against imperceptible attacks, has limited
effectiveness against state-of-the-art location-optimized patch attacks.
Second, we find that compositional deep networks, which have part-based
representations that lead to innate robustness to natural occlusion, are robust
to patch attacks on PASCAL3D+ and the German Traffic Sign Recognition
Benchmark, without adversarial training. Moreover, the robustness of
compositional models outperforms that of adversarially trained standard models
by a large margin. However, on GTSRB, we observe that they have problems
discriminating between similar traffic signs with fine-grained differences. We
overcome this limitation by introducing part-based finetuning, which improves
fine-grained recognition. By leveraging compositional representations, this is
the first work that defends against black-box patch attacks without expensive
adversarial training. This defense is more robust than adversarial training and
more interpretable because it can locate and ignore adversarial patches.


http://arxiv.org/abs/2012.00567
Boosting Adversarial Attacks on Neural Networks with Better Optimizer.
Heng Yin; Hengwei Zhang; Jindong Wang; Ruiyu Dou
  Convolutional neural networks have outperformed humans in image recognition
tasks, but they remain vulnerable to attacks from adversarial examples. Since
these data are crafted by adding imperceptible noise to normal images, their
existence poses potential security threats to deep learning systems.
Sophisticated adversarial examples with strong attack performance can also be
used as a tool to evaluate the robustness of a model. However, the success rate
of adversarial attacks can be further improved in black-box environments.
Therefore, this study combines a modified Adam gradient descent algorithm with
the iterative gradient-based attack method. The proposed Adam Iterative Fast
Gradient Method is then used to improve the transferability of adversarial
examples. Extensive experiments on ImageNet showed that the proposed method
offers a higher attack success rate than existing iterative methods. By
extending our method, we achieved a state-of-the-art attack success rate of
95.0% on defense models.


http://arxiv.org/abs/2012.00517
One-Pixel Attack Deceives Computer-Assisted Diagnosis of Cancer.
Joni Korpihalkola; Tuomo Sipola; Samir Puuska; Tero Kokkonen
  Computer vision and machine learning can be used to automate various tasks in
cancer diagnostic and detection. If an attacker can manipulate the automated
processing, the results can be devastating and in the worst case lead to wrong
diagnosis and treatment. In this research, the goal is to demonstrate the use
of one-pixel attacks in a real-life scenario with a real pathology dataset,
TUPAC16, which consists of digitized whole-slide images. We attack against the
IBM CODAIT's MAX breast cancer detector using adversarial images. These
adversarial examples are found using differential evolution to perform the
one-pixel modification to the images in the dataset. The results indicate that
a minor one-pixel modification of a whole slide image under analysis can affect
the diagnosis by reversing the automatic diagnosis result. The attack poses a
threat from the cyber security perspective: the one-pixel method can be used as
an attack vector by a motivated attacker.


http://arxiv.org/abs/2012.00909
Towards Imperceptible Adversarial Image Patches Based on Network Explanations.
Yaguan Qian; Jiamin Wang; Bin Wang; Zhaoquan Gu; Xiang Ling; Chunming Wu
  The vulnerability of deep neural networks (DNNs) for adversarial examples
have attracted more attention. Many algorithms are proposed to craft powerful
adversarial examples. However, these algorithms modifying the global or local
region of pixels without taking into account network explanations. Hence, the
perturbations are redundancy and easily detected by human eyes. In this paper,
we propose a novel method to generate local region perturbations. The main idea
is to find the contributing feature regions (CFRs) of images based on network
explanations for perturbations. Due to the network explanations, the
perturbations added to the CFRs are more effective than other regions. In our
method, a soft mask matrix is designed to represent the CFRs for finely
characterizing the contributions of each pixel. Based on this soft mask, we
develop a new objective function with inverse temperature to search for optimal
perturbations in CFRs. Extensive experiments are conducted on CIFAR-10 and
ILSVRC2012, which demonstrate the effectiveness, including attack success rate,
imperceptibility,and transferability.


http://arxiv.org/abs/2011.14969
Guided Adversarial Attack for Evaluating and Enhancing Adversarial Defenses.
Gaurang Sriramanan; Sravanti Addepalli; Arya Baburaj; R. Venkatesh Babu
  Advances in the development of adversarial attacks have been fundamental to
the progress of adversarial defense research. Efficient and effective attacks
are crucial for reliable evaluation of defenses, and also for developing robust
models. Adversarial attacks are often generated by maximizing standard losses
such as the cross-entropy loss or maximum-margin loss within a constraint set
using Projected Gradient Descent (PGD). In this work, we introduce a relaxation
term to the standard loss, that finds more suitable gradient-directions,
increases attack efficacy and leads to more efficient adversarial training. We
propose Guided Adversarial Margin Attack (GAMA), which utilizes function
mapping of the clean image to guide the generation of adversaries, thereby
resulting in stronger attacks. We evaluate our attack against multiple defenses
and show improved performance when compared to existing attacks. Further, we
propose Guided Adversarial Training (GAT), which achieves state-of-the-art
performance amongst single-step defenses by utilizing the proposed relaxation
term for both attack generation and training.


http://arxiv.org/abs/2011.14585
Just One Moment: Structural Vulnerability of Deep Action Recognition against One Frame Attack.
Jaehui Hwang; Jun-Hyuk Kim; Jun-Ho Choi; Jong-Seok Lee
  The video-based action recognition task has been extensively studied in
recent years. In this paper, we study the structural vulnerability of deep
learning-based action recognition models against the adversarial attack using
the one frame attack that adds an inconspicuous perturbation to only a single
frame of a given video clip. Our analysis shows that the models are highly
vulnerable against the one frame attack due to their structural properties.
Experiments demonstrate high fooling rates and inconspicuous characteristics of
the attack. Furthermore, we show that strong universal one frame perturbations
can be obtained under various scenarios. Our work raises the serious issue of
adversarial vulnerability of the state-of-the-art action recognition models in
various perspectives.


http://arxiv.org/abs/2011.14427
Architectural Adversarial Robustness: The Case for Deep Pursuit.
George Cazenavette; Calvin Murdock; Simon Lucey
  Despite their unmatched performance, deep neural networks remain susceptible
to targeted attacks by nearly imperceptible levels of adversarial noise. While
the underlying cause of this sensitivity is not well understood, theoretical
analyses can be simplified by reframing each layer of a feed-forward network as
an approximate solution to a sparse coding problem. Iterative solutions using
basis pursuit are theoretically more stable and have improved adversarial
robustness. However, cascading layer-wise pursuit implementations suffer from
error accumulation in deeper networks. In contrast, our new method of deep
pursuit approximates the activations of all layers as a single global
optimization problem, allowing us to consider deeper, real-world architectures
with skip connections such as residual networks. Experimentally, our approach
demonstrates improved robustness to adversarial noise.


http://arxiv.org/abs/2011.14365
A Targeted Universal Attack on Graph Convolutional Network.
Jiazhu Dai; Weifeng Zhu; Xiangfeng Luo
  Graph-structured data exist in numerous applications in real life. As a
state-of-the-art graph neural network, the graph convolutional network (GCN)
plays an important role in processing graph-structured data. However, a recent
study reported that GCNs are also vulnerable to adversarial attacks, which
means that GCN models may suffer malicious attacks with unnoticeable
modifications of the data. Among all the adversarial attacks on GCNs, there is
a special kind of attack method called the universal adversarial attack, which
generates a perturbation that can be applied to any sample and causes GCN
models to output incorrect results. Although universal adversarial attacks in
computer vision have been extensively researched, there are few research works
on universal adversarial attacks on graph structured data. In this paper, we
propose a targeted universal adversarial attack against GCNs. Our method
employs a few nodes as the attack nodes. The attack capability of the attack
nodes is enhanced through a small number of fake nodes connected to them.
During an attack, any victim node will be misclassified by the GCN as the
attack node class as long as it is linked to them. The experiments on three
popular datasets show that the average attack success rate of the proposed
attack on any victim node in the graph reaches 83% when using only 3 attack
nodes and 6 fake nodes. We hope that our work will make the community aware of
the threat of this type of attack and raise the attention given to its future
defense.


http://arxiv.org/abs/2011.14498
SwitchX: Gmin-Gmax Switching for Energy-Efficient and Robust Implementation of Binary Neural Networks on ReRAM Xbars.
Abhiroop Bhattacharjee; Priyadarshini Panda
  Memristive crossbars can efficiently implement Binarized Neural Networks
(BNNs) wherein the weights are stored in high-resistance states (HRS) and
low-resistance states (LRS) of the synapses. We propose SwitchX mapping of BNN
weights onto ReRAM crossbars such that the impact of crossbar non-idealities,
that lead to degradation in computational accuracy, are minimized. Essentially,
SwitchX maps the binary weights in such manner that a crossbar instance
comprises of more HRS than LRS synapses. We find BNNs mapped onto crossbars
with SwitchX to exhibit better robustness against adversarial attacks than the
standard crossbar-mapped BNNs, the baseline. Finally, we combine SwitchX with
state-aware training (that further increases the feasibility of HRS states
during weight mapping) to boost the robustness of a BNN on hardware. We find
that this approach yields stronger defense against adversarial attacks than
adversarial training, a state-of-the-art software defense. We perform
experiments on a VGG16 BNN with benchmark datasets (CIFAR-10, CIFAR-100 &
TinyImagenet) and use Fast Gradient Sign Method and Projected Gradient Descent
adversarial attacks. We show that SwitchX combined with state-aware training
can yield upto ~35% improvements in clean accuracy and ~6-16% in adversarial
accuracies against conventional BNNs. Furthermore, an important by-product of
SwitchX mapping is increased crossbar power savings, owing to an increased
proportion of HRS synapses, that is furthered with state-aware training. We
obtain upto ~21-22% savings in crossbar power consumption for state-aware
trained BNN mapped via SwitchX on 16x16 & 32x32 crossbars using the CIFAR-10 &
CIFAR-100 datasets.


http://arxiv.org/abs/2011.14224
Cyberbiosecurity: DNA Injection Attack in Synthetic Biology.
Dor Farbiash; Rami Puzis
  Today arbitrary synthetic DNA can be ordered online and delivered within
several days. In order to regulate both intentional and unintentional
generation of dangerous substances, most synthetic gene providers screen DNA
orders. A weakness in the Screening Framework Guidance for Providers of
Synthetic Double-Stranded DNA allows screening protocols based on this guidance
to be circumvented using a generic obfuscation procedure inspired by early
malware obfuscation techniques. Furthermore, accessibility and automation of
the synthetic gene engineering workflow, combined with insufficient
cybersecurity controls, allow malware to interfere with biological processes
within the victim's lab, closing the loop with the possibility of an exploit
written into a DNA molecule presented by Ney et al. in USENIX Security'17. Here
we present an end-to-end cyberbiological attack, in which unwitting biologists
may be tricked into generating dangerous substances within their labs.
Consequently, despite common biosecurity assumptions, the attacker does not
need to have physical contact with the generated substance. The most
challenging part of the attack, decoding of the obfuscated DNA, is executed
within living cells while using primitive biological operations commonly
employed by biologists during in-vivo gene editing. This attack scenario
underlines the need to harden the synthetic DNA supply chain with protections
against cyberbiological threats. To address these threats we propose an
improved screening protocol that takes into account in-vivo gene editing.


http://arxiv.org/abs/2011.14085
Deterministic Certification to Adversarial Attacks via Bernstein Polynomial Approximation.
Ching-Chia Kao; Jhe-Bang Ko; Chun-Shien Lu
  Randomized smoothing has established state-of-the-art provable robustness
against $\ell_2$ norm adversarial attacks with high probability. However, the
introduced Gaussian data augmentation causes a severe decrease in natural
accuracy. We come up with a question, "Is it possible to construct a smoothed
classifier without randomization while maintaining natural accuracy?". We find
the answer is definitely yes. We study how to transform any classifier into a
certified robust classifier based on a popular and elegant mathematical tool,
Bernstein polynomial. Our method provides a deterministic algorithm for
decision boundary smoothing. We also introduce a distinctive approach of
norm-independent certified robustness via numerical solutions of nonlinear
systems of equations. Theoretical analyses and experimental results indicate
that our method is promising for classifier smoothing and robustness
certification.


http://arxiv.org/abs/2011.14218
FaceGuard: A Self-Supervised Defense Against Adversarial Face Images.
Debayan Deb; Xiaoming Liu; Anil K. Jain
  Prevailing defense mechanisms against adversarial face images tend to overfit
to the adversarial perturbations in the training set and fail to generalize to
unseen adversarial attacks. We propose a new self-supervised adversarial
defense framework, namely FaceGuard, that can automatically detect, localize,
and purify a wide variety of adversarial faces without utilizing pre-computed
adversarial training samples. During training, FaceGuard automatically
synthesizes challenging and diverse adversarial attacks, enabling a classifier
to learn to distinguish them from real faces and a purifier attempts to remove
the adversarial perturbations in the image space. Experimental results on LFW
dataset show that FaceGuard can achieve 99.81% detection accuracy on six unseen
adversarial attack types. In addition, the proposed method can enhance the face
recognition performance of ArcFace from 34.27% TAR @ 0.1% FAR under no defense
to 77.46% TAR @ 0.1% FAR.


http://arxiv.org/abs/2011.13705
3D Invisible Cloak.
Mingfu Xue; Can He; Zhiyu Wu; Jian Wang; Zhe Liu; Weiqiang Liu
  In this paper, we propose a novel physical stealth attack against the person
detectors in real world. The proposed method generates an adversarial patch,
and prints it on real clothes to make a three dimensional (3D) invisible cloak.
Anyone wearing the cloak can evade the detection of person detectors and
achieve stealth. We consider the impacts of those 3D physical constraints
(i.e., radian, wrinkle, occlusion, angle, etc.) on person stealth attacks, and
propose 3D transformations to generate 3D invisible cloak. We launch the person
stealth attacks in 3D physical space instead of 2D plane by printing the
adversarial patches on real clothes under challenging and complex 3D physical
scenarios. The conventional and 3D transformations are performed on the patch
during its optimization process. Further, we study how to generate the optimal
3D invisible cloak. Specifically, we explore how to choose input images with
specific shapes and colors to generate the optimal 3D invisible cloak. Besides,
after successfully making the object detector misjudge the person as other
objects, we explore how to make a person completely disappeared, i.e., the
person will not be detected as any objects. Finally, we present a systematic
evaluation framework to methodically evaluate the performance of the proposed
attack in digital domain and physical world. Experimental results in various
indoor and outdoor physical scenarios show that, the proposed person stealth
attack method is robust and effective even under those complex and challenging
physical conditions, such as the cloak is wrinkled, obscured, curved, and from
different angles. The attack success rate in digital domain (Inria data set) is
86.56%, while the static and dynamic stealth attack performance in physical
world is 100% and 77%, respectively, which are significantly better than
existing works.


http://arxiv.org/abs/2011.13824
Fast and Complete: Enabling Complete Neural Network Verification with Rapid and Massively Parallel Incomplete Verifiers.
Kaidi Xu; Huan Zhang; Shiqi Wang; Yihan Wang; Suman Jana; Xue Lin; Cho-Jui Hsieh
  Formal verification of neural networks (NNs) is a challenging and important
problem. Existing efficient complete solvers typically require the
branch-and-bound (BaB) process, which splits the problem domain into
sub-domains and solves each sub-domain using faster but weaker incomplete
verifiers, such as Linear Programming (LP) on linearly relaxed sub-domains. In
this paper, we propose to use the backward mode linear relaxation based
perturbation analysis (LiRPA) to replace LP during the BaB process, which can
be efficiently implemented on the typical machine learning accelerators such as
GPUs and TPUs. However, unlike LP, LiRPA when applied naively can produce much
weaker bounds and even cannot check certain conflicts of sub-domains during
splitting, making the entire procedure incomplete after BaB. To address these
challenges, we apply a fast gradient based bound tightening procedure combined
with batch splits and the design of minimal usage of LP bound procedure,
enabling us to effectively use LiRPA on the accelerator hardware for the
challenging complete NN verification problem and significantly outperform
LP-based approaches. On a single GPU, we demonstrate an order of magnitude
speedup compared to existing LP-based approaches.


http://arxiv.org/abs/2011.14031
Voting based ensemble improves robustness of defensive models.
Devvrit; Minhao Cheng; Cho-Jui Hsieh; Inderjit Dhillon
  Developing robust models against adversarial perturbations has been an active
area of research and many algorithms have been proposed to train individual
robust models. Taking these pretrained robust models, we aim to study whether
it is possible to create an ensemble to further improve robustness. Several
previous attempts tackled this problem by ensembling the soft-label prediction
and have been proved vulnerable based on the latest attack methods. In this
paper, we show that if the robust training loss is diverse enough, a simple
hard-label based voting ensemble can boost the robust error over each
individual model. Furthermore, given a pool of robust models, we develop a
principled way to select which models to ensemble. Finally, to verify the
improved robustness, we conduct extensive experiments to study how to attack a
voting-based ensemble and develop several new white-box attacks. On CIFAR-10
dataset, by ensembling several state-of-the-art pre-trained defense models, our
method can achieve a 59.8% robust accuracy, outperforming all the existing
defensive models without using additional data.


http://arxiv.org/abs/2011.14045
Generalized Adversarial Examples: Attacks and Defenses.
Haojing Shen; Sihong Chen; Ran Wang; Xizhao Wang
  Most of the works follow such definition of adversarial example that is
imperceptible to humans but can fool the deep neural networks (DNNs). Some
works find another interesting form of adversarial examples such as one which
is unrecognizable to humans, but DNNs classify it as one class with high
confidence and adversarial patch. Based on this phenomenon, in this paper, from
the perspective of cognition of humans and machines, we propose a new
definition of adversarial examples. We show that imperceptible adversarial
examples, unrecognizable adversarial examples, and adversarial patches are
derivates of generalized adversarial examples. Then, we propose three types of
adversarial attacks based on the generalized definition. Finally, we propose a
defence mechanism that achieves state-of-the-art performance. We construct a
lossy compression function to filter out the redundant features generated by
the network. In this process, the perturbation produced by the attacker will be
filtered out. Therefore, the defence mechanism can effectively improve the
robustness of the model. The experiments show that our attack methods can
effectively generate adversarial examples, and our defence method can
significantly improve the adversarial robustness of DNNs compared with
adversarial training. As far as we know, our defending method achieves the best
performance even though we do not adopt adversarial training.


http://arxiv.org/abs/2011.13692
Robust and Natural Physical Adversarial Examples for Object Detectors.
Mingfu Xue; Chengxiang Yuan; Can He; Jian Wang; Weiqiang Liu
  Recently, many studies show that deep neural networks (DNNs) are susceptible
to adversarial examples. However, in order to convince that adversarial
examples are real threats in real physical world, it is necessary to study and
evaluate the adversarial examples in real-world scenarios. In this paper, we
propose a robust and natural physical adversarial example attack method
targeting object detectors under real-world conditions, which is more
challenging than targeting image classifiers. The generated adversarial
examples are robust to various physical constraints and visually look similar
to the original images, thus these adversarial examples are natural to humans
and will not cause any suspicions. First, to ensure the robustness of the
adversarial examples in real-world conditions, the proposed method exploits
different image transformation functions (Distance, Angle, Illumination,
Printing and Photographing), to simulate various physical changes during the
iterative optimization of the adversarial examples generation. Second, to
construct natural adversarial examples, the proposed method uses an adaptive
mask to constrain the area and intensities of added perturbations, and utilizes
the real-world perturbation score (RPS) to make the perturbations be similar to
those real noises in physical world. Compared with existing studies, our
generated adversarial examples can achieve a high success rate with less
conspicuous perturbations. Experimental results demonstrate that, the generated
adversarial examples are robust under various indoor and outdoor physical
conditions. Finally, the proposed physical adversarial attack method is
universal and can work in black-box scenarios. The generated adversarial
examples generalize well between different models.


http://arxiv.org/abs/2011.13560
SocialGuard: An Adversarial Example Based Privacy-Preserving Technique for Social Images.
Mingfu Xue; Shichang Sun; Zhiyu Wu; Can He; Jian Wang; Weiqiang Liu
  The popularity of various social platforms has prompted more people to share
their routine photos online. However, undesirable privacy leakages occur due to
such online photo sharing behaviors. Advanced deep neural network (DNN) based
object detectors can easily steal users' personal information exposed in shared
photos. In this paper, we propose a novel adversarial example based
privacy-preserving technique for social images against object detectors based
privacy stealing. Specifically, we develop an Object Disappearance Algorithm to
craft two kinds of adversarial social images. One can hide all objects in the
social images from being detected by an object detector, and the other can make
the customized sensitive objects be incorrectly classified by the object
detector. The Object Disappearance Algorithm constructs perturbation on a clean
social image. After being injected with the perturbation, the social image can
easily fool the object detector, while its visual quality will not be degraded.
We use two metrics, privacy-preserving success rate and privacy leakage rate,
to evaluate the effectiveness of the proposed method. Experimental results show
that, the proposed method can effectively protect the privacy of social images.
The privacy-preserving success rates of the proposed method on MS-COCO and
PASCAL VOC 2007 datasets are high up to 96.1% and 99.3%, respectively, and the
privacy leakage rates on these two datasets are as low as 0.57% and 0.07%,
respectively. In addition, compared with existing image processing methods (low
brightness, noise, blur, mosaic and JPEG compression), the proposed method can
achieve much better performance in privacy protection and image visual quality
maintenance.


http://arxiv.org/abs/2011.13696
Use the Spear as a Shield: A Novel Adversarial Example based Privacy-Preserving Technique against Membership Inference Attacks.
Mingfu Xue; Chengxiang Yuan; Can He; Zhiyu Wu; Yushu Zhang; Zhe Liu; Weiqiang Liu
  Recently, the membership inference attack poses a serious threat to the
privacy of confidential training data of machine learning models. This paper
proposes a novel adversarial example based privacy-preserving technique
(AEPPT), which adds the crafted adversarial perturbations to the prediction of
the target model to mislead the adversary's membership inference model. The
added adversarial perturbations do not affect the accuracy of target model, but
can prevent the adversary from inferring whether a specific data is in the
training set of the target model. Since AEPPT only modifies the original output
of the target model, the proposed method is general and does not require
modifying or retraining the target model. Experimental results show that the
proposed method can reduce the inference accuracy and precision of the
membership inference model to 50%, which is close to a random guess. Further,
for those adaptive attacks where the adversary knows the defense mechanism, the
proposed AEPPT is also demonstrated to be effective. Compared with the
state-of-the-art defense methods, the proposed defense can significantly
degrade the accuracy and precision of membership inference attacks to 50%
(i.e., the same as a random guess) while the performance and utility of the
target model will not be affected.


http://arxiv.org/abs/2011.13538
Rethinking Uncertainty in Deep Learning: Whether and How it Improves Robustness.
Yilun Jin; Lixin Fan; Kam Woh Ng; Ce Ju; Qiang Yang
  Deep neural networks (DNNs) are known to be prone to adversarial attacks, for
which many remedies are proposed. While adversarial training (AT) is regarded
as the most robust defense, it suffers from poor performance both on clean
examples and under other types of attacks, e.g. attacks with larger
perturbations. Meanwhile, regularizers that encourage uncertain outputs, such
as entropy maximization (EntM) and label smoothing (LS) can maintain accuracy
on clean examples and improve performance under weak attacks, yet their ability
to defend against strong attacks is still in doubt. In this paper, we revisit
uncertainty promotion regularizers, including EntM and LS, in the field of
adversarial learning. We show that EntM and LS alone provide robustness only
under small perturbations. Contrarily, we show that uncertainty promotion
regularizers complement AT in a principled manner, consistently improving
performance on both clean examples and under various attacks, especially
attacks with large perturbations. We further analyze how uncertainty promotion
regularizers enhance the performance of AT from the perspective of Jacobian
matrices $\nabla_X f(X;\theta)$, and find out that EntM effectively shrinks the
norm of Jacobian matrices and hence promotes robustness.


http://arxiv.org/abs/2011.13392
Exposing the Robustness and Vulnerability of Hybrid 8T-6T SRAM Memory Architectures to Adversarial Attacks in Deep Neural Networks.
Abhishek Moitra; Priyadarshini Panda
  Deep Learning is able to solve a plethora of once impossible problems.
However, they are vulnerable to input adversarial attacks preventing them from
being autonomously deployed in critical applications. Several
algorithm-centered works have discussed methods to cause adversarial attacks
and improve adversarial robustness of a Deep Neural Network (DNN). In this
work, we elicit the advantages and vulnerabilities of hybrid 6T-8T memories to
improve the adversarial robustness and cause adversarial attacks on DNNs. We
show that bit-error noise in hybrid memories due to erroneous 6T-SRAM cells
have deterministic behaviour based on the hybrid memory configurations (V_DD,
8T-6T ratio). This controlled noise (surgical noise) can be strategically
introduced into specific DNN layers to improve the adversarial accuracy of
DNNs. At the same time, surgical noise can be carefully injected into the DNN
parameters stored in hybrid memory to cause adversarial attacks. To improve the
adversarial robustness of DNNs using surgical noise, we propose a methodology
to select appropriate DNN layers and their corresponding hybrid memory
configurations to introduce the required surgical noise. Using this, we achieve
2-8% higher adversarial accuracy without re-training against white-box attacks
like FGSM, than the baseline models (with no surgical noise introduced). To
demonstrate adversarial attacks using surgical noise, we design a novel,
white-box attack on DNN parameters stored in hybrid memory banks that causes
the DNN inference accuracy to drop by more than 60% with over 90% confidence
value. We support our claims with experiments, performed using benchmark
datasets-CIFAR10 and CIFAR100 on VGG19 and ResNet18 networks.


http://arxiv.org/abs/2011.13526
Robust Attacks on Deep Learning Face Recognition in the Physical World.
Meng Shen; Hao Yu; Liehuang Zhu; Ke Xu; Qi Li; Xiaojiang Du
  Deep neural networks (DNNs) have been increasingly used in face recognition
(FR) systems. Recent studies, however, show that DNNs are vulnerable to
adversarial examples, which can potentially mislead the FR systems using DNNs
in the physical world. Existing attacks on these systems either generate
perturbations working merely in the digital world, or rely on customized
equipments to generate perturbations and are not robust in varying physical
environments. In this paper, we propose FaceAdv, a physical-world attack that
crafts adversarial stickers to deceive FR systems. It mainly consists of a
sticker generator and a transformer, where the former can craft several
stickers with different shapes and the latter transformer aims to digitally
attach stickers to human faces and provide feedbacks to the generator to
improve the effectiveness of stickers. We conduct extensive experiments to
evaluate the effectiveness of FaceAdv on attacking 3 typical FR systems (i.e.,
ArcFace, CosFace and FaceNet). The results show that compared with a
state-of-the-art attack, FaceAdv can significantly improve success rate of both
dodging and impersonating attacks. We also conduct comprehensive evaluations to
demonstrate the robustness of FaceAdv.


http://arxiv.org/abs/2011.13181
Regularization with Latent Space Virtual Adversarial Training.
Genki Osada; Budrul Ahsan; Revoti Prasad Bora; Takashi Nishide
  Virtual Adversarial Training (VAT) has shown impressive results among
recently developed regularization methods called consistency regularization.
VAT utilizes adversarial samples, generated by injecting perturbation in the
input space, for training and thereby enhances the generalization ability of a
classifier. However, such adversarial samples can be generated only within a
very small area around the input data point, which limits the adversarial
effectiveness of such samples. To address this problem we propose LVAT (Latent
space VAT), which injects perturbation in the latent space instead of the input
space. LVAT can generate adversarial samples flexibly, resulting in more
adverse effects and thus more effective regularization. The latent space is
built by a generative model, and in this paper, we examine two different type
of models: variational auto-encoder and normalizing flow, specifically Glow. We
evaluated the performance of our method in both supervised and semi-supervised
learning scenarios for an image classification task using SVHN and CIFAR-10
datasets. In our evaluation, we found that our method outperforms VAT and other
state-of-the-art methods.


http://arxiv.org/abs/2011.13375
Invisible Perturbations: Physical Adversarial Examples Exploiting the Rolling Shutter Effect.
Athena Sayles; Ashish Hooda; Mohit Gupta; Rahul Chatterjee; Earlence Fernandes
  Physical adversarial examples for camera-based computer vision have so far
been achieved through visible artifacts -- a sticker on a Stop sign, colorful
borders around eyeglasses or a 3D printed object with a colorful texture. An
implicit assumption here is that the perturbations must be visible so that a
camera can sense them. By contrast, we contribute a procedure to generate, for
the first time, physical adversarial examples that are invisible to human eyes.
Rather than modifying the victim object with visible artifacts, we modify light
that illuminates the object. We demonstrate how an attacker can craft a
modulated light signal that adversarially illuminates a scene and causes
targeted misclassifications on a state-of-the-art ImageNet deep learning model.
Concretely, we exploit the radiometric rolling shutter effect in commodity
cameras to create precise striping patterns that appear on images. To human
eyes, it appears like the object is illuminated, but the camera creates an
image with stripes that will cause ML models to output the attacker-desired
classification. We conduct a range of simulation and physical experiments with
LEDs, demonstrating targeted attack rates up to 84%.


http://arxiv.org/abs/2011.12680
Adversarial Attack on Facial Recognition using Visible Light.
Morgan Frearson; Kien Nguyen
  The use of deep learning for human identification and object detection is
becoming ever more prevalent in the surveillance industry. These systems have
been trained to identify human body's or faces with a high degree of accuracy.
However, there have been successful attempts to fool these systems with
different techniques called adversarial attacks. This paper presents a final
report for an adversarial attack using visible light on facial recognition
systems. The relevance of this research is to exploit the physical downfalls of
deep neural networks. This demonstration of weakness within these systems are
in hopes that this research will be used in the future to improve the training
models for object recognition. As results were gathered the project objectives
were adjusted to fit the outcomes. Because of this the following paper
initially explores an adversarial attack using infrared light before
readjusting to a visible light attack. A research outline on infrared light and
facial recognition are presented within. A detailed analyzation of the current
findings and possible future recommendations of the project are presented. The
challenges encountered are evaluated and a final solution is delivered. The
projects final outcome exhibits the ability to effectively fool recognition
systems using light.


http://arxiv.org/abs/2011.12902
Adversarial Evaluation of Multimodal Models under Realistic Gray Box Assumption.
Ivan Evtimov; Russel Howes; Brian Dolhansky; Hamed Firooz; Cristian Canton Ferrer
  This work examines the vulnerability of multimodal (image + text) models to
adversarial threats similar to those discussed in previous literature on
unimodal (image- or text-only) models. We introduce realistic assumptions of
partial model knowledge and access, and discuss how these assumptions differ
from the standard "black-box"/"white-box" dichotomy common in current
literature on adversarial attacks. Working under various levels of these
"gray-box" assumptions, we develop new attack methodologies unique to
multimodal classification and evaluate them on the Hateful Memes Challenge
classification task. We find that attacking multiple modalities yields stronger
attacks than unimodal attacks alone (inducing errors in up to 73% of cases),
and that the unimodal image attacks on multimodal classifiers we explored were
stronger than character-based text augmentation attacks (inducing errors on
average in 45% and 30% of cases, respectively).


http://arxiv.org/abs/2011.12807
SurFree: a fast surrogate-free black-box attack.
Thibault Maho; Teddy Furon; Erwan Le Merrer
  Machine learning classifiers are critically prone to evasion attacks.
Adversarial examples are slightly modified inputs that are then misclassified,
while remaining perceptively close to their originals. Last couple of years
have witnessed a striking decrease in the amount of queries a black box attack
submits to the target classifier, in order to forge adversarials. This
particularly concerns the black-box score-based setup, where the attacker has
access to top predicted probabilites: the amount of queries went from to
millions of to less than a thousand. This paper presents SurFree, a geometrical
approach that achieves a similar drastic reduction in the amount of queries in
the hardest setup: black box decision-based attacks (only the top-1 label is
available). We first highlight that the most recent attacks in that setup,
HSJA, QEBA and GeoDA all perform costly gradient surrogate estimations. SurFree
proposes to bypass these, by instead focusing on careful trials along diverse
directions, guided by precise indications of geometrical properties of the
classifier decision boundaries. We motivate this geometric approach before
performing a head-to-head comparison with previous attacks with the amount of
queries as a first class citizen. We exhibit a faster distortion decay under
low query amounts (few hundreds to a thousand), while remaining competitive at
higher query budgets.


http://arxiv.org/abs/2011.13011
Advancing diagnostic performance and clinical usability of neural networks via adversarial training and dual batch normalization.
Tianyu Han; Sven Nebelung; Federico Pedersoli; Markus Zimmermann; Maximilian Schulze-Hagen; Michael Ho; Christoph Haarburger; Fabian Kiessling; Christiane Kuhl; Volkmar Schulz; Daniel Truhn
  Unmasking the decision-making process of machine learning models is essential
for implementing diagnostic support systems in clinical practice. Here, we
demonstrate that adversarially trained models can significantly enhance the
usability of pathology detection as compared to their standard counterparts. We
let six experienced radiologists rate the interpretability of saliency maps in
datasets of X-rays, computed tomography, and magnetic resonance imaging scans.
Significant improvements were found for our adversarial models, which could be
further improved by the application of dual batch normalization. Contrary to
previous research on adversarially trained models, we found that the accuracy
of such models was equal to standard models when sufficiently large datasets
and dual batch norm training were used. To ensure transferability, we
additionally validated our results on an external test set of 22,433 X-rays.
These findings elucidate that different paths for adversarial and real images
are needed during training to achieve state of the art results with superior
clinical interpretability.


http://arxiv.org/abs/2011.14934
Probing Model Signal-Awareness via Prediction-Preserving Input Minimization. (80%)
Sahil Suneja; Yunhui Zheng; Yufan Zhuang; Jim Laredo; Alessandro Morari
  This work explores the signal awareness of AI models for source code
understanding. Using a software vulnerability detection use case, we evaluate
the models' ability to capture the correct vulnerability signals to produce
their predictions. Our prediction-preserving input minimization (P2IM) approach
systematically reduces the original source code to a minimal snippet which a
model needs to maintain its prediction. The model's reliance on incorrect
signals is then uncovered when the vulnerability in the original code is
missing in the minimal snippet, both of which the model however predicts as
being vulnerable. We measure the signal awareness of models using a new metric
we propose- Signal-aware Recall (SAR). We apply P2IM on three different neural
network architectures across multiple datasets. The results show a sharp drop
in the model's Recall from the high 90s to sub-60s with the new metric,
highlighting that the models are presumably picking up a lot of noise or
dataset nuances while learning their vulnerability detection logic. Although
the drop in model performance may be perceived as an adversarial attack, but
this isn't P2IM's objective. The idea is rather to uncover the signal-awareness
of a black-box model in a data-driven manner via controlled queries. SAR's
purpose is to measure the impact of task-agnostic model training, and not to
suggest a shortcoming in the Recall metric. The expectation, in fact, is for
SAR to match Recall in the ideal scenario where the model truly captures
task-specific signals.


http://arxiv.org/abs/2011.12344
Trust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning.
Luiz F. O. Chamon; Santiago Paternain; Alejandro Ribeiro
  Prediction credibility measures, in the form of confidence intervals or
probability distributions, are fundamental in statistics and machine learning
to characterize model robustness, detect out-of-distribution samples
(outliers), and protect against adversarial attacks. To be effective, these
measures should (i) account for the wide variety of models used in practice,
(ii) be computable for trained models or at least avoid modifying established
training procedures, (iii) forgo the use of data, which can expose them to the
same robustness issues and attacks as the underlying model, and (iv) be
followed by theoretical guarantees. These principles underly the framework
developed in this work, which expresses the credibility as a risk-fit
trade-off, i.e., a compromise between how much can fit be improved by
perturbing the model input and the magnitude of this perturbation (risk). Using
a constrained optimization formulation and duality theory, we analyze this
compromise and show that this balance can be determined counterfactually,
without having to test multiple perturbations. This results in an unsupervised,
a posteriori method of assigning prediction credibility for any (possibly
non-convex) differentiable model, from RKHS-based solutions to any architecture
of (feedforward, convolutional, graph) neural network. Its use is illustrated
in data filtering and defense against adversarial attacks.


http://arxiv.org/abs/2011.12423
Stochastic sparse adversarial attacks.
Manon Césaire; Hatem Hajri; Sylvain Lamprier; Patrick Gallinari
  This paper introduces stochastic sparse adversarial attacks (SSAA), standing
as simple, fast and purely noise-based targeted and untargeted attacks of
neural network classifiers (NNC). SSAA offer new examples of sparse (or $L_0$)
attacks for which only few methods have been proposed previously. These attacks
are devised by exploiting a small-time expansion idea widely used for Markov
processes. Experiments on small and large datasets (CIFAR-10 and ImageNet)
illustrate several advantages of SSAA in comparison with the-state-of-the-art
methods. For instance, in the untargeted case, our method called Voting Folded
Gaussian Attack (VFGA) scales efficiently to ImageNet and achieves a
significantly lower $L_0$ score than SparseFool (up to $\frac{2}{5}$) while
being faster. Moreover, VFGA achieves better $L_0$ scores on ImageNet than
Sparse-RS when both attacks are fully successful on a large number of samples.


http://arxiv.org/abs/2011.11922
On the Adversarial Robustness of 3D Point Cloud Classification.
Jiachen Sun; Karl Koenig; Yulong Cao; Qi Alfred Chen; Z. Morley Mao
  3D point clouds play pivotal roles in various safety-critical fields, such as
autonomous driving, which desires the corresponding deep neural networks to be
robust to adversarial perturbations. Though a few defenses against adversarial
point cloud classification have been proposed, it remains unknown whether they
can provide real robustness. To this end, we perform the first security
analysis of state-of-the-art defenses and design adaptive attacks on them. Our
100% adaptive attack success rates demonstrate that current defense designs are
still vulnerable. Since adversarial training (AT) is believed to be the most
effective defense, we present the first in-depth study showing how AT behaves
in point cloud classification and identify that the required symmetric function
(pooling operation) is paramount to the model's robustness under AT. Through
our systematic analysis, we find that the default used fixed pooling operations
(e.g., MAX pooling) generally weaken AT's performance in point cloud
classification. Still, sorting-based parametric pooling operations can
significantly improve the models' robustness. Based on the above insights, we
further propose DeepSym, a deep symmetric pooling operation, to architecturally
advance the adversarial robustness under AT to 47.0% without sacrificing
nominal accuracy, outperforming the original design and a strong baseline by
28.5% ($\sim 2.6 \times$) and 6.5%, respectively, in PointNet.


http://arxiv.org/abs/2011.11957
Towards Imperceptible Universal Attacks on Texture Recognition.
Yingpeng Deng; Lina J. Karam
  Although deep neural networks (DNNs) have been shown to be susceptible to
image-agnostic adversarial attacks on natural image classification problems,
the effects of such attacks on DNN-based texture recognition have yet to be
explored. As part of our work, we find that limiting the perturbation's $l_p$
norm in the spatial domain may not be a suitable way to restrict the
perceptibility of universal adversarial perturbations for texture images. Based
on the fact that human perception is affected by local visual frequency
characteristics, we propose a frequency-tuned universal attack method to
compute universal perturbations in the frequency domain. Our experiments
indicate that our proposed method can produce less perceptible perturbations
yet with a similar or higher white-box fooling rates on various DNN texture
classifiers and texture datasets as compared to existing universal attack
techniques. We also demonstrate that our approach can improve the attack
robustness against defended models as well as the cross-dataset transferability
for texture recognition problems.


http://arxiv.org/abs/2011.12720
Omni: Automated Ensemble with Unexpected Models against Adversarial Evasion Attack.
Rui Shu; Tianpei Xia; Laurie Williams; Tim Menzies
  BACKGROUND: Machine learning-based security detection models have become
prevalent in modern malware and intrusion detection systems. However, previous
studies show that such models are susceptible to adversarial evasion attacks.
In this type of attack, inputs (i.e., adversarial examples) are specially
crafted by intelligent malicious adversaries, with the aim of being
misclassified by existing state-of-the-art models (e.g., deep neural networks).
Once the attackers can fool a classifier to think that a malicious input is
actually benign, they can render a machine learning-based malware or intrusion
detection system ineffective.
  GOAL: To help security practitioners and researchers build a more robust
model against adversarial evasion attack through the use of ensemble learning.
  METHOD: We propose an approach called OMNI, the main idea of which is to
explore methods that create an ensemble of "unexpected models"; i.e., models
whose control hyperparameters have a large distance to the hyperparameters of
an adversary's target model, with which we then make an optimized weighted
ensemble prediction.
  RESULTS: In studies with five adversarial evasion attacks (FGSM, BIM, JSMA,
DeepFool and Carlini-Wagner) on five security datasets (NSL-KDD, CIC-IDS-2017,
CSE-CIC-IDS2018, CICAndMal2017 and the Contagio PDF dataset), we show that the
improvement rate of OMNI's prediction accuracy over attack accuracy is about
53% (median value) across all datasets, with about 18% (median value) loss rate
when comparing pre-attack accuracy and OMNI's prediction accuracy.
  CONCLUSIONWhen using ensemble learning as a defense method against
adversarial evasion attacks, we suggest to create ensemble with unexpected
models who are distant from the attacker's expected model (i.e., target model)
through methods such as hyperparameter optimization.


http://arxiv.org/abs/2011.11857
Augmented Lagrangian Adversarial Attacks.
Jérôme Rony; Eric Granger; Marco Pedersoli; Ismail Ben Ayed
  Adversarial attack algorithms are dominated by penalty methods, which are
slow in practice, or more efficient distance-customized methods, which are
heavily tailored to the properties of the considered distance. We propose a
white-box attack algorithm to generate minimally perturbed adversarial examples
based on Augmented Lagrangian principles. We bring several non-trivial
algorithmic modifications, which have a crucial effect on performance. Our
attack enjoys the generality of penalty methods and the computational
efficiency of distance-customized algorithms, and can be readily used for a
wide set of distances. We compare our attack to state-of-the-art methods on
three datasets and several models, and consistently obtain competitive
performances with similar or lower computational complexity.


http://arxiv.org/abs/2011.11164
Learnable Boundary Guided Adversarial Training.
Jiequan Cui; Shu Liu; Liwei Wang; Jiaya Jia
  Previous adversarial training raises model robustness under the compromise of
accuracy on natural data. In this paper, our target is to reduce natural
accuracy degradation. We use the model logits from one clean model
$\mathcal{M}^{natural}$ to guide learning of the robust model
$\mathcal{M}^{robust}$, taking into consideration that logits from the well
trained clean model $\mathcal{M}^{natural}$ embed the most discriminative
features of natural data, {\it e.g.}, generalizable classifier boundary. Our
solution is to constrain logits from the robust model $\mathcal{M}^{robust}$
that takes adversarial examples as input and make it similar to those from a
clean model $\mathcal{M}^{natural}$ fed with corresponding natural data. It
lets $\mathcal{M}^{robust}$ inherit the classifier boundary of
$\mathcal{M}^{natural}$. Thus, we name our method Boundary Guided Adversarial
Training (BGAT). Moreover, we generalize BGAT to Learnable Boundary Guided
Adversarial Training (LBGAT) by training $\mathcal{M}^{natural}$ and
$\mathcal{M}^{robust}$ simultaneously and collaboratively to learn one most
robustness-friendly classifier boundary for the strongest robustness. Extensive
experiments are conducted on CIFAR-10, CIFAR-100, and challenging Tiny ImageNet
datasets. Along with other state-of-the-art adversarial training approaches,
{\it e.g.}, Adversarial Logit Pairing (ALP) and TRADES, the performance is
further enhanced.


http://arxiv.org/abs/2011.11637
Nudge Attacks on Point-Cloud DNNs.
Yiren Zhao; Ilia Shumailov; Robert Mullins; Ross Anderson
  The wide adaption of 3D point-cloud data in safety-critical applications such
as autonomous driving makes adversarial samples a real threat. Existing
adversarial attacks on point clouds achieve high success rates but modify a
large number of points, which is usually difficult to do in real-life
scenarios. In this paper, we explore a family of attacks that only perturb a
few points of an input point cloud, and name them nudge attacks. We demonstrate
that nudge attacks can successfully flip the results of modern point-cloud
DNNs. We present two variants, gradient-based and decision-based, showing their
effectiveness in white-box and grey-box scenarios. Our extensive experiments
show nudge attacks are effective at generating both targeted and untargeted
adversarial point clouds, by changing a few points or even a single point from
the entire point-cloud input. We find that with a single point we can reliably
thwart predictions in 12--80% of cases, whereas 10 points allow us to further
increase this to 37--95%. Finally, we discuss the possible defenses against
such attacks, and explore their limitations.


http://arxiv.org/abs/2011.10794
Spatially Correlated Patterns in Adversarial Images.
Nandish Chattopadhyay; Lionell Yip En Zhi; Bryan Tan Bing Xing; Anupam Chattopadhyay
  Adversarial attacks have proved to be the major impediment in the progress on
research towards reliable machine learning solutions. Carefully crafted
perturbations, imperceptible to human vision, can be added to images to force
misclassification by an otherwise high performing neural network. To have a
better understanding of the key contributors of such structured attacks, we
searched for and studied spatially co-located patterns in the distribution of
pixels in the input space. In this paper, we propose a framework for
segregating and isolating regions within an input image which are particularly
critical towards either classification (during inference), or adversarial
vulnerability or both. We assert that during inference, the trained model looks
at a specific region in the image, which we call Region of Importance (RoI);
and the attacker looks at a region to alter/modify, which we call Region of
Attack (RoA). The success of this approach could also be used to design a
post-hoc adversarial defence method, as illustrated by our observations. This
uses the notion of blocking out (we call neutralizing) that region of the image
which is highly vulnerable to adversarial attacks but is not important for the
task of classification. We establish the theoretical setup for formalising the
process of segregation, isolation and neutralization and substantiate it
through empirical analysis on standard benchmarking datasets. The findings
strongly indicate that mapping features into the input space preserves the
significant patterns typically observed in the feature-space while adding major
interpretability and therefore simplifies potential defensive mechanisms.


http://arxiv.org/abs/2011.10867
A Neuro-Inspired Autoencoding Defense Against Adversarial Perturbations.
Can Bakiskan; Metehan Cekic; Ahmet Dundar Sezer; Upamanyu Madhow
  Deep Neural Networks (DNNs) are vulnerable to adversarial attacks: carefully
constructed perturbations to an image can seriously impair classification
accuracy, while being imperceptible to humans. While there has been a
significant amount of research on defending against such attacks, most defenses
based on systematic design principles have been defeated by appropriately
modified attacks. For a fixed set of data, the most effective current defense
is to train the network using adversarially perturbed examples. In this paper,
we investigate a radically different, neuro-inspired defense mechanism,
starting from the observation that human vision is virtually unaffected by
adversarial examples designed for machines. We aim to reject L^inf bounded
adversarial perturbations before they reach a classifier DNN, using an encoder
with characteristics commonly observed in biological vision: sparse
overcomplete representations, randomness due to synaptic noise, and drastic
nonlinearities. Encoder training is unsupervised, using standard dictionary
learning. A CNN-based decoder restores the size of the encoder output to that
of the original image, enabling the use of a standard CNN for classification.
Our nominal design is to train the decoder and classifier together in standard
supervised fashion, but we also consider unsupervised decoder training based on
a regression objective (as in a conventional autoencoder) with separate
supervised training of the classifier. Unlike adversarial training, all
training is based on clean images.
  Our experiments on the CIFAR-10 show performance competitive with
state-of-the-art defenses based on adversarial training, and point to the
promise of neuro-inspired techniques for the design of robust neural networks.
In addition, we provide results for a subset of the Imagenet dataset to verify
that our approach scales to larger images.


http://arxiv.org/abs/2011.10850
Robust Data Hiding Using Inverse Gradient Attention. (2%)
Honglei Zhang; Hu Wang; Yuanzhouhan Cao; Chunhua Shen; Yidong Li
  Data hiding is the procedure of encoding desired information into an image to
resist potential noises while ensuring the embedded image has little perceptual
perturbations from the original image. Recently, with the tremendous successes
gained by deep neural networks in various fields, data hiding areas have
attracted increasing number of attentions. The neglect of considering the pixel
sensitivity within the cover image of deep neural methods will inevitably
affect the model robustness for information hiding. Targeting at the problem,
in this paper, we propose a novel deep data hiding scheme with Inverse Gradient
Attention (IGA), combing the ideas of adversarial learning and attention
mechanism to endow different sensitivity to different pixels. With the proposed
component, the model can spotlight pixels with more robustness for embedding
data. Empirically, extensive experiments show that the proposed model
outperforms the state-of-the-art methods on two prevalent datasets under
multiple settings. Besides, we further identify and discuss the connections
between the proposed inverse gradient attention and high-frequency regions
within images.


http://arxiv.org/abs/2011.10280
Are Chess Discussions Racist? An Adversarial Hate Speech Data Set.
Rupak Sarkar; Ashiqur R. KhudaBukhsh
  On June 28, 2020, while presenting a chess podcast on Grandmaster Hikaru
Nakamura, Antonio Radi\'c's YouTube handle got blocked because it contained
"harmful and dangerous" content. YouTube did not give further specific reason,
and the channel got reinstated within 24 hours. However, Radi\'c speculated
that given the current political situation, a referral to "black against
white", albeit in the context of chess, earned him this temporary ban. In this
paper, via a substantial corpus of 681,995 comments, on 8,818 YouTube videos
hosted by five highly popular chess-focused YouTube channels, we ask the
following research question: \emph{how robust are off-the-shelf hate-speech
classifiers to out-of-domain adversarial examples?} We release a data set of
1,000 annotated comments where existing hate speech classifiers misclassified
benign chess discussions as hate speech. We conclude with an intriguing analogy
result on racial bias with our findings pointing out to the broader challenge
of color polysemy.


http://arxiv.org/abs/2011.10492
Detecting Universal Trigger's Adversarial Attack with Honeypot.
Thai Le; Noseong Park; Dongwon Lee
  The Universal Trigger (UniTrigger) is a recently-proposed powerful
adversarial textual attack method. Utilizing a learning-based mechanism,
UniTrigger can generate a fixed phrase that when added to any benign inputs,
can drop the prediction accuracy of a textual neural network (NN) model to near
zero on a target class. To defend against this new attack method that may cause
significant harm, we borrow the "honeypot" concept from the cybersecurity
community and propose DARCY, a honeypot-based defense framework. DARCY
adaptively searches and injects multiple trapdoors into an NN model to "bait
and catch" potential attacks. Through comprehensive experiments across five
public datasets, we demonstrate that DARCY detects UniTrigger's adversarial
attacks with up to 99% TPR and less than 1% FPR in most cases, while showing a
difference of only around 2% of F1 score on average in predicting for clean
inputs. We also show that DARCY with multiple trapdoors is robust under
different assumptions with respect to attackers' knowledge and skills.


http://arxiv.org/abs/2011.09789
An Experimental Study of Semantic Continuity for Deep Learning Models.
Shangxi Wu; Jitao Sang; Xian Zhao; Lizhang Chen
  Deep learning models suffer from the problem of semantic discontinuity: small
perturbations in the input space tend to cause semantic-level interference to
the model output. We argue that the semantic discontinuity results from these
inappropriate training targets and contributes to notorious issues such as
adversarial robustness, interpretability, etc. We first conduct data analysis
to provide evidence of semantic discontinuity in existing deep learning models,
and then design a simple semantic continuity constraint which theoretically
enables models to obtain smooth gradients and learn semantic-oriented features.
Qualitative and quantitative experiments prove that semantically continuous
models successfully reduce the use of non-semantic information, which further
contributes to the improvement in adversarial robustness, interpretability,
model transfer, and machine bias.


http://arxiv.org/abs/2011.09719
Adversarial Examples for $k$-Nearest Neighbor Classifiers Based on Higher-Order Voronoi Diagrams.
Chawin Sitawarin; Evgenios M. Kornaropoulos; Dawn Song; David Wagner
  Adversarial examples are a widely studied phenomenon in machine learning
models. While most of the attention has been focused on neural networks, other
practical models also suffer from this issue. In this work, we propose an
algorithm for evaluating the adversarial robustness of $k$-nearest neighbor
classification, i.e., finding a minimum-norm adversarial example. Diverging
from previous proposals, we take a geometric approach by performing a search
that expands outwards from a given input point. On a high level, the search
radius expands to the nearby Voronoi cells until we find a cell that classifies
differently from the input point. To scale the algorithm to a large $k$, we
introduce approximation steps that find perturbations with smaller norm,
compared to the baselines, in a variety of datasets. Furthermore, we analyze
the structural properties of a dataset where our approach outperforms the
competition.


http://arxiv.org/abs/2011.09957
Adversarial Threats to DeepFake Detection: A Practical Perspective.
Paarth Neekhara; Brian Dolhansky; Joanna Bitton; Cristian Canton Ferrer
  Facially manipulated images and videos or DeepFakes can be used maliciously
to fuel misinformation or defame individuals. Therefore, detecting DeepFakes is
crucial to increase the credibility of social media platforms and other media
sharing web sites. State-of-the art DeepFake detection techniques rely on
neural network based classification models which are known to be vulnerable to
adversarial examples. In this work, we study the vulnerabilities of
state-of-the-art DeepFake detection methods from a practical stand point. We
perform adversarial attacks on DeepFake detectors in a black box setting where
the adversary does not have complete knowledge of the classification models. We
study the extent to which adversarial perturbations transfer across different
models and propose techniques to improve the transferability of adversarial
examples. We also create more accessible attacks using Universal Adversarial
Perturbations which pose a very feasible attack scenario since they can be
easily shared amongst attackers. We perform our evaluations on the winning
entries of the DeepFake Detection Challenge (DFDC) and demonstrate that they
can be easily bypassed in a practical attack scenario by designing transferable
and accessible adversarial attacks.


http://arxiv.org/abs/2011.09824
Multi-Task Adversarial Attack.
Pengxin Guo; Yuancheng Xu; Baijiong Lin; Yu Zhang
  Deep neural networks have achieved impressive performance in various areas,
but they are shown to be vulnerable to adversarial attacks. Previous works on
adversarial attacks mainly focused on the single-task setting. However, in real
applications, it is often desirable to attack several models for different
tasks simultaneously. To this end, we propose Multi-Task adversarial Attack
(MTA), a unified framework that can craft adversarial examples for multiple
tasks efficiently by leveraging shared knowledge among tasks, which helps
enable large-scale applications of adversarial attacks on real-world systems.
More specifically, MTA uses a generator for adversarial perturbations which
consists of a shared encoder for all tasks and multiple task-specific decoders.
Thanks to the shared encoder, MTA reduces the storage cost and speeds up the
inference when attacking multiple tasks simultaneously. Moreover, the proposed
framework can be used to generate per-instance and universal perturbations for
targeted and non-targeted attacks. Experimental results on the Office-31 and
NYUv2 datasets demonstrate that MTA can improve the quality of attacks when
compared with its single-task counterpart.


http://arxiv.org/abs/2011.11486
Latent Adversarial Debiasing: Mitigating Collider Bias in Deep Neural Networks.
Luke Darlow; Stanisław Jastrzębski; Amos Storkey
  Collider bias is a harmful form of sample selection bias that neural networks
are ill-equipped to handle. This bias manifests itself when the underlying
causal signal is strongly correlated with other confounding signals due to the
training data collection procedure. In the situation where the confounding
signal is easy-to-learn, deep neural networks will latch onto this and the
resulting model will generalise poorly to in-the-wild test scenarios. We argue
herein that the cause of failure is a combination of the deep structure of
neural networks and the greedy gradient-driven learning process used - one that
prefers easy-to-compute signals when available. We show it is possible to
mitigate against this by generating bias-decoupled training data using latent
adversarial debiasing (LAD), even when the confounding signal is present in
100% of the training data. By training neural networks on these adversarial
examples,we can improve their generalisation in collider bias settings.
Experiments show state-of-the-art performance of LAD in label-free debiasing
with gains of 76.12% on background coloured MNIST, 35.47% on fore-ground
coloured MNIST, and 8.27% on corrupted CIFAR-10.


http://arxiv.org/abs/2011.09563
Robustified Domain Adaptation.
Jiajin Zhang; Hanqing Chao; Pingkun Yan
  Unsupervised domain adaptation (UDA) is widely used to transfer a model
trained in a labeled source domain to an unlabeled target domain. However, with
extensive studies showing deep learning models being vulnerable under
adversarial attacks, the adversarial robustness of models in domain adaptation
application has largely been overlooked. In this paper, we first conducted an
empirical analysis to show that severe inter-class mismatch is the key barrier
against achieving a robust model with UDA. Then, we propose a novel approach,
Class-consistent Unsupervised Robust Domain Adaptation (CURDA), for robustified
unsupervised domain adaptation. With the introduced contrastive robust training
and source anchored adversarial contrastive loss, our proposed CURDA is able to
effectively conquer the challenge of inter-class mismatch. Experiments on two
public benchmarks show that, compared with vanilla UDA, CURDA can significantly
improve model robustness in target domains for up to 67.4% costing only 0% to
4.4% of accuracy on the clean data samples. This is one of the first works
focusing on the new problem of robustifying unsupervised domain adaptation,
which demonstrates that UDA models can be substantially robustified while
maintaining competitive accuracy.


http://arxiv.org/abs/2011.09473
Adversarial collision attacks on image hashing functions.
Brian Dolhansky; Cristian Canton Ferrer
  Hashing images with a perceptual algorithm is a common approach to solving
duplicate image detection problems. However, perceptual image hashing
algorithms are differentiable, and are thus vulnerable to gradient-based
adversarial attacks. We demonstrate that not only is it possible to modify an
image to produce an unrelated hash, but an exact image hash collision between a
source and target image can be produced via minuscule adversarial
perturbations. In a white box setting, these collisions can be replicated
across nearly every image pair and hash type (including both deep and
non-learned hashes). Furthermore, by attacking points other than the output of
a hashing function, an attacker can avoid having to know the details of a
particular algorithm, resulting in collisions that transfer across different
hash sizes or model architectures. Using these techniques, an adversary can
poison the image lookup table of a duplicate image detection service, resulting
in undefined or unwanted behavior. Finally, we offer several potential
mitigations to gradient-based image hash attacks.


http://arxiv.org/abs/2011.09526
Contextual Fusion For Adversarial Robustness.
Aiswarya Akumalla; Seth Haney; Maksim Bazhenov
  Mammalian brains handle complex reasoning tasks in a gestalt manner by
integrating information from regions of the brain that are specialised to
individual sensory modalities. This allows for improved robustness and better
generalisation ability. In contrast, deep neural networks are usually designed
to process one particular information stream and susceptible to various types
of adversarial perturbations. While many methods exist for detecting and
defending against adversarial attacks, they do not generalise across a range of
attacks and negatively affect performance on clean, unperturbed data. We
developed a fusion model using a combination of background and foreground
features extracted in parallel from Places-CNN and Imagenet-CNN. We tested the
benefits of the fusion approach on preserving adversarial robustness for human
perceivable (e.g., Gaussian blur) and network perceivable (e.g.,
gradient-based) attacks for CIFAR-10 and MS COCO data sets. For gradient based
attacks, our results show that fusion allows for significant improvements in
classification without decreasing performance on unperturbed data and without
need to perform adversarial retraining. Our fused model revealed improvements
for Gaussian blur type perturbations as well. The increase in performance from
fusion approach depended on the variability of the image contexts; larger
increases were seen for classes of images with larger differences in their
contexts. We also demonstrate the effect of regularization to bias the
classifier decision in the presence of a known adversary. We propose that this
biologically inspired approach to integrate information across multiple
modalities provides a new way to improve adversarial robustness that can be
complementary to current state of the art approaches.


http://arxiv.org/abs/2011.09393
Adversarial Turing Patterns from Cellular Automata.
Nurislam Tursynbek; Ilya Vilkoviskiy; Maria Sindeeva; Ivan Oseledets
  State-of-the-art deep classifiers are intriguingly vulnerable to universal
adversarial perturbations: single disturbances of small magnitude that lead to
misclassification of most inputs. This phenomena may potentially result in a
serious security problem. Despite the extensive research in this area, there is
a lack of theoretical understanding of the structure of these perturbations. In
image domain, there is a certain visual similarity between patterns, that
represent these perturbations, and classical Turing patterns, which appear as a
solution of non-linear partial differential equations and are underlying
concept of many processes in nature. In this paper, we provide a theoretical
bridge between these two different theories, by mapping a simplified algorithm
for crafting universal perturbations to (inhomogeneous) cellular automata, the
latter is known to generate Turing patterns. Furthermore, we propose to use
Turing patterns, generated by cellular automata, as universal perturbations,
and experimentally show that they significantly degrade the performance of deep
learning models. We found this method to be a fast and efficient way to create
a data-agnostic quasi-imperceptible perturbation in the black-box scenario.


http://arxiv.org/abs/2011.09364
Self-Gradient Networks.
Hossein Aboutalebi; Mohammad Javad Shafiee Alexander Wong
  The incredible effectiveness of adversarial attacks on fooling deep neural
networks poses a tremendous hurdle in the widespread adoption of deep learning
in safety and security-critical domains. While adversarial defense mechanisms
have been proposed since the discovery of the adversarial vulnerability issue
of deep neural networks, there is a long path to fully understand and address
this issue. In this study, we hypothesize that part of the reason for the
incredible effectiveness of adversarial attacks is their ability to implicitly
tap into and exploit the gradient flow of a deep neural network. This innate
ability to exploit gradient flow makes defending against such attacks quite
challenging. Motivated by this hypothesis we argue that if a deep neural
network architecture can explicitly tap into its own gradient flow during the
training, it can boost its defense capability significantly. Inspired by this
fact, we introduce the concept of self-gradient networks, a novel deep neural
network architecture designed to be more robust against adversarial
perturbations. Gradient flow information is leveraged within self-gradient
networks to achieve greater perturbation stability beyond what can be achieved
in the standard training process. We conduct a theoretical analysis to gain
better insights into the behaviour of the proposed self-gradient networks to
illustrate the efficacy of leverage this additional gradient flow information.
The proposed self-gradient network architecture enables much more efficient and
effective adversarial training, leading to faster convergence towards an
adversarially robust solution by at least 10X. Experimental results demonstrate
the effectiveness of self-gradient networks when compared with state-of-the-art
adversarial learning strategies, with 10% improvement on the CIFAR10 dataset
under PGD and CW adversarial perturbations.


http://arxiv.org/abs/2011.09123
Adversarial Profiles: Detecting Out-Distribution & Adversarial Samples in Pre-trained CNNs.
Arezoo Rajabi; Rakesh B. Bobba
  Despite high accuracy of Convolutional Neural Networks (CNNs), they are
vulnerable to adversarial and out-distribution examples. There are many
proposed methods that tend to detect or make CNNs robust against these fooling
examples. However, most such methods need access to a wide range of fooling
examples to retrain the network or to tune detection parameters. Here, we
propose a method to detect adversarial and out-distribution examples against a
pre-trained CNN without needing to retrain the CNN or needing access to a wide
variety of fooling examples. To this end, we create adversarial profiles for
each class using only one adversarial attack generation technique. We then wrap
a detector around the pre-trained CNN that applies the created adversarial
profile to each input and uses the output to decide whether or not the input is
legitimate. Our initial evaluation of this approach using MNIST dataset show
that adversarial profile based detection is effective in detecting at least 92
of out-distribution examples and 59% of adversarial examples.


http://arxiv.org/abs/2011.08483
FoolHD: Fooling speaker identification by Highly imperceptible adversarial Disturbances.
Ali Shahin Shamsabadi; Francisco Sepúlveda Teixeira; Alberto Abad; Bhiksha Raj; Andrea Cavallaro; Isabel Trancoso
  Speaker identification models are vulnerable to carefully designed
adversarial perturbations of their input signals that induce misclassification.
In this work, we propose a white-box steganography-inspired adversarial attack
that generates imperceptible adversarial perturbations against a speaker
identification model. Our approach, FoolHD, uses a Gated Convolutional
Autoencoder that operates in the DCT domain and is trained with a
multi-objective loss function, in order to generate and conceal the adversarial
perturbation within the original audio files. In addition to hindering speaker
identification performance, this multi-objective loss accounts for human
perception through a frame-wise cosine similarity between MFCC feature vectors
extracted from the original and adversarial audio files. We validate the
effectiveness of FoolHD with a 250-speaker identification x-vector network,
trained using VoxCeleb, in terms of accuracy, success rate, and
imperceptibility. Our results show that FoolHD generates highly imperceptible
adversarial audio files (average PESQ scores above 4.30), while achieving a
success rate of 99.6% and 99.2% in misleading the speaker identification model,
for untargeted and targeted settings, respectively.


http://arxiv.org/abs/2011.08908
SIENA: Stochastic Multi-Expert Neural Patcher.
Thai Le; Noseong Park; Dongwon Lee
  Neural network (NN) models that are solely trained to maximize the likelihood
of an observed dataset are often vulnerable to adversarial attacks. Even though
several methods have been proposed to enhance NN models' adversarial
robustness, they often require re-training from scratch. This leads to
redundant computation, especially in the NLP domain where current
state-of-the-art models, such as BERT and ROBERTA, require great time and space
resources. By borrowing ideas from Software Engineering, we, therefore, first
introduce the Neural Patching mechanism to improve adversarial robustness by
"patching" only parts of a NN model. Then, we propose a novel neural patching
algorithm, SIENA, that transforms a textual NN model into a stochastic ensemble
of multi-expert predictors by upgrading and re-training its last layer only.
SIENA forces adversaries to attack not only one but multiple models that are
specialized in diverse sub-sets of features, labels, and instances so that the
ensemble model becomes more robust to adversarial attacks. By conducting
comprehensive experiments, we demonstrate that all of CNN, RNN, BERT, and
ROBERTA-based textual models, once patched by SIENA, witness an absolute
increase of as much as 20% in accuracy on average under 5 different white and
black-box attacks, outperforming 6 defensive baselines across 4 public NLP
datasets.


http://arxiv.org/abs/2011.09066
Shaping Deep Feature Space towards Gaussian Mixture for Visual Classification.
Weitao Wan; Jiansheng Chen; Cheng Yu; Tong Wu; Yuanyi Zhong; Ming-Hsuan Yang
  The softmax cross-entropy loss function has been widely used to train deep
models for various tasks. In this work, we propose a Gaussian mixture (GM) loss
function for deep neural networks for visual classification. Unlike the softmax
cross-entropy loss, our method explicitly shapes the deep feature space towards
a Gaussian Mixture distribution. With a classification margin and a likelihood
regularization, the GM loss facilitates both high classification performance
and accurate modeling of the feature distribution. The GM loss can be readily
used to distinguish abnormal inputs, such as the adversarial examples, based on
the discrepancy between feature distributions of the inputs and the training
set. Furthermore, theoretical analysis shows that a symmetric feature space can
be achieved by using the GM loss, which enables the models to perform robustly
against adversarial attacks. The proposed model can be implemented easily and
efficiently without using extra trainable parameters. Extensive evaluations
demonstrate that the proposed method performs favorably not only on image
classification but also on robust detection of adversarial examples generated
by strong attacks under different threat models.


http://arxiv.org/abs/2011.08558
Generating universal language adversarial examples by understanding and enhancing the transferability across neural models.
Liping Yuan; Xiaoqing Zheng; Yi Zhou; Cho-Jui Hsieh; Kai-wei Chang; Xuanjing Huang
  Deep neural network models are vulnerable to adversarial attacks. In many
cases, malicious inputs intentionally crafted for one model can fool another
model in the black-box attack setting. However, there is a lack of systematic
studies on the transferability of adversarial examples and how to generate
universal adversarial examples. In this paper, we systematically study the
transferability of adversarial attacks for text classification models. In
particular, we conduct extensive experiments to investigate how various
factors, such as network architecture, input format, word embedding, and model
capacity, affect the transferability of adversarial attacks. Based on these
studies, we then propose universal black-box attack algorithms that can induce
adversarial examples to attack almost all existing models. These universal
adversarial examples reflect the defects of the learning process and the bias
in the training dataset. Finally, we generalize these adversarial examples into
universal word replacement rules that can be used for model diagnostics.


http://arxiv.org/abs/2011.08485
Probing Predictions on OOD Images via Nearest Categories. (75%)
Yao-Yuan Yang; Cyrus Rashtchian; Ruslan Salakhutdinov; Kamalika Chaudhuri
  We study out-of-distribution (OOD) prediction behavior of neural networks
when they classify images from unseen classes or corrupted images. To probe the
OOD behavior, we introduce a new measure, nearest category generalization
(NCG), where we compute the fraction of OOD inputs that are classified with the
same label as their nearest neighbor in the training set. Our motivation stems
from understanding the prediction patterns of adversarially robust networks,
since previous work has identified unexpected consequences of training to be
robust to norm-bounded perturbations. We find that robust networks have
consistently higher NCG accuracy than natural training, even when the OOD data
is much farther away than the robustness radius. This implies that the local
regularization of robust training has a significant impact on the network's
decision regions. We replicate our findings using many datasets, comparing new
and existing training methods. Overall, adversarially robust networks resemble
a nearest neighbor classifier when it comes to OOD data. Code available at
https://github.com/yangarbiter/nearest-category-generalization.


http://arxiv.org/abs/2011.07793
MAAC: Novel Alert Correlation Method To Detect Multi-step Attack.
Xiaoyu Wang; Lei Yu; Houhua He; Xiaorui Gong
  With the continuous improvement of attack methods, there are more and more
distributed, complex, targeted attacks, and attackers use combined methods to
attack. Advanced cyber attacks include multiple stages to achieve the ultimate
goal. Traditional intrusion detection systems such as terminal security
management tools, firewalls, and other monitoring tools will generate a large
number of alerts during the attack. These alerts include attack clues, as well
as many false positives unrelated to attacks. Security analysts need to analyze
a large number of alerts and find useful clues from them, make correlations,
and restore attack scenarios. However, most traditional security monitoring
tools cannot correlate alerts from different sources, so many multi-step
attacks are still completely unnoticed, requiring manual analysis by security
analysts like finding a needle in a haystack. We propose MMAC, a multi-step
attack alert correlation algorithm, which reduces repeated alerts and combines
multi-stage attack paths based on alert semantics and attack stages. The
evaluation results of the dataset and real scene show that MAAC can find and
evaluate attack paths from a large number of alerts.


http://arxiv.org/abs/2011.08105
Enforcing robust control guarantees within neural network policies.
Priya L. Donti; Melrose Roderick; Mahyar Fazlyab; J. Zico Kolter
  When designing controllers for safety-critical systems, practitioners often
face a challenging tradeoff between robustness and performance. While robust
control methods provide rigorous guarantees on system stability under certain
worst-case disturbances, they often result in simple controllers that perform
poorly in the average (non-worst) case. In contrast, nonlinear control methods
trained using deep learning have achieved state-of-the-art performance on many
control tasks, but often lack robustness guarantees. We propose a technique
that combines the strengths of these two approaches: a generic nonlinear
control policy class, parameterized by neural networks, that nonetheless
enforces the same provable robustness criteria as robust control. Specifically,
we show that by integrating custom convex-optimization-based projection layers
into a nonlinear policy, we can construct a provably robust neural network
policy class that outperforms robust control methods in the average
(non-adversarial) setting. We demonstrate the power of this approach on several
domains, improving in performance over existing robust control methods and in
stability over (non-robust) RL methods.


http://arxiv.org/abs/2011.07835
Adversarially Robust Classification based on GLRT.
Bhagyashree Puranik; Upamanyu Madhow; Ramtin Pedarsani
  Machine learning models are vulnerable to adversarial attacks that can often
cause misclassification by introducing small but well designed perturbations.
In this paper, we explore, in the setting of classical composite hypothesis
testing, a defense strategy based on the generalized likelihood ratio test
(GLRT), which jointly estimates the class of interest and the adversarial
perturbation. We evaluate the GLRT approach for the special case of binary
hypothesis testing in white Gaussian noise under $\ell_{\infty}$ norm-bounded
adversarial perturbations, a setting for which a minimax strategy optimizing
for the worst-case attack is known. We show that the GLRT approach yields
performance competitive with that of the minimax approach under the worst-case
attack, and observe that it yields a better robustness-accuracy trade-off under
weaker attacks, depending on the values of signal components relative to the
attack budget. We also observe that the GLRT defense generalizes naturally to
more complex models for which optimal minimax classifiers are not known.


http://arxiv.org/abs/2011.08102
Combining GANs and AutoEncoders for Efficient Anomaly Detection.
Fabio ISTI CNR, Pisa, Italy Carrara; Giuseppe ISTI CNR, Pisa, Italy Amato; Luca ISTI CNR, Pisa, Italy Brombin; Fabrizio ISTI CNR, Pisa, Italy Falchi; Claudio ISTI CNR, Pisa, Italy Gennaro
  Deep learned models are now largely adopted in different fields, and they
generally provide superior performances with respect to classical signal-based
approaches. Notwithstanding this, their actual reliability when working in an
unprotected environment is far enough to be proven. In this work, we consider a
novel deep neural network architecture, named Neural Ordinary Differential
Equations (N-ODE), that is getting particular attention due to an attractive
property --- a test-time tunable trade-off between accuracy and efficiency.
This paper analyzes the robustness of N-ODE image classifiers when faced
against a strong adversarial attack and how its effectiveness changes when
varying such a tunable trade-off. We show that adversarial robustness is
increased when the networks operate in different tolerance regimes during test
time and training time. On this basis, we propose a novel adversarial detection
strategy for N-ODE nets based on the randomization of the adaptive ODE solver
tolerance. Our evaluation performed on standard image classification benchmarks
shows that our detection technique provides high rejection of adversarial
examples while maintaining most of the original samples under white-box attacks
and zero-knowledge adversaries.


http://arxiv.org/abs/2011.08367
Extreme Value Preserving Networks.
Mingjie Sun; Jianguo Li; Changshui Zhang
  Recent evidence shows that convolutional neural networks (CNNs) are biased
towards textures so that CNNs are non-robust to adversarial perturbations over
textures, while traditional robust visual features like SIFT (scale-invariant
feature transforms) are designed to be robust across a substantial range of
affine distortion, addition of noise, etc with the mimic of human perception
nature. This paper aims to leverage good properties of SIFT to renovate CNN
architectures towards better accuracy and robustness. We borrow the scale-space
extreme value idea from SIFT, and propose extreme value preserving networks
(EVPNets). Experiments demonstrate that EVPNets can achieve similar or better
accuracy than conventional CNNs, while achieving much better robustness on a
set of adversarial attacks (FGSM,PGD,etc) even without adversarial training.


http://arxiv.org/abs/2011.07478
Towards Understanding the Regularization of Adversarial Robustness on Neural Networks.
Yuxin Wen; Shuai Li; Kui Jia
  The problem of adversarial examples has shown that modern Neural Network (NN)
models could be rather fragile. Among the more established techniques to solve
the problem, one is to require the model to be {\it $\epsilon$-adversarially
robust} (AR); that is, to require the model not to change predicted labels when
any given input examples are perturbed within a certain range. However, it is
observed that such methods would lead to standard performance degradation,
i.e., the degradation on natural examples. In this work, we study the
degradation through the regularization perspective. We identify quantities from
generalization analysis of NNs; with the identified quantities we empirically
find that AR is achieved by regularizing/biasing NNs towards less confident
solutions by making the changes in the feature space (induced by changes in the
instance space) of most layers smoother uniformly in all directions; so to a
certain extent, it prevents sudden change in prediction w.r.t. perturbations.
However, the end result of such smoothing concentrates samples around decision
boundaries, resulting in less confident solutions, and leads to worse standard
performance. Our studies suggest that one might consider ways that build AR
into NNs in a gentler way to avoid the problematic regularization.


http://arxiv.org/abs/2011.07697
Ensemble of Models Trained by Key-based Transformed Images for Adversarially Robust Defense Against Black-box Attacks.
MaungMaung AprilPyone; Hitoshi Kiya
  We propose a voting ensemble of models trained by using block-wise
transformed images with secret keys for an adversarially robust defense.
Key-based adversarial defenses were demonstrated to outperform state-of-the-art
defenses against gradient-based (white-box) attacks. However, the key-based
defenses are not effective enough against gradient-free (black-box) attacks
without requiring any secret keys. Accordingly, we aim to enhance robustness
against black-box attacks by using a voting ensemble of models. In the proposed
ensemble, a number of models are trained by using images transformed with
different keys and block sizes, and then a voting ensemble is applied to the
models. In image classification experiments, the proposed defense is
demonstrated to defend state-of-the-art attacks. The proposed defense achieves
a clean accuracy of 95.56 % and an attack success rate of less than 9 % under
attacks with a noise distance of 8/255 on the CIFAR-10 dataset.


http://arxiv.org/abs/2011.07633
Almost Tight L0-norm Certified Robustness of Top-k Predictions against Adversarial Perturbations.
Jinyuan Jia; Binghui Wang; Xiaoyu Cao; Hongbin Liu; Neil Zhenqiang Gong
  Top-k predictions are used in many real-world applications such as machine
learning as a service, recommender systems, and web searches. $\ell_0$-norm
adversarial perturbation characterizes an attack that arbitrarily modifies some
features of an input such that a classifier makes an incorrect prediction for
the perturbed input. $\ell_0$-norm adversarial perturbation is easy to
interpret and can be implemented in the physical world. Therefore, certifying
robustness of top-$k$ predictions against $\ell_0$-norm adversarial
perturbation is important. However, existing studies either focused on
certifying $\ell_0$-norm robustness of top-$1$ predictions or $\ell_2$-norm
robustness of top-$k$ predictions. In this work, we aim to bridge the gap. Our
approach is based on randomized smoothing, which builds a provably robust
classifier from an arbitrary classifier via randomizing an input. Our major
theoretical contribution is an almost tight $\ell_0$-norm certified robustness
guarantee for top-$k$ predictions. We empirically evaluate our method on
CIFAR10 and ImageNet. For instance, our method can build a classifier that
achieves a certified top-3 accuracy of 69.2\% on ImageNet when an attacker can
arbitrarily perturb 5 pixels of a testing image.


http://arxiv.org/abs/2011.07603
Power Side-Channel Attacks on BNN Accelerators in Remote FPGAs. (1%)
Shayan Moini; Shanquan Tian; Jakub Szefer; Daniel Holcomb; Russell Tessier
  To lower cost and increase the utilization of Cloud Field-Programmable Gate
Arrays (FPGAs), researchers have recently been exploring the concept of
multi-tenant FPGAs, where multiple independent users simultaneously share the
same remote FPGA. Despite its benefits, multi-tenancy opens up the possibility
of malicious users co-locating on the same FPGA as a victim user, and
extracting sensitive information. This issue becomes especially serious when
the user is running a machine learning algorithm that is processing sensitive
or private information. To demonstrate the dangers, this paper presents a
remote, power-based side-channel attack on a deep neural network accelerator
running in a variety of Xilinx FPGAs and also on Cloud FPGAs using Amazon Web
Services (AWS) F1 instances. This work in particular shows how to remotely
obtain voltage estimates as a deep neural network inference circuit executes,
and how the information can be used to recover the inputs to the neural
network. The attack is demonstrated with a binarized convolutional neural
network used to recognize handwriting images from the MNIST handwritten digit
database. With the use of precise time-to-digital converters for remote voltage
estimation, the MNIST inputs can be successfully recovered with a maximum
normalized cross-correlation of 79% between the input image and the recovered
image on local FPGA boards and 72% on AWS F1 instances. The attack requires no
physical access nor modifications to the FPGA hardware.


http://arxiv.org/abs/2011.07430
Audio-Visual Event Recognition through the lens of Adversary.
Juncheng B Li; Kaixin Ma; Shuhui Qu; Po-Yao Huang; Florian Metze
  As audio/visual classification models are widely deployed for sensitive tasks
like content filtering at scale, it is critical to understand their robustness
along with improving the accuracy. This work aims to study several key
questions related to multimodal learning through the lens of adversarial
noises: 1) The trade-off between early/middle/late fusion affecting its
robustness and accuracy 2) How do different frequency/time domain features
contribute to the robustness? 3) How do different neural modules contribute to
the adversarial noise? In our experiment, we construct adversarial examples to
attack state-of-the-art neural models trained on Google AudioSet. We compare
how much attack potency in terms of adversarial perturbation of size $\epsilon$
using different $L_p$ norms we would need to "deactivate" the victim model.
Using adversarial noise to ablate multimodal models, we are able to provide
insights into what is the best potential fusion strategy to balance the model
parameters/accuracy and robustness trade-off and distinguish the robust
features versus the non-robust features that various neural networks model tend
to learn.


http://arxiv.org/abs/2011.06978
Transformer-Encoder Detector Module: Using Context to Improve Robustness to Adversarial Attacks on Object Detection.
Faisal Alamri; Sinan Kalkan; Nicolas Pugeault
  Deep neural network approaches have demonstrated high performance in object
recognition (CNN) and detection (Faster-RCNN) tasks, but experiments have shown
that such architectures are vulnerable to adversarial attacks (FFF, UAP): low
amplitude perturbations, barely perceptible by the human eye, can lead to a
drastic reduction in labeling performance. This article proposes a new context
module, called \textit{Transformer-Encoder Detector Module}, that can be
applied to an object detector to (i) improve the labeling of object instances;
and (ii) improve the detector's robustness to adversarial attacks. The proposed
model achieves higher mAP, F1 scores and AUC average score of up to 13\%
compared to the baseline Faster-RCNN detector, and an mAP score 8 points higher
on images subjected to FFF or UAP attacks due to the inclusion of both
contextual and visual features extracted from scene and encoded into the model.
The result demonstrates that a simple ad-hoc context module can improve the
reliability of object detectors significantly.


http://arxiv.org/abs/2011.07114
Query-based Targeted Action-Space Adversarial Policies on Deep Reinforcement Learning Agents.
Xian Yeow Lee; Yasaman Esfandiari; Kai Liang Tan; Soumik Sarkar
  Advances in computing resources have resulted in the increasing complexity of
cyber-physical systems (CPS). As the complexity of CPS evolved, the focus has
shifted from traditional control methods to deep reinforcement learning-based
(DRL) methods for control of these systems. This is due to the difficulty of
obtaining accurate models of complex CPS for traditional control. However, to
securely deploy DRL in production, it is essential to examine the weaknesses of
DRL-based controllers (policies) towards malicious attacks from all angles. In
this work, we investigate targeted attacks in the action-space domain, also
commonly known as actuation attacks in CPS literature, which perturbs the
outputs of a controller. We show that a query-based black-box attack model that
generates optimal perturbations with respect to an adversarial goal can be
formulated as another reinforcement learning problem. Thus, such an adversarial
policy can be trained using conventional DRL methods. Experimental results
showed that adversarial policies that only observe the nominal policy's output
generate stronger attacks than adversarial policies that observe the nominal
policy's input and output. Further analysis reveals that nominal policies whose
outputs are frequently at the boundaries of the action space are naturally more
robust towards adversarial policies. Lastly, we propose the use of adversarial
training with transfer learning to induce robust behaviors into the nominal
policy, which decreases the rate of successful targeted attacks by 50%.


http://arxiv.org/abs/2011.06690
Adversarial Robustness Against Image Color Transformation within Parametric Filter Space.
Zhengyu Zhao; Zhuoran Liu; Martha Larson
  We propose Adversarial Color Enhancement (ACE), a novel approach to
generating non-suspicious adversarial images by optimizing a color
transformation within a parametric filter space. The filter we use approximates
human-understandable color curve adjustment, constraining ACE with a single,
continuous function. This property gives rise to a principled adversarial
action space explicitly controlled by filter parameters. Existing color
transformation attacks are not guided by a parametric space, and, consequently,
additional pixel-related constraints such as regularization and sampling are
necessary. These constraints make methodical analysis difficult. In this paper,
we carry out a systematic robustness analysis of ACE from both the attack and
defense perspectives by varying the bound of the color filter parameters. We
investigate a general formulation of ACE and also a variant targeting
particularly appealing color styles, as achieved with popular image filters.
From the attack perspective, we provide extensive experiments on the
vulnerability of image classifiers, but also explore the vulnerability of
segmentation and aesthetic quality assessment algorithms, in both the white-box
and black-box scenarios. From the defense perspective, more experiments provide
insight into the stability of ACE against input transformation-based defenses
and show the potential of adversarial training for improving model robustness
against ACE.


http://arxiv.org/abs/2011.06585
Sparse PCA: Algorithms, Adversarial Perturbations and Certificates.
Tommaso d'Orsi; Pravesh K. Kothari; Gleb Novikov; David Steurer
  We study efficient algorithms for Sparse PCA in standard statistical models
(spiked covariance in its Wishart form). Our goal is to achieve optimal
recovery guarantees while being resilient to small perturbations. Despite a
long history of prior works, including explicit studies of perturbation
resilience, the best known algorithmic guarantees for Sparse PCA are fragile
and break down under small adversarial perturbations.
  We observe a basic connection between perturbation resilience and
\emph{certifying algorithms} that are based on certificates of upper bounds on
sparse eigenvalues of random matrices. In contrast to other techniques, such
certifying algorithms, including the brute-force maximum likelihood estimator,
are automatically robust against small adversarial perturbation.
  We use this connection to obtain the first polynomial-time algorithms for
this problem that are resilient against additive adversarial perturbations by
obtaining new efficient certificates for upper bounds on sparse eigenvalues of
random matrices. Our algorithms are based either on basic semidefinite
programming or on its low-degree sum-of-squares strengthening depending on the
parameter regimes. Their guarantees either match or approach the best known
guarantees of \emph{fragile} algorithms in terms of sparsity of the unknown
vector, number of samples and the ambient dimension.
  To complement our algorithmic results, we prove rigorous lower bounds
matching the gap between fragile and robust polynomial-time algorithms in a
natural computational model based on low-degree polynomials (closely related to
the pseudo-calibration technique for sum-of-squares lower bounds) that is known
to capture the best known guarantees for related statistical estimation
problems. The combination of these results provides formal evidence of an
inherent price to pay to achieve robustness.


http://arxiv.org/abs/2011.05623
Adversarial images for the primate brain.
Li Yuan; Will Xiao; Gabriel Kreiman; Francis E. H. Tay; Jiashi Feng; Margaret S. Livingstone
  Deep artificial neural networks have been proposed as a model of primate
vision. However, these networks are vulnerable to adversarial attacks, whereby
introducing minimal noise can fool networks into misclassifying images. Primate
vision is thought to be robust to such adversarial images. We evaluated this
assumption by designing adversarial images to fool primate vision. To do so, we
first trained a model to predict responses of face-selective neurons in macaque
inferior temporal cortex. Next, we modified images, such as human faces, to
match their model-predicted neuronal responses to a target category, such as
monkey faces. These adversarial images elicited neuronal responses similar to
the target category. Remarkably, the same images fooled monkeys and humans at
the behavioral level. These results challenge fundamental assumptions about the
similarity between computer and primate vision and show that a model of
neuronal activity can selectively direct primate visual behavior.


http://arxiv.org/abs/2011.05850
Detecting Adversarial Patches with Class Conditional Reconstruction Networks.
Perry Deng; Mohammad Saidur Rahman; Matthew Wright
  Defending against physical adversarial attacks is a rapidly growing topic in
deep learning and computer vision. Prominent forms of physical adversarial
attacks, such as overlaid adversarial patches and objects, share similarities
with digital attacks, but are easy for humans to notice. This leads us to
explore the hypothesis that adversarial detection methods, which have been
shown to be ineffective against adaptive digital adversarial examples, can be
effective against these physical attacks. We use one such detection method
based on autoencoder architectures, and perform adversarial patching
experiments on MNIST, SVHN, and CIFAR10 against a CNN architecture and two
CapsNet architectures. We also propose two modifications to the EM-Routed
CapsNet architecture, Affine Voting and Matrix Capsule Dropout, to improve its
classification performance. Our investigation shows that the detector retains
some of its effectiveness even against adaptive adversarial patch attacks. In
addition, detection performance tends to decrease among all the architectures
with the increase of dataset complexity.


http://arxiv.org/abs/2011.05074
Efficient and Transferable Adversarial Examples from Bayesian Neural Networks.
Martin Gubri; Maxime Cordy; Mike Papadakis; Yves Le Traon; Koushik Sen
  An established way to improve the transferability of black-box evasion
attacks is to craft the adversarial examples on an ensemble-based surrogate to
increase diversity. We argue that transferability is fundamentally related to
uncertainty. Based on a state-of-the-art Bayesian Deep Learning technique, we
propose a new method to efficiently build a surrogate by sampling approximately
from the posterior distribution of neural network weights, which represents the
belief about the value of each parameter. Our extensive experiments on
ImageNet, CIFAR-10 and MNIST show that our approach improves the success rates
of four state-of-the-art attacks significantly (up to 83.2 percentage points),
in both intra-architecture and inter-architecture transferability. On ImageNet,
our approach can reach 94% of success rate while reducing training computations
from 11.6 to 2.4 exaflops, compared to an ensemble of independently trained
DNNs. Our vanilla surrogate achieves 87.5% of the time higher transferability
than three test-time techniques designed for this purpose. Our work
demonstrates that the way to train a surrogate has been overlooked, although it
is an important element of transfer-based attacks. We are, therefore, the first
to review the effectiveness of several training methods in increasing
transferability. We provide new directions to better understand the
transferability phenomenon and offer a simple but strong baseline for future
work.


http://arxiv.org/abs/2011.04268
Solving Inverse Problems With Deep Neural Networks -- Robustness Included?
Martin Genzel; Jan Macdonald; Maximilian März
  In the past five years, deep learning methods have become state-of-the-art in
solving various inverse problems. Before such approaches can find application
in safety-critical fields, a verification of their reliability appears
mandatory. Recent works have pointed out instabilities of deep neural networks
for several image reconstruction tasks. In analogy to adversarial attacks in
classification, it was shown that slight distortions in the input domain may
cause severe artifacts. The present article sheds new light on this concern, by
conducting an extensive study of the robustness of deep-learning-based
algorithms for solving underdetermined inverse problems. This covers compressed
sensing with Gaussian measurements as well as image recovery from Fourier and
Radon measurements, including a real-world scenario for magnetic resonance
imaging (using the NYU-fastMRI dataset). Our main focus is on computing
adversarial perturbations of the measurements that maximize the reconstruction
error. A distinctive feature of our approach is the quantitative and
qualitative comparison with total-variation minimization, which serves as a
provably robust reference method. In contrast to previous findings, our results
reveal that standard end-to-end network architectures are not only resilient
against statistical noise, but also against adversarial perturbations. All
considered networks are trained by common deep learning techniques, without
sophisticated defense strategies.


http://arxiv.org/abs/2011.03901
Adversarial Black-Box Attacks On Text Classifiers Using Multi-Objective Genetic Optimization Guided By Deep Networks.
Alex Mathai; Shreya Khare; Srikanth Tamilselvam; Senthil Mani
  We propose a novel genetic-algorithm technique that generates black-box
adversarial examples which successfully fool neural network based text
classifiers. We perform a genetic search with multi-objective optimization
guided by deep learning based inferences and Seq2Seq mutation to generate
semantically similar but imperceptible adversaries. We compare our approach
with DeepWordBug (DWB) on SST and IMDB sentiment datasets by attacking three
trained models viz. char-LSTM, word-LSTM and elmo-LSTM. On an average, we
achieve an attack success rate of 65.67% for SST and 36.45% for IMDB across the
three models showing an improvement of 49.48% and 101% respectively.
Furthermore, our qualitative study indicates that 94% of the time, the users
were not able to distinguish between an original and adversarial sample.


http://arxiv.org/abs/2011.05157
Bridging the Performance Gap between FGSM and PGD Adversarial Training.
Tianjin Huang; Vlado Menkovski; Yulong Pei; Mykola Pechenizkiy
  Deep learning achieves state-of-the-art performance in many tasks but exposes
to the underlying vulnerability against adversarial examples. Across existing
defense techniques, adversarial training with the projected gradient decent
attack (adv.PGD) is considered as one of the most effective ways to achieve
moderate adversarial robustness. However, adv.PGD requires too much training
time since the projected gradient attack (PGD) takes multiple iterations to
generate perturbations. On the other hand, adversarial training with the fast
gradient sign method (adv.FGSM) takes much less training time since the fast
gradient sign method (FGSM) takes one step to generate perturbations but fails
to increase adversarial robustness. In this work, we extend adv.FGSM to make it
achieve the adversarial robustness of adv.PGD. We demonstrate that the large
curvature along FGSM perturbed direction leads to a large difference in
performance of adversarial robustness between adv.FGSM and adv.PGD, and
therefore propose combining adv.FGSM with a curvature regularization
(adv.FGSMR) in order to bridge the performance gap between adv.FGSM and
adv.PGD. The experiments show that adv.FGSMR has higher training efficiency
than adv.PGD. In addition, it achieves comparable performance of adversarial
robustness on MNIST dataset under white-box attack, and it achieves better
performance than adv.PGD under white-box attack and effectively defends the
transferable adversarial attack on CIFAR-10 dataset.


http://arxiv.org/abs/2011.03574
Single-Node Attacks for Fooling Graph Neural Networks.
Ben Finkelshtein; Chaim Baskin; Evgenii Zheltonozhskii; Uri Alon
  Graph neural networks (GNNs) have shown broad applicability in a variety of
domains. These domains, e.g., social networks and product recommendations, are
fertile ground for malicious users and behavior. In this paper, we show that
GNNs are vulnerable to the extremely limited (and thus quite realistic)
scenarios of a single-node adversarial attack, where the perturbed node cannot
be chosen by the attacker. That is, an attacker can force the GNN to classify
any target node to a chosen label, by only slightly perturbing the features or
the neighbor list of another single arbitrary node in the graph, even when not
being able to select that specific attacker node. When the adversary is allowed
to select the attacker node, these attacks are even more effective. We
demonstrate empirically that our attack is effective across various common GNN
types (e.g., GCN, GraphSAGE, GAT, GIN) and robustly optimized GNNs (e.g.,
Robust GCN, SM GCN, GAL, LAT-GCN), outperforming previous attacks across
different real-world datasets both in a targeted and non-targeted attacks. Our
code is available at https://github.com/benfinkelshtein/SINGLE .


http://arxiv.org/abs/2011.05973
A survey on practical adversarial examples for malware classifiers.
Daniel Park; Bülent Yener
  Machine learning based solutions have been very helpful in solving problems
that deal with immense amounts of data, such as malware detection and
classification. However, deep neural networks have been found to be vulnerable
to adversarial examples, or inputs that have been purposefully perturbed to
result in an incorrect label. Researchers have shown that this vulnerability
can be exploited to create evasive malware samples. However, many proposed
attacks do not generate an executable and instead generate a feature vector. To
fully understand the impact of adversarial examples on malware detection, we
review practical attacks against malware classifiers that generate executable
adversarial malware examples. We also discuss current challenges in this area
of research, as well as suggestions for improvement and future research
directions.


http://arxiv.org/abs/2011.02701
A Black-Box Attack Model for Visually-Aware Recommender Systems.
Rami Cohen; Oren Sar Shalom; Dietmar Jannach; Amihood Amir
  Due to the advances in deep learning, visually-aware recommender systems (RS)
have recently attracted increased research interest. Such systems combine
collaborative signals with images, usually represented as feature vectors
outputted by pre-trained image models. Since item catalogs can be huge,
recommendation service providers often rely on images that are supplied by the
item providers. In this work, we show that relying on such external sources can
make an RS vulnerable to attacks, where the goal of the attacker is to unfairly
promote certain pushed items. Specifically, we demonstrate how a new visual
attack model can effectively influence the item scores and rankings in a
black-box approach, i.e., without knowing the parameters of the model. The main
underlying idea is to systematically create small human-imperceptible
perturbations of the pushed item image and to devise appropriate gradient
approximation methods to incrementally raise the pushed item's score.
Experimental evaluations on two datasets show that the novel attack model is
effective even when the contribution of the visual features to the overall
performance of the recommender system is modest.


http://arxiv.org/abs/2011.03010
Data Augmentation via Structured Adversarial Perturbations.
Calvin Luo; Hossein Mobahi; Samy Bengio
  Data augmentation is a major component of many machine learning methods with
state-of-the-art performance. Common augmentation strategies work by drawing
random samples from a space of transformations. Unfortunately, such sampling
approaches are limited in expressivity, as they are unable to scale to rich
transformations that depend on numerous parameters due to the curse of
dimensionality. Adversarial examples can be considered as an alternative scheme
for data augmentation. By being trained on the most difficult modifications of
the inputs, the resulting models are then hopefully able to handle other,
presumably easier, modifications as well. The advantage of adversarial
augmentation is that it replaces sampling with the use of a single, calculated
perturbation that maximally increases the loss. The downside, however, is that
these raw adversarial perturbations appear rather unstructured; applying them
often does not produce a natural transformation, contrary to a desirable data
augmentation technique. To address this, we propose a method to generate
adversarial examples that maintain some desired natural structure. We first
construct a subspace that only contains perturbations with the desired
structure. We then project the raw adversarial gradient onto this space to
select a structured transformation that would maximally increase the loss when
applied. We demonstrate this approach through two types of image
transformations: photometric and geometric. Furthermore, we show that training
on such structured adversarial images improves generalization.


http://arxiv.org/abs/2011.02675
Defense-friendly Images in Adversarial Attacks: Dataset and Metrics forPerturbation Difficulty.
Camilo Pestana; Wei Liu; David Glance; Ajmal Mian
  Dataset bias is a problem in adversarial machine learn-ing, especially in the
evaluation of defenses. An adversarial attack or defense algorithm may show
better results on the reported dataset than can be replicated on other
datasets.Even when two algorithms are compared, their relative performance can
vary depending on the dataset. Deep learn-ing offers state-of-the-art solutions
for image recognition, but deep models are vulnerable even to small
perturbations.Research in this area focuses primarily on adversarial at-tacks
and defense algorithms. In this paper, we report for the first time, a class of
robust images that are both resilient to attacks and that recover better than
random images un-der adversarial attacks using simple defense techniques.Thus,
a test dataset with a high proportion of robust images gives a misleading
impression about the performance of an adversarial attack or defense. We
propose three metrics to determine the proportion of robust images in a dataset
and provide scoring to determine the dataset bias. We also pro-vide an
ImageNet-R dataset of 15000+ robust images to facilitate further research on
this intriguing phenomenon of image strength under attack. Our dataset,
combined with the proposed metrics, is valuable for unbiased benchmark-ing of
adversarial attack and defense algorithms


http://arxiv.org/abs/2011.02707
Dynamically Sampled Nonlocal Gradients for Stronger Adversarial Attacks.
Leo Schwinn; An Nguyen; René Raab; Dario Zanca; Bjoern Eskofier; Daniel Tenbrinck; Martin Burger
  The vulnerability of deep neural networks to small and even imperceptible
perturbations has become a central topic in deep learning research. Although
several sophisticated defense mechanisms have been introduced, most were later
shown to be ineffective. However, a reliable evaluation of model robustness is
mandatory for deployment in safety-critical scenarios. To overcome this problem
we propose a simple yet effective modification to the gradient calculation of
state-of-the-art first-order adversarial attacks. Normally, the gradient update
of an attack is directly calculated for the given data point. This approach is
sensitive to noise and small local optima of the loss function. Inspired by
gradient sampling techniques from non-convex optimization, we propose
Dynamically Sampled Nonlocal Gradient Descent (DSNGD). DSNGD calculates the
gradient direction of the adversarial attack as the weighted average over past
gradients of the optimization history. Moreover, distribution hyperparameters
that define the sampling operation are automatically learned during the
optimization scheme. We empirically show that by incorporating this nonlocal
gradient information, we are able to give a more accurate estimation of the
global descent direction on noisy and non-convex loss surfaces. In addition, we
show that DSNGD-based attacks are on average 35% faster while achieving 0.9% to
27.1% higher success rates compared to their gradient descent-based
counterparts.


http://arxiv.org/abs/2011.01514
You Do (Not) Belong Here: Detecting DPI Evasion Attacks with Context Learning.
Shitong Zhu; Shasha Li; Zhongjie Wang; Xun Chen; Zhiyun Qian; Srikanth V. Krishnamurthy; Kevin S. Chan; Ananthram Swami
  As Deep Packet Inspection (DPI) middleboxes become increasingly popular, a
spectrum of adversarial attacks have emerged with the goal of evading such
middleboxes. Many of these attacks exploit discrepancies between the middlebox
network protocol implementations, and the more rigorous/complete versions
implemented at end hosts. These evasion attacks largely involve subtle
manipulations of packets to cause different behaviours at DPI and end hosts, to
cloak malicious network traffic that is otherwise detectable. With recent
automated discovery, it has become prohibitively challenging to manually curate
rules for detecting these manipulations. In this work, we propose CLAP, the
first fully-automated, unsupervised ML solution to accurately detect and
localize DPI evasion attacks. By learning what we call the packet context,
which essentially captures inter-relationships across both (1) different
packets in a connection; and (2) different header fields within each packet,
from benign traffic traces only, CLAP can detect and pinpoint packets that
violate the benign packet contexts (which are the ones that are specially
crafted for evasion purposes). Our evaluations with 73 state-of-the-art DPI
evasion attacks show that CLAP achieves an Area Under the Receiver Operating
Characteristic Curve (AUC-ROC) of 0.963, an Equal Error Rate (EER) of only
0.061 in detection, and an accuracy of 94.6% in localization. These results
suggest that CLAP can be a promising tool for thwarting DPI evasion attacks.


http://arxiv.org/abs/2011.01846
Detecting Word Sense Disambiguation Biases in Machine Translation for Model-Agnostic Adversarial Attacks.
Denis Emelin; Ivan Titov; Rico Sennrich
  Word sense disambiguation is a well-known source of translation errors in
NMT. We posit that some of the incorrect disambiguation choices are due to
models' over-reliance on dataset artifacts found in training data, specifically
superficial word co-occurrences, rather than a deeper understanding of the
source text. We introduce a method for the prediction of disambiguation errors
based on statistical data properties, demonstrating its effectiveness across
several domains and model types. Moreover, we develop a simple adversarial
attack strategy that minimally perturbs sentences in order to elicit
disambiguation errors to further probe the robustness of translation models.
Our findings indicate that disambiguation robustness varies substantially
between domains and that different models trained on the same data are
vulnerable to different attacks.


http://arxiv.org/abs/2011.01538
Penetrating RF Fingerprinting-based Authentication with a Generative Adversarial Attack.
Samurdhi Karunaratne; Enes Krijestorac; Danijela Cabric
  Physical layer authentication relies on detecting unique imperfections in
signals transmitted by radio devices to isolate their fingerprint. Recently,
deep learning-based authenticators have increasingly been proposed to classify
devices using these fingerprints, as they achieve higher accuracies compared to
traditional approaches. However, it has been shown in other domains that adding
carefully crafted perturbations to legitimate inputs can fool such classifiers.
This can undermine the security provided by the authenticator. Unlike
adversarial attacks applied in other domains, an adversary has no control over
the propagation environment. Therefore, to investigate the severity of this
type of attack in wireless communications, we consider an unauthorized
transmitter attempting to have its signals classified as authorized by a deep
learning-based authenticator. We demonstrate a reinforcement learning-based
attack where the impersonator--using only the authenticator's binary
authentication decision--distorts its signals in order to penetrate the system.
Extensive simulations and experiments on a software-defined radio testbed
indicate that at appropriate channel conditions and bounded by a maximum
distortion level, it is possible to fool the authenticator reliably at more
than 90% success rate.


http://arxiv.org/abs/2011.01539
Recent Advances in Understanding Adversarial Robustness of Deep Neural Networks.
Tao Bai; Jinqi Luo; Jun Zhao
  Adversarial examples are inevitable on the road of pervasive applications of
deep neural networks (DNN). Imperceptible perturbations applied on natural
samples can lead DNN-based classifiers to output wrong prediction with fair
confidence score. It is increasingly important to obtain models with high
robustness that are resistant to adversarial examples. In this paper, we survey
recent advances in how to understand such intriguing property, i.e. adversarial
robustness, from different perspectives. We give preliminary definitions on
what adversarial attacks and robustness are. After that, we study
frequently-used benchmarks and mention theoretically-proved bounds for
adversarial robustness. We then provide an overview on analyzing correlations
among adversarial robustness and other critical indicators of DNN models.
Lastly, we introduce recent arguments on potential costs of adversarial
training which have attracted wide attention from the research community.


http://arxiv.org/abs/2011.03083
A Tunable Robust Pruning Framework Through Dynamic Network Rewiring of DNNs.
Souvik Kundu; Mahdi Nazemi; Peter A. Beerel; Massoud Pedram
  This paper presents a dynamic network rewiring (DNR) method to generate
pruned deep neural network (DNN) models that are robust against adversarial
attacks yet maintain high accuracy on clean images. In particular, the
disclosed DNR method is based on a unified constrained optimization formulation
using a hybrid loss function that merges ultra-high model compression with
robust adversarial training. This training strategy dynamically adjusts
inter-layer connectivity based on per-layer normalized momentum computed from
the hybrid loss function. In contrast to existing robust pruning frameworks
that require multiple training iterations, the proposed learning strategy
achieves an overall target pruning ratio with only a single training iteration
and can be tuned to support both irregular and structured channel pruning. To
evaluate the merits of DNR, experiments were performed with two widely accepted
models, namely VGG16 and ResNet-18, on CIFAR-10, CIFAR-100 as well as with
VGG16 on Tiny-ImageNet. Compared to the baseline uncompressed models, DNR
provides over20x compression on all the datasets with no significant drop in
either clean or adversarial classification accuracy. Moreover, our experiments
show that DNR consistently finds compressed models with better clean and
adversarial image classification performance than what is achievable through
state-of-the-art alternatives.


http://arxiv.org/abs/2011.01509
MalFox: Camouflaged Adversarial Malware Example Generation Based on Conv-GANs Against Black-Box Detectors.
Fangtian Zhong; Xiuzhen Cheng; Dongxiao Yu; Bei Gong; Shuaiwen Song; Jiguo Yu
  Deep learning is a thriving field currently stuffed with many practical
applications and active research topics. It allows computers to learn from
experience and to understand the world in terms of a hierarchy of concepts,
with each being defined through its relations to simpler concepts. Relying on
the strong capabilities of deep learning, we propose a convolutional generative
adversarial network-based (Conv-GAN) framework titled MalFox, targeting
adversarial malware example generation against third-party black-box malware
detectors. Motivated by the rival game between malware authors and malware
detectors, MalFox adopts a confrontational approach to produce perturbation
paths, with each formed by up to three methods (namely Obfusmal, Stealmal, and
Hollowmal) to generate adversarial malware examples. To demonstrate the
effectiveness of MalFox, we collect a large dataset consisting of both malware
and benignware programs, and investigate the performance of MalFox in terms of
accuracy, detection rate, and evasive rate of the generated adversarial malware
examples. Our evaluation indicates that the accuracy can be as high as 99.0%
which significantly outperforms the other 12 well-known learning models.
Furthermore, the detection rate is dramatically decreased by 56.8% on average,
and the average evasive rate is noticeably improved by up to 56.2%.


http://arxiv.org/abs/2011.01183
Adversarial Examples in Constrained Domains.
Ryan Sheatsley; Nicolas Papernot; Michael Weisman; Gunjan Verma; Patrick McDaniel
  Machine learning algorithms have been shown to be vulnerable to adversarial
manipulation through systematic modification of inputs (e.g., adversarial
examples) in domains such as image recognition. Under the default threat model,
the adversary exploits the unconstrained nature of images; each feature (pixel)
is fully under control of the adversary. However, it is not clear how these
attacks translate to constrained domains that limit which and how features can
be modified by the adversary (e.g., network intrusion detection). In this
paper, we explore whether constrained domains are less vulnerable than
unconstrained domains to adversarial example generation algorithms. We create
an algorithm for generating adversarial sketches: targeted universal
perturbation vectors which encode feature saliency within the envelope of
domain constraints. To assess how these algorithms perform, we evaluate them in
constrained (e.g., network intrusion detection) and unconstrained (e.g., image
recognition) domains. The results demonstrate that our approaches generate
misclassification rates in constrained domains that were comparable to those of
unconstrained domains (greater than 95%). Our investigation shows that the
narrow attack surface exposed by constrained domains is still sufficiently
large to craft successful adversarial examples; and thus, constraints do not
appear to make a domain robust. Indeed, with as little as five randomly
selected features, one can still generate adversarial examples.


http://arxiv.org/abs/2011.01132
Frequency-based Automated Modulation Classification in the Presence of Adversaries.
Rajeev Sahay; Christopher G. Brinton; David J. Love
  Automatic modulation classification (AMC) aims to improve the efficiency of
crowded radio spectrums by automatically predicting the modulation
constellation of wireless RF signals. Recent work has demonstrated the ability
of deep learning to achieve robust AMC performance using raw in-phase and
quadrature (IQ) time samples. Yet, deep learning models are highly susceptible
to adversarial interference, which cause intelligent prediction models to
misclassify received samples with high confidence. Furthermore, adversarial
interference is often transferable, allowing an adversary to attack multiple
deep learning models with a single perturbation crafted for a particular
classification network. In this work, we present a novel receiver architecture
consisting of deep learning models capable of withstanding transferable
adversarial interference. Specifically, we show that adversarial attacks
crafted to fool models trained on time-domain features are not easily
transferable to models trained using frequency-domain features. In this
capacity, we demonstrate classification performance improvements greater than
30% on recurrent neural networks (RNNs) and greater than 50% on convolutional
neural networks (CNNs). We further demonstrate our frequency feature-based
classification models to achieve accuracies greater than 99% in the absence of
attacks.


http://arxiv.org/abs/2011.01435
Robust Algorithms for Online Convex Problems via Primal-Dual.
Marco Molinaro
  Primal-dual methods in online optimization give several of the state-of-the
art results in both of the most common models: adversarial and
stochastic/random order. Here we try to provide a more unified analysis of
primal-dual algorithms to better understand the mechanisms behind this
important method. With this we are able of recover and extend in one goal
several results of the literature.
  In particular, we obtain robust online algorithm for fairly general online
convex problems: we consider the MIXED model where in some of the time steps
the data is stochastic and in the others the data is adversarial. Both the
quantity and location of the adversarial time steps are unknown to the
algorithm. The guarantees of our algorithms interpolate between the (close to)
best guarantees for each of the pure models. In particular, the presence of
adversarial times does not degrade the guarantee relative to the stochastic
part of the instance.
  Concretely, we first consider Online Convex Programming: at each time a
feasible set $V_t$ is revealed, and the algorithm needs to select $v_t \in V_t$
to minimize the total cost $\psi(\sum_t v_t)$, for a convex function $\psi$.
Our robust primal-dual algorithm for this problem on the MIXED model recovers
and extends, for example, a result of Gupta et al. and recent work on
$\ell_p$-norm load balancing by the author. We also consider the problem of
Welfare Maximization with Convex Production Costs: at each time a customer
presents a value $c_t$ and resource consumption vector $a_t$, and the goal is
to fractionally select customers to maximize the profit $\sum_t c_t x_t -
\psi(\sum_t a_t x_t)$. Our robust primal-dual algorithm on the MIXED model
recovers and extends the result of Azar et al.
  Given the ubiquity of primal-dual algorithms we hope the ideas presented here
will be useful in obtaining other robust algorithm in the MIXED or related
models.


http://arxiv.org/abs/2011.02272
Trustworthy AI.
Richa Singh; Mayank Vatsa; Nalini Ratha
  Modern AI systems are reaping the advantage of novel learning methods. With
their increasing usage, we are realizing the limitations and shortfalls of
these systems. Brittleness to minor adversarial changes in the input data,
ability to explain the decisions, address the bias in their training data, high
opacity in terms of revealing the lineage of the system, how they were trained
and tested, and under which parameters and conditions they can reliably
guarantee a certain level of performance, are some of the most prominent
limitations. Ensuring the privacy and security of the data, assigning
appropriate credits to data sources, and delivering decent outputs are also
required features of an AI system. We propose the tutorial on Trustworthy AI to
address six critical issues in enhancing user and public trust in AI systems,
namely: (i) bias and fairness, (ii) explainability, (iii) robust mitigation of
adversarial attacks, (iv) improved privacy and security in model building, (v)
being decent, and (vi) model attribution, including the right level of credit
assignment to the data sources, model architectures, and transparency in
lineage.


http://arxiv.org/abs/2011.00566
LG-GAN: Label Guided Adversarial Network for Flexible Targeted Attack of Point Cloud-based Deep Networks.
Hang Zhou; Dongdong Chen; Jing Liao; Weiming Zhang; Kejiang Chen; Xiaoyi Dong; Kunlin Liu; Gang Hua; Nenghai Yu
  Deep neural networks have made tremendous progress in 3D point-cloud
recognition. Recent works have shown that these 3D recognition networks are
also vulnerable to adversarial samples produced from various attack methods,
including optimization-based 3D Carlini-Wagner attack, gradient-based iterative
fast gradient method, and skeleton-detach based point-dropping. However, after
a careful analysis, these methods are either extremely slow because of the
optimization/iterative scheme, or not flexible to support targeted attack of a
specific category. To overcome these shortcomings, this paper proposes a novel
label guided adversarial network (LG-GAN) for real-time flexible targeted point
cloud attack. To the best of our knowledge, this is the first generation based
3D point cloud attack method. By feeding the original point clouds and target
attack label into LG-GAN, it can learn how to deform the point clouds to
mislead the recognition network into the specific label only with a single
forward pass. In detail, LGGAN first leverages one multi-branch adversarial
network to extract hierarchical features of the input point clouds, then
incorporates the specified label information into multiple intermediate
features using the label encoder. Finally, the encoded features will be fed
into the coordinate reconstruction decoder to generate the target adversarial
sample. By evaluating different point-cloud recognition models (e.g., PointNet,
PointNet++ and DGCNN), we demonstrate that the proposed LG-GAN can support
flexible targeted attack on the fly while guaranteeing good attack performance
and higher efficiency simultaneously.


http://arxiv.org/abs/2011.05976
Vulnerability of the Neural Networks Against Adversarial Examples: A Survey.
Rui Zhao
  With further development in the fields of computer vision, network security,
natural language processing and so on so forth, deep learning technology
gradually exposed certain security risks. The existing deep learning algorithms
cannot effectively describe the essential characteristics of data, making the
algorithm unable to give the correct result in the face of malicious input.
Based on current security threats faced by deep learning, this paper introduces
the problem of adversarial examples in deep learning, sorts out the existing
attack and defense methods of the black box and white box, and classifies them.
It briefly describes the application of some adversarial examples in different
scenarios in recent years, compares several defense technologies of adversarial
examples, and finally summarizes the problems in this research field and
prospects for its future development. This paper introduces the common white
box attack methods in detail, and further compares the similarities and
differences between the attack of the black and white box. Correspondingly, the
author also introduces the defense methods, and analyzes the performance of
these methods against the black and white box attack.


http://arxiv.org/abs/2011.01755
MAD-VAE: Manifold Awareness Defense Variational Autoencoder.
Frederick Morlock; Dingsu Wang
  Although deep generative models such as Defense-GAN and Defense-VAE have made
significant progress in terms of adversarial defenses of image classification
neural networks, several methods have been found to circumvent these defenses.
Based on Defense-VAE, in our research we introduce several methods to improve
the robustness of defense models. The methods introduced in this paper are
straight forward yet show promise over the vanilla Defense-VAE. With extensive
experiments on MNIST data set, we have demonstrated the effectiveness of our
algorithms against different attacks. Our experiments also include attacks on
the latent space of the defensive model. We also discuss the applicability of
existing adversarial latent space attacks as they may have a significant flaw.


http://arxiv.org/abs/2011.00144
Integer Programming-based Error-Correcting Output Code Design for Robust Classification.
Samarth Gupta; Saurabh Amin
  Error-Correcting Output Codes (ECOCs) offer a principled approach for
combining simple binary classifiers into multiclass classifiers. In this paper,
we investigate the problem of designing optimal ECOCs to achieve both nominal
and adversarial accuracy using Support Vector Machines (SVMs) and binary deep
learning models. In contrast to previous literature, we present an Integer
Programming (IP) formulation to design minimal codebooks with desirable error
correcting properties. Our work leverages the advances in IP solvers to
generate codebooks with optimality guarantees. To achieve tractability, we
exploit the underlying graph-theoretic structure of the constraint set in our
IP formulation. This enables us to use edge clique covers to substantially
reduce the constraint set. Our codebooks achieve a high nominal accuracy
relative to standard codebooks (e.g., one-vs-all, one-vs-one, and dense/sparse
codes). We also estimate the adversarial accuracy of our ECOC-based classifiers
in a white-box setting. Our IP-generated codebooks provide non-trivial
robustness to adversarial perturbations even without any adversarial training.


http://arxiv.org/abs/2010.16336
Leveraging Extracted Model Adversaries for Improved Black Box Attacks.
Naveen Jafer Nizar; Ari Kobren
  We present a method for adversarial input generation against black box models
for reading comprehension based question answering. Our approach is composed of
two steps. First, we approximate a victim black box model via model extraction
(Krishna et al., 2020). Second, we use our own white box method to generate
input perturbations that cause the approximate model to fail. These perturbed
inputs are used against the victim. In experiments we find that our method
improves on the efficacy of the AddAny---a white box attack---performed on the
approximate model by 25% F1, and the AddSent attack---a black box attack---by
11% F1 (Jia and Liang, 2017).


http://arxiv.org/abs/2011.00101
EEG-Based Brain-Computer Interfaces Are Vulnerable to Backdoor Attacks.
Lubin Meng; Jian Huang; Zhigang Zeng; Xue Jiang; Shan Yu; Tzyy-Ping Jung; Chin-Teng Lin; Ricardo Chavarriaga; Dongrui Wu
  Research and development of electroencephalogram (EEG) based brain-computer
interfaces (BCIs) have advanced rapidly, partly due to deeper understanding of
the brain and wide adoption of sophisticated machine learning approaches for
decoding the EEG signals. However, recent studies have shown that machine
learning algorithms are vulnerable to adversarial attacks. This article
proposes to use narrow period pulse for poisoning attack of EEG-based BCIs,
which is implementable in practice and has never been considered before. One
can create dangerous backdoors in the machine learning model by injecting
poisoning samples into the training set. Test samples with the backdoor key
will then be classified into the target class specified by the attacker. What
most distinguishes our approach from previous ones is that the backdoor key
does not need to be synchronized with the EEG trials, making it very easy to
implement. The effectiveness and robustness of the backdoor attack approach is
demonstrated, highlighting a critical security concern for EEG-based BCIs and
calling for urgent attention to address it.


http://arxiv.org/abs/2011.00095
Adversarial Attacks on Optimization based Planners.
Sai Vemprala; Ashish Kapoor
  Trajectory planning is a key piece in the algorithmic architecture of a
robot. Trajectory planners typically use iterative optimization schemes for
generating smooth trajectories that avoid collisions and are optimal for
tracking given the robot's physical specifications. Starting from an initial
estimate, the planners iteratively refine the solution so as to satisfy the
desired constraints. In this paper, we show that such iterative optimization
based planners can be vulnerable to adversarial attacks that force the planner
either to fail completely, or significantly increase the time required to find
a solution. The key insight here is that an adversary in the environment can
directly affect the optimization cost function of a planner. We demonstrate how
the adversary can adjust its own state configurations to result in poorly
conditioned eigenstructure of the objective leading to failures. We apply our
method against two state of the art trajectory planners and demonstrate that an
adversary can consistently exploit certain weaknesses of an iterative
optimization scheme.


http://arxiv.org/abs/2010.16204
Capture the Bot: Using Adversarial Examples to Improve CAPTCHA Robustness to Bot Attacks.
Dorjan Hitaj; Briland Hitaj; Sushil Jajodia; Luigi V. Mancini
  To this date, CAPTCHAs have served as the first line of defense preventing
unauthorized access by (malicious) bots to web-based services, while at the
same time maintaining a trouble-free experience for human visitors. However,
recent work in the literature has provided evidence of sophisticated bots that
make use of advancements in machine learning (ML) to easily bypass existing
CAPTCHA-based defenses. In this work, we take the first step to address this
problem. We introduce CAPTURE, a novel CAPTCHA scheme based on adversarial
examples. While typically adversarial examples are used to lead an ML model
astray, with CAPTURE, we attempt to make a "good use" of such mechanisms. Our
empirical evaluations show that CAPTURE can produce CAPTCHAs that are easy to
solve by humans while at the same time, effectively thwarting ML-based bot
solvers.


http://arxiv.org/abs/2011.05254
Perception Improvement for Free: Exploring Imperceptible Black-box Adversarial Attacks on Image Classification.
Yongwei Wang; Mingquan Feng; Rabab Ward; Z. Jane Wang; Lanjun Wang
  Deep neural networks are vulnerable to adversarial attacks. White-box
adversarial attacks can fool neural networks with small adversarial
perturbations, especially for large size images. However, keeping successful
adversarial perturbations imperceptible is especially challenging for
transfer-based black-box adversarial attacks. Often such adversarial examples
can be easily spotted due to their unpleasantly poor visual qualities, which
compromises the threat of adversarial attacks in practice. In this study, to
improve the image quality of black-box adversarial examples perceptually, we
propose structure-aware adversarial attacks by generating adversarial images
based on psychological perceptual models. Specifically, we allow higher
perturbations on perceptually insignificant regions, while assigning lower or
no perturbation on visually sensitive regions. In addition to the proposed
spatial-constrained adversarial perturbations, we also propose a novel
structure-aware frequency adversarial attack method in the discrete cosine
transform (DCT) domain. Since the proposed attacks are independent of the
gradient estimation, they can be directly incorporated with existing
gradient-based attacks. Experimental results show that, with the comparable
attack success rate (ASR), the proposed methods can produce adversarial
examples with considerably improved visual quality for free. With the
comparable perceptual quality, the proposed approaches achieve higher attack
success rates: particularly for the frequency structure-aware attacks, the
average ASR improves more than 10% over the baseline attacks.


http://arxiv.org/abs/2011.00070
Adversarial Robust Training of Deep Learning MRI Reconstruction Models.
Francesco Calivá; Kaiyang Cheng; Rutwik Shah; Valentina Pedoia
  Deep Learning (DL) has shown potential in accelerating Magnetic Resonance
Image acquisition and reconstruction. Nevertheless, there is a dearth of
tailored methods to guarantee that the reconstruction of small features is
achieved with high fidelity. In this work, we employ adversarial attacks to
generate small synthetic perturbations, which are difficult to reconstruct for
a trained DL reconstruction network. Then, we use robust training to increase
the network's sensitivity to these small features and encourage their
reconstruction. Next, we investigate the generalization of said approach to
real world features. For this, a musculoskeletal radiologist annotated a set of
cartilage and meniscal lesions from the knee Fast-MRI dataset, and a
classification network was devised to assess the reconstruction of the
features. Experimental results show that by introducing robust training to a
reconstruction network, the rate of false negative features (4.8\%) in image
reconstruction can be reduced. These results are encouraging, and highlight the
necessity for attention to this problem by the image reconstruction community,
as a milestone for the introduction of DL reconstruction in clinical practice.
To support further research, we make our annotations and code publicly
available at https://github.com/fcaliva/fastMRI_BB_abnormalities_annotation.


http://arxiv.org/abs/2010.16074
Volumetric Medical Image Segmentation: A 3D Deep Coarse-to-fine Framework and Its Adversarial Examples.
Yingwei Li; Zhuotun Zhu; Yuyin Zhou; Yingda Xia; Wei Shen; Elliot K. Fishman; Alan L. Yuille
  Although deep neural networks have been a dominant method for many 2D vision
tasks, it is still challenging to apply them to 3D tasks, such as medical image
segmentation, due to the limited amount of annotated 3D data and limited
computational resources. In this chapter, by rethinking the strategy to apply
3D Convolutional Neural Networks to segment medical images, we propose a novel
3D-based coarse-to-fine framework to efficiently tackle these challenges. The
proposed 3D-based framework outperforms their 2D counterparts by a large margin
since it can leverage the rich spatial information along all three axes. We
further analyze the threat of adversarial attacks on the proposed framework and
show how to defense against the attack. We conduct experiments on three
datasets, the NIH pancreas dataset, the JHMI pancreas dataset and the JHMI
pathological cyst dataset, where the first two and the last one contain healthy
and pathological pancreases respectively, and achieve the current
state-of-the-art in terms of Dice-Sorensen Coefficient (DSC) on all of them.
Especially, on the NIH pancreas segmentation dataset, we outperform the
previous best by an average of over $2\%$, and the worst case is improved by
$7\%$ to reach almost $70\%$, which indicates the reliability of our framework
in clinical applications.


http://arxiv.org/abs/2010.15886
Perception Matters: Exploring Imperceptible and Transferable Anti-forensics for GAN-generated Fake Face Imagery Detection.
Yongwei Wang; Xin Ding; Li Ding; Rabab Ward; Z. Jane Wang
  Recently, generative adversarial networks (GANs) can generate photo-realistic
fake facial images which are perceptually indistinguishable from real face
photos, promoting research on fake face detection. Though fake face forensics
can achieve high detection accuracy, their anti-forensic counterparts are less
investigated. Here we explore more \textit{imperceptible} and
\textit{transferable} anti-forensics for fake face imagery detection based on
adversarial attacks. Since facial and background regions are often smooth, even
small perturbation could cause noticeable perceptual impairment in fake face
images. Therefore it makes existing adversarial attacks ineffective as an
anti-forensic method. Our perturbation analysis reveals the intuitive reason of
the perceptual degradation issue when directly applying existing attacks. We
then propose a novel adversarial attack method, better suitable for image
anti-forensics, in the transformed color domain by considering visual
perception. Simple yet effective, the proposed method can fool both deep
learning and non-deep learning based forensic detectors, achieving higher
attack success rate and significantly improved visual quality. Specially, when
adversaries consider imperceptibility as a constraint, the proposed
anti-forensic method can improve the average attack success rate by around 30\%
on fake face images over two baseline attacks. \textit{More imperceptible} and
\textit{more transferable}, the proposed method raises new security concerns to
fake face imagery detection. We have released our code for public use, and
hopefully the proposed method can be further explored in related forensic
applications as an anti-forensic benchmark.


http://arxiv.org/abs/2010.15974
Can the state of relevant neurons in a deep neural networks serve as indicators for detecting adversarial attacks?
Roger Granda; Tinne Tuytelaars; Jose Oramas
  We present a method for adversarial attack detection based on the inspection
of a sparse set of neurons. We follow the hypothesis that adversarial attacks
introduce imperceptible perturbations in the input and that these perturbations
change the state of neurons relevant for the concepts modelled by the attacked
model. Therefore, monitoring the status of these neurons would enable the
detection of adversarial attacks. Focusing on the image classification task,
our method identifies neurons that are relevant for the classes predicted by
the model. A deeper qualitative inspection of these sparse set of neurons
indicates that their state changes in the presence of adversarial samples.
Moreover, quantitative results from our empirical evaluation indicate that our
method is capable of recognizing adversarial samples, produced by
state-of-the-art attack methods, with comparable accuracy to that of
state-of-the-art detectors.


http://arxiv.org/abs/2010.15651
Reliable Graph Neural Networks via Robust Aggregation.
Simon Geisler; Daniel Zügner; Stephan Günnemann
  Perturbations targeting the graph structure have proven to be extremely
effective in reducing the performance of Graph Neural Networks (GNNs), and
traditional defenses such as adversarial training do not seem to be able to
improve robustness. This work is motivated by the observation that
adversarially injected edges effectively can be viewed as additional samples to
a node's neighborhood aggregation function, which results in distorted
aggregations accumulating over the layers. Conventional GNN aggregation
functions, such as a sum or mean, can be distorted arbitrarily by a single
outlier. We propose a robust aggregation function motivated by the field of
robust statistics. Our approach exhibits the largest possible breakdown point
of 0.5, which means that the bias of the aggregation is bounded as long as the
fraction of adversarial edges of a node is less than 50\%. Our novel
aggregation function, Soft Medoid, is a fully differentiable generalization of
the Medoid and therefore lends itself well for end-to-end deep learning.
Equipping a GNN with our aggregation improves the robustness with respect to
structure perturbations on Cora ML by a factor of 3 (and 5.5 on Citeseer) and
by a factor of 8 for low-degree nodes.


http://arxiv.org/abs/2010.15824
Passport-aware Normalization for Deep Model Protection.
Jie Zhang; Dongdong Chen; Jing Liao; Weiming Zhang; Gang Hua; Nenghai Yu
  Despite tremendous success in many application scenarios, deep learning faces
serious intellectual property (IP) infringement threats. Considering the cost
of designing and training a good model, infringements will significantly
infringe the interests of the original model owner. Recently, many impressive
works have emerged for deep model IP protection. However, they either are
vulnerable to ambiguity attacks, or require changes in the target network
structure by replacing its original normalization layers and hence cause
significant performance drops. To this end, we propose a new passport-aware
normalization formulation, which is generally applicable to most existing
normalization layers and only needs to add another passport-aware branch for IP
protection. This new branch is jointly trained with the target model but
discarded in the inference stage. Therefore it causes no structure change in
the target model. Only when the model IP is suspected to be stolen by someone,
the private passport-aware branch is added back for ownership verification.
Through extensive experiments, we verify its effectiveness in both image and 3D
point recognition models. It is demonstrated to be robust not only to common
attack techniques like fine-tuning and model compression, but also to ambiguity
attacks. By further combining it with trigger-set based methods, both black-box
and white-box verification can be achieved for enhanced security of deep
learning models deployed in real systems. Code can be found at
https://github.com/ZJZAC/Passport-aware-Normalization.


http://arxiv.org/abs/2010.15391
Robustifying Binary Classification to Adversarial Perturbation.
Fariborz Salehi; Babak Hassibi
  Despite the enormous success of machine learning models in various
applications, most of these models lack resilience to (even small)
perturbations in their input data. Hence, new methods to robustify machine
learning models seem very essential. To this end, in this paper we consider the
problem of binary classification with adversarial perturbations. Investigating
the solution to a min-max optimization (which considers the worst-case loss in
the presence of adversarial perturbations) we introduce a generalization to the
max-margin classifier which takes into account the power of the adversary in
manipulating the data. We refer to this classifier as the "Robust Max-margin"
(RM) classifier. Under some mild assumptions on the loss function, we
theoretically show that the gradient descent iterates (with sufficiently small
step size) converge to the RM classifier in its direction. Therefore, the RM
classifier can be studied to compute various performance measures (e.g.
generalization error) of binary classification with adversarial perturbations.


http://arxiv.org/abs/2010.15487
Beyond cross-entropy: learning highly separable feature distributions for robust and accurate classification.
Arslan Ali; Andrea Migliorati; Tiziano Bianchi; Enrico Magli
  Deep learning has shown outstanding performance in several applications
including image classification. However, deep classifiers are known to be
highly vulnerable to adversarial attacks, in that a minor perturbation of the
input can easily lead to an error. Providing robustness to adversarial attacks
is a very challenging task especially in problems involving a large number of
classes, as it typically comes at the expense of an accuracy decrease. In this
work, we propose the Gaussian class-conditional simplex (GCCS) loss: a novel
approach for training deep robust multiclass classifiers that provides
adversarial robustness while at the same time achieving or even surpassing the
classification accuracy of state-of-the-art methods. Differently from other
frameworks, the proposed method learns a mapping of the input classes onto
target distributions in a latent space such that the classes are linearly
separable. Instead of maximizing the likelihood of target labels for individual
samples, our objective function pushes the network to produce feature
distributions yielding high inter-class separation. The mean values of the
distributions are centered on the vertices of a simplex such that each class is
at the same distance from every other class. We show that the regularization of
the latent space based on our approach yields excellent classification accuracy
and inherently provides robustness to multiple adversarial attacks, both
targeted and untargeted, outperforming state-of-the-art approaches over
challenging datasets.


http://arxiv.org/abs/2010.15773
WaveTransform: Crafting Adversarial Examples via Input Decomposition.
Divyam Anshumaan; Akshay Agarwal; Mayank Vatsa; Richa Singh
  Frequency spectrum has played a significant role in learning unique and
discriminating features for object recognition. Both low and high frequency
information present in images have been extracted and learnt by a host of
representation learning techniques, including deep learning. Inspired by this
observation, we introduce a novel class of adversarial attacks, namely
`WaveTransform', that creates adversarial noise corresponding to low-frequency
and high-frequency subbands, separately (or in combination). The frequency
subbands are analyzed using wavelet decomposition; the subbands are corrupted
and then used to construct an adversarial example. Experiments are performed
using multiple databases and CNN models to establish the effectiveness of the
proposed WaveTransform attack and analyze the importance of a particular
frequency component. The robustness of the proposed attack is also evaluated
through its transferability and resiliency against a recent adversarial defense
algorithm. Experiments show that the proposed attack is effective against the
defense algorithm and is also transferable across CNNs.


http://arxiv.org/abs/2010.16045
Machine Learning (In) Security: A Stream of Problems. (8%)
Fabrício Ceschin; Marcus Botacin; Albert Bifet; Bernhard Pfahringer; Luiz S. Oliveira; Heitor Murilo Gomes; André Grégio
  Machine Learning (ML) has been widely applied to cybersecurity and is
considered state-of-the-art for solving many of the open issues in that field.
However, it is very difficult to evaluate how good the produced solutions are,
since the challenges faced in security may not appear in other areas. One of
these challenges is the concept drift, which increases the existing arms race
between attackers and defenders: malicious actors can always create novel
threats to overcome the defense solutions, which may not consider them in some
approaches. Due to this, it is essential to know how to properly build and
evaluate an ML-based security solution. In this paper, we identify, detail, and
discuss the main challenges in the correct application of ML techniques to
cybersecurity data. We evaluate how concept drift, evolution, delayed labels,
and adversarial ML impact the existing solutions. Moreover, we address how
issues related to data collection affect the quality of the results presented
in the security literature, showing that new strategies are needed to improve
current solutions. Finally, we present how existing solutions may fail under
certain circumstances, and propose mitigations to them, presenting a novel
checklist to help the development of future ML solutions for cybersecurity.


http://arxiv.org/abs/2010.14927
Most ReLU Networks Suffer from $\ell^2$ Adversarial Perturbations.
Amit Daniely; Hadas Schacham
  We consider ReLU networks with random weights, in which the dimension
decreases at each layer. We show that for most such networks, most examples $x$
admit an adversarial perturbation at an Euclidean distance of
$O\left(\frac{\|x\|}{\sqrt{d}}\right)$, where $d$ is the input dimension.
Moreover, this perturbation can be found via gradient flow, as well as gradient
descent with sufficiently small steps. This result can be seen as an
explanation to the abundance of adversarial examples, and to the fact that they
are found via gradient descent.


http://arxiv.org/abs/2010.14974
Object Hider: Adversarial Patch Attack Against Object Detectors.
Yusheng Zhao; Huanqian Yan; Xingxing Wei
  Deep neural networks have been widely used in many computer vision tasks.
However, it is proved that they are susceptible to small, imperceptible
perturbations added to the input. Inputs with elaborately designed
perturbations that can fool deep learning models are called adversarial
examples, and they have drawn great concerns about the safety of deep neural
networks. Object detection algorithms are designed to locate and classify
objects in images or videos and they are the core of many computer vision
tasks, which have great research value and wide applications. In this paper, we
focus on adversarial attack on some state-of-the-art object detection models.
As a practical alternative, we use adversarial patches for the attack. Two
adversarial patch generation algorithms have been proposed: the heatmap-based
algorithm and the consensus-based algorithm. The experiment results have shown
that the proposed methods are highly effective, transferable and generic.
Additionally, we have applied the proposed methods to competition "Adversarial
Challenge on Object Detection" that is organized by Alibaba on the Tianchi
platform and won top 7 in 1701 teams. Code is available at:
https://github.com/FenHua/DetDak


http://arxiv.org/abs/2010.14986
Evaluating Robustness of Predictive Uncertainty Estimation: Are Dirichlet-based Models Reliable?
Anna-Kathrin Kopetzki; Bertrand Charpentier; Daniel Zügner; Sandhya Giri; Stephan Günnemann
  Dirichlet-based uncertainty (DBU) models are a recent and promising class of
uncertainty-aware models. DBU models predict the parameters of a Dirichlet
distribution to provide fast, high-quality uncertainty estimates alongside with
class predictions. In this work, we present the first large-scale, in-depth
study of the robustness of DBU models under adversarial attacks. Our results
suggest that uncertainty estimates of DBU models are not robust w.r.t. three
important tasks: (1) indicating correctly and wrongly classified samples; (2)
detecting adversarial examples; and (3) distinguishing between in-distribution
(ID) and out-of-distribution (OOD) data. Additionally, we explore the first
approaches to make DBU models more robust. While adversarial training has a
minor effect, our median smoothing based approach significantly increases
robustness of DBU models.


http://arxiv.org/abs/2010.14919
Transferable Universal Adversarial Perturbations Using Generative Models.
Atiye Sadat Hashemi; Andreas Bär; Saeed Mozaffari; Tim Fingscheidt
  Deep neural networks tend to be vulnerable to adversarial perturbations,
which by adding to a natural image can fool a respective model with high
confidence. Recently, the existence of image-agnostic perturbations, also known
as universal adversarial perturbations (UAPs), were discovered. However,
existing UAPs still lack a sufficiently high fooling rate, when being applied
to an unknown target model. In this paper, we propose a novel deep learning
technique for generating more transferable UAPs. We utilize a perturbation
generator and some given pretrained networks so-called source models to
generate UAPs using the ImageNet dataset. Due to the similar feature
representation of various model architectures in the first layer, we propose a
loss formulation that focuses on the adversarial energy only in the respective
first layer of the source models. This supports the transferability of our
generated UAPs to any other target model. We further empirically analyze our
generated UAPs and demonstrate that these perturbations generalize very well
towards different target models. Surpassing the current state of the art in
both, fooling rate and model-transferability, we can show the superiority of
our proposed approach. Using our generated non-targeted UAPs, we obtain an
average fooling rate of 93.36% on the source models (state of the art: 82.16%).
Generating our UAPs on the deep ResNet-152, we obtain about a 12% absolute
fooling rate advantage vs. cutting-edge methods on VGG-16 and VGG-19 target
models.


http://arxiv.org/abs/2010.14291
Fast Local Attack: Generating Local Adversarial Examples for Object Detectors.
Quanyu Liao; Xin Wang; Bin Kong; Siwei Lyu; Youbing Yin; Qi Song; Xi Wu
  The deep neural network is vulnerable to adversarial examples. Adding
imperceptible adversarial perturbations to images is enough to make them fail.
Most existing research focuses on attacking image classifiers or anchor-based
object detectors, but they generate globally perturbation on the whole image,
which is unnecessary. In our work, we leverage higher-level semantic
information to generate high aggressive local perturbations for anchor-free
object detectors. As a result, it is less computationally intensive and
achieves a higher black-box attack as well as transferring attack performance.
The adversarial examples generated by our method are not only capable of
attacking anchor-free object detectors, but also able to be transferred to
attack anchor-based object detector.


http://arxiv.org/abs/2010.14121
Anti-perturbation of Online Social Networks by Graph Label Transition.
Jun Zhuang; Mohammad Al Hasan
  Numerous popular online social networks (OSN) would classify users into
different categories and recommend users to each other with similar interests.
A small number of users, so-called perturbators, may perform some types of
behaviors, which significantly disturb such an OSN classifier. Manual
annotation by OSN administrators is one kind of potential solutions. However,
the manual annotation unavoidably brings into noise. Besides, such perturbators
are not Sybil users, and therefore their accounts cannot be frozen. To improve
the robustness of such an OSN classifier, we generalize this issue as the
defense of Graph Convolutional Networks (GCNs) on the node classification task.
Most existing defenses on this task can be divided into the adversarial-based
method and the detection-based method. The adversarial-based method improves
the robustness of GCNs by training with adversarial samples. However, in our
case, the perturbators are hard to be distinguished by OSN administrators and
thus we cannot use adversarial samples in the training phase. By contrast, the
detection-based method aims at detecting the attacker nodes or edges and
alleviates the negative impact by removing them. In our scenario, nevertheless,
the perturbators are not the attacker and thus cannot be eliminated. Both
methods could not solve the aforementioned problems. To address these issues,
we propose a novel graph label transition model, named GraphLT, to improve the
robustness of the OSN classifier by transiting the node latent representation
based on dynamic conditional label transition. Extensive experiments
demonstrate that GraphLT can not only considerably enhance the performance of
the node classifier in a clean environment but also successfully remedy the
classifier with superior performance over competing methods on seven benchmark
datasets after graph perturbation.


http://arxiv.org/abs/2010.13751
Robust and Verifiable Information Embedding Attacks to Deep Neural Networks via Error-Correcting Codes.
Jinyuan Jia; Binghui Wang; Neil Zhenqiang Gong
  In the era of deep learning, a user often leverages a third-party machine
learning tool to train a deep neural network (DNN) classifier and then deploys
the classifier as an end-user software product or a cloud service. In an
information embedding attack, an attacker is the provider of a malicious
third-party machine learning tool. The attacker embeds a message into the DNN
classifier during training and recovers the message via querying the API of the
black-box classifier after the user deploys it. Information embedding attacks
have attracted growing attention because of various applications such as
watermarking DNN classifiers and compromising user privacy. State-of-the-art
information embedding attacks have two key limitations: 1) they cannot verify
the correctness of the recovered message, and 2) they are not robust against
post-processing of the classifier.
  In this work, we aim to design information embedding attacks that are
verifiable and robust against popular post-processing methods. Specifically, we
leverage Cyclic Redundancy Check to verify the correctness of the recovered
message. Moreover, to be robust against post-processing, we leverage Turbo
codes, a type of error-correcting codes, to encode the message before embedding
it to the DNN classifier. We propose to recover the message via adaptively
querying the classifier to save queries. Our adaptive recovery strategy
leverages the property of Turbo codes that supports error correcting with a
partial code. We evaluate our information embedding attacks using simulated
messages and apply them to three applications, where messages have semantic
interpretations. We consider 8 popular methods to post-process the classifier.
Our results show that our attacks can accurately and verifiably recover the
messages in all considered scenarios, while state-of-the-art attacks cannot
accurately recover the messages in many scenarios.


http://arxiv.org/abs/2010.13773
GreedyFool: Distortion-Aware Sparse Adversarial Attack.
Xiaoyi Dong; Dongdong Chen; Jianmin Bao; Chuan Qin; Lu Yuan; Weiming Zhang; Nenghai Yu; Dong Chen
  Modern deep neural networks(DNNs) are vulnerable to adversarial samples.
Sparse adversarial samples are a special branch of adversarial samples that can
fool the target model by only perturbing a few pixels. The existence of the
sparse adversarial attack points out that DNNs are much more vulnerable than
people believed, which is also a new aspect for analyzing DNNs. However,
current sparse adversarial attack methods still have some shortcomings on both
sparsity and invisibility. In this paper, we propose a novel two-stage
distortion-aware greedy-based method dubbed as "GreedyFool". Specifically, it
first selects the most effective candidate positions to modify by considering
both the gradient(for adversary) and the distortion map(for invisibility), then
drops some less important points in the reduce stage. Experiments demonstrate
that compared with the start-of-the-art method, we only need to modify
$3\times$ fewer pixels under the same sparse perturbation setting. For target
attack, the success rate of our method is 9.96\% higher than the
start-of-the-art method under the same pixel budget. Code can be found at
https://github.com/LightDXY/GreedyFool.


http://arxiv.org/abs/2010.13337
Robust Pre-Training by Adversarial Contrastive Learning.
Ziyu Jiang; Tianlong Chen; Ting Chen; Zhangyang Wang
  Recent work has shown that, when integrated with adversarial training,
self-supervised pre-training can lead to state-of-the-art robustness In this
work, we improve robustness-aware self-supervised pre-training by learning
representations that are consistent under both data augmentations and
adversarial perturbations. Our approach leverages a recent contrastive learning
framework, which learns representations by maximizing feature consistency under
differently augmented views. This fits particularly well with the goal of
adversarial robustness, as one cause of adversarial fragility is the lack of
feature invariance, i.e., small input perturbations can result in undesirable
large changes in features or even predicted labels. We explore various options
to formulate the contrastive task, and demonstrate that by injecting
adversarial perturbations, contrastive pre-training can lead to models that are
both label-efficient and robust. We empirically evaluate the proposed
Adversarial Contrastive Learning (ACL) and show it can consistently outperform
existing methods. For example on the CIFAR-10 dataset, ACL outperforms the
previous state-of-the-art unsupervised robust pre-training approach by 2.99% on
robust accuracy and 2.14% on standard accuracy. We further demonstrate that ACL
pre-training can improve semi-supervised adversarial training, even when only a
few labeled examples are available. Our codes and pre-trained models have been
released at: https://github.com/VITA-Group/Adversarial-Contrastive-Learning.


http://arxiv.org/abs/2010.13880
Versatile Verification of Tree Ensembles.
Laurens Devos; Wannes Meert; Jesse Davis
  Machine learned models often must abide by certain requirements (e.g.,
fairness or legal). This has spurred interested in developing approaches that
can provably verify whether a model satisfies certain properties. This paper
introduces a generic algorithm called Veritas that enables tackling multiple
different verification tasks for tree ensemble models like random forests (RFs)
and gradient boosting decision trees (GBDTs). This generality contrasts with
previous work, which has focused exclusively on either adversarial example
generation or robustness checking. Veritas formulates the verification task as
a generic optimization problem and introduces a novel search space
representation. Veritas offers two key advantages. First, it provides anytime
lower and upper bounds when the optimization problem cannot be solved exactly.
In contrast, many existing methods have focused on exact solutions and are thus
limited by the verification problem being NP-complete. Second, Veritas produces
full (bounded suboptimal) solutions that can be used to generate concrete
examples. We experimentally show that Veritas outperforms the previous state of
the art by (a) generating exact solutions more frequently, (b) producing
tighter bounds when (a) is not possible, and (c) offering orders of magnitude
speed ups. Subsequently, Veritas enables tackling more and larger real-world
verification scenarios.


http://arxiv.org/abs/2010.13365
Robustness May Be at Odds with Fairness: An Empirical Study on Class-wise Accuracy.
Philipp Benz; Chaoning Zhang; Adil Karjauv; In So Kweon
  Convolutional neural networks (CNNs) have made significant advancement,
however, they are widely known to be vulnerable to adversarial attacks.
Adversarial training is the most widely used technique for improving
adversarial robustness to strong white-box attacks. Prior works have been
evaluating and improving the model average robustness without class-wise
evaluation. The average evaluation alone might provide a false sense of
robustness. For example, the attacker can focus on attacking the vulnerable
class, which can be dangerous, especially, when the vulnerable class is a
critical one, such as "human" in autonomous driving. We propose an empirical
study on the class-wise accuracy and robustness of adversarially trained
models. We find that there exists inter-class discrepancy for accuracy and
robustness even when the training dataset has an equal number of samples for
each class. For example, in CIFAR10, "cat" is much more vulnerable than other
classes. Moreover, this inter-class discrepancy also exists for normally
trained models, while adversarial training tends to further increase the
discrepancy. Our work aims to investigate the following questions: (a) is the
phenomenon of inter-class discrepancy universal regardless of datasets, model
architectures and optimization hyper-parameters? (b) If so, what can be
possible explanations for the inter-class discrepancy? (c) Can the techniques
proposed in the long tail classification be readily extended to adversarial
training for addressing the inter-class discrepancy?


http://arxiv.org/abs/2010.13356
Exploring the Security Boundary of Data Reconstruction via Neuron Exclusivity Analysis. (16%)
Xudong Pan; Mi Zhang; Yifan Yan; Jiaming Zhu; Min Yang
  Among existing privacy attacks on the gradient of neural networks, \emph{data
reconstruction attack}, which reverse engineers the training batch from the
gradient, poses a severe threat on the private training data. Despite its
empirical success on large architectures and small training batches, unstable
reconstruction accuracy is also observed when a smaller architecture or a
larger batch is under attack. Due to the weak interpretability of existing
learning-based attacks, there is little known on why, when and how data
reconstruction attack is feasible.
  In our work, we perform the first analytic study on the security boundary of
data reconstruction from gradient via a microcosmic view on neural networks
with rectified linear units (ReLUs), the most popular activation function in
practice. For the first time, we characterize the insecure/secure boundary of
data reconstruction attack in terms of the \emph{neuron exclusivity state} of a
training batch, indexed by the number of \emph{\textbf{Ex}clusively
\textbf{A}ctivated \textbf{N}eurons} (ExANs, i.e., a ReLU activated by only one
sample in a batch). Intuitively, we show a training batch with more ExANs are
more vulnerable to data reconstruction attack and vice versa. On the one hand,
we construct a novel deterministic attack algorithm which substantially
outperforms previous attacks for reconstructing training batches lying in the
insecure boundary of a neural network. Meanwhile, for training batches lying in
the secure boundary, we prove the impossibility of unique reconstruction, based
on which an exclusivity reduction strategy is devised to enlarge the secure
boundary for mitigation purposes.


http://arxiv.org/abs/2010.13247
Attack Agnostic Adversarial Defense via Visual Imperceptible Bound.
Saheb Chhabra; Akshay Agarwal; Richa Singh; Mayank Vatsa
  The high susceptibility of deep learning algorithms against structured and
unstructured perturbations has motivated the development of efficient
adversarial defense algorithms. However, the lack of generalizability of
existing defense algorithms and the high variability in the performance of the
attack algorithms for different databases raises several questions on the
effectiveness of the defense algorithms. In this research, we aim to design a
defense model that is robust within a certain bound against both seen and
unseen adversarial attacks. This bound is related to the visual appearance of
an image, and we termed it as \textit{Visual Imperceptible Bound (VIB)}. To
compute this bound, we propose a novel method that uses the database
characteristics. The VIB is further used to measure the effectiveness of attack
algorithms. The performance of the proposed defense model is evaluated on the
MNIST, CIFAR-10, and Tiny ImageNet databases on multiple attacks that include
C\&W ($l_2$) and DeepFool. The proposed defense model is not only able to
increase the robustness against several attacks but also retain or improve the
classification accuracy on an original clean test set. The proposed algorithm
is attack agnostic, i.e. it does not require any knowledge of the attack
algorithm.


http://arxiv.org/abs/2010.13070
Dynamic Adversarial Patch for Evading Object Detection Models.
Shahar Hoory; Tzvika Shapira; Asaf Shabtai; Yuval Elovici
  Recent research shows that neural networks models used for computer vision
(e.g., YOLO and Fast R-CNN) are vulnerable to adversarial evasion attacks. Most
of the existing real-world adversarial attacks against object detectors use an
adversarial patch which is attached to the target object (e.g., a carefully
crafted sticker placed on a stop sign). This method may not be robust to
changes in the camera's location relative to the target object; in addition, it
may not work well when applied to nonplanar objects such as cars. In this
study, we present an innovative attack method against object detectors applied
in a real-world setup that addresses some of the limitations of existing
attacks. Our method uses dynamic adversarial patches which are placed at
multiple predetermined locations on a target object. An adversarial learning
algorithm is applied in order to generate the patches used. The dynamic attack
is implemented by switching between optimized patches dynamically, according to
the camera's position (i.e., the object detection system's position). In order
to demonstrate our attack in a real-world setup, we implemented the patches by
attaching flat screens to the target object; the screens are used to present
the patches and switch between them, depending on the current camera location.
Thus, the attack is dynamic and adjusts itself to the situation to achieve
optimal results. We evaluated our dynamic patch approach by attacking the
YOLOv2 object detector with a car as the target object and succeeded in
misleading it in up to 90% of the video frames when filming the car from a wide
viewing angle range. We improved the attack by generating patches that consider
the semantic distance between the target object and its classification. We also
examined the attack's transferability among different car models and were able
to mislead the detector 71% of the time.


http://arxiv.org/abs/2010.13275
Asymptotic Behavior of Adversarial Training in Binary Classification.
Hossein Taheri; Ramtin Pedarsani; Christos Thrampoulidis
  It has been consistently reported that many machine learning models are
susceptible to adversarial attacks i.e., small additive adversarial
perturbations applied to data points can cause misclassification. Adversarial
training using empirical risk minimization is considered to be the
state-of-the-art method for defense against adversarial attacks. Despite being
successful in practice, several problems in understanding generalization
performance of adversarial training remain open. In this paper, we derive
precise theoretical predictions for the performance of adversarial training in
binary classification. We consider the high-dimensional regime where the
dimension of data grows with the size of the training data-set at a constant
ratio. Our results provide exact asymptotics for standard and adversarial
errors of the estimators obtained by adversarial training with $\ell_q$-norm
bounded perturbations ($q \ge 1$) for both discriminative binary models and
generative Gaussian mixture models. Furthermore, we use these sharp predictions
to uncover several intriguing observations on the role of various parameters
including the over-parameterization ratio, the data model, and the attack
budget on the adversarial and standard errors.


http://arxiv.org/abs/2010.12905
ATRO: Adversarial Training with a Rejection Option.
Masahiro Kato; Zhenghang Cui; Yoshihiro Fukuhara
  This paper proposes a classification framework with a rejection option to
mitigate the performance deterioration caused by adversarial examples. While
recent machine learning algorithms achieve high prediction performance, they
are empirically vulnerable to adversarial examples, which are slightly
perturbed data samples that are wrongly classified. In real-world applications,
adversarial attacks using such adversarial examples could cause serious
problems. To this end, various methods are proposed to obtain a classifier that
is robust against adversarial examples. Adversarial training is one of them,
which trains a classifier to minimize the worst-case loss under adversarial
attacks. In this paper, in order to acquire a more reliable classifier against
adversarial attacks, we propose the method of Adversarial Training with a
Rejection Option (ATRO). Applying the adversarial training objective to both a
classifier and a rejection function simultaneously, classifiers trained by ATRO
can choose to abstain from classification when it has insufficient confidence
to classify a test data point. We examine the feasibility of the framework
using the surrogate maximum hinge loss and establish a generalization bound for
linear models. Furthermore, we empirically confirmed the effectiveness of ATRO
using various models and real-world datasets.


http://arxiv.org/abs/2010.12989
Are Adversarial Examples Created Equal? A Learnable Weighted Minimax Risk for Robustness under Non-uniform Attacks.
Huimin Zeng; Chen Zhu; Tom Goldstein; Furong Huang
  Adversarial Training is proved to be an efficient method to defend against
adversarial examples, being one of the few defenses that withstand strong
attacks. However, traditional defense mechanisms assume a uniform attack over
the examples according to the underlying data distribution, which is apparently
unrealistic as the attacker could choose to focus on more vulnerable examples.
We present a weighted minimax risk optimization that defends against
non-uniform attacks, achieving robustness against adversarial examples under
perturbed test data distributions. Our modified risk considers importance
weights of different adversarial examples and focuses adaptively on harder
examples that are wrongly classified or at higher risk of being classified
incorrectly. The designed risk allows the training process to learn a strong
defense through optimizing the importance weights. The experiments show that
our model significantly improves state-of-the-art adversarial accuracy under
non-uniform attacks without a significant drop under uniform attacks.


http://arxiv.org/abs/2010.12809
Stop Bugging Me! Evading Modern-Day Wiretapping Using Adversarial Perturbations.
Yael Mathov; Tal Ben Senior; Asaf Shabtai; Yuval Elovici
  Mass surveillance systems for voice over IP (VoIP) conversations pose a great
risk to privacy. These automated systems use learning models to analyze
conversations, and calls that involve specific topics are routed to a human
agent for further examination. In this study, we present an
adversarial-learning-based framework for privacy protection for VoIP
conversations. We present a novel method that finds a universal adversarial
perturbation (UAP), which, when added to the audio stream, prevents an
eavesdropper from automatically detecting the conversation's topic. As shown in
our experiments, the UAP is agnostic to the speaker or audio length, and its
volume can be changed in real time, as needed. Our real-world solution uses a
Teensy microcontroller that acts as an external microphone and adds the UAP to
the audio in real time. We examine different speakers, VoIP applications
(Skype, Zoom, Slack, and Google Meet), and audio lengths. Our results in the
real world suggest that our approach is a feasible solution for privacy
protection.


http://arxiv.org/abs/2010.12510
Improving Robustness by Augmenting Training Sentences with Predicate-Argument Structures.
Nafise Sadat Moosavi; Boer Marcel de; Prasetya Ajie Utama; Iryna Gurevych
  Existing NLP datasets contain various biases, and models tend to quickly
learn those biases, which in turn limits their robustness. Existing approaches
to improve robustness against dataset biases mostly focus on changing the
training objective so that models learn less from biased examples. Besides,
they mostly focus on addressing a specific bias, and while they improve the
performance on adversarial evaluation sets of the targeted bias, they may bias
the model in other ways, and therefore, hurt the overall robustness. In this
paper, we propose to augment the input sentences in the training data with
their corresponding predicate-argument structures, which provide a higher-level
abstraction over different realizations of the same meaning and help the model
to recognize important parts of sentences. We show that without targeting a
specific bias, our sentence augmentation improves the robustness of transformer
models against multiple biases. In addition, we show that models can still be
vulnerable to the lexical overlap bias, even when the training data does not
contain this bias, and that the sentence augmentation also improves the
robustness in this scenario. We will release our adversarial datasets to
evaluate bias in such a scenario as well as our augmentation scripts at
https://github.com/UKPLab/data-augmentation-for-robustness.


http://arxiv.org/abs/2010.12190
Towards Robust Neural Networks via Orthogonal Diversity.
Kun Fang; Qinghua Tao; Yingwen Wu; Tao Li; Jia Cai; Feipeng Cai; Xiaolin Huang; Jie Yang
  Deep Neural Networks (DNNs) are vulnerable to invisible perturbations on the
images generated by adversarial attacks, which raises researches on the
adversarial robustness of DNNs. A series of methods represented by the
\textit{adversarial training} and its variants have proved the most practical
techniques in enhancing the DNN robustness. Generally, adversarial training
focuses on enriching the training data by involving perturbed data into clean
data. Despite of the efficiency on defending specific attacks, adversarial
training essentially benefits from the data augmentation, but does not
contribute to the robustness of DNN itself, and usually suffers accuracy drop
on clean data as well as inefficiency on unknown attacks. Towards the
robustness of DNN itself, we propose a novel defense that aims at augmenting
the model in order to learn features adaptive to diverse inputs, including
adversarial examples. Specifically, we introduce multiple paths to augment the
network, and impose orthogonality constraint on these paths. In addition, a
margin-maximization loss is designed to further boost DIversity via
Orthogonality (DIO). Extensive empirical results on various data sets,
architectures, and attacks demonstrate the robustness of DIO: it does not need
any adversarial example and yet achieves greater robustness compared with
state-of-the-art adversarial training methods.


http://arxiv.org/abs/2010.12050
Contrastive Learning with Adversarial Examples.
Chih-Hui Ho; Nuno Vasconcelos
  Contrastive learning (CL) is a popular technique for self-supervised learning
(SSL) of visual representations. It uses pairs of augmentations of unlabeled
training examples to define a classification task for pretext learning of a
deep embedding. Despite extensive works in augmentation procedures, prior works
do not address the selection of challenging negative pairs, as images within a
sampled batch are treated independently. This paper addresses the problem, by
introducing a new family of adversarial examples for constrastive learning and
using these examples to define a new adversarial training algorithm for SSL,
denoted as CLAE. When compared to standard CL, the use of adversarial examples
creates more challenging positive pairs and adversarial training produces
harder negative pairs by accounting for all images in a batch during the
optimization. CLAE is compatible with many CL methods in the literature.
Experiments show that it improves the performance of several existing CL
baselines on multiple datasets.


http://arxiv.org/abs/2010.11782
Adversarial Attacks on Binary Image Recognition Systems.
Eric Balkanski; Harrison Chase; Kojin Oshiba; Alexander Rilee; Yaron Singer; Richard Wang
  We initiate the study of adversarial attacks on models for binary (i.e. black
and white) image classification. Although there has been a great deal of work
on attacking models for colored and grayscale images, little is known about
attacks on models for binary images. Models trained to classify binary images
are used in text recognition applications such as check processing, license
plate recognition, invoice processing, and many others. In contrast to colored
and grayscale images, the search space of attacks on binary images is extremely
restricted and noise cannot be hidden with minor perturbations in each pixel.
Thus, the optimization landscape of attacks on binary images introduces new
fundamental challenges.
  In this paper we introduce a new attack algorithm called SCAR, designed to
fool classifiers of binary images. We show that SCAR significantly outperforms
existing $L_0$ attacks applied to the binary setting and use it to demonstrate
the vulnerability of real-world text recognition systems. SCAR's strong
performance in practice contrasts with the existence of classifiers that are
provably robust to large perturbations. In many cases, altering a single pixel
is sufficient to trick Tesseract, a popular open-source text recognition
system, to misclassify a word as a different word in the English dictionary. We
also license software from providers of check processing systems to most of the
major US banks and demonstrate the vulnerability of check recognitions for
mobile deposits. These systems are substantially harder to fool since they
classify both the handwritten amounts in digits and letters, independently.
Nevertheless, we generalize SCAR to design attacks that fool state-of-the-art
check processing systems using unnoticeable perturbations that lead to
misclassification of deposit amounts. Consequently, this is a powerful method
to perform financial fraud.


http://arxiv.org/abs/2010.11869
Rewriting Meaningful Sentences via Conditional BERT Sampling and an application on fooling text classifiers.
Lei Xu; Ivan Ramirez; Kalyan Veeramachaneni
  Most adversarial attack methods that are designed to deceive a text
classifier change the text classifier's prediction by modifying a few words or
characters. Few try to attack classifiers by rewriting a whole sentence, due to
the difficulties inherent in sentence-level rephrasing as well as the problem
of setting the criteria for legitimate rewriting.
  In this paper, we explore the problem of creating adversarial examples with
sentence-level rewriting. We design a new sampling method, named
ParaphraseSampler, to efficiently rewrite the original sentence in multiple
ways. Then we propose a new criteria for modification, called a sentence-level
threaten model. This criteria allows for both word- and sentence-level changes,
and can be adjusted independently in two dimensions: semantic similarity and
grammatical quality. Experimental results show that many of these rewritten
sentences are misclassified by the classifier. On all 6 datasets, our
ParaphraseSampler achieves a better attack success rate than our baseline.


http://arxiv.org/abs/2010.11598
An Efficient Adversarial Attack for Tree Ensembles.
Chong Zhang; Huan Zhang; Cho-Jui Hsieh
  We study the problem of efficient adversarial attacks on tree based ensembles
such as gradient boosting decision trees (GBDTs) and random forests (RFs).
Since these models are non-continuous step functions and gradient does not
exist, most existing efficient adversarial attacks are not applicable. Although
decision-based black-box attacks can be applied, they cannot utilize the
special structure of trees. In our work, we transform the attack problem into a
discrete search problem specially designed for tree ensembles, where the goal
is to find a valid "leaf tuple" that leads to mis-classification while having
the shortest distance to the original input. With this formulation, we show
that a simple yet effective greedy algorithm can be applied to iteratively
optimize the adversarial example by moving the leaf tuple to its neighborhood
within hamming distance 1. Experimental results on several large GBDT and RF
models with up to hundreds of trees demonstrate that our method can be
thousands of times faster than the previous mixed-integer linear programming
(MILP) based approach, while also providing smaller (better) adversarial
examples than decision-based black-box attacks on general $\ell_p$ ($p=1, 2,
\infty$) norm perturbations. Our code is available at
https://github.com/chong-z/tree-ensemble-attack.


http://arxiv.org/abs/2010.12088
Adversarial Robustness of Supervised Sparse Coding.
Jeremias Sulam; Ramchandran Muthukumar; Raman Arora
  Several recent results provide theoretical insights into the phenomena of
adversarial examples. Existing results, however, are often limited due to a gap
between the simplicity of the models studied and the complexity of those
deployed in practice. In this work, we strike a better balance by considering a
model that involves learning a representation while at the same time giving a
precise generalization bound and a robustness certificate. We focus on the
hypothesis class obtained by combining a sparsity-promoting encoder coupled
with a linear classifier, and show an interesting interplay between the
expressivity and stability of the (supervised) representation map and a notion
of margin in the feature space. We bound the robust risk (to $\ell_2$-bounded
perturbations) of hypotheses parameterized by dictionaries that achieve a mild
encoder gap on training data. Furthermore, we provide a robustness certificate
for end-to-end classification. We demonstrate the applicability of our analysis
by computing certified accuracy on real data, and compare with other
alternatives for certified robustness.


http://arxiv.org/abs/2010.11645
Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming.
Sumanth Dathathri; Krishnamurthy Dvijotham; Alexey Kurakin; Aditi Raghunathan; Jonathan Uesato; Rudy Bunel; Shreya Shankar; Jacob Steinhardt; Ian Goodfellow; Percy Liang; Pushmeet Kohli
  Convex relaxations have emerged as a promising approach for verifying
desirable properties of neural networks like robustness to adversarial
perturbations. Widely used Linear Programming (LP) relaxations only work well
when networks are trained to facilitate verification. This precludes
applications that involve verification-agnostic networks, i.e., networks not
specially trained for verification. On the other hand, semidefinite programming
(SDP) relaxations have successfully be applied to verification-agnostic
networks, but do not currently scale beyond small networks due to poor time and
space asymptotics. In this work, we propose a first-order dual SDP algorithm
that (1) requires memory only linear in the total number of network
activations, (2) only requires a fixed number of forward/backward passes
through the network per iteration. By exploiting iterative eigenvector methods,
we express all solver operations in terms of forward and backward passes
through the network, enabling efficient use of hardware like GPUs/TPUs. For two
verification-agnostic networks on MNIST and CIFAR-10, we significantly improve
L-inf verified robust accuracy from 1% to 88% and 6% to 40% respectively. We
also demonstrate tight verification of a quadratic stability specification for
the decoder of a variational autoencoder.


http://arxiv.org/abs/2010.11535
Defense-guided Transferable Adversarial Attacks.
Zifei Zhang; Kai Qiao; Jian Chen; Ningning Liang
  Though deep neural networks perform challenging tasks excellently, they are
susceptible to adversarial examples, which mislead classifiers by applying
human-imperceptible perturbations on clean inputs. Under the query-free
black-box scenario, adversarial examples are hard to transfer to unknown
models, and several methods have been proposed with the low transferability. To
settle such issue, we design a max-min framework inspired by input
transformations, which are benificial to both the adversarial attack and
defense. Explicitly, we decrease loss values with inputs' affline
transformations as a defense in the minimum procedure, and then increase loss
values with the momentum iterative algorithm as an attack in the maximum
procedure. To further promote transferability, we determine transformed values
with the max-min theory. Extensive experiments on Imagenet demonstrate that our
defense-guided transferable attacks achieve impressive increase on
transferability. Experimentally, we show that our ASR of adversarial attack
reaches to 58.38% on average, which outperforms the state-of-the-art method by
12.1% on the normally trained models and by 11.13% on the adversarially trained
models. Additionally, we provide elucidative insights on the improvement of
transferability, and our method is expected to be a benchmark for assessing the
robustness of deep models.


http://arxiv.org/abs/2010.11828
Once-for-All Adversarial Training: In-Situ Tradeoff between Robustness and Accuracy for Free.
Haotao Wang; Tianlong Chen; Shupeng Gui; Ting-Kuei Hu; Ji Liu; Zhangyang Wang
  Adversarial training and its many variants substantially improve deep network
robustness, yet at the cost of compromising standard accuracy. Moreover, the
training process is heavy and hence it becomes impractical to thoroughly
explore the trade-off between accuracy and robustness. This paper asks this new
question: how to quickly calibrate a trained model in-situ, to examine the
achievable trade-offs between its standard and robust accuracies, without
(re-)training it many times? Our proposed framework, Once-for-all Adversarial
Training (OAT), is built on an innovative model-conditional training framework,
with a controlling hyper-parameter as the input. The trained model could be
adjusted among different standard and robust accuracies "for free" at testing
time. As an important knob, we exploit dual batch normalization to separate
standard and adversarial feature statistics, so that they can be learned in one
model without degrading performance. We further extend OAT to a Once-for-all
Adversarial Training and Slimming (OATS) framework, that allows for the joint
trade-off among accuracy, robustness and runtime efficiency. Experiments show
that, without any re-training nor ensembling, OAT/OATS achieve similar or even
superior performance compared to dedicatedly trained models at various
configurations. Our codes and pretrained models are available at:
https://github.com/VITA-Group/Once-for-All-Adversarial-Training.


http://arxiv.org/abs/2010.11388
Adversarial Attacks on Deep Algorithmic Trading Policies.
Yaser Faghan; Nancirose Piazza; Vahid Behzadan; Ali Fathi
  Deep Reinforcement Learning (DRL) has become an appealing solution to
algorithmic trading such as high frequency trading of stocks and
cyptocurrencies. However, DRL have been shown to be susceptible to adversarial
attacks. It follows that algorithmic trading DRL agents may also be compromised
by such adversarial techniques, leading to policy manipulation. In this paper,
we develop a threat model for deep trading policies, and propose two attack
techniques for manipulating the performance of such policies at test-time.
Furthermore, we demonstrate the effectiveness of the proposed attacks against
benchmark and real-world DQN trading agents.


http://arxiv.org/abs/2010.11415
Maximum Mean Discrepancy is Aware of Adversarial Attacks.
Ruize Gao; Feng Liu; Jingfeng Zhang; Bo Han; Tongliang Liu; Gang Niu; Masashi Sugiyama
  The maximum mean discrepancy (MMD) test, as a representative two-sample test,
could in principle detect any distributional discrepancy between two datasets.
However, it has been shown that MMD is unaware of adversarial attacks---MMD
failed to detect the discrepancy between natural data and adversarial data
generated by adversarial attacks. Given this phenomenon, we raise a question:
are natural and adversarial data really from different distributions but
previous use of MMD on the purpose missed some key factors? The answer is
affirmative. We find the previous use missed three factors and accordingly we
propose three components: (a) Gaussian kernel has limited representation power,
and we replace it with a novel semantic-aware deep kernel; (b) test power of
MMD was neglected, and we maximize it in order to optimize our deep kernel; (c)
adversarial data may be non-independent, and to this end we apply wild
bootstrap for validity of the test power. By taking care of the three factors,
we validate that MMD is aware of adversarial attacks, which lights up a novel
road for adversarial attack detection based on two-sample tests.


http://arxiv.org/abs/2010.11213
Precise Statistical Analysis of Classification Accuracies for Adversarial Training.
Adel Javanmard; Mahdi Soltanolkotabi
  Despite the wide empirical success of modern machine learning algorithms and
models in a multitude of applications, they are known to be highly susceptible
to seemingly small indiscernible perturbations to the input data known as
adversarial attacks. A variety of recent adversarial training procedures have
been proposed to remedy this issue. Despite the success of such procedures at
increasing accuracy on adversarially perturbed inputs or robust accuracy, these
techniques often reduce accuracy on natural unperturbed inputs or standard
accuracy. Complicating matters further the effect and trend of adversarial
training procedures on standard and robust accuracy is rather counter intuitive
and radically dependent on a variety of factors including the perceived form of
the perturbation during training, size/quality of data, model
overparameterization, etc. In this paper we focus on binary classification
problems where the data is generated according to the mixture of two Gaussians
with general anisotropic covariance matrices and derive a precise
characterization of the standard and robust accuracy for a class of minimax
adversarially trained models. We consider a general norm-based adversarial
model, where the adversary can add perturbations of bounded $\ell_p$ norm to
each input data, for an arbitrary $p\ge 1$. Our comprehensive analysis allows
us to theoretically explain several intriguing empirical phenomena and provide
a precise understanding of the role of different problem parameters on standard
and robust accuracies.


http://arxiv.org/abs/2010.11742
Learning Black-Box Attackers with Transferable Priors and Query Feedback.
Jiancheng Yang; Yangzhou Jiang; Xiaoyang Huang; Bingbing Ni; Chenglong Zhao
  This paper addresses the challenging black-box adversarial attack problem,
where only classification confidence of a victim model is available. Inspired
by consistency of visual saliency between different vision models, a surrogate
model is expected to improve the attack performance via transferability. By
combining transferability-based and query-based black-box attack, we propose a
surprisingly simple baseline approach (named SimBA++) using the surrogate
model, which significantly outperforms several state-of-the-art methods.
Moreover, to efficiently utilize the query feedback, we update the surrogate
model in a novel learning scheme, named High-Order Gradient Approximation
(HOGA). By constructing a high-order gradient computation graph, we update the
surrogate model to approximate the victim model in both forward and backward
pass. The SimBA++ and HOGA result in Learnable Black-Box Attack (LeBA), which
surpasses previous state of the art by considerable margins: the proposed LeBA
significantly reduces queries, while keeping higher attack success rates close
to 100% in extensive ImageNet experiments, including attacking vision
benchmarks and defensive models. Code is open source at
https://github.com/TrustworthyDL/LeBA.


http://arxiv.org/abs/2010.11352
Class-Conditional Defense GAN Against End-to-End Speech Attacks.
Mohammad Esmaeilpour; Patrick Cardinal; Alessandro Lameiras Koerich
  In this paper we propose a novel defense approach against end-to-end
adversarial attacks developed to fool advanced speech-to-text systems such as
DeepSpeech and Lingvo. Unlike conventional defense approaches, the proposed
approach does not directly employ low-level transformations such as
autoencoding a given input signal aiming at removing potential adversarial
perturbation. Instead of that, we find an optimal input vector for a class
conditional generative adversarial network through minimizing the relative
chordal distance adjustment between a given test input and the generator
network. Then, we reconstruct the 1D signal from the synthesized spectrogram
and the original phase information derived from the given input signal. Hence,
this reconstruction does not add any extra noise to the signal and according to
our experimental results, our defense-GAN considerably outperforms conventional
defense algorithms both in terms of word error rate and sentence level
recognition accuracy.


http://arxiv.org/abs/2010.10987
A Distributional Robustness Certificate by Randomized Smoothing.
Jungang Yang; Liyao Xiang; Ruidong Chen; Yukun Wang; Wei Wang; Xinbing Wang
  The robustness of deep neural networks against adversarial example attacks
has received much attention recently. We focus on certified robustness of
smoothed classifiers in this work, and propose to use the worst-case population
loss over noisy inputs as a robustness metric. Under this metric, we provide a
tractable upper bound serving as a robustness certificate by exploiting the
duality. To improve the robustness, we further propose a noisy adversarial
learning procedure to minimize the upper bound following the robust
optimization framework. The smoothness of the loss function ensures the problem
easy to optimize even for non-smooth neural networks. We show how our
robustness certificate compares with others and the improvement over previous
works. Experiments on a variety of datasets and models verify that in terms of
empirical accuracies, our approach exceeds the state-of-the-art
certified/heuristic methods in defending adversarial examples.


http://arxiv.org/abs/2010.10242
Preventing Personal Data Theft in Images with Adversarial ML.
Thomas Cilloni; Wei Wang; Charles Walter; Charles Fleming
  Facial recognition tools are becoming exceptionally accurate in identifying
people from images. However, this comes at the cost of privacy for users of
online services with photo management (e.g. social media platforms).
Particularly troubling is the ability to leverage unsupervised learning to
recognize faces even when the user has not labeled their images. This is made
simpler by modern facial recognition tools, such as FaceNet, that use encoders
to generate low dimensional embeddings that can be clustered to learn
previously unknown faces. In this paper, we propose a strategy to generate
non-invasive noise masks to apply to facial images for a newly introduced user,
yielding adversarial examples and preventing the formation of identifiable
clusters in the embedding space. We demonstrate the effectiveness of our method
by showing that various classification and clustering methods cannot reliably
cluster the adversarial examples we generate.


http://arxiv.org/abs/2010.10650
Towards Understanding the Dynamics of the First-Order Adversaries.
Zhun Deng; Hangfeng He; Jiaoyang Huang; Weijie J. Su
  An acknowledged weakness of neural networks is their vulnerability to
adversarial perturbations to the inputs. To improve the robustness of these
models, one of the most popular defense mechanisms is to alternatively maximize
the loss over the constrained perturbations (or called adversaries) on the
inputs using projected gradient ascent and minimize over weights. In this
paper, we analyze the dynamics of the maximization step towards understanding
the experimentally observed effectiveness of this defense mechanism.
Specifically, we investigate the non-concave landscape of the adversaries for a
two-layer neural network with a quadratic loss. Our main result proves that
projected gradient ascent finds a local maximum of this non-concave problem in
a polynomial number of iterations with high probability. To our knowledge, this
is the first work that provides a convergence analysis of the first-order
adversaries. Moreover, our analysis demonstrates that, in the initial phase of
adversarial training, the scale of the inputs matters in the sense that a
smaller input scale leads to faster convergence of adversarial training and a
"more regular" landscape. Finally, we show that these theoretical findings are
in excellent agreement with a series of experiments.


http://arxiv.org/abs/2010.10047
Robust Neural Networks inspired by Strong Stability Preserving Runge-Kutta methods.
Byungjoo Kim; Bryce Chudomelka; Jinyoung Park; Jaewoo Kang; Youngjoon Hong; Hyunwoo J. Kim
  Deep neural networks have achieved state-of-the-art performance in a variety
of fields. Recent works observe that a class of widely used neural networks can
be viewed as the Euler method of numerical discretization. From the numerical
discretization perspective, Strong Stability Preserving (SSP) methods are more
advanced techniques than the explicit Euler method that produce both accurate
and stable solutions. Motivated by the SSP property and a generalized
Runge-Kutta method, we propose Strong Stability Preserving networks (SSP
networks) which improve robustness against adversarial attacks. We empirically
demonstrate that the proposed networks improve the robustness against
adversarial examples without any defensive methods. Further, the SSP networks
are complementary with a state-of-the-art adversarial training scheme. Lastly,
our experiments show that SSP networks suppress the blow-up of adversarial
perturbations. Our results open up a way to study robust architectures of
neural networks leveraging rich knowledge from numerical discretization
literature.


http://arxiv.org/abs/2010.10712
Boosting Gradient for White-Box Adversarial Attacks.
Hongying Liu; Zhenyu Zhou; Fanhua Shang; Xiaoyu Qi; Yuanyuan Liu; Licheng Jiao
  Deep neural networks (DNNs) are playing key roles in various artificial
intelligence applications such as image classification and object recognition.
However, a growing number of studies have shown that there exist adversarial
examples in DNNs, which are almost imperceptibly different from original
samples, but can greatly change the network output. Existing white-box attack
algorithms can generate powerful adversarial examples. Nevertheless, most of
the algorithms concentrate on how to iteratively make the best use of gradients
to improve adversarial performance. In contrast, in this paper, we focus on the
properties of the widely-used ReLU activation function, and discover that there
exist two phenomena (i.e., wrong blocking and over transmission) misleading the
calculation of gradients in ReLU during the backpropagation. Both issues
enlarge the difference between the predicted changes of the loss function from
gradient and corresponding actual changes, and mislead the gradients which
results in larger perturbations. Therefore, we propose a universal adversarial
example generation method, called ADV-ReLU, to enhance the performance of
gradient based white-box attack algorithms. During the backpropagation of the
network, our approach calculates the gradient of the loss function versus
network input, maps the values to scores, and selects a part of them to update
the misleading gradients. Comprehensive experimental results on \emph{ImageNet}
demonstrate that our ADV-ReLU can be easily integrated into many
state-of-the-art gradient-based white-box attack algorithms, as well as
transferred to black-box attack attackers, to further decrease perturbations in
the ${\ell _2}$-norm.


http://arxiv.org/abs/2010.10549
Tight Second-Order Certificates for Randomized Smoothing.
Alexander Levine; Aounon Kumar; Thomas Goldstein; Soheil Feizi
  Randomized smoothing is a popular way of providing robustness guarantees
against adversarial attacks: randomly-smoothed functions have a universal
Lipschitz-like bound, allowing for robustness certificates to be easily
computed. In this work, we show that there also exists a universal
curvature-like bound for Gaussian random smoothing: given the exact value and
gradient of a smoothed function, we compute a lower bound on the distance of a
point to its closest adversarial example, called the Second-order Smoothing
(SoS) robustness certificate. In addition to proving the correctness of this
novel certificate, we show that SoS certificates are realizable and therefore
tight. Interestingly, we show that the maximum achievable benefits, in terms of
certified robustness, from using the additional information of the gradient
norm are relatively small: because our bounds are tight, this is a fundamental
negative result. The gain of SoS certificates further diminishes if we consider
the estimation error of the gradient norms, for which we have developed an
estimator. We therefore additionally develop a variant of Gaussian smoothing,
called Gaussian dipole smoothing, which provides similar bounds to randomized
smoothing with gradient information, but with much-improved sample efficiency.
This allows us to achieve (marginally) improved robustness certificates on
high-dimensional datasets such as CIFAR-10 and ImageNet. Code is available at
https://github.com/alevine0/smoothing_second_order.


http://arxiv.org/abs/2010.09680
A Survey of Machine Learning Techniques in Adversarial Image Forensics.
Ehsan Nowroozi; Ali Dehghantanha; Reza M. Parizi; Kim-Kwang Raymond Choo
  Image forensic plays a crucial role in both criminal investigations (e.g.,
dissemination of fake images to spread racial hate or false narratives about
specific ethnicity groups) and civil litigation (e.g., defamation).
Increasingly, machine learning approaches are also utilized in image forensics.
However, there are also a number of limitations and vulnerabilities associated
with machine learning-based approaches, for example how to detect adversarial
(image) examples, with real-world consequences (e.g., inadmissible evidence, or
wrongful conviction). Therefore, with a focus on image forensics, this paper
surveys techniques that can be used to enhance the robustness of machine
learning-based binary manipulation detectors in various adversarial scenarios.


http://arxiv.org/abs/2010.09569
Against All Odds: Winning the Defense Challenge in an Evasion Competition with Diversification.
Erwin Quiring; Lukas Pirch; Michael Reimsbach; Daniel Arp; Konrad Rieck
  Machine learning-based systems for malware detection operate in a hostile
environment. Consequently, adversaries will also target the learning system and
use evasion attacks to bypass the detection of malware. In this paper, we
outline our learning-based system PEberus that got the first place in the
defender challenge of the Microsoft Evasion Competition, resisting a variety of
attacks from independent attackers. Our system combines multiple, diverse
defenses: we address the semantic gap, use various classification models, and
apply a stateful defense. This competition gives us the unique opportunity to
examine evasion attacks under a realistic scenario. It also highlights that
existing machine learning methods can be hardened against attacks by thoroughly
analyzing the attack surface and implementing concepts from adversarial
learning. Our defense can serve as an additional baseline in the future to
strengthen the research on secure learning.


http://arxiv.org/abs/2010.09670
RobustBench: a standardized adversarial robustness benchmark.
Francesco Croce; Maksym Andriushchenko; Vikash Sehwag; Nicolas Flammarion; Mung Chiang; Prateek Mittal; Matthias Hein
  Evaluation of adversarial robustness is often error-prone leading to
overestimation of the true robustness of models. While adaptive attacks
designed for a particular defense are a way out of this, there are only
approximate guidelines on how to perform them. Moreover, adaptive evaluations
are highly customized for particular models, which makes it difficult to
compare different defenses. Our goal is to establish a standardized benchmark
of adversarial robustness, which as accurately as possible reflects the
robustness of the considered models within a reasonable computational budget.
This requires to impose some restrictions on the admitted models to rule out
defenses that only make gradient-based attacks ineffective without improving
actual robustness. We evaluate robustness of models for our benchmark with
AutoAttack, an ensemble of white- and black-box attacks which was recently
shown in a large-scale study to improve almost all robustness evaluations
compared to the original publications. Our leaderboard, hosted at
http://robustbench.github.io/, aims at reflecting the current state of the art
on a set of well-defined tasks in $\ell_\infty$- and $\ell_2$-threat models
with possible extensions in the future. Additionally, we open-source the
library http://github.com/RobustBench/robustbench that provides unified access
to state-of-the-art robust models to facilitate their downstream applications.
Finally, based on the collected models, we analyze general trends in
$\ell_p$-robustness and its impact on other tasks such as robustness to various
distribution shifts and out-of-distribution detection.


http://arxiv.org/abs/2010.09624
Optimism in the Face of Adversity: Understanding and Improving Deep Learning through Adversarial Robustness.
Guillermo Ortiz-Jimenez; Apostolos Modas; Seyed-Mohsen Moosavi-Dezfooli; Pascal Frossard
  Driven by massive amounts of data and important advances in computational
resources, new deep learning systems have achieved outstanding results in a
large spectrum of applications. Nevertheless, our current theoretical
understanding on the mathematical foundations of deep learning lags far behind
its empirical success. Towards solving the vulnerability of neural networks,
however, the field of adversarial robustness has recently become one of the
main sources of explanations of our deep models. In this article, we provide an
in-depth review of the field of adversarial robustness in deep learning, and
give a self-contained introduction to its main notions. But, in contrast to the
mainstream pessimistic perspective of adversarial robustness, we focus on the
main positive aspects that it entails. We highlight the intuitive connection
between adversarial examples and the geometry of deep neural networks, and
eventually explore how the geometric study of adversarial examples can serve as
a powerful tool to understand deep learning. Furthermore, we demonstrate the
broad applicability of adversarial robustness, providing an overview of the
main emerging applications of adversarial robustness beyond security. The goal
of this article is to provide readers with a set of new perspectives to
understand deep learning, and to supply them with intuitive tools and insights
on how to use adversarial robustness to improve it.


http://arxiv.org/abs/2010.09633
Verifying the Causes of Adversarial Examples.
Honglin Li; Yifei Fan; Frieder Ganz; Anthony Yezzi; Payam Barnaghi
  The robustness of neural networks is challenged by adversarial examples that
contain almost imperceptible perturbations to inputs, which mislead a
classifier to incorrect outputs in high confidence. Limited by the extreme
difficulty in examining a high-dimensional image space thoroughly, research on
explaining and justifying the causes of adversarial examples falls behind
studies on attacks and defenses. In this paper, we present a collection of
potential causes of adversarial examples and verify (or partially verify) them
through carefully-designed controlled experiments. The major causes of
adversarial examples include model linearity, one-sum constraint, and geometry
of the categories. To control the effect of those causes, multiple techniques
are applied such as $L_2$ normalization, replacement of loss functions,
construction of reference datasets, and novel models using multi-layer
perceptron probabilistic neural networks (MLP-PNN) and density estimation (DE).
Our experiment results show that geometric factors tend to be more direct
causes and statistical factors magnify the phenomenon, especially for assigning
high prediction confidence. We believe this paper will inspire more studies to
rigorously investigate the root causes of adversarial examples, which in turn
provide useful guidance on designing more robust models.


http://arxiv.org/abs/2010.09246
When Bots Take Over the Stock Market: Evasion Attacks Against Algorithmic Traders.
Elior Nehemya; Yael Mathov; Asaf Shabtai; Yuval Elovici
  In recent years, machine learning has become prevalent in numerous tasks,
including algorithmic trading. Stock market traders utilize learning models to
predict the market's behavior and execute an investment strategy accordingly.
However, learning models have been shown to be susceptible to input
manipulations called adversarial examples. Yet, the trading domain remains
largely unexplored in the context of adversarial learning. This is mainly
because of the rapid changes in the market which impair the attacker's ability
to create a real-time attack. In this study, we present a realistic scenario in
which an attacker gains control of an algorithmic trading bots by manipulating
the input data stream in real-time. The attacker creates an universal
perturbation that is agnostic to the target model and time of use, while also
remaining imperceptible. We evaluate our attack on a real-world market data
stream and target three different trading architectures. We show that our
perturbation can fool the model at future unseen data points, in both white-box
and black-box settings. We believe these findings should serve as an alert to
the finance community about the threats in this area and prompt further
research on the risks associated with using automated learning models in the
finance domain.


http://arxiv.org/abs/2010.09891
FLAG: Adversarial Data Augmentation for Graph Neural Networks.
Kezhi Kong; Guohao Li; Mucong Ding; Zuxuan Wu; Chen Zhu; Bernard Ghanem; Gavin Taylor; Tom Goldstein
  Data augmentation helps neural networks generalize better by enlarging the
training set, but it remains an open question how to effectively augment graph
data to enhance the performance of GNNs (Graph Neural Networks). While most
existing graph regularizers focus on manipulating graph topological structures
by adding/removing edges, we offer a method to augment node features for better
performance. We propose FLAG (Free Large-scale Adversarial Augmentation on
Graphs), which iteratively augments node features with gradient-based
adversarial perturbations during training. By making the model invariant to
small fluctuations in input data, our method helps models generalize to
out-of-distribution samples and boosts model performance at test time. FLAG is
a general-purpose approach for graph data, which universally works in node
classification, link prediction, and graph classification tasks. FLAG is also
highly flexible and scalable, and is deployable with arbitrary GNN backbones
and large-scale datasets. We demonstrate the efficacy and stability of our
method through extensive experiments and ablation studies. We also provide
intuitive observations for a deeper understanding of our method.


http://arxiv.org/abs/2010.09119
FADER: Fast Adversarial Example Rejection.
Francesco Crecchi; Marco Melis; Angelo Sotgiu; Davide Bacciu; Battista Biggio
  Deep neural networks are vulnerable to adversarial examples, i.e.,
carefully-crafted inputs that mislead classification at test time. Recent
defenses have been shown to improve adversarial robustness by detecting
anomalous deviations from legitimate training samples at different layer
representations - a behavior normally exhibited by adversarial attacks. Despite
technical differences, all aforementioned methods share a common backbone
structure that we formalize and highlight in this contribution, as it can help
in identifying promising research directions and drawbacks of existing methods.
The first main contribution of this work is the review of these detection
methods in the form of a unifying framework designed to accommodate both
existing defenses and newer ones to come. In terms of drawbacks, the
overmentioned defenses require comparing input samples against an oversized
number of reference prototypes, possibly at different representation layers,
dramatically worsening the test-time efficiency. Besides, such defenses are
typically based on ensembling classifiers with heuristic methods, rather than
optimizing the whole architecture in an end-to-end manner to better perform
detection. As a second main contribution of this work, we introduce FADER, a
novel technique for speeding up detection-based methods. FADER overcome the
issues above by employing RBF networks as detectors: by fixing the number of
required prototypes, the runtime complexity of adversarial examples detectors
can be controlled. Our experiments outline up to 73x prototypes reduction
compared to analyzed detectors for MNIST dataset and up to 50x for CIFAR10
dataset respectively, without sacrificing classification accuracy on both clean
and adversarial data.


http://arxiv.org/abs/2010.09080
Poisoned classifiers are not only backdoored, they are fundamentally broken.
Mingjie Sun; Siddhant Agarwal; J. Zico Kolter
  Under a commonly-studied backdoor poisoning attack against classification
models, an attacker adds a small trigger to a subset of the training data, such
that the presence of this trigger at test time causes the classifier to always
predict some target class. It is often implicitly assumed that the poisoned
classifier is vulnerable exclusively to the adversary who possesses the
trigger. In this paper, we show empirically that this view of backdoored
classifiers is incorrect. We describe a new threat model for poisoned
classifier, where one without knowledge of the original trigger, would want to
control the poisoned classifier. Under this threat model, we propose a
test-time, human-in-the-loop attack method to generate multiple effective
alternative triggers without access to the initial backdoor and the training
data. We construct these alternative triggers by first generating adversarial
examples for a smoothed version of the classifier, created with a procedure
called Denoised Smoothing, and then extracting colors or cropped portions of
smoothed adversarial images with human interaction. We demonstrate the
effectiveness of our attack through extensive experiments on high-resolution
datasets: ImageNet and TrojAI. We also compare our approach to previous work on
modeling trigger distributions and find that our method are more scalable and
efficient in generating effective triggers. Last, we include a user study which
demonstrates that our method allows users to easily determine the existence of
such backdoors in existing poisoned classifiers. Thus, we argue that there is
no such thing as a secret backdoor in poisoned classifiers: poisoning a
classifier invites attacks not just by the party that possesses the trigger,
but from anyone with access to the classifier.


http://arxiv.org/abs/2010.08546
A Generative Model based Adversarial Security of Deep Learning and Linear Classifier Models.
erhat Ozgur Catak; Samed Sivaslioglu; Kevser Sahinbas
  In recent years, machine learning algorithms have been applied widely in
various fields such as health, transportation, and the autonomous car. With the
rapid developments of deep learning techniques, it is critical to take the
security concern into account for the application of the algorithms. While
machine learning offers significant advantages in terms of the application of
algorithms, the issue of security is ignored. Since it has many applications in
the real world, security is a vital part of the algorithms. In this paper, we
have proposed a mitigation method for adversarial attacks against machine
learning models with an autoencoder model that is one of the generative ones.
The main idea behind adversarial attacks against machine learning models is to
produce erroneous results by manipulating trained models. We have also
presented the performance of autoencoder models to various attack methods from
deep neural networks to traditional algorithms by using different methods such
as non-targeted and targeted attacks to multi-class logistic regression, a fast
gradient sign method, a targeted fast gradient sign method and a basic
iterative method attack to neural networks for the MNIST dataset.


http://arxiv.org/abs/2010.08844
Finding Physical Adversarial Examples for Autonomous Driving with Fast and Differentiable Image Compositing.
Jinghan Yang; Adith Boloor; Ayan Chakrabarti; Xuan Zhang; Yevgeniy Vorobeychik
  There is considerable evidence that deep neural networks are vulnerable to
adversarial perturbations applied directly to their digital inputs. However, it
remains an open question whether this translates to vulnerabilities in
real-world systems. Specifically, in the context of image inputs to autonomous
driving systems, an attack can be achieved only by modifying the physical
environment, so as to ensure that the resulting stream of video inputs to the
car's controller leads to incorrect driving decisions. Inducing this effect on
the video inputs indirectly through the environment requires accounting for
system dynamics and tracking viewpoint changes. We propose a scalable and
efficient approach for finding adversarial physical modifications, using a
differentiable approximation for the mapping from environmental
modifications-namely, rectangles drawn on the road-to the corresponding video
inputs to the controller network. Given the color, location, position, and
orientation parameters of the rectangles, our mapping composites them onto
pre-recorded video streams of the original environment. Our mapping accounts
for geometric and color variations, is differentiable with respect to rectangle
parameters, and uses multiple original video streams obtained by varying the
driving trajectory. When combined with a neural network-based controller, our
approach allows the design of adversarial modifications through end-to-end
gradient-based optimization. We evaluate our approach using the Carla
autonomous driving simulator, and show that it is significantly more scalable
and far more effective at generating attacks than a prior black-box approach
based on Bayesian Optimization.


http://arxiv.org/abs/2010.08852
Weight-Covariance Alignment for Adversarially Robust Neural Networks.
Panagiotis Eustratiadis; Henry Gouk; Da Li; Timothy Hospedales
  Stochastic Neural Networks (SNNs) that inject noise into their hidden layers
have recently been shown to achieve strong robustness against adversarial
attacks. However, existing SNNs are usually heuristically motivated, and often
rely on adversarial training, which is computationally costly. We propose a new
SNN that achieves state-of-the-art performance without relying on adversarial
training, and enjoys solid theoretical justification. Specifically, while
existing SNNs inject learned or hand-tuned isotropic noise, our SNN learns an
anisotropic noise distribution to optimize a learning-theoretic bound on
adversarial robustness. We evaluate our method on a number of popular
benchmarks, show that it can be applied to different architectures, and that it
provides robustness to a variety of white-box and black-box attacks, while
being simple and fast to train compared to existing alternatives.


http://arxiv.org/abs/2010.11679
DPAttack: Diffused Patch Attacks against Universal Object Detection.
Shudeng Wu; Tao Dai; Shu-Tao Xia
  Recently, deep neural networks (DNNs) have been widely and successfully used
in Object Detection, e.g. Faster RCNN, YOLO, CenterNet. However, recent studies
have shown that DNNs are vulnerable to adversarial attacks. Adversarial attacks
against object detection can be divided into two categories, whole-pixel
attacks and patch attacks. While these attacks add perturbations to a large
number of pixels in images, we proposed a diffused patch attack
(\textbf{DPAttack}) to successfully fool object detectors by diffused patches
of asteroid-shaped or grid-shape, which only change a small number of pixels.
Experiments show that our DPAttack can successfully fool most object detectors
with diffused patches and we get the second place in the Alibaba Tianchi
competition: Alibaba-Tsinghua Adversarial Challenge on Object Detection. Our
code can be obtained from https://github.com/Wu-Shudeng/DPAttack.


http://arxiv.org/abs/2010.08542
Mischief: A Simple Black-Box Attack Against Transformer Architectures.
Wynter Adrian de
  We introduce Mischief, a simple and lightweight method to produce a class of
human-readable, realistic adversarial examples for language models. We perform
exhaustive experimentations of our algorithm on four transformer-based
architectures, across a variety of downstream tasks, as well as under varying
concentrations of said examples. Our findings show that the presence of
Mischief-generated adversarial samples in the test set significantly degrades
(by up to $20\%$) the performance of these models with respect to their
reported baselines. Nonetheless, we also demonstrate that, by including similar
examples in the training set, it is possible to restore the baseline scores on
the adversarial test set. Moreover, for certain tasks, the models trained with
Mischief set show a modest increase on performance with respect to their
original, non-adversarial baseline.


http://arxiv.org/abs/2010.08418
Learning Robust Algorithms for Online Allocation Problems Using Adversarial Training.
Goran Zuzic; Di Wang; Aranyak Mehta; D. Sivakumar
  We address the challenge of finding algorithms for online allocation (i.e.
bipartite matching) using a machine learning approach. In this paper, we focus
on the AdWords problem, which is a classical online budgeted matching problem
of both theoretical and practical significance. In contrast to existing work,
our goal is to accomplish algorithm design {\em tabula rasa}, i.e., without any
human-provided insights or expert-tuned training data beyond specifying the
objective and constraints of the optimization problem. We construct a framework
based on insights and ideas from game theory, adversarial training and GANs Key
to our approach is to generate adversarial examples that expose the weakness of
any given algorithm. A unique challenge in our context is to generate complete
examples from scratch rather than perturbing given examples and we demonstrate
this can be accomplished for the Adwords problem. We use this framework to
co-train an algorithm network and an adversarial network against each other
until they converge to an equilibrium. This approach finds algorithms and
adversarial examples that are consistent with known optimal results. Secondly,
we address the question of robustness of the algorithm, namely can we design
algorithms that are both strong under practical distributions, as well as
exhibit robust performance against adversarial instances. To accomplish this,
we train algorithm networks using a mixture of adversarial and practical
distributions like power-laws; the resulting networks exhibit a smooth
trade-off between the two input regimes.


http://arxiv.org/abs/2010.07542
Adversarial Images through Stega Glasses.
Benoît CRIStAL Bonnet; Teddy CRIStAL Furon; Patrick CRIStAL Bas
  This paper explores the connection between steganography and adversarial
images. On the one hand, ste-ganalysis helps in detecting adversarial
perturbations. On the other hand, steganography helps in forging adversarial
perturbations that are not only invisible to the human eye but also
statistically undetectable. This work explains how to use these information
hiding tools for attacking or defending computer vision image classification.
We play this cat and mouse game with state-of-art classifiers, steganalyzers,
and steganographic embedding schemes. It turns out that steganography helps
more the attacker than the defender.


http://arxiv.org/abs/2010.07849
A Hamiltonian Monte Carlo Method for Probabilistic Adversarial Attack and Learning.
Hongjun Wang; Guanbin Li; Xiaobai Liu; Liang Lin
  Although deep convolutional neural networks (CNNs) have demonstrated
remarkable performance on multiple computer vision tasks, researches on
adversarial learning have shown that deep models are vulnerable to adversarial
examples, which are crafted by adding visually imperceptible perturbations to
the input images. Most of the existing adversarial attack methods only create a
single adversarial example for the input, which just gives a glimpse of the
underlying data manifold of adversarial examples. An attractive solution is to
explore the solution space of the adversarial examples and generate a diverse
bunch of them, which could potentially improve the robustness of real-world
systems and help prevent severe security threats and vulnerabilities. In this
paper, we present an effective method, called Hamiltonian Monte Carlo with
Accumulated Momentum (HMCAM), aiming to generate a sequence of adversarial
examples. To improve the efficiency of HMC, we propose a new regime to
automatically control the length of trajectories, which allows the algorithm to
move with adaptive step sizes along the search direction at different
positions. Moreover, we revisit the reason for high computational cost of
adversarial training under the view of MCMC and design a new generative method
called Contrastive Adversarial Training (CAT), which approaches equilibrium
distribution of adversarial examples with only few iterations by building from
small modifications of the standard Contrastive Divergence (CD) and achieve a
trade-off between efficiency and accuracy. Both quantitative and qualitative
analysis on several natural image datasets and practical systems have confirmed
the superiority of the proposed algorithm.


http://arxiv.org/abs/2010.07788
Generalizing Universal Adversarial Attacks Beyond Additive Perturbations.
Yanghao Zhang; Wenjie Ruan; Fu Wang; Xiaowei Huang
  The previous study has shown that universal adversarial attacks can fool deep
neural networks over a large set of input images with a single human-invisible
perturbation. However, current methods for universal adversarial attacks are
based on additive perturbation, which cause misclassification when the
perturbation is directly added to the input images. In this paper, for the
first time, we show that a universal adversarial attack can also be achieved
via non-additive perturbation (e.g., spatial transformation). More importantly,
to unify both additive and non-additive perturbations, we propose a novel
unified yet flexible framework for universal adversarial attacks, called GUAP,
which is able to initiate attacks by additive perturbation, non-additive
perturbation, or the combination of both. Extensive experiments are conducted
on CIFAR-10 and ImageNet datasets with six deep neural network models including
GoogleLeNet, VGG16/19, ResNet101/152, and DenseNet121. The empirical
experiments demonstrate that GUAP can obtain up to 90.9% and 99.24% successful
attack rates on CIFAR-10 and ImageNet datasets, leading to over 15% and 19%
improvements respectively than current state-of-the-art universal adversarial
attacks. The code for reproducing the experiments in this paper is available at
https://github.com/TrustAI/GUAP.


http://arxiv.org/abs/2010.07532
Certifying Neural Network Robustness to Random Input Noise from Samples.
Brendon G. Anderson; Somayeh Sojoudi
  Methods to certify the robustness of neural networks in the presence of input
uncertainty are vital in safety-critical settings. Most certification methods
in the literature are designed for adversarial input uncertainty, but
researchers have recently shown a need for methods that consider random
uncertainty. In this paper, we propose a novel robustness certification method
that upper bounds the probability of misclassification when the input noise
follows an arbitrary probability distribution. This bound is cast as a
chance-constrained optimization problem, which is then reformulated using
input-output samples to replace the optimization constraints. The resulting
optimization reduces to a linear program with an analytical solution.
Furthermore, we develop a sufficient condition on the number of samples needed
to make the misclassification bound hold with overwhelming probability. Our
case studies on MNIST classifiers show that this method is able to certify a
uniform infinity-norm uncertainty region with a radius of nearly 50 times
larger than what the current state-of-the-art method can certify.


http://arxiv.org/abs/2010.08034
Overfitting or Underfitting? Understand Robustness Drop in Adversarial Training.
Zichao Li; Liyuan Liu; Chengyu Dong; Jingbo Shang
  Our goal is to understand why the robustness drops after conducting
adversarial training for too long. Although this phenomenon is commonly
explained as overfitting, our analysis suggest that its primary cause is
perturbation underfitting. We observe that after training for too long,
FGSM-generated perturbations deteriorate into random noise. Intuitively, since
no parameter updates are made to strengthen the perturbation generator, once
this process collapses, it could be trapped in such local optima. Also,
sophisticating this process could mostly avoid the robustness drop, which
supports that this phenomenon is caused by underfitting instead of overfitting.
In the light of our analyses, we propose APART, an adaptive adversarial
training framework, which parameterizes perturbation generation and
progressively strengthens them. Shielding perturbations from underfitting
unleashes the potential of our framework. In our experiments, APART provides
comparable or even better robustness than PGD-10, with only about 1/4 of its
computational cost.


http://arxiv.org/abs/2010.08001
Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness.
Long Zhao; Ting Liu; Xi Peng; Dimitris Metaxas
  Adversarial data augmentation has shown promise for training robust deep
neural networks against unforeseen data shifts or corruptions. However, it is
difficult to define heuristics to generate effective fictitious target
distributions containing "hard" adversarial perturbations that are largely
different from the source distribution. In this paper, we propose a novel and
effective regularization term for adversarial data augmentation. We
theoretically derive it from the information bottleneck principle, which
results in a maximum-entropy formulation. Intuitively, this regularization term
encourages perturbing the underlying source distribution to enlarge predictive
uncertainty of the current model, so that the generated "hard" adversarial
perturbations can improve the model robustness during training. Experimental
results on three standard benchmarks demonstrate that our method consistently
outperforms the existing state of the art by a statistically significant
margin.


http://arxiv.org/abs/2010.09212
Exploiting Vulnerabilities of Deep Learning-based Energy Theft Detection in AMI through Adversarial Attacks.
Jiangnan Li; Yingyuan Yang; Jinyuan Stella Sun
  Effective detection of energy theft can prevent revenue losses of utility
companies and is also important for smart grid security. In recent years,
enabled by the massive fine-grained smart meter data, deep learning (DL)
approaches are becoming popular in the literature to detect energy theft in the
advanced metering infrastructure (AMI). However, as neural networks are shown
to be vulnerable to adversarial examples, the security of the DL models is of
concern.
  In this work, we study the vulnerabilities of DL-based energy theft detection
through adversarial attacks, including single-step attacks and iterative
attacks. From the attacker's point of view, we design the
\textit{SearchFromFree} framework that consists of 1) a randomly adversarial
measurement initialization approach to maximize the stolen profit and 2) a
step-size searching scheme to increase the performance of black-box iterative
attacks. The evaluation based on three types of neural networks shows that the
adversarial attacker can report extremely low consumption measurements to the
utility without being detected by the DL models. We finally discuss the
potential defense mechanisms against adversarial attacks in energy theft
detection.


http://arxiv.org/abs/2010.11143
Progressive Defense Against Adversarial Attacks for Deep Learning as a Service in Internet of Things.
Ling Wang; Cheng Zhang; Zejian Luo; Chenguang Liu; Jie Liu; Xi Zheng; Athanasios Vasilakos
  Nowadays, Deep Learning as a service can be deployed in Internet of Things
(IoT) to provide smart services and sensor data processing. However, recent
research has revealed that some Deep Neural Networks (DNN) can be easily misled
by adding relatively small but adversarial perturbations to the input (e.g.,
pixel mutation in input images). One challenge in defending DNN against these
attacks is to efficiently identifying and filtering out the adversarial pixels.
The state-of-the-art defense strategies with good robustness often require
additional model training for specific attacks. To reduce the computational
cost without loss of generality, we present a defense strategy called a
progressive defense against adversarial attacks (PDAAA) for efficiently and
effectively filtering out the adversarial pixel mutations, which could mislead
the neural network towards erroneous outputs, without a-priori knowledge about
the attack type. We evaluated our progressive defense strategy against various
attack methods on two well-known datasets. The result shows it outperforms the
state-of-the-art while reducing the cost of model training by 50% on average.


http://arxiv.org/abs/2010.06943
Pair the Dots: Jointly Examining Training History and Test Stimuli for Model Interpretability.
Yuxian Meng; Chun Fan; Zijun Sun; Eduard Hovy; Fei Wu; Jiwei Li
  Any prediction from a model is made by a combination of learning history and
test stimuli. This provides significant insights for improving model
interpretability: {\it because of which part(s) of which training example(s),
the model attends to which part(s) of a test example}. Unfortunately, existing
methods to interpret a model's predictions are only able to capture a single
aspect of either test stimuli or learning history, and evidences from both are
never combined or integrated. In this paper, we propose an efficient and
differentiable approach to make it feasible to interpret a model's prediction
by jointly examining training history and test stimuli. Test stimuli is first
identified by gradient-based methods, signifying {\it the part of a test
example that the model attends to}. The gradient-based saliency scores are then
propagated to training examples using influence functions to identify {\it
which part(s) of which training example(s)} make the model attends to the test
stimuli. The system is differentiable and time efficient: the adoption of
saliency scores from gradient-based methods allows us to efficiently trace a
model's prediction through test stimuli, and then back to training examples
through influence functions. We demonstrate that the proposed methodology
offers clear explanations about neural model decisions, along with being useful
for performing error analysis, crafting adversarial examples and fixing
erroneously classified examples.


http://arxiv.org/abs/2010.07190
Towards Resistant Audio Adversarial Examples.
Tom Dörr; Karla Markert; Nicolas M. Müller; Konstantin Böttinger
  Adversarial examples tremendously threaten the availability and integrity of
machine learning-based systems. While the feasibility of such attacks has been
observed first in the domain of image processing, recent research shows that
speech recognition is also susceptible to adversarial attacks. However,
reliably bridging the air gap (i.e., making the adversarial examples work when
recorded via a microphone) has so far eluded researchers. We find that due to
flaws in the generation process, state-of-the-art adversarial example
generation methods cause overfitting because of the binning operation in the
target speech recognition system (e.g., Mozilla Deepspeech). We devise an
approach to mitigate this flaw and find that our method improves generation of
adversarial examples with varying offsets. We confirm the significant
improvement with our approach by empirical comparison of the edit distance in a
realistic over-the-air setting. Our approach states a significant step towards
over-the-air attacks. We publish the code and an applicable implementation of
our approach.


http://arxiv.org/abs/2010.07230
An Adversarial Attack against Stacked Capsule Autoencoder.
Jiazhu Dai; Siwei Xiong
  Capsule network is a kind of neural network which uses spatial relationship
between features to classify images. By capturing poses and relative positions
between features, its ability to recognize affine transformation is improved
and surpasses traditional convolutional neural networks (CNNs) when dealing
with translation, rotation and scaling. Stacked Capsule Autoencoder (SCAE) is
the state-of-the-art generation of capsule network. SCAE encodes the image as
capsules, each of which contains poses of features and their correlations. The
encoded contents are then input into downstream classifier to predict the
categories of the images. Existed research mainly focuses on security of
capsule networks with dynamic routing or EM routing, little attention has been
paid to the security and robustness of SCAE. In this paper, we propose an
evasion attack against SCAE. After perturbation is generated with an
optimization algorithm, it is added to an image to reduce the output of
capsules related to the original category of the image. As the contribution of
these capsules to the original class is reduced, the perturbed image will be
misclassified. We evaluate the attack with image classification experiment on
the MNIST dataset. The experimental results indicate that our attack can
achieve around 99% success rate.


http://arxiv.org/abs/2010.06812
Explain2Attack: Text Adversarial Attacks via Cross-Domain Interpretability.
Mahmoud Hossam; Trung Le; He Zhao; Dinh Phung
  Training robust deep learning models for down-stream tasks is a critical
challenge. Research has shown that down-stream models can be easily fooled with
adversarial inputs that look like the training data, but slightly perturbed, in
a way imperceptible to humans. Understanding the behavior of natural language
models under these attacks is crucial to better defend these models against
such attacks. In the black-box attack setting, where no access to model
parameters is available, the attacker can only query the output information
from the targeted model to craft a successful attack. Current black-box
state-of-the-art models are costly in both computational complexity and number
of queries needed to craft successful adversarial examples. For real world
scenarios, the number of queries is critical, where less queries are desired to
avoid suspicion towards an attacking agent. In this paper, we propose
Explain2Attack, a black-box adversarial attack on text classification task.
Instead of searching for important words to be perturbed by querying the target
model, Explain2Attack employs an interpretable substitute model from a similar
domain to learn word importance scores. We show that our framework either
achieves or out-performs attack rates of the state-of-the-art models, yet with
lower queries cost and higher efficiency.


http://arxiv.org/abs/2010.06855
GreedyFool: Multi-Factor Imperceptibility and Its Application to Designing Black-box Adversarial Example Attack.
Hui Liu; Bo Zhao; Jiabao Guo; Yang An; Peng Liu
  Deep neural networks (DNNs) are inherently vulnerable to well-designed input
samples called adversarial examples. The adversary can easily fool DNNs by
adding slight perturbations to the input. In this paper, we propose a novel
black-box adversarial example attack named GreedyFool, which synthesizes
adversarial examples based on the differential evolution and the greedy
approximation. The differential evolution is utilized to evaluate the effects
of perturbed pixels on the confidence of the DNNs-based classifier. The greedy
approximation is an approximate optimization algorithm to automatically get
adversarial perturbations. Existing works synthesize the adversarial examples
by leveraging simple metrics to penalize the perturbations, which lack
sufficient consideration of the human visual system (HVS), resulting in
noticeable artifacts. In order to sufficient imperceptibility, we launch a lot
of investigations into the HVS and design an integrated metric considering just
noticeable distortion (JND), Weber-Fechner law, texture masking and channel
modulation, which is proven to be a better metric to measure the perceptual
distance between the benign examples and the adversarial ones. The experimental
results demonstrate that the GreedyFool has several remarkable properties
including black-box, 100% success rate, flexibility, automation and can
synthesize the more imperceptible adversarial examples than the
state-of-the-art pixel-wise methods.


http://arxiv.org/abs/2010.06545
Toward Few-step Adversarial Training from a Frequency Perspective.
Hans Shih-Han Wang; Cory Cornelius; Brandon Edwards; Jason Martin
  We investigate adversarial-sample generation methods from a frequency domain
perspective and extend standard $l_{\infty}$ Projected Gradient Descent (PGD)
to the frequency domain. The resulting method, which we call Spectral Projected
Gradient Descent (SPGD), has better success rate compared to PGD during early
steps of the method. Adversarially training models using SPGD achieves greater
adversarial accuracy compared to PGD when holding the number of attack steps
constant. The use of SPGD can, therefore, reduce the overhead of adversarial
training when utilizing adversarial generation with a smaller number of steps.
However, we also prove that SPGD is equivalent to a variant of the PGD
ordinarily used for the $l_{\infty}$ threat model. This PGD variant omits the
sign function which is ordinarily applied to the gradient. SPGD can, therefore,
be performed without explicitly transforming into the frequency domain.
Finally, we visualize the perturbations SPGD generates and find they use both
high and low-frequency components, which suggests that removing either
high-frequency components or low-frequency components is not an effective
defense.


http://arxiv.org/abs/2010.06651
Higher-Order Certification for Randomized Smoothing.
Jeet Mohapatra; Ching-Yun Ko; Tsui-Wei Weng; Pin-Yu Chen; Sijia Liu; Luca Daniel
  Randomized smoothing is a recently proposed defense against adversarial
attacks that has achieved SOTA provable robustness against $\ell_2$
perturbations. A number of publications have extended the guarantees to other
metrics, such as $\ell_1$ or $\ell_\infty$, by using different smoothing
measures. Although the current framework has been shown to yield near-optimal
$\ell_p$ radii, the total safety region certified by the current framework can
be arbitrarily small compared to the optimal. In this work, we propose a
framework to improve the certified safety region for these smoothed classifiers
without changing the underlying smoothing scheme. The theoretical contributions
are as follows: 1) We generalize the certification for randomized smoothing by
reformulating certified radius calculation as a nested optimization problem
over a class of functions. 2) We provide a method to calculate the certified
safety region using $0^{th}$-order and $1^{st}$-order information for
Gaussian-smoothed classifiers. We also provide a framework that generalizes the
calculation for certification using higher-order information. 3) We design
efficient, high-confidence estimators for the relevant statistics of the
first-order information. Combining the theoretical contribution 2) and 3)
allows us to certify safety region that are significantly larger than the ones
provided by the current methods. On CIFAR10 and Imagenet datasets, the new
regions certified by our approach achieve significant improvements on general
$\ell_1$ certified radii and on the $\ell_2$ certified radii for color-space
attacks ($\ell_2$ restricted to 1 channel) while also achieving smaller
improvements on the general $\ell_2$ certified radii. Our framework can also
provide a way to circumvent the current impossibility results on achieving
higher magnitude of certified radii without requiring the use of data-dependent
smoothing techniques.


http://arxiv.org/abs/2010.07693
Linking average- and worst-case perturbation robustness via class selectivity and dimensionality.
Matthew L. Leavitt; Ari Morcos
  Representational sparsity is known to affect robustness to input
perturbations in deep neural networks (DNNs), but less is known about how the
semantic content of representations affects robustness. Class selectivity-the
variability of a unit's responses across data classes or dimensions-is one way
of quantifying the sparsity of semantic representations. Given recent evidence
that class selectivity may not be necessary for, and in some cases can impair
generalization, we investigate whether it also confers robustness (or
vulnerability) to perturbations of input data. We found that networks
regularized to have lower levels of class selectivity were more robust to
average-case (naturalistic) perturbations, while networks with higher class
selectivity are more vulnerable. In contrast, class selectivity increases
robustness to multiple types of worst-case (i.e. white box adversarial)
perturbations, suggesting that while decreasing class selectivity is helpful
for average-case perturbations, it is harmful for worst-case perturbations. To
explain this difference, we studied the dimensionality of the networks'
representations: we found that the dimensionality of early-layer
representations is inversely proportional to a network's class selectivity, and
that adversarial samples cause a larger increase in early-layer dimensionality
than corrupted samples. Furthermore, the input-unit gradient is more variable
across samples and units in high-selectivity networks compared to
low-selectivity networks. These results lead to the conclusion that units
participate more consistently in low-selectivity regimes compared to
high-selectivity regimes, effectively creating a larger attack surface and
hence vulnerability to worst-case perturbations.


http://arxiv.org/abs/2010.06107
Universal Model for 3D Medical Image Analysis.
Xiaoman Zhang; Ya Zhang; Xiaoyun Zhang; Yanfeng Wang
  Deep Learning-based methods recently have achieved remarkable progress in
medical image analysis, but heavily rely on massive amounts of labeled training
data. Transfer learning from pre-trained models has been proposed as a standard
pipeline on medical image analysis to address this bottleneck. Despite their
success, the existing pre-trained models are mostly not tuned for multi-modal
multi-task generalization in medical domains. Specifically, their training data
are either from non-medical domain or in single modality, failing to attend to
the problem of performance degradation with cross-modal transfer. Furthermore,
there is no effort to explicitly extract multi-level features required by a
variety of downstream tasks. To overcome these limitations, we propose
Universal Model, a transferable and generalizable pre-trained model for 3D
medical image analysis. A unified self-supervised learning scheme is leveraged
to learn representations from multiple unlabeled source datasets with different
modalities and distinctive scan regions. A modality invariant adversarial
learning module is further introduced to improve the cross-modal
generalization. To fit a wide range of tasks, a simple yet effective scale
classifier is incorporated to capture multi-level visual representations. To
validate the effectiveness of the Universal Model, we perform extensive
experimental analysis on five target tasks, covering multiple imaging
modalities, distinctive scan regions, and different analysis tasks. Compared
with both public 3D pre-trained models and newly investigated 3D
self-supervised learning methods, Universal Model demonstrates superior
generalizability, manifested by its higher performance, stronger robustness and
faster convergence. The pre-trained Universal Model is available at:
\href{https://github.com/xm-cmic/Universal-Model}{https://github.com/xm-cmic/Universal-Model}.


http://arxiv.org/abs/2010.06121
To be Robust or to be Fair: Towards Fairness in Adversarial Training.
Han Xu; Xiaorui Liu; Yaxin Li; Jiliang Tang
  Adversarial training algorithms have been proven to be reliable to improve
machine learning models' robustness against adversarial examples. However, we
find that adversarial training algorithms tend to introduce severe disparity of
accuracy and robustness between different groups of data. For instance, PGD
adversarially trained ResNet18 model on CIFAR-10 has 93% clean accuracy and 67%
PGD $l_\infty-8$ adversarial accuracy on the class "automobile" but only 59%
and 17% on class "cat". This phenomenon happens in balanced datasets and does
not exist in naturally trained models when only using clean samples. In this
work, we theoretically show that this phenomenon can generally happen under
adversarial training algorithms which minimize DNN models' robust errors.
Motivated by these findings, we propose a Fair-Robust-Learning (FRL) framework
to mitigate this unfairness problem when doing adversarial defenses and
experimental results validate the effectiveness of FRL.


http://arxiv.org/abs/2010.06131
Learning to Attack with Fewer Pixels: A Probabilistic Post-hoc Framework for Refining Arbitrary Dense Adversarial Attacks.
He Zhao; Thanh Nguyen; Trung Le; Paul Montague; Vel Olivier De; Tamas Abraham; Dinh Phung
  Deep neural network image classifiers are reported to be susceptible to
adversarial evasion attacks, which use carefully crafted images created to
mislead a classifier. Many adversarial attacks belong to the category of dense
attacks, which generate adversarial examples by perturbing all the pixels of a
natural image. To generate sparse perturbations, sparse attacks have been
recently developed, which are usually independent attacks derived by modifying
a dense attack's algorithm with sparsity regularisations, resulting in reduced
attack efficiency. In this paper, we aim to tackle this task from a different
perspective. We select the most effective perturbations from the ones generated
from a dense attack, based on the fact we find that a considerable amount of
the perturbations on an image generated by dense attacks may contribute little
to attacking a classifier. Accordingly, we propose a probabilistic post-hoc
framework that refines given dense attacks by significantly reducing the number
of perturbed pixels but keeping their attack power, trained with mutual
information maximisation. Given an arbitrary dense attack, the proposed model
enjoys appealing compatibility for making its adversarial images more realistic
and less detectable with fewer perturbations. Moreover, our framework performs
adversarial attacks much faster than existing sparse attacks.


http://arxiv.org/abs/2010.05981
Shape-Texture Debiased Neural Network Training.
Yingwei Li; Qihang Yu; Mingxing Tan; Jieru Mei; Peng Tang; Wei Shen; Alan Yuille; Cihang Xie
  Shape and texture are two prominent and complementary cues for recognizing
objects. Nonetheless, Convolutional Neural Networks are often biased towards
either texture or shape, depending on the training dataset. Our ablation shows
that such bias degenerates model performance. Motivated by this observation, we
develop a simple algorithm for shape-texture debiased learning. To prevent
models from exclusively attending on a single cue in representation learning,
we augment training data with images with conflicting shape and texture
information (eg, an image of chimpanzee shape but with lemon texture) and, most
importantly, provide the corresponding supervisions from shape and texture
simultaneously.
  Experiments show that our method successfully improves model performance on
several image recognition benchmarks and adversarial robustness. For example,
by training on ImageNet, it helps ResNet-152 achieve substantial improvements
on ImageNet (+1.2%), ImageNet-A (+5.2%), ImageNet-C (+8.3%) and
Stylized-ImageNet (+11.1%), and on defending against FGSM adversarial attacker
on ImageNet (+14.4%). Our method also claims to be compatible with other
advanced data augmentation strategies, eg, Mixup, and CutMix. The code is
available here: https://github.com/LiYingwei/ShapeTextureDebiasedTraining.


http://arxiv.org/abs/2010.06154
On the Power of Abstention and Data-Driven Decision Making for Adversarial Robustness.
Maria-Florina Balcan; Avrim Blum; Dravyansh Sharma; Hongyang Zhang
  We formally define a feature-space attack where the adversary can perturb
datapoints by arbitrary amounts but in restricted directions. By restricting
the attack to a small random subspace, our model provides a clean abstraction
for non-Lipschitz networks which map small input movements to large feature
movements. We prove that classifiers with the ability to abstain are provably
more powerful than those that cannot in this setting. Specifically, we show
that no matter how well-behaved the natural data is, any classifier that cannot
abstain will be defeated by such an adversary. However, by allowing abstention,
we give a parameterized algorithm with provably good performance against such
an adversary when classes are reasonably well-separated in feature space and
the dimension of the feature space is high. We further use a data-driven method
to set our algorithm parameters to optimize over the accuracy vs. abstention
trade-off with strong theoretical guarantees. Our theory has direct
applications to the technique of contrastive learning, where we empirically
demonstrate the ability of our algorithms to obtain high robust accuracy with
only small amounts of abstention in both supervised and self-supervised
settings. Our results provide a first formal abstention-based gap, and a first
provable optimization for the induced trade-off in an adversarial defense
setting.


http://arxiv.org/abs/2010.05648
From Hero to Z\'eroe: A Benchmark of Low-Level Adversarial Attacks.
Steffen Eger; Yannik Benz
  Adversarial attacks are label-preserving modifications to inputs of machine
learning classifiers designed to fool machines but not humans. Natural Language
Processing (NLP) has mostly focused on high-level attack scenarios such as
paraphrasing input texts. We argue that these are less realistic in typical
application scenarios such as in social media, and instead focus on low-level
attacks on the character-level. Guided by human cognitive abilities and human
robustness, we propose the first large-scale catalogue and benchmark of
low-level adversarial attacks, which we dub Z\'eroe, encompassing nine
different attack modes including visual and phonetic adversaries. We show that
RoBERTa, NLP's current workhorse, fails on our attacks. Our dataset provides a
benchmark for testing robustness of future more human-like NLP models.


http://arxiv.org/abs/2010.05736
EFSG: Evolutionary Fooling Sentences Generator.
Giovanni Marco Di; Marco Brambilla
  Large pre-trained language representation models (LMs) have recently
collected a huge number of successes in many NLP tasks.
  In 2018 BERT, and later its successors (e.g. RoBERTa), obtained
state-of-the-art results in classical benchmark tasks, such as GLUE benchmark.
  After that, works about adversarial attacks have been published to test their
generalization proprieties and robustness.
  In this work, we design Evolutionary Fooling Sentences Generator (EFSG), a
model- and task-agnostic adversarial attack algorithm built using an
evolutionary approach to generate false-positive sentences for binary
classification tasks.
  We successfully apply EFSG to CoLA and MRPC tasks, on BERT and RoBERTa,
comparing performances. Results prove the presence of weak spots in
state-of-the-art LMs.
  We finally test adversarial training as a data augmentation defence approach
against EFSG, obtaining stronger improved models with no loss of accuracy when
tested on the original datasets.


http://arxiv.org/abs/2010.06087
Contrast and Classify: Training Robust VQA Models. (2%)
Yash Kant; Abhinav Moudgil; Dhruv Batra; Devi Parikh; Harsh Agrawal
  Recent Visual Question Answering (VQA) models have shown impressive
performance on the VQA benchmark but remain sensitive to small linguistic
variations in input questions. Existing approaches address this by augmenting
the dataset with question paraphrases from visual question generation models or
adversarial perturbations. These approaches use the combined data to learn an
answer classifier by minimizing the standard cross-entropy loss. To more
effectively leverage augmented data, we build on the recent success in
contrastive learning. We propose a novel training paradigm (ConClaT) that
optimizes both cross-entropy and contrastive losses. The contrastive loss
encourages representations to be robust to linguistic variations in questions
while the cross-entropy loss preserves the discriminative power of
representations for answer prediction.
  We find that optimizing both losses -- either alternately or jointly -- is
key to effective training. On the VQA-Rephrasings benchmark, which measures the
VQA model's answer consistency across human paraphrases of a question, ConClaT
improves Consensus Score by 1 .63% over an improved baseline. In addition, on
the standard VQA 2.0 benchmark, we improve the VQA accuracy by 0.78% overall.
We also show that ConClaT is agnostic to the type of data-augmentation strategy
used.


http://arxiv.org/abs/2010.05419
Gradient-based Analysis of NLP Models is Manipulable.
Junlin Wang; Jens Tuyls; Eric Wallace; Sameer Singh
  Gradient-based analysis methods, such as saliency map visualizations and
adversarial input perturbations, have found widespread use in interpreting
neural NLP models due to their simplicity, flexibility, and most importantly,
their faithfulness. In this paper, however, we demonstrate that the gradients
of a model are easily manipulable, and thus bring into question the reliability
of gradient-based analyses. In particular, we merge the layers of a target
model with a Facade that overwhelms the gradients without affecting the
predictions. This Facade can be trained to have gradients that are misleading
and irrelevant to the task, such as focusing only on the stop words in the
input. On a variety of NLP tasks (text classification, NLI, and QA), we show
that our method can manipulate numerous gradient-based analysis techniques:
saliency maps, input reduction, and adversarial perturbations all identify
unimportant or targeted tokens as being highly important. The code and a
tutorial of this paper is available at http://ucinlp.github.io/facade.


http://arxiv.org/abs/2010.05272
IF-Defense: 3D Adversarial Point Cloud Defense via Implicit Function based Restoration.
Ziyi Wu; Yueqi Duan; He Wang; Qingnan Fan; Leonidas J. Guibas
  Point cloud is an important 3D data representation widely used in many
essential applications. Leveraging deep neural networks, recent works have
shown great success in processing 3D point clouds. However, those deep neural
networks are vulnerable to various 3D adversarial attacks, which can be
summarized as two primary types: point perturbation that affects local point
distribution, and surface distortion that causes dramatic changes in geometry.
In this paper, we propose a novel 3D adversarial point cloud defense method
leveraging implicit function based restoration (IF-Defense) to address both the
aforementioned attacks. It is composed of two steps: 1) it predicts an implicit
function that captures the clean shape through a surface recovery module, and
2) restores a clean and complete point cloud via minimizing the difference
between the attacked point cloud and the predicted implicit function under
geometry- and distribution- aware constraints. Our experimental results show
that IF-Defense achieves the state-of-the-art defense performance against all
existing adversarial attacks on PointNet, PointNet++, DGCNN and PointConv.
Comparing with previous methods, IF-Defense presents 20.02% improvement in
classification accuracy against salient point dropping attack and 16.29%
against LG-GAN attack on PointNet.


http://arxiv.org/abs/2010.05125
Is It Time to Redefine the Classification Task for Deep Neural Networks?
Keji Han; Yun Li
  Deep neural networks (DNNs) is demonstrated to be vulnerable to the
adversarial example, which is generated by adding small adversarial
perturbation into the original legitimate example to cause the wrong outputs of
DNNs. Nowadays, most works focus on the robustness of the deep model, while few
works pay attention to the robustness of the learning task itself defined on
DNNs. So we redefine this issue as the robustness of deep neural learning
system. A deep neural learning system consists of the deep model and the
learning task defined on the deep model. Moreover, the deep model is usually a
deep neural network, involving the model architecture, data, training loss and
training algorithm. We speculate that the vulnerability of the deep learning
system also roots in the learning task itself. This paper defines the
interval-label classification task for the deep classification system, whose
labels are predefined non-overlapping intervals, instead of a fixed value (hard
label) or probability vector (soft label). The experimental results demonstrate
that the interval-label classification task is more robust than the traditional
classification task while retaining accuracy.


http://arxiv.org/abs/2010.04925
Regularizing Neural Networks via Adversarial Model Perturbation. (1%)
Yaowei Zheng; Richong Zhang; Yongyi Mao
  Effective regularization techniques are highly desired in deep learning for
alleviating overfitting and improving generalization. This work proposes a new
regularization scheme, based on the understanding that the flat local minima of
the empirical risk cause the model to generalize better. This scheme is
referred to as adversarial model perturbation (AMP), where instead of directly
minimizing the empirical risk, an alternative "AMP loss" is minimized via SGD.
Specifically, the AMP loss is obtained from the empirical risk by applying the
"worst" norm-bounded perturbation on each point in the parameter space.
Comparing with most existing regularization schemes, AMP has strong theoretical
justifications, in that minimizing the AMP loss can be shown theoretically to
favour flat local minima of the empirical risk. Extensive experiments on
various modern deep architectures establish AMP as a new state of the art among
regularization schemes. Our code is available at
https://github.com/hiyouga/AMP-Regularizer.


http://arxiv.org/abs/2010.04821
Understanding Spatial Robustness of Deep Neural Networks.
Ziyuan Zhong; Yuchi Tian; Baishakhi Ray
  Deep Neural Networks (DNNs) are being deployed in a wide range of settings
today, from safety-critical applications like autonomous driving to commercial
applications involving image classifications. However, recent research has
shown that DNNs can be brittle to even slight variations of the input data.
Therefore, rigorous testing of DNNs has gained widespread attention.
  While DNN robustness under norm-bound perturbation got significant attention
over the past few years, our knowledge is still limited when natural variants
of the input images come. These natural variants, e.g. a rotated or a rainy
version of the original input, are especially concerning as they can occur
naturally in the field without any active adversary and may lead to undesirable
consequences. Thus, it is important to identify the inputs whose small
variations may lead to erroneous DNN behaviors. The very few studies that
looked at DNN's robustness under natural variants, however, focus on estimating
the overall robustness of DNNs across all the test data rather than localizing
such error-producing points. This work aims to bridge this gap.
  To this end, we study the local per-input robustness properties of the DNNs
and leverage those properties to build a white-box (DEEPROBUST-W) and a
black-box (DEEPROBUST-B) tool to automatically identify the non-robust points.
Our evaluation of these methods on nine DNN models spanning three widely used
image classification datasets shows that they are effective in flagging points
of poor robustness. In particular, DEEPROBUST-W and DEEPROBUST-B are able to
achieve an F1 score of up to 91.4% and 99.1%, respectively. We further show
that DEEPROBUST-W can be applied to a regression problem for a self-driving car
application.


http://arxiv.org/abs/2010.04819
How Does Mixup Help With Robustness and Generalization?
Linjun Zhang; Zhun Deng; Kenji Kawaguchi; Amirata Ghorbani; James Zou
  Mixup is a popular data augmentation technique based on taking convex
combinations of pairs of examples and their labels. This simple technique has
been shown to substantially improve both the robustness and the generalization
of the trained model. However, it is not well-understood why such improvement
occurs. In this paper, we provide theoretical analysis to demonstrate how using
Mixup in training helps model robustness and generalization. For robustness, we
show that minimizing the Mixup loss corresponds to approximately minimizing an
upper bound of the adversarial loss. This explains why models obtained by Mixup
training exhibits robustness to several kinds of adversarial attacks such as
Fast Gradient Sign Method (FGSM). For generalization, we prove that Mixup
augmentation corresponds to a specific type of data-adaptive regularization
which reduces overfitting. Our analysis provides new insights and a framework
to understand Mixup.


http://arxiv.org/abs/2010.03856
Transcending Transcend: Revisiting Malware Classification with Conformal Evaluation.
Federico Barbero; Feargus Pendlebury; Fabio Pierazzi; Lorenzo Cavallaro
  Machine learning for malware classification shows encouraging results, but
real deployments suffer from performance degradation as malware authors adapt
their techniques to evade detection. This evolution of malware results in a
phenomenon known as concept drift, as new examples become less and less like
the original training examples. One promising method to cope with concept drift
is classification with rejection in which examples that are likely to be
misclassified are instead quarantined until they can be expertly analyzed.
  We revisit Transcend, a recently proposed framework for performing rejection
based on conformal prediction theory. In particular, we provide a formal
treatment of Transcend, enabling us to refine conformal evaluation theory---its
underlying statistical engine---and gain a better understanding of the
theoretical reasons for its effectiveness. In the process, we develop two
additional conformal evaluators that match or surpass the performance of the
original while significantly decreasing the computational overhead. We evaluate
our extension on a large dataset that removes sources of experimental bias
present in the original evaluation.
  Finally, to aid practitioners, we determine the optimal operational settings
for a Transcend deployment and show how it can be applied to many popular
learning algorithms.
  These insights support both old and new empirical findings, making Transcend
a sound and practical solution, while shedding light on how rejection
strategies may be further applied to the related problem of evasive adversarial
inputs.


http://arxiv.org/abs/2010.03844
Improve Adversarial Robustness via Weight Penalization on Classification Layer.
Cong Xu; Dan Li; Min Yang
  It is well-known that deep neural networks are vulnerable to adversarial
attacks. Recent studies show that well-designed classification parts can lead
to better robustness. However, there is still much space for improvement along
this line. In this paper, we first prove that, from a geometric point of view,
the robustness of a neural network is equivalent to some angular margin
condition of the classifier weights. We then explain why ReLU type function is
not a good choice for activation under this framework. These findings reveal
the limitations of the existing approaches and lead us to develop a novel
light-weight-penalized defensive method, which is simple and has a good
scalability. Empirical results on multiple benchmark datasets demonstrate that
our method can effectively improve the robustness of the network without
requiring too much additional computation, while maintaining a high
classification precision for clean data.


http://arxiv.org/abs/2010.04055
A Unified Approach to Interpreting and Boosting Adversarial Transferability.
Xin Wang; Jie Ren; Shuyun Lin; Xiangming Zhu; Yisen Wang; Quanshi Zhang
  In this paper, we use the interaction inside adversarial perturbations to
explain and boost the adversarial transferability. We discover and prove the
negative correlation between the adversarial transferability and the
interaction inside adversarial perturbations. The negative correlation is
further verified through different DNNs with various inputs. Moreover, this
negative correlation can be regarded as a unified perspective to understand
current transferability-boosting methods. To this end, we prove that some
classic methods of enhancing the transferability essentially decease
interactions inside adversarial perturbations. Based on this, we propose to
directly penalize interactions during the attacking process, which
significantly improves the adversarial transferability.


http://arxiv.org/abs/2010.04092
Improved Techniques for Model Inversion Attacks.
Si Chen; Ruoxi Jia; Guo-Jun Qi
  Model inversion (MI) attacks in the whitebox setting are aimed at
reconstructing training data from model parameters. Such attacks have triggered
increasing concerns about privacy, especially given a growing number of online
model repositories. However, existing MI attacks against deep neural networks
(DNNs) have large room for performance improvement. A natural question is
whether the underperformance is because the target model does not memorize much
about its training data or it is simply an artifact of imperfect attack
algorithm design? This paper shows that it is the latter. We present a variety
of new techniques that can significantly boost the performance of MI attacks
against DNNs. Recent advances to attack DNNs are largely attributed to the idea
of training a general generative adversarial network (GAN) with potential
public data and using it to regularize the search space for reconstructed
images. We propose to customize the training of a GAN to the inversion task so
as to better distill knowledge useful for performing attacks from public data.
Moreover, unlike previous work that directly searches for a single data point
to represent a target class, we propose to model private data distribution in
order to better reconstruct representative data points. Our experiments show
that the combination of these techniques can lead to state-of-the-art attack
performance on a variety of datasets and models, even when the public data has
a large distributional shift from the private data.


http://arxiv.org/abs/2010.04216
Affine-Invariant Robust Training.
Oriol Barbany Mayor
  The field of adversarial robustness has attracted significant attention in
machine learning. Contrary to the common approach of training models that are
accurate in average case, it aims at training models that are accurate for
worst case inputs, hence it yields more robust and reliable models. Put
differently, it tries to prevent an adversary from fooling a model. The study
of adversarial robustness is largely focused on $\ell_p-$bounded adversarial
perturbations, i.e. modifications of the inputs, bounded in some $\ell_p$ norm.
Nevertheless, it has been shown that state-of-the-art models are also
vulnerable to other more natural perturbations such as affine transformations,
which were already considered in machine learning within data augmentation.
This project reviews previous work in spatial robustness methods and proposes
evolution strategies as zeroth order optimization algorithms to find the worst
affine transforms for each input. The proposed method effectively yields robust
models and allows introducing non-parametric adversarial perturbations.


http://arxiv.org/abs/2010.04331
Targeted Attention Attack on Deep Learning Models in Road Sign Recognition.
Xinghao Yang; Weifeng Liu; Shengli Zhang; Wei Liu; Dacheng Tao
  Real world traffic sign recognition is an important step towards building
autonomous vehicles, most of which highly dependent on Deep Neural Networks
(DNNs). Recent studies demonstrated that DNNs are surprisingly susceptible to
adversarial examples. Many attack methods have been proposed to understand and
generate adversarial examples, such as gradient based attack, score based
attack, decision based attack, and transfer based attacks. However, most of
these algorithms are ineffective in real-world road sign attack, because (1)
iteratively learning perturbations for each frame is not realistic for a fast
moving car and (2) most optimization algorithms traverse all pixels equally
without considering their diverse contribution. To alleviate these problems,
this paper proposes the targeted attention attack (TAA) method for real world
road sign attack. Specifically, we have made the following contributions: (1)
we leverage the soft attention map to highlight those important pixels and skip
those zero-contributed areas - this also helps to generate natural
perturbations, (2) we design an efficient universal attack that optimizes a
single perturbation/noise based on a set of training images under the guidance
of the pre-trained attention map, (3) we design a simple objective function
that can be easily optimized, (4) we evaluate the effectiveness of TAA on real
world data sets. Experimental results validate that the TAA method improves the
attack successful rate (nearly 10%) and reduces the perturbation loss (about a
quarter) compared with the popular RP2 method. Additionally, our TAA also
provides good properties, e.g., transferability and generalization capability.
We provide code and data to ensure the reproducibility:
https://github.com/AdvAttack/RoadSignAttack.


http://arxiv.org/abs/2010.04205
Gaussian MRF Covariance Modeling for Efficient Black-Box Adversarial Attacks.
Anit Kumar Sahu; Satya Narayan Shukla; J. Zico Kolter
  We study the problem of generating adversarial examples in a black-box
setting, where we only have access to a zeroth order oracle, providing us with
loss function evaluations. Although this setting has been investigated in
previous work, most past approaches using zeroth order optimization implicitly
assume that the gradients of the loss function with respect to the input images
are \emph{unstructured}. In this work, we show that in fact substantial
correlations exist within these gradients, and we propose to capture these
correlations via a Gaussian Markov random field (GMRF). Given the
intractability of the explicit covariance structure of the MRF, we show that
the covariance structure can be efficiently represented using the Fast Fourier
Transform (FFT), along with low-rank updates to perform exact posterior
estimation under this model. We use this modeling technique to find fast
one-step adversarial attacks, akin to a black-box version of the Fast Gradient
Sign Method~(FGSM), and show that the method uses fewer queries and achieves
higher attack success rates than the current state of the art. We also
highlight the general applicability of this gradient modeling setup.


http://arxiv.org/abs/2010.03465
Hiding the Access Pattern is Not Enough: Exploiting Search Pattern Leakage in Searchable Encryption.
Simon Oya; Florian Kerschbaum
  Recent Searchable Symmetric Encryption (SSE) schemes enable secure searching
over an encrypted database stored in a server while limiting the information
leaked to the server. These schemes focus on hiding the access pattern, which
refers to the set of documents that match the client's queries. This provides
protection against current attacks that largely depend on this leakage to
succeed. However, most SSE constructions also leak whether or not two queries
aim for the same keyword, also called the search pattern.
  In this work, we show that search pattern leakage can severely undermine
current SSE defenses. We propose an attack that leverages both access and
search pattern leakage, as well as some background and query distribution
information, to recover the keywords of the queries performed by the client.
Our attack follows a maximum likelihood estimation approach, and is easy to
adapt against SSE defenses that obfuscate the access pattern. We empirically
show that our attack is efficient, it outperforms other proposed attacks, and
it completely thwarts two out of the three defenses we evaluate it against,
even when these defenses are set to high privacy regimes. These findings
highlight that hiding the search pattern, a feature that most constructions are
lacking, is key towards providing practical privacy guarantees in SSE.


http://arxiv.org/abs/2010.03245
Learning Clusterable Visual Features for Zero-Shot Recognition.
Jingyi Xu; Zhixin Shu; Dimitris Samaras
  In zero-shot learning (ZSL), conditional generators have been widely used to
generate additional training features. These features can then be used to train
the classifiers for testing data. However, some testing data are considered
"hard" as they lie close to the decision boundaries and are prone to
misclassification, leading to performance degradation for ZSL. In this paper,
we propose to learn clusterable features for ZSL problems. Using a Conditional
Variational Autoencoder (CVAE) as the feature generator, we project the
original features to a new feature space supervised by an auxiliary
classification loss. To further increase clusterability, we fine-tune the
features using Gaussian similarity loss. The clusterable visual features are
not only more suitable for CVAE reconstruction but are also more separable
which improves classification accuracy. Moreover, we introduce Gaussian noise
to enlarge the intra-class variance of the generated features, which helps to
improve the classifier's robustness. Our experiments on SUN,CUB, and AWA2
datasets show consistent improvement over previous state-of-the-art ZSL results
by a large margin. In addition to its effectiveness on zero-shot
classification, experiments show that our method to increase feature
clusterability benefits few-shot learning algorithms as well.


http://arxiv.org/abs/2010.03282
Don't Trigger Me! A Triggerless Backdoor Attack Against Deep Neural Networks.
Ahmed Salem; Michael Backes; Yang Zhang
  Backdoor attack against deep neural networks is currently being profoundly
investigated due to its severe security consequences. Current state-of-the-art
backdoor attacks require the adversary to modify the input, usually by adding a
trigger to it, for the target model to activate the backdoor. This added
trigger not only increases the difficulty of launching the backdoor attack in
the physical world, but also can be easily detected by multiple defense
mechanisms. In this paper, we present the first triggerless backdoor attack
against deep neural networks, where the adversary does not need to modify the
input for triggering the backdoor. Our attack is based on the dropout
technique. Concretely, we associate a set of target neurons that are dropped
out during model training with the target label. In the prediction phase, the
model will output the target label when the target neurons are dropped again,
i.e., the backdoor attack is launched. This triggerless feature of our attack
makes it practical in the physical world. Extensive experiments show that our
triggerless backdoor attack achieves a perfect attack success rate with a
negligible damage to the model's utility.


http://arxiv.org/abs/2010.03630
Revisiting Batch Normalization for Improving Corruption Robustness.
Philipp Benz; Chaoning Zhang; Adil Karjauv; In So Kweon
  Modern deep neural networks (DNN) have demonstrated remarkable success in
image recognition tasks when the test dataset and training dataset are from the
same distribution. In practical applications, however, this assumption is often
not valid and results in performance drop when there is a domain shift. For
example, the performance of DNNs trained on clean images has been shown to
decrease when the test images have common corruptions, limiting their use in
performance-sensitive applications. In this work, we interpret corruption
robustness as a domain shift problem and propose to rectify batch normalization
(BN) statistics for improving model robustness. This shift from the clean
domain to the corruption domain can be interpreted as a style shift that is
represented by the BN statistics. Straightforwardly, adapting BN statistics is
beneficial for rectifying this style shift. Specifically, we find that simply
estimating and adapting the BN statistics on a few (32 for instance)
representation samples, without retraining the model, improves the corruption
robustness by a large margin on several benchmark datasets with a wide range of
model architectures. For example, on ImageNet-C, statistics adaptation improves
the top1 accuracy from 40.2% to 49%. Moreover, we find that this technique can
further improve state-of-the-art robust models from 59.0% to 63.5%.


http://arxiv.org/abs/2010.03316
Batch Normalization Increases Adversarial Vulnerability: Disentangling Usefulness and Robustness of Model Features.
Philipp Benz; Chaoning Zhang; In So Kweon
  Batch normalization (BN) has been widely used in modern deep neural networks
(DNNs) due to fast convergence. BN is observed to increase the model accuracy
while at the cost of adversarial robustness. We conjecture that the increased
adversarial vulnerability is caused by BN shifting the model to rely more on
non-robust features (NRFs). Our exploration finds that other normalization
techniques also increase adversarial vulnerability and our conjecture is also
supported by analyzing the model corruption robustness and feature
transferability. With a classifier DNN defined as a feature set $F$ we propose
a framework for disentangling $F$ robust usefulness into $F$ usefulness and $F$
robustness. We adopt a local linearity based metric, termed LIGS, to define and
quantify $F$ robustness. Measuring the $F$ robustness with the LIGS provides
direct insight on the feature robustness shift independent of usefulness.
Moreover, the LIGS trend during the whole training stage sheds light on the
order of learned features, i.e. from RFs (robust features) to NRFs, or vice
versa. Our work analyzes how BN and other factors influence the DNN from the
feature perspective. Prior works mainly adopt accuracy to evaluate their
influence regarding $F$ usefulness, while we believe evaluating $F$ robustness
is equally important, for which our work fills the gap.


http://arxiv.org/abs/2010.03735
Decamouflage: A Framework to Detect Image-Scaling Attacks on Convolutional Neural Networks.
Bedeuro Kim; Alsharif Abuadbba; Yansong Gao; Yifeng Zheng; Muhammad Ejaz Ahmed; Hyoungshick Kim; Surya Nepal
  As an essential processing step in computer vision applications, image
resizing or scaling, more specifically downsampling, has to be applied before
feeding a normally large image into a convolutional neural network (CNN) model
because CNN models typically take small fixed-size images as inputs. However,
image scaling functions could be adversarially abused to perform a newly
revealed attack called image-scaling attack, which can affect a wide range of
computer vision applications building upon image-scaling functions.
  This work presents an image-scaling attack detection framework, termed as
Decamouflage. Decamouflage consists of three independent detection methods: (1)
rescaling, (2) filtering/pooling, and (3) steganalysis. While each of these
three methods is efficient standalone, they can work in an ensemble manner not
only to improve the detection accuracy but also to harden potential adaptive
attacks. Decamouflage has a pre-determined detection threshold that is generic.
More precisely, as we have validated, the threshold determined from one dataset
is also applicable to other different datasets. Extensive experiments show that
Decamouflage achieves detection accuracy of 99.9\% and 99.8\% in the white-box
(with the knowledge of attack algorithms) and the black-box (without the
knowledge of attack algorithms) settings, respectively. To corroborate the
efficiency of Decamouflage, we have also measured its run-time overhead on a
personal PC with an i5 CPU and found that Decamouflage can detect image-scaling
attacks in milliseconds. Overall, Decamouflage can accurately detect image
scaling attacks in both white-box and black-box settings with acceptable
run-time overhead.


http://arxiv.org/abs/2010.03258
Global Optimization of Objective Functions Represented by ReLU Networks.
Christopher A. Strong; Haoze Wu; Aleksandar Zeljić; Kyle D. Julian; Guy Katz; Clark Barrett; Mykel J. Kochenderfer
  Neural networks (NN) learn complex non-convex functions, making them
desirable solutions in many contexts. Applying NNs to safety-critical tasks
demands formal guarantees about their behavior. Recently, a myriad of
verification solutions for NNs emerged using reachability, optimization, and
search based techniques. Particularly interesting are adversarial examples,
which reveal ways the network can fail. They are widely generated using
incomplete methods, such as local optimization, which cannot guarantee
optimality. We propose strategies to extend existing verifiers to provide
provably optimal adversarial examples. Naive approaches combine bisection
search with an off-the-shelf verifier, resulting in many expensive calls to the
verifier. Instead, our proposed approach yields tightly integrated optimizers,
achieving better runtime performance. We extend Marabou, an SMT-based verifier,
and compare it with the bisection based approach and MIPVerify, an optimization
based verifier.


http://arxiv.org/abs/2010.03300
CD-UAP: Class Discriminative Universal Adversarial Perturbation.
Chaoning Zhang; Philipp Benz; Tooba Imtiaz; In So Kweon
  A single universal adversarial perturbation (UAP) can be added to all natural
images to change most of their predicted class labels. It is of high practical
relevance for an attacker to have flexible control over the targeted classes to
be attacked, however, the existing UAP method attacks samples from all classes.
In this work, we propose a new universal attack method to generate a single
perturbation that fools a target network to misclassify only a chosen group of
classes, while having limited influence on the remaining classes. Since the
proposed attack generates a universal adversarial perturbation that is
discriminative to targeted and non-targeted classes, we term it class
discriminative universal adversarial perturbation (CD-UAP). We propose one
simple yet effective algorithm framework, under which we design and compare
various loss function configurations tailored for the class discriminative
universal attack. The proposed approach has been evaluated with extensive
experiments on various benchmark datasets. Additionally, our proposed approach
achieves state-of-the-art performance for the original task of UAP attacking
all classes, which demonstrates the effectiveness of our approach.


http://arxiv.org/abs/2010.03180
Not All Datasets Are Born Equal: On Heterogeneous Data and Adversarial Examples.
Eden Levy; Yael Mathov; Ziv Katzir; Asaf Shabtai; Yuval Elovici
  Recent work on adversarial learning has focused mainly on neural networks and
domains where they excel, such as computer vision. The data in these domains is
homogeneous, whereas heterogeneous tabular data domains remain underexplored
despite their prevalence. Constructing an attack on models with heterogeneous
input spaces is challenging, as they are governed by complex domain-specific
validity rules and comprised of nominal, ordinal, and numerical features. We
argue that machine learning models trained on heterogeneous tabular data are as
susceptible to adversarial manipulations as those trained on continuous or
homogeneous data such as images. In this paper, we introduce an optimization
framework for identifying adversarial perturbations in heterogeneous input
spaces. We define distribution-aware constraints for preserving the consistency
of the adversarial examples and incorporate them by embedding the heterogeneous
input into a continuous latent space. Our approach focuses on an adversary who
aims to craft valid perturbations of minimal l_0-norms and apply them in real
life. We propose a neural network-based implementation of our approach and
demonstrate its effectiveness using three datasets from different content
domains. Our results suggest that despite the several constraints heterogeneity
imposes on the input space of a machine learning model, the susceptibility to
adversarial examples remains unimpaired.


http://arxiv.org/abs/2010.03288
Double Targeted Universal Adversarial Perturbations.
Philipp Benz; Chaoning Zhang; Tooba Imtiaz; In So Kweon
  Despite their impressive performance, deep neural networks (DNNs) are widely
known to be vulnerable to adversarial attacks, which makes it challenging for
them to be deployed in security-sensitive applications, such as autonomous
driving. Image-dependent perturbations can fool a network for one specific
image, while universal adversarial perturbations are capable of fooling a
network for samples from all classes without selection. We introduce a double
targeted universal adversarial perturbations (DT-UAPs) to bridge the gap
between the instance-discriminative image-dependent perturbations and the
generic universal perturbations. This universal perturbation attacks one
targeted source class to sink class, while having a limited adversarial effect
on other non-targeted source classes, for avoiding raising suspicions.
Targeting the source and sink class simultaneously, we term it double targeted
attack (DTA). This provides an attacker with the freedom to perform precise
attacks on a DNN model while raising little suspicion. We show the
effectiveness of the proposed DTA algorithm on a wide range of datasets and
also demonstrate its potential as a physical attack.


http://arxiv.org/abs/2010.03593
Uncovering the Limits of Adversarial Training against Norm-Bounded Adversarial Examples.
Sven Gowal; Chongli Qin; Jonathan Uesato; Timothy Mann; Pushmeet Kohli
  Adversarial training and its variants have become de facto standards for
learning robust deep neural networks. In this paper, we explore the landscape
around adversarial training in a bid to uncover its limits. We systematically
study the effect of different training losses, model sizes, activation
functions, the addition of unlabeled data (through pseudo-labeling) and other
factors on adversarial robustness. We discover that it is possible to train
robust models that go well beyond state-of-the-art results by combining larger
models, Swish/SiLU activations and model weight averaging. We demonstrate large
improvements on CIFAR-10 and CIFAR-100 against $\ell_\infty$ and $\ell_2$
norm-bounded perturbations of size $8/255$ and $128/255$, respectively. In the
setting with additional unlabeled data, we obtain an accuracy under attack of
65.87% against $\ell_\infty$ perturbations of size $8/255$ on CIFAR-10 (+6.34%
with respect to prior art). Without additional data, we obtain an accuracy
under attack of 56.43% (+2.69%). To test the generality of our findings and
without any additional modifications, we obtain an accuracy under attack of
80.45% (+7.58%) against $\ell_2$ perturbations of size $128/255$ on CIFAR-10,
and of 37.70% (+9.28%) against $\ell_\infty$ perturbations of size $8/255$ on
CIFAR-100.


http://arxiv.org/abs/2010.03671
Adversarial Attacks to Machine Learning-Based Smart Healthcare Systems.
AKM Iqtidar Newaz; Nur Imtiazul Haque; Amit Kumar Sikder; Mohammad Ashiqur Rahman; A. Selcuk Uluagac
  The increasing availability of healthcare data requires accurate analysis of
disease diagnosis, progression, and realtime monitoring to provide improved
treatments to the patients. In this context, Machine Learning (ML) models are
used to extract valuable features and insights from high-dimensional and
heterogeneous healthcare data to detect different diseases and patient
activities in a Smart Healthcare System (SHS). However, recent researches show
that ML models used in different application domains are vulnerable to
adversarial attacks. In this paper, we introduce a new type of adversarial
attacks to exploit the ML classifiers used in a SHS. We consider an adversary
who has partial knowledge of data distribution, SHS model, and ML algorithm to
perform both targeted and untargeted attacks. Employing these adversarial
capabilities, we manipulate medical device readings to alter patient status
(disease-affected, normal condition, activities, etc.) in the outcome of the
SHS. Our attack utilizes five different adversarial ML algorithms (HopSkipJump,
Fast Gradient Method, Crafting Decision Tree, Carlini & Wagner, Zeroth Order
Optimization) to perform different malicious activities (e.g., data poisoning,
misclassify outputs, etc.) on a SHS. Moreover, based on the training and
testing phase capabilities of an adversary, we perform white box and black box
attacks on a SHS. We evaluate the performance of our work in different SHS
settings and medical devices. Our extensive evaluation shows that our proposed
adversarial attack can significantly degrade the performance of a ML-based SHS
in detecting diseases and normal activities of the patients correctly, which
eventually leads to erroneous treatment.


http://arxiv.org/abs/2010.03164
Adversarial attacks on audio source separation.
Naoya Takahashi; Shota Inoue; Yuki Mitsufuji
  Despite the excellent performance of neural-network-based audio source
separation methods and their wide range of applications, their robustness
against intentional attacks has been largely neglected. In this work, we
reformulate various adversarial attack methods for the audio source separation
problem and intensively investigate them under different attack conditions and
target models. We further propose a simple yet effective regularization method
to obtain imperceptible adversarial noise while maximizing the impact on
separation quality with low computational complexity. Experimental results show
that it is possible to largely degrade the separation quality by adding
imperceptibly small noise when the noise is crafted for the target model. We
also show the robustness of source separation models against a black-box
attack. This study provides potentially useful insights for developing content
protection methods against the abuse of separated signals and improving the
separation performance and robustness.


http://arxiv.org/abs/2010.02468
Visualizing Color-wise Saliency of Black-Box Image Classification Models.
Yuhki SenseTime Japan Hatakeyama; Hiroki SenseTime Japan Sakuma; Yoshinori SenseTime Japan Konishi; Kohei Kyoto University Suenaga
  Image classification based on machine learning is being commonly used.
However, a classification result given by an advanced method, including deep
learning, is often hard to interpret. This problem of interpretability is one
of the major obstacles in deploying a trained model in safety-critical systems.
Several techniques have been proposed to address this problem; one of which is
RISE, which explains a classification result by a heatmap, called a saliency
map, which explains the significance of each pixel. We propose MC-RISE
(Multi-Color RISE), which is an enhancement of RISE to take color information
into account in an explanation. Our method not only shows the saliency of each
pixel in a given image as the original RISE does, but the significance of color
components of each pixel; a saliency map with color information is useful
especially in the domain where the color information matters (e.g.,
traffic-sign recognition). We implemented MC-RISE and evaluate them using two
datasets (GTSRB and ImageNet) to demonstrate the effectiveness of our methods
in comparison with existing techniques for interpreting image classification
results.


http://arxiv.org/abs/2010.02558
Constraining Logits by Bounded Function for Adversarial Robustness.
Sekitoshi Kanai; Masanori Yamada; Shin'ya Yamaguchi; Hiroshi Takahashi; Yasutoshi Ida
  We propose a method for improving adversarial robustness by addition of a new
bounded function just before softmax. Recent studies hypothesize that small
logits (inputs of softmax) by logit regularization can improve adversarial
robustness of deep learning. Following this hypothesis, we analyze norms of
logit vectors at the optimal point under the assumption of universal
approximation and explore new methods for constraining logits by addition of a
bounded function before softmax. We theoretically and empirically reveal that
small logits by addition of a common activation function, e.g., hyperbolic
tangent, do not improve adversarial robustness since input vectors of the
function (pre-logit vectors) can have large norms. From the theoretical
findings, we develop the new bounded function. The addition of our function
improves adversarial robustness because it makes logit and pre-logit vectors
have small norms. Since our method only adds one activation function before
softmax, it is easy to combine our method with adversarial training. Our
experiments demonstrate that our method is comparable to logit regularization
methods in terms of accuracies on adversarially perturbed datasets without
adversarial training. Furthermore, it is superior or comparable to logit
regularization methods and a recent defense method (TRADES) when using
adversarial training.


http://arxiv.org/abs/2010.03072
Adversarial Patch Attacks on Monocular Depth Estimation Networks.
Koichiro Yamanaka; Ryutaroh Matsumoto; Keita Takahashi; Toshiaki Fujii
  Thanks to the excellent learning capability of deep convolutional neural
networks (CNN), monocular depth estimation using CNNs has achieved great
success in recent years. However, depth estimation from a monocular image alone
is essentially an ill-posed problem, and thus, it seems that this approach
would have inherent vulnerabilities. To reveal this limitation, we propose a
method of adversarial patch attack on monocular depth estimation. More
specifically, we generate artificial patterns (adversarial patches) that can
fool the target methods into estimating an incorrect depth for the regions
where the patterns are placed. Our method can be implemented in the real world
by physically placing the printed patterns in real scenes. We also analyze the
behavior of monocular depth estimation under attacks by visualizing the
activation levels of the intermediate layers and the regions potentially
affected by the adversarial attack.


http://arxiv.org/abs/2010.03007
BAAAN: Backdoor Attacks Against Autoencoder and GAN-Based Machine Learning Models.
Ahmed Salem; Yannick Sautter; Michael Backes; Mathias Humbert; Yang Zhang
  The tremendous progress of autoencoders and generative adversarial networks
(GANs) has led to their application to multiple critical tasks, such as fraud
detection and sanitized data generation. This increasing adoption has fostered
the study of security and privacy risks stemming from these models. However,
previous works have mainly focused on membership inference attacks. In this
work, we explore one of the most severe attacks against machine learning
models, namely the backdoor attack, against both autoencoders and GANs. The
backdoor attack is a training time attack where the adversary implements a
hidden backdoor in the target model that can only be activated by a secret
trigger. State-of-the-art backdoor attacks focus on classification-based tasks.
We extend the applicability of backdoor attacks to autoencoders and GAN-based
models. More concretely, we propose the first backdoor attack against
autoencoders and GANs where the adversary can control what the decoded or
generated images are when the backdoor is activated. Our results show that the
adversary can build a backdoored autoencoder that returns a target output for
all backdoored inputs, while behaving perfectly normal on clean inputs.
Similarly, for the GANs, our experiments show that the adversary can generate
data from a different distribution when the backdoor is activated, while
maintaining the same utility when the backdoor is not.


http://arxiv.org/abs/2010.02065
Detecting Misclassification Errors in Neural Networks with a Gaussian Process Model.
Xin Qiu; Risto Miikkulainen
  As neural network classifiers are deployed in real-world applications, it is
crucial that their predictions are not just accurate, but trustworthy as well.
One practical solution is to assign confidence scores to each prediction, then
filter out low-confidence predictions. However, existing confidence metrics are
not yet sufficiently reliable for this role. This paper presents a new
framework that produces more reliable confidence scores for detecting
misclassification errors. This framework, RED, calibrates the classifier's
inherent confidence indicators and estimates uncertainty of the calibrated
confidence scores using Gaussian Processes. Empirical comparisons with other
confidence estimation methods on 125 UCI datasets demonstrate that this
approach is effective. An experiment on a vision task with a large deep
learning architecture further confirms that the method can scale up, and a case
study involving out-of-distribution and adversarial samples shows potential of
the proposed method to improve robustness of neural network classifiers more
broadly in the future.


http://arxiv.org/abs/2010.02508
Adversarial Boot Camp: label free certified robustness in one epoch.
Ryan Campbell; Chris Finlay; Adam M Oberman
  Machine learning models are vulnerable to adversarial attacks. One approach
to addressing this vulnerability is certification, which focuses on models that
are guaranteed to be robust for a given perturbation size. A drawback of recent
certified models is that they are stochastic: they require multiple
computationally expensive model evaluations with random noise added to a given
input. In our work, we present a deterministic certification approach which
results in a certifiably robust model. This approach is based on an equivalence
between training with a particular regularized loss, and the expected values of
Gaussian averages. We achieve certified models on ImageNet-1k by retraining a
model with this loss for one epoch without the use of label information.


http://arxiv.org/abs/2010.02364
Understanding Classifier Mistakes with Generative Models.
Laëtitia Shao; Yang Song; Stefano Ermon
  Although deep neural networks are effective on supervised learning tasks,
they have been shown to be brittle. They are prone to overfitting on their
training distribution and are easily fooled by small adversarial perturbations.
In this paper, we leverage generative models to identify and characterize
instances where classifiers fail to generalize. We propose a generative model
of the features extracted by a classifier, and show using rigorous hypothesis
testing that errors tend to occur when features are assigned low-probability by
our model. From this observation, we develop a detection criteria for samples
on which a classifier is likely to fail at test time. In particular, we test
against three different sources of classification failures: mistakes made on
the test set due to poor model generalization, adversarial samples and
out-of-distribution samples. Our approach is agnostic to class labels from the
training set which makes it applicable to models trained in a semi-supervised
way.


http://arxiv.org/abs/2010.02338
CAT-Gen: Improving Robustness in NLP Models via Controlled Adversarial Text Generation.
Tianlu Wang; Xuezhi Wang; Yao Qin; Ben Packer; Kang Li; Jilin Chen; Alex Beutel; Ed Chi
  NLP models are shown to suffer from robustness issues, i.e., a model's
prediction can be easily changed under small perturbations to the input. In
this work, we present a Controlled Adversarial Text Generation (CAT-Gen) model
that, given an input text, generates adversarial texts through controllable
attributes that are known to be invariant to task labels. For example, in order
to attack a model for sentiment classification over product reviews, we can use
the product categories as the controllable attribute which would not change the
sentiment of the reviews. Experiments on real-world NLP datasets demonstrate
that our method can generate more diverse and fluent adversarial texts,
compared to many existing adversarial text generation approaches. We further
use our generated adversarial examples to improve models through adversarial
training, and we demonstrate that our generated attacks are more robust against
model re-training and different model architectures.


http://arxiv.org/abs/2010.01770
Second-Order NLP Adversarial Examples.
John X. Morris
  Adversarial example generation methods in NLP rely on models like language
models or sentence encoders to determine if potential adversarial examples are
valid. In these methods, a valid adversarial example fools the model being
attacked, and is determined to be semantically or syntactically valid by a
second model. Research to date has counted all such examples as errors by the
attacked model. We contend that these adversarial examples may not be flaws in
the attacked model, but flaws in the model that determines validity. We term
such invalid inputs second-order adversarial examples. We propose the
constraint robustness curve and associated metric ACCS as tools for evaluating
the robustness of a constraint to second-order adversarial examples. To
generate this curve, we design an adversarial attack to run directly on the
semantic similarity models. We test on two constraints, the Universal Sentence
Encoder (USE) and BERTScore. Our findings indicate that such second-order
examples exist, but are typically less common than first-order adversarial
examples in state-of-the-art models. They also indicate that USE is effective
as constraint on NLP adversarial examples, while BERTScore is nearly
ineffectual. Code for running the experiments in this paper is available at
https://github.com/jxmorris12/second-order-adversarial-examples.


http://arxiv.org/abs/2010.02432
A Panda? No, It's a Sloth: Slowdown Attacks on Adaptive Multi-Exit Neural Network Inference.
Sanghyun Hong; Yiğitcan Kaya; Ionuţ-Vlad Modoranu; Tudor Dumitraş
  Recent increases in the computational demands of deep neural networks (DNNs),
combined with the observation that most input samples require only simple
models, have sparked interest in $input$-$adaptive$ multi-exit architectures,
such as MSDNets or Shallow-Deep Networks. These architectures enable faster
inferences and could bring DNNs to low-power devices, e.g., in the Internet of
Things (IoT). However, it is unknown if the computational savings provided by
this approach are robust against adversarial pressure. In particular, an
adversary may aim to slowdown adaptive DNNs by increasing their average
inference time$-$a threat analogous to the $denial$-$of$-$service$ attacks from
the Internet. In this paper, we conduct a systematic evaluation of this threat
by experimenting with three generic multi-exit DNNs (based on VGG16, MobileNet,
and ResNet56) and a custom multi-exit architecture, on two popular image
classification benchmarks (CIFAR-10 and Tiny ImageNet). To this end, we show
that adversarial example-crafting techniques can be modified to cause slowdown,
and we propose a metric for comparing their impact on different architectures.
We show that a slowdown attack reduces the efficacy of multi-exit DNNs by
90-100%, and it amplifies the latency by 1.5-5$\times$ in a typical IoT
deployment. We also show that it is possible to craft universal, reusable
perturbations and that the attack can be effective in realistic black-box
scenarios, where the attacker has limited knowledge about the victim. Finally,
we show that adversarial training provides limited protection against
slowdowns. These results suggest that further research is needed for defending
multi-exit architectures against this emerging threat. Our code is available at
https://github.com/sanghyun-hong/deepsloth.


http://arxiv.org/abs/2010.02329
InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective.
Boxin Wang; Shuohang Wang; Yu Cheng; Zhe Gan; Ruoxi Jia; Bo Li; Jingjing Liu
  Large-scale language models such as BERT have achieved state-of-the-art
performance across a wide range of NLP tasks. Recent studies, however, show
that such BERT-based models are vulnerable facing the threats of textual
adversarial attacks. We aim to address this problem from an
information-theoretic perspective, and propose InfoBERT, a novel learning
framework for robust fine-tuning of pre-trained language models. InfoBERT
contains two mutual-information-based regularizers for model training: (i) an
Information Bottleneck regularizer, which suppresses noisy mutual information
between the input and the feature representation; and (ii) a Robust Feature
regularizer, which increases the mutual information between local robust
features and global features. We provide a principled way to theoretically
analyze and improve the robustness of representation learning for language
models in both standard and adversarial training. Extensive experiments
demonstrate that InfoBERT achieves state-of-the-art robust accuracy over
several adversarial datasets on Natural Language Inference (NLI) and Question
Answering (QA) tasks. Our code is available at
https://github.com/AI-secure/InfoBERT.


http://arxiv.org/abs/2010.01799
Understanding Catastrophic Overfitting in Single-step Adversarial Training.
Hoki Kim; Woojin Lee; Jaewook Lee
  Although fast adversarial training has demonstrated both robustness and
efficiency, the problem of "catastrophic overfitting" has been observed. This
is a phenomenon in which, during single-step adversarial training, the robust
accuracy against projected gradient descent (PGD) suddenly decreases to 0%
after a few epochs, whereas the robust accuracy against fast gradient sign
method (FGSM) increases to 100%. In this paper, we demonstrate that
catastrophic overfitting is very closely related to the characteristic of
single-step adversarial training which uses only adversarial examples with the
maximum perturbation, and not all adversarial examples in the adversarial
direction, which leads to decision boundary distortion and a highly curved loss
surface. Based on this observation, we propose a simple method that not only
prevents catastrophic overfitting, but also overrides the belief that it is
difficult to prevent multi-step adversarial attacks with single-step
adversarial training.


http://arxiv.org/abs/2010.02456
Downscaling Attack and Defense: Turning What You See Back Into What You Get.
Andrew J. Lohn
  The resizing of images, which is typically a required part of preprocessing
for computer vision systems, is vulnerable to attack. Images can be created
such that the image is completely different at machine-vision scales than at
other scales and the default settings for some common computer vision and
machine learning systems are vulnerable. We show that defenses exist and are
trivial to administer provided that defenders are aware of the threat. These
attacks and defenses help to establish the role of input sanitization in
machine learning.


http://arxiv.org/abs/2010.02387
Metadata-Based Detection of Child Sexual Abuse Material. (1%)
Mayana Pereira; Rahul Dodhia; Hyrum Anderson; Richard Brown
  Child Sexual Abuse Media (CSAM) is any visual record of a sexually-explicit
activity involving minors. CSAM impacts victims differently from the actual
abuse because the distribution never ends, and images are permanent. Machine
learning-based solutions can help law enforcement quickly identify CSAM and
block digital distribution. However, collecting CSAM imagery to train machine
learning models has many ethical and legal constraints, creating a barrier to
research development. With such restrictions in place, the development of CSAM
machine learning detection systems based on file metadata uncovers several
opportunities. Metadata is not a record of a crime, and it does not have legal
restrictions. Therefore, investing in detection systems based on metadata can
increase the rate of discovery of CSAM and help thousands of victims. We
propose a framework for training and evaluating deployment-ready machine
learning models for CSAM identification. Our framework provides guidelines to
evaluate CSAM detection models against intelligent adversaries and models'
performance with open data. We apply the proposed framework to the problem of
CSAM detection based on file paths. In our experiments, the best-performing
model is based on convolutional neural networks and achieves an accuracy of
0.97. Our evaluation shows that the CNN model is robust against offenders
actively trying to evade detection by evaluating the model against
adversarially modified data. Experiments with open datasets confirm that the
model generalizes well and is deployment-ready.


http://arxiv.org/abs/2010.01724
TextAttack: Lessons learned in designing Python frameworks for NLP.
John X. Morris; Jin Yong Yoo; Yanjun Qi
  TextAttack is an open-source Python toolkit for adversarial attacks,
adversarial training, and data augmentation in NLP. TextAttack unites 15+
papers from the NLP adversarial attack literature into a single framework, with
many components reused across attacks. This framework allows both researchers
and developers to test and study the weaknesses of their NLP models. To build
such an open-source NLP toolkit requires solving some common problems: How do
we enable users to supply models from different deep learning frameworks? How
can we build tools to support as many different datasets as possible? We share
our insights into developing a well-written, well-documented NLP Python
framework in hope that they can aid future development of similar packages.


http://arxiv.org/abs/2010.01506
A Study for Universal Adversarial Attacks on Texture Recognition.
Yingpeng Deng; Lina J. Karam
  Given the outstanding progress that convolutional neural networks (CNNs) have
made on natural image classification and object recognition problems, it is
shown that deep learning methods can achieve very good recognition performance
on many texture datasets. However, while CNNs for natural image
classification/object recognition tasks have been revealed to be highly
vulnerable to various types of adversarial attack methods, the robustness of
deep learning methods for texture recognition is yet to be examined. In our
paper, we show that there exist small image-agnostic/univesal perturbations
that can fool the deep learning models with more than 80\% of testing fooling
rates on all tested texture datasets. The computed perturbations using various
attack methods on the tested datasets are generally quasi-imperceptible,
containing structured patterns with low, middle and high frequency components.


http://arxiv.org/abs/2010.01610
Adversarial Attack and Defense of Structured Prediction Models.
Wenjuan Han; Liwen Zhang; Yong Jiang; Kewei Tu
  Building an effective adversarial attacker and elaborating on countermeasures
for adversarial attacks for natural language processing (NLP) have attracted a
lot of research in recent years. However, most of the existing approaches focus
on classification problems. In this paper, we investigate attacks and defenses
for structured prediction tasks in NLP. Besides the difficulty of perturbing
discrete words and the sentence fluency problem faced by attackers in any NLP
tasks, there is a specific challenge to attackers of structured prediction
models: the structured output of structured prediction models is sensitive to
small perturbations in the input. To address these problems, we propose a novel
and unified framework that learns to attack a structured prediction model using
a sequence-to-sequence model with feedbacks from multiple reference models of
the same structured prediction task. Based on the proposed attack, we further
reinforce the victim model with adversarial training, making its prediction
more robust and accurate. We evaluate the proposed framework in dependency
parsing and part-of-speech tagging. Automatic and human evaluations show that
our proposed framework succeeds in both attacking state-of-the-art structured
prediction models and boosting them with adversarial training.


http://arxiv.org/abs/2010.01736
Geometry-aware Instance-reweighted Adversarial Training.
Jingfeng Zhang; Jianing Zhu; Gang Niu; Bo Han; Masashi Sugiyama; Mohan Kankanhalli
  In adversarial machine learning, there was a common belief that robustness
and accuracy hurt each other. The belief was challenged by recent studies where
we can maintain the robustness and improve the accuracy. However, the other
direction, whether we can keep the accuracy while improving the robustness, is
conceptually and practically more interesting, since robust accuracy should be
lower than standard accuracy for any model. In this paper, we show this
direction is also promising. Firstly, we find even over-parameterized deep
networks may still have insufficient model capacity, because adversarial
training has an overwhelming smoothing effect. Secondly, given limited model
capacity, we argue adversarial data should have unequal importance:
geometrically speaking, a natural data point closer to/farther from the class
boundary is less/more robust, and the corresponding adversarial data point
should be assigned with larger/smaller weight. Finally, to implement the idea,
we propose geometry-aware instance-reweighted adversarial training, where the
weights are based on how difficult it is to attack a natural data point.
Experiments show that our proposal boosts the robustness of standard
adversarial training; combining two directions, we improve both robustness and
accuracy of standard adversarial training.


http://arxiv.org/abs/2010.01592
Unknown Presentation Attack Detection against Rational Attackers.
Ali Khodabakhsh; Zahid Akhtar
  Despite the impressive progress in the field of presentation attack detection
and multimedia forensics over the last decade, these systems are still
vulnerable to attacks in real-life settings. Some of the challenges for
existing solutions are the detection of unknown attacks, the ability to perform
in adversarial settings, few-shot learning, and explainability. In this study,
these limitations are approached by reliance on a game-theoretic view for
modeling the interactions between the attacker and the detector. Consequently,
a new optimization criterion is proposed and a set of requirements are defined
for improving the performance of these systems in real-life settings.
Furthermore, a novel detection technique is proposed using generator-based
feature sets that are not biased towards any specific attack species. To
further optimize the performance on known attacks, a new loss function coined
categorical margin maximization loss (C-marmax) is proposed which gradually
improves the performance against the most powerful attack. The proposed
approach provides a more balanced performance across known and unknown attacks
and achieves state-of-the-art performance in known and unknown attack detection
cases against rational attackers. Lastly, the few-shot learning potential of
the proposed approach is studied as well as its ability to provide pixel-level
explainability.


http://arxiv.org/abs/2010.01401
Adversarial and Natural Perturbations for General Robustness.
Sadaf Gulshad; Jan Hendrik Metzen; Arnold Smeulders
  In this paper we aim to explore the general robustness of neural network
classifiers by utilizing adversarial as well as natural perturbations.
Different from previous works which mainly focus on studying the robustness of
neural networks against adversarial perturbations, we also evaluate their
robustness on natural perturbations before and after robustification. After
standardizing the comparison between adversarial and natural perturbations, we
demonstrate that although adversarial training improves the performance of the
networks against adversarial perturbations, it leads to drop in the performance
for naturally perturbed samples besides clean samples. In contrast, natural
perturbations like elastic deformations, occlusions and wave does not only
improve the performance against natural perturbations, but also lead to
improvement in the performance for the adversarial perturbations. Additionally
they do not drop the accuracy on the clean images.


http://arxiv.org/abs/2010.01329
Multi-Step Adversarial Perturbations on Recommender Systems Embeddings.
Vito Walter Anelli; Alejandro Bellogín; Yashar Deldjoo; Noia Tommaso Di; Felice Antonio Merra
  Recommender systems (RSs) have attained exceptional performance in learning
users' preferences and helping them in finding the most suitable products.
Recent advances in adversarial machine learning (AML) in the computer vision
domain have raised interests in the security of state-of-the-art model-based
recommenders. Recently, worrying deterioration of recommendation accuracy has
been acknowledged on several state-of-the-art model-based recommenders (e.g.,
BPR-MF) when machine-learned adversarial perturbations contaminate model
parameters. However, while the single-step fast gradient sign method (FGSM) is
the most explored perturbation strategy, multi-step (iterative) perturbation
strategies, that demonstrated higher efficacy in the computer vision domain,
have been highly under-researched in recommendation tasks.
  In this work, inspired by the basic iterative method (BIM) and the projected
gradient descent (PGD) strategies proposed in the CV domain, we adapt the
multi-step strategies for the item recommendation task to study the possible
weaknesses of embedding-based recommender models under minimal adversarial
perturbations. Letting the magnitude of the perturbation be fixed, we
illustrate the highest efficacy of the multi-step perturbation compared to the
single-step one with extensive empirical evaluation on two widely adopted
recommender datasets. Furthermore, we study the impact of structural dataset
characteristics, i.e., sparsity, density, and size, on the performance
degradation issued by presented perturbations to support RS designer in
interpreting recommendation performance variation due to minimal variations of
model parameters. Our implementation and datasets are available at
https://anonymous.4open.science/r/9f27f909-93d5-4016-b01c-8976b8c14bc5/.


http://arxiv.org/abs/2010.01345
A Geometry-Inspired Attack for Generating Natural Language Adversarial Examples.
Zhao Meng; Roger Wattenhofer
  Generating adversarial examples for natural language is hard, as natural
language consists of discrete symbols, and examples are often of variable
lengths. In this paper, we propose a geometry-inspired attack for generating
natural language adversarial examples. Our attack generates adversarial
examples by iteratively approximating the decision boundary of Deep Neural
Networks (DNNs). Experiments on two datasets with two different models show
that our attack fools natural language models with high success rates, while
only replacing a few words. Human evaluation shows that adversarial examples
generated by our attack are hard for humans to recognize. Further experiments
show that adversarial training can improve model robustness against our attack.


http://arxiv.org/abs/2010.01278
Efficient Robust Training via Backward Smoothing.
Jinghui Chen; Yu Cheng; Zhe Gan; Quanquan Gu; Jingjing Liu
  Adversarial training is so far the most effective strategy in defending
against adversarial examples. However, it suffers from high computational cost
due to the iterative adversarial attacks in each training step. Recent studies
show that it is possible to achieve Fast Adversarial Training by performing a
single-step attack with random initialization. Yet, it remains a mystery why
random initialization helps. Besides, such an approach still lags behind
state-of-the-art adversarial training algorithms on both stability and model
robustness. In this work, we develop a new understanding towards Fast
Adversarial Training, by viewing random initialization as performing randomized
smoothing for better optimization of the inner maximization problem. From this
perspective, we show that the smoothing effect by random initialization is not
sufficient under the adversarial perturbation constraint. A new initialization
strategy, backward smoothing, is proposed to address this issue and
significantly improves both stability and model robustness over single-step
robust training methods.Experiments on multiple benchmarks demonstrate that our
method achieves similar model robustness as the original TRADES method, while
using much less training time ($\sim$3x improvement with the same training
schedule).


http://arxiv.org/abs/2010.01279
Do Wider Neural Networks Really Help Adversarial Robustness?
Boxi Wu; Jinghui Chen; Deng Cai; Xiaofei He; Quanquan Gu
  Adversarial training is a powerful type of defense against adversarial
examples. Previous empirical results suggest that adversarial training requires
wider networks for better performances. However, it remains elusive how neural
network width affects model robustness. In this paper, we carefully examine the
relationship between network width and model robustness. Specifically, we show
that the model robustness is closely related to the tradeoff between natural
accuracy and perturbation stability, which is controlled by the robust
regularization parameter $\lambda$. With the same $\lambda$, wider networks can
achieve better natural accuracy but worse perturbation stability, leading to a
potentially worse overall model robustness. To understand the origin of this
phenomenon, we further relate the perturbation stability with the network's
local Lipschitzness. By leveraging recent results on neural tangent kernels, we
theoretically show that wider networks tend to have worse perturbation
stability. Our analyses suggest that: 1) the common strategy of first
fine-tuning $\lambda$ on small networks and then directly use it for wide model
training could lead to deteriorated model robustness; 2) one needs to properly
enlarge $\lambda$ to unleash the robustness potential of wider models fully.
Finally, we propose a new Width Adjusted Regularization (WAR) method that
adaptively enlarges $\lambda$ on wide models and significantly saves the tuning
time.


http://arxiv.org/abs/2010.00990
Note: An alternative proof of the vulnerability of $k$-NN classifiers in high intrinsic dimensionality regions.
Teddy Furon
  This document proposes an alternative proof of the result contained in
article "High intrinsic dimensionality facilitates adversarial attack:
Theoretical evidence", Amsaleg et a.. The proof is simpler to understand (I
believe) and leads to a more precise statement about the asymptotical
distribution of the relative amount of perturbation.


http://arxiv.org/abs/2010.00984
An Empirical Study of DNNs Robustification Inefficacy in Protecting Visual Recommenders.
Vito Walter Anelli; Noia Tommaso Di; Daniele Malitesta; Felice Antonio Merra
  Visual-based recommender systems (VRSs) enhance recommendation performance by
integrating users' feedback with the visual features of product images
extracted from a deep neural network (DNN). Recently, human-imperceptible
images perturbations, defined \textit{adversarial attacks}, have been
demonstrated to alter the VRSs recommendation performance, e.g., pushing/nuking
category of products. However, since adversarial training techniques have
proven to successfully robustify DNNs in preserving classification accuracy, to
the best of our knowledge, two important questions have not been investigated
yet: 1) How well can these defensive mechanisms protect the VRSs performance?
2) What are the reasons behind ineffective/effective defenses? To answer these
questions, we define a set of defense and attack settings, as well as
recommender models, to empirically investigate the efficacy of defensive
mechanisms. The results indicate alarming risks in protecting a VRS through the
DNN robustification. Our experiments shed light on the importance of visual
features in very effective attack scenarios. Given the financial impact of VRSs
on many companies, we believe this work might rise the need to investigate how
to successfully protect visual-based recommenders. Source code and data are
available at
https://anonymous.4open.science/r/868f87ca-c8a4-41ba-9af9-20c41de33029/.


http://arxiv.org/abs/2010.00801
Block-wise Image Transformation with Secret Key for Adversarially Robust Defense.
MaungMaung AprilPyone; Hitoshi Kiya
  In this paper, we propose a novel defensive transformation that enables us to
maintain a high classification accuracy under the use of both clean images and
adversarial examples for adversarially robust defense. The proposed
transformation is a block-wise preprocessing technique with a secret key to
input images. We developed three algorithms to realize the proposed
transformation: Pixel Shuffling, Bit Flipping, and FFX Encryption. Experiments
were carried out on the CIFAR-10 and ImageNet datasets by using both black-box
and white-box attacks with various metrics including adaptive ones. The results
show that the proposed defense achieves high accuracy close to that of using
clean images even under adaptive attacks for the first time. In the best-case
scenario, a model trained by using images transformed by FFX Encryption (block
size of 4) yielded an accuracy of 92.30% on clean images and 91.48% under PGD
attack with a noise distance of 8/255, which is close to the non-robust
accuracy (95.45%) for the CIFAR-10 dataset, and it yielded an accuracy of
72.18% on clean images and 71.43% under the same attack, which is also close to
the standard accuracy (73.70%) for the ImageNet dataset. Overall, all three
proposed algorithms are demonstrated to outperform state-of-the-art defenses
including adversarial training whether or not a model is under attack.


http://arxiv.org/abs/2010.01039
Query complexity of adversarial attacks.
Grzegorz Głuch; Rüdiger Urbanke
  There are two main attack models considered in the adversarial robustness
literature: black-box and white-box. We consider these threat models as two
ends of a fine-grained spectrum, indexed by the number of queries the adversary
can ask. Using this point of view we investigate how many queries the adversary
needs to make to design an attack that is comparable to the best possible
attack in the white-box model. We give a lower bound on that number of queries
in terms of entropy of decision boundaries of the classifier. Using this result
we analyze two classical learning algorithms on two synthetic tasks for which
we prove meaningful security guarantees. The obtained bounds suggest that some
learning algorithms are inherently more robust against query-bounded
adversaries than others.


http://arxiv.org/abs/2010.01250
CorrAttack: Black-box Adversarial Attack with Structured Search.
Zhichao Huang; Yaowei Huang; Tong Zhang
  We present a new method for score-based adversarial attack, where the
attacker queries the loss-oracle of the target model. Our method employs a
parameterized search space with a structure that captures the relationship of
the gradient of the loss function. We show that searching over the structured
space can be approximated by a time-varying contextual bandits problem, where
the attacker takes feature of the associated arm to make modifications of the
input, and receives an immediate reward as the reduction of the loss function.
The time-varying contextual bandits problem can then be solved by a Bayesian
optimization procedure, which can take advantage of the features of the
structured action space. The experiments on ImageNet and the Google Cloud
Vision API demonstrate that the proposed method achieves the state of the art
success rates and query efficiencies for both undefended and defended models.


http://arxiv.org/abs/2010.01238
A Deep Genetic Programming based Methodology for Art Media Classification Robust to Adversarial Perturbations.
Gustavo Olague; Gerardo Ibarra-Vazquez; Mariana Chan-Ley; Cesar Puente; Carlos Soubervielle-Montalvo; Axel Martinez
  Art Media Classification problem is a current research area that has
attracted attention due to the complex extraction and analysis of features of
high-value art pieces. The perception of the attributes can not be subjective,
as humans sometimes follow a biased interpretation of artworks while ensuring
automated observation's trustworthiness. Machine Learning has outperformed many
areas through its learning process of artificial feature extraction from images
instead of designing handcrafted feature detectors. However, a major concern
related to its reliability has brought attention because, with small
perturbations made intentionally in the input image (adversarial attack), its
prediction can be completely changed. In this manner, we foresee two ways of
approaching the situation: (1) solve the problem of adversarial attacks in
current neural networks methodologies, or (2) propose a different approach that
can challenge deep learning without the effects of adversarial attacks. The
first one has not been solved yet, and adversarial attacks have become even
more complex to defend. Therefore, this work presents a Deep Genetic
Programming method, called Brain Programming, that competes with deep learning
and studies the transferability of adversarial attacks using two artworks
databases made by art experts. The results show that the Brain Programming
method preserves its performance in comparison with AlexNet, making it robust
to these perturbations and competing to the performance of Deep Learning.


http://arxiv.org/abs/2010.01171
Data-Driven Certification of Neural Networks with Random Input Noise. (16%)
Brendon G. Anderson; Somayeh Sojoudi
  Methods to certify the robustness of neural networks in the presence of input
uncertainty are vital in safety-critical settings. Most certification methods
in the literature are designed for adversarial or worst-case inputs, but
researchers have recently shown a need for methods that consider random input
noise. In this paper, we examine the setting where inputs are subject to random
noise coming from an arbitrary probability distribution. We propose a
robustness certification method that lower-bounds the probability that network
outputs are safe. This bound is cast as a chance-constrained optimization
problem, which is then reformulated using input-output samples to make the
optimization constraints tractable. We develop sufficient conditions for the
resulting optimization to be convex, as well as on the number of samples needed
to make the robustness bound hold with overwhelming probability. We show for a
special case that the proposed optimization reduces to an intuitive closed-form
solution. Case studies on synthetic, MNIST, and CIFAR-10 networks
experimentally demonstrate that this method is able to certify robustness
against various input noise regimes over larger uncertainty regions than prior
state-of-the-art techniques.


http://arxiv.org/abs/2010.02004
Assessing Robustness of Text Classification through Maximal Safe Radius Computation.
Malfa Emanuele La; Min Wu; Luca Laurenti; Benjie Wang; Anthony Hartshorn; Marta Kwiatkowska
  Neural network NLP models are vulnerable to small modifications of the input
that maintain the original meaning but result in a different prediction. In
this paper, we focus on robustness of text classification against word
substitutions, aiming to provide guarantees that the model prediction does not
change if a word is replaced with a plausible alternative, such as a synonym.
As a measure of robustness, we adopt the notion of the maximal safe radius for
a given input text, which is the minimum distance in the embedding space to the
decision boundary. Since computing the exact maximal safe radius is not
feasible in practice, we instead approximate it by computing a lower and upper
bound. For the upper bound computation, we employ Monte Carlo Tree Search in
conjunction with syntactic filtering to analyse the effect of single and
multiple word substitutions. The lower bound computation is achieved through an
adaptation of the linear bounding techniques implemented in tools CNN-Cert and
POPQORN, respectively for convolutional and recurrent network models. We
evaluate the methods on sentiment analysis and news classification models for
four datasets (IMDB, SST, AG News and NEWS) and a range of embeddings, and
provide an analysis of robustness trends. We also apply our framework to
interpretability analysis and compare it with LIME.


http://arxiv.org/abs/2010.00467
Bag of Tricks for Adversarial Training.
Tianyu Pang; Xiao Yang; Yinpeng Dong; Hang Su; Jun Zhu
  Adversarial training (AT) is one of the most effective strategies for
promoting model robustness. However, recent benchmarks show that most of the
proposed improvements on AT are less effective than simply early stopping the
training procedure. This counter-intuitive fact motivates us to investigate the
implementation details of tens of AT methods. Surprisingly, we find that the
basic settings (e.g., weight decay, training schedule, etc.) used in these
methods are highly inconsistent. In this work, we provide comprehensive
evaluations on CIFAR-10, focusing on the effects of mostly overlooked training
tricks and hyperparameters for adversarially trained models. Our empirical
observations suggest that adversarial robustness is much more sensitive to some
basic training settings than we thought. For example, a slightly different
value of weight decay can reduce the model robust accuracy by more than 7%,
which is probable to override the potential promotion induced by the proposed
methods. We conclude a baseline training setting and re-implement previous
defenses to achieve new state-of-the-art results. These facts also appeal to
more concerns on the overlooked confounders when benchmarking defenses.


http://arxiv.org/abs/2010.00071
Erratum Concerning the Obfuscated Gradients Attack on Stochastic Activation Pruning.
Guneet S. Dhillon; Nicholas Carlini
  Stochastic Activation Pruning (SAP) (Dhillon et al., 2018) is a defense to
adversarial examples that was attacked and found to be broken by the
"Obfuscated Gradients" paper (Athalye et al., 2018). We discover a flaw in the
re-implementation that artificially weakens SAP. When SAP is applied properly,
the proposed attack is not effective. However, we show that a new use of the
BPDA attack technique can still reduce the accuracy of SAP to 0.1%.


http://arxiv.org/abs/2009.14454
Accurate and Robust Feature Importance Estimation under Distribution Shifts.
Jayaraman J. Thiagarajan; Vivek Narayanaswamy; Rushil Anirudh; Peer-Timo Bremer; Andreas Spanias
  With increasing reliance on the outcomes of black-box models in critical
applications, post-hoc explainability tools that do not require access to the
model internals are often used to enable humans understand and trust these
models. In particular, we focus on the class of methods that can reveal the
influence of input features on the predicted outputs. Despite their wide-spread
adoption, existing methods are known to suffer from one or more of the
following challenges: computational complexities, large uncertainties and most
importantly, inability to handle real-world domain shifts. In this paper, we
propose PRoFILE, a novel feature importance estimation method that addresses
all these challenges. Through the use of a loss estimator jointly trained with
the predictive model and a causal objective, PRoFILE can accurately estimate
the feature importance scores even under complex distribution shifts, without
any additional re-training. To this end, we also develop learning strategies
for training the loss estimator, namely contrastive and dropout calibration,
and find that it can effectively detect distribution shifts. Using empirical
studies on several benchmark image and non-image data, we show significant
improvements over state-of-the-art approaches, both in terms of fidelity and
robustness.


http://arxiv.org/abs/2009.14455
Uncertainty-Matching Graph Neural Networks to Defend Against Poisoning Attacks.
Uday Shankar Shanthamallu; Jayaraman J. Thiagarajan; Andreas Spanias
  Graph Neural Networks (GNNs), a generalization of neural networks to
graph-structured data, are often implemented using message passes between
entities of a graph. While GNNs are effective for node classification, link
prediction and graph classification, they are vulnerable to adversarial
attacks, i.e., a small perturbation to the structure can lead to a non-trivial
performance degradation. In this work, we propose Uncertainty Matching GNN
(UM-GNN), that is aimed at improving the robustness of GNN models, particularly
against poisoning attacks to the graph structure, by leveraging epistemic
uncertainties from the message passing framework. More specifically, we propose
to build a surrogate predictor that does not directly access the graph
structure, but systematically extracts reliable knowledge from a standard GNN
through a novel uncertainty-matching strategy. Interestingly, this uncoupling
makes UM-GNN immune to evasion attacks by design, and achieves significantly
improved robustness against poisoning attacks. Using empirical studies with
standard benchmarks and a suite of global and target attacks, we demonstrate
the effectiveness of UM-GNN, when compared to existing baselines including the
state-of-the-art robust GCN.


http://arxiv.org/abs/2009.14720
DVERGE: Diversifying Vulnerabilities for Enhanced Robust Generation of Ensembles.
Huanrui Yang; Jingyang Zhang; Hongliang Dong; Nathan Inkawhich; Andrew Gardner; Andrew Touchet; Wesley Wilkes; Heath Berry; Hai Li
  Recent research finds CNN models for image classification demonstrate
overlapped adversarial vulnerabilities: adversarial attacks can mislead CNN
models with small perturbations, which can effectively transfer between
different models trained on the same dataset. Adversarial training, as a
general robustness improvement technique, eliminates the vulnerability in a
single model by forcing it to learn robust features. The process is hard, often
requires models with large capacity, and suffers from significant loss on clean
data accuracy. Alternatively, ensemble methods are proposed to induce
sub-models with diverse outputs against a transfer adversarial example, making
the ensemble robust against transfer attacks even if each sub-model is
individually non-robust. Only small clean accuracy drop is observed in the
process. However, previous ensemble training methods are not efficacious in
inducing such diversity and thus ineffective on reaching robust ensemble. We
propose DVERGE, which isolates the adversarial vulnerability in each sub-model
by distilling non-robust features, and diversifies the adversarial
vulnerability to induce diverse outputs against a transfer attack. The novel
diversity metric and training procedure enables DVERGE to achieve higher
robustness against transfer attacks comparing to previous ensemble methods, and
enables the improved robustness when more sub-models are added to the ensemble.
The code of this work is available at https://github.com/zjysteven/DVERGE


http://arxiv.org/abs/2009.13971
Neural Topic Modeling with Cycle-Consistent Adversarial Training.
Xuemeng Hu; Rui Wang; Deyu Zhou; Yuxuan Xiong
  Advances on deep generative models have attracted significant research
interest in neural topic modeling. The recently proposed Adversarial-neural
Topic Model models topics with an adversarially trained generator network and
employs Dirichlet prior to capture the semantic patterns in latent topics. It
is effective in discovering coherent topics but unable to infer topic
distributions for given documents or utilize available document labels. To
overcome such limitations, we propose Topic Modeling with Cycle-consistent
Adversarial Training (ToMCAT) and its supervised version sToMCAT. ToMCAT
employs a generator network to interpret topics and an encoder network to infer
document topics. Adversarial training and cycle-consistent constraints are used
to encourage the generator and the encoder to produce realistic samples that
coordinate with each other. sToMCAT extends ToMCAT by incorporating document
labels into the topic modeling process to help discover more coherent topics.
The effectiveness of the proposed models is evaluated on
unsupervised/supervised topic modeling and text classification. The
experimental results show that our models can produce both coherent and
informative topics, outperforming a number of competitive baselines.


http://arxiv.org/abs/2009.14075
Fast Fr\'echet Inception Distance.
Alexander Mathiasen; Frederik Hvilshøj
  The Fr\'echet Inception Distance (FID) has been used to evaluate thousands of
generative models. We present a novel algorithm, FastFID, which allows fast
computation and backpropagation for FID. FastFID can efficiently (1) evaluate
generative model *during* training and (2) construct adversarial examples for
FID.


http://arxiv.org/abs/2009.13720
Adversarial Attacks Against Deep Learning Systems for ICD-9 Code Assignment.
Sharan Raja; Rudraksh Tuwani
  Manual annotation of ICD-9 codes is a time consuming and error-prone process.
Deep learning based systems tackling the problem of automated ICD-9 coding have
achieved competitive performance. Given the increased proliferation of
electronic medical records, such automated systems are expected to eventually
replace human coders. In this work, we investigate how a simple typo-based
adversarial attack strategy can impact the performance of state-of-the-art
models for the task of predicting the top 50 most frequent ICD-9 codes from
discharge summaries. Preliminary results indicate that a malicious adversary,
using gradient information, can craft specific perturbations, that appear as
regular human typos, for less than 3% of words in the discharge summary to
significantly affect the performance of the baseline model.


http://arxiv.org/abs/2009.13562
STRATA: Building Robustness with a Simple Method for Generating Black-box Adversarial Attacks for Models of Code.
Jacob M. Springer; Bryn Marie Reinstadler; Una-May O'Reilly
  Adversarial examples are imperceptible perturbations in the input to a neural
model that result in misclassification. Generating adversarial examples for
source code poses an additional challenge compared to the domains of images and
natural language, because source code perturbations must adhere to strict
semantic guidelines so the resulting programs retain the functional meaning of
the code. We propose a simple and efficient black-box method for generating
state-of-the-art adversarial examples on models of code. Our method generates
untargeted and targeted attacks, and empirically outperforms competing
gradient-based methods with less information and less computational effort. We
also use adversarial training to construct a model robust to these attacks; our
attack reduces the F1 score of code2seq by 42%. Adversarial training brings the
F1 score on adversarial examples up to 99% of baseline.


http://arxiv.org/abs/2009.13504
Graph Adversarial Networks: Protecting Information against Adversarial Attacks.
Peiyuan Liao; Han Zhao; Keyulu Xu; Tommi Jaakkola; Geoffrey Gordon; Stefanie Jegelka; Ruslan Salakhutdinov
  We study the problem of protecting information when learning with graph
structured data. While the advent of Graph Neural Networks (GNNs) has greatly
improved node and graph representational learning in many applications, the
neighborhood aggregation paradigm exposes additional vulnerabilities to
attackers seeking to extract node-level information about sensitive attributes.
To counter this, we propose a minimax game between the desired GNN encoder and
the worst-case attacker. The resulting adversarial training creates a strong
defense against inference attacks, while only suffering small loss in task
performance. We analyze the effectiveness of our framework against a worst-case
adversary, and characterize the trade-off between predictive accuracy and
adversarial defense. Experiments across multiple datasets from recommender
systems, knowledge graphs and quantum chemistry demonstrate that the proposed
approach provides a robust defense across various graph structures and tasks,
while producing competitive GNN encoders.


http://arxiv.org/abs/2009.13145
Adversarial Robustness of Stabilized NeuralODEs Might be from Obfuscated Gradients.
Yifei Huang; Yaodong Yu; Hongyang Zhang; Yi Ma; Yuan Yao
  In this paper we introduce a provably stable architecture for Neural Ordinary
Differential Equations (ODEs) which achieves non-trivial adversarial robustness
under white-box adversarial attacks even when the network is trained naturally.
For most existing defense methods withstanding strong white-box attacks, to
improve robustness of neural networks, they need to be trained adversarially,
hence have to strike a trade-off between natural accuracy and adversarial
robustness. Inspired by dynamical system theory, we design a stabilized neural
ODE network named SONet whose ODE blocks are skew-symmetric and proved to be
input-output stable. With natural training, SONet can achieve comparable
robustness with the state-of-the-art adversarial defense methods, without
sacrificing natural accuracy. Even replacing only the first layer of a ResNet
by such a ODE block can exhibit further improvement in robustness, e.g., under
PGD-20 ($\ell_\infty=0.031$) attack on CIFAR-10 dataset, it achieves 91.57\%
and natural accuracy and 62.35\% robust accuracy, while a counterpart
architecture of ResNet trained with TRADES achieves natural and robust accuracy
76.29\% and 45.24\%, respectively. To understand possible reasons behind this
surprisingly good result, we further explore the possible mechanism underlying
such an adversarial robustness. We show that the adaptive stepsize numerical
ODE solver, DOPRI5, has a gradient masking effect that fails the PGD attacks
which are sensitive to gradient information of training loss; on the other
hand, it cannot fool the CW attack of robust gradients and the SPSA attack that
is gradient-free. This provides a new explanation that the adversarial
robustness of ODE-based networks mainly comes from the obfuscated gradients in
numerical ODE solvers.


http://arxiv.org/abs/2009.13243
Generating End-to-End Adversarial Examples for Malware Classifiers Using Explainability.
Ishai Rosenberg; Shai Meir; Jonathan Berrebi; Ilay Gordon; Guillaume Sicard; Eli David
  In recent years, the topic of explainable machine learning (ML) has been
extensively researched. Up until now, this research focused on regular ML users
use-cases such as debugging a ML model. This paper takes a different posture
and show that adversaries can leverage explainable ML to bypass multi-feature
types malware classifiers. Previous adversarial attacks against such
classifiers only add new features and not modify existing ones to avoid harming
the modified malware executable's functionality. Current attacks use a single
algorithm that both selects which features to modify and modifies them blindly,
treating all features the same. In this paper, we present a different approach.
We split the adversarial example generation task into two parts: First we find
the importance of all features for a specific sample using explainability
algorithms, and then we conduct a feature-specific modification,
feature-by-feature. In order to apply our attack in black-box scenarios, we
introduce the concept of transferability of explainability, that is, applying
explainability algorithms to different classifiers using different features
subsets and trained on different datasets still result in a similar subset of
important features. We conclude that explainability algorithms can be leveraged
by adversaries and thus the advocates of training more interpretable
classifiers should consider the trade-off of higher vulnerability of those
classifiers to adversarial attacks.


http://arxiv.org/abs/2009.13714
Learning to Generate Image Source-Agnostic Universal Adversarial Perturbations. (92%)
Pu Zhao; Parikshit Ram; Songtao Lu; Yuguang Yao; Djallel Bouneffouf; Xue Lin; Sijia Liu
  Adversarial perturbations are critical for certifying the robustness of deep
learning models. A universal adversarial perturbation (UAP) can simultaneously
attack multiple images, and thus offers a more unified threat model, obviating
an image-wise attack algorithm. However, the existing UAP generator is
underdeveloped when images are drawn from different image sources (e.g., with
different image resolutions). Towards an authentic universality across image
sources, we take a novel view of UAP generation as a customized instance of
few-shot learning, which leverages bilevel optimization and
learning-to-optimize (L2O) techniques for UAP generation with improved attack
success rate (ASR). We begin by considering the popular model agnostic
meta-learning (MAML) framework to meta-learn a UAP generator. However, we see
that the MAML framework does not directly offer the universal attack across
image sources, requiring us to integrate it with another meta-learning
framework of L2O. The resulting scheme for meta-learning a UAP generator (i)
has better performance (50% higher ASR) than baselines such as Projected
Gradient Descent, (ii) has better performance (37% faster) than the vanilla L2O
and MAML frameworks (when applicable), and (iii) is able to simultaneously
handle UAP generation for different victim models and image data sources.


http://arxiv.org/abs/2009.12927
Learning to Improve Image Compression without Changing the Standard Decoder.
Yannick Strümpler; Ren Yang; Radu Timofte
  In recent years we have witnessed an increasing interest in applying Deep
Neural Networks (DNNs) to improve the rate-distortion performance in image
compression. However, the existing approaches either train a post-processing
DNN on the decoder side, or propose learning for image compression in an
end-to-end manner. This way, the trained DNNs are required in the decoder,
leading to the incompatibility to the standard image decoders (e.g., JPEG) in
personal computers and mobiles. Therefore, we propose learning to improve the
encoding performance with the standard decoder. In this paper, We work on JPEG
as an example. Specifically, a frequency-domain pre-editing method is proposed
to optimize the distribution of DCT coefficients, aiming at facilitating the
JPEG compression. Moreover, we propose learning the JPEG quantization table
jointly with the pre-editing network. Most importantly, we do not modify the
JPEG decoder and therefore our approach is applicable when viewing images with
the widely used standard JPEG decoder. The experiments validate that our
approach successfully improves the rate-distortion performance of JPEG in terms
of various quality metrics, such as PSNR, MS-SSIM and LPIPS. Visually, this
translates to better overall color retention especially when strong compression
is applied. The codes are available at
https://github.com/YannickStruempler/LearnedJPEG.


http://arxiv.org/abs/2009.13038
RoGAT: a robust GNN combined revised GAT with adjusted graphs.
Xianchen Zhou; Yaoyun Zeng; Hongxia Wang
  Graph Neural Networks(GNNs) are useful deep learning models to deal with the
non-Euclid data. However, recent works show that GNNs are vulnerable to
adversarial attacks. Small perturbations can lead to poor performance in many
GNNs, such as Graph attention networks(GATs). Therefore, enhancing the
robustness of GNNs is a critical problem.
  Robust GAT(RoGAT) is proposed to improve the robustness of GNNs in this
paper, . Note that the original GAT uses the attention mechanism for different
edges but is still sensitive to the perturbation, RoGAT adjusts the edges'
weight to adjust the attention scores progressively. Firstly, RoGAT tunes the
edges weight based on the assumption that the adjacent nodes should have
similar nodes. Secondly, RoGAT further tunes the features to eliminate
feature's noises since even for the clean graph, there exists some unreasonable
data. Then, we trained the adjusted GAT model to defense the adversarial
attacks. Different experiments against targeted and untargeted attacks
demonstrate that RoGAT outperforms significantly than most the state-of-the-art
defense methods. The implementation of RoGAT based on the DeepRobust repository
for adversarial attacks.


http://arxiv.org/abs/2009.13033
Where Does the Robustness Come from? A Study of the Transformation-based Ensemble Defence.
Chang Liao; Yao Cheng; Chengfang Fang; Jie Shi
  This paper aims to provide a thorough study on the effectiveness of the
transformation-based ensemble defence for image classification and its reasons.
It has been empirically shown that they can enhance the robustness against
evasion attacks, while there is little analysis on the reasons. In particular,
it is not clear whether the robustness improvement is a result of
transformation or ensemble. In this paper, we design two adaptive attacks to
better evaluate the transformation-based ensemble defence. We conduct
experiments to show that 1) the transferability of adversarial examples exists
among the models trained on data records after different reversible
transformations; 2) the robustness gained through transformation-based ensemble
is limited; 3) this limited robustness is mainly from the irreversible
transformations rather than the ensemble of a number of models; and 4) blindly
increasing the number of sub-models in a transformation-based ensemble does not
bring extra robustness gain.


http://arxiv.org/abs/2009.12718
Differentially Private Adversarial Robustness Through Randomized Perturbations.
Nan Xu; Oluwaseyi Feyisetan; Abhinav Aggarwal; Zekun Xu; Nathanael Teissier
  Deep Neural Networks, despite their great success in diverse domains, are
provably sensitive to small perturbations on correctly classified examples and
lead to erroneous predictions. Recently, it was proposed that this behavior can
be combatted by optimizing the worst case loss function over all possible
substitutions of training examples. However, this can be prone to weighing
unlikely substitutions higher, limiting the accuracy gain. In this paper, we
study adversarial robustness through randomized perturbations, which has two
immediate advantages: (1) by ensuring that substitution likelihood is weighted
by the proximity to the original word, we circumvent optimizing the worst case
guarantees and achieve performance gains; and (2) the calibrated randomness
imparts differentially-private model training, which additionally improves
robustness against adversarial attacks on the model outputs. Our approach uses
a novel density-based mechanism based on truncated Gumbel noise, which ensures
training on substitutions of both rare and dense words in the vocabulary while
maintaining semantic similarity for model robustness.


http://arxiv.org/abs/2009.12724
Beneficial Perturbations Network for Defending Adversarial Examples.
Shixian Wen; Amanda Rios; Laurent Itti
  Deep neural networks can be fooled by adversarial attacks: adding carefully
computed small adversarial perturbations to clean inputs can cause
misclassification on state-of-the-art machine learning models. The reason is
that neural networks fail to accommodate the distribution drift of the input
data caused by adversarial perturbations. Here, we present a new solution -
Beneficial Perturbation Network (BPN) - to defend against adversarial attacks
by fixing the distribution drift. During training, BPN generates and leverages
beneficial perturbations (somewhat opposite to well-known adversarial
perturbations) by adding new, out-of-network biasing units. Biasing units
influence the parameter space of the network, to preempt and neutralize future
adversarial perturbations on input data samples. To achieve this, BPN creates
reverse adversarial attacks during training, with very little cost, by
recycling the training gradients already computed. Reverse attacks are captured
by the biasing units, and the biases can in turn effectively defend against
future adversarial examples. Reverse attacks are a shortcut, i.e., they affect
the network's parameters without requiring instantiation of adversarial
examples that could assist training. We provide comprehensive empirical
evidence showing that 1) BPN is robust to adversarial examples and is much more
running memory and computationally efficient compared to classical adversarial
training. 2) BPN can defend against adversarial examples with negligible
additional computation and parameter costs compared to training only on clean
examples; 3) BPN hurts the accuracy on clean examples much less than classic
adversarial training; 4) BPN can improve the generalization of the network 5)
BPN trained only with Fast Gradient Sign Attack can generalize to defend PGD
attacks.


http://arxiv.org/abs/2009.12088
Training CNNs in Presence of JPEG Compression: Multimedia Forensics vs Computer Vision.
Sara Mandelli; Nicolò Bonettini; Paolo Bestagini; Stefano Tubaro
  Convolutional Neural Networks (CNNs) have proved very accurate in multiple
computer vision image classification tasks that required visual inspection in
the past (e.g., object recognition, face detection, etc.). Motivated by these
astonishing results, researchers have also started using CNNs to cope with
image forensic problems (e.g., camera model identification, tampering
detection, etc.). However, in computer vision, image classification methods
typically rely on visual cues easily detectable by human eyes. Conversely,
forensic solutions rely on almost invisible traces that are often very subtle
and lie in the fine details of the image under analysis. For this reason,
training a CNN to solve a forensic task requires some special care, as common
processing operations (e.g., resampling, compression, etc.) can strongly hinder
forensic traces. In this work, we focus on the effect that JPEG has on CNN
training considering different computer vision and forensic image
classification problems. Specifically, we consider the issues that rise from
JPEG compression and misalignment of the JPEG grid. We show that it is
necessary to consider these effects when generating a training dataset in order
to properly train a forensic detector not losing generalization capability,
whereas it is almost possible to ignore these effects for computer vision
tasks.


http://arxiv.org/abs/2009.12064
Attention Meets Perturbations: Robust and Interpretable Attention with Adversarial Training.
Shunsuke Kitada; Hitoshi Iyatomi
  In recent years, deep learning models have placed more emphasis on the
interpretability and robustness of models. The attention mechanism is an
important technique that contributes to these elements and is widely used,
especially in the natural language processing (NLP) field. Adversarial training
(AT) is a powerful regularization technique for enhancing the robustness of
neural networks and has been successful in many applications. The application
of AT to the attention mechanism is expected to be highly effective, but there
is little research on this. In this paper, we propose a new general training
technique for NLP tasks, using AT for attention (Attention AT) and more
interpretable adversarial training for attention (Attention iAT). Our proposals
improved both the prediction performance and interpretability of the model by
applying AT to the attention mechanisms. In particular, Attention iAT enhances
those advantages by introducing adversarial perturbation, which differentiates
the attention of sentences where it is unclear which words are important. We
performed various NLP tasks on ten open datasets and compared the performance
of our techniques to a recent model using attention mechanisms. Our experiments
revealed that AT for attention mechanisms, especially Attention iAT,
demonstrated (1) the best prediction performance in nine out of ten tasks and
(2) more interpretable attention (i.e., the resulting attention correlated more
strongly with gradient-based word importance) for all tasks. Additionally, our
techniques are (3) much less dependent on perturbation size in AT. Our code and
more results are available at
https://github.com/shunk031/attention-meets-perturbation


http://arxiv.org/abs/2009.13250
Advancing the Research and Development of Assured Artificial Intelligence and Machine Learning Capabilities.
Tyler J. Shipp; Daniel J. Clouse; Lucia Michael J. De; Metin B. Ahiskali; Kai Steverson; Jonathan M. Mullin; Nathaniel D. Bastian
  Artificial intelligence (AI) and machine learning (ML) have become
increasingly vital in the development of novel defense and intelligence
capabilities across all domains of warfare. An adversarial AI (A2I) and
adversarial ML (AML) attack seeks to deceive and manipulate AI/ML models. It is
imperative that AI/ML models can defend against these attacks. A2I/AML defenses
will help provide the necessary assurance of these advanced capabilities that
use AI/ML models. The A2I Working Group (A2IWG) seeks to advance the research
and development of assured AI/ML capabilities via new A2I/AML defenses by
fostering a collaborative environment across the U.S. Department of Defense and
U.S. Intelligence Community. The A2IWG aims to identify specific challenges
that it can help solve or address more directly, with initial focus on three
topics: AI Trusted Robustness, AI System Security, and AI/ML Architecture
Vulnerabilities.


http://arxiv.org/abs/2009.11911
Adversarial Examples in Deep Learning for Multivariate Time Series Regression.
Gautam Raj Mode; Khaza Anuarul Hoque
  Multivariate time series (MTS) regression tasks are common in many real-world
data mining applications including finance, cybersecurity, energy, healthcare,
prognostics, and many others. Due to the tremendous success of deep learning
(DL) algorithms in various domains including image recognition and computer
vision, researchers started adopting these techniques for solving MTS data
mining problems, many of which are targeted for safety-critical and
cost-critical applications. Unfortunately, DL algorithms are known for their
susceptibility to adversarial examples which also makes the DL regression
models for MTS forecasting also vulnerable to those attacks. To the best of our
knowledge, no previous work has explored the vulnerability of DL MTS regression
models to adversarial time series examples, which is an important step,
specifically when the forecasting from such models is used in safety-critical
and cost-critical applications. In this work, we leverage existing adversarial
attack generation techniques from the image classification domain and craft
adversarial multivariate time series examples for three state-of-the-art deep
learning regression models, specifically Convolutional Neural Network (CNN),
Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU). We evaluate our
study using Google stock and household power consumption dataset. The obtained
results show that all the evaluated DL regression models are vulnerable to
adversarial attacks, transferable, and thus can lead to catastrophic
consequences in safety-critical and cost-critical domains, such as energy and
finance.


http://arxiv.org/abs/2009.11508
Improving Query Efficiency of Black-box Adversarial Attack.
Yang Bai; Yuyuan Zeng; Yong Jiang; Yisen Wang; Shu-Tao Xia; Weiwei Guo
  Deep neural networks (DNNs) have demonstrated excellent performance on
various tasks, however they are under the risk of adversarial examples that can
be easily generated when the target model is accessible to an attacker
(white-box setting). As plenty of machine learning models have been deployed
via online services that only provide query outputs from inaccessible models
(e.g. Google Cloud Vision API2), black-box adversarial attacks (inaccessible
target model) are of critical security concerns in practice rather than
white-box ones. However, existing query-based black-box adversarial attacks
often require excessive model queries to maintain a high attack success rate.
Therefore, in order to improve query efficiency, we explore the distribution of
adversarial examples around benign inputs with the help of image structure
information characterized by a Neural Process, and propose a Neural Process
based black-box adversarial attack (NP-Attack) in this paper. Extensive
experiments show that NP-Attack could greatly decrease the query counts under
the black-box setting.


http://arxiv.org/abs/2009.11416
Enhancing Mixup-based Semi-Supervised Learning with Explicit Lipschitz Regularization.
Prashnna Kumar Gyawali; Sandesh Ghimire; Linwei Wang
  The success of deep learning relies on the availability of large-scale
annotated data sets, the acquisition of which can be costly, requiring expert
domain knowledge. Semi-supervised learning (SSL) mitigates this challenge by
exploiting the behavior of the neural function on large unlabeled data. The
smoothness of the neural function is a commonly used assumption exploited in
SSL. A successful example is the adoption of mixup strategy in SSL that
enforces the global smoothness of the neural function by encouraging it to
behave linearly when interpolating between training examples. Despite its
empirical success, however, the theoretical underpinning of how mixup
regularizes the neural function has not been fully understood. In this paper,
we offer a theoretically substantiated proposition that mixup improves the
smoothness of the neural function by bounding the Lipschitz constant of the
gradient function of the neural networks. We then propose that this can be
strengthened by simultaneously constraining the Lipschitz constant of the
neural function itself through adversarial Lipschitz regularization,
encouraging the neural function to behave linearly while also constraining the
slope of this linear function. On three benchmark data sets and one real-world
biomedical data set, we demonstrate that this combined regularization results
in improved generalization performance of SSL when learning from a small amount
of labeled data. We further demonstrate the robustness of the presented method
against single-step adversarial attacks. Our code is available at
https://github.com/Prasanna1991/Mixup-LR.


http://arxiv.org/abs/2009.11321
Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining.
Ananya B. Sai; Akash Kumar Mohankumar; Siddhartha Arora; Mitesh M. Khapra
  There is an increasing focus on model-based dialog evaluation metrics such as
ADEM, RUBER, and the more recent BERT-based metrics. These models aim to assign
a high score to all relevant responses and a low score to all irrelevant
responses. Ideally, such models should be trained using multiple relevant and
irrelevant responses for any given context. However, no such data is publicly
available, and hence existing models are usually trained using a single
relevant response and multiple randomly selected responses from other contexts
(random negatives). To allow for better training and robust evaluation of
model-based metrics, we introduce the DailyDialog++ dataset, consisting of (i)
five relevant responses for each context and (ii) five adversarially crafted
irrelevant responses for each context. Using this dataset, we first show that
even in the presence of multiple correct references, n-gram based metrics and
embedding based metrics do not perform well at separating relevant responses
from even random negatives. While model-based metrics perform better than
n-gram and embedding based metrics on random negatives, their performance drops
substantially when evaluated on adversarial examples. To check if large scale
pretraining could help, we propose a new BERT-based evaluation metric called
DEB, which is pretrained on 727M Reddit conversations and then finetuned on our
dataset. DEB significantly outperforms existing models, showing better
correlation with human judgements and better performance on random negatives
(88.27% accuracy). However, its performance again drops substantially, when
evaluated on adversarial responses, thereby highlighting that even large-scale
pretrained evaluation models are not robust to the adversarial examples in our
dataset. The dataset and code are publicly available.


http://arxiv.org/abs/2009.11349
Adversarial robustness via stochastic regularization of neural activation sensitivity.
Gil Fidel; Ron Bitton; Ziv Katzir; Asaf Shabtai
  Recent works have shown that the input domain of any machine learning
classifier is bound to contain adversarial examples. Thus we can no longer hope
to immune classifiers against adversarial examples and instead can only aim to
achieve the following two defense goals: 1) making adversarial examples harder
to find, or 2) weakening their adversarial nature by pushing them further away
from correctly classified data points. Most if not all the previously suggested
defense mechanisms attend to just one of those two goals, and as such, could be
bypassed by adaptive attacks that take the defense mechanism into
consideration. In this work we suggest a novel defense mechanism that
simultaneously addresses both defense goals: We flatten the gradients of the
loss surface, making adversarial examples harder to find, using a novel
stochastic regularization term that explicitly decreases the sensitivity of
individual neurons to small input perturbations. In addition, we push the
decision boundary away from correctly classified inputs by leveraging Jacobian
regularization. We present a solid theoretical basis and an empirical testing
of our suggested approach, demonstrate its superiority over previously
suggested defense mechanisms, and show that it is effective against a wide
range of adaptive attacks.


http://arxiv.org/abs/2009.10975
A Partial Break of the Honeypots Defense to Catch Adversarial Attacks.
Nicholas Carlini
  A recent defense proposes to inject "honeypots" into neural networks in order
to detect adversarial attacks. We break the baseline version of this defense by
reducing the detection true positive rate to 0\% and the detection AUC to 0.02,
maintaining the original distortion bounds. The authors of the original paper
have amended the defense in their CCS'20 paper to mitigate this attacks. To aid
further research, we release the complete 2.5 hour keystroke-by-keystroke
screen recording of our attack process at
https://nicholas.carlini.com/code/ccs_honeypot_break.


http://arxiv.org/abs/2009.10978
Semantics-Preserving Adversarial Training.
Wonseok Lee; Hanbit Lee; Sang-goo Lee
  Adversarial training is a defense technique that improves adversarial
robustness of a deep neural network (DNN) by including adversarial examples in
the training data. In this paper, we identify an overlooked problem of
adversarial training in that these adversarial examples often have different
semantics than the original data, introducing unintended biases into the model.
We hypothesize that such non-semantics-preserving (and resultingly ambiguous)
adversarial data harm the robustness of the target models. To mitigate such
unintended semantic changes of adversarial examples, we propose
semantics-preserving adversarial training (SPAT) which encourages perturbation
on the pixels that are shared among all classes when generating adversarial
examples in the training stage. Experiment results show that SPAT improves
adversarial robustness and achieves state-of-the-art results in CIFAR-10 and
CIFAR-100.


http://arxiv.org/abs/2009.11090
Robustification of Segmentation Models Against Adversarial Perturbations In Medical Imaging.
Hanwool Park; Amirhossein Bayat; Mohammad Sabokrou; Jan S. Kirschke; Bjoern H. Menze
  This paper presents a novel yet efficient defense framework for segmentation
models against adversarial attacks in medical imaging. In contrary to the
defense methods against adversarial attacks for classification models which
widely are investigated, such defense methods for segmentation models has been
less explored. Our proposed method can be used for any deep learning models
without revising the target deep learning models, as well as can be independent
of adversarial attacks. Our framework consists of a frequency domain converter,
a detector, and a reformer. The frequency domain converter helps the detector
detects adversarial examples by using a frame domain of an image. The reformer
helps target models to predict more precisely. We have experiments to
empirically show that our proposed method has a better performance compared to
the existing defense method.


http://arxiv.org/abs/2009.11397
Detection of Iterative Adversarial Attacks via Counter Attack.
Matthias Rottmann; Kira Maag; Mathis Peyron; Natasa Krejic; Hanno Gottschalk
  Deep neural networks (DNNs) have proven to be powerful tools for processing
unstructured data. However for high-dimensional data, like images, they are
inherently vulnerable to adversarial attacks. Small almost invisible
perturbations added to the input can be used to fool DNNs. Various attacks,
hardening methods and detection methods have been introduced in recent years.
Notoriously, Carlini-Wagner (CW) type attacks computed by iterative
minimization belong to those that are most difficult to detect. In this work we
outline a mathematical proof that the CW attack can be used as a detector
itself. That is, under certain assumptions and in the limit of attack
iterations this detector provides asymptotically optimal separation of original
and attacked images. In numerical experiments, we experimentally validate this
statement and furthermore obtain AUROC values up to 99.73% on CIFAR10 and
ImageNet. This is in the upper part of the spectrum of current state-of-the-art
detection rates for CW attacks.


http://arxiv.org/abs/2010.01950
Torchattacks: A PyTorch Repository for Adversarial Attacks.
Hoki Kim
  Torchattacks is a PyTorch library that contains adversarial attacks to
generate adversarial examples and to verify the robustness of deep learning
models. The code can be found at
https://github.com/Harry24k/adversarial-attacks-pytorch.


http://arxiv.org/abs/2009.10639
What Do You See? Evaluation of Explainable Artificial Intelligence (XAI) Interpretability through Neural Backdoors.
Yi-Shan Lin; Wen-Chuan Lee; Z. Berkay Celik
  EXplainable AI (XAI) methods have been proposed to interpret how a deep
neural network predicts inputs through model saliency explanations that
highlight the parts of the inputs deemed important to arrive a decision at a
specific target. However, it remains challenging to quantify correctness of
their interpretability as current evaluation approaches either require
subjective input from humans or incur high computation cost with automated
evaluation. In this paper, we propose backdoor trigger patterns--hidden
malicious functionalities that cause misclassification--to automate the
evaluation of saliency explanations. Our key observation is that triggers
provide ground truth for inputs to evaluate whether the regions identified by
an XAI method are truly relevant to its output. Since backdoor triggers are the
most important features that cause deliberate misclassification, a robust XAI
method should reveal their presence at inference time. We introduce three
complementary metrics for systematic evaluation of explanations that an XAI
method generates and evaluate seven state-of-the-art model-free and
model-specific posthoc methods through 36 models trojaned with specifically
crafted triggers using color, shape, texture, location, and size. We discovered
six methods that use local explanation and feature relevance fail to completely
highlight trigger regions, and only a model-free approach can uncover the
entire trigger region.


http://arxiv.org/abs/2009.10623
Tailoring: encoding inductive biases by optimizing unsupervised objectives at prediction time.
Ferran Alet; Kenji Kawaguchi; Tomas Lozano-Perez; Leslie Pack Kaelbling
  From CNNs to attention mechanisms, encoding inductive biases into neural
networks has been a fruitful source of improvement in machine learning.
Auxiliary losses are a general way of encoding biases in order to help networks
learn better representations by adding extra terms to the loss function.
However, since they are minimized on the training data, they suffer from the
same generalization gap as regular task losses. Moreover, by changing the loss
function, the network is optimizing a different objective than the one we care
about. In this work we solve both problems: first, we take inspiration from
transductive learning and note that, after receiving an input but before making
a prediction, we can fine-tune our models on any unsupervised objective. We
call this process tailoring, because we customize the model to each input.
Second, we formulate a nested optimization (similar to those in meta-learning)
and train our models to perform well on the task loss after adapting to the
tailoring loss. The advantages of tailoring and meta-tailoring are discussed
theoretically and demonstrated empirically on several diverse examples:
encoding inductive conservation laws from physics to improve predictions,
improving local smoothness to increase robustness to adversarial examples, and
using contrastive losses on the query image to improve generalization.


http://arxiv.org/abs/2009.10568
Adversarial Attack Based Countermeasures against Deep Learning Side-Channel Attacks.
Ruizhe Gu; Ping Wang; Mengce Zheng; Honggang Hu; Nenghai Yu
  Numerous previous works have studied deep learning algorithms applied in the
context of side-channel attacks, which demonstrated the ability to perform
successful key recoveries. These studies show that modern cryptographic devices
are increasingly threatened by side-channel attacks with the help of deep
learning. However, the existing countermeasures are designed to resist
classical side-channel attacks, and cannot protect cryptographic devices from
deep learning based side-channel attacks. Thus, there arises a strong need for
countermeasures against deep learning based side-channel attacks. Although deep
learning has the high potential in solving complex problems, it is vulnerable
to adversarial attacks in the form of subtle perturbations to inputs that lead
a model to predict incorrectly.
  In this paper, we propose a kind of novel countermeasures based on
adversarial attacks that is specifically designed against deep learning based
side-channel attacks. We estimate several models commonly used in deep learning
based side-channel attacks to evaluate the proposed countermeasures. It shows
that our approach can effectively protect cryptographic devices from deep
learning based side-channel attacks in practice. In addition, our experiments
show that the new countermeasures can also resist classical side-channel
attacks.


http://arxiv.org/abs/2009.10235
Uncertainty-aware Attention Graph Neural Network for Defending Adversarial Attacks.
Boyuan Feng; Yuke Wang; Zheng Wang; Yufei Ding
  With the increasing popularity of graph-based learning, graph neural networks
(GNNs) emerge as the essential tool for gaining insights from graphs. However,
unlike the conventional CNNs that have been extensively explored and
exhaustively tested, people are still worrying about the GNNs' robustness under
the critical settings, such as financial services. The main reason is that
existing GNNs usually serve as a black-box in predicting and do not provide the
uncertainty on the predictions. On the other side, the recent advancement of
Bayesian deep learning on CNNs has demonstrated its success of quantifying and
explaining such uncertainties to fortify CNN models. Motivated by these
observations, we propose UAG, the first systematic solution to defend
adversarial attacks on GNNs through identifying and exploiting hierarchical
uncertainties in GNNs. UAG develops a Bayesian Uncertainty Technique (BUT) to
explicitly capture uncertainties in GNNs and further employs an
Uncertainty-aware Attention Technique (UAT) to defend adversarial attacks on
GNNs. Intensive experiments show that our proposed defense approach outperforms
the state-of-the-art solutions by a significant margin.


http://arxiv.org/abs/2009.10233
Scalable Adversarial Attack on Graph Neural Networks with Alternating Direction Method of Multipliers.
Boyuan Feng; Yuke Wang; Xu Li; Yufei Ding
  Graph neural networks (GNNs) have achieved high performance in analyzing
graph-structured data and have been widely deployed in safety-critical areas,
such as finance and autonomous driving. However, only a few works have explored
GNNs' robustness to adversarial attacks, and their designs are usually limited
by the scale of input datasets (i.e., focusing on small graphs with only
thousands of nodes). In this work, we propose, SAG, the first scalable
adversarial attack method with Alternating Direction Method of Multipliers
(ADMM). We first decouple the large-scale graph into several smaller graph
partitions and cast the original problem into several subproblems. Then, we
propose to solve these subproblems using projected gradient descent on both the
graph topology and the node features that lead to considerably lower memory
consumption compared to the conventional attack methods. Rigorous experiments
further demonstrate that SAG can significantly reduce the computation and
memory overhead compared with the state-of-the-art approach, making SAG
applicable towards graphs with large size of nodes and edges.


http://arxiv.org/abs/2009.09774
Generating Adversarial yet Inconspicuous Patches with a Single Image.
Jinqi Luo; Tao Bai; Jun Zhao; Bo Li
  Deep neural networks have been shown vulnerable toadversarial patches, where
exotic patterns can resultin models wrong prediction. Nevertheless, existing
ap-proaches to adversarial patch generation hardly con-sider the contextual
consistency between patches andthe image background, causing such patches to be
eas-ily detected and adversarial attacks to fail. On the otherhand, these
methods require a large amount of data fortraining, which is computationally
expensive. To over-come these challenges, we propose an approach to gen-erate
adversarial yet inconspicuous patches with onesingle image. In our approach,
adversarial patches areproduced in a coarse-to-fine way with multiple scalesof
generators and discriminators. Contextual informa-tion is encoded during the
Min-Max training to makepatches consistent with surroundings. The selection
ofpatch location is based on the perceptual sensitivity ofvictim models.
Through extensive experiments, our ap-proach shows strong attacking ability in
both the white-box and black-box setting. Experiments on saliency de-tection
and user evaluation indicate that our adversar-ial patches can evade human
observations, demonstratethe inconspicuousness of our approach. Lastly, we
showthat our approach preserves the attack ability in thephysical world.


http://arxiv.org/abs/2009.10526
Adversarial Training with Stochastic Weight Average.
Joong-Won Hwang; Youngwan Lee; Sungchan Oh; Yuseok Bae
  Adversarial training deep neural networks often experience serious
overfitting problem. Recently, it is explained that the overfitting happens
because the sample complexity of training data is insufficient to generalize
robustness. In traditional machine learning, one way to relieve overfitting
from the lack of data is to use ensemble methods. However, adversarial training
multiple networks is extremely expensive. Moreover, we found that there is a
dilemma on choosing target model to generate adversarial examples. Optimizing
attack to the members of ensemble will be suboptimal attack to the ensemble and
incurs covariate shift, while attack to ensemble will weaken the members and
lose the benefit from ensembling. In this paper, we propose adversarial
training with Stochastic weight average (SWA); while performing adversarial
training, we aggregate the temporal weight states in the trajectory of
training. By adopting SWA, the benefit of ensemble can be gained without
tremendous computational increment and without facing the dilemma. Moreover, we
further improved SWA to be adequate to adversarial training. The empirical
results on CIFAR-10, CIFAR-100 and SVHN show that our method can improve the
robustness of models.


http://arxiv.org/abs/2009.09612
Improving Ensemble Robustness by Collaboratively Promoting and Demoting Adversarial Robustness.
Anh Bui; Trung Le; He Zhao; Paul Montague; Olivier deVel; Tamas Abraham; Dinh Phung
  Ensemble-based adversarial training is a principled approach to achieve
robustness against adversarial attacks. An important technique of this approach
is to control the transferability of adversarial examples among ensemble
members. We propose in this work a simple yet effective strategy to collaborate
among committee models of an ensemble model. This is achieved via the secure
and insecure sets defined for each model member on a given sample, hence help
us to quantify and regularize the transferability. Consequently, our proposed
framework provides the flexibility to reduce the adversarial transferability as
well as to promote the diversity of ensemble members, which are two crucial
factors for better robustness in our ensemble approach. We conduct extensive
and comprehensive experiments to demonstrate that our proposed method
outperforms the state-of-the-art ensemble baselines, at the same time can
detect a wide range of adversarial examples with a nearly perfect accuracy.


http://arxiv.org/abs/2009.09663
DeepDyve: Dynamic Verification for Deep Neural Networks.
Yu Li; Min Li; Bo Luo; Ye Tian; Qiang Xu
  Deep neural networks (DNNs) have become one of the enabling technologies in
many safety-critical applications, e.g., autonomous driving and medical image
analysis. DNN systems, however, suffer from various kinds of threats, such as
adversarial example attacks and fault injection attacks. While there are many
defense methods proposed against maliciously crafted inputs, solutions against
faults presented in the DNN system itself (e.g., parameters and calculations)
are far less explored. In this paper, we develop a novel lightweight
fault-tolerant solution for DNN-based systems, namely DeepDyve, which employs
pre-trained neural networks that are far simpler and smaller than the original
DNN for dynamic verification. The key to enabling such lightweight checking is
that the smaller neural network only needs to produce approximate results for
the initial task without sacrificing fault coverage much. We develop efficient
and effective architecture and task exploration techniques to achieve optimized
risk/overhead trade-off in DeepDyve. Experimental results show that DeepDyve
can reduce 90% of the risks at around 10% overhead.


http://arxiv.org/abs/2009.09922
Feature Distillation With Guided Adversarial Contrastive Learning.
Tao Bai; Jinnan Chen; Jun Zhao; Bihan Wen; Xudong Jiang; Alex Kot
  Deep learning models are shown to be vulnerable to adversarial examples.
Though adversarial training can enhance model robustness, typical approaches
are computationally expensive. Recent works proposed to transfer the robustness
to adversarial attacks across different tasks or models with soft
labels.Compared to soft labels, feature contains rich semantic information and
holds the potential to be applied to different downstream tasks. In this paper,
we propose a novel approach called Guided Adversarial Contrastive Distillation
(GACD), to effectively transfer adversarial robustness from teacher to student
with features. We first formulate this objective as contrastive learning and
connect it with mutual information. With a well-trained teacher model as an
anchor, students are expected to extract features similar to the teacher. Then
considering the potential errors made by teachers, we propose sample reweighted
estimation to eliminate the negative effects from teachers. With GACD, the
student not only learns to extract robust features, but also captures
structural knowledge from the teacher. By extensive experiments evaluating over
popular datasets such as CIFAR-10, CIFAR-100 and STL-10, we demonstrate that
our approach can effectively transfer robustness across different models and
even different tasks, and achieve comparable or better results than existing
methods. Besides, we provide a detailed analysis of various methods, showing
that students produced by our approach capture more structural knowledge from
teachers and learn more robust features under adversarial attacks.


http://arxiv.org/abs/2009.10149
Crafting Adversarial Examples for Deep Learning Based Prognostics (Extended Version).
Gautam Raj Mode; Khaza Anuarul Hoque
  In manufacturing, unexpected failures are considered a primary operational
risk, as they can hinder productivity and can incur huge losses.
State-of-the-art Prognostics and Health Management (PHM) systems incorporate
Deep Learning (DL) algorithms and Internet of Things (IoT) devices to ascertain
the health status of equipment, and thus reduce the downtime, maintenance cost
and increase the productivity. Unfortunately, IoT sensors and DL algorithms,
both are vulnerable to cyber attacks, and hence pose a significant threat to
PHM systems. In this paper, we adopt the adversarial example crafting
techniques from the computer vision domain and apply them to the PHM domain.
Specifically, we craft adversarial examples using the Fast Gradient Sign Method
(FGSM) and Basic Iterative Method (BIM) and apply them on the Long Short-Term
Memory (LSTM), Gated Recurrent Unit (GRU), and Convolutional Neural Network
(CNN) based PHM models. We evaluate the impact of adversarial attacks using
NASA's turbofan engine dataset. The obtained results show that all the
evaluated PHM models are vulnerable to adversarial attacks and can cause a
serious defect in the remaining useful life estimation. The obtained results
also show that the crafted adversarial examples are highly transferable and may
cause significant damages to PHM systems.


http://arxiv.org/abs/2009.10142
Stereopagnosia: Fooling Stereo Networks with Adversarial Perturbations.
Alex Wong; Mukund Mundhra; Stefano Soatto
  We study the effect of adversarial perturbations of images on the estimates
of disparity by deep learning models trained for stereo. We show that
imperceptible additive perturbations can significantly alter the disparity map,
and correspondingly the perceived geometry of the scene. These perturbations
not only affect the specific model they are crafted for, but transfer to models
with different architecture, trained with different loss functions. We show
that, when used for adversarial data augmentation, our perturbations result in
trained models that are more robust, without sacrificing overall accuracy of
the model. This is unlike what has been observed in image classification, where
adding the perturbed images to the training set makes the model less vulnerable
to adversarial perturbations, but to the detriment of overall accuracy. We test
our method using the most recent stereo networks and evaluate their performance
on public benchmark datasets.


http://arxiv.org/abs/2009.10064
Optimal Provable Robustness of Quantum Classification via Quantum Hypothesis Testing.
Maurice Weber; Nana Liu; Bo Li; Ce Zhang; Zhikuan Zhao
  Quantum machine learning models have the potential to offer speedups and
better predictive accuracy compared to their classical counterparts. However,
these quantum algorithms, like their classical counterparts, have been shown to
also be vulnerable to input perturbations, in particular for classification
problems. These can arise either from noisy implementations or, as a worst-case
type of noise, adversarial attacks. In order to develop defence mechanisms and
to better understand the reliability of these algorithms, it is crucial to
understand their robustness properties in presence of natural noise sources or
adversarial manipulation. From the observation that measurements involved in
quantum classification algorithms are naturally probabilistic, we uncover and
formalize a fundamental link between binary quantum hypothesis testing and
provably robust quantum classification. This link leads to a tight robustness
condition which puts constraints on the amount of noise a classifier can
tolerate, independent of whether the noise source is natural or adversarial.
Based on this result, we develop practical protocols to optimally certify
robustness. Finally, since this is a robustness condition against worst-case
types of noise, our result naturally extends to scenarios where the noise
source is known. Thus, we also provide a framework to study the reliability of
quantum classification protocols beyond the adversarial, worst-case noise
scenarios.


http://arxiv.org/abs/2009.10060
Password Strength Signaling: A Counter-Intuitive Defense Against Password Cracking. (1%)
Wenjie Bai; Jeremiah Blocki; Ben Harsha
  We introduce password strength information signaling as a novel, yet
counter-intuitive, defense mechanism against password cracking attacks. Recent
breaches have exposed billions of user passwords to the dangerous threat of
offline password cracking attacks. An offline attacker can quickly check
millions (or sometimes billions/trillions) of password guesses by comparing
their hash value with the stolen hash from a breached authentication server.
The attacker is limited only by the resources he is willing to invest. Our key
idea is to have the authentication server store a (noisy) signal about the
strength of each user password for an offline attacker to find. Surprisingly,
we show that the noise distribution for the signal can often be tuned so that a
rational (profit-maximizing) attacker will crack fewer passwords. The signaling
scheme exploits the fact that password cracking is not a zero-sum game i.e.,
the attacker's profit is given by the value of the cracked passwords minus the
total guessing cost. Thus, a well-defined signaling strategy will encourage the
attacker to reduce his guessing costs by cracking fewer passwords. We use an
evolutionary algorithm to compute the optimal signaling scheme for the
defender. As a proof-of-concept, we evaluate our mechanism on several password
datasets and show that it can reduce the total number of cracked passwords by
up to $12\%$ (resp. $5\%$) of all users in defending against offline (resp.
online) attacks.


http://arxiv.org/abs/2009.09587
Improving Robustness and Generality of NLP Models Using Disentangled Representations.
Jiawei Wu; Xiaoya Li; Xiang Ao; Yuxian Meng; Fei Wu; Jiwei Li
  Supervised neural networks, which first map an input $x$ to a single
representation $z$, and then map $z$ to the output label $y$, have achieved
remarkable success in a wide range of natural language processing (NLP) tasks.
Despite their success, neural models lack for both robustness and generality:
small perturbations to inputs can result in absolutely different outputs; the
performance of a model trained on one domain drops drastically when tested on
another domain.
  In this paper, we present methods to improve robustness and generality of NLP
models from the standpoint of disentangled representation learning. Instead of
mapping $x$ to a single representation $z$, the proposed strategy maps $x$ to a
set of representations $\{z_1,z_2,...,z_K\}$ while forcing them to be
disentangled. These representations are then mapped to different logits $l$s,
the ensemble of which is used to make the final prediction $y$. We propose
different methods to incorporate this idea into currently widely-used models,
including adding an $L$2 regularizer on $z$s or adding Total Correlation (TC)
under the framework of variational information bottleneck (VIB). We show that
models trained with the proposed criteria provide better robustness and domain
adaptation ability in a wide range of supervised learning tasks.


http://arxiv.org/abs/2009.09318
Efficient Certification of Spatial Robustness.
Anian Ruoss; Maximilian Baader; Mislav Balunović; Martin Vechev
  Recent work has exposed the vulnerability of computer vision models to
spatial transformations. Due to the widespread usage of such models in
safety-critical applications, it is crucial to quantify their robustness
against spatial transformations. However, existing work only provides empirical
quantification of spatial robustness via adversarial attacks, which lack
provable guarantees. In this work, we propose novel convex relaxations, which
enable us, for the first time, to provide a certificate of robustness against
spatial transformations. Our convex relaxations are model-agnostic and can be
leveraged by a wide range of neural network verifiers. Experiments on several
network architectures and different datasets demonstrate the effectiveness and
scalability of our method.


http://arxiv.org/abs/2009.09191
OpenAttack: An Open-source Textual Adversarial Attack Toolkit.
Guoyang Zeng; Fanchao Qi; Qianrui Zhou; Tingji Zhang; Bairu Hou; Yuan Zang; Zhiyuan Liu; Maosong Sun
  Textual adversarial attacking has received wide and increasing attention in
recent years. Various attack models have been proposed, which are enormously
distinct and implemented with different programming frameworks and settings.
These facts hinder quick utilization and apt comparison of attack models. In
this paper, we present an open-source textual adversarial attack toolkit named
OpenAttack. It currently builds in 12 typical attack models that cover all the
attack types. Its highly inclusive modular design not only supports quick
utilization of existing attack models, but also enables great flexibility and
extensibility. OpenAttack has broad uses including comparing and evaluating
attack models, measuring robustness of a victim model, assisting in developing
new attack models, and adversarial training. Source code, built-in models and
documentation can be obtained at https://github.com/thunlp/OpenAttack.


http://arxiv.org/abs/2009.09258
Making Images Undiscoverable from Co-Saliency Detection.
Ruijun Gao; Qing Guo; Felix Juefei-Xu; Hongkai Yu; Xuhong Ren; Wei Feng; Song Wang
  In recent years, co-saliency object detection (CoSOD) has achieved
significant progress and played a key role in the retrieval-related tasks,
e.g., image retrieval and video foreground detection. Nevertheless, it also
inevitably posts a totally new safety and security problem, i.e., how to
prevent high-profile and personal-sensitive contents from being extracted by
the powerful CoSOD methods. In this paper, we address this problem from the
perspective of adversarial attack and identify a novel task, i.e., adversarial
co-saliency attack: given an image selected from an image group containing some
common and salient objects, how to generate an adversarial version that can
mislead CoSOD methods to predict incorrect co-salient regions. Note that,
compared with general adversarial attacks for classification, this new task
introduces two extra challenges for existing whitebox adversarial noise
attacks: (1) low success rate due to the diverse appearance of images in the
image group; (2) low transferability across CoSOD methods due to the
considerable difference between CoSOD pipelines. To address these challenges,
we propose the very first blackbox joint adversarial exposure & noise attack
(Jadena) where we jointly and locally tune the exposure and additive
perturbations of the image according to a newly designed high-feature-level
contrast-sensitive loss function. Our method, without any information of the
state-of-the-art CoSOD methods, leads to significant performance degradation on
various co-saliency detection datasets and make the co-salient objects
undetectable, which can be strongly practical in nowadays where large-scale
personal photos are shared on the internet and should be properly and securely
preserved.


http://arxiv.org/abs/2009.09247
Bias Field Poses a Threat to DNN-based X-Ray Recognition.
Binyu Tian; Qing Guo; Felix Juefei-Xu; Wen Le Chan; Yupeng Cheng; Xiaohong Li; Xiaofei Xie; Shengchao Qin
  The chest X-ray plays a key role in screening and diagnosis of many lung
diseases including the COVID-19. More recently, many works construct deep
neural networks (DNNs) for chest X-ray images to realize automated and
efficient diagnosis of lung diseases. However, bias field caused by the
improper medical image acquisition process widely exists in the chest X-ray
images while the robustness of DNNs to the bias field is rarely explored, which
definitely poses a threat to the X-ray-based automated diagnosis system. In
this paper, we study this problem based on the recent adversarial attack and
propose a brand new attack, i.e., the adversarial bias field attack where the
bias field instead of the additive noise works as the adversarial perturbations
for fooling the DNNs. This novel attack posts a key problem: how to locally
tune the bias field to realize high attack success rate while maintaining its
spatial smoothness to guarantee high realisticity. These two goals contradict
each other and thus has made the attack significantly challenging. To overcome
this challenge, we propose the adversarial-smooth bias field attack that can
locally tune the bias field with joint smooth & adversarial constraints. As a
result, the adversarial X-ray images can not only fool the DNNs effectively but
also retain very high level of realisticity. We validate our method on real
chest X-ray datasets with powerful DNNs, e.g., ResNet50, DenseNet121, and
MobileNet, and show different properties to the state-of-the-art attacks in
both image realisticity and attack transferability. Our method reveals the
potential threat to the DNN-based X-ray automated diagnosis and can definitely
benefit the development of bias-field-robust automated diagnosis system.


http://arxiv.org/abs/2009.09192
Learning to Attack: Towards Textual Adversarial Attacking in Real-world Situations.
Yuan Zang; Bairu Hou; Fanchao Qi; Zhiyuan Liu; Xiaojun Meng; Maosong Sun
  Adversarial attacking aims to fool deep neural networks with adversarial
examples. In the field of natural language processing, various textual
adversarial attack models have been proposed, varying in the accessibility to
the victim model. Among them, the attack models that only require the output of
the victim model are more fit for real-world situations of adversarial
attacking. However, to achieve high attack performance, these models usually
need to query the victim model too many times, which is neither efficient nor
viable in practice. To tackle this problem, we propose a reinforcement learning
based attack model, which can learn from attack history and launch attacks more
efficiently. In experiments, we evaluate our model by attacking several
state-of-the-art models on the benchmark datasets of multiple tasks including
sentiment analysis, text classification and natural language inference.
Experimental results demonstrate that our model consistently achieves both
better attack performance and higher efficiency than recently proposed baseline
methods. We also find our attack model can bring more robustness improvement to
the victim model by adversarial training. All the code and data of this paper
will be made public.


http://arxiv.org/abs/2009.09205
Adversarial Rain Attack and Defensive Deraining for DNN Perception.
Liming Zhai; Felix Juefei-Xu; Qing Guo; Xiaofei Xie; Lei Ma; Wei Feng; Shengchao Qin; Yang Liu
  Rain often poses inevitable threats to deep neural network (DNN) based
perception systems, and a comprehensive investigation of the potential risks of
the rain to DNNs is of great importance. However, it is rather difficult to
collect or synthesize rainy images that can represent all rain situations that
would possibly occur in the real world. To this end, in this paper, we start
from a new perspective and propose to combine two totally different studies,
i.e., rainy image synthesis and adversarial attack. We first present an
adversarial rain attack, with which we could simulate various rain situations
with the guidance of deployed DNNs and reveal the potential threat factors that
can be brought by rain. In particular, we design a factor-aware rain generation
that synthesizes rain streaks according to the camera exposure process and
models the learnable rain factors for adversarial attack. With this generator,
we perform the adversarial rain attack against the image classification and
object detection. To defend the DNNs from the negative rain effect, we also
present a defensive deraining strategy, for which we design an adversarial rain
augmentation that uses mixed adversarial rain layers to enhance deraining
models for downstream DNN perception. Our large-scale evaluation on various
datasets demonstrates that our synthesized rainy images with realistic
appearances not only exhibit strong adversarial capability against DNNs, but
also boost the deraining models for defensive purposes, building the foundation
for further rain-robust perception studies.


http://arxiv.org/abs/2009.09231
Adversarial Exposure Attack on Diabetic Retinopathy Imagery Grading.
Yupeng Cheng; Qing Guo; Felix Juefei-Xu; Huazhu Fu; Shang-Wei Lin; Weisi Lin
  Diabetic Retinopathy (DR) is a leading cause of vision loss around the world.
To help diagnose it, numerous cutting-edge works have built powerful deep
neural networks (DNNs) to automatically grade DR via retinal fundus images
(RFIs). However, RFIs are commonly affected by camera exposure issues that may
lead to incorrect grades. The mis-graded results can potentially pose high
risks to an aggravation of the condition. In this paper, we study this problem
from the viewpoint of adversarial attacks. We identify and introduce a novel
solution to an entirely new task, termed as adversarial exposure attack, which
is able to produce natural exposure images and mislead the state-of-the-art
DNNs. We validate our proposed method on a real-world public DR dataset with
three DNNs, e.g., ResNet50, MobileNet, and EfficientNet, demonstrating that our
method achieves high image quality and success rate in transferring the
attacks. Our method reveals the potential threats to DNN-based automatic DR
grading and would benefit the development of exposure-robust DR grading methods
in the future.


http://arxiv.org/abs/2009.10537
EI-MTD:Moving Target Defense for Edge Intelligence against Adversarial Attacks.
Yaguan Qian; Qiqi Shao; Jiamin Wang; Xiang Lin; Yankai Guo; Zhaoquan Gu; Bin Wang; Chunming Wu
  With the boom of edge intelligence, its vulnerability to adversarial attacks
becomes an urgent problem. The so-called adversarial example can fool a deep
learning model on the edge node to misclassify. Due to the property of
transferability, the adversary can easily make a black-box attack using a local
substitute model. Nevertheless, the limitation of resource of edge nodes cannot
afford a complicated defense mechanism as doing on the cloud data center. To
overcome the challenge, we propose a dynamic defense mechanism, namely EI-MTD.
It first obtains robust member models with small size through differential
knowledge distillation from a complicated teacher model on the cloud data
center. Then, a dynamic scheduling policy based on a Bayesian Stackelberg game
is applied to the choice of a target model for service. This dynamic defense
can prohibit the adversary from selecting an optimal substitute model for
black-box attacks. Our experimental result shows that this dynamic scheduling
can effectively protect edge intelligence against adversarial attacks under the
black-box setting.


http://arxiv.org/abs/2009.09026
Robust Decentralized Learning for Neural Networks.
Yao Zhou; Jun Wu; Jingrui He
  In decentralized learning, data is distributed among local clients which
collaboratively train a shared prediction model using secure aggregation. To
preserve the privacy of the clients, modern decentralized learning paradigms
require each client to maintain a private local training data set and only
upload their summarized model updates to the server. However, this can quickly
lead to a degenerate model and collapse in performance when corrupted updates
(e.g., adversarial manipulations) are aggregated at the server. In this work,
we present a robust decentralized learning framework, Decent_BVA, using
bias-variance based adversarial training via asymmetrical communications
between each client and the server. The experiments are conducted on neural
networks with cross-entropy loss. Nevertheless, the proposed framework allows
the use of various classification loss functions (e.g., cross-entropy loss,
mean squared error loss) where the gradients of the bias and variance are
tractable to be estimated from local clients' models. In this case, any
gradient-based adversarial training strategies could be used by taking the
bias-variance oriented adversarial examples into consideration, e.g.,
bias-variance based FGSM and PGD proposed in this paper. Experiments show that
Decent_BVA is robust to the classical adversarial attacks when the level of
corruption is high while being competitive compared with conventional
decentralized learning in terms of the model's accuracy and efficiency.


http://arxiv.org/abs/2009.09090
MIRAGE: Mitigating Conflict-Based Cache Attacks with a Practical Fully-Associative Design. (1%)
Gururaj Saileshwar; Moinuddin Qureshi
  Shared processor caches are vulnerable to conflict-based side-channel
attacks, where an attacker can monitor access patterns of a victim by evicting
victim cache lines using cache-set conflicts. Recent mitigations propose
randomized mapping of addresses to cache lines to obfuscate the locations of
set-conflicts. However, these are vulnerable to new attacks that discover
conflicting sets of addresses despite such mitigations, because these designs
select eviction-candidates from a small set of conflicting lines.
  This paper presents Mirage, a practical design for a fully associative cache,
wherein eviction candidates are selected randomly from all lines resident in
the cache, to be immune to set-conflicts. A key challenge for enabling such
designs in large shared caches (containing tens of thousands of cache lines) is
the complexity of cache-lookup, as a naive design can require searching through
all the resident lines. Mirage achieves full-associativity while retaining
practical set-associative lookups by decoupling placement and replacement,
using pointer-based indirection from tag-store to data-store to allow a newly
installed address to globally evict the data of any random resident line. To
eliminate set-conflicts, Mirage provisions extra invalid tags in a
skewed-associative tag-store design where lines can be installed without
set-conflict, along with a load-aware skew-selection policy that guarantees the
availability of sets with invalid tags. Our analysis shows Mirage provides the
global eviction property of a fully-associative cache throughout system
lifetime (violations of full-associativity, i.e. set-conflicts, occur less than
once in 10^4 to 10^17 years), thus offering a principled defense against any
eviction-set discovery and any potential conflict based attacks. Mirage incurs
limited slowdown (2%) and 17-20% extra storage compared to a non-secure cache.


http://arxiv.org/abs/2009.08061
Certifying Confidence via Randomized Smoothing.
Aounon Kumar; Alexander Levine; Soheil Feizi; Tom Goldstein
  Randomized smoothing has been shown to provide good certified-robustness
guarantees for high-dimensional classification problems. It uses the
probabilities of predicting the top two most-likely classes around an input
point under a smoothing distribution to generate a certified radius for a
classifier's prediction. However, most smoothing methods do not give us any
information about the \emph{confidence} with which the underlying classifier
(e.g., deep neural network) makes a prediction. In this work, we propose a
method to generate certified radii for the prediction confidence of the
smoothed classifier. We consider two notions for quantifying confidence:
average prediction score of a class and the margin by which the average
prediction score of one class exceeds that of another. We modify the
Neyman-Pearson lemma (a key theorem in randomized smoothing) to design a
procedure for computing the certified radius where the confidence is guaranteed
to stay above a certain threshold. Our experimental results on CIFAR-10 and
ImageNet datasets show that using information about the distribution of the
confidence scores allows us to achieve a significantly better certified radius
than ignoring it. Thus, we demonstrate that extra information about the base
classifier at the input point can help improve certified guarantees for the
smoothed classifier.


http://arxiv.org/abs/2009.08205
Generating Label Cohesive and Well-Formed Adversarial Claims.
Pepa Atanasova; Dustin Wright; Isabelle Augenstein
  Adversarial attacks reveal important vulnerabilities and flaws of trained
models. One potent type of attack are universal adversarial triggers, which are
individual n-grams that, when appended to instances of a class under attack,
can trick a model into predicting a target class. However, for inference tasks
such as fact checking, these triggers often inadvertently invert the meaning of
instances they are inserted in. In addition, such attacks produce semantically
nonsensical inputs, as they simply concatenate triggers to existing samples.
Here, we investigate how to generate adversarial attacks against fact checking
systems that preserve the ground truth meaning and are semantically valid. We
extend the HotFlip attack algorithm used for universal trigger generation by
jointly minimising the target class loss of a fact checking model and the
entailment class loss of an auxiliary natural language inference model. We then
train a conditional language model to generate semantically valid statements,
which include the found universal triggers. We find that the generated attacks
maintain the directionality and semantic validity of the claim better than
previous work.


http://arxiv.org/abs/2009.08194
Vax-a-Net: Training-time Defence Against Adversarial Patch Attacks.
T. Gittings; S. Schneider; J. Collomosse
  We present Vax-a-Net; a technique for immunizing convolutional neural
networks (CNNs) against adversarial patch attacks (APAs). APAs insert visually
overt, local regions (patches) into an image to induce misclassification. We
introduce a conditional Generative Adversarial Network (GAN) architecture that
simultaneously learns to synthesise patches for use in APAs, whilst exploiting
those attacks to adapt a pre-trained target CNN to reduce its susceptibility to
them. This approach enables resilience against APAs to be conferred to
pre-trained models, which would be impractical with conventional adversarial
training due to the slow convergence of APA methods. We demonstrate
transferability of this protection to defend against existing APAs, and show
its efficacy across several contemporary CNN architectures.


http://arxiv.org/abs/2009.08233
Label Smoothing and Adversarial Robustness.
Chaohao Fu; Hongbin Chen; Na Ruan; Weijia Jia
  Recent studies indicate that current adversarial attack methods are flawed
and easy to fail when encountering some deliberately designed defense.
Sometimes even a slight modification in the model details will invalidate the
attack. We find that training model with label smoothing can easily achieve
striking accuracy under most gradient-based attacks. For instance, the robust
accuracy of a WideResNet model trained with label smoothing on CIFAR-10
achieves 75% at most under PGD attack. To understand the reason underlying the
subtle robustness, we investigate the relationship between label smoothing and
adversarial robustness. Through theoretical analysis about the characteristics
of the network trained with label smoothing and experiment verification of its
performance under various attacks. We demonstrate that the robustness produced
by label smoothing is incomplete based on the fact that its defense effect is
volatile, and it cannot defend attacks transferred from a naturally trained
model. Our study enlightens the research community to rethink how to evaluate
the model's robustness appropriately.


http://arxiv.org/abs/2009.08110
Online Alternate Generator against Adversarial Attacks.
Haofeng Li; Yirui Zeng; Guanbin Li; Liang Lin; Yizhou Yu
  The field of computer vision has witnessed phenomenal progress in recent
years partially due to the development of deep convolutional neural networks.
However, deep learning models are notoriously sensitive to adversarial examples
which are synthesized by adding quasi-perceptible noises on real images. Some
existing defense methods require to re-train attacked target networks and
augment the train set via known adversarial attacks, which is inefficient and
might be unpromising with unknown attack types. To overcome the above issues,
we propose a portable defense method, online alternate generator, which does
not need to access or modify the parameters of the target networks. The
proposed method works by online synthesizing another image from scratch for an
input image, instead of removing or destroying adversarial noises. To avoid
pretrained parameters exploited by attackers, we alternately update the
generator and the synthesized image at the inference stage. Experimental
results demonstrate that the proposed defensive scheme and method outperforms a
series of state-of-the-art defending models against gray-box adversarial
attacks.


http://arxiv.org/abs/2009.08058
MultAV: Multiplicative Adversarial Videos.
Shao-Yuan Lo; Vishal M. Patel
  The majority of adversarial machine learning research focuses on additive
attacks, which add adversarial perturbation to input data. On the other hand,
unlike image recognition problems, only a handful of attack approaches have
been explored in the video domain. In this paper, we propose a novel attack
method against video recognition models, Multiplicative Adversarial Videos
(MultAV), which imposes perturbation on video data by multiplication. MultAV
has different noise distributions to the additive counterparts and thus
challenges the defense methods tailored to resisting additive adversarial
attacks. Moreover, it can be generalized to not only Lp-norm attacks with a new
adversary constraint called ratio bound, but also different types of physically
realizable attacks. Experimental results show that the model adversarially
trained against additive attack is less robust to MultAV.


http://arxiv.org/abs/2009.08070
On the Transferability of Minimal Prediction Preserving Inputs in Question Answering.
Shayne Longpre; Yi Lu; Christopher DuBois
  Recent work (Feng et al., 2018) establishes the presence of short,
uninterpretable input fragments that yield high confidence and accuracy in
neural models. We refer to these as Minimal Prediction Preserving Inputs
(MPPIs). In the context of question answering, we investigate competing
hypotheses for the existence of MPPIs, including poor posterior calibration of
neural models, lack of pretraining, and "dataset bias" (where a model learns to
attend to spurious, non-generalizable cues in the training data). We discover a
perplexing invariance of MPPIs to random training seed, model architecture,
pretraining, and training domain. MPPIs demonstrate remarkable transferability
across domains achieving significantly higher performance than comparably short
queries. Additionally, penalizing over-confidence on MPPIs fails to improve
either generalization or adversarial robustness. These results suggest the
interpretability of MPPIs is insufficient to characterize generalization
capacity of these models. We hope this focused investigation encourages more
systematic analysis of model behavior outside of the human interpretable
distribution of examples.


http://arxiv.org/abs/2009.08435
Large Norms of CNN Layers Do Not Hurt Adversarial Robustness.
Youwei Liang; Dong Huang
  Since the Lipschitz properties of convolutional neural network (CNN) are
widely considered to be related to adversarial robustness, we theoretically
characterize the $\ell_1$ norm and $\ell_\infty$ norm of 2D multi-channel
convolutional layers and provide efficient methods to compute the exact
$\ell_1$ norm and $\ell_\infty$ norm. Based on our theorem, we propose a novel
regularization method termed norm decay, which can effectively reduce the norms
of CNN layers. Experiments show that norm-regularization methods, including
norm decay, weight decay, and singular value clipping, can improve
generalization of CNNs. However, we are surprised to find that they can
slightly hurt adversarial robustness. Furthermore, we compute the norms of
layers in the CNNs trained with three different adversarial training frameworks
and find that adversarially robust CNNs have comparable or even larger norms
than their non-adversarially robust counterparts. Moreover, we prove that under
a mild assumption, adversarially robust classifiers can be achieved with neural
networks and an adversarially robust neural network can have arbitrarily large
Lipschitz constant. For these reasons, enforcing small norms of CNN layers may
be neither effective nor necessary in achieving adversarial robustness. Our
code is available at https://github.com/youweiliang/norm_robustness.


http://arxiv.org/abs/2009.08311
Multimodal Safety-Critical Scenarios Generation for Decision-Making Algorithms Evaluation.
Wenhao Ding; Baiming Chen; Bo Li; Kim Ji Eun; Ding Zhao
  Existing neural network-based autonomous systems are shown to be vulnerable
against adversarial attacks, therefore sophisticated evaluation on their
robustness is of great importance. However, evaluating the robustness only
under the worst-case scenarios based on known attacks is not comprehensive, not
to mention that some of them even rarely occur in the real world. In addition,
the distribution of safety-critical data is usually multimodal, while most
traditional attacks and evaluation methods focus on a single modality. To solve
the above challenges, we propose a flow-based multimodal safety-critical
scenario generator for evaluating decisionmaking algorithms. The proposed
generative model is optimized with weighted likelihood maximization and a
gradient-based sampling procedure is integrated to improve the sampling
efficiency. The safety-critical scenarios are generated by querying the task
algorithms and the log-likelihood of the generated scenarios is in proportion
to the risk level. Experiments on a self-driving task demonstrate our
advantages in terms of testing efficiency and multimodal modeling capability.
We evaluate six Reinforcement Learning algorithms with our generated traffic
scenarios and provide empirical conclusions about their robustness.


http://arxiv.org/abs/2009.07974
Analysis of Generalizability of Deep Neural Networks Based on the Complexity of Decision Boundary.
Shuyue Guan; Murray Loew
  For supervised learning models, the analysis of generalization ability
(generalizability) is vital because the generalizability expresses how well a
model will perform on unseen data. Traditional generalization methods, such as
the VC dimension, do not apply to deep neural network (DNN) models. Thus, new
theories to explain the generalizability of DNNs are required. In this study,
we hypothesize that the DNN with a simpler decision boundary has better
generalizability by the law of parsimony (Occam's Razor). We create the
decision boundary complexity (DBC) score to define and measure the complexity
of decision boundary of DNNs. The idea of the DBC score is to generate data
points (called adversarial examples) on or near the decision boundary. Our new
approach then measures the complexity of the boundary using the entropy of
eigenvalues of these data. The method works equally well for high-dimensional
data. We use training data and the trained model to compute the DBC score. And,
the ground truth for model's generalizability is its test accuracy. Experiments
based on the DBC score have verified our hypothesis. The DBC is shown to
provide an effective method to measure the complexity of a decision boundary
and gives a quantitative measure of the generalizability of DNNs.


http://arxiv.org/abs/2009.07753
Malicious Network Traffic Detection via Deep Learning: An Information Theoretic View.
Erick Galinkin
  The attention that deep learning has garnered from the academic community and
industry continues to grow year over year, and it has been said that we are in
a new golden age of artificial intelligence research. However, neural networks
are still often seen as a "black box" where learning occurs but cannot be
understood in a human-interpretable way. Since these machine learning systems
are increasingly being adopted in security contexts, it is important to explore
these interpretations. We consider an Android malware traffic dataset for
approaching this problem. Then, using the information plane, we explore how
homeomorphism affects learned representation of the data and the invariance of
the mutual information captured by the parameters on that data. We empirically
validate these results, using accuracy as a second measure of similarity of
learned representations.
  Our results suggest that although the details of learned representations and
the specific coordinate system defined over the manifold of all parameters
differ slightly, the functional approximations are the same. Furthermore, our
results show that since mutual information remains invariant under
homeomorphism, only feature engineering methods that alter the entropy of the
dataset will change the outcome of the neural network. This means that for some
datasets and tasks, neural networks require meaningful, human-driven feature
engineering or changes in architecture to provide enough information for the
neural network to generate a sufficient statistic. Applying our results can
serve to guide analysis methods for machine learning engineers and suggests
that neural networks that can exploit the convolution theorem are equally
accurate as standard convolutional neural networks, and can be more
computationally efficient.


http://arxiv.org/abs/2009.07502
Contextualized Perturbation for Textual Adversarial Attack.
Dianqi Li; Yizhe Zhang; Hao Peng; Liqun Chen; Chris Brockett; Ming-Ting Sun; Bill Dolan
  Adversarial examples expose the vulnerabilities of natural language
processing (NLP) models, and can be used to evaluate and improve their
robustness. Existing techniques of generating such examples are typically
driven by local heuristic rules that are agnostic to the context, often
resulting in unnatural and ungrammatical outputs. This paper presents CLARE, a
ContextuaLized AdversaRial Example generation model that produces fluent and
grammatical outputs through a mask-then-infill procedure. CLARE builds on a
pre-trained masked language model and modifies the inputs in a context-aware
manner. We propose three contextualized perturbations, Replace, Insert and
Merge, allowing for generating outputs of varied lengths. With a richer range
of available strategies, CLARE is able to attack a victim model more
efficiently with fewer edits. Extensive experiments and human evaluation
demonstrate that CLARE outperforms the baselines in terms of attack success
rate, textual similarity, fluency and grammaticality.


http://arxiv.org/abs/2009.06962
Puzzle Mix: Exploiting Saliency and Local Statistics for Optimal Mixup.
Jang-Hyun Kim; Wonho Choo; Hyun Oh Song
  While deep neural networks achieve great performance on fitting the training
distribution, the learned networks are prone to overfitting and are susceptible
to adversarial attacks. In this regard, a number of mixup based augmentation
methods have been recently proposed. However, these approaches mainly focus on
creating previously unseen virtual examples and can sometimes provide
misleading supervisory signal to the network. To this end, we propose Puzzle
Mix, a mixup method for explicitly utilizing the saliency information and the
underlying statistics of the natural examples. This leads to an interesting
optimization problem alternating between the multi-label objective for optimal
mixing mask and saliency discounted optimal transport objective. Our
experiments show Puzzle Mix achieves the state of the art generalization and
the adversarial robustness results compared to other mixup methods on
CIFAR-100, Tiny-ImageNet, and ImageNet datasets. The source code is available
at https://github.com/snu-mllab/PuzzleMix.


http://arxiv.org/abs/2009.06996
Light Can Hack Your Face! Black-box Backdoor Attack on Face Recognition Systems.
Haoliang Nanyang Technological University, Singapore Li; Yufei Nanyang Technological University, Singapore Wang; Xiaofei Nanyang Technological University, Singapore Xie; Yang Nanyang Technological University, Singapore Liu; Shiqi City University of Hong Kong Wang; Renjie Nanyang Technological University, Singapore Wan; Lap-Pui Nanyang Technological University, Singapore Chau; Alex C. Nanyang Technological University, Singapore Kot
  Deep neural networks (DNN) have shown great success in many computer vision
applications. However, they are also known to be susceptible to backdoor
attacks. When conducting backdoor attacks, most of the existing approaches
assume that the targeted DNN is always available, and an attacker can always
inject a specific pattern to the training data to further fine-tune the DNN
model. However, in practice, such attack may not be feasible as the DNN model
is encrypted and only available to the secure enclave.
  In this paper, we propose a novel black-box backdoor attack technique on face
recognition systems, which can be conducted without the knowledge of the
targeted DNN model. To be specific, we propose a backdoor attack with a novel
color stripe pattern trigger, which can be generated by modulating LED in a
specialized waveform. We also use an evolutionary computing strategy to
optimize the waveform for backdoor attack. Our backdoor attack can be conducted
in a very mild condition: 1) the adversary cannot manipulate the input in an
unnatural way (e.g., injecting adversarial noise); 2) the adversary cannot
access the training database; 3) the adversary has no knowledge of the training
model as well as the training set used by the victim party.
  We show that the backdoor trigger can be quite effective, where the attack
success rate can be up to $88\%$ based on our simulation study and up to $40\%$
based on our physical-domain study by considering the task of face recognition
and verification based on at most three-time attempts during authentication.
Finally, we evaluate several state-of-the-art potential defenses towards
backdoor attacks, and find that our attack can still be effective. We highlight
that our study revealed a new physical backdoor attack, which calls for the
attention of the security issue of the existing face recognition/verification
techniques.


http://arxiv.org/abs/2009.07191
Switching Gradient Directions for Query-Efficient Black-Box Adversarial Attacks.
Chen Ma; Shuyu Cheng; Li Chen; Junhai Yong
  We propose a simple and highly query-efficient black-box adversarial attack
named SWITCH, which has a state-of-the-art performance under $\ell_2$ and
$\ell_\infty$ norms in the score-based setting. In the black box attack
setting, designing query-efficient attacks remains an open problem. The high
query efficiency of the proposed approach stems from the combination of
transfer-based attacks and random-search-based ones. The surrogate model's
gradient $\hat{\mathbf{g}}$ is exploited for the guidance, which is then
switched if our algorithm detects that it does not point to the adversarial
region by using a query, thereby keeping the objective loss function of the
target model rising as much as possible. Two switch operations are available,
i.e., SWITCH$_\text{neg}$ and SWITCH$_\text{rnd}$. SWITCH$_\text{neg}$ takes
$-\hat{\mathbf{g}}$ as the new direction, which is reasonable under an
approximate local linearity assumption. SWITCH$_\text{rnd}$ computes the
gradient from another model, which is randomly selected from a large model set,
to help bypass the potential obstacle in optimization. Experimental results
show that these strategies boost the optimization process whereas following the
original surrogate gradients does not work. In SWITCH, no query is used to
estimate the gradient, and all the queries aim to determine whether to switch
directions, resulting in unprecedented query efficiency. We demonstrate that
our approach outperforms 10 state-of-the-art attacks on CIFAR-10, CIFAR-100 and
TinyImageNet datasets. SWITCH can serve as a strong baseline for future
black-box attacks. The PyTorch source code is released in
https://github.com/machanic/SWITCH .


http://arxiv.org/abs/2009.07024
Decision-based Universal Adversarial Attack.
Jing Wu; Mingyi Zhou; Shuaicheng Liu; Yipeng Liu; Ce Zhu
  A single perturbation can pose the most natural images to be misclassified by
classifiers. In black-box setting, current universal adversarial attack methods
utilize substitute models to generate the perturbation, then apply the
perturbation to the attacked model. However, this transfer often produces
inferior results. In this study, we directly work in the black-box setting to
generate the universal adversarial perturbation. Besides, we aim to design an
adversary generating a single perturbation having texture like stripes based on
orthogonal matrix, as the top convolutional layers are sensitive to stripes. To
this end, we propose an efficient Decision-based Universal Attack (DUAttack).
With few data, the proposed adversary computes the perturbation based solely on
the final inferred labels, but good transferability has been realized not only
across models but also span different vision tasks. The effectiveness of
DUAttack is validated through comparisons with other state-of-the-art attacks.
The efficiency of DUAttack is also demonstrated on real world settings
including the Microsoft Azure. In addition, several representative defense
methods are struggling with DUAttack, indicating the practicability of the
proposed method.


http://arxiv.org/abs/2009.06530
A Game Theoretic Analysis of Additive Adversarial Attacks and Defenses.
Ambar Pal; René Vidal
  Research in adversarial learning follows a cat and mouse game between
attackers and defenders where attacks are proposed, they are mitigated by new
defenses, and subsequently new attacks are proposed that break earlier
defenses, and so on. However, it has remained unclear as to whether there are
conditions under which no better attacks or defenses can be proposed. In this
paper, we propose a game-theoretic framework for studying attacks and defenses
which exist in equilibrium. Under a locally linear decision boundary model for
the underlying binary classifier, we prove that the Fast Gradient Method attack
and the Randomized Smoothing defense form a Nash Equilibrium. We then show how
this equilibrium defense can be approximated given finitely many samples from a
data-generating distribution, and derive a generalization bound for the
performance of our approximation.


http://arxiv.org/abs/2009.06571
Input Hessian Regularization of Neural Networks.
Waleed Mustafa; Robert A. Vandermeulen; Marius Kloft
  Regularizing the input gradient has shown to be effective in promoting the
robustness of neural networks. The regularization of the input's Hessian is
therefore a natural next step. A key challenge here is the computational
complexity. Computing the Hessian of inputs is computationally infeasible. In
this paper we propose an efficient algorithm to train deep neural networks with
Hessian operator-norm regularization. We analyze the approach theoretically and
prove that the Hessian operator norm relates to the ability of a neural network
to withstand an adversarial attack. We give a preliminary experimental
evaluation on the MNIST and FMNIST datasets, which demonstrates that the new
regularizer can, indeed, be feasible and, furthermore, that it increases the
robustness of neural networks over input gradient regularization.


http://arxiv.org/abs/2009.06589
Robust Deep Learning Ensemble against Deception.
Wenqi Wei; Ling Liu
  Deep neural network (DNN) models are known to be vulnerable to maliciously
crafted adversarial examples and to out-of-distribution inputs drawn
sufficiently far away from the training data. How to protect a machine learning
model against deception of both types of destructive inputs remains an open
challenge. This paper presents XEnsemble, a diversity ensemble verification
methodology for enhancing the adversarial robustness of DNN models against
deception caused by either adversarial examples or out-of-distribution inputs.
XEnsemble by design has three unique capabilities. First, XEnsemble builds
diverse input denoising verifiers by leveraging different data cleaning
techniques. Second, XEnsemble develops a disagreement-diversity ensemble
learning methodology for guarding the output of the prediction model against
deception. Third, XEnsemble provides a suite of algorithms to combine input
verification and output verification to protect the DNN prediction models from
both adversarial examples and out of distribution inputs. Evaluated using
eleven popular adversarial attacks and two representative out-of-distribution
datasets, we show that XEnsemble achieves a high defense success rate against
adversarial examples and a high detection success rate against
out-of-distribution data inputs, and outperforms existing representative
defense methods with respect to robustness and defensibility.


http://arxiv.org/abs/2009.06701
Hold Tight and Never Let Go: Security of Deep Learning based Automated Lane Centering under Physical-World Attack.
Takami Sato; Junjie Shen; Ningfei Wang; Yunhan Jack Jia; Xue Lin; Qi Alfred Chen
  Automated Lane Centering (ALC) systems are convenient and widely deployed
today, but also highly security and safety critical. In this work, we are the
first to systematically study the security of state-of-the-art deep learning
based ALC systems in their designed operational domains under physical-world
adversarial attacks. We formulate the problem with a safety-critical attack
goal, and a novel and domain-specific attack vector: dirty road patches. To
systematically generate the attack, we adopt an optimization-based approach and
overcome domain-specific design challenges such as camera frame
inter-dependencies due to dynamic vehicle actuation, and the lack of objective
function design for lane detection models.
  We evaluate our attack method on a production ALC system using 80 attack
scenarios from real-world driving traces. The results show that our attack is
highly effective with over 92% success rates and less than 0.95 sec average
success time, which is substantially lower than the average driver reaction
time. Such high attack effectiveness is also found (1) robust to motion model
inaccuracies, different lane detection model designs, and physical-world
factors, and (2) stealthy from the driver's view. To concretely understand the
end-to-end safety consequences, we further evaluate on concrete real-world
attack scenarios using a production-grade simulator, and find that our attack
can successfully cause the victim to hit the highway concrete barrier or a
truck in the opposite direction with 98% and 100% success rates. We also
discuss defense directions.


http://arxiv.org/abs/2009.05965
Manifold attack.
Khanh-Hung Tran; Fred-Maurice Ngole-Mboula; Jean-Luc Starck
  Machine Learning in general and Deep Learning in particular has gained much
interest in the recent decade and has shown significant performance
improvements for many Computer Vision or Natural Language Processing tasks. In
order to deal with databases which have just a small amount of training samples
or to deal with models which have large amount of parameters, the
regularization is indispensable. In this paper, we enforce the manifold
preservation (manifold learning) from the original data into latent
presentation by using "manifold attack". The later is inspired in a fashion of
adversarial learning : finding virtual points that distort mostly the manifold
preservation then using these points as supplementary samples to train the
model. We show that our approach of regularization provides improvements for
the accuracy rate and for the robustness to adversarial examples.


http://arxiv.org/abs/2009.06114
Towards the Quantification of Safety Risks in Deep Neural Networks.
Peipei Xu; Wenjie Ruan; Xiaowei Huang
  Safety concerns on the deep neural networks (DNNs) have been raised when they
are applied to critical sectors. In this paper, we define safety risks by
requesting the alignment of the network's decision with human perception. To
enable a general methodology for quantifying safety risks, we define a generic
safety property and instantiate it to express various safety risks. For the
quantification of risks, we take the maximum radius of safe norm balls, in
which no safety risk exists. The computation of the maximum safe radius is
reduced to the computation of their respective Lipschitz metrics - the
quantities to be computed. In addition to the known adversarial example,
reachability example, and invariant example, in this paper we identify a new
class of risk - uncertainty example - on which humans can tell easily but the
network is unsure. We develop an algorithm, inspired by derivative-free
optimization techniques and accelerated by tensor-based parallelization on
GPUs, to support efficient computation of the metrics. We perform evaluations
on several benchmark neural networks, including ACSC-Xu, MNIST, CIFAR-10, and
ImageNet networks. The experiments show that, our method can achieve
competitive performance on safety quantification in terms of the tightness and
the efficiency of computation. Importantly, as a generic approach, our method
can work with a broad class of safety risks and without restrictions on the
structure of neural networks.


http://arxiv.org/abs/2009.05872
Certified Robustness of Graph Classification against Topology Attack with Randomized Smoothing.
Zhidong Gao; Rui Hu; Yanmin Gong
  Graph classification has practical applications in diverse fields. Recent
studies show that graph-based machine learning models are especially vulnerable
to adversarial perturbations due to the non i.i.d nature of graph data. By
adding or deleting a small number of edges in the graph, adversaries could
greatly change the graph label predicted by a graph classification model. In
this work, we propose to build a smoothed graph classification model with
certified robustness guarantee. We have proven that the resulting graph
classification model would output the same prediction for a graph under $l_0$
bounded adversarial perturbation. We also evaluate the effectiveness of our
approach under graph convolutional network (GCN) based multi-class graph
classification model.


http://arxiv.org/abs/2009.05244
Defending Against Multiple and Unforeseen Adversarial Videos.
Shao-Yuan Lo; Vishal M. Patel
  Adversarial examples of deep neural networks have been actively investigated
on image-based classification, segmentation and detection tasks. However,
adversarial robustness of video models still lacks exploration. While several
studies have proposed how to generate adversarial videos, only a handful of
approaches pertaining to the defense strategies have been published in the
literature. Furthermore, these defense methods are limited to a single
perturbation type and often fail to provide robustness to Lp-bounded attacks
and physically realizable attacks simultaneously. In this paper, we propose one
of the first defense solutions against multiple adversarial video types for
video classification. The proposed approach performs adversarial training with
multiple types of video adversaries using independent batch normalizations
(BNs), and recognizes different adversaries by an adversarial video detector.
During inference, a switch module sends an input to a proper batch
normalization branch according to the detected attack type. Compared to
conventional adversarial training, our method exhibits stronger robustness to
multiple and even unforeseen adversarial videos and provides higher
classification accuracy.


http://arxiv.org/abs/2009.05460
Robust Neural Machine Translation: Modeling Orthographic and Interpunctual Variation.
Toms Bergmanis; Artūrs Stafanovičs; Mārcis Pinnis
  Neural machine translation systems typically are trained on curated corpora
and break when faced with non-standard orthography or punctuation. Resilience
to spelling mistakes and typos, however, is crucial as machine translation
systems are used to translate texts of informal origins, such as chat
conversations, social media posts and web pages. We propose a simple generative
noise model to generate adversarial examples of ten different types. We use
these to augment machine translation systems' training data and show that, when
tested on noisy data, systems trained using adversarial examples perform almost
as well as when translating clean data, while baseline systems' performance
drops by 2-3 BLEU points. To measure the robustness and noise invariance of
machine translation systems' outputs, we use the average translation edit rate
between the translation of the original sentence and its noised variants. Using
this measure, we show that systems trained on adversarial examples on average
yield 50% consistency improvements when compared to baselines trained on clean
data.


http://arxiv.org/abs/2009.05423
Achieving Adversarial Robustness via Sparsity.
Shufan Wang; Ningyi Liao; Liyao Xiang; Nanyang Ye; Quanshi Zhang
  Network pruning has been known to produce compact models without much
accuracy degradation. However, how the pruning process affects a network's
robustness and the working mechanism behind remain unresolved. In this work, we
theoretically prove that the sparsity of network weights is closely associated
with model robustness. Through experiments on a variety of adversarial pruning
methods, we find that weights sparsity will not hurt but improve robustness,
where both weights inheritance from the lottery ticket and adversarial training
improve model robustness in network pruning. Based on these findings, we
propose a novel adversarial training method called inverse weights inheritance,
which imposes sparse weights distribution on a large network by inheriting
weights from a small network, thereby improving the robustness of the large
network.


http://arxiv.org/abs/2009.05487
The Intriguing Relation Between Counterfactual Explanations and Adversarial Examples.
Timo Freiesleben
  The same method that creates adversarial examples (AEs) to fool
image-classifiers can be used to generate counterfactual explanations (CEs)
that explain algorithmic decisions. This observation has led researchers to
consider CEs as AEs by another name. We argue that the relationship to the true
label and the tolerance with respect to proximity are two properties that
formally distinguish CEs and AEs. Based on these arguments, we introduce CEs,
AEs, and related concepts mathematically in a common framework. Furthermore, we
show connections between current methods for generating CEs and AEs, and
estimate that the fields will merge more and more as the number of common
use-cases grows.


http://arxiv.org/abs/2009.05602
Semantic-preserving Reinforcement Learning Attack Against Graph Neural Networks for Malware Detection.
Lan Zhang; Peng Liu; Yoon-Ho Choi
  As an increasing number of deep-learning-based malware scanners have been
proposed, the existing evasion techniques, including code obfuscation and
polymorphic malware, are found to be less effective. In this work, we propose a
reinforcement learning-based semantics-preserving
(i.e.functionality-preserving) attack against black-box GNNs (GraphNeural
Networks) for malware detection. The key factor of adversarial malware
generation via semantic Nops insertion is to select the appropriate
semanticNopsand their corresponding basic blocks. The proposed attack uses
reinforcement learning to automatically make these "how to select" decisions.
To evaluate the attack, we have trained two kinds of GNNs with five types(i.e.,
Backdoor, Trojan-Downloader, Trojan-Ransom, Adware, and Worm) of Windows
malware samples and various benign Windows programs. The evaluation results
have shown that the proposed attack can achieve a significantly higher evasion
rate than three baseline attacks, namely the semantics-preserving random
instruction insertion attack, the semantics-preserving accumulative instruction
insertion attack, and the semantics-preserving gradient-based instruction
insertion attack.


http://arxiv.org/abs/2009.04923
Second Order Optimization for Adversarial Robustness and Interpretability.
Theodoros Tsiligkaridis; Jay Roberts
  Deep neural networks are easily fooled by small perturbations known as
adversarial attacks. Adversarial Training (AT) is a technique aimed at learning
features robust to such attacks and is widely regarded as a very effective
defense. However, the computational cost of such training can be prohibitive as
the network size and input dimensions grow. Inspired by the relationship
between robustness and curvature, we propose a novel regularizer which
incorporates first and second order information via a quadratic approximation
to the adversarial loss. The worst case quadratic loss is approximated via an
iterative scheme. It is shown that using only a single iteration in our
regularizer achieves stronger robustness than prior gradient and curvature
regularization schemes, avoids gradient obfuscation, and, with additional
iterations, achieves strong robustness with significantly lower training time
than AT. Further, it retains the interesting facet of AT that networks learn
features which are well-aligned with human perception. We demonstrate
experimentally that our method produces higher quality human-interpretable
features than other geometric regularization techniques. These robust features
are then used to provide human-friendly explanations to model predictions.


http://arxiv.org/abs/2009.04709
Quantifying the Preferential Direction of the Model Gradient in Adversarial Training With Projected Gradient Descent.
Ricardo Bigolin Lanfredi; Joyce D. Schroeder; Tolga Tasdizen
  Adversarial training, especially projected gradient descent (PGD), has proven
to be a successful approach for improving robustness against adversarial
attacks. After adversarial training, gradients of models with respect to their
inputs have a preferential direction. However, the direction of alignment is
not mathematically well established, making it difficult to evaluate
quantitatively. We propose a novel definition of this direction as the
direction of the vector pointing toward the closest point of the support of the
closest inaccurate class in decision space. To evaluate the alignment with this
direction after adversarial training, we apply a metric that uses generative
adversarial networks to produce the smallest residual needed to change the
class present in the image. We show that PGD-trained models have a higher
alignment than the baseline according to our definition, that our metric
presents higher alignment values than a competing metric formulation, and that
enforcing this alignment increases the robustness of models.


http://arxiv.org/abs/2009.04614
End-to-end Kernel Learning via Generative Random Fourier Features.
Kun Fang; Xiaolin Huang; Fanghui Liu; Jie Yang
  Random Fourier features enable researchers to build feature map to learn the
spectral distribution of the underlying kernel. Current distribution-based
methods follow a two-stage scheme: they first learn and optimize the feature
map by solving the kernel alignment problem, then learn a linear classifier on
the features. However, since the ideal kernel in kernel alignment problem is
not necessarily optimal in classification tasks, the generalization performance
of the random features learned in this two-stage manner can perhaps be further
improved. To address this issue, we propose an end-to-end, one-stage kernel
learning approach, called generative random Fourier features, which jointly
learns the features and the classifier. A generative network is involved to
implicitly learn and to sample from the distribution of the latent kernel.
Random features are then built via the generative weights and followed by a
linear classifier parameterized as a full-connected layer. We jointly train the
generative network and the classifier by solving the empirical risk
minimization problem for a one-stage solution. Straightly minimizing the loss
between predictive and true labels brings better generalization performance.
Besides, this end-to-end strategy allows us to increase the depth of features,
resulting in multi-layer architecture and exhibiting strong linear-separable
pattern. Empirical results demonstrate the superiority of our method in
classification tasks over other two-stage kernel learning methods. Finally, we
investigate the robustness of proposed method in defending adversarial attacks,
which shows that the randomization and resampling mechanism associated with the
learned distribution can alleviate the performance decrease brought by
adversarial examples.


http://arxiv.org/abs/2009.06368
Searching for a Search Method: Benchmarking Search Algorithms for Generating NLP Adversarial Examples.
Jin Yong Yoo; John X. Morris; Eli Lifland; Yanjun Qi
  We study the behavior of several black-box search algorithms used for
generating adversarial examples for natural language processing (NLP) tasks. We
perform a fine-grained analysis of three elements relevant to search: search
algorithm, search space, and search budget. When new search methods are
proposed in past work, the attack search space is often modified alongside the
search method. Without ablation studies benchmarking the search algorithm
change with the search space held constant, an increase in attack success rate
could from an improved search method or a less restrictive search space.
Additionally, many previous studies fail to properly consider the search
algorithms' run-time cost, which is essential for downstream tasks like
adversarial training. Our experiments provide a reproducible benchmark of
search algorithms across a variety of search spaces and query budgets to guide
future research in adversarial NLP. Based on our experiments, we recommend
greedy attacks with word importance ranking when under a time constraint or
attacking long inputs, and either beam search or particle swarm optimization
otherwise. Code implementation shared via https://github.com/QData/TextAttack


http://arxiv.org/abs/2009.05474
A black-box adversarial attack for poisoning clustering.
Antonio Emanuele Cinà; Alessandro Torcinovich; Marcello Pelillo
  Clustering algorithms play a fundamental role as tools in decision-making and
sensible automation processes. Due to the widespread use of these applications,
a robustness analysis of this family of algorithms against adversarial noise
has become imperative. To the best of our knowledge, however, only a few works
have currently addressed this problem. In an attempt to fill this gap, in this
work, we propose a black-box adversarial attack for crafting adversarial
samples to test the robustness of clustering algorithms. We formulate the
problem as a constrained minimization program, general in its structure and
customizable by the attacker according to her capability constraints. We do not
assume any information about the internal structure of the victim clustering
algorithm, and we allow the attacker to query it as a service only. In the
absence of any derivative information, we perform the optimization with a
custom approach inspired by the Abstract Genetic Algorithm (AGA). In the
experimental part, we demonstrate the sensibility of different single and
ensemble clustering algorithms against our crafted adversarial samples on
different scenarios. Furthermore, we perform a comparison of our algorithm with
a state-of-the-art approach showing that we are able to reach or even
outperform its performance. Finally, to highlight the general nature of the
generated noise, we show that our attacks are transferable even against
supervised algorithms such as SVMs, random forests, and neural networks.


http://arxiv.org/abs/2009.04131
SoK: Certified Robustness for Deep Neural Networks.
Linyi Li; Tao Xie; Bo Li
  Great advances in deep neural networks (DNNs) have led to state-of-the-art
performance on a wide range of tasks. However, recent studies have shown that
DNNs are vulnerable to adversarial attacks, which have brought great concerns
when deploying these models to safety-critical applications such as autonomous
driving. Different defense approaches have been proposed against adversarial
attacks, including: a) empirical defenses, which can usually be adaptively
attacked again without providing robustness certification; and b) certifiably
robust approaches, which consist of robustness verification providing the lower
bound of robust accuracy against any attacks under certain conditions and
corresponding robust training approaches. In this paper, we systematize
certifiably robust approaches and related practical and theoretical
implications and findings. We also provide the first comprehensive benchmark on
existing robustness verification and training approaches on different datasets.
In particular, we 1) provide a taxonomy for the robustness verification and
training approaches, as well as summarize the methodologies for representative
algorithms, 2) reveal the characteristics, strengths, limitations, and
fundamental connections among these approaches, 3) discuss current research
progresses, theoretical barriers, main challenges, and future directions for
certifiably robust approaches for DNNs, and 4) provide an open-sourced unified
platform to evaluate 20+ representative certifiably robust approaches.


http://arxiv.org/abs/2009.04004
Fuzzy Unique Image Transformation: Defense Against Adversarial Attacks On Deep COVID-19 Models.
Achyut Mani Tripathi; Ashish Mishra
  Early identification of COVID-19 using a deep model trained on Chest X-Ray
and CT images has gained considerable attention from researchers to speed up
the process of identification of active COVID-19 cases. These deep models act
as an aid to hospitals that suffer from the unavailability of specialists or
radiologists, specifically in remote areas. Various deep models have been
proposed to detect the COVID-19 cases, but few works have been performed to
prevent the deep models against adversarial attacks capable of fooling the deep
model by using a small perturbation in image pixels. This paper presents an
evaluation of the performance of deep COVID-19 models against adversarial
attacks. Also, it proposes an efficient yet effective Fuzzy Unique Image
Transformation (FUIT) technique that downsamples the image pixels into an
interval. The images obtained after the FUIT transformation are further
utilized for training the secure deep model that preserves high accuracy of the
diagnosis of COVID-19 cases and provides reliable defense against the
adversarial attacks. The experiments and results show the proposed model
prevents the deep model against the six adversarial attacks and maintains high
accuracy to classify the COVID-19 cases from the Chest X-Ray image and CT image
Datasets. The results also recommend that a careful inspection is required
before practically applying the deep models to diagnose the COVID-19 cases.


http://arxiv.org/abs/2009.03728
Adversarial Machine Learning in Image Classification: A Survey Towards the Defender's Perspective.
Gabriel Resende Machado; Eugênio Silva; Ronaldo Ribeiro Goldschmidt
  Deep Learning algorithms have achieved the state-of-the-art performance for
Image Classification and have been used even in security-critical applications,
such as biometric recognition systems and self-driving cars. However, recent
works have shown those algorithms, which can even surpass the human
capabilities, are vulnerable to adversarial examples. In Computer Vision,
adversarial examples are images containing subtle perturbations generated by
malicious optimization algorithms in order to fool classifiers. As an attempt
to mitigate these vulnerabilities, numerous countermeasures have been
constantly proposed in literature. Nevertheless, devising an efficient defense
mechanism has proven to be a difficult task, since many approaches have already
shown to be ineffective to adaptive attackers. Thus, this self-containing paper
aims to provide all readerships with a review of the latest research progress
on Adversarial Machine Learning in Image Classification, however with a
defender's perspective. Here, novel taxonomies for categorizing adversarial
attacks and defenses are introduced and discussions about the existence of
adversarial examples are provided. Further, in contrast to exisiting surveys,
it is also given relevant guidance that should be taken into consideration by
researchers when devising and evaluating defenses. Finally, based on the
reviewed literature, it is discussed some promising paths for future research.


http://arxiv.org/abs/2009.03364
Adversarial attacks on deep learning models for fatty liver disease classification by modification of ultrasound image reconstruction method.
Michal Byra; Grzegorz Styczynski; Cezary Szmigielski; Piotr Kalinowski; Lukasz Michalowski; Rafal Paluszkiewicz; Bogna Ziarkiewicz-Wroblewska; Krzysztof Zieniewicz; Andrzej Nowicki
  Convolutional neural networks (CNNs) have achieved remarkable success in
medical image analysis tasks. In ultrasound (US) imaging, CNNs have been
applied to object classification, image reconstruction and tissue
characterization. However, CNNs can be vulnerable to adversarial attacks, even
small perturbations applied to input data may significantly affect model
performance and result in wrong output. In this work, we devise a novel
adversarial attack, specific to ultrasound (US) imaging. US images are
reconstructed based on radio-frequency signals. Since the appearance of US
images depends on the applied image reconstruction method, we explore the
possibility of fooling deep learning model by perturbing US B-mode image
reconstruction method. We apply zeroth order optimization to find small
perturbations of image reconstruction parameters, related to attenuation
compensation and amplitude compression, which can result in wrong output. We
illustrate our approach using a deep learning model developed for fatty liver
disease diagnosis, where the proposed adversarial attack achieved success rate
of 48%.


http://arxiv.org/abs/2009.03488
Adversarial Attack on Large Scale Graph.
Jintang Li; Tao Xie; Liang Chen; Fenfang Xie; Xiangnan He; Zibin Zheng
  Recent studies have shown that graph neural networks are vulnerable against
perturbations due to lack of robustness and can therefore be easily fooled.
Most works on attacking the graph neural networks are currently mainly using
the gradient information to guide the attack and achieve outstanding
performance. Nevertheless, the high complexity of time and space makes them
unmanageable for large scale graphs. We argue that the main reason is that they
have to use the entire graph for attacks, resulting in the increasing time and
space complexity as the data scale grows. In this work, we propose an efficient
Simplified Gradient-based Attack (SGA) framework to bridge this gap. SGA can
cause the graph neural networks to misclassify specific target nodes through a
multi-stage optimized attack framework, which needs only a much smaller
subgraph. In addition, we present a practical metric named Degree Assortativity
Change (DAC) for measuring the impacts of adversarial attacks on graph data. We
evaluate our attack method on four real-world datasets by attacking several
commonly used graph neural networks. The experimental results show that SGA is
able to achieve significant time and memory efficiency improvements while
maintaining considerable performance in the attack compared to other
state-of-the-art methods of attack.


http://arxiv.org/abs/2009.03136
Black Box to White Box: Discover Model Characteristics Based on Strategic Probing.
Josh Kalin; Matthew Ciolino; David Noever; Gerry Dozier
  In Machine Learning, White Box Adversarial Attacks rely on knowing underlying
knowledge about the model attributes. This works focuses on discovering to
distrinct pieces of model information: the underlying architecture and primary
training dataset. With the process in this paper, a structured set of input
probes and the output of the model become the training data for a deep
classifier. Two subdomains in Machine Learning are explored: image based
classifiers and text transformers with GPT-2. With image classification, the
focus is on exploring commonly deployed architectures and datasets available in
popular public libraries. Using a single transformer architecture with multiple
levels of parameters, text generation is explored by fine tuning off different
datasets. Each dataset explored in image and text are distinguishable from one
another. Diversity in text transformer outputs implies further research is
needed to successfully classify architecture attribution in text domain.


http://arxiv.org/abs/2009.02877
A Game Theoretic Analysis of LQG Control under Adversarial Attack.
Zuxing Li; György Dán; Dong Liu
  Motivated by recent works addressing adversarial attacks on deep
reinforcement learning, a deception attack on linear quadratic Gaussian control
is studied in this paper. In the considered attack model, the adversary can
manipulate the observation of the agent subject to a mutual information
constraint. The adversarial problem is formulated as a novel dynamic cheap talk
game to capture the strategic interaction between the adversary and the agent,
the asymmetry of information availability, and the system dynamics. Necessary
and sufficient conditions are provided for subgame perfect equilibria to exist
in pure strategies and in behavioral strategies; and characteristics of the
equilibria and the resulting control rewards are given. The results show that
pure strategy equilibria are informative, while only babbling equilibria exist
in behavioral strategies. Numerical results are shown to illustrate the impact
of strategic adversarial interaction.


http://arxiv.org/abs/2009.02874
Dynamically Computing Adversarial Perturbations for Recurrent Neural Networks.
Shankar A. Deka; Dušan M. Stipanović; Claire J. Tomlin
  Convolutional and recurrent neural networks have been widely employed to
achieve state-of-the-art performance on classification tasks. However, it has
also been noted that these networks can be manipulated adversarially with
relative ease, by carefully crafted additive perturbations to the input. Though
several experimentally established prior works exist on crafting and defending
against attacks, it is also desirable to have theoretical guarantees on the
existence of adversarial examples and robustness margins of the network to such
examples. We provide both in this paper. We focus specifically on recurrent
architectures and draw inspiration from dynamical systems theory to naturally
cast this as a control problem, allowing us to dynamically compute adversarial
perturbations at each timestep of the input sequence, thus resembling a
feedback controller. Illustrative examples are provided to supplement the
theoretical discussions.


http://arxiv.org/abs/2009.02738
Detection Defense Against Adversarial Attacks with Saliency Map.
Dengpan Ye; Chuanxi Chen; Changrui Liu; Hao Wang; Shunzhi Jiang
  It is well established that neural networks are vulnerable to adversarial
examples, which are almost imperceptible on human vision and can cause the deep
models misbehave. Such phenomenon may lead to severely inestimable consequences
in the safety and security critical applications. Existing defenses are trend
to harden the robustness of models against adversarial attacks, e.g.,
adversarial training technology. However, these are usually intractable to
implement due to the high cost of re-training and the cumbersome operations of
altering the model architecture or parameters. In this paper, we discuss the
saliency map method from the view of enhancing model interpretability, it is
similar to introducing the mechanism of the attention to the model, so as to
comprehend the progress of object identification by the deep networks. We then
propose a novel method combined with additional noises and utilize the
inconsistency strategy to detect adversarial examples. Our experimental results
of some representative adversarial attacks on common datasets including
ImageNet and popular models show that our method can detect all the attacks
with high detection success rate effectively. We compare it with the existing
state-of-the-art technique, and the experiments indicate that our method is
more general.


http://arxiv.org/abs/2009.02608
Bluff: Interactively Deciphering Adversarial Attacks on Deep Neural Networks.
Nilaksh Polo Das; Haekyu Polo Park; Zijie J. Polo Wang; Fred Polo Hohman; Robert Polo Firstman; Emily Polo Rogers; Duen Polo Horng; Chau
  Deep neural networks (DNNs) are now commonly used in many domains. However,
they are vulnerable to adversarial attacks: carefully crafted perturbations on
data inputs that can fool a model into making incorrect predictions. Despite
significant research on developing DNN attack and defense techniques, people
still lack an understanding of how such attacks penetrate a model's internals.
We present Bluff, an interactive system for visualizing, characterizing, and
deciphering adversarial attacks on vision-based neural networks. Bluff allows
people to flexibly visualize and compare the activation pathways for benign and
attacked images, revealing mechanisms that adversarial attacks employ to
inflict harm on a model. Bluff is open-sourced and runs in modern web browsers.


http://arxiv.org/abs/2009.02470
Dual Manifold Adversarial Robustness: Defense against Lp and non-Lp Adversarial Attacks.
Wei-An Lin; Chun Pong Lau; Alexander Levine; Rama Chellappa; Soheil Feizi
  Adversarial training is a popular defense strategy against attack threat
models with bounded Lp norms. However, it often degrades the model performance
on normal images and the defense does not generalize well to novel attacks.
Given the success of deep generative models such as GANs and VAEs in
characterizing the underlying manifold of images, we investigate whether or not
the aforementioned problems can be remedied by exploiting the underlying
manifold information. To this end, we construct an "On-Manifold ImageNet"
(OM-ImageNet) dataset by projecting the ImageNet samples onto the manifold
learned by StyleGSN. For this dataset, the underlying manifold information is
exact. Using OM-ImageNet, we first show that adversarial training in the latent
space of images improves both standard accuracy and robustness to on-manifold
attacks. However, since no out-of-manifold perturbations are realized, the
defense can be broken by Lp adversarial attacks. We further propose Dual
Manifold Adversarial Training (DMAT) where adversarial perturbations in both
latent and image spaces are used in robustifying the model. Our DMAT improves
performance on normal images, and achieves comparable robustness to the
standard adversarial training against Lp attacks. In addition, we observe that
models defended by DMAT achieve improved robustness against novel attacks which
manipulate images by global color shifts or various types of image filtering.
Interestingly, similar improvements are also achieved when the defended models
are tested on out-of-manifold natural images. These results demonstrate the
potential benefits of using manifold information in enhancing robustness of
deep learning models against various types of novel adversarial attacks.


http://arxiv.org/abs/2009.01729
MIPGAN -- Generating Strong and High Quality Morphing Attacks Using Identity Prior Driven GAN. (10%)
Haoyu Zhang; Sushma Venkatesh; Raghavendra Ramachandra; Kiran Raja; Naser Damer; Christoph Busch
  Face morphing attacks target to circumvent Face Recognition Systems (FRS) by
employing face images derived from multiple data subjects (e.g., accomplices
and malicious actors). Morphed images can be verified against contributing data
subjects with a reasonable success rate, given they have a high degree of
facial resemblance. The success of morphing attacks is directly dependent on
the quality of the generated morph images. We present a new approach for
generating strong attacks extending our earlier framework for generating face
morphs. We present a new approach using an Identity Prior Driven Generative
Adversarial Network, which we refer to as MIPGAN (Morphing through Identity
Prior driven GAN). The proposed MIPGAN is derived from the StyleGAN with a
newly formulated loss function exploiting perceptual quality and identity
factor to generate a high quality morphed facial image with minimal artefacts
and with high resolution. We demonstrate the proposed approach's applicability
to generate strong morphing attacks by evaluating its vulnerability against
both commercial and deep learning based Face Recognition System (FRS) and
demonstrate the success rate of attacks. Extensive experiments are carried out
to assess the FRS's vulnerability against the proposed morphed face generation
technique on three types of data such as digital images, re-digitized (printed
and scanned) images, and compressed images after re-digitization from newly
generated MIPGAN Face Morph Dataset. The obtained results demonstrate that the
proposed approach of morph generation poses a high threat to FRS.


http://arxiv.org/abs/2009.01672
Yet Meta Learning Can Adapt Fast, It Can Also Break Easily.
Han Xu; Yaxin Li; Xiaorui Liu; Hui Liu; Jiliang Tang
  Meta learning algorithms have been widely applied in many tasks for efficient
learning, such as few-shot image classification and fast reinforcement
learning. During meta training, the meta learner develops a common learning
strategy, or experience, from a variety of learning tasks. Therefore, during
meta test, the meta learner can use the learned strategy to quickly adapt to
new tasks even with a few training samples. However, there is still a dark side
about meta learning in terms of reliability and robustness. In particular, is
meta learning vulnerable to adversarial attacks? In other words, would a
well-trained meta learner utilize its learned experience to build wrong or
likely useless knowledge, if an adversary unnoticeably manipulates the given
training set? Without the understanding of this problem, it is extremely risky
to apply meta learning in safety-critical applications. Thus, in this paper, we
perform the initial study about adversarial attacks on meta learning under the
few-shot classification problem. In particular, we formally define key elements
of adversarial attacks unique to meta learning and propose the first attacking
algorithm against meta learning under various settings. We evaluate the
effectiveness of the proposed attacking strategy as well as the robustness of
several representative meta learning algorithms. Experimental results
demonstrate that the proposed attacking strategy can easily break the meta
learner and meta learning is vulnerable to adversarial attacks. The
implementation of the proposed framework will be released upon the acceptance
of this paper.


http://arxiv.org/abs/2009.01110
Perceptual Deep Neural Networks: Adversarial Robustness through Input Recreation.
Danilo Vasconcellos Vargas; Bingli Liao; Takahiro Kanzaki
  Adversarial examples have shown that albeit highly accurate, models learned
by machines, differently from humans,have many weaknesses. However, humans'
perception is also fundamentally different from machines, because we do not see
the signals which arrive at the retina but a rather complex recreation of them.
In this paper, we explore how machines could recreate the input as well as
investigate the benefits of such an augmented perception. In this regard, we
propose Perceptual Deep Neural Networks ($\varphi$DNN) which also recreate
their own input before further processing. The concept is formalized
mathematically and two variations of it are developed (one based on inpainting
the whole image and the other based on a noisy resized super resolution
recreation). Experiments reveal that $\varphi$DNNs can reduce attacks' accuracy
substantially, surpassing even state-of-the-art defenses. Moreover, the
recreation process intentionally corrupts the input image. Interestingly, we
show by ablation tests that corrupting the input is, although
counter-intuitive,beneficial. This suggests that the blind-spot in vertebrates
might also be, analogously, the precursor of visual robustness. Thus,
$\varphi$DNNs reveal that input recreation has strong benefits for artificial
neural networks similar to biological ones, shedding light into the importance
of the blind-spot and starting an area of perception models for robust
recognition in artificial intelligence.


http://arxiv.org/abs/2009.00814
Open-set Adversarial Defense.
Rui Shao; Pramuditha Perera; Pong C. Yuen; Vishal M. Patel
  Open-set recognition and adversarial defense study two key aspects of deep
learning that are vital for real-world deployment. The objective of open-set
recognition is to identify samples from open-set classes during testing, while
adversarial defense aims to defend the network against images with
imperceptible adversarial perturbations. In this paper, we show that open-set
recognition systems are vulnerable to adversarial attacks. Furthermore, we show
that adversarial defense mechanisms trained on known classes do not generalize
well to open-set samples. Motivated by this observation, we emphasize the need
of an Open-Set Adversarial Defense (OSAD) mechanism. This paper proposes an
Open-Set Defense Network (OSDN) as a solution to the OSAD problem. The proposed
network uses an encoder with feature-denoising layers coupled with a classifier
to learn a noise-free latent feature representation. Two techniques are
employed to obtain an informative latent feature space with the objective of
improving open-set performance. First, a decoder is used to ensure that clean
images can be reconstructed from the obtained latent features. Then,
self-supervision is used to ensure that the latent features are informative
enough to carry out an auxiliary task. We introduce a testing protocol to
evaluate OSAD performance and show the effectiveness of the proposed method in
multiple object classification datasets. The implementation code of the
proposed method is available at: https://github.com/rshaojimmy/ECCV2020-OSAD.


http://arxiv.org/abs/2009.00902
Adversarially Robust Neural Architectures.
Minjing Dong; Yanxi Li; Yunhe Wang; Chang Xu
  Deep Neural Network (DNN) are vulnerable to adversarial attack. Existing
methods are devoted to developing various robust training strategies or
regularizations to update the weights of the neural network. But beyond the
weights, the overall structure and information flow in the network are
explicitly determined by the neural architecture, which remains unexplored.
This paper thus aims to improve the adversarial robustness of the network from
the architecture perspective with NAS framework. We explore the relationship
among adversarial robustness, Lipschitz constant, and architecture parameters
and show that an appropriate constraint on architecture parameters could reduce
the Lipschitz constant to further improve the robustness. For NAS framework,
all the architecture parameters are equally treated when the discrete
architecture is sampled from supernet. However, the importance of architecture
parameters could vary from operation to operation or connection to connection,
which is not explored and might reduce the confidence of robust architecture
sampling. Thus, we propose to sample architecture parameters from trainable
multivariate log-normal distributions, with which the Lipschitz constant of
entire network can be approximated using a univariate log-normal distribution
with mean and variance related to architecture parameters. Compared with
adversarially trained neural architectures searched by various NAS algorithms
as well as efficient human-designed models, our algorithm empirically achieves
the best performance among all the models under various attacks on different
datasets.


http://arxiv.org/abs/2009.01122
Flow-based detection and proxy-based evasion of encrypted malware C2 traffic.
Carlos University of Porto and INESC TEC Novo; Ricardo University of Porto and INESC TEC Morla
  State of the art deep learning techniques are known to be vulnerable to
evasion attacks where an adversarial sample is generated from a malign sample
and misclassified as benign. Detection of encrypted malware command and control
traffic based on TCP/IP flow features can be framed as a learning task and is
thus vulnerable to evasion attacks. However, unlike e.g. in image processing
where generated adversarial samples can be directly mapped to images, going
from flow features to actual TCP/IP packets requires crafting the sequence of
packets, with no established approach for such crafting and a limitation on the
set of modifiable features that such crafting allows. In this paper we discuss
learning and evasion consequences of the gap between generated and crafted
adversarial samples. We exemplify with a deep neural network detector trained
on a public C2 traffic dataset, white-box adversarial learning, and a
proxy-based approach for crafting longer flows. Our results show 1) the high
evasion rate obtained by using generated adversarial samples on the detector
can be significantly reduced when using crafted adversarial samples; 2)
robustness against adversarial samples by model hardening varies according to
the crafting approach and corresponding set of modifiable features that the
attack allows for; 3) incrementally training hardened models with adversarial
samples can produce a level playing field where no detector is best against all
attacks and no attack is best against all detectors, in a given set of attacks
and detectors. To the best of our knowledge this is the first time that level
playing field feature set- and iteration-hardening are analyzed in encrypted C2
malware traffic detection.


http://arxiv.org/abs/2009.01109
Adversarial Attacks on Deep Learning Systems for User Identification based on Motion Sensors.
Cezara Benegui; Radu Tudor Ionescu
  For the time being, mobile devices employ implicit authentication mechanisms,
namely, unlock patterns, PINs or biometric-based systems such as fingerprint or
face recognition. While these systems are prone to well-known attacks, the
introduction of an explicit and unobtrusive authentication layer can greatly
enhance security. In this study, we focus on deep learning methods for explicit
authentication based on motion sensor signals. In this scenario, attackers
could craft adversarial examples with the aim of gaining unauthorized access
and even restraining a legitimate user to access his mobile device. To our
knowledge, this is the first study that aims at quantifying the impact of
adversarial attacks on machine learning models used for user identification
based on motion sensors. To accomplish our goal, we study multiple methods for
generating adversarial examples. We propose three research questions regarding
the impact and the universality of adversarial examples, conducting relevant
experiments in order to answer our research questions. Our empirical results
demonstrate that certain adversarial example generation methods are specific to
the attacked classification model, while others tend to be generic. We thus
conclude that deep neural networks trained for user identification tasks based
on motion sensors are subject to a high percentage of misclassification when
given adversarial input.


http://arxiv.org/abs/2009.00960
Simulating Unknown Target Models for Query-Efficient Black-box Attacks.
Chen Ma; Li Chen; Jun-Hai Yong
  Many adversarial attacks have been proposed to investigate the security
issues of deep neural networks. In the black-box setting, current model
stealing attacks train a substitute model to counterfeit the functionality of
the target model. However, the training requires querying the target model.
Consequently, the query complexity remains high, and such attacks can be
defended easily. This study aims to train a generalized substitute model called
"Simulator", which can mimic the functionality of any unknown target model. To
this end, we build the training data with the form of multiple tasks by
collecting query sequences generated during the attacks of various existing
networks. The learning process uses a mean square error-based
knowledge-distillation loss in the meta-learning to minimize the difference
between the Simulator and the sampled networks. The meta-gradients of this loss
are then computed and accumulated from multiple tasks to update the Simulator
and subsequently improve generalization. When attacking a target model that is
unseen in training, the trained Simulator can accurately simulate its
functionality using its limited feedback. As a result, a large fraction of
queries can be transferred to the Simulator, thereby reducing query complexity.
Results of the comprehensive experiments conducted using the CIFAR-10,
CIFAR-100, and TinyImageNet datasets demonstrate that the proposed approach
reduces query complexity by several orders of magnitude compared to the
baseline method. The implementation source code is released at
https://github.com/machanic/SimulatorAttack.


http://arxiv.org/abs/2009.09803
Defending against substitute model black box adversarial attacks with the 01 loss.
Yunzhe Xue; Meiyan Xie; Usman Roshan
  Substitute model black box attacks can create adversarial examples for a
target model just by accessing its output labels. This poses a major challenge
to machine learning models in practice, particularly in security sensitive
applications. The 01 loss model is known to be more robust to outliers and
noise than convex models that are typically used in practice. Motivated by
these properties we present 01 loss linear and 01 loss dual layer neural
network models as a defense against transfer based substitute model black box
attacks. We compare the accuracy of adversarial examples from substitute model
black box attacks targeting our 01 loss models and their convex counterparts
for binary classification on popular image benchmarks. Our 01 loss dual layer
neural network has an adversarial accuracy of 66.2%, 58%, 60.5%, and 57% on
MNIST, CIFAR10, STL10, and ImageNet respectively whereas the sigmoid activated
logistic loss counterpart has accuracies of 63.5%, 19.3%, 14.9%, and 27.6%.
Except for MNIST the convex counterparts have substantially lower adversarial
accuracies. We show practical applications of our models to deter traffic sign
and facial recognition adversarial attacks. On GTSRB street sign and CelebA
facial detection our 01 loss network has 34.6% and 37.1% adversarial accuracy
respectively whereas the convex logistic counterpart has accuracy 24% and 1.9%.
Finally we show that our 01 loss network can attain robustness on par with
simple convolutional neural networks and much higher than its convex
counterpart even when attacked with a convolutional network substitute model.
Our work shows that 01 loss models offer a powerful defense against substitute
model black box attacks.


http://arxiv.org/abs/2008.13671
Adversarial Patch Camouflage against Aerial Detection.
Ajaya Adhikari; Richard den Hollander; Ioannis Tolios; Bekkum Michael van; Anneloes Bal; Stijn Hendriks; Maarten Kruithof; Dennis Gross; Nils Jansen; Guillermo Pérez; Kit Buurman; Stephan Raaijmakers
  Detection of military assets on the ground can be performed by applying deep
learning-based object detectors on drone surveillance footage. The traditional
way of hiding military assets from sight is camouflage, for example by using
camouflage nets. However, large assets like planes or vessels are difficult to
conceal by means of traditional camouflage nets. An alternative type of
camouflage is the direct misleading of automatic object detectors. Recently, it
has been observed that small adversarial changes applied to images of the
object can produce erroneous output by deep learning-based detectors. In
particular, adversarial attacks have been successfully demonstrated to prohibit
person detections in images, requiring a patch with a specific pattern held up
in front of the person, thereby essentially camouflaging the person for the
detector. Research into this type of patch attacks is still limited and several
questions related to the optimal patch configuration remain open.
  This work makes two contributions. First, we apply patch-based adversarial
attacks for the use case of unmanned aerial surveillance, where the patch is
laid on top of large military assets, camouflaging them from automatic
detectors running over the imagery. The patch can prevent automatic detection
of the whole object while only covering a small part of it. Second, we perform
several experiments with different patch configurations, varying their size,
position, number and saliency. Our results show that adversarial patch attacks
form a realistic alternative to traditional camouflage activities, and should
therefore be considered in the automated analysis of aerial surveillance
imagery.


http://arxiv.org/abs/2009.01048
MALCOM: Generating Malicious Comments to Attack Neural Fake News Detection Models.
Thai Le; Suhang Wang; Dongwon Lee
  In recent years, the proliferation of so-called "fake news" has caused much
disruptions in society and weakened the news ecosystem. Therefore, to mitigate
such problems, researchers have developed state-of-the-art models to
auto-detect fake news on social media using sophisticated data science and
machine learning techniques. In this work, then, we ask "what if adversaries
attempt to attack such detection models?" and investigate related issues by (i)
proposing a novel threat model against fake news detectors, in which
adversaries can post malicious comments toward news articles to mislead fake
news detectors, and (ii) developing MALCOM, an end-to-end adversarial comment
generation framework to achieve such an attack. Through a comprehensive
evaluation, we demonstrate that about 94% and 93.5% of the time on average
MALCOM can successfully mislead five of the latest neural detection models to
always output targeted real and fake news labels. Furthermore, MALCOM can also
fool black box fake news detectors to always output real news labels 90% of the
time on average. We also compare our attack model with four baselines across
two real-world datasets, not only on attack performance but also on generated
quality, coherency, transferability, and robustness.


http://arxiv.org/abs/2009.00203
Efficient, Direct, and Restricted Black-Box Graph Evasion Attacks to Any-Layer Graph Neural Networks via Influence Function.
Binghui Wang; Tianxiang Zhou; Minhua Lin; Pan Zhou; Ang Li; Meng Pang; Hai Li; Yiran Chen
  Graph neural network (GNN), the mainstream method to learn on graph data, is
vulnerable to graph evasion attacks, where an attacker slightly perturbing the
graph structure can fool trained GNN models. Existing work has at least one of
the following drawbacks: 1) limited to directly attack two-layer GNNs; 2)
inefficient; and 3) impractical, as they need to know full or part of GNN model
parameters.
  We address the above drawbacks and propose an influence-based
\emph{efficient, direct, and restricted black-box} evasion attack to
\emph{any-layer} GNNs. Specifically, we first introduce two influence
functions, i.e., feature-label influence and label influence, that are defined
on GNNs and label propagation (LP), respectively. Then we observe that GNNs and
LP are strongly connected in terms of our defined influences. Based on this, we
can then reformulate the evasion attack to GNNs as calculating label influence
on LP, which is \emph{inherently} applicable to any-layer GNNs, while no need
to know information about the internal GNN model. Finally, we propose an
efficient algorithm to calculate label influence. Experimental results on
various graph datasets show that, compared to state-of-the-art white-box
attacks, our attack can achieve comparable attack performance, but has a 5-50x
speedup when attacking two-layer GNNs. Moreover, our attack is effective to
attack multi-layer GNNs\footnote{Source code and full version is in the link:
\url{https://github.com/ventr1c/InfAttack}}.


http://arxiv.org/abs/2008.13261
Benchmarking adversarial attacks and defenses for time-series data.
Shoaib Ahmed Siddiqui; Andreas Dengel; Sheraz Ahmed
  The adversarial vulnerability of deep networks has spurred the interest of
researchers worldwide. Unsurprisingly, like images, adversarial examples also
translate to time-series data as they are an inherent weakness of the model
itself rather than the modality. Several attempts have been made to defend
against these adversarial attacks, particularly for the visual modality. In
this paper, we perform detailed benchmarking of well-proven adversarial defense
methodologies on time-series data. We restrict ourselves to the $L_{\infty}$
threat model. We also explore the trade-off between smoothness and clean
accuracy for regularization-based defenses to better understand the trade-offs
that they offer. Our analysis shows that the explored adversarial defenses
offer robustness against both strong white-box as well as black-box attacks.
This paves the way for future research in the direction of adversarial attacks
and defenses, particularly for time-series data.


http://arxiv.org/abs/2008.13305
An Integrated Approach to Produce Robust Models with High Efficiency.
Zhijian Li; Bao Wang; Jack Xin
  Deep Neural Networks (DNNs) needs to be both efficient and robust for
practical uses. Quantization and structure simplification are promising ways to
adapt DNNs to mobile devices, and adversarial training is the most popular
method to make DNNs robust. In this work, we try to obtain both features by
applying a convergent relaxation quantization algorithm, Binary-Relax (BR), to
a robust adversarial-trained model, ResNets Ensemble via Feynman-Kac Formalism
(EnResNet). We also discover that high precision, such as ternary (tnn) and
4-bit, quantization will produce sparse DNNs. However, this sparsity is
unstructured under advarsarial training. To solve the problems that adversarial
training jeopardizes DNNs' accuracy on clean images and the struture of
sparsity, we design a trade-off loss function that helps DNNs preserve their
natural accuracy and improve the channel sparsity. With our trade-off loss
function, we achieve both goals with no reduction of resistance under weak
attacks and very minor reduction of resistance under strong attcks. Together
with quantized EnResNet with trade-off loss function, we provide robust models
that have high efficiency.


http://arxiv.org/abs/2008.13336
Shape Defense Against Adversarial Attacks.
Ali Borji
  Humans rely heavily on shape information to recognize objects. Conversely,
convolutional neural networks (CNNs) are biased more towards texture. This is
perhaps the main reason why CNNs are vulnerable to adversarial examples. Here,
we explore how shape bias can be incorporated into CNNs to improve their
robustness. Two algorithms are proposed, based on the observation that edges
are invariant to moderate imperceptible perturbations. In the first one, a
classifier is adversarially trained on images with the edge map as an
additional channel. At inference time, the edge map is recomputed and
concatenated to the image. In the second algorithm, a conditional GAN is
trained to translate the edge maps, from clean and/or perturbed images, into
clean images. Inference is done over the generated image corresponding to the
input's edge map. Extensive experiments over 10 datasets demonstrate the
effectiveness of the proposed algorithms against FGSM and $\ell_\infty$ PGD-40
attacks. Further, we show that a) edge information can also benefit other
adversarial training methods, and b) CNNs trained on edge-augmented inputs are
more robust against natural image corruptions such as motion blur, impulse
noise and JPEG compression, than CNNs trained solely on RGB images. From a
broader perspective, our study suggests that CNNs do not adequately account for
image structures that are crucial for robustness. Code is available
at:~\url{https://github.com/aliborji/Shapedefense.git}.


http://arxiv.org/abs/2008.12997
Improving Resistance to Adversarial Deformations by Regularizing Gradients.
Pengfei Xia; Bin Li
  Improving the resistance of deep neural networks against adversarial attacks
is important for deploying models to realistic applications. However, most
defense methods are designed to defend against intensity perturbations and
ignore location perturbations, which should be equally important for deep model
security. In this paper, we focus on adversarial deformations, a typical class
of location perturbations, and propose a flow gradient regularization to
improve the resistance of models. Theoretically, we prove that, compared with
input gradient regularization, regularizing flow gradients is able to get a
tighter bound. Over multiple datasets, architectures, and adversarial
deformations, our empirical results indicate that models trained with flow
gradients can acquire a better resistance than trained with input gradients
with a large margin, and also better than adversarial training. Moreover,
compared with directly training with adversarial deformations, our method can
achieve better results in unseen attacks, and combining these two methods can
improve the resistance further.


http://arxiv.org/abs/2008.12328
A Scene-Agnostic Framework with Adversarial Training for Abnormal Event Detection in Video.
Mariana-Iuliana Georgescu; Radu Tudor Ionescu; Fahad Shahbaz Khan; Marius Popescu; Mubarak Shah
  Abnormal event detection in video is a complex computer vision problem that
has attracted significant attention in recent years. The complexity of the task
arises from the commonly-agreed definition of an abnormal event, that is, a
rarely occurring event that typically depends on the surrounding context.
Following the standard formulation of abnormal event detection as outlier
detection, we propose a scene-agnostic framework that learns from training
videos containing only normal events. Our framework is composed of an object
detector, a set of appearance and motion auto-encoders, and a discriminator.
Since our framework only looks at object detections, it can be applied to
different scenes, provided that abnormal events are defined identically across
scenes. This makes our method scene agnostic, as we rely strictly on objects
that can cause anomalies, and not on the background. To overcome the lack of
abnormal data during training, we propose an adversarial learning strategy for
the auto-encoders. We create a scene-agnostic set of out-of-domain adversarial
examples, which are correctly reconstructed by the auto-encoders before
applying gradient ascent on the adversarial examples. We further utilize the
adversarial examples to serve as abnormal examples when training a binary
classifier to discriminate between normal and abnormal latent features and
reconstructions. Furthermore, to ensure that the auto-encoders focus only on
the main object inside each bounding box image, we introduce a branch that
learns to segment the main object. We compare our framework with the
state-of-the-art methods on three benchmark data sets, using various evaluation
metrics. Compared to existing methods, the empirical results indicate that our
approach achieves favorable performance on all data sets.


http://arxiv.org/abs/2008.12008
GhostBuster: Looking Into Shadows to Detect Ghost Objects in Autonomous Vehicle 3D Sensing.
Zhongyuan Hau; Soteris Demetriou; Luis Muñoz-González; Emil C. Lupu
  LiDAR-driven 3D sensing allows new generations of vehicles to achieve
advanced levels of situation awareness. However, recent works have demonstrated
that physical adversaries can spoof LiDAR return signals and deceive 3D object
detectors to erroneously detect "ghost" objects. In this work, we introduce
GhostBuster, a set of new techniques embodied in an end-to-end prototype to
detect ghost object attacks on 3D detectors. GhostBuster is agnostic of the 3D
detector targeted, and only uses LiDAR data that is already available to the
target object detector. It considers the 3D object detectors' blind spots by
examining the objects' 3D shadows. Ray optics is used to estimate the shadow
regions and an exponential decay approach minimizes the importance of noisy
points. GhostBuster identifies anomalous regions and then analyzes their 3D
point cluster densities to distinguish between shadows of ghost objects, and
genuine object shadows. We conduct an extensive empirical evaluation on the
KITTI dataset and find that GhostBuster consistently achieves more than 94%
accuracy in identifying anomalous shadows, which it can attribute with 96%
accuracy to ghost attacks. We introduce a new class of "invalidation" attacks
where adversaries can target shadows of genuine objects aiming to invalidate
them and we show that GhostBuster remains robust to these attacks. Finally we
show that GhostBuster can achieve real-time detection, requiring only between
0.003s-0.021s on average to process an object in a 3D point cloud on a
commodity machine.


http://arxiv.org/abs/2008.12066
Minimal Adversarial Examples for Deep Learning on 3D Point Clouds.
Jaeyeon Kim; Binh-Son Hua; Duc Thanh Nguyen; Sai-Kit Yeung
  With recent developments of convolutional neural networks, deep learning for
3D point clouds has shown significant progress in various 3D scene
understanding tasks, e.g., object recognition, object detection. In a
safety-critical environment, it is however not well understood how such deep
learning models are vulnerable to adversarial examples. In this work, we
explore adversarial attacks for point cloud-based neural networks. We propose a
general formulation for adversarial point cloud generation via $\ell_0$-norm
optimisation. Our method generates adversarial examples by attacking the
classification ability of the point cloud-based networks while considering the
perceptibility of the examples and ensuring the minimum level of point
manipulations. The proposed method is general and can be realised in different
attack strategies. Experimental results show that our method achieves the
state-of-the-art performance with higher than 89\% and 90\% of attack success
on synthetic and real-world data respectively, while manipulating only about
4\% of the total points.


http://arxiv.org/abs/2008.12016
On the Intrinsic Robustness of NVM Crossbars Against Adversarial Attacks.
Deboleena Roy; Indranil Chakraborty; Timur Ibrayev; Kaushik Roy
  The increasing computational demand of Deep Learning has propelled research
in special-purpose inference accelerators based on emerging non-volatile memory
(NVM) technologies. Such NVM crossbars promise fast and energy-efficient
in-situ Matrix Vector Multiplication (MVM) thus alleviating the long-standing
von Neuman bottleneck in today's digital hardware. However, the analog nature
of computing in these crossbars is inherently approximate and results in
deviations from ideal output values, which reduces the overall performance of
Deep Neural Networks (DNNs) under normal circumstances. In this paper, we study
the impact of these non-idealities under adversarial circumstances. We show
that the non-ideal behavior of analog computing lowers the effectiveness of
adversarial attacks, in both Black-Box and White-Box attack scenarios. In a
non-adaptive attack, where the attacker is unaware of the analog hardware, we
observe that analog computing offers a varying degree of intrinsic robustness,
with a peak adversarial accuracy improvement of 35.34%, 22.69%, and 9.90% for
white box PGD (epsilon=1/255, iter=30) for CIFAR-10, CIFAR-100, and ImageNet
respectively. We also demonstrate "Hardware-in-Loop" adaptive attacks that
circumvent this robustness by utilizing the knowledge of the NVM model.


http://arxiv.org/abs/2009.00097
Adversarial Eigen Attack on Black-Box Models.
Linjun Zhou; Peng Cui; Yinan Jiang; Shiqiang Yang
  Black-box adversarial attack has attracted a lot of research interests for
its practical use in AI safety. Compared with the white-box attack, a black-box
setting is more difficult for less available information related to the
attacked model and the additional constraint on the query budget. A general way
to improve the attack efficiency is to draw support from a pre-trained
transferable white-box model. In this paper, we propose a novel setting of
transferable black-box attack: attackers may use external information from a
pre-trained model with available network parameters, however, different from
previous studies, no additional training data is permitted to further change or
tune the pre-trained model. To this end, we further propose a new algorithm,
EigenBA to tackle this problem. Our method aims to explore more gradient
information of the black-box model, and promote the attack efficiency, while
keeping the perturbation to the original attacked image small, by leveraging
the Jacobian matrix of the pre-trained white-box model. We show the optimal
perturbations are closely related to the right singular vectors of the Jacobian
matrix. Further experiments on ImageNet and CIFAR-10 show that even the
unlearnable pre-trained white-box model could also significantly boost the
efficiency of the black-box attack and our proposed method could further
improve the attack efficiency.


http://arxiv.org/abs/2008.12454
Color and Edge-Aware Adversarial Image Perturbations.
Robert Bassett; Mitchell Graves; Patrick Reilly
  Adversarial perturbation of images, in which a source image is deliberately
modified with the intent of causing a classifier to misclassify the image,
provides important insight into the robustness of image classifiers. In this
work we develop two new methods for constructing adversarial perturbations,
both of which are motivated by minimizing human ability to detect changes
between the perturbed and source image. The first of these, the Edge-Aware
method, reduces the magnitude of perturbations permitted in smooth regions of
an image where changes are more easily detected. Our second method, the
Color-Aware method, performs the perturbation in a color space which accurately
captures human ability to distinguish differences in colors, thus reducing the
perceived change. The Color-Aware and Edge-Aware methods can also be
implemented simultaneously, resulting in image perturbations which account for
both human color perception and sensitivity to changes in homogeneous regions.
Because Edge-Aware and Color-Aware modifications exist for many image
perturbations techniques, we also focus on computation to demonstrate their
potential for use within more complex perturbation schemes. We empirically
demonstrate that the Color-Aware and Edge-Aware perturbations we consider
effectively cause misclassification, are less distinguishable to human
perception, and are as easy to compute as the most efficient image perturbation
techniques. Code and demo available at
https://github.com/rbassett3/Color-and-Edge-Aware-Perturbations


http://arxiv.org/abs/2008.12338
Adversarially Robust Learning via Entropic Regularization.
Gauri Jagatap; Ameya Joshi; Animesh Basak Chowdhury; Siddharth Garg; Chinmay Hegde
  In this paper we propose a new family of algorithms, ATENT, for training
adversarially robust deep neural networks. We formulate a new loss function
that is equipped with an additional entropic regularization. Our loss function
considers the contribution of adversarial samples that are drawn from a
specially designed distribution in the data space that assigns high probability
to points with high loss and in the immediate neighborhood of training samples.
Our proposed algorithms optimize this loss to seek adversarially robust valleys
of the loss landscape. Our approach achieves competitive (or better)
performance in terms of robust classification accuracy as compared to several
state-of-the-art robust learning approaches on benchmark datasets such as MNIST
and CIFAR-10.


http://arxiv.org/abs/2008.11618
Adversarially Training for Audio Classifiers.
Raymel Alfonso Sallo; Mohammad Esmaeilpour; Patrick Cardinal
  In this paper, we investigate the potential effect of the adversarially
training on the robustness of six advanced deep neural networks against a
variety of targeted and non-targeted adversarial attacks. We firstly show that,
the ResNet-56 model trained on the 2D representation of the discrete wavelet
transform appended with the tonnetz chromagram outperforms other models in
terms of recognition accuracy. Then we demonstrate the positive impact of
adversarially training on this model as well as other deep architectures
against six types of attack algorithms (white and black-box) with the cost of
the reduced recognition accuracy and limited adversarial perturbation. We run
our experiments on two benchmarking environmental sound datasets and show that
without any imposed limitations on the budget allocations for the adversary,
the fooling rate of the adversarially trained models can exceed 90\%. In other
words, adversarial attacks exist in any scales, but they might require higher
adversarial perturbations compared to non-adversarially trained models.


http://arxiv.org/abs/2008.11300
Likelihood Landscapes: A Unifying Principle Behind Many Adversarial Defenses.
Fu Lin; Rohit Mittapalli; Prithvijit Chattopadhyay; Daniel Bolya; Judy Hoffman
  Convolutional Neural Networks have been shown to be vulnerable to adversarial
examples, which are known to locate in subspaces close to where normal data
lies but are not naturally occurring and of low probability. In this work, we
investigate the potential effect defense techniques have on the geometry of the
likelihood landscape - likelihood of the input images under the trained model.
We first propose a way to visualize the likelihood landscape leveraging an
energy-based model interpretation of discriminative classifiers. Then we
introduce a measure to quantify the flatness of the likelihood landscape. We
observe that a subset of adversarial defense techniques results in a similar
effect of flattening the likelihood landscape. We further explore directly
regularizing towards a flat landscape for adversarial robustness.


http://arxiv.org/abs/2008.11089
Two Sides of the Same Coin: White-box and Black-box Attacks for Transfer Learning.
Yinghua Zhang; Yangqiu Song; Jian Liang; Kun Bai; Qiang Yang
  Transfer learning has become a common practice for training deep learning
models with limited labeled data in a target domain. On the other hand, deep
models are vulnerable to adversarial attacks. Though transfer learning has been
widely applied, its effect on model robustness is unclear. To figure out this
problem, we conduct extensive empirical evaluations to show that fine-tuning
effectively enhances model robustness under white-box FGSM attacks. We also
propose a black-box attack method for transfer learning models which attacks
the target model with the adversarial examples produced by its source model. To
systematically measure the effect of both white-box and black-box attacks, we
propose a new metric to evaluate how transferable are the adversarial examples
produced by a source model to a target model. Empirical results show that the
adversarial examples are more transferable when fine-tuning is used than they
are when the two networks are trained independently.


http://arxiv.org/abs/2008.11298
Rethinking Non-idealities in Memristive Crossbars for Adversarial Robustness in Neural Networks.
Abhiroop Bhattacharjee; Priyadarshini Panda
  Deep Neural Networks (DNNs) have been shown to be prone to adversarial
attacks. Memristive crossbars, being able to perform
Matrix-Vector-Multiplications (MVMs) efficiently, are used to realize DNNs on
hardware. However, crossbar non-idealities have always been devalued since they
cause errors in performing MVMs, leading to computational accuracy losses in
DNNs. Several software-based defenses have been proposed to make DNNs
adversarially robust. However, no previous work has demonstrated the advantage
conferred by the crossbar non-idealities in unleashing adversarial robustness.
We show that the intrinsic hardware non-idealities yield adversarial robustness
to the mapped DNNs without any additional optimization. We evaluate the
adversarial resilience of state-of-the-art DNNs (VGG8 & VGG16 networks) using
benchmark datasets (CIFAR-10, CIFAR-100 & Tiny Imagenet) across various
crossbar sizes. We find that crossbar non-idealities unleash significantly
greater adversarial robustness (>10-20%) in crossbar-mapped DNNs than baseline
software DNNs. We further assess the performance of our approach with other
state-of-the-art efficiency-driven adversarial defenses and find that our
approach performs significantly well in terms of reducing adversarial loss.


http://arxiv.org/abs/2008.11278
An Adversarial Attack Defending System for Securing In-Vehicle Networks.
Yi Li; Jing Lin; Kaiqi Xiong
  In a modern vehicle, there are over seventy Electronics Control Units (ECUs).
For an in-vehicle network, ECUs communicate with each other by following a
standard communication protocol, such as Controller Area Network (CAN).
However, an attacker can easily access the in-vehicle network to compromise
ECUs through a WLAN or Bluetooth. Though there are various deep learning (DL)
methods suggested for securing in-vehicle networks, recent studies on
adversarial examples have shown that attackers can easily fool DL models. In
this research, we further explore adversarial examples in an in-vehicle
network. We first discover and implement two adversarial attack models that are
harmful to a Long Short Term Memory (LSTM)-based detection model used in the
in-vehicle network. Then, we propose an Adversarial Attack Defending System
(AADS) for securing an in-vehicle network. Specifically, we focus on
brake-related ECUs in an in-vehicle network. Our experimental results
demonstrate that adversaries can easily attack the LSTM-based detection model
with a success rate of over 98%, and the proposed AADS achieves over 99%
accuracy for detecting adversarial attacks.


http://arxiv.org/abs/2008.10715
Certified Robustness of Graph Neural Networks against Adversarial Structural Perturbation.
Binghui Wang; Jinyuan Jia; Xiaoyu Cao; Neil Zhenqiang Gong
  Graph neural networks (GNNs) have recently gained much attention for node and
graph classification tasks on graph-structured data. However, multiple recent
works showed that an attacker can easily make GNNs predict incorrectly via
perturbing the graph structure, i.e., adding or deleting edges in the graph. We
aim to defend against such attacks via developing certifiably robust GNNs.
Specifically, we prove the first certified robustness guarantee of any GNN for
both node and graph classifications against structural perturbation. Moreover,
we show that our certified robustness guarantee is tight. Our results are based
on a recently proposed technique called randomized smoothing, which we extend
to graph data. We also empirically evaluate our method for both node and graph
classifications on multiple GNNs and multiple benchmark datasets. For instance,
on the Cora dataset, Graph Convolutional Network with our randomized smoothing
can achieve a certified accuracy of 0.49 when the attacker can arbitrarily
add/delete at most 15 edges in the graph.


http://arxiv.org/abs/2008.10106
Developing and Defeating Adversarial Examples.
Ian McDiarmid-Sterling; Allan Moser
  Breakthroughs in machine learning have resulted in state-of-the-art deep
neural networks (DNNs) performing classification tasks in safety-critical
applications. Recent research has demonstrated that DNNs can be attacked
through adversarial examples, which are small perturbations to input data that
cause the DNN to misclassify objects. The proliferation of DNNs raises
important safety concerns about designing systems that are robust to
adversarial examples. In this work we develop adversarial examples to attack
the Yolo V3 object detector [1] and then study strategies to detect and
neutralize these examples. Python code for this project is available at
https://github.com/ianmcdiarmidsterling/adversarial


http://arxiv.org/abs/2008.09954
Ptolemy: Architecture Support for Robust Deep Learning.
Yiming Gan; Yuxian Qiu; Jingwen Leng; Minyi Guo; Yuhao Zhu
  Deep learning is vulnerable to adversarial attacks, where carefully-crafted
input perturbations could mislead a well-trained Deep Neural Network to produce
incorrect results. Today's countermeasures to adversarial attacks either do not
have capability to detect adversarial samples at inference time, or introduce
prohibitively high overhead to be practical at inference time.
  We propose Ptolemy, an algorithm-architecture co-designed system that detects
adversarial attacks at inference time with low overhead and high accuracy.We
exploit the synergies between DNN inference and imperative program execution:
an input to a DNN uniquely activates a set of neurons that contribute
significantly to the inference output, analogous to the sequence of basic
blocks exercised by an input in a conventional program. Critically, we observe
that adversarial samples tend to activate distinctive paths from those of
benign inputs. Leveraging this insight, we propose an adversarial sample
detection framework, which uses canary paths generated from offline profiling
to detect adversarial samples at runtime. The Ptolemy compiler along with the
co-designed hardware enable efficient execution by exploiting the unique
algorithmic characteristics. Extensive evaluations show that Ptolemy achieves
higher or similar adversarial example detection accuracy than today's
mechanisms with a much lower runtime (as low as 2%) overhead.


http://arxiv.org/abs/2008.10138
PermuteAttack: Counterfactual Explanation of Machine Learning Credit Scorecards.
Masoud Hashemi; Ali Fathi
  This paper is a note on new directions and methodologies for validation and
explanation of Machine Learning (ML) models employed for retail credit scoring
in finance. Our proposed framework draws motivation from the field of
Artificial Intelligence (AI) security and adversarial ML where the need for
certifying the performance of the ML algorithms in the face of their
overwhelming complexity poses a need for rethinking the traditional notions of
model architecture selection, sensitivity analysis and stress testing. Our
point of view is that the phenomenon of adversarial perturbations when detached
from the AI security domain, has purely algorithmic roots and fall within the
scope of model risk assessment. We propose a model criticism and explanation
framework based on adversarially generated counterfactual examples for tabular
data. A counterfactual example to a given instance in this context is defined
as a synthetically generated data point sampled from the estimated data
distribution which is treated differently by a model. The counterfactual
examples can be used to provide a black-box instance-level explanation of the
model behaviour as well as studying the regions in the input space where the
model performance deteriorates. Adversarial example generating algorithms are
extensively studied in the image and natural language processing (NLP) domains.
However, most financial data come in tabular format and naive application of
the existing techniques on this class of datasets generates unrealistic
samples. In this paper, we propose a counterfactual example generation method
capable of handling tabular data including discrete and categorical variables.
Our proposed algorithm uses a gradient-free optimization based on genetic
algorithms and therefore is applicable to any classification model.


http://arxiv.org/abs/2008.09824
Self-Competitive Neural Networks.
Iman Saberi; Fathiyeh Faghih
  Deep Neural Networks (DNNs) have improved the accuracy of classification
problems in lots of applications. One of the challenges in training a DNN is
its need to be fed by an enriched dataset to increase its accuracy and avoid it
suffering from overfitting. One way to improve the generalization of DNNs is to
augment the training data with new synthesized adversarial samples. Recently,
researchers have worked extensively to propose methods for data augmentation.
In this paper, we generate adversarial samples to refine the Domains of
Attraction (DoAs) of each class. In this approach, at each stage, we use the
model learned by the primary and generated adversarial data (up to that stage)
to manipulate the primary data in a way that look complicated to the DNN. The
DNN is then retrained using the augmented data and then it again generates
adversarial data that are hard to predict for itself. As the DNN tries to
improve its accuracy by competing with itself (generating hard samples and then
learning them), the technique is called Self-Competitive Neural Network (SCNN).
To generate such samples, we pose the problem as an optimization task, where
the network weights are fixed and use a gradient descent based method to
synthesize adversarial samples that are on the boundary of their true labels
and the nearest wrong labels. Our experimental results show that data
augmentation using SCNNs can significantly increase the accuracy of the
original network. As an example, we can mention improving the accuracy of a CNN
trained with 1000 limited training data of MNIST dataset from 94.26% to 98.25%.


http://arxiv.org/abs/2008.09381
A Survey on Assessing the Generalization Envelope of Deep Neural Networks: Predictive Uncertainty, Out-of-distribution and Adversarial Samples.
Julia Lust; Alexandru Paul Condurache
  Deep Neural Networks (DNNs) achieve state-of-the-art performance on numerous
applications. However, it is difficult to tell beforehand if a DNN receiving an
input will deliver the correct output since their decision criteria are usually
nontransparent. A DNN delivers the correct output if the input is within the
area enclosed by its generalization envelope. In this case, the information
contained in the input sample is processed reasonably by the network. It is of
large practical importance to assess at inference time if a DNN generalizes
correctly. Currently, the approaches to achieve this goal are investigated in
different problem set-ups rather independently from one another, leading to
three main research and literature fields: predictive uncertainty,
out-of-distribution detection and adversarial example detection. This survey
connects the three fields within the larger framework of investigating the
generalization performance of machine learning methods and in particular DNNs.
We underline the common ground, point at the most promising approaches and give
a structured overview of the methods that provide at inference time means to
establish if the current input is within the generalization envelope of a DNN.


http://arxiv.org/abs/2008.09148
Towards adversarial robustness with 01 loss neural networks.
Yunzhe Xue; Meiyan Xie; Usman Roshan
  Motivated by the general robustness properties of the 01 loss we propose a
single hidden layer 01 loss neural network trained with stochastic coordinate
descent as a defense against adversarial attacks in machine learning. One
measure of a model's robustness is the minimum distortion required to make the
input adversarial. This can be approximated with the Boundary Attack (Brendel
et. al. 2018) and HopSkipJump (Chen et. al. 2019) methods. We compare the
minimum distortion of the 01 loss network to the binarized neural network and
the standard sigmoid activation network with cross-entropy loss all trained
with and without Gaussian noise on the CIFAR10 benchmark binary classification
between classes 0 and 1. Both with and without noise training we find our 01
loss network to have the largest adversarial distortion of the three models by
non-trivial margins. To further validate these results we subject all models to
substitute model black box attacks under different distortion thresholds and
find that the 01 loss network is the hardest to attack across all distortions.
At a distortion of 0.125 both sigmoid activated cross-entropy loss and
binarized networks have almost 0% accuracy on adversarial examples whereas the
01 loss network is at 40%. Even though both 01 loss and the binarized network
use sign activations their training algorithms are different which in turn give
different solutions for robustness. Finally we compare our network to simple
convolutional models under substitute model black box attacks and find their
accuracies to be comparable. Our work shows that the 01 loss network has the
potential to defend against black box adversarial attacks better than convex
loss and binarized networks.


http://arxiv.org/abs/2008.09194
On Attribution of Deepfakes.
Baiwu Zhang; Jin Peng Zhou; Ilia Shumailov; Nicolas Papernot
  Progress in generative modelling, especially generative adversarial networks,
have made it possible to efficiently synthesize and alter media at scale.
Malicious individuals now rely on these machine-generated media, or deepfakes,
to manipulate social discourse. In order to ensure media authenticity, existing
research is focused on deepfake detection. Yet, the adversarial nature of
frameworks used for generative modeling suggests that progress towards
detecting deepfakes will enable more realistic deepfake generation. Therefore,
it comes at no surprise that developers of generative models are under the
scrutiny of stakeholders dealing with misinformation campaigns. At the same
time, generative models have a lot of positive applications. As such, there is
a clear need to develop tools that ensure the transparent use of generative
modeling, while minimizing the harm caused by malicious applications.
  Our technique optimizes over the source of entropy of each generative model
to probabilistically attribute a deepfake to one of the models. We evaluate our
method on the seminal example of face synthesis, demonstrating that our
approach achieves 97.62% attribution accuracy, and is less sensitive to
perturbations and adversarial examples. We discuss the ethical implications of
our work, identify where our technique can be used, and highlight that a more
meaningful legislative framework is required for a more transparent and ethical
use of generative modeling. Finally, we argue that model developers should be
capable of claiming plausible deniability and propose a second framework to do
so -- this allows a model developer to produce evidence that they did not
produce media that they are being accused of having produced.


http://arxiv.org/abs/2008.09010
$\beta$-Variational Classifiers Under Attack.
Marco Maggipinto; Matteo Terzi; Gian Antonio Susto
  Deep Neural networks have gained lots of attention in recent years thanks to
the breakthroughs obtained in the field of Computer Vision. However, despite
their popularity, it has been shown that they provide limited robustness in
their predictions. In particular, it is possible to synthesise small
adversarial perturbations that imperceptibly modify a correctly classified
input data, making the network confidently misclassify it. This has led to a
plethora of different methods to try to improve robustness or detect the
presence of these perturbations. In this paper, we perform an analysis of
$\beta$-Variational Classifiers, a particular class of methods that not only
solve a specific classification task, but also provide a generative component
that is able to generate new samples from the input distribution. More in
details, we study their robustness and detection capabilities, together with
some novel insights on the generative part of the model.


http://arxiv.org/abs/2008.08847
Yet Another Intermediate-Level Attack.
Qizhang Li; Yiwen Guo; Hao Chen
  The transferability of adversarial examples across deep neural network (DNN)
models is the crux of a spectrum of black-box attacks. In this paper, we
propose a novel method to enhance the black-box transferability of baseline
adversarial examples. By establishing a linear mapping of the
intermediate-level discrepancies (between a set of adversarial inputs and their
benign counterparts) for predicting the evoked adversarial loss, we aim to take
full advantage of the optimization procedure of multi-step baseline attacks. We
conducted extensive experiments to verify the effectiveness of our method on
CIFAR-100 and ImageNet. Experimental results demonstrate that it outperforms
previous state-of-the-arts considerably. Our code is at
https://github.com/qizhangli/ila-plus-plus.


http://arxiv.org/abs/2008.08384
Addressing Neural Network Robustness with Mixup and Targeted Labeling Adversarial Training.
Alfred Laugros; Alice Caplier; Matthieu Ospici
  Despite their performance, Artificial Neural Networks are not reliable enough
for most of industrial applications. They are sensitive to noises, rotations,
blurs and adversarial examples. There is a need to build defenses that protect
against a wide range of perturbations, covering the most traditional common
corruptions and adversarial examples. We propose a new data augmentation
strategy called M-TLAT and designed to address robustness in a broad sense. Our
approach combines the Mixup augmentation and a new adversarial training
algorithm called Targeted Labeling Adversarial Training (TLAT). The idea of
TLAT is to interpolate the target labels of adversarial examples with the
ground-truth labels. We show that M-TLAT can increase the robustness of image
classifiers towards nineteen common corruptions and five adversarial attacks,
without reducing the accuracy on clean samples.


http://arxiv.org/abs/2008.08755
On $\ell_p$-norm Robustness of Ensemble Stumps and Trees.
Yihan Wang; Huan Zhang; Hongge Chen; Duane Boning; Cho-Jui Hsieh
  Recent papers have demonstrated that ensemble stumps and trees could be
vulnerable to small input perturbations, so robustness verification and defense
for those models have become an important research problem. However, due to the
structure of decision trees, where each node makes decision purely based on one
feature value, all the previous works only consider the $\ell_\infty$ norm
perturbation. To study robustness with respect to a general $\ell_p$ norm
perturbation, one has to consider the correlation between perturbations on
different features, which has not been handled by previous algorithms. In this
paper, we study the problem of robustness verification and certified defense
with respect to general $\ell_p$ norm perturbations for ensemble decision
stumps and trees. For robustness verification of ensemble stumps, we prove that
complete verification is NP-complete for $p\in(0, \infty)$ while polynomial
time algorithms exist for $p=0$ or $\infty$. For $p\in(0, \infty)$ we develop
an efficient dynamic programming based algorithm for sound verification of
ensemble stumps. For ensemble trees, we generalize the previous multi-level
robustness verification algorithm to $\ell_p$ norm. We demonstrate the first
certified defense method for training ensemble stumps and trees with respect to
$\ell_p$ norm perturbations, and verify its effectiveness empirically on real
datasets.


http://arxiv.org/abs/2008.08750
Prototype-based interpretation of the functionality of neurons in winner-take-all neural networks.
Ramin Zarei Sabzevar; Kamaledin Ghiasi-Shirazi; Ahad Harati
  Prototype-based learning (PbL) using a winner-take-all (WTA) network based on
minimum Euclidean distance (ED-WTA) is an intuitive approach to multiclass
classification. By constructing meaningful class centers, PbL provides higher
interpretability and generalization than hyperplane-based learning (HbL)
methods based on maximum Inner Product (IP-WTA) and can efficiently detect and
reject samples that do not belong to any classes. In this paper, we first prove
the equivalence of IP-WTA and ED-WTA from a representational point of view.
Then, we show that naively using this equivalence leads to unintuitive ED-WTA
networks in which the centers have high distances to data that they represent.
We propose $\pm$ED-WTA which models each neuron with two prototypes: one
positive prototype representing samples that are modeled by this neuron and a
negative prototype representing the samples that are erroneously won by that
neuron during training. We propose a novel training algorithm for the
$\pm$ED-WTA network, which cleverly switches between updating the positive and
negative prototypes and is essential to the emergence of interpretable
prototypes. Unexpectedly, we observed that the negative prototype of each
neuron is indistinguishably similar to the positive one. The rationale behind
this observation is that the training data that are mistaken with a prototype
are indeed similar to it. The main finding of this paper is this interpretation
of the functionality of neurons as computing the difference between the
distances to a positive and a negative prototype, which is in agreement with
the BCM theory. In our experiments, we show that the proposed $\pm$ED-WTA
method constructs highly interpretable prototypes that can be successfully used
for detecting outlier and adversarial examples.


http://arxiv.org/abs/2008.07838
Improving adversarial robustness of deep neural networks by using semantic information.
Lina Wang; Rui Tang; Yawei Yue; Xingshu Chen; Wei Wang; Yi Zhu; Xuemei Zeng
  The vulnerability of deep neural networks (DNNs) to adversarial attack, which
is an attack that can mislead state-of-the-art classifiers into making an
incorrect classification with high confidence by deliberately perturbing the
original inputs, raises concerns about the robustness of DNNs to such attacks.
Adversarial training, which is the main heuristic method for improving
adversarial robustness and the first line of defense against adversarial
attacks, requires many sample-by-sample calculations to increase training size
and is usually insufficiently strong for an entire network. This paper provides
a new perspective on the issue of adversarial robustness, one that shifts the
focus from the network as a whole to the critical part of the region close to
the decision boundary corresponding to a given class. From this perspective, we
propose a method to generate a single but image-agnostic adversarial
perturbation that carries the semantic information implying the directions to
the fragile parts on the decision boundary and causes inputs to be
misclassified as a specified target. We call the adversarial training based on
such perturbations "region adversarial training" (RAT), which resembles
classical adversarial training but is distinguished in that it reinforces the
semantic information missing in the relevant regions. Experimental results on
the MNIST and CIFAR-10 datasets show that this approach greatly improves
adversarial robustness even using a very small dataset from the training data;
moreover, it can defend against FGSM adversarial attacks that have a completely
different pattern from the model seen during retraining.


http://arxiv.org/abs/2008.09041
Direct Adversarial Training for GANs.
Ziqiang Li
  There is an interesting discovery that several neural networks are vulnerable
to adversarial examples. That is, many machines learning models misclassify the
samples with only a little change which will not be noticed by human eyes.
Generative adversarial networks (GANs) are the most popular models for image
generation by jointly optimizing discriminator and generator. With stability
train, some regularization and normalization have been used to let the
discriminator satisfy Lipschitz consistency. In this paper, we have analyzed
that the generator may produce adversarial examples for discriminator during
the training process, which may cause the unstable training of GANs. For this
reason, we propose a direct adversarial training method for GANs. At the same
time, we prove that this direct adversarial training can limit the lipschitz
constant of the discriminator and accelerate the convergence of the generator.
We have verified the advanced performs of the method on multiple baseline
networks, such as DCGAN, WGAN, WGAN-GP, and WGAN-LP.


http://arxiv.org/abs/2008.08170
Accelerated Zeroth-Order and First-Order Momentum Methods from Mini to Minimax Optimization.
Feihu Huang; Shangqian Gao; Jian Pei; Heng Huang
  In the paper, we propose a class of accelerated zeroth-order and first-order
momentum methods for both nonconvex mini-optimization and minimax-optimization.
Specifically, we propose a new accelerated zeroth-order momentum (Acc-ZOM)
method to solve stochastic mini-optimization problems. We prove that the
Acc-ZOM method achieves a lower query complexity of
$\tilde{O}(d^{3/4}\epsilon^{-3})$ for finding an $\epsilon$-stationary point,
which improves the best known result by a factor of $O(d^{1/4})$ where $d$
denotes the parameter dimension. In particular, the Acc-ZOM does not need large
batches that are required in the existing zeroth-order stochastic algorithms.
At the same time, we propose an accelerated zeroth-order momentum descent
ascent (Acc-ZOMDA) method for black-box minimax-optimization. We prove that the
Acc-ZOMDA method reaches the best known query complexity of
$\tilde{O}((d_1+d_2)\kappa_y^{3}\epsilon^{-3})$ without large batches for
finding an $\epsilon$-stationary point, where $d_1$ and $d_2$ denote dimensions
of optimization parameters and $\kappa_y$ is condition number. Moreover, we
propose an accelerated first-order momentum descent ascent (Acc-MDA) method for
solving white-box minimax problems, and prove that it achieves a lower gradient
complexity of $\tilde{O}(\kappa_y^{2.5}\epsilon^{-3})$ given batch size
$b=\kappa_y^{4}$ for finding an $\epsilon$-stationary point, which improves the
best known result by a factor of $O(\kappa_y^{1/2})$. Extensive experimental
results on the black-box adversarial attack to deep neural networks (DNNs) and
poisoning attack demonstrate the efficiency of our algorithms.


http://arxiv.org/abs/2008.07651
A Deep Dive into Adversarial Robustness in Zero-Shot Learning.
Mehmet Kerim Yucel; Ramazan Gokberk Cinbis; Pinar Duygulu
  Machine learning (ML) systems have introduced significant advances in various
fields, due to the introduction of highly complex models. Despite their
success, it has been shown multiple times that machine learning models are
prone to imperceptible perturbations that can severely degrade their accuracy.
So far, existing studies have primarily focused on models where supervision
across all classes were available. In constrast, Zero-shot Learning (ZSL) and
Generalized Zero-shot Learning (GZSL) tasks inherently lack supervision across
all classes. In this paper, we present a study aimed on evaluating the
adversarial robustness of ZSL and GZSL models. We leverage the well-established
label embedding model and subject it to a set of established adversarial
attacks and defenses across multiple datasets. In addition to creating possibly
the first benchmark on adversarial robustness of ZSL models, we also present
analyses on important points that require attention for better interpretation
of ZSL robustness results. We hope these points, along with the benchmark, will
help researchers establish a better understanding what challenges lie ahead and
help guide their work.


http://arxiv.org/abs/2008.07685
Adversarial Attack and Defense Strategies for Deep Speaker Recognition Systems.
Arindam Jati; Chin-Cheng Hsu; Monisankha Pal; Raghuveer Peri; Wael AbdAlmageed; Shrikanth Narayanan
  Robust speaker recognition, including in the presence of malicious attacks,
is becoming increasingly important and essential, especially due to the
proliferation of several smart speakers and personal agents that interact with
an individual's voice commands to perform diverse, and even sensitive tasks.
Adversarial attack is a recently revived domain which is shown to be effective
in breaking deep neural network-based classifiers, specifically, by forcing
them to change their posterior distribution by only perturbing the input
samples by a very small amount. Although, significant progress in this realm
has been made in the computer vision domain, advances within speaker
recognition is still limited. The present expository paper considers several
state-of-the-art adversarial attacks to a deep speaker recognition system,
employing strong defense methods as countermeasures, and reporting on several
ablation studies to obtain a comprehensive understanding of the problem. The
experiments show that the speaker recognition systems are vulnerable to
adversarial attacks, and the strongest attacks can reduce the accuracy of the
system from 94% to even 0%. The study also compares the performances of the
employed defense methods in detail, and finds adversarial training based on
Projected Gradient Descent (PGD) to be the best defense method in our setting.
We hope that the experiments presented in this paper provide baselines that can
be useful for the research community interested in further studying adversarial
robustness of speaker recognition systems.


http://arxiv.org/abs/2008.07125
Adversarial EXEmples: A Survey and Experimental Evaluation of Practical Attacks on Machine Learning for Windows Malware Detection.
Luca Demetrio; Scott E. Coull; Battista Biggio; Giovanni Lagorio; Alessandro Armando; Fabio Roli
  Recent work has shown that adversarial Windows malware samples - referred to
as adversarial EXEmples in this paper - can bypass machine learning-based
detection relying on static code analysis by perturbing relatively few input
bytes. To preserve malicious functionality, previous attacks either add bytes
to existing non-functional areas of the file, potentially limiting their
effectiveness, or require running computationally-demanding validation steps to
discard malware variants that do not correctly execute in sandbox environments.
In this work, we overcome these limitations by developing a unifying framework
that does not only encompass and generalize previous attacks against
machine-learning models, but also includes three novel attacks based on
practical, functionality-preserving manipulations to the Windows Portable
Executable (PE) file format. These attacks, named Full DOS, Extend and Shift,
inject the adversarial payload by respectively manipulating the DOS header,
extending it, and shifting the content of the first section. Our experimental
results show that these attacks outperform existing ones in both white-box and
black-box scenarios, achieving a better trade-off in terms of evasion rate and
size of the injected payload, while also enabling evasion of models that have
been shown to be robust to previous attacks. To facilitate reproducibility of
our findings, we open source our framework and all the corresponding attack
implementations as part of the secml-malware Python library. We conclude this
work by discussing the limitations of current machine learning-based malware
detectors, along with potential mitigation strategies based on embedding domain
knowledge coming from subject-matter experts directly into the learning
process.


http://arxiv.org/abs/2008.07230
Robustness Verification of Quantum Classifiers. (81%)
Ji Guan; Wang Fang; Mingsheng Ying
  Several important models of machine learning algorithms have been
successfully generalized to the quantum world, with potential speedup to
training classical classifiers and applications to data analytics in quantum
physics that can be implemented on the near future quantum computers. However,
quantum noise is a major obstacle to the practical implementation of quantum
machine learning. In this work, we define a formal framework for the robustness
verification and analysis of quantum machine learning algorithms against
noises. A robust bound is derived and an algorithm is developed to check
whether or not a quantum machine learning algorithm is robust with respect to
quantum training data. In particular, this algorithm can find adversarial
examples during checking. Our approach is implemented on Google's TensorFlow
Quantum and can verify the robustness of quantum machine learning algorithms
with respect to a small disturbance of noises, derived from the surrounding
environment. The effectiveness of our robust bound and algorithm is confirmed
by the experimental results, including quantum bits classification as the
"Hello World" example, quantum phase recognition and cluster excitation
detection from real world intractable physical problems, and the classification
of MNIST from the classical world.


http://arxiv.org/abs/2008.06860
TextDecepter: Hard Label Black Box Attack on Text Classifiers.
Sachin Saxena
  Machine learning has been proven to be susceptible to carefully crafted
samples, known as adversarial examples. The generation of these adversarial
examples helps to make the models more robust and gives us an insight into the
underlying decision-making of these models. Over the years, researchers have
successfully attacked image classifiers in both, white and black-box settings.
However, these methods are not directly applicable to texts as text data is
discrete. In recent years, research on crafting adversarial examples against
textual applications has been on the rise. In this paper, we present a novel
approach for hard-label black-box attacks against Natural Language Processing
(NLP) classifiers, where no model information is disclosed, and an attacker can
only query the model to get a final decision of the classifier, without
confidence scores of the classes involved. Such an attack scenario applies to
real-world black-box models being used for security-sensitive applications such
as sentiment analysis and toxic content detection.


http://arxiv.org/abs/2008.07015
Adversarial Concurrent Training: Optimizing Robustness and Accuracy Trade-off of Deep Neural Networks.
Elahe Arani; Fahad Sarfraz; Bahram Zonooz
  Adversarial training has been proven to be an effective technique for
improving the adversarial robustness of models. However, there seems to be an
inherent trade-off between optimizing the model for accuracy and robustness. To
this end, we propose Adversarial Concurrent Training (ACT), which employs
adversarial training in a collaborative learning framework whereby we train a
robust model in conjunction with a natural model in a minimax game. ACT
encourages the two models to align their feature space by using the
task-specific decision boundaries and explore the input space more broadly.
Furthermore, the natural model acts as a regularizer, enforcing priors on
features that the robust model should learn. Our analyses on the behavior of
the models show that ACT leads to a robust model with lower model complexity,
higher information compression in the learned representations, and high
posterior entropy solutions indicative of convergence to a flatter minima. We
demonstrate the effectiveness of the proposed approach across different
datasets and network architectures. On ImageNet, ACT achieves 68.20% standard
accuracy and 44.29% robustness accuracy under a 100-iteration untargeted
attack, improving upon the standard adversarial training method's 65.70%
standard accuracy and 42.36% robustness.


http://arxiv.org/abs/2008.06822
Relevance Attack on Detectors.
Sizhe Chen; Fan He; Xiaolin Huang; Kun Zhang
  This paper focuses on high-transferable adversarial attacks on detectors,
which are hard to attack in a black-box manner, because of their
multiple-output characteristics and the diversity across architectures. To
pursue a high attack transferability, one plausible way is to find a common
property across detectors, which facilitates the discovery of common
weaknesses. We are the first to suggest that the relevance map from
interpreters for detectors is such a property. Based on it, we design a
Relevance Attack on Detectors (RAD), which achieves a state-of-the-art
transferability, exceeding existing results by above 20%. On MS COCO, the
detection mAPs for all 8 black-box architectures are more than halved and the
segmentation mAPs are also significantly influenced. Given the great
transferability of RAD, we generate the first adversarial dataset for object
detection and instance segmentation, i.e., Adversarial Objects in COntext
(AOCO), which helps to quickly evaluate and improve the robustness of
detectors.


http://arxiv.org/abs/2008.06199
Defending Adversarial Attacks without Adversarial Attacks in Deep Reinforcement Learning.
Xinghua Qu; Yew-Soon Ong; Abhishek Gupta; Zhu Sun
  Many recent studies in deep reinforcement learning (DRL) have proposed to
boost adversarial robustness through policy distillation utilizing adversarial
training, where additional adversarial examples are added in the training
process of the student policy; this makes the robustness improvement less
flexible and more computationally expensive. In contrast, we propose an
efficient policy distillation paradigm called robust policy distillation that
is capable of achieving an adversarially robust student policy without relying
on any adversarial example during student policy training. To this end, we
devise a new policy distillation loss that consists of two terms: 1) a
prescription gap maximization loss aiming at simultaneously maximizing the
likelihood of the action selected by the teacher policy and the entropy over
the remaining actions; 2) a Jacobian regularization loss that minimizes the
magnitude of Jacobian with respect to the input state. The theoretical analysis
proves that our distillation loss guarantees to increase the prescription gap
and the adversarial robustness. Meanwhile, experiments on five Atari games
firmly verifies the superiority of our policy distillation on boosting
adversarial robustness compared to other state-of-the-arts.


http://arxiv.org/abs/2008.06631
On the Generalization Properties of Adversarial Training.
Yue Xing; Qifan Song; Guang Cheng
  Modern machine learning and deep learning models are shown to be vulnerable
when testing data are slightly perturbed. Existing theoretical studies of
adversarial training algorithms mostly focus on either adversarial training
losses or local convergence properties. In contrast, this paper studies the
generalization performance of a generic adversarial training algorithm.
Specifically, we consider linear regression models and two-layer neural
networks (with lazy training) using squared loss under low-dimensional and
high-dimensional regimes. In the former regime, after overcoming the
non-smoothness of adversarial training, the adversarial risk of the trained
models can converge to the minimal adversarial risk. In the latter regime, we
discover that data interpolation prevents the adversarially robust estimator
from being consistent. Therefore, inspired by successes of the least absolute
shrinkage and selection operator (LASSO), we incorporate the L1 penalty in the
high dimensional adversarial learning and show that it leads to consistent
adversarially robust estimation. A series of numerical studies are conducted to
demonstrate how the smoothness and L1 penalization help improve the adversarial
robustness of DNN models.


http://arxiv.org/abs/2009.05107
Generating Image Adversarial Examples by Embedding Digital Watermarks.
Yuexin Xiang; Tiantian Li; Wei Ren; Tianqing Zhu; Kim-Kwang Raymond Choo
  With the increasing attention to deep neural network (DNN) models, attacks
are also upcoming for such models. For example, an attacker may carefully
construct images in specific ways (also referred to as adversarial examples)
aiming to mislead the DNN models to output incorrect classification results.
Similarly, many efforts are proposed to detect and mitigate adversarial
examples, usually for certain dedicated attacks. In this paper, we propose a
novel digital watermark-based method to generate image adversarial examples to
fool DNN models. Specifically, partial main features of the watermark image are
embedded into the host image almost invisibly, aiming to tamper with and damage
the recognition capabilities of the DNN models. We devise an efficient
mechanism to select host images and watermark images and utilize the improved
discrete wavelet transform (DWT) based Patchwork watermarking algorithm with a
set of valid hyperparameters to embed digital watermarks from the watermark
image dataset into original images for generating image adversarial examples.
The experimental results illustrate that the attack success rate on common DNN
models can reach an average of 95.47% on the CIFAR-10 dataset and the highest
at 98.71%. Besides, our scheme is able to generate a large number of
adversarial examples efficiently, concretely, an average of 1.17 seconds for
completing the attacks on each image on the CIFAR-10 dataset. In addition, we
design a baseline experiment using the watermark images generated by Gaussian
noise as the watermark image dataset that also displays the effectiveness of
our scheme. Similarly, we also propose the modified discrete cosine transform
(DCT) based Patchwork watermarking algorithm. To ensure repeatability and
reproducibility, the source code is available on GitHub.


http://arxiv.org/abs/2008.06255
From Attack to Protection: Leveraging Watermarking Attack Network for Advanced Add-on Watermarking. (2%)
Seung-Hun Nam; Jihyeon Kang; Daesik Kim; Namhyuk Ahn; Wonhyuk Ahn
  Multi-bit watermarking (MW) has been designed to enhance resistance against
watermarking attacks, such as signal processing operations and geometric
distortions. Various benchmark tools exist to assess this robustness through
simulated attacks on watermarked images. However, these tools often fail to
capitalize on the unique attributes of the targeted MW and typically neglect
the aspect of visual quality, a critical factor in practical applications. To
overcome these shortcomings, we introduce a watermarking attack network (WAN),
a fully trainable watermarking benchmark tool designed to exploit
vulnerabilities within MW systems and induce watermark bit inversions,
significantly diminishing watermark extractability. The proposed WAN employs an
architecture based on residual dense blocks, which is adept at both local and
global feature learning, thereby maintaining high visual quality while
obstructing the extraction of embedded information. Our empirical results
demonstrate that the WAN effectively undermines various block-based MW systems
while minimizing visual degradation caused by attacks. This is facilitated by
our novel watermarking attack loss, which is specifically crafted to compromise
these systems. The WAN functions not only as a benchmarking tool but also as an
add-on watermarking (AoW) mechanism, augmenting established universal
watermarking schemes by enhancing robustness or imperceptibility without
requiring detailed method context and adapting to dynamic watermarking
requirements. Extensive experimental results show that AoW complements the
performance of the targeted MW system by independently enhancing both
imperceptibility and robustness.


http://arxiv.org/abs/2008.06081
Adversarial Training and Provable Robustness: A Tale of Two Objectives.
Jiameng Fan; Wenchao Li
  We propose a principled framework that combines adversarial training and
provable robustness verification for training certifiably robust neural
networks. We formulate the training problem as a joint optimization problem
with both empirical and provable robustness objectives and develop a novel
gradient-descent technique that can eliminate bias in stochastic
multi-gradients. We perform both theoretical analysis on the convergence of the
proposed technique and experimental comparison with state-of-the-arts. Results
on MNIST and CIFAR-10 show that our method can consistently match or outperform
prior approaches for provable l infinity robustness. Notably, we achieve 6.60%
verified test error on MNIST at epsilon = 0.3, and 66.57% on CIFAR-10 with
epsilon = 8/255.


http://arxiv.org/abs/2008.06069
Semantically Adversarial Learnable Filters.
Ali Shahin Shamsabadi; Changjae Oh; Andrea Cavallaro
  We present an adversarial framework to craft perturbations that mislead
classifiers by accounting for the image content and the semantics of the
labels. The proposed framework combines a structure loss and a semantic
adversarial loss in a multi-task objective function to train a fully
convolutional neural network. The structure loss helps generate perturbations
whose type and magnitude are defined by a target image processing filter. The
semantic adversarial loss considers groups of (semantic) labels to craft
perturbations that prevent the filtered image {from} being classified with a
label in the same group. We validate our framework with three different target
filters, namely detail enhancement, log transformation and gamma correction
filters; and evaluate the adversarially filtered images against three
classifiers, ResNet50, ResNet18 and AlexNet, pre-trained on ImageNet. We show
that the proposed framework generates filtered images with a high success rate,
robustness, and transferability to unseen classifiers. We also discuss
objective and subjective evaluations of the adversarial perturbations.


http://arxiv.org/abs/2008.07369
Continuous Patrolling Games. (45%)
Steve Alpern; Thuy Bui; Thomas Lidbetter; Katerina Papadaki
  We study a patrolling game played on a network $Q$, considered as a metric
space. The Attacker chooses a point of $Q$ (not necessarily a node) to attack
during a chosen time interval of fixed duration. The Patroller chooses a unit
speed path on $Q$ and intercepts the attack (and wins) if she visits the
attacked point during the attack time interval. This zero-sum game models the
problem of protecting roads or pipelines from an adversarial attack. The payoff
to the maximizing Patroller is the probability that the attack is intercepted.
Our results include the following: (i) a solution to the game for any network
$Q$, as long as the time required to carry out the attack is sufficiently
short, (ii) a solution to the game for all tree networks that satisfy a certain
condition on their extremities, and (iii) a solution to the game for any attack
duration for stars with one long arc and the remaining arcs equal in length. We
present a conjecture on the solution of the game for arbitrary trees and
establish it in certain cases.


http://arxiv.org/abs/2008.05247
Learning to Learn from Mistakes: Robust Optimization for Adversarial Noise.
Alex Serban; Erik Poll; Joost Visser
  Sensitivity to adversarial noise hinders deployment of machine learning
algorithms in security-critical applications. Although many adversarial
defenses have been proposed, robustness to adversarial noise remains an open
problem. The most compelling defense, adversarial training, requires a
substantial increase in processing time and it has been shown to overfit on the
training data. In this paper, we aim to overcome these limitations by training
robust models in low data regimes and transfer adversarial knowledge between
different models. We train a meta-optimizer which learns to robustly optimize a
model using adversarial examples and is able to transfer the knowledge learned
to new models, without the need to generate new adversarial examples.
Experimental results show the meta-optimizer is consistent across different
architectures and data sets, suggesting it is possible to automatically patch
adversarial vulnerabilities.


http://arxiv.org/abs/2008.05230
Defending Adversarial Examples via DNN Bottleneck Reinforcement.
Wenqing Liu; Miaojing Shi; Teddy Furon; Li Li
  This paper presents a DNN bottleneck reinforcement scheme to alleviate the
vulnerability of Deep Neural Networks (DNN) against adversarial attacks.
Typical DNN classifiers encode the input image into a compressed latent
representation more suitable for inference. This information bottleneck makes a
trade-off between the image-specific structure and class-specific information
in an image. By reinforcing the former while maintaining the latter, any
redundant information, be it adversarial or not, should be removed from the
latent representation. Hence, this paper proposes to jointly train an
auto-encoder (AE) sharing the same encoding weights with the visual classifier.
In order to reinforce the information bottleneck, we introduce the multi-scale
low-pass objective and multi-scale high-frequency communication for better
frequency steering in the network. Unlike existing approaches, our scheme is
the first reforming defense per se which keeps the classifier structure
untouched without appending any pre-processing head and is trained with clean
images only. Extensive experiments on MNIST, CIFAR-10 and ImageNet demonstrate
the strong defense of our method against various adversarial attacks.


http://arxiv.org/abs/2008.05667
Feature Binding with Category-Dependant MixUp for Semantic Segmentation and Adversarial Robustness.
Md Amirul Islam; Matthew Kowal; Konstantinos G. Derpanis; Neil D. B. Bruce
  In this paper, we present a strategy for training convolutional neural
networks to effectively resolve interference arising from competing hypotheses
relating to inter-categorical information throughout the network. The premise
is based on the notion of feature binding, which is defined as the process by
which activation's spread across space and layers in the network are
successfully integrated to arrive at a correct inference decision. In our work,
this is accomplished for the task of dense image labelling by blending images
based on their class labels, and then training a feature binding network, which
simultaneously segments and separates the blended images. Subsequent feature
denoising to suppress noisy activations reveals additional desirable properties
and high degrees of successful predictions. Through this process, we reveal a
general mechanism, distinct from any prior methods, for boosting the
performance of the base segmentation network while simultaneously increasing
robustness to adversarial attacks.


http://arxiv.org/abs/2008.05536
Semantics-preserving adversarial attacks in NLP.
Rahul Singh; Tarun Joshi; Vijayan N. Nair; Agus Sudjianto
  We propose algorithms to create adversarial attacks to assess model
robustness in text classification problems. They can be used to create white
box attacks and black box attacks while at the same time preserving the
semantics and syntax of the original text. The attacks cause significant number
of flips in white-box setting and same rule based can be used in black-box
setting. In a black-box setting, the attacks created are able to reverse
decisions of transformer based architectures.


http://arxiv.org/abs/2008.04876
Revisiting Adversarially Learned Injection Attacks Against Recommender Systems.
Jiaxi Tang; Hongyi Wen; Ke Wang
  Recommender systems play an important role in modern information and
e-commerce applications. While increasing research is dedicated to improving
the relevance and diversity of the recommendations, the potential risks of
state-of-the-art recommendation models are under-explored, that is, these
models could be subject to attacks from malicious third parties, through
injecting fake user interactions to achieve their purposes. This paper revisits
the adversarially-learned injection attack problem, where the injected fake
user `behaviors' are learned locally by the attackers with their own model --
one that is potentially different from the model under attack, but shares
similar properties to allow attack transfer. We found that most existing works
in literature suffer from two major limitations: (1) they do not solve the
optimization problem precisely, making the attack less harmful than it could
be, (2) they assume perfect knowledge for the attack, causing the lack of
understanding for realistic attack capabilities. We demonstrate that the exact
solution for generating fake users as an optimization problem could lead to a
much larger impact. Our experiments on a real-world dataset reveal important
properties of the attack, including attack transferability and its limitations.
These findings can inspire useful defensive methods against this possible
existing attack.


http://arxiv.org/abs/2008.04254
Informative Dropout for Robust Representation Learning: A Shape-bias Perspective.
Baifeng Shi; Dinghuai Zhang; Qi Dai; Zhanxing Zhu; Yadong Mu; Jingdong Wang
  Convolutional Neural Networks (CNNs) are known to rely more on local texture
rather than global shape when making decisions. Recent work also indicates a
close relationship between CNN's texture-bias and its robustness against
distribution shift, adversarial perturbation, random corruption, etc. In this
work, we attempt at improving various kinds of robustness universally by
alleviating CNN's texture bias. With inspiration from the human visual system,
we propose a light-weight model-agnostic method, namely Informative Dropout
(InfoDrop), to improve interpretability and reduce texture bias. Specifically,
we discriminate texture from shape based on local self-information in an image,
and adopt a Dropout-like algorithm to decorrelate the model output from the
local texture. Through extensive experiments, we observe enhanced robustness
under various scenarios (domain generalization, few-shot classification, image
corruption, and adversarial perturbation). To the best of our knowledge, this
work is one of the earliest attempts to improve different kinds of robustness
in a unified model, shedding new light on the relationship between shape-bias
and robustness, also on new approaches to trustworthy machine learning
algorithms. Code is available at https://github.com/bfshi/InfoDrop.


http://arxiv.org/abs/2008.04203
FireBERT: Hardening BERT-based classifiers against adversarial attack.
Gunnar Mein; Kevin Hartman; Andrew Morris
  We present FireBERT, a set of three proof-of-concept NLP classifiers hardened
against TextFooler-style word-perturbation by producing diverse alternatives to
original samples. In one approach, we co-tune BERT against the training data
and synthetic adversarial samples. In a second approach, we generate the
synthetic samples at evaluation time through substitution of words and
perturbation of embedding vectors. The diversified evaluation results are then
combined by voting. A third approach replaces evaluation-time word substitution
with perturbation of embedding vectors. We evaluate FireBERT for MNLI and IMDB
Movie Review datasets, in the original and on adversarial examples generated by
TextFooler. We also test whether TextFooler is less successful in creating new
adversarial samples when manipulating FireBERT, compared to working on
unhardened classifiers. We show that it is possible to improve the accuracy of
BERT-based models in the face of adversarial attacks without significantly
reducing the accuracy for regular benchmark samples. We present co-tuning with
a synthetic data generator as a highly effective method to protect against 95%
of pre-manufactured adversarial samples while maintaining 98% of original
benchmark performance. We also demonstrate evaluation-time perturbation as a
promising direction for further research, restoring accuracy up to 75% of
benchmark performance for pre-made adversarials, and up to 65% (from a baseline
of 75% orig. / 12% attack) under active attack by TextFooler.


http://arxiv.org/abs/2008.03677
Enhancing Robustness Against Adversarial Examples in Network Intrusion Detection Systems.
Mohammad J. Hashemi; Eric Keller
  The increase of cyber attacks in both the numbers and varieties in recent
years demands to build a more sophisticated network intrusion detection system
(NIDS). These NIDS perform better when they can monitor all the traffic
traversing through the network like when being deployed on a Software-Defined
Network (SDN). Because of the inability to detect zero-day attacks,
signature-based NIDS which were traditionally used for detecting malicious
traffic are beginning to get replaced by anomaly-based NIDS built on neural
networks. However, recently it has been shown that such NIDS have their own
drawback namely being vulnerable to the adversarial example attack. Moreover,
they were mostly evaluated on the old datasets which don't represent the
variety of attacks network systems might face these days. In this paper, we
present Reconstruction from Partial Observation (RePO) as a new mechanism to
build an NIDS with the help of denoising autoencoders capable of detecting
different types of network attacks in a low false alert setting with an
enhanced robustness against adversarial example attack. Our evaluation
conducted on a dataset with a variety of network attacks shows denoising
autoencoders can improve detection of malicious traffic by up to 29% in a
normal setting and by up to 45% in an adversarial setting compared to other
recently proposed anomaly detectors.


http://arxiv.org/abs/2008.03709
Adversarial Training with Fast Gradient Projection Method against Synonym Substitution based Text Attacks.
Xiaosen Wang; Yichen Yang; Yihe Deng; Kun He
  Adversarial training is the most empirically successful approach in improving
the robustness of deep neural networks for image classification.For text
classification, however, existing synonym substitution based adversarial
attacks are effective but not efficient to be incorporated into practical text
adversarial training. Gradient-based attacks, which are very efficient for
images, are hard to be implemented for synonym substitution based text attacks
due to the lexical, grammatical and semantic constraints and the discrete text
input space. Thereby, we propose a fast text adversarial attack method called
Fast Gradient Projection Method (FGPM) based on synonym substitution, which is
about 20 times faster than existing text attack methods and could achieve
similar attack performance. We then incorporate FGPM with adversarial training
and propose a text defense method called Adversarial Training with FGPM
enhanced by Logit pairing (ATFL). Experiments show that ATFL could
significantly improve the model robustness and block the transferability of
adversarial examples.


http://arxiv.org/abs/2008.03609
Enhance CNN Robustness Against Noises for Classification of 12-Lead ECG with Variable Length.
Linhai Ma; Liang Liang
  Electrocardiogram (ECG) is the most widely used diagnostic tool to monitor
the condition of the cardiovascular system. Deep neural networks (DNNs), have
been developed in many research labs for automatic interpretation of ECG
signals to identify potential abnormalities in patient hearts. Studies have
shown that given a sufficiently large amount of data, the classification
accuracy of DNNs could reach human-expert cardiologist level. However, despite
of the excellent performance in classification accuracy, it has been shown that
DNNs are highly vulnerable to adversarial noises which are subtle changes in
input of a DNN and lead to a wrong class-label prediction with a high
confidence. Thus, it is challenging and essential to improve robustness of DNNs
against adversarial noises for ECG signal classification, a life-critical
application. In this work, we designed a CNN for classification of 12-lead ECG
signals with variable length, and we applied three defense methods to improve
robustness of this CNN for this classification task. The ECG data in this study
is very challenging because the sample size is limited, and the length of each
ECG recording varies in a large range. The evaluation results show that our
customized CNN reached satisfying F1 score and average accuracy, comparable to
the top-6 entries in the CPSC2018 ECG classification challenge, and the defense
methods enhanced robustness of our CNN against adversarial noises and white
noises, with a minimal reduction in accuracy on clean data.


http://arxiv.org/abs/2008.10356
Visual Attack and Defense on Text.
Shengjun Liu; Ningkang Jiang; Yuanbin Wu
  Modifying characters of a piece of text to their visual similar ones often
ap-pear in spam in order to fool inspection systems and other conditions, which
we regard as a kind of adversarial attack to neural models. We pro-pose a way
of generating such visual text attack and show that the attacked text are
readable by humans but mislead a neural classifier greatly. We ap-ply a
vision-based model and adversarial training to defense the attack without
losing the ability to understand normal text. Our results also show that visual
attack is extremely sophisticated and diverse, more work needs to be done to
solve this.


http://arxiv.org/abs/2008.03072
Optimizing Information Loss Towards Robust Neural Networks.
Philip Sperl; Konstantin Böttinger
  Neural Networks (NNs) are vulnerable to adversarial examples. Such inputs
differ only slightly from their benign counterparts yet provoke
misclassifications of the attacked NNs. The required perturbations to craft the
examples are often negligible and even human imperceptible. To protect deep
learning-based systems from such attacks, several countermeasures have been
proposed with adversarial training still being considered the most effective.
Here, NNs are iteratively retrained using adversarial examples forming a
computational expensive and time consuming process often leading to a
performance decrease. To overcome the downsides of adversarial training while
still providing a high level of security, we present a new training approach we
call \textit{entropic retraining}. Based on an information-theoretic-inspired
analysis, entropic retraining mimics the effects of adversarial training
without the need of the laborious generation of adversarial examples. We
empirically show that entropic retraining leads to a significant increase in
NNs' security and robustness while only relying on the given original data.
With our prototype implementation we validate and show the effectiveness of our
approach for various NN architectures and data sets.


http://arxiv.org/abs/2008.04094
Adversarial Examples on Object Recognition: A Comprehensive Survey.
Alex Serban; Erik Poll; Joost Visser
  Deep neural networks are at the forefront of machine learning research.
However, despite achieving impressive performance on complex tasks, they can be
very sensitive: Small perturbations of inputs can be sufficient to induce
incorrect behavior. Such perturbations, called adversarial examples, are
intentionally designed to test the network's sensitivity to distribution
drifts. Given their surprisingly small size, a wide body of literature
conjectures on their existence and how this phenomenon can be mitigated. In
this article we discuss the impact of adversarial examples on security, safety,
and robustness of neural networks. We start by introducing the hypotheses
behind their existence, the methods used to construct or protect against them,
and the capacity to transfer adversarial examples between different machine
learning models. Altogether, the goal is to provide a comprehensive and
self-contained survey of this growing field of research.


http://arxiv.org/abs/2008.02883
Stronger and Faster Wasserstein Adversarial Attacks.
Kaiwen Wu; Allen Houze Wang; Yaoliang Yu
  Deep models, while being extremely flexible and accurate, are surprisingly
vulnerable to "small, imperceptible" perturbations known as adversarial
attacks. While the majority of existing attacks focus on measuring
perturbations under the $\ell_p$ metric, Wasserstein distance, which takes
geometry in pixel space into account, has long been known to be a suitable
metric for measuring image quality and has recently risen as a compelling
alternative to the $\ell_p$ metric in adversarial attacks. However,
constructing an effective attack under the Wasserstein metric is
computationally much more challenging and calls for better optimization
algorithms. We address this gap in two ways: (a) we develop an exact yet
efficient projection operator to enable a stronger projected gradient attack;
(b) we show that the Frank-Wolfe method equipped with a suitable linear
minimization oracle works extremely fast under Wasserstein constraints. Our
algorithms not only converge faster but also generate much stronger attacks.
For instance, we decrease the accuracy of a residual network on CIFAR-10 to
$3.4\%$ within a Wasserstein perturbation ball of radius $0.005$, in contrast
to $65.6\%$ using the previous Wasserstein attack based on an
\emph{approximate} projection operator. Furthermore, employing our stronger
attacks in adversarial training significantly improves the robustness of
adversarially trained models.


http://arxiv.org/abs/2008.02965
Improve Generalization and Robustness of Neural Networks via Weight Scale Shifting Invariant Regularizations.
Ziquan Liu; Yufei Cui; Antoni B. Chan
  Using weight decay to penalize the L2 norms of weights in neural networks has
been a standard training practice to regularize the complexity of networks. In
this paper, we show that a family of regularizers, including weight decay, is
ineffective at penalizing the intrinsic norms of weights for networks with
positively homogeneous activation functions, such as linear, ReLU and
max-pooling functions. As a result of homogeneity, functions specified by the
networks are invariant to the shifting of weight scales between layers. The
ineffective regularizers are sensitive to such shifting and thus poorly
regularize the model capacity, leading to overfitting. To address this
shortcoming, we propose an improved regularizer that is invariant to weight
scale shifting and thus effectively constrains the intrinsic norm of a neural
network. The derived regularizer is an upper bound for the input gradient of
the network so minimizing the improved regularizer also benefits the
adversarial robustness. Residual connections are also considered and we show
that our regularizer also forms an upper bound to input gradients of such a
residual network. We demonstrate the efficacy of our proposed regularizer on
various datasets and neural network architectures at improving generalization
and adversarial robustness.


http://arxiv.org/abs/2008.02197
One word at a time: adversarial attacks on retrieval models.
Nisarg Raval; Manisha Verma
  Adversarial examples, generated by applying small perturbations to input
features, are widely used to fool classifiers and measure their robustness to
noisy inputs. However, little work has been done to evaluate the robustness of
ranking models through adversarial examples. In this work, we present a
systematic approach of leveraging adversarial examples to measure the
robustness of popular ranking models. We explore a simple method to generate
adversarial examples that forces a ranker to incorrectly rank the documents.
Using this approach, we analyze the robustness of various ranking models and
the quality of perturbations generated by the adversarial attacker across two
datasets. Our findings suggest that with very few token changes (1-3), the
attacker can yield semantically similar perturbed documents that can fool
different rankers into changing a document's score, lowering its rank by
several positions.


http://arxiv.org/abs/2008.01976
Robust Deep Reinforcement Learning through Adversarial Loss.
Tuomas Oikarinen; Wang Zhang; Alexandre Megretski; Luca Daniel; Tsui-Wei Weng
  Recent studies have shown that deep reinforcement learning agents are
vulnerable to small adversarial perturbations on the agent's inputs, which
raises concerns about deploying such agents in the real world. To address this
issue, we propose RADIAL-RL, a principled framework to train reinforcement
learning agents with improved robustness against $l_p$-norm bounded adversarial
attacks. Our framework is compatible with popular deep reinforcement learning
algorithms and we demonstrate its performance with deep Q-learning, A3C and
PPO. We experiment on three deep RL benchmarks (Atari, MuJoCo and ProcGen) to
show the effectiveness of our robust training algorithm. Our RADIAL-RL agents
consistently outperform prior methods when tested against attacks of varying
strength and are more computationally efficient to train. In addition, we
propose a new evaluation method called Greedy Worst-Case Reward (GWC) to
measure attack agnostic robustness of deep RL agents. We show that GWC can be
evaluated efficiently and is a good estimate of the reward under the worst
possible sequence of adversarial attacks. All code used for our experiments is
available at https://github.com/tuomaso/radial_rl_v2.


http://arxiv.org/abs/2008.01919
Adv-watermark: A Novel Watermark Perturbation for Adversarial Examples.
Xiaojun Jia; Xingxing Wei; Xiaochun Cao; Xiaoguang Han
  Recent research has demonstrated that adding some imperceptible perturbations
to original images can fool deep learning models. However, the current
adversarial perturbations are usually shown in the form of noises, and thus
have no practical meaning. Image watermark is a technique widely used for
copyright protection. We can regard image watermark as a king of meaningful
noises and adding it to the original image will not affect people's
understanding of the image content, and will not arouse people's suspicion.
Therefore, it will be interesting to generate adversarial examples using
watermarks. In this paper, we propose a novel watermark perturbation for
adversarial examples (Adv-watermark) which combines image watermarking
techniques and adversarial example algorithms. Adding a meaningful watermark to
the clean images can attack the DNN models. Specifically, we propose a novel
optimization algorithm, which is called Basin Hopping Evolution (BHE), to
generate adversarial watermarks in the black-box attack mode. Thanks to the
BHE, Adv-watermark only requires a few queries from the threat models to finish
the attacks. A series of experiments conducted on ImageNet and CASIA-WebFace
datasets show that the proposed method can efficiently generate adversarial
examples, and outperforms the state-of-the-art attack methods. Moreover,
Adv-watermark is more robust against image transformation defense methods.


http://arxiv.org/abs/2008.01524
TREND: Transferability based Robust ENsemble Design.
Deepak Ravikumar; Sangamesh Kodge; Isha Garg; Kaushik Roy
  Deep Learning models hold state-of-the-art performance in many fields, but
their vulnerability to adversarial examples poses threat to their ubiquitous
deployment in practical settings. Additionally, adversarial inputs generated on
one classifier have been shown to transfer to other classifiers trained on
similar data, which makes the attacks possible even if model parameters are not
revealed to the adversary. This property of transferability has not yet been
systematically studied, leading to a gap in our understanding of robustness of
neural networks to adversarial inputs. In this work, we study the effect of
network architecture, initialization, optimizer, input, weight and activation
quantization on transferability of adversarial samples. We also study the
effect of different attacks on transferability. Our experiments reveal that
transferability is significantly hampered by input quantization and
architectural mismatch between source and target, is unaffected by
initialization but the choice of optimizer turns out to be critical. We observe
that transferability is architecture-dependent for both weight and activation
quantized models. To quantify transferability, we use simple metric and
demonstrate the utility of the metric in designing a methodology to build
ensembles with improved adversarial robustness. When attacking ensembles we
observe that "gradient domination" by a single ensemble member model hampers
existing attacks. To combat this we propose a new state-of-the-art ensemble
attack. We compare the proposed attack with existing attack techniques to show
its effectiveness. Finally, we show that an ensemble consisting of carefully
chosen diverse networks achieves better adversarial robustness than would
otherwise be possible with a single network.


http://arxiv.org/abs/2008.01761
Can Adversarial Weight Perturbations Inject Neural Backdoors?
Siddhant Garg; Adarsh Kumar; Vibhor Goel; Yingyu Liang
  Adversarial machine learning has exposed several security hazards of neural
models and has become an important research topic in recent times. Thus far,
the concept of an "adversarial perturbation" has exclusively been used with
reference to the input space referring to a small, imperceptible change which
can cause a ML model to err. In this work we extend the idea of "adversarial
perturbations" to the space of model weights, specifically to inject backdoors
in trained DNNs, which exposes a security risk of using publicly available
trained models. Here, injecting a backdoor refers to obtaining a desired
outcome from the model when a trigger pattern is added to the input, while
retaining the original model predictions on a non-triggered input. From the
perspective of an adversary, we characterize these adversarial perturbations to
be constrained within an $\ell_{\infty}$ norm around the original model
weights. We introduce adversarial perturbations in the model weights using a
composite loss on the predictions of the original model and the desired trigger
through projected gradient descent. We empirically show that these adversarial
weight perturbations exist universally across several computer vision and
natural language processing tasks. Our results show that backdoors can be
successfully injected with a very small average relative change in model weight
values for several applications.


http://arxiv.org/abs/2008.01786
Entropy Guided Adversarial Model for Weakly Supervised Object Localization.
Sabrina Narimene Benassou; Wuzhen Shi; Feng Jiang
  Weakly Supervised Object Localization is challenging because of the lack of
bounding box annotations. Previous works tend to generate a class activation
map i.e CAM to localize the object. Unfortunately, the network activates only
the features that discriminate the object and does not activate the whole
object. Some methods tend to remove some parts of the object to force the CNN
to detect other features, whereas, others change the network structure to
generate multiple CAMs from different levels of the model. In this present
article, we propose to take advantage of the generalization ability of the
network and train the model using clean examples and adversarial examples to
localize the whole object. Adversarial examples are typically used to train
robust models and are images where a perturbation is added. To get a good
classification accuracy, the CNN trained with adversarial examples is forced to
detect more features that discriminate the object. We futher propose to apply
the shannon entropy on the CAMs generated by the network to guide it during
training. Our method does not erase any part of the image neither does it
change the network architecure and extensive experiments show that our Entropy
Guided Adversarial model (EGA model) improved performance on state of the arts
benchmarks for both localization and classification accuracy.


http://arxiv.org/abs/2008.01219
Hardware Accelerator for Adversarial Attacks on Deep Learning Neural Networks.
Haoqiang Guo; Lu Peng; Jian Zhang; Fang Qi; Lide Duan
  Recent studies identify that Deep learning Neural Networks (DNNs) are
vulnerable to subtle perturbations, which are not perceptible to human visual
system but can fool the DNN models and lead to wrong outputs. A class of
adversarial attack network algorithms has been proposed to generate robust
physical perturbations under different circumstances. These algorithms are the
first efforts to move forward secure deep learning by providing an avenue to
train future defense networks, however, the intrinsic complexity of them
prevents their broader usage.
  In this paper, we propose the first hardware accelerator for adversarial
attacks based on memristor crossbar arrays. Our design significantly improves
the throughput of a visual adversarial perturbation system, which can further
improve the robustness and security of future deep learning systems. Based on
the algorithm uniqueness, we propose four implementations for the adversarial
attack accelerator ($A^3$) to improve the throughput, energy efficiency, and
computational efficiency.


http://arxiv.org/abs/2008.00698
Anti-Bandit Neural Architecture Search for Model Defense.
Hanlin Chen; Baochang Zhang; Song Xue; Xuan Gong; Hong Liu; Rongrong Ji; David Doermann
  Deep convolutional neural networks (DCNNs) have dominated as the best
performers in machine learning, but can be challenged by adversarial attacks.
In this paper, we defend against adversarial attacks using neural architecture
search (NAS) which is based on a comprehensive search of denoising blocks,
weight-free operations, Gabor filters and convolutions. The resulting
anti-bandit NAS (ABanditNAS) incorporates a new operation evaluation measure
and search process based on the lower and upper confidence bounds (LCB and
UCB). Unlike the conventional bandit algorithm using UCB for evaluation only,
we use UCB to abandon arms for search efficiency and LCB for a fair competition
between arms. Extensive experiments demonstrate that ABanditNAS is faster than
other NAS methods, while achieving an $8.73\%$ improvement over prior arts on
CIFAR-10 under PGD-$7$.


http://arxiv.org/abs/2008.00217
Efficient Adversarial Attacks for Visual Object Tracking.
Siyuan Liang; Xingxing Wei; Siyuan Yao; Xiaochun Cao
  Visual object tracking is an important task that requires the tracker to find
the objects quickly and accurately. The existing state-ofthe-art object
trackers, i.e., Siamese based trackers, use DNNs to attain high accuracy.
However, the robustness of visual tracking models is seldom explored. In this
paper, we analyze the weakness of object trackers based on the Siamese network
and then extend adversarial examples to visual object tracking. We present an
end-to-end network FAN (Fast Attack Network) that uses a novel drift loss
combined with the embedded feature loss to attack the Siamese network based
trackers. Under a single GPU, FAN is efficient in the training speed and has a
strong attack performance. The FAN can generate an adversarial example at 10ms,
achieve effective targeted attack (at least 40% drop rate on OTB) and
untargeted attack (at least 70% drop rate on OTB).


http://arxiv.org/abs/2008.00312
Trojaning Language Models for Fun and Profit.
Xinyang Zhang; Zheng Zhang; Shouling Ji; Ting Wang
  Recent years have witnessed the emergence of a new paradigm of building
natural language processing (NLP) systems: general-purpose, pre-trained
language models (LMs) are composed with simple downstream models and fine-tuned
for a variety of NLP tasks. This paradigm shift significantly simplifies the
system development cycles. However, as many LMs are provided by untrusted third
parties, their lack of standardization or regulation entails profound security
implications, which are largely unexplored.
  To bridge this gap, this work studies the security threats posed by malicious
LMs to NLP systems. Specifically, we present TROJAN-LM, a new class of
trojaning attacks in which maliciously crafted LMs trigger host NLP systems to
malfunction in a highly predictable manner. By empirically studying three
state-of-the-art LMs (BERT, GPT-2, XLNet) in a range of security-critical NLP
tasks (toxic comment detection, question answering, text completion) as well as
user studies on crowdsourcing platforms, we demonstrate that TROJAN-LM
possesses the following properties: (i) flexibility - the adversary is able to
flexibly dene logical combinations (e.g., 'and', 'or', 'xor') of arbitrary
words as triggers, (ii) efficacy - the host systems misbehave as desired by the
adversary with high probability when trigger-embedded inputs are present, (iii)
specificity - the trojan LMs function indistinguishably from their benign
counterparts on clean inputs, and (iv) fluency - the trigger-embedded inputs
appear as fluent natural language and highly relevant to their surrounding
contexts. We provide analytical justification for the practicality of
TROJAN-LM, and further discuss potential countermeasures and their challenges,
which lead to several promising research directions.


http://arxiv.org/abs/2008.00138
Vulnerability Under Adversarial Machine Learning: Bias or Variance?
Hossein Aboutalebi; Mohammad Javad Shafiee; Michelle Karg; Christian Scharfenberger; Alexander Wong
  Prior studies have unveiled the vulnerability of the deep neural networks in
the context of adversarial machine learning, leading to great recent attention
into this area. One interesting question that has yet to be fully explored is
the bias-variance relationship of adversarial machine learning, which can
potentially provide deeper insights into this behaviour. The notion of bias and
variance is one of the main approaches to analyze and evaluate the
generalization and reliability of a machine learning model. Although it has
been extensively used in other machine learning models, it is not well explored
in the field of deep learning and it is even less explored in the area of
adversarial machine learning.
  In this study, we investigate the effect of adversarial machine learning on
the bias and variance of a trained deep neural network and analyze how
adversarial perturbations can affect the generalization of a network. We derive
the bias-variance trade-off for both classification and regression applications
based on two main loss functions: (i) mean squared error (MSE), and (ii)
cross-entropy. Furthermore, we perform quantitative analysis with both
simulated and real data to empirically evaluate consistency with the derived
bias-variance tradeoffs. Our analysis sheds light on why the deep neural
networks have poor performance under adversarial perturbation from a
bias-variance point of view and how this type of perturbation would change the
performance of a network. Moreover, given these new theoretical findings, we
introduce a new adversarial machine learning algorithm with lower computational
complexity than well-known adversarial machine learning strategies (e.g., PGD)
while providing a high success rate in fooling deep neural networks in lower
perturbation magnitudes.


http://arxiv.org/abs/2007.16118
Physical Adversarial Attack on Vehicle Detector in the Carla Simulator.
Tong Wu; Xuefei Ning; Wenshuo Li; Ranran Huang; Huazhong Yang; Yu Wang
  In this paper, we tackle the issue of physical adversarial examples for
object detectors in the wild. Specifically, we proposed to generate adversarial
patterns to be applied on vehicle surface so that it's not recognizable by
detectors in the photo-realistic Carla simulator. Our approach contains two
main techniques, an \textit{Enlarge-and-Repeat} process and a \textit{Discrete
Searching} method, to craft mosaic-like adversarial vehicle textures without
access to neither the model weight of the detector nor a differential rendering
procedure. The experimental results demonstrate the effectiveness of our
approach in the simulator.


http://arxiv.org/abs/2007.16204
Adversarial Attacks with Multiple Antennas Against Deep Learning-Based Modulation Classifiers.
Brian Kim; Yalin E. Sagduyu; Tugba Erpek; Kemal Davaslioglu; Sennur Ulukus
  We consider a wireless communication system, where a transmitter sends
signals to a receiver with different modulation types while the receiver
classifies the modulation types of the received signals using its deep
learning-based classifier. Concurrently, an adversary transmits adversarial
perturbations using its multiple antennas to fool the classifier into
misclassifying the received signals. From the adversarial machine learning
perspective, we show how to utilize multiple antennas at the adversary to
improve the adversarial (evasion) attack performance. Two main points are
considered while exploiting the multiple antennas at the adversary, namely the
power allocation among antennas and the utilization of channel diversity.
First, we show that multiple independent adversaries, each with a single
antenna cannot improve the attack performance compared to a single adversary
with multiple antennas using the same total power. Then, we consider various
ways to allocate power among multiple antennas at a single adversary such as
allocating power to only one antenna, and proportional or inversely
proportional to the channel gain. By utilizing channel diversity, we introduce
an attack to transmit the adversarial perturbation through the channel with the
largest channel gain at the symbol level. We show that this attack reduces the
classifier accuracy significantly compared to other attacks under different
channel conditions in terms of channel variance and channel correlation across
antennas. Also, we show that the attack success improves significantly as the
number of antennas increases at the adversary that can better utilize channel
diversity to craft adversarial attacks.


http://arxiv.org/abs/2007.15836
TEAM: We Need More Powerful Adversarial Examples for DNNs.
Yaguan Qian; Ximin Zhang; Bin Wang; Wei Li; Zhaoquan Gu; Haijiang Wang; Wassim Swaileh
  Although deep neural networks (DNNs) have achieved success in many
application fields, it is still vulnerable to imperceptible adversarial
examples that can lead to misclassification of DNNs easily. To overcome this
challenge, many defensive methods are proposed. Indeed, a powerful adversarial
example is a key benchmark to measure these defensive mechanisms. In this
paper, we propose a novel method (TEAM, Taylor Expansion-Based Adversarial
Methods) to generate more powerful adversarial examples than previous methods.
The main idea is to craft adversarial examples by minimizing the confidence of
the ground-truth class under untargeted attacks or maximizing the confidence of
the target class under targeted attacks. Specifically, we define the new
objective functions that approximate DNNs by using the second-order Taylor
expansion within a tiny neighborhood of the input. Then the Lagrangian
multiplier method is used to obtain the optimize perturbations for these
objective functions. To decrease the amount of computation, we further
introduce the Gauss-Newton (GN) method to speed it up. Finally, the
experimental result shows that our method can reliably produce adversarial
examples with 100% attack success rate (ASR) while only by smaller
perturbations. In addition, the adversarial example generated with our method
can defeat defensive distillation based on gradient masking.


http://arxiv.org/abs/2007.15310
Black-box Adversarial Sample Generation Based on Differential Evolution.
Junyu Lin; Lei Xu; Yingqi Liu; Xiangyu Zhang
  Deep Neural Networks (DNNs) are being used in various daily tasks such as
object detection, speech processing, and machine translation. However, it is
known that DNNs suffer from robustness problems -- perturbed inputs called
adversarial samples leading to misbehaviors of DNNs. In this paper, we propose
a black-box technique called Black-box Momentum Iterative Fast Gradient Sign
Method (BMI-FGSM) to test the robustness of DNN models. The technique does not
require any knowledge of the structure or weights of the target DNN. Compared
to existing white-box testing techniques that require accessing model internal
information such as gradients, our technique approximates gradients through
Differential Evolution and uses approximated gradients to construct adversarial
samples. Experimental results show that our technique can achieve 100% success
in generating adversarial samples to trigger misclassification, and over 95%
success in generating samples to trigger misclassification to a specific target
output label. It also demonstrates better perturbation distance and better
transferability. Compared to the state-of-the-art black-box technique, our
technique is more efficient. Furthermore, we conduct testing on the commercial
Aliyun API and successfully trigger its misbehavior within a limited number of
queries, demonstrating the feasibility of real-world black-box attack.


http://arxiv.org/abs/2007.15290
A Data Augmentation-based Defense Method Against Adversarial Attacks in Neural Networks.
Yi Zeng; Han Qiu; Gerard Memmi; Meikang Qiu
  Deep Neural Networks (DNNs) in Computer Vision (CV) are well-known to be
vulnerable to Adversarial Examples (AEs), namely imperceptible perturbations
added maliciously to cause wrong classification results. Such variability has
been a potential risk for systems in real-life equipped DNNs as core
components. Numerous efforts have been put into research on how to protect DNN
models from being tackled by AEs. However, no previous work can efficiently
reduce the effects caused by novel adversarial attacks and be compatible with
real-life constraints at the same time. In this paper, we focus on developing a
lightweight defense method that can efficiently invalidate full whitebox
adversarial attacks with the compatibility of real-life constraints. From basic
affine transformations, we integrate three transformations with randomized
coefficients that fine-tuned respecting the amount of change to the defended
sample. Comparing to 4 state-of-art defense methods published in top-tier AI
conferences in the past two years, our method demonstrates outstanding
robustness and efficiency. It is worth highlighting that, our model can
withstand advanced adaptive attack, namely BPDA with 50 rounds, and still helps
the target model maintain an accuracy around 80 %, meanwhile constraining the
attack success rate to almost zero.


http://arxiv.org/abs/2007.15805
vWitness: Certifying Web Page Interactions with Computer Vision. (83%)
He Shuang; Lianying Zhao; David Lie
  Web servers service client requests, some of which might cause the web server
to perform security-sensitive operations (e.g. money transfer, voting). An
attacker may thus forge or maliciously manipulate such requests by compromising
a web client. Unfortunately, a web server has no way of knowing whether the
client from which it receives a request has been compromised or not -- current
"best practice" defenses such as user authentication or network encryption
cannot aid a server as they all assume web client integrity. To address this
shortcoming, we propose vWitness, which "witnesses" the interactions of a user
with a web page and certifies whether they match a specification provided by
the web server, enabling the web server to know that the web request is
user-intended. The main challenge that vWitness overcomes is that even benign
clients introduce unpredictable variations in the way they render web pages.
vWitness differentiates between these benign variations and malicious
manipulation using computer vision, allowing it to certify to the web server
that 1) the web page user interface is properly displayed 2) observed user
interactions are used to construct the web request. Our vWitness prototype
achieves compatibility with modern web pages, is resilient to adversarial
example attacks and is accurate and performant -- vWitness achieves 99.97%
accuracy and adds 197ms of overhead to the entire interaction session in the
average case.


http://arxiv.org/abs/2007.14714
End-to-End Adversarial White Box Attacks on Music Instrument Classification.
Katharina Johannes Kepler University Linz Prinz; Arthur Johannes Kepler University Linz Flexer
  Small adversarial perturbations of input data are able to drastically change
performance of machine learning systems, thereby challenging the validity of
such systems. We present the very first end-to-end adversarial attacks on a
music instrument classification system allowing to add perturbations directly
to audio waveforms instead of spectrograms. Our attacks are able to reduce the
accuracy close to a random baseline while at the same time keeping
perturbations almost imperceptible and producing misclassifications to any
desired instrument.


http://arxiv.org/abs/2007.14983
Adversarial Robustness for Machine Learning Cyber Defenses Using Log Data.
Kai Steverson; Jonathan Mullin; Metin Ahiskali
  There has been considerable and growing interest in applying machine learning
for cyber defenses. One promising approach has been to apply natural language
processing techniques to analyze logs data for suspicious behavior. A natural
question arises to how robust these systems are to adversarial attacks. Defense
against sophisticated attack is of particular concern for cyber defenses. In
this paper, we develop a testing framework to evaluate adversarial robustness
of machine learning cyber defenses, particularly those focused on log data. Our
framework uses techniques from deep reinforcement learning and adversarial
natural language processing. We validate our framework using a publicly
available dataset and demonstrate that our adversarial attack does succeed
against the target systems, revealing a potential vulnerability. We apply our
framework to analyze the influence of different levels of dropout
regularization and find that higher dropout levels increases robustness.
Moreover 90% dropout probability exhibited the highest level of robustness by a
significant margin, which suggests unusually high dropout may be necessary to
properly protect against adversarial attacks.


http://arxiv.org/abs/2007.15036
Generative Classifiers as a Basis for Trustworthy Computer Vision.
Radek Mackowiak; Lynton Ardizzone; Ullrich Köthe; Carsten Rother
  With the maturing of deep learning systems, trustworthiness is becoming
increasingly important for model assessment. We understand trustworthiness as
the combination of explainability and robustness. Generative classifiers (GCs)
are a promising class of models that are said to naturally accomplish these
qualities. However, this has mostly been demonstrated on simple datasets such
as MNIST, SVHN and CIFAR in the past. In this work, we firstly develop an
architecture and training scheme that allows for GCs to be trained on the
ImageNet classification task, a more relevant level of complexity for practical
computer vision. The resulting models use an invertible neural network
architecture and achieve a competetive ImageNet top-1 accuracy of up to 76.2%.
Secondly, we show the large potential of GCs for trustworthiness.
Explainability and some aspects of robustness are vastly improved compared to
standard feed-forward models, even when the GCs are just applied naively. While
not all trustworthiness problems are solved completely, we argue from our
observations that GCs are an extremely promising basis for further algorithms
and modifications, as have been developed in the past for feedforward models to
increase their trustworthiness. We release our trained model for download in
the hope that it serves as a starting point for various other generative
classification tasks in much the same way as pretrained ResNet models do for
discriminative classification.


http://arxiv.org/abs/2007.14672
Stylized Adversarial Defense.
Muzammal Naseer; Salman Khan; Munawar Hayat; Fahad Shahbaz Khan; Fatih Porikli
  Deep Convolution Neural Networks (CNNs) can easily be fooled by subtle,
imperceptible changes to the input images. To address this vulnerability,
adversarial training creates perturbation patterns and includes them in the
training set to robustify the model. In contrast to existing adversarial
training methods that only use class-boundary information (e.g., using a
cross-entropy loss), we propose to exploit additional information from the
feature space to craft stronger adversaries that are in turn used to learn a
robust model. Specifically, we use the style and content information of the
target sample from another class, alongside its class-boundary information to
create adversarial perturbations. We apply our proposed multi-task objective in
a deeply supervised manner, extracting multi-scale feature knowledge to create
maximally separating adversaries. Subsequently, we propose a max-margin
adversarial training approach that minimizes the distance between source image
and its adversary and maximizes the distance between the adversary and the
target image. Our adversarial training approach demonstrates strong robustness
compared to state-of-the-art defenses, generalizes well to naturally occurring
corruptions and data distributional shifts, and retains the model accuracy on
clean examples.


http://arxiv.org/abs/2007.15147
Detecting Anomalous Inputs to DNN Classifiers By Joint Statistical Testing at the Layers.
Jayaram Raghuram; Varun Chandrasekaran; Somesh Jha; Suman Banerjee
  Detecting anomalous inputs, such as adversarial and out-of-distribution (OOD)
inputs, is critical for classifiers deployed in real-world applications,
especially deep neural network (DNN) classifiers that are known to be brittle
on such inputs. We propose an unsupervised statistical testing framework for
detecting such anomalous inputs to a trained DNN classifier based on its
internal layer representations. By calculating test statistics at the input and
intermediate-layer representations of the DNN, conditioned individually on the
predicted class and on the true class of labeled training data, the method
characterizes their class-conditional distributions on natural inputs. Given a
test input, its extent of non-conformity with respect to the training
distribution is captured using p-values of the class-conditional test
statistics across the layers, which are then combined using a scoring function
designed to score high on anomalous inputs. We focus on adversarial inputs,
which are an important class of anomalous inputs, and also demonstrate the
effectiveness of our method on general OOD inputs. The proposed framework also
provides an alternative class prediction that can be used to correct the DNNs
prediction on (detected) adversarial inputs. Experiments on well-known image
classification datasets with strong adversarial attacks, including a custom
attack method that uses the internal layer representations of the DNN,
demonstrate that our method outperforms or performs comparably with five
recently-proposed, competing detection methods.


http://arxiv.org/abs/2007.14433
Cassandra: Detecting Trojaned Networks from Adversarial Perturbations.
Xiaoyu Zhang; Ajmal Mian; Rohit Gupta; Nazanin Rahnavard; Mubarak Shah
  Deep neural networks are being widely deployed for many critical tasks due to
their high classification accuracy. In many cases, pre-trained models are
sourced from vendors who may have disrupted the training pipeline to insert
Trojan behaviors into the models. These malicious behaviors can be triggered at
the adversary's will and hence, cause a serious threat to the widespread
deployment of deep models. We propose a method to verify if a pre-trained model
is Trojaned or benign. Our method captures fingerprints of neural networks in
the form of adversarial perturbations learned from the network gradients.
Inserting backdoors into a network alters its decision boundaries which are
effectively encoded in their adversarial perturbations. We train a two stream
network for Trojan detection from its global ($L_\infty$ and $L_2$ bounded)
perturbations and the localized region of high energy within each perturbation.
The former encodes decision boundaries of the network and latter encodes the
unknown trigger shape. We also propose an anomaly detection method to identify
the target class in a Trojaned network. Our methods are invariant to the
trigger type, trigger size, training data and network architecture. We evaluate
our methods on MNIST, NIST-Round0 and NIST-Round1 datasets, with up to 1,000
pre-trained models making this the largest study to date on Trojaned network
detection, and achieve over 92\% detection accuracy to set the new
state-of-the-art.


http://arxiv.org/abs/2007.14042
Derivation of Information-Theoretically Optimal Adversarial Attacks with Applications to Robust Machine Learning.
Jirong Yi; Raghu Mudumbai; Weiyu Xu
  We consider the theoretical problem of designing an optimal adversarial
attack on a decision system that maximally degrades the achievable performance
of the system as measured by the mutual information between the degraded signal
and the label of interest. This problem is motivated by the existence of
adversarial examples for machine learning classifiers. By adopting an
information theoretic perspective, we seek to identify conditions under which
adversarial vulnerability is unavoidable i.e. even optimally designed
classifiers will be vulnerable to small adversarial perturbations. We present
derivations of the optimal adversarial attacks for discrete and continuous
signals of interest, i.e., finding the optimal perturbation distributions to
minimize the mutual information between the degraded signal and a signal
following a continuous or discrete distribution. In addition, we show that it
is much harder to achieve adversarial attacks for minimizing mutual information
when multiple redundant copies of the input signal are available. This provides
additional support to the recently proposed ``feature compression" hypothesis
as an explanation for the adversarial vulnerability of deep learning
classifiers. We also report on results from computational experiments to
illustrate our theoretical results.


http://arxiv.org/abs/2007.14120
Reachable Sets of Classifiers and Regression Models: (Non-)Robustness Analysis and Robust Training.
Anna-Kathrin Kopetzki; Stephan Günnemann
  Neural networks achieve outstanding accuracy in classification and regression
tasks. However, understanding their behavior still remains an open challenge
that requires questions to be addressed on the robustness, explainability and
reliability of predictions. We answer these questions by computing reachable
sets of neural networks, i.e. sets of outputs resulting from continuous sets of
inputs. We provide two efficient approaches that lead to over- and
under-approximations of the reachable set. This principle is highly versatile,
as we show. First, we use it to analyze and enhance the robustness properties
of both classifiers and regression models. This is in contrast to existing
works, which are mainly focused on classification. Specifically, we verify
(non-)robustness, propose a robust training procedure, and show that our
approach outperforms adversarial attacks as well as state-of-the-art methods of
verifying classifiers for non-norm bound perturbations. Second, we provide
techniques to distinguish between reliable and non-reliable predictions for
unlabeled inputs, to quantify the influence of each feature on a prediction,
and compute a feature ranking.


http://arxiv.org/abs/2007.14321
Label-Only Membership Inference Attacks.
Christopher A. Choquette-Choo; Florian Tramer; Nicholas Carlini; Nicolas Papernot
  Membership inference attacks are one of the simplest forms of privacy leakage
for machine learning models: given a data point and model, determine whether
the point was used to train the model. Existing membership inference attacks
exploit models' abnormal confidence when queried on their training data. These
attacks do not apply if the adversary only gets access to models' predicted
labels, without a confidence measure. In this paper, we introduce label-only
membership inference attacks. Instead of relying on confidence scores, our
attacks evaluate the robustness of a model's predicted labels under
perturbations to obtain a fine-grained membership signal. These perturbations
include common data augmentations or adversarial examples. We empirically show
that our label-only membership inference attacks perform on par with prior
attacks that required access to model confidences. We further demonstrate that
label-only attacks break multiple defenses against membership inference attacks
that (implicitly or explicitly) rely on a phenomenon we call confidence
masking. These defenses modify a model's confidence scores in order to thwart
attacks, but leave the model's predicted labels unchanged. Our label-only
attacks demonstrate that confidence-masking is not a viable defense strategy
against membership inference. Finally, we investigate worst-case label-only
attacks, that infer membership for a small number of outlier data points. We
show that label-only attacks also match confidence-based attacks in this
setting. We find that training models with differential privacy and (strong) L2
regularization are the only known defense strategies that successfully prevents
all attacks. This remains true even when the differential privacy budget is too
high to offer meaningful provable guarantees.


http://arxiv.org/abs/2008.02076
Attacking and Defending Machine Learning Applications of Public Cloud.
Dou Goodman; Hao Xin
  Adversarial attack breaks the boundaries of traditional security defense. For
adversarial attack and the characteristics of cloud services, we propose
Security Development Lifecycle for Machine Learning applications, e.g., SDL for
ML. The SDL for ML helps developers build more secure software by reducing the
number and severity of vulnerabilities in ML-as-a-service, while reducing
development cost.


http://arxiv.org/abs/2007.13960
KOVIS: Keypoint-based Visual Servoing with Zero-Shot Sim-to-Real Transfer for Robotics Manipulation.
En Yen Puang; Keng Peng Tee; Wei Jing
  We present KOVIS, a novel learning-based, calibration-free visual servoing
method for fine robotic manipulation tasks with eye-in-hand stereo camera
system. We train the deep neural network only in the simulated environment; and
the trained model could be directly used for real-world visual servoing tasks.
KOVIS consists of two networks. The first keypoint network learns the keypoint
representation from the image using with an autoencoder. Then the visual
servoing network learns the motion based on keypoints extracted from the camera
image. The two networks are trained end-to-end in the simulated environment by
self-supervised learning without manual data labeling. After training with data
augmentation, domain randomization, and adversarial examples, we are able to
achieve zero-shot sim-to-real transfer to real-world robotic manipulation
tasks. We demonstrate the effectiveness of the proposed method in both
simulated environment and real-world experiment with different robotic
manipulation tasks, including grasping, peg-in-hole insertion with 4mm
clearance, and M13 screw insertion. The demo video is available at
http://youtu.be/gfBJBR2tDzA


http://arxiv.org/abs/2007.13703
From Sound Representation to Model Robustness.
Mohamad Esmaeilpour; Patrick Cardinal; Alessandro Lameiras Koerich
  In this paper, we demonstrate the extreme vulnerability of a residual deep
neural network architecture (ResNet-18) against adversarial attacks in
time-frequency representations of audio signals. We evaluate MFCC, short time
Fourier transform (STFT), and discrete wavelet transform (DWT) to modulate
environmental sound signals in 2D representation spaces. ResNet-18 not only
outperforms other dense deep learning classifiers (i.e., GoogLeNet and AlexNet)
in terms of recognition accuracy, but also it considerably transfers
adversarial examples to other victim classifiers. On the balance of average
budgets allocated by adversaries and the cost of the attack, we notice an
inverse relationship between high recognition accuracy and model robustness
against six strong adversarial attacks. We investigated this relationship to
the three 2D representation domains, which are commonly used to represent audio
signals, on three benchmarking environmental sound datasets. The experimental
results have shown that while the ResNet-18 classifier trained on DWT
spectrograms achieves the highest recognition accuracy, attacking this model is
relatively more costly for the adversary compared to the MFCC and STFT
representations.


http://arxiv.org/abs/2007.13632
Towards Accuracy-Fairness Paradox: Adversarial Example-based Data Augmentation for Visual Debiasing.
Yi Zhang; Jitao Sang
  Machine learning fairness concerns about the biases towards certain protected
or sensitive group of people when addressing the target tasks. This paper
studies the debiasing problem in the context of image classification tasks. Our
data analysis on facial attribute recognition demonstrates (1) the attribution
of model bias from imbalanced training data distribution and (2) the potential
of adversarial examples in balancing data distribution. We are thus motivated
to employ adversarial example to augment the training data for visual
debiasing. Specifically, to ensure the adversarial generalization as well as
cross-task transferability, we propose to couple the operations of target task
classifier training, bias task classifier training, and adversarial example
generation. The generated adversarial examples supplement the target task
training dataset via balancing the distribution over bias variables in an
online fashion. Results on simulated and real-world debiasing experiments
demonstrate the effectiveness of the proposed solution in simultaneously
improving model accuracy and fairness. Preliminary experiment on few-shot
learning further shows the potential of adversarial attack-based pseudo sample
generation as alternative solution to make up for the training data lackage.


http://arxiv.org/abs/2007.14249
RANDOM MASK: Towards Robust Convolutional Neural Networks.
Tiange Luo; Tianle Cai; Mengxiao Zhang; Siyu Chen; Liwei Wang
  Robustness of neural networks has recently been highlighted by the
adversarial examples, i.e., inputs added with well-designed perturbations which
are imperceptible to humans but can cause the network to give incorrect
outputs. In this paper, we design a new CNN architecture that by itself has
good robustness. We introduce a simple but powerful technique, Random Mask, to
modify existing CNN structures. We show that CNN with Random Mask achieves
state-of-the-art performance against black-box adversarial attacks without
applying any adversarial training. We next investigate the adversarial examples
which 'fool' a CNN with Random Mask. Surprisingly, we find that these
adversarial examples often 'fool' humans as well. This raises fundamental
questions on how to define adversarial examples and robustness properly.


http://arxiv.org/abs/2007.13073
Robust Collective Classification against Structural Attacks.
Kai Zhou; Yevgeniy Vorobeychik
  Collective learning methods exploit relations among data points to enhance
classification performance. However, such relations, represented as edges in
the underlying graphical model, expose an extra attack surface to the
adversaries. We study adversarial robustness of an important class of such
graphical models, Associative Markov Networks (AMN), to structural attacks,
where an attacker can modify the graph structure at test time. We formulate the
task of learning a robust AMN classifier as a bi-level program, where the inner
problem is a challenging non-linear integer program that computes optimal
structural changes to the AMN. To address this technical challenge, we first
relax the attacker problem, and then use duality to obtain a convex quadratic
upper bound for the robust AMN problem. We then prove a bound on the quality of
the resulting approximately optimal solutions, and experimentally demonstrate
the efficacy of our approach. Finally, we apply our approach in a transductive
learning setting, and show that robust AMN is much more robust than
state-of-the-art deep learning methods, while sacrificing little in accuracy on
non-adversarial data.


http://arxiv.org/abs/2007.13171
Train Like a (Var)Pro: Efficient Training of Neural Networks with Variable Projection. (1%)
Elizabeth Newman; Lars Ruthotto; Joseph Hart; Bart van Bloemen Waanders
  Deep neural networks (DNNs) have achieved state-of-the-art performance across
a variety of traditional machine learning tasks, e.g., speech recognition,
image classification, and segmentation. The ability of DNNs to efficiently
approximate high-dimensional functions has also motivated their use in
scientific applications, e.g., to solve partial differential equations (PDE)
and to generate surrogate models. In this paper, we consider the supervised
training of DNNs, which arises in many of the above applications. We focus on
the central problem of optimizing the weights of the given DNN such that it
accurately approximates the relation between observed input and target data.
Devising effective solvers for this optimization problem is notoriously
challenging due to the large number of weights, non-convexity, data-sparsity,
and non-trivial choice of hyperparameters. To solve the optimization problem
more efficiently, we propose the use of variable projection (VarPro), a method
originally designed for separable nonlinear least-squares problems. Our main
contribution is the Gauss-Newton VarPro method (GNvpro) that extends the reach
of the VarPro idea to non-quadratic objective functions, most notably,
cross-entropy loss functions arising in classification. These extensions make
GNvpro applicable to all training problems that involve a DNN whose last layer
is an affine mapping, which is common in many state-of-the-art architectures.
In our four numerical experiments from surrogate modeling, segmentation, and
classification GNvpro solves the optimization problem more efficiently than
commonly-used stochastic gradient descent (SGD) schemes. Also, GNvpro finds
solutions that generalize well, and in all but one example better than
well-tuned SGD methods, to unseen data points.


http://arxiv.org/abs/2007.12881
MirrorNet: Bio-Inspired Adversarial Attack for Camouflaged Object Segmentation.
Jinnan Yan; Trung-Nghia Le; Khanh-Duy Nguyen; Minh-Triet Tran; Thanh-Toan Do; Tam V. Nguyen
  Camouflaged objects are generally difficult to be detected in their natural
environment even for human beings. In this paper, we propose a novel
bio-inspired network, named the MirrorNet, that leverages both instance
segmentation and adversarial attack for the camouflaged object segmentation.
Differently from existing networks for segmentation, our proposed network
possesses two segmentation streams: the main stream and the adversarial stream
corresponding with the original image and its flipped image, respectively. The
output from the adversarial stream is then fused into the main stream's result
for the final camouflage map to boost up the segmentation accuracy. Extensive
experiments conducted on the public CAMO dataset demonstrate the effectiveness
of our proposed network. Our proposed method achieves 89% in accuracy,
outperforming the state-of-the-arts.
  Project Page: https://sites.google.com/view/ltnghia/research/camo


http://arxiv.org/abs/2007.12861
Adversarial Privacy-preserving Filter.
Jiaming Zhang; Jitao Sang; Xian Zhao; Xiaowen Huang; Yanfeng Sun; Yongli Hu
  While widely adopted in practical applications, face recognition has been
critically discussed regarding the malicious use of face images and the
potential privacy problems, e.g., deceiving payment system and causing personal
sabotage. Online photo sharing services unintentionally act as the main
repository for malicious crawler and face recognition applications. This work
aims to develop a privacy-preserving solution, called Adversarial
Privacy-preserving Filter (APF), to protect the online shared face images from
being maliciously used.We propose an end-cloud collaborated adversarial attack
solution to satisfy requirements of privacy, utility and nonaccessibility.
Specifically, the solutions consist of three modules: (1) image-specific
gradient generation, to extract image-specific gradient in the user end with a
compressed probe model; (2) adversarial gradient transfer, to fine-tune the
image-specific gradient in the server cloud; and (3) universal adversarial
perturbation enhancement, to append image-independent perturbation to derive
the final adversarial noise. Extensive experiments on three datasets validate
the effectiveness and efficiency of the proposed solution. A prototype
application is also released for further evaluation.We hope the end-cloud
collaborated attack framework could shed light on addressing the issue of
online multimedia sharing privacy-preserving issues from user side.


http://arxiv.org/abs/2007.12892
MP3 Compression To Diminish Adversarial Noise in End-to-End Speech Recognition.
Iustina Andronic; Ludwig Kürzinger; Edgar Ricardo Chavez Rosas; Gerhard Rigoll; Bernhard U. Seeber
  Audio Adversarial Examples (AAE) represent specially created inputs meant to
trick Automatic Speech Recognition (ASR) systems into misclassification. The
present work proposes MP3 compression as a means to decrease the impact of
Adversarial Noise (AN) in audio samples transcribed by ASR systems. To this
end, we generated AAEs with the Fast Gradient Sign Method for an end-to-end,
hybrid CTC-attention ASR system. Our method is then validated by two objective
indicators: (1) Character Error Rates (CER) that measure the speech decoding
performance of four ASR models trained on uncompressed, as well as
MP3-compressed data sets and (2) Signal-to-Noise Ratio (SNR) estimated for both
uncompressed and MP3-compressed AAEs that are reconstructed in the time domain
by feature inversion. We found that MP3 compression applied to AAEs indeed
reduces the CER when compared to uncompressed AAEs. Moreover, feature-inverted
(reconstructed) AAEs had significantly higher SNRs after MP3 compression,
indicating that AN was reduced. In contrast to AN, MP3 compression applied to
utterances augmented with regular noise resulted in more transcription errors,
giving further evidence that MP3 encoding is effective in diminishing only AN.


http://arxiv.org/abs/2007.12684
Deep Co-Training with Task Decomposition for Semi-Supervised Domain Adaptation. (1%)
Luyu Yang; Yan Wang; Mingfei Gao; Abhinav Shrivastava; Kilian Q. Weinberger; Wei-Lun Chao; Ser-Nam Lim
  Semi-supervised domain adaptation (SSDA) aims to adapt models trained from a
labeled source domain to a different but related target domain, from which
unlabeled data and a small set of labeled data are provided. Current methods
that treat source and target supervision without distinction overlook their
inherent discrepancy, resulting in a source-dominated model that has not
effectively used the target supervision. In this paper, we argue that the
labeled target data needs to be distinguished for effective SSDA, and propose
to explicitly decompose the SSDA task into two sub-tasks: a semi-supervised
learning (SSL) task in the target domain and an unsupervised domain adaptation
(UDA) task across domains. By doing so, the two sub-tasks can better leverage
the corresponding supervision and thus yield very different classifiers. To
integrate the strengths of the two classifiers, we apply the well-established
co-training framework, in which the two classifiers exchange their high
confident predictions to iteratively "teach each other" so that both
classifiers can excel in the target domain. We call our approach Deep
Co-training with Task decomposition (DeCoTa). DeCoTa requires no adversarial
training and is easy to implement. Moreover, DeCoTa is well-founded on the
theoretical condition of when co-training would succeed. As a result, DeCoTa
achieves state-of-the-art results on several SSDA datasets, outperforming the
prior art by a notable 4% margin on DomainNet. Code is available at
https://github.com/LoyoYang/DeCoTa


http://arxiv.org/abs/2007.12133
Provably Robust Adversarial Examples.
Dimitar I. Dimitrov; Gagandeep Singh; Timon Gehr; Martin Vechev
  We introduce the concept of provably robust adversarial examples for deep
neural networks - connected input regions constructed from standard adversarial
examples which are guaranteed to be robust to a set of real-world perturbations
(such as changes in pixel intensity and geometric transformations). We present
a novel method called PARADE for generating these regions in a scalable manner
which works by iteratively refining the region initially obtained via sampling
until a refined region is certified to be adversarial with existing
state-of-the-art verifiers. At each step, a novel optimization procedure is
applied to maximize the region's volume under the constraint that the convex
relaxation of the network behavior with respect to the region implies a chosen
bound on the certification objective. Our experimental evaluation shows the
effectiveness of PARADE: it successfully finds large provably robust regions
including ones containing $\approx 10^{573}$ adversarial examples for pixel
intensity and $\approx 10^{599}$ for geometric perturbations. The provability
enables our robust examples to be significantly more effective against
state-of-the-art defenses based on randomized smoothing than the individual
attacks used to construct the regions.


http://arxiv.org/abs/2007.11206
SOCRATES: Towards a Unified Platform for Neural Network Verification.
Long H. Pham; Jiaying Li; Jun Sun
  Studies show that neural networks, not unlike traditional programs, are
subject to bugs, e.g., adversarial samples that cause classification errors and
discriminatory instances that demonstrate the lack of fairness. Given that
neural networks are increasingly applied in critical applications (e.g.,
self-driving cars, face recognition systems and personal credit rating
systems), it is desirable that systematic methods are developed to verify or
falsify neural networks against desirable properties. Recently, a number of
approaches have been developed to verify neural networks. These efforts are
however scattered (i.e., each approach tackles some restricted classes of
neural networks against certain particular properties), incomparable (i.e.,
each approach has its own assumptions and input format) and thus hard to apply,
reuse or extend. In this project, we aim to build a unified framework for
developing verification techniques for neural networks. Towards this goal, we
develop a platform called SOCRATES which supports a standardized format for a
variety of neural network models, an assertion language for property
specification as well as two novel algorithms for verifying or falsifying
neural network models. SOCRATES is extensible and thus existing approaches can
be easily integrated. Experiment results show that our platform offers better
or comparable performance to state-of-the-art approaches. More importantly, it
provides a platform for synergistic research on neural network verification.


http://arxiv.org/abs/2007.11259
Adversarial Training Reduces Information and Improves Transferability.
Matteo Terzi; Alessandro Achille; Marco Maggipinto; Gian Antonio Susto
  Recent results show that features of adversarially trained networks for
classification, in addition to being robust, enable desirable properties such
as invertibility. The latter property may seem counter-intuitive as it is
widely accepted by the community that classification models should only capture
the minimal information (features) required for the task. Motivated by this
discrepancy, we investigate the dual relationship between Adversarial Training
and Information Theory. We show that the Adversarial Training can improve
linear transferability to new tasks, from which arises a new trade-off between
transferability of representations and accuracy on the source task. We validate
our results employing robust networks trained on CIFAR-10, CIFAR-100 and
ImageNet on several datasets. Moreover, we show that Adversarial Training
reduces Fisher information of representations about the input and of the
weights about the task, and we provide a theoretical argument which explains
the invertibility of deterministic networks without violating the principle of
minimality. Finally, we leverage our theoretical insights to remarkably improve
the quality of reconstructed images through inversion.


http://arxiv.org/abs/2007.11693
Robust Machine Learning via Privacy/Rate-Distortion Theory.
Ye Wang; Shuchin Aeron; Adnan Siraj Rakin; Toshiaki Koike-Akino; Pierre Moulin
  Robust machine learning formulations have emerged to address the prevalent
vulnerability of deep neural networks to adversarial examples. Our work draws
the connection between optimal robust learning and the privacy-utility tradeoff
problem, which is a generalization of the rate-distortion problem. The saddle
point of the game between a robust classifier and an adversarial perturbation
can be found via the solution of a maximum conditional entropy problem. This
information-theoretic perspective sheds light on the fundamental tradeoff
between robustness and clean data performance, which ultimately arises from the
geometric structure of the underlying data distribution and perturbation
constraints.


http://arxiv.org/abs/2007.11709
Threat of Adversarial Attacks on Face Recognition: A Comprehensive Survey.
Fatemeh Vakhshiteh; Raghavendra Ramachandra; Ahmad Nickabadi
  Face recognition (FR) systems have demonstrated outstanding verification
performance, suggesting suitability for real-world applications, ranging from
photo tagging in social media to automated border control (ABC). In an advanced
FR system with deep learning-based architecture, however, promoting the
recognition efficiency alone is not sufficient and the system should also
withstand potential kinds of attacks designed to target its proficiency. Recent
studies show that (deep) FR systems exhibit an intriguing vulnerability to
imperceptible or perceptible but natural-looking adversarial input images that
drive the model to incorrect output predictions. In this article, we present a
comprehensive survey on adversarial attacks against FR systems and elaborate on
the competence of new countermeasures against them. Further, we propose a
taxonomy of existing attack and defense strategies according to different
criteria. Finally, we compare the presented approaches according to techniques'
characteristics.


http://arxiv.org/abs/2007.10723
Audio Adversarial Examples for Robust Hybrid CTC/Attention Speech Recognition.
Ludwig Kürzinger; Edgar Ricardo Chavez Rosas; Lujun Li; Tobias Watzel; Gerhard Rigoll
  Recent advances in Automatic Speech Recognition (ASR) demonstrated how
end-to-end systems are able to achieve state-of-the-art performance. There is a
trend towards deeper neural networks, however those ASR models are also more
complex and prone against specially crafted noisy data. Those Audio Adversarial
Examples (AAE) were previously demonstrated on ASR systems that use
Connectionist Temporal Classification (CTC), as well as attention-based
encoder-decoder architectures. Following the idea of the hybrid CTC/attention
ASR system, this work proposes algorithms to generate AAEs to combine both
approaches into a joint CTC-attention gradient method. Evaluation is performed
using a hybrid CTC/attention end-to-end ASR model on two reference sentences as
case study, as well as the TEDlium v2 speech recognition task. We then
demonstrate the application of this algorithm for adversarial training to
obtain a more robust ASR model.


http://arxiv.org/abs/2007.10593
Towards Visual Distortion in Black-Box Attacks.
Nannan Li; Zhenzhong Chen
  Constructing adversarial examples in a black-box threat model injures the
original images by introducing visual distortion. In this paper, we propose a
novel black-box attack approach that can directly minimize the induced
distortion by learning the noise distribution of the adversarial example,
assuming only loss-oracle access to the black-box network. The quantified
visual distortion, which measures the perceptual distance between the
adversarial example and the original image, is introduced in our loss whilst
the gradient of the corresponding non-differentiable loss function is
approximated by sampling noise from the learned noise distribution. We validate
the effectiveness of our attack on ImageNet. Our attack results in much lower
distortion when compared to the state-of-the-art black-box attacks and achieves
$100\%$ success rate on InceptionV3, ResNet50 and VGG16bn. The code is
available at https://github.com/Alina-1997/visual-distortion-in-attack.


http://arxiv.org/abs/2007.10505
DeepNNK: Explaining deep models and their generalization using polytope interpolation.
Sarath Shekkizhar; Antonio Ortega
  Modern machine learning systems based on neural networks have shown great
success in learning complex data patterns while being able to make good
predictions on unseen data points. However, the limited interpretability of
these systems hinders further progress and application to several domains in
the real world. This predicament is exemplified by time consuming model
selection and the difficulties faced in predictive explainability, especially
in the presence of adversarial examples. In this paper, we take a step towards
better understanding of neural networks by introducing a local polytope
interpolation method. The proposed Deep Non Negative Kernel regression (NNK)
interpolation framework is non parametric, theoretically simple and
geometrically intuitive. We demonstrate instance based explainability for deep
learning models and develop a method to identify models with good
generalization properties using leave one out estimation. Finally, we draw a
rationalization to adversarial and generative examples which are inevitable
from an interpolation view of machine learning.


http://arxiv.org/abs/2007.09916
Evaluating a Simple Retraining Strategy as a Defense Against Adversarial Attacks.
Nupur Thakur; Yuzhen Ding; Baoxin Li
  Though deep neural networks (DNNs) have shown superiority over other
techniques in major fields like computer vision, natural language processing,
robotics, recently, it has been proven that they are vulnerable to adversarial
attacks. The addition of a simple, small and almost invisible perturbation to
the original input image can be used to fool DNNs into making wrong decisions.
With more attack algorithms being designed, a need for defending the neural
networks from such attacks arises. Retraining the network with adversarial
images is one of the simplest techniques. In this paper, we evaluate the
effectiveness of such a retraining strategy in defending against adversarial
attacks. We also show how simple algorithms like KNN can be used to determine
the labels of the adversarial images needed for retraining. We present the
results on two standard datasets namely, CIFAR-10 and TinyImageNet.


http://arxiv.org/abs/2007.09919
Robust Tracking against Adversarial Attacks.
Shuai Jia; Chao Ma; Yibing Song; Xiaokang Yang
  While deep convolutional neural networks (CNNs) are vulnerable to adversarial
attacks, considerably few efforts have been paid to construct robust deep
tracking algorithms against adversarial attacks. Current studies on adversarial
attack and defense mainly reside in a single image. In this work, we first
attempt to generate adversarial examples on top of video sequences to improve
the tracking robustness against adversarial attacks. To this end, we take
temporal motion into consideration when generating lightweight perturbations
over the estimated tracking results frame-by-frame. On one hand, we add the
temporal perturbations into the original video sequences as adversarial
examples to greatly degrade the tracking performance. On the other hand, we
sequentially estimate the perturbations from input sequences and learn to
eliminate their effect for performance restoration. We apply the proposed
adversarial attack and defense approaches to state-of-the-art deep tracking
algorithms. Extensive evaluations on the benchmark datasets demonstrate that
our defense method not only eliminates the large performance drops caused by
adversarial attacks, but also achieves additional performance gains when deep
trackers are not under adversarial attacks.


http://arxiv.org/abs/2007.10868
Scaling Polyhedral Neural Network Verification on GPUs.
Christoph Müller; François Serre; Gagandeep Singh; Markus Püschel; Martin Vechev
  Certifying the robustness of neural networks against adversarial attacks is
essential to their reliable adoption in safety-critical systems such as
autonomous driving and medical diagnosis. Unfortunately, state-of-the-art
verifiers either do not scale to bigger networks or are too imprecise to prove
robustness, limiting their practical adoption. In this work, we introduce
GPUPoly, a scalable verifier that can prove the robustness of significantly
larger deep neural networks than previously possible. The key technical insight
behind GPUPoly is the design of custom, sound polyhedra algorithms for neural
network verification on a GPU. Our algorithms leverage the available GPU
parallelism and inherent sparsity of the underlying verification task. GPUPoly
scales to large networks: for example, it can prove the robustness of a 1M
neuron, 34-layer deep residual network in approximately 34.5 ms. We believe
GPUPoly is a promising step towards practical verification of real-world neural
networks.


http://arxiv.org/abs/2007.10485
AdvFoolGen: Creating Persistent Troubles for Deep Classifiers.
Yuzhen Ding; Nupur Thakur; Baoxin Li
  Researches have shown that deep neural networks are vulnerable to malicious
attacks, where adversarial images are created to trick a network into
misclassification even if the images may give rise to totally different labels
by human eyes. To make deep networks more robust to such attacks, many defense
mechanisms have been proposed in the literature, some of which are quite
effective for guarding against typical attacks. In this paper, we present a new
black-box attack termed AdvFoolGen, which can generate attacking images from
the same feature space as that of the natural images, so as to keep baffling
the network even though state-of-the-art defense mechanisms have been applied.
We systematically evaluate our model by comparing with well-established attack
algorithms. Through experiments, we demonstrate the effectiveness and
robustness of our attack in the face of state-of-the-art defense techniques and
unveil the potential reasons for its effectiveness through principled analysis.
As such, AdvFoolGen contributes to understanding the vulnerability of deep
networks from a new perspective and may, in turn, help in developing and
evaluating new defense mechanisms.


http://arxiv.org/abs/2007.09592
Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering.
Ruixue Tang; Chao Ma; Wei Emma Zhang; Qi Wu; Xiaokang Yang
  Visual Question Answering (VQA) has achieved great success thanks to the fast
development of deep neural networks (DNN). On the other hand, the data
augmentation, as one of the major tricks for DNN, has been widely used in many
computer vision tasks. However, there are few works studying the data
augmentation problem for VQA and none of the existing image based augmentation
schemes (such as rotation and flipping) can be directly applied to VQA due to
its semantic structure -- an $\langle image, question, answer\rangle$ triplet
needs to be maintained correctly. For example, a direction related
Question-Answer (QA) pair may not be true if the associated image is rotated or
flipped. In this paper, instead of directly manipulating images and questions,
we use generated adversarial examples for both images and questions as the
augmented data. The augmented examples do not change the visual properties
presented in the image as well as the \textbf{semantic} meaning of the
question, the correctness of the $\langle image, question, answer\rangle$ is
thus still maintained. We then use adversarial learning to train a classic VQA
model (BUTD) with our augmented data. We find that we not only improve the
overall performance on VQAv2, but also can withstand adversarial attack
effectively, compared to the baseline model. The source code is available at
https://github.com/zaynmi/seada-vqa.


http://arxiv.org/abs/2007.09766
Exploiting vulnerabilities of deep neural networks for privacy protection.
Ricardo Sanchez-Matilla; Chau Yi Li; Ali Shahin Shamsabadi; Riccardo Mazzon; Andrea Cavallaro
  Adversarial perturbations can be added to images to protect their content
from unwanted inferences. These perturbations may, however, be ineffective
against classifiers that were not {seen} during the generation of the
perturbation, or against defenses {based on re-quantization, median filtering
or JPEG compression. To address these limitations, we present an adversarial
attack {that is} specifically designed to protect visual content against {
unseen} classifiers and known defenses. We craft perturbations using an
iterative process that is based on the Fast Gradient Signed Method and {that}
randomly selects a classifier and a defense, at each iteration}. This
randomization prevents an undesirable overfitting to a specific classifier or
defense. We validate the proposed attack in both targeted and untargeted
settings on the private classes of the Places365-Standard dataset. Using
ResNet18, ResNet50, AlexNet and DenseNet161 {as classifiers}, the performance
of the proposed attack exceeds that of eleven state-of-the-art attacks. The
implementation is available at https://github.com/smartcameras/RP-FGSM/.


http://arxiv.org/abs/2007.09763
Connecting the Dots: Detecting Adversarial Perturbations Using Context Inconsistency.
Shasha Li; Shitong Zhu; Sudipta Paul; Amit Roy-Chowdhury; Chengyu Song; Srikanth Krishnamurthy; Ananthram Swami; Kevin S Chan
  There has been a recent surge in research on adversarial perturbations that
defeat Deep Neural Networks (DNNs) in machine vision; most of these
perturbation-based attacks target object classifiers. Inspired by the
observation that humans are able to recognize objects that appear out of place
in a scene or along with other unlikely objects, we augment the DNN with a
system that learns context consistency rules during training and checks for the
violations of the same during testing. Our approach builds a set of
auto-encoders, one for each object class, appropriately trained so as to output
a discrepancy between the input and output if an added adversarial perturbation
violates context consistency rules. Experiments on PASCAL VOC and MS COCO show
that our method effectively detects various adversarial attacks and achieves
high ROC-AUC (over 0.95 in most cases); this corresponds to over 20%
improvement over a state-of-the-art context-agnostic method.


http://arxiv.org/abs/2007.09647
Adversarial Immunization for Improving Certifiable Robustness on Graphs.
Shuchang Tao; Huawei Shen; Qi Cao; Liang Hou; Xueqi Cheng
  Despite achieving strong performance in the semi-supervised node
classification task, graph neural networks (GNNs) are vulnerable to adversarial
attacks, similar to other deep learning models. Existing research works either
focus on developing robust GNN models or attack detection methods against
attacks on graphs. However, little research attention is paid to the potential
and practice of immunization to adversarial attacks on graphs. In this paper,
we formulate the problem of graph adversarial immunization as a bilevel
optimization problem, i.e., vaccinating an affordable fraction of node pairs,
connected or unconnected, to improve the certifiable robustness of the graph
against any admissible adversarial attack. We further propose an efficient
algorithm, called AdvImmune, which optimizes meta-gradient in a discrete way to
circumvent the computationally expensive combinatorial optimization when
solving the adversarial immunization problem. Experiments are conducted on two
citation networks and one social network. Experimental results demonstrate that
the proposed AdvImmune immunization method remarkably improves the fraction of
robust nodes by 12%, 42%, 65%, with an affordable immune budget of only 5%
edges.


http://arxiv.org/abs/2007.09431
DDR-ID: Dual Deep Reconstruction Networks Based Image Decomposition for Anomaly Detection.
Dongyun Lin; Yiqun Li; Shudong Xie; Tin Lay Nwe; Sheng Dong
  One pivot challenge for image anomaly (AD) detection is to learn
discriminative information only from normal class training images. Most image
reconstruction based AD methods rely on the discriminative capability of
reconstruction error. This is heuristic as image reconstruction is unsupervised
without incorporating normal-class-specific information. In this paper, we
propose an AD method called dual deep reconstruction networks based image
decomposition (DDR-ID). The networks are trained by jointly optimizing for
three losses: the one-class loss, the latent space constrain loss and the
reconstruction loss. After training, DDR-ID can decompose an unseen image into
its normal class and the residual components, respectively. Two anomaly scores
are calculated to quantify the anomalous degree of the image in either normal
class latent space or reconstruction image space. Thereby, anomaly detection
can be performed via thresholding the anomaly score. The experiments
demonstrate that DDR-ID outperforms multiple related benchmarking methods in
image anomaly detection using MNIST, CIFAR-10 and Endosome datasets and
adversarial attack detection using GTSRB dataset.


http://arxiv.org/abs/2007.09327
Towards Quantum-Secure Authentication and Key Agreement via Abstract Multi-Agent Interaction. (1%)
Ibrahim H. Ahmed; Josiah P. Hanna; Elliot Fosong; Stefano V. Albrecht
  Current methods for authentication and key agreement based on public-key
cryptography are vulnerable to quantum computing. We propose a novel approach
based on artificial intelligence research in which communicating parties are
viewed as autonomous agents which interact repeatedly using their private
decision models. Authentication and key agreement are decided based on the
agents' observed behaviors during the interaction. The security of this
approach rests upon the difficulty of modeling the decisions of interacting
agents from limited observations, a problem which we conjecture is also hard
for quantum computing. We release PyAMI, a prototype authentication and key
agreement system based on the proposed method. We empirically validate our
method for authenticating legitimate users while detecting different types of
adversarial attacks. Finally, we show how reinforcement learning techniques can
be used to train server models which effectively probe a client's decisions to
achieve more sample-efficient authentication.


http://arxiv.org/abs/2007.10812
Anomaly Detection in Unsupervised Surveillance Setting Using Ensemble of Multimodal Data with Adversarial Defense.
Sayeed Shafayet Chowdhury; Kaji Mejbaul Islam; Rouhan Noor
  Autonomous aerial surveillance using drone feed is an interesting and
challenging research domain. To ensure safety from intruders and potential
objects posing threats to the zone being protected, it is crucial to be able to
distinguish between normal and abnormal states in real-time. Additionally, we
also need to consider any device malfunction. However, the inherent uncertainty
embedded within the type and level of abnormality makes supervised techniques
less suitable since the adversary may present a unique anomaly for intrusion.
As a result, an unsupervised method for anomaly detection is preferable taking
the unpredictable nature of attacks into account. Again in our case, the
autonomous drone provides heterogeneous data streams consisting of images and
other analog or digital sensor data, all of which can play a role in anomaly
detection if they are ensembled synergistically. To that end, an ensemble
detection mechanism is proposed here which estimates the degree of abnormality
of analyzing the real-time image and IMU (Inertial Measurement Unit) sensor
data in an unsupervised manner. First, we have implemented a Convolutional
Neural Network (CNN) regression block, named AngleNet to estimate the angle
between a reference image and current test image, which provides us with a
measure of the anomaly of the device. Moreover, the IMU data are used in
autoencoders to predict abnormality. Finally, the results from these two
pipelines are ensembled to estimate the final degree of abnormality.
Furthermore, we have applied adversarial attack to test the robustness and
security of the proposed approach and integrated defense mechanism. The
proposed method performs satisfactorily on the IEEE SP Cup-2020 dataset with an
accuracy of 97.8%. Additionally, we have also tested this approach on an
in-house dataset to validate its robustness.


http://arxiv.org/abs/2007.09200
Neural Networks with Recurrent Generative Feedback.
Yujia Huang; James Gornet; Sihui Dai; Zhiding Yu; Tan Nguyen; Doris Y. Tsao; Anima Anandkumar
  Neural networks are vulnerable to input perturbations such as additive noise
and adversarial attacks. In contrast, human perception is much more robust to
such perturbations. The Bayesian brain hypothesis states that human brains use
an internal generative model to update the posterior beliefs of the sensory
input. This mechanism can be interpreted as a form of self-consistency between
the maximum a posteriori (MAP) estimation of an internal generative model and
the external environment. Inspired by such hypothesis, we enforce
self-consistency in neural networks by incorporating generative recurrent
feedback. We instantiate this design on convolutional neural networks (CNNs).
The proposed framework, termed Convolutional Neural Networks with Feedback
(CNN-F), introduces a generative feedback with latent variables to existing CNN
architectures, where consistent predictions are made through alternating MAP
inference under a Bayesian framework. In the experiments, CNN-F shows
considerably improved adversarial robustness over conventional feedforward CNNs
on standard benchmarks.


http://arxiv.org/abs/2007.08716
Understanding and Diagnosing Vulnerability under Adversarial Attacks.
Haizhong Zheng; Ziqi Zhang; Honglak Lee; Atul Prakash
  Deep Neural Networks (DNNs) are known to be vulnerable to adversarial
attacks. Currently, there is no clear insight into how slight perturbations
cause such a large difference in classification results and how we can design a
more robust model architecture. In this work, we propose a novel
interpretability method, InterpretGAN, to generate explanations for features
used for classification in latent variables. Interpreting the classification
process of adversarial examples exposes how adversarial perturbations influence
features layer by layer as well as which features are modified by
perturbations. Moreover, we design the first diagnostic method to quantify the
vulnerability contributed by each layer, which can be used to identify
vulnerable parts of model architectures. The diagnostic results show that the
layers introducing more information loss tend to be more vulnerable than other
layers. Based on the findings, our evaluation results on MNIST and CIFAR10
datasets suggest that average pooling layers, with lower information loss, are
more robust than max pooling layers for the network architectures studied in
this paper.


http://arxiv.org/abs/2007.08714
Transfer Learning without Knowing: Reprogramming Black-box Machine Learning Models with Scarce Data and Limited Resources.
Yun-Yun Tsai; Pin-Yu Chen; Tsung-Yi Ho
  Current transfer learning methods are mainly based on finetuning a pretrained
model with target-domain data. Motivated by the techniques from adversarial
machine learning (ML) that are capable of manipulating the model prediction via
data perturbations, in this paper we propose a novel approach, black-box
adversarial reprogramming (BAR), that repurposes a well-trained black-box ML
model (e.g., a prediction API or a proprietary software) for solving different
ML tasks, especially in the scenario with scarce data and constrained
resources. The rationale lies in exploiting high-performance but unknown ML
models to gain learning capability for transfer learning. Using zeroth order
optimization and multi-label mapping techniques, BAR can reprogram a black-box
ML model solely based on its input-output responses without knowing the model
architecture or changing any parameter. More importantly, in the limited
medical data setting, on autism spectrum disorder classification, diabetic
retinopathy detection, and melanoma detection tasks, BAR outperforms
state-of-the-art methods and yields comparable performance to the vanilla
adversarial reprogramming method requiring complete knowledge of the target ML
model. BAR also outperforms baseline transfer learning approaches by a
significant margin, demonstrating cost-effective means and new insights for
transfer learning.


http://arxiv.org/abs/2007.12625
Accelerated Stochastic Gradient-free and Projection-free Methods.
Feihu Huang; Lue Tao; Songcan Chen
  In the paper, we propose a class of accelerated stochastic gradient-free and
projection-free (a.k.a., zeroth-order Frank-Wolfe) methods to solve the
constrained stochastic and finite-sum nonconvex optimization. Specifically, we
propose an accelerated stochastic zeroth-order Frank-Wolfe (Acc-SZOFW) method
based on the variance reduced technique of SPIDER/SpiderBoost and a novel
momentum accelerated technique. Moreover, under some mild conditions, we prove
that the Acc-SZOFW has the function query complexity of
$O(d\sqrt{n}\epsilon^{-2})$ for finding an $\epsilon$-stationary point in the
finite-sum problem, which improves the exiting best result by a factor of
$O(\sqrt{n}\epsilon^{-2})$, and has the function query complexity of
$O(d\epsilon^{-3})$ in the stochastic problem, which improves the exiting best
result by a factor of $O(\epsilon^{-1})$. To relax the large batches required
in the Acc-SZOFW, we further propose a novel accelerated stochastic
zeroth-order Frank-Wolfe (Acc-SZOFW*) based on a new variance reduced technique
of STORM, which still reaches the function query complexity of
$O(d\epsilon^{-3})$ in the stochastic problem without relying on any large
batches. In particular, we present an accelerated framework of the Frank-Wolfe
methods based on the proposed momentum accelerated technique. The extensive
experimental results on black-box adversarial attack and robust black-box
classification demonstrate the efficiency of our algorithms.


http://arxiv.org/abs/2007.08473
Provable Worst Case Guarantees for the Detection of Out-of-Distribution Data.
Julian Bitterwolf; Alexander Meinke; Matthias Hein
  Deep neural networks are known to be overconfident when applied to
out-of-distribution (OOD) inputs which clearly do not belong to any class. This
is a problem in safety-critical applications since a reliable assessment of the
uncertainty of a classifier is a key property, allowing to trigger human
intervention or to transfer into a safe state. In this paper, we are aiming for
certifiable worst case guarantees for OOD detection by enforcing not only low
confidence at the OOD point but also in an $l_\infty$-ball around it. For this
purpose, we use interval bound propagation (IBP) to upper bound the maximal
confidence in the $l_\infty$-ball and minimize this upper bound during training
time. We show that non-trivial bounds on the confidence for OOD data
generalizing beyond the OOD dataset seen at training time are possible.
Moreover, in contrast to certified adversarial robustness which typically comes
with significant loss in prediction performance, certified guarantees for worst
case OOD detection are possible without much loss in accuracy.


http://arxiv.org/abs/2007.08428
An Empirical Study on the Robustness of NAS based Architectures.
Chaitanya Devaguptapu; Devansh Agarwal; Gaurav Mittal; Vineeth N Balasubramanian
  Most existing methods for Neural Architecture Search (NAS) focus on achieving
state-of-the-art (SOTA) performance on standard datasets and do not explicitly
search for adversarially robust models. In this work, we study the adversarial
robustness of existing NAS architectures, comparing it with state-of-the-art
handcrafted architectures, and provide reasons for why it is essential. We draw
some key conclusions on the capacity of current NAS methods to tackle
adversarial attacks through experiments on datasets of different sizes.


http://arxiv.org/abs/2007.08489
Do Adversarially Robust ImageNet Models Transfer Better?
Hadi Salman; Andrew Ilyas; Logan Engstrom; Ashish Kapoor; Aleksander Madry
  Transfer learning is a widely-used paradigm in deep learning, where models
pre-trained on standard datasets can be efficiently adapted to downstream
tasks. Typically, better pre-trained models yield better transfer results,
suggesting that initial accuracy is a key aspect of transfer learning
performance. In this work, we identify another such aspect: we find that
adversarially robust models, while less accurate, often perform better than
their standard-trained counterparts when used for transfer learning.
Specifically, we focus on adversarially robust ImageNet classifiers, and show
that they yield improved accuracy on a standard suite of downstream
classification tasks. Further analysis uncovers more differences between robust
and standard models in the context of transfer learning. Our results are
consistent with (and in fact, add to) recent hypotheses stating that robustness
leads to improved feature representations. Our code and models are available at
https://github.com/Microsoft/robust-models-transfer .


http://arxiv.org/abs/2007.08450
Learning perturbation sets for robust machine learning.
Eric Wong; J. Zico Kolter
  Although much progress has been made towards robust deep learning, a
significant gap in robustness remains between real-world perturbations and more
narrowly defined sets typically studied in adversarial defenses. In this paper,
we aim to bridge this gap by learning perturbation sets from data, in order to
characterize real-world effects for robust training and evaluation.
Specifically, we use a conditional generator that defines the perturbation set
over a constrained region of the latent space. We formulate desirable
properties that measure the quality of a learned perturbation set, and
theoretically prove that a conditional variational autoencoder naturally
satisfies these criteria. Using this framework, our approach can generate a
variety of perturbations at different complexities and scales, ranging from
baseline digit transformations, through common image corruptions, to lighting
variations. We measure the quality of our learned perturbation sets both
quantitatively and qualitatively, finding that our models are capable of
producing a diverse set of meaningful perturbations beyond the limited data
seen during training. Finally, we leverage our learned perturbation sets to
learn models which have improved generalization performance and are empirically
and certifiably robust to adversarial image corruptions and adversarial
lighting variations. All code and configuration files for reproducing the
experiments as well as pretrained model weights can be found at
https://github.com/locuslab/perturbation_learning.


http://arxiv.org/abs/2007.08558
On Robustness and Transferability of Convolutional Neural Networks. (1%)
Josip Djolonga; Jessica Yung; Michael Tschannen; Rob Romijnders; Lucas Beyer; Alexander Kolesnikov; Joan Puigcerver; Matthias Minderer; Alexander D'Amour; Dan Moldovan; Sylvain Gelly; Neil Houlsby; Xiaohua Zhai; Mario Lucic
  Modern deep convolutional networks (CNNs) are often criticized for not
generalizing under distributional shifts. However, several recent breakthroughs
in transfer learning suggest that these networks can cope with severe
distribution shifts and successfully adapt to new tasks from a few training
examples. In this work we study the interplay between out-of-distribution and
transfer performance of modern image classification CNNs for the first time and
investigate the impact of the pre-training data size, the model scale, and the
data preprocessing pipeline. We find that increasing both the training set and
model sizes significantly improve the distributional shift robustness.
Furthermore, we show that, perhaps surprisingly, simple changes in the
preprocessing such as modifying the image resolution can significantly mitigate
robustness issues in some cases. Finally, we outline the shortcomings of
existing robustness evaluation datasets and introduce a synthetic dataset
SI-Score we use for a systematic analysis across factors of variation common in
visual data such as object size and position.


http://arxiv.org/abs/2007.08319
Less is More: A privacy-respecting Android malware classifier using Federated Learning. (1%)
Rafa Gálvez; Veelasha Moonsamy; Claudia Diaz
  In this paper we present LiM ("Less is More"), a malware classification
framework that leverages Federated Learning to detect and classify malicious
apps in a privacy-respecting manner. Information about newly installed apps is
kept locally on users' devices, so that the provider cannot infer which apps
were installed by users. At the same time, input from all users is taken into
account in the federated learning process and they all benefit from better
classification performance. A key challenge of this setting is that users do
not have access to the ground truth (i.e. they cannot correctly identify
whether an app is malicious). To tackle this, LiM uses a safe semi-supervised
ensemble that maximizes classification accuracy with respect to a baseline
classifier trained by the service provider (i.e. the cloud). We implement LiM
and show that the cloud server has F1 score of 95%, while clients have perfect
recall with only 1 false positive in >100 apps, using a dataset of 25K clean
apps and 25K malicious apps, 200 users and 50 rounds of federation.
Furthermore, we conduct a security analysis and demonstrate that LiM is robust
against both poisoning attacks by adversaries who control half of the clients,
and inference attacks performed by an honest-but-curious cloud server. Further
experiments with MaMaDroid's dataset confirm resistance against poisoning
attacks and a performance improvement due to the federation.


http://arxiv.org/abs/2007.07646
A Survey of Privacy Attacks in Machine Learning.
Maria Rigaki; Sebastian Garcia
  As machine learning becomes more widely used, the need to study its
implications in security and privacy becomes more urgent. Research on the
security aspects of machine learning, such as adversarial attacks, has received
a lot of focus and publicity, but privacy related attacks have received less
attention from the research community. Although there is a growing body of work
in the area, there is yet no extensive analysis of privacy related attacks. To
contribute into this research line we analyzed more than 40 papers related to
privacy attacks against machine learning that have been published during the
past seven years. Based on this analysis, an attack taxonomy is proposed
together with a threat model that allows the categorization of the different
attacks based on the adversarial knowledge and the assets under attack. In
addition, a detailed analysis of the different attacks is presented, including
the models under attack and the datasets used, as well as the common elements
and main differences between the approaches under the defined threat model.
Finally, we explore the potential reasons for privacy leaks and present an
overview of the most common proposed defenses.


http://arxiv.org/abs/2007.08520
Accelerating Robustness Verification of Deep Neural Networks Guided by Target Labels.
Wenjie Wan; Zhaodi Zhang; Yiwei Zhu; Min Zhang; Fu Song
  Deep Neural Networks (DNNs) have become key components of many
safety-critical applications such as autonomous driving and medical diagnosis.
However, DNNs have been shown suffering from poor robustness because of their
susceptibility to adversarial examples such that small perturbations to an
input result in misprediction. Addressing to this concern, various approaches
have been proposed to formally verify the robustness of DNNs. Most of these
approaches reduce the verification problem to optimization problems of
searching an adversarial example for a given input so that it is not correctly
classified to the original label. However, they are limited in accuracy and
scalability. In this paper, we propose a novel approach that can accelerate the
robustness verification techniques by guiding the verification with target
labels. The key insight of our approach is that the robustness verification
problem of DNNs can be solved by verifying sub-problems of DNNs, one per target
label. Fixing the target label during verification can drastically reduce the
search space and thus improve the efficiency. We also propose an approach by
leveraging symbolic interval propagation and linear relaxation techniques to
sort the target labels in terms of chances that adversarial examples exist.
This often allows us to quickly falsify the robustness of DNNs and the
verification for remaining target labels could be avoided. Our approach is
orthogonal to, and can be integrated with, many existing verification
techniques. For evaluation purposes, we integrate it with three recent
promising DNN verification tools, i.e., MipVerify, DeepZ, and Neurify.
Experimental results show that our approach can significantly improve these
tools by 36X speedup when the perturbation distance is set in a reasonable
range.


http://arxiv.org/abs/2007.08041
A Survey on Security Attacks and Defense Techniques for Connected and Autonomous Vehicles.
Minh Pham; Kaiqi Xiong
  Autonomous Vehicle has been transforming intelligent transportation systems.
As telecommunication technology improves, autonomous vehicles are getting
connected to each other and to infrastructures, forming Connected and
Autonomous Vehicles (CAVs). CAVs will help humans achieve safe, efficient, and
autonomous transportation systems. However, CAVs will face significant security
challenges because many of their components are vulnerable to attacks, and a
successful attack on a CAV may have significant impacts on other CAVs and
infrastructures due to their communications. In this paper, we conduct a survey
on 184 papers from 2000 to 2020 to understand state-of-the-art CAV attacks and
defense techniques. This survey first presents a comprehensive overview of
security attacks and their corresponding countermeasures on CAVs. We then
discuss the details of attack models based on the targeted CAV components of
attacks, access requirements, and attack motives. Finally, we identify some
current research challenges and trends from the perspectives of both academic
research and industrial development. Based on our studies of academic
literature and industrial publications, we have not found any strong connection
between academic research and industry's implementation on CAV-related security
issues. While efforts from CAV manufacturers to secure CAVs have been reported,
there is no evidence to show that CAVs on the market have the ability to defend
against some novel attack models that the research community has recently
found. This survey may give researchers and engineers a better understanding of
the current status and trend of CAV security for CAV future improvement.


http://arxiv.org/abs/2007.10115
Towards robust sensing for Autonomous Vehicles: An adversarial perspective.
Apostolos Modas; Ricardo Sanchez-Matilla; Pascal Frossard; Andrea Cavallaro
  Autonomous Vehicles rely on accurate and robust sensor observations for
safety critical decision-making in a variety of conditions. Fundamental
building blocks of such systems are sensors and classifiers that process
ultrasound, RADAR, GPS, LiDAR and camera signals~\cite{Khan2018}. It is of
primary importance that the resulting decisions are robust to perturbations,
which can take the form of different types of nuisances and data
transformations, and can even be adversarial perturbations (APs). Adversarial
perturbations are purposefully crafted alterations of the environment or of the
sensory measurements, with the objective of attacking and defeating the
autonomous systems. A careful evaluation of the vulnerabilities of their
sensing system(s) is necessary in order to build and deploy safer systems in
the fast-evolving domain of AVs. To this end, we survey the emerging field of
sensing in adversarial settings: after reviewing adversarial attacks on sensing
modalities for autonomous systems, we discuss countermeasures and present
future research directions.


http://arxiv.org/abs/2007.07176
Robustifying Reinforcement Learning Agents via Action Space Adversarial Training.
Kai Liang Tan; Yasaman Esfandiari; Xian Yeow Lee; Aakanksha; Soumik Sarkar
  Adoption of machine learning (ML)-enabled cyber-physical systems (CPS) are
becoming prevalent in various sectors of modern society such as transportation,
industrial, and power grids. Recent studies in deep reinforcement learning
(DRL) have demonstrated its benefits in a large variety of data-driven
decisions and control applications. As reliance on ML-enabled systems grows, it
is imperative to study the performance of these systems under malicious state
and actuator attacks. Traditional control systems employ
resilient/fault-tolerant controllers that counter these attacks by correcting
the system via error observations. However, in some applications, a resilient
controller may not be sufficient to avoid a catastrophic failure. Ideally, a
robust approach is more useful in these scenarios where a system is inherently
robust (by design) to adversarial attacks. While robust control has a long
history of development, robust ML is an emerging research area that has already
demonstrated its relevance and urgency. However, the majority of robust ML
research has focused on perception tasks and not on decision and control tasks,
although the ML (specifically RL) models used for control applications are
equally vulnerable to adversarial attacks. In this paper, we show that a
well-performing DRL agent that is initially susceptible to action space
perturbations (e.g. actuator attacks) can be robustified against similar
perturbations through adversarial training.


http://arxiv.org/abs/2007.06803
Bounding The Number of Linear Regions in Local Area for Neural Networks with ReLU Activations.
Rui Zhu; Bo Lin; Haixu Tang
  The number of linear regions is one of the distinct properties of the neural
networks using piecewise linear activation functions such as ReLU, comparing
with those conventional ones using other activation functions. Previous studies
showed this property reflected the expressivity of a neural network family
([14]); as a result, it can be used to characterize how the structural
complexity of a neural network model affects the function it aims to compute.
Nonetheless, it is challenging to directly compute the number of linear
regions; therefore, many researchers focus on estimating the bounds (in
particular the upper bound) of the number of linear regions for deep neural
networks using ReLU. These methods, however, attempted to estimate the upper
bound in the entire input space. The theoretical methods are still lacking to
estimate the number of linear regions within a specific area of the input
space, e.g., a sphere centered at a training data point such as an adversarial
example or a backdoor trigger. In this paper, we present the first method to
estimate the upper bound of the number of linear regions in any sphere in the
input space of a given ReLU neural network. We implemented the method, and
computed the bounds in deep neural networks using the piece-wise linear active
function. Our experiments showed that, while training a neural network, the
boundaries of the linear regions tend to move away from the training data
points. In addition, we observe that the spheres centered at the training data
points tend to contain more linear regions than any arbitrary points in the
input space. To the best of our knowledge, this is the first study of bounding
linear regions around a specific data point. We consider our work as a first
step toward the investigation of the structural complexity of deep neural
networks in a specific input area.


http://arxiv.org/abs/2007.07236
Multitask Learning Strengthens Adversarial Robustness.
Chengzhi Mao; Amogh Gupta; Vikram Nitin; Baishakhi Ray; Shuran Song; Junfeng Yang; Carl Vondrick
  Although deep networks achieve strong accuracy on a range of computer vision
benchmarks, they remain vulnerable to adversarial attacks, where imperceptible
input perturbations fool the network. We present both theoretical and empirical
analyses that connect the adversarial robustness of a model to the number of
tasks that it is trained on. Experiments on two datasets show that attack
difficulty increases as the number of target tasks increase. Moreover, our
results suggest that when models are trained on multiple tasks at once, they
become more robust to adversarial attacks on individual tasks. While
adversarial defense remains an open challenge, our results suggest that deep
networks are vulnerable partly because they are trained on too few tasks.


http://arxiv.org/abs/2007.06993
Adversarial Examples and Metrics.
Nico Döttling; Kathrin Grosse; Michael Backes; Ian Molloy
  Adversarial examples are a type of attack on machine learning (ML) systems
which cause misclassification of inputs. Achieving robustness against
adversarial examples is crucial to apply ML in the real world. While most prior
work on adversarial examples is empirical, a recent line of work establishes
fundamental limitations of robust classification based on cryptographic
hardness. Most positive and negative results in this field however assume that
there is a fixed target metric which constrains the adversary, and we argue
that this is often an unrealistic assumption. In this work we study the
limitations of robust classification if the target metric is uncertain.
Concretely, we construct a classification problem, which admits robust
classification by a small classifier if the target metric is known at the time
the model is trained, but for which robust classification is impossible for
small classifiers if the target metric is chosen after the fact. In the
process, we explore a novel connection between hardness of robust
classification and bounded storage model cryptography.


http://arxiv.org/abs/2007.07435
AdvFlow: Inconspicuous Black-box Adversarial Attacks using Normalizing Flows.
Hadi M. Dolatabadi; Sarah Erfani; Christopher Leckie
  Deep learning classifiers are susceptible to well-crafted, imperceptible
variations of their inputs, known as adversarial attacks. In this regard, the
study of powerful attack models sheds light on the sources of vulnerability in
these classifiers, hopefully leading to more robust ones. In this paper, we
introduce AdvFlow: a novel black-box adversarial attack method on image
classifiers that exploits the power of normalizing flows to model the density
of adversarial examples around a given target image. We see that the proposed
method generates adversaries that closely follow the clean data distribution, a
property which makes their detection less likely. Also, our experimental
results show competitive performance of the proposed approach with some of the
existing attack methods on defended classifiers, outperforming them in both the
number of queries and attack success rate. The code is available at
https://github.com/hmdolatabadi/AdvFlow.


http://arxiv.org/abs/2007.07097
Pasadena: Perceptually Aware and Stealthy Adversarial Denoise Attack.
Yupeng Cheng; Qing Guo; Felix Juefei-Xu; Wei Feng; Shang-Wei Lin; Weisi Lin; Yang Liu
  Image denoising can remove natural noise that widely exists in images
captured by multimedia devices due to low-quality imaging sensors, unstable
image transmission processes, or low light conditions. Recent works also find
that image denoising benefits the high-level vision tasks, e.g., image
classification. In this work, we try to challenge this common sense and explore
a totally new problem, i.e., whether the image denoising can be given the
capability of fooling the state-of-the-art deep neural networks (DNNs) while
enhancing the image quality. To this end, we initiate the very first attempt to
study this problem from the perspective of adversarial attack and propose the
adversarial denoise attack. More specifically, our main contributions are
three-fold: First, we identify a new task that stealthily embeds attacks inside
the image denoising module widely deployed in multimedia devices as an image
post-processing operation to simultaneously enhance the visual image quality
and fool DNNs. Second, we formulate this new task as a kernel prediction
problem for image filtering and propose the adversarial-denoising kernel
prediction that can produce adversarial-noiseless kernels for effective
denoising and adversarial attacking simultaneously. Third, we implement an
adaptive perceptual region localization to identify semantic-related
vulnerability regions with which the attack can be more effective while not
doing too much harm to the denoising. We name the proposed method as Pasadena
(Perceptually Aware and Stealthy Adversarial DENoise Attack) and validate our
method on the NeurIPS'17 adversarial competition dataset, CVPR2021-AIC-VI:
unrestricted adversarial attacks on ImageNet,etc. The comprehensive evaluation
and analysis demonstrate that our method not only realizes denoising but also
achieves a significantly higher success rate and transferability over
state-of-the-art attacks.


http://arxiv.org/abs/2007.07001
Adversarial Attacks against Neural Networks in Audio Domain: Exploiting Principal Components.
Ken Alparslan; Yigit Alparslan; Matthew Burlick
  Adversarial attacks are inputs that are similar to original inputs but
altered on purpose. Speech-to-text neural networks that are widely used today
are prone to misclassify adversarial attacks. In this study, first, we
investigate the presence of targeted adversarial attacks by altering wave forms
from Common Voice data set. We craft adversarial wave forms via Connectionist
Temporal Classification Loss Function, and attack DeepSpeech, a speech-to-text
neural network implemented by Mozilla. We achieve 100% adversarial success rate
(zero successful classification by DeepSpeech) on all 25 adversarial wave forms
that we crafted. Second, we investigate the use of PCA as a defense mechanism
against adversarial attacks. We reduce dimensionality by applying PCA to these
25 attacks that we created and test them against DeepSpeech. We observe zero
successful classification by DeepSpeech, which suggests PCA is not a good
defense mechanism in audio domain. Finally, instead of using PCA as a defense
mechanism, we use PCA this time to craft adversarial inputs under a black-box
setting with minimal adversarial knowledge. With no knowledge regarding the
model, parameters, or weights, we craft adversarial attacks by applying PCA to
samples from Common Voice data set and achieve 100% adversarial success under
black-box setting again when tested against DeepSpeech. We also experiment with
different percentage of components necessary to result in a classification
during attacking process. In all cases, adversary becomes successful.


http://arxiv.org/abs/2007.07365
Towards a Theoretical Understanding of the Robustness of Variational Autoencoders.
Alexander Camuto; Matthew Willetts; Stephen Roberts; Chris Holmes; Tom Rainforth
  We make inroads into understanding the robustness of Variational Autoencoders
(VAEs) to adversarial attacks and other input perturbations. While previous
work has developed algorithmic approaches to attacking and defending VAEs,
there remains a lack of formalization for what it means for a VAE to be robust.
To address this, we develop a novel criterion for robustness in probabilistic
models: $r$-robustness. We then use this to construct the first theoretical
results for the robustness of VAEs, deriving margins in the input space for
which we can provide guarantees about the resulting reconstruction. Informally,
we are able to define a region within which any perturbation will produce a
reconstruction that is similar to the original reconstruction. To support our
analysis, we show that VAEs trained using disentangling methods not only score
well under our robustness metrics, but that the reasons for this can be
interpreted through our theoretical results.


http://arxiv.org/abs/2007.06381
A simple defense against adversarial attacks on heatmap explanations.
Laura Rieger; Lars Kai Hansen
  With machine learning models being used for more sensitive applications, we
rely on interpretability methods to prove that no discriminating attributes
were used for classification. A potential concern is the so-called
"fair-washing" - manipulating a model such that the features used in reality
are hidden and more innocuous features are shown to be important instead.
  In our work we present an effective defence against such adversarial attacks
on neural networks. By a simple aggregation of multiple explanation methods,
the network becomes robust against manipulation. This holds even when the
attacker has exact knowledge of the model weights and the explanation methods
used.


http://arxiv.org/abs/2007.06189
Understanding Adversarial Examples from the Mutual Influence of Images and Perturbations.
Chaoning Zhang; Philipp Benz; Tooba Imtiaz; In-So Kweon
  A wide variety of works have explored the reason for the existence of
adversarial examples, but there is no consensus on the explanation. We propose
to treat the DNN logits as a vector for feature representation, and exploit
them to analyze the mutual influence of two independent inputs based on the
Pearson correlation coefficient (PCC). We utilize this vector representation to
understand adversarial examples by disentangling the clean images and
adversarial perturbations, and analyze their influence on each other. Our
results suggest a new perspective towards the relationship between images and
universal perturbations: Universal perturbations contain dominant features, and
images behave like noise to them. This feature perspective leads to a new
method for generating targeted universal adversarial perturbations using random
source images. We are the first to achieve the challenging task of a targeted
universal attack without utilizing original training data. Our approach using a
proxy dataset achieves comparable performance to the state-of-the-art baselines
which utilize the original training dataset.


http://arxiv.org/abs/2007.06555
Adversarial robustness via robust low rank representations.
Pranjal Awasthi; Himanshu Jain; Ankit Singh Rawat; Aravindan Vijayaraghavan
  Adversarial robustness measures the susceptibility of a classifier to
imperceptible perturbations made to the inputs at test time. In this work we
highlight the benefits of natural low rank representations that often exist for
real data such as images, for training neural networks with certified
robustness guarantees.
  Our first contribution is for certified robustness to perturbations measured
in $\ell_2$ norm. We exploit low rank data representations to provide improved
guarantees over state-of-the-art randomized smoothing-based approaches on
standard benchmark datasets such as CIFAR-10 and CIFAR-100.
  Our second contribution is for the more challenging setting of certified
robustness to perturbations measured in $\ell_\infty$ norm. We demonstrate
empirically that natural low rank representations have inherent robustness
properties, that can be leveraged to provide significantly better guarantees
for certified robustness to $\ell_\infty$ perturbations in those
representations. Our certificate of $\ell_\infty$ robustness relies on a
natural quantity involving the $\infty \to 2$ matrix operator norm associated
with the representation, to translate robustness guarantees from $\ell_2$ to
$\ell_\infty$ perturbations.
  A key technical ingredient for our certification guarantees is a fast
algorithm with provable guarantees based on the multiplicative weights update
method to provide upper bounds on the above matrix norm. Our algorithmic
guarantees improve upon the state of the art for this problem, and may be of
independent interest.


http://arxiv.org/abs/2007.07205
Security and Machine Learning in the Real World.
Ivan Evtimov; Weidong Cui; Ece Kamar; Emre Kiciman; Tadayoshi Kohno; Jerry Li
  Machine learning (ML) models deployed in many safety- and business-critical
systems are vulnerable to exploitation through adversarial examples. A large
body of academic research has thoroughly explored the causes of these blind
spots, developed sophisticated algorithms for finding them, and proposed a few
promising defenses. A vast majority of these works, however, study standalone
neural network models. In this work, we build on our experience evaluating the
security of a machine learning software product deployed on a large scale to
broaden the conversation to include a systems security view of these
vulnerabilities. We describe novel challenges to implementing systems security
best practices in software with ML components. In addition, we propose a list
of short-term mitigation suggestions that practitioners deploying machine
learning modules can use to secure their systems. Finally, we outline
directions for new research into machine learning attacks and defenses that can
serve to advance the state of ML systems security.


http://arxiv.org/abs/2007.07210
Hard Label Black-box Adversarial Attacks in Low Query Budget Regimes.
Satya Narayan Shukla; Anit Kumar Sahu; Devin Willmott; J. Zico Kolter
  We focus on the problem of black-box adversarial attacks, where the aim is to
generate adversarial examples for deep learning models solely based on
information limited to output labels (hard label) to a queried data input. We
use Bayesian optimization (BO) to specifically cater to scenarios involving low
query budgets to develop efficient adversarial attacks. Issues with BO's
performance in high dimensions are avoided by searching for adversarial
examples in structured low-dimensional subspace. Our proposed approach achieves
better performance to state of the art black-box adversarial attacks that
require orders of magnitude more queries than ours.


http://arxiv.org/abs/2007.06796
Calling Out Bluff: Attacking the Robustness of Automatic Scoring Systems with Simple Adversarial Testing.
Yaman Kumar; Mehar Bhatia; Anubha Kabra; Jessy Junyi Li; Di Jin; Rajiv Ratn Shah
  A significant progress has been made in deep-learning based Automatic Essay
Scoring (AES) systems in the past two decades. The performance commonly
measured by the standard performance metrics like Quadratic Weighted Kappa
(QWK), and accuracy points to the same. However, testing on common-sense
adversarial examples of these AES systems reveal their lack of natural language
understanding capability. Inspired by common student behaviour during
examinations, we propose a task agnostic adversarial evaluation scheme for AES
systems to test their natural language understanding capabilities and overall
robustness.


http://arxiv.org/abs/2007.06622
SoK: The Faults in our ASRs: An Overview of Attacks against Automatic Speech Recognition and Speaker Identification Systems.
Hadi Abdullah; Kevin Warren; Vincent Bindschaedler; Nicolas Papernot; Patrick Traynor
  Speech and speaker recognition systems are employed in a variety of
applications, from personal assistants to telephony surveillance and biometric
authentication. The wide deployment of these systems has been made possible by
the improved accuracy in neural networks. Like other systems based on neural
networks, recent research has demonstrated that speech and speaker recognition
systems are vulnerable to attacks using manipulated inputs. However, as we
demonstrate in this paper, the end-to-end architecture of speech and speaker
systems and the nature of their inputs make attacks and defenses against them
substantially different than those in the image space. We demonstrate this
first by systematizing existing research in this space and providing a taxonomy
through which the community can evaluate future work. We then demonstrate
experimentally that attacks against these models almost universally fail to
transfer. In so doing, we argue that substantial additional work is required to
provide adequate mitigations in this space.


http://arxiv.org/abs/2007.06765
Patch-wise Attack for Fooling Deep Neural Network.
Lianli Gao; Qilong Zhang; Jingkuan Song; Xianglong Liu; Heng Tao Shen
  By adding human-imperceptible noise to clean images, the resultant
adversarial examples can fool other unknown models. Features of a pixel
extracted by deep neural networks (DNNs) are influenced by its surrounding
regions, and different DNNs generally focus on different discriminative regions
in recognition. Motivated by this, we propose a patch-wise iterative algorithm
-- a black-box attack towards mainstream normally trained and defense models,
which differs from the existing attack methods manipulating pixel-wise noise.
In this way, without sacrificing the performance of white-box attack, our
adversarial examples can have strong transferability. Specifically, we
introduce an amplification factor to the step size in each iteration, and one
pixel's overall gradient overflowing the $\epsilon$-constraint is properly
assigned to its surrounding regions by a project kernel. Our method can be
generally integrated to any gradient-based attack methods. Compared with the
current state-of-the-art attacks, we significantly improve the success rate by
9.2\% for defense models and 3.7\% for normally trained models on average. Our
code is available at
\url{https://github.com/qilong-zhang/Patch-wise-iterative-attack}


http://arxiv.org/abs/2007.06174
Generating Fluent Adversarial Examples for Natural Languages.
Huangzhao Zhang; Hao Zhou; Ning Miao; Lei Li
  Efficiently building an adversarial attacker for natural language processing
(NLP) tasks is a real challenge. Firstly, as the sentence space is discrete, it
is difficult to make small perturbations along the direction of gradients.
Secondly, the fluency of the generated examples cannot be guaranteed. In this
paper, we propose MHA, which addresses both problems by performing
Metropolis-Hastings sampling, whose proposal is designed with the guidance of
gradients. Experiments on IMDB and SNLI show that our proposed MHA outperforms
the baseline model on attacking capability. Adversarial training with MAH also
leads to better robustness and performance.


http://arxiv.org/abs/2007.06055
Adversarial jamming attacks and defense strategies via adaptive deep reinforcement learning.
Feng Wang; Chen Zhong; M. Cenk Gursoy; Senem Velipasalar
  As the applications of deep reinforcement learning (DRL) in wireless
communications grow, sensitivity of DRL based wireless communication strategies
against adversarial attacks has started to draw increasing attention. In order
to address such sensitivity and alleviate the resulting security concerns, we
in this paper consider a victim user that performs DRL-based dynamic channel
access, and an attacker that executes DRLbased jamming attacks to disrupt the
victim. Hence, both the victim and attacker are DRL agents and can interact
with each other, retrain their models, and adapt to opponents' policies. In
this setting, we initially develop an adversarial jamming attack policy that
aims at minimizing the accuracy of victim's decision making on dynamic channel
access. Subsequently, we devise defense strategies against such an attacker,
and propose three defense strategies, namely diversified defense with
proportional-integral-derivative (PID) control, diversified defense with an
imitation attacker, and defense via orthogonal policies. We design these
strategies to maximize the attacked victim's accuracy and evaluate their
performances.


http://arxiv.org/abs/2007.06032
Probabilistic Jacobian-based Saliency Maps Attacks.
Théo Combey; António Loison; Maxime Faucher; Hatem Hajri
  Neural network classifiers (NNCs) are known to be vulnerable to malicious
adversarial perturbations of inputs including those modifying a small fraction
of the input features named sparse or $L_0$ attacks. Effective and fast $L_0$
attacks, such as the widely used Jacobian-based Saliency Map Attack (JSMA) are
practical to fool NNCs but also to improve their robustness. In this paper, we
show that penalising saliency maps of JSMA by the output probabilities and the
input features of the NNC allows to obtain more powerful attack algorithms that
better take into account each input's characteristics. This leads us to
introduce improved versions of JSMA, named Weighted JSMA (WJSMA) and Taylor
JSMA (TJSMA), and demonstrate through a variety of white-box and black-box
experiments on three different datasets (MNIST, CIFAR-10 and GTSRB), that they
are both significantly faster and more efficient than the original targeted and
non-targeted versions of JSMA. Experiments also demonstrate, in some cases,
very competitive results of our attacks in comparison with the Carlini-Wagner
(CW) $L_0$ attack, while remaining, like JSMA, significantly faster (WJSMA and
TJSMA are more than 50 times faster than CW $L_0$ on CIFAR-10). Therefore, our
new attacks provide good trade-offs between JSMA and CW for $L_0$ real-time
adversarial testing on datasets such as the ones previously cited. Codes are
publicly available through the link
https://github.com/probabilistic-jsmas/probabilistic-jsmas.


http://arxiv.org/abs/2007.05828
Understanding Object Detection Through An Adversarial Lens.
Ka-Ho Chow; Ling Liu; Mehmet Emre Gursoy; Stacey Truex; Wenqi Wei; Yanzhao Wu
  Deep neural networks based object detection models have revolutionized
computer vision and fueled the development of a wide range of visual
recognition applications. However, recent studies have revealed that deep
object detectors can be compromised under adversarial attacks, causing a victim
detector to detect no object, fake objects, or mislabeled objects. With object
detection being used pervasively in many security-critical applications, such
as autonomous vehicles and smart cities, we argue that a holistic approach for
an in-depth understanding of adversarial attacks and vulnerabilities of deep
object detection systems is of utmost importance for the research community to
develop robust defense mechanisms. This paper presents a framework for
analyzing and evaluating vulnerabilities of the state-of-the-art object
detectors under an adversarial lens, aiming to analyze and demystify the attack
strategies, adverse effects, and costs, as well as the cross-model and
cross-resolution transferability of attacks. Using a set of quantitative
metrics, extensive experiments are performed on six representative deep object
detectors from three popular families (YOLOv3, SSD, and Faster R-CNN) with two
benchmark datasets (PASCAL VOC and MS COCO). We demonstrate that the proposed
framework can serve as a methodical benchmark for analyzing adversarial
behaviors and risks in real-time object detection systems. We conjecture that
this framework can also serve as a tool to assess the security risks and the
adversarial robustness of deep object detectors to be deployed in real-world
applications.


http://arxiv.org/abs/2007.05817
ManiGen: A Manifold Aided Black-box Generator of Adversarial Examples.
Guanxiong Liu; Issa Khalil; Abdallah Khreishah; Abdulelah Algosaibi; Adel Aldalbahi; Mohammed Alaneem; Abdulaziz Alhumam; Mohammed Anan
  Machine learning models, especially neural network (NN) classifiers, have
acceptable performance and accuracy that leads to their wide adoption in
different aspects of our daily lives. The underlying assumption is that these
models are generated and used in attack free scenarios. However, it has been
shown that neural network based classifiers are vulnerable to adversarial
examples. Adversarial examples are inputs with special perturbations that are
ignored by human eyes while can mislead NN classifiers. Most of the existing
methods for generating such perturbations require a certain level of knowledge
about the target classifier, which makes them not very practical. For example,
some generators require knowledge of pre-softmax logits while others utilize
prediction scores.
  In this paper, we design a practical black-box adversarial example generator,
dubbed ManiGen. ManiGen does not require any knowledge of the inner state of
the target classifier. It generates adversarial examples by searching along the
manifold, which is a concise representation of input data. Through extensive
set of experiments on different datasets, we show that (1) adversarial examples
generated by ManiGen can mislead standalone classifiers by being as successful
as the state-of-the-art white-box generator, Carlini, and (2) adversarial
examples generated by ManiGen can more effectively attack classifiers with
state-of-the-art defenses.


http://arxiv.org/abs/2007.05869
Adversarially-Trained Deep Nets Transfer Better: Illustration on Image Classification. (15%)
Francisco Utrera; Evan Kravitz; N. Benjamin Erichson; Rajiv Khanna; Michael W. Mahoney
  Transfer learning has emerged as a powerful methodology for adapting
pre-trained deep neural networks on image recognition tasks to new domains.
This process consists of taking a neural network pre-trained on a large
feature-rich source dataset, freezing the early layers that encode essential
generic image properties, and then fine-tuning the last few layers in order to
capture specific information related to the target situation. This approach is
particularly useful when only limited or weakly labeled data are available for
the new task. In this work, we demonstrate that adversarially-trained models
transfer better than non-adversarially-trained models, especially if only
limited data are available for the new domain task. Further, we observe that
adversarial training biases the learnt representations to retaining shapes, as
opposed to textures, which impacts the transferability of the source models.
Finally, through the lens of influence functions, we discover that transferred
adversarially-trained models contain more human-identifiable semantic
information, which explains -- at least partly -- why adversarially-trained
models transfer better.


http://arxiv.org/abs/2007.05573
Improved Detection of Adversarial Images Using Deep Neural Networks.
Yutong Gao; Yi Pan
  Machine learning techniques are immensely deployed in both industry and
academy. Recent studies indicate that machine learning models used for
classification tasks are vulnerable to adversarial examples, which limits the
usage of applications in the fields with high precision requirements. We
propose a new approach called Feature Map Denoising to detect the adversarial
inputs and show the performance of detection on the mixed dataset consisting of
adversarial examples generated by different attack algorithms, which can be
used to associate with any pre-trained DNNs at a low cost. Wiener filter is
also introduced as the denoise algorithm to the defense model, which can
further improve performance. Experimental results indicate that good accuracy
of detecting the adversarial examples can be achieved through our Feature Map
Denoising algorithm.


http://arxiv.org/abs/2007.05225
Miss the Point: Targeted Adversarial Attack on Multiple Landmark Detection.
Qingsong Yao; Zecheng He; Hu Han; S. Kevin Zhou
  Recent methods in multiple landmark detection based on deep convolutional
neural networks (CNNs) reach high accuracy and improve traditional clinical
workflow. However, the vulnerability of CNNs to adversarial-example attacks can
be easily exploited to break classification and segmentation tasks. This paper
is the first to study how fragile a CNN-based model on multiple landmark
detection to adversarial perturbations. Specifically, we propose a novel
Adaptive Targeted Iterative FGSM (ATI-FGSM) attack against the state-of-the-art
models in multiple landmark detection. The attacker can use ATI-FGSM to
precisely control the model predictions of arbitrarily selected landmarks,
while keeping other stationary landmarks still, by adding imperceptible
perturbations to the original image. A comprehensive evaluation on a public
dataset for cephalometric landmark detection demonstrates that the adversarial
examples generated by ATI-FGSM break the CNN-based network more effectively and
efficiently, compared with the original Iterative FGSM attack. Our work reveals
serious threats to patients' health. Furthermore, we discuss the limitations of
our method and provide potential defense directions, by investigating the
coupling effect of nearby landmarks, i.e., a major source of divergence in our
experiments. Our source code is available at
https://github.com/qsyao/attack_landmark_detection.


http://arxiv.org/abs/2007.05315
Generating Adversarial Inputs Using A Black-box Differential Technique.
João Batista Pereira Matos Juúnior; Lucas Carvalho Cordeiro; Marcelo d'Amorim; Xiaowei Huang
  Neural Networks (NNs) are known to be vulnerable to adversarial attacks. A
malicious agent initiates these attacks by perturbing an input into another one
such that the two inputs are classified differently by the NN. In this paper,
we consider a special class of adversarial examples, which can exhibit not only
the weakness of NN models - as do for the typical adversarial examples - but
also the different behavior between two NN models. We call them
difference-inducing adversarial examples or DIAEs. Specifically, we propose
DAEGEN, the first black-box differential technique for adversarial input
generation. DAEGEN takes as input two NN models of the same classification
problem and reports on output an adversarial example. The obtained adversarial
example is a DIAE, so that it represents a point-wise difference in the input
space between the two NN models. Algorithmically, DAEGEN uses a local
search-based optimization algorithm to find DIAEs by iteratively perturbing an
input to maximize the difference of two models on predicting the input. We
conduct experiments on a spectrum of benchmark datasets (e.g., MNIST, ImageNet,
and Driving) and NN models (e.g., LeNet, ResNet, Dave, and VGG). Experimental
results are promising. First, we compare DAEGEN with two existing white-box
differential techniques (DeepXplore and DLFuzz) and find that under the same
setting, DAEGEN is 1) effective, i.e., it is the only technique that succeeds
in generating attacks in all cases, 2) precise, i.e., the adversarial attacks
are very likely to fool machines and humans, and 3) efficient, i.e, it requires
a reasonable number of classification queries. Second, we compare DAEGEN with
state-of-the-art black-box adversarial attack methods (simba and tremba), by
adapting them to work on a differential setting. The experimental results show
that DAEGEN performs better than both of them.


http://arxiv.org/abs/2007.05123
Improving Adversarial Robustness by Enforcing Local and Global Compactness.
Anh Bui; Trung Le; He Zhao; Paul Montague; Olivier deVel; Tamas Abraham; Dinh Phung
  The fact that deep neural networks are susceptible to crafted perturbations
severely impacts the use of deep learning in certain domains of application.
Among many developed defense models against such attacks, adversarial training
emerges as the most successful method that consistently resists a wide range of
attacks. In this work, based on an observation from a previous study that the
representations of a clean data example and its adversarial examples become
more divergent in higher layers of a deep neural net, we propose the Adversary
Divergence Reduction Network which enforces local/global compactness and the
clustering assumption over an intermediate layer of a deep neural network. We
conduct comprehensive experiments to understand the isolating behavior of each
component (i.e., local/global compactness and the clustering assumption) and
compare our proposed model with state-of-the-art adversarial training methods.
The experimental results demonstrate that augmenting adversarial training with
our proposed components can further improve the robustness of the network,
leading to higher unperturbed and adversarial predictive performances.


http://arxiv.org/abs/2007.05086
Boundary thickness and robustness in learning models.
Yaoqing Yang; Rajiv Khanna; Yaodong Yu; Amir Gholami; Kurt Keutzer; Joseph E. Gonzalez; Kannan Ramchandran; Michael W. Mahoney
  Robustness of machine learning models to various adversarial and
non-adversarial corruptions continues to be of interest. In this paper, we
introduce the notion of the boundary thickness of a classifier, and we describe
its connection with and usefulness for model robustness. Thick decision
boundaries lead to improved performance, while thin decision boundaries lead to
overfitting (e.g., measured by the robust generalization gap between training
and testing) and lower robustness. We show that a thicker boundary helps
improve robustness against adversarial examples (e.g., improving the robust
test accuracy of adversarial training) as well as so-called out-of-distribution
(OOD) transforms, and we show that many commonly-used regularization and data
augmentation procedures can increase boundary thickness. On the theoretical
side, we establish that maximizing boundary thickness during training is akin
to the so-called mixup training. Using these observations, we show that
noise-augmentation on mixup training further increases boundary thickness,
thereby combating vulnerability to various forms of adversarial attacks and OOD
transforms. We can also show that the performance improvement in several lines
of recent work happens in conjunction with a thicker boundary.


http://arxiv.org/abs/2007.06704
Node Copying for Protection Against Graph Neural Network Topology Attacks.
Florence Regol; Soumyasundar Pal; Mark Coates
  Adversarial attacks can affect the performance of existing deep learning
models. With the increased interest in graph based machine learning techniques,
there have been investigations which suggest that these models are also
vulnerable to attacks. In particular, corruptions of the graph topology can
degrade the performance of graph based learning algorithms severely. This is
due to the fact that the prediction capability of these algorithms relies
mostly on the similarity structure imposed by the graph connectivity.
Therefore, detecting the location of the corruption and correcting the induced
errors becomes crucial. There has been some recent work which tackles the
detection problem, however these methods do not address the effect of the
attack on the downstream learning task. In this work, we propose an algorithm
that uses node copying to mitigate the degradation in classification that is
caused by adversarial attacks. The proposed methodology is applied only after
the model for the downstream task is trained and the added computation cost
scales well for large graphs. Experimental results show the effectiveness of
our approach for several real world datasets.


http://arxiv.org/abs/2007.04564
Efficient detection of adversarial images.
Darpan Kumar Yadav; Kartik Mundra; Rahul Modpur; Arpan Chattopadhyay; Indra Narayan Kar
  In this paper, detection of deception attack on deep neural network (DNN)
based image classification in autonomous and cyber-physical systems is
considered. Several studies have shown the vulnerability of DNN to malicious
deception attacks. In such attacks, some or all pixel values of an image are
modified by an external attacker, so that the change is almost invisible to the
human eye but significant enough for a DNN-based classifier to misclassify it.
This paper first proposes a novel pre-processing technique that facilitates the
detection of such modified images under any DNN-based image classifier as well
as the attacker model. The proposed pre-processing algorithm involves a certain
combination of principal component analysis (PCA)-based decomposition of the
image, and random perturbation based detection to reduce computational
complexity. Next, an adaptive version of this algorithm is proposed where a
random number of perturbations are chosen adaptively using a doubly-threshold
policy, and the threshold values are learnt via stochastic approximation in
order to minimize the expected number of perturbations subject to constraints
on the false alarm and missed detection probabilities. Numerical experiments
show that the proposed detection scheme outperforms a competing algorithm while
achieving reasonably low computational complexity.


http://arxiv.org/abs/2007.04028
How benign is benign overfitting?
Amartya Sanyal; Puneet K Dokania; Varun Kanade; Philip H. S. Torr
  We investigate two causes for adversarial vulnerability in deep neural
networks: bad data and (poorly) trained models. When trained with SGD, deep
neural networks essentially achieve zero training error, even in the presence
of label noise, while also exhibiting good generalization on natural test data,
something referred to as benign overfitting [2, 10]. However, these models are
vulnerable to adversarial attacks. We identify label noise as one of the causes
for adversarial vulnerability, and provide theoretical and empirical evidence
in support of this. Surprisingly, we find several instances of label noise in
datasets such as MNIST and CIFAR, and that robustly trained models incur
training error on some of these, i.e. they don't fit the noise. However,
removing noisy labels alone does not suffice to achieve adversarial robustness.
Standard training procedures bias neural networks towards learning "simple"
classification boundaries, which may be less robust than more complex ones. We
observe that adversarial training does produce more complex decision
boundaries. We conjecture that in part the need for complex decision boundaries
arises from sub-optimal representation learning. By means of simple toy
examples, we show theoretically how the choice of representation can
drastically affect adversarial robustness.


http://arxiv.org/abs/2007.04137
SLAP: Improving Physical Adversarial Examples with Short-Lived Adversarial Perturbations.
Giulio Lovisotto; Henry Turner; Ivo Sluganovic; Martin Strohmeier; Ivan Martinovic
  Research into adversarial examples (AE) has developed rapidly, yet static
adversarial patches are still the main technique for conducting attacks in the
real world, despite being obvious, semi-permanent and unmodifiable once
deployed.
  In this paper, we propose Short-Lived Adversarial Perturbations (SLAP), a
novel technique that allows adversaries to realize physically robust real-world
AE by using a light projector. Attackers can project a specifically crafted
adversarial perturbation onto a real-world object, transforming it into an AE.
This allows the adversary greater control over the attack compared to
adversarial patches: (i) projections can be dynamically turned on and off or
modified at will, (ii) projections do not suffer from the locality constraint
imposed by patches, making them harder to detect.
  We study the feasibility of SLAP in the self-driving scenario, targeting both
object detector and traffic sign recognition tasks, focusing on the detection
of stop signs. We conduct experiments in a variety of ambient light conditions,
including outdoors, showing how in non-bright settings the proposed method
generates AE that are extremely robust, causing misclassifications on
state-of-the-art networks with up to 99% success rate for a variety of angles
and distances. We also demostrate that SLAP-generated AE do not present
detectable behaviours seen in adversarial patches and therefore bypass
SentiNet, a physical AE detection method. We evaluate other defences including
an adaptive defender using adversarial learning which is able to thwart the
attack effectiveness up to 80% even in favourable attacker conditions.


http://arxiv.org/abs/2007.04118
RobFR: Benchmarking Adversarial Robustness on Face Recognition.
Xiao Yang; Dingcheng Yang; Yinpeng Dong; Hang Su; Wenjian Yu; Jun Zhu
  Face recognition (FR) has recently made substantial progress and achieved
high accuracy on standard benchmarks. However, it has raised security concerns
in enormous FR applications because deep CNNs are unusually vulnerable to
adversarial examples, and it is still lack of a comprehensive robustness
evaluation before a FR model is deployed in safety-critical scenarios. To
facilitate a better understanding of the adversarial vulnerability on FR, we
develop an adversarial robustness evaluation library on FR named
\textbf{RobFR}, which serves as a reference for evaluating the robustness of
downstream tasks. Specifically, RobFR involves 15 popular naturally trained FR
models, 9 models with representative defense mechanisms and 2 commercial FR API
services, to perform the robustness evaluation by using various adversarial
attacks as an important surrogate. The evaluations are conducted under diverse
adversarial settings in terms of dodging and impersonation, $\ell_2$ and
$\ell_\infty$, as well as white-box and black-box attacks. We further propose a
landmark-guided cutout (LGC) attack method to improve the transferability of
adversarial examples for black-box attacks by considering the special
characteristics of FR. Based on large-scale evaluations, the commercial FR API
services fail to exhibit acceptable performance on robustness evaluation, and
we also draw several important conclusions for understanding the adversarial
robustness of FR models and providing insights for the design of robust FR
models. RobFR is open-source and maintains all extendable modules, i.e.,
\emph{Datasets}, \emph{FR Models}, \emph{Attacks\&Defenses}, and
\emph{Evaluations} at
\url{https://github.com/ShawnXYang/Face-Robustness-Benchmark}, which will be
continuously updated to promote future research on robust FR.


http://arxiv.org/abs/2007.04391
A Critical Evaluation of Open-World Machine Learning.
Liwei Song; Vikash Sehwag; Arjun Nitin Bhagoji; Prateek Mittal
  Open-world machine learning (ML) combines closed-world models trained on
in-distribution data with out-of-distribution (OOD) detectors, which aim to
detect and reject OOD inputs. Previous works on open-world ML systems usually
fail to test their reliability under diverse, and possibly adversarial
conditions. Therefore, in this paper, we seek to understand how resilient are
state-of-the-art open-world ML systems to changes in system components? With
our evaluation across 6 OOD detectors, we find that the choice of
in-distribution data, model architecture and OOD data have a strong impact on
OOD detection performance, inducing false positive rates in excess of $70\%$.
We further show that OOD inputs with 22 unintentional corruptions or
adversarial perturbations render open-world ML systems unusable with false
positive rates of up to $100\%$. To increase the resilience of open-world ML,
we combine robust classifiers with OOD detection techniques and uncover a new
trade-off between OOD detection and robustness.


http://arxiv.org/abs/2007.04440
On the relationship between class selectivity, dimensionality, and robustness.
Matthew L. Leavitt; Ari S. Morcos
  While the relative trade-offs between sparse and distributed representations
in deep neural networks (DNNs) are well-studied, less is known about how these
trade-offs apply to representations of semantically-meaningful information.
Class selectivity, the variability of a unit's responses across data classes or
dimensions, is one way of quantifying the sparsity of semantic representations.
Given recent evidence showing that class selectivity can impair generalization,
we sought to investigate whether it also confers robustness (or vulnerability)
to perturbations of input data. We found that mean class selectivity predicts
vulnerability to naturalistic corruptions; networks regularized to have lower
levels of class selectivity are more robust to corruption, while networks with
higher class selectivity are more vulnerable to corruption, as measured using
Tiny ImageNetC and CIFAR10C. In contrast, we found that class selectivity
increases robustness to multiple types of gradient-based adversarial attacks.
To examine this difference, we studied the dimensionality of the change in the
representation due to perturbation, finding that decreasing class selectivity
increases the dimensionality of this change for both corruption types, but with
a notably larger increase for adversarial attacks. These results demonstrate
the causal relationship between selectivity and robustness and provide new
insights into the mechanisms of this relationship.


http://arxiv.org/abs/2007.04472
Evaluation of Adversarial Training on Different Types of Neural Networks in Deep Learning-based IDSs.
Rana Abou Khamis; Ashraf Matrawy
  Network security applications, including intrusion detection systems of deep
neural networks, are increasing rapidly to make detection task of anomaly
activities more accurate and robust. With the rapid increase of using DNN and
the volume of data traveling through systems, different growing types of
adversarial attacks to defeat them create a severe challenge. In this paper, we
focus on investigating the effectiveness of different evasion attacks and how
to train a resilience deep learning-based IDS using different Neural networks,
e.g., convolutional neural networks (CNN) and recurrent neural networks (RNN).
We use the min-max approach to formulate the problem of training robust IDS
against adversarial examples using two benchmark datasets. Our experiments on
different deep learning algorithms and different benchmark datasets demonstrate
that defense using an adversarial training-based min-max approach improves the
robustness against the five well-known adversarial attack methods.


http://arxiv.org/abs/2007.03244
Robust Learning with Frequency Domain Regularization.
Weiyu Guo; Yidong Ouyang
  Convolution neural networks have achieved remarkable performance in many
tasks of computing vision. However, CNN tends to bias to low frequency
components. They prioritize capturing low frequency patterns which lead them
fail when suffering from application scenario transformation. While adversarial
example implies the model is very sensitive to high frequency perturbations. In
this paper, we introduce a new regularization method by constraining the
frequency spectra of the filter of the model. Different from band-limit
training, our method considers the valid frequency range probably entangles in
different layers rather than continuous and trains the valid frequency range
end-to-end by backpropagation. We demonstrate the effectiveness of our
regularization by (1) defensing to adversarial perturbations; (2) reducing the
generalization gap in different architecture; (3) improving the generalization
ability in transfer learning scenario without fine-tune.


http://arxiv.org/abs/2007.03198
Regional Image Perturbation Reduces $L_p$ Norms of Adversarial Examples While Maintaining Model-to-model Transferability.
Utku Ozbulak; Jonathan Peck; Neve Wesley De; Bart Goossens; Yvan Saeys; Messem Arnout Van
  Regional adversarial attacks often rely on complicated methods for generating
adversarial perturbations, making it hard to compare their efficacy against
well-known attacks. In this study, we show that effective regional
perturbations can be generated without resorting to complex methods. We develop
a very simple regional adversarial perturbation attack method using
cross-entropy sign, one of the most commonly used losses in adversarial machine
learning. Our experiments on ImageNet with multiple models reveal that, on
average, $76\%$ of the generated adversarial examples maintain model-to-model
transferability when the perturbation is applied to local image regions.
Depending on the selected region, these localized adversarial examples require
significantly less $L_p$ norm distortion (for $p \in \{0, 2, \infty\}$)
compared to their non-local counterparts. These localized attacks therefore
have the potential to undermine defenses that claim robustness under the
aforementioned norms.


http://arxiv.org/abs/2007.03832
Fast Training of Deep Neural Networks Robust to Adversarial Perturbations.
Justin Goodwin; Olivia Brown; Victoria Helus
  Deep neural networks are capable of training fast and generalizing well
within many domains. Despite their promising performance, deep networks have
shown sensitivities to perturbations of their inputs (e.g., adversarial
examples) and their learned feature representations are often difficult to
interpret, raising concerns about their true capability and trustworthiness.
Recent work in adversarial training, a form of robust optimization in which the
model is optimized against adversarial examples, demonstrates the ability to
improve performance sensitivities to perturbations and yield feature
representations that are more interpretable. Adversarial training, however,
comes with an increased computational cost over that of standard (i.e.,
nonrobust) training, rendering it impractical for use in large-scale problems.
Recent work suggests that a fast approximation to adversarial training shows
promise for reducing training time and maintaining robustness in the presence
of perturbations bounded by the infinity norm. In this work, we demonstrate
that this approach extends to the Euclidean norm and preserves the
human-aligned feature representations that are common for robust models.
Additionally, we show that using a distributed training scheme can further
reduce the time to train robust deep networks. Fast adversarial training is a
promising approach that will provide increased security and explainability in
machine learning applications for which robust optimization was previously
thought to be impractical.


http://arxiv.org/abs/2007.03838
Making Adversarial Examples More Transferable and Indistinguishable.
Junhua Zou; Yexin Duan; Boyu Li; Wu Zhang; Yu Pan; Zhisong Pan
  Fast gradient sign attack series are popular methods that are used to
generate adversarial examples. However, most of the approaches based on fast
gradient sign attack series cannot balance the indistinguishability and
transferability due to the limitations of the basic sign structure. To address
this problem, we propose a method, called Adam Iterative Fast Gradient Tanh
Method (AI-FGTM), to generate indistinguishable adversarial examples with high
transferability. Besides, smaller kernels and dynamic step size are also
applied to generate adversarial examples for further increasing the attack
success rates. Extensive experiments on an ImageNet-compatible dataset show
that our method generates more indistinguishable adversarial examples and
achieves higher attack success rates without extra running time and resource.
Our best transfer-based attack NI-TI-DI-AITM can fool six classic defense
models with an average success rate of 89.3% and three advanced defense models
with an average success rate of 82.7%, which are higher than the
state-of-the-art gradient-based attacks. Additionally, our method can also
reduce nearly 20% mean perturbation. We expect that our method will serve as a
new baseline for generating adversarial examples with better transferability
and indistinguishability.


http://arxiv.org/abs/2007.03730
Detection as Regression: Certified Object Detection by Median Smoothing.
Ping-yeh Chiang; Michael J. Curry; Ahmed Abdelkader; Aounon Kumar; John Dickerson; Tom Goldstein
  Despite the vulnerability of object detectors to adversarial attacks, very
few defenses are known to date. While adversarial training can improve the
empirical robustness of image classifiers, a direct extension to object
detection is very expensive. This work is motivated by recent progress on
certified classification by randomized smoothing. We start by presenting a
reduction from object detection to a regression problem. Then, to enable
certified regression, where standard mean smoothing fails, we propose median
smoothing, which is of independent interest. We obtain the first
model-agnostic, training-free, and certified defense for object detection
against $\ell_2$-bounded attacks. The code for all experiments in the paper is
available at http://github.com/Ping-C/CertifiedObjectDetection .


http://arxiv.org/abs/2007.02771
Certifying Decision Trees Against Evasion Attacks by Program Analysis.
Stefano Calzavara; Pietro Ferrara; Claudio Lucchese
  Machine learning has proved invaluable for a range of different tasks, yet it
also proved vulnerable to evasion attacks, i.e., maliciously crafted
perturbations of input data designed to force mispredictions. In this paper we
propose a novel technique to verify the security of decision tree models
against evasion attacks with respect to an expressive threat model, where the
attacker can be represented by an arbitrary imperative program. Our approach
exploits the interpretability property of decision trees to transform them into
imperative programs, which are amenable for traditional program analysis
techniques. By leveraging the abstract interpretation framework, we are able to
soundly verify the security guarantees of decision tree models trained over
publicly available datasets. Our experiments show that our technique is both
precise and efficient, yielding only a minimal number of false positives and
scaling up to cases which are intractable for a competitor approach.


http://arxiv.org/abs/2007.02650
On Data Augmentation and Adversarial Risk: An Empirical Analysis.
Hamid Eghbal-zadeh; Khaled Koutini; Paul Primus; Verena Haunschmid; Michal Lewandowski; Werner Zellinger; Bernhard A. Moser; Gerhard Widmer
  Data augmentation techniques have become standard practice in deep learning,
as it has been shown to greatly improve the generalisation abilities of models.
These techniques rely on different ideas such as invariance-preserving
transformations (e.g, expert-defined augmentation), statistical heuristics
(e.g, Mixup), and learning the data distribution (e.g, GANs). However, in the
adversarial settings it remains unclear under what conditions such data
augmentation methods reduce or even worsen the misclassification risk. In this
paper, we therefore analyse the effect of different data augmentation
techniques on the adversarial risk by three measures: (a) the well-known risk
under adversarial attacks, (b) a new measure of prediction-change stress based
on the Laplacian operator, and (c) the influence of training examples on
prediction. The results of our empirical analysis disprove the hypothesis that
an improvement in the classification performance induced by a data augmentation
is always accompanied by an improvement in the risk under adversarial attack.
Further, our results reveal that the augmented data has more influence than the
non-augmented data, on the resulting models. Taken together, our results
suggest that general-purpose data augmentations that do not take into the
account the characteristics of the data and the task, must be applied with
care.


http://arxiv.org/abs/2007.02617
Understanding and Improving Fast Adversarial Training.
Maksym Andriushchenko; Nicolas Flammarion
  A recent line of work focused on making adversarial training computationally
efficient for deep learning models. In particular, Wong et al. (2020) showed
that $\ell_\infty$-adversarial training with fast gradient sign method (FGSM)
can fail due to a phenomenon called "catastrophic overfitting", when the model
quickly loses its robustness over a single epoch of training. We show that
adding a random step to FGSM, as proposed in Wong et al. (2020), does not
prevent catastrophic overfitting, and that randomness is not important per se
-- its main role being simply to reduce the magnitude of the perturbation.
Moreover, we show that catastrophic overfitting is not inherent to deep and
overparametrized networks, but can occur in a single-layer convolutional
network with a few filters. In an extreme case, even a single filter can make
the network highly non-linear locally, which is the main reason why FGSM
training fails. Based on this observation, we propose a new regularization
method, GradAlign, that prevents catastrophic overfitting by explicitly
maximizing the gradient alignment inside the perturbation set and improves the
quality of the FGSM solution. As a result, GradAlign allows to successfully
apply FGSM training also for larger $\ell_\infty$-perturbations and reduce the
gap to multi-step adversarial training. The code of our experiments is
available at https://github.com/tml-epfl/understanding-fast-adv-training.


http://arxiv.org/abs/2007.02734
Black-box Adversarial Example Generation with Normalizing Flows.
Hadi M. Dolatabadi; Sarah Erfani; Christopher Leckie
  Deep neural network classifiers suffer from adversarial vulnerability:
well-crafted, unnoticeable changes to the input data can affect the classifier
decision. In this regard, the study of powerful adversarial attacks can help
shed light on sources of this malicious behavior. In this paper, we propose a
novel black-box adversarial attack using normalizing flows. We show how an
adversary can be found by searching over a pre-trained flow-based model base
distribution. This way, we can generate adversaries that resemble the original
data closely as the perturbations are in the shape of the data. We then
demonstrate the competitive performance of the proposed approach against
well-known black-box adversarial attack methods.


http://arxiv.org/abs/2007.02407
Adversarial Learning in the Cyber Security Domain.
Ihai Rosenberg; Asaf Shabtai; Yuval Elovici; Lior Rokach
  In recent years, machine learning algorithms, and more specially, deep
learning algorithms, have been widely used in many fields, including cyber
security. However, machine learning systems are vulnerable to adversarial
attacks, and this limits the application of machine learning, especially in
non-stationary, adversarial environments, such as the cyber security domain,
where actual adversaries (e.g., malware developers) exist. This paper
comprehensively summarizes the latest research on adversarial attacks against
security solutions that are based on machine learning techniques and presents
the risks they pose to cyber security solutions. First, we discuss the unique
challenges of implementing end-to-end adversarial attacks in the cyber security
domain. Following that, we define a unified taxonomy, where the adversarial
attack methods are characterized based on their stage of occurrence, and the
attacker's goals and capabilities. Then, we categorize the applications of
adversarial attack techniques in the cyber security domain. Finally, we use our
taxonomy to shed light on gaps in the cyber security domain that have already
been addressed in other adversarial learning domains and discuss their impact
on future adversarial learning trends in the cyber security domain.


http://arxiv.org/abs/2007.02209
On Connections between Regularizations for Improving DNN Robustness.
Yiwen Guo; Long Chen; Yurong Chen; Changshui Zhang
  This paper analyzes regularization terms proposed recently for improving the
adversarial robustness of deep neural networks (DNNs), from a theoretical point
of view. Specifically, we study possible connections between several effective
methods, including input-gradient regularization, Jacobian regularization,
curvature regularization, and a cross-Lipschitz functional. We investigate them
on DNNs with general rectified linear activations, which constitute one of the
most prevalent families of models for image classification and a host of other
machine learning applications. We shed light on essential ingredients of these
regularizations and re-interpret their functionality. Through the lens of our
study, more principled and efficient regularizations can possibly be invented
in the near future.


http://arxiv.org/abs/2007.02047
Relationship between manifold smoothness and adversarial vulnerability in deep learning with local errors.
Zijian Jiang; Jianwen Zhou; Haiping Huang
  Artificial neural networks can achieve impressive performances, and even
outperform humans in some specific tasks. Nevertheless, unlike biological
brains, the artificial neural networks suffer from tiny perturbations in
sensory input, under various kinds of adversarial attacks. It is therefore
necessary to study the origin of the adversarial vulnerability. Here, we
establish a fundamental relationship between geometry of hidden representations
(manifold perspective) and the generalization capability of the deep networks.
For this purpose, we choose a deep neural network trained by local errors, and
then analyze emergent properties of trained networks through the manifold
dimensionality, manifold smoothness, and the generalization capability. To
explore effects of adversarial examples, we consider independent Gaussian noise
attacks and fast-gradient-sign-method (FGSM) attacks. Our study reveals that a
high generalization accuracy requires a relatively fast power-law decay of the
eigen-spectrum of hidden representations. Under Gaussian attacks, the
relationship between generalization accuracy and power-law exponent is
monotonic, while a non-monotonic behavior is observed for FGSM attacks. Our
empirical study provides a route towards a final mechanistic interpretation of
adversarial vulnerability under adversarial attacks.


http://arxiv.org/abs/2007.02196
Deep Active Learning via Open Set Recognition. (1%)
Jaya Krishna Mandivarapu; Blake Camp; Rolando Estrada
  In many applications, data is easy to acquire but expensive and
time-consuming to label prominent examples include medical imaging and NLP.
This disparity has only grown in recent years as our ability to collect data
improves. Under these constraints, it makes sense to select only the most
informative instances from the unlabeled pool and request an oracle (e.g., a
human expert) to provide labels for those samples. The goal of active learning
is to infer the informativeness of unlabeled samples so as to minimize the
number of requests to the oracle. Here, we formulate active learning as an
open-set recognition problem. In this paradigm, only some of the inputs belong
to known classes; the classifier must identify the rest as unknown. More
specifically, we leverage variational neural networks (VNNs), which produce
high-confidence (i.e., low-entropy) predictions only for inputs that closely
resemble the training data. We use the inverse of this confidence measure to
select the samples that the oracle should label. Intuitively, unlabeled samples
that the VNN is uncertain about are more informative for future training. We
carried out an extensive evaluation of our novel, probabilistic formulation of
active learning, achieving state-of-the-art results on MNIST, CIFAR-10, and
CIFAR-100. Additionally, unlike current active learning methods, our algorithm
can learn tasks without the need for task labels. As our experiments show, when
the unlabeled pool consists of a mixture of samples from multiple datasets, our
approach can automatically distinguish between samples from seen vs. unseen
tasks.


http://arxiv.org/abs/2007.01507
Towards Robust Deep Learning with Ensemble Networks and Noisy Layers.
Yuting Liang; Reza Samavi
  In this paper we provide an approach for deep learning that protects against
adversarial examples in image classification-type networks. The approach relies
on two mechanisms:1) a mechanism that increases robustness at the expense of
accuracy, and, 2) a mechanism that improves accuracy but does not always
increase robustness. We show that an approach combining the two mechanisms can
provide protection against adversarial examples while retaining accuracy. We
formulate potential attacks on our approach and provide experimental results to
demonstrate the effectiveness of our approach.


http://arxiv.org/abs/2007.01003
Efficient Proximal Mapping of the 1-path-norm of Shallow Networks.
Fabian Latorre; Paul Rolland; Nadav Hallak; Volkan Cevher
  We demonstrate two new important properties of the 1-path-norm of shallow
neural networks. First, despite its non-smoothness and non-convexity it allows
a closed form proximal operator which can be efficiently computed, allowing the
use of stochastic proximal-gradient-type methods for regularized empirical risk
minimization. Second, when the activation functions is differentiable, it
provides an upper bound on the Lipschitz constant of the network. Such bound is
tighter than the trivial layer-wise product of Lipschitz constants, motivating
its use for training networks robust to adversarial perturbations. In practical
experiments we illustrate the advantages of using the proximal mapping and we
compare the robustness-accuracy trade-off induced by the 1-path-norm, L1-norm
and layer-wise constraints on the Lipschitz constant (Parseval networks).


http://arxiv.org/abs/2007.01017
Deep Learning Defenses Against Adversarial Examples for Dynamic Risk Assessment.
Xabier Echeberria-Barrio; Amaia Gil-Lerchundi; Ines Goicoechea-Telleria; Raul Orduna-Urrutia
  Deep Neural Networks were first developed decades ago, but it was not until
recently that they started being extensively used, due to their computing power
requirements. Since then, they are increasingly being applied to many fields
and have undergone far-reaching advancements. More importantly, they have been
utilized for critical matters, such as making decisions in healthcare
procedures or autonomous driving, where risk management is crucial. Any
mistakes in the diagnostics or decision-making in these fields could entail
grave accidents, and even death. This is preoccupying, because it has been
repeatedly reported that it is straightforward to attack this type of models.
Thus, these attacks must be studied to be able to assess their risk, and
defenses need to be developed to make models more robust. For this work, the
most widely known attack was selected (adversarial attack) and several defenses
were implemented against it (i.e. adversarial training, dimensionality reduc
tion and prediction similarity). The obtained outcomes make the model more
robust while keeping a similar accuracy. The idea was developed using a breast
cancer dataset and a VGG16 and dense neural network model, but the solutions
could be applied to datasets from other areas and different convolutional and
dense deep neural network models.


http://arxiv.org/abs/2007.01356
Decoder-free Robustness Disentanglement without (Additional) Supervision.
Yifei Wang; Dan Peng; Furui Liu; Zhenguo Li; Zhitang Chen; Jiansheng Yang
  Adversarial Training (AT) is proposed to alleviate the adversarial
vulnerability of machine learning models by extracting only robust features
from the input, which, however, inevitably leads to severe accuracy reduction
as it discards the non-robust yet useful features. This motivates us to
preserve both robust and non-robust features and separate them with
disentangled representation learning. Our proposed Adversarial Asymmetric
Training (AAT) algorithm can reliably disentangle robust and non-robust
representations without additional supervision on robustness. Empirical results
show our method does not only successfully preserve accuracy by combining two
representations, but also achieve much better disentanglement than previous
work.


http://arxiv.org/abs/2007.01472
Increasing Trustworthiness of Deep Neural Networks via Accuracy Monitoring.
Zhihui Shao; Jianyi Yang; Shaolei Ren
  Inference accuracy of deep neural networks (DNNs) is a crucial performance
metric, but can vary greatly in practice subject to actual test datasets and is
typically unknown due to the lack of ground truth labels. This has raised
significant concerns with trustworthiness of DNNs, especially in
safety-critical applications. In this paper, we address trustworthiness of DNNs
by using post-hoc processing to monitor the true inference accuracy on a user's
dataset. Concretely, we propose a neural network-based accuracy monitor model,
which only takes the deployed DNN's softmax probability output as its input and
directly predicts if the DNN's prediction result is correct or not, thus
leading to an estimate of the true inference accuracy. The accuracy monitor
model can be pre-trained on a dataset relevant to the target application of
interest, and only needs to actively label a small portion (1% in our
experiments) of the user's dataset for model transfer. For estimation
robustness, we further employ an ensemble of monitor models based on the
Monte-Carlo dropout method. We evaluate our approach on different deployed DNN
models for image classification and traffic sign detection over multiple
datasets (including adversarial samples). The result shows that our accuracy
monitor model provides a close-to-true accuracy estimation and outperforms the
existing baseline methods.


http://arxiv.org/abs/2007.01855
Trace-Norm Adversarial Examples.
Ehsan Kazemi; Thomas Kerdreux; Liqiang Wang
  White box adversarial perturbations are sought via iterative optimization
algorithms most often minimizing an adversarial loss on a $l_p$ neighborhood of
the original image, the so-called distortion set. Constraining the adversarial
search with different norms results in disparately structured adversarial
examples. Here we explore several distortion sets with structure-enhancing
algorithms. These new structures for adversarial examples, yet pervasive in
optimization, are for instance a challenge for adversarial theoretical
certification which again provides only $l_p$ certificates. Because adversarial
robustness is still an empirical field, defense mechanisms should also
reasonably be evaluated against differently structured attacks. Besides, these
structured adversarial perturbations may allow for larger distortions size than
their $l_p$ counter-part while remaining imperceptible or perceptible as
natural slight distortions of the image. Finally, they allow some control on
the generation of the adversarial perturbation, like (localized) bluriness.


http://arxiv.org/abs/2007.01299
Generating Adversarial Examples withControllable Non-transferability.
Renzhi Wang; Tianwei Zhang; Xiaofei Xie; Lei Ma; Cong Tian; Felix Juefei-Xu; Yang Liu
  Adversarial attacks against Deep Neural Networks have been widely studied.
One significant feature that makes such attacks particularly powerful is
transferability, where the adversarial examples generated from one model can be
effective against other similar models as well. A large number of works have
been done to increase the transferability. However, how to decrease the
transferability and craft malicious samples only for specific target models are
not explored yet.
  In this paper, we design novel attack methodologies to generate adversarial
examples with controllable non-transferability. With these methods, an
adversary can efficiently produce precise adversarial examples to attack a set
of target models he desires, while keeping benign to other models. The first
method is Reversed Loss Function Ensemble, where the adversary can craft
qualified examples from the gradients of a reversed loss function. This
approach is effective for the white-box and gray-box settings. The second
method is Transferability Classification: the adversary trains a
transferability-aware classifier from the perturbations of adversarial
examples. This classifier further provides the guidance for the generation of
non-transferable adversarial examples. This approach can be applied to the
black-box scenario. Evaluation results demonstrate the effectiveness and
efficiency of our proposed methods. This work opens up a new route for
generating adversarial examples with new features and applications.


http://arxiv.org/abs/2007.00251
Unifying Model Explainability and Robustness via Machine-Checkable Concepts.
Vedant Nanda; Till Speicher; John P. Dickerson; Krishna P. Gummadi; Muhammad Bilal Zafar
  As deep neural networks (DNNs) get adopted in an ever-increasing number of
applications, explainability has emerged as a crucial desideratum for these
models. In many real-world tasks, one of the principal reasons for requiring
explainability is to in turn assess prediction robustness, where predictions
(i.e., class labels) that do not conform to their respective explanations
(e.g., presence or absence of a concept in the input) are deemed to be
unreliable. However, most, if not all, prior methods for checking
explanation-conformity (e.g., LIME, TCAV, saliency maps) require significant
manual intervention, which hinders their large-scale deployability. In this
paper, we propose a robustness-assessment framework, at the core of which is
the idea of using machine-checkable concepts. Our framework defines a large
number of concepts that the DNN explanations could be based on and performs the
explanation-conformity check at test time to assess prediction robustness. Both
steps are executed in an automated manner without requiring any human
intervention and are easily scaled to datasets with a very large number of
classes. Experiments on real-world datasets and human surveys show that our
framework is able to enhance prediction robustness significantly: the
predictions marked to be robust by our framework have significantly higher
accuracy and are more robust to adversarial perturbations.


http://arxiv.org/abs/2007.00644
Measuring Robustness to Natural Distribution Shifts in Image Classification.
Rohan Taori; Achal Dave; Vaishaal Shankar; Nicholas Carlini; Benjamin Recht; Ludwig Schmidt
  We study how robust current ImageNet models are to distribution shifts
arising from natural variations in datasets. Most research on robustness
focuses on synthetic image perturbations (noise, simulated weather artifacts,
adversarial examples, etc.), which leaves open how robustness on synthetic
distribution shift relates to distribution shift arising in real data. Informed
by an evaluation of 196 ImageNet models in 211 different test conditions, we
find that there is little to no transfer of robustness from current synthetic
to natural distribution shift. Moreover, most current techniques provide no
robustness to the natural distribution shifts in our testbed. The main
exception is training on larger datasets, which in some cases offers small
gains in robustness. Our results indicate that distribution shifts arising in
real data are currently an open research problem.


http://arxiv.org/abs/2007.00337
Determining Sequence of Image Processing Technique (IPT) to Detect Adversarial Attacks.
Kishor Datta Gupta; Dipankar Dasgupta; Zahid Akhtar
  Developing secure machine learning models from adversarial examples is
challenging as various methods are continually being developed to generate
adversarial attacks. In this work, we propose an evolutionary approach to
automatically determine Image Processing Techniques Sequence (IPTS) for
detecting malicious inputs. Accordingly, we first used a diverse set of attack
methods including adaptive attack methods (on our defense) to generate
adversarial samples from the clean dataset. A detection framework based on a
genetic algorithm (GA) is developed to find the optimal IPTS, where the
optimality is estimated by different fitness measures such as Euclidean
distance, entropy loss, average histogram, local binary pattern and loss
functions. The "image difference" between the original and processed images is
used to extract the features, which are then fed to a classification scheme in
order to determine whether the input sample is adversarial or clean. This paper
described our methodology and performed experiments using multiple data-sets
tested with several adversarial attacks. For each attack-type and dataset, it
generates unique IPTS. A set of IPTS selected dynamically in testing time which
works as a filter for the adversarial attack. Our empirical experiments
exhibited promising results indicating the approach can efficiently be used as
processing for any AI model.


http://arxiv.org/abs/2007.00806
Query-Free Adversarial Transfer via Undertrained Surrogates.
Chris Miller; Soroush Vosoughi
  Deep neural networks have been shown to be highly vulnerable to adversarial
examples---minor perturbations added to a model's input which cause the model
to output an incorrect prediction. This vulnerability represents both a risk
for the use of deep learning models in security-conscious fields and an
opportunity to improve our understanding of how deep networks generalize to
unexpected inputs. In a transfer attack, the adversary builds an adversarial
attack using a surrogate model, then uses that attack to fool an unseen target
model. Recent work in this subfield has focused on attack generation methods
which can improve transferability between models. We show that optimizing a
single surrogate model is a more effective method of improving adversarial
transfer, using the simple example of an undertrained surrogate. This method
transfers well across varied architectures and outperforms state-of-the-art
methods. To interpret the effectiveness of undertrained surrogate models, we
represent adversarial transferability as a function of surrogate model loss
function curvature and similarity between surrogate and target gradients and
show that our approach reduces the presence of local loss maxima which hinder
transferability. Our results suggest that finding good single surrogate models
is a highly effective and simple method for generating transferable adversarial
attacks, and that this method represents a valuable route for future study in
this field.


http://arxiv.org/abs/2007.00720
Adversarial Example Games.
Avishek Joey Bose; Gauthier Gidel; Hugo Berard; Andre Cianflone; Pascal Vincent; Simon Lacoste-Julien; William L. Hamilton
  The existence of adversarial examples capable of fooling trained neural
network classifiers calls for a much better understanding of possible attacks
to guide the development of safeguards against them. This includes attack
methods in the challenging non-interactive blackbox setting, where adversarial
attacks are generated without any access, including queries, to the target
model. Prior attacks in this setting have relied mainly on algorithmic
innovations derived from empirical observations (e.g., that momentum helps),
lacking principled transferability guarantees. In this work, we provide a
theoretical foundation for crafting transferable adversarial examples to entire
hypothesis classes. We introduce Adversarial Example Games (AEG), a framework
that models the crafting of adversarial examples as a min-max game between a
generator of attacks and a classifier. AEG provides a new way to design
adversarial examples by adversarially training a generator and a classifier
from a given hypothesis class (e.g., architecture). We prove that this game has
an equilibrium, and that the optimal generator is able to craft adversarial
examples that can attack any classifier from the corresponding hypothesis
class. We demonstrate the efficacy of AEG on the MNIST and CIFAR-10 datasets,
outperforming prior state-of-the-art approaches with an average relative
improvement of $27.5\%$ and $47.2\%$ against undefended and robust models
respectively.


http://arxiv.org/abs/2007.00772
Robustness against Relational Adversary.
Yizhen Wang; Xiaozhu Meng; Ke Wang; Mihai Christodorescu; Somesh Jha
  Test-time adversarial attacks have posed serious challenges to the robustness
of machine-learning models, and in many settings the adversarial perturbation
need not be bounded by small $\ell_p$-norms. Motivated by the
semantics-preserving attacks in vision and security domain, we investigate
$\textit{relational adversaries}$, a broad class of attackers who create
adversarial examples that are in a reflexive-transitive closure of a logical
relation. We analyze the conditions for robustness and propose
$\textit{normalize-and-predict}$ -- a learning framework with provable
robustness guarantee. We compare our approach with adversarial training and
derive an unified framework that provides benefits of both approaches. Guided
by our theoretical findings, we apply our framework to image classification and
malware detection. Results of both tasks show that attacks using relational
adversaries frequently fool existing models, but our unified framework can
significantly enhance their robustness.


http://arxiv.org/abs/2007.00289
A Le Cam Type Bound for Adversarial Learning and Applications.
Qiuling Xu; Kevin Bello; Jean Honorio
  Robustness of machine learning methods is essential for modern practical
applications. Given the arms race between attack and defense methods, one may
be curious regarding the fundamental limits of any defense mechanism. In this
work, we focus on the problem of learning from noise-injected data, where the
existing literature falls short by either assuming a specific attack method or
by over-specifying the learning problem. We shed light on the
information-theoretic limits of adversarial learning without assuming a
particular learning process or attacker. Finally, we apply our general bounds
to a canonical set of non-trivial learning problems and provide examples of
common types of attacks.


http://arxiv.org/abs/2007.00753
Opportunities and Challenges in Deep Learning Adversarial Robustness: A Survey.
Samuel Henrique Silva; Peyman Najafirad
  As we seek to deploy machine learning models beyond virtual and controlled
domains, it is critical to analyze not only the accuracy or the fact that it
works most of the time, but if such a model is truly robust and reliable. This
paper studies strategies to implement adversary robustly trained algorithms
towards guaranteeing safety in machine learning algorithms. We provide a
taxonomy to classify adversarial attacks and defenses, formulate the Robust
Optimization problem in a min-max setting and divide it into 3 subcategories,
namely: Adversarial (re)Training, Regularization Approach, and Certified
Defenses. We survey the most recent and important results in adversarial
example generation, defense mechanisms with adversarial (re)Training as their
main defense against perturbations. We also survey mothods that add
regularization terms that change the behavior of the gradient, making it harder
for attackers to achieve their objective. Alternatively, we've surveyed methods
which formally derive certificates of robustness by exactly solving the
optimization problem or by approximations using upper or lower bounds. In
addition, we discuss the challenges faced by most of the recent algorithms
presenting future research perspectives.


http://arxiv.org/abs/2006.16974
Towards Robust LiDAR-based Perception in Autonomous Driving: General Black-box Adversarial Sensor Attack and Countermeasures.
Jiachen Sun; Yulong Cao; Qi Alfred Chen; Z. Morley Mao
  Perception plays a pivotal role in autonomous driving systems, which utilizes
onboard sensors like cameras and LiDARs (Light Detection and Ranging) to assess
surroundings. Recent studies have demonstrated that LiDAR-based perception is
vulnerable to spoofing attacks, in which adversaries spoof a fake vehicle in
front of a victim self-driving car by strategically transmitting laser signals
to the victim's LiDAR sensor. However, existing attacks suffer from
effectiveness and generality limitations. In this work, we perform the first
study to explore the general vulnerability of current LiDAR-based perception
architectures and discover that the ignored occlusion patterns in LiDAR point
clouds make self-driving cars vulnerable to spoofing attacks. We construct the
first black-box spoofing attack based on our identified vulnerability, which
universally achieves around 80% mean success rates on all target models. We
perform the first defense study, proposing CARLO to mitigate LiDAR spoofing
attacks. CARLO detects spoofed data by treating ignored occlusion patterns as
invariant physical features, which reduces the mean attack success rate to
5.5%. Meanwhile, we take the first step towards exploring a general
architecture for robust LiDAR-based perception, and propose SVF that embeds the
neglected physical features into end-to-end learning. SVF further reduces the
mean attack success rate to around 2.3%.


http://arxiv.org/abs/2006.16545
Adversarial Deep Ensemble: Evasion Attacks and Defenses for Malware Detection.
Deqiang Li; Qianmu Li
  Malware remains a big threat to cyber security, calling for machine learning
based malware detection. While promising, such detectors are known to be
vulnerable to evasion attacks. Ensemble learning typically facilitates
countermeasures, while attackers can leverage this technique to improve attack
effectiveness as well. This motivates us to investigate which kind of
robustness the ensemble defense or effectiveness the ensemble attack can
achieve, particularly when they combat with each other. We thus propose a new
attack approach, named mixture of attacks, by rendering attackers capable of
multiple generative methods and multiple manipulation sets, to perturb a
malware example without ruining its malicious functionality. This naturally
leads to a new instantiation of adversarial training, which is further geared
to enhancing the ensemble of deep neural networks. We evaluate defenses using
Android malware detectors against 26 different attacks upon two practical
datasets. Experimental results show that the new adversarial training
significantly enhances the robustness of deep neural networks against a wide
range of attacks, ensemble methods promote the robustness when base classifiers
are robust enough, and yet ensemble attacks can evade the enhanced malware
detectors effectively, even notably downgrading the VirusTotal service.


http://arxiv.org/abs/2006.16520
Black-box Certification and Learning under Adversarial Perturbations.
Hassan Ashtiani; Vinayak Pathak; Ruth Urner
  We formally study the problem of classification under adversarial
perturbations from a learner's perspective as well as a third-party who aims at
certifying the robustness of a given black-box classifier. We analyze a
PAC-type framework of semi-supervised learning and identify possibility and
impossibility results for proper learning of VC-classes in this setting. We
further introduce a new setting of black-box certification under limited query
budget, and analyze this for various classes of predictors and perturbation. We
also consider the viewpoint of a black-box adversary that aims at finding
adversarial examples, showing that the existence of an adversary with
polynomial query complexity can imply the existence of a sample efficient
robust learner.


http://arxiv.org/abs/2007.00147
Neural Network Virtual Sensors for Fuel Injection Quantities with Provable Performance Specifications.
Eric Wong; Tim Schneider; Joerg Schmitt; Frank R. Schmidt; J. Zico Kolter
  Recent work has shown that it is possible to learn neural networks with
provable guarantees on the output of the model when subject to input
perturbations, however these works have focused primarily on defending against
adversarial examples for image classifiers. In this paper, we study how these
provable guarantees can be naturally applied to other real world settings,
namely getting performance specifications for robust virtual sensors measuring
fuel injection quantities within an engine. We first demonstrate that, in this
setting, even simple neural network models are highly susceptible to reasonable
levels of adversarial sensor noise, which are capable of increasing the mean
relative error of a standard neural network from 6.6% to 43.8%. We then
leverage methods for learning provably robust networks and verifying robustness
properties, resulting in a robust model which we can provably guarantee has at
most 16.5% mean relative error under any sensor noise. Additionally, we show
how specific intervals of fuel injection quantities can be targeted to maximize
robustness for certain ranges, allowing us to train a virtual sensor for fuel
injection which is provably guaranteed to have at most 10.69% relative error
under noise while maintaining 3% relative error on non-adversarial data within
normalized fuel injection ranges of 0.6 to 1.0.


http://arxiv.org/abs/2007.00146
Generating Adversarial Examples with an Optimized Quality.
Aminollah Khormali; DaeHun Nyang; David Mohaisen
  Deep learning models are widely used in a range of application areas, such as
computer vision, computer security, etc. However, deep learning models are
vulnerable to Adversarial Examples (AEs),carefully crafted samples to deceive
those models. Recent studies have introduced new adversarial attack methods,
but, to the best of our knowledge, none provided guaranteed quality for the
crafted examples as part of their creation, beyond simple quality measures such
as Misclassification Rate (MR). In this paper, we incorporateImage Quality
Assessment (IQA) metrics into the design and generation process of AEs. We
propose an evolutionary-based single- and multi-objective optimization
approaches that generate AEs with high misclassification rate and explicitly
improve the quality, thus indistinguishability, of the samples, while
perturbing only a limited number of pixels. In particular, several IQA metrics,
including edge analysis, Fourier analysis, and feature descriptors, are
leveraged into the process of generating AEs. Unique characteristics of the
evolutionary-based algorithm enable us to simultaneously optimize the
misclassification rate and the IQA metrics of the AEs. In order to evaluate the
performance of the proposed method, we conduct intensive experiments on
different well-known benchmark datasets(MNIST, CIFAR, GTSRB, and Open Image
Dataset V5), while considering various objective optimization configurations.
The results obtained from our experiments, when compared with the exist-ing
attack methods, validate our initial hypothesis that the use ofIQA metrics
within generation process of AEs can substantially improve their quality, while
maintaining high misclassification rate.Finally, transferability and human
perception studies are provided, demonstrating acceptable performance.


http://arxiv.org/abs/2006.16055
Harnessing Adversarial Distances to Discover High-Confidence Errors.
Walter Bennette; Karsten Maurer; Sean Sisti
  Given a deep neural network image classification model that we treat as a
black box, and an unlabeled evaluation dataset, we develop an efficient
strategy by which the classifier can be evaluated. Randomly sampling and
labeling instances from an unlabeled evaluation dataset allows traditional
performance measures like accuracy, precision, and recall to be estimated.
However, random sampling may miss rare errors for which the model is highly
confident in its prediction, but wrong. These high-confidence errors can
represent costly mistakes, and therefore should be explicitly searched for.
Past works have developed search techniques to find classification errors above
a specified confidence threshold, but ignore the fact that errors should be
expected at confidence levels anywhere below 100\%. In this work, we
investigate the problem of finding errors at rates greater than expected given
model confidence. Additionally, we propose a query-efficient and novel search
technique that is guided by adversarial perturbations to find these mistakes in
black box models. Through rigorous empirical experimentation, we demonstrate
that our Adversarial Distance search discovers high-confidence errors at a rate
greater than expected given model confidence.


http://arxiv.org/abs/2006.16384
Sharp Statistical Guarantees for Adversarially Robust Gaussian Classification.
Chen Dan; Yuting Wei; Pradeep Ravikumar
  Adversarial robustness has become a fundamental requirement in modern machine
learning applications. Yet, there has been surprisingly little statistical
understanding so far. In this paper, we provide the first result of the optimal
minimax guarantees for the excess risk for adversarially robust classification,
under Gaussian mixture model proposed by \cite{schmidt2018adversarially}. The
results are stated in terms of the Adversarial Signal-to-Noise Ratio (AdvSNR),
which generalizes a similar notion for standard linear classification to the
adversarial setting. For the Gaussian mixtures with AdvSNR value of $r$, we
establish an excess risk lower bound of order $\Theta(e^{-(\frac{1}{8}+o(1))
r^2} \frac{d}{n})$ and design a computationally efficient estimator that
achieves this optimal rate. Our results built upon minimal set of assumptions
while cover a wide spectrum of adversarial perturbations including $\ell_p$
balls for any $p \ge 1$.


http://arxiv.org/abs/2006.16179
Legal Risks of Adversarial Machine Learning Research.
Ram Shankar Siva Kumar; Jonathon Penney; Bruce Schneier; Kendra Albert
  Adversarial Machine Learning is booming with ML researchers increasingly
targeting commercial ML systems such as those used in Facebook, Tesla,
Microsoft, IBM, Google to demonstrate vulnerabilities. In this paper, we ask,
"What are the potential legal risks to adversarial ML researchers when they
attack ML systems?" Studying or testing the security of any operational system
potentially runs afoul the Computer Fraud and Abuse Act (CFAA), the primary
United States federal statute that creates liability for hacking. We claim that
Adversarial ML research is likely no different. Our analysis show that because
there is a split in how CFAA is interpreted, aspects of adversarial ML attacks,
such as model inversion, membership inference, model stealing, reprogramming
the ML system and poisoning attacks, may be sanctioned in some jurisdictions
and not penalized in others. We conclude with an analysis predicting how the US
Supreme Court may resolve some present inconsistencies in the CFAA's
application in Van Buren v. United States, an appeal expected to be decided in
2021. We argue that the court is likely to adopt a narrow construction of the
CFAA, and that this will actually lead to better adversarial ML security
outcomes in the long term.


http://arxiv.org/abs/2006.16427
Biologically Inspired Mechanisms for Adversarial Robustness.
Manish V. Reddy; Andrzej Banburski; Nishka Pant; Tomaso Poggio
  A convolutional neural network strongly robust to adversarial perturbations
at reasonable computational and performance cost has not yet been demonstrated.
The primate visual ventral stream seems to be robust to small perturbations in
visual stimuli but the underlying mechanisms that give rise to this robust
perception are not understood. In this work, we investigate the role of two
biologically plausible mechanisms in adversarial robustness. We demonstrate
that the non-uniform sampling performed by the primate retina and the presence
of multiple receptive fields with a range of receptive field sizes at each
eccentricity improve the robustness of neural networks to small adversarial
perturbations. We verify that these two mechanisms do not suffer from gradient
obfuscation and study their contribution to adversarial robustness through
ablation studies.


http://arxiv.org/abs/2006.16375
Improving Uncertainty Estimates through the Relationship with Adversarial Robustness.
Yao Qin; Xuezhi Wang; Alex Beutel; Ed H. Chi
  Robustness issues arise in a variety of forms and are studied through
multiple lenses in the machine learning literature. Neural networks lack
adversarial robustness -- they are vulnerable to adversarial examples that
through small perturbations to inputs cause incorrect predictions. Further,
trust is undermined when models give miscalibrated or unstable uncertainty
estimates, i.e. the predicted probability is not a good indicator of how much
we should trust our model and could vary greatly over multiple independent
runs. In this paper, we study the connection between adversarial robustness,
predictive uncertainty (calibration) and model uncertainty (stability) on
multiple classification networks and datasets. We find that the inputs for
which the model is sensitive to small perturbations (are easily attacked) are
more likely to have poorly calibrated and unstable predictions. Based on this
insight, we examine if calibration and stability can be improved by addressing
those adversarially unrobust inputs. To this end, we propose Adversarial
Robustness based Adaptive Label Smoothing (AR-AdaLS) that integrates the
correlations of adversarial robustness and uncertainty into training by
adaptively softening labels conditioned on how easily it can be attacked by
adversarial examples. We find that our method, taking the adversarial
robustness of the in-distribution data into consideration, leads to better
calibration and stability over the model even under distributional shifts. In
addition, AR-AdaLS can also be applied to an ensemble model to achieve the best
calibration performance.


http://arxiv.org/abs/2006.15632
FDA3 : Federated Defense Against Adversarial Attacks for Cloud-Based IIoT Applications.
Yunfei Song; Tian Liu; Tongquan Wei; Xiangfeng Wang; Zhe Tao; Mingsong Chen
  Along with the proliferation of Artificial Intelligence (AI) and Internet of
Things (IoT) techniques, various kinds of adversarial attacks are increasingly
emerging to fool Deep Neural Networks (DNNs) used by Industrial IoT (IIoT)
applications. Due to biased training data or vulnerable underlying models,
imperceptible modifications on inputs made by adversarial attacks may result in
devastating consequences. Although existing methods are promising in defending
such malicious attacks, most of them can only deal with limited existing attack
types, which makes the deployment of large-scale IIoT devices a great
challenge. To address this problem, we present an effective federated defense
approach named FDA3 that can aggregate defense knowledge against adversarial
examples from different sources. Inspired by federated learning, our proposed
cloud-based architecture enables the sharing of defense capabilities against
different attacks among IIoT devices. Comprehensive experimental results show
that the generated DNNs by our approach can not only resist more malicious
attacks than existing attack-specific adversarial training methods, but also
can prevent IIoT applications from new attacks.


http://arxiv.org/abs/2006.15669
Geometry-Inspired Top-k Adversarial Perturbations.
Nurislam Tursynbek; Aleksandr Petiushko; Ivan Oseledets
  Deep learning models are vulnerable to adversarial examples, which endangers
their usage in real-world applications. The main target of existing adversarial
perturbations is primarily limited to change the correct Top-1 predicted class
by the incorrect one, which does not intend changing the Top-$k$ prediction.
However, in many real-world scenarios, especially dealing with digital images,
Top-$k$ predictions are more important. In this work, we propose a simple yet
effective geometry-inspired method of computing Top-$k$ adversarial examples
for any $k$. We evaluate its effectiveness and efficiency by comparing it with
other adversarial example crafting techniques. Moreover, based on this method,
we propose Top-$k$ Universal Adversarial Perturbations, image-agnostic tiny
perturbations that cause true class to be absent among the Top-$k$ prediction
for most inputs in the dataset. We experimentally show that our approach
outperforms baseline methods and even improves existing techniques of
generating Universal Adversarial Perturbations.


http://arxiv.org/abs/2006.14856
Orthogonal Deep Models As Defense Against Black-Box Attacks.
Mohammad A. A. K. Jalwana; Naveed Akhtar; Mohammed Bennamoun; Ajmal Mian
  Deep learning has demonstrated state-of-the-art performance for a variety of
challenging computer vision tasks. On one hand, this has enabled deep visual
models to pave the way for a plethora of critical applications like disease
prognostics and smart surveillance. On the other, deep learning has also been
found vulnerable to adversarial attacks, which calls for new techniques to
defend deep models against these attacks. Among the attack algorithms, the
black-box schemes are of serious practical concern since they only need
publicly available knowledge of the targeted model. We carefully analyze the
inherent weakness of deep models in black-box settings where the attacker may
develop the attack using a model similar to the targeted model. Based on our
analysis, we introduce a novel gradient regularization scheme that encourages
the internal representation of a deep model to be orthogonal to another, even
if the architectures of the two models are similar. Our unique constraint
allows a model to concomitantly endeavour for higher accuracy while maintaining
near orthogonal alignment of gradients with respect to a reference model.
Detailed empirical study verifies that controlled misalignment of gradients
under our orthogonality objective significantly boosts a model's robustness
against transferable black-box adversarial attacks. In comparison to regular
models, the orthogonal models are significantly more robust to a range of $l_p$
norm bounded perturbations. We verify the effectiveness of our technique on a
variety of large-scale models.


http://arxiv.org/abs/2006.15207
Informative Outlier Matters: Robustifying Out-of-distribution Detection Using Outlier Mining.
Jiefeng Chen; Yixuan Li; Xi Wu; Yingyu Liang; Somesh Jha
  Detecting out-of-distribution (OOD) inputs is critical for safely deploying
deep learning models in an open-world setting. However, existing OOD detection
solutions can be brittle in the open world, facing various types of adversarial
OOD inputs. While methods leveraging auxiliary OOD data have emerged, our
analysis reveals a key insight that the majority of auxiliary OOD examples may
not meaningfully improve the decision boundary of the OOD detector. In this
paper, we provide a theoretically motivated method, Adversarial Training with
informative Outlier Mining (ATOM), which improves the robustness of OOD
detection. We show that, by mining informative auxiliary OOD data, one can
significantly improve OOD detection performance, and somewhat surprisingly,
generalize to unseen adversarial attacks. ATOM achieves state-of-the-art
performance under a broad family of natural and perturbed OOD evaluation tasks.
For example, on the CIFAR-10 in-distribution dataset, ATOM reduces the FPR95 by
up to 57.99% under adversarial OOD inputs, surpassing the previous best
baseline by a large margin.


http://arxiv.org/abs/2006.15127
Diverse Knowledge Distillation (DKD): A Solution for Improving The Robustness of Ensemble Models Against Adversarial Attacks.
Ali Mirzaeian; Jana Kosecka; Houman Homayoun; Tinoosh Mohsenin; Avesta Sasan
  This paper proposes an ensemble learning model that is resistant to
adversarial attacks. To build resilience, we introduced a training process
where each member learns a radically distinct latent space. Member models are
added one at a time to the ensemble. Simultaneously, the loss function is
regulated by a reverse knowledge distillation, forcing the new member to learn
different features and map to a latent space safely distanced from those of
existing members. We assessed the security and performance of the proposed
solution on image classification tasks using CIFAR10 and MNIST datasets and
showed security and performance improvement compared to the state of the art
defense methods.


http://arxiv.org/abs/2006.14871
Can We Mitigate Backdoor Attack Using Adversarial Detection Methods?
Kaidi Jin; Tianwei Zhang; Chao Shen; Yufei Chen; Ming Fan; Chenhao Lin; Ting Liu
  Deep Neural Networks are well known to be vulnerable to adversarial attacks
and backdoor attacks, where minor modifications on the input are able to
mislead the models to give wrong results. Although defenses against adversarial
attacks have been widely studied, investigation on mitigating backdoor attacks
is still at an early stage. It is unknown whether there are any connections and
common characteristics between the defenses against these two attacks. We
conduct comprehensive studies on the connections between adversarial examples
and backdoor examples of Deep Neural Networks to seek to answer the question:
can we detect backdoor using adversarial detection methods. Our insights are
based on the observation that both adversarial examples and backdoor examples
have anomalies during the inference process, highly distinguishable from benign
samples. As a result, we revise four existing adversarial defense methods for
detecting backdoor examples. Extensive evaluations indicate that these
approaches provide reliable protection against backdoor attacks, with a higher
accuracy than detecting adversarial examples. These solutions also reveal the
relations of adversarial examples, backdoor examples and normal samples in
model sensitivity, activation space and feature space. This is able to enhance
our understanding about the inherent features of these two attacks and the
defense opportunities.


http://arxiv.org/abs/2006.14536
Smooth Adversarial Training.
Cihang Xie; Mingxing Tan; Boqing Gong; Alan Yuille; Quoc V. Le
  It is commonly believed that networks cannot be both accurate and robust,
that gaining robustness means losing accuracy. It is also generally believed
that, unless making networks larger, network architectural elements would
otherwise matter little in improving adversarial robustness. Here we present
evidence to challenge these common beliefs by a careful study about adversarial
training. Our key observation is that the widely-used ReLU activation function
significantly weakens adversarial training due to its non-smooth nature. Hence
we propose smooth adversarial training (SAT), in which we replace ReLU with its
smooth approximations to strengthen adversarial training. The purpose of smooth
activation functions in SAT is to allow it to find harder adversarial examples
and compute better gradient updates during adversarial training.
  Compared to standard adversarial training, SAT improves adversarial
robustness for "free", i.e., no drop in accuracy and no increase in
computational cost. For example, without introducing additional computations,
SAT significantly enhances ResNet-50's robustness from 33.0% to 42.3%, while
also improving accuracy by 0.9% on ImageNet. SAT also works well with larger
networks: it helps EfficientNet-L1 to achieve 82.2% accuracy and 58.6%
robustness on ImageNet, outperforming the previous state-of-the-art defense by
9.5% for accuracy and 11.6% for robustness. Models are available at
https://github.com/cihangxie/SmoothAdversarialTraining.


http://arxiv.org/abs/2006.14748
Proper Network Interpretability Helps Adversarial Robustness in Classification.
Akhilan Boopathy; Sijia Liu; Gaoyuan Zhang; Cynthia Liu; Pin-Yu Chen; Shiyu Chang; Luca Daniel
  Recent works have empirically shown that there exist adversarial examples
that can be hidden from neural network interpretability (namely, making network
interpretation maps visually similar), or interpretability is itself
susceptible to adversarial attacks. In this paper, we theoretically show that
with a proper measurement of interpretation, it is actually difficult to
prevent prediction-evasion adversarial attacks from causing interpretation
discrepancy, as confirmed by experiments on MNIST, CIFAR-10 and Restricted
ImageNet. Spurred by that, we develop an interpretability-aware defensive
scheme built only on promoting robust interpretation (without the need for
resorting to adversarial loss minimization). We show that our defense achieves
both robust classification and robust interpretation, outperforming
state-of-the-art adversarial training methods against attacks of large
perturbation in particular.


http://arxiv.org/abs/2006.14512
Uncovering the Connections Between Adversarial Transferability and Knowledge Transferability.
Kaizhao Liang; Jacky Y. Zhang; Boxin Wang; Zhuolin Yang; Oluwasanmi Koyejo; Bo Li
  Knowledge transferability, or transfer learning, has been widely adopted to
allow a pre-trained model in the source domain to be effectively adapted to
downstream tasks in the target domain. It is thus important to explore and
understand the factors affecting knowledge transferability. In this paper, as
the first work, we analyze and demonstrate the connections between knowledge
transferability and another important phenomenon--adversarial transferability,
\emph{i.e.}, adversarial examples generated against one model can be
transferred to attack other models. Our theoretical studies show that
adversarial transferability indicates knowledge transferability and vice versa.
Moreover, based on the theoretical insights, we propose two practical
adversarial transferability metrics to characterize this process, serving as
bidirectional indicators between adversarial and knowledge transferability. We
conduct extensive experiments for different scenarios on diverse datasets,
showing a positive correlation between adversarial transferability and
knowledge transferability. Our findings will shed light on future research
about effective knowledge transfer learning and adversarial transferability
analyses.


http://arxiv.org/abs/2006.14655
Can 3D Adversarial Logos Cloak Humans?
Yi Wang; Jingyang Zhou; Tianlong Chen; Sijia Liu; Shiyu Chang; Chandrajit Bajaj; Zhangyang Wang
  With the trend of adversarial attacks, researchers attempt to fool trained
object detectors in 2D scenes. Among many of them, an intriguing new form of
attack with potential real-world usage is to append adversarial patches (e.g.
logos) to images. Nevertheless, much less have we known about adversarial
attacks from 3D rendering views, which is essential for the attack to be
persistently strong in the physical world. This paper presents a new 3D
adversarial logo attack: we construct an arbitrary shape logo from a 2D texture
image and map this image into a 3D adversarial logo via a texture mapping
called logo transformation. The resulting 3D adversarial logo is then viewed as
an adversarial texture enabling easy manipulation of its shape and position.
This greatly extends the versatility of adversarial training for computer
graphics synthesized imagery. Contrary to the traditional adversarial patch,
this new form of attack is mapped into the 3D object world and back-propagates
to the 2D image domain through differentiable rendering. In addition, and
unlike existing adversarial patches, our new 3D adversarial logo is shown to
fool state-of-the-art deep object detectors robustly under model rotations,
leading to one step further for realistic attacks in the physical world. Our
codes are available at https://github.com/TAMU-VITA/3D_Adversarial_Logo.


http://arxiv.org/abs/2006.13555
Defending against adversarial attacks on medical imaging AI system, classification or detection?
Xin Li; Deng Pan; Dongxiao Zhu
  Medical imaging AI systems such as disease classification and segmentation
are increasingly inspired and transformed from computer vision based AI
systems. Although an array of adversarial training and/or loss function based
defense techniques have been developed and proved to be effective in computer
vision, defending against adversarial attacks on medical images remains largely
an uncharted territory due to the following unique challenges: 1) label
scarcity in medical images significantly limits adversarial generalizability of
the AI system; 2) vastly similar and dominant fore- and background in medical
images make it hard samples for learning the discriminating features between
different disease classes; and 3) crafted adversarial noises added to the
entire medical image as opposed to the focused organ target can make clean and
adversarial examples more discriminate than that between different disease
classes. In this paper, we propose a novel robust medical imaging AI framework
based on Semi-Supervised Adversarial Training (SSAT) and Unsupervised
Adversarial Detection (UAD), followed by designing a new measure for assessing
systems adversarial risk. We systematically demonstrate the advantages of our
robust medical imaging AI system over the existing adversarial defense
techniques under diverse real-world settings of adversarial attacks using a
benchmark OCT imaging data set.


http://arxiv.org/abs/2006.14032
Compositional Explanations of Neurons.
Jesse Mu; Jacob Andreas
  We describe a procedure for explaining neurons in deep representations by
identifying compositional logical concepts that closely approximate neuron
behavior. Compared to prior work that uses atomic labels as explanations,
analyzing neurons compositionally allows us to more precisely and expressively
characterize their behavior. We use this procedure to answer several questions
on interpretability in models for vision and natural language processing.
First, we examine the kinds of abstractions learned by neurons. In image
classification, we find that many neurons learn highly abstract but
semantically coherent visual concepts, while other polysemantic neurons detect
multiple unrelated features; in natural language inference (NLI), neurons learn
shallow lexical heuristics from dataset biases. Second, we see whether
compositional explanations give us insight into model performance: vision
neurons that detect human-interpretable concepts are positively correlated with
task performance, while NLI neurons that fire for shallow heuristics are
negatively correlated with task performance. Finally, we show how compositional
explanations provide an accessible way for end users to produce simple
"copy-paste" adversarial examples that change model behavior in predictable
ways.


http://arxiv.org/abs/2006.14042
Blacklight: Defending Black-Box Adversarial Attacks on Deep Neural Networks.
Huiying Li; Shawn Shan; Emily Wenger; Jiayun Zhang; Haitao Zheng; Ben Y. Zhao
  Deep learning systems are known to be vulnerable to adversarial examples. In
particular, query-based black-box attacks do not require knowledge of the deep
learning model, but can compute adversarial examples over the network by
submitting queries and inspecting returns. Recent work largely improves the
efficiency of those attacks, demonstrating their practicality on today's
ML-as-a-service platforms.
  We propose Blacklight, a new defense against query-based black-box
adversarial attacks. The fundamental insight driving our design is that, to
compute adversarial examples, these attacks perform iterative optimization over
the network, producing image queries highly similar in the input space.
Blacklight detects query-based black-box attacks by detecting highly similar
queries, using an efficient similarity engine operating on probabilistic
content fingerprints. We evaluate Blacklight against eight state-of-the-art
attacks, across a variety of models and image classification tasks. Blacklight
identifies them all, often after only a handful of queries. By rejecting all
detected queries, Blacklight prevents any attack to complete, even when
attackers persist to submit queries after account ban or query rejection.
Blacklight is also robust against several powerful countermeasures, including
an optimal black-box attack that approximates white-box attacks in efficiency.
Finally, we illustrate how Blacklight generalizes to other domains like text
classification.


http://arxiv.org/abs/2006.13726
Imbalanced Gradients: A Subtle Cause of Overestimated Adversarial Robustness.
Xingjun Ma; Linxi Jiang; Hanxun Huang; Zejia Weng; James Bailey; Yu-Gang Jiang
  Evaluating the robustness of a defense model is a challenging task in
adversarial robustness research. Obfuscated gradients have previously been
found to exist in many defense methods and cause a false signal of robustness.
In this paper, we identify a more subtle situation called Imbalanced Gradients
that can also cause overestimated adversarial robustness. The phenomenon of
imbalanced gradients occurs when the gradient of one term of the margin loss
dominates and pushes the attack towards to a suboptimal direction. To exploit
imbalanced gradients, we formulate a Margin Decomposition (MD) attack that
decomposes a margin loss into individual terms and then explores the
attackability of these terms separately via a two-stage process. We also
propose a multi-targeted and ensemble version of our MD attack. By
investigating 24 defense models proposed since 2018, we find that 11 models are
susceptible to a certain degree of imbalanced gradients and our MD attack can
decrease their robustness evaluated by the best standalone baseline attack by
more than 1%. We also provide an in-depth investigation on the likely causes of
imbalanced gradients and effective countermeasures. Our code is available at
https://github.com/HanxunH/MDAttack.


http://arxiv.org/abs/2006.12792
RayS: A Ray Searching Method for Hard-label Adversarial Attack.
Jinghui Chen; Quanquan Gu
  Deep neural networks are vulnerable to adversarial attacks. Among different
attack settings, the most challenging yet the most practical one is the
hard-label setting where the attacker only has access to the hard-label output
(prediction label) of the target model. Previous attempts are neither effective
enough in terms of attack success rate nor efficient enough in terms of query
complexity under the widely used $L_\infty$ norm threat model. In this paper,
we present the Ray Searching attack (RayS), which greatly improves the
hard-label attack effectiveness as well as efficiency. Unlike previous works,
we reformulate the continuous problem of finding the closest decision boundary
into a discrete problem that does not require any zeroth-order gradient
estimation. In the meantime, all unnecessary searches are eliminated via a fast
check step. This significantly reduces the number of queries needed for our
hard-label attack. Moreover, interestingly, we found that the proposed RayS
attack can also be used as a sanity check for possible "falsely robust" models.
On several recently proposed defenses that claim to achieve the
state-of-the-art robust accuracy, our attack method demonstrates that the
current white-box/black-box attacks could still give a false sense of security
and the robust accuracy drop between the most popular PGD attack and RayS
attack could be as large as $28\%$. We believe that our proposed RayS attack
could help identify falsely robust models that beat most white-box/black-box
attacks.


http://arxiv.org/abs/2006.12834
Sparse-RS: a versatile framework for query-efficient sparse black-box adversarial attacks.
Francesco Croce; Maksym Andriushchenko; Naman D. Singh; Nicolas Flammarion; Matthias Hein
  We propose a versatile framework based on random search, Sparse-RS, for
score-based sparse targeted and untargeted attacks in the black-box setting.
Sparse-RS does not rely on substitute models and achieves state-of-the-art
success rate and query efficiency for multiple sparse attack models:
$l_0$-bounded perturbations, adversarial patches, and adversarial frames. The
$l_0$-version of untargeted Sparse-RS outperforms all black-box and even all
white-box attacks for different models on MNIST, CIFAR-10, and ImageNet.
Moreover, our untargeted Sparse-RS achieves very high success rates even for
the challenging settings of $20\times20$ adversarial patches and $2$-pixel wide
adversarial frames for $224\times224$ images. Finally, we show that Sparse-RS
can be applied to generate targeted universal adversarial patches where it
significantly outperforms the existing approaches. The code of our framework is
available at https://github.com/fra31/sparse-rs.


http://arxiv.org/abs/2006.13192
Adversarial Robustness of Deep Sensor Fusion Models.
Shaojie Wang; Tong Wu; Ayan Chakrabarti; Yevgeniy Vorobeychik
  We experimentally study the robustness of deep camera-LiDAR fusion
architectures for 2D object detection in autonomous driving. First, we find
that the fusion model is usually both more accurate, and more robust against
single-source attacks than single-sensor deep neural networks. Furthermore, we
show that without adversarial training, early fusion is more robust than late
fusion, whereas the two perform similarly after adversarial training. However,
we note that single-channel adversarial training of deep fusion is often
detrimental even to robustness. Moreover, we observe cross-channel
externalities, where single-channel adversarial training reduces robustness to
attacks on the other channel. Additionally, we observe that the choice of
adversarial model in adversarial training is critical: using attacks restricted
to cars' bounding boxes is more effective in adversarial training and exhibits
less significant cross-channel externalities. Finally, we find that
joint-channel adversarial training helps mitigate many of the issues above, but
does not significantly boost adversarial robustness.


http://arxiv.org/abs/2006.12135
Learning to Generate Noise for Multi-Attack Robustness.
Divyam Madaan; Jinwoo Shin; Sung Ju Hwang
  Adversarial learning has emerged as one of the successful techniques to
circumvent the susceptibility of existing methods against adversarial
perturbations. However, the majority of existing defense methods are tailored
to defend against a single category of adversarial perturbation (e.g.
$\ell_\infty$-attack). In safety-critical applications, this makes these
methods extraneous as the attacker can adopt diverse adversaries to deceive the
system. Moreover, training on multiple perturbations simultaneously
significantly increases the computational overhead during training. To address
these challenges, we propose a novel meta-learning framework that explicitly
learns to generate noise to improve the model's robustness against multiple
types of attacks. Its key component is Meta Noise Generator (MNG) that outputs
optimal noise to stochastically perturb a given sample, such that it helps
lower the error on diverse adversarial perturbations. By utilizing samples
generated by MNG, we train a model by enforcing the label consistency across
multiple perturbations. We validate the robustness of models trained by our
scheme on various datasets and against a wide variety of perturbations,
demonstrating that it significantly outperforms the baselines across multiple
perturbations with a marginal computational cost.


http://arxiv.org/abs/2006.12655
Perceptual Adversarial Robustness: Defense Against Unseen Threat Models.
Cassidy Laidlaw; Sahil Singla; Soheil Feizi
  A key challenge in adversarial robustness is the lack of a precise
mathematical characterization of human perception, used in the very definition
of adversarial attacks that are imperceptible to human eyes. Most current
attacks and defenses try to avoid this issue by considering restrictive
adversarial threat models such as those bounded by $L_2$ or $L_\infty$
distance, spatial perturbations, etc. However, models that are robust against
any of these restrictive threat models are still fragile against other threat
models. To resolve this issue, we propose adversarial training against the set
of all imperceptible adversarial examples, approximated using deep neural
networks. We call this threat model the neural perceptual threat model (NPTM);
it includes adversarial examples with a bounded neural perceptual distance (a
neural network-based approximation of the true perceptual distance) to natural
images. Through an extensive perceptual study, we show that the neural
perceptual distance correlates well with human judgements of perceptibility of
adversarial examples, validating our threat model.
  Under the NPTM, we develop novel perceptual adversarial attacks and defenses.
Because the NPTM is very broad, we find that Perceptual Adversarial Training
(PAT) against a perceptual attack gives robustness against many other types of
adversarial attacks. We test PAT on CIFAR-10 and ImageNet-100 against five
diverse adversarial attacks. We find that PAT achieves state-of-the-art
robustness against the union of these five attacks, more than doubling the
accuracy over the next best model, without training against any of them. That
is, PAT generalizes well to unforeseen perturbation types. This is vital in
sensitive applications where a particular threat model cannot be assumed, and
to the best of our knowledge, PAT is the first adversarial training defense
with this property.


http://arxiv.org/abs/2006.11776
Network Moments: Extensions and Sparse-Smooth Attacks.
Modar Alfadly; Adel Bibi; Emilio Botero; Salman Alsubaihi; Bernard Ghanem
  The impressive performance of deep neural networks (DNNs) has immensely
strengthened the line of research that aims at theoretically analyzing their
effectiveness. This has incited research on the reaction of DNNs to noisy
input, namely developing adversarial input attacks and strategies that lead to
robust DNNs to these attacks. To that end, in this paper, we derive exact
analytic expressions for the first and second moments (mean and variance) of a
small piecewise linear (PL) network (Affine, ReLU, Affine) subject to Gaussian
input. In particular, we generalize the second-moment expression of Bibi et al.
to arbitrary input Gaussian distributions, dropping the zero-mean assumption.
We show that the new variance expression can be efficiently approximated
leading to much tighter variance estimates as compared to the preliminary
results of Bibi et al. Moreover, we experimentally show that these expressions
are tight under simple linearizations of deeper PL-DNNs, where we investigate
the effect of the linearization sensitivity on the accuracy of the moment
estimates. Lastly, we show that the derived expressions can be used to
construct sparse and smooth Gaussian adversarial attacks (targeted and
non-targeted) that tend to lead to perceptually feasible input attacks.


http://arxiv.org/abs/2006.11604
How do SGD hyperparameters in natural training affect adversarial robustness?
Sandesh Kamath; Amit Deshpande; K V Subrahmanyam
  Learning rate, batch size and momentum are three important hyperparameters in
the SGD algorithm. It is known from the work of Jastrzebski et al.
arXiv:1711.04623 that large batch size training of neural networks yields
models which do not generalize well. Yao et al. arXiv:1802.08241 observe that
large batch training yields models that have poor adversarial robustness. In
the same paper, the authors train models with different batch sizes and compute
the eigenvalues of the Hessian of loss function. They observe that as the batch
size increases, the dominant eigenvalues of the Hessian become larger. They
also show that both adversarial training and small-batch training leads to a
drop in the dominant eigenvalues of the Hessian or lowering its spectrum. They
combine adversarial training and second order information to come up with a new
large-batch training algorithm and obtain robust models with good
generalization. In this paper, we empirically observe the effect of the SGD
hyperparameters on the accuracy and adversarial robustness of networks trained
with unperturbed samples. Jastrzebski et al. considered training models with a
fixed learning rate to batch size ratio. They observed that higher the ratio,
better is the generalization. We observe that networks trained with constant
learning rate to batch size ratio, as proposed in Jastrzebski et al., yield
models which generalize well and also have almost constant adversarial
robustness, independent of the batch size. We observe that momentum is more
effective with varying batch sizes and a fixed learning rate than with constant
learning rate to batch size ratio based SGD training.


http://arxiv.org/abs/2006.11627
Defense against Adversarial Attacks in NLP via Dirichlet Neighborhood Ensemble.
Yi Zhou; Xiaoqing Zheng; Cho-Jui Hsieh; Kai-wei Chang; Xuanjing Huang
  Despite neural networks have achieved prominent performance on many natural
language processing (NLP) tasks, they are vulnerable to adversarial examples.
In this paper, we propose Dirichlet Neighborhood Ensemble (DNE), a randomized
smoothing method for training a robust model to defense substitution-based
attacks. During training, DNE forms virtual sentences by sampling embedding
vectors for each word in an input sentence from a convex hull spanned by the
word and its synonyms, and it augments them with the training data. In such a
way, the model is robust to adversarial attacks while maintaining the
performance on the original clean data. DNE is agnostic to the network
architectures and scales to large models for NLP applications. We demonstrate
through extensive experimentation that our method consistently outperforms
recently proposed defense methods by a significant margin across different
network architectures and multiple data sets.


http://arxiv.org/abs/2006.11561
Stochastic Shortest Path with Adversarially Changing Costs. (1%)
Aviv Rosenberg; Yishay Mansour
  Stochastic shortest path (SSP) is a well-known problem in planning and
control, in which an agent has to reach a goal state in minimum total expected
cost. In this paper we present the adversarial SSP model that also accounts for
adversarial changes in the costs over time, while the underlying transition
function remains unchanged. Formally, an agent interacts with an SSP
environment for $K$ episodes, the cost function changes arbitrarily between
episodes, and the transitions are unknown to the agent. We develop the first
algorithms for adversarial SSPs and prove high probability regret bounds of
$\widetilde O (\sqrt{K})$ assuming all costs are strictly positive, and
$\widetilde O (K^{3/4})$ in the general case. We are the first to consider this
natural setting of adversarial SSP and obtain sub-linear regret for it.


http://arxiv.org/abs/2006.11440
Local Convolutions Cause an Implicit Bias towards High Frequency Adversarial Examples.
Josue Ortega Caro; Yilong Ju; Ryan Pyle; Sourav Dey; Wieland Brendel; Fabio Anselmi; Ankit Patel
  Despite great efforts, neural networks are still prone to adversarial
attacks. Recent work has shown that adversarial perturbations typically contain
high-frequency features, but the root cause of this phenomenon remains unknown.
Inspired by the theoretical work in linear full-width convolutional models
(Gunasekar et al, 2018), we hypothesize that the nonlinear local (i.e.
bounded-width) convolutional models used in practice are implicitly biased to
learn high frequency features, and that this is the root cause of high
frequency adversarial examples. To test this hypothesis, we analyzed the impact
of different choices of linear and nonlinear architectures on the implicit bias
of the learned features and the adversarial perturbations, in both spatial and
frequency domains. We find that the high-frequency adversarial perturbations
are critically dependent on the convolution operation in two ways: (i) the
translation invariance of the convolution induces an implicit bias towards
sparsity in the frequency domain; and (ii) the spatially-limited nature of
local convolutions induces an implicit bias towards high frequency features.
The explanation for the latter involves the Fourier Uncertainty Principle: a
spatially-limited (local in the space domain) filter cannot also be
frequency-limited (local in the frequency domain). Furthermore, using larger
convolution kernel sizes or avoiding convolutions altogether (e.g. by using
Visual Transformers architecture) significantly reduces this high frequency
bias, but not the overall susceptibility to attacks. Looking forward, our work
strongly suggests that understanding and controlling the implicit bias of
architectures will be essential for achieving adversarial robustness.


http://arxiv.org/abs/2006.11122
A general framework for defining and optimizing robustness.
Alessandro Tibo; Manfred Jaeger; Kim G. Larsen
  Robustness of neural networks has recently attracted a great amount of
interest. The many investigations in this area lack a precise common foundation
of robustness concepts. Therefore, in this paper, we propose a rigorous and
flexible framework for defining different types of robustness that also help to
explain the interplay between adversarial robustness and generalization. The
different robustness objectives directly lead to an adjustable family of loss
functions. For two robustness concepts of particular interest we show effective
ways to minimize the corresponding loss functions. One loss is designed to
strengthen robustness against adversarial off-manifold attacks, and another to
improve generalization under the given data distribution. Empirical results
show that we can effectively train under different robustness objectives,
obtaining higher robustness scores and better generalization, for the two
examples respectively, compared to the state-of-the-art data augmentation and
regularization techniques.


http://arxiv.org/abs/2006.11103
Analyzing the Real-World Applicability of DGA Classifiers.
Arthur Drichel; Ulrike Meyer; Samuel Schüppen; Dominik Teubert
  Separating benign domains from domains generated by DGAs with the help of a
binary classifier is a well-studied problem for which promising performance
results have been published. The corresponding multiclass task of determining
the exact DGA that generated a domain enabling targeted remediation measures is
less well studied. Selecting the most promising classifier for these tasks in
practice raises a number of questions that have not been addressed in prior
work so far. These include the questions on which traffic to train in which
network and when, just as well as how to assess robustness against adversarial
attacks. Moreover, it is unclear which features lead a classifier to a decision
and whether the classifiers are real-time capable. In this paper, we address
these issues and thus contribute to bringing DGA detection classifiers closer
to practical use. In this context, we propose one novel classifier based on
residual neural networks for each of the two tasks and extensively evaluate
them as well as previously proposed classifiers in a unified setting. We not
only evaluate their classification performance but also compare them with
respect to explainability, robustness, and training and classification speed.
Finally, we show that our newly proposed binary classifier generalizes well to
other networks, is time-robust, and able to identify previously unknown DGAs.


http://arxiv.org/abs/2006.11007
Towards an Adversarially Robust Normalization Approach.
Muhammad Awais; Fahad Shamshad; Sung-Ho Bae
  Batch Normalization (BatchNorm) is effective for improving the performance
and accelerating the training of deep neural networks. However, it has also
shown to be a cause of adversarial vulnerability, i.e., networks without it are
more robust to adversarial attacks. In this paper, we investigate how BatchNorm
causes this vulnerability and proposed new normalization that is robust to
adversarial attacks. We first observe that adversarial images tend to shift the
distribution of BatchNorm input, and this shift makes train-time estimated
population statistics inaccurate. We hypothesize that these inaccurate
statistics make models with BatchNorm more vulnerable to adversarial attacks.
We prove our hypothesis by replacing train-time estimated statistics with
statistics calculated from the inference-time batch. We found that the
adversarial vulnerability of BatchNorm disappears if we use these statistics.
However, without estimated batch statistics, we can not use BatchNorm in the
practice if large batches of input are not available. To mitigate this, we
propose Robust Normalization (RobustNorm); an adversarially robust version of
BatchNorm. We experimentally show that models trained with RobustNorm perform
better in adversarial settings while retaining all the benefits of BatchNorm.
Code is available at \url{https://github.com/awaisrauf/RobustNorm}.


http://arxiv.org/abs/2006.11078
Differentiable Language Model Adversarial Attacks on Categorical Sequence Classifiers.
I. Fursov; A. Zaytsev; N. Kluchnikov; A. Kravchenko; E. Burnaev
  An adversarial attack paradigm explores various scenarios for the
vulnerability of deep learning models: minor changes of the input can force a
model failure. Most of the state of the art frameworks focus on adversarial
attacks for images and other structured model inputs, but not for categorical
sequences models.
  Successful attacks on classifiers of categorical sequences are challenging
because the model input is tokens from finite sets, so a classifier score is
non-differentiable with respect to inputs, and gradient-based attacks are not
applicable. Common approaches deal with this problem working at a token level,
while the discrete optimization problem at hand requires a lot of resources to
solve.
  We instead use a fine-tuning of a language model for adversarial attacks as a
generator of adversarial examples. To optimize the model, we define a
differentiable loss function that depends on a surrogate classifier score and
on a deep learning model that evaluates approximate edit distance. So, we
control both the adversability of a generated sequence and its similarity to
the initial sequence.
  As a result, we obtain semantically better samples. Moreover, they are
resistant to adversarial training and adversarial detectors. Our model works
for diverse datasets on bank transactions, electronic health records, and NLP
datasets.


http://arxiv.org/abs/2006.11004
Adversarial Attacks for Multi-view Deep Models.
Xuli Sun; Shiliang Sun
  Recent work has highlighted the vulnerability of many deep machine learning
models to adversarial examples. It attracts increasing attention to adversarial
attacks, which can be used to evaluate the security and robustness of models
before they are deployed. However, to our best knowledge, there is no specific
research on the adversarial attacks for multi-view deep models. This paper
proposes two multi-view attack strategies, two-stage attack (TSA) and
end-to-end attack (ETEA). With the mild assumption that the single-view model
on which the target multi-view model is based is known, we first propose the
TSA strategy. The main idea of TSA is to attack the multi-view model with
adversarial examples generated by attacking the associated single-view model,
by which state-of-the-art single-view attack methods are directly extended to
the multi-view scenario. Then we further propose the ETEA strategy when the
multi-view model is provided publicly. The ETEA is applied to accomplish direct
attacks on the target multi-view model, where we develop three effective
multi-view attack methods. Finally, based on the fact that adversarial examples
generalize well among different models, this paper takes the adversarial attack
on the multi-view convolutional neural network as an example to validate that
the effectiveness of the proposed multi-view attacks. Extensive experimental
results demonstrate that our multi-view attack strategies are capable of
attacking the multi-view deep models, and we additionally find that multi-view
models are more robust than single-view models.


http://arxiv.org/abs/2006.10620
Local Competition and Uncertainty for Adversarial Robustness in Deep Learning.
Antonios Alexos; Konstantinos P. Panousis; Sotirios Chatzis
  This work attempts to address adversarial robustness of deep networks by
means of novel learning arguments. Specifically, inspired from results in
neuroscience, we propose a local competition principle as a means of
adversarially-robust deep learning. We argue that novel local winner-takes-all
(LWTA) nonlinearities, combined with posterior sampling schemes, can greatly
improve the adversarial robustness of traditional deep networks against
difficult adversarial attack schemes. We combine these LWTA arguments with
tools from the field of Bayesian non-parametrics, specifically the
stick-breaking construction of the Indian Buffet Process, to flexibly account
for the inherent uncertainty in data-driven modeling. As we experimentally
show, the new proposed model achieves high robustness to adversarial
perturbations on MNIST and CIFAR10 datasets. Our model achieves
state-of-the-art results in powerful white-box attacks, while at the same time
retaining its benign accuracy to a high degree. Equally importantly, our
approach achieves this result while requiring far less trainable model
parameters than the existing state-of-the-art.


http://arxiv.org/abs/2006.10679
Dissecting Deep Networks into an Ensemble of Generative Classifiers for Robust Predictions.
Lokender Tiwari; Anish Madan; Saket Anand; Subhashis Banerjee
  Deep Neural Networks (DNNs) are often criticized for being susceptible to
adversarial attacks. Most successful defense strategies adopt adversarial
training or random input transformations that typically require retraining or
fine-tuning the model to achieve reasonable performance. In this work, our
investigations of intermediate representations of a pre-trained DNN lead to an
interesting discovery pointing to intrinsic robustness to adversarial attacks.
We find that we can learn a generative classifier by statistically
characterizing the neural response of an intermediate layer to clean training
samples. The predictions of multiple such intermediate-layer based classifiers,
when aggregated, show unexpected robustness to adversarial attacks.
Specifically, we devise an ensemble of these generative classifiers that
rank-aggregates their predictions via a Borda count-based consensus. Our
proposed approach uses a subset of the clean training data and a pre-trained
model, and yet is agnostic to network architectures or the adversarial attack
generation method. We show extensive experiments to establish that our defense
strategy achieves state-of-the-art performance on the ImageNet validation set.


http://arxiv.org/abs/2006.10885
The Dilemma Between Dimensionality Reduction and Adversarial Robustness.
Sheila Alemany; Niki Pissinou
  Recent work has shown the tremendous vulnerability to adversarial samples
that are nearly indistinguishable from benign data but are improperly
classified by the deep learning model. Some of the latest findings suggest the
existence of adversarial attacks may be an inherent weakness of these models as
a direct result of its sensitivity to well-generalizing features in high
dimensional data. We hypothesize that data transformations can influence this
vulnerability since a change in the data manifold directly determines the
adversary's ability to create these adversarial samples. To approach this
problem, we study the effect of dimensionality reduction through the lens of
adversarial robustness. This study raises awareness of the positive and
negative impacts of five commonly used data transformation techniques on
adversarial robustness. The evaluation shows how these techniques contribute to
an overall increased vulnerability where accuracy is only improved when the
dimensionality reduction technique approaches the data's optimal intrinsic
dimension. The conclusions drawn from this work contribute to understanding and
creating more resistant learning models.


http://arxiv.org/abs/2006.10876
Beware the Black-Box: on the Robustness of Recent Defenses to Adversarial Examples.
Kaleel Mahmood; Deniz Gurevin; Dijk Marten van; Phuong Ha Nguyen
  Recent defenses published at venues like NIPS, ICML, ICLR and CVPR are mainly
focused on mitigating white-box attacks. These defenses do not properly
consider adaptive adversaries. In this paper, we expand the scope of these
defenses to include adaptive black-box adversaries. Our evaluation is done on
nine defenses including Barrage of Random Transforms, ComDefend, Ensemble
Diversity, Feature Distillation, The Odds are Odd, Error Correcting Codes,
Distribution Classifier Defense, K-Winner Take All and Buffer Zones. Our
investigation is done using two black-box adversarial models and six widely
studied adversarial attacks for CIFAR-10 and Fashion-MNIST datasets. Our
analyses show most recent defenses provide only marginal improvements in
security, as compared to undefended networks. Based on these results, we
propose new standards for properly evaluating defenses to black-box
adversaries. We provide this security framework to assist researchers in
developing future black-box resistant models.


http://arxiv.org/abs/2006.09994
Noise or Signal: The Role of Image Backgrounds in Object Recognition.
Kai Xiao; Logan Engstrom; Andrew Ilyas; Aleksander Madry
  We assess the tendency of state-of-the-art object recognition models to
depend on signals from image backgrounds. We create a toolkit for disentangling
foreground and background signal on ImageNet images, and find that (a) models
can achieve non-trivial accuracy by relying on the background alone, (b) models
often misclassify images even in the presence of correctly classified
foregrounds--up to 87.5% of the time with adversarially chosen backgrounds, and
(c) more accurate models tend to depend on backgrounds less. Our analysis of
backgrounds brings us closer to understanding which correlations machine
learning models use, and how they determine models' out of distribution
performance.


http://arxiv.org/abs/2006.10013
Adversarial Examples Detection and Analysis with Layer-wise Autoencoders.
Bartosz Wójcik; Paweł Morawiecki; Marek Śmieja; Tomasz Krzyżek; Przemysław Spurek; Jacek Tabor
  We present a mechanism for detecting adversarial examples based on data
representations taken from the hidden layers of the target network. For this
purpose, we train individual autoencoders at intermediate layers of the target
network. This allows us to describe the manifold of true data and, in
consequence, decide whether a given example has the same characteristics as
true data. It also gives us insight into the behavior of adversarial examples
and their flow through the layers of a deep neural network. Experimental
results show that our method outperforms the state of the art in supervised and
unsupervised settings.


http://arxiv.org/abs/2006.09701
Adversarial Defense by Latent Style Transformations.
Shuo Wang; Surya Nepal; Alsharif Abuadbba; Carsten Rudolph; Marthie Grobler
  Machine learning models have demonstrated vulnerability to adversarial
attacks, more specifically misclassification of adversarial examples.
  In this paper, we investigate an attack-agnostic defense against adversarial
attacks on high-resolution images by detecting suspicious inputs.
  The intuition behind our approach is that the essential characteristics of a
normal image are generally consistent with non-essential style transformations,
e.g., slightly changing the facial expression of human portraits.
  In contrast, adversarial examples are generally sensitive to such
transformations.
  In our approach to detect adversarial instances, we propose an
in\underline{V}ertible \underline{A}utoencoder based on the
\underline{S}tyleGAN2 generator via \underline{A}dversarial training (VASA) to
inverse images to disentangled latent codes that reveal hierarchical styles.
  We then build a set of edited copies with non-essential style transformations
by performing latent shifting and reconstruction, based on the correspondences
between latent codes and style transformations.
  The classification-based consistency of these edited copies is used to
distinguish adversarial instances.


http://arxiv.org/abs/2006.12247
Disrupting Deepfakes with an Adversarial Attack that Survives Training.
Eran Segalis
  The rapid progress in generative models and autoencoders has given rise to
effective video tampering techniques, used for generating deepfakes. Mitigation
research is mostly focused on post-factum deepfake detection and not
prevention. We complement these efforts by proposing a prevention technique
against face-swapping autoencoders. Our technique consists of a novel
training-resistant adversarial attack that can be applied to a video to disrupt
face-swapping manipulations. Our attack introduces spatial-temporal distortions
to the output of the face-swapping autoencoders, and it holds whether or not
our adversarial images have been included in the training set of said
autoencoders. To implement the attack, we construct a bilevel optimization
problem, where we train a generator and a face-swapping model instance against
each other. Specifically, we pair each input image with a target distortion,
and feed them into a generator that produces an adversarial image. This image
will exhibit the distortion when a face-swapping autoencoder is applied to it.
We solve the optimization problem by training the generator and the
face-swapping model simultaneously using an iterative process of alternating
optimization. Finally, we validate our attack using a popular implementation of
FaceSwap, and show that our attack transfers across different models and target
faces. More broadly, these results demonstrate the existence of
training-resistant adversarial attacks, potentially applicable to a wide range
of domains.


http://arxiv.org/abs/2006.09989
Universal Lower-Bounds on Classification Error under Adversarial Attacks and Random Corruption.
Elvis Dohmatob
  We theoretically analyse the limits of robustness to test-time adversarial
and noisy examples in classification. Our work focuses on deriving bounds which
uniformly apply to all classifiers (i.e all measurable functions from features
to labels) for a given problem. Our contributions are three-fold. (1) In the
classical framework of adversarial attacks, we use optimal transport theory to
derive variational formulae for the Bayes-optimal error a classifier can make
on a given classification problem, subject to adversarial attacks. The optimal
adversarial attack is then an optimal transport plan for a certain binary
cost-function induced by the specific attack model, and can be computed via a
simple algorithm based on maximal matching on bipartite graphs. (2) We derive
explicit lower-bounds on the Bayes-optimal error in the case of the popular
distance-based attacks. These bounds are universal in the sense that they
depend on the geometry of the class-conditional distributions of the data, but
not on a particular classifier. Our results are in sharp contrast with the
existing literature, wherein adversarial vulnerability of classifiers is
derived as a consequence of nonzero ordinary test error. (3) For our third
contribution, we study robustness to random noise corruption, wherein the
attacker (or nature) is allowed to inject random noise into examples at test
time. We establish nonlinear data-processing inequalities induced by such
corruptions, and use them to obtain lower-bounds on the Bayes-optimal error for
noisy problem.


http://arxiv.org/abs/2006.12621
Fairness Through Robustness: Investigating Robustness Disparity in Deep Learning.
Vedant Nanda; Samuel Dooley; Sahil Singla; Soheil Feizi; John P. Dickerson
  Deep neural networks (DNNs) are increasingly used in real-world applications
(e.g. facial recognition). This has resulted in concerns about the fairness of
decisions made by these models. Various notions and measures of fairness have
been proposed to ensure that a decision-making system does not
disproportionately harm (or benefit) particular subgroups of the population. In
this paper, we argue that traditional notions of fairness that are only based
on models' outputs are not sufficient when the model is vulnerable to
adversarial attacks. We argue that in some cases, it may be easier for an
attacker to target a particular subgroup, resulting in a form of
\textit{robustness bias}. We show that measuring robustness bias is a
challenging task for DNNs and propose two methods to measure this form of bias.
We then conduct an empirical study on state-of-the-art neural networks on
commonly used real-world datasets such as CIFAR-10, CIFAR-100, Adience, and
UTKFace and show that in almost all cases there are subgroups (in some cases
based on sensitive attributes like race, gender, etc) which are less robust and
are thus at a disadvantage. We argue that this kind of bias arises due to both
the data distribution and the highly complex nature of the learned decision
boundary in the case of DNNs, thus making mitigation of such biases a
non-trivial task. Our results show that robustness bias is an important
criterion to consider while auditing real-world systems that rely on DNNs for
decision making. Code to reproduce all our results can be found here:
\url{https://github.com/nvedant07/Fairness-Through-Robustness}


http://arxiv.org/abs/2006.08914
Calibrating Deep Neural Network Classifiers on Out-of-Distribution Datasets.
Zhihui Shao; Jianyi Yang; Shaolei Ren
  To increase the trustworthiness of deep neural network (DNN) classifiers, an
accurate prediction confidence that represents the true likelihood of
correctness is crucial. Towards this end, many post-hoc calibration methods
have been proposed to leverage a lightweight model to map the target DNN's
output layer into a calibrated confidence. Nonetheless, on an
out-of-distribution (OOD) dataset in practice, the target DNN can often
mis-classify samples with a high confidence, creating significant challenges
for the existing calibration methods to produce an accurate confidence. In this
paper, we propose a new post-hoc confidence calibration method, called CCAC
(Confidence Calibration with an Auxiliary Class), for DNN classifiers on OOD
datasets. The key novelty of CCAC is an auxiliary class in the calibration
model which separates mis-classified samples from correctly classified ones,
thus effectively mitigating the target DNN's being confidently wrong. We also
propose a simplified version of CCAC to reduce free parameters and facilitate
transfer to a new unseen dataset. Our experiments on different DNN models,
datasets and applications show that CCAC can consistently outperform the prior
post-hoc calibration methods.


http://arxiv.org/abs/2006.08947
SPLASH: Learnable Activation Functions for Improving Accuracy and Adversarial Robustness.
Mohammadamin Tavakoli; Forest Agostinelli; Pierre Baldi
  We introduce SPLASH units, a class of learnable activation functions shown to
simultaneously improve the accuracy of deep neural networks while also
improving their robustness to adversarial attacks. SPLASH units have both a
simple parameterization and maintain the ability to approximate a wide range of
non-linear functions. SPLASH units are: 1) continuous; 2) grounded (f(0) = 0);
3) use symmetric hinges; and 4) the locations of the hinges are derived
directly from the data (i.e. no learning required). Compared to nine other
learned and fixed activation functions, including ReLU and its variants, SPLASH
units show superior performance across three datasets (MNIST, CIFAR-10, and
CIFAR-100) and four architectures (LeNet5, All-CNN, ResNet-20, and
Network-in-Network). Furthermore, we show that SPLASH units significantly
increase the robustness of deep neural networks to adversarial attacks. Our
experiments on both black-box and open-box adversarial attacks show that
commonly-used architectures, namely LeNet5, All-CNN, ResNet-20, and
Network-in-Network, can be up to 31% more robust to adversarial attacks by
simply using SPLASH units instead of ReLUs.


http://arxiv.org/abs/2006.09040
Debona: Decoupled Boundary Network Analysis for Tighter Bounds and Faster Adversarial Robustness Proofs.
Christopher Brix; Thomas Noll
  Neural networks are commonly used in safety-critical real-world applications.
Unfortunately, the predicted output is often highly sensitive to small, and
possibly imperceptible, changes to the input data. Proving that either no such
adversarial examples exist, or providing a concrete instance, is therefore
crucial to ensure safe applications. As enumerating and testing all potential
adversarial examples is computationally infeasible, verification techniques
have been developed to provide mathematically sound proofs of their absence
using overestimations of the network activations. We propose an improved
technique for computing tight upper and lower bounds of these node values,
based on increased flexibility gained by computing both bounds independently of
each other. Furthermore, we gain an additional improvement by re-implementing
part of the original state-of-the-art software "Neurify", leading to a faster
analysis. Combined, these adaptations reduce the necessary runtime by up to
94%, and allow a successful search for networks and inputs that were previously
too complex. We provide proofs for tight upper and lower bounds on max-pooling
layers in convolutional networks. To ensure widespread usability, we open
source our implementation "Debona", featuring both the implementation specific
enhancements as well as the refined boundary computation for faster and more
exact~results.


http://arxiv.org/abs/2006.09510
On sparse connectivity, adversarial robustness, and a novel model of the artificial neuron.
Sergey Bochkanov
  Deep neural networks have achieved human-level accuracy on almost all
perceptual benchmarks. It is interesting that these advances were made using
two ideas that are decades old: (a) an artificial neuron based on a linear
summator and (b) SGD training.
  However, there are important metrics beyond accuracy: computational
efficiency and stability against adversarial perturbations. In this paper, we
propose two closely connected methods to improve these metrics on contour
recognition tasks: (a) a novel model of an artificial neuron, a "strong
neuron," with low hardware requirements and inherent robustness against
adversarial perturbations and (b) a novel constructive training algorithm that
generates sparse networks with $O(1)$ connections per neuron.
  We demonstrate the feasibility of our approach through experiments on SVHN
and GTSRB benchmarks. We achieved an impressive 10x-100x reduction in
operations count (10x when compared with other sparsification approaches, 100x
when compared with dense networks) and a substantial reduction in hardware
requirements (8-bit fixed-point math was used) with no reduction in model
accuracy. Superior stability against adversarial perturbations (exceeding that
of adversarial training) was achieved without any counteradversarial measures,
relying on the robustness of strong neurons alone. We also proved that
constituent blocks of our strong neuron are the only activation functions with
perfect stability against adversarial attacks.


http://arxiv.org/abs/2006.09539
AdvMind: Inferring Adversary Intent of Black-Box Attacks.
Ren Pang; Xinyang Zhang; Shouling Ji; Xiapu Luo; Ting Wang
  Deep neural networks (DNNs) are inherently susceptible to adversarial attacks
even under black-box settings, in which the adversary only has query access to
the target models. In practice, while it may be possible to effectively detect
such attacks (e.g., observing massive similar but non-identical queries), it is
often challenging to exactly infer the adversary intent (e.g., the target class
of the adversarial example the adversary attempts to craft) especially during
early stages of the attacks, which is crucial for performing effective
deterrence and remediation of the threats in many scenarios.
  In this paper, we present AdvMind, a new class of estimation models that
infer the adversary intent of black-box adversarial attacks in a robust and
prompt manner. Specifically, to achieve robust detection, AdvMind accounts for
the adversary adaptiveness such that her attempt to conceal the target will
significantly increase the attack cost (e.g., in terms of the number of
queries); to achieve prompt detection, AdvMind proactively synthesizes
plausible query results to solicit subsequent queries from the adversary that
maximally expose her intent. Through extensive empirical evaluation on
benchmark datasets and state-of-the-art black-box attacks, we demonstrate that
on average AdvMind detects the adversary intent with over 75% accuracy after
observing less than 3 query batches and meanwhile increases the cost of
adaptive attacks by over 60%. We further discuss the possible synergy between
AdvMind and other defense methods against black-box adversarial attacks,
pointing to several promising research directions.


http://arxiv.org/abs/2006.09373
The shape and simplicity biases of adversarially robust ImageNet-trained CNNs.
Peijie Chen; Chirag Agarwal; Anh Nguyen
  Adversarial training has been the topic of dozens of studies and a leading
method for defending against adversarial attacks. Yet, it remains largely
unknown (a) how adversarially-robust ImageNet classifiers (R classifiers)
generalize to out-of-distribution examples; and (b) how their generalization
capability relates to their hidden representations. In this paper, we perform a
thorough, systematic study to answer these two questions across AlexNet,
GoogLeNet, and ResNet-50 architectures. We found that while standard ImageNet
classifiers have a strong texture bias, their R counterparts rely heavily on
shapes. Remarkably, adversarial training induces three simplicity biases into
hidden neurons in the process of 'robustifying' the network. That is, each
convolutional neuron in R networks often changes to detecting (1) pixel-wise
smoother patterns i.e. a mechanism that blocks high-frequency noise from
passing through the network; (2) more lower-level features i.e. textures and
colors (instead of objects); and (3) fewer types of inputs. (4) detecting both
shape and texture at the same time. Our findings reveal the interesting
mechanisms that made networks more adversarially robust and also explain some
recent findings.


http://arxiv.org/abs/2006.08789
Total Deep Variation: A Stable Regularizer for Inverse Problems.
Erich Kobler; Alexander Effland; Karl Kunisch; Thomas Pock
  Various problems in computer vision and medical imaging can be cast as
inverse problems. A frequent method for solving inverse problems is the
variational approach, which amounts to minimizing an energy composed of a data
fidelity term and a regularizer. Classically, handcrafted regularizers are
used, which are commonly outperformed by state-of-the-art deep learning
approaches. In this work, we combine the variational formulation of inverse
problems with deep learning by introducing the data-driven general-purpose
total deep variation regularizer. In its core, a convolutional neural network
extracts local features on multiple scales and in successive blocks. This
combination allows for a rigorous mathematical analysis including an optimal
control formulation of the training problem in a mean-field setting and a
stability analysis with respect to the initial values and the parameters of the
regularizer. In addition, we experimentally verify the robustness against
adversarial attacks and numerically derive upper bounds for the generalization
error. Finally, we achieve state-of-the-art results for numerous imaging tasks.


http://arxiv.org/abs/2006.08900
DefenseVGAE: Defending against Adversarial Attacks on Graph Data via a Variational Graph Autoencoder.
Ao Zhang; Jinwen Ma
  Graph neural networks (GNNs) achieve remarkable performance for tasks on
graph data. However, recent works show they are extremely vulnerable to
adversarial structural perturbations, making their outcomes unreliable. In this
paper, we propose DefenseVGAE, a novel framework leveraging variational graph
autoencoders(VGAEs) to defend GNNs against such attacks. DefenseVGAE is trained
to reconstruct graph structure. The reconstructed adjacency matrix can reduce
the effects of adversarial perturbations and boost the performance of GCNs when
facing adversarial attacks. Our experiments on a number of datasets show the
effectiveness of the proposed method under various threat models. Under some
settings it outperforms existing defense strategies. Our code has been made
publicly available at https://github.com/zhangao520/defense-vgae.


http://arxiv.org/abs/2006.08476
Improving Adversarial Robustness via Unlabeled Out-of-Domain Data.
Zhun Deng; Linjun Zhang; Amirata Ghorbani; James Zou
  Data augmentation by incorporating cheap unlabeled data from multiple domains
is a powerful way to improve prediction especially when there is limited
labeled data. In this work, we investigate how adversarial robustness can be
enhanced by leveraging out-of-domain unlabeled data. We demonstrate that for
broad classes of distributions and classifiers, there exists a sample
complexity gap between standard and robust classification. We quantify to what
degree this gap can be bridged via leveraging unlabeled samples from a shifted
domain by providing both upper and lower bounds. Moreover, we show settings
where we achieve better adversarial robustness when the unlabeled data come
from a shifted domain rather than the same domain as the labeled data. We also
investigate how to leverage out-of-domain data when some structural
information, such as sparsity, is shared between labeled and unlabeled domains.
Experimentally, we augment two object recognition datasets (CIFAR-10 and SVHN)
with easy to obtain and unlabeled out-of-domain data and demonstrate
substantial improvement in the model's robustness against $\ell_\infty$
adversarial attacks on the original domain.


http://arxiv.org/abs/2006.08391
Fast & Accurate Method for Bounding the Singular Values of Convolutional Layers with Application to Lipschitz Regularization.
Alexandre Araujo; Benjamin Negrevergne; Yann Chevaleyre; Jamal Atif
  This paper tackles the problem of Lipschitz regularization of Convolutional
Neural Networks. Lipschitz regularity is now established as a key property of
modern deep learning with implications in training stability, generalization,
robustness against adversarial examples, etc. However, computing the exact
value of the Lipschitz constant of a neural network is known to be NP-hard.
Recent attempts from the literature introduce upper bounds to approximate this
constant that are either efficient but loose or accurate but computationally
expensive. In this work, by leveraging the theory of Toeplitz matrices, we
introduce a new upper bound for convolutional layers that is both tight and
easy to compute. Based on this result we devise an algorithm to train Lipschitz
regularized Convolutional Neural Networks.


http://arxiv.org/abs/2006.08149
GNNGuard: Defending Graph Neural Networks against Adversarial Attacks.
Xiang Zhang; Marinka Zitnik
  Deep learning methods for graphs achieve remarkable performance on many
tasks. However, despite the proliferation of such methods and their success,
recent findings indicate that small, unnoticeable perturbations of graph
structure can catastrophically reduce performance of even the strongest and
most popular Graph Neural Networks (GNNs). Here, we develop GNNGuard, a general
defense approach against a variety of training-time attacks that perturb the
discrete graph structure. GNNGuard can be straightforwardly incorporated into
any GNN. Its core principle is to detect and quantify the relationship between
the graph structure and node features, if one exists, and then exploit that
relationship to mitigate negative effects of the attack. GNNGuard uses network
theory of homophily to learn how best assign higher weights to edges connecting
similar nodes while pruning edges between unrelated nodes. The revised edges
then allow the underlying GNN to robustly propagate neural messages in the
graph. GNNGuard introduces two novel components, the neighbor importance
estimation, and the layer-wise graph memory, and we show empirically that both
components are necessary for a successful defense. Across five GNNs, three
defense methods, and four datasets, including a challenging human disease
graph, experiments show that GNNGuard outperforms existing defense approaches
by 15.3% on average. Remarkably, GNNGuard can effectively restore the
state-of-the-art performance of GNNs in the face of various adversarial
attacks, including targeted and non-targeted attacks.


http://arxiv.org/abs/2006.08538
CG-ATTACK: Modeling the Conditional Distribution of Adversarial Perturbations to Boost Black-Box Attack.
Yan Feng; Baoyuan Wu; Yanbo Fan; Li Liu; Zhifeng Li; Shutao Xia
  Adversarial examples against deep neural networks (DNNs) have been
extensively developed in recent years. Modeling the distribution of adversarial
perturbations could play an important role in generating adversarial
perturbations, especially in the scenario of black-box adversarial attack.
However, the adversarial distribution is rarely studied as far as we know. To
this end, we propose to approximate the conditional distribution of adversarial
perturbations given benign examples by the conditional generative flow model
(c-Glow), which shows powerful ability of capturing the complex data
distribution. However, the standard training of the c-Glow by maximum
likelihood estimation requires massive adversarial perturbations, which is
time-consuming. To address this problem, we innovatively propose to efficiently
learn the c-Glow by minimizing the KL divergence between it and an energy-based
model, which can evaluate the probability of being adversarial for any randomly
sampled perturbation, rather than only adversarial perturbations. In this work,
we propose a novel score-based black-box adversarial attack method by designing
a novel transfer mechanism based on the c-Glow model pretrained with the above
efficient training method on surrogate models, to take advantage of both the
adversarial transferability and queries to the target model. Extensive
experiments demonstrate that the proposed method is superior on both attack
success rate and query efficiency to several state-of-the-art black-box attack
methods.


http://arxiv.org/abs/2006.08656
Multiscale Deep Equilibrium Models.
Shaojie Bai; Vladlen Koltun; J. Zico Kolter
  We propose a new class of implicit networks, the multiscale deep equilibrium
model (MDEQ), suited to large-scale and highly hierarchical pattern recognition
domains. An MDEQ directly solves for and backpropagates through the equilibrium
points of multiple feature resolutions simultaneously, using implicit
differentiation to avoid storing intermediate states (and thus requiring only
$O(1)$ memory consumption). These simultaneously-learned multi-resolution
features allow us to train a single model on a diverse set of tasks and loss
functions, such as using a single MDEQ to perform both image classification and
semantic segmentation. We illustrate the effectiveness of this approach on two
large-scale vision tasks: ImageNet classification and semantic segmentation on
high-resolution images from the Cityscapes dataset. In both settings, MDEQs are
able to match or exceed the performance of recent competitive computer vision
models: the first time such performance and scale have been achieved by an
implicit deep learning approach. The code and pre-trained models are at
https://github.com/locuslab/mdeq .


http://arxiv.org/abs/2006.07989
GradAug: A New Regularization Method for Deep Neural Networks.
Taojiannan Yang; Sijie Zhu; Chen Chen
  We propose a new regularization method to alleviate over-fitting in deep
neural networks. The key idea is utilizing randomly transformed training
samples to regularize a set of sub-networks, which are originated by sampling
the width of the original network, in the training process. As such, the
proposed method introduces self-guided disturbances to the raw gradients of the
network and therefore is termed as Gradient Augmentation (GradAug). We
demonstrate that GradAug can help the network learn well-generalized and more
diverse representations. Moreover, it is easy to implement and can be applied
to various structures and applications. GradAug improves ResNet-50 to 78.79% on
ImageNet classification, which is a new state-of-the-art accuracy. By combining
with CutMix, it further boosts the performance to 79.58%, which outperforms an
ensemble of advanced training tricks. The generalization ability is evaluated
on COCO object detection and instance segmentation where GradAug significantly
surpasses other state-of-the-art methods. GradAug is also robust to image
distortions and adversarial attacks and is highly effective in the low data
regimes.


http://arxiv.org/abs/2006.07794
PatchUp: A Regularization Technique for Convolutional Neural Networks.
Mojtaba Faramarzi; Mohammad Amini; Akilesh Badrinaaraayanan; Vikas Verma; Sarath Chandar
  Large capacity deep learning models are often prone to a high generalization
gap when trained with a limited amount of labeled training data. A recent class
of methods to address this problem uses various ways to construct a new
training sample by mixing a pair (or more) of training samples. We propose
PatchUp, a hidden state block-level regularization technique for Convolutional
Neural Networks (CNNs), that is applied on selected contiguous blocks of
feature maps from a random pair of samples. Our approach improves the
robustness of CNN models against the manifold intrusion problem that may occur
in other state-of-the-art mixing approaches like Mixup and CutMix. Moreover,
since we are mixing the contiguous block of features in the hidden space, which
has more dimensions than the input space, we obtain more diverse samples for
training towards different dimensions. Our experiments on CIFAR-10, CIFAR-100,
and SVHN datasets with PreactResnet18, PreactResnet34, and WideResnet-28-10
models show that PatchUp improves upon, or equals, the performance of current
state-of-the-art regularizers for CNNs. We also show that PatchUp can provide
better generalization to affine transformations of samples and is more robust
against adversarial attacks.


http://arxiv.org/abs/2006.07828
On Saliency Maps and Adversarial Robustness.
Puneet Mangla; Vedant Singh; Vineeth N Balasubramanian
  A Very recent trend has emerged to couple the notion of interpretability and
adversarial robustness, unlike earlier efforts which solely focused on good
interpretations or robustness against adversaries. Works have shown that
adversarially trained models exhibit more interpretable saliency maps than
their non-robust counterparts, and that this behavior can be quantified by
considering the alignment between input image and saliency map. In this work,
we provide a different perspective to this coupling, and provide a method,
Saliency based Adversarial training (SAT), to use saliency maps to improve
adversarial robustness of a model. In particular, we show that using
annotations such as bounding boxes and segmentation masks, already provided
with a dataset, as weak saliency maps, suffices to improve adversarial
robustness with no additional effort to generate the perturbations themselves.
Our empirical results on CIFAR-10, CIFAR-100, Tiny ImageNet and Flower-17
datasets consistently corroborate our claim, by showing improved adversarial
robustness using our method. saliency maps. We also show how using finer and
stronger saliency maps leads to more robust models, and how integrating SAT
with existing adversarial training methods, further boosts performance of these
existing methods.


http://arxiv.org/abs/2006.07800
On the transferability of adversarial examples between convex and 01 loss models.
Yunzhe Xue; Meiyan Xie; Usman Roshan
  We show that white box adversarial examples do not transfer effectively
between convex and 01 loss and between 01 loss models compared to between
convex models. We also show that convex substitute model black box attacks are
less effective on 01 loss than convex models, and that 01 loss substitute model
attacks are ineffective on both convex and 01 loss models. We show intuitively
by example how the presence of outliers can cause different decision boundaries
between 01 and convex loss models which in turn produces adversaries that are
non-transferable. Indeed we see on MNIST that adversaries transfer between 01
loss and convex models more easily than on CIFAR10 and ImageNet which are
likely to contain outliers. We also show intuitively by example how the
non-continuity of 01 loss makes adversaries non-transferable in a two layer
neural network.


http://arxiv.org/abs/2006.07934
Adversarial Attacks and Detection on Reinforcement Learning-Based Interactive Recommender Systems.
Yuanjiang Cao; Xiaocong Chen; Lina Yao; Xianzhi Wang; Wei Emma Zhang
  Adversarial attacks pose significant challenges for detecting adversarial
attacks at an early stage. We propose attack-agnostic detection on
reinforcement learning-based interactive recommendation systems. We first craft
adversarial examples to show their diverse distributions and then augment
recommendation systems by detecting potential attacks with a deep
learning-based classifier based on the crafted data. Finally, we study the
attack strength and frequency of adversarial examples and evaluate our model on
standard datasets with multiple crafting methods. Our extensive experiments
show that most adversarial attacks are effective, and both attack strength and
attack frequency impact the attack performance. The strategically-timed attack
achieves comparative attack performance with only 1/3 to 1/2 attack frequency.
Besides, our black-box detector trained with one crafting method has the
generalization ability over several crafting methods.


http://arxiv.org/abs/2006.08020
Sparsity Turns Adversarial: Energy and Latency Attacks on Deep Neural Networks.
Sarada Krithivasan; Sanchari Sen; Anand Raghunathan
  Adversarial attacks have exposed serious vulnerabilities in Deep Neural
Networks (DNNs) through their ability to force misclassifications through
human-imperceptible perturbations to DNN inputs. We explore a new direction in
the field of adversarial attacks by suggesting attacks that aim to degrade the
computational efficiency of DNNs rather than their classification accuracy.
Specifically, we propose and demonstrate sparsity attacks, which adversarial
modify a DNN's inputs so as to reduce sparsity (or the presence of zero values)
in its internal activation values. In resource-constrained systems, a wide
range of hardware and software techniques have been proposed that exploit
sparsity to improve DNN efficiency. The proposed attack increases the execution
time and energy consumption of sparsity-optimized DNN implementations, raising
concern over their deployment in latency and energy-critical applications.
  We propose a systematic methodology to generate adversarial inputs for
sparsity attacks by formulating an objective function that quantifies the
network's activation sparsity, and minimizing this function using iterative
gradient-descent techniques. We launch both white-box and black-box versions of
adversarial sparsity attacks on image recognition DNNs and demonstrate that
they decrease activation sparsity by up to 1.82x. We also evaluate the impact
of the attack on a sparsity-optimized DNN accelerator and demonstrate
degradations up to 1.59x in latency, and also study the performance of the
attack on a sparsity-optimized general-purpose processor. Finally, we evaluate
defense techniques such as activation thresholding and input quantization and
demonstrate that the proposed attack is able to withstand them, highlighting
the need for further efforts in this new direction within the field of
adversarial machine learning.


http://arxiv.org/abs/2006.07942
Duplicity Games for Deception Design with an Application to Insider Threat Mitigation. (11%)
Linan Huang; Quanyan Zhu
  Recent incidents such as the Colonial Pipeline ransomware attack and the
SolarWinds hack have shown that traditional defense techniques are becoming
insufficient to deter adversaries of growing sophistication. Proactive and
deceptive defenses are an emerging class of methods to defend against zero-day
and advanced attacks. This work develops a new game-theoretic framework called
the duplicity game to design deception mechanisms that consist of a generator,
an incentive modulator, and a trust manipulator, referred to as the GMM
mechanism. We formulate a mathematical programming problem to compute the
optimal GMM mechanism, quantify the upper limit of enforceable security
policies, and characterize conditions on user's identifiability and
manageability for cyber attribution and user management. We develop a
separation principle that decouples the design of the modulator from the GMM
mechanism and an equivalence principle that turns the joint design of the
generator and the manipulator into the single design of the manipulator. A case
study of dynamic honeypot configurations is presented to mitigate insider
threats. The numerical experiments corroborate the results that the optimal GMM
mechanism can elicit desirable actions from both selfish and adversarial
insiders and consequently improve the security posture of the insider network.
In particular, a proper modulator can reduce the \textcolor{black}{incentive
misalignment} between the players and achieve win-win situations for the
selfish insider and the defender. Meanwhile, we observe that the defender
always benefits from faking the percentage of honeypots when the optimal
generator is presented.


http://arxiv.org/abs/2006.07710
The Pitfalls of Simplicity Bias in Neural Networks.
Harshay Shah; Kaustav Tamuly; Aditi Raghunathan; Prateek Jain; Praneeth Netrapalli
  Several works have proposed Simplicity Bias (SB)---the tendency of standard
training procedures such as Stochastic Gradient Descent (SGD) to find simple
models---to justify why neural networks generalize well [Arpit et al. 2017,
Nakkiran et al. 2019, Valle-Perez et al. 2019]. However, the precise notion of
simplicity remains vague. Furthermore, previous settings that use SB to justify
why neural networks generalize well do not simultaneously capture the
brittleness of neural networks---a widely observed phenomenon in practice
[Goodfellow et al. 2014, Jo and Bengio 2017]. To this end, we introduce a
collection of piecewise-linear and image-based datasets that (a) naturally
incorporate a precise notion of simplicity and (b) capture the subtleties of
neural networks trained on real datasets. Through theory and experiments on
these datasets, we show that SB of SGD and variants is extreme: neural networks
rely exclusively on the simplest feature and remain invariant to all predictive
complex features. Consequently, the extreme nature of SB explains why seemingly
benign distribution shifts and small adversarial perturbations significantly
degrade model performance. Moreover, contrary to conventional wisdom, SB can
also hurt generalization on the same data distribution, as SB persists even
when the simplest feature has less predictive power than the more complex
features. We also demonstrate that common approaches for improving
generalization and robustness---ensembles and adversarial training---do not
mitigate SB and its shortcomings. Given the central role played by SB in
generalization and robustness, we hope that the datasets and methods in this
paper serve as an effective testbed to evaluate novel algorithmic approaches
aimed at avoiding the pitfalls of extreme SB.


http://arxiv.org/abs/2006.07589
Adversarial Self-Supervised Contrastive Learning.
Minseon Kim; Jihoon Tack; Sung Ju Hwang
  Existing adversarial learning approaches mostly use class labels to generate
adversarial samples that lead to incorrect predictions, which are then used to
augment the training of the model for improved robustness. While some recent
works propose semi-supervised adversarial learning methods that utilize
unlabeled data, they still require class labels. However, do we really need
class labels at all, for adversarially robust training of deep neural networks?
In this paper, we propose a novel adversarial attack for unlabeled data, which
makes the model confuse the instance-level identities of the perturbed data
samples. Further, we present a self-supervised contrastive learning framework
to adversarially train a robust neural network without labeled data, which aims
to maximize the similarity between a random augmentation of a data sample and
its instance-wise adversarial perturbation. We validate our method, Robust
Contrastive Learning (RoCL), on multiple benchmark datasets, on which it
obtains comparable robust accuracy over state-of-the-art supervised adversarial
learning methods, and significantly improved robustness against the black box
and unseen types of attacks. Moreover, with further joint fine-tuning with
supervised adversarial loss, RoCL obtains even higher robust accuracy over
using self-supervised learning alone. Notably, RoCL also demonstrate impressive
results in robust transfer learning.


http://arxiv.org/abs/2006.07682
Rethinking Clustering for Robustness.
Motasem Alfarra; Juan C. Pérez; Adel Bibi; Ali Thabet; Pablo Arbeláez; Bernard Ghanem
  This paper studies how encouraging semantically-aligned features during deep
neural network training can increase network robustness. Recent works observed
that Adversarial Training leads to robust models, whose learnt features appear
to correlate with human perception. Inspired by this connection from robustness
to semantics, we study the complementary connection: from semantics to
robustness. To do so, we provide a robustness certificate for distance-based
classification models (clustering-based classifiers). Moreover, we show that
this certificate is tight, and we leverage it to propose ClusTR (Clustering
Training for Robustness), a clustering-based and adversary-free training
framework to learn robust models. Interestingly, \textit{ClusTR} outperforms
adversarially-trained networks by up to $4\%$ under strong PGD attacks.


http://arxiv.org/abs/2006.07700
Defensive Approximation: Securing CNNs using Approximate Computing.
Amira Guesmi; Ihsen Alouani; Khaled Khasawneh; Mouna Baklouti; Tarek Frikha; Mohamed Abid; Nael Abu-Ghazaleh
  In the past few years, an increasing number of machine-learning and deep
learning structures, such as Convolutional Neural Networks (CNNs), have been
applied to solving a wide range of real-life problems. However, these
architectures are vulnerable to adversarial attacks. In this paper, we propose
for the first time to use hardware-supported approximate computing to improve
the robustness of machine learning classifiers. We show that our approximate
computing implementation achieves robustness across a wide range of attack
scenarios. Specifically, for black-box and grey-box attack scenarios, we show
that successful adversarial attacks against the exact classifier have poor
transferability to the approximate implementation. Surprisingly, the robustness
advantages also apply to white-box attacks where the attacker has access to the
internal implementation of the approximate classifier. We explain some of the
possible reasons for this robustness through analysis of the internal operation
of the approximate implementation. Furthermore, our approximate computing model
maintains the same level in terms of classification accuracy, does not require
retraining, and reduces resource utilization and energy consumption of the CNN.
We conducted extensive experiments on a set of strong adversarial attacks; We
empirically show that the proposed implementation increases the robustness of a
LeNet-5 and an Alexnet CNNs by up to 99% and 87%, respectively for strong
grey-box adversarial attacks along with up to 67% saving in energy consumption
due to the simpler nature of the approximate logic. We also show that a
white-box attack requires a remarkably higher noise budget to fool the
approximate classifier, causing an average of 4db degradation of the PSNR of
the input image relative to the images that succeed in fooling the exact
classifier


http://arxiv.org/abs/2006.07024
Provably Robust Metric Learning.
Lu Wang; Xuanqing Liu; Jinfeng Yi; Yuan Jiang; Cho-Jui Hsieh
  Metric learning is an important family of algorithms for classification and
similarity search, but the robustness of learned metrics against small
adversarial perturbations is less studied. In this paper, we show that existing
metric learning algorithms, which focus on boosting the clean accuracy, can
result in metrics that are less robust than the Euclidean distance. To overcome
this problem, we propose a novel metric learning algorithm to find a
Mahalanobis distance that is robust against adversarial perturbations, and the
robustness of the resulting model is certifiable. Experimental results show
that the proposed metric learning algorithm improves both certified robust
errors and empirical robust errors (errors under adversarial attacks).
Furthermore, unlike neural network defenses which usually encounter a trade-off
between clean and robust errors, our method does not sacrifice clean errors
compared with previous metric learning methods. Our code is available at
https://github.com/wangwllu/provably_robust_metric_learning.


http://arxiv.org/abs/2006.07421
Defending against GAN-based Deepfake Attacks via Transformation-aware Adversarial Faces.
Chaofei Yang; Lei Ding; Yiran Chen; Hai Li
  Deepfake represents a category of face-swapping attacks that leverage machine
learning models such as autoencoders or generative adversarial networks.
Although the concept of the face-swapping is not new, its recent technical
advances make fake content (e.g., images, videos) more realistic and
imperceptible to Humans. Various detection techniques for Deepfake attacks have
been explored. These methods, however, are passive measures against Deepfakes
as they are mitigation strategies after the high-quality fake content is
generated. More importantly, we would like to think ahead of the attackers with
robust defenses. This work aims to take an offensive measure to impede the
generation of high-quality fake images or videos. Specifically, we propose to
use novel transformation-aware adversarially perturbed faces as a defense
against GAN-based Deepfake attacks. Different from the naive adversarial faces,
our proposed approach leverages differentiable random image transformations
during the generation. We also propose to use an ensemble-based approach to
enhance the defense robustness against GAN-based Deepfake variants under the
black-box setting. We show that training a Deepfake model with adversarial
faces can lead to a significant degradation in the quality of synthesized
faces. This degradation is twofold. On the one hand, the quality of the
synthesized faces is reduced with more visual artifacts such that the
synthesized faces are more obviously fake or less convincing to human
observers. On the other hand, the synthesized faces can easily be detected
based on various metrics.


http://arxiv.org/abs/2006.07258
D-square-B: Deep Distribution Bound for Natural-looking Adversarial Attack.
Qiuling Xu; Guanhong Tao; Xiangyu Zhang
  We propose a novel technique that can generate natural-looking adversarial
examples by bounding the variations induced for internal activation values in
some deep layer(s), through a distribution quantile bound and a polynomial
barrier loss function. By bounding model internals instead of individual
pixels, our attack admits perturbations closely coupled with the existing
features of the original input, allowing the generated examples to be
natural-looking while having diverse and often substantial pixel distances from
the original input. Enforcing per-neuron distribution quantile bounds allows
addressing the non-uniformity of internal activation values. Our evaluation on
ImageNet and five different model architecture demonstrates that our attack is
quite effective. Compared to the state-of-the-art pixel space attack, semantic
attack, and feature space attack, our attack can achieve the same attack
success/confidence level while having much more natural-looking adversarial
perturbations. These perturbations piggy-back on existing local features and do
not have any fixed pixel bounds.


http://arxiv.org/abs/2006.08602
Targeted Adversarial Perturbations for Monocular Depth Prediction.
Alex Wong; Safa Cicek; Stefano Soatto
  We study the effect of adversarial perturbations on the task of monocular
depth prediction. Specifically, we explore the ability of small, imperceptible
additive perturbations to selectively alter the perceived geometry of the
scene. We show that such perturbations can not only globally re-scale the
predicted distances from the camera, but also alter the prediction to match a
different target scene. We also show that, when given semantic or instance
information, perturbations can fool the network to alter the depth of specific
categories or instances in the scene, and even remove them while preserving the
rest of the scene. To understand the effect of targeted perturbations, we
conduct experiments on state-of-the-art monocular depth prediction methods. Our
experiments reveal vulnerabilities in monocular depth prediction networks, and
shed light on the biases and context learned by them.


http://arxiv.org/abs/2006.06195
Large-Scale Adversarial Training for Vision-and-Language Representation Learning.
Zhe Gan; Yen-Chun Chen; Linjie Li; Chen Zhu; Yu Cheng; Jingjing Liu
  We present VILLA, the first known effort on large-scale adversarial training
for vision-and-language (V+L) representation learning. VILLA consists of two
training stages: (i) task-agnostic adversarial pre-training; followed by (ii)
task-specific adversarial finetuning. Instead of adding adversarial
perturbations on image pixels and textual tokens, we propose to perform
adversarial training in the embedding space of each modality. To enable
large-scale training, we adopt the "free" adversarial training strategy, and
combine it with KL-divergence-based regularization to promote higher invariance
in the embedding space. We apply VILLA to current best-performing V+L models,
and achieve new state of the art on a wide range of tasks, including Visual
Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval,
Referring Expression Comprehension, Visual Entailment, and NLVR2.


http://arxiv.org/abs/2006.06643
Smoothed Geometry for Robust Attribution.
Zifan Wang; Haofan Wang; Shakul Ramkumar; Matt Fredrikson; Piotr Mardziel; Anupam Datta
  Feature attributions are a popular tool for explaining the behavior of Deep
Neural Networks (DNNs), but have recently been shown to be vulnerable to
attacks that produce divergent explanations for nearby inputs. This lack of
robustness is especially problematic in high-stakes applications where
adversarially-manipulated explanations could impair safety and trustworthiness.
Building on a geometric understanding of these attacks presented in recent
work, we identify Lipschitz continuity conditions on models' gradient that lead
to robust gradient-based attributions, and observe that smoothness may also be
related to the ability of an attack to transfer across multiple attribution
methods. To mitigate these attacks in practice, we propose an inexpensive
regularization method that promotes these conditions in DNNs, as well as a
stochastic smoothing technique that does not require re-training. Our
experiments on a range of image models demonstrate that both of these
mitigations consistently improve attribution robustness, and confirm the role
that smooth geometry plays in these attacks on real, large-scale models.


http://arxiv.org/abs/2006.06493
Protecting Against Image Translation Deepfakes by Leaking Universal Perturbations from Black-Box Neural Networks.
Nataniel Ruiz; Sarah Adel Bargal; Stan Sclaroff
  In this work, we develop efficient disruptions of black-box image translation
deepfake generation systems. We are the first to demonstrate black-box deepfake
generation disruption by presenting image translation formulations of attacks
initially proposed for classification models. Nevertheless, a naive adaptation
of classification black-box attacks results in a prohibitive number of queries
for image translation systems in the real-world. We present a frustratingly
simple yet highly effective algorithm Leaking Universal Perturbations (LUP),
that significantly reduces the number of queries needed to attack an image. LUP
consists of two phases: (1) a short leaking phase where we attack the network
using traditional black-box attacks and gather information on successful
attacks on a small dataset and (2) and an exploitation phase where we leverage
said information to subsequently attack the network with improved efficiency.
Our attack reduces the total number of queries necessary to attack GANimation
and StarGAN by 30%.


http://arxiv.org/abs/2006.06186
Investigating Robustness of Adversarial Samples Detection for Automatic Speaker Verification.
Xu Li; Na Li; Jinghua Zhong; Xixin Wu; Xunying Liu; Dan Su; Dong Yu; Helen Meng
  Recently adversarial attacks on automatic speaker verification (ASV) systems
attracted widespread attention as they pose severe threats to ASV systems.
However, methods to defend against such attacks are limited. Existing
approaches mainly focus on retraining ASV systems with adversarial data
augmentation. Also, countermeasure robustness against different attack settings
are insufficiently investigated. Orthogonal to prior approaches, this work
proposes to defend ASV systems against adversarial attacks with a separate
detection network, rather than augmenting adversarial data into ASV training. A
VGG-like binary classification detector is introduced and demonstrated to be
effective on detecting adversarial samples. To investigate detector robustness
in a realistic defense scenario where unseen attack settings exist, we analyze
various attack settings and observe that the detector is robust (6.27\%
EER_{det} degradation in the worst case) against unseen substitute ASV systems,
but it has weak robustness (50.37\% EER_{det} degradation in the worst case)
against unseen perturbation methods. The weak robustness against unseen
perturbation methods shows a direction for developing stronger countermeasures.


http://arxiv.org/abs/2006.06861
Robustness to Adversarial Attacks in Learning-Enabled Controllers.
Zikang Xiong; Joe Eappen; He Zhu; Suresh Jagannathan
  Learning-enabled controllers used in cyber-physical systems (CPS) are known
to be susceptible to adversarial attacks. Such attacks manifest as
perturbations to the states generated by the controller's environment in
response to its actions. We consider state perturbations that encompass a wide
variety of adversarial attacks and describe an attack scheme for discovering
adversarial states. To be useful, these attacks need to be natural, yielding
states in which the controller can be reasonably expected to generate a
meaningful response. We consider shield-based defenses as a means to improve
controller robustness in the face of such perturbations. Our defense strategy
allows us to treat the controller and environment as black-boxes with unknown
dynamics. We provide a two-stage approach to construct this defense and show
its effectiveness through a range of experiments on realistic continuous
control domains such as the navigation control-loop of an F16 aircraft and the
motion control system of humanoid robots.


http://arxiv.org/abs/2006.06759
On the Tightness of Semidefinite Relaxations for Certifying Robustness to Adversarial Examples.
Richard Y. Zhang
  The robustness of a neural network to adversarial examples can be provably
certified by solving a convex relaxation. If the relaxation is loose, however,
then the resulting certificate can be too conservative to be practically
useful. Recently, a less conservative robustness certificate was proposed,
based on a semidefinite programming (SDP) relaxation of the ReLU activation
function. In this paper, we give a geometric analysis for the tightness of this
relaxation. We show that, for a least-squares restriction of the usual
adversarial attack problem, the SDP relaxation is tight over a single hidden
layer under reasonable assumptions. The resulting robustness certificate is
exact, meaning that it provides a lower-bound on the size of the smallest
adversarial perturbation, as well as a globally optimal perturbation that
attains the lower-bound. For several hidden layers, the SDP relaxation is not
usually tight; we give an explanation using the underlying hyperbolic geometry.
We experimentally confirm our theoretical insights using a general-purpose
interior-point method and a custom rank-2 Burer-Monteiro algorithm.


http://arxiv.org/abs/2006.06356
Adversarial Attack Vulnerability of Medical Image Analysis Systems: Unexplored Factors.
Suzanne C. Wetstein; Cristina González-Gonzalo; Gerda Bortsova; Bart Liefers; Florian Dubost; Ioannis Katramados; Laurens Hogeweg; Ginneken Bram van; Josien P. W. Pluim; Bruijne Marleen de; Clara I. Sánchez; Mitko Veta
  Adversarial attacks are considered a potentially serious security threat for
machine learning systems. Medical image analysis (MedIA) systems have recently
been argued to be particularly vulnerable to adversarial attacks due to strong
financial incentives. In this paper, we study several previously unexplored
factors affecting adversarial attack vulnerability of deep learning MedIA
systems in three medical domains: ophthalmology, radiology and pathology.
Firstly, we study the effect of varying the degree of adversarial perturbation
on the attack performance and its visual perceptibility. Secondly, we study how
pre-training on a public dataset (ImageNet) affects the models' vulnerability
to attacks. Thirdly, we study the influence of data and model architecture
disparity between target and attacker models. Our experiments show that the
degree of perturbation significantly affects both performance and human
perceptibility of attacks. Pre-training may dramatically increase the transfer
of adversarial examples; the larger the performance gain achieved by
pre-training, the larger the transfer. Finally, disparity in data and/or model
architecture between target and attacker models substantially decreases the
success of attacks. We believe that these factors should be considered when
designing cybersecurity-critical MedIA systems, as well as kept in mind when
evaluating their vulnerability to adversarial attacks.


http://arxiv.org/abs/2006.06520
Achieving robustness in classification using optimal transport with hinge regularization.
Mathieu Serrurier; Franck Mamalet; Alberto González-Sanz; Thibaut Boissin; Jean-Michel Loubes; Barrio Eustasio del
  Adversarial examples have pointed out Deep Neural Networks vulnerability to
small local noise. It has been shown that constraining their Lipschitz constant
should enhance robustness, but make them harder to learn with classical loss
functions. We propose a new framework for binary classification, based on
optimal transport, which integrates this Lipschitz constraint as a theoretical
requirement. We propose to learn 1-Lipschitz networks using a new loss that is
an hinge regularized version of the Kantorovich-Rubinstein dual formulation for
the Wasserstein distance estimation. This loss function has a direct
interpretation in terms of adversarial robustness together with certifiable
robustness bound. We also prove that this hinge regularized version is still
the dual formulation of an optimal transportation problem, and has a solution.
We also establish several geometrical properties of this optimal solution, and
extend the approach to multi-class problems. Experiments show that the proposed
approach provides the expected guarantees in terms of robustness without any
significant accuracy drop. The adversarial examples, on the proposed models,
visibly and meaningfully change the input providing an explanation for the
classification.


http://arxiv.org/abs/2006.06721
Backdoor Smoothing: Demystifying Backdoor Attacks on Deep Neural Networks. (96%)
Kathrin Grosse; Taesung Lee; Battista Biggio; Youngja Park; Michael Backes; Ian Molloy
  Backdoor attacks aim to mislead machine-learning models to output an
attacker-specified class when presented a specific trigger at test time. These
attacks require poisoning the training data or compromising the learning
algorithm, e.g., by injecting poisoning samples containing the trigger into the
training set, along with the desired class label. Despite the increasing number
of studies on backdoor attacks and defenses, the underlying factors affecting
the success of backdoor attacks, along with their impact on the learning
algorithm, are not yet well understood. In this work, we aim to shed light on
this issue. In particular, we unveil that backdoor attacks work by inducing a
smoother decision function around the triggered samples -- a phenomenon which
we refer to as \textit{backdoor smoothing}. We quantify backdoor smoothing by
defining a measure that evaluates the uncertainty associated to the predictions
of a classifier around the input samples.
  Our experiments show that smoothness increases when the trigger is added to
the input samples, and that the phenomenon is more pronounced for more
successful attacks.
  However, our experiments also show that patterns fulfilling backdoor
smoothing can be crafted
  even without poisoning the training data.
  Although our measure may not be directly exploited as a defense mechanism, it
unveils an important phenomenon which may pave the way towards understanding
the limitations of current defenses that rely on a smooth decision output for
backdoors.


http://arxiv.org/abs/2006.05648
Evaluating Graph Vulnerability and Robustness using TIGER.
Scott Freitas; Duen Horng Chau
  The study of network robustness is a critical tool in the characterization
and understanding of complex interconnected systems such as transportation,
infrastructure, communication, and computer networks. Through analyzing and
understanding the robustness of these networks we can:(1) quantify network
vulnerability and robustness,(2) augment a network's structure to resist
attacks and recover from failure, and (3) control the dissemination of entities
on the network (e.g., viruses, propaganda). While significant research has been
conducted on all of these tasks, no comprehensive open-source toolbox currently
exists to assist researchers and practitioners in this important topic. This
lack of available tools hinders reproducibility and examination of existing
work, development of new research, and dissemination of new ideas. We
contribute TIGER, an open-sourced Python toolbox to address these challenges.
TIGER contains 22 graph robustness measures with both original and fast
approximate versions; 17 failure and attack strategies; 15 heuristic and
optimization based defense techniques; and 4 simulation tools. By democratizing
the tools required to study network robustness, our goal is to assist
researchers and practitioners in analyzing their own networks; and facilitate
the development of new research in the field. TIGER is open-sourced at:
https://github.com/safreita1/TIGER


http://arxiv.org/abs/2006.06028
Towards Robust Fine-grained Recognition by Maximal Separation of Discriminative Features.
Krishna Kanth Nakka; Mathieu Salzmann
  Adversarial attacks have been widely studied for general classification
tasks, but remain unexplored in the context of fine-grained recognition, where
the inter-class similarities facilitate the attacker's task. In this paper, we
identify the proximity of the latent representations of different classes in
fine-grained recognition networks as a key factor to the success of adversarial
attacks. We therefore introduce an attention-based regularization mechanism
that maximally separates the discriminative latent features of different
classes while minimizing the contribution of the non-discriminative regions to
the final class prediction. As evidenced by our experiments, this allows us to
significantly improve robustness to adversarial attacks, to the point of
matching or even surpassing that of adversarial training, but without requiring
access to adversarial samples.


http://arxiv.org/abs/2006.06061
Deterministic Gaussian Averaged Neural Networks.
Ryan Campbell; Chris Finlay; Adam M Oberman
  We present a deterministic method to compute the Gaussian average of neural
networks used in regression and classification. Our method is based on an
equivalence between training with a particular regularized loss, and the
expected values of Gaussian averages. We use this equivalence to certify models
which perform well on clean data but are not robust to adversarial
perturbations. In terms of certified accuracy and adversarial robustness, our
method is comparable to known stochastic methods such as randomized smoothing,
but requires only a single model evaluation during inference.


http://arxiv.org/abs/2006.05749
Interpolation between Residual and Non-Residual Networks.
Zonghan Yang; Yang Liu; Chenglong Bao; Zuoqiang Shi
  Although ordinary differential equations (ODEs) provide insights for
designing network architectures, its relationship with the non-residual
convolutional neural networks (CNNs) is still unclear. In this paper, we
present a novel ODE model by adding a damping term. It can be shown that the
proposed model can recover both a ResNet and a CNN by adjusting an
interpolation coefficient. Therefore, the damped ODE model provides a unified
framework for the interpretation of residual and non-residual networks. The
Lyapunov analysis reveals better stability of the proposed model, and thus
yields robustness improvement of the learned networks. Experiments on a number
of image classification benchmarks show that the proposed model substantially
improves the accuracy of ResNet and ResNeXt over the perturbed inputs from both
stochastic noise and adversarial attack methods. Moreover, the loss landscape
analysis demonstrates the improved robustness of our method along the attack
direction.


http://arxiv.org/abs/2006.05945
Towards Certified Robustness of Metric Learning.
Xiaochen Yang; Yiwen Guo; Mingzhi Dong; Jing-Hao Xue
  Metric learning aims to learn a distance metric such that semantically
similar instances are pulled together while dissimilar instances are pushed
away. Many existing methods consider maximizing or at least constraining a
distance "margin" that separates similar and dissimilar pairs of instances to
guarantee their performance on a subsequent k-nearest neighbor classifier.
However, such a margin in the feature space does not necessarily lead to
robustness certification or even anticipated generalization advantage, since a
small perturbation of test instance in the instance space could still
potentially alter the model prediction. To address this problem, we advocate
penalizing small distance between training instances and their nearest
adversarial examples, and we show that the resulting new approach to metric
learning enjoys a larger certified neighborhood with theoretical performance
guarantee. Moreover, drawing on an intuitive geometric insight, the proposed
new loss term permits an analytically elegant closed-form solution and offers
great flexibility in leveraging it jointly with existing metric learning
methods. Extensive experiments demonstrate the superiority of the proposed
method over the state-of-the-arts in terms of both discrimination accuracy and
robustness to noise.


http://arxiv.org/abs/2006.05095
Towards an Intrinsic Definition of Robustness for a Classifier.
Théo Giraudon; Vincent Gripon; Matthias Löwe; Franck Vermet
  The robustness of classifiers has become a question of paramount importance
in the past few years. Indeed, it has been shown that state-of-the-art deep
learning architectures can easily be fooled with imperceptible changes to their
inputs. Therefore, finding good measures of robustness of a trained classifier
is a key issue in the field. In this paper, we point out that averaging the
radius of robustness of samples in a validation set is a statistically weak
measure. We propose instead to weight the importance of samples depending on
their difficulty. We motivate the proposed score by a theoretical case study
using logistic regression, where we show that the proposed score is independent
of the choice of the samples it is evaluated upon. We also empirically
demonstrate the ability of the proposed score to measure robustness of
classifiers with little dependence on the choice of samples in more complex
settings, including deep convolutional neural networks and real datasets.


http://arxiv.org/abs/2006.05057
Black-Box Adversarial Attacks on Graph Neural Networks with Limited Node Access.
Jiaqi Ma; Shuangrui Ding; Qiaozhu Mei
  We study the black-box attacks on graph neural networks (GNNs) under a novel
and realistic constraint: attackers have access to only a subset of nodes in
the network, and they can only attack a small number of them. A node selection
step is essential under this setup. We demonstrate that the structural
inductive biases of GNN models can be an effective source for this type of
attacks. Specifically, by exploiting the connection between the backward
propagation of GNNs and random walks, we show that the common gradient-based
white-box attacks can be generalized to the black-box setting via the
connection between the gradient and an importance score similar to PageRank. In
practice, we find attacks based on this importance score indeed increase the
classification loss by a large margin, but they fail to significantly increase
the mis-classification rate. Our theoretical and empirical analyses suggest
that there is a discrepancy between the loss and mis-classification rate, as
the latter presents a diminishing-return pattern when the number of attacked
nodes increases. Therefore, we propose a greedy procedure to correct the
importance score that takes into account of the diminishing-return pattern.
Experimental results show that the proposed procedure can significantly
increase the mis-classification rate of common GNNs on real-world data without
access to model parameters nor predictions.


http://arxiv.org/abs/2006.05097
GAP++: Learning to generate target-conditioned adversarial examples.
Xiaofeng Mao; Yuefeng Chen; Yuhong Li; Yuan He; Hui Xue
  Adversarial examples are perturbed inputs which can cause a serious threat
for machine learning models. Finding these perturbations is such a hard task
that we can only use the iterative methods to traverse. For computational
efficiency, recent works use adversarial generative networks to model the
distribution of both the universal or image-dependent perturbations directly.
However, these methods generate perturbations only rely on input images. In
this work, we propose a more general-purpose framework which infers
target-conditioned perturbations dependent on both input image and target
label. Different from previous single-target attack models, our model can
conduct target-conditioned attacks by learning the relations of attack target
and the semantics in image. Using extensive experiments on the datasets of
MNIST and CIFAR10, we show that our method achieves superior performance with
single target attack models and obtains high fooling rates with small
perturbation norms.


http://arxiv.org/abs/2006.05594
Adversarial Attacks on Brain-Inspired Hyperdimensional Computing-Based Classifiers.
Fangfang Yang; Shaolei Ren
  Being an emerging class of in-memory computing architecture, brain-inspired
hyperdimensional computing (HDC) mimics brain cognition and leverages random
hypervectors (i.e., vectors with a dimensionality of thousands or even more) to
represent features and to perform classification tasks. The unique hypervector
representation enables HDC classifiers to exhibit high energy efficiency, low
inference latency and strong robustness against hardware-induced bit errors.
Consequently, they have been increasingly recognized as an appealing
alternative to or even replacement of traditional deep neural networks (DNNs)
for local on device classification, especially on low-power Internet of Things
devices. Nonetheless, unlike their DNN counterparts, state-of-the-art designs
for HDC classifiers are mostly security-oblivious, casting doubt on their
safety and immunity to adversarial inputs. In this paper, we study for the
first time adversarial attacks on HDC classifiers and highlight that HDC
classifiers can be vulnerable to even minimally-perturbed adversarial samples.
Concretely, using handwritten digit classification as an example, we construct
a HDC classifier and formulate a grey-box attack problem, where an attacker's
goal is to mislead the target HDC classifier to produce erroneous prediction
labels while keeping the amount of added perturbation noise as little as
possible. Then, we propose a modified genetic algorithm to generate adversarial
samples within a reasonably small number of queries. Our results show that
adversarial images generated by our algorithm can successfully mislead the HDC
classifier to produce wrong prediction labels with a high probability (i.e.,
78% when the HDC classifier uses a fixed majority rule for decision). Finally,
we also present two defense strategies -- adversarial training and retraining--
to strengthen the security of HDC classifiers.


http://arxiv.org/abs/2006.05161
Provable tradeoffs in adversarially robust classification.
Edgar Dobriban; Hamed Hassani; David Hong; Alexander Robey
  Machine learning methods can be vulnerable to small, adversarially-chosen
perturbations of their inputs, prompting much research into theoretical
explanations and algorithms toward improving adversarial robustness. Although a
rich and insightful literature has developed around these ideas, many
foundational open problems remain. In this paper, we seek to address several of
these questions by deriving optimal robust classifiers for two- and three-class
Gaussian classification problems with respect to adversaries in both the
$\ell_2$ and $\ell_\infty$ norms. While the standard non-robust version of this
problem has a long history, the corresponding robust setting contains many
unexplored problems, and indeed deriving optimal robust classifiers turns out
to pose a variety of new challenges. We develop new analysis tools for this
task. Our results reveal intriguing tradeoffs between usual and robust
accuracy. Furthermore, we give results for data lying on low-dimensional
manifolds and study the landscape of adversarially robust risk over linear
classifiers, including proving Fisher consistency in some cases. Lastly, we
provide novel results concerning finite sample adversarial risk in the Gaussian
classification setting.


http://arxiv.org/abs/2006.05630
Distributional Robust Batch Contextual Bandits. (1%)
Nian Si; Fan Zhang; Zhengyuan Zhou; Jose Blanchet
  Policy learning using historical observational data is an important problem
that has found widespread applications. Examples include selecting offers,
prices, advertisements to send to customers, as well as selecting which
medication to prescribe to a patient. However, existing literature rests on the
crucial assumption that the future environment where the learned policy will be
deployed is the same as the past environment that has generated the data -- an
assumption that is often false or too coarse an approximation. In this paper,
we lift this assumption and aim to learn a distributionally robust policy with
incomplete observational data. We first present a policy evaluation procedure
that allows us to assess how well the policy does under the worst-case
environment shift. We then establish a central limit theorem type guarantee for
this proposed policy evaluation scheme. Leveraging this evaluation scheme, we
further propose a novel learning algorithm that is able to learn a policy that
is robust to adversarial perturbations and unknown covariate shifts with a
performance guarantee based on the theory of uniform convergence. Finally, we
empirically test the effectiveness of our proposed algorithm in synthetic
datasets and demonstrate that it provides the robustness that is missing using
standard policy learning algorithms. We conclude the paper by providing a
comprehensive application of our methods in the context of a real-world voting
dataset.


http://arxiv.org/abs/2006.04935
Calibrated neighborhood aware confidence measure for deep metric learning.
Maryna Karpusha; Sunghee Yun; Istvan Fehervari
  Deep metric learning has gained promising improvement in recent years
following the success of deep learning. It has been successfully applied to
problems in few-shot learning, image retrieval, and open-set classifications.
However, measuring the confidence of a deep metric learning model and
identifying unreliable predictions is still an open challenge. This paper
focuses on defining a calibrated and interpretable confidence metric that
closely reflects its classification accuracy. While performing similarity
comparison directly in the latent space using the learned distance metric, our
approach approximates the distribution of data points for each class using a
Gaussian kernel smoothing function. The post-processing calibration algorithm
with proposed confidence metric on the held-out validation dataset improves
generalization and robustness of state-of-the-art deep metric learning models
while provides an interpretable estimation of the confidence. Extensive tests
on four popular benchmark datasets (Caltech-UCSD Birds, Stanford Online
Product, Stanford Car-196, and In-shop Clothes Retrieval) show consistent
improvements even at the presence of distribution shifts in test data related
to additional noise or adversarial examples.


http://arxiv.org/abs/2006.04924
A Self-supervised Approach for Adversarial Robustness.
Muzammal Naseer; Salman Khan; Munawar Hayat; Fahad Shahbaz Khan; Fatih Porikli
  Adversarial examples can cause catastrophic mistakes in Deep Neural Network
(DNNs) based vision systems e.g., for classification, segmentation and object
detection. The vulnerability of DNNs against such attacks can prove a major
roadblock towards their real-world deployment. Transferability of adversarial
examples demand generalizable defenses that can provide cross-task protection.
Adversarial training that enhances robustness by modifying target model's
parameters lacks such generalizability. On the other hand, different input
processing based defenses fall short in the face of continuously evolving
attacks. In this paper, we take the first step to combine the benefits of both
approaches and propose a self-supervised adversarial training mechanism in the
input space. By design, our defense is a generalizable approach and provides
significant robustness against the \textbf{unseen} adversarial attacks (\eg by
reducing the success rate of translation-invariant \textbf{ensemble} attack
from 82.6\% to 31.9\% in comparison to previous state-of-the-art). It can be
deployed as a plug-and-play solution to protect a variety of vision systems, as
we demonstrate for the case of classification, segmentation and detection. Code
is available at: {\small\url{https://github.com/Muzammal-Naseer/NRP}}.


http://arxiv.org/abs/2006.04349
Distributional Robustness with IPMs and links to Regularization and GANs.
Hisham Husain
  Robustness to adversarial attacks is an important concern due to the
fragility of deep neural networks to small perturbations and has received an
abundance of attention in recent years. Distributionally Robust Optimization
(DRO), a particularly promising way of addressing this challenge, studies
robustness via divergence-based uncertainty sets and has provided valuable
insights into robustification strategies such as regularization. In the context
of machine learning, the majority of existing results have chosen
$f$-divergences, Wasserstein distances and more recently, the Maximum Mean
Discrepancy (MMD) to construct uncertainty sets. We extend this line of work
for the purposes of understanding robustness via regularization by studying
uncertainty sets constructed with Integral Probability Metrics (IPMs) - a large
family of divergences including the MMD, Total Variation and Wasserstein
distances. Our main result shows that DRO under \textit{any} choice of IPM
corresponds to a family of regularization penalties, which recover and improve
upon existing results in the setting of MMD and Wasserstein distances. Due to
the generality of our result, we show that other choices of IPMs correspond to
other commonly used penalties in machine learning. Furthermore, we extend our
results to shed light on adversarial generative modelling via $f$-GANs,
constituting the first study of distributional robustness for the $f$-GAN
objective. Our results unveil the inductive properties of the discriminator set
with regards to robustness, allowing us to give positive comments for several
penalty-based GAN methods such as Wasserstein-, MMD- and Sobolev-GANs. In
summary, our results intimately link GANs to distributional robustness, extend
previous results on DRO and contribute to our understanding of the link between
regularization and robustness at large.


http://arxiv.org/abs/2006.04449
On Universalized Adversarial and Invariant Perturbations.
Sandesh Kamath; Amit Deshpande; K V Subrahmanyam
  Convolutional neural networks or standard CNNs (StdCNNs) are
translation-equivariant models that achieve translation invariance when trained
on data augmented with sufficient translations. Recent work on equivariant
models for a given group of transformations (e.g., rotations) has lead to
group-equivariant convolutional neural networks (GCNNs). GCNNs trained on data
augmented with sufficient rotations achieve rotation invariance. Recent work by
authors arXiv:2002.11318 studies a trade-off between invariance and robustness
to adversarial attacks. In another related work arXiv:2005.08632, given any
model and any input-dependent attack that satisfies a certain spectral
property, the authors propose a universalization technique called SVD-Universal
to produce a universal adversarial perturbation by looking at very few test
examples. In this paper, we study the effectiveness of SVD-Universal on GCNNs
as they gain rotation invariance through higher degree of training
augmentation. We empirically observe that as GCNNs gain rotation invariance
through training augmented with larger rotations, the fooling rate of
SVD-Universal gets better. To understand this phenomenon, we introduce
universal invariant directions and study their relation to the universal
adversarial direction produced by SVD-Universal.


http://arxiv.org/abs/2006.04504
Tricking Adversarial Attacks To Fail.
Blerta Lindqvist
  Recent adversarial defense approaches have failed. Untargeted gradient-based
attacks cause classifiers to choose any wrong class. Our novel white-box
defense tricks untargeted attacks into becoming attacks targeted at designated
target classes. From these target classes, we can derive the real classes. Our
Target Training defense tricks the minimization at the core of untargeted,
gradient-based adversarial attacks: minimize the sum of (1) perturbation and
(2) classifier adversarial loss. Target Training changes the classifier
minimally, and trains it with additional duplicated points (at 0 distance)
labeled with designated classes. These differently-labeled duplicated samples
minimize both terms (1) and (2) of the minimization, steering attack
convergence to samples of designated classes, from which correct classification
is derived. Importantly, Target Training eliminates the need to know the attack
and the overhead of generating adversarial samples of attacks that minimize
perturbations. We obtain an 86.2% accuracy for CW-L2 (confidence=0) in CIFAR10,
exceeding even unsecured classifier accuracy on non-adversarial samples. Target
Training presents a fundamental change in adversarial defense strategy.


http://arxiv.org/abs/2006.04403
Global Robustness Verification Networks.
Weidi Sun; Yuteng Lu; Xiyue Zhang; Zhanxing Zhu; Meng Sun
  The wide deployment of deep neural networks, though achieving great success
in many domains, has severe safety and reliability concerns. Existing
adversarial attack generation and automatic verification techniques cannot
formally verify whether a network is globally robust, i.e., the absence or not
of adversarial examples in the input space. To address this problem, we develop
a global robustness verification framework with three components: 1) a novel
rule-based ``back-propagation'' finding which input region is responsible for
the class assignment by logic reasoning; 2) a new network architecture Sliding
Door Network (SDN) enabling feasible rule-based ``back-propagation''; 3) a
region-based global robustness verification (RGRV) approach. Moreover, we
demonstrate the effectiveness of our approach on both synthetic and real
datasets.


http://arxiv.org/abs/2006.04622
Trade-offs between membership privacy & adversarially robust learning.
Jamie Hayes
  Historically, machine learning methods have not been designed with security
in mind. In turn, this has given rise to adversarial examples, carefully
perturbed input samples aimed to mislead detection at test time, which have
been applied to attack spam and malware classification, and more recently to
attack image classification. Consequently, an abundance of research has been
devoted to designing machine learning methods that are robust to adversarial
examples. Unfortunately, there are desiderata besides robustness that a secure
and safe machine learning model must satisfy, such as fairness and privacy.
Recent work by Song et al. (2019) has shown, empirically, that there exists a
trade-off between robust and private machine learning models. Models designed
to be robust to adversarial examples often overfit on training data to a larger
extent than standard (non-robust) models. If a dataset contains private
information, then any statistical test that separates training and test data by
observing a model's outputs can represent a privacy breach, and if a model
overfits on training data, these statistical tests become easier.
  In this work, we identify settings where standard models will overfit to a
larger extent in comparison to robust models, and as empirically observed in
previous works, settings where the opposite behavior occurs. Thus, it is not
necessarily the case that privacy must be sacrificed to achieve robustness. The
degree of overfitting naturally depends on the amount of data available for
training. We go on to characterize how the training set size factors into the
privacy risks exposed by training a robust model on a simple Gaussian data
task, and show empirically that our findings hold on image classification
benchmark datasets, such as CIFAR-10 and CIFAR-100.


http://arxiv.org/abs/2006.04621
Adversarial Feature Desensitization.
Pouya Bashivan; Reza Bayat; Adam Ibrahim; Kartik Ahuja; Mojtaba Faramarzi; Touraj Laleh; Blake Aaron Richards; Irina Rish
  Neural networks are known to be vulnerable to adversarial attacks -- slight
but carefully constructed perturbations of the inputs which can drastically
impair the network's performance. Many defense methods have been proposed for
improving robustness of deep networks by training them on adversarially
perturbed inputs. However, these models often remain vulnerable to new types of
attacks not seen during training, and even to slightly stronger versions of
previously seen attacks. In this work, we propose a novel approach to
adversarial robustness, which builds upon the insights from the domain
adaptation field. Our method, called Adversarial Feature Desensitization (AFD),
aims at learning features that are invariant towards adversarial perturbations
of the inputs. This is achieved through a game where we learn features that are
both predictive and robust (insensitive to adversarial attacks), i.e. cannot be
used to discriminate between natural and adversarial data. Empirical results on
several benchmarks demonstrate the effectiveness of the proposed approach
against a wide range of attack types and attack strengths.


http://arxiv.org/abs/2006.04208
Extensions and limitations of randomized smoothing for robustness guarantees.
Jamie Hayes
  Randomized smoothing, a method to certify a classifier's decision on an input
is invariant under adversarial noise, offers attractive advantages over other
certification methods. It operates in a black-box and so certification is not
constrained by the size of the classifier's architecture. Here, we extend the
work of Li et al. \cite{li2018second}, studying how the choice of divergence
between smoothing measures affects the final robustness guarantee, and how the
choice of smoothing measure itself can lead to guarantees in differing threat
models. To this end, we develop a method to certify robustness against any
$\ell_p$ ($p\in\mathbb{N}_{>0}$) minimized adversarial perturbation. We then
demonstrate a negative result, that randomized smoothing suffers from the curse
of dimensionality; as $p$ increases, the effective radius around an input one
can certify vanishes.


http://arxiv.org/abs/2006.04183
Uncertainty-Aware Deep Classifiers using Generative Models.
Murat Sensoy; Lance Kaplan; Federico Cerutti; Maryam Saleki
  Deep neural networks are often ignorant about what they do not know and
overconfident when they make uninformed predictions. Some recent approaches
quantify classification uncertainty directly by training the model to output
high uncertainty for the data samples close to class boundaries or from the
outside of the training distribution. These approaches use an auxiliary data
set during training to represent out-of-distribution samples. However,
selection or creation of such an auxiliary data set is non-trivial, especially
for high dimensional data such as images. In this work we develop a novel
neural network model that is able to express both aleatoric and epistemic
uncertainty to distinguish decision boundary and out-of-distribution regions of
the feature space. To this end, variational autoencoders and generative
adversarial networks are incorporated to automatically generate
out-of-distribution exemplars for training. Through extensive analysis, we
demonstrate that the proposed approach provides better estimates of uncertainty
for in- and out-of-distribution samples, and adversarial examples on well-known
data sets against state-of-the-art approaches including recent Bayesian
approaches for neural networks and anomaly detection methods.


http://arxiv.org/abs/2006.03873
Unique properties of adversarially trained linear classifiers on Gaussian data.
Jamie Hayes
  Machine learning models are vulnerable to adversarial perturbations, that
when added to an input, can cause high confidence misclassifications. The
adversarial learning research community has made remarkable progress in the
understanding of the root causes of adversarial perturbations. However, most
problems that one may consider important to solve for the deployment of machine
learning in safety critical tasks involve high dimensional complex manifolds
that are difficult to characterize and study. It is common to develop
adversarially robust learning theory on simple problems, in the hope that
insights will transfer to `real world datasets'. In this work, we discuss a
setting where this approach fails. In particular, we show with a linear
classifier, it is always possible to solve a binary classification problem on
Gaussian data under arbitrary levels of adversarial corruption during training,
and that this property is not observed with non-linear classifiers on the
CIFAR-10 dataset.


http://arxiv.org/abs/2006.03833
Can Domain Knowledge Alleviate Adversarial Attacks in Multi-Label Classifiers?
Stefano Melacci; Gabriele Ciravegna; Angelo Sotgiu; Ambra Demontis; Battista Biggio; Marco Gori; Fabio Roli
  Adversarial attacks on machine learning-based classifiers, along with defense
mechanisms, have been widely studied in the context of single-label
classification problems. In this paper, we shift the attention to multi-label
classification, where the availability of domain knowledge on the relationships
among the considered classes may offer a natural way to spot incoherent
predictions, i.e., predictions associated to adversarial examples lying outside
of the training data distribution. We explore this intuition in a framework in
which first-order logic knowledge is converted into constraints and injected
into a semi-supervised learning problem. Within this setting, the constrained
classifier learns to fulfill the domain knowledge over the marginal
distribution, and can naturally reject samples with incoherent predictions.
Even though our method does not exploit any knowledge of attacks during
training, our experimental analysis surprisingly unveils that domain-knowledge
constraints can help detect adversarial examples effectively, especially if
such constraints are not known to the attacker. While we also show that an
adaptive attack exploiting knowledge of the constraints may still deceive our
classifier, it remains an open issue to understand how hard for an attacker
would be to infer such constraints in practical cases. For this reason, we
believe that our approach may provide a significant step towards designing
robust multi-label classifiers.


http://arxiv.org/abs/2006.03243
Adversarial Image Generation and Training for Deep Convolutional Neural Networks.
Ronghua Shi; Hai Shu; Hongtu Zhu; Ziqi Chen
  Deep convolutional neural networks (DCNNs) have achieved great success in
image classification, but they may be very vulnerable to adversarial attacks
with small perturbations to images. Moreover, the adversarial training based on
adversarial image samples has been shown to improve the robustness and
generalization of DCNNs. The aim of this paper is to develop a novel framework
based on information-geometry sensitivity analysis and the particle swarm
optimization to improve two aspects of adversarial image generation and
training for DCNNs. The first one is customized generation of adversarial
examples. It can design adversarial attacks from options of the number of
perturbed pixels, the misclassification probability, and the targeted incorrect
class, and hence it is more flexible and effective to locate vulnerable pixels
and also enjoys certain adversarial universality. The other is targeted
adversarial training. DCNN models can be improved in training with the
adversarial information using a manifold-based influence measure effective in
vulnerable image/pixel detection as well as allowing for targeted attacks,
thereby exhibiting an enhanced adversarial defense in testing.


http://arxiv.org/abs/2006.03712
Lipschitz Bounds and Provably Robust Training by Laplacian Smoothing.
Vishaal Krishnan; Abed AlRahman Al Makdah; Fabio Pasqualetti
  In this work we propose a graph-based learning framework to train models with
provable robustness to adversarial perturbations. In contrast to
regularization-based approaches, we formulate the adversarially robust learning
problem as one of loss minimization with a Lipschitz constraint, and show that
the saddle point of the associated Lagrangian is characterized by a Poisson
equation with weighted Laplace operator. Further, the weighting for the Laplace
operator is given by the Lagrange multiplier for the Lipschitz constraint,
which modulates the sensitivity of the minimizer to perturbations. We then
design a provably robust training scheme using graph-based discretization of
the input space and a primal-dual algorithm to converge to the Lagrangian's
saddle point. Our analysis establishes a novel connection between elliptic
operators with constraint-enforced weighting and adversarial learning. We also
study the complementary problem of improving the robustness of minimizers with
a margin on their loss, formulated as a loss-constrained minimization problem
of the Lipschitz constant. We propose a technique to obtain robustified
minimizers, and evaluate fundamental Lipschitz lower bounds by approaching
Lipschitz constant minimization via a sequence of gradient $p$-norm
minimization problems. Ultimately, our results show that, for a desired nominal
performance, there exists a fundamental lower bound on the sensitivity to
adversarial perturbations that depends only on the loss function and the data
distribution, and that improvements in robustness beyond this bound can only be
made at the expense of nominal performance. Our training schemes provably
achieve these bounds both under constraints on performance and~robustness.


http://arxiv.org/abs/2006.03463
Sponge Examples: Energy-Latency Attacks on Neural Networks.
Ilia Shumailov; Yiren Zhao; Daniel Bates; Nicolas Papernot; Robert Mullins; Ross Anderson
  The high energy costs of neural network training and inference led to the use
of acceleration hardware such as GPUs and TPUs. While this enabled us to train
large-scale neural networks in datacenters and deploy them on edge devices, the
focus so far is on average-case performance. In this work, we introduce a novel
threat vector against neural networks whose energy consumption or decision
latency are critical. We show how adversaries can exploit carefully crafted
$\boldsymbol{sponge}~\boldsymbol{examples}$, which are inputs designed to
maximise energy consumption and latency.
  We mount two variants of this attack on established vision and language
models, increasing energy consumption by a factor of 10 to 200. Our attacks can
also be used to delay decisions where a network has critical real-time
performance, such as in perception for autonomous vehicles. We demonstrate the
portability of our malicious inputs across CPUs and a variety of hardware
accelerator chips including GPUs, and an ASIC simulator. We conclude by
proposing a defense strategy which mitigates our attack by shifting the
analysis of energy consumption in hardware from an average-case to a worst-case
perspective.


http://arxiv.org/abs/2006.02724
Characterizing the Weight Space for Different Learning Models.
Saurav Musunuru; Jay N. Paranjape; Rahul Kumar Dubey; Vijendran G. Venkoparao
  Deep Learning has become one of the primary research areas in developing
intelligent machines. Most of the well-known applications (such as Speech
Recognition, Image Processing and NLP) of AI are driven by Deep Learning. Deep
Learning algorithms mimic human brain using artificial neural networks and
progressively learn to accurately solve a given problem. But there are
significant challenges in Deep Learning systems. There have been many attempts
to make deep learning models imitate the biological neural network. However,
many deep learning models have performed poorly in the presence of adversarial
examples. Poor performance in adversarial examples leads to adversarial attacks
and in turn leads to safety and security in most of the applications. In this
paper we make an attempt to characterize the solution space of a deep neural
network in terms of three different subsets viz. weights belonging to exact
trained patterns, weights belonging to generalized pattern set and weights
belonging to adversarial pattern sets. We attempt to characterize the solution
space with two seemingly different learning paradigms viz. the Deep Neural
Networks and the Dense Associative Memory Model, which try to achieve learning
via quite different mechanisms. We also show that adversarial attacks are
generally less successful against Associative Memory Models than Deep Neural
Networks.


http://arxiv.org/abs/2006.03089
Towards Understanding Fast Adversarial Training.
Bai Li; Shiqi Wang; Suman Jana; Lawrence Carin
  Current neural-network-based classifiers are susceptible to adversarial
examples. The most empirically successful approach to defending against such
adversarial examples is adversarial training, which incorporates a strong
self-attack during training to enhance its robustness. This approach, however,
is computationally expensive and hence is hard to scale up. A recent work,
called fast adversarial training, has shown that it is possible to markedly
reduce computation time without sacrificing significant performance. This
approach incorporates simple self-attacks, yet it can only run for a limited
number of training epochs, resulting in sub-optimal performance. In this paper,
we conduct experiments to understand the behavior of fast adversarial training
and show the key to its success is the ability to recover from overfitting to
weak attacks. We then extend our findings to improve fast adversarial training,
demonstrating superior robust accuracy to strong adversarial training, with
much-reduced training time.


http://arxiv.org/abs/2006.03214
Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised Learning.
Haibin Wu; Andy T. Liu; Hung-yi Lee
  High-performance anti-spoofing models for automatic speaker verification
(ASV), have been widely used to protect ASV by identifying and filtering
spoofing audio that is deliberately generated by text-to-speech, voice
conversion, audio replay, etc. However, it has been shown that high-performance
anti-spoofing models are vulnerable to adversarial attacks. Adversarial
attacks, that are indistinguishable from original data but result in the
incorrect predictions, are dangerous for anti-spoofing models and not in
dispute we should detect them at any cost. To explore this issue, we proposed
to employ Mockingjay, a self-supervised learning based model, to protect
anti-spoofing models against adversarial attacks in the black-box scenario.
Self-supervised learning models are effective in improving downstream task
performance like phone classification or ASR. However, their effect in defense
for adversarial attacks has not been explored yet. In this work, we explore the
robustness of self-supervised learned high-level representations by using them
in the defense against adversarial attacks. A layerwise noise to signal ratio
(LNSR) is proposed to quantize and measure the effectiveness of deep models in
countering adversarial noise. Experimental results on the ASVspoof 2019 dataset
demonstrate that high-level representations extracted by Mockingjay can prevent
the transferability of adversarial examples, and successfully counter black-box
attacks.


http://arxiv.org/abs/2006.03184
Pick-Object-Attack: Type-Specific Adversarial Attack for Object Detection.
Omid Mohamad Nezami; Akshay Chaturvedi; Mark Dras; Utpal Garain
  Many recent studies have shown that deep neural models are vulnerable to
adversarial samples: images with imperceptible perturbations, for example, can
fool image classifiers. In this paper, we present the first type-specific
approach to generating adversarial examples for object detection, which entails
detecting bounding boxes around multiple objects present in the image and
classifying them at the same time, making it a harder task than against image
classification. We specifically aim to attack the widely used Faster R-CNN by
changing the predicted label for a particular object in an image: where prior
work has targeted one specific object (a stop sign), we generalise to arbitrary
objects, with the key challenge being the need to change the labels of all
bounding boxes for all instances of that object type. To do so, we propose a
novel method, named Pick-Object-Attack. Pick-Object-Attack successfully adds
perturbations only to bounding boxes for the targeted object, preserving the
labels of other detected objects in the image. In terms of perceptibility, the
perturbations induced by the method are very small. Furthermore, for the first
time, we examine the effect of adversarial attacks on object detection in terms
of a downstream task, image captioning; we show that where a method that can
modify all object types leads to very obvious changes in captions, the changes
from our constrained attack are much less apparent.


http://arxiv.org/abs/2006.01791
SaliencyMix: A Saliency Guided Data Augmentation Strategy for Better Regularization.
A. F. M. Shahab Uddin; Mst. Sirazam Monira; Wheemyung Shin; TaeChoong Chung; Sung-Ho Bae
  Advanced data augmentation strategies have widely been studied to improve the
generalization ability of deep learning models. Regional dropout is one of the
popular solutions that guides the model to focus on less discriminative parts
by randomly removing image regions, resulting in improved regularization.
However, such information removal is undesirable. On the other hand, recent
strategies suggest to randomly cut and mix patches and their labels among
training images, to enjoy the advantages of regional dropout without having any
pointless pixel in the augmented images. We argue that the random selection of
the patch may not necessarily represent any information about the corresponding
object and thereby mixing the labels according to that uninformative patch
enables the model to learn unexpected feature representation. Therefore, we
propose SaliencyMix that carefully selects a representative image patch with
the help of a saliency map and mixes this indicative patch with the target
image that leads the model to learn more appropriate feature representation.
SaliencyMix achieves a new state-of-the-art top-1 error of 20.09% on ImageNet
classification using ResNet-101 architecture and also improves the model
robustness against adversarial perturbations. Furthermore, SaliencyMix trained
model helps to improve the object detection performance.


http://arxiv.org/abs/2006.01408
Exploring the role of Input and Output Layers of a Deep Neural Network in Adversarial Defense.
Jay N. Paranjape; Rahul Kumar Dubey; Vijendran V Gopalan
  Deep neural networks are learning models having achieved state of the art
performance in many fields like prediction, computer vision, language
processing and so on. However, it has been shown that certain inputs exist
which would not trick a human normally, but may mislead the model completely.
These inputs are known as adversarial inputs. These inputs pose a high security
threat when such models are used in real world applications. In this work, we
have analyzed the resistance of three different classes of fully connected
dense networks against the rarely tested non-gradient based adversarial
attacks. These classes are created by manipulating the input and output layers.
We have proven empirically that owing to certain characteristics of the
network, they provide a high robustness against these attacks, and can be used
in fine tuning other models to increase defense against adversarial attacks.


http://arxiv.org/abs/2006.01456
Perturbation Analysis of Gradient-based Adversarial Attacks.
Utku Ozbulak; Manvel Gasparyan; Neve Wesley De; Messem Arnout Van
  After the discovery of adversarial examples and their adverse effects on deep
learning models, many studies focused on finding more diverse methods to
generate these carefully crafted samples. Although empirical results on the
effectiveness of adversarial example generation methods against defense
mechanisms are discussed in detail in the literature, an in-depth study of the
theoretical properties and the perturbation effectiveness of these adversarial
attacks has largely been lacking. In this paper, we investigate the objective
functions of three popular methods for adversarial example generation: the
L-BFGS attack, the Iterative Fast Gradient Sign attack, and Carlini & Wagner's
attack (CW). Specifically, we perform a comparative and formal analysis of the
loss functions underlying the aforementioned attacks while laying out
large-scale experimental results on ImageNet dataset. This analysis exposes (1)
the faster optimization speed as well as the constrained optimization space of
the cross-entropy loss, (2) the detrimental effects of using the signature of
the cross-entropy loss on optimization precision as well as optimization space,
and (3) the slow optimization speed of the logit loss in the context of
adversariality. Our experiments reveal that the Iterative Fast Gradient Sign
attack, which is thought to be fast for generating adversarial examples, is the
worst attack in terms of the number of iterations required to create
adversarial examples in the setting of equal perturbation. Moreover, our
experiments show that the underlying loss function of CW, which is criticized
for being substantially slower than other adversarial attacks, is not that much
slower than other loss functions. Finally, we analyze how well neural networks
can identify adversarial perturbations generated by the attacks under
consideration, hereby revisiting the idea of adversarial retraining on
ImageNet.


http://arxiv.org/abs/2006.01888
Adversarial Item Promotion: Vulnerabilities at the Core of Top-N Recommenders that Use Images to Address Cold Start.
Zhuoran Liu; Martha Larson
  E-commerce platforms provide their customers with ranked lists of recommended
items matching the customers' preferences. Merchants on e-commerce platforms
would like their items to appear as high as possible in the top-N of these
ranked lists. In this paper, we demonstrate how unscrupulous merchants can
create item images that artificially promote their products, improving their
rankings. Recommender systems that use images to address the cold start problem
are vulnerable to this security risk. We describe a new type of attack,
Adversarial Item Promotion (AIP), that strikes directly at the core of Top-N
recommenders: the ranking mechanism itself. Existing work on adversarial images
in recommender systems investigates the implications of conventional attacks,
which target deep learning classifiers. In contrast, our AIP attacks are
embedding attacks that seek to push features representations in a way that
fools the ranker (not a classifier) and directly lead to item promotion. We
introduce three AIP attacks insider attack, expert attack, and semantic attack,
which are defined with respect to three successively more realistic attack
models. Our experiments evaluate the danger of these attacks when mounted
against three representative visually-aware recommender algorithms in a
framework that uses images to address cold start. We also evaluate potential
defenses, including adversarial training and find that common,
currently-existing, techniques do not eliminate the danger of AIP attacks. In
sum, we show that using images to address cold start opens recommender systems
to potential threats with clear practical implications.


http://arxiv.org/abs/2006.01906
Detecting Audio Attacks on ASR Systems with Dropout Uncertainty.
Tejas Jayashankar; Jonathan Le Roux; Pierre Moulin
  Various adversarial audio attacks have recently been developed to fool
automatic speech recognition (ASR) systems. We here propose a defense against
such attacks based on the uncertainty introduced by dropout in neural networks.
We show that our defense is able to detect attacks created through optimized
perturbations and frequency masking on a state-of-the-art end-to-end ASR
system. Furthermore, the defense can be made robust against attacks that are
immune to noise reduction. We test our defense on Mozilla's CommonVoice
dataset, the UrbanSound dataset, and an excerpt of the LibriSpeech dataset,
showing that it achieves high detection accuracy in a wide range of scenarios.


http://arxiv.org/abs/2006.00731
Second-Order Provable Defenses against Adversarial Attacks.
Sahil Singla; Soheil Feizi
  A robustness certificate is the minimum distance of a given input to the
decision boundary of the classifier (or its lower bound). For {\it any} input
perturbations with a magnitude smaller than the certificate value, the
classification output will provably remain unchanged. Exactly computing the
robustness certificates for neural networks is difficult since it requires
solving a non-convex optimization. In this paper, we provide
computationally-efficient robustness certificates for neural networks with
differentiable activation functions in two steps. First, we show that if the
eigenvalues of the Hessian of the network are bounded, we can compute a
robustness certificate in the $l_2$ norm efficiently using convex optimization.
Second, we derive a computationally-efficient differentiable upper bound on the
curvature of a deep network. We also use the curvature bound as a
regularization term during the training of the network to boost its certified
robustness. Putting these results together leads to our proposed {\bf
C}urvature-based {\bf R}obustness {\bf C}ertificate (CRC) and {\bf
C}urvature-based {\bf R}obust {\bf T}raining (CRT). Our numerical results show
that CRT leads to significantly higher certified robust accuracy compared to
interval-bound propagation (IBP) based training. We achieve certified robust
accuracy 69.79\%, 57.78\% and 53.19\% while IBP-based methods achieve 44.96\%,
44.74\% and 44.66\% on 2,3 and 4 layer networks respectively on the
MNIST-dataset.


http://arxiv.org/abs/2006.00817
Adversarial Attacks on Reinforcement Learning based Energy Management Systems of Extended Range Electric Delivery Vehicles.
Pengyue Wang; Yan Li; Shashi Shekhar; William F. Northrop
  Adversarial examples are firstly investigated in the area of computer vision:
by adding some carefully designed ''noise'' to the original input image, the
perturbed image that cannot be distinguished from the original one by human,
can fool a well-trained classifier easily. In recent years, researchers also
demonstrated that adversarial examples can mislead deep reinforcement learning
(DRL) agents on playing video games using image inputs with similar methods.
However, although DRL has been more and more popular in the area of intelligent
transportation systems, there is little research investigating the impacts of
adversarial attacks on them, especially for algorithms that do not take images
as inputs. In this work, we investigated several fast methods to generate
adversarial examples to significantly degrade the performance of a well-trained
DRL- based energy management system of an extended range electric delivery
vehicle. The perturbed inputs are low-dimensional state representations and
close to the original inputs quantified by different kinds of norms. Our work
shows that, to apply DRL agents on real-world transportation systems,
adversarial examples in the form of cyber-attack should be considered
carefully, especially for applications that may lead to serious safety issues.


http://arxiv.org/abs/2006.00860
Adversarial Attacks on Classifiers for Eye-based User Modelling.
Inken CISPA Helmholtz Center for Information Security Hagestedt; Michael CISPA Helmholtz Center for Information Security Backes; Andreas University of Stuttgart Bulling
  An ever-growing body of work has demonstrated the rich information content
available in eye movements for user modelling, e.g. for predicting users'
activities, cognitive processes, or even personality traits. We show that
state-of-the-art classifiers for eye-based user modelling are highly vulnerable
to adversarial examples: small artificial perturbations in gaze input that can
dramatically change a classifier's predictions. We generate these adversarial
examples using the Fast Gradient Sign Method (FGSM) that linearises the
gradient to find suitable perturbations. On the sample task of eye-based
document type recognition we study the success of different adversarial attack
scenarios: with and without knowledge about classifier gradients (white-box vs.
black-box) as well as with and without targeting the attack to a specific
class, In addition, we demonstrate the feasibility of defending against
adversarial attacks by adding adversarial examples to a classifier's training
data.


http://arxiv.org/abs/2006.01304
Rethinking Empirical Evaluation of Adversarial Robustness Using First-Order Attack Methods.
Kyungmi Lee; Anantha P. Chandrakasan
  We identify three common cases that lead to overestimation of adversarial
accuracy against bounded first-order attack methods, which is popularly used as
a proxy for adversarial robustness in empirical studies. For each case, we
propose compensation methods that either address sources of inaccurate gradient
computation, such as numerical instability near zero and non-differentiability,
or reduce the total number of back-propagations for iterative attacks by
approximating second-order information. These compensation methods can be
combined with existing attack methods for a more precise empirical evaluation
metric. We illustrate the impact of these three cases with examples of
practical interest, such as benchmarking model capacity and regularization
techniques for robustness. Overall, our work shows that overestimated
adversarial accuracy that is not indicative of robustness is prevalent even for
conventionally trained deep neural networks, and highlights cautions of using
empirical evaluation without guaranteed bounds.


http://arxiv.org/abs/2006.00442
Evaluations and Methods for Explanation through Robustness Analysis.
Cheng-Yu Hsieh; Chih-Kuan Yeh; Xuanqing Liu; Pradeep Ravikumar; Seungyeon Kim; Sanjiv Kumar; Cho-Jui Hsieh
  Feature based explanations, that provide importance of each feature towards
the model prediction, is arguably one of the most intuitive ways to explain a
model. In this paper, we establish a novel set of evaluation criteria for such
feature based explanations by robustness analysis. In contrast to existing
evaluations which require us to specify some way to "remove" features that
could inevitably introduces biases and artifacts, we make use of the subtler
notion of smaller adversarial perturbations. By optimizing towards our proposed
evaluation criteria, we obtain new explanations that are loosely necessary and
sufficient for a prediction. We further extend the explanation to extract the
set of features that would move the current prediction to a target class by
adopting targeted adversarial attack for the robustness analysis. Through
experiments across multiple domains and a user study, we validate the
usefulness of our evaluation criteria and our derived explanations.


http://arxiv.org/abs/2006.00602
Estimating Principal Components under Adversarial Perturbations.
Pranjal Awasthi; Xue Chen; Aravindan Vijayaraghavan
  Robustness is a key requirement for widespread deployment of machine learning
algorithms, and has received much attention in both statistics and computer
science. We study a natural model of robustness for high-dimensional
statistical estimation problems that we call the adversarial perturbation
model. An adversary can perturb every sample arbitrarily up to a specified
magnitude $\delta$ measured in some $\ell_q$ norm, say $\ell_\infty$. Our model
is motivated by emerging paradigms such as low precision machine learning and
adversarial training.
  We study the classical problem of estimating the top-$r$ principal subspace
of the Gaussian covariance matrix in high dimensions, under the adversarial
perturbation model. We design a computationally efficient algorithm that given
corrupted data, recovers an estimate of the top-$r$ principal subspace with
error that depends on a robustness parameter $\kappa$ that we identify. This
parameter corresponds to the $q \to 2$ operator norm of the projector onto the
principal subspace, and generalizes well-studied analytic notions of sparsity.
Additionally, in the absence of corruptions, our algorithmic guarantees recover
existing bounds for problems such as sparse PCA and its higher rank analogs. We
also prove that the above dependence on the parameter $\kappa$ is almost
optimal asymptotically, not just in a minimax sense, but remarkably for every
instance of the problem. This instance-optimal guarantee shows that the $q \to
2$ operator norm of the subspace essentially characterizes the estimation error
under adversarial perturbations.


http://arxiv.org/abs/2006.00387
Exploring Model Robustness with Adaptive Networks and Improved Adversarial Training.
Zheng Xu; Ali Shafahi; Tom Goldstein
  Adversarial training has proven to be effective in hardening networks against
adversarial examples. However, the gained robustness is limited by network
capacity and number of training samples. Consequently, to build more robust
models, it is common practice to train on widened networks with more
parameters. To boost robustness, we propose a conditional normalization module
to adapt networks when conditioned on input samples. Our adaptive networks,
once adversarially trained, can outperform their non-adaptive counterparts on
both clean validation accuracy and robustness. Our method is objective agnostic
and consistently improves both the conventional adversarial training objective
and the TRADES objective. Our adaptive networks also outperform larger widened
non-adaptive architectures that have 1.5 times more parameters. We further
introduce several practical ``tricks'' in adversarial training to improve
robustness and empirically verify their efficiency.


http://arxiv.org/abs/2005.14424
SAFER: A Structure-free Approach for Certified Robustness to Adversarial Word Substitutions.
Mao Ye; Chengyue Gong; Qiang Liu
  State-of-the-art NLP models can often be fooled by human-unaware
transformations such as synonymous word substitution. For security reasons, it
is of critical importance to develop models with certified robustness that can
provably guarantee that the prediction is can not be altered by any possible
synonymous word substitution. In this work, we propose a certified robust
method based on a new randomized smoothing technique, which constructs a
stochastic ensemble by applying random word substitutions on the input
sentences, and leverage the statistical properties of the ensemble to provably
certify the robustness. Our method is simple and structure-free in that it only
requires the black-box queries of the model outputs, and hence can be applied
to any pre-trained models (such as BERT) and any types of models (world-level
or subword-level). Our method significantly outperforms recent state-of-the-art
methods for certified robustness on both IMDB and Amazon text classification
tasks. To the best of our knowledge, we are the first work to achieve certified
robustness on large systems such as BERT with practically meaningful certified
accuracy.


http://arxiv.org/abs/2005.14302
Monocular Depth Estimators: Vulnerabilities and Attacks.
Alwyn Mathew; Aditya Prakash Patra; Jimson Mathew
  Recent advancements of neural networks lead to reliable monocular depth
estimation. Monocular depth estimated techniques have the upper hand over
traditional depth estimation techniques as it only needs one image during
inference. Depth estimation is one of the essential tasks in robotics, and
monocular depth estimation has a wide variety of safety-critical applications
like in self-driving cars and surgical devices. Thus, the robustness of such
techniques is very crucial. It has been shown in recent works that these deep
neural networks are highly vulnerable to adversarial samples for tasks like
classification, detection and segmentation. These adversarial samples can
completely ruin the output of the system, making their credibility in real-time
deployment questionable. In this paper, we investigate the robustness of the
most state-of-the-art monocular depth estimation networks against adversarial
attacks. Our experiments show that tiny perturbations on an image that are
invisible to the naked eye (perturbation attack) and corruption less than about
1% of an image (patch attack) can affect the depth estimation drastically. We
introduce a novel deep feature annihilation loss that corrupts the hidden
feature space representation forcing the decoder of the network to output poor
depth maps. The white-box and black-box test compliments the effectiveness of
the proposed attack. We also perform adversarial example transferability tests,
mainly cross-data transferability.


http://arxiv.org/abs/2005.14137
QEBA: Query-Efficient Boundary-Based Blackbox Attack.
Huichen Li; Xiaojun Xu; Xiaolu Zhang; Shuang Yang; Bo Li
  Machine learning (ML), especially deep neural networks (DNNs) have been
widely used in various applications, including several safety-critical ones
(e.g. autonomous driving). As a result, recent research about adversarial
examples has raised great concerns. Such adversarial attacks can be achieved by
adding a small magnitude of perturbation to the input to mislead model
prediction. While several whitebox attacks have demonstrated their
effectiveness, which assume that the attackers have full access to the machine
learning models; blackbox attacks are more realistic in practice. In this
paper, we propose a Query-Efficient Boundary-based blackbox Attack (QEBA) based
only on model's final prediction labels. We theoretically show why previous
boundary-based attack with gradient estimation on the whole gradient space is
not efficient in terms of query numbers, and provide optimality analysis for
our dimension reduction-based gradient estimation. On the other hand, we
conducted extensive experiments on ImageNet and CelebA datasets to evaluate
QEBA. We show that compared with the state-of-the-art blackbox attacks, QEBA is
able to use a smaller number of queries to achieve a lower magnitude of
perturbation with 100% attack success rate. We also show case studies of
attacks on real-world APIs including MEGVII Face++ and Microsoft Azure.


http://arxiv.org/abs/2005.14108
Adversarial Attacks and Defense on Texts: A Survey.
Aminul Huq; Mst. Tasnim Pervin
  Deep learning models have been used widely for various purposes in recent
years in object recognition, self-driving cars, face recognition, speech
recognition, sentiment analysis, and many others. However, in recent years it
has been shown that these models possess weakness to noises which force the
model to misclassify. This issue has been studied profoundly in the image and
audio domain. Very little has been studied on this issue concerning textual
data. Even less survey on this topic has been performed to understand different
types of attacks and defense techniques. In this manuscript, we accumulated and
analyzed different attacking techniques and various defense models to provide a
more comprehensive idea. Later we point out some of the interesting findings of
all papers and challenges that need to be overcome to move forward in this
field.


http://arxiv.org/abs/2006.03686
Adversarial Robustness of Deep Convolutional Candlestick Learner.
Jun-Hao Chen; Samuel Yen-Chi Chen; Yun-Cheng Tsai; Chih-Shiang Shur
  Deep learning (DL) has been applied extensively in a wide range of fields.
However, it has been shown that DL models are susceptible to a certain kinds of
perturbations called \emph{adversarial attacks}. To fully unlock the power of
DL in critical fields such as financial trading, it is necessary to address
such issues. In this paper, we present a method of constructing perturbed
examples and use these examples to boost the robustness of the model. Our
algorithm increases the stability of DL models for candlestick classification
with respect to perturbations in the input data.


http://arxiv.org/abs/2005.13293
Enhancing Resilience of Deep Learning Networks by Means of Transferable Adversaries.
Moritz Seiler; Heike Trautmann; Pascal Kerschke
  Artificial neural networks in general and deep learning networks in
particular established themselves as popular and powerful machine learning
algorithms. While the often tremendous sizes of these networks are beneficial
when solving complex tasks, the tremendous number of parameters also causes
such networks to be vulnerable to malicious behavior such as adversarial
perturbations. These perturbations can change a model's classification
decision. Moreover, while single-step adversaries can easily be transferred
from network to network, the transfer of more powerful multi-step adversaries
has - usually -- been rather difficult. In this work, we introduce a method for
generating strong ad-versaries that can easily (and frequently) be transferred
between different models. This method is then used to generate a large set of
adversaries, based on which the effects of selected defense methods are
experimentally assessed. At last, we introduce a novel, simple, yet effective
approach to enhance the resilience of neural networks against adversaries and
benchmark it against established defense methods. In contrast to the already
existing methods, our proposed defense approach is much more efficient as it
only requires a single additional forward-pass to achieve comparable
performance results.


http://arxiv.org/abs/2005.13712
Mitigating Advanced Adversarial Attacks with More Advanced Gradient Obfuscation Techniques.
Han Qiu; Yi Zeng; Qinkai Zheng; Tianwei Zhang; Meikang Qiu; Gerard Memmi
  Deep Neural Networks (DNNs) are well-known to be vulnerable to Adversarial
Examples (AEs). A large amount of efforts have been spent to launch and heat
the arms race between the attackers and defenders. Recently, advanced
gradient-based attack techniques were proposed (e.g., BPDA and EOT), which have
defeated a considerable number of existing defense methods. Up to today, there
are still no satisfactory solutions that can effectively and efficiently defend
against those attacks.
  In this paper, we make a steady step towards mitigating those advanced
gradient-based attacks with two major contributions. First, we perform an
in-depth analysis about the root causes of those attacks, and propose four
properties that can break the fundamental assumptions of those attacks. Second,
we identify a set of operations that can meet those properties. By integrating
these operations, we design two preprocessing functions that can invalidate
these powerful attacks. Extensive evaluations indicate that our solutions can
effectively mitigate all existing standard and advanced attack techniques, and
beat 11 state-of-the-art defense solutions published in top-tier conferences
over the past 2 years. The defender can employ our solutions to constrain the
attack success rate below 7% for the strongest attacks even the adversary has
spent dozens of GPU hours.


http://arxiv.org/abs/2005.13525
Stochastic Security: Adversarial Defense Using Long-Run Dynamics of Energy-Based Models.
Mitch Hill; Jonathan Mitchell; Song-Chun Zhu
  The vulnerability of deep networks to adversarial attacks is a central
problem for deep learning from the perspective of both cognition and security.
The current most successful defense method is to train a classifier using
adversarial images created during learning. Another defense approach involves
transformation or purification of the original input to remove adversarial
signals before the image is classified. We focus on defending naturally-trained
classifiers using Markov Chain Monte Carlo (MCMC) sampling with an Energy-Based
Model (EBM) for adversarial purification. In contrast to adversarial training,
our approach is intended to secure pre-existing and highly vulnerable
classifiers.
  The memoryless behavior of long-run MCMC sampling will eventually remove
adversarial signals, while metastable behavior preserves consistent appearance
of MCMC samples after many steps to allow accurate long-run prediction.
Balancing these factors can lead to effective purification and robust
classification. We evaluate adversarial defense with an EBM using the strongest
known attacks against purification. Our contributions are 1) an improved method
for training EBM's with realistic long-run MCMC samples, 2) an
Expectation-Over-Transformation (EOT) defense that resolves theoretical
ambiguities for stochastic defenses and from which the EOT attack naturally
follows, and 3) state-of-the-art adversarial defense for naturally-trained
classifiers and competitive defense compared to adversarially-trained
classifiers on Cifar-10, SVHN, and Cifar-100. Code and pre-trained models are
available at https://github.com/point0bar1/ebm-defense.


http://arxiv.org/abs/2005.13748
Calibrated Surrogate Losses for Adversarially Robust Classification.
Han Bao; Clayton Scott; Masashi Sugiyama
  Adversarially robust classification seeks a classifier that is insensitive to
adversarial perturbations of test patterns. This problem is often formulated
via a minimax objective, where the target loss is the worst-case value of the
0-1 loss subject to a bound on the size of perturbation. Recent work has
proposed convex surrogates for the adversarial 0-1 loss, in an effort to make
optimization more tractable. A primary question is that of consistency, that
is, whether minimization of the surrogate risk implies minimization of the
adversarial 0-1 risk. In this work, we analyze this question through the lens
of calibration, which is a pointwise notion of consistency. We show that no
convex surrogate loss is calibrated with respect to the adversarial 0-1 loss
when restricted to the class of linear models. We further introduce a class of
nonconvex losses and offer necessary and sufficient conditions for losses in
this class to be calibrated. We also show that if the underlying distribution
satisfies Massart's noise condition, convex losses can also be calibrated in
the adversarial setting.


http://arxiv.org/abs/2005.13123
Effects of Forward Error Correction on Communications Aware Evasion Attacks.
Matthew DelVecchio; Bryse Flowers; William C. Headley
  Recent work has shown the impact of adversarial machine learning on deep
neural networks (DNNs) developed for Radio Frequency Machine Learning (RFML)
applications. While these attacks have been shown to be successful in
disrupting the performance of an eavesdropper, they fail to fully support the
primary goal of successful intended communication. To remedy this, a
communications-aware attack framework was recently developed that allows for a
more effective balance between the opposing goals of evasion and intended
communication through the novel use of a DNN to intelligently create the
adversarial communication signal. Given the near ubiquitous usage of forward
error correction (FEC) coding in the majority of deployed systems to correct
errors that arise, incorporating FEC in this framework is a natural extension
of this prior work and will allow for improved performance in more adverse
environments. This work therefore provides contributions to the framework
through improved loss functions and design considerations to incorporate
inherent knowledge of the usage of FEC codes within the transmitted signal.
Performance analysis shows that FEC coding improves the communications aware
adversarial attack even if no explicit knowledge of the coding scheme is
assumed and allows for improved performance over the prior art in balancing the
opposing goals of evasion and intended communications.


http://arxiv.org/abs/2005.13124
Investigating a Spectral Deception Loss Metric for Training Machine Learning-based Evasion Attacks.
Matthew DelVecchio; Vanessa Arndorfer; William C. Headley
  Adversarial evasion attacks have been very successful in causing poor
performance in a wide variety of machine learning applications. One such
application is radio frequency spectrum sensing. While evasion attacks have
proven particularly successful in this area, they have done so at the detriment
of the signal's intended purpose. More specifically, for real-world
applications of interest, the resulting perturbed signal that is transmitted to
evade an eavesdropper must not deviate far from the original signal, less the
intended information is destroyed. Recent work by the authors and others has
demonstrated an attack framework that allows for intelligent balancing between
these conflicting goals of evasion and communication. However, while these
methodologies consider creating adversarial signals that minimize
communications degradation, they have been shown to do so at the expense of the
spectral shape of the signal. This opens the adversarial signal up to defenses
at the eavesdropper such as filtering, which could render the attack
ineffective. To remedy this, this work introduces a new spectral deception loss
metric that can be implemented during the training process to force the
spectral shape to be more in-line with the original signal. As an initial proof
of concept, a variety of methods are presented that provide a starting point
for this proposed loss. Through performance analysis, it is shown that these
techniques are effective in controlling the shape of the adversarial signal.


http://arxiv.org/abs/2005.12696
Generating Semantically Valid Adversarial Questions for TableQA.
Yi Zhu; Menglin Xia; Yiwei Zhou
  Adversarial attack on question answering systems over tabular data (TableQA)
can help evaluate to what extent they can understand natural language questions
and reason with tables. However, generating natural language adversarial
questions is difficult, because even a single character swap could lead to huge
semantic difference in human perception. In this paper, we propose SAGE
(Semantically valid Adversarial GEnerator), a Wasserstein sequence-to-sequence
model for TableQA white-box attack. To preserve meaning of original questions,
we apply minimum risk training with SIMILE and entity delexicalization. We use
Gumbel-Softmax to incorporate adversarial loss for end-to-end training. Our
experiments show that SAGE outperforms existing local attack models on semantic
validity and fluency while achieving a good attack success rate. Finally, we
demonstrate that adversarial training with SAGE augmented data can improve
performance and robustness of TableQA systems.


http://arxiv.org/abs/2005.12154
Adversarial Feature Selection against Evasion Attacks.
Fei Zhang; Patrick P. K. Chan; Battista Biggio; Daniel S. Yeung; Fabio Roli
  Pattern recognition and machine learning techniques have been increasingly
adopted in adversarial settings such as spam, intrusion and malware detection,
although their security against well-crafted attacks that aim to evade
detection by manipulating data at test time has not yet been thoroughly
assessed. While previous work has been mainly focused on devising
adversary-aware classification algorithms to counter evasion attempts, only few
authors have considered the impact of using reduced feature sets on classifier
security against the same attacks. An interesting, preliminary result is that
classifier security to evasion may be even worsened by the application of
feature selection. In this paper, we provide a more detailed investigation of
this aspect, shedding some light on the security properties of feature
selection against evasion attacks. Inspired by previous work on adversary-aware
classifiers, we propose a novel adversary-aware feature selection model that
can improve classifier security against evasion attacks, by incorporating
specific assumptions on the adversary's data manipulation strategy. We focus on
an efficient, wrapper-based implementation of our approach, and experimentally
validate its soundness on different application examples, including spam and
malware detection.


http://arxiv.org/abs/2005.14611
Detecting Adversarial Examples for Speech Recognition via Uncertainty Quantification.
Sina Däubener; Lea Schönherr; Asja Fischer; Dorothea Kolossa
  Machine learning systems and also, specifically, automatic speech recognition
(ASR) systems are vulnerable against adversarial attacks, where an attacker
maliciously changes the input. In the case of ASR systems, the most interesting
cases are targeted attacks, in which an attacker aims to force the system into
recognizing given target transcriptions in an arbitrary audio sample. The
increasing number of sophisticated, quasi imperceptible attacks raises the
question of countermeasures. In this paper, we focus on hybrid ASR systems and
compare four acoustic models regarding their ability to indicate uncertainty
under attack: a feed-forward neural network and three neural networks
specifically designed for uncertainty quantification, namely a Bayesian neural
network, Monte Carlo dropout, and a deep ensemble. We employ uncertainty
measures of the acoustic model to construct a simple one-class classification
model for assessing whether inputs are benign or adversarial. Based on this
approach, we are able to detect adversarial examples with an area under the
receiving operator curve score of more than 0.99. The neural networks for
uncertainty quantification simultaneously diminish the vulnerability to the
attack, which is reflected in a lower recognition accuracy of the malicious
target text in comparison to a standard hybrid ASR system.


http://arxiv.org/abs/2005.11671
SoK: Arms Race in Adversarial Malware Detection.
Deqiang Li; Qianmu Li; Yanfang Ye; Shouhuai Xu
  Malicious software (malware) is a major cyber threat that shall be tackled
with Machine Learning (ML) techniques because millions of new malware examples
are injected into cyberspace on a daily basis. However, ML is known to be
vulnerable to attacks known as adversarial examples. In this SoK paper, we
systematize the field of Adversarial Malware Detection (AMD) through the lens
of a unified framework of assumptions, attacks, defenses and security
properties. This not only guides us to map attacks and defenses into some
partial order structures, but also allows us to clearly describe the
attack-defense arms race in the AMD context. In addition to manually drawing
insights, we also propose using ML to draw insights from the systematized
representation of the literature. Examples of the insights are: knowing the
defender's feature set is critical to the attacker's success; attack tactic (as
a core part of the threat model) largely determines what security property of a
malware detector can be broke; there is currently no silver bullet defense
against evasion attacks or poisoning attacks; defense tactic largely determines
what security properties can be achieved by a malware detector; knowing
attacker's manipulation set is critical to defender's success; ML is an
effective method for insights learning in SoK studies. These insights shed
light on future research directions.


http://arxiv.org/abs/2005.11904
Adaptive Adversarial Logits Pairing.
Shangxi Wu; Jitao Sang; Kaiyuan Xu; Guanhua Zheng; Changsheng Xu
  Adversarial examples provide an opportunity as well as impose a challenge for
understanding image classification systems. Based on the analysis of the
adversarial training solution Adversarial Logits Pairing (ALP), we observed in
this work that: (1) The inference of adversarially robust model tends to rely
on fewer high-contribution features compared with vulnerable ones. (2) The
training target of ALP doesn't fit well to a noticeable part of samples, where
the logits pairing loss is overemphasized and obstructs minimizing the
classification loss. Motivated by these observations, we design an Adaptive
Adversarial Logits Pairing (AALP) solution by modifying the training process
and training target of ALP. Specifically, AALP consists of an adaptive feature
optimization module with Guided Dropout to systematically pursue fewer
high-contribution features, and an adaptive sample weighting module by setting
sample-specific training weights to balance between logits pairing loss and
classification loss. The proposed AALP solution demonstrates superior defense
performance on multiple datasets with extensive experiments.


http://arxiv.org/abs/2005.11626
ShapeAdv: Generating Shape-Aware Adversarial 3D Point Clouds.
Kibok Lee; Zhuoyuan Chen; Xinchen Yan; Raquel Urtasun; Ersin Yumer
  We introduce ShapeAdv, a novel framework to study shape-aware adversarial
perturbations that reflect the underlying shape variations (e.g., geometric
deformations and structural differences) in the 3D point cloud space. We
develop shape-aware adversarial 3D point cloud attacks by leveraging the
learned latent space of a point cloud auto-encoder where the adversarial noise
is applied in the latent space. Specifically, we propose three different
variants including an exemplar-based one by guiding the shape deformation with
auxiliary data, such that the generated point cloud resembles the shape
morphing between objects in the same category. Different from prior works, the
resulting adversarial 3D point clouds reflect the shape variations in the 3D
point cloud space while still being close to the original one. In addition,
experimental evaluations on the ModelNet40 benchmark demonstrate that our
adversaries are more difficult to defend with existing point cloud defense
methods and exhibit a higher attack transferability across classifiers. Our
shape-aware adversarial attacks are orthogonal to existing point cloud based
attacks and shed light on the vulnerability of 3D deep neural networks.


http://arxiv.org/abs/2005.11560
Adversarial Attack on Hierarchical Graph Pooling Neural Networks.
Haoteng Tang; Guixiang Ma; Yurong Chen; Lei Guo; Wei Wang; Bo Zeng; Liang Zhan
  Recent years have witnessed the emergence and development of graph neural
networks (GNNs), which have been shown as a powerful approach for graph
representation learning in many tasks, such as node classification and graph
classification. The research on the robustness of these models has also started
to attract attentions in the machine learning field. However, most of the
existing work in this area focus on the GNNs for node-level tasks, while little
work has been done to study the robustness of the GNNs for the graph
classification task. In this paper, we aim to explore the vulnerability of the
Hierarchical Graph Pooling (HGP) Neural Networks, which are advanced GNNs that
perform very well in the graph classification in terms of prediction accuracy.
We propose an adversarial attack framework for this task. Specifically, we
design a surrogate model that consists of convolutional and pooling operators
to generate adversarial samples to fool the hierarchical GNN-based graph
classification models. We set the preserved nodes by the pooling operator as
our attack targets, and then we perturb the attack targets slightly to fool the
pooling operator in hierarchical GNNs so that they will select the wrong nodes
to preserve. We show the adversarial samples generated from multiple datasets
by our surrogate model have enough transferability to attack current
state-of-art graph classification models. Furthermore, we conduct the robust
train on the target models and demonstrate that the retrained graph
classification models are able to better defend against the attack from the
adversarial samples. To the best of our knowledge, this is the first work on
the adversarial attack against hierarchical GNN-based graph classification
models.


http://arxiv.org/abs/2005.11516
Frontal Attack: Leaking Control-Flow in SGX via the CPU Frontend. (1%)
Ivan Puddu; Moritz Schneider; Miro Haller; Srdjan Čapkun
  We introduce a new timing side-channel attack on Intel CPU processors. Our
Frontal attack exploits timing differences that arise from how the CPU frontend
fetches and processes instructions while being interrupted. In particular, we
observe that in modern Intel CPUs, some instructions' execution times will
depend on which operations precede and succeed them, and on their virtual
addresses. Unlike previous attacks that could only profile branches if they
contained different code or had known branch targets, the Frontal attack allows
the adversary to distinguish between instruction-wise identical branches. As
the attack requires OS capabilities to set the interrupts, we use it to exploit
SGX enclaves. Our attack further demonstrates that secret-dependent branches
should not be used even alongside defenses to current controlled-channel
attacks. We show that the adversary can use the Frontal attack to extract a
secret from an SGX enclave if that secret was used as a branching condition for
two instruction-wise identical branches. We successfully tested the attack on
all the available Intel CPUs with SGX (until 10th gen) and used it to leak
information from two commonly used cryptographic libraries.


http://arxiv.org/abs/2005.11061
Vulnerability of deep neural networks for detecting COVID-19 cases from chest X-ray images to universal adversarial attacks.
Hokuto Hirano; Kazuki Koga; Kazuhiro Takemoto
  Under the epidemic of the novel coronavirus disease 2019 (COVID-19), chest
X-ray computed tomography imaging is being used for effectively screening
COVID-19 patients. The development of computer-aided systems based on deep
neural networks (DNNs) has been advanced, to rapidly and accurately detect
COVID-19 cases, because the need for expert radiologists, who are limited in
number, forms a bottleneck for the screening. However, so far, the
vulnerability of DNN-based systems has been poorly evaluated, although DNNs are
vulnerable to a single perturbation, called universal adversarial perturbation
(UAP), which can induce DNN failure in most classification tasks. Thus, we
focus on representative DNN models for detecting COVID-19 cases from chest
X-ray images and evaluate their vulnerability to UAPs generated using simple
iterative algorithms. We consider nontargeted UAPs, which cause a task failure
resulting in an input being assigned an incorrect label, and targeted UAPs,
which cause the DNN to classify an input into a specific class. The results
demonstrate that the models are vulnerable to nontargeted and targeted UAPs,
even in case of small UAPs. In particular, 2% norm of the UPAs to the average
norm of an image in the image dataset achieves >85% and >90% success rates for
the nontargeted and targeted attacks, respectively. Due to the nontargeted
UAPs, the DNN models judge most chest X-ray images as COVID-19 cases. The
targeted UAPs make the DNN models classify most chest X-ray images into a given
target class. The results indicate that careful consideration is required in
practical applications of DNNs to COVID-19 diagnosis; in particular, they
emphasize the need for strategies to address security concerns. As an example,
we show that iterative fine-tuning of the DNN models using UAPs improves the
robustness of the DNN models against UAPs.


http://arxiv.org/abs/2005.10750
Revisiting Role of Autoencoders in Adversarial Settings.
Byeong Cheon Kim; Jung Uk Kim; Hakmin Lee; Yong Man Ro
  To combat against adversarial attacks, autoencoder structure is widely used
to perform denoising which is regarded as gradient masking. In this paper, we
revisit the role of autoencoders in adversarial settings. Through the
comprehensive experimental results and analysis, this paper presents the
inherent property of adversarial robustness in the autoencoders. We also found
that autoencoders may use robust features that cause inherent adversarial
robustness. We believe that our discovery of the adversarial robustness of the
autoencoders can provide clues to the future research and applications for
adversarial defense.


http://arxiv.org/abs/2005.10757
Robust Ensemble Model Training via Random Layer Sampling Against Adversarial Attack.
Hakmin Lee; Hong Joo Lee; Seong Tae Kim; Yong Man Ro
  Deep neural networks have achieved substantial achievements in several
computer vision areas, but have vulnerabilities that are often fooled by
adversarial examples that are not recognized by humans. This is an important
issue for security or medical applications. In this paper, we propose an
ensemble model training framework with random layer sampling to improve the
robustness of deep neural networks. In the proposed training framework, we
generate various sampled model through the random layer sampling and update the
weight of the sampled model. After the ensemble models are trained, it can hide
the gradient efficiently and avoid the gradient-based attack by the random
layer sampling method. To evaluate our proposed method, comprehensive and
comparative experiments have been conducted on three datasets. Experimental
results show that the proposed method improves the adversarial robustness.


http://arxiv.org/abs/2005.10637
Inaudible Adversarial Perturbations for Targeted Attack in Speaker Recognition.
Qing Wang; Pengcheng Guo; Lei Xie
  Speaker recognition is a popular topic in biometric authentication and many
deep learning approaches have achieved extraordinary performances. However, it
has been shown in both image and speech applications that deep neural networks
are vulnerable to adversarial examples. In this study, we aim to exploit this
weakness to perform targeted adversarial attacks against the x-vector based
speaker recognition system. We propose to generate inaudible adversarial
perturbations achieving targeted white-box attacks to speaker recognition
system based on the psychoacoustic principle of frequency masking.
Specifically, we constrict the perturbation under the masking threshold of
original audio, instead of using a common l_p norm to measure the
perturbations. Experiments on Aishell-1 corpus show that our approach yields up
to 98.5% attack success rate to arbitrary gender speaker targets, while
retaining indistinguishable attribute to listeners. Furthermore, we also
achieve an effective speaker attack when applying the proposed approach to a
completely irrelevant waveform, such as music.


http://arxiv.org/abs/2005.10987
Investigating Vulnerability to Adversarial Examples on Multimodal Data Fusion in Deep Learning.
Youngjoon Yu; Hong Joo Lee; Byeong Cheon Kim; Jung Uk Kim; Yong Man Ro
  The success of multimodal data fusion in deep learning appears to be
attributed to the use of complementary in-formation between multiple input
data. Compared to their predictive performance, relatively less attention has
been devoted to the robustness of multimodal fusion models. In this paper, we
investigated whether the current multimodal fusion model utilizes the
complementary intelligence to defend against adversarial attacks. We applied
gradient based white-box attacks such as FGSM and PGD on MFNet, which is a
major multispectral (RGB, Thermal) fusion deep learning model for semantic
segmentation. We verified that the multimodal fusion model optimized for better
prediction is still vulnerable to adversarial attack, even if only one of the
sensors is attacked. Thus, it is hard to say that existing multimodal data
fusion models are fully utilizing complementary relationships between multiple
modalities in terms of adversarial robustness. We believe that our observations
open a new horizon for adversarial attack research on multimodal data fusion.


http://arxiv.org/abs/2005.10203
Graph Structure Learning for Robust Graph Neural Networks.
Wei Jin; Yao Ma; Xiaorui Liu; Xianfeng Tang; Suhang Wang; Jiliang Tang
  Graph Neural Networks (GNNs) are powerful tools in representation learning
for graphs. However, recent studies show that GNNs are vulnerable to
carefully-crafted perturbations, called adversarial attacks. Adversarial
attacks can easily fool GNNs in making predictions for downstream tasks. The
vulnerability to adversarial attacks has raised increasing concerns for
applying GNNs in safety-critical applications. Therefore, developing robust
algorithms to defend adversarial attacks is of great significance. A natural
idea to defend adversarial attacks is to clean the perturbed graph. It is
evident that real-world graphs share some intrinsic properties. For example,
many real-world graphs are low-rank and sparse, and the features of two
adjacent nodes tend to be similar. In fact, we find that adversarial attacks
are likely to violate these graph properties. Therefore, in this paper, we
explore these properties to defend adversarial attacks on graphs. In
particular, we propose a general framework Pro-GNN, which can jointly learn a
structural graph and a robust graph neural network model from the perturbed
graph guided by these properties. Extensive experiments on real-world graphs
demonstrate that the proposed framework achieves significantly better
performance compared with the state-of-the-art defense methods, even when the
graph is heavily perturbed. We release the implementation of Pro-GNN to our
DeepRobust repository for adversarial attacks and defenses (footnote:
https://github.com/DSE-MSU/DeepRobust). The specific experimental settings to
reproduce our results can be found in https://github.com/ChandlerBang/Pro-GNN.


http://arxiv.org/abs/2005.10247
Model-Based Robust Deep Learning: Generalizing to Natural, Out-of-Distribution Data.
Alexander Robey; Hamed Hassani; George J. Pappas
  While deep learning has resulted in major breakthroughs in many application
domains, the frameworks commonly used in deep learning remain fragile to
artificially-crafted and imperceptible changes in the data. In response to this
fragility, adversarial training has emerged as a principled approach for
enhancing the robustness of deep learning with respect to norm-bounded
perturbations. However, there are other sources of fragility for deep learning
that are arguably more common and less thoroughly studied. Indeed, natural
variation such as lighting or weather conditions can significantly degrade the
accuracy of trained neural networks, proving that such natural variation
presents a significant challenge for deep learning.
  In this paper, we propose a paradigm shift from perturbation-based
adversarial robustness toward model-based robust deep learning. Our objective
is to provide general training algorithms that can be used to train deep neural
networks to be robust against natural variation in data. Critical to our
paradigm is first obtaining a model of natural variation which can be used to
vary data over a range of natural conditions. Such models may be either known a
priori or else learned from data. In the latter case, we show that deep
generative models can be used to learn models of natural variation that are
consistent with realistic conditions. We then exploit such models in three
novel model-based robust training algorithms in order to enhance the robustness
of deep learning with respect to the given model. Our extensive experiments
show that across a variety of naturally-occurring conditions and across various
datasets, deep neural networks trained with our model-based algorithms
significantly outperform both standard deep learning algorithms as well as
norm-bounded robust deep learning algorithms.


http://arxiv.org/abs/2005.10284
An Adversarial Approach for Explaining the Predictions of Deep Neural Networks.
Arash Rahnama; Andrew Tseng
  Machine learning models have been successfully applied to a wide range of
applications including computer vision, natural language processing, and speech
recognition. A successful implementation of these models however, usually
relies on deep neural networks (DNNs) which are treated as opaque black-box
systems due to their incomprehensible complexity and intricate internal
mechanism. In this work, we present a novel algorithm for explaining the
predictions of a DNN using adversarial machine learning. Our approach
identifies the relative importance of input features in relation to the
predictions based on the behavior of an adversarial attack on the DNN. Our
algorithm has the advantage of being fast, consistent, and easy to implement
and interpret. We present our detailed analysis that demonstrates how the
behavior of an adversarial attack, given a DNN and a task, stays consistent for
any input test data point proving the generality of our approach. Our analysis
enables us to produce consistent and efficient explanations. We illustrate the
effectiveness of our approach by conducting experiments using a variety of
DNNs, tasks, and datasets. Finally, we compare our work with other well-known
techniques in the current literature.


http://arxiv.org/abs/2005.10322
A survey on Adversarial Recommender Systems: from Attack/Defense strategies to Generative Adversarial Networks.
Yashar Deldjoo; Noia Tommaso Di; Felice Antonio Merra
  Latent-factor models (LFM) based on collaborative filtering (CF), such as
matrix factorization (MF) and deep CF methods, are widely used in modern
recommender systems (RS) due to their excellent performance and recommendation
accuracy. However, success has been accompanied with a major new arising
challenge: many applications of machine learning (ML) are adversarial in
nature. In recent years, it has been shown that these methods are vulnerable to
adversarial examples, i.e., subtle but non-random perturbations designed to
force recommendation models to produce erroneous outputs.
  The goal of this survey is two-fold: (i) to present recent advances on
adversarial machine learning (AML) for the security of RS (i.e., attacking and
defense recommendation models), (ii) to show another successful application of
AML in generative adversarial networks (GANs) for generative applications,
thanks to their ability for learning (high-dimensional) data distributions. In
this survey, we provide an exhaustive literature review of 74 articles
published in major RS and ML journals and conferences. This review serves as a
reference for the RS community, working on the security of RS or on generative
models using GANs to improve their quality.


http://arxiv.org/abs/2005.10190
Feature Purification: How Adversarial Training Performs Robust Deep Learning.
Zeyuan Allen-Zhu; Yuanzhi Li
  Despite the empirical success of using Adversarial Training to defend deep
learning models against adversarial perturbations, so far, it still remains
rather unclear what the principles are behind the existence of adversarial
perturbations, and what adversarial training does to the neural network to
remove them.
  In this paper, we present a principle that we call Feature Purification,
where we show one of the causes of the existence of adversarial examples is the
accumulation of certain small dense mixtures in the hidden weights during the
training process of a neural network; and more importantly, one of the goals of
adversarial training is to remove such mixtures to purify hidden weights. We
present both experiments on the CIFAR-10 dataset to illustrate this principle,
and a theoretical result proving that for certain natural classification tasks,
training a two-layer neural network with ReLU activation using randomly
initialized gradient descent indeed satisfies this principle.
  Technically, we give, to the best of our knowledge, the first result proving
that the following two can hold simultaneously for training a neural network
with ReLU activation. (1) Training over the original data is indeed non-robust
to small adversarial perturbations of some radius. (2) Adversarial training,
even with an empirical perturbation algorithm such as FGM, can in fact be
provably robust against ANY perturbations of the same radius. Finally, we also
prove a complexity lower bound, showing that low complexity models such as
linear classifiers, low-degree polynomials, or even the neural tangent kernel
for this network, CANNOT defend against perturbations of this same radius, no
matter what algorithms are used to train them.


http://arxiv.org/abs/2005.09294
Synthesizing Unrestricted False Positive Adversarial Objects Using Generative Models.
Martin Kotuliak; Sandro E. Schoenborn; Andrei Dan
  Adversarial examples are data points misclassified by neural networks.
Originally, adversarial examples were limited to adding small perturbations to
a given image. Recent work introduced the generalized concept of unrestricted
adversarial examples, without limits on the added perturbations. In this paper,
we introduce a new category of attacks that create unrestricted adversarial
examples for object detection. Our key idea is to generate adversarial objects
that are unrelated to the classes identified by the target object detector.
Different from previous attacks, we use off-the-shelf Generative Adversarial
Networks (GAN), without requiring any further training or modification. Our
method consists of searching over the latent normal space of the GAN for
adversarial objects that are wrongly identified by the target object detector.
We evaluate this method on the commonly used Faster R-CNN ResNet-101, Inception
v2 and SSD Mobilenet v1 object detectors using logo generative iWGAN-LC and
SNGAN trained on CIFAR-10. The empirical results show that the generated
adversarial objects are indistinguishable from non-adversarial objects
generated by the GANs, transferable between the object detectors and robust in
the physical world. This is the first work to study unrestricted false positive
adversarial examples for object detection.


http://arxiv.org/abs/2005.09257
Bias-based Universal Adversarial Patch Attack for Automatic Check-out.
Aishan Liu; Jiakai Wang; Xianglong Liu; Bowen Cao; Chongzhi Zhang; Hang Yu
  Adversarial examples are inputs with imperceptible perturbations that easily
misleading deep neural networks(DNNs). Recently, adversarial patch, with noise
confined to a small and localized patch, has emerged for its easy feasibility
in real-world scenarios. However, existing strategies failed to generate
adversarial patches with strong generalization ability. In other words, the
adversarial patches were input-specific and failed to attack images from all
classes, especially unseen ones during training. To address the problem, this
paper proposes a bias-based framework to generate class-agnostic universal
adversarial patches with strong generalization ability, which exploits both the
perceptual and semantic bias of models. Regarding the perceptual bias, since
DNNs are strongly biased towards textures, we exploit the hard examples which
convey strong model uncertainties and extract a textural patch prior from them
by adopting the style similarities. The patch prior is more close to decision
boundaries and would promote attacks. To further alleviate the heavy dependency
on large amounts of data in training universal attacks, we further exploit the
semantic bias. As the class-wise preference, prototypes are introduced and
pursued by maximizing the multi-class margin to help universal training. Taking
AutomaticCheck-out (ACO) as the typical scenario, extensive experiments
including white-box and black-box settings in both digital-world(RPC, the
largest ACO related dataset) and physical-world scenario(Taobao and JD, the
world' s largest online shopping platforms) are conducted. Experimental results
demonstrate that our proposed framework outperforms state-of-the-art
adversarial patch attack methods.


http://arxiv.org/abs/2005.08632
Universalization of any adversarial attack using very few test examples.
Sandesh Kamath; Amit Deshpande; K V Subrahmanyam
  Deep learning models are known to be vulnerable not only to input-dependent
adversarial attacks but also to input-agnostic or universal adversarial
attacks. Dezfooli et al. \cite{Dezfooli17,Dezfooli17anal} construct universal
adversarial attack on a given model by looking at a large number of training
data points and the geometry of the decision boundary near them. Subsequent
work \cite{Khrulkov18} constructs universal attack by looking only at test
examples and intermediate layers of the given model. In this paper, we propose
a simple universalization technique to take any input-dependent adversarial
attack and construct a universal attack by only looking at very few adversarial
test examples. We do not require details of the given model and have negligible
computational overhead for universalization. We theoretically justify our
universalization technique by a spectral property common to many
input-dependent adversarial perturbations, e.g., gradients, Fast Gradient Sign
Method (FGSM) and DeepFool. Using matrix concentration inequalities and
spectral perturbation bounds, we show that the top singular vector of
input-dependent adversarial directions on a small test sample gives an
effective and simple universal adversarial attack. For VGG16 and VGG19 models
trained on ImageNet, our simple universalization of Gradient, FGSM, and
DeepFool perturbations using a test sample of 64 images gives fooling rates
comparable to state-of-the-art universal attacks \cite{Dezfooli17,Khrulkov18}
for reasonable norms of perturbation.


http://arxiv.org/abs/2005.09170
On Intrinsic Dataset Properties for Adversarial Machine Learning.
Jeffrey Z. Pan; Nicholas Zufelt
  Deep neural networks (DNNs) have played a key role in a wide range of machine
learning applications. However, DNN classifiers are vulnerable to
human-imperceptible adversarial perturbations, which can cause them to
misclassify inputs with high confidence. Thus, creating robust DNNs which can
defend against malicious examples is critical in applications where security
plays a major role. In this paper, we study the effect of intrinsic dataset
properties on the performance of adversarial attack and defense methods,
testing on five popular image classification datasets - MNIST, Fashion-MNIST,
CIFAR10/CIFAR100, and ImageNet. We find that input size and image contrast play
key roles in attack and defense success. Our discoveries highlight that dataset
design and data preprocessing steps are important to boost the adversarial
robustness of DNNs. To our best knowledge, this is the first comprehensive work
that studies the effect of intrinsic dataset properties on adversarial machine
learning.


http://arxiv.org/abs/2005.08781
Defending Your Voice: Adversarial Attack on Voice Conversion.
Chien-yu Huang; Yist Y. Lin; Hung-yi Lee; Lin-shan Lee
  Substantial improvements have been achieved in recent years in voice
conversion, which converts the speaker characteristics of an utterance into
those of another speaker without changing the linguistic content of the
utterance. Nonetheless, the improved conversion technologies also led to
concerns about privacy and authentication. It thus becomes highly desired to be
able to prevent one's voice from being improperly utilized with such voice
conversion technologies. This is why we report in this paper the first known
attempt to perform adversarial attack on voice conversion. We introduce human
imperceptible noise into the utterances of a speaker whose voice is to be
defended. Given these adversarial examples, voice conversion models cannot
convert other utterances so as to sound like being produced by the defended
speaker. Preliminary experiments were conducted on two currently
state-of-the-art zero-shot voice conversion models. Objective and subjective
evaluation results in both white-box and black-box scenarios are reported. It
was shown that the speaker characteristics of the converted utterances were
made obviously different from those of the defended speaker, while the
adversarial examples of the defended speaker are not distinguishable from the
authentic utterances.


http://arxiv.org/abs/2005.08454
Reliability and Robustness analysis of Machine Learning based Phishing URL Detectors.
Bushra University of Adelaide, CREST - The Centre for Research on Engineering Software Technologies, CSIROs Data61 Sabir; M. Ali University of Adelaide, CREST - The Centre for Research on Engineering Software Technologies Babar; Raj CSIROs Data61 Gaire; Alsharif CSIROs DATA61 Abuadbba
  ML-based Phishing URL (MLPU) detectors serve as the first level of defence to
protect users and organisations from being victims of phishing attacks. Lately,
few studies have launched successful adversarial attacks against specific MLPU
detectors raising questions about their practical reliability and usage.
Nevertheless, the robustness of these systems has not been extensively
investigated. Therefore, the security vulnerabilities of these systems, in
general, remain primarily unknown which calls for testing the robustness of
these systems. In this article, we have proposed a methodology to investigate
the reliability and robustness of 50 representative state-of-the-art MLPU
models. Firstly, we have proposed a cost-effective Adversarial URL generator
URLBUG that created an Adversarial URL dataset. Subsequently, we reproduced 50
MLPU (traditional ML and Deep learning) systems and recorded their baseline
performance. Lastly, we tested the considered MLPU systems on Adversarial
Dataset and analyzed their robustness and reliability using box plots and heat
maps. Our results showed that the generated adversarial URLs have valid syntax
and can be registered at a median annual price of \$11.99. Out of 13\% of the
already registered adversarial URLs, 63.94\% were used for malicious purposes.
Moreover, the considered MLPU models Matthew Correlation Coefficient (MCC)
dropped from a median 0.92 to 0.02 when tested against $Adv_\mathrm{data}$,
indicating that the baseline MLPU models are unreliable in their current form.
Further, our findings identified several security vulnerabilities of these
systems and provided future directions for researchers to design dependable and
secure MLPU systems.


http://arxiv.org/abs/2005.09134
Improve robustness of DNN for ECG signal classification:a noise-to-signal ratio perspective.
Linhai Ma; Liang Liang
  Electrocardiogram (ECG) is the most widely used diagnostic tool to monitor
the condition of the cardiovascular system. Deep neural networks (DNNs), have
been developed in many research labs for automatic interpretation of ECG
signals to identify potential abnormalities in patient hearts. Studies have
shown that given a sufficiently large amount of data, the classification
accuracy of DNNs could reach human-expert cardiologist level. A DNN-based
automated ECG diagnostic system would be an affordable solution for patients in
developing countries where human-expert cardiologist are lacking. However,
despite of the excellent performance in classification accuracy, it has been
shown that DNNs are highly vulnerable to adversarial attacks: subtle changes in
input of a DNN can lead to a wrong classification output with high confidence.
Thus, it is challenging and essential to improve adversarial robustness of DNNs
for ECG signal classification, a life-critical application. In this work, we
proposed to improve DNN robustness from the perspective of noise-to-signal
ratio (NSR) and developed two methods to minimize NSR during training process.
We evaluated the proposed methods on PhysionNets MIT-BIH dataset, and the
results show that our proposed methods lead to an enhancement in robustness
against PGD adversarial attack and SPSA attack, with a minimal change in
accuracy on clean data.


http://arxiv.org/abs/2005.09147
Increasing-Margin Adversarial (IMA) Training to Improve Adversarial Robustness of Neural Networks.
Linhai Ma; Liang Liang
  Deep neural networks (DNNs) are vulnerable to adversarial noises. By adding
adversarial noises to training samples, adversarial training can improve the
model's robustness against adversarial noises. However, adversarial training
samples with excessive noises can harm standard accuracy, which may be
unacceptable for many medical image analysis applications. This issue has been
termed the trade-off between standard accuracy and adversarial robustness. In
this paper, we hypothesize that this issue may be alleviated if the adversarial
samples for training are placed right on the decision boundaries. Based on this
hypothesis, we design an adaptive adversarial training method, named IMA. For
each individual training sample, IMA makes a sample-wise estimation of the
upper bound of the adversarial perturbation. In the training process, each of
the sample-wise adversarial perturbations is gradually increased to match the
margin. Once an equilibrium state is reached, the adversarial perturbations
will stop increasing. IMA is evaluated on publicly available datasets under two
popular adversarial attacks, PGD and IFGSM. The results show that: (1) IMA
significantly improves adversarial robustness of DNN classifiers, which
achieves state-of-the-art performance; (2) IMA has a minimal reduction in clean
accuracy among all competing defense methods; (3) IMA can be applied to
pretrained models to reduce time cost; (4) IMA can be applied to the
state-of-the-art medical image segmentation networks, with outstanding
performance. We hope our work may help to lift the trade-off between
adversarial robustness and clean accuracy and facilitate the development of
robust applications in the medical field. The source code will be released when
this paper is published.


http://arxiv.org/abs/2005.09161
Spatiotemporal Attacks for Embodied Agents.
Aishan Liu; Tairan Huang; Xianglong Liu; Yitao Xu; Yuqing Ma; Xinyun Chen; Stephen J. Maybank; Dacheng Tao
  Adversarial attacks are valuable for providing insights into the blind-spots
of deep learning models and help improve their robustness. Existing work on
adversarial attacks have mainly focused on static scenes; however, it remains
unclear whether such attacks are effective against embodied agents, which could
navigate and interact with a dynamic environment. In this work, we take the
first step to study adversarial attacks for embodied agents. In particular, we
generate spatiotemporal perturbations to form 3D adversarial examples, which
exploit the interaction history in both the temporal and spatial dimensions.
Regarding the temporal dimension, since agents make predictions based on
historical observations, we develop a trajectory attention module to explore
scene view contributions, which further help localize 3D objects appeared with
the highest stimuli. By conciliating with clues from the temporal dimension,
along the spatial dimension, we adversarially perturb the physical properties
(e.g., texture and 3D shape) of the contextual objects that appeared in the
most important scene views. Extensive experiments on the EQA-v1 dataset for
several embodied tasks in both the white-box and black-box settings have been
conducted, which demonstrate that our perturbations have strong attack and
generalization abilities.


http://arxiv.org/abs/2005.08321
Toward Adversarial Robustness by Diversity in an Ensemble of Specialized Deep Neural Networks.
Mahdieh Abbasi; Arezoo Rajabi; Christian Gagne; Rakesh B. Bobba
  We aim at demonstrating the influence of diversity in the ensemble of CNNs on
the detection of black-box adversarial instances and hardening the generation
of white-box adversarial attacks. To this end, we propose an ensemble of
diverse specialized CNNs along with a simple voting mechanism. The diversity in
this ensemble creates a gap between the predictive confidences of adversaries
and those of clean samples, making adversaries detectable. We then analyze how
diversity in such an ensemble of specialists may mitigate the risk of the
black-box and white-box adversarial examples. Using MNIST and CIFAR-10, we
empirically verify the ability of our ensemble to detect a large portion of
well-known black-box adversarial examples, which leads to a significant
reduction in the risk rate of adversaries, at the expense of a small increase
in the risk rate of clean samples. Moreover, we show that the success rate of
generating white-box attacks by our ensemble is remarkably decreased compared
to a vanilla CNN and an ensemble of vanilla CNNs, highlighting the beneficial
role of diversity in the ensemble for developing more robust models.


http://arxiv.org/abs/2005.08087
Universal Adversarial Perturbations: A Survey.
Ashutosh Chaubey; Nikhil Agrawal; Kavya Barnwal; Keerat K. Guliani; Pramod Mehta
  Over the past decade, Deep Learning has emerged as a useful and efficient
tool to solve a wide variety of complex learning problems ranging from image
classification to human pose estimation, which is challenging to solve using
statistical machine learning algorithms. However, despite their superior
performance, deep neural networks are susceptible to adversarial perturbations,
which can cause the network's prediction to change without making perceptible
changes to the input image, thus creating severe security issues at the time of
deployment of such systems. Recent works have shown the existence of Universal
Adversarial Perturbations, which, when added to any image in a dataset,
misclassifies it when passed through a target model. Such perturbations are
more practical to deploy since there is minimal computation done during the
actual attack. Several techniques have also been proposed to defend the neural
networks against these perturbations. In this paper, we attempt to provide a
detailed discussion on the various data-driven and data-independent methods for
generating universal perturbations, along with measures to defend against such
perturbations. We also cover the applications of such universal perturbations
in various deep learning tasks.


http://arxiv.org/abs/2005.07998
Encryption Inspired Adversarial Defense for Visual Classification.
MaungMaung AprilPyone; Hitoshi Kiya
  Conventional adversarial defenses reduce classification accuracy whether or
not a model is under attacks. Moreover, most of image processing based defenses
are defeated due to the problem of obfuscated gradients. In this paper, we
propose a new adversarial defense which is a defensive transform for both
training and test images inspired by perceptual image encryption methods. The
proposed method utilizes a block-wise pixel shuffling method with a secret key.
The experiments are carried out on both adaptive and non-adaptive maximum-norm
bounded white-box attacks while considering obfuscated gradients. The results
show that the proposed defense achieves high accuracy (91.55 %) on clean images
and (89.66 %) on adversarial examples with noise distance of 8/255 on CIFAR-10
dataset. Thus, the proposed defense outperforms state-of-the-art adversarial
defenses including latent adversarial training, adversarial training and
thermometer encoding.


http://arxiv.org/abs/2005.10884
PatchGuard: Provable Defense against Adversarial Patches Using Masks on Small Receptive Fields.
Chong Xiang; Arjun Nitin Bhagoji; Vikash Sehwag; Prateek Mittal
  Localized adversarial patches aim to induce misclassification in machine
learning models by arbitrarily modifying pixels within a restricted region of
an image. Such attacks can be realized in the physical world by attaching the
adversarial patch to the object to be misclassified. In this paper, we propose
a general defense framework that can achieve both high clean accuracy and
provable robustness against localized adversarial patches. The cornerstone of
our defense framework is to use a convolutional network with small receptive
fields that impose a bound on the number of features corrupted by an
adversarial patch. We further present the robust masking defense that robustly
detects and masks corrupted features for a secure feature aggregation. We
evaluate our defense against the most powerful white-box untargeted adaptive
attacker and achieve a 92.3% clean accuracy and an 85.2% provable robust
accuracy on a 10-class subset of ImageNet against a 31x31 adversarial patch (2%
pixels), a 57.4% clean accuracy and a 14.4% provable robust accuracy on
1000-class ImageNet against a 31x31 patch (2% pixels), and an 80.3% clean
accuracy and a 61.3% provable accuracy on CIFAR-10 against a 5x5 patch (2.4%
pixels). Notably, our provable defenses achieve state-of-the-art provable
robust accuracy on ImageNet and CIFAR-10.


http://arxiv.org/abs/2005.07675
How to Make 5G Communications "Invisible": Adversarial Machine Learning for Wireless Privacy.
Brian Kim; Yalin E. Sagduyu; Kemal Davaslioglu; Tugba Erpek; Sennur Ulukus
  We consider the problem of hiding wireless communications from an
eavesdropper that employs a deep learning (DL) classifier to detect whether any
transmission of interest is present or not. There exists one transmitter that
transmits to its receiver in the presence of an eavesdropper, while a
cooperative jammer (CJ) transmits carefully crafted adversarial perturbations
over the air to fool the eavesdropper into classifying the received
superposition of signals as noise. The CJ puts an upper bound on the strength
of perturbation signal to limit its impact on the bit error rate (BER) at the
receiver. We show that this adversarial perturbation causes the eavesdropper to
misclassify the received signals as noise with high probability while
increasing the BER only slightly. On the other hand, the CJ cannot fool the
eavesdropper by simply transmitting Gaussian noise as in conventional jamming
and instead needs to craft perturbation signals built by adversarial machine
learning to enable covert communications. Our results show that signals with
different modulation types and eventually 5G communications can be effectively
hidden from an eavesdropper even if it is equipped with a DL classifier to
detect transmissions.


http://arxiv.org/abs/2005.07519
Practical Traffic-space Adversarial Attacks on Learning-based NIDSs.
Dongqi Han; Zhiliang Wang; Ying Zhong; Wenqi Chen; Jiahai Yang; Shuqiang Lu; Xingang Shi; Xia Yin
  Machine learning (ML) techniques have been increasingly used in anomaly-based
network intrusion detection systems (NIDS) to detect unknown attacks. However,
ML has shown to be extremely vulnerable to adversarial attacks, aggravating the
potential risk of evasion attacks against learning-based NIDSs. In this
situation, prior studies on evading traditional anomaly-based or
signature-based NIDSs are no longer valid. Existing attacks on learning-based
NIDSs mostly focused on feature-space and/or white-box attacks, leaving the
study on practical gray/black-box attacks largely unexplored. To bridge this
gap, we conduct the first systematic study of the practical traffic-space
evasion attack on learning-based NIDSs. We outperform the previous work in the
following aspects: (1) practical---instead of directly modifying features, we
provide a novel framework to automatically mutate malicious traffic with
extremely limited knowledge while preserving its functionality; (2)
generic---the proposed attack is effective for any ML classifiers (i.e.,
model-agnostic) and most non-payload-based features; (3) explainable---we
propose a feature-based interpretation method to measure the robustness of
targeted systems against such attacks. We extensively evaluate our attack and
defense scheme on Kitsune, a state-of-the-art learning-based NIDS, as well as
measuring the robustness of various NIDSs using diverse features and ML
classifiers. Experimental results show promising results and intriguing
findings.


http://arxiv.org/abs/2005.07606
Initializing Perturbations in Multiple Directions for Fast Adversarial Training.
Xunguang Wang; Ship Peng Xu; Eric Ke Wang
  Recent developments in the filed of Deep Learning have demonstrated that Deep
Neural Networks(DNNs) are vulnerable to adversarial examples. Specifically, in
image classification, an adversarial example can fool the well trained deep
neural networks by adding barely imperceptible perturbations to clean images.
Adversarial Training, one of the most direct and effective methods, minimizes
the losses of perturbed-data to learn robust deep networks against adversarial
attacks. It has been proven that using the fast gradient sign method (FGSM) can
achieve Fast Adversarial Training. However, FGSM-based adversarial training may
finally obtain a failed model because of overfitting to FGSM samples. In this
paper, we proposed the Diversified Initialized Perturbations Adversarial
Training (DIP-FAT) which involves seeking the initialization of the
perturbation via enlarging the output distances of the target model in a random
directions. Due to the diversity of random directions, the embedded fast
adversarial training using FGSM increases the information from the adversary
and reduces the possibility of overfitting. In addition to preventing
overfitting, the extensive results show that our proposed DIP-FAT technique can
also improve the accuracy of the clean data. The biggest advantage of DIP-FAT
method: achieving the best banlance among clean-data, perturbed-data and
efficiency.


http://arxiv.org/abs/2005.07099
Stealthy and Efficient Adversarial Attacks against Deep Reinforcement Learning.
Jianwen Sun; Tianwei Zhang; Xiaofei Xie; Lei Ma; Yan Zheng; Kangjie Chen; Yang Liu
  Adversarial attacks against conventional Deep Learning (DL) systems and
algorithms have been widely studied, and various defenses were proposed.
However, the possibility and feasibility of such attacks against Deep
Reinforcement Learning (DRL) are less explored. As DRL has achieved great
success in various complex tasks, designing effective adversarial attacks is an
indispensable prerequisite towards building robust DRL algorithms. In this
paper, we introduce two novel adversarial attack techniques to
\emph{stealthily} and \emph{efficiently} attack the DRL agents. These two
techniques enable an adversary to inject adversarial samples in a minimal set
of critical moments while causing the most severe damage to the agent. The
first technique is the \emph{critical point attack}: the adversary builds a
model to predict the future environmental states and agent's actions, assesses
the damage of each possible attack strategy, and selects the optimal one. The
second technique is the \emph{antagonist attack}: the adversary automatically
learns a domain-agnostic model to discover the critical moments of attacking
the agent in an episode. Experimental results demonstrate the effectiveness of
our techniques. Specifically, to successfully attack the DRL agent, our
critical point technique only requires 1 (TORCS) or 2 (Atari Pong and Breakout)
steps, and the antagonist technique needs fewer than 5 steps (4 Mujoco tasks),
which are significant improvements over state-of-the-art methods.


http://arxiv.org/abs/2005.07347
Towards Assessment of Randomized Mechanisms for Certifying Adversarial Robustness.
Tianhang Zheng; Di Wang; Baochun Li; Jinhui Xu
  As a certified defensive technique, randomized smoothing has received
considerable attention due to its scalability to large datasets and neural
networks. However, several important questions remain unanswered, such as (i)
whether the Gaussian mechanism is an appropriate option for certifying
$\ell_2$-norm robustness, and (ii) whether there is an appropriate randomized
mechanism to certify $\ell_\infty$-norm robustness on high-dimensional
datasets. To shed light on these questions, we introduce a generic framework
that connects the existing frameworks to assess randomized mechanisms. Under
our framework, we define the magnitude of the noise required by a mechanism to
certify a certain extent of robustness as the metric for assessing the
appropriateness of the mechanism. We also derive lower bounds on the metric as
the criteria for assessment. Assessment of Gaussian and Exponential mechanisms
is achieved by comparing the magnitude of noise needed by these mechanisms and
the criteria, and we conclude that the Gaussian mechanism is an appropriate
option to certify both $\ell_2$-norm and $\ell_\infty$-norm robustness. The
veracity of our framework is verified by evaluations on CIFAR10 and ImageNet.


http://arxiv.org/abs/2005.07145
A Deep Learning-based Fine-grained Hierarchical Learning Approach for Robust Malware Classification.
Ahmed Abusnaina; Mohammed Abuhamad; Hisham Alasmary; Afsah Anwar; Rhongho Jang; Saeed Salem; DaeHun Nyang; David Mohaisen
  The wide acceptance of Internet of Things (IoT) for both household and
industrial applications is accompanied by several security concerns. A major
security concern is their probable abuse by adversaries towards their malicious
intent. Understanding and analyzing IoT malicious behaviors is crucial,
especially with their rapid growth and adoption in wide-range of applications.
However, recent studies have shown that machine learning-based approaches are
susceptible to adversarial attacks by adding junk codes to the binaries, for
example, with an intention to fool those machine learning or deep
learning-based detection systems. Realizing the importance of addressing this
challenge, this study proposes a malware detection system that is robust to
adversarial attacks. To do so, examine the performance of the state-of-the-art
methods against adversarial IoT software crafted using the graph embedding and
augmentation techniques. In particular, we study the robustness of such methods
against two black-box adversarial methods, GEA and SGEA, to generate
Adversarial Examples (AEs) with reduced overhead, and keeping their
practicality intact. Our comprehensive experimentation with GEA-based AEs show
the relation between misclassification and the graph size of the injected
sample. Upon optimization and with small perturbation, by use of SGEA, all the
IoT malware samples are misclassified as benign. This highlights the
vulnerability of current detection systems under adversarial settings. With the
landscape of possible adversarial attacks, we then propose DL-FHMC, a
fine-grained hierarchical learning approach for malware detection and
classification, that is robust to AEs with a capability to detect 88.52% of the
malicious AEs.


http://arxiv.org/abs/2005.06149
DeepRobust: A PyTorch Library for Adversarial Attacks and Defenses.
Yaxin Li; Wei Jin; Han Xu; Jiliang Tang
  DeepRobust is a PyTorch adversarial learning library which aims to build a
comprehensive and easy-to-use platform to foster this research field. It
currently contains more than 10 attack algorithms and 8 defense algorithms in
image domain and 9 attack algorithms and 4 defense algorithms in graph domain,
under a variety of deep learning architectures. In this manual, we introduce
the main contents of DeepRobust with detailed instructions. The library is kept
updated and can be found at https://github.com/DSE-MSU/DeepRobust.


http://arxiv.org/abs/2005.05750
Evaluating Ensemble Robustness Against Adversarial Attacks.
George Adam; Romain Speciel
  Adversarial examples, which are slightly perturbed inputs generated with the
aim of fooling a neural network, are known to transfer between models;
adversaries which are effective on one model will often fool another. This
concept of transferability poses grave security concerns as it leads to the
possibility of attacking models in a black box setting, during which the
internal parameters of the target model are unknown. In this paper, we seek to
analyze and minimize the transferability of adversaries between models within
an ensemble. To this end, we introduce a gradient based measure of how
effectively an ensemble's constituent models collaborate to reduce the space of
adversarial examples targeting the ensemble itself. Furthermore, we demonstrate
that this measure can be utilized during training as to increase an ensemble's
robustness to adversarial examples.


http://arxiv.org/abs/2005.06023
Increased-confidence adversarial examples for improved transferability of Counter-Forensic attacks.
Wenjie Li; Benedetta Tondi; Rongrong Ni; Mauro Barni
  Transferability of adversarial examples is a key issue to study the security
of multimedia forensics (MMF) techniques relying on Deep Learning (DL). The
transferability of the attacks, in fact, would open the way to the deployment
of successful counter forensics attacks also in cases where the attacker does
not have a full knowledge of the to-be-attacked system. Some preliminary works
have shown that adversarial examples against CNN-based image forensics
detectors are in general non-transferrable, at least when the basic versions of
the attacks implemented in the most popular attack packages are adopted. In
this paper, we introduce a general strategy to increase the strength of the
attacks and evaluate the transferability of the adversarial examples when such
a strength varies. We experimentally show that, in this way, attack
transferability can be improved to a large extent, at the expense of a larger
distortion. Our research confirms the security threats posed by the existence
of adversarial examples even in multimedia forensics scenarios, thus calling
for new defense strategies to improve the security of DL-based MMF techniques.


http://arxiv.org/abs/2005.06107
Adversarial examples are useful too!
Ali Borji
  Deep learning has come a long way and has enjoyed an unprecedented success.
Despite high accuracy, however, deep models are brittle and are easily fooled
by imperceptible adversarial perturbations. In contrast to common
inference-time attacks, Backdoor (\aka Trojan) attacks target the training
phase of model construction, and are extremely difficult to combat since a) the
model behaves normally on a pristine testing set and b) the augmented
perturbations can be minute and may only affect few training samples. Here, I
propose a new method to tell whether a model has been subject to a backdoor
attack. The idea is to generate adversarial examples, targeted or untargeted,
using conventional attacks such as FGSM and then feed them back to the
classifier. By computing the statistics (here simply mean maps) of the images
in different categories and comparing them with the statistics of a reference
model, it is possible to visually locate the perturbed regions and unveil the
attack.


http://arxiv.org/abs/2005.05552
Effective and Robust Detection of Adversarial Examples via Benford-Fourier Coefficients.
Chengcheng Ma; Baoyuan Wu; Shibiao Xu; Yanbo Fan; Yong Zhang; Xiaopeng Zhang; Zhifeng Li
  Adversarial examples have been well known as a serious threat to deep neural
networks (DNNs). In this work, we study the detection of adversarial examples,
based on the assumption that the output and internal responses of one DNN model
for both adversarial and benign examples follow the generalized Gaussian
distribution (GGD), but with different parameters (i.e., shape factor, mean,
and variance). GGD is a general distribution family to cover many popular
distributions (e.g., Laplacian, Gaussian, or uniform). It is more likely to
approximate the intrinsic distributions of internal responses than any specific
distribution. Besides, since the shape factor is more robust to different
databases rather than the other two parameters, we propose to construct
discriminative features via the shape factor for adversarial detection,
employing the magnitude of Benford-Fourier coefficients (MBF), which can be
easily estimated using responses. Finally, a support vector machine is trained
as the adversarial detector through leveraging the MBF features. Extensive
experiments in terms of image classification demonstrate that the proposed
detector is much more effective and robust on detecting adversarial examples of
different crafting methods and different sources, compared to state-of-the-art
adversarial detection methods.


http://arxiv.org/abs/2005.05321
Channel-Aware Adversarial Attacks Against Deep Learning-Based Wireless Signal Classifiers.
Brian Kim; Yalin E. Sagduyu; Kemal Davaslioglu; Tugba Erpek; Sennur Ulukus
  This paper presents channel-aware adversarial attacks against deep
learning-based wireless signal classifiers. There is a transmitter that
transmits signals with different modulation types. A deep neural network is
used at each receiver to classify its over-the-air received signals to
modulation types. In the meantime, an adversary transmits an adversarial
perturbation (subject to a power budget) to fool receivers into making errors
in classifying signals that are received as superpositions of transmitted
signals and adversarial perturbations. First, these evasion attacks are shown
to fail when channels are not considered in designing adversarial
perturbations. Then realistic attacks are presented by considering channel
effects from the adversary to each receiver. After showing that a channel-aware
attack is selective (i.e., it affects only the receiver whose channel is
considered in the perturbation design), a broadcast adversarial attack is
presented by crafting a common adversarial perturbation to simultaneously fool
classifiers at different receivers. The major vulnerability of modulation
classifiers to over-the-air adversarial attacks is shown by accounting for
different levels of information available about channel, transmitter input, and
classifier model. Finally, a certified defense based on randomized smoothing
that augments training data with noise is introduced to make modulation
classifier robust to adversarial perturbations.


http://arxiv.org/abs/2005.04871
Spanning Attack: Reinforce Black-box Attacks with Unlabeled Data.
Lu Wang; Huan Zhang; Jinfeng Yi; Cho-Jui Hsieh; Yuan Jiang
  Adversarial black-box attacks aim to craft adversarial perturbations by
querying input-output pairs of machine learning models. They are widely used to
evaluate the robustness of pre-trained models. However, black-box attacks often
suffer from the issue of query inefficiency due to the high dimensionality of
the input space, and therefore incur a false sense of model robustness. In this
paper, we relax the conditions of the black-box threat model, and propose a
novel technique called the spanning attack. By constraining adversarial
perturbations in a low-dimensional subspace via spanning an auxiliary unlabeled
dataset, the spanning attack significantly improves the query efficiency of a
wide variety of existing black-box attacks. Extensive experiments show that the
proposed method works favorably in both soft-label and hard-label black-box
attacks. Our code is available at https://github.com/wangwllu/spanning_attack.


http://arxiv.org/abs/2005.04364
It's Morphin' Time! Combating Linguistic Discrimination with Inflectional Perturbations.
Samson Tan; Shafiq Joty; Min-Yen Kan; Richard Socher
  Training on only perfect Standard English corpora predisposes pre-trained
neural networks to discriminate against minorities from non-standard linguistic
backgrounds (e.g., African American Vernacular English, Colloquial Singapore
English, etc.). We perturb the inflectional morphology of words to craft
plausible and semantically similar adversarial examples that expose these
biases in popular NLP models, e.g., BERT and Transformer, and show that
adversarially fine-tuning them for a single epoch significantly improves
robustness without sacrificing performance on clean data.


http://arxiv.org/abs/2005.04564
Class-Aware Domain Adaptation for Improving Adversarial Robustness.
Xianxu Hou; Jingxin Liu; Bolei Xu; Xiaolong Wang; Bozhi Liu; Guoping Qiu
  Recent works have demonstrated convolutional neural networks are vulnerable
to adversarial examples, i.e., inputs to machine learning models that an
attacker has intentionally designed to cause the models to make a mistake. To
improve the adversarial robustness of neural networks, adversarial training has
been proposed to train networks by injecting adversarial examples into the
training data. However, adversarial training could overfit to a specific type
of adversarial attack and also lead to standard accuracy drop on clean images.
To this end, we propose a novel Class-Aware Domain Adaptation (CADA) method for
adversarial defense without directly applying adversarial training.
Specifically, we propose to learn domain-invariant features for adversarial
examples and clean images via a domain discriminator. Furthermore, we introduce
a class-aware component into the discriminator to increase the discriminative
power of the network for adversarial examples. We evaluate our newly proposed
approach using multiple benchmark datasets. The results demonstrate that our
method can significantly improve the state-of-the-art of adversarial robustness
for various attacks and maintain high performances on clean images.


http://arxiv.org/abs/2005.04272
Towards Robustness against Unsuspicious Adversarial Examples.
Liang Tong; Minzhe Guo; Atul Prakash; Yevgeniy Vorobeychik
  Despite the remarkable success of deep neural networks, significant concerns
have emerged about their robustness to adversarial perturbations to inputs.
While most attacks aim to ensure that these are imperceptible, physical
perturbation attacks typically aim for being unsuspicious, even if perceptible.
However, there is no universal notion of what it means for adversarial examples
to be unsuspicious. We propose an approach for modeling suspiciousness by
leveraging cognitive salience. Specifically, we split an image into foreground
(salient region) and background (the rest), and allow significantly larger
adversarial perturbations in the background, while ensuring that cognitive
salience of background remains low. We describe how to compute the resulting
non-salience-preserving dual-perturbation attacks on classifiers. We then
experimentally demonstrate that our attacks indeed do not significantly change
perceptual salience of the background, but are highly effective against
classifiers robust to conventional attacks. Furthermore, we show that
adversarial training with dual-perturbation attacks yields classifiers that are
more robust to these than state-of-the-art robust learning approaches, and
comparable in terms of robustness to conventional attacks.


http://arxiv.org/abs/2005.03597
Efficient Exact Verification of Binarized Neural Networks.
Kai Jia; Martin Rinard
  We present a new system, EEV, for verifying binarized neural networks (BNNs).
We formulate BNN verification as a Boolean satisfiability problem (SAT) with
reified cardinality constraints of the form $y = (x_1 + \cdots + x_n \le b)$,
where $x_i$ and $y$ are Boolean variables possibly with negation and $b$ is an
integer constant. We also identify two properties, specifically balanced weight
sparsity and lower cardinality bounds, that reduce the verification complexity
of BNNs. EEV contains both a SAT solver enhanced to handle reified cardinality
constraints natively and novel training strategies designed to reduce
verification complexity by delivering networks with improved sparsity
properties and cardinality bounds. We demonstrate the effectiveness of EEV by
presenting the first exact verification results for $\ell_{\infty}$-bounded
adversarial robustness of nontrivial convolutional BNNs on the MNIST and
CIFAR10 datasets. Our results also show that, depending on the dataset and
network architecture, our techniques verify BNNs between a factor of ten to ten
thousand times faster than the best previous exact verification techniques for
either binarized or real-valued networks.


http://arxiv.org/abs/2005.03837
Projection & Probability-Driven Black-Box Attack.
Jie Li; Rongrong Ji; Hong Liu; Jianzhuang Liu; Bineng Zhong; Cheng Deng; Qi Tian
  Generating adversarial examples in a black-box setting retains a significant
challenge with vast practical application prospects. In particular, existing
black-box attacks suffer from the need for excessive queries, as it is
non-trivial to find an appropriate direction to optimize in the
high-dimensional space. In this paper, we propose Projection &
Probability-driven Black-box Attack (PPBA) to tackle this problem by reducing
the solution space and providing better optimization. For reducing the solution
space, we first model the adversarial perturbation optimization problem as a
process of recovering frequency-sparse perturbations with compressed sensing,
under the setting that random noise in the low-frequency space is more likely
to be adversarial. We then propose a simple method to construct a low-frequency
constrained sensing matrix, which works as a plug-and-play projection matrix to
reduce the dimensionality. Such a sensing matrix is shown to be flexible enough
to be integrated into existing methods like NES and Bandits$_{TD}$. For better
optimization, we perform a random walk with a probability-driven strategy,
which utilizes all queries over the whole progress to make full use of the
sensing matrix for a less query budget. Extensive experiments show that our
method requires at most 24% fewer queries with a higher attack success rate
compared with state-of-the-art approaches. Finally, the attack method is
evaluated on the real-world online service, i.e., Google Cloud Vision API,
which further demonstrates our practical potentials.


http://arxiv.org/abs/2005.03644
Defending Hardware-based Malware Detectors against Adversarial Attacks.
Abraham Peedikayil Kuruvila; Shamik Kundu; Kanad Basu
  In the era of Internet of Things (IoT), Malware has been proliferating
exponentially over the past decade. Traditional anti-virus software are
ineffective against modern complex Malware. In order to address this challenge,
researchers have proposed Hardware-assisted Malware Detection (HMD) using
Hardware Performance Counters (HPCs). The HPCs are used to train a set of
Machine learning (ML) classifiers, which in turn, are used to distinguish
benign programs from Malware. Recently, adversarial attacks have been designed
by introducing perturbations in the HPC traces using an adversarial sample
predictor to misclassify a program for specific HPCs. These attacks are
designed with the basic assumption that the attacker is aware of the HPCs being
used to detect Malware. Since modern processors consist of hundreds of HPCs,
restricting to only a few of them for Malware detection aids the attacker. In
this paper, we propose a Moving target defense (MTD) for this adversarial
attack by designing multiple ML classifiers trained on different sets of HPCs.
The MTD randomly selects a classifier; thus, confusing the attacker about the
HPCs or the number of classifiers applied. We have developed an analytical
model which proves that the probability of an attacker to guess the perfect
HPC-classifier combination for MTD is extremely low (in the range of
$10^{-1864}$ for a system with 20 HPCs). Our experimental results prove that
the proposed defense is able to improve the classification accuracy of HPC
traces that have been modified through an adversarial sample generator by up to
31.5%, for a near perfect (99.4%) restoration of the original accuracy.


http://arxiv.org/abs/2005.02936
GraCIAS: Grassmannian of Corrupted Images for Adversarial Security.
Ankita Shukla; Pavan Turaga; Saket Anand
  Input transformation based defense strategies fall short in defending against
strong adversarial attacks. Some successful defenses adopt approaches that
either increase the randomness within the applied transformations, or make the
defense computationally intensive, making it substantially more challenging for
the attacker. However, it limits the applicability of such defenses as a
pre-processing step, similar to computationally heavy approaches that use
retraining and network modifications to achieve robustness to perturbations. In
this work, we propose a defense strategy that applies random image corruptions
to the input image alone, constructs a self-correlation based subspace followed
by a projection operation to suppress the adversarial perturbation. Due to its
simplicity, the proposed defense is computationally efficient as compared to
the state-of-the-art, and yet can withstand huge perturbations. Further, we
develop proximity relationships between the projection operator of a clean
image and of its adversarially perturbed version, via bounds relating geodesic
distance on the Grassmannian to matrix Frobenius norms. We empirically show
that our strategy is complementary to other weak defenses like JPEG compression
and can be seamlessly integrated with them to create a stronger defense. We
present extensive experiments on the ImageNet dataset across four different
models namely InceptionV3, ResNet50, VGG16 and MobileNet models with
perturbation magnitude set to {\epsilon} = 16. Unlike state-of-the-art
approaches, even without any retraining, the proposed strategy achieves an
absolute improvement of ~ 4.5% in defense accuracy on ImageNet.


http://arxiv.org/abs/2005.02929
Training robust neural networks using Lipschitz bounds.
Patricia Pauli; Anne Koch; Julian Berberich; Paul Kohler; Frank Allgöwer
  Due to their susceptibility to adversarial perturbations, neural networks
(NNs) are hardly used in safety-critical applications. One measure of
robustness to such perturbations in the input is the Lipschitz constant of the
input-output map defined by an NN. In this work, we propose a framework to
train multi-layer NNs while at the same time encouraging robustness by keeping
their Lipschitz constant small, thus addressing the robustness issue. More
specifically, we design an optimization scheme based on the Alternating
Direction Method of Multipliers that minimizes not only the training loss of an
NN but also its Lipschitz constant resulting in a semidefinite programming
based training procedure that promotes robustness. We design two versions of
this training procedure. The first one includes a regularizer that penalizes an
accurate upper bound on the Lipschitz constant. The second one allows to
enforce a desired Lipschitz bound on the NN at all times during training.
Finally, we provide two examples to show that the proposed framework
successfully increases the robustness of NNs.


http://arxiv.org/abs/2005.02552
Enhancing Intrinsic Adversarial Robustness via Feature Pyramid Decoder.
Guanlin Li; Shuya Ding; Jun Luo; Chang Liu
  Whereas adversarial training is employed as the main defence strategy against
specific adversarial samples, it has limited generalization capability and
incurs excessive time complexity. In this paper, we propose an attack-agnostic
defence framework to enhance the intrinsic robustness of neural networks,
without jeopardizing the ability of generalizing clean samples. Our Feature
Pyramid Decoder (FPD) framework applies to all block-based convolutional neural
networks (CNNs). It implants denoising and image restoration modules into a
targeted CNN, and it also constraints the Lipschitz constant of the
classification layer. Moreover, we propose a two-phase strategy to train the
FPD-enhanced CNN, utilizing $\epsilon$-neighbourhood noisy images with
multi-task and self-supervised learning. Evaluated against a variety of
white-box and black-box attacks, we demonstrate that FPD-enhanced CNNs gain
sufficient robustness against general adversarial samples on MNIST, SVHN and
CALTECH. In addition, if we further conduct adversarial training, the
FPD-enhanced CNNs perform better than their non-enhanced versions.


http://arxiv.org/abs/2005.02270
Hacking the Waveform: Generalized Wireless Adversarial Deep Learning.
Francesco Restuccia; Salvatore D'Oro; Amani Al-Shawabka; Bruno Costa Rendon; Kaushik Chowdhury; Stratis Ioannidis; Tommaso Melodia
  This paper advances the state of the art by proposing the first comprehensive
analysis and experimental evaluation of adversarial learning attacks to
wireless deep learning systems. We postulate a series of adversarial attacks,
and formulate a Generalized Wireless Adversarial Machine Learning Problem
(GWAP) where we analyze the combined effect of the wireless channel and the
adversarial waveform on the efficacy of the attacks. We propose a new neural
network architecture called FIRNet, which can be trained to "hack" a classifier
based only on its output. We extensively evaluate the performance on (i) a
1,000-device radio fingerprinting dataset, and (ii) a 24-class modulation
dataset. Results obtained with several channel conditions show that our
algorithms can decrease the classifier accuracy up to 3x. We also
experimentally evaluate FIRNet on a radio testbed, and show that our
data-driven blackbox approach can confuse the classifier up to 97% while
keeping the waveform distortion to a minimum.


http://arxiv.org/abs/2005.02313
Adversarial Training against Location-Optimized Adversarial Patches.
Sukrut Rao; David Stutz; Bernt Schiele
  Deep neural networks have been shown to be susceptible to adversarial
examples -- small, imperceptible changes constructed to cause
mis-classification in otherwise highly accurate image classifiers. As a
practical alternative, recent work proposed so-called adversarial patches:
clearly visible, but adversarially crafted rectangular patches in images. These
patches can easily be printed and applied in the physical world. While defenses
against imperceptible adversarial examples have been studied extensively,
robustness against adversarial patches is poorly understood. In this work, we
first devise a practical approach to obtain adversarial patches while actively
optimizing their location within the image. Then, we apply adversarial training
on these location-optimized adversarial patches and demonstrate significantly
improved robustness on CIFAR10 and GTSRB. Additionally, in contrast to
adversarial training on imperceptible adversarial examples, our adversarial
patch training does not reduce accuracy.


http://arxiv.org/abs/2005.02540
Measuring Adversarial Robustness using a Voronoi-Epsilon Adversary.
Hyeongji Kim; Pekka Parviainen; Ketil Malde
  Previous studies on robustness have argued that there is a tradeoff between
accuracy and adversarial accuracy. The tradeoff can be inevitable even when we
neglect generalization. We argue that the tradeoff is inherent to the commonly
used definition of adversarial accuracy, which uses an adversary that can
construct adversarial points constrained by $\epsilon$-balls around data
points. As $\epsilon$ gets large, the adversary may use real data points from
other classes as adversarial examples. We propose a Voronoi-epsilon adversary
which is constrained both by Voronoi cells and by $\epsilon$-balls. This
adversary balances between two notions of perturbation. As a result,
adversarial accuracy based on this adversary avoids a tradeoff between accuracy
and adversarial accuracy on training data even when $\epsilon$ is large.
Finally, we show that a nearest neighbor classifier is the maximally robust
classifier against the proposed adversary on the training data.


http://arxiv.org/abs/2005.01499
On the Benefits of Models with Perceptually-Aligned Gradients.
Gunjan Aggarwal; Abhishek Sinha; Nupur Kumari; Mayank Singh
  Adversarial robust models have been shown to learn more robust and
interpretable features than standard trained models. As shown in
[\cite{tsipras2018robustness}], such robust models inherit useful interpretable
properties where the gradient aligns perceptually well with images, and adding
a large targeted adversarial perturbation leads to an image resembling the
target class. We perform experiments to show that interpretable and
perceptually aligned gradients are present even in models that do not show high
robustness to adversarial attacks. Specifically, we perform adversarial
training with attack for different max-perturbation bound. Adversarial training
with low max-perturbation bound results in models that have interpretable
features with only slight drop in performance over clean samples. In this
paper, we leverage models with interpretable perceptually-aligned features and
show that adversarial training with low max-perturbation bound can improve the
performance of models for zero-shot and weakly supervised localization tasks.


http://arxiv.org/abs/2005.01452
Do Gradient-based Explanations Tell Anything About Adversarial Robustness to Android Malware?
Marco Melis; Michele Scalas; Ambra Demontis; Davide Maiorca; Battista Biggio; Giorgio Giacinto; Fabio Roli
  While machine-learning algorithms have demonstrated a strong ability in
detecting Android malware, they can be evaded by sparse evasion attacks crafted
by injecting a small set of fake components, e.g., permissions and system
calls, without compromising intrusive functionality. Previous work has shown
that, to improve robustness against such attacks, learning algorithms should
avoid overemphasizing few discriminant features, providing instead decisions
that rely upon a large subset of components. In this work, we investigate
whether gradient-based attribution methods, used to explain classifiers'
decisions by identifying the most relevant features, can be used to help
identify and select more robust algorithms. To this end, we propose to exploit
two different metrics that represent the evenness of explanations, and a new
compact security measure called Adversarial Robustness Metric. Our experiments
conducted on two different datasets and five classification algorithms for
Android malware detection show that a strong connection exists between the
uniformity of explanations and adversarial robustness. In particular, we found
that popular techniques like Gradient*Input and Integrated Gradients are
strongly correlated to security when applied to both linear and nonlinear
detectors, while more elementary explanation techniques like the simple
Gradient do not provide reliable information about the robustness of such
classifiers.


http://arxiv.org/abs/2005.01229
Robust Encodings: A Framework for Combating Adversarial Typos.
Erik Jones; Robin Jia; Aditi Raghunathan; Percy Liang
  Despite excellent performance on many tasks, NLP systems are easily fooled by
small adversarial perturbations of inputs. Existing procedures to defend
against such perturbations are either (i) heuristic in nature and susceptible
to stronger attacks or (ii) provide guaranteed robustness to worst-case
attacks, but are incompatible with state-of-the-art models like BERT. In this
work, we introduce robust encodings (RobEn): a simple framework that confers
guaranteed robustness, without making compromises on model architecture. The
core component of RobEn is an encoding function, which maps sentences to a
smaller, discrete space of encodings. Systems using these encodings as a
bottleneck confer guaranteed robustness with standard training, and the same
encodings can be used across multiple tasks. We identify two desiderata to
construct robust encoding functions: perturbations of a sentence should map to
a small set of encodings (stability), and models using encodings should still
perform well (fidelity). We instantiate RobEn to defend against a large family
of adversarial typos. Across six tasks from GLUE, our instantiation of RobEn
paired with BERT achieves an average robust accuracy of 71.3% against all
adversarial typos in the family considered, while previous work using a
typo-corrector achieves only 35.3% accuracy against a simple greedy attack.


http://arxiv.org/abs/2005.00695
On the Generalization Effects of Linear Transformations in Data Augmentation. (1%)
Sen Wu; Hongyang R. Zhang; Gregory Valiant; Christopher Ré
  Data augmentation is a powerful technique to improve performance in
applications such as image and text classification tasks. Yet, there is little
rigorous understanding of why and how various augmentations work. In this work,
we consider a family of linear transformations and study their effects on the
ridge estimator in an over-parametrized linear regression setting. First, we
show that transformations that preserve the labels of the data can improve
estimation by enlarging the span of the training data. Second, we show that
transformations that mix data can improve estimation by playing a
regularization effect. Finally, we validate our theoretical insights on MNIST.
Based on the insights, we propose an augmentation scheme that searches over the
space of transformations by how uncertain the model is about the transformed
data. We validate our proposed scheme on image and text datasets. For example,
our method outperforms random sampling methods by 1.24% on CIFAR-100 using
Wide-ResNet-28-10. Furthermore, we achieve comparable accuracy to the SoTA
Adversarial AutoAugment on CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets.


http://arxiv.org/abs/2005.00656
Jacks of All Trades, Masters Of None: Addressing Distributional Shift and Obtrusiveness via Transparent Patch Attacks.
Neil Fendley; Max Lennon; I-Jeng Wang; Philippe Burlina; Nathan Drenkow
  We focus on the development of effective adversarial patch attacks and -- for
the first time -- jointly address the antagonistic objectives of attack success
and obtrusiveness via the design of novel semi-transparent patches. This work
is motivated by our pursuit of a systematic performance analysis of patch
attack robustness with regard to geometric transformations. Specifically, we
first elucidate a) key factors underpinning patch attack success and b) the
impact of distributional shift between training and testing/deployment when
cast under the Expectation over Transformation (EoT) formalism. By focusing our
analysis on three principal classes of transformations (rotation, scale, and
location), our findings provide quantifiable insights into the design of
effective patch attacks and demonstrate that scale, among all factors,
significantly impacts patch attack success. Working from these findings, we
then focus on addressing how to overcome the principal limitations of scale for
the deployment of attacks in real physical settings: namely the obtrusiveness
of large patches. Our strategy is to turn to the novel design of
irregularly-shaped, semi-transparent partial patches which we construct via a
new optimization process that jointly addresses the antagonistic goals of
mitigating obtrusiveness and maximizing effectiveness. Our study -- we hope --
will help encourage more focus in the community on the issues of obtrusiveness,
scale, and success in patch attacks.


http://arxiv.org/abs/2005.00683
Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models.
Bill Yuchen Lin; Seyeon Lee; Rahul Khanna; Xiang Ren
  Recent works show that pre-trained masked language models, such as BERT,
possess certain linguistic and commonsense knowledge. However, it remains to be
seen what types of commonsense knowledge these models have access to. In this
vein, we propose to study whether numerical commonsense knowledge --
commonsense knowledge that provides an understanding of the numeric relation
between entities -- can be induced from pre-trained masked language models and
to what extent is this access to knowledge robust against adversarial examples?
To study this, we introduce a probing task with a diagnostic dataset,
NumerSense, containing 3,145 masked-word-prediction probes. Surprisingly, our
experiments and analysis reveal that: (1) BERT and its stronger variant RoBERTa
perform poorly on our dataset prior to any fine-tuning; (2) fine-tuning with
distant supervision does improve performance; (3) the best distantly supervised
model still performs poorly when compared to humans (47.8% vs 96.3%).


http://arxiv.org/abs/2005.00616
Robust Deep Learning as Optimal Control: Insights and Convergence Guarantees.
Jacob H. Seidman; Mahyar Fazlyab; Victor M. Preciado; George J. Pappas
  The fragility of deep neural networks to adversarially-chosen inputs has
motivated the need to revisit deep learning algorithms. Including adversarial
examples during training is a popular defense mechanism against adversarial
attacks. This mechanism can be formulated as a min-max optimization problem,
where the adversary seeks to maximize the loss function using an iterative
first-order algorithm while the learner attempts to minimize it. However,
finding adversarial examples in this way causes excessive computational
overhead during training. By interpreting the min-max problem as an optimal
control problem, it has recently been shown that one can exploit the
compositional structure of neural networks in the optimization problem to
improve the training time significantly. In this paper, we provide the first
convergence analysis of this adversarial training algorithm by combining
techniques from robust optimal control and inexact oracle methods in
optimization. Our analysis sheds light on how the hyperparameters of the
algorithm affect the its stability and convergence. We support our insights
with experiments on a robust classification problem.


http://arxiv.org/abs/2005.00446
Defense of Word-level Adversarial Attacks via Random Substitution Encoding.
Zhaoyang Wang; Hongtao Wang
  The adversarial attacks against deep neural networks on computer vision tasks
have spawned many new technologies that help protect models from avoiding false
predictions. Recently, word-level adversarial attacks on deep models of Natural
Language Processing (NLP) tasks have also demonstrated strong power, e.g.,
fooling a sentiment classification neural network to make wrong decisions.
Unfortunately, few previous literatures have discussed the defense of such
word-level synonym substitution based attacks since they are hard to be
perceived and detected. In this paper, we shed light on this problem and
propose a novel defense framework called Random Substitution Encoding (RSE),
which introduces a random substitution encoder into the training process of
original neural networks. Extensive experiments on text classification tasks
demonstrate the effectiveness of our framework on defense of word-level
adversarial attacks, under various base and attack models.


http://arxiv.org/abs/2005.00190
Evaluating Neural Machine Comprehension Model Robustness to Noisy Inputs and Adversarial Attacks.
Winston Wu; Dustin Arendt; Svitlana Volkova
  We evaluate machine comprehension models' robustness to noise and adversarial
attacks by performing novel perturbations at the character, word, and sentence
level. We experiment with different amounts of perturbations to examine model
confidence and misclassification rate, and contrast model performance in
adversarial training with different embedding types on two benchmark datasets.
We demonstrate improving model performance with ensembling. Finally, we analyze
factors that effect model behavior under adversarial training and develop a
model to predict model errors during adversarial attacks.


http://arxiv.org/abs/2004.15015
Imitation Attacks and Defenses for Black-box Machine Translation Systems.
Eric Wallace; Mitchell Stern; Dawn Song
  Adversaries may look to steal or attack black-box NLP systems, either for
financial gain or to exploit model errors. One setting of particular interest
is machine translation (MT), where models have high commercial value and errors
can be costly. We investigate possible exploitations of black-box MT systems
and explore a preliminary defense against such threats. We first show that MT
systems can be stolen by querying them with monolingual sentences and training
models to imitate their outputs. Using simulated experiments, we demonstrate
that MT model stealing is possible even when imitation models have different
input data or architectures than their target models. Applying these ideas, we
train imitation models that reach within 0.6 BLEU of three production MT
systems on both high-resource and low-resource language pairs. We then leverage
the similarity of our imitation models to transfer adversarial examples to the
production systems. We use gradient-based attacks that expose inputs which lead
to semantically-incorrect translations, dropped content, and vulgar model
outputs. To mitigate these vulnerabilities, we propose a defense that modifies
translation outputs in order to misdirect the optimization of imitation models.
This defense degrades the adversary's BLEU score and attack success rate at
some cost in the defender's BLEU and inference speed.


http://arxiv.org/abs/2005.00174
Universal Adversarial Attacks with Natural Triggers for Text Classification.
Liwei Song; Xinwei Yu; Hsuan-Tung Peng; Karthik Narasimhan
  Recent work has demonstrated the vulnerability of modern text classifiers to
universal adversarial attacks, which are input-agnostic sequences of words
added to text processed by classifiers. Despite being successful, the word
sequences produced in such attacks are often ungrammatical and can be easily
distinguished from natural text. We develop adversarial attacks that appear
closer to natural English phrases and yet confuse classification systems when
added to benign inputs. We leverage an adversarially regularized autoencoder
(ARAE) to generate triggers and propose a gradient-based search that aims to
maximize the downstream classifier's prediction loss. Our attacks effectively
reduce model accuracy on classification tasks while being less identifiable
than prior models as per automatic detection metrics and human-subject studies.
Our aim is to demonstrate that adversarial attacks can be made harder to detect
than previously thought and to enable the development of appropriate defenses.


http://arxiv.org/abs/2005.00060
Bridging Mode Connectivity in Loss Landscapes and Adversarial Robustness.
Pu Zhao; Pin-Yu Chen; Payel Das; Karthikeyan Natesan Ramamurthy; Xue Lin
  Mode connectivity provides novel geometric insights on analyzing loss
landscapes and enables building high-accuracy pathways between well-trained
neural networks. In this work, we propose to employ mode connectivity in loss
landscapes to study the adversarial robustness of deep neural networks, and
provide novel methods for improving this robustness. Our experiments cover
various types of adversarial attacks applied to different network architectures
and datasets. When network models are tampered with backdoor or error-injection
attacks, our results demonstrate that the path connection learned using limited
amount of bonafide data can effectively mitigate adversarial effects while
maintaining the original accuracy on clean data. Therefore, mode connectivity
provides users with the power to repair backdoored or error-injected models. We
also use mode connectivity to investigate the loss landscapes of regular and
robust models against evasion attacks. Experiments show that there exists a
barrier in adversarial robustness loss on the path connecting regular and
adversarially-trained models. A high correlation is observed between the
adversarial robustness loss and the largest eigenvalue of the input Hessian
matrix, for which theoretical justifications are provided. Our results suggest
that mode connectivity offers a holistic tool and practical means for
evaluating and improving adversarial robustness.


http://arxiv.org/abs/2004.14861
Perturbing Across the Feature Hierarchy to Improve Standard and Strict Blackbox Attack Transferability.
Nathan Inkawhich; Kevin J Liang; Binghui Wang; Matthew Inkawhich; Lawrence Carin; Yiran Chen
  We consider the blackbox transfer-based targeted adversarial attack threat
model in the realm of deep neural network (DNN) image classifiers. Rather than
focusing on crossing decision boundaries at the output layer of the source
model, our method perturbs representations throughout the extracted feature
hierarchy to resemble other classes. We design a flexible attack framework that
allows for multi-layer perturbations and demonstrates state-of-the-art targeted
transfer performance between ImageNet DNNs. We also show the superiority of our
feature space methods under a relaxation of the common assumption that the
source and target models are trained on the same dataset and label space, in
some instances achieving a $10\times$ increase in targeted success rate
relative to other blackbox transfer methods. Finally, we analyze why the
proposed methods outperform existing attack strategies and show an extension of
the method in the case when limited queries to the blackbox model are allowed.


http://arxiv.org/abs/2004.14543
TAVAT: Token-Aware Virtual Adversarial Training for Language Understanding.
Linyang Li; Xipeng Qiu
  Gradient-based adversarial training is widely used in improving the
robustness of neural networks, while it cannot be easily adapted to natural
language processing tasks since the embedding space is discrete. In natural
language processing fields, virtual adversarial training is introduced since
texts are discrete and cannot be perturbed by gradients directly.
Alternatively, virtual adversarial training, which generates perturbations on
the embedding space, is introduced in NLP tasks. Despite its success, existing
virtual adversarial training methods generate perturbations roughly constrained
by Frobenius normalization balls. To craft fine-grained perturbations, we
propose a Token-Aware Virtual Adversarial Training method. We introduce a
token-level accumulated perturbation vocabulary to initialize the perturbations
better and use a token-level normalization ball to constrain these
perturbations pertinently. Experiments show that our method improves the
performance of pre-trained models such as BERT and ALBERT in various tasks by a
considerable margin. The proposed method improves the score of the GLUE
benchmark from 78.3 to 80.9 using BERT model and it also enhances the
performance of sequence labeling and text classification tasks.


http://arxiv.org/abs/2005.05909
TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP.
John X. Morris; Eli Lifland; Jin Yong Yoo; Jake Grigsby; Di Jin; Yanjun Qi
  While there has been substantial research using adversarial attacks to
analyze NLP models, each attack is implemented in its own code repository. It
remains challenging to develop NLP attacks and utilize them to improve model
performance. This paper introduces TextAttack, a Python framework for
adversarial attacks, data augmentation, and adversarial training in NLP.
TextAttack builds attacks from four components: a goal function, a set of
constraints, a transformation, and a search method. TextAttack's modular design
enables researchers to easily construct attacks from combinations of novel and
existing components. TextAttack provides implementations of 16 adversarial
attacks from the literature and supports a variety of models and datasets,
including BERT and other transformers, and all GLUE tasks. TextAttack also
includes data augmentation and adversarial training modules for using
components of adversarial attacks to improve model accuracy and robustness.
TextAttack is democratizing NLP: anyone can try data augmentation and
adversarial training on any model or dataset, with just a few lines of code.
Code and tutorials are available at https://github.com/QData/TextAttack.


http://arxiv.org/abs/2004.13617
Adversarial Learning Guarantees for Linear Hypotheses and Neural Networks.
Pranjal Awasthi; Natalie Frank; Mehryar Mohri
  Adversarial or test time robustness measures the susceptibility of a
classifier to perturbations to the test input. While there has been a flurry of
recent work on designing defenses against such perturbations, the theory of
adversarial robustness is not well understood. In order to make progress on
this, we focus on the problem of understanding generalization in adversarial
settings, via the lens of Rademacher complexity. We give upper and lower bounds
for the adversarial empirical Rademacher complexity of linear hypotheses with
adversarial perturbations measured in $l_r$-norm for an arbitrary $r \geq 1$.
This generalizes the recent result of [Yin et al.'19] that studies the case of
$r = \infty$, and provides a finer analysis of the dependence on the input
dimensionality as compared to the recent work of [Khim and Loh'19] on linear
hypothesis classes.
  We then extend our analysis to provide Rademacher complexity lower and upper
bounds for a single ReLU unit. Finally, we give adversarial Rademacher
complexity bounds for feed-forward neural networks with one hidden layer.
Unlike previous works we directly provide bounds on the adversarial Rademacher
complexity of the given network, as opposed to a bound on a surrogate. A
by-product of our analysis also leads to tighter bounds for the Rademacher
complexity of linear hypotheses, for which we give a detailed analysis and
present a comparison with existing bounds.


http://arxiv.org/abs/2004.13799
Minority Reports Defense: Defending Against Adversarial Patches.
Michael McCoyd; Won Park; Steven Chen; Neil Shah; Ryan Roggenkemper; Minjune Hwang; Jason Xinyu Liu; David Wagner
  Deep learning image classification is vulnerable to adversarial attack, even
if the attacker changes just a small patch of the image. We propose a defense
against patch attacks based on partially occluding the image around each
candidate patch location, so that a few occlusions each completely hide the
patch. We demonstrate on CIFAR-10, Fashion MNIST, and MNIST that our defense
provides certified security against patch attacks of a certain size.


http://arxiv.org/abs/2004.12864
DeSePtion: Dual Sequence Prediction and Adversarial Examples for Improved Fact-Checking.
Christopher Hidey; Tuhin Chakrabarty; Tariq Alhindi; Siddharth Varia; Kriste Krstovski; Mona Diab; Smaranda Muresan
  The increased focus on misinformation has spurred development of data and
systems for detecting the veracity of a claim as well as retrieving
authoritative evidence. The Fact Extraction and VERification (FEVER) dataset
provides such a resource for evaluating end-to-end fact-checking, requiring
retrieval of evidence from Wikipedia to validate a veracity prediction. We show
that current systems for FEVER are vulnerable to three categories of realistic
challenges for fact-checking -- multiple propositions, temporal reasoning, and
ambiguity and lexical variation -- and introduce a resource with these types of
claims. Then we present a system designed to be resilient to these "attacks"
using multiple pointer networks for document selection and jointly modeling a
sequence of evidence sentences and veracity relation predictions. We find that
in handling these attacks we obtain state-of-the-art results on FEVER, largely
due to improved evidence retrieval.


http://arxiv.org/abs/2004.12771
Adversarial Fooling Beyond "Flipping the Label".
Konda Reddy Mopuri; Vaisakh Shaj; R. Venkatesh Babu
  Recent advancements in CNNs have shown remarkable achievements in various
CV/AI applications. Though CNNs show near human or better than human
performance in many critical tasks, they are quite vulnerable to adversarial
attacks. These attacks are potentially dangerous in real-life deployments.
Though there have been many adversarial attacks proposed in recent years, there
is no proper way of quantifying the effectiveness of these attacks. As of
today, mere fooling rate is used for measuring the susceptibility of the
models, or the effectiveness of adversarial attacks. Fooling rate just
considers label flipping and does not consider the cost of such flipping, for
instance, in some deployments, flipping between two species of dogs may not be
as severe as confusing a dog category with that of a vehicle. Therefore, the
metric to quantify the vulnerability of the models should capture the severity
of the flipping as well. In this work we first bring out the drawbacks of the
existing evaluation and propose novel metrics to capture various aspects of the
fooling. Further, for the first time, we present a comprehensive analysis of
several important adversarial attacks over a set of distinct CNN architectures.
We believe that the presented analysis brings valuable insights about the
current adversarial attacks and the CNN models.


http://arxiv.org/abs/2004.12764
"Call me sexist, but...": Revisiting Sexism Detection Using Psychological Scales and Adversarial Samples. (81%)
Mattia Samory; Indira Sen; Julian Kohne; Fabian Floeck; Claudia Wagner
  Research has focused on automated methods to effectively detect sexism
online. Although overt sexism seems easy to spot, its subtle forms and manifold
expressions are not. In this paper, we outline the different dimensions of
sexism by grounding them in their implementation in psychological scales. From
the scales, we derive a codebook for sexism in social media, which we use to
annotate existing and novel datasets, surfacing their limitations in breadth
and validity with respect to the construct of sexism. Next, we leverage the
annotated datasets to generate adversarial examples, and test the reliability
of sexism detection methods. Results indicate that current machine learning
models pick up on a very narrow set of linguistic markers of sexism and do not
generalize well to out-of-domain examples. Yet, including diverse data and
adversarial examples at training time results in models that generalize better
and that are more robust to artifacts of data collection. By providing a
scale-based codebook and insights regarding the shortcomings of the
state-of-the-art, we hope to contribute to the development of better and
broader models for sexism detection, including reflections on theory-driven
approaches to data collection.


http://arxiv.org/abs/2004.12519
Transferable Perturbations of Deep Feature Distributions.
Nathan Inkawhich; Kevin J Liang; Lawrence Carin; Yiran Chen
  Almost all current adversarial attacks of CNN classifiers rely on information
derived from the output layer of the network. This work presents a new
adversarial attack based on the modeling and exploitation of class-wise and
layer-wise deep feature distributions. We achieve state-of-the-art targeted
blackbox transfer-based attack results for undefended ImageNet models. Further,
we place a priority on explainability and interpretability of the attacking
process. Our methodology affords an analysis of how adversarial attacks change
the intermediate feature distributions of CNNs, as well as a measure of
layer-wise and class-wise feature distributional separability/entanglement. We
also conceptualize a transition from task/data-specific to model-specific
features within a CNN architecture that directly impacts the transferability of
adversarial examples.


http://arxiv.org/abs/2004.12385
Towards Feature Space Adversarial Attack.
Qiuling Xu; Guanhong Tao; Siyuan Cheng; Xiangyu Zhang
  We propose a new adversarial attack to Deep Neural Networks for image
classification. Different from most existing attacks that directly perturb
input pixels, our attack focuses on perturbing abstract features, more
specifically, features that denote styles, including interpretable styles such
as vivid colors and sharp outlines, and uninterpretable ones. It induces model
misclassfication by injecting imperceptible style changes through an
optimization procedure. We show that our attack can generate adversarial
samples that are more natural-looking than the state-of-the-art unbounded
attacks. The experiment also supports that existing pixel-space adversarial
attack detection and defense techniques can hardly ensure robustness in the
style related feature space.


http://arxiv.org/abs/2005.02160
Printing and Scanning Attack for Image Counter Forensics.
Hailey James; Otkrist Gupta; Dan Raviv
  Examining the authenticity of images has become increasingly important as
manipulation tools become more accessible and advanced. Recent work has shown
that while CNN-based image manipulation detectors can successfully identify
manipulations, they are also vulnerable to adversarial attacks, ranging from
simple double JPEG compression to advanced pixel-based perturbation. In this
paper we explore another method of highly plausible attack: printing and
scanning. We demonstrate the vulnerability of two state-of-the-art models to
this type of attack. We also propose a new machine learning model that performs
comparably to these state-of-the-art models when trained and validated on
printed and scanned images. Of the three models, our proposed model outperforms
the others when trained and validated on images from a single printer. To
facilitate this exploration, we create a dataset of over 6,000 printed and
scanned image blocks. Further analysis suggests that variation between images
produced from different printers is significant, large enough that good
validation accuracy on images from one printer does not imply similar
validation accuracy on identical images from a different printer.


http://arxiv.org/abs/2004.12478
Improved Image Wasserstein Attacks and Defenses.
Edward J. Hu; Adith Swaminathan; Hadi Salman; Greg Yang
  Robustness against image perturbations bounded by a $\ell_p$ ball have been
well-studied in recent literature. Perturbations in the real-world, however,
rarely exhibit the pixel independence that $\ell_p$ threat models assume. A
recently proposed Wasserstein distance-bounded threat model is a promising
alternative that limits the perturbation to pixel mass movements. We point out
and rectify flaws in previous definition of the Wasserstein threat model and
explore stronger attacks and defenses under our better-defined framework.
Lastly, we discuss the inability of current Wasserstein-robust models in
defending against perturbations seen in the real world. Our code and trained
models are available at https://github.com/edwardjhu/improved_wasserstein .


http://arxiv.org/abs/2004.12227
Improved Adversarial Training via Learned Optimizer.
Yuanhao Xiong; Cho-Jui Hsieh
  Adversarial attack has recently become a tremendous threat to deep learning
models. To improve the robustness of machine learning models, adversarial
training, formulated as a minimax optimization problem, has been recognized as
one of the most effective defense mechanisms. However, the non-convex and
non-concave property poses a great challenge to the minimax training. In this
paper, we empirically demonstrate that the commonly used PGD attack may not be
optimal for inner maximization, and improved inner optimizer can lead to a more
robust model. Then we leverage a learning-to-learn (L2L) framework to train an
optimizer with recurrent neural networks, providing update directions and steps
adaptively for the inner problem. By co-training optimizer's parameters and
model's weights, the proposed framework consistently improves the model
robustness over PGD-based adversarial training and TRADES.


http://arxiv.org/abs/2004.12261
Enabling Fast and Universal Audio Adversarial Attack Using Generative Model.
Yi Xie; Zhuohang Li; Cong Shi; Jian Liu; Yingying Chen; Bo Yuan
  Recently, the vulnerability of DNN-based audio systems to adversarial attacks
has obtained the increasing attention. However, the existing audio adversarial
attacks allow the adversary to possess the entire user's audio input as well as
granting sufficient time budget to generate the adversarial perturbations.
These idealized assumptions, however, makes the existing audio adversarial
attacks mostly impossible to be launched in a timely fashion in practice (e.g.,
playing unnoticeable adversarial perturbations along with user's streaming
input). To overcome these limitations, in this paper we propose fast audio
adversarial perturbation generator (FAPG), which uses generative model to
generate adversarial perturbations for the audio input in a single forward
pass, thereby drastically improving the perturbation generation speed. Built on
the top of FAPG, we further propose universal audio adversarial perturbation
generator (UAPG), a scheme crafting universal adversarial perturbation that can
be imposed on arbitrary benign audio input to cause misclassification.
Extensive experiments show that our proposed FAPG can achieve up to 167X
speedup over the state-of-the-art audio adversarial attack methods. Also our
proposed UAPG can generate universal adversarial perturbation that achieves
much better attack performance than the state-of-the-art solutions.


http://arxiv.org/abs/2004.13013
Harnessing adversarial examples with a surprisingly simple defense.
Ali Borji
  I introduce a very simple method to defend against adversarial examples. The
basic idea is to raise the slope of the ReLU function at the test time.
Experiments over MNIST and CIFAR-10 datasets demonstrate the effectiveness of
the proposed defense against a number of strong attacks in both untargeted and
targeted settings. While perhaps not as effective as the state of the art
adversarial defenses, this approach can provide insights to understand and
mitigate adversarial attacks. It can also be used in conjunction with other
defenses.


http://arxiv.org/abs/2004.11573
Towards Characterizing Adversarial Defects of Deep Learning Software from the Lens of Uncertainty.
Xiyue Zhang; Xiaofei Xie; Lei Ma; Xiaoning Du; Qiang Hu; Yang Liu; Jianjun Zhao; Meng Sun
  Over the past decade, deep learning (DL) has been successfully applied to
many industrial domain-specific tasks. However, the current state-of-the-art DL
software still suffers from quality issues, which raises great concern
especially in the context of safety- and security-critical scenarios.
Adversarial examples (AEs) represent a typical and important type of defects
needed to be urgently addressed, on which a DL software makes incorrect
decisions. Such defects occur through either intentional attack or
physical-world noise perceived by input sensors, potentially hindering further
industry deployment. The intrinsic uncertainty nature of deep learning
decisions can be a fundamental reason for its incorrect behavior. Although some
testing, adversarial attack and defense techniques have been recently proposed,
it still lacks a systematic study to uncover the relationship between AEs and
DL uncertainty. In this paper, we conduct a large-scale study towards bridging
this gap. We first investigate the capability of multiple uncertainty metrics
in differentiating benign examples (BEs) and AEs, which enables to characterize
the uncertainty patterns of input data. Then, we identify and categorize the
uncertainty patterns of BEs and AEs, and find that while BEs and AEs generated
by existing methods do follow common uncertainty patterns, some other
uncertainty patterns are largely missed. Based on this, we propose an automated
testing technique to generate multiple types of uncommon AEs and BEs that are
largely missed by existing techniques. Our further evaluation reveals that the
uncommon data generated by our method is hard to be defended by the existing
defense techniques with the average defense success rate reduced by 35\%. Our
results call for attention and necessity to generate more diverse data for
evaluating quality assurance solutions of DL software.


http://arxiv.org/abs/2004.13002
A Black-box Adversarial Attack Strategy with Adjustable Sparsity and Generalizability for Deep Image Classifiers.
Arka Ghosh; Sankha Subhra Mullick; Shounak Datta; Swagatam Das; Rammohan Mallipeddi; Asit Kr. Das
  Constructing adversarial perturbations for deep neural networks is an
important direction of research. Crafting image-dependent adversarial
perturbations using white-box feedback has hitherto been the norm for such
adversarial attacks. However, black-box attacks are much more practical for
real-world applications. Universal perturbations applicable across multiple
images are gaining popularity due to their innate generalizability. There have
also been efforts to restrict the perturbations to a few pixels in the image.
This helps to retain visual similarity with the original images making such
attacks hard to detect. This paper marks an important step which combines all
these directions of research. We propose the DEceit algorithm for constructing
effective universal pixel-restricted perturbations using only black-box
feedback from the target network. We conduct empirical investigations using the
ImageNet validation set on the state-of-the-art deep neural classifiers by
varying the number of pixels to be perturbed from a meagre 10 pixels to as high
as all pixels in the image. We find that perturbing only about 10% of the
pixels in an image using DEceit achieves a commendable and highly transferable
Fooling Rate while retaining the visual quality. We further demonstrate that
DEceit can be successfully applied to image dependent attacks as well. In both
sets of experiments, we outperformed several state-of-the-art methods.


http://arxiv.org/abs/2004.14174
Reevaluating Adversarial Examples in Natural Language.
John X. Morris; Eli Lifland; Jack Lanchantin; Yangfeng Ji; Yanjun Qi
  State-of-the-art attacks on NLP models lack a shared definition of a what
constitutes a successful attack. We distill ideas from past work into a unified
framework: a successful natural language adversarial example is a perturbation
that fools the model and follows some linguistic constraints. We then analyze
the outputs of two state-of-the-art synonym substitution attacks. We find that
their perturbations often do not preserve semantics, and 38% introduce
grammatical errors. Human surveys reveal that to successfully preserve
semantics, we need to significantly increase the minimum cosine similarities
between the embeddings of swapped words and between the sentence encodings of
original and perturbed sentences.With constraints adjusted to better preserve
semantics and grammaticality, the attack success rate drops by over 70
percentage points.


http://arxiv.org/abs/2004.11898
Adversarial Machine Learning in Network Intrusion Detection Systems.
Elie Alhajjar; Paul Maxwell; Nathaniel D. Bastian
  Adversarial examples are inputs to a machine learning system intentionally
crafted by an attacker to fool the model into producing an incorrect output.
These examples have achieved a great deal of success in several domains such as
image recognition, speech recognition and spam detection. In this paper, we
study the nature of the adversarial problem in Network Intrusion Detection
Systems (NIDS). We focus on the attack perspective, which includes techniques
to generate adversarial examples capable of evading a variety of machine
learning models. More specifically, we explore the use of evolutionary
computation (particle swarm optimization and genetic algorithm) and deep
learning (generative adversarial networks) as tools for adversarial example
generation. To assess the performance of these algorithms in evading a NIDS, we
apply them to two publicly available data sets, namely the NSL-KDD and
UNSW-NB15, and we contrast them to a baseline perturbation method: Monte Carlo
simulation. The results show that our adversarial example generation techniques
cause high misclassification rates in eleven different machine learning models,
along with a voting classifier. Our work highlights the vulnerability of
machine learning based NIDS in the face of adversarial perturbation.


http://arxiv.org/abs/2004.11488
Adversarial Attacks and Defenses: An Interpretation Perspective.
Ninghao Liu; Mengnan Du; Ruocheng Guo; Huan Liu; Xia Hu
  Despite the recent advances in a wide spectrum of applications, machine
learning models, especially deep neural networks, have been shown to be
vulnerable to adversarial attacks. Attackers add carefully-crafted
perturbations to input, where the perturbations are almost imperceptible to
humans, but can cause models to make wrong predictions. Techniques to protect
models against adversarial input are called adversarial defense methods.
Although many approaches have been proposed to study adversarial attacks and
defenses in different scenarios, an intriguing and crucial challenge remains
that how to really understand model vulnerability? Inspired by the saying that
"if you know yourself and your enemy, you need not fear the battles", we may
tackle the aforementioned challenge after interpreting machine learning models
to open the black-boxes. The goal of model interpretation, or interpretable
machine learning, is to extract human-understandable terms for the working
mechanism of models. Recently, some approaches start incorporating
interpretation into the exploration of adversarial attacks and defenses.
Meanwhile, we also observe that many existing methods of adversarial attacks
and defenses, although not explicitly claimed, can be understood from the
perspective of interpretation. In this paper, we review recent work on
adversarial attacks and defenses, particularly from the perspective of machine
learning interpretation. We categorize interpretation into two types,
feature-level interpretation and model-level interpretation. For each type of
interpretation, we elaborate on how it could be used for adversarial attacks
and defenses. We then briefly illustrate additional correlations between
interpretation and adversaries. Finally, we discuss the challenges and future
directions along tackling adversary issues with interpretation.


http://arxiv.org/abs/2004.11114
Evaluating Adversarial Robustness for Deep Neural Network Interpretability using fMRI Decoding.
Patrick McClure; Dustin Moraczewski; Ka Chun Lam; Adam Thomas; Francisco Pereira
  While deep neural networks (DNNs) are being increasingly used to make
predictions from high-dimensional, complex data, they are widely seen as
uninterpretable "black boxes", since it can be difficult to discover what input
information is used to make predictions. This ability is particularly important
for applications in cognitive neuroscience and neuroinformatics. A saliency map
is a common approach for producing interpretable visualizations of the relative
importance of input features for a prediction. However, many methods for
creating these maps fail due to focusing too much on the input or being
extremely sensitive to small input noise. It is also challenging to
quantitatively evaluate how well saliency maps correspond to the truly relevant
input information. In this paper, we develop two quantitative evaluation
procedures for saliency methods, using the fact that the Human Connectome
Project (HCP) dataset contains functional magnetic resonance imaging(fMRI) data
from multiple tasks per subject to create ground truth saliency maps.We then
introduce an adversarial training method that makes DNNs robust to small input
noise, and use these evaluations to demonstrate that it greatly improves
interpretability.


http://arxiv.org/abs/2004.11157
On Adversarial Examples for Biomedical NLP Tasks.
Vladimir Araujo; Andres Carvallo; Carlos Aspillaga; Denis Parra
  The success of pre-trained word embeddings has motivated its use in tasks in
the biomedical domain. The BERT language model has shown remarkable results on
standard performance metrics in tasks such as Named Entity Recognition (NER)
and Semantic Textual Similarity (STS), which has brought significant progress
in the field of NLP. However, it is unclear whether these systems work
seemingly well in critical domains, such as legal or medical. For that reason,
in this work, we propose an adversarial evaluation scheme on two well-known
datasets for medical NER and STS. We propose two types of attacks inspired by
natural spelling errors and typos made by humans. We also propose another type
of attack that uses synonyms of medical terms. Under these adversarial
settings, the accuracy of the models drops significantly, and we quantify the
extent of this performance loss. We also show that we can significantly improve
the robustness of the models by training them with adversarial examples. We
hope our work will motivate the use of adversarial examples to evaluate and
develop models with increased robustness for medical tasks.


http://arxiv.org/abs/2004.11273
Ensemble Generative Cleaning with Feedback Loops for Defending Adversarial Attacks.
Jianhe Yuan; Zhihai He
  Effective defense of deep neural networks against adversarial attacks remains
a challenging problem, especially under powerful white-box attacks. In this
paper, we develop a new method called ensemble generative cleaning with
feedback loops (EGC-FL) for effective defense of deep neural networks. The
proposed EGC-FL method is based on two central ideas. First, we introduce a
transformed deadzone layer into the defense network, which consists of an
orthonormal transform and a deadzone-based activation function, to destroy the
sophisticated noise pattern of adversarial attacks. Second, by constructing a
generative cleaning network with a feedback loop, we are able to generate an
ensemble of diverse estimations of the original clean image. We then learn a
network to fuse this set of diverse estimations together to restore the
original image. Our extensive experimental results demonstrate that our
approach improves the state-of-art by large margins in both white-box and
black-box attacks. It significantly improves the classification accuracy for
white-box PGD attacks upon the second best method by more than 29% on the SVHN
dataset and more than 39% on the challenging CIFAR-10 dataset.


http://arxiv.org/abs/2004.11072
Improved Noise and Attack Robustness for Semantic Segmentation by Using Multi-Task Training with Self-Supervised Depth Estimation.
Marvin Klingner; Andreas Bär; Tim Fingscheidt
  While current approaches for neural network training often aim at improving
performance, less focus is put on training methods aiming at robustness towards
varying noise conditions or directed attacks by adversarial examples. In this
paper, we propose to improve robustness by a multi-task training, which extends
supervised semantic segmentation by a self-supervised monocular depth
estimation on unlabeled videos. This additional task is only performed during
training to improve the semantic segmentation model's robustness at test time
under several input perturbations. Moreover, we even find that our joint
training approach also improves the performance of the model on the original
(supervised) semantic segmentation task. Our evaluation exhibits a particular
novelty in that it allows to mutually compare the effect of input noises and
adversarial attacks on the robustness of the semantic segmentation. We show the
effectiveness of our method on the Cityscapes dataset, where our multi-task
training approach consistently outperforms the single-task semantic
segmentation baseline in terms of both robustness vs. noise and in terms of
adversarial attacks, without the need for depth labels in training.


http://arxiv.org/abs/2004.14798
RAIN: A Simple Approach for Robust and Accurate Image Classification Networks.
Jiawei Du; Hanshu Yan; Vincent Y. F. Tan; Joey Tianyi Zhou; Rick Siow Mong Goh; Jiashi Feng
  It has been shown that the majority of existing adversarial defense methods
achieve robustness at the cost of sacrificing prediction accuracy. The
undesirable severe drop in accuracy adversely affects the reliability of
machine learning algorithms and prohibits their deployment in realistic
applications. This paper aims to address this dilemma by proposing a novel
preprocessing framework, which we term Robust and Accurate Image
classificatioN(RAIN), to improve the robustness of given CNN classifiers and,
at the same time, preserve their high prediction accuracies. RAIN introduces a
new randomization-enhancement scheme. It applies randomization over inputs to
break the ties between the model forward prediction path and the backward
gradient path, thus improving the model robustness. However, similar to
existing preprocessing-based methods, the randomized process will degrade the
prediction accuracy. To understand why this is the case, we compare the
difference between original and processed images, and find it is the loss of
high-frequency components in the input image that leads to accuracy drop of the
classifier. Based on this finding, RAIN enhances the input's high-frequency
details to retain the CNN's high prediction accuracy. Concretely, RAIN consists
of two novel randomization modules: randomized small circular shift (RdmSCS)
and randomized down-upsampling (RdmDU). The RdmDU module randomly downsamples
the input image, and then the RdmSCS module circularly shifts the input image
along a randomly chosen direction by a small but random number of pixels.
Finally, the RdmDU module performs upsampling with a detail-enhancement model,
such as deep super-resolution networks. We conduct extensive experiments on the
STL10 and ImageNet datasets to verify the effectiveness of RAIN against various
types of adversarial attacks.


http://arxiv.org/abs/2004.10700
CodNN -- Robust Neural Networks From Coded Classification.
Netanel Andrew Raviv; Siddharth Andrew Jain; Pulakesh Andrew Upadhyaya; Jehoshua Andrew Bruck; Andrew Anxiao; Jiang
  Deep Neural Networks (DNNs) are a revolutionary force in the ongoing
information revolution, and yet their intrinsic properties remain a mystery. In
particular, it is widely known that DNNs are highly sensitive to noise, whether
adversarial or random. This poses a fundamental challenge for hardware
implementations of DNNs, and for their deployment in critical applications such
as autonomous driving. In this paper we construct robust DNNs via error
correcting codes. By our approach, either the data or internal layers of the
DNN are coded with error correcting codes, and successful computation under
noise is guaranteed. Since DNNs can be seen as a layered concatenation of
classification tasks, our research begins with the core task of classifying
noisy coded inputs, and progresses towards robust DNNs. We focus on binary data
and linear codes. Our main result is that the prevalent parity code can
guarantee robustness for a large family of DNNs, which includes the recently
popularized binarized neural networks. Further, we show that the coded
classification problem has a deep connection to Fourier analysis of Boolean
functions. In contrast to existing solutions in the literature, our results do
not rely on altering the training process of the DNN, and provide
mathematically rigorous guarantees rather than experimental evidence.


http://arxiv.org/abs/2004.10608
Provably robust deep generative models.
Filipe Condessa; Zico Kolter
  Recent work in adversarial attacks has developed provably robust methods for
training deep neural network classifiers. However, although they are often
mentioned in the context of robustness, deep generative models themselves have
received relatively little attention in terms of formally analyzing their
robustness properties. In this paper, we propose a method for training provably
robust generative models, specifically a provably robust version of the
variational auto-encoder (VAE). To do so, we first formally define a
(certifiably) robust lower bound on the variational lower bound of the
likelihood, and then show how this bound can be optimized during training to
produce a robust VAE. We evaluate the method on simple examples, and show that
it is able to produce generative models that are substantially more robust to
adversarial attacks (i.e., an adversary trying to perturb inputs so as to
drastically lower their likelihood under the model).


http://arxiv.org/abs/2004.11233
QUANOS- Adversarial Noise Sensitivity Driven Hybrid Quantization of Neural Networks.
Priyadarshini Panda
  Deep Neural Networks (DNNs) have been shown to be vulnerable to adversarial
attacks, wherein, a model gets fooled by applying slight perturbations on the
input. With the advent of Internet-of-Things and the necessity to enable
intelligence in embedded devices, low-power and secure hardware implementation
of DNNs is vital. In this paper, we investigate the use of quantization to
potentially resist adversarial attacks. Several recent studies have reported
remarkable results in reducing the energy requirement of a DNN through
quantization. However, no prior work has considered the relationship between
adversarial sensitivity of a DNN and its effect on quantization. We propose
QUANOS- a framework that performs layer-specific hybrid quantization based on
Adversarial Noise Sensitivity (ANS). We identify a novel noise stability metric
(ANS) for DNNs, i.e., the sensitivity of each layer's computation to
adversarial noise. ANS allows for a principled way of determining optimal
bit-width per layer that incurs adversarial robustness as well as
energy-efficiency with minimal loss in accuracy. Essentially, QUANOS assigns
layer significance based on its contribution to adversarial perturbation and
accordingly scales the precision of the layers. A key advantage of QUANOS is
that it does not rely on a pre-trained model and can be applied in the initial
stages of training. We evaluate the benefits of QUANOS on precision scalable
Multiply and Accumulate (MAC) hardware architectures with data gating and
subword parallelism capabilities. Our experiments on CIFAR10, CIFAR100 datasets
show that QUANOS outperforms homogenously quantized 8-bit precision baseline in
terms of adversarial robustness (3%-4% higher) while yielding improved
compression (>5x) and energy savings (>2x) at iso-accuracy.


http://arxiv.org/abs/2004.10882
Adversarial examples and where to find them.
Niklas Risse; Christina Göpfert; Jan Philip Göpfert
  Adversarial robustness of trained models has attracted considerable attention
over recent years, within and beyond the scientific community. This is not only
because of a straight-forward desire to deploy reliable systems, but also
because of how adversarial attacks challenge our beliefs about deep neural
networks. Demanding more robust models seems to be the obvious solution --
however, this requires a rigorous understanding of how one should judge
adversarial robustness as a property of a given model. In this work, we analyze
where adversarial examples occur, in which ways they are peculiar, and how they
are processed by robust models. We use robustness curves to show that
$\ell_\infty$ threat models are surprisingly effective in improving robustness
for other $\ell_p$ norms; we introduce perturbation cost trajectories to
provide a broad perspective on how robust and non-robust networks perceive
adversarial perturbations as opposed to random perturbations; and we explicitly
examine the scale of certain common data sets, showing that robustness
thresholds must be adapted to the data set they pertain to. This allows us to
provide concrete recommendations for anyone looking to train a robust model or
to estimate how much robustness they should require for their operation. The
code for all our experiments is available at
www.github.com/niklasrisse/adversarial-examples-and-where-to-find-them .


http://arxiv.org/abs/2004.13825
Scalable Attack on Graph Data by Injecting Vicious Nodes.
Jihong Wang; Minnan Luo; Fnu Suya; Jundong Li; Zijiang Yang; Qinghua Zheng
  Recent studies have shown that graph convolution networks (GCNs) are
vulnerable to carefully designed attacks, which aim to cause misclassification
of a specific node on the graph with unnoticeable perturbations. However, a
vast majority of existing works cannot handle large-scale graphs because of
their high time complexity. Additionally, existing works mainly focus on
manipulating existing nodes on the graph, while in practice, attackers usually
do not have the privilege to modify information of existing nodes. In this
paper, we develop a more scalable framework named Approximate Fast Gradient
Sign Method (AFGSM) which considers a more practical attack scenario where
adversaries can only inject new vicious nodes to the graph while having no
control over the original graph. Methodologically, we provide an approximation
strategy to linearize the model we attack and then derive an approximate
closed-from solution with a lower time cost. To have a fair comparison with
existing attack methods that manipulate the original graph, we adapt them to
the new attack scenario by injecting vicious nodes. Empirical experimental
results show that our proposed attack method can significantly reduce the
classification accuracy of GCNs and is much faster than existing methods
without jeopardizing the attack performance.


http://arxiv.org/abs/2004.10250
Certifying Joint Adversarial Robustness for Model Ensembles.
Mainuddin Ahmad Jonas; David Evans
  Deep Neural Networks (DNNs) are often vulnerable to adversarial
examples.Several proposed defenses deploy an ensemble of models with the hope
that, although the individual models may be vulnerable, an adversary will not
be able to find an adversarial example that succeeds against the ensemble.
Depending on how the ensemble is used, an attacker may need to find a single
adversarial example that succeeds against all, or a majority, of the models in
the ensemble. The effectiveness of ensemble defenses against strong adversaries
depends on the vulnerability spaces of models in the ensemble being disjoint.
We consider the joint vulnerability of an ensemble of models, and propose a
novel technique for certifying the joint robustness of ensembles, building upon
prior works on single-model robustness certification. We evaluate the
robustness of various models ensembles, including models trained using
cost-sensitive robustness to be diverse, to improve understanding of the
potential effectiveness of ensemble models as a defense against adversarial
examples.


http://arxiv.org/abs/2004.10281
Probabilistic Safety for Bayesian Neural Networks.
Matthew Wicker; Luca Laurenti; Andrea Patane; Marta Kwiatkowska
  We study probabilistic safety for Bayesian Neural Networks (BNNs) under
adversarial input perturbations. Given a compact set of input points, $T
\subseteq \mathbb{R}^m$, we study the probability w.r.t. the BNN posterior that
all the points in $T$ are mapped to the same region $S$ in the output space. In
particular, this can be used to evaluate the probability that a network sampled
from the BNN is vulnerable to adversarial attacks. We rely on relaxation
techniques from non-convex optimization to develop a method for computing a
lower bound on probabilistic safety for BNNs, deriving explicit procedures for
the case of interval and linear function propagation techniques. We apply our
methods to BNNs trained on a regression task, airborne collision avoidance, and
MNIST, empirically showing that our approach allows one to certify
probabilistic safety of BNNs with millions of parameters.


http://arxiv.org/abs/2004.09984
BERT-ATTACK: Adversarial Attack Against BERT Using BERT.
Linyang Li; Ruotian Ma; Qipeng Guo; Xiangyang Xue; Xipeng Qiu
  Adversarial attacks for discrete data (such as text) has been proved
significantly more challenging than continuous data (such as image), since it
is difficult to generate adversarial samples with gradient-based methods.
Currently, the successful attack methods for text usually adopt heuristic
replacement strategies on character or word level, which remains challenging to
find the optimal solution in the massive space of possible combination of
replacements, while preserving semantic consistency and language fluency. In
this paper, we propose \textbf{BERT-Attack}, a high-quality and effective
method to generate adversarial samples using pre-trained masked language models
exemplified by BERT. We turn BERT against its fine-tuned models and other deep
neural models for downstream tasks. Our method successfully misleads the target
models to predict incorrectly, outperforming state-of-the-art attack strategies
in both success rate and perturb percentage, while the generated adversarial
samples are fluent and semantically preserved. Also, the cost of calculation is
low, thus possible for large-scale generations.


http://arxiv.org/abs/2004.10162
EMPIR: Ensembles of Mixed Precision Deep Networks for Increased Robustness against Adversarial Attacks.
Sanchari Sen; Balaraman Ravindran; Anand Raghunathan
  Ensuring robustness of Deep Neural Networks (DNNs) is crucial to their
adoption in safety-critical applications such as self-driving cars, drones, and
healthcare. Notably, DNNs are vulnerable to adversarial attacks in which small
input perturbations can produce catastrophic misclassifications. In this work,
we propose EMPIR, ensembles of quantized DNN models with different numerical
precisions, as a new approach to increase robustness against adversarial
attacks. EMPIR is based on the observation that quantized neural networks often
demonstrate much higher robustness to adversarial attacks than full precision
networks, but at the cost of a substantial loss in accuracy on the original
(unperturbed) inputs. EMPIR overcomes this limitation to achieve the 'best of
both worlds', i.e., the higher unperturbed accuracies of the full precision
models combined with the higher robustness of the low precision models, by
composing them in an ensemble. Further, as low precision DNN models have
significantly lower computational and storage requirements than full precision
models, EMPIR models only incur modest compute and memory overheads compared to
a single full-precision model (<25% in our evaluations). We evaluate EMPIR
across a suite of DNNs for 3 different image recognition tasks (MNIST, CIFAR-10
and ImageNet) and under 4 different adversarial attacks. Our results indicate
that EMPIR boosts the average adversarial accuracies by 42.6%, 15.2% and 10.5%
for the DNN models trained on the MNIST, CIFAR-10 and ImageNet datasets
respectively, when compared to single full-precision models, without
sacrificing accuracy on the unperturbed inputs.


http://arxiv.org/abs/2004.09179
GraN: An Efficient Gradient-Norm Based Detector for Adversarial and Misclassified Examples.
Julia Lust; Alexandru Paul Condurache
  Deep neural networks (DNNs) are vulnerable to adversarial examples and other
data perturbations. Especially in safety critical applications of DNNs, it is
therefore crucial to detect misclassified samples. The current state-of-the-art
detection methods require either significantly more runtime or more parameters
than the original network itself. This paper therefore proposes GraN, a time-
and parameter-efficient method that is easily adaptable to any DNN.
  GraN is based on the layer-wise norm of the DNN's gradient regarding the loss
of the current input-output combination, which can be computed via
backpropagation. GraN achieves state-of-the-art performance on numerous problem
set-ups.


http://arxiv.org/abs/2004.09677
Approximate exploitability: Learning a best response in large games. (74%)
Finbarr Timbers; Nolan Bard; Edward Lockhart; Marc Lanctot; Martin Schmid; Neil Burch; Julian Schrittwieser; Thomas Hubert; Michael Bowling
  Researchers have demonstrated that neural networks are vulnerable to
adversarial examples and subtle environment changes, both of which one can view
as a form of distribution shift. To humans, the resulting errors can look like
blunders, eroding trust in these agents. In prior games research, agent
evaluation often focused on the in-practice game outcomes. While valuable, such
evaluation typically fails to evaluate robustness to worst-case outcomes. Prior
research in computer poker has examined how to assess such worst-case
performance, both exactly and approximately. Unfortunately, exact computation
is infeasible with larger domains, and existing approximations rely on
poker-specific knowledge. We introduce ISMCTS-BR, a scalable search-based deep
reinforcement learning algorithm for learning a best response to an agent,
thereby approximating worst-case performance. We demonstrate the technique in
several two-player zero-sum games against a variety of agents, including
several AlphaZero-based agents.


http://arxiv.org/abs/2004.08833
Dynamic Knowledge Graph-based Dialogue Generation with Improved Adversarial Meta-Learning.
Hongcai Xu; Junpeng Bao; Gaojie Zhang
  Knowledge graph-based dialogue systems are capable of generating more
informative responses and can implement sophisticated reasoning mechanisms.
However, these models do not take into account the sparseness and
incompleteness of knowledge graph (KG)and current dialogue models cannot be
applied to dynamic KG. This paper proposes a dynamic Knowledge graph-based
dialogue generation method with improved adversarial Meta-Learning (KDAD). KDAD
formulates dynamic knowledge triples as a problem of adversarial attack and
incorporates the objective of quickly adapting to dynamic knowledge-aware
dialogue generation. We train a knowledge graph-based dialog model with
improved ADML using minimal training samples. The model can initialize the
parameters and adapt to previous unseen knowledge so that training can be
quickly completed based on only a few knowledge triples. We show that our model
significantly outperforms other baselines. We evaluate and demonstrate that our
method adapts extremely fast and well to dynamic knowledge graph-based dialogue
generation.


http://arxiv.org/abs/2004.08994
Adversarial Training for Large Neural Language Models.
Xiaodong Liu; Hao Cheng; Pengcheng He; Weizhu Chen; Yu Wang; Hoifung Poon; Jianfeng Gao
  Generalization and robustness are both key desiderata for designing machine
learning methods. Adversarial training can enhance robustness, but past work
often finds it hurts generalization. In natural language processing (NLP),
pre-training large neural language models such as BERT have demonstrated
impressive gain in generalization for a variety of tasks, with further
improvement from adversarial fine-tuning. However, these models are still
vulnerable to adversarial attacks. In this paper, we show that adversarial
pre-training can improve both generalization and robustness. We propose a
general algorithm ALUM (Adversarial training for large neural LangUage Models),
which regularizes the training objective by applying perturbations in the
embedding space that maximizes the adversarial loss. We present the first
comprehensive study of adversarial training in all stages, including
pre-training from scratch, continual pre-training on a well-trained model, and
task-specific fine-tuning. ALUM obtains substantial gains over BERT on a wide
range of NLP tasks, in both regular and adversarial scenarios. Even for models
that have been well trained on extremely large text corpora, such as RoBERTa,
ALUM can still produce significant gains from continual pre-training, whereas
conventional non-adversarial methods can not. ALUM can be further combined with
task-specific fine-tuning to attain additional gains. The ALUM code and
pre-trained models will be made publicly available at
https://github.com/namisan/mt-dnn.


http://arxiv.org/abs/2004.09007
Headless Horseman: Adversarial Attacks on Transfer Learning Models.
Ahmed Abdelkader; Michael J. Curry; Liam Fowl; Tom Goldstein; Avi Schwarzschild; Manli Shu; Christoph Studer; Chen Zhu
  Transfer learning facilitates the training of task-specific classifiers using
pre-trained models as feature extractors. We present a family of transferable
adversarial attacks against such classifiers, generated without access to the
classification head; we call these \emph{headless attacks}. We first
demonstrate successful transfer attacks against a victim network using
\textit{only} its feature extractor. This motivates the introduction of a
label-blind adversarial attack. This transfer attack method does not require
any information about the class-label space of the victim. Our attack lowers
the accuracy of a ResNet18 trained on CIFAR10 by over 40\%.


http://arxiv.org/abs/2004.08705
Protecting Classifiers From Attacks. A Bayesian Approach.
Victor Gallego; Roi Naveiro; Alberto Redondo; David Rios Insua; Fabrizio Ruggeri
  Classification problems in security settings are usually modeled as
confrontations in which an adversary tries to fool a classifier manipulating
the covariates of instances to obtain a benefit. Most approaches to such
problems have focused on game-theoretic ideas with strong underlying common
knowledge assumptions, which are not realistic in the security realm. We
provide an alternative Bayesian framework that accounts for the lack of precise
knowledge about the attacker's behavior using adversarial risk analysis. A key
ingredient required by our framework is the ability to sample from the
distribution of originating instances given the possibly attacked observed one.
We propose a sampling procedure based on approximate Bayesian computation, in
which we simulate the attacker's problem taking into account our uncertainty
about his elements. For large scale problems, we propose an alternative,
scalable approach that could be used when dealing with differentiable
classifiers. Within it, we move the computational load to the training phase,
simulating attacks from an adversary, adapting the framework to obtain a
classifier robustified against attacks.


http://arxiv.org/abs/2004.08628
Single-step Adversarial training with Dropout Scheduling.
Vivek B. S.; R. Venkatesh Babu
  Deep learning models have shown impressive performance across a spectrum of
computer vision applications including medical diagnosis and autonomous
driving. One of the major concerns that these models face is their
susceptibility to adversarial attacks. Realizing the importance of this issue,
more researchers are working towards developing robust models that are less
affected by adversarial attacks. Adversarial training method shows promising
results in this direction. In adversarial training regime, models are trained
with mini-batches augmented with adversarial samples. Fast and simple methods
(e.g., single-step gradient ascent) are used for generating adversarial
samples, in order to reduce computational complexity. It is shown that models
trained using single-step adversarial training method (adversarial samples are
generated using non-iterative method) are pseudo robust. Further, this pseudo
robustness of models is attributed to the gradient masking effect. However,
existing works fail to explain when and why gradient masking effect occurs
during single-step adversarial training. In this work, (i) we show that models
trained using single-step adversarial training method learn to prevent the
generation of single-step adversaries, and this is due to over-fitting of the
model during the initial stages of training, and (ii) to mitigate this effect,
we propose a single-step adversarial training method with dropout scheduling.
Unlike models trained using existing single-step adversarial training methods,
models trained using the proposed single-step adversarial training method are
robust against both single-step and multi-step adversarial attacks, and the
performance is on par with models trained using computationally expensive
multi-step adversarial training methods, in white-box and black-box settings.


http://arxiv.org/abs/2004.08443
Adversarial Attack on Deep Learning-Based Splice Localization.
Andras Rozsa; Zheng Zhong; Terrance E. Boult
  Regarding image forensics, researchers have proposed various approaches to
detect and/or localize manipulations, such as splices. Recent best performing
image-forensics algorithms greatly benefit from the application of deep
learning, but such tools can be vulnerable to adversarial attacks. Due to the
fact that most of the proposed adversarial example generation techniques can be
used only on end-to-end classifiers, the adversarial robustness of
image-forensics methods that utilize deep learning only for feature extraction
has not been studied yet. Using a novel algorithm capable of directly adjusting
the underlying representations of patches we demonstrate on three non
end-to-end deep learning-based splice localization tools that hiding
manipulations of images is feasible via adversarial attacks. While the tested
image-forensics methods, EXIF-SC, SpliceRadar, and Noiseprint, rely on feature
extractors that were trained on different surrogate tasks, we find that the
formed adversarial perturbations can be transferable among them regarding the
deterioration of their localization performance.


http://arxiv.org/abs/2004.07780
Shortcut Learning in Deep Neural Networks.
Robert Geirhos; Jörn-Henrik Jacobsen; Claudio Michaelis; Richard Zemel; Wieland Brendel; Matthias Bethge; Felix A. Wichmann
  Deep learning has triggered the current rise of artificial intelligence and
is the workhorse of today's machine intelligence. Numerous success stories have
rapidly spread all over science, industry and society, but its limitations have
only recently come into focus. In this perspective we seek to distill how many
of deep learning's problems can be seen as different symptoms of the same
underlying problem: shortcut learning. Shortcuts are decision rules that
perform well on standard benchmarks but fail to transfer to more challenging
testing conditions, such as real-world scenarios. Related issues are known in
Comparative Psychology, Education and Linguistics, suggesting that shortcut
learning may be a common characteristic of learning systems, biological and
artificial alike. Based on these observations, we develop a set of
recommendations for model interpretation and benchmarking, highlighting recent
advances in machine learning to improve robustness and transferability from the
lab to real-world applications.


http://arxiv.org/abs/2004.07955
Targeted Attack for Deep Hashing based Retrieval.
Jiawang Bai; Bin Chen; Yiming Li; Dongxian Wu; Weiwei Guo; Shu-tao Xia; En-hui Yang
  The deep hashing based retrieval method is widely adopted in large-scale
image and video retrieval. However, there is little investigation on its
security. In this paper, we propose a novel method, dubbed deep hashing
targeted attack (DHTA), to study the targeted attack on such retrieval.
Specifically, we first formulate the targeted attack as a point-to-set
optimization, which minimizes the average distance between the hash code of an
adversarial example and those of a set of objects with the target label. Then
we design a novel component-voting scheme to obtain an anchor code as the
representative of the set of hash codes of objects with the target label, whose
optimality guarantee is also theoretically derived. To balance the performance
and perceptibility, we propose to minimize the Hamming distance between the
hash code of the adversarial example and the anchor code under the
$\ell^\infty$ restriction on the perturbation. Extensive experiments verify
that DHTA is effective in attacking both deep hashing based image retrieval and
video retrieval.


http://arxiv.org/abs/2004.07919
A Framework for Enhancing Deep Neural Networks Against Adversarial Malware.
Deqiang Li; Qianmu Li; Yanfang Ye; Shouhuai Xu
  Machine learning-based malware detection is known to be vulnerable to
adversarial evasion attacks. The state-of-the-art is that there are no
effective defenses against these attacks. As a response to the adversarial
malware classification challenge organized by the MIT Lincoln Lab and
associated with the AAAI-19 Workshop on Artificial Intelligence for Cyber
Security (AICS'2019), we propose six guiding principles to enhance the
robustness of deep neural networks. Some of these principles have been
scattered in the literature, but the others are introduced in this paper for
the first time. Under the guidance of these six principles, we propose a
defense framework to enhance the robustness of deep neural networks against
adversarial malware evasion attacks. By conducting experiments with the Drebin
Android malware dataset, we show that the framework can achieve a 98.49\%
accuracy (on average) against grey-box attacks, where the attacker knows some
information about the defense and the defender knows some information about the
attack, and an 89.14% accuracy (on average) against the more capable white-box
attacks, where the attacker knows everything about the defense and the defender
knows some information about the attack. The framework wins the AICS'2019
challenge by achieving a 76.02% accuracy, where neither the attacker (i.e., the
challenge organizer) knows the framework or defense nor we (the defender) know
the attacks. This gap highlights the importance of knowing about the attack.


http://arxiv.org/abs/2004.06954
Advanced Evasion Attacks and Mitigations on Practical ML-Based Phishing Website Classifiers.
Yusi Lei; Sen Chen; Lingling Fan; Fu Song; Yang Liu
  Machine learning (ML) based approaches have been the mainstream solution for
anti-phishing detection. When they are deployed on the client-side, ML-based
classifiers are vulnerable to evasion attacks. However, such potential threats
have received relatively little attention because existing attacks destruct the
functionalities or appearance of webpages and are conducted in the white-box
scenario, making it less practical. Consequently, it becomes imperative to
understand whether it is possible to launch evasion attacks with limited
knowledge of the classifier, while preserving the functionalities and
appearance.
  In this work, we show that even in the grey-, and black-box scenarios,
evasion attacks are not only effective on practical ML-based classifiers, but
can also be efficiently launched without destructing the functionalities and
appearance. For this purpose, we propose three mutation-based attacks,
differing in the knowledge of the target classifier, addressing a key technical
challenge: automatically crafting an adversarial sample from a known phishing
website in a way that can mislead classifiers. To launch attacks in the white-
and grey-box scenarios, we also propose a sample-based collision attack to gain
the knowledge of the target classifier. We demonstrate the effectiveness and
efficiency of our evasion attacks on the state-of-the-art, Google's phishing
page filter, achieved 100% attack success rate in less than one second per
website. Moreover, the transferability attack on BitDefender's industrial
phishing page classifier, TrafficLight, achieved up to 81.25% attack success
rate. We further propose a similarity-based method to mitigate such evasion
attacks, Pelican. We demonstrate that Pelican can effectively detect evasion
attacks. Our findings contribute to design more robust phishing website
classifiers in practice.


http://arxiv.org/abs/2004.06562
On the Optimal Interaction Range for Multi-Agent Systems Under Adversarial Attack.
Saad J Saleh
  Consider a consensus-driven multi-agent dynamic system. The interaction
range, which defines the set of neighbors for each agent, plays a key role in
influencing connectivity of the underlying network. In this paper, we assume
the system is under attack by a predator and explore the question of finding
the optimal interaction range that facilitates the most-efficient escape
trajectories for the group of agents. We find that for many cases of interest
the optimal interaction range is one that forces the network to break up into a
handful of disconnected graphs, each containing a subset of agents, thus
outperforming the two extreme cases corresponding to fully-connected and
fully-disconnected networks. In other words, the results indicate that some
connectivity among the agents is helpful because information is effectively
transmitted from the agents closest to the predator to others slightly farther
away, but also that too much connectivity can be detrimental to the agility of
the group, thus hampering efficient and rapid escape.


http://arxiv.org/abs/2004.06383
Extending Adversarial Attacks to Produce Adversarial Class Probability Distributions.
Jon Vadillo; Roberto Santana; Jose A. Lozano
  Despite the remarkable performance and generalization levels of deep learning
models in a wide range of artificial intelligence tasks, it has been
demonstrated that these models can be easily fooled by the addition of
imperceptible but malicious perturbations to natural inputs. These altered
inputs are known in the literature as adversarial examples. In this paper we
propose a novel probabilistic framework to generalize and extend adversarial
attacks in order to produce a desired probability distribution for the classes
when we apply the attack method to a large number of inputs. This novel attack
strategy provides the attacker with greater control over the target model, and
increases the complexity of detecting that the model is being attacked. We
introduce three different strategies to efficiently generate such attacks, and
illustrate our approach extending DeepFool, a state-of-the-art attack algorithm
to generate adversarial examples. We also experimentally validate our approach
for the spoken command classification task, an exemplary machine learning
problem in the audio domain. Our results demonstrate that we can closely
approximate any probability distribution for the classes while maintaining a
high fooling rate and by injecting imperceptible perturbations to the inputs.


http://arxiv.org/abs/2004.05923
Adversarial Robustness Guarantees for Random Deep Neural Networks.
Palma Giacomo De; Bobak T. Kiani; Seth Lloyd
  The reliability of deep learning algorithms is fundamentally challenged by
the existence of adversarial examples, which are incorrectly classified inputs
that are extremely close to a correctly classified input. We explore the
properties of adversarial examples for deep neural networks with random weights
and biases, and prove that for any $p\ge1$, the $\ell^p$ distance of any given
input from the classification boundary scales as one over the square root of
the dimension of the input times the $\ell^p$ norm of the input. The results
are based on the recently proved equivalence between Gaussian processes and
deep neural networks in the limit of infinite width of the hidden layers, and
are validated with experiments on both random deep neural networks and deep
neural networks trained on the MNIST and CIFAR10 datasets. The results
constitute a fundamental advance in the theoretical understanding of
adversarial examples, and open the way to a thorough theoretical
characterization of the relation between network architecture and robustness to
adversarial perturbations.


http://arxiv.org/abs/2004.05887
Frequency-Guided Word Substitutions for Detecting Textual Adversarial Examples.
Maximilian Mozes; Pontus Stenetorp; Bennett Kleinberg; Lewis D. Griffin
  Recent efforts have shown that neural text processing models are vulnerable
to adversarial examples, but the nature of these examples is poorly understood.
In this work, we show that adversarial attacks against CNN, LSTM and
Transformer-based classification models perform word substitutions that are
identifiable through frequency differences between replaced words and their
corresponding substitutions. Based on these findings, we propose
frequency-guided word substitutions (FGWS), a simple algorithm exploiting the
frequency properties of adversarial word substitutions for the detection of
adversarial examples. FGWS achieves strong performance by accurately detecting
adversarial examples on the SST-2 and IMDb sentiment datasets, with F1
detection scores of up to 91.4% against RoBERTa-based classification models. We
compare our approach against a recently proposed perturbation discrimination
framework and show that we outperform it by up to 13.0% F1.


http://arxiv.org/abs/2004.05884
Adversarial Weight Perturbation Helps Robust Generalization.
Dongxian Wu; Shu-tao Xia; Yisen Wang
  The study on improving the robustness of deep neural networks against
adversarial examples grows rapidly in recent years. Among them, adversarial
training is the most promising one, which flattens the input loss landscape
(loss change with respect to input) via training on adversarially perturbed
examples. However, how the widely used weight loss landscape (loss change with
respect to weight) performs in adversarial training is rarely explored. In this
paper, we investigate the weight loss landscape from a new perspective, and
identify a clear correlation between the flatness of weight loss landscape and
robust generalization gap. Several well-recognized adversarial training
improvements, such as early stopping, designing new objective functions, or
leveraging unlabeled data, all implicitly flatten the weight loss landscape.
Based on these observations, we propose a simple yet effective Adversarial
Weight Perturbation (AWP) to explicitly regularize the flatness of weight loss
landscape, forming a double-perturbation mechanism in the adversarial training
framework that adversarially perturbs both inputs and weights. Extensive
experiments demonstrate that AWP indeed brings flatter weight loss landscape
and can be easily incorporated into various existing adversarial training
methods to further boost their adversarial robustness.


http://arxiv.org/abs/2004.06076
Adversarial Augmentation Policy Search for Domain and Cross-Lingual Generalization in Reading Comprehension.
Adyasha Maharana; Mohit Bansal
  Reading comprehension models often overfit to nuances of training datasets
and fail at adversarial evaluation. Training with adversarially augmented
dataset improves robustness against those adversarial attacks but hurts
generalization of the models. In this work, we present several effective
adversaries and automated data augmentation policy search methods with the goal
of making reading comprehension models more robust to adversarial evaluation,
but also improving generalization to the source domain as well as new domains
and languages. We first propose three new methods for generating QA
adversaries, that introduce multiple points of confusion within the context,
show dependence on insertion location of the distractor, and reveal the
compounding effect of mixing adversarial strategies with syntactic and semantic
paraphrasing methods. Next, we find that augmenting the training datasets with
uniformly sampled adversaries improves robustness to the adversarial attacks
but leads to decline in performance on the original unaugmented dataset. We
address this issue via RL and more efficient Bayesian policy search methods for
automatically learning the best augmentation policy combinations of the
transformation probability for each adversary in a large search space. Using
these learned policies, we show that adversarial training can lead to
significant improvements in in-domain, out-of-domain, and cross-lingual
(German, Russian, Turkish) generalization.


http://arxiv.org/abs/2004.06288
Towards Robust Classification with Image Quality Assessment.
Yeli Feng; Yiyu Cai
  Recent studies have shown that deep convolutional neural networks (DCNN) are
vulnerable to adversarial examples and sensitive to perceptual quality as well
as the acquisition condition of images. These findings raise a big concern for
the adoption of DCNN-based applications for critical tasks. In the literature,
various defense strategies have been introduced to increase the robustness of
DCNN, including re-training an entire model with benign noise injection,
adversarial examples, or adding extra layers. In this paper, we investigate the
connection between adversarial manipulation and image quality, subsequently
propose a protective mechanism that doesnt require re-training a DCNN. Our
method combines image quality assessment with knowledge distillation to detect
input images that would trigger a DCCN to produce egregiously wrong results.
Using the ResNet model trained on ImageNet as an example, we demonstrate that
the detector can effectively identify poor quality and adversarial images.


http://arxiv.org/abs/2004.05790
Towards Transferable Adversarial Attack against Deep Face Recognition.
Yaoyao Zhong; Weihong Deng
  Face recognition has achieved great success in the last five years due to the
development of deep learning methods. However, deep convolutional neural
networks (DCNNs) have been found to be vulnerable to adversarial examples. In
particular, the existence of transferable adversarial examples can severely
hinder the robustness of DCNNs since this type of attacks can be applied in a
fully black-box manner without queries on the target system. In this work, we
first investigate the characteristics of transferable adversarial attacks in
face recognition by showing the superiority of feature-level methods over
label-level methods. Then, to further improve transferability of feature-level
adversarial examples, we propose DFANet, a dropout-based method used in
convolutional layers, which can increase the diversity of surrogate models and
obtain ensemble-like effects. Extensive experiments on state-of-the-art face
models with various training databases, loss functions and network
architectures show that the proposed method can significantly enhance the
transferability of existing attack methods. Finally, by applying DFANet to the
LFW database, we generate a new set of adversarial face pairs that can
successfully attack four commercial APIs without any queries. This TALFW
database is available to facilitate research on the robustness and defense of
deep face recognition.


http://arxiv.org/abs/2004.05682
PatchAttack: A Black-box Texture-based Attack with Reinforcement Learning.
Chenglin Yang; Adam Kortylewski; Cihang Xie; Yinzhi Cao; Alan Yuille
  Patch-based attacks introduce a perceptible but localized change to the input
that induces misclassification. A limitation of current patch-based black-box
attacks is that they perform poorly for targeted attacks, and even for the less
challenging non-targeted scenarios, they require a large number of queries. Our
proposed PatchAttack is query efficient and can break models for both targeted
and non-targeted attacks. PatchAttack induces misclassifications by
superimposing small textured patches on the input image. We parametrize the
appearance of these patches by a dictionary of class-specific textures. This
texture dictionary is learned by clustering Gram matrices of feature
activations from a VGG backbone. PatchAttack optimizes the position and texture
parameters of each patch using reinforcement learning. Our experiments show
that PatchAttack achieves > 99% success rate on ImageNet for a wide range of
architectures, while only manipulating 3% of the image for non-targeted attacks
and 10% on average for targeted attacks. Furthermore, we show that PatchAttack
circumvents state-of-the-art adversarial defense methods successfully.


http://arxiv.org/abs/2004.11819
Domain Adaptive Transfer Attack (DATA)-based Segmentation Networks for Building Extraction from Aerial Images.
Younghwan Na; Jun Hee Kim; Kyungsu Lee; Juhum Park; Jae Youn Hwang; Jihwan P. Choi
  Semantic segmentation models based on convolutional neural networks (CNNs)
have gained much attention in relation to remote sensing and have achieved
remarkable performance for the extraction of buildings from high-resolution
aerial images. However, the issue of limited generalization for unseen images
remains. When there is a domain gap between the training and test datasets,
CNN-based segmentation models trained by a training dataset fail to segment
buildings for the test dataset. In this paper, we propose segmentation networks
based on a domain adaptive transfer attack (DATA) scheme for building
extraction from aerial images. The proposed system combines the domain transfer
and adversarial attack concepts. Based on the DATA scheme, the distribution of
the input images can be shifted to that of the target images while turning
images into adversarial examples against a target network. Defending
adversarial examples adapted to the target domain can overcome the performance
degradation due to the domain gap and increase the robustness of the
segmentation model. Cross-dataset experiments and the ablation study are
conducted for the three different datasets: the Inria aerial image labeling
dataset, the Massachusetts building dataset, and the WHU East Asia dataset.
Compared to the performance of the segmentation network without the DATA
scheme, the proposed method shows improvements in the overall IoU. Moreover, it
is verified that the proposed method outperforms even when compared to feature
adaptation (FA) and output space adaptation (OSA).


http://arxiv.org/abs/2004.06496
Certified Adversarial Robustness for Deep Reinforcement Learning.
Michael Everett; Bjorn Lutjens; Jonathan P. How
  Deep Neural Network-based systems are now the state-of-the-art in many
robotics tasks, but their application in safety-critical domains remains
dangerous without formal guarantees on network robustness. Small perturbations
to sensor inputs (from noise or adversarial examples) are often enough to
change network-based decisions, which was recently shown to cause an autonomous
vehicle to swerve into another lane. In light of these dangers, numerous
algorithms have been developed as defensive mechanisms from these adversarial
inputs, some of which provide formal robustness guarantees or certificates.
This work leverages research on certified adversarial robustness to develop an
online certified defense for deep reinforcement learning algorithms. The
proposed defense computes guaranteed lower bounds on state-action values during
execution to identify and choose a robust action under a worst-case deviation
in input space due to possible adversaries or noise. The approach is
demonstrated on a Deep Q-Network policy and is shown to increase robustness to
noise and adversaries in pedestrian collision avoidance scenarios and a classic
control task. This work extends our previous paper with new performance
guarantees, expanded results aggregated across more scenarios, an extension
into scenarios with adversarial behavior, comparisons with a more
computationally expensive method, and visualizations that provide intuition
about the robustness algorithm.


http://arxiv.org/abs/2004.05465
Robust Large-Margin Learning in Hyperbolic Space.
Melanie Weber; Manzil Zaheer; Ankit Singh Rawat; Aditya Menon; Sanjiv Kumar
  Recently, there has been a surge of interest in representation learning in
hyperbolic spaces, driven by their ability to represent hierarchical data with
significantly fewer dimensions than standard Euclidean spaces. However, the
viability and benefits of hyperbolic spaces for downstream machine learning
tasks have received less attention. In this paper, we present, to our
knowledge, the first theoretical guarantees for learning a classifier in
hyperbolic rather than Euclidean space. Specifically, we consider the problem
of learning a large-margin classifier for data possessing a hierarchical
structure. Our first contribution is a hyperbolic perceptron algorithm, which
provably converges to a separating hyperplane. We then provide an algorithm to
efficiently learn a large-margin hyperplane, relying on the careful injection
of adversarial examples. Finally, we prove that for hierarchical data that
embeds well into hyperbolic space, the low embedding dimension ensures superior
guarantees when learning the classifier directly in hyperbolic space.


http://arxiv.org/abs/2004.05511
Verification of Deep Convolutional Neural Networks Using ImageStars.
Hoang-Dung Tran; Stanley Bak; Weiming Xiang; Taylor T. Johnson
  Convolutional Neural Networks (CNN) have redefined the state-of-the-art in
many real-world applications, such as facial recognition, image classification,
human pose estimation, and semantic segmentation. Despite their success, CNNs
are vulnerable to adversarial attacks, where slight changes to their inputs may
lead to sharp changes in their output in even well-trained networks. Set-based
analysis methods can detect or prove the absence of bounded adversarial
attacks, which can then be used to evaluate the effectiveness of neural network
training methodology. Unfortunately, existing verification approaches have
limited scalability in terms of the size of networks that can be analyzed.
  In this paper, we describe a set-based framework that successfully deals with
real-world CNNs, such as VGG16 and VGG19, that have high accuracy on ImageNet.
Our approach is based on a new set representation called the ImageStar, which
enables efficient exact and over-approximative analysis of CNNs. ImageStars
perform efficient set-based analysis by combining operations on concrete images
with linear programming (LP). Our approach is implemented in a tool called NNV,
and can verify the robustness of VGG networks with respect to a small set of
input states, derived from adversarial attacks, such as the DeepFool attack.
The experimental results show that our approach is less conservative and faster
than existing zonotope methods, such as those used in DeepZ, and the polytope
method used in DeepPoly.


http://arxiv.org/abs/2004.05005
Adversarial Attacks on Machine Learning Cybersecurity Defences in Industrial Control Systems.
Eirini Anthi; Lowri Williams; Matilda Rhode; Pete Burnap; Adam Wedgbury
  The proliferation and application of machine learning based Intrusion
Detection Systems (IDS) have allowed for more flexibility and efficiency in the
automated detection of cyber attacks in Industrial Control Systems (ICS).
However, the introduction of such IDSs has also created an additional attack
vector; the learning models may also be subject to cyber attacks, otherwise
referred to as Adversarial Machine Learning (AML). Such attacks may have severe
consequences in ICS systems, as adversaries could potentially bypass the IDS.
This could lead to delayed attack detection which may result in infrastructure
damages, financial loss, and even loss of life. This paper explores how
adversarial learning can be used to target supervised models by generating
adversarial samples using the Jacobian-based Saliency Map attack and exploring
classification behaviours. The analysis also includes the exploration of how
such samples can support the robustness of supervised models using adversarial
training. An authentic power system dataset was used to support the experiments
presented herein. Overall, the classification performance of two widely used
classifiers, Random Forest and J48, decreased by 16 and 20 percentage points
when adversarial samples were present. Their performances improved following
adversarial training, demonstrating their robustness towards such attacks.


http://arxiv.org/abs/2004.04919
Luring of transferable adversarial perturbations in the black-box paradigm.
Rémi Bernhard; Pierre-Alain Moellic; Jean-Max Dutertre
  The growing interest for adversarial examples, i.e. maliciously modified
examples which fool a classifier, has resulted in many defenses intended to
detect them, render them inoffensive or make the model more robust against
them. In this paper, we pave the way towards a new approach to improve the
robustness of a model against black-box transfer attacks. A removable
additional neural network is included in the target model, and is designed to
induce the \textit{luring effect}, which tricks the adversary into choosing
false directions to fool the target model. Training the additional model is
achieved thanks to a loss function acting on the logits sequence order. Our
deception-based method only needs to have access to the predictions of the
target model and does not require a labeled data set. We explain the luring
effect thanks to the notion of robust and non-robust useful features and
perform experiments on MNIST, SVHN and CIFAR10 to characterize and evaluate
this phenomenon. Additionally, we discuss two simple prediction schemes, and
verify experimentally that our approach can be used as a defense to efficiently
thwart an adversary using state-of-the-art attacks and allowed to perform large
perturbations.


http://arxiv.org/abs/2004.05914
Blind Adversarial Training: Balance Accuracy and Robustness.
Haidong Xie; Xueshuang Xiang; Naijin Liu; Bin Dong
  Adversarial training (AT) aims to improve the robustness of deep learning
models by mixing clean data and adversarial examples (AEs). Most existing AT
approaches can be grouped into restricted and unrestricted approaches.
Restricted AT requires a prescribed uniform budget to constrain the magnitude
of the AE perturbations during training, with the obtained results showing high
sensitivity to the budget. On the other hand, unrestricted AT uses
unconstrained AEs, resulting in the use of AEs located beyond the decision
boundary; these overestimated AEs significantly lower the accuracy on clean
data. These limitations mean that the existing AT approaches have difficulty in
obtaining a comprehensively robust model with high accuracy and robustness when
confronting attacks with varying strengths. Considering this problem, this
paper proposes a novel AT approach named blind adversarial training (BAT) to
better balance the accuracy and robustness. The main idea of this approach is
to use a cutoff-scale strategy to adaptively estimate a nonuniform budget to
modify the AEs used in the training, ensuring that the strengths of the AEs are
dynamically located in a reasonable range and ultimately improving the overall
robustness of the AT model. The experimental results obtained using BAT for
training classification models on several benchmarks demonstrate the
competitive performance of this method.


http://arxiv.org/abs/2004.05913
Blind Adversarial Pruning: Balance Accuracy, Efficiency and Robustness.
Haidong Xie; Lixin Qian; Xueshuang Xiang; Naijin Liu
  With the growth of interest in the attack and defense of deep neural
networks, researchers are focusing more on the robustness of applying them to
devices with limited memory. Thus, unlike adversarial training, which only
considers the balance between accuracy and robustness, we come to a more
meaningful and critical issue, i.e., the balance among accuracy, efficiency and
robustness (AER). Recently, some related works focused on this issue, but with
different observations, and the relations among AER remain unclear. This paper
first investigates the robustness of pruned models with different compression
ratios under the gradual pruning process and concludes that the robustness of
the pruned model drastically varies with different pruning processes,
especially in response to attacks with large strength. Second, we test the
performance of mixing the clean data and adversarial examples (generated with a
prescribed uniform budget) into the gradual pruning process, called adversarial
pruning, and find the following: the pruned model's robustness exhibits high
sensitivity to the budget. Furthermore, to better balance the AER, we propose
an approach called blind adversarial pruning (BAP), which introduces the idea
of blind adversarial training into the gradual pruning process. The main idea
is to use a cutoff-scale strategy to adaptively estimate a nonuniform budget to
modify the AEs used during pruning, thus ensuring that the strengths of AEs are
dynamically located within a reasonable range at each pruning step and
ultimately improving the overall AER of the pruned model. The experimental
results obtained using BAP for pruning classification models based on several
benchmarks demonstrate the competitive performance of this method: the
robustness of the model pruned by BAP is more stable among varying pruning
processes, and BAP exhibits better overall AER than adversarial pruning.


http://arxiv.org/abs/2004.04479
On Adversarial Examples and Stealth Attacks in Artificial Intelligence Systems.
Ivan Y. Tyukin; Desmond J. Higham; Alexander N. Gorban
  In this work we present a formal theoretical framework for assessing and
analyzing two classes of malevolent action towards generic Artificial
Intelligence (AI) systems. Our results apply to general multi-class classifiers
that map from an input space into a decision space, including artificial neural
networks used in deep learning applications. Two classes of attacks are
considered. The first class involves adversarial examples and concerns the
introduction of small perturbations of the input data that cause
misclassification. The second class, introduced here for the first time and
named stealth attacks, involves small perturbations to the AI system itself.
Here the perturbed system produces whatever output is desired by the attacker
on a specific small data set, perhaps even a single input, but performs as
normal on a validation set (which is unknown to the attacker). We show that in
both cases, i.e., in the case of an attack based on adversarial examples and in
the case of a stealth attack, the dimensionality of the AI's decision-making
space is a major contributor to the AI's susceptibility. For attacks based on
adversarial examples, a second crucial parameter is the absence of local
concentrations in the data probability distribution, a property known as
Smeared Absolute Continuity. According to our findings, robustness to
adversarial examples requires either (a) the data distributions in the AI's
feature space to have concentrated probability density functions or (b) the
dimensionality of the AI's decision variables to be sufficiently small. We also
show how to construct stealth attacks on high-dimensional AI systems that are
hard to spot unless the validation set is made exponentially large.


http://arxiv.org/abs/2004.04199
Transferable, Controllable, and Inconspicuous Adversarial Attacks on Person Re-identification With Deep Mis-Ranking.
Hongjun Wang; Guangrun Wang; Ya Li; Dongyu Zhang; Liang Lin
  The success of DNNs has driven the extensive applications of person
re-identification (ReID) into a new era. However, whether ReID inherits the
vulnerability of DNNs remains unexplored. To examine the robustness of ReID
systems is rather important because the insecurity of ReID systems may cause
severe losses, e.g., the criminals may use the adversarial perturbations to
cheat the CCTV systems. In this work, we examine the insecurity of current
best-performing ReID models by proposing a learning-to-mis-rank formulation to
perturb the ranking of the system output. As the cross-dataset transferability
is crucial in the ReID domain, we also perform a back-box attack by developing
a novel multi-stage network architecture that pyramids the features of
different levels to extract general and transferable features for the
adversarial perturbations. Our method can control the number of malicious
pixels by using differentiable multi-shot sampling. To guarantee the
inconspicuousness of the attack, we also propose a new perception loss to
achieve better visual quality. Extensive experiments on four of the largest
ReID benchmarks (i.e., Market1501 [45], CUHK03 [18], DukeMTMC [33], and MSMT17
[40]) not only show the effectiveness of our method, but also provides
directions of the future improvement in the robustness of ReID systems. For
example, the accuracy of one of the best-performing ReID systems drops sharply
from 91.8% to 1.4% after being attacked by our method. Some attack results are
shown in Fig. 1. The code is available at
https://github.com/whj363636/Adversarial-attack-on-Person-ReID-With-Deep-Mis-Ranking.


http://arxiv.org/abs/2004.03742
Towards Evaluating the Robustness of Chinese BERT Classifiers.
Boxin Wang; Boyuan Pan; Xin Li; Bo Li
  Recent advances in large-scale language representation models such as BERT
have improved the state-of-the-art performances in many NLP tasks. Meanwhile,
character-level Chinese NLP models, including BERT for Chinese, have also
demonstrated that they can outperform the existing models. In this paper, we
show that, however, such BERT-based models are vulnerable under character-level
adversarial attacks. We propose a novel Chinese char-level attack method
against BERT-based classifiers. Essentially, we generate "small" perturbation
on the character level in the embedding space and guide the character
substitution procedure. Extensive experiments show that the classification
accuracy on a Chinese news dataset drops from 91.8% to 0% by manipulating less
than 2 characters on average based on the proposed attack. Human evaluations
also confirm that our generated Chinese adversarial examples barely affect
human performance on these NLP tasks.


http://arxiv.org/abs/2004.03295
Feature Partitioning for Robust Tree Ensembles and their Certification in Adversarial Scenarios.
Stefano Calzavara; Claudio Lucchese; Federico Marcuzzi; Salvatore Orlando
  Machine learning algorithms, however effective, are known to be vulnerable in
adversarial scenarios where a malicious user may inject manipulated instances.
In this work we focus on evasion attacks, where a model is trained in a safe
environment and exposed to attacks at test time. The attacker aims at finding a
minimal perturbation of a test instance that changes the model outcome.
  We propose a model-agnostic strategy that builds a robust ensemble by
training its basic models on feature-based partitions of the given dataset. Our
algorithm guarantees that the majority of the models in the ensemble cannot be
affected by the attacker. We experimented the proposed strategy on decision
tree ensembles, and we also propose an approximate certification method for
tree ensembles that efficiently assess the minimal accuracy of a forest on a
given dataset avoiding the costly computation of evasion attacks.
  Experimental evaluation on publicly available datasets shows that proposed
strategy outperforms state-of-the-art adversarial learning algorithms against
evasion attacks.


http://arxiv.org/abs/2004.03434
Learning to fool the speaker recognition.
Jiguo Li; Xinfeng Zhang; Jizheng Xu; Li Zhang; Yue Wang; Siwei Ma; Wen Gao
  Due to the widespread deployment of fingerprint/face/speaker recognition
systems, attacking deep learning based biometric systems has drawn more and
more attention. Previous research mainly studied the attack to the vision-based
system, such as fingerprint and face recognition. While the attack for speaker
recognition has not been investigated yet, although it has been widely used in
our daily life. In this paper, we attempt to fool the state-of-the-art speaker
recognition model and present \textit{speaker recognition attacker}, a
lightweight model to fool the deep speaker recognition model by adding
imperceptible perturbations onto the raw speech waveform. We find that the
speaker recognition system is also vulnerable to the attack, and we achieve a
high success rate on the non-targeted attack. Besides, we also present an
effective method to optimize the speaker recognition attacker to obtain a
trade-off between the attack success rate with the perceptual quality.
Experiments on the TIMIT dataset show that we can achieve a sentence error rate
of $99.2\%$ with an average SNR $57.2\text{dB}$ and PESQ 4.2 with speed rather
faster than real-time.


http://arxiv.org/abs/2004.03428
Universal Adversarial Perturbations Generative Network for Speaker Recognition.
Jiguo Li; Xinfeng Zhang; Chuanmin Jia; Jizheng Xu; Li Zhang; Yue Wang; Siwei Ma; Wen Gao
  Attacking deep learning based biometric systems has drawn more and more
attention with the wide deployment of fingerprint/face/speaker recognition
systems, given the fact that the neural networks are vulnerable to the
adversarial examples, which have been intentionally perturbed to remain almost
imperceptible for human. In this paper, we demonstrated the existence of the
universal adversarial perturbations~(UAPs) for the speaker recognition systems.
We proposed a generative network to learn the mapping from the low-dimensional
normal distribution to the UAPs subspace, then synthesize the UAPs to perturbe
any input signals to spoof the well-trained speaker recognition model with high
probability. Experimental results on TIMIT and LibriSpeech datasets demonstrate
the effectiveness of our model.


http://arxiv.org/abs/2004.02183
Approximate Manifold Defense Against Multiple Adversarial Perturbations.
Jay Nandy; Wynne Hsu; Mong Li Lee
  Existing defenses against adversarial attacks are typically tailored to a
specific perturbation type. Using adversarial training to defend against
multiple types of perturbation requires expensive adversarial examples from
different perturbation types at each training step. In contrast, manifold-based
defense incorporates a generative network to project an input sample onto the
clean data manifold. This approach eliminates the need to generate expensive
adversarial examples while achieving robustness against multiple perturbation
types. However, the success of this approach relies on whether the generative
network can capture the complete clean data manifold, which remains an open
problem for complex input domain. In this work, we devise an approximate
manifold defense mechanism, called RBF-CNN, for image classification. Instead
of capturing the complete data manifold, we use an RBF layer to learn the
density of small image patches. RBF-CNN also utilizes a reconstruction layer
that mitigates any minor adversarial perturbations. Further, incorporating our
proposed reconstruction process for training improves the adversarial
robustness of our RBF-CNN models. Experiment results on MNIST and CIFAR-10
datasets indicate that RBF-CNN offers robustness for multiple perturbations
without the need for expensive adversarial training.


http://arxiv.org/abs/2004.01903
Understanding (Non-)Robust Feature Disentanglement and the Relationship Between Low- and High-Dimensional Adversarial Attacks.
Zuowen Wang; Leo Horne
  Recent work has put forth the hypothesis that adversarial vulnerabilities in
neural networks are due to them overusing "non-robust features" inherent in the
training data. We show empirically that for PGD-attacks, there is a training
stage where neural networks start heavily relying on non-robust features to
boost natural accuracy. We also propose a mechanism reducing vulnerability to
PGD-style attacks consisting of mixing in a certain amount of images
contain-ing mostly "robust features" into each training batch, and then show
that robust accuracy is improved, while natural accuracy is not substantially
hurt. We show that training on "robust features" provides boosts in robust
accuracy across various architectures and for different attacks. Finally, we
demonstrate empirically that these "robust features" do not induce spatial
invariance.


http://arxiv.org/abs/2004.01970
BAE: BERT-based Adversarial Examples for Text Classification.
Siddhant Garg; Goutham Ramakrishnan
  Modern text classification models are susceptible to adversarial examples,
perturbed versions of the original text indiscernible by humans which get
misclassified by the model. Recent works in NLP use rule-based synonym
replacement strategies to generate adversarial examples. These strategies can
lead to out-of-context and unnaturally complex token replacements, which are
easily identifiable by humans. We present BAE, a black box attack for
generating adversarial examples using contextual perturbations from a BERT
masked language model. BAE replaces and inserts tokens in the original text by
masking a portion of the text and leveraging the BERT-MLM to generate
alternatives for the masked tokens. Through automatic and human evaluations, we
show that BAE performs a stronger attack, in addition to generating adversarial
examples with improved grammaticality and semantic coherence as compared to
prior work.


http://arxiv.org/abs/2004.01832
Adversarial Robustness through Regularization: A Second-Order Approach.
Avery Ma; Fartash Faghri; Amir-massoud Farahmand
  Adversarial training is a common approach to improving the robustness of deep
neural networks against adversarial examples. In this work, we propose a novel
regularization approach as an alternative. To derive the regularizer, we
formulate the adversarial robustness problem under the robust optimization
framework and approximate the loss function using a second-order Taylor series
expansion. Our proposed second-order adversarial regularizer (SOAR) is an upper
bound based on the Taylor approximation of the inner-max in the robust
optimization objective. We empirically show that the proposed method improves
the robustness of networks on the CIFAR-10 dataset.


http://arxiv.org/abs/2004.00622
Evading Deepfake-Image Detectors with White- and Black-Box Attacks.
Nicholas Carlini; Hany Farid
  It is now possible to synthesize highly realistic images of people who don't
exist. Such content has, for example, been implicated in the creation of
fraudulent social-media profiles responsible for dis-information campaigns.
Significant efforts are, therefore, being deployed to detect
synthetically-generated content. One popular forensic approach trains a neural
network to distinguish real from synthetic content.
  We show that such forensic classifiers are vulnerable to a range of attacks
that reduce the classifier to near-0% accuracy. We develop five attack case
studies on a state-of-the-art classifier that achieves an area under the ROC
curve (AUC) of 0.95 on almost all existing image generators, when only trained
on one generator. With full access to the classifier, we can flip the lowest
bit of each pixel in an image to reduce the classifier's AUC to 0.0005; perturb
1% of the image area to reduce the classifier's AUC to 0.08; or add a single
noise pattern in the synthesizer's latent space to reduce the classifier's AUC
to 0.17. We also develop a black-box attack that, with no access to the target
classifier, reduces the AUC to 0.22. These attacks reveal significant
vulnerabilities of certain image-forensic classifiers.


http://arxiv.org/abs/2004.00306
Towards Achieving Adversarial Robustness by Enforcing Feature Consistency Across Bit Planes.
Sravanti Addepalli; Vivek B. S.; Arya Baburaj; Gaurang Sriramanan; R. Venkatesh Babu
  As humans, we inherently perceive images based on their predominant features,
and ignore noise embedded within lower bit planes. On the contrary, Deep Neural
Networks are known to confidently misclassify images corrupted with
meticulously crafted perturbations that are nearly imperceptible to the human
eye. In this work, we attempt to address this problem by training networks to
form coarse impressions based on the information in higher bit planes, and use
the lower bit planes only to refine their prediction. We demonstrate that, by
imposing consistency on the representations learned across differently
quantized images, the adversarial robustness of networks improves significantly
when compared to a normally trained model. Present state-of-the-art defenses
against adversarial attacks require the networks to be explicitly trained using
adversarial samples that are computationally expensive to generate. While such
methods that use adversarial training continue to achieve the best results,
this work paves the way towards achieving robustness without having to
explicitly train on adversarial samples. The proposed approach is therefore
faster, and also closer to the natural learning process in humans.


http://arxiv.org/abs/2004.00543
Physically Realizable Adversarial Examples for LiDAR Object Detection.
James Tu; Mengye Ren; Siva Manivasagam; Ming Liang; Bin Yang; Richard Du; Frank Cheng; Raquel Urtasun
  Modern autonomous driving systems rely heavily on deep learning models to
process point cloud sensory data; meanwhile, deep models have been shown to be
susceptible to adversarial attacks with visually imperceptible perturbations.
Despite the fact that this poses a security concern for the self-driving
industry, there has been very little exploration in terms of 3D perception, as
most adversarial attacks have only been applied to 2D flat images. In this
paper, we address this issue and present a method to generate universal 3D
adversarial objects to fool LiDAR detectors. In particular, we demonstrate that
placing an adversarial object on the rooftop of any target vehicle to hide the
vehicle entirely from LiDAR detectors with a success rate of 80%. We report
attack results on a suite of detectors using various input representation of
point clouds. We also conduct a pilot study on adversarial defense using data
augmentation. This is one step closer towards safer self-driving under unseen
conditions from limited training data.


http://arxiv.org/abs/2003.13969
A Thorough Comparison Study on Adversarial Attacks and Defenses for Common Thorax Disease Classification in Chest X-rays.
Chendi Rao; Jiezhang Cao; Runhao Zeng; Qi Chen; Huazhu Fu; Yanwu Xu; Mingkui Tan
  Recently, deep neural networks (DNNs) have made great progress on automated
diagnosis with chest X-rays images. However, DNNs are vulnerable to adversarial
examples, which may cause misdiagnoses to patients when applying the DNN based
methods in disease detection. Recently, there is few comprehensive studies
exploring the influence of attack and defense methods on disease detection,
especially for the multi-label classification problem. In this paper, we aim to
review various adversarial attack and defense methods on chest X-rays. First,
the motivations and the mathematical representations of attack and defense
methods are introduced in details. Second, we evaluate the influence of several
state-of-the-art attack and defense methods for common thorax disease
classification in chest X-rays. We found that the attack and defense methods
have poor performance with excessive iterations and large perturbations. To
address this, we propose a new defense method that is robust to different
degrees of perturbations. This study could provide new insights into
methodological development for the community.


http://arxiv.org/abs/2003.13917
Characterizing Speech Adversarial Examples Using Self-Attention U-Net Enhancement.
Chao-Han Huck Yang; Jun Qi; Pin-Yu Chen; Xiaoli Ma; Chin-Hui Lee
  Recent studies have highlighted adversarial examples as ubiquitous threats to
the deep neural network (DNN) based speech recognition systems. In this work,
we present a U-Net based attention model, U-Net$_{At}$, to enhance adversarial
speech signals. Specifically, we evaluate the model performance by
interpretable speech recognition metrics and discuss the model performance by
the augmented adversarial training. Our experiments show that our proposed
U-Net$_{At}$ improves the perceptual evaluation of speech quality (PESQ) from
1.13 to 2.78, speech transmission index (STI) from 0.65 to 0.75, short-term
objective intelligibility (STOI) from 0.83 to 0.96 on the task of speech
enhancement with adversarial speech examples. We conduct experiments on the
automatic speech recognition (ASR) task with adversarial audio attacks. We find
that (i) temporal features learned by the attention network are capable of
enhancing the robustness of DNN based ASR models; (ii) the generalization power
of DNN based ASR model could be enhanced by applying adversarial training with
an additive adversarial data augmentation. The ASR metric on word-error-rates
(WERs) shows that there is an absolute 2.22 $\%$ decrease under gradient-based
perturbation, and an absolute 2.03 $\%$ decrease, under evolutionary-optimized
perturbation, which suggests that our enhancement models with adversarial
training can further secure a resilient ASR system.


http://arxiv.org/abs/2004.00410
Adversarial Attacks on Multivariate Time Series.
Samuel Harford; Fazle Karim; Houshang Darabi
  Classification models for the multivariate time series have gained
significant importance in the research community, but not much research has
been done on generating adversarial samples for these models. Such samples of
adversaries could become a security concern. In this paper, we propose
transforming the existing adversarial transformation network (ATN) on a
distilled model to attack various multivariate time series classification
models. The proposed attack on the classification model utilizes a distilled
model as a surrogate that mimics the behavior of the attacked classical
multivariate time series classification models. The proposed methodology is
tested onto 1-Nearest Neighbor Dynamic Time Warping (1-NN DTW) and a Fully
Convolutional Network (FCN), all of which are trained on 18 University of East
Anglia (UEA) and University of California Riverside (UCR) datasets. We show
both models were susceptible to attacks on all 18 datasets. To the best of our
knowledge, adversarial attacks have only been conducted in the domain of
univariate time series and have not been conducted on multivariate time series.
such an attack on time series classification models has never been done before.
Additionally, we recommend future researchers that develop time series
classification models to incorporating adversarial data samples into their
training data sets to improve resilience on adversarial samples and to consider
model robustness as an evaluative metric.


http://arxiv.org/abs/2003.13511
Improved Gradient based Adversarial Attacks for Quantized Networks.
Kartik Gupta; Thalaiyasingam Ajanthan
  Neural network quantization has become increasingly popular due to efficient
memory consumption and faster computation resulting from bitwise operations on
the quantized networks. Even though they exhibit excellent generalization
capabilities, their robustness properties are not well-understood. In this
work, we systematically study the robustness of quantized networks against
gradient based adversarial attacks and demonstrate that these quantized models
suffer from gradient vanishing issues and show a fake sense of security. By
attributing gradient vanishing to poor forward-backward signal propagation in
the trained network, we introduce a simple temperature scaling approach to
mitigate this issue while preserving the decision boundary. Despite being a
simple modification to existing gradient based adversarial attacks, experiments
on CIFAR-10/100 datasets with VGG-16 and ResNet-18 networks demonstrate that
our temperature scaled attacks obtain near-perfect success rate on quantized
networks while outperforming original attacks on adversarially trained models
as well as floating-point networks.


http://arxiv.org/abs/2003.13370
Towards Deep Learning Models Resistant to Large Perturbations.
Amirreza Shaeiri; Rozhin Nobahari; Mohammad Hossein Rohban
  Adversarial robustness has proven to be a required property of machine
learning algorithms. A key and often overlooked aspect of this problem is to
try to make the adversarial noise magnitude as large as possible to enhance the
benefits of the model robustness. We show that the well-established algorithm
called "adversarial training" fails to train a deep neural network given a
large, but reasonable, perturbation magnitude. In this paper, we propose a
simple yet effective initialization of the network weights that makes learning
on higher levels of noise possible. We next evaluate this idea rigorously on
MNIST ($\epsilon$ up to $\approx 0.40$) and CIFAR10 ($\epsilon$ up to $\approx
32/255$) datasets assuming the $\ell_{\infty}$ attack model. Additionally, in
order to establish the limits of $\epsilon$ in which the learning is feasible,
we study the optimal robust classifier assuming full access to the joint data
and label distribution. Then, we provide some theoretical results on the
adversarial accuracy for a simple multi-dimensional Bernoulli distribution,
which yields some insights on the range of feasible perturbations for the MNIST
dataset.


http://arxiv.org/abs/2003.13526
Efficient Black-box Optimization of Adversarial Windows Malware with Constrained Manipulations.
Luca Demetrio; Battista Biggio; Giovanni Lagorio; Fabio Roli; Alessandro Armando
  Windows malware detectors based on machine learning are vulnerable to
adversarial examples, even if the attacker is only given black-box access to
the model. The main drawback of these attacks is that they require executing
the adversarial malware sample in a sandbox at each iteration of its
optimization process, to ensure that its intrusive functionality is preserved.
In this paper, we present a novel black-box attack that leverages a set of
semantics-preserving, constrained malware manipulations to overcome this
computationally-demanding validation step. Our attack is formalized as a
constrained minimization problem which also enables optimizing the trade-off
between the probability of evading detection and the size of the injected
adversarial payload. We investigate this trade-off empirically, on two popular
static Windows malware detectors, and show that our black-box attack is able to
bypass them with only few iterations and changes. We also evaluate whether our
attack transfers to other commercial antivirus solutions, and surprisingly find
that it can increase the probability of evading some of them. We conclude by
discussing the limitations of our approach, and its possible future extensions
to target malware classifiers based on dynamic analysis.


http://arxiv.org/abs/2003.12862
Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning.
Tianlong Chen; Sijia Liu; Shiyu Chang; Yu Cheng; Lisa Amini; Zhangyang Wang
  Pretrained models from self-supervision are prevalently used in fine-tuning
downstream tasks faster or for better accuracy. However, gaining robustness
from pretraining is left unexplored. We introduce adversarial training into
self-supervision, to provide general-purpose robust pre-trained models for the
first time. We find these robust pre-trained models can benefit the subsequent
fine-tuning in two ways: i) boosting final model robustness; ii) saving the
computation cost, if proceeding towards adversarial fine-tuning. We conduct
extensive experiments to demonstrate that the proposed framework achieves large
performance margins (eg, 3.83% on robust accuracy and 1.3% on standard
accuracy, on the CIFAR-10 dataset), compared with the conventional end-to-end
adversarial training baseline. Moreover, we find that different self-supervised
pre-trained models have a diverse adversarial vulnerability. It inspires us to
ensemble several pretraining tasks, which boosts robustness more. Our ensemble
strategy contributes to a further improvement of 3.59% on robust accuracy,
while maintaining a slightly higher standard accuracy on CIFAR-10. Our codes
are available at https://github.com/TAMU-VITA/Adv-SS-Pretraining.


http://arxiv.org/abs/2003.12703
DaST: Data-free Substitute Training for Adversarial Attacks.
Mingyi Zhou; Jing Wu; Yipeng Liu; Shuaicheng Liu; Ce Zhu
  Machine learning models are vulnerable to adversarial examples. For the
black-box setting, current substitute attacks need pre-trained models to
generate adversarial examples. However, pre-trained models are hard to obtain
in real-world tasks. In this paper, we propose a data-free substitute training
method (DaST) to obtain substitute models for adversarial black-box attacks
without the requirement of any real data. To achieve this, DaST utilizes
specially designed generative adversarial networks (GANs) to train the
substitute models. In particular, we design a multi-branch architecture and
label-control loss for the generative model to deal with the uneven
distribution of synthetic samples. The substitute model is then trained by the
synthetic samples generated by the generative model, which are labeled by the
attacked model subsequently. The experiments demonstrate the substitute models
produced by DaST can achieve competitive performance compared with the baseline
models which are trained by the same train set with attacked models.
Additionally, to evaluate the practicability of the proposed method on the
real-world task, we attack an online machine learning model on the Microsoft
Azure platform. The remote model misclassifies 98.35% of the adversarial
examples crafted by our method. To the best of our knowledge, we are the first
to train a substitute model for adversarial attacks without any real data.


http://arxiv.org/abs/2003.12760
Adversarial Imitation Attack.
Mingyi Zhou; Jing Wu; Yipeng Liu; Shuaicheng Liu; Xiang Zhang; Ce Zhu
  Deep learning models are known to be vulnerable to adversarial examples. A
practical adversarial attack should require as little as possible knowledge of
attacked models. Current substitute attacks need pre-trained models to generate
adversarial examples and their attack success rates heavily rely on the
transferability of adversarial examples. Current score-based and decision-based
attacks require lots of queries for the attacked models. In this study, we
propose a novel adversarial imitation attack. First, it produces a replica of
the attacked model by a two-player game like the generative adversarial
networks (GANs). The objective of the generative model is to generate examples
that lead the imitation model returning different outputs with the attacked
model. The objective of the imitation model is to output the same labels with
the attacked model under the same inputs. Then, the adversarial examples
generated by the imitation model are utilized to fool the attacked model.
Compared with the current substitute attacks, imitation attacks can use less
training data to produce a replica of the attacked model and improve the
transferability of adversarial examples. Experiments demonstrate that our
imitation attack requires less training data than the black-box substitute
attacks, but achieves an attack success rate close to the white-box attack on
unseen data with no query.


http://arxiv.org/abs/2003.11816
Do Deep Minds Think Alike? Selective Adversarial Attacks for Fine-Grained Manipulation of Multiple Deep Neural Networks.
Zain Khan; Jirong Yi; Raghu Mudumbai; Xiaodong Wu; Weiyu Xu
  Recent works have demonstrated the existence of {\it adversarial examples}
targeting a single machine learning system. In this paper we ask a simple but
fundamental question of "selective fooling": given {\it multiple} machine
learning systems assigned to solve the same classification problem and taking
the same input signal, is it possible to construct a perturbation to the input
signal that manipulates the outputs of these {\it multiple} machine learning
systems {\it simultaneously} in arbitrary pre-defined ways? For example, is it
possible to selectively fool a set of "enemy" machine learning systems but does
not fool the other "friend" machine learning systems? The answer to this
question depends on the extent to which these different machine learning
systems "think alike". We formulate the problem of "selective fooling" as a
novel optimization problem, and report on a series of experiments on the MNIST
dataset. Our preliminary findings from these experiments show that it is in
fact very easy to selectively manipulate multiple MNIST classifiers
simultaneously, even when the classifiers are identical in their architectures,
training algorithms and training datasets except for random initialization
during training. This suggests that two nominally equivalent machine learning
systems do not in fact "think alike" at all, and opens the possibility for many
novel applications and deeper understandings of the working principles of deep
neural networks.


http://arxiv.org/abs/2003.11855
Challenging the adversarial robustness of DNNs based on error-correcting output codes.
Bowen Zhang; Benedetta Tondi; Xixiang Lv; Mauro Barni
  The existence of adversarial examples and the easiness with which they can be
generated raise several security concerns with regard to deep learning systems,
pushing researchers to develop suitable defense mechanisms. The use of networks
adopting error-correcting output codes (ECOC) has recently been proposed to
counter the creation of adversarial examples in a white-box setting. In this
paper, we carry out an in-depth investigation of the adversarial robustness
achieved by the ECOC approach. We do so by proposing a new adversarial attack
specifically designed for multi-label classification architectures, like the
ECOC-based one, and by applying two existing attacks. In contrast to previous
findings, our analysis reveals that ECOC-based networks can be attacked quite
easily by introducing a small adversarial perturbation. Moreover, the
adversarial examples can be generated in such a way to achieve high
probabilities for the predicted target class, hence making it difficult to use
the prediction confidence to detect them. Our findings are proven by means of
experimental results obtained on MNIST, CIFAR-10 and GTSRB classification
tasks.


http://arxiv.org/abs/2003.11323
Plausible Counterfactuals: Auditing Deep Learning Classifiers with Realistic Adversarial Examples.
Alejandro Barredo-Arrieta; Ser Javier Del
  The last decade has witnessed the proliferation of Deep Learning models in
many applications, achieving unrivaled levels of predictive performance.
Unfortunately, the black-box nature of Deep Learning models has posed
unanswered questions about what they learn from data. Certain application
scenarios have highlighted the importance of assessing the bounds under which
Deep Learning models operate, a problem addressed by using assorted approaches
aimed at audiences from different domains. However, as the focus of the
application is placed more on non-expert users, it results mandatory to provide
the means for him/her to trust the model, just like a human gets familiar with
a system or process: by understanding the hypothetical circumstances under
which it fails. This is indeed the angular stone for this research work: to
undertake an adversarial analysis of a Deep Learning model. The proposed
framework constructs counterfactual examples by ensuring their plausibility,
e.g. there is a reasonable probability that a human could generate them without
resorting to a computer program. Therefore, this work must be regarded as
valuable auditing exercise of the usable bounds a certain model is constrained
within, thereby allowing for a much greater understanding of the capabilities
and pitfalls of a model used in a real application. To this end, a Generative
Adversarial Network (GAN) and multi-objective heuristics are used to furnish a
plausible attack to the audited model, efficiently trading between the
confusion of this model, the intensity and plausibility of the generated
counterfactual. Its utility is showcased within a human face classification
task, unveiling the enormous potential of the proposed framework.


http://arxiv.org/abs/2003.11145
Adversarial Light Projection Attacks on Face Recognition Systems: A Feasibility Study.
Luan Nguyen; Sunpreet S. Arora; Yuhang Wu; Hao Yang
  Deep learning-based systems have been shown to be vulnerable to adversarial
attacks in both digital and physical domains. While feasible, digital attacks
have limited applicability in attacking deployed systems, including face
recognition systems, where an adversary typically has access to the input and
not the transmission channel. In such setting, physical attacks that directly
provide a malicious input through the input channel pose a bigger threat. We
investigate the feasibility of conducting real-time physical attacks on face
recognition systems using adversarial light projections. A setup comprising a
commercially available web camera and a projector is used to conduct the
attack. The adversary uses a transformation-invariant adversarial pattern
generation method to generate a digital adversarial pattern using one or more
images of the target available to the adversary. The digital adversarial
pattern is then projected onto the adversary's face in the physical domain to
either impersonate a target (impersonation) or evade recognition (obfuscation).
We conduct preliminary experiments using two open-source and one commercial
face recognition system on a pool of 50 subjects. Our experimental results
demonstrate the vulnerability of face recognition systems to light projection
attacks in both white-box and black-box attack settings.


http://arxiv.org/abs/2003.10602
Defense Through Diverse Directions.
Christopher M. Bender; Yang Li; Yifeng Shi; Michael K. Reiter; Junier B. Oliva
  In this work we develop a novel Bayesian neural network methodology to
achieve strong adversarial robustness without the need for online adversarial
training. Unlike previous efforts in this direction, we do not rely solely on
the stochasticity of network weights by minimizing the divergence between the
learned parameter distribution and a prior. Instead, we additionally require
that the model maintain some expected uncertainty with respect to all input
covariates. We demonstrate that by encouraging the network to distribute evenly
across inputs, the network becomes less susceptible to localized, brittle
features which imparts a natural robustness to targeted perturbations. We show
empirical robustness on several benchmark datasets.


http://arxiv.org/abs/2003.10315
Adversarial Attacks on Monocular Depth Estimation.
Ziqi Zhang; Xinge Zhu; Yingwei Li; Xiangqun Chen; Yao Guo
  Recent advances of deep learning have brought exceptional performance on many
computer vision tasks such as semantic segmentation and depth estimation.
However, the vulnerability of deep neural networks towards adversarial examples
have caused grave concerns for real-world deployment. In this paper, we present
to the best of our knowledge the first systematic study of adversarial attacks
on monocular depth estimation, an important task of 3D scene understanding in
scenarios such as autonomous driving and robot navigation. In order to
understand the impact of adversarial attacks on depth estimation, we first
define a taxonomy of different attack scenarios for depth estimation, including
non-targeted attacks, targeted attacks and universal attacks. We then adapt
several state-of-the-art attack methods for classification on the field of
depth estimation. Besides, multi-task attacks are introduced to further improve
the attack performance for universal attacks. Experimental results show that it
is possible to generate significant errors on depth estimation. In particular,
we demonstrate that our methods can conduct targeted attacks on given objects
(such as a car), resulting in depth estimation 3-4x away from the ground truth
(e.g., from 20m to 80m).


http://arxiv.org/abs/2003.10399
Inherent Adversarial Robustness of Deep Spiking Neural Networks: Effects of Discrete Input Encoding and Non-Linear Activations.
Saima Sharmin; Nitin Rathi; Priyadarshini Panda; Kaushik Roy
  In the recent quest for trustworthy neural networks, we present Spiking
Neural Network (SNN) as a potential candidate for inherent robustness against
adversarial attacks. In this work, we demonstrate that adversarial accuracy of
SNNs under gradient-based attacks is higher than their non-spiking counterparts
for CIFAR datasets on deep VGG and ResNet architectures, particularly in
blackbox attack scenario. We attribute this robustness to two fundamental
characteristics of SNNs and analyze their effects. First, we exhibit that input
discretization introduced by the Poisson encoder improves adversarial
robustness with reduced number of timesteps. Second, we quantify the amount of
adversarial accuracy with increased leak rate in Leaky-Integrate-Fire (LIF)
neurons. Our results suggest that SNNs trained with LIF neurons and smaller
number of timesteps are more robust than the ones with IF (Integrate-Fire)
neurons and larger number of timesteps. Also we overcome the bottleneck of
creating gradient-based adversarial inputs in temporal domain by proposing a
technique for crafting attacks from SNN


http://arxiv.org/abs/2003.10596
Adversarial Perturbations Fool Deepfake Detectors.
Apurva Gandhi; Shomik Jain
  This work uses adversarial perturbations to enhance deepfake images and fool
common deepfake detectors. We created adversarial perturbations using the Fast
Gradient Sign Method and the Carlini and Wagner L2 norm attack in both blackbox
and whitebox settings. Detectors achieved over 95% accuracy on unperturbed
deepfakes, but less than 27% accuracy on perturbed deepfakes. We also explore
two improvements to deepfake detectors: (i) Lipschitz regularization, and (ii)
Deep Image Prior (DIP). Lipschitz regularization constrains the gradient of the
detector with respect to the input in order to increase robustness to input
perturbations. The DIP defense removes perturbations using generative
convolutional neural networks in an unsupervised manner. Regularization
improved the detection of perturbed deepfakes on average, including a 10%
accuracy boost in the blackbox case. The DIP defense achieved 95% accuracy on
perturbed deepfakes that fooled the original detector, while retaining 98%
accuracy in other cases on a 100 image subsample.


http://arxiv.org/abs/2003.10041
Understanding the robustness of deep neural network classifiers for breast cancer screening.
Witold Oleszkiewicz; Taro Makino; Stanisław Jastrzębski; Tomasz Trzciński; Linda Moy; Kyunghyun Cho; Laura Heacock; Krzysztof J. Geras
  Deep neural networks (DNNs) show promise in breast cancer screening, but
their robustness to input perturbations must be better understood before they
can be clinically implemented. There exists extensive literature on this
subject in the context of natural images that can potentially be built upon.
However, it cannot be assumed that conclusions about robustness will transfer
from natural images to mammogram images, due to significant differences between
the two image modalities. In order to determine whether conclusions will
transfer, we measure the sensitivity of a radiologist-level screening mammogram
image classifier to four commonly studied input perturbations that natural
image classifiers are sensitive to. We find that mammogram image classifiers
are also sensitive to these perturbations, which suggests that we can build on
the existing literature. We also perform a detailed analysis on the effects of
low-pass filtering, and find that it degrades the visibility of clinically
meaningful features called microcalcifications. Since low-pass filtering
removes semantically meaningful information that is predictive of breast
cancer, we argue that it is undesirable for mammogram image classifiers to be
invariant to it. This is in contrast to natural images, where we do not want
DNNs to be sensitive to low-pass filtering due to its tendency to remove
information that is human-incomprehensible.


http://arxiv.org/abs/2003.10045
Architectural Resilience to Foreground-and-Background Adversarial Noise.
Carl Cheng; Evan Hu
  Adversarial attacks in the form of imperceptible perturbations of normal
images have been extensively studied, and for every new defense methodology
created, multiple adversarial attacks are found to counteract it. In
particular, a popular style of attack, exemplified in recent years by DeepFool
and Carlini-Wagner, relies solely on white-box scenarios in which full access
to the predictive model and its weights are required. In this work, we instead
propose distinct model-agnostic benchmark perturbations of images in order to
investigate the resilience and robustness of different network architectures.
Results empirically determine that increasing depth within most types of
Convolutional Neural Networks typically improves model resilience towards
general attacks, with improvement steadily decreasing as the model becomes
deeper. Additionally, we find that a notable difference in adversarial
robustness exists between residual architectures with skip connections and
non-residual architectures of similar complexity. Our findings provide
direction for future understanding of residual connections and depth on network
robustness.


http://arxiv.org/abs/2003.10804
Detecting Adversarial Examples in Learning-Enabled Cyber-Physical Systems using Variational Autoencoder for Regression.
Feiyang Cai; Jiani Li; Xenofon Koutsoukos
  Learning-enabled components (LECs) are widely used in cyber-physical systems
(CPS) since they can handle the uncertainty and variability of the environment
and increase the level of autonomy. However, it has been shown that LECs such
as deep neural networks (DNN) are not robust and adversarial examples can cause
the model to make a false prediction. The paper considers the problem of
efficiently detecting adversarial examples in LECs used for regression in CPS.
The proposed approach is based on inductive conformal prediction and uses a
regression model based on variational autoencoder. The architecture allows to
take into consideration both the input and the neural network prediction for
detecting adversarial, and more generally, out-of-distribution examples. We
demonstrate the method using an advanced emergency braking system implemented
in an open source simulator for self-driving cars where a DNN is used to
estimate the distance to an obstacle. The simulation results show that the
method can effectively detect adversarial examples with a short detection
delay.


http://arxiv.org/abs/2003.09711
Robust Out-of-distribution Detection in Neural Networks.
Jiefeng Chen; Yixuan Li; Xi Wu; Yingyu Liang; Somesh Jha
  Detecting anomalous inputs is critical for safely deploying deep learning
models in the real world. Existing approaches for detecting out-of-distribution
(OOD) examples work well when evaluated on natural samples drawn from a
sufficiently different distribution than the training data distribution.
However, in this paper, we show that existing detection mechanisms can be
extremely brittle when evaluating on inputs with minimal adversarial
perturbations which don't change their semantics. Formally, we introduce a
novel and challenging problem, Robust Out-of-Distribution Detection, and
propose an algorithm that can fool existing OOD detectors by adding small
perturbations to the inputs while preserving their semantics and thus the
distributional membership. We take a first step to solve this challenge, and
propose an effective algorithm called ALOE, which performs robust training by
exposing the model to both adversarially crafted inlier and outlier examples.
Our method can be flexibly combined with, and render existing methods robust.
On common benchmark datasets, we show that ALOE substantially improves the
robustness of state-of-the-art OOD detection, with 58.4% AUROC improvement on
CIFAR-10 and 46.59% improvement on CIFAR-100. Finally, we provide theoretical
analysis for our method, underpinning the empirical results above.


http://arxiv.org/abs/2003.09595
Cooling-Shrinking Attack: Blinding the Tracker with Imperceptible Noises.
Bin Yan; Dong Wang; Huchuan Lu; Xiaoyun Yang
  Adversarial attack of CNN aims at deceiving models to misbehave by adding
imperceptible perturbations to images. This feature facilitates to understand
neural networks deeply and to improve the robustness of deep learning models.
Although several works have focused on attacking image classifiers and object
detectors, an effective and efficient method for attacking single object
trackers of any target in a model-free way remains lacking. In this paper, a
cooling-shrinking attack method is proposed to deceive state-of-the-art
SiameseRPN-based trackers. An effective and efficient perturbation generator is
trained with a carefully designed adversarial loss, which can simultaneously
cool hot regions where the target exists on the heatmaps and force the
predicted bounding box to shrink, making the tracked target invisible to
trackers. Numerous experiments on OTB100, VOT2018, and LaSOT datasets show that
our method can effectively fool the state-of-the-art SiameseRPN++ tracker by
adding small perturbations to the template or the search regions. Besides, our
method has good transferability and is able to deceive other top-performance
trackers such as DaSiamRPN, DaSiamRPN-UpdateNet, and DiMP. The source codes are
available at https://github.com/MasterBin-IIAU/CSA.


http://arxiv.org/abs/2003.11917
Adversarial Examples and the Deeper Riddle of Induction: The Need for a Theory of Artifacts in Deep Learning.
Cameron Buckner
  Deep learning is currently the most widespread and successful technology in
artificial intelligence. It promises to push the frontier of scientific
discovery beyond current limits. However, skeptics have worried that deep
neural networks are black boxes, and have called into question whether these
advances can really be deemed scientific progress if humans cannot understand
them. Relatedly, these systems also possess bewildering new vulnerabilities:
most notably a susceptibility to "adversarial examples". In this paper, I argue
that adversarial examples will become a flashpoint of debate in philosophy and
diverse sciences. Specifically, new findings concerning adversarial examples
have challenged the consensus view that the networks' verdicts on these cases
are caused by overfitting idiosyncratic noise in the training set, and may
instead be the result of detecting predictively useful "intrinsic features of
the data geometry" that humans cannot perceive (Ilyas et al., 2019). These
results should cause us to re-examine responses to one of the deepest puzzles
at the intersection of philosophy and science: Nelson Goodman's "new riddle" of
induction. Specifically, they raise the possibility that progress in a number
of sciences will depend upon the detection and manipulation of useful features
that humans find inscrutable. Before we can evaluate this possibility, however,
we must decide which (if any) of these inscrutable features are real but
available only to "alien" perception and cognition, and which are distinctive
artifacts of deep learning-for artifacts like lens flares or Gibbs phenomena
can be similarly useful for prediction, but are usually seen as obstacles to
scientific theorizing. Thus, machine learning researchers urgently need to
develop a theory of artifacts for deep neural networks, and I conclude by
sketching some initial directions for this area of research.


http://arxiv.org/abs/2004.02756
Investigating Image Applications Based on Spatial-Frequency Transform and Deep Learning Techniques.
Qinkai Zheng; Han Qiu; Gerard Memmi; Isabelle Bloch
  This is the report for the PRIM project in Telecom Paris. This report is
about applications based on spatial-frequency transform and deep learning
techniques. In this report, there are two main works. The first work is about
the enhanced JPEG compression method based on deep learning. we propose a novel
method to highly enhance the JPEG compression by transmitting fewer image data
at the sender's end. At the receiver's end, we propose a DC recovery algorithm
together with the deep residual learning framework to recover images with high
quality. The second work is about adversarial examples defenses based on signal
processing. We propose the wavelet extension method to extend image data
features, which makes it more difficult to generate adversarial examples. We
further adopt wavelet denoising to reduce the influence of the adversarial
perturbations. With intensive experiments, we demonstrate that both works are
effective in their application scenarios.


http://arxiv.org/abs/2003.09416
Quantum noise protects quantum classifiers against adversaries.
Yuxuan Du; Min-Hsiu Hsieh; Tongliang Liu; Dacheng Tao; Nana Liu
  Noise in quantum information processing is often viewed as a disruptive and
difficult-to-avoid feature, especially in near-term quantum technologies.
However, noise has often played beneficial roles, from enhancing weak signals
in stochastic resonance to protecting the privacy of data in differential
privacy. It is then natural to ask, can we harness the power of quantum noise
that is beneficial to quantum computing? An important current direction for
quantum computing is its application to machine learning, such as
classification problems. One outstanding problem in machine learning for
classification is its sensitivity to adversarial examples. These are small,
undetectable perturbations from the original data where the perturbed data is
completely misclassified in otherwise extremely accurate classifiers. They can
also be considered as `worst-case' perturbations by unknown noise sources. We
show that by taking advantage of depolarisation noise in quantum circuits for
classification, a robustness bound against adversaries can be derived where the
robustness improves with increasing noise. This robustness property is
intimately connected with an important security concept called differential
privacy which can be extended to quantum differential privacy. For the
protection of quantum data, this is the first quantum protocol that can be used
against the most general adversaries. Furthermore, we show how the robustness
in the classical case can be sensitive to the details of the classification
model, but in the quantum case the details of classification model are absent,
thus also providing a potential quantum advantage for classical data that is
independent of quantum speedups. This opens the opportunity to explore other
ways in which quantum noise can be used in our favour, as well as identifying
other ways quantum algorithms can be helpful that is independent of quantum
speedups.


http://arxiv.org/abs/2003.09372
One Neuron to Fool Them All.
Anshuman Suri; David Evans
  Despite vast research in adversarial examples, the root causes of model
susceptibility are not well understood. Instead of looking at attack-specific
robustness, we propose a notion that evaluates the sensitivity of individual
neurons in terms of how robust the model's output is to direct perturbations of
that neuron's output. Analyzing models from this perspective reveals
distinctive characteristics of standard as well as adversarially-trained robust
models, and leads to several curious results. In our experiments on CIFAR-10
and ImageNet, we find that attacks using a loss function that targets just a
single sensitive neuron find adversarial examples nearly as effectively as ones
that target the full model. We analyze the properties of these sensitive
neurons to propose a regularization term that can help a model achieve
robustness to a variety of different perturbation constraints while maintaining
accuracy on natural data distributions. Code for all our experiments is
available at https://github.com/iamgroot42/sauron .


http://arxiv.org/abs/2003.09461
Adversarial Robustness on In- and Out-Distribution Improves Explainability.
Maximilian Augustin; Alexander Meinke; Matthias Hein
  Neural networks have led to major improvements in image classification but
suffer from being non-robust to adversarial changes, unreliable uncertainty
estimates on out-distribution samples and their inscrutable black-box
decisions. In this work we propose RATIO, a training procedure for Robustness
via Adversarial Training on In- and Out-distribution, which leads to robust
models with reliable and robust confidence estimates on the out-distribution.
RATIO has similar generative properties to adversarial training so that visual
counterfactuals produce class specific features. While adversarial training
comes at the price of lower clean accuracy, RATIO achieves state-of-the-art
$l_2$-adversarial robustness on CIFAR10 and maintains better clean accuracy.


http://arxiv.org/abs/2003.08937
Breaking certified defenses: Semantic adversarial examples with spoofed robustness certificates.
Amin Ghiasi; Ali Shafahi; Tom Goldstein
  To deflect adversarial attacks, a range of "certified" classifiers have been
proposed. In addition to labeling an image, certified classifiers produce (when
possible) a certificate guaranteeing that the input image is not an
$\ell_p$-bounded adversarial example. We present a new attack that exploits not
only the labelling function of a classifier, but also the certificate
generator. The proposed method applies large perturbations that place images
far from a class boundary while maintaining the imperceptibility property of
adversarial examples. The proposed "Shadow Attack" causes certifiably robust
networks to mislabel an image and simultaneously produce a "spoofed"
certificate of robustness.


http://arxiv.org/abs/2003.08861
Face-Off: Adversarial Face Obfuscation.
Varun Chandrasekaran; Chuhan Gao; Brian Tang; Kassem Fawaz; Somesh Jha; Suman Banerjee
  Advances in deep learning have made face recognition technologies pervasive.
While useful to social media platforms and users, this technology carries
significant privacy threats. Coupled with the abundant information they have
about users, service providers can associate users with social interactions,
visited places, activities, and preferences--some of which the user may not
want to share. Additionally, facial recognition models used by various agencies
are trained by data scraped from social media platforms. Existing approaches to
mitigate these privacy risks from unwanted face recognition result in an
imbalanced privacy-utility trade-off to users. In this paper, we address this
trade-off by proposing Face-Off, a privacy-preserving framework that introduces
strategic perturbations to the user's face to prevent it from being correctly
recognized. To realize Face-Off, we overcome a set of challenges related to the
black-box nature of commercial face recognition services, and the scarcity of
literature for adversarial attacks on metric networks. We implement and
evaluate Face-Off to find that it deceives three commercial face recognition
services from Microsoft, Amazon, and Face++. Our user study with 423
participants further shows that the perturbations come at an acceptable cost
for the users.


http://arxiv.org/abs/2003.08938
Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations.
Huan Zhang; Hongge Chen; Chaowei Xiao; Bo Li; Mingyan Liu; Duane Boning; Cho-Jui Hsieh
  A deep reinforcement learning (DRL) agent observes its states through
observations, which may contain natural measurement errors or adversarial
noises. Since the observations deviate from the true states, they can mislead
the agent into making suboptimal actions. Several works have shown this
vulnerability via adversarial attacks, but existing approaches on improving the
robustness of DRL under this setting have limited success and lack for
theoretical principles. We show that naively applying existing techniques on
improving robustness for classification tasks, like adversarial training, is
ineffective for many RL tasks. We propose the state-adversarial Markov decision
process (SA-MDP) to study the fundamental properties of this problem, and
develop a theoretically principled policy regularization which can be applied
to a large family of DRL algorithms, including proximal policy optimization
(PPO), deep deterministic policy gradient (DDPG) and deep Q networks (DQN), for
both discrete and continuous action control problems. We significantly improve
the robustness of PPO, DDPG and DQN agents under a suite of strong white box
adversarial attacks, including new attacks of our own. Additionally, we find
that a robust policy noticeably improves DRL performance even without an
adversary in a number of environments. Our code is available at
https://github.com/chenhongge/StateAdvDRL.


http://arxiv.org/abs/2003.08907
Overinterpretation reveals image classification model pathologies. (81%)
Brandon Carter; Siddhartha Jain; Jonas Mueller; David Gifford
  Image classifiers are typically scored on their test set accuracy, but high
accuracy can mask a subtle type of model failure. We find that high scoring
convolutional neural networks (CNNs) on popular benchmarks exhibit troubling
pathologies that allow them to display high accuracy even in the absence of
semantically salient features. When a model provides a high-confidence decision
without salient supporting input features, we say the classifier has
overinterpreted its input, finding too much class-evidence in patterns that
appear nonsensical to humans. Here, we demonstrate that neural networks trained
on CIFAR-10 and ImageNet suffer from overinterpretation, and we find models on
CIFAR-10 make confident predictions even when 95% of input images are masked
and humans cannot discern salient features in the remaining pixel-subsets. We
introduce Batched Gradient SIS, a new method for discovering sufficient input
subsets for complex datasets, and use this method to show the sufficiency of
border pixels in ImageNet for training and testing. Although these patterns
portend potential model fragility in real-world deployment, they are in fact
valid statistical patterns of the benchmark that alone suffice to attain high
test accuracy. Unlike adversarial examples, overinterpretation relies upon
unmodified image pixels. We find ensembling and input dropout can each help
mitigate overinterpretation.


http://arxiv.org/abs/2003.08837
Vulnerabilities of Connectionist AI Applications: Evaluation and Defence.
Christian Berghoff; Matthias Neu; Twickel Arndt von
  This article deals with the IT security of connectionist artificial
intelligence (AI) applications, focusing on threats to integrity, one of the
three IT security goals. Such threats are for instance most relevant in
prominent AI computer vision applications. In order to present a holistic view
on the IT security goal integrity, many additional aspects such as
interpretability, robustness and documentation are taken into account. A
comprehensive list of threats and possible mitigations is presented by
reviewing the state-of-the-art literature. AI-specific vulnerabilities such as
adversarial attacks and poisoning attacks as well as their AI-specific root
causes are discussed in detail. Additionally and in contrast to former reviews,
the whole AI supply chain is analysed with respect to vulnerabilities,
including the planning, data acquisition, training, evaluation and operation
phases. The discussion of mitigations is likewise not restricted to the level
of the AI system itself but rather advocates viewing AI systems in the context
of their supply chains and their embeddings in larger IT infrastructures and
hardware devices. Based on this and the observation that adaptive attackers may
circumvent any single published AI-specific defence to date, the article
concludes that single protective measures are not sufficient but rather
multiple measures on different levels have to be combined to achieve a minimum
level of IT security for AI applications.


http://arxiv.org/abs/2003.08034
Generating Socially Acceptable Perturbations for Efficient Evaluation of Autonomous Vehicles.
Songan Zhang; Huei Peng; Subramanya Nageshrao; H. Eric Tseng
  Deep reinforcement learning methods have been widely used in recent years for
autonomous vehicle's decision-making. A key issue is that deep neural networks
can be fragile to adversarial attacks or other unseen inputs. In this paper, we
address the latter issue: we focus on generating socially acceptable
perturbations (SAP), so that the autonomous vehicle (AV agent), instead of the
challenging vehicle (attacker), is primarily responsible for the crash. In our
process, one attacker is added to the environment and trained by deep
reinforcement learning to generate the desired perturbation. The reward is
designed so that the attacker aims to fail the AV agent in a socially
acceptable way. After training the attacker, the agent policy is evaluated in
both the original naturalistic environment and the environment with one
attacker. The results show that the agent policy which is safe in the
naturalistic environment has many crashes in the perturbed environment.


http://arxiv.org/abs/2003.08093
Solving Non-Convex Non-Differentiable Min-Max Games using Proximal Gradient Method.
Babak Barazandeh; Meisam Razaviyayn
  Min-max saddle point games appear in a wide range of applications in machine
leaning and signal processing. Despite their wide applicability, theoretical
studies are mostly limited to the special convex-concave structure. While some
recent works generalized these results to special smooth non-convex cases, our
understanding of non-smooth scenarios is still limited. In this work, we study
special form of non-smooth min-max games when the objective function is
(strongly) convex with respect to one of the player's decision variable. We
show that a simple multi-step proximal gradient descent-ascent algorithm
converges to $\epsilon$-first-order Nash equilibrium of the min-max game with
the number of gradient evaluations being polynomial in $1/\epsilon$. We will
also show that our notion of stationarity is stronger than existing ones in the
literature. Finally, we evaluate the performance of the proposed algorithm
through adversarial attack on a LASSO estimator.


http://arxiv.org/abs/2003.09347
SAT: Improving Adversarial Training via Curriculum-Based Loss Smoothing.
Chawin Sitawarin; Supriyo Chakraborty; David Wagner
  Adversarial training (AT) has become a popular choice for training robust
networks. However, it tends to sacrifice clean accuracy heavily in favor of
robustness and suffers from a large generalization error. To address these
concerns, we propose Smooth Adversarial Training (SAT), guided by our analysis
on the eigenspectrum of the loss Hessian. We find that curriculum learning, a
scheme that emphasizes on starting "easy" and gradually ramping up on the
"difficulty" of training, smooths the adversarial loss landscape for a suitably
chosen difficulty metric. We present a general formulation for curriculum
learning in the adversarial setting and propose two difficulty metrics based on
the maximal Hessian eigenvalue (H-SAT) and the softmax probability (P-SA). We
demonstrate that SAT stabilizes network training even for a large perturbation
norm and allows the network to operate at a better clean accuracy versus
robustness trade-off curve compared to AT. This leads to a significant
improvement in both clean accuracy and robustness compared to AT, TRADES, and
other baselines. To highlight a few results, our best model improves normal and
robust accuracy by 6% and 1% on CIFAR-100 compared to AT, respectively. On
Imagenette, a ten-class subset of ImageNet, our model outperforms AT by 23% and
3% on normal and robust accuracy respectively.


http://arxiv.org/abs/2003.07637
Motion-Excited Sampler: Video Adversarial Attack with Sparked Prior.
Hu Zhang; Linchao Zhu; Yi Zhu; Yi Yang
  Deep neural networks are known to be susceptible to adversarial noise, which
are tiny and imperceptible perturbations. Most of previous work on adversarial
attack mainly focus on image models, while the vulnerability of video models is
less explored. In this paper, we aim to attack video models by utilizing
intrinsic movement pattern and regional relative motion among video frames. We
propose an effective motion-excited sampler to obtain motion-aware noise prior,
which we term as sparked prior. Our sparked prior underlines frame correlations
and utilizes video dynamics via relative motion. By using the sparked prior in
gradient estimation, we can successfully attack a variety of video
classification models with fewer number of queries. Extensive experimental
results on four benchmark datasets validate the efficacy of our proposed
method.


http://arxiv.org/abs/2003.07573
Heat and Blur: An Effective and Fast Defense Against Adversarial Examples.
Haya Brama; Tal Grinshpoun
  The growing incorporation of artificial neural networks (NNs) into many
fields, and especially into life-critical systems, is restrained by their
vulnerability to adversarial examples (AEs). Some existing defense methods can
increase NNs' robustness, but they often require special architecture or
training procedures and are irrelevant to already trained models. In this
paper, we propose a simple defense that combines feature visualization with
input modification, and can, therefore, be applicable to various pre-trained
networks. By reviewing several interpretability methods, we gain new insights
regarding the influence of AEs on NNs' computation. Based on that, we
hypothesize that information about the "true" object is preserved within the
NN's activity, even when the input is adversarial, and present a feature
visualization version that can extract that information in the form of
relevance heatmaps. We then use these heatmaps as a basis for our defense, in
which the adversarial effects are corrupted by massive blurring. We also
provide a new evaluation metric that can capture the effects of both attacks
and defenses more thoroughly and descriptively, and demonstrate the
effectiveness of the defense and the utility of the suggested evaluation
measurement with VGG19 results on the ImageNet dataset.


http://arxiv.org/abs/2003.07982
Adversarial Transferability in Wearable Sensor Systems.
Ramesh Kumar Sah; Hassan Ghasemzadeh
  Machine learning has increasingly become the most used approach for inference
and decision making in wearable sensor systems. However, recent studies have
found that machine learning systems are easily fooled by the addition of
adversarial perturbation to their inputs. What is more interesting is that the
adversarial examples generated for one machine learning system can also degrade
the performance of another. This property of adversarial examples is called
transferability. In this work, we take the first strides in studying
adversarial transferability in wearable sensor systems, from the following
perspectives: 1) Transferability between machine learning models, 2)
Transferability across subjects, 3) Transferability across sensor locations,
and 4) Transferability across datasets. With Human Activity Recognition (HAR)
as an example sensor system, we found strong untargeted transferability in all
cases of transferability. Specifically, gradient-based attacks were able to
achieve higher misclassification rates compared to non-gradient attacks. The
misclassification rate of untargeted adversarial examples ranged from 20% to
98%. For targeted transferability between machine learning models, the success
rate of adversarial examples was 100% for iterative attack methods. However,
the success rate for other types of targeted transferability ranged from 20% to
0%. Our findings strongly suggest that adversarial transferability has serious
consequences not only in sensor systems but also across the broad spectrum of
ubiquitous computing.


http://arxiv.org/abs/2003.06878
Output Diversified Initialization for Adversarial Attacks.
Yusuke Tashiro; Yang Song; Stefano Ermon
  Adversarial examples are often constructed by iteratively refining a randomly
perturbed input. To improve diversity and thus also the success rates of
attacks, we propose Output Diversified Initialization (ODI), a novel random
initialization strategy that can be combined with most existing white-box
adversarial attacks. Instead of using uniform perturbations in the input space,
we seek diversity in the output logits space of the target model. Empirically,
we demonstrate that existing $\ell_\infty$ and $\ell_2$ adversarial attacks
with ODI become much more efficient on several datasets including MNIST,
CIFAR-10 and ImageNet, reducing the accuracy of recently proposed defense
models by 1--17\%. Moreover, PGD attack with ODI outperforms current
state-of-the-art attacks against robust models, while also being roughly 50
times faster on CIFAR-10. The code is available on
https://github.com/ermongroup/ODI/.


http://arxiv.org/abs/2003.06979
Anomalous Example Detection in Deep Learning: A Survey.
Saikiran Bulusu; Bhavya Kailkhura; Bo Li; Pramod K. Varshney; Dawn Song
  Deep Learning (DL) is vulnerable to out-of-distribution and adversarial
examples resulting in incorrect outputs. To make DL more robust, several
posthoc (or runtime) anomaly detection techniques to detect (and discard) these
anomalous samples have been proposed in the recent past. This survey tries to
provide a structured and comprehensive overview of the research on anomaly
detection for DL based applications. We provide a taxonomy for existing
techniques based on their underlying assumptions and adopted approaches. We
discuss various techniques in each of the categories and provide the relative
strengths and weaknesses of the approaches. Our goal in this survey is to
provide an easier yet better understanding of the techniques belonging to
different categories in which research has been done on this topic. Finally, we
highlight the unsolved research challenges while applying anomaly detection
techniques in DL systems and present some high-impact future research
directions.


http://arxiv.org/abs/2003.06814
Towards Face Encryption by Generating Adversarial Identity Masks.
Xiao Yang; Yinpeng Dong; Tianyu Pang; Hang Su; Jun Zhu; Yuefeng Chen; Hui Xue
  As billions of personal data being shared through social media and network,
the data privacy and security have drawn an increasing attention. Several
attempts have been made to alleviate the leakage of identity information from
face photos, with the aid of, e.g., image obfuscation techniques. However, most
of the present results are either perceptually unsatisfactory or ineffective
against face recognition systems. Our goal in this paper is to develop a
technique that can encrypt the personal photos such that they can protect users
from unauthorized face recognition systems but remain visually identical to the
original version for human beings. To achieve this, we propose a targeted
identity-protection iterative method (TIP-IM) to generate adversarial identity
masks which can be overlaid on facial images, such that the original identities
can be concealed without sacrificing the visual quality. Extensive experiments
demonstrate that TIP-IM provides 95\%+ protection success rate against various
state-of-the-art face recognition models under practical test scenarios.
Besides, we also show the practical and effective applicability of our method
on a commercial API service.


http://arxiv.org/abs/2003.06974
Toward Adversarial Robustness via Semi-supervised Robust Training.
Yiming Li; Baoyuan Wu; Yan Feng; Yanbo Fan; Yong Jiang; Zhifeng Li; Shutao Xia
  Adversarial examples have been shown to be the severe threat to deep neural
networks (DNNs). One of the most effective adversarial defense methods is
adversarial training (AT) through minimizing the adversarial risk $R_{adv}$,
which encourages both the benign example $x$ and its adversarially perturbed
neighborhoods within the $\ell_{p}$-ball to be predicted as the ground-truth
label. In this work, we propose a novel defense method, the robust training
(RT), by jointly minimizing two separated risks ($R_{stand}$ and $R_{rob}$),
which is with respect to the benign example and its neighborhoods respectively.
The motivation is to explicitly and jointly enhance the accuracy and the
adversarial robustness. We prove that $R_{adv}$ is upper-bounded by $R_{stand}
+ R_{rob}$, which implies that RT has similar effect as AT. Intuitively,
minimizing the standard risk enforces the benign example to be correctly
predicted, and the robust risk minimization encourages the predictions of the
neighbor examples to be consistent with the prediction of the benign example.
Besides, since $R_{rob}$ is independent of the ground-truth label, RT is
naturally extended to the semi-supervised mode ($i.e.$, SRT), to further
enhance the adversarial robustness. Moreover, we extend the $\ell_{p}$-bounded
neighborhood to a general case, which covers different types of perturbations,
such as the pixel-wise ($i.e.$, $x + \delta$) or the spatial perturbation
($i.e.$, $ AX + b$). Extensive experiments on benchmark datasets not only
verify the superiority of the proposed SRT method to state-of-the-art methods
for defensing pixel-wise or spatial perturbations separately, but also
demonstrate its robustness to both perturbations simultaneously. The code for
reproducing main results is available at
\url{https://github.com/THUYimingLi/Semi-supervised_Robust_Training}.


http://arxiv.org/abs/2003.06559
Minimum-Norm Adversarial Examples on KNN and KNN-Based Models.
Chawin Sitawarin; David Wagner
  We study the robustness against adversarial examples of kNN classifiers and
classifiers that combine kNN with neural networks. The main difficulty lies in
the fact that finding an optimal attack on kNN is intractable for typical
datasets. In this work, we propose a gradient-based attack on kNN and kNN-based
defenses, inspired by the previous work by Sitawarin & Wagner [1]. We
demonstrate that our attack outperforms their method on all of the models we
tested with only a minimal increase in the computation time. The attack also
beats the state-of-the-art attack [2] on kNN when k > 1 using less than 1% of
its running time. We hope that this attack can be used as a new baseline for
evaluating the robustness of kNN and its variants.


http://arxiv.org/abs/2003.06693
Certified Defenses for Adversarial Patches.
Ping-Yeh Chiang; Renkun Ni; Ahmed Abdelkader; Chen Zhu; Christoph Studer; Tom Goldstein
  Adversarial patch attacks are among one of the most practical threat models
against real-world computer vision systems. This paper studies certified and
empirical defenses against patch attacks. We begin with a set of experiments
showing that most existing defenses, which work by pre-processing input images
to mitigate adversarial patches, are easily broken by simple white-box
adversaries. Motivated by this finding, we propose the first certified defense
against patch attacks, and propose faster methods for its training.
Furthermore, we experiment with different patch shapes for testing, obtaining
surprisingly good robustness transfer across shapes, and present preliminary
results on certified defense against sparse attacks. Our complete
implementation can be found on:
https://github.com/Ping-C/certifiedpatchdefense.


http://arxiv.org/abs/2003.06555
Dynamic Divide-and-Conquer Adversarial Training for Robust Semantic Segmentation.
Xiaogang Xu; Hengshuang Zhao; Jiaya Jia
  Adversarial training is promising for improving robustness of deep neural
networks towards adversarial perturbations, especially on the classification
task. The effect of this type of training on semantic segmentation, contrarily,
just commences. We make the initial attempt to explore the defense strategy on
semantic segmentation by formulating a general adversarial training procedure
that can perform decently on both adversarial and clean samples. We propose a
dynamic divide-and-conquer adversarial training (DDC-AT) strategy to enhance
the defense effect, by setting additional branches in the target model during
training, and dealing with pixels with diverse properties towards adversarial
perturbation. Our dynamical division mechanism divides pixels into multiple
branches automatically. Note all these additional branches can be abandoned
during inference and thus leave no extra parameter and computation cost.
Extensive experiments with various segmentation models are conducted on PASCAL
VOC 2012 and Cityscapes datasets, in which DDC-AT yields satisfying performance
under both white- and black-box attack.


http://arxiv.org/abs/2003.06566
On the benefits of defining vicinal distributions in latent space.
Puneet Mangla; Vedant Singh; Shreyas Jayant Havaldar; Vineeth N Balasubramanian
  The vicinal risk minimization (VRM) principle is an empirical risk
minimization (ERM) variant that replaces Dirac masses with vicinal functions.
There is strong numerical and theoretical evidence showing that VRM outperforms
ERM in terms of generalization if appropriate vicinal functions are chosen.
Mixup Training (MT), a popular choice of vicinal distribution, improves the
generalization performance of models by introducing globally linear behavior in
between training examples. Apart from generalization, recent works have shown
that mixup trained models are relatively robust to input
perturbations/corruptions and at the same time are calibrated better than their
non-mixup counterparts. In this work, we investigate the benefits of defining
these vicinal distributions like mixup in latent space of generative models
rather than in input space itself. We propose a new approach - \textit{VarMixup
(Variational Mixup)} - to better sample mixup images by using the latent
manifold underlying the data. Our empirical studies on CIFAR-10, CIFAR-100, and
Tiny-ImageNet demonstrate that models trained by performing mixup in the latent
manifold learned by VAEs are inherently more robust to various input
corruptions/perturbations, are significantly better calibrated, and exhibit
more local-linear loss landscapes.


http://arxiv.org/abs/2003.06428
Towards a Resilient Machine Learning Classifier -- a Case Study of Ransomware Detection.
Chih-Yuan Yang; Ravi Sahita
  The damage caused by crypto-ransomware, due to encryption, is difficult to
revert and cause data losses. In this paper, a machine learning (ML) classifier
was built to early detect ransomware (called crypto-ransomware) that uses
cryptography by program behavior. If a signature-based detection was missed, a
behavior-based detector can be the last line of defense to detect and contain
the damages. We find that input/output activities of ransomware and the
file-content entropy are unique traits to detect crypto-ransomware. A
deep-learning (DL) classifier can detect ransomware with a high accuracy and a
low false positive rate. We conduct an adversarial research against the models
generated. We use simulated ransomware programs to launch a gray-box analysis
to probe the weakness of ML classifiers and to improve model robustness. In
addition to accuracy and resiliency, trustworthiness is the other key criteria
for a quality detector. Making sure that the correct information was used for
inference is important for a security application. The Integrated Gradient
method was used to explain the deep learning model and also to reveal why false
negatives evade the detection. The approaches to build and to evaluate a
real-world detector were demonstrated and discussed.


http://arxiv.org/abs/2003.06468
GeoDA: a geometric framework for black-box adversarial attacks.
Ali Rahmati; Seyed-Mohsen Moosavi-Dezfooli; Pascal Frossard; Huaiyu Dai
  Adversarial examples are known as carefully perturbed images fooling image
classifiers. We propose a geometric framework to generate adversarial examples
in one of the most challenging black-box settings where the adversary can only
generate a small number of queries, each of them returning the top-$1$ label of
the classifier. Our framework is based on the observation that the decision
boundary of deep networks usually has a small mean curvature in the vicinity of
data samples. We propose an effective iterative algorithm to generate
query-efficient black-box perturbations with small $\ell_p$ norms for $p \ge
1$, which is confirmed via experimental evaluations on state-of-the-art natural
image classifiers. Moreover, for $p=2$, we theoretically show that our
algorithm actually converges to the minimal $\ell_2$-perturbation when the
curvature of the decision boundary is bounded. We also obtain the optimal
distribution of the queries over the iterations of the algorithm. Finally,
experimental results confirm that our principled black-box attack algorithm
performs better than state-of-the-art algorithms as it generates smaller
perturbations with a reduced number of queries.


http://arxiv.org/abs/2003.06121
When are Non-Parametric Methods Robust?
Robi Bhattacharjee; Kamalika Chaudhuri
  A growing body of research has shown that many classifiers are susceptible to
{\em{adversarial examples}} -- small strategic modifications to test inputs
that lead to misclassification. In this work, we study general non-parametric
methods, with a view towards understanding when they are robust to these
modifications. We establish general conditions under which non-parametric
methods are r-consistent -- in the sense that they converge to optimally robust
and accurate classifiers in the large sample limit.
  Concretely, our results show that when data is well-separated, nearest
neighbors and kernel classifiers are r-consistent, while histograms are not.
For general data distributions, we prove that preprocessing by Adversarial
Pruning (Yang et. al., 2019) -- that makes data well-separated -- followed by
nearest neighbors or kernel classifiers also leads to r-consistency.


http://arxiv.org/abs/2003.05822
Topological Effects on Attacks Against Vertex Classification.
Benjamin A. Miller; Mustafa Çamurcu; Alexander J. Gomez; Kevin Chan; Tina Eliassi-Rad
  Vertex classification is vulnerable to perturbations of both graph topology
and vertex attributes, as shown in recent research. As in other machine
learning domains, concerns about robustness to adversarial manipulation can
prevent potential users from adopting proposed methods when the consequence of
action is very high. This paper considers two topological characteristics of
graphs and explores the way these features affect the amount the adversary must
perturb the graph in order to be successful. We show that, if certain vertices
are included in the training set, it is possible to substantially an
adversary's required perturbation budget. On four citation datasets, we
demonstrate that if the training set includes high degree vertices or vertices
that ensure all unlabeled nodes have neighbors in the training set, we show
that the adversary's budget often increases by a substantial factor---often a
factor of 2 or more---over random training for the Nettack poisoning attack.
Even for especially easy targets (those that are misclassified after just one
or two perturbations), the degradation of performance is much slower, assigning
much lower probabilities to the incorrect classes. In addition, we demonstrate
that this robustness either persists when recently proposed defenses are
applied, or is competitive with the resulting performance improvement for the
defender.


http://arxiv.org/abs/2003.05703
Inline Detection of DGA Domains Using Side Information.
Raaghavi Sivaguru; Jonathan Peck; Femi Olumofin; Anderson Nascimento; Cock Martine De
  Malware applications typically use a command and control (C&C) server to
manage bots to perform malicious activities. Domain Generation Algorithms
(DGAs) are popular methods for generating pseudo-random domain names that can
be used to establish a communication between an infected bot and the C&C
server. In recent years, machine learning based systems have been widely used
to detect DGAs. There are several well known state-of-the-art classifiers in
the literature that can detect DGA domain names in real-time applications with
high predictive performance. However, these DGA classifiers are highly
vulnerable to adversarial attacks in which adversaries purposely craft domain
names to evade DGA detection classifiers. In our work, we focus on hardening
DGA classifiers against adversarial attacks. To this end, we train and evaluate
state-of-the-art deep learning and random forest (RF) classifiers for DGA
detection using side information that is harder for adversaries to manipulate
than the domain name itself. Additionally, the side information features are
selected such that they are easily obtainable in practice to perform inline DGA
detection. The performance and robustness of these models is assessed by
exposing them to one day of real-traffic data as well as domains generated by
adversarial attack algorithms. We found that the DGA classifiers that rely on
both the domain name and side information have high performance and are more
robust against adversaries.


http://arxiv.org/abs/2003.05669
ARAE: Adversarially Robust Training of Autoencoders Improves Novelty Detection.
Mohammadreza Salehi; Atrin Arya; Barbod Pajoum; Mohammad Otoofi; Amirreza Shaeiri; Mohammad Hossein Rohban; Hamid R. Rabiee
  Autoencoders (AE) have recently been widely employed to approach the novelty
detection problem. Trained only on the normal data, the AE is expected to
reconstruct the normal data effectively while fail to regenerate the anomalous
data, which could be utilized for novelty detection. However, in this paper, it
is demonstrated that this does not always hold. AE often generalizes so
perfectly that it can also reconstruct the anomalous data well. To address this
problem, we propose a novel AE that can learn more semantically meaningful
features. Specifically, we exploit the fact that adversarial robustness
promotes learning of meaningful features. Therefore, we force the AE to learn
such features by penalizing networks with a bottleneck layer that is unstable
against adversarial perturbations. We show that despite using a much simpler
architecture in comparison to the prior methods, the proposed AE outperforms or
is competitive to state-of-the-art on three benchmark datasets.


http://arxiv.org/abs/2003.05631
ConAML: Constrained Adversarial Machine Learning for Cyber-Physical Systems.
Jiangnan Li; Yingyuan Yang; Jinyuan Stella Sun; Kevin Tomsovic; Hairong Qi
  Recent research demonstrated that the superficially well-trained machine
learning (ML) models are highly vulnerable to adversarial examples. As ML
techniques are rapidly employed in cyber-physical systems (CPSs), the security
of these applications is of concern. However, current studies on adversarial
machine learning (AML) mainly focus on computer vision and related fields. The
risks the adversarial examples can bring to the CPS applications have not been
well investigated. In particular, due to the distributed property of data
sources and the inherent physical constraints imposed by CPSs, the widely-used
threat models in previous research and the state-of-the-art AML algorithms are
no longer practical when applied to CPS applications.
  We study the vulnerabilities of ML applied in CPSs by proposing Constrained
Adversarial Machine Learning (ConAML), which generates adversarial examples
used as ML model input that meet the intrinsic constraints of the physical
systems. We first summarize the difference between AML in CPSs and AML in
existing cyber systems and propose a general threat model for ConAML. We then
design a best-effort search algorithm to iteratively generate adversarial
examples with linear physical constraints. We evaluate the vulnerabilities of
ML models used in the electric power grids and water treatment systems. The
results show that our ConAML algorithms can effectively generate adversarial
examples which significantly decrease the performance of the ML models even
under practical constraints.


http://arxiv.org/abs/2003.05549
Frequency-Tuned Universal Adversarial Attacks.
Yingpeng Deng; Lina J. Karam
  Researchers have shown that the predictions of a convolutional neural network
(CNN) for an image set can be severely distorted by one single image-agnostic
perturbation, or universal perturbation, usually with an empirically fixed
threshold in the spatial domain to restrict its perceivability. However, by
considering the human perception, we propose to adopt JND thresholds to guide
the perceivability of universal adversarial perturbations. Based on this, we
propose a frequency-tuned universal attack method to compute universal
perturbations and show that our method can realize a good balance between
perceivability and effectiveness in terms of fooling rate by adapting the
perturbations to the local frequency content. Compared with existing universal
adversarial attack techniques, our frequency-tuned attack method can achieve
cutting-edge quantitative results. We demonstrate that our approach can
significantly improve the performance of the baseline on both white-box and
black-box attacks.


http://arxiv.org/abs/2003.04820
SAD: Saliency-based Defenses Against Adversarial Examples.
Richard Tran; David Patrick; Michael Geyer; Amanda Fernandez
  With the rise in popularity of machine and deep learning models, there is an
increased focus on their vulnerability to malicious inputs. These adversarial
examples drift model predictions away from the original intent of the network
and are a growing concern in practical security. In order to combat these
attacks, neural networks can leverage traditional image processing approaches
or state-of-the-art defensive models to reduce perturbations in the data.
Defensive approaches that take a global approach to noise reduction are
effective against adversarial attacks, however their lossy approach often
distorts important data within the image. In this work, we propose a visual
saliency based approach to cleaning data affected by an adversarial attack. Our
model leverages the salient regions of an adversarial image in order to provide
a targeted countermeasure while comparatively reducing loss within the cleaned
images. We measure the accuracy of our model by evaluating the effectiveness of
state-of-the-art saliency methods prior to attack, under attack, and after
application of cleaning methods. We demonstrate the effectiveness of our
proposed approach in comparison with related defenses and against established
adversarial attack methods, across two saliency datasets. Our targeted approach
shows significant improvements in a range of standard statistical and distance
saliency metrics, in comparison with both traditional and state-of-the-art
approaches.


http://arxiv.org/abs/2003.05005
Using an ensemble color space model to tackle adversarial examples.
Shreyank N Gowda; Chun Yuan
  Minute pixel changes in an image drastically change the prediction that the
deep learning model makes. One of the most significant problems that could
arise due to this, for instance, is autonomous driving. Many methods have been
proposed to combat this with varying amounts of success. We propose a 3 step
method for defending such attacks. First, we denoise the image using
statistical methods. Second, we show that adopting multiple color spaces in the
same model can help us to fight these adversarial attacks further as each color
space detects certain features explicit to itself. Finally, the feature maps
generated are enlarged and sent back as an input to obtain even smaller
features. We show that the proposed model does not need to be trained to defend
an particular type of attack and is inherently more robust to black-box,
white-box, and grey-box adversarial attack techniques. In particular, the model
is 56.12 percent more robust than compared models in case of white box attacks
when the models are not subject to adversarial example training.


http://arxiv.org/abs/2003.04884
Cryptanalytic Extraction of Neural Network Models.
Nicholas Carlini; Matthew Jagielski; Ilya Mironov
  We argue that the machine learning problem of model extraction is actually a
cryptanalytic problem in disguise, and should be studied as such. Given oracle
access to a neural network, we introduce a differential attack that can
efficiently steal the parameters of the remote model up to floating point
precision. Our attack relies on the fact that ReLU neural networks are
piecewise linear functions, and thus queries at the critical points reveal
information about the model parameters.
  We evaluate our attack on multiple neural network models and extract models
that are 2^20 times more precise and require 100x fewer queries than prior
work. For example, we extract a 100,000 parameter neural network trained on the
MNIST digit recognition task with 2^21.5 queries in under an hour, such that
the extracted model agrees with the oracle on all inputs up to a worst-case
error of 2^-25, or a model with 4,000 parameters in 2^18.5 queries with
worst-case error of 2^-40.4. Code is available at
https://github.com/google-research/cryptanalytic-model-extraction.


http://arxiv.org/abs/2003.05730
A Survey of Adversarial Learning on Graphs.
Liang Chen; Jintang Li; Jiaying Peng; Tao Xie; Zengxu Cao; Kun Xu; Xiangnan He; Zibin Zheng
  Deep learning models on graphs have achieved remarkable performance in
various graph analysis tasks, e.g., node classification, link prediction and
graph clustering. However, they expose uncertainty and unreliability against
the well-designed inputs, i.e., adversarial examples. Accordingly, a line of
studies have emerged for both attack and defense addressed in different graph
analysis tasks, leading to the arms race in graph adversarial learning.
  Despite the booming works, there still lacks a unified problem definition and
a comprehensive review. To bridge this gap, we investigate and summarize the
existing works on graph adversarial learning tasks systemically. Specifically,
we survey and unify the existing works w.r.t. attack and defense in graph
analysis tasks, and give appropriate definitions and taxonomies at the same
time. Besides, we emphasize the importance of related evaluation metrics,
investigate and summarize them comprehensively. Hopefully, our works can
provide a comprehensive overview and offer insights for the relevant
researchers. More details of our works are available at
https://github.com/gitgiter/Graph-Adversarial-Learning.


http://arxiv.org/abs/2003.04475
Domain Adaptation with Conditional Distribution Matching and Generalized Label Shift.
Remi Tachet des Combes; Han Zhao; Yu-Xiang Wang; Geoff Gordon
  Adversarial learning has demonstrated good performance in the unsupervised
domain adaptation setting, by learning domain-invariant representations that
perform well on the source domain. However, recent work has underlined
limitations of existing methods in the presence of mismatched label
distributions between the source and target domains. In this paper, we extend a
recent upper-bound on the performance of adversarial domain adaptation to
multi-class classification and more general discriminators. We then propose
generalized label shift (GLS) as a way to improve robustness against mismatched
label distributions. GLS states that, conditioned on the label, there exists a
representation of the input that is invariant between the source and target
domains. Under GLS, we provide theoretical guarantees on the transfer
performance of any classifier. We also devise necessary and sufficient
conditions for GLS to hold. The conditions are based on the estimation of the
relative class weights between domains and on an appropriate reweighting of
samples. Guided by our theoretical insights, we modify three widely used
algorithms, JAN, DANN and CDAN and evaluate their performance on standard
domain adaptation tasks where our method outperforms the base versions. We also
demonstrate significant gains on artificially created tasks with large
divergences between their source and target label distributions.


http://arxiv.org/abs/2003.04247
Towards Probabilistic Verification of Machine Unlearning.
David Marco Sommer; Liwei Song; Sameer Wagh; Prateek Mittal
  Right to be forgotten, also known as the right to erasure, is the right of
individuals to have their data erased from an entity storing it. The General
Data Protection Regulation in the European Union legally solidified the status
of this long held notion. As a consequence, there is a growing need for the
development of mechanisms whereby users can verify if service providers comply
with their deletion requests.
  In this work, we take the first step in proposing a formal framework to study
the design of such verification mechanisms for data deletion requests -- also
known as machine unlearning -- in the context of systems that provide machine
learning as a service. We propose a backdoor-based verification mechanism and
demonstrate its effectiveness in certifying data deletion with high confidence
using the above framework. Our mechanism makes a novel use of backdoor attacks
in ML as a basis for quantitatively inferring machine unlearning. In our
mechanism, each user poisons part of its training data by injecting a
user-specific backdoor trigger associated with a user-specific target label.
The prediction of target labels on test samples with the backdoor trigger is
then used as an indication of the user's data being used to train the ML model.
  We formalize the verification process as a hypothesis testing problem, and
provide theoretical guarantees on the statistical power of the hypothesis test.
We experimentally demonstrate that our approach has minimal effect on the
machine learning service but provides high confidence verification of
unlearning. We show that with a $30\%$ poison ratio and merely $20$ test
queries, our verification mechanism has both false positive and false negative
ratios below $10^{-5}$. Furthermore, we also show the effectiveness of our
approach by testing it against an adaptive adversary that uses a
state-of-the-art backdoor defense method.


http://arxiv.org/abs/2003.04286
Manifold Regularization for Locally Stable Deep Neural Networks.
Charles Jin; Martin Rinard
  We apply concepts from manifold regularization to develop new regularization
techniques for training locally stable deep neural networks. Our regularizers
are based on a sparsification of the graph Laplacian which holds with high
probability when the data is sparse in high dimensions, as is common in deep
learning. Empirically, our networks exhibit stability in a diverse set of
perturbation models, including $\ell_2$, $\ell_\infty$, and Wasserstein-based
perturbations; in particular, we achieve 40% adversarial accuracy on CIFAR-10
against an adaptive PGD attack using $\ell_\infty$ perturbations of size
$\epsilon = 8/255$, and state-of-the-art verified accuracy of 21% in the same
perturbation model. Furthermore, our techniques are efficient, incurring
overhead on par with two additional parallel forward passes through the
network.


http://arxiv.org/abs/2003.10388
Generating Natural Language Adversarial Examples on a Large Scale with Generative Models.
Yankun Ren; Jianbin Lin; Siliang Tang; Jun Zhou; Shuang Yang; Yuan Qi; Xiang Ren
  Today text classification models have been widely used. However, these
classifiers are found to be easily fooled by adversarial examples. Fortunately,
standard attacking methods generate adversarial texts in a pair-wise way, that
is, an adversarial text can only be created from a real-world text by replacing
a few words. In many applications, these texts are limited in numbers,
therefore their corresponding adversarial examples are often not diverse enough
and sometimes hard to read, thus can be easily detected by humans and cannot
create chaos at a large scale. In this paper, we propose an end to end solution
to efficiently generate adversarial texts from scratch using generative models,
which are not restricted to perturbing the given texts. We call it unrestricted
adversarial text generation. Specifically, we train a conditional variational
autoencoder (VAE) with an additional adversarial loss to guide the generation
of adversarial examples. Moreover, to improve the validity of adversarial
texts, we utilize discrimators and the training framework of generative
adversarial networks (GANs) to make adversarial texts consistent with real
data. Experimental results on sentiment analysis demonstrate the scalability
and efficiency of our method. It can attack text classification models with a
higher success rate than existing methods, and provide acceptable quality for
humans in the meantime.


http://arxiv.org/abs/2003.04173
Gradient-based adversarial attacks on categorical sequence models via traversing an embedded world.
Ivan Fursov; Alexey Zaytsev; Nikita Kluchnikov; Andrey Kravchenko; Evgeny Burnaev
  Deep learning models suffer from a phenomenon called adversarial attacks: we
can apply minor changes to the model input to fool a classifier for a
particular example. The literature mostly considers adversarial attacks on
models with images and other structured inputs. However, the adversarial
attacks for categorical sequences can also be harmful. Successful attacks for
inputs in the form of categorical sequences should address the following
challenges: (1) non-differentiability of the target function, (2) constraints
on transformations of initial sequences, and (3) diversity of possible
problems. We handle these challenges using two black-box adversarial attacks.
The first approach adopts a Monte-Carlo method and allows usage in any
scenario, the second approach uses a continuous relaxation of models and target
metrics, and thus allows usage of state-of-the-art methods for adversarial
attacks with little additional effort. Results for money transactions, medical
fraud, and NLP datasets suggest that proposed methods generate reasonable
adversarial sequences that are close to original ones but fool machine learning
models.


http://arxiv.org/abs/2003.04735
Security of Distributed Machine Learning: A Game-Theoretic Approach to Design Secure DSVM.
Rui Zhang; Quanyan Zhu
  Distributed machine learning algorithms play a significant role in processing
massive data sets over large networks. However, the increasing reliance on
machine learning on information and communication technologies (ICTs) makes it
inherently vulnerable to cyber threats. This work aims to develop secure
distributed algorithms to protect the learning from data poisoning and network
attacks. We establish a game-theoretic framework to capture the conflicting
goals of a learner who uses distributed support vector machines (SVMs) and an
attacker who is capable of modifying training data and labels. We develop a
fully distributed and iterative algorithm to capture real-time reactions of the
learner at each node to adversarial behaviors. The numerical results show that
distributed SVM is prone to fail in different types of attacks, and their
impact has a strong dependence on the network structure and attack
capabilities.


http://arxiv.org/abs/2003.03879
An Empirical Evaluation on Robustness and Uncertainty of Regularization Methods.
Sanghyuk Chun; Seong Joon Oh; Sangdoo Yun; Dongyoon Han; Junsuk Choe; Youngjoon Yoo
  Despite apparent human-level performances of deep neural networks (DNN), they
behave fundamentally differently from humans. They easily change predictions
when small corruptions such as blur and noise are applied on the input (lack of
robustness), and they often produce confident predictions on
out-of-distribution samples (improper uncertainty measure). While a number of
researches have aimed to address those issues, proposed solutions are typically
expensive and complicated (e.g. Bayesian inference and adversarial training).
Meanwhile, many simple and cheap regularization methods have been developed to
enhance the generalization of classifiers. Such regularization methods have
largely been overlooked as baselines for addressing the robustness and
uncertainty issues, as they are not specifically designed for that. In this
paper, we provide extensive empirical evaluations on the robustness and
uncertainty estimates of image classifiers (CIFAR-100 and ImageNet) trained
with state-of-the-art regularization methods. Furthermore, experimental results
show that certain regularization methods can serve as strong baseline methods
for robustness and uncertainty estimation of DNNs.


http://arxiv.org/abs/2003.03722
On the Robustness of Cooperative Multi-Agent Reinforcement Learning.
Jieyu Lin; Kristina Dzeparoska; Sai Qian Zhang; Alberto Leon-Garcia; Nicolas Papernot
  In cooperative multi-agent reinforcement learning (c-MARL), agents learn to
cooperatively take actions as a team to maximize a total team reward. We
analyze the robustness of c-MARL to adversaries capable of attacking one of the
agents on a team. Through the ability to manipulate this agent's observations,
the adversary seeks to decrease the total team reward.
  Attacking c-MARL is challenging for three reasons: first, it is difficult to
estimate team rewards or how they are impacted by an agent mispredicting;
second, models are non-differentiable; and third, the feature space is
low-dimensional. Thus, we introduce a novel attack. The attacker first trains a
policy network with reinforcement learning to find a wrong action it should
encourage the victim agent to take. Then, the adversary uses targeted
adversarial examples to force the victim to take this action.
  Our results on the StartCraft II multi-agent benchmark demonstrate that
c-MARL teams are highly vulnerable to perturbations applied to one of their
agent's observations. By attacking a single agent, our attack method has highly
negative impact on the overall team reward, reducing it from 20 to 9.4. This
results in the team's winning rate to go down from 98.9% to 0%.


http://arxiv.org/abs/2003.03778
Adversarial Attacks on Probabilistic Autoregressive Forecasting Models.
Raphaël Dang-Nhu; Gagandeep Singh; Pavol Bielik; Martin Vechev
  We develop an effective generation of adversarial attacks on neural models
that output a sequence of probability distributions rather than a sequence of
single values. This setting includes the recently proposed deep probabilistic
autoregressive forecasting models that estimate the probability distribution of
a time series given its past and achieve state-of-the-art results in a diverse
set of application domains. The key technical challenge we address is
effectively differentiating through the Monte-Carlo estimation of statistics of
the joint distribution of the output sequence. Additionally, we extend prior
work on probabilistic forecasting to the Bayesian setting which allows
conditioning on future observations, instead of only on past observations. We
demonstrate that our approach can successfully generate attacks with small
input perturbations in two challenging tasks where robust decision making is
crucial: stock market trading and prediction of electricity consumption.


http://arxiv.org/abs/2003.08757
Adversarial Camouflage: Hiding Physical-World Attacks with Natural Styles.
Ranjie Duan; Xingjun Ma; Yisen Wang; James Bailey; A. K. Qin; Yun Yang
  Deep neural networks (DNNs) are known to be vulnerable to adversarial
examples. Existing works have mostly focused on either digital adversarial
examples created via small and imperceptible perturbations, or physical-world
adversarial examples created with large and less realistic distortions that are
easily identified by human observers. In this paper, we propose a novel
approach, called Adversarial Camouflage (\emph{AdvCam}), to craft and
camouflage physical-world adversarial examples into natural styles that appear
legitimate to human observers. Specifically, \emph{AdvCam} transfers large
adversarial perturbations into customized styles, which are then "hidden"
on-target object or off-target background. Experimental evaluation shows that,
in both digital and physical-world scenarios, adversarial examples crafted by
\emph{AdvCam} are well camouflaged and highly stealthy, while remaining
effective in fooling state-of-the-art DNN image classifiers. Hence,
\emph{AdvCam} is a flexible approach that can help craft stealthy attacks to
evaluate the robustness of DNNs. \emph{AdvCam} can also be used to protect
private information from being detected by deep learning systems.


http://arxiv.org/abs/2003.03824
No Surprises: Training Robust Lung Nodule Detection for Low-Dose CT Scans by Augmenting with Adversarial Attacks.
Siqi Liu; Arnaud Arindra Adiyoso Setio; Florin C. Ghesu; Eli Gibson; Sasa Grbic; Bogdan Georgescu; Dorin Comaniciu
  Detecting malignant pulmonary nodules at an early stage can allow medical
interventions which may increase the survival rate of lung cancer patients.
Using computer vision techniques to detect nodules can improve the sensitivity
and the speed of interpreting chest CT for lung cancer screening. Many studies
have used CNNs to detect nodule candidates. Though such approaches have been
shown to outperform the conventional image processing based methods regarding
the detection accuracy, CNNs are also known to be limited to generalize on
under-represented samples in the training set and prone to imperceptible noise
perturbations. Such limitations can not be easily addressed by scaling up the
dataset or the models. In this work, we propose to add adversarial synthetic
nodules and adversarial attack samples to the training data to improve the
generalization and the robustness of the lung nodule detection systems. To
generate hard examples of nodules from a differentiable nodule synthesizer, we
use projected gradient descent (PGD) to search the latent code within a bounded
neighbourhood that would generate nodules to decrease the detector response. To
make the network more robust to unanticipated noise perturbations, we use PGD
to search for noise patterns that can trigger the network to give
over-confident mistakes. By evaluating on two different benchmark datasets
containing consensus annotations from three radiologists, we show that the
proposed techniques can improve the detection performance on real CT data. To
understand the limitations of both the conventional networks and the proposed
augmented networks, we also perform stress-tests on the false positive
reduction networks by feeding different types of artificially produced patches.
We show that the augmented networks are more robust to both under-represented
nodules as well as resistant to noise perturbations.


http://arxiv.org/abs/2003.03675
Dynamic Backdoor Attacks Against Machine Learning Models.
Ahmed Salem; Rui Wen; Michael Backes; Shiqing Ma; Yang Zhang
  Machine learning (ML) has made tremendous progress during the past decade and
is being adopted in various critical real-world applications. However, recent
research has shown that ML models are vulnerable to multiple security and
privacy attacks. In particular, backdoor attacks against ML models have
recently raised a lot of awareness. A successful backdoor attack can cause
severe consequences, such as allowing an adversary to bypass critical
authentication systems.
  Current backdooring techniques rely on adding static triggers (with fixed
patterns and locations) on ML model inputs which are prone to detection by the
current backdoor detection mechanisms. In this paper, we propose the first
class of dynamic backdooring techniques against deep neural networks (DNN),
namely Random Backdoor, Backdoor Generating Network (BaN), and conditional
Backdoor Generating Network (c-BaN). Triggers generated by our techniques can
have random patterns and locations, which reduce the efficacy of the current
backdoor detection mechanisms. In particular, BaN and c-BaN based on a novel
generative network are the first two schemes that algorithmically generate
triggers. Moreover, c-BaN is the first conditional backdooring technique that
given a target label, it can generate a target-specific trigger. Both BaN and
c-BaN are essentially a general framework which renders the adversary the
flexibility for further customizing backdoor attacks.
  We extensively evaluate our techniques on three benchmark datasets: MNIST,
CelebA, and CIFAR-10. Our techniques achieve almost perfect attack performance
on backdoored data with a negligible utility loss. We further show that our
techniques can bypass current state-of-the-art defense mechanisms against
backdoor attacks, including ABS, Februus, MNTD, Neural Cleanse, and STRIP.


http://arxiv.org/abs/2003.03546
Adversarial Machine Learning: Bayesian Perspectives. (26%)
David Rios Insua; Roi Naveiro; Victor Gallego; Jason Poulos
  Adversarial Machine Learning (AML) is emerging as a major field aimed at
protecting machine learning (ML) systems against security threats: in certain
scenarios there may be adversaries that actively manipulate input data to fool
learning systems. This creates a new class of security vulnerabilities that ML
systems may face, and a new desirable property called adversarial robustness
essential to trust operations based on ML outputs. Most work in AML is built
upon a game-theoretic modelling of the conflict between a learning system and
an adversary, ready to manipulate input data. This assumes that each agent
knows their opponent's interests and uncertainty judgments, facilitating
inferences based on Nash equilibria. However, such common knowledge assumption
is not realistic in the security scenarios typical of AML. After reviewing such
game-theoretic approaches, we discuss the benefits that Bayesian perspectives
provide when defending ML-based systems. We demonstrate how the Bayesian
approach allows us to explicitly model our uncertainty about the opponent's
beliefs and interests, relaxing unrealistic assumptions, and providing more
robust inferences. We illustrate this approach in supervised learning settings,
and identify relevant future research problems.


http://arxiv.org/abs/2003.03065
Defense against adversarial attacks on spoofing countermeasures of ASV.
Haibin Wu; Songxiang Liu; Helen Meng; Hung-yi Lee
  Various forefront countermeasure methods for automatic speaker verification
(ASV) with considerable performance in anti-spoofing are proposed in the
ASVspoof 2019 challenge. However, previous work has shown that countermeasure
models are vulnerable to adversarial examples indistinguishable from natural
data. A good countermeasure model should not only be robust against spoofing
audio, including synthetic, converted, and replayed audios; but counteract
deliberately generated examples by malicious adversaries. In this work, we
introduce a passive defense method, spatial smoothing, and a proactive defense
method, adversarial training, to mitigate the vulnerability of ASV spoofing
countermeasure models against adversarial examples. This paper is among the
first to use defense methods to improve the robustness of ASV spoofing
countermeasure models under adversarial attacks. The experimental results show
that these two defense methods positively help spoofing countermeasure models
counter adversarial examples.


http://arxiv.org/abs/2003.03143
Triple Memory Networks: a Brain-Inspired Method for Continual Learning.
Liyuan Wang; Bo Lei; Qian Li; Hang Su; Jun Zhu; Yi Zhong
  Continual acquisition of novel experience without interfering previously
learned knowledge, i.e. continual learning, is critical for artificial neural
networks, but limited by catastrophic forgetting. A neural network adjusts its
parameters when learning a new task, but then fails to conduct the old tasks
well. By contrast, the brain has a powerful ability to continually learn new
experience without catastrophic interference. The underlying neural mechanisms
possibly attribute to the interplay of hippocampus-dependent memory system and
neocortex-dependent memory system, mediated by prefrontal cortex. Specifically,
the two memory systems develop specialized mechanisms to consolidate
information as more specific forms and more generalized forms, respectively,
and complement the two forms of information in the interplay. Inspired by such
brain strategy, we propose a novel approach named triple memory networks (TMNs)
for continual learning. TMNs model the interplay of hippocampus, prefrontal
cortex and sensory cortex (a neocortex region) as a triple-network architecture
of generative adversarial networks (GAN). The input information is encoded as
specific representation of the data distributions in a generator, or
generalized knowledge of solving tasks in a discriminator and a classifier,
with implementing appropriate brain-inspired algorithms to alleviate
catastrophic forgetting in each module. Particularly, the generator replays
generated data of the learned tasks to the discriminator and the classifier,
both of which are implemented with a weight consolidation regularizer to
complement the lost information in generation process. TMNs achieve new
state-of-the-art performance on a variety of class-incremental learning
benchmarks on MNIST, SVHN, CIFAR-10 and ImageNet-50, comparing with strong
baseline methods.


http://arxiv.org/abs/2003.03100
MAB-Malware: A Reinforcement Learning Framework for Attacking Static Malware Classifiers.
Wei Song; Xuezixiang Li; Sadia Afroz; Deepali Garg; Dmitry Kuznetsov; Heng Yin
  Modern commercial antivirus systems increasingly rely on machine learning to
keep up with the rampant inflation of new malware. However, it is well-known
that machine learning models are vulnerable to adversarial examples (AEs).
Previous works have shown that ML malware classifiers are fragile to the
white-box adversarial attacks. However, ML models used in commercial antivirus
products are usually not available to attackers and only return hard
classification labels. Therefore, it is more practical to evaluate the
robustness of ML models and real-world AVs in a pure black-box manner. We
propose a black-box Reinforcement Learning (RL) based framework to generate AEs
for PE malware classifiers and AV engines. It regards the adversarial attack
problem as a multi-armed bandit problem, which finds an optimal balance between
exploiting the successful patterns and exploring more varieties. Compared to
other frameworks, our improvements lie in three points. 1) Limiting the
exploration space by modeling the generation process as a stateless process to
avoid combination explosions. 2) Due to the critical role of payload in AE
generation, we design to reuse the successful payload in modeling. 3)
Minimizing the changes on AE samples to correctly assign the rewards in RL
learning. It also helps identify the root cause of evasions. As a result, our
framework has much higher black-box evasion rates than other off-the-shelf
frameworks. Results show it has over 74\%--97\% evasion rate for two
state-of-the-art ML detectors and over 32\%--48\% evasion rate for commercial
AVs in a pure black-box setting. We also demonstrate that the transferability
of adversarial attacks among ML-based classifiers is higher than the attack
transferability between purely ML-based and commercial AVs.


http://arxiv.org/abs/2003.05733
Towards Practical Lottery Ticket Hypothesis for Adversarial Training.
Bai Li; Shiqi Wang; Yunhan Jia; Yantao Lu; Zhenyu Zhong; Lawrence Carin; Suman Jana
  Recent research has proposed the lottery ticket hypothesis, suggesting that
for a deep neural network, there exist trainable sub-networks performing
equally or better than the original model with commensurate training steps.
While this discovery is insightful, finding proper sub-networks requires
iterative training and pruning. The high cost incurred limits the applications
of the lottery ticket hypothesis. We show there exists a subset of the
aforementioned sub-networks that converge significantly faster during the
training process and thus can mitigate the cost issue. We conduct extensive
experiments to show such sub-networks consistently exist across various model
structures for a restrictive setting of hyperparameters ($e.g.$, carefully
selected learning rate, pruning ratio, and model capacity). As a practical
application of our findings, we demonstrate that such sub-networks can help in
cutting down the total time of adversarial training, a standard approach to
improve robustness, by up to 49\% on CIFAR-10 to achieve the state-of-the-art
robustness.


http://arxiv.org/abs/2003.03021
Exploiting Verified Neural Networks via Floating Point Numerical Error.
Kai Jia; Martin Rinard
  We show how to construct adversarial examples for neural networks with
exactly verified robustness against $\ell_{\infty}$-bounded input perturbations
by exploiting floating point error. We argue that any exact verification of
real-valued neural networks must accurately model the implementation details of
any floating point arithmetic used during inference or verification.


http://arxiv.org/abs/2003.02732
Detection and Recovery of Adversarial Attacks with Injected Attractors.
Jiyi Zhang; Ee-Chien Chang; Hwee Kuan Lee
  Many machine learning adversarial attacks find adversarial samples of a
victim model ${\mathcal M}$ by following the gradient of some functions, either
explicitly or implicitly. To detect and recover from such attacks, we take the
proactive approach that modifies those functions with the goal of misleading
the attacks to some local minimals, or to some designated regions that can be
easily picked up by a forensic analyzer. To achieve the goal, we propose adding
a large number of artifacts, which we called $attractors$, onto the otherwise
smooth function. An attractor is a point in the input space, which has a
neighborhood of samples with gradients pointing toward it. We observe that
decoders of watermarking schemes exhibit properties of attractors, and give a
generic method that injects attractors from a watermark decoder into the victim
model ${\mathcal M}$. This principled approach allows us to leverage on known
watermarking schemes for scalability and robustness. Experimental studies show
that our method has competitive performance. For instance, for un-targeted
attacks on CIFAR-10 dataset, we can reduce the overall attack success rate of
DeepFool to 1.9%, whereas known defence LID, FS and MagNet can reduce the rate
to 90.8%, 98.5% and 78.5% respectively.


http://arxiv.org/abs/2003.02460
Adversarial Robustness Through Local Lipschitzness.
Yao-Yuan Yang; Cyrus Rashtchian; Hongyang Zhang; Ruslan Salakhutdinov; Kamalika Chaudhuri
  A standard method for improving the robustness of neural networks is
adversarial training, where the network is trained on adversarial examples that
are close to the training inputs. This produces classifiers that are robust,
but it often decreases clean accuracy. Prior work even posits that the tradeoff
between robustness and accuracy may be inevitable. We investigate this tradeoff
in more depth through the lens of local Lipschitzness. In many image datasets,
the classes are separated in the sense that images with different labels are
not extremely close in $\ell_\infty$ distance. Using this separation as a
starting point, we argue that it is possible to achieve both accuracy and
robustness by encouraging the classifier to be locally smooth around the data.
More precisely, we consider classifiers that are obtained by rounding locally
Lipschitz functions. Theoretically, we show that such classifiers exist for any
dataset such that there is a positive distance between the support of different
classes. Empirically, we compare the local Lipschitzness of classifiers trained
by several methods. Our results show that having a small Lipschitz constant
correlates with achieving high clean and robust accuracy, and therefore, the
smoothness of the classifier is an important property to consider in the
context of adversarial examples. Code available at
https://github.com/yangarbiter/robust-local-lipschitz .


http://arxiv.org/abs/2003.02484
Adversarial Vertex Mixup: Toward Better Adversarially Robust Generalization.
Saehyung Lee; Hyungyu Lee; Sungroh Yoon
  Adversarial examples cause neural networks to produce incorrect outputs with
high confidence. Although adversarial training is one of the most effective
forms of defense against adversarial examples, unfortunately, a large gap
exists between test accuracy and training accuracy in adversarial training. In
this paper, we identify Adversarial Feature Overfitting (AFO), which may cause
poor adversarially robust generalization, and we show that adversarial training
can overshoot the optimal point in terms of robust generalization, leading to
AFO in our simple Gaussian model. Considering these theoretical results, we
present soft labeling as a solution to the AFO problem. Furthermore, we propose
Adversarial Vertex mixup (AVmixup), a soft-labeled data augmentation approach
for improving adversarially robust generalization. We complement our
theoretical analysis with experiments on CIFAR10, CIFAR100, SVHN, and Tiny
ImageNet, and show that AVmixup significantly improves the robust
generalization performance and that it reduces the trade-off between standard
accuracy and adversarial robustness.


http://arxiv.org/abs/2003.02750
Search Space of Adversarial Perturbations against Image Filters.
Dang Duy Thang; Toshihiro Matsui
  The superiority of deep learning performance is threatened by safety issues
for itself. Recent findings have shown that deep learning systems are very weak
to adversarial examples, an attack form that was altered by the attacker's
intent to deceive the deep learning system. There are many proposed defensive
methods to protect deep learning systems against adversarial examples. However,
there is still a lack of principal strategies to deceive those defensive
methods. Any time a particular countermeasure is proposed, a new powerful
adversarial attack will be invented to deceive that countermeasure. In this
study, we focus on investigating the ability to create adversarial patterns in
search space against defensive methods that use image filters. Experimental
results conducted on the ImageNet dataset with image classification tasks
showed the correlation between the search space of adversarial perturbation and
filters. These findings open a new direction for building stronger offensive
methods towards deep learning systems.


http://arxiv.org/abs/2003.02301
Real-time, Universal, and Robust Adversarial Attacks Against Speaker Recognition Systems.
Yi Xie; Cong Shi; Zhuohang Li; Jian Liu; Yingying Chen; Bo Yuan
  As the popularity of voice user interface (VUI) exploded in recent years,
speaker recognition system has emerged as an important medium of identifying a
speaker in many security-required applications and services. In this paper, we
propose the first real-time, universal, and robust adversarial attack against
the state-of-the-art deep neural network (DNN) based speaker recognition
system. Through adding an audio-agnostic universal perturbation on arbitrary
enrolled speaker's voice input, the DNN-based speaker recognition system would
identify the speaker as any target (i.e., adversary-desired) speaker label. In
addition, we improve the robustness of our attack by modeling the sound
distortions caused by the physical over-the-air propagation through estimating
room impulse response (RIR). Experiment using a public dataset of $109$ English
speakers demonstrates the effectiveness and robustness of our proposed attack
with a high attack success rate of over 90%. The attack launching time also
achieves a 100X speedup over contemporary non-universal attacks.


http://arxiv.org/abs/2003.02188
Colored Noise Injection for Training Adversarially Robust Neural Networks.
Evgenii Zheltonozhskii; Chaim Baskin; Yaniv Nemcovsky; Brian Chmiel; Avi Mendelson; Alex M. Bronstein
  Even though deep learning have shown unmatched performance on various tasks,
neural networks has been shown to be vulnerable to small adversarial
perturbation of the input which lead to significant performance degradation. In
this work we extend the idea of adding independent Gaussian noise to weights
and activation during adversarial training (PNI) to injection of colored noise
for defense against common white-box and black-box attacks. We show that our
approach outperforms PNI and various previous approaches in terms of
adversarial accuracy on CIFAR-10 dataset. In addition, we provide an extensive
ablation study of the proposed method justifying the chosen configurations.


http://arxiv.org/abs/2003.01895
Double Backpropagation for Training Autoencoders against Adversarial Attack.
Chengjin Sun; Sizhe Chen; Xiaolin Huang
  Deep learning, as widely known, is vulnerable to adversarial samples. This
paper focuses on the adversarial attack on autoencoders. Safety of the
autoencoders (AEs) is important because they are widely used as a compression
scheme for data storage and transmission, however, the current autoencoders are
easily attacked, i.e., one can slightly modify an input but has totally
different codes. The vulnerability is rooted the sensitivity of the
autoencoders and to enhance the robustness, we propose to adopt double
backpropagation (DBP) to secure autoencoder such as VAE and DRAW. We restrict
the gradient from the reconstruction image to the original one so that the
autoencoder is not sensitive to trivial perturbation produced by the
adversarial attack. After smoothing the gradient by DBP, we further smooth the
label by Gaussian Mixture Model (GMM), aiming for accurate and robust
classification. We demonstrate in MNIST, CelebA, SVHN that our method leads to
a robust autoencoder resistant to attack and a robust classifier able for image
transition and immune to adversarial attack if combined with GMM.


http://arxiv.org/abs/2003.01908
Black-box Smoothing: A Provable Defense for Pretrained Classifiers.
Hadi Salman; Mingjie Sun; Greg Yang; Ashish Kapoor; J. Zico Kolter
  We present a method for provably defending any pretrained image classifier
against $\ell_p$ adversarial attacks. By prepending a custom-trained denoiser
to any off-the-shelf image classifier and using randomized smoothing, we
effectively create a new classifier that is guaranteed to be $\ell_p$-robust to
adversarial examples, without modifying the pretrained classifier. The approach
applies both to the case where we have full access to the pretrained classifier
as well as the case where we only have query access. We refer to this defense
as black-box smoothing, and we demonstrate its effectiveness through extensive
experimentation on ImageNet and CIFAR-10. Finally, we use our method to
provably defend the Azure, Google, AWS, and ClarifAI image classification APIs.
Our code replicating all the experiments in the paper can be found at
https://github.com/microsoft/blackbox-smoothing .


http://arxiv.org/abs/2003.01993
Metrics and methods for robustness evaluation of neural networks with generative models.
Igor Buzhinsky; Arseny Nerinovsky; Stavros Tripakis
  Recent studies have shown that modern deep neural network classifiers are
easy to fool, assuming that an adversary is able to slightly modify their
inputs. Many papers have proposed adversarial attacks, defenses and methods to
measure robustness to such adversarial perturbations. However, most commonly
considered adversarial examples are based on $\ell_p$-bounded perturbations in
the input space of the neural network, which are unlikely to arise naturally.
Recently, especially in computer vision, researchers discovered "natural" or
"semantic" perturbations, such as rotations, changes of brightness, or more
high-level changes, but these perturbations have not yet been systematically
utilized to measure the performance of classifiers. In this paper, we propose
several metrics to measure robustness of classifiers to natural adversarial
examples, and methods to evaluate them. These metrics, called latent space
performance metrics, are based on the ability of generative models to capture
probability distributions, and are defined in their latent spaces. On three
image classification case studies, we evaluate the proposed metrics for several
classifiers, including ones trained in conventional and robust ways. We find
that the latent counterparts of adversarial robustness are associated with the
accuracy of the classifier rather than its conventional adversarial robustness,
but the latter is still reflected on the properties of found latent
perturbations. In addition, our novel method of finding latent adversarial
perturbations demonstrates that these perturbations are often perceptually
small.


http://arxiv.org/abs/2003.01690
Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks.
Francesco Croce; Matthias Hein
  The field of defense strategies against adversarial attacks has significantly
grown over the last years, but progress is hampered as the evaluation of
adversarial defenses is often insufficient and thus gives a wrong impression of
robustness. Many promising defenses could be broken later on, making it
difficult to identify the state-of-the-art. Frequent pitfalls in the evaluation
are improper tuning of hyperparameters of the attacks, gradient obfuscation or
masking. In this paper we first propose two extensions of the PGD-attack
overcoming failures due to suboptimal step size and problems of the objective
function. We then combine our novel attacks with two complementary existing
ones to form a parameter-free, computationally affordable and user-independent
ensemble of attacks to test adversarial robustness. We apply our ensemble to
over 40 models from papers published at recent top machine learning and
computer vision venues. In all except one of the cases we achieve lower robust
test accuracy than reported in these papers, often by more than $10\%$,
identifying several broken defenses.


http://arxiv.org/abs/2003.01595
Analyzing Accuracy Loss in Randomized Smoothing Defenses.
Yue Gao; Harrison Rosenberg; Kassem Fawaz; Somesh Jha; Justin Hsu
  Recent advances in machine learning (ML) algorithms, especially deep neural
networks (DNNs), have demonstrated remarkable success (sometimes exceeding
human-level performance) on several tasks, including face and speech
recognition. However, ML algorithms are vulnerable to \emph{adversarial
attacks}, such test-time, training-time, and backdoor attacks. In test-time
attacks an adversary crafts adversarial examples, which are specially crafted
perturbations imperceptible to humans which, when added to an input example,
force a machine learning model to misclassify the given input example.
Adversarial examples are a concern when deploying ML algorithms in critical
contexts, such as information security and autonomous driving. Researchers have
responded with a plethora of defenses. One promising defense is
\emph{randomized smoothing} in which a classifier's prediction is smoothed by
adding random noise to the input example we wish to classify. In this paper, we
theoretically and empirically explore randomized smoothing. We investigate the
effect of randomized smoothing on the feasible hypotheses space, and show that
for some noise levels the set of hypotheses which are feasible shrinks due to
smoothing, giving one reason why the natural accuracy drops after smoothing. To
perform our analysis, we introduce a model for randomized smoothing which
abstracts away specifics, such as the exact distribution of the noise. We
complement our theoretical results with extensive experiments.


http://arxiv.org/abs/2003.01665
Discriminative Multi-level Reconstruction under Compact Latent Space for One-Class Novelty Detection.
Jaewoo Park; Yoon Gyo Jung; Andrew Beng Jin Teoh
  In one-class novelty detection, a model learns solely on the in-class data to
single out out-class instances. Autoencoder (AE) variants aim to compactly
model the in-class data to reconstruct it exclusively, thus differentiating the
in-class from out-class by the reconstruction error. However, compact modeling
in an improper way might collapse the latent representations of the in-class
data and thus their reconstruction, which would lead to performance
deterioration. Moreover, to properly measure the reconstruction error of
high-dimensional data, a metric is required that captures high-level semantics
of the data. To this end, we propose Discriminative Compact AE (DCAE) that
learns both compact and collapse-free latent representations of the in-class
data, thereby reconstructing them both finely and exclusively. In DCAE, (a) we
force a compact latent space to bijectively represent the in-class data by
reconstructing them through internal discriminative layers of generative
adversarial nets. (b) Based on the deep encoder's vulnerability to open set
risk, out-class instances are encoded into the same compact latent space and
reconstructed poorly without sacrificing the quality of in-class data
reconstruction. (c) In inference, the reconstruction error is measured by a
novel metric that computes the dissimilarity between a query and its
reconstruction based on the class semantics captured by the internal
discriminator. Extensive experiments on public image datasets validate the
effectiveness of our proposed model on both novelty and adversarial example
detection, delivering state-of-the-art performance.


http://arxiv.org/abs/2003.01782
Security of Deep Learning based Lane Keeping System under Physical-World Adversarial Attack.
Takami Sato; Junjie Shen; Ningfei Wang; Yunhan Jack Jia; Xue Lin; Qi Alfred Chen
  Lane-Keeping Assistance System (LKAS) is convenient and widely available
today, but also extremely security and safety critical. In this work, we design
and implement the first systematic approach to attack real-world DNN-based
LKASes. We identify dirty road patches as a novel and domain-specific threat
model for practicality and stealthiness. We formulate the attack as an
optimization problem, and address the challenge from the inter-dependencies
among attacks on consecutive camera frames. We evaluate our approach on a
state-of-the-art LKAS and our preliminary results show that our attack can
successfully cause it to drive off lane boundaries within as short as 1.3
seconds.


http://arxiv.org/abs/2003.01872
Type I Attack for Generative Models.
Chengjin Sun; Sizhe Chen; Jia Cai; Xiaolin Huang
  Generative models are popular tools with a wide range of applications.
Nevertheless, it is as vulnerable to adversarial samples as classifiers. The
existing attack methods mainly focus on generating adversarial examples by
adding imperceptible perturbations to input, which leads to wrong result.
However, we focus on another aspect of attack, i.e., cheating models by
significant changes. The former induces Type II error and the latter causes
Type I error. In this paper, we propose Type I attack to generative models such
as VAE and GAN. One example given in VAE is that we can change an original
image significantly to a meaningless one but their reconstruction results are
similar. To implement the Type I attack, we destroy the original one by
increasing the distance in input space while keeping the output similar because
different inputs may correspond to similar features for the property of deep
neural network. Experimental results show that our attack method is effective
to generate Type I adversarial examples for generative models on large-scale
image datasets.


http://arxiv.org/abs/2003.01295
Data-Free Adversarial Perturbations for Practical Black-Box Attack.
ZhaoXin Huan; Yulong Wang; Xiaolu Zhang; Lin Shang; Chilin Fu; Jun Zhou
  Neural networks are vulnerable to adversarial examples, which are malicious
inputs crafted to fool pre-trained models. Adversarial examples often exhibit
black-box attacking transferability, which allows that adversarial examples
crafted for one model can fool another model. However, existing black-box
attack methods require samples from the training data distribution to improve
the transferability of adversarial examples across different models. Because of
the data dependence, the fooling ability of adversarial perturbations is only
applicable when training data are accessible. In this paper, we present a
data-free method for crafting adversarial perturbations that can fool a target
model without any knowledge about the training data distribution. In the
practical setting of a black-box attack scenario where attackers do not have
access to target models and training data, our method achieves high fooling
rates on target models and outperforms other universal adversarial perturbation
methods. Our method empirically shows that current deep learning models are
still at risk even when the attackers do not have access to training data.


http://arxiv.org/abs/2003.01090
Learn2Perturb: an End-to-end Feature Perturbation Learning to Improve Adversarial Robustness.
Ahmadreza Jeddi; Mohammad Javad Shafiee; Michelle Karg; Christian Scharfenberger; Alexander Wong
  While deep neural networks have been achieving state-of-the-art performance
across a wide variety of applications, their vulnerability to adversarial
attacks limits their widespread deployment for safety-critical applications.
Alongside other adversarial defense approaches being investigated, there has
been a very recent interest in improving adversarial robustness in deep neural
networks through the introduction of perturbations during the training process.
However, such methods leverage fixed, pre-defined perturbations and require
significant hyper-parameter tuning that makes them very difficult to leverage
in a general fashion. In this study, we introduce Learn2Perturb, an end-to-end
feature perturbation learning approach for improving the adversarial robustness
of deep neural networks. More specifically, we introduce novel
perturbation-injection modules that are incorporated at each layer to perturb
the feature space and increase uncertainty in the network. This feature
perturbation is performed at both the training and the inference stages.
Furthermore, inspired by the Expectation-Maximization, an alternating
back-propagation training algorithm is introduced to train the network and
noise parameters consecutively. Experimental results on CIFAR-10 and CIFAR-100
datasets show that the proposed Learn2Perturb method can result in deep neural
networks which are $4-7\%$ more robust on $l_{\infty}$ FGSM and PDG adversarial
attacks and significantly outperforms the state-of-the-art against $l_2$ $C\&W$
attack and a wide range of well-known black-box attacks.


http://arxiv.org/abs/2003.01279
Disrupting Deepfakes: Adversarial Attacks Against Conditional Image Translation Networks and Facial Manipulation Systems.
Nataniel Ruiz; Sarah Adel Bargal; Stan Sclaroff
  Face modification systems using deep learning have become increasingly
powerful and accessible. Given images of a person's face, such systems can
generate new images of that same person under different expressions and poses.
Some systems can also modify targeted attributes such as hair color or age.
This type of manipulated images and video have been coined Deepfakes. In order
to prevent a malicious user from generating modified images of a person without
their consent we tackle the new problem of generating adversarial attacks
against such image translation systems, which disrupt the resulting output
image. We call this problem disrupting deepfakes. Most image translation
architectures are generative models conditioned on an attribute (e.g. put a
smile on this person's face). We are first to propose and successfully apply
(1) class transferable adversarial attacks that generalize to different
classes, which means that the attacker does not need to have knowledge about
the conditioning class, and (2) adversarial training for generative adversarial
networks (GANs) as a first step towards robust image translation networks.
Finally, in gray-box scenarios, blurring can mount a successful defense against
disruption. We present a spread-spectrum adversarial attack, which evades blur
defenses.


http://arxiv.org/abs/2003.01249
Hidden Cost of Randomized Smoothing.
Jeet Lily Mohapatra; Ching-Yun Lily Ko; Lily Tsui-Wei; Weng; Sijia Liu; Pin-Yu Chen; Luca Daniel
  The fragility of modern machine learning models has drawn a considerable
amount of attention from both academia and the public. While immense interests
were in either crafting adversarial attacks as a way to measure the robustness
of neural networks or devising worst-case analytical robustness verification
with guarantees, few methods could enjoy both scalability and robustness
guarantees at the same time. As an alternative to these attempts, randomized
smoothing adopts a different prediction rule that enables statistical
robustness arguments which easily scale to large networks. However, in this
paper, we point out the side effects of current randomized smoothing workflows.
Specifically, we articulate and prove two major points: 1) the decision
boundaries of smoothed classifiers will shrink, resulting in disparity in
class-wise accuracy; 2) applying noise augmentation in the training process
does not necessarily resolve the shrinking issue due to the inconsistent
learning objectives.


http://arxiv.org/abs/2003.01261
Adversarial Network Traffic: Towards Evaluating the Robustness of Deep Learning-Based Network Traffic Classification.
Amir Mahdi Sadeghzadeh; Saeed Shiravi; Rasool Jalili
  Network traffic classification is used in various applications such as
network traffic management, policy enforcement, and intrusion detection
systems. Although most applications encrypt their network traffic and some of
them dynamically change their port numbers, Machine Learning (ML) and
especially Deep Learning (DL)-based classifiers have shown impressive
performance in network traffic classification. In this paper, we evaluate the
robustness of DL-based network traffic classifiers against Adversarial Network
Traffic (ANT). ANT causes DL-based network traffic classifiers to predict
incorrectly using Universal Adversarial Perturbation (UAP) generating methods.
Since there is no need to buffer network traffic before sending ANT, it is
generated live. We partition the input space of the DL-based network traffic
classification into three categories: packet classification, flow content
classification, and flow time series classification. To generate ANT, we
propose three new attacks injecting UAP into network traffic. AdvPad attack
injects a UAP into the content of packets to evaluate the robustness of packet
classifiers. AdvPay attack injects a UAP into the payload of a dummy packet to
evaluate the robustness of flow content classifiers. AdvBurst attack injects a
specific number of dummy packets with crafted statistical features based on a
UAP into a selected burst of a flow to evaluate the robustness of flow time
series classifiers. The results indicate injecting a little UAP into network
traffic, highly decreases the performance of DL-based network traffic
classifiers in all categories.


http://arxiv.org/abs/2003.00653
Adversarial Attacks and Defenses on Graphs: A Review, A Tool and Empirical Studies.
Wei Jin; Yaxin Li; Han Xu; Yiqi Wang; Shuiwang Ji; Charu Aggarwal; Jiliang Tang
  Deep neural networks (DNNs) have achieved significant performance in various
tasks. However, recent studies have shown that DNNs can be easily fooled by
small perturbation on the input, called adversarial attacks. As the extensions
of DNNs to graphs, Graph Neural Networks (GNNs) have been demonstrated to
inherit this vulnerability. Adversary can mislead GNNs to give wrong
predictions by modifying the graph structure such as manipulating a few edges.
This vulnerability has arisen tremendous concerns for adapting GNNs in
safety-critical applications and has attracted increasing research attention in
recent years. Thus, it is necessary and timely to provide a comprehensive
overview of existing graph adversarial attacks and the countermeasures. In this
survey, we categorize existing attacks and defenses, and review the
corresponding state-of-the-art methods. Furthermore, we have developed a
repository with representative algorithms
(https://github.com/DSE-MSU/DeepRobust/tree/master/deeprobust/graph). The
repository enables us to conduct empirical studies to deepen our understandings
on attacks and defenses on graphs.


http://arxiv.org/abs/2003.00378
Understanding the Intrinsic Robustness of Image Distributions using Conditional Generative Models.
Xiao Zhang; Jinghui Chen; Quanquan Gu; David Evans
  Starting with Gilmer et al. (2018), several works have demonstrated the
inevitability of adversarial examples based on different assumptions about the
underlying input probability space. It remains unclear, however, whether these
results apply to natural image distributions. In this work, we assume the
underlying data distribution is captured by some conditional generative model,
and prove intrinsic robustness bounds for a general class of classifiers, which
solves an open problem in Fawzi et al. (2018). Building upon the
state-of-the-art conditional generative models, we study the intrinsic
robustness of two common image benchmarks under $\ell_2$ perturbations, and
show the existence of a large gap between the robustness limits implied by our
theory and the adversarial robustness achieved by current state-of-the-art
robust models. Code for all our experiments is available at
https://github.com/xiaozhanguva/Intrinsic-Rob.


http://arxiv.org/abs/2003.00402
Why is the Mahalanobis Distance Effective for Anomaly Detection?
Ryo Kamoi; Kei Kobayashi
  The Mahalanobis distance-based confidence score, a recently proposed anomaly
detection method for pre-trained neural classifiers, achieves state-of-the-art
performance on both out-of-distribution (OoD) and adversarial examples
detection. This work analyzes why this method exhibits such strong performance
in practical settings while imposing an implausible assumption; namely, that
class conditional distributions of pre-trained features have tied covariance.
Although the Mahalanobis distance-based method is claimed to be motivated by
classification prediction confidence, we find that its superior performance
stems from information not useful for classification. This suggests that the
reason the Mahalanobis confidence score works so well is mistaken, and makes
use of different information from ODIN, another popular OoD detection method
based on prediction confidence. This perspective motivates us to combine these
two methods, and the combined detector exhibits improved performance and
robustness. These findings provide insight into the behavior of neural
classifiers in response to anomalous inputs.


http://arxiv.org/abs/2003.00120
Improving Certified Robustness via Statistical Learning with Logical Reasoning.
Zhuolin Yang; Zhikuan Zhao; Boxin Wang; Jiawei Zhang; Linyi Li; Hengzhi Pei; Bojan Karlas; Ji Liu; Heng Guo; Ce Zhang; Bo Li
  Intensive algorithmic efforts have been made to enable the rapid improvements
of certificated robustness for complex ML models recently. However, current
robustness certification methods are only able to certify under a limited
perturbation radius. Given that existing pure data-driven statistical
approaches have reached a bottleneck, in this paper, we propose to integrate
statistical ML models with knowledge (expressed as logical rules) as a
reasoning component using Markov logic networks (MLN, so as to further improve
the overall certified robustness. This opens new research questions about
certifying the robustness of such a paradigm, especially the reasoning
component (e.g., MLN). As the first step towards understanding these questions,
we first prove that the computational complexity of certifying the robustness
of MLN is #P-hard. Guided by this hardness result, we then derive the first
certified robustness bound for MLN by carefully analyzing different model
regimes. Finally, we conduct extensive experiments on five datasets including
both high-dimensional images and natural language texts, and we show that the
certified robustness with knowledge-based logical reasoning indeed
significantly outperforms that of the state-of-the-arts.


http://arxiv.org/abs/2002.12913
Applying Tensor Decomposition to image for Robustness against Adversarial Attack.
Seungju Cho; Tae Joon Jun; Mingu Kang; Daeyoung Kim
  Nowadays the deep learning technology is growing faster and shows dramatic
performance in computer vision areas. However, it turns out a deep learning
based model is highly vulnerable to some small perturbation called an
adversarial attack. It can easily fool the deep learning model by adding small
perturbations. On the other hand, tensor decomposition method widely uses for
compressing the tensor data, including data matrix, image, etc. In this paper,
we suggest combining tensor decomposition for defending the model against
adversarial example. We verify this idea is simple and effective to resist
adversarial attack. In addition, this method rarely degrades the original
performance of clean data. We experiment on MNIST, CIFAR10 and ImageNet data
and show our method robust on state-of-the-art attack methods.


http://arxiv.org/abs/2003.04985
Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT.
Lichao Sun; Kazuma Hashimoto; Wenpeng Yin; Akari Asai; Jia Li; Philip Yu; Caiming Xiong
  There is an increasing amount of literature that claims the brittleness of
deep neural networks in dealing with adversarial examples that are created
maliciously. It is unclear, however, how the models will perform in realistic
scenarios where \textit{natural rather than malicious} adversarial instances
often exist. This work systematically explores the robustness of BERT, the
state-of-the-art Transformer-style model in NLP, in dealing with noisy data,
particularly mistakes in typing the keyboard, that occur inadvertently.
Intensive experiments on sentiment analysis and question answering benchmarks
indicate that: (i) Typos in various words of a sentence do not influence
equally. The typos in informative words make severer damages; (ii) Mistype is
the most damaging factor, compared with inserting, deleting, etc.; (iii) Humans
and machines have different focuses on recognizing adversarial attacks.


http://arxiv.org/abs/2002.12504
Detecting Patch Adversarial Attacks with Image Residuals.
Marius Arvinte; Ahmed Tewfik; Sriram Vishwanath
  We introduce an adversarial sample detection algorithm based on image
residuals, specifically designed to guard against patch-based attacks. The
image residual is obtained as the difference between an input image and a
denoised version of it, and a discriminator is trained to distinguish between
clean and adversarial samples. More precisely, we use a wavelet domain
algorithm for denoising images and demonstrate that the obtained residuals act
as a digital fingerprint for adversarial attacks. To emulate the limitations of
a physical adversary, we evaluate the performance of our approach against
localized (patch-based) adversarial attacks, including in settings where the
adversary has complete knowledge about the detection scheme. Our results show
that the proposed detection method generalizes to previously unseen, stronger
attacks and that it is able to reduce the success rate (conversely, increase
the computational effort) of an adaptive attacker.


http://arxiv.org/abs/2002.12463
Certified Defense to Image Transformations via Randomized Smoothing.
Marc Fischer; Maximilian Baader; Martin Vechev
  We extend randomized smoothing to cover parameterized transformations (e.g.,
rotations, translations) and certify robustness in the parameter space (e.g.,
rotation angle). This is particularly challenging as interpolation and rounding
effects mean that image transformations do not compose, in turn preventing
direct certification of the perturbed image (unlike certification with $\ell^p$
norms). We address this challenge by introducing three different defenses, each
with a different guarantee (heuristic, distributional and individual) stemming
from the method used to bound the interpolation error. Importantly, in the
individual case, we show how to efficiently compute the inverse of an image
transformation, enabling us to provide individual guarantees in the online
setting. We provide an implementation of all methods at
https://github.com/eth-sri/transformation-smoothing.


http://arxiv.org/abs/2002.12527
Are L2 adversarial examples intrinsically different?
Mingxuan Li; Jingyuan Wang; Yufan Wu
  Deep Neural Network (DDN) has achieved notable success in various tasks,
including many security concerning scenarios. However, a considerable amount of
work has proved its vulnerability to adversaries. We unravel the properties
that can intrinsically differentiate adversarial examples and normal inputs
through theoretical analysis. That is, adversarial examples generated by $L_2$
attacks usually have larger input sensitivity which can be used to identify
them efficiently. We also found that those generated by $L_\infty$ attacks will
be different enough in the pixel domain to be detected empirically. To verify
our analysis, we proposed a \textbf{G}uided \textbf{C}omplementary
\textbf{D}efense module (\textbf{GCD}) integrating detection and recovery
processes. When compared with adversarial detection methods, our detector
achieves a detection AUC of over 0.98 against most of the attacks. When
comparing our guided rectifier with commonly used adversarial training methods
and other rectification methods, our rectifier outperforms them by a large
margin. We achieve a recovered classification accuracy of up to 99\% on MNIST,
89\% on CIFAR-10, and 87\% on ImageNet subsets against $L_2$ attacks.
Furthermore, under the white-box setting, our holistic defensive module shows a
promising degree of robustness. Thus, we confirm that at least $L_2$
adversarial examples are intrinsically different enough from normal inputs both
theoretically and empirically. And we shed light upon designing simple yet
effective defensive methods with these properties.


http://arxiv.org/abs/2002.12398
TSS: Transformation-Specific Smoothing for Robustness Certification.
Linyi Li; Maurice Weber; Xiaojun Xu; Luka Rimanic; Bhavya Kailkhura; Tao Xie; Ce Zhang; Bo Li
  As machine learning (ML) systems become pervasive, safeguarding their
security is critical. However, recently it has been demonstrated that motivated
adversaries are able to mislead ML systems by perturbing test data using
semantic transformations. While there exists a rich body of research providing
provable robustness guarantees for ML models against $\ell_p$ norm bounded
adversarial perturbations, guarantees against semantic perturbations remain
largely underexplored. In this paper, we provide TSS -- a unified framework for
certifying ML robustness against general adversarial semantic transformations.
First, depending on the properties of each transformation, we divide common
transformations into two categories, namely resolvable (e.g., Gaussian blur)
and differentially resolvable (e.g., rotation) transformations. For the former,
we propose transformation-specific randomized smoothing strategies and obtain
strong robustness certification. The latter category covers transformations
that involve interpolation errors, and we propose a novel approach based on
stratified sampling to certify the robustness. Our framework TSS leverages
these certification strategies and combines with consistency-enhanced training
to provide rigorous certification of robustness. We conduct extensive
experiments on over ten types of challenging semantic transformations and show
that TSS significantly outperforms the state of the art. Moreover, to the best
of our knowledge, TSS is the first approach that achieves nontrivial certified
robustness on the large-scale ImageNet dataset. For instance, our framework
achieves 30.4% certified robust accuracy against rotation attack (within $\pm
30^\circ$) on ImageNet. Moreover, to consider a broader range of
transformations, we show TSS is also robust against adaptive attacks and
unforeseen image corruptions such as CIFAR-10-C and ImageNet-C.


http://arxiv.org/abs/2002.12222
On Isometry Robustness of Deep 3D Point Cloud Models under Adversarial Attacks.
Yue Zhao; Yuwei Wu; Caihua Chen; Andrew Lim
  While deep learning in 3D domain has achieved revolutionary performance in
many tasks, the robustness of these models has not been sufficiently studied or
explored. Regarding the 3D adversarial samples, most existing works focus on
manipulation of local points, which may fail to invoke the global geometry
properties, like robustness under linear projection that preserves the
Euclidean distance, i.e., isometry. In this work, we show that existing
state-of-the-art deep 3D models are extremely vulnerable to isometry
transformations. Armed with the Thompson Sampling, we develop a black-box
attack with success rate over 95\% on ModelNet40 data set. Incorporating with
the Restricted Isometry Property, we propose a novel framework of white-box
attack on top of spectral norm based perturbation. In contrast to previous
works, our adversarial samples are experimentally shown to be strongly
transferable. Evaluated on a sequence of prevailing 3D models, our white-box
attack achieves success rates from 98.88\% to 100\%. It maintains a successful
attack rate over 95\% even within an imperceptible rotation range $[\pm
2.81^{\circ}]$.


http://arxiv.org/abs/2002.12520
Utilizing Network Properties to Detect Erroneous Inputs.
Matt Gorbett; Nathaniel Blanchard
  Neural networks are vulnerable to a wide range of erroneous inputs such as
adversarial, corrupted, out-of-distribution, and misclassified examples. In
this work, we train a linear SVM classifier to detect these four types of
erroneous data using hidden and softmax feature vectors of pre-trained neural
networks. Our results indicate that these faulty data types generally exhibit
linearly separable activation properties from correct examples, giving us the
ability to reject bad inputs with no extra training or overhead. We
experimentally validate our findings across a diverse range of datasets,
domains, pre-trained models, and adversarial attacks.


http://arxiv.org/abs/2002.12047
FMix: Enhancing Mixed Sample Data Augmentation. (22%)
Ethan Harris; Antonia Marcu; Matthew Painter; Mahesan Niranjan; Adam Prügel-Bennett; Jonathon Hare
  Mixed Sample Data Augmentation (MSDA) has received increasing attention in
recent years, with many successful variants such as MixUp and CutMix. By
studying the mutual information between the function learned by a VAE on the
original data and on the augmented data we show that MixUp distorts learned
functions in a way that CutMix does not. We further demonstrate this by showing
that MixUp acts as a form of adversarial training, increasing robustness to
attacks such as Deep Fool and Uniform Noise which produce examples similar to
those generated by MixUp. We argue that this distortion prevents models from
learning about sample specific features in the data, aiding generalisation
performance. In contrast, we suggest that CutMix works more like a traditional
augmentation, improving performance by preventing memorisation without
distorting the data distribution. However, we argue that an MSDA which builds
on CutMix to include masks of arbitrary shape, rather than just square, could
further prevent memorisation whilst preserving the data distribution in the
same way. To this end, we propose FMix, an MSDA that uses random binary masks
obtained by applying a threshold to low frequency images sampled from Fourier
space. These random masks can take on a wide range of shapes and can be
generated for use with one, two, and three dimensional data. FMix improves
performance over MixUp and CutMix, without an increase in training time, for a
number of models across a range of data sets and problem settings, obtaining a
new single model state-of-the-art result on CIFAR-10 without external data.
Finally, we show that a consequence of the difference between interpolating
MSDA such as MixUp and masking MSDA such as FMix is that the two can be
combined to improve performance even further. Code for all experiments is
provided at https://github.com/ecs-vlc/FMix .


http://arxiv.org/abs/2002.11572
Revisiting Ensembles in an Adversarial Context: Improving Natural Accuracy.
Aditya Saligrama; Guillaume Leclerc
  A necessary characteristic for the deployment of deep learning models in real
world applications is resistance to small adversarial perturbations while
maintaining accuracy on non-malicious inputs. While robust training provides
models that exhibit better adversarial accuracy than standard models, there is
still a significant gap in natural accuracy between robust and non-robust
models which we aim to bridge. We consider a number of ensemble methods
designed to mitigate this performance difference. Our key insight is that model
trained to withstand small attacks, when ensembled, can often withstand
significantly larger attacks, and this concept can in turn be leveraged to
optimize natural accuracy. We consider two schemes, one that combines
predictions from several randomly initialized robust models, and the other that
fuses features from robust and standard models.


http://arxiv.org/abs/2002.11318
Invariance vs. Robustness of Neural Networks.
Sandesh Kamath; Amit Deshpande; K V Subrahmanyam
  We study the performance of neural network models on random geometric
transformations and adversarial perturbations. Invariance means that the
model's prediction remains unchanged when a geometric transformation is applied
to an input. Adversarial robustness means that the model's prediction remains
unchanged after small adversarial perturbations of an input. In this paper, we
show a quantitative trade-off between rotation invariance and robustness. We
empirically study the following two cases: (a) change in adversarial robustness
as we improve only the invariance of equivariant models via training
augmentation, (b) change in invariance as we improve only the adversarial
robustness using adversarial training. We observe that the rotation invariance
of equivariant models (StdCNNs and GCNNs) improves by training augmentation
with progressively larger random rotations but while doing so, their
adversarial robustness drops progressively, and very significantly on MNIST. We
take adversarially trained LeNet and ResNet models which have good $L_\infty$
adversarial robustness on MNIST and CIFAR-10, respectively, and observe that
adversarial training with progressively larger perturbations results in a
progressive drop in their rotation invariance profiles. Similar to the
trade-off between accuracy and robustness known in previous work, we give a
theoretical justification for the invariance vs. robustness trade-off observed
in our experiments.


http://arxiv.org/abs/2002.11569
Overfitting in adversarially robust deep learning.
Leslie Rice; Eric Wong; J. Zico Kolter
  It is common practice in deep learning to use overparameterized networks and
train for as long as possible; there are numerous studies that show, both
theoretically and empirically, that such practices surprisingly do not unduly
harm the generalization performance of the classifier. In this paper, we
empirically study this phenomenon in the setting of adversarially trained deep
networks, which are trained to minimize the loss under worst-case adversarial
perturbations. We find that overfitting to the training set does in fact harm
robust performance to a very large degree in adversarially robust training
across multiple datasets (SVHN, CIFAR-10, CIFAR-100, and ImageNet) and
perturbation models ($\ell_\infty$ and $\ell_2$). Based upon this observed
effect, we show that the performance gains of virtually all recent algorithmic
improvements upon adversarial training can be matched by simply using early
stopping. We also show that effects such as the double descent curve do still
occur in adversarially trained models, yet fail to explain the observed
overfitting. Finally, we study several classical and modern deep learning
remedies for overfitting, including regularization and data augmentation, and
find that no approach in isolation improves significantly upon the gains
achieved by early stopping. All code for reproducing the experiments as well as
pretrained model weights and training logs can be found at
https://github.com/locuslab/robust_overfitting.


http://arxiv.org/abs/2002.11320
MGA: Momentum Gradient Attack on Network.
Jinyin Chen; Yixian Chen; Haibin Zheng; Shijing Shen; Shanqing Yu; Dan Zhang; Qi Xuan
  The adversarial attack methods based on gradient information can adequately
find the perturbations, that is, the combinations of rewired links, thereby
reducing the effectiveness of the deep learning model based graph embedding
algorithms, but it is also easy to fall into a local optimum. Therefore, this
paper proposes a Momentum Gradient Attack (MGA) against the GCN model, which
can achieve more aggressive attacks with fewer rewiring links. Compared with
directly updating the original network using gradient information, integrating
the momentum term into the iterative process can stabilize the updating
direction, which makes the model jump out of poor local optimum and enhance the
method with stronger transferability. Experiments on node classification and
community detection methods based on three well-known network embedding
algorithms show that MGA has a better attack effect and transferability.


http://arxiv.org/abs/2002.11821
Improving Robustness of Deep-Learning-Based Image Reconstruction.
Ankit Raj; Yoram Bresler; Bo Li
  Deep-learning-based methods for different applications have been shown
vulnerable to adversarial examples. These examples make deployment of such
models in safety-critical tasks questionable. Use of deep neural networks as
inverse problem solvers has generated much excitement for medical imaging
including CT and MRI, but recently a similar vulnerability has also been
demonstrated for these tasks. We show that for such inverse problem solvers,
one should analyze and study the effect of adversaries in the
measurement-space, instead of the signal-space as in previous work. In this
paper, we propose to modify the training strategy of end-to-end
deep-learning-based inverse problem solvers to improve robustness. We introduce
an auxiliary network to generate adversarial examples, which is used in a
min-max formulation to build robust image reconstruction networks.
Theoretically, we show for a linear reconstruction scheme the min-max
formulation results in a singular-value(s) filter regularized solution, which
suppresses the effect of adversarial examples occurring because of
ill-conditioning in the measurement matrix. We find that a linear network using
the proposed min-max learning scheme indeed converges to the same solution. In
addition, for non-linear Compressed Sensing (CS) reconstruction using deep
networks, we show significant improvement in robustness using the proposed
approach over other methods. We complement the theory by experiments for CS on
two different datasets and evaluate the effect of increasing perturbations on
trained networks. We find the behavior for ill-conditioned and well-conditioned
measurement matrices to be qualitatively different.


http://arxiv.org/abs/2002.11881
Defense-PointNet: Protecting PointNet Against Adversarial Attacks.
Yu Zhang; Gongbo Liang; Tawfiq Salem; Nathan Jacobs
  Despite remarkable performance across a broad range of tasks, neural networks
have been shown to be vulnerable to adversarial attacks. Many works focus on
adversarial attacks and defenses on 2D images, but few focus on 3D point
clouds. In this paper, our goal is to enhance the adversarial robustness of
PointNet, which is one of the most widely used models for 3D point clouds. We
apply the fast gradient sign attack method (FGSM) on 3D point clouds and find
that FGSM can be used to generate not only adversarial images but also
adversarial point clouds. To minimize the vulnerability of PointNet to
adversarial attacks, we propose Defense-PointNet. We compare our model with two
baseline approaches and show that Defense-PointNet significantly improves the
robustness of the network against adversarial samples.


http://arxiv.org/abs/2002.11374
Adversarial Attack on Deep Product Quantization Network for Image Retrieval.
Yan Feng; Bin Chen; Tao Dai; Shutao Xia
  Deep product quantization network (DPQN) has recently received much attention
in fast image retrieval tasks due to its efficiency of encoding
high-dimensional visual features especially when dealing with large-scale
datasets. Recent studies show that deep neural networks (DNNs) are vulnerable
to input with small and maliciously designed perturbations (a.k.a., adversarial
examples). This phenomenon raises the concern of security issues for DPQN in
the testing/deploying stage as well. However, little effort has been devoted to
investigating how adversarial examples affect DPQN. To this end, we propose
product quantization adversarial generation (PQ-AG), a simple yet effective
method to generate adversarial examples for product quantization based
retrieval systems. PQ-AG aims to generate imperceptible adversarial
perturbations for query images to form adversarial queries, whose nearest
neighbors from a targeted product quantizaiton model are not semantically
related to those from the original queries. Extensive experiments show that our
PQ-AQ successfully creates adversarial examples to mislead targeted product
quantization retrieval models. Besides, we found that our PQ-AG significantly
degrades retrieval performance in both white-box and black-box settings.


http://arxiv.org/abs/2002.11565
Randomization matters. How to defend against strong adversarial attacks.
Rafael Pinot; Raphael Ettedgui; Geovani Rizk; Yann Chevaleyre; Jamal Atif
  Is there a classifier that ensures optimal robustness against all adversarial
attacks? This paper answers this question by adopting a game-theoretic point of
view. We show that adversarial attacks and defenses form an infinite zero-sum
game where classical results (e.g. Sion theorem) do not apply. We demonstrate
the non-existence of a Nash equilibrium in our game when the classifier and the
Adversary are both deterministic, hence giving a negative answer to the above
question in the deterministic regime. Nonetheless, the question remains open in
the randomized regime. We tackle this problem by showing that, undermild
conditions on the dataset distribution, any deterministic classifier can be
outperformed by a randomized one. This gives arguments for using randomization,
and leads us to a new algorithm for building randomized classifiers that are
robust to strong adversarial attacks. Empirical results validate our
theoretical analysis, and show that our defense method considerably outperforms
Adversarial Training against state-of-the-art attacks.


http://arxiv.org/abs/2002.11798
Learning Adversarially Robust Representations via Worst-Case Mutual Information Maximization.
Sicheng Zhu; Xiao Zhang; David Evans
  Training machine learning models that are robust against adversarial inputs
poses seemingly insurmountable challenges. To better understand adversarial
robustness, we consider the underlying problem of learning robust
representations. We develop a notion of representation vulnerability that
captures the maximum change of mutual information between the input and output
distributions, under the worst-case input perturbation. Then, we prove a
theorem that establishes a lower bound on the minimum adversarial risk that can
be achieved for any downstream classifier based on its representation
vulnerability. We propose an unsupervised learning method for obtaining
intrinsically robust representations by maximizing the worst-case mutual
information between the input and output distributions. Experiments on
downstream classification tasks support the robustness of the representations
found using unsupervised learning with our training principle.


http://arxiv.org/abs/2002.10716
Understanding and Mitigating the Tradeoff Between Robustness and Accuracy.
Aditi Raghunathan; Sang Michael Xie; Fanny Yang; John Duchi; Percy Liang
  Adversarial training augments the training set with perturbations to improve
the robust error (over worst-case perturbations), but it often leads to an
increase in the standard error (on unperturbed test inputs). Previous
explanations for this tradeoff rely on the assumption that no predictor in the
hypothesis class has low standard and robust error. In this work, we precisely
characterize the effect of augmentation on the standard error in linear
regression when the optimal linear predictor has zero standard and robust
error. In particular, we show that the standard error could increase even when
the augmented perturbations have noiseless observations from the optimal linear
predictor. We then prove that the recently proposed robust self-training (RST)
estimator improves robust error without sacrificing standard error for
noiseless linear regression. Empirically, for neural networks, we find that RST
with different adversarial training methods improves both standard and robust
error for random and adversarial rotations and adversarial $\ell_\infty$
perturbations in CIFAR-10.


http://arxiv.org/abs/2002.11080
The Curious Case of Adversarially Robust Models: More Data Can Help, Double Descend, or Hurt Generalization.
Yifei Min; Lin Chen; Amin Karbasi
  Despite remarkable success, deep neural networks are sensitive to
human-imperceptible small perturbations on the data and could be adversarially
misled to produce incorrect or even dangerous predictions. To circumvent these
issues, practitioners introduced adversarial training to produce adversarially
robust models whose predictions are robust to small perturbations to the data.
It is widely believed that more training data will help adversarially robust
models generalize better on the test data. In this paper, however, we challenge
this conventional belief and show that more training data could hurt the
generalization of adversarially robust models for the linear classification
problem. We identify three regimes based on the strength of the adversary. In
the weak adversary regime, more data improves the generalization of
adversarially robust models. In the medium adversary regime, with more training
data, the generalization loss exhibits a double descent curve. This implies
that in this regime, there is an intermediate stage where more training data
hurts their generalization. In the strong adversary regime, more data almost
immediately causes the generalization error to increase.


http://arxiv.org/abs/2002.10703
G\"odel's Sentence Is An Adversarial Example But Unsolvable.
Xiaodong Qi; Lansheng Han
  In recent years, different types of adversarial examples from different
fields have emerged endlessly, including purely natural ones without
perturbations. A variety of defenses are proposed and then broken quickly. Two
fundamental questions need to be asked: What's the reason for the existence of
adversarial examples and are adversarial examples unsolvable? In this paper, we
will show the reason for the existence of adversarial examples is there are
non-isomorphic natural explanations that can all explain data set.
Specifically, for two natural explanations of being true and provable,
G\"odel's sentence is an adversarial example but ineliminable. It can't be
solved by the re-accumulation of data set or the re-improvement of learning
algorithm. Finally, from the perspective of computability, we will prove the
incomputability for adversarial examples, which are unrecognizable.


http://arxiv.org/abs/2002.10947
Towards an Efficient and General Framework of Robust Training for Graph Neural Networks.
Kaidi Xu; Sijia Liu; Pin-Yu Chen; Mengshu Sun; Caiwen Ding; Bhavya Kailkhura; Xue Lin
  Graph Neural Networks (GNNs) have made significant advances on several
fundamental inference tasks. As a result, there is a surge of interest in using
these models for making potentially important decisions in high-regret
applications. However, despite GNNs' impressive performance, it has been
observed that carefully crafted perturbations on graph structures (or nodes
attributes) lead them to make wrong predictions. Presence of these adversarial
examples raises serious security concerns. Most of the existing robust GNN
design/training methods are only applicable to white-box settings where model
parameters are known and gradient based methods can be used by performing
convex relaxation of the discrete graph domain. More importantly, these methods
are not efficient and scalable which make them infeasible in time sensitive
tasks and massive graph datasets. To overcome these limitations, we propose a
general framework which leverages the greedy search algorithms and zeroth-order
methods to obtain robust GNNs in a generic and an efficient manner. On several
applications, we show that the proposed techniques are significantly less
computationally expensive and, in some cases, more robust than the
state-of-the-art methods making them suitable to large-scale problems which
were out of the reach of traditional robust training methods.


http://arxiv.org/abs/2002.10733
(De)Randomized Smoothing for Certifiable Defense against Patch Attacks.
Alexander Levine; Soheil Feizi
  Patch adversarial attacks on images, in which the attacker can distort pixels
within a region of bounded size, are an important threat model since they
provide a quantitative model for physical adversarial attacks. In this paper,
we introduce a certifiable defense against patch attacks that guarantees for a
given image and patch attack size, no patch adversarial examples exist. Our
method is related to the broad class of randomized smoothing robustness schemes
which provide high-confidence probabilistic robustness certificates. By
exploiting the fact that patch attacks are more constrained than general sparse
attacks, we derive meaningfully large robustness certificates against them.
Additionally, in contrast to smoothing-based defenses against L_p and sparse
attacks, our defense method against patch attacks is de-randomized, yielding
improved, deterministic certificates. Compared to the existing patch
certification method proposed by Chiang et al. (2020), which relies on interval
bound propagation, our method can be trained significantly faster, achieves
high clean and certified robust accuracy on CIFAR-10, and provides certificates
at ImageNet scale. For example, for a 5-by-5 patch attack on CIFAR-10, our
method achieves up to around 57.6% certified accuracy (with a classifier with
around 83.8% clean accuracy), compared to at most 30.3% certified accuracy for
the existing method (with a classifier with around 47.8% clean accuracy). Our
results effectively establish a new state-of-the-art of certifiable defense
against patch attacks on CIFAR-10 and ImageNet. Code is available at
https://github.com/alevine0/patchSmoothing.


http://arxiv.org/abs/2002.11242
Attacks Which Do Not Kill Training Make Adversarial Learning Stronger.
Jingfeng Zhang; Xilie Xu; Bo Han; Gang Niu; Lizhen Cui; Masashi Sugiyama; Mohan Kankanhalli
  Adversarial training based on the minimax formulation is necessary for
obtaining adversarial robustness of trained models. However, it is conservative
or even pessimistic so that it sometimes hurts the natural generalization. In
this paper, we raise a fundamental question---do we have to trade off natural
generalization for adversarial robustness? We argue that adversarial training
is to employ confident adversarial data for updating the current model. We
propose a novel approach of friendly adversarial training (FAT): rather than
employing most adversarial data maximizing the loss, we search for least
adversarial (i.e., friendly adversarial) data minimizing the loss, among the
adversarial data that are confidently misclassified. Our novel formulation is
easy to implement by just stopping the most adversarial data searching
algorithms such as PGD (projected gradient descent) early, which we call
early-stopped PGD. Theoretically, FAT is justified by an upper bound of the
adversarial risk. Empirically, early-stopped PGD allows us to answer the
earlier question negatively---adversarial robustness can indeed be achieved
without compromising the natural generalization.


http://arxiv.org/abs/2002.11293
Adversarial Ranking Attack and Defense.
Mo Zhou; Zhenxing Niu; Le Wang; Qilin Zhang; Gang Hua
  Deep Neural Network (DNN) classifiers are vulnerable to adversarial attack,
where an imperceptible perturbation could result in misclassification. However,
the vulnerability of DNN-based image ranking systems remains under-explored. In
this paper, we propose two attacks against deep ranking systems, i.e.,
Candidate Attack and Query Attack, that can raise or lower the rank of chosen
candidates by adversarial perturbations. Specifically, the expected ranking
order is first represented as a set of inequalities, and then a triplet-like
objective function is designed to obtain the optimal perturbation. Conversely,
a defense method is also proposed to improve the ranking system robustness,
which can mitigate all the proposed attacks simultaneously. Our adversarial
ranking attacks and defense are evaluated on datasets including MNIST,
Fashion-MNIST, and Stanford-Online-Products. Experimental results demonstrate
that a typical deep ranking system can be effectively compromised by our
attacks. Meanwhile, the system robustness can be moderately improved with our
defense. Furthermore, the transferable and universal properties of our
adversary illustrate the possibility of realistic black-box attack.


http://arxiv.org/abs/2002.10349
A Model-Based Derivative-Free Approach to Black-Box Adversarial Examples: BOBYQA.
Giuseppe Ughi; Vinayak Abrol; Jared Tanner
  We demonstrate that model-based derivative free optimisation algorithms can
generate adversarial targeted misclassification of deep networks using fewer
network queries than non-model-based methods. Specifically, we consider the
black-box setting, and show that the number of networks queries is less
impacted by making the task more challenging either through reducing the
allowed $\ell^{\infty}$ perturbation energy or training the network with
defences against adversarial misclassification. We illustrate this by
contrasting the BOBYQA algorithm with the state-of-the-art model-free
adversarial targeted misclassification approaches based on genetic,
combinatorial, and direct-search algorithms. We observe that for high
$\ell^{\infty}$ energy perturbations on networks, the aforementioned simpler
model-free methods require the fewest queries. In contrast, the proposed BOBYQA
based method achieves state-of-the-art results when the perturbation energy
decreases, or if the network is trained against adversarial perturbations.


http://arxiv.org/abs/2002.10084
Utilizing a null class to restrict decision spaces and defend against neural network adversarial attacks.
Matthew J. Roos
  Despite recent progress, deep neural networks generally continue to be
vulnerable to so-called adversarial examples--input images with small
perturbations that can result in changes in the output classifications, despite
no such change in the semantic meaning to human viewers. This is true even for
seemingly simple challenges such as the MNIST digit classification task. In
part, this suggests that these networks are not relying on the same set of
object features as humans use to make these classifications. In this paper we
examine an additional, and largely unexplored, cause behind this
phenomenon--namely, the use of the conventional training paradigm in which the
entire input space is parcellated among the training classes. Owing to this
paradigm, learned decision spaces for individual classes span excessively large
regions of the input space and include images that have no semantic similarity
to images in the training set. In this study, we train models that include a
null class. That is, models may "opt-out" of classifying an input image as one
of the digit classes. During training, null images are created through a
variety of methods, in an attempt to create tighter and more semantically
meaningful decision spaces for the digit classes. The best performing models
classify nearly all adversarial examples as nulls, rather than mistaking them
as a member of an incorrect digit class, while simultaneously maintaining high
accuracy on the unperturbed test set. The use of a null class and the training
paradigm presented herein may provide an effective defense against adversarial
attacks for some applications. Code for replicating this study will be made
available at https://github.com/mattroos/null_class_adversarial_defense .


http://arxiv.org/abs/2003.00883
Adversarial Perturbations Prevail in the Y-Channel of the YCbCr Color Space.
Camilo Pestana; Naveed Akhtar; Wei Liu; David Glance; Ajmal Mian
  Deep learning offers state of the art solutions for image recognition.
However, deep models are vulnerable to adversarial perturbations in images that
are subtle but significantly change the model's prediction. In a white-box
attack, these perturbations are generally learned for deep models that operate
on RGB images and, hence, the perturbations are equally distributed in the RGB
color space. In this paper, we show that the adversarial perturbations prevail
in the Y-channel of the YCbCr space. Our finding is motivated from the fact
that the human vision and deep models are more responsive to shape and texture
rather than color. Based on our finding, we propose a defense against
adversarial images. Our defence, coined ResUpNet, removes perturbations only
from the Y-channel by exploiting ResNet features in an upsampling framework
without the need for a bottleneck. At the final stage, the untouched
CbCr-channels are combined with the refined Y-channel to restore the clean
image. Note that ResUpNet is model agnostic as it does not modify the DNN
structure. ResUpNet is trained end-to-end in Pytorch and the results are
compared to existing defence techniques in the input transformation category.
Our results show that our approach achieves the best balance between defence
against adversarial attacks such as FGSM, PGD and DDN and maintaining the
original accuracies of VGG-16, ResNet50 and DenseNet121 on clean images. We
perform another experiment to show that learning adversarial perturbations only
for the Y-channel results in higher fooling rates for the same perturbation
magnitude.


http://arxiv.org/abs/2002.10097
Towards Rapid and Robust Adversarial Training with One-Step Attacks.
Leo Schwinn; René Raab; Björn Eskofier
  Adversarial training is the most successful empirical method for increasing
the robustness of neural networks against adversarial attacks. However, the
most effective approaches, like training with Projected Gradient Descent (PGD)
are accompanied by high computational complexity. In this paper, we present two
ideas that, in combination, enable adversarial training with the
computationally less expensive Fast Gradient Sign Method (FGSM). First, we add
uniform noise to the initial data point of the FGSM attack, which creates a
wider variety of adversaries, thus prohibiting overfitting to one particular
perturbation bound. Further, we add a learnable regularization step prior to
the neural network, which we call Pixelwise Noise Injection Layer (PNIL).
Inputs propagated trough the PNIL are resampled from a learned Gaussian
distribution. The regularization induced by the PNIL prevents the model form
learning to obfuscate its gradients, a factor that hindered prior approaches
from successfully applying one-step methods for adversarial training. We show
that noise injection in conjunction with FGSM-based adversarial training
achieves comparable results to adversarial training with PGD while being
considerably faster. Moreover, we outperform PGD-based adversarial training by
combining noise injection and PNIL.


http://arxiv.org/abs/2002.10477
Precise Tradeoffs in Adversarial Training for Linear Regression.
Adel Javanmard; Mahdi Soltanolkotabi; Hamed Hassani
  Despite breakthrough performance, modern learning models are known to be
highly vulnerable to small adversarial perturbations in their inputs. While a
wide variety of recent \emph{adversarial training} methods have been effective
at improving robustness to perturbed inputs (robust accuracy), often this
benefit is accompanied by a decrease in accuracy on benign inputs (standard
accuracy), leading to a tradeoff between often competing objectives.
Complicating matters further, recent empirical evidence suggest that a variety
of other factors (size and quality of training data, model size, etc.) affect
this tradeoff in somewhat surprising ways. In this paper we provide a precise
and comprehensive understanding of the role of adversarial training in the
context of linear regression with Gaussian features. In particular, we
characterize the fundamental tradeoff between the accuracies achievable by any
algorithm regardless of computational power or size of the training data.
Furthermore, we precisely characterize the standard/robust accuracy and the
corresponding tradeoff achieved by a contemporary mini-max adversarial training
approach in a high-dimensional regime where the number of data points and the
parameters of the model grow in proportion to each other. Our theory for
adversarial training algorithms also facilitates the rigorous study of how a
variety of factors (size and quality of training data, model
overparametrization etc.) affect the tradeoff between these two competing
accuracies.


http://arxiv.org/abs/2002.10509
HYDRA: Pruning Adversarially Robust Neural Networks.
Vikash Sehwag; Shiqi Wang; Prateek Mittal; Suman Jana
  In safety-critical but computationally resource-constrained applications,
deep learning faces two key challenges: lack of robustness against adversarial
attacks and large neural network size (often millions of parameters). While the
research community has extensively explored the use of robust training and
network pruning independently to address one of these challenges, only a few
recent works have studied them jointly. However, these works inherit a
heuristic pruning strategy that was developed for benign training, which
performs poorly when integrated with robust training techniques, including
adversarial training and verifiable robust training. To overcome this
challenge, we propose to make pruning techniques aware of the robust training
objective and let the training objective guide the search for which connections
to prune. We realize this insight by formulating the pruning objective as an
empirical risk minimization problem which is solved efficiently using SGD. We
demonstrate that our approach, titled HYDRA, achieves compressed networks with
state-of-the-art benign and robust accuracy, simultaneously. We demonstrate the
success of our approach across CIFAR-10, SVHN, and ImageNet dataset with four
robust training techniques: iterative adversarial training, randomized
smoothing, MixTrain, and CROWN-IBP. We also demonstrate the existence of highly
robust sub-networks within non-robust networks. Our code and compressed
networks are publicly available at
\url{https://github.com/inspire-group/compactness-robustness}.


http://arxiv.org/abs/2002.09896
Adversarial Attack on DL-based Massive MIMO CSI Feedback.
Qing Liu; Jiajia Guo; Chao-Kai Wen; Shi Jin
  With the increasing application of deep learning (DL) algorithms in wireless
communications, the physical layer faces new challenges caused by adversarial
attack. Such attack has significantly affected the neural network in computer
vision. We chose DL-based analog channel state information (CSI) to show the
effect of adversarial attack on DL-based communication system. We present a
practical method to craft white-box adversarial attack on DL-based CSI feedback
process. Our simulation results showed the destructive effect adversarial
attack caused on DL-based CSI feedback by analyzing the performance of
normalized mean square error. We also launched a jamming attack for comparison
and found that the jamming attack could be prevented with certain precautions.
As DL algorithm becomes the trend in developing wireless communication, this
work raises concerns regarding the security in the use of DL-based algorithms.


http://arxiv.org/abs/2002.10025
Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling Input-Adaptive Inference.
Ting-Kuei Hu; Tianlong Chen; Haotao Wang; Zhangyang Wang
  Deep networks were recently suggested to face the odds between accuracy (on
clean natural images) and robustness (on adversarially perturbed images)
(Tsipras et al., 2019). Such a dilemma is shown to be rooted in the inherently
higher sample complexity (Schmidt et al., 2018) and/or model capacity
(Nakkiran, 2019), for learning a high-accuracy and robust classifier. In view
of that, give a classification task, growing the model capacity appears to help
draw a win-win between accuracy and robustness, yet at the expense of model
size and latency, therefore posing challenges for resource-constrained
applications. Is it possible to co-design model accuracy, robustness and
efficiency to achieve their triple wins? This paper studies multi-exit networks
associated with input-adaptive efficient inference, showing their strong
promise in achieving a "sweet point" in cooptimizing model accuracy, robustness
and efficiency. Our proposed solution, dubbed Robust Dynamic Inference Networks
(RDI-Nets), allows for each input (either clean or adversarial) to adaptively
choose one of the multiple output layers (early branches or the final one) to
output its prediction. That multi-loss adaptivity adds new variations and
flexibility to adversarial attacks and defenses, on which we present a
systematical investigation. We show experimentally that by equipping existing
backbones with such robust adaptive inference, the resulting RDI-Nets can
achieve better accuracy and robustness, yet with over 30% computational
savings, compared to the defended original models.


http://arxiv.org/abs/2002.09772
Non-Intrusive Detection of Adversarial Deep Learning Attacks via Observer Networks.
Kirthi Shankar Sivamani; Rajeev Sahay; Aly El Gamal
  Recent studies have shown that deep learning models are vulnerable to
specifically crafted adversarial inputs that are quasi-imperceptible to humans.
In this letter, we propose a novel method to detect adversarial inputs, by
augmenting the main classification network with multiple binary detectors
(observer networks) which take inputs from the hidden layers of the original
network (convolutional kernel outputs) and classify the input as clean or
adversarial. During inference, the detectors are treated as a part of an
ensemble network and the input is deemed adversarial if at least half of the
detectors classify it as so. The proposed method addresses the trade-off
between accuracy of classification on clean and adversarial samples, as the
original classification network is not modified during the detection process.
The use of multiple observer networks makes attacking the detection mechanism
non-trivial even when the attacker is aware of the victim classifier. We
achieve a 99.5% detection accuracy on the MNIST dataset and 97.5% on the
CIFAR-10 dataset using the Fast Gradient Sign Attack in a semi-white box setup.
The number of false positive detections is a mere 0.12% in the worst case
scenario.


http://arxiv.org/abs/2002.09674
Temporal Sparse Adversarial Attack on Sequence-based Gait Recognition.
Ziwen He; Wei Wang; Jing Dong; Tieniu Tan
  Gait recognition is widely used in social security applications due to its
advantages in long-distance human identification. Recently, sequence-based
methods have achieved high accuracy by learning abundant temporal and spatial
information. However, their robustness under adversarial attacks has not been
clearly explored. In this paper, we demonstrate that the state-of-the-art gait
recognition model is vulnerable to such attacks. To this end, we propose a
novel temporal sparse adversarial attack method. Different from previous
additive noise models which add perturbations on original samples, we employ a
generative adversarial network based architecture to semantically generate
adversarial high-quality gait silhouettes or video frames. Moreover, by
sparsely substituting or inserting a few adversarial gait silhouettes, the
proposed method ensures its imperceptibility and achieves a high attack success
rate. The experimental results show that if only one-fortieth of the frames are
attacked, the accuracy of the target model drops dramatically.


http://arxiv.org/abs/2002.09792
Real-Time Detectors for Digital and Physical Adversarial Inputs to Perception Systems.
Yiannis Kantaros; Taylor Carpenter; Kaustubh Sridhar; Yahan Yang; Insup Lee; James Weimer
  Deep neural network (DNN) models have proven to be vulnerable to adversarial
digital and physical attacks. In this paper, we propose a novel attack- and
dataset-agnostic and real-time detector for both types of adversarial inputs to
DNN-based perception systems. In particular, the proposed detector relies on
the observation that adversarial images are sensitive to certain
label-invariant transformations. Specifically, to determine if an image has
been adversarially manipulated, the proposed detector checks if the output of
the target classifier on a given input image changes significantly after
feeding it a transformed version of the image under investigation. Moreover, we
show that the proposed detector is computationally-light both at runtime and
design-time which makes it suitable for real-time applications that may also
involve large-scale image domains. To highlight this, we demonstrate the
efficiency of the proposed detector on ImageNet, a task that is computationally
challenging for the majority of relevant defenses, and on physically attacked
traffic signs that may be encountered in real-time autonomy applications.
Finally, we propose the first adversarial dataset, called AdvNet that includes
both clean and physical traffic sign images. Our extensive comparative
experiments on the MNIST, CIFAR10, ImageNet, and AdvNet datasets show that
VisionGuard outperforms existing defenses in terms of scalability and detection
performance. We have also evaluated the proposed detector on field test data
obtained on a moving vehicle equipped with a perception-based DNN being under
attack.


http://arxiv.org/abs/2002.09632
Using Single-Step Adversarial Training to Defend Iterative Adversarial Examples.
Guanxiong Liu; Issa Khalil; Abdallah Khreishah
  Adversarial examples have become one of the largest challenges that machine
learning models, especially neural network classifiers, face. These adversarial
examples break the assumption of attack-free scenario and fool state-of-the-art
(SOTA) classifiers with insignificant perturbations to human. So far,
researchers achieved great progress in utilizing adversarial training as a
defense. However, the overwhelming computational cost degrades its
applicability and little has been done to overcome this issue. Single-Step
adversarial training methods have been proposed as computationally viable
solutions, however they still fail to defend against iterative adversarial
examples. In this work, we first experimentally analyze several different SOTA
defense methods against adversarial examples. Then, based on observations from
experiments, we propose a novel single-step adversarial training method which
can defend against both single-step and iterative adversarial examples. Lastly,
through extensive evaluations, we demonstrate that our proposed method
outperforms the SOTA single-step and iterative adversarial training defense.
Compared with ATDA (single-step method) on CIFAR10 dataset, our proposed method
achieves 35.67% enhancement in test accuracy and 19.14% reduction in training
time. When compared with methods that use BIM or Madry examples (iterative
methods) on CIFAR10 dataset, it saves up to 76.03% in training time with less
than 3.78% degeneration in test accuracy.


http://arxiv.org/abs/2002.09580
Polarizing Front Ends for Robust CNNs.
Can Bakiskan; Soorya Gopalakrishnan; Metehan Cekic; Upamanyu Madhow; Ramtin Pedarsani
  The vulnerability of deep neural networks to small, adversarially designed
perturbations can be attributed to their "excessive linearity." In this paper,
we propose a bottom-up strategy for attenuating adversarial perturbations using
a nonlinear front end which polarizes and quantizes the data. We observe that
ideal polarization can be utilized to completely eliminate perturbations,
develop algorithms to learn approximately polarizing bases for data, and
investigate the effectiveness of the proposed strategy on the MNIST and Fashion
MNIST datasets.


http://arxiv.org/abs/2002.09422
Robustness from Simple Classifiers.
Sharon Qian; Dimitris Kalimeris; Gal Kaplun; Yaron Singer
  Despite the vast success of Deep Neural Networks in numerous application
domains, it has been shown that such models are not robust i.e., they are
vulnerable to small adversarial perturbations of the input. While extensive
work has been done on why such perturbations occur or how to successfully
defend against them, we still do not have a complete understanding of
robustness. In this work, we investigate the connection between robustness and
simplicity. We find that simpler classifiers, formed by reducing the number of
output classes, are less susceptible to adversarial perturbations.
Consequently, we demonstrate that decomposing a complex multiclass model into
an aggregation of binary models enhances robustness. This behavior is
consistent across different datasets and model architectures and can be
combined with known defense techniques such as adversarial training. Moreover,
we provide further evidence of a disconnect between standard and robust
learning regimes. In particular, we show that elaborate label information can
help standard accuracy but harm robustness.


http://arxiv.org/abs/2002.09364
Adversarial Detection and Correction by Matching Prediction Distributions.
Giovanni Vacanti; Looveren Arnaud Van
  We present a novel adversarial detection and correction method for machine
learning classifiers.The detector consists of an autoencoder trained with a
custom loss function based on the Kullback-Leibler divergence between the
classifier predictions on the original and reconstructed instances.The method
is unsupervised, easy to train and does not require any knowledge about the
underlying attack. The detector almost completely neutralises powerful attacks
like Carlini-Wagner or SLIDE on MNIST and Fashion-MNIST, and remains very
effective on CIFAR-10 when the attack is granted full access to the
classification model but not the defence. We show that our method is still able
to detect the adversarial examples in the case of a white-box attack where the
attacker has full knowledge of both the model and the defence and investigate
the robustness of the attack. The method is very flexible and can also be used
to detect common data corruptions and perturbations which negatively impact the
model performance. We illustrate this capability on the CIFAR-10-C dataset.


http://arxiv.org/abs/2002.09576
UnMask: Adversarial Detection and Defense Through Robust Feature Alignment.
Scott Freitas; Shang-Tse Chen; Zijie J. Wang; Duen Horng Chau
  Deep learning models are being integrated into a wide range of high-impact,
security-critical systems, from self-driving cars to medical diagnosis.
However, recent research has demonstrated that many of these deep learning
architectures are vulnerable to adversarial attacks--highlighting the vital
need for defensive techniques to detect and mitigate these attacks before they
occur. To combat these adversarial attacks, we developed UnMask, an adversarial
detection and defense framework based on robust feature alignment. The core
idea behind UnMask is to protect these models by verifying that an image's
predicted class ("bird") contains the expected robust features (e.g., beak,
wings, eyes). For example, if an image is classified as "bird", but the
extracted features are wheel, saddle and frame, the model may be under attack.
UnMask detects such attacks and defends the model by rectifying the
misclassification, re-classifying the image based on its robust features. Our
extensive evaluation shows that UnMask (1) detects up to 96.75% of attacks, and
(2) defends the model by correctly classifying up to 93% of adversarial images
produced by the current strongest attack, Projected Gradient Descent, in the
gray-box setting. UnMask provides significantly better protection than
adversarial training across 8 attack vectors, averaging 31.18% higher accuracy.
We open source the code repository and data with this paper:
https://github.com/safreita1/unmask.


http://arxiv.org/abs/2002.09579
Robustness to Programmable String Transformations via Augmented Abstract Training.
Yuhao Zhang; Aws Albarghouthi; Loris D'Antoni
  Deep neural networks for natural language processing tasks are vulnerable to
adversarial input perturbations. In this paper, we present a versatile language
for programmatically specifying string transformations -- e.g., insertions,
deletions, substitutions, swaps, etc. -- that are relevant to the task at hand.
We then present an approach to adversarially training models that are robust to
such user-defined string transformations. Our approach combines the advantages
of search-based techniques for adversarial training with abstraction-based
techniques. Specifically, we show how to decompose a set of user-defined string
transformations into two component specifications, one that benefits from
search and another from abstraction. We use our technique to train models on
the AG and SST2 datasets and show that the resulting models are robust to
combinations of user-defined transformations mimicking spelling mistakes and
other meaning-preserving transformations.


http://arxiv.org/abs/2002.09169
Black-Box Certification with Randomized Smoothing: A Functional Optimization Based Framework.
Dinghuai Zhang; Mao Ye; Chengyue Gong; Zhanxing Zhu; Qiang Liu
  Randomized classifiers have been shown to provide a promising approach for
achieving certified robustness against adversarial attacks in deep learning.
However, most existing methods only leverage Gaussian smoothing noise and only
work for $\ell_2$ perturbation. We propose a general framework of adversarial
certification with non-Gaussian noise and for more general types of attacks,
from a unified functional optimization perspective. Our new framework allows us
to identify a key trade-off between accuracy and robustness via designing
smoothing distributions, helping to design new families of non-Gaussian
smoothing distributions that work more efficiently for different $\ell_p$
settings, including $\ell_1$, $\ell_2$ and $\ell_\infty$ attacks. Our proposed
methods achieve better certification results than previous works and provide a
new perspective on randomized smoothing certification.


http://arxiv.org/abs/2002.09565
Adversarial Attacks on Machine Learning Systems for High-Frequency Trading.
Micah Goldblum; Avi Schwarzschild; Ankit B. Patel; Tom Goldstein
  Algorithmic trading systems are often completely automated, and deep learning
is increasingly receiving attention in this domain. Nonetheless, little is
known about the robustness properties of these models. We study valuation
models for algorithmic trading from the perspective of adversarial machine
learning. We introduce new attacks specific to this domain with size
constraints that minimize attack costs. We further discuss how these attacks
can be used as an analysis tool to study and evaluate the robustness properties
of financial models. Finally, we investigate the feasibility of realistic
adversarial attacks in which an adversarial trader fools automated trading
systems into making inaccurate predictions.


http://arxiv.org/abs/2002.09027
Enhanced Adversarial Strategically-Timed Attacks against Deep Reinforcement Learning.
Chao-Han Huck Yang; Jun Qi; Pin-Yu Chen; Yi Ouyang; I-Te Danny Hung; Chin-Hui Lee; Xiaoli Ma
  Recent deep neural networks based techniques, especially those equipped with
the ability of self-adaptation in the system level such as deep reinforcement
learning (DRL), are shown to possess many advantages of optimizing robot
learning systems (e.g., autonomous navigation and continuous robot arm
control.) However, the learning-based systems and the associated models may be
threatened by the risks of intentionally adaptive (e.g., noisy sensor
confusion) and adversarial perturbations from real-world scenarios. In this
paper, we introduce timing-based adversarial strategies against a DRL-based
navigation system by jamming in physical noise patterns on the selected time
frames. To study the vulnerability of learning-based navigation systems, we
propose two adversarial agent models: one refers to online learning; another
one is based on evolutionary learning. Besides, three open-source robot
learning and navigation control environments are employed to study the
vulnerability under adversarial timing attacks. Our experimental results show
that the adversarial timing attacks can lead to a significant performance drop,
and also suggest the necessity of enhancing the robustness of robot learning
systems.


http://arxiv.org/abs/2002.08838
On the Decision Boundaries of Deep Neural Networks: A Tropical Geometry Perspective.
Motasem Alfarra; Adel Bibi; Hasan Hammoud; Mohamed Gaafar; Bernard Ghanem
  This work tackles the problem of characterizing and understanding the
decision boundaries of neural networks with piecewise linear non-linearity
activations. We use tropical geometry, a new development in the area of
algebraic geometry, to characterize the decision boundaries of a simple neural
network of the form (Affine, ReLU, Affine). Our main finding is that the
decision boundaries are a subset of a tropical hypersurface, which is
intimately related to a polytope formed by the convex hull of two zonotopes.
The generators of these zonotopes are functions of the neural network
parameters. This geometric characterization provides new perspective to three
tasks. Specifically, we propose a new tropical perspective to the lottery
ticket hypothesis, where we see the effect of different initializations on the
tropical geometric representation of a network's decision boundaries. Moreover,
we use this characterization to propose a new set of tropical regularizers,
which directly deal with the decision boundaries of a network. We investigate
the use of these regularizers in neural network pruning (by removing network
parameters that do not contribute to the tropical geometric representation of
the decision boundaries) and in generating adversarial input attacks (by
producing input perturbations that explicitly perturb the decision boundaries'
geometry and ultimately change the network's prediction).


http://arxiv.org/abs/2002.08859
A Bayes-Optimal View on Adversarial Examples.
Eitan Richardson; Yair Weiss
  The ability to fool modern CNN classifiers with tiny perturbations of the
input has lead to the development of a large number of candidate defenses and
often conflicting explanations. In this paper, we argue for examining
adversarial examples from the perspective of Bayes-Optimal classification. We
construct realistic image datasets for which the Bayes-Optimal classifier can
be efficiently computed and derive analytic conditions on the distributions so
that the optimal classifier is either robust or vulnerable. By training
different classifiers on these datasets (for which the "gold standard" optimal
classifiers are known), we can disentangle the possible sources of
vulnerability and avoid the accuracy-robustness tradeoff that may occur in
commonly used datasets. Our results show that even when the optimal classifier
is robust, standard CNN training consistently learns a vulnerable classifier.
At the same time, for exactly the same training data, RBF SVMs consistently
learn a robust classifier. The same trend is observed in experiments with real
images.


http://arxiv.org/abs/2002.08740
Towards Certifiable Adversarial Sample Detection.
Ilia Shumailov; Yiren Zhao; Robert Mullins; Ross Anderson
  Convolutional Neural Networks (CNNs) are deployed in more and more
classification systems, but adversarial samples can be maliciously crafted to
trick them, and are becoming a real threat. There have been various proposals
to improve CNNs' adversarial robustness but these all suffer performance
penalties or other limitations. In this paper, we provide a new approach in the
form of a certifiable adversarial detection scheme, the Certifiable Taboo Trap
(CTT). The system can provide certifiable guarantees of detection of
adversarial inputs for certain $l_{\infty}$ sizes on a reasonable assumption,
namely that the training data have the same distribution as the test data. We
develop and evaluate several versions of CTT with a range of defense
capabilities, training overheads and certifiability on adversarial samples.
Against adversaries with various $l_p$ norms, CTT outperforms existing defense
methods that focus purely on improving network robustness. We show that CTT has
small false positive rates on clean test data, minimal compute overheads when
deployed, and can support complex security policies.


http://arxiv.org/abs/2002.08619
Boosting Adversarial Training with Hypersphere Embedding.
Tianyu Pang; Xiao Yang; Yinpeng Dong; Kun Xu; Hang Su; Jun Zhu
  Adversarial training (AT) is one of the most effective defenses against
adversarial attacks for deep learning models. In this work, we advocate
incorporating the hypersphere embedding (HE) mechanism into the AT procedure by
regularizing the features onto compact manifolds, which constitutes a
lightweight yet effective module to blend in the strength of representation
learning. Our extensive analyses reveal that AT and HE are well coupled to
benefit the robustness of the adversarially trained models from several
aspects. We validate the effectiveness and adaptability of HE by embedding it
into the popular AT frameworks including PGD-AT, ALP, and TRADES, as well as
the FreeAT and FastAT strategies. In the experiments, we evaluate our methods
under a wide range of adversarial attacks on the CIFAR-10 and ImageNet
datasets, which verifies that integrating HE can consistently enhance the model
robustness for each AT framework with little extra computation.


http://arxiv.org/abs/2002.08569
Byzantine-resilient Decentralized Stochastic Gradient Descent. (5%)
Shangwei Guo; Tianwei Zhang; Han Yu; Xiaofei Xie; Lei Ma; Tao Xiang; Yang Liu
  Decentralized learning has gained great popularity to improve learning
efficiency and preserve data privacy. Each computing node makes equal
contribution to collaboratively learn a Deep Learning model. The elimination of
centralized Parameter Servers (PS) can effectively address many issues such as
privacy, performance bottleneck and single-point-failure. However, how to
achieve Byzantine Fault Tolerance in decentralized learning systems is rarely
explored, although this problem has been extensively studied in centralized
systems.
  In this paper, we present an in-depth study towards the Byzantine resilience
of decentralized learning systems with two contributions. First, from the
adversarial perspective, we theoretically illustrate that Byzantine attacks are
more dangerous and feasible in decentralized learning systems: even one
malicious participant can arbitrarily alter the models of other participants by
sending carefully crafted updates to its neighbors. Second, from the defense
perspective, we propose UBAR, a novel algorithm to enhance decentralized
learning with Byzantine Fault Tolerance. Specifically, UBAR provides a Uniform
Byzantine-resilient Aggregation Rule for benign nodes to select the useful
parameter updates and filter out the malicious ones in each training iteration.
It guarantees that each benign node in a decentralized system can train a
correct model under very strong Byzantine attacks with an arbitrary number of
faulty nodes. We conduct extensive experiments on standard image classification
tasks and the results indicate that UBAR can effectively defeat both simple and
sophisticated Byzantine attacks with higher performance efficiency than
existing solutions.


http://arxiv.org/abs/2002.10248
Bayes-TrEx: Model Transparency by Example.
Serena Booth; Yilun Zhou; Ankit Shah; Julie Shah
  Post-hoc explanation methods are gaining popularity as tools for
interpreting, understanding, and debugging neural networks. Most post-hoc
methods explain decisions in response to individual inputs. These individual
inputs are typically drawn from the test set; however, the test set may be
biased or may only sparsely invoke some model behaviours. To address these
challenges, we introduce Bayes-TrEx, a model-agnostic method for generating
distribution-conforming examples of known prediction confidence. Using a
classifier prediction and a data generator, Bayes-TrEx can be used to visualize
class boundaries; to find in-distribution adversarial examples; to understand
novel-class extrapolation; and to expose neural network overconfidence. We
demonstrate Bayes-TrEx with rendered data (CLEVR) and organic data (MNIST,
Fashion-MNIST). Code: github.com/serenabooth/Bayes-TrEx.


http://arxiv.org/abs/2002.08439
AdvMS: A Multi-source Multi-cost Defense Against Adversarial Attacks.
Xiao Wang; Siyue Wang; Pin-Yu Chen; Xue Lin; Peter Chin
  Designing effective defense against adversarial attacks is a crucial topic as
deep neural networks have been proliferated rapidly in many security-critical
domains such as malware detection and self-driving cars. Conventional defense
methods, although shown to be promising, are largely limited by their
single-source single-cost nature: The robustness promotion tends to plateau
when the defenses are made increasingly stronger while the cost tends to
amplify. In this paper, we study principles of designing multi-source and
multi-cost schemes where defense performance is boosted from multiple defending
components. Based on this motivation, we propose a multi-source and multi-cost
defense scheme, Adversarially Trained Model Switching (AdvMS), that inherits
advantages from two leading schemes: adversarial training and random model
switching. We show that the multi-source nature of AdvMS mitigates the
performance plateauing issue and the multi-cost nature enables improving
robustness at a flexible and adjustable combination of costs over different
factors which can better suit specific restrictions and needs in practice.


http://arxiv.org/abs/2002.08527
NAttack! Adversarial Attacks to bypass a GAN based classifier trained to detect Network intrusion.
Aritran Piplai; Sai Sree Laya Chukkapalli; Anupam Joshi
  With the recent developments in artificial intelligence and machine learning,
anomalies in network traffic can be detected using machine learning approaches.
Before the rise of machine learning, network anomalies which could imply an
attack, were detected using well-crafted rules. An attacker who has knowledge
in the field of cyber-defence could make educated guesses to sometimes
accurately predict which particular features of network traffic data the
cyber-defence mechanism is looking at. With this information, the attacker can
circumvent a rule-based cyber-defense system. However, after the advancements
of machine learning for network anomaly, it is not easy for a human to
understand how to bypass a cyber-defence system. Recently, adversarial attacks
have become increasingly common to defeat machine learning algorithms. In this
paper, we show that even if we build a classifier and train it with adversarial
examples for network data, we can use adversarial attacks and successfully
break the system. We propose a Generative Adversarial Network(GAN)based
algorithm to generate data to train an efficient neural network based
classifier, and we subsequently break the system using adversarial attacks.


http://arxiv.org/abs/2002.08347
On Adaptive Attacks to Adversarial Example Defenses.
Florian Tramer; Nicholas Carlini; Wieland Brendel; Aleksander Madry
  Adaptive attacks have (rightfully) become the de facto standard for
evaluating defenses to adversarial examples. We find, however, that typical
adaptive evaluations are incomplete. We demonstrate that thirteen defenses
recently published at ICLR, ICML and NeurIPS---and chosen for illustrative and
pedagogical purposes---can be circumvented despite attempting to perform
evaluations using adaptive attacks. While prior evaluation papers focused
mainly on the end result---showing that a defense was ineffective---this paper
focuses on laying out the methodology and the approach necessary to perform an
adaptive attack. We hope that these analyses will serve as guidance on how to
properly perform adaptive attacks against defenses to adversarial examples, and
thus will allow the community to make further progress in building more robust
models.


http://arxiv.org/abs/2002.08012
Indirect Adversarial Attacks via Poisoning Neighbors for Graph Convolutional Networks.
Tsubasa Takahashi
  Graph convolutional neural networks, which learn aggregations over neighbor
nodes, have achieved great performance in node classification tasks. However,
recent studies reported that such graph convolutional node classifier can be
deceived by adversarial perturbations on graphs. Abusing graph convolutions, a
node's classification result can be influenced by poisoning its neighbors.
Given an attributed graph and a node classifier, how can we evaluate robustness
against such indirect adversarial attacks? Can we generate strong adversarial
perturbations which are effective on not only one-hop neighbors, but more far
from the target? In this paper, we demonstrate that the node classifier can be
deceived with high-confidence by poisoning just a single node even two-hops or
more far from the target. Towards achieving the attack, we propose a new
approach which searches smaller perturbations on just a single node far from
the target. In our experiments, our proposed method shows 99% attack success
rate within two-hops from the target in two datasets. We also demonstrate that
m-layer graph convolutional neural networks have chance to be deceived by our
indirect attack within m-hop neighbors. The proposed attack can be used as a
benchmark in future defense attempts to develop graph convolutional neural
networks with having adversary robustness.


http://arxiv.org/abs/2002.08118
Randomized Smoothing of All Shapes and Sizes.
Greg Yang; Tony Duan; J. Edward Hu; Hadi Salman; Ilya Razenshteyn; Jerry Li
  Randomized smoothing is a recently proposed defense against adversarial
attacks that has achieved state-of-the-art provable robustness against $\ell_2$
perturbations. Soon after, a number of works devised new randomized smoothing
schemes for other metrics, such as $\ell_1$ or $\ell_\infty$; however, for each
geometry, substantial effort was needed to derive new robustness guarantees.
This begs the question: can we find a general theory for randomized smoothing?
  In this work we propose a novel framework for devising and analyzing
randomized smoothing schemes, and validate its effectiveness in practice. Our
theoretical contributions are as follows: (1) We show that for an appropriate
notion of "optimal", the optimal smoothing distributions for any "nice" norm
have level sets given by the *Wulff Crystal* of that norm. (2) We propose two
novel and complementary methods for deriving provably robust radii for any
smoothing distribution. Finally, (3) we show fundamental limits to current
randomized smoothing techniques via the theory of *Banach space cotypes*. By
combining (1) and (2), we significantly improve the state-of-the-art certified
accuracy in $\ell_1$ on standard datasets. On the other hand, using (3), we
show that, without more information than label statistics under random input
perturbations, randomized smoothing cannot achieve nontrivial certified
accuracy against perturbations of $\ell_p$-norm $\Omega(\min(1,
d^{\frac{1}{p}-\frac{1}{2}}))$, when the input dimension $d$ is large. We
provide code in github.com/tonyduan/rs4a.


http://arxiv.org/abs/2002.08000
Action-Manipulation Attacks Against Stochastic Bandits: Attacks and Defense.
Guanlin Liu; Lifeng lai
  Due to the broad range of applications of stochastic multi-armed bandit
model, understanding the effects of adversarial attacks and designing bandit
algorithms robust to attacks are essential for the safe applications of this
model. In this paper, we introduce a new class of attack named
action-manipulation attack. In this attack, an adversary can change the action
signal selected by the user. We show that without knowledge of mean rewards of
arms, our proposed attack can manipulate Upper Confidence Bound (UCB)
algorithm, a widely used bandit algorithm, into pulling a target arm very
frequently by spending only logarithmic cost. To defend against this class of
attacks, we introduce a novel algorithm that is robust to action-manipulation
attacks when an upper bound for the total attack cost is given. We prove that
our algorithm has a pseudo-regret upper bounded by $\mathcal{O}(\max\{\log
T,A\})$, where $T$ is the total number of rounds and $A$ is the upper bound of
the total attack cost.


http://arxiv.org/abs/2002.07405
Deflecting Adversarial Attacks.
Yao Qin; Nicholas Frosst; Colin Raffel; Garrison Cottrell; Geoffrey Hinton
  There has been an ongoing cycle where stronger defenses against adversarial
attacks are subsequently broken by a more advanced defense-aware attack. We
present a new approach towards ending this cycle where we "deflect''
adversarial attacks by causing the attacker to produce an input that
semantically resembles the attack's target class. To this end, we first propose
a stronger defense based on Capsule Networks that combines three detection
mechanisms to achieve state-of-the-art detection performance on both standard
and defense-aware attacks. We then show that undetected attacks against our
defense often perceptually resemble the adversarial target class by performing
a human study where participants are asked to label images produced by the
attack. These attack images can no longer be called "adversarial'' because our
network classifies them the same way as humans do.


http://arxiv.org/abs/2002.07891
Towards Query-Efficient Black-Box Adversary with Zeroth-Order Natural Gradient Descent.
Pu Zhao; Pin-Yu Chen; Siyue Wang; Xue Lin
  Despite the great achievements of the modern deep neural networks (DNNs), the
vulnerability/robustness of state-of-the-art DNNs raises security concerns in
many application domains requiring high reliability. Various adversarial
attacks are proposed to sabotage the learning performance of DNN models. Among
those, the black-box adversarial attack methods have received special
attentions owing to their practicality and simplicity. Black-box attacks
usually prefer less queries in order to maintain stealthy and low costs.
However, most of the current black-box attack methods adopt the first-order
gradient descent method, which may come with certain deficiencies such as
relatively slow convergence and high sensitivity to hyper-parameter settings.
In this paper, we propose a zeroth-order natural gradient descent (ZO-NGD)
method to design the adversarial attacks, which incorporates the zeroth-order
gradient estimation technique catering to the black-box attack scenario and the
second-order natural gradient descent to achieve higher query efficiency. The
empirical evaluations on image classification datasets demonstrate that ZO-NGD
can obtain significantly lower model query complexities compared with
state-of-the-art attack methods.


http://arxiv.org/abs/2002.07920
Block Switching: A Stochastic Approach for Deep Learning Security.
Xiao Wang; Siyue Wang; Pin-Yu Chen; Xue Lin; Peter Chin
  Recent study of adversarial attacks has revealed the vulnerability of modern
deep learning models. That is, subtly crafted perturbations of the input can
make a trained network with high accuracy produce arbitrary incorrect
predictions, while maintain imperceptible to human vision system. In this
paper, we introduce Block Switching (BS), a defense strategy against
adversarial attacks based on stochasticity. BS replaces a block of model layers
with multiple parallel channels, and the active channel is randomly assigned in
the run time hence unpredictable to the adversary. We show empirically that BS
leads to a more dispersed input gradient distribution and superior defense
effectiveness compared with other stochastic defenses such as stochastic
activation pruning (SAP). Compared to other defenses, BS is also characterized
by the following features: (i) BS causes less test accuracy drop; (ii) BS is
attack-independent and (iii) BS is compatible with other defenses and can be
used jointly with others.


http://arxiv.org/abs/2002.10252
TensorShield: Tensor-based Defense Against Adversarial Attacks on Images.
Negin Entezari; Evangelos E. Papalexakis
  Recent studies have demonstrated that machine learning approaches like deep
neural networks (DNNs) are easily fooled by adversarial attacks. Subtle and
imperceptible perturbations of the data are able to change the result of deep
neural networks. Leveraging vulnerable machine learning methods raises many
concerns especially in domains where security is an important factor.
Therefore, it is crucial to design defense mechanisms against adversarial
attacks. For the task of image classification, unnoticeable perturbations
mostly occur in the high-frequency spectrum of the image. In this paper, we
utilize tensor decomposition techniques as a preprocessing step to find a
low-rank approximation of images which can significantly discard high-frequency
perturbations. Recently a defense framework called Shield could "vaccinate"
Convolutional Neural Networks (CNN) against adversarial examples by performing
random-quality JPEG compressions on local patches of images on the ImageNet
dataset. Our tensor-based defense mechanism outperforms the SLQ method from
Shield by 14% against FastGradient Descent (FGSM) adversarial attacks, while
maintaining comparable speed.


http://arxiv.org/abs/2002.06816
On the Similarity of Deep Learning Representations Across Didactic and Adversarial Examples.
Pamela K. Douglas; Farzad Vasheghani Farahani
  The increasing use of deep neural networks (DNNs) has motivated a parallel
endeavor: the design of adversaries that profit from successful
misclassifications. However, not all adversarial examples are crafted for
malicious purposes. For example, real world systems often contain physical,
temporal, and sampling variability across instrumentation. Adversarial examples
in the wild may inadvertently prove deleterious for accurate predictive
modeling. Conversely, naturally occurring covariance of image features may
serve didactic purposes. Here, we studied the stability of deep learning
representations for neuroimaging classification across didactic and adversarial
conditions characteristic of MRI acquisition variability. We show that
representational similarity and performance vary according to the frequency of
adversarial examples in the input space.


http://arxiv.org/abs/2002.06864
Scalable Quantitative Verification For Deep Neural Networks.
Teodora Baluta; Zheng Leong Chua; Kuldeep S. Meel; Prateek Saxena
  Despite the functional success of deep neural networks (DNNs), their
trustworthiness remains a crucial open challenge. To address this challenge,
both testing and verification techniques have been proposed. But these existing
techniques provide either scalability to large networks or formal guarantees,
not both. In this paper, we propose a scalable quantitative verification
framework for deep neural networks, i.e., a test-driven approach that comes
with formal guarantees that a desired probabilistic property is satisfied. Our
technique performs enough tests until soundness of a formal probabilistic
property can be proven. It can be used to certify properties of both
deterministic and randomized DNNs. We implement our approach in a tool called
PROVERO and apply it in the context of certifying adversarial robustness of
DNNs. In this context, we first show a new attack-agnostic measure of
robustness which offers an alternative to purely attack-based methodology of
evaluating robustness being reported today. Second, PROVERO provides
certificates of robustness for large DNNs, where existing state-of-the-art
verification tools fail to produce conclusive results. Our work paves the way
forward for verifying properties of distributions captured by real-world deep
neural networks, with provable guarantees, even where testers only have
black-box access to the neural network.


http://arxiv.org/abs/2002.06789
CAT: Customized Adversarial Training for Improved Robustness.
Minhao Cheng; Qi Lei; Pin-Yu Chen; Inderjit Dhillon; Cho-Jui Hsieh
  Adversarial training has become one of the most effective methods for
improving robustness of neural networks. However, it often suffers from poor
generalization on both clean and perturbed data. In this paper, we propose a
new algorithm, named Customized Adversarial Training (CAT), which adaptively
customizes the perturbation level and the corresponding label for each training
sample in adversarial training. We show that the proposed algorithm achieves
better clean and robust accuracy than previous adversarial training methods
through extensive experiments.


http://arxiv.org/abs/2002.07317
On the Matrix-Free Generation of Adversarial Perturbations for Black-Box Attacks.
Hisaichi Shibata; Shouhei Hanaoka; Yukihiro Nomura; Naoto Hayashi; Osamu Abe
  In general, adversarial perturbations superimposed on inputs are realistic
threats for a deep neural network (DNN). In this paper, we propose a practical
generation method of such adversarial perturbation to be applied to black-box
attacks that demand access to an input-output relationship only. Thus, the
attackers generate such perturbation without invoking inner functions and/or
accessing the inner states of a DNN. Unlike the earlier studies, the algorithm
to generate the perturbation presented in this study requires much fewer query
trials. Moreover, to show the effectiveness of the adversarial perturbation
extracted, we experiment with a DNN for semantic segmentation. The result shows
that the network is easily deceived with the perturbation generated than using
uniformly distributed random noise with the same magnitude.


http://arxiv.org/abs/2002.07214
Robust Stochastic Bandit Algorithms under Probabilistic Unbounded Adversarial Attack.
Ziwei Guan; Kaiyi Ji; Donald J Jr Bucci; Timothy Y Hu; Joseph Palombo; Michael Liston; Yingbin Liang
  The multi-armed bandit formalism has been extensively studied under various
attack models, in which an adversary can modify the reward revealed to the
player. Previous studies focused on scenarios where the attack value either is
bounded at each round or has a vanishing probability of occurrence. These
models do not capture powerful adversaries that can catastrophically perturb
the revealed reward. This paper investigates the attack model where an
adversary attacks with a certain probability at each round, and its attack
value can be arbitrary and unbounded if it attacks. Furthermore, the attack
value does not necessarily follow a statistical distribution. We propose a
novel sample median-based and exploration-aided UCB algorithm (called
med-E-UCB) and a median-based $\epsilon$-greedy algorithm (called
med-$\epsilon$-greedy). Both of these algorithms are provably robust to the
aforementioned attack model. More specifically we show that both algorithms
achieve $\mathcal{O}(\log T)$ pseudo-regret (i.e., the optimal regret without
attacks). We also provide a high probability guarantee of $\mathcal{O}(\log T)$
regret with respect to random rewards and random occurrence of attacks. These
bounds are achieved under arbitrary and unbounded reward perturbation as long
as the attack probability does not exceed a certain constant threshold. We
provide multiple synthetic simulations of the proposed algorithms to verify
these claims and showcase the inability of existing techniques to achieve
sublinear regret. We also provide experimental results of the algorithm
operating in a cognitive radio setting using multiple software-defined radios.


http://arxiv.org/abs/2002.07246
Regularized Training and Tight Certification for Randomized Smoothed Classifier with Provable Robustness.
Huijie Feng; Chunpeng Wu; Guoyang Chen; Weifeng Zhang; Yang Ning
  Recently smoothing deep neural network based classifiers via isotropic
Gaussian perturbation is shown to be an effective and scalable way to provide
state-of-the-art probabilistic robustness guarantee against $\ell_2$ norm
bounded adversarial perturbations. However, how to train a good base classifier
that is accurate and robust when smoothed has not been fully investigated. In
this work, we derive a new regularized risk, in which the regularizer can
adaptively encourage the accuracy and robustness of the smoothed counterpart
when training the base classifier. It is computationally efficient and can be
implemented in parallel with other empirical defense methods. We discuss how to
implement it under both standard (non-adversarial) and adversarial training
scheme. At the same time, we also design a new certification algorithm, which
can leverage the regularization effect to provide tighter robustness lower
bound that holds with high probability. Our extensive experimentation
demonstrates the effectiveness of the proposed training and certification
approaches on CIFAR-10 and ImageNet datasets.


http://arxiv.org/abs/2002.07088
GRAPHITE: A Practical Framework for Generating Automatic Physical Adversarial Machine Learning Attacks.
Ryan Feng; Neal Mangaokar; Jiefeng Chen; Earlence Fernandes; Somesh Jha; Atul Prakash
  This paper investigates an adversary's ease of attack in generating
adversarial examples for real-world scenarios. We address three key
requirements for practical attacks for the real-world: 1) automatically
constraining the size and shape of the attack so it can be applied with
stickers, 2) transform-robustness, i.e., robustness of a attack to
environmental physical variations such as viewpoint and lighting changes, and
3) supporting attacks in both white-box and black-box hard-label scenarios, so
that the adversary can attack proprietary models. In particular, the art of
automatically picking which areas to perturb remains largely unexplored -- an
efficient solution would remove the need to search over possible locations,
shapes, and sizes as in current patch attacks. In this work, we propose
GRAPHITE, an efficient and general framework for generating attacks that
satisfy the above three key requirements. GRAPHITE takes advantage of
transform-robustness, a metric based on expectation over transforms (EoT), to
automatically generate small masks and optimize with gradient-free
optimization. GRAPHITE is also flexible as it can easily trade-off
transform-robustness, perturbation size, and query count in black-box settings.
On a GTSRB model in a hard-label black-box setting, we are able to find attacks
on all possible 1,806 victim-target class pairs with averages of 77.8%
transform-robustness, perturbation size of 16.63% of the victim images, and
126K queries per pair. For digital-only attacks where achieving
transform-robustness is not a requirement, GRAPHITE is able to find successful
small-patch attacks with an average of only 566 queries for 92.2% of
victim-target pairs. GRAPHITE is also able to find successful attacks using
perturbations that modify small areas of the input image against PatchGuard, a
recently proposed defense against patch-based attacks.


http://arxiv.org/abs/2002.06668
Over-parameterized Adversarial Training: An Analysis Overcoming the Curse of Dimensionality.
Yi Zhang; Orestis Plevrakis; Simon S. Du; Xingguo Li; Zhao Song; Sanjeev Arora
  Adversarial training is a popular method to give neural nets robustness
against adversarial perturbations. In practice adversarial training leads to
low robust training loss. However, a rigorous explanation for why this happens
under natural conditions is still missing. Recently a convergence theory for
standard (non-adversarial) supervised training was developed by various groups
for {\em very overparametrized} nets. It is unclear how to extend these results
to adversarial training because of the min-max objective. Recently, a first
step towards this direction was made by Gao et al. using tools from online
learning, but they require the width of the net to be \emph{exponential} in
input dimension $d$, and with an unnatural activation function. Our work proves
convergence to low robust training loss for \emph{polynomial} width instead of
exponential, under natural assumptions and with the ReLU activation. Key
element of our proof is showing that ReLU networks near initialization can
approximate the step function, which may be of independent interest.


http://arxiv.org/abs/2003.04808
Undersensitivity in Neural Reading Comprehension.
Johannes Welbl; Pasquale Minervini; Max Bartolo; Pontus Stenetorp; Sebastian Riedel
  Current reading comprehension models generalise well to in-distribution test
sets, yet perform poorly on adversarially selected inputs. Most prior work on
adversarial inputs studies oversensitivity: semantically invariant text
perturbations that cause a model's prediction to change when it should not. In
this work we focus on the complementary problem: excessive prediction
undersensitivity, where input text is meaningfully changed but the model's
prediction does not, even though it should. We formulate a noisy adversarial
attack which searches among semantic variations of the question for which a
model erroneously predicts the same answer, and with even higher probability.
Despite comprising unanswerable questions, both SQuAD2.0 and NewsQA models are
vulnerable to this attack. This indicates that although accurate, models tend
to rely on spurious patterns and do not fully consider the information
specified in a question. We experiment with data augmentation and adversarial
training as defences, and find that both substantially decrease vulnerability
to attacks on held out data, as well as held out attack spaces. Addressing
undersensitivity also improves results on AddSent and AddOneSent, and models
furthermore generalise better when facing train/evaluation distribution
mismatch: they are less prone to overly rely on predictive cues present only in
the training set, and outperform a conventional model by as much as 10.9% F1.


http://arxiv.org/abs/2002.06349
Hold me tight! Influence of discriminative features on deep network boundaries.
Guillermo Ortiz-Jimenez; Apostolos Modas; Seyed-Mohsen Moosavi-Dezfooli; Pascal Frossard
  Important insights towards the explainability of neural networks and their
properties reside in the formation of their decision boundaries. In this work,
we borrow tools from the field of adversarial robustness and propose a new
framework that permits to relate the features of the dataset with the distance
of data samples to the decision boundary along specific directions. We
demonstrate that the inductive bias of deep learning has the tendency to
generate classification functions that are invariant along non-discriminative
directions of the dataset. More surprisingly, we further show that training on
small perturbations of the data samples are sufficient to completely change the
decision boundary. This is actually the characteristic exploited by the
so-called adversarial training to produce robust classifiers. Our general
framework can be used to reveal the effect of specific dataset features on the
macroscopic properties of deep models and to develop a better understanding of
the successes and limitations of deep learning.


http://arxiv.org/abs/2002.06495
Blind Adversarial Network Perturbations.
Milad Nasr; Alireza Bahramali; Amir Houmansadr
  Deep Neural Networks (DNNs) are commonly used for various traffic analysis
problems, such as website fingerprinting and flow correlation, as they
outperform traditional (e.g., statistical) techniques by large margins.
However, deep neural networks are known to be vulnerable to adversarial
examples: adversarial inputs to the model that get labeled incorrectly by the
model due to small adversarial perturbations. In this paper, for the first
time, we show that an adversary can defeat DNN-based traffic analysis
techniques by applying \emph{adversarial perturbations} on the patterns of
\emph{live} network traffic.


http://arxiv.org/abs/2002.05990
Skip Connections Matter: On the Transferability of Adversarial Examples Generated with ResNets.
Dongxian Wu; Yisen Wang; Shu-Tao Xia; James Bailey; Xingjun Ma
  Skip connections are an essential component of current state-of-the-art deep
neural networks (DNNs) such as ResNet, WideResNet, DenseNet, and ResNeXt.
Despite their huge success in building deeper and more powerful DNNs, we
identify a surprising security weakness of skip connections in this paper. Use
of skip connections allows easier generation of highly transferable adversarial
examples. Specifically, in ResNet-like (with skip connections) neural networks,
gradients can backpropagate through either skip connections or residual
modules. We find that using more gradients from the skip connections rather
than the residual modules according to a decay factor, allows one to craft
adversarial examples with high transferability. Our method is termed Skip
Gradient Method(SGM). We conduct comprehensive transfer attacks against
state-of-the-art DNNs including ResNets, DenseNets, Inceptions,
Inception-ResNet, Squeeze-and-Excitation Network (SENet) and robustly trained
DNNs. We show that employing SGM on the gradient flow can greatly improve the
transferability of crafted attacks in almost all cases. Furthermore, SGM can be
easily combined with existing black-box attack techniques, and obtain high
improvements over state-of-the-art transferability methods. Our findings not
only motivate new research into the architectural vulnerability of DNNs, but
also open up further challenges for the design of secure DNN architectures.


http://arxiv.org/abs/2002.05999
Adversarial Distributional Training for Robust Deep Learning.
Yinpeng Dong; Zhijie Deng; Tianyu Pang; Hang Su; Jun Zhu
  Adversarial training (AT) is among the most effective techniques to improve
model robustness by augmenting training data with adversarial examples.
However, most existing AT methods adopt a specific attack to craft adversarial
examples, leading to the unreliable robustness against other unseen attacks.
Besides, a single attack algorithm could be insufficient to explore the space
of perturbations. In this paper, we introduce adversarial distributional
training (ADT), a novel framework for learning robust models. ADT is formulated
as a minimax optimization problem, where the inner maximization aims to learn
an adversarial distribution to characterize the potential adversarial examples
around a natural one under an entropic regularizer, and the outer minimization
aims to train robust models by minimizing the expected loss over the worst-case
adversarial distributions. Through a theoretical analysis, we develop a general
algorithm for solving ADT, and present three approaches for parameterizing the
adversarial distributions, ranging from the typical Gaussian distributions to
the flexible implicit ones. Empirical results on several benchmarks validate
the effectiveness of ADT compared with the state-of-the-art AT methods.


http://arxiv.org/abs/2002.05388
Recurrent Attention Model with Log-Polar Mapping is Robust against Adversarial Attacks.
Taro Kiritani; Koji Ono
  Convolutional neural networks are vulnerable to small $\ell^p$ adversarial
attacks, while the human visual system is not. Inspired by neural networks in
the eye and the brain, we developed a novel artificial neural network model
that recurrently collects data with a log-polar field of view that is
controlled by attention. We demonstrate the effectiveness of this design as a
defense against SPSA and PGD adversarial attacks. It also has beneficial
properties observed in the animal visual system, such as reflex-like pathways
for low-latency inference, fixed amount of computation independent of image
size, and rotation and scale invariance. The code for experiments is available
at https://gitlab.com/exwzd-public/kiritani_ono_2020.


http://arxiv.org/abs/2002.05379
The Conditional Entropy Bottleneck.
Ian Fischer
  Much of the field of Machine Learning exhibits a prominent set of failure
modes, including vulnerability to adversarial examples, poor
out-of-distribution (OoD) detection, miscalibration, and willingness to
memorize random labelings of datasets. We characterize these as failures of
robust generalization, which extends the traditional measure of generalization
as accuracy or related metrics on a held-out set. We hypothesize that these
failures to robustly generalize are due to the learning systems retaining too
much information about the training data. To test this hypothesis, we propose
the Minimum Necessary Information (MNI) criterion for evaluating the quality of
a model. In order to train models that perform well with respect to the MNI
criterion, we present a new objective function, the Conditional Entropy
Bottleneck (CEB), which is closely related to the Information Bottleneck (IB).
We experimentally test our hypothesis by comparing the performance of CEB
models with deterministic models and Variational Information Bottleneck (VIB)
models on a variety of different datasets and robustness challenges. We find
strong empirical evidence supporting our hypothesis that MNI models improve on
these problems of robust generalization.


http://arxiv.org/abs/2002.05463
Identifying Audio Adversarial Examples via Anomalous Pattern Detection.
Victor Akinwande; Celia Cintas; Skyler Speakman; Srihari Sridharan
  Audio processing models based on deep neural networks are susceptible to
adversarial attacks even when the adversarial audio waveform is 99.9% similar
to a benign sample. Given the wide application of DNN-based audio recognition
systems, detecting the presence of adversarial examples is of high practical
relevance. By applying anomalous pattern detection techniques in the activation
space of these models, we show that 2 of the recent and current
state-of-the-art adversarial attacks on audio processing systems systematically
lead to higher-than-expected activation at some subset of nodes and we can
detect these with up to an AUC of 0.98 with no degradation in performance on
benign samples.


http://arxiv.org/abs/2002.05283
Stabilizing Differentiable Architecture Search via Perturbation-based Regularization.
Xiangning Chen; Cho-Jui Hsieh
  Differentiable architecture search (DARTS) is a prevailing NAS solution to
identify architectures. Based on the continuous relaxation of the architecture
space, DARTS learns a differentiable architecture weight and largely reduces
the search cost. However, its stability has been challenged for yielding
deteriorating architectures as the search proceeds. We find that the
precipitous validation loss landscape, which leads to a dramatic performance
drop when distilling the final architecture, is an essential factor that causes
instability. Based on this observation, we propose a perturbation-based
regularization - SmoothDARTS (SDARTS), to smooth the loss landscape and improve
the generalizability of DARTS-based methods. In particular, our new
formulations stabilize DARTS-based methods by either random smoothing or
adversarial attack. The search trajectory on NAS-Bench-1Shot1 demonstrates the
effectiveness of our approach and due to the improved stability, we achieve
performance gain across various search spaces on 4 datasets. Furthermore, we
mathematically show that SDARTS implicitly regularizes the Hessian norm of the
validation loss, which accounts for a smoother loss landscape and improved
performance.


http://arxiv.org/abs/2002.05123
Over-the-Air Adversarial Flickering Attacks against Video Recognition Networks.
Roi Pony; Itay Naeh; Shie Mannor
  Deep neural networks for video classification, just like image classification
networks, may be subjected to adversarial manipulation. The main difference
between image classifiers and video classifiers is that the latter usually use
temporal information contained within the video. In this work we present a
manipulation scheme for fooling video classifiers by introducing a flickering
temporal perturbation that in some cases may be unnoticeable by human observers
and is implementable in the real world. After demonstrating the manipulation of
action classification of single videos, we generalize the procedure to make
universal adversarial perturbation, achieving high fooling ratio. In addition,
we generalize the universal perturbation and produce a temporal-invariant
perturbation, which can be applied to the video without synchronizing the
perturbation to the input. The attack was implemented on several target models
and the transferability of the attack was demonstrated. These properties allow
us to bridge the gap between simulated environment and real-world application,
as will be demonstrated in this paper for the first time for an over-the-air
flickering attack.


http://arxiv.org/abs/2002.04694
Adversarial Robustness for Code.
Pavol Bielik; Martin Vechev
  Machine learning and deep learning in particular has been recently used to
successfully address many tasks in the domain of code such as finding and
fixing bugs, code completion, decompilation, type inference and many others.
However, the issue of adversarial robustness of models for code has gone
largely unnoticed. In this work, we explore this issue by: (i) instantiating
adversarial attacks for code (a domain with discrete and highly structured
inputs), (ii) showing that, similar to other domains, neural models for code
are vulnerable to adversarial attacks, and (iii) combining existing and novel
techniques to improve robustness while preserving high accuracy.


http://arxiv.org/abs/2002.04599
Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial Perturbations.
Florian Tramèr; Jens Behrmann; Nicholas Carlini; Nicolas Papernot; Jörn-Henrik Jacobsen
  Adversarial examples are malicious inputs crafted to induce
misclassification. Commonly studied sensitivity-based adversarial examples
introduce semantically-small changes to an input that result in a different
model prediction. This paper studies a complementary failure mode,
invariance-based adversarial examples, that introduce minimal semantic changes
that modify an input's true label yet preserve the model's prediction. We
demonstrate fundamental tradeoffs between these two types of adversarial
examples.
  We show that defenses against sensitivity-based attacks actively harm a
model's accuracy on invariance-based attacks, and that new approaches are
needed to resist both attack types. In particular, we break state-of-the-art
adversarially-trained and certifiably-robust models by generating small
perturbations that the models are (provably) robust to, yet that change an
input's class according to human labelers. Finally, we formally show that the
existence of excessively invariant classifiers arises from the presence of
overly-robust predictive features in standard datasets.


http://arxiv.org/abs/2002.04359
Robustness of Bayesian Neural Networks to Gradient-Based Attacks.
Ginevra Carbone; Matthew Wicker; Luca Laurenti; Andrea Patane; Luca Bortolussi; Guido Sanguinetti
  Vulnerability to adversarial attacks is one of the principal hurdles to the
adoption of deep learning in safety-critical applications. Despite significant
efforts, both practical and theoretical, the problem remains open. In this
paper, we analyse the geometry of adversarial attacks in the large-data,
overparametrized limit for Bayesian Neural Networks (BNNs). We show that, in
the limit, vulnerability to gradient-based attacks arises as a result of
degeneracy in the data distribution, i.e., when the data lies on a
lower-dimensional submanifold of the ambient space. As a direct consequence, we
demonstrate that in the limit BNN posteriors are robust to gradient-based
adversarial attacks. Experimental results on the MNIST and Fashion MNIST
datasets with BNNs trained with Hamiltonian Monte Carlo and Variational
Inference support this line of argument, showing that BNNs can display both
high accuracy and robustness to gradient based adversarial attacks.


http://arxiv.org/abs/2002.04237
Improving the affordability of robustness training for DNNs.
Sidharth Gupta; Parijat Dube; Ashish Verma
  Projected Gradient Descent (PGD) based adversarial training has become one of
the most prominent methods for building robust deep neural network models.
However, the computational complexity associated with this approach, due to the
maximization of the loss function when finding adversaries, is a longstanding
problem and may be prohibitive when using larger and more complex models. In
this paper we show that the initial phase of adversarial training is redundant
and can be replaced with natural training which significantly improves the
computational efficiency. We demonstrate that this efficiency gain can be
achieved without any loss in accuracy on natural and adversarial test samples.
We support our argument with insights on the nature of the adversaries and
their relative strength during the training process. We show that our proposed
method can reduce the training time by a factor of up to 2.5 with comparable or
better model test accuracy and generalization on various strengths of
adversarial attacks.


http://arxiv.org/abs/2002.04742
Fast Geometric Projections for Local Robustness Certification.
Aymeric Fromherz; Klas Leino; Matt Fredrikson; Bryan Parno; Corina Păsăreanu
  Local robustness ensures that a model classifies all inputs within an
$\ell_2$-ball consistently, which precludes various forms of adversarial
inputs. In this paper, we present a fast procedure for checking local
robustness in feed-forward neural networks with piecewise-linear activation
functions. Such networks partition the input space into a set of convex
polyhedral regions in which the network's behavior is linear; hence, a
systematic search for decision boundaries within the regions around a given
input is sufficient for assessing robustness. Crucially, we show how the
regions around a point can be analyzed using simple geometric projections, thus
admitting an efficient, highly-parallel GPU implementation that excels
particularly for the $\ell_2$ norm, where previous work has been less
effective. Empirically we find this approach to be far more precise than many
approximate verification approaches, while at the same time performing multiple
orders of magnitude faster than complete verifiers, and scaling to much deeper
networks.


http://arxiv.org/abs/2002.04784
Graph Universal Adversarial Attacks: A Few Bad Actors Ruin Graph Learning Models.
Xiao Zang; Yi Xie; Jie Chen; Bo Yuan
  Deep neural networks, while generalize well, are known to be sensitive to
small adversarial perturbations. This phenomenon poses severe security threat
and calls for in-depth investigation of the robustness of deep learning models.
With the emergence of neural networks for graph structured data, similar
investigations are urged to understand their robustness. It has been found that
adversarially perturbing the graph structure and/or node features may result in
a significant degradation of the model performance. In this work, we show from
a different angle that such fragility similarly occurs if the graph contains a
few bad-actor nodes, which compromise a trained graph neural network through
flipping the connections to any targeted victim. Worse, the bad actors found
for one graph model severely compromise other models as well. We call the bad
actors ``anchor nodes'' and propose an algorithm, named GUA, to identify them.
Thorough empirical investigations suggest an interesting finding that the
anchor nodes often belong to the same class; and they also corroborate the
intuitive trade-off between the number of anchor nodes and the attack success
rate. For the dataset Cora which contains 2708 nodes, as few as six anchor
nodes will result in an attack success rate higher than 80\% for GCN and other
three models.


http://arxiv.org/abs/2002.04725
More Data Can Expand the Generalization Gap Between Adversarially Robust and Standard Models.
Lin Chen; Yifei Min; Mingrui Zhang; Amin Karbasi
  Despite remarkable success in practice, modern machine learning models have
been found to be susceptible to adversarial attacks that make
human-imperceptible perturbations to the data, but result in serious and
potentially dangerous prediction errors. To address this issue, practitioners
often use adversarial training to learn models that are robust against such
attacks at the cost of higher generalization error on unperturbed test sets.
The conventional wisdom is that more training data should shrink the gap
between the generalization error of adversarially-trained models and standard
models. However, we study the training of robust classifiers for both Gaussian
and Bernoulli models under $\ell_\infty$ attacks, and we prove that more data
may actually increase this gap. Furthermore, our theoretical results identify
if and when additional data will finally begin to shrink the gap. Lastly, we
experimentally demonstrate that our results also hold for linear regression
models, which may indicate that this phenomenon occurs more broadly.


http://arxiv.org/abs/2002.03924
Playing to Learn Better: Repeated Games for Adversarial Learning with Multiple Classifiers.
Prithviraj Dasgupta; Joseph B. Collins; Michael McCarrick
  We consider the problem of prediction by a machine learning algorithm, called
learner, within an adversarial learning setting. The learner's task is to
correctly predict the class of data passed to it as a query. However, along
with queries containing clean data, the learner could also receive malicious or
adversarial queries from an adversary. The objective of the adversary is to
evade the learner's prediction mechanism by sending adversarial queries that
result in erroneous class prediction by the learner, while the learner's
objective is to reduce the incorrect prediction of these adversarial queries
without degrading the prediction quality of clean queries. We propose a game
theory-based technique called a Repeated Bayesian Sequential Game where the
learner interacts repeatedly with a model of the adversary using self play to
determine the distribution of adversarial versus clean queries. It then
strategically selects a classifier from a set of pre-trained classifiers that
balances the likelihood of correct prediction for the query along with reducing
the costs to use the classifier. We have evaluated our proposed technique using
clean and adversarial text data with deep neural network-based classifiers and
shown that the learner can select an appropriate classifier that is
commensurate with the query type (clean or adversarial) while remaining aware
of the cost to use the classifier.


http://arxiv.org/abs/2002.03793
Adversarial Data Encryption.
Yingdong Hu; Liang Zhang; Wei Shan; Xiaoxiao Qin; Jing Qi; Zhenzhou Wu; Yang Yuan
  In the big data era, many organizations face the dilemma of data sharing.
Regular data sharing is often necessary for human-centered discussion and
communication, especially in medical scenarios. However, unprotected data
sharing may also lead to data leakage. Inspired by adversarial attack, we
propose a method for data encryption, so that for human beings the encrypted
data look identical to the original version, but for machine learning methods
they are misleading. To show the effectiveness of our method, we collaborate
with the Beijing Tiantan Hospital, which has a world leading neurological
center. We invite $3$ doctors to manually inspect our encryption method based
on real world medical images. The results show that the encrypted images can be
used for diagnosis by the doctors, but not by machine learning methods.


http://arxiv.org/abs/2002.04197
Generalised Lipschitz Regularisation Equals Distributional Robustness.
Zac Cranko; Zhan Shi; Xinhua Zhang; Richard Nock; Simon Kornblith
  The problem of adversarial examples has highlighted the need for a theory of
regularisation that is general enough to apply to exotic function classes, such
as universal approximators. In response, we give a very general equality result
regarding the relationship between distributional robustness and
regularisation, as defined with a transportation cost uncertainty set. The
theory allows us to (tightly) certify the robustness properties of a
Lipschitz-regularised model with very mild assumptions. As a theoretical
application we show a new result explicating the connection between adversarial
learning and distributional robustness. We then give new results for how to
achieve Lipschitz regularisation of kernel classifiers, which are demonstrated
experimentally.


http://arxiv.org/abs/2002.03331
MDEA: Malware Detection with Evolutionary Adversarial Learning.
Xiruo Wang; Risto Miikkulainen
  Malware detection have used machine learning to detect malware in programs.
These applications take in raw or processed binary data to neural network
models to classify as benign or malicious files. Even though this approach has
proven effective against dynamic changes, such as encrypting, obfuscating and
packing techniques, it is vulnerable to specific evasion attacks where that
small changes in the input data cause misclassification at test time. This
paper proposes a new approach: MDEA, an Adversarial Malware Detection model
uses evolutionary optimization to create attack samples to make the network
robust against evasion attacks. By retraining the model with the evolved
malware samples, its performance improves a significant margin.


http://arxiv.org/abs/2002.03444
Robust binary classification with the 01 loss.
Yunzhe Xue; Meiyan Xie; Usman Roshan
  The 01 loss is robust to outliers and tolerant to noisy data compared to
convex loss functions. We conjecture that the 01 loss may also be more robust
to adversarial attacks. To study this empirically we have developed a
stochastic coordinate descent algorithm for a linear 01 loss classifier and a
single hidden layer 01 loss neural network. Due to the absence of the gradient
we iteratively update coordinates on random subsets of the data for fixed
epochs. We show our algorithms to be fast and comparable in accuracy to the
linear support vector machine and logistic loss single hidden layer network for
binary classification on several image benchmarks, thus establishing that our
method is on-par in test accuracy with convex losses. We then subject them to
accurately trained substitute model black box attacks on the same image
benchmarks and find them to be more robust than convex counterparts. On CIFAR10
binary classification task between classes 0 and 1 with adversarial
perturbation of 0.0625 we see that the MLP01 network loses 27\% in accuracy
whereas the MLP-logistic counterpart loses 83\%. Similarly on STL10 and
ImageNet binary classification between classes 0 and 1 the MLP01 network loses
21\% and 20\% while MLP-logistic loses 67\% and 45\% respectively. On MNIST
that is a well-separable dataset we find MLP01 comparable to MLP-logistic and
show under simulation how and why our 01 loss solver is less robust there. We
then propose adversarial training for our linear 01 loss solver that
significantly improves its robustness on MNIST and all other datasets and
retains clean test accuracy. Finally we show practical applications of our
method to deter traffic sign and facial recognition adversarial attacks. We
discuss attacks with 01 loss, substitute model accuracy, and several future
avenues like multiclass, 01 loss convolutions, and further adversarial
training.


http://arxiv.org/abs/2002.03500
Watch out! Motion is Blurring the Vision of Your Deep Neural Networks.
Qing Guo; Felix Juefei-Xu; Xiaofei Xie; Lei Ma; Jian Wang; Bing Yu; Wei Feng; Yang Liu
  The state-of-the-art deep neural networks (DNNs) are vulnerable against
adversarial examples with additive random-like noise perturbations. While such
examples are hardly found in the physical world, the image blurring effect
caused by object motion, on the other hand, commonly occurs in practice, making
the study of which greatly important especially for the widely adopted
real-time image processing tasks (e.g., object detection, tracking). In this
paper, we initiate the first step to comprehensively investigate the potential
hazards of the blur effect for DNN, caused by object motion. We propose a novel
adversarial attack method that can generate visually natural motion-blurred
adversarial examples, named motion-based adversarial blur attack (ABBA). To
this end, we first formulate the kernel-prediction-based attack where an input
image is convolved with kernels in a pixel-wise way, and the misclassification
capability is achieved by tuning the kernel weights. To generate visually more
natural and plausible examples, we further propose the saliency-regularized
adversarial kernel prediction, where the salient region serves as a moving
object, and the predicted kernel is regularized to achieve naturally visual
effects. Besides, the attack is further enhanced by adaptively tuning the
translations of object and background. A comprehensive evaluation on the
NeurIPS'17 adversarial competition dataset demonstrates the effectiveness of
ABBA by considering various kernel sizes, translations, and regions. The
in-depth study further confirms that our method shows more effective
penetrating capability to the state-of-the-art GAN-based deblurring mechanisms
compared with other blurring methods. We release the code to
https://github.com/tsingqguo/ABBA.


http://arxiv.org/abs/2002.05517
Feature-level Malware Obfuscation in Deep Learning.
Keith Dillon
  We consider the problem of detecting malware with deep learning models, where
the malware may be combined with significant amounts of benign code. Examples
of this include piggybacking and trojan horse attacks on a system, where
malicious behavior is hidden within a useful application. Such added
flexibility in augmenting the malware enables significantly more code
obfuscation. Hence we focus on the use of static features, particularly
Intents, Permissions, and API calls, which we presume cannot be ultimately
hidden from the Android system, but only augmented with yet more such features.
We first train a deep neural network classifier for malware classification
using features of benign and malware samples. Then we demonstrate a steep
increase in false negative rate (i.e., attacks succeed), simply by randomly
adding features of a benign app to malware. Finally we test the use of data
augmentation to harden the classifier against such attacks. We find that for
API calls, it is possible to reject the vast majority of attacks, where using
Intents or Permissions is less successful.


http://arxiv.org/abs/2002.12749
Adversarial Deepfakes: Evaluating Vulnerability of Deepfake Detectors to Adversarial Examples.
Paarth Neekhara; Shehzeen Hussain; Malhar Jere; Farinaz Koushanfar; Julian McAuley
  Recent advances in video manipulation techniques have made the generation of
fake videos more accessible than ever before. Manipulated videos can fuel
disinformation and reduce trust in media. Therefore detection of fake videos
has garnered immense interest in academia and industry. Recently developed
Deepfake detection methods rely on deep neural networks (DNNs) to distinguish
AI-generated fake videos from real videos. In this work, we demonstrate that it
is possible to bypass such detectors by adversarially modifying fake videos
synthesized using existing Deepfake generation methods. We further demonstrate
that our adversarial perturbations are robust to image and video compression
codecs, making them a real-world threat. We present pipelines in both white-box
and black-box attack scenarios that can fool DNN based Deepfake detectors into
classifying fake videos as real.


http://arxiv.org/abs/2003.04367
Category-wise Attack: Transferable Adversarial Examples for Anchor Free Object Detection.
Quanyu Liao; Xin Wang; Bin Kong; Siwei Lyu; Youbing Yin; Qi Song; Xi Wu
  Deep neural networks have been demonstrated to be vulnerable to adversarial
attacks: subtle perturbations can completely change the classification results.
Their vulnerability has led to a surge of research in this direction. However,
most works dedicated to attacking anchor-based object detection models. In this
work, we aim to present an effective and efficient algorithm to generate
adversarial examples to attack anchor-free object models based on two
approaches. First, we conduct category-wise instead of instance-wise attacks on
the object detectors. Second, we leverage the high-level semantic information
to generate the adversarial examples. Surprisingly, the generated adversarial
examples it not only able to effectively attack the targeted anchor-free object
detector but also to be transferred to attack other object detectors, even
anchor-based detectors such as Faster R-CNN.


http://arxiv.org/abs/2002.03421
Certified Robustness of Community Detection against Adversarial Structural Perturbation via Randomized Smoothing.
Jinyuan Jia; Binghui Wang; Xiaoyu Cao; Neil Zhenqiang Gong
  Community detection plays a key role in understanding graph structure.
However, several recent studies showed that community detection is vulnerable
to adversarial structural perturbation. In particular, via adding or removing a
small number of carefully selected edges in a graph, an attacker can manipulate
the detected communities. However, to the best of our knowledge, there are no
studies on certifying robustness of community detection against such
adversarial structural perturbation. In this work, we aim to bridge this gap.
Specifically, we develop the first certified robustness guarantee of community
detection against adversarial structural perturbation. Given an arbitrary
community detection method, we build a new smoothed community detection method
via randomly perturbing the graph structure. We theoretically show that the
smoothed community detection method provably groups a given arbitrary set of
nodes into the same community (or different communities) when the number of
edges added/removed by an attacker is bounded. Moreover, we show that our
certified robustness is tight. We also empirically evaluate our method on
multiple real-world graphs with ground truth communities.


http://arxiv.org/abs/2002.03517
Random Smoothing Might be Unable to Certify $\ell_\infty$ Robustness for High-Dimensional Images.
Avrim Blum; Travis Dick; Naren Manoj; Hongyang Zhang
  We show a hardness result for random smoothing to achieve certified
adversarial robustness against attacks in the $\ell_p$ ball of radius
$\epsilon$ when $p>2$. Although random smoothing has been well understood for
the $\ell_2$ case using the Gaussian distribution, much remains unknown
concerning the existence of a noise distribution that works for the case of
$p>2$. This has been posed as an open problem by Cohen et al. (2019) and
includes many significant paradigms such as the $\ell_\infty$ threat model. In
this work, we show that any noise distribution $\mathcal{D}$ over
$\mathbb{R}^d$ that provides $\ell_p$ robustness for all base classifiers with
$p>2$ must satisfy
$\mathbb{E}\eta_i^2=\Omega(d^{1-2/p}\epsilon^2(1-\delta)/\delta^2)$ for 99% of
the features (pixels) of vector $\eta\sim\mathcal{D}$, where $\epsilon$ is the
robust radius and $\delta$ is the score gap between the highest-scored class
and the runner-up. Therefore, for high-dimensional images with pixel values
bounded in $[0,255]$, the required noise will eventually dominate the useful
information in the images, leading to trivial smoothed classifiers.


http://arxiv.org/abs/2002.03339
Input Validation for Neural Networks via Runtime Local Robustness Verification.
Jiangchao Liu; Liqian Chen; Antoine Mine; Ji Wang
  Local robustness verification can verify that a neural network is robust wrt.
any perturbation to a specific input within a certain distance. We call this
distance Robustness Radius. We observe that the robustness radii of correctly
classified inputs are much larger than that of misclassified inputs which
include adversarial examples, especially those from strong adversarial attacks.
Another observation is that the robustness radii of correctly classified inputs
often follow a normal distribution. Based on these two observations, we propose
to validate inputs for neural networks via runtime local robustness
verification. Experiments show that our approach can protect neural networks
from adversarial examples and improve their accuracies.


http://arxiv.org/abs/2002.03095
Attacking Optical Character Recognition (OCR) Systems with Adversarial Watermarks.
Lu Chen; Wei Xu
  Optical character recognition (OCR) is widely applied in real applications
serving as a key preprocessing tool. The adoption of deep neural network (DNN)
in OCR results in the vulnerability against adversarial examples which are
crafted to mislead the output of the threat model. Different from vanilla
colorful images, images of printed text have clear backgrounds usually.
However, adversarial examples generated by most of the existing adversarial
attacks are unnatural and pollute the background severely. To address this
issue, we propose a watermark attack method to produce natural distortion that
is in the disguise of watermarks and evade human eyes' detection. Experimental
results show that watermark attacks can yield a set of natural adversarial
examples attached with watermarks and attain similar attack performance to the
state-of-the-art methods in different attack scenarios.


http://arxiv.org/abs/2002.03239
Curse of Dimensionality on Randomized Smoothing for Certifiable Robustness.
Aounon Kumar; Alexander Levine; Tom Goldstein; Soheil Feizi
  Randomized smoothing, using just a simple isotropic Gaussian distribution,
has been shown to produce good robustness guarantees against $\ell_2$-norm
bounded adversaries. In this work, we show that extending the smoothing
technique to defend against other attack models can be challenging, especially
in the high-dimensional regime. In particular, for a vast class of
i.i.d.~smoothing distributions, we prove that the largest $\ell_p$-radius that
can be certified decreases as $O(1/d^{\frac{1}{2} - \frac{1}{p}})$ with
dimension $d$ for $p > 2$. Notably, for $p \geq 2$, this dependence on $d$ is
no better than that of the $\ell_p$-radius that can be certified using
isotropic Gaussian smoothing, essentially putting a matching lower bound on the
robustness radius. When restricted to {\it generalized} Gaussian smoothing,
these two bounds can be shown to be within a constant factor of each other in
an asymptotic sense, establishing that Gaussian smoothing provides the best
possible results, up to a constant factor, when $p \geq 2$. We present
experimental results on CIFAR to validate our theory. For other smoothing
distributions, such as, a uniform distribution within an $\ell_1$ or an
$\ell_\infty$-norm ball, we show upper bounds of the form $O(1 / d)$ and $O(1 /
d^{1 - \frac{1}{p}})$ respectively, which have an even worse dependence on $d$.


http://arxiv.org/abs/2002.02998
Renofeation: A Simple Transfer Learning Method for Improved Adversarial Robustness.
Ting-Wu Chin; Cha Zhang; Diana Marculescu
  Fine-tuning through knowledge transfer from a pre-trained model on a
large-scale dataset is a widely spread approach to effectively build models on
small-scale datasets. In this work, we show that a recent adversarial attack
designed for transfer learning via re-training the last linear layer can
successfully deceive models trained with transfer learning via end-to-end
fine-tuning. This raises security concerns for many industrial applications. In
contrast, models trained with random initialization without transfer are much
more robust to such attacks, although these models often exhibit much lower
accuracy. To this end, we propose noisy feature distillation, a new transfer
learning method that trains a network from random initialization while
achieving clean-data performance competitive with fine-tuning. Code available
at https://github.com/cmu-enyac/Renofeation.


http://arxiv.org/abs/2002.03080
Analysis of Random Perturbations for Robust Convolutional Neural Networks.
Adam Dziedzic; Sanjay Krishnan
  Recent work has extensively shown that randomized perturbations of neural
networks can improve robustness to adversarial attacks. The literature is,
however, lacking a detailed compare-and-contrast of the latest proposals to
understand what classes of perturbations work, when they work, and why they
work. We contribute a detailed evaluation that elucidates these questions and
benchmarks perturbation based defenses consistently. In particular, we show
five main results: (1) all input perturbation defenses, whether random or
deterministic, are equivalent in their efficacy, (2) attacks transfer between
perturbation defenses so the attackers need not know the specific type of
defense -- only that it involves perturbations, (3) a tuned sequence of noise
layers across a network provides the best empirical robustness, (4)
perturbation based defenses offer almost no robustness to adaptive attacks
unless these perturbations are observed during training, and (5) adversarial
examples in a close neighborhood of original inputs show an elevated
sensitivity to perturbations in first and second-order analyses.


http://arxiv.org/abs/2002.02776
RAID: Randomized Adversarial-Input Detection for Neural Networks.
Hasan Ferit Eniser; Maria Christakis; Valentin Wüstholz
  In recent years, neural networks have become the default choice for image
classification and many other learning tasks, even though they are vulnerable
to so-called adversarial attacks. To increase their robustness against these
attacks, there have emerged numerous detection mechanisms that aim to
automatically determine if an input is adversarial. However, state-of-the-art
detection mechanisms either rely on being tuned for each type of attack, or
they do not generalize across different attack types. To alleviate these
issues, we propose a novel technique for adversarial-image detection, RAID,
that trains a secondary classifier to identify differences in neuron activation
values between benign and adversarial inputs. Our technique is both more
reliable and more effective than the state of the art when evaluated against
six popular attacks. Moreover, a straightforward extension of RAID increases
its robustness against detection-aware adversaries without affecting its
effectiveness.


http://arxiv.org/abs/2002.02842
Assessing the Adversarial Robustness of Monte Carlo and Distillation Methods for Deep Bayesian Neural Network Classification.
Meet P. Vadera; Satya Narayan Shukla; Brian Jalaian; Benjamin M. Marlin
  In this paper, we consider the problem of assessing the adversarial
robustness of deep neural network models under both Markov chain Monte Carlo
(MCMC) and Bayesian Dark Knowledge (BDK) inference approximations. We
characterize the robustness of each method to two types of adversarial attacks:
the fast gradient sign method (FGSM) and projected gradient descent (PGD). We
show that full MCMC-based inference has excellent robustness, significantly
outperforming standard point estimation-based learning. On the other hand, BDK
provides marginal improvements. As an additional contribution, we present a
storage-efficient approach to computing adversarial examples for large Monte
Carlo ensembles using both the FGSM and PGD attacks.


http://arxiv.org/abs/2002.03043
Semantic Robustness of Models of Source Code.
Goutham Ramakrishnan; Jordan Henkel; Zi Wang; Aws Albarghouthi; Somesh Jha; Thomas Reps
  Deep neural networks are vulnerable to adversarial examples - small input
perturbations that result in incorrect predictions. We study this problem for
models of source code, where we want the network to be robust to source-code
modifications that preserve code functionality. (1) We define a powerful
adversary that can employ sequences of parametric, semantics-preserving program
transformations; (2) we show how to perform adversarial training to learn
models robust to such adversaries; (3) we conduct an evaluation on different
languages and architectures, demonstrating significant quantitative gains in
robustness.


http://arxiv.org/abs/2002.02424
Reliability Validation of Learning Enabled Vehicle Tracking.
Youcheng Sun; Yifan Zhou; Simon Maskell; James Sharp; Xiaowei Huang
  This paper studies the reliability of a real-world learning-enabled system,
which conducts dynamic vehicle tracking based on a high-resolution wide-area
motion imagery input. The system consists of multiple neural network components
-- to process the imagery inputs -- and multiple symbolic (Kalman filter)
components -- to analyse the processed information for vehicle tracking. It is
known that neural networks suffer from adversarial examples, which make them
lack robustness. However, it is unclear if and how the adversarial examples
over learning components can affect the overall system-level reliability. By
integrating a coverage-guided neural network testing tool, DeepConcolic, with
the vehicle tracking system, we found that (1) the overall system can be
resilient to some adversarial examples thanks to the existence of other
components, and (2) the overall system presents an extra level of uncertainty
which cannot be determined by analysing the deep learning components only. This
research suggests the need for novel verification and validation methods for
learning-enabled systems.


http://arxiv.org/abs/2002.02175
An Analysis of Adversarial Attacks and Defenses on Autonomous Driving Models.
Yao Deng; Xi Zheng; Tianyi Zhang; Chen Chen; Guannan Lou; Miryung Kim
  Nowadays, autonomous driving has attracted much attention from both industry
and academia. Convolutional neural network (CNN) is a key component in
autonomous driving, which is also increasingly adopted in pervasive computing
such as smartphones, wearable devices, and IoT networks. Prior work shows
CNN-based classification models are vulnerable to adversarial attacks. However,
it is uncertain to what extent regression models such as driving models are
vulnerable to adversarial attacks, the effectiveness of existing defense
techniques, and the defense implications for system and middleware builders.
This paper presents an in-depth analysis of five adversarial attacks and four
defense methods on three driving models. Experiments show that, similar to
classification models, these models are still highly vulnerable to adversarial
attacks. This poses a big security threat to autonomous driving and thus should
be taken into account in practice. While these defense methods can effectively
defend against different attacks, none of them are able to provide adequate
protection against all five attacks. We derive several implications for system
and middleware builders: (1) when adding a defense component against
adversarial attacks, it is important to deploy multiple defense methods in
tandem to achieve a good coverage of various attacks, (2) a blackbox attack is
much less effective compared with a white-box attack, implying that it is
important to keep model details (e.g., model architecture, hyperparameters)
confidential via model obfuscation, and (3) driving models with a complex
architecture are preferred if computing resources permit as they are more
resilient to adversarial attacks than simple models.


http://arxiv.org/abs/2002.02196
AI-GAN: Attack-Inspired Generation of Adversarial Examples.
Tao Bai; Jun Zhao; Jinlin Zhu; Shoudong Han; Jiefeng Chen; Bo Li; Alex Kot
  Deep neural networks (DNNs) are vulnerable to adversarial examples, which are
crafted by adding imperceptible perturbations to inputs. Recently different
attacks and strategies have been proposed, but how to generate adversarial
examples perceptually realistic and more efficiently remains unsolved. This
paper proposes a novel framework called Attack-Inspired GAN (AI-GAN), where a
generator, a discriminator, and an attacker are trained jointly. Once trained,
it can generate adversarial perturbations efficiently given input images and
target classes. Through extensive experiments on several popular datasets \eg
MNIST and CIFAR-10, AI-GAN achieves high attack success rates and reduces
generation time significantly in various settings. Moreover, for the first
time, AI-GAN successfully scales to complicated datasets \eg CIFAR-100 with
around $90\%$ success rates among all classes.


http://arxiv.org/abs/2002.02400
Over-the-Air Adversarial Attacks on Deep Learning Based Modulation Classifier over Wireless Channels.
Brian Kim; Yalin E. Sagduyu; Kemal Davaslioglu; Tugba Erpek; Sennur Ulukus
  We consider a wireless communication system that consists of a transmitter, a
receiver, and an adversary. The transmitter transmits signals with different
modulation types, while the receiver classifies its received signals to
modulation types using a deep learning-based classifier. In the meantime, the
adversary makes over-the-air transmissions that are received as superimposed
with the transmitter's signals to fool the classifier at the receiver into
making errors. While this evasion attack has received growing interest
recently, the channel effects from the adversary to the receiver have been
ignored so far such that the previous attack mechanisms cannot be applied under
realistic channel effects. In this paper, we present how to launch a realistic
evasion attack by considering channels from the adversary to the receiver. Our
results show that modulation classification is vulnerable to an adversarial
attack over a wireless channel that is modeled as Rayleigh fading with path
loss and shadowing. We present various adversarial attacks with respect to
availability of information about channel, transmitter input, and classifier
architecture. First, we present two types of adversarial attacks, namely a
targeted attack (with minimum power) and non-targeted attack that aims to
change the classification to a target label or to any other label other than
the true label, respectively. Both are white-box attacks that are transmitter
input-specific and use channel information. Then we introduce an algorithm to
generate adversarial attacks using limited channel information where the
adversary only knows the channel distribution. Finally, we present a black-box
universal adversarial perturbation (UAP) attack where the adversary has limited
knowledge about both channel and transmitter input.


http://arxiv.org/abs/2002.01810
Understanding the Decision Boundary of Deep Neural Networks: An Empirical Study.
David Mickisch; Felix Assion; Florens Greßner; Wiebke Günther; Mariele Motta
  Despite achieving remarkable performance on many image classification tasks,
state-of-the-art machine learning (ML) classifiers remain vulnerable to small
input perturbations. Especially, the existence of adversarial examples raises
concerns about the deployment of ML models in safety- and security-critical
environments, like autonomous driving and disease detection. Over the last few
years, numerous defense methods have been published with the goal of improving
adversarial as well as corruption robustness. However, the proposed measures
succeeded only to a very limited extent. This limited progress is partly due to
the lack of understanding of the decision boundary and decision regions of deep
neural networks. Therefore, we study the minimum distance of data points to the
decision boundary and how this margin evolves over the training of a deep
neural network. By conducting experiments on MNIST, FASHION-MNIST, and
CIFAR-10, we observe that the decision boundary moves closer to natural images
over training. This phenomenon even remains intact in the late epochs of
training, where the classifier already obtains low training and test error
rates. On the other hand, adversarial training appears to have the potential to
prevent this undesired convergence of the decision boundary.


http://arxiv.org/abs/2002.01147
Adversarially Robust Frame Sampling with Bounded Irregularities.
Hanhan Li; Pin Wang
  In recent years, video analysis tools for automatically extracting meaningful
information from videos are widely studied and deployed. Because most of them
use deep neural networks which are computationally expensive, feeding only a
subset of video frames into such algorithms is desired. Sampling the frames
with fixed rate is always attractive for its simplicity, representativeness,
and interpretability. For example, a popular cloud video API generated video
and shot labels by processing only the first frame of every second in a video.
However, one can easily attack such strategies by placing chosen frames at the
sampled locations. In this paper, we present an elegant solution to this
sampling problem that is provably robust against adversarial attacks and
introduces bounded irregularities as well.


http://arxiv.org/abs/2002.01249
Adversarial Attacks to Scale-Free Networks: Testing the Robustness of Physical Criteria.
Qi Xuan; Yalu Shan; Jinhuan Wang; Zhongyuan Ruan; Guanrong Chen
  Adversarial attacks have been alerting the artificial intelligence community
recently, since many machine learning algorithms were found vulnerable to
malicious attacks. This paper studies adversarial attacks to scale-free
networks to test their robustness in terms of statistical measures. In addition
to the well-known random link rewiring (RLR) attack, two heuristic attacks are
formulated and simulated: degree-addition-based link rewiring (DALR) and
degree-interval-based link rewiring (DILR). These three strategies are applied
to attack a number of strong scale-free networks of various sizes generated
from the Barab\'asi-Albert model. It is found that both DALR and DILR are more
effective than RLR, in the sense that rewiring a smaller number of links can
succeed in the same attack. However, DILR is as concealed as RLR in the sense
that they both are constructed by introducing a relatively small number of
changes on several typical structural properties such as average shortest
path-length, average clustering coefficient, and average diagonal distance. The
results of this paper suggest that to classify a network to be scale-free has
to be very careful from the viewpoint of adversarial attack effects.


http://arxiv.org/abs/2002.01256
Minimax Defense against Gradient-based Adversarial Attacks.
Blerta Lindqvist; Rauf Izmailov
  State-of-the-art adversarial attacks are aimed at neural network classifiers.
By default, neural networks use gradient descent to minimize their loss
function. The gradient of a classifier's loss function is used by
gradient-based adversarial attacks to generate adversarially perturbed images.
We pose the question whether another type of optimization could give neural
network classifiers an edge. Here, we introduce a novel approach that uses
minimax optimization to foil gradient-based adversarial attacks. Our minimax
classifier is the discriminator of a generative adversarial network (GAN) that
plays a minimax game with the GAN generator. In addition, our GAN generator
projects all points onto a manifold that is different from the original
manifold since the original manifold might be the cause of adversarial attacks.
To measure the performance of our minimax defense, we use adversarial attacks -
Carlini Wagner (CW), DeepFool, Fast Gradient Sign Method (FGSM) - on three
datasets: MNIST, CIFAR-10 and German Traffic Sign (TRAFFIC). Against CW
attacks, our minimax defense achieves 98.07% (MNIST-default 98.93%), 73.90%
(CIFAR-10-default 83.14%) and 94.54% (TRAFFIC-default 96.97%). Against DeepFool
attacks, our minimax defense achieves 98.87% (MNIST), 76.61% (CIFAR-10) and
94.57% (TRAFFIC). Against FGSM attacks, we achieve 97.01% (MNIST), 76.79%
(CIFAR-10) and 81.41% (TRAFFIC). Our Minimax adversarial approach presents a
significant shift in defense strategy for neural network classifiers.


http://arxiv.org/abs/2002.01008
A Differentiable Color Filter for Generating Unrestricted Adversarial Images.
Zhengyu Zhao; Zhuoran Liu; Martha Larson
  We propose Adversarial Color Filtering (AdvCF), an approach that uses a
differentiable color filter to create adversarial images. The color filter
allows us to introduce large perturbations into images, while still maintaining
or enhancing their photographic quality and appeal. AdvCF is motivated by
properties that are necessary if adversarial images are to be used to protect
the content of images shared online from unethical machine learning
classifiers: First, perturbations must be imperceptible and adversarial images
must look realistic to the human eye. Second, adversarial impact must be
maintained in the face of classifiers unknown when the perturbations are
generated (transferability). The paper presents evidence that AdvCF has these
two properties, and also points out that AdvCF has the potential for further
improvement if image semantics are taken into account.


http://arxiv.org/abs/2002.00614
Regularizers for Single-step Adversarial Training.
B. S. Vivek; R. Venkatesh Babu
  The progress in the last decade has enabled machine learning models to
achieve impressive performance across a wide range of tasks in Computer Vision.
However, a plethora of works have demonstrated the susceptibility of these
models to adversarial samples. Adversarial training procedure has been proposed
to defend against such adversarial attacks. Adversarial training methods
augment mini-batches with adversarial samples, and typically single-step
(non-iterative) methods are used for generating these adversarial samples.
However, models trained using single-step adversarial training converge to
degenerative minima where the model merely appears to be robust. The pseudo
robustness of these models is due to the gradient masking effect. Although
multi-step adversarial training helps to learn robust models, they are hard to
scale due to the use of iterative methods for generating adversarial samples.
To address these issues, we propose three different types of regularizers that
help to learn robust models using single-step adversarial training methods. The
proposed regularizers mitigate the effect of gradient masking by harnessing on
properties that differentiate a robust model from that of a pseudo robust
model. Performance of models trained using the proposed regularizers is on par
with models trained using computationally expensive multi-step adversarial
training methods.


http://arxiv.org/abs/2002.02007
Defending Adversarial Attacks via Semantic Feature Manipulation.
Shuo Wang; Tianle Chen; Surya Nepal; Carsten Rudolph; Marthie Grobler; Shangyu Chen
  Machine learning models have demonstrated vulnerability to adversarial
attacks, more specifically misclassification of adversarial examples. In this
paper, we propose a one-off and attack-agnostic Feature Manipulation
(FM)-Defense to detect and purify adversarial examples in an interpretable and
efficient manner. The intuition is that the classification result of a normal
image is generally resistant to non-significant intrinsic feature changes,
e.g., varying thickness of handwritten digits. In contrast, adversarial
examples are sensitive to such changes since the perturbation lacks
transferability. To enable manipulation of features, a combo-variational
autoencoder is applied to learn disentangled latent codes that reveal semantic
features. The resistance to classification change over the morphs, derived by
varying and reconstructing latent codes, is used to detect suspicious inputs.
Further, combo-VAE is enhanced to purify the adversarial examples with good
quality by considering both class-shared and class-unique features. We
empirically demonstrate the effectiveness of detection and the quality of
purified instance. Our experiments on three datasets show that FM-Defense can
detect nearly $100\%$ of adversarial examples produced by different
state-of-the-art adversarial attacks. It achieves more than $99\%$ overall
purification accuracy on the suspicious instances that close the manifold of
normal examples.


http://arxiv.org/abs/2002.00526
Robust saliency maps with decoy-enhanced saliency score.
Yang Lu; Wenbo Guo; Xinyu Xing; William Stafford Noble
  Saliency methods help to make deep neural network predictions more
interpretable by identifying particular features, such as pixels in an image,
that contribute most strongly to the network's prediction. Unfortunately,
recent evidence suggests that many saliency methods perform poorly when
gradients are saturated or in the presence of strong inter-feature dependence
or noise injected by an adversarial attack. In this work, we propose to infer
robust saliency scores by integrating the saliency scores of a set of decoys
with a novel decoy-enhanced saliency score, in which the decoys are generated
by either solving an optimization problem or blurring the original input. We
theoretically analyze that our method compensates for gradient saturation and
considers joint activation patterns of pixels. We also apply our method to
three different CNNs---VGGNet, AlexNet, and ResNet trained on ImageNet data
set. The empirical results show both qualitatively and quantitatively that our
method outperforms raw scores produced by three existing saliency methods, even
in the presence of adversarial attacks.


http://arxiv.org/abs/2002.02372
Towards Sharper First-Order Adversary with Quantized Gradients.
Zhuanghua Liu; Ivor W. Tsang
  Despite the huge success of Deep Neural Networks (DNNs) in a wide spectrum of
machine learning and data mining tasks, recent research shows that this
powerful tool is susceptible to maliciously crafted adversarial examples. Up
until now, adversarial training has been the most successful defense against
adversarial attacks. To increase adversarial robustness, a DNN can be trained
with a combination of benign and adversarial examples generated by first-order
methods. However, in state-of-the-art first-order attacks, adversarial examples
with sign gradients retain the sign information of each gradient component but
discard the relative magnitude between components. In this work, we replace
sign gradients with quantized gradients. Gradient quantization not only
preserves the sign information, but also keeps the relative magnitude between
components. Experiments show white-box first-order attacks with quantized
gradients outperform their variants with sign gradients on multiple datasets.
Notably, our BLOB\_QG attack achieves an accuracy of $88.32\%$ on the secret
MNIST model from the MNIST Challenge and it outperforms all other methods on
the leaderboard of white-box attacks.


http://arxiv.org/abs/2002.00179
AdvJND: Generating Adversarial Examples with Just Noticeable Difference.
Zifei Zhang; Kai Qiao; Lingyun Jiang; Linyuan Wang; Bin Yan
  Compared with traditional machine learning models, deep neural networks
perform better, especially in image classification tasks. However, they are
vulnerable to adversarial examples. Adding small perturbations on examples
causes a good-performance model to misclassify the crafted examples, without
category differences in the human eyes, and fools deep models successfully.
There are two requirements for generating adversarial examples: the attack
success rate and image fidelity metrics. Generally, perturbations are increased
to ensure the adversarial examples' high attack success rate; however, the
adversarial examples obtained have poor concealment. To alleviate the tradeoff
between the attack success rate and image fidelity, we propose a method named
AdvJND, adding visual model coefficients, just noticeable difference
coefficients, in the constraint of a distortion function when generating
adversarial examples. In fact, the visual subjective feeling of the human eyes
is added as a priori information, which decides the distribution of
perturbations, to improve the image quality of adversarial examples. We tested
our method on the FashionMNIST, CIFAR10, and MiniImageNet datasets. Adversarial
examples generated by our AdvJND algorithm yield gradient distributions that
are similar to those of the original inputs. Hence, the crafted noise can be
hidden in the original inputs, thus improving the attack concealment
significantly.


http://arxiv.org/abs/2001.11905
Additive Tree Ensembles: Reasoning About Potential Instances.
Laurens Devos; Wannes Meert; Jesse Davis
  Imagine being able to ask questions to a black box model such as "Which
adversarial examples exist?", "Does a specific attribute have a
disproportionate effect on the model's prediction?" or "What kind of
predictions are possible for a partially described example?" This last question
is particularly important if your partial description does not correspond to
any observed example in your data, as it provides insight into how the model
will extrapolate to unseen data. These capabilities would be extremely helpful
as it would allow a user to better understand the model's behavior,
particularly as it relates to issues such as robustness, fairness, and bias. In
this paper, we propose such an approach for an ensemble of trees. Since, in
general, this task is intractable we present a strategy that (1) can prune part
of the input space given the question asked to simplify the problem; and (2)
follows a divide and conquer approach that is incremental and can always return
some answers and indicates which parts of the input domains are still
uncertain. The usefulness of our approach is shown on a diverse set of use
cases.


http://arxiv.org/abs/2002.05648
Politics of Adversarial Machine Learning.
Kendra Albert; Jonathon Penney; Bruce Schneier; Ram Shankar Siva Kumar
  In addition to their security properties, adversarial machine-learning
attacks and defenses have political dimensions. They enable or foreclose
certain options for both the subjects of the machine learning systems and for
those who deploy them, creating risks for civil liberties and human rights. In
this paper, we draw on insights from science and technology studies,
anthropology, and human rights literature, to inform how defenses against
adversarial attacks can be used to suppress dissent and limit attempts to
investigate machine learning systems. To make this concrete, we use real-world
examples of how attacks such as perturbation, model inversion, or membership
inference can be used for socially desirable ends. Although the predictions of
this analysis may seem dire, there is hope. Efforts to address human rights
concerns in the commercial spyware industry provide guidance for similar
measures to ensure ML systems serve democratic, not authoritarian ends


http://arxiv.org/abs/2002.00760
FastWordBug: A Fast Method To Generate Adversarial Text Against NLP Applications.
Dou Goodman; Lv Zhonghou; Wang minghua
  In this paper, we present a novel algorithm, FastWordBug, to efficiently
generate small text perturbations in a black-box setting that forces a
sentiment analysis or text classification mode to make an incorrect prediction.
By combining the part of speech attributes of words, we propose a scoring
method that can quickly identify important words that affect text
classification. We evaluate FastWordBug on three real-world text datasets and
two state-of-the-art machine learning models under black-box setting. The
results show that our method can significantly reduce the accuracy of the
model, and at the same time, we can call the model as little as possible, with
the highest attack efficiency. We also attack two popular real-world cloud
services of NLP, and the results show that our method works as well.


http://arxiv.org/abs/2001.11569
Tiny Noise Can Make an EEG-Based Brain-Computer Interface Speller Output Anything.
Xiao Zhang; Dongrui Wu; Lieyun Ding; Hanbin Luo; Chin-Teng Lin; Tzyy-Ping Jung; Ricardo Chavarriaga
  An electroencephalogram (EEG) based brain-computer interface (BCI) speller
allows a user to input text to a computer by thought. It is particularly useful
to severely disabled individuals, e.g., amyotrophic lateral sclerosis patients,
who have no other effective means of communication with another person or a
computer. Most studies so far focused on making EEG-based BCI spellers faster
and more reliable; however, few have considered their security. Here we show
that P300 and steady-state visual evoked potential BCI spellers are very
vulnerable, i.e., they can be severely attacked by adversarial perturbations,
which are too tiny to be noticed when added to EEG signals, but can mislead the
spellers to spell anything the attacker wants. The consequence could range from
merely user frustration to severe misdiagnosis in clinical applications. We
hope our research can attract more attention to the security of EEG-based BCI
spellers, and more broadly, EEG-based BCIs, which has received little attention
before.


http://arxiv.org/abs/2001.10999
A4 : Evading Learning-based Adblockers.
Shitong Zhu; Zhongjie Wang; Xun Chen; Shasha Li; Umar Iqbal; Zhiyun Qian; Kevin S. Chan; Srikanth V. Krishnamurthy; Zubair Shafiq
  Efforts by online ad publishers to circumvent traditional ad blockers towards
regaining fiduciary benefits, have been demonstrably successful. As a result,
there have recently emerged a set of adblockers that apply machine learning
instead of manually curated rules and have been shown to be more robust in
blocking ads on websites including social media sites such as Facebook. Among
these, AdGraph is arguably the state-of-the-art learning-based adblocker. In
this paper, we develop A4, a tool that intelligently crafts adversarial samples
of ads to evade AdGraph. Unlike the popular research on adversarial samples
against images or videos that are considered less- to un-restricted, the
samples that A4 generates preserve application semantics of the web page, or
are actionable. Through several experiments we show that A4 can bypass AdGraph
about 60% of the time, which surpasses the state-of-the-art attack by a
significant margin of 84.3%; in addition, changes to the visual layout of the
web page due to these perturbations are imperceptible. We envision the
algorithmic framework proposed in A4 is also promising in improving adversarial
attacks against other learning-based web applications with similar
requirements.


http://arxiv.org/abs/2001.11108
D2M: Dynamic Defense and Modeling of Adversarial Movement in Networks.
Scott Freitas; Andrew Wicker; Duen Horng Chau; Joshua Neil
  Given a large enterprise network of devices and their authentication history
(e.g., device logons), how can we quantify network vulnerability to lateral
attack and identify at-risk devices? We systematically address these problems
through D2M, the first framework that models lateral attacks on enterprise
networks using multiple attack strategies developed with researchers,
engineers, and threat hunters in the Microsoft Defender Advanced Threat
Protection group. These strategies integrate real-world adversarial actions
(e.g., privilege escalation) to generate attack paths: a series of compromised
machines. Leveraging these attack paths and a novel Monte-Carlo method, we
formulate network vulnerability as a probabilistic function of the network
topology, distribution of access credentials and initial penetration point. To
identify machines at risk to lateral attack, we propose a suite of five fast
graph mining techniques, including a novel technique called AnomalyShield
inspired by node immunization research. Using three real-world authentication
graphs from Microsoft and Los Alamos National Laboratory (up to 223,399
authentications), we report the first experimental results on network
vulnerability to lateral attack, demonstrating D2M's unique potential to
empower IT admins to develop robust user access credential policies.


http://arxiv.org/abs/2001.11064
Just Noticeable Difference for Machines to Generate Adversarial Images.
Adil Kaan Akan; Mehmet Ali Genc; Fatos T. Yarman Vural
  One way of designing a robust machine learning algorithm is to generate
authentic adversarial images which can trick the algorithms as much as
possible. In this study, we propose a new method to generate adversarial images
which are very similar to true images, yet, these images are discriminated from
the original ones and are assigned into another category by the model. The
proposed method is based on a popular concept of experimental psychology,
called, Just Noticeable Difference. We define Just Noticeable Difference for a
machine learning model and generate a least perceptible difference for
adversarial images which can trick a model. The suggested model iteratively
distorts a true image by gradient descent method until the machine learning
algorithm outputs a false label. Deep Neural Networks are trained for object
detection and classification tasks. The cost function includes regularization
terms to generate just noticeably different adversarial images which can be
detected by the model. The adversarial images generated in this study looks
more natural compared to the output of state of the art adversarial image
generators.


http://arxiv.org/abs/2001.11055
Semantic Adversarial Perturbations using Learnt Representations.
Isaac Dunn; Tom Melham; Daniel Kroening
  Adversarial examples for image classifiers are typically created by searching
for a suitable norm-constrained perturbation to the pixels of an image.
However, such perturbations represent only a small and rather contrived subset
of possible adversarial inputs; robustness to norm-constrained pixel
perturbations alone is insufficient. We introduce a novel method for the
construction of a rich new class of semantic adversarial examples. Leveraging
the hierarchical feature representations learnt by generative models, our
procedure makes adversarial but realistic changes at different levels of
semantic granularity. Unlike prior work, this is not an ad-hoc algorithm
targeting a fixed category of semantic property. For instance, our approach
perturbs the pose, location, size, shape, colour and texture of the objects in
an image without manual encoding of these concepts. We demonstrate this new
attack by creating semantic adversarial examples that fool state-of-the-art
classifiers on the MNIST and ImageNet datasets.


http://arxiv.org/abs/2001.11137
Adversarial Attacks on Convolutional Neural Networks in Facial Recognition Domain.
Yigit Alparslan; Ken Alparslan; Jeremy Keim-Shenk; Shweta Khade; Rachel Greenstadt
  Numerous recent studies have demonstrated how Deep Neural Network (DNN)
classifiers can be fooled by adversarial examples, in which an attacker adds
perturbations to an original sample, causing the classifier to misclassify the
sample. Adversarial attacks that render DNNs vulnerable in real life represent
a serious threat, given the consequences of improperly functioning autonomous
vehicles, malware filters, or biometric authentication systems. In this paper,
we apply Fast Gradient Sign Method to introduce perturbations to a facial image
dataset and then test the output on a different classifier that we trained
ourselves, to analyze transferability of this method. Next, we craft a variety
of different attack algorithms on a facial image dataset, with the intention of
developing untargeted black-box approaches assuming minimal adversarial
knowledge, to further assess the robustness of DNNs in the facial recognition
realm. We explore modifying single optimal pixels by a large amount, or
modifying all pixels by a smaller amount, or combining these two attack
approaches. While our single-pixel attacks achieved about a 15% average
decrease in classifier confidence level for the actual class, the all-pixel
attacks were more successful and achieved up to an 84% average decrease in
confidence, along with an 81.6% misclassification rate, in the case of the
attack that we tested with the highest levels of perturbation. Even with these
high levels of perturbation, the face images remained fairly clearly
identifiable to a human. We hope our research may help to advance the study of
adversarial attacks on DNNs and defensive mechanisms to counteract them,
particularly in the facial recognition domain.


http://arxiv.org/abs/2001.10648
Modelling and Quantifying Membership Information Leakage in Machine Learning.
Farhad Farokhi; Mohamed Ali Kaafar
  Machine learning models have been shown to be vulnerable to membership
inference attacks, i.e., inferring whether individuals' data have been used for
training models. The lack of understanding about factors contributing success
of these attacks motivates the need for modelling membership information
leakage using information theory and for investigating properties of machine
learning models and training algorithms that can reduce membership information
leakage. We use conditional mutual information leakage to measure the amount of
information leakage from the trained machine learning model about the presence
of an individual in the training dataset. We devise an upper bound for this
measure of information leakage using Kullback--Leibler divergence that is more
amenable to numerical computation. We prove a direct relationship between the
Kullback--Leibler membership information leakage and the probability of success
for a hypothesis-testing adversary examining whether a particular data record
belongs to the training dataset of a machine learning model. We show that the
mutual information leakage is a decreasing function of the training dataset
size and the regularization weight. We also prove that, if the sensitivity of
the machine learning model (defined in terms of the derivatives of the fitness
with respect to model parameters) is high, more membership information is
potentially leaked. This illustrates that complex models, such as deep neural
networks, are more susceptible to membership inference attacks in comparison to
simpler models with fewer degrees of freedom. We show that the amount of the
membership information leakage is reduced by
$\mathcal{O}(\log^{1/2}(\delta^{-1})\epsilon^{-1})$ when using Gaussian
$(\epsilon,\delta)$-differentially-private additive noises.


http://arxiv.org/abs/2001.10916
Interpreting Machine Learning Malware Detectors Which Leverage N-gram Analysis.
William Briguglio; Sherif Saad
  In cyberattack detection and prevention systems, cybersecurity analysts
always prefer solutions that are as interpretable and understandable as
rule-based or signature-based detection. This is because of the need to tune
and optimize these solutions to mitigate and control the effect of false
positives and false negatives. Interpreting machine learning models is a new
and open challenge. However, it is expected that an interpretable machine
learning solution will be domain-specific. For instance, interpretable
solutions for machine learning models in healthcare are different than
solutions in malware detection. This is because the models are complex, and
most of them work as a black-box. Recently, the increased ability for malware
authors to bypass antimalware systems has forced security specialists to look
to machine learning for creating robust detection systems. If these systems are
to be relied on in the industry, then, among other challenges, they must also
explain their predictions. The objective of this paper is to evaluate the
current state-of-the-art ML models interpretability techniques when applied to
ML-based malware detectors. We demonstrate interpretability techniques in
practice and evaluate the effectiveness of existing interpretability techniques
in the malware analysis domain.


http://arxiv.org/abs/2001.09993
Generating Natural Adversarial Hyperspectral examples with a modified Wasserstein GAN.
Jean-Christophe OBELIX Burnel; Kilian OBELIX Fatras; Nicolas OBELIX Courty
  Adversarial examples are a hot topic due to their abilities to fool a
classifier's prediction. There are two strategies to create such examples, one
uses the attacked classifier's gradients, while the other only requires access
to the clas-sifier's prediction. This is particularly appealing when the
classifier is not full known (black box model). In this paper, we present a new
method which is able to generate natural adversarial examples from the true
data following the second paradigm. Based on Generative Adversarial Networks
(GANs) [5], it reweights the true data empirical distribution to encourage the
classifier to generate ad-versarial examples. We provide a proof of concept of
our method by generating adversarial hyperspectral signatures on a remote
sensing dataset.


http://arxiv.org/abs/2001.09598
FakeLocator: Robust Localization of GAN-Based Face Manipulations via Semantic Segmentation Networks with Bells and Whistles.
Yihao Huang; Felix Juefei-Xu; Run Wang; Xiaofei Xie; Lei Ma; Jianwen Li; Weikai Miao; Yang Liu; Geguang Pu
  Nowadays, full face synthesis and partial face manipulation by virtue of the
generative adversarial networks (GANs) have raised wide public concern. In the
digital media forensics area, detecting and ultimately locating the image
forgery have become imperative. Although many methods focus on fake detection,
only a few put emphasis on the localization of the fake regions. Through
analyzing the imperfection in the upsampling procedures of the GAN-based
methods and recasting the fake localization problem as a modified semantic
segmentation one, our proposed FakeLocator can obtain high localization
accuracy, at full resolution, on manipulated facial images. To the best of our
knowledge, this is the very first attempt to solve the GAN-based fake
localization problem with a semantic segmentation map. As an improvement, the
real-numbered segmentation map proposed by us preserves more information of
fake regions. For this new type segmentation map, we also find suitable loss
functions for it. Experimental results on the CelebA and FFHQ databases with
seven different SOTA GAN-based face generation methods show the effectiveness
of our method. Compared with the baseline, our method performs several times
better on various metrics. Moreover, the proposed method is robust against
various real-world facial image degradations such as JPEG compression,
low-resolution, noise, and blur.


http://arxiv.org/abs/2001.09684
Challenges and Countermeasures for Adversarial Attacks on Deep Reinforcement Learning.
Inaam Ilahi; Muhammad Usama; Junaid Qadir; Muhammad Umar Janjua; Ala Al-Fuqaha; Dinh Thai Hoang; Dusit Niyato
  Deep Reinforcement Learning (DRL) has numerous applications in the real world
thanks to its outstanding ability in quickly adapting to the surrounding
environments. Despite its great advantages, DRL is susceptible to adversarial
attacks, which precludes its use in real-life critical systems and applications
(e.g., smart grids, traffic controls, and autonomous vehicles) unless its
vulnerabilities are addressed and mitigated. Thus, this paper provides a
comprehensive survey that discusses emerging attacks in DRL-based systems and
the potential countermeasures to defend against these attacks. We first cover
some fundamental backgrounds about DRL and present emerging adversarial attacks
on machine learning techniques. We then investigate more details of the
vulnerabilities that the adversary can exploit to attack DRL along with the
state-of-the-art countermeasures to prevent such attacks. Finally, we highlight
open issues and research challenges for developing solutions to deal with
attacks for DRL-based intelligent systems.


http://arxiv.org/abs/2001.09610
Practical Fast Gradient Sign Attack against Mammographic Image Classifier.
Ibrahim Yilmaz
  Artificial intelligence (AI) has been a topic of major research for many
years. Especially, with the emergence of deep neural network (DNN), these
studies have been tremendously successful. Today machines are capable of making
faster, more accurate decision than human. Thanks to the great development of
machine learning (ML) techniques, ML have been used many different fields such
as education, medicine, malware detection, autonomous car etc. In spite of
having this degree of interest and much successful research, ML models are
still vulnerable to adversarial attacks. Attackers can manipulate clean data in
order to fool the ML classifiers to achieve their desire target. For instance;
a benign sample can be modified as a malicious sample or a malicious one can be
altered as benign while this modification can not be recognized by human
observer. This can lead to many financial losses, or serious injuries, even
deaths. The motivation behind this paper is that we emphasize this issue and
want to raise awareness. Therefore, the security gap of mammographic image
classifier against adversarial attack is demonstrated. We use mamographic
images to train our model then evaluate our model performance in terms of
accuracy. Later on, we poison original dataset and generate adversarial samples
that missclassified by the model. We then using structural similarity index
(SSIM) analyze similarity between clean images and adversarial images. Finally,
we show how successful we are to misuse by using different poisoning factors.


http://arxiv.org/abs/2001.09486
Ensemble Noise Simulation to Handle Uncertainty about Gradient-based Adversarial Attacks.
Rehana Mahfuz; Rajeev Sahay; Aly El Gamal
  Gradient-based adversarial attacks on neural networks can be crafted in a
variety of ways by varying either how the attack algorithm relies on the
gradient, the network architecture used for crafting the attack, or both. Most
recent work has focused on defending classifiers in a case where there is no
uncertainty about the attacker's behavior (i.e., the attacker is expected to
generate a specific attack using a specific network architecture). However, if
the attacker is not guaranteed to behave in a certain way, the literature lacks
methods in devising a strategic defense. We fill this gap by simulating the
attacker's noisy perturbation using a variety of attack algorithms based on
gradients of various classifiers. We perform our analysis using a
pre-processing Denoising Autoencoder (DAE) defense that is trained with the
simulated noise. We demonstrate significant improvements in post-attack
accuracy, using our proposed ensemble-trained defense, compared to a situation
where no effort is made to handle uncertainty.


http://arxiv.org/abs/2002.03751
Weighted Average Precision: Adversarial Example Detection in the Visual Perception of Autonomous Vehicles.
Yilan Li; Senem Velipasalar
  Recent works have shown that neural networks are vulnerable to carefully
crafted adversarial examples (AE). By adding small perturbations to input
images, AEs are able to make the victim model predicts incorrect outputs.
Several research work in adversarial machine learning started to focus on the
detection of AEs in autonomous driving. However, the existing studies either
use preliminary assumption on outputs of detections or ignore the tracking
system in the perception pipeline. In this paper, we firstly propose a novel
distance metric for practical autonomous driving object detection outputs.
Then, we bridge the gap between the current AE detection research and the
real-world autonomous systems by providing a temporal detection algorithm,
which takes the impact of tracking system into consideration. We perform
evaluation on Berkeley Deep Drive (BDD) and CityScapes datasets to show how our
approach outperforms existing single-frame-mAP based AE detections by
increasing 17.76% accuracy of performance.


http://arxiv.org/abs/2001.09388
AI-Powered GUI Attack and Its Defensive Methods.
Ning Yu; Zachary Tuttle; Carl Jake Thurnau; Emmanuel Mireku
  Since the first Graphical User Interface (GUI) prototype was invented in the
1970s, GUI systems have been deployed into various personal computer systems
and server platforms. Recently, with the development of artificial intelligence
(AI) technology, malicious malware powered by AI is emerging as a potential
threat to GUI systems. This type of AI-based cybersecurity attack, targeting at
GUI systems, is explored in this paper. It is twofold: (1) A malware is
designed to attack the existing GUI system by using AI-based object recognition
techniques. (2) Its defensive methods are discovered by generating adversarial
examples and other methods to alleviate the threats from the intelligent GUI
attack. The results have shown that a generic GUI attack can be implemented and
performed in a simple way based on current AI techniques and its
countermeasures are temporary but effective to mitigate the threats of GUI
attack so far.


http://arxiv.org/abs/2001.09395
Analyzing the Noise Robustness of Deep Neural Networks.
Kelei Cao; Mengchen Liu; Hang Su; Jing Wu; Jun Zhu; Shixia Liu
  Adversarial examples, generated by adding small but intentionally
imperceptible perturbations to normal examples, can mislead deep neural
networks (DNNs) to make incorrect predictions. Although much work has been done
on both adversarial attack and defense, a fine-grained understanding of
adversarial examples is still lacking. To address this issue, we present a
visual analysis method to explain why adversarial examples are misclassified.
The key is to compare and analyze the datapaths of both the adversarial and
normal examples. A datapath is a group of critical neurons along with their
connections. We formulate the datapath extraction as a subset selection problem
and solve it by constructing and training a neural network. A multi-level
visualization consisting of a network-level visualization of data flows, a
layer-level visualization of feature maps, and a neuron-level visualization of
learned features, has been designed to help investigate how datapaths of
adversarial and normal examples diverge and merge in the prediction process. A
quantitative evaluation and a case study were conducted to demonstrate the
promise of our method to explain the misclassification of adversarial examples.


http://arxiv.org/abs/2001.08883
When Wireless Security Meets Machine Learning: Motivation, Challenges, and Research Directions.
Yalin E. Sagduyu; Yi Shi; Tugba Erpek; William Headley; Bryse Flowers; George Stantchev; Zhuo Lu
  Wireless systems are vulnerable to various attacks such as jamming and
eavesdropping due to the shared and broadcast nature of wireless medium. To
support both attack and defense strategies, machine learning (ML) provides
automated means to learn from and adapt to wireless communication
characteristics that are hard to capture by hand-crafted features and models.
This article discusses motivation, background, and scope of research efforts
that bridge ML and wireless security. Motivated by research directions surveyed
in the context of ML for wireless security, ML-based attack and defense
solutions and emerging adversarial ML techniques in the wireless domain are
identified along with a roadmap to foster research efforts in bridging ML and
wireless security.


http://arxiv.org/abs/2001.08855
Privacy for All: Demystify Vulnerability Disparity of Differential Privacy against Membership Inference Attack.
Bo Zhang; Ruotong Yu; Haipei Sun; Yanying Li; Jun Xu; Hui Wang
  Machine learning algorithms, when applied to sensitive data, pose a potential
threat to privacy. A growing body of prior work has demonstrated that
membership inference attack (MIA) can disclose specific private information in
the training data to an attacker. Meanwhile, the algorithmic fairness of
machine learning has increasingly caught attention from both academia and
industry. Algorithmic fairness ensures that the machine learning models do not
discriminate a particular demographic group of individuals (e.g., black and
female people). Given that MIA is indeed a learning model, it raises a serious
concern if MIA ``fairly'' treats all groups of individuals equally. In other
words, whether a particular group is more vulnerable against MIA than the other
groups. This paper examines the algorithmic fairness issue in the context of
MIA and its defenses. First, for fairness evaluation, it formalizes the
notation of vulnerability disparity (VD) to quantify the difference of MIA
treatment on different demographic groups. Second, it evaluates VD on four
real-world datasets, and shows that VD indeed exists in these datasets. Third,
it examines the impacts of differential privacy, as a defense mechanism of MIA,
on VD. The results show that although DP brings significant change on VD, it
cannot eliminate VD completely. Therefore, fourth, it designs a new mitigation
algorithm named FAIRPICK to reduce VD. An extensive set of experimental results
demonstrate that FAIRPICK can effectively reduce VD for both with and without
the DP deployment.


http://arxiv.org/abs/2001.08389
Towards Robust DNNs: An Taylor Expansion-Based Method for Generating Powerful Adversarial Examples.
Ya-guan Qian; Xi-Ming Zhang; Bin Wang; Wei Li; Jian-Hai Chen; Wu-Jie Zhou; Jing-Sheng Lei
  Although deep neural networks (DNNs) have achieved successful applications in
many fields, they are vulnerable to adversarial examples. Adversarial training
is one of the most effective methods to improve the robustness of DNNs, and it
is generally considered as a minimax point problem that minimizes the loss
function and maximizes the perturbation. Therefore, powerful adversarial
examples can effectively simulate perturbation maximization to solve the
minimax problem. In paper, a novel method was proposed to generate more
powerful adversarial examples for robust adversarial training. The main idea is
to approximates the output of DNNs in the input neighborhood by using the
Taylor expansion, and then optimizes it by using the Lagrange multiplier method
to generate adversarial examples. The experiment results show it can
effectively improve the robust of DNNs trained with these powerful adversarial
examples.


http://arxiv.org/abs/2001.08444
On the human evaluation of audio adversarial examples.
Jon Vadillo; Roberto Santana
  Human-machine interaction is increasingly dependent on speech communication.
Machine Learning models are usually applied to interpret human speech commands.
However, these models can be fooled by adversarial examples, which are inputs
intentionally perturbed to produce a wrong prediction without being noticed.
While much research has been focused on developing new techniques to generate
adversarial perturbations, less attention has been given to aspects that
determine whether and how the perturbations are noticed by humans. This
question is relevant since high fooling rates of proposed adversarial
perturbation strategies are only valuable if the perturbations are not
detectable. In this paper we investigate to which extent the distortion metrics
proposed in the literature for audio adversarial examples, and which are
commonly applied to evaluate the effectiveness of methods for generating these
attacks, are a reliable measure of the human perception of the perturbations.
Using an analytical framework, and an experiment in which 18 subjects evaluate
audio adversarial examples, we demonstrate that the metrics employed by
convention are not a reliable measure of the perceptual similarity of
adversarial examples in the audio domain.


http://arxiv.org/abs/2001.07933
Adversarial Attack on Community Detection by Hiding Individuals.
Jia Li; Honglei Zhang; Zhichao Han; Yu Rong; Hong Cheng; Junzhou Huang
  It has been demonstrated that adversarial graphs, i.e., graphs with
imperceptible perturbations added, can cause deep graph models to fail on
node/graph classification tasks. In this paper, we extend adversarial graphs to
the problem of community detection which is much more difficult. We focus on
black-box attack and aim to hide targeted individuals from the detection of
deep graph community detection models, which has many applications in
real-world scenarios, for example, protecting personal privacy in social
networks and understanding camouflage patterns in transaction networks. We
propose an iterative learning framework that takes turns to update two modules:
one working as the constrained graph generator and the other as the surrogate
community detection model. We also find that the adversarial graphs generated
by our method can be transferred to other learning based community detection
models.


http://arxiv.org/abs/2001.07645
SAUNet: Shape Attentive U-Net for Interpretable Medical Image Segmentation.
Jesse Sun; Fatemeh Darbeha; Mark Zaidi; Bo Wang
  Medical image segmentation is a difficult but important task for many
clinical operations such as cardiac bi-ventricular volume estimation. More
recently, there has been a shift to utilizing deep learning and fully
convolutional neural networks (CNNs) to perform image segmentation that has
yielded state-of-the-art results in many public benchmark datasets. Despite the
progress of deep learning in medical image segmentation, standard CNNs are
still not fully adopted in clinical settings as they lack robustness and
interpretability. Shapes are generally more meaningful features than solely
textures of images, which are features regular CNNs learn, causing a lack of
robustness. Likewise, previous works surrounding model interpretability have
been focused on post hoc gradient-based saliency methods. However,
gradient-based saliency methods typically require additional computations post
hoc and have been shown to be unreliable for interpretability. Thus, we present
a new architecture called Shape Attentive U-Net (SAUNet) which focuses on model
interpretability and robustness. The proposed architecture attempts to address
these limitations by the use of a secondary shape stream that captures rich
shape-dependent information in parallel with the regular texture stream.
Furthermore, we suggest multi-resolution saliency maps can be learned using our
dual-attention decoder module which allows for multi-level interpretability and
mitigates the need for additional computations post hoc. Our method also
achieves state-of-the-art results on the two large public cardiac MRI image
segmentation datasets of SUN09 and AC17.


http://arxiv.org/abs/2001.08103
Secure and Robust Machine Learning for Healthcare: A Survey.
Adnan Qayyum; Junaid Qadir; Muhammad Bilal; Ala Al-Fuqaha
  Recent years have witnessed widespread adoption of machine learning (ML)/deep
learning (DL) techniques due to their superior performance for a variety of
healthcare applications ranging from the prediction of cardiac arrest from
one-dimensional heart signals to computer-aided diagnosis (CADx) using
multi-dimensional medical images. Notwithstanding the impressive performance of
ML/DL, there are still lingering doubts regarding the robustness of ML/DL in
healthcare settings (which is traditionally considered quite challenging due to
the myriad security and privacy issues involved), especially in light of recent
results that have shown that ML/DL are vulnerable to adversarial attacks. In
this paper, we present an overview of various application areas in healthcare
that leverage such techniques from security and privacy point of view and
present associated challenges. In addition, we present potential methods to
ensure secure and privacy-preserving ML for healthcare applications. Finally,
we provide insight into the current research challenges and promising
directions for future research.


http://arxiv.org/abs/2001.07685
FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence.
Kihyuk Sohn; David Berthelot; Chun-Liang Li; Zizhao Zhang; Nicholas Carlini; Ekin D. Cubuk; Alex Kurakin; Han Zhang; Colin Raffel
  Semi-supervised learning (SSL) provides an effective means of leveraging
unlabeled data to improve a model's performance. In this paper, we demonstrate
the power of a simple combination of two common SSL methods: consistency
regularization and pseudo-labeling. Our algorithm, FixMatch, first generates
pseudo-labels using the model's predictions on weakly-augmented unlabeled
images. For a given image, the pseudo-label is only retained if the model
produces a high-confidence prediction. The model is then trained to predict the
pseudo-label when fed a strongly-augmented version of the same image. Despite
its simplicity, we show that FixMatch achieves state-of-the-art performance
across a variety of standard semi-supervised learning benchmarks, including
94.93% accuracy on CIFAR-10 with 250 labels and 88.61% accuracy with 40 -- just
4 labels per class. Since FixMatch bears many similarities to existing SSL
methods that achieve worse performance, we carry out an extensive ablation
study to tease apart the experimental factors that are most important to
FixMatch's success. We make our code available at
https://github.com/google-research/fixmatch.


http://arxiv.org/abs/2001.07792
GhostImage: Perception Domain Attacks against Vision-based Object Classification Systems.
Yanmao Man; Ming Li; Ryan Gerdes
  In vision-based object classification systems, imaging sensors perceive the
environment and then objects are detected and classified for decision-making
purposes. Vulnerabilities in the perception domain enable an attacker to inject
false data into the sensor which could lead to unsafe consequences. In this
work, we focus on camera-based systems and propose GhostImage attacks, with the
goal of either creating a fake perceived object or obfuscating the object's
image that leads to wrong classification results. This is achieved by remotely
projecting adversarial patterns into camera-perceived images, exploiting two
common effects in optical imaging systems, namely lens flare/ghost effects, and
auto-exposure control. To improve the robustness of the attack to channel
perturbations, we generate optimal input patterns by integrating adversarial
machine learning techniques with a trained end-to-end channel model. We realize
GhostImage attacks with a projector, and conducted comprehensive experiments,
using three different image datasets, in indoor and outdoor environments, and
three different cameras. We demonstrate that GhostImage attacks are applicable
to both autonomous driving and security surveillance scenarios. Experiment
results show that, depending on the projector-camera distance, attack success
rates can reach as high as 100%.


http://arxiv.org/abs/2001.07631
Generate High-Resolution Adversarial Samples by Identifying Effective Features.
Sizhe Chen; Peidong Zhang; Chengjin Sun; Jia Cai; Xiaolin Huang
  As the prevalence of deep learning in computer vision, adversarial samples
that weaken the neural networks emerge in large numbers, revealing their
deep-rooted defects. Most adversarial attacks calculate an imperceptible
perturbation in image space to fool the DNNs. In this strategy, the
perturbation looks like noise and thus could be mitigated. Attacks in feature
space produce semantic perturbation, but they could only deal with low
resolution samples. The reason lies in the great number of coupled features to
express a high-resolution image. In this paper, we propose Attack by
Identifying Effective Features (AIEF), which learns different weights for
features to attack. Effective features, those with great weights, influence the
victim model much but distort the image little, and thus are more effective for
attack. By attacking mostly on them, AIEF produces high resolution adversarial
samples with acceptable distortions. We demonstrate the effectiveness of AIEF
by attacking on different tasks with different generative models.


http://arxiv.org/abs/2001.07769
Massif: Interactive Interpretation of Adversarial Attacks on Deep Learning.
Nilaksh Polo Das; Haekyu Polo Park; Zijie J. Polo Wang; Fred Polo Hohman; Robert Polo Firstman; Emily Polo Rogers; Duen Polo Horng; Chau
  Deep neural networks (DNNs) are increasingly powering high-stakes
applications such as autonomous cars and healthcare; however, DNNs are often
treated as "black boxes" in such applications. Recent research has also
revealed that DNNs are highly vulnerable to adversarial attacks, raising
serious concerns over deploying DNNs in the real world. To overcome these
deficiencies, we are developing Massif, an interactive tool for deciphering
adversarial attacks. Massif identifies and interactively visualizes neurons and
their connections inside a DNN that are strongly activated or suppressed by an
adversarial attack. Massif provides both a high-level, interpretable overview
of the effect of an attack on a DNN, and a low-level, detailed description of
the affected neurons. These tightly coupled views in Massif help people better
understand which input features are most vulnerable or important for correct
predictions.


http://arxiv.org/abs/2001.07820
Elephant in the Room: An Evaluation Framework for Assessing Adversarial Examples in NLP.
Ying Xu; Xu Zhong; Antonio Jose Jimeno Yepes; Jey Han Lau
  An adversarial example is an input transformed by small perturbations that
machine learning models consistently misclassify. While there are a number of
methods proposed to generate adversarial examples for text data, it is not
trivial to assess the quality of these adversarial examples, as minor
perturbations (such as changing a word in a sentence) can lead to a significant
shift in their meaning, readability and classification label. In this paper, we
propose an evaluation framework to assess the quality of adversarial examples
based on the aforementioned properties. We experiment with five benchmark
attacking methods and an alternative approach based on an auto-encoder, and
found that these methods generate adversarial examples with poor readability
and content preservation. We also learned that there are multiple factors that
can influence the attacking performance, such as the the length of text
examples and the input domain.


http://arxiv.org/abs/2001.06309
Cyber Attack Detection thanks to Machine Learning Algorithms.
Antoine Delplace; Sheryl Hermoso; Kristofer Anandita
  Cybersecurity attacks are growing both in frequency and sophistication over
the years. This increasing sophistication and complexity call for more
advancement and continuous innovation in defensive strategies. Traditional
methods of intrusion detection and deep packet inspection, while still largely
used and recommended, are no longer sufficient to meet the demands of growing
security threats. As computing power increases and cost drops, Machine Learning
is seen as an alternative method or an additional mechanism to defend against
malwares, botnets, and other attacks. This paper explores Machine Learning as a
viable solution by examining its capabilities to classify malicious traffic in
a network.
  First, a strong data analysis is performed resulting in 22 extracted features
from the initial Netflow datasets. All these features are then compared with
one another through a feature selection process. Then, our approach analyzes
five different machine learning algorithms against NetFlow dataset containing
common botnets. The Random Forest Classifier succeeds in detecting more than
95% of the botnets in 8 out of 13 scenarios and more than 55% in the most
difficult datasets. Finally, insight is given to improve and generalize the
results, especially through a bootstrapping technique.


http://arxiv.org/abs/2001.06099
Code-Bridged Classifier (CBC): A Low or Negative Overhead Defense for Making a CNN Classifier Robust Against Adversarial Attacks.
Farnaz Behnia; Ali Mirzaeian; Mohammad Sabokrou; Sai Manoj; Tinoosh Mohsenin; Khaled N. Khasawneh; Liang Zhao; Houman Homayoun; Avesta Sasan
  In this paper, we propose Code-Bridged Classifier (CBC), a framework for
making a Convolutional Neural Network (CNNs) robust against adversarial attacks
without increasing or even by decreasing the overall models' computational
complexity. More specifically, we propose a stacked encoder-convolutional
model, in which the input image is first encoded by the encoder module of a
denoising auto-encoder, and then the resulting latent representation (without
being decoded) is fed to a reduced complexity CNN for image classification. We
illustrate that this network not only is more robust to adversarial examples
but also has a significantly lower computational complexity when compared to
the prior art defenses.


http://arxiv.org/abs/2001.05873
A Little Fog for a Large Turn.
Harshitha Machiraju; Vineeth N Balasubramanian
  Small, carefully crafted perturbations called adversarial perturbations can
easily fool neural networks. However, these perturbations are largely additive
and not naturally found. We turn our attention to the field of Autonomous
navigation wherein adverse weather conditions such as fog have a drastic effect
on the predictions of these systems. These weather conditions are capable of
acting like natural adversaries that can help in testing models. To this end,
we introduce a general notion of adversarial perturbations, which can be
created using generative models and provide a methodology inspired by
Cycle-Consistent Generative Adversarial Networks to generate adversarial
weather conditions for a given image. Our formulation and results show that
these images provide a suitable testbed for steering models used in Autonomous
navigation models. Our work also presents a more natural and general definition
of Adversarial perturbations based on Perceptual Similarity.


http://arxiv.org/abs/2001.07523
The gap between theory and practice in function approximation with deep neural networks.
Ben Adcock; Nick Dexter
  Deep learning (DL) is transforming whole industries as complicated
decision-making processes are being automated by Deep Neural Networks (DNNs)
trained on real-world data. Driven in part by a rapidly-expanding literature on
DNN approximation theory showing that DNNs can approximate a rich variety of
functions, these tools are increasingly being considered for problems in
scientific computing. Yet, unlike more traditional algorithms in this field,
relatively little is known about DNNs from the principles of numerical
analysis, namely, stability, accuracy, computational efficiency and sample
complexity. In this paper we introduce a computational framework for examining
DNNs in practice, and use it to study their empirical performance with regard
to these issues. We examine the performance of DNNs of different widths and
depths on a variety of test functions in various dimensions, including smooth
and piecewise smooth functions. We also compare DL against best-in-class
methods for smooth function approximation based on compressed sensing. Our main
conclusion is that there is a crucial gap between the approximation theory of
DNNs and their practical performance, with trained DNNs performing relatively
poorly on functions for which there are strong approximation results (e.g.
smooth functions), yet performing well in comparison to best-in-class methods
for other functions. Finally, we present a novel practical existence theorem,
which asserts the existence of a DNN architecture and training procedure which
offers the same performance as current best-in-class schemes. This result
indicates the potential for practical DNN approximation, and the need for
future research into practical architecture design and training strategies.


http://arxiv.org/abs/2001.06325
Universal Adversarial Attack on Attention and the Resulting Dataset DAmageNet.
Sizhe Chen; Zhengbao He; Chengjin Sun; Jie Yang; Xiaolin Huang
  Adversarial attacks on deep neural networks (DNNs) have been found for
several years. However, the existing adversarial attacks have high success
rates only when the information of the victim DNN is well-known or could be
estimated by the structure similarity or massive queries. In this paper, we
propose to Attack on Attention (AoA), a semantic property commonly shared by
DNNs. AoA enjoys a significant increase in transferability when the traditional
cross entropy loss is replaced with the attention loss. Since AoA alters the
loss function only, it could be easily combined with other
transferability-enhancement techniques and then achieve SOTA performance. We
apply AoA to generate 50000 adversarial samples from ImageNet validation set to
defeat many neural networks, and thus name the dataset as DAmageNet. 13
well-trained DNNs are tested on DAmageNet, and all of them have an error rate
over 85%. Even with defenses or adversarial training, most models still
maintain an error rate over 70% on DAmageNet. DAmageNet is the first universal
adversarial dataset. It could be downloaded freely and serve as a benchmark for
robustness testing and adversarial training.


http://arxiv.org/abs/2001.06057
Increasing the robustness of DNNs against image corruptions by playing the Game of Noise.
Evgenia Rusak; Lukas Schott; Roland S. Zimmermann; Julian Bitterwolf; Oliver Bringmann; Matthias Bethge; Wieland Brendel
  The human visual system is remarkably robust against a wide range of
naturally occurring variations and corruptions like rain or snow. In contrast,
the performance of modern image recognition models strongly degrades when
evaluated on previously unseen corruptions. Here, we demonstrate that a simple
but properly tuned training with additive Gaussian and Speckle noise
generalizes surprisingly well to unseen corruptions, easily reaching the
previous state of the art on the corruption benchmark ImageNet-C (with
ResNet50) and on MNIST-C. We build on top of these strong baseline results and
show that an adversarial training of the recognition model against uncorrelated
worst-case noise distributions leads to an additional increase in performance.
This regularization can be combined with previously proposed defense methods
for further improvement.


http://arxiv.org/abs/2001.04974
Noisy Machines: Understanding Noisy Neural Networks and Enhancing Robustness to Analog Hardware Errors Using Distillation.
Chuteng Zhou; Prad Kadambi; Matthew Mattina; Paul N. Whatmough
  The success of deep learning has brought forth a wave of interest in computer
hardware design to better meet the high demands of neural network inference. In
particular, analog computing hardware has been heavily motivated specifically
for accelerating neural networks, based on either electronic, optical or
photonic devices, which may well achieve lower power consumption than
conventional digital electronics. However, these proposed analog accelerators
suffer from the intrinsic noise generated by their physical components, which
makes it challenging to achieve high accuracy on deep neural networks. Hence,
for successful deployment on analog accelerators, it is essential to be able to
train deep neural networks to be robust to random continuous noise in the
network weights, which is a somewhat new challenge in machine learning. In this
paper, we advance the understanding of noisy neural networks. We outline how a
noisy neural network has reduced learning capacity as a result of loss of
mutual information between its input and output. To combat this, we propose
using knowledge distillation combined with noise injection during training to
achieve more noise robust networks, which is demonstrated experimentally across
different networks and datasets, including ImageNet. Our method achieves models
with as much as two times greater noise tolerance compared with the previous
best attempts, which is a significant step towards making analog hardware
practical for deep learning.


http://arxiv.org/abs/2001.05574
Advbox: a toolbox to generate adversarial examples that fool neural networks.
Dou Goodman; Hao Xin; Wang Yang; Wu Yuesheng; Xiong Junfeng; Zhang Huan
  In recent years, neural networks have been extensively deployed for computer
vision tasks, particularly visual classification problems, where new algorithms
reported to achieve or even surpass the human performance. Recent studies have
shown that they are all vulnerable to the attack of adversarial examples. Small
and often imperceptible perturbations to the input images are sufficient to
fool the most powerful neural networks. \emph{Advbox} is a toolbox to generate
adversarial examples that fool neural networks in PaddlePaddle, PyTorch,
Caffe2, MxNet, Keras, TensorFlow and it can benchmark the robustness of machine
learning models. Compared to previous work, our platform supports black box
attacks on Machine-Learning-as-a-service, as well as more attack scenarios,
such as Face Recognition Attack, Stealth T-shirt, and Deepfake Face Detect. The
code is licensed under the Apache 2.0 license and is openly available at
https://github.com/advboxes/AdvBox.


http://arxiv.org/abs/2001.04011
Membership Inference Attacks Against Object Detection Models.
Yeachan Park; Myungjoo Kang
  Machine learning models can leak information about the dataset they trained.
In this paper, we present the first membership inference attack against
black-boxed object detection models that determines whether given data records
are used in training. To attack the object detection model, we devise a noble
method named as Canvas Method, drawing predicted bounding boxes on an empty
image for attack model input. In experiments, we successfully reveal the
membership status of privately sensitive data trained by one-stage and
two-stage detection models. We then propose defense strategies and also conduct
a transfer attack between models and datasets. Our results show that object
detection models are vulnerable to inference attacks as other models.


http://arxiv.org/abs/2001.04051
An Adversarial Approach for the Robust Classification of Pneumonia from Chest Radiographs.
Joseph D. Janizek; Gabriel Erion; Alex J. DeGrave; Su-In Lee
  While deep learning has shown promise in the domain of disease classification
from medical images, models based on state-of-the-art convolutional neural
network architectures often exhibit performance loss due to dataset shift.
Models trained using data from one hospital system achieve high predictive
performance when tested on data from the same hospital, but perform
significantly worse when they are tested in different hospital systems.
Furthermore, even within a given hospital system, deep learning models have
been shown to depend on hospital- and patient-level confounders rather than
meaningful pathology to make classifications. In order for these models to be
safely deployed, we would like to ensure that they do not use confounding
variables to make their classification, and that they will work well even when
tested on images from hospitals that were not included in the training data. We
attempt to address this problem in the context of pneumonia classification from
chest radiographs. We propose an approach based on adversarial optimization,
which allows us to learn more robust models that do not depend on confounders.
Specifically, we demonstrate improved out-of-hospital generalization
performance of a pneumonia classifier by training a model that is invariant to
the view position of chest radiographs (anterior-posterior vs.
posterior-anterior). Our approach leads to better predictive performance on
external hospital data than both a standard baseline and previously proposed
methods to handle confounding, and also suggests a method for identifying
models that may rely on confounders. Code available at
https://github.com/suinleelab/cxr_adv.


http://arxiv.org/abs/2001.03994
Fast is better than free: Revisiting adversarial training.
Eric Wong; Leslie Rice; J. Zico Kolter
  Adversarial training, a method for learning robust deep networks, is
typically assumed to be more expensive than traditional training due to the
necessity of constructing adversarial examples via a first-order method like
projected gradient decent (PGD). In this paper, we make the surprising
discovery that it is possible to train empirically robust models using a much
weaker and cheaper adversary, an approach that was previously believed to be
ineffective, rendering the method no more costly than standard training in
practice. Specifically, we show that adversarial training with the fast
gradient sign method (FGSM), when combined with random initialization, is as
effective as PGD-based training but has significantly lower cost. Furthermore
we show that FGSM adversarial training can be further accelerated by using
standard techniques for efficient training of deep networks, allowing us to
learn a robust CIFAR10 classifier with 45% robust accuracy to PGD attacks with
$\epsilon=8/255$ in 6 minutes, and a robust ImageNet classifier with 43% robust
accuracy at $\epsilon=2/255$ in 12 hours, in comparison to past work based on
"free" adversarial training which took 10 and 50 hours to reach the same
respective thresholds. Finally, we identify a failure mode referred to as
"catastrophic overfitting" which may have caused previous attempts to use FGSM
adversarial training to fail. All code for reproducing the experiments in this
paper as well as pretrained model weights are at
https://github.com/locuslab/fast_adversarial.


http://arxiv.org/abs/2001.05286
Exploring and Improving Robustness of Multi Task Deep Neural Networks via Domain Agnostic Defenses.
Kashyap Coimbatore Murali
  In this paper, we explore the robustness of the Multi-Task Deep Neural
Networks (MT-DNN) against non-targeted adversarial attacks across Natural
Language Understanding (NLU) tasks as well as some possible ways to defend
against them. Liu et al., have shown that the Multi-Task Deep Neural Network,
due to the regularization effect produced when training as a result of its
cross task data, is more robust than a vanilla BERT model trained only on one
task (1.1%-1.5% absolute difference). We further show that although the MT-DNN
has generalized better, making it easily transferable across domains and tasks,
it can still be compromised as after only 2 attacks (1-character and
2-character) the accuracy drops by 42.05% and 32.24% for the SNLI and SciTail
tasks. Finally, we propose a domain agnostic defense which restores the model's
accuracy (36.75% and 25.94% respectively) as opposed to a general-purpose
defense or an off-the-shelf spell checker.


http://arxiv.org/abs/2001.03754
Sparse Black-box Video Attack with Reinforcement Learning.
Huanqian Yan; Xingxing Wei; Bo Li
  Adversarial attacks on video recognition models have been explored recently.
However, most existing works treat each video frame equally and ignore their
temporal interactions. To overcome this drawback, a few methods try to select
some key frames, and then perform attacks based on them. Unfortunately, their
selecting strategy is independent with the attacking step, therefore the
resulting performance is limited. In this paper, we aim to attack video
recognition task in the black-box setting. The difference is, we think the
frame selection phase is closely relevant with the attacking phase. The
reasonable key frames should be adjusted according to the feedback of attacking
threat models. Based on this idea, we formulate the black-box video attacks
into the Reinforcement Learning (RL) framework. Specifically, the environment
in RL is set as the threat models, and the agent in RL plays the role of frame
selecting and video attacking simultaneously. By continuously querying the
threat models and receiving the feedback of predicted probabilities (reward),
the agent adjusts its frame selection strategy and performs attacks (action).
Step by step, the optimal key frames are selected and the smallest adversarial
perturbations are achieved. We conduct a series of experiments with two
mainstream video recognition models: C3D and LRCN on the public UCF-101 and
HMDB-51 datasets. The results demonstrate that the proposed method can
significantly reduce the perturbation of adversarial examples and attacking on
the sparse video frames can have better attack effectiveness than attacking on
each frame.


http://arxiv.org/abs/2001.03662
ReluDiff: Differential Verification of Deep Neural Networks.
Brandon Paulsen; Jingbo Wang; Chao Wang
  As deep neural networks are increasingly being deployed in practice, their
efficiency has become an important issue. While there are compression
techniques for reducing the network's size, energy consumption and
computational requirement, they only demonstrate empirically that there is no
loss of accuracy, but lack formal guarantees of the compressed network, e.g.,
in the presence of adversarial examples. Existing verification techniques such
as Reluplex, ReluVal, and DeepPoly provide formal guarantees, but they are
designed for analyzing a single network instead of the relationship between two
networks. To fill the gap, we develop a new method for differential
verification of two closely related networks. Our method consists of a fast but
approximate forward interval analysis pass followed by a backward pass that
iteratively refines the approximation until the desired property is verified.
We have two main innovations. During the forward pass, we exploit structural
and behavioral similarities of the two networks to more accurately bound the
difference between the output neurons of the two networks. Then in the backward
pass, we leverage the gradient differences to more accurately compute the most
beneficial refinement. Our experiments show that, compared to state-of-the-art
verification tools, our method can achieve orders-of-magnitude speedup and
prove many more properties than existing tools.


http://arxiv.org/abs/2001.03311
Guess First to Enable Better Compression and Adversarial Robustness.
Sicheng Zhu; Bang An; Shiyu Niu
  Machine learning models are generally vulnerable to adversarial examples,
which is in contrast to the robustness of humans. In this paper, we try to
leverage one of the mechanisms in human recognition and propose a bio-inspired
classification framework in which model inference is conditioned on label
hypothesis. We provide a class of training objectives for this framework and an
information bottleneck regularizer which utilizes the advantage that label
information can be discarded during inference. This framework enables better
compression of the mutual information between inputs and latent representations
without loss of learning capacity, at the cost of tractable inference
complexity. Better compression and elimination of label information further
bring better adversarial robustness without loss of natural accuracy, which is
demonstrated in the experiment.


http://arxiv.org/abs/2001.02438
To Transfer or Not to Transfer: Misclassification Attacks Against Transfer Learned Text Classifiers.
Bijeeta Pal; Shruti Tople
  Transfer learning --- transferring learned knowledge --- has brought a
paradigm shift in the way models are trained. The lucrative benefits of
improved accuracy and reduced training time have shown promise in training
models with constrained computational resources and fewer training samples.
Specifically, publicly available text-based models such as GloVe and BERT that
are trained on large corpus of datasets have seen ubiquitous adoption in
practice. In this paper, we ask, "can transfer learning in text prediction
models be exploited to perform misclassification attacks?" As our main
contribution, we present novel attack techniques that utilize unintended
features learnt in the teacher (public) model to generate adversarial examples
for student (downstream) models. To the best of our knowledge, ours is the
first work to show that transfer learning from state-of-the-art word-based and
sentence-based teacher models increase the susceptibility of student models to
misclassification attacks. First, we propose a novel word-score based attack
algorithm for generating adversarial examples against student models trained
using context-free word-level embedding model. On binary classification tasks
trained using the GloVe teacher model, we achieve an average attack accuracy of
97% for the IMDB Movie Reviews and 80% for the Fake News Detection. For
multi-class tasks, we divide the Newsgroup dataset into 6 and 20 classes and
achieve an average attack accuracy of 75% and 41% respectively. Next, we
present length-based and sentence-based misclassification attacks for the Fake
News Detection task trained using a context-aware BERT model and achieve 78%
and 39% attack accuracy respectively. Thus, our results motivate the need for
designing training techniques that are robust to unintended feature learning,
specifically for transfer learned models.


http://arxiv.org/abs/2001.02378
MACER: Attack-free and Scalable Robust Training via Maximizing Certified Radius.
Runtian Zhai; Chen Dan; Di He; Huan Zhang; Boqing Gong; Pradeep Ravikumar; Cho-Jui Hsieh; Liwei Wang
  Adversarial training is one of the most popular ways to learn robust models
but is usually attack-dependent and time costly. In this paper, we propose the
MACER algorithm, which learns robust models without using adversarial training
but performs better than all existing provable l2-defenses. Recent work shows
that randomized smoothing can be used to provide a certified l2 radius to
smoothed classifiers, and our algorithm trains provably robust smoothed
classifiers via MAximizing the CErtified Radius (MACER). The attack-free
characteristic makes MACER faster to train and easier to optimize. In our
experiments, we show that our method can be applied to modern deep neural
networks on a wide range of datasets, including Cifar-10, ImageNet, MNIST, and
SVHN. For all tasks, MACER spends less training time than state-of-the-art
adversarial training algorithms, and the learned models achieve larger average
certified radius.


http://arxiv.org/abs/2001.03460
Transferability of Adversarial Examples to Attack Cloud-based Image Classifier Service.
Dou Goodman
  In recent years, Deep Learning(DL) techniques have been extensively deployed
for computer vision tasks, particularly visual classification problems, where
new algorithms reported to achieve or even surpass the human performance. While
many recent works demonstrated that DL models are vulnerable to adversarial
examples. Fortunately, generating adversarial examples usually requires
white-box access to the victim model, and real-world cloud-based image
classification services are more complex than white-box classifier,the
architecture and parameters of DL models on cloud platforms cannot be obtained
by the attacker. The attacker can only access the APIs opened by cloud
platforms. Thus, keeping models in the cloud can usually give a (false) sense
of security. In this paper, we mainly focus on studying the security of
real-world cloud-based image classification services. Specifically, (1) We
propose a novel attack method, Fast Featuremap Loss PGD (FFL-PGD) attack based
on Substitution model, which achieves a high bypass rate with a very limited
number of queries. Instead of millions of queries in previous studies, our
method finds the adversarial examples using only two queries per image; and (2)
we make the first attempt to conduct an extensive empirical study of black-box
attacks against real-world cloud-based classification services. Through
evaluations on four popular cloud platforms including Amazon, Google,
Microsoft, Clarifai, we demonstrate that FFL-PGD attack has a success rate over
90\% among different classification services. (3) We discuss the possible
defenses to address these security challenges in cloud-based classification
services. Our defense technology is mainly divided into model training stage
and image preprocessing stage.


http://arxiv.org/abs/2001.01987
Softmax-based Classification is k-means Clustering: Formal Proof, Consequences for Adversarial Attacks, and Improvement through Centroid Based Tailoring.
Sibylle Hess; Wouter Duivesteijn; Decebal Mocanu
  We formally prove the connection between k-means clustering and the
predictions of neural networks based on the softmax activation layer. In
existing work, this connection has been analyzed empirically, but it has never
before been mathematically derived. The softmax function partitions the
transformed input space into cones, each of which encompasses a class. This is
equivalent to putting a number of centroids in this transformed space at equal
distance from the origin, and k-means clustering the data points by proximity
to these centroids. Softmax only cares in which cone a data point falls, and
not how far from the centroid it is within that cone. We formally prove that
networks with a small Lipschitz modulus (which corresponds to a low
susceptibility to adversarial attacks) map data points closer to the cluster
centroids, which results in a mapping to a k-means-friendly space. To leverage
this knowledge, we propose Centroid Based Tailoring as an alternative to the
softmax function in the last layer of a neural network. The resulting Gauss
network has similar predictive accuracy as traditional networks, but is less
susceptible to one-pixel attacks; while the main contribution of this paper is
theoretical in nature, the Gauss network contributes empirical auxiliary
benefits.


http://arxiv.org/abs/2001.01506
Deceiving Image-to-Image Translation Networks for Autonomous Driving with Adversarial Perturbations.
Lin Wang; Wonjune Cho; Kuk-Jin Yoon
  Deep neural networks (DNNs) have achieved impressive performance on handling
computer vision problems, however, it has been found that DNNs are vulnerable
to adversarial examples. For such reason, adversarial perturbations have been
recently studied in several respects. However, most previous works have focused
on image classification tasks, and it has never been studied regarding
adversarial perturbations on Image-to-image (Im2Im) translation tasks, showing
great success in handling paired and/or unpaired mapping problems in the field
of autonomous driving and robotics. This paper examines different types of
adversarial perturbations that can fool Im2Im frameworks for autonomous driving
purpose. We propose both quasi-physical and digital adversarial perturbations
that can make Im2Im models yield unexpected results. We then empirically
analyze these perturbations and show that they generalize well under both
paired for image synthesis and unpaired settings for style transfer. We also
validate that there exist some perturbation thresholds over which the Im2Im
mapping is disrupted or impossible. The existence of these perturbations
reveals that there exist crucial weaknesses in Im2Im models. Lastly, we show
that our methods illustrate how perturbations affect the quality of outputs,
pioneering the improvement of the robustness of current SOTA networks for
autonomous driving.


http://arxiv.org/abs/2001.02297
Generating Semantic Adversarial Examples via Feature Manipulation.
Shuo Wang; Surya Nepal; Carsten Rudolph; Marthie Grobler; Shangyu Chen; Tianle Chen
  The vulnerability of deep neural networks to adversarial attacks has been
widely demonstrated (e.g., adversarial example attacks). Traditional attacks
perform unstructured pixel-wise perturbation to fool the classifier. An
alternative approach is to have perturbations in the latent space. However,
such perturbations are hard to control due to the lack of interpretability and
disentanglement. In this paper, we propose a more practical adversarial attack
by designing structured perturbation with semantic meanings. Our proposed
technique manipulates the semantic attributes of images via the disentangled
latent codes. The intuition behind our technique is that images in similar
domains have some commonly shared but theme-independent semantic attributes,
e.g. thickness of lines in handwritten digits, that can be bidirectionally
mapped to disentangled latent codes. We generate adversarial perturbation by
manipulating a single or a combination of these latent codes and propose two
unsupervised semantic manipulation approaches: vector-based disentangled
representation and feature map-based disentangled representation, in terms of
the complexity of the latent codes and smoothness of the reconstructed images.
We conduct extensive experimental evaluations on real-world image data to
demonstrate the power of our attacks for black-box classifiers. We further
demonstrate the existence of a universal, image-agnostic semantic adversarial
example.


http://arxiv.org/abs/2001.01172
The Human Visual System and Adversarial AI.
Yaoshiang Ho; Samuel Wookey
  This paper applies theories about the Human Visual System to make Adversarial
AI more effective. To date, Adversarial AI has modeled perceptual distances
between clean and adversarial examples of images using Lp norms. These norms
have the benefit of simple mathematical description and reasonable
effectiveness in approximating perceptual distance. However, in prior decades,
other areas of image processing have moved beyond simpler models like Mean
Squared Error (MSE) towards more complex models that better approximate the
Human Visual System (HVS). We demonstrate a proof of concept of incorporating
HVS models into Adversarial AI.


http://arxiv.org/abs/2001.00483
Reject Illegal Inputs with Generative Classifier Derived from Any Discriminative Classifier.
Xin Wang
  Generative classifiers have been shown promising to detect illegal inputs
including adversarial examples and out-of-distribution samples. Supervised Deep
Infomax~(SDIM) is a scalable end-to-end framework to learn generative
classifiers. In this paper, we propose a modification of SDIM termed
SDIM-\emph{logit}. Instead of training generative classifier from scratch,
SDIM-\emph{logit} first takes as input the logits produced any given
discriminative classifier, and generate logit representations; then a
generative classifier is derived by imposing statistical constraints on logit
representations. SDIM-\emph{logit} could inherit the performance of the
discriminative classifier without loss. SDIM-\emph{logit} incurs a negligible
number of additional parameters, and can be efficiently trained with base
classifiers fixed. We perform \emph{classification with rejection}, where test
samples whose class conditionals are smaller than pre-chosen thresholds will be
rejected without predictions. Experiments on illegal inputs, including
adversarial examples, samples with common corruptions, and
out-of-distribution~(OOD) samples show that allowed to reject a portion of test
samples, SDIM-\emph{logit} significantly improves the performance on the left
test sets.


http://arxiv.org/abs/2001.01587
Exploring Adversarial Attack in Spiking Neural Networks with Spike-Compatible Gradient.
Ling Liang; Xing Hu; Lei Deng; Yujie Wu; Guoqi Li; Yufei Ding; Peng Li; Yuan Xie
  Recently, backpropagation through time inspired learning algorithms are
widely introduced into SNNs to improve the performance, which brings the
possibility to attack the models accurately given Spatio-temporal gradient
maps. We propose two approaches to address the challenges of gradient input
incompatibility and gradient vanishing. Specifically, we design a gradient to
spike converter to convert continuous gradients to ternary ones compatible with
spike inputs. Then, we design a gradient trigger to construct ternary gradients
that can randomly flip the spike inputs with a controllable turnover rate, when
meeting all zero gradients. Putting these methods together, we build an
adversarial attack methodology for SNNs trained by supervised algorithms.
Moreover, we analyze the influence of the training loss function and the firing
threshold of the penultimate layer, which indicates a "trap" region under the
cross-entropy loss that can be escaped by threshold tuning. Extensive
experiments are conducted to validate the effectiveness of our solution.
Besides the quantitative analysis of the influence factors, we evidence that
SNNs are more robust against adversarial attack than ANNs. This work can help
reveal what happens in SNN attack and might stimulate more research on the
security of SNN models and neuromorphic devices.


http://arxiv.org/abs/2001.00308
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural Networks Against Adversarial Attacks.
Ying Meng; Jianhai Su; Jason O'Kane; Pooyan Jamshidi
  Despite achieving state-of-the-art performance across many domains, machine
learning systems are highly vulnerable to subtle adversarial perturbations.
Although defense approaches have been proposed in recent years, many have been
bypassed by even weak adversarial attacks. An early
study~\cite{he2017adversarial} shows that ensembles created by combining
multiple weak defenses (i.e., input data transformations) are still weak. We
show that it is indeed possible to construct effective ensembles using weak
defenses to block adversarial attacks. However, to do so requires a diverse set
of such weak defenses. In this work, we propose Athena, an extensible framework
for building effective defenses to adversarial attacks against machine learning
systems. Here we conducted a comprehensive empirical study to evaluate several
realizations of Athena. More specifically, we evaluated the effectiveness of 5
ensemble strategies with a diverse set of many weak defenses that comprise
transforming the inputs (e.g., rotation, shifting, noising, denoising, and many
more) before feeding them to target deep neural network (DNN) classifiers. We
evaluate the effectiveness of the ensembles with adversarial examples generated
by 9 various adversaries (i.e., FGSM, CW, etc.) in 4 threat models (i.e.,
zero-knowledge, black-box, gray-box, white-box) on MNIST. We also explain, via
a comprehensive empirical study, why building defenses based on the idea of
many diverse weak defenses works, when it is most effective, and what its
inherent limitations and overhead are.


http://arxiv.org/abs/1912.13258
Automated Testing for Deep Learning Systems with Differential Behavior Criteria.
Yuan Gao; Yiqiang Han
  In this work, we conducted a study on building an automated testing system
for deep learning systems based on differential behavior criteria. The
automated testing goals were achieved by jointly optimizing two objective
functions: maximizing differential behaviors from models under testing and
maximizing neuron coverage. By observing differential behaviors from three
pre-trained models during each testing iteration, the input image that
triggered erroneous feedback was registered as a corner-case. The generated
corner-cases can be used to examine the robustness of DNNs and consequently
improve model accuracy. A project called DeepXplore was also used as a baseline
model. After we fully implemented and optimized the baseline system, we
explored its application as an augmenting training dataset with newly generated
corner cases. With the GTRSB dataset, by retraining the model based on
automated generated corner cases, the accuracy of three generic models
increased by 259.2%, 53.6%, and 58.3%, respectively. Further, to extend the
capability of automated testing, we explored other approaches based on
differential behavior criteria to generate photo-realistic images for deep
learning systems. One approach was to apply various transformations to the seed
images for the deep learning framework. The other approach was to utilize the
Generative Adversarial Networks (GAN) technique, which was implemented on MNIST
and Driving datasets. The style transferring capability has been observed very
effective in adding additional visual effects, replacing image elements, and
style-shifting (virtual image to real images). The GAN-based testing sample
generation system was shown to be the next frontier for automated testing for
deep learning systems.


http://arxiv.org/abs/2001.00071
Protecting GANs against privacy attacks by preventing overfitting.
Sumit Mukherjee; Yixi Xu; Anusua Trivedi; Juan Lavista Ferres
  Generative Adversarial Networks (GANs) have made releasing of synthetic
images a viable approach to share data without releasing the original dataset.
It has been shown that such synthetic data can be used for a variety of
downstream tasks such as training classifiers that would otherwise require the
original dataset to be shared. However, recent work has shown that the GAN
models and their synthetically generated data can be used to infer the training
set membership by an adversary who has access to the entire dataset and some
auxiliary information. Here we develop a new GAN architecture (privGAN) which
provides protection against this mode of attack while leading to negligible
loss in downstream performances. Our architecture explicitly prevents
overfitting to the training set thereby providing implicit protection against
white-box attacks. The main contributions of this paper are: i) we propose a
novel GAN architecture that can generate synthetic data in a privacy preserving
manner and demonstrate the effectiveness of our model against white--box
attacks on several benchmark datasets, ii) we provide a theoretical
understanding of the optimal solution of the GAN loss function, iii) we
demonstrate on two common benchmark datasets that synthetic images generated by
privGAN lead to negligible loss in downstream performance when compared against
non--private GANs. While we have focosued on benchmarking privGAN exclusively
of image datasets, the architecture of privGAN is not exclusive to image
datasets and can be easily extended to other types of datasets.


http://arxiv.org/abs/2001.00116
Erase and Restore: Simple, Accurate and Resilient Detection of $L_2$ Adversarial Examples.
Fei Zuo; Qiang Zeng
  By adding carefully crafted perturbations to input images, adversarial
examples (AEs) can be generated to mislead neural-network-based image
classifiers. $L_2$ adversarial perturbations by Carlini and Wagner (CW) are
regarded as among the most effective attacks. While many countermeasures
against AEs have been proposed, detection of adaptive CW $L_2$ AEs has been
very inaccurate. Our observation is that those deliberately altered pixels in
an $L_2$ AE, altogether, exert their malicious influence. By randomly erasing
some pixels from an $L_2$ AE and then restoring it with an inpainting
technique, such an AE, before and after the steps, tends to have different
classification results, while a benign sample does not show this symptom. Based
on this, we propose a novel AE detection technique, Erase and Restore (E\&R),
that exploits the limitation of $L_2$ attacks. On two popular image datasets,
CIFAR-10 and ImageNet, our experiments show that the proposed technique is able
to detect over 98% of the AEs generated by CW and other $L_2$ algorithms and
has a very low false positive rate on benign images. Moreover, our approach
demonstrate strong resilience to adaptive attacks. While adding noises and
inpainting each have been well studied, by combining them together, we deliver
a simple, accurate and resilient detection technique against adaptive $L_2$
AEs.


http://arxiv.org/abs/2001.00030
Quantum Adversarial Machine Learning.
Sirui Lu; Lu-Ming Duan; Dong-Ling Deng
  Adversarial machine learning is an emerging field that focuses on studying
vulnerabilities of machine learning approaches in adversarial settings and
developing techniques accordingly to make learning robust to adversarial
manipulations. It plays a vital role in various machine learning applications
and has attracted tremendous attention across different communities recently.
In this paper, we explore different adversarial scenarios in the context of
quantum machine learning. We find that, similar to traditional classifiers
based on classical neural networks, quantum learning systems are likewise
vulnerable to crafted adversarial examples, independent of whether the input
data is classical or quantum. In particular, we find that a quantum classifier
that achieves nearly the state-of-the-art accuracy can be conclusively deceived
by adversarial examples obtained via adding imperceptible perturbations to the
original legitimate samples. This is explicitly demonstrated with quantum
adversarial learning in different scenarios, including classifying real-life
images (e.g., handwritten digit images in the dataset MNIST), learning phases
of matter (such as, ferromagnetic/paramagnetic orders and symmetry protected
topological phases), and classifying quantum data. Furthermore, we show that
based on the information of the adversarial examples at hand, practical defense
strategies can be designed to fight against a number of different attacks. Our
results uncover the notable vulnerability of quantum machine learning systems
to adversarial perturbations, which not only reveals a novel perspective in
bridging machine learning and quantum physics in theory but also provides
valuable guidance for practical applications of quantum classifiers based on
both near-term and future quantum technologies.


http://arxiv.org/abs/2001.05844
Adversarial Example Generation using Evolutionary Multi-objective Optimization.
Takahiro Suzuki; Shingo Takeshita; Satoshi Ono
  This paper proposes Evolutionary Multi-objective Optimization (EMO)-based
Adversarial Example (AE) design method that performs under black-box setting.
Previous gradient-based methods produce AEs by changing all pixels of a target
image, while previous EC-based method changes small number of pixels to produce
AEs. Thanks to EMO's property of population based-search, the proposed method
produces various types of AEs involving ones locating between AEs generated by
the previous two approaches, which helps to know the characteristics of a
target model or to know unknown attack patterns. Experimental results showed
the potential of the proposed method, e.g., it can generate robust AEs and,
with the aid of DCT-based perturbation pattern generation, AEs for high
resolution images.


http://arxiv.org/abs/1912.12859
Defending from adversarial examples with a two-stream architecture.
Hao Ge; Xiaoguang Tu; Mei Xie; Zheng Ma
  In recent years, deep learning has shown impressive performance on many
tasks. However, recent researches showed that deep learning systems are
vulnerable to small, specially crafted perturbations that are imperceptible to
humans. Images with such perturbations are the so called adversarial examples,
which have proven to be an indisputable threat to the DNN based applications.
The lack of better understanding of the DNNs has prevented the development of
efficient defenses against adversarial examples. In this paper, we propose a
two-stream architecture to protect CNN from attacking by adversarial examples.
Our model draws on the idea of "two-stream" which commonly used in the security
field, and successfully defends different kinds of attack methods by the
differences of "high-resolution" and "low-resolution" networks in feature
extraction. We provide a reasonable interpretation on why our two-stream
architecture is difficult to defeat, and show experimentally that our method is
hard to defeat with state-of-the-art attacks. We demonstrate that our
two-stream architecture is robust to adversarial examples built by currently
known attacking algorithms.


http://arxiv.org/abs/1912.12510
Detecting Out-of-Distribution Examples with In-distribution Examples and Gram Matrices.
Chandramouli Shama Sastry; Sageev Oore
  When presented with Out-of-Distribution (OOD) examples, deep neural networks
yield confident, incorrect predictions. Detecting OOD examples is challenging,
and the potential risks are high. In this paper, we propose to detect OOD
examples by identifying inconsistencies between activity patterns and class
predicted. We find that characterizing activity patterns by Gram matrices and
identifying anomalies in gram matrix values can yield high OOD detection rates.
We identify anomalies in the gram matrices by simply comparing each value with
its respective range observed over the training data. Unlike many approaches,
this can be used with any pre-trained softmax classifier and does not require
access to OOD data for fine-tuning hyperparameters, nor does it require OOD
access for inferring parameters. The method is applicable across a variety of
architectures and vision datasets and, for the important and surprisingly hard
task of detecting far-from-distribution out-of-distribution examples, it
generally performs better than or equal to state-of-the-art OOD detection
methods (including those that do assume access to OOD examples).


http://arxiv.org/abs/1912.12463
Search Based Repair of Deep Neural Networks.
Jeongju Sohn; Sungmin Kang; Shin Yoo
  Deep Neural Networks (DNNs) are being adopted in various domains, including
safety critical ones. The wide-spread adoption also calls for ways to guide the
testing of their accuracy and robustness, for which various test adequacy
criteria and input generation methods have been recently introduced. In this
paper, we explore the natural subsequent step: given an input that reveals
unexpected behaviour in a trained DNN, we propose to repair the DNN using
input-output pairs as a specification. This paper introduces Arachne, a novel
program repair technique for DNNs. Arachne first performs sensitivity based
fault localisation to limit the number of neural weights it has to modify.
Subsequently, Arachne uses Particle Swarm Optimisation (PSO) to directly
optimise the localised neural weights until the behaviour is corrected. An
empirical study using three different benchmark datasets shows that Arachne can
reduce the instances of the most frequent misclassification type committed by a
pre-trained CIFAR-10 classifier by 27.5%, without any need for additional
training data. Patches generated by Arachne tend to be more focused on the
targeted misbehaviour than DNN retraining, which is more disruptive to
non-targeted behaviour. The overall results suggest the feasibility of patching
DNNs using Arachne until they can be retrained properly.


http://arxiv.org/abs/1912.11852
Benchmarking Adversarial Robustness.
Yinpeng Dong; Qi-An Fu; Xiao Yang; Tianyu Pang; Hang Su; Zihao Xiao; Jun Zhu
  Deep neural networks are vulnerable to adversarial examples, which becomes
one of the most important research problems in the development of deep
learning. While a lot of efforts have been made in recent years, it is of great
significance to perform correct and complete evaluations of the adversarial
attack and defense algorithms. In this paper, we establish a comprehensive,
rigorous, and coherent benchmark to evaluate adversarial robustness on image
classification tasks. After briefly reviewing plenty of representative attack
and defense methods, we perform large-scale experiments with two robustness
curves as the fair-minded evaluation criteria to fully understand the
performance of these methods. Based on the evaluation results, we draw several
important findings and provide insights for future research.


http://arxiv.org/abs/1912.11969
Efficient Adversarial Training with Transferable Adversarial Examples.
Haizhong Zheng; Ziqi Zhang; Juncheng Gu; Honglak Lee; Atul Prakash
  Adversarial training is an effective defense method to protect classification
models against adversarial attacks. However, one limitation of this approach is
that it can require orders of magnitude additional training time due to high
cost of generating strong adversarial examples during training. In this paper,
we first show that there is high transferability between models from
neighboring epochs in the same training process, i.e., adversarial examples
from one epoch continue to be adversarial in subsequent epochs. Leveraging this
property, we propose a novel method, Adversarial Training with Transferable
Adversarial Examples (ATTA), that can enhance the robustness of trained models
and greatly improve the training efficiency by accumulating adversarial
perturbations through epochs. Compared to state-of-the-art adversarial training
methods, ATTA enhances adversarial accuracy by up to 7.2% on CIFAR10 and
requires 12~14x less training time on MNIST and CIFAR10 datasets with
comparable model robustness.


http://arxiv.org/abs/1912.11464
Attack-Resistant Federated Learning with Residual-based Reweighting.
Shuhao Fu; Chulin Xie; Bo Li; Qifeng Chen
  Federated learning has a variety of applications in multiple domains by
utilizing private training data stored on different devices. However, the
aggregation process in federated learning is highly vulnerable to adversarial
attacks so that the global model may behave abnormally under attacks. To tackle
this challenge, we present a novel aggregation algorithm with residual-based
reweighting to defend federated learning. Our aggregation algorithm combines
repeated median regression with the reweighting scheme in iteratively
reweighted least squares. Our experiments show that our aggregation algorithm
outperforms other alternative algorithms in the presence of label-flipping,
backdoor, and Gaussian noise attacks. We also provide theoretical guarantees
for our aggregation algorithm.


http://arxiv.org/abs/1912.11372
Analysis of Moving Target Defense Against False Data Injection Attacks on Power Grid.
Zhenyong Zhang; Ruilong Deng; Member; IEEE; David K. Y. Yau; Senior Member; IEEE; Peng Cheng; Member; IEEE; Jiming Chen; Fellow; IEEE
  Recent studies have considered thwarting false data injection (FDI) attacks
against state estimation in power grids by proactively perturbing branch
susceptances. This approach is known as moving target defense (MTD). However,
despite of the deployment of MTD, it is still possible for the attacker to
launch stealthy FDI attacks generated with former branch susceptances. In this
paper, we prove that, an MTD has the capability to thwart all FDI attacks
constructed with former branch susceptances only if (i) the number of branches
$l$ in the power system is not less than twice that of the system states $n$
(i.e., $l \geq 2n$, where $n + 1$ is the number of buses); (ii) the
susceptances of more than $n$ branches, which cover all buses, are perturbed.
Moreover, we prove that the state variable of a bus that is only connected by a
single branch (no matter it is perturbed or not) can always be modified by the
attacker. Nevertheless, in order to reduce the attack opportunities of
potential attackers, we first exploit the impact of the susceptance
perturbation magnitude on the dimension of the \emph{stealthy attack space}, in
which the attack vector is constructed with former branch susceptances. Then,
we propose that, by perturbing an appropriate set of branches, we can minimize
the dimension of the \emph{stealthy attack space} and maximize the number of
covered buses. Besides, we consider the increasing operation cost caused by the
activation of MTD. Finally, we conduct extensive simulations to illustrate our
findings with IEEE standard test power systems.


http://arxiv.org/abs/1912.11279
Cronus: Robust and Heterogeneous Collaborative Learning with Black-Box Knowledge Transfer.
Hongyan Chang; Virat Shejwalkar; Reza Shokri; Amir Houmansadr
  Collaborative (federated) learning enables multiple parties to train a model
without sharing their private data, but through repeated sharing of the
parameters of their local models. Despite its advantages, this approach has
many known privacy and security weaknesses and performance overhead, in
addition to being limited only to models with homogeneous architectures. Shared
parameters leak a significant amount of information about the local (and
supposedly private) datasets. Besides, federated learning is severely
vulnerable to poisoning attacks, where some participants can adversarially
influence the aggregate parameters. Large models, with high dimensional
parameter vectors, are in particular highly susceptible to privacy and security
attacks: curse of dimensionality in federated learning. We argue that sharing
parameters is the most naive way of information exchange in collaborative
learning, as they open all the internal state of the model to inference
attacks, and maximize the model's malleability by stealthy poisoning attacks.
We propose Cronus, a robust collaborative machine learning framework. The
simple yet effective idea behind designing Cronus is to control, unify, and
significantly reduce the dimensions of the exchanged information between
parties, through robust knowledge transfer between their black-box local
models. We evaluate all existing federated learning algorithms against
poisoning attacks, and we show that Cronus is the only secure method, due to
its tight robustness guarantee. Treating local models as black-box, reduces the
information leakage through models, and enables us using existing
privacy-preserving algorithms that mitigate the risk of information leakage
through the model's output (predictions). Cronus also has a significantly lower
sample complexity, compared to federated learning, which does not bind its
security to the number of participants.


http://arxiv.org/abs/1912.11460
Characterizing the Decision Boundary of Deep Neural Networks.
Hamid Karimi; Tyler Derr; Jiliang Tang
  Deep neural networks and in particular, deep neural classifiers have become
an integral part of many modern applications. Despite their practical success,
we still have limited knowledge of how they work and the demand for such an
understanding is evergrowing. In this regard, one crucial aspect of deep neural
network classifiers that can help us deepen our knowledge about their
decision-making behavior is to investigate their decision boundaries.
Nevertheless, this is contingent upon having access to samples populating the
areas near the decision boundary. To achieve this, we propose a novel approach
we call Deep Decision boundary Instance Generation (DeepDIG). DeepDIG utilizes
a method based on adversarial example generation as an effective way of
generating samples near the decision boundary of any deep neural network model.
Then, we introduce a set of important principled characteristics that take
advantage of the generated instances near the decision boundary to provide
multifaceted understandings of deep neural networks. We have performed
extensive experiments on multiple representative datasets across various deep
neural network models and characterized their decision boundaries.


http://arxiv.org/abs/1912.12106
White Noise Analysis of Neural Networks.
Ali Borji; Sikun Lin
  A white noise analysis of modern deep neural networks is presented to unveil
their biases at the whole network level or the single neuron level. Our
analysis is based on two popular and related methods in psychophysics and
neurophysiology namely classification images and spike triggered analysis.
These methods have been widely used to understand the underlying mechanisms of
sensory systems in humans and monkeys. We leverage them to investigate the
inherent biases of deep neural networks and to obtain a first-order
approximation of their functionality. We emphasize on CNNs since they are
currently the state of the art methods in computer vision and are a decent
model of human visual processing. In addition, we study multi-layer
perceptrons, logistic regression, and recurrent neural networks. Experiments
over four classic datasets, MNIST, Fashion-MNIST, CIFAR-10, and ImageNet, show
that the computed bias maps resemble the target classes and when used for
classification lead to an over twofold performance than the chance level.
Further, we show that classification images can be used to attack a black-box
classifier and to detect adversarial patch attacks. Finally, we utilize spike
triggered averaging to derive the filters of CNNs and explore how the behavior
of a network changes when neurons in different layers are modulated. Our effort
illustrates a successful example of borrowing from neurosciences to study ANNs
and highlights the importance of cross-fertilization and synergy across machine
learning, deep learning, and computational neuroscience.


http://arxiv.org/abs/1912.11188
Adversarial AutoAugment.
Xinyu Zhang; Qiang Wang; Jian Zhang; Zhao Zhong
  Data augmentation (DA) has been widely utilized to improve generalization in
training deep neural networks. Recently, human-designed data augmentation has
been gradually replaced by automatically learned augmentation policy. Through
finding the best policy in well-designed search space of data augmentation,
AutoAugment can significantly improve validation accuracy on image
classification tasks. However, this approach is not computationally practical
for large-scale problems. In this paper, we develop an adversarial method to
arrive at a computationally-affordable solution called Adversarial AutoAugment,
which can simultaneously optimize target related object and augmentation policy
search loss. The augmentation policy network attempts to increase the training
loss of a target network through generating adversarial augmentation policies,
while the target network can learn more robust features from harder examples to
improve the generalization. In contrast to prior work, we reuse the computation
in target network training for policy evaluation, and dispense with the
retraining of the target network. Compared to AutoAugment, this leads to about
12x reduction in computing cost and 11x shortening in time overhead on
ImageNet. We show experimental results of our approach on CIFAR-10/CIFAR-100,
ImageNet, and demonstrate significant performance improvements over
state-of-the-art. On CIFAR-10, we achieve a top-1 test error of 1.36%, which is
the currently best performing single model. On ImageNet, we achieve a leading
performance of top-1 accuracy 79.40% on ResNet-50 and 80.00% on ResNet-50-D
without extra data.


http://arxiv.org/abs/1912.11171
Geometry-aware Generation of Adversarial and Cooperative Point Clouds.
Yuxin Wen; Jiehong Lin; Ke Chen; Kui Jia
  Recent studies show that machine learning models are vulnerable to
adversarial examples. In 2D image domain, these examples are obtained by adding
imperceptible noises to natural images. This paper studies adversarial
generation of point clouds by learning to deform those approximating object
surfaces of certain categories. As 2D manifolds embedded in the 3D Euclidean
space, object surfaces enjoy the general properties of smoothness and fairness.
We thus argue that in order to achieve imperceptible surface shape
deformations, adversarial point clouds should have the same properties with
similar degrees of smoothness/fairness to the benign ones, while being close to
the benign ones as well when measured under certain distance metrics of point
clouds. To this end, we propose a novel loss function to account for
imperceptible, geometry-aware deformations of point clouds, and use the
proposed loss in an adversarial objective to attack representative models of
point set classifiers. Experiments show that our proposed method achieves
stronger attacks than existing methods, without introduction of noticeable
outliers and surface irregularities. In this work, we also investigate an
opposite direction that learns to deform point clouds of object surfaces in the
same geometry-aware, but cooperative manner. Cooperatively generated point
clouds are more favored by machine learning models in terms of improved
classification confidence or accuracy. We present experiments verifying that
our proposed objective succeeds in learning cooperative shape deformations.


http://arxiv.org/abs/1912.10375
T3: Tree-Autoencoder Constrained Adversarial Text Generation for Targeted Attack.
Boxin Wang; Hengzhi Pei; Boyuan Pan; Qian Chen; Shuohang Wang; Bo Li
  Adversarial attacks against natural language processing systems, which
perform seemingly innocuous modifications to inputs, can induce arbitrary
mistakes to the target models. Though raised great concerns, such adversarial
attacks can be leveraged to estimate the robustness of NLP models. Compared
with the adversarial example generation in continuous data domain (e.g.,
image), generating adversarial text that preserves the original meaning is
challenging since the text space is discrete and non-differentiable. To handle
these challenges, we propose a target-controllable adversarial attack framework
T3, which is applicable to a range of NLP tasks. In particular, we propose a
tree-based autoencoder to embed the discrete text data into a continuous
representation space, upon which we optimize the adversarial perturbation. A
novel tree-based decoder is then applied to regularize the syntactic
correctness of the generated text and manipulate it on either sentence
(T3(Sent)) or word (T3(Word)) level. We consider two most representative NLP
tasks: sentiment analysis and question answering (QA). Extensive experimental
results and human studies show that T3 generated adversarial texts can
successfully manipulate the NLP models to output the targeted incorrect answer
without misleading the human. Moreover, we show that the generated adversarial
texts have high transferability which enables the black-box attacks in
practice. Our work sheds light on an effective and general way to examine the
robustness of NLP models. Our code is publicly available at
https://github.com/AI-secure/T3/.


http://arxiv.org/abs/1912.10154
Measuring Dataset Granularity.
Yin Cui; Zeqi Gu; Dhruv Mahajan; der Maaten Laurens van; Serge Belongie; Ser-Nam Lim
  Despite the increasing visibility of fine-grained recognition in our field,
"fine-grained'' has thus far lacked a precise definition. In this work,
building upon clustering theory, we pursue a framework for measuring dataset
granularity. We argue that dataset granularity should depend not only on the
data samples and their labels, but also on the distance function we choose. We
propose an axiomatic framework to capture desired properties for a dataset
granularity measure and provide examples of measures that satisfy these
properties. We assess each measure via experiments on datasets with
hierarchical labels of varying granularity. When measuring granularity in
commonly used datasets with our measure, we find that certain datasets that are
widely considered fine-grained in fact contain subsets of considerable size
that are substantially more coarse-grained than datasets generally regarded as
coarse-grained. We also investigate the interplay between dataset granularity
with a variety of factors and find that fine-grained datasets are more
difficult to learn from, more difficult to transfer to, more difficult to
perform few-shot learning with, and more vulnerable to adversarial attacks.


http://arxiv.org/abs/1912.09899
Certified Robustness for Top-k Predictions against Adversarial Perturbations via Randomized Smoothing.
Jinyuan Jia; Xiaoyu Cao; Binghui Wang; Neil Zhenqiang Gong
  It is well-known that classifiers are vulnerable to adversarial
perturbations. To defend against adversarial perturbations, various certified
robustness results have been derived. However, existing certified robustnesses
are limited to top-1 predictions. In many real-world applications, top-$k$
predictions are more relevant. In this work, we aim to derive certified
robustness for top-$k$ predictions. In particular, our certified robustness is
based on randomized smoothing, which turns any classifier to a new classifier
via adding noise to an input example. We adopt randomized smoothing because it
is scalable to large-scale neural networks and applicable to any classifier. We
derive a tight robustness in $\ell_2$ norm for top-$k$ predictions when using
randomized smoothing with Gaussian noise. We find that generalizing the
certified robustness from top-1 to top-$k$ predictions faces significant
technical challenges. We also empirically evaluate our method on CIFAR10 and
ImageNet. For example, our method can obtain an ImageNet classifier with a
certified top-5 accuracy of 62.8\% when the $\ell_2$-norms of the adversarial
perturbations are less than 0.5 (=127/255). Our code is publicly available at:
\url{https://github.com/jjy1994/Certify_Topk}.


http://arxiv.org/abs/1912.10013
secml: A Python Library for Secure and Explainable Machine Learning.
Maura Pintor; Luca Demetrio; Angelo Sotgiu; Marco Melis; Ambra Demontis; Battista Biggio
  We present \texttt{secml}, an open-source Python library for secure and
explainable machine learning. It implements the most popular attacks against
machine learning, including test-time evasion attacks to generate adversarial
examples against deep neural networks and training-time poisoning attacks
against support vector machines and many other algorithms. These attacks enable
evaluating the security of learning algorithms and the corresponding defenses
under both white-box and black-box threat models. To this end, \texttt{secml}
provides built-in functions to compute security evaluation curves, showing how
quickly classification performance decreases against increasing adversarial
perturbations of the input data. \texttt{secml} also includes explainability
methods to help understand why adversarial attacks succeed against a given
model, by visualizing the most influential features and training prototypes
contributing to each decision. It is distributed under the Apache License 2.0
and hosted at \url{https://github.com/pralab/secml}.


http://arxiv.org/abs/1912.10185
Jacobian Adversarially Regularized Networks for Robustness.
Alvin Chan; Yi Tay; Yew Soon Ong; Jie Fu
  Adversarial examples are crafted with imperceptible perturbations with the
intent to fool neural networks. Against such attacks, adversarial training and
its variants stand as the strongest defense to date. Previous studies have
pointed out that robust models that have undergone adversarial training tend to
produce more salient and interpretable Jacobian matrices than their non-robust
counterparts. A natural question is whether a model trained with an objective
to produce salient Jacobian can result in better robustness. This paper answers
this question with affirmative empirical results. We propose Jacobian
Adversarially Regularized Networks (JARN) as a method to optimize the saliency
of a classifier's Jacobian by adversarially regularizing the model's Jacobian
to resemble natural training images. Image classifiers trained with JARN show
improved robust accuracy compared to standard models on the MNIST, SVHN and
CIFAR-10 datasets, uncovering a new angle to boost robustness without using
adversarial training examples.


http://arxiv.org/abs/1912.09855
Explainability and Adversarial Robustness for RNNs.
Alexander Hartl; Maximilian Bachl; Joachim Fabini; Tanja Zseby
  Recurrent Neural Networks (RNNs) yield attractive properties for constructing
Intrusion Detection Systems (IDSs) for network data. With the rise of
ubiquitous Machine Learning (ML) systems, malicious actors have been catching
up quickly to find new ways to exploit ML vulnerabilities for profit. Recently
developed adversarial ML techniques focus on computer vision and their
applicability to network traffic is not straightforward: Network packets expose
fewer features than an image, are sequential and impose several constraints on
their features.
  We show that despite these completely different characteristics, adversarial
samples can be generated reliably for RNNs. To understand a classifier's
potential for misclassification, we extend existing explainability techniques
and propose new ones, suitable particularly for sequential data. Applying them
shows that already the first packets of a communication flow are of crucial
importance and are likely to be targeted by attackers. Feature importance
methods show that even relatively unimportant features can be effectively
abused to generate adversarial samples. Since traditional evaluation metrics
such as accuracy are not sufficient for quantifying the adversarial threat, we
propose the Adversarial Robustness Score (ARS) for comparing IDSs, capturing a
common notion of adversarial robustness, and show that an adversarial training
procedure can significantly and successfully reduce the attack surface.


http://arxiv.org/abs/1912.09670
Adversarial symmetric GANs: bridging adversarial samples and adversarial networks.
Faqiang Liu; Mingkun Xu; Guoqi Li; Jing Pei; Luping Shi; Rong Zhao
  Generative adversarial networks have achieved remarkable performance on
various tasks but suffer from training instability. Despite many training
strategies proposed to improve training stability, this issue remains as a
challenge. In this paper, we investigate the training instability from the
perspective of adversarial samples and reveal that adversarial training on fake
samples is implemented in vanilla GANs, but adversarial training on real
samples has long been overlooked. Consequently, the discriminator is extremely
vulnerable to adversarial perturbation and the gradient given by the
discriminator contains non-informative adversarial noises, which hinders the
generator from catching the pattern of real samples. Here, we develop
adversarial symmetric GANs (AS-GANs) that incorporate adversarial training of
the discriminator on real samples into vanilla GANs, making adversarial
training symmetrical. The discriminator is therefore more robust and provides
more informative gradient with less adversarial noise, thereby stabilizing
training and accelerating convergence. The effectiveness of the AS-GANs is
verified on image generation on CIFAR-10 , CelebA, and LSUN with varied network
architectures. Not only the training is more stabilized, but the FID scores of
generated samples are consistently improved by a large margin compared to the
baseline. The bridging of adversarial samples and adversarial networks provides
a new approach to further develop adversarial networks.


http://arxiv.org/abs/1912.10834
Does Symbolic Knowledge Prevent Adversarial Fooling?
Stefano Teso
  Arguments in favor of injecting symbolic knowledge into neural architectures
abound. When done right, constraining a sub-symbolic model can substantially
improve its performance and sample complexity and prevent it from predicting
invalid configurations. Focusing on deep probabilistic (logical) graphical
models -- i.e., constrained joint distributions whose parameters are determined
(in part) by neural nets based on low-level inputs -- we draw attention to an
elementary but unintended consequence of symbolic knowledge: that the resulting
constraints can propagate the negative effects of adversarial examples.


http://arxiv.org/abs/1912.10833
A New Ensemble Method for Concessively Targeted Multi-model Attack.
Ziwen He; Wei Wang; Xinsheng Xuan; Jing Dong; Tieniu Tan
  It is well known that deep learning models are vulnerable to adversarial
examples crafted by maliciously adding perturbations to original inputs. There
are two types of attacks: targeted attack and non-targeted attack, and most
researchers often pay more attention to the targeted adversarial examples.
However, targeted attack has a low success rate, especially when aiming at a
robust model or under a black-box attack protocol. In this case, non-targeted
attack is the last chance to disable AI systems. Thus, in this paper, we
propose a new attack mechanism which performs the non-targeted attack when the
targeted attack fails. Besides, we aim to generate a single adversarial sample
for different deployed models of the same task, e.g. image classification
models. Hence, for this practical application, we focus on attacking ensemble
models by dividing them into two groups: easy-to-attack and robust models. We
alternately attack these two groups of models in the non-targeted or targeted
manner. We name it a bagging and stacking ensemble (BAST) attack. The BAST
attack can generate an adversarial sample that fails multiple models
simultaneously. Some of the models classify the adversarial sample as a target
label, and other models which are not attacked successfully may give wrong
labels at least. The experimental results show that the proposed BAST attack
outperforms the state-of-the-art attack methods on the new defined criterion
that considers both targeted and non-targeted attack performance.


http://arxiv.org/abs/1912.12170
Mitigating large adversarial perturbations on X-MAS (X minus Moving Averaged Samples).
Woohyung Chun; Sung-Min Hong; Junho Huh; Inyup Kang
  We propose the scheme that mitigates an adversarial perturbation $\epsilon$
on the adversarial example $X_{adv}$ ($=$ $X$ $\pm$ $\epsilon$) by subtracting
the estimated perturbation $\hat{\epsilon}$ from $X$ $+$ $\epsilon$ and adding
$\hat{\epsilon}$ to $X$ $-$ $\epsilon$. The estimated perturbation
$\hat{\epsilon}$ comes from the difference between $X_{adv}$ and its
moving-averaged outcome $W_{avg}*X_{adv}$ where $W_{avg}$ is $N \times N$
moving average kernel that all the coefficients are one. Usually, the adjacent
samples of an image are close to each other such that we can let $X$ $\approx$
$W_{avg}*X$ (naming this relation after X-MAS[X minus Moving Averaged Sample]).
Since the X-MAS relation is approximately zero, the estimated perturbation can
be less than the adversarial perturbation. The scheme is also extended to do
the multi-level mitigation by configuring the mitigated adversarial example
$X_{adv}$ $\pm$ $\hat{\epsilon}$ as a new adversarial example to be mitigated.
The multi-level mitigation gets $X_{adv}$ closer to $X$ with a smaller (i.e.
mitigated) perturbation than original unmitigated perturbation by setting
$W_{avg} * X_{adv}$ ($<$ $X$ $+$ $W_{avg}*\epsilon$ if $X$ $\approx$
$W_{avg}*X$) as the boundary condition that the multi-level mitigation cannot
cross over (i.e. decreasing $\epsilon$ cannot go below and increasing
$\epsilon$ cannot go beyond). With the multi-level mitigation, we can get high
prediction accuracies even in the adversarial example having a large
perturbation (i.e. $\epsilon$ $\geq$ $16$). The proposed scheme is evaluated
with adversarial examples crafted by the Iterative FGSM (Fast Gradient Sign
Method) on ResNet-50 trained with ImageNet dataset.


http://arxiv.org/abs/1912.09064
Optimization-Guided Binary Diversification to Mislead Neural Networks for Malware Detection.
Mahmood Sharif; Keane Lucas; Lujo Bauer; Michael K. Reiter; Saurabh Shintre
  Motivated by the transformative impact of deep neural networks (DNNs) on
different areas (e.g., image and speech recognition), researchers and
anti-virus vendors are proposing end-to-end DNNs for malware detection from raw
bytes that do not require manual feature engineering. Given the security
sensitivity of the task that these DNNs aim to solve, it is important to assess
their susceptibility to evasion.
  In this work, we propose an attack that guides binary-diversification tools
via optimization to mislead DNNs for malware detection while preserving the
functionality of binaries. Unlike previous attacks on such DNNs, ours
manipulates instructions that are a functional part of the binary, which makes
it particularly challenging to defend against. We evaluated our attack against
three DNNs in white-box and black-box settings, and found that it can often
achieve success rates near 100%. Moreover, we found that our attack can fool
some commercial anti-viruses, in certain cases with a success rate of 85%. We
explored several defenses, both new and old, and identified some that can
successfully prevent over 80% of our evasion attempts. However, these defenses
may still be susceptible to evasion by adaptive attackers, and so we advocate
for augmenting malware-detection systems with methods that do not rely on
machine learning.


http://arxiv.org/abs/1912.09059
$n$-ML: Mitigating Adversarial Examples via Ensembles of Topologically Manipulated Classifiers.
Mahmood Sharif; Lujo Bauer; Michael K. Reiter
  This paper proposes a new defense called $n$-ML against adversarial examples,
i.e., inputs crafted by perturbing benign inputs by small amounts to induce
misclassifications by classifiers. Inspired by $n$-version programming, $n$-ML
trains an ensemble of $n$ classifiers, and inputs are classified by a vote of
the classifiers in the ensemble. Unlike prior such approaches, however, the
classifiers in the ensemble are trained specifically to classify adversarial
examples differently, rendering it very difficult for an adversarial example to
obtain enough votes to be misclassified. We show that $n$-ML roughly retains
the benign classification accuracies of state-of-the-art models on the MNIST,
CIFAR10, and GTSRB datasets, while simultaneously defending against adversarial
examples with better resilience than the best defenses known to date and, in
most cases, with lower classification-time overhead.


http://arxiv.org/abs/1912.09533
Towards Verifying Robustness of Neural Networks Against Semantic Perturbations.
Jeet Lily Mohapatra; Lily Tsui-Wei; Weng; Pin-Yu Chen; Sijia Liu; Luca Daniel
  Verifying robustness of neural networks given a specified threat model is a
fundamental yet challenging task. While current verification methods mainly
focus on the $\ell_p$-norm threat model of the input instances, robustness
verification against semantic adversarial attacks inducing large $\ell_p$-norm
perturbations, such as color shifting and lighting adjustment, are beyond their
capacity. To bridge this gap, we propose \textit{Semantify-NN}, a
model-agnostic and generic robustness verification approach against semantic
perturbations for neural networks. By simply inserting our proposed
\textit{semantic perturbation layers} (SP-layers) to the input layer of any
given model, \textit{Semantify-NN} is model-agnostic, and any $\ell_p$-norm
based verification tools can be used to verify the model robustness against
semantic perturbations. We illustrate the principles of designing the SP-layers
and provide examples including semantic perturbations to image classification
in the space of hue, saturation, lightness, brightness, contrast and rotation,
respectively. In addition, an efficient refinement technique is proposed to
further significantly improve the semantic certificate. Experiments on various
network architectures and different datasets demonstrate the superior
verification performance of \textit{Semantify-NN} over $\ell_p$-norm-based
verification frameworks that naively convert semantic perturbation to
$\ell_p$-norm. The results show that \textit{Semantify-NN} can support
robustness verification against a wide range of semantic perturbations.
  Code available https://github.com/JeetMo/Semantify-NN


http://arxiv.org/abs/1912.09405
Perturbations on the Perceptual Ball.
Andrew Elliott; Stephen Law; Chris Russell
  We present a simple regularisation of Adversarial Perturbations based upon
the perceptual loss. While the resulting perturbations remain imperceptible to
the human eye, they differ from existing adversarial perturbations in two
important regards: (i) our resulting perturbations are semi-sparse, and
typically make alterations to objects and regions of interest leaving the
background static; (ii) our perturbations do not alter the distribution of data
in the image and are undetectable by state-of-the-art methods. As such this
work reinforces the connection between explainable AI and adversarial
perturbations. We demonstrate the merits of our approach by evaluating on
standard explainability benchmark and by defeating a recent test for detecting
adversarial perturbations.


http://arxiv.org/abs/1912.08981
Identifying Adversarial Sentences by Analyzing Text Complexity.
Hoang-Quoc Nguyen-Son; Tran Phuong Thao; Seira Hidano; Shinsaku Kiyomoto
  Attackers create adversarial text to deceive both human perception and the
current AI systems to perform malicious purposes such as spam product reviews
and fake political posts. We investigate the difference between the adversarial
and the original text to prevent the risk. We prove that the text written by a
human is more coherent and fluent. Moreover, the human can express the idea
through the flexible text with modern words while a machine focuses on
optimizing the generated text by the simple and common words. We also suggest a
method to identify the adversarial text by extracting the features related to
our findings. The proposed method achieves high performance with 82.0% of
accuracy and 18.4% of equal error rate, which is better than the existing
methods whose the best accuracy is 77.0% corresponding to the error rate 22.8%.


http://arxiv.org/abs/1912.08954
An Adversarial Perturbation Oriented Domain Adaptation Approach for Semantic Segmentation.
Jihan Yang; Ruijia Xu; Ruiyu Li; Xiaojuan Qi; Xiaoyong Shen; Guanbin Li; Liang Lin
  We focus on Unsupervised Domain Adaptation (UDA) for the task of semantic
segmentation. Recently, adversarial alignment has been widely adopted to match
the marginal distribution of feature representations across two domains
globally. However, this strategy fails in adapting the representations of the
tail classes or small objects for semantic segmentation since the alignment
objective is dominated by head categories or large objects. In contrast to
adversarial alignment, we propose to explicitly train a domain-invariant
classifier by generating and defensing against pointwise feature space
adversarial perturbations. Specifically, we firstly perturb the intermediate
feature maps with several attack objectives (i.e., discriminator and
classifier) on each individual position for both domains, and then the
classifier is trained to be invariant to the perturbations. By perturbing each
position individually, our model treats each location evenly regardless of the
category or object size and thus circumvents the aforementioned issue.
Moreover, the domain gap in feature space is reduced by extrapolating source
and target perturbed features towards each other with attack on the domain
discriminator. Our approach achieves the state-of-the-art performance on two
challenging domain adaptation tasks for semantic segmentation: GTA5 ->
Cityscapes and SYNTHIA -> Cityscapes.


http://arxiv.org/abs/1912.08865
Adversarial VC-dimension and Sample Complexity of Neural Networks.
Zetong Qi; T. J. Wilder
  Adversarial attacks during the testing phase of neural networks pose a
challenge for the deployment of neural networks in security critical settings.
These attacks can be performed by adding noise that is imperceptible to humans
on top of the original data. By doing so, an attacker can create an adversarial
sample, which will cause neural networks to misclassify. In this paper, we seek
to understand the theoretical limits of what can be learned by neural networks
in the presence of an adversary. We first defined the hypothesis space of a
neural network, and showed the relationship between the growth number of the
entire neural network and the growth number of each neuron. Combine that with
the adversarial Vapnik-Chervonenkis(VC)-dimension of halfspace classifiers, we
concluded the adversarial VC-dimension of the neural networks with sign
activation functions.


http://arxiv.org/abs/1912.09303
SIGMA : Strengthening IDS with GAN and Metaheuristics Attacks.
Simon Msika; Alejandro Quintero; Foutse Khomh
  An Intrusion Detection System (IDS) is a key cybersecurity tool for network
administrators as it identifies malicious traffic and cyberattacks. With the
recent successes of machine learning techniques such as deep learning, more and
more IDS are now using machine learning algorithms to detect attacks faster.
However, these systems lack robustness when facing previously unseen types of
attacks. With the increasing number of new attacks, especially against Internet
of Things devices, having a robust IDS able to spot unusual and new attacks
becomes necessary.
  This work explores the possibility of leveraging generative adversarial
models to improve the robustness of machine learning based IDS. More
specifically, we propose a new method named SIGMA, that leverages adversarial
examples to strengthen IDS against new types of attacks. Using Generative
Adversarial Networks (GAN) and metaheuristics, SIGMA %Our method consists in
generates adversarial examples, iteratively, and uses it to retrain a machine
learning-based IDS, until a convergence of the detection rate (i.e. until the
detection system is not improving anymore). A round of improvement consists of
a generative phase, in which we use GANs and metaheuristics to generate
instances ; an evaluation phase in which we calculate the detection rate of
those newly generated attacks ; and a training phase, in which we train the IDS
with those attacks. We have evaluated the SIGMA method for four standard
machine learning classification algorithms acting as IDS, with a combination of
GAN and a hybrid local-search and genetic algorithm, to generate new datasets
of attacks. Our results show that SIGMA can successfully generate adversarial
attacks against different machine learning based IDS. Also, using SIGMA, we can
improve the performance of an IDS to up to 100\% after as little as two rounds
of improvement.


http://arxiv.org/abs/1912.08639
Detecting Adversarial Attacks On Audio-Visual Speech Recognition.
Pingchuan Ma; Stavros Petridis; Maja Pantic
  Adversarial attacks pose a threat to deep learning models. However, research
on adversarial detection methods, especially in the multi-modal domain, is very
limited. In this work, we propose an efficient and straightforward detection
method based on the temporal correlation between audio and video streams. The
main idea is that the correlation between audio and video in adversarial
examples will be lower than benign examples due to added adversarial noise. We
use the synchronisation confidence score as a proxy for audio-visual
correlation and based on it we can detect adversarial attacks. To the best of
our knowledge, this is the first work on detection of adversarial attacks on
audio-visual speech recognition models. We apply recent adversarial attacks on
two audio-visual speech recognition models trained on the GRID and LRW
datasets. The experimental results demonstrated that the proposed approach is
an effective way for detecting such attacks.


http://arxiv.org/abs/1912.08166
APRICOT: A Dataset of Physical Adversarial Attacks on Object Detection.
A. Braunegg; Amartya Chakraborty; Michael Krumdick; Nicole Lape; Sara Leary; Keith Manville; Elizabeth Merkhofer; Laura Strickhart; Matthew Walmer
  Physical adversarial attacks threaten to fool object detection systems, but
reproducible research on the real-world effectiveness of physical patches and
how to defend against them requires a publicly available benchmark dataset. We
present APRICOT, a collection of over 1,000 annotated photographs of printed
adversarial patches in public locations. The patches target several object
categories for three COCO-trained detection models, and the photos represent
natural variation in position, distance, lighting conditions, and viewing
angle. Our analysis suggests that maintaining adversarial robustness in
uncontrolled settings is highly challenging, but it is still possible to
produce targeted detections under white-box and sometimes black-box settings.
We establish baselines for defending against adversarial patches through
several methods, including a detector supervised with synthetic data and
unsupervised methods such as kernel density estimation, Bayesian uncertainty,
and reconstruction error. Our results suggest that adversarial patches can be
effectively flagged, both in a high-knowledge, attack-specific scenario, and in
an unsupervised setting where patches are detected as anomalies in natural
images. This dataset and the described experiments provide a benchmark for
future research on the effectiveness of and defenses against physical
adversarial objects in the wild.


http://arxiv.org/abs/1912.07742
CAG: A Real-time Low-cost Enhanced-robustness High-transferability Content-aware Adversarial Attack Generator.
Huy Phan; Yi Xie; Siyu Liao; Jie Chen; Bo Yuan
  Deep neural networks (DNNs) are vulnerable to adversarial attack despite
their tremendous success in many AI fields. Adversarial attack is a method that
causes the intended misclassfication by adding imperceptible perturbations to
legitimate inputs. Researchers have developed numerous types of adversarial
attack methods. However, from the perspective of practical deployment, these
methods suffer from several drawbacks such as long attack generating time, high
memory cost, insufficient robustness and low transferability. We propose a
Content-aware Adversarial Attack Generator (CAG) to achieve real-time,
low-cost, enhanced-robustness and high-transferability adversarial attack.
First, as a type of generative model-based attack, CAG shows significant
speedup (at least 500 times) in generating adversarial examples compared to the
state-of-the-art attacks such as PGD and C\&W. CAG only needs a single
generative model to perform targeted attack to any targeted class. Because CAG
encodes the label information into a trainable embedding layer, it differs from
prior generative model-based adversarial attacks that use $n$ different copies
of generative models for $n$ different targeted classes. As a result, CAG
significantly reduces the required memory cost for generating adversarial
examples. CAG can generate adversarial perturbations that focus on the critical
areas of input by integrating the class activation maps information in the
training process, and hence improve the robustness of CAG attack against the
state-of-art adversarial defenses. In addition, CAG exhibits high
transferability across different DNN classifier models in black-box attack
scenario by introducing random dropout in the process of generating
perturbations. Extensive experiments on different datasets and DNN models have
verified the real-time, low-cost, enhanced-robustness, and high-transferability
benefits of CAG.


http://arxiv.org/abs/1912.07748
MimicGAN: Robust Projection onto Image Manifolds with Corruption Mimicking.
Rushil Anirudh; Jayaraman J. Thiagarajan; Bhavya Kailkhura; Timo Bremer
  In the past few years, Generative Adversarial Networks (GANs) have
dramatically advanced our ability to represent and parameterize
high-dimensional, non-linear image manifolds. As a result, they have been
widely adopted across a variety of applications, ranging from challenging
inverse problems like image completion, to problems such as anomaly detection
and adversarial defense. A recurring theme in many of these applications is the
notion of projecting an image observation onto the manifold that is inferred by
the generator. In this context, Projected Gradient Descent (PGD) has been the
most popular approach, which essentially optimizes for a latent vector that
minimizes the discrepancy between a generated image and the given observation.
However, PGD is a brittle optimization technique that fails to identify the
right projection (or latent vector) when the observation is corrupted, or
perturbed even by a small amount. Such corruptions are common in the real
world, for example images in the wild come with unknown crops, rotations,
missing pixels, or other kinds of non-linear distributional shifts which break
current encoding methods, rendering downstream applications unusable. To
address this, we propose corruption mimicking -- a new robust projection
technique, that utilizes a surrogate network to approximate the unknown
corruption directly at test time, without the need for additional supervision
or data augmentation. The proposed method is significantly more robust than PGD
and other competing methods under a wide variety of corruptions, thereby
enabling a more effective use of GANs in real-world applications. More
importantly, we show that our approach produces state-of-the-art performance in
several GAN-based applications -- anomaly detection, domain adaptation, and
adversarial defense, that benefit from an accurate projection.


http://arxiv.org/abs/1912.07458
On-manifold Adversarial Data Augmentation Improves Uncertainty Calibration.
Kanil Patel; William Beluch; Dan Zhang; Michael Pfeiffer; Bin Yang
  Uncertainty estimates help to identify ambiguous, novel, or anomalous inputs,
but the reliable quantification of uncertainty has proven to be challenging for
modern deep networks. To improve uncertainty estimation, we propose On-Manifold
Adversarial Data Augmentation or OMADA, which specifically attempts to generate
the most challenging examples by following an on-manifold adversarial attack
path in the latent space of an autoencoder-based generative model that closely
approximates decision boundaries between two or more classes. On a variety of
datasets and for multiple network architectures, OMADA consistently yields more
accurate and better calibrated classifiers than baseline models, and
outperforms competing approaches such as Mixup and CutMix, as well as achieving
similar performance to (at times better than) post-processing calibration
methods such as temperature scaling. Variants of OMADA can employ different
sampling schemes for ambiguous on-manifold examples based on the entropy of
their estimated soft labels, which exhibit specific strengths for
generalization, calibration of predicted uncertainty, or detection of
out-of-distribution inputs.


http://arxiv.org/abs/1912.07561
Constructing a provably adversarially-robust classifier from a high accuracy one.
Grzegorz Głuch; Rüdiger Urbanke
  Modern machine learning models with very high accuracy have been shown to be
vulnerable to small, adversarially chosen perturbations of the input. Given
black-box access to a high-accuracy classifier $f$, we show how to construct a
new classifier $g$ that has high accuracy and is also robust to adversarial
$\ell_2$-bounded perturbations. Our algorithm builds upon the framework of
\textit{randomized smoothing} that has been recently shown to outperform all
previous defenses against $\ell_2$-bounded adversaries. Using techniques like
random partitions and doubling dimension, we are able to bound the adversarial
error of $g$ in terms of the optimum error. In this paper we focus on our
conceptual contribution, but we do present two examples to illustrate our
framework. We will argue that, under some assumptions, our bounds are optimal
for these cases.


http://arxiv.org/abs/1912.07160
DAmageNet: A Universal Adversarial Dataset.
Sizhe Chen; Xiaolin Huang; Zhengbao He; Chengjin Sun
  It is now well known that deep neural networks (DNNs) are vulnerable to
adversarial attack. Adversarial samples are similar to the clean ones, but are
able to cheat the attacked DNN to produce incorrect predictions in high
confidence. But most of the existing adversarial attacks have high success rate
only when the information of the attacked DNN is well-known or could be
estimated by massive queries. A promising way is to generate adversarial
samples with high transferability. By this way, we generate 96020 transferable
adversarial samples from original ones in ImageNet. The average difference,
measured by root means squared deviation, is only around 3.8 on average.
However, the adversarial samples are misclassified by various models with an
error rate up to 90\%. Since the images are generated independently with the
attacked DNNs, this is essentially zero-query adversarial attack. We call the
dataset \emph{DAmageNet}, which is the first universal adversarial dataset that
beats many models trained in ImageNet. By finding the drawbacks, DAmageNet
could serve as a benchmark to study and improve robustness of DNNs. DAmageNet
could be downloaded in http://www.pami.sjtu.edu.cn/Show/56/122.


http://arxiv.org/abs/1912.06960
What Else Can Fool Deep Learning? Addressing Color Constancy Errors on Deep Neural Network Performance.
Mahmoud Afifi; Michael S Brown
  There is active research targeting local image manipulations that can fool
deep neural networks (DNNs) into producing incorrect results. This paper
examines a type of global image manipulation that can produce similar adverse
effects. Specifically, we explore how strong color casts caused by incorrectly
applied computational color constancy - referred to as white balance (WB) in
photography - negatively impact the performance of DNNs targeting image
segmentation and classification. In addition, we discuss how existing image
augmentation methods used to improve the robustness of DNNs are not well suited
for modeling WB errors. To address this problem, a novel augmentation method is
proposed that can emulate accurate color constancy degradation. We also explore
pre-processing training and testing images with a recent WB correction
algorithm to reduce the effects of incorrectly white-balanced images. We
examine both augmentation and pre-processing strategies on different datasets
and demonstrate notable improvements on the CIFAR-10, CIFAR-100, and ADE20K
datasets.


http://arxiv.org/abs/1912.06872
Towards Robust Toxic Content Classification.
Keita Kurita; Anna Belova; Antonios Anastasopoulos
  Toxic content detection aims to identify content that can offend or harm its
recipients. Automated classifiers of toxic content need to be robust against
adversaries who deliberately try to bypass filters. We propose a method of
generating realistic model-agnostic attacks using a lexicon of toxic tokens,
which attempts to mislead toxicity classifiers by diluting the toxicity signal
either by obfuscating toxic tokens through character-level perturbations, or by
injecting non-toxic distractor tokens. We show that these realistic attacks
reduce the detection recall of state-of-the-art neural toxicity detectors,
including those using ELMo and BERT, by more than 50% in some cases. We explore
two approaches for defending against such attacks. First, we examine the effect
of training on synthetically noised data. Second, we propose the Contextual
Denoising Autoencoder (CDAE): a method for learning robust representations that
uses character-level and contextual information to denoise perturbed tokens. We
show that the two approaches are complementary, improving robustness to both
character-level perturbations and distractors, recovering a considerable
portion of the lost accuracy. Finally, we analyze the robustness
characteristics of the most competitive methods and outline practical
considerations for improving toxicity detectors.


http://arxiv.org/abs/1912.06409
Potential adversarial samples for white-box attacks.
Amir Nazemi; Paul Fieguth
  Deep convolutional neural networks can be highly vulnerable to small
perturbations of their inputs, potentially a major issue or limitation on
system robustness when using deep networks as classifiers. In this paper we
propose a low-cost method to explore marginal sample data near trained
classifier decision boundaries, thus identifying potential adversarial samples.
By finding such adversarial samples it is possible to reduce the search space
of adversarial attack algorithms while keeping a reasonable successful
perturbation rate. In our developed strategy, the potential adversarial samples
represent only 61% of the test data, but in fact cover more than 82% of the
adversarial samples produced by iFGSM and 92% of the adversarial samples
successfully perturbed by DeepFool on CIFAR10.


http://arxiv.org/abs/1912.05683
Learning to Model Aspects of Hearing Perception Using Neural Loss Functions.
Prateek Verma; Jonathan Berger
  We present a framework to model the perceived quality of audio signals by
combining convolutional architectures, with ideas from classical signal
processing, and describe an approach to enhancing perceived acoustical quality.
We demonstrate the approach by transforming the sound of an inexpensive musical
with degraded sound quality to that of a high-quality musical instrument
without the need for parallel data which is often hard to collect. We adapt the
classical approach of a simple adaptive EQ filtering to the objective criterion
learned by a neural architecture and optimize it to get the signal of our
interest. Since we learn adaptive masks depending on the signal of interest as
opposed to a fixed transformation for all the inputs, we show that shallow
neural architectures can achieve the desired result. A simple constraint on the
objective and the initialization helps us in avoiding adversarial examples,
which otherwise would have produced noisy, unintelligible audio. We believe
that the current framework proposed has enormous applications, in a variety of
problems where one can learn a loss function depending on the problem, using a
neural architecture and optimize it after it has been learned.


http://arxiv.org/abs/1912.05661
Gabor Layers Enhance Network Robustness.
Juan C. Pérez; Motasem Alfarra; Guillaume Jeanneret; Adel Bibi; Ali Thabet; Bernard Ghanem; Pablo Arbeláez
  We revisit the benefits of merging classical vision concepts with deep
learning models. In particular, we explore the effect on robustness against
adversarial attacks of replacing the first layers of various deep architectures
with Gabor layers, i.e. convolutional layers with filters that are based on
learnable Gabor parameters. We observe that architectures enhanced with Gabor
layers gain a consistent boost in robustness over regular models and preserve
high generalizing test performance, even though these layers come at a
negligible increase in the number of parameters. We then exploit the closed
form expression of Gabor filters to derive an expression for a Lipschitz
constant of such filters, and harness this theoretical result to develop a
regularizer we use during training to further enhance network robustness. We
conduct extensive experiments with various architectures (LeNet, AlexNet, VGG16
and WideResNet) on several datasets (MNIST, SVHN, CIFAR10 and CIFAR100) and
demonstrate large empirical robustness gains. Furthermore, we experimentally
show how our regularizer provides consistent robustness improvements.


http://arxiv.org/abs/1912.05333
An Efficient Approach for Using Expectation Maximization Algorithm in Capsule Networks.
Moein Hasani; Amin Nasim Saravi; Hassan Khotanlou
  Capsule Networks (CapsNets) are brand-new architectures that have shown
ground-breaking results in certain areas of Computer Vision (CV). In 2017,
Hinton and his team introduced CapsNets with routing-by-agreement in "Sabour et
al" and in a more recent paper "Matrix Capsules with EM Routing" they proposed
a more complete architecture with Expectation-Maximization (EM) algorithm.
Unlike the traditional convolutional neural networks (CNNs), this architecture
is able to preserve the pose of the objects in the picture. Due to this
characteristic, it has been able to beat the previous state-of-theart results
on the smallNORB dataset, which includes samples with various view points.
Also, this architecture is more robust to white box adversarial attacks.
However, CapsNets have two major drawbacks. They can't perform as well as CNNs
on complex datasets and, they need a huge amount of time for training. We try
to mitigate these shortcomings by finding optimum settings of EM routing
iterations for training CapsNets. Unlike the past studies, we use un-equal
numbers of EM routing iterations for different stages of the CapsNet. For our
research, we use three datasets: Yale face dataset, Belgium Traffic Sign
dataset, and Fashion-MNIST dataset.


http://arxiv.org/abs/1912.05391
Detecting and Correcting Adversarial Images Using Image Processing Operations and Convolutional Neural Networks.
Huy H. Nguyen; Minoru Kuribayashi; Junichi Yamagishi; Isao Echizen
  Deep neural networks (DNNs) have achieved excellent performance on several
tasks and have been widely applied in both academia and industry. However, DNNs
are vulnerable to adversarial machine learning attacks, in which noise is added
to the input to change the network output. We have devised two methods for
detecting adversarial images; one based on statistical image processing and one
based on convolutional neural network in which the final softmax layer is
removed during training. In addition to detection, the image-processing-based
method can be used to reduce adversarial noise in images and thereby restore
the image labels, which is crucial to restoring the normal functionalities of
DNN-based systems. Testing using an adversarial machine learning database we
created for generating several types of attack using images from the ImageNet
Large Scale Visual Recognition Challenge database demonstrated the efficiency
of our proposed methods for both detection and correction even when training
was done from scratch on a small database.


http://arxiv.org/abs/1912.05699
What it Thinks is Important is Important: Robustness Transfers through Input Gradients.
Alvin Chan; Yi Tay; Yew-Soon Ong
  Adversarial perturbations are imperceptible changes to input pixels that can
change the prediction of deep learning models. Learned weights of models robust
to such perturbations are previously found to be transferable across different
tasks but this applies only if the model architecture for the source and target
tasks is the same. Input gradients characterize how small changes at each input
pixel affect the model output. Using only natural images, we show here that
training a student model's input gradients to match those of a robust teacher
model can gain robustness close to a strong baseline that is robustly trained
from scratch. Through experiments in MNIST, CIFAR-10, CIFAR-100 and
Tiny-ImageNet, we show that our proposed method, input gradient adversarial
matching, can transfer robustness across different tasks and even across
different model architectures. This demonstrates that directly targeting the
semantics of input gradients is a feasible way towards adversarial robustness.


http://arxiv.org/abs/1912.05945
Towards a Robust Classifier: An MDL-Based Method for Generating Adversarial Examples.
Behzad Asadi; Vijay Varadharajan
  We address the problem of adversarial examples in machine learning where an
adversary tries to misguide a classifier by making functionality-preserving
modifications to original samples. We assume a black-box scenario where the
adversary has access to only the feature set, and the final hard-decision
output of the classifier. We propose a method to generate adversarial examples
using the minimum description length (MDL) principle. Our final aim is to
improve the robustness of the classifier by considering generated examples in
rebuilding the classifier. We evaluate our method for the application of static
malware detection in portable executable (PE) files. We consider API calls of
PE files as their distinguishing features where the feature vector is a binary
vector representing the presence-absence of API calls. In our method, we first
create a dataset of benign samples by querying the target classifier. We next
construct a code table of frequent patterns for the compression of this dataset
using the MDL principle. We finally generate an adversarial example
corresponding to a malware sample by selecting and adding a pattern from the
benign code table to the malware sample. The selected pattern is the one that
minimizes the length of the compressed adversarial example given the code
table. This modification preserves the functionalities of the original malware
sample as all original API calls are kept, and only some new API calls are
added. Considering a neural network, we show that the evasion rate is 78.24
percent for adversarial examples compared to 8.16 percent for original malware
samples. This shows the effectiveness of our method in generating examples that
need to be considered in rebuilding the classifier.


http://arxiv.org/abs/1912.04538
Appending Adversarial Frames for Universal Video Attack.
Zhikai Chen; Lingxi Xie; Shanmin Pang; Yong He; Qi Tian
  There have been many efforts in attacking image classification models with
adversarial perturbations, but the same topic on video classification has not
yet been thoroughly studied. This paper presents a novel idea of video-based
attack, which appends a few dummy frames (e.g., containing the texts of `thanks
for watching') to a video clip and then adds adversarial perturbations only on
these new frames. Our approach enjoys three major benefits, namely, a high
success rate, a low perceptibility, and a strong ability in transferring across
different networks. These benefits mostly come from the common dummy frame
which pushes all samples towards the boundary of classification. On the other
hand, such attacks are easily to be concealed since most people would not
notice the abnormality behind the perturbed video clips. We perform experiments
on two popular datasets with six state-of-the-art video classification models,
and demonstrate the effectiveness of our approach in the scenario of universal
video attacks.


http://arxiv.org/abs/1912.04792
Training Provably Robust Models by Polyhedral Envelope Regularization.
Chen Liu; Mathieu Salzmann; Sabine Süsstrunk
  Training certifiable neural networks enables one to obtain models with
robustness guarantees against adversarial attacks. In this work, we introduce a
framework to bound the adversary-free region in the neighborhood of the input
data by a polyhedral envelope, which yields finer-grained certified robustness.
We further introduce polyhedral envelope regularization (PER) to encourage
larger polyhedral envelopes and thus improve the provable robustness of the
models. We demonstrate the flexibility and effectiveness of our framework on
standard benchmarks; it applies to networks of different architectures and
general activation functions. Compared with the state-of-the-art methods, PER
has very little computational overhead and better robustness guarantees without
over-regularizing the model.


http://arxiv.org/abs/1912.04884
Statistically Robust Neural Network Classification. (22%)
Benjie Wang; Stefan Webb; Tom Rainforth
  Despite their numerous successes, there are many scenarios where adversarial
risk metrics do not provide an appropriate measure of robustness. For example,
test-time perturbations may occur in a probabilistic manner rather than being
generated by an explicit adversary, while the poor train--test generalization
of adversarial metrics can limit their usage to simple problems. Motivated by
this, we develop a probabilistic robust risk framework, the statistically
robust risk (SRR), which considers pointwise corruption distributions, as
opposed to worst-case adversaries. The SRR provides a distinct and
complementary measure of robust performance, compared to natural and
adversarial risk. We show that the SRR admits estimation and training schemes
which are as simple and efficient as for the natural risk: these simply require
noising the inputs, but with a principled derivation for exactly how and why
this should be done. Furthermore, we demonstrate both theoretically and
experimentally that it can provide superior generalization performance compared
with adversarial risks, enabling application to high-dimensional datasets.


http://arxiv.org/abs/1912.04497
Feature Losses for Adversarial Robustness.
Kirthi Shankar Sivamani
  Deep learning has made tremendous advances in computer vision tasks such as
image classification. However, recent studies have shown that deep learning
models are vulnerable to specifically crafted adversarial inputs that are
quasi-imperceptible to humans. In this work, we propose a novel approach to
defending adversarial attacks. We employ an input processing technique based on
denoising autoencoders as a defense. It has been shown that the input
perturbations grow and accumulate as noise in feature maps while propagating
through a convolutional neural network (CNN). We exploit the noisy feature maps
by using an additional subnetwork to extract image feature maps and train an
auto-encoder on perceptual losses of these feature maps. This technique
achieves close to state-of-the-art results on defending MNIST and CIFAR10
datasets, but more importantly, shows a new way of employing a defense that
cannot be trivially trained end-to-end by the attacker. Empirical results
demonstrate the effectiveness of this approach on the MNIST and CIFAR10
datasets on simple as well as iterative LP attacks. Our method can be applied
as a preprocessing technique to any off the shelf CNN.


http://arxiv.org/abs/1912.03790
Hardening Random Forest Cyber Detectors Against Adversarial Attacks.
Giovanni Apruzzese; Mauro Andreolini; Michele Colajanni; Mirco Marchetti
  Machine learning algorithms are effective in several applications, but they
are not as much successful when applied to intrusion detection in cyber
security. Due to the high sensitivity to their training data, cyber detectors
based on machine learning are vulnerable to targeted adversarial attacks that
involve the perturbation of initial samples. Existing defenses assume
unrealistic scenarios; their results are underwhelming in non-adversarial
settings; or they can be applied only to machine learning algorithms that
perform poorly for cyber security. We present an original methodology for
countering adversarial perturbations targeting intrusion detection systems
based on random forests. As a practical application, we integrate the proposed
defense method in a cyber detector analyzing network traffic. The experimental
results on millions of labelled network flows show that the new detector has a
twofold value: it outperforms state-of-the-art detectors that are subject to
adversarial attacks; it exhibits robust results both in adversarial and
non-adversarial scenarios.


http://arxiv.org/abs/1912.03829
Amora: Black-box Adversarial Morphing Attack.
Run Wang; Felix Juefei-Xu; Xiaofei Xie; Lei Ma; Yihao Huang; Yang Liu
  Nowadays, digital facial content manipulation has become ubiquitous and
realistic with the unprecedented success of generative adversarial networks
(GANs) in image synthesis. Unfortunately, face recognition (FR) systems suffer
from severe security concerns due to facial image manipulations. In this paper,
we investigate and introduce a new type of adversarial attack to evade FR
systems by manipulating facial content, called adversarial morphing attack
(a.k.a. Amora). In contrast to adversarial noise attack that perturbs pixel
intensity values by adding human-imperceptible noise, our proposed adversarial
morphing attack is a semantic attack that perturbs pixels spatially in a
coherent manner. To tackle the black-box attack problem, we have devised a
simple yet effective joint dictionary learning pipeline to obtain a proprietary
optical flow field for each attack. We have quantitatively and qualitatively
demonstrated the effectiveness of our adversarial morphing attack at various
levels of morphing intensity on two popular FR systems with smiling facial
expression manipulations. Both open-set and closed-set experimental results
indicate that a novel black-box adversarial attack based on local deformation
is possible, which is vastly different from additive noise based attacks. The
findings of this work may pave a new research direction towards a more thorough
understanding and investigation of image-based adversarial attacks and
defenses.


http://arxiv.org/abs/1912.03609
Exploring the Back Alleys: Analysing The Robustness of Alternative Neural Network Architectures against Adversarial Attacks.
Yi Xiang Marcus Tan; Yuval Elovici; Alexander Binder
  Recent discoveries in the field of adversarial machine learning have shown
that Artificial Neural Networks (ANNs) are susceptible to adversarial attacks.
These attacks cause misclassification of specially crafted adversarial samples.
In light of this phenomenon, it is worth investigating whether other types of
neural networks are less susceptible to adversarial attacks. In this work, we
applied standard attack methods originally aimed at conventional ANNs, towards
stochastic ANNs and also towards Spiking Neural Networks (SNNs), across three
different datasets namely MNIST, CIFAR-10 and Patch Camelyon. We analysed their
adversarial robustness against attacks performed in the raw image space of the
different model variants. We employ a variety of attacks namely Basic Iterative
Method (BIM), Carlini & Wagner L2 attack (CWL2) and Boundary attack. Our
results suggests that SNNs and stochastic ANNs exhibit some degree of
adversarial robustness as compared to their ANN counterparts under certain
attack methods. Namely, we found that the Boundary and the state-of-the-art
CWL2 attacks are largely ineffective against stochastic ANNs. Following this
observation, we proposed a modified version of the CWL2 attack and analysed the
impact of this attack on the models' adversarial robustness. Our results
suggest that with this modified CWL2 attack, many models are more easily fooled
as compared to the vanilla CWL2 attack, albeit observing an increase in L2
norms of adversarial perturbations. Lastly, we also investigate the resilience
of alternative neural networks against adversarial samples transferred from
ResNet18. We show that the modified CWL2 attack provides an improved
cross-architecture transferability compared to other attacks.


http://arxiv.org/abs/1912.03192
Achieving Robustness in the Wild via Adversarial Mixing with Disentangled Representations.
Sven Gowal; Chongli Qin; Po-Sen Huang; Taylan Cemgil; Krishnamurthy Dvijotham; Timothy Mann; Pushmeet Kohli
  Recent research has made the surprising finding that state-of-the-art deep
learning models sometimes fail to generalize to small variations of the input.
Adversarial training has been shown to be an effective approach to overcome
this problem. However, its application has been limited to enforcing invariance
to analytically defined transformations like $\ell_p$-norm bounded
perturbations. Such perturbations do not necessarily cover plausible real-world
variations that preserve the semantics of the input (such as a change in
lighting conditions). In this paper, we propose a novel approach to express and
formalize robustness to these kinds of real-world transformations of the input.
The two key ideas underlying our formulation are (1) leveraging disentangled
representations of the input to define different factors of variations, and (2)
generating new input images by adversarially composing the representations of
different images. We use a StyleGAN model to demonstrate the efficacy of this
framework. Specifically, we leverage the disentangled latent representations
computed by a StyleGAN model to generate perturbations of an image that are
similar to real-world variations (like adding make-up, or changing the
skin-tone of a person) and train models to be invariant to these perturbations.
Extensive experiments show that our method improves generalization and reduces
the effect of spurious correlations (reducing the error rate of a "smile"
detector by 21% for example).


http://arxiv.org/abs/1912.03406
Principal Component Properties of Adversarial Samples.
Malhar Jere; Sandro Herbig; Christine Lind; Farinaz Koushanfar
  Deep Neural Networks for image classification have been found to be
vulnerable to adversarial samples, which consist of sub-perceptual noise added
to a benign image that can easily fool trained neural networks, posing a
significant risk to their commercial deployment. In this work, we analyze
adversarial samples through the lens of their contributions to the principal
components of each image, which is different than prior works in which authors
performed PCA on the entire dataset. We investigate a number of
state-of-the-art deep neural networks trained on ImageNet as well as several
attacks for each of the networks. Our results demonstrate empirically that
adversarial samples across several attacks have similar properties in their
contributions to the principal components of neural network inputs. We propose
a new metric for neural networks to measure their robustness to adversarial
samples, termed the (k,p) point. We utilize this metric to achieve 93.36%
accuracy in detecting adversarial samples independent of architecture and
attack type for models trained on ImageNet.


http://arxiv.org/abs/1912.03430
Training Deep Neural Networks for Interpretability and Adversarial Robustness.
Adam Noack; Isaac Ahern; Dejing Dou; Boyang Li
  Deep neural networks (DNNs) have had many successes, but they suffer from two
major issues: (1) a vulnerability to adversarial examples and (2) a tendency to
elude human interpretation. Interestingly, recent empirical and theoretical
evidence suggests these two seemingly disparate issues are actually connected.
In particular, robust models tend to provide more interpretable gradients than
non-robust models. However, whether this relationship works in the opposite
direction remains obscure. With this paper, we seek empirical answers to the
following question: can models acquire adversarial robustness when they are
trained to have interpretable gradients? We introduce a theoretically inspired
technique called Interpretation Regularization (IR), which encourages a model's
gradients to (1) match the direction of interpretable target salience maps and
(2) have small magnitude. To assess model performance and tease apart factors
that contribute to adversarial robustness, we conduct extensive experiments on
MNIST and CIFAR-10 with both $\ell_2$ and $\ell_\infty$ attacks. We demonstrate
that training the networks to have interpretable gradients improves their
robustness to adversarial perturbations. Applying the network interpretation
technique SmoothGrad yields additional performance gains, especially in
cross-norm attacks and under heavy perturbations. The results indicate that the
interpretability of the model gradients is a crucial factor for adversarial
robustness. Code for the experiments can be found at
https://github.com/a1noack/interp_regularization.


http://arxiv.org/abs/1912.02918
Detection of Face Recognition Adversarial Attacks.
Fabio Valerio Massoli; Fabio Carrara; Giuseppe Amato; Fabrizio Falchi
  Deep Learning methods have become state-of-the-art for solving tasks such as
Face Recognition (FR). Unfortunately, despite their success, it has been
pointed out that these learning models are exposed to adversarial inputs -
images to which an imperceptible amount of noise for humans is added to
maliciously fool a neural network - thus limiting their adoption in real-world
applications. While it is true that an enormous effort has been spent in order
to train robust models against this type of threat, adversarial detection
techniques have recently started to draw attention within the scientific
community. A detection approach has the advantage that it does not require to
re-train any model, thus it can be added on top of any system. In this context,
we present our work on adversarial samples detection in forensics mainly
focused on detecting attacks against FR systems in which the learning model is
typically used only as a features extractor. Thus, in these cases, train a more
robust classifier might not be enough to defence a FR system. In this frame,
the contribution of our work is four-fold: i) we tested our recently proposed
adversarial detection approach against classifier attacks, i.e. adversarial
samples crafted to fool a FR neural network acting as a classifier; ii) using a
k-Nearest Neighbor (kNN) algorithm as a guidance, we generated deep features
attacks against a FR system based on a DL model acting as features extractor,
followed by a kNN which gives back the query identity based on features
similarity; iii) we used the deep features attacks to fool a FR system on the
1:1 Face Verification task and we showed their superior effectiveness with
respect to classifier attacks in fooling such type of system; iv) we used the
detectors trained on classifier attacks to detect deep features attacks, thus
showing that such approach is generalizable to different types of offensives.


http://arxiv.org/abs/1912.02386
The Search for Sparse, Robust Neural Networks.
Justin Cosentino; Federico Zaiter; Dan Pei; Jun Zhu
  Recent work on deep neural network pruning has shown there exist sparse
subnetworks that achieve equal or improved accuracy, training time, and loss
using fewer network parameters when compared to their dense counterparts.
Orthogonal to pruning literature, deep neural networks are known to be
susceptible to adversarial examples, which may pose risks in security- or
safety-critical applications. Intuition suggests that there is an inherent
trade-off between sparsity and robustness such that these characteristics could
not co-exist. We perform an extensive empirical evaluation and analysis testing
the Lottery Ticket Hypothesis with adversarial training and show this approach
enables us to find sparse, robust neural networks. Code for reproducing
experiments is available here:
https://github.com/justincosentino/robust-sparse-networks.


http://arxiv.org/abs/1912.02598
Region-Wise Attack: On Efficient Generation of Robust Physical Adversarial Examples.
Bo Luo; Qiang Xu
  Deep neural networks (DNNs) are shown to be susceptible to adversarial
example attacks. Most existing works achieve this malicious objective by
crafting subtle pixel-wise perturbations, and they are difficult to launch in
the physical world due to inevitable transformations (e.g., different
photographic distances and angles). Recently, there are a few research works on
generating physical adversarial examples, but they generally require the
details of the model a priori, which is often impractical. In this work, we
propose a novel physical adversarial attack for arbitrary black-box DNN models,
namely Region-Wise Attack. To be specific, we present how to efficiently search
for region-wise perturbations to the inputs and determine their shapes,
locations and colors via both top-down and bottom-up techniques. In addition,
we introduce two fine-tuning techniques to further improve the robustness of
our attack. Experimental results demonstrate the efficacy and robustness of the
proposed Region-Wise Attack in real world.


http://arxiv.org/abs/1912.01810
Learning with Multiplicative Perturbations.
Xiulong Yang; Shihao Ji
  Adversarial Training (AT) and Virtual Adversarial Training (VAT) are the
regularization techniques that train Deep Neural Networks (DNNs) with
adversarial examples generated by adding small but worst-case perturbations to
input examples. In this paper, we propose xAT and xVAT, new adversarial
training algorithms, that generate \textbf{multiplicative} perturbations to
input examples for robust training of DNNs. Such perturbations are much more
perceptible and interpretable than their \textbf{additive} counterparts
exploited by AT and VAT. Furthermore, the multiplicative perturbations can be
generated transductively or inductively while the standard AT and VAT only
support a transductive implementation. We conduct a series of experiments that
analyze the behavior of the multiplicative perturbations and demonstrate that
xAT and xVAT match or outperform state-of-the-art classification accuracies
across multiple established benchmarks while being about 30\% faster than their
additive counterparts. Furthermore, the resulting DNNs also demonstrate
distinct weight distributions.


http://arxiv.org/abs/1912.02258
A Survey of Game Theoretic Approaches for Adversarial Machine Learning in Cybersecurity Tasks.
Prithviraj Dasgupta; Joseph B. Collins
  Machine learning techniques are currently used extensively for automating
various cybersecurity tasks. Most of these techniques utilize supervised
learning algorithms that rely on training the algorithm to classify incoming
data into different categories, using data encountered in the relevant domain.
A critical vulnerability of these algorithms is that they are susceptible to
adversarial attacks where a malicious entity called an adversary deliberately
alters the training data to misguide the learning algorithm into making
classification errors. Adversarial attacks could render the learning algorithm
unsuitable to use and leave critical systems vulnerable to cybersecurity
attacks. Our paper provides a detailed survey of the state-of-the-art
techniques that are used to make a machine learning algorithm robust against
adversarial attacks using the computational framework of game theory. We also
discuss open problems and challenges and possible directions for further
research that would make deep machine learning-based systems more robust and
reliable for cybersecurity tasks.


http://arxiv.org/abs/1912.02153
Walking on the Edge: Fast, Low-Distortion Adversarial Examples.
Hanwei Zhang; Yannis Avrithis; Teddy Furon; Laurent Amsaleg
  Adversarial examples of deep neural networks are receiving ever increasing
attention because they help in understanding and reducing the sensitivity to
their input. This is natural given the increasing applications of deep neural
networks in our everyday lives. When white-box attacks are almost always
successful, it is typically only the distortion of the perturbations that
matters in their evaluation.
  In this work, we argue that speed is important as well, especially when
considering that fast attacks are required by adversarial training. Given more
time, iterative methods can always find better solutions. We investigate this
speed-distortion trade-off in some depth and introduce a new attack called
boundary projection (BP) that improves upon existing methods by a large margin.
Our key idea is that the classification boundary is a manifold in the image
space: we therefore quickly reach the boundary and then optimize distortion on
this manifold.


http://arxiv.org/abs/1912.02184
Towards Robust Image Classification Using Sequential Attention Models.
Daniel Zoran; Mike Chrzanowski; Po-Sen Huang; Sven Gowal; Alex Mott; Pushmeet Kohl
  In this paper we propose to augment a modern neural-network architecture with
an attention model inspired by human perception. Specifically, we adversarially
train and analyze a neural model incorporating a human inspired, visual
attention component that is guided by a recurrent top-down sequential process.
Our experimental evaluation uncovers several notable findings about the
robustness and behavior of this new model. First, introducing attention to the
model significantly improves adversarial robustness resulting in
state-of-the-art ImageNet accuracies under a wide range of random targeted
attack strengths. Second, we show that by varying the number of attention steps
(glances/fixations) for which the model is unrolled, we are able to make its
defense capabilities stronger, even in light of stronger attacks --- resulting
in a "computational race" between the attacker and the defender. Finally, we
show that some of the adversarial examples generated by attacking our model are
quite different from conventional adversarial examples --- they contain global,
salient and spatially coherent structures coming from the target class that
would be recognizable even to a human, and work by distracting the attention of
the model away from the main object in the original image.


http://arxiv.org/abs/1912.02316
Scratch that! An Evolution-based Adversarial Attack against Neural Networks.
Malhar Jere; Briland Hitaj; Gabriela Ciocarlie; Farinaz Koushanfar
  Recent research has shown that Deep Neural Networks (DNNs) for image
classification are vulnerable to adversarial attacks. However, most works on
adversarial samples utilize sub-perceptual noise that, while invisible or
slightly visible to humans, often covers the entire image. Additionally, most
of these attacks often require knowledge of the neural network architecture and
its parameters, and the ability to calculate the gradients of the parameters
with respect to the inputs. In this work, we show that it is possible to attack
neural networks in a highly restricted threat setting, where attackers have no
knowledge of the neural network (i.e., in a black-box setting) and can only
modify highly localized adversarial noise in the form of randomly chosen
straight lines or scratches. Our Adversarial Scratches attack method covers
only 1-2% of the image pixels and are generated using the Covariance Matrix
Adaptation Evolutionary Strategy, a purely black-box method that does not
require knowledge of the neural network architecture and its gradients. Against
ImageNet models, Adversarial Scratches requires 3 times fewer queries than
GenAttack (without any optimizations) and 73 times fewer queries than ZOO, both
prior state-of-the-art black-box attacks. We successfully deceive
state-of-the-art Inception-v3, ResNet-50 and VGG-19 models trained on ImageNet
with deceiving rates of 75.8%, 62.7%, and 45% respectively, with fewer queries
than several state-of-the-art black-box attacks, while modifying less than 2%
of the image pixels. Additionally, we provide a new threat scenario for neural
networks, demonstrate a new attack surface that can be used to perform
adversarial attacks, and discuss its potential implications.


http://arxiv.org/abs/1912.01667
A Survey of Black-Box Adversarial Attacks on Computer Vision Models.
Siddhant Bhambri; Sumanyu Muku; Avinash Tulasi; Arun Balaji Buduru
  Machine learning has seen tremendous advances in the past few years, which
has lead to deep learning models being deployed in varied applications of
day-to-day life. Attacks on such models using perturbations, particularly in
real-life scenarios, pose a severe challenge to their applicability, pushing
research into the direction which aims to enhance the robustness of these
models. After the introduction of these perturbations by Szegedy et al. [1],
significant amount of research has focused on the reliability of such models,
primarily in two aspects - white-box, where the adversary has access to the
targeted model and related parameters; and the black-box, which resembles a
real-life scenario with the adversary having almost no knowledge of the model
to be attacked. To provide a comprehensive security cover, it is essential to
identify, study, and build defenses against such attacks. Hence, in this paper,
we propose to present a comprehensive comparative study of various black-box
adversarial attacks and defense techniques.


http://arxiv.org/abs/1912.01978
FANNet: Formal Analysis of Noise Tolerance, Training Bias and Input Sensitivity in Neural Networks.
Mahum Naseer; Mishal Fatima Minhas; Faiq Khalid; Muhammad Abdullah Hanif; Osman Hasan; Muhammad Shafique
  With a constant improvement in the network architectures and training
methodologies, Neural Networks (NNs) are increasingly being deployed in
real-world Machine Learning systems. However, despite their impressive
performance on "known inputs", these NNs can fail absurdly on the "unseen
inputs", especially if these real-time inputs deviate from the training dataset
distributions, or contain certain types of input noise. This indicates the low
noise tolerance of NNs, which is a major reason for the recent increase of
adversarial attacks. This is a serious concern, particularly for
safety-critical applications, where inaccurate results lead to dire
consequences. We propose a novel methodology that leverages model checking for
the Formal Analysis of Neural Network (FANNet) under different input noise
ranges. Our methodology allows us to rigorously analyze the noise tolerance of
NNs, their input node sensitivity, and the effects of training bias on their
performance, e.g., in terms of classification accuracy. For evaluation, we use
a feed-forward fully-connected NN architecture trained for the Leukemia
classification. Our experimental results show $\pm 11\%$ noise tolerance for
the given trained network, identify the most sensitive input nodes, and confirm
the biasness of the available training dataset.


http://arxiv.org/abs/1912.01149
Cost-Aware Robust Tree Ensembles for Security Applications.
Yizheng Chen; Shiqi Wang; Weifan Jiang; Asaf Cidon; Suman Jana
  Features of security classifiers have various costs to be manipulated. The
costs are asymmetric across features and to the directions of changes, which
cannot be precisely captured by existing cost models based on $L_p$-norm
robustness. In this paper, we utilize such domain knowledge to increase the
evasion cost against security classifiers, specifically, tree ensemble models
that are widely used by security tasks. We propose a new cost modeling method
to capture the domain knowledge of features as constraint, and then we
integrate the cost-driven constraint into the node construction process to
train robust tree ensembles. During the training process, we use the constraint
to find data points that are likely to be perturbed given the costs of the
features, and we optimize the quality of the trees using a new robust training
algorithm. Our cost-aware training method can be applied to different types of
tree ensembles, including random forest model that cannot be robustly trained
by previous methods. Using twitter spam detection as the security application,
our evaluation results show that training cost-aware robust model can rank high
cost features as the most important ones, and increase the adaptive attack cost
by 6.4X compared to the baseline.


http://arxiv.org/abs/1912.00888
Deep Neural Network Fingerprinting by Conferrable Adversarial Examples.
Nils Lukas; Yuxuan Zhang; Florian Kerschbaum
  In Machine Learning as a Service, a provider trains a deep neural network and
provides many users access to it. However, the hosted (source) model is
susceptible to model stealing attacks where an adversary derives a
\emph{surrogate model} from API access to the source model. For post hoc
detection of such attacks, the provider needs a robust method to determine
whether a suspect model is a surrogate of their model or not. We propose a
fingerprinting method for deep neural networks that extracts a set of inputs
from the source model so that only surrogates agree with the source model on
the classification of such inputs. These inputs are a specifically crafted
subclass of targeted transferable adversarial examples which we call
conferrable adversarial examples that transfer exclusively from a source model
to its surrogates. We propose new methods to generate these conferrable
adversarial examples and use them as our fingerprint. Our fingerprint is the
first to be successfully tested as robust against distillation attacks, and our
experiments show that this robustness extends to robustness against weaker
removal attacks such as fine-tuning, ensemble attacks, and adversarial
retraining. We even protect against a powerful adversary with white-box access
to the source model, whereas the defender only needs black-box access to the
surrogate model. We conduct our experiments on the CINIC dataset and a subset
of ImageNet32 with 100 classes.


http://arxiv.org/abs/1912.01171
Universal Adversarial Perturbations for CNN Classifiers in EEG-Based BCIs.
Zihan Liu; Xiao Zhang; Lubin Meng; Dongrui Wu
  Multiple convolutional neural network (CNN) classifiers have been proposed
for electroencephalogram (EEG) based brain-computer interfaces (BCIs). However,
CNN models have been found vulnerable to universal adversarial perturbations
(UAPs), which are small and example-independent, yet powerful enough to degrade
the performance of a CNN model, when added to a benign example. This paper
proposes a novel total loss minimization (TLM) approach to generate UAPs for
EEG-based BCIs. Experimental results demonstrated the effectiveness of TLM on
three popular CNN classifiers for both target and non-target attacks. We also
verified the transferability of UAPs in EEG-based BCI systems. To our
knowledge, this is the first study on UAPs of CNN classifiers in EEG-based
BCIs, and also the first study on optimization based UAPs for target attacks.
UAPs are easy to construct, and can attack BCIs in real-time, exposing a
potentially critical security concern of BCIs.


http://arxiv.org/abs/1912.00330
Adversary A3C for Robust Reinforcement Learning.
Zhaoyuan Gu; Zhenzhong Jia; Howie Choset
  Asynchronous Advantage Actor Critic (A3C) is an effective Reinforcement
Learning (RL) algorithm for a wide range of tasks, such as Atari games and
robot control. The agent learns policies and value function through
trial-and-error interactions with the environment until converging to an
optimal policy. Robustness and stability are critical in RL; however, neural
network can be vulnerable to noise from unexpected sources and is not likely to
withstand very slight disturbances. We note that agents generated from mild
environment using A3C are not able to handle challenging environments. Learning
from adversarial examples, we proposed an algorithm called Adversary Robust A3C
(AR-A3C) to improve the agent's performance under noisy environments. In this
algorithm, an adversarial agent is introduced to the learning process to make
it more robust against adversarial disturbances, thereby making it more
adaptive to noisy environments. Both simulations and real-world experiments are
carried out to illustrate the stability of the proposed algorithm. The AR-A3C
algorithm outperforms A3C in both clean and noisy environments.


http://arxiv.org/abs/1912.00466
A Method for Computing Class-wise Universal Adversarial Perturbations.
Tejus Gupta; Abhishek Sinha; Nupur Kumari; Mayank Singh; Balaji Krishnamurthy
  We present an algorithm for computing class-specific universal adversarial
perturbations for deep neural networks. Such perturbations can induce
misclassification in a large fraction of images of a specific class. Unlike
previous methods that use iterative optimization for computing a universal
perturbation, the proposed method employs a perturbation that is a linear
function of weights of the neural network and hence can be computed much
faster. The method does not require any training data and has no
hyper-parameters. The attack obtains 34% to 51% fooling rate on
state-of-the-art deep neural networks on ImageNet and transfers across models.
We also study the characteristics of the decision boundaries learned by
standard and adversarially trained models to understand the universal
adversarial perturbations.


http://arxiv.org/abs/1912.00461
AdvPC: Transferable Adversarial Perturbations on 3D Point Clouds.
Abdullah Hamdi; Sara Rojas; Ali Thabet; Bernard Ghanem
  Deep neural networks are vulnerable to adversarial attacks, in which
imperceptible perturbations to their input lead to erroneous network
predictions. This phenomenon has been extensively studied in the image domain,
and only recently extended to 3D point clouds. In this work, we present novel
data-driven adversarial attacks against 3D point cloud networks. We aim to
address the following problems in current 3D point cloud adversarial attacks:
they do not transfer well between different networks, and they are easy to
defend against simple statistical methods. To this extent, we develop new point
cloud attacks (we dub AdvPC) that exploit input data distributions. These
attacks lead to perturbations that are resilient against current defenses while
remaining highly transferable compared to state-of-the-art attacks. We test our
attacks using four popular point cloud networks: PointNet, PointNet++ (MSG and
SSG), and DGCNN. Our proposed attack enables an increase in the transferability
of up to 20 points for some networks. It also increases the ability to break
defenses of up to 23 points on ModelNet40 data.


http://arxiv.org/abs/1912.05021
Design and Interpretation of Universal Adversarial Patches in Face Detection.
Xiao Yang; Fangyun Wei; Hongyang Zhang; Jun Zhu
  We consider universal adversarial patches for faces -- small visual elements
whose addition to a face image reliably destroys the performance of face
detectors. Unlike previous work that mostly focused on the algorithmic design
of adversarial examples in terms of improving the success rate as an attacker,
in this work we show an interpretation of such patches that can prevent the
state-of-the-art face detectors from detecting the real faces. We investigate a
phenomenon: patches designed to suppress real face detection appear face-like.
This phenomenon holds generally across different initialization, locations,
scales of patches, backbones, and state-of-the-art face detection frameworks.
We propose new optimization-based approaches to automatic design of universal
adversarial patches for varying goals of the attack, including scenarios in
which true positives are suppressed without introducing false positives. Our
proposed algorithms perform well on real-world datasets, deceiving
state-of-the-art face detectors in terms of multiple precision/recall metrics
and transferability.


http://arxiv.org/abs/1912.00181
Error-Correcting Neural Network.
Yang Song; Qiyu Kang; Wee Peng Tay
  Error-correcting output codes (ECOC) is an ensemble method combining a set of
binary classifiers for multi-class learning problems. However, in traditional
ECOC framework, the binary classifiers are trained independently. To explore
the interaction between the binary classifiers, we construct an error
correction network (ECN) that jointly trains all binary classifiers while
maximizing the ensemble diversity to improve its robustness against adversarial
attacks. An ECN is built based on a code matrix which is generated by
maximizing the error tolerance, i.e., the minimum Hamming distance between any
two rows, as well as the ensemble diversity, i.e., the variation of information
between any two columns. Though ECN inherently promotes the diversity between
the binary classifiers as each ensemble member solves a different
classification problem (specified by the corresponding column of the code
matrix), we empirically show that the ensemble diversity can be further
improved by forcing the weight matrices learned by ensemble members to be
orthogonal. The ECN is trained in end-to-end fashion and can be complementary
to other defense approaches including adversarial training. We show empirically
that ECN is effective against the state-of-the-art while-box attacks while
maintaining good accuracy on normal examples.


http://arxiv.org/abs/1912.00049
Square Attack: a query-efficient black-box adversarial attack via random search.
Maksym Andriushchenko; Francesco Croce; Nicolas Flammarion; Matthias Hein
  We propose the Square Attack, a new score-based black-box $l_2$ and
$l_\infty$ adversarial attack that does not rely on local gradient information
and thus is not affected by gradient masking. The Square Attack is based on a
randomized search scheme where we select localized square-shaped updates at
random positions so that the $l_\infty$- or $l_2$-norm of the perturbation is
approximately equal to the maximal budget at each step. Our method is
algorithmically transparent, robust to the choice of hyperparameters, and is
significantly more query efficient compared to the more complex
state-of-the-art methods. In particular, on ImageNet we improve the average
query efficiency for various deep networks by a factor of at least $2$ and up
to $7$ compared to the recent state-of-the-art $l_\infty$-attack of Meunier et
al. while having a higher success rate. The Square Attack can even be
competitive to gradient-based white-box attacks in terms of success rate.
Moreover, we show its utility by breaking a recently proposed defense based on
randomization. The code of our attack is available at
https://github.com/max-andr/square-attack


http://arxiv.org/abs/1911.12562
Towards Privacy and Security of Deep Learning Systems: A Survey.
Yingzhe He; Guozhu Meng; Kai Chen; Xingbo Hu; Jinwen He
  Deep learning has gained tremendous success and great popularity in the past
few years. However, recent research found that it is suffering several inherent
weaknesses, which can threaten the security and privacy of the stackholders.
Deep learning's wide use further magnifies the caused consequences. To this
end, lots of research has been conducted with the purpose of exhaustively
identifying intrinsic weaknesses and subsequently proposing feasible
mitigation. Yet few is clear about how these weaknesses are incurred and how
effective are these attack approaches in assaulting deep learning. In order to
unveil the security weaknesses and aid in the development of a robust deep
learning system, we are devoted to undertaking a comprehensive investigation on
attacks towards deep learning, and extensively evaluating these attacks in
multiple views. In particular, we focus on four types of attacks associated
with security and privacy of deep learning: model extraction attack, model
inversion attack, poisoning attack and adversarial attack. For each type of
attack, we construct its essential workflow as well as adversary capabilities
and attack goals. Many pivot metrics are devised for evaluating the attack
approaches, by which we perform a quantitative and qualitative analysis. From
the analysis, we have identified significant and indispensable factors in an
attack vector, \eg, how to reduce queries to target models, what distance used
for measuring perturbation. We spot light on 17 findings covering these
approaches' merits and demerits, success probability, deployment complexity and
prospects. Moreover, we discuss other potential security weaknesses and
possible mitigation which can inspire relevant researchers in this area.


http://arxiv.org/abs/1911.11932
Survey of Attacks and Defenses on Edge-Deployed Neural Networks.
Mihailo Isakov; Vijay Gadepally; Karen M. Gettings; Michel A. Kinsy
  Deep Neural Network (DNN) workloads are quickly moving from datacenters onto
edge devices, for latency, privacy, or energy reasons. While datacenter
networks can be protected using conventional cybersecurity measures, edge
neural networks bring a host of new security challenges. Unlike classic IoT
applications, edge neural networks are typically very compute and memory
intensive, their execution is data-independent, and they are robust to noise
and faults. Neural network models may be very expensive to develop, and can
potentially reveal information about the private data they were trained on,
requiring special care in distribution. The hidden states and outputs of the
network can also be used in reconstructing user inputs, potentially violating
users' privacy. Furthermore, neural networks are vulnerable to adversarial
attacks, which may cause misclassifications and violate the integrity of the
output. These properties add challenges when securing edge-deployed DNNs,
requiring new considerations, threat models, priorities, and approaches in
securely and privately deploying DNNs to the edge. In this work, we cover the
landscape of attacks on, and defenses, of neural networks deployed in edge
devices and provide a taxonomy of attacks and defenses targeting edge DNNs.


http://arxiv.org/abs/1911.11881
An Adaptive View of Adversarial Robustness from Test-time Smoothing Defense.
Chao Tang; Yifei Fan; Anthony Yezzi
  The safety and robustness of learning-based decision-making systems are under
threats from adversarial examples, as imperceptible perturbations can mislead
neural networks to completely different outputs. In this paper, we present an
adaptive view of the issue via evaluating various test-time smoothing defense
against white-box untargeted adversarial examples. Through controlled
experiments with pretrained ResNet-152 on ImageNet, we first illustrate the
non-monotonic relation between adversarial attacks and smoothing defenses. Then
at the dataset level, we observe large variance among samples and show that it
is easy to inflate accuracy (even to 100%) or build large-scale (i.e., with
size ~10^4) subsets on which a designated method outperforms others by a large
margin. Finally at the sample level, as different adversarial examples require
different degrees of defense, the potential advantages of iterative methods are
also discussed. We hope this paper reveal useful behaviors of test-time
defenses, which could help improve the evaluation process for adversarial
robustness in the future.


http://arxiv.org/abs/1911.11946
Can Attention Masks Improve Adversarial Robustness?
Pratik Vaishnavi; Tianji Cong; Kevin Eykholt; Atul Prakash; Amir Rahmati
  Deep Neural Networks (DNNs) are known to be susceptible to adversarial
examples. Adversarial examples are maliciously crafted inputs that are designed
to fool a model, but appear normal to human beings. Recent work has shown that
pixel discretization can be used to make classifiers for MNIST highly robust to
adversarial examples. However, pixel discretization fails to provide
significant protection on more complex datasets. In this paper, we take the
first step towards reconciling these contrary findings. Focusing on the
observation that discrete pixelization in MNIST makes the background completely
black and foreground completely white, we hypothesize that the important
property for increasing robustness is the elimination of image background using
attention masks before classifying an object. To examine this hypothesis, we
create foreground attention masks for two different datasets, GTSRB and
MS-COCO. Our initial results suggest that using attention mask leads to
improved robustness. On the adversarially trained classifiers, we see an
adversarial robustness increase of over 20% on MS-COCO.


http://arxiv.org/abs/1911.11746
Defending Against Adversarial Machine Learning.
Alison Jenkins
  An Adversarial System to attack and an Authorship Attribution System (AAS) to
defend itself against the attacks are analyzed. Defending a system against
attacks from an adversarial machine learner can be done by randomly switching
between models for the system, by detecting and reacting to changes in the
distribution of normal inputs, or by using other methods. Adversarial machine
learning is used to identify a system that is being used to map system inputs
to outputs. Three types of machine learners are using for the model that is
being attacked. The machine learners that are used to model the system being
attacked are a Radial Basis Function Support Vector Machine, a Linear Support
Vector Machine, and a Feedforward Neural Network. The feature masks are evolved
using accuracy as the fitness measure. The system defends itself against
adversarial machine learning attacks by identifying inputs that do not match
the probability distribution of normal inputs. The system also defends itself
against adversarial attacks by randomly switching between the feature masks
being used to map system inputs to outputs.


http://arxiv.org/abs/1911.11484
Using Depth for Pixel-Wise Detection of Adversarial Attacks in Crowd Counting.
Weizhe Liu; Mathieu Salzmann; Pascal Fua
  State-of-the-art methods for counting people in crowded scenes rely on deep
networks to estimate crowd density. While effective, deep learning approaches
are vulnerable to adversarial attacks, which, in a crowd-counting context, can
lead to serious security issues. However, attack and defense mechanisms have
been virtually unexplored in regression tasks, let alone for crowd density
estimation.
  In this paper, we investigate the effectiveness of existing attack strategies
on crowd-counting networks, and introduce a simple yet effective pixel-wise
detection mechanism. It builds on the intuition that, when attacking a
multitask network, in our case estimating crowd density and scene depth, both
outputs will be perturbed, and thus the second one can be used for detection
purposes. We will demonstrate that this significantly outperforms heuristic and
uncertainty-based strategies.


http://arxiv.org/abs/1911.11253
Playing it Safe: Adversarial Robustness with an Abstain Option.
Cassidy Laidlaw; Soheil Feizi
  We explore adversarial robustness in the setting in which it is acceptable
for a classifier to abstain---that is, output no class---on adversarial
examples. Adversarial examples are small perturbations of normal inputs to a
classifier that cause the classifier to give incorrect output; they present
security and safety challenges for machine learning systems. In many
safety-critical applications, it is less costly for a classifier to abstain on
adversarial examples than to give incorrect output for them. We first introduce
a novel objective function for adversarial robustness with an abstain option
which characterizes an explicit tradeoff between robustness and accuracy. We
then present a simple baseline in which an adversarially-trained classifier
abstains on all inputs within a certain distance of the decision boundary,
which we theoretically and experimentally evaluate. Finally, we propose
Combined Abstention Robustness Learning (CARL), a method for jointly learning a
classifier and the region of the input space on which it should abstain. We
explore different variations of the PGD and DeepFool adversarial attacks on
CARL in the abstain setting. Evaluating against these attacks, we demonstrate
that training with CARL results in a more accurate, robust, and efficient
classifier than the baseline.


http://arxiv.org/abs/1911.10891
ColorFool: Semantic Adversarial Colorization.
Ali Shahin Shamsabadi; Ricardo Sanchez-Matilla; Andrea Cavallaro
  Adversarial attacks that generate small L_p-norm perturbations to mislead
classifiers have limited success in black-box settings and with unseen
classifiers. These attacks are also fragile with defenses that use denoising
filters and to adversarial training procedures. Instead, adversarial attacks
that generate unrestricted perturbations are more robust to defenses, are
generally more successful in black-box settings and are more transferable to
unseen classifiers. However, unrestricted perturbations may be noticeable to
humans. In this paper, we propose a content-based black-box adversarial attack
that generates unrestricted perturbations by exploiting image semantics to
selectively modify colors within chosen ranges that are perceived as natural by
humans. We show that the proposed approach, ColorFool, outperforms in terms of
success rate, robustness to defense frameworks and transferability five
state-of-the-art adversarial attacks on two different tasks, scene and object
classification, when attacking three state-of-the-art deep neural networks
using three standard datasets. We will make the code of the proposed approach
and the whole evaluation framework publicly available.


http://arxiv.org/abs/1911.10875
Adversarial Attack with Pattern Replacement.
Ziang Dong; Liang Mao; Shiliang Sun
  We propose a generative model for adversarial attack. The model generates
subtle but predictive patterns from the input. To perform an attack, it
replaces the patterns of the input with those generated based on examples from
some other class. We demonstrate our model by attacking CNN on MNIST.


http://arxiv.org/abs/1911.11219
One Man's Trash is Another Man's Treasure: Resisting Adversarial Examples by Adversarial Examples.
Chang Xiao; Changxi Zheng
  Modern image classification systems are often built on deep neural networks,
which suffer from adversarial examples--images with deliberately crafted,
imperceptible noise to mislead the network's classification. To defend against
adversarial examples, a plausible idea is to obfuscate the network's gradient
with respect to the input image. This general idea has inspired a long line of
defense methods. Yet, almost all of them have proven vulnerable. We revisit
this seemingly flawed idea from a radically different perspective. We embrace
the omnipresence of adversarial examples and the numerical procedure of
crafting them, and turn this harmful attacking process into a useful defense
mechanism. Our defense method is conceptually simple: before feeding an input
image for classification, transform it by finding an adversarial example on a
pre-trained external model. We evaluate our method against a wide range of
possible attacks. On both CIFAR-10 and Tiny ImageNet datasets, our method is
significantly more robust than state-of-the-art methods. Particularly, in
comparison to adversarial training, our method offers lower training cost as
well as stronger robustness.


http://arxiv.org/abs/1911.10695
When NAS Meets Robustness: In Search of Robust Architectures against Adversarial Attacks.
Minghao Guo; Yuzhe Yang; Rui Xu; Ziwei Liu; Dahua Lin
  Recent advances in adversarial attacks uncover the intrinsic vulnerability of
modern deep neural networks. Since then, extensive efforts have been devoted to
enhancing the robustness of deep networks via specialized learning algorithms
and loss functions. In this work, we take an architectural perspective and
investigate the patterns of network architectures that are resilient to
adversarial attacks. To obtain the large number of networks needed for this
study, we adopt one-shot neural architecture search, training a large network
for once and then finetuning the sub-networks sampled therefrom. The sampled
architectures together with the accuracies they achieve provide a rich basis
for our study. Our "robust architecture Odyssey" reveals several valuable
observations: 1) densely connected patterns result in improved robustness; 2)
under computational budget, adding convolution operations to direct connection
edge is effective; 3) flow of solution procedure (FSP) matrix is a good
indicator of network robustness. Based on these observations, we discover a
family of robust architectures (RobNets). On various datasets, including CIFAR,
SVHN, Tiny-ImageNet, and ImageNet, RobNets exhibit superior robustness
performance to other widely used architectures. Notably, RobNets substantially
improve the robust accuracy (~5% absolute gains) under both white-box and
black-box attacks, even with fewer parameter numbers. Code is available at
https://github.com/gmh14/RobNets.


http://arxiv.org/abs/1911.10561
Time-aware Gradient Attack on Dynamic Network Link Prediction.
Jinyin Chen; Jian Zhang; Zhi Chen; Min Du; Feifei Li; Qi Xuan
  In network link prediction, it is possible to hide a target link from being
predicted with a small perturbation on network structure. This observation may
be exploited in many real world scenarios, for example, to preserve privacy, or
to exploit financial security. There have been many recent studies to generate
adversarial examples to mislead deep learning models on graph data. However,
none of the previous work has considered the dynamic nature of real-world
systems. In this work, we present the first study of adversarial attack on
dynamic network link prediction (DNLP). The proposed attack method, namely
time-aware gradient attack (TGA), utilizes the gradient information generated
by deep dynamic network embedding (DDNE) across different snapshots to rewire a
few links, so as to make DDNE fail to predict target links. We implement TGA in
two ways: one is based on traversal search, namely TGA-Tra; and the other is
simplified with greedy search for efficiency, namely TGA-Gre. We conduct
comprehensive experiments which show the outstanding performance of TGA in
attacking DNLP algorithms.


http://arxiv.org/abs/1911.10435
Robust Assessment of Real-World Adversarial Examples.
Brett Jefferson; Carlos Ortiz Marrero
  We explore rigorous, systematic, and controlled experimental evaluation of
adversarial examples in the real world and propose a testing regimen for
evaluation of real world adversarial objects. We show that for small scene/
environmental perturbations, large adversarial performance differences exist.
Current state of adversarial reporting exists largely as a frequency count over
a dynamic collections of scenes. Our work underscores the need for either a
more complete report or a score that incorporates scene changes and baseline
performance for models and environments tested by adversarial developers. We
put forth a score that attempts to address the above issues in a
straight-forward exemplar application for multiple generated adversary
examples. We contribute the following: 1. a testbed for adversarial assessment,
2. a score for adversarial examples, and 3. a collection of additional
evaluations on testbed data.


http://arxiv.org/abs/1911.10364
Universal Adversarial Robustness of Texture and Shape-Biased Models.
Kenneth T. Co; Luis Muñoz-González; Leslie Kanthan; Ben Glocker; Emil C. Lupu
  Increasing shape-bias in deep neural networks has been shown to improve
robustness to common corruptions and noise. In this paper we analyze the
adversarial robustness of texture and shape-biased models to Universal
Adversarial Perturbations (UAPs). We use UAPs to evaluate the robustness of DNN
models with varying degrees of shape-based training. We find that shape-biased
models do not markedly improve adversarial robustness, and we show that
ensembles of texture and shape-biased models can improve universal adversarial
robustness while maintaining strong performance.


http://arxiv.org/abs/1911.10258
Bounding Singular Values of Convolution Layers.
Sahil Singla; Soheil Feizi
  In deep neural networks, the spectral norm of the Jacobian of a layer bounds
the factor by which the norm of a signal changes during forward or backward
propagation. Spectral norm regularization has also been shown to improve the
generalization and robustness of deep networks. However, existing methods to
compute the spectral norm of the jacobian of convolution layers either rely on
heuristics (but are efficient in computation) or are exact (but computationally
expensive to be used during training). In this work, we resolve these issues by
deriving an upper bound on the spectral norm of a standard 2D multi-channel
convolution layer. Our method provides a provable bound that is differentiable
and can be computed efficiently during training with negligible overhead. We
show that our spectral bound is an effective regularizer and can be used to
bound the lipschitz constant and the curvature (eigenvalues of the Hessian) of
neural network. Through experiments on MNIST and CIFAR-10, we demonstrate the
effectiveness of our spectral bound in improving the generalization and
provable robustness of deep networks against adversarial examples. Our code is
available at \url{https://github.com/singlasahil14/CONV-SV}.


http://arxiv.org/abs/1911.11616
Enhancing Cross-task Black-Box Transferability of Adversarial Examples with Dispersion Reduction.
Yantao Lu; Yunhan Jia; Jianyu Wang; Bai Li; Weiheng Chai; Lawrence Carin; Senem Velipasalar
  Neural networks are known to be vulnerable to carefully crafted adversarial
examples, and these malicious samples often transfer, i.e., they remain
adversarial even against other models. Although great efforts have been delved
into the transferability across models, surprisingly, less attention has been
paid to the cross-task transferability, which represents the real-world
cybercriminal's situation, where an ensemble of different defense/detection
mechanisms need to be evaded all at once. In this paper, we investigate the
transferability of adversarial examples across a wide range of real-world
computer vision tasks, including image classification, object detection,
semantic segmentation, explicit content detection, and text detection. Our
proposed attack minimizes the ``dispersion'' of the internal feature map, which
overcomes existing attacks' limitation of requiring task-specific loss
functions and/or probing a target model. We conduct evaluation on open source
detection and segmentation models as well as four different computer vision
tasks provided by Google Cloud Vision (GCV) APIs, to show how our approach
outperforms existing attacks by degrading performance of multiple CV tasks by a
large margin with only modest perturbations linf=16.


http://arxiv.org/abs/1911.10008
Attack Agnostic Statistical Method for Adversarial Detection.
Sambuddha Saha; Aashish Kumar; Pratyush Sahay; George Jose; Srinivas Kruthiventi; Harikrishna Muralidhara
  Deep Learning based AI systems have shown great promise in various domains
such as vision, audio, autonomous systems (vehicles, drones), etc. Recent
research on neural networks has shown the susceptibility of deep networks to
adversarial attacks - a technique of adding small perturbations to the inputs
which can fool a deep network into misclassifying them. Developing defenses
against such adversarial attacks is an active research area, with some
approaches proposing robust models that are immune to such adversaries, while
other techniques attempt to detect such adversarial inputs. In this paper, we
present a novel statistical approach for adversarial detection in image
classification. Our approach is based on constructing a per-class feature
distribution and detecting adversaries based on comparison of features of a
test image with the feature distribution of its class. For this purpose, we
make use of various statistical distances such as ED (Energy Distance), MMD
(Maximum Mean Discrepancy) for adversarial detection, and analyze the
performance of each metric. We experimentally show that our approach achieves
good adversarial detection performance on MNIST and CIFAR-10 datasets
irrespective of the attack method, sample size and the degree of adversarial
perturbation.


http://arxiv.org/abs/1911.10182
Universal adversarial examples in speech command classification.
Jon Vadillo; Roberto Santana
  Adversarial examples are inputs intentionally perturbed with the aim of
forcing a machine learning model to produce a wrong prediction, while the
changes are not easily detectable by a human. Although this topic has been
intensively studied in the image domain, classification tasks in the audio
domain have received less attention. In this paper we address the existence of
universal perturbations for speech command classification. We provide evidence
that universal attacks can be generated for speech command classification
tasks, which are able to generalize across different models to a significant
extent. Additionally, a novel analytical framework is proposed for the
evaluation of universal perturbations under different levels of universality,
demonstrating that the feasibility of generating effective perturbations
decreases as the universality level increases. Finally, we propose a more
detailed and rigorous framework to measure the amount of distortion introduced
by the perturbations, demonstrating that the methods employed by convention are
not realistic in audio-based problems.


http://arxiv.org/abs/1911.10291
Invert and Defend: Model-based Approximate Inversion of Generative Adversarial Networks for Secure Inference.
Wei-An Lin; Yogesh Balaji; Pouya Samangouei; Rama Chellappa
  Inferring the latent variable generating a given test sample is a challenging
problem in Generative Adversarial Networks (GANs). In this paper, we propose
InvGAN - a novel framework for solving the inference problem in GANs, which
involves training an encoder network capable of inverting a pre-trained
generator network without access to any training data. Under mild assumptions,
we theoretically show that using InvGAN, we can approximately invert the
generations of any latent code of a trained GAN model. Furthermore, we
empirically demonstrate the superiority of our inference scheme by quantitative
and qualitative comparisons with other methods that perform a similar task. We
also show the effectiveness of our framework in the problem of adversarial
defenses where InvGAN can successfully be used as a projection-based defense
mechanism. Additionally, we show how InvGAN can be used to implement
reparameterization white-box attacks on projection-based defense mechanisms.
Experimental validation on several benchmark datasets demonstrate the efficacy
of our method in achieving improved performance on several white-box and
black-box attacks. Our code is available at
https://github.com/yogeshbalaji/InvGAN.


http://arxiv.org/abs/1911.09449
Heuristic Black-box Adversarial Attacks on Video Recognition Models.
Zhipeng Wei; Jingjing Chen; Xingxing Wei; Linxi Jiang; Tat-Seng Chua; Fengfeng Zhou; Yu-Gang Jiang
  We study the problem of attacking video recognition models in the black-box
setting, where the model information is unknown and the adversary can only make
queries to detect the predicted top-1 class and its probability. Compared with
the black-box attack on images, attacking videos is more challenging as the
computation cost for searching the adversarial perturbations on a video is much
higher due to its high dimensionality. To overcome this challenge, we propose a
heuristic black-box attack model that generates adversarial perturbations only
on the selected frames and regions. More specifically, a heuristic-based
algorithm is proposed to measure the importance of each frame in the video
towards generating the adversarial examples. Based on the frames' importance,
the proposed algorithm heuristically searches a subset of frames where the
generated adversarial example has strong adversarial attack ability while keeps
the perturbations lower than the given bound. Besides, to further boost the
attack efficiency, we propose to generate the perturbations only on the salient
regions of the selected frames. In this way, the generated perturbations are
sparse in both temporal and spatial domains. Experimental results of attacking
two mainstream video recognition methods on the UCF-101 dataset and the HMDB-51
dataset demonstrate that the proposed heuristic black-box adversarial attack
method can significantly reduce the computation cost and lead to more than 28\%
reduction in query numbers for the untargeted attack on both datasets.


http://arxiv.org/abs/1911.09665
Adversarial Examples Improve Image Recognition.
Cihang Xie; Mingxing Tan; Boqing Gong; Jiang Wang; Alan Yuille; Quoc V. Le
  Adversarial examples are commonly viewed as a threat to ConvNets. Here we
present an opposite perspective: adversarial examples can be used to improve
image recognition models if harnessed in the right manner. We propose AdvProp,
an enhanced adversarial training scheme which treats adversarial examples as
additional examples, to prevent overfitting. Key to our method is the usage of
a separate auxiliary batch norm for adversarial examples, as they have
different underlying distributions to normal examples.
  We show that AdvProp improves a wide range of models on various image
recognition tasks and performs better when the models are bigger. For instance,
by applying AdvProp to the latest EfficientNet-B7 [28] on ImageNet, we achieve
significant improvements on ImageNet (+0.7%), ImageNet-C (+6.5%), ImageNet-A
(+7.0%), Stylized-ImageNet (+4.8%). With an enhanced EfficientNet-B8, our
method achieves the state-of-the-art 85.5% ImageNet top-1 accuracy without
extra data. This result even surpasses the best model in [20] which is trained
with 3.5B Instagram images (~3000X more than ImageNet) and ~9.4X more
parameters. Models are available at
https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet.


http://arxiv.org/abs/1911.09307
Patch-level Neighborhood Interpolation: A General and Effective Graph-based Regularization Strategy. (1%)
Ke Sun; Bing Yu; Zhouchen Lin; Zhanxing Zhu
  Regularization plays a crucial role in machine learning models, especially
for deep neural networks. The existing regularization techniques mainly rely on
the i.i.d. assumption and only consider the knowledge from the current sample,
without the leverage of the neighboring relationship between samples. In this
work, we propose a general regularizer called \textbf{Patch-level Neighborhood
Interpolation~(Pani)} that conducts a non-local representation in the
computation of networks. Our proposal explicitly constructs patch-level graphs
in different layers and then linearly interpolates neighborhood patch features,
serving as a general and effective regularization strategy. Further, we
customize our approach into two kinds of popular regularization methods, namely
Virtual Adversarial Training (VAT) and MixUp as well as its variants. The first
derived \textbf{Pani VAT} presents a novel way to construct non-local
adversarial smoothness by employing patch-level interpolated perturbations. The
second derived \textbf{Pani MixUp} method extends the MixUp, and achieves
superiority over MixUp and competitive performance over state-of-the-art
variants of MixUp method with a significant advantage in computational
efficiency. Extensive experiments have verified the effectiveness of our Pani
approach in both supervised and semi-supervised settings.


http://arxiv.org/abs/1911.09272
Robustness Certificates for Sparse Adversarial Attacks by Randomized Ablation.
Alexander Levine; Soheil Feizi
  Recently, techniques have been developed to provably guarantee the robustness
of a classifier to adversarial perturbations of bounded L_1 and L_2 magnitudes
by using randomized smoothing: the robust classification is a consensus of base
classifications on randomly noised samples where the noise is additive. In this
paper, we extend this technique to the L_0 threat model. We propose an
efficient and certifiably robust defense against sparse adversarial attacks by
randomly ablating input features, rather than using additive noise.
Experimentally, on MNIST, we can certify the classifications of over 50% of
images to be robust to any distortion of at most 8 pixels. This is comparable
to the observed empirical robustness of unprotected classifiers on MNIST to
modern L_0 attacks, demonstrating the tightness of the proposed robustness
certificate. We also evaluate our certificate on ImageNet and CIFAR-10. Our
certificates represent an improvement on those provided in a concurrent work
(Lee et al. 2019) which uses random noise rather than ablation (median
certificates of 8 pixels versus 4 pixels on MNIST; 16 pixels versus 1 pixel on
ImageNet.) Additionally, we empirically demonstrate that our classifier is
highly robust to modern sparse adversarial attacks on MNIST. Our
classifications are robust, in median, to adversarial perturbations of up to 31
pixels, compared to 22 pixels reported as the state-of-the-art defense, at the
cost of a slight decrease (around 2.3%) in the classification accuracy. Code is
available at https://github.com/alevine0/randomizedAblation/.


http://arxiv.org/abs/1911.08790
Analysis of Deep Networks for Monocular Depth Estimation Through Adversarial Attacks with Proposal of a Defense Method.
Junjie Hu; Takayuki Okatani
  In this paper, we consider adversarial attacks against a system of monocular
depth estimation (MDE) based on convolutional neural networks (CNNs). The
motivation is two-fold. One is to study the security of MDE systems, which has
not been actively considered in the community. The other is to improve our
understanding of the computational mechanism of CNNs performing MDE. Toward
this end, we apply the method recently proposed for visualization of MDE to
defending attacks. It trains another CNN to predict a saliency map from an
input image, such that the CNN for MDE continues to accurately estimate the
depth map from the image with its non-salient part masked out. We report the
following findings. First, unsurprisingly, attacks by IFGSM (or equivalently
PGD) succeed in making the CNNs yield inaccurate depth estimates. Second, the
attacks can be defended by masking out non-salient pixels, indicating that the
attacks function by perturbing mostly non-salient pixels. However, the
prediction of saliency maps is itself vulnerable to the attacks, even though it
is not the direct target of the attacks. We show that the attacks can be
defended by using a saliency map predicted by a CNN trained to be robust to the
attacks. These results provide an effective defense method as well as a clue to
understanding the computational mechanism of CNNs for MDE.


http://arxiv.org/abs/1911.09058
Fine-grained Synthesis of Unrestricted Adversarial Examples.
Omid Poursaeed; Tianxing Jiang; Harry Yang; Serge Belongie; Ser-Nam Lim
  We propose a novel approach for generating unrestricted adversarial examples
by manipulating fine-grained aspects of image generation. Unlike existing
unrestricted attacks that typically hand-craft geometric transformations, we
learn stylistic and stochastic modifications leveraging state-of-the-art
generative models. This allows us to manipulate an image in a controlled,
fine-grained manner without being bounded by a norm threshold. Our model can be
used for both targeted and non-targeted unrestricted attacks. We demonstrate
that our attacks can bypass certified defenses, yet our adversarial images look
indistinguishable from natural images as verified by human evaluation.
Adversarial training can be used as an effective defense without degrading
performance of the model on clean images. We perform experiments on LSUN and
CelebA-HQ as high resolution datasets to validate efficacy of our proposed
approach.


http://arxiv.org/abs/1911.08723
Deep Minimax Probability Machine.
Lirong He; Ziyi Guo; Kaizhu Huang; Zenglin Xu
  Deep neural networks enjoy a powerful representation and have proven
effective in a number of applications. However, recent advances show that deep
neural networks are vulnerable to adversarial attacks incurred by the so-called
adversarial examples. Although the adversarial example is only slightly
different from the input sample, the neural network classifies it as the wrong
class. In order to alleviate this problem, we propose the Deep Minimax
Probability Machine (DeepMPM), which applies MPM to deep neural networks in an
end-to-end fashion. In a worst-case scenario, MPM tries to minimize an upper
bound of misclassification probabilities, considering the global information
(i.e., mean and covariance information of each class). DeepMPM can be more
robust since it learns the worst-case bound on the probability of
misclassification of future data. Experiments on two real-world datasets can
achieve comparable classification performance with CNN, while can be more
robust on adversarial attacks.


http://arxiv.org/abs/1911.08635
Logic-inspired Deep Neural Networks.
Minh Le
  Deep neural networks have achieved impressive performance and become de-facto
standard in many tasks. However, phenomena such as adversarial examples and
fooling examples hint that the generalization they make is flawed. We argue
that the problem roots in their distributed and connected nature and propose
remedies inspired by propositional logic. Our experiments show that the
proposed models are more local and better at resisting fooling and adversarial
examples. By means of an ablation analysis, we reveal insights into adversarial
examples and suggest a new hypothesis on their origins.


http://arxiv.org/abs/1911.08696
Where is the Bottleneck of Adversarial Learning with Unlabeled Data?
Jingfeng Zhang; Bo Han; Gang Niu; Tongliang Liu; Masashi Sugiyama
  Deep neural networks (DNNs) are incredibly brittle due to adversarial
examples. To robustify DNNs, adversarial training was proposed, which requires
large-scale but well-labeled data. However, it is quite expensive to annotate
large-scale data well. To compensate for this shortage, several seminal works
are utilizing large-scale unlabeled data. In this paper, we observe that
seminal works do not perform well, since the quality of pseudo labels on
unlabeled data is quite poor, especially when the amount of unlabeled data is
significantly larger than that of labeled data. We believe that the quality of
pseudo labels is the bottleneck of adversarial learning with unlabeled data. To
tackle this bottleneck, we leverage deep co-training, which trains two deep
networks and encourages two networks diverged by exploiting peer's adversarial
examples. Based on deep co-training, we propose robust co-training (RCT) for
adversarial learning with unlabeled data. We conduct comprehensive experiments
on CIFAR-10 and SVHN datasets. Empirical results demonstrate that our RCT can
significantly outperform baselines (e.g., robust self-training (RST)) in both
standard test accuracy and robust test accuracy w.r.t. different datasets,
different network structures, and different types of adversarial training.


http://arxiv.org/abs/1911.08654
Adversarial Robustness of Flow-Based Generative Models.
Phillip Pope; Yogesh Balaji; Soheil Feizi
  Flow-based generative models leverage invertible generator functions to fit a
distribution to the training data using maximum likelihood. Despite their use
in several application domains, robustness of these models to adversarial
attacks has hardly been explored. In this paper, we study adversarial
robustness of flow-based generative models both theoretically (for some simple
models) and empirically (for more complex ones). First, we consider a linear
flow-based generative model and compute optimal sample-specific and universal
adversarial perturbations that maximally decrease the likelihood scores. Using
this result, we study the robustness of the well-known adversarial training
procedure, where we characterize the fundamental trade-off between model
robustness and accuracy. Next, we empirically study the robustness of two
prominent deep, non-linear, flow-based generative models, namely GLOW and
RealNVP. We design two types of adversarial attacks; one that minimizes the
likelihood scores of in-distribution samples, while the other that maximizes
the likelihood scores of out-of-distribution ones. We find that GLOW and
RealNVP are extremely sensitive to both types of attacks. Finally, using a
hybrid adversarial training procedure, we significantly boost the robustness of
these generative models.


http://arxiv.org/abs/1911.08644
Generate (non-software) Bugs to Fool Classifiers.
Hiromu Yakura; Youhei Akimoto; Jun Sakuma
  In adversarial attacks intended to confound deep learning models, most
studies have focused on limiting the magnitude of the modification so that
humans do not notice the attack. On the other hand, during an attack against
autonomous cars, for example, most drivers would not find it strange if a small
insect image were placed on a stop sign, or they may overlook it. In this
paper, we present a systematic approach to generate natural adversarial
examples against classification models by employing such natural-appearing
perturbations that imitate a certain object or signal. We first show the
feasibility of this approach in an attack against an image classifier by
employing generative adversarial networks that produce image patches that have
the appearance of a natural object to fool the target model. We also introduce
an algorithm to optimize placement of the perturbation in accordance with the
input image, which makes the generation of adversarial examples fast and likely
to succeed. Moreover, we experimentally show that the proposed approach can be
extended to the audio domain, for example, to generate perturbations that sound
like the chirping of birds to fool a speech classifier.


http://arxiv.org/abs/1911.07682
A New Ensemble Adversarial Attack Powered by Long-term Gradient Memories.
Zhaohui Che; Ali Borji; Guangtao Zhai; Suiyi Ling; Jing Li; Patrick Le Callet
  Deep neural networks are vulnerable to adversarial attacks.


http://arxiv.org/abs/1911.08053
A novel method for identifying the deep neural network model with the Serial Number.
XiangRui Xu; YaQin Li; Cao Yuan
  Deep neural network (DNN) with the state of art performance has emerged as a
viable and lucrative business service. However, those impressive performances
require a large number of computational resources, which comes at a high cost
for the model creators. The necessity for protecting DNN models from illegal
reproducing and distribution appears salient now. Recently, trigger-set
watermarking, breaking the white-box restriction, relying on adversarial
training pre-defined (incorrect) labels for crafted inputs, and subsequently
using them to verify the model authenticity, has been the main topic of DNN
ownership verification. While these methods have successfully demonstrated
robustness against removal attacks, few are effective against the tampering
attacks from competitors forging the fake watermarks and dogging in the
manager. In this paper, we put forth a new framework of the trigger-set
watermark by embedding a unique Serial Number (relatedness less original
labels) to the deep neural network for model ownership identification, which is
both robust to model pruning and resist to tampering attacks. Experiment
results demonstrate that the DNN Serial Number only incurs slight accuracy
degradation of the original performance and is valid for ownership
verification.


http://arxiv.org/abs/1911.08011
Adversarial Attacks on Grid Events Classification: An Adversarial Machine Learning Approach.
Iman Niazazari; Hanif Livani
  With the ever-increasing reliance on data for data-driven applications in
power grids, such as event cause analysis, the authenticity of data streams has
become crucially important. The data can be prone to adversarial stealthy
attacks aiming to manipulate the data such that residual-based bad data
detectors cannot detect them, and the perception of system operators or event
classifiers changes about the actual event. This paper investigates the impact
of adversarial attacks on convolutional neural network-based event cause
analysis frameworks. We have successfully verified the ability of adversaries
to maliciously misclassify events through stealthy data manipulations. The
vulnerability assessment is studied with respect to the number of compromised
measurements. Furthermore, a defense mechanism to robustify the performance of
the event cause analysis is proposed. The effectiveness of adversarial attacks
on changing the output of the framework is studied using the data generated by
real-time digital simulator (RTDS) under different scenarios such as type of
attacks and level of access to data.


http://arxiv.org/abs/1911.07989
WITCHcraft: Efficient PGD attacks with random step size.
Ping-Yeh Chiang; Jonas Geiping; Micah Goldblum; Tom Goldstein; Renkun Ni; Steven Reich; Ali Shafahi
  State-of-the-art adversarial attacks on neural networks use expensive
iterative methods and numerous random restarts from different initial points.
Iterative FGSM-based methods without restarts trade off performance for
computational efficiency because they do not adequately explore the image space
and are highly sensitive to the choice of step size. We propose a variant of
Projected Gradient Descent (PGD) that uses a random step size to improve
performance without resorting to expensive random restarts. Our method, Wide
Iterative Stochastic crafting (WITCHcraft), achieves results superior to the
classical PGD attack on the CIFAR-10 and MNIST data sets but without additional
computational cost. This simple modification of PGD makes crafting attacks more
economical, which is important in situations like adversarial training where
attacks need to be crafted in real time.


http://arxiv.org/abs/1911.08090
Deep Detector Health Management under Adversarial Campaigns.
Javier Echauz; Keith Kenemer; Sarfaraz Hussein; Jay Dhaliwal; Saurabh Shintre; Slawomir Grzonkowski; Andrew Gardner
  Machine learning models are vulnerable to adversarial inputs that induce
seemingly unjustifiable errors. As automated classifiers are increasingly used
in industrial control systems and machinery, these adversarial errors could
grow to be a serious problem. Despite numerous studies over the past few years,
the field of adversarial ML is still considered alchemy, with no practical
unbroken defenses demonstrated to date, leaving PHM practitioners with few
meaningful ways of addressing the problem. We introduce turbidity detection as
a practical superset of the adversarial input detection problem, coping with
adversarial campaigns rather than statistically invisible one-offs. This
perspective is coupled with ROC-theoretic design guidance that prescribes an
inexpensive domain adaptation layer at the output of a deep learning model
during an attack campaign. The result aims to approximate the Bayes optimal
mitigation that ameliorates the detection model's degraded health. A
proactively reactive type of prognostics is achieved via Monte Carlo simulation
of various adversarial campaign scenarios, by sampling from the model's own
turbidity distribution to quickly deploy the correct mitigation during a
real-world campaign.


http://arxiv.org/abs/1911.07201
Countering Inconsistent Labelling by Google's Vision API for Rotated Images.
Aman Apte; Aritra Bandyopadhyay; K Akhilesh Shenoy; Jason Peter Andrews; Aditya Rathod; Manish Agnihotri; Aditya Jajodia
  Google's Vision API analyses images and provides a variety of output
predictions, one such type is context-based labelling. In this paper, it is
shown that adversarial examples that cause incorrect label prediction and
spoofing can be generated by rotating the images. Due to the black-boxed nature
of the API, a modular context-based pre-processing pipeline is proposed
consisting of a Res-Net50 model, that predicts the angle by which the image
must be rotated to correct its orientation. The pipeline successfully performs
the correction whilst maintaining the image's resolution and feeds it to the
API which generates labels similar to the original correctly oriented image and
using a Percentage Error metric, the performance of the corrected images as
compared to its rotated counter-parts is found to be significantly higher.
These observations imply that the API can benefit from such a pre-processing
pipeline to increase robustness to rotational perturbances.


http://arxiv.org/abs/1911.07421
Deep Verifier Networks: Verification of Deep Discriminative Models with Deep Generative Models.
Tong Che; Xiaofeng Liu; Site Li; Yubin Ge; Ruixiang Zhang; Caiming Xiong; Yoshua Bengio
  AI Safety is a major concern in many deep learning applications such as
autonomous driving. Given a trained deep learning model, an important natural
problem is how to reliably verify the model's prediction. In this paper, we
propose a novel framework -- deep verifier networks (DVN) to verify the inputs
and outputs of deep discriminative models with deep generative models. Our
proposed model is based on conditional variational auto-encoders with
disentanglement constraints. We give both intuitive and theoretical
justifications of the model. Our verifier network is trained independently with
the prediction model, which eliminates the need of retraining the verifier
network for a new model. We test the verifier network on out-of-distribution
detection and adversarial example detection problems, as well as anomaly
detection problems in structured prediction tasks such as image caption
generation. We achieve state-of-the-art results in all of these problems.


http://arxiv.org/abs/1911.07198
Smoothed Inference for Adversarially-Trained Models.
Yaniv Nemcovsky; Evgenii Zheltonozhskii; Chaim Baskin; Brian Chmiel; Maxim Fishman; Alex M. Bronstein; Avi Mendelson
  Deep neural networks are known to be vulnerable to adversarial attacks.
Current methods of defense from such attacks are based on either implicit or
explicit regularization, e.g., adversarial training. Randomized smoothing, the
averaging of the classifier outputs over a random distribution centered in the
sample, has been shown to guarantee the performance of a classifier subject to
bounded perturbations of the input. In this work, we study the application of
randomized smoothing as a way to improve performance on unperturbed data as
well as to increase robustness to adversarial attacks. The proposed technique
can be applied on top of any existing adversarial defense, but works
particularly well with the randomized approaches. We examine its performance on
common white-box (PGD) and black-box (transfer and NAttack) attacks on CIFAR-10
and CIFAR-100, substantially outperforming previous art for most scenarios and
comparable on others. For example, we achieve 60.4% accuracy under a PGD attack
on CIFAR-10 using ResNet-20, outperforming previous art by 11.7%. Since our
method is based on sampling, it lends itself well for trading-off between the
model inference complexity and its performance. A reference implementation of
the proposed techniques is provided at https://github.com/yanemcovsky/SIAM


http://arxiv.org/abs/1911.07107
SMART: Skeletal Motion Action Recognition aTtack.
He Wang; Feixiang He; Zexi Peng; Yongliang Yang; Tianjia Shao; Kun Zhou; David Hogg
  Adversarial attack has inspired great interest in computer vision, by showing
that classification-based solutions are prone to imperceptible attack in many
tasks. In this paper, we propose a method, SMART, to attack action recognizers
which rely on 3D skeletal motions. Our method involves an innovative perceptual
loss which ensures the imperceptibility of the attack. Empirical studies
demonstrate that SMART is effective in both white-box and black-box scenarios.
Its generalizability is evidenced on a variety of action recognizers and
datasets. Its versatility is shown in different attacking strategies. Its
deceitfulness is proven in extensive perceptual studies. Finally, SMART shows
that adversarial attack on 3D skeletal motion, one type of time-series data, is
significantly different from traditional adversarial attack problems.


http://arxiv.org/abs/1911.07015
Suspicion-Free Adversarial Attacks on Clustering Algorithms.
Anshuman Chhabra; Abhishek Roy; Prasant Mohapatra
  Clustering algorithms are used in a large number of applications and play an
important role in modern machine learning-- yet, adversarial attacks on
clustering algorithms seem to be broadly overlooked unlike supervised learning.
In this paper, we seek to bridge this gap by proposing a black-box adversarial
attack for clustering models for linearly separable clusters. Our attack works
by perturbing a single sample close to the decision boundary, which leads to
the misclustering of multiple unperturbed samples, named spill-over adversarial
samples. We theoretically show the existence of such adversarial samples for
the K-Means clustering. Our attack is especially strong as (1) we ensure the
perturbed sample is not an outlier, hence not detectable, and (2) the exact
metric used for clustering is not known to the attacker. We theoretically
justify that the attack can indeed be successful without the knowledge of the
true metric. We conclude by providing empirical results on a number of
datasets, and clustering algorithms. To the best of our knowledge, this is the
first work that generates spill-over adversarial samples without the knowledge
of the true metric ensuring that the perturbed sample is not an outlier, and
theoretically proves the above.


http://arxiv.org/abs/1911.07140
Black-Box Adversarial Attack with Transferable Model-based Embedding.
Zhichao Huang; Tong Zhang
  We present a new method for black-box adversarial attack. Unlike previous
methods that combined transfer-based and scored-based methods by using the
gradient or initialization of a surrogate white-box model, this new method
tries to learn a low-dimensional embedding using a pretrained model, and then
performs efficient search within the embedding space to attack an unknown
target network. The method produces adversarial perturbations with high level
semantic patterns that are easily transferable. We show that this approach can
greatly improve the query efficiency of black-box adversarial attack across
different target network architectures. We evaluate our approach on MNIST,
ImageNet and Google Cloud Vision API, resulting in a significant reduction on
the number of queries. We also attack adversarially defended networks on
CIFAR10 and ImageNet, where our method not only reduces the number of queries,
but also improves the attack success rate.


http://arxiv.org/abs/1911.06968
Defensive Few-shot Learning.
Wenbin Li; Lei Wang; Xingxing Zhang; Lei Qi; Jing Huo; Yang Gao; Jiebo Luo
  This paper investigates a new challenging problem called defensive few-shot
learning in order to learn a robust few-shot model against adversarial attacks.
Simply applying the existing adversarial defense methods to few-shot learning
cannot effectively solve this problem. This is because the commonly assumed
sample-level distribution consistency between the training and test sets can no
longer be met in the few-shot setting. To address this situation, we develop a
general defensive few-shot learning (DFSL) framework to answer the following
two key questions: (1) how to transfer adversarial defense knowledge from one
sample distribution to another? (2) how to narrow the distribution gap between
clean and adversarial examples under the few-shot setting? To answer the first
question, we propose an episode-based adversarial training mechanism by
assuming a task-level distribution consistency to better transfer the
adversarial defense knowledge. As for the second question, within each few-shot
task, we design two kinds of distribution consistency criteria to narrow the
distribution gap between clean and adversarial examples from the feature-wise
and prediction-wise perspectives, respectively. Extensive experiments
demonstrate that the proposed framework can effectively make the existing
few-shot models robust against adversarial attacks. Code is available at
https://github.com/WenbinLee/DefensiveFSL.git.


http://arxiv.org/abs/1911.06587
Learning To Characterize Adversarial Subspaces.
Xiaofeng Mao; Yuefeng Chen; Yuhong Li; Yuan He; Hui Xue
  Deep Neural Networks (DNNs) are known to be vulnerable to the maliciously
generated adversarial examples. To detect these adversarial examples, previous
methods use artificially designed metrics to characterize the properties of
\textit{adversarial subspaces} where adversarial examples lie. However, we find
these methods are not working in practical attack detection scenarios. Because
the artificially defined features are lack of robustness and show limitation in
discriminative power to detect strong attacks. To solve this problem, we
propose a novel adversarial detection method which identifies adversaries by
adaptively learning reasonable metrics to characterize adversarial subspaces.
As auxiliary context information, \textit{k} nearest neighbors are used to
represent the surrounded subspace of the detected sample. We propose an
innovative model called Neighbor Context Encoder (NCE) to learn from \textit{k}
neighbors context and infer if the detected sample is normal or adversarial. We
conduct thorough experiment on CIFAR-10, CIFAR-100 and ImageNet dataset. The
results demonstrate that our approach surpasses all existing methods under
three settings: \textit{attack-aware black-box detection},
\textit{attack-unaware black-box detection} and \textit{white-box detection}.


http://arxiv.org/abs/1911.06479
On Model Robustness Against Adversarial Examples.
Shufei Zhang; Kaizhu Huang; Zenglin Xu
  We study the model robustness against adversarial examples, referred to as
small perturbed input data that may however fool many state-of-the-art deep
learning models. Unlike previous research, we establish a novel theory
addressing the robustness issue from the perspective of stability of the loss
function in the small neighborhood of natural examples. We propose to exploit
an energy function to describe the stability and prove that reducing such
energy guarantees the robustness against adversarial examples. We also show
that the traditional training methods including adversarial training with the
$l_2$ norm constraint (AT) and Virtual Adversarial Training (VAT) tend to
minimize the lower bound of our proposed energy function. We make an analysis
showing that minimization of such lower bound can however lead to insufficient
robustness within the neighborhood around the input sample. Furthermore, we
design a more rational method with the energy regularization which proves to
achieve better robustness than previous methods. Through a series of
experiments, we demonstrate the superiority of our model on both supervised
tasks and semi-supervised tasks. In particular, our proposed adversarial
framework achieves the best performance compared with previous adversarial
training methods on benchmark datasets MNIST, CIFAR-10, and SVHN. Importantly,
they demonstrate much better robustness against adversarial examples than all
the other comparison methods.


http://arxiv.org/abs/1911.06502
Simple iterative method for generating targeted universal adversarial perturbations.
Hokuto Hirano; Kazuhiro Takemoto
  Deep neural networks (DNNs) are vulnerable to adversarial attacks. In
particular, a single perturbation known as the universal adversarial
perturbation (UAP) can foil most classification tasks conducted by DNNs. Thus,
different methods for generating UAPs are required to fully evaluate the
vulnerability of DNNs. A realistic evaluation would be with cases that consider
targeted attacks; wherein the generated UAP causes DNN to classify an input
into a specific class. However, the development of UAPs for targeted attacks
has largely fallen behind that of UAPs for non-targeted attacks. Therefore, we
propose a simple iterative method to generate UAPs for targeted attacks. Our
method combines the simple iterative method for generating non-targeted UAPs
and the fast gradient sign method for generating a targeted adversarial
perturbation for an input. We applied the proposed method to state-of-the-art
DNN models for image classification and proved the existence of almost
imperceptible UAPs for targeted attacks; further, we demonstrated that such
UAPs are easily generatable.


http://arxiv.org/abs/1911.06591
AdvKnn: Adversarial Attacks On K-Nearest Neighbor Classifiers With Approximate Gradients.
Xiaodan Li; Yuefeng Chen; Yuan He; Hui Xue
  Deep neural networks have been shown to be vulnerable to adversarial
examples---maliciously crafted examples that can trigger the target model to
misbehave by adding imperceptible perturbations. Existing attack methods for
k-nearest neighbor~(kNN) based algorithms either require large perturbations or
are not applicable for large k. To handle this problem, this paper proposes a
new method called AdvKNN for evaluating the adversarial robustness of kNN-based
models. Firstly, we propose a deep kNN block to approximate the output of kNN
methods, which is differentiable thus can provide gradients for attacks to
cross the decision boundary with small distortions. Second, a new consistency
learning for distribution instead of classification is proposed for the
effectiveness in distribution based methods. Extensive experimental results
indicate that the proposed method significantly outperforms state of the art in
terms of attack success rate and the added perturbations.


http://arxiv.org/abs/1912.01487
Adversarial Embedding: A robust and elusive Steganography and Watermarking technique.
Salah Ghamizi; Maxime Cordy; Mike Papadakis; Yves Le Traon
  We propose adversarial embedding, a new steganography and watermarking
technique that embeds secret information within images. The key idea of our
method is to use deep neural networks for image classification and adversarial
attacks to embed secret information within images. Thus, we use the attacks to
embed an encoding of the message within images and the related deep neural
network outputs to extract it. The key properties of adversarial attacks
(invisible perturbations, nontransferability, resilience to tampering) offer
guarantees regarding the confidentiality and the integrity of the hidden
messages. We empirically evaluate adversarial embedding using more than 100
models and 1,000 messages. Our results confirm that our embedding passes
unnoticed by both humans and steganalysis methods, while at the same time
impedes illicit retrieval of the message (less than 13% recovery rate when the
interceptor has some knowledge about our model), and is resilient to soft and
(to some extent) aggressive image tampering (up to 100% recovery rate under
jpeg compression). We further develop our method by proposing a new type of
adversarial attack which improves the embedding density (amount of hidden
information) of our method to up to 10 bits per pixel.


http://arxiv.org/abs/1911.06470
Self-supervised Adversarial Training.
Kejiang Chen; Hang Zhou; Yuefeng Chen; Xiaofeng Mao; Yuhong Li; Yuan He; Hui Xue; Weiming Zhang; Nenghai Yu
  Recent work has demonstrated that neural networks are vulnerable to
adversarial examples. To escape from the predicament, many works try to harden
the model in various ways, in which adversarial training is an effective way
which learns robust feature representation so as to resist adversarial attacks.
Meanwhile, the self-supervised learning aims to learn robust and semantic
embedding from data itself. With these views, we introduce self-supervised
learning to against adversarial examples in this paper. Specifically, the
self-supervised representation coupled with k-Nearest Neighbour is proposed for
classification. To further strengthen the defense ability, self-supervised
adversarial training is proposed, which maximizes the mutual information
between the representations of original examples and the corresponding
adversarial examples. Experimental results show that the self-supervised
representation outperforms its supervised version in respect of robustness and
self-supervised adversarial training can further improve the defense ability
efficiently.


http://arxiv.org/abs/1911.06285
DomainGAN: Generating Adversarial Examples to Attack Domain Generation Algorithm Classifiers.
Isaac Corley; Jonathan Lwowski; Justin Hoffman
  Domain Generation Algorithms (DGAs) are frequently used to generate numerous
domains for use by botnets. These domains are often utilized as rendezvous
points for servers that malware has command and control over. There are many
algorithms that are used to generate domains, however many of these algorithms
are simplistic and easily detected by traditional machine learning techniques.
In this paper, three variants of Generative Adversarial Networks (GANs) are
optimized to generate domains which have similar characteristics of benign
domains, resulting in domains which greatly evade several state-of-the-art deep
learning based DGA classifiers. We additionally provide a detailed analysis
into offensive usability for each variant with respect to repeated and existing
domain collisions. Finally, we fine-tune the state-of-the-art DGA classifiers
by adding GAN generated samples to their original training datasets and analyze
the changes in performance. Our results conclude that GAN based DGAs are
superior in evading DGA classifiers in comparison to traditional DGAs, and of
the variants, the Wasserstein GAN with Gradient Penalty (WGANGP) is the highest
performing DGA for uses both offensively and defensively.


http://arxiv.org/abs/1911.07931
CAGFuzz: Coverage-Guided Adversarial Generative Fuzzing Testing of Deep Learning Systems.
Pengcheng Zhang; Qiyin Dai; Patrizio Pelliccione
  Deep Learning systems (DL) based on Deep Neural Networks (DNNs) are more and
more used in various aspects of our life, including unmanned vehicles, speech
processing, and robotics. However, due to the limited dataset and the
dependence on manual labeling data, DNNs often fail to detect their erroneous
behaviors, which may lead to serious problems. Several approaches have been
proposed to enhance the input examples for testing DL systems. However, they
have the following limitations. First, they design and generate adversarial
examples from the perspective of model, which may cause low generalization
ability when they are applied to other models. Second, they only use surface
feature constraints to judge the difference between the adversarial example
generated and the original example. The deep feature constraints, which contain
high-level semantic information, such as image object category and scene
semantics are completely neglected. To address these two problems, in this
paper, we propose CAGFuzz, a Coverage-guided Adversarial Generative Fuzzing
testing approach, which generates adversarial examples for a targeted DNN to
discover its potential defects. First, we train an adversarial case generator
(AEG) from the perspective of general data set. Second, we extract the depth
features of the original and adversarial examples, and constrain the
adversarial examples by cosine similarity to ensure that the semantic
information of adversarial examples remains unchanged. Finally, we retrain
effective adversarial examples to improve neuron testing coverage rate. Based
on several popular data sets, we design a set of dedicated experiments to
evaluate CAGFuzz. The experimental results show that CAGFuzz can improve the
neuron coverage rate, detect hidden errors, and also improve the accuracy of
the target DNN.


http://arxiv.org/abs/1911.05904
There is Limited Correlation between Coverage and Robustness for Deep Neural Networks.
Yizhen Dong; Peixin Zhang; Jingyi Wang; Shuang Liu; Jun Sun; Jianye Hao; Xinyu Wang; Li Wang; Jin Song Dong; Dai Ting
  Deep neural networks (DNN) are increasingly applied in safety-critical
systems, e.g., for face recognition, autonomous car control and malware
detection. It is also shown that DNNs are subject to attacks such as
adversarial perturbation and thus must be properly tested. Many coverage
criteria for DNN since have been proposed, inspired by the success of code
coverage criteria for software programs. The expectation is that if a DNN is a
well tested (and retrained) according to such coverage criteria, it is more
likely to be robust. In this work, we conduct an empirical study to evaluate
the relationship between coverage, robustness and attack/defense metrics for
DNN. Our study is the largest to date and systematically done based on 100 DNN
models and 25 metrics. One of our findings is that there is limited correlation
between coverage and robustness, i.e., improving coverage does not help improve
the robustness. Our dataset and implementation have been made available to
serve as a benchmark for future studies on testing DNN.


http://arxiv.org/abs/1911.05916
Adversarial Margin Maximization Networks.
Ziang Yan; Yiwen Guo; Changshui Zhang
  The tremendous recent success of deep neural networks (DNNs) has sparked a
surge of interest in understanding their predictive ability. Unlike the human
visual system which is able to generalize robustly and learn with little
supervision, DNNs normally require a massive amount of data to learn new
concepts. In addition, research works also show that DNNs are vulnerable to
adversarial examples-maliciously generated images which seem perceptually
similar to the natural ones but are actually formed to fool learning models,
which means the models have problem generalizing to unseen data with certain
type of distortions. In this paper, we analyze the generalization ability of
DNNs comprehensively and attempt to improve it from a geometric point of view.
We propose adversarial margin maximization (AMM), a learning-based
regularization which exploits an adversarial perturbation as a proxy. It
encourages a large margin in the input space, just like the support vector
machines. With a differentiable formulation of the perturbation, we train the
regularized DNNs simply through back-propagation in an end-to-end manner.
Experimental results on various datasets (including MNIST, CIFAR-10/100, SVHN
and ImageNet) and different DNN architectures demonstrate the superiority of
our method over previous state-of-the-arts. Code and models for reproducing our
results will be made publicly available.


http://arxiv.org/abs/1911.05153
Improving Robustness of Task Oriented Dialog Systems.
Arash Einolghozati; Sonal Gupta; Mrinal Mohit; Rushin Shah
  Task oriented language understanding in dialog systems is often modeled using
intents (task of a query) and slots (parameters for that task). Intent
detection and slot tagging are, in turn, modeled using sentence classification
and word tagging techniques respectively. Similar to adversarial attack
problems with computer vision models discussed in existing literature, these
intent-slot tagging models are often over-sensitive to small variations in
input -- predicting different and often incorrect labels when small changes are
made to a query, thus reducing their accuracy and reliability. However,
evaluating a model's robustness to these changes is harder for language since
words are discrete and an automated change (e.g. adding `noise') to a query
sometimes changes the meaning and thus labels of a query. In this paper, we
first describe how to create an adversarial test set to measure the robustness
of these models. Furthermore, we introduce and adapt adversarial training
methods as well as data augmentation using back-translation to mitigate these
issues. Our experiments show that both techniques improve the robustness of the
system substantially and can be combined to yield the best results.


http://arxiv.org/abs/1911.04681
On Robustness to Adversarial Examples and Polynomial Optimization.
Pranjal Awasthi; Abhratanu Dutta; Aravindan Vijayaraghavan
  We study the design of computationally efficient algorithms with provable
guarantees, that are robust to adversarial (test time) perturbations. While
there has been an proliferation of recent work on this topic due to its
connections to test time robustness of deep networks, there is limited
theoretical understanding of several basic questions like (i) when and how can
one design provably robust learning algorithms? (ii) what is the price of
achieving robustness to adversarial examples in a computationally efficient
manner?
  The main contribution of this work is to exhibit a strong connection between
achieving robustness to adversarial examples, and a rich class of polynomial
optimization problems, thereby making progress on the above questions. In
particular, we leverage this connection to (a) design computationally efficient
robust algorithms with provable guarantees for a large class of hypothesis,
namely linear classifiers and degree-2 polynomial threshold functions (PTFs),
(b) give a precise characterization of the price of achieving robustness in a
computationally efficient manner for these classes, (c) design efficient
algorithms to certify robustness and generate adversarial attacks in a
principled manner for 2-layer neural networks. We empirically demonstrate the
effectiveness of these attacks on real data.


http://arxiv.org/abs/1911.05268
Adversarial Examples in Modern Machine Learning: A Review.
Rey Reza Wiyatno; Anqi Xu; Ousmane Dia; Berker Archy de
  Recent research has found that many families of machine learning models are
vulnerable to adversarial examples: inputs that are specifically designed to
cause the target model to produce erroneous outputs. In this survey, we focus
on machine learning models in the visual domain, where methods for generating
and detecting such examples have been most extensively studied. We explore a
variety of adversarial attack methods that apply to image-space content, real
world adversarial attacks, adversarial defenses, and the transferability
property of adversarial examples. We also discuss strengths and weaknesses of
various methods of adversarial attack and defense. Our aim is to provide an
extensive coverage of the field, furnishing the reader with an intuitive
understanding of the mechanics of adversarial attack and defense mechanisms and
enlarging the community of researchers studying this fundamental set of
problems.


http://arxiv.org/abs/1911.06269
Few-Features Attack to Fool Machine Learning Models through Mask-Based GAN.
Feng Chen; Yunkai Shang; Bo Xu; Jincheng Hu
  GAN is a deep-learning based generative approach to generate contents such as
images, languages and speeches. Recently, studies have shown that GAN can also
be applied to generative adversarial attack examples to fool the
machine-learning models. In comparison with the previous non-learning
adversarial example attack approaches, the GAN-based adversarial attack example
approach can generate the adversarial samples quickly using the GAN
architecture every time facing a new sample after training, but meanwhile needs
to perturb the attack samples in great quantities, which results in the
unpractical application in reality. To address this issue, we propose a new
approach, named Few-Feature-Attack-GAN (FFA-GAN). FFA-GAN has a significant
time-consuming advantage than the non-learning adversarial samples approaches
and a better non-zero-features performance than the GANbased adversarial sample
approaches. FFA-GAN can automatically generate the attack samples in the
black-box attack through the GAN architecture instead of the evolutional
algorithms or the other non-learning approaches. Besides, we introduce the mask
mechanism into the generator network of the GAN architecture to optimize the
constraint issue, which can also be regarded as the sparsity problem of the
important features. During the training, the different weights of losses of the
generator are set in the different training phases to ensure the divergence of
the two above mentioned parallel networks of the generator. Experiments are
made respectively on the structured data sets KDD-Cup 1999 and CIC-IDS 2017, in
which the dimensions of the data are relatively low, and also on the
unstructured data sets MNIST and CIFAR-10 with the data of the relatively high
dimensions. The results of the experiments demonstrate the effectiveness and
the robustness of our proposed approach.


http://arxiv.org/abs/1911.06155
RNN-Test: Towards Adversarial Testing for Recurrent Neural Network Systems.
Jianmin Guo; Yue Zhao; Quan Zhang; Yu Jiang
  While massive efforts have been investigated in adversarial testing of
convolutional neural networks (CNN), testing for recurrent neural networks
(RNN) is still limited and leaves threats for vast sequential application
domains. In this paper, we propose an adversarial testing framework RNN-Test
for RNN systems, focusing on the main sequential domains, not only
classification tasks. First, we design a novel search methodology customized
for RNN models by maximizing the inconsistency of RNN states to produce
adversarial inputs. Next, we introduce two state-based coverage metrics
according to the distinctive structure of RNNs to explore more inference
logics. Finally, RNN-Test solves the joint optimization problem to maximize
state inconsistency and state coverage, and crafts adversarial inputs for
various tasks of different kinds of inputs.
  For evaluations, we apply RNN-Test on three sequential models of common RNN
structures. On the tested models, the RNN-Test approach is demonstrated to be
competitive in generating adversarial inputs, outperforming FGSM-based and
DLFuzz-based methods to reduce the model performance more sharply with 2.78% to
32.5% higher success (or generation) rate. RNN-Test could also achieve 52.65%
to 66.45% higher adversary rate on MNIST-LSTM model than relevant work testRNN.
Compared with the neuron coverage, the proposed state coverage metrics as
guidance excel with 4.17% to 97.22% higher success (or generation) rate.


http://arxiv.org/abs/1911.05072
Learning From Brains How to Regularize Machines.
Zhe Li; Wieland Brendel; Edgar Y. Walker; Erick Cobos; Taliah Muhammad; Jacob Reimer; Matthias Bethge; Fabian H. Sinz; Xaq Pitkow; Andreas S. Tolias
  Despite impressive performance on numerous visual tasks, Convolutional Neural
Networks (CNNs) --- unlike brains --- are often highly sensitive to small
perturbations of their input, e.g. adversarial noise leading to erroneous
decisions. We propose to regularize CNNs using large-scale neuroscience data to
learn more robust neural features in terms of representational similarity. We
presented natural images to mice and measured the responses of thousands of
neurons from cortical visual areas. Next, we denoised the notoriously variable
neural activity using strong predictive models trained on this large corpus of
responses from the mouse visual system, and calculated the representational
similarity for millions of pairs of images from the model's predictions. We
then used the neural representation similarity to regularize CNNs trained on
image classification by penalizing intermediate representations that deviated
from neural ones. This preserved performance of baseline models when
classifying images under standard benchmarks, while maintaining substantially
higher performance compared to baseline or control models when classifying
noisy images. Moreover, the models regularized with cortical representations
also improved model robustness in terms of adversarial attacks. This
demonstrates that regularizing with neural data can be an effective tool to
create an inductive bias towards more robust inference.


http://arxiv.org/abs/1911.04636
Robust Design of Deep Neural Networks against Adversarial Attacks based on Lyapunov Theory.
Arash Rahnama; Andre T. Nguyen; Edward Raff
  Deep neural networks (DNNs) are vulnerable to subtle adversarial
perturbations applied to the input. These adversarial perturbations, though
imperceptible, can easily mislead the DNN. In this work, we take a control
theoretic approach to the problem of robustness in DNNs. We treat each
individual layer of the DNN as a nonlinear dynamical system and use Lyapunov
theory to prove stability and robustness locally. We then proceed to prove
stability and robustness globally for the entire DNN. We develop empirically
tight bounds on the response of the output layer, or any hidden layer, to
adversarial perturbations added to the input, or the input of hidden layers.
Recent works have proposed spectral norm regularization as a solution for
improving robustness against l2 adversarial attacks. Our results give new
insights into how spectral norm regularization can mitigate the adversarial
effects. Finally, we evaluate the power of our approach on a variety of data
sets and network architectures and against some of the well-known adversarial
attacks.


http://arxiv.org/abs/1911.04657
CALPA-NET: Channel-pruning-assisted Deep Residual Network for Steganalysis of Digital Images.
Shunquan Tan; Weilong Wu; Zilong Shao; Qiushi Li; Bin Li; Jiwu Huang
  Over the past few years, detection performance improvements of deep-learning
based steganalyzers have been usually achieved through structure expansion.
However, excessive expanded structure results in huge computational cost,
storage overheads, and consequently difficulty in training and deployment. In
this paper we propose CALPA-NET, a ChAnneL-Pruning-Assisted deep residual
network architecture search approach to shrink the network structure of
existing vast, over-parameterized deep-learning based steganalyzers. We observe
that the broad inverted-pyramid structure of existing deep-learning based
steganalyzers might contradict the well-established model diversity oriented
philosophy, and therefore is not suitable for steganalysis. Then a hybrid
criterion combined with two network pruning schemes is introduced to adaptively
shrink every involved convolutional layer in a data-driven manner. The
resulting network architecture presents a slender bottleneck-like structure. We
have conducted extensive experiments on BOSSBase+BOWS2 dataset, more diverse
ALASKA dataset and even a large-scale subset extracted from ImageNet CLS-LOC
dataset. The experimental results show that the model structure generated by
our proposed CALPA-NET can achieve comparative performance with less than two
percent of parameters and about one third FLOPs compared to the original
steganalytic model. The new model possesses even better adaptivity,
transferability, and scalability.


http://arxiv.org/abs/1911.04429
GraphDefense: Towards Robust Graph Convolutional Networks.
Xiaoyun Wang; Xuanqing Liu; Cho-Jui Hsieh
  In this paper, we study the robustness of graph convolutional networks
(GCNs). Despite the good performance of GCNs on graph semi-supervised learning
tasks, previous works have shown that the original GCNs are very unstable to
adversarial perturbations. In particular, we can observe a severe performance
degradation by slightly changing the graph adjacency matrix or the features of
a few nodes, making it unsuitable for security-critical applications. Inspired
by the previous works on adversarial defense for deep neural networks, and
especially adversarial training algorithm, we propose a method called
GraphDefense to defend against the adversarial perturbations. In addition, for
our defense method, we could still maintain semi-supervised learning settings,
without a large label rate. We also show that adversarial training in features
is equivalent to adversarial training for edges with a small perturbation. Our
experiments show that the proposed defense methods successfully increase the
robustness of Graph Convolutional Networks. Furthermore, we show that with
careful design, our proposed algorithm can scale to large graphs, such as
Reddit dataset.


http://arxiv.org/abs/1911.03677
A Reinforced Generation of Adversarial Samples for Neural Machine Translation.
Wei Zou; Shujian Huang; Jun Xie; Xinyu Dai; Jiajun Chen
  Neural machine translation systems tend to fail on less de-cent inputs
despite its great efficacy, which may greatly harm the credibility of these
systems. Fathoming how and when neural-based systems fail in such cases is
critical for industrial maintenance. Instead of collecting and analyzing bad
cases using limited handcrafted error features, here we investigate this issue
by generating adversarial samples via a new paradigm based on reinforcement
learning. Our paradigm could expose pitfalls for a given performance metric,
e.g.BLEU, and could target any given neural machine translation architecture.
We conduct experiments of adversarial attacks on two mainstream neural machine
translation architectures, RNN-search and Transformer. The results show that
our method efficiently produces stable attacks with meaning-preserving
adversarial samples. We also present a qualitative and quantitative analysis
for the preference pattern of the attack, showing its capability of pitfall
exposure.


http://arxiv.org/abs/1911.03614
Improving Machine Reading Comprehension via Adversarial Training.
Ziqing Yang; Yiming Cui; Wanxiang Che; Ting Liu; Shijin Wang; Guoping Hu
  Adversarial training (AT) as a regularization method has proved its
effectiveness in various tasks, such as image classification and text
classification. Though there are successful applications of AT in many tasks of
natural language processing (NLP), the mechanism behind it is still unclear. In
this paper, we aim to apply AT on machine reading comprehension (MRC) and study
its effects from multiple perspectives. We experiment with three different
kinds of RC tasks: span-based RC, span-based RC with unanswerable questions and
multi-choice RC. The experimental results show that the proposed method can
improve the performance significantly and universally on SQuAD1.1, SQuAD2.0 and
RACE. With virtual adversarial training (VAT), we explore the possibility of
improving the RC models with semi-supervised learning and prove that examples
from a different task are also beneficial. We also find that AT helps little in
defending against artificial adversarial examples, but AT helps the model to
learn better on examples that contain more low-frequency words.


http://arxiv.org/abs/1911.03784
Adaptive versus Standard Descent Methods and Robustness Against Adversarial Examples.
Marc Khoury
  Adversarial examples are a pervasive phenomenon of machine learning models
where seemingly imperceptible perturbations to the input lead to
misclassifications for otherwise statistically accurate models. In this paper
we study how the choice of optimization algorithm influences the robustness of
the resulting classifier to adversarial examples. Specifically we show an
example of a learning problem for which the solution found by adaptive
optimization algorithms exhibits qualitatively worse robustness properties
against both $L_{2}$- and $L_{\infty}$-adversaries than the solution found by
non-adaptive algorithms. Then we fully characterize the geometry of the loss
landscape of $L_{2}$-adversarial training in least-squares linear regression.
The geometry of the loss landscape is subtle and has important consequences for
optimization algorithms. Finally we provide experimental evidence which
suggests that non-adaptive methods consistently produce more robust models than
adaptive methods.


http://arxiv.org/abs/1911.03849
Minimalistic Attacks: How Little it Takes to Fool a Deep Reinforcement Learning Policy.
Xinghua Qu; Zhu Sun; Yew-Soon Ong; Abhishek Gupta; Pengfei Wei
  Recent studies have revealed that neural network-based policies can be easily
fooled by adversarial examples. However, while most prior works analyze the
effects of perturbing every pixel of every frame assuming white-box policy
access, in this paper we take a more restrictive view towards adversary
generation - with the goal of unveiling the limits of a model's vulnerability.
In particular, we explore minimalistic attacks by defining three key settings:
(1) black-box policy access: where the attacker only has access to the input
(state) and output (action probability) of an RL policy; (2) fractional-state
adversary: where only several pixels are perturbed, with the extreme case being
a single-pixel adversary; and (3) tactically-chanced attack: where only
significant frames are tactically chosen to be attacked. We formulate the
adversarial attack by accommodating the three key settings and explore their
potency on six Atari games by examining four fully trained state-of-the-art
policies. In Breakout, for example, we surprisingly find that: (i) all policies
showcase significant performance degradation by merely modifying 0.01% of the
input state, and (ii) the policy trained by DQN is totally deceived by
perturbation to only 1% frames.


http://arxiv.org/abs/1911.04278
Adversarial Attacks on Time-Series Intrusion Detection for Industrial Control Systems.
Giulio Zizzo; Chris Hankin; Sergio Maffeis; Kevin Jones
  Neural networks are increasingly used for intrusion detection on industrial
control systems (ICS). With neural networks being vulnerable to adversarial
examples, attackers who wish to cause damage to an ICS can attempt to hide
their attacks from detection by using adversarial example techniques. In this
work we address the domain specific challenges of constructing such attacks
against autoregressive based intrusion detection systems (IDS) in an ICS
setting.
  We model an attacker that can compromise a subset of sensors in a ICS which
has a LSTM based IDS. The attacker manipulates the data sent to the IDS, and
seeks to hide the presence of real cyber-physical attacks occurring in the ICS.
  We evaluate our adversarial attack methodology on the Secure Water Treatment
system when examining solely continuous data, and on data containing a mixture
of discrete and continuous variables. In the continuous data domain our attack
successfully hides the cyber-physical attacks requiring 2.87 out of 12
monitored sensors to be compromised on average. With both discrete and
continuous data our attack required, on average, 3.74 out of 26 monitored
sensors to be compromised.


http://arxiv.org/abs/1911.07922
Patch augmentation: Towards efficient decision boundaries for neural networks.
Marcus D. Bloice; Andreas Holzinger
  In this paper we propose a new augmentation technique, called patch
augmentation, that, in our experiments, improves model accuracy and makes
networks more robust to adversarial attacks. In brief, this data-independent
approach creates new image data based on image/label pairs, where a patch from
one of the two images in the pair is superimposed on to the other image,
creating a new augmented sample. The new image's label is a linear combination
of the image pair's corresponding labels. Initial experiments show a several
percentage point increase in accuracy on CIFAR-10, from a baseline of 80.6% to
86.8%. CIFAR-100 sees better improvements still. Networks trained using patch
augmentation are also more robust to adversarial attacks, which we demonstrate
using the Fast Gradient Sign Method. An adversarial misclassification, or
adversarial attack, occurs when an image that should seemingly be easily
classified correctly by the network is suddenly classified as belonging to a
completely different class-and with high confidence. Such occurrences are
difficult to diagnose and are a cause of much concern in artificial
intelligence research, as any model trained with empirical risk minimisation
seems to be vulnerable to such attacks. The ease at which neural networks are
susceptible to adversarial perturbations are partially the result of images
lying close to the decision boundaries that are typically learned by neural
networks during their training. Patch augmentation is an attempt to train more
efficient decision boundaries.


http://arxiv.org/abs/1911.03109
Domain Robustness in Neural Machine Translation.
Mathias Müller; Annette Rios; Rico Sennrich
  Translating text that diverges from the training domain is a key challenge
for neural machine translation (NMT). Domain robustness - the generalization of
models to unseen test domains - is low compared to statistical machine
translation. In this paper, we investigate the performance of NMT on
out-of-domain test sets, and ways to improve it.
  We observe that hallucination (translations that are fluent but unrelated to
the source) is common in out-of-domain settings, and we empirically compare
methods that improve adequacy (reconstruction), out-of-domain translation
(subword regularization), or robustness against adversarial examples (defensive
distillation), as well as noisy channel models.
  In experiments on German to English OPUS data, and German to Romansh, a
low-resource scenario, we find that several methods improve domain robustness,
reconstruction standing out as a method that not only improves automatic
scores, but also shows improvements in a manual assessments of adequacy, albeit
at some loss in fluency. However, out-of-domain performance is still relatively
low and domain robustness remains an open problem.


http://arxiv.org/abs/1911.03078
Adversarial Attacks on GMM i-vector based Speaker Verification Systems.
Xu Li; Jinghua Zhong; Xixin Wu; Jianwei Yu; Xunying Liu; Helen Meng
  This work investigates the vulnerability of Gaussian Mix-ture Model (GMM)
i-vector based speaker verification (SV) systems to adversarial attacks, and
the transferability of adversarial samples crafted from GMM i-vector based
systemsto x-vector based systems. In detail, we formulate the GMM i-vector
based system as a scoring function, and leverage the fast gradient sign method
(FGSM) to generate adversarial samples through this function. These adversarial
samples are used to attack both GMM i-vector and x-vector based systems. We
measure the vulnerability of the systems by the degradation of equal error rate
and false acceptance rate. Experimental results show that GMM i-vector based
systems are seriously vulnerable to adversarial attacks, and the generated
adversarial samples are proved to be transferable and pose threats to neural
network speaker embedding based systems (e.g. x-vector systems).


http://arxiv.org/abs/1911.03274
Imperceptible Adversarial Attacks on Tabular Data.
Vincent Ballet; Xavier Renard; Jonathan Aigrain; Thibault Laugel; Pascal Frossard; Marcin Detyniecki
  Security of machine learning models is a concern as they may face adversarial
attacks for unwarranted advantageous decisions. While research on the topic has
mainly been focusing on the image domain, numerous industrial applications, in
particular in finance, rely on standard tabular data. In this paper, we discuss
the notion of adversarial examples in the tabular domain. We propose a
formalization based on the imperceptibility of attacks in the tabular domain
leading to an approach to generate imperceptible adversarial examples.
Experiments show that we can generate imperceptible adversarial examples with a
high fooling rate.


http://arxiv.org/abs/1911.04606
White-Box Target Attack for EEG-Based BCI Regression Problems.
Lubin Meng; Chin-Teng Lin; Tzyy-Ring Jung; Dongrui Wu
  Machine learning has achieved great success in many applications, including
electroencephalogram (EEG) based brain-computer interfaces (BCIs).
Unfortunately, many machine learning models are vulnerable to adversarial
examples, which are crafted by adding deliberately designed perturbations to
the original inputs. Many adversarial attack approaches for classification
problems have been proposed, but few have considered target adversarial attacks
for regression problems. This paper proposes two such approaches. More
specifically, we consider white-box target attacks for regression problems,
where we know all information about the regression model to be attacked, and
want to design small perturbations to change the regression output by a
pre-determined amount. Experiments on two BCI regression problems verified that
both approaches are effective. Moreover, adversarial examples generated from
both approaches are also transferable, which means that we can use adversarial
examples generated from one known regression model to attack an unknown
regression model, i.e., to perform black-box attacks. To our knowledge, this is
the first study on adversarial attacks for EEG-based BCI regression problems,
which calls for more attention on the security of BCI systems.


http://arxiv.org/abs/1911.04338
Active Learning for Black-Box Adversarial Attacks in EEG-Based Brain-Computer Interfaces.
Xue Jiang; Xiao Zhang; Dongrui Wu
  Deep learning has made significant breakthroughs in many fields, including
electroencephalogram (EEG) based brain-computer interfaces (BCIs). However,
deep learning models are vulnerable to adversarial attacks, in which
deliberately designed small perturbations are added to the benign input samples
to fool the deep learning model and degrade its performance. This paper
considers transferability-based black-box attacks, where the attacker trains a
substitute model to approximate the target model, and then generates
adversarial examples from the substitute model to attack the target model.
Learning a good substitute model is critical to the success of these attacks,
but it requires a large number of queries to the target model. We propose a
novel framework which uses query synthesis based active learning to improve the
query efficiency in training the substitute model. Experiments on three
convolutional neural network (CNN) classifiers and three EEG datasets
demonstrated that our method can improve the attack success rate with the same
number of queries, or, in other words, our method requires fewer queries to
achieve a desired attack performance. To our knowledge, this is the first work
that integrates active learning and adversarial attacks for EEG-based BCIs.


http://arxiv.org/abs/1911.02466
Towards Large yet Imperceptible Adversarial Image Perturbations with Perceptual Color Distance.
Zhengyu Zhao; Zhuoran Liu; Martha Larson
  The success of image perturbations that are designed to fool image
classification is assessed in terms of both adversarial effect and visual
imperceptibility. In this work, we investigate the contribution of human color
perception to perturbations that are not noticeable. Our basic insight is that
perceptual color distance makes it possible to drop the conventional assumption
that imperceptible perturbations should strive for small $L_p$ norms in RGB
space. Our first approach, Perceptual Color distance C&W (PerC-C&W), extends
the widely-used C&W approach and produces larger RGB perturbations. PerC-C&W is
able to maintain adversarial strength, while contributing to imperceptibility.
Our second approach, Perceptual Color distance Alternating Loss (PerC-AL),
achieves the same outcome, but does so more efficiently by alternating between
the classification loss and perceptual color difference when updating
perturbations. Experimental evaluation shows PerC approaches improve robustness
and transferability of perturbations over conventional approaches and also
demonstrates that the PerC distance can provide added value on top of existing
structure-based approaches to creating image perturbations.


http://arxiv.org/abs/1911.02508
Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods.
Dylan Slack; Sophie Hilgard; Emily Jia; Sameer Singh; Himabindu Lakkaraju
  As machine learning black boxes are increasingly being deployed in domains
such as healthcare and criminal justice, there is growing emphasis on building
tools and techniques for explaining these black boxes in an interpretable
manner. Such explanations are being leveraged by domain experts to diagnose
systematic errors and underlying biases of black boxes. In this paper, we
demonstrate that post hoc explanations techniques that rely on input
perturbations, such as LIME and SHAP, are not reliable. Specifically, we
propose a novel scaffolding technique that effectively hides the biases of any
given classifier by allowing an adversarial entity to craft an arbitrary
desired explanation. Our approach can be used to scaffold any biased classifier
in such a way that its predictions on the input data distribution still remain
biased, but the post hoc explanations of the scaffolded classifier look
innocuous. Using extensive evaluation with multiple real-world datasets
(including COMPAS), we demonstrate how extremely biased (racist) classifiers
crafted by our framework can easily fool popular explanation techniques such as
LIME and SHAP into generating innocuous explanations which do not reflect the
underlying biases.


http://arxiv.org/abs/1911.02621
The Threat of Adversarial Attacks on Machine Learning in Network Security -- A Survey.
Olakunle Ibitoye; Rana Abou-Khamis; Ashraf Matrawy; M. Omair Shafiq
  Machine learning models have made many decision support systems to be faster,
more accurate and more efficient. However, applications of machine learning in
network security face more disproportionate threat of active adversarial
attacks compared to other domains. This is because machine learning
applications in network security such as malware detection, intrusion
detection, and spam filtering are by themselves adversarial in nature. In what
could be considered an arms race between attackers and defenders, adversaries
constantly probe machine learning systems with inputs which are explicitly
designed to bypass the system and induce a wrong prediction. In this survey, we
first provide a taxonomy of machine learning techniques, styles, and
algorithms. We then introduce a classification of machine learning in network
security applications. Next, we examine various adversarial attacks against
machine learning in network security and introduce two classification
approaches for adversarial attacks in network security. First, we classify
adversarial attacks in network security based on a taxonomy of network security
applications. Secondly, we categorize adversarial attacks in network security
into a problem space vs. feature space dimensional classification model. We
then analyze the various defenses against adversarial attacks on machine
learning-based network security applications. We conclude by introducing an
adversarial risk model and evaluate several existing adversarial attacks
against machine learning in network security using the risk model. We also
identify where each attack classification resides within the adversarial risk
model


http://arxiv.org/abs/1911.02360
Reversible Adversarial Example based on Reversible Image Transformation.
Zhaoxia Yin; Hua Wang; Weiming Zhang
  At present there are many companies that take the most advanced Deep Neural
Networks (DNNs) to classify and analyze photos we upload to social networks or
the cloud. In order to prevent users privacy from leakage, the attack
characteristics of the adversarial example can be exploited to make these
models misjudged. In this paper, we take advantage of reversible image
transformation to construct reversible adversarial example, which is still an
adversarial example to DNNs. It not only allows DNNs to extract the wrong
information, but also can be recovered to its original image without any
distortion. Experimental results show that reversible adversarial examples
obtained by our method have higher attack success rates while ensuring that the
reversible image quality is still high. Moreover, the proposed method is easy
to operate, suitable for practical applications.


http://arxiv.org/abs/1911.01670
Adversarial Enhancement for Community Detection in Complex Networks.
Jiajun Zhou; Zhi Chen; Min Du; Lihong Chen; Shanqing Yu; Feifei Li; Guanrong Chen; Qi Xuan
  Community detection plays a significant role in network analysis. However, it
also faces numerous challenges like adversarial attacks. How to further improve
the performance and robustness of community detection for real-world networks
has raised great concerns. In this paper, we propose a concept of adversarial
enhancement for community detection, and present two adversarial enhancement
algorithms: one is named adversarial enhancement via genetic algorithm (AE-GA),
in which the modularity and the number of clusters are used to design a fitness
function to solve the resolution limit problem; and the other is called
adversarial enhancement via vertex similarity (AE-VS), integrating multiple
information of community structures captured by diverse vertex similarities,
which scales well on large-scale networks. The two algorithms are tested along
with six existing community detection algorithms on four real-world networks.
Comprehensive experimental results show that, by comparing with two traditional
enhancement strategies, our methods help six community detection algorithms
achieve more significant performance improvement. Moreover, experiments on the
corresponding adversarial networks indicate that our methods can rebuild the
network structure destroyed by adversarial attacks to certain extent, achieving
stronger defense against community detection deception.


http://arxiv.org/abs/1911.01921
DLA: Dense-Layer-Analysis for Adversarial Example Detection.
Philip Sperl; Ching-Yu Kao; Peng Chen; Konstantin Böttinger
  In recent years Deep Neural Networks (DNNs) have achieved remarkable results
and even showed super-human capabilities in a broad range of domains. This led
people to trust in DNNs' classifications and resulting actions even in
security-sensitive environments like autonomous driving.
  Despite their impressive achievements, DNNs are known to be vulnerable to
adversarial examples. Such inputs contain small perturbations to intentionally
fool the attacked model.
  In this paper, we present a novel end-to-end framework to detect such attacks
during classification without influencing the target model's performance.
Inspired by recent research in neuron-coverage guided testing we show that
dense layers of DNNs carry security-sensitive information. With a secondary DNN
we analyze the activation patterns of the dense layers during classification
runtime, which enables effective and real-time detection of adversarial
examples.
  Our prototype implementation successfully detects adversarial examples in
image, natural language, and audio processing. Thereby, we cover a variety of
target DNNs, including Long Short Term Memory (LSTM) architectures. In
addition, to effectively defend against state-of-the-art attacks, our approach
generalizes between different sets of adversarial examples. Thus, our method
most likely enables us to detect even future, yet unknown attacks. Finally,
during white-box adaptive attacks, we show our method cannot be easily
bypassed.


http://arxiv.org/abs/1911.02142
Intriguing Properties of Adversarial ML Attacks in the Problem Space.
Fabio Pierazzi; Feargus Pendlebury; Jacopo Cortellazzi; Lorenzo Cavallaro
  Recent research efforts on adversarial ML have investigated problem-space
attacks, focusing on the generation of real evasive objects in domains where,
unlike images, there is no clear inverse mapping to the feature space (e.g.,
software). However, the design, comparison, and real-world implications of
problem-space attacks remain underexplored. This paper makes two major
contributions. First, we propose a novel formalization for adversarial ML
evasion attacks in the problem-space, which includes the definition of a
comprehensive set of constraints on available transformations, preserved
semantics, robustness to preprocessing, and plausibility. We shed light on the
relationship between feature space and problem space, and we introduce the
concept of side-effect features as the byproduct of the inverse feature-mapping
problem. This enables us to define and prove necessary and sufficient
conditions for the existence of problem-space attacks. We further demonstrate
the expressive power of our formalization by using it to describe several
attacks from related literature across different domains. Second, building on
our formalization, we propose a novel problem-space attack on Android malware
that overcomes past limitations. Experiments on a dataset with 170K Android
apps from 2017 and 2018 show the practical feasibility of evading a
state-of-the-art malware classifier along with its hardened version. Our
results demonstrate that "adversarial-malware as a service" is a realistic
threat, as we automatically generate thousands of realistic and inconspicuous
adversarial applications at scale, where on average it takes only a few minutes
to generate an adversarial app. Our formalization of problem-space attacks
paves the way to more principled research in this domain.


http://arxiv.org/abs/1911.01952
Coverage Guided Testing for Recurrent Neural Networks.
Wei Huang; Youcheng Sun; Xingyu Zhao; James Sharp; Wenjie Ruan; Jie Meng; Xiaowei Huang
  Recurrent neural networks (RNNs) have been applied to a broad range of
applications, including natural language processing, drug discovery, and video
recognition. Their vulnerability to input perturbation is also known. Aligning
with a view from software defect detection, this paper aims to develop a
coverage guided testing approach to systematically exploit the internal
behaviour of RNNs, with the expectation that such testing can detect defects
with high possibility. Technically, the long short term memory network (LSTM),
a major class of RNNs, is thoroughly studied. A family of three test metrics
are designed to quantify not only the values but also the temporal relations
(including both step-wise and bounded-length) exhibited when LSTM processing
inputs. A genetic algorithm is applied to efficiently generate test cases. The
test metrics and test case generation algorithm are implemented into a tool
TestRNN, which is then evaluated on a set of LSTM benchmarks. Experiments
confirm that TestRNN has advantages over the state-of-art tool DeepStellar and
attack-based defect detection methods, owing to its working with finer temporal
semantics and the consideration of the naturalness of input perturbation.
Furthermore, TestRNN enables meaningful information to be collected and
exhibited for users to understand the testing results, which is an important
step towards interpretable neural network testing.


http://arxiv.org/abs/1911.01043
Persistency of Excitation for Robustness of Neural Networks.
Kamil Nar; S. Shankar Sastry
  When an online learning algorithm is used to estimate the unknown parameters
of a model, the signals interacting with the parameter estimates should not
decay too quickly for the optimal values to be discovered correctly. This
requirement is referred to as persistency of excitation, and it arises in
various contexts, such as optimization with stochastic gradient methods,
exploration for multi-armed bandits, and adaptive control of dynamical systems.
While training a neural network, the iterative optimization algorithm involved
also creates an online learning problem, and consequently, correct estimation
of the optimal parameters requires persistent excitation of the network
weights. In this work, we analyze the dynamics of the gradient descent
algorithm while training a two-layer neural network with two different loss
functions, the squared-error loss and the cross-entropy loss; and we obtain
conditions to guarantee persistent excitation of the network weights. We then
show that these conditions are difficult to satisfy when a multi-layer network
is trained for a classification task, for the signals in the intermediate
layers of the network become low-dimensional during training and fail to remain
persistently exciting. To provide a remedy, we delve into the classical
regularization terms used for linear models, reinterpret them as a means to
ensure persistent excitation of the model parameters, and propose an algorithm
for neural networks by building an analogy. The results in this work shed some
light on why adversarial examples have become a challenging problem for neural
networks, why merely augmenting training data sets will not be an effective
approach to address them, and why there may not exist a data-independent
regularization term for neural networks, which involve only the model
parameters but not the training data.


http://arxiv.org/abs/1911.01172
Fast-UAP: An Algorithm for Speeding up Universal Adversarial Perturbation Generation with Orientation of Perturbation Vectors.
Jiazhu Dai; Le Shu
  Convolutional neural networks (CNN) have become one of the most popular
machine learning tools and are being applied in various tasks, however, CNN
models are vulnerable to universal perturbations, which are usually
human-imperceptible but can cause natural images to be misclassified with high
probability. One of the state-of-the-art algorithms to generate universal
perturbations is known as UAP. UAP only aggregates the minimal perturbations in
every iteration, which will lead to generated universal perturbation whose
magnitude cannot rise up efficiently and cause a slow generation. In this
paper, we proposed an optimized algorithm to improve the performance of
crafting universal perturbations based on orientation of perturbation vectors.
At each iteration, instead of choosing minimal perturbation vector with respect
to each image, we aggregate the current instance of universal perturbation with
the perturbation which has similar orientation to the former so that the
magnitude of the aggregation will rise up as large as possible at every
iteration. The experiment results show that we get universal perturbations in a
shorter time and with a smaller number of training images. Furthermore, we
observe in experiments that universal perturbations generated by our proposed
algorithm have an average increment of fooling rate by 8% ~ 9% in white-box
attacks and black-box attacks comparing with universal perturbations generated
by UAP.


http://arxiv.org/abs/1911.01559
A Tale of Evil Twins: Adversarial Inputs versus Poisoned Models.
Ren Pang; Hua Shen; Xinyang Zhang; Shouling Ji; Yevgeniy Vorobeychik; Xiapu Luo; Alex Liu; Ting Wang
  Despite their tremendous success in a range of domains, deep learning systems
are inherently susceptible to two types of manipulations: adversarial inputs --
maliciously crafted samples that deceive target deep neural network (DNN)
models, and poisoned models -- adversely forged DNNs that misbehave on
pre-defined inputs. While prior work has intensively studied the two attack
vectors in parallel, there is still a lack of understanding about their
fundamental connections: what are the dynamic interactions between the two
attack vectors? what are the implications of such interactions for optimizing
existing attacks? what are the potential countermeasures against the enhanced
attacks? Answering these key questions is crucial for assessing and mitigating
the holistic vulnerabilities of DNNs deployed in realistic settings.
  Here we take a solid step towards this goal by conducting the first
systematic study of the two attack vectors within a unified framework.
Specifically, (i) we develop a new attack model that jointly optimizes
adversarial inputs and poisoned models; (ii) with both analytical and empirical
evidence, we reveal that there exist intriguing "mutual reinforcement" effects
between the two attack vectors -- leveraging one vector significantly amplifies
the effectiveness of the other; (iii) we demonstrate that such effects enable a
large design spectrum for the adversary to enhance the existing attacks that
exploit both vectors (e.g., backdoor attacks), such as maximizing the attack
evasiveness with respect to various detection methods; (iv) finally, we discuss
potential countermeasures against such optimized attacks and their technical
challenges, pointing to several promising research directions.


http://arxiv.org/abs/1911.01840
Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems.
Guangke Chen; Sen Chen; Lingling Fan; Xiaoning Du; Zhe Zhao; Fu Song; Yang Liu
  Speaker recognition (SR) is widely used in our daily life as a biometric
authentication mechanism. The popularity of SR brings in serious security
concerns, as demonstrated by recent adversarial attacks. However, the impacts
of such threats in the practical black-box setting are still open, since
current attacks consider the white-box setting only.
  In this paper, we conduct the first comprehensive and systematic study of the
adversarial attacks on SR systems (SRSs) to understand their security weakness
in the practical black-box setting. For this purpose, we propose an adversarial
attack, named FakeBob, to craft adversarial samples. Specifically, we formulate
the adversarial sample generation as an optimization problem, incorporated with
the confidence of adversarial samples and maximal distortion to balance between
the strength and imperceptibility of adversarial voices. One key contribution
is to propose a novel algorithm to estimate the score threshold, a feature in
SRSs, and use it in the optimization problem to solve the optimization problem.
We demonstrate that FakeBob achieves close to 100% targeted attack success rate
on both open-source and commercial systems. We further demonstrate that FakeBob
is also effective (at least 65% untargeted success rate) on both open-source
and commercial systems when playing over the air in the physical world.
Moreover, we have conducted a human study which reveals that it is hard for
human to differentiate the speakers of the original and adversarial voices.
Last but not least, we show that three promising defense methods for
adversarial attack from the speech recognition domain become ineffective on
SRSs against FakeBob, which calls for more effective defense methods. We
highlight that our study peeks into the security implications of adversarial
attacks on SRSs, and realistically fosters to improve the security robustness
of SRSs.


http://arxiv.org/abs/1911.00870
MadNet: Using a MAD Optimization for Defending Against Adversarial Attacks.
Shai Rozenberg; Gal Elidan; Ran El-Yaniv
  This paper is concerned with the defense of deep models against adversarial
attacks. Inspired by the certificate defense approach, we propose a maximal
adversarial distortion (MAD) optimization method for robustifying deep
networks. MAD captures the idea of increasing separability of class clusters in
the embedding space while decreasing the network sensitivity to small
distortions. Given a deep neural network (DNN) for a classification problem, an
application of MAD optimization results in MadNet, a version of the original
network, now equipped with an adversarial defense mechanism. MAD optimization
is intuitive, effective and scalable, and the resulting MadNet can improve the
original accuracy. We present an extensive empirical study demonstrating that
MadNet improves adversarial robustness performance compared to state-of-the-art
methods.


http://arxiv.org/abs/1911.00650
Automatic Detection of Generated Text is Easiest when Humans are Fooled.
Daphne Ippolito; Daniel Duckworth; Chris Callison-Burch; Douglas Eck
  Recent advancements in neural language modelling make it possible to rapidly
generate vast amounts of human-sounding text. The capabilities of humans and
automatic discriminators to detect machine-generated text have been a large
source of research interest, but humans and machines rely on different cues to
make their decisions. Here, we perform careful benchmarking and analysis of
three popular sampling-based decoding strategies---top-$k$, nucleus sampling,
and untruncated random sampling---and show that improvements in decoding
methods have primarily optimized for fooling humans. This comes at the expense
of introducing statistical abnormalities that make detection easy for automatic
systems. We also show that though both human and automatic detector performance
improve with longer excerpt length, even multi-sentence excerpts can fool
expert human raters over 30% of the time. Our findings reveal the importance of
using both human and automatic detectors to assess the humanness of text
generation systems.


http://arxiv.org/abs/1911.00660
Security of Facial Forensics Models Against Adversarial Attacks.
Rong Huang; Fuming Fang; Huy H. Nguyen; Junichi Yamagishi; Isao Echizen
  Deep neural networks (DNNs) have been used in forensics to identify fake
facial images. We investigated several DNN-based forgery forensics models
(FFMs) to determine whether they are secure against adversarial attacks. We
experimentally demonstrated the existence of individual adversarial
perturbations (IAPs) and universal adversarial perturbations (UAPs) that can
lead a well-performed FFM to misbehave. Based on iterative procedure, gradient
information is used to generate two kinds of IAPs that can be used to fabricate
classification and segmentation outputs. In contrast, UAPs are generated on the
basis of over-firing. We designed a new objective function that encourages
neurons to over-fire, which makes UAP generation feasible even without using
training data. Experiments demonstrated the transferability of UAPs across
unseen datasets and unseen FFMs. Moreover, we are the first to conduct
subjective assessment for imperceptibility of the adversarial perturbations,
revealing that the crafted UAPs are visually negligible. There findings provide
a baseline for evaluating the adversarial security of FFMs.


http://arxiv.org/abs/1910.14655
Enhancing Certifiable Robustness via a Deep Model Ensemble.
Huan Zhang; Minhao Cheng; Cho-Jui Hsieh
  We propose an algorithm to enhance certified robustness of a deep model
ensemble by optimally weighting each base model. Unlike previous works on using
ensembles to empirically improve robustness, our algorithm is based on
optimizing a guaranteed robustness certificate of neural networks. Our proposed
ensemble framework with certified robustness, RobBoost, formulates the optimal
model selection and weighting task as an optimization problem on a lower bound
of classification margin, which can be efficiently solved using coordinate
descent. Experiments show that our algorithm can form a more robust ensemble
than naively averaging all available models using robustly trained MNIST or
CIFAR base models. Additionally, our ensemble typically has better accuracy on
clean (unperturbed) data. RobBoost allows us to further improve certified
robustness and clean accuracy by creating an ensemble of already certified
models.


http://arxiv.org/abs/1910.14356
Certifiable Robustness to Graph Perturbations.
Aleksandar Bojchevski; Stephan Günnemann
  Despite the exploding interest in graph neural networks there has been little
effort to verify and improve their robustness. This is even more alarming given
recent findings showing that they are extremely vulnerable to adversarial
attacks on both the graph structure and the node attributes. We propose the
first method for verifying certifiable (non-)robustness to graph perturbations
for a general class of models that includes graph neural networks and
label/feature propagation. By exploiting connections to PageRank and Markov
decision processes our certificates can be efficiently (and under many threat
models exactly) computed. Furthermore, we investigate robust training
procedures that increase the number of certifiably robust nodes while
maintaining or improving the clean predictive accuracy.


http://arxiv.org/abs/1911.00126
Adversarial Music: Real World Audio Adversary Against Wake-word Detection System.
Juncheng B. Li; Shuhui Qu; Xinjian Li; Joseph Szurley; J. Zico Kolter; Florian Metze
  Voice Assistants (VAs) such as Amazon Alexa or Google Assistant rely on
wake-word detection to respond to people's commands, which could potentially be
vulnerable to audio adversarial examples. In this work, we target our attack on
the wake-word detection system, jamming the model with some inconspicuous
background music to deactivate the VAs while our audio adversary is present. We
implemented an emulated wake-word detection system of Amazon Alexa based on
recent publications. We validated our models against the real Alexa in terms of
wake-word detection accuracy. Then we computed our audio adversaries with
consideration of expectation over transform and we implemented our audio
adversary with a differentiable synthesizer. Next, we verified our audio
adversaries digitally on hundreds of samples of utterances collected from the
real world. Our experiments show that we can effectively reduce the recognition
F1 score of our emulated model from 93.4% to 11.0%. Finally, we tested our
audio adversary over the air, and verified it works effectively against Alexa,
reducing its F1 score from 92.5% to 11.0%.; We also verified that
non-adversarial music does not disable Alexa as effectively as our music at the
same sound level. To the best of our knowledge, this is the first real-world
adversarial attack against a commercial-grade VA wake-word detection system.
Our code and demo videos can be accessed at
\url{https://www.junchengbillyli.com/AdversarialMusic}


http://arxiv.org/abs/1910.14107
Investigating Resistance of Deep Learning-based IDS against Adversaries using min-max Optimization.
Rana Abou Khamis; Omair Shafiq; Ashraf Matrawy
  With the growth of adversarial attacks against machine learning models,
several concerns have emerged about potential vulnerabilities in designing deep
neural network-based intrusion detection systems (IDS). In this paper, we study
the resilience of deep learning-based intrusion detection systems against
adversarial attacks. We apply the min-max (or saddle-point) approach to train
intrusion detection systems against adversarial attack samples in NSW-NB 15
dataset. We have the max approach for generating adversarial samples that
achieves maximum loss and attack deep neural networks. On the other side, we
utilize the existing min approach [2] [9] as a defense strategy to optimize
intrusion detection systems that minimize the loss of the incorporated
adversarial samples during the adversarial training. We study and measure the
effectiveness of the adversarial attack methods as well as the resistance of
the adversarially trained models against such attacks. We find that the
adversarial attack methods that were designed in binary domains can be used in
continuous domains and exhibit different misclassification levels. We finally
show that principal component analysis (PCA) based feature reduction can boost
the robustness in intrusion detection system (IDS) using a deep neural network
(DNN).


http://arxiv.org/abs/1910.14184
Beyond Universal Person Re-ID Attack.
Wenjie Ding; Xing Wei; Rongrong Ji; Xiaopeng Hong; Qi Tian; Yihong Gong
  Deep learning-based person re-identification (Re-ID) has made great progress
and achieved high performance recently. In this paper, we make the first
attempt to examine the vulnerability of current person Re-ID models against a
dangerous attack method, \ie, the universal adversarial perturbation (UAP)
attack, which has been shown to fool classification models with a little
overhead. We propose a \emph{more universal} adversarial perturbation (MUAP)
method for both image-agnostic and model-insensitive person Re-ID attack.
Firstly, we adopt a list-wise attack objective function to disrupt the
similarity ranking list directly. Secondly, we propose a model-insensitive
mechanism for cross-model attack. Extensive experiments show that the proposed
attack approach achieves high attack performance and outperforms other state of
the arts by large margin in cross-model scenario. The results also demonstrate
the vulnerability of current Re-ID models to MUAP and further suggest the need
of designing more robust Re-ID models.


http://arxiv.org/abs/1910.13222
Adversarial Example in Remote Sensing Image Recognition.
Li Chen; Guowei Zhu; Qi Li; Haifeng Li
  With the wide application of remote sensing technology in various fields, the
accuracy and security requirements for remote sensing images (RSIs) recognition
are also increasing. In recent years, due to the rapid development of deep
learning in the field of image recognition, RSI recognition models based on
deep convolution neural networks (CNNs) outperform traditional hand-craft
feature techniques. However, CNNs also pose security issues when they show
their capability of accurate classification. By adding a very small variation
of the adversarial perturbation to the input image, the CNN model can be caused
to produce erroneous results with extremely high confidence, and the
modification of the image is not perceived by the human eye. This added
adversarial perturbation image is called an adversarial example, which poses a
serious security problem for systems based on CNN model recognition results.
This paper, for the first time, analyzes adversarial example problem of RSI
recognition under CNN models. In the experiments, we used different attack
algorithms to fool multiple high-accuracy RSI recognition models trained on
multiple RSI datasets. The results show that RSI recognition models are also
vulnerable to adversarial examples, and the models with different structures
trained on the same RSI dataset also have different vulnerabilities. For each
RSI dataset, the number of features also affects the vulnerability of the
model. Many features are good for defensive adversarial examples. Further, we
find that the attacked class of RSI has an attack selectivity property. The
misclassification of adversarial examples of the RSIs are related to the
similarity of the original classes in the CNN feature space. In addition,
adversarial examples in RSI recognition are of great significance for the
security of remote sensing applications, showing a huge potential for future
research.


http://arxiv.org/abs/1910.13025
Active Subspace of Neural Networks: Structural Analysis and Universal Attacks.
Chunfeng Cui; Kaiqi Zhang; Talgat Daulbaev; Julia Gusak; Ivan Oseledets; Zheng Zhang
  Active subspace is a model reduction method widely used in the uncertainty
quantification community. In this paper, we propose analyzing the internal
structure and vulnerability and deep neural networks using active subspace.
Firstly, we employ the active subspace to measure the number of "active
neurons" at each intermediate layer and reduce the number of neurons from
several thousands to several dozens. This motivates us to change the network
structure and to develop a new and more compact network, referred to as
{ASNet}, that has significantly fewer model parameters. Secondly, we propose
analyzing the vulnerability of a neural network using active subspace and
finding an additive universal adversarial attack vector that can misclassify a
dataset with a high probability. Our experiments on CIFAR-10 show that ASNet
can achieve 23.98$\times$ parameter and 7.30$\times$ flops reduction. The
universal active subspace attack vector can achieve around 20% higher attack
ratio compared with the existing approach in all of our numerical experiments.
The PyTorch codes for this paper are available online.


http://arxiv.org/abs/1910.12908
Certified Adversarial Robustness for Deep Reinforcement Learning.
Björn Lütjens; Michael Everett; Jonathan P. How
  Deep Neural Network-based systems are now the state-of-the-art in many
robotics tasks, but their application in safety-critical domains remains
dangerous without formal guarantees on network robustness. Small perturbations
to sensor inputs (from noise or adversarial examples) are often enough to
change network-based decisions, which was already shown to cause an autonomous
vehicle to swerve into oncoming traffic. In light of these dangers, numerous
algorithms have been developed as defensive mechanisms from these adversarial
inputs, some of which provide formal robustness guarantees or certificates.
This work leverages research on certified adversarial robustness to develop an
online certified defense for deep reinforcement learning algorithms. The
proposed defense computes guaranteed lower bounds on state-action values during
execution to identify and choose the optimal action under a worst-case
deviation in input space due to possible adversaries or noise. The approach is
demonstrated on a Deep Q-Network policy and is shown to increase robustness to
noise and adversaries in pedestrian collision avoidance scenarios and a classic
control task.


http://arxiv.org/abs/1910.12196
Word-level Textual Adversarial Attacking as Combinatorial Optimization.
Yuan Zang; Fanchao Qi; Chenghao Yang; Zhiyuan Liu; Meng Zhang; Qun Liu; Maosong Sun
  Adversarial attacks are carried out to reveal the vulnerability of deep
neural networks. Textual adversarial attacking is challenging because text is
discrete and a small perturbation can bring significant change to the original
input. Word-level attacking, which can be regarded as a combinatorial
optimization problem, is a well-studied class of textual attack methods.
However, existing word-level attack models are far from perfect, largely
because unsuitable search space reduction methods and inefficient optimization
algorithms are employed. In this paper, we propose a novel attack model, which
incorporates the sememe-based word substitution method and particle swarm
optimization-based search algorithm to solve the two problems separately. We
conduct exhaustive experiments to evaluate our attack model by attacking BiLSTM
and BERT on three benchmark datasets. Experimental results demonstrate that our
model consistently achieves much higher attack success rates and crafts more
high-quality adversarial examples as compared to baseline methods. Also,
further experiments show our model has higher transferability and can bring
more robustness enhancement to victim models by adversarial training. All the
code and data of this paper can be obtained on
https://github.com/thunlp/SememePSO-Attack.


http://arxiv.org/abs/1910.12227
EdgeFool: An Adversarial Image Enhancement Filter.
Ali Shahin Shamsabadi; Changjae Oh; Andrea Cavallaro
  Adversarial examples are intentionally perturbed images that mislead
classifiers. These images can, however, be easily detected using denoising
algorithms, when high-frequency spatial perturbations are used, or can be
noticed by humans, when perturbations are large. In this paper, we propose
EdgeFool, an adversarial image enhancement filter that learns structure-aware
adversarial perturbations. EdgeFool generates adversarial images with
perturbations that enhance image details via training a fully convolutional
neural network end-to-end with a multi-task loss function. This loss function
accounts for both image detail enhancement and class misleading objectives. We
evaluate EdgeFool on three classifiers (ResNet-50, ResNet-18 and AlexNet) using
two datasets (ImageNet and Private-Places365) and compare it with six
adversarial methods (DeepFool, SparseFool, Carlini-Wagner, SemanticAdv,
Non-targeted and Private Fast Gradient Sign Methods). Code is available at
https://github.com/smartcameras/EdgeFool.git.


http://arxiv.org/abs/1911.00927
Spot Evasion Attacks: Adversarial Examples for License Plate Recognition Systems with Convolutional Neural Networks.
Ya-guan Qian; Dan-feng Ma; Bin Wang; Jun Pan; Jia-min Wang; Jian-hai Chen; Wu-jie Zhou; Jing-sheng Lei
  Recent studies have shown convolution neural networks (CNNs) for image
recognition are vulnerable to evasion attacks with carefully manipulated
adversarial examples. Previous work primarily focused on how to generate
adversarial examples closed to source images, by introducing pixel-level
perturbations into the whole or specific part of images. In this paper, we
propose an evasion attack on CNN classifiers in the context of License Plate
Recognition (LPR), which adds predetermined perturbations to specific regions
of license plate images, simulating some sort of naturally formed spots (such
as sludge, etc.). Therefore, the problem is modeled as an optimization process
searching for optimal perturbation positions, which is different from previous
work that consider pixel values as decision variables. Notice that this is a
complex nonlinear optimization problem, and we use a genetic-algorithm based
approach to obtain optimal perturbation positions. In experiments, we use the
proposed algorithm to generate various adversarial examples in the form of
rectangle, circle, ellipse and spots cluster. Experimental results show that
these adversarial examples are almost ignored by human eyes, but can fool
HyperLPR with high attack success rate over 93%. Therefore, we believe that
this kind of spot evasion attacks would pose a great threat to current LPR
systems, and needs to be investigated further by the security community.


http://arxiv.org/abs/1910.12084
Detection of Adversarial Attacks and Characterization of Adversarial Subspace.
Mohammad Esmaeilpour; Patrick Cardinal; Alessandro Lameiras Koerich
  Adversarial attacks have always been a serious threat for any data-driven
model. In this paper, we explore subspaces of adversarial examples in unitary
vector domain, and we propose a novel detector for defending our models trained
for environmental sound classification. We measure chordal distance between
legitimate and malicious representation of sounds in unitary space of
generalized Schur decomposition and show that their manifolds lie far from each
other. Our front-end detector is a regularized logistic regression which
discriminates eigenvalues of legitimate and adversarial spectrograms. The
experimental results on three benchmarking datasets of environmental sounds
represented by spectrograms reveal high detection rate of the proposed detector
for eight types of adversarial attacks and outperforms other detection
approaches.


http://arxiv.org/abs/1910.12163
Understanding and Quantifying Adversarial Examples Existence in Linear Classification.
Xupeng Shi; A. Adam Ding
  State-of-art deep neural networks (DNN) are vulnerable to attacks by
adversarial examples: a carefully designed small perturbation to the input,
that is imperceptible to human, can mislead DNN. To understand the root cause
of adversarial examples, we quantify the probability of adversarial example
existence for linear classifiers. Previous mathematical definition of
adversarial examples only involves the overall perturbation amount, and we
propose a more practical relevant definition of strong adversarial examples
that separately limits the perturbation along the signal direction also. We
show that linear classifiers can be made robust to strong adversarial examples
attack in cases where no adversarial robust linear classifiers exist under the
previous definition. The quantitative formulas are confirmed by numerical
experiments using a linear support vector machine (SVM) classifier. The results
suggest that designing general strong-adversarial-robust learning systems is
feasible but only through incorporating human knowledge of the underlying
classification problem.


http://arxiv.org/abs/1910.12165
Adversarial Defense Via Local Flatness Regularization.
Jia Xu; Yiming Li; Yong Jiang; Shu-Tao Xia
  Adversarial defense is a popular and important research area. Due to its
intrinsic mechanism, one of the most straightforward and effective ways of
defending attacks is to analyze the property of loss surface in the input
space. In this paper, we define the local flatness of the loss surface as the
maximum value of the chosen norm of the gradient regarding to the input within
a neighborhood centered on the benign sample, and discuss the relationship
between the local flatness and adversarial vulnerability. Based on the
analysis, we propose a novel defense approach via regularizing the local
flatness, dubbed local flatness regularization (LFR). We also demonstrate the
effectiveness of the proposed method from other perspectives, such as human
visual mechanism, and analyze the relationship between LFR and other related
methods theoretically. Experiments are conducted to verify our theory and
demonstrate the superiority of the proposed method.


http://arxiv.org/abs/1910.12392
Effectiveness of random deep feature selection for securing image manipulation detectors against adversarial examples.
Mauro Barni; Ehsan Nowroozi; Benedetta Tondi; Bowen Zhang
  We investigate if the random feature selection approach proposed in [1] to
improve the robustness of forensic detectors to targeted attacks, can be
extended to detectors based on deep learning features. In particular, we study
the transferability of adversarial examples targeting an original CNN image
manipulation detector to other detectors (a fully connected neural network and
a linear SVM) that rely on a random subset of the features extracted from the
flatten layer of the original network. The results we got by considering three
image manipulation detection tasks (resizing, median filtering and adaptive
histogram equalization), two original network architectures and three classes
of attacks, show that feature randomization helps to hinder attack
transferability, even if, in some cases, simply changing the architecture of
the detector, or even retraining the detector is enough to prevent the
transferability of the attacks.


http://arxiv.org/abs/1910.11603
MediaEval 2019: Concealed FGSM Perturbations for Privacy Preservation.
Panagiotis Linardos; Suzanne Little; Kevin McGuinness
  This work tackles the Pixel Privacy task put forth by MediaEval 2019. Our
goal is to manipulate images in a way that conceals them from automatic scene
classifiers while preserving the original image quality. We use the fast
gradient sign method, which normally has a corrupting influence on image
appeal, and devise two methods to minimize the damage. The first approach uses
a map of pixel locations that are either salient or flat, and directs
perturbations away from them. The second approach subtracts the gradient of an
aesthetics evaluation model from the gradient of the attack model to guide the
perturbations towards a direction that preserves appeal. We make our code
available at: https://git.io/JesXr.


http://arxiv.org/abs/1910.11585
Label Smoothing and Logit Squeezing: A Replacement for Adversarial Training?
Ali Shafahi; Amin Ghiasi; Furong Huang; Tom Goldstein
  Adversarial training is one of the strongest defenses against adversarial
attacks, but it requires adversarial examples to be generated for every
mini-batch during optimization. The expense of producing these examples during
training often precludes adversarial training from use on complex image
datasets. In this study, we explore the mechanisms by which adversarial
training improves classifier robustness, and show that these mechanisms can be
effectively mimicked using simple regularization methods, including label
smoothing and logit squeezing. Remarkably, using these simple regularization
methods in combination with Gaussian noise injection, we are able to achieve
strong adversarial robustness -- often exceeding that of adversarial training
-- using no adversarial examples.


http://arxiv.org/abs/1910.10994
ATZSL: Defensive Zero-Shot Recognition in the Presence of Adversaries.
Xingxing Zhang; Shupeng Gui; Zhenfeng Zhu; Yao Zhao; Ji Liu
  Zero-shot learning (ZSL) has received extensive attention recently especially
in areas of fine-grained object recognition, retrieval, and image captioning.
Due to the complete lack of training samples and high requirement of defense
transferability, the ZSL model learned is particularly vulnerable against
adversarial attacks. Recent work also showed adversarially robust
generalization requires more data. This may significantly affect the robustness
of ZSL. However, very few efforts have been devoted towards this direction. In
this paper, we take an initial attempt, and propose a generic formulation to
provide a systematical solution (named ATZSL) for learning a robust ZSL model.
It is capable of achieving better generalization on various adversarial objects
recognition while only losing a negligible performance on clean images for
unseen classes, by casting ZSL into a min-max optimization problem. To address
it, we design a defensive relation prediction network, which can bridge the
seen and unseen class domains via attributes to generalize prediction and
defense strategy. Additionally, our framework can be extended to deal with the
poisoned scenario of unseen class attributes. An extensive group of experiments
are then presented, demonstrating that ATZSL obtains remarkably more favorable
trade-off between model transferability and robustness, over currently
available alternatives under various settings.


http://arxiv.org/abs/1910.10679
A Useful Taxonomy for Adversarial Robustness of Neural Networks.
Leslie N. Smith
  Adversarial attacks and defenses are currently active areas of research for
the deep learning community. A recent review paper divided the defense
approaches into three categories; gradient masking, robust optimization, and
adversarial example detection. We divide gradient masking and robust
optimization differently: (1) increasing intra-class compactness and
inter-class separation of the feature vectors improves adversarial robustness,
and (2) marginalization or removal of non-robust image features also improves
adversarial robustness. By reframing these topics differently, we provide a
fresh perspective that provides insight into the underlying factors that enable
training more robust networks and can help inspire novel solutions. In
addition, there are several papers in the literature of adversarial defenses
that claim there is a cost for adversarial robustness, or a trade-off between
robustness and accuracy but, under this proposed taxonomy, we hypothesis that
this is not universal. We follow up on our taxonomy with several challenges to
the deep learning research community that builds on the connections and
insights in this paper.


http://arxiv.org/abs/1910.10783
Wasserstein Smoothing: Certified Robustness against Wasserstein Adversarial Attacks.
Alexander Levine; Soheil Feizi
  In the last couple of years, several adversarial attack methods based on
different threat models have been proposed for the image classification
problem. Most existing defenses consider additive threat models in which sample
perturbations have bounded L_p norms. These defenses, however, can be
vulnerable against adversarial attacks under non-additive threat models. An
example of an attack method based on a non-additive threat model is the
Wasserstein adversarial attack proposed by Wong et al. (2019), where the
distance between an image and its adversarial example is determined by the
Wasserstein metric ("earth-mover distance") between their normalized pixel
intensities. Until now, there has been no certifiable defense against this type
of attack. In this work, we propose the first defense with certified robustness
against Wasserstein Adversarial attacks using randomized smoothing. We develop
this certificate by considering the space of possible flows between images, and
representing this space such that Wasserstein distance between images is
upper-bounded by L_1 distance in this flow-space. We can then apply existing
randomized smoothing certificates for the L_1 metric. In MNIST and CIFAR-10
datasets, we find that our proposed defense is also practically effective,
demonstrating significantly improved accuracy under Wasserstein adversarial
attack compared to unprotected models.


http://arxiv.org/abs/1910.10053
Attacking Optical Flow.
Anurag Ranjan; Joel Janai; Andreas Geiger; Michael J. Black
  Deep neural nets achieve state-of-the-art performance on the problem of
optical flow estimation. Since optical flow is used in several safety-critical
applications like self-driving cars, it is important to gain insights into the
robustness of those techniques. Recently, it has been shown that adversarial
attacks easily fool deep neural networks to misclassify objects. The robustness
of optical flow networks to adversarial attacks, however, has not been studied
so far. In this paper, we extend adversarial patch attacks to optical flow
networks and show that such attacks can compromise their performance. We show
that corrupting a small patch of less than 1% of the image size can
significantly affect optical flow estimates. Our attacks lead to noisy flow
estimates that extend significantly beyond the region of the attack, in many
cases even completely erasing the motion of objects in the scene. While
networks using an encoder-decoder architecture are very sensitive to these
attacks, we found that networks using a spatial pyramid architecture are less
affected. We analyse the success and failure of attacking both architectures by
visualizing their feature maps and comparing them to classical optical flow
techniques which are robust to these attacks. We also demonstrate that such
attacks are practical by placing a printed pattern into real scenes.


http://arxiv.org/abs/1910.10013
Adversarial Example Detection by Classification for Deep Speech Recognition.
Saeid Samizade; Zheng-Hua Tan; Chao Shen; Xiaohong Guan
  Machine Learning systems are vulnerable to adversarial attacks and will
highly likely produce incorrect outputs under these attacks. There are
white-box and black-box attacks regarding to adversary's access level to the
victim learning algorithm. To defend the learning systems from these attacks,
existing methods in the speech domain focus on modifying input signals and
testing the behaviours of speech recognizers. We, however, formulate the
defense as a classification problem and present a strategy for systematically
generating adversarial example datasets: one for white-box attacks and one for
black-box attacks, containing both adversarial and normal examples. The
white-box attack is a gradient-based method on Baidu DeepSpeech with the
Mozilla Common Voice database while the black-box attack is a gradient-free
method on a deep model-based keyword spotting system with the Google Speech
Command dataset. The generated datasets are used to train a proposed
Convolutional Neural Network (CNN), together with cepstral features, to detect
adversarial examples. Experimental results show that, it is possible to
accurately distinct between adversarial and normal examples for known attacks,
in both single-condition and multi-condition training settings, while the
performance degrades dramatically for unknown attacks. The adversarial datasets
and the source code are made publicly available.


http://arxiv.org/abs/1910.10106
Cross-Representation Transferability of Adversarial Attacks: From Spectrograms to Audio Waveforms.
Karl M. Koerich; Mohammad Esmailpour; Sajjad Abdoli; Alceu S. Jr. Britto; Alessandro L. Koerich
  This paper shows the susceptibility of spectrogram-based audio classifiers to
adversarial attacks and the transferability of such attacks to audio waveforms.
Some commonly used adversarial attacks to images have been applied to
Mel-frequency and short-time Fourier transform spectrograms, and such perturbed
spectrograms are able to fool a 2D convolutional neural network (CNN). Such
attacks produce perturbed spectrograms that are visually imperceptible by
humans. Furthermore, the audio waveforms reconstructed from the perturbed
spectrograms are also able to fool a 1D CNN trained on the original audio.
Experimental results on a dataset of western music have shown that the 2D CNN
achieves up to 81.87% of mean accuracy on legitimate examples and such
performance drops to 12.09% on adversarial examples. Likewise, the 1D CNN
achieves up to 78.29% of mean accuracy on original audio samples and such
performance drops to 27.91% on adversarial audio waveforms reconstructed from
the perturbed spectrograms.


http://arxiv.org/abs/1910.09821
Structure Matters: Towards Generating Transferable Adversarial Images.
Dan Peng; Zizhan Zheng; Linhao Luo; Xiaofeng Zhang
  Recent works on adversarial examples for image classification focus on
directly modifying pixels with minor perturbations. The small perturbation
requirement is imposed to ensure the generated adversarial examples being
natural and realistic to humans, which, however, puts a curb on the attack
space thus limiting the attack ability and transferability especially for
systems protected by a defense mechanism. In this paper, we propose the novel
concepts of structure patterns and structure-aware perturbations that relax the
small perturbation constraint while still keeping images natural. The key idea
of our approach is to allow perceptible deviation in adversarial examples while
keeping structure patterns that are central to a human classifier. Built upon
these concepts, we propose a \emph{structure-preserving attack (SPA)} for
generating natural adversarial examples with extremely high transferability.
Empirical results on the MNIST and the CIFAR10 datasets show that SPA exhibits
strong attack ability in both the white-box and black-box setting even defenses
are applied. Moreover, with the integration of PGD or CW attack, its attack
ability escalates sharply under the white-box setting, without losing the
outstanding transferability inherited from SPA.


http://arxiv.org/abs/1910.09239
Recovering Localized Adversarial Attacks.
Jan Philip Göpfert; Heiko Wersing; Barbara Hammer
  Deep convolutional neural networks have achieved great successes over recent
years, particularly in the domain of computer vision. They are fast,
convenient, and -- thanks to mature frameworks -- relatively easy to implement
and deploy. However, their reasoning is hidden inside a black box, in spite of
a number of proposed approaches that try to provide human-understandable
explanations for the predictions of neural networks. It is still a matter of
debate which of these explainers are best suited for which situations, and how
to quantitatively evaluate and compare them. In this contribution, we focus on
the capabilities of explainers for convolutional deep neural networks in an
extreme situation: a setting in which humans and networks fundamentally
disagree. Deep neural networks are susceptible to adversarial attacks that
deliberately modify input samples to mislead a neural network's classification,
without affecting how a human observer interprets the input. Our goal with this
contribution is to evaluate explainers by investigating whether they can
identify adversarially attacked regions of an image. In particular, we
quantitatively and qualitatively investigate the capability of three popular
explainers of classifications -- classic salience, guided backpropagation, and
LIME -- with respect to their ability to identify regions of attack as the
explanatory regions for the (incorrect) prediction in representative examples
from image classification. We find that LIME outperforms the other explainers.


http://arxiv.org/abs/1910.09464
Learning to Learn by Zeroth-Order Oracle.
Yangjun Ruan; Yuanhao Xiong; Sashank Reddi; Sanjiv Kumar; Cho-Jui Hsieh
  In the learning to learn (L2L) framework, we cast the design of optimization
algorithms as a machine learning problem and use deep neural networks to learn
the update rules. In this paper, we extend the L2L framework to zeroth-order
(ZO) optimization setting, where no explicit gradient information is available.
Our learned optimizer, modeled as recurrent neural network (RNN), first
approximates gradient by ZO gradient estimator and then produces parameter
update utilizing the knowledge of previous iterations. To reduce high variance
effect due to ZO gradient estimator, we further introduce another RNN to learn
the Gaussian sampling rule and dynamically guide the query direction sampling.
Our learned optimizer outperforms hand-designed algorithms in terms of
convergence rate and final solution on both synthetic and practical ZO
optimization tasks (in particular, the black-box adversarial attack task, which
is one of the most widely used tasks of ZO optimization). We finally conduct
extensive analytical experiments to demonstrate the effectiveness of our
proposed optimizer.


http://arxiv.org/abs/1910.09338
An Alternative Surrogate Loss for PGD-based Adversarial Testing.
Sven Gowal; Jonathan Uesato; Chongli Qin; Po-Sen Huang; Timothy Mann; Pushmeet Kohli
  Adversarial testing methods based on Projected Gradient Descent (PGD) are
widely used for searching norm-bounded perturbations that cause the inputs of
neural networks to be misclassified. This paper takes a deeper look at these
methods and explains the effect of different hyperparameters (i.e., optimizer,
step size and surrogate loss). We introduce the concept of MultiTargeted
testing, which makes clever use of alternative surrogate losses, and explain
when and how MultiTargeted is guaranteed to find optimal perturbations.
Finally, we demonstrate that MultiTargeted outperforms more sophisticated
methods and often requires less iterative steps than other variants of PGD
found in the literature. Notably, MultiTargeted ranks first on MadryLab's
white-box MNIST and CIFAR-10 leaderboards, reducing the accuracy of their MNIST
model to 88.36% (with $\ell_\infty$ perturbations of $\epsilon = 0.3$) and the
accuracy of their CIFAR-10 model to 44.03% (at $\epsilon = 8/255$).
MultiTargeted also ranks first on the TRADES leaderboard reducing the accuracy
of their CIFAR-10 model to 53.07% (with $\ell_\infty$ perturbations of
$\epsilon = 0.031$).


http://arxiv.org/abs/1910.08910
Enhancing Recurrent Neural Networks with Sememes.
Yujia Qin; Fanchao Qi; Sicong Ouyang; Zhiyuan Liu; Cheng Yang; Yasheng Wang; Qun Liu; Maosong Sun
  Sememes, the minimum semantic units of human languages, have been
successfully utilized in various natural language processing applications.
However, most existing studies exploit sememes in specific tasks and few
efforts are made to utilize sememes more fundamentally. In this paper, we
propose to incorporate sememes into recurrent neural networks (RNNs) to improve
their sequence modeling ability, which is beneficial to all kinds of downstream
tasks. We design three different sememe incorporation methods and employ them
in typical RNNs including LSTM, GRU and their bidirectional variants. For
evaluation, we use several benchmark datasets involving PTB and WikiText-2 for
language modeling, SNLI for natural language inference. Experimental results
show evident and consistent improvement of our sememe-incorporated models
compared with vanilla RNNs, which proves the effectiveness of our sememe
incorporation methods. Moreover, we find the sememe-incorporated models have
great robustness and outperform adversarial training in defending adversarial
attack. All the code and data of this work will be made available to the
public.


http://arxiv.org/abs/1910.08716
Adversarial Attacks on Spoofing Countermeasures of automatic speaker verification.
Songxiang Liu; Haibin Wu; Hung-yi Lee; Helen Meng
  High-performance spoofing countermeasure systems for automatic speaker
verification (ASV) have been proposed in the ASVspoof 2019 challenge. However,
the robustness of such systems under adversarial attacks has not been studied
yet. In this paper, we investigate the vulnerability of spoofing
countermeasures for ASV under both white-box and black-box adversarial attacks
with the fast gradient sign method (FGSM) and the projected gradient descent
(PGD) method. We implement high-performing countermeasure models in the
ASVspoof 2019 challenge and conduct adversarial attacks on them. We compare
performance of black-box attacks across spoofing countermeasure models with
different network architectures and different amount of model parameters. The
experimental results show that all implemented countermeasure models are
vulnerable to FGSM and PGD attacks under the scenario of white-box attack. The
more dangerous black-box attacks also prove to be effective by the experimental
results.


http://arxiv.org/abs/1910.08650
Toward Metrics for Differentiating Out-of-Distribution Sets.
Mahdieh Abbasi; Changjian Shui; Arezoo Rajabi; Christian Gagne; Rakesh Bobba
  Vanilla CNNs, as uncalibrated classifiers, suffer from classifying
out-of-distribution (OOD) samples nearly as confidently as in-distribution
samples. To tackle this challenge, some recent works have demonstrated the
gains of leveraging available OOD sets for training end-to-end calibrated CNNs.
However, a critical question remains unanswered in these works: how to
differentiate OOD sets for selecting the most effective one(s) that induce
training such CNNs with high detection rates on unseen OOD sets? To address
this pivotal question, we provide a criterion based on generalization errors of
Augmented-CNN, a vanilla CNN with an added extra class employed for rejection,
on in-distribution and unseen OOD sets. However, selecting the most effective
OOD set by directly optimizing this criterion incurs a huge computational cost.
Instead, we propose three novel computationally-efficient metrics for
differentiating between OOD sets according to their "protection" level of
in-distribution sub-manifolds. We empirically verify that the most protective
OOD sets -- selected according to our metrics -- lead to A-CNNs with
significantly lower generalization errors than the A-CNNs trained on the least
protective ones. We also empirically show the effectiveness of a protective OOD
set for training well-generalized confidence-calibrated vanilla CNNs. These
results confirm that 1) all OOD sets are not equally effective for training
well-performing end-to-end models (i.e., A-CNNs and calibrated CNNs) for OOD
detection tasks and 2) the protection level of OOD sets is a viable factor for
recognizing the most effective one. Finally, across the image classification
tasks, we exhibit A-CNN trained on the most protective OOD set can also detect
black-box FGS adversarial examples as their distance (measured by our metrics)
is becoming larger from the protected sub-manifolds.


http://arxiv.org/abs/1910.08640
Are Perceptually-Aligned Gradients a General Property of Robust Classifiers?
Simran Kaur; Jeremy Cohen; Zachary C. Lipton
  For a standard convolutional neural network, optimizing over the input pixels
to maximize the score of some target class will generally produce a
grainy-looking version of the original image. However, researchers have
demonstrated that for adversarially-trained neural networks, this optimization
produces images that uncannily resemble the target class. In this paper, we
show that these "perceptually-aligned gradients" also occur under randomized
smoothing, an alternative means of constructing adversarially-robust
classifiers. Our finding suggests that perceptually-aligned gradients may be a
general property of robust classifiers, rather than a specific property of
adversarially-trained neural networks. We hope that our results will inspire
research aimed at explaining this link between perceptually-aligned gradients
and adversarial robustness.


http://arxiv.org/abs/1910.08681
Spatial-aware Online Adversarial Perturbations Against Visual Object Tracking.
Qing Guo; Xiaofei Xie; Lei Ma; Zhongguo Li; Wei Feng; Yang Liu
  Adversarial attacks of deep neural networks have been intensively studied on
image, audio, natural language, patch, and pixel classification tasks.
Nevertheless, as a typical, while important real-world application, the
adversarial attacks of online video object tracking that traces an object's
moving trajectory instead of its category are rarely explored. In this paper,
we identify a new task for the adversarial attack to visual object tracking:
online generating imperceptible perturbations that mislead trackers along an
incorrect (Untargeted Attack, UA) or specified trajectory (Targeted Attack,
TA). To this end, we first propose a spatial-aware basic attack by adapting
existing attack methods, i.e., FGSM, BIM, and C\&W, and comprehensively analyze
the attacking performance. We identify that online object tracking poses two
new challenges: 1) it is difficult to generate imperceptible perturbations that
can transfer across time/frames, and 2) real-time trackers require the attack
to satisfy a certain level of efficiency. To address these challenges, we
further propose the online incremental attack (OIA) that performs
spatial-temporal sparse incremental perturbations online and makes the
adversarial attack less perceptible. In addition, as an optimization-based
method, OIA quickly converges to very small losses within several iterations by
considering historical incremental perturbations, making it much more efficient
than the basic attacks. The in-depth evaluation on the state-of-the-art
trackers (i.e., SiamRPN with Alex, MobileNetv2, and ResNet-50) for OTB100 and
VOT2018 demonstrates the effectiveness and transferability of OIA in misleading
existing trackers under both UA and TA with minor perturbations.


http://arxiv.org/abs/1910.08623
A Fast Saddle-Point Dynamical System Approach to Robust Deep Learning.
Yasaman Esfandiari; Aditya Balu; Keivan Ebrahimi; Umesh Vaidya; Nicola Elia; Soumik Sarkar
  Recent focus on robustness to adversarial attacks for deep neural networks
produced a large variety of algorithms for training robust models. Most of the
effective algorithms involve solving the min-max optimization problem for
training robust models (min step) under worst-case attacks (max step). However,
they often suffer from high computational cost from running several inner
maximization iterations (to find an optimal attack) inside every outer
minimization iteration. Therefore, it becomes difficult to readily apply such
algorithms for moderate to large size real world data sets. To alleviate this,
we explore the effectiveness of iterative descent-ascent algorithms where the
maximization and minimization steps are executed in an alternate fashion to
simultaneously obtain the worst-case attack and the corresponding robust model.
Specifically, we propose a novel discrete-time dynamical system-based algorithm
that aims to find the saddle point of a min-max optimization problem in the
presence of uncertainties. Under the assumptions that the cost function is
convex and uncertainties enter concavely in the robust learning problem, we
analytically show that our algorithm converges asymptotically to the robust
optimal solution under a general adversarial budget constraints as induced by
$\ell_p$ norm, for $1\leq p\leq \infty$. Based on our proposed analysis, we
devise a fast robust training algorithm for deep neural networks. Although such
training involves highly non-convex robust optimization problems, empirical
results show that the algorithm can achieve significant robustness compared to
other state-of-the-art robust models on benchmark data sets.


http://arxiv.org/abs/1910.08051
Instance adaptive adversarial training: Improved accuracy tradeoffs in neural nets.
Yogesh Balaji; Tom Goldstein; Judy Hoffman
  Adversarial training is by far the most successful strategy for improving
robustness of neural networks to adversarial attacks. Despite its success as a
defense mechanism, adversarial training fails to generalize well to unperturbed
test set. We hypothesize that this poor generalization is a consequence of
adversarial training with uniform perturbation radius around every training
sample. Samples close to decision boundary can be morphed into a different
class under a small perturbation budget, and enforcing large margins around
these samples produce poor decision boundaries that generalize poorly.
Motivated by this hypothesis, we propose instance adaptive adversarial training
-- a technique that enforces sample-specific perturbation margins around every
training sample. We show that using our approach, test accuracy on unperturbed
samples improve with a marginal drop in robustness. Extensive experiments on
CIFAR-10, CIFAR-100 and Imagenet datasets demonstrate the effectiveness of our
proposed approach.


http://arxiv.org/abs/1910.08108
Enforcing Linearity in DNN succours Robustness and Adversarial Image Generation.
Anindya Sarkar; Nikhil Kumar Gupta; Raghu Iyengar
  Recent studies on the adversarial vulnerability of neural networks have shown
that models trained with the objective of minimizing an upper bound on the
worst-case loss over all possible adversarial perturbations improve robustness
against adversarial attacks. Beside exploiting adversarial training framework,
we show that by enforcing a Deep Neural Network (DNN) to be linear in
transformed input and feature space improves robustness significantly. We also
demonstrate that by augmenting the objective function with Local Lipschitz
regularizer boost robustness of the model further. Our method outperforms most
sophisticated adversarial training methods and achieves state of the art
adversarial accuracy on MNIST, CIFAR10 and SVHN dataset. In this paper, we also
propose a novel adversarial image generation method by leveraging Inverse
Representation Learning and Linearity aspect of an adversarially trained deep
neural network classifier.


http://arxiv.org/abs/1910.08536
LanCe: A Comprehensive and Lightweight CNN Defense Methodology against Physical Adversarial Attacks on Embedded Multimedia Applications.
Zirui Xu; Fuxun Yu; Xiang Chen
  Recently, adversarial attacks can be applied to the physical world, causing
practical issues to various Convolutional Neural Networks (CNNs) powered
applications. Most existing physical adversarial attack defense works only
focus on eliminating explicit perturbation patterns from inputs, ignoring
interpretation to CNN's intrinsic vulnerability. Therefore, they lack the
expected versatility to different attacks and thereby depend on considerable
data processing costs. In this paper, we propose LanCe -- a comprehensive and
lightweight CNN defense methodology against different physical adversarial
attacks. By interpreting CNN's vulnerability, we find that non-semantic
adversarial perturbations can activate CNN with significantly abnormal
activations and even overwhelm other semantic input patterns' activations. We
improve the CNN recognition process by adding a self-verification stage to
detect the potential adversarial input with only one CNN inference cost. Based
on the detection result, we further propose a data recovery methodology to
defend the physical adversarial attacks. We apply such defense methodology into
both image and audio CNN recognition scenarios and analyze the computational
complexity for each scenario, respectively. Experiments show that our
methodology can achieve an average 91% successful rate for attack detection and
89% accuracy recovery. Moreover, it is at most 3x faster compared with the
state-of-the-art defense methods, making it feasible to resource-constrained
embedded systems, such as mobile devices.


http://arxiv.org/abs/1910.11099
Adversarial T-shirt! Evading Person Detectors in A Physical World.
Kaidi Xu; Gaoyuan Zhang; Sijia Liu; Quanfu Fan; Mengshu Sun; Hongge Chen; Pin-Yu Chen; Yanzhi Wang; Xue Lin
  It is known that deep neural networks (DNNs) are vulnerable to adversarial
attacks. The so-called physical adversarial examples deceive DNN-based
decisionmakers by attaching adversarial patches to real objects. However, most
of the existing works on physical adversarial attacks focus on static objects
such as glass frames, stop signs and images attached to cardboard. In this
work, we proposed adversarial T-shirts, a robust physical adversarial example
for evading person detectors even if it could undergo non-rigid deformation due
to a moving person's pose changes. To the best of our knowledge, this is the
first work that models the effect of deformation for designing physical
adversarial examples with respect to-rigid objects such as T-shirts. We show
that the proposed method achieves74% and 57% attack success rates in the
digital and physical worlds respectively against YOLOv2. In contrast, the
state-of-the-art physical attack method to fool a person detector only achieves
18% attack success rate. Furthermore, by leveraging min-max optimization, we
extend our method to the ensemble attack setting against two object detectors
YOLO-v2 and Faster R-CNN simultaneously.


http://arxiv.org/abs/1910.07629
A New Defense Against Adversarial Images: Turning a Weakness into a Strength.
Tao Yu; Shengyuan Hu; Chuan Guo; Wei-Lun Chao; Kilian Q. Weinberger
  Natural images are virtually surrounded by low-density misclassified regions
that can be efficiently discovered by gradient-guided search --- enabling the
generation of adversarial images. While many techniques for detecting these
attacks have been proposed, they are easily bypassed when the adversary has
full knowledge of the detection mechanism and adapts the attack strategy
accordingly. In this paper, we adopt a novel perspective and regard the
omnipresence of adversarial perturbations as a strength rather than a weakness.
We postulate that if an image has been tampered with, these adversarial
directions either become harder to find with gradient methods or have
substantially higher density than for natural images. We develop a practical
test for this signature characteristic to successfully detect adversarial
attacks, achieving unprecedented accuracy under the white-box setting where the
adversary is given full knowledge of our detection mechanism.


http://arxiv.org/abs/1910.06813
Improving Robustness of time series classifier with Neural ODE guided gradient based data augmentation.
Anindya Sarkar; Anirudh Sunder Raj; Raghu Sesha Iyengar
  Exploring adversarial attack vectors and studying their effects on machine
learning algorithms has been of interest to researchers. Deep neural networks
working with time series data have received lesser interest compared to their
image counterparts in this context. In a recent finding, it has been revealed
that current state-of-the-art deep learning time series classifiers are
vulnerable to adversarial attacks. In this paper, we introduce two local
gradient based and one spectral density based time series data augmentation
techniques. We show that a model trained with data obtained using our
techniques obtains state-of-the-art classification accuracy on various time
series benchmarks. In addition, it improves the robustness of the model against
some of the most common corruption techniques,such as Fast Gradient Sign Method
(FGSM) and Basic Iterative Method (BIM).


http://arxiv.org/abs/1910.07416
Understanding Misclassifications by Attributes.
Sadaf Gulshad; Zeynep Akata; Jan Hendrik Metzen; Arnold Smeulders
  In this paper, we aim to understand and explain the decisions of deep neural
networks by studying the behavior of predicted attributes when adversarial
examples are introduced. We study the changes in attributes for clean as well
as adversarial images in both standard and adversarially robust networks. We
propose a metric to quantify the robustness of an adversarially robust network
against adversarial attacks. In a standard network, attributes predicted for
adversarial images are consistent with the wrong class, while attributes
predicted for the clean images are consistent with the true class. In an
adversarially robust network, the attributes predicted for adversarial images
classified correctly are consistent with the true class. Finally, we show that
the ability to robustify a network varies for different datasets. For the fine
grained dataset, it is higher as compared to the coarse-grained dataset.
Additionally, the ability to robustify a network increases with the increase in
adversarial noise.


http://arxiv.org/abs/1910.07517
Adversarial Examples for Models of Code.
Noam Yefet; Uri Alon; Eran Yahav
  We introduce a novel approach for attacking trained models of code with
adversarial examples. The main idea is to force a given trained model to make a
prediction of the adversary's choice by introducing small perturbations that do
not change program semantics. We find these perturbations by deriving the
desired prediction with respect to the model's inputs while holding the model
weights constant and following the gradients to slightly modify the input.
  To defend a model against such attacks, we propose placing a defensive model
in front of the downstream model. The defensive model detects unlikely
mutations and masks them before feeding the input to the downstream model.
  We show that our attack succeeds in changing a prediction to the adversary's
desire ("targeted attack") up to 89% of the times, and succeeds in changing a
given prediction to any incorrect prediction ("non-targeted attack") 94% of the
times. By using our proposed defense, the success rate of the attack drops
drastically for both targeted and non-targeted attacks, with a minor penalty of
2% relative degradation in accuracy while not performing under attack.


http://arxiv.org/abs/1910.07067
On adversarial patches: real-world attack on ArcFace-100 face recognition system.
Mikhail Pautov; Grigorii Melnikov; Edgar Kaziakhmedov; Klim Kireev; Aleksandr Petiushko
  Recent works showed the vulnerability of image classifiers to adversarial
attacks in the digital domain. However, the majority of attacks involve adding
small perturbation to an image to fool the classifier. Unfortunately, such
procedures can not be used to conduct a real-world attack, where adding an
adversarial attribute to the photo is a more practical approach. In this paper,
we study the problem of real-world attacks on face recognition systems. We
examine security of one of the best public face recognition systems,
LResNet100E-IR with ArcFace loss, and propose a simple method to attack it in
the physical world. The method suggests creating an adversarial patch that can
be printed, added as a face attribute and photographed; the photo of a person
with such attribute is then passed to the classifier such that the classifier's
recognized class changes from correct to the desired one. Proposed generating
procedure allows projecting adversarial patches not only on different areas of
the face, such as nose or forehead but also on some wearable accessory, such as
eyeglasses.


http://arxiv.org/abs/1910.06296
DeepSearch: Simple and Effective Blackbox Fuzzing of Deep Neural Networks.
Fuyuan Zhang; Sankalan Pal Chowdhury; Maria Christakis
  Although deep neural networks have been successful in image classification,
they are prone to adversarial attacks. To generate misclassified inputs, there
has emerged a wide variety of techniques, such as black- and whitebox testing
of neural networks. In this paper, we present DeepSearch, a novel
blackbox-fuzzing technique for image classifiers. Despite its simplicity,
DeepSearch is shown to be more effective in finding adversarial examples than
closely related black- and whitebox approaches. DeepSearch is additionally able
to generate the most subtle adversarial examples in comparison to these
approaches.


http://arxiv.org/abs/1910.06259
Confidence-Calibrated Adversarial Training: Generalizing to Unseen Attacks.
David Stutz; Matthias Hein; Bernt Schiele
  Adversarial training yields robust models against a specific threat model,
e.g., $L_\infty$ adversarial examples. Typically robustness does not generalize
to previously unseen threat models, e.g., other $L_p$ norms, or larger
perturbations. Our confidence-calibrated adversarial training (CCAT) tackles
this problem by biasing the model towards low confidence predictions on
adversarial examples. By allowing to reject examples with low confidence,
robustness generalizes beyond the threat model employed during training. CCAT,
trained only on $L_\infty$ adversarial examples, increases robustness against
larger $L_\infty$, $L_2$, $L_1$ and $L_0$ attacks, adversarial frames, distal
adversarial examples and corrupted examples and yields better clean accuracy
compared to adversarial training. For thorough evaluation we developed novel
white- and black-box attacks directly attacking CCAT by maximizing confidence.
For each threat model, we use $7$ attacks with up to $50$ restarts and $5000$
iterations and report worst-case robust test error, extended to our
confidence-thresholded setting, across all attacks.


http://arxiv.org/abs/1910.06513
ZO-AdaMM: Zeroth-Order Adaptive Momentum Method for Black-Box Optimization.
Xiangyi Chen; Sijia Liu; Kaidi Xu; Xingguo Li; Xue Lin; Mingyi Hong; David Cox
  The adaptive momentum method (AdaMM), which uses past gradients to update
descent directions and learning rates simultaneously, has become one of the
most popular first-order optimization methods for solving machine learning
problems. However, AdaMM is not suited for solving black-box optimization
problems, where explicit gradient forms are difficult or infeasible to obtain.
In this paper, we propose a zeroth-order AdaMM (ZO-AdaMM) algorithm, that
generalizes AdaMM to the gradient-free regime. We show that the convergence
rate of ZO-AdaMM for both convex and nonconvex optimization is roughly a factor
of $O(\sqrt{d})$ worse than that of the first-order AdaMM algorithm, where $d$
is problem size. In particular, we provide a deep understanding on why
Mahalanobis distance matters in convergence of ZO-AdaMM and other AdaMM-type
methods. As a byproduct, our analysis makes the first step toward understanding
adaptive learning rate methods for nonconvex constrained optimization.
Furthermore, we demonstrate two applications, designing per-image and universal
adversarial attacks from black-box neural networks, respectively. We perform
extensive experiments on ImageNet and empirically show that ZO-AdaMM converges
much faster to a solution of high accuracy compared with $6$ state-of-the-art
ZO optimization methods.


http://arxiv.org/abs/1910.06838
Man-in-the-Middle Attacks against Machine Learning Classifiers via Malicious Generative Models.
Derek Derui; Wang; Chaoran Li; Sheng Wen; Surya Nepal; Yang Xiang
  Deep Neural Networks (DNNs) are vulnerable to deliberately crafted
adversarial examples. In the past few years, many efforts have been spent on
exploring query-optimisation attacks to find adversarial examples of either
black-box or white-box DNN models, as well as the defending countermeasures
against those attacks. In this work, we explore vulnerabilities of DNN models
under the umbrella of Man-in-the-Middle (MitM) attacks, which has not been
investigated before. From the perspective of an MitM adversary, the
aforementioned adversarial example attacks are not viable anymore. First, such
attacks must acquire the outputs from the models by multiple times before
actually launching attacks, which is difficult for the MitM adversary in
practice. Second, such attacks are one-off and cannot be directly generalised
onto new data examples, which decreases the rate of return for the attacker. In
contrast, using generative models to craft adversarial examples on the fly can
mitigate the drawbacks. However, the adversarial capability of the generative
models, such as Variational Auto-Encoder (VAE), has not been extensively
studied. Therefore, given a classifier, we investigate using a VAE decoder to
either transform benign inputs to their adversarial counterparts or decode
outputs from benign VAE encoders to be adversarial examples. The proposed
method can endue more capability to MitM attackers. Based on our evaluation,
the proposed attack can achieve above 95% success rate on both MNIST and
CIFAR10 datasets, which is better or comparable with state-of-the-art
query-optimisation attacks. At the meantime, the attack is 104 times faster
than the query-optimisation attacks.


http://arxiv.org/abs/1910.06261
Real-world adversarial attack on MTCNN face detection system.
Edgar Kaziakhmedov; Klim Kireev; Grigorii Melnikov; Mikhail Pautov; Aleksandr Petiushko
  Recent studies proved that deep learning approaches achieve remarkable
results on face detection task. On the other hand, the advances gave rise to a
new problem associated with the security of the deep convolutional neural
network models unveiling potential risks of DCNNs based applications. Even
minor input changes in the digital domain can result in the network being
fooled. It was shown then that some deep learning-based face detectors are
prone to adversarial attacks not only in a digital domain but also in the real
world. In the paper, we investigate the security of the well-known cascade CNN
face detection system - MTCNN and introduce an easily reproducible and a robust
way to attack it. We propose different face attributes printed on an ordinary
white and black printer and attached either to the medical face mask or to the
face directly. Our approach is capable of breaking the MTCNN detector in a
real-world scenario.


http://arxiv.org/abs/1910.05513
On Robustness of Neural Ordinary Differential Equations.
Hanshu Yan; Jiawei Du; Vincent Y. F. Tan; Jiashi Feng
  Neural ordinary differential equations (ODEs) have been attracting increasing
attention in various research domains recently. There have been some works
studying optimization issues and approximation capabilities of neural ODEs, but
their robustness is still yet unclear. In this work, we fill this important gap
by exploring robustness properties of neural ODEs both empirically and
theoretically. We first present an empirical study on the robustness of the
neural ODE-based networks (ODENets) by exposing them to inputs with various
types of perturbations and subsequently investigating the changes of the
corresponding outputs. In contrast to conventional convolutional neural
networks (CNNs), we find that the ODENets are more robust against both random
Gaussian perturbations and adversarial attack examples. We then provide an
insightful understanding of this phenomenon by exploiting a certain desirable
property of the flow of a continuous-time ODE, namely that integral curves are
non-intersecting. Our work suggests that, due to their intrinsic robustness, it
is promising to use neural ODEs as a basic block for building robust deep
network models. To further enhance the robustness of vanilla neural ODEs, we
propose the time-invariant steady neural ODE (TisODE), which regularizes the
flow on perturbed data via the time-invariant property and the imposition of a
steady-state constraint. We show that the TisODE method outperforms vanilla
neural ODEs and also can work in conjunction with other state-of-the-art
architectural methods to build more robust deep networks.


http://arxiv.org/abs/1910.05262
Hear "No Evil", See "Kenansville": Efficient and Transferable Black-Box Attacks on Speech Recognition and Voice Identification Systems.
Hadi Abdullah; Muhammad Sajidur Rahman; Washington Garcia; Logan Blue; Kevin Warren; Anurag Swarnim Yadav; Tom Shrimpton; Patrick Traynor
  Automatic speech recognition and voice identification systems are being
deployed in a wide array of applications, from providing control mechanisms to
devices lacking traditional interfaces, to the automatic transcription of
conversations and authentication of users. Many of these applications have
significant security and privacy considerations. We develop attacks that force
mistranscription and misidentification in state of the art systems, with
minimal impact on human comprehension. Processing pipelines for modern systems
are comprised of signal preprocessing and feature extraction steps, whose
output is fed to a machine-learned model. Prior work has focused on the models,
using white-box knowledge to tailor model-specific attacks. We focus on the
pipeline stages before the models, which (unlike the models) are quite similar
across systems. As such, our attacks are black-box and transferable, and
demonstrably achieve mistranscription and misidentification rates as high as
100% by modifying only a few frames of audio. We perform a study via Amazon
Mechanical Turk demonstrating that there is no statistically significant
difference between human perception of regular and perturbed audio. Our
findings suggest that models may learn aspects of speech that are generally not
perceived by human subjects, but that are crucial for model accuracy. We also
find that certain English language phonemes (in particular, vowels) are
significantly more susceptible to our attack. We show that the attacks are
effective when mounted over cellular networks, where signals are subject to
degradation due to transcoding, jitter, and packet loss.


http://arxiv.org/abs/1910.05018
Verification of Neural Networks: Specifying Global Robustness using Generative Models.
Nathanaël Fijalkow; Mohit Kumar Gupta
  The success of neural networks across most machine learning tasks and the
persistence of adversarial examples have made the verification of such models
an important quest. Several techniques have been successfully developed to
verify robustness, and are now able to evaluate neural networks with thousands
of nodes. The main weakness of this approach is in the specification:
robustness is asserted on a validation set consisting of a finite set of
examples, i.e. locally.
  We propose a notion of global robustness based on generative models, which
asserts the robustness on a very large and representative set of examples. We
show how this can be used for verifying neural networks. In this paper we
experimentally explore the merits of this approach, and show how it can be used
to construct realistic adversarial examples.


http://arxiv.org/abs/1910.04618
Universal Adversarial Perturbation for Text Classification.
Hang Gao; Tim Oates
  Given a state-of-the-art deep neural network text classifier, we show the
existence of a universal and very small perturbation vector (in the embedding
space) that causes natural text to be misclassified with high probability.
Unlike images on which a single fixed-size adversarial perturbation can be
found, text is of variable length, so we define the "universality" as
"token-agnostic", where a single perturbation is applied to each token,
resulting in different perturbations of flexible sizes at the sequence level.
We propose an algorithm to compute universal adversarial perturbations, and
show that the state-of-the-art deep neural networks are highly vulnerable to
them, even though they keep the neighborhood of tokens mostly preserved. We
also show how to use these adversarial perturbations to generate adversarial
text samples. The surprising existence of universal "token-agnostic"
adversarial perturbations may reveal important properties of a text classifier.


http://arxiv.org/abs/1910.04819
Information Aware Max-Norm Dirichlet Networks for Predictive Uncertainty Estimation.
Theodoros Tsiligkaridis
  Precise estimation of uncertainty in predictions for AI systems is a critical
factor in ensuring trust and safety. Deep neural networks trained with a
conventional method are prone to over-confident predictions. In contrast to
Bayesian neural networks that learn approximate distributions on weights to
infer prediction confidence, we propose a novel method, Information Aware
Dirichlet networks, that learn an explicit Dirichlet prior distribution on
predictive distributions by minimizing a bound on the expected $L_\infty$ norm
of the prediction error and penalizing information associated with incorrect
outcomes. Properties of the new cost function are derived to indicate how
improved uncertainty estimation is achieved. Experiments using real datasets
show that our technique outperforms by a large margin state-of-the-art neural
networks for estimating within-distribution and out-of-distribution
uncertainty, and detecting adversarial examples.


http://arxiv.org/abs/1910.03850
Learning deep forest with multi-scale Local Binary Pattern features for face anti-spoofing.
Rizhao Cai; Changsheng Chen
  Face Anti-Spoofing (FAS) is significant for the security of face recognition
systems. Convolutional Neural Networks (CNNs) have been introduced to the field
of the FAS and have achieved competitive performance. However, CNN-based
methods are vulnerable to the adversarial attack. Attackers could generate
adversarial-spoofing examples to circumvent a CNN-based face liveness detector.
Studies about the transferability of the adversarial attack reveal that
utilizing handcrafted feature-based methods could improve security in a
system-level. Therefore, handcrafted feature-based methods are worth our
exploration. In this paper, we introduce the deep forest, which is proposed as
an alternative towards CNNs by Zhou et al., in the problem of the FAS. To the
best of our knowledge, this is the first attempt at exploiting the deep forest
in the problem of FAS. Moreover, we propose to re-devise the representation
constructing by using LBP descriptors rather than the Grained-Scanning
Mechanism in the original scheme. Our method achieves competitive results. On
the benchmark database IDIAP REPLAY-ATTACK, 0\% Equal Error Rate (EER) is
achieved. This work provides a competitive option in a fusing scheme for
improving system-level security and offers important ideas to those who want to
explore methods besides CNNs.


http://arxiv.org/abs/1910.03810
Adversarial Learning of Deepfakes in Accounting.
Marco Schreyer; Timur Sattarov; Bernd Reimer; Damian Borth
  Nowadays, organizations collect vast quantities of accounting relevant
transactions, referred to as 'journal entries', in 'Enterprise Resource
Planning' (ERP) systems. The aggregation of those entries ultimately defines an
organization's financial statement. To detect potential misstatements and
fraud, international audit standards demand auditors to directly assess journal
entries using 'Computer Assisted AuditTechniques' (CAATs). At the same time,
discoveries in deep learning research revealed that machine learning models are
vulnerable to 'adversarial attacks'. It also became evident that such attack
techniques can be misused to generate 'Deepfakes' designed to directly attack
the perception of humans by creating convincingly altered media content. The
research of such developments and their potential impact on the finance and
accounting domain is still in its early stage. We believe that it is of vital
relevance to investigate how such techniques could be maliciously misused in
this sphere. In this work, we show an adversarial attack against CAATs using
deep neural networks. We first introduce a real-world 'thread model' designed
to camouflage accounting anomalies such as fraudulent journal entries. Second,
we show that adversarial autoencoder neural networks are capable of learning a
human interpretable model of journal entries that disentangles the entries
latent generative factors. Finally, we demonstrate how such a model can be
maliciously misused by a perpetrator to generate robust 'adversarial' journal
entries that mislead CAATs.


http://arxiv.org/abs/1910.03916
Deep Latent Defence.
Giulio Zizzo; Chris Hankin; Sergio Maffeis; Kevin Jones
  Deep learning methods have shown state of the art performance in a range of
tasks from computer vision to natural language processing. However, it is well
known that such systems are vulnerable to attackers who craft inputs in order
to cause misclassification. The level of perturbation an attacker needs to
introduce in order to cause such a misclassification can be extremely small,
and often imperceptible. This is of significant security concern, particularly
where misclassification can cause harm to humans.
  We thus propose Deep Latent Defence, an architecture which seeks to combine
adversarial training with a detection system. At its core Deep Latent Defence
has a adversarially trained neural network. A series of encoders take the
intermediate layer representation of data as it passes though the network and
project it to a latent space which we use for detecting adversarial samples via
a $k$-nn classifier. We present results using both grey and white box
attackers, as well as an adaptive $L_{\infty}$ bounded attack which was
constructed specifically to try and evade our defence. We find that even under
the strongest attacker model that we have investigated our defence is able to
offer significant defensive benefits.


http://arxiv.org/abs/1910.04279
Adversarial Training: embedding adversarial perturbations into the parameter space of a neural network to build a robust system.
Shixian Wen; Laurent Itti
  Adversarial training, in which a network is trained on both adversarial and
clean examples, is one of the most trusted defense methods against adversarial
attacks. However, there are three major practical difficulties in implementing
and deploying this method - expensive in terms of extra memory and computation
costs; accuracy trade-off between clean and adversarial examples; and lack of
diversity of adversarial perturbations. Classical adversarial training uses
fixed, precomputed perturbations in adversarial examples (input space). In
contrast, we introduce dynamic adversarial perturbations into the parameter
space of the network, by adding perturbation biases to the fully connected
layers of deep convolutional neural network. During training, using only clean
images, the perturbation biases are updated in the Fast Gradient Sign Direction
to automatically create and store adversarial perturbations by recycling the
gradient information computed. The network learns and adjusts itself
automatically to these learned adversarial perturbations. Thus, we can achieve
adversarial training with negligible cost compared to requiring a training set
of adversarial example images. In addition, if combined with classical
adversarial training, our perturbation biases can alleviate accuracy trade-off
difficulties, and diversify adversarial perturbations.


http://arxiv.org/abs/1910.03468
Directional Adversarial Training for Cost Sensitive Deep Learning Classification Applications.
Matteo Terzi; Gian Antonio Susto; Pratik Chaudhari
  In many real-world applications of Machine Learning it is of paramount
importance not only to provide accurate predictions, but also to ensure certain
levels of robustness. Adversarial Training is a training procedure aiming at
providing models that are robust to worst-case perturbations around predefined
points. Unfortunately, one of the main issues in adversarial training is that
robustness w.r.t. gradient-based attackers is always achieved at the cost of
prediction accuracy. In this paper, a new algorithm, called Wasserstein
Projected Gradient Descent (WPGD), for adversarial training is proposed. WPGD
provides a simple way to obtain cost-sensitive robustness, resulting in a finer
control of the robustness-accuracy trade-off. Moreover, WPGD solves an optimal
transport problem on the output space of the network and it can efficiently
discover directions where robustness is required, allowing to control the
directional trade-off between accuracy and robustness. The proposed WPGD is
validated in this work on image recognition tasks with different benchmark
datasets and architectures. Moreover, real world-like datasets are often
unbalanced: this paper shows that when dealing with such type of datasets, the
performance of adversarial training are mainly affected in term of standard
accuracy.


http://arxiv.org/abs/1910.03624
SmoothFool: An Efficient Framework for Computing Smooth Adversarial Perturbations.
Ali Dabouei; Sobhan Soleymani; Fariborz Taherkhani; Jeremy Dawson; Nasser M. Nasrabadi
  Deep neural networks are susceptible to adversarial manipulations in the
input domain. The extent of vulnerability has been explored intensively in
cases of $\ell_p$-bounded and $\ell_p$-minimal adversarial perturbations.
However, the vulnerability of DNNs to adversarial perturbations with specific
statistical properties or frequency-domain characteristics has not been
sufficiently explored. In this paper, we study the smoothness of perturbations
and propose SmoothFool, a general and computationally efficient framework for
computing smooth adversarial perturbations. Through extensive experiments, we
validate the efficacy of the proposed method for both the white-box and
black-box attack scenarios. In particular, we demonstrate that: (i) there exist
extremely smooth adversarial perturbations for well-established and widely used
network architectures, (ii) smoothness significantly enhances the robustness of
perturbations against state-of-the-art defense mechanisms, (iii) smoothness
improves the transferability of adversarial perturbations across both data
points and network architectures, and (iv) class categories exhibit a variable
range of susceptibility to smooth perturbations. Our results suggest that
smooth APs can play a significant role in exploring the vulnerability extent of
DNNs to adversarial examples.


http://arxiv.org/abs/1910.02673
Interpretable Disentanglement of Neural Networks by Extracting Class-Specific Subnetwork.
Yulong Wang; Xiaolin Hu; Hang Su
  We propose a novel perspective to understand deep neural networks in an
interpretable disentanglement form. For each semantic class, we extract a
class-specific functional subnetwork from the original full model, with
compressed structure while maintaining comparable prediction performance. The
structure representations of extracted subnetworks display a resemblance to
their corresponding class semantic similarities. We also apply extracted
subnetworks in visual explanation and adversarial example detection tasks by
merely replacing the original full model with class-specific subnetworks.
Experiments demonstrate that this intuitive operation can effectively improve
explanation saliency accuracy for gradient-based explanation methods, and
increase the detection rate for confidence score-based adversarial example
detection methods.


http://arxiv.org/abs/1910.02354
Unrestricted Adversarial Attacks for Semantic Segmentation.
Guangyu Shen; Chengzhi Mao; Junfeng Yang; Baishakhi Ray
  Semantic segmentation is one of the most impactful applications of machine
learning; however, their robustness under adversarial attack is not well
studied. In this paper, we focus on generating unrestricted adversarial
examples for semantic segmentation models. We demonstrate a simple yet
effective method to generate unrestricted adversarial examples using
conditional generative adversarial networks (CGAN) without any hand-crafted
metric. The na\"ive implementation of CGAN, however, yields inferior image
quality and low attack success rate. Instead, we leverage the SPADE
(Spatially-adaptive denormalization) structure with an additional loss item,
which is able to generate effective adversarial attacks in a single step. We
validate our approach on the well studied Cityscapes and ADE20K datasets, and
demonstrate that our synthetic adversarial examples are not only realistic, but
also improve the attack success rate by up to 41.0\% compared with the state of
the art adversarial attack methods including PGD attack.


http://arxiv.org/abs/1910.02244
Yet another but more efficient black-box adversarial attack: tiling and evolution strategies.
Laurent Meunier; Jamal Atif; Olivier Teytaud
  We introduce a new black-box attack achieving state of the art performances.
Our approach is based on a new objective function, borrowing ideas from
$\ell_\infty$-white box attacks, and particularly designed to fit
derivative-free optimization requirements. It only requires to have access to
the logits of the classifier without any other information which is a more
realistic scenario. Not only we introduce a new objective function, we extend
previous works on black box adversarial attacks to a larger spectrum of
evolution strategies and other derivative-free optimization methods. We also
highlight a new intriguing property that deep neural networks are not robust to
single shot tiled attacks. Our models achieve, with a budget limited to
$10,000$ queries, results up to $99.2\%$ of success rate against InceptionV3
classifier with $630$ queries to the network on average in the untargeted
attacks setting, which is an improvement by $90$ queries of the current state
of the art. In the targeted setting, we are able to reach, with a limited
budget of $100,000$, $100\%$ of success rate with a budget of $6,662$ queries
on average, i.e. we need $800$ queries less than the current state of the art.


http://arxiv.org/abs/1910.02125
Requirements for Developing Robust Neural Networks.
John S. Hyatt; Michael S. Lee
  Validation accuracy is a necessary, but not sufficient, measure of a neural
network classifier's quality. High validation accuracy during development does
not guarantee that a model is free of serious flaws, such as vulnerability to
adversarial attacks or a tendency to misclassify (with high confidence) data it
was not trained on. The model may also be incomprehensible to a human or base
its decisions on unreasonable criteria. These problems, which are not unique to
classifiers, have been the focus of a substantial amount of recent research.
However, they are not prioritized during model development, which almost always
optimizes on validation accuracy to the exclusion of everything else. The
product of this approach is likely to fail in unexpected ways outside of the
training environment. We believe that, in addition to validation accuracy, the
model development process must give added weight to other performance metrics
such as explainability, resistance to adversarial attacks, and overconfidence
on out-of-distribution data.


http://arxiv.org/abs/1910.02095
Adversarial Examples for Cost-Sensitive Classifiers.
Gavin S. Hartnett; Andrew J. Lohn; Alexander P. Sedlack
  Motivated by safety-critical classification problems, we investigate
adversarial attacks against cost-sensitive classifiers. We use current
state-of-the-art adversarially-resistant neural network classifiers [1] as the
underlying models. Cost-sensitive predictions are then achieved via a final
processing step in the feed-forward evaluation of the network. We evaluate the
effectiveness of cost-sensitive classifiers against a variety of attacks and we
introduce a new cost-sensitive attack which performs better than targeted
attacks in some cases. We also explored the measures a defender can take in
order to limit their vulnerability to these attacks. This attacker/defender
scenario is naturally framed as a two-player zero-sum finite game which we
analyze using game theory.


http://arxiv.org/abs/1910.01329
Perturbations are not Enough: Generating Adversarial Examples with Spatial Distortions.
He Zhao; Trung Le; Paul Montague; Vel Olivier De; Tamas Abraham; Dinh Phung
  Deep neural network image classifiers are reported to be susceptible to
adversarial evasion attacks, which use carefully crafted images created to
mislead a classifier. Recently, various kinds of adversarial attack methods
have been proposed, most of which focus on adding small perturbations to input
images. Despite the success of existing approaches, the way to generate
realistic adversarial images with small perturbations remains a challenging
problem. In this paper, we aim to address this problem by proposing a novel
adversarial method, which generates adversarial examples by imposing not only
perturbations but also spatial distortions on input images, including scaling,
rotation, shear, and translation. As humans are less susceptible to small
spatial distortions, the proposed approach can produce visually more realistic
attacks with smaller perturbations, able to deceive classifiers without
affecting human predictions. We learn our method by amortized techniques with
neural networks and generate adversarial examples efficiently by a forward pass
of the networks. Extensive experiments on attacking different types of
non-robustified classifiers and robust classifiers with defence show that our
method has state-of-the-art performance in comparison with advanced attack
parallels.


http://arxiv.org/abs/1910.02785
BUZz: BUffer Zones for defending adversarial examples in image classification.
Kaleel Mahmood; Phuong Ha Nguyen; Lam M. Nguyen; Thanh Nguyen; Dijk Marten van
  We propose a novel defense against all existing gradient based adversarial
attacks on deep neural networks for image classification problems. Our defense
is based on a combination of deep neural networks and simple image
transformations. While straightforward in implementation, this defense yields a
unique security property which we term buffer zones. We argue that our defense
based on buffer zones offers significant improvements over state-of-the-art
defenses. We are able to achieve this improvement even when the adversary has
access to the {\em entire} original training data set and unlimited query
access to the defense. We verify our claim through experimentation using
Fashion-MNIST and CIFAR-10: We demonstrate $<11\%$ attack success rate --
significantly lower than what other well-known state-of-the-art defenses offer
-- at only a price of a $11-18\%$ drop in clean accuracy. By using a new
intuitive metric, we explain why this trade-off offers a significant
improvement over prior work.


http://arxiv.org/abs/1910.01624
Verification of Neural Network Behaviour: Formal Guarantees for Power System Applications.
Andreas Venzke; Spyros Chatzivasileiadis
  This paper presents for the first time, to our knowledge, a framework for
verifying neural network behavior in power system applications. Up to this
moment, neural networks have been applied in power systems as a black-box; this
has presented a major barrier for their adoption in practice. Developing a
rigorous framework based on mixed integer linear programming, our methods can
determine the range of inputs that neural networks classify as safe or unsafe,
and are able to systematically identify adversarial examples. Such methods have
the potential to build the missing trust of power system operators on neural
networks, and unlock a series of new applications in power systems. This paper
presents the framework, methods to assess and improve neural network robustness
in power systems, and addresses concerns related to scalability and accuracy.
We demonstrate our methods on the IEEE 9-bus, 14-bus, and 162-bus systems,
treating both N-1 security and small-signal stability.


http://arxiv.org/abs/1910.01907
Attacking Vision-based Perception in End-to-End Autonomous Driving Models.
Adith Boloor; Karthik Garimella; Xin He; Christopher Gill; Yevgeniy Vorobeychik; Xuan Zhang
  Recent advances in machine learning, especially techniques such as deep
neural networks, are enabling a range of emerging applications. One such
example is autonomous driving, which often relies on deep learning for
perception. However, deep learning-based perception has been shown to be
vulnerable to a host of subtle adversarial manipulations of images.
Nevertheless, the vast majority of such demonstrations focus on perception that
is disembodied from end-to-end control. We present novel end-to-end attacks on
autonomous driving in simulation, using simple physically realizable attacks:
the painting of black lines on the road. These attacks target deep neural
network models for end-to-end autonomous driving control. A systematic
investigation shows that such attacks are easy to engineer, and we describe
scenarios (e.g., right turns) in which they are highly effective. We define
several objective functions that quantify the success of an attack and develop
techniques based on Bayesian Optimization to efficiently traverse the search
space of higher dimensional attacks. Additionally, we define a novel class of
hijacking attacks, where painted lines on the road cause the driver-less car to
follow a target path. Through the use of network deconvolution, we provide
insights into the successful attacks, which appear to work by mimicking
activations of entirely different scenarios. Our code is available at
https://github.com/xz-group/AdverseDrive


http://arxiv.org/abs/1910.00982
Adversarially Robust Few-Shot Learning: A Meta-Learning Approach.
Micah Goldblum; Liam Fowl; Tom Goldstein
  Previous work on adversarially robust neural networks for image
classification requires large training sets and computationally expensive
training procedures. On the other hand, few-shot learning methods are highly
vulnerable to adversarial examples. The goal of our work is to produce networks
which both perform well at few-shot classification tasks and are simultaneously
robust to adversarial examples. We develop an algorithm for producing
adversarially robust meta-learners, and we thoroughly investigate factors which
contribute to adversarial vulnerability. Moreover, our method achieves far
superior robust performance on few-shot image classification tasks, such as
Mini-ImageNet and CIFAR-FS, than robust transfer learning.


http://arxiv.org/abs/1910.00736
Boosting Image Recognition with Non-differentiable Constraints.
Xuan Li; Yuchen Lu; Peng Xu; Jizong Peng; Christian Desrosiers; Xue Liu
  In this paper, we study the problem of image recognition with
non-differentiable constraints. A lot of real-life recognition applications
require a rich output structure with deterministic constraints that are
discrete or modeled by a non-differentiable function. A prime example is
recognizing digit sequences, which are restricted by such rules (e.g.,
\textit{container code detection}, \textit{social insurance number
recognition}, etc.). We investigate the usefulness of adding non-differentiable
constraints in learning for the task of digit sequence recognition. Toward this
goal, we synthesize six different datasets from MNIST and Cropped SVHN, with
three discrete rules inspired by real-life protocols. To deal with the
non-differentiability of these rules, we propose a reinforcement learning
approach based on the policy gradient method. We find that incorporating this
rule-based reinforcement can effectively increase the accuracy for all datasets
and provide a good inductive bias which improves the model even with limited
data. On one of the datasets, MNIST\_Rule2, models trained with rule-based
reinforcement increase the accuracy by 4.7\% for 2000 samples and 23.6\% for
500 samples. We further test our model against synthesized adversarial
examples, e.g., blocking out digits, and observe that adding our rule-based
reinforcement increases the model robustness with a relatively smaller
performance drop.


http://arxiv.org/abs/1910.00727
Generating Semantic Adversarial Examples with Differentiable Rendering.
Lakshya Jain; Wilson Wu; Steven Chen; Uyeong Jang; Varun Chandrasekaran; Sanjit Seshia; Somesh Jha
  Machine learning (ML) algorithms, especially deep neural networks, have
demonstrated success in several domains. However, several types of attacks have
raised concerns about deploying ML in safety-critical domains, such as
autonomous driving and security. An attacker perturbs a data point slightly in
the concrete feature space (e.g., pixel space) and causes the ML algorithm to
produce incorrect output (e.g. a perturbed stop sign is classified as a yield
sign). These perturbed data points are called adversarial examples, and there
are numerous algorithms in the literature for constructing adversarial examples
and defending against them. In this paper we explore semantic adversarial
examples (SAEs) where an attacker creates perturbations in the semantic space
representing the environment that produces input for the ML model. For example,
an attacker can change the background of the image to be cloudier to cause
misclassification. We present an algorithm for constructing SAEs that uses
recent advances in differential rendering and inverse graphics.


http://arxiv.org/abs/1910.00327
Attacking CNN-based anti-spoofing face authentication in the physical domain.
Bowen Zhang; Benedetta Tondi; Mauro Barni
  In this paper, we study the vulnerability of anti-spoofing methods based on
deep learning against adversarial perturbations. We first show that attacking a
CNN-based anti-spoofing face authentication system turns out to be a difficult
task. When a spoofed face image is attacked in the physical world, in fact, the
attack has not only to remove the rebroadcast artefacts present in the image,
but it has also to take into account that the attacked image will be recaptured
again and then compensate for the distortions that will be re-introduced after
the attack by the subsequent rebroadcast process. Subsequently, we propose a
method to craft robust physical domain adversarial images against anti-spoofing
CNN-based face authentication. The attack built in this way can successfully
pass all the steps in the authentication chain (that is, face detection, face
recognition and spoofing detection), by achieving simultaneously the following
goals: i) make the spoofing detection fail; ii) let the facial region be
detected as a face and iii) recognized as belonging to the victim of the
attack. The effectiveness of the proposed attack is validated experimentally
within a realistic setting, by considering the REPLAY-MOBILE database, and by
feeding the adversarial images to a real face authentication system capturing
the input images through a mobile phone camera.


http://arxiv.org/abs/1910.00511
An Efficient and Margin-Approaching Zero-Confidence Adversarial Attack.
Yang Zhang; Shiyu Chang; Mo Yu; Kaizhi Qian
  There are two major paradigms of white-box adversarial attacks that attempt
to impose input perturbations. The first paradigm, called the fix-perturbation
attack, crafts adversarial samples within a given perturbation level. The
second paradigm, called the zero-confidence attack, finds the smallest
perturbation needed to cause mis-classification, also known as the margin of an
input feature. While the former paradigm is well-resolved, the latter is not.
Existing zero-confidence attacks either introduce significant ap-proximation
errors, or are too time-consuming. We therefore propose MARGINATTACK, a
zero-confidence attack framework that is able to compute the margin with
improved accuracy and efficiency. Our experiments show that MARGINATTACK is
able to compute a smaller margin than the state-of-the-art zero-confidence
attacks, and matches the state-of-the-art fix-perturbation at-tacks. In
addition, it runs significantly faster than the Carlini-Wagner attack,
currently the most ac-curate zero-confidence attack algorithm.


http://arxiv.org/abs/1910.01742
Cross-Layer Strategic Ensemble Defense Against Adversarial Examples.
Wenqi Wei; Ling Liu; Margaret Loper; Ka-Ho Chow; Emre Gursoy; Stacey Truex; Yanzhao Wu
  Deep neural network (DNN) has demonstrated its success in multiple domains.
However, DNN models are inherently vulnerable to adversarial examples, which
are generated by adding adversarial perturbations to benign inputs to fool the
DNN model to misclassify. In this paper, we present a cross-layer strategic
ensemble framework and a suite of robust defense algorithms, which are
attack-independent, and capable of auto-repairing and auto-verifying the target
model being attacked. Our strategic ensemble approach makes three original
contributions. First, we employ input-transformation diversity to design the
input-layer strategic transformation ensemble algorithms. Second, we utilize
model-disagreement diversity to develop the output-layer strategic model
ensemble algorithms. Finally, we create an input-output cross-layer strategic
ensemble defense that strengthens the defensibility by combining diverse input
transformation based model ensembles with diverse output verification model
ensembles. Evaluated over 10 attacks on ImageNet dataset, we show that our
strategic ensemble defense algorithms can achieve high defense success rates
and are more robust with high attack prevention success rates and low benign
false negative rates, compared to existing representative defense methods.


http://arxiv.org/abs/1910.00470
Deep Neural Rejection against Adversarial Examples.
Angelo Sotgiu; Ambra Demontis; Marco Melis; Battista Biggio; Giorgio Fumera; Xiaoyi Feng; Fabio Roli
  Despite the impressive performances reported by deep neural networks in
different application domains, they remain largely vulnerable to adversarial
examples, i.e., input samples that are carefully perturbed to cause
misclassification at test time. In this work, we propose a deep neural
rejection mechanism to detect adversarial examples, based on the idea of
rejecting samples that exhibit anomalous feature representations at different
network layers. With respect to competing approaches, our method does not
require generating adversarial examples at training time, and it is less
computationally demanding. To properly evaluate our method, we define an
adaptive white-box attack that is aware of the defense mechanism and aims to
bypass it. Under this worst-case setting, we empirically show that our approach
outperforms previously-proposed methods that detect adversarial examples by
only analyzing the feature representation provided by the output network layer.


http://arxiv.org/abs/1909.13857
Black-box Adversarial Attacks with Bayesian Optimization.
Satya Narayan Shukla; Anit Kumar Sahu; Devin Willmott; J. Zico Kolter
  We focus on the problem of black-box adversarial attacks, where the aim is to
generate adversarial examples using information limited to loss function
evaluations of input-output pairs. We use Bayesian optimization~(BO) to
specifically cater to scenarios involving low query budgets to develop query
efficient adversarial attacks. We alleviate the issues surrounding BO in
regards to optimizing high dimensional deep learning models by effective
dimension upsampling techniques. Our proposed approach achieves performance
comparable to the state of the art black-box adversarial attacks albeit with a
much lower average query count. In particular, in low query budget regimes, our
proposed method reduces the query count up to $80\%$ with respect to the state
of the art methods.


http://arxiv.org/abs/1909.13806
Min-Max Optimization without Gradients: Convergence and Applications to Adversarial ML.
Sijia Liu; Songtao Lu; Xiangyi Chen; Yao Feng; Kaidi Xu; Abdullah Al-Dujaili; Minyi Hong; Una-May O'Reilly
  In this paper, we study the problem of constrained robust (min-max)
optimization ina black-box setting, where the desired optimizer cannot access
the gradients of the objective function but may query its values. We present a
principled optimization framework, integrating a zeroth-order (ZO) gradient
estimator with an alternating projected stochastic gradient descent-ascent
method, where the former only requires a small number of function queries and
the later needs just one-step descent/ascent update. We show that the proposed
framework, referred to as ZO-Min-Max, has a sub-linear convergence rate under
mild conditions and scales gracefully with problem size. From an application
side, we explore a promising connection between black-box min-max optimization
and black-box evasion and poisoning attacks in adversarial machine learning
(ML). Our empirical evaluations on these use cases demonstrate the
effectiveness of our approach and its scalability to dimensions that prohibit
using recent black-box solvers.


http://arxiv.org/abs/1910.00068
Role of Spatial Context in Adversarial Robustness for Object Detection.
Aniruddha Saha; Akshayvarun Subramanya; Koninika Patil; Hamed Pirsiavash
  The benefits of utilizing spatial context in fast object detection algorithms
have been studied extensively. Detectors increase inference speed by doing a
single forward pass per image which means they implicitly use contextual
reasoning for their predictions. However, one can show that an adversary can
design adversarial patches which do not overlap with any objects of interest in
the scene and exploit contextual reasoning to fool standard detectors. In this
paper, we examine this problem and design category specific adversarial patches
which make a widely used object detector like YOLO blind to an attacker chosen
object category. We also show that limiting the use of spatial context during
object detector training improves robustness to such adversaries. We believe
the existence of context based adversarial attacks is concerning since the
adversarial patch can affect predictions without being in vicinity of any
objects of interest. Hence, defending against such attacks becomes challenging
and we urge the research community to give attention to this vulnerability.


http://arxiv.org/abs/1910.06907
Techniques for Adversarial Examples Threatening the Safety of Artificial Intelligence Based Systems.
Utku Kose
  Artificial intelligence is known as the most effective technological field
for rapid developments shaping the future of the world. Even today, it is
possible to see intense use of intelligence systems in all fields of the life.
Although advantages of the Artificial Intelligence are widely observed, there
is also a dark side employing efforts to design hacking oriented techniques
against Artificial Intelligence. Thanks to such techniques, it is possible to
trick intelligent systems causing directed results for unsuccessful outputs.
That is critical for also cyber wars of the future as it is predicted that the
wars will be done unmanned, autonomous intelligent systems. Moving from the
explanations, objective of this study is to provide information regarding
adversarial examples threatening the Artificial Intelligence and focus on
details of some techniques, which are used for creating adversarial examples.
Adversarial examples are known as training data, which can trick a Machine
Learning technique to learn incorrectly about the target problem and cause an
unsuccessful or maliciously directed intelligent system at the end. The study
enables the readers to learn enough about details of recent techniques for
creating adversarial examples.


http://arxiv.org/abs/1909.12734
Maximal adversarial perturbations for obfuscation: Hiding certain attributes while preserving rest.
Indu Ilanchezian; Praneeth Vepakomma; Abhishek Singh; Otkrist Gupta; G. N. Srinivasa Prasanna; Ramesh Raskar
  In this paper we investigate the usage of adversarial perturbations for the
purpose of privacy from human perception and model (machine) based detection.
We employ adversarial perturbations for obfuscating certain variables in raw
data while preserving the rest. Current adversarial perturbation methods are
used for data poisoning with minimal perturbations of the raw data such that
the machine learning model's performance is adversely impacted while the human
vision cannot perceive the difference in the poisoned dataset due to minimal
nature of perturbations. We instead apply relatively maximal perturbations of
raw data to conditionally damage model's classification of one attribute while
preserving the model performance over another attribute. In addition, the
maximal nature of perturbation helps adversely impact human perception in
classifying hidden attribute apart from impacting model performance. We
validate our result qualitatively by showing the obfuscated dataset and
quantitatively by showing the inability of models trained on clean data to
predict the hidden attribute from the perturbed dataset while being able to
predict the rest of attributes.


http://arxiv.org/abs/1909.12741
Impact of Low-bitwidth Quantization on the Adversarial Robustness for Embedded Neural Networks.
Rémi Bernhard; Pierre-Alain Moellic; Jean-Max Dutertre
  As the will to deploy neural networks models on embedded systems grows, and
considering the related memory footprint and energy consumption issues, finding
lighter solutions to store neural networks such as weight quantization and more
efficient inference methods become major research topics. Parallel to that,
adversarial machine learning has risen recently with an impressive and
significant attention, unveiling some critical flaws of machine learning
models, especially neural networks. In particular, perturbed inputs called
adversarial examples have been shown to fool a model into making incorrect
predictions. In this article, we investigate the adversarial robustness of
quantized neural networks under different threat models for a classical
supervised image classification task. We show that quantization does not offer
any robust protection, results in severe form of gradient masking and advance
some hypotheses to explain it. However, we experimentally observe poor
transferability capacities which we explain by quantization value shift
phenomenon and gradient misalignment and explore how these results can be
exploited with an ensemble-based defense.


http://arxiv.org/abs/1910.04858
Training-Free Uncertainty Estimation for Dense Regression: Sensitivity as a Surrogate. (1%)
Lu Mi; Hao Wang; Yonglong Tian; Hao He; Nir Shavit
  Uncertainty estimation is an essential step in the evaluation of the
robustness for deep learning models in computer vision, especially when applied
in risk-sensitive areas. However, most state-of-the-art deep learning models
either fail to obtain uncertainty estimation or need significant modification
(e.g., formulating a proper Bayesian treatment) to obtain it. Most previous
methods are not able to take an arbitrary model off the shelf and generate
uncertainty estimation without retraining or redesigning it. To address this
gap, we perform a systematic exploration into training-free uncertainty
estimation for dense regression, an unrecognized yet important problem, and
provide a theoretical construction justifying such estimations. We propose
three simple and scalable methods to analyze the variance of outputs from a
trained network under tolerable perturbations: infer-transformation,
infer-noise, and infer-dropout. They operate solely during the inference,
without the need to re-train, re-design, or fine-tune the models, as typically
required by state-of-the-art uncertainty estimation methods. Surprisingly, even
without involving such perturbations in training, our methods produce
comparable or even better uncertainty estimation when compared to
training-required state-of-the-art methods.


http://arxiv.org/abs/1909.12031
Towards Understanding the Transferability of Deep Representations.
Hong Liu; Mingsheng Long; Jianmin Wang; Michael I. Jordan
  Deep neural networks trained on a wide range of datasets demonstrate
impressive transferability. Deep features appear general in that they are
applicable to many datasets and tasks. Such property is in prevalent use in
real-world applications. A neural network pretrained on large datasets, such as
ImageNet, can significantly boost generalization and accelerate training if
fine-tuned to a smaller target dataset. Despite its pervasiveness, few effort
has been devoted to uncovering the reason of transferability in deep feature
representations. This paper tries to understand transferability from the
perspectives of improved generalization, optimization and the feasibility of
transferability. We demonstrate that 1) Transferred models tend to find flatter
minima, since their weight matrices stay close to the original flat region of
pretrained parameters when transferred to a similar target dataset; 2)
Transferred representations make the loss landscape more favorable with
improved Lipschitzness, which accelerates and stabilizes training
substantially. The improvement largely attributes to the fact that the
principal component of gradient is suppressed in the pretrained parameters,
thus stabilizing the magnitude of gradient in back-propagation. 3) The
feasibility of transferability is related to the similarity of both input and
label. And a surprising discovery is that the feasibility is also impacted by
the training stages in that the transferability first increases during
training, and then declines. We further provide a theoretical analysis to
verify our observations.


http://arxiv.org/abs/1909.12167
Adversarial Machine Learning Attack on Modulation Classification.
Muhammad Usama; Muhammad Asim; Junaid Qadir; Ala Al-Fuqaha; Muhammad Ali Imran
  Modulation classification is an important component of cognitive self-driving
networks. Recently many ML-based modulation classification methods have been
proposed. We have evaluated the robustness of 9 ML-based modulation classifiers
against the powerful Carlini \& Wagner (C-W) attack and showed that the current
ML-based modulation classifiers do not provide any deterrence against
adversarial ML examples. To the best of our knowledge, we are the first to
report the results of the application of the C-W attack for creating
adversarial examples against various ML models for modulation classification.


http://arxiv.org/abs/1909.12161
Adversarial ML Attack on Self Organizing Cellular Networks.
Salah-ud-din Farooq; Muhammad Usama; Junaid Qadir; Muhammad Ali Imran
  Deep Neural Networks (DNN) have been widely adopted in self-organizing
networks (SON) for automating different networking tasks. Recently, it has been
shown that DNN lack robustness against adversarial examples where an adversary
can fool the DNN model into incorrect classification by introducing a small
imperceptible perturbation to the original example. SON is expected to use DNN
for multiple fundamental cellular tasks and many DNN-based solutions for
performing SON tasks have been proposed in the literature have not been tested
against adversarial examples. In this paper, we have tested and explained the
robustness of SON against adversarial example and investigated the performance
of an important SON use case in the face of adversarial attacks. We have also
generated explanations of incorrect classifications by utilizing an explainable
artificial intelligence (AI) technique.


http://arxiv.org/abs/1909.12180
Towards neural networks that provably know when they don't know.
Alexander Meinke; Matthias Hein
  It has recently been shown that ReLU networks produce arbitrarily
over-confident predictions far away from the training data. Thus, ReLU networks
do not know when they don't know. However, this is a highly important property
in safety critical applications. In the context of out-of-distribution
detection (OOD) there have been a number of proposals to mitigate this problem
but none of them are able to make any mathematical guarantees. In this paper we
propose a new approach to OOD which overcomes both problems. Our approach can
be used with ReLU networks and provides provably low confidence predictions far
away from the training data as well as the first certificates for low
confidence predictions in a neighborhood of an out-distribution point. In the
experiments we show that state-of-the-art methods fail in this worst-case
setting whereas our model can guarantee its performance while retaining
state-of-the-art OOD performance.


http://arxiv.org/abs/1909.12272
Lower Bounds on Adversarial Robustness from Optimal Transport.
Arjun Nitin Bhagoji; Daniel Cullina; Prateek Mittal
  While progress has been made in understanding the robustness of machine
learning classifiers to test-time adversaries (evasion attacks), fundamental
questions remain unresolved. In this paper, we use optimal transport to
characterize the minimum possible loss in an adversarial classification
scenario. In this setting, an adversary receives a random labeled example from
one of two classes, perturbs the example subject to a neighborhood constraint,
and presents the modified example to the classifier. We define an appropriate
cost function such that the minimum transportation cost between the
distributions of the two classes determines the minimum $0-1$ loss for any
classifier. When the classifier comes from a restricted hypothesis class, the
optimal transportation cost provides a lower bound. We apply our framework to
the case of Gaussian data with norm-bounded adversaries and explicitly show
matching bounds for the classification and transport problems as well as the
optimality of linear classifiers. We also characterize the sample complexity of
learning in this setting, deriving and extending previously known results as a
special case. Finally, we use our framework to study the gap between the
optimal classification performance possible and that currently achieved by
state-of-the-art robustly trained neural networks for datasets of interest,
namely, MNIST, Fashion MNIST and CIFAR-10.


http://arxiv.org/abs/1909.11786
Probabilistic Modeling of Deep Features for Out-of-Distribution and Adversarial Detection.
Nilesh A. Ahuja; Ibrahima Ndiour; Trushant Kalyanpur; Omesh Tickoo
  We present a principled approach for detecting out-of-distribution (OOD) and
adversarial samples in deep neural networks. Our approach consists in modeling
the outputs of the various layers (deep features) with parametric probability
distributions once training is completed. At inference, the likelihoods of the
deep features w.r.t the previously learnt distributions are calculated and used
to derive uncertainty estimates that can discriminate in-distribution samples
from OOD samples. We explore the use of two classes of multivariate
distributions for modeling the deep features - Gaussian and Gaussian mixture -
and study the trade-off between accuracy and computational complexity. We
demonstrate benefits of our approach on image features by detecting OOD images
and adversarially-generated images, using popular DNN architectures on MNIST
and CIFAR10 datasets. We show that more precise modeling of the feature
distributions result in significantly improved detection of OOD and adversarial
samples; up to 12 percentage points in AUPR and AUROC metrics. We further show
that our approach remains extremely effective when applied to video data and
associated spatio-temporal features by detecting adversarial samples on
activity classification tasks using UCF101 dataset, and the C3D network. To our
knowledge, our methodology is the first one reported for reliably detecting
white-box adversarial framing, a state-of-the-art adversarial attack for video
classifiers.


http://arxiv.org/abs/1909.11515
Mixup Inference: Better Exploiting Mixup to Defend Adversarial Attacks.
Tianyu Pang; Kun Xu; Jun Zhu
  It has been widely recognized that adversarial examples can be easily crafted
to fool deep networks, which mainly root from the locally non-linear behavior
nearby input examples. Applying mixup in training provides an effective
mechanism to improve generalization performance and model robustness against
adversarial perturbations, which introduces the globally linear behavior
in-between training examples. However, in previous work, the mixup-trained
models only passively defend adversarial attacks in inference by directly
classifying the inputs, where the induced global linearity is not well
exploited. Namely, since the locality of the adversarial perturbations, it
would be more efficient to actively break the locality via the globality of the
model predictions. Inspired by simple geometric intuition, we develop an
inference principle, named mixup inference (MI), for mixup-trained models. MI
mixups the input with other random clean samples, which can shrink and transfer
the equivalent perturbation if the input is adversarial. Our experiments on
CIFAR-10 and CIFAR-100 demonstrate that MI can further improve the adversarial
robustness for the models trained by mixup and its variants.


http://arxiv.org/abs/1909.11764
FreeLB: Enhanced Adversarial Training for Natural Language Understanding.
Chen Zhu; Yu Cheng; Zhe Gan; Siqi Sun; Tom Goldstein; Jingjing Liu
  Adversarial training, which minimizes the maximal risk for label-preserving
input perturbations, has proved to be effective for improving the
generalization of language models. In this work, we propose a novel adversarial
training algorithm, FreeLB, that promotes higher invariance in the embedding
space, by adding adversarial perturbations to word embeddings and minimizing
the resultant adversarial risk inside different regions around input samples.
To validate the effectiveness of the proposed approach, we apply it to
Transformer-based models for natural language understanding and commonsense
reasoning tasks. Experiments on the GLUE benchmark show that when applied only
to the finetuning stage, it is able to improve the overall test scores of
BERT-base model from 78.3 to 79.4, and RoBERTa-large model from 88.5 to 88.8.
In addition, the proposed approach achieves state-of-the-art single-model test
accuracies of 85.44\% and 67.75\% on ARC-Easy and ARC-Challenge. Experiments on
CommonsenseQA benchmark further demonstrate that FreeLB can be generalized and
boost the performance of RoBERTa-large model on other tasks as well. Code is
available at \url{https://github.com/zhuchen03/FreeLB .


http://arxiv.org/abs/1909.11202
A Visual Analytics Framework for Adversarial Text Generation.
Brandon Laughlin; Christopher Collins; Karthik Sankaranarayanan; Khalil El-Khatib
  This paper presents a framework which enables a user to more easily make
corrections to adversarial texts. While attack algorithms have been
demonstrated to automatically build adversaries, changes made by the algorithms
can often have poor semantics or syntax. Our framework is designed to
facilitate human intervention by aiding users in making corrections. The
framework extends existing attack algorithms to work within an evolutionary
attack process paired with a visual analytics loop. Using an interactive
dashboard a user is able to review the generation process in real time and
receive suggestions from the system for edits to be made. The adversaries can
be used to both diagnose robustness issues within a single classifier or to
compare various classifier options. With the weaknesses identified, the
framework can also be used as a first step in mitigating adversarial threats.
The framework can be used as part of further research into defense methods in
which the adversarial examples are used to evaluate new countermeasures. We
demonstrate the framework with a word swapping attack for the task of sentiment
classification.


http://arxiv.org/abs/1909.11167
Intelligent image synthesis to attack a segmentation CNN using adversarial learning.
Liang Chen; Paul Bentley; Kensaku Mori; Kazunari Misawa; Michitaka Fujiwara; Daniel Rueckert
  Deep learning approaches based on convolutional neural networks (CNNs) have
been successful in solving a number of problems in medical imaging, including
image segmentation. In recent years, it has been shown that CNNs are vulnerable
to attacks in which the input image is perturbed by relatively small amounts of
noise so that the CNN is no longer able to perform a segmentation of the
perturbed image with sufficient accuracy. Therefore, exploring methods on how
to attack CNN-based models as well as how to defend models against attacks have
become a popular topic as this also provides insights into the performance and
generalization abilities of CNNs. However, most of the existing work assumes
unrealistic attack models, i.e. the resulting attacks were specified in
advance. In this paper, we propose a novel approach for generating adversarial
examples to attack CNN-based segmentation models for medical images. Our
approach has three key features: 1) The generated adversarial examples exhibit
anatomical variations (in form of deformations) as well as appearance
perturbations; 2) The adversarial examples attack segmentation models so that
the Dice scores decrease by a pre-specified amount; 3) The attack is not
required to be specified beforehand. We have evaluated our approach on
CNN-based approaches for the multi-organ segmentation problem in 2D CT images.
We show that the proposed approach can be used to attack different CNN-based
segmentation models.


http://arxiv.org/abs/1909.10773
Sign-OPT: A Query-Efficient Hard-label Adversarial Attack.
Minhao Cheng; Simranjit Singh; Patrick Chen; Pin-Yu Chen; Sijia Liu; Cho-Jui Hsieh
  We study the most practical problem setup for evaluating adversarial
robustness of a machine learning system with limited access: the hard-label
black-box attack setting for generating adversarial examples, where limited
model queries are allowed and only the decision is provided to a queried data
input. Several algorithms have been proposed for this problem but they
typically require huge amount (>20,000) of queries for attacking one example.
Among them, one of the state-of-the-art approaches (Cheng et al., 2019) showed
that hard-label attack can be modeled as an optimization problem where the
objective function can be evaluated by binary search with additional model
queries, thereby a zeroth order optimization algorithm can be applied. In this
paper, we adopt the same optimization formulation but propose to directly
estimate the sign of gradient at any direction instead of the gradient itself,
which enjoys the benefit of single query. Using this single query oracle for
retrieving sign of directional derivative, we develop a novel query-efficient
Sign-OPT approach for hard-label black-box attack. We provide a convergence
analysis of the new algorithm and conduct experiments on several models on
MNIST, CIFAR-10 and ImageNet. We find that Sign-OPT attack consistently
requires 5X to 10X fewer queries when compared to the current state-of-the-art
approaches, and usually converges to an adversarial example with smaller
perturbation.


http://arxiv.org/abs/1909.11201
Matrix Sketching for Secure Collaborative Machine Learning. (1%)
Mengjiao Zhang; Shusen Wang
  Collaborative learning allows participants to jointly train a model without
data sharing. To update the model parameters, the central server broadcasts
model parameters to the clients, and the clients send updating directions such
as gradients to the server. While data do not leave a client device, the
communicated gradients and parameters will leak a client's privacy. Attacks
that infer clients' privacy from gradients and parameters have been developed
by prior work. Simple defenses such as dropout and differential privacy either
fail to defend the attacks or seriously hurt test accuracy.
  We propose a practical defense which we call Double-Blind Collaborative
Learning (DBCL). The high-level idea is to apply random matrix sketching to the
parameters (aka weights) and re-generate random sketching after each iteration.
DBCL prevents clients from conducting gradient-based privacy inferences which
are the most effective attacks. DBCL works because from the attacker's
perspective, sketching is effectively random noise that outweighs the signal.
Notably, DBCL does not much increase computation and communication costs and
does not hurt test accuracy at all.


http://arxiv.org/abs/1909.10594
MemGuard: Defending against Black-Box Membership Inference Attacks via Adversarial Examples.
Jinyuan Jia; Ahmed Salem; Michael Backes; Yang Zhang; Neil Zhenqiang Gong
  In a membership inference attack, an attacker aims to infer whether a data
sample is in a target classifier's training dataset or not. Specifically, given
a black-box access to the target classifier, the attacker trains a binary
classifier, which takes a data sample's confidence score vector predicted by
the target classifier as an input and predicts the data sample to be a member
or non-member of the target classifier's training dataset. Membership inference
attacks pose severe privacy and security threats to the training dataset. Most
existing defenses leverage differential privacy when training the target
classifier or regularize the training process of the target classifier. These
defenses suffer from two key limitations: 1) they do not have formal
utility-loss guarantees of the confidence score vectors, and 2) they achieve
suboptimal privacy-utility tradeoffs.
  In this work, we propose MemGuard, the first defense with formal utility-loss
guarantees against black-box membership inference attacks. Instead of tampering
the training process of the target classifier, MemGuard adds noise to each
confidence score vector predicted by the target classifier. Our key observation
is that attacker uses a classifier to predict member or non-member and
classifier is vulnerable to adversarial examples. Based on the observation, we
propose to add a carefully crafted noise vector to a confidence score vector to
turn it into an adversarial example that misleads the attacker's classifier.
Our experimental results on three datasets show that MemGuard can effectively
defend against membership inference attacks and achieve better privacy-utility
tradeoffs than existing defenses. Our work is the first one to show that
adversarial examples can be used as defensive mechanisms to defend against
membership inference attacks.


http://arxiv.org/abs/1909.10147
Robust Local Features for Improving the Generalization of Adversarial Training.
Chuanbiao Song; Kun He; Jiadong Lin; Liwei Wang; John E. Hopcroft
  Adversarial training has been demonstrated as one of the most effective
methods for training robust models to defend against adversarial examples.
However, adversarially trained models often lack adversarially robust
generalization on unseen testing data. Recent works show that adversarially
trained models are more biased towards global structure features. Instead, in
this work, we would like to investigate the relationship between the
generalization of adversarial training and the robust local features, as the
robust local features generalize well for unseen shape variation. To learn the
robust local features, we develop a Random Block Shuffle (RBS) transformation
to break up the global structure features on normal adversarial examples. We
continue to propose a new approach called Robust Local Features for Adversarial
Training (RLFAT), which first learns the robust local features by adversarial
training on the RBS-transformed adversarial examples, and then transfers the
robust local features into the training of normal adversarial examples. To
demonstrate the generality of our argument, we implement RLFAT in currently
state-of-the-art adversarial training frameworks. Extensive experiments on
STL-10, CIFAR-10 and CIFAR-100 show that RLFAT significantly improves both the
adversarially robust generalization and the standard generalization of
adversarial training. Additionally, we demonstrate that our models capture more
local features of the object on the images, aligning better with human
perception.


http://arxiv.org/abs/1909.10480
FENCE: Feasible Evasion Attacks on Neural Networks in Constrained Environments.
Alesia Chernikova; Alina Oprea
  As advances in Deep Neural Networks (DNNs) demonstrate unprecedented levels
of performance in many critical applications, their vulnerability to attacks is
still an open question. We consider evasion attacks at the testing time against
Deep Learning in constrained environments, in which dependencies between
features need to be satisfied. These situations may arise naturally in tabular
data or may be the result of feature engineering in specific application
domains, such as threat detection. We propose a general iterative
gradient-based framework called FENCE for crafting evasion attacks that take
into consideration the specifics of constrained domains. We apply it against
Feed-Forward Neural Networks in two threat detection applications: network
traffic botnet classification and malicious domain classification, to generate
feasible adversarial examples. We extensively evaluate the success rate and
performance of our attacks, compare their significant improvement over several
baselines, and analyze several factors that impact the attack success rate,
including the optimization objective and the data imbalance. We show that with
minimal effort (e.g., generating 12 additional network connections), an
attacker can change the model's prediction to the target one. We found that
models trained on datasets with higher imbalance are more vulnerable to our
FENCE attacks. Finally, we show the potential of adversarial training in
constrained domains to increase the DNN resilience against these attacks.


http://arxiv.org/abs/1909.09938
HAWKEYE: Adversarial Example Detector for Deep Neural Networks.
Jinkyu Koo; Michael Roth; Saurabh Bagchi
  Adversarial examples (AEs) are images that can mislead deep neural network
(DNN) classifiers via introducing slight perturbations into original images.
Recent work has shown that detecting AEs can be more effective against AEs than
preventing them from being generated. However, the state-of-the-art AE
detection still shows a high false positive rate, thereby rejecting a
considerable amount of normal images. To address this issue, we propose
HAWKEYE, which is a separate neural network that analyzes the output layer of
the DNN, and detects AEs. HAWKEYE's AE detector utilizes a quantized version of
an input image as a reference, and is trained to distinguish the variation
characteristics of the DNN output on an input image from the DNN output on its
reference image. We also show that cascading our AE detectors that are trained
for different quantization step sizes can drastically reduce a false positive
rate, while keeping a detection rate high.


http://arxiv.org/abs/1909.10023
Towards Interpreting Recurrent Neural Networks through Probabilistic Abstraction.
Guoliang Dong; Jingyi Wang; Jun Sun; Yang Zhang; Xinyu Wang; Ting Dai; Jin Song Dong; Xingen Wang
  Neural networks are becoming a popular tool for solving many real-world
problems such as object recognition and machine translation, thanks to its
exceptional performance as an end-to-end solution. However, neural networks are
complex black-box models, which hinders humans from interpreting and
consequently trusting them in making critical decisions. Towards interpreting
neural networks, several approaches have been proposed to extract simple
deterministic models from neural networks. The results are not encouraging
(e.g., low accuracy and limited scalability), fundamentally due to the limited
expressiveness of such simple models.
  In this work, we propose an approach to extract probabilistic automata for
interpreting an important class of neural networks, i.e., recurrent neural
networks. Our work distinguishes itself from existing approaches in two
important ways. One is that probability is used to compensate for the loss of
expressiveness. This is inspired by the observation that human reasoning is
often `probabilistic'. The other is that we adaptively identify the right level
of abstraction so that a simple model is extracted in a request-specific way.
We conduct experiments on several real-world datasets using state-of-the-art
architectures including GRU and LSTM. The result shows that our approach
significantly improves existing approaches in terms of accuracy or scalability.
Lastly, we demonstrate the usefulness of the extracted models through detecting
adversarial texts.


http://arxiv.org/abs/1909.09481
Adversarial Learning with Margin-based Triplet Embedding Regularization.
Yaoyao Zhong; Weihong Deng
  The Deep neural networks (DNNs) have achieved great success on a variety of
computer vision tasks, however, they are highly vulnerable to adversarial
attacks. To address this problem, we propose to improve the local smoothness of
the representation space, by integrating a margin-based triplet embedding
regularization term into the classification objective, so that the obtained
model learns to resist adversarial examples. The regularization term consists
of two steps optimizations which find potential perturbations and punish them
by a large margin in an iterative way. Experimental results on MNIST,
CASIA-WebFace, VGGFace2 and MS-Celeb-1M reveal that our approach increases the
robustness of the network against both feature and label adversarial attacks in
simple object classification and deep face recognition.


http://arxiv.org/abs/1909.09735
COPYCAT: Practical Adversarial Attacks on Visualization-Based Malware Detection.
Aminollah Khormali; Ahmed Abusnaina; Songqing Chen; DaeHun Nyang; Aziz Mohaisen
  Despite many attempts, the state-of-the-art of adversarial machine learning
on malware detection systems generally yield unexecutable samples. In this
work, we set out to examine the robustness of visualization-based malware
detection system against adversarial examples (AEs) that not only are able to
fool the model, but also maintain the executability of the original input. As
such, we first investigate the application of existing off-the-shelf
adversarial attack approaches on malware detection systems through which we
found that those approaches do not necessarily maintain the functionality of
the original inputs. Therefore, we proposed an approach to generate adversarial
examples, COPYCAT, which is specifically designed for malware detection systems
considering two main goals; achieving a high misclassification rate and
maintaining the executability and functionality of the original input. We
designed two main configurations for COPYCAT, namely AE padding and sample
injection. While the first configuration results in untargeted
misclassification attacks, the sample injection configuration is able to force
the model to generate a targeted output, which is highly desirable in the
malware attribution setting. We evaluate the performance of COPYCAT through an
extensive set of experiments on two malware datasets, and report that we were
able to generate adversarial samples that are misclassified at a rate of 98.9%
and 96.5% with Windows and IoT binary datasets, respectively, outperforming the
misclassification rates in the literature. Most importantly, we report that
those AEs were executable unlike AEs generated by off-the-shelf approaches. Our
transferability study demonstrates that the generated AEs through our proposed
method can be generalized to other models.


http://arxiv.org/abs/1909.09552
Defending Against Physically Realizable Attacks on Image Classification.
Tong Wu; Liang Tong; Yevgeniy Vorobeychik
  We study the problem of defending deep neural network approaches for image
classification from physically realizable attacks. First, we demonstrate that
the two most scalable and effective methods for learning robust models,
adversarial training with PGD attacks and randomized smoothing, exhibit very
limited effectiveness against three of the highest profile physical attacks.
Next, we propose a new abstract adversarial model, rectangular occlusion
attacks, in which an adversary places a small adversarially crafted rectangle
in an image, and develop two approaches for efficiently computing the resulting
adversarial examples. Finally, we demonstrate that adversarial training using
our new attack yields image classification models that exhibit high robustness
against the physically realizable attacks we study, offering the first
effective generic defense against such attacks.


http://arxiv.org/abs/1909.09263
Propagated Perturbation of Adversarial Attack for well-known CNNs: Empirical Study and its Explanation.
Jihyeun Yoon; Kyungyul Kim; Jongseong Jang
  Deep Neural Network based classifiers are known to be vulnerable to
perturbations of inputs constructed by an adversarial attack to force
misclassification. Most studies have focused on how to make vulnerable noise by
gradient based attack methods or to defense model from adversarial attack. The
use of the denoiser model is one of a well-known solution to reduce the
adversarial noise although classification performance had not significantly
improved. In this study, we aim to analyze the propagation of adversarial
attack as an explainable AI(XAI) point of view. Specifically, we examine the
trend of adversarial perturbations through the CNN architectures. To analyze
the propagated perturbation, we measured normalized Euclidean Distance and
cosine distance in each CNN layer between the feature map of the perturbed
image passed through denoiser and the non-perturbed original image. We used
five well-known CNN based classifiers and three gradient-based adversarial
attacks. From the experimental results, we observed that in most cases,
Euclidean Distance explosively increases in the final fully connected layer
while cosine distance fluctuated and disappeared at the last layer. This means
that the use of denoiser can decrease the amount of noise. However, it failed
to defense accuracy degradation.


http://arxiv.org/abs/1909.08864
Adversarial Vulnerability Bounds for Gaussian Process Classification.
Michael Thomas Smith; Kathrin Grosse; Michael Backes; Mauricio A Alvarez
  Machine learning (ML) classification is increasingly used in safety-critical
systems. Protecting ML classifiers from adversarial examples is crucial. We
propose that the main threat is that of an attacker perturbing a confidently
classified input to produce a confident misclassification. To protect against
this we devise an adversarial bound (AB) for a Gaussian process classifier,
that holds for the entire input domain, bounding the potential for any future
adversarial method to cause such misclassification. This is a formal guarantee
of robustness, not just an empirically derived result. We investigate how to
configure the classifier to maximise the bound, including the use of a sparse
approximation, leading to the method producing a practical, useful and provably
robust classifier, which we test using a variety of datasets.


http://arxiv.org/abs/1909.08830
Absum: Simple Regularization Method for Reducing Structural Sensitivity of Convolutional Neural Networks.
Sekitoshi Kanai; Yasutoshi Ida; Yasuhiro Fujiwara; Masanori Yamada; Shuichi Adachi
  We propose Absum, which is a regularization method for improving adversarial
robustness of convolutional neural networks (CNNs). Although CNNs can
accurately recognize images, recent studies have shown that the convolution
operations in CNNs commonly have structural sensitivity to specific noise
composed of Fourier basis functions. By exploiting this sensitivity, they
proposed a simple black-box adversarial attack: Single Fourier attack. To
reduce structural sensitivity, we can use regularization of convolution filter
weights since the sensitivity of linear transform can be assessed by the norm
of the weights. However, standard regularization methods can prevent
minimization of the loss function because they impose a tight constraint for
obtaining high robustness. To solve this problem, Absum imposes a loose
constraint; it penalizes the absolute values of the summation of the parameters
in the convolution layers. Absum can improve robustness against single Fourier
attack while being as simple and efficient as standard regularization methods
(e.g., weight decay and L1 regularization). Our experiments demonstrate that
Absum improves robustness against single Fourier attack more than standard
regularization methods. Furthermore, we reveal that robust CNNs with Absum are
more robust against transferred attacks due to decreasing the common
sensitivity and against high-frequency noise than standard regularization
methods. We also reveal that Absum can improve robustness against
gradient-based attacks (projected gradient descent) when used with adversarial
training.


http://arxiv.org/abs/1909.12927
Toward Robust Image Classification.
Basemah Alshemali; Alta Graham; Jugal Kalita
  Neural networks are frequently used for image classification, but can be
vulnerable to misclassification caused by adversarial images. Attempts to make
neural network image classification more robust have included variations on
preprocessing (cropping, applying noise, blurring), adversarial training, and
dropout randomization. In this paper, we implemented a model for adversarial
detection based on a combination of two of these techniques: dropout
randomization with preprocessing applied to images within a given Bayesian
uncertainty. We evaluated our model on the MNIST dataset, using adversarial
images generated using Fast Gradient Sign Method (FGSM), Jacobian-based
Saliency Map Attack (JSMA) and Basic Iterative Method (BIM) attacks. Our model
achieved an average adversarial image detection accuracy of 97%, with an
average image classification accuracy, after discarding images flagged as
adversarial, of 99%. Our average detection accuracy exceeded that of recent
papers using similar techniques.


http://arxiv.org/abs/1909.09034
Training Robust Deep Neural Networks via Adversarial Noise Propagation.
Aishan Liu; Xianglong Liu; Chongzhi Zhang; Hang Yu; Qiang Liu; Dacheng Tao
  In practice, deep neural networks have been found to be vulnerable to various
types of noise, such as adversarial examples and corruption. Various
adversarial defense methods have accordingly been developed to improve
adversarial robustness for deep models. However, simply training on data mixed
with adversarial examples, most of these models still fail to defend against
the generalized types of noise. Motivated by the fact that hidden layers play a
highly important role in maintaining a robust model, this paper proposes a
simple yet powerful training algorithm, named \emph{Adversarial Noise
Propagation} (ANP), which injects noise into the hidden layers in a layer-wise
manner. ANP can be implemented efficiently by exploiting the nature of the
backward-forward training style. Through thorough investigations, we determine
that different hidden layers make different contributions to model robustness
and clean accuracy, while shallow layers are comparatively more critical than
deep layers. Moreover, our framework can be easily combined with other
adversarial training methods to further improve model robustness by exploiting
the potential of hidden layers. Extensive experiments on MNIST, CIFAR-10,
CIFAR-10-C, CIFAR-10-P, and ImageNet demonstrate that ANP enables the strong
robustness for deep models against both adversarial and corrupted ones, and
also significantly outperforms various adversarial defense methods.


http://arxiv.org/abs/1909.08072
Adversarial Attacks and Defenses in Images, Graphs and Text: A Review.
Han Xu; Yao Ma; Haochen Liu; Debayan Deb; Hui Liu; Jiliang Tang; Anil Jain
  Deep neural networks (DNN) have achieved unprecedented success in numerous
machine learning tasks in various domains. However, the existence of
adversarial examples raises our concerns in adopting deep learning to
safety-critical applications. As a result, we have witnessed increasing
interests in studying attack and defense mechanisms for DNN models on different
data types, such as images, graphs and text. Thus, it is necessary to provide a
systematic and comprehensive overview of the main threats of attacks and the
success of corresponding countermeasures. In this survey, we review the state
of the art algorithms for generating adversarial examples and the
countermeasures against adversarial examples, for three most popular data
types, including images, graphs and text.


http://arxiv.org/abs/1909.07873
Generating Black-Box Adversarial Examples for Text Classifiers Using a Deep Reinforced Model.
Prashanth Vijayaraghavan; Deb Roy
  Recently, generating adversarial examples has become an important means of
measuring robustness of a deep learning model. Adversarial examples help us
identify the susceptibilities of the model and further counter those
vulnerabilities by applying adversarial training techniques. In natural
language domain, small perturbations in the form of misspellings or paraphrases
can drastically change the semantics of the text. We propose a reinforcement
learning based approach towards generating adversarial examples in black-box
settings. We demonstrate that our method is able to fool well-trained models
for (a) IMDB sentiment classification task and (b) AG's news corpus news
categorization task with significantly high success rates. We find that the
adversarial examples generated are semantics-preserving perturbations to the
original text.


http://arxiv.org/abs/1909.08526
Defending against Machine Learning based Inference Attacks via Adversarial Examples: Opportunities and Challenges.
Jinyuan Jia; Neil Zhenqiang Gong
  As machine learning (ML) becomes more and more powerful and easily
accessible, attackers increasingly leverage ML to perform automated large-scale
inference attacks in various domains. In such an ML-equipped inference attack,
an attacker has access to some data (called public data) of an individual, a
software, or a system; and the attacker uses an ML classifier to automatically
infer their private data. Inference attacks pose severe privacy and security
threats to individuals and systems. Inference attacks are successful because
private data are statistically correlated with public data, and ML classifiers
can capture such statistical correlations. In this chapter, we discuss the
opportunities and challenges of defending against ML-equipped inference attacks
via adversarial examples. Our key observation is that attackers rely on ML
classifiers in inference attacks. The adversarial machine learning community
has demonstrated that ML classifiers have various vulnerabilities. Therefore,
we can turn the vulnerabilities of ML into defenses against inference attacks.
For example, ML classifiers are vulnerable to adversarial examples, which add
carefully crafted noise to normal examples such that an ML classifier makes
predictions for the examples as we desire. To defend against inference attacks,
we can add carefully crafted noise into the public data to turn them into
adversarial examples, such that attackers' classifiers make incorrect
predictions for the private data. However, existing methods to construct
adversarial examples are insufficient because they did not consider the unique
challenges and requirements for the crafted noise at defending against
inference attacks. In this chapter, we take defending against inference attacks
in online social networks as an example to illustrate the opportunities and
challenges.


http://arxiv.org/abs/1909.07490
They Might NOT Be Giants: Crafting Black-Box Adversarial Examples with Fewer Queries Using Particle Swarm Optimization.
Rayan Mosli; Matthew Wright; Bo Yuan; Yin Pan
  Machine learning models have been found to be susceptible to adversarial
examples that are often indistinguishable from the original inputs. These
adversarial examples are created by applying adversarial perturbations to input
samples, which would cause them to be misclassified by the target models.
Attacks that search and apply the perturbations to create adversarial examples
are performed in both white-box and black-box settings, depending on the
information available to the attacker about the target. For black-box attacks,
the only capability available to the attacker is the ability to query the
target with specially crafted inputs and observing the labels returned by the
model. Current black-box attacks either have low success rates, requires a high
number of queries, or produce adversarial examples that are easily
distinguishable from their sources. In this paper, we present AdversarialPSO, a
black-box attack that uses fewer queries to create adversarial examples with
high success rates. AdversarialPSO is based on the evolutionary search
algorithm Particle Swarm Optimization, a populationbased gradient-free
optimization algorithm. It is flexible in balancing the number of queries
submitted to the target vs the quality of imperceptible adversarial examples.
The attack has been evaluated using the image classification benchmark datasets
CIFAR-10, MNIST, and Imagenet, achieving success rates of 99.6%, 96.3%, and
82.0%, respectively, while submitting substantially fewer queries than the
state-of-the-art. We also present a black-box method for isolating salient
features used by models when making classifications. This method, called Swarms
with Individual Search Spaces or SWISS, creates adversarial examples by finding
and modifying the most important features in the input.


http://arxiv.org/abs/1909.07558
HAD-GAN: A Human-perception Auxiliary Defense GAN to Defend Adversarial Examples.
Wanting Yu; Hongyi Yu; Lingyun Jiang; Mengli Zhang; Kai Qiao; Linyuan Wang; Bin Yan
  Adversarial examples reveal the vulnerability and unexplained nature of
neural networks. Studying the defense of adversarial examples is of
considerable practical importance. Most adversarial examples that misclassify
networks are often undetectable by humans. In this paper, we propose a defense
model to train the classifier into a human-perception classification model with
shape preference. The proposed model comprising a texture transfer network
(TTN) and an auxiliary defense generative adversarial networks (GAN) is called
Human-perception Auxiliary Defense GAN (HAD-GAN). The TTN is used to extend the
texture samples of a clean image and helps classifiers focus on its shape. GAN
is utilized to form a training framework for the model and generate the
necessary images. A series of experiments conducted on MNIST, Fashion-MNIST and
CIFAR10 show that the proposed model outperforms the state-of-the-art defense
methods for network robustness. The model also demonstrates a significant
improvement on defense capability of adversarial examples.


http://arxiv.org/abs/1909.07283
Towards Quality Assurance of Software Product Lines with Adversarial Configurations.
Paul Temple; Mathieu Acher; Gilles Perrouin; Battista Biggio; Jean-marc Jezequel; Fabio Roli
  Software product line (SPL) engineers put a lot of effort to ensure that,
through the setting of a large number of possible configuration options,
products are acceptable and well-tailored to customers' needs. Unfortunately,
options and their mutual interactions create a huge configuration space which
is intractable to exhaustively explore. Instead of testing all products,
machine learning techniques are increasingly employed to approximate the set of
acceptable products out of a small training sample of configurations. Machine
learning (ML) techniques can refine a software product line through learned
constraints and a priori prevent non-acceptable products to be derived. In this
paper, we use adversarial ML techniques to generate adversarial configurations
fooling ML classifiers and pinpoint incorrect classifications of products
(videos) derived from an industrial video generator. Our attacks yield (up to)
a 100% misclassification rate and a drop in accuracy of 5%. We discuss the
implications these results have on SPL quality assurance.


http://arxiv.org/abs/1909.06978
Interpreting and Improving Adversarial Robustness with Neuron Sensitivity.
Chongzhi Zhang; Aishan Liu; Xianglong Liu; Yitao Xu; Hang Yu; Yuqing Ma; Tianlin Li
  Deep neural networks (DNNs) are vulnerable to adversarial examples where
inputs with imperceptible perturbations mislead DNNs to incorrect results.
Despite the potential risk they bring, adversarial examples are also valuable
for providing insights into the weakness and blind-spots of DNNs. Thus, the
interpretability of a DNN in adversarial setting aims to explain the rationale
behind its decision-making process and makes deeper understanding which results
in better practical applications. To address this issue, we try to explain
adversarial robustness for deep models from a new perspective of neuron
sensitivity which is measured by neuron behavior variation intensity against
benign and adversarial examples. In this paper, we first draw the close
connection between adversarial robustness and neuron sensitivities, as
sensitive neurons make the most non-trivial contributions to model predictions
in adversarial setting. Based on that, we further propose to improve
adversarial robustness by constraining the similarities of sensitive neurons
between benign and adversarial examples which stabilizes the behaviors of
sensitive neurons in adversarial setting. Moreover, we demonstrate that
state-of-the-art adversarial training methods improve model robustness by
reducing neuron sensitivities which in turn confirms the strong connections
between adversarial robustness and neuron sensitivity as well as the
effectiveness of using sensitive neurons to build robust models. Extensive
experiments on various datasets demonstrate that our algorithm effectively
achieve excellent results.


http://arxiv.org/abs/1909.06727
An Empirical Study towards Characterizing Deep Learning Development and Deployment across Different Frameworks and Platforms.
Qianyu Guo; Sen Chen; Xiaofei Xie; Lei Ma; Qiang Hu; Hongtao Liu; Yang Liu; Jianjun Zhao; Xiaohong Li
  Deep Learning (DL) has recently achieved tremendous success. A variety of DL
frameworks and platforms play a key role to catalyze such progress. However,
the differences in architecture designs and implementations of existing
frameworks and platforms bring new challenges for DL software development and
deployment. Till now, there is no study on how various mainstream frameworks
and platforms influence both DL software development and deployment in
practice. To fill this gap, we take the first step towards understanding how
the most widely-used DL frameworks and platforms support the DL software
development and deployment. We conduct a systematic study on these frameworks
and platforms by using two types of DNN architectures and three popular
datasets. (1) For development process, we investigate the prediction accuracy
under the same runtime training configuration or same model weights/biases. We
also study the adversarial robustness of trained models by leveraging the
existing adversarial attack techniques. The experimental results show that the
computing differences across frameworks could result in an obvious prediction
accuracy decline, which should draw the attention of DL developers. (2) For
deployment process, we investigate the prediction accuracy and performance
(refers to time cost and memory consumption) when the trained models are
migrated/quantized from PC to real mobile devices and web browsers. The DL
platform study unveils that the migration and quantization still suffer from
compatibility and reliability issues. Meanwhile, we find several DL software
bugs by using the results as a benchmark. We further validate the results
through bug confirmation from stakeholders and industrial positive feedback to
highlight the implications of our study. Through our study, we summarize
practical guidelines, identify challenges and pinpoint new research directions.


http://arxiv.org/abs/1909.06872
Detecting Adversarial Samples Using Influence Functions and Nearest Neighbors.
Gilad Cohen; Guillermo Sapiro; Raja Giryes
  Deep neural networks (DNNs) are notorious for their vulnerability to
adversarial attacks, which are small perturbations added to their input images
to mislead their prediction. Detection of adversarial examples is, therefore, a
fundamental requirement for robust classification frameworks. In this work, we
present a method for detecting such adversarial attacks, which is suitable for
any pre-trained neural network classifier. We use influence functions to
measure the impact of every training sample on the validation set data. From
the influence scores, we find the most supportive training samples for any
given validation example. A k-nearest neighbor (k-NN) model fitted on the DNN's
activation layers is employed to search for the ranking of these supporting
training samples. We observe that these samples are highly correlated with the
nearest neighbors of the normal inputs, while this correlation is much weaker
for adversarial inputs. We train an adversarial detector using the k-NN ranks
and distances and show that it successfully distinguishes adversarial examples,
getting state-of-the-art results on four attack methods with three datasets.


http://arxiv.org/abs/1909.06723
Natural Language Adversarial Attacks and Defenses in Word Level.
Xiaosen Wang; Hao Jin; Kun He
  Up until recent two years, inspired by the big amount of research about
adversarial example in the field of computer vision, there has been a growing
interest in adversarial attacks for Natural Language Processing (NLP). What
followed was a very few works of adversarial defense for NLP. However, there
exists no defense method against the successful synonyms substitution based
attacks that aim to satisfy all the lexical, grammatical, semantic constraints
and thus are hard to perceived by humans. To fill this gap, we postulate the
generalization of the model leads to the existence of adversarial examples, and
propose an adversarial defense method called Synonyms Encoding Method (SEM),
which inserts an encoder before the input layer of the model and then trains
the model to eliminate adversarial perturbations. Extensive experiments
demonstrate that SEM can efficiently defend current best synonym substitution
based adversarial attacks with almost no decay on the accuracy for benign
examples. Besides, to better evaluate SEM, we also propose a strong attack
method called Improved Genetic Algorithm (IGA) that adopts the genetic
metaheuristic against synonyms substitution based attacks. Compared with
existing genetic based adversarial attack, the proposed IGA can achieve higher
attack success rate at the same time maintain the transferability of
adversarial examples.


http://arxiv.org/abs/1909.06500
Adversarial Attack on Skeleton-based Human Action Recognition.
Jian Liu; Naveed Akhtar; Ajmal Mian
  Deep learning models achieve impressive performance for skeleton-based human
action recognition. However, the robustness of these models to adversarial
attacks remains largely unexplored due to their complex spatio-temporal nature
that must represent sparse and discrete skeleton joints. This work presents the
first adversarial attack on skeleton-based action recognition with graph
convolutional networks. The proposed targeted attack, termed Constrained
Iterative Attack for Skeleton Actions (CIASA), perturbs joint locations in an
action sequence such that the resulting adversarial sequence preserves the
temporal coherence, spatial integrity, and the anthropomorphic plausibility of
the skeletons. CIASA achieves this feat by satisfying multiple physical
constraints, and employing spatial skeleton realignments for the perturbed
skeletons along with regularization of the adversarial skeletons with
Generative networks. We also explore the possibility of semantically
imperceptible localized attacks with CIASA, and succeed in fooling the
state-of-the-art skeleton action recognition models with high confidence. CIASA
perturbations show high transferability for black-box attacks. We also show
that the perturbed skeleton sequences are able to induce adversarial behavior
in the RGB videos created with computer graphics. A comprehensive evaluation
with NTU and Kinetics datasets ascertains the effectiveness of CIASA for
graph-based skeleton action recognition and reveals the imminent threat to the
spatio-temporal deep learning tasks in general.


http://arxiv.org/abs/1909.06044
Say What I Want: Towards the Dark Side of Neural Dialogue Models.
Haochen Liu; Tyler Derr; Zitao Liu; Jiliang Tang
  Neural dialogue models have been widely adopted in various chatbot
applications because of their good performance in simulating and generalizing
human conversations. However, there exists a dark side of these models -- due
to the vulnerability of neural networks, a neural dialogue model can be
manipulated by users to say what they want, which brings in concerns about the
security of practical chatbot services. In this work, we investigate whether we
can craft inputs that lead a well-trained black-box neural dialogue model to
generate targeted outputs. We formulate this as a reinforcement learning (RL)
problem and train a Reverse Dialogue Generator which efficiently finds such
inputs for targeted outputs. Experiments conducted on a representative neural
dialogue model show that our proposed model is able to discover such desired
inputs in a considerable portion of cases. Overall, our work reveals this
weakness of neural dialogue models and may prompt further researches of
developing corresponding solutions to avoid it.


http://arxiv.org/abs/1909.06271
White-Box Adversarial Defense via Self-Supervised Data Estimation.
Zudi Lin; Hanspeter Pfister; Ziming Zhang
  In this paper, we study the problem of how to defend classifiers against
adversarial attacks that fool the classifiers using subtly modified input data.
In contrast to previous works, here we focus on the white-box adversarial
defense where the attackers are granted full access to not only the classifiers
but also defenders to produce as strong attacks as possible. In such a context
we propose viewing a defender as a functional, a higher-order function that
takes functions as its argument to represent a function space, rather than
fixed functions conventionally. From this perspective, a defender should be
realized and optimized individually for each adversarial input. To this end, we
propose RIDE, an efficient and provably convergent self-supervised learning
algorithm for individual data estimation to protect the predictions from
adversarial attacks. We demonstrate the significant improvement of adversarial
defense performance on image recognition, eg, 98%, 76%, 43% test accuracy on
MNIST, CIFAR-10, and ImageNet datasets respectively under the state-of-the-art
BPDA attacker.


http://arxiv.org/abs/1909.06137
Defending Against Adversarial Attacks by Suppressing the Largest Eigenvalue of Fisher Information Matrix.
Chaomin Shen; Yaxin Peng; Guixu Zhang; Jinsong Fan
  We propose a scheme for defending against adversarial attacks by suppressing
the largest eigenvalue of the Fisher information matrix (FIM). Our starting
point is one explanation on the rationale of adversarial examples. Based on the
idea of the difference between a benign sample and its adversarial example is
measured by the Euclidean norm, while the difference between their
classification probability densities at the last (softmax) layer of the network
could be measured by the Kullback-Leibler (KL) divergence, the explanation
shows that the output difference is a quadratic form of the input difference.
If the eigenvalue of this quadratic form (a.k.a. FIM) is large, the output
difference becomes large even when the input difference is small, which
explains the adversarial phenomenon. This makes the adversarial defense
possible by controlling the eigenvalues of the FIM. Our solution is adding one
term representing the trace of the FIM to the loss function of the original
network, as the largest eigenvalue is bounded by the trace. Our defensive
scheme is verified by experiments using a variety of common attacking methods
on typical deep neural networks, e.g. LeNet, VGG and ResNet, with datasets
MNIST, CIFAR-10, and German Traffic Sign Recognition Benchmark (GTSRB). Our new
network, after adopting the novel loss function and retraining, has an
effective and robust defensive capability, as it decreases the fooling ratio of
the generated adversarial examples, and remains the classification accuracy of
the original network.


http://arxiv.org/abs/1909.05527
Inspecting adversarial examples using the Fisher information.
Jörg Martin; Clemens Elster
  Adversarial examples are slight perturbations that are designed to fool
artificial neural networks when fed as an input. In this work the usability of
the Fisher information for the detection of such adversarial attacks is
studied. We discuss various quantities whose computation scales well with the
network size, study their behavior on adversarial examples and show how they
can highlight the importance of single input neurons, thereby providing a
visual tool for further analyzing (un-)reasonable behavior of a neural network.
The potential of our methods is demonstrated by applications to the MNIST,
CIFAR10 and Fruits-360 datasets.


http://arxiv.org/abs/1909.05580
An Empirical Investigation of Randomized Defenses against Adversarial Attacks.
Yannik Potdevin; Dirk Nowotka; Vijay Ganesh
  In recent years, Deep Neural Networks (DNNs) have had a dramatic impact on a
variety of problems that were long considered very difficult, e. g., image
classification and automatic language translation to name just a few. The
accuracy of modern DNNs in classification tasks is remarkable indeed. At the
same time, attackers have devised powerful methods to construct
specially-crafted malicious inputs (often referred to as adversarial examples)
that can trick DNNs into mis-classifying them. What is worse is that despite
the many defense mechanisms proposed to protect DNNs against adversarial
attacks, attackers are often able to circumvent these defenses, rendering them
useless. This state of affairs is extremely worrying, especially since machine
learning systems get adopted at scale.
  In this paper, we propose a scientific evaluation methodology aimed at
assessing the quality, efficacy, robustness and efficiency of randomized
defenses to protect DNNs against adversarial examples. Using this methodology,
we evaluate a variety of defense mechanisms. In addition, we also propose a
defense mechanism we call Randomly Perturbed Ensemble Neural Networks (RPENNs).
We provide a thorough and comprehensive evaluation of the considered defense
mechanisms against a white-box attacker model, six different adversarial attack
methods and using the ILSVRC2012 validation data set.


http://arxiv.org/abs/1909.05921
Transferable Adversarial Robustness using Adversarially Trained Autoencoders.
Pratik Vaishnavi; Kevin Eykholt; Atul Prakash; Amir Rahmati
  Machine learning has proven to be an extremely useful tool for solving
complex problems in many application domains. This prevalence makes it an
attractive target for malicious actors. Adversarial machine learning is a
well-studied field of research in which an adversary seeks to cause predicable
errors in a machine learning algorithm through careful manipulation of the
input. In response, numerous techniques have been proposed to harden machine
learning algorithms and mitigate the effect of adversarial attacks. Of these
techniques, adversarial training, which augments the training data with
adversarial inputs, has proven to be an effective defensive technique. However,
adversarial training is computationally expensive and the improvements in
adversarial performance are limited to a single model. In this paper, we
propose Adversarially-Trained Autoencoder Augmentation, the first transferable
adversarial defense that is robust to certain adaptive adversaries. We
disentangle adversarial robustness from the classification pipeline by
adversarially training an autoencoder with respect to the classification loss.
We show that our approach achieves comparable results to state-of-the-art
adversarially trained models on the MNIST, Fashion-MNIST, and CIFAR-10
datasets. Furthermore, we can transfer our approach to other vulnerable models
and improve their adversarial performance without additional training. Finally,
we combine our defense with ensemble methods and parallelize adversarial
training across multiple vulnerable pre-trained models. In a single adversarial
training session, the autoencoder can achieve adversarial performance on the
vulnerable models that is comparable or better than standard adversarial
training.


http://arxiv.org/abs/1909.05443
Feedback Learning for Improving the Robustness of Neural Networks.
Chang Song; Zuoguan Wang; Hai Li
  Recent research studies revealed that neural networks are vulnerable to
adversarial attacks. State-of-the-art defensive techniques add various
adversarial examples in training to improve models' adversarial robustness.
However, these methods are not universal and can't defend unknown or
non-adversarial evasion attacks. In this paper, we analyze the model robustness
in the decision space. A feedback learning method is then proposed, to
understand how well a model learns and to facilitate the retraining process of
remedying the defects. The evaluations according to a set of distance-based
criteria show that our method can significantly improve models' accuracy and
robustness against different types of evasion attacks. Moreover, we observe the
existence of inter-class inequality and propose to compensate it by changing
the proportions of examples generated in different classes.


http://arxiv.org/abs/1909.05040
Sparse and Imperceivable Adversarial Attacks.
Francesco Croce; Matthias Hein
  Neural networks have been proven to be vulnerable to a variety of adversarial
attacks. From a safety perspective, highly sparse adversarial attacks are
particularly dangerous. On the other hand the pixelwise perturbations of sparse
attacks are typically large and thus can be potentially detected. We propose a
new black-box technique to craft adversarial examples aiming at minimizing
$l_0$-distance to the original image. Extensive experiments show that our
attack is better or competitive to the state of the art. Moreover, we can
integrate additional bounds on the componentwise perturbation. Allowing pixels
to change only in region of high variation and avoiding changes along
axis-aligned edges makes our adversarial examples almost non-perceivable.
Moreover, we adapt the Projected Gradient Descent attack to the $l_0$-norm
integrating componentwise constraints. This allows us to do adversarial
training to enhance the robustness of classifiers against sparse and
imperceivable adversarial manipulations.


http://arxiv.org/abs/1909.04779
Localized Adversarial Training for Increased Accuracy and Robustness in Image Classification.
Eitan Rothberg; Tingting Chen; Luo Jie; Hao Ji
  Today's state-of-the-art image classifiers fail to correctly classify
carefully manipulated adversarial images. In this work, we develop a new,
localized adversarial attack that generates adversarial examples by
imperceptibly altering the backgrounds of normal images. We first use this
attack to highlight the unnecessary sensitivity of neural networks to changes
in the background of an image, then use it as part of a new training technique:
localized adversarial training. By including locally adversarial images in the
training set, we are able to create a classifier that suffers less loss than a
non-adversarially trained counterpart model on both natural and adversarial
inputs. The evaluation of our localized adversarial training algorithm on MNIST
and CIFAR-10 datasets shows decreased accuracy loss on natural images, and
increased robustness against adversarial inputs.


http://arxiv.org/abs/1909.04837
Identifying and Resisting Adversarial Videos Using Temporal Consistency.
Xiaojun Jia; Xingxing Wei; Xiaochun Cao
  Video classification is a challenging task in computer vision. Although Deep
Neural Networks (DNNs) have achieved excellent performance in video
classification, recent research shows adding imperceptible perturbations to
clean videos can make the well-trained models output wrong labels with high
confidence. In this paper, we propose an effective defense framework to
characterize and defend adversarial videos. The proposed method contains two
phases: (1) adversarial video detection using temporal consistency between
adjacent frames, and (2) adversarial perturbation reduction via denoisers in
the spatial and temporal domains respectively. Specifically, because of the
linear nature of DNNs, the imperceptible perturbations will enlarge with the
increasing of DNNs depth, which leads to the inconsistency of DNNs output
between adjacent frames. However, the benign video frames often have the same
outputs with their neighbor frames owing to the slight changes. Based on this
observation, we can distinguish between adversarial videos and benign videos.
After that, we utilize different defense strategies against different attacks.
We propose the temporal defense, which reconstructs the polluted frames with
their temporally neighbor clean frames, to deal with the adversarial videos
with sparse polluted frames. For the videos with dense polluted frames, we use
an efficient adversarial denoiser to process each frame in the spatial domain,
and thus purify the perturbations (we call it as spatial defense). A series of
experiments conducted on the UCF-101 dataset demonstrate that the proposed
method significantly improves the robustness of video classifiers against
adversarial attacks.


http://arxiv.org/abs/1909.04778
Effectiveness of Adversarial Examples and Defenses for Malware Classification.
Robert Podschwadt; Hassan Takabi
  Artificial neural networks have been successfully used for many different
classification tasks including malware detection and distinguishing between
malicious and non-malicious programs. Although artificial neural networks
perform very well on these tasks, they are also vulnerable to adversarial
examples. An adversarial example is a sample that has minor modifications made
to it so that the neural network misclassifies it. Many techniques have been
proposed, both for crafting adversarial examples and for hardening neural
networks against them. Most previous work has been done in the image domain.
Some of the attacks have been adopted to work in the malware domain which
typically deals with binary feature vectors. In order to better understand the
space of adversarial examples in malware classification, we study different
approaches of crafting adversarial examples and defense techniques in the
malware domain and compare their effectiveness on multiple datasets.


http://arxiv.org/abs/1909.04839
Towards Noise-Robust Neural Networks via Progressive Adversarial Training.
Hang Yu; Aishan Liu; Xianglong Liu; Jichen Yang; Chongzhi Zhang
  Adversarial examples, intentionally designed inputs tending to mislead deep
neural networks, have attracted great attention in the past few years. Although
a series of defense strategies have been developed and achieved encouraging
model robustness, most of them are still vulnerable to the more commonly
witnessed corruptions, e.g., Gaussian noise, blur, etc., in the real world. In
this paper, we theoretically and empirically discover the fact that there
exists an inherent connection between adversarial robustness and corruption
robustness. Based on the fundamental discovery, this paper further proposes a
more powerful training method named Progressive Adversarial Training (PAT) that
adds diversified adversarial noises progressively during training, and thus
obtains robust model against both adversarial examples and corruptions through
higher training data complexity. Meanwhile, we also theoretically find that PAT
can promise better generalization ability. Experimental evaluation on MNIST,
CIFAR-10 and SVHN show that PAT is able to enhance the robustness and
generalization of the state-of-the-art network structures, performing
comprehensively well compared to various augmentation methods. Moreover, we
also propose Mixed Test to evaluate model generalization ability more fairly.


http://arxiv.org/abs/1909.04326
UPC: Learning Universal Physical Camouflage Attacks on Object Detectors.
Lifeng Huang; Chengying Gao; Yuyin Zhou; Changqing Zou; Cihang Xie; Alan Yuille; Ning Liu
  In this paper, we study physical adversarial attacks on object detectors in
the wild. Prior arts on this matter mostly craft instance-dependent
perturbations only for rigid and planar objects. To this end, we propose to
learn an adversarial pattern to effectively attack all instances belonging to
the same object category (e.g., person, car), referred to as Universal Physical
Camouflage Attack (UPC). Concretely, UPC crafts camouflage by jointly fooling
the region proposal network, as well as misleading the classifier and the
regressor to output errors. In order to make UPC effective for articulated
non-rigid or non-planar objects, we introduce a set of transformations for the
generated camouflage patterns to mimic their deformable properties. We
additionally impose optimization constraint to make generated patterns look
natural for human observers. To fairly evaluate the effectiveness of different
physical-world attacks on object detectors, we present the first standardized
virtual database, AttackScenes, which simulates the real 3D world in a
controllable and reproducible environment. Extensive experiments suggest the
superiority of our proposed UPC compared with existing physical adversarial
attackers not only in virtual environments (AttackScenes), but also in
real-world physical environments. Codes, models, and demos are publicly
available at https://mesunhlf.github.io/index_physical.html.


http://arxiv.org/abs/1909.04385
FDA: Feature Disruptive Attack.
Aditya Ganeshan; B. S. Vivek; R. Venkatesh Babu
  Though Deep Neural Networks (DNN) show excellent performance across various
computer vision tasks, several works show their vulnerability to adversarial
samples, i.e., image samples with imperceptible noise engineered to manipulate
the network's prediction. Adversarial sample generation methods range from
simple to complex optimization techniques. Majority of these methods generate
adversaries through optimization objectives that are tied to the pre-softmax or
softmax output of the network. In this work we, (i) show the drawbacks of such
attacks, (ii) propose two new evaluation metrics: Old Label New Rank (OLNR) and
New Label Old Rank (NLOR) in order to quantify the extent of damage made by an
attack, and (iii) propose a new adversarial attack FDA: Feature Disruptive
Attack, to address the drawbacks of existing attacks. FDA works by generating
image perturbation that disrupt features at each layer of the network and
causes deep-features to be highly corrupt. This allows FDA adversaries to
severely reduce the performance of deep networks. We experimentally validate
that FDA generates stronger adversaries than other state-of-the-art methods for
image classification, even in the presence of various defense measures. More
importantly, we show that FDA disrupts feature-representation based tasks even
without access to the task-specific network or methodology. Code available at:
https://github.com/BardOfCodes/fda


http://arxiv.org/abs/1909.04311
Learning to Disentangle Robust and Vulnerable Features for Adversarial Detection.
Byunggill Joe; Sung Ju Hwang; Insik Shin
  Although deep neural networks have shown promising performances on various
tasks, even achieving human-level performance on some, they are shown to be
susceptible to incorrect predictions even with imperceptibly small
perturbations to an input. There exists a large number of previous works which
proposed to defend against such adversarial attacks either by robust inference
or detection of adversarial inputs. Yet, most of them cannot effectively defend
against whitebox attacks where an adversary has a knowledge of the model and
defense. More importantly, they do not provide a convincing reason why the
generated adversarial inputs successfully fool the target models. To address
these shortcomings of the existing approaches, we hypothesize that the
adversarial inputs are tied to latent features that are susceptible to
adversarial perturbation, which we call vulnerable features. Then based on this
intuition, we propose a minimax game formulation to disentangle the latent
features of each instance into robust and vulnerable ones, using variational
autoencoders with two latent spaces. We thoroughly validate our model for both
blackbox and whitebox attacks on MNIST, Fashion MNIST5, and Cat & Dog datasets,
whose results show that the adversarial inputs cannot bypass our detector
without changing its semantics, in which case the attack has failed.


http://arxiv.org/abs/1909.04288
Toward Finding The Global Optimal of Adversarial Examples.
Zhenxin Xiao; Kai-Wei Chang; Cho-Jui Hsieh
  Current machine learning models are vulnerable to adversarial examples
(Goodfellow et al., 2014), we noticed that current state-of-the-art methods
(Kurakin et al., 2016; Cheng et al., 2018) to attack a well-trained model often
stuck in local optimal values. We conduct series of experiments on both
white-box and black-box settings, and find out that by different
initialization, the attack algorithm will finally converge to very different
local optimals, suggesting the importance of careful and thorough search in the
attack space. In this paper, we propose a general boosting algorithm that can
help current attack to find a more global optimal example. Specifically, we
search for the adversarial examples by starting from different
points/directions, and in certain interval we adopt successive halving
(Jamieson & Talwalkar, 2016) to cut down the searching directions that are not
promising, and use Bayesian Optimization (Pelikan et al., 1999; Bergstra et
al., 2011) to resample from the search space based on the knowledge obtained
from past searches. We demonstrate that by applying our methods to
state-of-the-art attack algorithms in both black-and white box setting, we can
further reduce the distortion between the original image and adversarial sample
about 10%-20%. By adopting dynamic successive halving, we can reduce the
computation cost 5-10 times without harming the final result. We conduct
experiments in models trained on MNIST or ImageNet and also try on decision
tree models, these experiments suggest that our method is a general way to
boost the performance of current adversarial attack methods.


http://arxiv.org/abs/1909.04068
Adversarial Robustness Against the Union of Multiple Perturbation Models.
Pratyush Maini; Eric Wong; J. Zico Kolter
  Owing to the susceptibility of deep learning systems to adversarial attacks,
there has been a great deal of work in developing (both empirically and
certifiably) robust classifiers, but the vast majority has defended against
single types of attacks. Recent work has looked at defending against multiple
attacks, specifically on the MNIST dataset, yet this approach used a relatively
complex architecture, claiming that standard adversarial training can not apply
because it "overfits" to a particular norm. In this work, we show that it is
indeed possible to adversarially train a robust model against a union of
norm-bounded attacks, by using a natural generalization of the standard
PGD-based procedure for adversarial training to multiple threat models. With
this approach, we are able to train standard architectures which are robust
against $\ell_\infty$, $\ell_2$, and $\ell_1$ attacks, outperforming past
approaches on the MNIST dataset and providing the first CIFAR10 network trained
to be simultaneously robust against $(\ell_{\infty}, \ell_{2},\ell_{1})$ threat
models, which achieves adversarial accuracy rates of $(47.6\%, 64.8\%, 53.4\%)$
for $(\ell_{\infty}, \ell_{2},\ell_{1})$ perturbations with radius $\epsilon =
(0.03,0.5,12)$.


http://arxiv.org/abs/1909.04126
DeepObfuscator: Obfuscating Intermediate Representations with Privacy-Preserving Adversarial Learning on Smartphones. (1%)
Ang Li; Jiayi Guo; Huanrui Yang; Flora D. Salim; Yiran Chen
  Deep learning has been widely applied in many computer vision applications,
with remarkable success. However, running deep learning models on mobile
devices is generally challenging due to the limitation of computing resources.
A popular alternative is to use cloud services to run deep learning models to
process raw data. This, however, imposes privacy risks. Some prior arts
proposed sending the features extracted from raw data to the cloud.
Unfortunately, these extracted features can still be exploited by attackers to
recover raw images and to infer embedded private attributes. In this paper, we
propose an adversarial training framework, DeepObfuscator, which prevents the
usage of the features for reconstruction of the raw images and inference of
private attributes. This is done while retaining useful information for the
intended cloud service. DeepObfuscator includes a learnable obfuscator that is
designed to hide privacy-related sensitive information from the features by
performing our proposed adversarial training algorithm. The proposed algorithm
is designed by simulating the game between an attacker who makes efforts to
reconstruct raw image and infer private attributes from the extracted features
and a defender who aims to protect user privacy. By deploying the trained
obfuscator on the smartphone, features can be locally extracted and then sent
to the cloud. Our experiments on CelebA and LFW datasets show that the quality
of the reconstructed images from the obfuscated features of the raw image is
dramatically decreased from 0.9458 to 0.3175 in terms of multi-scale structural
similarity. The person in the reconstructed image, hence, becomes hardly to be
re-identified. The classification accuracy of the inferred private attributes
that can be achieved by the attacker is significantly reduced to a
random-guessing level.


http://arxiv.org/abs/1909.03413
STA: Adversarial Attacks on Siamese Trackers.
Xugang Wu; Xiaoping Wang; Xu Zhou; Songlei Jian
  Recently, the majority of visual trackers adopt Convolutional Neural Network
(CNN) as their backbone to achieve high tracking accuracy. However, less
attention has been paid to the potential adversarial threats brought by CNN,
including Siamese network.
  In this paper, we first analyze the existing vulnerabilities in Siamese
trackers and propose the requirements for a successful adversarial attack. On
this basis, we formulate the adversarial generation problem and propose an
end-to-end pipeline to generate a perturbed texture map for the 3D object that
causes the trackers to fail. Finally, we conduct thorough experiments to verify
the effectiveness of our algorithm. Experiment results show that adversarial
examples generated by our algorithm can successfully lower the tracking
accuracy of victim trackers and even make them drift off. To the best of our
knowledge, this is the first work to generate 3D adversarial examples on visual
trackers.


http://arxiv.org/abs/1909.03418
When Explainability Meets Adversarial Learning: Detecting Adversarial Examples using SHAP Signatures.
Gil Fidel; Ron Bitton; Asaf Shabtai
  State-of-the-art deep neural networks (DNNs) are highly effective in solving
many complex real-world problems. However, these models are vulnerable to
adversarial perturbation attacks, and despite the plethora of research in this
domain, to this day, adversaries still have the upper hand in the cat and mouse
game of adversarial example generation methods vs. detection and prevention
methods. In this research, we present a novel detection method that uses
Shapley Additive Explanations (SHAP) values computed for the internal layers of
a DNN classifier to discriminate between normal and adversarial inputs. We
evaluate our method by building an extensive dataset of adversarial examples
over the popular CIFAR-10 and MNIST datasets, and training a neural
network-based detector to distinguish between normal and adversarial inputs. We
evaluate our detector against adversarial examples generated by diverse
state-of-the-art attacks and demonstrate its high detection accuracy and strong
generalization ability to adversarial inputs generated with different attack
methods.


http://arxiv.org/abs/1909.03084
Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification.
Yichao Zhou; Jyun-Yu Jiang; Kai-Wei Chang; Wei Wang
  Adversarial attacks against machine learning models have threatened various
real-world applications such as spam filtering and sentiment analysis. In this
paper, we propose a novel framework, learning to DIScriminate Perturbations
(DISP), to identify and adjust malicious perturbations, thereby blocking
adversarial attacks for text classification models. To identify adversarial
attacks, a perturbation discriminator validates how likely a token in the text
is perturbed and provides a set of potential perturbations. For each potential
perturbation, an embedding estimator learns to restore the embedding of the
original word based on the context and a replacement token is chosen based on
approximate kNN search. DISP can block adversarial attacks for any NLP model
without modifying the model structure or training procedure. Extensive
experiments on two benchmark datasets demonstrate that DISP significantly
outperforms baseline methods in blocking adversarial attacks for text
classification. In addition, in-depth analysis shows the robustness of DISP
across different situations.


http://arxiv.org/abs/1909.04495
Natural Adversarial Sentence Generation with Gradient-based Perturbation.
Yu-Lun Hsieh; Minhao Cheng; Da-Cheng Juan; Wei Wei; Wen-Lian Hsu; Cho-Jui Hsieh
  This work proposes a novel algorithm to generate natural language adversarial
input for text classification models, in order to investigate the robustness of
these models. It involves applying gradient-based perturbation on the sentence
embeddings that are used as the features for the classifier, and learning a
decoder for generation. We employ this method to a sentiment analysis model and
verify its effectiveness in inducing incorrect predictions by the model. We
also conduct quantitative and qualitative analysis on these examples and
demonstrate that our approach can generate more natural adversaries. In
addition, it can be used to successfully perform black-box attacks, which
involves attacking other existing models whose parameters are not known. On a
public sentiment analysis API, the proposed method introduces a 20% relative
decrease in average accuracy and 74% relative increase in absolute error.


http://arxiv.org/abs/1909.02918
Blackbox Attacks on Reinforcement Learning Agents Using Approximated Temporal Information.
Yiren Zhao; Ilia Shumailov; Han Cui; Xitong Gao; Robert Mullins; Ross Anderson
  Recent research on reinforcement learning has shown that trained agents are
vulnerable to maliciously crafted adversarial samples. In this work, we show
how adversarial samples against RL agents can be generalised from White-box and
Grey-box attacks to a strong Black-box case, namely where the attacker has no
knowledge of the agents and their training methods. We use sequence-to-sequence
models to predict a single action or a sequence of future actions that a
trained agent will make. Our approximation model, based on time-series
information from the agent, successfully predicts agents' future actions with
consistently above 80% accuracy on a wide range of games and training methods.
Second, we find that although such adversarial samples are transferable, they
do not outperform random Gaussian noise as a means of reducing the game scores
of trained RL agents. This highlights a serious methodological deficiency in
previous work on such agents; random jamming should have been taken as the
baseline for evaluation. Third, we do find a novel use for adversarial samples
in this context: they can be used to trigger a trained agent to misbehave after
a specific delay. This appears to be a genuinely new type of attack; it
potentially enables an attacker to use devices controlled by RL agents as time
bombs.


http://arxiv.org/abs/1909.02583
Spatiotemporally Constrained Action Space Attacks on Deep Reinforcement Learning Agents.
Xian Yeow Lee; Sambit Ghadai; Kai Liang Tan; Chinmay Hegde; Soumik Sarkar
  Robustness of Deep Reinforcement Learning (DRL) algorithms towards
adversarial attacks in real world applications such as those deployed in
cyber-physical systems (CPS) are of increasing concern. Numerous studies have
investigated the mechanisms of attacks on the RL agent's state space.
Nonetheless, attacks on the RL agent's action space (AS) (corresponding to
actuators in engineering systems) are equally perverse; such attacks are
relatively less studied in the ML literature. In this work, we first frame the
problem as an optimization problem of minimizing the cumulative reward of an RL
agent with decoupled constraints as the budget of attack. We propose a
white-box Myopic Action Space (MAS) attack algorithm that distributes the
attacks across the action space dimensions. Next, we reformulate the
optimization problem above with the same objective function, but with a
temporally coupled constraint on the attack budget to take into account the
approximated dynamics of the agent. This leads to the white-box Look-ahead
Action Space (LAS) attack algorithm that distributes the attacks across the
action and temporal dimensions. Our results shows that using the same amount of
resources, the LAS attack deteriorates the agent's performance significantly
more than the MAS attack. This reveals the possibility that with limited
resource, an adversary can utilize the agent's dynamics to malevolently craft
attacks that causes the agent to fail. Additionally, we leverage these attack
strategies as a possible tool to gain insights on the potential vulnerabilities
of DRL agents.


http://arxiv.org/abs/1909.02560
Adversarial Examples with Difficult Common Words for Paraphrase Identification.
Zhouxing Shi; Minlie Huang; Ting Yao; Jingfang Xu
  Despite the success of deep models for paraphrase identification on benchmark
datasets, these models are still vulnerable to adversarial examples. In this
paper, we propose a novel algorithm to generate a new type of adversarial
examples to study the robustness of deep paraphrase identification models. We
first sample an original sentence pair from the corpus and then adversarially
replace some word pairs with difficult common words. We take multiple steps and
use beam search to find a modification solution that makes the target model
fail, and thereby obtain an adversarial example. The word replacement is also
constrained by heuristic rules and a language model, to preserve the label and
grammaticality of the example during modification. Experiments show that our
algorithm can generate adversarial examples on which the performance of the
target model drops dramatically. Meanwhile, human annotators are much less
affected, and the generated sentences retain a good grammaticality. We also
show that adversarial training with generated adversarial examples can improve
model robustness.


http://arxiv.org/abs/1909.02436
Are Adversarial Robustness and Common Perturbation Robustness Independent Attributes ?
Alfred Laugros; Alice Caplier; Matthieu Ospici
  Neural Networks have been shown to be sensitive to common perturbations such
as blur, Gaussian noise, rotations, etc. They are also vulnerable to some
artificial malicious corruptions called adversarial examples. The adversarial
examples study has recently become very popular and it sometimes even reduces
the term "adversarial robustness" to the term "robustness". Yet, we do not know
to what extent the adversarial robustness is related to the global robustness.
Similarly, we do not know if a robustness to various common perturbations such
as translations or contrast losses for instance, could help with adversarial
corruptions. We intend to study the links between the robustnesses of neural
networks to both perturbations. With our experiments, we provide one of the
first benchmark designed to estimate the robustness of neural networks to
common perturbations. We show that increasing the robustness to carefully
selected common perturbations, can make neural networks more robust to unseen
common perturbations. We also prove that adversarial robustness and robustness
to common perturbations are independent. Our results make us believe that
neural network robustness should be addressed in a broader sense.


http://arxiv.org/abs/1909.00986
Certified Robustness to Adversarial Word Substitutions.
Robin Jia; Aditi Raghunathan; Kerem Göksel; Percy Liang
  State-of-the-art NLP models can often be fooled by adversaries that apply
seemingly innocuous label-preserving transformations (e.g., paraphrasing) to
input text. The number of possible transformations scales exponentially with
text length, so data augmentation cannot cover all transformations of an input.
This paper considers one exponentially large family of label-preserving
transformations, in which every word in the input can be replaced with a
similar word. We train the first models that are provably robust to all word
substitutions in this family. Our training procedure uses Interval Bound
Propagation (IBP) to minimize an upper bound on the worst-case loss that any
combination of word substitutions can induce. To evaluate models' robustness to
these transformations, we measure accuracy on adversarially chosen word
substitutions applied to test examples. Our IBP-trained models attain $75\%$
adversarial accuracy on both sentiment analysis on IMDB and natural language
inference on SNLI. In comparison, on IMDB, models trained normally and ones
trained with data augmentation achieve adversarial accuracy of only $8\%$ and
$35\%$, respectively.


http://arxiv.org/abs/1909.01492
Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation.
Po-Sen Huang; Robert Stanforth; Johannes Welbl; Chris Dyer; Dani Yogatama; Sven Gowal; Krishnamurthy Dvijotham; Pushmeet Kohli
  Neural networks are part of many contemporary NLP systems, yet their
empirical successes come at the price of vulnerability to adversarial attacks.
Previous work has used adversarial training and data augmentation to partially
mitigate such brittleness, but these are unlikely to find worst-case
adversaries due to the complexity of the search space arising from discrete
text perturbations. In this work, we approach the problem from the opposite
direction: to formally verify a system's robustness against a predefined class
of adversarial attacks. We study text classification under synonym replacements
or character flip perturbations. We propose modeling these input perturbations
as a simplex and then using Interval Bound Propagation -- a formal model
verification method. We modify the conventional log-likelihood training
objective to train models that can be efficiently verified, which would
otherwise come with exponential search complexity. The resulting models show
only little difference in terms of nominal accuracy, but have much improved
verified accuracy under perturbations and come with an efficiently computable
formal guarantee on worst case adversaries.


http://arxiv.org/abs/1909.00900
Metric Learning for Adversarial Robustness.
Chengzhi Mao; Ziyuan Zhong; Junfeng Yang; Carl Vondrick; Baishakhi Ray
  Deep networks are well-known to be fragile to adversarial attacks. Using
several standard image datasets and established attack mechanisms, we conduct
an empirical analysis of deep representations under attack, and find that the
attack causes the internal representation to shift closer to the "false" class.
Motivated by this observation, we propose to regularize the representation
space under attack with metric learning in order to produce more robust
classifiers. By carefully sampling examples for metric learning, our learned
representation not only increases robustness, but also can detect previously
unseen adversarial samples. Quantitative experiments show improvement of
robustness accuracy by up to 4\% and detection efficiency by up to 6\%
according to Area Under Curve (AUC) score over baselines.


http://arxiv.org/abs/1908.11514
Adversarial Training Methods for Network Embedding.
Quanyu Dai; Xiao Shen; Liang Zhang; Qiang Li; Dan Wang
  Network Embedding is the task of learning continuous node representations for
networks, which has been shown effective in a variety of tasks such as link
prediction and node classification. Most of existing works aim to preserve
different network structures and properties in low-dimensional embedding
vectors, while neglecting the existence of noisy information in many real-world
networks and the overfitting issue in the embedding learning process. Most
recently, generative adversarial networks (GANs) based regularization methods
are exploited to regularize embedding learning process, which can encourage a
global smoothness of embedding vectors. These methods have very complicated
architecture and suffer from the well-recognized non-convergence problem of
GANs. In this paper, we aim to introduce a more succinct and effective local
regularization method, namely adversarial training, to network embedding so as
to achieve model robustness and better generalization performance. Firstly, the
adversarial training method is applied by defining adversarial perturbations in
the embedding space with an adaptive $L_2$ norm constraint that depends on the
connectivity pattern of node pairs. Though effective as a regularizer, it
suffers from the interpretability issue which may hinder its application in
certain real-world scenarios. To improve this strategy, we further propose an
interpretable adversarial training method by enforcing the reconstruction of
the adversarial examples in the discrete graph domain. These two regularization
methods can be applied to many existing embedding models, and we take DeepWalk
as the base model for illustration in the paper. Empirical evaluations in both
link prediction and node classification demonstrate the effectiveness of the
proposed methods.


http://arxiv.org/abs/1908.11091
Deep Neural Network Ensembles against Deception: Ensemble Diversity, Accuracy and Robustness.
Ling Liu; Wenqi Wei; Ka-Ho Chow; Margaret Loper; Emre Gursoy; Stacey Truex; Yanzhao Wu
  Ensemble learning is a methodology that integrates multiple DNN learners for
improving prediction performance of individual learners. Diversity is greater
when the errors of the ensemble prediction is more uniformly distributed.
Greater diversity is highly correlated with the increase in ensemble accuracy.
Another attractive property of diversity optimized ensemble learning is its
robustness against deception: an adversarial perturbation attack can mislead
one DNN model to misclassify but may not fool other ensemble DNN members
consistently. In this paper we first give an overview of the concept of
ensemble diversity and examine the three types of ensemble diversity in the
context of DNN classifiers. We then describe a set of ensemble diversity
measures, a suite of algorithms for creating diversity ensembles and for
performing ensemble consensus (voted or learned) for generating high accuracy
ensemble output by strategically combining outputs of individual members. This
paper concludes with a discussion on a set of open issues in quantifying
ensemble diversity for robust deep learning.


http://arxiv.org/abs/1908.11230
Defeating Misclassification Attacks Against Transfer Learning.
Bang Wu; Shuo Wang; Xingliang Yuan; Cong Wang; Carsten Rudolph; Xiangwen Yang
  Transfer learning is prevalent as a technique to efficiently generate new
models (Student models) based on the knowledge transferred from a pre-trained
model (Teacher model). However, Teacher models are often publicly available for
sharing and reuse, which inevitably introduces vulnerability to trigger severe
attacks against transfer learning systems. In this paper, we take a first step
towards mitigating one of the most advanced misclassification attacks in
transfer learning. We design a distilled differentiator via activation-based
network pruning to enervate the attack transferability while retaining
accuracy. We adopt an ensemble structure from variant differentiators to
improve the defence robustness. To avoid the bloated ensemble size during
inference, we propose a two-phase defence, in which inference from the Student
model is firstly performed to narrow down the candidate differentiators to be
assembled, and later only a small, fixed number of them can be chosen to
validate clean or reject adversarial inputs effectively. Our comprehensive
evaluations on both large and small image recognition tasks confirm that the
Student models with our defence of only 5 differentiators are immune to over
90% of the adversarial inputs with an accuracy loss of less than 10%. Our
comparison also demonstrates that our design outperforms prior problematic
defences.


http://arxiv.org/abs/1908.11332
Universal, transferable and targeted adversarial attacks.
Junde Wu; Rao Fu
  Deep Neural Network has been found vulnerable recently. A kind of
well-designed inputs, which called adversarial examples, can lead the networks
to make incorrect predictions. Depending on the different scenarios, goals and
capabilities, the difficulty to generate the attack is different. For example,
generating a targeted attack is more difficult than a non-targeted attack, a
universal attack is more difficult than a non-universal attack, a transferable
attack is more difficult than a nontransferable one. The question is: Is there
exist an attack that can survival in the most harsh adversity to meet all these
requirements. Although many cheap and effective attacks have been proposed,
this question is still not completely solved over large models and large scale
dataset. In this paper, we learn a universal mapping from the sources to the
adversarial examples. These examples can fool classification networks into
classifying all of them to one targeted class. Besides, they are also
transferable between different models.


http://arxiv.org/abs/1908.09705
A Statistical Defense Approach for Detecting Adversarial Examples.
Alessandro Cennamo; Ido Freeman; Anton Kummert
  Adversarial examples are maliciously modified inputs created to fool deep
neural networks (DNN). The discovery of such inputs presents a major issue to
the expansion of DNN-based solutions. Many researchers have already contributed
to the topic, providing both cutting edge-attack techniques and various
defensive strategies. In this work, we focus on the development of a system
capable of detecting adversarial samples by exploiting statistical information
from the training-set. Our detector computes several distorted replicas of the
test input, then collects the classifier's prediction vectors to build a
meaningful signature for the detection task. Then, the signature is projected
onto the class-specific statistic vector to infer the input's nature. The
classification output of the original input is used to select the
class-statistic vector. We show that our method reliably detects malicious
inputs, outperforming state-of-the-art approaches in various settings, while
being complementary to other defensive solutions.


http://arxiv.org/abs/1908.09699
Gated Convolutional Networks with Hybrid Connectivity for Image Classification.
Chuanguang Yang; Zhulin An; Hui Zhu; Xiaolong Hu; Kun Zhang; Kaiqiang Xu; Chao Li; Yongjun Xu
  We propose a simple yet effective method to reduce the redundancy of DenseNet
by substantially decreasing the number of stacked modules by replacing the
original bottleneck by our SMG module, which is augmented by local residual.
Furthermore, SMG module is equipped with an efficient two-stage pipeline, which
aims to DenseNet-like architectures that need to integrate all previous
outputs, i.e., squeezing the incoming informative but redundant features
gradually by hierarchical convolutions as a hourglass shape and then exciting
it by multi-kernel depthwise convolutions, the output of which would be compact
and hold more informative multi-scale features. We further develop a forget and
an update gate by introducing the popular attention modules to implement the
effective fusion instead of a simple addition between reused and new features.
Due to the Hybrid Connectivity (nested combination of global dense and local
residual) and Gated mechanisms, we called our network as the HCGNet.
Experimental results on CIFAR and ImageNet datasets show that HCGNet is more
prominently efficient than DenseNet, and can also significantly outperform
state-of-the-art networks with less complexity. Moreover, HCGNet also shows the
remarkable interpretability and robustness by network dissection and
adversarial defense, respectively. On MS-COCO, HCGNet can consistently learn
better features than popular backbones.


http://arxiv.org/abs/1908.09364
Adversarial Edit Attacks for Tree Data.
Benjamin Paaßen
  Many machine learning models can be attacked with adversarial examples, i.e.
inputs close to correctly classified examples that are classified incorrectly.
However, most research on adversarial attacks to date is limited to vectorial
data, in particular image data. In this contribution, we extend the field by
introducing adversarial edit attacks for tree-structured data with potential
applications in medicine and automated program analysis. Our approach solely
relies on the tree edit distance and a logarithmic number of black-box queries
to the attacked classifier without any need for gradient information. We
evaluate our approach on two programming and two biomedical data sets and show
that many established tree classifiers, like tree-kernel-SVMs and recursive
neural networks, can be attacked effectively.


http://arxiv.org/abs/1908.09327
advPattern: Physical-World Attacks on Deep Person Re-Identification via Adversarially Transformable Patterns.
Zhibo Wang; Siyan Zheng; Mengkai Song; Qian Wang; Alireza Rahimpour; Hairong Qi
  Person re-identification (re-ID) is the task of matching person images across
camera views, which plays an important role in surveillance and security
applications. Inspired by great progress of deep learning, deep re-ID models
began to be popular and gained state-of-the-art performance. However, recent
works found that deep neural networks (DNNs) are vulnerable to adversarial
examples, posing potential threats to DNNs based applications. This phenomenon
throws a serious question about whether deep re-ID based systems are vulnerable
to adversarial attacks. In this paper, we take the first attempt to implement
robust physical-world attacks against deep re-ID. We propose a novel attack
algorithm, called advPattern, for generating adversarial patterns on clothes,
which learns the variations of image pairs across cameras to pull closer the
image features from the same camera, while pushing features from different
cameras farther. By wearing our crafted "invisible cloak", an adversary can
evade person search, or impersonate a target person to fool deep re-ID models
in physical world. We evaluate the effectiveness of our transformable patterns
on adversaries'clothes with Market1501 and our established PRCS dataset. The
experimental results show that the rank-1 accuracy of re-ID models for matching
the adversary decreases from 87.9% to 27.1% under Evading Attack. Furthermore,
the adversary can impersonate a target person with 47.1% rank-1 accuracy and
67.9% mAP under Impersonation Attack. The results demonstrate that deep re-ID
systems are vulnerable to our physical attacks.


http://arxiv.org/abs/1908.09163
Targeted Mismatch Adversarial Attack: Query with a Flower to Retrieve the Tower.
Giorgos Tolias; Filip Radenovic; Ond{ř}ej Chum
  Access to online visual search engines implies sharing of private user
content - the query images. We introduce the concept of targeted mismatch
attack for deep learning based retrieval systems to generate an adversarial
image to conceal the query image. The generated image looks nothing like the
user intended query, but leads to identical or very similar retrieval results.
Transferring attacks to fully unseen networks is challenging. We show
successful attacks to partially unknown systems, by designing various loss
functions for the adversarial image construction. These include loss functions,
for example, for unknown global pooling operation or unknown input resolution
by the retrieval system. We evaluate the attacks on standard retrieval
benchmarks and compare the results retrieved with the original and adversarial
image.


http://arxiv.org/abs/1908.11435
Improving Adversarial Robustness via Attention and Adversarial Logit Pairing.
Dou Goodman; Xingjian Li; Jun Huan; Tao Wei
  Though deep neural networks have achieved the state of the art performance in
visual classification, recent studies have shown that they are all vulnerable
to the attack of adversarial examples. In this paper, we develop improved
techniques for defending against adversarial examples.First, we introduce
enhanced defense using a technique we call \textbf{Attention and Adversarial
Logit Pairing(AT+ALP)}, a method that encourages both attention map and logit
for pairs of examples to be similar. When applied to clean examples and their
adversarial counterparts, \textbf{AT+ALP} improves accuracy on adversarial
examples over adversarial training.Next,We show that our \textbf{AT+ALP} can
effectively increase the average activations of adversarial examples in the key
area and demonstrate that it focuse on more discriminate features to improve
the robustness of the model.Finally,we conducte extensive experiments using a
wide range of datasets and the experiment results show that our \textbf{AT+ALP}
achieves \textbf{the state of the art} defense.For example,on \textbf{17 Flower
Category Database}, under strong 200-iteration \textbf{PGD} gray-box and
black-box attacks where prior art has 34\% and 39\% accuracy, our method
achieves \textbf{50\%} and \textbf{51\%}.Compared with previous work,our work
is evaluated under highly challenging PGD attack:the maximum perturbation
$\epsilon \in \{0.25,0.5\}$ i.e. $L_\infty \in \{0.25,0.5\}$ with 10 to 200
attack iterations.To our knowledge, such a strong attack has not been
previously explored on a wide range of datasets.


http://arxiv.org/abs/1908.08705
AdvHat: Real-world adversarial attack on ArcFace Face ID system.
Stepan Komkov; Aleksandr Petiushko
  In this paper we propose a novel easily reproducible technique to attack the
best public Face ID system ArcFace in different shooting conditions. To create
an attack, we print the rectangular paper sticker on a common color printer and
put it on the hat. The adversarial sticker is prepared with a novel algorithm
for off-plane transformations of the image which imitates sticker location on
the hat. Such an approach confuses the state-of-the-art public Face ID model
LResNet100E-IR, ArcFace@ms1m-refine-v2 and is transferable to other Face ID
models.


http://arxiv.org/abs/1908.08413
Saliency Methods for Explaining Adversarial Attacks.
Jindong Gu; Volker Tresp
  The classification decisions of neural networks can be misled by small
imperceptible perturbations. This work aims to explain the misled
classifications using saliency methods. The idea behind saliency methods is to
explain the classification decisions of neural networks by creating so-called
saliency maps. Unfortunately, a number of recent publications have shown that
many of the proposed saliency methods do not provide insightful explanations. A
prominent example is Guided Backpropagation (GuidedBP), which simply performs
(partial) image recovery. However, our numerical analysis shows the saliency
maps created by GuidedBP do indeed contain class-discriminative information. We
propose a simple and efficient way to enhance the saliency maps. The proposed
enhanced GuidedBP shows the state-of-the-art performance to explain adversary
classifications.


http://arxiv.org/abs/1908.08016
Testing Robustness Against Unforeseen Adversaries.
Daniel Kang; Yi Sun; Dan Hendrycks; Tom Brown; Jacob Steinhardt
  Considerable work on adversarial defense has studied robustness to a fixed,
known family of adversarial distortions, most frequently L_p-bounded
distortions. In reality, the specific form of attack will rarely be known and
adversaries are free to employ distortions outside of any fixed set. The
present work advocates measuring robustness against this much broader range of
unforeseen attacks---attacks whose precise form is not known when designing a
defense.
  We propose a methodology for evaluating a defense against a diverse range of
distortion types together with a summary metric UAR that measures the
Unforeseen Attack Robustness against a distortion. We construct novel JPEG,
Fog, Gabor, and Snow adversarial attacks to simulate unforeseen adversaries and
perform a careful study of adversarial robustness against these and existing
distortion types. We find that evaluation against existing L_p attacks yields
highly correlated information that may not generalize to other attacks and
identify a set of 4 attacks that yields more diverse information. We further
find that adversarial training against either one or multiple distortions,
including our novel ones, does not confer robustness to unforeseen distortions.
These results underscore the need to study robustness against unforeseen
distortions and provide a starting point for doing so.


http://arxiv.org/abs/1908.07899
Evaluating Defensive Distillation For Defending Text Processing Neural Networks Against Adversarial Examples.
Marcus Soll; Tobias Hinz; Sven Magg; Stefan Wermter
  Adversarial examples are artificially modified input samples which lead to
misclassifications, while not being detectable by humans. These adversarial
examples are a challenge for many tasks such as image and text classification,
especially as research shows that many adversarial examples are transferable
between different classifiers. In this work, we evaluate the performance of a
popular defensive strategy for adversarial examples called defensive
distillation, which can be successful in hardening neural networks against
adversarial examples in the image domain. However, instead of applying
defensive distillation to networks for image classification, we examine, for
the first time, its performance on text classification tasks and also evaluate
its effect on the transferability of adversarial text examples. Our results
indicate that defensive distillation only has a minimal impact on text
classifying neural networks and does neither help with increasing their
robustness against adversarial examples nor prevent the transferability of
adversarial examples between neural networks.


http://arxiv.org/abs/1908.07667
Denoising and Verification Cross-Layer Ensemble Against Black-box Adversarial Attacks.
Ka-Ho Chow; Wenqi Wei; Yanzhao Wu; Ling Liu
  Deep neural networks (DNNs) have demonstrated impressive performance on many
challenging machine learning tasks. However, DNNs are vulnerable to adversarial
inputs generated by adding maliciously crafted perturbations to the benign
inputs. As a growing number of attacks have been reported to generate
adversarial inputs of varying sophistication, the defense-attack arms race has
been accelerated. In this paper, we present MODEF, a cross-layer model
diversity ensemble framework. MODEF intelligently combines unsupervised model
denoising ensemble with supervised model verification ensemble by quantifying
model diversity, aiming to boost the robustness of the target model against
adversarial examples. Evaluated using eleven representative attacks on popular
benchmark datasets, we show that MODEF achieves remarkable defense success
rates, compared with existing defense methods, and provides a superior
capability of repairing adversarial inputs and making correct predictions with
high accuracy in the presence of black-box attacks.


http://arxiv.org/abs/1908.07558
Transferring Robustness for Graph Neural Network Against Poisoning Attacks.
Xianfeng Tang; Yandong Li; Yiwei Sun; Huaxiu Yao; Prasenjit Mitra; Suhang Wang
  Graph neural networks (GNNs) are widely used in many applications. However,
their robustness against adversarial attacks is criticized. Prior studies show
that using unnoticeable modifications on graph topology or nodal features can
significantly reduce the performances of GNNs. It is very challenging to design
robust graph neural networks against poisoning attack and several efforts have
been taken. Existing work aims at reducing the negative impact from adversarial
edges only with the poisoned graph, which is sub-optimal since they fail to
discriminate adversarial edges from normal ones. On the other hand, clean
graphs from similar domains as the target poisoned graph are usually available
in the real world. By perturbing these clean graphs, we create supervised
knowledge to train the ability to detect adversarial edges so that the
robustness of GNNs is elevated. However, such potential for clean graphs is
neglected by existing work. To this end, we investigate a novel problem of
improving the robustness of GNNs against poisoning attacks by exploring clean
graphs. Specifically, we propose PA-GNN, which relies on a penalized
aggregation mechanism that directly restrict the negative impact of adversarial
edges by assigning them lower attention coefficients. To optimize PA-GNN for a
poisoned graph, we design a meta-optimization algorithm that trains PA-GNN to
penalize perturbations using clean graphs and their adversarial counterparts,
and transfers such ability to improve the robustness of PA-GNN on the poisoned
graph. Experimental results on four real-world datasets demonstrate the
robustness of PA-GNN against poisoning attacks on graphs. Code and data are
available here: https://github.com/tangxianfeng/PA-GNN.


http://arxiv.org/abs/1908.07125
Universal Adversarial Triggers for NLP.
Eric Wallace; Shi Feng; Nikhil Kandpal; Matt Gardner; Sameer Singh
  Adversarial examples highlight model vulnerabilities and are useful for
evaluation and interpretation. We define universal adversarial triggers:
input-agnostic sequences of tokens that trigger a model to produce a specific
prediction when concatenated to any input from a dataset. We propose a
gradient-guided search over tokens which finds short trigger sequences (e.g.,
one word for classification and four words for language modeling) that
successfully trigger the target prediction. For example, triggers cause SNLI
entailment accuracy to drop from 89.94% to 0.55%, 72% of "why" questions in
SQuAD to be answered "to kill american people", and the GPT-2 language model to
spew racist output even when conditioned on non-racial contexts. Furthermore,
although the triggers are optimized using white-box access to a specific model,
they transfer to other models for all tasks we consider. Finally, since
triggers are input-agnostic, they provide an analysis of global model behavior.
For instance, they confirm that SNLI models exploit dataset biases and help to
diagnose heuristics learned by reading comprehension models.


http://arxiv.org/abs/1908.07116
Protecting Neural Networks with Hierarchical Random Switching: Towards Better Robustness-Accuracy Trade-off for Stochastic Defenses.
Xiao Wang; Siyue Wang; Pin-Yu Chen; Yanzhi Wang; Brian Kulis; Xue Lin; Peter Chin
  Despite achieving remarkable success in various domains, recent studies have
uncovered the vulnerability of deep neural networks to adversarial
perturbations, creating concerns on model generalizability and new threats such
as prediction-evasive misclassification or stealthy reprogramming. Among
different defense proposals, stochastic network defenses such as random neuron
activation pruning or random perturbation to layer inputs are shown to be
promising for attack mitigation. However, one critical drawback of current
defenses is that the robustness enhancement is at the cost of noticeable
performance degradation on legitimate data, e.g., large drop in test accuracy.
This paper is motivated by pursuing for a better trade-off between adversarial
robustness and test accuracy for stochastic network defenses. We propose
Defense Efficiency Score (DES), a comprehensive metric that measures the gain
in unsuccessful attack attempts at the cost of drop in test accuracy of any
defense. To achieve a better DES, we propose hierarchical random switching
(HRS), which protects neural networks through a novel randomization scheme. A
HRS-protected model contains several blocks of randomly switching channels to
prevent adversaries from exploiting fixed model structures and parameters for
their malicious purposes. Extensive experiments show that HRS is superior in
defending against state-of-the-art white-box and adaptive adversarial
misclassification attacks. We also demonstrate the effectiveness of HRS in
defending adversarial reprogramming, which is the first defense against
adversarial programs. Moreover, in most settings the average DES of HRS is at
least 5X higher than current stochastic network defenses, validating its
significantly improved robustness-accuracy trade-off.


http://arxiv.org/abs/1908.07000
Hybrid Batch Attacks: Finding Black-box Adversarial Examples with Limited Queries.
Fnu Suya; Jianfeng Chi; David Evans; Yuan Tian
  We study adversarial examples in a black-box setting where the adversary only
has API access to the target model and each query is expensive. Prior work on
black-box adversarial examples follows one of two main strategies: (1) transfer
attacks use white-box attacks on local models to find candidate adversarial
examples that transfer to the target model, and (2) optimization-based attacks
use queries to the target model and apply optimization techniques to search for
adversarial examples. We propose hybrid attacks that combine both strategies,
using candidate adversarial examples from local models as starting points for
optimization-based attacks and using labels learned in optimization-based
attacks to tune local models for finding transfer candidates. We empirically
demonstrate on the MNIST, CIFAR10, and ImageNet datasets that our hybrid attack
strategy reduces cost and improves success rates. We also introduce a seed
prioritization strategy which enables attackers to focus their resources on the
most promising seeds. Combining hybrid attacks with our seed prioritization
strategy enables batch attacks that can reliably find adversarial examples with
only a handful of queries.


http://arxiv.org/abs/1908.06401
On the Robustness of Human Pose Estimation.
Sahil Shah; Naman Jain; Abhishek Sharma; Arjun Jain
  This paper provides a comprehensive and exhaustive study of adversarial
attacks on human pose estimation models and the evaluation of their robustness.
Besides highlighting the important differences between well-studied
classification and human pose-estimation systems w.r.t. adversarial attacks, we
also provide deep insights into the design choices of pose-estimation systems
to shape future work. We benchmark the robustness of several 2D single person
pose-estimation architectures trained on multiple datasets, MPII and COCO. In
doing so, we also explore the problem of attacking non-classification networks
including regression based networks, which has been virtually unexplored in the
past.
  \par We find that compared to classification and semantic segmentation, human
pose estimation architectures are relatively robust to adversarial attacks with
the single-step attacks being surprisingly ineffective. Our study shows that
the heatmap-based pose-estimation models are notably robust than their direct
regression-based systems and that the systems which explicitly model
anthropomorphic semantics of human body fare better than their other
counterparts. Besides, targeted attacks are more difficult to obtain than
un-targeted ones and some body-joints are easier to fool than the others. We
present visualizations of universal perturbations to facilitate unprecedented
insights into their workings on pose-estimation. Additionally, we show them to
generalize well across different networks. Finally we perform a user study
about perceptibility of these examples.


http://arxiv.org/abs/1908.06566
Adversarial Defense by Suppressing High-frequency Components.
Zhendong Zhang; Cheolkon Jung; Xiaolong Liang
  Recent works show that deep neural networks trained on image classification
dataset bias towards textures. Those models are easily fooled by applying small
high-frequency perturbations to clean images. In this paper, we learn robust
image classification models by removing high-frequency components.
Specifically, we develop a differentiable high-frequency suppression module
based on discrete Fourier transform (DFT). Combining with adversarial training,
we won the 5th place in the IJCAI-2019 Alibaba Adversarial AI Challenge. Our
code is available online.


http://arxiv.org/abs/1908.06353
Verification of Neural Network Control Policy Under Persistent Adversarial Perturbation.
Yuh-Shyang Wang; Tsui-Wei Weng; Luca Daniel
  Deep neural networks are known to be fragile to small adversarial
perturbations. This issue becomes more critical when a neural network is
interconnected with a physical system in a closed loop. In this paper, we show
how to combine recent works on neural network certification tools (which are
mainly used in static settings such as image classification) with robust
control theory to certify a neural network policy in a control loop.
Specifically, we give a sufficient condition and an algorithm to ensure that
the closed loop state and control constraints are satisfied when the persistent
adversarial perturbation is l-infinity norm bounded. Our method is based on
finding a positively invariant set of the closed loop dynamical system, and
thus we do not require the differentiability or the continuity of the neural
network policy. Along with the verification result, we also develop an
effective attack strategy for neural network control systems that outperforms
exhaustive Monte-Carlo search significantly. We show that our certification
algorithm works well on learned models and achieves 5 times better result than
the traditional Lipschitz-based method to certify the robustness of a neural
network policy on a cart pole control problem.


http://arxiv.org/abs/1908.06281
Nesterov Accelerated Gradient and Scale Invariance for Adversarial Attacks.
Jiadong Lin; Chuanbiao Song; Kun He; Liwei Wang; John E. Hopcroft
  Deep learning models are vulnerable to adversarial examples crafted by
applying human-imperceptible perturbations on benign inputs. However, under the
black-box setting, most existing adversaries often have a poor transferability
to attack other defense models. In this work, from the perspective of regarding
the adversarial example generation as an optimization process, we propose two
new methods to improve the transferability of adversarial examples, namely
Nesterov Iterative Fast Gradient Sign Method (NI-FGSM) and Scale-Invariant
attack Method (SIM). NI-FGSM aims to adapt Nesterov accelerated gradient into
the iterative attacks so as to effectively look ahead and improve the
transferability of adversarial examples. While SIM is based on our discovery on
the scale-invariant property of deep learning models, for which we leverage to
optimize the adversarial perturbations over the scale copies of the input
images so as to avoid "overfitting" on the white-box model being attacked and
generate more transferable adversarial examples. NI-FGSM and SIM can be
naturally integrated to build a robust gradient-based attack to generate more
transferable adversarial examples against the defense models. Empirical results
on ImageNet dataset demonstrate that our attack methods exhibit higher
transferability and achieve higher attack success rates than state-of-the-art
gradient-based attacks.


http://arxiv.org/abs/1908.06062
Adversarial point perturbations on 3D objects.
Daniel Liu; Ronald Yu; Hao Su
  The importance of training robust neural network grows as 3D data is
increasingly utilized in deep learning for vision tasks, like autonomous
driving. We examine this problem from the perspective of the attacker, which is
necessary in understanding how neural networks can be exploited, and thus
defended. More specifically, we propose adversarial attacks based on solving
different optimization problems, like minimizing the perceptibility of our
generated adversarial examples, or maintaining a uniform density distribution
of points across the adversarial object surfaces. Our four proposed algorithms
for attacking 3D point cloud classification are all highly successful on
existing neural networks, and we find that some of them are even effective
against previously proposed point removal defenses.


http://arxiv.org/abs/1908.05185
Once a MAN: Towards Multi-Target Attack via Learning Multi-Target Adversarial Network Once.
Jiangfan Han; Xiaoyi Dong; Ruimao Zhang; Dongdong Chen; Weiming Zhang; Nenghai Yu; Ping Luo; Xiaogang Wang
  Modern deep neural networks are often vulnerable to adversarial samples.
Based on the first optimization-based attacking method, many following methods
are proposed to improve the attacking performance and speed. Recently,
generation-based methods have received much attention since they directly use
feed-forward networks to generate the adversarial samples, which avoid the
time-consuming iterative attacking procedure in optimization-based and
gradient-based methods. However, current generation-based methods are only able
to attack one specific target (category) within one model, thus making them not
applicable to real classification systems that often have hundreds/thousands of
categories. In this paper, we propose the first Multi-target Adversarial
Network (MAN), which can generate multi-target adversarial samples with a
single model. By incorporating the specified category information into the
intermediate features, it can attack any category of the target classification
model during runtime. Experiments show that the proposed MAN can produce
stronger attack results and also have better transferability than previous
state-of-the-art methods in both multi-target attack task and single-target
attack task. We further use the adversarial samples generated by our MAN to
improve the robustness of the classification model. It can also achieve better
classification accuracy than other methods when attacked by various methods.


http://arxiv.org/abs/1908.05008
AdvFaces: Adversarial Face Synthesis.
Debayan Deb; Jianbang Zhang; Anil K. Jain
  Face recognition systems have been shown to be vulnerable to adversarial
examples resulting from adding small perturbations to probe images. Such
adversarial images can lead state-of-the-art face recognition systems to
falsely reject a genuine subject (obfuscation attack) or falsely match to an
impostor (impersonation attack). Current approaches to crafting adversarial
face images lack perceptual quality and take an unreasonable amount of time to
generate them. We propose, AdvFaces, an automated adversarial face synthesis
method that learns to generate minimal perturbations in the salient facial
regions via Generative Adversarial Networks. Once AdvFaces is trained, it can
automatically generate imperceptible perturbations that can evade
state-of-the-art face matchers with attack success rates as high as 97.22% and
24.30% for obfuscation and impersonation attacks, respectively.


http://arxiv.org/abs/1908.05195
DAPAS : Denoising Autoencoder to Prevent Adversarial attack in Semantic Segmentation.
Seungju Cho; Tae Joon Jun; Byungsoo Oh; Daeyoung Kim
  Nowadays, Deep learning techniques show dramatic performance on computer
vision area, and they even outperform human. But it is also vulnerable to some
small perturbation called an adversarial attack. This is a problem combined
with the safety of artificial intelligence, which has recently been studied a
lot. These attacks have shown that they can fool models of image
classification, semantic segmentation, and object detection. We point out this
attack can be protected by denoise autoencoder, which is used for denoising the
perturbation and restoring the original images. We experiment with various
noise distributions and verify the effect of denoise autoencoder against
adversarial attack in semantic segmentation.


http://arxiv.org/abs/1908.04473
On Defending Against Label Flipping Attacks on Malware Detection Systems.
Rahim Taheri; Reza Javidan; Mohammad Shojafar; Zahra Pooranian; Ali Miri; Mauro Conti
  Label manipulation attacks are a subclass of data poisoning attacks in
adversarial machine learning used against different applications, such as
malware detection. These types of attacks represent a serious threat to
detection systems in environments having high noise rate or uncertainty, such
as complex networks and Internet of Thing (IoT). Recent work in the literature
has suggested using the $K$-Nearest Neighboring (KNN) algorithm to defend
against such attacks. However, such an approach can suffer from low to wrong
detection accuracy. In this paper, we design an architecture to tackle the
Android malware detection problem in IoT systems. We develop an attack
mechanism based on Silhouette clustering method, modified for mobile Android
platforms. We proposed two Convolutional Neural Network (CNN)-type deep
learning algorithms against this \emph{Silhouette Clustering-based Label
Flipping Attack (SCLFA)}. We show the effectiveness of these two defense
algorithms - \emph{Label-based Semi-supervised Defense (LSD)} and
\emph{clustering-based Semi-supervised Defense (CSD)} - in correcting labels
being attacked. We evaluate the performance of the proposed algorithms by
varying the various machine learning parameters on three Android datasets:
Drebin, Contagio, and Genome and three types of features: API, intent, and
permission. Our evaluation shows that using random forest feature selection and
varying ratios of features can result in an improvement of up to 19\% accuracy
when compared with the state-of-the-art method in the literature.


http://arxiv.org/abs/1908.04355
Adversarial Neural Pruning with Latent Vulnerability Suppression.
Divyam Madaan; Jinwoo Shin; Sung Ju Hwang
  Despite the remarkable performance of deep neural networks on various
computer vision tasks, they are known to be susceptible to adversarial
perturbations, which makes it challenging to deploy them in real-world
safety-critical applications. In this paper, we conjecture that the leading
cause of adversarial vulnerability is the distortion in the latent feature
space, and provide methods to suppress them effectively. Explicitly, we define
\emph{vulnerability} for each latent feature and then propose a new loss for
adversarial learning, \emph{Vulnerability Suppression (VS)} loss, that aims to
minimize the feature-level vulnerability during training. We further propose a
Bayesian framework to prune features with high vulnerability to reduce both
vulnerability and loss on adversarial samples. We validate our
\emph{Adversarial Neural Pruning with Vulnerability Suppression (ANP-VS)}
method on multiple benchmark datasets, on which it not only obtains
state-of-the-art adversarial robustness but also improves the performance on
clean examples, using only a fraction of the parameters used by the full
network. Further qualitative analysis suggests that the improvements come from
the suppression of feature-level vulnerability.


http://arxiv.org/abs/1908.03560
On the Adversarial Robustness of Neural Networks without Weight Transport.
Mohamed Akrout
  Neural networks trained with backpropagation, the standard algorithm of deep
learning which uses weight transport, are easily fooled by existing
gradient-based adversarial attacks. This class of attacks are based on certain
small perturbations of the inputs to make networks misclassify them. We show
that less biologically implausible deep neural networks trained with feedback
alignment, which do not use weight transport, can be harder to fool, providing
actual robustness. Tested on MNIST, deep neural networks trained without weight
transport (1) have an adversarial accuracy of 98% compared to 0.03% for neural
networks trained with backpropagation and (2) generate non-transferable
adversarial examples. However, this gap decreases on CIFAR-10 but still
significant particularly for small perturbation magnitude less than 1/2.


http://arxiv.org/abs/1908.03176
Defending Against Adversarial Iris Examples Using Wavelet Decomposition.
Sobhan Soleymani; Ali Dabouei; Jeremy Dawson; Nasser M. Nasrabadi
  Deep neural networks have presented impressive performance in biometric
applications. However, their performance is highly at risk when facing
carefully crafted input samples known as adversarial examples. In this paper,
we present three defense strategies to detect adversarial iris examples. These
defense strategies are based on wavelet domain denoising of the input examples
by investigating each wavelet sub-band and removing the sub-bands that are most
affected by the adversary. The first proposed defense strategy reconstructs
multiple denoised versions of the input example through manipulating the mid-
and high-frequency components of the wavelet domain representation of the input
example and makes a decision upon the classification result of the majority of
the denoised examples. The second and third proposed defense strategies aim to
denoise each wavelet domain sub-band and determine the sub-bands that are most
likely affected by the adversary using the reconstruction error computed for
each sub-band. We test the performance of the proposed defense strategies
against several attack scenarios and compare the results with five state of the
art defense strategies.


http://arxiv.org/abs/1908.03173
Universal Adversarial Audio Perturbations.
Sajjad Abdoli; Luiz G. Hafemann; Jerome Rony; Ismail Ben Ayed; Patrick Cardinal; Alessandro L. Koerich
  We demonstrate the existence of universal adversarial perturbations, which
can fool a family of audio processing architectures, for both targeted and
untargeted attacks. To the best of our knowledge, this is the first study on
generating universal adversarial perturbations for audio processing systems. We
propose two methods for finding such perturbations. The first method is based
on an iterative, greedy approach that is well-known in computer vision: it
aggregates small perturbations to the input so as to push it to the decision
boundary. The second method, which is the main technical contribution of this
work, is a novel penalty formulation, which finds targeted and untargeted
universal adversarial perturbations. Differently from the greedy approach, the
penalty method minimizes an appropriate objective function on a batch of
samples. Therefore, it produces more successful attacks when the number of
training samples is limited. Moreover, we provide a proof that the proposed
penalty method theoretically converges to a solution that corresponds to
universal adversarial perturbations. We report comprehensive experiments,
showing attack success rates higher than 91.1% and 74.7% for targeted and
untargeted attacks, respectively.


http://arxiv.org/abs/1908.02435
Improved Adversarial Robustness by Reducing Open Space Risk via Tent Activations.
Andras Rozsa; Terrance E. Boult
  Adversarial examples contain small perturbations that can remain
imperceptible to human observers but alter the behavior of even the best
performing deep learning models and yield incorrect outputs. Since their
discovery, adversarial examples have drawn significant attention in machine
learning: researchers try to reveal the reasons for their existence and improve
the robustness of machine learning models to adversarial perturbations. The
state-of-the-art defense is the computationally expensive and very time
consuming adversarial training via projected gradient descent (PGD). We
hypothesize that adversarial attacks exploit the open space risk of classic
monotonic activation functions. This paper introduces the tent activation
function with bounded open space risk and shows that tents make deep learning
models more robust to adversarial attacks. We demonstrate on the MNIST dataset
that a classifier with tents yields an average accuracy of 91.8% against six
white-box adversarial attacks, which is more than 15 percentage points above
the state of the art. On the CIFAR-10 dataset, our approach improves the
average accuracy against the six white-box adversarial attacks to 73.5% from
41.8% achieved by adversarial training via PGD.


http://arxiv.org/abs/1908.02802
Investigating Decision Boundaries of Trained Neural Networks.
Roozbeh Yousefzadeh; Dianne P O'Leary
  Deep learning models have been the subject of study from various
perspectives, for example, their training process, interpretation,
generalization error, robustness to adversarial attacks, etc. A trained model
is defined by its decision boundaries, and therefore, many of the studies about
deep learning models speculate about the decision boundaries, and sometimes
make simplifying assumptions about them. So far, finding exact points on the
decision boundaries of trained deep models has been considered an intractable
problem. Here, we compute exact points on the decision boundaries of these
models and provide mathematical tools to investigate the surfaces that define
the decision boundaries. Through numerical results, we confirm that some of the
speculations about the decision boundaries are accurate, some of the
computational methods can be improved, and some of the simplifying assumptions
may be unreliable, for models with nonlinear activation functions. We advocate
for verification of simplifying assumptions and approximation methods, wherever
they are used. Finally, we demonstrate that the computational practices used
for finding adversarial examples can be improved and computing the closest
point on the decision boundary reveals the weakest vulnerability of a model
against adversarial attack.


http://arxiv.org/abs/1908.02374
Explaining Deep Neural Networks Using Spectrum-Based Fault Localization.
Youcheng Sun; Hana Chockler; Xiaowei Huang; Daniel Kroening
  Deep neural networks (DNNs) increasingly replace traditionally developed
software in a broad range of applications. However, in stark contrast to
traditional software, the black-box nature of DNNs makes it impossible to
understand their outputs, creating demand for "Explainable AI". Explanations of
the outputs of the DNN are essential for the training process and are
supporting evidence of the adequacy of the DNN. In this paper, we show that
spectrum-based fault localization delivers good explanations of the outputs of
DNNs. We present an algorithm and a tool PROTOZOA, which synthesizes a ranking
of the parts of the inputs using several spectrum-based fault localization
measures. We show that the highest-ranked parts provide explanations that are
consistent with the standard definitions of explanations in the literature. Our
experimental results on ImageNet show that the explanations we generate are
useful visual indicators for the progress of the training of the DNN. We
compare the results of PROTOZOA with SHAP and show that the explanations
generated by PROTOZOA are on par or superior. We also generate adversarial
examples using our explanations; the efficiency of this process can serve as a
proxy metric for the quality of the explanations. Our measurements show that
PROTOZOA's explanations yield a higher number of adversarial examples than
those produced by SHAP.


http://arxiv.org/abs/1908.02199
MetaAdvDet: Towards Robust Detection of Evolving Adversarial Attacks.
Chen Ma; Chenxu Zhao; Hailin Shi; Li Chen; Junhai Yong; Dan Zeng
  Deep neural networks (DNNs) are vulnerable to adversarial attack which is
maliciously implemented by adding human-imperceptible perturbation to images
and thus leads to incorrect prediction. Existing studies have proposed various
methods to detect the new adversarial attacks. However, new attack methods keep
evolving constantly and yield new adversarial examples to bypass the existing
detectors. It needs to collect tens of thousands samples to train detectors,
while the new attacks evolve much more frequently than the high-cost data
collection. Thus, this situation leads the newly evolved attack samples to
remain in small scales. To solve such few-shot problem with the evolving
attack, we propose a meta-learning based robust detection method to detect new
adversarial attacks with limited examples. Specifically, the learning consists
of a double-network framework: a task-dedicated network and a master network
which alternatively learn the detection capability for either seen attack or a
new attack. To validate the effectiveness of our approach, we construct the
benchmarks with few-shot-fashion protocols based on three conventional
datasets, i.e. CIFAR-10, MNIST and Fashion-MNIST. Comprehensive experiments are
conducted on them to verify the superiority of our approach with respect to the
traditional adversarial attack detection methods.


http://arxiv.org/abs/1908.02256
BlurNet: Defense by Filtering the Feature Maps.
Ravi Raju; Mikko Lipasti
  Recently, the field of adversarial machine learning has been garnering
attention by showing that state-of-the-art deep neural networks are vulnerable
to adversarial examples, stemming from small perturbations being added to the
input image. Adversarial examples are generated by a malicious adversary by
obtaining access to the model parameters, such as gradient information, to
alter the input or by attacking a substitute model and transferring those
malicious examples over to attack the victim model. Specifically, one of these
attack algorithms, Robust Physical Perturbations ($RP_2$), generates
adversarial images of stop signs with black and white stickers to achieve high
targeted misclassification rates against standard-architecture traffic sign
classifiers. In this paper, we propose BlurNet, a defense against the $RP_2$
attack. First, we motivate the defense with a frequency analysis of the first
layer feature maps of the network on the LISA dataset, which shows that high
frequency noise is introduced into the input image by the $RP_2$ algorithm. To
remove the high frequency noise, we introduce a depthwise convolution layer of
standard blur kernels after the first layer. We perform a blackbox transfer
attack to show that low-pass filtering the feature maps is more beneficial than
filtering the input. We then present various regularization schemes to
incorporate this low-pass filtering behavior into the training regime of the
network and perform white-box attacks. We conclude with an adaptive attack
evaluation to show that the success rate of the attack drops from 90\% to 20\%
with total variation regularization, one of the proposed defenses.


http://arxiv.org/abs/1908.02658
Random Directional Attack for Fooling Deep Neural Networks.
Wenjian Luo; Chenwang Wu; Nan Zhou; Li Ni
  Deep neural networks (DNNs) have been widely used in many fields such as
images processing, speech recognition; however, they are vulnerable to
adversarial examples, and this is a security issue worthy of attention. Because
the training process of DNNs converge the loss by updating the weights along
the gradient descent direction, many gradient-based methods attempt to destroy
the DNN model by adding perturbations in the gradient direction. Unfortunately,
as the model is nonlinear in most cases, the addition of perturbations in the
gradient direction does not necessarily increase loss. Thus, we propose a
random directed attack (RDA) for generating adversarial examples in this paper.
Rather than limiting the gradient direction to generate an attack, RDA searches
the attack direction based on hill climbing and uses multiple strategies to
avoid local optima that cause attack failure. Compared with state-of-the-art
gradient-based methods, the attack performance of RDA is very competitive.
Moreover, RDA can attack without any internal knowledge of the model, and its
performance under black-box attack is similar to that of the white-box attack
in most cases, which is difficult to achieve using existing gradient-based
attack methods.


http://arxiv.org/abs/1908.01517
Adversarial Self-Defense for Cycle-Consistent GANs.
Dina Bashkirova; Ben Usman; Kate Saenko
  The goal of unsupervised image-to-image translation is to map images from one
domain to another without the ground truth correspondence between the two
domains. State-of-art methods learn the correspondence using large numbers of
unpaired examples from both domains and are based on generative adversarial
networks. In order to preserve the semantics of the input image, the
adversarial objective is usually combined with a cycle-consistency loss that
penalizes incorrect reconstruction of the input image from the translated one.
However, if the target mapping is many-to-one, e.g. aerial photos to maps, such
a restriction forces the generator to hide information in low-amplitude
structured noise that is undetectable by human eye or by the discriminator. In
this paper, we show how such self-attacking behavior of unsupervised
translation methods affects their performance and provide two defense
techniques. We perform a quantitative evaluation of the proposed techniques and
show that making the translation model more robust to the self-adversarial
attack increases its generation quality and reconstruction reliability and
makes the model less sensitive to low-amplitude perturbations.


http://arxiv.org/abs/1908.01469
Automated Detection System for Adversarial Examples with High-Frequency Noises Sieve.
Dang Duy Thang; Toshihiro Matsui
  Deep neural networks are being applied in many tasks with encouraging
results, and have often reached human-level performance. However, deep neural
networks are vulnerable to well-designed input samples called adversarial
examples. In particular, neural networks tend to misclassify adversarial
examples that are imperceptible to humans. This paper introduces a new
detection system that automatically detects adversarial examples on deep neural
networks. Our proposed system can mostly distinguish adversarial samples and
benign images in an end-to-end manner without human intervention. We exploit
the important role of the frequency domain in adversarial samples and propose a
method that detects malicious samples in observations. When evaluated on two
standard benchmark datasets (MNIST and ImageNet), our method achieved an
out-detection rate of 99.7 - 100% in many settings.


http://arxiv.org/abs/1908.01667
A principled approach for generating adversarial images under non-smooth dissimilarity metrics.
Aram-Alexandre Pooladian; Chris Finlay; Tim Hoheisel; Adam Oberman
  Deep neural networks perform well on real world data but are prone to
adversarial perturbations: small changes in the input easily lead to
misclassification. In this work, we propose an attack methodology not only for
cases where the perturbations are measured by $\ell_p$ norms, but in fact any
adversarial dissimilarity metric with a closed proximal form. This includes,
but is not limited to, $\ell_1, \ell_2$, and $\ell_\infty$ perturbations; the
$\ell_0$ counting "norm" (i.e. true sparseness); and the total variation
seminorm, which is a (non-$\ell_p$) convolutional dissimilarity measuring local
pixel changes. Our approach is a natural extension of a recent adversarial
attack method, and eliminates the differentiability requirement of the metric.
We demonstrate our algorithm, ProxLogBarrier, on the MNIST, CIFAR10, and
ImageNet-1k datasets. We consider undefended and defended models, and show that
our algorithm easily transfers to various datasets. We observe that
ProxLogBarrier outperforms a host of modern adversarial attacks specialized for
the $\ell_0$ case. Moreover, by altering images in the total variation
seminorm, we shed light on a new class of perturbations that exploit
neighboring pixel information.


http://arxiv.org/abs/1908.01551
Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems.
Lea Schönherr; Thorsten Eisenhofer; Steffen Zeiler; Thorsten Holz; Dorothea Kolossa
  Automatic speech recognition (ASR) systems can be fooled via targeted
adversarial examples, which induce the ASR to produce arbitrary transcriptions
in response to altered audio signals. However, state-of-the-art adversarial
examples typically have to be fed into the ASR system directly, and are not
successful when played in a room. The few published over-the-air adversarial
examples fall into one of three categories: they are either handcrafted
examples, they are so conspicuous that human listeners can easily recognize the
target transcription once they are alerted to its content, or they require
precise information about the room where the attack takes place, and are hence
not transferable to other rooms. In this paper, we demonstrate the first
algorithm that produces generic adversarial examples, which remain robust in an
over-the-air attack that is not adapted to the specific environment. Hence, no
prior knowledge of the room characteristics is required. Instead, we use room
impulse responses (RIRs) to compute robust adversarial examples for arbitrary
room characteristics and employ the ASR system Kaldi to demonstrate the attack.
Further, our algorithm can utilize psychoacoustic methods to hide changes of
the original audio signal below the human thresholds of hearing. In practical
experiments, we show that the adversarial examples work for varying room
setups, and that no direct line-of-sight between speaker and microphone is
necessary. As a result, an attacker can create inconspicuous adversarial
examples for any target transcription and apply these to arbitrary room setups
without any prior knowledge.


http://arxiv.org/abs/1908.01297
A Restricted Black-box Adversarial Framework Towards Attacking Graph Embedding Models.
Heng Chang; Yu Rong; Tingyang Xu; Wenbing Huang; Honglei Zhang; Peng Cui; Wenwu Zhu; Junzhou Huang
  With the great success of graph embedding model on both academic and industry
area, the robustness of graph embedding against adversarial attack inevitably
becomes a central problem in graph learning domain. Regardless of the fruitful
progress, most of the current works perform the attack in a white-box fashion:
they need to access the model predictions and labels to construct their
adversarial loss. However, the inaccessibility of model predictions in real
systems makes the white-box attack impractical to real graph learning system.
This paper promotes current frameworks in a more general and flexible sense --
we demand to attack various kinds of graph embedding model with black-box
driven. To this end, we begin by investigating the theoretical connections
between graph signal processing and graph embedding models in a principled way
and formulate the graph embedding model as a general graph signal process with
corresponding graph filter. As such, a generalized adversarial attacker:
GF-Attack is constructed by the graph filter and feature matrix. Instead of
accessing any knowledge of the target classifiers used in graph embedding,
GF-Attack performs the attack only on the graph filter in a black-box attack
fashion. To validate the generalization of GF-Attack, we construct the attacker
on four popular graph embedding models. Extensive experimental results validate
the effectiveness of our attacker on several benchmark datasets. Particularly
by using our attack, even small graph perturbations like one-edge flip is able
to consistently make a strong attack in performance to different graph
embedding models.


http://arxiv.org/abs/1908.01165
Exploring the Robustness of NMT Systems to Nonsensical Inputs.
Akshay Chaturvedi; Abijith KP; Utpal Garain
  Neural machine translation (NMT) systems have been shown to give undesirable
translation when a small change is made in the source sentence. In this paper,
we study the behaviour of NMT systems when multiple changes are made to the
source sentence. In particular, we ask the following question "Is it possible
for an NMT system to predict same translation even when multiple words in the
source sentence have been replaced?". To this end, we propose a soft-attention
based technique to make the aforementioned word replacements. The experiments
are conducted on two language pairs: English-German (en-de) and English-French
(en-fr) and two state-of-the-art NMT systems: BLSTM-based encoder-decoder with
attention and Transformer. The proposed soft-attention based technique achieves
high success rate and outperforms existing methods like HotFlip by a
significant margin for all the conducted experiments. The results demonstrate
that state-of-the-art NMT systems are unable to capture the semantics of the
source language. The proposed soft-attention based technique is an
invariance-based adversarial attack on NMT systems. To better evaluate such
attacks, we propose an alternate metric and argue its benefits in comparison
with success rate.


http://arxiv.org/abs/1908.00706
AdvGAN++ : Harnessing latent layers for adversary generation.
Puneet Mangla; Surgan Jandial; Sakshi Varshney; Vineeth N Balasubramanian
  Adversarial examples are fabricated examples, indistinguishable from the
original image that mislead neural networks and drastically lower their
performance. Recently proposed AdvGAN, a GAN based approach, takes input image
as a prior for generating adversaries to target a model. In this work, we show
how latent features can serve as better priors than input images for adversary
generation by proposing AdvGAN++, a version of AdvGAN that achieves higher
attack rates than AdvGAN and at the same time generates perceptually realistic
images on MNIST and CIFAR-10 datasets.


http://arxiv.org/abs/1908.00635
Black-box Adversarial ML Attack on Modulation Classification.
Muhammad Usama; Junaid Qadir; Ala Al-Fuqaha
  Recently, many deep neural networks (DNN) based modulation classification
schemes have been proposed in the literature. We have evaluated the robustness
of two famous such modulation classifiers (based on the techniques of
convolutional neural networks and long short term memory) against adversarial
machine learning attacks in black-box settings. We have used Carlini \& Wagner
(C-W) attack for performing the adversarial attack. To the best of our
knowledge, the robustness of these modulation classifiers has not been
evaluated through C-W attack before. Our results clearly indicate that
state-of-art deep machine learning-based modulation classifiers are not robust
against adversarial attacks.


http://arxiv.org/abs/1908.00656
Robustifying deep networks for image segmentation.
Zheng Liu; Jinnian Zhang; Varun Jog; Po-Ling Loh; Alan B McMillan
  Purpose: The purpose of this study is to investigate the robustness of a
commonly-used convolutional neural network for image segmentation with respect
to visually-subtle adversarial perturbations, and suggest new methods to make
these networks more robust to such perturbations. Materials and Methods: In
this retrospective study, the accuracy of brain tumor segmentation was studied
in subjects with low- and high-grade gliomas. A three-dimensional UNet model
was implemented to segment four different MR series (T1-weighted, post-contrast
T1-weighted, T2- weighted, and T2-weighted FLAIR) into four pixelwise labels
(Gd-enhancing tumor, peritumoral edema, necrotic and non-enhancing tumor, and
background). We developed attack strategies based on the Fast Gradient Sign
Method (FGSM), iterative FGSM (i-FGSM), and targeted iterative FGSM (ti-FGSM)
to produce effective attacks. Additionally, we explored the effectiveness of
distillation and adversarial training via data augmentation to counteract
adversarial attacks. Robustness was measured by comparing the Dice coefficient
for each attack method using Wilcoxon signed-rank tests. Results: Attacks based
on FGSM, i-FGSM, and ti-FGSM were effective in significantly reducing the
quality of image segmentation with reductions in Dice coefficient by up to 65%.
For attack defenses, distillation performed significantly better than
adversarial training approaches. However, all defense approaches performed
worse compared to unperturbed test images. Conclusion: Segmentation networks
can be adversely affected by targeted attacks that introduce visually minor
(and potentially undetectable) modifications to existing images. With an
increasing interest in applying deep learning techniques to medical imaging
data, it is important to quantify the ramifications of adversarial inputs
(either intentional or unintentional).


http://arxiv.org/abs/1908.00096
Adversarial Robustness Curves.
Christina Göpfert; Jan Philip Göpfert; Barbara Hammer
  The existence of adversarial examples has led to considerable uncertainty
regarding the trust one can justifiably put in predictions produced by
automated systems. This uncertainty has, in turn, lead to considerable research
effort in understanding adversarial robustness. In this work, we take first
steps towards separating robustness analysis from the choice of robustness
threshold and norm. We propose robustness curves as a more general view of the
robustness behavior of a model and investigate under which circumstances they
can qualitatively depend on the chosen norm.


http://arxiv.org/abs/1907.13548
Optimal Attacks on Reinforcement Learning Policies.
Alessio Russo; Alexandre Proutiere
  Control policies, trained using the Deep Reinforcement Learning, have been
recently shown to be vulnerable to adversarial attacks introducing even very
small perturbations to the policy input. The attacks proposed so far have been
designed using heuristics, and build on existing adversarial example crafting
techniques used to dupe classifiers in supervised learning. In contrast, this
paper investigates the problem of devising optimal attacks, depending on a
well-defined attacker's objective, e.g., to minimize the main agent average
reward. When the policy and the system dynamics, as well as rewards, are known
to the attacker, a scenario referred to as a white-box attack, designing
optimal attacks amounts to solving a Markov Decision Process. For what we call
black-box attacks, where neither the policy nor the system is known, optimal
attacks can be trained using Reinforcement Learning techniques. Through
numerical experiments, we demonstrate the efficiency of our attacks compared to
existing attacks (usually based on Gradient methods). We further quantify the
potential impact of attacks and establish its connection to the smoothness of
the policy under attack. Smooth policies are naturally less prone to attacks
(this explains why Lipschitz policies, with respect to the state, are more
resilient). Finally, we show that from the main agent perspective, the system
uncertainties and the attacker can be modeled as a Partially Observable Markov
Decision Process. We actually demonstrate that using Reinforcement Learning
techniques tailored to POMDP (e.g. using Recurrent Neural Networks) leads to
more resilient policies.


http://arxiv.org/abs/1907.13124
Impact of Adversarial Examples on Deep Learning Models for Biomedical Image Segmentation.
Utku Ozbulak; Messem Arnout Van; Neve Wesley De
  Deep learning models, which are increasingly being used in the field of
medical image analysis, come with a major security risk, namely, their
vulnerability to adversarial examples. Adversarial examples are carefully
crafted samples that force machine learning models to make mistakes during
testing time. These malicious samples have been shown to be highly effective in
misguiding classification tasks. However, research on the influence of
adversarial examples on segmentation is significantly lacking. Given that a
large portion of medical imaging problems are effectively segmentation
problems, we analyze the impact of adversarial examples on deep learning-based
image segmentation models. Specifically, we expose the vulnerability of these
models to adversarial examples by proposing the Adaptive Segmentation Mask
Attack (ASMA). This novel algorithm makes it possible to craft targeted
adversarial examples that come with (1) high intersection-over-union rates
between the target adversarial mask and the prediction and (2) with
perturbation that is, for the most part, invisible to the bare eye. We lay out
experimental and visual evidence by showing results obtained for the ISIC skin
lesion segmentation challenge and the problem of glaucoma optic disc
segmentation. An implementation of this algorithm and additional examples can
be found at https://github.com/utkuozbulak/adaptive-segmentation-mask-attack.


http://arxiv.org/abs/1907.12744
Not All Adversarial Examples Require a Complex Defense: Identifying Over-optimized Adversarial Examples with IQR-based Logit Thresholding.
Utku Ozbulak; Messem Arnout Van; Neve Wesley De
  Detecting adversarial examples currently stands as one of the biggest
challenges in the field of deep learning. Adversarial attacks, which produce
adversarial examples, increase the prediction likelihood of a target class for
a particular data point. During this process, the adversarial example can be
further optimized, even when it has already been wrongly classified with 100%
confidence, thus making the adversarial example even more difficult to detect.
For this kind of adversarial examples, which we refer to as over-optimized
adversarial examples, we discovered that the logits of the model provide solid
clues on whether the data point at hand is adversarial or genuine. In this
context, we first discuss the masking effect of the softmax function for the
prediction made and explain why the logits of the model are more useful in
detecting over-optimized adversarial examples. To identify this type of
adversarial examples in practice, we propose a non-parametric and
computationally efficient method which relies on interquartile range, with this
method becoming more effective as the image resolution increases. We support
our observations throughout the paper with detailed experiments for different
datasets (MNIST, CIFAR-10, and ImageNet) and several architectures.


http://arxiv.org/abs/1907.12138
Are Odds Really Odd? Bypassing Statistical Detection of Adversarial Examples.
Hossein Hosseini; Sreeram Kannan; Radha Poovendran
  Deep learning classifiers are known to be vulnerable to adversarial examples.
A recent paper presented at ICML 2019 proposed a statistical test detection
method based on the observation that logits of noisy adversarial examples are
biased toward the true class. The method is evaluated on CIFAR-10 dataset and
is shown to achieve 99% true positive rate (TPR) at only 1% false positive rate
(FPR). In this paper, we first develop a classifier-based adaptation of the
statistical test method and show that it improves the detection performance. We
then propose Logit Mimicry Attack method to generate adversarial examples such
that their logits mimic those of benign images. We show that our attack
bypasses both statistical test and classifier-based methods, reducing their TPR
to less than 2:2% and 1:6%, respectively, even at 5% FPR. We finally show that
a classifier-based detector that is trained with logits of mimicry adversarial
examples can be evaded by an adaptive attacker that specifically targets the
detector. Furthermore, even a detector that is iteratively trained to defend
against adaptive attacker cannot be made robust, indicating that statistics of
logits cannot be used to detect adversarial examples.


http://arxiv.org/abs/1907.11932
Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment.
Di Jin; Zhijing Jin; Joey Tianyi Zhou; Peter Szolovits
  Machine learning algorithms are often vulnerable to adversarial examples that
have imperceptible alterations from the original counterparts but can fool the
state-of-the-art models. It is helpful to evaluate or even improve the
robustness of these models by exposing the maliciously crafted adversarial
examples. In this paper, we present TextFooler, a simple but strong baseline to
generate natural adversarial text. By applying it to two fundamental natural
language tasks, text classification and textual entailment, we successfully
attacked three target models, including the powerful pre-trained BERT, and the
widely used convolutional and recurrent neural networks. We demonstrate the
advantages of this framework in three ways: (1) effective---it outperforms
state-of-the-art attacks in terms of success rate and perturbation rate, (2)
utility-preserving---it preserves semantic content and grammaticality, and
remains correctly classified by humans, and (3) efficient---it generates
adversarial text with computational complexity linear to the text length.


http://arxiv.org/abs/1907.11780
Understanding Adversarial Robustness: The Trade-off between Minimum and Average Margin.
Kaiwen Wu; Yaoliang Yu
  Deep models, while being extremely versatile and accurate, are vulnerable to
adversarial attacks: slight perturbations that are imperceptible to humans can
completely flip the prediction of deep models. Many attack and defense
mechanisms have been proposed, although a satisfying solution still largely
remains elusive. In this work, we give strong evidence that during training,
deep models maximize the minimum margin in order to achieve high accuracy, but
at the same time decrease the \emph{average} margin hence hurting robustness.
Our empirical results highlight an intrinsic trade-off between accuracy and
robustness for current deep model training. To further address this issue, we
propose a new regularizer to explicitly promote average margin, and we verify
through extensive experiments that it does lead to better robustness. Our
regularized objective remains Fisher-consistent, hence asymptotically can still
recover the Bayes optimal classifier.


http://arxiv.org/abs/1907.11684
On the Design of Black-box Adversarial Examples by Leveraging Gradient-free Optimization and Operator Splitting Method.
Pu Zhao; Sijia Liu; Pin-Yu Chen; Nghia Hoang; Kaidi Xu; Bhavya Kailkhura; Xue Lin
  Robust machine learning is currently one of the most prominent topics which
could potentially help shaping a future of advanced AI platforms that not only
perform well in average cases but also in worst cases or adverse situations.
Despite the long-term vision, however, existing studies on black-box
adversarial attacks are still restricted to very specific settings of threat
models (e.g., single distortion metric and restrictive assumption on target
model's feedback to queries) and/or suffer from prohibitively high query
complexity. To push for further advances in this field, we introduce a general
framework based on an operator splitting method, the alternating direction
method of multipliers (ADMM) to devise efficient, robust black-box attacks that
work with various distortion metrics and feedback settings without incurring
high query complexity. Due to the black-box nature of the threat model, the
proposed ADMM solution framework is integrated with zeroth-order (ZO)
optimization and Bayesian optimization (BO), and thus is applicable to the
gradient-free regime. This results in two new black-box adversarial attack
generation methods, ZO-ADMM and BO-ADMM. Our empirical evaluations on image
classification datasets show that our proposed approaches have much lower
function query complexities compared to state-of-the-art attack methods, but
achieve very competitive attack success rates.


http://arxiv.org/abs/1907.10310
Towards Adversarially Robust Object Detection.
Haichao Zhang; Jianyu Wang
  Object detection is an important vision task and has emerged as an
indispensable component in many vision system, rendering its robustness as an
increasingly important performance factor for practical applications. While
object detection models have been demonstrated to be vulnerable against
adversarial attacks by many recent works, very few efforts have been devoted to
improving their robustness. In this work, we take an initial attempt towards
this direction. We first revisit and systematically analyze object detectors
and many recently developed attacks from the perspective of model robustness.
We then present a multi-task learning perspective of object detection and
identify an asymmetric role of task losses. We further develop an adversarial
training approach which can leverage the multiple sources of attacks for
improving the robustness of detection models. Extensive experiments on
PASCAL-VOC and MS-COCO verified the effectiveness of the proposed approach.


http://arxiv.org/abs/1907.10737
Joint Adversarial Training: Incorporating both Spatial and Pixel Attacks.
Haichao Zhang; Jianyu Wang
  Conventional adversarial training methods using attacks that manipulate the
pixel value directly and individually, leading to models that are less robust
in face of spatial transformation-based attacks. In this paper, we propose a
joint adversarial training method that incorporates both spatial
transformation-based and pixel-value based attacks for improving model
robustness. We introduce a spatial transformation-based attack with an explicit
notion of budget and develop an algorithm for spatial attack generation. We
further integrate both pixel and spatial attacks into one generation model and
show how to leverage the complementary strengths of each other in training for
improving the overall model robustness. Extensive experimental results on
different benchmark datasets compared with state-of-the-art methods verified
the effectiveness of the proposed method.


http://arxiv.org/abs/1907.10764
Defense Against Adversarial Attacks Using Feature Scattering-based Adversarial Training.
Haichao Zhang; Jianyu Wang
  We introduce a feature scattering-based adversarial training approach for
improving model robustness against adversarial attacks. Conventional
adversarial training approaches leverage a supervised scheme (either targeted
or non-targeted) in generating attacks for training, which typically suffer
from issues such as label leaking as noted in recent works. Differently, the
proposed approach generates adversarial images for training through feature
scattering in the latent space, which is unsupervised in nature and avoids
label leaking. More importantly, this new approach generates perturbed images
in a collaborative fashion, taking the inter-sample relationships into
consideration. We conduct analysis on model robustness and demonstrate the
effectiveness of the proposed approach through extensively experiments on
different datasets compared with state-of-the-art approaches.


http://arxiv.org/abs/1907.12934
Weakly Supervised Localization using Min-Max Entropy: an Interpretable Framework.
Soufiane Belharbi; Jérôme Rony; Jose Dolz; Ismail Ben Ayed; Luke McCaffrey; Eric Granger
  Weakly supervised object localization (WSOL) models aim to locate objects of
interest in an image after being trained only on data with coarse image level
labels. Deep learning models for WSOL rely typically on convolutional attention
maps with no constraints on the regions of interest which allows them to select
any region, making them vulnerable to false positive regions. This issue occurs
in many application domains, e.g., medical image analysis, where
interpretability is central to the prediction. In order to improve the
localization reliability, we propose a deep learning framework for WSOL with
pixel level localization. It is composed of two sequential sub-networks: a
localizer that localizes regions of interest; followed by a classifier that
classifies them. Within its end-to-end training, we incorporate the prior
knowledge stating that in an agnostic-class setup an image is more likely to
contain relevant --object of interest-- and irrelevant regions --noise--. Based
on the conditional entropy (CE) measured at the classifier, the localizer is
driven to spot relevant regions (low CE), and irrelevant regions (high CE). Our
framework is able to recover large discriminative regions using our recursive
erasing algorithm that we incorporate within the backpropagation during
training. Moreover, the framework handles intrinsically multi-instances.
Experimental results on public datasets with medical images (GlaS colon cancer)
and natural images (Caltech-UCSD Birds-200-2011, Oxford flower 102) show that,
compared to state of the art WSOL methods, our framework can provide
significant improvements in terms of image-level classification, pixel-level
localization, and robustness to overfitting when dealing with few training
samples. A public reproducible PyTorch implementation is provided in:
https://github.com/sbelharbi/wsol-min-max-entropy-interpretability .


http://arxiv.org/abs/1907.10456
Understanding Adversarial Attacks on Deep Learning Based Medical Image Analysis Systems.
Xingjun Ma; Yuhao Niu; Lin Gu; Yisen Wang; Yitian Zhao; James Bailey; Feng Lu
  Deep neural networks (DNNs) have become popular for medical image analysis
tasks like cancer diagnosis and lesion detection. However, a recent study
demonstrates that medical deep learning systems can be compromised by
carefully-engineered adversarial examples/attacks with small imperceptible
perturbations. This raises safety concerns about the deployment of these
systems in clinical settings. In this paper, we provide a deeper understanding
of adversarial examples in the context of medical images. We find that medical
DNN models can be more vulnerable to adversarial attacks compared to models for
natural images, according to two different viewpoints. Surprisingly, we also
find that medical adversarial attacks can be easily detected, i.e., simple
detectors can achieve over 98% detection AUC against state-of-the-art attacks,
due to fundamental feature differences compared to normal examples. We believe
these findings may be a useful basis to approach the design of more explainable
and secure medical deep learning systems.


http://arxiv.org/abs/1907.10823
Enhancing Adversarial Example Transferability with an Intermediate Level Attack.
Qian Huang; Isay Katsman; Horace He; Zeqi Gu; Serge Belongie; Ser-Nam Lim
  Neural networks are vulnerable to adversarial examples, malicious inputs
crafted to fool trained models. Adversarial examples often exhibit black-box
transfer, meaning that adversarial examples for one model can fool another
model. However, adversarial examples are typically overfit to exploit the
particular architecture and feature representation of a source model, resulting
in sub-optimal black-box transfer attacks to other target models. We introduce
the Intermediate Level Attack (ILA), which attempts to fine-tune an existing
adversarial example for greater black-box transferability by increasing its
perturbation on a pre-specified layer of the source model, improving upon
state-of-the-art methods. We show that we can select a layer of the source
model to perturb without any knowledge of the target models while achieving
high transferability. Additionally, we provide some explanatory insights
regarding our method and the effect of optimizing for adversarial examples
using intermediate feature maps. Our code is available at
https://github.com/CUVL/Intermediate-Level-Attack.


http://arxiv.org/abs/1907.09470
Characterizing Attacks on Deep Reinforcement Learning.
Xinlei Pan; Chaowei Xiao; Warren He; Shuang Yang; Jian Peng; Mingjie Sun; Jinfeng Yi; Zijiang Yang; Mingyan Liu; Bo Li; Dawn Song
  Recent studies show that Deep Reinforcement Learning (DRL) models are
vulnerable to adversarial attacks, which attack DRL models by adding small
perturbations to the observations. However, some attacks assume full
availability of the victim model, and some require a huge amount of
computation, making them less feasible for real world applications. In this
work, we make further explorations of the vulnerabilities of DRL by studying
other aspects of attacks on DRL using realistic and efficient attacks. First,
we adapt and propose efficient black-box attacks when we do not have access to
DRL model parameters. Second, to address the high computational demands of
existing attacks, we introduce efficient online sequential attacks that exploit
temporal consistency across consecutive steps. Third, we explore the
possibility of an attacker perturbing other aspects in the DRL setting, such as
the environment dynamics. Finally, to account for imperfections in how an
attacker would inject perturbations in the physical world, we devise a method
for generating a robust physical perturbations to be printed. The attack is
evaluated on a real-world robot under various conditions. We conduct extensive
experiments both in simulation such as Atari games, robotics and autonomous
driving, and on real-world robotics, to compare the effectiveness of the
proposed attacks with baseline approaches. To the best of our knowledge, we are
the first to apply adversarial attacks on DRL systems to physical robots.


http://arxiv.org/abs/1907.07732
Connecting Lyapunov Control Theory to Adversarial Attacks.
Arash Rahnama; Andre T. Nguyen; Edward Raff
  Significant work is being done to develop the math and tools necessary to
build provable defenses, or at least bounds, against adversarial attacks of
neural networks. In this work, we argue that tools from control theory could be
leveraged to aid in defending against such attacks. We do this by example,
building a provable defense against a weaker adversary. This is done so we can
focus on the mechanisms of control theory, and illuminate its intrinsic value.


http://arxiv.org/abs/1907.07640
Robustness properties of Facebook's ResNeXt WSL models.
A. Emin Orhan
  We investigate the robustness properties of ResNeXt image recognition models
trained with billion scale weakly-supervised data (ResNeXt WSL models). These
models, recently made public by Facebook AI, were trained on ~1B images from
Instagram and fine-tuned on ImageNet. We show that these models display an
unprecedented degree of robustness against common image corruptions and
perturbations, as measured by the ImageNet-C and ImageNet-P benchmarks. The
largest of the released models, in particular, achieves state-of-the-art
results on both ImageNet-C and ImageNet-P by a large margin. The gains on
ImageNet-C and ImageNet-P far outpace the gains on ImageNet validation
accuracy, suggesting the former as more useful benchmarks to measure further
progress in image recognition. Remarkably, the ResNeXt WSL models even achieve
a limited degree of adversarial robustness against state-of-the-art white-box
attacks (10-step PGD attacks). However, in contrast to adversarially trained
models, the robustness of the ResNeXt WSL models rapidly declines with the
number of PGD steps, suggesting that these models do not achieve genuine
adversarial robustness. Visualization of the learned features also confirms
this conclusion. Finally, we show that although the ResNeXt WSL models are more
shape-biased than comparable ImageNet-trained models in a shape-texture cue
conflict experiment, they still remain much more texture-biased than humans and
their accuracy on the recently introduced "natural adversarial examples"
(ImageNet-A) also remains low, suggesting that they share many of the
underlying characteristics of ImageNet-trained models that make these
benchmarks challenging.


http://arxiv.org/abs/1907.07487
Constrained Concealment Attacks against Reconstruction-based Anomaly Detectors in Industrial Control Systems.
Alessandro Erba; Riccardo Taormina; Stefano Galelli; Marcello Pogliani; Michele Carminati; Stefano Zanero; Nils Ole Tippenhauer
  Recently, reconstruction-based anomaly detection was proposed as an effective
technique to detect attacks in dynamic industrial control networks. Unlike
classical network anomaly detectors that observe the network traffic,
reconstruction-based detectors operate on the measured sensor data, leveraging
physical process models learned a priori.
  In this work, we investigate different approaches to evade prior-work
reconstruction-based anomaly detectors by manipulating sensor data so that the
attack is concealed. We find that replay attacks (commonly assumed to be very
strong) show bad performance (i.e., increasing the number of alarms) if the
attacker is constrained to manipulate less than 95% of all features in the
system, as hidden correlations between the features are not replicated well. To
address this, we propose two novel attacks that manipulate a subset of the
sensor readings, leveraging learned physical constraints of the system. Our
attacks feature two different attacker models: A white box attacker, which uses
an optimization approach with a detection oracle, and a black box attacker,
which uses an autoencoder to translate anomalous data into normal data. We
evaluate our implementation on two different datasets from the water
distribution domain, showing that the detector's Recall drops from 0.68 to 0.12
by manipulating 4 sensors out of 82 in WADI dataset. In addition, we show that
our black box attacks are transferable to different detectors: They work
against autoencoder-, LSTM-, and CNN-based detectors. Finally, we implement and
demonstrate our attacks on a real industrial testbed to demonstrate their
feasibility in real-time.


http://arxiv.org/abs/1907.07291
Adversarial Security Attacks and Perturbations on Machine Learning and Deep Learning Methods.
Arif Siddiqi
  The ever-growing big data and emerging artificial intelligence (AI) demand
the use of machine learning (ML) and deep learning (DL) methods. Cybersecurity
also benefits from ML and DL methods for various types of applications. These
methods however are susceptible to security attacks. The adversaries can
exploit the training and testing data of the learning models or can explore the
workings of those models for launching advanced future attacks. The topic of
adversarial security attacks and perturbations within the ML and DL domains is
a recent exploration and a great interest is expressed by the security
researchers and practitioners. The literature covers different adversarial
security attacks and perturbations on ML and DL methods and those have their
own presentation styles and merits. A need to review and consolidate knowledge
that is comprehending of this increasingly focused and growing topic of
research; however, is the current demand of the research communities. In this
review paper, we specifically aim to target new researchers in the
cybersecurity domain who may seek to acquire some basic knowledge on the
machine learning and deep learning models and algorithms, as well as some of
the relevant adversarial security attacks and perturbations.


http://arxiv.org/abs/1907.07001
Latent Adversarial Defence with Boundary-guided Generation.
Xiaowei Zhou; Ivor W. Tsang; Jie Yin
  Deep Neural Networks (DNNs) have recently achieved great success in many
tasks, which encourages DNNs to be widely used as a machine learning service in
model sharing scenarios. However, attackers can easily generate adversarial
examples with a small perturbation to fool the DNN models to predict wrong
labels. To improve the robustness of shared DNN models against adversarial
attacks, we propose a novel method called Latent Adversarial Defence (LAD). The
proposed LAD method improves the robustness of a DNN model through adversarial
training on generated adversarial examples. Different from popular attack
methods which are carried in the input space and only generate adversarial
examples of repeating patterns, LAD generates myriad of adversarial examples
through adding perturbations to latent features along the normal of the
decision boundary which is constructed by an SVM with an attention mechanism.
Once adversarial examples are generated, we adversarially train the model
through augmenting the training data with generated adversarial examples.
Extensive experiments on the MNIST, SVHN, and CelebA dataset demonstrate the
effectiveness of our model in defending against different types of adversarial
attacks.


http://arxiv.org/abs/1907.07174
Natural Adversarial Examples.
Dan Hendrycks; Kevin Zhao; Steven Basart; Jacob Steinhardt; Dawn Song
  We introduce two challenging datasets that reliably cause machine learning
model performance to substantially degrade. The datasets are collected with a
simple adversarial filtration technique to create datasets with limited
spurious cues. Our datasets' real-world, unmodified examples transfer to
various unseen models reliably, demonstrating that computer vision models have
shared weaknesses. The first dataset is called ImageNet-A and is like the
ImageNet test set, but it is far more challenging for existing models. We also
curate an adversarial out-of-distribution detection dataset called ImageNet-O,
which is the first out-of-distribution detection dataset created for ImageNet
models. On ImageNet-A a DenseNet-121 obtains around 2% accuracy, an accuracy
drop of approximately 90%, and its out-of-distribution detection performance on
ImageNet-O is near random chance levels. We find that existing data
augmentation techniques hardly boost performance, and using other public
training datasets provides improvements that are limited. However, we find that
improvements to computer vision architectures provide a promising path towards
robust models.


http://arxiv.org/abs/1907.06826
Adversarial Sensor Attack on LiDAR-based Perception in Autonomous Driving.
Yulong Cao; Chaowei Xiao; Benjamin Cyr; Yimeng Zhou; Won Park; Sara Rampazzi; Qi Alfred Chen; Kevin Fu; Z. Morley Mao
  In Autonomous Vehicles (AVs), one fundamental pillar is perception, which
leverages sensors like cameras and LiDARs (Light Detection and Ranging) to
understand the driving environment. Due to its direct impact on road safety,
multiple prior efforts have been made to study its the security of perception
systems. In contrast to prior work that concentrates on camera-based
perception, in this work we perform the first security study of LiDAR-based
perception in AV settings, which is highly important but unexplored. We
consider LiDAR spoofing attacks as the threat model and set the attack goal as
spoofing obstacles close to the front of a victim AV. We find that blindly
applying LiDAR spoofing is insufficient to achieve this goal due to the machine
learning-based object detection process. Thus, we then explore the possibility
of strategically controlling the spoofed attack to fool the machine learning
model. We formulate this task as an optimization problem and design modeling
methods for the input perturbation function and the objective function. We also
identify the inherent limitations of directly solving the problem using
optimization and design an algorithm that combines optimization and global
sampling, which improves the attack success rates to around 75%. As a case
study to understand the attack impact at the AV driving decision level, we
construct and evaluate two attack scenarios that may damage road safety and
mobility. We also discuss defense directions at the AV system, sensor, and
machine learning model levels.


http://arxiv.org/abs/1907.07296
Explaining Vulnerabilities to Adversarial Machine Learning through Visual Analytics.
Yuxin Ma; Tiankai Xie; Jundong Li; Ross Maciejewski
  Machine learning models are currently being deployed in a variety of
real-world applications where model predictions are used to make decisions
about healthcare, bank loans, and numerous other critical tasks. As the
deployment of artificial intelligence technologies becomes ubiquitous, it is
unsurprising that adversaries have begun developing methods to manipulate
machine learning models to their advantage. While the visual analytics
community has developed methods for opening the black box of machine learning
models, little work has focused on helping the user understand their model
vulnerabilities in the context of adversarial attacks. In this paper, we
present a visual analytics framework for explaining and exploring model
vulnerabilities to adversarial attacks. Our framework employs a multi-faceted
visualization scheme designed to support the analysis of data poisoning attacks
from the perspective of models, data instances, features, and local structures.
We demonstrate our framework through two case studies on binary classifiers and
illustrate model vulnerabilities with respect to varying attack strategies.


http://arxiv.org/abs/1907.06800
Graph Interpolating Activation Improves Both Natural and Robust Accuracies in Data-Efficient Deep Learning.
Bao Wang; Stanley J. Osher
  Improving the accuracy and robustness of deep neural nets (DNNs) and adapting
them to small training data are primary tasks in deep learning research. In
this paper, we replace the output activation function of DNNs, typically the
data-agnostic softmax function, with a graph Laplacian-based high dimensional
interpolating function which, in the continuum limit, converges to the solution
of a Laplace-Beltrami equation on a high dimensional manifold. Furthermore, we
propose end-to-end training and testing algorithms for this new architecture.
The proposed DNN with graph interpolating activation integrates the advantages
of both deep learning and manifold learning. Compared to the conventional DNNs
with the softmax function as output activation, the new framework demonstrates
the following major advantages: First, it is better applicable to
data-efficient learning in which we train high capacity DNNs without using a
large number of training data. Second, it remarkably improves both natural
accuracy on the clean images and robust accuracy on the adversarial images
crafted by both white-box and black-box adversarial attacks. Third, it is a
natural choice for semi-supervised learning. For reproducibility, the code is
available at \url{https://github.com/BaoWangMath/DNN-DataDependentActivation}.


http://arxiv.org/abs/1907.06565
Recovery Guarantees for Compressible Signals with Adversarial Noise.
Jasjeet Dhaliwal; Kyle Hambrook
  We provide recovery guarantees for compressible signals that have been
corrupted with noise and extend the framework introduced in [1] to defend
neural networks against $\ell_0$-norm and $\ell_2$-norm attacks. Concretely,
for a signal that is approximately sparse in some transform domain and has been
perturbed with noise, we provide guarantees for accurately recovering the
signal in the transform domain. We can then use the recovered signal to
reconstruct the signal in its original domain while largely removing the noise.
Our results are general as they can be directly applied to most unitary
transforms used in practice and hold for both $\ell_0$-norm bounded noise and
$\ell_2$-norm bounded noise. In the case of $\ell_0$-norm bounded noise, we
prove recovery guarantees for Iterative Hard Thresholding (IHT) and Basis
Pursuit (BP). For the case of $\ell_2$-norm bounded noise, we provide recovery
guarantees for BP. These guarantees theoretically bolster the defense framework
introduced in [1] for defending neural networks against adversarial inputs.
Finally, we experimentally demonstrate this defense framework using both IHT
and BP against the One Pixel Attack [21], Carlini-Wagner $\ell_0$ and $\ell_2$
attacks [3], Jacobian Saliency Based attack [18], and the DeepFool attack [17]
on CIFAR-10 [12], MNIST [13], and Fashion-MNIST [27] datasets. This expands
beyond the experimental demonstrations of [1].


http://arxiv.org/abs/1907.06291
Measuring the Transferability of Adversarial Examples.
Deyan Petrov; Timothy M. Hospedales
  Adversarial examples are of wide concern due to their impact on the
reliability of contemporary machine learning systems. Effective adversarial
examples are mostly found via white-box attacks. However, in some cases they
can be transferred across models, thus enabling them to attack black-box
models. In this work we evaluate the transferability of three adversarial
attacks - the Fast Gradient Sign Method, the Basic Iterative Method, and the
Carlini & Wagner method, across two classes of models - the VGG class(using
VGG16, VGG19 and an ensemble of VGG16 and VGG19), and the Inception
class(Inception V3, Xception, Inception Resnet V2, and an ensemble of the
three). We also outline the problems with the assessment of transferability in
the current body of research and attempt to amend them by picking specific
"strong" parameters for the attacks, and by using a L-Infinity clipping
technique and the SSIM metric for the final evaluation of the attack
transferability.


http://arxiv.org/abs/1907.05793
Unsupervised Adversarial Attacks on Deep Feature-based Retrieval with GAN.
Guoping Zhao; Mingyu Zhang; Jiajun Liu; Ji-Rong Wen
  Studies show that Deep Neural Network (DNN)-based image classification models
are vulnerable to maliciously constructed adversarial examples. However, little
effort has been made to investigate how DNN-based image retrieval models are
affected by such attacks. In this paper, we introduce Unsupervised Adversarial
Attacks with Generative Adversarial Networks (UAA-GAN) to attack deep
feature-based image retrieval systems. UAA-GAN is an unsupervised learning
model that requires only a small amount of unlabeled data for training. Once
trained, it produces query-specific perturbations for query images to form
adversarial queries. The core idea is to ensure that the attached perturbation
is barely perceptible to human yet effective in pushing the query away from its
original position in the deep feature space. UAA-GAN works with various
application scenarios that are based on deep features, including image
retrieval, person Re-ID and face search. Empirical results show that UAA-GAN
cripples retrieval performance without significant visual changes in the query
images. UAA-GAN generated adversarial examples are less distinguishable because
they tend to incorporate subtle perturbations in textured or salient areas of
the images, such as key body parts of human, dominant structural
patterns/textures or edges, rather than in visually insignificant areas (e.g.,
background and sky). Such tendency indicates that the model indeed learned how
to toy with both image retrieval systems and human eyes.


http://arxiv.org/abs/1907.05587
Stateful Detection of Black-Box Adversarial Attacks.
Steven Chen; Nicholas Carlini; David Wagner
  The problem of adversarial examples, evasion attacks on machine learning
classifiers, has proven extremely difficult to solve. This is true even when,
as is the case in many practical settings, the classifier is hosted as a remote
service and so the adversary does not have direct access to the model
parameters.
  This paper argues that in such settings, defenders have a much larger space
of actions than have been previously explored. Specifically, we deviate from
the implicit assumption made by prior work that a defense must be a stateless
function that operates on individual examples, and explore the possibility for
stateful defenses.
  To begin, we develop a defense designed to detect the process of adversarial
example generation. By keeping a history of the past queries, a defender can
try to identify when a sequence of queries appears to be for the purpose of
generating an adversarial example. We then introduce query blinding, a new
class of attacks designed to bypass defenses that rely on such a defense
approach.
  We believe that expanding the study of adversarial examples from stateless
classifiers to stateful systems is not only more realistic for many black-box
settings, but also gives the defender a much-needed advantage in responding to
the adversary.


http://arxiv.org/abs/1907.05600
Generative Modeling by Estimating Gradients of the Data Distribution.
Yang Song; Stefano Ermon
  We introduce a new generative model where samples are produced via Langevin
dynamics using gradients of the data distribution estimated with score
matching. Because gradients can be ill-defined and hard to estimate when the
data resides on low-dimensional manifolds, we perturb the data with different
levels of Gaussian noise, and jointly estimate the corresponding scores, i.e.,
the vector fields of gradients of the perturbed data distribution for all noise
levels. For sampling, we propose an annealed Langevin dynamics where we use
gradients corresponding to gradually decreasing noise levels as the sampling
process gets closer to the data manifold. Our framework allows flexible model
architectures, requires no sampling during training or the use of adversarial
methods, and provides a learning objective that can be used for principled
model comparisons. Our models produce samples comparable to GANs on MNIST,
CelebA and CIFAR-10 datasets, achieving a new state-of-the-art inception score
of 8.87 on CIFAR-10. Additionally, we demonstrate that our models learn
effective representations via image inpainting experiments.


http://arxiv.org/abs/1907.05718
Why Blocking Targeted Adversarial Perturbations Impairs the Ability to Learn.
Ziv Katzir; Yuval Elovici
  Despite their accuracy, neural network-based classifiers are still prone to
manipulation through adversarial perturbations. Those perturbations are
designed to be misclassified by the neural network, while being perceptually
identical to some valid input. The vast majority of attack methods rely on
white-box conditions, where the attacker has full knowledge of the attacked
network's parameters. This allows the attacker to calculate the network's loss
gradient with respect to some valid input and use this gradient in order to
create an adversarial example. The task of blocking white-box attacks has
proven difficult to solve. While a large number of defense methods have been
suggested, they have had limited success. In this work we examine this
difficulty and try to understand it. We systematically explore the abilities
and limitations of defensive distillation, one of the most promising defense
mechanisms against adversarial perturbations suggested so far in order to
understand the defense challenge. We show that contrary to commonly held
belief, the ability to bypass defensive distillation is not dependent on an
attack's level of sophistication. In fact, simple approaches, such as the
Targeted Gradient Sign Method, are capable of effectively bypassing defensive
distillation. We prove that defensive distillation is highly effective against
non-targeted attacks but is unsuitable for targeted attacks. This discovery
leads us to realize that targeted attacks leverage the same input gradient that
allows a network to be trained. This implies that blocking them will require
losing the network's ability to learn, presenting an impossible tradeoff to the
research community.


http://arxiv.org/abs/1907.05418
Adversarial Objects Against LiDAR-Based Autonomous Driving Systems.
Yulong Cao; Chaowei Xiao; Dawei Yang; Jing Fang; Ruigang Yang; Mingyan Liu; Bo Li
  Deep neural networks (DNNs) are found to be vulnerable against adversarial
examples, which are carefully crafted inputs with a small magnitude of
perturbation aiming to induce arbitrarily incorrect predictions. Recent studies
show that adversarial examples can pose a threat to real-world
security-critical applications: a "physical adversarial Stop Sign" can be
synthesized such that the autonomous driving cars will misrecognize it as
others (e.g., a speed limit sign). However, these image-space adversarial
examples cannot easily alter 3D scans of widely equipped LiDAR or radar on
autonomous vehicles. In this paper, we reveal the potential vulnerabilities of
LiDAR-based autonomous driving detection systems, by proposing an optimization
based approach LiDAR-Adv to generate adversarial objects that can evade the
LiDAR-based detection system under various conditions. We first show the
vulnerabilities using a blackbox evolution-based algorithm, and then explore
how much a strong adversary can do, using our gradient-based approach
LiDAR-Adv. We test the generated adversarial objects on the Baidu Apollo
autonomous driving platform and show that such physical systems are indeed
vulnerable to the proposed attacks. We also 3D-print our adversarial objects
and perform physical experiments to illustrate that such vulnerability exists
in the real world. Please find more visualizations and results on the anonymous
website: https://sites.google.com/view/lidar-adv.


http://arxiv.org/abs/1907.04774
Metamorphic Detection of Adversarial Examples in Deep Learning Models With Affine Transformations.
Rohan Reddy Mekala; Gudjon Einar Magnusson; Adam Porter; Mikael Lindvall; Madeline Diep
  Adversarial attacks are small, carefully crafted perturbations, imperceptible
to the naked eye; that when added to an image cause deep learning models to
misclassify the image with potentially detrimental outcomes. With the rise of
artificial intelligence models in consumer safety and security intensive
industries such as self-driving cars, camera surveillance and face recognition,
there is a growing need for guarding against adversarial attacks. In this
paper, we present an approach that uses metamorphic testing principles to
automatically detect such adversarial attacks. The approach can detect image
manipulations that are so small, that they are impossible to detect by a human
through visual inspection. By applying metamorphic relations based on distance
ratio preserving affine image transformations which compare the behavior of the
original and transformed image; we show that our proposed approach can
determine whether or not the input image is adversarial with a high degree of
accuracy.


http://arxiv.org/abs/1907.04449
PhysGAN: Generating Physical-World-Resilient Adversarial Examples for Autonomous Driving.
Zelun Kong; Junfeng Guo; Ang Li; Cong Liu
  Although Deep neural networks (DNNs) are being pervasively used in
vision-based autonomous driving systems, they are found vulnerable to
adversarial attacks where small-magnitude perturbations into the inputs during
test time cause dramatic changes to the outputs. While most of the recent
attack methods target at digital-world adversarial scenarios, it is unclear how
they perform in the physical world, and more importantly, the generated
perturbations under such methods would cover a whole driving scene including
those fixed background imagery such as the sky, making them inapplicable to
physical world implementation. We present PhysGAN, which generates
physical-world-resilient adversarial examples for mislead-ing autonomous
driving systems in a continuous manner. We show the effectiveness and
robustness of PhysGAN via extensive digital and real-world evaluations. Digital
experiments show that PhysGAN is effective for various steer-ing models and
scenes, which misleads the average steer-ing angle by up to 23.06 degrees under
various scenarios. The real-world studies further demonstrate that PhysGAN is
sufficiently resilient in practice, which misleads the average steering angle
by up to 19.17 degrees. We compare PhysGAN with a set of state-of-the-art
baseline methods including several of our self-designed ones, which further
demonstrate the robustness and efficacy of our approach. We also show that
PhysGAN outperforms state-of-the-art baseline methods To the best of our
knowledge, PhysGANis probably the first technique of generating realistic and
physical-world-resilient adversarial examples for attacking common autonomous
driving scenarios.


http://arxiv.org/abs/1907.05274
Affine Disentangled GAN for Interpretable and Robust AV Perception.
Letao Liu; Martin Saerbeck; Justin Dauwels
  Autonomous vehicles (AV) have progressed rapidly with the advancements in
computer vision algorithms. The deep convolutional neural network as the main
contributor to this advancement has boosted the classification accuracy
dramatically. However, the discovery of adversarial examples reveals the
generalization gap between dataset and the real world. Furthermore, affine
transformations may also confuse computer vision based object detectors. The
degradation of the perception system is undesirable for safety critical systems
such as autonomous vehicles. In this paper, a deep learning system is proposed:
Affine Disentangled GAN (ADIS-GAN), which is robust against affine
transformations and adversarial attacks. It is demonstrated that conventional
data augmentation for affine transformation and adversarial attacks are
orthogonal, while ADIS-GAN can handle both attacks at the same time. Useful
information such as image rotation angle and scaling factor are also generated
in ADIS-GAN. On MNIST dataset, ADIS-GAN can achieve over 98 percent
classification accuracy within 30 degrees rotation, and over 90 percent
classification accuracy against FGSM and PGD adversarial attack.


http://arxiv.org/abs/1907.02957
Detecting and Diagnosing Adversarial Images with Class-Conditional Capsule Reconstructions.
Yao Qin; Nicholas Frosst; Sara Sabour; Colin Raffel; Garrison Cottrell; Geoffrey Hinton
  Adversarial examples raise questions about whether neural network models are
sensitive to the same visual features as humans. In this paper, we first detect
adversarial examples or otherwise corrupted images based on a class-conditional
reconstruction of the input. To specifically attack our detection mechanism, we
propose the Reconstructive Attack which seeks both to cause a misclassification
and a low reconstruction error. This reconstructive attack produces undetected
adversarial examples but with much smaller success rate. Among all these
attacks, we find that CapsNets always perform better than convolutional
networks. Then, we diagnose the adversarial examples for CapsNets and find that
the success of the reconstructive attack is highly related to the visual
similarity between the source and target class. Additionally, the resulting
perturbations can cause the input image to appear visually more like the target
class and hence become non-adversarial. This suggests that CapsNets use
features that are more aligned with human perception and have the potential to
address the central issue raised by adversarial examples.


http://arxiv.org/abs/1907.02610
Adversarial Robustness through Local Linearization.
Chongli Qin; James Martens; Sven Gowal; Dilip Krishnan; Krishnamurthy Dvijotham; Alhussein Fawzi; Soham De; Robert Stanforth; Pushmeet Kohli
  Adversarial training is an effective methodology for training deep neural
networks that are robust against adversarial, norm-bounded perturbations.
However, the computational cost of adversarial training grows prohibitively as
the size of the model and number of input dimensions increase. Further,
training against less expensive and therefore weaker adversaries produces
models that are robust against weak attacks but break down under attacks that
are stronger. This is often attributed to the phenomenon of gradient
obfuscation; such models have a highly non-linear loss surface in the vicinity
of training examples, making it hard for gradient-based attacks to succeed even
though adversarial examples still exist. In this work, we introduce a novel
regularizer that encourages the loss to behave linearly in the vicinity of the
training data, thereby penalizing gradient obfuscation while encouraging
robustness. We show via extensive experiments on CIFAR-10 and ImageNet, that
models trained with our regularizer avoid gradient obfuscation and can be
trained significantly faster than adversarial training. Using this regularizer,
we exceed current state of the art and achieve 47% adversarial accuracy for
ImageNet with l-infinity adversarial perturbations of radius 4/255 under an
untargeted, strong, white-box attack. Additionally, we match state of the art
results for CIFAR-10 at 8/255.


http://arxiv.org/abs/1907.02477
Adversarial Attacks in Sound Event Classification.
Vinod Subramanian; Emmanouil Benetos; Ning Xu; SKoT McDonald; Mark Sandler
  Adversarial attacks refer to a set of methods that perturb the input to a
classification model in order to fool the classifier. In this paper we apply
different gradient based adversarial attack algorithms on five deep learning
models trained for sound event classification. Four of the models use
mel-spectrogram input and one model uses raw audio input. The models represent
standard architectures such as convolutional, recurrent and dense networks. The
dataset used for training is the Freesound dataset released for task 2 of the
DCASE 2018 challenge and the models used are from participants of the challenge
who open sourced their code. Our experiments show that adversarial attacks can
be generated with high confidence and low perturbation. In addition, we show
that the adversarial attacks are very effective across the different models.


http://arxiv.org/abs/1907.01996
Robust Synthesis of Adversarial Visual Examples Using a Deep Image Prior.
Thomas Gittings; Steve Schneider; John Collomosse
  We present a novel method for generating robust adversarial image examples
building upon the recent `deep image prior' (DIP) that exploits convolutional
network architectures to enforce plausible texture in image synthesis.
Adversarial images are commonly generated by perturbing images to introduce
high frequency noise that induces image misclassification, but that is fragile
to subsequent digital manipulation of the image. We show that using DIP to
reconstruct an image under adversarial constraint induces perturbations that
are more robust to affine deformation, whilst remaining visually imperceptible.
Furthermore we show that our DIP approach can also be adapted to produce local
adversarial patches (`adversarial stickers'). We demonstrate robust adversarial
examples over a broad gamut of images and object classes drawn from the
ImageNet dataset.


http://arxiv.org/abs/1907.02044
Minimally distorted Adversarial Examples with a Fast Adaptive Boundary Attack.
Francesco Croce; Matthias Hein
  The evaluation of robustness against adversarial manipulation of neural
networks-based classifiers is mainly tested with empirical attacks as the
methods for the exact computation, even when available, do not scale to large
networks. We propose in this paper a new white-box adversarial attack wrt the
$l_p$-norms for $p \in \{1,2,\infty\}$ aiming at finding the minimal
perturbation necessary to change the class of a given input. It has an
intuitive geometric meaning, yields high quality results already with one
restart, minimizes the size of the perturbation, so that the robust accuracy
can be evaluated at all possible thresholds with a single run, and comes with
almost no free parameters except number of iterations and restarts. It achieves
better or similar robust test accuracy compared to state-of-the-art attacks
which are partially specialized to one $l_p$-norm.


http://arxiv.org/abs/1907.01216
Efficient Cyber Attacks Detection in Industrial Control Systems Using Lightweight Neural Networks and PCA.
Moshe Kravchik; Asaf Shabtai
  Industrial control systems (ICSs) are widely used and vital to industry and
society. Their failure can have severe impact on both economics and human life.
Hence, these systems have become an attractive target for attacks, both
physical and cyber. A number of attack detection methods have been proposed,
however they are characterized by a low detection rate, a substantial false
positive rate, or are system specific. In this paper, we study an attack
detection method based on simple and lightweight neural networks, namely, 1D
convolutions and autoencoders. We apply these networks to both the time and
frequency domains of the collected data and discuss pros and cons of each
approach. We evaluate the suggested method on three popular public datasets and
achieve detection rates matching or exceeding previously published detection
results, while featuring small footprint, short training and detection times,
and generality. We also demonstrate the effectiveness of PCA, which, given
proper data preprocessing and feature selection, can provide high attack
detection scores in many settings. Finally, we study the proposed method's
robustness against adversarial attacks, that exploit inherent blind spots of
neural networks to evade detection while achieving their intended physical
effect. Our results show that the proposed method is robust to such evasion
attacks: in order to evade detection, the attacker is forced to sacrifice the
desired physical impact on the system. This finding suggests that neural
networks trained under the constraints of the laws of physics can be trusted
more than networks trained under more flexible conditions.


http://arxiv.org/abs/1907.01197
Treant: Training Evasion-Aware Decision Trees.
Stefano Calzavara; Claudio Lucchese; Gabriele Tolomei; Seyum Assefa Abebe; Salvatore Orlando
  Despite its success and popularity, machine learning is now recognized as
vulnerable to evasion attacks, i.e., carefully crafted perturbations of test
inputs designed to force prediction errors. In this paper we focus on evasion
attacks against decision tree ensembles, which are among the most successful
predictive models for dealing with non-perceptual problems. Even though they
are powerful and interpretable, decision tree ensembles have received only
limited attention by the security and machine learning communities so far,
leading to a sub-optimal state of the art for adversarial learning techniques.
We thus propose Treant, a novel decision tree learning algorithm that, on the
basis of a formal threat model, minimizes an evasion-aware loss function at
each step of the tree construction. Treant is based on two key technical
ingredients: robust splitting and attack invariance, which jointly guarantee
the soundness of the learning process. Experimental results on three publicly
available datasets show that Treant is able to generate decision tree ensembles
that are at the same time accurate and nearly insensitive to evasion attacks,
outperforming state-of-the-art adversarial learning techniques.


http://arxiv.org/abs/1907.00895
Comment on "Adv-BNN: Improved Adversarial Defense through Robust Bayesian Neural Network".
Roland S. Zimmermann
  A recent paper by Liu et al. combines the topics of adversarial training and
Bayesian Neural Networks (BNN) and suggests that adversarially trained BNNs are
more robust against adversarial attacks than their non-Bayesian counterparts.
Here, I analyze the proposed defense and suggest that one needs to adjust the
adversarial attack to incorporate the stochastic nature of a Bayesian network
to perform an accurate evaluation of its robustness. Using this new type of
attack I show that there appears to be no strong evidence for higher robustness
of the adversarially trained BNNs.


http://arxiv.org/abs/1907.01023
Diminishing the Effect of Adversarial Perturbations via Refining Feature Representation.
Nader Asadi; AmirMohammad Sarfi; Sahba Tahsini; Mahdi Eftekhari
  Deep neural networks are highly vulnerable to adversarial examples, which
imposes severe security issues for these state-of-the-art models. Many defense
methods have been proposed to mitigate this problem. However, a lot of them
depend on modification or additional training of the target model. In this
work, we analytically investigate each layer representation of non-perturbed
and perturbed images and show the effect of perturbations on each of these
representations. Accordingly, a method based on whitening coloring transform is
proposed in order to diminish the misrepresentation of any desirable layer
caused by adversaries. Our method can be applied to any layer of any arbitrary
model without the need of any modification or additional training. Due to the
fact that full whitening of the layer representation is not easily
differentiable, our proposed method is superbly robust against white-box
attacks. Furthermore, we demonstrate the strength of our method against some
state-of-the-art black-box attacks such as Carlini-Wagner L2 attack and we show
that our method is able to defend against some non-constrained attacks.


http://arxiv.org/abs/1907.01003
Accurate, reliable and fast robustness evaluation.
Wieland Brendel; Jonas Rauber; Matthias Kümmerer; Ivan Ustyuzhaninov; Matthias Bethge
  Throughout the past five years, the susceptibility of neural networks to
minimal adversarial perturbations has moved from a peculiar phenomenon to a
core issue in Deep Learning. Despite much attention, however, progress towards
more robust models is significantly impaired by the difficulty of evaluating
the robustness of neural network models. Today's methods are either fast but
brittle (gradient-based attacks), or they are fairly reliable but slow (score-
and decision-based attacks). We here develop a new set of gradient-based
adversarial attacks which (a) are more reliable in the face of gradient-masking
than other gradient-based attacks, (b) perform better and are more query
efficient than current state-of-the-art gradient-based attacks, (c) can be
flexibly adapted to a wide range of adversarial criteria and (d) require
virtually no hyperparameter tuning. These findings are carefully validated
across a diverse set of six different models and hold for L2 and L_infinity in
both targeted as well as untargeted scenarios. Implementations will be made
available in all major toolboxes (Foolbox, CleverHans and ART). Furthermore, we
will soon add additional content and experiments, including L0 and L1 versions
of our attack as well as additional comparisons to other L2 and L_infinity
attacks. We hope that this class of attacks will make robustness evaluations
easier and more reliable, thus contributing to more signal in the search for
more robust machine learning models.


http://arxiv.org/abs/1907.00374
Fooling a Real Car with Adversarial Traffic Signs.
Nir Morgulis; Alexander Kreines; Shachar Mendelowitz; Yuval Weisglass
  The attacks on the neural-network-based classifiers using adversarial images
have gained a lot of attention recently. An adversary can purposely generate an
image that is indistinguishable from a innocent image for a human being but is
incorrectly classified by the neural networks. The adversarial images do not
need to be tuned to a particular architecture of the classifier - an image that
fools one network can fool another one with a certain success rate.The
published works mostly concentrate on the use of modified image files for
attacks against the classifiers trained on the model databases. Although there
exists a general understanding that such attacks can be carried in the real
world as well, the works considering the real-world attacks are scarce.
Moreover, to the best of our knowledge, there have been no reports on the
attacks against real production-grade image classification systems.In our work
we present a robust pipeline for reproducible production of adversarial traffic
signs that can fool a wide range of classifiers, both open-source and
production-grade in the real world. The efficiency of the attacks was checked
both with the neural-network-based classifiers and legacy computer vision
systems. Most of the attacks have been performed in the black-box mode, e.g.
the adversarial signs produced for a particular classifier were used to attack
a variety of other classifiers. The efficiency was confirmed in drive-by
experiments with a production-grade traffic sign recognition systems of a real
car.


http://arxiv.org/abs/1906.12340
Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty.
Dan Hendrycks; Mantas Mazeika; Saurav Kadavath; Dawn Song
  Self-supervision provides effective representations for downstream tasks
without requiring labels. However, existing approaches lag behind fully
supervised training and are often not thought beneficial beyond obviating the
need for annotations. We find that self-supervision can benefit robustness in a
variety of ways, including robustness to adversarial examples, label
corruption, and common input corruptions. Additionally, self-supervision
greatly benefits out-of-distribution detection on difficult, near-distribution
outliers, so much so that it exceeds the performance of fully supervised
methods. These results demonstrate the promise of self-supervision for
improving robustness and uncertainty estimation and establish these tasks as
new axes of evaluation for future self-supervised learning research.


http://arxiv.org/abs/1906.12269
Certifiable Robustness and Robust Training for Graph Convolutional Networks.
Daniel Zügner; Stephan Günnemann
  Recent works show that Graph Neural Networks (GNNs) are highly non-robust
with respect to adversarial attacks on both the graph structure and the node
attributes, making their outcomes unreliable. We propose the first method for
certifiable (non-)robustness of graph convolutional networks with respect to
perturbations of the node attributes. We consider the case of binary node
attributes (e.g. bag-of-words) and perturbations that are L_0-bounded. If a
node has been certified with our method, it is guaranteed to be robust under
any possible perturbation given the attack model. Likewise, we can certify
non-robustness. Finally, we propose a robust semi-supervised training procedure
that treats the labeled and unlabeled nodes jointly. As shown in our
experimental evaluation, our method significantly improves the robustness of
the GNN with only minimal effect on the predictive accuracy.


http://arxiv.org/abs/1906.12061
Learning to Cope with Adversarial Attacks.
Xian Yeow Lee; Aaron Havens; Girish Chowdhary; Soumik Sarkar
  The security of Deep Reinforcement Learning (Deep RL) algorithms deployed in
real life applications are of a primary concern. In particular, the robustness
of RL agents in cyber-physical systems against adversarial attacks are
especially vital since the cost of a malevolent intrusions can be extremely
high. Studies have shown Deep Neural Networks (DNN), which forms the core
decision-making unit in most modern RL algorithms, are easily subjected to
adversarial attacks. Hence, it is imperative that RL agents deployed in
real-life applications have the capability to detect and mitigate adversarial
attacks in an online fashion. An example of such a framework is the
Meta-Learned Advantage Hierarchy (MLAH) agent that utilizes a meta-learning
framework to learn policies robustly online. Since the mechanism of this
framework are still not fully explored, we conducted multiple experiments to
better understand the framework's capabilities and limitations. Our results
shows that the MLAH agent exhibits interesting coping behaviors when subjected
to different adversarial attacks to maintain a nominal reward. Additionally,
the framework exhibits a hierarchical coping capability, based on the
adaptability of the Master policy and sub-policies themselves. From empirical
results, we also observed that as the interval of adversarial attacks increase,
the MLAH agent can maintain a higher distribution of rewards, though at the
cost of higher instabilities.


http://arxiv.org/abs/1907.00098
Robustness Guarantees for Deep Neural Networks on Videos.
Min Wu; Marta Kwiatkowska
  The widespread adoption of deep learning models places demands on their
robustness. In this paper, we consider the robustness of deep neural networks
on videos, which comprise both the spatial features of individual frames
extracted by a convolutional neural network and the temporal dynamics between
adjacent frames captured by a recurrent neural network. To measure robustness,
we study the maximum safe radius problem, which computes the minimum distance
from the optical flow set obtained from a given input to that of an adversarial
example in the norm ball. We demonstrate that, under the assumption of
Lipschitz continuity, the problem can be approximated using finite optimisation
via discretising the optical flow space, and the approximation has provable
guarantees. We then show that the finite optimisation problem can be solved by
utilising a two-player turn-based game in a cooperative setting, where the
first player selects the optical flows and the second player determines the
dimensions to be manipulated in the chosen flow. We employ an anytime approach
to solve the game, in the sense of approximating the value of the game by
monotonically improving its upper and lower bounds. We exploit a gradient-based
search algorithm to compute the upper bounds, and the admissible A* algorithm
to update the lower bounds. Finally, we evaluate our framework on the UCF101
video dataset.


http://arxiv.org/abs/1906.11729
Using Intuition from Empirical Properties to Simplify Adversarial Training Defense.
Guanxiong Liu; Issa Khalil; Abdallah Khreishah
  Due to the surprisingly good representation power of complex distributions,
neural network (NN) classifiers are widely used in many tasks which include
natural language processing, computer vision and cyber security. In recent
works, people noticed the existence of adversarial examples. These adversarial
examples break the NN classifiers' underlying assumption that the environment
is attack free and can easily mislead fully trained NN classifier without
noticeable changes. Among defensive methods, adversarial training is a popular
choice. However, original adversarial training with single-step adversarial
examples (Single-Adv) can not defend against iterative adversarial examples.
Although adversarial training with iterative adversarial examples (Iter-Adv)
can defend against iterative adversarial examples, it consumes too much
computational power and hence is not scalable. In this paper, we analyze
Iter-Adv techniques and identify two of their empirical properties. Based on
these properties, we propose modifications which enhance Single-Adv to perform
competitively as Iter-Adv. Through preliminary evaluation, we show that the
proposed method enhances the test accuracy of state-of-the-art (SOTA)
Single-Adv defensive method against iterative adversarial examples by up to
16.93% while reducing its training cost by 28.75%.


http://arxiv.org/abs/1906.11567
Adversarial Robustness via Label-Smoothing.
Morgane Goibert; Elvis Dohmatob
  We study Label-Smoothing as a means for improving adversarial robustness of
supervised deep-learning models. After establishing a thorough and unified
framework, we propose several variations to this general method: adversarial,
Boltzmann and second-best Label-Smoothing methods, and we explain how to
construct your own one. On various datasets (MNIST, CIFAR10, SVHN) and models
(linear models, MLPs, LeNet, ResNet), we show that Label-Smoothing in general
improves adversarial robustness against a variety of attacks (FGSM, BIM,
DeepFool, Carlini-Wagner) by better taking account of the dataset geometry. The
proposed Label-Smoothing methods have two main advantages: they can be
implemented as a modified cross-entropy loss, thus do not require any
modifications of the network architecture nor do they lead to increased
training times, and they improve both standard and adversarial accuracy.


http://arxiv.org/abs/1906.11667
Evolving Robust Neural Architectures to Defend from Adversarial Attacks.
Shashank Kotyan; Danilo Vasconcellos Vargas
  Neural networks are prone to misclassify slightly modified input images.
Recently, many defences have been proposed, but none have improved the
robustness of neural networks consistently. Here, we propose to use adversarial
attacks as a function evaluation to search for neural architectures that can
resist such attacks automatically. Experiments on neural architecture search
algorithms from the literature show that although accurate, they are not able
to find robust architectures. A significant reason for this lies in their
limited search space. By creating a novel neural architecture search with
options for dense layers to connect with convolution layers and vice-versa as
well as the addition of concatenation layers in the search, we were able to
evolve an architecture that is inherently accurate on adversarial samples.
Interestingly, this inherent robustness of the evolved architecture rivals
state-of-the-art defences such as adversarial training while being trained only
on the non-adversarial samples. Moreover, the evolved architecture makes use of
some peculiar traits which might be useful for developing even more robust
ones. Thus, the results here confirm that more robust architectures exist as
well as opens up a new realm of feasibilities for the development and
exploration of neural networks.
  Code available at http://bit.ly/RobustArchitectureSearch.


http://arxiv.org/abs/1906.11327
The Adversarial Robustness of Sampling.
Omri Ben-Eliezer; Eylon Yogev
  Random sampling is a fundamental primitive in modern algorithms, statistics,
and machine learning, used as a generic method to obtain a small yet
"representative" subset of the data. In this work, we investigate the
robustness of sampling against adaptive adversarial attacks in a streaming
setting: An adversary sends a stream of elements from a universe $U$ to a
sampling algorithm (e.g., Bernoulli sampling or reservoir sampling), with the
goal of making the sample "very unrepresentative" of the underlying data
stream. The adversary is fully adaptive in the sense that it knows the exact
content of the sample at any given point along the stream, and can choose which
element to send next accordingly, in an online manner.
  Well-known results in the static setting indicate that if the full stream is
chosen in advance (non-adaptively), then a random sample of size $\Omega(d /
\varepsilon^2)$ is an $\varepsilon$-approximation of the full data with good
probability, where $d$ is the VC-dimension of the underlying set system
$(U,R)$. Does this sample size suffice for robustness against an adaptive
adversary? The simplistic answer is \emph{negative}: We demonstrate a set
system where a constant sample size (corresponding to VC-dimension $1$)
suffices in the static setting, yet an adaptive adversary can make the sample
very unrepresentative, as long as the sample size is (strongly) sublinear in
the stream length, using a simple and easy-to-implement attack.
  However, this attack is "theoretical only", requiring the set system size to
(essentially) be exponential in the stream length. This is not a coincidence:
We show that to make Bernoulli or reservoir sampling robust against adaptive
adversaries, the modification required is solely to replace the VC-dimension
term $d$ in the sample size with the cardinality term $\log |R|$. This nearly
matches the bound imposed by the attack.


http://arxiv.org/abs/1906.10973
Defending Adversarial Attacks by Correcting logits.
Yifeng Li; Lingxi Xie; Ya Zhang; Rui Zhang; Yanfeng Wang; Qi Tian
  Generating and eliminating adversarial examples has been an intriguing topic
in the field of deep learning. While previous research verified that
adversarial attacks are often fragile and can be defended via image-level
processing, it remains unclear how high-level features are perturbed by such
attacks. We investigate this issue from a new perspective, which purely relies
on logits, the class scores before softmax, to detect and defend adversarial
attacks. Our defender is a two-layer network trained on a mixed set of clean
and perturbed logits, with the goal being recovering the original prediction.
Upon a wide range of adversarial attacks, our simple approach shows promising
results with relatively high accuracy in defense, and the defender can transfer
across attackers with similar properties. More importantly, our defender can
work in the scenarios that image data are unavailable, and enjoys high
interpretability especially at the semantic level.


http://arxiv.org/abs/1906.10395
Quantitative Verification of Neural Networks And its Security Applications.
Teodora Baluta; Shiqi Shen; Shweta Shinde; Kuldeep S. Meel; Prateek Saxena
  Neural networks are increasingly employed in safety-critical domains. This
has prompted interest in verifying or certifying logically encoded properties
of neural networks. Prior work has largely focused on checking existential
properties, wherein the goal is to check whether there exists any input that
violates a given property of interest. However, neural network training is a
stochastic process, and many questions arising in their analysis require
probabilistic and quantitative reasoning, i.e., estimating how many inputs
satisfy a given property. To this end, our paper proposes a novel and
principled framework to quantitative verification of logical properties
specified over neural networks. Our framework is the first to provide PAC-style
soundness guarantees, in that its quantitative estimates are within a
controllable and bounded error from the true count. We instantiate our
algorithmic framework by building a prototype tool called NPAQ that enables
checking rich properties over binarized neural networks. We show how emerging
security analyses can utilize our framework in 3 concrete point applications:
quantifying robustness to adversarial inputs, efficacy of trojan attacks, and
fairness/bias of given neural networks.


http://arxiv.org/abs/1906.10773
Are Adversarial Perturbations a Showstopper for ML-Based CAD? A Case Study on CNN-Based Lithographic Hotspot Detection.
Kang Liu; Haoyu Yang; Yuzhe Ma; Benjamin Tan; Bei Yu; Evangeline F. Y. Young; Ramesh Karri; Siddharth Garg
  There is substantial interest in the use of machine learning (ML) based
techniques throughout the electronic computer-aided design (CAD) flow,
particularly those based on deep learning. However, while deep learning methods
have surpassed state-of-the-art performance in several applications, they have
exhibited intrinsic susceptibility to adversarial perturbations --- small but
deliberate alterations to the input of a neural network, precipitating
incorrect predictions. In this paper, we seek to investigate whether
adversarial perturbations pose risks to ML-based CAD tools, and if so, how
these risks can be mitigated. To this end, we use a motivating case study of
lithographic hotspot detection, for which convolutional neural networks (CNN)
have shown great promise. In this context, we show the first adversarial
perturbation attacks on state-of-the-art CNN-based hotspot detectors;
specifically, we show that small (on average 0.5% modified area), functionality
preserving and design-constraint satisfying changes to a layout can nonetheless
trick a CNN-based hotspot detector into predicting the modified layout as
hotspot free (with up to 99.7% success). We propose an adversarial retraining
strategy to improve the robustness of CNN-based hotspot detection and show that
this strategy significantly improves robustness (by a factor of ~3) against
adversarial attacks without compromising classification accuracy.


http://arxiv.org/abs/1906.10571
Deceptive Reinforcement Learning Under Adversarial Manipulations on Cost Signals.
Yunhan Huang; Quanyan Zhu
  This paper studies reinforcement learning (RL) under malicious falsification
on cost signals and introduces a quantitative framework of attack models to
understand the vulnerabilities of RL. Focusing on $Q$-learning, we show that
$Q$-learning algorithms converge under stealthy attacks and bounded
falsifications on cost signals. We characterize the relation between the
falsified cost and the $Q$-factors as well as the policy learned by the
learning agent which provides fundamental limits for feasible offensive and
defensive moves. We propose a robust region in terms of the cost within which
the adversary can never achieve the targeted policy. We provide conditions on
the falsified cost which can mislead the agent to learn an adversary's favored
policy. A numerical case study of water reservoir control is provided to show
the potential hazards of RL in learning-based control systems and corroborate
the results.


http://arxiv.org/abs/1906.09525
Defending Against Adversarial Examples with K-Nearest Neighbor.
Chawin Sitawarin; David Wagner
  Robustness is an increasingly important property of machine learning models
as they become more and more prevalent. We propose a defense against
adversarial examples based on a k-nearest neighbor (kNN) on the intermediate
activation of neural networks. Our scheme surpasses state-of-the-art defenses
on MNIST and CIFAR-10 against l2-perturbation by a significant margin. With our
models, the mean perturbation norm required to fool our MNIST model is 3.07 and
2.30 on CIFAR-10. Additionally, we propose a simple certifiable lower bound on
the l2-norm of the adversarial perturbation using a more specific version of
our scheme, a 1-NN on representations learned by a Lipschitz network. Our model
provides a nontrivial average lower bound of the perturbation norm, comparable
to other schemes on MNIST with similar clean accuracy.


http://arxiv.org/abs/1906.09288
Hiding Faces in Plain Sight: Disrupting AI Face Synthesis with Adversarial Perturbations.
Yuezun Li; Xin Yang; Baoyuan Wu; Siwei Lyu
  Recent years have seen fast development in synthesizing realistic human faces
using AI technologies. Such fake faces can be weaponized to cause negative
personal and social impact. In this work, we develop technologies to defend
individuals from becoming victims of recent AI synthesized fake videos by
sabotaging would-be training data. This is achieved by disrupting deep neural
network (DNN) based face detection method with specially designed imperceptible
adversarial perturbations to reduce the quality of the detected faces. We
describe attacking schemes under white-box, gray-box and black-box settings,
each with decreasing information about the DNN based face detectors. We
empirically show the effectiveness of our methods in disrupting
state-of-the-art DNN based face detectors on several datasets.


http://arxiv.org/abs/1906.08988
A Fourier Perspective on Model Robustness in Computer Vision.
Dong Yin; Raphael Gontijo Lopes; Jonathon Shlens; Ekin D. Cubuk; Justin Gilmer
  Achieving robustness to distributional shift is a longstanding and
challenging goal of computer vision. Data augmentation is a commonly used
approach for improving robustness, however robustness gains are typically not
uniform across corruption types. Indeed increasing performance in the presence
of random noise is often met with reduced performance on other corruptions such
as contrast change. Understanding when and why these sorts of trade-offs occur
is a crucial step towards mitigating them. Towards this end, we investigate
recently observed trade-offs caused by Gaussian data augmentation and
adversarial training. We find that both methods improve robustness to
corruptions that are concentrated in the high frequency domain while reducing
robustness to corruptions that are concentrated in the low frequency domain.
This suggests that one way to mitigate these trade-offs via data augmentation
is to use a more diverse set of augmentations. Towards this end we observe that
AutoAugment, a recently proposed data augmentation policy optimized for clean
accuracy, achieves state-of-the-art robustness on the CIFAR-10-C and ImageNet-C
benchmarks.


http://arxiv.org/abs/1906.09072
Evolution Attack On Neural Networks.
YiGui Luo; RuiJia Yang; Wei Sha; WeiYi Ding; YouTeng Sun; YiSi Wang
  Many studies have been done to prove the vulnerability of neural networks to
adversarial example. A trained and well-behaved model can be fooled by a
visually imperceptible perturbation, i.e., an originally correctly classified
image could be misclassified after a slight perturbation. In this paper, we
propose a black-box strategy to attack such networks using an evolution
algorithm. First, we formalize the generation of an adversarial example into
the optimization problem of perturbations that represent the noise added to an
original image at each pixel. To solve this optimization problem in a black-box
way, we find that an evolution algorithm perfectly meets our requirement since
it can work without any gradient information. Therefore, we test various
evolution algorithms, including a simple genetic algorithm, a
parameter-exploring policy gradient, an OpenAI evolution strategy, and a
covariance matrix adaptive evolution strategy. Experimental results show that a
covariance matrix adaptive evolution Strategy performs best in this
optimization problem. Additionally, we also perform several experiments to
explore the effect of different regularizations on improving the quality of an
adversarial example.


http://arxiv.org/abs/1906.09300
Adversarial Examples to Fool Iris Recognition Systems.
Sobhan Soleymani; Ali Dabouei; Jeremy Dawson; Nasser M. Nasrabadi
  Adversarial examples have recently proven to be able to fool deep learning
methods by adding carefully crafted small perturbation to the input space
image. In this paper, we study the possibility of generating adversarial
examples for code-based iris recognition systems. Since generating adversarial
examples requires back-propagation of the adversarial loss, conventional filter
bank-based iris-code generation frameworks cannot be employed in such a setup.
Therefore, to compensate for this shortcoming, we propose to train a deep
auto-encoder surrogate network to mimic the conventional iris code generation
procedure. This trained surrogate network is then deployed to generate the
adversarial examples using the iterative gradient sign method algorithm. We
consider non-targeted and targeted attacks through three attack scenarios.
Considering these attacks, we study the possibility of fooling an iris
recognition system in white-box and black-box frameworks.


http://arxiv.org/abs/1906.09313
A Cyclically-Trained Adversarial Network for Invariant Representation Learning.
Jiawei Chen; Janusz Konrad; Prakash Ishwar
  Recent studies show that deep neural networks are vulnerable to adversarial
examples which can be generated via certain types of transformations. Being
robust to a desired family of adversarial attacks is then equivalent to being
invariant to a family of transformations. Learning invariant representations
then naturally emerges as an important goal to achieve which we explore in this
paper within specific application contexts. Specifically, we propose a
cyclically-trained adversarial network to learn a mapping from image space to
latent representation space and back such that the latent representation is
invariant to a specified factor of variation (e.g., identity). The learned
mapping assures that the synthesized image is not only realistic, but has the
same values for unspecified factors (e.g., pose and illumination) as the
original image and a desired value of the specified factor. Unlike disentangled
representation learning, which requires two latent spaces, one for specified
and another for unspecified factors, invariant representation learning needs
only one such space. We encourage invariance to a specified factor by applying
adversarial training using a variational autoencoder in the image space as
opposed to the latent space. We strengthen this invariance by introducing a
cyclic training process (forward and backward cycle). We also propose a new
method to evaluate conditional generative networks. It compares how well
different factors of variation can be predicted from the synthesized, as
opposed to real, images. In quantitative terms, our approach attains
state-of-the-art performance in experiments spanning three datasets with
factors such as identity, pose, illumination or style. Our method produces
sharp, high-quality synthetic images with little visible artefacts compared to
previous approaches.


http://arxiv.org/abs/1906.11897
On Physical Adversarial Patches for Object Detection.
Mark Lee; Zico Kolter
  In this paper, we demonstrate a physical adversarial patch attack against
object detectors, notably the YOLOv3 detector. Unlike previous work on physical
object detection attacks, which required the patch to overlap with the objects
being misclassified or avoiding detection, we show that a properly designed
patch can suppress virtually all the detected objects in the image. That is, we
can place the patch anywhere in the image, causing all existing objects in the
image to be missed entirely by the detector, even those far away from the patch
itself. This in turn opens up new lines of physical attacks against object
detection systems, which require no modification of the objects in a scene. A
demo of the system can be found at https://youtu.be/WXnQjbZ1e7Y.


http://arxiv.org/abs/1907.03720
Catfish Effect Between Internal and External Attackers:Being Semi-honest is Helpful.
Hanqing Liu; Na Ruan; Joseph K. Liu
  The consensus protocol named proof of work (PoW) is widely applied by
cryptocurrencies like Bitcoin. Although security of a PoW cryptocurrency is
always the top priority, it is threatened by mining attacks like selfish
mining. Researchers have proposed many mining attack models with one attacker,
and optimized the attacker's strategy. During these mining attacks, an attacker
pursues a higher relative revenue (RR) by wasting a large amount of
computational power of the honest miners at the cost of a small amount of
computational power of himself. In this paper, we propose a mining attack model
with two phases: the original system and the multi-attacker system. It is the
first model to provide both theoretical and quantitative analysis of mining
attacks with two attackers. We explain how the original system turns into the
multi-attacker system by introducing two attackers: the internal attacker and
the external attacker. If both attackers take the attacking strategy selfish
mining, the RR of the internal attacker in multi-attacker system will drop by
up to 31.9% compared with his RR in original system. The external attacker will
overestimate his RR by up to 44.6% in multiattacker system. Unexpected
competitions, auctions between attackers and overestimation of attackers'
influence factor are three main causes of both attackers' dropping RR. We
propose a mining strategy named Partial Initiative Release (PIR) which is a
semi-honest mining strategy in multi-attacker system. In some specific
situations, PIR allows the attacker to get more block reward by launching an
attack in multi-attacker system.


http://arxiv.org/abs/1906.08416
Improving the robustness of ImageNet classifiers using elements of human visual cognition.
A. Emin Orhan; Brenden M. Lake
  We investigate the robustness properties of image recognition models equipped
with two features inspired by human vision, an explicit episodic memory and a
shape bias, at the ImageNet scale. As reported in previous work, we show that
an explicit episodic memory improves the robustness of image recognition models
against small-norm adversarial perturbations under some threat models. It does
not, however, improve the robustness against more natural, and typically
larger, perturbations. Learning more robust features during training appears to
be necessary for robustness in this second sense. We show that features derived
from a model that was encouraged to learn global, shape-based representations
(Geirhos et al., 2019) do not only improve the robustness against natural
perturbations, but when used in conjunction with an episodic memory, they also
provide additional robustness against adversarial perturbations. Finally, we
address three important design choices for the episodic memory: memory size,
dimensionality of the memories and the retrieval method. We show that to make
the episodic memory more compact, it is preferable to reduce the number of
memories by clustering them, instead of reducing their dimensionality.


http://arxiv.org/abs/1906.07982
A unified view on differential privacy and robustness to adversarial examples.
Rafael Pinot; Florian Yger; Cédric Gouy-Pailler; Jamal Atif
  This short note highlights some links between two lines of research within
the emerging topic of trustworthy machine learning: differential privacy and
robustness to adversarial examples. By abstracting the definitions of both
notions, we show that they build upon the same theoretical ground and hence
results obtained so far in one domain can be transferred to the other. More
precisely, our analysis is based on two key elements: probabilistic mappings
(also called randomized algorithms in the differential privacy community), and
the Renyi divergence which subsumes a large family of divergences. We first
generalize the definition of robustness against adversarial examples to
encompass probabilistic mappings. Then we observe that Renyi-differential
privacy (a generalization of differential privacy recently proposed
in~\cite{Mironov2017RenyiDP}) and our definition of robustness share several
similarities. We finally discuss how can both communities benefit from this
connection to transfer technical tools from one research field to the other.


http://arxiv.org/abs/1906.07916
Convergence of Adversarial Training in Overparametrized Networks.
Ruiqi Gao; Tianle Cai; Haochuan Li; Liwei Wang; Cho-Jui Hsieh; Jason D. Lee
  Neural networks are vulnerable to adversarial examples, i.e. inputs that are
imperceptibly perturbed from natural data and yet incorrectly classified by the
network. Adversarial training, a heuristic form of robust optimization that
alternates between minimization and maximization steps, has proven to be among
the most successful methods to train networks that are robust against a
pre-defined family of perturbations. This paper provides a partial answer to
the success of adversarial training. When the inner maximization problem can be
solved to optimality, we prove that adversarial training finds a network of
small robust train loss. When the maximization problem is solved by a heuristic
algorithm, we prove that adversarial training finds a network of small robust
surrogate train loss. The analysis technique leverages recent work on the
analysis of neural networks via Neural Tangent Kernel (NTK), combined with
online-learning when the maximization is solved by a heuristic, and the
expressiveness of the NTK kernel in the $\ell_\infty$-norm.


http://arxiv.org/abs/1906.07920
Global Adversarial Attacks for Assessing Deep Learning Robustness.
Hanbin Hu; Mit Shah; Jianhua Z. Huang; Peng Li
  It has been shown that deep neural networks (DNNs) may be vulnerable to
adversarial attacks, raising the concern on their robustness particularly for
safety-critical applications. Recognizing the local nature and limitations of
existing adversarial attacks, we present a new type of global adversarial
attacks for assessing global DNN robustness. More specifically, we propose a
novel concept of global adversarial example pairs in which each pair of two
examples are close to each other but have different class labels predicted by
the DNN. We further propose two families of global attack methods and show that
our methods are able to generate diverse and intriguing adversarial example
pairs at locations far from the training or testing data. Moreover, we
demonstrate that DNNs hardened using the strong projected gradient descent
(PGD) based (local) adversarial training are vulnerable to the proposed global
adversarial example pairs, suggesting that global robustness must be considered
while training robust deep learning networks.


http://arxiv.org/abs/1906.07997
Cloud-based Image Classification Service Is Not Robust To Simple Transformations: A Forgotten Battlefield.
Dou Goodman; Tao Wei
  Many recent works demonstrated that Deep Learning models are vulnerable to
adversarial examples.Fortunately, generating adversarial examples usually
requires white-box access to the victim model, and the attacker can only access
the APIs opened by cloud platforms. Thus, keeping models in the cloud can
usually give a (false) sense of security.Unfortunately, cloud-based image
classification service is not robust to simple transformations such as Gaussian
Noise, Salt-and-Pepper Noise, Rotation and Monochromatization. In this
paper,(1) we propose one novel attack method called Image Fusion(IF) attack,
which achieve a high bypass rate,can be implemented only with OpenCV and is
difficult to defend; and (2) we make the first attempt to conduct an extensive
empirical study of Simple Transformation (ST) attacks against real-world
cloud-based classification services. Through evaluations on four popular cloud
platforms including Amazon, Google, Microsoft, Clarifai, we demonstrate that ST
attack has a success rate of approximately 100% except Amazon approximately
50%, IF attack have a success rate over 98% among different classification
services. (3) We discuss the possible defenses to address these security
challenges.Experiments show that our defense technology can effectively defend
known ST attacks.


http://arxiv.org/abs/1906.07927
SemanticAdv: Generating Adversarial Examples via Attribute-conditional Image Editing.
Haonan Qiu; Chaowei Xiao; Lei Yang; Xinchen Yan; Honglak Lee; Bo Li
  Deep neural networks (DNNs) have achieved great success in various
applications due to their strong expressive power. However, recent studies have
shown that DNNs are vulnerable to adversarial examples which are manipulated
instances targeting to mislead DNNs to make incorrect predictions. Currently,
most such adversarial examples try to guarantee "subtle perturbation" by
limiting the $L_p$ norm of the perturbation. In this paper, we aim to explore
the impact of semantic manipulation on DNNs predictions by manipulating the
semantic attributes of images and generate "unrestricted adversarial examples".
  In particular, we propose an algorithm \emph{SemanticAdv} which leverages
disentangled semantic factors to generate adversarial perturbation by altering
controlled semantic attributes to fool the learner towards various
"adversarial" targets. We conduct extensive experiments to show that the
semantic based adversarial examples can not only fool different learning tasks
such as face verification and landmark detection, but also achieve high
targeted attack success rate against \emph{real-world black-box} services such
as Azure face verification service based on transferability.
  To further demonstrate the applicability of \emph{SemanticAdv} beyond face
recognition domain, we also generate semantic perturbations on street-view
images. Such adversarial examples with controlled semantic manipulation can
shed light on further understanding about vulnerabilities of DNNs as well as
potential defensive approaches.


http://arxiv.org/abs/1906.07153
Adversarial attacks on Copyright Detection Systems.
Parsa Saadatpanah; Ali Shafahi; Tom Goldstein
  It is well-known that many machine learning models are susceptible to
so-called "adversarial attacks," in which an attacker evades a classifier by
making small perturbations to inputs. This paper discusses how industrial
copyright detection tools, which serve a central role on the web, are
susceptible to adversarial attacks. We discuss a range of copyright detection
systems, and why they are particularly vulnerable to attacks. These
vulnerabilities are especially apparent for neural network based systems. As a
proof of concept, we describe a well-known music identification method, and
implement this system in the form of a neural net. We then attack this system
using simple gradient methods. Adversarial music created this way successfully
fools industrial systems, including the AudioTag copyright detector and
YouTube's Content ID system. Our goal is to raise awareness of the threats
posed by adversarial examples in this space, and to highlight the importance of
hardening copyright detection systems to attacks.


http://arxiv.org/abs/1906.06919
Improving Black-box Adversarial Attacks with a Transfer-based Prior.
Shuyu Cheng; Yinpeng Dong; Tianyu Pang; Hang Su; Jun Zhu
  We consider the black-box adversarial setting, where the adversary has to
generate adversarial perturbations without access to the target models to
compute gradients. Previous methods tried to approximate the gradient either by
using a transfer gradient of a surrogate white-box model, or based on the query
feedback. However, these methods often suffer from low attack success rates or
poor query efficiency since it is non-trivial to estimate the gradient in a
high-dimensional space with limited information. To address these problems, we
propose a prior-guided random gradient-free (P-RGF) method to improve black-box
adversarial attacks, which takes the advantage of a transfer-based prior and
the query information simultaneously. The transfer-based prior given by the
gradient of a surrogate model is appropriately integrated into our algorithm by
an optimal coefficient derived by a theoretical analysis. Extensive experiments
demonstrate that our method requires much fewer queries to attack black-box
models with higher success rates compared with the alternative state-of-the-art
methods.


http://arxiv.org/abs/1906.07077
The Attack Generator: A Systematic Approach Towards Constructing Adversarial Attacks.
Felix Assion; Peter Schlicht; Florens Greßner; Wiebke Günther; Fabian Hüger; Nico Schmidt; Umair Rasheed
  Most state-of-the-art machine learning (ML) classification systems are
vulnerable to adversarial perturbations. As a consequence, adversarial
robustness poses a significant challenge for the deployment of ML-based systems
in safety- and security-critical environments like autonomous driving, disease
detection or unmanned aerial vehicles. In the past years we have seen an
impressive amount of publications presenting more and more new adversarial
attacks. However, the attack research seems to be rather unstructured and new
attacks often appear to be random selections from the unlimited set of possible
adversarial attacks. With this publication, we present a structured analysis of
the adversarial attack creation process. By detecting different building blocks
of adversarial attacks, we outline the road to new sets of adversarial attacks.
We call this the "attack generator". In the pursuit of this objective, we
summarize and extend existing adversarial perturbation taxonomies. The
resulting taxonomy is then linked to the application context of computer vision
systems for autonomous vehicles, i.e. semantic segmentation and object
detection. Finally, in order to prove the usefulness of the attack generator,
we investigate existing semantic segmentation attacks with respect to the
detected defining components of adversarial attacks.


http://arxiv.org/abs/1906.06784
Interpolated Adversarial Training: Achieving Robust Neural Networks without Sacrificing Accuracy.
Alex Lamb; Vikas Verma; Juho Kannala; Yoshua Bengio
  Adversarial robustness has become a central goal in deep learning, both in
theory and practice. However, successful methods to improve adversarial
robustness (such as adversarial training) greatly hurt generalization
performance on the clean data. This could have a major impact on how
adversarial robustness affects real world systems (i.e. many may opt to forego
robustness if it can improve performance on the clean data). We propose
Interpolated Adversarial Training, which employs recently proposed
interpolation based training methods in the framework of adversarial training.
On CIFAR-10, adversarial training increases clean test error from 5.8% to
16.7%, whereas with our Interpolated adversarial training we retain adversarial
robustness while achieving a clean test error of only 6.5%. With our technique,
the relative error increase for the robust model is reduced from 187.9% to just
12.1%


http://arxiv.org/abs/1906.06765
Defending Against Adversarial Attacks Using Random Forests.
Yifan Ding; Liqiang Wang; Huan Zhang; Jinfeng Yi; Deliang Fan; Boqing Gong
  As deep neural networks (DNNs) have become increasingly important and
popular, the robustness of DNNs is the key to the safety of both the Internet
and the physical world. Unfortunately, some recent studies show that
adversarial examples, which are hard to be distinguished from real examples,
can easily fool DNNs and manipulate their predictions. Upon observing that
adversarial examples are mostly generated by gradient-based methods, in this
paper, we first propose to use a simple yet very effective non-differentiable
hybrid model that combines DNNs and random forests, rather than hide gradients
from attackers, to defend against the attacks. Our experiments show that our
model can successfully and completely defend the white-box attacks, has a lower
transferability, and is quite resistant to three representative types of
black-box attacks; while at the same time, our model achieves similar
classification accuracy as the original DNNs. Finally, we investigate and
suggest a criterion to define where to grow random forests in DNNs.


http://arxiv.org/abs/1906.06627
Representation Quality Of Neural Networks Links To Adversarial Attacks and Defences.
Shashank Kotyan; Danilo Vasconcellos Vargas; Moe Matsuki
  Neural networks have been shown vulnerable to a variety of adversarial
algorithms. A crucial step to understanding the rationale for this lack of
robustness is to assess the potential of the neural networks' representation to
encode the existing features. Here, we propose a method to understand the
representation quality of the neural networks using a novel test based on
Zero-Shot Learning, entitled Raw Zero-Shot. The principal idea is that, if an
algorithm learns rich features, such features should be able to interpret
"unknown" classes as an aggregate of previously learned features. This is
because unknown classes usually share several regular features with recognised
classes, given the features learned are general enough. We further introduce
two metrics to assess these learned features to interpret unknown classes. One
is based on inter-cluster validation technique (Davies-Bouldin Index), and the
other is based on the distance to an approximated ground-truth. Experiments
suggest that adversarial defences improve the representation of the
classifiers, further suggesting that to improve the robustness of the
classifiers, one has to improve the representation quality also. Experiments
also reveal a strong association (a high Pearson Correlation and low p-value)
between the metrics and adversarial attacks. Interestingly, the results
indicate that dynamic routing networks such as CapsNet have better
representation while current deeper neural networks are trading off
representation quality for accuracy.
  Code available at http://bit.ly/RepresentationMetrics.


http://arxiv.org/abs/1906.06032
Adversarial Training Can Hurt Generalization.
Aditi Raghunathan; Sang Michael Xie; Fanny Yang; John C. Duchi; Percy Liang
  While adversarial training can improve robust accuracy (against an
adversary), it sometimes hurts standard accuracy (when there is no adversary).
Previous work has studied this tradeoff between standard and robust accuracy,
but only in the setting where no predictor performs well on both objectives in
the infinite data limit. In this paper, we show that even when the optimal
predictor with infinite data performs well on both objectives, a tradeoff can
still manifest itself with finite data. Furthermore, since our construction is
based on a convex learning problem, we rule out optimization concerns, thus
laying bare a fundamental tension between robustness and generalization.
Finally, we show that robust self-training mostly eliminates this tradeoff by
leveraging unlabeled data.


http://arxiv.org/abs/1906.06110
Towards Compact and Robust Deep Neural Networks.
Vikash Sehwag; Shiqi Wang; Prateek Mittal; Suman Jana
  Deep neural networks have achieved impressive performance in many
applications but their large number of parameters lead to significant
computational and storage overheads. Several recent works attempt to mitigate
these overheads by designing compact networks using pruning of connections.
However, we observe that most of the existing strategies to design compact
networks fail to preserve network robustness against adversarial examples. In
this work, we rigorously study the extension of network pruning strategies to
preserve both benign accuracy and robustness of a network. Starting with a
formal definition of the pruning procedure, including pre-training, weights
pruning, and fine-tuning, we propose a new pruning method that can create
compact networks while preserving both benign accuracy and robustness. Our
method is based on two main insights: (1) we ensure that the training
objectives of the pre-training and fine-tuning steps match the training
objective of the desired robust model (e.g., adversarial robustness/verifiable
robustness), and (2) we keep the pruning strategy agnostic to pre-training and
fine-tuning objectives. We evaluate our method on four different networks on
the CIFAR-10 dataset and measure benign accuracy, empirical robust accuracy,
and verifiable robust accuracy. We demonstrate that our pruning method can
preserve on average 93\% benign accuracy, 92.5\% empirical robust accuracy, and
85.0\% verifiable robust accuracy while compressing the tested network by
10$\times$.


http://arxiv.org/abs/1906.06355
Perceptual Based Adversarial Audio Attacks.
Joseph Szurley; J. Zico Kolter
  Recent work has shown the possibility of adversarial attacks on automatic
speechrecognition (ASR) systems. However, in the vast majority of work in this
area, theattacks have been executed only in the digital space, or have involved
short phrasesand static room settings. In this paper, we demonstrate a
physically realizableaudio adversarial attack. We base our approach
specifically on a psychoacoustic-property-based loss function, and automated
generation of room impulse responses, to create adversarial attacks that are
robust when played over a speaker in multiple environments. We show that such
attacks are possible even while being virtually imperceptible to listeners.


http://arxiv.org/abs/1906.06086
Copy and Paste: A Simple But Effective Initialization Method for Black-Box Adversarial Attacks.
Thomas Brunner; Frederik Diehl; Alois Knoll
  Many optimization methods for generating black-box adversarial examples have
been proposed, but the aspect of initializing said optimizers has not been
considered in much detail. We show that the choice of starting points is indeed
crucial, and that the performance of state-of-the-art attacks depends on it.
First, we discuss desirable properties of starting points for attacking image
classifiers, and how they can be chosen to increase query efficiency. Notably,
we find that simply copying small patches from other images is a valid
strategy. In an evaluation on ImageNet, we show that this initialization
reduces the number of queries required for a state-of-the-art Boundary Attack
by 81%, significantly outperforming previous results reported for targeted
black-box adversarial examples.


http://arxiv.org/abs/1906.06449
Robust or Private? Adversarial Training Makes Models More Vulnerable to Privacy Attacks.
Felipe A. Mejia; Paul Gamble; Zigfried Hampel-Arias; Michael Lomnitz; Nina Lopatina; Lucas Tindall; Maria Alejandra Barrios
  Adversarial training was introduced as a way to improve the robustness of
deep learning models to adversarial attacks. This training method improves
robustness against adversarial attacks, but increases the models vulnerability
to privacy attacks. In this work we demonstrate how model inversion attacks,
extracting training data directly from the model, previously thought to be
intractable become feasible when attacking a robustly trained model. The input
space for a traditionally trained model is dominated by adversarial examples -
data points that strongly activate a certain class but lack semantic meaning -
this makes it difficult to successfully conduct model inversion attacks. We
demonstrate this effect using the CIFAR-10 dataset under three different model
inversion attacks, a vanilla gradient descent method, gradient based method at
different scales, and a generative adversarial network base attacks.


http://arxiv.org/abs/1906.06316
Towards Stable and Efficient Training of Verifiably Robust Neural Networks.
Huan Zhang; Hongge Chen; Chaowei Xiao; Bo Li; Duane Boning; Cho-Jui Hsieh
  Training neural networks with verifiable robustness guarantees is
challenging. Several existing successful approaches utilize relatively tight
linear relaxation based bounds of neural network outputs, but they can slow
down training by a factor of hundreds and over-regularize the network.
Meanwhile, interval bound propagation (IBP) based training is efficient and
significantly outperform linear relaxation based methods on some tasks, yet it
suffers from stability issues since the bounds are much looser. In this paper,
we first interpret IBP training as training an augmented network which computes
non-linear bounds, thus explaining its good performance. We then propose a new
certified adversarial training method, CROWN-IBP, by combining the fast IBP
bounds in the forward pass and a tight linear relaxation based bound, CROWN, in
the backward pass. The proposed method is computationally efficient and
consistently outperforms IBP baselines on training verifiably robust neural
networks. We conduct large scale experiments using 53 models on MNIST,
Fashion-MNIST and CIFAR datasets. On MNIST with $\epsilon=0.3$ and
$\epsilon=0.4$ ($\ell_\infty$ norm distortion) we achieve 7.46\% and 12.96\%
verified error on test set, respectively, outperforming previous certified
defense methods.


http://arxiv.org/abs/1906.06026
Adversarial Robustness Assessment: Why both $L_0$ and $L_\infty$ Attacks Are Necessary.
Shashank Kotyan; Danilo Vasconcellos Vargas
  There exists a vast number of adversarial attacks and defences for machine
learning algorithms of various types which makes assessing the robustness of
algorithms a daunting task. To make matters worse, there is an intrinsic bias
in these adversarial algorithms. Here, we organise the problems faced: a) Model
Dependence, b) Insufficient Evaluation, c) False Adversarial Samples, and d)
Perturbation Dependent Results). Based on this, we propose a model agnostic
dual quality assessment method, together with the concept of robustness levels
to tackle them. We validate the dual quality assessment on state-of-the-art
neural networks (WideResNet, ResNet, AllConv, DenseNet, NIN, LeNet and CapsNet)
as well as adversarial defences for image classification problem. We further
show that current networks and defences are vulnerable at all levels of
robustness. The proposed robustness assessment reveals that depending on the
metric used (i.e., $L_0$ or $L_\infty$), the robustness may vary significantly.
Hence, the duality should be taken into account for a correct evaluation.
Moreover, a mathematical derivation, as well as a counter-example, suggest that
$L_1$ and $L_2$ metrics alone are not sufficient to avoid spurious adversarial
samples. Interestingly, the threshold attack of the proposed assessment is a
novel $L_\infty$ black-box adversarial method which requires even less
perturbation than the One-Pixel Attack (only $12\%$ of One-Pixel Attack's
amount of perturbation) to achieve similar results.
  Code is available at http://bit.ly/DualQualityAssessment.


http://arxiv.org/abs/1906.05599
A Computationally Efficient Method for Defending Adversarial Deep Learning Attacks.
Rajeev Sahay; Rehana Mahfuz; Aly El Gamal
  The reliance on deep learning algorithms has grown significantly in recent
years. Yet, these models are highly vulnerable to adversarial attacks, which
introduce visually imperceptible perturbations into testing data to induce
misclassifications. The literature has proposed several methods to combat such
adversarial attacks, but each method either fails at high perturbation values,
requires excessive computing power, or both. This letter proposes a
computationally efficient method for defending the Fast Gradient Sign (FGS)
adversarial attack by simultaneously denoising and compressing data.
Specifically, our proposed defense relies on training a fully connected
multi-layer Denoising Autoencoder (DAE) and using its encoder as a defense
against the adversarial attack. Our results show that using this dimensionality
reduction scheme is not only highly effective in mitigating the effect of the
FGS attack in multiple threat models, but it also provides a 2.43x speedup in
comparison to defense strategies providing similar robustness against the same
attack.


http://arxiv.org/abs/1906.05815
Lower Bounds for Adversarially Robust PAC Learning.
Dimitrios I. Diochnos; Saeed Mahloujifar; Mohammad Mahmoody
  In this work, we initiate a formal study of probably approximately correct
(PAC) learning under evasion attacks, where the adversary's goal is to
\emph{misclassify} the adversarially perturbed sample point $\widetilde{x}$,
i.e., $h(\widetilde{x})\neq c(\widetilde{x})$, where $c$ is the ground truth
concept and $h$ is the learned hypothesis. Previous work on PAC learning of
adversarial examples have all modeled adversarial examples as corrupted inputs
in which the goal of the adversary is to achieve $h(\widetilde{x}) \neq c(x)$,
where $x$ is the original untampered instance. These two definitions of
adversarial risk coincide for many natural distributions, such as images, but
are incomparable in general.
  We first prove that for many theoretically natural input spaces of high
dimension $n$ (e.g., isotropic Gaussian in dimension $n$ under $\ell_2$
perturbations), if the adversary is allowed to apply up to a sublinear
$o(||x||)$ amount of perturbations on the test instances, PAC learning requires
sample complexity that is exponential in $n$. This is in contrast with results
proved using the corrupted-input framework, in which the sample complexity of
robust learning is only polynomially more.
  We then formalize hybrid attacks in which the evasion attack is preceded by a
poisoning attack. This is perhaps reminiscent of "trapdoor attacks" in which a
poisoning phase is involved as well, but the evasion phase here uses the
error-region definition of risk that aims at misclassifying the perturbed
instances. In this case, we show PAC learning is sometimes impossible all
together, even when it is possible without the attack (e.g., due to the bounded
VC dimension).


http://arxiv.org/abs/1906.04948
Tight Certificates of Adversarial Robustness for Randomly Smoothed Classifiers.
Guang-He Lee; Yang Yuan; Shiyu Chang; Tommi S. Jaakkola
  Strong theoretical guarantees of robustness can be given for ensembles of
classifiers generated by input randomization. Specifically, an $\ell_2$ bounded
adversary cannot alter the ensemble prediction generated by an additive
isotropic Gaussian noise, where the radius for the adversary depends on both
the variance of the distribution as well as the ensemble margin at the point of
interest. We build on and considerably expand this work across broad classes of
distributions. In particular, we offer adversarial robustness guarantees and
associated algorithms for the discrete case where the adversary is $\ell_0$
bounded. Moreover, we exemplify how the guarantees can be tightened with
specific assumptions about the function class of the classifier such as a
decision tree. We empirically illustrate these results with and without
functional restrictions across image and molecule datasets.


http://arxiv.org/abs/1906.04392
Subspace Attack: Exploiting Promising Subspaces for Query-Efficient Black-box Attacks.
Ziang Yan; Yiwen Guo; Changshui Zhang
  Unlike the white-box counterparts that are widely studied and readily
accessible, adversarial examples in black-box settings are generally more
Herculean on account of the difficulty of estimating gradients. Many methods
achieve the task by issuing numerous queries to target classification systems,
which makes the whole procedure costly and suspicious to the systems. In this
paper, we aim at reducing the query complexity of black-box attacks in this
category. We propose to exploit gradients of a few reference models which
arguably span some promising search subspaces. Experimental results show that,
in comparison with the state-of-the-arts, our method can gain up to 2x and 4x
reductions in the requisite mean and medium numbers of queries with much lower
failure rates even if the reference models are trained on a small and
inadequate dataset disjoint to the one for training the victim model. Code and
models for reproducing our results will be made publicly available.


http://arxiv.org/abs/1906.04606
Mimic and Fool: A Task Agnostic Adversarial Attack.
Akshay Chaturvedi; Utpal Garain
  At present, adversarial attacks are designed in a task-specific fashion.
However, for downstream computer vision tasks such as image captioning, image
segmentation etc., the current deep learning systems use an image classifier
like VGG16, ResNet50, Inception-v3 etc. as a feature extractor. Keeping this in
mind, we propose Mimic and Fool, a task agnostic adversarial attack. Given a
feature extractor, the proposed attack finds an adversarial image which can
mimic the image feature of the original image. This ensures that the two images
give the same (or similar) output regardless of the task. We randomly select
1000 MSCOCO validation images for experimentation. We perform experiments on
two image captioning models, Show and Tell, Show Attend and Tell and one VQA
model, namely, end-to-end neural module network (N2NMN). The proposed attack
achieves success rate of 74.0%, 81.0% and 89.6% for Show and Tell, Show Attend
and Tell and N2NMN respectively. We also propose a slight modification to our
attack to generate natural-looking adversarial images. In addition, it is shown
that the proposed attack also works for invertible architecture. Since Mimic
and Fool only requires information about the feature extractor of the model, it
can be considered as a gray-box attack.


http://arxiv.org/abs/1906.04893
Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks.
Mahyar Fazlyab; Alexander Robey; Hamed Hassani; Manfred Morari; George J. Pappas
  Tight estimation of the Lipschitz constant for deep neural networks (DNNs) is
useful in many applications ranging from robustness certification of
classifiers to stability analysis of closed-loop systems with reinforcement
learning controllers. Existing methods in the literature for estimating the
Lipschitz constant suffer from either lack of accuracy or poor scalability. In
this paper, we present a convex optimization framework to compute guaranteed
upper bounds on the Lipschitz constant of DNNs both accurately and efficiently.
Our main idea is to interpret activation functions as gradients of convex
potential functions. Hence, they satisfy certain properties that can be
described by quadratic constraints. This particular description allows us to
pose the Lipschitz constant estimation problem as a semidefinite program (SDP).
The resulting SDP can be adapted to increase either the estimation accuracy (by
capturing the interaction between activation functions of different layers) or
scalability (by decomposition and parallel implementation). We illustrate the
utility of our approach with a variety of experiments on randomly generated
networks and on classifiers trained on the MNIST and Iris datasets. In
particular, we experimentally demonstrate that our Lipschitz bounds are the
most accurate compared to those in the literature. We also study the impact of
adversarial training methods on the Lipschitz bounds of the resulting
classifiers and show that our bounds can be used to efficiently provide
robustness guarantees.


http://arxiv.org/abs/1906.03973
E-LPIPS: Robust Perceptual Image Similarity via Random Transformation Ensembles.
Markus Kettunen; Erik Härkönen; Jaakko Lehtinen
  It has been recently shown that the hidden variables of convolutional neural
networks make for an efficient perceptual similarity metric that accurately
predicts human judgment on relative image similarity assessment. First, we show
that such learned perceptual similarity metrics (LPIPS) are susceptible to
adversarial attacks that dramatically contradict human visual similarity
judgment. While this is not surprising in light of neural networks' well-known
weakness to adversarial perturbations, we proceed to show that self-ensembling
with an infinite family of random transformations of the input --- a technique
known not to render classification networks robust --- is enough to turn the
metric robust against attack, while retaining predictive power on human
judgments. Finally, we study the geometry imposed by our our novel
self-ensembled metric (E-LPIPS) on the space of natural images. We find
evidence of "perceptual convexity" by showing that convex combinations of
similar-looking images retain appearance, and that discrete geodesics yield
meaningful frame interpolation and texture morphing, all without explicit
correspondences.


http://arxiv.org/abs/1906.03972
Evaluating the Robustness of Nearest Neighbor Classifiers: A Primal-Dual Perspective.
Lu Wang; Xuanqing Liu; Jinfeng Yi; Zhi-Hua Zhou; Cho-Jui Hsieh
  We study the problem of computing the minimum adversarial perturbation of the
Nearest Neighbor (NN) classifiers. Previous attempts either conduct attacks on
continuous approximations of NN models or search for the perturbation by some
heuristic methods. In this paper, we propose the first algorithm that is able
to compute the minimum adversarial perturbation. The main idea is to formulate
the problem as a list of convex quadratic programming (QP) problems that can be
efficiently solved by the proposed algorithms for 1-NN models. Furthermore, we
show that dual solutions for these QP problems could give us a valid lower
bound of the adversarial perturbation that can be used for formal robustness
verification, giving us a nice view of attack/verification for NN models. For
$K$-NN models with larger $K$, we show that the same formulation can help us
efficiently compute the upper and lower bounds of the minimum adversarial
perturbation, which can be used for attack and verification.


http://arxiv.org/abs/1906.03849
Robustness Verification of Tree-based Models.
Hongge Chen; Huan Zhang; Si Si; Yang Li; Duane Boning; Cho-Jui Hsieh
  We study the robustness verification problem for tree-based models, including
decision trees, random forests (RFs) and gradient boosted decision trees
(GBDTs). Formal robustness verification of decision tree ensembles involves
finding the exact minimal adversarial perturbation or a guaranteed lower bound
of it. Existing approaches find the minimal adversarial perturbation by a mixed
integer linear programming (MILP) problem, which takes exponential time so is
impractical for large ensembles. Although this verification problem is
NP-complete in general, we give a more precise complexity characterization. We
show that there is a simple linear time algorithm for verifying a single tree,
and for tree ensembles, the verification problem can be cast as a max-clique
problem on a multi-partite graph with bounded boxicity. For low dimensional
problems when boxicity can be viewed as constant, this reformulation leads to a
polynomial time algorithm. For general problems, by exploiting the boxicity of
the graph, we develop an efficient multi-level verification algorithm that can
give tight lower bounds on the robustness of decision tree ensembles, while
allowing iterative improvement and any-time termination. OnRF/GBDT models
trained on 10 datasets, our algorithm is hundreds of times faster than the
previous approach that requires solving MILPs, and is able to give tight
robustness verification bounds on large GBDTs with hundreds of deep trees.


http://arxiv.org/abs/1906.04214
Topology Attack and Defense for Graph Neural Networks: An Optimization Perspective.
Kaidi Xu; Hongge Chen; Sijia Liu; Pin-Yu Chen; Tsui-Wei Weng; Mingyi Hong; Xue Lin
  Graph neural networks (GNNs) which apply the deep neural networks to graph
data have achieved significant performance for the task of semi-supervised node
classification. However, only few work has addressed the adversarial robustness
of GNNs. In this paper, we first present a novel gradient-based attack method
that facilitates the difficulty of tackling discrete graph data. When comparing
to current adversarial attacks on GNNs, the results show that by only
perturbing a small number of edge perturbations, including addition and
deletion, our optimization-based attack can lead to a noticeable decrease in
classification performance. Moreover, leveraging our gradient-based attack, we
propose the first optimization-based adversarial training for GNNs. Our method
yields higher robustness against both different gradient based and greedy
attack methods without sacrificing classification accuracy on original graph.


http://arxiv.org/abs/1906.03612
On the Vulnerability of Capsule Networks to Adversarial Attacks.
Felix Michels; Tobias Uelwer; Eric Upschulte; Stefan Harmeling
  This paper extensively evaluates the vulnerability of capsule networks to
different adversarial attacks. Recent work suggests that these architectures
are more robust towards adversarial attacks than other neural networks.
However, our experiments show that capsule networks can be fooled as easily as
convolutional neural networks.


http://arxiv.org/abs/1906.03787
Intriguing properties of adversarial training.
Cihang Xie; Alan Yuille
  Adversarial training is one of the main defenses against adversarial attacks.
In this paper, we provide the first rigorous study on diagnosing elements of
adversarial training, which reveals two intriguing properties.
  First, we study the role of normalization. Batch normalization (BN) is a
crucial element for achieving state-of-the-art performance on many vision
tasks, but we show it may prevent networks from obtaining strong robustness in
adversarial training. One unexpected observation is that, for models trained
with BN, simply removing clean images from training data largely boosts
adversarial robustness, i.e., 18.3%. We relate this phenomenon to the
hypothesis that clean images and adversarial images are drawn from two
different domains. This two-domain hypothesis may explain the issue of BN when
training with a mixture of clean and adversarial images, as estimating
normalization statistics of this mixture distribution is challenging. Guided by
this two-domain hypothesis, we show disentangling the mixture distribution for
normalization, i.e., applying separate BNs to clean and adversarial images for
statistics estimation, achieves much stronger robustness. Additionally, we find
that enforcing BNs to behave consistently at training and testing can further
enhance robustness.
  Second, we study the role of network capacity. We find our so-called "deep"
networks are still shallow for the task of adversarial learning. Unlike
traditional classification tasks where accuracy is only marginally improved by
adding more layers to "deep" networks (e.g., ResNet-152), adversarial training
exhibits a much stronger demand on deeper networks to achieve higher
adversarial robustness. This robustness improvement can be observed
substantially and consistently even by pushing the network capacity to an
unprecedented scale, i.e., ResNet-638.


http://arxiv.org/abs/1906.03749
Improved Adversarial Robustness via Logit Regularization Methods.
Cecilia Summers; Michael J. Dinneen
  While great progress has been made at making neural networks effective across
a wide range of visual tasks, most models are surprisingly vulnerable. This
frailness takes the form of small, carefully chosen perturbations of their
input, known as adversarial examples, which represent a security threat for
learned vision models in the wild -- a threat which should be responsibly
defended against in safety-critical applications of computer vision. In this
paper, we advocate for and experimentally investigate the use of a family of
logit regularization techniques as an adversarial defense, which can be used in
conjunction with other methods for creating adversarial robustness at little to
no marginal cost. We also demonstrate that much of the effectiveness of one
recent adversarial defense mechanism can in fact be attributed to logit
regularization, and show how to improve its defense against both white-box and
black-box attacks, in the process creating a stronger black-box attack against
PGD-based models. We validate our methods on three datasets and include results
on both gradient-free attacks and strong gradient-based iterative attacks with
as many as 1,000 steps.


http://arxiv.org/abs/1906.03750
Attacking Graph Convolutional Networks via Rewiring.
Yao Ma; Suhang Wang; Tyler Derr; Lingfei Wu; Jiliang Tang
  Graph Neural Networks (GNNs) have boosted the performance of many graph
related tasks such as node classification and graph classification. Recent
researches show that graph neural networks are vulnerable to adversarial
attacks, which deliberately add carefully created unnoticeable perturbation to
the graph structure. The perturbation is usually created by adding/deleting a
few edges, which might be noticeable even when the number of edges modified is
small. In this paper, we propose a graph rewiring operation which affects the
graph in a less noticeable way compared to adding/deleting edges. We then use
reinforcement learning to learn the attack strategy based on the proposed
rewiring operation. Experiments on real world graphs demonstrate the
effectiveness of the proposed framework. To understand the proposed framework,
we further analyze how its generated perturbation to the graph structure
affects the output of the target model.


http://arxiv.org/abs/1906.03563
Towards A Unified Min-Max Framework for Adversarial Exploration and Robustness.
Jingkang Wang; Tianyun Zhang; Sijia Liu; Pin-Yu Chen; Jiacen Xu; Makan Fardad; Bo Li
  The worst-case training principle that minimizes the maximal adversarial
loss, also known as adversarial training (AT), has shown to be a
state-of-the-art approach for enhancing adversarial robustness against
norm-ball bounded input perturbations. Nonetheless, min-max optimization beyond
the purpose of AT has not been rigorously explored in the research of
adversarial attack and defense. In particular, given a set of risk sources
(domains), minimizing the maximal loss induced from the domain set can be
reformulated as a general min-max problem that is different from AT. Examples
of this general formulation include attacking model ensembles, devising
universal perturbation under multiple inputs or data transformations, and
generalized AT over different types of attack models. We show that these
problems can be solved under a unified and theoretically principled min-max
optimization framework. We also show that the self-adjusted domain weights
learned from our method provides a means to explain the difficulty level of
attack and defense over multiple domains. Extensive experiments show that our
approach leads to substantial performance improvement over the conventional
averaging strategy.


http://arxiv.org/abs/1906.04584
Provably Robust Deep Learning via Adversarially Trained Smoothed Classifiers.
Hadi Salman; Greg Yang; Jerry Li; Pengchuan Zhang; Huan Zhang; Ilya Razenshteyn; Sebastien Bubeck
  Recent works have shown the effectiveness of randomized smoothing as a
scalable technique for building neural network-based classifiers that are
provably robust to $\ell_2$-norm adversarial perturbations. In this paper, we
employ adversarial training to improve the performance of randomized smoothing.
We design an adapted attack for smoothed classifiers, and we show how this
attack can be used in an adversarial training setting to boost the provable
robustness of smoothed classifiers. We demonstrate through extensive
experimentation that our method consistently outperforms all existing provably
$\ell_2$-robust classifiers by a significant margin on ImageNet and CIFAR-10,
establishing the state-of-the-art for provable $\ell_2$-defenses. Moreover, we
find that pre-training and semi-supervised learning boost adversarially trained
smoothed classifiers even further. Our code and trained models are available at
http://github.com/Hadisalman/smoothing-adversarial .


http://arxiv.org/abs/1906.03466
Strategies to architect AI Safety: Defense to guard AI from Adversaries.
Rajagopal. A; Nirmala. V
  The impact of designing for security of AI is critical for humanity in the AI
era. With humans increasingly becoming dependent upon AI, there is a need for
neural networks that work reliably, inspite of Adversarial attacks. The vision
for Safe and secure AI for popular use is achievable. To achieve safety of AI,
this paper explores strategies and a novel deep learning architecture. To guard
AI from adversaries, paper explores combination of 3 strategies:
  1. Introduce randomness at inference time to hide the representation learning
from adversaries.
  2. Detect presence of adversaries by analyzing the sequence of inferences.
  3. Exploit visual similarity.
  To realize these strategies, this paper designs a novel architecture, Dynamic
Neural Defense, DND. This defense has 3 deep learning architectural features:
  1. By hiding the way a neural network learns from exploratory attacks using a
random computation graph, DND evades attack.
  2. By analyzing input sequence to cloud AI inference engine with LSTM, DND
detects attack sequence.
  3. By inferring with visual similar inputs generated by VAE, any AI defended
by DND approach does not succumb to hackers.
  Thus, a roadmap to develop reliable, safe and secure AI is presented.


http://arxiv.org/abs/1906.03455
Sensitivity of Deep Convolutional Networks to Gabor Noise.
Kenneth T. Co; Luis Muñoz-González; Emil C. Lupu
  Deep Convolutional Networks (DCNs) have been shown to be sensitive to
Universal Adversarial Perturbations (UAPs): input-agnostic perturbations that
fool a model on large portions of a dataset. These UAPs exhibit interesting
visual patterns, but this phenomena is, as yet, poorly understood. Our work
shows that visually similar procedural noise patterns also act as UAPs. In
particular, we demonstrate that different DCN architectures are sensitive to
Gabor noise patterns. This behaviour, its causes, and implications deserve
further in-depth study.


http://arxiv.org/abs/1906.03499
ML-LOO: Detecting Adversarial Examples with Feature Attribution.
Puyudi Yang; Jianbo Chen; Cho-Jui Hsieh; Jane-Ling Wang; Michael I. Jordan
  Deep neural networks obtain state-of-the-art performance on a series of
tasks. However, they are easily fooled by adding a small adversarial
perturbation to input. The perturbation is often human imperceptible on image
data. We observe a significant difference in feature attributions of
adversarially crafted examples from those of original ones. Based on this
observation, we introduce a new framework to detect adversarial examples
through thresholding a scale estimate of feature attribution scores.
Furthermore, we extend our method to include multi-layer feature attributions
in order to tackle the attacks with mixed confidence levels. Through vast
experiments, our method achieves superior performances in distinguishing
adversarial examples from popular attack methods on a variety of real data sets
among state-of-the-art detection methods. In particular, our method is able to
detect adversarial examples of mixed confidence levels, and transfer between
different attacking methods.


http://arxiv.org/abs/1906.03526
Provably Robust Boosted Decision Stumps and Trees against Adversarial Attacks.
Maksym Andriushchenko; Matthias Hein
  The problem of adversarial robustness has been studied extensively for neural
networks. However, for boosted decision trees and decision stumps there are
almost no results, even though they are widely used in practice (e.g. XGBoost)
due to their accuracy, interpretability, and efficiency. We show in this paper
that for boosted decision stumps the \textit{exact} min-max robust loss and
test error for an $l_\infty$-attack can be computed in $O(T\log T)$ time per
input, where $T$ is the number of decision stumps and the optimal update step
of the ensemble can be done in $O(n^2\,T\log T)$, where $n$ is the number of
data points. For boosted trees we show how to efficiently calculate and
optimize an upper bound on the robust loss, which leads to state-of-the-art
robust test error for boosted trees on MNIST (12.5% for $\epsilon_\infty=0.3$),
FMNIST (23.2% for $\epsilon_\infty=0.1$), and CIFAR-10 (74.7% for
$\epsilon_\infty=8/255$). Moreover, the robust test error rates we achieve are
competitive to the ones of provably robust convolutional networks. The code of
all our experiments is available at
http://github.com/max-andr/provably-robust-boosting


http://arxiv.org/abs/1906.03397
Making targeted black-box evasion attacks effective and efficient.
Mika Juuti; Buse Gul Atli; N. Asokan
  We investigate how an adversary can optimally use its query budget for
targeted evasion attacks against deep neural networks in a black-box setting.
We formalize the problem setting and systematically evaluate what benefits the
adversary can gain by using substitute models. We show that there is an
exploration-exploitation tradeoff in that query efficiency comes at the cost of
effectiveness. We present two new attack strategies for using substitute models
and show that they are as effective as previous query-only techniques but
require significantly fewer queries, by up to three orders of magnitude. We
also show that an agile adversary capable of switching through different attack
techniques can achieve pareto-optimal efficiency. We demonstrate our attack
against Google Cloud Vision showing that the difficulty of black-box attacks
against real-world prediction APIs is significantly easier than previously
thought (requiring approximately 500 queries instead of approximately 20,000 as
in previous works).


http://arxiv.org/abs/1906.03444
Defending Against Universal Attacks Through Selective Feature Regeneration.
Tejas Borkar; Felix Heide; Lina Karam
  Deep neural network (DNN) predictions have been shown to be vulnerable to
carefully crafted adversarial perturbations. Specifically, image-agnostic
(universal adversarial) perturbations added to any image can fool a target
network into making erroneous predictions. Departing from existing defense
strategies that work mostly in the image domain, we present a novel defense
which operates in the DNN feature domain and effectively defends against such
universal perturbations. Our approach identifies pre-trained convolutional
features that are most vulnerable to adversarial noise and deploys trainable
feature regeneration units which transform these DNN filter activations into
resilient features that are robust to universal perturbations. Regenerating
only the top 50% adversarially susceptible activations in at most 6 DNN layers
and leaving all remaining DNN activations unchanged, we outperform existing
defense strategies across different network architectures by more than 10% in
restored accuracy. We show that without any additional modification, our
defense trained on ImageNet with one type of universal attack examples
effectively defends against other types of unseen universal attacks.


http://arxiv.org/abs/1906.03231
A cryptographic approach to black box adversarial machine learning.
Kevin Shi; Daniel Hsu; Allison Bishop
  We propose an ensemble technique for converting any classifier into a
computationally secure classifier. We define a simpler security problem for
random binary classifiers and prove a reduction from this model to the security
of the overall ensemble classifier. We provide experimental evidence of the
security of our random binary classifiers, as well as empirical results of the
adversarial accuracy of the overall ensemble to black-box attacks. Our
construction crucially leverages hidden randomness in the multiclass-to-binary
reduction.


http://arxiv.org/abs/1906.03367
Using learned optimizers to make models robust to input noise.
Luke Metz; Niru Maheswaranathan; Jonathon Shlens; Jascha Sohl-Dickstein; Ekin D. Cubuk
  State-of-the art vision models can achieve superhuman performance on image
classification tasks when testing and training data come from the same
distribution. However, when models are tested on corrupted images (e.g. due to
scale changes, translations, or shifts in brightness or contrast), performance
degrades significantly. Here, we explore the possibility of meta-training a
learned optimizer that can train image classification models such that they are
robust to common image corruptions. Specifically, we are interested training
models that are more robust to noise distributions not present in the training
data. We find that a learned optimizer meta-trained to produce models which are
robust to Gaussian noise trains models that are more robust to Gaussian noise
at other scales compared to traditional optimizers like Adam. The effect of
meta-training is more complicated when targeting a more general set of noise
distributions, but led to improved performance on half of held-out corruption
tasks. Our results suggest that meta-learning provides a novel approach for
studying and improving the robustness of deep learning models.


http://arxiv.org/abs/1906.03333
Efficient Project Gradient Descent for Ensemble Adversarial Attack.
Fanyou Wu; Rado Gazo; Eva Haviarova; Bedrich Benes
  Recent advances show that deep neural networks are not robust to deliberately
crafted adversarial examples which many are generated by adding human
imperceptible perturbation to clear input. Consider $l_2$ norms attacks,
Project Gradient Descent (PGD) and the Carlini and Wagner (C\&W) attacks are
the two main methods, where PGD control max perturbation for adversarial
examples while C\&W approach treats perturbation as a regularization term
optimized it with loss function together. If we carefully set parameters for
any individual input, both methods become similar. In general, PGD attacks
perform faster but obtains larger perturbation to find adversarial examples
than the C\&W when fixing the parameters for all inputs. In this report, we
propose an efficient modified PGD method for attacking ensemble models by
automatically changing ensemble weights and step size per iteration per input.
This method generates smaller perturbation adversarial examples than PGD method
while remains efficient as compared to C\&W method. Our method won the first
place in IJCAI19 Targeted Adversarial Attack competition.


http://arxiv.org/abs/1906.02931
Inductive Bias of Gradient Descent based Adversarial Training on Separable Data.
Yan Li; Ethan X. Fang; Huan Xu; Tuo Zhao
  Adversarial training is a principled approach for training robust neural
networks. Despite of tremendous successes in practice, its theoretical
properties still remain largely unexplored. In this paper, we provide new
theoretical insights of gradient descent based adversarial training by studying
its computational properties, specifically on its inductive bias. We take the
binary classification task on linearly separable data as an illustrative
example, where the loss asymptotically attains its infimum as the parameter
diverges to infinity along certain directions. Specifically, we show that when
the adversarial perturbation during training has bounded $\ell_2$-norm, the
classifier learned by gradient descent based adversarial training converges in
direction to the maximum $\ell_2$-norm margin classifier at the rate of
$\tilde{\mathcal{O}}(1/\sqrt{T})$, significantly faster than the rate
$\mathcal{O}(1/\log T)$ of training with clean data. In addition, when the
adversarial perturbation during training has bounded $\ell_q$-norm for some
$q\ge 1$, the resulting classifier converges in direction to a maximum
mixed-norm margin classifier, which has a natural interpretation of robustness,
as being the maximum $\ell_2$-norm margin classifier under worst-case
$\ell_q$-norm perturbation to the data. Our findings provide theoretical
backups for adversarial training that it indeed promotes robustness against
adversarial perturbation.


http://arxiv.org/abs/1906.02896
Adversarial Explanations for Understanding Image Classification Decisions and Improved Neural Network Robustness.
Walt Woods; Jack Chen; Christof Teuscher
  For sensitive problems, such as medical imaging or fraud detection, Neural
Network (NN) adoption has been slow due to concerns about their reliability,
leading to a number of algorithms for explaining their decisions. NNs have also
been found vulnerable to a class of imperceptible attacks, called adversarial
examples, which arbitrarily alter the output of the network. Here we
demonstrate both that these attacks can invalidate prior attempts to explain
the decisions of NNs, and that with very robust networks, the attacks
themselves may be leveraged as explanations with greater fidelity to the model.
We show that the introduction of a novel regularization technique inspired by
the Lipschitz constraint, alongside other proposed improvements, greatly
improves an NN's resistance to adversarial examples. On the ImageNet
classification task, we demonstrate a network with an Accuracy-Robustness Area
(ARA) of 0.0053, an ARA 2.4x greater than the previous state of the art.
Improving the mechanisms by which NN decisions are understood is an important
direction for both establishing trust in sensitive domains and learning more
about the stimuli to which NNs respond.


http://arxiv.org/abs/1906.03310
Robustness for Non-Parametric Classification: A Generic Attack and Defense.
Yao-Yuan Yang; Cyrus Rashtchian; Yizhen Wang; Kamalika Chaudhuri
  Adversarially robust machine learning has received much recent attention.
However, prior attacks and defenses for non-parametric classifiers have been
developed in an ad-hoc or classifier-specific basis. In this work, we take a
holistic look at adversarial examples for non-parametric classifiers, including
nearest neighbors, decision trees, and random forests. We provide a general
defense method, adversarial pruning, that works by preprocessing the dataset to
become well-separated. To test our defense, we provide a novel attack that
applies to a wide range of non-parametric classifiers. Theoretically, we derive
an optimally robust classifier, which is analogous to the Bayes Optimal. We
show that adversarial pruning can be viewed as a finite sample approximation to
this optimal classifier. We empirically show that our defense and attack are
either better than or competitive with prior work on non-parametric
classifiers. Overall, our results provide a strong and broadly-applicable
baseline for future work on robust non-parametrics. Code available at
https://github.com/yangarbiter/adversarial-nonparametrics/ .


http://arxiv.org/abs/1906.02816
Robust Attacks against Multiple Classifiers.
Juan C. Perdomo; Yaron Singer
  We address the challenge of designing optimal adversarial noise algorithms
for settings where a learner has access to multiple classifiers. We demonstrate
how this problem can be framed as finding strategies at equilibrium in a
two-player, zero-sum game between a learner and an adversary. In doing so, we
illustrate the need for randomization in adversarial attacks. In order to
compute Nash equilibrium, our main technical focus is on the design of best
response oracles that can then be implemented within a Multiplicative Weights
Update framework to boost deterministic perturbations against a set of models
into optimal mixed strategies. We demonstrate the practical effectiveness of
our approach on a series of image classification tasks using both linear
classifiers and deep neural networks.


http://arxiv.org/abs/1906.02611
Improving Robustness Without Sacrificing Accuracy with Patch Gaussian Augmentation.
Raphael Gontijo Lopes; Dong Yin; Ben Poole; Justin Gilmer; Ekin D. Cubuk
  Deploying machine learning systems in the real world requires both high
accuracy on clean data and robustness to naturally occurring corruptions. While
architectural advances have led to improved accuracy, building robust models
remains challenging. Prior work has argued that there is an inherent trade-off
between robustness and accuracy, which is exemplified by standard data augment
techniques such as Cutout, which improves clean accuracy but not robustness,
and additive Gaussian noise, which improves robustness but hurts accuracy. To
overcome this trade-off, we introduce Patch Gaussian, a simple augmentation
scheme that adds noise to randomly selected patches in an input image. Models
trained with Patch Gaussian achieve state of the art on the CIFAR-10 and
ImageNetCommon Corruptions benchmarks while also improving accuracy on clean
data. We find that this augmentation leads to reduced sensitivity to high
frequency noise(similar to Gaussian) while retaining the ability to take
advantage of relevant high frequency information in the image (similar to
Cutout). Finally, we show that Patch Gaussian can be used in conjunction with
other regularization methods and data augmentation policies such as
AutoAugment, and improves performance on the COCO object detection benchmark.


http://arxiv.org/abs/1906.02494
Understanding Adversarial Behavior of DNNs by Disentangling Non-Robust and Robust Components in Performance Metric.
Yujun Shi; Benben Liao; Guangyong Chen; Yun Liu; Ming-Ming Cheng; Jiashi Feng
  The vulnerability to slight input perturbations is a worrying yet intriguing
property of deep neural networks (DNNs). Despite many previous works studying
the reason behind such adversarial behavior, the relationship between the
generalization performance and adversarial behavior of DNNs is still little
understood. In this work, we reveal such relation by introducing a metric
characterizing the generalization performance of a DNN. The metric can be
disentangled into an information-theoretic non-robust component, responsible
for adversarial behavior, and a robust component. Then, we show by experiments
that current DNNs rely heavily on optimizing the non-robust component in
achieving decent performance. We also demonstrate that current state-of-the-art
adversarial training algorithms indeed try to robustify the DNNs by preventing
them from using the non-robust component to distinguish samples from different
categories. Also, based on our findings, we take a step forward and point out
the possible direction for achieving decent standard performance and
adversarial robustness simultaneously. We believe that our theory could further
inspire the community to make more interesting discoveries about the
relationship between standard generalization and adversarial generalization of
deep learning models.


http://arxiv.org/abs/1906.02439
Should Adversarial Attacks Use Pixel p-Norm?
Ayon Sen; Xiaojin Zhu; Liam Marshall; Robert Nowak
  Adversarial attacks aim to confound machine learning systems, while remaining
virtually imperceptible to humans. Attacks on image classification systems are
typically gauged in terms of $p$-norm distortions in the pixel feature space.
We perform a behavioral study, demonstrating that the pixel $p$-norm for any
$0\le p \le \infty$, and several alternative measures including earth mover's
distance, structural similarity index, and deep net embedding, do not fit human
perception. Our result has the potential to improve the understanding of
adversarial attack and defense strategies.


http://arxiv.org/abs/1906.09453
Image Synthesis with a Single (Robust) Classifier.
Shibani Santurkar; Dimitris Tsipras; Brandon Tran; Andrew Ilyas; Logan Engstrom; Aleksander Madry
  We show that the basic classification framework alone can be used to tackle
some of the most challenging tasks in image synthesis. In contrast to other
state-of-the-art approaches, the toolkit we develop is rather minimal: it uses
a single, off-the-shelf classifier for all these tasks. The crux of our
approach is that we train this classifier to be adversarially robust. It turns
out that adversarial robustness is precisely what we need to directly
manipulate salient features of the input. Overall, our findings demonstrate the
utility of robustness in the broader machine learning context. Code and models
for our experiments can be found at https://git.io/robust-apps.


http://arxiv.org/abs/1906.02337
MNIST-C: A Robustness Benchmark for Computer Vision.
Norman Mu; Justin Gilmer
  We introduce the MNIST-C dataset, a comprehensive suite of 15 corruptions
applied to the MNIST test set, for benchmarking out-of-distribution robustness
in computer vision. Through several experiments and visualizations we
demonstrate that our corruptions significantly degrade performance of
state-of-the-art computer vision models while preserving the semantic content
of the test images. In contrast to the popular notion of adversarial
robustness, our model-agnostic corruptions do not seek worst-case performance
but are instead designed to be broad and diverse, capturing multiple failure
modes of modern models. In fact, we find that several previously published
adversarial defenses significantly degrade robustness as measured by MNIST-C.
We hope that our benchmark serves as a useful tool for future work in designing
systems that are able to learn robust feature representations that capture the
underlying semantics of the input.


http://arxiv.org/abs/1906.02282
Enhancing Gradient-based Attacks with Symbolic Intervals.
Shiqi Wang; Yizheng Chen; Ahmed Abdou; Suman Jana
  Recent breakthroughs in defenses against adversarial examples, like
adversarial training, make the neural networks robust against various classes
of attackers (e.g., first-order gradient-based attacks). However, it is an open
question whether the adversarially trained networks are truly robust under
unknown attacks. In this paper, we present interval attacks, a new technique to
find adversarial examples to evaluate the robustness of neural networks.
Interval attacks leverage symbolic interval propagation, a bound propagation
technique that can exploit a broader view around the current input to locate
promising areas containing adversarial instances, which in turn can be searched
with existing gradient-guided attacks. We can obtain such a broader view using
sound bound propagation methods to track and over-approximate the errors of the
network within given input ranges. Our results show that, on state-of-the-art
adversarially trained networks, interval attack can find on average 47%
relatively more violations than the state-of-the-art gradient-guided PGD
attack.


http://arxiv.org/abs/1906.02398
Query-efficient Meta Attack to Deep Neural Networks.
Jiawei Du; Hu Zhang; Joey Tianyi Zhou; Yi Yang; Jiashi Feng
  Black-box attack methods aim to infer suitable attack patterns to targeted
DNN models by only using output feedback of the models and the corresponding
input queries. However, due to lack of prior and inefficiency in leveraging the
query and feedback information, existing methods are mostly query-intensive for
obtaining effective attack patterns. In this work, we propose a meta attack
approach that is capable of attacking a targeted model with much fewer queries.
Its high queryefficiency stems from effective utilization of meta learning
approaches in learning generalizable prior abstraction from the previously
observed attack patterns and exploiting such prior to help infer attack
patterns from only a few queries and outputs. Extensive experiments on MNIST,
CIFAR10 and tiny-Imagenet demonstrate that our meta-attack method can
remarkably reduce the number of model queries without sacrificing the attack
performance. Besides, the obtained meta attacker is not restricted to a
particular model but can be used easily with a fast adaptive ability to attack
a variety of models.The code of our work is available at
https://github.com/dydjw9/MetaAttack_ICLR2020/.


http://arxiv.org/abs/1906.02032
c-Eval: A Unified Metric to Evaluate Feature-based Explanations via Perturbation.
Minh N. Vu; Truc D. Nguyen; NhatHai Phan; Ralucca Gera; My T. Thai
  In many modern image-classification applications, understanding the cause of
model's prediction can be as critical as the prediction's accuracy itself.
Various feature-based local explanations generation methods have been designed
to give us more insights on the decision of complex classifiers. Nevertheless,
there is no consensus on evaluating the quality of different explanations. In
response to this lack of comprehensive evaluation, we introduce the c-Eval
metric and its corresponding framework to quantify the feature-based local
explanation's quality. Given a classifier's prediction and the corresponding
explanation on that prediction, c-Eval is the minimum-distortion perturbation
that successfully alters the prediction while keeping the explanation's
features unchanged. We then demonstrate how c-Eval can be computed using some
modifications on existing adversarial generation libraries. To show that c-Eval
captures the importance of input's features, we establish the connection
between c-Eval and the features returned by explainers in affine and
nearly-affine classifiers. We then introduce the c-Eval plot, which not only
displays a strong connection between c-Eval and explainers' quality, but also
helps automatically determine explainer's parameters. Since the generation of
c-Eval relies on adversarial generation, we provide a demo of c-Eval on
adversarial-robust models and show that the metric is applicable in those
models. Finally, extensive experiments of explainers on different datasets are
conducted to support the adoption of c-Eval in evaluating explainers'
performance.


http://arxiv.org/abs/1906.02033
Multi-way Encoding for Robustness.
Donghyun Kim; Sarah Adel Bargal; Jianming Zhang; Stan Sclaroff
  Deep models are state-of-the-art for many computer vision tasks including
image classification and object detection. However, it has been shown that deep
models are vulnerable to adversarial examples. We highlight how one-hot
encoding directly contributes to this vulnerability and propose breaking away
from this widely-used, but highly-vulnerable mapping. We demonstrate that by
leveraging a different output encoding, multi-way encoding, we decorrelate
source and target models, making target models more secure. Our approach makes
it more difficult for adversaries to find useful gradients for generating
adversarial attacks. We present robustness for black-box and white-box attacks
on four benchmark datasets: MNIST, CIFAR-10, CIFAR-100, and SVHN. The strength
of our approach is also presented in the form of an attack for model
watermarking, raising challenges in detecting stolen models.


http://arxiv.org/abs/1906.01527
Adversarial Training is a Form of Data-dependent Operator Norm Regularization.
Kevin Roth; Yannic Kilcher; Thomas Hofmann
  We establish a theoretical link between adversarial training and operator
norm regularization for deep neural networks. Specifically, we prove that
$\ell_p$-norm constrained projected gradient ascent based adversarial training
with an $\ell_q$-norm loss on the logits of clean and perturbed inputs is
equivalent to data-dependent (p, q) operator norm regularization. This
fundamental connection confirms the long-standing argument that a network's
sensitivity to adversarial examples is tied to its spectral properties and
hints at novel ways to robustify and defend against adversarial attacks. We
provide extensive empirical evidence on state-of-the-art network architectures
to support our theoretical results.


http://arxiv.org/abs/1906.01354
Architecture Selection via the Trade-off Between Accuracy and Robustness. (67%)
Zhun Deng; Cynthia Dwork; Jialiang Wang; Yao Zhao
  We provide a general framework for characterizing the trade-off between
accuracy and robustness in supervised learning. We propose a method and define
quantities to characterize the trade-off between accuracy and robustness for a
given architecture, and provide theoretical insight into the trade-off.
Specifically we introduce a simple trade-off curve, define and study an
influence function that captures the sensitivity, under adversarial attack, of
the optima of a given loss function. We further show how adversarial training
regularizes the parameters in an over-parameterized linear model, recovering
the LASSO and ridge regression as special cases, which also allows us to
theoretically analyze the behavior of the trade-off curve. In experiments, we
demonstrate the corresponding trade-off curves of neural networks and how they
vary with respect to factors such as number of layers, neurons, and across
different network structures. Such information provides a useful guideline to
architecture selection.


http://arxiv.org/abs/1906.01121
Adversarial Exploitation of Policy Imitation.
Vahid Behzadan; William Hsu
  This paper investigates a class of attacks targeting the confidentiality
aspect of security in Deep Reinforcement Learning (DRL) policies. Recent
research have established the vulnerability of supervised machine learning
models (e.g., classifiers) to model extraction attacks. Such attacks leverage
the loosely-restricted ability of the attacker to iteratively query the model
for labels, thereby allowing for the forging of a labeled dataset which can be
used to train a replica of the original model. In this work, we demonstrate the
feasibility of exploiting imitation learning techniques in launching model
extraction attacks on DRL agents. Furthermore, we develop proof-of-concept
attacks that leverage such techniques for black-box attacks against the
integrity of DRL policies. We also present a discussion on potential solution
concepts for mitigation techniques.


http://arxiv.org/abs/1906.00698
Adversarial Risk Bounds for Neural Networks through Sparsity based Compression.
Emilio Rafael Balda; Arash Behboodi; Niklas Koep; Rudolf Mathar
  Neural networks have been shown to be vulnerable against minor adversarial
perturbations of their inputs, especially for high dimensional data under
$\ell_\infty$ attacks. To combat this problem, techniques like adversarial
training have been employed to obtain models which are robust on the training
set. However, the robustness of such models against adversarial perturbations
may not generalize to unseen data. To study how robustness generalizes, recent
works assume that the inputs have bounded $\ell_2$-norm in order to bound the
adversarial risk for $\ell_\infty$ attacks with no explicit dimension
dependence. In this work we focus on $\ell_\infty$ attacks on $\ell_\infty$
bounded inputs and prove margin-based bounds. Specifically, we use a
compression based approach that relies on efficiently compressing the set of
tunable parameters without distorting the adversarial risk. To achieve this, we
apply the concept of effective sparsity and effective joint sparsity on the
weight matrices of neural networks. This leads to bounds with no explicit
dependence on the input dimension, neither on the number of classes. Our
results show that neural networks with approximately sparse weight matrices not
only enjoy enhanced robustness, but also better generalization.


http://arxiv.org/abs/1906.00679
The Adversarial Machine Learning Conundrum: Can The Insecurity of ML Become The Achilles' Heel of Cognitive Networks?
Muhammad Usama; Junaid Qadir; Ala Al-Fuqaha; Mounir Hamdi
  The holy grail of networking is to create \textit{cognitive networks} that
organize, manage, and drive themselves. Such a vision now seems attainable
thanks in large part to the progress in the field of machine learning (ML),
which has now already disrupted a number of industries and revolutionized
practically all fields of research. But are the ML models foolproof and robust
to security attacks to be in charge of managing the network? Unfortunately,
many modern ML models are easily misled by simple and easily-crafted
adversarial perturbations, which does not bode well for the future of ML-based
cognitive networks unless ML vulnerabilities for the cognitive networking
environment are identified, addressed, and fixed. The purpose of this article
is to highlight the problem of insecure ML and to sensitize the readers to the
danger of adversarial ML by showing how an easily-crafted adversarial ML
example can compromise the operations of the cognitive self-driving network. In
this paper, we demonstrate adversarial attacks on two simple yet representative
cognitive networking applications (namely, intrusion detection and network
traffic classification). We also provide some guidelines to design secure ML
models for cognitive networks that are robust to adversarial attacks on the ML
pipeline of cognitive networks.


http://arxiv.org/abs/1906.00945
Adversarial Robustness as a Prior for Learned Representations.
Logan Engstrom; Andrew Ilyas; Shibani Santurkar; Dimitris Tsipras; Brandon Tran; Aleksander Madry
  An important goal in deep learning is to learn versatile, high-level feature
representations of input data. However, standard networks' representations seem
to possess shortcomings that, as we illustrate, prevent them from fully
realizing this goal. In this work, we show that robust optimization can be
re-cast as a tool for enforcing priors on the features learned by deep neural
networks. It turns out that representations learned by robust models address
the aforementioned shortcomings and make significant progress towards learning
a high-level encoding of inputs. In particular, these representations are
approximately invertible, while allowing for direct visualization and
manipulation of salient input features. More broadly, our results indicate
adversarial robustness as a promising avenue for improving learned
representations. Our code and models for reproducing these results is available
at https://git.io/robust-reps .


http://arxiv.org/abs/1906.01110
RL-Based Method for Benchmarking the Adversarial Resilience and Robustness of Deep Reinforcement Learning Policies.
Vahid Behzadan; William Hsu
  This paper investigates the resilience and robustness of Deep Reinforcement
Learning (DRL) policies to adversarial perturbations in the state space. We
first present an approach for the disentanglement of vulnerabilities caused by
representation learning of DRL agents from those that stem from the sensitivity
of the DRL policies to distributional shifts in state transitions. Building on
this approach, we propose two RL-based techniques for quantitative benchmarking
of adversarial resilience and robustness in DRL policies against perturbations
of state transitions. We demonstrate the feasibility of our proposals through
experimental evaluation of resilience and robustness in DQN, A2C, and PPO2
policies trained in the Cartpole environment.


http://arxiv.org/abs/1906.00735
Achieving Generalizable Robustness of Deep Neural Networks by Stability Training.
Jan Laermann; Wojciech Samek; Nils Strodthoff
  We study the recently introduced stability training as a general-purpose
method to increase the robustness of deep neural networks against input
perturbations. In particular, we explore its use as an alternative to data
augmentation and validate its performance against a number of distortion types
and transformations including adversarial examples. In our image classification
experiments using ImageNet data stability training performs on a par or even
outperforms data augmentation for specific transformations, while consistently
offering improved robustness against a broader range of distortion strengths
and types unseen during training, a considerably smaller hyperparameter
dependence and less potentially negative side effects compared to data
augmentation.


http://arxiv.org/abs/1906.01040
A Surprising Density of Illusionable Natural Speech.
Melody Y. Guan; Gregory Valiant
  Recent work on adversarial examples has demonstrated that most natural inputs
can be perturbed to fool even state-of-the-art machine learning systems. But
does this happen for humans as well? In this work, we investigate: what
fraction of natural instances of speech can be turned into "illusions" which
either alter humans' perception or result in different people having
significantly different perceptions? We first consider the McGurk effect, the
phenomenon by which adding a carefully chosen video clip to the audio channel
affects the viewer's perception of what is said (McGurk and MacDonald, 1976).
We obtain empirical estimates that a significant fraction of both words and
sentences occurring in natural speech have some susceptibility to this effect.
We also learn models for predicting McGurk illusionability. Finally we
demonstrate that the Yanny or Laurel auditory illusion (Pressnitzer et al.,
2018) is not an isolated occurrence by generating several very different new
instances. We believe that the surprising density of illusionable natural
speech warrants further investigation, from the perspectives of both security
and cognitive science. Supplementary videos are available at:
https://www.youtube.com/playlist?list=PLaX7t1K-e_fF2iaenoKznCatm0RC37B_k.


http://arxiv.org/abs/1906.00628
Fast and Stable Interval Bounds Propagation for Training Verifiably Robust Models.
Paweł Morawiecki; Przemysław Spurek; Marek Śmieja; Jacek Tabor
  We present an efficient technique, which allows to train classification
networks which are verifiably robust against norm-bounded adversarial attacks.
This framework is built upon the work of Gowal et al., who applies the interval
arithmetic to bound the activations at each layer and keeps the prediction
invariant to the input perturbation. While that method is faster than
competitive approaches, it requires careful tuning of hyper-parameters and a
large number of epochs to converge. To speed up and stabilize training, we
supply the cost function with an additional term, which encourages the model to
keep the interval bounds at hidden layers small. Experimental results
demonstrate that we can achieve comparable (or even better) results using a
smaller number of training iterations, in a more stable fashion. Moreover, the
proposed model is not so sensitive to the exact specification of the training
process, which makes it easier to use by practitioners.


http://arxiv.org/abs/1906.01171
Understanding the Limitations of Conditional Generative Models.
Ethan Fetaya; Jörn-Henrik Jacobsen; Will Grathwohl; Richard Zemel
  Class-conditional generative models hold promise to overcome the shortcomings
of their discriminative counterparts. They are a natural choice to solve
discriminative tasks in a robust manner as they jointly optimize for predictive
performance and accurate modeling of the input distribution. In this work, we
investigate robust classification with likelihood-based generative models from
a theoretical and practical perspective to investigate if they can deliver on
their promises. Our analysis focuses on a spectrum of robustness properties:
(1) Detection of worst-case outliers in the form of adversarial examples; (2)
Detection of average-case outliers in the form of ambiguous inputs and (3)
Detection of incorrectly labeled in-distribution inputs. Our theoretical result
reveals that it is impossible to guarantee detectability of
adversarially-perturbed inputs even for near-optimal generative classifiers.
Experimentally, we find that while we are able to train robust models for
MNIST, robustness completely breaks down on CIFAR10. We relate this failure to
various undesirable model properties that can be traced to the maximum
likelihood training objective. Despite being a common choice in the literature,
our results indicate that likelihood-based conditional generative models may
are surprisingly ineffective for robust classification.


http://arxiv.org/abs/1906.00555
Adversarially Robust Generalization Just Requires More Unlabeled Data.
Runtian Zhai; Tianle Cai; Di He; Chen Dan; Kun He; John Hopcroft; Liwei Wang
  Neural network robustness has recently been highlighted by the existence of
adversarial examples. Many previous works show that the learned networks do not
perform well on perturbed test data, and significantly more labeled data is
required to achieve adversarially robust generalization. In this paper, we
theoretically and empirically show that with just more unlabeled data, we can
learn a model with better adversarially robust generalization. The key insight
of our results is based on a risk decomposition theorem, in which the expected
robust risk is separated into two parts: the stability part which measures the
prediction stability in the presence of perturbations, and the accuracy part
which evaluates the standard classification accuracy. As the stability part
does not depend on any label information, we can optimize this part using
unlabeled data. We further prove that for a specific Gaussian mixture problem,
adversarially robust generalization can be almost as easy as the standard
generalization in supervised learning if a sufficiently large amount of
unlabeled data is provided. Inspired by the theoretical findings, we further
show that a practical adversarial training algorithm that leverages unlabeled
data can improve adversarial robust generalization on MNIST and Cifar-10.


http://arxiv.org/abs/1906.00335
Adversarial Examples for Edge Detection: They Exist, and They Transfer.
Christian Cosgrove; Alan L. Yuille
  Convolutional neural networks have recently advanced the state of the art in
many tasks including edge and object boundary detection. However, in this
paper, we demonstrate that these edge detectors inherit a troubling property of
neural networks: they can be fooled by adversarial examples. We show that
adding small perturbations to an image causes HED, a CNN-based edge detection
model, to fail to locate edges, to detect nonexistent edges, and even to
hallucinate arbitrary configurations of edges. More surprisingly, we find that
these adversarial examples transfer to other CNN-based vision models. In
particular, attacks on edge detection result in significant drops in accuracy
in models trained to perform unrelated, high-level tasks like image
classification and semantic segmentation. Our code will be made public.


http://arxiv.org/abs/1906.00204
Perceptual Evaluation of Adversarial Attacks for CNN-based Image Classification.
Sid Ahmed Fezza; Yassine Bakhti; Wassim Hamidouche; Olivier Déforges
  Deep neural networks (DNNs) have recently achieved state-of-the-art
performance and provide significant progress in many machine learning tasks,
such as image classification, speech processing, natural language processing,
etc. However, recent studies have shown that DNNs are vulnerable to adversarial
attacks. For instance, in the image classification domain, adding small
imperceptible perturbations to the input image is sufficient to fool the DNN
and to cause misclassification. The perturbed image, called \textit{adversarial
example}, should be visually as close as possible to the original image.
However, all the works proposed in the literature for generating adversarial
examples have used the $L_{p}$ norms ($L_{0}$, $L_{2}$ and $L_{\infty}$) as
distance metrics to quantify the similarity between the original image and the
adversarial example. Nonetheless, the $L_{p}$ norms do not correlate with human
judgment, making them not suitable to reliably assess the perceptual
similarity/fidelity of adversarial examples. In this paper, we present a
database for visual fidelity assessment of adversarial examples. We describe
the creation of the database and evaluate the performance of fifteen
state-of-the-art full-reference (FR) image fidelity assessment metrics that
could substitute $L_{p}$ norms. The database as well as subjective scores are
publicly available to help designing new metrics for adversarial examples and
to facilitate future research works.


http://arxiv.org/abs/1906.00258
Enhancing Transformation-based Defenses using a Distribution Classifier.
Connie Kou; Hwee Kuan Lee; Ee-Chien Chang; Teck Khim Ng
  Adversarial attacks on convolutional neural networks (CNN) have gained
significant attention and there have been active research efforts on defense
mechanisms. Stochastic input transformation methods have been proposed, where
the idea is to recover the image from adversarial attack by random
transformation, and to take the majority vote as consensus among the random
samples. However, the transformation improves the accuracy on adversarial
images at the expense of the accuracy on clean images. While it is intuitive
that the accuracy on clean images would deteriorate, the exact mechanism in
which how this occurs is unclear. In this paper, we study the distribution of
softmax induced by stochastic transformations. We observe that with random
transformations on the clean images, although the mass of the softmax
distribution could shift to the wrong class, the resulting distribution of
softmax could be used to correct the prediction. Furthermore, on the
adversarial counterparts, with the image transformation, the resulting shapes
of the distribution of softmax are similar to the distributions from the clean
images. With these observations, we propose a method to improve existing
transformation-based defenses. We train a separate lightweight distribution
classifier to recognize distinct features in the distributions of softmax
outputs of transformed images. Our empirical studies show that our distribution
classifier, by training on distributions obtained from clean images only,
outperforms majority voting for both clean and adversarial images. Our method
is generic and can be integrated with existing transformation-based defenses.


http://arxiv.org/abs/1905.13736
Unlabeled Data Improves Adversarial Robustness.
Yair Carmon; Aditi Raghunathan; Ludwig Schmidt; Percy Liang; John C. Duchi
  We demonstrate, theoretically and empirically, that adversarial robustness
can significantly benefit from semisupervised learning. Theoretically, we
revisit the simple Gaussian model of Schmidt et al. that shows a sample
complexity gap between standard and robust classification. We prove that
unlabeled data bridges this gap: a simple semisupervised learning procedure
(self-training) achieves high robust accuracy using the same number of labels
required for achieving high standard accuracy. Empirically, we augment CIFAR-10
with 500K unlabeled images sourced from 80 Million Tiny Images and use robust
self-training to outperform state-of-the-art robust accuracies by over 5 points
in (i) $\ell_\infty$ robustness against several strong attacks via adversarial
training and (ii) certified $\ell_2$ and $\ell_\infty$ robustness via
randomized smoothing. On SVHN, adding the dataset's own extra training set with
the labels removed provides gains of 4 to 10 points, within 1 point of the gain
from using the extra labels.


http://arxiv.org/abs/1905.13472
Reverse KL-Divergence Training of Prior Networks: Improved Uncertainty and Adversarial Robustness.
Andrey Malinin; Mark Gales
  Ensemble approaches for uncertainty estimation have recently been applied to
the tasks of misclassification detection, out-of-distribution input detection
and adversarial attack detection. Prior Networks have been proposed as an
approach to efficiently \emph{emulate} an ensemble of models for classification
by parameterising a Dirichlet prior distribution over output distributions.
These models have been shown to outperform alternative ensemble approaches,
such as Monte-Carlo Dropout, on the task of out-of-distribution input
detection. However, scaling Prior Networks to complex datasets with many
classes is difficult using the training criteria originally proposed. This
paper makes two contributions. First, we show that the appropriate training
criterion for Prior Networks is the \emph{reverse} KL-divergence between
Dirichlet distributions. This addresses issues in the nature of the training
data target distributions, enabling prior networks to be successfully trained
on classification tasks with arbitrarily many classes, as well as improving
out-of-distribution detection performance. Second, taking advantage of this new
training criterion, this paper investigates using Prior Networks to detect
adversarial attacks and proposes a generalized form of adversarial training. It
is shown that the construction of successful \emph{adaptive} whitebox attacks,
which affect the prediction and evade detection, against Prior Networks trained
on CIFAR-10 and CIFAR-100 using the proposed approach requires a greater amount
of computational effort than against networks defended using standard
adversarial training or MC-dropout.


http://arxiv.org/abs/1905.13725
Are Labels Required for Improving Adversarial Robustness?
Jonathan Uesato; Jean-Baptiste Alayrac; Po-Sen Huang; Robert Stanforth; Alhussein Fawzi; Pushmeet Kohli
  Recent work has uncovered the interesting (and somewhat surprising) finding
that training models to be invariant to adversarial perturbations requires
substantially larger datasets than those required for standard classification.
This result is a key hurdle in the deployment of robust machine learning models
in many real world applications where labeled data is expensive. Our main
insight is that unlabeled data can be a competitive alternative to labeled data
for training adversarially robust models. Theoretically, we show that in a
simple statistical setting, the sample complexity for learning an adversarially
robust model from unlabeled data matches the fully supervised case up to
constant factors. On standard datasets like CIFAR-10, a simple Unsupervised
Adversarial Training (UAT) approach using unlabeled data improves robust
accuracy by 21.7% over using 4K supervised examples alone, and captures over
95% of the improvement from the same number of labeled examples. Finally, we
report an improvement of 4% over the previous state-of-the-art on CIFAR-10
against the strongest known attack by using additional unlabeled data from the
uncurated 80 Million Tiny Images dataset. This demonstrates that our finding
extends as well to the more realistic case where unlabeled data is also
uncurated, therefore opening a new avenue for improving adversarial training.


http://arxiv.org/abs/1905.13399
Real-Time Adversarial Attacks.
Yuan Gong; Boyang Li; Christian Poellabauer; Yiyu Shi
  In recent years, many efforts have demonstrated that modern machine learning
algorithms are vulnerable to adversarial attacks, where small, but carefully
crafted, perturbations on the input can make them fail. While these attack
methods are very effective, they only focus on scenarios where the target model
takes static input, i.e., an attacker can observe the entire original sample
and then add a perturbation at any point of the sample. These attack approaches
are not applicable to situations where the target model takes streaming input,
i.e., an attacker is only able to observe past data points and add
perturbations to the remaining (unobserved) data points of the input. In this
paper, we propose a real-time adversarial attack scheme for machine learning
models with streaming inputs.


http://arxiv.org/abs/1905.13386
Residual Networks as Nonlinear Systems: Stability Analysis using Linearization.
Kai Rothauge; Zhewei Yao; Zixi Hu; Michael W. Mahoney
  We regard pre-trained residual networks (ResNets) as nonlinear systems and
use linearization, a common method used in the qualitative analysis of
nonlinear systems, to understand the behavior of the networks under small
perturbations of the input images. We work with ResNet-56 and ResNet-110
trained on the CIFAR-10 data set. We linearize these networks at the level of
residual units and network stages, and the singular value decomposition is used
in the stability analysis of these components. It is found that most of the
singular values of the linearizations of residual units are 1 and, in spite of
the fact that the linearizations depend directly on the activation maps, the
singular values differ only slightly for different input images. However,
adjusting the scaling of the skip connection or the values of the weights in a
residual unit has a significant impact on the singular value distributions.
Inspection of how random and adversarial perturbations of input images
propagate through the network reveals that there is a dramatic jump in the
magnitude of adversarial perturbations towards the end of the final stage of
the network that is not present in the case of random perturbations. We attempt
to gain a better understanding of this phenomenon by projecting the
perturbations onto singular vectors of the linearizations of the residual
units.


http://arxiv.org/abs/1905.13284
Identifying Classes Susceptible to Adversarial Attacks.
Rangeet Pan; Md Johirul Islam; Shibbir Ahmed; Hridesh Rajan
  Despite numerous attempts to defend deep learning based image classifiers,
they remain susceptible to the adversarial attacks. This paper proposes a
technique to identify susceptible classes, those classes that are more easily
subverted. To identify the susceptible classes we use distance-based measures
and apply them on a trained model. Based on the distance among original
classes, we create mapping among original classes and adversarial classes that
helps to reduce the randomness of a model to a significant amount in an
adversarial setting. We analyze the high dimensional geometry among the feature
classes and identify the k most susceptible target classes in an adversarial
attack. We conduct experiments using MNIST, Fashion MNIST, CIFAR-10 (ImageNet
and ResNet-32) datasets. Finally, we evaluate our techniques in order to
determine which distance-based measure works best and how the randomness of a
model changes with perturbation.


http://arxiv.org/abs/1905.13074
Robust Sparse Regularization: Simultaneously Optimizing Neural Network Robustness and Compactness.
Adnan Siraj Rakin; Zhezhi He; Li Yang; Yanzhi Wang; Liqiang Wang; Deliang Fan
  Deep Neural Network (DNN) trained by the gradient descent method is known to
be vulnerable to maliciously perturbed adversarial input, aka. adversarial
attack. As one of the countermeasures against adversarial attack, increasing
the model capacity for DNN robustness enhancement was discussed and reported as
an effective approach by many recent works. In this work, we show that
shrinking the model size through proper weight pruning can even be helpful to
improve the DNN robustness under adversarial attack. For obtaining a
simultaneously robust and compact DNN model, we propose a multi-objective
training method called Robust Sparse Regularization (RSR), through the fusion
of various regularization techniques, including channel-wise noise injection,
lasso weight penalty, and adversarial training. We conduct extensive
experiments across popular ResNet-20, ResNet-18 and VGG-16 DNN architectures to
demonstrate the effectiveness of RSR against popular white-box (i.e., PGD and
FGSM) and black-box attacks. Thanks to RSR, 85% weight connections of ResNet-18
can be pruned while still achieving 0.68% and 8.72% improvement in clean- and
perturbed-data accuracy respectively on CIFAR-10 dataset, in comparison to its
PGD adversarial training baseline.


http://arxiv.org/abs/1905.12864
Interpretable Adversarial Training for Text.
Samuel Barham; Soheil Feizi
  Generating high-quality and interpretable adversarial examples in the text
domain is a much more daunting task than it is in the image domain. This is due
partly to the discrete nature of text, partly to the problem of ensuring that
the adversarial examples are still probable and interpretable, and partly to
the problem of maintaining label invariance under input perturbations. In order
to address some of these challenges, we introduce sparse projected gradient
descent (SPGD), a new approach to crafting interpretable adversarial examples
for text. SPGD imposes a directional regularization constraint on input
perturbations by projecting them onto the directions to nearby word embeddings
with highest cosine similarities. This constraint ensures that perturbations
move each word embedding in an interpretable direction (i.e., towards another
nearby word embedding). Moreover, SPGD imposes a sparsity constraint on
perturbations at the sentence level by ignoring word-embedding perturbations
whose norms are below a certain threshold. This constraint ensures that our
method changes only a few words per sequence, leading to higher quality
adversarial examples. Our experiments with the IMDB movie review dataset show
that the proposed SPGD method improves adversarial example interpretability and
likelihood (evaluated by average per-word perplexity) compared to
state-of-the-art methods, while suffering little to no loss in training
performance.


http://arxiv.org/abs/1905.12797
Bandlimiting Neural Networks Against Adversarial Attacks.
Yuping Lin; Kasra Ahmadi K. A.; Hui Jiang
  In this paper, we study the adversarial attack and defence problem in deep
learning from the perspective of Fourier analysis. We first explicitly compute
the Fourier transform of deep ReLU neural networks and show that there exist
decaying but non-zero high frequency components in the Fourier spectrum of
neural networks. We demonstrate that the vulnerability of neural networks
towards adversarial samples can be attributed to these insignificant but
non-zero high frequency components. Based on this analysis, we propose to use a
simple post-averaging technique to smooth out these high frequency components
to improve the robustness of neural networks against adversarial attacks.
Experimental results on the ImageNet dataset have shown that our proposed
method is universally effective to defend many existing adversarial attacking
methods proposed in the literature, including FGSM, PGD, DeepFool and C&W
attacks. Our post-averaging method is simple since it does not require any
re-training, and meanwhile it can successfully defend over 95% of the
adversarial samples generated by these methods without introducing any
significant performance degradation (less than 1%) on the original clean
images.


http://arxiv.org/abs/1905.12386
Misleading Authorship Attribution of Source Code using Adversarial Learning.
Erwin Quiring; Alwin Maier; Konrad Rieck
  In this paper, we present a novel attack against authorship attribution of
source code. We exploit that recent attribution methods rest on machine
learning and thus can be deceived by adversarial examples of source code. Our
attack performs a series of semantics-preserving code transformations that
mislead learning-based attribution but appear plausible to a developer. The
attack is guided by Monte-Carlo tree search that enables us to operate in the
discrete domain of source code. In an empirical evaluation with source code
from 204 programmers, we demonstrate that our attack has a substantial effect
on two recent attribution methods, whose accuracy drops from over 88% to 1%
under attack. Furthermore, we show that our attack can imitate the coding style
of developers with high accuracy and thereby induce false attributions. We
conclude that current approaches for authorship attribution are inappropriate
for practical application and there is a need for resilient analysis
techniques.


http://arxiv.org/abs/1905.12762
Securing Connected & Autonomous Vehicles: Challenges Posed by Adversarial Machine Learning and The Way Forward.
Adnan Qayyum; Muhammad Usama; Junaid Qadir; Ala Al-Fuqaha
  Connected and autonomous vehicles (CAVs) will form the backbone of future
next-generation intelligent transportation systems (ITS) providing travel
comfort, road safety, along with a number of value-added services. Such a
transformation---which will be fuelled by concomitant advances in technologies
for machine learning (ML) and wireless communications---will enable a future
vehicular ecosystem that is better featured and more efficient. However, there
are lurking security problems related to the use of ML in such a critical
setting where an incorrect ML decision may not only be a nuisance but can lead
to loss of precious lives. In this paper, we present an in-depth overview of
the various challenges associated with the application of ML in vehicular
networks. In addition, we formulate the ML pipeline of CAVs and present various
potential security issues associated with the adoption of ML methods. In
particular, we focus on the perspective of adversarial ML attacks on CAVs and
outline a solution to defend against adversarial attacks in multiple settings.


http://arxiv.org/abs/1906.00001
Functional Adversarial Attacks.
Cassidy Laidlaw; Soheil Feizi
  We propose functional adversarial attacks, a novel class of threat models for
crafting adversarial examples to fool machine learning models. Unlike a
standard $\ell_p$-ball threat model, a functional adversarial threat model
allows only a single function to be used to perturb input features to produce
an adversarial example. For example, a functional adversarial attack applied on
colors of an image can change all red pixels simultaneously to light red. Such
global uniform changes in images can be less perceptible than perturbing pixels
of the image individually. For simplicity, we refer to functional adversarial
attacks on image colors as ReColorAdv, which is the main focus of our
experiments. We show that functional threat models can be combined with
existing additive ($\ell_p$) threat models to generate stronger threat models
that allow both small, individual perturbations and large, uniform changes to
an input. Moreover, we prove that such combinations encompass perturbations
that would not be allowed in either constituent threat model. In practice,
ReColorAdv can significantly reduce the accuracy of a ResNet-32 trained on
CIFAR-10. Furthermore, to the best of our knowledge, combining ReColorAdv with
other attacks leads to the strongest existing attack even after adversarial
training. An implementation of ReColorAdv is available at
https://github.com/cassidylaidlaw/ReColorAdv .


http://arxiv.org/abs/1905.12282
CopyCAT: Taking Control of Neural Policies with Constant Attacks.
Léonard Hussenot; Matthieu Geist; Olivier Pietquin
  We propose a new perspective on adversarial attacks against deep
reinforcement learning agents. Our main contribution is CopyCAT, a targeted
attack able to consistently lure an agent into following an outsider's policy.
It is pre-computed, therefore fast inferred, and could thus be usable in a
real-time scenario. We show its effectiveness on Atari 2600 games in the novel
read-only setting. In this setting, the adversary cannot directly modify the
agent's state -- its representation of the environment -- but can only attack
the agent's observation -- its perception of the environment. Directly
modifying the agent's state would require a write-access to the agent's inner
workings and we argue that this assumption is too strong in realistic settings.


http://arxiv.org/abs/1905.11971
ME-Net: Towards Effective Adversarial Robustness with Matrix Estimation.
Yuzhe Yang; Guo Zhang; Dina Katabi; Zhi Xu
  Deep neural networks are vulnerable to adversarial attacks. The literature is
rich with algorithms that can easily craft successful adversarial examples. In
contrast, the performance of defense techniques still lags behind. This paper
proposes ME-Net, a defense method that leverages matrix estimation (ME). In
ME-Net, images are preprocessed using two steps: first pixels are randomly
dropped from the image; then, the image is reconstructed using ME. We show that
this process destroys the adversarial structure of the noise, while
re-enforcing the global structure in the original image. Since humans typically
rely on such global structures in classifying images, the process makes the
network mode compatible with human perception. We conduct comprehensive
experiments on prevailing benchmarks such as MNIST, CIFAR-10, SVHN, and
Tiny-ImageNet. Comparing ME-Net with state-of-the-art defense mechanisms shows
that ME-Net consistently outperforms prior techniques, improving robustness
against both black-box and white-box attacks.


http://arxiv.org/abs/1905.11831
Adversarial Attacks on Remote User Authentication Using Behavioural Mouse Dynamics.
Yi Xiang Marcus Tan; Alfonso Iacovazzi; Ivan Homoliak; Yuval Elovici; Alexander Binder
  Mouse dynamics is a potential means of authenticating users. Typically, the
authentication process is based on classical machine learning techniques, but
recently, deep learning techniques have been introduced for this purpose.
Although prior research has demonstrated how machine learning and deep learning
algorithms can be bypassed by carefully crafted adversarial samples, there has
been very little research performed on the topic of behavioural biometrics in
the adversarial domain. In an attempt to address this gap, we built a set of
attacks, which are applications of several generative approaches, to construct
adversarial mouse trajectories that bypass authentication models. These
generated mouse sequences will serve as the adversarial samples in the context
of our experiments. We also present an analysis of the attack approaches we
explored, explaining their limitations. In contrast to previous work, we
consider the attacks in a more realistic and challenging setting in which an
attacker has access to recorded user data but does not have access to the
authentication model or its outputs. We explore three different attack
strategies: 1) statistics-based, 2) imitation-based, and 3) surrogate-based; we
show that they are able to evade the functionality of the authentication
models, thereby impacting their robustness adversely. We show that
imitation-based attacks often perform better than surrogate-based attacks,
unless, however, the attacker can guess the architecture of the authentication
model. In such cases, we propose a potential detection mechanism against
surrogate-based attacks.


http://arxiv.org/abs/1905.11713
Improving the Robustness of Deep Neural Networks via Adversarial Training with Triplet Loss.
Pengcheng Li; Jinfeng Yi; Bowen Zhou; Lijun Zhang
  Recent studies have highlighted that deep neural networks (DNNs) are
vulnerable to adversarial examples. In this paper, we improve the robustness of
DNNs by utilizing techniques of Distance Metric Learning. Specifically, we
incorporate Triplet Loss, one of the most popular Distance Metric Learning
methods, into the framework of adversarial training. Our proposed algorithm,
Adversarial Training with Triplet Loss (AT$^2$L), substitutes the adversarial
example against the current model for the anchor of triplet loss to effectively
smooth the classification boundary. Furthermore, we propose an ensemble version
of AT$^2$L, which aggregates different attack methods and model structures for
better defense effects. Our empirical studies verify that the proposed approach
can significantly improve the robustness of DNNs without sacrificing accuracy.
Finally, we demonstrate that our specially designed triplet loss can also be
used as a regularization term to enhance other defense methods.


http://arxiv.org/abs/1905.11832
Snooping Attacks on Deep Reinforcement Learning.
Matthew Inkawhich; Yiran Chen; Hai Li
  Adversarial attacks have exposed a significant security vulnerability in
state-of-the-art machine learning models. Among these models include deep
reinforcement learning agents. The existing methods for attacking reinforcement
learning agents assume the adversary either has access to the target agent's
learned parameters or the environment that the agent interacts with. In this
work, we propose a new class of threat models, called snooping threat models,
that are unique to reinforcement learning. In these snooping threat models, the
adversary does not have the ability to interact with the target agent's
environment, and can only eavesdrop on the action and reward signals being
exchanged between agent and environment. We show that adversaries operating in
these highly constrained threat models can still launch devastating attacks
against the target agent by training proxy models on related tasks and
leveraging the transferability of adversarial examples.


http://arxiv.org/abs/1905.13545
High Frequency Component Helps Explain the Generalization of Convolutional Neural Networks.
Haohan Wang; Xindi Wu; Zeyi Huang; Eric P. Xing
  We investigate the relationship between the frequency spectrum of image data
and the generalization behavior of convolutional neural networks (CNN). We
first notice CNN's ability in capturing the high-frequency components of
images. These high-frequency components are almost imperceptible to a human.
Thus the observation leads to multiple hypotheses that are related to the
generalization behaviors of CNN, including a potential explanation for
adversarial examples, a discussion of CNN's trade-off between robustness and
accuracy, and some evidence in understanding training heuristics.


http://arxiv.org/abs/1905.12418
Expected Tight Bounds for Robust Training.
Salman Alsubaihi; Adel Bibi; Modar Alfadly; Abdullah Hamdi; Bernard Ghanem
  Training Deep Neural Networks that are robust to norm bounded adversarial
attacks remains an elusive problem. While exact and inexact verification-based
methods are generally too expensive to train large networks, it was
demonstrated that bounded input intervals can be inexpensively propagated from
a layer to another through deep networks. This interval bound propagation
approach (IBP) not only has improved both robustness and certified accuracy but
was the first to be employed on large/deep networks. However, due to the very
loose nature of the IBP bounds, the required training procedure is complex and
involved. In this paper, we closely examine the bounds of a block of layers
composed in the form of Affine-ReLU-Affine. To this end, we propose expected
tight bounds (true bounds in expectation), referred to as ETB, which are
provably tighter than IBP bounds in expectation. We then extend this result to
deeper networks through blockwise propagation and show that we can achieve
orders of magnitudes tighter bounds compared to IBP. Furthermore, using a
simple standard training procedure, we can achieve impressive
robustness-accuracy trade-off on both MNIST and CIFAR10.


http://arxiv.org/abs/1905.12202
Empirically Measuring Concentration: Fundamental Limits on Intrinsic Robustness.
Saeed Mahloujifar; Xiao Zhang; Mohammad Mahmoody; David Evans
  Many recent works have shown that adversarial examples that fool classifiers
can be found by minimally perturbing a normal input. Recent theoretical
results, starting with Gilmer et al. (2018b), show that if the inputs are drawn
from a concentrated metric probability space, then adversarial examples with
small perturbation are inevitable. A concentrated space has the property that
any subset with $\Omega(1)$ (e.g., 1/100) measure, according to the imposed
distribution, has small distance to almost all (e.g., 99/100) of the points in
the space. It is not clear, however, whether these theoretical results apply to
actual distributions such as images. This paper presents a method for
empirically measuring and bounding the concentration of a concrete dataset
which is proven to converge to the actual concentration. We use it to
empirically estimate the intrinsic robustness to $\ell_\infty$ and $\ell_2$
perturbations of several image classification benchmarks. Code for our
experiments is available at
https://github.com/xiaozhanguva/Measure-Concentration.


http://arxiv.org/abs/1905.11736
Cross-Domain Transferability of Adversarial Perturbations.
Muzammal Naseer; Salman H. Khan; Harris Khan; Fahad Shahbaz Khan; Fatih Porikli
  Adversarial examples reveal the blind spots of deep neural networks (DNNs)
and represent a major concern for security-critical applications. The
transferability of adversarial examples makes real-world attacks possible in
black-box settings, where the attacker is forbidden to access the internal
parameters of the model. The underlying assumption in most adversary generation
methods, whether learning an instance-specific or an instance-agnostic
perturbation, is the direct or indirect reliance on the original
domain-specific data distribution. In this work, for the first time, we
demonstrate the existence of domain-invariant adversaries, thereby showing
common adversarial space among different datasets and models. To this end, we
propose a framework capable of launching highly transferable attacks that
crafts adversarial patterns to mislead networks trained on wholly different
domains. For instance, an adversarial function learned on Paintings, Cartoons
or Medical images can successfully perturb ImageNet samples to fool the
classifier, with success rates as high as $\sim$99\% ($\ell_{\infty} \le 10$).
The core of our proposed adversarial function is a generative network that is
trained using a relativistic supervisory signal that enables domain-invariant
perturbations. Our approach sets the new state-of-the-art for fooling rates,
both under the white-box and black-box scenarios. Furthermore, despite being an
instance-agnostic perturbation function, our attack outperforms the
conventionally much stronger instance-specific attack methods.


http://arxiv.org/abs/1905.12105
Certifiably Robust Interpretation in Deep Learning.
Alexander Levine; Sahil Singla; Soheil Feizi
  Deep learning interpretation is essential to explain the reasoning behind
model predictions. Understanding the robustness of interpretation methods is
important especially in sensitive domains such as medical applications since
interpretation results are often used in downstream tasks. Although
gradient-based saliency maps are popular methods for deep learning
interpretation, recent works show that they can be vulnerable to adversarial
attacks. In this paper, we address this problem and provide a certifiable
defense method for deep learning interpretation. We show that a sparsified
version of the popular SmoothGrad method, which computes the average saliency
maps over random perturbations of the input, is certifiably robust against
adversarial perturbations. We obtain this result by extending recent bounds for
certifiably robust smooth classifiers to the interpretation setting.
Experiments on ImageNet samples validate our theory.


http://arxiv.org/abs/1905.12171
Brain-inspired reverse adversarial examples.
Shaokai Ye; Sia Huat Tan; Kaidi Xu; Yanzhi Wang; Chenglong Bao; Kaisheng Ma
  A human does not have to see all elephants to recognize an animal as an
elephant. On contrast, current state-of-the-art deep learning approaches
heavily depend on the variety of training samples and the capacity of the
network. In practice, the size of network is always limited and it is
impossible to access all the data samples. Under this circumstance, deep
learning models are extremely fragile to human-imperceivable adversarial
examples, which impose threats to all safety critical systems. Inspired by the
association and attention mechanisms of the human brain, we propose reverse
adversarial examples method that can greatly improve models' robustness on
unseen data. Experiments show that our reverse adversarial method can improve
accuracy on average 19.02% on ResNet18, MobileNet, and VGG16 on unseen data
transformation. Besides, the proposed method is also applicable to compressed
models and shows potential to compensate the robustness drop brought by model
quantization - an absolute 30.78% accuracy improvement.


http://arxiv.org/abs/1905.11544
Label Universal Targeted Attack.
Naveed Akhtar; Mohammad A. A. K. Jalwana; Mohammed Bennamoun; Ajmal Mian
  We introduce Label Universal Targeted Attack (LUTA) that makes a deep model
predict a label of attacker's choice for `any' sample of a given source class
with high probability. Our attack stochastically maximizes the log-probability
of the target label for the source class with first order gradient
optimization, while accounting for the gradient moments. It also suppresses the
leakage of attack information to the non-source classes for avoiding the attack
suspicions. The perturbations resulting from our attack achieve high fooling
ratios on the large-scale ImageNet and VGGFace models, and transfer well to the
Physical World. Given full control over the perturbation scope in LUTA, we also
demonstrate it as a tool for deep model autopsy. The proposed attack reveals
interesting perturbation patterns and observations regarding the deep models.


http://arxiv.org/abs/1905.11026
Fooling Detection Alone is Not Enough: First Adversarial Attack against Multiple Object Tracking.
Yunhan Jia; Yantao Lu; Junjie Shen; Qi Alfred Chen; Zhenyu Zhong; Tao Wei
  Recent work in adversarial machine learning started to focus on the visual
perception in autonomous driving and studied Adversarial Examples (AEs) for
object detection models. However, in such visual perception pipeline the
detected objects must also be tracked, in a process called Multiple Object
Tracking (MOT), to build the moving trajectories of surrounding obstacles.
Since MOT is designed to be robust against errors in object detection, it poses
a general challenge to existing attack techniques that blindly target objection
detection: we find that a success rate of over 98% is needed for them to
actually affect the tracking results, a requirement that no existing attack
technique can satisfy. In this paper, we are the first to study adversarial
machine learning attacks against the complete visual perception pipeline in
autonomous driving, and discover a novel attack technique, tracker hijacking,
that can effectively fool MOT using AEs on object detection. Using our
technique, successful AEs on as few as one single frame can move an existing
object in to or out of the headway of an autonomous vehicle to cause potential
safety hazards. We perform evaluation using the Berkeley Deep Drive dataset and
find that on average when 3 frames are attacked, our attack can have a nearly
100% success rate while attacks that blindly target object detection only have
up to 25%.


http://arxiv.org/abs/1905.11213
Provable robustness against all adversarial $l_p$-perturbations for $p\geq 1$.
Francesco Croce; Matthias Hein
  In recent years several adversarial attacks and defenses have been proposed.
Often seemingly robust models turn out to be non-robust when more sophisticated
attacks are used. One way out of this dilemma are provable robustness
guarantees. While provably robust models for specific $l_p$-perturbation models
have been developed, we show that they do not come with any guarantee against
other $l_q$-perturbations. We propose a new regularization scheme,
MMR-Universal, for ReLU networks which enforces robustness wrt $l_1$- and
$l_\infty$-perturbations and show how that leads to the first provably robust
models wrt any $l_p$-norm for $p\geq 1$.


http://arxiv.org/abs/1905.11468
Scaleable input gradient regularization for adversarial robustness.
Chris Finlay; Adam M Oberman
  In this work we revisit gradient regularization for adversarial robustness
with some new ingredients. First, we derive new per-image theoretical
robustness bounds based on local gradient information. These bounds strongly
motivate input gradient regularization. Second, we implement a scaleable
version of input gradient regularization which avoids double backpropagation:
adversarially robust ImageNet models are trained in 33 hours on four consumer
grade GPUs. Finally, we show experimentally and through theoretical
certification that input gradient regularization is competitive with
adversarial training. Moreover we demonstrate that gradient regularization does
not lead to gradient obfuscation or gradient masking.


http://arxiv.org/abs/1905.11268
Combating Adversarial Misspellings with Robust Word Recognition.
Danish Pruthi; Bhuwan Dhingra; Zachary C. Lipton
  To combat adversarial spelling mistakes, we propose placing a word
recognition model in front of the downstream classifier. Our word recognition
models build upon the RNN semi-character architecture, introducing several new
backoff strategies for handling rare and unseen words. Trained to recognize
words corrupted by random adds, drops, swaps, and keyboard mistakes, our method
achieves 32% relative (and 3.3% absolute) error reduction over the vanilla
semi-character model. Notably, our pipeline confers robustness on the
downstream classifier, outperforming both adversarial training and
off-the-shelf spell checkers. Against a BERT model fine-tuned for sentiment
analysis, a single adversarially-chosen character attack lowers accuracy from
90.3% to 45.8%. Our defense restores accuracy to 75%. Surprisingly, better word
recognition does not always entail greater robustness. Our analysis reveals
that robustness also depends upon a quantity that we denote the sensitivity.


http://arxiv.org/abs/1905.12429
Analyzing the Interpretability Robustness of Self-Explaining Models.
Haizhong Zheng; Earlence Fernandes; Atul Prakash
  Recently, interpretable models called self-explaining models (SEMs) have been
proposed with the goal of providing interpretability robustness. We evaluate
the interpretability robustness of SEMs and show that explanations provided by
SEMs as currently proposed are not robust to adversarial inputs. Specifically,
we successfully created adversarial inputs that do not change the model outputs
but cause significant changes in the explanations. We find that even though
current SEMs use stable co-efficients for mapping explanations to output
labels, they do not consider the robustness of the first stage of the model
that creates interpretable basis concepts from the input, leading to non-robust
explanations. Our work makes a case for future work to start examining how to
generate interpretable basis concepts in a robust way.


http://arxiv.org/abs/1905.11564
Adversarially Robust Learning Could Leverage Computational Hardness.
Sanjam Garg; Somesh Jha; Saeed Mahloujifar; Mohammad Mahmoody
  Over recent years, devising classification algorithms that are robust to
adversarial perturbations has emerged as a challenging problem. In particular,
deep neural nets (DNNs) seem to be susceptible to small imperceptible changes
over test instances. However, the line of work in provable robustness, so far,
has been focused on information-theoretic robustness, ruling out even the
existence of any adversarial examples. In this work, we study whether there is
a hope to benefit from algorithmic nature of an attacker that searches for
adversarial examples, and ask whether there is any learning task for which it
is possible to design classifiers that are only robust against polynomial-time
adversaries. Indeed, numerous cryptographic tasks can only be secure against
computationally bounded adversaries, and are indeed impossible for
computationally unbounded attackers. Thus, it is natural to ask if the same
strategy could help robust learning.
  We show that computational limitation of attackers can indeed be useful in
robust learning by demonstrating the possibility of a classifier for some
learning task for which computational and information theoretic adversaries of
bounded perturbations have very different power. Namely, while computationally
unbounded adversaries can attack successfully and find adversarial examples
with small perturbation, polynomial time adversaries are unable to do so unless
they can break standard cryptographic hardness assumptions. Our results,
therefore, indicate that perhaps a similar approach to cryptography (relying on
computational hardness) holds promise for achieving computationally robust
machine learning. On the reverse directions, we also show that the existence of
such learning task in which computational robustness beats information
theoretic robustness requires computational hardness by implying (average-case)
hardness of NP.


http://arxiv.org/abs/1905.11015
Unsupervised Euclidean Distance Attack on Network Embedding.
Shanqing Yu; Jun Zheng; Jinhuan Wang; Jian Zhang; Lihong Chen; Qi Xuan; Jinyin Chen; Dan Zhang; Qingpeng Zhang
  Considering the wide application of network embedding methods in graph data
mining, inspired by the adversarial attack in deep learning, this paper
proposes a Genetic Algorithm (GA) based Euclidean Distance Attack strategy
(EDA) to attack the network embedding, so as to prevent certain structural
information from being discovered. EDA focuses on disturbing the Euclidean
distance between a pair of nodes in the embedding space as much as possible
through minimal modifications of the network structure. Since a large number of
downstream network algorithms, such as community detection and node
classification, rely on the Euclidean distance between nodes to evaluate the
similarity between them in the embedding space, EDA can be considered as a
universal attack on a variety of network algorithms. Different from traditional
supervised attack strategies, EDA does not need labeling information, and, in
other words, is an unsupervised network embedding attack method.


http://arxiv.org/abs/1905.11475
GAT: Generative Adversarial Training for Adversarial Example Detection and Robust Classification.
Xuwang Yin; Soheil Kolouri; Gustavo K. Rohde
  The vulnerabilities of deep neural networks against adversarial examples have
become a significant concern for deploying these models in sensitive domains.
Devising a definitive defense against such attacks is proven to be challenging,
and the methods relying on detecting adversarial samples are only valid when
the attacker is oblivious to the detection mechanism. In this paper we propose
a principled adversarial example detection method that can withstand
norm-constrained white-box attacks. Inspired by one-versus-the-rest
classification, in a K class classification problem, we train K binary
classifiers where the i-th binary classifier is used to distinguish between
clean data of class i and adversarially perturbed samples of other classes. At
test time, we first use a trained classifier to get the predicted label (say k)
of the input, and then use the k-th binary classifier to determine whether the
input is a clean sample (of class k) or an adversarially perturbed example (of
other classes). We further devise a generative approach to
detecting/classifying adversarial examples by interpreting each binary
classifier as an unnormalized density model of the class-conditional data. We
provide comprehensive evaluation of the above adversarial example
detection/classification methods, and demonstrate their competitive
performances and compelling properties.


http://arxiv.org/abs/1905.11382
State-Reification Networks: Improving Generalization by Modeling the Distribution of Hidden Representations.
Alex Lamb; Jonathan Binas; Anirudh Goyal; Sandeep Subramanian; Ioannis Mitliagkas; Denis Kazakov; Yoshua Bengio; Michael C. Mozer
  Machine learning promises methods that generalize well from finite labeled
data. However, the brittleness of existing neural net approaches is revealed by
notable failures, such as the existence of adversarial examples that are
misclassified despite being nearly identical to a training example, or the
inability of recurrent sequence-processing nets to stay on track without
teacher forcing. We introduce a method, which we refer to as \emph{state
reification}, that involves modeling the distribution of hidden states over the
training data and then projecting hidden states observed during testing toward
this distribution. Our intuition is that if the network can remain in a
familiar manifold of hidden space, subsequent layers of the net should be well
trained to respond appropriately. We show that this state-reification method
helps neural nets to generalize better, especially when labeled data are
sparse, and also helps overcome the challenge of achieving robust
generalization with adversarial training.


http://arxiv.org/abs/1905.10906
Non-Determinism in Neural Networks for Adversarial Robustness.
Daanish Ali Khan; Linhong Li; Ninghao Sha; Zhuoran Liu; Abelino Jimenez; Bhiksha Raj; Rita Singh
  Recent breakthroughs in the field of deep learning have led to advancements
in a broad spectrum of tasks in computer vision, audio processing, natural
language processing and other areas. In most instances where these tasks are
deployed in real-world scenarios, the models used in them have been shown to be
susceptible to adversarial attacks, making it imperative for us to address the
challenge of their adversarial robustness. Existing techniques for adversarial
robustness fall into three broad categories: defensive distillation techniques,
adversarial training techniques, and randomized or non-deterministic model
based techniques. In this paper, we propose a novel neural network paradigm
that falls under the category of randomized models for adversarial robustness,
but differs from all existing techniques under this category in that it models
each parameter of the network as a statistical distribution with learnable
parameters. We show experimentally that this framework is highly robust to a
variety of white-box and black-box adversarial attacks, while preserving the
task-specific performance of the traditional neural network model.


http://arxiv.org/abs/1905.10729
Purifying Adversarial Perturbation with Adversarially Trained Auto-encoders.
Hebi Li; Qi Xiao; Shixin Tian; Jin Tian
  Machine learning models are vulnerable to adversarial examples. Iterative
adversarial training has shown promising results against strong white-box
attacks. However, adversarial training is very expensive, and every time a
model needs to be protected, such expensive training scheme needs to be
performed. In this paper, we propose to apply iterative adversarial training
scheme to an external auto-encoder, which once trained can be used to protect
other models directly. We empirically show that our model outperforms other
purifying-based methods against white-box attacks, and transfers well to
directly protect other base models with different architectures.


http://arxiv.org/abs/1905.10900
Rearchitecting Classification Frameworks For Increased Robustness.
Varun Chandrasekaran; Brian Tang; Nicolas Papernot; Kassem Fawaz; Somesh Jha; Xi Wu
  While generalizing well over natural inputs, neural networks are vulnerable
to adversarial inputs. Existing defenses against adversarial inputs have
largely been detached from the real world. These defenses also come at a cost
to accuracy. Fortunately, there are invariances of an object that are its
salient features; when we break them it will necessarily change the perception
of the object. We find that applying invariants to the classification task
makes robustness and accuracy feasible together. Two questions follow: how to
extract and model these invariances? and how to design a classification
paradigm that leverages these invariances to improve the robustness accuracy
trade-off? The remainder of the paper discusses solutions to the aformenetioned
questions.


http://arxiv.org/abs/1905.10904
Robust Classification using Robust Feature Augmentation.
Kevin Eykholt; Swati Gupta; Atul Prakash; Amir Rahmati; Pratik Vaishnavi; Haizhong Zheng
  Existing deep neural networks, say for image classification, have been shown
to be vulnerable to adversarial images that can cause a DNN misclassification,
without any perceptible change to an image. In this work, we propose shock
absorbing robust features such as binarization, e.g., rounding, and group
extraction, e.g., color or shape, to augment the classification pipeline,
resulting in more robust classifiers. Experimentally, we show that augmenting
ML models with these techniques leads to improved overall robustness on
adversarial inputs as well as significant improvements in training time. On the
MNIST dataset, we achieved 14x speedup in training time to obtain 90%
adversarial accuracy com-pared to the state-of-the-art adversarial training
method of Madry et al., as well as retained higher adversarial accuracy over a
broader range of attacks. We also find robustness improvements on traffic sign
classification using robust feature augmentation. Finally, we give theoretical
insights for why one can expect robust feature augmentation to reduce
adversarial input space


http://arxiv.org/abs/1905.10864
Generalizable Adversarial Attacks Using Generative Models.
Avishek Joey Bose; Andre Cianflone; William L. Hamilton
  Adversarial attacks on deep neural networks traditionally rely on a
constrained optimization paradigm, where an optimization procedure is used to
obtain a single adversarial perturbation for a given input example. Here, we
instead view adversarial attacks as a generative modelling problem, with the
goal of producing entire distributions of adversarial examples given an
unperturbed input. We show that this generative perspective can be used to
design a unified encoder-decoder framework, which is domain-agnostic in that
the same framework can be employed to attack different domains with minimal
modification. Across three diverse domains---images, text, and graphs---our
approach generates whitebox attacks with success rates that are competitive
with or superior to existing approaches, with a new state-of-the-art achieved
in the graph domain. Finally, we demonstrate that our generative framework can
efficiently generate a diverse set of attacks for a single given input, and is
even capable of attacking unseen test instances in a zero-shot manner,
exhibiting attack generalization.


http://arxiv.org/abs/1905.11381
Trust but Verify: An Information-Theoretic Explanation for the Adversarial Fragility of Machine Learning Systems, and a General Defense against Adversarial Attacks.
Jirong Yi; Hui Xie; Leixin Zhou; Xiaodong Wu; Weiyu Xu; Raghuraman Mudumbai
  Deep-learning based classification algorithms have been shown to be
susceptible to adversarial attacks: minor changes to the input of classifiers
can dramatically change their outputs, while being imperceptible to humans. In
this paper, we present a simple hypothesis about a feature compression property
of artificial intelligence (AI) classifiers and present theoretical arguments
to show that this hypothesis successfully accounts for the observed fragility
of AI classifiers to small adversarial perturbations. Drawing on ideas from
information and coding theory, we propose a general class of defenses for
detecting classifier errors caused by abnormally small input perturbations. We
further show theoretical guarantees for the performance of this detection
method. We present experimental results with (a) a voice recognition system,
and (b) a digit recognition system using the MNIST database, to demonstrate the
effectiveness of the proposed defense methods. The ideas in this paper are
motivated by a simple analogy between AI classifiers and the standard Shannon
model of a communication system.


http://arxiv.org/abs/1905.10695
Adversarial Distillation for Ordered Top-k Attacks.
Zekun Zhang; Tianfu Wu
  Deep Neural Networks (DNNs) are vulnerable to adversarial attacks, especially
white-box targeted attacks. One scheme of learning attacks is to design a
proper adversarial objective function that leads to the imperceptible
perturbation for any test image (e.g., the Carlini-Wagner (C&W) method). Most
methods address targeted attacks in the Top-1 manner. In this paper, we propose
to learn ordered Top-k attacks (k>= 1) for image classification tasks, that is
to enforce the Top-k predicted labels of an adversarial example to be the k
(randomly) selected and ordered labels (the ground-truth label is exclusive).
To this end, we present an adversarial distillation framework: First, we
compute an adversarial probability distribution for any given ordered Top-k
targeted labels with respect to the ground-truth of a test image. Then, we
learn adversarial examples by minimizing the Kullback-Leibler (KL) divergence
together with the perturbation energy penalty, similar in spirit to the network
distillation method. We explore how to leverage label semantic similarities in
computing the targeted distributions, leading to knowledge-oriented attacks. In
experiments, we thoroughly test Top-1 and Top-5 attacks in the ImageNet-1000
validation dataset using two popular DNNs trained with clean ImageNet-1000
train dataset, ResNet-50 and DenseNet-121. For both models, our proposed
adversarial distillation approach outperforms the C&W method in the Top-1
setting, as well as other baseline methods. Our approach shows significant
improvement in the Top-5 setting against a strong modified C&W method.


http://arxiv.org/abs/1905.10615
Adversarial Policies: Attacking Deep Reinforcement Learning.
Adam Gleave; Michael Dennis; Cody Wild; Neel Kant; Sergey Levine; Stuart Russell
  Deep reinforcement learning (RL) policies are known to be vulnerable to
adversarial perturbations to their observations, similar to adversarial
examples for classifiers. However, an attacker is not usually able to directly
modify another agent's observations. This might lead one to wonder: is it
possible to attack an RL agent simply by choosing an adversarial policy acting
in a multi-agent environment so as to create natural observations that are
adversarial? We demonstrate the existence of adversarial policies in zero-sum
games between simulated humanoid robots with proprioceptive observations,
against state-of-the-art victims trained via self-play to be robust to
opponents. The adversarial policies reliably win against the victims but
generate seemingly random and uncoordinated behavior. We find that these
policies are more successful in high-dimensional environments, and induce
substantially different activations in the victim policy network than when the
victim plays against a normal opponent. Videos are available at
https://adversarialpolicies.github.io/.


http://arxiv.org/abs/1905.10626
Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness.
Tianyu Pang; Kun Xu; Yinpeng Dong; Chao Du; Ning Chen; Jun Zhu
  Previous work shows that adversarially robust generalization requires larger
sample complexity, and the same dataset, e.g., CIFAR-10, which enables good
standard accuracy may not suffice to train robust models. Since collecting new
training data could be costly, we focus on better utilizing the given data by
inducing the regions with high sample density in the feature space, which could
lead to locally sufficient samples for robust learning. We first formally show
that the softmax cross-entropy (SCE) loss and its variants convey inappropriate
supervisory signals, which encourage the learned feature points to spread over
the space sparsely in training. This inspires us to propose the Max-Mahalanobis
center (MMC) loss to explicitly induce dense feature regions in order to
benefit robustness. Namely, the MMC loss encourages the model to concentrate on
learning ordered and compact representations, which gather around the preset
optimal centers for different classes. We empirically demonstrate that applying
the MMC loss can significantly improve robustness even under strong adaptive
attacks, while keeping state-of-the-art accuracy on clean inputs with little
extra computation compared to the SCE loss.


http://arxiv.org/abs/1905.13021
Robustness to Adversarial Perturbations in Learning from Incomplete Data.
Amir Najafi; Shin-ichi Maeda; Masanori Koyama; Takeru Miyato
  What is the role of unlabeled data in an inference problem, when the presumed
underlying distribution is adversarially perturbed? To provide a concrete
answer to this question, this paper unifies two major learning frameworks:
Semi-Supervised Learning (SSL) and Distributionally Robust Learning (DRL). We
develop a generalization theory for our framework based on a number of novel
complexity measures, such as an adversarial extension of Rademacher complexity
and its semi-supervised analogue. Moreover, our analysis is able to quantify
the role of unlabeled data in the generalization under a more general condition
compared to the existing theoretical works in SSL. Based on our framework, we
also present a hybrid of DRL and EM algorithms that has a guaranteed
convergence rate. When implemented with deep neural networks, our method shows
a comparable performance to those of the state-of-the-art on a number of
real-world benchmark datasets.


http://arxiv.org/abs/1905.10510
Enhancing Adversarial Defense by k-Winners-Take-All.
Chang Xiao; Peilin Zhong; Changxi Zheng
  We propose a simple change to existing neural network structures for better
defending against gradient-based adversarial attacks. Instead of using popular
activation functions (such as ReLU), we advocate the use of k-Winners-Take-All
(k-WTA) activation, a C0 discontinuous function that purposely invalidates the
neural network model's gradient at densely distributed input data points. The
proposed k-WTA activation can be readily used in nearly all existing networks
and training methods with no significant overhead. Our proposal is
theoretically rationalized. We analyze why the discontinuities in k-WTA
networks can largely prevent gradient-based search of adversarial examples and
why they at the same time remain innocuous to the network training. This
understanding is also empirically backed. We test k-WTA activation on various
network structures optimized by a training method, be it adversarial training
or not. In all cases, the robustness of k-WTA networks outperforms that of
traditional networks under white-box attacks.


http://arxiv.org/abs/1905.10029
Power up! Robust Graph Convolutional Network via Graph Powering.
Ming Jin; Heng Chang; Wenwu Zhu; Somayeh Sojoudi
  Graph convolutional networks (GCNs) are powerful tools for graph-structured
data. However, they have been recently shown to be vulnerable to topological
attacks. To enhance adversarial robustness, we go beyond spectral graph theory
to robust graph theory. By challenging the classical graph Laplacian, we
propose a new convolution operator that is provably robust in the spectral
domain and is incorporated in the GCN architecture to improve expressivity and
interpretability. By extending the original graph to a sequence of graphs, we
also propose a robust training paradigm that encourages transferability across
graphs that span a range of spatial and spectral characteristics. The proposed
approaches are demonstrated in extensive experiments to simultaneously improve
performance in both benign and adversarial situations.


http://arxiv.org/abs/1905.09591
A Direct Approach to Robust Deep Learning Using Adversarial Networks.
Huaxia Wang; Chun-Nam Yu
  Deep neural networks have been shown to perform well in many classical
machine learning problems, especially in image classification tasks. However,
researchers have found that neural networks can be easily fooled, and they are
surprisingly sensitive to small perturbations imperceptible to humans.
Carefully crafted input images (adversarial examples) can force a well-trained
neural network to provide arbitrary outputs. Including adversarial examples
during training is a popular defense mechanism against adversarial attacks. In
this paper we propose a new defensive mechanism under the generative
adversarial network (GAN) framework. We model the adversarial noise using a
generative network, trained jointly with a classification discriminative
network as a minimax game. We show empirically that our adversarial network
approach works well against black box attacks, with performance on par with
state-of-art methods such as ensemble adversarial training and adversarial
training with projected gradient descent.


http://arxiv.org/abs/1905.09894
PHom-GeM: Persistent Homology for Generative Models.
Jeremy Charlier; Radu State; Jean Hilger
  Generative neural network models, including Generative Adversarial Network
(GAN) and Auto-Encoders (AE), are among the most popular neural network models
to generate adversarial data. The GAN model is composed of a generator that
produces synthetic data and of a discriminator that discriminates between the
generator's output and the true data. AE consist of an encoder which maps the
model distribution to a latent manifold and of a decoder which maps the latent
manifold to a reconstructed distribution. However, generative models are known
to provoke chaotically scattered reconstructed distribution during their
training, and consequently, incomplete generated adversarial distributions.
Current distance measures fail to address this problem because they are not
able to acknowledge the shape of the data manifold, i.e. its topological
features, and the scale at which the manifold should be analyzed. We propose
Persistent Homology for Generative Models, PHom-GeM, a new methodology to
assess and measure the distribution of a generative model. PHom-GeM minimizes
an objective function between the true and the reconstructed distributions and
uses persistent homology, the study of the topological features of a space at
different spatial resolutions, to compare the nature of the true and the
generated distributions. Our experiments underline the potential of persistent
homology for Wasserstein GAN in comparison to Wasserstein AE and Variational
AE. The experiments are conducted on a real-world data set particularly
challenging for traditional distance measures and generative neural network
models. PHom-GeM is the first methodology to propose a topological distance
measure, the bottleneck distance, for generative models used to compare
adversarial samples in the context of credit card transactions.


http://arxiv.org/abs/1905.09871
Thwarting finite difference adversarial attacks with output randomization.
Haidar Khan; Daniel Park; Azer Khan; Bülent Yener
  Adversarial examples pose a threat to deep neural network models in a variety
of scenarios, from settings where the adversary has complete knowledge of the
model and to the opposite "black box" setting. Black box attacks are
particularly threatening as the adversary only needs access to the input and
output of the model. Defending against black box adversarial example generation
attacks is paramount as currently proposed defenses are not effective. Since
these types of attacks rely on repeated queries to the model to estimate
gradients over input dimensions, we investigate the use of randomization to
thwart such adversaries from successfully creating adversarial examples.
Randomization applied to the output of the deep neural network model has the
potential to confuse potential attackers, however this introduces a tradeoff
between accuracy and robustness. We show that for certain types of
randomization, we can bound the probability of introducing errors by carefully
setting distributional parameters. For the particular case of finite difference
black box attacks, we quantify the error introduced by the defense in the
finite difference estimate of the gradient. Lastly, we show empirically that
the defense can thwart two adaptive black box adversarial attack algorithms.


http://arxiv.org/abs/1905.09797
Interpreting Adversarially Trained Convolutional Neural Networks.
Tianyuan Zhang; Zhanxing Zhu
  We attempt to interpret how adversarially trained convolutional neural
networks (AT-CNNs) recognize objects. We design systematic approaches to
interpret AT-CNNs in both qualitative and quantitative ways and compare them
with normally trained models. Surprisingly, we find that adversarial training
alleviates the texture bias of standard CNNs when trained on object recognition
tasks, and helps CNNs learn a more shape-biased representation. We validate our
hypothesis from two aspects. First, we compare the salience maps of AT-CNNs and
standard CNNs on clean images and images under different transformations. The
comparison could visually show that the prediction of the two types of CNNs is
sensitive to dramatically different types of features. Second, to achieve
quantitative verification, we construct additional test datasets that destroy
either textures or shapes, such as style-transferred version of clean data,
saturated images and patch-shuffled ones, and then evaluate the classification
accuracy of AT-CNNs and normal CNNs on these datasets. Our findings shed some
light on why AT-CNNs are more robust than those normally trained ones and
contribute to a better understanding of adversarial training over CNNs from an
interpretation perspective.


http://arxiv.org/abs/1905.09747
Adversarially Robust Distillation.
Micah Goldblum; Liam Fowl; Soheil Feizi; Tom Goldstein
  Knowledge distillation is effective for producing small, high-performance
neural networks for classification, but these small networks are vulnerable to
adversarial attacks. This paper studies how adversarial robustness transfers
from teacher to student during knowledge distillation. We find that a large
amount of robustness may be inherited by the student even when distilled on
only clean images. Second, we introduce Adversarially Robust Distillation (ARD)
for distilling robustness onto student networks. In addition to producing small
models with high test accuracy like conventional distillation, ARD also passes
the superior robustness of large networks onto the student. In our experiments,
we find that ARD student models decisively outperform adversarially trained
networks of identical architecture in terms of robust accuracy, surpassing
state-of-the-art methods on standard robustness benchmarks. Finally, we adapt
recent fast adversarial training methods to ARD for accelerated robust
distillation.


http://arxiv.org/abs/1905.09209
Convergence and Margin of Adversarial Training on Separable Data.
Zachary Charles; Shashank Rajput; Stephen Wright; Dimitris Papailiopoulos
  Adversarial training is a technique for training robust machine learning
models. To encourage robustness, it iteratively computes adversarial examples
for the model, and then re-trains on these examples via some update rule. This
work analyzes the performance of adversarial training on linearly separable
data, and provides bounds on the number of iterations required for large
margin. We show that when the update rule is given by an arbitrary empirical
risk minimizer, adversarial training may require exponentially many iterations
to obtain large margin. However, if gradient or stochastic gradient update
rules are used, only polynomially many iterations are required to find a
large-margin separator. By contrast, without the use of adversarial examples,
gradient methods may require exponentially many iterations to achieve large
margin. Our results are derived by showing that adversarial training with
gradient updates minimizes a robust version of the empirical risk at a
$\mathcal{O}(\ln(t)^2/t)$ rate, despite non-smoothness. We corroborate our
theory empirically.


http://arxiv.org/abs/1905.09186
Detecting Adversarial Examples and Other Misclassifications in Neural Networks by Introspection.
Jonathan Aigrain; Marcin Detyniecki
  Despite having excellent performances for a wide variety of tasks, modern
neural networks are unable to provide a reliable confidence value allowing to
detect misclassifications. This limitation is at the heart of what is known as
an adversarial example, where the network provides a wrong prediction
associated with a strong confidence to a slightly modified image. Moreover,
this overconfidence issue has also been observed for regular errors and
out-of-distribution data. We tackle this problem by what we call introspection,
i.e. using the information provided by the logits of an already pretrained
neural network. We show that by training a simple 3-layers neural network on
top of the logit activations, we are able to detect misclassifications at a
competitive level.


http://arxiv.org/abs/1905.08790
DoPa: A Fast and Comprehensive CNN Defense Methodology against Physical Adversarial Attacks.
Zirui Xu; Fuxun Yu; Xiang Chen
  Recently, Convolutional Neural Networks (CNNs) demonstrate a considerable
vulnerability to adversarial attacks, which can be easily misled by adversarial
perturbations. With more aggressive methods proposed, adversarial attacks can
be also applied to the physical world, causing practical issues to various CNN
powered applications. Most existing defense works for physical adversarial
attacks only focus on eliminating explicit perturbation patterns from inputs,
ignoring interpretation and solution to CNN's intrinsic vulnerability.
Therefore, most of them depend on considerable data processing costs and lack
the expected versatility to different attacks. In this paper, we propose DoPa -
a fast and comprehensive CNN defense methodology against physical adversarial
attacks. By interpreting the CNN's vulnerability, we find that non-semantic
adversarial perturbations can activate CNN with significantly abnormal
activations and even overwhelm other semantic input patterns' activations. We
improve the CNN recognition process by adding a self-verification stage to
analyze the semantics of distinguished activation patterns with only one CNN
inference involved. Based on the detection result, we further propose a data
recovery methodology to defend the physical adversarial attacks. We apply such
detection and defense methodology into both image and audio CNN recognition
process. Experiments show that our methodology can achieve an average rate of
90% success for attack detection and 81% accuracy recovery for image physical
adversarial attacks. Also, the proposed defense method can achieve a 92%
detection successful rate and 77.5% accuracy recovery for audio recognition
applications. Moreover, the proposed defense methods are at most 2.3x faster
compared to the state-of-the-art defense methods, making them feasible to
resource-constrained platforms, such as mobile devices.


http://arxiv.org/abs/1905.08232
Adversarially robust transfer learning.
Ali Shafahi; Parsa Saadatpanah; Chen Zhu; Amin Ghiasi; Christoph Studer; David Jacobs; Tom Goldstein
  Transfer learning, in which a network is trained on one task and re-purposed
on another, is often used to produce neural network classifiers when data is
scarce or full-scale training is too costly. When the goal is to produce a
model that is not only accurate but also adversarially robust, data scarcity
and computational limitations become even more cumbersome. We consider robust
transfer learning, in which we transfer not only performance but also
robustness from a source model to a target domain. We start by observing that
robust networks contain robust feature extractors. By training classifiers on
top of these feature extractors, we produce new models that inherit the
robustness of their parent networks. We then consider the case of fine tuning a
network by re-training end-to-end in the target domain. When using lifelong
learning strategies, this process preserves the robustness of the source
network while achieving high accuracy. By using such strategies, it is possible
to produce accurate and robust models with little data, and without the cost of
adversarial training. Additionally, we can improve the generalization of
adversarially trained models, while maintaining their robustness.


http://arxiv.org/abs/1905.07831
Testing DNN Image Classifiers for Confusion & Bias Errors.
Yuchi Tian; Ziyuan Zhong; Vicente Ordonez; Gail Kaiser; Baishakhi Ray
  Image classifiers are an important component of today's software, from
consumer and business applications to safety-critical domains. The advent of
Deep Neural Networks (DNNs) is the key catalyst behind such wide-spread
success. However, wide adoption comes with serious concerns about the
robustness of software systems dependent on DNNs for image classification, as
several severe erroneous behaviors have been reported under sensitive and
critical circumstances. We argue that developers need to rigorously test their
software's image classifiers and delay deployment until acceptable. We present
an approach to testing image classifier robustness based on class property
violations.
  We found that many of the reported erroneous cases in popular DNN image
classifiers occur because the trained models confuse one class with another or
show biases towards some classes over others. These bugs usually violate some
class properties of one or more of those classes. Most DNN testing techniques
focus on per-image violations, so fail to detect class-level confusions or
biases.
  We developed a testing technique to automatically detect class-based
confusion and bias errors in DNN-driven image classification software. We
evaluated our implementation, DeepInspect, on several popular image classifiers
with precision up to 100% (avg.~72.6%) for confusion errors, and up to 84.3%
(avg.~66.8%) for bias errors. DeepInspect found hundreds of classification
mistakes in widely-used models, many exposing errors indicating confusion or
bias.


http://arxiv.org/abs/1905.07666
What Do Adversarially Robust Models Look At?
Takahiro Itazuri; Yoshihiro Fukuhara; Hirokatsu Kataoka; Shigeo Morishima
  In this paper, we address the open question: "What do adversarially robust
models look at?" Recently, it has been reported in many works that there exists
the trade-off between standard accuracy and adversarial robustness. According
to prior works, this trade-off is rooted in the fact that adversarially robust
and standard accurate models might depend on very different sets of features.
However, it has not been well studied what kind of difference actually exists.
In this paper, we analyze this difference through various experiments visually
and quantitatively. Experimental results show that adversarially robust models
look at things at a larger scale than standard models and pay less attention to
fine textures. Furthermore, although it has been claimed that adversarially
robust features are not compatible with standard accuracy, there is even a
positive effect by using them as pre-trained models particularly in low
resolution datasets.


http://arxiv.org/abs/1905.07672
Taking Care of The Discretization Problem:A Black-Box Adversarial Image Attack in Discrete Integer Domain.
Yuchao Duan; Zhe Zhao; Lei Bu; Fu Song
  Numerous methods for crafting adversarial examples were proposed recently
with high success rate. Since most existing machine learning based classifiers
normalize images into some continuous, real vector, domain firstly, attacks
often craft adversarial examples in such domain. However, "adversarial"
examples may become benign after denormalizing them back into the discrete
integer domain, known as the discretization problem. This problem was mentioned
in some work, but has received relatively little attention.
  In this work, we first conduct a comprehensive study of existing methods and
tools for crafting. We theoretically analyze 34 representative methods and
empirically study 20 representative open source tools for crafting adversarial
images. Our study reveals that the discretization problem is far more serious
than originally thought. This suggests that the discretization problem should
be taken into account seriously when crafting adversarial examples and
measuring attack success rate. As a first step towards addressing this problem
in black-box scenario, we propose a black-box method which reduces the
adversarial example searching problem to a derivative-free optimization
problem. Our method is able to craft adversarial images by derivative-free
search in the discrete integer domain. Experimental results show that our
method is comparable to recent white-box methods (e.g., FGSM, BIM and C\&W) and
achieves significantly higher success rate in terms of adversarial examples in
the discrete integer domain than recent black-box methods (e.g., ZOO, NES-PGD
and Bandits). Moreover, our method is able to handle models that is
non-differentiable and successfully break the winner of NIPS 2017 competition
on defense with 95\% success rate. Our results suggest that discrete
optimization algorithms open up a promising area of research into effective
black-box attacks.


http://arxiv.org/abs/1905.07387
POPQORN: Quantifying Robustness of Recurrent Neural Networks.
Ching-Yun Ko; Zhaoyang Lyu; Tsui-Wei Weng; Luca Daniel; Ngai Wong; Dahua Lin
  The vulnerability to adversarial attacks has been a critical issue for deep
neural networks. Addressing this issue requires a reliable way to evaluate the
robustness of a network. Recently, several methods have been developed to
compute $\textit{robustness quantification}$ for neural networks, namely,
certified lower bounds of the minimum adversarial perturbation. Such methods,
however, were devised for feed-forward networks, e.g. multi-layer perceptron or
convolutional networks. It remains an open problem to quantify robustness for
recurrent networks, especially LSTM and GRU. For such networks, there exist
additional challenges in computing the robustness quantification, such as
handling the inputs at multiple steps and the interaction between gates and
states. In this work, we propose $\textit{POPQORN}$
($\textbf{P}$ropagated-$\textbf{o}$ut$\textbf{p}$ut $\textbf{Q}$uantified
R$\textbf{o}$bustness for $\textbf{RN}$Ns), a general algorithm to quantify
robustness of RNNs, including vanilla RNNs, LSTMs, and GRUs. We demonstrate its
effectiveness on different network architectures and show that the robustness
quantification on individual steps can lead to new insights.


http://arxiv.org/abs/1905.07112
A critique of the DeepSec Platform for Security Analysis of Deep Learning Models.
Nicholas Carlini
  At IEEE S&P 2019, the paper "DeepSec: A Uniform Platform for Security
Analysis of Deep Learning Model" aims to to "systematically evaluate the
existing adversarial attack and defense methods." While the paper's goals are
laudable, it fails to achieve them and presents results that are fundamentally
flawed and misleading. We explain the flaws in the DeepSec work, along with how
its analysis fails to meaningfully evaluate the various attacks and defenses.
Specifically, DeepSec (1) evaluates each defense obliviously, using attacks
crafted against undefended models; (2) evaluates attacks and defenses using
incorrect implementations that greatly under-estimate their effectiveness; (3)
evaluates the robustness of each defense as an average, not based on the most
effective attack against that defense; (4) performs several statistical
analyses incorrectly and fails to report variance; and, (5) as a result of
these errors draws invalid conclusions and makes sweeping generalizations.


http://arxiv.org/abs/1905.07121
Simple Black-box Adversarial Attacks.
Chuan Guo; Jacob R. Gardner; Yurong You; Andrew Gordon Wilson; Kilian Q. Weinberger
  We propose an intriguingly simple method for the construction of adversarial
images in the black-box setting. In constrast to the white-box scenario,
constructing black-box adversarial images has the additional constraint on
query budget, and efficient attacks remain an open problem to date. With only
the mild assumption of continuous-valued confidence scores, our highly
query-efficient algorithm utilizes the following simple iterative principle: we
randomly sample a vector from a predefined orthonormal basis and either add or
subtract it to the target image. Despite its simplicity, the proposed method
can be used for both untargeted and targeted attacks -- resulting in previously
unprecedented query efficiency in both settings. We demonstrate the efficacy
and efficiency of our algorithm on several real world settings including the
Google Cloud Vision API. We argue that our proposed algorithm should serve as a
strong baseline for future black-box attacks, in particular because it is
extremely fast and its implementation requires less than 20 lines of PyTorch
code.


http://arxiv.org/abs/1905.06635
Parsimonious Black-Box Adversarial Attacks via Efficient Combinatorial Optimization.
Seungyong Moon; Gaon An; Hyun Oh Song
  Solving for adversarial examples with projected gradient descent has been
demonstrated to be highly effective in fooling the neural network based
classifiers. However, in the black-box setting, the attacker is limited only to
the query access to the network and solving for a successful adversarial
example becomes much more difficult. To this end, recent methods aim at
estimating the true gradient signal based on the input queries but at the cost
of excessive queries. We propose an efficient discrete surrogate to the
optimization problem which does not require estimating the gradient and
consequently becomes free of the first order update hyperparameters to tune.
Our experiments on Cifar-10 and ImageNet show the state of the art black-box
attack performance with significant reduction in the required queries compared
to a number of recently proposed methods. The source code is available at
https://github.com/snu-mllab/parsimonious-blackbox-attack.


http://arxiv.org/abs/1905.06455
On Norm-Agnostic Robustness of Adversarial Training.
Bai Li; Changyou Chen; Wenlin Wang; Lawrence Carin
  Adversarial examples are carefully perturbed in-puts for fooling machine
learning models. A well-acknowledged defense method against such examples is
adversarial training, where adversarial examples are injected into training
data to increase robustness. In this paper, we propose a new attack to unveil
an undesired property of the state-of-the-art adversarial training, that is it
fails to obtain robustness against perturbations in $\ell_2$ and $\ell_\infty$
norms simultaneously. We discuss a possible solution to this issue and its
limitations as well.


http://arxiv.org/abs/1905.08614
An Efficient Pre-processing Method to Eliminate Adversarial Effects.
Hua Wang; Jie Wang; Zhaoxia Yin
  Deep Neural Networks (DNNs) are vulnerable to adversarial examples generated
by imposing subtle perturbations to inputs that lead a model to predict
incorrect outputs. Currently, a large number of researches on defending
adversarial examples pay little attention to the real-world applications,
either with high computational complexity or poor defensive effects. Motivated
by this observation, we develop an efficient preprocessing method to defend
adversarial images. Specifically, before an adversarial example is fed into the
model, we perform two image transformations: WebP compression, which is
utilized to remove the small adversarial noises. Flip operation, which flips
the image once along one side of the image to destroy the specific structure of
adversarial perturbations. Finally, a de-perturbed sample is obtained and can
be correctly classified by DNNs. Experimental results on ImageNet show that our
method outperforms the state-of-the-art defense methods. It can effectively
defend adversarial attacks while ensure only very small accuracy drop on normal
images.


http://arxiv.org/abs/1905.05454
Robustification of deep net classifiers by key based diversified aggregation with pre-filtering.
Olga Taran; Shideh Rezaeifar; Taras Holotyak; Slava Voloshynovskiy
  In this paper, we address a problem of machine learning system vulnerability
to adversarial attacks. We propose and investigate a Key based Diversified
Aggregation (KDA) mechanism as a defense strategy. The KDA assumes that the
attacker (i) knows the architecture of classifier and the used defense
strategy, (ii) has an access to the training data set but (iii) does not know
the secret key. The robustness of the system is achieved by a specially
designed key based randomization. The proposed randomization prevents the
gradients' back propagation or the creating of a "bypass" system. The
randomization is performed simultaneously in several channels and a
multi-channel aggregation stabilizes the results of randomization by
aggregating soft outputs from each classifier in multi-channel system. The
performed experimental evaluation demonstrates a high robustness and
universality of the KDA against the most efficient gradient based attacks like
those proposed by N. Carlini and D. Wagner and the non-gradient based sparse
adversarial perturbations like OnePixel attacks.


http://arxiv.org/abs/1905.05163
Adversarial Examples for Electrocardiograms.
Xintian Han; Yuxuan Hu; Luca Foschini; Larry Chinitz; Lior Jankelson; Rajesh Ranganath
  In recent years, the electrocardiogram (ECG) has seen a large diffusion in
both medical and commercial applications, fueled by the rise of single-lead
versions. Single-lead ECG can be embedded in medical devices and wearable
products such as the injectable Medtronic Linq monitor, the iRhythm Ziopatch
wearable monitor, and the Apple Watch Series 4. Recently, deep neural networks
have been used to automatically analyze ECG tracings, outperforming even
physicians specialized in cardiac electrophysiology in detecting certain rhythm
irregularities. However, deep learning classifiers have been shown to be
brittle to adversarial examples, which are examples created to look
incontrovertibly belonging to a certain class to a human eye but contain subtle
features that fool the classifier into misclassifying them into the wrong
class. Very recently, adversarial examples have also been created for
medical-related tasks. Yet, traditional attack methods to create adversarial
examples, such as projected gradient descent (PGD) do not extend directly to
ECG signals, as they generate examples that introduce square wave artifacts
that are not physiologically plausible. Here, we developed a method to
construct smoothed adversarial examples for single-lead ECG. First, we
implemented a neural network model achieving state-of-the-art performance on
the data from the 2017 PhysioNet/Computing-in-Cardiology Challenge for
arrhythmia detection from single lead ECG classification. For this model, we
utilized a new technique to generate smoothed examples to produce signals that
are 1) indistinguishable to cardiologists from the original examples and 2)
incorrectly classified by the neural network. Finally, we show that adversarial
examples are not unique and provide a general technique to collate and perturb
known adversarial examples to create new ones.


http://arxiv.org/abs/1905.05137
Analyzing Adversarial Attacks Against Deep Learning for Intrusion Detection in IoT Networks.
Olakunle Ibitoye; Omair Shafiq; Ashraf Matrawy
  Adversarial attacks have been widely studied in the field of computer vision
but their impact on network security applications remains an area of open
research. As IoT, 5G and AI continue to converge to realize the promise of the
fourth industrial revolution (Industry 4.0), security incidents and events on
IoT networks have increased. Deep learning techniques are being applied to
detect and mitigate many of such security threats against IoT networks.
Feedforward Neural Networks (FNN) have been widely used for classifying
intrusion attacks in IoT networks. In this paper, we consider a variant of the
FNN known as the Self-normalizing Neural Network (SNN) and compare its
performance with the FNN for classifying intrusion attacks in an IoT network.
Our analysis is performed using the BoT-IoT dataset from the Cyber Range Lab of
the center of UNSW Canberra Cyber. In our experimental results, the FNN
outperforms the SNN for intrusion detection in IoT networks based on multiple
performance metrics such as accuracy, precision, and recall as well as
multi-classification metrics such as Cohen's Kappa score. However, when tested
for adversarial robustness, the SNN demonstrates better resilience against the
adversarial samples from the IoT dataset, presenting a promising future in the
quest for safer and more secure deep learning in IoT networks.


http://arxiv.org/abs/1905.05186
Harnessing the Vulnerability of Latent Layers in Adversarially Trained Models.
Mayank Singh; Abhishek Sinha; Nupur Kumari; Harshitha Machiraju; Balaji Krishnamurthy; Vineeth N Balasubramanian
  Neural networks are vulnerable to adversarial attacks -- small visually
imperceptible crafted noise which when added to the input drastically changes
the output. The most effective method of defending against these adversarial
attacks is to use the methodology of adversarial training. We analyze the
adversarially trained robust models to study their vulnerability against
adversarial attacks at the level of the latent layers. Our analysis reveals
that contrary to the input layer which is robust to adversarial attack, the
latent layer of these robust models are highly susceptible to adversarial
perturbations of small magnitude. Leveraging this information, we introduce a
new technique Latent Adversarial Training (LAT) which comprises of fine-tuning
the adversarially trained models to ensure the robustness at the feature
layers. We also propose Latent Attack (LA), a novel algorithm for construction
of adversarial examples. LAT results in minor improvement in test accuracy and
leads to a state-of-the-art adversarial accuracy against the universal
first-order adversarial PGD attack which is shown for the MNIST, CIFAR-10,
CIFAR-100 datasets.


http://arxiv.org/abs/1905.13148
Moving Target Defense for Deep Visual Sensing against Adversarial Examples.
Qun Song; Zhenyu Yan; Rui Tan
  Deep learning based visual sensing has achieved attractive accuracy but is
shown vulnerable to adversarial example attacks. Specifically, once the
attackers obtain the deep model, they can construct adversarial examples to
mislead the model to yield wrong classification results. Deployable adversarial
examples such as small stickers pasted on the road signs and lanes have been
shown effective in misleading advanced driver-assistance systems. Many existing
countermeasures against adversarial examples build their security on the
attackers' ignorance of the defense mechanisms. Thus, they fall short of
following Kerckhoffs's principle and can be subverted once the attackers know
the details of the defense. This paper applies the strategy of moving target
defense (MTD) to generate multiple new deep models after system deployment,
that will collaboratively detect and thwart adversarial examples. Our MTD
design is based on the adversarial examples' minor transferability to models
differing from the one (e.g., the factory-designed model) used for attack
construction. The post-deployment quasi-secret deep models significantly
increase the bar for the attackers to construct effective adversarial examples.
We also apply the technique of serial data fusion with early stopping to reduce
the inference time by a factor of up to 5 while maintaining the sensing and
defense performance. Extensive evaluation based on three datasets including a
road sign image database and a GPU-equipped Jetson embedded computing board
shows the effectiveness of our approach.


http://arxiv.org/abs/1905.04270
Interpreting and Evaluating Neural Network Robustness.
Fuxun Yu; Zhuwei Qin; Chenchen Liu; Liang Zhao; Yanzhi Wang; Xiang Chen
  Recently, adversarial deception becomes one of the most considerable threats
to deep neural networks. However, compared to extensive research in new designs
of various adversarial attacks and defenses, the neural networks' intrinsic
robustness property is still lack of thorough investigation. This work aims to
qualitatively interpret the adversarial attack and defense mechanism through
loss visualization, and establish a quantitative metric to evaluate the neural
network model's intrinsic robustness. The proposed robustness metric identifies
the upper bound of a model's prediction divergence in the given domain and thus
indicates whether the model can maintain a stable prediction. With extensive
experiments, our metric demonstrates several advantages over conventional
adversarial testing accuracy based robustness estimation: (1) it provides a
uniformed evaluation to models with different structures and parameter scales;
(2) it over-performs conventional accuracy based robustness estimation and
provides a more reliable evaluation that is invariant to different test
settings; (3) it can be fast generated without considerable testing cost.


http://arxiv.org/abs/1905.04172
On the Connection Between Adversarial Robustness and Saliency Map Interpretability.
Christian Etmann; Sebastian Lunz; Peter Maass; Carola-Bibiane Schönlieb
  Recent studies on the adversarial vulnerability of neural networks have shown
that models trained to be more robust to adversarial attacks exhibit more
interpretable saliency maps than their non-robust counterparts. We aim to
quantify this behavior by considering the alignment between input image and
saliency map. We hypothesize that as the distance to the decision boundary
grows,so does the alignment. This connection is strictly true in the case of
linear models. We confirm these theoretical findings with experiments based on
models trained with a local Lipschitz regularization and identify where the
non-linear nature of neural networks weakens the relation.


http://arxiv.org/abs/1905.04016
Exact Adversarial Attack to Image Captioning via Structured Output Learning with Latent Variables.
Yan Xu; Baoyuan Wu; Fumin Shen; Yanbo Fan; Yong Zhang; Heng Tao Shen; Wei Liu
  In this work, we study the robustness of a CNN+RNN based image captioning
system being subjected to adversarial noises. We propose to fool an image
captioning system to generate some targeted partial captions for an image
polluted by adversarial noises, even the targeted captions are totally
irrelevant to the image content. A partial caption indicates that the words at
some locations in this caption are observed, while words at other locations are
not restricted.It is the first work to study exact adversarial attacks of
targeted partial captions. Due to the sequential dependencies among words in a
caption, we formulate the generation of adversarial noises for targeted partial
captions as a structured output learning problem with latent variables. Both
the generalized expectation maximization algorithm and structural SVMs with
latent variables are then adopted to optimize the problem. The proposed methods
generate very successful at-tacks to three popular CNN+RNN based image
captioning models. Furthermore, the proposed attack methods are used to
understand the inner mechanism of image captioning systems, providing the
guidance to further improve automatic image captioning systems towards human
captioning.


http://arxiv.org/abs/1905.03679
Adversarial Defense Framework for Graph Neural Network.
Shen Wang; Zhengzhang Chen; Jingchao Ni; Xiao Yu; Zhichun Li; Haifeng Chen; Philip S. Yu
  Graph neural network (GNN), as a powerful representation learning model on
graph data, attracts much attention across various disciplines. However, recent
studies show that GNN is vulnerable to adversarial attacks. How to make GNN
more robust? What are the key vulnerabilities in GNN? How to address the
vulnerabilities and defense GNN against the adversarial attacks? In this paper,
we propose DefNet, an effective adversarial defense framework for GNNs. In
particular, we first investigate the latent vulnerabilities in every layer of
GNNs and propose corresponding strategies including dual-stage aggregation and
bottleneck perceptron. Then, to cope with the scarcity of training data, we
propose an adversarial contrastive learning method to train the GNN in a
conditional GAN manner by leveraging the high-level graph representation.
Extensive experiments on three public datasets demonstrate the effectiveness of
DefNet in improving the robustness of popular GNN variants, such as Graph
Convolutional Network and GraphSAGE, under various types of adversarial
attacks.


http://arxiv.org/abs/1905.03517
Mitigating Deep Learning Vulnerabilities from Adversarial Examples Attack in the Cybersecurity Domain.
Chris Einar San Agustin
  Deep learning models are known to solve classification and regression
problems by employing a number of epoch and training samples on a large dataset
with optimal accuracy. However, that doesn't mean they are attack-proof or
unexposed to vulnerabilities. Newly deployed systems particularly on a public
environment (i.e public networks) are vulnerable to attacks from various
entities. Moreover, published research on deep learning systems (Goodfellow et
al., 2014) have determined a significant number of attacks points and a wide
array of attack surface that has evidence of exploitation from adversarial
examples. Successful exploit on these systems could lead to critical real world
repercussions. For instance, (1) an adversarial attack on a self-driving car
running a deep reinforcement learning system yields a direct misclassification
on humans causing untoward accidents.(2) a self-driving vehicle misreading a
red light signal may cause the car to crash to another car (3)
misclassification of a pedestrian lane as an intersection lane that could lead
to car crashes. This is just the tip of the iceberg, computer vision deployment
are not entirely focused on self-driving cars but on many other areas as well -
that would have definitive impact on the real-world. These vulnerabilities must
be mitigated at an early stage of development. It is imperative to develop and
implement baseline security standards at a global level prior to real-world
deployment.


http://arxiv.org/abs/1905.03837
Exploring the Hyperparameter Landscape of Adversarial Robustness.
Evelyn Duesterwald; Anupama Murthi; Ganesh Venkataraman; Mathieu Sinn; Deepak Vijaykeerthy
  Adversarial training shows promise as an approach for training models that
are robust towards adversarial perturbation. In this paper, we explore some of
the practical challenges of adversarial training. We present a sensitivity
analysis that illustrates that the effectiveness of adversarial training hinges
on the settings of a few salient hyperparameters. We show that the robustness
surface that emerges across these salient parameters can be surprisingly
complex and that therefore no effective one-size-fits-all parameter settings
exist. We then demonstrate that we can use the same salient hyperparameters as
tuning knob to navigate the tension that can arise between robustness and
accuracy. Based on these findings, we present a practical approach that
leverages hyperparameter optimization techniques for tuning adversarial
training to maximize robustness while keeping the loss in accuracy within a
defined budget.


http://arxiv.org/abs/1905.03767
Learning Interpretable Features via Adversarially Robust Optimization.
Ashkan Khakzar; Shadi Albarqouni; Nassir Navab
  Neural networks are proven to be remarkably successful for classification and
diagnosis in medical applications. However, the ambiguity in the
decision-making process and the interpretability of the learned features is a
matter of concern. In this work, we propose a method for improving the feature
interpretability of neural network classifiers. Initially, we propose a
baseline convolutional neural network with state of the art performance in
terms of accuracy and weakly supervised localization. Subsequently, the loss is
modified to integrate robustness to adversarial examples into the training
process. In this work, feature interpretability is quantified via evaluating
the weakly supervised localization using the ground truth bounding boxes.
Interpretability is also visually assessed using class activation maps and
saliency maps. The method is applied to NIH ChestX-ray14, the largest publicly
available chest x-rays dataset. We demonstrate that the adversarially robust
optimization paradigm improves feature interpretability both quantitatively and
visually.


http://arxiv.org/abs/1905.03828
Universal Adversarial Perturbations for Speech Recognition Systems.
Paarth Neekhara; Shehzeen Hussain; Prakhar Pandey; Shlomo Dubnov; Julian McAuley; Farinaz Koushanfar
  In this work, we demonstrate the existence of universal adversarial audio
perturbations that cause mis-transcription of audio signals by automatic speech
recognition (ASR) systems. We propose an algorithm to find a single
quasi-imperceptible perturbation, which when added to any arbitrary speech
signal, will most likely fool the victim speech recognition model. Our
experiments demonstrate the application of our proposed technique by crafting
audio-agnostic universal perturbations for the state-of-the-art ASR system --
Mozilla DeepSpeech. Additionally, we show that such perturbations generalize to
a significant extent across models that are not available during training, by
performing a transferability test on a WaveNet based ASR system.


http://arxiv.org/abs/1905.03434
ROSA: Robust Salient Object Detection against Adversarial Attacks.
Haofeng Li; Guanbin Li; Yizhou Yu
  Recently salient object detection has witnessed remarkable improvement owing
to the deep convolutional neural networks which can harvest powerful features
for images. In particular, state-of-the-art salient object detection methods
enjoy high accuracy and efficiency from fully convolutional network (FCN) based
frameworks which are trained from end to end and predict pixel-wise labels.
However, such framework suffers from adversarial attacks which confuse neural
networks via adding quasi-imperceptible noises to input images without changing
the ground truth annotated by human subjects. To our knowledge, this paper is
the first one that mounts successful adversarial attacks on salient object
detection models and verifies that adversarial samples are effective on a wide
range of existing methods. Furthermore, this paper proposes a novel end-to-end
trainable framework to enhance the robustness for arbitrary FCN-based salient
object detection models against adversarial attacks. The proposed framework
adopts a novel idea that first introduces some new generic noise to destroy
adversarial perturbations, and then learns to predict saliency maps for input
images with the introduced noise. Specifically, our proposed method consists of
a segment-wise shielding component, which preserves boundaries and destroys
delicate adversarial noise patterns and a context-aware restoration component,
which refines saliency maps through global contrast modeling. Experimental
results suggest that our proposed framework improves the performance
significantly for state-of-the-art models on a series of datasets.


http://arxiv.org/abs/1905.03333
Enhancing Cross-task Transferability of Adversarial Examples with Dispersion Reduction.
Yunhan Jia; Yantao Lu; Senem Velipasalar; Zhenyu Zhong; Tao Wei
  Neural networks are known to be vulnerable to carefully crafted adversarial
examples, and these malicious samples often transfer, i.e., they maintain their
effectiveness even against other models. With great efforts delved into the
transferability of adversarial examples, surprisingly, less attention has been
paid to its impact on real-world deep learning deployment. In this paper, we
investigate the transferability of adversarial examples across a wide range of
real-world computer vision tasks, including image classification, explicit
content detection, optical character recognition (OCR), and object detection.
It represents the cybercriminal's situation where an ensemble of different
detection mechanisms need to be evaded all at once. We propose practical attack
that overcomes existing attacks' limitation of requiring task-specific loss
functions by targeting on the `dispersion' of internal feature map. We report
evaluation on four different computer vision tasks provided by Google Cloud
Vision APIs to show how our approach outperforms existing attacks by degrading
performance of multiple CV tasks by a large margin with only modest
perturbations.


http://arxiv.org/abs/1905.03421
Adversarial Image Translation: Unrestricted Adversarial Examples in Face Recognition Systems.
Kazuya Kakizaki; Kosuke Yoshida
  Thanks to recent advances in deep neural networks (DNNs), face recognition
systems have become highly accurate in classifying a large number of face
images. However, recent studies have found that DNNs could be vulnerable to
adversarial examples, raising concerns about the robustness of such systems.
Adversarial examples that are not restricted to small perturbations could be
more serious since conventional certified defenses might be ineffective against
them. To shed light on the vulnerability to such adversarial examples, we
propose a flexible and efficient method for generating unrestricted adversarial
examples using image translation techniques. Our method enables us to translate
a source image into any desired facial appearance with large perturbations to
deceive target face recognition systems. Our experimental results indicate that
our method achieved about $90$ and $80\%$ attack success rates under white- and
black-box settings, respectively, and that the translated images are
perceptually realistic and maintain the identifiability of the individual while
the perturbations are large enough to bypass certified defenses.


http://arxiv.org/abs/1905.02704
A Comprehensive Analysis on Adversarial Robustness of Spiking Neural Networks.
Saima Sharmin; Priyadarshini Panda; Syed Shakib Sarwar; Chankyu Lee; Wachirawit Ponghiran; Kaushik Roy
  In this era of machine learning models, their functionality is being
threatened by adversarial attacks. In the face of this struggle for making
artificial neural networks robust, finding a model, resilient to these attacks,
is very important. In this work, we present, for the first time, a
comprehensive analysis of the behavior of more bio-plausible networks, namely
Spiking Neural Network (SNN) under state-of-the-art adversarial tests. We
perform a comparative study of the accuracy degradation between conventional
VGG-9 Artificial Neural Network (ANN) and equivalent spiking network with
CIFAR-10 dataset in both whitebox and blackbox setting for different types of
single-step and multi-step FGSM (Fast Gradient Sign Method) attacks. We
demonstrate that SNNs tend to show more resiliency compared to ANN under
black-box attack scenario. Additionally, we find that SNN robustness is largely
dependent on the corresponding training mechanism. We observe that SNNs trained
by spike-based backpropagation are more adversarially robust than the ones
obtained by ANN-to-SNN conversion rules in several whitebox and blackbox
scenarios. Finally, we also propose a simple, yet, effective framework for
crafting adversarial attacks from SNNs. Our results suggest that attacks
crafted from SNNs following our proposed method are much stronger than those
crafted from ANNs.


http://arxiv.org/abs/1905.02422
Representation of White- and Black-Box Adversarial Examples in Deep Neural Networks and Humans: A Functional Magnetic Resonance Imaging Study.
Chihye Han; Wonjun Yoon; Gihyun Kwon; Seungkyu Nam; Daeshik Kim
  The recent success of brain-inspired deep neural networks (DNNs) in solving
complex, high-level visual tasks has led to rising expectations for their
potential to match the human visual system. However, DNNs exhibit
idiosyncrasies that suggest their visual representation and processing might be
substantially different from human vision. One limitation of DNNs is that they
are vulnerable to adversarial examples, input images on which subtle, carefully
designed noises are added to fool a machine classifier. The robustness of the
human visual system against adversarial examples is potentially of great
importance as it could uncover a key mechanistic feature that machine vision is
yet to incorporate. In this study, we compare the visual representations of
white- and black-box adversarial examples in DNNs and humans by leveraging
functional magnetic resonance imaging (fMRI). We find a small but significant
difference in representation patterns for different (i.e. white- versus black-
box) types of adversarial examples for both humans and DNNs. However, human
performance on categorical judgment is not degraded by noise regardless of the
type unlike DNN. These results suggest that adversarial examples may be
differentially represented in the human visual system, but unable to affect the
perceptual experience.


http://arxiv.org/abs/1905.02675
An Empirical Evaluation of Adversarial Robustness under Transfer Learning.
Todor Davchev; Timos Korres; Stathi Fotiadis; Nick Antonopoulos; Subramanian Ramamoorthy
  In this work, we evaluate adversarial robustness in the context of transfer
learning from a source trained on CIFAR 100 to a target network trained on
CIFAR 10. Specifically, we study the effects of using robust optimisation in
the source and target networks. This allows us to identify transfer learning
strategies under which adversarial defences are successfully retained, in
addition to revealing potential vulnerabilities. We study the extent to which
features learnt by a fast gradient sign method (FGSM) and its iterative
alternative (PGD) can preserve their defence properties against black and
white-box attacks under three different transfer learning strategies. We find
that using PGD examples during training on the source task leads to more
general robust features that are easier to transfer. Furthermore, under
successful transfer, it achieves 5.2% more accuracy against white-box PGD
attacks than suitable baselines. Overall, our empirical evaluations give
insights on how well adversarial robustness under transfer learning can
generalise.


http://arxiv.org/abs/1905.02463
Adaptive Generation of Unrestricted Adversarial Inputs.
Isaac Dunn; Hadrien Pouget; Tom Melham; Daniel Kroening
  Neural networks are vulnerable to adversarially-constructed perturbations of
their inputs. Most research so far has considered perturbations of a fixed
magnitude under some $l_p$ norm. Although studying these attacks is valuable,
there has been increasing interest in the construction of (and robustness to)
unrestricted attacks, which are not constrained to a small and rather
artificial subset of all possible adversarial inputs. We introduce a novel
algorithm for generating such unrestricted adversarial inputs which, unlike
prior work, is adaptive: it is able to tune its attacks to the classifier being
targeted. It also offers a 400-2,000x speedup over the existing state of the
art. We demonstrate our approach by generating unrestricted adversarial inputs
that fool classifiers robust to perturbation-based attacks. We also show that,
by virtue of being adaptive and unrestricted, our attack is able to defeat
adversarial training against it.


http://arxiv.org/abs/1905.02161
Batch Normalization is a Cause of Adversarial Vulnerability.
Angus Galloway; Anna Golubeva; Thomas Tanay; Medhat Moussa; Graham W. Taylor
  Batch normalization (batch norm) is often used in an attempt to stabilize and
accelerate training in deep neural networks. In many cases it indeed decreases
the number of parameter updates required to achieve low training error.
However, it also reduces robustness to small adversarial input perturbations
and noise by double-digit percentages, as we show on five standard datasets.
Furthermore, substituting weight decay for batch norm is sufficient to nullify
the relationship between adversarial vulnerability and the input dimension. Our
work is consistent with a mean-field analysis that found that batch norm causes
exploding gradients.


http://arxiv.org/abs/1905.02175
Adversarial Examples Are Not Bugs, They Are Features.
Andrew Ilyas; Shibani Santurkar; Dimitris Tsipras; Logan Engstrom; Brandon Tran; Aleksander Madry
  Adversarial examples have attracted significant attention in machine
learning, but the reasons for their existence and pervasiveness remain unclear.
We demonstrate that adversarial examples can be directly attributed to the
presence of non-robust features: features derived from patterns in the data
distribution that are highly predictive, yet brittle and incomprehensible to
humans. After capturing these features within a theoretical framework, we
establish their widespread existence in standard datasets. Finally, we present
a simple setting where we can rigorously tie the phenomena we observe in
practice to a misalignment between the (human-specified) notion of robustness
and the inherent geometry of the data.


http://arxiv.org/abs/1905.02342
Machine Learning Cryptanalysis of a Quantum Random Number Generator. (1%)
Nhan Duy Truong; Jing Yan Haw; Syed Muhamad Assad; Ping Koy Lam; Omid Kavehei
  Random number generators (RNGs) that are crucial for cryptographic
applications have been the subject of adversarial attacks. These attacks
exploit environmental information to predict generated random numbers that are
supposed to be truly random and unpredictable. Though quantum random number
generators (QRNGs) are based on the intrinsic indeterministic nature of quantum
properties, the presence of classical noise in the measurement process
compromises the integrity of a QRNG. In this paper, we develop a predictive
machine learning (ML) analysis to investigate the impact of deterministic
classical noise in different stages of an optical continuous variable QRNG. Our
ML model successfully detects inherent correlations when the deterministic
noise sources are prominent. After appropriate filtering and randomness
extraction processes are introduced, our QRNG system, in turn, demonstrates its
robustness against ML. We further demonstrate the robustness of our ML approach
by applying it to uniformly distributed random numbers from the QRNG and a
congruential RNG. Hence, our result shows that ML has potentials in
benchmarking the quality of RNG devices.


http://arxiv.org/abs/1905.01726
Better the Devil you Know: An Analysis of Evasion Attacks using Out-of-Distribution Adversarial Examples.
Vikash Sehwag; Arjun Nitin Bhagoji; Liwei Song; Chawin Sitawarin; Daniel Cullina; Mung Chiang; Prateek Mittal
  A large body of recent work has investigated the phenomenon of evasion
attacks using adversarial examples for deep learning systems, where the
addition of norm-bounded perturbations to the test inputs leads to incorrect
output classification. Previous work has investigated this phenomenon in
closed-world systems where training and test inputs follow a pre-specified
distribution. However, real-world implementations of deep learning
applications, such as autonomous driving and content classification are likely
to operate in the open-world environment. In this paper, we demonstrate the
success of open-world evasion attacks, where adversarial examples are generated
from out-of-distribution inputs (OOD adversarial examples). In our study, we
use 11 state-of-the-art neural network models trained on 3 image datasets of
varying complexity. We first demonstrate that state-of-the-art detectors for
out-of-distribution data are not robust against OOD adversarial examples. We
then consider 5 known defenses for adversarial examples, including
state-of-the-art robust training methods, and show that against these defenses,
OOD adversarial examples can achieve up to 4$\times$ higher target success
rates compared to adversarial examples generated from in-distribution data. We
also take a quantitative look at how open-world evasion attacks may affect
real-world systems. Finally, we present the first steps towards a robust
open-world machine learning system.


http://arxiv.org/abs/1905.01034
Transfer of Adversarial Robustness Between Perturbation Types.
Daniel Kang; Yi Sun; Tom Brown; Dan Hendrycks; Jacob Steinhardt
  We study the transfer of adversarial robustness of deep neural networks
between different perturbation types. While most work on adversarial examples
has focused on $L_\infty$ and $L_2$-bounded perturbations, these do not capture
all types of perturbations available to an adversary. The present work
evaluates 32 attacks of 5 different types against models adversarially trained
on a 100-class subset of ImageNet. Our empirical results suggest that
evaluating on a wide range of perturbation sizes is necessary to understand
whether adversarial robustness transfers between perturbation types. We further
demonstrate that robustness against one perturbation type may not always imply
and may sometimes hurt robustness against other perturbation types. In light of
these results, we recommend evaluation of adversarial defenses take place on a
diverse range of perturbation types and sizes.


http://arxiv.org/abs/1905.01019
Adversarial Training with Voronoi Constraints.
Marc Khoury; Dylan Hadfield-Menell
  Adversarial examples are a pervasive phenomenon of machine learning models
where seemingly imperceptible perturbations to the input lead to
misclassifications for otherwise statistically accurate models. We propose a
geometric framework, drawing on tools from the manifold reconstruction
literature, to analyze the high-dimensional geometry of adversarial examples.
In particular, we highlight the importance of codimension: for low-dimensional
data manifolds embedded in high-dimensional space there are many directions off
the manifold in which an adversary could construct adversarial examples.
Adversarial examples are a natural consequence of learning a decision boundary
that classifies the low-dimensional data manifold well, but classifies points
near the manifold incorrectly. Using our geometric framework we prove that
adversarial training is sample inefficient, and show sufficient sampling
conditions under which nearest neighbor classifiers and ball-based adversarial
training are robust. Finally we introduce adversarial training with Voronoi
constraints, which replaces the norm ball constraint with the Voronoi cell for
each point in the training set. We show that adversarial training with Voronoi
constraints produces robust models which significantly improve over the
state-of-the-art on MNIST and are competitive on CIFAR-10.


http://arxiv.org/abs/1905.00568
Weight Map Layer for Noise and Adversarial Attack Robustness.
Mohammed Amer; Tomás Maul
  Convolutional neural networks (CNNs) are known for their good performance and
generalization in vision-related tasks and have become state-of-the-art in both
application and research-based domains. However, just like other neural network
models, they suffer from a susceptibility to noise and adversarial attacks. An
adversarial defence aims at reducing a neural network's susceptibility to
adversarial attacks through learning or architectural modifications. We propose
the weight map layer (WM) as a generic architectural addition to CNNs and show
that it can increase their robustness to noise and adversarial attacks. We
further explain that the enhanced robustness of the two WM variants results
from the adaptive activation-variance amplification exhibited by the layer. We
show that the WM layer can be integrated into scaled up models to increase
their noise and adversarial attack robustness, while achieving comparable
accuracy levels across different datasets.


http://arxiv.org/abs/1905.00877
You Only Propagate Once: Accelerating Adversarial Training via Maximal Principle.
Dinghuai Zhang; Tianyuan Zhang; Yiping Lu; Zhanxing Zhu; Bin Dong
  Deep learning achieves state-of-the-art results in many tasks in computer
vision and natural language processing. However, recent works have shown that
deep networks can be vulnerable to adversarial perturbations, which raised a
serious robustness issue of deep networks. Adversarial training, typically
formulated as a robust optimization problem, is an effective way of improving
the robustness of deep networks. A major drawback of existing adversarial
training algorithms is the computational overhead of the generation of
adversarial examples, typically far greater than that of the network training.
This leads to the unbearable overall computational cost of adversarial
training. In this paper, we show that adversarial training can be cast as a
discrete time differential game. Through analyzing the Pontryagin's Maximal
Principle (PMP) of the problem, we observe that the adversary update is only
coupled with the parameters of the first layer of the network. This inspires us
to restrict most of the forward and back propagation within the first layer of
the network during adversary updates. This effectively reduces the total number
of full forward and backward propagation to only one for each group of
adversary updates. Therefore, we refer to this algorithm YOPO (You Only
Propagate Once). Numerical experiments demonstrate that YOPO can achieve
comparable defense accuracy with approximately 1/5 ~ 1/4 GPU time of the
projected gradient descent (PGD) algorithm. Our codes are available at
https://https://github.com/a1600012888/YOPO-You-Only-Propagate-Once.


http://arxiv.org/abs/1906.03181
POBA-GA: Perturbation Optimized Black-Box Adversarial Attacks via Genetic Algorithm.
Jinyin Chen; Mengmeng Su; Shijing Shen; Hui Xiong; Haibin Zheng
  Most deep learning models are easily vulnerable to adversarial attacks.
Various adversarial attacks are designed to evaluate the robustness of models
and develop defense model. Currently, adversarial attacks are brought up to
attack their own target model with their own evaluation metrics. And most of
the black-box adversarial attack algorithms cannot achieve the expected success
rate compared with white-box attacks. In this paper, comprehensive evaluation
metrics are brought up for different adversarial attack methods. A novel
perturbation optimized black-box adversarial attack based on genetic algorithm
(POBA-GA) is proposed for achieving white-box comparable attack performances.
Approximate optimal adversarial examples are evolved through evolutionary
operations including initialization, selection, crossover and mutation. Fitness
function is specifically designed to evaluate the example individual in both
aspects of attack ability and perturbation control. Population diversity
strategy is brought up in evolutionary process to promise the approximate
optimal perturbations obtained. Comprehensive experiments are carried out to
testify POBA-GA's performances. Both simulation and application results prove
that our method is better than current state-of-art black-box attack methods in
aspects of attack capability and perturbation control.


http://arxiv.org/abs/1905.00180
Dropping Pixels for Adversarial Robustness.
Hossein Hosseini; Sreeram Kannan; Radha Poovendran
  Deep neural networks are vulnerable against adversarial examples. In this
paper, we propose to train and test the networks with randomly subsampled
images with high drop rates. We show that this approach significantly improves
robustness against adversarial examples in all cases of bounded L0, L2 and
L_inf perturbations, while reducing the standard accuracy by a small value. We
argue that subsampling pixels can be thought to provide a set of robust
features for the input image and, thus, improves robustness without performing
adversarial training.


http://arxiv.org/abs/1905.00441
NATTACK: Learning the Distributions of Adversarial Examples for an Improved Black-Box Attack on Deep Neural Networks.
Yandong Li; Lijun Li; Liqiang Wang; Tong Zhang; Boqing Gong
  Powerful adversarial attack methods are vital for understanding how to
construct robust deep neural networks (DNNs) and for thoroughly testing defense
techniques. In this paper, we propose a black-box adversarial attack algorithm
that can defeat both vanilla DNNs and those generated by various defense
techniques developed recently. Instead of searching for an "optimal"
adversarial example for a benign input to a targeted DNN, our algorithm finds a
probability density distribution over a small region centered around the input,
such that a sample drawn from this distribution is likely an adversarial
example, without the need of accessing the DNN's internal layers or weights.
Our approach is universal as it can successfully attack different neural
networks by a single algorithm. It is also strong; according to the testing
against 2 vanilla DNNs and 13 defended ones, it outperforms state-of-the-art
black-box or white-box attack methods for most test cases. Additionally, our
results reveal that adversarial training remains one of the best defense
techniques, and the adversarial examples are not as transferable across
defended DNNs as them across vanilla DNNs.


http://arxiv.org/abs/1904.13195
Test Selection for Deep Learning Systems.
Wei Ma; Mike Papadakis; Anestis Tsakmalis; Maxime Cordy; Yves Le Traon
  Testing of deep learning models is challenging due to the excessive number
and complexity of computations involved. As a result, test data selection is
performed manually and in an ad hoc way. This raises the question of how we can
automatically select candidate test data to test deep learning models. Recent
research has focused on adapting test selection metrics from code-based
software testing (such as coverage) to deep learning. However, deep learning
models have different attributes from code such as spread of computations
across the entire network reflecting training data properties, balance of
neuron weights and redundancy (use of many more neurons than needed). Such
differences make code-based metrics inappropriate to select data that can
challenge the models (can trigger misclassification). We thus propose a set of
test selection metrics based on the notion of model uncertainty (model
confidence on specific inputs). Intuitively, the more uncertain we are about a
candidate sample, the more likely it is that this sample triggers a
misclassification. Similarly, the samples for which we are the most uncertain,
are the most informative and should be used to improve the model by retraining.
We evaluate these metrics on two widely-used image classification problems
involving real and artificial (adversarial) data. We show that
uncertainty-based metrics have a strong ability to select data that are
misclassified and lead to major improvement in classification accuracy during
retraining: up to 80% more gain than random selection and other
state-of-the-art metrics on one dataset and up to 29% on the other.


http://arxiv.org/abs/1904.13094
Detecting Adversarial Examples through Nonlinear Dimensionality Reduction.
Francesco Crecchi; Davide Bacciu; Battista Biggio
  Deep neural networks are vulnerable to adversarial examples, i.e.,
carefully-perturbed inputs aimed to mislead classification. This work proposes
a detection method based on combining non-linear dimensionality reduction and
density estimation techniques. Our empirical findings show that the proposed
approach is able to effectively detect adversarial examples crafted by
non-adaptive attackers, i.e., not specifically tuned to bypass the detection
method. Given our promising results, we plan to extend our analysis to adaptive
attackers in future work.


http://arxiv.org/abs/1904.12843
Adversarial Training for Free!
Ali Shafahi; Mahyar Najibi; Amin Ghiasi; Zheng Xu; John Dickerson; Christoph Studer; Larry S. Davis; Gavin Taylor; Tom Goldstein
  Adversarial training, in which a network is trained on adversarial examples,
is one of the few defenses against adversarial attacks that withstands strong
attacks. Unfortunately, the high cost of generating strong adversarial examples
makes standard adversarial training impractical on large-scale problems like
ImageNet. We present an algorithm that eliminates the overhead cost of
generating adversarial examples by recycling the gradient information computed
when updating model parameters. Our "free" adversarial training algorithm
achieves comparable robustness to PGD adversarial training on the CIFAR-10 and
CIFAR-100 datasets at negligible additional cost compared to natural training,
and can be 7 to 30 times faster than other strong adversarial training methods.
Using a single workstation with 4 P100 GPUs and 2 days of runtime, we can train
a robust model for the large-scale ImageNet classification task that maintains
40% accuracy against PGD attacks. The code is available at
https://github.com/ashafahi/free_adv_train.


http://arxiv.org/abs/1904.13000
Adversarial Training and Robustness for Multiple Perturbations.
Florian Tramèr; Dan Boneh
  Defenses against adversarial examples, such as adversarial training, are
typically tailored to a single perturbation type (e.g., small
$\ell_\infty$-noise). For other perturbations, these defenses offer no
guarantees and, at times, even increase the model's vulnerability. Our aim is
to understand the reasons underlying this robustness trade-off, and to train
models that are simultaneously robust to multiple perturbation types. We prove
that a trade-off in robustness to different types of $\ell_p$-bounded and
spatial perturbations must exist in a natural and simple statistical setting.
We corroborate our formal analysis by demonstrating similar robustness
trade-offs on MNIST and CIFAR10. Building upon new multi-perturbation
adversarial training schemes, and a novel efficient attack for finding
$\ell_1$-bounded adversarial examples, we show that no model trained against
multiple attacks achieves robustness competitive with that of models trained on
each attack individually. In particular, we uncover a pernicious
gradient-masking phenomenon on MNIST, which causes adversarial training with
first-order $\ell_\infty, \ell_1$ and $\ell_2$ adversaries to achieve merely
$50\%$ accuracy. Our results question the viability and computational
scalability of extending adversarial robustness, and adversarial training, to
multiple perturbation types.


http://arxiv.org/abs/1904.12181
Non-Local Context Encoder: Robust Biomedical Image Segmentation against Adversarial Attacks.
Xiang He; Sibei Yang; Guanbin Li?; Haofeng Li; Huiyou Chang; Yizhou Yu
  Recent progress in biomedical image segmentation based on deep convolutional
neural networks (CNNs) has drawn much attention. However, its vulnerability
towards adversarial samples cannot be overlooked. This paper is the first one
that discovers that all the CNN-based state-of-the-art biomedical image
segmentation models are sensitive to adversarial perturbations. This limits the
deployment of these methods in safety-critical biomedical fields. In this
paper, we discover that global spatial dependencies and global contextual
information in a biomedical image can be exploited to defend against
adversarial attacks. To this end, non-local context encoder (NLCE) is proposed
to model short- and long range spatial dependencies and encode global contexts
for strengthening feature activations by channel-wise attention. The NLCE
modules enhance the robustness and accuracy of the non-local context encoding
network (NLCEN), which learns robust enhanced pyramid feature representations
with NLCE modules, and then integrates the information across different levels.
Experiments on both lung and skin lesion segmentation datasets have
demonstrated that NLCEN outperforms any other state-of-the-art biomedical image
segmentation methods against adversarial attacks. In addition, NLCE modules can
be applied to improve the robustness of other CNN-based biomedical image
segmentation methods.


http://arxiv.org/abs/1904.11803
Robustness Verification of Support Vector Machines.
Francesco Ranzato; Marco Zanella
  We study the problem of formally verifying the robustness to adversarial
examples of support vector machines (SVMs), a major machine learning model for
classification and regression tasks. Following a recent stream of works on
formal robustness verification of (deep) neural networks, our approach relies
on a sound abstract version of a given SVM classifier to be used for checking
its robustness. This methodology is parametric on a given numerical abstraction
of real values and, analogously to the case of neural networks, needs neither
abstract least upper bounds nor widening operators on this abstraction. The
standard interval domain provides a simple instantiation of our abstraction
technique, which is enhanced with the domain of reduced affine forms, which is
an efficient abstraction of the zonotope abstract domain. This robustness
verification technique has been fully implemented and experimentally evaluated
on SVMs based on linear and nonlinear (polynomial and radial basis function)
kernels, which have been trained on the popular MNIST dataset of images and on
the recent and more challenging Fashion-MNIST dataset. The experimental results
of our prototype SVM robustness verifier appear to be encouraging: this
automated verification is fast, scalable and shows significantly high
percentages of provable robustness on the test set of MNIST, in particular
compared to the analogous provable robustness of neural networks.


http://arxiv.org/abs/1904.10990
A Robust Approach for Securing Audio Classification Against Adversarial Attacks.
Mohammad Esmaeilpour; Patrick Cardinal; Alessandro Lameiras Koerich
  Adversarial audio attacks can be considered as a small perturbation
unperceptive to human ears that is intentionally added to the audio signal and
causes a machine learning model to make mistakes. This poses a security concern
about the safety of machine learning models since the adversarial attacks can
fool such models toward the wrong predictions. In this paper we first review
some strong adversarial attacks that may affect both audio signals and their 2D
representations and evaluate the resiliency of the most common machine learning
model, namely deep learning models and support vector machines (SVM) trained on
2D audio representations such as short time Fourier transform (STFT), discrete
wavelet transform (DWT) and cross recurrent plot (CRP) against several
state-of-the-art adversarial attacks. Next, we propose a novel approach based
on pre-processed DWT representation of audio signals and SVM to secure audio
systems against adversarial attacks. The proposed architecture has several
preprocessing modules for generating and enhancing spectrograms including
dimension reduction and smoothing. We extract features from small patches of
the spectrograms using speeded up robust feature (SURF) algorithm which are
further used to generate a codebook using the K-Means++ algorithm. Finally,
codewords are used to train a SVM on the codebook of the SURF-generated
vectors. All these steps yield to a novel approach for audio classification
that provides a good trade-off between accuracy and resilience. Experimental
results on three environmental sound datasets show the competitive performance
of proposed approach compared to the deep neural networks both in terms of
accuracy and robustness against strong adversarial attacks.


http://arxiv.org/abs/1904.11042
Physical Adversarial Textures that Fool Visual Object Tracking.
Rey Reza Wiyatno; Anqi Xu
  We present a system for generating inconspicuous-looking textures that, when
displayed in the physical world as digital or printed posters, cause visual
object tracking systems to become confused. For instance, as a target being
tracked by a robot's camera moves in front of such a poster, our generated
texture makes the tracker lock onto it and allows the target to evade. This
work aims to fool seldom-targeted regression tasks, and in particular compares
diverse optimization strategies: non-targeted, targeted, and a new family of
guided adversarial losses. While we use the Expectation Over Transformation
(EOT) algorithm to generate physical adversaries that fool tracking models when
imaged under diverse conditions, we compare the impacts of different
conditioning variables, including viewpoint, lighting, and appearances, to find
practical attack setups with high resulting adversarial strength and
convergence speed. We further showcase textures optimized solely using
simulated scenes can confuse real-world tracking systems.


http://arxiv.org/abs/1904.10390
Minimizing Perceived Image Quality Loss Through Adversarial Attack Scoping.
Kostiantyn Khabarlak; Larysa Koriashkina
  Neural networks are now actively being used for computer vision tasks in
security critical areas such as robotics, face recognition, autonomous vehicles
yet their safety is under question after the discovery of adversarial attacks.
In this paper we develop simplified adversarial attack algorithms based on a
scoping idea, which enables execution of fast adversarial attacks that minimize
structural image quality (SSIM) loss, allows performing efficient transfer
attacks with low target inference network call count and opens a possibility of
an attack using pen-only drawings on a paper for the MNIST handwritten digit
dataset. The presented adversarial attack analysis and the idea of attack
scoping can be easily expanded to different datasets, thus making the paper's
results applicable to a wide range of practical tasks.


http://arxiv.org/abs/1904.09804
blessing in disguise: Designing Robust Turing Test by Employing Algorithm Unrobustness.
Jiaming Zhang; Jitao Sang; Kaiyuan Xu; Shangxi Wu; Yongli Hu; Yanfeng Sun; Jian Yu
  Turing test was originally proposed to examine whether machine's behavior is
indistinguishable from a human. The most popular and practical Turing test is
CAPTCHA, which is to discriminate algorithm from human by offering
recognition-alike questions. The recent development of deep learning has
significantly advanced the capability of algorithm in solving CAPTCHA
questions, forcing CAPTCHA designers to increase question complexity. Instead
of designing questions difficult for both algorithm and human, this study
attempts to employ the limitations of algorithm to design robust CAPTCHA
questions easily solvable to human. Specifically, our data analysis observes
that human and algorithm demonstrates different vulnerability to visual
distortions: adversarial perturbation is significantly annoying to algorithm
yet friendly to human. We are motivated to employ adversarially perturbed
images for robust CAPTCHA design in the context of character-based questions.
Three modules of multi-target attack, ensemble adversarial training, and image
preprocessing differentiable approximation are proposed to address the
characteristics of character-based CAPTCHA cracking. Qualitative and
quantitative experimental results demonstrate the effectiveness of the proposed
solution. We hope this study can lead to the discussions around adversarial
attack/defense in CAPTCHA design and also inspire the future attempts in
employing algorithm limitation for practical usage.


http://arxiv.org/abs/1904.10076
Using Videos to Evaluate Image Model Robustness.
Keren Gu; Brandon Yang; Jiquan Ngiam; Quoc Le; Jonathon Shlens
  Human visual systems are robust to a wide range of image transformations that
are challenging for artificial networks. We present the first study of image
model robustness to the minute transformations found across video frames, which
we term "natural robustness". Compared to previous studies on adversarial
examples and synthetic distortions, natural robustness captures a more diverse
set of common image transformations that occur in the natural environment. Our
study across a dozen model architectures shows that more accurate models are
more robust to natural transformations, and that robustness to synthetic color
distortions is a good proxy for natural robustness. In examining brittleness in
videos, we find that majority of the brittleness found in videos lies outside
the typical definition of adversarial examples (99.9\%). Finally, we
investigate training techniques to reduce brittleness and find that no single
technique systematically improves natural robustness across twelve tested
architectures.


http://arxiv.org/abs/1904.09633
Beyond Explainability: Leveraging Interpretability for Improved Adversarial Learning.
Devinder Kumar; Ibrahim Ben-Daya; Kanav Vats; Jeffery Feng; Graham Taylor and; Alexander Wong
  In this study, we propose the leveraging of interpretability for tasks beyond
purely the purpose of explainability. In particular, this study puts forward a
novel strategy for leveraging gradient-based interpretability in the realm of
adversarial examples, where we use insights gained to aid adversarial learning.
More specifically, we introduce the concept of spatially constrained one-pixel
adversarial perturbations, where we guide the learning of such adversarial
perturbations towards more susceptible areas identified via gradient-based
interpretability. Experimental results using different benchmark datasets show
that such a spatially constrained one-pixel adversarial perturbation strategy
can noticeably improve the speed of convergence as well as produce successful
attacks that were also visually difficult to perceive, thus illustrating an
effective use of interpretability methods for tasks outside of the purpose of
purely explainability.


http://arxiv.org/abs/1904.09433
Can Machine Learning Model with Static Features be Fooled: an Adversarial Machine Learning Approach.
Rahim Taheri; Reza Javidan; Mohammad Shojafar; Vinod P; Mauro Conti
  The widespread adoption of smartphones dramatically increases the risk of
attacks and the spread of mobile malware, especially on the Android platform.
Machine learning-based solutions have been already used as a tool to supersede
signature-based anti-malware systems. However, malware authors leverage
features from malicious and legitimate samples to estimate statistical
difference in-order to create adversarial examples. Hence, to evaluate the
vulnerability of machine learning algorithms in malware detection, we propose
five different attack scenarios to perturb malicious applications (apps). By
doing this, the classification algorithm inappropriately fits the discriminant
function on the set of data points, eventually yielding a higher
misclassification rate. Further, to distinguish the adversarial examples from
benign samples, we propose two defense mechanisms to counter attacks. To
validate our attacks and solutions, we test our model on three different
benchmark datasets. We also test our methods using various classifier
algorithms and compare them with the state-of-the-art data poisoning method
using the Jacobian matrix. Promising results show that generated adversarial
samples can evade detection with a very high probability. Additionally, evasive
variants generated by our attack models when used to harden the developed
anti-malware system improves the detection rate up to 50% when using the
Generative Adversarial Network (GAN) method.


http://arxiv.org/abs/1904.09146
Salient Object Detection in the Deep Learning Era: An In-Depth Survey.
Wenguan Wang; Qiuxia Lai; Huazhu Fu; Jianbing Shen; Haibin Ling; Ruigang Yang
  As an important problem in computer vision, salient object detection (SOD)
from images has been attracting an increasing amount of research effort over
the years. Recent advances in SOD, not surprisingly, are dominantly led by deep
learning-based solutions (named deep SOD) and reflected by hundreds of
published papers. To facilitate the in-depth understanding of deep SODs, in
this paper we provide a comprehensive survey covering various aspects ranging
from algorithm taxonomy to unsolved open issues. In particular, we first review
deep SOD algorithms from different perspectives including network architecture,
level of supervision, learning paradigm and object/instance level detection.
Following that, we summarize existing SOD evaluation datasets and metrics.
Then, we carefully compile a thorough benchmark results of SOD methods based on
previous work, and provide detailed analysis of the comparison results.
Moreover, we study the performance of SOD algorithms under different
attributes, which have been barely explored previously, by constructing a novel
SOD dataset with rich attribute annotations. We further analyze, for the first
time in the field, the robustness and transferability of deep SOD models w.r.t.
adversarial attacks. We also look into the influence of input perturbations,
and the generalization and hardness of existing SOD datasets. Finally, we
discuss several open issues and challenges of SOD, and point out possible
research directions in future. All the saliency prediction maps, our
constructed dataset with annotations, and codes for evaluation are made
publicly available at https://github.com/wenguanwang/SODsurvey.


http://arxiv.org/abs/1904.08653
Fooling automated surveillance cameras: adversarial patches to attack person detection.
Simen Thys; Ranst Wiebe Van; Toon Goedemé
  Adversarial attacks on machine learning models have seen increasing interest
in the past years. By making only subtle changes to the input of a
convolutional neural network, the output of the network can be swayed to output
a completely different result. The first attacks did this by changing pixel
values of an input image slightly to fool a classifier to output the wrong
class. Other approaches have tried to learn "patches" that can be applied to an
object to fool detectors and classifiers. Some of these approaches have also
shown that these attacks are feasible in the real-world, i.e. by modifying an
object and filming it with a video camera. However, all of these approaches
target classes that contain almost no intra-class variety (e.g. stop signs).
The known structure of the object is then used to generate an adversarial patch
on top of it.
  In this paper, we present an approach to generate adversarial patches to
targets with lots of intra-class variety, namely persons. The goal is to
generate a patch that is able successfully hide a person from a person
detector. An attack that could for instance be used maliciously to circumvent
surveillance systems, intruders can sneak around undetected by holding a small
cardboard plate in front of their body aimed towards the surveillance camera.
  From our results we can see that our system is able significantly lower the
accuracy of a person detector. Our approach also functions well in real-life
scenarios where the patch is filmed by a camera. To the best of our knowledge
we are the first to attempt this kind of attack on targets with a high level of
intra-class variety like persons.


http://arxiv.org/abs/1904.08516
ZK-GanDef: A GAN based Zero Knowledge Adversarial Training Defense for Neural Networks.
Guanxiong Liu; Issa Khalil; Abdallah Khreishah
  Neural Network classifiers have been used successfully in a wide range of
applications. However, their underlying assumption of attack free environment
has been defied by adversarial examples. Researchers tried to develop defenses;
however, existing approaches are still far from providing effective solutions
to this evolving problem. In this paper, we design a generative adversarial net
(GAN) based zero knowledge adversarial training defense, dubbed ZK-GanDef,
which does not consume adversarial examples during training. Therefore,
ZK-GanDef is not only efficient in training but also adaptive to new
adversarial examples. This advantage comes at the cost of small degradation in
test accuracy compared to full knowledge approaches. Our experiments show that
ZK-GanDef enhances test accuracy on adversarial examples by up-to 49.17%
compared to zero knowledge approaches. More importantly, its test accuracy is
close to that of the state-of-the-art full knowledge approaches (maximum
degradation of 8.46%), while taking much less training time.


http://arxiv.org/abs/1904.08444
Defensive Quantization: When Efficiency Meets Robustness.
Ji Lin; Chuang Gan; Song Han
  Neural network quantization is becoming an industry standard to efficiently
deploy deep learning models on hardware platforms, such as CPU, GPU, TPU, and
FPGAs. However, we observe that the conventional quantization approaches are
vulnerable to adversarial attacks. This paper aims to raise people's awareness
about the security of the quantized models, and we designed a novel
quantization methodology to jointly optimize the efficiency and robustness of
deep learning models. We first conduct an empirical study to show that vanilla
quantization suffers more from adversarial attacks. We observe that the
inferior robustness comes from the error amplification effect, where the
quantization operation further enlarges the distance caused by amplified noise.
Then we propose a novel Defensive Quantization (DQ) method by controlling the
Lipschitz constant of the network during quantization, such that the magnitude
of the adversarial noise remains non-expansive during inference. Extensive
experiments on CIFAR-10 and SVHN datasets demonstrate that our new quantization
method can defend neural networks against adversarial examples, and even
achieves superior robustness than their full-precision counterparts while
maintaining the same hardware efficiency as vanilla quantization approaches. As
a by-product, DQ can also improve the accuracy of quantized models without
adversarial attack.


http://arxiv.org/abs/1904.08279
Interpreting Adversarial Examples with Attributes.
Sadaf Gulshad; Jan Hendrik Metzen; Arnold Smeulders; Zeynep Akata
  Deep computer vision systems being vulnerable to imperceptible and carefully
crafted noise have raised questions regarding the robustness of their
decisions. We take a step back and approach this problem from an orthogonal
direction. We propose to enable black-box neural networks to justify their
reasoning both for clean and for adversarial examples by leveraging attributes,
i.e. visually discriminative properties of objects. We rank attributes based on
their class relevance, i.e. how the classification decision changes when the
input is visually slightly perturbed, as well as image relevance, i.e. how well
the attributes can be localized on both clean and perturbed images. We present
comprehensive experiments for attribute prediction, adversarial example
generation, adversarially robust learning, and their qualitative and
quantitative analysis using predicted attributes on three benchmark datasets.


http://arxiv.org/abs/1904.08089
Adversarial Defense Through Network Profiling Based Path Extraction.
Yuxian Qiu; Jingwen Leng; Cong Guo; Quan Chen; Chao Li; Minyi Guo; Yuhao Zhu
  Recently, researchers have started decomposing deep neural network models
according to their semantics or functions. Recent work has shown the
effectiveness of decomposed functional blocks for defending adversarial
attacks, which add small input perturbation to the input image to fool the DNN
models. This work proposes a profiling-based method to decompose the DNN models
to different functional blocks, which lead to the effective path as a new
approach to exploring DNNs' internal organization. Specifically, the per-image
effective path can be aggregated to the class-level effective path, through
which we observe that adversarial images activate effective path different from
normal images. We propose an effective path similarity-based method to detect
adversarial images with an interpretable model, which achieve better accuracy
and broader applicability than the state-of-the-art technique.


http://arxiv.org/abs/1904.08554
Gotta Catch 'Em All: Using Concealed Trapdoors to Detect Adversarial Attacks on Neural Networks.
Shawn Shan; Emily Willson; Bolun Wang; Bo Li; Haitao Zheng; Ben Y. Zhao
  Deep neural networks are vulnerable to adversarial attacks. Numerous efforts
have focused on defenses that either try to patch `holes' in trained models or
try to make it difficult or costly to compute adversarial examples exploiting
these holes. In our work, we explore a counter-intuitive approach of
constructing "adversarial trapdoors. Unlike prior works that try to patch or
disguise vulnerable points in the manifold, we intentionally inject
`trapdoors,' artificial weaknesses in the manifold that attract optimized
perturbation into certain pre-embedded local optima. As a result, the
adversarial generation functions naturally gravitate towards our trapdoors,
producing adversarial examples that the model owner can recognize through a
known neuron activation signature. In this paper, we introduce trapdoors and
describe an implementation of trapdoors using similar strategies to
backdoor/Trojan attacks. We show that by proactively injecting trapdoors into
the models (and extracting their neuron activation signature), we can detect
adversarial examples generated by the state of the art attacks (Projected
Gradient Descent, Optimization based CW, and Elastic Net) with high detection
success rate and negligible impact on normal inputs. These results also
generalize across multiple classification domains (image recognition, face
recognition and traffic sign recognition). We explore different properties of
trapdoors, and discuss potential countermeasures (adaptive attacks) and
mitigations.


http://arxiv.org/abs/1904.08489
Semantic Adversarial Attacks: Parametric Transformations That Fool Deep Classifiers.
Ameya Joshi; Amitangshu Mukherjee; Soumik Sarkar; Chinmay Hegde
  Deep neural networks have been shown to exhibit an intriguing vulnerability
to adversarial input images corrupted with imperceptible perturbations.
However, the majority of adversarial attacks assume global, fine-grained
control over the image pixel space. In this paper, we consider a different
setting: what happens if the adversary could only alter specific attributes of
the input image? These would generate inputs that might be perceptibly
different, but still natural-looking and enough to fool a classifier. We
propose a novel approach to generate such `semantic' adversarial examples by
optimizing a particular adversarial loss over the range-space of a parametric
conditional generative model. We demonstrate implementations of our attacks on
binary classifiers trained on face images, and show that such natural-looking
semantic adversarial examples exist. We evaluate the effectiveness of our
attack on synthetic and real data, and present detailed comparisons with
existing attack methods. We supplement our empirical results with theoretical
bounds that demonstrate the existence of such parametric adversarial examples.


http://arxiv.org/abs/1904.07980
Reducing Adversarial Example Transferability Using Gradient Regularization.
George Adam; Petr Smirnov; Benjamin Haibe-Kains; Anna Goldenberg
  Deep learning algorithms have increasingly been shown to lack robustness to
simple adversarial examples (AdvX). An equally troubling observation is that
these adversarial examples transfer between different architectures trained on
different datasets. We investigate the transferability of adversarial examples
between models using the angle between the input-output Jacobians of different
models. To demonstrate the relevance of this approach, we perform case studies
that involve jointly training pairs of models. These case studies empirically
justify the theoretical intuitions for why the angle between gradients is a
fundamental quantity in AdvX transferability. Furthermore, we consider the
asymmetry of AdvX transferability between two models of the same architecture
and explain it in terms of differences in gradient norms between the models.
Lastly, we provide a simple modification to existing training setups that
reduces transferability of adversarial examples between pairs of models.


http://arxiv.org/abs/1904.07793
AT-GAN: An Adversarial Generator Model for Non-constrained Adversarial Examples.
Xiaosen Wang; Kun He; Chuanbiao Song; Liwei Wang; John E. Hopcroft
  Despite the rapid development of adversarial machine learning, most
adversarial attack and defense researches mainly focus on the
perturbation-based adversarial examples, which is constrained by the input
images. In comparison with existing works, we propose non-constrained
adversarial examples, which are generated entirely from scratch without any
constraint on the input. Unlike perturbation-based attacks, or the so-called
unrestricted adversarial attack which is still constrained by the input noise,
we aim to learn the distribution of adversarial examples to generate
non-constrained but semantically meaningful adversarial examples. Following
this spirit, we propose a novel attack framework called AT-GAN (Adversarial
Transfer on Generative Adversarial Net). Specifically, we first develop a
normal GAN model to learn the distribution of benign data, and then transfer
the pre-trained GAN model to estimate the distribution of adversarial examples
for the target model. In this way, AT-GAN can learn the distribution of
adversarial examples that is very close to the distribution of real data. To
our knowledge, this is the first work of building an adversarial generator
model that could produce adversarial examples directly from any input noise.
Extensive experiments and visualizations show that the proposed AT-GAN can very
efficiently generate diverse adversarial examples that are more realistic to
human perception. In addition, AT-GAN yields higher attack success rates
against adversarially trained models under white-box attack setting and
exhibits moderate transferability against black-box models.


http://arxiv.org/abs/1904.07370
Are Self-Driving Cars Secure? Evasion Attacks against Deep Neural Networks for Steering Angle Prediction.
Alesia Chernikova; Alina Oprea; Cristina Nita-Rotaru; BaekGyu Kim
  Deep Neural Networks (DNNs) have tremendous potential in advancing the vision
for self-driving cars. However, the security of DNN models in this context
leads to major safety implications and needs to be better understood. We
consider the case study of steering angle prediction from camera images, using
the dataset from the 2014 Udacity challenge. We demonstrate for the first time
adversarial testing-time attacks for this application for both classification
and regression settings. We show that minor modifications to the camera image
(an L2 distance of 0.82 for one of the considered models) result in
mis-classification of an image to any class of attacker's choice. Furthermore,
our regression attack results in a significant increase in Mean Square Error
(MSE) by a factor of 69 in the worst case.


http://arxiv.org/abs/1904.06964
Influence of Control Parameters and the Size of Biomedical Image Datasets on the Success of Adversarial Attacks.
Vassili Kovalev; Dmitry Voynov
  In this paper, we study dependence of the success rate of adversarial attacks
to the Deep Neural Networks on the biomedical image type, control parameters,
and image dataset size. With this work, we are going to contribute towards
accumulation of experimental results on adversarial attacks for the community
dealing with biomedical images. The white-box Projected Gradient Descent
attacks were examined based on 8 classification tasks and 13 image datasets
containing a total of 605,080 chest X-ray and 317,000 histology images of
malignant tumors. We concluded that: (1) An increase of the amplitude of
perturbation in generating malicious adversarial images leads to a growth of
the fraction of successful attacks for the majority of image types examined in
this study. (2) Histology images tend to be less sensitive to the growth of
amplitude of adversarial perturbations. (3) Percentage of successful attacks is
growing with an increase of the number of iterations of the algorithm of
generating adversarial perturbations with an asymptotic stabilization. (4) It
was found that the success of attacks dropping dramatically when the original
confidence of predicting image class exceeds 0.95. (5) The expected dependence
of the percentage of successful attacks on the size of image training set was
not confirmed.


http://arxiv.org/abs/1904.06606
Exploiting Vulnerabilities of Load Forecasting Through Adversarial Attacks.
Yize Chen; Yushi Tan; Baosen Zhang
  Load forecasting plays a critical role in the operation and planning of power
systems. By using input features such as historical loads and weather
forecasts, system operators and utilities build forecast models to guide
decision making in commitment and dispatch. As the forecasting techniques
becomes more sophisticated, however, they also become more vulnerable to
cybersecurity threats. In this paper, we study the vulnerability of a class of
load forecasting algorithms and analyze the potential impact on the power
system operations, such as load shedding and increased dispatch costs.
Specifically, we propose data injection attack algorithms that require minimal
assumptions on the ability of the adversary. The attacker does not need to have
knowledge about the load forecasting model or the underlying power system.
Surprisingly, our results indicate that standard load forecasting algorithms
are quite vulnerable to the designed black-box attacks. By only injecting
malicious data in temperature from online weather forecast APIs, an attacker
could manipulate load forecasts in arbitrary directions and cause significant
and targeted damages to system operations.


http://arxiv.org/abs/1904.06026
Cycle-Consistent Adversarial GAN: the integration of adversarial attack and defense.
Lingyun Jiang; Kai Qiao; Ruoxi Qin; Linyuan Wang; Jian Chen; Haibing Bu; Bin Yan
  In image classification of deep learning, adversarial examples where inputs
intended to add small magnitude perturbations may mislead deep neural networks
(DNNs) to incorrect results, which means DNNs are vulnerable to them. Different
attack and defense strategies have been proposed to better research the
mechanism of deep learning. However, those research in these networks are only
for one aspect, either an attack or a defense, not considering that attacks and
defenses should be interdependent and mutually reinforcing, just like the
relationship between spears and shields. In this paper, we propose
Cycle-Consistent Adversarial GAN (CycleAdvGAN) to generate adversarial
examples, which can learn and approximate the distribution of original
instances and adversarial examples. For CycleAdvGAN, once the Generator and are
trained, can generate adversarial perturbations efficiently for any instance,
so as to make DNNs predict wrong, and recovery adversarial examples to clean
instances, so as to make DNNs predict correct. We apply CycleAdvGAN under
semi-white box and black-box settings on two public datasets MNIST and CIFAR10.
Using the extensive experiments, we show that our method has achieved the
state-of-the-art adversarial attack method and also efficiently improve the
defense ability, which make the integration of adversarial attack and defense
come true. In additional, it has improved attack effect only trained on the
adversarial dataset generated by any kind of adversarial attack.


http://arxiv.org/abs/1904.06186
Generating Minimal Adversarial Perturbations with Integrated Adaptive Gradients.
Yatie Xiao; Chi-Man Pun
  Deep neural networks are easily fooled high confidence predictions for
adversarial samples


http://arxiv.org/abs/1904.06097
Evaluating Robustness of Deep Image Super-Resolution against Adversarial Attacks.
Jun-Ho Choi; Huan Zhang; Jun-Hyuk Kim; Cho-Jui Hsieh; Jong-Seok Lee
  Single-image super-resolution aims to generate a high-resolution version of a
low-resolution image, which serves as an essential component in many computer
vision applications. This paper investigates the robustness of deep
learning-based super-resolution methods against adversarial attacks, which can
significantly deteriorate the super-resolved images without noticeable
distortion in the attacked low-resolution images. It is demonstrated that
state-of-the-art deep super-resolution methods are highly vulnerable to
adversarial attacks. Different levels of robustness of different methods are
analyzed theoretically and experimentally. We also present analysis on
transferability of attacks, and feasibility of targeted attacks and universal
attacks.


http://arxiv.org/abs/1904.06292
Adversarial Learning in Statistical Classification: A Comprehensive Review of Defenses Against Attacks.
David J. Miller; Zhen Xiang; George Kesidis
  There is great potential for damage from adversarial learning (AL) attacks on
machine-learning based systems. In this paper, we provide a contemporary survey
of AL, focused particularly on defenses against attacks on statistical
classifiers. After introducing relevant terminology and the goals and range of
possible knowledge of both attackers and defenders, we survey recent work on
test-time evasion (TTE), data poisoning (DP), and reverse engineering (RE)
attacks and particularly defenses against same. In so doing, we distinguish
robust classification from anomaly detection (AD), unsupervised from
supervised, and statistical hypothesis-based defenses from ones that do not
have an explicit null (no attack) hypothesis; we identify the hyperparameters a
particular method requires, its computational complexity, as well as the
performance measures on which it was evaluated and the obtained quality. We
then dig deeper, providing novel insights that challenge conventional AL wisdom
and that target unresolved issues, including: 1) robust classification versus
AD as a defense strategy; 2) the belief that attack success increases with
attack strength, which ignores susceptibility to AD; 3) small perturbations for
test-time evasion attacks: a fallacy or a requirement?; 4) validity of the
universal assumption that a TTE attacker knows the ground-truth class for the
example to be attacked; 5) black, grey, or white box attacks as the standard
for defense evaluation; 6) susceptibility of query-based RE to an AD defense.
We also discuss attacks on the privacy of training data. We then present
benchmark comparisons of several defenses against TTE, RE, and backdoor DP
attacks on images. The paper concludes with a discussion of future work.


http://arxiv.org/abs/1904.06347
Unrestricted Adversarial Examples via Semantic Manipulation.
Anand Bhattad; Min Jin Chong; Kaizhao Liang; Bo Li; D. A. Forsyth
  Machine learning models, especially deep neural networks (DNNs), have been
shown to be vulnerable against adversarial examples which are carefully crafted
samples with a small magnitude of the perturbation. Such adversarial
perturbations are usually restricted by bounding their $\mathcal{L}_p$ norm
such that they are imperceptible, and thus many current defenses can exploit
this property to reduce their adversarial impact. In this paper, we instead
introduce "unrestricted" perturbations that manipulate semantically meaningful
image-based visual descriptors - color and texture - in order to generate
effective and photorealistic adversarial examples. We show that these
semantically aware perturbations are effective against JPEG compression,
feature squeezing and adversarially trained model. We also show that the
proposed methods can effectively be applied to both image classification and
image captioning tasks on complex datasets such as ImageNet and MSCOCO. In
addition, we conduct comprehensive user studies to show that our generated
semantic adversarial examples are photorealistic to humans despite large
magnitude perturbations when compared to other attacks.


http://arxiv.org/abs/1904.05586
Black-Box Decision based Adversarial Attack with Symmetric $\alpha$-stable Distribution.
Vignesh Srinivasan; Ercan E. Kuruoglu; Klaus-Robert Müller; Wojciech Samek; Shinichi Nakajima
  Developing techniques for adversarial attack and defense is an important
research field for establishing reliable machine learning and its applications.
Many existing methods employ Gaussian random variables for exploring the data
space to find the most adversarial (for attacking) or least adversarial (for
defense) point. However, the Gaussian distribution is not necessarily the
optimal choice when the exploration is required to follow the complicated
structure that most real-world data distributions exhibit. In this paper, we
investigate how statistics of random variables affect such random walk
exploration. Specifically, we generalize the Boundary Attack, a
state-of-the-art black-box decision based attacking strategy, and propose the
L\'evy-Attack, where the random walk is driven by symmetric $\alpha$-stable
random variables. Our experiments on MNIST and CIFAR10 datasets show that the
L\'evy-Attack explores the image data space more efficiently, and significantly
improves the performance. Our results also give an insight into the recently
found fact in the whitebox attacking scenario that the choice of the norm for
measuring the amplitude of the adversarial patterns is essential.


http://arxiv.org/abs/1904.05475
Learning to Generate Synthetic Data via Compositing.
Shashank Tripathi; Siddhartha Chandra; Amit Agrawal; Ambrish Tyagi; James M. Rehg; Visesh Chari
  We present a task-aware approach to synthetic data generation. Our framework
employs a trainable synthesizer network that is optimized to produce meaningful
training samples by assessing the strengths and weaknesses of a `target'
network. The synthesizer and target networks are trained in an adversarial
manner wherein each network is updated with a goal to outdo the other.
Additionally, we ensure the synthesizer generates realistic data by pairing it
with a discriminator trained on real-world images. Further, to make the target
classifier invariant to blending artefacts, we introduce these artefacts to
background regions of the training images so the target does not over-fit to
them.
  We demonstrate the efficacy of our approach by applying it to different
target networks including a classification network on AffNIST, and two object
detection networks (SSD, Faster-RCNN) on different datasets. On the AffNIST
benchmark, our approach is able to surpass the baseline results with just half
the training examples. On the VOC person detection benchmark, we show
improvements of up to 2.7% as a result of our data augmentation. Similarly on
the GMU detection benchmark, we report a performance boost of 3.5% in mAP over
the baseline method, outperforming the previous state of the art approaches by
up to 7.5% on specific categories.


http://arxiv.org/abs/1904.05181
Black-box Adversarial Attacks on Video Recognition Models.
Linxi Jiang; Xingjun Ma; Shaoxiang Chen; James Bailey; Yu-Gang Jiang
  Deep neural networks (DNNs) are known for their vulnerability to adversarial
examples. These are examples that have undergone small, carefully crafted
perturbations, and which can easily fool a DNN into making misclassifications
at test time. Thus far, the field of adversarial research has mainly focused on
image models, under either a white-box setting, where an adversary has full
access to model parameters, or a black-box setting where an adversary can only
query the target model for probabilities or labels. Whilst several white-box
attacks have been proposed for video models, black-box video attacks are still
unexplored. To close this gap, we propose the first black-box video attack
framework, called V-BAD. V-BAD utilizes tentative perturbations transferred
from image models, and partition-based rectifications found by the NES on
partitions (patches) of tentative perturbations, to obtain good adversarial
gradient estimates with fewer queries to the target model. V-BAD is equivalent
to estimating the projection of an adversarial gradient on a selected subspace.
Using three benchmark video datasets, we demonstrate that V-BAD can craft both
untargeted and targeted attacks to fool two state-of-the-art deep video
recognition models. For the targeted attack, it achieves $>$93\% success rate
using only an average of $3.4 \sim 8.4 \times 10^4$ queries, a similar number
of queries to state-of-the-art black-box image attacks. This is despite the
fact that videos often have two orders of magnitude higher dimensionality than
static images. We believe that V-BAD is a promising new tool to evaluate and
improve the robustness of video recognition models to black-box adversarial
attacks.


http://arxiv.org/abs/1904.04802
Generation & Evaluation of Adversarial Examples for Malware Obfuscation.
Daniel Park; Haidar Khan; Bülent Yener
  There has been an increased interest in the application of convolutional
neural networks for image based malware classification, but the susceptibility
of neural networks to adversarial examples allows malicious actors to evade
classifiers. Adversarial examples are usually generated by adding small
perturbations to the input that are unrecognizable to humans, but the same
approach is not effective with malware. In general, these perturbations cause
changes in the byte sequences that change the initial functionality or result
in un-executable binaries. We present a generative model for executable
adversarial malware examples using obfuscation that achieves a high
misclassification rate, up to 100% and 98% in white-box and black-box settings
respectively, and demonstrates transferability. We further evaluate the
effectiveness of the proposed method by reporting insignificant change in the
evasion rate of our adversarial examples against popular defense strategies.


http://arxiv.org/abs/1904.04433
Efficient Decision-based Black-box Adversarial Attacks on Face Recognition.
Yinpeng Dong; Hang Su; Baoyuan Wu; Zhifeng Li; Wei Liu; Tong Zhang; Jun Zhu
  Face recognition has obtained remarkable progress in recent years due to the
great improvement of deep convolutional neural networks (CNNs). However, deep
CNNs are vulnerable to adversarial examples, which can cause fateful
consequences in real-world face recognition applications with
security-sensitive purposes. Adversarial attacks are widely studied as they can
identify the vulnerability of the models before they are deployed. In this
paper, we evaluate the robustness of state-of-the-art face recognition models
in the decision-based black-box attack setting, where the attackers have no
access to the model parameters and gradients, but can only acquire hard-label
predictions by sending queries to the target model. This attack setting is more
practical in real-world face recognition systems. To improve the efficiency of
previous methods, we propose an evolutionary attack algorithm, which can model
the local geometries of the search directions and reduce the dimension of the
search space. Extensive experiments demonstrate the effectiveness of the
proposed method that induces a minimum perturbation to an input face image with
fewer queries. We also apply the proposed method to attack a real-world face
recognition system successfully.


http://arxiv.org/abs/1904.04334
A Target-Agnostic Attack on Deep Models: Exploiting Security Vulnerabilities of Transfer Learning.
Shahbaz Rezaei; Xin Liu
  Due to insufficient training data and the high computational cost to train a
deep neural network from scratch, transfer learning has been extensively used
in many deep-neural-network-based applications. A commonly used transfer
learning approach involves taking a part of a pre-trained model, adding a few
layers at the end, and re-training the new layers with a small dataset. This
approach, while efficient and widely used, imposes a security vulnerability
because the pre-trained model used in transfer learning is usually publicly
available, including to potential attackers. In this paper, we show that
without any additional knowledge other than the pre-trained model, an attacker
can launch an effective and efficient brute force attack that can craft
instances of input to trigger each target class with high confidence. We assume
that the attacker has no access to any target-specific information, including
samples from target classes, re-trained model, and probabilities assigned by
Softmax to each class, and thus making the attack target-agnostic. These
assumptions render all previous attack models inapplicable, to the best of our
knowledge. To evaluate the proposed attack, we perform a set of experiments on
face recognition and speech recognition tasks and show the effectiveness of the
attack. Our work reveals a fundamental security weakness of the Softmax layer
when used in transfer learning settings.


http://arxiv.org/abs/1904.03750
JumpReLU: A Retrofit Defense Strategy for Adversarial Attacks.
N. Benjamin Erichson; Zhewei Yao; Michael W. Mahoney
  It has been demonstrated that very simple attacks can fool
highly-sophisticated neural network architectures. In particular, so-called
adversarial examples, constructed from perturbations of input data that are
small or imperceptible to humans but lead to different predictions, may lead to
an enormous risk in certain critical applications. In light of this, there has
been a great deal of work on developing adversarial training strategies to
improve model robustness. These training strategies are very expensive, in both
human and computational time. To complement these approaches, we propose a very
simple and inexpensive strategy which can be used to ``retrofit'' a
previously-trained network to improve its resilience to adversarial attacks.
More concretely, we propose a new activation function---the JumpReLU---which,
when used in place of a ReLU in an already-trained model, leads to a trade-off
between predictive accuracy and robustness. This trade-off is controlled by the
jump size, a hyper-parameter which can be tuned during the validation stage.
Our empirical results demonstrate that this increases model robustness,
protecting against adversarial attacks with substantially increased levels of
perturbations. This is accomplished simply by retrofitting existing networks
with our JumpReLU activation function, without the need for retraining the
model. Additionally, we demonstrate that adversarially trained (robust) models
can greatly benefit from retrofitting.


http://arxiv.org/abs/1904.05747
Malware Evasion Attack and Defense.
Yonghong Huang; Utkarsh Verma; Celeste Fralick; Gabriel Infante-Lopez; Brajesh Kumarz; Carl Woodward
  Machine learning (ML) classifiers are vulnerable to adversarial examples. An
adversarial example is an input sample which is slightly modified to induce
misclassification in an ML classifier. In this work, we investigate white-box
and grey-box evasion attacks to an ML-based malware detector and conduct
performance evaluations in a real-world setting. We compare the defense
approaches in mitigating the attacks. We propose a framework for deploying
grey-box and black-box attacks to malware detection systems.


http://arxiv.org/abs/1904.03542
On Training Robust PDF Malware Classifiers.
Yizheng Chen; Shiqi Wang; Dongdong She; Suman Jana
  Although state-of-the-art PDF malware classifiers can be trained with almost
perfect test accuracy (99%) and extremely low false positive rate (under 0.1%),
it has been shown that even a simple adversary can evade them. A practically
useful malware classifier must be robust against evasion attacks. However,
achieving such robustness is an extremely challenging task.
  In this paper, we take the first steps towards training robust PDF malware
classifiers with verifiable robustness properties. For instance, a robustness
property can enforce that no matter how many pages from benign documents are
inserted into a PDF malware, the classifier must still classify it as
malicious. We demonstrate how the worst-case behavior of a malware classifier
with respect to specific robustness properties can be formally verified.
Furthermore, we find that training classifiers that satisfy formally verified
robustness properties can increase the evasion cost of unbounded (i.e., not
bounded by the robustness properties) attackers by eliminating simple evasion
attacks.
  Specifically, we propose a new distance metric that operates on the PDF tree
structure and specify two classes of robustness properties including subtree
insertions and deletions. We utilize state-of-the-art verifiably robust
training method to build robust PDF malware classifiers. Our results show that,
we can achieve 92.27% average verified robust accuracy over three properties,
while maintaining 99.74% accuracy and 0.56% false positive rate. With simple
robustness properties, our robust model maintains 7% higher robust accuracy
than all the baseline models against unrestricted whitebox attacks. Moreover,
the state-of-the-art and new adaptive evolutionary attackers need up to 10
times larger $L_0$ feature distance and 21 times more PDF basic mutations
(e.g., inserting and deleting objects) to evade our robust model than the
baselines.


http://arxiv.org/abs/1904.02884
Evading Defenses to Transferable Adversarial Examples by Translation-Invariant Attacks.
Yinpeng Dong; Tianyu Pang; Hang Su; Jun Zhu
  Deep neural networks are vulnerable to adversarial examples, which can
mislead classifiers by adding imperceptible perturbations. An intriguing
property of adversarial examples is their good transferability, making
black-box attacks feasible in real-world applications. Due to the threat of
adversarial attacks, many methods have been proposed to improve the robustness.
Several state-of-the-art defenses are shown to be robust against transferable
adversarial examples. In this paper, we propose a translation-invariant attack
method to generate more transferable adversarial examples against the defense
models. By optimizing a perturbation over an ensemble of translated images, the
generated adversarial example is less sensitive to the white-box model being
attacked and has better transferability. To improve the efficiency of attacks,
we further show that our method can be implemented by convolving the gradient
at the untranslated image with a pre-defined kernel. Our method is generally
applicable to any gradient-based attack method. Extensive experiments on the
ImageNet dataset validate the effectiveness of the proposed method. Our best
attack fools eight state-of-the-art defenses at an 82% success rate on average
based only on the transferability, demonstrating the insecurity of the current
defense techniques.


http://arxiv.org/abs/1904.02405
White-to-Black: Efficient Distillation of Black-Box Adversarial Attacks.
Yotam Gil; Yoav Chai; Or Gorodissky; Jonathan Berant
  Adversarial examples are important for understanding the behavior of neural
models, and can improve their robustness through adversarial training. Recent
work in natural language processing generated adversarial examples by assuming
white-box access to the attacked model, and optimizing the input directly
against it (Ebrahimi et al., 2018). In this work, we show that the knowledge
implicit in the optimization procedure can be distilled into another more
efficient neural network. We train a model to emulate the behavior of a
white-box attack and show that it generalizes well across examples. Moreover,
it reduces adversarial example generation time by 19x-39x. We also show that
our approach transfers to a black-box setting, by attacking The Google
Perspective API and exposing its vulnerability. Our attack flips the
API-predicted label in 42\% of the generated examples, while humans maintain
high-accuracy in predicting the gold label.


http://arxiv.org/abs/1904.02841
Minimum Uncertainty Based Detection of Adversaries in Deep Neural Networks.
Fatemeh Sheikholeslami; Swayambhoo Jain; Georgios B. Giannakis
  Despite their unprecedented performance in various domains, utilization of
Deep Neural Networks (DNNs) in safety-critical environments is severely limited
in the presence of even small adversarial perturbations. The present work
develops a randomized approach to detecting such perturbations based on minimum
uncertainty metrics that rely on sampling at the hidden layers during the DNN
inference stage. Inspired by Bayesian approaches to uncertainty estimation, the
sampling probabilities are designed for effective detection of the
adversarially corrupted inputs. Being modular, the novel detector of
adversaries can be conveniently employed by any pre-trained DNN at no extra
training overhead. Selecting which units to sample per hidden layer entails
quantifying the amount of DNN output uncertainty, where the overall uncertainty
is expressed in terms of its layer-wise components - what also promotes
scalability. Sampling probabilities are then sought by minimizing uncertainty
measures layer-by-layer, leading to a novel convex optimization problem that
admits an exact solver with superlinear convergence rate. By simplifying the
objective function, low-complexity approximate solvers are also developed. In
addition to valuable insights, these approximations link the novel approach
with state-of-the-art randomized adversarial detectors. The effectiveness of
the novel detectors in the context of competing alternatives is highlighted
through extensive tests for various types of adversarial attacks with variable
levels of strength.


http://arxiv.org/abs/1904.10504
Understanding the efficacy, reliability and resiliency of computer vision techniques for malware detection and future research directions.
Li Chen
  My research lies in the intersection of security and machine learning. This
overview summarizes one component of my research: combining computer vision
with malware exploit detection for enhanced security solutions. I will present
the perspectives of efficacy, reliability and resiliency to formulate threat
detection as computer vision problems and develop state-of-the-art image-based
malware classification. Representing malware binary as images provides a direct
visualization of data samples, reduces the efforts for feature extraction, and
consumes the whole binary for holistic structural analysis. Employing transfer
learning of deep neural networks effective for large scale image classification
to malware classification demonstrates superior classification efficacy
compared with classical machine learning algorithms. To enhance reliability of
these vision-based malware detectors, interpretation frameworks can be
constructed on the malware visual representations and useful for extracting
faithful explanation, so that security practitioners have confidence in the
model before deployment. In cyber-security applications, we should always
assume that a malware writer constantly modifies code to bypass detection.
Addressing the resiliency of the malware detectors is equivalently important as
efficacy and reliability. Via understanding the attack surfaces of machine
learning models used for malware detection, we can greatly improve the
robustness of the algorithms to combat malware adversaries in the wild. Finally
I will discuss future research directions worth pursuing in this research
community.


http://arxiv.org/abs/1904.02057
Interpreting Adversarial Examples by Activation Promotion and Suppression.
Kaidi Xu; Sijia Liu; Gaoyuan Zhang; Mengshu Sun; Pu Zhao; Quanfu Fan; Chuang Gan; Xue Lin
  It is widely known that convolutional neural networks (CNNs) are vulnerable
to adversarial examples: images with imperceptible perturbations crafted to
fool classifiers. However, interpretability of these perturbations is less
explored in the literature. This work aims to better understand the roles of
adversarial perturbations and provide visual explanations from pixel, image and
network perspectives. We show that adversaries have a promotion-suppression
effect (PSE) on neurons' activations and can be primarily categorized into
three types: i) suppression-dominated perturbations that mainly reduce the
classification score of the true label, ii) promotion-dominated perturbations
that focus on boosting the confidence of the target label, and iii) balanced
perturbations that play a dual role in suppression and promotion. We also
provide image-level interpretability of adversarial examples. This links PSE of
pixel-level perturbations to class-specific discriminative image regions
localized by class activation mapping (Zhou et al. 2016). Further, we examine
the adversarial effect through network dissection (Bau et al. 2017), which
offers concept-level interpretability of hidden units. We show that there
exists a tight connection between the units' sensitivity to adversarial attacks
and their interpretability on semantic concepts. Lastly, we provide some new
insights from our interpretation to improve the adversarial robustness of
networks.


http://arxiv.org/abs/1904.02144
HopSkipJumpAttack: A Query-Efficient Decision-Based Attack.
Jianbo Chen; Michael I. Jordan; Martin J. Wainwright
  The goal of a decision-based adversarial attack on a trained model is to
generate adversarial examples based solely on observing output labels returned
by the targeted model. We develop HopSkipJumpAttack, a family of algorithms
based on a novel estimate of the gradient direction using binary information at
the decision boundary. The proposed family includes both untargeted and
targeted attacks optimized for $\ell_2$ and $\ell_\infty$ similarity metrics
respectively. Theoretical analysis is provided for the proposed algorithms and
the gradient direction estimate. Experiments show HopSkipJumpAttack requires
significantly fewer model queries than Boundary Attack. It also achieves
competitive performance in attacking several widely-used defense mechanisms.
(HopSkipJumpAttack was named Boundary Attack++ in a previous version of the
preprint.)


http://arxiv.org/abs/1904.02323
Summit: Scaling Deep Learning Interpretability by Visualizing Activation and Attribution Summarizations.
Fred Hohman; Haekyu Park; Caleb Robinson; Duen Horng Chau
  Deep learning is increasingly used in decision-making tasks. However,
understanding how neural networks produce final predictions remains a
fundamental challenge. Existing work on interpreting neural network predictions
for images often focuses on explaining predictions for single images or
neurons. As predictions are often computed from millions of weights that are
optimized over millions of images, such explanations can easily miss a bigger
picture. We present Summit, an interactive system that scalably and
systematically summarizes and visualizes what features a deep learning model
has learned and how those features interact to make predictions. Summit
introduces two new scalable summarization techniques: (1) activation
aggregation discovers important neurons, and (2) neuron-influence aggregation
identifies relationships among such neurons. Summit combines these techniques
to create the novel attribution graph that reveals and summarizes crucial
neuron associations and substructures that contribute to a model's outcomes.
Summit scales to large data, such as the ImageNet dataset with 1.2M images, and
leverages neural network feature visualization and dataset examples to help
users distill large, complex neural network models into compact, interactive
visualizations. We present neural network exploration scenarios where Summit
helps us discover multiple surprising insights into a prevalent, large-scale
image classifier's learned representations and informs future neural network
architecture design. The Summit visualization runs in modern web browsers and
is open-sourced.


http://arxiv.org/abs/1904.01231
Adversarial Attacks against Deep Saliency Models.
Zhaohui Che; Ali Borji; Guangtao Zhai; Suiyi Ling; Guodong Guo; Patrick Le Callet
  Currently, a plethora of saliency models based on deep neural networks have
led great breakthroughs in many complex high-level vision tasks (e.g. scene
description, object detection). The robustness of these models, however, has
not yet been studied. In this paper, we propose a sparse feature-space
adversarial attack method against deep saliency models for the first time. The
proposed attack only requires a part of the model information, and is able to
generate a sparser and more insidious adversarial perturbation, compared to
traditional image-space attacks. These adversarial perturbations are so subtle
that a human observer cannot notice their presences, but the model outputs will
be revolutionized. This phenomenon raises security threats to deep saliency
models in practical applications. We also explore some intriguing properties of
the feature-space attack, e.g. 1) the hidden layers with bigger receptive
fields generate sparser perturbations, 2) the deeper hidden layers achieve
higher attack success rates, and 3) different loss functions and different
attacked layers will result in diverse perturbations. Experiments indicate that
the proposed method is able to successfully attack different model
architectures across various image scenes.


http://arxiv.org/abs/1904.01160
Curls & Whey: Boosting Black-Box Adversarial Attacks.
Yucheng Shi; Siyu Wang; Yahong Han
  Image classifiers based on deep neural networks suffer from harassment caused
by adversarial examples. Two defects exist in black-box iterative attacks that
generate adversarial examples by incrementally adjusting the noise-adding
direction for each step. On the one hand, existing iterative attacks add noises
monotonically along the direction of gradient ascent, resulting in a lack of
diversity and adaptability of the generated iterative trajectories. On the
other hand, it is trivial to perform adversarial attack by adding excessive
noises, but currently there is no refinement mechanism to squeeze redundant
noises. In this work, we propose Curls & Whey black-box attack to fix the above
two defects. During Curls iteration, by combining gradient ascent and descent,
we `curl' up iterative trajectories to integrate more diversity and
transferability into adversarial examples. Curls iteration also alleviates the
diminishing marginal effect in existing iterative attacks. The Whey
optimization further squeezes the `whey' of noises by exploiting the robustness
of adversarial perturbation. Extensive experiments on Imagenet and
Tiny-Imagenet demonstrate that our approach achieves impressive decrease on
noise magnitude in l2 norm. Curls & Whey attack also shows promising
transferability against ensemble models as well as adversarially trained
models. In addition, we extend our attack to the targeted misclassification,
effectively reducing the difficulty of targeted attacks under black-box
condition.


http://arxiv.org/abs/1904.00923
Robustness of 3D Deep Learning in an Adversarial Setting.
Matthew Wicker; Marta Kwiatkowska
  Understanding the spatial arrangement and nature of real-world objects is of
paramount importance to many complex engineering tasks, including autonomous
navigation. Deep learning has revolutionized state-of-the-art performance for
tasks in 3D environments; however, relatively little is known about the
robustness of these approaches in an adversarial setting. The lack of
comprehensive analysis makes it difficult to justify deployment of 3D deep
learning models in real-world, safety-critical applications. In this work, we
develop an algorithm for analysis of pointwise robustness of neural networks
that operate on 3D data. We show that current approaches presented for
understanding the resilience of state-of-the-art models vastly overestimate
their robustness. We then use our algorithm to evaluate an array of
state-of-the-art models in order to demonstrate their vulnerability to
occlusion attacks. We show that, in the worst case, these networks can be
reduced to 0% classification accuracy after the occlusion of at most 6.5% of
the occupied input space.


http://arxiv.org/abs/1904.00689
Defending against adversarial attacks by randomized diversification.
Olga Taran; Shideh Rezaeifar; Taras Holotyak; Slava Voloshynovskiy
  The vulnerability of machine learning systems to adversarial attacks
questions their usage in many applications. In this paper, we propose a
randomized diversification as a defense strategy. We introduce a multi-channel
architecture in a gray-box scenario, which assumes that the architecture of the
classifier and the training data set are known to the attacker. The attacker
does not only have access to a secret key and to the internal states of the
system at the test time. The defender processes an input in multiple channels.
Each channel introduces its own randomization in a special transform domain
based on a secret key shared between the training and testing stages. Such a
transform based randomization with a shared key preserves the gradients in
key-defined sub-spaces for the defender but it prevents gradient back
propagation and the creation of various bypass systems for the attacker. An
additional benefit of multi-channel randomization is the aggregation that fuses
soft-outputs from all channels, thus increasing the reliability of the final
score. The sharing of a secret key creates an information advantage to the
defender. Experimental evaluation demonstrates an increased robustness of the
proposed method to a number of known state-of-the-art attacks.


http://arxiv.org/abs/1904.00887
Adversarial Defense by Restricting the Hidden Space of Deep Neural Networks.
Aamir Mustafa; Salman Khan; Munawar Hayat; Roland Goecke; Jianbing Shen; Ling Shao
  Deep neural networks are vulnerable to adversarial attacks, which can fool
them by adding minuscule perturbations to the input images. The robustness of
existing defenses suffers greatly under white-box attack settings, where an
adversary has full knowledge about the network and can iterate several times to
find strong perturbations. We observe that the main reason for the existence of
such perturbations is the close proximity of different class samples in the
learned feature space. This allows model decisions to be totally changed by
adding an imperceptible perturbation in the inputs. To counter this, we propose
to class-wise disentangle the intermediate feature representations of deep
networks. Specifically, we force the features for each class to lie inside a
convex polytope that is maximally separated from the polytopes of other
classes. In this manner, the network is forced to learn distinct and distant
decision regions for each class. We observe that this simple constraint on the
features greatly enhances the robustness of learned models, even against the
strongest white-box attacks, without degrading the classification performance
on clean images. We report extensive evaluations in both black-box and
white-box attack scenarios and show significant gains in comparison to
state-of-the art defenses.


http://arxiv.org/abs/1904.00979
Regional Homogeneity: Towards Learning Transferable Universal Adversarial Perturbations Against Defenses.
Yingwei Li; Song Bai; Cihang Xie; Zhenyu Liao; Xiaohui Shen; Alan L. Yuille
  This paper focuses on learning transferable adversarial examples specifically
against defense models (models to defense adversarial attacks). In particular,
we show that a simple universal perturbation can fool a series of
state-of-the-art defenses.
  Adversarial examples generated by existing attacks are generally hard to
transfer to defense models. We observe the property of regional homogeneity in
adversarial perturbations and suggest that the defenses are less robust to
regionally homogeneous perturbations. Therefore, we propose an effective
transforming paradigm and a customized gradient transformer module to transform
existing perturbations into regionally homogeneous ones. Without explicitly
forcing the perturbations to be universal, we observe that a well-trained
gradient transformer module tends to output input-independent gradients (hence
universal) benefiting from the under-fitting phenomenon. Thorough experiments
demonstrate that our work significantly outperforms the prior art attacking
algorithms (either image-dependent or universal ones) by an average improvement
of 14.0% when attacking 9 defenses in the transfer-based attack setting. In
addition to the cross-model transferability, we also verify that regionally
homogeneous perturbations can well transfer across different vision tasks
(attacking with the semantic segmentation task and testing on the object
detection task). The code is available here:
https://github.com/LiYingwei/Regional-Homogeneity.


http://arxiv.org/abs/1904.01002
On the Vulnerability of CNN Classifiers in EEG-Based BCIs.
Xiao Zhang; Dongrui Wu
  Deep learning has been successfully used in numerous applications because of
its outstanding performance and the ability to avoid manual feature
engineering. One such application is electroencephalogram (EEG) based
brain-computer interface (BCI), where multiple convolutional neural network
(CNN) models have been proposed for EEG classification. However, it has been
found that deep learning models can be easily fooled with adversarial examples,
which are normal examples with small deliberate perturbations. This paper
proposes an unsupervised fast gradient sign method (UFGSM) to attack three
popular CNN classifiers in BCIs, and demonstrates its effectiveness. We also
verify the transferability of adversarial examples in BCIs, which means we can
perform attacks even without knowing the architecture and parameters of the
target models, or the datasets they were trained on. To our knowledge, this is
the first study on the vulnerability of CNN classifiers in EEG-based BCIs, and
hopefully will trigger more attention on the security of BCI systems.


http://arxiv.org/abs/1903.12561
Adversarial Robustness vs Model Compression, or Both?
Shaokai Ye; Kaidi Xu; Sijia Liu; Hao Cheng; Jan-Henrik Lambrechts; Huan Zhang; Aojun Zhou; Kaisheng Ma; Yanzhi Wang; Xue Lin
  It is well known that deep neural networks (DNNs) are vulnerable to
adversarial attacks, which are implemented by adding crafted perturbations onto
benign examples. Min-max robust optimization based adversarial training can
provide a notion of security against adversarial attacks. However, adversarial
robustness requires a significantly larger capacity of the network than that
for the natural training with only benign examples. This paper proposes a
framework of concurrent adversarial training and weight pruning that enables
model compression while still preserving the adversarial robustness and
essentially tackles the dilemma of adversarial training. Furthermore, this work
studies two hypotheses about weight pruning in the conventional setting and
finds that weight pruning is essential for reducing the network model size in
the adversarial setting, training a small model from scratch even with
inherited initialization from the large model cannot achieve both adversarial
robustness and high standard accuracy. Code is available at
https://github.com/yeshaokai/Robustness-Aware-Pruning-ADMM.


http://arxiv.org/abs/1903.12261
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations.
Dan Hendrycks; Thomas Dietterich
  In this paper we establish rigorous benchmarks for image classifier
robustness. Our first benchmark, ImageNet-C, standardizes and expands the
corruption robustness topic, while showing which classifiers are preferable in
safety-critical applications. Then we propose a new dataset called ImageNet-P
which enables researchers to benchmark a classifier's robustness to common
perturbations. Unlike recent robustness research, this benchmark evaluates
performance on common corruptions and perturbations not worst-case adversarial
perturbations. We find that there are negligible changes in relative corruption
robustness from AlexNet classifiers to ResNet classifiers. Afterward we
discover ways to enhance corruption and perturbation robustness. We even find
that a bypassed adversarial defense provides substantial common perturbation
robustness. Together our benchmarks may aid future work toward networks that
robustly generalize.


http://arxiv.org/abs/1903.11862
Smooth Adversarial Examples.
Hanwei Zhang; Yannis Avrithis; Teddy Furon; Laurent Amsaleg
  This paper investigates the visual quality of the adversarial examples.
Recent papers propose to smooth the perturbations to get rid of high frequency
artefacts. In this work, smoothing has a different meaning as it perceptually
shapes the perturbation according to the visual content of the image to be
attacked. The perturbation becomes locally smooth on the flat areas of the
input image, but it may be noisy on its textured areas and sharp across its
edges.
  This operation relies on Laplacian smoothing, well-known in graph signal
processing, which we integrate in the attack pipeline. We benchmark several
attacks with and without smoothing under a white-box scenario and evaluate
their transferability. Despite the additional constraint of smoothness, our
attack has the same probability of success at lower distortion.


http://arxiv.org/abs/1903.11626
Bridging Adversarial Robustness and Gradient Interpretability.
Beomsu Kim; Junghoon Seo; Taegyun Jeon
  Adversarial training is a training scheme designed to counter adversarial
attacks by augmenting the training dataset with adversarial examples.
Surprisingly, several studies have observed that loss gradients from
adversarially trained DNNs are visually more interpretable than those from
standard DNNs. Although this phenomenon is interesting, there are only few
works that have offered an explanation. In this paper, we attempted to bridge
this gap between adversarial robustness and gradient interpretability. To this
end, we identified that loss gradients from adversarially trained DNNs align
better with human perception because adversarial training restricts gradients
closer to the image manifold. We then demonstrated that adversarial training
causes loss gradients to be quantitatively meaningful. Finally, we showed that
under the adversarial training framework, there exists an empirical trade-off
between test accuracy and loss gradient interpretability and proposed two
potential approaches to resolving this trade-off.


http://arxiv.org/abs/1903.11359
Scaling up the randomized gradient-free adversarial attack reveals overestimation of robustness using established attacks.
Francesco Croce; Jonas Rauber; Matthias Hein
  Modern neural networks are highly non-robust against adversarial
manipulation. A significant amount of work has been invested in techniques to
compute lower bounds on robustness through formal guarantees and to build
provably robust models. However, it is still difficult to get guarantees for
larger networks or robustness against larger perturbations. Thus attack
strategies are needed to provide tight upper bounds on the actual robustness.
We significantly improve the randomized gradient-free attack for ReLU networks
[9], in particular by scaling it up to large networks. We show that our attack
achieves similar or significantly smaller robust accuracy than state-of-the-art
attacks like PGD or the one of Carlini and Wagner, thus revealing an
overestimation of the robustness by these state-of-the-art methods. Our attack
is not based on a gradient descent scheme and in this sense gradient-free,
which makes it less sensitive to the choice of hyperparameters as no careful
selection of the stepsize is required.


http://arxiv.org/abs/1903.11688
Rallying Adversarial Techniques against Deep Learning for Network Security.
Joseph Clements; Yuzhe Yang; Ankur Sharma; Hongxin Hu; Yingjie Lao
  Recent advances in artificial intelligence and the increasing need for
powerful defensive measures in the domain of network security, have led to the
adoption of deep learning approaches for use in network intrusion detection
systems. These methods have achieved superior performance against conventional
network attacks, which enable the deployment of practical security systems to
unique and dynamic sectors. Adversarial machine learning, unfortunately, has
recently shown that deep learning models are inherently vulnerable to
adversarial modifications on their input data. Because of this susceptibility,
the deep learning models deployed to power a network defense could in fact be
the weakest entry point for compromising a network system. In this paper, we
show that by modifying on average as little as 1.38 of the input features, an
adversary can generate malicious inputs which effectively fool a deep learning
based NIDS. Therefore, when designing such systems, it is crucial to consider
the performance from not only the conventional network security perspective but
also the adversarial machine learning domain.


http://arxiv.org/abs/1903.11508
Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems.
Steffen Eger; Gözde Gül Şahin; Andreas Rücklé; Ji-Ung Lee; Claudia Schulz; Mohsen Mesgar; Krishnkant Swarnkar; Edwin Simpson; Iryna Gurevych
  Visual modifications to text are often used to obfuscate offensive comments
in social media (e.g., "!d10t") or as a writing style ("1337" in "leet speak"),
among other scenarios. We consider this as a new type of adversarial attack in
NLP, a setting to which humans are very robust, as our experiments with both
simple and more difficult visual input perturbations demonstrate. We then
investigate the impact of visual adversarial attacks on current NLP systems on
character-, word-, and sentence-level tasks, showing that both neural and
non-neural models are, in contrast to humans, extremely sensitive to such
attacks, suffering performance decreases of up to 82\%. We then explore three
shielding methods---visual character embeddings, adversarial training, and
rule-based recovery---which substantially improve the robustness of the models.
However, the shielding methods still fall behind performances achieved in
non-attack scenarios, which demonstrates the difficulty of dealing with visual
attacks.


http://arxiv.org/abs/1903.11220
On the Adversarial Robustness of Multivariate Robust Estimation.
Erhan Bayraktar; Lifeng Lai
  In this paper, we investigate the adversarial robustness of multivariate
$M$-Estimators. In the considered model, after observing the whole dataset, an
adversary can modify all data points with the goal of maximizing inference
errors. We use adversarial influence function (AIF) to measure the asymptotic
rate at which the adversary can change the inference result. We first
characterize the adversary's optimal modification strategy and its
corresponding AIF. From the defender's perspective, we would like to design an
estimator that has a small AIF. For the case of joint location and scale
estimation problem, we characterize the optimal $M$-estimator that has the
smallest AIF. We further identify a tradeoff between robustness against
adversarial modifications and robustness against outliers, and derive the
optimal $M$-estimator that achieves the best tradeoff.


http://arxiv.org/abs/1903.10826
A geometry-inspired decision-based attack.
Yujia Liu; Seyed-Mohsen Moosavi-Dezfooli; Pascal Frossard
  Deep neural networks have recently achieved tremendous success in image
classification. Recent studies have however shown that they are easily misled
into incorrect classification decisions by adversarial examples. Adversaries
can even craft attacks by querying the model in black-box settings, where no
information about the model is released except its final decision. Such
decision-based attacks usually require lots of queries, while real-world image
recognition systems might actually restrict the number of queries. In this
paper, we propose qFool, a novel decision-based attack algorithm that can
generate adversarial examples using a small number of queries. The qFool method
can drastically reduce the number of queries compared to previous
decision-based attacks while reaching the same quality of adversarial examples.
We also enhance our method by constraining adversarial perturbations in
low-frequency subspace, which can make qFool even more computationally
efficient. Altogether, we manage to fool commercial image recognition systems
with a small number of queries, which demonstrates the actual effectiveness of
our new algorithm in practice.


http://arxiv.org/abs/1903.10586
Defending against Whitebox Adversarial Attacks via Randomized Discretization.
Yuchen Zhang; Percy Liang
  Adversarial perturbations dramatically decrease the accuracy of
state-of-the-art image classifiers. In this paper, we propose and analyze a
simple and computationally efficient defense strategy: inject random Gaussian
noise, discretize each pixel, and then feed the result into any pre-trained
classifier. Theoretically, we show that our randomized discretization strategy
reduces the KL divergence between original and adversarial inputs, leading to a
lower bound on the classification accuracy of any classifier against any
(potentially whitebox) $\ell_\infty$-bounded adversarial attack. Empirically,
we evaluate our defense on adversarial examples generated by a strong iterative
PGD attack. On ImageNet, our defense is more robust than adversarially-trained
networks and the winning defenses of the NIPS 2017 Adversarial Attacks &
Defenses competition.


http://arxiv.org/abs/1903.10484
Exploiting Excessive Invariance caused by Norm-Bounded Adversarial Robustness.
Jörn-Henrik Jacobsen; Jens Behrmannn; Nicholas Carlini; Florian Tramèr; Nicolas Papernot
  Adversarial examples are malicious inputs crafted to cause a model to
misclassify them. Their most common instantiation, "perturbation-based"
adversarial examples introduce changes to the input that leave its true label
unchanged, yet result in a different model prediction. Conversely,
"invariance-based" adversarial examples insert changes to the input that leave
the model's prediction unaffected despite the underlying input's label having
changed.
  In this paper, we demonstrate that robustness to perturbation-based
adversarial examples is not only insufficient for general robustness, but
worse, it can also increase vulnerability of the model to invariance-based
adversarial examples. In addition to analytical constructions, we empirically
study vision classifiers with state-of-the-art robustness to perturbation-based
adversaries constrained by an $\ell_p$ norm. We mount attacks that exploit
excessive model invariance in directions relevant to the task, which are able
to find adversarial examples within the $\ell_p$ ball. In fact, we find that
classifiers trained to be $\ell_p$-norm robust are more vulnerable to
invariance-based adversarial examples than their undefended counterparts.
  Excessive invariance is not limited to models trained to be robust to
perturbation-based $\ell_p$-norm adversaries. In fact, we argue that the term
adversarial example is used to capture a series of model limitations, some of
which may not have been discovered yet. Accordingly, we call for a set of
precise definitions that taxonomize and address each of these shortcomings in
learning.


http://arxiv.org/abs/1903.10396
The LogBarrier adversarial attack: making effective use of decision boundary information.
Chris Finlay; Aram-Alexandre Pooladian; Adam M. Oberman
  Adversarial attacks for image classification are small perturbations to
images that are designed to cause misclassification by a model. Adversarial
attacks formally correspond to an optimization problem: find a minimum norm
image perturbation, constrained to cause misclassification. A number of
effective attacks have been developed. However, to date, no gradient-based
attacks have used best practices from the optimization literature to solve this
constrained minimization problem. We design a new untargeted attack, based on
these best practices, using the established logarithmic barrier method. On
average, our attack distance is similar or better than all state-of-the-art
attacks on benchmark datasets (MNIST, CIFAR10, ImageNet-1K). In addition, our
method performs significantly better on the most challenging images, those
which normally require larger perturbations for misclassification. We employ
the LogBarrier attack on several adversarially defended models, and show that
it adversarially perturbs all images more efficiently than other attacks: the
distance needed to perturb all images is significantly smaller with the
LogBarrier attack than with other state-of-the-art attacks.


http://arxiv.org/abs/1903.10219
Robust Neural Networks using Randomized Adversarial Training.
Alexandre Araujo; Laurent Meunier; Rafael Pinot; Benjamin Negrevergne
  This paper tackles the problem of defending a neural network against
adversarial attacks crafted with different norms (in particular $\ell_\infty$
and $\ell_2$ bounded adversarial examples). It has been observed that defense
mechanisms designed to protect against one type of attacks often offer poor
performance against the other. We show that $\ell_\infty$ defense mechanisms
cannot offer good protection against $\ell_2$ attacks and vice-versa, and we
provide both theoretical and empirical insights on this phenomenon. Then, we
discuss various ways of combining existing defense mechanisms in order to train
neural networks robust against both types of attacks. Our experiments show that
these new defense mechanisms offer better protection when attacked with both
norms.


http://arxiv.org/abs/1903.10033
A Formalization of Robustness for Deep Neural Networks.
Tommaso Dreossi; Shromona Ghosh; Alberto Sangiovanni-Vincentelli; Sanjit A. Seshia
  Deep neural networks have been shown to lack robustness to small input
perturbations. The process of generating the perturbations that expose the lack
of robustness of neural networks is known as adversarial input generation. This
process depends on the goals and capabilities of the adversary, In this paper,
we propose a unifying formalization of the adversarial input generation process
from a formal methods perspective. We provide a definition of robustness that
is general enough to capture different formulations. The expressiveness of our
formalization is shown by modeling and comparing a variety of adversarial
attack techniques.


http://arxiv.org/abs/1903.09940
Variational Inference with Latent Space Quantization for Adversarial Resilience.
Vinay Kyatham; Mayank Mishra; Tarun Kumar Yadav; Deepak Mishra; Prathosh AP
  Despite their tremendous success in modelling high-dimensional data
manifolds, deep neural networks suffer from the threat of adversarial attacks -
Existence of perceptually valid input-like samples obtained through careful
perturbation that lead to degradation in the performance of the underlying
model. Major concerns with existing defense mechanisms include
non-generalizability across different attacks, models and large inference time.
In this paper, we propose a generalized defense mechanism capitalizing on the
expressive power of regularized latent space based generative models. We design
an adversarial filter, devoid of access to classifier and adversaries, which
makes it usable in tandem with any classifier. The basic idea is to learn a
Lipschitz constrained mapping from the data manifold, incorporating adversarial
perturbations, to a quantized latent space and re-map it to the true data
manifold. Specifically, we simultaneously auto-encode the data manifold and its
perturbations implicitly through the perturbations of the regularized and
quantized generative latent space, realized using variational inference. We
demonstrate the efficacy of the proposed formulation in providing resilience
against multiple attack types (black and white box) and methods, while being
almost real-time. Our experiments show that the proposed method surpasses the
state-of-the-art techniques in several cases.


http://arxiv.org/abs/1903.09799
Improving Adversarial Robustness via Guided Complement Entropy.
Hao-Yun Chen; Jhao-Hong Liang; Shih-Chieh Chang; Jia-Yu Pan; Yu-Ting Chen; Wei Wei; Da-Cheng Juan
  Adversarial robustness has emerged as an important topic in deep learning as
carefully crafted attack samples can significantly disturb the performance of a
model. Many recent methods have proposed to improve adversarial robustness by
utilizing adversarial training or model distillation, which adds additional
procedures to model training. In this paper, we propose a new training paradigm
called Guided Complement Entropy (GCE) that is capable of achieving
"adversarial defense for free," which involves no additional procedures in the
process of improving adversarial robustness. In addition to maximizing model
probabilities on the ground-truth class like cross-entropy, we neutralize its
probabilities on the incorrect classes along with a "guided" term to balance
between these two terms. We show in the experiments that our method achieves
better model robustness with even better performance compared to the commonly
used cross-entropy training objective. We also show that our method can be used
orthogonally to adversarial training across well-known methods with noticeable
robustness gain. To the best of our knowledge, our approach is the first one
that improves model robustness without compromising performance.


http://arxiv.org/abs/1903.10346
Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition.
Yao Qin; Nicholas Carlini; Ian Goodfellow; Garrison Cottrell; Colin Raffel
  Adversarial examples are inputs to machine learning models designed by an
adversary to cause an incorrect output. So far, adversarial examples have been
studied most extensively in the image domain. In this domain, adversarial
examples can be constructed by imperceptibly modifying images to cause
misclassification, and are practical in the physical world. In contrast,
current targeted adversarial examples applied to speech recognition systems
have neither of these properties: humans can easily identify the adversarial
perturbations, and they are not effective when played over-the-air. This paper
makes advances on both of these fronts. First, we develop effectively
imperceptible audio adversarial examples (verified through a human study) by
leveraging the psychoacoustic principle of auditory masking, while retaining
100% targeted success rate on arbitrary full-sentence targets. Next, we make
progress towards physical-world over-the-air audio adversarial examples by
constructing perturbations which remain effective even after applying realistic
simulated environmental distortions.


http://arxiv.org/abs/1903.09410
Fast Bayesian Uncertainty Estimation and Reduction of Batch Normalized Single Image Super-Resolution Network. (45%)
Aupendu Kar; Prabir Kumar Biswas
  Convolutional neural network (CNN) has achieved unprecedented success in
image super-resolution tasks in recent years. However, the network's
performance depends on the distribution of the training sets and degrades on
out-of-distribution samples. This paper adopts a Bayesian approach for
estimating uncertainty associated with output and applies it in a deep image
super-resolution model to address the concern mentioned above. We use the
uncertainty estimation technique using the batch-normalization layer, where
stochasticity of the batch mean and variance generate Monte-Carlo (MC) samples.
The MC samples, which are nothing but different super-resolved images using
different stochastic parameters, reconstruct the image, and provide a
confidence or uncertainty map of the reconstruction. We propose a faster
approach for MC sample generation, and it allows the variable image size during
testing. Therefore, it will be useful for image reconstruction domain. Our
experimental findings show that this uncertainty map strongly relates to the
quality of reconstruction generated by the deep CNN model and explains its
limitation. Furthermore, this paper proposes an approach to reduce the model's
uncertainty for an input image, and it helps to defend the adversarial attacks
on the image super-resolution model. The proposed uncertainty reduction
technique also improves the performance of the model for out-of-distribution
test images. To the best of our knowledge, we are the first to propose an
adversarial defense mechanism in any image reconstruction domain.


http://arxiv.org/abs/1904.00759
Adversarial camera stickers: A physical camera-based attack on deep learning systems.
Juncheng Li; Frank R. Schmidt; J. Zico Kolter
  Recent work has documented the susceptibility of deep learning systems to
adversarial examples, but most such attacks directly manipulate the digital
input to a classifier. Although a smaller line of work considers physical
adversarial attacks, in all cases these involve manipulating the object of
interest, e.g., putting a physical sticker on an object to misclassify it, or
manufacturing an object specifically intended to be misclassified. In this
work, we consider an alternative question: is it possible to fool deep
classifiers, over all perceived objects of a certain type, by physically
manipulating the camera itself? We show that by placing a carefully crafted and
mainly-translucent sticker over the lens of a camera, one can create universal
perturbations of the observed images that are inconspicuous, yet misclassify
target objects as a different (targeted) class. To accomplish this, we propose
an iterative procedure for both updating the attack perturbation (to make it
adversarial for a given classifier), and the threat model itself (to ensure it
is physically realizable). For example, we show that we can achieve
physically-realizable attacks that fool ImageNet classifiers in a targeted
fashion 49.6% of the time. This presents a new class of physically-realizable
threat models to consider in the context of adversarially robust machine
learning. Our demo video can be viewed at: https://youtu.be/wUVmL33Fx54


http://arxiv.org/abs/1903.08778
Provable Certificates for Adversarial Examples: Fitting a Ball in the Union of Polytopes.
Matt Jordan; Justin Lewis; Alexandros G. Dimakis
  We propose a novel method for computing exact pointwise robustness of deep
neural networks for all convex $\ell_p$ norms. Our algorithm, GeoCert, finds
the largest $\ell_p$ ball centered at an input point $x_0$, within which the
output class of a given neural network with ReLU nonlinearities remains
unchanged. We relate the problem of computing pointwise robustness of these
networks to that of computing the maximum norm ball with a fixed center that
can be contained in a non-convex polytope. This is a challenging problem in
general, however we show that there exists an efficient algorithm to compute
this for polyhedral complices. Further we show that piecewise linear neural
networks partition the input space into a polyhedral complex. Our algorithm has
the ability to almost immediately output a nontrivial lower bound to the
pointwise robustness which is iteratively improved until it ultimately becomes
tight. We empirically show that our approach generates distance lower bounds
that are tighter compared to prior work, under moderate time constraints.


http://arxiv.org/abs/1903.08333
On the Robustness of Deep K-Nearest Neighbors.
Chawin Sitawarin; David Wagner
  Despite a large amount of attention on adversarial examples, very few works
have demonstrated an effective defense against this threat. We examine Deep
k-Nearest Neighbor (DkNN), a proposed defense that combines k-Nearest Neighbor
(kNN) and deep learning to improve the model's robustness to adversarial
examples. It is challenging to evaluate the robustness of this scheme due to a
lack of efficient algorithm for attacking kNN classifiers with large k and
high-dimensional data. We propose a heuristic attack that allows us to use
gradient descent to find adversarial examples for kNN classifiers, and then
apply it to attack the DkNN defense as well. Results suggest that our attack is
moderately stronger than any naive attack on kNN and significantly outperforms
other attacks on DkNN.


http://arxiv.org/abs/1903.07282
Generating Adversarial Examples With Conditional Generative Adversarial Net.
Ping Yu; Kaitao Song; Jianfeng Lu
  Recently, deep neural networks have significant progress and successful
application in various fields, but they are found vulnerable to attack
instances, e.g., adversarial examples. State-of-art attack methods can generate
attack images by adding small perturbation to the source image. These attack
images can fool the classifier but have little impact to human. Therefore, such
attack instances are difficult to generate by searching the feature space. How
to design an effective and robust generating method has become a spotlight.
Inspired by adversarial examples, we propose two novel generative models to
produce adaptive attack instances directly, in which conditional generative
adversarial network is adopted and distinctive strategy is designed for
training. Compared with the common method, such as Fast Gradient Sign Method,
our models can reduce the generating cost and improve robustness and has about
one fifth running time for producing attack instance.


http://arxiv.org/abs/1904.05734
Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems.
Hadi Abdullah; Washington Garcia; Christian Peeters; Patrick Traynor; Kevin R. B. Butler; Joseph Wilson
  Voice Processing Systems (VPSes), now widely deployed, have been made
significantly more accurate through the application of recent advances in
machine learning. However, adversarial machine learning has similarly advanced
and has been used to demonstrate that VPSes are vulnerable to the injection of
hidden commands - audio obscured by noise that is correctly recognized by a VPS
but not by human beings. Such attacks, though, are often highly dependent on
white-box knowledge of a specific machine learning model and limited to
specific microphones and speakers, making their use across different acoustic
hardware platforms (and thus their practicality) limited. In this paper, we
break these dependencies and make hidden command attacks more practical through
model-agnostic (blackbox) attacks, which exploit knowledge of the signal
processing algorithms commonly used by VPSes to generate the data fed into
machine learning systems. Specifically, we exploit the fact that multiple
source audio samples have similar feature vectors when transformed by acoustic
feature extraction algorithms (e.g., FFTs). We develop four classes of
perturbations that create unintelligible audio and test them against 12 machine
learning models, including 7 proprietary models (e.g., Google Speech API, Bing
Speech API, IBM Speech API, Azure Speaker API, etc), and demonstrate successful
attacks against all targets. Moreover, we successfully use our maliciously
generated audio samples in multiple hardware configurations, demonstrating
effectiveness across both models and real systems. In so doing, we demonstrate
that domain-specific knowledge of audio signal processing represents a
practical means of generating successful hidden voice command attacks.


http://arxiv.org/abs/1903.07054
Adversarial Attacks on Deep Neural Networks for Time Series Classification.
Hassan Ismail Fawaz; Germain Forestier; Jonathan Weber; Lhassane Idoumghar; Pierre-Alain Muller
  Time Series Classification (TSC) problems are encountered in many real life
data mining tasks ranging from medicine and security to human activity
recognition and food safety. With the recent success of deep neural networks in
various domains such as computer vision and natural language processing,
researchers started adopting these techniques for solving time series data
mining problems. However, to the best of our knowledge, no previous work has
considered the vulnerability of deep learning models to adversarial time series
examples, which could potentially make them unreliable in situations where the
decision taken by the classifier is crucial such as in medicine and security.
For computer vision problems, such attacks have been shown to be very easy to
perform by altering the image and adding an imperceptible amount of noise to
trick the network into wrongly classifying the input image. Following this line
of work, we propose to leverage existing adversarial attack mechanisms to add a
special noise to the input time series in order to decrease the network's
confidence when classifying instances at test time. Our results reveal that
current state-of-the-art deep learning time series classifiers are vulnerable
to adversarial attacks which can have major consequences in multiple domains
such as food safety and quality assurance.


http://arxiv.org/abs/1903.06620
On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models.
Paul Michel; Xian Li; Graham Neubig; Juan Miguel Pino
  Adversarial examples --- perturbations to the input of a model that elicit
large changes in the output --- have been shown to be an effective way of
assessing the robustness of sequence-to-sequence (seq2seq) models. However,
these perturbations only indicate weaknesses in the model if they do not change
the input so significantly that it legitimately results in changes in the
expected output. This fact has largely been ignored in the evaluations of the
growing body of related literature. Using the example of untargeted attacks on
machine translation (MT), we propose a new evaluation framework for adversarial
attacks on seq2seq models that takes the semantic equivalence of the pre- and
post-perturbation input into account. Using this framework, we demonstrate that
existing methods may not preserve meaning in general, breaking the
aforementioned assumption that source side perturbations should not result in
changes in the expected output. We further use this framework to demonstrate
that adding additional constraints on attacks allows for adversarial
perturbations that are more meaning-preserving, but nonetheless largely change
the output sequence. Finally, we show that performing untargeted adversarial
training with meaning-preserving attacks is beneficial to the model in terms of
adversarial robustness, without hurting test performance. A toolkit
implementing our evaluation framework is released at
https://github.com/pmichel31415/teapot-nlp.


http://arxiv.org/abs/1903.06603
On Certifying Non-uniform Bound against Adversarial Attacks.
Chen Liu; Ryota Tomioka; Volkan Cevher
  This work studies the robustness certification problem of neural network
models, which aims to find certified adversary-free regions as large as
possible around data points. In contrast to the existing approaches that seek
regions bounded uniformly along all input features, we consider non-uniform
bounds and use it to study the decision boundary of neural network models. We
formulate our target as an optimization problem with nonlinear constraints.
Then, a framework applicable for general feedforward neural networks is
proposed to bound the output logits so that the relaxed problem can be solved
by the augmented Lagrangian method. Our experiments show the non-uniform bounds
have larger volumes than uniform ones and the geometric similarity of the
non-uniform bounds gives a quantitative, data-agnostic metric of input
features' robustness. Further, compared with normal models, the robust models
have even larger non-uniform bounds and better interpretability.


http://arxiv.org/abs/1903.06293
A Research Agenda: Dynamic Models to Defend Against Correlated Attacks.
Ian Goodfellow
  In this article I describe a research agenda for securing machine learning
models against adversarial inputs at test time. This article does not present
results but instead shares some of my thoughts about where I think that the
field needs to go. Modern machine learning works very well on I.I.D. data: data
for which each example is drawn {\em independently} and for which the
distribution generating each example is {\em identical}. When these assumptions
are relaxed, modern machine learning can perform very poorly. When machine
learning is used in contexts where security is a concern, it is desirable to
design models that perform well even when the input is designed by a malicious
adversary. So far most research in this direction has focused on an adversary
who violates the {\em identical} assumption, and imposes some kind of
restricted worst-case distribution shift. I argue that machine learning
security researchers should also address the problem of relaxing the {\em
independence} assumption and that current strategies designed for robustness to
distribution shift will not do so. I recommend {\em dynamic models} that change
each time they are run as a potential solution path to this problem, and show
an example of a simple attack using correlated data that can be mitigated by a
simple dynamic defense. This is not intended as a real-world security measure,
but as a recommendation to explore this research direction and develop more
realistic defenses.


http://arxiv.org/abs/1903.05821
Attribution-driven Causal Analysis for Detection of Adversarial Examples.
Susmit Jha; Sunny Raj; Steven Lawrence Fernandes; Sumit Kumar Jha; Somesh Jha; Gunjan Verma; Brian Jalaian; Ananthram Swami
  Attribution methods have been developed to explain the decision of a machine
learning model on a given input. We use the Integrated Gradient method for
finding attributions to define the causal neighborhood of an input by
incrementally masking high attribution features. We study the robustness of
machine learning models on benign and adversarial inputs in this neighborhood.
Our study indicates that benign inputs are robust to the masking of high
attribution features but adversarial inputs generated by the state-of-the-art
adversarial attack methods such as DeepFool, FGSM, CW and PGD, are not robust
to such masking. Further, our study demonstrates that this concentration of
high-attribution features responsible for the incorrect decision is more
pronounced in physically realizable adversarial examples. This difference in
attribution of benign and adversarial inputs can be used to detect adversarial
examples. Such a defense approach is independent of training data and attack
method, and we demonstrate its effectiveness on digital and physically
realizable perturbations.


http://arxiv.org/abs/1903.05543
Adversarial attacks against Fact Extraction and VERification.
James Thorne; Andreas Vlachos
  This paper describes a baseline for the second iteration of the Fact
Extraction and VERification shared task (FEVER2.0) which explores the
resilience of systems through adversarial evaluation. We present a collection
of simple adversarial attacks against systems that participated in the first
FEVER shared task. FEVER modeled the assessment of truthfulness of written
claims as a joint information retrieval and natural language inference task
using evidence from Wikipedia. A large number of participants made use of deep
neural networks in their submissions to the shared task. The extent as to
whether such models understand language has been the subject of a number of
recent investigations and discussion in literature. In this paper, we present a
simple method of generating entailment-preserving and entailment-altering
perturbations of instances by common patterns within the training data. We find
that a number of systems are greatly affected with absolute losses in
classification accuracy of up to $29\%$ on the newly perturbed instances. Using
these newly generated instances, we construct a sample submission for the
FEVER2.0 shared task. Addressing these types of attacks will aid in building
more robust fact-checking models, as well as suggest directions to expand the
datasets.


http://arxiv.org/abs/1903.05157
Simple Physical Adversarial Examples against End-to-End Autonomous Driving Models.
Adith Boloor; Xin He; Christopher Gill; Yevgeniy Vorobeychik; Xuan Zhang
  Recent advances in machine learning, especially techniques such as deep
neural networks, are promoting a range of high-stakes applications, including
autonomous driving, which often relies on deep learning for perception. While
deep learning for perception has been shown to be vulnerable to a host of
subtle adversarial manipulations of images, end-to-end demonstrations of
successful attacks, which manipulate the physical environment and result in
physical consequences, are scarce. Moreover, attacks typically involve
carefully constructed adversarial examples at the level of pixels. We
demonstrate the first end-to-end attacks on autonomous driving in simulation,
using simple physically realizable attacks: the painting of black lines on the
road. These attacks target deep neural network models for end-to-end autonomous
driving control. A systematic investigation shows that such attacks are
surprisingly easy to engineer, and we describe scenarios (e.g., right turns) in
which they are highly effective, and others that are less vulnerable (e.g.,
driving straight). Further, we use network deconvolution to demonstrate that
the attacks succeed by inducing activation patterns similar to entirely
different scenarios used in training.


http://arxiv.org/abs/1903.05994
Can Adversarial Network Attack be Defended?
Jinyin Chen; Yangyang Wu; Xiang Lin; Qi Xuan
  Machine learning has been successfully applied to complex network analysis in
various areas, and graph neural networks (GNNs) based methods outperform
others. Recently, adversarial attack on networks has attracted special
attention since carefully crafted adversarial networks with slight
perturbations on clean network may invalid lots of network applications, such
as node classification, link prediction, and community detection etc. Such
attacks are easily constructed with serious security threat to various analyze
methods, including traditional methods and deep models. To the best of our
knowledge, it is the first time that defense method against network adversarial
attack is discussed. In this paper, we are interested in the possibility of
defense against adversarial attack on network, and propose defense strategies
for GNNs against attacks. First, we propose novel adversarial training
strategies to improve GNNs' defensibility against attacks. Then, we
analytically investigate the robustness properties for GNNs granted by the use
of smooth defense, and propose two special smooth defense strategies: smoothing
distillation and smoothing cross-entropy loss function. Both of them are
capable of smoothing gradient of GNNs, and consequently reduce the amplitude of
adversarial gradients, which benefits gradient masking from attackers. The
comprehensive experiments show that our proposed strategies have great
defensibility against different adversarial attacks on four real-world networks
in different network analyze tasks.


http://arxiv.org/abs/1903.03905
Manifold Preserving Adversarial Learning.
Ousmane Amadou Dia; Elnaz Barshan; Reza Babanezhad
  How to generate semantically meaningful and structurally sound adversarial
examples? We propose to answer this question by restricting the search for
adversaries in the true data manifold. To this end, we introduce a stochastic
variational inference method to learn the data manifold, in the presence of
continuous latent variables with intractable posterior distributions, without
requiring an a priori form for the data underlying distribution. We then
propose a manifold perturbation strategy that ensures the cases we perturb
remain in the manifold of the original examples and thereby generate the
adversaries. We evaluate our approach on a number of image and text datasets.
Our results show the effectiveness of our approach in producing coherent, and
realistic-looking adversaries that can evade strong defenses known to be
resilient to traditional adversarial attacks


http://arxiv.org/abs/1903.03029
Attack Type Agnostic Perceptual Enhancement of Adversarial Images.
Bilgin Aksoy; Alptekin Temizel
  Adversarial images are samples that are intentionally modified to deceive
machine learning systems. They are widely used in applications such as CAPTHAs
to help distinguish legitimate human users from bots. However, the noise
introduced during the adversarial image generation process degrades the
perceptual quality and introduces artificial colours; making it also difficult
for humans to classify images and recognise objects. In this letter, we propose
a method to enhance the perceptual quality of these adversarial images. The
proposed method is attack type agnostic and could be used in association with
the existing attacks in the literature. Our experiments show that the generated
adversarial images have lower Euclidean distance values while maintaining the
same adversarial attack performance. Distances are reduced by 5.88% to 41.27%
with an average reduction of 22% over the different attack and network types.


http://arxiv.org/abs/1903.02926
Out-domain examples for generative models.
Dario Pasquini; Marco Mingione; Massimo Bernaschi
  Deep generative models are being increasingly used in a wide variety of
applications. However, the generative process is not fully predictable and at
times, it produces an unexpected output. We will refer to those outputs as
out-domain examples. In the present paper we show that an attacker can force a
pre-trained generator to reproduce an arbitrary out-domain example if fed by a
suitable adversarial input. The main assumption is that those outputs lie in an
unexplored region of the generator's codomain and hence they have a very low
probability of being naturally generated. Moreover, we show that this
adversarial input can be shaped so as to be statistically indistinguishable
from the set of genuine inputs. The goal is to look for an efficient way of
finding these inputs in the generator's latent space.


http://arxiv.org/abs/1903.02585
GanDef: A GAN based Adversarial Training Defense for Neural Network Classifier.
Guanxiong Liu; Issa Khalil; Abdallah Khreishah
  Machine learning models, especially neural network (NN) classifiers, are
widely used in many applications including natural language processing,
computer vision and cybersecurity. They provide high accuracy under the
assumption of attack-free scenarios. However, this assumption has been defied
by the introduction of adversarial examples -- carefully perturbed samples of
input that are usually misclassified. Many researchers have tried to develop a
defense against adversarial examples; however, we are still far from achieving
that goal. In this paper, we design a Generative Adversarial Net (GAN) based
adversarial training defense, dubbed GanDef, which utilizes a competition game
to regulate the feature selection during the training. We analytically show
that GanDef can train a classifier so it can defend against adversarial
examples. Through extensive evaluation on different white-box adversarial
examples, the classifier trained by GanDef shows the same level of test
accuracy as those trained by state-of-the-art adversarial training defenses.
More importantly, GanDef-Comb, a variant of GanDef, could utilize the
discriminator to achieve a dynamic trade-off between correctly classifying
original and adversarial examples. As a result, it achieves the highest overall
test accuracy when the ratio of adversarial examples exceeds 41.7%.


http://arxiv.org/abs/1903.01980
Statistical Guarantees for the Robustness of Bayesian Neural Networks.
Luca Cardelli; Marta Kwiatkowska; Luca Laurenti; Nicola Paoletti; Andrea Patane; Matthew Wicker
  We introduce a probabilistic robustness measure for Bayesian Neural Networks
(BNNs), defined as the probability that, given a test point, there exists a
point within a bounded set such that the BNN prediction differs between the
two. Such a measure can be used, for instance, to quantify the probability of
the existence of adversarial examples. Building on statistical verification
techniques for probabilistic models, we develop a framework that allows us to
estimate probabilistic robustness for a BNN with statistical guarantees, i.e.,
with a priori error and confidence bounds. We provide experimental comparison
for several approximate BNN inference techniques on image classification tasks
associated to MNIST and a two-class subset of the GTSRB dataset. Our results
enable quantification of uncertainty of BNN predictions in adversarial
settings.


http://arxiv.org/abs/1903.01715
L 1-norm double backpropagation adversarial defense.
Ismaïla LIMOS, LITIS Seck; Gaëlle LIMOS Loosli; Stephane LITIS Canu
  Adversarial examples are a challenging open problem for deep neural networks.
We propose in this paper to add a penalization term that forces the decision
function to be at in some regions of the input space, such that it becomes, at
least locally, less sensitive to attacks. Our proposition is theoretically
motivated and shows on a first set of carefully conducted experiments that it
behaves as expected when used alone, and seems promising when coupled with
adversarial training.


http://arxiv.org/abs/1903.01612
Defense Against Adversarial Images using Web-Scale Nearest-Neighbor Search.
Abhimanyu Dubey; der Maaten Laurens van; Zeki Yalniz; Yixuan Li; Dhruv Mahajan
  A plethora of recent work has shown that convolutional networks are not
robust to adversarial images: images that are created by perturbing a sample
from the data distribution as to maximize the loss on the perturbed example. In
this work, we hypothesize that adversarial perturbations move the image away
from the image manifold in the sense that there exists no physical process that
could have produced the adversarial image. This hypothesis suggests that a
successful defense mechanism against adversarial images should aim to project
the images back onto the image manifold. We study such defense mechanisms,
which approximate the projection onto the unknown image manifold by a
nearest-neighbor search against a web-scale image database containing tens of
billions of images. Empirical evaluations of this defense strategy on ImageNet
suggest that it is very effective in attack settings in which the adversary
does not have access to the image database. We also propose two novel attack
methods to break nearest-neighbor defenses, and demonstrate conditions under
which nearest-neighbor defense fails. We perform a series of ablation
experiments, which suggest that there is a trade-off between robustness and
accuracy in our defenses, that a large image database (with hundreds of
millions of images) is crucial to get good performance, and that careful
construction the image database is important to be robust against attacks
tailored to circumvent our defenses.


http://arxiv.org/abs/1903.01610
The Vulnerabilities of Graph Convolutional Networks: Stronger Attacks and Defensive Techniques.
Huijun Wu; Chen Wang; Yuriy Tyshetskiy; Andrew Dotcherty; Kai Lu; Liming Zhu
  Graph deep learning models, such as graph convolutional networks (GCN)
achieve remarkable performance for tasks on graph data. Similar to other types
of deep models, graph deep learning models often suffer from adversarial
attacks. However, compared with non-graph data, the discrete features, graph
connections and different definitions of imperceptible perturbations bring
unique challenges and opportunities for the adversarial attacks and defences
for graph data. In this paper, we propose both attack and defence techniques.
For attack, we show that the discrete feature problem could easily be resolved
by introducing integrated gradients which could accurately reflect the effect
of perturbing certain features or edges while still benefiting from the
parallel computations. For defence, we propose to partially learn the adjacency
matrix to integrate the information of distant nodes so that the prediction of
a certain target is supported by more global graph information rather than just
few neighbour nodes. This, therefore, makes the attacks harder since one need
to perturb more features/edges to make the attacks succeed. Our experiments on
a number of datasets show the effectiveness of the proposed methods.


http://arxiv.org/abs/1903.01182
Complement Objective Training.
Hao-Yun Chen; Pei-Hsin Wang; Chun-Hao Liu; Shih-Chieh Chang; Jia-Yu Pan; Yu-Ting Chen; Wei Wei; Da-Cheng Juan
  Learning with a primary objective, such as softmax cross entropy for
classification and sequence generation, has been the norm for training deep
neural networks for years. Although being a widely-adopted approach, using
cross entropy as the primary objective exploits mostly the information from the
ground-truth class for maximizing data likelihood, and largely ignores
information from the complement (incorrect) classes. We argue that, in addition
to the primary objective, training also using a complement objective that
leverages information from the complement classes can be effective in improving
model performance. This motivates us to study a new training paradigm that
maximizes the likelihood of the groundtruth class while neutralizing the
probabilities of the complement classes. We conduct extensive experiments on
multiple tasks ranging from computer vision to natural language understanding.
The experimental results confirm that, compared to the conventional training
with just one primary objective, training also with the complement objective
further improves the performance of the state-of-the-art models across all
tasks. In addition to the accuracy improvement, we also show that models
trained with both primary and complement objectives are more robust to
single-step adversarial attacks.


http://arxiv.org/abs/1903.01287
Safety Verification and Robustness Analysis of Neural Networks via Quadratic Constraints and Semidefinite Programming.
Mahyar Fazlyab; Manfred Morari; George J. Pappas
  Certifying the safety or robustness of neural networks against input
uncertainties and adversarial attacks is an emerging challenge in the area of
safe machine learning and control. To provide such a guarantee, one must be
able to bound the output of neural networks when their input changes within a
bounded set. In this paper, we propose a semidefinite programming (SDP)
framework to address this problem for feed-forward neural networks with general
activation functions and input uncertainty sets. Our main idea is to abstract
various properties of activation functions (e.g., monotonicity, bounded slope,
bounded values, and repetition across layers) with the formalism of quadratic
constraints. We then analyze the safety properties of the abstracted network
via the S-procedure and semidefinite programming. Our framework spans the
trade-off between conservatism and computational efficiency and applies to
problems beyond safety verification. We evaluate the performance of our
approach via numerical problem instances of various sizes.


http://arxiv.org/abs/1903.01015
A Kernelized Manifold Mapping to Diminish the Effect of Adversarial Perturbations.
Saeid Asgari Taghanaki; Kumar Abhishek; Shekoofeh Azizi; Ghassan Hamarneh
  The linear and non-flexible nature of deep convolutional models makes them
vulnerable to carefully crafted adversarial perturbations. To tackle this
problem, we propose a non-linear radial basis convolutional feature mapping by
learning a Mahalanobis-like distance function. Our method then maps the
convolutional features onto a linearly well-separated manifold, which prevents
small adversarial perturbations from forcing a sample to cross the decision
boundary. We test the proposed method on three publicly available image
classification and segmentation datasets namely, MNIST, ISBI ISIC 2017 skin
lesion segmentation, and NIH Chest X-Ray-14. We evaluate the robustness of our
method to different gradient (targeted and untargeted) and non-gradient based
attacks and compare it to several non-gradient masking defense strategies. Our
results demonstrate that the proposed method can increase the resilience of
deep convolutional neural networks to adversarial perturbations without
accuracy drop on clean data.


http://arxiv.org/abs/1903.01563
Evaluating Adversarial Evasion Attacks in the Context of Wireless Communications.
Bryse Flowers; R. Michael Buehrer; William C. Headley
  Recent advancements in radio frequency machine learning (RFML) have
demonstrated the use of raw in-phase and quadrature (IQ) samples for multiple
spectrum sensing tasks. Yet, deep learning techniques have been shown, in other
applications, to be vulnerable to adversarial machine learning (ML) techniques,
which seek to craft small perturbations that are added to the input to cause a
misclassification. The current work differentiates the threats that adversarial
ML poses to RFML systems based on where the attack is executed from: direct
access to classifier input, synchronously transmitted over the air (OTA), or
asynchronously transmitted from a separate device. Additionally, the current
work develops a methodology for evaluating adversarial success in the context
of wireless communications, where the primary metric of interest is bit error
rate and not human perception, as is the case in image recognition. The
methodology is demonstrated using the well known Fast Gradient Sign Method to
evaluate the vulnerabilities of raw IQ based Automatic Modulation
Classification and concludes RFML is vulnerable to adversarial examples, even
in OTA attacks. However, RFML domain specific receiver effects, which would be
encountered in an OTA attack, can present significant impairments to
adversarial evasion.


http://arxiv.org/abs/1903.00585
PuVAE: A Variational Autoencoder to Purify Adversarial Examples.
Uiwon Hwang; Jaewoo Park; Hyemi Jang; Sungroh Yoon; Nam Ik Cho
  Deep neural networks are widely used and exhibit excellent performance in
many areas. However, they are vulnerable to adversarial attacks that compromise
the network at the inference time by applying elaborately designed perturbation
to input data. Although several defense methods have been proposed to address
specific attacks, other attack methods can circumvent these defense mechanisms.
Therefore, we propose Purifying Variational Autoencoder (PuVAE), a method to
purify adversarial examples. The proposed method eliminates an adversarial
perturbation by projecting an adversarial example on the manifold of each
class, and determines the closest projection as a purified sample. We
experimentally illustrate the robustness of PuVAE against various attack
methods without any prior knowledge. In our experiments, the proposed method
exhibits performances competitive with state-of-the-art defense methods, and
the inference time is approximately 130 times faster than that of Defense-GAN
that is the state-of-the art purifier model.


http://arxiv.org/abs/1903.00553
Attacking Graph-based Classification via Manipulating the Graph Structure.
Binghui Wang; Neil Zhenqiang Gong
  Graph-based classification methods are widely used for security and privacy
analytics. Roughly speaking, graph-based classification methods include
collective classification and graph neural network. Evading a graph-based
classification method enables an attacker to evade detection in security
analytics and can be used as a privacy defense against inference attacks.
Existing adversarial machine learning studies mainly focused on machine
learning for non-graph data. Only a few recent studies touched adversarial
graph-based classification methods. However, they focused on graph neural
network methods, leaving adversarial collective classification largely
unexplored. We aim to bridge this gap in this work. We first propose a threat
model to characterize the attack surface of a collective classification method.
Specifically, we characterize an attacker's background knowledge along three
dimensions: parameters of the method, training dataset, and the complete graph;
an attacker's goal is to evade detection via manipulating the graph structure.
We formulate our attack as a graph-based optimization problem, solving which
produces the edges that an attacker needs to manipulate to achieve its attack
goal. Moreover, we propose several approximation techniques to solve the
optimization problem. We evaluate our attacks and compare them with a recent
attack designed for graph neural networks. Results show that our attacks 1) can
effectively evade graph-based classification methods; 2) do not require access
to the true parameters, true training dataset, and/or complete graph; and 3)
outperform the existing attack for evading collective classification methods
and some graph neural network methods. We also apply our attacks to evade Sybil
detection using a large-scale Twitter dataset and apply our attacks as a
defense against attribute inference attacks using a large-scale Google+
dataset.


http://arxiv.org/abs/1903.00073
On the Effectiveness of Low Frequency Perturbations.
Yash Sharma; Gavin Weiguang Ding; Marcus Brubaker
  Carefully crafted, often imperceptible, adversarial perturbations have been
shown to cause state-of-the-art models to yield extremely inaccurate outputs,
rendering them unsuitable for safety-critical application domains. In addition,
recent work has shown that constraining the attack space to a low frequency
regime is particularly effective. Yet, it remains unclear whether this is due
to generally constraining the attack search space or specifically removing high
frequency components from consideration. By systematically controlling the
frequency components of the perturbation, evaluating against the top-placing
defense submissions in the NeurIPS 2017 competition, we empirically show that
performance improvements in both optimization and generalization are yielded
only when low frequency components are preserved. In fact, the defended models
based on (ensemble) adversarial training are roughly as vulnerable to low
frequency perturbations as undefended models, suggesting that the purported
robustness of proposed defenses is reliant upon adversarial perturbations being
high frequency in nature. We do find that under $\ell_\infty$
$\epsilon=16/255$, a commonly used distortion bound, low frequency
perturbations are indeed perceptible. This questions the use of the
$\ell_\infty$-norm, in particular, as a distortion metric, and suggests that
explicitly considering the frequency space is promising for learning robust
models which better align with human perception.


http://arxiv.org/abs/1902.11029
Enhancing the Robustness of Deep Neural Networks by Boundary Conditional GAN.
Ke Sun; Zhanxing Zhu; Zhouchen Lin
  Deep neural networks have been widely deployed in various machine learning
tasks. However, recent works have demonstrated that they are vulnerable to
adversarial examples: carefully crafted small perturbations to cause
misclassification by the network. In this work, we propose a novel defense
mechanism called Boundary Conditional GAN to enhance the robustness of deep
neural networks against adversarial examples. Boundary Conditional GAN, a
modified version of Conditional GAN, can generate boundary samples with true
labels near the decision boundary of a pre-trained classifier. These boundary
samples are fed to the pre-trained classifier as data augmentation to make the
decision boundary more robust. We empirically show that the model improved by
our approach consistently defenses against various types of adversarial attacks
successfully. Further quantitative investigations about the improvement of
robustness and visualization of decision boundaries are also provided to
justify the effectiveness of our strategy. This new defense mechanism that uses
boundary samples to enhance the robustness of networks opens up a new way to
defense adversarial attacks consistently.


http://arxiv.org/abs/1902.11019
Towards Understanding Adversarial Examples Systematically: Exploring Data Size, Task and Model Factors.
Ke Sun; Zhanxing Zhu; Zhouchen Lin
  Most previous works usually explained adversarial examples from several
specific perspectives, lacking relatively integral comprehension about this
problem. In this paper, we present a systematic study on adversarial examples
from three aspects: the amount of training data, task-dependent and
model-specific factors. Particularly, we show that adversarial generalization
(i.e. test accuracy on adversarial examples) for standard training requires
more data than standard generalization (i.e. test accuracy on clean examples);
and uncover the global relationship between generalization and robustness with
respect to the data size especially when data is augmented by generative
models. This reveals the trade-off correlation between standard generalization
and robustness in limited training data regime and their consistency when data
size is large enough. Furthermore, we explore how different task-dependent and
model-specific factors influence the vulnerability of deep neural networks by
extensive empirical analysis. Relevant recommendations on defense against
adversarial attacks are provided as well. Our results outline a potential path
towards the luminous and systematic understanding of adversarial examples.


http://arxiv.org/abs/1902.10899
Adversarial Attack and Defense on Point Sets.
Qiang Zhang; Jiancheng Yang; Rongyao Fang; Bingbing Ni; Jinxian Liu; Qi Tian
  Emergence of the utility of 3D point cloud data in safety-critical vision
tasks (e.g., ADAS) urges researchers to pay more attention to the robustness of
3D representations and deep networks. To this end, we develop an attack and
defense scheme, dedicated to 3D point cloud data, for preventing 3D point
clouds from manipulated as well as pursuing noise-tolerable 3D representation.
A set of novel 3D point cloud attack operations are proposed via pointwise
gradient perturbation and adversarial point attachment / detachment. We then
develop a flexible perturbation-measurement scheme for 3D point cloud data to
detect potential attack data or noisy sensing data. Notably, the proposed
defense methods are even effective to detect the adversarial point clouds
generated by a proof-of-concept attack directly targeting the defense.
Transferability of adversarial attacks between several point cloud networks is
addressed, and we propose an momentum-enhanced pointwise gradient to improve
the attack transferability. We further analyze the transferability from
adversarial point clouds to grid CNNs and the inverse. Extensive experimental
results on common point cloud benchmarks demonstrate the validity of the
proposed 3D attack and defense framework.


http://arxiv.org/abs/1902.10755
Adversarial Attacks on Time Series.
Fazle Karim; Somshubra Majumdar; Houshang Darabi
  Time series classification models have been garnering significant importance
in the research community. However, not much research has been done on
generating adversarial samples for these models. These adversarial samples can
become a security concern. In this paper, we propose utilizing an adversarial
transformation network (ATN) on a distilled model to attack various time series
classification models. The proposed attack on the classification model utilizes
a distilled model as a surrogate that mimics the behavior of the attacked
classical time series classification models. Our proposed methodology is
applied onto 1-Nearest Neighbor Dynamic Time Warping (1-NN ) DTW, a Fully
Connected Network and a Fully Convolutional Network (FCN), all of which are
trained on 42 University of California Riverside (UCR) datasets. In this paper,
we show both models were susceptible to attacks on all 42 datasets. To the best
of our knowledge, such an attack on time series classification models has never
been done before. Finally, we recommend future researchers that develop time
series classification models to incorporating adversarial data samples into
their training data sets to improve resilience on adversarial samples and to
consider model robustness as an evaluative metric.


http://arxiv.org/abs/1902.10660
Robust Decision Trees Against Adversarial Examples.
Hongge Chen; Huan Zhang; Duane Boning; Cho-Jui Hsieh
  Although adversarial examples and model robustness have been extensively
studied in the context of linear models and neural networks, research on this
issue in tree-based models and how to make tree-based models robust against
adversarial examples is still limited. In this paper, we show that tree based
models are also vulnerable to adversarial examples and develop a novel
algorithm to learn robust trees. At its core, our method aims to optimize the
performance under the worst-case perturbation of input features, which leads to
a max-min saddle point problem. Incorporating this saddle point objective into
the decision tree building procedure is non-trivial due to the discrete nature
of trees --- a naive approach to finding the best split according to this
saddle point objective will take exponential time. To make our approach
practical and scalable, we propose efficient tree building algorithms by
approximating the inner minimizer in this saddle point problem, and present
efficient implementations for classical information gain based trees as well as
state-of-the-art tree boosting models such as XGBoost. Experimental results on
real world datasets demonstrate that the proposed algorithms can substantially
improve the robustness of tree-based models against adversarial examples.


http://arxiv.org/abs/1902.10758
Tensor Dropout for Robust Learning.
Arinbjörn Kolbeinsson; Jean Kossaifi; Yannis Panagakis; Adrian Bulat; Anima Anandkumar; Ioanna Tzoulaki; Paul Matthews
  CNNs achieve remarkable performance by leveraging deep, over-parametrized
architectures, trained on large datasets. However, they have limited
generalization ability to data outside the training domain, and a lack of
robustness to noise and adversarial attacks. By building better inductive
biases, we can improve robustness and also obtain smaller networks that are
more memory and computationally efficient. While standard CNNs use matrix
computations, we study tensor layers that involve higher-order computations and
provide better inductive bias. Specifically, we impose low-rank tensor
structures on the weights of tensor regression layers to obtain compact
networks, and propose tensor dropout, a randomization in the tensor rank for
robustness. We show that our approach outperforms other methods for large-scale
image classification on ImageNet and CIFAR-100. We establish a new
state-of-the-art accuracy for phenotypic trait prediction on the largest
dataset of brain MRI, the UK Biobank brain MRI dataset, where multi-linear
structure is paramount. In all cases, we demonstrate superior performance and
significantly improved robustness, both to noisy inputs and to adversarial
attacks. We rigorously validate the theoretical validity of our approach by
establishing the link between our randomized decomposition and non-linear
dropout.


http://arxiv.org/abs/1902.10674
The Best Defense Is a Good Offense: Adversarial Attacks to Avoid Modulation Detection.
Muhammad Zaid Hameed; Andras Gyorgy; Deniz Gunduz
  We consider a communication scenario, in which an intruder tries to determine
the modulation scheme of the intercepted signal. Our aim is to minimize the
accuracy of the intruder, while guaranteeing that the intended receiver can
still recover the underlying message with the highest reliability. This is
achieved by perturbing channel input symbols at the encoder, similarly to
adversarial attacks against classifiers in machine learning. In image
classification, the perturbation is limited to be imperceptible to a human
observer, while in our case the perturbation is constrained so that the message
can still be reliably decoded by the legitimate receiver, which is oblivious to
the perturbation. Simulation results demonstrate the viability of our approach
to make wireless communication secure against state-of-the-art intruders (using
deep learning or decision trees) with minimal sacrifice in the communication
performance. On the other hand, we also demonstrate that using diverse training
data and curriculum learning can significantly boost the accuracy of the
intruder.


http://arxiv.org/abs/1902.10365
A Distributionally Robust Optimization Method for Adversarial Multiple Kernel Learning. (76%)
Masoud Badiei Khuzani; Hongyi Ren; Md Tauhidul Islam; Lei Xing
  We propose a novel data-driven method to learn a mixture of multiple kernels
with random features that is certifiabaly robust against adverserial inputs.
Specifically, we consider a distributionally robust optimization of the
kernel-target alignment with respect to the distribution of training samples
over a distributional ball defined by the Kullback-Leibler (KL) divergence. The
distributionally robust optimization problem can be recast as a min-max
optimization whose objective function includes a log-sum term. We develop a
mini-batch biased stochastic primal-dual proximal method to solve the min-max
optimization. To debias the minibatch algorithm, we use the Gumbel perturbation
technique to estimate the log-sum term. We establish theoretical guarantees for
the performance of the proposed multiple kernel learning method. In particular,
we prove the consistency, asymptotic normality, stochastic equicontinuity, and
the minimax rate of the empirical estimators. In addition, based on the notion
of Rademacher and Gaussian complexities, we establish distributionally robust
generalization bounds that are tighter than previous known bounds. More
specifically, we leverage matrix concentration inequalities to establish
distributionally robust generalization bounds. We validate our kernel learning
approach for classification with the kernel SVMs on synthetic dataset generated
by sampling multvariate Gaussian distributions with differernt variance
structures. We also apply our kernel learning approach to the MNIST data-set
and evaluate its robustness to perturbation of input images under different
adversarial models. More specifically, we examine the robustness of the
proposed kernel model selection technique against FGSM, PGM, C\&W, and DDN
adversarial perturbations, and compare its performance with alternative
state-of-the-art multiple kernel learning paradigms.


http://arxiv.org/abs/1902.10799
AutoGAN-based Dimension Reduction for Privacy Preservation. (1%)
Hung Nguyen; Di Zhuang; Pei-Yuan Wu; Morris Chang
  Protecting sensitive information against data exploiting attacks is an
emerging research area in data mining. Over the past, several different methods
have been introduced to protect individual privacy from such attacks while
maximizing data-utility of the application. However, these existing techniques
are not sufficient to effectively protect data owner privacy, especially in the
scenarios that utilize visualizable data (e.g. images, videos) or the
applications that require heavy computations for implementation. To address
these problems, we propose a new dimension reduction-based method for privacy
preservation. Our method generates dimension-reduced data for performing
machine learning tasks and prevents a strong adversary from reconstructing the
original data. We first introduce a theoretical approach to evaluate dimension
reduction-based privacy preserving mechanisms, then propose a non-linear
dimension reduction framework motivated by state-of-the-art neural network
structures for privacy preservation. We conducted experiments over three
different face image datasets (AT&T, YaleB, and CelebA), and the results show
that when the number of dimensions is reduced to seven, we can achieve the
accuracies of 79%, 80%, and 73% respectively and the reconstructed images are
not recognizable to naked human eyes.


http://arxiv.org/abs/1902.11134
Disentangled Deep Autoencoding Regularization for Robust Image Classification.
Zhenyu Duan; Martin Renqiang Min; Li Erran Li; Mingbo Cai; Yi Xu; Bingbing Ni
  In spite of achieving revolutionary successes in machine learning, deep
convolutional neural networks have been recently found to be vulnerable to
adversarial attacks and difficult to generalize to novel test images with
reasonably large geometric transformations. Inspired by a recent neuroscience
discovery revealing that primate brain employs disentangled shape and
appearance representations for object recognition, we propose a general
disentangled deep autoencoding regularization framework that can be easily
applied to any deep embedding based classification model for improving the
robustness of deep neural networks. Our framework effectively learns
disentangled appearance code and geometric code for robust image
classification, which is the first disentangling based method defending against
adversarial attacks and complementary to standard defense methods. Extensive
experiments on several benchmark datasets show that, our proposed
regularization framework leveraging disentangled embedding significantly
outperforms traditional unregularized convolutional neural networks for image
classification on robustness against adversarial attacks and generalization to
novel test data.


http://arxiv.org/abs/1902.09866
Analyzing Deep Neural Networks with Symbolic Propagation: Towards Higher Precision and Faster Verification.
Jianlin Li; Pengfei Yang; Jiangchao Liu; Liqian Chen; Xiaowei Huang; Lijun Zhang
  Deep neural networks (DNNs) have been shown lack of robustness for the
vulnerability of their classification to small perturbations on the inputs.
This has led to safety concerns of applying DNNs to safety-critical domains.
Several verification approaches have been developed to automatically prove or
disprove safety properties of DNNs. However, these approaches suffer from
either the scalability problem, i.e., only small DNNs can be handled, or the
precision problem, i.e., the obtained bounds are loose. This paper improves on
a recent proposal of analyzing DNNs through the classic abstract interpretation
technique, by a novel symbolic propagation technique. More specifically, the
values of neurons are represented symbolically and propagated forwardly from
the input layer to the output layer, on top of abstract domains. We show that
our approach can achieve significantly higher precision and thus can prove more
properties than using only abstract domains. Moreover, we show that the bounds
derived from our approach on the hidden neurons, when applied to a
state-of-the-art SMT based verification tool, can improve its performance. We
implement our approach into a software tool and validate it over a few DNNs
trained on benchmark datasets such as MNIST, etc.


http://arxiv.org/abs/1902.09592
Verification of Non-Linear Specifications for Neural Networks.
Chongli Dj Qin; Dj Krishnamurthy; Dvijotham; Brendan O'Donoghue; Rudy Bunel; Robert Stanforth; Sven Gowal; Jonathan Uesato; Grzegorz Swirszcz; Pushmeet Kohli
  Prior work on neural network verification has focused on specifications that
are linear functions of the output of the network, e.g., invariance of the
classifier output under adversarial perturbations of the input. In this paper,
we extend verification algorithms to be able to certify richer properties of
neural networks. To do this we introduce the class of convex-relaxable
specifications, which constitute nonlinear specifications that can be verified
using a convex relaxation. We show that a number of important properties of
interest can be modeled within this class, including conservation of energy in
a learned dynamics model of a physical system; semantic consistency of a
classifier's output labels under adversarial perturbations and bounding errors
in a system that predicts the summation of handwritten digits. Our experimental
evaluation shows that our method is able to effectively verify these
specifications. Moreover, our evaluation exposes the failure modes in models
which cannot be verified to satisfy these specifications. Thus, emphasizing the
importance of training models not just to fit training data but also to be
consistent with specifications.


http://arxiv.org/abs/1902.09286
Adversarial attacks hidden in plain sight.
Jan Philip Göpfert; André Artelt; Heiko Wersing; Barbara Hammer
  Convolutional neural networks have been used to achieve a string of successes
during recent years, but their lack of interpretability remains a serious
issue. Adversarial examples are designed to deliberately fool neural networks
into making any desired incorrect classification, potentially with very high
certainty. Several defensive approaches increase robustness against adversarial
attacks, demanding attacks of greater magnitude, which lead to visible
artifacts. By considering human visual perception, we compose a technique that
allows to hide such adversarial attacks in regions of high complexity, such
that they are imperceptible even to an astute observer. We carry out a user
study on classifying adversarially modified images to validate the perceptual
quality of our approach and find significant evidence for its concealment with
regards to human visual perception.


http://arxiv.org/abs/1902.08909
MaskDGA: A Black-box Evasion Technique Against DGA Classifiers and Adversarial Defenses.
Lior Sidi; Asaf Nadler; Asaf Shabtai
  Domain generation algorithms (DGAs) are commonly used by botnets to generate
domain names through which bots can establish a resilient communication channel
with their command and control servers. Recent publications presented deep
learning, character-level classifiers that are able to detect algorithmically
generated domain (AGD) names with high accuracy, and correspondingly,
significantly reduce the effectiveness of DGAs for botnet communication. In
this paper we present MaskDGA, a practical adversarial learning technique that
adds perturbation to the character-level representation of algorithmically
generated domain names in order to evade DGA classifiers, without the attacker
having any knowledge about the DGA classifier's architecture and parameters.
MaskDGA was evaluated using the DMD-2018 dataset of AGD names and four recently
published DGA classifiers, in which the average F1-score of the classifiers
degrades from 0.977 to 0.495 when applying the evasion technique. An additional
evaluation was conducted using the same classifiers but with adversarial
defenses implemented: adversarial re-training and distillation. The results of
this evaluation show that MaskDGA can be used for improving the robustness of
the character-level DGA classifiers against adversarial attacks, but that
ideally DGA classifiers should incorporate additional features alongside
character-level features that are demonstrated in this study to be vulnerable
to adversarial attacks.


http://arxiv.org/abs/1902.09062
Adversarial Reinforcement Learning under Partial Observability in Software-Defined Networking.
Yi Han; David Hubczenko; Paul Montague; Vel Olivier De; Tamas Abraham; Benjamin I. P. Rubinstein; Christopher Leckie; Tansu Alpcan; Sarah Erfani
  Recent studies have demonstrated that reinforcement learning (RL) agents are
susceptible to adversarial manipulation, similar to vulnerabilities previously
demonstrated in the supervised setting. Accordingly focus has remained with
computer vision, and full observability. This paper focuses on reinforcement
learning in the context of autonomous defence in Software-Defined Networking
(SDN). We demonstrate that causative attacks---attacks that target the training
process---can poison RL agents even if the attacker only has partial
observability of the environment. In addition, we propose an inversion defence
method that aims to apply the opposite perturbation to that which an attacker
might use to generate their adversarial samples. Our experimental results
illustrate that the countermeasure can effectively reduce the impact of the
causative attack, while not significantly affecting the training process in
non-attack scenarios.


http://arxiv.org/abs/1902.08832
Re-evaluating ADEM: A Deeper Look at Scoring Dialogue Responses.
Ananya B. Sai; Mithun Das Gupta; Mitesh M. Khapra; Mukundhan Srinivasan
  Automatically evaluating the quality of dialogue responses for unstructured
domains is a challenging problem. ADEM(Lowe et al. 2017) formulated the
automatic evaluation of dialogue systems as a learning problem and showed that
such a model was able to predict responses which correlate significantly with
human judgements, both at utterance and system level. Their system was shown to
have beaten word-overlap metrics such as BLEU with large margins. We start with
the question of whether an adversary can game the ADEM model. We design a
battery of targeted attacks at the neural network based ADEM evaluation system
and show that automatic evaluation of dialogue systems still has a long way to
go. ADEM can get confused with a variation as simple as reversing the word
order in the text! We report experiments on several such adversarial scenarios
that draw out counterintuitive scores on the dialogue responses. We take a
systematic look at the scoring function proposed by ADEM and connect it to
linear system theory to predict the shortcomings evident in the system. We also
devise an attack that can fool such a system to rate a response generation
system as favorable. Finally, we allude to future research directions of using
the adversarial attacks to design a truly automated dialogue evaluation system.


http://arxiv.org/abs/1902.08785
A Deep, Information-theoretic Framework for Robust Biometric Recognition.
Renjie Xie; Yanzhi Chen; Yan Wo; Qiao Wang
  Deep neural networks (DNN) have been a de facto standard for nowadays
biometric recognition solutions. A serious, but still overlooked problem in
these DNN-based recognition systems is their vulnerability against adversarial
attacks. Adversarial attacks can easily cause the output of a DNN system to
greatly distort with only tiny changes in its input. Such distortions can
potentially lead to an unexpected match between a valid biometric and a
synthetic one constructed by a strategic attacker, raising security issue. In
this work, we show how this issue can be resolved by learning robust biometric
features through a deep, information-theoretic framework, which builds upon the
recent deep variational information bottleneck method but is carefully adapted
to biometric recognition tasks. Empirical evaluation demonstrates that our
method not only offers stronger robustness against adversarial attacks but also
provides better recognition performance over state-of-the-art approaches.


http://arxiv.org/abs/1902.08391
Physical Adversarial Attacks Against End-to-End Autoencoder Communication Systems.
Meysam Sadeghi; Erik G. Larsson
  We show that end-to-end learning of communication systems through deep neural
network (DNN) autoencoders can be extremely vulnerable to physical adversarial
attacks. Specifically, we elaborate how an attacker can craft effective
physical black-box adversarial attacks. Due to the openness (broadcast nature)
of the wireless channel, an adversary transmitter can increase the
block-error-rate of a communication system by orders of magnitude by
transmitting a well-designed perturbation signal over the channel. We reveal
that the adversarial attacks are more destructive than jamming attacks. We also
show that classical coding schemes are more robust than autoencoders against
both adversarial and jamming attacks. The codes are available at [1].


http://arxiv.org/abs/1902.08722
A Convex Relaxation Barrier to Tight Robustness Verification of Neural Networks.
Hadi Salman; Greg Yang; Huan Zhang; Cho-Jui Hsieh; Pengchuan Zhang
  Verification of neural networks enables us to gauge their robustness against
adversarial attacks. Verification algorithms fall into two categories: exact
verifiers that run in exponential time and relaxed verifiers that are efficient
but incomplete. In this paper, we unify all existing LP-relaxed verifiers, to
the best of our knowledge, under a general convex relaxation framework. This
framework works for neural networks with diverse architectures and
nonlinearities and covers both primal and dual views of robustness
verification. We further prove strong duality between the primal and dual
problems under very mild conditions. Next, we perform large-scale experiments,
amounting to more than 22 CPU-years, to obtain exact solution to the
convex-relaxed problem that is optimal within our framework for ReLU networks.
We find the exact solution does not significantly improve upon the gap between
PGD and existing relaxed verifiers for various networks trained normally or
robustly on MNIST and CIFAR datasets. Our results suggest there is an inherent
barrier to tight verification for the large class of methods captured by our
framework. We discuss possible causes of this barrier and potential future
directions for bypassing it. Our code and trained models are available at
http://github.com/Hadisalman/robust-verify-benchmark .


http://arxiv.org/abs/1902.08412
Adversarial Attacks on Graph Neural Networks via Meta Learning.
Daniel Zügner; Stephan Günnemann
  Deep learning models for graphs have advanced the state of the art on many
tasks. Despite their recent success, little is known about their robustness. We
investigate training time attacks on graph neural networks for node
classification that perturb the discrete graph structure. Our core principle is
to use meta-gradients to solve the bilevel problem underlying training-time
attacks, essentially treating the graph as a hyperparameter to optimize. Our
experiments show that small graph perturbations consistently lead to a strong
decrease in performance for graph convolutional networks, and even transfer to
unsupervised embeddings. Remarkably, the perturbations created by our algorithm
can misguide the graph neural networks such that they perform worse than a
simple baseline that ignores all relational information. Our attacks do not
assume any knowledge about or access to the target classifiers.


http://arxiv.org/abs/1902.08336
On the Sensitivity of Adversarial Robustness to Input Data Distributions.
Gavin Weiguang Ding; Kry Yik Chau Lui; Xiaomeng Jin; Luyu Wang; Ruitong Huang
  Neural networks are vulnerable to small adversarial perturbations. Existing
literature largely focused on understanding and mitigating the vulnerability of
learned models. In this paper, we demonstrate an intriguing phenomenon about
the most popular robust training method in the literature, adversarial
training: Adversarial robustness, unlike clean accuracy, is sensitive to the
input data distribution. Even a semantics-preserving transformations on the
input data distribution can cause a significantly different robustness for the
adversarial trained model that is both trained and evaluated on the new
distribution. Our discovery of such sensitivity on data distribution is based
on a study which disentangles the behaviors of clean accuracy and robust
accuracy of the Bayes classifier. Empirical investigations further confirm our
finding. We construct semantically-identical variants for MNIST and CIFAR10
respectively, and show that standardly trained models achieve comparable clean
accuracies on them, but adversarially trained models achieve significantly
different robustness accuracies. This counter-intuitive phenomenon indicates
that input data distribution alone can affect the adversarial robustness of
trained neural networks, not necessarily the tasks themselves. Lastly, we
discuss the practical implications on evaluating adversarial robustness, and
make initial attempts to understand this complex phenomenon.


http://arxiv.org/abs/1902.08265
Quantifying Perceptual Distortion of Adversarial Examples.
Matt Jordan; Naren Manoj; Surbhi Goel; Alexandros G. Dimakis
  Recent work has shown that additive threat models, which only permit the
addition of bounded noise to the pixels of an image, are insufficient for fully
capturing the space of imperceivable adversarial examples. For example, small
rotations and spatial transformations can fool classifiers, remain
imperceivable to humans, but have large additive distance from the original
images. In this work, we leverage quantitative perceptual metrics like LPIPS
and SSIM to define a novel threat model for adversarial attacks.
  To demonstrate the value of quantifying the perceptual distortion of
adversarial examples, we present and employ a unifying framework fusing
different attack styles. We first prove that our framework results in images
that are unattainable by attack styles in isolation. We then perform
adversarial training using attacks generated by our framework to demonstrate
that networks are only robust to classes of adversarial perturbations they have
been trained against, and combination attacks are stronger than any of their
individual components. Finally, we experimentally demonstrate that our combined
attacks retain the same perceptual distortion but induce far higher
misclassification rates when compared against individual attacks.


http://arxiv.org/abs/1902.07906
Wasserstein Adversarial Examples via Projected Sinkhorn Iterations.
Eric Wong; Frank R. Schmidt; J. Zico Kolter
  A rapidly growing area of work has studied the existence of adversarial
examples, datapoints which have been perturbed to fool a classifier, but the
vast majority of these works have focused primarily on threat models defined by
$\ell_p$ norm-bounded perturbations. In this paper, we propose a new threat
model for adversarial attacks based on the Wasserstein distance. In the image
classification setting, such distances measure the cost of moving pixel mass,
which naturally cover "standard" image manipulations such as scaling, rotation,
translation, and distortion (and can potentially be applied to other settings
as well). To generate Wasserstein adversarial examples, we develop a procedure
for projecting onto the Wasserstein ball, based upon a modified version of the
Sinkhorn iteration. The resulting algorithm can successfully attack image
classification models, bringing traditional CIFAR10 models down to 3% accuracy
within a Wasserstein ball with radius 0.1 (i.e., moving 10% of the image mass 1
pixel), and we demonstrate that PGD-based adversarial training can improve this
adversarial accuracy to 76%. In total, this work opens up a new direction of
study in adversarial robustness, more formally considering convex metrics that
accurately capture the invariances that we typically believe should exist in
classifiers. Code for all experiments in the paper is available at
https://github.com/locuslab/projected_sinkhorn.


http://arxiv.org/abs/1902.07623
advertorch v0.1: An Adversarial Robustness Toolbox based on PyTorch.
Gavin Weiguang Ding; Luyu Wang; Xiaomeng Jin
  advertorch is a toolbox for adversarial robustness research. It contains
various implementations for attacks, defenses and robust training methods.
advertorch is built on PyTorch (Paszke et al., 2017), and leverages the
advantages of the dynamic computational graph to provide concise and efficient
reference implementations. The code is licensed under the LGPL license and is
open sourced at https://github.com/BorealisAI/advertorch .


http://arxiv.org/abs/1902.07776
Perceptual Quality-preserving Black-Box Attack against Deep Learning Image Classifiers.
Diego Gragnaniello; Francesco Marra; Giovanni Poggi; Luisa Verdoliva
  Deep neural networks provide unprecedented performance in all image
classification problems, taking advantage of huge amounts of data available for
training. Recent studies, however, have shown their vulnerability to
adversarial attacks, spawning an intense research effort in this field. With
the aim of building better systems, new countermeasures and stronger attacks
are proposed by the day. On the attacker's side, there is growing interest for
the realistic black-box scenario, in which the user has no access to the neural
network parameters. The problem is to design efficient attacks which mislead
the neural network without compromising image quality. In this work, we propose
to perform the black-box attack along a low-distortion path, so as to improve
both the attack efficiency and the perceptual quality of the adversarial image.
Numerical experiments on real-world systems prove the effectiveness of the
proposed approach, both in benchmark classification tasks and in key
applications in biometrics and forensics.


http://arxiv.org/abs/1902.08226
Graph Adversarial Training: Dynamically Regularizing Based on Graph Structure.
Fuli Feng; Xiangnan He; Jie Tang; Tat-Seng Chua
  Recent efforts show that neural networks are vulnerable to small but
intentional perturbations on input features in visual classification tasks. Due
to the additional consideration of connections between examples (\eg articles
with citation link tend to be in the same class), graph neural networks could
be more sensitive to the perturbations, since the perturbations from connected
examples exacerbate the impact on a target example. Adversarial Training (AT),
a dynamic regularization technique, can resist the worst-case perturbations on
input features and is a promising choice to improve model robustness and
generalization. However, existing AT methods focus on standard classification,
being less effective when training models on graph since it does not model the
impact from connected examples.
  In this work, we explore adversarial training on graph, aiming to improve the
robustness and generalization of models learned on graph. We propose Graph
Adversarial Training (GraphAT), which takes the impact from connected examples
into account when learning to construct and resist perturbations. We give a
general formulation of GraphAT, which can be seen as a dynamic regularization
scheme based on the graph structure. To demonstrate the utility of GraphAT, we
employ it on a state-of-the-art graph neural network model --- Graph
Convolutional Network (GCN). We conduct experiments on two citation graphs
(Citeseer and Cora) and a knowledge graph (NELL), verifying the effectiveness
of GraphAT which outperforms normal training on GCN by 4.51% in node
classification accuracy. Codes are available via:
https://github.com/fulifeng/GraphAT.


http://arxiv.org/abs/1902.06894
There are No Bit Parts for Sign Bits in Black-Box Attacks.
Abdullah Al-Dujaili; Una-May O'Reilly
  We present a black-box adversarial attack algorithm which sets new
state-of-the-art model evasion rates for query efficiency in the $\ell_\infty$
and $\ell_2$ metrics, where only loss-oracle access to the model is available.
On two public black-box attack challenges, the algorithm achieves the highest
evasion rate, surpassing all of the submitted attacks. Similar performance is
observed on a model that is secure against substitute-model attacks. For
standard models trained on the MNIST, CIFAR10, and IMAGENET datasets, averaged
over the datasets and metrics, the algorithm is 3.8x less failure-prone, and
spends in total 2.5x fewer queries than the current state-of-the-art attacks
combined given a budget of 10, 000 queries per attack attempt. Notably, it
requires no hyperparameter tuning or any data/time-dependent prior. The
algorithm exploits a new approach, namely sign-based rather than
magnitude-based gradient estimation. This shifts the estimation from continuous
to binary black-box optimization. With three properties of the directional
derivative, we examine three approaches to adversarial attacks. This yields a
superior algorithm breaking a standard MNIST model using just 12 queries on
average!


http://arxiv.org/abs/1902.06705
On Evaluating Adversarial Robustness.
Nicholas Carlini; Anish Athalye; Nicolas Papernot; Wieland Brendel; Jonas Rauber; Dimitris Tsipras; Ian Goodfellow; Aleksander Madry; Alexey Kurakin
  Correctly evaluating defenses against adversarial examples has proven to be
extremely difficult. Despite the significant amount of recent work attempting
to design defenses that withstand adaptive attacks, few have succeeded; most
papers that propose defenses are quickly shown to be incorrect.
  We believe a large contributing factor is the difficulty of performing
security evaluations. In this paper, we discuss the methodological foundations,
review commonly accepted best practices, and suggest new methods for evaluating
defenses to adversarial examples. We hope that both researchers developing
defenses as well as readers and reviewers who wish to understand the
completeness of an evaluation consider our advice in order to avoid common
pitfalls.


http://arxiv.org/abs/1902.06415
AuxBlocks: Defense Adversarial Example via Auxiliary Blocks.
Yueyao Yu; Pengfei Yu; Wenye Li
  Deep learning models are vulnerable to adversarial examples, which poses an
indisputable threat to their applications. However, recent studies observe
gradient-masking defenses are self-deceiving methods if an attacker can realize
this defense. In this paper, we propose a new defense method based on appending
information. We introduce the Aux Block model to produce extra outputs as a
self-ensemble algorithm and analytically investigate the robustness mechanism
of Aux Block. We have empirically studied the efficiency of our method against
adversarial examples in two types of white-box attacks, and found that even in
the full white-box attack where an adversary can craft malicious examples from
defense models, our method has a more robust performance of about 54.6%
precision on Cifar10 dataset and 38.7% precision on Mini-Imagenet dataset.
Another advantage of our method is that it is able to maintain the prediction
accuracy of the classification model on clean images, and thereby exhibits its
high potential in practical applications


http://arxiv.org/abs/1902.06626
Mockingbird: Defending Against Deep-Learning-Based Website Fingerprinting Attacks with Adversarial Traces.
Mohsen Imani; Mohammad Saidur Rahman; Nate Mathews; Matthew Wright
  Website Fingerprinting (WF) is a type of traffic analysis attack that enables
a local passive eavesdropper to infer the victim's activity even when the
traffic is protected by encryption, a VPN, or an anonymity system like Tor.
Leveraging a deep-learning classifier, a WF attacker can gain over 98% accuracy
on Tor traffic. Existing WF defenses are either very expensive in terms of
bandwidth and latency overheads (e.g. two-to-three times as large or slow) or
ineffective against the latest attacks. In this paper, we explore a novel
defense, Mockingbird, based on the idea of adversarial examples that have been
shown to undermine machine-learning classifiers in other domains. Since the
attacker gets to design his classifier based on the defense design, we first
demonstrate that at least one technique for generating adversarial-example
based traces fails to protect against an attacker using adversarial training
for robust classification. We then propose Mockingbird, a technique for
generating traces that resists adversarial training by moving randomly in the
space of viable traces and not following more predictable gradients. The
technique drops the accuracy of the state-of-the-art attack hardened with
adversarial training from 98% to as low as 29% while incurring only 56%
bandwidth overhead. The attack accuracy is generally lower than
state-of-the-art defenses, and much lower when considering Top-2 accuracy,
while incurring lower overheads in most settings.


http://arxiv.org/abs/1902.08034
Mitigation of Adversarial Examples in RF Deep Classifiers Utilizing AutoEncoder Pre-training.
Silvija Kokalj-Filipovic; Rob Miller; Nicholas Chang; Chi Leung Lau
  Adversarial examples in machine learning for images are widely publicized and
explored. Illustrations of misclassifications caused by slightly perturbed
inputs are abundant and commonly known (e.g., a picture of panda imperceptibly
perturbed to fool the classifier into incorrectly labeling it as a gibbon).
Similar attacks on deep learning (DL) for radio frequency (RF) signals and
their mitigation strategies are scarcely addressed in the published work. Yet,
RF adversarial examples (AdExs) with minimal waveform perturbations can cause
drastic, targeted misclassification results, particularly against spectrum
sensing/survey applications (e.g. BPSK is mistaken for 8-PSK). Our research on
deep learning AdExs and proposed defense mechanisms are RF-centric, and
incorporate physical world, over-the-air (OTA) effects. We herein present
defense mechanisms based on pre-training the target classifier using an
autoencoder. Our results validate this approach as a viable mitigation method
to subvert adversarial attacks against deep learning-based communications and
radar sensing systems.


http://arxiv.org/abs/1902.06044
Adversarial Examples in RF Deep Learning: Detection of the Attack and its Physical Robustness.
Silvija Kokalj-Filipovic; Rob Miller
  While research on adversarial examples in machine learning for images has
been prolific, similar attacks on deep learning (DL) for radio frequency (RF)
signals and their mitigation strategies are scarcely addressed in the published
work, with only one recent publication in the RF domain [1]. RF adversarial
examples (AdExs) can cause drastic, targeted misclassification results mostly
in spectrum sensing/ survey applications (e.g. BPSK mistaken for 8-PSK) with
minimal waveform perturbation. It is not clear if the RF AdExs maintain their
effects in the physical world, i.e., when AdExs are delivered over-the-air
(OTA). Our research on deep learning AdExs and proposed defense mechanisms are
RF-centric, and incorporate physical world, OTA effects. We here present
defense mechanisms based on statistical tests. One test to detect AdExs
utilizes Peak-to- Average-Power-Ratio (PAPR) of the DL data points delivered
OTA, while another statistical test uses the Softmax outputs of the DL
classifier, which corresponds to the probabilities the classifier assigns to
each of the trained classes. The former test leverages the RF nature of the
data, and the latter is universally applicable to AdExs regardless of their
origin. Both solutions are shown as viable mitigation methods to subvert
adversarial attacks against communications and radar sensing systems.


http://arxiv.org/abs/1902.05974
DeepFault: Fault Localization for Deep Neural Networks.
Hasan Ferit Eniser; Simos Gerasimou; Alper Sen
  Deep Neural Networks (DNNs) are increasingly deployed in safety-critical
applications including autonomous vehicles and medical diagnostics. To reduce
the residual risk for unexpected DNN behaviour and provide evidence for their
trustworthy operation, DNNs should be thoroughly tested. The DeepFault whitebox
DNN testing approach presented in our paper addresses this challenge by
employing suspiciousness measures inspired by fault localization to establish
the hit spectrum of neurons and identify suspicious neurons whose weights have
not been calibrated correctly and thus are considered responsible for
inadequate DNN performance. DeepFault also uses a suspiciousness-guided
algorithm to synthesize new inputs, from correctly classified inputs, that
increase the activation values of suspicious neurons. Our empirical evaluation
on several DNN instances trained on MNIST and CIFAR-10 datasets shows that
DeepFault is effective in identifying suspicious neurons. Also, the inputs
synthesized by DeepFault closely resemble the original inputs, exercise the
identified suspicious neurons and are highly adversarial.


http://arxiv.org/abs/1902.05586
Can Intelligent Hyperparameter Selection Improve Resistance to Adversarial Examples?
Cody Burkard; Brent Lagesse
  Convolutional Neural Networks and Deep Learning classification systems in
general have been shown to be vulnerable to attack by specially crafted data
samples that appear to belong to one class but are instead classified as
another, commonly known as adversarial examples. A variety of attack strategies
have been proposed to craft these samples; however, there is no standard model
that is used to compare the success of each type of attack. Furthermore, there
is no literature currently available that evaluates how common hyperparameters
and optimization strategies may impact a model's ability to resist these
samples. This research bridges that lack of awareness and provides a means for
the selection of training and model parameters in future research on evasion
attacks against convolutional neural networks. The findings of this work
indicate that the selection of model hyperparameters does impact the ability of
a model to resist attack, although they alone cannot prevent the existence of
adversarial examples.


http://arxiv.org/abs/1902.04818
The Odds are Odd: A Statistical Test for Detecting Adversarial Examples.
Kevin Roth; Yannic Kilcher; Thomas Hofmann
  We investigate conditions under which test statistics exist that can reliably
detect examples, which have been adversarially manipulated in a white-box
attack. These statistics can be easily computed and calibrated by randomly
corrupting inputs. They exploit certain anomalies that adversarial attacks
introduce, in particular if they follow the paradigm of choosing perturbations
optimally under p-norm constraints. Access to the log-odds is the only
requirement to defend models. We justify our approach empirically, but also
provide conditions under which detectability via the suggested test statistics
is guaranteed to be effective. In our experiments, we show that it is even
possible to correct test time predictions for adversarial attacks with high
accuracy.


http://arxiv.org/abs/1902.04416
Examining Adversarial Learning against Graph-based IoT Malware Detection Systems.
Ahmed Abusnaina; Aminollah Khormali; Hisham Alasmary; Jeman Park; Afsah Anwar; Ulku Meteriz; Aziz Mohaisen
  The main goal of this study is to investigate the robustness of graph-based
Deep Learning (DL) models used for Internet of Things (IoT) malware
classification against Adversarial Learning (AL). We designed two approaches to
craft adversarial IoT software, including Off-the-Shelf Adversarial Attack
(OSAA) methods, using six different AL attack approaches, and Graph Embedding
and Augmentation (GEA). The GEA approach aims to preserve the functionality and
practicality of the generated adversarial sample through a careful embedding of
a benign sample to a malicious one. Our evaluations demonstrate that OSAAs are
able to achieve a misclassification rate (MR) of 100%. Moreover, we observed
that the GEA approach is able to misclassify all IoT malware samples as benign.


http://arxiv.org/abs/1902.04238
Adversarial Samples on Android Malware Detection Systems for IoT Systems.
Xiaolei Liu; Xiaojiang Du; Xiaosong Zhang; Qingxin Zhu; Mohsen Guizani
  Many IoT(Internet of Things) systems run Android systems or Android-like
systems. With the continuous development of machine learning algorithms, the
learning-based Android malware detection system for IoT devices has gradually
increased. However, these learning-based detection models are often vulnerable
to adversarial samples. An automated testing framework is needed to help these
learning-based malware detection systems for IoT devices perform security
analysis. The current methods of generating adversarial samples mostly require
training parameters of models and most of the methods are aimed at image data.
To solve this problem, we propose a \textbf{t}esting framework for
\textbf{l}earning-based \textbf{A}ndroid \textbf{m}alware \textbf{d}etection
systems(TLAMD) for IoT Devices. The key challenge is how to construct a
suitable fitness function to generate an effective adversarial sample without
affecting the features of the application. By introducing genetic algorithms
and some technical improvements, our test framework can generate adversarial
samples for the IoT Android Application with a success rate of nearly 100\% and
can perform black-box testing on the system.


http://arxiv.org/abs/1902.07285
A Survey: Towards a Robust Deep Neural Network in Text Domain.
Wenqi Wang; Lina Wang; Benxiao Tang; Run Wang; Aoshuang Ye
  Deep neural networks (DNNs) have shown an inherent vulnerability to
adversarial examples which are maliciously crafted on real examples by
attackers, aiming at making target DNNs misbehave. The threats of adversarial
examples are widely existed in image, voice, speech, and text recognition and
classification. Inspired by the previous work, researches on adversarial
attacks and defenses in text domain develop rapidly. In order to make people
have a general understanding about the field, this article presents a
comprehensive review on adversarial examples in text. We analyze the advantages
and shortcomings of recent adversarial examples generation methods and
elaborate the efficiency and limitations on countermeasures. Finally, we
discuss the challenges in adversarial texts and provide a research direction of
this aspect.


http://arxiv.org/abs/1902.03538
Model Compression with Adversarial Robustness: A Unified Optimization Framework.
Shupeng University of Rochester Gui; Haotao Texas A&M University Wang; Chen University of Rochester Yu; Haichuan University of Rochester Yang; Zhangyang Texas A&M University Wang; Ji Ytech Seattle AI lab, FeDA lab, AI platform, Kwai Inc Liu
  Deep model compression has been extensively studied, and state-of-the-art
methods can now achieve high compression ratios with minimal accuracy loss.
This paper studies model compression through a different lens: could we
compress models without hurting their robustness to adversarial attacks, in
addition to maintaining accuracy? Previous literature suggested that the goals
of robustness and compactness might sometimes contradict. We propose a novel
Adversarially Trained Model Compression (ATMC) framework. ATMC constructs a
unified constrained optimization formulation, where existing compression means
(pruning, factorization, quantization) are all integrated into the constraints.
An efficient algorithm is then developed. An extensive group of experiments are
presented, demonstrating that ATMC obtains remarkably more favorable trade-off
among model size, accuracy and robustness, over currently available
alternatives in various settings. The codes are publicly available at:
https://github.com/shupenggui/ATMC.


http://arxiv.org/abs/1902.03380
When Causal Intervention Meets Adversarial Examples and Image Masking for Deep Neural Networks.
Chao-Han Huck Yang; Yi-Chieh Liu; Pin-Yu Chen; Xiaoli Ma; Yi-Chang James Tsai
  Discovering and exploiting the causality in deep neural networks (DNNs) are
crucial challenges for understanding and reasoning causal effects (CE) on an
explainable visual model. "Intervention" has been widely used for recognizing a
causal relation ontologically. In this paper, we propose a causal inference
framework for visual reasoning via do-calculus. To study the intervention
effects on pixel-level features for causal reasoning, we introduce pixel-wise
masking and adversarial perturbation. In our framework, CE is calculated using
features in a latent space and perturbed prediction from a DNN-based model. We
further provide the first look into the characteristics of discovered CE of
adversarially perturbed images generated by gradient-based methods
\footnote{~~https://github.com/jjaacckkyy63/Causal-Intervention-AE-wAdvImg}.
Experimental results show that CE is a competitive and robust index for
understanding DNNs when compared with conventional methods such as
class-activation mappings (CAMs) on the Chest X-Ray-14 dataset for
human-interpretable feature(s) (e.g., symptom) reasoning. Moreover, CE holds
promises for detecting adversarial examples as it possesses distinct
characteristics in the presence of adversarial perturbations.


http://arxiv.org/abs/1902.03227
Minimal Images in Deep Neural Networks: Fragile Object Recognition in Natural Images.
Sanjana Srivastava; Guy Ben-Yosef; Xavier Boix
  The human ability to recognize objects is impaired when the object is not
shown in full. "Minimal images" are the smallest regions of an image that
remain recognizable for humans. Ullman et al. 2016 show that a slight
modification of the location and size of the visible region of the minimal
image produces a sharp drop in human recognition accuracy. In this paper, we
demonstrate that such drops in accuracy due to changes of the visible region
are a common phenomenon between humans and existing state-of-the-art deep
neural networks (DNNs), and are much more prominent in DNNs. We found many
cases where DNNs classified one region correctly and the other incorrectly,
though they only differed by one row or column of pixels, and were often bigger
than the average human minimal image size. We show that this phenomenon is
independent from previous works that have reported lack of invariance to minor
modifications in object location in DNNs. Our results thus reveal a new failure
mode of DNNs that also affects humans to a much lesser degree. They expose how
fragile DNN recognition ability is for natural images even without adversarial
patterns being introduced. Bringing the robustness of DNNs in natural images to
the human level remains an open challenge for the community.


http://arxiv.org/abs/1902.02947
Understanding the One-Pixel Attack: Propagation Maps and Locality Analysis.
Danilo Vasconcellos Vargas; Jiawei Su
  Deep neural networks were shown to be vulnerable to single pixel
modifications. However, the reason behind such phenomena has never been
elucidated. Here, we propose Propagation Maps which show the influence of the
perturbation in each layer of the network. Propagation Maps reveal that even in
extremely deep networks such as Resnet, modification in one pixel easily
propagates until the last layer. In fact, this initial local perturbation is
also shown to spread becoming a global one and reaching absolute difference
values that are close to the maximum value of the original feature maps in a
given layer. Moreover, we do a locality analysis in which we demonstrate that
nearby pixels of the perturbed one in the one-pixel attack tend to share the
same vulnerability, revealing that the main vulnerability lies in neither
neurons nor pixels but receptive fields. Hopefully, the analysis conducted in
this work together with a new technique called propagation maps shall shed
light into the inner workings of other adversarial samples and be the basis of
new defense systems to come.


http://arxiv.org/abs/1902.03151
Discretization based Solutions for Secure Machine Learning against Adversarial Attacks.
Priyadarshini Panda; Indranil Chakraborty; Kaushik Roy
  Adversarial examples are perturbed inputs that are designed (from a deep
learning network's (DLN) parameter gradients) to mislead the DLN during test
time. Intuitively, constraining the dimensionality of inputs or parameters of a
network reduces the 'space' in which adversarial examples exist. Guided by this
intuition, we demonstrate that discretization greatly improves the robustness
of DLNs against adversarial attacks. Specifically, discretizing the input space
(or allowed pixel levels from 256 values or 8-bit to 4 values or 2-bit)
extensively improves the adversarial robustness of DLNs for a substantial range
of perturbations for minimal loss in test accuracy. Furthermore, we find that
Binary Neural Networks (BNNs) and related variants are intrinsically more
robust than their full precision counterparts in adversarial scenarios.
Combining input discretization with BNNs furthers the robustness even waiving
the need for adversarial training for certain magnitude of perturbation values.
We evaluate the effect of discretization on MNIST, CIFAR10, CIFAR100 and
Imagenet datasets. Across all datasets, we observe maximal adversarial
resistance with 2-bit input discretization that incurs an adversarial accuracy
loss of just ~1-2% as compared to clean test accuracy.


http://arxiv.org/abs/1902.02826
Robustness Of Saak Transform Against Adversarial Attacks.
Thiyagarajan Ramanathan; Abinaya Manimaran; Suya You; C-C Jay Kuo
  Image classification is vulnerable to adversarial attacks. This work
investigates the robustness of Saak transform against adversarial attacks
towards high performance image classification. We develop a complete image
classification system based on multi-stage Saak transform. In the Saak
transform domain, clean and adversarial images demonstrate different
distributions at different spectral dimensions. Selection of the spectral
dimensions at every stage can be viewed as an automatic denoising process.
Motivated by this observation, we carefully design strategies of feature
extraction, representation and classification that increase adversarial
robustness. The performances with well-known datasets and attacks are
demonstrated by extensive experimental evaluations.


http://arxiv.org/abs/1902.02918
Certified Adversarial Robustness via Randomized Smoothing.
Jeremy M Cohen; Elan Rosenfeld; J. Zico Kolter
  We show how to turn any classifier that classifies well under Gaussian noise
into a new classifier that is certifiably robust to adversarial perturbations
under the $\ell_2$ norm. This "randomized smoothing" technique has been
proposed recently in the literature, but existing guarantees are loose. We
prove a tight robustness guarantee in $\ell_2$ norm for smoothing with Gaussian
noise. We use randomized smoothing to obtain an ImageNet classifier with e.g. a
certified top-1 accuracy of 49% under adversarial perturbations with $\ell_2$
norm less than 0.5 (=127/255). No certified defense has been shown feasible on
ImageNet except for smoothing. On smaller-scale datasets where competing
approaches to certified $\ell_2$ robustness are viable, smoothing delivers
higher certified accuracies. Our strong empirical results suggest that
randomized smoothing is a promising direction for future research into
adversarially robust classification. Code and models are available at
http://github.com/locuslab/smoothing.


http://arxiv.org/abs/1902.02041
Fooling Neural Network Interpretations via Adversarial Model Manipulation.
Juyeon Heo; Sunghwan Joo; Taesup Moon
  We ask whether the neural network interpretation methods can be fooled via
adversarial model manipulation, which is defined as a model fine-tuning step
that aims to radically alter the explanations without hurting the accuracy of
the original models, e.g., VGG19, ResNet50, and DenseNet121. By incorporating
the interpretation results directly in the penalty term of the objective
function for fine-tuning, we show that the state-of-the-art saliency map based
interpreters, e.g., LRP, Grad-CAM, and SimpleGrad, can be easily fooled with
our model manipulation. We propose two types of fooling, Passive and Active,
and demonstrate such foolings generalize well to the entire validation set as
well as transfer to other interpretation methods. Our results are validated by
both visually showing the fooled explanations and reporting quantitative
metrics that measure the deviations from the original explanations. We claim
that the stability of neural network interpretation method with respect to our
adversarial model manipulation is an important criterion to check for
developing robust and reliable neural network interpretation method.


http://arxiv.org/abs/1902.02067
Daedalus: Breaking Non-Maximum Suppression in Object Detection via Adversarial Examples.
Derui Wang; Chaoran Li; Sheng Wen; Xiaojun Chang; Surya Nepal; Yang Xiang
  We demonstrate that Non-Maximum Suppression (NMS), which is commonly used in
Object Detection (OD) tasks to filter redundant detection results, is no longer
secure. Considering that NMS has been an integral part of OD systems, thwarting
the functionality of NMS can result in unexpected or even lethal consequences
for such systems. In this paper, we propose an adversarial example attack which
triggers malfunctioning of NMS in end-to-end OD models. Our attack, namely
Daedalus, compresses the dimensions of detection boxes to evade NMS. As a
result, the final detection output contains extremely dense false positives.
This can be fatal for many OD applications such as autonomous vehicle and
surveillance system. Our attack can be generalised to different end-to-end OD
models, such that the attack cripples various OD applications. Furthermore, we
propose a way to craft robust adversarial examples by using an ensemble of
popular detection models as the substitutes. Considering the pervasive nature
of model reusing in real-world OD scenarios, Daedalus examples crafted based on
an ensemble of substitutes can launch attacks without knowing the parameters of
the victim models. Our experiments demonstrate that the attack effectively
stops NMS from filtering redundant bounding boxes. As the evaluation results
suggest, Daedalus increases the false positive rate in detection results to
99.9% and reduces the mean average precision scores to 0, while maintaining a
low cost of distortion on the original inputs. With the widespread applications
of OD, our work shows that there are serious vulnerabilities in the fundamental
components of such systems and further investigation on them is required in
this area.


http://arxiv.org/abs/1902.01686
Fatal Brain Damage.
El Mahdi El Mhamdi; Rachid Guerraoui; Sergei Volodin
  The loss of a few neurons in a brain often does not result in a visible loss
of function. We propose to advance the understanding of neural networks through
their remarkable ability to sustain individual neuron failures, i.e. their
fault tolerance. Before the last AI winter, fault tolerance in NNs was a
popular topic as NNs were expected to be implemented in neuromorphic hardware,
which for a while did not happen. Moreover, since the number of possible crash
subsets grows exponentially with the network size, additional assumptions are
required to practically study this phenomenon for modern architectures. We
prove a series of bounds on error propagation using justified assumptions,
applicable to deep networks, show their location on the complexity versus
tightness trade-off scale and test them empirically. We demonstrate how fault
tolerance is connected to generalization and show that the data jacobian of a
network determines its fault tolerance properties. We investigate this quantity
and show how it is interlinked with other mathematical properties of the
network such as Lipschitzness, singular values, weight matrices norms, and the
loss gradients. Known results give a connection between the data jacobian and
robustness to adversarial examples, providing another piece of the puzzle.
Combining that with our results, we call for a unifying research endeavor
encompassing fault tolerance, generalization capacity, and robustness to
adversarial inputs together as we demonstrate a strong connection between these
areas. Moreover, we argue that fault tolerance is an important overlooked AI
safety problem since neuromorphic hardware is becoming popular again.


http://arxiv.org/abs/1902.01148
Theoretical evidence for adversarial robustness through randomization.
Rafael Pinot; Laurent Meunier; Alexandre Araujo; Hisashi Kashima; Florian Yger; Cédric Gouy-Pailler; Jamal Atif
  This paper investigates the theory of robustness against adversarial attacks.
It focuses on the family of randomization techniques that consist in injecting
noise in the network at inference time. These techniques have proven effective
in many contexts, but lack theoretical arguments. We close this gap by
presenting a theoretical analysis of these approaches, hence explaining why
they perform well in practice. More precisely, we make two new contributions.
The first one relates the randomization rate to robustness to adversarial
attacks. This result applies for the general family of exponential
distributions, and thus extends and unifies the previous approaches. The second
contribution consists in devising a new upper bound on the adversarial
generalization gap of randomized neural networks. We support our theoretical
claims with a set of experiments.


http://arxiv.org/abs/1902.01080
Predictive Uncertainty Quantification with Compound Density Networks.
Agustinus Kristiadi; Sina Däubener; Asja Fischer
  Despite the huge success of deep neural networks (NNs), finding good
mechanisms for quantifying their prediction uncertainty is still an open
problem. Bayesian neural networks are one of the most popular approaches to
uncertainty quantification. On the other hand, it was recently shown that
ensembles of NNs, which belong to the class of mixture models, can be used to
quantify prediction uncertainty. In this paper, we build upon these two
approaches. First, we increase the mixture model's flexibility by replacing the
fixed mixing weights by an adaptive, input-dependent distribution (specifying
the probability of each component) represented by NNs, and by considering
uncountably many mixture components. The resulting class of models can be seen
as the continuous counterpart to mixture density networks and is therefore
referred to as compound density networks (CDNs). We employ both maximum
likelihood and variational Bayesian inference to train CDNs, and empirically
show that they yield better uncertainty estimates on out-of-distribution data
and are more robust to adversarial examples than the previous approaches.


http://arxiv.org/abs/1902.01147
Is Spiking Secure? A Comparative Study on the Security Vulnerabilities of Spiking and Deep Neural Networks.
Alberto Marchisio; Giorgio Nanfa; Faiq Khalid; Muhammad Abdullah Hanif; Maurizio Martina; Muhammad Shafique
  Spiking Neural Networks (SNNs) claim to present many advantages in terms of
biological plausibility and energy efficiency compared to standard Deep Neural
Networks (DNNs). Recent works have shown that DNNs are vulnerable to
adversarial attacks, i.e., small perturbations added to the input data can lead
to targeted or random misclassifications. In this paper, we aim at
investigating the key research question: ``Are SNNs secure?'' Towards this, we
perform a comparative study of the security vulnerabilities in SNNs and DNNs
w.r.t. the adversarial noise. Afterwards, we propose a novel black-box attack
methodology, i.e., without the knowledge of the internal structure of the SNN,
which employs a greedy heuristic to automatically generate imperceptible and
robust adversarial examples (i.e., attack images) for the given SNN. We perform
an in-depth evaluation for a Spiking Deep Belief Network (SDBN) and a DNN
having the same number of layers and neurons (to obtain a fair comparison), in
order to study the efficiency of our methodology and to understand the
differences between SNNs and DNNs w.r.t. the adversarial examples. Our work
opens new avenues of research towards the robustness of the SNNs, considering
their similarities to the human brain's functionality.


http://arxiv.org/abs/1902.01235
Robustness Certificates Against Adversarial Examples for ReLU Networks.
Sahil Singla; Soheil Feizi
  While neural networks have achieved high performance in different learning
tasks, their accuracy drops significantly in the presence of small adversarial
perturbations to inputs. Defenses based on regularization and adversarial
training are often followed by new attacks to defeat them. In this paper, we
propose attack-agnostic robustness certificates for a multi-label
classification problem using a deep ReLU network. Although computing the exact
distance of a given input sample to the classification decision boundary
requires solving a non-convex optimization, we characterize two lower bounds
for such distances, namely the simplex certificate and the decision boundary
certificate. These robustness certificates leverage the piece-wise linear
structure of ReLU networks and use the fact that in a polyhedron around a given
sample, the prediction function is linear. In particular, the proposed simplex
certificate has a closed-form, is differentiable and is an order of magnitude
faster to compute than the existing methods even for deep networks. In addition
to theoretical bounds, we provide numerical results for our certificates over
MNIST and compare them with some existing upper bounds.


http://arxiv.org/abs/1902.00236
Natural and Adversarial Error Detection using Invariance to Image Transformations.
Yuval Bahat; Michal Irani; Gregory Shakhnarovich
  We propose an approach to distinguish between correct and incorrect image
classifications. Our approach can detect misclassifications which either occur
$\it{unintentionally}$ ("natural errors"), or due to
$\it{intentional~adversarial~attacks}$ ("adversarial errors"), both in a single
$\it{unified~framework}$. Our approach is based on the observation that
correctly classified images tend to exhibit robust and consistent
classifications under certain image transformations (e.g., horizontal flip,
small image translation, etc.). In contrast, incorrectly classified images
(whether due to adversarial errors or natural errors) tend to exhibit large
variations in classification results under such transformations. Our approach
does not require any modifications or retraining of the classifier, hence can
be applied to any pre-trained classifier. We further use state of the art
targeted adversarial attacks to demonstrate that even when the adversary has
full knowledge of our method, the adversarial distortion needed for bypassing
our detector is $\it{no~longer~imperceptible~to~the~human~eye}$. Our approach
obtains state-of-the-art results compared to previous adversarial detection
methods, surpassing them by a large margin.


http://arxiv.org/abs/1902.01220
Adaptive Gradient for Adversarial Perturbations Generation.
Yatie Xiao; Chi-Man Pun
  Deep Neural Networks have achieved remarkable success in computer vision,
natural language processing, and audio tasks.


http://arxiv.org/abs/1902.00577
Robustness of Generalized Learning Vector Quantization Models against Adversarial Attacks.
Sascha Saralajew; Lars Holdijk; Maike Rees; Thomas Villmann
  Adversarial attacks and the development of (deep) neural networks robust
against them are currently two widely researched topics. The robustness of
Learning Vector Quantization (LVQ) models against adversarial attacks has
however not yet been studied to the same extent. We therefore present an
extensive evaluation of three LVQ models: Generalized LVQ, Generalized Matrix
LVQ and Generalized Tangent LVQ. The evaluation suggests that both Generalized
LVQ and Generalized Tangent LVQ have a high base robustness, on par with the
current state-of-the-art in robust neural network methods. In contrast to this,
Generalized Matrix LVQ shows a high susceptibility to adversarial attacks,
scoring consistently behind all other models. Additionally, our numerical
evaluation indicates that increasing the number of prototypes per class
improves the robustness of the models.


http://arxiv.org/abs/1902.00541
The Efficacy of SHIELD under Different Threat Models.
Cory Cornelius; Nilaksh Das; Shang-Tse Chen; Li Chen; Michael E. Kounavis; Duen Horng Chau
  In this appraisal paper, we evaluate the efficacy of SHIELD, a
compression-based defense framework for countering adversarial attacks on image
classification models, which was published at KDD 2018. Here, we consider
alternative threat models not studied in the original work, where we assume
that an adaptive adversary is aware of the ensemble defense approach, the
defensive pre-processing, and the architecture and weights of the models used
in the ensemble. We define scenarios with varying levels of threat and
empirically analyze the proposed defense by varying the degree of information
available to the attacker, spanning from a full white-box attack to the
gray-box threat model described in the original work. To evaluate the
robustness of the defense against an adaptive attacker, we consider the
targeted-attack success rate of the Projected Gradient Descent (PGD) attack,
which is a strong gradient-based adversarial attack proposed in adversarial
machine learning research. We also experiment with training the SHIELD ensemble
from scratch, which is different from re-training using a pre-trained model as
done in the original work. We find that the targeted PGD attack has a success
rate of 64.3% against the original SHIELD ensemble in the full white box
scenario, but this drops to 48.9% if the models used in the ensemble are
trained from scratch instead of being retrained. Our experiments further reveal
that an ensemble whose models are re-trained indeed have higher correlation in
the cosine similarity space, and models that are trained from scratch are less
vulnerable to targeted attacks in the white-box and gray-box scenarios.


http://arxiv.org/abs/1902.01208
A New Family of Neural Networks Provably Resistant to Adversarial Attacks.
Rakshit Agrawal; Alfaro Luca de; David Helmbold
  Adversarial attacks add perturbations to the input features with the intent
of changing the classification produced by a machine learning system. Small
perturbations can yield adversarial examples which are misclassified despite
being virtually indistinguishable from the unperturbed input. Classifiers
trained with standard neural network techniques are highly susceptible to
adversarial examples, allowing an adversary to create misclassifications of
their choice.
  We introduce a new type of network unit, called MWD (max of weighed distance)
units that have a built-in resistant to adversarial attacks. These units are
highly non-linear, and we develop the techniques needed to effectively train
them. We show that simple interval techniques for propagating perturbation
effects through the network enables the efficient computation of robustness
(i.e., accuracy guarantees) for MWD networks under any perturbations, including
adversarial attacks.
  MWD networks are significantly more robust to input perturbations than ReLU
networks. On permutation invariant MNIST, when test examples can be perturbed
by 20% of the input range, MWD networks provably retain accuracy above 83%,
while the accuracy of ReLU networks drops below 5%. The provable accuracy of
MWD networks is superior even to the observed accuracy of ReLU networks trained
with the help of adversarial examples. In the absence of adversarial attacks,
MWD networks match the performance of sigmoid networks, and have accuracy only
slightly below that of ReLU networks.


http://arxiv.org/abs/1902.00358
Training Artificial Neural Networks by Generalized Likelihood Ratio Method: Exploring Brain-like Learning to Improve Robustness.
Li Xiao; Yijie Peng; Jeff Hong; Zewu Ke; Shuhuai Yang
  In this work, we propose a generalized likelihood ratio method capable of
training the artificial neural networks with some biological brain-like
mechanisms,.e.g., (a) learning by the loss value, (b) learning via neurons with
discontinuous activation and loss functions. The traditional back propagation
method cannot train the artificial neural networks with aforementioned
brain-like learning mechanisms. Numerical results show that the robustness of
various artificial neural networks trained by the new method is significantly
improved when the input data is affected by both the natural noises and
adversarial attacks. Code is available:
\url{https://github.com/LX-doctorAI/GLR_ADV} .


http://arxiv.org/abs/1901.10861
A Simple Explanation for the Existence of Adversarial Examples with Small Hamming Distance.
Adi Shamir; Itay Safran; Eyal Ronen; Orr Dunkelman
  The existence of adversarial examples in which an imperceptible change in the
input can fool well trained neural networks was experimentally discovered by
Szegedy et al in 2013, who called them "Intriguing properties of neural
networks". Since then, this topic had become one of the hottest research areas
within machine learning, but the ease with which we can switch between any two
decisions in targeted attacks is still far from being understood, and in
particular it is not clear which parameters determine the number of input
coordinates we have to change in order to mislead the network. In this paper we
develop a simple mathematical framework which enables us to think about this
baffling phenomenon from a fresh perspective, turning it into a natural
consequence of the geometry of $\mathbb{R}^n$ with the $L_0$ (Hamming) metric,
which can be quantitatively analyzed. In particular, we explain why we should
expect to find targeted adversarial examples with Hamming distance of roughly
$m$ in arbitrarily deep neural networks which are designed to distinguish
between $m$ input classes.


http://arxiv.org/abs/1901.11188
Augmenting Model Robustness with Transformation-Invariant Attacks.
Houpu Yao; Zhe Wang; Guangyu Nie; Yassine Mazboudi; Yezhou Yang; Yi Ren
  The vulnerability of neural networks under adversarial attacks has raised
serious concerns and motivated extensive research. It has been shown that both
neural networks and adversarial attacks against them can be sensitive to input
transformations such as linear translation and rotation, and that human vision,
which is robust against adversarial attacks, is invariant to natural input
transformations. Based on these, this paper tests the hypothesis that model
robustness can be further improved when it is adversarially trained against
transformed attacks and transformation-invariant attacks. Experiments on MNIST,
CIFAR-10, and restricted ImageNet show that while transformations of attacks
alone do not affect robustness, transformation-invariant attacks can improve
model robustness by 2.5\% on MNIST, 3.7\% on CIFAR-10, and 1.1\% on restricted
ImageNet. We discuss the intuition behind this phenomenon.


http://arxiv.org/abs/1901.10513
Adversarial Examples Are a Natural Consequence of Test Error in Noise.
Nic Ford; Justin Gilmer; Nicolas Carlini; Dogus Cubuk
  Over the last few years, the phenomenon of adversarial examples ---
maliciously constructed inputs that fool trained machine learning models ---
has captured the attention of the research community, especially when the
adversary is restricted to small modifications of a correctly handled input.
Less surprisingly, image classifiers also lack human-level performance on
randomly corrupted images, such as images with additive Gaussian noise. In this
paper we provide both empirical and theoretical evidence that these are two
manifestations of the same underlying phenomenon, establishing close
connections between the adversarial robustness and corruption robustness
research programs. This suggests that improving adversarial robustness should
go hand in hand with improving performance in the presence of more general and
realistic image corruptions. Based on our results we recommend that future
adversarial defenses consider evaluating the robustness of their methods to
distributional shift with benchmarks such as Imagenet-C.


http://arxiv.org/abs/1901.10258
RED-Attack: Resource Efficient Decision based Attack for Machine Learning.
Faiq Khalid; Hassan Ali; Muhammad Abdullah Hanif; Semeen Rehman; Rehan Ahmed; Muhammad Shafique
  Due to data dependency and model leakage properties, Deep Neural Networks
(DNNs) exhibit several security vulnerabilities. Several security attacks
exploited them but most of them require the output probability vector. These
attacks can be mitigated by concealing the output probability vector. To
address this limitation, decision-based attacks have been proposed which can
estimate the model but they require several thousand queries to generate a
single untargeted attack image. However, in real-time attacks, resources and
attack time are very crucial parameters. Therefore, in resource-constrained
systems, e.g., autonomous vehicles where an untargeted attack can have a
catastrophic effect, these attacks may not work efficiently. To address this
limitation, we propose a resource efficient decision-based methodology which
generates the imperceptible attack, i.e., the RED-Attack, for a given black-box
model. The proposed methodology follows two main steps to generate the
imperceptible attack, i.e., classification boundary estimation and adversarial
noise optimization. Firstly, we propose a half-interval search-based algorithm
for estimating a sample on the classification boundary using a target image and
a randomly selected image from another class. Secondly, we propose an
optimization algorithm which first, introduces a small perturbation in some
randomly selected pixels of the estimated sample. Then to ensure
imperceptibility, it optimizes the distance between the perturbed and target
samples. For illustration, we evaluate it for CFAR-10 and German Traffic Sign
Recognition (GTSR) using state-of-the-art networks.


http://arxiv.org/abs/1901.10622
Reliable Smart Road Signs.
Muhammed O. Sayin; Chung-Wei Lin; Eunsuk Kang; Shinichi Shiraishi; Tamer Basar
  In this paper, we propose a game theoretical adversarial intervention
detection mechanism for reliable smart road signs. A future trend in
intelligent transportation systems is ``smart road signs" that incorporate
smart codes (e.g., visible at infrared) on their surface to provide more
detailed information to smart vehicles. Such smart codes make road sign
classification problem aligned with communication settings more than
conventional classification. This enables us to integrate well-established
results in communication theory, e.g., error-correction methods, into road sign
classification problem. Recently, vision-based road sign classification
algorithms have been shown to be vulnerable against (even) small scale
adversarial interventions that are imperceptible for humans. On the other hand,
smart codes constructed via error-correction methods can lead to robustness
against small scale intelligent or random perturbations on them. In the
recognition of smart road signs, however, humans are out of the loop since they
cannot see or interpret them. Therefore, there is no equivalent concept of
imperceptible perturbations in order to achieve a comparable performance with
humans. Robustness against small scale perturbations would not be sufficient
since the attacker can attack more aggressively without such a constraint.
Under a game theoretical solution concept, we seek to ensure certain measure of
guarantees against even the worst case (intelligent) attackers that can perturb
the signal even at large scale. We provide a randomized detection strategy
based on the distance between the decoder output and the received input, i.e.,
error rate. Finally, we examine the performance of the proposed scheme over
various scenarios.


http://arxiv.org/abs/1901.10371
On the Effect of Low-Rank Weights on Adversarial Robustness of Neural Networks.
Peter Langenberg; Emilio Rafael Balda; Arash Behboodi; Rudolf Mathar
  Recently, there has been an abundance of works on designing Deep Neural
Networks (DNNs) that are robust to adversarial examples. In particular, a
central question is which features of DNNs influence adversarial robustness
and, therefore, can be to used to design robust DNNs. In this work, this
problem is studied through the lens of compression which is captured by the
low-rank structure of weight matrices. It is first shown that adversarial
training tends to promote simultaneously low-rank and sparse structure in the
weight matrices of neural networks. This is measured through the notions of
effective rank and effective sparsity. In the reverse direction, when the low
rank structure is promoted by nuclear norm regularization and combined with
sparsity inducing regularizations, neural networks show significantly improved
adversarial robustness. The effect of nuclear norm regularization on
adversarial robustness is paramount when it is applied to convolutional neural
networks. Although still not competing with adversarial training, this result
contributes to understanding the key properties of robust classifiers.


http://arxiv.org/abs/1901.10650
Adversarial Metric Attack and Defense for Person Re-identification.
Song Bai; Yingwei Li; Yuyin Zhou; Qizhu Li; Philip H. S. Torr
  Person re-identification (re-ID) has attracted much attention recently due to
its great importance in video surveillance. In general, distance metrics used
to identify two person images are expected to be robust under various
appearance changes. However, our work observes the extreme vulnerability of
existing distance metrics to adversarial examples, generated by simply adding
human-imperceptible perturbations to person images. Hence, the security danger
is dramatically increased when deploying commercial re-ID systems in video
surveillance. Although adversarial examples have been extensively applied for
classification analysis, it is rarely studied in metric analysis like person
re-identification. The most likely reason is the natural gap between the
training and testing of re-ID networks, that is, the predictions of a re-ID
network cannot be directly used during testing without an effective metric. In
this work, we bridge the gap by proposing Adversarial Metric Attack, a parallel
methodology to adversarial classification attacks. Comprehensive experiments
clearly reveal the adversarial effects in re-ID systems. Meanwhile, we also
present an early attempt of training a metric-preserving network, thereby
defending the metric against adversarial attacks. At last, by benchmarking
various adversarial settings, we expect that our work can facilitate the
development of adversarial attack and defense in metric-based applications.


http://arxiv.org/abs/1901.09981
Improving Adversarial Robustness of Ensembles with Diversity Training.
Sanjay Kariyappa; Moinuddin K. Qureshi
  Deep Neural Networks are vulnerable to adversarial attacks even in settings
where the attacker has no direct access to the model being attacked. Such
attacks usually rely on the principle of transferability, whereby an attack
crafted on a surrogate model tends to transfer to the target model. We show
that an ensemble of models with misaligned loss gradients can provide an
effective defense against transfer-based attacks. Our key insight is that an
adversarial example is less likely to fool multiple models in the ensemble if
their loss functions do not increase in a correlated fashion. To this end, we
propose Diversity Training, a novel method to train an ensemble of models with
uncorrelated loss functions. We show that our method significantly improves the
adversarial robustness of ensembles and can also be combined with existing
methods to create a stronger defense.


http://arxiv.org/abs/1901.09878
CapsAttacks: Robust and Imperceptible Adversarial Attacks on Capsule Networks.
Alberto Marchisio; Giorgio Nanfa; Faiq Khalid; Muhammad Abdullah Hanif; Maurizio Martina; Muhammad Shafique
  Capsule Networks preserve the hierarchical spatial relationships between
objects, and thereby bears a potential to surpass the performance of
traditional Convolutional Neural Networks (CNNs) in performing tasks like image
classification. A large body of work has explored adversarial examples for
CNNs, but their effectiveness on Capsule Networks has not yet been well
studied. In our work, we perform an analysis to study the vulnerabilities in
Capsule Networks to adversarial attacks. These perturbations, added to the test
inputs, are small and imperceptible to humans, but can fool the network to
mispredict. We propose a greedy algorithm to automatically generate targeted
imperceptible adversarial examples in a black-box attack scenario. We show that
this kind of attacks, when applied to the German Traffic Sign Recognition
Benchmark (GTSRB), mislead Capsule Networks. Moreover, we apply the same kind
of adversarial attacks to a 5-layer CNN and a 9-layer CNN, and analyze the
outcome, compared to the Capsule Networks to study differences in their
behavior.


http://arxiv.org/abs/1901.09963
Defense Methods Against Adversarial Examples for Recurrent Neural Networks.
Ishai Rosenberg; Asaf Shabtai; Yuval Elovici; Lior Rokach
  Adversarial examples are known to mislead deep learning models to incorrectly
classify them, even in domains where such models achieve state-of-the-art
performance. Until recently, research on both attack and defense methods
focused on image recognition, primarily using convolutional neural networks
(CNNs). In recent years, adversarial example generation methods for recurrent
neural networks (RNNs) have been published, demonstrating that RNN classifiers
are also vulnerable. In this paper, we present a novel defense method, termed
sequence squeezing, to make RNN classifiers more robust against such attacks.
Our method differs from previous defense methods which were designed only for
non-sequence based models. We also implement four additional RNN defense
methods inspired by current CNN defense methods. We evaluate our methods
against state-of-the-art attacks in the cyber security domain where real
adversaries (malware developers) exist, but our methods can be applied against
any sequence based adversarial attack, e.g., in the NLP domain. Using our
methods we were able to decrease the effectiveness of such attack from 99.9% to
15%.


http://arxiv.org/abs/1901.09960
Using Pre-Training Can Improve Model Robustness and Uncertainty.
Dan Hendrycks; Kimin Lee; Mantas Mazeika
  He et al. (2018) have called into question the utility of pre-training by
showing that training from scratch can often yield similar performance to
pre-training. We show that although pre-training may not improve performance on
traditional classification metrics, it improves model robustness and
uncertainty estimates. Through extensive experiments on adversarial examples,
label corruption, class imbalance, out-of-distribution detection, and
confidence calibration, we demonstrate large gains from pre-training and
complementary effects with task-specific methods. We introduce adversarial
pre-training and show approximately a 10% absolute improvement over the
previous state-of-the-art in adversarial robustness. In some cases, using
pre-training without task-specific methods also surpasses the state-of-the-art,
highlighting the need for pre-training when evaluating future methods on
robustness and uncertainty tasks.


http://arxiv.org/abs/1901.09863
Efficient Multiparty Interactive Coding for Insertions, Deletions and Substitutions. (1%)
Ran Gelles; Yael T. Kalai; Govind Ramnarayan
  In the field of interactive coding, two or more parties wish to carry out a
distributed computation over a communication network that may be noisy. The
ultimate goal is to develop efficient coding schemes that can tolerate a high
level of noise while increasing the communication by only a constant factor
(i.e., constant rate).
  In this work we consider synchronous communication networks over an arbitrary
topology, in the powerful adversarial insertion-deletion noise model. Namely,
the noisy channel may adversarially alter the content of any transmitted
symbol, as well as completely remove a transmitted symbol or inject a new
symbol into the channel. We provide efficient, constant rate schemes that
successfully conduct any computation with high probability as long as the
adversary corrupts at most $\varepsilon /m$ fraction of the total
communication, where $m$ is the number of links in the network and
$\varepsilon$ is a small constant. This scheme assumes the parties share a
random string to which the adversarial noise is oblivious. We can remove this
assumption at the price of being resilient to $\varepsilon / (m\log m)$
adversarial error.
  While previous work considered the insertion-deletion noise model in the
two-party setting, to the best of our knowledge, our scheme is the first
multiparty scheme that is resilient to insertions and deletions. Furthermore,
our scheme is the first computationally efficient scheme in the multiparty
setting that is resilient to adversarial noise.


http://arxiv.org/abs/1901.09413
An Information-Theoretic Explanation for the Adversarial Fragility of AI Classifiers.
Hui Xie; Jirong Yi; Weiyu Xu; Raghu Mudumbai
  We present a simple hypothesis about a compression property of artificial
intelligence (AI) classifiers and present theoretical arguments to show that
this hypothesis successfully accounts for the observed fragility of AI
classifiers to small adversarial perturbations. We also propose a new method
for detecting when small input perturbations cause classifier errors, and show
theoretical guarantees for the performance of this detection method. We present
experimental results with a voice recognition system to demonstrate this
method. The ideas in this paper are motivated by a simple analogy between AI
classifiers and the standard Shannon model of a communication system.


http://arxiv.org/abs/1901.09496
Characterizing the Shape of Activation Space in Deep Neural Networks.
Thomas Gebhart; Paul Schrater; Alan Hylton
  The representations learned by deep neural networks are difficult to
interpret in part due to their large parameter space and the complexities
introduced by their multi-layer structure. We introduce a method for computing
persistent homology over the graphical activation structure of neural networks,
which provides access to the task-relevant substructures activated throughout
the network for a given input. This topological perspective provides unique
insights into the distributed representations encoded by neural networks in
terms of the shape of their activation structures. We demonstrate the value of
this approach by showing an alternative explanation for the existence of
adversarial examples. By studying the topology of network activations across
multiple architectures and datasets, we find that adversarial perturbations do
not add activations that target the semantic structure of the adversarial class
as previously hypothesized. Rather, adversarial examples are explainable as
alterations to the dominant activation structures induced by the original
image, suggesting the class representations learned by deep networks are
problematically sparse on the input space.


http://arxiv.org/abs/1901.09493
Strong Black-box Adversarial Attacks on Unsupervised Machine Learning Models.
Anshuman Chhabra; Abhishek Roy; Prasant Mohapatra
  Machine Learning (ML) and Deep Learning (DL) models have achieved
state-of-the-art performance on multiple learning tasks, from vision to natural
language modelling. With the growing adoption of ML and DL to many areas of
computer science, recent research has also started focusing on the security
properties of these models. There has been a lot of work undertaken to
understand if (deep) neural network architectures are resilient to black-box
adversarial attacks which craft perturbed input samples that fool the
classifier without knowing the architecture used. Recent work has also focused
on the transferability of adversarial attacks and found that adversarial
attacks are generally easily transferable between models, datasets, and
techniques. However, such attacks and their analysis have not been covered from
the perspective of unsupervised machine learning algorithms. In this paper, we
seek to bridge this gap through multiple contributions. We first provide a
strong (iterative) black-box adversarial attack that can craft adversarial
samples which will be incorrectly clustered irrespective of the choice of
clustering algorithm. We choose 4 prominent clustering algorithms, and a
real-world dataset to show the working of the proposed adversarial algorithm.
Using these clustering algorithms we also carry out a simple study of
cross-technique adversarial attack transferability.


http://arxiv.org/abs/1901.09892
A Black-box Attack on Neural Networks Based on Swarm Evolutionary Algorithm.
Xiaolei Liu; Yuheng Luo; Xiaosong Zhang; Qingxin Zhu
  Neural networks play an increasingly important role in the field of machine
learning and are included in many applications in society. Unfortunately,
neural networks suffer from adversarial samples generated to attack them.
However, most of the generation approaches either assume that the attacker has
full knowledge of the neural network model or are limited by the type of
attacked model. In this paper, we propose a new approach that generates a
black-box attack to neural networks based on the swarm evolutionary algorithm.
Benefiting from the improvements in the technology and theoretical
characteristics of evolutionary algorithms, our approach has the advantages of
effectiveness, black-box attack, generality, and randomness. Our experimental
results show that both the MNIST images and the CIFAR-10 images can be
perturbed to successful generate a black-box attack with 100\% probability on
average. In addition, the proposed attack, which is successful on distilled
neural networks with almost 100\% probability, is resistant to defensive
distillation. The experimental results also indicate that the robustness of the
artificial intelligence algorithm is related to the complexity of the model and
the data set. In addition, we find that the adversarial samples to some extent
reproduce the characteristics of the sample data learned by the neural network
model.


http://arxiv.org/abs/1901.10300
Weighted-Sampling Audio Adversarial Example Attack.
Xiaolei Liu; Xiaosong Zhang; Kun Wan; Qingxin Zhu; Yufei Ding
  Recent studies have highlighted audio adversarial examples as a ubiquitous
threat to state-of-the-art automatic speech recognition systems. Thorough
studies on how to effectively generate adversarial examples are essential to
prevent potential attacks. Despite many research on this, the efficiency and
the robustness of existing works are not yet satisfactory. In this paper, we
propose~\textit{weighted-sampling audio adversarial examples}, focusing on the
numbers and the weights of distortion to reinforce the attack. Further, we
apply a denoising method in the loss function to make the adversarial attack
more imperceptible. Experiments show that our method is the first in the field
to generate audio adversarial examples with low noise and high audio robustness
at the minute time-consuming level.


http://arxiv.org/abs/1901.09113
Generative Adversarial Networks for Black-Box API Attacks with Limited Training Data.
Yi Shi; Yalin E. Sagduyu; Kemal Davaslioglu; Jason H. Li
  As online systems based on machine learning are offered to public or paid
subscribers via application programming interfaces (APIs), they become
vulnerable to frequent exploits and attacks. This paper studies adversarial
machine learning in the practical case when there are rate limitations on API
calls. The adversary launches an exploratory (inference) attack by querying the
API of an online machine learning system (in particular, a classifier) with
input data samples, collecting returned labels to build up the training data,
and training an adversarial classifier that is functionally equivalent and
statistically close to the target classifier. The exploratory attack with
limited training data is shown to fail to reliably infer the target classifier
of a real text classifier API that is available online to the public. In
return, a generative adversarial network (GAN) based on deep learning is built
to generate synthetic training data from a limited number of real training data
samples, thereby extending the training data and improving the performance of
the inferred classifier. The exploratory attack provides the basis to launch
the causative attack (that aims to poison the training process) and evasion
attack (that aims to fool the classifier into making wrong decisions) by
selecting training and test data samples, respectively, based on the confidence
scores obtained from the inferred classifier. These stealth attacks with small
footprint (using a small number of API calls) make adversarial machine learning
practical under the realistic case with limited training data available to the
adversary.


http://arxiv.org/abs/1901.08846
Improving Adversarial Robustness via Promoting Ensemble Diversity.
Tianyu Pang; Kun Xu; Chao Du; Ning Chen; Jun Zhu
  Though deep neural networks have achieved significant progress on various
tasks, often enhanced by model ensemble, existing high-performance models can
be vulnerable to adversarial attacks. Many efforts have been devoted to
enhancing the robustness of individual networks and then constructing a
straightforward ensemble, e.g., by directly averaging the outputs, which
ignores the interaction among networks. This paper presents a new method that
explores the interaction among individual networks to improve robustness for
ensemble models. Technically, we define a new notion of ensemble diversity in
the adversarial setting as the diversity among non-maximal predictions of
individual members, and present an adaptive diversity promoting (ADP)
regularizer to encourage the diversity, which leads to globally better
robustness for the ensemble by making adversarial examples difficult to
transfer among individual members. Our method is computationally efficient and
compatible with the defense methods acting on individual networks. Empirical
results on various datasets verify that our method can improve adversarial
robustness while maintaining state-of-the-art accuracy on normal examples.


http://arxiv.org/abs/1901.08873
Chapter: Vulnerability of Quantum Information Systems to Collective Manipulation. (1%)
Fernando J. Gómez-Ruiz; Ferney J. Rodríguez; Luis Quiroga; Neil F. Johnson
  The highly specialist terms `quantum computing' and `quantum information',
together with the broader term `quantum technologies', now appear regularly in
the mainstream media. While this is undoubtedly highly exciting for physicists
and investors alike, a key question for society concerns such systems'
vulnerabilities -- and in particular, their vulnerability to collective
manipulation. Here we present and discuss a new form of vulnerability in such
systems, that we have identified based on detailed many-body quantum mechanical
calculations. The impact of this new vulnerability is that groups of
adversaries can maximally disrupt these systems' global quantum state which
will then jeopardize their quantum functionality. It will be almost impossible
to detect these attacks since they do not change the Hamiltonian and the purity
remains the same; they do not entail any real-time communication between the
attackers; and they can last less than a second. We also argue that there can
be an implicit amplification of such attacks because of the statistical
character of modern non-state actor groups. A countermeasure could be to embed
future quantum technologies within redundant classical networks. We purposely
structure the discussion in this chapter so that the first sections are
self-contained and can be read by non-specialists.


http://arxiv.org/abs/1901.09035
Towards Interpretable Deep Neural Networks by Leveraging Adversarial Examples.
Yinpeng Dong; Fan Bao; Hang Su; Jun Zhu
  Sometimes it is not enough for a DNN to produce an outcome. For example, in
applications such as healthcare, users need to understand the rationale of the
decisions. Therefore, it is imperative to develop algorithms to learn models
with good interpretability (Doshi-Velez 2017). An important factor that leads
to the lack of interpretability of DNNs is the ambiguity of neurons, where a
neuron may fire for various unrelated concepts. This work aims to increase the
interpretability of DNNs on the whole image space by reducing the ambiguity of
neurons. In this paper, we make the following contributions:
  1) We propose a metric to evaluate the consistency level of neurons in a
network quantitatively.
  2) We find that the learned features of neurons are ambiguous by leveraging
adversarial examples.
  3) We propose to improve the consistency of neurons on adversarial example
subset by an adversarial training algorithm with a consistent loss.


http://arxiv.org/abs/1901.08360
Cross-Entropy Loss and Low-Rank Features Have Responsibility for Adversarial Examples.
Kamil Nar; Orhan Ocal; S. Shankar Sastry; Kannan Ramchandran
  State-of-the-art neural networks are vulnerable to adversarial examples; they
can easily misclassify inputs that are imperceptibly different than their
training and test data. In this work, we establish that the use of
cross-entropy loss function and the low-rank features of the training data have
responsibility for the existence of these inputs. Based on this observation, we
suggest that addressing adversarial examples requires rethinking the use of
cross-entropy loss function and looking for an alternative that is more suited
for minimization with low-rank features. In this direction, we present a
training scheme called differential training, which uses a loss function
defined on the differences between the features of points from opposite
classes. We show that differential training can ensure a large margin between
the decision boundary of the neural network and the points in the training
dataset. This larger margin increases the amount of perturbation needed to flip
the prediction of the classifier and makes it harder to find an adversarial
example with small perturbations. We test differential training on a binary
classification task with CIFAR-10 dataset and demonstrate that it radically
reduces the ratio of images for which an adversarial example could be found --
not only in the training dataset, but in the test dataset as well.


http://arxiv.org/abs/1901.08573
Theoretically Principled Trade-off between Robustness and Accuracy.
Hongyang Zhang; Yaodong Yu; Jiantao Jiao; Eric P. Xing; Laurent El Ghaoui; Michael I. Jordan
  We identify a trade-off between robustness and accuracy that serves as a
guiding principle in the design of defenses against adversarial examples.
Although this problem has been widely studied empirically, much remains unknown
concerning the theory underlying this trade-off. In this work, we decompose the
prediction error for adversarial examples (robust error) as the sum of the
natural (classification) error and boundary error, and provide a differentiable
upper bound using the theory of classification-calibrated loss, which is shown
to be the tightest possible upper bound uniform over all probability
distributions and measurable predictors. Inspired by our theoretical analysis,
we also design a new defense method, TRADES, to trade adversarial robustness
off against accuracy. Our proposed algorithm performs well experimentally in
real-world datasets. The methodology is the foundation of our entry to the
NeurIPS 2018 Adversarial Vision Challenge in which we won the 1st place out of
~2,000 submissions, surpassing the runner-up approach by $11.41\%$ in terms of
mean $\ell_2$ perturbation distance.


http://arxiv.org/abs/1901.07846
SirenAttack: Generating Adversarial Audio for End-to-End Acoustic Systems.
Tianyu Du; Shouling Ji; Jinfeng Li; Qinchen Gu; Ting Wang; Raheem Beyah
  Despite their immense popularity, deep learning-based acoustic systems are
inherently vulnerable to adversarial attacks, wherein maliciously crafted
audios trigger target systems to misbehave. In this paper, we present
SirenAttack, a new class of attacks to generate adversarial audios. Compared
with existing attacks, SirenAttack highlights with a set of significant
features: (i) versatile -- it is able to deceive a range of end-to-end acoustic
systems under both white-box and black-box settings; (ii) effective -- it is
able to generate adversarial audios that can be recognized as specific phrases
by target acoustic systems; and (iii) stealthy -- it is able to generate
adversarial audios indistinguishable from their benign counterparts to human
perception. We empirically evaluate SirenAttack on a set of state-of-the-art
deep learning-based acoustic systems (including speech command recognition,
speaker recognition and sound event classification), with results showing the
versatility, effectiveness, and stealthiness of SirenAttack. For instance, it
achieves 99.45% attack success rate on the IEMOCAP dataset against the ResNet18
model, while the generated adversarial audios are also misinterpreted by
multiple popular ASR platforms, including Google Cloud Speech, Microsoft Bing
Voice, and IBM Speech-to-Text. We further evaluate three potential defense
methods to mitigate such attacks, including adversarial training, audio
downsampling, and moving average filtering, which leads to promising directions
for further research.


http://arxiv.org/abs/1901.08121
Sitatapatra: Blocking the Transfer of Adversarial Samples.
Ilia Shumailov; Xitong Gao; Yiren Zhao; Robert Mullins; Ross Anderson; Cheng-Zhong Xu
  Convolutional Neural Networks (CNNs) are widely used to solve classification
tasks in computer vision. However, they can be tricked into misclassifying
specially crafted `adversarial' samples -- and samples built to trick one model
often work alarmingly well against other models trained on the same task. In
this paper we introduce Sitatapatra, a system designed to block the transfer of
adversarial samples. It diversifies neural networks using a key, as in
cryptography, and provides a mechanism for detecting attacks. What's more, when
adversarial samples are detected they can typically be traced back to the
individual device that was used to develop them. The run-time overheads are
minimal permitting the use of Sitatapatra on constrained systems.


http://arxiv.org/abs/1901.07132
Universal Rules for Fooling Deep Neural Networks based Text Classification.
Di Li; Danilo Vasconcellos Vargas; Sakurai Kouichi
  Recently, deep learning based natural language processing techniques are
being extensively used to deal with spam mail, censorship evaluation in social
networks, among others. However, there is only a couple of works evaluating the
vulnerabilities of such deep neural networks. Here, we go beyond attacks to
investigate, for the first time, universal rules, i.e., rules that are sample
agnostic and therefore could turn any text sample in an adversarial one. In
fact, the universal rules do not use any information from the method itself (no
information from the method, gradient information or training dataset
information is used), making them black-box universal attacks. In other words,
the universal rules are sample and method agnostic. By proposing a
coevolutionary optimization algorithm we show that it is possible to create
universal rules that can automatically craft imperceptible adversarial samples
(only less than five perturbations which are close to misspelling are inserted
in the text sample). A comparison with a random search algorithm further
justifies the strength of the method. Thus, universal rules for fooling
networks are here shown to exist. Hopefully, the results from this work will
impact the development of yet more sample and model agnostic attacks as well as
their defenses, culminating in perhaps a new age for artificial intelligence.


http://arxiv.org/abs/1901.06796
Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey.
Wei Emma Zhang; Quan Z. Sheng; Ahoud Alhazmi; Chenliang Li
  With the development of high computational devices, deep neural networks
(DNNs), in recent years, have gained significant popularity in many Artificial
Intelligence (AI) applications. However, previous efforts have shown that DNNs
were vulnerable to strategically modified samples, named adversarial examples.
These samples are generated with some imperceptible perturbations but can fool
the DNNs to give false predictions. Inspired by the popularity of generating
adversarial examples for image DNNs, research efforts on attacking DNNs for
textual applications emerges in recent years. However, existing perturbation
methods for images cannotbe directly applied to texts as text data is discrete.
In this article, we review research works that address this difference and
generatetextual adversarial examples on DNNs. We collect, select, summarize,
discuss and analyze these works in a comprehensive way andcover all the related
information to make the article self-contained. Finally, drawing on the
reviewed literature, we provide further discussions and suggestions on this
topic.


http://arxiv.org/abs/1901.07152
Sensitivity Analysis of Deep Neural Networks.
Hai Shu; Hongtu Zhu
  Deep neural networks (DNNs) have achieved superior performance in various
prediction tasks, but can be very vulnerable to adversarial examples or
perturbations. Therefore, it is crucial to measure the sensitivity of DNNs to
various forms of perturbations in real applications. We introduce a novel
perturbation manifold and its associated influence measure to quantify the
effects of various perturbations on DNN classifiers. Such perturbations include
various external and internal perturbations to input samples and network
parameters. The proposed measure is motivated by information geometry and
provides desirable invariance properties. We demonstrate that our influence
measure is useful for four model building tasks: detecting potential
'outliers', analyzing the sensitivity of model architectures, comparing network
sensitivity between training and test sets, and locating vulnerable areas.
Experiments show reasonably good performance of the proposed measure for the
popular DNN models ResNet50 and DenseNet121 on CIFAR10 and MNIST datasets.


http://arxiv.org/abs/1901.06834
Perception-in-the-Loop Adversarial Examples.
Mahmoud Salamati; Sadegh Soudjani; Rupak Majumdar
  We present a scalable, black box, perception-in-the-loop technique to find
adversarial examples for deep neural network classifiers. Black box means that
our procedure only has input-output access to the classifier, and not to the
internal structure, parameters, or intermediate confidence values.
Perception-in-the-loop means that the notion of proximity between inputs can be
directly queried from human participants rather than an arbitrarily chosen
metric. Our technique is based on covariance matrix adaptation evolution
strategy (CMA-ES), a black box optimization approach. CMA-ES explores the
search space iteratively in a black box manner, by generating populations of
candidates according to a distribution, choosing the best candidates according
to a cost function, and updating the posterior distribution to favor the best
candidates. We run CMA-ES using human participants to provide the fitness
function, using the insight that the choice of best candidates in CMA-ES can be
naturally modeled as a perception task: pick the top $k$ inputs perceptually
closest to a fixed input. We empirically demonstrate that finding adversarial
examples is feasible using small populations and few iterations. We compare the
performance of CMA-ES on the MNIST benchmark with other black-box approaches
using $L_p$ norms as a cost function, and show that it performs favorably both
in terms of success in finding adversarial examples and in minimizing the
distance between the original and the adversarial input. In experiments on the
MNIST, CIFAR10, and GTSRB benchmarks, we demonstrate that CMA-ES can find
perceptually similar adversarial inputs with a small number of iterations and
small population sizes when using perception-in-the-loop. Finally, we show that
networks trained specifically to be robust against $L_\infty$ norm can still be
susceptible to perceptually similar adversarial examples.


http://arxiv.org/abs/1901.05674
Easy to Fool? Testing the Anti-evasion Capabilities of PDF Malware Scanners.
Saeed TU Darmstadt Ehteshamifar; Antonio xorlab Barresi; Thomas R. ETH Zurich Gross; Michael TU Darmstadt Pradel
  Malware scanners try to protect users from opening malicious documents by
statically or dynamically analyzing documents. However, malware developers may
apply evasions that conceal the maliciousness of a document. Given the variety
of existing evasions, systematically assessing the impact of evasions on
malware scanners remains an open challenge. This paper presents a novel
methodology for testing the capability of malware scanners to cope with
evasions. We apply the methodology to malicious Portable Document Format (PDF)
documents and present an in-depth study of how current PDF evasions affect 41
state-of-the-art malware scanners. The study is based on a framework for
creating malicious PDF documents that use one or more evasions. Based on such
documents, we measure how effective different evasions are at concealing the
maliciousness of a document. We find that many static and dynamic scanners can
be easily fooled by relatively simple evasions and that the effectiveness of
different evasions varies drastically. Our work not only is a call to arms for
improving current malware scanners, but by providing a large-scale corpus of
malicious PDF documents with evasions, we directly support the development of
improved tools to detect document-based malware. Moreover, our methodology
paves the way for a quantitative evaluation of evasions in other kinds of
malware.


http://arxiv.org/abs/1901.04684
The Limitations of Adversarial Training and the Blind-Spot Attack.
Huan Zhang; Hongge Chen; Zhao Song; Duane Boning; Inderjit S. Dhillon; Cho-Jui Hsieh
  The adversarial training procedure proposed by Madry et al. (2018) is one of
the most effective methods to defend against adversarial examples in deep
neural networks (DNNs). In our paper, we shed some lights on the practicality
and the hardness of adversarial training by showing that the effectiveness
(robustness on test set) of adversarial training has a strong correlation with
the distance between a test point and the manifold of training data embedded by
the network. Test examples that are relatively far away from this manifold are
more likely to be vulnerable to adversarial attacks. Consequentially, an
adversarial training based defense is susceptible to a new class of attacks,
the "blind-spot attack", where the input images reside in "blind-spots" (low
density regions) of the empirical distribution of training data but is still on
the ground-truth data manifold. For MNIST, we found that these blind-spots can
be easily found by simply scaling and shifting image pixel values. Most
importantly, for large datasets with high dimensional and complex data manifold
(CIFAR, ImageNet, etc), the existence of blind-spots in adversarial training
makes defending on any valid test examples difficult due to the curse of
dimensionality and the scarcity of training data. Additionally, we find that
blind-spots also exist on provable defenses including (Wong & Kolter, 2018) and
(Sinha et al., 2018) because these trainable robustness certificates can only
be practically optimized on a limited set of training data.


http://arxiv.org/abs/1901.03706
Generating Adversarial Perturbation with Root Mean Square Gradient.
Yatie Xiao; Chi-Man Pun; Jizhe Zhou
  Deep Neural Models are vulnerable to adversarial perturbations in
classification. Many attack methods generate adversarial examples with large
pixel modification and low cosine similarity with original images. In this
paper, we propose an adversarial method generating perturbations based on root
mean square gradient which formulates adversarial perturbation size in root
mean square level and update gradient in direction, due to updating gradients
with adaptive and root mean square stride, our method map origin, and
corresponding adversarial image directly which shows good transferability in
adversarial examples generation. We evaluate several traditional perturbations
creating ways in image classification with our methods. Experimental results
show that our approach works well and outperform recent techniques in the
change of misclassifying image classification with slight pixel modification,
and excellent efficiency in fooling deep network models.


http://arxiv.org/abs/1901.03808
ECGadv: Generating Adversarial Electrocardiogram to Misguide Arrhythmia Classification System.
Huangxun Chen; Chenyu Huang; Qianyi Huang; Qian Zhang; Wei Wang
  Deep neural networks (DNNs)-powered Electrocardiogram (ECG) diagnosis systems
recently achieve promising progress to take over tedious examinations by
cardiologists. However, their vulnerability to adversarial attacks still lack
comprehensive investigation. The existing attacks in image domain could not be
directly applicable due to the distinct properties of ECGs in visualization and
dynamic properties. Thus, this paper takes a step to thoroughly explore
adversarial attacks on the DNN-powered ECG diagnosis system. We analyze the
properties of ECGs to design effective attacks schemes under two attacks models
respectively. Our results demonstrate the blind spots of DNN-powered diagnosis
systems under adversarial attacks, which calls attention to adequate
countermeasures.


http://arxiv.org/abs/1901.03583
Explaining Vulnerabilities of Deep Learning to Adversarial Malware Binaries.
Luca Demetrio; Battista Biggio; Giovanni Lagorio; Fabio Roli; Alessandro Armando
  Recent work has shown that deep-learning algorithms for malware detection are
also susceptible to adversarial examples, i.e., carefully-crafted perturbations
to input malware that enable misleading classification. Although this has
questioned their suitability for this task, it is not yet clear why such
algorithms are easily fooled also in this particular application domain. In
this work, we take a first step to tackle this issue by leveraging explainable
machine-learning algorithms developed to interpret the black-box decisions of
deep neural networks. In particular, we use an explainable technique known as
feature attribution to identify the most influential input features
contributing to each decision, and adapt it to provide meaningful explanations
to the classification of malware binaries. In this case, we find that a
recently-proposed convolutional neural network does not learn any meaningful
characteristic for malware detection from the data and text sections of
executable files, but rather tends to learn to discriminate between benign and
malware samples based on the characteristics found in the file header. Based on
this finding, we propose a novel attack algorithm that generates adversarial
malware binaries by only changing few tens of bytes in the file header. With
respect to the other state-of-the-art attack algorithms, our attack does not
require injecting any padding bytes at the end of the file, and it is much more
efficient, as it requires manipulating much fewer bytes.


http://arxiv.org/abs/1901.03398
Characterizing and evaluating adversarial examples for Offline Handwritten Signature Verification.
Luiz G. Hafemann; Robert Sabourin; Luiz S. Oliveira
  The phenomenon of Adversarial Examples is attracting increasing interest from
the Machine Learning community, due to its significant impact to the security
of Machine Learning systems. Adversarial examples are similar (from a
perceptual notion of similarity) to samples from the data distribution, that
"fool" a machine learning classifier. For computer vision applications, these
are images with carefully crafted but almost imperceptible changes, that are
misclassified. In this work, we characterize this phenomenon under an existing
taxonomy of threats to biometric systems, in particular identifying new attacks
for Offline Handwritten Signature Verification systems. We conducted an
extensive set of experiments on four widely used datasets: MCYT-75, CEDAR,
GPDS-160 and the Brazilian PUC-PR, considering both a CNN-based system and a
system using a handcrafted feature extractor (CLBP). We found that attacks that
aim to get a genuine signature rejected are easy to generate, even in a limited
knowledge scenario, where the attacker does not have access to the trained
classifier nor the signatures used for training. Attacks that get a forgery to
be accepted are harder to produce, and often require a higher level of noise -
in most cases, no longer "imperceptible" as previous findings in object
recognition. We also evaluated the impact of two countermeasures on the success
rate of the attacks and the amount of noise required for generating successful
attacks.


http://arxiv.org/abs/1901.03037
Image Transformation can make Neural Networks more robust against Adversarial Examples.
Dang Duy Thang; Toshihiro Matsui
  Neural networks are being applied in many tasks related to IoT with
encouraging results. For example, neural networks can precisely detect human,
objects and animal via surveillance camera for security purpose. However,
neural networks have been recently found vulnerable to well-designed input
samples that called adversarial examples. Such issue causes neural networks to
misclassify adversarial examples that are imperceptible to humans. We found
giving a rotation to an adversarial example image can defeat the effect of
adversarial examples. Using MNIST number images as the original images, we
first generated adversarial examples to neural network recognizer, which was
completely fooled by the forged examples. Then we rotated the adversarial image
and gave them to the recognizer to find the recognizer to regain the correct
recognition. Thus, we empirically confirmed rotation to images can protect
pattern recognizer based on neural networks from adversarial example attacks.


http://arxiv.org/abs/1901.03006
Extending Adversarial Attacks and Defenses to Deep 3D Point Cloud Classifiers.
Daniel Liu; Ronald Yu; Hao Su
  3D object classification and segmentation using deep neural networks has been
extremely successful. As the problem of identifying 3D objects has many
safety-critical applications, the neural networks have to be robust against
adversarial changes to the input data set. There is a growing body of research
on generating human-imperceptible adversarial attacks and defenses against them
in the 2D image classification domain. However, 3D objects have various
differences with 2D images, and this specific domain has not been rigorously
studied so far.
  We present a preliminary evaluation of adversarial attacks on deep 3D point
cloud classifiers, namely PointNet and PointNet++, by evaluating both white-box
and black-box adversarial attacks that were proposed for 2D images and
extending those attacks to reduce the perceptibility of the perturbations in 3D
space. We also show the high effectiveness of simple defenses against those
attacks by proposing new defenses that exploit the unique structure of 3D point
clouds. Finally, we attempt to explain the effectiveness of the defenses
through the intrinsic structures of both the point clouds and the neural
network architectures. Overall, we find that networks that process 3D point
cloud data are weak to adversarial attacks, but they are also more easily
defensible compared to 2D image classifiers. Our investigation will provide the
groundwork for future studies on improving the robustness of deep neural
networks that handle 3D data.


http://arxiv.org/abs/1901.02229
Interpretable BoW Networks for Adversarial Example Detection.
Krishna Kanth Nakka; Mathieu Salzmann
  The standard approach to providing interpretability to deep convolutional
neural networks (CNNs) consists of visualizing either their feature maps, or
the image regions that contribute the most to the prediction. In this paper, we
introduce an alternative strategy to interpret the results of a CNN. To this
end, we leverage a Bag of visual Word representation within the network and
associate a visual and semantic meaning to the corresponding codebook elements
via the use of a generative adversarial network. The reason behind the
prediction for a new sample can then be interpreted by looking at the visual
representation of the most highly activated codeword. We then propose to
exploit our interpretable BoW networks for adversarial example detection. To
this end, we build upon the intuition that, while adversarial samples look very
similar to real images, to produce incorrect predictions, they should activate
codewords with a significantly different visual representation. We therefore
cast the adversarial example detection problem as that of comparing the input
image with the most highly activated visual codeword. As evidenced by our
experiments, this allows us to outperform the state-of-the-art adversarial
example detection methods on standard benchmarks, independently of the attack
strategy.


http://arxiv.org/abs/1901.01677
Image Super-Resolution as a Defense Against Adversarial Attacks.
Aamir Mustafa; Salman H. Khan; Munawar Hayat; Jianbing Shen; Ling Shao
  Convolutional Neural Networks have achieved significant success across
multiple computer vision tasks. However, they are vulnerable to carefully
crafted, human-imperceptible adversarial noise patterns which constrain their
deployment in critical security-sensitive systems. This paper proposes a
computationally efficient image enhancement approach that provides a strong
defense mechanism to effectively mitigate the effect of such adversarial
perturbations. We show that deep image restoration networks learn mapping
functions that can bring off-the-manifold adversarial samples onto the natural
image manifold, thus restoring classification towards correct classes. A
distinguishing feature of our approach is that, in addition to providing
robustness against attacks, it simultaneously enhances image quality and
retains models performance on clean images. Furthermore, the proposed method
does not modify the classifier or requires a separate mechanism to detect
adversarial images. The effectiveness of the scheme has been demonstrated
through extensive experiments, where it has proven a strong defense in gray-box
settings. The proposed scheme is simple and has the following advantages: (1)
it does not require any model training or parameter optimization, (2) it
complements other existing defense mechanisms, (3) it is agnostic to the
attacked model and attack type and (4) it provides superior performance across
all popular attack algorithms. Our codes are publicly available at
https://github.com/aamir-mustafa/super-resolution-adversarial-defense.


http://arxiv.org/abs/1901.09657
Fake News Detection via NLP is Vulnerable to Adversarial Attacks.
Zhixuan Zhou; Huankang Guan; Meghana Moorthy Bhat; Justin Hsu
  News plays a significant role in shaping people's beliefs and opinions. Fake
news has always been a problem, which wasn't exposed to the mass public until
the past election cycle for the 45th President of the United States. While
quite a few detection methods have been proposed to combat fake news since
2015, they focus mainly on linguistic aspects of an article without any fact
checking. In this paper, we argue that these models have the potential to
misclassify fact-tampering fake news as well as under-written real news.
Through experiments on Fakebox, a state-of-the-art fake news detector, we show
that fact tampering attacks can be effective. To address these weaknesses, we
argue that fact checking should be adopted in conjunction with linguistic
characteristics analysis, so as to truly separate fake news from real news. A
crowdsourced knowledge graph is proposed as a straw man solution to collecting
timely facts about news events.


http://arxiv.org/abs/1901.01223
Adversarial Examples Versus Cloud-based Detectors: A Black-box Empirical Study.
Xurong Li; Shouling Ji; Meng Han; Juntao Ji; Zhenyu Ren; Yushan Liu; Chunming Wu
  Deep learning has been broadly leveraged by major cloud providers, such as
Google, AWS and Baidu, to offer various computer vision related services
including image classification, object identification, illegal image detection,
etc. While recent works extensively demonstrated that deep learning
classification models are vulnerable to adversarial examples, cloud-based image
detection models, which are more complicated than classifiers, may also have
similar security concern but not get enough attention yet. In this paper, we
mainly focus on the security issues of real-world cloud-based image detectors.
Specifically, (1) based on effective semantic segmentation, we propose four
attacks to generate semantics-aware adversarial examples via only interacting
with black-box APIs; and (2) we make the first attempt to conduct an extensive
empirical study of black-box attacks against real-world cloud-based image
detectors. Through the comprehensive evaluations on five major cloud platforms:
AWS, Azure, Google Cloud, Baidu Cloud, and Alibaba Cloud, we demonstrate that
our image processing based attacks can reach a success rate of approximately
100%, and the semantic segmentation based attacks have a success rate over 90%
among different detection services, such as violence, politician, and
pornography detection. We also proposed several possible defense strategies for
these security challenges in the real-life situation.


http://arxiv.org/abs/1901.00546
Multi-Label Adversarial Perturbations.
Qingquan Song; Haifeng Jin; Xiao Huang; Xia Hu
  Adversarial examples are delicately perturbed inputs, which aim to mislead
machine learning models towards incorrect outputs. While most of the existing
work focuses on generating adversarial perturbations in multi-class
classification problems, many real-world applications fall into the multi-label
setting in which one instance could be associated with more than one label. For
example, a spammer may generate adversarial spams with malicious advertising
while maintaining the other labels such as topic labels unchanged. To analyze
the vulnerability and robustness of multi-label learning models, we investigate
the generation of multi-label adversarial perturbations. This is a challenging
task due to the uncertain number of positive labels associated with one
instance, as well as the fact that multiple labels are usually not mutually
exclusive with each other. To bridge this gap, in this paper, we propose a
general attacking framework targeting on multi-label classification problem and
conduct a premier analysis on the perturbations for deep neural networks.
Leveraging the ranking relationships among labels, we further design a
ranking-based framework to attack multi-label ranking algorithms. We specify
the connection between the two proposed frameworks and separately design two
specific methods grounded on each of them to generate targeted multi-label
perturbations. Experiments on real-world multi-label image classification and
ranking problems demonstrate the effectiveness of our proposed frameworks and
provide insights of the vulnerability of multi-label deep learning models under
diverse targeted attacking strategies. Several interesting findings including
an unpolished defensive strategy, which could potentially enhance the
interpretability and robustness of multi-label deep learning models, are
further presented and discussed at the end.


http://arxiv.org/abs/1901.00532
Adversarial Robustness May Be at Odds With Simplicity.
Preetum Nakkiran
  Current techniques in machine learning are so far are unable to learn
classifiers that are robust to adversarial perturbations. However, they are
able to learn non-robust classifiers with very high accuracy, even in the
presence of random perturbations. Towards explaining this gap, we highlight the
hypothesis that $\textit{robust classification may require more complex
classifiers (i.e. more capacity) than standard classification.}$
  In this note, we show that this hypothesis is indeed possible, by giving
several theoretical examples of classification tasks and sets of "simple"
classifiers for which: (1) There exists a simple classifier with high standard
accuracy, and also high accuracy under random $\ell_\infty$ noise. (2) Any
simple classifier is not robust: it must have high adversarial loss with
$\ell_\infty$ perturbations. (3) Robust classification is possible, but only
with more complex classifiers (exponentially more complex, in some examples).
  Moreover, $\textit{there is a quantitative trade-off between robustness and
standard accuracy among simple classifiers.}$ This suggests an alternate
explanation of this phenomenon, which appears in practice: the tradeoff may
occur not because the classification task inherently requires such a tradeoff
(as in [Tsipras-Santurkar-Engstrom-Turner-Madry `18]), but because the
structure of our current classifiers imposes such a tradeoff.


http://arxiv.org/abs/1901.00054
A Noise-Sensitivity-Analysis-Based Test Prioritization Technique for Deep Neural Networks.
Long Zhang; Xuechao Sun; Yong Li; Zhenyu Zhang
  Deep neural networks (DNNs) have been widely used in the fields such as
natural language processing, computer vision and image recognition. But several
studies have been shown that deep neural networks can be easily fooled by
artificial examples with some perturbations, which are widely known as
adversarial examples. Adversarial examples can be used to attack deep neural
networks or to improve the robustness of deep neural networks. A common way of
generating adversarial examples is to first generate some noises and then add
them into original examples. In practice, different examples have different
noise-sensitive. To generate an effective adversarial example, it may be
necessary to add a lot of noise to low noise-sensitive example, which may make
the adversarial example meaningless. In this paper, we propose a
noise-sensitivity-analysis-based test prioritization technique to pick out
examples by their noise sensitivity. We construct an experiment to validate our
approach on four image sets and two DNN models, which shows that examples are
sensitive to noise and our method can effectively pick out examples by their
noise sensitivity.


http://arxiv.org/abs/1812.10812
DeepBillboard: Systematic Physical-World Testing of Autonomous Driving Systems.
Husheng Zhou; Wei Li; Yuankun Zhu; Yuqun Zhang; Bei Yu; Lingming Zhang; Cong Liu
  Deep Neural Networks (DNNs) have been widely applied in many autonomous
systems such as autonomous driving. Recently, DNN testing has been intensively
studied to automatically generate adversarial examples, which inject
small-magnitude perturbations into inputs to test DNNs under extreme
situations. While existing testing techniques prove to be effective, they
mostly focus on generating digital adversarial perturbations (particularly for
autonomous driving), e.g., changing image pixels, which may never happen in
physical world. There is a critical missing piece in the literature on
autonomous driving testing: understanding and exploiting both digital and
physical adversarial perturbation generation for impacting steering decisions.
In this paper, we present DeepBillboard, a systematic physical-world testing
approach targeting at a common and practical driving scenario: drive-by
billboards. DeepBillboard is capable of generating a robust and resilient
printable adversarial billboard, which works under dynamic changing driving
conditions including viewing angle, distance, and lighting. The objective is to
maximize the possibility, degree, and duration of the steering-angle errors of
an autonomous vehicle driving by the generated adversarial billboard. We have
extensively evaluated the efficacy and robustness of DeepBillboard through
conducting both digital and physical-world experiments. Results show that
DeepBillboard is effective for various steering models and scenes. Furthermore,
DeepBillboard is sufficiently robust and resilient for generating
physical-world adversarial billboard tests for real-world driving under various
weather conditions. To the best of our knowledge, this is the first study
demonstrating the possibility of generating realistic and continuous
physical-world tests for practical autonomous driving systems.


http://arxiv.org/abs/1812.10528
Adversarial Attack and Defense on Graph Data: A Survey.
Lichao Sun; Yingtong Dou; Carl Yang; Ji Wang; Yixin Liu; Philip S. Yu; Lifang He; Bo Li
  Deep neural networks (DNNs) have been widely applied to various applications,
including image classification, text generation, audio recognition, and graph
data analysis. However, recent studies have shown that DNNs are vulnerable to
adversarial attacks. Though there are several works about adversarial attack
and defense strategies on domains such as images and natural language
processing, it is still difficult to directly transfer the learned knowledge to
graph data due to its representation structure. Given the importance of graph
analysis, an increasing number of studies over the past few years have
attempted to analyze the robustness of machine learning models on graph data.
Nevertheless, existing research considering adversarial behaviors on graph data
often focuses on specific types of attacks with certain assumptions. In
addition, each work proposes its own mathematical formulation, which makes the
comparison among different methods difficult. Therefore, this review is
intended to provide an overall landscape of more than 100 papers on adversarial
attack and defense strategies for graph data, and establish a unified
formulation encompassing most graph adversarial learning models. Moreover, we
also compare different graph attacks and defenses along with their
contributions and limitations, as well as summarize the evaluation metrics,
datasets and future trends. We hope this survey can help fill the gap in the
literature and facilitate further development of this promising new field.


http://arxiv.org/abs/1812.10061
Noise Flooding for Detecting Audio Adversarial Examples Against Automatic Speech Recognition.
Krishan Rajaratnam; Jugal Kalita
  Neural models enjoy widespread use across a variety of tasks and have grown
to become crucial components of many industrial systems. Despite their
effectiveness and extensive popularity, they are not without their exploitable
flaws. Initially applied to computer vision systems, the generation of
adversarial examples is a process in which seemingly imperceptible
perturbations are made to an image, with the purpose of inducing a deep
learning based classifier to misclassify the image. Due to recent trends in
speech processing, this has become a noticeable issue in speech recognition
models. In late 2017, an attack was shown to be quite effective against the
Speech Commands classification model. Limited-vocabulary speech classifiers,
such as the Speech Commands model, are used quite frequently in a variety of
applications, particularly in managing automated attendants in telephony
contexts. As such, adversarial examples produced by this attack could have
real-world consequences. While previous work in defending against these
adversarial examples has investigated using audio preprocessing to reduce or
distort adversarial noise, this work explores the idea of flooding particular
frequency bands of an audio signal with random noise in order to detect
adversarial examples. This technique of flooding, which does not require
retraining or modifying the model, is inspired by work done in computer vision
and builds on the idea that speech classifiers are relatively robust to natural
noise. A combined defense incorporating 5 different frequency bands for
flooding the signal with noise outperformed other existing defenses in the
audio space, detecting adversarial examples with 91.8% precision and 93.5%
recall.


http://arxiv.org/abs/1812.10049
PPD: Permutation Phase Defense Against Adversarial Examples in Deep Learning.
Mehdi Jafarnia-Jahromi; Tasmin Chowdhury; Hsin-Tai Wu; Sayandev Mukherjee
  Deep neural networks have demonstrated cutting edge performance on various
tasks including classification. However, it is well known that adversarially
designed imperceptible perturbation of the input can mislead advanced
classifiers. In this paper, Permutation Phase Defense (PPD), is proposed as a
novel method to resist adversarial attacks. PPD combines random permutation of
the image with phase component of its Fourier transform. The basic idea behind
this approach is to turn adversarial defense problems analogously into
symmetric cryptography, which relies solely on safekeeping of the keys for
security. In PPD, safe keeping of the selected permutation ensures
effectiveness against adversarial attacks. Testing PPD on MNIST and CIFAR-10
datasets yielded state-of-the-art robustness against the most powerful
adversarial attacks currently available.


http://arxiv.org/abs/1812.10199
A Multiversion Programming Inspired Approach to Detecting Audio Adversarial Examples.
Qiang Zeng; Jianhai Su; Chenglong Fu; Golam Kayas; Lannan Luo
  Adversarial examples (AEs) are crafted by adding human-imperceptible
perturbations to inputs such that a machine-learning based classifier
incorrectly labels them. They have become a severe threat to the
trustworthiness of machine learning. While AEs in the image domain have been
well studied, audio AEs are less investigated. Recently, multiple techniques
are proposed to generate audio AEs, which makes countermeasures against them an
urgent task. Our experiments show that, given an AE, the transcription results
by different Automatic Speech Recognition (ASR) systems differ significantly,
as they use different architectures, parameters, and training datasets.
Inspired by Multiversion Programming, we propose a novel audio AE detection
approach, which utilizes multiple off-the-shelf ASR systems to determine
whether an audio input is an AE. The evaluation shows that the detection
achieves accuracies over 98.6%.


http://arxiv.org/abs/1812.10085
A Data-driven Adversarial Examples Recognition Framework via Adversarial Feature Genome.
Li Chen; Qi Li; Weiye Chen; Zeyu Wang; Haifeng Li
  Adversarial examples pose many security threats to convolutional neural
networks (CNNs). Most defense algorithms prevent these threats by finding
differences between the original images and adversarial examples. However, the
found differences do not contain features about the classes, so these defense
algorithms can only detect adversarial examples without recovering the correct
labels. In this regard, we propose the Adversarial Feature Genome (AFG), a
novel type of data that contains both the differences and features about
classes. This method is inspired by an observed phenomenon, namely the
Adversarial Feature Separability (AFS), where the difference between the
feature maps of the original images and adversarial examples becomes larger
with deeper layers. On top of that, we further develop an adversarial example
recognition framework that detects adversarial examples and can recover the
correct labels. In the experiments, the detection and classification of
adversarial examples by AFGs has an accuracy of more than 90.01\% in various
attack scenarios. To the best of our knowledge, our method is the first method
that focuses on both attack detecting and recovering. AFG gives a new
data-driven perspective to improve the robustness of CNNs. The source code is
available at https://github.com/GeoX-Lab/Adv_Fea_Genome.


http://arxiv.org/abs/1812.10217
Seeing isn't Believing: Practical Adversarial Attack Against Object Detectors.
Yue Zhao; Hong Zhu; Ruigang Liang; Qintao Shen; Shengzhi Zhang; Kai Chen
  In this paper, we presented systematic solutions to build robust and
practical AEs against real world object detectors. Particularly, for Hiding
Attack (HA), we proposed the feature-interference reinforcement (FIR) method
and the enhanced realistic constraints generation (ERG) to enhance robustness,
and for Appearing Attack (AA), we proposed the nested-AE, which combines two
AEs together to attack object detectors in both long and short distance. We
also designed diverse styles of AEs to make AA more surreptitious. Evaluation
results show that our AEs can attack the state-of-the-art real-time object
detectors (i.e., YOLO V3 and faster-RCNN) at the success rate up to 92.4% with
varying distance from 1m to 25m and angles from -60{\deg} to 60{\deg}. Our AEs
are also demonstrated to be highly transferable, capable of attacking another
three state-of-the-art black-box models with high success rate.


http://arxiv.org/abs/1812.11017
DUP-Net: Denoiser and Upsampler Network for 3D Adversarial Point Clouds Defense.
Hang Zhou; Kejiang Chen; Weiming Zhang; Han Fang; Wenbo Zhou; Nenghai Yu
  Neural networks are vulnerable to adversarial examples, which poses a threat
to their application in security sensitive systems. We propose a Denoiser and
UPsampler Network (DUP-Net) structure as defenses for 3D adversarial point
cloud classification, where the two modules reconstruct surface smoothness by
dropping or adding points. In this paper, statistical outlier removal (SOR) and
a data-driven upsampling network are considered as denoiser and upsampler
respectively. Compared with baseline defenses, DUP-Net has three advantages.
First, with DUP-Net as a defense, the target model is more robust to white-box
adversarial attacks. Second, the statistical outlier removal provides added
robustness since it is a non-differentiable denoising operation. Third, the
upsampler network can be trained on a small dataset and defends well against
adversarial attacks generated from other point cloud datasets. We conduct
various experiments to validate that DUP-Net is very effective as defense in
practice. Our best defense eliminates 83.8% of C&W and l_2 loss based attack
(point shifting), 50.0% of C&W and Hausdorff distance loss based attack (point
adding) and 9.0% of saliency map based attack (point dropping) under 200
dropped points on PointNet.


http://arxiv.org/abs/1812.09660
Markov Game Modeling of Moving Target Defense for Strategic Detection of Threats in Cloud Networks.
Ankur Chowdhary; Sailik Sengupta; Dijiang Huang; Subbarao Kambhampati
  The processing and storage of critical data in large-scale cloud networks
necessitate the need for scalable security solutions. It has been shown that
deploying all possible security measures incurs a cost on performance by using
up valuable computing and networking resources which are the primary selling
points for cloud service providers. Thus, there has been a recent interest in
developing Moving Target Defense (MTD) mechanisms that helps one optimize the
joint objective of maximizing security while ensuring that the impact on
performance is minimized. Often, these techniques model the problem of
multi-stage attacks by stealthy adversaries as a single-step attack detection
game using graph connectivity measures as a heuristic to measure performance,
thereby (1) losing out on valuable information that is inherently present in
graph-theoretic models designed for large cloud networks, and (2) coming up
with certain strategies that have asymmetric impacts on performance. In this
work, we leverage knowledge in attack graphs of a cloud network in formulating
a zero-sum Markov Game and use the Common Vulnerability Scoring System (CVSS)
to come up with meaningful utility values for this game. Then, we show that the
optimal strategy of placing detecting mechanisms against an adversary is
equivalent to computing the mixed Min-max Equilibrium of the Markov Game. We
compare the gains obtained by using our method to other techniques presently
used in cloud network security, thereby showing its effectiveness. Finally, we
highlight how the method was used for a small real-world cloud system.


http://arxiv.org/abs/1812.09803
Guessing Smart: Biased Sampling for Efficient Black-Box Adversarial Attacks.
Thomas Brunner; Frederik Diehl; Michael Truong Le; Alois Knoll
  We consider adversarial examples for image classification in the black-box
decision-based setting. Here, an attacker cannot access confidence scores, but
only the final label. Most attacks for this scenario are either unreliable or
inefficient. Focusing on the latter, we show that a specific class of attacks,
Boundary Attacks, can be reinterpreted as a biased sampling framework that
gains efficiency from domain knowledge. We identify three such biases, image
frequency, regional masks and surrogate gradients, and evaluate their
performance against an ImageNet classifier. We show that the combination of
these biases outperforms the state of the art by a wide margin. We also
showcase an efficient way to attack the Google Cloud Vision API, where we craft
convincing perturbations with just a few hundred queries. Finally, the methods
we propose have also been found to work very well against strong defenses: Our
targeted attack won second place in the NeurIPS 2018 Adversarial Vision
Challenge.


http://arxiv.org/abs/1812.09638
Exploiting the Inherent Limitation of L0 Adversarial Examples.
Fei Zuo; Bokai Yang; Xiaopeng Li; Lannan Luo; Qiang Zeng
  Despite the great achievements made by neural networks on tasks such as image
classification, they are brittle and vulnerable to adversarial example (AE)
attacks, which are crafted by adding human-imperceptible perturbations to
inputs in order that a neural-network-based classifier incorrectly labels them.
In particular, L0 AEs are a category of widely discussed threats where
adversaries are restricted in the number of pixels that they can corrupt.
However, our observation is that, while L0 attacks modify as few pixels as
possible, they tend to cause large-amplitude perturbations to the modified
pixels. We consider this as an inherent limitation of L0 AEs, and thwart such
attacks by both detecting and rectifying them. The main novelty of the proposed
detector is that we convert the AE detection problem into a comparison problem
by exploiting the inherent limitation of L0 attacks. More concretely, given an
image I, it is pre-processed to obtain another image I' . A Siamese network,
which is known to be effective in comparison, takes I and I' as the input pair
to determine whether I is an AE. A trained Siamese network automatically and
precisely captures the discrepancies between I and I' to detect L0
perturbations. In addition, we show that the pre-processing technique,
inpainting, used for detection can also work as an effective defense, which has
a high probability of removing the adversarial influence of L0 perturbations.
Thus, our system, called AEPECKER, demonstrates not only high AE detection
accuracies, but also a notable capability to correct the classification
results.


http://arxiv.org/abs/1812.09431
Dissociable neural representations of adversarially perturbed images in convolutional neural networks and the human brain.
Chi Zhang; Xiaohan Duan; Linyuan Wang; Yongli Li; Bin Yan; Guoen Hu; Ruyuan Zhang; Li Tong
  Despite the remarkable similarities between convolutional neural networks
(CNN) and the human brain, CNNs still fall behind humans in many visual tasks,
indicating that there still exist considerable differences between the two
systems. Here, we leverage adversarial noise (AN) and adversarial interference
(AI) images to quantify the consistency between neural representations and
perceptual outcomes in the two systems. Humans can successfully recognize AI
images as corresponding categories but perceive AN images as meaningless noise.
In contrast, CNNs can correctly recognize AN images but mistakenly classify AI
images into wrong categories with surprisingly high confidence. We use
functional magnetic resonance imaging to measure brain activity evoked by
regular and adversarial images in the human brain, and compare it to the
activity of artificial neurons in a prototypical CNN-AlexNet. In the human
brain, we find that the representational similarity between regular and
adversarial images largely echoes their perceptual similarity in all early
visual areas. In AlexNet, however, the neural representations of adversarial
images are inconsistent with network outputs in all intermediate processing
layers, providing no neural foundations for perceptual similarity. Furthermore,
we show that voxel-encoding models trained on regular images can successfully
generalize to the neural responses to AI images but not AN images. These
remarkable differences between the human brain and AlexNet in the
representation-perception relation suggest that future CNNs should emulate both
behavior and the internal neural presentations of the human brain.


http://arxiv.org/abs/1812.08108
Enhancing Robustness of Deep Neural Networks Against Adversarial Malware Samples: Principles, Framework, and AICS'2019 Challenge.
Deqiang Li; Qianmu Li; Yanfang Ye; Shouhuai Xu
  Malware continues to be a major cyber threat, despite the tremendous effort
that has been made to combat them. The number of malware in the wild steadily
increases over time, meaning that we must resort to automated defense
techniques. This naturally calls for machine learning based malware detection.
However, machine learning is known to be vulnerable to adversarial evasion
attacks that manipulate a small number of features to make classifiers wrongly
recognize a malware sample as a benign one. The state-of-the-art is that there
are no effective countermeasures against these attacks. Inspired by the
AICS'2019 Challenge, we systematize a number of principles for enhancing the
robustness of neural networks against adversarial malware evasion attacks. Some
of these principles have been scattered in the literature, but others are
proposed in this paper for the first time. Under the guidance of these
principles, we propose a framework and an accompanying training algorithm,
which are then applied to the AICS'2019 challenge. Our experimental results
have been submitted to the challenge organizer for evaluation.


http://arxiv.org/abs/1812.08329
PROVEN: Certifying Robustness of Neural Networks with a Probabilistic Approach.
Tsui-Wei Weng; Pin-Yu Chen; Lam M. Nguyen; Mark S. Squillante; Ivan Oseledets; Luca Daniel
  With deep neural networks providing state-of-the-art machine learning models
for numerous machine learning tasks, quantifying the robustness of these models
has become an important area of research. However, most of the research
literature merely focuses on the \textit{worst-case} setting where the input of
the neural network is perturbed with noises that are constrained within an
$\ell_p$ ball; and several algorithms have been proposed to compute certified
lower bounds of minimum adversarial distortion based on such worst-case
analysis. In this paper, we address these limitations and extend the approach
to a \textit{probabilistic} setting where the additive noises can follow a
given distributional characterization. We propose a novel probabilistic
framework PROVEN to PRObabilistically VErify Neural networks with statistical
guarantees -- i.e., PROVEN certifies the probability that the classifier's
top-1 prediction cannot be altered under any constrained $\ell_p$ norm
perturbation to a given input. Importantly, we show that it is possible to
derive closed-form probabilistic certificates based on current state-of-the-art
neural network robustness verification frameworks. Hence, the probabilistic
certificates provided by PROVEN come naturally and with almost no overhead when
obtaining the worst-case certified lower bounds from existing methods such as
Fast-Lin, CROWN and CNN-Cert. Experiments on small and large MNIST and CIFAR
neural network models demonstrate our probabilistic approach can achieve up to
around $75\%$ improvement in the robustness certification with at least a
$99.99\%$ confidence compared with the worst-case robustness certificate
delivered by CROWN.


http://arxiv.org/abs/1812.06815
Spartan Networks: Self-Feature-Squeezing Neural Networks for increased robustness in adversarial settings.
François Menet; Paul Berthier; José M. Fernandez; Michel Gagnon
  Deep learning models are vulnerable to adversarial examples which are input
samples modified in order to maximize the error on the system. We introduce
Spartan Networks, resistant deep neural networks that do not require input
preprocessing nor adversarial training. These networks have an adversarial
layer designed to discard some information of the network, thus forcing the
system to focus on relevant input. This is done using a new activation function
to discard data. The added layer trains the neural network to filter-out
usually-irrelevant parts of its input. Our performance evaluation shows that
Spartan Networks have a slightly lower precision but report a higher robustness
under attack when compared to unprotected models. Results of this study of
Adversarial AI as a new attack vector are based on tests conducted on the MNIST
dataset.


http://arxiv.org/abs/1812.06626
Designing Adversarially Resilient Classifiers using Resilient Feature Engineering.
Kevin Eykholt; Atul Prakash
  We provide a methodology, resilient feature engineering, for creating
adversarially resilient classifiers. According to existing work, adversarial
attacks identify weakly correlated or non-predictive features learned by the
classifier during training and design the adversarial noise to utilize these
features. Therefore, highly predictive features should be used first during
classification in order to determine the set of possible output labels. Our
methodology focuses the problem of designing resilient classifiers into a
problem of designing resilient feature extractors for these highly predictive
features. We provide two theorems, which support our methodology. The Serial
Composition Resilience and Parallel Composition Resilience theorems show that
the output of adversarially resilient feature extractors can be combined to
create an equally resilient classifier. Based on our theoretical results, we
outline the design of an adversarially resilient classifier.


http://arxiv.org/abs/1812.08342
A Survey of Safety and Trustworthiness of Deep Neural Networks.
Xiaowei Huang; Daniel Kroening; Wenjie Ruan; James Sharp; Youcheng Sun; Emese Thamo; Min Wu; Xinping Yi
  In the past few years, significant progress has been made on deep neural
networks (DNNs) in achieving human-level performance on several long-standing
tasks. With the broader deployment of DNNs on various applications, the
concerns on its safety and trustworthiness have been raised in public,
especially after the widely reported fatal incidents of self-driving cars.
Research to address these concerns is very active, with many papers released in
the past few years. This survey paper conducts a review of the current research
effort on making DNNs safe and trustworthy, by focusing on four aspects:
verification, testing, adversarial attack and defence, and interpretability. In
total, we surveyed 178 papers, most of which published after 2017.


http://arxiv.org/abs/1812.06570
Defense-VAE: A Fast and Accurate Defense against Adversarial Attacks.
Xiang Li; Shihao Ji
  Deep neural networks (DNNs) have been enormously successful across a variety
of prediction tasks. However, recent research shows that DNNs are particularly
vulnerable to adversarial attacks, which poses a serious threat to their
applications in security-sensitive systems. In this paper, we propose a simple
yet effective defense algorithm Defense-VAE that uses variational autoencoder
(VAE) to purge adversarial perturbations from contaminated images. The proposed
method is generic and can defend white-box and black-box attacks without the
need of retraining the original CNN classifiers, and can further strengthen the
defense by retraining CNN or end-to-end finetuning the whole pipeline. In
addition, the proposed method is very efficient compared to the
optimization-based alternatives, such as Defense-GAN, since no iterative
optimization is needed for online prediction. Extensive experiments on MNIST,
Fashion-MNIST, CelebA and CIFAR-10 demonstrate the superior defense accuracy of
Defense-VAE compared to Defense-GAN, while being 50x faster than the latter.
This makes Defense-VAE widely deployable in real-time security-sensitive
systems. Our source code can be found at
https://github.com/lxuniverse/defense-vae.


http://arxiv.org/abs/1812.07385
Perturbation Analysis of Learning Algorithms: A Unifying Perspective on Generation of Adversarial Examples.
Emilio Rafael Balda; Arash Behboodi; Rudolf Mathar
  Despite the tremendous success of deep neural networks in various learning
problems, it has been observed that adding an intentionally designed
adversarial perturbation to inputs of these architectures leads to erroneous
classification with high confidence in the prediction. In this work, we propose
a general framework based on the perturbation analysis of learning algorithms
which consists of convex programming and is able to recover many current
adversarial attacks as special cases. The framework can be used to propose
novel attacks against learning algorithms for classification and regression
tasks under various new constraints with closed form solutions in many
instances. In particular we derive new attacks against classification
algorithms which are shown to achieve comparable performances to notable
existing attacks. The framework is then used to generate adversarial
perturbations for regression tasks which include single pixel and single subset
attacks. By applying this method to autoencoding and image colorization tasks,
it is shown that adversarial perturbations can effectively perturb the output
of regression tasks as well.


http://arxiv.org/abs/1812.06371
Trust Region Based Adversarial Attack on Neural Networks.
Zhewei Yao; Amir Gholami; Peng Xu; Kurt Keutzer; Michael Mahoney
  Deep Neural Networks are quite vulnerable to adversarial perturbations.
Current state-of-the-art adversarial attack methods typically require very time
consuming hyper-parameter tuning, or require many iterations to solve an
optimization based adversarial attack. To address this problem, we present a
new family of trust region based adversarial attacks, with the goal of
computing adversarial perturbations efficiently. We propose several attacks
based on variants of the trust region optimization method. We test the proposed
methods on Cifar-10 and ImageNet datasets using several different models
including AlexNet, ResNet-50, VGG-16, and DenseNet-121 models. Our methods
achieve comparable results with the Carlini-Wagner (CW) attack, but with
significant speed up of up to $37\times$, for the VGG-16 model on a Titan Xp
GPU. For the case of ResNet-50 on ImageNet, we can bring down its
classification accuracy to less than 0.1\% with at most $1.5\%$ relative
$L_\infty$ (or $L_2$) perturbation requiring only $1.02$ seconds as compared to
$27.04$ seconds for the CW attack. We have open sourced our method which can be
accessed at [1].


http://arxiv.org/abs/1812.05793
Adversarial Sample Detection for Deep Neural Network through Model Mutation Testing.
Jingyi Wang; Guoliang Dong; Jun Sun; Xinyu Wang; Peixin Zhang
  Deep neural networks (DNN) have been shown to be useful in a wide range of
applications. However, they are also known to be vulnerable to adversarial
samples. By transforming a normal sample with some carefully crafted human
imperceptible perturbations, even highly accurate DNN make wrong decisions.
Multiple defense mechanisms have been proposed which aim to hinder the
generation of such adversarial samples. However, a recent work show that most
of them are ineffective. In this work, we propose an alternative approach to
detect adversarial samples at runtime. Our main observation is that adversarial
samples are much more sensitive than normal samples if we impose random
mutations on the DNN. We thus first propose a measure of `sensitivity' and show
empirically that normal samples and adversarial samples have distinguishable
sensitivity. We then integrate statistical hypothesis testing and model
mutation testing to check whether an input sample is likely to be normal or
adversarial at runtime by measuring its sensitivity. We evaluated our approach
on the MNIST and CIFAR10 datasets. The results show that our approach detects
adversarial samples generated by state-of-the-art attacking methods efficiently
and accurately.


http://arxiv.org/abs/1812.05271
TextBugger: Generating Adversarial Text Against Real-world Applications.
Jinfeng Li; Shouling Ji; Tianyu Du; Bo Li; Ting Wang
  Deep Learning-based Text Understanding (DLTU) is the backbone technique
behind various applications, including question answering, machine translation,
and text classification. Despite its tremendous popularity, the security
vulnerabilities of DLTU are still largely unknown, which is highly concerning
given its increasing use in security-sensitive applications such as sentiment
analysis and toxic content detection. In this paper, we show that DLTU is
inherently vulnerable to adversarial text attacks, in which maliciously crafted
texts trigger target DLTU systems and services to misbehave. Specifically, we
present TextBugger, a general attack framework for generating adversarial
texts. In contrast to prior works, TextBugger differs in significant ways: (i)
effective -- it outperforms state-of-the-art attacks in terms of attack success
rate; (ii) evasive -- it preserves the utility of benign text, with 94.9\% of
the adversarial text correctly recognized by human readers; and (iii) efficient
-- it generates adversarial text with computational complexity sub-linear to
the text length. We empirically evaluate TextBugger on a set of real-world DLTU
systems and services used for sentiment analysis and toxic content detection,
demonstrating its effectiveness, evasiveness, and efficiency. For instance,
TextBugger achieves 100\% success rate on the IMDB dataset based on Amazon AWS
Comprehend within 4.61 seconds and preserves 97\% semantic similarity. We
further discuss possible defense mechanisms to mitigate such attack and the
adversary's potential countermeasures, which leads to promising directions for
further research.


http://arxiv.org/abs/1812.05720
Why ReLU networks yield high-confidence predictions far away from the training data and how to mitigate the problem.
Matthias Hein; Maksym Andriushchenko; Julian Bitterwolf
  Classifiers used in the wild, in particular for safety-critical systems,
should not only have good generalization properties but also should know when
they don't know, in particular make low confidence predictions far away from
the training data. We show that ReLU type neural networks which yield a
piecewise linear classifier function fail in this regard as they produce almost
always high confidence predictions far away from the training data. For bounded
domains like images we propose a new robust optimization technique similar to
adversarial training which enforces low confidence predictions far away from
the training data. We show that this technique is surprisingly effective in
reducing the confidence of predictions far away from the training data while
maintaining high confidence predictions and test error on the original
classification task compared to standard training.


http://arxiv.org/abs/1812.05447
Generating Hard Examples for Pixel-wise Classification. (4%)
Hyungtae Lee; Heesung Kwon; Wonkook Kim
  Pixel-wise classification in remote sensing identifies entities in
large-scale satellite-based images at the pixel level. Few fully annotated
large-scale datasets for pixel-wise classification exist due to the challenges
of annotating individual pixels. Training data scarcity inevitably ensues from
the annotation challenge, leading to overfitting classifiers and degraded
classification performance. The lack of annotated pixels also necessarily
results in few hard examples of various entities critical for generating a
robust classification hyperplane. To overcome the problem of the data scarcity
and lack of hard examples in training, we introduce a two-step hard example
generation (HEG) approach that first generates hard example candidates and then
mines actual hard examples. In the first step, a generator that creates hard
example candidates is learned via the adversarial learning framework by fooling
a discriminator and a pixel-wise classification model at the same time. In the
second step, mining is performed to build a fixed number of hard examples from
a large pool of real and artificially generated examples. To evaluate the
effectiveness of the proposed HEG approach, we design a 9-layer fully
convolutional network suitable for pixel-wise classification. Experiments show
that using generated hard examples from the proposed HEG approach improves the
pixel-wise classification model's accuracy on red tide detection and
hyperspectral image classification tasks.


http://arxiv.org/abs/1812.05013
Thwarting Adversarial Examples: An $L_0$-RobustSparse Fourier Transform.
Mitali Bafna; Jack Murtagh; Nikhil Vyas
  We give a new algorithm for approximating the Discrete Fourier transform of
an approximately sparse signal that has been corrupted by worst-case $L_0$
noise, namely a bounded number of coordinates of the signal have been corrupted
arbitrarily. Our techniques generalize to a wide range of linear
transformations that are used in data analysis such as the Discrete Cosine and
Sine transforms, the Hadamard transform, and their high-dimensional analogs. We
use our algorithm to successfully defend against well known $L_0$ adversaries
in the setting of image classification. We give experimental results on the
Jacobian-based Saliency Map Attack (JSMA) and the Carlini Wagner (CW) $L_0$
attack on the MNIST and Fashion-MNIST datasets as well as the Adversarial Patch
on the ImageNet dataset.


http://arxiv.org/abs/1812.04293
On the Security of Randomized Defenses Against Adversarial Samples.
Kumar Sharad; Giorgia Azzurra Marson; Hien Thi Thu Truong; Ghassan Karame
  Deep Learning has been shown to be particularly vulnerable to adversarial
samples. To combat adversarial strategies, numerous defensive techniques have
been proposed. Among these, a promising approach is to use randomness in order
to make the classification process unpredictable and presumably harder for the
adversary to control. In this paper, we study the effectiveness of randomized
defenses against adversarial samples. To this end, we categorize existing
state-of-the-art adversarial strategies into three attacker models of
increasing strength, namely blackbox, graybox, and whitebox (a.k.a.~adaptive)
attackers. We also devise a lightweight randomization strategy for image
classification based on feature squeezing, that consists of pre-processing the
classifier input by embedding randomness within each feature, before applying
feature squeezing. We evaluate the proposed defense and compare it to other
randomized techniques in the literature via thorough experiments. Our results
indeed show that careful integration of randomness can be effective against
both graybox and blackbox attacks without significantly degrading the accuracy
of the underlying classifier. However, our experimental results offer strong
evidence that in the present form such randomization techniques cannot deter a
whitebox adversary that has access to all classifier parameters and has full
knowledge of the defense. Our work thoroughly and empirically analyzes the
impact of randomization techniques against all classes of adversarial
strategies.


http://arxiv.org/abs/1812.04599
Adversarial Framing for Image and Video Classification.
Konrad Zolna; Michal Zajac; Negar Rostamzadeh; Pedro O. Pinheiro
  Neural networks are prone to adversarial attacks. In general, such attacks
deteriorate the quality of the input by either slightly modifying most of its
pixels, or by occluding it with a patch. In this paper, we propose a method
that keeps the image unchanged and only adds an adversarial framing on the
border of the image. We show empirically that our method is able to
successfully attack state-of-the-art methods on both image and video
classification problems. Notably, the proposed method results in a universal
attack which is very fast at test time. Source code can be found at
https://github.com/zajaczajac/adv_framing .


http://arxiv.org/abs/1812.03705
Defending Against Universal Perturbations With Shared Adversarial Training.
Chaithanya Kumar Mummadi; Thomas Brox; Jan Hendrik Metzen
  Classifiers such as deep neural networks have been shown to be vulnerable
against adversarial perturbations on problems with high-dimensional input
space. While adversarial training improves the robustness of image classifiers
against such adversarial perturbations, it leaves them sensitive to
perturbations on a non-negligible fraction of the inputs. In this work, we show
that adversarial training is more effective in preventing universal
perturbations, where the same perturbation needs to fool a classifier on many
inputs. Moreover, we investigate the trade-off between robustness against
universal perturbations and performance on unperturbed data and propose an
extension of adversarial training that handles this trade-off more gracefully.
We present results for image classification and semantic segmentation to
showcase that universal perturbations that fool a model hardened with
adversarial training become clearly perceptible and show patterns of the target
scene.


http://arxiv.org/abs/1812.03411
Feature Denoising for Improving Adversarial Robustness.
Cihang Xie; Yuxin Wu; der Maaten Laurens van; Alan Yuille; Kaiming He
  Adversarial attacks to image classification systems present challenges to
convolutional networks and opportunities for understanding them. This study
suggests that adversarial perturbations on images lead to noise in the features
constructed by these networks. Motivated by this observation, we develop new
network architectures that increase adversarial robustness by performing
feature denoising. Specifically, our networks contain blocks that denoise the
features using non-local means or other filters; the entire networks are
trained end-to-end. When combined with adversarial training, our feature
denoising networks substantially improve the state-of-the-art in adversarial
robustness in both white-box and black-box attack settings. On ImageNet, under
10-iteration PGD white-box attacks where prior art has 27.9% accuracy, our
method achieves 55.7%; even under extreme 2000-iteration PGD white-box attacks,
our method secures 42.6% accuracy. Our method was ranked first in Competition
on Adversarial Attacks and Defenses (CAAD) 2018 --- it achieved 50.6%
classification accuracy on a secret, ImageNet-like test dataset against 48
unknown attackers, surpassing the runner-up approach by ~10%. Code is available
at https://github.com/facebookresearch/ImageNet-Adversarial-Training.


http://arxiv.org/abs/1812.03405
AutoGAN: Robust Classifier Against Adversarial Attacks.
Blerta Lindqvist; Shridatt Sugrim; Rauf Izmailov
  Classifiers fail to classify correctly input images that have been
purposefully and imperceptibly perturbed to cause misclassification. This
susceptability has been shown to be consistent across classifiers, regardless
of their type, architecture or parameters. Common defenses against adversarial
attacks modify the classifer boundary by training on additional adversarial
examples created in various ways. In this paper, we introduce AutoGAN, which
counters adversarial attacks by enhancing the lower-dimensional manifold
defined by the training data and by projecting perturbed data points onto it.
AutoGAN mitigates the need for knowing the attack type and magnitude as well as
the need for having adversarial samples of the attack. Our approach uses a
Generative Adversarial Network (GAN) with an autoencoder generator and a
discriminator that also serves as a classifier. We test AutoGAN against
adversarial samples generated with state-of-the-art Fast Gradient Sign Method
(FGSM) as well as samples generated with random Gaussian noise, both using the
MNIST dataset. For different magnitudes of perturbation in training and
testing, AutoGAN can surpass the accuracy of FGSM method by up to 25\% points
on samples perturbed using FGSM. Without an augmented training dataset, AutoGAN
achieves an accuracy of 89\% compared to 1\% achieved by FGSM method on FGSM
testing adversarial samples.


http://arxiv.org/abs/1812.03303
Detecting Adversarial Examples in Convolutional Neural Networks.
Stefanos Pertigkiozoglou; Petros Maragos
  The great success of convolutional neural networks has caused a massive
spread of the use of such models in a large variety of Computer Vision
applications. However, these models are vulnerable to certain inputs, the
adversarial examples, which although are not easily perceived by humans, they
can lead a neural network to produce faulty results. This paper focuses on the
detection of adversarial examples, which are created for convolutional neural
networks that perform image classification. We propose three methods for
detecting possible adversarial examples and after we analyze and compare their
performance, we combine their best aspects to develop an even more robust
approach. The first proposed method is based on the regularization of the
feature vector that the neural network produces as output. The second method
detects adversarial examples by using histograms, which are created from the
outputs of the hidden layers of the neural network. These histograms create a
feature vector which is used as the input of an SVM classifier, which
classifies the original input either as an adversarial or as a real input.
Finally, for the third method we introduce the concept of the residual image,
which contains information about the parts of the input pattern that are
ignored by the neural network. This method aims at the detection of possible
adversarial examples, by using the residual image and reinforcing the parts of
the input pattern that are ignored by the neural network. Each one of these
methods has some novelties and by combining them we can further improve the
detection results. For the proposed methods and their combination, we present
the results of detecting adversarial examples on the MNIST dataset. The
combination of the proposed methods offers some improvements over similar state
of the art approaches.


http://arxiv.org/abs/1812.03413
Learning Transferable Adversarial Examples via Ghost Networks.
Yingwei Li; Song Bai; Yuyin Zhou; Cihang Xie; Zhishuai Zhang; Alan Yuille
  Recent development of adversarial attacks has proven that ensemble-based
methods outperform traditional, non-ensemble ones in black-box attack. However,
as it is computationally prohibitive to acquire a family of diverse models,
these methods achieve inferior performance constrained by the limited number of
models to be ensembled.
  In this paper, we propose Ghost Networks to improve the transferability of
adversarial examples. The critical principle of ghost networks is to apply
feature-level perturbations to an existing model to potentially create a huge
set of diverse models. After that, models are subsequently fused by
longitudinal ensemble. Extensive experimental results suggest that the number
of networks is essential for improving the transferability of adversarial
examples, but it is less necessary to independently train different networks
and ensemble them in an intensive aggregation way. Instead, our work can be
used as a computationally cheap and easily applied plug-in to improve
adversarial approaches both in single-model and multi-model attack, compatible
with residual and non-residual networks. By reproducing the NeurIPS 2017
adversarial competition, our method outperforms the No.1 attack submission by a
large margin, demonstrating its effectiveness and efficiency. Code is available
at https://github.com/LiYingwei/ghost-network.


http://arxiv.org/abs/1812.03190
Deep-RBF Networks Revisited: Robust Classification with Rejection.
Pourya Habib Zadeh; Reshad Hosseini; Suvrit Sra
  One of the main drawbacks of deep neural networks, like many other
classifiers, is their vulnerability to adversarial attacks. An important reason
for their vulnerability is assigning high confidence to regions with few or
even no feature points. By feature points, we mean a nonlinear transformation
of the input space extracting a meaningful representation of the input data. On
the other hand, deep-RBF networks assign high confidence only to the regions
containing enough feature points, but they have been discounted due to the
widely-held belief that they have the vanishing gradient problem. In this
paper, we revisit the deep-RBF networks by first giving a general formulation
for them, and then proposing a family of cost functions thereof inspired by
metric learning. In the proposed deep-RBF learning algorithm, the vanishing
gradient problem does not occur. We make these networks robust to adversarial
attack by adding the reject option to their output layer. Through several
experiments on the MNIST dataset, we demonstrate that our proposed method not
only achieves significant classification accuracy but is also very resistant to
various adversarial attacks.


http://arxiv.org/abs/1812.03087
Combatting Adversarial Attacks through Denoising and Dimensionality Reduction: A Cascaded Autoencoder Approach.
Rajeev Sahay; Rehana Mahfuz; Aly El Gamal
  Machine Learning models are vulnerable to adversarial attacks that rely on
perturbing the input data. This work proposes a novel strategy using
Autoencoder Deep Neural Networks to defend a machine learning model against two
gradient-based attacks: The Fast Gradient Sign attack and Fast Gradient attack.
First we use an autoencoder to denoise the test data, which is trained with
both clean and corrupted data. Then, we reduce the dimension of the denoised
data using the hidden layer representation of another autoencoder. We perform
this experiment for multiple values of the bound of adversarial perturbations,
and consider different numbers of reduced dimensions. When the test data is
preprocessed using this cascaded pipeline, the tested deep neural network
classifier yields a much higher accuracy, thus mitigating the effect of the
adversarial perturbation.


http://arxiv.org/abs/1812.02891
Adversarial Defense of Image Classification Using a Variational Auto-Encoder.
Yi Luo; Henry Pfister
  Deep neural networks are known to be vulnerable to adversarial attacks. This
exposes them to potential exploits in security-sensitive applications and
highlights their lack of robustness. This paper uses a variational auto-encoder
(VAE) to defend against adversarial attacks for image classification tasks.
This VAE defense has a few nice properties: (1) it is quite flexible and its
use of randomness makes it harder to attack; (2) it can learn disentangled
representations that prevent blurry reconstruction; and (3) a patch-wise VAE
defense strategy is used that does not require retraining for different size
images. For moderate to severe attacks, this system outperforms or closely
matches the performance of JPEG compression, with the best quality parameter.
It also has more flexibility and potential for improvement via training.


http://arxiv.org/abs/1812.02885
Adversarial Attacks, Regression, and Numerical Stability Regularization.
Andre T. Nguyen; Edward Raff
  Adversarial attacks against neural networks in a regression setting are a
critical yet understudied problem. In this work, we advance the state of the
art by investigating adversarial attacks against regression networks and by
formulating a more effective defense against these attacks. In particular, we
take the perspective that adversarial attacks are likely caused by numerical
instability in learned functions. We introduce a stability inducing,
regularization based defense against adversarial attacks in the regression
setting. Our new and easy to implement defense is shown to outperform prior
approaches and to improve the numerical stability of learned functions.


http://arxiv.org/abs/1812.02575
Prior Networks for Detection of Adversarial Attacks.
Andrey Malinin; Mark Gales
  Adversarial examples are considered a serious issue for safety critical
applications of AI, such as finance, autonomous vehicle control and medicinal
applications. Though significant work has resulted in increased robustness of
systems to these attacks, systems are still vulnerable to well-crafted attacks.
To address this problem, several adversarial attack detection methods have been
proposed. However, a system can still be vulnerable to adversarial samples that
are designed to specifically evade these detection methods. One recent
detection scheme that has shown good performance is based on uncertainty
estimates derived from Monte-Carlo dropout ensembles. Prior Networks, a new
method of estimating predictive uncertainty, has been shown to outperform
Monte-Carlo dropout on a range of tasks. One of the advantages of this approach
is that the behaviour of a Prior Network can be explicitly tuned to, for
example, predict high uncertainty in regions where there are no training data
samples. In this work, Prior Networks are applied to adversarial attack
detection using measures of uncertainty in a similar fashion to Monte-Carlo
Dropout. Detection based on measures of uncertainty derived from DNNs and
Monte-Carlo dropout ensembles are used as a baseline. Prior Networks are shown
to significantly out-perform these baseline approaches over a range of
adversarial attacks in both detection of whitebox and blackbox configurations.
Even when the adversarial attacks are constructed with full knowledge of the
detection mechanism, it is shown to be highly challenging to successfully
generate an adversarial sample.


http://arxiv.org/abs/1812.02524
Towards Leveraging the Information of Gradients in Optimization-based Adversarial Attack.
Jingyang Zhang; Hsin-Pai Cheng; Chunpeng Wu; Hai Li; Yiran Chen
  In recent years, deep neural networks demonstrated state-of-the-art
performance in a large variety of tasks and therefore have been adopted in many
applications. On the other hand, the latest studies revealed that neural
networks are vulnerable to adversarial examples obtained by carefully adding
small perturbation to legitimate samples. Based upon the observation, many
attack methods were proposed. Among them, the optimization-based CW attack is
the most powerful as the produced adversarial samples present much less
distortion compared to other methods. The better attacking effect, however,
comes at the cost of running more iterations and thus longer computation time
to reach desirable results. In this work, we propose to leverage the
information of gradients as a guidance during the search of adversaries. More
specifically, directly incorporating the gradients into the perturbation can be
regarded as a constraint added to the optimization process. We intuitively and
empirically prove the rationality of our method in reducing the search space.
Our experiments show that compared to the original CW attack, the proposed
method requires fewer iterations towards adversarial samples, obtaining a
higher success rate and resulting in smaller $\ell_2$ distortion.


http://arxiv.org/abs/1812.02843
Fooling Network Interpretation in Image Classification.
Akshayvarun Subramanya; Vipin Pillai; Hamed Pirsiavash
  Deep neural networks have been shown to be fooled rather easily using
adversarial attack algorithms. Practical methods such as adversarial patches
have been shown to be extremely effective in causing misclassification.
However, these patches are highlighted using standard network interpretation
algorithms, thus revealing the identity of the adversary. We show that it is
possible to create adversarial patches which not only fool the prediction, but
also change what we interpret regarding the cause of the prediction. Moreover,
we introduce our attack as a controlled setting to measure the accuracy of
interpretation algorithms. We show this using extensive experiments for
Grad-CAM interpretation that transfers to occluding patch interpretation as
well. We believe our algorithms can facilitate developing more robust network
interpretation tools that truly explain the network's underlying decision
making process.


http://arxiv.org/abs/1812.02606
The Limitations of Model Uncertainty in Adversarial Settings.
Kathrin Grosse; David Pfaff; Michael Thomas Smith; Michael Backes
  Machine learning models are vulnerable to adversarial examples: minor
perturbations to input samples intended to deliberately cause
misclassification. While an obvious security threat, adversarial examples yield
as well insights about the applied model itself. We investigate adversarial
examples in the context of Bayesian neural network's (BNN's) uncertainty
measures. As these measures are highly non-smooth, we use a smooth Gaussian
process classifier (GPC) as substitute. We show that both confidence and
uncertainty can be unsuspicious even if the output is wrong. Intriguingly, we
find subtle differences in the features influencing uncertainty and confidence
for most tasks.


http://arxiv.org/abs/1812.02637
MMA Training: Direct Input Space Margin Maximization through Adversarial Training.
Gavin Weiguang Ding; Yash Sharma; Kry Yik Chau Lui; Ruitong Huang
  We study adversarial robustness of neural networks from a margin maximization
perspective, where margins are defined as the distances from inputs to a
classifier's decision boundary. Our study shows that maximizing margins can be
achieved by minimizing the adversarial loss on the decision boundary at the
"shortest successful perturbation", demonstrating a close connection between
adversarial losses and the margins. We propose Max-Margin Adversarial (MMA)
training to directly maximize the margins to achieve adversarial robustness.
Instead of adversarial training with a fixed $\epsilon$, MMA offers an
improvement by enabling adaptive selection of the "correct" $\epsilon$ as the
margin individually for each datapoint. In addition, we rigorously analyze
adversarial training with the perspective of margin maximization, and provide
an alternative interpretation for adversarial training, maximizing either a
lower or an upper bound of the margins. Our experiments empirically confirm our
theory and demonstrate MMA training's efficacy on the MNIST and CIFAR10
datasets w.r.t. $\ell_\infty$ and $\ell_2$ robustness. Code and models are
available at https://github.com/BorealisAI/mma_training.


http://arxiv.org/abs/1812.02737
On Configurable Defense against Adversarial Example Attacks.
Bo Luo; Min Li; Yu Li; Qiang Xu
  Machine learning systems based on deep neural networks (DNNs) have gained
mainstream adoption in many applications. Recently, however, DNNs are shown to
be vulnerable to adversarial example attacks with slight perturbations on the
inputs. Existing defense mechanisms against such attacks try to improve the
overall robustness of the system, but they do not differentiate different
targeted attacks even though the corresponding impacts may vary significantly.
To tackle this problem, we propose a novel configurable defense mechanism in
this work, wherein we are able to flexibly tune the robustness of the system
against different targeted attacks to satisfy application requirements. This is
achieved by refining the DNN loss function with an attack sensitive matrix to
represent the impacts of different targeted attacks. Experimental results on
CIFAR-10 and GTSRB data sets demonstrate the efficacy of the proposed solution.


http://arxiv.org/abs/1812.01821
Regularized Ensembles and Transferability in Adversarial Learning.
Yifan Chen; Yevgeniy Vorobeychik
  Despite the considerable success of convolutional neural networks in a broad
array of domains, recent research has shown these to be vulnerable to small
adversarial perturbations, commonly known as adversarial examples. Moreover,
such examples have shown to be remarkably portable, or transferable, from one
model to another, enabling highly successful black-box attacks. We explore this
issue of transferability and robustness from two dimensions: first, considering
the impact of conventional $l_p$ regularization as well as replacing the top
layer with a linear support vector machine (SVM), and second, the value of
combining regularized models into an ensemble. We show that models trained with
different regularizers present barriers to transferability, as does partial
information about the models comprising the ensemble.


http://arxiv.org/abs/1812.02132
SADA: Semantic Adversarial Diagnostic Attacks for Autonomous Applications.
Abdullah Hamdi; Matthias Müller; Bernard Ghanem
  One major factor impeding more widespread adoption of deep neural networks
(DNNs) is their lack of robustness, which is essential for safety-critical
applications such as autonomous driving. This has motivated much recent work on
adversarial attacks for DNNs, which mostly focus on pixel-level perturbations
void of semantic meaning. In contrast, we present a general framework for
adversarial attacks on trained agents, which covers semantic perturbations to
the environment of the agent performing the task as well as pixel-level
attacks. To do this, we re-frame the adversarial attack problem as learning a
distribution of parameters that always fools the agent. In the semantic case,
our proposed adversary (denoted as BBGAN) is trained to sample parameters that
describe the environment with which the black-box agent interacts, such that
the agent performs its dedicated task poorly in this environment. We apply
BBGAN on three different tasks, primarily targeting aspects of autonomous
navigation: object detection, self-driving, and autonomous UAV racing. On these
tasks, BBGAN can generate failure cases that consistently fool a trained agent.


http://arxiv.org/abs/1812.01647
Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures.
Jonathan Dj Uesato; Ananya Dj Kumar; Csaba Dj Szepesvari; Tom Dj Erez; Avraham Dj Ruderman; Keith Dj Anderson; Dj Krishmamurthy; Dvijotham; Nicolas Heess; Pushmeet Kohli
  This paper addresses the problem of evaluating learning systems in safety
critical domains such as autonomous driving, where failures can have
catastrophic consequences. We focus on two problems: searching for scenarios
when learned agents fail and assessing their probability of failure. The
standard method for agent evaluation in reinforcement learning, Vanilla Monte
Carlo, can miss failures entirely, leading to the deployment of unsafe agents.
We demonstrate this is an issue for current agents, where even matching the
compute used for training is sometimes insufficient for evaluation. To address
this shortcoming, we draw upon the rare event probability estimation literature
and propose an adversarial evaluation approach. Our approach focuses evaluation
on adversarially chosen situations, while still providing unbiased estimates of
failure probabilities. The key difficulty is in identifying these adversarial
situations -- since failures are rare there is little signal to drive
optimization. To solve this we propose a continuation approach that learns
failure modes in related but less robust agents. Our approach also allows reuse
of data already collected for training the agent. We demonstrate the efficacy
of adversarial evaluation on two standard domains: humanoid control and
simulated driving. Experimental results show that our methods can find
catastrophic failures and estimate failures rates of agents multiple orders of
magnitude faster than standard evaluation schemes, in minutes to hours rather
than days.


http://arxiv.org/abs/1812.01804
Random Spiking and Systematic Evaluation of Defenses Against Adversarial Examples.
Huangyi Ge; Sze Yiu Chau; Bruno Ribeiro; Ninghui Li
  Image classifiers often suffer from adversarial examples, which are generated
by strategically adding a small amount of noise to input images to trick
classifiers into misclassification. Over the years, many defense mechanisms
have been proposed, and different researchers have made seemingly contradictory
claims on their effectiveness. We present an analysis of possible adversarial
models, and propose an evaluation framework for comparing different defense
mechanisms. As part of the framework, we introduce a more powerful and
realistic adversary strategy. Furthermore, we propose a new defense mechanism
called Random Spiking (RS), which generalizes dropout and introduces random
noises in the training process in a controlled manner. Evaluations under our
proposed framework suggest RS delivers better protection against adversarial
examples than many existing schemes.


http://arxiv.org/abs/1812.00740
Disentangling Adversarial Robustness and Generalization.
David Stutz; Matthias Hein; Bernt Schiele
  Obtaining deep networks that are robust against adversarial examples and
generalize well is an open problem. A recent hypothesis even states that both
robust and accurate models are impossible, i.e., adversarial robustness and
generalization are conflicting goals. In an effort to clarify the relationship
between robustness and generalization, we assume an underlying, low-dimensional
data manifold and show that: 1. regular adversarial examples leave the
manifold; 2. adversarial examples constrained to the manifold, i.e.,
on-manifold adversarial examples, exist; 3. on-manifold adversarial examples
are generalization errors, and on-manifold adversarial training boosts
generalization; 4. regular robustness and generalization are not necessarily
contradicting goals. These assumptions imply that both robust and accurate
models are possible. However, different models (architectures, training
strategies etc.) can exhibit different robustness and generalization
characteristics. To confirm our claims, we present extensive experiments on
synthetic data (with known manifold) as well as on EMNIST, Fashion-MNIST and
CelebA.


http://arxiv.org/abs/1812.00891
Interpretable Deep Learning under Fire.
Xinyang Zhang; Ningfei Wang; Hua Shen; Shouling Ji; Xiapu Luo; Ting Wang
  Providing explanations for deep neural network (DNN) models is crucial for
their use in security-sensitive domains. A plethora of interpretation models
have been proposed to help users understand the inner workings of DNNs: how
does a DNN arrive at a specific decision for a given input? The improved
interpretability is believed to offer a sense of security by involving human in
the decision-making process. Yet, due to its data-driven nature, the
interpretability itself is potentially susceptible to malicious manipulations,
about which little is known thus far.
  Here we bridge this gap by conducting the first systematic study on the
security of interpretable deep learning systems (IDLSes). We show that existing
\imlses are highly vulnerable to adversarial manipulations. Specifically, we
present ADV^2, a new class of attacks that generate adversarial inputs not only
misleading target DNNs but also deceiving their coupled interpretation models.
Through empirical evaluation against four major types of IDLSes on benchmark
datasets and in security-critical applications (e.g., skin cancer diagnosis),
we demonstrate that with ADV^2 the adversary is able to arbitrarily designate
an input's prediction and interpretation. Further, with both analytical and
empirical evidence, we identify the prediction-interpretation gap as one root
cause of this vulnerability -- a DNN and its interpretation model are often
misaligned, resulting in the possibility of exploiting both models
simultaneously. Finally, we explore potential countermeasures against ADV^2,
including leveraging its low transferability and incorporating it in an
adversarial training framework. Our findings shed light on designing and
operating IDLSes in a more secure and informative fashion, leading to several
promising research directions.


http://arxiv.org/abs/1812.01198
Adversarial Example Decomposition.
Horace He; Aaron Lou; Qingxuan Jiang; Isay Katsman; Serge Belongie; Ser-Nam Lim
  Research has shown that widely used deep neural networks are vulnerable to
carefully crafted adversarial perturbations. Moreover, these adversarial
perturbations often transfer across models. We hypothesize that adversarial
weakness is composed of three sources of bias: architecture, dataset, and
random initialization. We show that one can decompose adversarial examples into
an architecture-dependent component, data-dependent component, and
noise-dependent component and that these components behave intuitively. For
example, noise-dependent components transfer poorly to all other models, while
architecture-dependent components transfer better to retrained models with the
same architecture. In addition, we demonstrate that these components can be
recombined to improve transferability without sacrificing efficacy on the
original model.


http://arxiv.org/abs/1812.00483
Model-Reuse Attacks on Deep Learning Systems.
Yujie Ji; Xinyang Zhang; Shouling Ji; Xiapu Luo; Ting Wang
  Many of today's machine learning (ML) systems are built by reusing an array
of, often pre-trained, primitive models, each fulfilling distinct functionality
(e.g., feature extraction). The increasing use of primitive models
significantly simplifies and expedites the development cycles of ML systems.
Yet, because most of such models are contributed and maintained by untrusted
sources, their lack of standardization or regulation entails profound security
implications, about which little is known thus far.
  In this paper, we demonstrate that malicious primitive models pose immense
threats to the security of ML systems. We present a broad class of {\em
model-reuse} attacks wherein maliciously crafted models trigger host ML systems
to misbehave on targeted inputs in a highly predictable manner. By empirically
studying four deep learning systems (including both individual and ensemble
systems) used in skin cancer screening, speech recognition, face verification,
and autonomous steering, we show that such attacks are (i) effective - the host
systems misbehave on the targeted inputs as desired by the adversary with high
probability, (ii) evasive - the malicious models function indistinguishably
from their benign counterparts on non-targeted inputs, (iii) elastic - the
malicious models remain effective regardless of various system design choices
and tuning strategies, and (iv) easy - the adversary needs little prior
knowledge about the data used for system tuning or inference. We provide
analytical justification for the effectiveness of model-reuse attacks, which
points to the unprecedented complexity of today's primitive models. This issue
thus seems fundamental to many ML systems. We further discuss potential
countermeasures and their challenges, which lead to several promising research
directions.


http://arxiv.org/abs/1812.00552
Universal Perturbation Attack Against Image Retrieval.
Jie Li; Rongrong Ji; Hong Liu; Xiaopeng Hong; Yue Gao; Qi Tian
  Universal adversarial perturbations (UAPs), a.k.a. input-agnostic
perturbations, has been proved to exist and be able to fool cutting-edge deep
learning models on most of the data samples. Existing UAP methods mainly focus
on attacking image classification models. Nevertheless, little attention has
been paid to attacking image retrieval systems. In this paper, we make the
first attempt in attacking image retrieval systems. Concretely, image retrieval
attack is to make the retrieval system return irrelevant images to the query at
the top ranking list. It plays an important role to corrupt the neighbourhood
relationships among features in image retrieval attack. To this end, we propose
a novel method to generate retrieval-against UAP to break the neighbourhood
relationships of image features via degrading the corresponding ranking metric.
To expand the attack method to scenarios with varying input sizes or
untouchable network parameters, a multi-scale random resizing scheme and a
ranking distillation strategy are proposed. We evaluate the proposed method on
four widely-used image retrieval datasets, and report a significant performance
drop in terms of different metrics, such as mAP and mP@10. Finally, we test our
attack methods on the real-world visual search engine, i.e., Google Images,
which demonstrates the practical potentials of our methods.


http://arxiv.org/abs/1812.01713
FineFool: Fine Object Contour Attack via Attention.
Jinyin Chen; Haibin Zheng; Hui Xiong; Mengmeng Su
  Machine learning models have been shown vulnerable to adversarial attacks
launched by adversarial examples which are carefully crafted by attacker to
defeat classifiers. Deep learning models cannot escape the attack either. Most
of adversarial attack methods are focused on success rate or perturbations
size, while we are more interested in the relationship between adversarial
perturbation and the image itself. In this paper, we put forward a novel
adversarial attack based on contour, named FineFool. Finefool not only has
better attack performance compared with other state-of-art white-box attacks in
aspect of higher attack success rate and smaller perturbation, but also capable
of visualization the optimal adversarial perturbation via attention on object
contour. To the best of our knowledge, Finefool is for the first time combines
the critical feature of the original clean image with the optimal perturbations
in a visible manner. Inspired by the correlations between adversarial
perturbations and object contour, slighter perturbations is produced via
focusing on object contour features, which is more imperceptible and difficult
to be defended, especially network add-on defense methods with the trade-off
between perturbations filtering and contour feature loss. Compared with
existing state-of-art attacks, extensive experiments are conducted to show that
Finefool is capable of efficient attack against defensive deep models.


http://arxiv.org/abs/1812.00239
Building robust classifiers through generation of confident out of distribution examples.
Kumar Sricharan; Ashok Srivastava
  Deep learning models are known to be overconfident in their predictions on
out of distribution inputs. There have been several pieces of work to address
this issue, including a number of approaches for building Bayesian neural
networks, as well as closely related work on detection of out of distribution
samples. Recently, there has been work on building classifiers that are robust
to out of distribution samples by adding a regularization term that maximizes
the entropy of the classifier output on out of distribution data. To
approximate out of distribution samples (which are not known apriori), a GAN
was used for generation of samples at the edges of the training distribution.
In this paper, we introduce an alternative GAN based approach for building a
robust classifier, where the idea is to use the GAN to explicitly generate out
of distribution samples that the classifier is confident on (low entropy), and
have the classifier maximize the entropy for these samples. We showcase the
effectiveness of our approach relative to state-of-the-art on hand-written
characters as well as on a variety of natural image datasets.


http://arxiv.org/abs/1812.00151
Discrete Adversarial Attacks and Submodular Optimization with Applications to Text Classification.
Qi Lei; Lingfei Wu; Pin-Yu Chen; Alexandros G. Dimakis; Inderjit S. Dhillon; Michael Witbrock
  Adversarial examples are carefully constructed modifications to an input that
completely change the output of a classifier but are imperceptible to humans.
Despite these successful attacks for continuous data (such as image and audio
samples), generating adversarial examples for discrete structures such as text
has proven significantly more challenging. In this paper we formulate the
attacks with discrete input on a set function as an optimization task. We prove
that this set function is submodular for some popular neural network text
classifiers under simplifying assumption. This finding guarantees a $1-1/e$
approximation factor for attacks that use the greedy algorithm. Meanwhile, we
show how to use the gradient of the attacked classifier to guide the greedy
search. Empirical studies with our proposed optimization scheme show
significantly improved attack ability and efficiency, on three different text
classification tasks over various baselines. We also use a joint sentence and
word paraphrasing technique to maintain the original semantics and syntax of
the text. This is validated by a human subject evaluation in subjective metrics
on the quality and semantic coherence of our generated adversarial text.


http://arxiv.org/abs/1812.00181
Effects of Loss Functions And Target Representations on Adversarial Robustness.
Sean Saito; Sujoy Roy
  Understanding and evaluating the robustness of neural networks under
adversarial settings is a subject of growing interest. Attacks proposed in the
literature usually work with models trained to minimize cross-entropy loss and
output softmax probabilities. In this work, we present interesting experimental
results that suggest the importance of considering other loss functions and
target representations, specifically, (1) training on mean-squared error and
(2) representing targets as codewords generated from a random codebook. We
evaluate the robustness of neural networks that implement these proposed
modifications using existing attacks, showing an increase in accuracy against
untargeted attacks of up to 98.7\% and a decrease of targeted attack success
rates of up to 99.8\%. Our model demonstrates more robustness compared to its
conventional counterpart even against attacks that are tailored to our
modifications. Furthermore, we find that the parameters of our modified model
have significantly smaller Lipschitz bounds, an important measure correlated
with a model's sensitivity to adversarial perturbations.


http://arxiv.org/abs/1812.00292
SentiNet: Detecting Localized Universal Attacks Against Deep Learning Systems.
Edward Chou; Florian Tramèr; Giancarlo Pellegrino
  SentiNet is a novel detection framework for localized universal attacks on
neural networks. These attacks restrict adversarial noise to contiguous
portions of an image and are reusable with different images---constraints that
prove useful for generating physically-realizable attacks. Unlike most other
works on adversarial detection, SentiNet does not require training a model or
preknowledge of an attack prior to detection. Our approach is appealing due to
the large number of possible mechanisms and attack-vectors that an
attack-specific defense would have to consider. By leveraging the neural
network's susceptibility to attacks and by using techniques from model
interpretability and object detection as detection mechanisms, SentiNet turns a
weakness of a model into a strength. We demonstrate the effectiveness of
SentiNet on three different attacks---i.e., data poisoning attacks, trojaned
networks, and adversarial patches (including physically realizable attacks)---
and show that our defense is able to achieve very competitive performance
metrics for all three threats. Finally, we show that SentiNet is robust against
strong adaptive adversaries, who build adversarial patches that specifically
target the components of SentiNet's architecture.


http://arxiv.org/abs/1811.12641
Transferable Adversarial Attacks for Image and Video Object Detection.
Xingxing Wei; Siyuan Liang; Xiaochun Cao; Jun Zhu
  Adversarial examples have been demonstrated to threaten many computer vision
tasks including object detection. However, the existing attacking methods for
object detection have two limitations: poor transferability, which denotes that
the generated adversarial examples have low success rate to attack other kinds
of detection methods, and high computation cost, which means that they need
more time to generate an adversarial image, and therefore are difficult to deal
with the video data. To address these issues, we utilize a generative mechanism
to obtain the adversarial image and video. In this way, the processing time is
reduced. To enhance the transferability, we destroy the feature maps extracted
from the feature network, which usually constitutes the basis of object
detectors. The proposed method is based on the Generative Adversarial Network
(GAN) framework, where we combine the high-level class loss and low-level
feature loss to jointly train the adversarial example generator. A series of
experiments conducted on PASCAL VOC and ImageNet VID datasets show that our
method can efficiently generate image and video adversarial examples, and more
importantly, these adversarial examples have better transferability, and thus,
are able to simultaneously attack two kinds of representative object detection
models: proposal based models like Faster-RCNN, and regression based models
like SSD.


http://arxiv.org/abs/1811.12673
ComDefend: An Efficient Image Compression Model to Defend Adversarial Examples.
Xiaojun Jia; Xingxing Wei; Xiaochun Cao; Hassan Foroosh
  Deep neural networks (DNNs) have been demonstrated to be vulnerable to
adversarial examples. Specifically, adding imperceptible perturbations to clean
images can fool the well trained deep neural networks. In this paper, we
propose an end-to-end image compression model to defend adversarial examples:
\textbf{ComDefend}. The proposed model consists of a compression convolutional
neural network (ComCNN) and a reconstruction convolutional neural network
(ResCNN). The ComCNN is used to maintain the structure information of the
original image and purify adversarial perturbations. And the ResCNN is used to
reconstruct the original image with high quality. In other words, ComDefend can
transform the adversarial image to its clean version, which is then fed to the
trained classifier. Our method is a pre-processing module, and does not modify
the classifier's structure during the whole process. Therefore, it can be
combined with other model-specific defense models to jointly improve the
classifier's robustness. A series of experiments conducted on MNIST, CIFAR10
and ImageNet show that the proposed method outperforms the state-of-the-art
defense methods, and is consistently effective to protect classifiers against
adversarial attacks.


http://arxiv.org/abs/1812.00037
Adversarial Defense by Stratified Convolutional Sparse Coding.
Bo Sun; Nian-hsuan Tsai; Fangchen Liu; Ronald Yu; Hao Su
  We propose an adversarial defense method that achieves state-of-the-art
performance among attack-agnostic adversarial defense methods while also
maintaining robustness to input resolution, scale of adversarial perturbation,
and scale of dataset size. Based on convolutional sparse coding, we construct a
stratified low-dimensional quasi-natural image space that faithfully
approximates the natural image space while also removing adversarial
perturbations. We introduce a novel Sparse Transformation Layer (STL) in
between the input image and the first layer of the neural network to
efficiently project images into our quasi-natural image space. Our experiments
show state-of-the-art performance of our method compared to other
attack-agnostic adversarial defense methods in various adversarial settings.


http://arxiv.org/abs/1811.12395
CNN-Cert: An Efficient Framework for Certifying Robustness of Convolutional Neural Networks.
Akhilan Boopathy; Tsui-Wei Weng; Pin-Yu Chen; Sijia Liu; Luca Daniel
  Verifying robustness of neural network classifiers has attracted great
interests and attention due to the success of deep neural networks and their
unexpected vulnerability to adversarial perturbations. Although finding minimum
adversarial distortion of neural networks (with ReLU activations) has been
shown to be an NP-complete problem, obtaining a non-trivial lower bound of
minimum distortion as a provable robustness guarantee is possible. However,
most previous works only focused on simple fully-connected layers (multilayer
perceptrons) and were limited to ReLU activations. This motivates us to propose
a general and efficient framework, CNN-Cert, that is capable of certifying
robustness on general convolutional neural networks. Our framework is general
-- we can handle various architectures including convolutional layers,
max-pooling layers, batch normalization layer, residual blocks, as well as
general activation functions; our approach is efficient -- by exploiting the
special structure of convolutional layers, we achieve up to 17 and 11 times of
speed-up compared to the state-of-the-art certification algorithms (e.g.
Fast-Lin, CROWN) and 366 times of speed-up compared to the dual-LP approach
while our algorithm obtains similar or even better verification bounds. In
addition, CNN-Cert generalizes state-of-the-art algorithms e.g. Fast-Lin and
CROWN. We demonstrate by extensive experiments that our method outperforms
state-of-the-art lower-bound-based certification algorithms in terms of both
bound quality and speed.


http://arxiv.org/abs/1811.12335
Bayesian Adversarial Spheres: Bayesian Inference and Adversarial Examples in a Noiseless Setting.
Artur Bekasov; Iain Murray
  Modern deep neural network models suffer from adversarial examples, i.e.
confidently misclassified points in the input space. It has been shown that
Bayesian neural networks are a promising approach for detecting adversarial
points, but careful analysis is problematic due to the complexity of these
models. Recently Gilmer et al. (2018) introduced adversarial spheres, a toy
set-up that simplifies both practical and theoretical analysis of the problem.
In this work, we use the adversarial sphere set-up to understand the properties
of approximate Bayesian inference methods for a linear model in a noiseless
setting. We compare predictions of Bayesian and non-Bayesian methods,
showcasing the advantages of the former, although revealing open challenges for
deep learning applications.


http://arxiv.org/abs/1811.12601
Adversarial Examples as an Input-Fault Tolerance Problem.
Angus Galloway; Anna Golubeva; Graham W. Taylor
  We analyze the adversarial examples problem in terms of a model's fault
tolerance with respect to its input. Whereas previous work focuses on
arbitrarily strict threat models, i.e., $\epsilon$-perturbations, we consider
arbitrary valid inputs and propose an information-based characteristic for
evaluating tolerance to diverse input faults.


http://arxiv.org/abs/1811.12470
Analyzing Federated Learning through an Adversarial Lens.
Arjun Nitin Bhagoji; Supriyo Chakraborty; Prateek Mittal; Seraphin Calo
  Federated learning distributes model training among a multitude of agents,
who, guided by privacy concerns, perform training using their local data but
share only model parameter updates, for iterative aggregation at the server. In
this work, we explore the threat of model poisoning attacks on federated
learning initiated by a single, non-colluding malicious agent where the
adversarial objective is to cause the model to misclassify a set of chosen
inputs with high confidence. We explore a number of strategies to carry out
this attack, starting with simple boosting of the malicious agent's update to
overcome the effects of other agents' updates. To increase attack stealth, we
propose an alternating minimization strategy, which alternately optimizes for
the training loss and the adversarial objective. We follow up by using
parameter estimation for the benign agents' updates to improve on attack
success. Finally, we use a suite of interpretability techniques to generate
visual explanations of model decisions for both benign and malicious models and
show that the explanations are nearly visually indistinguishable. Our results
indicate that even a highly constrained adversary can carry out model poisoning
attacks while simultaneously maintaining stealth, thus highlighting the
vulnerability of the federated learning setting and the need to develop
effective defense strategies.


http://arxiv.org/abs/1811.11875
Adversarial Attacks for Optical Flow-Based Action Recognition Classifiers.
Nathan Inkawhich; Matthew Inkawhich; Yiran Chen; Hai Li
  The success of deep learning research has catapulted deep models into
production systems that our society is becoming increasingly dependent on,
especially in the image and video domains. However, recent work has shown that
these largely uninterpretable models exhibit glaring security vulnerabilities
in the presence of an adversary. In this work, we develop a powerful untargeted
adversarial attack for action recognition systems in both white-box and
black-box settings. Action recognition models differ from image-classification
models in that their inputs contain a temporal dimension, which we explicitly
target in the attack. Drawing inspiration from image classifier attacks, we
create new attacks which achieve state-of-the-art success rates on a two-stream
classifier trained on the UCF-101 dataset. We find that our attacks can
significantly degrade a model's performance with sparsely and imperceptibly
perturbed examples. We also demonstrate the transferability of our attacks to
black-box action recognition systems.


http://arxiv.org/abs/1811.11553
Strike (with) a Pose: Neural Networks Are Easily Fooled by Strange Poses of Familiar Objects.
Michael A. Alcorn; Qi Li; Zhitao Gong; Chengfei Wang; Long Mai; Wei-Shinn Ku; Anh Nguyen
  Despite excellent performance on stationary test sets, deep neural networks
(DNNs) can fail to generalize to out-of-distribution (OoD) inputs, including
natural, non-adversarial ones, which are common in real-world settings. In this
paper, we present a framework for discovering DNN failures that harnesses 3D
renderers and 3D models. That is, we estimate the parameters of a 3D renderer
that cause a target DNN to misbehave in response to the rendered image. Using
our framework and a self-assembled dataset of 3D objects, we investigate the
vulnerability of DNNs to OoD poses of well-known objects in ImageNet. For
objects that are readily recognized by DNNs in their canonical poses, DNNs
incorrectly classify 97% of their pose space. In addition, DNNs are highly
sensitive to slight pose perturbations. Importantly, adversarial poses transfer
across models and datasets. We find that 99.9% and 99.4% of the poses
misclassified by Inception-v3 also transfer to the AlexNet and ResNet-50 image
classifiers trained on the same ImageNet dataset, respectively, and 75.5%
transfer to the YOLOv3 object detector trained on MS COCO.


http://arxiv.org/abs/1811.11493
A randomized gradient-free attack on ReLU networks.
Francesco Croce; Matthias Hein
  It has recently been shown that neural networks but also other classifiers
are vulnerable to so called adversarial attacks e.g. in object recognition an
almost non-perceivable change of the image changes the decision of the
classifier. Relatively fast heuristics have been proposed to produce these
adversarial inputs but the problem of finding the optimal adversarial input,
that is with the minimal change of the input, is NP-hard. While methods based
on mixed-integer optimization which find the optimal adversarial input have
been developed, they do not scale to large networks. Currently, the attack
scheme proposed by Carlini and Wagner is considered to produce the best
adversarial inputs. In this paper we propose a new attack scheme for the class
of ReLU networks based on a direct optimization on the resulting linear
regions. In our experimental validation we improve in all except one experiment
out of 18 over the Carlini-Wagner attack with a relative improvement of up to
9\%. As our approach is based on the geometrical structure of ReLU networks, it
is less susceptible to defences targeting their functional properties.


http://arxiv.org/abs/1811.11402
Adversarial Machine Learning And Speech Emotion Recognition: Utilizing Generative Adversarial Networks For Robustness.
Siddique Latif; Rajib Rana; Junaid Qadir
  Deep learning has undoubtedly offered tremendous improvements in the
performance of state-of-the-art speech emotion recognition (SER) systems.
However, recent research on adversarial examples poses enormous challenges on
the robustness of SER systems by showing the susceptibility of deep neural
networks to adversarial examples as they rely only on small and imperceptible
perturbations. In this study, we evaluate how adversarial examples can be used
to attack SER systems and propose the first black-box adversarial attack on SER
systems. We also explore potential defenses including adversarial training and
generative adversarial network (GAN) to enhance robustness. Experimental
evaluations suggest various interesting aspects of the effective utilization of
adversarial examples useful for achieving robustness for SER systems opening up
opportunities for researchers to further innovate in this space.


http://arxiv.org/abs/1811.11079
Robust Classification of Financial Risk.
Suproteem K. Sarkar; Kojin Oshiba; Daniel Giebisch; Yaron Singer
  Algorithms are increasingly common components of high-impact decision-making,
and a growing body of literature on adversarial examples in laboratory settings
indicates that standard machine learning models are not robust. This suggests
that real-world systems are also susceptible to manipulation or
misclassification, which especially poses a challenge to machine learning
models used in financial services. We use the loan grade classification problem
to explore how machine learning models are sensitive to small changes in
user-reported data, using adversarial attacks documented in the literature and
an original, domain-specific attack. Our work shows that a robust optimization
algorithm can build models for financial services that are resistant to
misclassification on perturbations. To the best of our knowledge, this is the
first study of adversarial attacks and defenses for deep learning in financial
services.


http://arxiv.org/abs/1811.11304
Universal Adversarial Training.
Ali Shafahi; Mahyar Najibi; Zheng Xu; John Dickerson; Larry S. Davis; Tom Goldstein
  Standard adversarial attacks change the predicted class label of a selected
image by adding specially tailored small perturbations to its pixels. In
contrast, a universal perturbation is an update that can be added to any image
in a broad class of images, while still changing the predicted class label. We
study the efficient generation of universal adversarial perturbations, and also
efficient methods for hardening networks to these attacks. We propose a simple
optimization-based universal attack that reduces the top-1 accuracy of various
network architectures on ImageNet to less than 20%, while learning the
universal perturbation 13X faster than the standard method.
  To defend against these perturbations, we propose universal adversarial
training, which models the problem of robust classifier generation as a
two-player min-max game, and produces robust models with only 2X the cost of
natural training. We also propose a simultaneous stochastic gradient method
that is almost free of extra computation, which allows us to do universal
adversarial training on ImageNet.


http://arxiv.org/abs/1811.11310
Using Attribution to Decode Dataset Bias in Neural Network Models for Chemistry.
Kevin McCloskey; Ankur Taly; Federico Monti; Michael P. Brenner; Lucy Colwell
  Deep neural networks have achieved state of the art accuracy at classifying
molecules with respect to whether they bind to specific protein targets. A key
breakthrough would occur if these models could reveal the fragment
pharmacophores that are causally involved in binding. Extracting chemical
details of binding from the networks could potentially lead to scientific
discoveries about the mechanisms of drug actions. But doing so requires shining
light into the black box that is the trained neural network model, a task that
has proved difficult across many domains. Here we show how the binding
mechanism learned by deep neural network models can be interrogated, using a
recently described attribution method. We first work with carefully constructed
synthetic datasets, in which the 'fragment logic' of binding is fully known. We
find that networks that achieve perfect accuracy on held out test datasets
still learn spurious correlations due to biases in the datasets, and we are
able to exploit this non-robustness to construct adversarial examples that fool
the model. The dataset bias makes these models unreliable for accurately
revealing information about the mechanisms of protein-ligand binding. In light
of our findings, we prescribe a test that checks for dataset bias given a
hypothesis. If the test fails, it indicates that either the model must be
simplified or regularized and/or that the training dataset requires
augmentation.


http://arxiv.org/abs/1811.10828
A Frank-Wolfe Framework for Efficient and Effective Adversarial Attacks.
Jinghui Chen; Dongruo Zhou; Jinfeng Yi; Quanquan Gu
  Depending on how much information an adversary can access to, adversarial
attacks can be classified as white-box attack and black-box attack. For
white-box attack, optimization-based attack algorithms such as projected
gradient descent (PGD) can achieve relatively high attack success rates within
moderate iterates. However, they tend to generate adversarial examples near or
upon the boundary of the perturbation set, resulting in large distortion.
Furthermore, their corresponding black-box attack algorithms also suffer from
high query complexities, thereby limiting their practical usefulness. In this
paper, we focus on the problem of developing efficient and effective
optimization-based adversarial attack algorithms. In particular, we propose a
novel adversarial attack framework for both white-box and black-box settings
based on a variant of Frank-Wolfe algorithm. We show in theory that the
proposed attack algorithms are efficient with an $O(1/\sqrt{T})$ convergence
rate. The empirical results of attacking the ImageNet and MNIST datasets also
verify the efficiency and effectiveness of the proposed algorithms. More
specifically, our proposed algorithms attain the best attack performances in
both white-box and black-box attacks among all baselines, and are more time and
query efficient than the state-of-the-art.


http://arxiv.org/abs/1811.10745
ResNets Ensemble via the Feynman-Kac Formalism to Improve Natural and Robust Accuracies.
Bao Wang; Binjie Yuan; Zuoqiang Shi; Stanley J. Osher
  Empirical adversarial risk minimization (EARM) is a widely used mathematical
framework to robustly train deep neural nets (DNNs) that are resistant to
adversarial attacks. However, both natural and robust accuracies, in
classifying clean and adversarial images, respectively, of the trained robust
models are far from satisfactory. In this work, we unify the theory of optimal
control of transport equations with the practice of training and testing of
ResNets. Based on this unified viewpoint, we propose a simple yet effective
ResNets ensemble algorithm to boost the accuracy of the robustly trained model
on both clean and adversarial images. The proposed algorithm consists of two
components: First, we modify the base ResNets by injecting a variance specified
Gaussian noise to the output of each residual mapping. Second, we average over
the production of multiple jointly trained modified ResNets to get the final
prediction. These two steps give an approximation to the Feynman-Kac formula
for representing the solution of a transport equation with viscosity, or a
convection-diffusion equation. For the CIFAR10 benchmark, this simple algorithm
leads to a robust model with a natural accuracy of {\bf 85.62}\% on clean
images and a robust accuracy of ${\bf 57.94 \%}$ under the 20 iterations of the
IFGSM attack, which outperforms the current state-of-the-art in defending
against IFGSM attack on the CIFAR10. Both natural and robust accuracies of the
proposed ResNets ensemble can be improved dynamically as the building block
ResNet advances. The code is available at:
\url{https://github.com/BaoWangMath/EnResNet}.


http://arxiv.org/abs/1811.10716
Bilateral Adversarial Training: Towards Fast Training of More Robust Models Against Adversarial Attacks.
Jianyu Wang; Haichao Zhang
  In this paper, we study fast training of adversarially robust models. From
the analyses of the state-of-the-art defense method, i.e., the multi-step
adversarial training, we hypothesize that the gradient magnitude links to the
model robustness. Motivated by this, we propose to perturb both the image and
the label during training, which we call Bilateral Adversarial Training (BAT).
To generate the adversarial label, we derive an closed-form heuristic solution.
To generate the adversarial image, we use one-step targeted attack with the
target label being the most confusing class. In the experiment, we first show
that random start and the most confusing target attack effectively prevent the
label leaking and gradient masking problem. Then coupled with the adversarial
label part, our model significantly improves the state-of-the-art results. For
example, against PGD100 white-box attack with cross-entropy loss, on CIFAR10,
we achieve 63.7\% versus 47.2\%; on SVHN, we achieve 59.1\% versus 42.1\%. At
last, the experiment on the very (computationally) challenging ImageNet dataset
further demonstrates the effectiveness of our fast method.


http://arxiv.org/abs/1811.09982
Is Data Clustering in Adversarial Settings Secure?
Battista Biggio; Ignazio Pillai; Samuel Rota Bulò; Davide Ariu; Marcello Pelillo; Fabio Roli
  Clustering algorithms have been increasingly adopted in security applications
to spot dangerous or illicit activities. However, they have not been originally
devised to deal with deliberate attack attempts that may aim to subvert the
clustering process itself. Whether clustering can be safely adopted in such
settings remains thus questionable. In this work we propose a general framework
that allows one to identify potential attacks against clustering algorithms,
and to evaluate their impact, by making specific assumptions on the adversary's
goal, knowledge of the attacked system, and capabilities of manipulating the
input data. We show that an attacker may significantly poison the whole
clustering process by adding a relatively small percentage of attack samples to
the input data, and that some attack samples may be obfuscated to be hidden
within some existing clusters. We present a case study on single-linkage
hierarchical clustering, and report experiments on clustering of malware
samples and handwritten digits.


http://arxiv.org/abs/1811.09831
Attention, Please! Adversarial Defense via Activation Rectification and Preservation.
Shangxi Wu; Jitao Sang; Kaiyuan Xu; Jiaming Zhang; Yanfeng Sun; Liping Jing; Jian Yu
  This study provides a new understanding of the adversarial attack problem by
examining the correlation between adversarial attack and visual attention
change. In particular, we observed that: (1) images with incomplete attention
regions are more vulnerable to adversarial attacks; and (2) successful
adversarial attacks lead to deviated and scattered attention map. Accordingly,
an attention-based adversarial defense framework is designed to simultaneously
rectify the attention map for prediction and preserve the attention area
between adversarial and original images. The problem of adding iteratively
attacked samples is also discussed in the context of visual attention change.
We hope the attention-related data analysis and defense solution in this study
will shed some light on the mechanism behind the adversarial attack and also
facilitate future adversarial defense/attack model design.


http://arxiv.org/abs/1811.09716
Robustness via curvature regularization, and vice versa.
Seyed-Mohsen Moosavi-Dezfooli; Alhussein Fawzi; Jonathan Uesato; Pascal Frossard
  State-of-the-art classifiers have been shown to be largely vulnerable to
adversarial perturbations. One of the most effective strategies to improve
robustness is adversarial training. In this paper, we investigate the effect of
adversarial training on the geometry of the classification landscape and
decision boundaries. We show in particular that adversarial training leads to a
significant decrease in the curvature of the loss surface with respect to
inputs, leading to a drastically more "linear" behaviour of the network. Using
a locally quadratic approximation, we provide theoretical evidence on the
existence of a strong relation between large robustness and small curvature. To
further show the importance of reduced curvature for improving the robustness,
we propose a new regularizer that directly minimizes curvature of the loss
surface, and leads to adversarial robustness that is on par with adversarial
training. Besides being a more efficient and principled alternative to
adversarial training, the proposed regularizer confirms our claims on the
importance of exhibiting quasi-linear behavior in the vicinity of data points
in order to achieve robustness.


http://arxiv.org/abs/1811.09600
Decoupling Direction and Norm for Efficient Gradient-Based L2 Adversarial Attacks and Defenses.
Jérôme Rony; Luiz G. Hafemann; Luiz S. Oliveira; Ismail Ben Ayed; Robert Sabourin; Eric Granger
  Research on adversarial examples in computer vision tasks has shown that
small, often imperceptible changes to an image can induce misclassification,
which has security implications for a wide range of image processing systems.
Considering $L_2$ norm distortions, the Carlini and Wagner attack is presently
the most effective white-box attack in the literature. However, this method is
slow since it performs a line-search for one of the optimization terms, and
often requires thousands of iterations. In this paper, an efficient approach is
proposed to generate gradient-based attacks that induce misclassifications with
low $L_2$ norm, by decoupling the direction and the norm of the adversarial
perturbation that is added to the image. Experiments conducted on the MNIST,
CIFAR-10 and ImageNet datasets indicate that our attack achieves comparable
results to the state-of-the-art (in terms of $L_2$ norm) with considerably
fewer iterations (as few as 100 iterations), which opens the possibility of
using these attacks for adversarial training. Models trained with our attack
achieve state-of-the-art robustness against white-box gradient-based $L_2$
attacks on the MNIST and CIFAR-10 datasets, outperforming the Madry defense
when the attacks are limited to a maximum norm.


http://arxiv.org/abs/1811.09310
Parametric Noise Injection: Trainable Randomness to Improve Deep Neural Network Robustness against Adversarial Attack.
Adnan Siraj Rakin; Zhezhi He; Deliang Fan
  Recent development in the field of Deep Learning have exposed the underlying
vulnerability of Deep Neural Network (DNN) against adversarial examples. In
image classification, an adversarial example is a carefully modified image that
is visually imperceptible to the original image but can cause DNN model to
misclassify it. Training the network with Gaussian noise is an effective
technique to perform model regularization, thus improving model robustness
against input variation. Inspired by this classical method, we explore to
utilize the regularization characteristic of noise injection to improve DNN's
robustness against adversarial attack. In this work, we propose
Parametric-Noise-Injection (PNI) which involves trainable Gaussian noise
injection at each layer on either activation or weights through solving the
min-max optimization problem, embedded with adversarial training. These
parameters are trained explicitly to achieve improved robustness. To the best
of our knowledge, this is the first work that uses trainable noise injection to
improve network robustness against adversarial attacks, rather than manually
configuring the injected noise level through cross-validation. The extensive
results show that our proposed PNI technique effectively improves the
robustness against a variety of powerful white-box and black-box attacks such
as PGD, C & W, FGSM, transferable attack and ZOO attack. Last but not the
least, PNI method improves both clean- and perturbed-data accuracy in
comparison to the state-of-the-art defense methods, which outperforms current
unbroken PGD defense by 1.1 % and 6.8 % on clean test data and perturbed test
data respectively using Resnet-20 architecture.


http://arxiv.org/abs/1811.09300
Strength in Numbers: Trading-off Robustness and Computation via Adversarially-Trained Ensembles.
Edward Grefenstette; Robert Stanforth; Brendan O'Donoghue; Jonathan Uesato; Grzegorz Swirszcz; Pushmeet Kohli
  While deep learning has led to remarkable results on a number of challenging
problems, researchers have discovered a vulnerability of neural networks in
adversarial settings, where small but carefully chosen perturbations to the
input can make the models produce extremely inaccurate outputs. This makes
these models particularly unsuitable for safety-critical application domains
(e.g. self-driving cars) where robustness is extremely important. Recent work
has shown that augmenting training with adversarially generated data provides
some degree of robustness against test-time attacks. In this paper we
investigate how this approach scales as we increase the computational budget
given to the defender. We show that increasing the number of parameters in
adversarially-trained models increases their robustness, and in particular that
ensembling smaller models while adversarially training the entire ensemble as a
single model is a more efficient way of spending said budget than simply using
a larger single model. Crucially, we show that it is the adversarial training
of the ensemble, rather than the ensembling of adversarially trained models,
which provides robustness.


http://arxiv.org/abs/1811.09043
Detecting Adversarial Perturbations Through Spatial Behavior in Activation Spaces.
Ziv Katzir; Yuval Elovici
  Neural network based classifiers are still prone to manipulation through
adversarial perturbations. State of the art attacks can overcome most of the
defense or detection mechanisms suggested so far, and adversaries have the
upper hand in this arms race. Adversarial examples are designed to resemble the
normal input from which they were constructed, while triggering an incorrect
classification. This basic design goal leads to a characteristic spatial
behavior within the context of Activation Spaces, a term coined by the authors
to refer to the hyperspaces formed by the activation values of the network's
layers. Within the output of the first layers of the network, an adversarial
example is likely to resemble normal instances of the source class, while in
the final layers such examples will diverge towards the adversary's target
class. The steps below enable us to leverage this inherent shift from one class
to another in order to form a novel adversarial example detector. We construct
Euclidian spaces out of the activation values of each of the deep neural
network layers. Then, we induce a set of k-nearest neighbor classifiers (k-NN),
one per activation space of each neural network layer, using the
non-adversarial examples. We leverage those classifiers to produce a sequence
of class labels for each nonperturbed input sample and estimate the a priori
probability for a class label change between one activation space and another.
During the detection phase we compute a sequence of classification labels for
each input using the trained classifiers. We then estimate the likelihood of
those classification sequences and show that adversarial sequences are far less
likely than normal ones. We evaluated our detection method against the state of
the art C&W attack method, using two image classification datasets (MNIST,
CIFAR-10) reaching an AUC 0f 0.95 for the CIFAR-10 dataset.


http://arxiv.org/abs/1811.09020
Task-generalizable Adversarial Attack based on Perceptual Metric.
Muzammal Naseer; Salman H. Khan; Shafin Rahman; Fatih Porikli
  Deep neural networks (DNNs) can be easily fooled by adding human
imperceptible perturbations to the images. These perturbed images are known as
`adversarial examples' and pose a serious threat to security and safety
critical systems. A litmus test for the strength of adversarial examples is
their transferability across different DNN models in a black box setting (i.e.
when the target model's architecture and parameters are not known to attacker).
Current attack algorithms that seek to enhance adversarial transferability work
on the decision level i.e. generate perturbations that alter the network
decisions. This leads to two key limitations: (a) An attack is dependent on the
task-specific loss function (e.g. softmax cross-entropy for object recognition)
and therefore does not generalize beyond its original task. (b) The adversarial
examples are specific to the network architecture and demonstrate poor
transferability to other network architectures. We propose a novel approach to
create adversarial examples that can broadly fool different networks on
multiple tasks. Our approach is based on the following intuition: "Perpetual
metrics based on neural network features are highly generalizable and show
excellent performance in measuring and stabilizing input distortions. Therefore
an ideal attack that creates maximum distortions in the network feature space
should realize highly transferable examples". We report extensive experiments
to show how adversarial examples generalize across multiple networks for
classification, object detection and segmentation tasks.


http://arxiv.org/abs/1811.09008
Towards Robust Neural Networks with Lipschitz Continuity.
Muhammad Usama; Dong Eui Chang
  Deep neural networks have shown remarkable performance across a wide range of
vision-based tasks, particularly due to the availability of large-scale
datasets for training and better architectures. However, data seen in the real
world are often affected by distortions that not accounted for by the training
datasets. In this paper, we address the challenge of robustness and stability
of neural networks and propose a general training method that can be used to
make the existing neural network architectures more robust and stable to input
visual perturbations while using only available datasets for training. Proposed
training method is convenient to use as it does not require data augmentation
or changes in the network architecture. We provide theoretical proof as well as
empirical evidence for the efficiency of the proposed training method by
performing experiments with existing neural network architectures and
demonstrate that same architecture when trained with the proposed training
method perform better than when trained with conventional training approach in
the presence of noisy datasets.


http://arxiv.org/abs/1811.08577
How the Softmax Output is Misleading for Evaluating the Strength of Adversarial Examples.
Utku Ozbulak; Neve Wesley De; Messem Arnout Van
  Even before deep learning architectures became the de facto models for
complex computer vision tasks, the softmax function was, given its elegant
properties, already used to analyze the predictions of feedforward neural
networks. Nowadays, the output of the softmax function is also commonly used to
assess the strength of adversarial examples: malicious data points designed to
fail machine learning models during the testing phase. However, in this paper,
we show that it is possible to generate adversarial examples that take
advantage of some properties of the softmax function, leading to undesired
outcomes when interpreting the strength of the adversarial examples at hand.
Specifically, we argue that the output of the softmax function is a poor
indicator when the strength of an adversarial example is analyzed and that this
indicator can be easily tricked by already existing methods for adversarial
example generation.


http://arxiv.org/abs/1811.08484
MimicGAN: Corruption-Mimicking for Blind Image Recovery & Adversarial Defense.
Rushil Anirudh; Jayaraman J. Thiagarajan; Bhavya Kailkhura; Timo Bremer
  Solving inverse problems continues to be a central challenge in computer
vision. Existing techniques either explicitly construct an inverse mapping
using prior knowledge about the corruption, or learn the inverse directly using
a large collection of examples. However, in practice, the nature of corruption
may be unknown, and thus it is challenging to regularize the problem of
inferring a plausible solution. On the other hand, collecting task-specific
training data is tedious for known corruptions and impossible for unknown ones.
We present MimicGAN, an unsupervised technique to solve general inverse
problems based on image priors in the form of generative adversarial networks
(GANs). Using a GAN prior, we show that one can reliably recover solutions to
underdetermined inverse problems through a surrogate network that learns to
mimic the corruption at test time. Our system successively estimates the
corruption and the clean image without the need for supervisory training, while
outperforming existing baselines in blind image recovery. We also demonstrate
that MimicGAN improves upon recent GAN-based defenses against adversarial
attacks and represents one of the strongest test-time defenses available today.


http://arxiv.org/abs/1811.08458
Intermediate Level Adversarial Attack for Enhanced Transferability.
Qian Huang; Zeqi Gu; Isay Katsman; Horace He; Pian Pawakapan; Zhiqiu Lin; Serge Belongie; Ser-Nam Lim
  Neural networks are vulnerable to adversarial examples, malicious inputs
crafted to fool trained models. Adversarial examples often exhibit black-box
transfer, meaning that adversarial examples for one model can fool another
model. However, adversarial examples may be overfit to exploit the particular
architecture and feature representation of a source model, resulting in
sub-optimal black-box transfer attacks to other target models. This leads us to
introduce the Intermediate Level Attack (ILA), which attempts to fine-tune an
existing adversarial example for greater black-box transferability by
increasing its perturbation on a pre-specified layer of the source model. We
show that our method can effectively achieve this goal and that we can decide a
nearly-optimal layer of the source model to perturb without any knowledge of
the target models.


http://arxiv.org/abs/1811.08080
Lightweight Lipschitz Margin Training for Certified Defense against Adversarial Examples.
Hajime Ono; Tsubasa Takahashi; Kazuya Kakizaki
  How can we make machine learning provably robust against adversarial examples
in a scalable way? Since certified defense methods, which ensure
$\epsilon$-robust, consume huge resources, they can only achieve small degree
of robustness in practice. Lipschitz margin training (LMT) is a scalable
certified defense, but it can also only achieve small robustness due to
over-regularization. How can we make certified defense more efficiently? We
present LC-LMT, a light weight Lipschitz margin training which solves the above
problem. Our method has the following properties; (a) efficient: it can achieve
$\epsilon$-robustness at early epoch, and (b) robust: it has a potential to get
higher robustness than LMT. In the evaluation, we demonstrate the benefits of
the proposed method. LC-LMT can achieve required robustness more than 30 epoch
earlier than LMT in MNIST, and shows more than 90 $\%$ accuracy against both
legitimate and adversarial inputs.


http://arxiv.org/abs/1812.02622
Convolutional Neural Networks with Transformed Input based on Robust Tensor Network Decomposition.
Jenn-Bing Ong; Wee-Keong Ng; C. -C. Jay Kuo
  Tensor network decomposition, originated from quantum physics to model
entangled many-particle quantum systems, turns out to be a promising
mathematical technique to efficiently represent and process big data in
parsimonious manner. In this study, we show that tensor networks can
systematically partition structured data, e.g. color images, for distributed
storage and communication in privacy-preserving manner. Leveraging the sea of
big data and metadata privacy, empirical results show that neighbouring
subtensors with implicit information stored in tensor network formats cannot be
identified for data reconstruction. This technique complements the existing
encryption and randomization techniques which store explicit data
representation at one place and highly susceptible to adversarial attacks such
as side-channel attacks and de-anonymization. Furthermore, we propose a theory
for adversarial examples that mislead convolutional neural networks to
misclassification using subspace analysis based on singular value decomposition
(SVD). The theory is extended to analyze higher-order tensors using
tensor-train SVD (TT-SVD); it helps to explain the level of susceptibility of
different datasets to adversarial attacks, the structural similarity of
different adversarial attacks including global and localized attacks, and the
efficacy of different adversarial defenses based on input transformation. An
efficient and adaptive algorithm based on robust TT-SVD is then developed to
detect strong and static adversarial attacks.


http://arxiv.org/abs/1811.07950
Optimal Transport Classifier: Defending Against Adversarial Attacks by Regularized Deep Embedding.
Yao Li; Martin Renqiang Min; Wenchao Yu; Cho-Jui Hsieh; Thomas C. M. Lee; Erik Kruus
  Recent studies have demonstrated the vulnerability of deep convolutional
neural networks against adversarial examples. Inspired by the observation that
the intrinsic dimension of image data is much smaller than its pixel space
dimension and the vulnerability of neural networks grows with the input
dimension, we propose to embed high-dimensional input images into a
low-dimensional space to perform classification. However, arbitrarily
projecting the input images to a low-dimensional space without regularization
will not improve the robustness of deep neural networks. Leveraging optimal
transport theory, we propose a new framework, Optimal Transport Classifier
(OT-Classifier), and derive an objective that minimizes the discrepancy between
the distribution of the true label and the distribution of the OT-Classifier
output. Experimental results on several benchmark datasets show that, our
proposed framework achieves state-of-the-art performance against strong
adversarial attack methods.


http://arxiv.org/abs/1811.07457
Generalizable Adversarial Training via Spectral Normalization.
Farzan Farnia; Jesse M. Zhang; David Tse
  Deep neural networks (DNNs) have set benchmarks on a wide array of supervised
learning tasks. Trained DNNs, however, often lack robustness to minor
adversarial perturbations to the input, which undermines their true
practicality. Recent works have increased the robustness of DNNs by fitting
networks using adversarially-perturbed training samples, but the improved
performance can still be far below the performance seen in non-adversarial
settings. A significant portion of this gap can be attributed to the decrease
in generalization performance due to adversarial training. In this work, we
extend the notion of margin loss to adversarial settings and bound the
generalization error for DNNs trained under several well-known gradient-based
attack schemes, motivating an effective regularization scheme based on spectral
normalization of the DNN's weight matrices. We also provide a
computationally-efficient method for normalizing the spectral norm of
convolutional layers with arbitrary stride and padding schemes in deep
convolutional networks. We evaluate the power of spectral normalization
extensively on combinations of datasets, network architectures, and adversarial
training schemes. The code is available at
https://github.com/jessemzhang/dl_spectral_normalization.


http://arxiv.org/abs/1811.07311
Regularized adversarial examples for model interpretability.
Yoel Shoshan; Vadim Ratner
  As machine learning algorithms continue to improve, there is an increasing
need for explaining why a model produces a certain prediction for a certain
input. In recent years, several methods for model interpretability have been
developed, aiming to provide explanation of which subset regions of the model
input is the main reason for the model prediction. In parallel, a significant
research community effort is occurring in recent years for developing
adversarial example generation methods for fooling models, while not altering
the true label of the input,as it would have been classified by a human
annotator. In this paper, we bridge the gap between adversarial example
generation and model interpretability, and introduce a modification to the
adversarial example generation process which encourages better
interpretability. We analyze the proposed method on a public medical imaging
dataset, both quantitatively and qualitatively, and show that it significantly
outperforms the leading known alternative method. Our suggested method is
simple to implement, and can be easily plugged into most common adversarial
example generation frameworks. Additionally, we propose an explanation quality
metric - $APE$ - "Adversarial Perturbative Explanation", which measures how
well an explanation describes model decisions.


http://arxiv.org/abs/1811.07375
The Taboo Trap: Behavioural Detection of Adversarial Samples.
Ilia Shumailov; Yiren Zhao; Robert Mullins; Ross Anderson
  Deep Neural Networks (DNNs) have become a powerful toolfor a wide range of
problems. Yet recent work has found an increasing variety of adversarial
samplesthat can fool them. Most existing detection mechanisms against
adversarial attacksimpose significant costs, either by using additional
classifiers to spot adversarial samples, or by requiring the DNN to be
restructured. In this paper, we introduce a novel defence. We train our DNN so
that, as long as it is workingas intended on the kind of inputs we expect, its
behavior is constrained, in that some set of behaviors are taboo. If it is
exposed to adversarial samples, they will often cause a taboo behavior, which
we can detect. Taboos can be both subtle and diverse, so their choice can
encode and hide information. It is a well-established design principle that the
security of a system should not depend on the obscurity of its design, but on
some variable (the key) which can differ between implementations and bechanged
as necessary. We discuss how taboos can be used to equip a classifier with just
such a key, and how to tune the keying mechanism to adversaries of various
capabilities. We evaluate the performance of a prototype against a wide range
of attacks and show how our simple defense can defend against cheap attacks at
scale with zero run-time computation overhead, making it a suitable defense
method for IoT devices.


http://arxiv.org/abs/1811.07266
DeepConsensus: using the consensus of features from multiple layers to attain robust image classification.
Yuchen Li; Safwan Hossain; Kiarash Jamali; Frank Rudzicz
  We consider a classifier whose test set is exposed to various perturbations
that are not present in the training set. These test samples still contain
enough features to map them to the same class as their unperturbed counterpart.
Current architectures exhibit rapid degradation of accuracy when trained on
standard datasets but then used to classify perturbed samples of that data. To
address this, we present a novel architecture named DeepConsensus that
significantly improves generalization to these test-time perturbations. Our key
insight is that deep neural networks should directly consider summaries of low
and high level features when making classifications. Existing convolutional
neural networks can be augmented with DeepConsensus, leading to improved
resistance against large and small perturbations on MNIST, EMNIST,
FashionMNIST, CIFAR10 and SVHN datasets.


http://arxiv.org/abs/1811.07211
Classifiers Based on Deep Sparse Coding Architectures are Robust to Deep Learning Transferable Examples.
Jacob M. Springer; Charles S. Strauss; Austin M. Thresher; Edward Kim; Garrett T. Kenyon
  Although deep learning has shown great success in recent years, researchers
have discovered a critical flaw where small, imperceptible changes in the input
to the system can drastically change the output classification. These attacks
are exploitable in nearly all of the existing deep learning classification
frameworks. However, the susceptibility of deep sparse coding models to
adversarial examples has not been examined. Here, we show that classifiers
based on a deep sparse coding model whose classification accuracy is
competitive with a variety of deep neural network models are robust to
adversarial examples that effectively fool those same deep learning models. We
demonstrate both quantitatively and qualitatively that the robustness of deep
sparse coding models to adversarial examples arises from two key properties.
First, because deep sparse coding models learn general features corresponding
to generators of the dataset as a whole, rather than highly discriminative
features for distinguishing specific classes, the resulting classifiers are
less dependent on idiosyncratic features that might be more easily exploited.
Second, because deep sparse coding models utilize fixed point attractor
dynamics with top-down feedback, it is more difficult to find small changes to
the input that drive the resulting representations out of the correct attractor
basin.


http://arxiv.org/abs/1811.07108
Boosting the Robustness Verification of DNN by Identifying the Achilles's Heel.
Chengdong Feng; Zhenbang Chen; Weijiang Hong; Hengbiao Yu; Wei Dong; Ji Wang
  Deep Neural Network (DNN) is a widely used deep learning technique. How to
ensure the safety of DNN-based system is a critical problem for the research
and application of DNN. Robustness is an important safety property of DNN.
However, existing work of verifying DNN's robustness is time-consuming and hard
to scale to large-scale DNNs. In this paper, we propose a boosting method for
DNN robustness verification, aiming to find counter-examples earlier. Our
observation is DNN's different inputs have different possibilities of existing
counter-examples around them, and the input with a small difference between the
largest output value and the second largest output value tends to be the
achilles's heel of the DNN. We have implemented our method and applied it on
Reluplex, a state-of-the-art DNN verification tool, and four DNN attacking
methods. The results of the extensive experiments on two benchmarks indicate
the effectiveness of our boosting method.


http://arxiv.org/abs/1811.07018
Protecting Voice Controlled Systems Using Sound Source Identification Based on Acoustic Cues.
Yuan Gong; Christian Poellabauer
  Over the last few years, a rapidly increasing number of Internet-of-Things
(IoT) systems that adopt voice as the primary user input have emerged. These
systems have been shown to be vulnerable to various types of voice spoofing
attacks. Existing defense techniques can usually only protect from a specific
type of attack or require an additional authentication step that involves
another device. Such defense strategies are either not strong enough or lower
the usability of the system. Based on the fact that legitimate voice commands
should only come from humans rather than a playback device, we propose a novel
defense strategy that is able to detect the sound source of a voice command
based on its acoustic features. The proposed defense strategy does not require
any information other than the voice command itself and can protect a system
from multiple types of spoofing attacks. Our proof-of-concept experiments
verify the feasibility and effectiveness of this defense strategy.


http://arxiv.org/abs/1811.06969
DARCCC: Detecting Adversaries by Reconstruction from Class Conditional Capsules.
Nicholas Frosst; Sara Sabour; Geoffrey Hinton
  We present a simple technique that allows capsule models to detect
adversarial images. In addition to being trained to classify images, the
capsule model is trained to reconstruct the images from the pose parameters and
identity of the correct top-level capsule. Adversarial images do not look like
a typical member of the predicted class and they have much larger
reconstruction errors when the reconstruction is produced from the top-level
capsule for that class. We show that setting a threshold on the $l2$ distance
between the input image and its reconstruction from the winning capsule is very
effective at detecting adversarial images for three different datasets. The
same technique works quite well for CNNs that have been trained to reconstruct
the image from all or part of the last hidden layer before the softmax. We then
explore a stronger, white-box attack that takes the reconstruction error into
account. This attack is able to fool our detection technique but in order to
make the model change its prediction to another class, the attack must
typically make the "adversarial" image resemble images of the other class.


http://arxiv.org/abs/1811.06539
A note on hyperparameters in black-box adversarial examples.
Jamie Hayes
  Since Biggio et al. (2013) and Szegedy et al. (2013) first drew attention to
adversarial examples, there has been a flood of research into defending and
attacking machine learning models. However, almost all proposed attacks assume
white-box access to a model. In other words, the attacker is assumed to have
perfect knowledge of the models weights and architecture. With this insider
knowledge, a white-box attack can leverage gradient information to craft
adversarial examples. Black-box attacks assume no knowledge of the model
weights or architecture. These attacks craft adversarial examples using
information only contained in the logits or hard classification label. Here, we
assume the attacker can use the logits in order to find an adversarial example.
Empirically, we show that 2-sided stochastic gradient estimation techniques are
not sensitive to scaling parameters, and can be used to mount powerful
black-box attacks requiring relatively few model queries.


http://arxiv.org/abs/1811.06492
Mathematical Analysis of Adversarial Attacks.
Zehao Dou; Stanley J. Osher; Bao Wang
  In this paper, we analyze efficacy of the fast gradient sign method (FGSM)
and the Carlini-Wagner's L2 (CW-L2) attack. We prove that, within a certain
regime, the untargeted FGSM can fool any convolutional neural nets (CNNs) with
ReLU activation; the targeted FGSM can mislead any CNNs with ReLU activation to
classify any given image into any prescribed class. For a special two-layer
neural network: a linear layer followed by the softmax output activation, we
show that the CW-L2 attack increases the ratio of the classification
probability between the target and ground truth classes. Moreover, we provide
numerical results to verify all our theoretical results.


http://arxiv.org/abs/1811.06418
Adversarial Examples from Cryptographic Pseudo-Random Generators.
Sébastien Bubeck; Yin Tat Lee; Eric Price; Ilya Razenshteyn
  In our recent work (Bubeck, Price, Razenshteyn, arXiv:1805.10204) we argued
that adversarial examples in machine learning might be due to an inherent
computational hardness of the problem. More precisely, we constructed a binary
classification task for which (i) a robust classifier exists; yet no
non-trivial accuracy can be obtained with an efficient algorithm in (ii) the
statistical query model. In the present paper we significantly strengthen both
(i) and (ii): we now construct a task which admits (i') a maximally robust
classifier (that is it can tolerate perturbations of size comparable to the
size of the examples themselves); and moreover we prove computational hardness
of learning this task under (ii') a standard cryptographic assumption.


http://arxiv.org/abs/1811.06609
A Spectral View of Adversarially Robust Features.
Shivam Garg; Vatsal Sharan; Brian Hu Zhang; Gregory Valiant
  Given the apparent difficulty of learning models that are robust to
adversarial perturbations, we propose tackling the simpler problem of
developing adversarially robust features. Specifically, given a dataset and
metric of interest, the goal is to return a function (or multiple functions)
that 1) is robust to adversarial perturbations, and 2) has significant
variation across the datapoints. We establish strong connections between
adversarially robust features and a natural spectral property of the geometry
of the dataset and metric of interest. This connection can be leveraged to
provide both robust features, and a lower bound on the robustness of any
function that has significant variance across the dataset. Finally, we provide
empirical evidence that the adversarially robust features given by this
spectral approach can be fruitfully leveraged to learn a robust (and accurate)
model.


http://arxiv.org/abs/1811.06029
Verification of Recurrent Neural Networks Through Rule Extraction.
Qinglong Wang; Kaixuan Zhang; Xue Liu; C. Lee Giles
  The verification problem for neural networks is verifying whether a neural
network will suffer from adversarial samples, or approximating the maximal
allowed scale of adversarial perturbation that can be endured. While most prior
work contributes to verifying feed-forward networks, little has been explored
for verifying recurrent networks. This is due to the existence of a more
rigorous constraint on the perturbation space for sequential data, and the lack
of a proper metric for measuring the perturbation. In this work, we address
these challenges by proposing a metric which measures the distance between
strings, and use deterministic finite automata (DFA) to represent a rigorous
oracle which examines if the generated adversarial samples violate certain
constraints on a perturbation. More specifically, we empirically show that
certain recurrent networks allow relatively stable DFA extraction. As such,
DFAs extracted from these recurrent networks can serve as a surrogate oracle
for when the ground truth DFA is unknown. We apply our verification mechanism
to several widely used recurrent networks on a set of the Tomita grammars. The
results demonstrate that only a few models remain robust against adversarial
samples. In addition, we show that for grammars with different levels of
complexity, there is also a difference in the difficulty of robust learning of
these grammars.


http://arxiv.org/abs/1811.05808
Robustness of spectral methods for community detection.
Ludovic Stephan; Laurent Massoulié
  The present work is concerned with community detection. Specifically, we
consider a random graph drawn according to the stochastic block model~: its
vertex set is partitioned into blocks, or communities, and edges are placed
randomly and independently of each other with probability depending only on the
communities of their two endpoints. In this context, our aim is to recover the
community labels better than by random guess, based only on the observation of
the graph.
  In the sparse case, where edge probabilities are in $O(1/n)$, we introduce a
new spectral method based on the distance matrix $D^{(l)}$, where $D^{(l)}_{ij}
= 1$ iff the graph distance between $i$ and $j$, noted $d(i, j)$ is equal to
$\ell$. We show that when $\ell \sim c\log(n)$ for carefully chosen $c$, the
eigenvectors associated to the largest eigenvalues of $D^{(l)}$ provide enough
information to perform non-trivial community recovery with high probability,
provided we are above the so-called Kesten-Stigum threshold. This yields an
efficient algorithm for community detection, since computation of the matrix
$D^{(l)}$ can be done in $O(n^{1+\kappa})$ operations for a small constant
$\kappa$.
  We then study the sensitivity of the eigendecomposition of $D^{(l)}$ when we
allow an adversarial perturbation of the edges of $G$. We show that when the
considered perturbation does not affect more than $O(n^\varepsilon)$ vertices
for some small $\varepsilon > 0$, the highest eigenvalues and their
corresponding eigenvectors incur negligible perturbations, which allows us to
still perform efficient recovery.


http://arxiv.org/abs/1811.05521
Deep Q learning for fooling neural networks.
Mandar Kulkarni
  Deep learning models are vulnerable to external attacks. In this paper, we
propose a Reinforcement Learning (RL) based approach to generate adversarial
examples for the pre-trained (target) models. We assume a semi black-box
setting where the only access an adversary has to the target model is the class
probabilities obtained for the input queries. We train a Deep Q Network (DQN)
agent which, with experience, learns to attack only a small portion of image
pixels to generate non-targeted adversarial images. Initially, an agent
explores an environment by sequentially modifying random sets of image pixels
and observes its effect on the class probabilities. At the end of an episode,
it receives a positive (negative) reward if it succeeds (fails) to alter the
label of the image. Experimental results with MNIST, CIFAR-10 and Imagenet
datasets demonstrate that our RL framework is able to learn an effective attack
policy.


http://arxiv.org/abs/1811.03733
Universal Decision-Based Black-Box Perturbations: Breaking Security-Through-Obscurity Defenses.
Thomas A. Hogan; Bhavya Kailkhura
  We study the problem of finding a universal (image-agnostic) perturbation to
fool machine learning (ML) classifiers (e.g., neural nets, decision tress) in
the hard-label black-box setting. Recent work in adversarial ML in the
white-box setting (model parameters are known) has shown that many
state-of-the-art image classifiers are vulnerable to universal adversarial
perturbations: a fixed human-imperceptible perturbation that, when added to any
image, causes it to be misclassified with high probability Kurakin et al.
[2016], Szegedy et al. [2013], Chen et al. [2017a], Carlini and Wagner [2017].
This paper considers a more practical and challenging problem of finding such
universal perturbations in an obscure (or black-box) setting. More
specifically, we use zeroth order optimization algorithms to find such a
universal adversarial perturbation when no model information is revealed-except
that the attacker can make queries to probe the classifier. We further relax
the assumption that the output of a query is continuous valued confidence
scores for all the classes and consider the case where the output is a
hard-label decision. Surprisingly, we found that even in these extremely
obscure regimes, state-of-the-art ML classifiers can be fooled with a very high
probability just by adding a single human-imperceptible image perturbation to
any natural image. The surprising existence of universal perturbations in a
hard-label black-box setting raises serious security concerns with the
existence of a universal noise vector that adversaries can possibly exploit to
break a classifier on most natural images.


http://arxiv.org/abs/1811.03685
New CleverHans Feature: Better Adversarial Robustness Evaluations with Attack Bundling.
Ian Goodfellow
  This technical report describes a new feature of the CleverHans library
called "attack bundling". Many papers about adversarial examples present lists
of error rates corresponding to different attack algorithms. A common approach
is to take the maximum across this list and compare defenses against that error
rate. We argue that a better approach is to use attack bundling: the max should
be taken across many examples at the level of individual examples, then the
error rate should be calculated by averaging after this maximization operation.
Reporting the bundled attacker error rate provides a lower bound on the true
worst-case error rate. The traditional approach of reporting the maximum error
rate across attacks can underestimate the true worst-case error rate by an
amount approaching 100\% as the number of attacks approaches infinity. Attack
bundling can be used with different prioritization schemes to optimize
quantities such as error rate on adversarial examples, perturbation size needed
to cause misclassification, or failure rate when using a specific confidence
threshold.


http://arxiv.org/abs/1811.03531
A Geometric Perspective on the Transferability of Adversarial Directions.
Zachary Charles; Harrison Rosenberg; Dimitris Papailiopoulos
  State-of-the-art machine learning models frequently misclassify inputs that
have been perturbed in an adversarial manner. Adversarial perturbations
generated for a given input and a specific classifier often seem to be
effective on other inputs and even different classifiers. In other words,
adversarial perturbations seem to transfer between different inputs, models,
and even different neural network architectures. In this work, we show that in
the context of linear classifiers and two-layer ReLU networks, there provably
exist directions that give rise to adversarial perturbations for many
classifiers and data points simultaneously. We show that these "transferable
adversarial directions" are guaranteed to exist for linear separators of a
given set, and will exist with high probability for linear classifiers trained
on independent sets drawn from the same distribution. We extend our results to
large classes of two-layer ReLU networks. We further show that adversarial
directions for ReLU networks transfer to linear classifiers while the reverse
need not hold, suggesting that adversarial perturbations for more complex
models are more likely to transfer to other classifiers. We validate our
findings empirically, even for deeper ReLU networks.


http://arxiv.org/abs/1811.03456
CAAD 2018: Iterative Ensemble Adversarial Attack.
Jiayang Liu; Weiming Zhang; Nenghai Yu
  Deep Neural Networks (DNNs) have recently led to significant improvements in
many fields. However, DNNs are vulnerable to adversarial examples which are
samples with imperceptible perturbations while dramatically misleading the
DNNs. Adversarial attacks can be used to evaluate the robustness of deep
learning models before they are deployed. Unfortunately, most of existing
adversarial attacks can only fool a black-box model with a low success rate. To
improve the success rates for black-box adversarial attacks, we proposed an
iterated adversarial attack against an ensemble of image classifiers. With this
method, we won the 5th place in CAAD 2018 Targeted Adversarial Attack
competition.


http://arxiv.org/abs/1811.03194
AdVersarial: Perceptual Ad Blocking meets Adversarial Machine Learning.
Florian Tramèr; Pascal Dupré; Gili Rusak; Giancarlo Pellegrino; Dan Boneh
  Perceptual ad-blocking is a novel approach that detects online advertisements
based on their visual content. Compared to traditional filter lists, the use of
perceptual signals is believed to be less prone to an arms race with web
publishers and ad networks. We demonstrate that this may not be the case. We
describe attacks on multiple perceptual ad-blocking techniques, and unveil a
new arms race that likely disfavors ad-blockers. Unexpectedly, perceptual
ad-blocking can also introduce new vulnerabilities that let an attacker bypass
web security boundaries and mount DDoS attacks.
  We first analyze the design space of perceptual ad-blockers and present a
unified architecture that incorporates prior academic and commercial work. We
then explore a variety of attacks on the ad-blocker's detection pipeline, that
enable publishers or ad networks to evade or detect ad-blocking, and at times
even abuse its high privilege level to bypass web security boundaries.
  On one hand, we show that perceptual ad-blocking must visually classify
rendered web content to escape an arms race centered on obfuscation of page
markup. On the other, we present a concrete set of attacks on visual
ad-blockers by constructing adversarial examples in a real web page context.
For seven ad-detectors, we create perturbed ads, ad-disclosure logos, and
native web content that misleads perceptual ad-blocking with 100% success
rates. In one of our attacks, we demonstrate how a malicious user can upload
adversarial content, such as a perturbed image in a Facebook post, that fools
the ad-blocker into removing another users' non-ad content.
  Moving beyond the Web and visual domain, we also build adversarial examples
for AdblockRadio, an open source radio client that uses machine learning to
detects ads in raw audio streams.


http://arxiv.org/abs/1811.02625
MixTrain: Scalable Training of Verifiably Robust Neural Networks.
Shiqi Wang; Yizheng Chen; Ahmed Abdou; Suman Jana
  Making neural networks robust against adversarial inputs has resulted in an
arms race between new defenses and attacks. The most promising defenses,
adversarially robust training and verifiably robust training, have limitations
that restrict their practical applications. The adversarially robust training
only makes the networks robust against a subclass of attackers and we reveal
such weaknesses by developing a new attack based on interval gradients. By
contrast, verifiably robust training provides protection against any L-p
norm-bounded attacker but incurs orders of magnitude more computational and
memory overhead than adversarially robust training.
  We propose two novel techniques, stochastic robust approximation and dynamic
mixed training, to drastically improve the efficiency of verifiably robust
training without sacrificing verified robustness. We leverage two critical
insights: (1) instead of over the entire training set, sound
over-approximations over randomly subsampled training data points are
sufficient for efficiently guiding the robust training process; and (2) We
observe that the test accuracy and verifiable robustness often conflict after
certain training epochs. Therefore, we use a dynamic loss function to
adaptively balance them for each epoch.
  We designed and implemented our techniques as part of MixTrain and evaluated
it on six networks trained on three popular datasets including MNIST, CIFAR,
and ImageNet-200. Our evaluations show that MixTrain can achieve up to $95.2\%$
verified robust accuracy against $L_\infty$ norm-bounded attackers while taking
$15$ and $3$ times less training time than state-of-the-art verifiably robust
training and adversarially robust training schemes, respectively. Furthermore,
MixTrain easily scales to larger networks like the one trained on ImageNet-200,
significantly outperforming the existing verifiably robust training methods.


http://arxiv.org/abs/1811.02248
SparseFool: a few pixels make a big difference.
Apostolos Modas; Seyed-Mohsen Moosavi-Dezfooli; Pascal Frossard
  Deep Neural Networks have achieved extraordinary results on image
classification tasks, but have been shown to be vulnerable to attacks with
carefully crafted perturbations of the input data. Although most attacks
usually change values of many image's pixels, it has been shown that deep
networks are also vulnerable to sparse alterations of the input. However, no
computationally efficient method has been proposed to compute sparse
perturbations. In this paper, we exploit the low mean curvature of the decision
boundary, and propose SparseFool, a geometry inspired sparse attack that
controls the sparsity of the perturbations. Extensive evaluations show that our
approach computes sparse perturbations very fast, and scales efficiently to
high dimensional data. We further analyze the transferability and the visual
effects of the perturbations, and show the existence of shared semantic
information across the images and the networks. Finally, we show that
adversarial training can only slightly improve the robustness against sparse
additive perturbations computed with SparseFool.


http://arxiv.org/abs/1811.01811
Active Deep Learning Attacks under Strict Rate Limitations for Online API Calls.
Yi Shi; Yalin E. Sagduyu; Kemal Davaslioglu; Jason H. Li
  Machine learning has been applied to a broad range of applications and some
of them are available online as application programming interfaces (APIs) with
either free (trial) or paid subscriptions. In this paper, we study adversarial
machine learning in the form of back-box attacks on online classifier APIs. We
start with a deep learning based exploratory (inference) attack, which aims to
build a classifier that can provide similar classification results (labels) as
the target classifier. To minimize the difference between the labels returned
by the inferred classifier and the target classifier, we show that the deep
learning based exploratory attack requires a large number of labeled training
data samples. These labels can be collected by calling the online API, but
usually there is some strict rate limitation on the number of allowed API
calls. To mitigate the impact of limited training data, we develop an active
learning approach that first builds a classifier based on a small number of API
calls and uses this classifier to select samples to further collect their
labels. Then, a new classifier is built using more training data samples. This
updating process can be repeated multiple times. We show that this active
learning approach can build an adversarial classifier with a small statistical
difference from the target classifier using only a limited number of training
data samples. We further consider evasion and causative (poisoning) attacks
based on the inferred classifier that is built by the exploratory attack.
Evasion attack determines samples that the target classifier is likely to
misclassify, whereas causative attack provides erroneous training data samples
to reduce the reliability of the re-trained classifier. The success of these
attacks show that adversarial machine learning emerges as a feasible threat in
the realistic case with limited training data.


http://arxiv.org/abs/1811.01749
FUNN: Flexible Unsupervised Neural Network.
David Vigouroux; Sylvain Picard
  Deep neural networks have demonstrated high accuracy in image classification
tasks. However, they were shown to be weak against adversarial examples: a
small perturbation in the image which changes the classification output
dramatically. In recent years, several defenses have been proposed to solve
this issue in supervised classification tasks. We propose a method to obtain
robust features in unsupervised learning tasks against adversarial attacks. Our
method differs from existing solutions by directly learning the robust features
without the need to project the adversarial examples in the original examples
distribution space. A first auto-encoder A1 is in charge of perturbing the
input image to fool another auto-encoder A2 which is in charge of regenerating
the original image. A1 tries to find the less perturbed image under the
constraint that the error in the output of A2 should be at least equal to a
threshold. Thanks to this training, the encoder of A2 will be robust against
adversarial attacks and could be used in different tasks like classification.
Using state-of-art network architectures, we demonstrate the robustness of the
features obtained thanks to this method in classification tasks.


http://arxiv.org/abs/1811.01629
On the Transferability of Adversarial Examples Against CNN-Based Image Forensics.
Mauro Barni; Kassem Kallas; Ehsan Nowroozi; Benedetta Tondi
  Recent studies have shown that Convolutional Neural Networks (CNN) are
relatively easy to attack through the generation of so-called adversarial
examples. Such vulnerability also affects CNN-based image forensic tools.
Research in deep learning has shown that adversarial examples exhibit a certain
degree of transferability, i.e., they maintain part of their effectiveness even
against CNN models other than the one targeted by the attack. This is a very
strong property undermining the usability of CNN's in security-oriented
applications. In this paper, we investigate if attack transferability also
holds in image forensics applications. With specific reference to the case of
manipulation detection, we analyse the results of several experiments
considering different sources of mismatch between the CNN used to build the
adversarial examples and the one adopted by the forensic analyst. The analysis
ranges from cases in which the mismatch involves only the training dataset, to
cases in which the attacker and the forensic analyst adopt different
architectures. The results of our experiments show that, in the majority of the
cases, the attacks are not transferable, thus easing the design of proper
countermeasures at least when the attacker does not have a perfect knowledge of
the target detector.


http://arxiv.org/abs/1811.01444
FAdeML: Understanding the Impact of Pre-Processing Noise Filtering on Adversarial Machine Learning.
Faiq Khalid; Muhammmad Abdullah Hanif; Semeen Rehman; Junaid Qadir; Muhammad Shafique
  Deep neural networks (DNN)-based machine learning (ML) algorithms have
recently emerged as the leading ML paradigm particularly for the task of
classification due to their superior capability of learning efficiently from
large datasets. The discovery of a number of well-known attacks such as dataset
poisoning, adversarial examples, and network manipulation (through the addition
of malicious nodes) has, however, put the spotlight squarely on the lack of
security in DNN-based ML systems. In particular, malicious actors can use these
well-known attacks to cause random/targeted misclassification, or cause a
change in the prediction confidence, by only slightly but systematically
manipulating the environmental parameters, inference data, or the data
acquisition block. Most of the prior adversarial attacks have, however, not
accounted for the pre-processing noise filters commonly integrated with the
ML-inference module. Our contribution in this work is to show that this is a
major omission since these noise filters can render ineffective the majority of
the existing attacks, which rely essentially on introducing adversarial noise.
Apart from this, we also extend the state of the art by proposing a novel
pre-processing noise Filter-aware Adversarial ML attack called FAdeML. To
demonstrate the effectiveness of the proposed methodology, we generate an
adversarial attack image by exploiting the "VGGNet" DNN trained for the "German
Traffic Sign Recognition Benchmarks (GTSRB" dataset, which despite having no
visual noise, can cause a classifier to misclassify even in the presence of
pre-processing noise filters.


http://arxiv.org/abs/1811.01437
QuSecNets: Quantization-based Defense Mechanism for Securing Deep Neural Network against Adversarial Attacks.
Faiq Khalid; Hassan Ali; Hammad Tariq; Muhammad Abdullah Hanif; Semeen Rehman; Rehan Ahmed; Muhammad Shafique
  Adversarial examples have emerged as a significant threat to machine learning
algorithms, especially to the convolutional neural networks (CNNs). In this
paper, we propose two quantization-based defense mechanisms, Constant
Quantization (CQ) and Trainable Quantization (TQ), to increase the robustness
of CNNs against adversarial examples. CQ quantizes input pixel intensities
based on a "fixed" number of quantization levels, while in TQ, the quantization
levels are "iteratively learned during the training phase", thereby providing a
stronger defense mechanism. We apply the proposed techniques on undefended CNNs
against different state-of-the-art adversarial attacks from the open-source
\textit{Cleverhans} library. The experimental results demonstrate 50%-96% and
10%-50% increase in the classification accuracy of the perturbed images
generated from the MNIST and the CIFAR-10 datasets, respectively, on commonly
used CNN (Conv2D(64, 8x8) - Conv2D(128, 6x6) - Conv2D(128, 5x5) - Dense(10) -
Softmax()) available in \textit{Cleverhans} library.


http://arxiv.org/abs/1811.01443
SSCNets: Robustifying DNNs using Secure Selective Convolutional Filters.
Hassan Ali; Faiq Khalid; Hammad Tariq; Muhammad Abdullah Hanif; Semeen Rehman; Rehan Ahmed; Muhammad Shafique
  In this paper, we introduce a novel technique based on the Secure Selective
Convolutional (SSC) techniques in the training loop that increases the
robustness of a given DNN by allowing it to learn the data distribution based
on the important edges in the input image. We validate our technique on
Convolutional DNNs against the state-of-the-art attacks from the open-source
Cleverhans library using the MNIST, the CIFAR-10, and the CIFAR-100 datasets.
Our experimental results show that the attack success rate, as well as the
imperceptibility of the adversarial images, can be significantly reduced by
adding effective pre-processing functions, i.e., Sobel filtering.


http://arxiv.org/abs/1811.01302
Adversarial Gain.
Peter Henderson; Koustuv Sinha; Rosemary Nan Ke; Joelle Pineau
  Adversarial examples can be defined as inputs to a model which induce a
mistake - where the model output is different than that of an oracle, perhaps
in surprising or malicious ways. Original models of adversarial attacks are
primarily studied in the context of classification and computer vision tasks.
While several attacks have been proposed in natural language processing (NLP)
settings, they often vary in defining the parameters of an attack and what a
successful attack would look like. The goal of this work is to propose a
unifying model of adversarial examples suitable for NLP tasks in both
generative and classification settings. We define the notion of adversarial
gain: based in control theory, it is a measure of the change in the output of a
system relative to the perturbation of the input (caused by the so-called
adversary) presented to the learner. This definition, as we show, can be used
under different feature spaces and distance conditions to determine attack or
defense effectiveness across different intuitive manifolds. This notion of
adversarial gain not only provides a useful way for evaluating adversaries and
defenses, but can act as a building block for future work in robustness under
adversaries due to its rooted nature in stability and manifold theory.


http://arxiv.org/abs/1811.01225
CAAD 2018: Powerful None-Access Black-Box Attack Based on Adversarial Transformation Network.
Xiaoyi Dong; Weiming Zhang; Nenghai Yu
  In this paper, we propose an improvement of Adversarial Transformation
Networks(ATN) to generate adversarial examples, which can fool white-box models
and black-box models with a state of the art performance and won the 2rd place
in the non-target task in CAAD 2018.


http://arxiv.org/abs/1811.01312
Adversarial Black-Box Attacks on Automatic Speech Recognition Systems using Multi-Objective Evolutionary Optimization.
Shreya Khare; Rahul Aralikatte; Senthil Mani
  Fooling deep neural networks with adversarial input have exposed a
significant vulnerability in the current state-of-the-art systems in multiple
domains. Both black-box and white-box approaches have been used to either
replicate the model itself or to craft examples which cause the model to fail.
In this work, we propose a framework which uses multi-objective evolutionary
optimization to perform both targeted and un-targeted black-box attacks on
Automatic Speech Recognition (ASR) systems. We apply this framework on two ASR
systems: Deepspeech and Kaldi-ASR, which increases the Word Error Rates (WER)
of these systems by upto 980%, indicating the potency of our approach. During
both un-targeted and targeted attacks, the adversarial samples maintain a high
acoustic similarity of 0.98 and 0.97 with the original audio.


http://arxiv.org/abs/1811.01213
Learning to Defense by Learning to Attack.
Haoming Jiang; Zhehui Chen; Yuyang Shi; Bo Dai; Tuo Zhao
  Adversarial training provides a principled approach for training robust
neural networks. From an optimization perspective, adversarial training is
essentially solving a minimax robust optimization problem. The outer
minimization is trying to learn a robust classifier, while the inner
maximization is trying to generate adversarial samples. Unfortunately, such a
minimax problem is challenging to solve due to the lack of convex-concave
structure. This work proposes a new adversarial training method based on a
general learning-to-learn framework. Specifically, instead of applying the
existing hand-design algorithms for the inner problem, we learn an optimizer,
which is parametrized as a convolutional neural network. At the same time, a
robust classifier is learned to defend the adversarial attack generated by the
learned optimizer. Our experiments demonstrate that our proposed method
significantly outperforms existing adversarial training methods on CIFAR-10 and
CIFAR-100 datasets.


http://arxiv.org/abs/1811.01134
A Marauder's Map of Security and Privacy in Machine Learning.
Nicolas Papernot
  There is growing recognition that machine learning (ML) exposes new security
and privacy vulnerabilities in software systems, yet the technical community's
understanding of the nature and extent of these vulnerabilities remains limited
but expanding. In this talk, we explore the threat model space of ML algorithms
through the lens of Saltzer and Schroeder's principles for the design of secure
computer systems. This characterization of the threat space prompts an
investigation of current and future research directions. We structure our
discussion around three of these directions, which we believe are likely to
lead to significant progress. The first encompasses a spectrum of approaches to
verification and admission control, which is a prerequisite to enable fail-safe
defaults in machine learning systems. The second seeks to design mechanisms for
assembling reliable records of compromise that would help understand the degree
to which vulnerabilities are exploited by adversaries, as well as favor
psychological acceptability of machine learning applications. The third pursues
formal frameworks for security and privacy in machine learning, which we argue
should strive to align machine learning goals such as generalization with
security and privacy desiderata like robustness or privacy. Key insights
resulting from these three directions pursued both in the ML and security
communities are identified and the effectiveness of approaches are related to
structural elements of ML algorithms and the data used to train them. We
conclude by systematizing best practices in our community.


http://arxiv.org/abs/1811.01057
Semidefinite relaxations for certifying robustness to adversarial examples.
Aditi Raghunathan; Jacob Steinhardt; Percy Liang
  Despite their impressive performance on diverse tasks, neural networks fail
catastrophically in the presence of adversarial inputs---imperceptibly but
adversarially perturbed versions of natural inputs. We have witnessed an arms
race between defenders who attempt to train robust networks and attackers who
try to construct adversarial examples. One promise of ending the arms race is
developing certified defenses, ones which are provably robust against all
attackers in some family. These certified defenses are based on convex
relaxations which construct an upper bound on the worst case loss over all
attackers in the family. Previous relaxations are loose on networks that are
not trained against the respective relaxation. In this paper, we propose a new
semidefinite relaxation for certifying robustness that applies to arbitrary
ReLU networks. We show that our proposed relaxation is tighter than previous
relaxations and produces meaningful robustness guarantees on three different
"foreign networks" whose training objectives are agnostic to our proposed
relaxation.


http://arxiv.org/abs/1811.00866
Efficient Neural Network Robustness Certification with General Activation Functions.
Huan Zhang; Tsui-Wei Weng; Pin-Yu Chen; Cho-Jui Hsieh; Luca Daniel
  Finding minimum distortion of adversarial examples and thus certifying
robustness in neural network classifiers for given data points is known to be a
challenging problem. Nevertheless, recently it has been shown to be possible to
give a non-trivial certified lower bound of minimum adversarial distortion, and
some recent progress has been made towards this direction by exploiting the
piece-wise linear nature of ReLU activations. However, a generic robustness
certification for general activation functions still remains largely
unexplored. To address this issue, in this paper we introduce CROWN, a general
framework to certify robustness of neural networks with general activation
functions for given input data points. The novelty in our algorithm consists of
bounding a given activation function with linear and quadratic functions, hence
allowing it to tackle general activation functions including but not limited to
four popular choices: ReLU, tanh, sigmoid and arctan. In addition, we
facilitate the search for a tighter certified lower bound by adaptively
selecting appropriate surrogates for each neuron activation. Experimental
results show that CROWN on ReLU networks can notably improve the certified
lower bounds compared to the current state-of-the-art algorithm Fast-Lin, while
having comparable computational efficiency. Furthermore, CROWN also
demonstrates its effectiveness and flexibility on networks with general
activation functions, including tanh, sigmoid and arctan.


http://arxiv.org/abs/1811.00830
Towards Adversarial Malware Detection: Lessons Learned from PDF-based Attacks.
Davide Maiorca; Battista Biggio; Giorgio Giacinto
  Malware still constitutes a major threat in the cybersecurity landscape, also
due to the widespread use of infection vectors such as documents. These
infection vectors hide embedded malicious code to the victim users,
facilitating the use of social engineering techniques to infect their machines.
Research showed that machine-learning algorithms provide effective detection
mechanisms against such threats, but the existence of an arms race in
adversarial settings has recently challenged such systems. In this work, we
focus on malware embedded in PDF files as a representative case of such an arms
race. We start by providing a comprehensive taxonomy of the different
approaches used to generate PDF malware, and of the corresponding
learning-based detection systems. We then categorize threats specifically
targeted against learning-based PDF malware detectors, using a well-established
framework in the field of adversarial machine learning. This framework allows
us to categorize known vulnerabilities of learning-based PDF malware detectors
and to identify novel attacks that may threaten such systems, along with the
potential defense mechanisms that can mitigate the impact of such threats. We
conclude the paper by discussing how such findings highlight promising research
directions towards tackling the more general challenge of designing robust
malware detectors in adversarial settings.


http://arxiv.org/abs/1811.01031
TrISec: Training Data-Unaware Imperceptible Security Attacks on Deep Neural Networks.
Faiq Khalid; Muhammad Abdullah Hanif; Semeen Rehman; Rehan Ahmed; Muhammad Shafique
  Most of the data manipulation attacks on deep neural networks (DNNs) during
the training stage introduce a perceptible noise that can be catered by
preprocessing during inference or can be identified during the validation
phase. Therefore, data poisoning attacks during inference (e.g., adversarial
attacks) are becoming more popular. However, many of them do not consider the
imperceptibility factor in their optimization algorithms, and can be detected
by correlation and structural similarity analysis, or noticeable (e.g., by
humans) in a multi-level security system. Moreover, the majority of the
inference attack relies on some knowledge about the training dataset. In this
paper, we propose a novel methodology which automatically generates
imperceptible attack images by using the back-propagation algorithm on
pre-trained DNNs, without requiring any information about the training dataset
(i.e., completely training data-unaware). We present a case study on traffic
sign detection using the VGGNet trained on the German Traffic Sign Recognition
Benchmarks dataset in an autonomous driving use case. Our results demonstrate
that the generated attack images successfully perform misclassification while
remaining imperceptible in both "subjective" and "objective" quality tests.


http://arxiv.org/abs/1811.00621
Improving Adversarial Robustness by Encouraging Discriminative Features.
Chirag Agarwal; Anh Nguyen; Dan Schonfeld
  Deep neural networks (DNNs) have achieved state-of-the-art results in various
pattern recognition tasks. However, they perform poorly on out-of-distribution
adversarial examples i.e. inputs that are specifically crafted by an adversary
to cause DNNs to misbehave, questioning the security and reliability of
applications. In this paper, we encourage DNN classifiers to learn more
discriminative features by imposing a center loss in addition to the regular
softmax cross-entropy loss. Intuitively, the center loss encourages DNNs to
simultaneously learns a center for the deep features of each class, and
minimize the distances between the intra-class deep features and their
corresponding class centers. We hypothesize that minimizing distances between
intra-class features and maximizing the distances between inter-class features
at the same time would improve a classifier's robustness to adversarial
examples. Our results on state-of-the-art architectures on MNIST, CIFAR-10, and
CIFAR-100 confirmed that intuition and highlight the importance of
discriminative features.


http://arxiv.org/abs/1811.00525
On the Geometry of Adversarial Examples.
Marc Khoury; Dylan Hadfield-Menell
  Adversarial examples are a pervasive phenomenon of machine learning models
where seemingly imperceptible perturbations to the input lead to
misclassifications for otherwise statistically accurate models. We propose a
geometric framework, drawing on tools from the manifold reconstruction
literature, to analyze the high-dimensional geometry of adversarial examples.
In particular, we highlight the importance of codimension: for low-dimensional
data manifolds embedded in high-dimensional space there are many directions off
the manifold in which to construct adversarial examples. Adversarial examples
are a natural consequence of learning a decision boundary that classifies the
low-dimensional data manifold well, but classifies points near the manifold
incorrectly. Using our geometric framework we prove (1) a tradeoff between
robustness under different norms, (2) that adversarial training in balls around
the data is sample inefficient, and (3) sufficient sampling conditions under
which nearest neighbor classifiers and ball-based adversarial training are
robust.


http://arxiv.org/abs/1811.00401
Excessive Invariance Causes Adversarial Vulnerability.
Jörn-Henrik Jacobsen; Jens Behrmann; Richard Zemel; Matthias Bethge
  Despite their impressive performance, deep neural networks exhibit striking
failures on out-of-distribution inputs. One core idea of adversarial example
research is to reveal neural network errors under such distribution shifts. We
decompose these errors into two complementary sources: sensitivity and
invariance. We show deep networks are not only too sensitive to task-irrelevant
changes of their input, as is well-known from epsilon-adversarial examples, but
are also too invariant to a wide range of task-relevant changes, thus making
vast regions in input space vulnerable to adversarial attacks. We show such
excessive invariance occurs across various tasks and architecture types. On
MNIST and ImageNet one can manipulate the class-specific content of almost any
image without changing the hidden activations. We identify an insufficiency of
the standard cross-entropy loss as a reason for these failures. Further, we
extend this objective based on an information-theoretic analysis so it
encourages the model to consider all task-dependent features in its decision.
This provides the first approach tailored explicitly to overcome excessive
invariance and resulting vulnerabilities.


http://arxiv.org/abs/1811.02658
When Not to Classify: Detection of Reverse Engineering Attacks on DNN Image Classifiers.
Yujia Wang; David J. Miller; George Kesidis
  This paper addresses detection of a reverse engineering (RE) attack targeting
a deep neural network (DNN) image classifier; by querying, RE's aim is to
discover the classifier's decision rule. RE can enable test-time evasion
attacks, which require knowledge of the classifier. Recently, we proposed a
quite effective approach (ADA) to detect test-time evasion attacks. In this
paper, we extend ADA to detect RE attacks (ADA-RE). We demonstrate our method
is successful in detecting "stealthy" RE attacks before they learn enough to
launch effective test-time evasion attacks.


http://arxiv.org/abs/1811.00189
Unauthorized AI cannot Recognize Me: Reversible Adversarial Example.
Jiayang Liu; Weiming Zhang; Kazuto Fukuchi; Youhei Akimoto; Jun Sakuma
  In this study, we propose a new methodology to control how user's data is
recognized and used by AI via exploiting the properties of adversarial
examples. For this purpose, we propose reversible adversarial example (RAE), a
new type of adversarial example. A remarkable feature of RAE is that the image
can be correctly recognized and used by the AI model specified by the user
because the authorized AI can recover the original image from the RAE exactly
by eliminating adversarial perturbation. On the other hand, other unauthorized
AI models cannot recognize it correctly because it functions as an adversarial
example. Moreover, RAE can be considered as one type of encryption to computer
vision since reversibility guarantees the decryption. To realize RAE, we
combine three technologies, adversarial example, reversible data hiding for
exact recovery of adversarial perturbation, and encryption for selective
control of AIs who can remove adversarial perturbation. Experimental results
show that the proposed method can achieve comparable attack ability with the
corresponding adversarial attack method and similar visual quality with the
original image, including white-box attacks and black-box attacks.


http://arxiv.org/abs/1810.12576
Improved Network Robustness with Adversary Critic.
Alexander Matyasko; Lap-Pui Chau
  Ideally, what confuses neural network should be confusing to humans. However,
recent experiments have shown that small, imperceptible perturbations can
change the network prediction. To address this gap in perception, we propose a
novel approach for learning robust classifier. Our main idea is: adversarial
examples for the robust classifier should be indistinguishable from the regular
data of the adversarial target. We formulate a problem of learning robust
classifier in the framework of Generative Adversarial Networks (GAN), where the
adversarial attack on classifier acts as a generator, and the critic network
learns to distinguish between regular and adversarial images. The classifier
cost is augmented with the objective that its adversarial examples should
confuse the adversary critic. To improve the stability of the adversarial
mapping, we introduce adversarial cycle-consistency constraint which ensures
that the adversarial mapping of the adversarial examples is close to the
original. In the experiments, we show the effectiveness of our defense. Our
method surpasses in terms of robustness networks trained with adversarial
training. Additionally, we verify in the experiments with human annotators on
MTurk that adversarial examples are indeed visually confusing. Codes for the
project are available at https://github.com/aam-at/adversary_critic.


http://arxiv.org/abs/1810.12715
On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models.
Sven Gowal; Krishnamurthy Dvijotham; Robert Stanforth; Rudy Bunel; Chongli Qin; Jonathan Uesato; Relja Arandjelovic; Timothy Mann; Pushmeet Kohli
  Recent work has shown that it is possible to train deep neural networks that
are provably robust to norm-bounded adversarial perturbations. Most of these
methods are based on minimizing an upper bound on the worst-case loss over all
possible adversarial perturbations. While these techniques show promise, they
often result in difficult optimization procedures that remain hard to scale to
larger networks. Through a comprehensive analysis, we show how a simple
bounding technique, interval bound propagation (IBP), can be exploited to train
large provably robust neural networks that beat the state-of-the-art in
verified accuracy. While the upper bound computed by IBP can be quite weak for
general networks, we demonstrate that an appropriate loss and clever
hyper-parameter schedule allow the network to adapt such that the IBP bound is
tight. This results in a fast and stable learning algorithm that outperforms
more sophisticated methods and achieves state-of-the-art results on MNIST,
CIFAR-10 and SVHN. It also allows us to train the largest model to be verified
beyond vacuous bounds on a downscaled version of ImageNet.


http://arxiv.org/abs/1810.12272
Adversarial Risk and Robustness: General Definitions and Implications for the Uniform Distribution.
Dimitrios I. Diochnos; Saeed Mahloujifar; Mohammad Mahmoody
  We study adversarial perturbations when the instances are uniformly
distributed over $\{0,1\}^n$. We study both "inherent" bounds that apply to any
problem and any classifier for such a problem as well as bounds that apply to
specific problems and specific hypothesis classes.
  As the current literature contains multiple definitions of adversarial risk
and robustness, we start by giving a taxonomy for these definitions based on
their goals, we identify one of them as the one guaranteeing misclassification
by pushing the instances to the error region. We then study some classic
algorithms for learning monotone conjunctions and compare their adversarial
risk and robustness under different definitions by attacking the hypotheses
using instances drawn from the uniform distribution. We observe that sometimes
these definitions lead to significantly different bounds. Thus, this study
advocates for the use of the error-region definition, even though other
definitions, in other contexts, may coincide with the error-region definition.
  Using the error-region definition of adversarial perturbations, we then study
inherent bounds on risk and robustness of any classifier for any classification
problem whose instances are uniformly distributed over $\{0,1\}^n$. Using the
isoperimetric inequality for the Boolean hypercube, we show that for initial
error $0.01$, there always exists an adversarial perturbation that changes
$O(\sqrt{n})$ bits of the instances to increase the risk to $0.5$, making
classifier's decisions meaningless. Furthermore, by also using the central
limit theorem we show that when $n\to \infty$, at most $c \cdot \sqrt{n}$ bits
of perturbations, for a universal constant $c< 1.17$, suffice for increasing
the risk to $0.5$, and the same $c \cdot \sqrt{n} $ bits of perturbations on
average suffice to increase the risk to $1$, hence bounding the robustness by
$c \cdot \sqrt{n}$.


http://arxiv.org/abs/1810.12042
Logit Pairing Methods Can Fool Gradient-Based Attacks.
Marius Mosbach; Maksym Andriushchenko; Thomas Trost; Matthias Hein; Dietrich Klakow
  Recently, Kannan et al. [2018] proposed several logit regularization methods
to improve the adversarial robustness of classifiers. We show that the
computationally fast methods they propose - Clean Logit Pairing (CLP) and Logit
Squeezing (LSQ) - just make the gradient-based optimization problem of crafting
adversarial examples harder without providing actual robustness. We find that
Adversarial Logit Pairing (ALP) may indeed provide robustness against
adversarial examples, especially when combined with adversarial training, and
we examine it in a variety of settings. However, the increase in adversarial
accuracy is much smaller than previously claimed. Finally, our results suggest
that the evaluation against an iterative PGD attack relies heavily on the
parameters used and may result in false conclusions regarding robustness of a
model.


http://arxiv.org/abs/1810.11783
RecurJac: An Efficient Recursive Algorithm for Bounding Jacobian Matrix of Neural Networks and Its Applications.
Huan Zhang; Pengchuan Zhang; Cho-Jui Hsieh
  The Jacobian matrix (or the gradient for single-output networks) is directly
related to many important properties of neural networks, such as the function
landscape, stationary points, (local) Lipschitz constants and robustness to
adversarial attacks. In this paper, we propose a recursive algorithm, RecurJac,
to compute both upper and lower bounds for each element in the Jacobian matrix
of a neural network with respect to network's input, and the network can
contain a wide range of activation functions. As a byproduct, we can
efficiently obtain a (local) Lipschitz constant, which plays a crucial role in
neural network robustness verification, as well as the training stability of
GANs. Experiments show that (local) Lipschitz constants produced by our method
is of better quality than previous approaches, thus providing better robustness
verification results. Our algorithm has polynomial time complexity, and its
computation time is reasonable even for relatively large networks.
Additionally, we use our bounds of Jacobian matrix to characterize the
landscape of the neural network, for example, to determine whether there exist
stationary points in a local neighborhood. Source code available at
\url{http://github.com/huanzhang12/RecurJac-Jacobian-bounds}.


http://arxiv.org/abs/1810.11914
Rademacher Complexity for Adversarially Robust Generalization.
Dong Yin; Kannan Ramchandran; Peter Bartlett
  Many machine learning models are vulnerable to adversarial attacks; for
example, adding adversarial perturbations that are imperceptible to humans can
often make machine learning models produce wrong predictions with high
confidence. Moreover, although we may obtain robust models on the training
dataset via adversarial training, in some problems the learned models cannot
generalize well to the test data. In this paper, we focus on $\ell_\infty$
attacks, and study the adversarially robust generalization problem through the
lens of Rademacher complexity. For binary linear classifiers, we prove tight
bounds for the adversarial Rademacher complexity, and show that the adversarial
Rademacher complexity is never smaller than its natural counterpart, and it has
an unavoidable dimension dependence, unless the weight vector has bounded
$\ell_1$ norm. The results also extend to multi-class linear classifiers. For
(nonlinear) neural networks, we show that the dimension dependence in the
adversarial Rademacher complexity also exists. We further consider a surrogate
adversarial loss for one-hidden layer ReLU network and prove margin bounds for
this setting. Our results indicate that having $\ell_1$ norm constraints on the
weight matrices might be a potential way to improve generalization in the
adversarial setting. We demonstrate experimental results that validate our
theoretical findings.


http://arxiv.org/abs/1810.11793
Robust Audio Adversarial Example for a Physical Attack.
Hiromu Yakura; Jun Sakuma
  We propose a method to generate audio adversarial examples that can attack a
state-of-the-art speech recognition model in the physical world. Previous work
assumes that generated adversarial examples are directly fed to the recognition
model, and is not able to perform such a physical attack because of
reverberation and noise from playback environments. In contrast, our method
obtains robust adversarial examples by simulating transformations caused by
playback or recording in the physical world and incorporating the
transformations into the generation process. Evaluation and a listening
experiment demonstrated that our adversarial examples are able to attack
without being noticed by humans. This result suggests that audio adversarial
examples generated by the proposed method may become a real threat.


http://arxiv.org/abs/1810.11726
Towards Robust Deep Neural Networks.
Timothy E. Wang; Yiming Gu; Dhagash Mehta; Xiaojun Zhao; Edgar A. Bernal
  We investigate the topics of sensitivity and robustness in feedforward and
convolutional neural networks. Combining energy landscape techniques developed
in computational chemistry with tools drawn from formal methods, we produce
empirical evidence indicating that networks corresponding to lower-lying minima
in the optimization landscape of the learning objective tend to be more robust.
The robustness estimate used is the inverse of a proposed sensitivity measure,
which we define as the volume of an over-approximation of the reachable set of
network outputs under all additive $l_{\infty}$-bounded perturbations on the
input data. We present a novel loss function which includes a sensitivity term
in addition to the traditional task-oriented and regularization terms. In our
experiments on standard machine learning and computer vision datasets, we show
that the proposed loss function leads to networks which reliably optimize the
robustness measure as well as other related metrics of adversarial robustness
without significant degradation in the classification error. Experimental
results indicate that the proposed method outperforms state-of-the-art
sensitivity-based learning approaches with regards to robustness to adversarial
attacks. We also show that although the introduced framework does not
explicitly enforce an adversarial loss, it achieves competitive overall
performance relative to methods that do.


http://arxiv.org/abs/1810.11711
Regularization Effect of Fast Gradient Sign Method and its Generalization.
Chandler Zuo
  Fast Gradient Sign Method (FGSM) is a popular method to generate adversarial
examples that make neural network models robust against perturbations. Despite
its empirical success, its theoretical property is not well understood. This
paper develops theory to explain the regularization effect of Generalized FGSM,
a class of methods to generate adversarial examples. Motivated from the
relationship between FGSM and LASSO penalty, the asymptotic properties of
Generalized FGSM are derived in the Generalized Linear Model setting, which is
essentially the 1-layer neural network setting with certain activation
functions. In such simple neural network models, I prove that Generalized FGSM
estimation is root n-consistent and weakly oracle under proper conditions. The
asymptotic results are also highly similar to penalized likelihood estimation.
Nevertheless, Generalized FGSM introduces additional bias when data sampling is
not sign neutral, a concept I introduce to describe the balance-ness of the
noise signs. Although the theory in this paper is developed under simple neural
network settings, I argue that it may give insights and justification for FGSM
in deep neural network settings as well.


http://arxiv.org/abs/1810.11580
Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples.
Guanhong Tao; Shiqing Ma; Yingqi Liu; Xiangyu Zhang
  Adversarial sample attacks perturb benign inputs to induce DNN misbehaviors.
Recent research has demonstrated the widespread presence and the devastating
consequences of such attacks. Existing defense techniques either assume prior
knowledge of specific attacks or may not work well on complex models due to
their underlying assumptions. We argue that adversarial sample attacks are
deeply entangled with interpretability of DNN models: while classification
results on benign inputs can be reasoned based on the human perceptible
features/attributes, results on adversarial samples can hardly be explained.
Therefore, we propose a novel adversarial sample detection technique for face
recognition models, based on interpretability. It features a novel
bi-directional correspondence inference between attributes and internal neurons
to identify neurons critical for individual attributes. The activation values
of critical neurons are enhanced to amplify the reasoning part of the
computation and the values of other neurons are weakened to suppress the
uninterpretable part. The classification results after such transformation are
compared with those of the original model to detect adversaries. Results show
that our technique can achieve 94% detection accuracy for 7 different kinds of
attacks with 9.91% false positives on benign inputs. In contrast, a
state-of-the-art feature squeezing technique can only achieve 55% accuracy with
23.3% false positives.


http://arxiv.org/abs/1810.10731
Law and Adversarial Machine Learning.
Ram Shankar Siva Kumar; David R. O'Brien; Kendra Albert; Salome Vilojen
  When machine learning systems fail because of adversarial manipulation, how
should society expect the law to respond? Through scenarios grounded in
adversarial ML literature, we explore how some aspects of computer crime,
copyright, and tort law interface with perturbation, poisoning, model stealing
and model inversion attacks to show how some attacks are more likely to result
in liability than others. We end with a call for action to ML researchers to
invest in transparent benchmarks of attacks and defenses; architect ML systems
with forensics in mind and finally, think more about adversarial machine
learning in the context of civil liberties. The paper is targeted towards ML
researchers who have no legal background.


http://arxiv.org/abs/1810.10751
Attack Graph Convolutional Networks by Adding Fake Nodes.
Xiaoyun Wang; Minhao Cheng; Joe Eaton; Cho-Jui Hsieh; Felix Wu
  In this paper, we study the robustness of graph convolutional networks
(GCNs). Previous work have shown that GCNs are vulnerable to adversarial
perturbation on adjacency or feature matrices of existing nodes; however, such
attacks are usually unrealistic in real applications. For instance, in social
network applications, the attacker will need to hack into either the client or
server to change existing links or features. In this paper, we propose a new
type of "fake node attacks" to attack GCNs by adding malicious fake nodes. This
is much more realistic than previous attacks; in social network applications,
the attacker only needs to register a set of fake accounts and link to existing
ones. To conduct fake node attacks, a greedy algorithm is proposed to generate
edges of malicious nodes and their corresponding features aiming to minimize
the classification accuracy on the target nodes. In addition, we introduce a
discriminator to classify malicious nodes from real nodes, and propose a
Greedy-GAN attack to simultaneously update the discriminator and the attacker,
to make malicious nodes indistinguishable from the real ones. Our non-targeted
attack decreases the accuracy of GCN down to 0.03, and our targeted attack
reaches a success rate of 78% on a group of 100 nodes, and 90% on average for
attacking a single target node.


http://arxiv.org/abs/1810.10939
Evading classifiers in discrete domains with provable optimality guarantees.
Bogdan Kulynych; Jamie Hayes; Nikita Samarin; Carmela Troncoso
  Machine-learning models for security-critical applications such as bot,
malware, or spam detection, operate in constrained discrete domains. These
applications would benefit from having provable guarantees against adversarial
examples. The existing literature on provable adversarial robustness of models,
however, exclusively focuses on robustness to gradient-based attacks in domains
such as images. These attacks model the adversarial cost, e.g., amount of
distortion applied to an image, as a $p$-norm. We argue that this approach is
not well-suited to model adversarial costs in constrained domains where not all
examples are feasible.
  We introduce a graphical framework that (1) generalizes existing attacks in
discrete domains, (2) can accommodate complex cost functions beyond $p$-norms,
including financial cost incurred when attacking a classifier, and (3)
efficiently produces valid adversarial examples with guarantees of minimal
adversarial cost. These guarantees directly translate into a notion of
adversarial robustness that takes into account domain constraints and the
adversary's capabilities. We show how our framework can be used to evaluate
security by crafting adversarial examples that evade a Twitter-bot detection
classifier with provably minimal number of changes; and to build privacy
defenses by crafting adversarial examples that evade a privacy-invasive
website-fingerprinting classifier.


http://arxiv.org/abs/1810.10625
Robust Adversarial Learning via Sparsifying Front Ends.
Soorya Gopalakrishnan; Zhinus Marzi; Metehan Cekic; Upamanyu Madhow; Ramtin Pedarsani
  It is by now well-known that small adversarial perturbations can induce
classification errors in deep neural networks. In this paper, we take a
bottom-up signal processing perspective to this problem and show that a
systematic exploitation of sparsity in natural data is a promising tool for
defense. For linear classifiers, we show that a sparsifying front end is
provably effective against $\ell_{\infty}$-bounded attacks, reducing output
distortion due to the attack by a factor of roughly $K/N$ where $N$ is the data
dimension and $K$ is the sparsity level. We then extend this concept to deep
networks, showing that a "locally linear" model can be used to develop a
theoretical foundation for crafting attacks and defenses. We also devise
attacks based on the locally linear model that outperform the well-known FGSM
attack. We supplement our theoretical results with experiments on the MNIST and
CIFAR-10 datasets, showing the efficacy of the proposed sparsity-based defense
schemes.


http://arxiv.org/abs/1810.10031
Stochastic Substitute Training: A Gray-box Approach to Craft Adversarial Examples Against Gradient Obfuscation Defenses.
Mohammad Hashemi; Greg Cusack; Eric Keller
  It has been shown that adversaries can craft example inputs to neural
networks which are similar to legitimate inputs but have been created to
purposely cause the neural network to misclassify the input. These adversarial
examples are crafted, for example, by calculating gradients of a carefully
defined loss function with respect to the input. As a countermeasure, some
researchers have tried to design robust models by blocking or obfuscating
gradients, even in white-box settings. Another line of research proposes
introducing a separate detector to attempt to detect adversarial examples. This
approach also makes use of gradient obfuscation techniques, for example, to
prevent the adversary from trying to fool the detector. In this paper, we
introduce stochastic substitute training, a gray-box approach that can craft
adversarial examples for defenses which obfuscate gradients. For those defenses
that have tried to make models more robust, with our technique, an adversary
can craft adversarial examples with no knowledge of the defense. For defenses
that attempt to detect the adversarial examples, with our technique, an
adversary only needs very limited information about the defense to craft
adversarial examples. We demonstrate our technique by applying it against two
defenses which make models more robust and two defenses which detect
adversarial examples.


http://arxiv.org/abs/1810.09650
One Bit Matters: Understanding Adversarial Examples as the Abuse of Redundancy.
Jingkang Wang; Ruoxi Jia; Gerald Friedland; Bo Li; Costas Spanos
  Despite the great success achieved in machine learning (ML), adversarial
examples have caused concerns with regards to its trustworthiness: A small
perturbation of an input results in an arbitrary failure of an otherwise
seemingly well-trained ML model. While studies are being conducted to discover
the intrinsic properties of adversarial examples, such as their transferability
and universality, there is insufficient theoretic analysis to help understand
the phenomenon in a way that can influence the design process of ML
experiments. In this paper, we deduce an information-theoretic model which
explains adversarial attacks as the abuse of feature redundancies in ML
algorithms. We prove that feature redundancy is a necessary condition for the
existence of adversarial examples. Our model helps to explain some major
questions raised in many anecdotal studies on adversarial examples. Our theory
is backed up by empirical measurements of the information content of benign and
adversarial examples on both image and text datasets. Our measurements show
that typical adversarial examples introduce just enough redundancy to overflow
the decision making of an ML model trained on corresponding benign examples. We
conclude with actionable recommendations to improve the robustness of machine
learners against adversarial examples.


http://arxiv.org/abs/1810.10109
Et Tu Alexa? When Commodity WiFi Devices Turn into Adversarial Motion Sensors.
Yanzi Zhu; Zhujun Xiao; Yuxin Chen; Zhijing Li; Max Liu; Ben Y. Zhao; Haitao Zheng
  Our work demonstrates a new set of silent reconnaissance attacks, which
leverages the presence of commodity WiFi devices to track users inside private
homes and offices, without compromising any WiFi network, data packets, or
devices. We show that just by sniffing existing WiFi signals, an adversary can
accurately detect and track movements of users inside a building. This is made
possible by our new signal model that links together human motion near WiFi
transmitters and variance of multipath signal propagation seen by the attacker
sniffer outside of the property. The resulting attacks are cheap, highly
effective, and yet difficult to detect. We implement the attack using a single
commodity smartphone, deploy it in 11 real-world offices and residential
apartments, and show it is highly effective. Finally, we evaluate potential
defenses, and propose a practical and effective defense based on AP signal
obfuscation.


http://arxiv.org/abs/1810.09519
Adversarial Risk Bounds via Function Transformation.
Justin Khim; Po-Ling Loh
  We derive bounds for a notion of adversarial risk, designed to characterize
the robustness of linear and neural network classifiers to adversarial
perturbations. Specifically, we introduce a new class of function
transformations with the property that the risk of the transformed functions
upper-bounds the adversarial risk of the original functions. This reduces the
problem of deriving bounds on the adversarial risk to the problem of deriving
risk bounds using standard learning-theoretic techniques. We then derive bounds
on the Rademacher complexities of the transformed function classes, obtaining
error rates on the same order as the generalization error of the original
function classes. We also discuss extensions of our theory to multiclass
classification and regression. Finally, we provide two algorithms for
optimizing the adversarial risk bounds in the linear case, and discuss
connections to regularization and distributional robustness.


http://arxiv.org/abs/1810.09225
Cost-Sensitive Robustness against Adversarial Examples.
Xiao Zhang; David Evans
  Several recent works have developed methods for training classifiers that are
certifiably robust against norm-bounded adversarial perturbations. These
methods assume that all the adversarial transformations are equally important,
which is seldom the case in real-world applications. We advocate for
cost-sensitive robustness as the criteria for measuring the classifier's
performance for tasks where some adversarial transformation are more important
than others. We encode the potential harm of each adversarial transformation in
a cost matrix, and propose a general objective function to adapt the robust
training method of Wong & Kolter (2018) to optimize for cost-sensitive
robustness. Our experiments on simple MNIST and CIFAR10 models with a variety
of cost matrices show that the proposed approach can produce models with
substantially reduced cost-sensitive robust error, while maintaining
classification accuracy.


http://arxiv.org/abs/1810.09619
Sparse DNNs with Improved Adversarial Robustness.
Yiwen Guo; Chao Zhang; Changshui Zhang; Yurong Chen
  Deep neural networks (DNNs) are computationally/memory-intensive and
vulnerable to adversarial attacks, making them prohibitive in some real-world
applications. By converting dense models into sparse ones, pruning appears to
be a promising solution to reducing the computation/memory cost. This paper
studies classification models, especially DNN-based ones, to demonstrate that
there exists intrinsic relationships between their sparsity and adversarial
robustness. Our analyses reveal, both theoretically and empirically, that
nonlinear DNN-based classifiers behave differently under $l_2$ attacks from
some linear ones. We further demonstrate that an appropriately higher model
sparsity implies better robustness of nonlinear DNNs, whereas over-sparsified
models can be more difficult to resist adversarial examples.


http://arxiv.org/abs/1810.08640
On Extensions of CLEVER: A Neural Network Robustness Evaluation Algorithm.
Tsui-Wei Weng; Huan Zhang; Pin-Yu Chen; Aurelie Lozano; Cho-Jui Hsieh; Luca Daniel
  CLEVER (Cross-Lipschitz Extreme Value for nEtwork Robustness) is an Extreme
Value Theory (EVT) based robustness score for large-scale deep neural networks
(DNNs). In this paper, we propose two extensions on this robustness score.
First, we provide a new formal robustness guarantee for classifier functions
that are twice differentiable. We apply extreme value theory on the new formal
robustness guarantee and the estimated robustness is called second-order CLEVER
score. Second, we discuss how to handle gradient masking, a common defensive
technique, using CLEVER with Backward Pass Differentiable Approximation (BPDA).
With BPDA applied, CLEVER can evaluate the intrinsic robustness of neural
networks of a broader class -- networks with non-differentiable input
transformations. We demonstrate the effectiveness of CLEVER with BPDA in
experiments on a 121-layer Densenet model trained on the ImageNet dataset.


http://arxiv.org/abs/1810.08280
Exploring Adversarial Examples in Malware Detection.
Octavian Suciu; Scott E. Coull; Jeffrey Johns
  The convolutional neural network (CNN) architecture is increasingly being
applied to new domains, such as malware detection, where it is able to learn
malicious behavior from raw bytes extracted from executables. These
architectures reach impressive performance with no feature engineering effort
involved, but their robustness against active attackers is yet to be
understood. Such malware detectors could face a new attack vector in the form
of adversarial interference with the classification model. Existing evasion
attacks intended to cause misclassification on test-time instances, which have
been extensively studied for image classifiers, are not applicable because of
the input semantics that prevents arbitrary changes to the binaries. This paper
explores the area of adversarial examples for malware detection. By training an
existing model on a production-scale dataset, we show that some previous
attacks are less effective than initially reported, while simultaneously
highlighting architectural weaknesses that facilitate new attack strategies for
malware classification. Finally, we explore how generalizable different attack
strategies are, the trade-offs when aiming to increase their effectiveness, and
the transferability of single-step attacks.


http://arxiv.org/abs/1810.08070
A Training-based Identification Approach to VIN Adversarial Examples.
Yingdi Wang; Wenjia Niu; Tong Chen; Yingxiao Xiang; Jingjing Liu; Gang Li; Jiqiang Liu
  With the rapid development of Artificial Intelligence (AI), the problem of AI
security has gradually emerged. Most existing machine learning algorithms may
be attacked by adversarial examples. An adversarial example is a slightly
modified input sample that can lead to a false result of machine learning
algorithms. The adversarial examples pose a potential security threat for many
AI application areas, especially in the domain of robot path planning. In this
field, the adversarial examples obstruct the algorithm by adding obstacles to
the normal maps, resulting in multiple effects on the predicted path. However,
there is no suitable approach to automatically identify them. To our knowledge,
all previous work uses manual observation method to estimate the attack results
of adversarial maps, which is time-consuming. Aiming at the existing problem,
this paper explores a method to automatically identify the adversarial examples
in Value Iteration Networks (VIN), which has a strong generalization ability.
We analyze the possible scenarios caused by the adversarial maps. We propose a
training-based identification approach to VIN adversarial examples by combing
the path feature comparison and path image classification. We evaluate our
method using the adversarial maps dataset, show that our method can achieve a
high-accuracy and faster identification than manual observation method.


http://arxiv.org/abs/1810.07481
Provable Robustness of ReLU networks via Maximization of Linear Regions.
Francesco University of Tübingen Croce; Maksym Saarland University Andriushchenko; Matthias University of Tübingen Hein
  It has been shown that neural network classifiers are not robust. This raises
concerns about their usage in safety-critical systems. We propose in this paper
a regularization scheme for ReLU networks which provably improves the
robustness of the classifier by maximizing the linear regions of the classifier
as well as the distance to the decision boundary. Our techniques allow even to
find the minimal adversarial perturbation for a fraction of test points for
large networks. In the experiments we show that our approach improves upon
adversarial training both in terms of lower and upper bounds on the robustness
and is comparable or better than the state-of-the-art in terms of test error
and robustness.


http://arxiv.org/abs/1810.10337
Projecting Trouble: Light Based Adversarial Attacks on Deep Learning Classifiers.
Nicole Nichols; Robert Jasper
  This work demonstrates a physical attack on a deep learning image
classification system using projected light onto a physical scene. Prior work
is dominated by techniques for creating adversarial examples which directly
manipulate the digital input of the classifier. Such an attack is limited to
scenarios where the adversary can directly update the inputs to the classifier.
This could happen by intercepting and modifying the inputs to an online API
such as Clarifai or Cloud Vision. Such limitations have led to a vein of
research around physical attacks where objects are constructed to be inherently
adversarial or adversarial modifications are added to cause misclassification.
Our work differs from other physical attacks in that we can cause
misclassification dynamically without altering physical objects in a permanent
way.
  We construct an experimental setup which includes a light projection source,
an object for classification, and a camera to capture the scene. Experiments
are conducted against 2D and 3D objects from CIFAR-10. Initial tests show
projected light patterns selected via differential evolution could degrade
classification from 98% to 22% and 89% to 43% probability for 2D and 3D targets
respectively. Subsequent experiments explore sensitivity to physical setup and
compare two additional baseline conditions for all 10 CIFAR classes. Some
physical targets are more susceptible to perturbation. Simple attacks show near
equivalent success, and 6 of the 10 classes were disrupted by light.


http://arxiv.org/abs/1810.07339
Security Matters: A Survey on Adversarial Machine Learning.
Guofu Li; Pengjia Zhu; Jin Li; Zhemin Yang; Ning Cao; Zhiyi Chen
  Adversarial machine learning is a fast growing research area, which considers
the scenarios when machine learning systems may face potential adversarial
attackers, who intentionally synthesize input data to make a well-trained model
to make mistake. It always involves a defending side, usually a classifier, and
an attacking side that aims to cause incorrect output. The earliest studies on
the adversarial examples for machine learning algorithms start from the
information security area, which considers a much wider varieties of attacking
methods. But recent research focus that popularized by the deep learning
community places strong emphasis on how the "imperceivable" perturbations on
the normal inputs may cause dramatic mistakes by the deep learning with
supposed super-human accuracy. This paper serves to give a comprehensive
introduction to a range of aspects of the adversarial deep learning topic,
including its foundations, typical attacking and defending strategies, and some
extended studies.


http://arxiv.org/abs/1810.06583
Concise Explanations of Neural Networks using Adversarial Training.
Prasad Chalasani; Jiefeng Chen; Amrita Roy Chowdhury; Somesh Jha; Xi Wu
  We show new connections between adversarial learning and explainability for
deep neural networks (DNNs). One form of explanation of the output of a neural
network model in terms of its input features, is a vector of
feature-attributions. Two desirable characteristics of an attribution-based
explanation are: (1) $\textit{sparseness}$: the attributions of irrelevant or
weakly relevant features should be negligible, thus resulting in
$\textit{concise}$ explanations in terms of the significant features, and (2)
$\textit{stability}$: it should not vary significantly within a small local
neighborhood of the input. Our first contribution is a theoretical exploration
of how these two properties (when using attributions based on Integrated
Gradients, or IG) are related to adversarial training, for a class of 1-layer
networks (which includes logistic regression models for binary and multi-class
classification); for these networks we show that (a) adversarial training using
an $\ell_\infty$-bounded adversary produces models with sparse attribution
vectors, and (b) natural model-training while encouraging stable explanations
(via an extra term in the loss function), is equivalent to adversarial
training. Our second contribution is an empirical verification of phenomenon
(a), which we show, somewhat surprisingly, occurs $\textit{not only}$
$\textit{in 1-layer networks}$, $\textit{but also DNNs}$ $\textit{trained on }$
$\textit{standard image datasets}$, and extends beyond IG-based attributions,
to those based on DeepSHAP: adversarial training with $\ell_\infty$-bounded
perturbations yields significantly sparser attribution vectors, with little
degradation in performance on natural test data, compared to natural training.
Moreover, the sparseness of the attribution vectors is significantly better
than that achievable via $\ell_1$-regularized natural training.


http://arxiv.org/abs/1810.05162
Characterizing Adversarial Examples Based on Spatial Consistency Information for Semantic Segmentation.
Chaowei Xiao; Ruizhi Deng; Bo Li; Fisher Yu; Mingyan Liu; Dawn Song
  Deep Neural Networks (DNNs) have been widely applied in various recognition
tasks. However, recently DNNs have been shown to be vulnerable against
adversarial examples, which can mislead DNNs to make arbitrary incorrect
predictions. While adversarial examples are well studied in classification
tasks, other learning problems may have different properties. For instance,
semantic segmentation requires additional components such as dilated
convolutions and multiscale processing. In this paper, we aim to characterize
adversarial examples based on spatial context information in semantic
segmentation. We observe that spatial consistency information can be
potentially leveraged to detect adversarial examples robustly even when a
strong adaptive attacker has access to the model and detection strategies. We
also show that adversarial examples based on attacks considered within the
paper barely transfer among models, even though transferability is common in
classification. Our observations shed new light on developing adversarial
attacks and defenses to better understand the vulnerabilities of DNNs.


http://arxiv.org/abs/1810.05206
MeshAdv: Adversarial Meshes for Visual Recognition.
Chaowei Xiao; Dawei Yang; Bo Li; Jia Deng; Mingyan Liu
  Highly expressive models such as deep neural networks (DNNs) have been widely
applied to various applications. However, recent studies show that DNNs are
vulnerable to adversarial examples, which are carefully crafted inputs aiming
to mislead the predictions. Currently, the majority of these studies have
focused on perturbation added to image pixels, while such manipulation is not
physically realistic. Some works have tried to overcome this limitation by
attaching printable 2D patches or painting patterns onto surfaces, but can be
potentially defended because 3D shape features are intact. In this paper, we
propose meshAdv to generate "adversarial 3D meshes" from objects that have rich
shape features but minimal textural variation. To manipulate the shape or
texture of the objects, we make use of a differentiable renderer to compute
accurate shading on the shape and propagate the gradient. Extensive experiments
show that the generated 3D meshes are effective in attacking both classifiers
and object detectors. We evaluate the attack under different viewpoints. In
addition, we design a pipeline to perform black-box attack on a photorealistic
renderer with unknown rendering parameters.


http://arxiv.org/abs/1810.05665
Is PGD-Adversarial Training Necessary? Alternative Training via a Soft-Quantization Network with Noisy-Natural Samples Only.
Tianhang Zheng; Changyou Chen; Kui Ren
  Recent work on adversarial attack and defense suggests that PGD is a
universal $l_\infty$ first-order attack, and PGD adversarial training can
significantly improve network robustness against a wide range of first-order
$l_\infty$-bounded attacks, represented as the state-of-the-art defense method.
However, an obvious weakness of PGD adversarial training is its
highly-computational cost in generating adversarial samples, making it
computationally infeasible for large and high-resolution real datasets such as
the ImageNet dataset. In addition, recent work also has suggested a simple
"close-form" solution to a robust model on MNIST. Therefore, a natural question
raised is that is PGD adversarial training really necessary for robust defense?
In this paper, we give a negative answer by proposing a training paradigm that
is comparable to PGD adversarial training on several standard datasets, while
only using noisy-natural samples. Specifically, we reformulate the min-max
objective in PGD adversarial training by a problem to minimize the original
network loss plus $l_1$ norms of its gradients w.r.t. the inputs. For the
$l_1$-norm loss, we propose a computationally-feasible solution by embedding a
differentiable soft-quantization layer after the network input layer. We show
formally that the soft-quantization layer trained with noisy-natural samples is
an alternative approach to minimizing the $l_1$-gradient norms as in PGD
adversarial training. Extensive empirical evaluations on standard datasets show
that our proposed models are comparable to PGD-adversarially-trained models
under PGD and BPDA attacks. Remarkably, our method achieves a 24X speed-up on
MNIST while maintaining a comparable defensive ability, and for the first time
fine-tunes a robust Imagenet model within only two days. Code is provided on
\url{https://github.com/tianzheng4/Noisy-Training-Soft-Quantization}


http://arxiv.org/abs/1810.03913
Analyzing the Noise Robustness of Deep Neural Networks.
Mengchen Liu; Shixia Liu; Hang Su; Kelei Cao; Jun Zhu
  Deep neural networks (DNNs) are vulnerable to maliciously generated
adversarial examples. These examples are intentionally designed by making
imperceptible perturbations and often mislead a DNN into making an incorrect
prediction. This phenomenon means that there is significant risk in applying
DNNs to safety-critical applications, such as driverless cars. To address this
issue, we present a visual analytics approach to explain the primary cause of
the wrong predictions introduced by adversarial examples. The key is to analyze
the datapaths of the adversarial examples and compare them with those of the
normal examples. A datapath is a group of critical neurons and their
connections. To this end, we formulate the datapath extraction as a subset
selection problem and approximately solve it based on back-propagation. A
multi-level visualization consisting of a segmented DAG (layer level), an Euler
diagram (feature map level), and a heat map (neuron level), has been designed
to help experts investigate datapaths from the high-level layers to the
detailed neuron activations. Two case studies are conducted that demonstrate
the promise of our approach in support of explaining the working mechanism of
adversarial examples.


http://arxiv.org/abs/1810.03806
The Adversarial Attack and Detection under the Fisher Information Metric.
Chenxiao Zhao; P. Thomas Fletcher; Mixue Yu; Yaxin Peng; Guixu Zhang; Chaomin Shen
  Many deep learning models are vulnerable to the adversarial attack, i.e.,
imperceptible but intentionally-designed perturbations to the input can cause
incorrect output of the networks. In this paper, using information geometry, we
provide a reasonable explanation for the vulnerability of deep learning models.
By considering the data space as a non-linear space with the Fisher information
metric induced from a neural network, we first propose an adversarial attack
algorithm termed one-step spectral attack (OSSA). The method is described by a
constrained quadratic form of the Fisher information matrix, where the optimal
adversarial perturbation is given by the first eigenvector, and the model
vulnerability is reflected by the eigenvalues. The larger an eigenvalue is, the
more vulnerable the model is to be attacked by the corresponding eigenvector.
Taking advantage of the property, we also propose an adversarial detection
method with the eigenvalues serving as characteristics. Both our attack and
detection algorithms are numerically optimized to work efficiently on large
datasets. Our evaluations show superior performance compared with other
methods, implying that the Fisher information is a promising approach to
investigate the adversarial attacks and defenses.


http://arxiv.org/abs/1810.04065
Limitations of adversarial robustness: strong No Free Lunch Theorem.
Elvis Dohmatob
  This manuscript presents some new impossibility results on adversarial
robustness in machine learning, a very important yet largely open problem. We
show that if conditioned on a class label the data distribution satisfies the
$W_2$ Talagrand transportation-cost inequality (for example, this condition is
satisfied if the conditional distribution has density which is log-concave; is
the uniform measure on a compact Riemannian manifold with positive Ricci
curvature, any classifier can be adversarially fooled with high probability
once the perturbations are slightly greater than the natural noise level in the
problem. We call this result The Strong "No Free Lunch" Theorem as some recent
results (Tsipras et al. 2018, Fawzi et al. 2018, etc.) on the subject can be
immediately recovered as very particular cases. Our theoretical bounds are
demonstrated on both simulated and real data (MNIST). We conclude the
manuscript with some speculation on possible future research directions.


http://arxiv.org/abs/1810.03739
Efficient Two-Step Adversarial Defense for Deep Neural Networks.
Ting-Jui Chang; Yukun He; Peng Li
  In recent years, deep neural networks have demonstrated outstanding
performance in many machine learning tasks. However, researchers have
discovered that these state-of-the-art models are vulnerable to adversarial
examples: legitimate examples added by small perturbations which are
unnoticeable to human eyes. Adversarial training, which augments the training
data with adversarial examples during the training process, is a well known
defense to improve the robustness of the model against adversarial attacks.
However, this robustness is only effective to the same attack method used for
adversarial training. Madry et al.(2017) suggest that effectiveness of
iterative multi-step adversarial attacks and particularly that projected
gradient descent (PGD) may be considered the universal first order adversary
and applying the adversarial training with PGD implies resistance against many
other first order attacks. However, the computational cost of the adversarial
training with PGD and other multi-step adversarial examples is much higher than
that of the adversarial training with other simpler attack techniques. In this
paper, we show how strong adversarial examples can be generated only at a cost
similar to that of two runs of the fast gradient sign method (FGSM), allowing
defense against adversarial attacks with a robustness level comparable to that
of the adversarial training with multi-step adversarial examples. We
empirically demonstrate the effectiveness of the proposed two-step defense
approach against different attack methods and its improvements over existing
defense strategies.


http://arxiv.org/abs/1810.03538
Combinatorial Attacks on Binarized Neural Networks.
Elias B. Khalil; Amrita Gupta; Bistra Dilkina
  Binarized Neural Networks (BNNs) have recently attracted significant interest
due to their computational efficiency. Concurrently, it has been shown that
neural networks may be overly sensitive to "attacks" - tiny adversarial changes
in the input - which may be detrimental to their use in safety-critical
domains. Designing attack algorithms that effectively fool trained models is a
key step towards learning robust neural networks. The discrete,
non-differentiable nature of BNNs, which distinguishes them from their
full-precision counterparts, poses a challenge to gradient-based attacks. In
this work, we study the problem of attacking a BNN through the lens of
combinatorial and integer optimization. We propose a Mixed Integer Linear
Programming (MILP) formulation of the problem. While exact and flexible, the
MILP quickly becomes intractable as the network and perturbation space grow. To
address this issue, we propose IProp, a decomposition-based algorithm that
solves a sequence of much smaller MILP problems. Experimentally, we evaluate
both proposed methods against the standard gradient-based attack (FGSM) on
MNIST and Fashion-MNIST, and show that IProp performs favorably compared to
FGSM, while scaling beyond the limits of the MILP.


http://arxiv.org/abs/1810.03773
Average Margin Regularization for Classifiers.
Matt Olfat; Anil Aswani
  Adversarial robustness has become an important research topic given empirical
demonstrations on the lack of robustness of deep neural networks.
Unfortunately, recent theoretical results suggest that adversarial training
induces a strict tradeoff between classification accuracy and adversarial
robustness. In this paper, we propose and then study a new regularization for
any margin classifier or deep neural network. We motivate this regularization
by a novel generalization bound that shows a tradeoff in classifier accuracy
between maximizing its margin and average margin. We thus call our approach an
average margin (AM) regularization, and it consists of a linear term added to
the objective. We theoretically show that for certain distributions AM
regularization can both improve classifier accuracy and robustness to
adversarial attacks. We conclude by using both synthetic and real data to
empirically show that AM regularization can strictly improve both accuracy and
robustness for support vector machine's (SVM's), relative to unregularized
classifiers and adversarially trained classifiers.


http://arxiv.org/abs/1810.02424
Feature Prioritization and Regularization Improve Standard Accuracy and Adversarial Robustness.
Chihuang Liu; Joseph JaJa
  Adversarial training has been successfully applied to build robust models at
a certain cost. While the robustness of a model increases, the standard
classification accuracy declines. This phenomenon is suggested to be an
inherent trade-off. We propose a model that employs feature prioritization by a
nonlinear attention module and $L_2$ feature regularization to improve the
adversarial robustness and the standard accuracy relative to adversarial
training. The attention module encourages the model to rely heavily on robust
features by assigning larger weights to them while suppressing non-robust
features. The regularizer encourages the model to extract similar features for
the natural and adversarial images, effectively ignoring the added
perturbation. In addition to evaluating the robustness of our model, we provide
justification for the attention module and propose a novel experimental
strategy that quantitatively demonstrates that our model is almost ideally
aligned with salient data characteristics. Additional experimental results
illustrate the power of our model relative to the state of the art methods.


http://arxiv.org/abs/1810.02180
Improved Generalization Bounds for Robust Learning.
Idan Attias; Aryeh Kontorovich; Yishay Mansour
  We consider a model of robust learning in an adversarial environment. The
learner gets uncorrupted training data with access to possible corruptions that
may be effected by the adversary during testing. The learner's goal is to build
a robust classifier, which will be tested on future adversarial examples. The
adversary is limited to $k$ possible corruptions for each input. We model the
learner-adversary interaction as a zero-sum game. This model is closely related
to the adversarial examples model of Schmidt et al. (2018); Madry et al.
(2017). Our main results consist of generalization bounds for the binary and
multiclass classification, as well as the real-valued case (regression). For
the binary classification setting, we both tighten the generalization bound of
Feige, Mansour, and Schapire (2015), and are also able to handle infinite
hypothesis classes. The sample complexity is improved from
$O(\frac{1}{\epsilon^4}\log(\frac{|\mathcal{H}|}{\delta}))$ to
$O\big(\frac{1}{\epsilon^2}(\sqrt{k
\mathrm{VC}(\mathcal{H})}\log^{\frac{3}{2}+\alpha}(k\mathrm{VC}(\mathcal{H}))+\log(\frac{1}{\delta})\big)$
for any $\alpha > 0$. Additionally, we extend the algorithm and generalization
bound from the binary to the multiclass and real-valued cases. Along the way,
we obtain results on fat-shattering dimension and Rademacher complexity of
$k$-fold maxima over function classes; these may be of independent interest.
  For binary classification, the algorithm of Feige et al. (2015) uses a regret
minimization algorithm and an ERM oracle as a black box; we adapt it for the
multiclass and regression settings. The algorithm provides us with near-optimal
policies for the players on a given training sample.


http://arxiv.org/abs/1810.01407
Can Adversarially Robust Learning Leverage Computational Hardness?
Saeed Mahloujifar; Mohammad Mahmoody
  Making learners robust to adversarial perturbation at test time (i.e.,
evasion attacks) or training time (i.e., poisoning attacks) has emerged as a
challenging task. It is known that for some natural settings, sublinear
perturbations in the training phase or the testing phase can drastically
decrease the quality of the predictions. These negative results, however, are
information theoretic and only prove the existence of such successful
adversarial perturbations. A natural question for these settings is whether or
not we can make classifiers computationally robust to polynomial-time attacks.
  In this work, we prove strong barriers against achieving such envisioned
computational robustness both for evasion and poisoning attacks. In particular,
we show that if the test instances come from a product distribution (e.g.,
uniform over $\{0,1\}^n$ or $[0,1]^n$, or isotropic $n$-variate Gaussian) and
that there is an initial constant error, then there exists a polynomial-time
attack that finds adversarial examples of Hamming distance $O(\sqrt n)$. For
poisoning attacks, we prove that for any learning algorithm with sample
complexity $m$ and any efficiently computable "predicate" defining some "bad"
property $B$ for the produced hypothesis (e.g., failing on a particular test)
that happens with an initial constant probability, there exist polynomial-time
online poisoning attacks that tamper with $O (\sqrt m)$ many examples, replace
them with other correctly labeled examples, and increases the probability of
the bad event $B$ to $\approx 1$.
  Both of our poisoning and evasion attacks are black-box in how they access
their corresponding components of the system (i.e., the hypothesis, the
concept, and the learning algorithm) and make no further assumptions about the
classifier or the learning algorithm producing the classifier.


http://arxiv.org/abs/1810.01185
Adversarial Examples - A Complete Characterisation of the Phenomenon.
Alexandru Constantin Serban; Erik Poll; Joost Visser
  We provide a complete characterisation of the phenomenon of adversarial
examples - inputs intentionally crafted to fool machine learning models. We aim
to cover all the important concerns in this field of study: (1) the conjectures
on the existence of adversarial examples, (2) the security, safety and
robustness implications, (3) the methods used to generate and (4) protect
against adversarial examples and (5) the ability of adversarial examples to
transfer between different machine learning models. We provide ample background
information in an effort to make this document self-contained. Therefore, this
document can be used as survey, tutorial or as a catalog of attacks and
defences using adversarial examples.


http://arxiv.org/abs/1810.01110
Link Prediction Adversarial Attack.
Jinyin Chen; Ziqiang Shi; Yangyang Wu; Xuanheng Xu; Haibin Zheng
  Deep neural network has shown remarkable performance in solving computer
vision and some graph evolved tasks, such as node classification and link
prediction. However, the vulnerability of deep model has also been revealed by
carefully designed adversarial examples generated by various adversarial attack
methods. With the wider application of deep model in complex network analysis,
in this paper we define and formulate the link prediction adversarial attack
problem and put forward a novel iterative gradient attack (IGA) based on the
gradient information in trained graph auto-encoder (GAE). To our best
knowledge, it is the first time link prediction adversarial attack problem is
defined and attack method is brought up. Not surprisingly, GAE was easily
fooled by adversarial network with only a few links perturbed on the clean
network. By conducting comprehensive experiments on different real-world data
sets, we can conclude that most deep model based and other state-of-art link
prediction algorithms cannot escape the adversarial attack just like GAE. We
can benefit the attack as an efficient privacy protection tool from link
prediction unknown violation, on the other hand, link prediction attack can be
a robustness evaluation metric for current link prediction algorithm in attack
defensibility.


http://arxiv.org/abs/1810.01279
Adv-BNN: Improved Adversarial Defense through Robust Bayesian Neural Network.
Xuanqing Liu; Yao Li; Chongruo Wu; Cho-Jui Hsieh
  We present a new algorithm to train a robust neural network against
adversarial attacks. Our algorithm is motivated by the following two ideas.
First, although recent work has demonstrated that fusing randomness can improve
the robustness of neural networks (Liu 2017), we noticed that adding noise
blindly to all the layers is not the optimal way to incorporate randomness.
Instead, we model randomness under the framework of Bayesian Neural Network
(BNN) to formally learn the posterior distribution of models in a scalable way.
Second, we formulate the mini-max problem in BNN to learn the best model
distribution under adversarial attacks, leading to an adversarial-trained
Bayesian neural net. Experiment results demonstrate that the proposed algorithm
achieves state-of-the-art performance under strong attacks. On CIFAR-10 with
VGG network, our model leads to 14\% accuracy improvement compared with
adversarial training (Madry 2017) and random self-ensemble (Liu 2017) under PGD
attack with $0.035$ distortion, and the gap becomes even larger on a subset of
ImageNet.


http://arxiv.org/abs/1810.00740
Improving the Generalization of Adversarial Training with Domain Adaptation.
Chuanbiao Song; Kun He; Liwei Wang; John E. Hopcroft
  By injecting adversarial examples into training data, adversarial training is
promising for improving the robustness of deep learning models. However, most
existing adversarial training approaches are based on a specific type of
adversarial attack. It may not provide sufficiently representative samples from
the adversarial domain, leading to a weak generalization ability on adversarial
examples from other attacks. Moreover, during the adversarial training,
adversarial perturbations on inputs are usually crafted by fast single-step
adversaries so as to scale to large datasets. This work is mainly focused on
the adversarial training yet efficient FGSM adversary. In this scenario, it is
difficult to train a model with great generalization due to the lack of
representative adversarial samples, aka the samples are unable to accurately
reflect the adversarial domain. To alleviate this problem, we propose a novel
Adversarial Training with Domain Adaptation (ATDA) method. Our intuition is to
regard the adversarial training on FGSM adversary as a domain adaption task
with limited number of target domain samples. The main idea is to learn a
representation that is semantically meaningful and domain invariant on the
clean domain as well as the adversarial domain. Empirical evaluations on
Fashion-MNIST, SVHN, CIFAR-10 and CIFAR-100 demonstrate that ATDA can greatly
improve the generalization of adversarial training and the smoothness of the
learned models, and outperforms state-of-the-art methods on standard benchmark
datasets. To show the transfer ability of our method, we also extend ATDA to
the adversarial training on iterative attacks such as PGD-Adversial Training
(PAT) and the defense performance is improved considerably.


http://arxiv.org/abs/1810.01021
Large batch size training of neural networks with adversarial training and second-order information.
Zhewei Yao; Amir Gholami; Daiyaan Arfeen; Richard Liaw; Joseph Gonzalez; Kurt Keutzer; Michael Mahoney
  The most straightforward method to accelerate Stochastic Gradient Descent
(SGD) computation is to distribute the randomly selected batch of inputs over
multiple processors. To keep the distributed processors fully utilized requires
commensurately growing the batch size. However, large batch training often
leads to poorer generalization. A recently proposed solution for this problem
is to use adaptive batch sizes in SGD. In this case, one starts with a small
number of processes and scales the processes as training progresses. Two major
challenges with this approach are (i) that dynamically resizing the cluster can
add non-trivial overhead, in part since it is currently not supported, and (ii)
that the overall speed up is limited by the initial phase with smaller batches.
In this work, we address both challenges by developing a new adaptive batch
size framework, with autoscaling based on the Ray framework. This allows very
efficient elastic scaling with negligible resizing overhead (0.32\% of time for
ResNet18 ImageNet training). Furthermore, we propose a new adaptive batch size
training scheme using second order methods and adversarial training. These
enable increasing batch sizes earlier during training, which leads to better
training time. We extensively evaluate our method on Cifar-10/100, SVHN,
TinyImageNet, and ImageNet datasets, using multiple neural networks, including
ResNets and smaller networks such as SqueezeNext. Our method exceeds the
performance of existing solutions in terms of both accuracy and the number of
SGD iterations (up to 1\% and $5\times$, respectively). Importantly, this is
achieved without any additional hyper-parameter tuning to tailor our method in
any of these experiments.


http://arxiv.org/abs/1810.00953
Improved robustness to adversarial examples using Lipschitz regularization of the loss.
Chris Finlay; Adam Oberman; Bilal Abbasi
  We augment adversarial training (AT) with worst case adversarial training
(WCAT) which improves adversarial robustness by 11% over the current
state-of-the-art result in the $\ell_2$ norm on CIFAR-10. We obtain verifiable
average case and worst case robustness guarantees, based on the expected and
maximum values of the norm of the gradient of the loss. We interpret
adversarial training as Total Variation Regularization, which is a fundamental
tool in mathematical image processing, and WCAT as Lipschitz regularization.


http://arxiv.org/abs/1810.00470
Procedural Noise Adversarial Examples for Black-Box Attacks on Deep Convolutional Networks.
Kenneth T. Co; Luis Muñoz-González; Maupeou Sixte de; Emil C. Lupu
  Deep Convolutional Networks (DCNs) have been shown to be vulnerable to
adversarial examples---perturbed inputs specifically designed to produce
intentional errors in the learning algorithms at test time. Existing
input-agnostic adversarial perturbations exhibit interesting visual patterns
that are currently unexplained. In this paper, we introduce a structured
approach for generating Universal Adversarial Perturbations (UAPs) with
procedural noise functions. Our approach unveils the systemic vulnerability of
popular DCN models like Inception v3 and YOLO v3, with single noise patterns
able to fool a model on up to 90% of the dataset. Procedural noise allows us to
generate a distribution of UAPs with high universal evasion rates using only a
few parameters. Additionally, we propose Bayesian optimization to efficiently
learn procedural noise parameters to construct inexpensive untargeted black-box
attacks. We demonstrate that it can achieve an average of less than 10 queries
per successful attack, a 100-fold improvement on existing methods. We further
motivate the use of input-agnostic defences to increase the stability of models
to adversarial perturbations. The universality of our attacks suggests that DCN
models may be sensitive to aggregations of low-level class-agnostic features.
These findings give insight on the nature of some universal adversarial
perturbations and how they could be generated in other applications.


http://arxiv.org/abs/1810.01268
CAAD 2018: Generating Transferable Adversarial Examples.
Yash Sharma; Tien-Dung Le; Moustafa Alzantot
  Deep neural networks (DNNs) are vulnerable to adversarial examples,
perturbations carefully crafted to fool the targeted DNN, in both the
non-targeted and targeted case. In the non-targeted case, the attacker simply
aims to induce misclassification. In the targeted case, the attacker aims to
induce classification to a specified target class. In addition, it has been
observed that strong adversarial examples can transfer to unknown models,
yielding a serious security concern. The NIPS 2017 competition was organized to
accelerate research in adversarial attacks and defenses, taking place in the
realistic setting where submitted adversarial attacks attempt to transfer to
submitted defenses. The CAAD 2018 competition took place with nearly identical
rules to the NIPS 2017 one. Given the requirement that the NIPS 2017
submissions were to be open-sourced, participants in the CAAD 2018 competition
were able to directly build upon previous solutions, and thus improve the
state-of-the-art in this setting. Our team participated in the CAAD 2018
competition, and won 1st place in both attack subtracks, non-targeted and
targeted adversarial attacks, and 3rd place in defense. We outline our
solutions and development results in this article. We hope our results can
inform researchers in both generating and defending against adversarial
examples.


http://arxiv.org/abs/1810.00144
Interpreting Adversarial Robustness: A View from Decision Surface in Input Space.
Fuxun Yu; Chenchen Liu; Yanzhi Wang; Liang Zhao; Xiang Chen
  One popular hypothesis of neural network generalization is that the flat
local minima of loss surface in parameter space leads to good generalization.
However, we demonstrate that loss surface in parameter space has no obvious
relationship with generalization, especially under adversarial settings.
Through visualizing decision surfaces in both parameter space and input space,
we instead show that the geometry property of decision surface in input space
correlates well with the adversarial robustness. We then propose an adversarial
robustness indicator, which can evaluate a neural network's intrinsic
robustness property without testing its accuracy under adversarial attacks.
Guided by it, we further propose our robust training method. Without involving
adversarial training, our method could enhance network's intrinsic adversarial
robustness against various adversarial attacks.


http://arxiv.org/abs/1810.00208
To compress or not to compress: Understanding the Interactions between Adversarial Attacks and Neural Network Compression.
Yiren Zhao; Ilia Shumailov; Robert Mullins; Ross Anderson
  As deep neural networks (DNNs) become widely used, pruned and quantised
models are becoming ubiquitous on edge devices; such compressed DNNs are
popular for lowering computational requirements. Meanwhile, recent studies show
that adversarial samples can be effective at making DNNs misclassify. We,
therefore, investigate the extent to which adversarial samples are transferable
between uncompressed and compressed DNNs. We find that adversarial samples
remain transferable for both pruned and quantised models. For pruning, the
adversarial samples generated from heavily pruned models remain effective on
uncompressed models. For quantisation, we find the transferability of
adversarial samples is highly sensitive to integer precision.


http://arxiv.org/abs/1809.10875
Characterizing Audio Adversarial Examples Using Temporal Dependency.
Zhuolin Yang; Bo Li; Pin-Yu Chen; Dawn Song
  Recent studies have highlighted adversarial examples as a ubiquitous threat
to different neural network models and many downstream applications.
Nonetheless, as unique data properties have inspired distinct and powerful
learning principles, this paper aims to explore their potentials towards
mitigating adversarial inputs. In particular, our results reveal the importance
of using the temporal dependency in audio data to gain discriminate power
against adversarial examples. Tested on the automatic speech recognition (ASR)
tasks and three recent audio adversarial attacks, we find that (i) input
transformation developed from image adversarial defense provides limited
robustness improvement and is subtle to advanced attacks; (ii) temporal
dependency can be exploited to gain discriminative power against audio
adversarial examples and is resistant to adaptive attacks considered in our
experiments. Our results not only show promising means of improving the
robustness of ASR systems, but also offer novel insights in exploiting
domain-specific data properties to mitigate negative effects of adversarial
examples.


http://arxiv.org/abs/1810.00069
Adversarial Attacks and Defences: A Survey.
Anirban Chakraborty; Manaar Alam; Vishal Dey; Anupam Chattopadhyay; Debdeep Mukhopadhyay
  Deep learning has emerged as a strong and efficient framework that can be
applied to a broad spectrum of complex learning problems which were difficult
to solve using the traditional machine learning techniques in the past. In the
last few years, deep learning has advanced radically in such a way that it can
surpass human-level performance on a number of tasks. As a consequence, deep
learning is being extensively used in most of the recent day-to-day
applications. However, security of deep learning systems are vulnerable to
crafted adversarial examples, which may be imperceptible to the human eye, but
can lead the model to misclassify the output. In recent times, different types
of adversaries based on their threat model leverage these vulnerabilities to
compromise a deep learning system where adversaries have high incentives.
Hence, it is extremely important to provide robustness to deep learning
algorithms against these adversaries. However, there are only a few strong
countermeasures which can be used in all types of attack scenarios to design a
robust deep learning system. In this paper, we attempt to provide a detailed
discussion on different types of adversarial attacks with various threat models
and also elaborate the efficiency and challenges of recent countermeasures
against them.


http://arxiv.org/abs/1810.00024
Explainable Black-Box Attacks Against Model-based Authentication.
Washington Garcia; Joseph I. Choi; Suman K. Adari; Somesh Jha; Kevin R. B. Butler
  Establishing unique identities for both humans and end systems has been an
active research problem in the security community, giving rise to innovative
machine learning-based authentication techniques. Although such techniques
offer an automated method to establish identity, they have not been vetted
against sophisticated attacks that target their core machine learning
technique. This paper demonstrates that mimicking the unique signatures
generated by host fingerprinting and biometric authentication systems is
possible. We expose the ineffectiveness of underlying machine learning
classification models by constructing a blind attack based around the query
synthesis framework and utilizing Explainable-AI (XAI) techniques. We launch an
attack in under 130 queries on a state-of-the-art face authentication system,
and under 100 queries on a host authentication system. We examine how these
attacks can be defended against and explore their limitations. XAI provides an
effective means for adversaries to infer decision boundaries and provides a new
way forward in constructing attacks against systems using machine learning
models for authentication.


http://arxiv.org/abs/1810.07242
Adversarial Attacks on Cognitive Self-Organizing Networks: The Challenge and the Way Forward.
Muhammad Usama; Junaid Qadir; Ala Al-Fuqaha
  Future communications and data networks are expected to be largely cognitive
self-organizing networks (CSON). Such networks will have the essential property
of cognitive self-organization, which can be achieved using machine learning
techniques (e.g., deep learning). Despite the potential of these techniques,
these techniques in their current form are vulnerable to adversarial attacks
that can cause cascaded damages with detrimental consequences for the whole
network. In this paper, we explore the effect of adversarial attacks on CSON.
Our experiments highlight the level of threat that CSON have to deal with in
order to meet the challenges of next-generation networks and point out
promising directions for future work.


http://arxiv.org/abs/1809.09262
Neural Networks with Structural Resistance to Adversarial Attacks.
Alfaro Luca de
  In adversarial attacks to machine-learning classifiers, small perturbations
are added to input that is correctly classified. The perturbations yield
adversarial examples, which are virtually indistinguishable from the
unperturbed input, and yet are misclassified. In standard neural networks used
for deep learning, attackers can craft adversarial examples from most input to
cause a misclassification of their choice.
  We introduce a new type of network units, called RBFI units, whose non-linear
structure makes them inherently resistant to adversarial attacks. On
permutation-invariant MNIST, in absence of adversarial attacks, networks using
RBFI units match the performance of networks using sigmoid units, and are
slightly below the accuracy of networks with ReLU units. When subjected to
adversarial attacks, networks with RBFI units retain accuracies above 90% for
attacks that degrade the accuracy of networks with ReLU or sigmoid units to
below 2%. RBFI networks trained with regular input are superior in their
resistance to adversarial attacks even to ReLU and sigmoid networks trained
with the help of adversarial examples.
  The non-linear structure of RBFI units makes them difficult to train using
standard gradient descent. We show that networks of RBFI units can be
efficiently trained to high accuracies using pseudogradients, computed using
functions especially crafted to facilitate learning instead of their true
derivatives. We show that the use of pseudogradients makes training deep RBFI
networks practical, and we compare several structural alternatives of RBFI
networks for their accuracy.


http://arxiv.org/abs/1809.08999
Fast Geometrically-Perturbed Adversarial Faces.
Ali Dabouei; Sobhan Soleymani; Jeremy Dawson; Nasser M. Nasrabadi
  The state-of-the-art performance of deep learning algorithms has led to a
considerable increase in the utilization of machine learning in
security-sensitive and critical applications. However, it has recently been
shown that a small and carefully crafted perturbation in the input space can
completely fool a deep model. In this study, we explore the extent to which
face recognition systems are vulnerable to geometrically-perturbed adversarial
faces. We propose a fast landmark manipulation method for generating
adversarial faces, which is approximately 200 times faster than the previous
geometric attacks and obtains 99.86% success rate on the state-of-the-art face
recognition models. To further force the generated samples to be natural, we
introduce a second attack constrained on the semantic structure of the face
which has the half speed of the first attack with the success rate of 99.96%.
Both attacks are extremely robust against the state-of-the-art defense methods
with the success rate of equal or greater than 53.59%. Code is available at
https://github.com/alldbi/FLM


http://arxiv.org/abs/1809.08986
On The Utility of Conditional Generation Based Mutual Information for Characterizing Adversarial Subspaces.
Chia-Yi Hsu; Pei-Hsuan Lu; Pin-Yu Chen; Chia-Mu Yu
  Recent studies have found that deep learning systems are vulnerable to
adversarial examples; e.g., visually unrecognizable adversarial images can
easily be crafted to result in misclassification. The robustness of neural
networks has been studied extensively in the context of adversary detection,
which compares a metric that exhibits strong discriminate power between natural
and adversarial examples. In this paper, we propose to characterize the
adversarial subspaces through the lens of mutual information (MI) approximated
by conditional generation methods. We use MI as an information-theoretic metric
to strengthen existing defenses and improve the performance of adversary
detection. Experimental results on MagNet defense demonstrate that our proposed
MI detector can strengthen its robustness against powerful adversarial attacks.


http://arxiv.org/abs/1809.08758
Low Frequency Adversarial Perturbation.
Chuan Guo; Jared S. Frank; Kilian Q. Weinberger
  Adversarial images aim to change a target model's decision by minimally
perturbing a target image. In the black-box setting, the absence of gradient
information often renders this search problem costly in terms of query
complexity. In this paper we propose to restrict the search for adversarial
images to a low frequency domain. This approach is readily compatible with many
existing black-box attack frameworks and consistently reduces their query cost
by 2 to 4 times. Further, we can circumvent image transformation defenses even
when both the model and the defense strategy are unknown. Finally, we
demonstrate the efficacy of this technique by fooling the Google Cloud Vision
platform with an unprecedented low number of model queries.


http://arxiv.org/abs/1809.08706
Is Ordered Weighted $\ell_1$ Regularized Regression Robust to Adversarial Perturbation? A Case Study on OSCAR.
Pin-Yu Chen; Bhanukiran Vinzamuri; Sijia Liu
  Many state-of-the-art machine learning models such as deep neural networks
have recently shown to be vulnerable to adversarial perturbations, especially
in classification tasks. Motivated by adversarial machine learning, in this
paper we investigate the robustness of sparse regression models with strongly
correlated covariates to adversarially designed measurement noises.
Specifically, we consider the family of ordered weighted $\ell_1$ (OWL)
regularized regression methods and study the case of OSCAR (octagonal shrinkage
clustering algorithm for regression) in the adversarial setting. Under a
norm-bounded threat model, we formulate the process of finding a maximally
disruptive noise for OWL-regularized regression as an optimization problem and
illustrate the steps towards finding such a noise in the case of OSCAR.
Experimental results demonstrate that the regression performance of grouping
strongly correlated features can be severely degraded under our adversarial
setting, even when the noise budget is significantly smaller than the
ground-truth signals.


http://arxiv.org/abs/1809.08516
Adversarial Defense via Data Dependent Activation Function and Total Variation Minimization.
Bao Wang; Alex T. Lin; Wei Zhu; Penghang Yin; Andrea L. Bertozzi; Stanley J. Osher
  We improve the robustness of Deep Neural Net (DNN) to adversarial attacks by
using an interpolating function as the output activation. This data-dependent
activation remarkably improves both the generalization and robustness of DNN.
In the CIFAR10 benchmark, we raise the robust accuracy of the adversarially
trained ResNet20 from $\sim 46\%$ to $\sim 69\%$ under the state-of-the-art
Iterative Fast Gradient Sign Method (IFGSM) based adversarial attack. When we
combine this data-dependent activation with total variation minimization on
adversarial images and training data augmentation, we achieve an improvement in
robust accuracy by 38.9$\%$ for ResNet56 under the strongest IFGSM attack.
Furthermore, We provide an intuitive explanation of our defense by analyzing
the geometry of the feature space.


http://arxiv.org/abs/1809.08352
Unrestricted Adversarial Examples.
Tom B. Brown; Nicholas Carlini; Chiyuan Zhang; Catherine Olsson; Paul Christiano; Ian Goodfellow
  We introduce a two-player contest for evaluating the safety and robustness of
machine learning systems, with a large prize pool. Unlike most prior work in ML
robustness, which studies norm-constrained adversaries, we shift our focus to
unconstrained adversaries. Defenders submit machine learning models, and try to
achieve high accuracy and coverage on non-adversarial data while making no
confident mistakes on adversarial inputs. Attackers try to subvert defenses by
finding arbitrary unambiguous inputs where the model assigns an incorrect label
with high confidence. We propose a simple unambiguous dataset ("bird-or-
bicycle") to use as part of this contest. We hope this contest will help to
more comprehensively evaluate the worst-case adversarial risk of machine
learning models.


http://arxiv.org/abs/1809.08316
Adversarial Binaries for Authorship Identification.
Xiaozhu Meng; Barton P. Miller; Somesh Jha
  Binary code authorship identification determines authors of a binary program.
Existing techniques have used supervised machine learning for this task. In
this paper, we look this problem from an attacker's perspective. We aim to
modify a test binary, such that it not only causes misprediction but also
maintains the functionality of the original input binary. Attacks against
binary code are intrinsically more difficult than attacks against domains such
as computer vision, where attackers can change each pixel of the input image
independently and still maintain a valid image. For binary code, even flipping
one bit of a binary may cause the binary to be invalid, to crash at the
run-time, or to lose the original functionality. We investigate two types of
attacks: untargeted attacks, causing misprediction to any of the incorrect
authors, and targeted attacks, causing misprediction to a specific one among
the incorrect authors. We develop two key attack capabilities: feature vector
modification, generating an adversarial feature vector that both corresponds to
a real binary and causes the required misprediction, and input binary
modification, modifying the input binary to match the adversarial feature
vector while maintaining the functionality of the input binary. We evaluated
our attack against classifiers trained with a state-of-the-art method for
authorship attribution. The classifiers for authorship identification have 91%
accuracy on average. Our untargeted attack has a 96% success rate on average,
showing that we can effectively suppress authorship signal. Our targeted attack
has a 46% success rate on average, showing that it is possible, but
significantly more difficult to impersonate a specific programmer's style. Our
attack reveals that existing binary code authorship identification techniques
rely on code features that are easy to modify, and thus are vulnerable to
attacks.


http://arxiv.org/abs/1809.07802
Playing the Game of Universal Adversarial Perturbations.
Julien Perolat; Mateusz Malinowski; Bilal Piot; Olivier Pietquin
  We study the problem of learning classifiers robust to universal adversarial
perturbations. While prior work approaches this problem via robust
optimization, adversarial training, or input transformation, we instead phrase
it as a two-player zero-sum game. In this new formulation, both players
simultaneously play the same game, where one player chooses a classifier that
minimizes a classification loss whilst the other player creates an adversarial
perturbation that increases the same loss when applied to every sample in the
training set. By observing that performing a classification (respectively
creating adversarial samples) is the best response to the other player, we
propose a novel extension of a game-theoretic algorithm, namely fictitious
play, to the domain of training robust classifiers. Finally, we empirically
show the robustness and versatility of our approach in two defence scenarios
where universal attacks are performed on several image classification datasets
-- CIFAR10, CIFAR100 and ImageNet.


http://arxiv.org/abs/1809.08098
Efficient Formal Safety Analysis of Neural Networks.
Shiqi Wang; Kexin Pei; Justin Whitehouse; Junfeng Yang; Suman Jana
  Neural networks are increasingly deployed in real-world safety-critical
domains such as autonomous driving, aircraft collision avoidance, and malware
detection. However, these networks have been shown to often mispredict on
inputs with minor adversarial or even accidental perturbations. Consequences of
such errors can be disastrous and even potentially fatal as shown by the recent
Tesla autopilot crash. Thus, there is an urgent need for formal analysis
systems that can rigorously check neural networks for violations of different
safety properties such as robustness against adversarial perturbations within a
certain $L$-norm of a given image. An effective safety analysis system for a
neural network must be able to either ensure that a safety property is
satisfied by the network or find a counterexample, i.e., an input for which the
network will violate the property. Unfortunately, most existing techniques for
performing such analysis struggle to scale beyond very small networks and the
ones that can scale to larger networks suffer from high false positives and
cannot produce concrete counterexamples in case of a property violation. In
this paper, we present a new efficient approach for rigorously checking
different safety properties of neural networks that significantly outperforms
existing approaches by multiple orders of magnitude. Our approach can check
different safety properties and find concrete counterexamples for networks that
are 10$\times$ larger than the ones supported by existing analysis techniques.
We believe that our approach to estimating tight output bounds of a network for
a given input range can also help improve the explainability of neural networks
and guide the training process of more robust neural networks.


http://arxiv.org/abs/1809.07062
Adversarial Training Towards Robust Multimedia Recommender System.
Jinhui Tang; Xiaoyu Du; Xiangnan He; Fajie Yuan; Qi Tian; Tat-Seng Chua
  With the prevalence of multimedia content on the Web, developing recommender
solutions that can effectively leverage the rich signal in multimedia data is
in urgent need. Owing to the success of deep neural networks in representation
learning, recent advance on multimedia recommendation has largely focused on
exploring deep learning methods to improve the recommendation accuracy. To
date, however, there has been little effort to investigate the robustness of
multimedia representation and its impact on the performance of multimedia
recommendation.
  In this paper, we shed light on the robustness of multimedia recommender
system. Using the state-of-the-art recommendation framework and deep image
features, we demonstrate that the overall system is not robust, such that a
small (but purposeful) perturbation on the input image will severely decrease
the recommendation accuracy. This implies the possible weakness of multimedia
recommender system in predicting user preference, and more importantly, the
potential of improvement by enhancing its robustness. To this end, we propose a
novel solution named Adversarial Multimedia Recommendation (AMR), which can
lead to a more robust multimedia recommender model by using adversarial
learning. The idea is to train the model to defend an adversary, which adds
perturbations to the target image with the purpose of decreasing the model's
accuracy. We conduct experiments on two representative multimedia
recommendation tasks, namely, image recommendation and visually-aware product
recommendation. Extensive results verify the positive effect of adversarial
learning and demonstrate the effectiveness of our AMR method. Source codes are
available in https://github.com/duxy-me/AMR.


http://arxiv.org/abs/1809.07016
Generating 3D Adversarial Point Clouds.
Chong Xiang; Charles R. Qi; Bo Li
  Deep neural networks are known to be vulnerable to adversarial examples which
are carefully crafted instances to cause the models to make wrong predictions.
While adversarial examples for 2D images and CNNs have been extensively
studied, less attention has been paid to 3D data such as point clouds. Given
many safety-critical 3D applications such as autonomous driving, it is
important to study how adversarial point clouds could affect current deep 3D
models. In this work, we propose several novel algorithms to craft adversarial
point clouds against PointNet, a widely used deep neural network for point
cloud processing. Our algorithms work in two ways: adversarial point
perturbation and adversarial point generation. For point perturbation, we shift
existing points negligibly. For point generation, we generate either a set of
independent and scattered points or a small number (1-3) of point clusters with
meaningful shapes such as balls and airplanes which could be hidden in the
human psyche. In addition, we formulate six perturbation measurement metrics
tailored to the attacks in point clouds and conduct extensive experiments to
evaluate the proposed algorithms on the ModelNet40 3D shape classification
dataset. Overall, our attack algorithms achieve a success rate higher than 99%
for all targeted attacks


http://arxiv.org/abs/1809.06498
HashTran-DNN: A Framework for Enhancing Robustness of Deep Neural Networks against Adversarial Malware Samples.
Deqiang Li; Ramesh Baral; Tao Li; Han Wang; Qianmu Li; Shouhuai Xu
  Adversarial machine learning in the context of image processing and related
applications has received a large amount of attention. However, adversarial
machine learning, especially adversarial deep learning, in the context of
malware detection has received much less attention despite its apparent
importance. In this paper, we present a framework for enhancing the robustness
of Deep Neural Networks (DNNs) against adversarial malware samples, dubbed
Hashing Transformation Deep Neural Networks} (HashTran-DNN). The core idea is
to use hash functions with a certain locality-preserving property to transform
samples to enhance the robustness of DNNs in malware classification. The
framework further uses a Denoising Auto-Encoder (DAE) regularizer to
reconstruct the hash representations of samples, making the resulting DNN
classifiers capable of attaining the locality information in the latent space.
We experiment with two concrete instantiations of the HashTran-DNN framework to
classify Android malware. Experimental results show that four known attacks can
render standard DNNs useless in classifying Android malware, that known
defenses can at most defend three of the four attacks, and that HashTran-DNN
can effectively defend against all of the four attacks.


http://arxiv.org/abs/1809.06452
Robustness Guarantees for Bayesian Inference with Gaussian Processes.
Luca Cardelli; Marta Kwiatkowska; Luca Laurenti; Andrea Patane
  Bayesian inference and Gaussian processes are widely used in applications
ranging from robotics and control to biological systems. Many of these
applications are safety-critical and require a characterization of the
uncertainty associated with the learning model and formal guarantees on its
predictions. In this paper we define a robustness measure for Bayesian
inference against input perturbations, given by the probability that, for a
test point and a compact set in the input space containing the test point, the
prediction of the learning model will remain $\delta-$close for all the points
in the set, for $\delta>0.$ Such measures can be used to provide formal
guarantees for the absence of adversarial examples. By employing the theory of
Gaussian processes, we derive tight upper bounds on the resulting robustness by
utilising the Borell-TIS inequality, and propose algorithms for their
computation. We evaluate our techniques on two examples, a GP regression
problem and a fully-connected deep neural network, where we rely on weak
convergence to GPs to study adversarial examples on the MNIST dataset.


http://arxiv.org/abs/1809.05966
Exploring the Vulnerability of Single Shot Module in Object Detectors via Imperceptible Background Patches.
Yuezun Li; Xiao Bian; Ming-ching Chang; Siwei Lyu
  Recent works succeeded to generate adversarial perturbations on the entire
image or the object of interests to corrupt CNN based object detectors. In this
paper, we focus on exploring the vulnerability of the Single Shot Module (SSM)
commonly used in recent object detectors, by adding small perturbations to
patches in the background outside the object. The SSM is referred to the Region
Proposal Network used in a two-stage object detector or the single-stage object
detector itself. The SSM is typically a fully convolutional neural network
which generates output in a single forward pass. Due to the excessive
convolutions used in SSM, the actual receptive field is larger than the object
itself. As such, we propose a novel method to corrupt object detectors by
generating imperceptible patches only in the background. Our method can find a
few background patches for perturbation, which can effectively decrease true
positives and dramatically increase false positives. Efficacy is demonstrated
on 5 two-stage object detectors and 8 single-stage object detectors on the MS
COCO 2014 dataset. Results indicate that perturbations with small distortions
outside the bounding box of object region can still severely damage the
detection performance.


http://arxiv.org/abs/1809.05962
Robust Adversarial Perturbation on Deep Proposal-based Models.
Yuezun Li; Daniel Tian; Ming-Ching Chang; Xiao Bian; Siwei Lyu
  Adversarial noises are useful tools to probe the weakness of deep learning
based computer vision algorithms. In this paper, we describe a robust
adversarial perturbation (R-AP) method to attack deep proposal-based object
detectors and instance segmentation algorithms. Our method focuses on attacking
the common component in these algorithms, namely Region Proposal Network (RPN),
to universally degrade their performance in a black-box fashion. To do so, we
design a loss function that combines a label loss and a novel shape loss, and
optimize it with respect to image using a gradient based iterative algorithm.
Evaluations are performed on the MS COCO 2014 dataset for the adversarial
attacking of 6 state-of-the-art object detectors and 2 instance segmentation
algorithms. Experimental results demonstrate the efficacy of the proposed
method.


http://arxiv.org/abs/1809.05165
Defensive Dropout for Hardening Deep Neural Networks under Adversarial Attacks.
Siyue Wang; Xiao Wang; Pu Zhao; Wujie Wen; David Kaeli; Peter Chin; Xue Lin
  Deep neural networks (DNNs) are known vulnerable to adversarial attacks. That
is, adversarial examples, obtained by adding delicately crafted distortions
onto original legal inputs, can mislead a DNN to classify them as any target
labels. This work provides a solution to hardening DNNs under adversarial
attacks through defensive dropout. Besides using dropout during training for
the best test accuracy, we propose to use dropout also at test time to achieve
strong defense effects. We consider the problem of building robust DNNs as an
attacker-defender two-player game, where the attacker and the defender know
each others' strategies and try to optimize their own strategies towards an
equilibrium. Based on the observations of the effect of test dropout rate on
test accuracy and attack success rate, we propose a defensive dropout algorithm
to determine an optimal test dropout rate given the neural network model and
the attacker's strategy for generating adversarial examples.We also investigate
the mechanism behind the outstanding defense effects achieved by the proposed
defensive dropout. Comparing with stochastic activation pruning (SAP), another
defense method through introducing randomness into the DNN model, we find that
our defensive dropout achieves much larger variances of the gradients, which is
the key for the improved defense effects (much lower attack success rate). For
example, our defensive dropout can reduce the attack success rate from 100% to
13.89% under the currently strongest attack i.e., C&W attack on MNIST dataset.


http://arxiv.org/abs/1809.04913
Query-Efficient Black-Box Attack by Active Learning.
Pengcheng Li; Jinfeng Yi; Lijun Zhang
  Deep neural network (DNN) as a popular machine learning model is found to be
vulnerable to adversarial attack. This attack constructs adversarial examples
by adding small perturbations to the raw input, while appearing unmodified to
human eyes but will be misclassified by a well-trained classifier. In this
paper, we focus on the black-box attack setting where attackers have almost no
access to the underlying models. To conduct black-box attack, a popular
approach aims to train a substitute model based on the information queried from
the target DNN. The substitute model can then be attacked using existing
white-box attack approaches, and the generated adversarial examples will be
used to attack the target DNN. Despite its encouraging results, this approach
suffers from poor query efficiency, i.e., attackers usually needs to query a
huge amount of times to collect enough information for training an accurate
substitute model. To this end, we first utilize state-of-the-art white-box
attack methods to generate samples for querying, and then introduce an active
learning strategy to significantly reduce the number of queries needed.
Besides, we also propose a diversity criterion to avoid the sampling bias. Our
extensive experimental results on MNIST and CIFAR-10 show that the proposed
method can reduce more than $90\%$ of queries while preserve attacking success
rates and obtain an accurate substitute model which is more than $85\%$ similar
with the target oracle.


http://arxiv.org/abs/1809.04790
Adversarial Examples: Opportunities and Challenges.
Jiliang Zhang; Chen Li
  Deep neural networks (DNNs) have shown huge superiority over humans in image
recognition, speech processing, autonomous vehicles and medical diagnosis.
However, recent studies indicate that DNNs are vulnerable to adversarial
examples (AEs) which are designed by attackers to fool deep learning models.
Different from real examples, AEs can mislead the model to predict incorrect
outputs while hardly be distinguished by human eyes, therefore threaten
security-critical deep-learning applications. In recent years, the generation
and defense of AEs have become a research hotspot in the field of artificial
intelligence (AI) security. This article reviews the latest research progress
of AEs. First, we introduce the concept, cause, characteristics and evaluation
metrics of AEs, then give a survey on the state-of-the-art AE generation
methods with the discussion of advantages and disadvantages. After that, we
review the existing defenses and discuss their limitations. Finally, future
research opportunities and challenges on AEs are prospected.


http://arxiv.org/abs/1809.04098
On the Structural Sensitivity of Deep Convolutional Networks to the Directions of Fourier Basis Functions.
Yusuke Tsuzuku; Issei Sato
  Data-agnostic quasi-imperceptible perturbations on inputs are known to
degrade recognition accuracy of deep convolutional networks severely. This
phenomenon is considered to be a potential security issue. Moreover, some
results on statistical generalization guarantees indicate that the phenomenon
can be a key to improve the networks' generalization. However, the
characteristics of the shared directions of such harmful perturbations remain
unknown. Our primal finding is that convolutional networks are sensitive to the
directions of Fourier basis functions. We derived the property by specializing
a hypothesis of the cause of the sensitivity, known as the linearity of neural
networks, to convolutional networks and empirically validated it. As a
by-product of the analysis, we propose an algorithm to create shift-invariant
universal adversarial perturbations available in black-box settings.


http://arxiv.org/abs/1809.04397
Isolated and Ensemble Audio Preprocessing Methods for Detecting Adversarial Examples against Automatic Speech Recognition.
Krishan Rajaratnam; Kunal Shah; Jugal Kalita
  An adversarial attack is an exploitative process in which minute alterations
are made to natural inputs, causing the inputs to be misclassified by neural
models. In the field of speech recognition, this has become an issue of
increasing significance. Although adversarial attacks were originally
introduced in computer vision, they have since infiltrated the realm of speech
recognition. In 2017, a genetic attack was shown to be quite potent against the
Speech Commands Model. Limited-vocabulary speech classifiers, such as the
Speech Commands Model, are used in a variety of applications, particularly in
telephony; as such, adversarial examples produced by this attack pose as a
major security threat. This paper explores various methods of detecting these
adversarial examples with combinations of audio preprocessing. One particular
combined defense incorporating compressions, speech coding, filtering, and
audio panning was shown to be quite effective against the attack on the Speech
Commands Model, detecting audio adversarial examples with 93.5% precision and
91.2% recall.


http://arxiv.org/abs/1809.04120
Humans can decipher adversarial images.
Zhenglong Zhou; Chaz Firestone
  How similar is the human mind to the sophisticated machine-learning systems
that mirror its performance? Models of object categorization based on
convolutional neural networks (CNNs) have achieved human-level benchmarks in
assigning known labels to novel images. These advances promise to support
transformative technologies such as autonomous vehicles and machine diagnosis;
beyond this, they also serve as candidate models for the visual system itself
-- not only in their output but perhaps even in their underlying mechanisms and
principles. However, unlike human vision, CNNs can be "fooled" by adversarial
examples -- carefully crafted images that appear as nonsense patterns to humans
but are recognized as familiar objects by machines, or that appear as one
object to humans and a different object to machines. This seemingly extreme
divergence between human and machine classification challenges the promise of
these new advances, both as applied image-recognition systems and also as
models of the human mind. Surprisingly, however, little work has empirically
investigated human classification of such adversarial stimuli: Does human and
machine performance fundamentally diverge? Or could humans decipher such images
and predict the machine's preferred labels? Here, we show that human and
machine classification of adversarial stimuli are robustly related: In eight
experiments on five prominent and diverse adversarial imagesets, human subjects
reliably identified the machine's chosen label over relevant foils. This
pattern persisted for images with strong antecedent identities, and even for
images described as "totally unrecognizable to human eyes". We suggest that
human intuition may be a more reliable guide to machine (mis)classification
than has typically been imagined, and we explore the consequences of this
result for minds and machines alike.


http://arxiv.org/abs/1809.03740
Does it care what you asked? Understanding Importance of Verbs in Deep Learning QA System. (22%)
Barbara Rychalska; Dominika Basaj; Przemyslaw Biecek; Anna Wroblewska
  In this paper we present the results of an investigation of the importance of
verbs in a deep learning QA system trained on SQuAD dataset. We show that main
verbs in questions carry little influence on the decisions made by the system -
in over 90% of researched cases swapping verbs for their antonyms did not
change system decision. We track this phenomenon down to the insides of the
net, analyzing the mechanism of self-attention and values contained in hidden
layers of RNN. Finally, we recognize the characteristics of the SQuAD dataset
as the source of the problem. Our work refers to the recently popular topic of
adversarial examples in NLP, combined with investigating deep net structure.


http://arxiv.org/abs/1809.03063
The Curse of Concentration in Robust Learning: Evasion and Poisoning Attacks from Concentration of Measure.
Saeed Mahloujifar; Dimitrios I. Diochnos; Mohammad Mahmoody
  Many modern machine learning classifiers are shown to be vulnerable to
adversarial perturbations of the instances. Despite a massive amount of work
focusing on making classifiers robust, the task seems quite challenging. In
this work, through a theoretical study, we investigate the adversarial risk and
robustness of classifiers and draw a connection to the well-known phenomenon of
concentration of measure in metric measure spaces. We show that if the metric
probability space of the test instance is concentrated, any classifier with
some initial constant error is inherently vulnerable to adversarial
perturbations.
  One class of concentrated metric probability spaces are the so-called Levy
families that include many natural distributions. In this special case, our
attacks only need to perturb the test instance by at most $O(\sqrt n)$ to make
it misclassified, where $n$ is the data dimension. Using our general result
about Levy instance spaces, we first recover as special case some of the
previously proved results about the existence of adversarial examples. However,
many more Levy families are known (e.g., product distribution under the Hamming
distance) for which we immediately obtain new attacks that find adversarial
examples of distance $O(\sqrt n)$.
  Finally, we show that concentration of measure for product spaces implies the
existence of forms of "poisoning" attacks in which the adversary tampers with
the training data with the goal of degrading the classifier. In particular, we
show that for any learning algorithm that uses $m$ training examples, there is
an adversary who can increase the probability of any "bad property" (e.g.,
failing on a particular test instance) that initially happens with
non-negligible probability to $\approx 1$ by substituting only $\tilde{O}(\sqrt
m)$ of the examples with other (still correctly labeled) examples.


http://arxiv.org/abs/1809.03008
Training for Faster Adversarial Robustness Verification via Inducing ReLU Stability.
Kai Y. Xiao; Vincent Tjeng; Nur Muhammad Shafiullah; Aleksander Madry
  We explore the concept of co-design in the context of neural network
verification. Specifically, we aim to train deep neural networks that not only
are robust to adversarial perturbations but also whose robustness can be
verified more easily. To this end, we identify two properties of network models
- weight sparsity and so-called ReLU stability - that turn out to significantly
impact the complexity of the corresponding verification task. We demonstrate
that improving weight sparsity alone already enables us to turn computationally
intractable verification problems into tractable ones. Then, improving ReLU
stability leads to an additional 4-13x speedup in verification times. An
important feature of our methodology is its "universality," in the sense that
it can be used with a broad range of training procedures and verification
approaches.


http://arxiv.org/abs/1809.03113
Certified Adversarial Robustness with Additive Noise.
Bai Li; Changyou Chen; Wenlin Wang; Lawrence Carin
  The existence of adversarial data examples has drawn significant attention in
the deep-learning community; such data are seemingly minimally perturbed
relative to the original data, but lead to very different outputs from a
deep-learning algorithm. Although a significant body of work on developing
defensive models has been considered, most such models are heuristic and are
often vulnerable to adaptive attacks. Defensive methods that provide
theoretical robustness guarantees have been studied intensively, yet most fail
to obtain non-trivial robustness when a large-scale model and data are present.
To address these limitations, we introduce a framework that is scalable and
provides certified bounds on the norm of the input manipulation for
constructing adversarial examples. We establish a connection between robustness
against adversarial perturbation and additive random noise, and propose a
training strategy that can significantly improve the certified bounds. Our
evaluation on MNIST, CIFAR-10 and ImageNet suggests that the proposed method is
scalable to complicated models and large data sets, while providing competitive
robustness to state-of-the-art provable defense methods.


http://arxiv.org/abs/1809.02918
Towards Query Efficient Black-box Attacks: An Input-free Perspective.
Yali Du; Meng Fang; Jinfeng Yi; Jun Cheng; Dacheng Tao
  Recent studies have highlighted that deep neural networks (DNNs) are
vulnerable to adversarial attacks, even in a black-box scenario. However, most
of the existing black-box attack algorithms need to make a huge amount of
queries to perform attacks, which is not practical in the real world. We note
one of the main reasons for the massive queries is that the adversarial example
is required to be visually similar to the original image, but in many cases,
how adversarial examples look like does not matter much. It inspires us to
introduce a new attack called \emph{input-free} attack, under which an
adversary can choose an arbitrary image to start with and is allowed to add
perceptible perturbations on it. Following this approach, we propose two
techniques to significantly reduce the query complexity. First, we initialize
an adversarial example with a gray color image on which every pixel has roughly
the same importance for the target model. Then we shrink the dimension of the
attack space by perturbing a small region and tiling it to cover the input
image. To make our algorithm more effective, we stabilize a projected gradient
ascent algorithm with momentum, and also propose a heuristic approach for
region size selection. Through extensive experiments, we show that with only
1,701 queries on average, we can perturb a gray image to any target class of
ImageNet with a 100\% success rate on InceptionV3. Besides, our algorithm has
successfully defeated two real-world systems, the Clarifai food detection API
and the Baidu Animal Identification API.


http://arxiv.org/abs/1809.02797
Fast Gradient Attack on Network Embedding.
Jinyin Chen; Yangyang Wu; Xuanheng Xu; Yixian Chen; Haibin Zheng; Qi Xuan
  Network embedding maps a network into a low-dimensional Euclidean space, and
thus facilitate many network analysis tasks, such as node classification, link
prediction and community detection etc, by utilizing machine learning methods.
In social networks, we may pay special attention to user privacy, and would
like to prevent some target nodes from being identified by such network
analysis methods in certain cases. Inspired by successful adversarial attack on
deep learning models, we propose a framework to generate adversarial networks
based on the gradient information in Graph Convolutional Network (GCN). In
particular, we extract the gradient of pairwise nodes based on the adversarial
network, and select the pair of nodes with maximum absolute gradient to realize
the Fast Gradient Attack (FGA) and update the adversarial network. This process
is implemented iteratively and terminated until certain condition is satisfied,
i.e., the number of modified links reaches certain predefined value.
Comprehensive attacks, including unlimited attack, direct attack and indirect
attack, are performed on six well-known network embedding methods. The
experiments on real-world networks suggest that our proposed FGA behaves better
than some baseline methods, i.e., the network embedding can be easily disturbed
using FGA by only rewiring few links, achieving state-of-the-art attack
performance.


http://arxiv.org/abs/1809.02786
Structure-Preserving Transformation: Generating Diverse and Transferable Adversarial Examples.
Dan Peng; Zizhan Zheng; Xiaofeng Zhang
  Adversarial examples are perturbed inputs designed to fool machine learning
models. Most recent works on adversarial examples for image classification
focus on directly modifying pixels with minor perturbations. A common
requirement in all these works is that the malicious perturbations should be
small enough (measured by an L_p norm for some p) so that they are
imperceptible to humans. However, small perturbations can be unnecessarily
restrictive and limit the diversity of adversarial examples generated. Further,
an L_p norm based distance metric ignores important structure patterns hidden
in images that are important to human perception. Consequently, even the minor
perturbation introduced in recent works often makes the adversarial examples
less natural to humans. More importantly, they often do not transfer well and
are therefore less effective when attacking black-box models especially for
those protected by a defense mechanism. In this paper, we propose a
structure-preserving transformation (SPT) for generating natural and diverse
adversarial examples with extremely high transferability. The key idea of our
approach is to allow perceptible deviation in adversarial examples while
keeping structure patterns that are central to a human classifier. Empirical
results on the MNIST and the fashion-MNIST datasets show that adversarial
examples generated by our approach can easily bypass strong adversarial
training. Further, they transfer well to other target models with no loss or
little loss of successful attack rate.


http://arxiv.org/abs/1809.02861
Why Do Adversarial Attacks Transfer? Explaining Transferability of Evasion and Poisoning Attacks.
Ambra Demontis; Marco Melis; Maura Pintor; Matthew Jagielski; Battista Biggio; Alina Oprea; Cristina Nita-Rotaru; Fabio Roli
  Transferability captures the ability of an attack against a machine-learning
model to be effective against a different, potentially unknown, model.
Empirical evidence for transferability has been shown in previous work, but the
underlying reasons why an attack transfers or not are not yet well understood.
In this paper, we present a comprehensive analysis aimed to investigate the
transferability of both test-time evasion and training-time poisoning attacks.
We provide a unifying optimization framework for evasion and poisoning attacks,
and a formal definition of transferability of such attacks. We highlight two
main factors contributing to attack transferability: the intrinsic adversarial
vulnerability of the target model, and the complexity of the surrogate model
used to optimize the attack. Based on these insights, we define three metrics
that impact an attack's transferability. Interestingly, our results derived
from theoretical analysis hold for both evasion and poisoning attacks, and are
confirmed experimentally using a wide range of linear and non-linear
classifiers and datasets.


http://arxiv.org/abs/1809.02560
A Deeper Look at 3D Shape Classifiers.
Jong-Chyi Su; Matheus Gadelha; Rui Wang; Subhransu Maji
  We investigate the role of representations and architectures for classifying
3D shapes in terms of their computational efficiency, generalization, and
robustness to adversarial transformations. By varying the number of training
examples and employing cross-modal transfer learning we study the role of
initialization of existing deep architectures for 3D shape classification. Our
analysis shows that multiview methods continue to offer the best generalization
even without pretraining on large labeled image datasets, and even when trained
on simplified inputs such as binary silhouettes. Furthermore, the performance
of voxel-based 3D convolutional networks and point-based architectures can be
improved via cross-modal transfer from image representations. Finally, we
analyze the robustness of 3D shape classifiers to adversarial transformations
and present a novel approach for generating adversarial perturbations of a 3D
shape for multiview classifiers using a differentiable renderer. We find that
point-based networks are more robust to point position perturbations while
voxel-based and multiview networks are easily fooled with the addition of
imperceptible noise to the input.


http://arxiv.org/abs/1809.02444
Metamorphic Relation Based Adversarial Attacks on Differentiable Neural Computer.
Alvin Chan; Lei Ma; Felix Juefei-Xu; Xiaofei Xie; Yang Liu; Yew Soon Ong
  Deep neural networks (DNN), while becoming the driving force of many novel
technology and achieving tremendous success in many cutting-edge applications,
are still vulnerable to adversarial attacks. Differentiable neural computer
(DNC) is a novel computing machine with DNN as its central controller operating
on an external memory module for data processing. The unique architecture of
DNC contributes to its state-of-the-art performance in tasks which requires the
ability to represent variables and data structure as well as to store data over
long timescales. However, there still lacks a comprehensive study on how
adversarial examples affect DNC in terms of robustness. In this paper, we
propose metamorphic relation based adversarial techniques for a range of tasks
described in the natural processing language domain. We show that the
near-perfect performance of the DNC in bAbI logical question answering tasks
can be degraded by adversarially injected sentences. We further perform
in-depth study on the role of DNC's memory size in its robustness and analyze
the potential reason causing why DNC fails. Our study demonstrates the current
challenges and potential opportunities towards constructing more robust DNCs.


http://arxiv.org/abs/1809.02701
Trick Me If You Can: Human-in-the-loop Generation of Adversarial Examples for Question Answering.
Eric Wallace; Pedro Rodriguez; Shi Feng; Ikuya Yamada; Jordan Boyd-Graber
  Adversarial evaluation stress tests a model's understanding of natural
language. While past approaches expose superficial patterns, the resulting
adversarial examples are limited in complexity and diversity. We propose
human-in-the-loop adversarial generation, where human authors are guided to
break models. We aid the authors with interpretations of model predictions
through an interactive user interface. We apply this generation framework to a
question answering task called Quizbowl, where trivia enthusiasts craft
adversarial questions. The resulting questions are validated via live
human--computer matches: although the questions appear ordinary to humans, they
systematically stump neural and information retrieval models. The adversarial
questions cover diverse phenomena from multi-hop reasoning to entity type
distractors, exposing open challenges in robust question answering.


http://arxiv.org/abs/1809.02681
Query Attack via Opposite-Direction Feature:Towards Robust Image Retrieval.
Zhedong Zheng; Liang Zheng; Yi Yang; Fei Wu
  Most existing works of adversarial samples focus on attacking image
recognition models, while little attention is paid to the image retrieval task.
In this paper, we identify two inherent challenges in applying prevailing image
recognition attack methods to image retrieval. First, image retrieval demands
discriminative visual features, which is significantly different from the
one-hot class prediction in image recognition. Second, due to the disjoint and
potentially unrelated classes between the training and test set in image
retrieval, predicting the query category from predefined training classes is
not accurate and leads to a sub-optimal adversarial gradient. To address these
limitations, we propose a new white-box attack approach, Opposite-Direction
Feature Attack (ODFA), to generate adversarial queries. Opposite-Direction
Feature Attack (ODFA) effectively exploits feature-level adversarial gradients
and takes advantage of feature distance in the representation space. To our
knowledge, we are among the early attempts to design an attack method
specifically for image retrieval. When we deploy an attacked image as the
query, the true matches are prone to receive low ranks. We demonstrate through
extensive experiments that (1) only crafting adversarial queries is sufficient
to fool the state-of-the-art retrieval systems; (2) the proposed attack method,
ODFA, leads to a higher attack success rate than classification attack methods,
validating the necessity of leveraging characteristics of image retrieval; (3)
the adversarial queries generated by our method have good transferability to
other retrieval models without accessing their parameters, i.e.,the black-box
setting.


http://arxiv.org/abs/1809.02079
Adversarial Over-Sensitivity and Over-Stability Strategies for Dialogue Models.
Tong Niu; Mohit Bansal
  We present two categories of model-agnostic adversarial strategies that
reveal the weaknesses of several generative, task-oriented dialogue models:
Should-Not-Change strategies that evaluate over-sensitivity to small and
semantics-preserving edits, as well as Should-Change strategies that test if a
model is over-stable against subtle yet semantics-changing modifications. We
next perform adversarial training with each strategy, employing a max-margin
approach for negative generative examples. This not only makes the target
dialogue model more robust to the adversarial inputs, but also helps it perform
significantly better on the original inputs. Moreover, training on all
strategies combined achieves further improvements, achieving a new
state-of-the-art performance on the original task (also verified via human
evaluation). In addition to adversarial training, we also address the
robustness task at the model-level, by feeding it subword units as both inputs
and outputs, and show that the resulting model is equally competitive, requires
only 1/4 of the original vocabulary size, and is robust to one of the
adversarial strategies (to which the original model is vulnerable) even without
adversarial training.


http://arxiv.org/abs/1809.02104
Are adversarial examples inevitable?
Ali Shafahi; W. Ronny Huang; Christoph Studer; Soheil Feizi; Tom Goldstein
  A wide range of defenses have been proposed to harden neural networks against
adversarial attacks. However, a pattern has emerged in which the majority of
adversarial defenses are quickly broken by new attacks. Given the lack of
success at generating robust defenses, we are led to ask a fundamental
question: Are adversarial attacks inevitable? This paper analyzes adversarial
examples from a theoretical perspective, and identifies fundamental bounds on
the susceptibility of a classifier to adversarial attacks. We show that, for
certain classes of problems, adversarial examples are inescapable. Using
experiments, we explore the implications of theoretical guarantees for
real-world problems and discuss how factors such as dimensionality and image
complexity limit a classifier's robustness against adversarial examples.


http://arxiv.org/abs/1809.02077
IDSGAN: Generative Adversarial Networks for Attack Generation against Intrusion Detection.
Zilong Lin; Yong Shi; Zhi Xue
  As an important tool in security, the intrusion detection system bears the
responsibility of the defense to network attacks performed by malicious
traffic. Nowadays, with the help of machine learning algorithms, the intrusion
detection system develops rapidly. However, the robustness of this system is
questionable when it faces the adversarial attacks. To improve the detection
system, more potential attack approaches should be researched. In this paper, a
framework of the generative adversarial networks, IDSGAN, is proposed to
generate the adversarial attacks, which can deceive and evade the intrusion
detection system. Considering that the internal structure of the detection
system is unknown to attackers, adversarial attack examples perform the
black-box attacks against the detection system. IDSGAN leverages a generator to
transform original malicious traffic into adversarial malicious traffic. A
discriminator classifies traffic examples and simulates the black-box detection
system. More significantly, we only modify part of the attacks' nonfunctional
features to guarantee the validity of the intrusion. Based on the dataset
NSL-KDD, the feasibility of the model is demonstrated to attack many detection
systems with different attacks and the excellent results are achieved.
Moreover, the robustness of IDSGAN is verified by changing the amount of the
unmodified features.


http://arxiv.org/abs/1809.01829
Adversarial Reprogramming of Text Classification Neural Networks.
Paarth Neekhara; Shehzeen Hussain; Shlomo Dubnov; Farinaz Koushanfar
  Adversarial Reprogramming has demonstrated success in utilizing pre-trained
neural network classifiers for alternative classification tasks without
modification to the original network. An adversary in such an attack scenario
trains an additive contribution to the inputs to repurpose the neural network
for the new classification task. While this reprogramming approach works for
neural networks with a continuous input space such as that of images, it is not
directly applicable to neural networks trained for tasks such as text
classification, where the input space is discrete. Repurposing such
classification networks would require the attacker to learn an adversarial
program that maps inputs from one discrete space to the other. In this work, we
introduce a context-based vocabulary remapping model to reprogram neural
networks trained on a specific sequence classification task, for a new sequence
classification task desired by the adversary. We propose training procedures
for this adversarial program in both white-box and black-box settings. We
demonstrate the application of our model by adversarially repurposing various
text-classification models including LSTM, bi-directional LSTM and CNN for
alternate classification tasks.


http://arxiv.org/abs/1809.01715
Bridging machine learning and cryptography in defence against adversarial attacks.
Olga Taran; Shideh Rezaeifar; Slava Voloshynovskiy
  In the last decade, deep learning algorithms have become very popular thanks
to the achieved performance in many machine learning and computer vision tasks.
However, most of the deep learning architectures are vulnerable to so called
adversarial examples. This questions the security of deep neural networks (DNN)
for many security- and trust-sensitive domains. The majority of the proposed
existing adversarial attacks are based on the differentiability of the DNN cost
function.Defence strategies are mostly based on machine learning and signal
processing principles that either try to detect-reject or filter out the
adversarial perturbations and completely neglect the classical cryptographic
component in the defence. In this work, we propose a new defence mechanism
based on the second Kerckhoffs's cryptographic principle which states that the
defence and classification algorithm are supposed to be known, but not the key.
To be compliant with the assumption that the attacker does not have access to
the secret key, we will primarily focus on a gray-box scenario and do not
address a white-box one. More particularly, we assume that the attacker does
not have direct access to the secret block, but (a) he completely knows the
system architecture, (b) he has access to the data used for training and
testing and (c) he can observe the output of the classifier for each given
input. We show empirically that our system is efficient against most famous
state-of-the-art attacks in black-box and gray-box scenarios.


http://arxiv.org/abs/1809.01093
Adversarial Attacks on Node Embeddings.
Aleksandar Bojchevski; Stephan Günnemann
  The goal of network representation learning is to learn low-dimensional node
embeddings that capture the graph structure and are useful for solving
downstream tasks. However, despite the proliferation of such methods there is
currently no study of their robustness to adversarial attacks. We provide the
first adversarial vulnerability analysis on the widely used family of methods
based on random walks. We derive efficient adversarial perturbations that
poison the network structure and have a negative effect on both the quality of
the embeddings and the downstream tasks. We further show that our attacks are
transferable -- they generalize to many models -- and are successful even when
the attacker has restricted actions.


http://arxiv.org/abs/1809.01697
HASP: A High-Performance Adaptive Mobile Security Enhancement Against Malicious Speech Recognition.
Zirui Xu; Fuxun Yu; Chenchen Liu; Xiang Chen
  Nowadays, machine learning based Automatic Speech Recognition (ASR) technique
has widely spread in smartphones, home devices, and public facilities. As
convenient as this technology can be, a considerable security issue also raises
-- the users' speech content might be exposed to malicious ASR monitoring and
cause severe privacy leakage. In this work, we propose HASP -- a
high-performance security enhancement approach to solve this security issue on
mobile devices. Leveraging ASR systems' vulnerability to the adversarial
examples, HASP is designed to cast human imperceptible adversarial noises to
real-time speech and effectively perturb malicious ASR monitoring by increasing
the Word Error Rate (WER). To enhance the practical performance on mobile
devices, HASP is also optimized for effective adaptation to the human speech
characteristics, environmental noises, and mobile computation scenarios. The
experiments show that HASP can achieve optimal real-time security enhancement:
it can lead an average WER of 84.55% for perturbing the malicious ASR
monitoring, and the data processing speed is 15x to 40x faster compared to the
state-of-the-art methods. Moreover, HASP can effectively perturb various ASR
systems, demonstrating a strong transferability.


http://arxiv.org/abs/1809.00594
Adversarial Attack Type I: Cheat Classifiers by Significant Changes.
Sanli Tang; Xiaolin Huang; Mingjian Chen; Chengjin Sun; Jie Yang
  Despite the great success of deep neural networks, the adversarial attack can
cheat some well-trained classifiers by small permutations. In this paper, we
propose another type of adversarial attack that can cheat classifiers by
significant changes. For example, we can significantly change a face but
well-trained neural networks still recognize the adversarial and the original
example as the same person. Statistically, the existing adversarial attack
increases Type II error and the proposed one aims at Type I error, which are
hence named as Type II and Type I adversarial attack, respectively. The two
types of attack are equally important but are essentially different, which are
intuitively explained and numerically evaluated. To implement the proposed
attack, a supervised variation autoencoder is designed and then the classifier
is attacked by updating the latent variables using gradient information.
{Besides, with pre-trained generative models, Type I attack on latent spaces is
investigated as well.} Experimental results show that our method is practical
and effective to generate Type I adversarial examples on large-scale image
datasets. Most of these generated examples can pass detectors designed for
defending Type II attack and the strengthening strategy is only efficient with
a specific type attack, both implying that the underlying reasons for Type I
and Type II attack are different.


http://arxiv.org/abs/1809.00065
MULDEF: Multi-model-based Defense Against Adversarial Examples for Neural Networks.
Siwakorn Srisakaokul; Yuhao Zhang; Zexuan Zhong; Wei Yang; Tao Xie; Bo Li
  Despite being popularly used in many applications, neural network models have
been found to be vulnerable to adversarial examples, i.e., carefully crafted
examples aiming to mislead machine learning models. Adversarial examples can
pose potential risks on safety and security critical applications. However,
existing defense approaches are still vulnerable to attacks, especially in a
white-box attack scenario. To address this issue, we propose a new defense
approach, named MulDef, based on robustness diversity. Our approach consists of
(1) a general defense framework based on multiple models and (2) a technique
for generating these multiple models to achieve high defense capability. In
particular, given a target model, our framework includes multiple models
(constructed from the target model) to form a model family. The model family is
designed to achieve robustness diversity (i.e., an adversarial example
successfully attacking one model cannot succeed in attacking other models in
the family). At runtime, a model is randomly selected from the family to be
applied on each input example. Our general framework can inspire rich future
research to construct a desirable model family achieving higher robustness
diversity. Our evaluation results show that MulDef (with only up to 5 models in
the family) can substantially improve the target model's accuracy on
adversarial examples by 22-74% in a white-box attack scenario, while
maintaining similar accuracy on legitimate examples.


http://arxiv.org/abs/1808.09413
DLFuzz: Differential Fuzzing Testing of Deep Learning Systems.
Jianmin Guo; Yu Jiang; Yue Zhao; Quan Chen; Jiaguang Sun
  Deep learning (DL) systems are increasingly applied to safety-critical
domains such as autonomous driving cars. It is of significant importance to
ensure the reliability and robustness of DL systems. Existing testing
methodologies always fail to include rare inputs in the testing dataset and
exhibit low neuron coverage. In this paper, we propose DLFuzz, the frst
differential fuzzing testing framework to guide DL systems exposing incorrect
behaviors. DLFuzz keeps minutely mutating the input to maximize the neuron
coverage and the prediction difference between the original input and the
mutated input, without manual labeling effort or cross-referencing oracles from
other DL systems with the same functionality. We present empirical evaluations
on two well-known datasets to demonstrate its efficiency. Compared with
DeepXplore, the state-of-the-art DL whitebox testing framework, DLFuzz does not
require extra efforts to find similar functional DL systems for
cross-referencing check, but could generate 338.59% more adversarial inputs
with 89.82% smaller perturbations, averagely obtain 2.86% higher neuron
coverage, and save 20.11% time consumption.


http://arxiv.org/abs/1808.09115
All You Need is "Love": Evading Hate-speech Detection.
Tommi Gröndahl; Luca Pajola; Mika Juuti; Mauro Conti; N. Asokan
  With the spread of social networks and their unfortunate use for hate speech,
automatic detection of the latter has become a pressing problem. In this paper,
we reproduce seven state-of-the-art hate speech detection models from prior
work, and show that they perform well only when tested on the same type of data
they were trained on. Based on these results, we argue that for successful hate
speech detection, model architecture is less important than the type of data
and labeling criteria. We further show that all proposed detection techniques
are brittle against adversaries who can (automatically) insert typos, change
word boundaries or add innocuous words to the original hate speech. A
combination of these methods is also effective against Google Perspective -- a
cutting-edge solution from industry. Our experiments demonstrate that
adversarial training does not completely mitigate the attacks, and using
character-level features makes the models systematically more attack-resistant
than using word-level features.


http://arxiv.org/abs/1808.09540
Lipschitz regularized Deep Neural Networks generalize and are adversarially robust.
Chris Finlay; Jeff Calder; Bilal Abbasi; Adam Oberman
  In this work we study input gradient regularization of deep neural networks,
and demonstrate that such regularization leads to generalization proofs and
improved adversarial robustness. The proof of generalization does not overcome
the curse of dimensionality, but it is independent of the number of layers in
the networks. The adversarial robustness regularization combines adversarial
training, which we show to be equivalent to Total Variation regularization,
with Lipschitz regularization. We demonstrate empirically that the regularized
models are more robust, and that gradient norms of images can be used for
attack detection.


http://arxiv.org/abs/1809.00958
Targeted Nonlinear Adversarial Perturbations in Images and Videos.
Roberto Rey-de-Castro; Herschel Rabitz
  We introduce a method for learning adversarial perturbations targeted to
individual images or videos. The learned perturbations are found to be sparse
while at the same time containing a high level of feature detail. Thus, the
extracted perturbations allow a form of object or action recognition and
provide insights into what features the studied deep neural network models
consider important when reaching their classification decisions. From an
adversarial point of view, the sparse perturbations successfully confused the
models into misclassifying, although the perturbed samples still belonged to
the same original class by visual examination. This is discussed in terms of a
prospective data augmentation scheme. The sparse yet high-quality perturbations
may also be leveraged for image or video compression.


http://arxiv.org/abs/1808.08750
Generalisation in humans and deep neural networks.
Robert Geirhos; Carlos R. Medina Temme; Jonas Rauber; Heiko H. Schütt; Matthias Bethge; Felix A. Wichmann
  We compare the robustness of humans and current convolutional deep neural
networks (DNNs) on object recognition under twelve different types of image
degradations. First, using three well known DNNs (ResNet-152, VGG-19,
GoogLeNet) we find the human visual system to be more robust to nearly all of
the tested image manipulations, and we observe progressively diverging
classification error-patterns between humans and DNNs when the signal gets
weaker. Secondly, we show that DNNs trained directly on distorted images
consistently surpass human performance on the exact distortion types they were
trained on, yet they display extremely poor generalisation abilities when
tested on other distortion types. For example, training on salt-and-pepper
noise does not imply robustness on uniform white noise and vice versa. Thus,
changes in the noise distribution between training and testing constitutes a
crucial challenge to deep learning vision systems that can be systematically
addressed in a lifelong machine learning approach. Our new dataset consisting
of 83K carefully measured human psychophysical trials provide a useful
reference for lifelong robustness against image degradations set by the human
visual system.


http://arxiv.org/abs/1808.08609
Adversarially Regularising Neural NLI Models to Integrate Logical Background Knowledge.
Pasquale Minervini; Sebastian Riedel
  Adversarial examples are inputs to machine learning models designed to cause
the model to make a mistake. They are useful for understanding the shortcomings
of machine learning models, interpreting their results, and for regularisation.
In NLP, however, most example generation strategies produce input text by using
known, pre-specified semantic transformations, requiring significant manual
effort and in-depth understanding of the problem and domain. In this paper, we
investigate the problem of automatically generating adversarial examples that
violate a set of given First-Order Logic constraints in Natural Language
Inference (NLI). We reduce the problem of identifying such adversarial examples
to a combinatorial optimisation problem, by maximising a quantity measuring the
degree of violation of such constraints and by using a language model for
generating linguistically-plausible examples. Furthermore, we propose a method
for adversarially regularising neural NLI models for incorporating background
knowledge. Our results show that, while the proposed method does not always
improve results on the SNLI and MultiNLI datasets, it significantly and
consistently increases the predictive accuracy on adversarially-crafted
datasets -- up to a 79.6% relative improvement -- while drastically reducing
the number of background knowledge violations. Furthermore, we show that
adversarial examples transfer among model architectures, and that the proposed
adversarial training procedure improves the robustness of NLI models to
adversarial examples.


http://arxiv.org/abs/1808.08426
Analysis of adversarial attacks against CNN-based image forgery detectors.
Diego Gragnaniello; Francesco Marra; Giovanni Poggi; Luisa Verdoliva
  With the ubiquitous diffusion of social networks, images are becoming a
dominant and powerful communication channel. Not surprisingly, they are also
increasingly subject to manipulations aimed at distorting information and
spreading fake news. In recent years, the scientific community has devoted
major efforts to contrast this menace, and many image forgery detectors have
been proposed. Currently, due to the success of deep learning in many
multimedia processing tasks, there is high interest towards CNN-based
detectors, and early results are already very promising. Recent studies in
computer vision, however, have shown CNNs to be highly vulnerable to
adversarial attacks, small perturbations of the input data which drive the
network towards erroneous classification. In this paper we analyze the
vulnerability of CNN-based image forensics methods to adversarial attacks,
considering several detectors and several types of attack, and testing
performance on a wide range of common manipulations, both easily and hardly
detectable.


http://arxiv.org/abs/1808.08444
Guiding Deep Learning System Testing using Surprise Adequacy.
Jinhan Kim; Robert Feldt; Shin Yoo
  Deep Learning (DL) systems are rapidly being adopted in safety and security
critical domains, urgently calling for ways to test their correctness and
robustness. Testing of DL systems has traditionally relied on manual collection
and labelling of data. Recently, a number of coverage criteria based on neuron
activation values have been proposed. These criteria essentially count the
number of neurons whose activation during the execution of a DL system
satisfied certain properties, such as being above predefined thresholds.
However, existing coverage criteria are not sufficiently fine grained to
capture subtle behaviours exhibited by DL systems. Moreover, evaluations have
focused on showing correlation between adversarial examples and proposed
criteria rather than evaluating and guiding their use for actual testing of DL
systems. We propose a novel test adequacy criterion for testing of DL systems,
called Surprise Adequacy for Deep Learning Systems (SADL), which is based on
the behaviour of DL systems with respect to their training data. We measure the
surprise of an input as the difference in DL system's behaviour between the
input and the training data (i.e., what was learnt during training), and
subsequently develop this as an adequacy criterion: a good test input should be
sufficiently but not overtly surprising compared to training data. Empirical
evaluation using a range of DL systems from simple image classifiers to
autonomous driving car platforms shows that systematic sampling of inputs based
on their surprise can improve classification accuracy of DL systems against
adversarial examples by up to 77.5% via retraining.


http://arxiv.org/abs/1808.08197
Is Machine Learning in Power Systems Vulnerable?
Yize Chen; Yushi Tan; Deepjyoti Deka
  Recent advances in Machine Learning(ML) have led to its broad adoption in a
series of power system applications, ranging from meter data analytics,
renewable/load/price forecasting to grid security assessment. Although these
data-driven methods yield state-of-the-art performances in many tasks, the
robustness and security of applying such algorithms in modern power grids have
not been discussed. In this paper, we attempt to address the issues regarding
the security of ML applications in power systems. We first show that most of
the current ML algorithms proposed in power systems are vulnerable to
adversarial examples, which are maliciously crafted input data. We then adopt
and extend a simple yet efficient algorithm for finding subtle perturbations,
which could be used for generating adversaries for both categorical(e.g., user
load profile classification) and sequential applications(e.g., renewables
generation forecasting). Case studies on classification of power quality
disturbances and forecast of building loads demonstrate the vulnerabilities of
current ML algorithms in power networks under our adversarial designs. These
vulnerabilities call for design of robust and secure ML algorithms for real
world applications.


http://arxiv.org/abs/1808.07945
Maximal Jacobian-based Saliency Map Attack.
Rey Wiyatno; Anqi Xu
  The Jacobian-based Saliency Map Attack is a family of adversarial attack
methods for fooling classification models, such as deep neural networks for
image classification tasks. By saturating a few pixels in a given image to
their maximum or minimum values, JSMA can cause the model to misclassify the
resulting adversarial image as a specified erroneous target class. We propose
two variants of JSMA, one which removes the requirement to specify a target
class, and another that additionally does not need to specify whether to only
increase or decrease pixel intensities. Our experiments highlight the
competitive speeds and qualities of these variants when applied to datasets of
hand-written digits and natural scenes.


http://arxiv.org/abs/1808.07713
Adversarial Attacks on Deep-Learning Based Radio Signal Classification.
Meysam Sadeghi; Erik G. Larsson
  Deep learning (DL), despite its enormous success in many computer vision and
language processing applications, is exceedingly vulnerable to adversarial
attacks. We consider the use of DL for radio signal (modulation) classification
tasks, and present practical methods for the crafting of white-box and
universal black-box adversarial attacks in that application. We show that these
attacks can considerably reduce the classification performance, with extremely
small perturbations of the input. In particular, these attacks are
significantly more powerful than classical jamming attacks, which raises
significant security and robustness concerns in the use of DL-based algorithms
for the wireless physical layer.


http://arxiv.org/abs/1808.08282
Controlling Over-generalization and its Effect on Adversarial Examples Generation and Detection.
Mahdieh Abbasi; Arezoo Rajabi; Azadeh Sadat Mozafari; Rakesh B. Bobba; Christian Gagne
  Convolutional Neural Networks (CNNs) significantly improve the
state-of-the-art for many applications, especially in computer vision. However,
CNNs still suffer from a tendency to confidently classify out-distribution
samples from unknown classes into pre-defined known classes. Further, they are
also vulnerable to adversarial examples. We are relating these two issues
through the tendency of CNNs to over-generalize for areas of the input space
not covered well by the training set. We show that a CNN augmented with an
extra output class can act as a simple yet effective end-to-end model for
controlling over-generalization. As an appropriate training set for the extra
class, we introduce two resources that are computationally efficient to obtain:
a representative natural out-distribution set and interpolated in-distribution
samples. To help select a representative natural out-distribution set among
available ones, we propose a simple measurement to assess an out-distribution
set's fitness. We also demonstrate that training such an augmented CNN with
representative out-distribution natural datasets and some interpolated samples
allows it to better handle a wide range of unseen out-distribution samples and
black-box adversarial examples without training it on any adversaries. Finally,
we show that generation of white-box adversarial attacks using our proposed
augmented CNN can become harder, as the attack algorithms have to get around
the rejection regions when generating actual adversaries.


http://arxiv.org/abs/1808.06645
Stochastic Combinatorial Ensembles for Defending Against Adversarial Examples.
George A. Adam; Petr Smirnov; David Duvenaud; Benjamin Haibe-Kains; Anna Goldenberg
  Many deep learning algorithms can be easily fooled with simple adversarial
examples. To address the limitations of existing defenses, we devised a
probabilistic framework that can generate an exponentially large ensemble of
models from a single model with just a linear cost. This framework takes
advantage of neural network depth and stochastically decides whether or not to
insert noise removal operators such as VAEs between layers. We show empirically
the important role that model gradients have when it comes to determining
transferability of adversarial examples, and take advantage of this result to
demonstrate that it is possible to train models with limited adversarial attack
transferability. Additionally, we propose a detection method based on metric
learning in order to detect adversarial examples that have no hope of being
cleaned of maliciously engineered noise.


http://arxiv.org/abs/1808.05770
Reinforcement Learning for Autonomous Defence in Software-Defined Networking.
Yi Han; Benjamin I. P. Rubinstein; Tamas Abraham; Tansu Alpcan; Vel Olivier De; Sarah Erfani; David Hubczenko; Christopher Leckie; Paul Montague
  Despite the successful application of machine learning (ML) in a wide range
of domains, adaptability---the very property that makes machine learning
desirable---can be exploited by adversaries to contaminate training and evade
classification. In this paper, we investigate the feasibility of applying a
specific class of machine learning algorithms, namely, reinforcement learning
(RL) algorithms, for autonomous cyber defence in software-defined networking
(SDN). In particular, we focus on how an RL agent reacts towards different
forms of causative attacks that poison its training process, including
indiscriminate and targeted, white-box and black-box attacks. In addition, we
also study the impact of the attack timing, and explore potential
countermeasures such as adversarial training.


http://arxiv.org/abs/1808.05705
Mitigation of Adversarial Attacks through Embedded Feature Selection.
Ziyi Bao; Luis Muñoz-González; Emil C. Lupu
  Machine learning has become one of the main components for task automation in
many application domains. Despite the advancements and impressive achievements
of machine learning, it has been shown that learning algorithms can be
compromised by attackers both at training and test time. Machine learning
systems are especially vulnerable to adversarial examples where small
perturbations added to the original data points can produce incorrect or
unexpected outputs in the learning algorithms at test time. Mitigation of these
attacks is hard as adversarial examples are difficult to detect. Existing
related work states that the security of machine learning systems against
adversarial examples can be weakened when feature selection is applied to
reduce the systems' complexity. In this paper, we empirically disprove this
idea, showing that the relative distortion that the attacker has to introduce
to succeed in the attack is greater when the target is using a reduced set of
features. We also show that the minimal adversarial examples differ
statistically more strongly from genuine examples with a lower number of
features. However, reducing the feature count can negatively impact the
system's performance. We illustrate the trade-off between security and accuracy
with specific examples. We propose a design methodology to evaluate the
security of machine learning classifiers with embedded feature selection
against adversarial examples crafted using different attack strategies.


http://arxiv.org/abs/1808.05665
Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding.
Lea Schönherr; Katharina Kohls; Steffen Zeiler; Thorsten Holz; Dorothea Kolossa
  Voice interfaces are becoming accepted widely as input methods for a diverse
set of devices. This development is driven by rapid improvements in automatic
speech recognition (ASR), which now performs on par with human listening in
many tasks. These improvements base on an ongoing evolution of DNNs as the
computational core of ASR. However, recent research results show that DNNs are
vulnerable to adversarial perturbations, which allow attackers to force the
transcription into a malicious output.
  In this paper, we introduce a new type of adversarial examples based on
psychoacoustic hiding. Our attack exploits the characteristics of DNN-based ASR
systems, where we extend the original analysis procedure by an additional
backpropagation step. We use this backpropagation to learn the degrees of
freedom for the adversarial perturbation of the input signal, i.e., we apply a
psychoacoustic model and manipulate the acoustic signal below the thresholds of
human perception. To further minimize the perceptibility of the perturbations,
we use forced alignment to find the best fitting temporal alignment between the
original audio sample and the malicious target transcription. These extensions
allow us to embed an arbitrary audio input with a malicious voice command that
is then transcribed by the ASR system, with the audio signal remaining barely
distinguishable from the original signal. In an experimental evaluation, we
attack the state-of-the-art speech recognition system Kaldi and determine the
best performing parameter and analysis setup for different types of input. Our
results show that we are successful in up to 98% of cases with a computational
effort of fewer than two minutes for a ten-second audio file. Based on user
studies, we found that none of our target transcriptions were audible to human
listeners, who still understand the original speech content with unchanged
accuracy.


http://arxiv.org/abs/1808.05537
Distributionally Adversarial Attack.
Tianhang Zheng; Changyou Chen; Kui Ren
  Recent work on adversarial attack has shown that Projected Gradient Descent
(PGD) Adversary is a universal first-order adversary, and the classifier
adversarially trained by PGD is robust against a wide range of first-order
attacks. It is worth noting that the original objective of an attack/defense
model relies on a data distribution $p(\mathbf{x})$, typically in the form of
risk maximization/minimization, e.g.,
$\max/\min\mathbb{E}_{p(\mathbf(x))}\mathcal{L}(\mathbf{x})$ with
$p(\mathbf{x})$ some unknown data distribution and $\mathcal{L}(\cdot)$ a loss
function. However, since PGD generates attack samples independently for each
data sample based on $\mathcal{L}(\cdot)$, the procedure does not necessarily
lead to good generalization in terms of risk optimization. In this paper, we
achieve the goal by proposing distributionally adversarial attack (DAA), a
framework to solve an optimal {\em adversarial-data distribution}, a perturbed
distribution that satisfies the $L_\infty$ constraint but deviates from the
original data distribution to increase the generalization risk maximally.
Algorithmically, DAA performs optimization on the space of potential data
distributions, which introduces direct dependency between all data points when
generating adversarial samples. DAA is evaluated by attacking state-of-the-art
defense models, including the adversarially-trained models provided by {\em MIT
MadryLab}. Notably, DAA ranks {\em the first place} on MadryLab's white-box
leaderboards, reducing the accuracy of their secret MNIST model to $88.79\%$
(with $l_\infty$ perturbations of $\epsilon = 0.3$) and the accuracy of their
secret CIFAR model to $44.71\%$ (with $l_\infty$ perturbations of $\epsilon =
8.0$). Code for the experiments is released on
\url{https://github.com/tianzheng4/Distributionally-Adversarial-Attack}.


http://arxiv.org/abs/1808.03601
Using Randomness to Improve Robustness of Machine-Learning Models Against Evasion Attacks.
Fan Yang; Zhiyuan Chen
  Machine learning models have been widely used in security applications such
as intrusion detection, spam filtering, and virus or malware detection.
However, it is well-known that adversaries are always trying to adapt their
attacks to evade detection. For example, an email spammer may guess what
features spam detection models use and modify or remove those features to avoid
detection. There has been some work on making machine learning models more
robust to such attacks. However, one simple but promising approach called {\em
randomization} is underexplored. This paper proposes a novel
randomization-based approach to improve robustness of machine learning models
against evasion attacks. The proposed approach incorporates randomization into
both model training time and model application time (meaning when the model is
used to detect attacks). We also apply this approach to random forest, an
existing ML method which already has some degree of randomness. Experiments on
intrusion detection and spam filtering data show that our approach further
improves robustness of random-forest method. We also discuss how this approach
can be applied to other ML models.


http://arxiv.org/abs/1808.04218
Android HIV: A Study of Repackaging Malware for Evading Machine-Learning Detection.
Xiao Chen; Chaoran Li; Derui Wang; Sheng Wen; Jun Zhang; Surya Nepal; Yang Xiang; Kui Ren
  Machine learning based solutions have been successfully employed for
automatic detection of malware on Android. However, machine learning models
lack robustness to adversarial examples, which are crafted by adding carefully
chosen perturbations to the normal inputs. So far, the adversarial examples can
only deceive detectors that rely on syntactic features (e.g., requested
permissions, API calls, etc), and the perturbations can only be implemented by
simply modifying application's manifest. While recent Android malware detectors
rely more on semantic features from Dalvik bytecode rather than manifest,
existing attacking/defending methods are no longer effective. In this paper, we
introduce a new attacking method that generates adversarial examples of Android
malware and evades being detected by the current models. To this end, we
propose a method of applying optimal perturbations onto Android APK that can
successfully deceive the machine learning detectors. We develop an automated
tool to generate the adversarial examples without human intervention. In
contrast to existing works, the adversarial examples crafted by our method can
also deceive recent machine learning based detectors that rely on semantic
features such as control-flow-graph. The perturbations can also be implemented
directly onto APK's Dalvik bytecode rather than Android manifest to evade from
recent detectors. We demonstrate our attack on two state-of-the-art Android
malware detection schemes, MaMaDroid and Drebin. Our results show that the
malware detection rates decreased from 96% to 0% in MaMaDroid, and from 97% to
0% in Drebin, with just a small number of codes to be inserted into the APK.


http://arxiv.org/abs/1808.02651
Beyond Pixel Norm-Balls: Parametric Adversaries using an Analytically Differentiable Renderer.
Hsueh-Ti Derek Liu; Michael Tao; Chun-Liang Li; Derek Nowrouzezahrai; Alec Jacobson
  Many machine learning image classifiers are vulnerable to adversarial
attacks, inputs with perturbations designed to intentionally trigger
misclassification. Current adversarial methods directly alter pixel colors and
evaluate against pixel norm-balls: pixel perturbations smaller than a specified
magnitude, according to a measurement norm. This evaluation, however, has
limited practical utility since perturbations in the pixel space do not
correspond to underlying real-world phenomena of image formation that lead to
them and has no security motivation attached. Pixels in natural images are
measurements of light that has interacted with the geometry of a physical
scene. As such, we propose the direct perturbation of physical parameters that
underly image formation: lighting and geometry. As such, we propose a novel
evaluation measure, parametric norm-balls, by directly perturbing physical
parameters that underly image formation. One enabling contribution we present
is a physically-based differentiable renderer that allows us to propagate pixel
gradients to the parametric space of lighting and geometry. Our approach
enables physically-based adversarial attacks, and our differentiable renderer
leverages models from the interactive rendering literature to balance the
performance and accuracy trade-offs necessary for a memory-efficient and
scalable adversarial data augmentation workflow.


http://arxiv.org/abs/1808.02455
Data augmentation using synthetic data for time series classification with deep residual networks.
Hassan Ismail Fawaz; Germain Forestier; Jonathan Weber; Lhassane Idoumghar; Pierre-Alain Muller
  Data augmentation in deep neural networks is the process of generating
artificial data in order to reduce the variance of the classifier with the goal
to reduce the number of errors. This idea has been shown to improve deep neural
network's generalization capabilities in many computer vision tasks such as
image recognition and object localization. Apart from these applications, deep
Convolutional Neural Networks (CNNs) have also recently gained popularity in
the Time Series Classification (TSC) community. However, unlike in image
recognition problems, data augmentation techniques have not yet been
investigated thoroughly for the TSC task. This is surprising as the accuracy of
deep learning models for TSC could potentially be improved, especially for
small datasets that exhibit overfitting, when a data augmentation method is
adopted. In this paper, we fill this gap by investigating the application of a
recently proposed data augmentation technique based on the Dynamic Time Warping
distance, for a deep learning model for TSC. To evaluate the potential of
augmenting the training set, we performed extensive experiments using the UCR
TSC benchmark. Our preliminary experiments reveal that data augmentation can
drastically increase deep CNN's accuracy on some datasets and significantly
improve the deep model's accuracy when the method is used in an ensemble
approach.


http://arxiv.org/abs/1808.01976
Adversarial Vision Challenge.
Wieland Brendel; Jonas Rauber; Alexey Kurakin; Nicolas Papernot; Behar Veliqi; Marcel Salathé; Sharada P. Mohanty; Matthias Bethge
  The NIPS 2018 Adversarial Vision Challenge is a competition to facilitate
measurable progress towards robust machine vision models and more generally
applicable adversarial attacks. This document is an updated version of our
competition proposal that was accepted in the competition track of 32nd
Conference on Neural Information Processing Systems (NIPS 2018).


http://arxiv.org/abs/1808.01785
Defense Against Adversarial Attacks with Saak Transform.
Sibo Song; Yueru Chen; Ngai-Man Cheung; C. -C. Jay Kuo
  Deep neural networks (DNNs) are known to be vulnerable to adversarial
perturbations, which imposes a serious threat to DNN-based decision systems. In
this paper, we propose to apply the lossy Saak transform to adversarially
perturbed images as a preprocessing tool to defend against adversarial attacks.
Saak transform is a recently-proposed state-of-the-art for computing the
spatial-spectral representations of input images. Empirically, we observe that
outputs of the Saak transform are very discriminative in differentiating
adversarial examples from clean ones. Therefore, we propose a Saak transform
based preprocessing method with three steps: 1) transforming an input image to
a joint spatial-spectral representation via the forward Saak transform, 2)
apply filtering to its high-frequency components, and, 3) reconstructing the
image via the inverse Saak transform. The processed image is found to be robust
against adversarial perturbations. We conduct extensive experiments to
investigate various settings of the Saak transform and filtering functions.
Without harming the decision performance on clean images, our method
outperforms state-of-the-art adversarial defense methods by a substantial
margin on both the CIFAR-10 and ImageNet datasets. Importantly, our results
suggest that adversarial perturbations can be effectively and efficiently
defended using state-of-the-art frequency analysis.


http://arxiv.org/abs/1808.01753
Gray-box Adversarial Training.
Vivek B. S.; Konda Reddy Mopuri; R. Venkatesh Babu
  Adversarial samples are perturbed inputs crafted to mislead the machine
learning systems. A training mechanism, called adversarial training, which
presents adversarial samples along with clean samples has been introduced to
learn robust models. In order to scale adversarial training for large datasets,
these perturbations can only be crafted using fast and simple methods (e.g.,
gradient ascent). However, it is shown that adversarial training converges to a
degenerate minimum, where the model appears to be robust by generating weaker
adversaries. As a result, the models are vulnerable to simple black-box
attacks. In this paper we, (i) demonstrate the shortcomings of existing
evaluation policy, (ii) introduce novel variants of white-box and black-box
attacks, dubbed gray-box adversarial attacks" based on which we propose novel
evaluation method to assess the robustness of the learned models, and (iii)
propose a novel variant of adversarial training, named Graybox Adversarial
Training" that uses intermediate versions of the models to seed the
adversaries. Experimental evaluation demonstrates that the models trained using
our method exhibit better robustness compared to both undefended and
adversarially trained model


http://arxiv.org/abs/1808.01688
Is Robustness the Cost of Accuracy? -- A Comprehensive Study on the Robustness of 18 Deep Image Classification Models.
Dong Su; Huan Zhang; Hongge Chen; Jinfeng Yi; Pin-Yu Chen; Yupeng Gao
  The prediction accuracy has been the long-lasting and sole standard for
comparing the performance of different image classification models, including
the ImageNet competition. However, recent studies have highlighted the lack of
robustness in well-trained deep neural networks to adversarial examples.
Visually imperceptible perturbations to natural images can easily be crafted
and mislead the image classifiers towards misclassification. To demystify the
trade-offs between robustness and accuracy, in this paper we thoroughly
benchmark 18 ImageNet models using multiple robustness metrics, including the
distortion, success rate and transferability of adversarial examples between
306 pairs of models. Our extensive experimental results reveal several new
insights: (1) linear scaling law - the empirical $\ell_2$ and $\ell_\infty$
distortion metrics scale linearly with the logarithm of classification error;
(2) model architecture is a more critical factor to robustness than model size,
and the disclosed accuracy-robustness Pareto frontier can be used as an
evaluation criterion for ImageNet model designers; (3) for a similar network
architecture, increasing network depth slightly improves robustness in
$\ell_\infty$ distortion; (4) there exist models (in VGG family) that exhibit
high adversarial transferability, while most adversarial examples crafted from
one model can only be transferred within the same family. Experiment code is
publicly available at \url{https://github.com/huanzhang12/Adversarial_Survey}.


http://arxiv.org/abs/1808.01664
Structured Adversarial Attack: Towards General Implementation and Better Interpretability.
Kaidi Xu; Sijia Liu; Pu Zhao; Pin-Yu Chen; Huan Zhang; Quanfu Fan; Deniz Erdogmus; Yanzhi Wang; Xue Lin
  When generating adversarial examples to attack deep neural networks (DNNs),
Lp norm of the added perturbation is usually used to measure the similarity
between original image and adversarial example. However, such adversarial
attacks perturbing the raw input spaces may fail to capture structural
information hidden in the input. This work develops a more general attack
model, i.e., the structured attack (StrAttack), which explores group sparsity
in adversarial perturbations by sliding a mask through images aiming for
extracting key spatial structures. An ADMM (alternating direction method of
multipliers)-based framework is proposed that can split the original problem
into a sequence of analytically solvable subproblems and can be generalized to
implement other attacking methods. Strong group sparsity is achieved in
adversarial perturbations even with the same level of Lp norm distortion as the
state-of-the-art attacks. We demonstrate the effectiveness of StrAttack by
extensive experimental results onMNIST, CIFAR-10, and ImageNet. We also show
that StrAttack provides better interpretability (i.e., better correspondence
with discriminative image regions)through adversarial saliency map (Papernot et
al., 2016b) and class activation map(Zhou et al., 2016).


http://arxiv.org/abs/1808.01452
Traits & Transferability of Adversarial Examples against Instance Segmentation & Object Detection.
Raghav Gurbaxani; Shivank Mishra
  Despite the recent advancements in deploying neural networks for image
classification, it has been found that adversarial examples are able to fool
these models leading them to misclassify the images. Since these models are now
being widely deployed, we provide an insight on the threat of these adversarial
examples by evaluating their characteristics and transferability to more
complex models that utilize Image Classification as a subtask.
  We demonstrate the ineffectiveness of adversarial examples when applied to
Instance Segmentation & Object Detection models. We show that this
ineffectiveness arises from the inability of adversarial examples to withstand
transformations such as scaling or a change in lighting conditions. Moreover,
we show that there exists a small threshold below which the adversarial
property is retained while applying these input transformations.
  Additionally, these attacks demonstrate weak cross-network transferability
across neural network architectures, e.g. VGG16 and ResNet50, however, the
attack may fool both the networks if passed sequentially through networks
during its formation.
  The lack of scalability and transferability challenges the question of how
adversarial images would be effective in the real world.


http://arxiv.org/abs/1808.01546
ATMPA: Attacking Machine Learning-based Malware Visualization Detection Methods via Adversarial Examples.
Xinbo Liu; Jiliang Zhang; Yaping Lin; He Li
  Since the threat of malicious software (malware) has become increasingly
serious, automatic malware detection techniques have received increasing
attention, where machine learning (ML)-based visualization detection methods
become more and more popular. In this paper, we demonstrate that the
state-of-the-art ML-based visualization detection methods are vulnerable to
Adversarial Example (AE) attacks. We develop a novel Adversarial Texture
Malware Perturbation Attack (ATMPA) method based on the gradient descent and
L-norm optimization method, where attackers can introduce some tiny
perturbations on the transformed dataset such that ML-based malware detection
methods will completely fail. The experimental results on the MS BIG malware
dataset show that a small interference can reduce the accuracy rate down to 0%
for several ML-based detection methods, and the rate of transferability is
74.1% on average.


http://arxiv.org/abs/1808.01153
Ask, Acquire, and Attack: Data-free UAP Generation using Class Impressions.
Konda Reddy Mopuri; Phani Krishna Uppala; R. Venkatesh Babu
  Deep learning models are susceptible to input specific noise, called
adversarial perturbations. Moreover, there exist input-agnostic noise, called
Universal Adversarial Perturbations (UAP) that can affect inference of the
models over most input samples. Given a model, there exist broadly two
approaches to craft UAPs: (i) data-driven: that require data, and (ii)
data-free: that do not require data samples. Data-driven approaches require
actual samples from the underlying data distribution and craft UAPs with high
success (fooling) rate. However, data-free approaches craft UAPs without
utilizing any data samples and therefore result in lesser success rates. In
this paper, for data-free scenarios, we propose a novel approach that emulates
the effect of data samples with class impressions in order to craft UAPs using
data-driven objectives. Class impression for a given pair of category and model
is a generic representation (in the input space) of the samples belonging to
that category. Further, we present a neural network based generative model that
utilizes the acquired class impressions to learn crafting UAPs. Experimental
evaluation demonstrates that the learned generative model, (i) readily crafts
UAPs via simple feed-forwarding through neural network layers, and (ii)
achieves state-of-the-art success rates for data-free scenario and closer to
that for data-driven setting without actually utilizing any data samples.


http://arxiv.org/abs/1808.01352
DeepCloak: Adversarial Crafting As a Defensive Measure to Cloak Processes.
Mehmet Sinan Inci; Thomas Eisenbarth; Berk Sunar
  Over the past decade, side-channels have proven to be significant and
practical threats to modern computing systems. Recent attacks have all
exploited the underlying shared hardware. While practical, mounting such a
complicated attack is still akin to listening on a private conversation in a
crowded train station. The attacker has to either perform significant manual
labor or use AI systems to automate the process. The recent academic literature
points to the latter option. With the abundance of cheap computing power and
the improvements made in AI, it is quite advantageous to automate such tasks.
By using AI systems however, malicious parties also inherit their weaknesses.
One such weakness is undoubtedly the vulnerability to adversarial samples.
  In contrast to the previous literature, for the first time, we propose the
use of adversarial learning as a defensive tool to obfuscate and mask private
information. We demonstrate the viability of this approach by first training
CNNs and other machine learning classifiers on leakage trace of different
processes. After training highly accurate models (99+% accuracy), we
investigate their resolve against adversarial learning methods. By applying
minimal perturbations to input traces, the adversarial traffic by the defender
can run as an attachment to the original process and cloak it against a
malicious classifier.
  Finally, we investigate whether an attacker can protect her classifier model
by employing adversarial defense methods, namely adversarial re-training and
defensive distillation. Our results show that even in the presence of an
intelligent adversary that employs such techniques, all 10 of the tested
adversarial learning methods still manage to successfully craft adversarial
perturbations and the proposed cloaking methodology succeeds.


http://arxiv.org/abs/1808.00123
EagleEye: Attack-Agnostic Defense against Adversarial Inputs (Technical Report).
Yujie Ji; Xinyang Zhang; Ting Wang
  Deep neural networks (DNNs) are inherently vulnerable to adversarial inputs:
such maliciously crafted samples trigger DNNs to misbehave, leading to
detrimental consequences for DNN-powered systems. The fundamental challenges of
mitigating adversarial inputs stem from their adaptive and variable nature.
Existing solutions attempt to improve DNN resilience against specific attacks;
yet, such static defenses can often be circumvented by adaptively engineered
inputs or by new attack variants.
  Here, we present EagleEye, an attack-agnostic adversarial tampering analysis
engine for DNN-powered systems. Our design exploits the {\em minimality
principle} underlying many attacks: to maximize the attack's evasiveness, the
adversary often seeks the minimum possible distortion to convert genuine inputs
to adversarial ones. We show that this practice entails the distinct
distributional properties of adversarial inputs in the input space. By
leveraging such properties in a principled manner, EagleEye effectively
discriminates adversarial inputs and even uncovers their correct classification
outputs. Through extensive empirical evaluation using a range of benchmark
datasets and DNN models, we validate EagleEye's efficacy. We further
investigate the adversary's possible countermeasures, which implies a difficult
dilemma for her: to evade EagleEye's detection, excessive distortion is
necessary, thereby significantly reducing the attack's evasiveness regarding
other detection mechanisms.


http://arxiv.org/abs/1807.10454
Rob-GAN: Generator, Discriminator, and Adversarial Attacker.
Xuanqing Liu; Cho-Jui Hsieh
  We study two important concepts in adversarial deep learning---adversarial
training and generative adversarial network (GAN). Adversarial training is the
technique used to improve the robustness of discriminator by combining
adversarial attacker and discriminator in the training phase. GAN is commonly
used for image generation by jointly optimizing discriminator and generator. We
show these two concepts are indeed closely related and can be used to
strengthen each other---adding a generator to the adversarial training
procedure can improve the robustness of discriminators, and adding an
adversarial attack to GAN training can improve the convergence speed and lead
to better generators. Combining these two insights, we develop a framework
called Rob-GAN to jointly optimize generator and discriminator in the presence
of adversarial attacks---the generator generates fake images to fool
discriminator; the adversarial attacker perturbs real images to fool the
discriminator, and the discriminator wants to minimize loss under fake and
adversarial images. Through this end-to-end training procedure, we are able to
simultaneously improve the convergence speed of GAN training, the quality of
synthetic images, and the robustness of discriminator under strong adversarial
attacks. Experimental results demonstrate that the obtained classifier is more
robust than the state-of-the-art adversarial training approach, and the
generator outperforms SN-GAN on ImageNet-143.


http://arxiv.org/abs/1807.10335
A general metric for identifying adversarial images.
Siddharth Krishna Kumar
  It is well known that a determined adversary can fool a neural network by
making imperceptible adversarial perturbations to an image. Recent studies have
shown that these perturbations can be detected even without information about
the neural network if the strategy taken by the adversary is known beforehand.
Unfortunately, these studies suffer from the generalization limitation -- the
detection method has to be recalibrated every time the adversary changes his
strategy. In this study, we attempt to overcome the generalization limitation
by deriving a metric which reliably identifies adversarial images even when the
approach taken by the adversary is unknown. Our metric leverages key
differences between the spectra of clean and adversarial images when an image
is treated as a matrix. Our metric is able to detect adversarial images across
different datasets and attack strategies without any additional re-calibration.
In addition, our approach provides geometric insights into several unanswered
questions about adversarial perturbations.


http://arxiv.org/abs/1807.10272
Evaluating and Understanding the Robustness of Adversarial Logit Pairing.
Logan Engstrom; Andrew Ilyas; Anish Athalye
  We evaluate the robustness of Adversarial Logit Pairing, a recently proposed
defense against adversarial examples. We find that a network trained with
Adversarial Logit Pairing achieves 0.6% accuracy in the threat model in which
the defense is considered. We provide a brief overview of the defense and the
threat models/claims considered, as well as a discussion of the methodology and
results of our attack, which may offer insights into the reasons underlying the
vulnerability of ALP to adversarial attack.


http://arxiv.org/abs/1807.09937
HiDDeN: Hiding Data With Deep Networks.
Jiren Zhu; Russell Kaplan; Justin Johnson; Li Fei-Fei
  Recent work has shown that deep neural networks are highly sensitive to tiny
perturbations of input images, giving rise to adversarial examples. Though this
property is usually considered a weakness of learned models, we explore whether
it can be beneficial. We find that neural networks can learn to use invisible
perturbations to encode a rich amount of useful information. In fact, one can
exploit this capability for the task of data hiding. We jointly train encoder
and decoder networks, where given an input message and cover image, the encoder
produces a visually indistinguishable encoded image, from which the decoder can
recover the original message. We show that these encodings are competitive with
existing data hiding algorithms, and further that they can be made robust to
noise: our models learn to reconstruct hidden information in an encoded image
despite the presence of Gaussian blurring, pixel-wise dropout, cropping, and
JPEG compression. Even though JPEG is non-differentiable, we show that a robust
model can be trained using differentiable approximations. Finally, we
demonstrate that adversarial training improves the visual quality of encoded
images.


http://arxiv.org/abs/1807.09705
Limitations of the Lipschitz constant as a defense against adversarial examples.
Todd Huster; Cho-Yu Jason Chiang; Ritu Chadha
  Several recent papers have discussed utilizing Lipschitz constants to limit
the susceptibility of neural networks to adversarial examples. We analyze
recently proposed methods for computing the Lipschitz constant. We show that
the Lipschitz constant may indeed enable adversarially robust neural networks.
However, the methods currently employed for computing it suffer from
theoretical and practical limitations. We argue that addressing this
shortcoming is a promising direction for future research into certified
adversarial defenses.


http://arxiv.org/abs/1807.09443
Unbounded Output Networks for Classification.
Stefan Elfwing; Eiji Uchibe; Kenji Doya
  We proposed the expected energy-based restricted Boltzmann machine (EE-RBM)
as a discriminative RBM method for classification. Two characteristics of the
EE-RBM are that the output is unbounded and that the target value of correct
classification is set to a value much greater than one. In this study, by
adopting features of the EE-RBM approach to feed-forward neural networks, we
propose the UnBounded output network (UBnet) which is characterized by three
features: (1) unbounded output units; (2) the target value of correct
classification is set to a value much greater than one; and (3) the models are
trained by a modified mean-squared error objective. We evaluate our approach
using the MNIST, CIFAR-10, and CIFAR-100 benchmark datasets. We first
demonstrate, for shallow UBnets on MNIST, that a setting of the target value
equal to the number of hidden units significantly outperforms a setting of the
target value equal to one, and it also outperforms standard neural networks by
about 25\%. We then validate our approach by achieving high-level
classification performance on the three datasets using unbounded output
residual networks. We finally use MNIST to analyze the learned features and
weights, and we demonstrate that UBnets are much more robust against
adversarial examples than the standard approach of using a softmax output layer
and training the networks by a cross-entropy objective.


http://arxiv.org/abs/1807.09380
Contrastive Video Representation Learning via Adversarial Perturbations.
Jue Wang; Anoop Cherian
  Adversarial perturbations are noise-like patterns that can subtly change the
data, while failing an otherwise accurate classifier. In this paper, we propose
to use such perturbations within a novel contrastive learning setup to build
negative samples, which are then used to produce improved video
representations. To this end, given a well-trained deep model for per-frame
video recognition, we first generate adversarial noise adapted to this model.
Positive and negative bags are produced using the original data features from
the full video sequence and their perturbed counterparts, respectively. Unlike
the classic contrastive learning methods, we develop a binary classification
problem that learns a set of discriminative hyperplanes -- as a subspace --
that will separate the two bags from each other. This subspace is then used as
a descriptor for the video, dubbed \emph{discriminative subspace pooling}. As
the perturbed features belong to data classes that are likely to be confused
with the original features, the discriminative subspace will characterize parts
of the feature space that are more representative of the original data, and
thus may provide robust video representations. To learn such descriptors, we
formulate a subspace learning objective on the Stiefel manifold and resort to
Riemannian optimization methods for solving it efficiently. We provide
experiments on several video datasets and demonstrate state-of-the-art results.


http://arxiv.org/abs/1807.08108
Simultaneous Adversarial Training - Learn from Others Mistakes.
Zukang Liao
  Adversarial examples are maliciously tweaked images that can easily fool
machine learning techniques, such as neural networks, but they are normally not
visually distinguishable for human beings. One of the main approaches to solve
this problem is to retrain the networks using those adversarial examples,
namely adversarial training. However, standard adversarial training might not
actually change the decision boundaries but cause the problem of gradient
masking, resulting in a weaker ability to generate adversarial examples.
Therefore, it cannot alleviate the problem of black-box attacks, where
adversarial examples generated from other networks can transfer to the targeted
one. In order to reduce the problem of black-box attacks, we propose a novel
method that allows two networks to learn from each others' adversarial examples
and become resilient to black-box attacks. We also combine this method with a
simple domain adaptation to further improve the performance.


http://arxiv.org/abs/1807.07978
Prior Convictions: Black-Box Adversarial Attacks with Bandits and Priors.
Andrew Ilyas; Logan Engstrom; Aleksander Madry
  We study the problem of generating adversarial examples in a black-box
setting in which only loss-oracle access to a model is available. We introduce
a framework that conceptually unifies much of the existing work on black-box
attacks, and we demonstrate that the current state-of-the-art methods are
optimal in a natural sense. Despite this optimality, we show how to improve
black-box attacks by bringing a new element into the problem: gradient priors.
We give a bandit optimization-based algorithm that allows us to seamlessly
integrate any such priors, and we explicitly identify and incorporate two
examples. The resulting methods use two to four times fewer queries and fail
two to five times less often than the current state-of-the-art.


http://arxiv.org/abs/1807.07769
Physical Adversarial Examples for Object Detectors.
Kevin Eykholt; Ivan Evtimov; Earlence Fernandes; Bo Li; Amir Rahmati; Florian Tramer; Atul Prakash; Tadayoshi Kohno; Dawn Song
  Deep neural networks (DNNs) are vulnerable to adversarial
examples-maliciously crafted inputs that cause DNNs to make incorrect
predictions. Recent work has shown that these attacks generalize to the
physical domain, to create perturbations on physical objects that fool image
classifiers under a variety of real-world conditions. Such attacks pose a risk
to deep learning models used in safety-critical cyber-physical systems. In this
work, we extend physical attacks to more challenging object detection models, a
broader class of deep learning algorithms widely used to detect and label
multiple objects within a scene. Improving upon a previous physical attack on
image classifiers, we create perturbed physical objects that are either ignored
or mislabeled by object detection models. We implement a Disappearance Attack,
in which we cause a Stop sign to "disappear" according to the detector-either
by covering thesign with an adversarial Stop sign poster, or by adding
adversarial stickers onto the sign. In a video recorded in a controlled lab
environment, the state-of-the-art YOLOv2 detector failed to recognize these
adversarial Stop signs in over 85% of the video frames. In an outdoor
experiment, YOLO was fooled by the poster and sticker attacks in 72.5% and
63.5% of the video frames respectively. We also use Faster R-CNN, a different
object detection model, to demonstrate the transferability of our adversarial
perturbations. The created poster perturbation is able to fool Faster R-CNN in
85.9% of the video frames in a controlled lab environment, and 40.2% of the
video frames in an outdoor environment. Finally, we present preliminary results
with a new Creation Attack, where in innocuous physical stickers fool a model
into detecting nonexistent objects.


http://arxiv.org/abs/1807.10590
Harmonic Adversarial Attack Method.
Wen Heng; Shuchang Zhou; Tingting Jiang
  Adversarial attacks find perturbations that can fool models into
misclassifying images. Previous works had successes in generating
noisy/edge-rich adversarial perturbations, at the cost of degradation of image
quality. Such perturbations, even when they are small in scale, are usually
easily spottable by human vision. In contrast, we propose Harmonic Adversar-
ial Attack Methods (HAAM), that generates edge-free perturbations by using
harmonic functions. The property of edge-free guarantees that the generated
adversarial images can still preserve visual quality, even when perturbations
are of large magnitudes. Experiments also show that adversaries generated by
HAAM often have higher rates of success when transferring between models. In
addition, we find harmonic perturbations can simulate natural phenomena like
natural lighting and shadows. It would then be possible to help find corner
cases for given models, as a first step to improving them.


http://arxiv.org/abs/1807.06752
Gradient Band-based Adversarial Training for Generalized Attack Immunity of A3C Path Finding.
Tong Chen; Wenjia Niu; Yingxiao Xiang; Xiaoxuan Bai; Jiqiang Liu; Zhen Han; Gang Li
  As adversarial attacks pose a serious threat to the security of AI system in
practice, such attacks have been extensively studied in the context of computer
vision applications. However, few attentions have been paid to the adversarial
research on automatic path finding. In this paper, we show dominant adversarial
examples are effective when targeting A3C path finding, and design a Common
Dominant Adversarial Examples Generation Method (CDG) to generate dominant
adversarial examples against any given map. In addition, we propose Gradient
Band-based Adversarial Training, which trained with a single randomly choose
dominant adversarial example without taking any modification, to realize the
"1:N" attack immunity for generalized dominant adversarial examples. Extensive
experimental results show that, the lowest generation precision for CDG
algorithm is 91.91%, and the lowest immune precision for Gradient Band-based
Adversarial Training is 93.89%, which can prove that our method can realize the
generalized attack immunity of A3C path finding with a high confidence.


http://arxiv.org/abs/1807.06732
Motivating the Rules of the Game for Adversarial Example Research.
Justin Gilmer; Ryan P. Adams; Ian Goodfellow; David Andersen; George E. Dahl
  Advances in machine learning have led to broad deployment of systems with
impressive performance on important problems. Nonetheless, these systems can be
induced to make errors on data that are surprisingly similar to examples the
learned system handles correctly. The existence of these errors raises a
variety of questions about out-of-sample generalization and whether bad actors
might use such examples to abuse deployed systems. As a result of these
security concerns, there has been a flurry of recent papers proposing
algorithms to defend against such malicious perturbations of correctly handled
examples. It is unclear how such misclassifications represent a different kind
of security problem than other errors, or even other attacker-produced examples
that have no specific relationship to an uncorrupted input. In this paper, we
argue that adversarial example defense papers have, to date, mostly considered
abstract, toy games that do not relate to any specific security concern.
Furthermore, defense papers have not yet precisely described all the abilities
and limitations of attackers that would be relevant in practical security.
Towards this end, we establish a taxonomy of motivations, constraints, and
abilities for more plausible adversaries. Finally, we provide a series of
recommendations outlining a path forward for future work to more clearly
articulate the threat model and perform more meaningful evaluation.


http://arxiv.org/abs/1807.06714
Defend Deep Neural Networks Against Adversarial Examples via Fixed and Dynamic Quantized Activation Functions.
Adnan Siraj Rakin; Jinfeng Yi; Boqing Gong; Deliang Fan
  Recent studies have shown that deep neural networks (DNNs) are vulnerable to
adversarial attacks. To this end, many defense approaches that attempt to
improve the robustness of DNNs have been proposed. In a separate and yet
related area, recent works have explored to quantize neural network weights and
activation functions into low bit-width to compress model size and reduce
computational complexity. In this work, we find that these two different
tracks, namely the pursuit of network compactness and robustness, can be merged
into one and give rise to networks of both advantages. To the best of our
knowledge, this is the first work that uses quantization of activation
functions to defend against adversarial examples. We also propose to train
robust neural networks by using adaptive quantization techniques for the
activation functions. Our proposed Dynamic Quantized Activation (DQA) is
verified through a wide range of experiments with the MNIST and CIFAR-10
datasets under different white-box attack methods, including FGSM, PGD, and C &
W attacks. Furthermore, Zeroth Order Optimization and substitute model-based
black-box attacks are also considered in this work. The experimental results
clearly show that the robustness of DNNs could be greatly improved using the
proposed DQA.


http://arxiv.org/abs/1807.06064
Online Robust Policy Learning in the Presence of Unknown Adversaries.
Aaron J. Havens; Zhanhong Jiang; Soumik Sarkar
  The growing prospect of deep reinforcement learning (DRL) being used in
cyber-physical systems has raised concerns around safety and robustness of
autonomous agents. Recent work on generating adversarial attacks have shown
that it is computationally feasible for a bad actor to fool a DRL policy into
behaving sub optimally. Although certain adversarial attacks with specific
attack models have been addressed, most studies are only interested in off-line
optimization in the data space (e.g., example fitting, distillation). This
paper introduces a Meta-Learned Advantage Hierarchy (MLAH) framework that is
attack model-agnostic and more suited to reinforcement learning, via handling
the attacks in the decision space (as opposed to data space) and directly
mitigating learned bias introduced by the adversary. In MLAH, we learn separate
sub-policies (nominal and adversarial) in an online manner, as guided by a
supervisory master agent that detects the presence of the adversary by
leveraging the advantage function for the sub-policies. We demonstrate that the
proposed algorithm enables policy learning with significantly lower bias as
compared to the state-of-the-art policy learning approaches even in the
presence of heavy state information attacks. We present algorithm analysis and
simulation results using popular OpenAI Gym environments.


http://arxiv.org/abs/1807.05832
Manifold Adversarial Learning.
Shufei Zhang; Kaizhu Huang; Jianke Zhu; Yang Liu
  Recently proposed adversarial training methods show the robustness to both
adversarial and original examples and achieve state-of-the-art results in
supervised and semi-supervised learning. All the existing adversarial training
methods consider only how the worst perturbed examples (i.e., adversarial
examples) could affect the model output. Despite their success, we argue that
such setting may be in lack of generalization, since the output space (or label
space) is apparently less informative.In this paper, we propose a novel method,
called Manifold Adversarial Training (MAT). MAT manages to build an adversarial
framework based on how the worst perturbation could affect the distributional
manifold rather than the output space. Particularly, a latent data space with
the Gaussian Mixture Model (GMM) will be first derived.On one hand, MAT tries
to perturb the input samples in the way that would rough the distributional
manifold the worst. On the other hand, the deep learning model is trained
trying to promote in the latent space the manifold smoothness, measured by the
variation of Gaussian mixtures (given the local perturbation around the data
point). Importantly, since the latent space is more informative than the output
space, the proposed MAT can learn better a robust and compact data
representation, leading to further performance improvement. The proposed MAT is
important in that it can be considered as a superset of one recently-proposed
discriminative feature learning approach called center loss. We conducted a
series of experiments in both supervised and semi-supervised learning on three
benchmark data sets, showing that the proposed MAT can achieve remarkable
performance, much better than those of the state-of-the-art adversarial
approaches. We also present a series of visualization which could generate
further understanding or explanation on adversarial examples.


http://arxiv.org/abs/1807.04457
Query-Efficient Hard-label Black-box Attack:An Optimization-based Approach.
Minhao Cheng; Thong Le; Pin-Yu Chen; Jinfeng Yi; Huan Zhang; Cho-Jui Hsieh
  We study the problem of attacking a machine learning model in the hard-label
black-box setting, where no model information is revealed except that the
attacker can make queries to probe the corresponding hard-label decisions. This
is a very challenging problem since the direct extension of state-of-the-art
white-box attacks (e.g., CW or PGD) to the hard-label black-box setting will
require minimizing a non-continuous step function, which is combinatorial and
cannot be solved by a gradient-based optimizer. The only current approach is
based on random walk on the boundary, which requires lots of queries and lacks
convergence guarantees. We propose a novel way to formulate the hard-label
black-box attack as a real-valued optimization problem which is usually
continuous and can be solved by any zeroth order optimization algorithm. For
example, using the Randomized Gradient-Free method, we are able to bound the
number of iterations needed for our algorithm to achieve stationary points. We
demonstrate that our proposed method outperforms the previous random walk
approach to attacking convolutional neural networks on MNIST, CIFAR, and
ImageNet datasets. More interestingly, we show that the proposed algorithm can
also be used to attack other discrete and non-continuous machine learning
models, such as Gradient Boosting Decision Trees (GBDT).


http://arxiv.org/abs/1807.04200
With Friends Like These, Who Needs Adversaries?
Saumya Jetley; Nicholas A. Lord; Philip H. S. Torr
  The vulnerability of deep image classification networks to adversarial attack
is now well known, but less well understood. Via a novel experimental analysis,
we illustrate some facts about deep convolutional networks for image
classification that shed new light on their behaviour and how it connects to
the problem of adversaries. In short, the celebrated performance of these
networks and their vulnerability to adversarial attack are simply two sides of
the same coin: the input image-space directions along which the networks are
most vulnerable to attack are the same directions which they use to achieve
their classification performance in the first place. We develop this result in
two main steps. The first uncovers the fact that classes tend to be associated
with specific image-space directions. This is shown by an examination of the
class-score outputs of nets as functions of 1D movements along these
directions. This provides a novel perspective on the existence of universal
adversarial perturbations. The second is a clear demonstration of the tight
coupling between classification performance and vulnerability to adversarial
attack within the spaces spanned by these directions. Thus, our analysis
resolves the apparent contradiction between accuracy and vulnerability. It
provides a new perspective on much of the prior art and reveals profound
implications for efforts to construct neural nets that are both accurate and
robust to adversarial attack.


http://arxiv.org/abs/1807.03888
A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks.
Kimin Lee; Kibok Lee; Honglak Lee; Jinwoo Shin
  Detecting test samples drawn sufficiently far away from the training
distribution statistically or adversarially is a fundamental requirement for
deploying a good classifier in many real-world machine learning applications.
However, deep neural networks with the softmax classifier are known to produce
highly overconfident posterior distributions even for such abnormal samples. In
this paper, we propose a simple yet effective method for detecting any abnormal
samples, which is applicable to any pre-trained softmax neural classifier. We
obtain the class conditional Gaussian distributions with respect to (low- and
upper-level) features of the deep models under Gaussian discriminant analysis,
which result in a confidence score based on the Mahalanobis distance. While
most prior methods have been evaluated for detecting either out-of-distribution
or adversarial samples, but not both, the proposed method achieves the
state-of-the-art performances for both cases in our experiments. Moreover, we
found that our proposed method is more robust in harsh cases, e.g., when the
training dataset has noisy labels or small number of samples. Finally, we show
that the proposed method enjoys broader usage by applying it to
class-incremental learning: whenever out-of-distribution samples are detected,
our classification rule can incorporate new classes well without further
training deep models.


http://arxiv.org/abs/1807.03571
A Game-Based Approximate Verification of Deep Neural Networks with Provable Guarantees.
Min Wu; Matthew Wicker; Wenjie Ruan; Xiaowei Huang; Marta Kwiatkowska
  Despite the improved accuracy of deep neural networks, the discovery of
adversarial examples has raised serious safety concerns. In this paper, we
study two variants of pointwise robustness, the maximum safe radius problem,
which for a given input sample computes the minimum distance to an adversarial
example, and the feature robustness problem, which aims to quantify the
robustness of individual features to adversarial perturbations. We demonstrate
that, under the assumption of Lipschitz continuity, both problems can be
approximated using finite optimisation by discretising the input space, and the
approximation has provable guarantees, i.e., the error is bounded. We then show
that the resulting optimisation problems can be reduced to the solution of
two-player turn-based games, where the first player selects features and the
second perturbs the image within the feature. While the second player aims to
minimise the distance to an adversarial example, depending on the optimisation
objective the first player can be cooperative or competitive. We employ an
anytime approach to solve the games, in the sense of approximating the value of
a game by monotonically improving its upper and lower bounds. The Monte Carlo
tree search algorithm is applied to compute upper bounds for both games, and
the Admissible A* and the Alpha-Beta Pruning algorithms are, respectively, used
to compute lower bounds for the maximum safety radius and feature robustness
games. When working on the upper bound of the maximum safe radius problem, our
tool demonstrates competitive performance against existing adversarial example
crafting algorithms. Furthermore, we show how our framework can be deployed to
evaluate pointwise robustness of neural networks in safety-critical
applications such as traffic sign recognition in self-driving cars.


http://arxiv.org/abs/1807.04270
Attack and defence in cellular decision-making: lessons from machine learning.
Thomas J. Rademaker; Emmanuel Bengio; Paul François
  Machine learning algorithms can be fooled by small well-designed adversarial
perturbations. This is reminiscent of cellular decision-making where ligands
(called antagonists) prevent correct signalling, like in early immune
recognition. We draw a formal analogy between neural networks used in machine
learning and models of cellular decision-making (adaptive proofreading). We
apply attacks from machine learning to simple decision-making models, and show
explicitly the correspondence to antagonism by weakly bound ligands. Such
antagonism is absent in more nonlinear models, which inspired us to implement a
biomimetic defence in neural networks filtering out adversarial perturbations.
We then apply a gradient-descent approach from machine learning to different
cellular decision-making models, and we reveal the existence of two regimes
characterized by the presence or absence of a critical point for the gradient.
This critical point causes the strongest antagonists to lie close to the
decision boundary. This is validated in the loss landscapes of robust neural
networks and cellular decision-making models, and observed experimentally for
immune cells. For both regimes, we explain how associated defence mechanisms
shape the geometry of the loss landscape, and why different adversarial attacks
are effective in different regimes. Our work connects evolved cellular
decision-making to machine learning, and motivates the design of a general
theory of adversarial perturbations, both for in vivo and in silico systems.


http://arxiv.org/abs/1807.03326
Adaptive Adversarial Attack on Scene Text Recognition.
Xiaoyong Yuan; Pan He; Xiaolin Andy Li; Dapeng Oliver Wu
  Recent studies have shown that state-of-the-art deep learning models are
vulnerable to the inputs with small perturbations (adversarial examples). We
observe two critical obstacles in adversarial examples: (i) Strong adversarial
attacks (e.g., C&W attack) require manually tuning hyper-parameters and take a
long time to construct an adversarial example, making it impractical to attack
real-time systems; (ii) Most of the studies focus on non-sequential tasks, such
as image classification, yet only a few consider sequential tasks. In this
work, we speed up adversarial attacks, especially on sequential learning tasks.
By leveraging the uncertainty of each task, we directly learn the adaptive
multi-task weightings, without manually searching hyper-parameters. A unified
architecture is developed and evaluated for both non-sequential tasks and
sequential ones. To validate the effectiveness, we take the scene text
recognition task as a case study. To our best knowledge, our proposed method is
the first attempt to adversarial attack for scene text recognition. Adaptive
Attack achieves over 99.9\% success rate with 3-6X speedup compared to
state-of-the-art adversarial attacks.


http://arxiv.org/abs/1807.02905
Vulnerability Analysis of Chest X-Ray Image Classification Against Adversarial Attacks.
Saeid Asgari Taghanaki; Arkadeep Das; Ghassan Hamarneh
  Recently, there have been several successful deep learning approaches for
automatically classifying chest X-ray images into different disease categories.
However, there is not yet a comprehensive vulnerability analysis of these
models against the so-called adversarial perturbations/attacks, which makes
deep models more trustful in clinical practices. In this paper, we extensively
analyzed the performance of two state-of-the-art classification deep networks
on chest X-ray images. These two networks were attacked by three different
categories (ten methods in total) of adversarial methods (both white- and
black-box), namely gradient-based, score-based, and decision-based attacks.
Furthermore, we modified the pooling operations in the two classification
networks to measure their sensitivities against different attacks, on the
specific task of chest X-ray classification.


http://arxiv.org/abs/1807.02188
Implicit Generative Modeling of Random Noise during Training for Adversarial Robustness.
Priyadarshini Panda; Kaushik Roy
  We introduce a Noise-based prior Learning (NoL) approach for training neural
networks that are intrinsically robust to adversarial attacks. We find that the
implicit generative modeling of random noise with the same loss function used
during posterior maximization, improves a model's understanding of the data
manifold furthering adversarial robustness. We evaluate our approach's efficacy
and provide a simplistic visualization tool for understanding adversarial data,
using Principal Component Analysis. Our analysis reveals that adversarial
robustness, in general, manifests in models with higher variance along the
high-ranked principal components. We show that models learnt with our approach
perform remarkably well against a wide-range of attacks. Furthermore, combining
NoL with state-of-the-art adversarial training extends the robustness of a
model, even beyond what it is adversarially trained for, in both white-box and
black-box attack scenarios.


http://arxiv.org/abs/1807.01697
Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations.
Dan Hendrycks; Thomas G. Dietterich
  In this paper we establish rigorous benchmarks for image classifier
robustness. Our first benchmark, ImageNet-C, standardizes and expands the
corruption robustness topic, while showing which classifiers are preferable in
safety-critical applications. Unlike recent robustness research, this benchmark
evaluates performance on commonplace corruptions not worst-case adversarial
corruptions. We find that there are negligible changes in relative corruption
robustness from AlexNet to ResNet classifiers, and we discover ways to enhance
corruption robustness. Then we propose a new dataset called Icons-50 which
opens research on a new kind of robustness, surface variation robustness. With
this dataset we evaluate the frailty of classifiers on new styles of known
objects and unexpected instances of known classes. We also demonstrate two
methods that improve surface variation robustness. Together our benchmarks may
aid future work toward networks that learn fundamental class structure and also
robustly generalize.


http://arxiv.org/abs/1807.01216
Local Gradients Smoothing: Defense against localized adversarial attacks.
Muzammal Naseer; Salman H. Khan; Fatih Porikli
  Deep neural networks (DNNs) have shown vulnerability to adversarial attacks,
i.e., carefully perturbed inputs designed to mislead the network at inference
time. Recently introduced localized attacks, Localized and Visible Adversarial
Noise (LaVAN) and Adversarial patch, pose a new challenge to deep learning
security by adding adversarial noise only within a specific region without
affecting the salient objects in an image. Driven by the observation that such
attacks introduce concentrated high-frequency changes at a particular image
location, we have developed an effective method to estimate noise location in
gradient domain and transform those high activation regions caused by
adversarial noise in image domain while having minimal effect on the salient
object that is important for correct classification. Our proposed Local
Gradients Smoothing (LGS) scheme achieves this by regularizing gradients in the
estimated noisy region before feeding the image to DNN for inference. We have
shown the effectiveness of our method in comparison to other defense methods
including Digital Watermarking, JPEG compression, Total Variance Minimization
(TVM) and Feature squeezing on ImageNet dataset. In addition, we systematically
study the robustness of the proposed defense mechanism against Back Pass
Differentiable Approximation (BPDA), a state of the art attack recently
developed to break defenses that transform an input sample to minimize the
adversarial effect. Compared to other defense mechanisms, LGS is by far the
most resistant to BPDA in localized adversarial attack setting.


http://arxiv.org/abs/1807.01069
Adversarial Robustness Toolbox v1.0.0.
Maria-Irina Nicolae; Mathieu Sinn; Minh Ngoc Tran; Beat Buesser; Ambrish Rawat; Martin Wistuba; Valentina Zantedeschi; Nathalie Baracaldo; Bryant Chen; Heiko Ludwig; Ian M. Molloy; Ben Edwards
  Adversarial Robustness Toolbox (ART) is a Python library supporting
developers and researchers in defending Machine Learning models (Deep Neural
Networks, Gradient Boosted Decision Trees, Support Vector Machines, Random
Forests, Logistic Regression, Gaussian Processes, Decision Trees, Scikit-learn
Pipelines, etc.) against adversarial threats and helps making AI systems more
secure and trustworthy. Machine Learning models are vulnerable to adversarial
examples, which are inputs (images, texts, tabular data, etc.) deliberately
modified to produce a desired response by the Machine Learning model. ART
provides the tools to build and deploy defences and test them with adversarial
attacks. Defending Machine Learning models involves certifying and verifying
model robustness and model hardening with approaches such as pre-processing
inputs, augmenting training data with adversarial samples, and leveraging
runtime detection methods to flag any inputs that might have been modified by
an adversary. The attacks implemented in ART allow creating adversarial attacks
against Machine Learning models which is required to test defenses with
state-of-the-art threat models. Supported Machine Learning Libraries include
TensorFlow (v1 and v2), Keras, PyTorch, MXNet, Scikit-learn, XGBoost, LightGBM,
CatBoost, and GPy. The source code of ART is released with MIT license at
https://github.com/IBM/adversarial-robustness-toolbox. The release includes
code examples, notebooks with tutorials and documentation
(http://adversarial-robustness-toolbox.readthedocs.io).


http://arxiv.org/abs/1807.00458
Adversarial Perturbations Against Real-Time Video Classification Systems.
Shasha Li; Ajaya Neupane; Sujoy Paul; Chengyu Song; Srikanth V. Krishnamurthy; Amit K. Roy Chowdhury; Ananthram Swami
  Recent research has demonstrated the brittleness of machine learning systems
to adversarial perturbations. However, the studies have been mostly limited to
perturbations on images and more generally, classification that does not deal
with temporally varying inputs. In this paper we ask "Are adversarial
perturbations possible in real-time video classification systems and if so,
what properties must they satisfy?" Such systems find application in
surveillance applications, smart vehicles, and smart elderly care and thus,
misclassification could be particularly harmful (e.g., a mishap at an elderly
care facility may be missed). We show that accounting for temporal structure is
key to generating adversarial examples in such systems. We exploit recent
advances in generative adversarial network (GAN) architectures to account for
temporal correlations and generate adversarial samples that can cause
misclassification rates of over 80% for targeted activities. More importantly,
the samples also leave other activities largely unaffected making them
extremely stealthy. Finally, we also surprisingly find that in many scenarios,
the same perturbation can be applied to every frame in a video clip that makes
the adversary's ability to achieve misclassification relatively easy.


http://arxiv.org/abs/1807.00340
Towards Adversarial Training with Moderate Performance Improvement for Neural Network Classification.
Xinhan Di; Pengqian Yu; Meng Tian
  It has been demonstrated that deep neural networks are prone to noisy
examples particular adversarial samples during inference process. The gap
between robust deep learning systems in real world applications and vulnerable
neural networks is still large. Current adversarial training strategies improve
the robustness against adversarial samples. However, these methods lead to
accuracy reduction when the input examples are clean thus hinders the
practicability. In this paper, we investigate an approach that protects the
neural network classification from the adversarial samples and improves its
accuracy when the input examples are clean. We demonstrate the versatility and
effectiveness of our proposed approach on a variety of different networks and
datasets.


http://arxiv.org/abs/1807.00051
Adversarial Examples in Deep Learning: Characterization and Divergence.
Wenqi Wei; Ling Liu; Margaret Loper; Stacey Truex; Lei Yu; Mehmet Emre Gursoy; Yanzhao Wu
  The burgeoning success of deep learning has raised the security and privacy
concerns as more and more tasks are accompanied with sensitive data.
Adversarial attacks in deep learning have emerged as one of the dominating
security threat to a range of mission-critical deep learning systems and
applications. This paper takes a holistic and principled approach to perform
statistical characterization of adversarial examples in deep learning. We
provide a general formulation of adversarial examples and elaborate on the
basic principle for adversarial attack algorithm design. We introduce easy and
hard categorization of adversarial attacks to analyze the effectiveness of
adversarial examples in terms of attack success rate, degree of change in
adversarial perturbation, average entropy of prediction qualities, and fraction
of adversarial examples that lead to successful attacks. We conduct extensive
experimental study on adversarial behavior in easy and hard attacks under deep
learning models with different hyperparameters and different deep learning
frameworks. We show that the same adversarial attack behaves differently under
different hyperparameters and across different frameworks due to the different
features learned under different deep learning model training process. Our
statistical characterization with strong empirical evidence provides a
transformative enlightenment on mitigation strategies towards effective
countermeasures against present and future adversarial attacks.


http://arxiv.org/abs/1806.11146
Adversarial Reprogramming of Neural Networks.
Gamaleldin F. Elsayed; Ian Goodfellow; Jascha Sohl-Dickstein
  Deep neural networks are susceptible to \emph{adversarial} attacks. In
computer vision, well-crafted perturbations to images can cause neural networks
to make mistakes such as confusing a cat with a computer. Previous adversarial
attacks have been designed to degrade performance of models or cause machine
learning models to produce specific outputs chosen ahead of time by the
attacker. We introduce attacks that instead {\em reprogram} the target model to
perform a task chosen by the attacker---without the attacker needing to specify
or compute the desired output for each test-time input. This attack finds a
single adversarial perturbation, that can be added to all test-time inputs to a
machine learning model in order to cause the model to perform a task chosen by
the adversary---even if the model was not trained to do this task. These
perturbations can thus be considered a program for the new task. We demonstrate
adversarial reprogramming on six ImageNet classification models, repurposing
these models to perform a counting task, as well as classification tasks:
classification of MNIST and CIFAR-10 examples presented as inputs to the
ImageNet model.


http://arxiv.org/abs/1806.10707
Gradient Similarity: An Explainable Approach to Detect Adversarial Attacks against Deep Learning.
Jasjeet Dhaliwal; Saurabh Shintre
  Deep neural networks are susceptible to small-but-specific adversarial
perturbations capable of deceiving the network. This vulnerability can lead to
potentially harmful consequences in security-critical applications. To address
this vulnerability, we propose a novel metric called \emph{Gradient Similarity}
that allows us to capture the influence of training data on test inputs. We
show that \emph{Gradient Similarity} behaves differently for normal and
adversarial inputs, and enables us to detect a variety of adversarial attacks
with a near perfect ROC-AUC of 95-100\%. Even white-box adversaries equipped
with perfect knowledge of the system cannot bypass our detector easily. On the
MNIST dataset, white-box attacks are either detected with a high ROC-AUC of
87-96\%, or require very high distortion to bypass our detector.


http://arxiv.org/abs/1806.10496
Customizing an Adversarial Example Generator with Class-Conditional GANs.
Shih-hong Tsai
  Adversarial examples are intentionally crafted data with the purpose of
deceiving neural networks into misclassification. When we talk about strategies
to create such examples, we usually refer to perturbation-based methods that
fabricate adversarial examples by applying invisible perturbations onto normal
data. The resulting data reserve their visual appearance to human observers,
yet can be totally unrecognizable to DNN models, which in turn leads to
completely misleading predictions. In this paper, however, we consider crafting
adversarial examples from existing data as a limitation to example diversity.
We propose a non-perturbation-based framework that generates native adversarial
examples from class-conditional generative adversarial networks.As such, the
generated data will not resemble any existing data and thus expand example
diversity, raising the difficulty in adversarial defense. We then extend this
framework to pre-trained conditional GANs, in which we turn an existing
generator into an "adversarial-example generator". We conduct experiments on
our approach for MNIST and CIFAR10 datasets and have satisfactory results,
showing that this approach can be a potential alternative to previous attack
strategies.


http://arxiv.org/abs/1806.09410
Exploring Adversarial Examples: Patterns of One-Pixel Attacks.
David Kügler; Alexander Distergoft; Arjan Kuijper; Anirban Mukhopadhyay
  Failure cases of black-box deep learning, e.g. adversarial examples, might
have severe consequences in healthcare. Yet such failures are mostly studied in
the context of real-world images with calibrated attacks. To demystify the
adversarial examples, rigorous studies need to be designed. Unfortunately,
complexity of the medical images hinders such study design directly from the
medical images. We hypothesize that adversarial examples might result from the
incorrect mapping of image space to the low dimensional generation manifold by
deep networks. To test the hypothesis, we simplify a complex medical problem
namely pose estimation of surgical tools into its barest form. An analytical
decision boundary and exhaustive search of the one-pixel attack across multiple
image dimensions let us localize the regions of frequent successful one-pixel
attacks at the image space.


http://arxiv.org/abs/1806.09035
Defending Malware Classification Networks Against Adversarial Perturbations with Non-Negative Weight Restrictions.
Alex Kouzemtchenko
  There is a growing body of literature showing that deep neural networks are
vulnerable to adversarial input modification. Recently this work has been
extended from image classification to malware classification over boolean
features. In this paper we present several new methods for training restricted
networks in this specific domain that are highly effective at preventing
adversarial perturbations. We start with a fully adversarially resistant neural
network that has hard non-negative weight restrictions and is equivalent to
learning a monotonic boolean function and then attempt to relax the constraints
to improve classifier accuracy.


http://arxiv.org/abs/1806.09030
On Adversarial Examples for Character-Level Neural Machine Translation.
Javid Ebrahimi; Daniel Lowd; Dejing Dou
  Evaluating on adversarial examples has become a standard procedure to measure
robustness of deep learning models. Due to the difficulty of creating white-box
adversarial examples for discrete text input, most analyses of the robustness
of NLP models have been done through black-box adversarial examples. We
investigate adversarial examples for character-level neural machine translation
(NMT), and contrast black-box adversaries with a novel white-box adversary,
which employs differentiable string-edit operations to rank adversarial
changes. We propose two novel types of attacks which aim to remove or change a
word in a translation, rather than simply break the NMT. We demonstrate that
white-box adversarial examples are significantly stronger than their black-box
counterparts in different attack scenarios, which show more serious
vulnerabilities than previously known. In addition, after performing
adversarial training, which takes only 3 times longer than regular training, we
can improve the model's robustness significantly.


http://arxiv.org/abs/1806.08970
Evaluation of Momentum Diverse Input Iterative Fast Gradient Sign Method (M-DI2-FGSM) Based Attack Method on MCS 2018 Adversarial Attacks on Black Box Face Recognition System.
Md Ashraful Alam Milton
  The convolutional neural network is the crucial tool for the recent success
of deep learning based methods on various computer vision tasks like
classification, segmentation, and detection. Convolutional neural networks
achieved state-of-the-art performance in these tasks and every day pushing the
limit of computer vision and AI. However, adversarial attack on computer vision
systems is threatening their application in the real life and in
safety-critical applications. Necessarily, Finding adversarial examples are
important to detect susceptible models to attack and take safeguard measures to
overcome the adversarial attacks. In this regard, MCS 2018 Adversarial Attacks
on Black Box Face Recognition challenge aims to facilitate the research of
finding new adversarial attack techniques and their effectiveness in generating
adversarial examples. In this challenge, the attack"s nature is targeted-attack
on the black-box neural network where we have no knowledge about black-block"s
inner structure. The attacker must modify a set of five images of a single
person so that the neural network miss-classify them as target image which is a
set of five images of another person. In this competition, we applied Momentum
Diverse Input Iterative Fast Gradient Sign Method (M-DI2-FGSM) to make an
adversarial attack on black-box face recognition system. We tested our method
on MCS 2018 Adversarial Attacks on Black Box Face Recognition challenge and
found competitive result. Our solution got validation score 1.404 which better
than baseline score 1.407 and stood 14 place among 132 teams in the
leader-board. Further improvement can be achieved by finding improved feature
extraction from source image, carefully chosen hyper-parameters, finding
improved substitute model of the black-box and better optimization method.


http://arxiv.org/abs/1806.09186
Detection based Defense against Adversarial Examples from the Steganalysis Point of View.
Jiayang Liu; Weiming Zhang; Yiwei Zhang; Dongdong Hou; Yujia Liu; Hongyue Zha; Nenghai Yu
  Deep Neural Networks (DNNs) have recently led to significant improvements in
many fields. However, DNNs are vulnerable to adversarial examples which are
samples with imperceptible perturbations while dramatically misleading the
DNNs. Moreover, adversarial examples can be used to perform an attack on
various kinds of DNN based systems, even if the adversary has no access to the
underlying model. Many defense methods have been proposed, such as obfuscating
gradients of the networks or detecting adversarial examples. However it is
proved out that these defense methods are not effective or cannot resist
secondary adversarial attacks. In this paper, we point out that steganalysis
can be applied to adversarial examples detection, and propose a method to
enhance steganalysis features by estimating the probability of modifications
caused by adversarial attacks. Experimental results show that the proposed
method can accurately detect adversarial examples. Moreover, secondary
adversarial attacks cannot be directly performed to our method because our
method is not based on a neural network but based on high-dimensional
artificial features and FLD (Fisher Linear Discriminant) ensemble.


http://arxiv.org/abs/1806.08028
Gradient Adversarial Training of Neural Networks.
Ayan Sinha; Zhao Chen; Vijay Badrinarayanan; Andrew Rabinovich
  We propose gradient adversarial training, an auxiliary deep learning
framework applicable to different machine learning problems. In gradient
adversarial training, we leverage a prior belief that in many contexts,
simultaneous gradient updates should be statistically indistinguishable from
each other. We enforce this consistency using an auxiliary network that
classifies the origin of the gradient tensor, and the main network serves as an
adversary to the auxiliary network in addition to performing standard
task-based training. We demonstrate gradient adversarial training for three
different scenarios: (1) as a defense to adversarial examples we classify
gradient tensors and tune them to be agnostic to the class of their
corresponding example, (2) for knowledge distillation, we do binary
classification of gradient tensors derived from the student or teacher network
and tune the student gradient tensor to mimic the teacher's gradient tensor;
and (3) for multi-task learning we classify the gradient tensors derived from
different task loss functions and tune them to be statistically
indistinguishable. For each of the three scenarios we show the potential of
gradient adversarial training procedure. Specifically, gradient adversarial
training increases the robustness of a network to adversarial attacks, is able
to better distill the knowledge from a teacher network to a student network
compared to soft targets, and boosts multi-task learning by aligning the
gradient tensors derived from the task specific loss functions. Overall, our
experiments demonstrate that gradient tensors contain latent information about
whatever tasks are being trained, and can support diverse machine learning
problems when intelligently guided through adversarialization using a auxiliary
network.


http://arxiv.org/abs/1806.07723
Combinatorial Testing for Deep Learning Systems.
Lei Ma; Fuyuan Zhang; Minhui Xue; Bo Li; Yang Liu; Jianjun Zhao; Yadong Wang
  Deep learning (DL) has achieved remarkable progress over the past decade and
been widely applied to many safety-critical applications. However, the
robustness of DL systems recently receives great concerns, such as adversarial
examples against computer vision systems, which could potentially result in
severe consequences. Adopting testing techniques could help to evaluate the
robustness of a DL system and therefore detect vulnerabilities at an early
stage. The main challenge of testing such systems is that its runtime state
space is too large: if we view each neuron as a runtime state for DL, then a DL
system often contains massive states, rendering testing each state almost
impossible. For traditional software, combinatorial testing (CT) is an
effective testing technique to reduce the testing space while obtaining
relatively high defect detection abilities. In this paper, we perform an
exploratory study of CT on DL systems. We adapt the concept in CT and propose a
set of coverage criteria for DL systems, as well as a CT coverage guided test
generation technique. Our evaluation demonstrates that CT provides a promising
avenue for testing DL systems. We further pose several open questions and
interesting directions for combinatorial testing of DL systems.


http://arxiv.org/abs/1806.07492
On the Learning of Deep Local Features for Robust Face Spoofing Detection.
Souza Gustavo Botelho de; João Paulo Papa; Aparecido Nilceu Marana
  Biometrics emerged as a robust solution for security systems. However, given
the dissemination of biometric applications, criminals are developing
techniques to circumvent them by simulating physical or behavioral traits of
legal users (spoofing attacks). Despite face being a promising characteristic
due to its universality, acceptability and presence of cameras almost
everywhere, face recognition systems are extremely vulnerable to such frauds
since they can be easily fooled with common printed facial photographs.
State-of-the-art approaches, based on Convolutional Neural Networks (CNNs),
present good results in face spoofing detection. However, these methods do not
consider the importance of learning deep local features from each facial
region, even though it is known from face recognition that each facial region
presents different visual aspects, which can also be exploited for face
spoofing detection. In this work we propose a novel CNN architecture trained in
two steps for such task. Initially, each part of the neural network learns
features from a given facial region. Afterwards, the whole model is fine-tuned
on the whole facial images. Results show that such pre-training step allows the
CNN to learn different local spoofing cues, improving the performance and the
convergence speed of the final model, outperforming the state-of-the-art
approaches.


http://arxiv.org/abs/1806.07409
Built-in Vulnerabilities to Imperceptible Adversarial Perturbations.
Thomas Tanay; Jerone T. A. Andrews; Lewis D. Griffin
  Designing models that are robust to small adversarial perturbations of their
inputs has proven remarkably difficult. In this work we show that the reverse
problem---making models more vulnerable---is surprisingly easy. After
presenting some proofs of concept on MNIST, we introduce a generic tilting
attack that injects vulnerabilities into the linear layers of pre-trained
networks by increasing their sensitivity to components of low variance in the
training data without affecting their performance on test data. We illustrate
this attack on a multilayer perceptron trained on SVHN and use it to design a
stand-alone adversarial module which we call a steganogram decoder. Finally, we
show on CIFAR-10 that a poisoning attack with a poisoning rate as low as 0.1%
can induce vulnerabilities to chosen imperceptible backdoor signals in
state-of-the-art networks. Beyond their practical implications, these different
results shed new light on the nature of the adversarial example phenomenon.


http://arxiv.org/abs/1806.06108
Non-Negative Networks Against Adversarial Attacks.
William Fleshman; Edward Raff; Jared Sylvester; Steven Forsyth; Mark McLean
  Adversarial attacks against neural networks are a problem of considerable
importance, for which effective defenses are not yet readily available. We make
progress toward this problem by showing that non-negative weight constraints
can be used to improve resistance in specific scenarios. In particular, we show
that they can provide an effective defense for binary classification problems
with asymmetric cost, such as malware or spam detection. We also show the
potential for non-negativity to be helpful to non-binary problems by applying
it to image classification.


http://arxiv.org/abs/1806.05476
Copycat CNN: Stealing Knowledge by Persuading Confession with Random Non-Labeled Data.
Jacson Rodrigues Correia-Silva; Rodrigo F. Berriel; Claudine Badue; Souza Alberto F. de; Thiago Oliveira-Santos
  In the past few years, Convolutional Neural Networks (CNNs) have been
achieving state-of-the-art performance on a variety of problems. Many companies
employ resources and money to generate these models and provide them as an API,
therefore it is in their best interest to protect them, i.e., to avoid that
someone else copies them. Recent studies revealed that state-of-the-art CNNs
are vulnerable to adversarial examples attacks, and this weakness indicates
that CNNs do not need to operate in the problem domain (PD). Therefore, we
hypothesize that they also do not need to be trained with examples of the PD in
order to operate in it.
  Given these facts, in this paper, we investigate if a target black-box CNN
can be copied by persuading it to confess its knowledge through random
non-labeled data. The copy is two-fold: i) the target network is queried with
random data and its predictions are used to create a fake dataset with the
knowledge of the network; and ii) a copycat network is trained with the fake
dataset and should be able to achieve similar performance as the target
network.
  This hypothesis was evaluated locally in three problems (facial expression,
object, and crosswalk classification) and against a cloud-based API. In the
copy attacks, images from both non-problem domain and PD were used. All copycat
networks achieved at least 93.7% of the performance of the original models with
non-problem domain data, and at least 98.6% using additional data from the PD.
Additionally, the copycat CNN successfully copied at least 97.3% of the
performance of the Microsoft Azure Emotion API. Our results show that it is
possible to create a copycat CNN by simply querying a target network as
black-box with random non-labeled data.


http://arxiv.org/abs/1806.05337
Hierarchical interpretations for neural network predictions.
Chandan Singh; W. James Murdoch; Bin Yu
  Deep neural networks (DNNs) have achieved impressive predictive performance
due to their ability to learn complex, non-linear relationships between
variables. However, the inability to effectively visualize these relationships
has led to DNNs being characterized as black boxes and consequently limited
their applications. To ameliorate this problem, we introduce the use of
hierarchical interpretations to explain DNN predictions through our proposed
method, agglomerative contextual decomposition (ACD). Given a prediction from a
trained DNN, ACD produces a hierarchical clustering of the input features,
along with the contribution of each cluster to the final prediction. This
hierarchy is optimized to identify clusters of features that the DNN learned
are predictive. Using examples from Stanford Sentiment Treebank and ImageNet,
we show that ACD is effective at diagnosing incorrect predictions and
identifying dataset bias. Through human experiments, we demonstrate that ACD
enables users both to identify the more accurate of two DNNs and to better
trust a DNN's outputs. We also find that ACD's hierarchy is largely robust to
adversarial perturbations, implying that it captures fundamental aspects of the
input and ignores spurious noise.


http://arxiv.org/abs/1806.05236
Manifold Mixup: Better Representations by Interpolating Hidden States.
Vikas Verma; Alex Lamb; Christopher Beckham; Amir Najafi; Ioannis Mitliagkas; Aaron Courville; David Lopez-Paz; Yoshua Bengio
  Deep neural networks excel at learning the training data, but often provide
incorrect and confident predictions when evaluated on slightly different test
examples. This includes distribution shifts, outliers, and adversarial
examples. To address these issues, we propose Manifold Mixup, a simple
regularizer that encourages neural networks to predict less confidently on
interpolations of hidden representations. Manifold Mixup leverages semantic
interpolations as additional training signal, obtaining neural networks with
smoother decision boundaries at multiple levels of representation. As a result,
neural networks trained with Manifold Mixup learn class-representations with
fewer directions of variance. We prove theory on why this flattening happens
under ideal conditions, validate it on practical situations, and connect it to
previous works on information theory and generalization. In spite of incurring
no significant computation and being implemented in a few lines of code,
Manifold Mixup improves strong baselines in supervised learning, robustness to
single-step adversarial attacks, and test log-likelihood.


http://arxiv.org/abs/1806.04646
Adversarial Attacks on Variational Autoencoders.
George Gondim-Ribeiro; Pedro Tabacof; Eduardo Valle
  Adversarial attacks are malicious inputs that derail machine-learning models.
We propose a scheme to attack autoencoders, as well as a quantitative
evaluation framework that correlates well with the qualitative assessment of
the attacks. We assess --- with statistically validated experiments --- the
resistance to attacks of three variational autoencoders (simple, convolutional,
and DRAW) in three datasets (MNIST, SVHN, CelebA), showing that both DRAW's
recurrence and attention mechanism lead to better resistance. As autoencoders
are proposed for compressing data --- a scenario in which their safety is
paramount --- we expect more attention will be given to adversarial attacks on
them.


http://arxiv.org/abs/1806.04425
Ranking Robustness Under Adversarial Document Manipulations.
Gregory Goren; Oren Kurland; Moshe Tennenholtz; Fiana Raiber
  For many queries in the Web retrieval setting there is an on-going ranking
competition: authors manipulate their documents so as to promote them in
rankings. Such competitions can have unwarranted effects not only in terms of
retrieval effectiveness, but also in terms of ranking robustness. A case in
point, rankings can (rapidly) change due to small indiscernible perturbations
of documents. While there has been a recent growing interest in analyzing the
robustness of classifiers to adversarial manipulations, there has not yet been
a study of the robustness of relevance-ranking functions. We address this
challenge by formally analyzing different definitions and aspects of the
robustness of learning-to-rank-based ranking functions. For example, we
formally show that increased regularization of linear ranking functions
increases ranking robustness. This finding leads us to conjecture that
decreased variance of any ranking function results in increased robustness. We
propose several measures for quantifying ranking robustness and use them to
analyze ranking competitions between documents' authors. The empirical findings
support our formal analysis and conjecture for both RankSVM and LambdaMART.


http://arxiv.org/abs/1806.04169
Defense Against the Dark Arts: An overview of adversarial example security research and future research directions.
Ian Goodfellow
  This article presents a summary of a keynote lecture at the Deep Learning
Security workshop at IEEE Security and Privacy 2018. This lecture summarizes
the state of the art in defenses against adversarial examples and provides
recommendations for future research directions on this topic.


http://arxiv.org/abs/1806.02977
Monge blunts Bayes: Hardness Results for Adversarial Training.
Zac Cranko; Aditya Krishna Menon; Richard Nock; Cheng Soon Ong; Zhan Shi; Christian Walder
  The last few years have seen a staggering number of empirical studies of the
robustness of neural networks in a model of adversarial perturbations of their
inputs. Most rely on an adversary which carries out local modifications within
prescribed balls. None however has so far questioned the broader picture: how
to frame a resource-bounded adversary so that it can be severely detrimental to
learning, a non-trivial problem which entails at a minimum the choice of loss
and classifiers.
  We suggest a formal answer for losses that satisfy the minimal statistical
requirement of being proper. We pin down a simple sufficient property for any
given class of adversaries to be detrimental to learning, involving a central
measure of "harmfulness" which generalizes the well-known class of integral
probability metrics. A key feature of our result is that it holds for all
proper losses, and for a popular subset of these, the optimisation of this
central measure appears to be independent of the loss. When classifiers are
Lipschitz -- a now popular approach in adversarial training --, this
optimisation resorts to optimal transport to make a low-budget compression of
class marginals. Toy experiments reveal a finding recently separately observed:
training against a sufficiently budgeted adversary of this kind improves
generalization.


http://arxiv.org/abs/1806.02924
Revisiting Adversarial Risk.
Arun Sai Suggala; Adarsh Prasad; Vaishnavh Nagarajan; Pradeep Ravikumar
  Recent works on adversarial perturbations show that there is an inherent
trade-off between standard test accuracy and adversarial accuracy.
Specifically, they show that no classifier can simultaneously be robust to
adversarial perturbations and achieve high standard test accuracy. However,
this is contrary to the standard notion that on tasks such as image
classification, humans are robust classifiers with low error rate. In this
work, we show that the main reason behind this confusion is the inexact
definition of adversarial perturbation that is used in the literature. To fix
this issue, we propose a slight, yet important modification to the existing
definition of adversarial perturbation. Based on the modified definition, we
show that there is no trade-off between adversarial and standard accuracies;
there exist classifiers that are robust and achieve high standard accuracy. We
further study several properties of this new definition of adversarial risk and
its relation to the existing definition.


http://arxiv.org/abs/1806.02782
Training Augmentation with Adversarial Examples for Robust Speech Recognition.
Sining Sun; Ching-Feng Yeh; Mari Ostendorf; Mei-Yuh Hwang; Lei Xie
  This paper explores the use of adversarial examples in training speech
recognition systems to increase robustness of deep neural network acoustic
models. During training, the fast gradient sign method is used to generate
adversarial examples augmenting the original training data. Different from
conventional data augmentation based on data transformations, the examples are
dynamically generated based on current acoustic model parameters. We assess the
impact of adversarial data augmentation in experiments on the Aurora-4 and
CHiME-4 single-channel tasks, showing improved robustness against noise and
channel variation. Further improvement is obtained when combining adversarial
examples with teacher/student training, leading to a 23% relative word error
rate reduction on Aurora-4.


http://arxiv.org/abs/1806.02371
Adversarial Attack on Graph Structured Data.
Hanjun Dai; Hui Li; Tian Tian; Xin Huang; Lin Wang; Jun Zhu; Le Song
  Deep learning on graph structures has shown exciting results in various
applications. However, few attentions have been paid to the robustness of such
models, in contrast to numerous research work for image or text adversarial
attack and defense. In this paper, we focus on the adversarial attacks that
fool the model by modifying the combinatorial structure of data. We first
propose a reinforcement learning based attack method that learns the
generalizable attack policy, while only requiring prediction labels from the
target classifier. Also, variants of genetic algorithms and gradient methods
are presented in the scenario where prediction confidence or gradients are
available. We use both synthetic and real-world data to show that, a family of
Graph Neural Network models are vulnerable to these attacks, in both
graph-level and node-level classification tasks. We also show such attacks can
be used to diagnose the learned classifiers.


http://arxiv.org/abs/1806.02256
Adversarial Regression with Multiple Learners.
Liang Tong; Sixie Yu; Scott Alfeld; Yevgeniy Vorobeychik
  Despite the considerable success enjoyed by machine learning techniques in
practice, numerous studies demonstrated that many approaches are vulnerable to
attacks. An important class of such attacks involves adversaries changing
features at test time to cause incorrect predictions. Previous investigations
of this problem pit a single learner against an adversary. However, in many
situations an adversary's decision is aimed at a collection of learners, rather
than specifically targeted at each independently. We study the problem of
adversarial linear regression with multiple learners. We approximate the
resulting game by exhibiting an upper bound on learner loss functions, and show
that the resulting game has a unique symmetric equilibrium. We present an
algorithm for computing this equilibrium, and show through extensive
experiments that equilibrium models are significantly more robust than
conventional regularized linear regression.


http://arxiv.org/abs/1806.02032
Killing four birds with one Gaussian process: the relation between different test-time attacks.
Kathrin Grosse; Michael T. Smith; Michael Backes
  In machine learning (ML) security, attacks like evasion, model stealing or
membership inference are generally studied in individually. Previous work has
also shown a relationship between some attacks and decision function curvature
of the targeted model. Consequently, we study an ML model allowing direct
control over the decision surface curvature: Gaussian Process classifiers
(GPCs). For evasion, we find that changing GPC's curvature to be robust against
one attack algorithm boils down to enabling a different norm or attack
algorithm to succeed. This is backed up by our formal analysis showing that
static security guarantees are opposed to learning. Concerning intellectual
property, we show formally that lazy learning does not necessarily leak all
information when applied. In practice, often a seemingly secure curvature can
be found. For example, we are able to secure GPC against empirical membership
inference by proper configuration. In this configuration, however, the GPC's
hyper-parameters are leaked, e.g. model reverse engineering succeeds. We
conclude that attacks on classification should not be studied in isolation, but
in relation to each other.


http://arxiv.org/abs/1806.02299
DPatch: An Adversarial Patch Attack on Object Detectors.
Xin Liu; Huanrui Yang; Ziwei Liu; Linghao Song; Hai Li; Yiran Chen
  Object detectors have emerged as an indispensable module in modern computer
vision systems. In this work, we propose DPatch -- a black-box
adversarial-patch-based attack towards mainstream object detectors (i.e. Faster
R-CNN and YOLO). Unlike the original adversarial patch that only manipulates
image-level classifier, our DPatch simultaneously attacks the bounding box
regression and object classification so as to disable their predictions.
Compared to prior works, DPatch has several appealing properties: (1) DPatch
can perform both untargeted and targeted effective attacks, degrading the mAP
of Faster R-CNN and YOLO from 75.10% and 65.7% down to below 1%, respectively.
(2) DPatch is small in size and its attacking effect is location-independent,
making it very practical to implement real-world attacks. (3) DPatch
demonstrates great transferability among different detectors as well as
training datasets. For example, DPatch that is trained on Faster R-CNN can
effectively attack YOLO, and vice versa. Extensive evaluations imply that
DPatch can perform effective attacks under black-box setup, i.e., even without
the knowledge of the attacked network's architectures and parameters.
Successful realization of DPatch also illustrates the intrinsic vulnerability
of the modern detector architectures to such patch-based adversarial attacks.


http://arxiv.org/abs/1806.02190
Mitigation of Policy Manipulation Attacks on Deep Q-Networks with Parameter-Space Noise.
Vahid Behzadan; Arslan Munir
  Recent developments have established the vulnerability of deep reinforcement
learning to policy manipulation attacks via intentionally perturbed inputs,
known as adversarial examples. In this work, we propose a technique for
mitigation of such attacks based on addition of noise to the parameter space of
deep reinforcement learners during training. We experimentally verify the
effect of parameter-space noise in reducing the transferability of adversarial
examples, and demonstrate the promising performance of this technique in
mitigating the impact of whitebox and blackbox attacks at both test and
training times.


http://arxiv.org/abs/1806.01477
An Explainable Adversarial Robustness Metric for Deep Learning Neural Networks.
Chirag Agarwal; Bo Dong; Dan Schonfeld; Anthony Hoogs
  Deep Neural Networks(DNN) have excessively advanced the field of computer
vision by achieving state of the art performance in various vision tasks. These
results are not limited to the field of vision but can also be seen in speech
recognition and machine translation tasks. Recently, DNNs are found to poorly
fail when tested with samples that are crafted by making imperceptible changes
to the original input images. This causes a gap between the validation and
adversarial performance of a DNN. An effective and generalizable robustness
metric for evaluating the performance of DNN on these adversarial inputs is
still missing from the literature. In this paper, we propose Noise Sensitivity
Score (NSS), a metric that quantifies the performance of a DNN on a specific
input under different forms of fix-directional attacks. An insightful
mathematical explanation is provided for deeply understanding the proposed
metric. By leveraging the NSS, we also proposed a skewness based dataset
robustness metric for evaluating a DNN's adversarial performance on a given
dataset. Extensive experiments using widely used state of the art architectures
along with popular classification datasets, such as MNIST, CIFAR-10, CIFAR-100,
and ImageNet, are used to validate the effectiveness and generalization of our
proposed metrics. Instead of simply measuring a DNN's adversarial robustness in
the input domain, as previous works, the proposed NSS is built on top of
insightful mathematical understanding of the adversarial attack and gives a
more explicit explanation of the robustness.


http://arxiv.org/abs/1806.01471
PAC-learning in the presence of evasion adversaries.
Daniel Cullina; Arjun Nitin Bhagoji; Prateek Mittal
  The existence of evasion attacks during the test phase of machine learning
algorithms represents a significant challenge to both their deployment and
understanding. These attacks can be carried out by adding imperceptible
perturbations to inputs to generate adversarial examples and finding effective
defenses and detectors has proven to be difficult. In this paper, we step away
from the attack-defense arms race and seek to understand the limits of what can
be learned in the presence of an evasion adversary. In particular, we extend
the Probably Approximately Correct (PAC)-learning framework to account for the
presence of an adversary. We first define corrupted hypothesis classes which
arise from standard binary hypothesis classes in the presence of an evasion
adversary and derive the Vapnik-Chervonenkis (VC)-dimension for these, denoted
as the adversarial VC-dimension. We then show that sample complexity upper
bounds from the Fundamental Theorem of Statistical learning can be extended to
the case of evasion adversaries, where the sample complexity is controlled by
the adversarial VC-dimension. We then explicitly derive the adversarial
VC-dimension for halfspace classifiers in the presence of a sample-wise
norm-constrained adversary of the type commonly studied for evasion attacks and
show that it is the same as the standard VC-dimension, closing an open
question. Finally, we prove that the adversarial VC-dimension can be either
larger or smaller than the standard VC-dimension depending on the hypothesis
class and adversary, making it an interesting object of study in its own right.


http://arxiv.org/abs/1806.00667
Sufficient Conditions for Idealised Models to Have No Adversarial Examples: a Theoretical and Empirical Study with Bayesian Neural Networks.
Yarin Gal; Lewis Smith
  We prove, under two sufficient conditions, that idealised models can have no
adversarial examples. We discuss which idealised models satisfy our conditions,
and show that idealised Bayesian neural networks (BNNs) satisfy these. We
continue by studying near-idealised BNNs using HMC inference, demonstrating the
theoretical ideas in practice. We experiment with HMC on synthetic data derived
from MNIST for which we know the ground-truth image density, showing that
near-perfect epistemic uncertainty correlates to density under image manifold,
and that adversarial images lie off the manifold in our setting. This suggests
why MC dropout, which can be seen as performing approximate inference, has been
observed to be an effective defence against adversarial examples in practice;
We highlight failure-cases of non-idealised BNNs relying on dropout, suggesting
a new attack for dropout models and a new defence as well. Lastly, we
demonstrate the defence on a cats-vs-dogs image classification task with a
VGG13 variant.


http://arxiv.org/abs/1806.00580
Detecting Adversarial Examples via Key-based Network.
Pinlong Zhao; Zhouyu Fu; Ou wu; Qinghua Hu; Jun Wang
  Though deep neural networks have achieved state-of-the-art performance in
visual classification, recent studies have shown that they are all vulnerable
to the attack of adversarial examples. Small and often imperceptible
perturbations to the input images are sufficient to fool the most powerful deep
neural networks. Various defense methods have been proposed to address this
issue. However, they either require knowledge on the process of generating
adversarial examples, or are not robust against new attacks specifically
designed to penetrate the existing defense. In this work, we introduce
key-based network, a new detection-based defense mechanism to distinguish
adversarial examples from normal ones based on error correcting output codes,
using the binary code vectors produced by multiple binary classifiers applied
to randomly chosen label-sets as signatures to match normal images and reject
adversarial examples. In contrast to existing defense methods, the proposed
method does not require knowledge of the process for generating adversarial
examples and can be applied to defend against different types of attacks. For
the practical black-box and gray-box scenarios, where the attacker does not
know the encoding scheme, we show empirically that key-based network can
effectively detect adversarial examples generated by several state-of-the-art
attacks.


http://arxiv.org/abs/1806.00088
PeerNets: Exploiting Peer Wisdom Against Adversarial Attacks.
Jan Svoboda; Jonathan Masci; Federico Monti; Michael M. Bronstein; Leonidas Guibas
  Deep learning systems have become ubiquitous in many aspects of our lives.
Unfortunately, it has been shown that such systems are vulnerable to
adversarial attacks, making them prone to potential unlawful uses. Designing
deep neural networks that are robust to adversarial attacks is a fundamental
step in making such systems safer and deployable in a broader variety of
applications (e.g. autonomous driving), but more importantly is a necessary
step to design novel and more advanced architectures built on new computational
paradigms rather than marginally building on the existing ones. In this paper
we introduce PeerNets, a novel family of convolutional networks alternating
classical Euclidean convolutions with graph convolutions to harness information
from a graph of peer samples. This results in a form of non-local forward
propagation in the model, where latent features are conditioned on the global
structure induced by the graph, that is up to 3 times more robust to a variety
of white- and black-box adversarial attacks compared to conventional
architectures with almost no drop in accuracy.


http://arxiv.org/abs/1806.00081
Resisting Adversarial Attacks using Gaussian Mixture Variational Autoencoders.
Partha Ghosh; Arpan Losalka; Michael J Black
  Susceptibility of deep neural networks to adversarial attacks poses a major
theoretical and practical challenge. All efforts to harden classifiers against
such attacks have seen limited success. Two distinct categories of samples to
which deep networks are vulnerable, "adversarial samples" and "fooling
samples", have been tackled separately so far due to the difficulty posed when
considered together. In this work, we show how one can address them both under
one unified framework. We tie a discriminative model with a generative model,
rendering the adversarial objective to entail a conflict. Our model has the
form of a variational autoencoder, with a Gaussian mixture prior on the latent
vector. Each mixture component of the prior distribution corresponds to one of
the classes in the data. This enables us to perform selective classification,
leading to the rejection of adversarial samples instead of misclassification.
Our method inherently provides a way of learning a selective classifier in a
semi-supervised scenario as well, which can resist adversarial attacks. We also
show how one can reclassify the rejected adversarial samples.


http://arxiv.org/abs/1805.12514
Scaling provable adversarial defenses.
Eric Wong; Frank R. Schmidt; Jan Hendrik Metzen; J. Zico Kolter
  Recent work has developed methods for learning deep network classifiers that
are provably robust to norm-bounded adversarial perturbation; however, these
methods are currently only possible for relatively small feedforward networks.
In this paper, in an effort to scale these approaches to substantially larger
models, we extend previous work in three main directions. First, we present a
technique for extending these training procedures to much more general
networks, with skip connections (such as ResNets) and general nonlinearities;
the approach is fully modular, and can be implemented automatically (analogous
to automatic differentiation). Second, in the specific case of $\ell_\infty$
adversarial perturbations and networks with ReLU nonlinearities, we adopt a
nonlinear random projection for training, which scales linearly in the number
of hidden units (previous approaches scaled quadratically). Third, we show how
to further improve robust error through cascade models. On both MNIST and CIFAR
data sets, we train classifiers that improve substantially on the state of the
art in provable robust adversarial error bounds: from 5.8% to 3.1% on MNIST
(with $\ell_\infty$ perturbations of $\epsilon=0.1$), and from 80% to 36.4% on
CIFAR (with $\ell_\infty$ perturbations of $\epsilon=2/255$). Code for all
experiments in the paper is available at
https://github.com/locuslab/convex_adversarial/.


http://arxiv.org/abs/1805.12487
Sequential Attacks on Agents for Long-Term Adversarial Goals.
Edgar Tretschk; Seong Joon Oh; Mario Fritz
  Reinforcement learning (RL) has advanced greatly in the past few years with
the employment of effective deep neural networks (DNNs) on the policy networks.
With the great effectiveness came serious vulnerability issues with DNNs that
small adversarial perturbations on the input can change the output of the
network. Several works have pointed out that learned agents with a DNN policy
network can be manipulated against achieving the original task through a
sequence of small perturbations on the input states. In this paper, we
demonstrate furthermore that it is also possible to impose an arbitrary
adversarial reward on the victim policy network through a sequence of attacks.
Our method involves the latest adversarial attack technique, Adversarial
Transformer Network (ATN), that learns to generate the attack and is easy to
integrate into the policy network. As a result of our attack, the victim agent
is misguided to optimise for the adversarial reward over time. Our results
expose serious security threats for RL applications in safety-critical systems
including drones, medical analysis, and self-driving cars.


http://arxiv.org/abs/1805.12316
Greedy Attack and Gumbel Attack: Generating Adversarial Examples for Discrete Data.
Puyudi Yang; Jianbo Chen; Cho-Jui Hsieh; Jane-Ling Wang; Michael I. Jordan
  We present a probabilistic framework for studying adversarial attacks on
discrete data. Based on this framework, we derive a perturbation-based method,
Greedy Attack, and a scalable learning-based method, Gumbel Attack, that
illustrate various tradeoffs in the design of attacks. We demonstrate the
effectiveness of these methods using both quantitative metrics and human
evaluation on various state-of-the-art models for text classification,
including a word-based CNN, a character-based CNN and an LSTM. As as example of
our results, we show that the accuracy of character-based convolutional
networks drops to the level of random selection by modifying only five
characters through Greedy Attack.


http://arxiv.org/abs/1805.12302
Adversarial Attacks on Face Detectors using Neural Net based Constrained Optimization.
Avishek Joey Bose; Parham Aarabi
  Adversarial attacks involve adding, small, often imperceptible, perturbations
to inputs with the goal of getting a machine learning model to misclassifying
them. While many different adversarial attack strategies have been proposed on
image classification models, object detection pipelines have been much harder
to break. In this paper, we propose a novel strategy to craft adversarial
examples by solving a constrained optimization problem using an adversarial
generator network. Our approach is fast and scalable, requiring only a forward
pass through our trained generator network to craft an adversarial sample.
Unlike in many attack strategies, we show that the same trained generator is
capable of attacking new images without explicitly optimizing on them. We
evaluate our attack on a trained Faster R-CNN face detector on the cropped
300-W face dataset where we manage to reduce the number of detected faces to
$0.5\%$ of all originally detected faces. In a different experiment, also on
300-W, we demonstrate the robustness of our attack to a JPEG compression based
defense typical JPEG compression level of $75\%$ reduces the effectiveness of
our attack from only $0.5\%$ of detected faces to a modest $5.0\%$.


http://arxiv.org/abs/1805.11852
ADAGIO: Interactive Experimentation with Adversarial Attack and Defense for Audio.
Nilaksh Das; Madhuri Shanbhogue; Shang-Tse Chen; Li Chen; Michael E. Kounavis; Duen Horng Chau
  Adversarial machine learning research has recently demonstrated the
feasibility to confuse automatic speech recognition (ASR) models by introducing
acoustically imperceptible perturbations to audio samples. To help researchers
and practitioners gain better understanding of the impact of such attacks, and
to provide them with tools to help them more easily evaluate and craft strong
defenses for their models, we present ADAGIO, the first tool designed to allow
interactive experimentation with adversarial attacks and defenses on an ASR
model in real time, both visually and aurally. ADAGIO incorporates AMR and MP3
audio compression techniques as defenses, which users can interactively apply
to attacked audio samples. We show that these techniques, which are based on
psychoacoustic principles, effectively eliminate targeted attacks, reducing the
attack success rate from 92.5% to 0%. We will demonstrate ADAGIO and invite the
audience to try it on the Mozilla Common Voice dataset.


http://arxiv.org/abs/1805.12017
Robustifying Models Against Adversarial Attacks by Langevin Dynamics.
Vignesh Srinivasan; Arturo Marban; Klaus-Robert Müller; Wojciech Samek; Shinichi Nakajima
  Adversarial attacks on deep learning models have compromised their
performance considerably. As remedies, a lot of defense methods were proposed,
which however, have been circumvented by newer attacking strategies. In the
midst of this ensuing arms race, the problem of robustness against adversarial
attacks still remains unsolved. This paper proposes a novel, simple yet
effective defense strategy where adversarial samples are relaxed onto the
underlying manifold of the (unknown) target class distribution. Specifically,
our algorithm drives off-manifold adversarial samples towards high density
regions of the data generating distribution of the target class by the
Metroplis-adjusted Langevin algorithm (MALA) with perceptual boundary taken
into account. Although the motivation is similar to projection methods, e.g.,
Defense-GAN, our algorithm, called MALA for DEfense (MALADE), is equipped with
significant dispersion - projection is distributed broadly, and therefore any
whitebox attack cannot accurately align the input so that the MALADE moves it
to a targeted untrained spot where the model predicts a wrong label. In our
experiments, MALADE exhibited state-of-the-art performance against various
elaborate attacking strategies.


http://arxiv.org/abs/1805.12152
Robustness May Be at Odds with Accuracy.
Dimitris Tsipras; Shibani Santurkar; Logan Engstrom; Alexander Turner; Aleksander Madry
  We show that there may exist an inherent tension between the goal of
adversarial robustness and that of standard generalization. Specifically,
training robust models may not only be more resource-consuming, but also lead
to a reduction of standard accuracy. We demonstrate that this trade-off between
the standard accuracy of a model and its robustness to adversarial
perturbations provably exists in a fairly simple and natural setting. These
findings also corroborate a similar phenomenon observed empirically in more
complex settings. Further, we argue that this phenomenon is a consequence of
robust classifiers learning fundamentally different feature representations
than standard classifiers. These differences, in particular, seem to result in
unexpected benefits: the representations learned by robust models tend to align
better with salient data characteristics and human perception.


http://arxiv.org/abs/1805.11770
AutoZOOM: Autoencoder-based Zeroth Order Optimization Method for Attacking Black-box Neural Networks.
Chun-Chen Tu; Paishun Ting; Pin-Yu Chen; Sijia Liu; Huan Zhang; Jinfeng Yi; Cho-Jui Hsieh; Shin-Ming Cheng
  Recent studies have shown that adversarial examples in state-of-the-art image
classifiers trained by deep neural networks (DNN) can be easily generated when
the target model is transparent to an attacker, known as the white-box setting.
However, when attacking a deployed machine learning service, one can only
acquire the input-output correspondences of the target model; this is the
so-called black-box attack setting. The major drawback of existing black-box
attacks is the need for excessive model queries, which may give a false sense
of model robustness due to inefficient query designs. To bridge this gap, we
propose a generic framework for query-efficient black-box attacks. Our
framework, AutoZOOM, which is short for Autoencoder-based Zeroth Order
Optimization Method, has two novel building blocks towards efficient black-box
attacks: (i) an adaptive random gradient estimation strategy to balance query
counts and distortion, and (ii) an autoencoder that is either trained offline
with unlabeled data or a bilinear resizing operation for attack acceleration.
Experimental results suggest that, by applying AutoZOOM to a state-of-the-art
black-box attack (ZOO), a significant reduction in model queries can be
achieved without sacrificing the attack success rate and the visual quality of
the resulting adversarial examples. In particular, when compared to the
standard ZOO method, AutoZOOM can consistently reduce the mean query counts in
finding successful adversarial examples (or reaching the same distortion level)
by at least 93% on MNIST, CIFAR-10 and ImageNet datasets, leading to novel
insights on adversarial robustness.


http://arxiv.org/abs/1805.11596
Adversarial Noise Attacks of Deep Learning Architectures -- Stability Analysis via Sparse Modeled Signals.
Yaniv Romano; Aviad Aberdam; Jeremias Sulam; Michael Elad
  Despite their impressive performance, deep convolutional neural networks
(CNNs) have been shown to be sensitive to small adversarial perturbations.
These nuisances, which one can barely notice, are powerful enough to fool
sophisticated and well performing classifiers, leading to ridiculous
misclassification results. In this paper we analyze the stability of
state-of-the-art deep-learning classification machines to adversarial
perturbations, where we assume that the signals belong to the (possibly
multi-layer) sparse representation model. We start with convolutional sparsity
and then proceed to its multi-layered version, which is tightly connected to
CNNs. Our analysis links between the stability of the classification to noise
and the underlying structure of the signal, quantified by the sparsity of its
representation under a fixed dictionary. In addition, we offer similar
stability theorems for two practical pursuit algorithms, which are posed as two
different deep-learning architectures - the layered Thresholding and the
layered Basis Pursuit. Our analysis establishes the better robustness of the
later to adversarial attacks. We corroborate these theoretical results by
numerical experiments on three datasets: MNIST, CIFAR-10 and CIFAR-100.


http://arxiv.org/abs/1805.11666
Why Botnets Work: Distributed Brute-Force Attacks Need No Synchronization.
Salman Salamatian; Wasim Huleihel; Ahmad Beirami; Asaf Cohen; Muriel Médard
  In September 2017, McAffee Labs quarterly report estimated that brute force
attacks represent 20\% of total network attacks, making them the most prevalent
type of attack ex-aequo with browser based vulnerabilities. These attacks have
sometimes catastrophic consequences, and understanding their fundamental limits
may play an important role in the risk assessment of password-secured systems,
and in the design of better security protocols. While some solutions exist to
prevent online brute-force attacks that arise from one single IP address,
attacks performed by botnets are more challenging. In this paper, we analyze
these distributed attacks by using a simplified model. Our aim is to understand
the impact of distribution and asynchronization on the overall computational
effort necessary to breach a system. Our result is based on Guesswork, a
measure of the number of queries (guesses) required of an adversary before a
correct sequence, such as a password, is found in an optimal attack. Guesswork
is a direct surrogate for time and computational effort of guessing a sequence
from a set of sequences with associated likelihoods. We model the lack of
synchronization by a worst-case optimization in which the queries made by
multiple adversarial agents are received in the worst possible order for the
adversary, resulting in a min-max formulation. We show that, even without
synchronization, and for sequences of growing length, the asymptotic optimal
performance is achievable by using randomized guesses drawn from an appropriate
distribution. Therefore, randomization is key for distributed asynchronous
attacks. In other words, asynchronous guessers can asymptotically perform
brute-force attacks as efficiently as synchronized guessers.


http://arxiv.org/abs/1805.10997
Adversarial Examples in Remote Sensing.
Wojciech Czaja; Neil Fendley; Michael Pekala; Christopher Ratto; I-Jeng Wang
  This paper considers attacks against machine learning algorithms used in
remote sensing applications, a domain that presents a suite of challenges that
are not fully addressed by current research focused on natural image data such
as ImageNet. In particular, we present a new study of adversarial examples in
the context of satellite image classification problems. Using a recently
curated data set and associated classifier, we provide a preliminary analysis
of adversarial examples in settings where the targeted classifier is permitted
multiple observations of the same location over time. While our experiments to
date are purely digital, our problem setup explicitly incorporates a number of
practical considerations that a real-world attacker would need to take into
account when mounting a physical attack. We hope this work provides a useful
starting point for future studies of potential vulnerabilities in this setting.


http://arxiv.org/abs/1805.11090
GenAttack: Practical Black-box Attacks with Gradient-Free Optimization.
Moustafa Alzantot; Yash Sharma; Supriyo Chakraborty; Huan Zhang; Cho-Jui Hsieh; Mani Srivastava
  Deep neural networks are vulnerable to adversarial examples, even in the
black-box setting, where the attacker is restricted solely to query access.
Existing black-box approaches to generating adversarial examples typically
require a significant number of queries, either for training a substitute
network or performing gradient estimation. We introduce GenAttack, a
gradient-free optimization technique that uses genetic algorithms for
synthesizing adversarial examples in the black-box setting. Our experiments on
different datasets (MNIST, CIFAR-10, and ImageNet) show that GenAttack can
successfully generate visually imperceptible adversarial examples against
state-of-the-art image recognition models with orders of magnitude fewer
queries than previous approaches. Against MNIST and CIFAR-10 models, GenAttack
required roughly 2,126 and 2,568 times fewer queries respectively, than ZOO,
the prior state-of-the-art black-box attack. In order to scale up the attack to
large-scale high-dimensional ImageNet models, we perform a series of
optimizations that further improve the query efficiency of our attack leading
to 237 times fewer queries against the Inception-v3 model than ZOO.
Furthermore, we show that GenAttack can successfully attack some
state-of-the-art ImageNet defenses, including ensemble adversarial training and
non-differentiable or randomized input transformations. Our results suggest
that evolutionary algorithms open up a promising area of research into
effective black-box attacks.


http://arxiv.org/abs/1805.10652
Defending Against Adversarial Attacks by Leveraging an Entire GAN.
Gokula Krishnan Santhanam; Paulina Grnarova
  Recent work has shown that state-of-the-art models are highly vulnerable to
adversarial perturbations of the input. We propose cowboy, an approach to
detecting and defending against adversarial attacks by using both the
discriminator and generator of a GAN trained on the same dataset. We show that
the discriminator consistently scores the adversarial samples lower than the
real samples across multiple attacks and datasets. We provide empirical
evidence that adversarial samples lie outside of the data manifold learned by
the GAN. Based on this, we propose a cleaning method which uses both the
discriminator and generator of the GAN to project the samples back onto the
data manifold. This cleaning procedure is independent of the classifier and
type of attack and thus can be deployed in existing systems.


http://arxiv.org/abs/1805.10265
Training verified learners with learned verifiers.
Krishnamurthy Dvijotham; Sven Gowal; Robert Stanforth; Relja Arandjelovic; Brendan O'Donoghue; Jonathan Uesato; Pushmeet Kohli
  This paper proposes a new algorithmic framework, predictor-verifier training,
to train neural networks that are verifiable, i.e., networks that provably
satisfy some desired input-output properties. The key idea is to simultaneously
train two networks: a predictor network that performs the task at hand,e.g.,
predicting labels given inputs, and a verifier network that computes a bound on
how well the predictor satisfies the properties being verified. Both networks
can be trained simultaneously to optimize a weighted combination of the
standard data-fitting loss and a term that bounds the maximum violation of the
property. Experiments show that not only is the predictor-verifier architecture
able to train networks to achieve state of the art verified robustness to
adversarial examples with much shorter training times (outperforming previous
algorithms on small datasets like MNIST and SVHN), but it can also be scaled to
produce the first known (to the best of our knowledge) verifiably robust
networks for CIFAR-10.


http://arxiv.org/abs/1805.10204
Adversarial examples from computational constraints.
Sébastien Bubeck; Eric Price; Ilya Razenshteyn
  Why are classifiers in high dimension vulnerable to "adversarial"
perturbations? We show that it is likely not due to information theoretic
limitations, but rather it could be due to computational constraints.
  First we prove that, for a broad set of classification tasks, the mere
existence of a robust classifier implies that it can be found by a possibly
exponential-time algorithm with relatively few training examples. Then we give
a particular classification task where learning a robust classifier is
computationally intractable. More precisely we construct a binary
classification task in high dimensional space which is (i) information
theoretically easy to learn robustly for large perturbations, (ii) efficiently
learnable (non-robustly) by a simple linear separator, (iii) yet is not
efficiently robustly learnable, even for small perturbations, by any algorithm
in the statistical query (SQ) model. This example gives an exponential
separation between classical learning and robust learning in the statistical
query model. It suggests that adversarial examples may be an unavoidable
byproduct of computational limitations of learning algorithms.


http://arxiv.org/abs/1805.10133
Laplacian Networks: Bounding Indicator Function Smoothness for Neural Network Robustness.
Carlos Eduardo Rosar Kos Lassance; Vincent Gripon; Antonio Ortega
  For the past few years, Deep Neural Network (DNN) robustness has become a
question of paramount importance. As a matter of fact, in sensitive settings
misclassification can lead to dramatic consequences. Such misclassifications
are likely to occur when facing adversarial attacks, hardware failures or
limitations, and imperfect signal acquisition. To address this question,
authors have proposed different approaches aiming at increasing the robustness
of DNNs, such as adding regularizers or training using noisy examples. In this
paper we propose a new regularizer built upon the Laplacian of similarity
graphs obtained from the representation of training data at each layer of the
DNN architecture. This regularizer penalizes large changes (across consecutive
layers in the architecture) in the distance between examples of different
classes, and as such enforces smooth variations of the class boundaries. Since
it is agnostic to the type of deformations that are expected when predicting
with the DNN, the proposed regularizer can be combined with existing ad-hoc
methods. We provide theoretical justification for this regularizer and
demonstrate its effectiveness to improve robustness of DNNs on classical
supervised learning vision datasets.


http://arxiv.org/abs/1805.09380
Anonymizing k-Facial Attributes via Adversarial Perturbations.
Saheb Chhabra; Richa Singh; Mayank Vatsa; Gaurav Gupta
  A face image not only provides details about the identity of a subject but
also reveals several attributes such as gender, race, sexual orientation, and
age. Advancements in machine learning algorithms and popularity of sharing
images on the World Wide Web, including social media websites, have increased
the scope of data analytics and information profiling from photo collections.
This poses a serious privacy threat for individuals who do not want to be
profiled. This research presents a novel algorithm for anonymizing selective
attributes which an individual does not want to share without affecting the
visual quality of images. Using the proposed algorithm, a user can select
single or multiple attributes to be surpassed while preserving identity
information and visual content. The proposed adversarial perturbation based
algorithm embeds imperceptible noise in an image such that attribute prediction
algorithm for the selected attribute yields incorrect classification result,
thereby preserving the information according to user's choice. Experiments on
three popular databases i.e. MUCT, LFWcrop, and CelebA show that the proposed
algorithm not only anonymizes k-attributes, but also preserves image quality
and identity information.


http://arxiv.org/abs/1805.09370
Towards Robust Training of Neural Networks by Regularizing Adversarial Gradients.
Fuxun Yu; Zirui Xu; Yanzhi Wang; Chenchen Liu; Xiang Chen
  In recent years, neural networks have demonstrated outstanding effectiveness
in a large amount of applications.However, recent works have shown that neural
networks are susceptible to adversarial examples, indicating possible flaws
intrinsic to the network structures. To address this problem and improve the
robustness of neural networks, we investigate the fundamental mechanisms behind
adversarial examples and propose a novel robust training method via regulating
adversarial gradients. The regulation effectively squeezes the adversarial
gradients of neural networks and significantly increases the difficulty of
adversarial example generation.Without any adversarial example involved, the
robust training method could generate naturally robust networks, which are
near-immune to various types of adversarial examples. Experiments show the
naturally robust networks can achieve optimal accuracy against Fast Gradient
Sign Method (FGSM) and C\&W attacks on MNIST, Cifar10, and Google Speech
Command dataset. Moreover, our proposed method also provides neural networks
with consistent robustness against transferable attacks.


http://arxiv.org/abs/1805.09190
Towards the first adversarially robust neural network model on MNIST.
Lukas Schott; Jonas Rauber; Matthias Bethge; Wieland Brendel
  Despite much effort, deep neural networks remain highly susceptible to tiny
input perturbations and even for MNIST, one of the most common toy datasets in
computer vision, no neural network model exists for which adversarial
perturbations are large and make semantic sense to humans. We show that even
the widely recognized and by far most successful defense by Madry et al. (1)
overfits on the L-infinity metric (it's highly susceptible to L2 and L0
perturbations), (2) classifies unrecognizable images with high certainty, (3)
performs not much better than simple input binarization and (4) features
adversarial perturbations that make little sense to humans. These results
suggest that MNIST is far from being solved in terms of adversarial robustness.
We present a novel robust classification model that performs analysis by
synthesis using learned class-conditional data distributions. We derive bounds
on the robustness and go to great length to empirically evaluate our model
using maximally effective adversarial attacks by (a) applying decision-based,
score-based, gradient-based and transfer-based attacks for several different Lp
norms, (b) by designing a new attack that exploits the structure of our
defended model and (c) by devising a novel decision-based attack that seeks to
minimize the number of perturbed pixels (L0). The results suggest that our
approach yields state-of-the-art robustness on MNIST against L0, L2 and
L-infinity perturbations and we demonstrate that most adversarial examples are
strongly perturbed towards the perceptual boundary between the original and the
adversarial class.


http://arxiv.org/abs/1805.08736
Adversarially Robust Training through Structured Gradient Regularization.
Kevin Roth; Aurelien Lucchi; Sebastian Nowozin; Thomas Hofmann
  We propose a novel data-dependent structured gradient regularizer to increase
the robustness of neural networks vis-a-vis adversarial perturbations. Our
regularizer can be derived as a controlled approximation from first principles,
leveraging the fundamental link between training with noise and regularization.
It adds very little computational overhead during learning and is simple to
implement generically in standard deep learning frameworks. Our experiments
provide strong evidence that structured gradient regularization can act as an
effective first line of defense against attacks based on low-level signal
corruption.


http://arxiv.org/abs/1805.08000
Adversarial Noise Layer: Regularize Neural Network By Adding Noise.
Zhonghui You; Jinmian Ye; Kunming Li; Zenglin Xu; Ping Wang
  In this paper, we introduce a novel regularization method called Adversarial
Noise Layer (ANL) and its efficient version called Class Adversarial Noise
Layer (CANL), which are able to significantly improve CNN's generalization
ability by adding carefully crafted noise into the intermediate layer
activations. ANL and CANL can be easily implemented and integrated with most of
the mainstream CNN-based models. We compared the effects of the different types
of noise and visually demonstrate that our proposed adversarial noise instruct
CNN models to learn to extract cleaner feature maps, which further reduce the
risk of over-fitting. We also conclude that models trained with ANL or CANL are
more robust to the adversarial examples generated by FGSM than the traditional
adversarial training approaches.


http://arxiv.org/abs/1805.07894
Constructing Unrestricted Adversarial Examples with Generative Models.
Yang Song; Rui Shu; Nate Kushman; Stefano Ermon
  Adversarial examples are typically constructed by perturbing an existing data
point within a small matrix norm, and current defense methods are focused on
guarding against this type of attack. In this paper, we propose unrestricted
adversarial examples, a new threat model where the attackers are not restricted
to small norm-bounded perturbations. Different from perturbation-based attacks,
we propose to synthesize unrestricted adversarial examples entirely from
scratch using conditional generative models. Specifically, we first train an
Auxiliary Classifier Generative Adversarial Network (AC-GAN) to model the
class-conditional distribution over data samples. Then, conditioned on a
desired class, we search over the AC-GAN latent space to find images that are
likely under the generative model and are misclassified by a target classifier.
We demonstrate through human evaluation that unrestricted adversarial examples
generated this way are legitimate and belong to the desired class. Our
empirical results on the MNIST, SVHN, and CelebA datasets show that
unrestricted adversarial examples can bypass strong adversarial training and
certified defense methods designed for traditional adversarial attacks.


http://arxiv.org/abs/1805.08006
Bidirectional Learning for Robust Neural Networks.
Sidney Pontes-Filho; Marcus Liwicki
  A multilayer perceptron can behave as a generative classifier by applying
bidirectional learning (BL). It consists of training an undirected neural
network to map input to output and vice-versa; therefore it can produce a
classifier in one direction, and a generator in the opposite direction for the
same data. The learning process of BL tries to reproduce the neuroplasticity
stated in Hebbian theory using only backward propagation of errors. In this
paper, two novel learning techniques are introduced which use BL for improving
robustness to white noise static and adversarial examples. The first method is
bidirectional propagation of errors, which the error propagation occurs in
backward and forward directions. Motivated by the fact that its generative
model receives as input a constant vector per class, we introduce as a second
method the hybrid adversarial networks (HAN). Its generative model receives a
random vector as input and its training is based on generative adversarial
networks (GAN). To assess the performance of BL, we perform experiments using
several architectures with fully and convolutional layers, with and without
bias. Experimental results show that both methods improve robustness to white
noise static and adversarial examples, and even increase accuracy, but have
different behavior depending on the architecture and task, being more
beneficial to use the one or the other. Nevertheless, HAN using a convolutional
architecture with batch normalization presents outstanding robustness, reaching
state-of-the-art accuracy on adversarial examples of hand-written digits.


http://arxiv.org/abs/1805.07984
Adversarial Attacks on Neural Networks for Graph Data.
Daniel Zügner; Amir Akbarnejad; Stephan Günnemann
  Deep learning models for graphs have achieved strong performance for the task
of node classification. Despite their proliferation, currently there is no
study of their robustness to adversarial attacks. Yet, in domains where they
are likely to be used, e.g. the web, adversaries are common. Can deep learning
models for graphs be easily fooled? In this work, we introduce the first study
of adversarial attacks on attributed graphs, specifically focusing on models
exploiting ideas of graph convolutions. In addition to attacks at test time, we
tackle the more challenging class of poisoning/causative attacks, which focus
on the training phase of a machine learning model. We generate adversarial
perturbations targeting the node's features and the graph structure, thus,
taking the dependencies between instances in account. Moreover, we ensure that
the perturbations remain unnoticeable by preserving important data
characteristics. To cope with the underlying discrete domain we propose an
efficient algorithm Nettack exploiting incremental computations. Our
experimental study shows that accuracy of node classification significantly
drops even when performing only few perturbations. Even more, our attacks are
transferable: the learned attacks generalize to other state-of-the-art node
classification models and unsupervised approaches, and likewise are successful
even when only limited knowledge about the graph is given.


http://arxiv.org/abs/1805.07862
Featurized Bidirectional GAN: Adversarial Defense via Adversarially Learned Semantic Inference.
Ruying Bao; Sihang Liang; Qingcan Wang
  Deep neural networks have been demonstrated to be vulnerable to adversarial
attacks, where small perturbations intentionally added to the original inputs
can fool the classifier. In this paper, we propose a defense method, Featurized
Bidirectional Generative Adversarial Networks (FBGAN), to extract the semantic
features of the input and filter the non-semantic perturbation. FBGAN is
pre-trained on the clean dataset in an unsupervised manner, adversarially
learning a bidirectional mapping between the high-dimensional data space and
the low-dimensional semantic space; also mutual information is applied to
disentangle the semantically meaningful features. After the bidirectional
mapping, the adversarial data can be reconstructed to denoised data, which
could be fed into any pre-trained classifier. We empirically show the quality
of reconstruction images and the effectiveness of defense.


http://arxiv.org/abs/1805.07816
Towards Understanding Limitations of Pixel Discretization Against Adversarial Attacks.
Jiefeng Chen; Xi Wu; Vaibhav Rastogi; Yingyu Liang; Somesh Jha
  Wide adoption of artificial neural networks in various domains has led to an
increasing interest in defending adversarial attacks against them.
Preprocessing defense methods such as pixel discretization are particularly
attractive in practice due to their simplicity, low computational overhead, and
applicability to various systems. It is observed that such methods work well on
simple datasets like MNIST, but break on more complicated ones like ImageNet
under recently proposed strong white-box attacks. To understand the conditions
for success and potentials for improvement, we study the pixel discretization
defense method, including more sophisticated variants that take into account
the properties of the dataset being discretized. Our results again show poor
resistance against the strong attacks. We analyze our results in a theoretical
framework and offer strong evidence that pixel discretization is unlikely to
work on all but the simplest of the datasets. Furthermore, our arguments
present insights why some other preprocessing defenses may be insecure.


http://arxiv.org/abs/1805.07820
Targeted Adversarial Examples for Black Box Audio Systems.
Rohan Taori; Amog Kamsetty; Brenton Chu; Nikita Vemuri
  The application of deep recurrent networks to audio transcription has led to
impressive gains in automatic speech recognition (ASR) systems. Many have
demonstrated that small adversarial perturbations can fool deep neural networks
into incorrectly predicting a specified target with high confidence. Current
work on fooling ASR systems have focused on white-box attacks, in which the
model architecture and parameters are known. In this paper, we adopt a
black-box approach to adversarial generation, combining the approaches of both
genetic algorithms and gradient estimation to solve the task. We achieve a
89.25% targeted attack similarity after 3000 generations while maintaining
94.6% audio file similarity.


http://arxiv.org/abs/1805.06605
Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Generative Models.
Pouya Samangouei; Maya Kabkab; Rama Chellappa
  In recent years, deep neural network approaches have been widely adopted for
machine learning tasks, including classification. However, they were shown to
be vulnerable to adversarial perturbations: carefully crafted small
perturbations can cause misclassification of legitimate images. We propose
Defense-GAN, a new framework leveraging the expressive capability of generative
models to defend deep neural networks against such attacks. Defense-GAN is
trained to model the distribution of unperturbed images. At inference time, it
finds a close output to a given image which does not contain the adversarial
changes. This output is then fed to the classifier. Our proposed method can be
used with any classification model and does not modify the classifier structure
or training procedure. It can also be used as a defense against any attack as
it does not assume knowledge of the process for generating the adversarial
examples. We empirically show that Defense-GAN is consistently effective
against different attack methods and improves on existing defense strategies.
Our code has been made publicly available at
https://github.com/kabkabm/defensegan


http://arxiv.org/abs/1805.06130
Towards Robust Neural Machine Translation.
Yong Cheng; Zhaopeng Tu; Fandong Meng; Junjie Zhai; Yang Liu
  Small perturbations in the input can severely distort intermediate
representations and thus impact translation quality of neural machine
translation (NMT) models. In this paper, we propose to improve the robustness
of NMT models with adversarial stability training. The basic idea is to make
both the encoder and decoder in NMT models robust against input perturbations
by enabling them to behave similarly for the original input and its perturbed
counterpart. Experimental results on Chinese-English, English-German and
English-French translation tasks show that our approaches can not only achieve
significant improvements over strong NMT systems but also improve the
robustness of NMT models.


http://arxiv.org/abs/1805.05010
Detecting Adversarial Samples for Deep Neural Networks through Mutation Testing.
Jingyi Wang; Jun Sun; Peixin Zhang; Xinyu Wang
  Recently, it has been shown that deep neural networks (DNN) are subject to
attacks through adversarial samples. Adversarial samples are often crafted
through adversarial perturbation, i.e., manipulating the original sample with
minor modifications so that the DNN model labels the sample incorrectly. Given
that it is almost impossible to train perfect DNN, adversarial samples are
shown to be easy to generate. As DNN are increasingly used in safety-critical
systems like autonomous cars, it is crucial to develop techniques for defending
such attacks. Existing defense mechanisms which aim to make adversarial
perturbation challenging have been shown to be ineffective. In this work, we
propose an alternative approach. We first observe that adversarial samples are
much more sensitive to perturbations than normal samples. That is, if we impose
random perturbations on a normal and an adversarial sample respectively, there
is a significant difference between the ratio of label change due to the
perturbations. Observing this, we design a statistical adversary detection
algorithm called nMutant (inspired by mutation testing from software
engineering community). Our experiments show that nMutant effectively detects
most of the adversarial samples generated by recently proposed attacking
methods. Furthermore, we provide an error bound with certain statistical
significance along with the detection.


http://arxiv.org/abs/1805.04807
Curriculum Adversarial Training.
Qi-Zhi Cai; Min Du; Chang Liu; Dawn Song
  Recently, deep learning has been applied to many security-sensitive
applications, such as facial authentication. The existence of adversarial
examples hinders such applications. The state-of-the-art result on defense
shows that adversarial training can be applied to train a robust model on MNIST
against adversarial examples; but it fails to achieve a high empirical
worst-case accuracy on a more complex task, such as CIFAR-10 and SVHN. In our
work, we propose curriculum adversarial training (CAT) to resolve this issue.
The basic idea is to develop a curriculum of adversarial examples generated by
attacks with a wide range of strengths. With two techniques to mitigate the
forgetting and the generalization issues, we demonstrate that CAT can improve
the prior art's empirical worst-case accuracy by a large margin of 25% on
CIFAR-10 and 35% on SVHN. At the same, the model's performance on
non-adversarial inputs is comparable to the state-of-the-art models.


http://arxiv.org/abs/1805.04810
AttriGuard: A Practical Defense Against Attribute Inference Attacks via Adversarial Machine Learning.
Jinyuan Jia; Neil Zhenqiang Gong
  Users in various web and mobile applications are vulnerable to attribute
inference attacks, in which an attacker leverages a machine learning classifier
to infer a target user's private attributes (e.g., location, sexual
orientation, political view) from its public data (e.g., rating scores, page
likes). Existing defenses leverage game theory or heuristics based on
correlations between the public data and attributes. These defenses are not
practical. Specifically, game-theoretic defenses require solving intractable
optimization problems, while correlation-based defenses incur large utility
loss of users' public data.
  In this paper, we present AttriGuard, a practical defense against attribute
inference attacks. AttriGuard is computationally tractable and has small
utility loss. Our AttriGuard works in two phases. Suppose we aim to protect a
user's private attribute. In Phase I, for each value of the attribute, we find
a minimum noise such that if we add the noise to the user's public data, then
the attacker's classifier is very likely to infer the attribute value for the
user. We find the minimum noise via adapting existing evasion attacks in
adversarial machine learning. In Phase II, we sample one attribute value
according to a certain probability distribution and add the corresponding noise
found in Phase I to the user's public data. We formulate finding the
probability distribution as solving a constrained convex optimization problem.
We extensively evaluate AttriGuard and compare it with existing methods using a
real-world dataset. Our results show that AttriGuard substantially outperforms
existing methods. Our work is the first one that shows evasion attacks can be
used as defensive techniques for privacy protection.


http://arxiv.org/abs/1805.04613
Breaking Transferability of Adversarial Samples with Randomness.
Yan Zhou; Murat Kantarcioglu; Bowei Xi
  We investigate the role of transferability of adversarial attacks in the
observed vulnerabilities of Deep Neural Networks (DNNs). We demonstrate that
introducing randomness to the DNN models is sufficient to defeat adversarial
attacks, given that the adversary does not have an unlimited attack budget.
Instead of making one specific DNN model robust to perfect knowledge attacks
(a.k.a, white box attacks), creating randomness within an army of DNNs
completely eliminates the possibility of perfect knowledge acquisition,
resulting in a significantly more robust DNN ensemble against the strongest
form of attacks. We also show that when the adversary has an unlimited budget
of data perturbation, all defensive techniques would eventually break down as
the budget increases. Therefore, it is important to understand the game saddle
point where the adversary would not further pursue this endeavor.
  Furthermore, we explore the relationship between attack severity and decision
boundary robustness in the version space. We empirically demonstrate that by
simply adding a small Gaussian random noise to the learned weights, a DNN model
can increase its resilience to adversarial attacks by as much as 74.2%. More
importantly, we show that by randomly activating/revealing a model from a pool
of pre-trained DNNs at each query request, we can put a tremendous strain on
the adversary's attack strategies. We compare our randomization techniques to
the Ensemble Adversarial Training technique and show that our randomization
techniques are superior under different attack budget constraints.


http://arxiv.org/abs/1805.03553
On Visual Hallmarks of Robustness to Adversarial Malware.
Alex Huang; Abdullah Al-Dujaili; Erik Hemberg; Una-May O'Reilly
  A central challenge of adversarial learning is to interpret the resulting
hardened model. In this contribution, we ask how robust generalization can be
visually discerned and whether a concise view of the interactions between a
hardened decision map and input samples is possible. We first provide a means
of visually comparing a hardened model's loss behavior with respect to the
adversarial variants generated during training versus loss behavior with
respect to adversarial variants generated from other sources. This allows us to
confirm that the association of observed flatness of a loss landscape with
generalization that is seen with naturally trained models extends to
adversarially hardened models and robust generalization. To complement these
means of interpreting model parameter robustness we also use self-organizing
maps to provide a visual means of superimposing adversarial and natural
variants on a model's decision space, thus allowing the model's global
robustness to be comprehensively examined.


http://arxiv.org/abs/1805.03438
Robust Classification with Convolutional Prototype Learning.
Hong-Ming Yang; Xu-Yao Zhang; Fei Yin; Cheng-Lin Liu
  Convolutional neural networks (CNNs) have been widely used for image
classification. Despite its high accuracies, CNN has been shown to be easily
fooled by some adversarial examples, indicating that CNN is not robust enough
for pattern classification. In this paper, we argue that the lack of robustness
for CNN is caused by the softmax layer, which is a totally discriminative model
and based on the assumption of closed world (i.e., with a fixed number of
categories). To improve the robustness, we propose a novel learning framework
called convolutional prototype learning (CPL). The advantage of using
prototypes is that it can well handle the open world recognition problem and
therefore improve the robustness. Under the framework of CPL, we design
multiple classification criteria to train the network. Moreover, a prototype
loss (PL) is proposed as a regularization to improve the intra-class
compactness of the feature representation, which can be viewed as a generative
model based on the Gaussian assumption of different classes. Experiments on
several datasets demonstrate that CPL can achieve comparable or even better
results than traditional CNN, and from the robustness perspective, CPL shows
great advantages for both the rejection and incremental category learning
tasks.


http://arxiv.org/abs/1805.02917
Interpretable Adversarial Perturbation in Input Embedding Space for Text.
Motoki Sato; Jun Suzuki; Hiroyuki Shindo; Yuji Matsumoto
  Following great success in the image processing field, the idea of
adversarial training has been applied to tasks in the natural language
processing (NLP) field. One promising approach directly applies adversarial
training developed in the image processing field to the input word embedding
space instead of the discrete input space of texts. However, this approach
abandons such interpretability as generating adversarial texts to significantly
improve the performance of NLP tasks. This paper restores interpretability to
such methods by restricting the directions of perturbations toward the existing
words in the input embedding space. As a result, we can straightforwardly
reconstruct each input with perturbations to an actual text by considering the
perturbations to be the replacement of words in the sentence while maintaining
or even improving the task performance.


http://arxiv.org/abs/1805.02131
A Counter-Forensic Method for CNN-Based Camera Model Identification.
David Güera; Yu Wang; Luca Bondi; Paolo Bestagini; Stefano Tubaro; Edward J. Delp
  An increasing number of digital images are being shared and accessed through
websites, media, and social applications. Many of these images have been
modified and are not authentic. Recent advances in the use of deep
convolutional neural networks (CNNs) have facilitated the task of analyzing the
veracity and authenticity of largely distributed image datasets. We examine in
this paper the problem of identifying the camera model or type that was used to
take an image and that can be spoofed. Due to the linear nature of CNNs and the
high-dimensionality of images, neural networks are vulnerable to attacks with
adversarial examples. These examples are imperceptibly different from correctly
classified images but are misclassified with high confidence by CNNs. In this
paper, we describe a counter-forensic method capable of subtly altering images
to change their estimated camera model when they are analyzed by any CNN-based
camera model detector. Our method can use both the Fast Gradient Sign Method
(FGSM) or the Jacobian-based Saliency Map Attack (JSMA) to craft these
adversarial images and does not require direct access to the CNN. Our results
show that even advanced deep learning architectures trained to analyze images
and obtain camera model information are still vulnerable to our proposed
method.


http://arxiv.org/abs/1805.01431
Siamese networks for generating adversarial examples.
Mandar Kulkarni; Aria Abubakar
  Machine learning models are vulnerable to adversarial examples. An adversary
modifies the input data such that humans still assign the same label, however,
machine learning models misclassify it. Previous approaches in the literature
demonstrated that adversarial examples can even be generated for the remotely
hosted model. In this paper, we propose a Siamese network based approach to
generate adversarial examples for a multiclass target CNN. We assume that the
adversary do not possess any knowledge of the target data distribution, and we
use an unlabeled mismatched dataset to query the target, e.g., for the
ResNet-50 target, we use the Food-101 dataset as the query. Initially, the
target model assigns labels to the query dataset, and a Siamese network is
trained on the image pairs derived from these multiclass labels. We learn the
\emph{adversarial perturbations} for the Siamese model and show that these
perturbations are also adversarial w.r.t. the target model. In experimental
results, we demonstrate effectiveness of our approach on MNIST, CIFAR-10 and
ImageNet targets with TinyImageNet/Food-101 query datasets.


http://arxiv.org/abs/1805.00089
Concolic Testing for Deep Neural Networks.
Youcheng Sun; Min Wu; Wenjie Ruan; Xiaowei Huang; Marta Kwiatkowska; Daniel Kroening
  Concolic testing combines program execution and symbolic analysis to explore
the execution paths of a software program. This paper presents the first
concolic testing approach for Deep Neural Networks (DNNs). More specifically,
we formalise coverage criteria for DNNs that have been studied in the
literature, and then develop a coherent method for performing concolic testing
to increase test coverage. Our experimental results show the effectiveness of
the concolic testing approach in both achieving high coverage and finding
adversarial examples.


http://arxiv.org/abs/1804.11313
How Robust are Deep Neural Networks?
Biswa Sengupta; Karl J. Friston
  Convolutional and Recurrent, deep neural networks have been successful in
machine learning systems for computer vision, reinforcement learning, and other
allied fields. However, the robustness of such neural networks is seldom
apprised, especially after high classification accuracy has been attained. In
this paper, we evaluate the robustness of three recurrent neural networks to
tiny perturbations, on three widely used datasets, to argue that high accuracy
does not always mean a stable and a robust (to bounded perturbations,
adversarial attacks, etc.) system. Especially, normalizing the spectrum of the
discrete recurrent network to bound the spectrum (using power method, Rayleigh
quotient, etc.) on a unit disk produces stable, albeit highly non-robust neural
networks. Furthermore, using the $\epsilon$-pseudo-spectrum, we show that
training of recurrent networks, say using gradient-based methods, often result
in non-normal matrices that may or may not be diagonalizable. Therefore, the
open problem lies in constructing methods that optimize not only for accuracy
but also for the stability and the robustness of the underlying neural network,
a criterion that is distinct from the other.


http://arxiv.org/abs/1804.11285
Adversarially Robust Generalization Requires More Data.
Ludwig Schmidt; Shibani Santurkar; Dimitris Tsipras; Kunal Talwar; Aleksander Mądry
  Machine learning models are often susceptible to adversarial perturbations of
their inputs. Even small perturbations can cause state-of-the-art classifiers
with high "standard" accuracy to produce an incorrect prediction with high
confidence. To better understand this phenomenon, we study adversarially robust
learning from the viewpoint of generalization. We show that already in a simple
natural data model, the sample complexity of robust learning can be
significantly larger than that of "standard" learning. This gap is information
theoretic and holds irrespective of the training algorithm or the model family.
We complement our theoretical results with experiments on popular image
classification datasets and show that a similar gap exists here as well. We
postulate that the difficulty of training robust classifiers stems, at least
partially, from this inherently larger sample complexity.


http://arxiv.org/abs/1804.11022
Adversarial Regression for Detecting Attacks in Cyber-Physical Systems.
Amin Ghafouri; Yevgeniy Vorobeychik; Xenofon Koutsoukos
  Attacks in cyber-physical systems (CPS) which manipulate sensor readings can
cause enormous physical damage if undetected. Detection of attacks on sensors
is crucial to mitigate this issue. We study supervised regression as a means to
detect anomalous sensor readings, where each sensor's measurement is predicted
as a function of other sensors. We show that several common learning approaches
in this context are still vulnerable to \emph{stealthy attacks}, which
carefully modify readings of compromised sensors to cause desired damage while
remaining undetected. Next, we model the interaction between the CPS defender
and attacker as a Stackelberg game in which the defender chooses detection
thresholds, while the attacker deploys a stealthy attack in response. We
present a heuristic algorithm for finding an approximately optimal threshold
for the defender in this game, and show that it increases system resilience to
attacks without significantly increasing the false alarm rate.


http://arxiv.org/abs/1804.10829
Formal Security Analysis of Neural Networks using Symbolic Intervals.
Shiqi Wang; Kexin Pei; Justin Whitehouse; Junfeng Yang; Suman Jana
  Due to the increasing deployment of Deep Neural Networks (DNNs) in real-world
security-critical domains including autonomous vehicles and collision avoidance
systems, formally checking security properties of DNNs, especially under
different attacker capabilities, is becoming crucial. Most existing security
testing techniques for DNNs try to find adversarial examples without providing
any formal security guarantees about the non-existence of such adversarial
examples. Recently, several projects have used different types of
Satisfiability Modulo Theory (SMT) solvers to formally check security
properties of DNNs. However, all of these approaches are limited by the high
overhead caused by the solver.
  In this paper, we present a new direction for formally checking security
properties of DNNs without using SMT solvers. Instead, we leverage interval
arithmetic to compute rigorous bounds on the DNN outputs. Our approach, unlike
existing solver-based approaches, is easily parallelizable. We further present
symbolic interval analysis along with several other optimizations to minimize
overestimations of output bounds.
  We design, implement, and evaluate our approach as part of ReluVal, a system
for formally checking security properties of Relu-based DNNs. Our extensive
empirical results show that ReluVal outperforms Reluplex, a state-of-the-art
solver-based system, by 200 times on average. On a single 8-core machine
without GPUs, within 4 hours, ReluVal is able to verify a security property
that Reluplex deemed inconclusive due to timeout after running for more than 5
days. Our experiments demonstrate that symbolic interval analysis is a
promising new direction towards rigorously analyzing different security
properties of DNNs.


http://arxiv.org/abs/1804.09699
Towards Fast Computation of Certified Robustness for ReLU Networks.
Tsui-Wei Weng; Huan Zhang; Hongge Chen; Zhao Song; Cho-Jui Hsieh; Duane Boning; Inderjit S. Dhillon; Luca Daniel
  Verifying the robustness property of a general Rectified Linear Unit (ReLU)
network is an NP-complete problem [Katz, Barrett, Dill, Julian and Kochenderfer
CAV17]. Although finding the exact minimum adversarial distortion is hard,
giving a certified lower bound of the minimum distortion is possible. Current
available methods of computing such a bound are either time-consuming or
delivering low quality bounds that are too loose to be useful. In this paper,
we exploit the special structure of ReLU networks and provide two
computationally efficient algorithms Fast-Lin and Fast-Lip that are able to
certify non-trivial lower bounds of minimum distortions, by bounding the ReLU
units with appropriate linear functions Fast-Lin, or by bounding the local
Lipschitz constant Fast-Lip. Experiments show that (1) our proposed methods
deliver bounds close to (the gap is 2-3X) exact minimum distortion found by
Reluplex in small MNIST networks while our algorithms are more than 10,000
times faster; (2) our methods deliver similar quality of bounds (the gap is
within 35% and usually around 10%; sometimes our bounds are even better) for
larger networks compared to the methods based on solving linear programming
problems but our algorithms are 33-14,000 times faster; (3) our method is
capable of solving large MNIST and CIFAR networks up to 7 layers with more than
10,000 neurons within tens of seconds on a single CPU core.
  In addition, we show that, in fact, there is no polynomial time algorithm
that can approximately find the minimum $\ell_1$ adversarial distortion of a
ReLU network with a $0.99\ln n$ approximation ratio unless
$\mathsf{NP}$=$\mathsf{P}$, where $n$ is the number of neurons in the network.


http://arxiv.org/abs/1804.08794
Towards Dependable Deep Convolutional Neural Networks (CNNs) with Out-distribution Learning.
Mahdieh Abbasi; Arezoo Rajabi; Christian Gagné; Rakesh B. Bobba
  Detection and rejection of adversarial examples in security sensitive and
safety-critical systems using deep CNNs is essential. In this paper, we propose
an approach to augment CNNs with out-distribution learning in order to reduce
misclassification rate by rejecting adversarial examples. We empirically show
that our augmented CNNs can either reject or classify correctly most
adversarial examples generated using well-known methods ( >95% for MNIST and
>75% for CIFAR-10 on average). Furthermore, we achieve this without requiring
to train using any specific type of adversarial examples and without
sacrificing the accuracy of models on clean samples significantly (< 4%).


http://arxiv.org/abs/1804.08757
Siamese Generative Adversarial Privatizer for Biometric Data.
Witold Oleszkiewicz; Peter Kairouz; Karol Piczak; Ram Rajagopal; Tomasz Trzcinski
  State-of-the-art machine learning algorithms can be fooled by carefully
crafted adversarial examples. As such, adversarial examples present a concrete
problem in AI safety. In this work we turn the tables and ask the following
question: can we harness the power of adversarial examples to prevent malicious
adversaries from learning identifying information from data while allowing
non-malicious entities to benefit from the utility of the same data? For
instance, can we use adversarial examples to anonymize biometric dataset of
faces while retaining usefulness of this data for other purposes, such as
emotion recognition? To address this question, we propose a simple yet
effective method, called Siamese Generative Adversarial Privatizer (SGAP), that
exploits the properties of a Siamese neural network to find discriminative
features that convey identifying information. When coupled with a generative
model, our approach is able to correctly locate and disguise identifying
information, while minimally reducing the utility of the privatized dataset.
Extensive evaluation on a biometric dataset of fingerprints and cartoon faces
confirms usefulness of our simple yet effective method.


http://arxiv.org/abs/1804.08598
Black-box Adversarial Attacks with Limited Queries and Information.
Andrew Ilyas; Logan Engstrom; Anish Athalye; Jessy Lin
  Current neural network-based classifiers are susceptible to adversarial
examples even in the black-box setting, where the attacker only has query
access to the model. In practice, the threat model for real-world systems is
often more restrictive than the typical black-box model where the adversary can
observe the full output of the network on arbitrarily many chosen inputs. We
define three realistic threat models that more accurately characterize many
real-world classifiers: the query-limited setting, the partial-information
setting, and the label-only setting. We develop new attacks that fool
classifiers under these more restrictive threat models, where previous methods
would be impractical or ineffective. We demonstrate that our methods are
effective against an ImageNet classifier under our proposed threat models. We
also demonstrate a targeted black-box attack against a commercial classifier,
overcoming the challenges of limited query access, partial information, and
other practical issues to break the Google Cloud Vision API.


http://arxiv.org/abs/1804.08529
VectorDefense: Vectorization as a Defense to Adversarial Examples.
Vishaal Munusamy Kabilan; Brandon Morris; Anh Nguyen
  Training deep neural networks on images represented as grids of pixels has
brought to light an interesting phenomenon known as adversarial examples.
Inspired by how humans reconstruct abstract concepts, we attempt to codify the
input bitmap image into a set of compact, interpretable elements to avoid being
fooled by the adversarial structures. We take the first step in this direction
by experimenting with image vectorization as an input transformation step to
map the adversarial examples back into the natural manifold of MNIST
handwritten digits. We compare our method vs. state-of-the-art input
transformations and further discuss the trade-offs between a hand-designed and
a learned transformation defense.


http://arxiv.org/abs/1804.08778
Query-Efficient Black-Box Attack Against Sequence-Based Malware Classifiers.
Ishai Rosenberg; Asaf Shabtai; Yuval Elovici; Lior Rokach
  In this paper, we present a generic, query-efficient black-box attack against
API call-based machine learning malware classifiers. We generate adversarial
examples by modifying the malware's API call sequences and non-sequential
features (printable strings), and these adversarial examples will be
misclassified by the target malware classifier without affecting the malware's
functionality. In contrast to previous studies, our attack minimizes the number
of malware classifier queries required. In addition, in our attack, the
attacker must only know the class predicted by the malware classifier; attacker
knowledge of the malware classifier's confidence score is optional. We evaluate
the attack effectiveness when attacks are performed against a variety of
malware classifier architectures, including recurrent neural network (RNN)
variants, deep neural networks, support vector machines, and gradient boosted
decision trees. Our attack success rate is around 98% when the classifier's
confidence score is known and 64% when just the classifier's predicted class is
known. We implement four state-of-the-art query-efficient attacks and show that
our attack requires fewer queries and less knowledge about the attacked model's
architecture than other existing query-efficient attacks, making it practical
for attacking cloud-based malware classifiers at a minimal cost.


http://arxiv.org/abs/1804.07998
Generating Natural Language Adversarial Examples.
Moustafa Alzantot; Yash Sharma; Ahmed Elgohary; Bo-Jhang Ho; Mani Srivastava; Kai-Wei Chang
  Deep neural networks (DNNs) are vulnerable to adversarial examples,
perturbations to correctly classified examples which can cause the model to
misclassify. In the image domain, these perturbations are often virtually
indistinguishable to human perception, causing humans and state-of-the-art
models to disagree. However, in the natural language domain, small
perturbations are clearly perceptible, and the replacement of a single word can
drastically alter the semantics of the document. Given these challenges, we use
a black-box population-based optimization algorithm to generate semantically
and syntactically similar adversarial examples that fool well-trained sentiment
analysis and textual entailment models with success rates of 97% and 70%,
respectively. We additionally demonstrate that 92.3% of the successful
sentiment analysis adversarial examples are classified to their original label
by 20 human annotators, and that the examples are perceptibly quite similar.
Finally, we discuss an attempt to use adversarial training as a defense, but
fail to yield improvement, demonstrating the strength and diversity of our
adversarial examples. We hope our findings encourage researchers to pursue
improving the robustness of DNNs in the natural language domain.


http://arxiv.org/abs/1804.07870
Gradient Masking Causes CLEVER to Overestimate Adversarial Perturbation Size.
Ian Goodfellow
  A key problem in research on adversarial examples is that vulnerability to
adversarial examples is usually measured by running attack algorithms. Because
the attack algorithms are not optimal, the attack algorithms are prone to
overestimating the size of perturbation needed to fool the target model. In
other words, the attack-based methodology provides an upper-bound on the size
of a perturbation that will fool the model, but security guarantees require a
lower bound. CLEVER is a proposed scoring method to estimate a lower bound.
Unfortunately, an estimate of a bound is not a bound. In this report, we show
that gradient masking, a common problem that causes attack methodologies to
provide only a very loose upper bound, causes CLEVER to overestimate the size
of perturbation needed to fool the model. In other words, CLEVER does not
resolve the key problem with the attack-based methodology, because it fails to
provide a lower bound.


http://arxiv.org/abs/1804.07757
Learning More Robust Features with Adversarial Training.
Shuangtao Li; Yuanke Chen; Yanlin Peng; Lin Bai
  In recent years, it has been found that neural networks can be easily fooled
by adversarial examples, which is a potential safety hazard in some
safety-critical applications. Many researchers have proposed various method to
make neural networks more robust to white-box adversarial attacks, but an
effective method have not been found so far. In this short paper, we focus on
the robustness of the features learned by neural networks. We show that the
features learned by neural networks are not robust, and find that the
robustness of the learned features is closely related to the resistance against
adversarial examples of neural networks. We also find that adversarial training
against fast gradients sign method (FGSM) does not make the leaned features
very robust, even if it can make the trained networks very resistant to FGSM
attack. Then we propose a method, which can be seen as an extension of
adversarial training, to train neural networks to learn more robust features.
We perform experiments on MNIST and CIFAR-10 to evaluate our method, and the
experiment results show that this method greatly improves the robustness of the
learned features and the resistance to adversarial attacks.


http://arxiv.org/abs/1804.07729
ADef: an Iterative Algorithm to Construct Adversarial Deformations.
Rima Alaifari; Giovanni S. Alberti; Tandri Gauksson
  While deep neural networks have proven to be a powerful tool for many
recognition and classification tasks, their stability properties are still not
well understood. In the past, image classifiers have been shown to be
vulnerable to so-called adversarial attacks, which are created by additively
perturbing the correctly classified image. In this paper, we propose the ADef
algorithm to construct a different kind of adversarial attack created by
iteratively applying small deformations to the image, found through a gradient
descent step. We demonstrate our results on MNIST with convolutional neural
networks and on ImageNet with Inception-v3 and ResNet-101.


http://arxiv.org/abs/1804.07062
Attacking Convolutional Neural Network using Differential Evolution.
Jiawei Su; Danilo Vasconcellos Vargas; Kouichi Sakurai
  The output of Convolutional Neural Networks (CNN) has been shown to be
discontinuous which can make the CNN image classifier vulnerable to small
well-tuned artificial perturbations. That is, images modified by adding such
perturbations(i.e. adversarial perturbations) that make little difference to
human eyes, can completely alter the CNN classification results. In this paper,
we propose a practical attack using differential evolution(DE) for generating
effective adversarial perturbations. We comprehensively evaluate the
effectiveness of different types of DEs for conducting the attack on different
network structures. The proposed method is a black-box attack which only
requires the miracle feedback of the target CNN systems. The results show that
under strict constraints which simultaneously control the number of pixels
changed and overall perturbation strength, attacking can achieve 72.29%, 78.24%
and 61.28% non-targeted attack success rates, with 88.68%, 99.85% and 73.07%
confidence on average, on three common types of CNNs. The attack only requires
modifying 5 pixels with 20.44, 14.76 and 22.98 pixel values distortion. Thus,
the result shows that the current DNNs are also vulnerable to such simpler
black-box attacks even under very limited attack conditions.


http://arxiv.org/abs/1804.07045
Semantic Adversarial Deep Learning.
Tommaso Dreossi; Somesh Jha; Sanjit A. Seshia
  Fueled by massive amounts of data, models produced by machine-learning (ML)
algorithms, especially deep neural networks, are being used in diverse domains
where trustworthiness is a concern, including automotive systems, finance,
health care, natural language processing, and malware detection. Of particular
concern is the use of ML algorithms in cyber-physical systems (CPS), such as
self-driving cars and aviation, where an adversary can cause serious
consequences. However, existing approaches to generating adversarial examples
and devising robust ML algorithms mostly ignore the semantics and context of
the overall system containing the ML component. For example, in an autonomous
vehicle using deep learning for perception, not every adversarial example for
the neural network might lead to a harmful consequence. Moreover, one may want
to prioritize the search for adversarial examples towards those that
significantly modify the desired semantics of the overall system. Along the
same lines, existing algorithms for constructing robust ML algorithms ignore
the specification of the overall system. In this paper, we argue that the
semantics and specification of the overall system has a crucial role to play in
this line of research. We present preliminary research results that support
this claim.


http://arxiv.org/abs/1804.06760
Simulation-based Adversarial Test Generation for Autonomous Vehicles with Machine Learning Components.
Cumhur Erkan Tuncali; Georgios Fainekos; Hisahiro Ito; James Kapinski
  Many organizations are developing autonomous driving systems, which are
expected to be deployed at a large scale in the near future. Despite this,
there is a lack of agreement on appropriate methods to test, debug, and certify
the performance of these systems. One of the main challenges is that many
autonomous driving systems have machine learning components, such as deep
neural networks, for which formal properties are difficult to characterize. We
present a testing framework that is compatible with test case generation and
automatic falsification methods, which are used to evaluate cyber-physical
systems. We demonstrate how the framework can be used to evaluate closed-loop
properties of an autonomous driving system model that includes the ML
components, all within a virtual environment. We demonstrate how to use test
case generation methods, such as covering arrays, as well as requirement
falsification methods to automatically identify problematic test scenarios. The
resulting framework can be used to increase the reliability of autonomous
driving systems.


http://arxiv.org/abs/1804.06898
Neural Automated Essay Scoring and Coherence Modeling for Adversarially Crafted Input.
Youmna Farag; Helen Yannakoudakis; Ted Briscoe
  We demonstrate that current state-of-the-art approaches to Automated Essay
Scoring (AES) are not well-suited to capturing adversarially crafted input of
grammatical but incoherent sequences of sentences. We develop a neural model of
local coherence that can effectively learn connectedness features between
sentences, and propose a framework for integrating and jointly training the
local coherence model with a state-of-the-art AES model. We evaluate our
approach against a number of baselines and experimentally demonstrate its
effectiveness on both the AES task and the task of flagging adversarial input,
further contributing to the development of an approach that strengthens the
validity of neural essay scoring models.


http://arxiv.org/abs/1804.06473
Robust Machine Comprehension Models via Adversarial Training.
Yicheng Wang; Mohit Bansal
  It is shown that many published models for the Stanford Question Answering
Dataset (Rajpurkar et al., 2016) lack robustness, suffering an over 50%
decrease in F1 score during adversarial evaluation based on the AddSent (Jia
and Liang, 2017) algorithm. It has also been shown that retraining models on
data generated by AddSent has limited effect on their robustness. We propose a
novel alternative adversary-generation algorithm, AddSentDiverse, that
significantly increases the variance within the adversarial training data by
providing effective examples that punish the model for making certain
superficial assumptions. Further, in order to improve robustness to AddSent's
semantic perturbations (e.g., antonyms), we jointly improve the model's
semantic-relationship learning capabilities in addition to our
AddSentDiverse-based adversarial training data augmentation. With these
additions, we show that we can make a state-of-the-art model significantly more
robust, achieving a 36.5% increase in F1 score under many different types of
adversarial evaluation while maintaining performance on the regular SQuAD task.


http://arxiv.org/abs/1804.06059
Adversarial Example Generation with Syntactically Controlled Paraphrase Networks.
Mohit Iyyer; John Wieting; Kevin Gimpel; Luke Zettlemoyer
  We propose syntactically controlled paraphrase networks (SCPNs) and use them
to generate adversarial examples. Given a sentence and a target syntactic form
(e.g., a constituency parse), SCPNs are trained to produce a paraphrase of the
sentence with the desired syntax. We show it is possible to create training
data for this task by first doing backtranslation at a very large scale, and
then using a parser to label the syntactic transformations that naturally occur
during this process. Such data allows us to train a neural encoder-decoder
model with extra inputs to specify the target syntax. A combination of
automated and human evaluations show that SCPNs generate paraphrases that
follow their target specifications without decreasing paraphrase quality when
compared to baseline (uncontrolled) paraphrase systems. Furthermore, they are
more capable of generating syntactically adversarial examples that both (1)
"fool" pretrained models and (2) improve the robustness of these models to
syntactic variation when used to augment their training data.


http://arxiv.org/abs/1804.05805
Global Robustness Evaluation of Deep Neural Networks with Provable Guarantees for the $L_0$ Norm.
Wenjie Ruan; Min Wu; Youcheng Sun; Xiaowei Huang; Daniel Kroening; Marta Kwiatkowska
  Deployment of deep neural networks (DNNs) in safety- or security-critical
systems requires provable guarantees on their correct behaviour. A common
requirement is robustness to adversarial perturbations in a neighbourhood
around an input. In this paper we focus on the $L_0$ norm and aim to compute,
for a trained DNN and an input, the maximal radius of a safe norm ball around
the input within which there are no adversarial examples. Then we define global
robustness as an expectation of the maximal safe radius over a test data set.
We first show that the problem is NP-hard, and then propose an approximate
approach to iteratively compute lower and upper bounds on the network's
robustness. The approach is \emph{anytime}, i.e., it returns intermediate
bounds and robustness estimates that are gradually, but strictly, improved as
the computation proceeds; \emph{tensor-based}, i.e., the computation is
conducted over a set of inputs simultaneously, instead of one by one, to enable
efficient GPU computation; and has \emph{provable guarantees}, i.e., both the
bounds and the robustness estimates can converge to their optimal values.
Finally, we demonstrate the utility of the proposed approach in practice to
compute tight bounds by applying and adapting the anytime algorithm to a set of
challenging problems, including global robustness evaluation, competitive $L_0$
attacks, test case generation for DNNs, and local robustness evaluation on
large-scale ImageNet DNNs. We release the code of all case studies via GitHub.


http://arxiv.org/abs/1804.05810
ShapeShifter: Robust Physical Adversarial Attack on Faster R-CNN Object Detector.
Shang-Tse Chen; Cory Cornelius; Jason Martin; Duen Horng Chau
  Given the ability to directly manipulate image pixels in the digital input
space, an adversary can easily generate imperceptible perturbations to fool a
Deep Neural Network (DNN) image classifier, as demonstrated in prior work. In
this work, we propose ShapeShifter, an attack that tackles the more challenging
problem of crafting physical adversarial perturbations to fool image-based
object detectors like Faster R-CNN. Attacking an object detector is more
difficult than attacking an image classifier, as it needs to mislead the
classification results in multiple bounding boxes with different scales.
Extending the digital attack to the physical world adds another layer of
difficulty, because it requires the perturbation to be robust enough to survive
real-world distortions due to different viewing distances and angles, lighting
conditions, and camera limitations. We show that the Expectation over
Transformation technique, which was originally proposed to enhance the
robustness of adversarial perturbations in image classification, can be
successfully adapted to the object detection setting. ShapeShifter can generate
adversarially perturbed stop signs that are consistently mis-detected by Faster
R-CNN as other objects, posing a potential threat to autonomous vehicles and
other safety-critical computer vision systems.


http://arxiv.org/abs/1805.00310
On the Limitation of MagNet Defense against $L_1$-based Adversarial Examples.
Pei-Hsuan Lu; Pin-Yu Chen; Kang-Cheng Chen; Chia-Mu Yu
  In recent years, defending adversarial perturbations to natural examples in
order to build robust machine learning models trained by deep neural networks
(DNNs) has become an emerging research field in the conjunction of deep
learning and security. In particular, MagNet consisting of an adversary
detector and a data reformer is by far one of the strongest defenses in the
black-box oblivious attack setting, where the attacker aims to craft
transferable adversarial examples from an undefended DNN model to bypass an
unknown defense module deployed on the same DNN model. Under this setting,
MagNet can successfully defend a variety of attacks in DNNs, including the
high-confidence adversarial examples generated by the Carlini and Wagner's
attack based on the $L_2$ distortion metric. However, in this paper, under the
same attack setting we show that adversarial examples crafted based on the
$L_1$ distortion metric can easily bypass MagNet and mislead the target DNN
image classifiers on MNIST and CIFAR-10. We also provide explanations on why
the considered approach can yield adversarial examples with superior attack
performance and conduct extensive experiments on variants of MagNet to verify
its lack of robustness to $L_1$ distortion based attacks. Notably, our results
substantially weaken the assumption of effective threat models on MagNet that
require knowing the deployed defense technique when attacking DNNs (i.e., the
gray-box attack setting).


http://arxiv.org/abs/1804.05296
Adversarial Attacks Against Medical Deep Learning Systems.
Samuel G. Finlayson; Hyung Won Chung; Isaac S. Kohane; Andrew L. Beam
  The discovery of adversarial examples has raised concerns about the practical
deployment of deep learning systems. In this paper, we demonstrate that
adversarial examples are capable of manipulating deep learning systems across
three clinical domains. For each of our representative medical deep learning
classifiers, both white and black box attacks were highly successful. Our
models are representative of the current state of the art in medical computer
vision and, in some cases, directly reflect architectures already seeing
deployment in real world clinical settings. In addition to the technical
contribution of our paper, we synthesize a large body of knowledge about the
healthcare system to argue that medicine may be uniquely susceptible to
adversarial attacks, both in terms of monetary incentives and technical
vulnerability. To this end, we outline the healthcare economy and the
incentives it creates for fraud and provide concrete examples of how and why
such attacks could be realistically carried out. We urge practitioners to be
aware of current vulnerabilities when deploying deep learning systems in
clinical settings, and encourage the machine learning community to further
investigate the domain-specific characteristics of medical learning systems.


http://arxiv.org/abs/1804.04177
Detecting Malicious PowerShell Commands using Deep Neural Networks.
Danny Hendler; Shay Kels; Amir Rubin
  Microsoft's PowerShell is a command-line shell and scripting language that is
installed by default on Windows machines. While PowerShell can be configured by
administrators for restricting access and reducing vulnerabilities, these
restrictions can be bypassed. Moreover, PowerShell commands can be easily
generated dynamically, executed from memory, encoded and obfuscated, thus
making the logging and forensic analysis of code executed by PowerShell
challenging.For all these reasons, PowerShell is increasingly used by
cybercriminals as part of their attacks' tool chain, mainly for downloading
malicious contents and for lateral movement. Indeed, a recent comprehensive
technical report by Symantec dedicated to PowerShell's abuse by cybercrimials
reported on a sharp increase in the number of malicious PowerShell samples they
received and in the number of penetration tools and frameworks that use
PowerShell. This highlights the urgent need of developing effective methods for
detecting malicious PowerShell commands.In this work, we address this challenge
by implementing several novel detectors of malicious PowerShell commands and
evaluating their performance. We implemented both "traditional" natural
language processing (NLP) based detectors and detectors based on
character-level convolutional neural networks (CNNs). Detectors' performance
was evaluated using a large real-world dataset.Our evaluation results show
that, although our detectors individually yield high performance, an ensemble
detector that combines an NLP-based classifier with a CNN-based classifier
provides the best performance, since the latter classifier is able to detect
malicious commands that succeed in evading the former. Our analysis of these
evasive commands reveals that some obfuscation patterns automatically detected
by the CNN classifier are intrinsically difficult to detect using the NLP
techniques we applied.


http://arxiv.org/abs/1804.03286
On the Robustness of the CVPR 2018 White-Box Adversarial Example Defenses.
Anish Athalye; Nicholas Carlini
  Neural networks are known to be vulnerable to adversarial examples. In this
note, we evaluate the two white-box defenses that appeared at CVPR 2018 and
find they are ineffective: when applying existing techniques, we can reduce the
accuracy of the defended models to 0%.


http://arxiv.org/abs/1804.03308
Adversarial Training Versus Weight Decay.
Angus Galloway; Thomas Tanay; Graham W. Taylor
  Performance-critical machine learning models should be robust to input
perturbations not seen during training. Adversarial training is a method for
improving a model's robustness to some perturbations by including them in the
training process, but this tends to exacerbate other vulnerabilities of the
model. The adversarial training framework has the effect of translating the
data with respect to the cost function, while weight decay has a scaling
effect. Although weight decay could be considered a crude regularization
technique, it appears superior to adversarial training as it remains stable
over a broader range of regimes and reduces all generalization errors. Equipped
with these abstractions, we provide key baseline results and methodology for
characterizing robustness. The two approaches can be combined to yield one
small model that demonstrates good robustness to several white-box attacks
associated with different metrics.


http://arxiv.org/abs/1804.03193
An ADMM-Based Universal Framework for Adversarial Attacks on Deep Neural Networks.
Pu Zhao; Sijia Liu; Yanzhi Wang; Xue Lin
  Deep neural networks (DNNs) are known vulnerable to adversarial attacks. That
is, adversarial examples, obtained by adding delicately crafted distortions
onto original legal inputs, can mislead a DNN to classify them as any target
labels. In a successful adversarial attack, the targeted mis-classification
should be achieved with the minimal distortion added. In the literature, the
added distortions are usually measured by L0, L1, L2, and L infinity norms,
namely, L0, L1, L2, and L infinity attacks, respectively. However, there lacks
a versatile framework for all types of adversarial attacks.
  This work for the first time unifies the methods of generating adversarial
examples by leveraging ADMM (Alternating Direction Method of Multipliers), an
operator splitting optimization approach, such that L0, L1, L2, and L infinity
attacks can be effectively implemented by this general framework with little
modifications. Comparing with the state-of-the-art attacks in each category,
our ADMM-based attacks are so far the strongest, achieving both the 100% attack
success rate and the minimal distortion.


http://arxiv.org/abs/1804.02691
Adaptive Spatial Steganography Based on Probability-Controlled Adversarial Examples.
Sai Ma; Qingxiao Guan; Xianfeng Zhao; Yaqi Liu
  Explanation from Sai Ma: The experiments in this paper are conducted on Caffe
framework. In Caffe, there is an API to directly set the gradient in Matlab. I
wrongly use it to control the 'probability', in fact, I modify the gradient
directly. The misusage of API leads to wrong experiment results, and wrong
theoretical analysis.
  Apologize to readers who have read this paper. We have submitted a correct
version of this paper to Multimedia Tools and Applications and it is under
revision.
  Thanks to Dr. Patrick Bas, who is the Associate Editor of TIFS and the
anonymous reviewers of this paper.
  Thanks to Tingting Song from Sun Yat-sen University. We discussed some
problems of this paper. Her advice helps me to improve the submitted paper to
Multimedia Tools and Applications.


http://arxiv.org/abs/1804.02485
Fortified Networks: Improving the Robustness of Deep Networks by Modeling the Manifold of Hidden Representations.
Alex Lamb; Jonathan Binas; Anirudh Goyal; Dmitriy Serdyuk; Sandeep Subramanian; Ioannis Mitliagkas; Yoshua Bengio
  Deep networks have achieved impressive results across a variety of important
tasks. However a known weakness is a failure to perform well when evaluated on
data which differ from the training distribution, even if these differences are
very small, as is the case with adversarial examples. We propose Fortified
Networks, a simple transformation of existing networks, which fortifies the
hidden layers in a deep network by identifying when the hidden states are off
of the data manifold, and maps these hidden states back to parts of the data
manifold where the network performs well. Our principal contribution is to show
that fortifying these hidden states improves the robustness of deep networks
and our experiments (i) demonstrate improved robustness to standard adversarial
attacks in both black-box and white-box threat models; (ii) suggest that our
improvements are not primarily due to the gradient masking problem and (iii)
show the advantage of doing this fortification in the hidden layers instead of
the input space.


http://arxiv.org/abs/1804.01635
Unifying Bilateral Filtering and Adversarial Training for Robust Neural Networks.
Neale Ratzlaff; Li Fuxin
  Recent analysis of deep neural networks has revealed their vulnerability to
carefully structured adversarial examples. Many effective algorithms exist to
craft these adversarial examples, but performant defenses seem to be far away.
In this work, we explore the use of edge-aware bilateral filtering as a
projection back to the space of natural images. We show that bilateral
filtering is an effective defense in multiple attack settings, where the
strength of the adversary gradually increases. In the case of an adversary who
has no knowledge of the defense, bilateral filtering can remove more than 90%
of adversarial examples from a variety of different attacks. To evaluate
against an adversary with complete knowledge of our defense, we adapt the
bilateral filter as a trainable layer in a neural network and show that adding
this layer makes ImageNet images significantly more robust to attacks. When
trained under a framework of adversarial training, we show that the resulting
model is hard to fool with even the best attack methods.


http://arxiv.org/abs/1804.00097
Adversarial Attacks and Defences Competition.
Alexey Kurakin; Ian Goodfellow; Samy Bengio; Yinpeng Dong; Fangzhou Liao; Ming Liang; Tianyu Pang; Jun Zhu; Xiaolin Hu; Cihang Xie; Jianyu Wang; Zhishuai Zhang; Zhou Ren; Alan Yuille; Sangxia Huang; Yao Zhao; Yuzhe Zhao; Zhonglin Han; Junjiajia Long; Yerkebulan Berdibekov; Takuya Akiba; Seiya Tokui; Motoki Abe
  To accelerate research on adversarial examples and robustness of machine
learning classifiers, Google Brain organized a NIPS 2017 competition that
encouraged researchers to develop new methods to generate adversarial examples
as well as to develop new ways to defend against them. In this chapter, we
describe the structure and organization of the competition and the solutions
developed by several of the top-placing teams.


http://arxiv.org/abs/1803.11157
Security Consideration For Deep Learning-Based Image Forensics.
Wei Zhao; Pengpeng Yang; Rongrong Ni; Yao Zhao; Haorui Wu
  Recently, image forensics community has paied attention to the research on
the design of effective algorithms based on deep learning technology and facts
proved that combining the domain knowledge of image forensics and deep learning
would achieve more robust and better performance than the traditional schemes.
Instead of improving it, in this paper, the safety of deep learning based
methods in the field of image forensics is taken into account. To the best of
our knowledge, this is a first work focusing on this topic. Specifically, we
experimentally find that the method using deep learning would fail when adding
the slight noise into the images (adversarial images). Furthermore, two kinds
of strategys are proposed to enforce security of deep learning-based method.
Firstly, an extra penalty term to the loss function is added, which is referred
to the 2-norm of the gradient of the loss with respect to the input images, and
then an novel training method are adopt to train the model by fusing the normal
and adversarial images. Experimental results show that the proposed algorithm
can achieve good performance even in the case of adversarial images and provide
a safety consideration for deep learning-based image forensics


http://arxiv.org/abs/1803.10840
Defending against Adversarial Images using Basis Functions Transformations.
Uri Shaham; James Garritano; Yutaro Yamada; Ethan Weinberger; Alex Cloninger; Xiuyuan Cheng; Kelly Stanton; Yuval Kluger
  We study the effectiveness of various approaches that defend against
adversarial attacks on deep networks via manipulations based on basis function
representations of images. Specifically, we experiment with low-pass filtering,
PCA, JPEG compression, low resolution wavelet approximation, and
soft-thresholding. We evaluate these defense techniques using three types of
popular attacks in black, gray and white-box settings. Our results show JPEG
compression tends to outperform the other tested defenses in most of the
settings considered, in addition to soft-thresholding, which performs well in
specific cases, and yields a more mild decrease in accuracy on benign examples.
In addition, we also mathematically derive a novel white-box attack in which
the adversarial perturbation is composed only of terms corresponding a to
pre-determined subset of the basis functions, of which a "low frequency attack"
is a special case.


http://arxiv.org/abs/1803.10418
The Effects of JPEG and JPEG2000 Compression on Attacks using Adversarial Examples.
Ayse Elvan Aydemir; Alptekin Temizel; Tugba Taskaya Temizel
  Adversarial examples are known to have a negative effect on the performance
of classifiers which have otherwise good performance on undisturbed images.
These examples are generated by adding non-random noise to the testing samples
in order to make classifier misclassify the given data. Adversarial attacks use
these intentionally generated examples and they pose a security risk to the
machine learning based systems. To be immune to such attacks, it is desirable
to have a pre-processing mechanism which removes these effects causing
misclassification while keeping the content of the image. JPEG and JPEG2000 are
well-known image compression techniques which suppress the high-frequency
content taking the human visual system into account. JPEG has been also shown
to be an effective method for reducing adversarial noise. In this paper, we
propose applying JPEG2000 compression as an alternative and systematically
compare the classification performance of adversarial images compressed using
JPEG and JPEG2000 at different target PSNR values and maximum compression
levels. Our experiments show that JPEG2000 is more effective in reducing
adversarial noise as it allows higher compression rates with less distortion
and it does not introduce blocking artifacts.


http://arxiv.org/abs/1803.09868
Bypassing Feature Squeezing by Increasing Adversary Strength.
Yash Sharma; Pin-Yu Chen
  Feature Squeezing is a recently proposed defense method which reduces the
search space available to an adversary by coalescing samples that correspond to
many different feature vectors in the original space into a single sample. It
has been shown that feature squeezing defenses can be combined in a joint
detection framework to achieve high detection rates against state-of-the-art
attacks. However, we demonstrate on the MNIST and CIFAR-10 datasets that by
increasing the adversary strength of said state-of-the-art attacks, one can
bypass the detection framework with adversarial examples of minimal visual
distortion. These results suggest for proposed defenses to validate against
stronger attack configurations.


http://arxiv.org/abs/1803.09638
On the Limitation of Local Intrinsic Dimensionality for Characterizing the Subspaces of Adversarial Examples.
Pei-Hsuan Lu; Pin-Yu Chen; Chia-Mu Yu
  Understanding and characterizing the subspaces of adversarial examples aid in
studying the robustness of deep neural networks (DNNs) to adversarial
perturbations. Very recently, Ma et al. (ICLR 2018) proposed to use local
intrinsic dimensionality (LID) in layer-wise hidden representations of DNNs to
study adversarial subspaces. It was demonstrated that LID can be used to
characterize the adversarial subspaces associated with different attack
methods, e.g., the Carlini and Wagner's (C&W) attack and the fast gradient sign
attack.
  In this paper, we use MNIST and CIFAR-10 to conduct two new sets of
experiments that are absent in existing LID analysis and report the limitation
of LID in characterizing the corresponding adversarial subspaces, which are (i)
oblivious attacks and LID analysis using adversarial examples with different
confidence levels; and (ii) black-box transfer attacks. For (i), we find that
the performance of LID is very sensitive to the confidence parameter deployed
by an attack, and the LID learned from ensembles of adversarial examples with
varying confidence levels surprisingly gives poor performance. For (ii), we
find that when adversarial examples are crafted from another DNN model, LID is
ineffective in characterizing their adversarial subspaces. These two findings
together suggest the limited capability of LID in characterizing the subspaces
of adversarial examples.


http://arxiv.org/abs/1803.09468
Clipping free attacks against artificial neural networks.
Boussad Addad; Jerome Kodjabachian; Christophe Meyer
  During the last years, a remarkable breakthrough has been made in AI domain
thanks to artificial deep neural networks that achieved a great success in many
machine learning tasks in computer vision, natural language processing, speech
recognition, malware detection and so on. However, they are highly vulnerable
to easily crafted adversarial examples. Many investigations have pointed out
this fact and different approaches have been proposed to generate attacks while
adding a limited perturbation to the original data. The most robust known
method so far is the so called C&W attack [1]. Nonetheless, a countermeasure
known as feature squeezing coupled with ensemble defense showed that most of
these attacks can be destroyed [6]. In this paper, we present a new method we
call Centered Initial Attack (CIA) whose advantage is twofold : first, it
insures by construction the maximum perturbation to be smaller than a threshold
fixed beforehand, without the clipping process that degrades the quality of
attacks. Second, it is robust against recently introduced defenses such as
feature squeezing, JPEG encoding and even against a voting ensemble of
defenses. While its application is not limited to images, we illustrate this
using five of the current best classifiers on ImageNet dataset among which two
are adversarialy retrained on purpose to be robust against attacks. With a
fixed maximum perturbation of only 1.5% on any pixel, around 80% of attacks
(targeted) fool the voting ensemble defense and nearly 100% when the
perturbation is only 6%. While this shows how it is difficult to defend against
CIA attacks, the last section of the paper gives some guidelines to limit their
impact.


http://arxiv.org/abs/1803.09163
Security Theater: On the Vulnerability of Classifiers to Exploratory Attacks.
Tegjyot Singh Sethi; Mehmed Kantardzic; Joung Woo Ryu
  The increasing scale and sophistication of cyberattacks has led to the
adoption of machine learning based classification techniques, at the core of
cybersecurity systems. These techniques promise scale and accuracy, which
traditional rule or signature based methods cannot. However, classifiers
operating in adversarial domains are vulnerable to evasion attacks by an
adversary, who is capable of learning the behavior of the system by employing
intelligently crafted probes. Classification accuracy in such domains provides
a false sense of security, as detection can easily be evaded by carefully
perturbing the input samples. In this paper, a generic data driven framework is
presented, to analyze the vulnerability of classification systems to black box
probing based attacks. The framework uses an exploration exploitation based
strategy, to understand an adversary's point of view of the attack defense
cycle. The adversary assumes a black box model of the defender's classifier and
can launch indiscriminate attacks on it, without information of the defender's
model type, training data or the domain of application. Experimental evaluation
on 10 real world datasets demonstrates that even models having high perceived
accuracy (>90%), by a defender, can be effectively circumvented with a high
evasion rate (>95%, on average). The detailed attack algorithms, adversarial
model and empirical evaluation, serve.


http://arxiv.org/abs/1803.09162
A Dynamic-Adversarial Mining Approach to the Security of Machine Learning.
Tegjyot Singh Sethi; Mehmed Kantardzic; Lingyu Lyua; Jiashun Chen
  Operating in a dynamic real world environment requires a forward thinking and
adversarial aware design for classifiers, beyond fitting the model to the
training data. In such scenarios, it is necessary to make classifiers - a)
harder to evade, b) easier to detect changes in the data distribution over
time, and c) be able to retrain and recover from model degradation. While most
works in the security of machine learning has concentrated on the evasion
resistance (a) problem, there is little work in the areas of reacting to
attacks (b and c). Additionally, while streaming data research concentrates on
the ability to react to changes to the data distribution, they often take an
adversarial agnostic view of the security problem. This makes them vulnerable
to adversarial activity, which is aimed towards evading the concept drift
detection mechanism itself. In this paper, we analyze the security of machine
learning, from a dynamic and adversarial aware perspective. The existing
techniques of Restrictive one class classifier models, Complex learning models
and Randomization based ensembles, are shown to be myopic as they approach
security as a static task. These methodologies are ill suited for a dynamic
environment, as they leak excessive information to an adversary, who can
subsequently launch attacks which are indistinguishable from the benign data.
Based on empirical vulnerability analysis against a sophisticated adversary, a
novel feature importance hiding approach for classifier design, is proposed.
The proposed design ensures that future attacks on classifiers can be detected
and recovered from. The proposed work presents motivation, by serving as a
blueprint, for future work in the area of Dynamic-Adversarial mining, which
combines lessons learned from Streaming data mining, Adversarial learning and
Cybersecurity.


http://arxiv.org/abs/1803.09156
An Overview of Vulnerabilities of Voice Controlled Systems.
Yuan Gong; Christian Poellabauer
  Over the last few years, a rapidly increasing number of Internet-of-Things
(IoT) systems that adopt voice as the primary user input have emerged. These
systems have been shown to be vulnerable to various types of voice spoofing
attacks. However, how exactly these techniques differ or relate to each other
has not been extensively studied. In this paper, we provide a survey of recent
attack and defense techniques for voice controlled systems and propose a
classification of these techniques. We also discuss the need for a universal
defense strategy that protects a system from various types of attacks.


http://arxiv.org/abs/1804.00504
Generalizability vs. Robustness: Adversarial Examples for Medical Imaging.
Magdalini Paschali; Sailesh Conjeti; Fernando Navarro; Nassir Navab
  In this paper, for the first time, we propose an evaluation method for deep
learning models that assesses the performance of a model not only in an unseen
test scenario, but also in extreme cases of noise, outliers and ambiguous input
data. To this end, we utilize adversarial examples, images that fool machine
learning models, while looking imperceptibly different from original data, as a
measure to evaluate the robustness of a variety of medical imaging models.
Through extensive experiments on skin lesion classification and whole brain
segmentation with state-of-the-art networks such as Inception and UNet, we show
that models that achieve comparable performance regarding generalizability may
have significant variations in their perception of the underlying data
manifold, leading to an extensive performance gap in their robustness.


http://arxiv.org/abs/1803.09043
CNN Based Adversarial Embedding with Minimum Alteration for Image Steganography.
Weixuan Tang; Bin Li; Shunquan Tan; Mauro Barni; Jiwu Huang
  Historically, steganographic schemes were designed in a way to preserve image
statistics or steganalytic features. Since most of the state-of-the-art
steganalytic methods employ a machine learning (ML) based classifier, it is
reasonable to consider countering steganalysis by trying to fool the ML
classifiers. However, simply applying perturbations on stego images as
adversarial examples may lead to the failure of data extraction and introduce
unexpected artefacts detectable by other classifiers. In this paper, we present
a steganographic scheme with a novel operation called adversarial embedding,
which achieves the goal of hiding a stego message while at the same time
fooling a convolutional neural network (CNN) based steganalyzer. The proposed
method works under the conventional framework of distortion minimization.
Adversarial embedding is achieved by adjusting the costs of image element
modifications according to the gradients backpropagated from the CNN classifier
targeted by the attack. Therefore, modification direction has a higher
probability to be the same as the sign of the gradient. In this way, the so
called adversarial stego images are generated. Experiments demonstrate that the
proposed steganographic scheme is secure against the targeted adversary-unaware
steganalyzer. In addition, it deteriorates the performance of other
adversary-aware steganalyzers opening the way to a new class of modern
steganographic schemes capable to overcome powerful CNN-based steganalysis.


http://arxiv.org/abs/1803.08773
Detecting Adversarial Perturbations with Saliency.
Chiliang Zhang; Zhimou Yang; Zuochang Ye
  In this paper we propose a novel method for detecting adversarial examples by
training a binary classifier with both origin data and saliency data. In the
case of image classification model, saliency simply explain how the model make
decisions by identifying significant pixels for prediction. A model shows wrong
classification output always learns wrong features and shows wrong saliency as
well. Our approach shows good performance on detecting adversarial
perturbations. We quantitatively evaluate generalization ability of the
detector, showing that detectors trained with strong adversaries perform well
on weak adversaries.


http://arxiv.org/abs/1803.08680
Improving DNN Robustness to Adversarial Attacks using Jacobian Regularization.
Daniel Jakubovitz; Raja Giryes
  Deep neural networks have lately shown tremendous performance in various
applications including vision and speech processing tasks. However, alongside
their ability to perform these tasks with such high accuracy, it has been shown
that they are highly susceptible to adversarial attacks: a small change in the
input would cause the network to err with high confidence. This phenomenon
exposes an inherent fault in these networks and their ability to generalize
well. For this reason, providing robustness to adversarial attacks is an
important challenge in networks training, which has led to extensive research.
In this work, we suggest a theoretically inspired novel approach to improve the
networks' robustness. Our method applies regularization using the Frobenius
norm of the Jacobian of the network, which is applied as post-processing, after
regular training has finished. We demonstrate empirically that it leads to
enhanced robustness results with a minimal change in the original network's
accuracy.


http://arxiv.org/abs/1803.08533
Understanding Measures of Uncertainty for Adversarial Example Detection.
Lewis Smith; Yarin Gal
  Measuring uncertainty is a promising technique for detecting adversarial
examples, crafted inputs on which the model predicts an incorrect class with
high confidence. But many measures of uncertainty exist, including predictive
en- tropy and mutual information, each capturing different types of
uncertainty. We study these measures, and shed light on why mutual information
seems to be effective at the task of adversarial example detection. We
highlight failure modes for MC dropout, a widely used approach for estimating
uncertainty in deep models. This leads to an improved understanding of the
drawbacks of current methods, and a proposal to improve the quality of
uncertainty estimates using probabilistic model ensembles. We give illustrative
experiments using MNIST to demonstrate the intuition underlying the different
measures of uncertainty, as well as experiments on a real world Kaggle dogs vs
cats classification dataset.


http://arxiv.org/abs/1803.07994
Adversarial Defense based on Structure-to-Signal Autoencoders.
Joachim Folz; Sebastian Palacio; Joern Hees; Damian Borth; Andreas Dengel
  Adversarial attack methods have demonstrated the fragility of deep neural
networks. Their imperceptible perturbations are frequently able fool
classifiers into potentially dangerous misclassifications. We propose a novel
way to interpret adversarial perturbations in terms of the effective input
signal that classifiers actually use. Based on this, we apply specially trained
autoencoders, referred to as S2SNets, as defense mechanism. They follow a
two-stage training scheme: first unsupervised, followed by a fine-tuning of the
decoder, using gradients from an existing classifier. S2SNets induce a shift in
the distribution of gradients propagated through them, stripping them from
class-dependent signal. We analyze their robustness against several white-box
and gray-box scenarios on the large ImageNet dataset. Our approach reaches
comparable resilience in white-box attack scenarios as other state-of-the-art
defenses in gray-box scenarios. We further analyze the relationships of
AlexNet, VGG 16, ResNet 50 and Inception v3 in adversarial space, and found
that VGG 16 is the easiest to fool, while perturbations from ResNet 50 are the
most transferable.


http://arxiv.org/abs/1803.08134
Task dependent Deep LDA pruning of neural networks.
Qing Tian; Tal Arbel; James J. Clark
  With deep learning's success, a limited number of popular deep nets have been
widely adopted for various vision tasks. However, this usually results in
unnecessarily high complexities and possibly many features of low task utility.
In this paper, we address this problem by introducing a task-dependent deep
pruning framework based on Fisher's Linear Discriminant Analysis (LDA). The
approach can be applied to convolutional, fully-connected, and module-based
deep network structures, in all cases leveraging the high decorrelation of
neuron motifs found in the pre-decision space and cross-layer deconv
dependency. Moreover, we examine our approach's potential in network
architecture search for specific tasks and analyze the influence of our pruning
on model robustness to noises and adversarial attacks. Experimental results on
datasets of generic objects (ImageNet, CIFAR100) as well as domain specific
tasks (Adience, and LFWA) illustrate our framework's superior performance over
state-of-the-art pruning approaches and fixed compact nets (e.g. SqueezeNet,
MobileNet). The proposed method successfully maintains comparable accuracies
even after discarding most parameters (98%-99% for VGG16, up to 82% for the
already compact InceptionNet) and with significant FLOP reductions (83% for
VGG16, up to 64% for InceptionNet). Through pruning, we can also derive
smaller, but more accurate and more robust models suitable for the task.


http://arxiv.org/abs/1803.07519
DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems.
Lei Ma; Felix Juefei-Xu; Fuyuan Zhang; Jiyuan Sun; Minhui Xue; Bo Li; Chunyang Chen; Ting Su; Li Li; Yang Liu; Jianjun Zhao; Yadong Wang
  Deep learning (DL) defines a new data-driven programming paradigm that
constructs the internal system logic of a crafted neuron network through a set
of training data. We have seen wide adoption of DL in many safety-critical
scenarios. However, a plethora of studies have shown that the state-of-the-art
DL systems suffer from various vulnerabilities which can lead to severe
consequences when applied to real-world applications. Currently, the testing
adequacy of a DL system is usually measured by the accuracy of test data.
Considering the limitation of accessible high quality test data, good accuracy
performance on test data can hardly provide confidence to the testing adequacy
and generality of DL systems. Unlike traditional software systems that have
clear and controllable logic and functionality, the lack of interpretability in
a DL system makes system analysis and defect detection difficult, which could
potentially hinder its real-world deployment. In this paper, we propose
DeepGauge, a set of multi-granularity testing criteria for DL systems, which
aims at rendering a multi-faceted portrayal of the testbed. The in-depth
evaluation of our proposed testing criteria is demonstrated on two well-known
datasets, five DL systems, and with four state-of-the-art adversarial attack
techniques against DL. The potential usefulness of DeepGauge sheds light on the
construction of more generic and robust DL systems.


http://arxiv.org/abs/1803.06975
Technical Report: When Does Machine Learning FAIL? Generalized Transferability for Evasion and Poisoning Attacks.
Octavian Suciu; Radu Mărginean; Yiğitcan Kaya; Hal III Daumé; Tudor Dumitraş
  Recent results suggest that attacks against supervised machine learning
systems are quite effective, while defenses are easily bypassed by new attacks.
However, the specifications for machine learning systems currently lack precise
adversary definitions, and the existing attacks make diverse, potentially
unrealistic assumptions about the strength of the adversary who launches them.
We propose the FAIL attacker model, which describes the adversary's knowledge
and control along four dimensions. Our model allows us to consider a wide range
of weaker adversaries who have limited control and incomplete knowledge of the
features, learning algorithms and training instances utilized. To evaluate the
utility of the FAIL model, we consider the problem of conducting targeted
poisoning attacks in a realistic setting: the crafted poison samples must have
clean labels, must be individually and collectively inconspicuous, and must
exhibit a generalized form of transferability, defined by the FAIL model. By
taking these constraints into account, we design StingRay, a targeted poisoning
attack that is practical against 4 machine learning applications, which use 3
different learning algorithms, and can bypass 2 existing defenses. Conversely,
we show that a prior evasion attack is less effective under generalized
transferability. Such attack evaluations, under the FAIL adversary model, may
also suggest promising directions for future defenses.


http://arxiv.org/abs/1803.06978
Improving Transferability of Adversarial Examples with Input Diversity.
Cihang Xie; Zhishuai Zhang; Yuyin Zhou; Song Bai; Jianyu Wang; Zhou Ren; Alan Yuille
  Though CNNs have achieved the state-of-the-art performance on various vision
tasks, they are vulnerable to adversarial examples --- crafted by adding
human-imperceptible perturbations to clean images. However, most of the
existing adversarial attacks only achieve relatively low success rates under
the challenging black-box setting, where the attackers have no knowledge of the
model structure and parameters. To this end, we propose to improve the
transferability of adversarial examples by creating diverse input patterns.
Instead of only using the original images to generate adversarial examples, our
method applies random transformations to the input images at each iteration.
Extensive experiments on ImageNet show that the proposed attack method can
generate adversarial examples that transfer much better to different networks
than existing baselines. By evaluating our method against top defense solutions
and official baselines from NIPS 2017 adversarial competition, the enhanced
attack reaches an average success rate of 73.0%, which outperforms the top-1
attack submission in the NIPS competition by a large margin of 6.6%. We hope
that our proposed attack strategy can serve as a strong benchmark baseline for
evaluating the robustness of networks to adversaries and the effectiveness of
different defense methods in the future. Code is available at
https://github.com/cihangxie/DI-2-FGSM.


http://arxiv.org/abs/1803.06567
A Dual Approach to Scalable Verification of Deep Networks.
Dj Krishnamurthy; Dvijotham; Robert Stanforth; Sven Gowal; Timothy Mann; Pushmeet Kohli
  This paper addresses the problem of formally verifying desirable properties
of neural networks, i.e., obtaining provable guarantees that neural networks
satisfy specifications relating their inputs and outputs (robustness to bounded
norm adversarial perturbations, for example). Most previous work on this topic
was limited in its applicability by the size of the network, network
architecture and the complexity of properties to be verified. In contrast, our
framework applies to a general class of activation functions and specifications
on neural network inputs and outputs. We formulate verification as an
optimization problem (seeking to find the largest violation of the
specification) and solve a Lagrangian relaxation of the optimization problem to
obtain an upper bound on the worst case violation of the specification being
verified. Our approach is anytime i.e. it can be stopped at any time and a
valid bound on the maximum violation can be obtained. We develop specialized
verification algorithms with provable tightness guarantees under special
assumptions and demonstrate the practical significance of our general
verification approach on a variety of verification tasks.


http://arxiv.org/abs/1803.06373
Adversarial Logit Pairing.
Harini Kannan; Alexey Kurakin; Ian Goodfellow
  In this paper, we develop improved techniques for defending against
adversarial examples at scale. First, we implement the state of the art version
of adversarial training at unprecedented scale on ImageNet and investigate
whether it remains effective in this setting - an important open scientific
question (Athalye et al., 2018). Next, we introduce enhanced defenses using a
technique we call logit pairing, a method that encourages logits for pairs of
examples to be similar. When applied to clean examples and their adversarial
counterparts, logit pairing improves accuracy on adversarial examples over
vanilla adversarial training; we also find that logit pairing on clean examples
only is competitive with adversarial training in terms of accuracy on two
datasets. Finally, we show that adversarial logit pairing achieves the state of
the art defense on ImageNet against PGD white box attacks, with an accuracy
improvement from 1.5% to 27.9%. Adversarial logit pairing also successfully
damages the current state of the art defense against black box attacks on
ImageNet (Tramer et al., 2018), dropping its accuracy from 66.6% to 47.1%. With
this new accuracy drop, adversarial logit pairing ties with Tramer et al.(2018)
for the state of the art on black box attacks on ImageNet.


http://arxiv.org/abs/1804.00499
Semantic Adversarial Examples.
Hossein Hosseini; Radha Poovendran
  Deep neural networks are known to be vulnerable to adversarial examples,
i.e., images that are maliciously perturbed to fool the model. Generating
adversarial examples has been mostly limited to finding small perturbations
that maximize the model prediction error. Such images, however, contain
artificial perturbations that make them somewhat distinguishable from natural
images. This property is used by several defense methods to counter adversarial
examples by applying denoising filters or training the model to be robust to
small perturbations.
  In this paper, we introduce a new class of adversarial examples, namely
"Semantic Adversarial Examples," as images that are arbitrarily perturbed to
fool the model, but in such a way that the modified image semantically
represents the same object as the original image. We formulate the problem of
generating such images as a constrained optimization problem and develop an
adversarial transformation based on the shape bias property of human cognitive
system. In our method, we generate adversarial images by first converting the
RGB image into the HSV (Hue, Saturation and Value) color space and then
randomly shifting the Hue and Saturation components, while keeping the Value
component the same. Our experimental results on CIFAR10 dataset show that the
accuracy of VGG16 network on adversarial color-shifted images is 5.7%.


http://arxiv.org/abs/1803.05598
Large Margin Deep Networks for Classification.
Gamaleldin F. Elsayed; Dilip Krishnan; Hossein Mobahi; Kevin Regan; Samy Bengio
  We present a formulation of deep learning that aims at producing a large
margin classifier. The notion of margin, minimum distance to a decision
boundary, has served as the foundation of several theoretically profound and
empirically successful results for both classification and regression tasks.
However, most large margin algorithms are applicable only to shallow models
with a preset feature representation; and conventional margin methods for
neural networks only enforce margin at the output layer. Such methods are
therefore not well suited for deep networks.
  In this work, we propose a novel loss function to impose a margin on any
chosen set of layers of a deep network (including input and hidden layers). Our
formulation allows choosing any norm on the metric measuring the margin. We
demonstrate that the decision boundary obtained by our loss has nice properties
compared to standard classification loss functions. Specifically, we show
improved empirical results on the MNIST, CIFAR-10 and ImageNet datasets on
multiple tasks: generalization from small training sets, corrupted labels, and
robustness against adversarial perturbations. The resulting loss is general and
complementary to existing data augmentation (such as random/adversarial input
transform) and regularization techniques (such as weight decay, dropout, and
batch norm).


http://arxiv.org/abs/1803.05787
Feature Distillation: DNN-Oriented JPEG Compression Against Adversarial Examples.
Zihao Liu; Qi Liu; Tao Liu; Nuo Xu; Xue Lin; Yanzhi Wang; Wujie Wen
  Image compression-based approaches for defending against the
adversarial-example attacks, which threaten the safety use of deep neural
networks (DNN), have been investigated recently. However, prior works mainly
rely on directly tuning parameters like compression rate, to blindly reduce
image features, thereby lacking guarantee on both defense efficiency (i.e.
accuracy of polluted images) and classification accuracy of benign images,
after applying defense methods. To overcome these limitations, we propose a
JPEG-based defensive compression framework, namely "feature distillation", to
effectively rectify adversarial examples without impacting classification
accuracy on benign data. Our framework significantly escalates the defense
efficiency with marginal accuracy reduction using a two-step method: First, we
maximize malicious features filtering of adversarial input perturbations by
developing defensive quantization in frequency domain of JPEG compression or
decompression, guided by a semi-analytical method; Second, we suppress the
distortions of benign features to restore classification accuracy through a
DNN-oriented quantization refine process. Our experimental results show that
proposed "feature distillation" can significantly surpass the latest
input-transformation based mitigations such as Quilting and TV Minimization in
three aspects, including defense efficiency (improve classification accuracy
from $\sim20\%$ to $\sim90\%$ on adversarial examples), accuracy of benign
images after defense ($\le1\%$ accuracy degradation), and processing time per
image ($\sim259\times$ Speedup). Moreover, our solution can also provide the
best defense efficiency ($\sim60\%$ accuracy) against the recent adaptive
attack with least accuracy reduction ($\sim1\%$) on benign images when compared
with other input-transformation based defense methods.


http://arxiv.org/abs/1803.04765
Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning.
Nicolas Papernot; Patrick McDaniel
  Deep neural networks (DNNs) enable innovative applications of machine
learning like image recognition, machine translation, or malware detection.
However, deep learning is often criticized for its lack of robustness in
adversarial settings (e.g., vulnerability to adversarial inputs) and general
inability to rationalize its predictions. In this work, we exploit the
structure of deep learning to enable new learning-based inference and decision
strategies that achieve desirable properties such as robustness and
interpretability. We take a first step in this direction and introduce the Deep
k-Nearest Neighbors (DkNN). This hybrid classifier combines the k-nearest
neighbors algorithm with representations of the data learned by each layer of
the DNN: a test input is compared to its neighboring training points according
to the distance that separates them in the representations. We show the labels
of these neighboring points afford confidence estimates for inputs outside the
model's training manifold, including on malicious inputs like adversarial
examples--and therein provides protections against inputs that are outside the
models understanding. This is because the nearest neighbors can be used to
estimate the nonconformity of, i.e., the lack of support for, a prediction in
the training data. The neighbors also constitute human-interpretable
explanations of predictions. We evaluate the DkNN algorithm on several
datasets, and show the confidence estimates accurately identify inputs outside
the model, and that the explanations provided by nearest neighbors are
intuitive and useful in understanding model failures.


http://arxiv.org/abs/1803.04683
Invisible Mask: Practical Attacks on Face Recognition with Infrared.
Zhe Zhou; Di Tang; Xiaofeng Wang; Weili Han; Xiangyu Liu; Kehuan Zhang
  Accurate face recognition techniques make a series of critical applications
possible: policemen could employ it to retrieve criminals' faces from
surveillance video streams; cross boarder travelers could pass a face
authentication inspection line without the involvement of officers.
Nonetheless, when public security heavily relies on such intelligent systems,
the designers should deliberately consider the emerging attacks aiming at
misleading those systems employing face recognition.
  We propose a kind of brand new attack against face recognition systems, which
is realized by illuminating the subject using infrared according to the
adversarial examples worked out by our algorithm, thus face recognition systems
can be bypassed or misled while simultaneously the infrared perturbations
cannot be observed by raw eyes. Through launching this kind of attack, an
attacker not only can dodge surveillance cameras. More importantly, he can
impersonate his target victim and pass the face authentication system, if only
the victim's photo is acquired by the attacker. Again, the attack is totally
unobservable by nearby people, because not only the light is invisible, but
also the device we made to launch the attack is small enough. According to our
study on a large dataset, attackers have a very high success rate with a over
70\% success rate for finding such an adversarial example that can be
implemented by infrared. To the best of our knowledge, our work is the first
one to shed light on the severity of threat resulted from infrared adversarial
examples against face recognition.


http://arxiv.org/abs/1803.05123
Defending against Adversarial Attack towards Deep Neural Networks via Collaborative Multi-task Training.
Derek Wang; Chaoran Li; Sheng Wen; Surya Nepal; Yang Xiang
  Deep neural networks (DNNs) are known to be vulnerable to adversarial
examples which contain human-imperceptible perturbations. A series of defending
methods, either proactive defence or reactive defence, have been proposed in
the recent years. However, most of the methods can only handle specific
attacks. For example, proactive defending methods are invalid against grey-box
or white-box attacks, while reactive defending methods are challenged by
low-distortion adversarial examples or transferring adversarial examples. This
becomes a critical problem since a defender usually does not have the type of
the attack as a priori knowledge. Moreover, existing two-pronged defences
(e.g., MagNet), which take advantages of both proactive and reactive methods,
have been reported as broken under transferring attacks. To address this
problem, this paper proposed a novel defensive framework based on collaborative
multi-task training, aiming at providing defence for different types of
attacks. The proposed defence first encodes training labels into label pairs
and counters black-box attacks leveraging adversarial training supervised by
the encoded label pairs. The defence further constructs a detector to identify
and reject high-confidence adversarial examples that bypass the black-box
defence. In addition, the proposed collaborative architecture can prevent
adversaries from finding valid adversarial examples when the defence strategy
is exposed. In the experiments, we evaluated our defence against four
state-of-the-art attacks on $MNIST$ and $CIFAR10$ datasets. The results showed
that our defending method achieved up to $96.3\%$ classification accuracy on
black-box adversarial examples, and detected up to $98.7\%$ of the high
confidence adversarial examples. It only decreased the model accuracy on benign
example classification by $2.1\%$ for the $CIFAR10$ dataset.


http://arxiv.org/abs/1803.04173
Adversarial Malware Binaries: Evading Deep Learning for Malware Detection in Executables.
Bojan Kolosnjaji; Ambra Demontis; Battista Biggio; Davide Maiorca; Giorgio Giacinto; Claudia Eckert; Fabio Roli
  Machine-learning methods have already been exploited as useful tools for
detecting malicious executable files. They leverage data retrieved from malware
samples, such as header fields, instruction sequences, or even raw bytes, to
learn models that discriminate between benign and malicious software. However,
it has also been shown that machine learning and deep neural networks can be
fooled by evasion attacks (also referred to as adversarial examples), i.e.,
small changes to the input data that cause misclassification at test time. In
this work, we investigate the vulnerability of malware detection methods that
use deep networks to learn from raw bytes. We propose a gradient-based attack
that is capable of evading a recently-proposed deep network suited to this
purpose by only changing few specific bytes at the end of each malware sample,
while preserving its intrusive functionality. Promising results show that our
adversarial malware binaries evade the targeted network with high probability,
even though less than 1% of their bytes are modified.


http://arxiv.org/abs/1803.03880
Combating Adversarial Attacks Using Sparse Representations.
Soorya Gopalakrishnan; Zhinus Marzi; Upamanyu Madhow; Ramtin Pedarsani
  It is by now well-known that small adversarial perturbations can induce
classification errors in deep neural networks (DNNs). In this paper, we make
the case that sparse representations of the input data are a crucial tool for
combating such attacks. For linear classifiers, we show that a sparsifying
front end is provably effective against $\ell_{\infty}$-bounded attacks,
reducing output distortion due to the attack by a factor of roughly $K / N$
where $N$ is the data dimension and $K$ is the sparsity level. We then extend
this concept to DNNs, showing that a "locally linear" model can be used to
develop a theoretical foundation for crafting attacks and defenses.
Experimental results for the MNIST dataset show the efficacy of the proposed
sparsifying front end.


http://arxiv.org/abs/1803.03870
Detecting Adversarial Examples via Neural Fingerprinting.
Sumanth Dathathri; Stephan Zheng; Tianwei Yin; Richard M. Murray; Yisong Yue
  Deep neural networks are vulnerable to adversarial examples, which
dramatically alter model output using small input changes. We propose Neural
Fingerprinting, a simple, yet effective method to detect adversarial examples
by verifying whether model behavior is consistent with a set of secret
fingerprints, inspired by the use of biometric and cryptographic signatures.
The benefits of our method are that 1) it is fast, 2) it is prohibitively
expensive for an attacker to reverse-engineer which fingerprints were used, and
3) it does not assume knowledge of the adversary. In this work, we pose a
formal framework to analyze fingerprints under various threat models, and
characterize Neural Fingerprinting for linear models. For complex neural
networks, we empirically demonstrate that Neural Fingerprinting significantly
improves on state-of-the-art detection mechanisms by detecting the strongest
known adversarial attacks with 98-100% AUC-ROC scores on the MNIST, CIFAR-10
and MiniImagenet (20 classes) datasets. In particular, the detection accuracy
of Neural Fingerprinting generalizes well to unseen test-data under various
black- and whitebox threat models, and is robust over a wide range of
hyperparameters and choices of fingerprints.


http://arxiv.org/abs/1803.03613
Detecting Adversarial Examples - A Lesson from Multimedia Forensics.
Pascal Schöttle; Alexander Schlögl; Cecilia Pasquini; Rainer Böhme
  Adversarial classification is the task of performing robust classification in
the presence of a strategic attacker. Originating from information hiding and
multimedia forensics, adversarial classification recently received a lot of
attention in a broader security context. In the domain of machine
learning-based image classification, adversarial classification can be
interpreted as detecting so-called adversarial examples, which are slightly
altered versions of benign images. They are specifically crafted to be
misclassified with a very high probability by the classifier under attack.
Neural networks, which dominate among modern image classifiers, have been shown
to be especially vulnerable to these adversarial examples.
  However, detecting subtle changes in digital images has always been the goal
of multimedia forensics and steganalysis. In this paper, we highlight the
parallels between these two fields and secure machine learning.
  Furthermore, we adapt a linear filter, similar to early steganalysis methods,
to detect adversarial examples that are generated with the projected gradient
descent (PGD) method, the state-of-the-art algorithm for this task. We test our
method on the MNIST database and show for several parameter combinations of PGD
that our method can reliably detect adversarial examples.
  Additionally, the combination of adversarial re-training and our detection
method effectively reduces the attack surface of attacks against neural
networks. Thus, we conclude that adversarial examples for image classification
possibly do not withstand detection methods from steganalysis, and future work
should explore the effectiveness of known techniques from multimedia forensics
in other adversarial settings.


http://arxiv.org/abs/1803.03607
On Generation of Adversarial Examples using Convex Programming.
Emilio Rafael Balda; Arash Behboodi; Rudolf Mathar
  It has been observed that deep learning architectures tend to make erroneous
decisions with high reliability for particularly designed adversarial
instances. In this work, we show that the perturbation analysis of these
architectures provides a framework for generating adversarial instances by
convex programming which, for classification tasks, is able to recover variants
of existing non-adaptive adversarial methods. The proposed framework can be
used for the design of adversarial noise under various desirable constraints
and different types of networks. Moreover, this framework is capable of
explaining various existing adversarial methods and can be used to derive new
algorithms as well. We make use of these results to obtain novel algorithms.
The experiments show the competitive performance of the obtained solutions, in
terms of fooling ratio, when benchmarked with well-known adversarial methods.


http://arxiv.org/abs/1803.03544
Explaining Black-box Android Malware Detection.
Marco Melis; Davide Maiorca; Battista Biggio; Giorgio Giacinto; Fabio Roli
  Machine-learning models have been recently used for detecting malicious
Android applications, reporting impressive performances on benchmark datasets,
even when trained only on features statically extracted from the application,
such as system calls and permissions. However, recent findings have highlighted
the fragility of such in-vitro evaluations with benchmark datasets, showing
that very few changes to the content of Android malware may suffice to evade
detection. How can we thus trust that a malware detector performing well on
benchmark data will continue to do so when deployed in an operating
environment? To mitigate this issue, the most popular Android malware detectors
use linear, explainable machine-learning models to easily identify the most
influential features contributing to each decision. In this work, we generalize
this approach to any black-box machine- learning model, by leveraging a
gradient-based approach to identify the most influential local features. This
enables using nonlinear models to potentially increase accuracy without
sacrificing interpretability of decisions. Our approach also highlights the
global characteristics learned by the model to discriminate between benign and
malware applications. Finally, as shown by our empirical analysis on a popular
Android malware detection task, it also helps identifying potential
vulnerabilities of linear and nonlinear models against adversarial
manipulations.


http://arxiv.org/abs/1803.02988
Rethinking Feature Distribution for Loss Functions in Image Classification.
Weitao Wan; Yuanyi Zhong; Tianpeng Li; Jiansheng Chen
  We propose a large-margin Gaussian Mixture (L-GM) loss for deep neural
networks in classification tasks. Different from the softmax cross-entropy
loss, our proposal is established on the assumption that the deep features of
the training set follow a Gaussian Mixture distribution. By involving a
classification margin and a likelihood regularization, the L-GM loss
facilitates both a high classification performance and an accurate modeling of
the training feature distribution. As such, the L-GM loss is superior to the
softmax loss and its major variants in the sense that besides classification,
it can be readily used to distinguish abnormal inputs, such as the adversarial
examples, based on their features' likelihood to the training feature
distribution. Extensive experiments on various recognition benchmarks like
MNIST, CIFAR, ImageNet and LFW, as well as on adversarial examples demonstrate
the effectiveness of our proposal.


http://arxiv.org/abs/1803.02536
Sparse Adversarial Perturbations for Videos.
Xingxing Wei; Jun Zhu; Hang Su
  Although adversarial samples of deep neural networks (DNNs) have been
intensively studied on static images, their extensions in videos are never
explored. Compared with images, attacking a video needs to consider not only
spatial cues but also temporal cues. Moreover, to improve the imperceptibility
as well as reduce the computation cost, perturbations should be added on as
fewer frames as possible, i.e., adversarial perturbations are temporally
sparse. This further motivates the propagation of perturbations, which denotes
that perturbations added on the current frame can transfer to the next frames
via their temporal interactions. Thus, no (or few) extra perturbations are
needed for these frames to misclassify them. To this end, we propose an
l2,1-norm based optimization algorithm to compute the sparse adversarial
perturbations for videos. We choose the action recognition as the targeted
task, and networks with a CNN+RNN architecture as threat models to verify our
method. Thanks to the propagation, we can compute perturbations on a shortened
version video, and then adapt them to the long version video to fool DNNs.
Experimental results on the UCF101 dataset demonstrate that even only one frame
in a video is perturbed, the fooling rate can still reach 59.7%.


http://arxiv.org/abs/1803.01442
Stochastic Activation Pruning for Robust Adversarial Defense.
Guneet S. Dhillon; Kamyar Azizzadenesheli; Zachary C. Lipton; Jeremy Bernstein; Jean Kossaifi; Aran Khanna; Anima Anandkumar
  Neural networks are known to be vulnerable to adversarial examples. Carefully
chosen perturbations to real images, while imperceptible to humans, induce
misclassification and threaten the reliability of deep learning systems in the
wild. To guard against adversarial examples, we take inspiration from game
theory and cast the problem as a minimax zero-sum game between the adversary
and the model. In general, for such games, the optimal strategy for both
players requires a stochastic policy, also known as a mixed strategy. In this
light, we propose Stochastic Activation Pruning (SAP), a mixed strategy for
adversarial defense. SAP prunes a random subset of activations (preferentially
pruning those with smaller magnitude) and scales up the survivors to
compensate. We can apply SAP to pretrained networks, including adversarially
trained models, without fine-tuning, providing robustness against adversarial
examples. Experiments demonstrate that SAP confers robustness against attacks,
increasing accuracy and preserving calibration.


http://arxiv.org/abs/1803.01128
Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples.
Minhao Cheng; Jinfeng Yi; Pin-Yu Chen; Huan Zhang; Cho-Jui Hsieh
  Crafting adversarial examples has become an important technique to evaluate
the robustness of deep neural networks (DNNs). However, most existing works
focus on attacking the image classification problem since its input space is
continuous and output space is finite.
  In this paper, we study the much more challenging problem of crafting
adversarial examples for sequence-to-sequence (seq2seq) models, whose inputs
are discrete text strings and outputs have an almost infinite number of
possibilities. To address the challenges caused by the discrete input space, we
propose a projected gradient method combined with group lasso and gradient
regularization. To handle the almost infinite output space, we design some
novel loss functions to conduct non-overlapping attack and targeted keyword
attack. We apply our algorithm to machine translation and text summarization
tasks, and verify the effectiveness of the proposed algorithm: by changing less
than 3 words, we can make seq2seq model to produce desired outputs with high
success rates. On the other hand, we recognize that, compared with the
well-evaluated CNN-based classifiers, seq2seq models are intrinsically more
robust to adversarial attacks.


http://arxiv.org/abs/1803.00940
Protecting JPEG Images Against Adversarial Attacks.
Aaditya Prakash; Nick Moran; Solomon Garber; Antonella DiLillo; James Storer
  As deep neural networks (DNNs) have been integrated into critical systems,
several methods to attack these systems have been developed. These adversarial
attacks make imperceptible modifications to an image that fool DNN classifiers.
We present an adaptive JPEG encoder which defends against many of these
attacks. Experimentally, we show that our method produces images with high
visual quality while greatly reducing the potency of state-of-the-art attacks.
Our algorithm requires only a modest increase in encoding time, produces a
compressed image which can be decompressed by an off-the-shelf JPEG decoder,
and classified by an unmodified classifier


http://arxiv.org/abs/1802.09707
Understanding and Enhancing the Transferability of Adversarial Examples.
Lei Wu; Zhanxing Zhu; Cheng Tai; Weinan E
  State-of-the-art deep neural networks are known to be vulnerable to
adversarial examples, formed by applying small but malicious perturbations to
the original inputs. Moreover, the perturbations can \textit{transfer across
models}: adversarial examples generated for a specific model will often mislead
other unseen models. Consequently the adversary can leverage it to attack
deployed systems without any query, which severely hinder the application of
deep learning, especially in the areas where security is crucial. In this work,
we systematically study how two classes of factors that might influence the
transferability of adversarial examples. One is about model-specific factors,
including network architecture, model capacity and test accuracy. The other is
the local smoothness of loss function for constructing adversarial examples.
Based on these understanding, a simple but effective strategy is proposed to
enhance transferability. We call it variance-reduced attack, since it utilizes
the variance-reduced gradient to generate adversarial example. The
effectiveness is confirmed by a variety of experiments on both CIFAR-10 and
ImageNet datasets.


http://arxiv.org/abs/1802.09653
On the Suitability of $L_p$-norms for Creating and Preventing Adversarial Examples.
Mahmood Sharif; Lujo Bauer; Michael K. Reiter
  Much research effort has been devoted to better understanding adversarial
examples, which are specially crafted inputs to machine-learning models that
are perceptually similar to benign inputs, but are classified differently
(i.e., misclassified). Both algorithms that create adversarial examples and
strategies for defending against them typically use $L_p$-norms to measure the
perceptual similarity between an adversarial input and its benign original.
Prior work has already shown, however, that two images need not be close to
each other as measured by an $L_p$-norm to be perceptually similar. In this
work, we show that nearness according to an $L_p$-norm is not just unnecessary
for perceptual similarity, but is also insufficient. Specifically, focusing on
datasets (CIFAR10 and MNIST), $L_p$-norms, and thresholds used in prior work,
we show through online user studies that "adversarial examples" that are closer
to their benign counterparts than required by commonly used $L_p$-norm
thresholds can nevertheless be perceptually different to humans from the
corresponding benign examples. Namely, the perceptual distance between two
images that are "near" each other according to an $L_p$-norm can be high enough
that participants frequently classify the two images as representing different
objects or digits. Combined with prior work, we thus demonstrate that nearness
of inputs as measured by $L_p$-norms is neither necessary nor sufficient for
perceptual similarity, which has implications for both creating and defending
against adversarial examples. We propose and discuss alternative similarity
metrics to stimulate future research in the area.


http://arxiv.org/abs/1802.09502
Retrieval-Augmented Convolutional Neural Networks for Improved Robustness against Adversarial Examples.
Jake Zhao; Kyunghyun Cho
  We propose a retrieval-augmented convolutional network and propose to train
it with local mixup, a novel variant of the recently proposed mixup algorithm.
The proposed hybrid architecture combining a convolutional network and an
off-the-shelf retrieval engine was designed to mitigate the adverse effect of
off-manifold adversarial examples, while the proposed local mixup addresses
on-manifold ones by explicitly encouraging the classifier to locally behave
linearly on the data manifold. Our evaluation of the proposed approach against
five readily-available adversarial attacks on three datasets--CIFAR-10, SVHN
and ImageNet--demonstrate the improved robustness compared to the vanilla
convolutional network.


http://arxiv.org/abs/1802.09308
Max-Mahalanobis Linear Discriminant Analysis Networks.
Tianyu Pang; Chao Du; Jun Zhu
  A deep neural network (DNN) consists of a nonlinear transformation from an
input to a feature representation, followed by a common softmax linear
classifier. Though many efforts have been devoted to designing a proper
architecture for nonlinear transformation, little investigation has been done
on the classifier part. In this paper, we show that a properly designed
classifier can improve robustness to adversarial attacks and lead to better
prediction results. Specifically, we define a Max-Mahalanobis distribution
(MMD) and theoretically show that if the input distributes as a MMD, the linear
discriminant analysis (LDA) classifier will have the best robustness to
adversarial examples. We further propose a novel Max-Mahalanobis linear
discriminant analysis (MM-LDA) network, which explicitly maps a complicated
data distribution in the input space to a MMD in the latent feature space and
then applies LDA to make predictions. Our results demonstrate that the MM-LDA
networks are significantly more robust to adversarial attacks, and have better
performance in class-biased classification.


http://arxiv.org/abs/1803.00404
Deep Defense: Training DNNs with Improved Adversarial Robustness.
Ziang Yan; Yiwen Guo; Changshui Zhang
  Despite the efficacy on a variety of computer vision tasks, deep neural
networks (DNNs) are vulnerable to adversarial attacks, limiting their
applications in security-critical systems. Recent works have shown the
possibility of generating imperceptibly perturbed image inputs (a.k.a.,
adversarial examples) to fool well-trained DNN classifiers into making
arbitrary predictions. To address this problem, we propose a training recipe
named "deep defense". Our core idea is to integrate an adversarial
perturbation-based regularizer into the classification objective, such that the
obtained models learn to resist potential attacks, directly and precisely. The
whole optimization problem is solved just like training a recursive network.
Experimental results demonstrate that our method outperforms training with
adversarial/Parseval regularizations by large margins on various datasets
(including MNIST, CIFAR-10 and ImageNet) and different DNN architectures. Code
and models for reproducing our results are available at
https://github.com/ZiangYan/deepdefense.pytorch


http://arxiv.org/abs/1802.08760
Sensitivity and Generalization in Neural Networks: an Empirical Study.
Roman Novak; Yasaman Bahri; Daniel A. Abolafia; Jeffrey Pennington; Jascha Sohl-Dickstein
  In practice it is often found that large over-parameterized neural networks
generalize better than their smaller counterparts, an observation that appears
to conflict with classical notions of function complexity, which typically
favor smaller models. In this work, we investigate this tension between
complexity and generalization through an extensive empirical exploration of two
natural metrics of complexity related to sensitivity to input perturbations.
Our experiments survey thousands of models with various fully-connected
architectures, optimizers, and other hyper-parameters, as well as four
different image classification datasets.
  We find that trained neural networks are more robust to input perturbations
in the vicinity of the training data manifold, as measured by the norm of the
input-output Jacobian of the network, and that it correlates well with
generalization. We further establish that factors associated with poor
generalization $-$ such as full-batch training or using random labels $-$
correspond to lower robustness, while factors associated with good
generalization $-$ such as data augmentation and ReLU non-linearities $-$ give
rise to more robust functions. Finally, we demonstrate how the input-output
Jacobian norm can be predictive of generalization at the level of individual
test points.


http://arxiv.org/abs/1802.08686
Adversarial vulnerability for any classifier.
Alhussein Fawzi; Hamza Fawzi; Omar Fawzi
  Despite achieving impressive performance, state-of-the-art classifiers remain
highly vulnerable to small, imperceptible, adversarial perturbations. This
vulnerability has proven empirically to be very intricate to address. In this
paper, we study the phenomenon of adversarial perturbations under the
assumption that the data is generated with a smooth generative model. We derive
fundamental upper bounds on the robustness to perturbations of any
classification function, and prove the existence of adversarial perturbations
that transfer well across different classifiers with small risk. Our analysis
of the robustness also provides insights onto key properties of generative
models, such as their smoothness and dimensionality of latent space. We
conclude with numerical experimental results showing that our bounds provide
informative baselines to the maximal achievable robustness on several datasets.


http://arxiv.org/abs/1802.08678
Verifying Controllers Against Adversarial Examples with Bayesian Optimization.
Shromona Ghosh; Felix Berkenkamp; Gireeja Ranade; Shaz Qadeer; Ashish Kapoor
  Recent successes in reinforcement learning have lead to the development of
complex controllers for real-world robots. As these robots are deployed in
safety-critical applications and interact with humans, it becomes critical to
ensure safety in order to avoid causing harm. A first step in this direction is
to test the controllers in simulation. To be able to do this, we need to
capture what we mean by safety and then efficiently search the space of all
behaviors to see if they are safe. In this paper, we present an active-testing
framework based on Bayesian Optimization. We specify safety constraints using
logic and exploit structure in the problem in order to test the system for
adversarial counter examples that violate the safety specifications. These
specifications are defined as complex boolean combinations of smooth functions
on the trajectories and, unlike reward functions in reinforcement learning, are
expressive and impose hard constraints on the system. In our framework, we
exploit regularity assumptions on individual functions in form of a Gaussian
Process (GP) prior. We combine these into a coherent optimization framework
using problem structure. The resulting algorithm is able to provably verify
complex safety specifications or alternatively find counter examples.
Experimental results show that the proposed method is able to find adversarial
examples quickly.


http://arxiv.org/abs/1803.00401
Unravelling Robustness of Deep Learning based Face Recognition Against Adversarial Attacks.
Gaurav Goswami; Nalini Ratha; Akshay Agarwal; Richa Singh; Mayank Vatsa
  Deep neural network (DNN) architecture based models have high expressive
power and learning capacity. However, they are essentially a black box method
since it is not easy to mathematically formulate the functions that are learned
within its many layers of representation. Realizing this, many researchers have
started to design methods to exploit the drawbacks of deep learning based
algorithms questioning their robustness and exposing their singularities. In
this paper, we attempt to unravel three aspects related to the robustness of
DNNs for face recognition: (i) assessing the impact of deep architectures for
face recognition in terms of vulnerabilities to attacks inspired by commonly
observed distortions in the real world that are well handled by shallow
learning methods along with learning based adversaries; (ii) detecting the
singularities by characterizing abnormal filter response behavior in the hidden
layers of deep networks; and (iii) making corrections to the processing
pipeline to alleviate the problem. Our experimental evaluation using multiple
open-source DNN-based face recognition networks, including OpenFace and
VGG-Face, and two publicly available databases (MEDS and PaSC) demonstrates
that the performance of deep learning based face recognition algorithms can
suffer greatly in the presence of such distortions. The proposed method is also
compared with existing detection algorithms and the results show that it is
able to detect the attacks with very high accuracy by suitably designing a
classifier using the response of the hidden layers in the network. Finally, we
present several effective countermeasures to mitigate the impact of adversarial
attacks and improve the overall robustness of DNN-based face recognition.


http://arxiv.org/abs/1802.08241
Hessian-based Analysis of Large Batch Training and Robustness to Adversaries.
Zhewei Yao; Amir Gholami; Qi Lei; Kurt Keutzer; Michael W. Mahoney
  Large batch size training of Neural Networks has been shown to incur accuracy
loss when trained with the current methods. The exact underlying reasons for
this are still not completely understood. Here, we study large batch size
training through the lens of the Hessian operator and robust optimization. In
particular, we perform a Hessian based study to analyze exactly how the
landscape of the loss function changes when training with large batch size. We
compute the true Hessian spectrum, without approximation, by back-propagating
the second derivative. Extensive experiments on multiple networks show that
saddle-points are not the cause for generalization gap of large batch size
training, and the results consistently show that large batch converges to
points with noticeably higher Hessian spectrum. Furthermore, we show that
robust training allows one to favor flat areas, as points with large Hessian
spectrum show poor robustness to adversarial perturbation. We further study
this relationship, and provide empirical and theoretical proof that the inner
loop for robust training is a saddle-free optimization problem \textit{almost
everywhere}. We present detailed experiments with five different network
architectures, including a residual network, tested on MNIST, CIFAR-10, and
CIFAR-100 datasets. We have open sourced our method which can be accessed at
[1].


http://arxiv.org/abs/1802.08195
Adversarial Examples that Fool both Computer Vision and Time-Limited Humans.
Gamaleldin F. Elsayed; Shreya Shankar; Brian Cheung; Nicolas Papernot; Alex Kurakin; Ian Goodfellow; Jascha Sohl-Dickstein
  Machine learning models are vulnerable to adversarial examples: small changes
to images can cause computer vision models to make mistakes such as identifying
a school bus as an ostrich. However, it is still an open question whether
humans are prone to similar mistakes. Here, we address this question by
leveraging recent techniques that transfer adversarial examples from computer
vision models with known parameters and architecture to other models with
unknown parameters and architecture, and by matching the initial processing of
the human visual system. We find that adversarial examples that strongly
transfer across computer vision models influence the classifications made by
time-limited human observers.


http://arxiv.org/abs/1802.08567
Adversarial Training for Probabilistic Spiking Neural Networks.
Alireza Bagheri; Osvaldo Simeone; Bipin Rajendran
  Classifiers trained using conventional empirical risk minimization or maximum
likelihood methods are known to suffer dramatic performance degradations when
tested over examples adversarially selected based on knowledge of the
classifier's decision rule. Due to the prominence of Artificial Neural Networks
(ANNs) as classifiers, their sensitivity to adversarial examples, as well as
robust training schemes, have been recently the subject of intense
investigation. In this paper, for the first time, the sensitivity of spiking
neural networks (SNNs), or third-generation neural networks, to adversarial
examples is studied. The study considers rate and time encoding, as well as
rate and first-to-spike decoding. Furthermore, a robust training mechanism is
proposed that is demonstrated to enhance the performance of SNNs under
white-box attacks.


http://arxiv.org/abs/1802.07896
L2-Nonexpansive Neural Networks.
Haifeng Qian; Mark N. Wegman
  This paper proposes a class of well-conditioned neural networks in which a
unit amount of change in the inputs causes at most a unit amount of change in
the outputs or any of the internal layers. We develop the known methodology of
controlling Lipschitz constants to realize its full potential in maximizing
robustness, with a new regularization scheme for linear layers, new ways to
adapt nonlinearities and a new loss function. With MNIST and CIFAR-10
classifiers, we demonstrate a number of advantages. Without needing any
adversarial training, the proposed classifiers exceed the state of the art in
robustness against white-box L2-bounded adversarial attacks. They generalize
better than ordinary networks from noisy data with partially random labels.
Their outputs are quantitatively meaningful and indicate levels of confidence
and generalization, among other desirable properties.


http://arxiv.org/abs/1802.07770
Generalizable Adversarial Examples Detection Based on Bi-model Decision Mismatch.
João Monteiro; Isabela Albuquerque; Zahid Akhtar; Tiago H. Falk
  Modern applications of artificial neural networks have yielded remarkable
performance gains in a wide range of tasks. However, recent studies have
discovered that such modelling strategy is vulnerable to Adversarial Examples,
i.e. examples with subtle perturbations often too small and imperceptible to
humans, but that can easily fool neural networks. Defense techniques against
adversarial examples have been proposed, but ensuring robust performance
against varying or novel types of attacks remains an open problem. In this
work, we focus on the detection setting, in which case attackers become
identifiable while models remain vulnerable. Particularly, we employ the
decision layer of independently trained models as features for posterior
detection. The proposed framework does not require any prior knowledge of
adversarial examples generation techniques, and can be directly employed along
with unmodified off-the-shelf models. Experiments on the standard MNIST and
CIFAR10 datasets deliver empirical evidence that such detection approach
generalizes well across not only different adversarial examples generation
methods but also quality degradation attacks. Non-linear binary classifiers
trained on top of our proposed features can achieve a high detection rate
(>90%) in a set of white-box attacks and maintain such performance when tested
against unseen attacks.


http://arxiv.org/abs/1802.07295
Attack Strength vs. Detectability Dilemma in Adversarial Machine Learning.
Christopher Frederickson; Michael Moore; Glenn Dawson; Robi Polikar
  As the prevalence and everyday use of machine learning algorithms, along with
our reliance on these algorithms grow dramatically, so do the efforts to attack
and undermine these algorithms with malicious intent, resulting in a growing
interest in adversarial machine learning. A number of approaches have been
developed that can render a machine learning algorithm ineffective through
poisoning or other types of attacks. Most attack algorithms typically use
sophisticated optimization approaches, whose objective function is designed to
cause maximum damage with respect to accuracy and performance of the algorithm
with respect to some task. In this effort, we show that while such an objective
function is indeed brutally effective in causing maximum damage on an embedded
feature selection task, it often results in an attack mechanism that can be
easily detected with an embarrassingly simple novelty or outlier detection
algorithm. We then propose an equally simple yet elegant solution by adding a
regularization term to the attacker's objective function that penalizes
outlying attack points.


http://arxiv.org/abs/1802.07124
Out-distribution training confers robustness to deep neural networks.
Mahdieh Abbasi; Christian Gagné
  The easiness at which adversarial instances can be generated in deep neural
networks raises some fundamental questions on their functioning and concerns on
their use in critical systems. In this paper, we draw a connection between
over-generalization and adversaries: a possible cause of adversaries lies in
models designed to make decisions all over the input space, leading to
inappropriate high-confidence decisions in parts of the input space not
represented in the training set. We empirically show an augmented neural
network, which is not trained on any types of adversaries, can increase the
robustness by detecting black-box one-step adversaries, i.e. assimilated to
out-distribution samples, and making generation of white-box one-step
adversaries harder.


http://arxiv.org/abs/1802.06927
On Lyapunov exponents and adversarial perturbation.
Vinay Uday Prabhu; Nishant Desai; John Whaley
  In this paper, we would like to disseminate a serendipitous discovery
involving Lyapunov exponents of a 1-D time series and their use in serving as a
filtering defense tool against a specific kind of deep adversarial
perturbation. To this end, we use the state-of-the-art CleverHans library to
generate adversarial perturbations against a standard Convolutional Neural
Network (CNN) architecture trained on the MNIST as well as the Fashion-MNIST
datasets. We empirically demonstrate how the Lyapunov exponents computed on the
flattened 1-D vector representations of the images served as highly
discriminative features that could be to pre-classify images as adversarial or
legitimate before feeding the image into the CNN for classification. We also
explore the issue of possible false-alarms when the input images are noisy in a
non-adversarial sense.


http://arxiv.org/abs/1802.06816
Shield: Fast, Practical Defense and Vaccination for Deep Learning using JPEG Compression.
Nilaksh Das; Madhuri Shanbhogue; Shang-Tse Chen; Fred Hohman; Siwei Li; Li Chen; Michael E. Kounavis; Duen Horng Chau
  The rapidly growing body of research in adversarial machine learning has
demonstrated that deep neural networks (DNNs) are highly vulnerable to
adversarially generated images. This underscores the urgent need for practical
defense that can be readily deployed to combat attacks in real-time. Observing
that many attack strategies aim to perturb image pixels in ways that are
visually imperceptible, we place JPEG compression at the core of our proposed
Shield defense framework, utilizing its capability to effectively "compress
away" such pixel manipulation. To immunize a DNN model from artifacts
introduced by compression, Shield "vaccinates" a model by re-training it with
compressed images, where different compression levels are applied to generate
multiple vaccinated models that are ultimately used together in an ensemble
defense. On top of that, Shield adds an additional layer of protection by
employing randomization at test time that compresses different regions of an
image using random compression levels, making it harder for an adversary to
estimate the transformation performed. This novel combination of vaccination,
ensembling, and randomization makes Shield a fortified multi-pronged
protection. We conducted extensive, large-scale experiments using the ImageNet
dataset, and show that our approaches eliminate up to 94% of black-box attacks
and 98% of gray-box attacks delivered by the recent, strongest attacks, such as
Carlini-Wagner's L2 and DeepFool. Our approaches are fast and work without
requiring knowledge about the model.


http://arxiv.org/abs/1802.06806
Divide, Denoise, and Defend against Adversarial Attacks.
Seyed-Mohsen Moosavi-Dezfooli; Ashish Shrivastava; Oncel Tuzel
  Deep neural networks, although shown to be a successful class of machine
learning algorithms, are known to be extremely unstable to adversarial
perturbations. Improving the robustness of neural networks against these
attacks is important, especially for security-critical applications. To defend
against such attacks, we propose dividing the input image into multiple
patches, denoising each patch independently, and reconstructing the image,
without losing significant image content. We call our method D3. This proposed
defense mechanism is non-differentiable which makes it non-trivial for an
adversary to apply gradient-based attacks. Moreover, we do not fine-tune the
network with adversarial examples, making it more robust against unknown
attacks. We present an analysis of the tradeoff between accuracy and robustness
against adversarial attacks. We evaluate our method under black-box, grey-box,
and white-box settings. On the ImageNet dataset, our method outperforms the
state-of-the-art by 19.7% under grey-box setting, and performs comparably under
black-box setting. For the white-box setting, the proposed method achieves
34.4% accuracy compared to the 0% reported in the recent works.


http://arxiv.org/abs/1802.06627
Robustness of Rotation-Equivariant Networks to Adversarial Perturbations.
Beranger Dumont; Simona Maggio; Pablo Montalvo
  Deep neural networks have been shown to be vulnerable to adversarial
examples: very small perturbations of the input having a dramatic impact on the
predictions. A wealth of adversarial attacks and distance metrics to quantify
the similarity between natural and adversarial images have been proposed,
recently enlarging the scope of adversarial examples with geometric
transformations beyond pixel-wise attacks. In this context, we investigate the
robustness to adversarial attacks of new Convolutional Neural Network
architectures providing equivariance to rotations. We found that
rotation-equivariant networks are significantly less vulnerable to
geometric-based attacks than regular networks on the MNIST, CIFAR-10, and
ImageNet datasets.


http://arxiv.org/abs/1802.06552
Are Generative Classifiers More Robust to Adversarial Attacks?
Yingzhen Li; John Bradshaw; Yash Sharma
  There is a rising interest in studying the robustness of deep neural network
classifiers against adversaries, with both advanced attack and defence
techniques being actively developed. However, most recent work focuses on
discriminative classifiers, which only model the conditional distribution of
the labels given the inputs. In this paper, we propose and investigate the deep
Bayes classifier, which improves classical naive Bayes with conditional deep
generative models. We further develop detection methods for adversarial
examples, which reject inputs with low likelihood under the generative model.
Experimental results suggest that deep Bayes classifiers are more robust than
deep discriminative classifiers, and that the proposed detection methods are
effective against many recently proposed attacks.


http://arxiv.org/abs/1802.06430
DARTS: Deceiving Autonomous Cars with Toxic Signs.
Chawin Sitawarin; Arjun Nitin Bhagoji; Arsalan Mosenia; Mung Chiang; Prateek Mittal
  Sign recognition is an integral part of autonomous cars. Any
misclassification of traffic signs can potentially lead to a multitude of
disastrous consequences, ranging from a life-threatening accident to even a
large-scale interruption of transportation services relying on autonomous cars.
In this paper, we propose and examine security attacks against sign recognition
systems for Deceiving Autonomous caRs with Toxic Signs (we call the proposed
attacks DARTS). In particular, we introduce two novel methods to create these
toxic signs. First, we propose Out-of-Distribution attacks, which expand the
scope of adversarial examples by enabling the adversary to generate these
starting from an arbitrary point in the image space compared to prior attacks
which are restricted to existing training/test data (In-Distribution). Second,
we present the Lenticular Printing attack, which relies on an optical
phenomenon to deceive the traffic sign recognition system. We extensively
evaluate the effectiveness of the proposed attacks in both virtual and
real-world settings and consider both white-box and black-box threat models.
Our results demonstrate that the proposed attacks are successful under both
settings and threat models. We further show that Out-of-Distribution attacks
can outperform In-Distribution attacks on classifiers defended using the
adversarial training defense, exposing a new attack vector for these defenses.


http://arxiv.org/abs/1802.05763
ASP:A Fast Adversarial Attack Example Generation Framework based on Adversarial Saliency Prediction.
Fuxun Yu; Qide Dong; Xiang Chen
  With the excellent accuracy and feasibility, the Neural Networks have been
widely applied into the novel intelligent applications and systems. However,
with the appearance of the Adversarial Attack, the NN based system performance
becomes extremely vulnerable:the image classification results can be
arbitrarily misled by the adversarial examples, which are crafted images with
human unperceivable pixel-level perturbation. As this raised a significant
system security issue, we implemented a series of investigations on the
adversarial attack in this work: We first identify an image's pixel
vulnerability to the adversarial attack based on the adversarial saliency
analysis. By comparing the analyzed saliency map and the adversarial
perturbation distribution, we proposed a new evaluation scheme to
comprehensively assess the adversarial attack precision and efficiency. Then,
with a novel adversarial saliency prediction method, a fast adversarial example
generation framework, namely "ASP", is proposed with significant attack
efficiency improvement and dramatic computation cost reduction. Compared to the
previous methods, experiments show that ASP has at most 12 times speed-up for
adversarial example generation, 2 times lower perturbation rate, and high
attack success rate of 87% on both MNIST and Cifar10. ASP can be also well
utilized to support the data-hungry NN adversarial training. By reducing the
attack success rate as much as 90%, ASP can quickly and effectively enhance the
defense capability of NN based system to the adversarial attacks.


http://arxiv.org/abs/1802.05666
Adversarial Risk and the Dangers of Evaluating Against Weak Attacks.
Jonathan Uesato; Brendan O'Donoghue; Aaron van den Oord; Pushmeet Kohli
  This paper investigates recently proposed approaches for defending against
adversarial examples and evaluating adversarial robustness. We motivate
'adversarial risk' as an objective for achieving models robust to worst-case
inputs. We then frame commonly used attacks and evaluation metrics as defining
a tractable surrogate objective to the true adversarial risk. This suggests
that models may optimize this surrogate rather than the true adversarial risk.
We formalize this notion as 'obscurity to an adversary,' and develop tools and
heuristics for identifying obscured models and designing transparent models. We
demonstrate that this is a significant problem in practice by repurposing
gradient-free optimization techniques into adversarial attacks, which we use to
decrease the accuracy of several recently proposed defenses to near zero. Our
hope is that our formulations and results will help researchers to develop more
powerful defenses.


http://arxiv.org/abs/1802.05385
Fooling OCR Systems with Adversarial Text Images.
Congzheng Song; Vitaly Shmatikov
  We demonstrate that state-of-the-art optical character recognition (OCR)
based on deep learning is vulnerable to adversarial images. Minor modifications
to images of printed text, which do not change the meaning of the text to a
human reader, cause the OCR system to "recognize" a different text where
certain words chosen by the adversary are replaced by their semantic opposites.
This completely changes the meaning of the output produced by the OCR system
and by the NLP applications that use OCR for preprocessing their inputs.


http://arxiv.org/abs/1802.05193
Security Analysis and Enhancement of Model Compressed Deep Learning Systems under Adversarial Attacks.
Qi Liu; Tao Liu; Zihao Liu; Yanzhi Wang; Yier Jin; Wujie Wen
  DNN is presenting human-level performance for many complex intelligent tasks
in real-world applications. However, it also introduces ever-increasing
security concerns. For example, the emerging adversarial attacks indicate that
even very small and often imperceptible adversarial input perturbations can
easily mislead the cognitive function of deep learning systems (DLS). Existing
DNN adversarial studies are narrowly performed on the ideal software-level DNN
models with a focus on single uncertainty factor, i.e. input perturbations,
however, the impact of DNN model reshaping on adversarial attacks, which is
introduced by various hardware-favorable techniques such as hash-based weight
compression during modern DNN hardware implementation, has never been
discussed. In this work, we for the first time investigate the multi-factor
adversarial attack problem in practical model optimized deep learning systems
by jointly considering the DNN model-reshaping (e.g. HashNet based deep
compression) and the input perturbations. We first augment adversarial example
generating method dedicated to the compressed DNN models by incorporating the
software-based approaches and mathematical modeled DNN reshaping. We then
conduct a comprehensive robustness and vulnerability analysis of deep
compressed DNN models under derived adversarial attacks. A defense technique
named "gradient inhibition" is further developed to ease the generating of
adversarial examples thus to effectively mitigate adversarial attacks towards
both software and hardware-oriented DNNs. Simulation results show that
"gradient inhibition" can decrease the average success rate of adversarial
attacks from 87.99% to 4.77% (from 86.74% to 4.64%) on MNIST (CIFAR-10)
benchmark with marginal accuracy degradation across various DNNs.


http://arxiv.org/abs/1802.09900
Query-Free Attacks on Industry-Grade Face Recognition Systems under Resource Constraints.
Di Tang; XiaoFeng Wang; Kehuan Zhang
  To launch black-box attacks against a Deep Neural Network (DNN) based Face
Recognition (FR) system, one needs to build \textit{substitute} models to
simulate the target model, so the adversarial examples discovered from
substitute models could also mislead the target model. Such
\textit{transferability} is achieved in recent studies through querying the
target model to obtain data for training the substitute models. A real-world
target, likes the FR system of law enforcement, however, is less accessible to
the adversary. To attack such a system, a substitute model with similar quality
as the target model is needed to identify their common defects. This is hard
since the adversary often does not have the enough resources to train such a
powerful model (hundreds of millions of images and rooms of GPUs are needed to
train a commercial FR system).
  We found in our research, however, that a resource-constrained adversary
could still effectively approximate the target model's capability to recognize
\textit{specific} individuals, by training \textit{biased} substitute models on
additional images of those victims whose identities the attacker want to cover
or impersonate. This is made possible by a new property we discovered, called
\textit{Nearly Local Linearity} (NLL), which models the observation that an
ideal DNN model produces the image representations (embeddings) whose distances
among themselves truthfully describe the human perception of the differences
among the input images. By simulating this property around the victim's images,
we significantly improve the transferability of black-box impersonation attacks
by nearly 50\%. Particularly, we successfully attacked a commercial system
trained over 20 million images, using 4 million images and 1/5 of the training
time but achieving 62\% transferability in an impersonation attack and 89\% in
a dodging attack.


http://arxiv.org/abs/1802.04822
Identify Susceptible Locations in Medical Records via Adversarial Attacks on Deep Predictive Models.
Mengying Sun; Fengyi Tang; Jinfeng Yi; Fei Wang; Jiayu Zhou
  The surging availability of electronic medical records (EHR) leads to
increased research interests in medical predictive modeling. Recently many deep
learning based predicted models are also developed for EHR data and
demonstrated impressive performance. However, a series of recent studies showed
that these deep models are not safe: they suffer from certain vulnerabilities.
In short, a well-trained deep network can be extremely sensitive to inputs with
negligible changes. These inputs are referred to as adversarial examples. In
the context of medical informatics, such attacks could alter the result of a
high performance deep predictive model by slightly perturbing a patient's
medical records. Such instability not only reflects the weakness of deep
architectures, more importantly, it offers guide on detecting susceptible parts
on the inputs. In this paper, we propose an efficient and effective framework
that learns a time-preferential minimum attack targeting the LSTM model with
EHR inputs, and we leverage this attack strategy to screen medical records of
patients and identify susceptible events and measurements. The efficient
screening procedure can assist decision makers to pay extra attentions to the
locations that can cause severe consequence if not measured correctly. We
conduct extensive empirical studies on a real-world urgent care cohort and
demonstrate the effectiveness of the proposed screening approach.


http://arxiv.org/abs/1802.04528
Deceiving End-to-End Deep Learning Malware Detectors using Adversarial Examples.
Felix Kreuk; Assi Barak; Shir Aviv-Reuven; Moran Baruch; Benny Pinkas; Joseph Keshet
  In recent years, deep learning has shown performance breakthroughs in many
applications, such as image detection, image segmentation, pose estimation, and
speech recognition. However, this comes with a major concern: deep networks
have been found to be vulnerable to adversarial examples. Adversarial examples
are slightly modified inputs that are intentionally designed to cause a
misclassification by the model. In the domains of images and speech, the
modifications are so small that they are not seen or heard by humans, but
nevertheless greatly affect the classification of the model.
  Deep learning models have been successfully applied to malware detection. In
this domain, generating adversarial examples is not straightforward, as small
modifications to the bytes of the file could lead to significant changes in its
functionality and validity. We introduce a novel loss function for generating
adversarial examples specifically tailored for discrete input sets, such as
executable bytes. We modify malicious binaries so that they would be detected
as benign, while preserving their original functionality, by injecting a small
sequence of bytes (payload) in the binary file. We applied this approach to an
end-to-end convolutional deep learning malware detection model and show a high
rate of detection evasion. Moreover, we show that our generated payload is
robust enough to be transferable within different locations of the same file
and across different files, and that its entropy is low and similar to that of
benign data sections.


http://arxiv.org/abs/1802.04034
Lipschitz-Margin Training: Scalable Certification of Perturbation Invariance for Deep Neural Networks.
Yusuke Tsuzuku; Issei Sato; Masashi Sugiyama
  High sensitivity of neural networks against malicious perturbations on inputs
causes security concerns. To take a steady step towards robust classifiers, we
aim to create neural network models provably defended from perturbations. Prior
certification work requires strong assumptions on network structures and
massive computational costs, and thus the range of their applications was
limited. From the relationship between the Lipschitz constants and prediction
margins, we present a computationally efficient calculation technique to
lower-bound the size of adversarial perturbations that can deceive networks,
and that is widely applicable to various complicated networks. Moreover, we
propose an efficient training procedure that robustifies networks and
significantly improves the provably guarded areas around data points. In
experimental evaluations, our method showed its ability to provide a
non-trivial guarantee and enhance robustness for even large networks.


http://arxiv.org/abs/1802.04457
Predicting Adversarial Examples with High Confidence.
Angus Galloway; Graham W. Taylor; Medhat Moussa
  It has been suggested that adversarial examples cause deep learning models to
make incorrect predictions with high confidence. In this work, we take the
opposite stance: an overly confident model is more likely to be vulnerable to
adversarial examples. This work is one of the most proactive approaches taken
to date, as we link robustness with non-calibrated model confidence on noisy
images, providing a data-augmentation-free path forward. The adversarial
examples phenomenon is most easily explained by the trend of increasing
non-regularized model capacity, while the diversity and number of samples in
common datasets has remained flat. Test accuracy has incorrectly been
associated with true generalization performance, ignoring that training and
test splits are often extremely similar in terms of the overall representation
space. The transferability property of adversarial examples was previously used
as evidence against overfitting arguments, a perceived random effect, but
overfitting is not always random.


http://arxiv.org/abs/1802.03471
Certified Robustness to Adversarial Examples with Differential Privacy.
Mathias Lecuyer; Vaggelis Atlidakis; Roxana Geambasu; Daniel Hsu; Suman Jana
  Adversarial examples that fool machine learning models, particularly deep
neural networks, have been a topic of intense research interest, with attacks
and defenses being developed in a tight back-and-forth. Most past defenses are
best effort and have been shown to be vulnerable to sophisticated attacks.
Recently a set of certified defenses have been introduced, which provide
guarantees of robustness to norm-bounded attacks, but they either do not scale
to large datasets or are limited in the types of models they can support. This
paper presents the first certified defense that both scales to large networks
and datasets (such as Google's Inception network for ImageNet) and applies
broadly to arbitrary model types. Our defense, called PixelDP, is based on a
novel connection between robustness against adversarial examples and
differential privacy, a cryptographically-inspired formalism, that provides a
rigorous, generic, and flexible foundation for defense.


http://arxiv.org/abs/1802.03041
Detection of Adversarial Training Examples in Poisoning Attacks through Anomaly Detection.
Andrea Paudice; Luis Muñoz-González; Andras Gyorgy; Emil C. Lupu
  Machine learning has become an important component for many systems and
applications including computer vision, spam filtering, malware and network
intrusion detection, among others. Despite the capabilities of machine learning
algorithms to extract valuable information from data and produce accurate
predictions, it has been shown that these algorithms are vulnerable to attacks.
Data poisoning is one of the most relevant security threats against machine
learning systems, where attackers can subvert the learning process by injecting
malicious samples in the training data. Recent work in adversarial machine
learning has shown that the so-called optimal attack strategies can
successfully poison linear classifiers, degrading the performance of the system
dramatically after compromising a small fraction of the training dataset. In
this paper we propose a defence mechanism to mitigate the effect of these
optimal poisoning attacks based on outlier detection. We show empirically that
the adversarial examples generated by these attack strategies are quite
different from genuine points, as no detectability constrains are considered to
craft the attack. Hence, they can be detected with an appropriate pre-filtering
of the training dataset.


http://arxiv.org/abs/1802.01549
Blind Pre-Processing: A Robust Defense Method Against Adversarial Examples.
Adnan Siraj Rakin; Zhezhi He; Boqing Gong; Deliang Fan
  Deep learning algorithms and networks are vulnerable to perturbed inputs
which is known as the adversarial attack. Many defense methodologies have been
investigated to defend against such adversarial attack. In this work, we
propose a novel methodology to defend the existing powerful attack model. We
for the first time introduce a new attacking scheme for the attacker and set a
practical constraint for white box attack. Under this proposed attacking
scheme, we present the best defense ever reported against some of the recent
strong attacks. It consists of a set of nonlinear function to process the input
data which will make it more robust over the adversarial attack. However, we
make this processing layer completely hidden from the attacker. Blind
pre-processing improves the white box attack accuracy of MNIST from 94.3\% to
98.7\%. Even with increasing defense when others defenses completely fail,
blind pre-processing remains one of the strongest ever reported. Another
strength of our defense is that it eliminates the need for adversarial training
as it can significantly increase the MNIST accuracy without adversarial
training as well. Additionally, blind pre-processing can also increase the
inference accuracy in the face of a powerful attack on CIFAR-10 and SVHN data
set as well without much sacrificing clean data accuracy.


http://arxiv.org/abs/1802.01421
First-order Adversarial Vulnerability of Neural Networks and Input Dimension.
Carl-Johann Simon-Gabriel; Yann Ollivier; Léon Bottou; Bernhard Schölkopf; David Lopez-Paz
  Over the past few years, neural networks were proven vulnerable to
adversarial images: targeted but imperceptible image perturbations lead to
drastically different predictions. We show that adversarial vulnerability
increases with the gradients of the training objective when viewed as a
function of the inputs. Surprisingly, vulnerability does not depend on network
topology: for many standard network architectures, we prove that at
initialization, the $\ell_1$-norm of these gradients grows as the square root
of the input dimension, leaving the networks increasingly vulnerable with
growing image size. We empirically show that this dimension dependence persists
after either usual or robust training, but gets attenuated with higher
regularization.


http://arxiv.org/abs/1802.00573
Secure Detection of Image Manipulation by means of Random Feature Selection.
Zhipeng Chen; Benedetta Tondi; Xiaolong Li; Rongrong Ni; Yao Zhao; Mauro Barni
  We address the problem of data-driven image manipulation detection in the
presence of an attacker with limited knowledge about the detector.
Specifically, we assume that the attacker knows the architecture of the
detector, the training data and the class of features V the detector can rely
on. In order to get an advantage in his race of arms with the attacker, the
analyst designs the detector by relying on a subset of features chosen at
random in V. Given its ignorance about the exact feature set, the adversary
attacks a version of the detector based on the entire feature set. In this way,
the effectiveness of the attack diminishes since there is no guarantee that
attacking a detector working in the full feature space will result in a
successful attack against the reduced-feature detector. We theoretically prove
that, thanks to random feature selection, the security of the detector
increases significantly at the expense of a negligible loss of performance in
the absence of attacks. We also provide an experimental validation of the
proposed procedure by focusing on the detection of two specific kinds of image
manipulations, namely adaptive histogram equalization and median filtering. The
experiments confirm the gain in security at the expense of a negligible loss of
performance in the absence of attacks.


http://arxiv.org/abs/1802.01448
Hardening Deep Neural Networks via Adversarial Model Cascades.
Deepak Vijaykeerthy; Anshuman Suri; Sameep Mehta; Ponnurangam Kumaraguru
  Deep neural networks (DNNs) are vulnerable to malicious inputs crafted by an
adversary to produce erroneous outputs. Works on securing neural networks
against adversarial examples achieve high empirical robustness on simple
datasets such as MNIST. However, these techniques are inadequate when
empirically tested on complex data sets such as CIFAR-10 and SVHN. Further,
existing techniques are designed to target specific attacks and fail to
generalize across attacks. We propose the Adversarial Model Cascades (AMC) as a
way to tackle the above inadequacies. Our approach trains a cascade of models
sequentially where each model is optimized to be robust towards a mixture of
multiple attacks. Ultimately, it yields a single model which is secure against
a wide range of attacks; namely FGSM, Elastic, Virtual Adversarial
Perturbations and Madry. On an average, AMC increases the model's empirical
robustness against various attacks simultaneously, by a significant margin (of
6.225% for MNIST, 5.075% for SVHN and 2.65% for CIFAR10). At the same time, the
model's performance on non-adversarial inputs is comparable to the
state-of-the-art models.


http://arxiv.org/abs/1802.00420
Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples.
Anish Athalye; Nicholas Carlini; David Wagner
  We identify obfuscated gradients, a kind of gradient masking, as a phenomenon
that leads to a false sense of security in defenses against adversarial
examples. While defenses that cause obfuscated gradients appear to defeat
iterative optimization-based attacks, we find defenses relying on this effect
can be circumvented. We describe characteristic behaviors of defenses
exhibiting the effect, and for each of the three types of obfuscated gradients
we discover, we develop attack techniques to overcome it. In a case study,
examining non-certified white-box-secure defenses at ICLR 2018, we find
obfuscated gradients are a common occurrence, with 7 of 9 defenses relying on
obfuscated gradients. Our new attacks successfully circumvent 6 completely, and
1 partially, in the original threat model each paper considers.


http://arxiv.org/abs/1801.10578
Evaluating the Robustness of Neural Networks: An Extreme Value Theory Approach.
Tsui-Wei Weng; Huan Zhang; Pin-Yu Chen; Jinfeng Yi; Dong Su; Yupeng Gao; Cho-Jui Hsieh; Luca Daniel
  The robustness of neural networks to adversarial examples has received great
attention due to security implications. Despite various attack approaches to
crafting visually imperceptible adversarial examples, little has been developed
towards a comprehensive measure of robustness. In this paper, we provide a
theoretical justification for converting robustness analysis into a local
Lipschitz constant estimation problem, and propose to use the Extreme Value
Theory for efficient evaluation. Our analysis yields a novel robustness metric
called CLEVER, which is short for Cross Lipschitz Extreme Value for nEtwork
Robustness. The proposed CLEVER score is attack-agnostic and computationally
feasible for large neural networks. Experimental results on various networks,
including ResNet, Inception-v3 and MobileNet, show that (i) CLEVER is aligned
with the robustness indication measured by the $\ell_2$ and $\ell_\infty$ norms
of adversarial examples from powerful attacks, and (ii) defended networks using
defensive distillation or bounded ReLU indeed achieve better CLEVER scores. To
the best of our knowledge, CLEVER is the first attack-independent robustness
metric that can be applied to any neural network classifier.


http://arxiv.org/abs/1801.09827
Robustness of classification ability of spiking neural networks.
Jie Yang; Pingping Zhang; Yan Liu
  It is well-known that the robustness of artificial neural networks (ANNs) is
important for their wide ranges of applications. In this paper, we focus on the
robustness of the classification ability of a spiking neural network which
receives perturbed inputs. Actually, the perturbation is allowed to be
arbitrary styles. However, Gaussian perturbation and other regular ones have
been rarely investigated. For classification problems, the closer to the
desired point, the more perturbed points there are in the input space. In
addition, the perturbation may be periodic. Based on these facts, we only
consider sinusoidal and Gaussian perturbations in this paper. With the
SpikeProp algorithm, we perform extensive experiments on the classical XOR
problem and other three benchmark datasets. The numerical results show that
there is not significant reduction in the classification ability of the network
if the input signals are subject to sinusoidal and Gaussian perturbations.


http://arxiv.org/abs/1801.09344
Certified Defenses against Adversarial Examples.
Aditi Raghunathan; Jacob Steinhardt; Percy Liang
  While neural networks have achieved high accuracy on standard image
classification benchmarks, their accuracy drops to nearly zero in the presence
of small adversarial perturbations to test inputs. Defenses based on
regularization and adversarial training have been proposed, but often followed
by new, stronger attacks that defeat these defenses. Can we somehow end this
arms race? In this work, we study this problem for neural networks with one
hidden layer. We first propose a method based on a semidefinite relaxation that
outputs a certificate that for a given network and test input, no attack can
force the error to exceed a certain value. Second, as this certificate is
differentiable, we jointly optimize it with the network parameters, providing
an adaptive regularizer that encourages robustness against all attacks. On
MNIST, our approach produces a network and a certificate that no attack that
perturbs each pixel by at most \epsilon = 0.1 can cause more than 35% test
error.


http://arxiv.org/abs/1801.09097
Towards an Understanding of Neural Networks in Natural-Image Spaces.
Yifei Fan; Anthony Yezzi
  Two major uncertainties, dataset bias and adversarial examples, prevail in
state-of-the-art AI algorithms with deep neural networks. In this paper, we
present an intuitive explanation for these issues as well as an interpretation
of the performance of deep networks in a natural-image space. The explanation
consists of two parts: the philosophy of neural networks and a hypothetical
model of natural-image spaces. Following the explanation, we 1) demonstrate
that the values of training samples differ, 2) provide incremental boost to the
accuracy of a CIFAR-10 classifier by introducing an additional "random-noise"
category during training, 3) alleviate over-fitting thereby enhancing the
robustness against adversarial examples by detecting and excluding illusive
training samples that are consistently misclassified. Our overall contribution
is therefore twofold. First, while most existing algorithms treat data equally
and have a strong appetite for more data, we demonstrate in contrast that an
individual datum can sometimes have disproportionate and counterproductive
influence and that it is not always better to train neural networks with more
data. Next, we consider more thoughtful strategies by taking into account the
geometric and topological properties of natural-image spaces to which deep
networks are applied.


http://arxiv.org/abs/1801.08926
Deflecting Adversarial Attacks with Pixel Deflection.
Aaditya Prakash; Nick Moran; Solomon Garber; Antonella DiLillo; James Storer
  CNNs are poised to become integral parts of many critical systems. Despite
their robustness to natural variations, image pixel values can be manipulated,
via small, carefully crafted, imperceptible perturbations, to cause a model to
misclassify images. We present an algorithm to process an image so that
classification accuracy is significantly preserved in the presence of such
adversarial manipulations. Image classifiers tend to be robust to natural
noise, and adversarial attacks tend to be agnostic to object location. These
observations motivate our strategy, which leverages model robustness to defend
against adversarial perturbations by forcing the image to match natural image
statistics. Our algorithm locally corrupts the image by redistributing pixel
values via a process we term pixel deflection. A subsequent wavelet-based
denoising operation softens this corruption, as well as some of the adversarial
changes. We demonstrate experimentally that the combination of these techniques
enables the effective recovery of the true class, against a variety of robust
attacks. Our results compare favorably with current state-of-the-art defenses,
without requiring retraining or modifying the CNN.


http://arxiv.org/abs/1801.08917
Learning to Evade Static PE Machine Learning Malware Models via Reinforcement Learning.
Hyrum S. Anderson; Anant Kharkar; Bobby Filar; David Evans; Phil Roth
  Machine learning is a popular approach to signatureless malware detection
because it can generalize to never-before-seen malware families and polymorphic
strains. This has resulted in its practical use for either primary detection
engines or for supplementary heuristic detection by anti-malware vendors.
Recent work in adversarial machine learning has shown that deep learning models
are susceptible to gradient-based attacks, whereas non-differentiable models
that report a score can be attacked by genetic algorithms that aim to
systematically reduce the score. We propose a more general framework based on
reinforcement learning (RL) for attacking static portable executable (PE)
anti-malware engines. The general framework does not require a differentiable
model nor does it require the engine to produce a score. Instead, an RL agent
is equipped with a set of functionality-preserving operations that it may
perform on the PE file. Through a series of games played against the
anti-malware engine, it learns which sequences of operations are likely to
result in evading the detector for any given malware sample. This enables
completely black-box attacks against static PE anti-malware, and produces
functional evasive malware samples as a direct result. We show in experiments
that our method can attack a gradient-boosted machine learning model with
evasion rates that are substantial and appear to be strongly dependent on the
dataset. We demonstrate that attacks against this model appear to also evade
components of publicly hosted antivirus engines. Adversarial training results
are also presented: by retraining the model on evasive ransomware samples, a
subsequent attack is 33% less effective. However, there are overfitting dangers
when adversarial training, which we note. We release code to allow researchers
to reproduce and improve this approach.


http://arxiv.org/abs/1801.08535
CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition.
Xuejing Yuan; Yuxuan Chen; Yue Zhao; Yunhui Long; Xiaokang Liu; Kai Chen; Shengzhi Zhang; Heqing Huang; Xiaofeng Wang; Carl A. Gunter
  The popularity of ASR (automatic speech recognition) systems, like Google
Voice, Cortana, brings in security concerns, as demonstrated by recent attacks.
The impacts of such threats, however, are less clear, since they are either
less stealthy (producing noise-like voice commands) or requiring the physical
presence of an attack device (using ultrasound). In this paper, we demonstrate
that not only are more practical and surreptitious attacks feasible but they
can even be automatically constructed. Specifically, we find that the voice
commands can be stealthily embedded into songs, which, when played, can
effectively control the target system through ASR without being noticed. For
this purpose, we developed novel techniques that address a key technical
challenge: integrating the commands into a song in a way that can be
effectively recognized by ASR through the air, in the presence of background
noise, while not being detected by a human listener. Our research shows that
this can be done automatically against real world ASR applications. We also
demonstrate that such CommanderSongs can be spread through Internet (e.g.,
YouTube) and radio, potentially affecting millions of ASR users. We further
present a new mitigation technique that controls this threat.


http://arxiv.org/abs/1801.08092
Generalizable Data-free Objective for Crafting Universal Adversarial Perturbations.
Konda Reddy Mopuri; Aditya Ganeshan; R. Venkatesh Babu
  Machine learning models are susceptible to adversarial perturbations: small
changes to input that can cause large changes in output. It is also
demonstrated that there exist input-agnostic perturbations, called universal
adversarial perturbations, which can change the inference of target model on
most of the data samples. However, existing methods to craft universal
perturbations are (i) task specific, (ii) require samples from the training
data distribution, and (iii) perform complex optimizations. Additionally,
because of the data dependence, fooling ability of the crafted perturbations is
proportional to the available training data. In this paper, we present a novel,
generalizable and data-free approaches for crafting universal adversarial
perturbations. Independent of the underlying task, our objective achieves
fooling via corrupting the extracted features at multiple layers. Therefore,
the proposed objective is generalizable to craft image-agnostic perturbations
across multiple vision tasks such as object recognition, semantic segmentation,
and depth estimation. In the practical setting of black-box attack scenario
(when the attacker does not have access to the target model and it's training
data), we show that our objective outperforms the data dependent objectives to
fool the learned models. Further, via exploiting simple priors related to the
data distribution, our objective remarkably boosts the fooling ability of the
crafted perturbations. Significant fooling rates achieved by our objective
emphasize that the current deep learning models are now at an increased risk,
since our objective generalizes across multiple tasks without the requirement
of training data for crafting the perturbations. To encourage reproducible
research, we have released the codes for our proposed algorithm.


http://arxiv.org/abs/1801.07175
Adversarial Texts with Gradient Methods.
Zhitao Gong; Wenlu Wang; Bo Li; Dawn Song; Wei-Shinn Ku
  Adversarial samples for images have been extensively studied in the
literature. Among many of the attacking methods, gradient-based methods are
both effective and easy to compute. In this work, we propose a framework to
adapt the gradient attacking methods on images to text domain. The main
difficulties for generating adversarial texts with gradient methods are i) the
input space is discrete, which makes it difficult to accumulate small noise
directly in the inputs, and ii) the measurement of the quality of the
adversarial texts is difficult. We tackle the first problem by searching for
adversarials in the embedding space and then reconstruct the adversarial texts
via nearest neighbor search. For the latter problem, we employ the Word Mover's
Distance (WMD) to quantify the quality of adversarial texts. Through extensive
experiments on three datasets, IMDB movie reviews, Reuters-2 and Reuters-5
newswires, we show that our framework can leverage gradient attacking methods
to generate very high-quality adversarial texts that are only a few words
different from the original texts. There are many cases where we can change one
word to alter the label of the whole piece of text. We successfully incorporate
FGM and DeepFool into our framework. In addition, we empirically show that WMD
is closely related to the quality of adversarial texts.


http://arxiv.org/abs/1801.05420
A Comparative Study of Rule Extraction for Recurrent Neural Networks.
Qinglong Wang; Kaixuan Zhang; Alexander G. II Ororbia; Xinyu Xing; Xue Liu; C. Lee Giles
  Understanding recurrent networks through rule extraction has a long history.
This has taken on new interests due to the need for interpreting or verifying
neural networks. One basic form for representing stateful rules is
deterministic finite automata (DFA). Previous research shows that extracting
DFAs from trained second-order recurrent networks is not only possible but also
relatively stable. Recently, several new types of recurrent networks with more
complicated architectures have been introduced. These handle challenging
learning tasks usually involving sequential data. However, it remains an open
problem whether DFAs can be adequately extracted from these models.
Specifically, it is not clear how DFA extraction will be affected when applied
to different recurrent networks trained on data sets with different levels of
complexity. Here, we investigate DFA extraction on several widely adopted
recurrent networks that are trained to learn a set of seven regular Tomita
grammars. We first formally analyze the complexity of Tomita grammars and
categorize these grammars according to that complexity. Then we empirically
evaluate different recurrent networks for their performance of DFA extraction
on all Tomita grammars. Our experiments show that for most recurrent networks,
their extraction performance decreases as the complexity of the underlying
grammar increases. On grammars of lower complexity, most recurrent networks
obtain desirable extraction performance. As for grammars with the highest level
of complexity, while several complicated models fail with only certain
recurrent networks having satisfactory extraction performance.


http://arxiv.org/abs/1801.04695
Sparsity-based Defense against Adversarial Attacks on Linear Classifiers.
Zhinus Marzi; Soorya Gopalakrishnan; Upamanyu Madhow; Ramtin Pedarsani
  Deep neural networks represent the state of the art in machine learning in a
growing number of fields, including vision, speech and natural language
processing. However, recent work raises important questions about the
robustness of such architectures, by showing that it is possible to induce
classification errors through tiny, almost imperceptible, perturbations.
Vulnerability to such "adversarial attacks", or "adversarial examples", has
been conjectured to be due to the excessive linearity of deep networks. In this
paper, we study this phenomenon in the setting of a linear classifier, and show
that it is possible to exploit sparsity in natural data to combat
$\ell_{\infty}$-bounded adversarial perturbations. Specifically, we demonstrate
the efficacy of a sparsifying front end via an ensemble averaged analysis, and
experimental results for the MNIST handwritten digit database. To the best of
our knowledge, this is the first work to show that sparsity provides a
theoretically rigorous framework for defense against adversarial attacks.


http://arxiv.org/abs/1801.04693
Towards Imperceptible and Robust Adversarial Example Attacks against Neural Networks.
Bo Luo; Yannan Liu; Lingxiao Wei; Qiang Xu
  Machine learning systems based on deep neural networks, being able to produce
state-of-the-art results on various perception tasks, have gained mainstream
adoption in many applications. However, they are shown to be vulnerable to
adversarial example attack, which generates malicious output by adding slight
perturbations to the input. Previous adversarial example crafting methods,
however, use simple metrics to evaluate the distances between the original
examples and the adversarial ones, which could be easily detected by human
eyes. In addition, these attacks are often not robust due to the inevitable
noises and deviation in the physical world. In this work, we present a new
adversarial example attack crafting method, which takes the human perceptual
system into consideration and maximizes the noise tolerance of the crafted
adversarial example. Experimental results demonstrate the efficacy of the
proposed technique.


http://arxiv.org/abs/1801.04354
Black-box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers.
Ji Gao; Jack Lanchantin; Mary Lou Soffa; Yanjun Qi
  Although various techniques have been proposed to generate adversarial
samples for white-box attacks on text, little attention has been paid to
black-box attacks, which are more realistic scenarios. In this paper, we
present a novel algorithm, DeepWordBug, to effectively generate small text
perturbations in a black-box setting that forces a deep-learning classifier to
misclassify a text input. We employ novel scoring strategies to identify the
critical tokens that, if modified, cause the classifier to make an incorrect
prediction. Simple character-level transformations are applied to the
highest-ranked tokens in order to minimize the edit distance of the
perturbation, yet change the original classification. We evaluated DeepWordBug
on eight real-world text datasets, including text classification, sentiment
analysis, and spam detection. We compare the result of DeepWordBug with two
baselines: Random (Black-box) and Gradient (White-box). Our experimental
results indicate that DeepWordBug reduces the prediction accuracy of current
state-of-the-art deep-learning models, including a decrease of 68\% on average
for a Word-LSTM model and 48\% on average for a Char-CNN model.


http://arxiv.org/abs/1801.04055
A3T: Adversarially Augmented Adversarial Training.
Akram Erraqabi; Aristide Baratin; Yoshua Bengio; Simon Lacoste-Julien
  Recent research showed that deep neural networks are highly sensitive to
so-called adversarial perturbations, which are tiny perturbations of the input
data purposely designed to fool a machine learning classifier. Most
classification models, including deep learning models, are highly vulnerable to
adversarial attacks. In this work, we investigate a procedure to improve
adversarial robustness of deep neural networks through enforcing representation
invariance. The idea is to train the classifier jointly with a discriminator
attached to one of its hidden layer and trained to filter the adversarial
noise. We perform preliminary experiments to test the viability of the approach
and to compare it to other standard adversarial training methods.


http://arxiv.org/abs/1801.03339
Fooling End-to-end Speaker Verification by Adversarial Examples.
Felix Kreuk; Yossi Adi; Moustapha Cisse; Joseph Keshet
  Automatic speaker verification systems are increasingly used as the primary
means to authenticate costumers. Recently, it has been proposed to train
speaker verification systems using end-to-end deep neural models. In this
paper, we show that such systems are vulnerable to adversarial example attack.
Adversarial examples are generated by adding a peculiar noise to original
speaker examples, in such a way that they are almost indistinguishable from the
original examples by a human listener. Yet, the generated waveforms, which
sound as speaker A can be used to fool such a system by claiming as if the
waveforms were uttered by speaker B. We present white-box attacks on an
end-to-end deep network that was either trained on YOHO or NTIMIT. We also
present two black-box attacks: where the adversarial examples were generated
with a system that was trained on YOHO, but the attack is on a system that was
trained on NTIMIT; and when the adversarial examples were generated with a
system that was trained on Mel-spectrum feature set, but the attack is on a
system that was trained on MFCC. Results suggest that the accuracy of the
attacked system was decreased and the false-positive rate was dramatically
increased.


http://arxiv.org/abs/1801.02950
Adversarial Deep Learning for Robust Detection of Binary Encoded Malware.
Abdullah Al-Dujaili; Alex Huang; Erik Hemberg; Una-May O'Reilly
  Malware is constantly adapting in order to avoid detection. Model based
malware detectors, such as SVM and neural networks, are vulnerable to so-called
adversarial examples which are modest changes to detectable malware that allows
the resulting malware to evade detection. Continuous-valued methods that are
robust to adversarial examples of images have been developed using saddle-point
optimization formulations. We are inspired by them to develop similar methods
for the discrete, e.g. binary, domain which characterizes the features of
malware. A specific extra challenge of malware is that the adversarial examples
must be generated in a way that preserves their malicious functionality. We
introduce methods capable of generating functionally preserved adversarial
malware examples in the binary domain. Using the saddle-point formulation, we
incorporate the adversarial examples into the training of models that are
robust to them. We evaluate the effectiveness of the methods and others in the
literature on a set of Portable Execution~(PE) files. Comparison prompts our
introduction of an online measure computed during training to assess general
expectation of robustness.


http://arxiv.org/abs/1801.02850
Less is More: Culling the Training Set to Improve Robustness of Deep Neural Networks.
Yongshuai Liu; Jiyu Chen; Hao Chen
  Deep neural networks are vulnerable to adversarial examples. Prior defenses
attempted to make deep networks more robust by either changing the network
architecture or augmenting the training set with adversarial examples, but both
have inherent limitations. Motivated by recent research that shows outliers in
the training set have a high negative influence on the trained model, we
studied the relationship between model robustness and the quality of the
training set. We first show that outliers give the model better generalization
ability but weaker robustness. Next, we propose an adversarial example
detection framework, in which we design two methods for removing outliers from
training set to obtain the sanitized model and then detect adversarial example
by calculating the difference of outputs between the original and the sanitized
model. We evaluated the framework on both MNIST and SVHN. Based on the
difference measured by Kullback-Leibler divergence, we could detect adversarial
examples with accuracy between 94.67% to 99.89%.


http://arxiv.org/abs/1801.02780
Rogue Signs: Deceiving Traffic Sign Recognition with Malicious Ads and Logos.
Chawin Sitawarin; Arjun Nitin Bhagoji; Arsalan Mosenia; Prateek Mittal; Mung Chiang
  We propose a new real-world attack against the computer vision based systems
of autonomous vehicles (AVs). Our novel Sign Embedding attack exploits the
concept of adversarial examples to modify innocuous signs and advertisements in
the environment such that they are classified as the adversary's desired
traffic sign with high confidence. Our attack greatly expands the scope of the
threat posed to AVs since adversaries are no longer restricted to just
modifying existing traffic signs as in previous work. Our attack pipeline
generates adversarial samples which are robust to the environmental conditions
and noisy image transformations present in the physical world. We ensure this
by including a variety of possible image transformations in the optimization
problem used to generate adversarial samples. We verify the robustness of the
adversarial samples by printing them out and carrying out drive-by tests
simulating the conditions under which image capture would occur in a real-world
scenario. We experimented with physical attack samples for different distances,
lighting conditions and camera angles. In addition, extensive evaluations were
carried out in the virtual setting for a variety of image transformations. The
adversarial samples generated using our method have adversarial success rates
in excess of 95% in the physical as well as virtual settings.


http://arxiv.org/abs/1801.02774
Adversarial Spheres.
Justin Gilmer; Luke Metz; Fartash Faghri; Samuel S. Schoenholz; Maithra Raghu; Martin Wattenberg; Ian Goodfellow
  State of the art computer vision models have been shown to be vulnerable to
small adversarial perturbations of the input. In other words, most images in
the data distribution are both correctly classified by the model and are very
close to a visually similar misclassified image. Despite substantial research
interest, the cause of the phenomenon is still poorly understood and remains
unsolved. We hypothesize that this counter intuitive behavior is a naturally
occurring result of the high dimensional geometry of the data manifold. As a
first step towards exploring this hypothesis, we study a simple synthetic
dataset of classifying between two concentric high dimensional spheres. For
this dataset we show a fundamental tradeoff between the amount of test error
and the average distance to nearest error. In particular, we prove that any
model which misclassifies a small constant fraction of a sphere will be
vulnerable to adversarial perturbations of size $O(1/\sqrt{d})$. Surprisingly,
when we train several different architectures on this dataset, all of their
error sets naturally approach this theoretical bound. As a result of the
theory, the vulnerability of neural networks to small adversarial perturbations
is a logical consequence of the amount of test error observed. We hope that our
theoretical analysis of this very simple case will point the way forward to
explore how the geometry of complex real-world data sets leads to adversarial
examples.


http://arxiv.org/abs/1801.02613
Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality.
Xingjun Ma; Bo Li; Yisen Wang; Sarah M. Erfani; Sudanthi Wijewickrema; Grant Schoenebeck; Dawn Song; Michael E. Houle; James Bailey
  Deep Neural Networks (DNNs) have recently been shown to be vulnerable against
adversarial examples, which are carefully crafted instances that can mislead
DNNs to make errors during prediction. To better understand such attacks, a
characterization is needed of the properties of regions (the so-called
'adversarial subspaces') in which adversarial examples lie. We tackle this
challenge by characterizing the dimensional properties of adversarial regions,
via the use of Local Intrinsic Dimensionality (LID). LID assesses the
space-filling capability of the region surrounding a reference example, based
on the distance distribution of the example to its neighbors. We first provide
explanations about how adversarial perturbation can affect the LID
characteristic of adversarial regions, and then show empirically that LID
characteristics can facilitate the distinction of adversarial examples
generated using state-of-the-art attacks. As a proof-of-concept, we show that a
potential application of LID is to distinguish adversarial examples, and the
preliminary results show that it can outperform several state-of-the-art
detection measures by large margins for five attack strategies considered in
this paper across three benchmark datasets. Our analysis of the LID
characteristic for adversarial regions not only motivates new directions of
effective adversarial defense, but also opens up more challenges for developing
new attacks to better understand the vulnerabilities of DNNs.


http://arxiv.org/abs/1801.02612
Spatially Transformed Adversarial Examples.
Chaowei Xiao; Jun-Yan Zhu; Bo Li; Warren He; Mingyan Liu; Dawn Song
  Recent studies show that widely used deep neural networks (DNNs) are
vulnerable to carefully crafted adversarial examples. Many advanced algorithms
have been proposed to generate adversarial examples by leveraging the
$\mathcal{L}_p$ distance for penalizing perturbations. Researchers have
explored different defense methods to defend against such adversarial attacks.
While the effectiveness of $\mathcal{L}_p$ distance as a metric of perceptual
quality remains an active research area, in this paper we will instead focus on
a different type of perturbation, namely spatial transformation, as opposed to
manipulating the pixel values directly as in prior works. Perturbations
generated through spatial transformation could result in large $\mathcal{L}_p$
distance measures, but our extensive experiments show that such spatially
transformed adversarial examples are perceptually realistic and more difficult
to defend against with existing defense systems. This potentially provides a
new direction in adversarial example generation and the design of corresponding
defenses. We visualize the spatial transformation based perturbation for
different examples and show that our technique can produce realistic
adversarial examples with smooth image deformation. Finally, we visualize the
attention of deep networks with different types of adversarial examples to
better understand how these examples are interpreted.


http://arxiv.org/abs/1801.02610
Generating Adversarial Examples with Adversarial Networks.
Chaowei Xiao; Bo Li; Jun-Yan Zhu; Warren He; Mingyan Liu; Dawn Song
  Deep neural networks (DNNs) have been found to be vulnerable to adversarial
examples resulting from adding small-magnitude perturbations to inputs. Such
adversarial examples can mislead DNNs to produce adversary-selected results.
Different attack strategies have been proposed to generate adversarial
examples, but how to produce them with high perceptual quality and more
efficiently requires more research efforts. In this paper, we propose AdvGAN to
generate adversarial examples with generative adversarial networks (GANs),
which can learn and approximate the distribution of original instances. For
AdvGAN, once the generator is trained, it can generate adversarial
perturbations efficiently for any instance, so as to potentially accelerate
adversarial training as defenses. We apply AdvGAN in both semi-whitebox and
black-box attack settings. In semi-whitebox attacks, there is no need to access
the original target model after the generator is trained, in contrast to
traditional white-box attacks. In black-box attacks, we dynamically train a
distilled model for the black-box model and optimize the generator accordingly.
Adversarial examples generated by AdvGAN on different target models have high
attack success rate under state-of-the-art defenses compared to other attacks.
Our attack has placed the first with 92.76% accuracy on a public MNIST
black-box attack challenge.


http://arxiv.org/abs/1801.02608
LaVAN: Localized and Visible Adversarial Noise.
Danny Karmon; Daniel Zoran; Yoav Goldberg
  Most works on adversarial examples for deep-learning based image classifiers
use noise that, while small, covers the entire image. We explore the case where
the noise is allowed to be visible but confined to a small, localized patch of
the image, without covering any of the main object(s) in the image. We show
that it is possible to generate localized adversarial noises that cover only 2%
of the pixels in the image, none of them over the main object, and that are
transferable across images and locations, and successfully fool a
state-of-the-art Inception v3 model with very high success rates.


http://arxiv.org/abs/1801.02384
Attacking Speaker Recognition With Deep Generative Models.
Wilson Cai; Anish Doshi; Rafael Valle
  In this paper we investigate the ability of generative adversarial networks
(GANs) to synthesize spoofing attacks on modern speaker recognition systems. We
first show that samples generated with SampleRNN and WaveNet are unable to fool
a CNN-based speaker recognition system. We propose a modification of the
Wasserstein GAN objective function to make use of data that is real but not
from the class being learned. Our semi-supervised learning method is able to
perform both targeted and untargeted attacks, raising questions related to
security in speaker authentication systems.


http://arxiv.org/abs/1801.02318
HeNet: A Deep Learning Approach on Intel$^\circledR$ Processor Trace for Effective Exploit Detection.
Li Chen; Salmin Sultana; Ravi Sahita
  This paper presents HeNet, a hierarchical ensemble neural network, applied to
classify hardware-generated control flow traces for malware detection. Deep
learning-based malware detection has so far focused on analyzing executable
files and runtime API calls. Static code analysis approaches face challenges
due to obfuscated code and adversarial perturbations. Behavioral data collected
during execution is more difficult to obfuscate but recent research has shown
successful attacks against API call based malware classifiers. We investigate
control flow based characterization of a program execution to build robust deep
learning malware classifiers. HeNet consists of a low-level behavior model and
a top-level ensemble model. The low-level model is a per-application behavior
model, trained via transfer learning on a time-series of images generated from
control flow trace of an execution. We use Intel$^\circledR$ Processor Trace
enabled processor for low overhead execution tracing and design a lightweight
image conversion and segmentation of the control flow trace. The top-level
ensemble model aggregates the behavior classification of all the trace segments
and detects an attack. The use of hardware trace adds portability to our system
and the use of deep learning eliminates the manual effort of feature
engineering. We evaluate HeNet against real-world exploitations of PDF readers.
HeNet achieves 100\% accuracy and 0\% false positive on test set, and higher
classification accuracy compared to classical machine learning algorithms.


http://arxiv.org/abs/1801.02257
Denoising Dictionary Learning Against Adversarial Perturbations.
John Mitro; Derek Bridge; Steven Prestwich
  We propose denoising dictionary learning (DDL), a simple yet effective
technique as a protection measure against adversarial perturbations. We
examined denoising dictionary learning on MNIST and CIFAR10 perturbed under two
different perturbation techniques, fast gradient sign (FGSM) and jacobian
saliency maps (JSMA). We evaluated it against five different deep neural
networks (DNN) representing the building blocks of most recent architectures
indicating a successive progression of model complexity of each other. We show
that each model tends to capture different representations based on their
architecture. For each model we recorded its accuracy both on the perturbed
test data previously misclassified with high confidence and on the denoised one
after the reconstruction using dictionary learning. The reconstruction quality
of each data point is assessed by means of PSNR (Peak Signal to Noise Ratio)
and Structure Similarity Index (SSI). We show that after applying (DDL) the
reconstruction of the original data point from a noisy


http://arxiv.org/abs/1801.01953
Adversarial Perturbation Intensity Achieving Chosen Intra-Technique Transferability Level for Logistic Regression.
Martin Gubri
  Machine Learning models have been shown to be vulnerable to adversarial
examples, ie. the manipulation of data by a attacker to defeat a defender's
classifier at test time. We present a novel probabilistic definition of
adversarial examples in perfect or limited knowledge setting using prior
probability distributions on the defender's classifier. Using the asymptotic
properties of the logistic regression, we derive a closed-form expression of
the intensity of any adversarial perturbation, in order to achieve a given
expected misclassification rate. This technique is relevant in a threat model
of known model specifications and unknown training data. To our knowledge, this
is the first method that allows an attacker to directly choose the probability
of attack success. We evaluate our approach on two real-world datasets.


http://arxiv.org/abs/1801.01944
Audio Adversarial Examples: Targeted Attacks on Speech-to-Text.
Nicholas Carlini; David Wagner
  We construct targeted audio adversarial examples on automatic speech
recognition. Given any audio waveform, we can produce another that is over
99.9% similar, but transcribes as any phrase we choose (recognizing up to 50
characters per second of audio). We apply our white-box iterative
optimization-based attack to Mozilla's implementation DeepSpeech end-to-end,
and show it has a 100% success rate. The feasibility of this attack introduce a
new domain to study adversarial examples.


http://arxiv.org/abs/1801.01828
Shielding Google's language toxicity model against adversarial attacks.
Nestor Rodriguez; Sergio Rojas-Galeano
  Lack of moderation in online communities enables participants to incur in
personal aggression, harassment or cyberbullying, issues that have been
accentuated by extremist radicalisation in the contemporary post-truth politics
scenario. This kind of hostility is usually expressed by means of toxic
language, profanity or abusive statements. Recently Google has developed a
machine-learning-based toxicity model in an attempt to assess the hostility of
a comment; unfortunately, it has been suggested that said model can be deceived
by adversarial attacks that manipulate the text sequence of the comment. In
this paper we firstly characterise such adversarial attacks as using
obfuscation and polarity transformations. The former deceives by corrupting
toxic trigger content with typographic edits, whereas the latter deceives by
grammatical negation of the toxic content. Then, we propose a two--stage
approach to counter--attack these anomalies, bulding upon a recently proposed
text deobfuscation method and the toxicity scoring model. Lastly, we conducted
an experiment with approximately 24000 distorted comments, showing how in this
way it is feasible to restore toxicity of the adversarial variants, while
incurring roughly on a twofold increase in processing time. Even though novel
adversary challenges would keep coming up derived from the versatile nature of
written language, we anticipate that techniques combining machine learning and
text pattern recognition methods, each one targeting different layers of
linguistic features, would be needed to achieve robust detection of toxic
language, thus fostering aggression--free digital interaction.


http://arxiv.org/abs/1801.02480
Facial Attributes: Accuracy and Adversarial Robustness.
Andras Rozsa; Manuel Günther; Ethan M. Rudd; Terrance E. Boult
  Facial attributes, emerging soft biometrics, must be automatically and
reliably extracted from images in order to be usable in stand-alone systems.
While recent methods extract facial attributes using deep neural networks
(DNNs) trained on labeled facial attribute data, the robustness of deep
attribute representations has not been evaluated. In this paper, we examine the
representational stability of several approaches that recently advanced the
state of the art on the CelebA benchmark by generating adversarial examples
formed by adding small, non-random perturbations to inputs yielding altered
classifications. We show that our fast flipping attribute (FFA) technique
generates more adversarial examples than traditional algorithms, and that the
adversarial robustness of DNNs varies highly between facial attributes. We also
test the correlation of facial attributes and find that only for related
attributes do the formed adversarial perturbations change the classification of
others. Finally, we introduce the concept of natural adversarial samples, i.e.,
misclassified images where predictions can be corrected via small
perturbations. We demonstrate that natural adversarial samples commonly occur
and show that many of these images remain misclassified even with additional
training epochs, even though their correct classification may require only a
small adjustment to network parameters.


http://arxiv.org/abs/1801.00905
Neural Networks in Adversarial Setting and Ill-Conditioned Weight Space.
Mayank Singh; Abhishek Sinha; Balaji Krishnamurthy
  Recently, Neural networks have seen a huge surge in its adoption due to their
ability to provide high accuracy on various tasks. On the other hand, the
existence of adversarial examples have raised suspicions regarding the
generalization capabilities of neural networks. In this work, we focus on the
weight matrix learnt by the neural networks and hypothesize that ill
conditioned weight matrix is one of the contributing factors in neural
network's susceptibility towards adversarial examples. For ensuring that the
learnt weight matrix's condition number remains sufficiently low, we suggest
using orthogonal regularizer. We show that this indeed helps in increasing the
adversarial accuracy on MNIST and F-MNIST datasets.


http://arxiv.org/abs/1801.00634
High Dimensional Spaces, Deep Learning and Adversarial Examples.
Simant Dube
  In this paper, we analyze deep learning from a mathematical point of view and
derive several novel results. The results are based on intriguing mathematical
properties of high dimensional spaces. We first look at perturbation based
adversarial examples and show how they can be understood using topological and
geometrical arguments in high dimensions. We point out mistake in an argument
presented in prior published literature, and we present a more rigorous,
general and correct mathematical result to explain adversarial examples in
terms of topology of image manifolds. Second, we look at optimization
landscapes of deep neural networks and examine the number of saddle points
relative to that of local minima. Third, we show how multiresolution nature of
images explains perturbation based adversarial examples in form of a stronger
result. Our results state that expectation of $L_2$-norm of adversarial
perturbations is $O\left(\frac{1}{\sqrt{n}}\right)$ and therefore shrinks to 0
as image resolution $n$ becomes arbitrarily large. Finally, by incorporating
the parts-whole manifold learning hypothesis for natural images, we investigate
the working of deep neural networks and root causes of adversarial examples and
discuss how future improvements can be made and how adversarial examples can be
eliminated.


http://arxiv.org/abs/1801.00554
Did you hear that? Adversarial Examples Against Automatic Speech Recognition.
Moustafa Alzantot; Bharathan Balaji; Mani Srivastava
  Speech is a common and effective way of communication between humans, and
modern consumer devices such as smartphones and home hubs are equipped with
deep learning based accurate automatic speech recognition to enable natural
interaction between humans and machines. Recently, researchers have
demonstrated powerful attacks against machine learning models that can fool
them to produceincorrect results. However, nearly all previous research in
adversarial attacks has focused on image recognition and object detection
models. In this short paper, we present a first of its kind demonstration of
adversarial attacks against speech classification model. Our algorithm performs
targeted attacks with 87% success by adding small background noise without
having to know the underlying model parameter and architecture. Our attack only
changes the least significant bits of a subset of audio clip samples, and the
noise does not change 89% the human listener's perception of the audio clip as
evaluated in our human study.


http://arxiv.org/abs/1801.00553
Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey.
Naveed Akhtar; Ajmal Mian
  Deep learning is at the heart of the current rise of machine learning and
artificial intelligence. In the field of Computer Vision, it has become the
workhorse for applications ranging from self-driving cars to surveillance and
security. Whereas deep neural networks have demonstrated phenomenal success
(often beyond human capabilities) in solving complex problems, recent studies
show that they are vulnerable to adversarial attacks in the form of subtle
perturbations to inputs that lead a model to predict incorrect outputs. For
images, such perturbations are often too small to be perceptible, yet they
completely fool the deep learning models. Adversarial attacks pose a serious
threat to the success of deep learning in practice. This fact has lead to a
large influx of contributions in this direction. This article presents the
first comprehensive survey on adversarial attacks on deep learning in Computer
Vision. We review the works that design adversarial attacks, analyze the
existence of such attacks and propose defenses against them. To emphasize that
adversarial attacks are possible in practical conditions, we separately review
the contributions that evaluate adversarial attacks in the real-world
scenarios. Finally, we draw on the literature to provide a broader outlook of
the research direction.


http://arxiv.org/abs/1801.00349
A General Framework for Adversarial Examples with Objectives.
Mahmood Sharif; Sruti Bhagavatula; Lujo Bauer; Michael K. Reiter
  Images perturbed subtly to be misclassified by neural networks, called
adversarial examples, have emerged as a technically deep challenge and an
important concern for several application domains. Most research on adversarial
examples takes as its only constraint that the perturbed images are similar to
the originals. However, real-world application of these ideas often requires
the examples to satisfy additional objectives, which are typically enforced
through custom modifications of the perturbation process. In this paper, we
propose adversarial generative nets (AGNs), a general methodology to train a
generator neural network to emit adversarial examples satisfying desired
objectives. We demonstrate the ability of AGNs to accommodate a wide range of
objectives, including imprecise ones difficult to model, in two application
domains. In particular, we demonstrate physical adversarial examples---eyeglass
frames designed to fool face recognition---with better robustness,
inconspicuousness, and scalability than previous approaches, as well as a new
attack to fool a handwritten-digit classifier.


http://arxiv.org/abs/1712.09936
Gradient Regularization Improves Accuracy of Discriminative Models.
Dániel Varga; Adrián Csiszárik; Zsolt Zombori
  Regularizing the gradient norm of the output of a neural network with respect
to its inputs is a powerful technique, rediscovered several times. This paper
presents evidence that gradient regularization can consistently improve
classification accuracy on vision tasks, using modern deep neural networks,
especially when the amount of training data is small. We introduce our
regularizers as members of a broader class of Jacobian-based regularizers. We
demonstrate empirically on real and synthetic data that the learning process
leads to gradients controlled beyond the training points, and results in
solutions that generalize well.


http://arxiv.org/abs/1712.09665
Adversarial Patch.
Tom B. Brown; Dandelion Mané; Aurko Roy; Martín Abadi; Justin Gilmer
  We present a method to create universal, robust, targeted adversarial image
patches in the real world. The patches are universal because they can be used
to attack any scene, robust because they work under a wide variety of
transformations, and targeted because they can cause a classifier to output any
target class. These adversarial patches can be printed, added to any scene,
photographed, and presented to image classifiers; even when the patches are
small, they cause the classifiers to ignore the other items in the scene and
report a chosen target class.
  To reproduce the results from the paper, our code is available at
https://github.com/tensorflow/cleverhans/tree/master/examples/adversarial_patch


http://arxiv.org/abs/1712.09491
Exploring the Space of Black-box Attacks on Deep Neural Networks.
Arjun Nitin Bhagoji; Warren He; Bo Li; Dawn Song
  Existing black-box attacks on deep neural networks (DNNs) so far have largely
focused on transferability, where an adversarial instance generated for a
locally trained model can "transfer" to attack other learning models. In this
paper, we propose novel Gradient Estimation black-box attacks for adversaries
with query access to the target model's class probabilities, which do not rely
on transferability. We also propose strategies to decouple the number of
queries required to generate each adversarial sample from the dimensionality of
the input. An iterative variant of our attack achieves close to 100%
adversarial success rates for both targeted and untargeted attacks on DNNs. We
carry out extensive experiments for a thorough comparative evaluation of
black-box attacks and show that the proposed Gradient Estimation attacks
outperform all transferability based black-box attacks we tested on both MNIST
and CIFAR-10 datasets, achieving adversarial success rates similar to well
known, state-of-the-art white-box attacks. We also apply the Gradient
Estimation attacks successfully against a real-world Content Moderation
classifier hosted by Clarifai. Furthermore, we evaluate black-box attacks
against state-of-the-art defenses. We show that the Gradient Estimation attacks
are very effective even against these defenses.


http://arxiv.org/abs/1712.09327
Building Robust Deep Neural Networks for Road Sign Detection.
Arkar Min Aung; Yousef Fadila; Radian Gondokaryono; Luis Gonzalez
  Deep Neural Networks are built to generalize outside of training set in mind
by using techniques such as regularization, early stopping and dropout. But
considerations to make them more resilient to adversarial examples are rarely
taken. As deep neural networks become more prevalent in mission-critical and
real-time systems, miscreants start to attack them by intentionally making deep
neural networks to misclassify an object of one type to be seen as another
type. This can be catastrophic in some scenarios where the classification of a
deep neural network can lead to a fatal decision by a machine. In this work, we
used GTSRB dataset to craft adversarial samples by Fast Gradient Sign Method
and Jacobian Saliency Method, used those crafted adversarial samples to attack
another Deep Convolutional Neural Network and built the attacked network to be
more resilient against adversarial attacks by making it more robust by
Defensive Distillation and Adversarial Training


http://arxiv.org/abs/1712.09196
The Robust Manifold Defense: Adversarial Training using Generative Models.
Ajil Jalal; Andrew Ilyas; Constantinos Daskalakis; Alexandros G. Dimakis
  We propose a new type of attack for finding adversarial examples for image
classifiers. Our method exploits spanners, i.e. deep neural networks whose
input space is low-dimensional and whose output range approximates the set of
images of interest. Spanners may be generators of GANs or decoders of VAEs. The
key idea in our attack is to search over latent code pairs to find ones that
generate nearby images with different classifier outputs. We argue that our
attack is stronger than searching over perturbations of real images. Moreover,
we show that our stronger attack can be used to reduce the accuracy of
Defense-GAN to 3\%, resolving an open problem from the well-known paper by
Athalye et al. We combine our attack with normal adversarial training to obtain
the most robust known MNIST classifier, significantly improving the state of
the art against PGD attacks. Our formulation involves solving a min-max
problem, where the min player sets the parameters of the classifier and the max
player is running our attack, and is thus searching for adversarial examples in
the {\em low-dimensional} input space of the spanner.
  All code and models are available at
\url{https://github.com/ajiljalal/manifold-defense.git}


http://arxiv.org/abs/1712.08996
Android Malware Detection using Deep Learning on API Method Sequences.
ElMouatez Billah Karbab; Mourad Debbabi; Abdelouahid Derhab; Djedjiga Mouheb
  Android OS experiences a blazing popularity since the last few years. This
predominant platform has established itself not only in the mobile world but
also in the Internet of Things (IoT) devices. This popularity, however, comes
at the expense of security, as it has become a tempting target of malicious
apps. Hence, there is an increasing need for sophisticated, automatic, and
portable malware detection solutions. In this paper, we propose MalDozer, an
automatic Android malware detection and family attribution framework that
relies on sequences classification using deep learning techniques. Starting
from the raw sequence of the app's API method calls, MalDozer automatically
extracts and learns the malicious and the benign patterns from the actual
samples to detect Android malware. MalDozer can serve as a ubiquitous malware
detection system that is not only deployed on servers, but also on mobile and
even IoT devices. We evaluate MalDozer on multiple Android malware datasets
ranging from 1K to 33K malware apps, and 38K benign apps. The results show that
MalDozer can correctly detect malware and attribute them to their actual
families with an F1-Score of 96%-99% and a false positive rate of 0.06%-2%,
under all tested datasets and settings.


http://arxiv.org/abs/1712.09344
Whatever Does Not Kill Deep Reinforcement Learning, Makes It Stronger.
Vahid Behzadan; Arslan Munir
  Recent developments have established the vulnerability of deep Reinforcement
Learning (RL) to policy manipulation attacks via adversarial perturbations. In
this paper, we investigate the robustness and resilience of deep RL to
training-time and test-time attacks. Through experimental results, we
demonstrate that under noncontiguous training-time attacks, Deep Q-Network
(DQN) agents can recover and adapt to the adversarial conditions by reactively
adjusting the policy. Our results also show that policies learned under
adversarial perturbations are more robust to test-time attacks. Furthermore, we
compare the performance of $\epsilon$-greedy and parameter-space noise
exploration methods in terms of robustness and resilience against adversarial
perturbations.


http://arxiv.org/abs/1712.08713
Query-limited Black-box Attacks to Classifiers.
Fnu Suya; Yuan Tian; David Evans; Paolo Papotti
  We study black-box attacks on machine learning classifiers where each query
to the model incurs some cost or risk of detection to the adversary. We focus
explicitly on minimizing the number of queries as a major objective.
Specifically, we consider the problem of attacking machine learning classifiers
subject to a budget of feature modification cost while minimizing the number of
queries, where each query returns only a class and confidence score. We
describe an approach that uses Bayesian optimization to minimize the number of
queries, and find that the number of queries can be reduced to approximately
one tenth of the number needed through a random strategy for scenarios where
the feature modification cost budget is low.


http://arxiv.org/abs/1712.08263
Using LIP to Gloss Over Faces in Single-Stage Face Detection Networks.
Siqi Yang; Arnold Wiliem; Shaokang Chen; Brian C. Lovell
  This work shows that it is possible to fool/attack recent state-of-the-art
face detectors which are based on the single-stage networks. Successfully
attacking face detectors could be a serious malware vulnerability when
deploying a smart surveillance system utilizing face detectors. We show that
existing adversarial perturbation methods are not effective to perform such an
attack, especially when there are multiple faces in the input image. This is
because the adversarial perturbation specifically generated for one face may
disrupt the adversarial perturbation for another face. In this paper, we call
this problem the Instance Perturbation Interference (IPI) problem. This IPI
problem is addressed by studying the relationship between the deep neural
network receptive field and the adversarial perturbation. As such, we propose
the Localized Instance Perturbation (LIP) that uses adversarial perturbation
constrained to the Effective Receptive Field (ERF) of a target to perform the
attack. Experiment results show the LIP method massively outperforms existing
adversarial perturbation generation methods -- often by a factor of 2 to 10.


http://arxiv.org/abs/1712.08250
ReabsNet: Detecting and Revising Adversarial Examples.
Jiefeng Chen; Zihang Meng; Changtian Sun; Wei Tang; Yinglun Zhu
  Though deep neural network has hit a huge success in recent studies and
applica- tions, it still remains vulnerable to adversarial perturbations which
are imperceptible to humans. To address this problem, we propose a novel
network called ReabsNet to achieve high classification accuracy in the face of
various attacks. The approach is to augment an existing classification network
with a guardian network to detect if a sample is natural or has been
adversarially perturbed. Critically, instead of simply rejecting adversarial
examples, we revise them to get their true labels. We exploit the observation
that a sample containing adversarial perturbations has a possibility of
returning to its true class after revision. We demonstrate that our ReabsNet
outperforms the state-of-the-art defense method under various adversarial
attacks.


http://arxiv.org/abs/1712.08062
Note on Attacking Object Detectors with Adversarial Stickers.
Kevin Eykholt; Ivan Evtimov; Earlence Fernandes; Bo Li; Dawn Song; Tadayoshi Kohno; Amir Rahmati; Atul Prakash; Florian Tramer
  Deep learning has proven to be a powerful tool for computer vision and has
seen widespread adoption for numerous tasks. However, deep learning algorithms
are known to be vulnerable to adversarial examples. These adversarial inputs
are created such that, when provided to a deep learning algorithm, they are
very likely to be mislabeled. This can be problematic when deep learning is
used to assist in safety critical decisions. Recent research has shown that
classifiers can be attacked by physical adversarial examples under various
physical conditions. Given the fact that state-of-the-art objection detection
algorithms are harder to be fooled by the same set of adversarial examples,
here we show that these detectors can also be attacked by physical adversarial
examples. In this note, we briefly show both static and dynamic test results.
We design an algorithm that produces physical adversarial inputs, which can
fool the YOLO object detector and can also attack Faster-RCNN with relatively
high success rate based on transferability. Furthermore, our algorithm can
compress the size of the adversarial inputs to stickers that, when attached to
the targeted object, result in the detector either mislabeling or not detecting
the object a high percentage of the time. This note provides a small set of
results. Our upcoming paper will contain a thorough evaluation on other object
detectors, and will present the algorithm.


http://arxiv.org/abs/1712.07805
Wolf in Sheep's Clothing - The Downscaling Attack Against Deep Learning Applications.
Qixue Xiao; Kang Li; Deyue Zhang; Yier Jin
  This paper considers security risks buried in the data processing pipeline in
common deep learning applications. Deep learning models usually assume a fixed
scale for their training and input data. To allow deep learning applications to
handle a wide range of input data, popular frameworks, such as Caffe,
TensorFlow, and Torch, all provide data scaling functions to resize input to
the dimensions used by deep learning models. Image scaling algorithms are
intended to preserve the visual features of an image after scaling. However,
common image scaling algorithms are not designed to handle human crafted
images. Attackers can make the scaling outputs look dramatically different from
the corresponding input images.
  This paper presents a downscaling attack that targets the data scaling
process in deep learning applications. By carefully crafting input data that
mismatches with the dimension used by deep learning models, attackers can
create deceiving effects. A deep learning application effectively consumes data
that are not the same as those presented to users. The visual inconsistency
enables practical evasion and data poisoning attacks to deep learning
applications. This paper presents proof-of-concept attack samples to popular
deep-learning-based image classification applications. To address the
downscaling attacks, the paper also suggests multiple potential mitigation
strategies.


http://arxiv.org/abs/1712.07113
Query-Efficient Black-box Adversarial Examples (superceded).
Andrew Ilyas; Logan Engstrom; Anish Athalye; Jessy Lin
  Note that this paper is superceded by "Black-Box Adversarial Attacks with
Limited Queries and Information."
  Current neural network-based image classifiers are susceptible to adversarial
examples, even in the black-box setting, where the attacker is limited to query
access without access to gradients. Previous methods --- substitute networks
and coordinate-based finite-difference methods --- are either unreliable or
query-inefficient, making these methods impractical for certain problems.
  We introduce a new method for reliably generating adversarial examples under
more restricted, practical black-box threat models. First, we apply natural
evolution strategies to perform black-box attacks using two to three orders of
magnitude fewer queries than previous methods. Second, we introduce a new
algorithm to perform targeted adversarial attacks in the partial-information
setting, where the attacker only has access to a limited number of target
classes. Using these techniques, we successfully perform the first targeted
adversarial attack against a commercially deployed machine learning system, the
Google Cloud Vision API, in the partial information setting.


http://arxiv.org/abs/1712.07107
Adversarial Examples: Attacks and Defenses for Deep Learning.
Xiaoyong Yuan; Pan He; Qile Zhu; Xiaolin Li
  With rapid progress and significant successes in a wide spectrum of
applications, deep learning is being applied in many safety-critical
environments. However, deep neural networks have been recently found vulnerable
to well-designed input samples, called adversarial examples. Adversarial
examples are imperceptible to human but can easily fool deep neural networks in
the testing/deploying stage. The vulnerability to adversarial examples becomes
one of the major risks for applying deep neural networks in safety-critical
environments. Therefore, attacks and defenses on adversarial examples draw
great attention. In this paper, we review recent findings on adversarial
examples for deep neural networks, summarize the methods for generating
adversarial examples, and propose a taxonomy of these methods. Under the
taxonomy, applications for adversarial examples are investigated. We further
elaborate on countermeasures for adversarial examples and explore the
challenges and the potential solutions.


http://arxiv.org/abs/1712.06751
HotFlip: White-Box Adversarial Examples for Text Classification.
Javid Ebrahimi; Anyi Rao; Daniel Lowd; Dejing Dou
  We propose an efficient method to generate white-box adversarial examples to
trick a character-level neural classifier. We find that only a few
manipulations are needed to greatly decrease the accuracy. Our method relies on
an atomic flip operation, which swaps one token for another, based on the
gradients of the one-hot input vectors. Due to efficiency of our method, we can
perform adversarial training which makes the model more robust to attacks at
test time. With the use of a few semantics-preserving constraints, we
demonstrate that HotFlip can be adapted to attack a word-level classifier as
well.


http://arxiv.org/abs/1712.06646
When Not to Classify: Anomaly Detection of Attacks (ADA) on DNN Classifiers at Test Time.
David J. Miller; Yulia Wang; George Kesidis
  A significant threat to the recent, wide deployment of machine learning-based
systems, including deep neural networks (DNNs), is adversarial learning
attacks. We analyze possible test-time evasion-attack mechanisms and show that,
in some important cases, when the image has been attacked, correctly
classifying it has no utility: i) when the image to be attacked is (even
arbitrarily) selected from the attacker's cache; ii) when the sole recipient of
the classifier's decision is the attacker. Moreover, in some application
domains and scenarios it is highly actionable to detect the attack irrespective
of correctly classifying in the face of it (with classification still performed
if no attack is detected). We hypothesize that, even if human-imperceptible,
adversarial perturbations are machine-detectable. We propose a purely
unsupervised anomaly detector (AD) that, unlike previous works: i) models the
joint density of a deep layer using highly suitable null hypothesis density
models (matched in particular to the non- negative support for RELU layers);
ii) exploits multiple DNN layers; iii) leverages a "source" and "destination"
class concept, source class uncertainty, the class confusion matrix, and DNN
weight information in constructing a novel decision statistic grounded in the
Kullback-Leibler divergence. Tested on MNIST and CIFAR-10 image databases under
three prominent attack strategies, our approach outperforms previous detection
methods, achieving strong ROC AUC detection accuracy on two attacks and better
accuracy than recently reported for a variety of methods on the strongest (CW)
attack. We also evaluate a fully white box attack on our system. Finally, we
evaluate other important performance measures, such as classification accuracy,
versus detection rate and attack strength.


http://arxiv.org/abs/1712.06174
Deep Neural Networks as 0-1 Mixed Integer Linear Programs: A Feasibility Study.
Matteo Fischetti; Jason Jo
  Deep Neural Networks (DNNs) are very popular these days, and are the subject
of a very intense investigation. A DNN is made by layers of internal units (or
neurons), each of which computes an affine combination of the output of the
units in the previous layer, applies a nonlinear operator, and outputs the
corresponding value (also known as activation). A commonly-used nonlinear
operator is the so-called rectified linear unit (ReLU), whose output is just
the maximum between its input value and zero. In this (and other similar cases
like max pooling, where the max operation involves more than one input value),
one can model the DNN as a 0-1 Mixed Integer Linear Program (0-1 MILP) where
the continuous variables correspond to the output values of each unit, and a
binary variable is associated with each ReLU to model its yes/no nature. In
this paper we discuss the peculiarity of this kind of 0-1 MILP models, and
describe an effective bound-tightening technique intended to ease its solution.
We also present possible applications of the 0-1 MILP model arising in feature
visualization and in the construction of adversarial examples. Preliminary
computational results are reported, aimed at investigating (on small DNNs) the
computational performance of a state-of-the-art MILP solver when applied to a
known test case, namely, hand-written digit recognition.


http://arxiv.org/abs/1712.06131
Super-sparse Learning in Similarity Spaces.
Ambra Demontis; Marco Melis; Battista Biggio; Giorgio Fumera; Fabio Roli
  In several applications, input samples are more naturally represented in
terms of similarities between each other, rather than in terms of feature
vectors. In these settings, machine-learning algorithms can become very
computationally demanding, as they may require matching the test samples
against a very large set of reference prototypes. To mitigate this issue,
different approaches have been developed to reduce the number of required
reference prototypes. Current reduction approaches select a small subset of
representative prototypes in the space induced by the similarity measure, and
then separately train the classification function on the reduced subset.
However, decoupling these two steps may not allow reducing the number of
prototypes effectively without compromising accuracy. We overcome this
limitation by jointly learning the classification function along with an
optimal set of virtual prototypes, whose number can be either fixed a priori or
optimized according to application-specific criteria. Creating a super-sparse
set of virtual prototypes provides much sparser solutions, drastically reducing
complexity at test time, at the expense of a slightly increased complexity
during training. A much smaller set of prototypes also results in
easier-to-interpret decisions. We empirically show that our approach can reduce
up to ten times the complexity of Support Vector Machines, LASSO and ridge
regression at test time, without almost affecting their classification
accuracy.


http://arxiv.org/abs/1712.05919
Attack and Defense of Dynamic Analysis-Based, Adversarial Neural Malware Classification Models.
Jack W. Stokes; De Wang; Mady Marinescu; Marc Marino; Brian Bussone
  Recently researchers have proposed using deep learning-based systems for
malware detection. Unfortunately, all deep learning classification systems are
vulnerable to adversarial attacks. Previous work has studied adversarial
attacks against static analysis-based malware classifiers which only classify
the content of the unknown file without execution. However, since the majority
of malware is either packed or encrypted, malware classification based on
static analysis often fails to detect these types of files. To overcome this
limitation, anti-malware companies typically perform dynamic analysis by
emulating each file in the anti-malware engine or performing in-depth scanning
in a virtual machine. These strategies allow the analysis of the malware after
unpacking or decryption. In this work, we study different strategies of
crafting adversarial samples for dynamic analysis. These strategies operate on
sparse, binary inputs in contrast to continuous inputs such as pixels in
images. We then study the effects of two, previously proposed defensive
mechanisms against crafted adversarial samples including the distillation and
ensemble defenses. We also propose and evaluate the weight decay defense.
Experiments show that with these three defensive strategies, the number of
successfully crafted adversarial samples is reduced compared to a standard
baseline system without any defenses. In particular, the ensemble defense is
the most resilient to adversarial attacks. Importantly, none of the defenses
significantly reduce the classification accuracy for detecting malware.
Finally, we demonstrate that while adding additional hidden layers to neural
models does not significantly improve the malware classification accuracy, it
does significantly increase the classifier's robustness to adversarial attacks.


http://arxiv.org/abs/1712.05419
DANCin SEQ2SEQ: Fooling Text Classifiers with Adversarial Text Example Generation.
Catherine Wong
  Machine learning models are powerful but fallible. Generating adversarial
examples - inputs deliberately crafted to cause model misclassification or
other errors - can yield important insight into model assumptions and
vulnerabilities. Despite significant recent work on adversarial example
generation targeting image classifiers, relatively little work exists exploring
adversarial example generation for text classifiers; additionally, many
existing adversarial example generation algorithms require full access to
target model parameters, rendering them impractical for many real-world
attacks. In this work, we introduce DANCin SEQ2SEQ, a GAN-inspired algorithm
for adversarial text example generation targeting largely black-box text
classifiers. We recast adversarial text example generation as a reinforcement
learning problem, and demonstrate that our algorithm offers preliminary but
promising steps towards generating semantically meaningful adversarial text
examples in a real-world attack scenario.


http://arxiv.org/abs/1712.04248
Decision-Based Adversarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models.
Wieland Brendel; Jonas Rauber; Matthias Bethge
  Many machine learning algorithms are vulnerable to almost imperceptible
perturbations of their inputs. So far it was unclear how much risk adversarial
perturbations carry for the safety of real-world machine learning applications
because most methods used to generate such perturbations rely either on
detailed model information (gradient-based attacks) or on confidence scores
such as class probabilities (score-based attacks), neither of which are
available in most real-world scenarios. In many such cases one currently needs
to retreat to transfer-based attacks which rely on cumbersome substitute
models, need access to the training data and can be defended against. Here we
emphasise the importance of attacks which solely rely on the final model
decision. Such decision-based attacks are (1) applicable to real-world
black-box models such as autonomous cars, (2) need less knowledge and are
easier to apply than transfer-based attacks and (3) are more robust to simple
defences than gradient- or score-based attacks. Previous attacks in this
category were limited to simple models or simple datasets. Here we introduce
the Boundary Attack, a decision-based attack that starts from a large
adversarial perturbation and then seeks to reduce the perturbation while
staying adversarial. The attack is conceptually simple, requires close to no
hyperparameter tuning, does not rely on substitute models and is competitive
with the best gradient-based attacks in standard computer vision tasks like
ImageNet. We apply the attack on two black-box algorithms from Clarifai.com.
The Boundary Attack in particular and the class of decision-based attacks in
general open new avenues to study the robustness of machine learning models and
raise new questions regarding the safety of deployed machine learning systems.
An implementation of the attack is available as part of Foolbox at
https://github.com/bethgelab/foolbox .


http://arxiv.org/abs/1712.04006
Training Ensembles to Detect Adversarial Examples.
Alexander Bagnall; Razvan Bunescu; Gordon Stewart
  We propose a new ensemble method for detecting and classifying adversarial
examples generated by state-of-the-art attacks, including DeepFool and C&W. Our
method works by training the members of an ensemble to have low classification
error on random benign examples while simultaneously minimizing agreement on
examples outside the training distribution. We evaluate on both MNIST and
CIFAR-10, against oblivious and both white- and black-box adversaries.


http://arxiv.org/abs/1712.03632
Robust Deep Reinforcement Learning with Adversarial Attacks.
Anay Pattanaik; Zhenyi Tang; Shuijing Liu; Gautham Bommannan; Girish Chowdhary
  This paper proposes adversarial attacks for Reinforcement Learning (RL) and
then improves the robustness of Deep Reinforcement Learning algorithms (DRL) to
parameter uncertainties with the help of these attacks. We show that even a
naively engineered attack successfully degrades the performance of DRL
algorithm. We further improve the attack using gradient information of an
engineered loss function which leads to further degradation in performance.
These attacks are then leveraged during training to improve the robustness of
RL within robust control framework. We show that this adversarial training of
DRL algorithms like Deep Double Q learning and Deep Deterministic Policy
Gradients leads to significant increase in robustness to parameter variations
for RL benchmarks such as Cart-pole, Mountain Car, Hopper and Half Cheetah
environment.


http://arxiv.org/abs/1712.03390
NAG: Network for Adversary Generation.
Konda Reddy Mopuri; Utkarsh Ojha; Utsav Garg; R. Venkatesh Babu
  Adversarial perturbations can pose a serious threat for deploying machine
learning systems. Recent works have shown existence of image-agnostic
perturbations that can fool classifiers over most natural images. Existing
methods present optimization approaches that solve for a fooling objective with
an imperceptibility constraint to craft the perturbations. However, for a given
classifier, they generate one perturbation at a time, which is a single
instance from the manifold of adversarial perturbations. Also, in order to
build robust models, it is essential to explore the manifold of adversarial
perturbations. In this paper, we propose for the first time, a generative
approach to model the distribution of adversarial perturbations. The
architecture of the proposed model is inspired from that of GANs and is trained
using fooling and diversity objectives. Our trained generator network attempts
to capture the distribution of adversarial perturbations for a given classifier
and readily generates a wide variety of such perturbations. Our experimental
evaluation demonstrates that perturbations crafted by our model (i) achieve
state-of-the-art fooling rates, (ii) exhibit wide variety and (iii) deliver
excellent cross model generalizability. Our work can be deemed as an important
step in the process of inferring about the complex manifolds of adversarial
perturbations.


http://arxiv.org/abs/1712.03141
Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning.
Battista Biggio; Fabio Roli
  Learning-based pattern classifiers, including deep networks, have shown
impressive performance in several application domains, ranging from computer
vision to cybersecurity. However, it has also been shown that adversarial input
perturbations carefully crafted either at training or at test time can easily
subvert their predictions. The vulnerability of machine learning to such wild
patterns (also referred to as adversarial examples), along with the design of
suitable countermeasures, have been investigated in the research field of
adversarial machine learning. In this work, we provide a thorough overview of
the evolution of this research area over the last ten years and beyond,
starting from pioneering, earlier work on the security of non-deep learning
algorithms up to more recent work aimed to understand the security properties
of deep learning algorithms, in the context of computer vision and
cybersecurity tasks. We report interesting connections between these
apparently-different lines of work, highlighting common misconceptions related
to the security evaluation of machine-learning algorithms. We review the main
threat models and attacks defined to this end, and discuss the main limitations
of current work, along with the corresponding future challenges towards the
design of more secure learning algorithms.


http://arxiv.org/abs/1712.02976
Defense against Adversarial Attacks Using High-Level Representation Guided Denoiser.
Fangzhou Liao; Ming Liang; Yinpeng Dong; Tianyu Pang; Xiaolin Hu; Jun Zhu
  Neural networks are vulnerable to adversarial examples, which poses a threat
to their application in security sensitive systems. We propose high-level
representation guided denoiser (HGD) as a defense for image classification.
Standard denoiser suffers from the error amplification effect, in which small
residual adversarial noise is progressively amplified and leads to wrong
classifications. HGD overcomes this problem by using a loss function defined as
the difference between the target model's outputs activated by the clean image
and denoised image. Compared with ensemble adversarial training which is the
state-of-the-art defending method on large images, HGD has three advantages.
First, with HGD as a defense, the target model is more robust to either
white-box or black-box adversarial attacks. Second, HGD can be trained on a
small subset of the images and generalizes well to other images and unseen
classes. Third, HGD can be transferred to defend models other than the one
guiding it. In NIPS competition on defense against adversarial attacks, our HGD
solution won the first place and outperformed other models by a large margin.


http://arxiv.org/abs/1712.02494
Adversarial Examples that Fool Detectors.
Jiajun Lu; Hussein Sibai; Evan Fabry
  An adversarial example is an example that has been adjusted to produce a
wrong label when presented to a system at test time. To date, adversarial
example constructions have been demonstrated for classifiers, but not for
detectors. If adversarial examples that could fool a detector exist, they could
be used to (for example) maliciously create security hazards on roads populated
with smart vehicles. In this paper, we demonstrate a construction that
successfully fools two standard detectors, Faster RCNN and YOLO. The existence
of such examples is surprising, as attacking a classifier is very different
from attacking a detector, and that the structure of detectors - which must
search for their own bounding box, and which cannot estimate that box very
accurately - makes it quite likely that adversarial patterns are strongly
disrupted. We show that our construction produces adversarial examples that
generalize well across sequences digitally, even though large perturbations are
needed. We also show that our construction yields physical objects that are
adversarial.


http://arxiv.org/abs/1712.02779
Exploring the Landscape of Spatial Robustness.
Logan Engstrom; Brandon Tran; Dimitris Tsipras; Ludwig Schmidt; Aleksander Madry
  The study of adversarial robustness has so far largely focused on
perturbations bound in p-norms. However, state-of-the-art models turn out to be
also vulnerable to other, more natural classes of perturbations such as
translations and rotations. In this work, we thoroughly investigate the
vulnerability of neural network--based classifiers to rotations and
translations. While data augmentation offers relatively small robustness, we
use ideas from robust optimization and test-time input aggregation to
significantly improve robustness. Finally we find that, in contrast to the
p-norm case, first-order methods cannot reliably find worst-case perturbations.
This highlights spatial robustness as a fundamentally different setting
requiring additional study. Code available at
https://github.com/MadryLab/adversarial_spatial and
https://github.com/MadryLab/spatial-pytorch.


http://arxiv.org/abs/1712.02328
Generative Adversarial Perturbations.
Omid Poursaeed; Isay Katsman; Bicheng Gao; Serge Belongie
  In this paper, we propose novel generative models for creating adversarial
examples, slightly perturbed images resembling natural images but maliciously
crafted to fool pre-trained models. We present trainable deep neural networks
for transforming images to adversarial perturbations. Our proposed models can
produce image-agnostic and image-dependent perturbations for both targeted and
non-targeted attacks. We also demonstrate that similar architectures can
achieve impressive results in fooling classification and semantic segmentation
models, obviating the need for hand-crafting attack methods for each task.
Using extensive experiments on challenging high-resolution datasets such as
ImageNet and Cityscapes, we show that our perturbations achieve high fooling
rates with small perturbation norms. Moreover, our attacks are considerably
faster than current iterative methods at inference time.


http://arxiv.org/abs/1712.02051
Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning.
Hongge Chen; Huan Zhang; Pin-Yu Chen; Jinfeng Yi; Cho-Jui Hsieh
  Visual language grounding is widely studied in modern neural image captioning
systems, which typically adopts an encoder-decoder framework consisting of two
principal components: a convolutional neural network (CNN) for image feature
extraction and a recurrent neural network (RNN) for language caption
generation. To study the robustness of language grounding to adversarial
perturbations in machine vision and perception, we propose Show-and-Fool, a
novel algorithm for crafting adversarial examples in neural image captioning.
The proposed algorithm provides two evaluation approaches, which check whether
neural image captioning systems can be mislead to output some randomly chosen
captions or keywords. Our extensive experiments show that our algorithm can
successfully craft visually-similar adversarial examples with randomly targeted
captions or keywords, and the adversarial examples can be made highly
transferable to other image captioning systems. Consequently, our approach
leads to new robustness implications of neural image captioning and novel
insights in visual language grounding.


http://arxiv.org/abs/1712.01785
Towards Practical Verification of Machine Learning: The Case of Computer Vision Systems.
Kexin Pei; Linjie Zhu; Yinzhi Cao; Junfeng Yang; Carl Vondrick; Suman Jana
  Due to the increasing usage of machine learning (ML) techniques in security-
and safety-critical domains, such as autonomous systems and medical diagnosis,
ensuring correct behavior of ML systems, especially for different corner cases,
is of growing importance. In this paper, we propose a generic framework for
evaluating security and robustness of ML systems using different real-world
safety properties. We further design, implement and evaluate VeriVis, a
scalable methodology that can verify a diverse set of safety properties for
state-of-the-art computer vision systems with only blackbox access. VeriVis
leverage different input space reduction techniques for efficient verification
of different safety properties. VeriVis is able to find thousands of safety
violations in fifteen state-of-the-art computer vision systems including ten
Deep Neural Networks (DNNs) such as Inception-v3 and Nvidia's Dave self-driving
system with thousands of neurons as well as five commercial third-party vision
APIs including Google vision and Clarifai for twelve different safety
properties. Furthermore, VeriVis can successfully verify local safety
properties, on average, for around 31.7% of the test images. VeriVis finds up
to 64.8x more violations than existing gradient-based methods that, unlike
VeriVis, cannot ensure non-existence of any violations. Finally, we show that
retraining using the safety violations detected by VeriVis can reduce the
average number of violations up to 60.2%.


http://arxiv.org/abs/1712.00699
Improving Network Robustness against Adversarial Attacks with Compact Convolution.
Rajeev Ranjan; Swami Sankaranarayanan; Carlos D. Castillo; Rama Chellappa
  Though Convolutional Neural Networks (CNNs) have surpassed human-level
performance on tasks such as object classification and face verification, they
can easily be fooled by adversarial attacks. These attacks add a small
perturbation to the input image that causes the network to misclassify the
sample. In this paper, we focus on neutralizing adversarial attacks by compact
feature learning. In particular, we show that learning features in a closed and
bounded space improves the robustness of the network. We explore the effect of
L2-Softmax Loss, that enforces compactness in the learned features, thus
resulting in enhanced robustness to adversarial perturbations. Additionally, we
propose compact convolution, a novel method of convolution that when
incorporated in conventional CNNs improves their robustness. Compact
convolution ensures feature compactness at every layer such that they are
bounded and close to each other. Extensive experiments show that Compact
Convolutional Networks (CCNs) neutralize multiple types of attacks, and perform
better than existing methods in defending adversarial attacks, without
incurring any additional training overhead compared to CNNs.


http://arxiv.org/abs/1712.00673
Towards Robust Neural Networks via Random Self-ensemble.
Xuanqing Liu; Minhao Cheng; Huan Zhang; Cho-Jui Hsieh
  Recent studies have revealed the vulnerability of deep neural networks: A
small adversarial perturbation that is imperceptible to human can easily make a
well-trained deep neural network misclassify. This makes it unsafe to apply
neural networks in security-critical applications. In this paper, we propose a
new defense algorithm called Random Self-Ensemble (RSE) by combining two
important concepts: {\bf randomness} and {\bf ensemble}. To protect a targeted
model, RSE adds random noise layers to the neural network to prevent the strong
gradient-based attacks, and ensembles the prediction over random noises to
stabilize the performance. We show that our algorithm is equivalent to ensemble
an infinite number of noisy models $f_\epsilon$ without any additional memory
overhead, and the proposed training procedure based on noisy stochastic
gradient descent can ensure the ensemble model has a good predictive
capability. Our algorithm significantly outperforms previous defense techniques
on real data sets. For instance, on CIFAR-10 with VGG network (which has 92\%
accuracy without any attack), under the strong C\&W attack within a certain
distortion tolerance, the accuracy of unprotected model drops to less than
10\%, the best previous defense technique has $48\%$ accuracy, while our method
still has $86\%$ prediction accuracy under the same level of attack. Finally,
our method is simple and easy to integrate into any neural network.


http://arxiv.org/abs/1712.00558
Where Classification Fails, Interpretation Rises.
Chanh Nguyen; Georgi Georgiev; Yujie Ji; Ting Wang
  An intriguing property of deep neural networks is their inherent
vulnerability to adversarial inputs, which significantly hinders their
application in security-critical domains. Most existing detection methods
attempt to use carefully engineered patterns to distinguish adversarial inputs
from their genuine counterparts, which however can often be circumvented by
adaptive adversaries. In this work, we take a completely different route by
leveraging the definition of adversarial inputs: while deceiving for deep
neural networks, they are barely discernible for human visions. Building upon
recent advances in interpretable models, we construct a new detection framework
that contrasts an input's interpretation against its classification. We
validate the efficacy of this framework through extensive experiments using
benchmark datasets and attacks. We believe that this work opens a new direction
for designing adversarial input detection methods.


http://arxiv.org/abs/1711.11561
Measuring the tendency of CNNs to Learn Surface Statistical Regularities.
Jason Jo; Yoshua Bengio
  Deep CNNs are known to exhibit the following peculiarity: on the one hand
they generalize extremely well to a test set, while on the other hand they are
extremely sensitive to so-called adversarial perturbations. The extreme
sensitivity of high performance CNNs to adversarial examples casts serious
doubt that these networks are learning high level abstractions in the dataset.
We are concerned with the following question: How can a deep CNN that does not
learn any high level semantics of the dataset manage to generalize so well? The
goal of this article is to measure the tendency of CNNs to learn surface
statistical regularities of the dataset. To this end, we use Fourier filtering
to construct datasets which share the exact same high level abstractions but
exhibit qualitatively different surface statistical regularities. For the SVHN
and CIFAR-10 datasets, we present two Fourier filtered variants: a low
frequency variant and a randomly filtered variant. Each of the Fourier
filtering schemes is tuned to preserve the recognizability of the objects. Our
main finding is that CNNs exhibit a tendency to latch onto the Fourier image
statistics of the training dataset, sometimes exhibiting up to a 28%
generalization gap across the various test sets. Moreover, we observe that
significantly increasing the depth of a network has a very marginal impact on
closing the aforementioned generalization gap. Thus we provide quantitative
evidence supporting the hypothesis that deep CNNs tend to learn surface
statistical regularities in the dataset rather than higher-level abstract
concepts.


http://arxiv.org/abs/1711.10056
Adversary Detection in Neural Networks via Persistent Homology.
Thomas Gebhart; Paul Schrater
  We outline a detection method for adversarial inputs to deep neural networks.
By viewing neural network computations as graphs upon which information flows
from input space to out- put distribution, we compare the differences in graphs
induced by different inputs. Specifically, by applying persistent homology to
these induced graphs, we observe that the structure of the most persistent
subgraphs which generate the first homology group differ between adversarial
and unperturbed inputs. Based on this observation, we build a detection
algorithm that depends only on the topological information extracted during
training. We test our algorithm on MNIST and achieve 98% detection adversary
accuracy with F1-score 0.98.


http://arxiv.org/abs/1711.09856
On the Robustness of Semantic Segmentation Models to Adversarial Attacks.
Anurag Arnab; Ondrej Miksik; Philip H. S. Torr
  Deep Neural Networks (DNNs) have demonstrated exceptional performance on most
recognition tasks such as image classification and segmentation. However, they
have also been shown to be vulnerable to adversarial examples. This phenomenon
has recently attracted a lot of attention but it has not been extensively
studied on multiple, large-scale datasets and structured prediction tasks such
as semantic segmentation which often require more specialised networks with
additional components such as CRFs, dilated convolutions, skip-connections and
multiscale processing. In this paper, we present what to our knowledge is the
first rigorous evaluation of adversarial attacks on modern semantic
segmentation models, using two large-scale datasets. We analyse the effect of
different network architectures, model capacity and multiscale processing, and
show that many observations made on the task of classification do not always
transfer to this more complex task. Furthermore, we show how mean-field
inference in deep structured models, multiscale processing (and more generally,
input transformations) naturally implement recently proposed adversarial
defenses. Our observations will aid future efforts in understanding and
defending against adversarial examples. Moreover, in the shorter term, we show
how to effectively benchmark robustness and show which segmentation models
should currently be preferred in safety-critical applications due to their
inherent robustness.


http://arxiv.org/abs/1711.09681
Butterfly Effect: Bidirectional Control of Classification Performance by Small Additive Perturbation.
YoungJoon Yoo; Seonguk Park; Junyoung Choi; Sangdoo Yun; Nojun Kwak
  This paper proposes a new algorithm for controlling classification results by
generating a small additive perturbation without changing the classifier
network. Our work is inspired by existing works generating adversarial
perturbation that worsens classification performance. In contrast to the
existing methods, our work aims to generate perturbations that can enhance
overall classification performance. To solve this performance enhancement
problem, we newly propose a perturbation generation network (PGN) influenced by
the adversarial learning strategy. In our problem, the information in a large
external dataset is summarized by a small additive perturbation, which helps to
improve the performance of the classifier trained with the target dataset. In
addition to this performance enhancement problem, we show that the proposed PGN
can be adopted to solve the classical adversarial problem without utilizing the
information on the target classifier. The mentioned characteristics of our
method are verified through extensive experiments on publicly available visual
datasets.


http://arxiv.org/abs/1711.09404
Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients.
Andrew Slavin Ross; Finale Doshi-Velez
  Deep neural networks have proven remarkably effective at solving many
classification problems, but have been criticized recently for two major
weaknesses: the reasons behind their predictions are uninterpretable, and the
predictions themselves can often be fooled by small adversarial perturbations.
These problems pose major obstacles for the adoption of neural networks in
domains that require security or transparency. In this work, we evaluate the
effectiveness of defenses that differentiably penalize the degree to which
small changes in inputs can alter model predictions. Across multiple attacks,
architectures, defenses, and datasets, we find that neural networks trained
with this input gradient regularization exhibit robustness to transferred
adversarial examples generated to fool all of the other models. We also find
that adversarial examples generated to fool gradient-regularized models fool
all other models equally well, and actually lead to more "legitimate,"
interpretable misclassifications as rated by people (which we confirm in a
human subject experiment). Finally, we demonstrate that regularizing input
gradients makes them more naturally interpretable as rationales for model
predictions. We conclude by discussing this relationship between
interpretability and robustness in deep neural networks.


http://arxiv.org/abs/1711.09115
Geometric robustness of deep networks: analysis and improvement.
Can Kanbak; Seyed-Mohsen Moosavi-Dezfooli; Pascal Frossard
  Deep convolutional neural networks have been shown to be vulnerable to
arbitrary geometric transformations. However, there is no systematic method to
measure the invariance properties of deep networks to such transformations. We
propose ManiFool as a simple yet scalable algorithm to measure the invariance
of deep networks. In particular, our algorithm measures the robustness of deep
networks to geometric transformations in a worst-case regime as they can be
problematic for sensitive applications. Our extensive experimental results show
that ManiFool can be used to measure the invariance of fairly complex networks
on high dimensional datasets and these values can be used for analyzing the
reasons for it. Furthermore, we build on Manifool to propose a new adversarial
training scheme and we show its effectiveness on improving the invariance
properties of deep neural networks.


http://arxiv.org/abs/1711.08534
Safer Classification by Synthesis.
William Wang; Angelina Wang; Aviv Tamar; Xi Chen; Pieter Abbeel
  The discriminative approach to classification using deep neural networks has
become the de-facto standard in various fields. Complementing recent
reservations about safety against adversarial examples, we show that
conventional discriminative methods can easily be fooled to provide incorrect
labels with very high confidence to out of distribution examples. We posit that
a generative approach is the natural remedy for this problem, and propose a
method for classification using generative models. At training time, we learn a
generative model for each class, while at test time, given an example to
classify, we query each generator for its most similar generation, and select
the class corresponding to the most similar one. Our approach is general and
can be used with expressive models such as GANs and VAEs. At test time, our
method accurately "knows when it does not know," and provides resilience to out
of distribution examples while maintaining competitive performance for standard
examples.


http://arxiv.org/abs/1711.08478
MagNet and "Efficient Defenses Against Adversarial Attacks" are Not Robust to Adversarial Examples.
Nicholas Carlini; David Wagner
  MagNet and "Efficient Defenses..." were recently proposed as a defense to
adversarial examples. We find that we can construct adversarial examples that
defeat these defenses with only a slight increase in distortion.


http://arxiv.org/abs/1711.08244
Adversarial Phenomenon in the Eyes of Bayesian Deep Learning.
Ambrish Rawat; Martin Wistuba; Maria-Irina Nicolae
  Deep Learning models are vulnerable to adversarial examples, i.e.\ images
obtained via deliberate imperceptible perturbations, such that the model
misclassifies them with high confidence. However, class confidence by itself is
an incomplete picture of uncertainty. We therefore use principled Bayesian
methods to capture model uncertainty in prediction for observing adversarial
misclassification. We provide an extensive study with different Bayesian neural
networks attacked in both white-box and black-box setups. The behaviour of the
networks for noise, attacks and clean test data is compared. We observe that
Bayesian neural networks are uncertain in their predictions for adversarial
perturbations, a behaviour similar to the one observed for random Gaussian
perturbations. Thus, we conclude that Bayesian neural networks can be
considered for detecting adversarial examples.


http://arxiv.org/abs/1711.08001
Reinforcing Adversarial Robustness using Model Confidence Induced by Adversarial Training.
Xi Wu; Uyeong Jang; Jiefeng Chen; Lingjiao Chen; Somesh Jha
  In this paper we study leveraging confidence information induced by
adversarial training to reinforce adversarial robustness of a given
adversarially trained model. A natural measure of confidence is $\|F({\bf
x})\|_\infty$ (i.e. how confident $F$ is about its prediction?). We start by
analyzing an adversarial training formulation proposed by Madry et al.. We
demonstrate that, under a variety of instantiations, an only somewhat good
solution to their objective induces confidence to be a discriminator, which can
distinguish between right and wrong model predictions in a neighborhood of a
point sampled from the underlying distribution. Based on this, we propose
Highly Confident Near Neighbor (${\tt HCNN}$), a framework that combines
confidence information and nearest neighbor search, to reinforce adversarial
robustness of a base model. We give algorithms in this framework and perform a
detailed empirical study. We report encouraging experimental results that
support our analysis, and also discuss problems we observed with existing
adversarial training.


http://arxiv.org/abs/1711.07356
Evaluating Robustness of Neural Networks with Mixed Integer Programming.
Vincent Tjeng; Kai Xiao; Russ Tedrake
  Neural networks have demonstrated considerable success on a wide variety of
real-world problems. However, networks trained only to optimize for training
accuracy can often be fooled by adversarial examples - slightly perturbed
inputs that are misclassified with high confidence. Verification of networks
enables us to gauge their vulnerability to such adversarial examples. We
formulate verification of piecewise-linear neural networks as a mixed integer
program. On a representative task of finding minimum adversarial distortions,
our verifier is two to three orders of magnitude quicker than the
state-of-the-art. We achieve this computational speedup via tight formulations
for non-linearities, as well as a novel presolve algorithm that makes full use
of all information available. The computational speedup allows us to verify
properties on convolutional networks with an order of magnitude more ReLUs than
networks previously verified by any complete verifier. In particular, we
determine for the first time the exact adversarial accuracy of an MNIST
classifier to perturbations with bounded $l_\infty$ norm $\epsilon=0.1$: for
this classifier, we find an adversarial example for 4.38% of samples, and a
certificate of robustness (to perturbations with bounded norm) for the
remainder. Across all robust training procedures and network architectures
considered, we are able to certify more samples than the state-of-the-art and
find more adversarial examples than a strong first-order attack.


http://arxiv.org/abs/1711.07183
Adversarial Attacks Beyond the Image Space.
Xiaohui Zeng; Chenxi Liu; Yu-Siang Wang; Weichao Qiu; Lingxi Xie; Yu-Wing Tai; Chi Keung Tang; Alan L. Yuille
  Generating adversarial examples is an intriguing problem and an important way
of understanding the working mechanism of deep neural networks. Most existing
approaches generated perturbations in the image space, i.e., each pixel can be
modified independently. However, in this paper we pay special attention to the
subset of adversarial examples that correspond to meaningful changes in 3D
physical properties (like rotation and translation, illumination condition,
etc.). These adversaries arguably pose a more serious concern, as they
demonstrate the possibility of causing neural network failure by easy
perturbations of real-world 3D objects and scenes.
  In the contexts of object classification and visual question answering, we
augment state-of-the-art deep neural networks that receive 2D input images with
a rendering module (either differentiable or not) in front, so that a 3D scene
(in the physical space) is rendered into a 2D image (in the image space), and
then mapped to a prediction (in the output space). The adversarial
perturbations can now go beyond the image space, and have clear meanings in the
3D physical world. Though image-space adversaries can be interpreted as
per-pixel albedo change, we verify that they cannot be well explained along
these physically meaningful dimensions, which often have a non-local effect.
But it is still possible to successfully attack beyond the image space on the
physical space, though this is more difficult than image-space attacks,
reflected in lower success rates and heavier perturbations required.


http://arxiv.org/abs/1711.06598
How Wrong Am I? - Studying Adversarial Examples and their Impact on Uncertainty in Gaussian Process Machine Learning Models.
Kathrin Grosse; David Pfaff; Michael Thomas Smith; Michael Backes
  Machine learning models are vulnerable to Adversarial Examples: minor
perturbations to input samples intended to deliberately cause
misclassification. Current defenses against adversarial examples, especially
for Deep Neural Networks (DNN), are primarily derived from empirical
developments, and their security guarantees are often only justified
retroactively. Many defenses therefore rely on hidden assumptions that are
subsequently subverted by increasingly elaborate attacks. This is not
surprising: deep learning notoriously lacks a comprehensive mathematical
framework to provide meaningful guarantees.
  In this paper, we leverage Gaussian Processes to investigate adversarial
examples in the framework of Bayesian inference. Across different models and
datasets, we find deviating levels of uncertainty reflect the perturbation
introduced to benign samples by state-of-the-art attacks, including novel
white-box attacks on Gaussian Processes. Our experiments demonstrate that even
unoptimized uncertainty thresholds already reject adversarial examples in many
scenarios.
  Comment: Thresholds can be broken in a modified attack, which was done in
arXiv:1812.02606 (The limitations of model uncertainty in adversarial
settings).


http://arxiv.org/abs/1711.05934
Enhanced Attacks on Defensively Distilled Deep Neural Networks.
Yujia Liu; Weiming Zhang; Shaohua Li; Nenghai Yu
  Deep neural networks (DNNs) have achieved tremendous success in many tasks of
machine learning, such as the image classification. Unfortunately, researchers
have shown that DNNs are easily attacked by adversarial examples, slightly
perturbed images which can mislead DNNs to give incorrect classification
results. Such attack has seriously hampered the deployment of DNN systems in
areas where security or safety requirements are strict, such as autonomous
cars, face recognition, malware detection. Defensive distillation is a
mechanism aimed at training a robust DNN which significantly reduces the
effectiveness of adversarial examples generation. However, the state-of-the-art
attack can be successful on distilled networks with 100% probability. But it is
a white-box attack which needs to know the inner information of DNN. Whereas,
the black-box scenario is more general. In this paper, we first propose the
epsilon-neighborhood attack, which can fool the defensively distilled networks
with 100% success rate in the white-box setting, and it is fast to generate
adversarial examples with good visual quality. On the basis of this attack, we
further propose the region-based attack against defensively distilled DNNs in
the black-box setting. And we also perform the bypass attack to indirectly
break the distillation defense as a complementary method. The experimental
results show that our black-box attacks have a considerable success rate on
defensively distilled networks.


http://arxiv.org/abs/1711.05929
Defense against Universal Adversarial Perturbations.
Naveed Akhtar; Jian Liu; Ajmal Mian
  Recent advances in Deep Learning show the existence of image-agnostic
quasi-imperceptible perturbations that when applied to `any' image can fool a
state-of-the-art network classifier to change its prediction about the image
label. These `Universal Adversarial Perturbations' pose a serious threat to the
success of Deep Learning in practice. We present the first dedicated framework
to effectively defend the networks against such perturbations. Our approach
learns a Perturbation Rectifying Network (PRN) as `pre-input' layers to a
targeted model, such that the targeted model needs no modification. The PRN is
learned from real and synthetic image-agnostic perturbations, where an
efficient method to compute the latter is also proposed. A perturbation
detector is separately trained on the Discrete Cosine Transform of the
input-output difference of the PRN. A query image is first passed through the
PRN and verified by the detector. If a perturbation is detected, the output of
the PRN is used for label prediction instead of the actual image. A rigorous
evaluation shows that our framework can defend the network classifiers against
unseen adversarial perturbations in the real-world scenarios with up to 97.5%
success rate. The PRN also generalizes well in the sense that training for one
targeted network defends another network with a comparable success rate.


http://arxiv.org/abs/1711.05475
The best defense is a good offense: Countering black box attacks by predicting slightly wrong labels.
Yannic Kilcher; Thomas Hofmann
  Black-Box attacks on machine learning models occur when an attacker, despite
having no access to the inner workings of a model, can successfully craft an
attack by means of model theft. The attacker will train an own substitute model
that mimics the model to be attacked. The substitute can then be used to design
attacks against the original model, for example by means of adversarial
samples. We put ourselves in the shoes of the defender and present a method
that can successfully avoid model theft by mounting a counter-attack.
Specifically, to any incoming query, we slightly perturb our output label
distribution in a way that makes substitute training infeasible. We demonstrate
that the perturbation does not affect the ordinary use of our model, but
results in an effective defense against attacks based on model theft.


http://arxiv.org/abs/1711.04368
Machine vs Machine: Minimax-Optimal Defense Against Adversarial Examples.
Jihun Hamm; Akshay Mehra
  Recently, researchers have discovered that the state-of-the-art object
classifiers can be fooled easily by small perturbations in the input
unnoticeable to human eyes. It is also known that an attacker can generate
strong adversarial examples if she knows the classifier parameters. Conversely,
a defender can robustify the classifier by retraining if she has access to the
adversarial examples. We explain and formulate this adversarial example problem
as a two-player continuous zero-sum game, and demonstrate the fallacy of
evaluating a defense or an attack as a static problem. To find the best
worst-case defense against whitebox attacks, we propose a continuous minimax
optimization algorithm. We demonstrate the minimax defense with two types of
attack classes -- gradient-based and neural network-based attacks. Experiments
with the MNIST and the CIFAR-10 datasets demonstrate that the defense found by
numerical minimax optimization is indeed more robust than non-minimax defenses.
We discuss directions for improving the result toward achieving robustness
against multiple types of attack classes.


http://arxiv.org/abs/1711.03280
Crafting Adversarial Examples For Speech Paralinguistics Applications.
Yuan Gong; Christian Poellabauer
  Computational paralinguistic analysis is increasingly being used in a wide
range of cyber applications, including security-sensitive applications such as
speaker verification, deceptive speech detection, and medical diagnostics.
While state-of-the-art machine learning techniques, such as deep neural
networks, can provide robust and accurate speech analysis, they are susceptible
to adversarial attacks. In this work, we propose an end-to-end scheme to
generate adversarial examples for computational paralinguistic applications by
perturbing directly the raw waveform of an audio recording rather than specific
acoustic features. Our experiments show that the proposed adversarial
perturbation can lead to a significant performance drop of state-of-the-art
deep neural networks, while only minimally impairing the audio quality.


http://arxiv.org/abs/1711.02846
Intriguing Properties of Adversarial Examples.
Ekin D. Cubuk; Barret Zoph; Samuel S. Schoenholz; Quoc V. Le
  It is becoming increasingly clear that many machine learning classifiers are
vulnerable to adversarial examples. In attempting to explain the origin of
adversarial examples, previous studies have typically focused on the fact that
neural networks operate on high dimensional data, they overfit, or they are too
linear. Here we argue that the origin of adversarial examples is primarily due
to an inherent uncertainty that neural networks have about their predictions.
We show that the functional form of this uncertainty is independent of
architecture, dataset, and training protocol; and depends only on the
statistics of the logit differences of the network, which do not change
significantly during training. This leads to adversarial error having a
universal scaling, as a power-law, with respect to the size of the adversarial
perturbation. We show that this universality holds for a broad range of
datasets (MNIST, CIFAR10, ImageNet, and random data), models (including
state-of-the-art deep networks, linear models, adversarially trained networks,
and networks trained on randomly shuffled labels), and attacks (FGSM, step
l.l., PGD). Motivated by these results, we study the effects of reducing
prediction entropy on adversarial robustness. Finally, we study the effect of
network architectures on adversarial sensitivity. To do this, we use neural
architecture search with reinforcement learning to find adversarially robust
architectures on CIFAR10. Our resulting architecture is more robust to white
\emph{and} black box attacks compared to previous attempts.


http://arxiv.org/abs/1711.01991
Mitigating Adversarial Effects Through Randomization.
Cihang Xie; Jianyu Wang; Zhishuai Zhang; Zhou Ren; Alan Yuille
  Convolutional neural networks have demonstrated high accuracy on various
tasks in recent years. However, they are extremely vulnerable to adversarial
examples. For example, imperceptible perturbations added to clean images can
cause convolutional neural networks to fail. In this paper, we propose to
utilize randomization at inference time to mitigate adversarial effects.
Specifically, we use two randomization operations: random resizing, which
resizes the input images to a random size, and random padding, which pads zeros
around the input images in a random manner. Extensive experiments demonstrate
that the proposed randomization method is very effective at defending against
both single-step and iterative attacks. Our method provides the following
advantages: 1) no additional training or fine-tuning, 2) very few additional
computations, 3) compatible with other adversarial defense methods. By
combining the proposed randomization method with an adversarially trained
model, it achieves a normalized score of 0.924 (ranked No.2 among 107 defense
teams) in the NIPS 2017 adversarial examples defense challenge, which is far
better than using adversarial training alone with a normalized score of 0.773
(ranked No.56). The code is public available at
https://github.com/cihangxie/NIPS2017_adv_challenge_defense.


http://arxiv.org/abs/1711.01791
HyperNetworks with statistical filtering for defending adversarial examples.
Zhun Sun; Mete Ozay; Takayuki Okatani
  Deep learning algorithms have been known to be vulnerable to adversarial
perturbations in various tasks such as image classification. This problem was
addressed by employing several defense methods for detection and rejection of
particular types of attacks. However, training and manipulating networks
according to particular defense schemes increases computational complexity of
the learning algorithms. In this work, we propose a simple yet effective method
to improve robustness of convolutional neural networks (CNNs) to adversarial
attacks by using data dependent adaptive convolution kernels. To this end, we
propose a new type of HyperNetwork in order to employ statistical properties of
input data and features for computation of statistical adaptive maps. Then, we
filter convolution weights of CNNs with the learned statistical maps to compute
dynamic kernels. Thereby, weights and kernels are collectively optimized for
learning of image classification models robust to adversarial attacks without
employment of additional target detection and rejection algorithms. We
empirically demonstrate that the proposed method enables CNNs to spontaneously
defend against different types of attacks, e.g. attacks generated by Gaussian
noise, fast gradient sign methods (Goodfellow et al., 2014) and a black-box
attack(Narodytska & Kasiviswanathan, 2016).


http://arxiv.org/abs/1711.01768
Towards Reverse-Engineering Black-Box Neural Networks.
Seong Joon Oh; Max Augustin; Bernt Schiele; Mario Fritz
  Many deployed learned models are black boxes: given input, returns output.
Internal information about the model, such as the architecture, optimisation
procedure, or training data, is not disclosed explicitly as it might contain
proprietary information or make the system more vulnerable. This work shows
that such attributes of neural networks can be exposed from a sequence of
queries. This has multiple implications. On the one hand, our work exposes the
vulnerability of black-box neural networks to different types of attacks -- we
show that the revealed internal information helps generate more effective
adversarial examples against the black box model. On the other hand, this
technique can be used for better protection of private content from automatic
recognition models using adversarial examples. Our paper suggests that it is
actually hard to draw a line between white box and black box models.


http://arxiv.org/abs/1711.00867
The (Un)reliability of saliency methods.
Pieter-Jan Kindermans; Sara Hooker; Julius Adebayo; Maximilian Alber; Kristof T. Schütt; Sven Dähne; Dumitru Erhan; Been Kim
  Saliency methods aim to explain the predictions of deep neural networks.
These methods lack reliability when the explanation is sensitive to factors
that do not contribute to the model prediction. We use a simple and common
pre-processing step ---adding a constant shift to the input data--- to show
that a transformation with no effect on the model can cause numerous methods to
incorrectly attribute. In order to guarantee reliability, we posit that methods
should fulfill input invariance, the requirement that a saliency method mirror
the sensitivity of the model with respect to transformations of the input. We
show, through several examples, that saliency methods that do not satisfy input
invariance result in misleading attribution.


http://arxiv.org/abs/1711.00851
Provable defenses against adversarial examples via the convex outer adversarial polytope.
Eric Wong; J. Zico Kolter
  We propose a method to learn deep ReLU-based classifiers that are provably
robust against norm-bounded adversarial perturbations on the training data. For
previously unseen examples, the approach is guaranteed to detect all
adversarial examples, though it may flag some non-adversarial examples as well.
The basic idea is to consider a convex outer approximation of the set of
activations reachable through a norm-bounded perturbation, and we develop a
robust optimization procedure that minimizes the worst case loss over this
outer region (via a linear program). Crucially, we show that the dual problem
to this linear program can be represented itself as a deep network similar to
the backpropagation network, leading to very efficient optimization approaches
that produce guaranteed bounds on the robust loss. The end result is that by
executing a few more forward and backward passes through a slightly modified
version of the original network (though possibly with much larger batch sizes),
we can learn a classifier that is provably robust to any norm-bounded
adversarial attack. We illustrate the approach on a number of tasks to train
classifiers with robust adversarial guarantees (e.g. for MNIST, we produce a
convolutional classifier that provably has less than 5.8% test error for any
adversarial attack with bounded $\ell_\infty$ norm less than $\epsilon = 0.1$),
and code for all experiments in the paper is available at
https://github.com/locuslab/convex_adversarial.


http://arxiv.org/abs/1711.00449
Attacking Binarized Neural Networks.
Angus Galloway; Graham W. Taylor; Medhat Moussa
  Neural networks with low-precision weights and activations offer compelling
efficiency advantages over their full-precision equivalents. The two most
frequently discussed benefits of quantization are reduced memory consumption,
and a faster forward pass when implemented with efficient bitwise operations.
We propose a third benefit of very low-precision neural networks: improved
robustness against some adversarial attacks, and in the worst case, performance
that is on par with full-precision models. We focus on the very low-precision
case where weights and activations are both quantized to $\pm$1, and note that
stochastically quantizing weights in just one layer can sharply reduce the
impact of iterative attacks. We observe that non-scaled binary neural networks
exhibit a similar effect to the original defensive distillation procedure that
led to gradient masking, and a false notion of security. We address this by
conducting both black-box and white-box experiments with binary models that do
not artificially mask gradients.


http://arxiv.org/abs/1711.00117
Countering Adversarial Images using Input Transformations.
Chuan Guo; Mayank Rana; Moustapha Cisse; der Maaten Laurens van
  This paper investigates strategies that defend against adversarial-example
attacks on image-classification systems by transforming the inputs before
feeding them to the system. Specifically, we study applying image
transformations such as bit-depth reduction, JPEG compression, total variance
minimization, and image quilting before feeding the image to a convolutional
network classifier. Our experiments on ImageNet show that total variance
minimization and image quilting are very effective defenses in practice, in
particular, when the network is trained on transformed images. The strength of
those defenses lies in their non-differentiable nature and their inherent
randomness, which makes it difficult for an adversary to circumvent the
defenses. Our best defense eliminates 60% of strong gray-box and 90% of strong
black-box attacks by a variety of major attack methods


http://arxiv.org/abs/1710.11469
Conditional Variance Penalties and Domain Shift Robustness.
Christina Heinze-Deml; Nicolai Meinshausen
  When training a deep neural network for image classification, one can broadly
distinguish between two types of latent features of images that will drive the
classification. We can divide latent features into (i) "core" or "conditionally
invariant" features $X^\text{core}$ whose distribution $X^\text{core}\vert Y$,
conditional on the class $Y$, does not change substantially across domains and
(ii) "style" features $X^{\text{style}}$ whose distribution $X^{\text{style}}
\vert Y$ can change substantially across domains. Examples for style features
include position, rotation, image quality or brightness but also more complex
ones like hair color, image quality or posture for images of persons. Our goal
is to minimize a loss that is robust under changes in the distribution of these
style features. In contrast to previous work, we assume that the domain itself
is not observed and hence a latent variable.
  We do assume that we can sometimes observe a typically discrete identifier or
"$\mathrm{ID}$ variable". In some applications we know, for example, that two
images show the same person, and $\mathrm{ID}$ then refers to the identity of
the person. The proposed method requires only a small fraction of images to
have $\mathrm{ID}$ information. We group observations if they share the same
class and identifier $(Y,\mathrm{ID})=(y,\mathrm{id})$ and penalize the
conditional variance of the prediction or the loss if we condition on
$(Y,\mathrm{ID})$. Using a causal framework, this conditional variance
regularization (CoRe) is shown to protect asymptotically against shifts in the
distribution of the style variables. Empirically, we show that the CoRe penalty
improves predictive accuracy substantially in settings where domain changes
occur in terms of image quality, brightness and color while we also look at
more complex changes such as changes in movement and posture.


http://arxiv.org/abs/1710.11342
Generating Natural Adversarial Examples.
Zhengli Zhao; Dheeru Dua; Sameer Singh
  Due to their complex nature, it is hard to characterize the ways in which
machine learning models can misbehave or be exploited when deployed. Recent
work on adversarial examples, i.e. inputs with minor perturbations that result
in substantially different model predictions, is helpful in evaluating the
robustness of these models by exposing the adversarial scenarios where they
fail. However, these malicious perturbations are often unnatural, not
semantically meaningful, and not applicable to complicated domains such as
language. In this paper, we propose a framework to generate natural and legible
adversarial examples that lie on the data manifold, by searching in semantic
space of dense and continuous data representation, utilizing the recent
advances in generative adversarial networks. We present generated adversaries
to demonstrate the potential of the proposed approach for black-box classifiers
for a wide range of applications such as image classification, textual
entailment, and machine translation. We include experiments to show that the
generated adversaries are natural, legible to humans, and useful in evaluating
and analyzing black-box classifiers.


http://arxiv.org/abs/1710.10766
PixelDefend: Leveraging Generative Models to Understand and Defend against Adversarial Examples.
Yang Song; Taesup Kim; Sebastian Nowozin; Stefano Ermon; Nate Kushman
  Adversarial perturbations of normal images are usually imperceptible to
humans, but they can seriously confuse state-of-the-art machine learning
models. What makes them so special in the eyes of image classifiers? In this
paper, we show empirically that adversarial examples mainly lie in the low
probability regions of the training distribution, regardless of attack types
and targeted models. Using statistical hypothesis testing, we find that modern
neural density models are surprisingly good at detecting imperceptible image
perturbations. Based on this discovery, we devised PixelDefend, a new approach
that purifies a maliciously perturbed image by moving it back towards the
distribution seen in the training data. The purified image is then run through
an unmodified classifier, making our method agnostic to both the classifier and
the attacking method. As a result, PixelDefend can be used to protect already
deployed models and be combined with other model-specific defenses. Experiments
show that our method greatly improves resilience across a wide variety of
state-of-the-art attacking methods, increasing accuracy on the strongest attack
from 63% to 84% for Fashion MNIST and from 32% to 70% for CIFAR-10.


http://arxiv.org/abs/1710.10733
Attacking the Madry Defense Model with $L_1$-based Adversarial Examples.
Yash Sharma; Pin-Yu Chen
  The Madry Lab recently hosted a competition designed to test the robustness
of their adversarially trained MNIST model. Attacks were constrained to perturb
each pixel of the input image by a scaled maximal $L_\infty$ distortion
$\epsilon$ = 0.3. This discourages the use of attacks which are not optimized
on the $L_\infty$ distortion metric. Our experimental results demonstrate that
by relaxing the $L_\infty$ constraint of the competition, the elastic-net
attack to deep neural networks (EAD) can generate transferable adversarial
examples which, despite their high average $L_\infty$ distortion, have minimal
visual distortion. These results call into question the use of $L_\infty$ as a
sole measure for visual distortion, and further demonstrate the power of EAD at
generating robust adversarial examples.


http://arxiv.org/abs/1710.10571
Certifying Some Distributional Robustness with Principled Adversarial Training.
Aman Sinha; Hongseok Namkoong; Riccardo Volpi; John Duchi
  Neural networks are vulnerable to adversarial examples and researchers have
proposed many heuristic attack and defense mechanisms. We address this problem
through the principled lens of distributionally robust optimization, which
guarantees performance under adversarial input perturbations. By considering a
Lagrangian penalty formulation of perturbing the underlying data distribution
in a Wasserstein ball, we provide a training procedure that augments model
parameter updates with worst-case perturbations of training data. For smooth
losses, our procedure provably achieves moderate levels of robustness with
little computational or statistical cost relative to empirical risk
minimization. Furthermore, our statistical guarantees allow us to efficiently
certify robustness for the population loss. For imperceptible perturbations,
our method matches or outperforms heuristic approaches.


http://arxiv.org/abs/1710.10547
Interpretation of Neural Networks is Fragile.
Amirata Ghorbani; Abubakar Abid; James Zou
  In order for machine learning to be deployed and trusted in many
applications, it is crucial to be able to reliably explain why the machine
learning algorithm makes certain predictions. For example, if an algorithm
classifies a given pathology image to be a malignant tumor, then the doctor may
need to know which parts of the image led the algorithm to this classification.
How to interpret black-box predictors is thus an important and active area of
research. A fundamental question is: how much can we trust the interpretation
itself? In this paper, we show that interpretation of deep learning predictions
is extremely fragile in the following sense: two perceptively indistinguishable
inputs with the same predicted label can be assigned very different
interpretations. We systematically characterize the fragility of several
widely-used feature-importance interpretation methods (saliency maps, relevance
propagation, and DeepLIFT) on ImageNet and CIFAR-10. Our experiments show that
even small random perturbation can change the feature importance and new
systematic perturbations can lead to dramatically different interpretations
without changing the label. We extend these results to show that
interpretations based on exemplars (e.g. influence functions) are similarly
fragile. Our analysis of the geometry of the Hessian matrix gives insight on
why fragility could be a fundamental challenge to the current interpretation
approaches.


http://arxiv.org/abs/1710.10225
Adversarial Detection of Flash Malware: Limitations and Open Issues.
Davide Maiorca; Ambra Demontis; Battista Biggio; Fabio Roli; Giorgio Giacinto
  During the past four years, Flash malware has become one of the most
insidious threats to detect, with almost 600 critical vulnerabilities targeting
Adobe Flash disclosed in the wild. Research has shown that machine learning can
be successfully used to detect Flash malware by leveraging static analysis to
extract information from the structure of the file or its bytecode. However,
the robustness of Flash malware detectors against well-crafted evasion attempts
- also known as adversarial examples - has never been investigated. In this
paper, we propose a security evaluation of a novel, representative Flash
detector that embeds a combination of the prominent, static features employed
by state-of-the-art tools. In particular, we discuss how to craft adversarial
Flash malware examples, showing that it suffices to manipulate the
corresponding source malware samples slightly to evade detection. We then
empirically demonstrate that popular defense techniques proposed to mitigate
evasion attempts, including re-training on adversarial examples, may not always
be sufficient to ensure robustness. We argue that this occurs when the feature
vectors extracted from adversarial examples become indistinguishable from those
of benign data, meaning that the given feature representation is intrinsically
vulnerable. In this respect, we are the first to formally define and
quantitatively characterize this vulnerability, highlighting when an attack can
be countered by solely improving the security of the learning algorithm, or
when it requires also considering additional features. We conclude the paper by
suggesting alternative research directions to improve the security of
learning-based Flash malware detectors.


http://arxiv.org/abs/1710.09412
mixup: Beyond Empirical Risk Minimization.
Hongyi Zhang; Moustapha Cisse; Yann N. Dauphin; David Lopez-Paz
  Large deep neural networks are powerful, but exhibit undesirable behaviors
such as memorization and sensitivity to adversarial examples. In this work, we
propose mixup, a simple learning principle to alleviate these issues. In
essence, mixup trains a neural network on convex combinations of pairs of
examples and their labels. By doing so, mixup regularizes the neural network to
favor simple linear behavior in-between training examples. Our experiments on
the ImageNet-2012, CIFAR-10, CIFAR-100, Google commands and UCI datasets show
that mixup improves the generalization of state-of-the-art neural network
architectures. We also find that mixup reduces the memorization of corrupt
labels, increases the robustness to adversarial examples, and stabilizes the
training of generative adversarial networks.


http://arxiv.org/abs/1710.08864
One pixel attack for fooling deep neural networks.
Jiawei Su; Danilo Vasconcellos Vargas; Sakurai Kouichi
  Recent research has revealed that the output of Deep Neural Networks (DNN)
can be easily altered by adding relatively small perturbations to the input
vector. In this paper, we analyze an attack in an extremely limited scenario
where only one pixel can be modified. For that we propose a novel method for
generating one-pixel adversarial perturbations based on differential evolution
(DE). It requires less adversarial information (a black-box attack) and can
fool more types of networks due to the inherent features of DE. The results
show that 67.97% of the natural images in Kaggle CIFAR-10 test dataset and
16.04% of the ImageNet (ILSVRC 2012) test images can be perturbed to at least
one target class by modifying just one pixel with 74.03% and 22.91% confidence
on average. We also show the same vulnerability on the original CIFAR-10
dataset. Thus, the proposed attack explores a different take on adversarial
machine learning in an extreme limited scenario, showing that current DNNs are
also vulnerable to such low dimension attacks. Besides, we also illustrate an
important application of DE (or broadly speaking, evolutionary computation) in
the domain of adversarial machine learning: creating tools that can effectively
generate low-cost adversarial attacks against neural networks for evaluating
robustness.


http://arxiv.org/abs/1710.07859
Feature-Guided Black-Box Safety Testing of Deep Neural Networks.
Matthew Wicker; Xiaowei Huang; Marta Kwiatkowska
  Despite the improved accuracy of deep neural networks, the discovery of
adversarial examples has raised serious safety concerns. Most existing
approaches for crafting adversarial examples necessitate some knowledge
(architecture, parameters, etc.) of the network at hand. In this paper, we
focus on image classifiers and propose a feature-guided black-box approach to
test the safety of deep neural networks that requires no such knowledge. Our
algorithm employs object detection techniques such as SIFT (Scale Invariant
Feature Transform) to extract features from an image. These features are
converted into a mutable saliency distribution, where high probability is
assigned to pixels that affect the composition of the image with respect to the
human visual system. We formulate the crafting of adversarial examples as a
two-player turn-based stochastic game, where the first player's objective is to
minimise the distance to an adversarial example by manipulating the features,
and the second player can be cooperative, adversarial, or random. We show that,
theoretically, the two-player game can con- verge to the optimal strategy, and
that the optimal strategy represents a globally minimal adversarial image. For
Lipschitz networks, we also identify conditions that provide safety guarantees
that no adversarial examples exist. Using Monte Carlo tree search we gradually
explore the game state space to search for adversarial examples. Our
experiments show that, despite the black-box setting, manipulations guided by a
perception-based saliency distribution are competitive with state-of-the-art
methods that rely on white-box saliency matrices or sophisticated optimization
procedures. Finally, we show how our method can be used to evaluate robustness
of neural networks in safety-critical applications such as traffic sign
recognition in self-driving cars.


http://arxiv.org/abs/1710.06081
Boosting Adversarial Attacks with Momentum.
Yinpeng Dong; Fangzhou Liao; Tianyu Pang; Hang Su; Jun Zhu; Xiaolin Hu; Jianguo Li
  Deep neural networks are vulnerable to adversarial examples, which poses
security concerns on these algorithms due to the potentially severe
consequences. Adversarial attacks serve as an important surrogate to evaluate
the robustness of deep learning models before they are deployed. However, most
of existing adversarial attacks can only fool a black-box model with a low
success rate. To address this issue, we propose a broad class of momentum-based
iterative algorithms to boost adversarial attacks. By integrating the momentum
term into the iterative process for attacks, our methods can stabilize update
directions and escape from poor local maxima during the iterations, resulting
in more transferable adversarial examples. To further improve the success rates
for black-box attacks, we apply momentum iterative algorithms to an ensemble of
models, and show that the adversarially trained models with a strong defense
ability are also vulnerable to our black-box attacks. We hope that the proposed
methods will serve as a benchmark for evaluating the robustness of various deep
models and defense methods. With this method, we won the first places in NIPS
2017 Non-targeted Adversarial Attack and Targeted Adversarial Attack
competitions.


http://arxiv.org/abs/1710.04677
Game-Theoretic Design of Secure and Resilient Distributed Support Vector Machines with Adversaries.
Rui Zhang; Quanyan Zhu
  With a large number of sensors and control units in networked systems,
distributed support vector machines (DSVMs) play a fundamental role in scalable
and efficient multi-sensor classification and prediction tasks. However, DSVMs
are vulnerable to adversaries who can modify and generate data to deceive the
system to misclassification and misprediction. This work aims to design defense
strategies for DSVM learner against a potential adversary. We establish a
game-theoretic framework to capture the conflicting interests between the DSVM
learner and the attacker. The Nash equilibrium of the game allows predicting
the outcome of learning algorithms in adversarial environments, and enhancing
the resilience of the machine learning through dynamic distributed learning
algorithms. We show that the DSVM learner is less vulnerable when he uses a
balanced network with fewer nodes and higher degree. We also show that adding
more training samples is an efficient defense strategy against an attacker. We
present secure and resilient DSVM algorithms with verification method and
rejection method, and show their resiliency against adversary with numerical
experiments.


http://arxiv.org/abs/1710.03337
Standard detectors aren't (currently) fooled by physical adversarial stop signs.
Jiajun Lu; Hussein Sibai; Evan Fabry; David Forsyth
  An adversarial example is an example that has been adjusted to produce the
wrong label when presented to a system at test time. If adversarial examples
existed that could fool a detector, they could be used to (for example) wreak
havoc on roads populated with smart vehicles. Recently, we described our
difficulties creating physical adversarial stop signs that fool a detector.
More recently, Evtimov et al. produced a physical adversarial stop sign that
fools a proxy model of a detector. In this paper, we show that these physical
adversarial stop signs do not fool two standard detectors (YOLO and Faster
RCNN) in standard configuration. Evtimov et al.'s construction relies on a crop
of the image to the stop sign; this crop is then resized and presented to a
classifier. We argue that the cropping and resizing procedure largely
eliminates the effects of rescaling and of view angle. Whether an adversarial
attack is robust under rescaling and change of view direction remains moot. We
argue that attacking a classifier is very different from attacking a detector,
and that the structure of detectors - which must search for their own bounding
box, and which cannot estimate that box very accurately - likely makes it
difficult to make adversarial patterns. Finally, an adversarial pattern on a
physical object that could fool a detector would have to be adversarial in the
face of a wide family of parametric distortions (scale; view angle; box shift
inside the detector; illumination; and so on). Such a pattern would be of great
theoretical and practical interest. There is currently no evidence that such
patterns exist.


http://arxiv.org/abs/1710.03107
Verification of Binarized Neural Networks via Inter-Neuron Factoring.
Chih-Hong Cheng; Georg Nührenberg; Chung-Hao Huang; Harald Ruess
  We study the problem of formal verification of Binarized Neural Networks
(BNN), which have recently been proposed as a energy-efficient alternative to
traditional learning networks. The verification of BNNs, using the reduction to
hardware verification, can be even more scalable by factoring computations
among neurons within the same layer. By proving the NP-hardness of finding
optimal factoring as well as the hardness of PTAS approximability, we design
polynomial-time search heuristics to generate factoring solutions. The overall
framework allows applying verification techniques to moderately-sized BNNs for
embedded devices with thousands of neurons and inputs.


http://arxiv.org/abs/1710.00814
Detecting Adversarial Attacks on Neural Network Policies with Visual Foresight.
Yen-Chen Lin; Ming-Yu Liu; Min Sun; Jia-Bin Huang
  Deep reinforcement learning has shown promising results in learning control
policies for complex sequential decision-making tasks. However, these neural
network-based policies are known to be vulnerable to adversarial examples. This
vulnerability poses a potentially serious threat to safety-critical systems
such as autonomous vehicles. In this paper, we propose a defense mechanism to
defend reinforcement learning agents from adversarial attacks by leveraging an
action-conditioned frame prediction module. Our core idea is that the
adversarial examples targeting at a neural network-based policy are not
effective for the frame prediction model. By comparing the action distribution
produced by a policy from processing the current observed frame to the action
distribution produced by the same policy from processing the predicted frame
from the action-conditioned frame prediction module, we can detect the presence
of adversarial examples. Beyond detecting the presence of adversarial examples,
our method allows the agent to continue performing the task using the predicted
frame when the agent is under attack. We evaluate the performance of our
algorithm using five games in Atari 2600. Our results demonstrate that the
proposed defense mechanism achieves favorable performance against baseline
algorithms in detecting adversarial examples and in earning rewards when the
agents are under attack.


http://arxiv.org/abs/1710.00486
DeepSafe: A Data-driven Approach for Checking Adversarial Robustness in Neural Networks.
Divya Gopinath; Guy Katz; Corina S. Pasareanu; Clark Barrett
  Deep neural networks have become widely used, obtaining remarkable results in
domains such as computer vision, speech recognition, natural language
processing, audio recognition, social network filtering, machine translation,
and bio-informatics, where they have produced results comparable to human
experts. However, these networks can be easily fooled by adversarial
perturbations: minimal changes to correctly-classified inputs, that cause the
network to mis-classify them. This phenomenon represents a concern for both
safety and security, but it is currently unclear how to measure a network's
robustness against such perturbations. Existing techniques are limited to
checking robustness around a few individual input points, providing only very
limited guarantees. We propose a novel approach for automatically identifying
safe regions of the input space, within which the network is robust against
adversarial perturbations. The approach is data-guided, relying on clustering
to identify well-defined geometric regions as candidate safe regions. We then
utilize verification techniques to confirm that these regions are safe or to
provide counter-examples showing that they are not safe. We also introduce the
notion of targeted robustness which, for a given target label and region,
ensures that a NN does not map any input in the region to the target label. We
evaluated our technique on the MNIST dataset and on a neural network
implementation of a controller for the next-generation Airborne Collision
Avoidance System for unmanned aircraft (ACAS Xu). For these networks, our
approach identified multiple regions which were completely safe as well as some
which were only safe for specific labels. It also discovered several
adversarial perturbations of interest.


http://arxiv.org/abs/1709.10207
Provably Minimally-Distorted Adversarial Examples.
Nicholas Carlini; Guy Katz; Clark Barrett; David L. Dill
  The ability to deploy neural networks in real-world, safety-critical systems
is severely limited by the presence of adversarial examples: slightly perturbed
inputs that are misclassified by the network. In recent years, several
techniques have been proposed for increasing robustness to adversarial examples
--- and yet most of these have been quickly shown to be vulnerable to future
attacks. For example, over half of the defenses proposed by papers accepted at
ICLR 2018 have already been broken. We propose to address this difficulty
through formal verification techniques. We show how to construct provably
minimally distorted adversarial examples: given an arbitrary neural network and
input sample, we can construct adversarial examples which we prove are of
minimal distortion. Using this approach, we demonstrate that one of the recent
ICLR defense proposals, adversarial retraining, provably succeeds at increasing
the distortion required to construct adversarial examples by a factor of 4.2.


http://arxiv.org/abs/1709.09917
DR.SGX: Hardening SGX Enclaves against Cache Attacks with Data Location Randomization.
Ferdinand Technische Universität Darmstadt, Germany Brasser; Srdjan ETH Zurich, Switzerland Capkun; Alexandra University of Würzburg Dmitrienko; Tommaso Technische Universität Darmstadt, Germany Frassetto; Kari ETH Zurich, Switzerland Kostiainen; Ahmad-Reza Technische Universität Darmstadt, Germany Sadeghi
  Recent research has demonstrated that Intel's SGX is vulnerable to
software-based side-channel attacks. In a common attack, the adversary monitors
CPU caches to infer secret-dependent data accesses patterns. Known defenses
have major limitations, as they require either error-prone developer
assistance, incur extremely high runtime overhead, or prevent only specific
attacks. In this paper, we propose data location randomization as a novel
defense against side-channel attacks that target data access patterns. Our goal
is to break the link between the memory observations by the adversary and the
actual data accesses by the victim. We design and implement a compiler-based
tool called DR.SGX that instruments the enclave code, permuting data locations
at fine granularity. To prevent correlation of repeated memory accesses we
periodically re-randomize all enclave data. Our solution requires no developer
assistance and strikes the balance between side-channel protection and
performance based on an adjustable security parameter.


http://arxiv.org/abs/1709.09130
Output Range Analysis for Deep Neural Networks.
Souradeep Dutta; Susmit Jha; Sriram Sanakaranarayanan; Ashish Tiwari
  Deep neural networks (NN) are extensively used for machine learning tasks
such as image classification, perception and control of autonomous systems.
Increasingly, these deep NNs are also been deployed in high-assurance
applications. Thus, there is a pressing need for developing techniques to
verify neural networks to check whether certain user-expected properties are
satisfied. In this paper, we study a specific verification problem of computing
a guaranteed range for the output of a deep neural network given a set of
inputs represented as a convex polyhedron. Range estimation is a key primitive
for verifying deep NNs. We present an efficient range estimation algorithm that
uses a combination of local search and linear programming problems to
efficiently find the maximum and minimum values taken by the outputs of the NN
over the given input set. In contrast to recently proposed "monolithic"
optimization approaches, we use local gradient descent to repeatedly find and
eliminate local minima of the function. The final global optimum is certified
using a mixed integer programming instance. We implement our approach and
compare it with Reluplex, a recently proposed solver for deep neural networks.
We demonstrate the effectiveness of the proposed approach for verification of
NNs used in automated control as well as those used in classification.


http://arxiv.org/abs/1709.08693
Fooling Vision and Language Models Despite Localization and Attention Mechanism.
Xiaojun Xu; Xinyun Chen; Chang Liu; Anna Rohrbach; Trevor Darrell; Dawn Song
  Adversarial attacks are known to succeed on classifiers, but it has been an
open question whether more complex vision systems are vulnerable. In this
paper, we study adversarial examples for vision and language models, which
incorporate natural language understanding and complex structures such as
attention, localization, and modular architectures. In particular, we
investigate attacks on a dense captioning model and on two visual question
answering (VQA) models. Our evaluation shows that we can generate adversarial
examples with a high success rate (i.e., > 90%) for these models. Our work
sheds new light on understanding adversarial attacks on vision systems which
have a language component and shows that attention, bounding box localization,
and compositional internal structures are vulnerable to adversarial attacks.
These observations will inform future work towards building effective defenses.


http://arxiv.org/abs/1709.06662
Verifying Properties of Binarized Deep Neural Networks.
Nina Narodytska; Shiva Prasad Kasiviswanathan; Leonid Ryzhyk; Mooly Sagiv; Toby Walsh
  Understanding properties of deep neural networks is an important challenge in
deep learning. In this paper, we take a step in this direction by proposing a
rigorous way of verifying properties of a popular class of neural networks,
Binarized Neural Networks, using the well-developed means of Boolean
satisfiability. Our main contribution is a construction that creates a
representation of a binarized neural network as a Boolean formula. Our encoding
is the first exact Boolean representation of a deep neural network. Using this
encoding, we leverage the power of modern SAT solvers along with a proposed
counterexample-guided search procedure to verify various properties of these
networks. A particular focus will be on the critical property of robustness to
adversarial perturbations. For this property, our experimental results
demonstrate that our approach scales to medium-size deep neural networks used
in image classification tasks. To the best of our knowledge, this is the first
work on verifying properties of deep neural networks using an exact Boolean
encoding of the network.


http://arxiv.org/abs/1709.05583
Mitigating Evasion Attacks to Deep Neural Networks via Region-based Classification.
Xiaoyu Cao; Neil Zhenqiang Gong
  Deep neural networks (DNNs) have transformed several artificial intelligence
research areas including computer vision, speech recognition, and natural
language processing. However, recent studies demonstrated that DNNs are
vulnerable to adversarial manipulations at testing time. Specifically, suppose
we have a testing example, whose label can be correctly predicted by a DNN
classifier. An attacker can add a small carefully crafted noise to the testing
example such that the DNN classifier predicts an incorrect label, where the
crafted testing example is called adversarial example. Such attacks are called
evasion attacks. Evasion attacks are one of the biggest challenges for
deploying DNNs in safety and security critical applications such as
self-driving cars. In this work, we develop new methods to defend against
evasion attacks. Our key observation is that adversarial examples are close to
the classification boundary. Therefore, we propose region-based classification
to be robust to adversarial examples. For a benign/adversarial testing example,
we ensemble information in a hypercube centered at the example to predict its
label. In contrast, traditional classifiers are point-based classification,
i.e., given a testing example, the classifier predicts its label based on the
testing example alone. Our evaluation results on MNIST and CIFAR-10 datasets
demonstrate that our region-based classification can significantly mitigate
evasion attacks without sacrificing classification accuracy on benign examples.
Specifically, our region-based classification achieves the same classification
accuracy on testing benign examples as point-based classification, but our
region-based classification is significantly more robust than point-based
classification to various evasion attacks.


http://arxiv.org/abs/1709.04447
A Learning and Masking Approach to Secure Learning.
Linh Nguyen; Sky Wang; Arunesh Sinha
  Deep Neural Networks (DNNs) have been shown to be vulnerable against
adversarial examples, which are data points cleverly constructed to fool the
classifier. Such attacks can be devastating in practice, especially as DNNs are
being applied to ever increasing critical tasks like image recognition in
autonomous driving. In this paper, we introduce a new perspective on the
problem. We do so by first defining robustness of a classifier to adversarial
exploitation. Next, we show that the problem of adversarial example generation
can be posed as learning problem. We also categorize attacks in literature into
high and low perturbation attacks; well-known attacks like fast-gradient sign
method (FGSM) and our attack produce higher perturbation adversarial examples
while the more potent but computationally inefficient Carlini-Wagner (CW)
attack is low perturbation. Next, we show that the dual approach of the attack
learning problem can be used as a defensive technique that is effective against
high perturbation attacks. Finally, we show that a classifier masking method
achieved by adding noise to the a neural network's logit output protects
against low distortion attacks such as the CW attack. We also show that both
our learning and masking defense can work simultaneously to protect against
multiple attacks. We demonstrate the efficacy of our techniques by
experimenting with the MNIST and CIFAR-10 datasets.


http://arxiv.org/abs/1709.04137
Models and Framework for Adversarial Attacks on Complex Adaptive Systems.
Vahid Behzadan; Arslan Munir
  We introduce the paradigm of adversarial attacks that target the dynamics of
Complex Adaptive Systems (CAS). To facilitate the analysis of such attacks, we
present multiple approaches to the modeling of CAS as dynamical, data-driven,
and game-theoretic systems, and develop quantitative definitions of attack,
vulnerability, and resilience in the context of CAS security. Furthermore, we
propose a comprehensive set of schemes for classification of attacks and attack
surfaces in CAS, complemented with examples of practical attacks. Building on
this foundation, we propose a framework based on reinforcement learning for
simulation and analysis of attacks on CAS, and demonstrate its performance
through three real-world case studies of targeting power grids, destabilization
of terrorist organizations, and manipulation of machine learning agents. We
also discuss potential mitigation techniques, and remark on future research
directions in analysis and design of secure complex adaptive systems.


http://arxiv.org/abs/1709.04114
EAD: Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples.
Pin-Yu Chen; Yash Sharma; Huan Zhang; Jinfeng Yi; Cho-Jui Hsieh
  Recent studies have highlighted the vulnerability of deep neural networks
(DNNs) to adversarial examples - a visually indistinguishable adversarial image
can easily be crafted to cause a well-trained model to misclassify. Existing
methods for crafting adversarial examples are based on $L_2$ and $L_\infty$
distortion metrics. However, despite the fact that $L_1$ distortion accounts
for the total variation and encourages sparsity in the perturbation, little has
been developed for crafting $L_1$-based adversarial examples. In this paper, we
formulate the process of attacking DNNs via adversarial examples as an
elastic-net regularized optimization problem. Our elastic-net attacks to DNNs
(EAD) feature $L_1$-oriented adversarial examples and include the
state-of-the-art $L_2$ attack as a special case. Experimental results on MNIST,
CIFAR10 and ImageNet show that EAD can yield a distinct set of adversarial
examples with small $L_1$ distortion and attains similar attack performance to
the state-of-the-art methods in different attack scenarios. More importantly,
EAD leads to improved attack transferability and complements adversarial
training for DNNs, suggesting novel insights on leveraging $L_1$ distortion in
adversarial machine learning and security implications of DNNs.


http://arxiv.org/abs/1709.03582
Art of singular vectors and universal adversarial perturbations.
Valentin Khrulkov; Ivan Oseledets
  Vulnerability of Deep Neural Networks (DNNs) to adversarial attacks has been
attracting a lot of attention in recent studies. It has been shown that for
many state of the art DNNs performing image classification there exist
universal adversarial perturbations --- image-agnostic perturbations mere
addition of which to natural images with high probability leads to their
misclassification. In this work we propose a new algorithm for constructing
such universal perturbations. Our approach is based on computing the so-called
$(p, q)$-singular vectors of the Jacobian matrices of hidden layers of a
network. Resulting perturbations present interesting visual patterns, and by
using only 64 images we were able to construct universal perturbations with
more than 60 \% fooling rate on the dataset consisting of 50000 images. We also
investigate a correlation between the maximal singular value of the Jacobian
matrix and the fooling rate of the corresponding singular vector, and show that
the constructed perturbations generalize across networks.


http://arxiv.org/abs/1709.03423
Ensemble Methods as a Defense to Adversarial Perturbations Against Deep Neural Networks.
Thilo Strauss; Markus Hanselmann; Andrej Junginger; Holger Ulmer
  Deep learning has become the state of the art approach in many machine
learning problems such as classification. It has recently been shown that deep
learning is highly vulnerable to adversarial perturbations. Taking the camera
systems of self-driving cars as an example, small adversarial perturbations can
cause the system to make errors in important tasks, such as classifying traffic
signs or detecting pedestrians. Hence, in order to use deep learning without
safety concerns a proper defense strategy is required. We propose to use
ensemble methods as a defense strategy against adversarial perturbations. We
find that an attack leading one model to misclassify does not imply the same
for other networks performing the same task. This makes ensemble methods an
attractive defense strategy against adversarial attacks. We empirically show
for the MNIST and the CIFAR-10 data sets that ensemble methods not only improve
the accuracy of neural networks on test data but also increase their robustness
against adversarial perturbations.


http://arxiv.org/abs/1709.02802
Towards Proving the Adversarial Robustness of Deep Neural Networks.
Guy Stanford University Katz; Clark Stanford University Barrett; David L. Stanford University Dill; Kyle Stanford University Julian; Mykel J. Stanford University Kochenderfer
  Autonomous vehicles are highly complex systems, required to function reliably
in a wide variety of situations. Manually crafting software controllers for
these vehicles is difficult, but there has been some success in using deep
neural networks generated using machine-learning. However, deep neural networks
are opaque to human engineers, rendering their correctness very difficult to
prove manually; and existing automated techniques, which were not designed to
operate on neural networks, fail to scale to large systems. This paper focuses
on proving the adversarial robustness of deep neural networks, i.e. proving
that small perturbations to a correctly-classified input to the network cannot
cause it to be misclassified. We describe some of our recent and ongoing work
on verifying the adversarial robustness of networks, and discuss some of the
open questions we have encountered and how they might be addressed.


http://arxiv.org/abs/1709.02538
DeepFense: Online Accelerated Defense Against Adversarial Deep Learning.
Bita Darvish Rouhani; Mohammad Samragh; Mojan Javaheripi; Tara Javidi; Farinaz Koushanfar
  Recent advances in adversarial Deep Learning (DL) have opened up a largely
unexplored surface for malicious attacks jeopardizing the integrity of
autonomous DL systems. With the wide-spread usage of DL in critical and
time-sensitive applications, including unmanned vehicles, drones, and video
surveillance systems, online detection of malicious inputs is of utmost
importance. We propose DeepFense, the first end-to-end automated framework that
simultaneously enables efficient and safe execution of DL models. DeepFense
formalizes the goal of thwarting adversarial attacks as an optimization problem
that minimizes the rarely observed regions in the latent feature space spanned
by a DL network. To solve the aforementioned minimization problem, a set of
complementary but disjoint modular redundancies are trained to validate the
legitimacy of the input samples in parallel with the victim DL model. DeepFense
leverages hardware/software/algorithm co-design and customized acceleration to
achieve just-in-time performance in resource-constrained settings. The proposed
countermeasure is unsupervised, meaning that no adversarial sample is leveraged
to train modular redundancies. We further provide an accompanying API to reduce
the non-recurring engineering cost and ensure automated adaptation to various
platforms. Extensive evaluations on FPGAs and GPUs demonstrate up to two orders
of magnitude performance improvement while enabling online adversarial sample
detection.


http://arxiv.org/abs/1709.00609
Security Evaluation of Pattern Classifiers under Attack.
Battista Biggio; Giorgio Fumera; Fabio Roli
  Pattern classification systems are commonly used in adversarial applications,
like biometric authentication, network intrusion detection, and spam filtering,
in which data can be purposely manipulated by humans to undermine their
operation. As this adversarial scenario is not taken into account by classical
design methods, pattern classification systems may exhibit vulnerabilities,
whose exploitation may severely affect their performance, and consequently
limit their practical utility. Extending pattern classification theory and
design methods to adversarial settings is thus a novel and very relevant
research direction, which has not yet been pursued in a systematic way. In this
paper, we address one of the main open issues: evaluating at design phase the
security of pattern classifiers, namely, the performance degradation under
potential attacks they may incur during operation. We propose a framework for
empirical evaluation of classifier security that formalizes and generalizes the
main ideas proposed in the literature, and give examples of its use in three
real applications. Reported results show that security evaluation can provide a
more complete understanding of the classifier's behavior in adversarial
environments, and lead to better design choices.


http://arxiv.org/abs/1709.00045
On Security and Sparsity of Linear Classifiers for Adversarial Settings.
Ambra Demontis; Paolo Russu; Battista Biggio; Giorgio Fumera; Fabio Roli
  Machine-learning techniques are widely used in security-related applications,
like spam and malware detection. However, in such settings, they have been
shown to be vulnerable to adversarial attacks, including the deliberate
manipulation of data at test time to evade detection. In this work, we focus on
the vulnerability of linear classifiers to evasion attacks. This can be
considered a relevant problem, as linear classifiers have been increasingly
used in embedded systems and mobile devices for their low processing time and
memory requirements. We exploit recent findings in robust optimization to
investigate the link between regularization and security of linear classifiers,
depending on the type of attack. We also analyze the relationship between the
sparsity of feature weights, which is desirable for reducing processing cost,
and the security of linear classifiers. We further propose a novel octagonal
regularizer that allows us to achieve a proper trade-off between them. Finally,
we empirically show how this regularizer can improve classifier security and
sparsity in real-world application examples including spam and malware
detection.


http://arxiv.org/abs/1708.09790
Be Selfish and Avoid Dilemmas: Fork After Withholding (FAW) Attacks on Bitcoin.
Yujin Kwon; Dohyun Kim; Yunmok Son; Eugene Vasserman; Yongdae Kim
  In the Bitcoin system, participants are rewarded for solving cryptographic
puzzles. In order to receive more consistent rewards over time, some
participants organize mining pools and split the rewards from the pool in
proportion to each participant's contribution. However, several attacks
threaten the ability to participate in pools. The block withholding (BWH)
attack makes the pool reward system unfair by letting malicious participants
receive unearned wages while only pretending to contribute work. When two pools
launch BWH attacks against each other, they encounter the miner's dilemma: in a
Nash equilibrium, the revenue of both pools is diminished. In another attack
called selfish mining, an attacker can unfairly earn extra rewards by
deliberately generating forks. In this paper, we propose a novel attack called
a fork after withholding (FAW) attack. FAW is not just another attack. The
reward for an FAW attacker is always equal to or greater than that for a BWH
attacker, and it is usable up to four times more often per pool than in BWH
attack. When considering multiple pools - the current state of the Bitcoin
network - the extra reward for an FAW attack is about 56% more than that for a
BWH attack. Furthermore, when two pools execute FAW attacks on each other, the
miner's dilemma may not hold: under certain circumstances, the larger pool can
consistently win. More importantly, an FAW attack, while using intentional
forks, does not suffer from practicality issues, unlike selfish mining. We also
discuss partial countermeasures against the FAW attack, but finding a cheap and
efficient countermeasure remains an open problem. As a result, we expect to see
FAW attacks among mining pools.


http://arxiv.org/abs/1708.09056
Practical Attacks Against Graph-based Clustering.
Yizheng Chen; Yacin Nadji; Athanasios Kountouras; Fabian Monrose; Roberto Perdisci; Manos Antonakakis; Nikolaos Vasiloglou
  Graph modeling allows numerous security problems to be tackled in a general
way, however, little work has been done to understand their ability to
withstand adversarial attacks. We design and evaluate two novel graph attacks
against a state-of-the-art network-level, graph-based detection system. Our
work highlights areas in adversarial machine learning that have not yet been
addressed, specifically: graph-based clustering techniques, and a global
feature space where realistic attackers without perfect knowledge must be
accounted for (by the defenders) in order to be practical. Even though less
informed attackers can evade graph clustering with low cost, we show that some
practical defenses are possible.


http://arxiv.org/abs/1708.08559
DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars.
Yuchi Tian; Kexin Pei; Suman Jana; Baishakhi Ray
  Recent advances in Deep Neural Networks (DNNs) have led to the development of
DNN-driven autonomous cars that, using sensors like camera, LiDAR, etc., can
drive without any human intervention. Most major manufacturers including Tesla,
GM, Ford, BMW, and Waymo/Google are working on building and testing different
types of autonomous vehicles. The lawmakers of several US states including
California, Texas, and New York have passed new legislation to fast-track the
process of testing and deployment of autonomous vehicles on their roads.
  However, despite their spectacular progress, DNNs, just like traditional
software, often demonstrate incorrect or unexpected corner case behaviors that
can lead to potentially fatal collisions. Several such real-world accidents
involving autonomous cars have already happened including one which resulted in
a fatality. Most existing testing techniques for DNN-driven vehicles are
heavily dependent on the manual collection of test data under different driving
conditions which become prohibitively expensive as the number of test
conditions increases.
  In this paper, we design, implement and evaluate DeepTest, a systematic
testing tool for automatically detecting erroneous behaviors of DNN-driven
vehicles that can potentially lead to fatal crashes. First, our tool is
designed to automatically generated test cases leveraging real-world changes in
driving conditions like rain, fog, lighting conditions, etc. DeepTest
systematically explores different parts of the DNN logic by generating test
inputs that maximize the numbers of activated neurons. DeepTest found thousands
of erroneous behaviors under different realistic driving conditions (e.g.,
blurring, rain, fog, etc.) many of which lead to potentially fatal crashes in
three top performing DNNs in the Udacity self-driving car challenge.


http://arxiv.org/abs/1708.08327
Improving Robustness of ML Classifiers against Realizable Evasion Attacks Using Conserved Features.
Liang Tong; Bo Li; Chen Hajaj; Chaowei Xiao; Ning Zhang; Yevgeniy Vorobeychik
  Machine learning (ML) techniques are increasingly common in security
applications, such as malware and intrusion detection. However, ML models are
often susceptible to evasion attacks, in which an adversary makes changes to
the input (such as malware) in order to avoid being detected. A conventional
approach to evaluate ML robustness to such attacks, as well as to design robust
ML, is by considering simplified feature-space models of attacks, where the
attacker changes ML features directly to effect evasion, while minimizing or
constraining the magnitude of this change. We investigate the effectiveness of
this approach to designing robust ML in the face of attacks that can be
realized in actual malware (realizable attacks). We demonstrate that in the
context of structure-based PDF malware detection, such techniques appear to
have limited effectiveness, but they are effective with content-based
detectors. In either case, we show that augmenting the feature space models
with conserved features (those that cannot be unilaterally modified without
compromising malicious functionality) significantly improves performance.
Finally, we show that feature space models enable generalized robustness when
faced with a variety of realizable attacks, as compared to classifiers which
are tuned to be robust to a specific realizable attack.


http://arxiv.org/abs/1708.06939
Is Deep Learning Safe for Robot Vision? Adversarial Examples against the iCub Humanoid.
Marco Melis; Ambra Demontis; Battista Biggio; Gavin Brown; Giorgio Fumera; Fabio Roli
  Deep neural networks have been widely adopted in recent years, exhibiting
impressive performances in several application domains. It has however been
shown that they can be fooled by adversarial examples, i.e., images altered by
a barely-perceivable adversarial noise, carefully crafted to mislead
classification. In this work, we aim to evaluate the extent to which
robot-vision systems embodying deep-learning algorithms are vulnerable to
adversarial examples, and propose a computationally efficient countermeasure to
mitigate this threat, based on rejecting classification of anomalous inputs. We
then provide a clearer understanding of the safety properties of deep networks
through an intuitive empirical analysis, showing that the mapping learned by
such networks essentially violates the smoothness assumption of learning
algorithms. We finally discuss the main limitations of this work, including the
creation of real-world adversarial examples, and sketch promising research
directions.


http://arxiv.org/abs/1708.06670
CNN Fixations: An unraveling approach to visualize the discriminative image regions.
Konda Reddy Mopuri; Utsav Garg; R. Venkatesh Babu
  Deep convolutional neural networks (CNN) have revolutionized various fields
of vision research and have seen unprecedented adoption for multiple tasks such
as classification, detection, captioning, etc. However, they offer little
transparency into their inner workings and are often treated as black boxes
that deliver excellent performance. In this work, we aim at alleviating this
opaqueness of CNNs by providing visual explanations for the network's
predictions. Our approach can analyze variety of CNN based models trained for
vision applications such as object recognition and caption generation. Unlike
existing methods, we achieve this via unraveling the forward pass operation.
Proposed method exploits feature dependencies across the layer hierarchy and
uncovers the discriminative image locations that guide the network's
predictions. We name these locations CNN-Fixations, loosely analogous to human
eye fixations.
  Our approach is a generic method that requires no architectural changes,
additional training or gradient computation and computes the important image
locations (CNN Fixations). We demonstrate through a variety of applications
that our approach is able to localize the discriminative image locations across
different network architectures, diverse vision tasks and data modalities.


http://arxiv.org/abs/1708.06131
Evasion Attacks against Machine Learning at Test Time.
Battista Biggio; Igino Corona; Davide Maiorca; Blaine Nelson; Nedim Srndic; Pavel Laskov; Giorgio Giacinto; Fabio Roli
  In security-sensitive applications, the success of machine learning depends
on a thorough vetting of their resistance to adversarial data. In one
pertinent, well-motivated attack scenario, an adversary may attempt to evade a
deployed system at test time by carefully manipulating attack samples. In this
work, we present a simple but effective gradient-based approach that can be
exploited to systematically assess the security of several, widely-used
classification algorithms against evasion attacks. Following a recently
proposed framework for security evaluation, we simulate attack scenarios that
exhibit different risk levels for the classifier by increasing the attacker's
knowledge of the system and her ability to manipulate attack samples. This
gives the classifier designer a better picture of the classifier performance
under evasion attacks, and allows him to perform a more informed model
selection (or parameter setting). We evaluate our approach on the relevant
security task of malware detection in PDF files, and show that such systems can
be easily evaded. We also sketch some countermeasures suggested by our
analysis.


http://arxiv.org/abs/1708.05493
Towards Interpretable Deep Neural Networks by Leveraging Adversarial Examples.
Yinpeng Dong; Hang Su; Jun Zhu; Fan Bao
  Deep neural networks (DNNs) have demonstrated impressive performance on a
wide array of tasks, but they are usually considered opaque since internal
structure and learned parameters are not interpretable. In this paper, we
re-examine the internal representations of DNNs using adversarial images, which
are generated by an ensemble-optimization algorithm. We find that: (1) the
neurons in DNNs do not truly detect semantic objects/parts, but respond to
objects/parts only as recurrent discriminative patches; (2) deep visual
representations are not robust distributed codes of visual concepts because the
representations of adversarial images are largely not consistent with those of
real images, although they have similar visual appearance, both of which are
different from previous findings. To further improve the interpretability of
DNNs, we propose an adversarial training scheme with a consistent loss such
that the neurons are endowed with human-interpretable concepts. The induced
interpretable representations enable us to trace eventual outcomes back to
influential neurons. Therefore, human users can know how the models make
predictions, as well as when and why they make errors.


http://arxiv.org/abs/1708.05207
Learning Universal Adversarial Perturbations with Generative Models.
Jamie Hayes; George Danezis
  Neural networks are known to be vulnerable to adversarial examples, inputs
that have been intentionally perturbed to remain visually similar to the source
input, but cause a misclassification. It was recently shown that given a
dataset and classifier, there exists so called universal adversarial
perturbations, a single perturbation that causes a misclassification when
applied to any input. In this work, we introduce universal adversarial
networks, a generative network that is capable of fooling a target classifier
when it's generated output is added to a clean sample from a dataset. We show
that this technique improves on known universal adversarial attacks.


http://arxiv.org/abs/1708.04301
Attacking Automatic Video Analysis Algorithms: A Case Study of Google Cloud Video Intelligence API.
Hossein Hosseini; Baicen Xiao; Andrew Clark; Radha Poovendran
  Due to the growth of video data on Internet, automatic video analysis has
gained a lot of attention from academia as well as companies such as Facebook,
Twitter and Google. In this paper, we examine the robustness of video analysis
algorithms in adversarial settings. Specifically, we propose targeted attacks
on two fundamental classes of video analysis algorithms, namely video
classification and shot detection. We show that an adversary can subtly
manipulate a video in such a way that a human observer would perceive the
content of the original video, but the video analysis algorithm will return the
adversary's desired outputs.
  We then apply the attacks on the recently released Google Cloud Video
Intelligence API. The API takes a video file and returns the video labels
(objects within the video), shot changes (scene changes within the video) and
shot labels (description of video events over time). Through experiments, we
show that the API generates video and shot labels by processing only the first
frame of every second of the video. Hence, an adversary can deceive the API to
output only her desired video and shot labels by periodically inserting an
image into the video at the rate of one frame per second. We also show that the
pattern of shot changes returned by the API can be mostly recovered by an
algorithm that compares the histograms of consecutive frames. Based on our
equivalent model, we develop a method for slightly modifying the video frames,
in order to deceive the API into generating our desired pattern of shot
changes. We perform extensive experiments with different videos and show that
our attacks are consistently successful across videos with different
characteristics. At the end, we propose introducing randomness to video
analysis algorithms as a countermeasure to our attacks.


http://arxiv.org/abs/1708.03999
ZOO: Zeroth Order Optimization based Black-box Attacks to Deep Neural Networks without Training Substitute Models.
Pin-Yu Chen; Huan Zhang; Yash Sharma; Jinfeng Yi; Cho-Jui Hsieh
  Deep neural networks (DNNs) are one of the most prominent technologies of our
time, as they achieve state-of-the-art performance in many machine learning
tasks, including but not limited to image classification, text mining, and
speech processing. However, recent research on DNNs has indicated
ever-increasing concern on the robustness to adversarial examples, especially
for security-critical tasks such as traffic sign identification for autonomous
driving. Studies have unveiled the vulnerability of a well-trained DNN by
demonstrating the ability of generating barely noticeable (to both human and
machines) adversarial images that lead to misclassification. Furthermore,
researchers have shown that these adversarial images are highly transferable by
simply training and attacking a substitute model built upon the target model,
known as a black-box attack to DNNs.
  Similar to the setting of training substitute models, in this paper we
propose an effective black-box attack that also only has access to the input
(images) and the output (confidence scores) of a targeted DNN. However,
different from leveraging attack transferability from substitute models, we
propose zeroth order optimization (ZOO) based attacks to directly estimate the
gradients of the targeted DNN for generating adversarial examples. We use
zeroth order stochastic coordinate descent along with dimension reduction,
hierarchical attack and importance sampling techniques to efficiently attack
black-box models. By exploiting zeroth order optimization, improved attacks to
the targeted DNN can be accomplished, sparing the need for training substitute
models and avoiding the loss in attack transferability. Experimental results on
MNIST, CIFAR10 and ImageNet show that the proposed ZOO attack is as effective
as the state-of-the-art white-box attack and significantly outperforms existing
black-box attacks via substitute models.


http://arxiv.org/abs/1708.02582
Cascade Adversarial Machine Learning Regularized with a Unified Embedding.
Taesik Na; Jong Hwan Ko; Saibal Mukhopadhyay
  Injecting adversarial examples during training, known as adversarial
training, can improve robustness against one-step attacks, but not for unknown
iterative attacks. To address this challenge, we first show iteratively
generated adversarial images easily transfer between networks trained with the
same strategy. Inspired by this observation, we propose cascade adversarial
training, which transfers the knowledge of the end results of adversarial
training. We train a network from scratch by injecting iteratively generated
adversarial images crafted from already defended networks in addition to
one-step adversarial images from the network being trained. We also propose to
utilize embedding space for both classification and low-level (pixel-level)
similarity learning to ignore unknown pixel level perturbation. During
training, we inject adversarial images without replacing their corresponding
clean images and penalize the distance between the two embeddings (clean and
adversarial). Experimental results show that cascade adversarial training
together with our proposed low-level similarity learning efficiently enhances
the robustness against iterative attacks, but at the expense of decreased
robustness against one-step attacks. We show that combining those two
techniques can also improve robustness under the worst case black box attack
scenario.


http://arxiv.org/abs/1708.01697
Adversarial Robustness: Softmax versus Openmax.
Andras Rozsa; Manuel Günther; Terrance E. Boult
  Deep neural networks (DNNs) provide state-of-the-art results on various tasks
and are widely used in real world applications. However, it was discovered that
machine learning models, including the best performing DNNs, suffer from a
fundamental problem: they can unexpectedly and confidently misclassify examples
formed by slightly perturbing otherwise correctly recognized inputs. Various
approaches have been developed for efficiently generating these so-called
adversarial examples, but those mostly rely on ascending the gradient of loss.
In this paper, we introduce the novel logits optimized targeting system (LOTS)
to directly manipulate deep features captured at the penultimate layer. Using
LOTS, we analyze and compare the adversarial robustness of DNNs using the
traditional Softmax layer with Openmax, which was designed to provide open set
recognition by defining classes derived from deep representations, and is
claimed to be more robust to adversarial perturbations. We demonstrate that
Openmax provides less vulnerable systems than Softmax to traditional attacks,
however, we show that it can be equally susceptible to more sophisticated
adversarial generation techniques that directly work on deep representations.


http://arxiv.org/abs/1708.00807
Adversarial-Playground: A Visualization Suite Showing How Adversarial Examples Fool Deep Learning.
Andrew P. Norton; Yanjun Qi
  Recent studies have shown that attackers can force deep learning models to
misclassify so-called "adversarial examples": maliciously generated images
formed by making imperceptible modifications to pixel values. With growing
interest in deep learning for security applications, it is important for
security experts and users of machine learning to recognize how learning
systems may be attacked. Due to the complex nature of deep learning, it is
challenging to understand how deep models can be fooled by adversarial
examples. Thus, we present a web-based visualization tool,
Adversarial-Playground, to demonstrate the efficacy of common adversarial
methods against a convolutional neural network (CNN) system.
Adversarial-Playground is educational, modular and interactive. (1) It enables
non-experts to compare examples visually and to understand why an adversarial
example can fool a CNN-based image classifier. (2) It can help security experts
explore more vulnerability of deep learning as a software module. (3) Building
an interactive visualization is challenging in this domain due to the large
feature space of image classification (generating adversarial examples is slow
in general and visualizing images are costly). Through multiple novel design
choices, our tool can provide fast and accurate responses to user requests.
Empirically, we find that our client-server division strategy reduced the
response time by an average of 1.5 seconds per sample. Our other innovation, a
faster variant of JSMA evasion algorithm, empirically performed twice as fast
as JSMA and yet maintains a comparable evasion rate.
  Project source code and data from our experiments available at:
https://github.com/QData/AdversarialDNN-Playground


http://arxiv.org/abs/1707.08945
Robust Physical-World Attacks on Deep Learning Models.
Kevin Eykholt; Ivan Evtimov; Earlence Fernandes; Bo Li; Amir Rahmati; Chaowei Xiao; Atul Prakash; Tadayoshi Kohno; Dawn Song
  Recent studies show that the state-of-the-art deep neural networks (DNNs) are
vulnerable to adversarial examples, resulting from small-magnitude
perturbations added to the input. Given that that emerging physical systems are
using DNNs in safety-critical situations, adversarial examples could mislead
these systems and cause dangerous situations.Therefore, understanding
adversarial examples in the physical world is an important step towards
developing resilient learning algorithms. We propose a general attack
algorithm,Robust Physical Perturbations (RP2), to generate robust visual
adversarial perturbations under different physical conditions. Using the
real-world case of road sign classification, we show that adversarial examples
generated using RP2 achieve high targeted misclassification rates against
standard-architecture road sign classifiers in the physical world under various
environmental conditions, including viewpoints. Due to the current lack of a
standardized testing method, we propose a two-stage evaluation methodology for
robust physical adversarial examples consisting of lab and field tests. Using
this methodology, we evaluate the efficacy of physical adversarial
manipulations on real objects. Witha perturbation in the form of only black and
white stickers,we attack a real stop sign, causing targeted misclassification
in 100% of the images obtained in lab settings, and in 84.8%of the captured
video frames obtained on a moving vehicle(field test) for the target
classifier.


http://arxiv.org/abs/1707.07397
Synthesizing Robust Adversarial Examples.
Anish Athalye; Logan Engstrom; Andrew Ilyas; Kevin Kwok
  Standard methods for generating adversarial examples for neural networks do
not consistently fool neural network classifiers in the physical world due to a
combination of viewpoint shifts, camera noise, and other natural
transformations, limiting their relevance to real-world systems. We demonstrate
the existence of robust 3D adversarial objects, and we present the first
algorithm for synthesizing examples that are adversarial over a chosen
distribution of transformations. We synthesize two-dimensional adversarial
images that are robust to noise, distortion, and affine transformation. We
apply our algorithm to complex three-dimensional objects, using 3D-printing to
manufacture the first physical adversarial objects. Our results demonstrate the
existence of 3D adversarial objects in the physical world.


http://arxiv.org/abs/1707.07328
Adversarial Examples for Evaluating Reading Comprehension Systems.
Robin Jia; Percy Liang
  Standard accuracy metrics indicate that reading comprehension systems are
making rapid progress, but the extent to which these systems truly understand
language remains unclear. To reward systems with real language understanding
abilities, we propose an adversarial evaluation scheme for the Stanford
Question Answering Dataset (SQuAD). Our method tests whether systems can answer
questions about paragraphs that contain adversarially inserted sentences, which
are automatically generated to distract computer systems without changing the
correct answer or misleading humans. In this adversarial setting, the accuracy
of sixteen published models drops from an average of $75\%$ F1 score to $36\%$;
when the adversary is allowed to add ungrammatical sequences of words, average
accuracy on four models decreases further to $7\%$. We hope our insights will
motivate the development of new models that understand language more precisely.


http://arxiv.org/abs/1707.07013
Confidence estimation in Deep Neural networks via density modelling.
Akshayvarun Subramanya; Suraj Srinivas; R. Venkatesh Babu
  State-of-the-art Deep Neural Networks can be easily fooled into providing
incorrect high-confidence predictions for images with small amounts of
adversarial noise. Does this expose a flaw with deep neural networks, or do we
simply need a better way to estimate confidence? In this paper we consider the
problem of accurately estimating predictive confidence. We formulate this
problem as that of density modelling, and show how traditional methods such as
softmax produce poor estimates. To address this issue, we propose a novel
confidence measure based on density modelling approaches. We test these
measures on images distorted by blur, JPEG compression, random noise and
adversarial noise. Experiments show that our confidence measure consistently
shows reduced confidence scores in the presence of such distortions - a
property which softmax often lacks.


http://arxiv.org/abs/1707.06728
Efficient Defenses Against Adversarial Attacks.
Valentina Zantedeschi; Maria-Irina Nicolae; Ambrish Rawat
  Following the recent adoption of deep neural networks (DNN) accross a wide
range of applications, adversarial attacks against these models have proven to
be an indisputable threat. Adversarial samples are crafted with a deliberate
intention of undermining a system. In the case of DNNs, the lack of better
understanding of their working has prevented the development of efficient
defenses. In this paper, we propose a new defense method based on practical
observations which is easy to integrate into models and performs better than
state-of-the-art defenses. Our proposed solution is meant to reinforce the
structure of a DNN, making its prediction more stable and less likely to be
fooled by adversarial samples. We conduct an extensive experimental study
proving the efficiency of our method against multiple attacks, comparing it to
numerous defenses, both in white-box and black-box setups. Additionally, the
implementation of our method brings almost no overhead to the training
procedure, while maintaining the prediction performance of the original model
on clean samples.


http://arxiv.org/abs/1707.05970
Generic Black-Box End-to-End Attack Against State of the Art API Call Based Malware Classifiers.
Ishai Rosenberg; Asaf Shabtai; Lior Rokach; Yuval Elovici
  In this paper, we present a black-box attack against API call based machine
learning malware classifiers, focusing on generating adversarial sequences
combining API calls and static features (e.g., printable strings) that will be
misclassified by the classifier without affecting the malware functionality. We
show that this attack is effective against many classifiers due to the
transferability principle between RNN variants, feed forward DNNs, and
traditional machine learning classifiers such as SVM. We also implement GADGET,
a software framework to convert any malware binary to a binary undetected by
malware classifiers, using the proposed attack, without access to the malware
source code.


http://arxiv.org/abs/1707.05572
Fast Feature Fool: A data independent approach to universal adversarial perturbations.
Konda Reddy Mopuri; Utsav Garg; R. Venkatesh Babu
  State-of-the-art object recognition Convolutional Neural Networks (CNNs) are
shown to be fooled by image agnostic perturbations, called universal
adversarial perturbations. It is also observed that these perturbations
generalize across multiple networks trained on the same target data. However,
these algorithms require training data on which the CNNs were trained and
compute adversarial perturbations via complex optimization. The fooling
performance of these approaches is directly proportional to the amount of
available training data. This makes them unsuitable for practical attacks since
its unreasonable for an attacker to have access to the training data. In this
paper, for the first time, we propose a novel data independent approach to
generate image agnostic perturbations for a range of CNNs trained for object
recognition. We further show that these perturbations are transferable across
multiple network architectures trained either on same or different data. In the
absence of data, our method generates universal adversarial perturbations
efficiently via fooling the features learned at multiple layers thereby causing
CNNs to misclassify. Experiments demonstrate impressive fooling rates and
surprising transferability for the proposed universal perturbations generated
without any training data.


http://arxiv.org/abs/1707.05474
APE-GAN: Adversarial Perturbation Elimination with GAN.
Shiwei Shen; Guoqing Jin; Ke Gao; Yongdong Zhang
  Although neural networks could achieve state-of-the-art performance while
recongnizing images, they often suffer a tremendous defeat from adversarial
examples--inputs generated by utilizing imperceptible but intentional
perturbation to clean samples from the datasets. How to defense against
adversarial examples is an important problem which is well worth researching.
So far, very few methods have provided a significant defense to adversarial
examples. In this paper, a novel idea is proposed and an effective framework
based Generative Adversarial Nets named APE-GAN is implemented to defense
against the adversarial examples. The experimental results on three benchmark
datasets including MNIST, CIFAR10 and ImageNet indicate that APE-GAN is
effective to resist adversarial examples generated from five attacks.


http://arxiv.org/abs/1707.05373
Houdini: Fooling Deep Structured Prediction Models.
Moustapha Cisse; Yossi Adi; Natalia Neverova; Joseph Keshet
  Generating adversarial examples is a critical step for evaluating and
improving the robustness of learning machines. So far, most existing methods
only work for classification and are not designed to alter the true performance
measure of the problem at hand. We introduce a novel flexible approach named
Houdini for generating adversarial examples specifically tailored for the final
performance measure of the task considered, be it combinatorial and
non-decomposable. We successfully apply Houdini to a range of applications such
as speech recognition, pose estimation and semantic segmentation. In all cases,
the attacks based on Houdini achieve higher success rate than those based on
the traditional surrogates used to train the models while using a less
perceptible adversarial perturbation.


http://arxiv.org/abs/1707.04131
Foolbox: A Python toolbox to benchmark the robustness of machine learning models.
Jonas Rauber; Wieland Brendel; Matthias Bethge
  Even todays most advanced machine learning models are easily fooled by almost
imperceptible perturbations of their inputs. Foolbox is a new Python package to
generate such adversarial perturbations and to quantify and compare the
robustness of machine learning models. It is build around the idea that the
most comparable robustness measure is the minimum perturbation needed to craft
an adversarial example. To this end, Foolbox provides reference implementations
of most published adversarial attack methods alongside some new ones, all of
which perform internal hyperparameter tuning to find the minimum adversarial
perturbation. Additionally, Foolbox interfaces with most popular deep learning
frameworks such as PyTorch, Keras, TensorFlow, Theano and MXNet and allows
different adversarial criteria such as targeted misclassification and top-k
misclassification as well as different distance measures. The code is licensed
under the MIT license and is openly available at
https://github.com/bethgelab/foolbox . The most up-to-date documentation can be
found at http://foolbox.readthedocs.io .


http://arxiv.org/abs/1707.03501
NO Need to Worry about Adversarial Examples in Object Detection in Autonomous Vehicles.
Jiajun Lu; Hussein Sibai; Evan Fabry; David Forsyth
  It has been shown that most machine learning algorithms are susceptible to
adversarial perturbations. Slightly perturbing an image in a carefully chosen
direction in the image space may cause a trained neural network model to
misclassify it. Recently, it was shown that physical adversarial examples
exist: printing perturbed images then taking pictures of them would still
result in misclassification. This raises security and safety concerns.
  However, these experiments ignore a crucial property of physical objects: the
camera can view objects from different distances and at different angles. In
this paper, we show experiments that suggest that current constructions of
physical adversarial examples do not disrupt object detection from a moving
platform. Instead, a trained neural network classifies most of the pictures
taken from different distances and angles of a perturbed image correctly. We
believe this is because the adversarial property of the perturbation is
sensitive to the scale at which the perturbed picture is viewed, so (for
example) an autonomous car will misclassify a stop sign only from a small range
of distances.
  Our work raises an important question: can one construct examples that are
adversarial for many or most viewing conditions? If so, the construction should
offer very significant insights into the internal representation of patterns by
deep networks. If not, there is a good prospect that adversarial examples can
be reduced to a curiosity with little practical impact.


http://arxiv.org/abs/1707.03184
A Survey on Resilient Machine Learning.
Atul Kumar; Sameep Mehta
  Machine learning based system are increasingly being used for sensitive tasks
such as security surveillance, guiding autonomous vehicle, taking investment
decisions, detecting and blocking network intrusion and malware etc. However,
recent research has shown that machine learning models are venerable to attacks
by adversaries at all phases of machine learning (eg, training data collection,
training, operation). All model classes of machine learning systems can be
misled by providing carefully crafted inputs making them wrongly classify
inputs. Maliciously created input samples can affect the learning process of a
ML system by either slowing down the learning process, or affecting the
performance of the learned mode, or causing the system make error(s) only in
attacker's planned scenario. Because of these developments, understanding
security of machine learning algorithms and systems is emerging as an important
research area among computer security and machine learning researchers and
practitioners. We present a survey of this emerging area in machine learning.


http://arxiv.org/abs/1707.02812
Towards Crafting Text Adversarial Samples.
Suranjana Samanta; Sameep Mehta
  Adversarial samples are strategically modified samples, which are crafted
with the purpose of fooling a classifier at hand. An attacker introduces
specially crafted adversarial samples to a deployed classifier, which are being
mis-classified by the classifier. However, the samples are perceived to be
drawn from entirely different classes and thus it becomes hard to detect the
adversarial samples. Most of the prior works have been focused on synthesizing
adversarial samples in the image domain. In this paper, we propose a new method
of crafting adversarial text samples by modification of the original samples.
Modifications of the original text samples are done by deleting or replacing
the important or salient words in the text or by introducing new words in the
text sample. Our algorithm works best for the datasets which have
sub-categories within each of the classes of examples. While crafting
adversarial samples, one of the key constraint is to generate meaningful
sentences which can at pass off as legitimate from language (English)
viewpoint. Experimental results on IMDB movie review dataset for sentiment
analysis and Twitter dataset for gender detection show the efficiency of our
proposed method.


http://arxiv.org/abs/1707.01159
UPSET and ANGRI : Breaking High Performance Image Classifiers.
Sayantan Sarkar; Ankan Bansal; Upal Mahbub; Rama Chellappa
  In this paper, targeted fooling of high performance image classifiers is
achieved by developing two novel attack methods. The first method generates
universal perturbations for target classes and the second generates image
specific perturbations. Extensive experiments are conducted on MNIST and
CIFAR10 datasets to provide insights about the proposed algorithms and show
their effectiveness.


http://arxiv.org/abs/1706.06969
Comparing deep neural networks against humans: object recognition when the signal gets weaker.
Robert Geirhos; David H. J. Janssen; Heiko H. Schütt; Jonas Rauber; Matthias Bethge; Felix A. Wichmann
  Human visual object recognition is typically rapid and seemingly effortless,
as well as largely independent of viewpoint and object orientation. Until very
recently, animate visual systems were the only ones capable of this remarkable
computational feat. This has changed with the rise of a class of computer
vision algorithms called deep neural networks (DNNs) that achieve human-level
classification performance on object recognition tasks. Furthermore, a growing
number of studies report similarities in the way DNNs and the human visual
system process objects, suggesting that current DNNs may be good models of
human visual object recognition. Yet there clearly exist important
architectural and processing differences between state-of-the-art DNNs and the
primate visual system. The potential behavioural consequences of these
differences are not well understood. We aim to address this issue by comparing
human and DNN generalisation abilities towards image degradations. We find the
human visual system to be more robust to image manipulations like contrast
reduction, additive noise or novel eidolon-distortions. In addition, we find
progressively diverging classification error-patterns between humans and DNNs
when the signal gets weaker, indicating that there may still be marked
differences in the way humans and current DNNs perform visual object
recognition. We envision that our findings as well as our carefully measured
and freely available behavioural datasets provide a new useful benchmark for
the computer vision community to improve the robustness of DNNs and a
motivation for neuroscientists to search for mechanisms in the brain that could
facilitate this robustness.


http://arxiv.org/abs/1706.06083
Towards Deep Learning Models Resistant to Adversarial Attacks.
Aleksander Madry; Aleksandar Makelov; Ludwig Schmidt; Dimitris Tsipras; Adrian Vladu
  Recent work has demonstrated that deep neural networks are vulnerable to
adversarial examples---inputs that are almost indistinguishable from natural
data and yet classified incorrectly by the network. In fact, some of the latest
findings suggest that the existence of adversarial attacks may be an inherent
weakness of deep learning models. To address this problem, we study the
adversarial robustness of neural networks through the lens of robust
optimization. This approach provides us with a broad and unifying view on much
of the prior work on this topic. Its principled nature also enables us to
identify methods for both training and attacking neural networks that are
reliable and, in a certain sense, universal. In particular, they specify a
concrete security guarantee that would protect against any adversary. These
methods let us train networks with significantly improved resistance to a wide
range of adversarial attacks. They also suggest the notion of security against
a first-order adversary as a natural and broad security guarantee. We believe
that robustness against such well-defined classes of adversaries is an
important stepping stone towards fully resistant deep learning models. Code and
pre-trained models are available at https://github.com/MadryLab/mnist_challenge
and https://github.com/MadryLab/cifar10_challenge.


http://arxiv.org/abs/1706.04701
Adversarial Example Defenses: Ensembles of Weak Defenses are not Strong.
Warren He; James Wei; Xinyun Chen; Nicholas Carlini; Dawn Song
  Ongoing research has proposed several methods to defend neural networks
against adversarial examples, many of which researchers have shown to be
ineffective. We ask whether a strong defense can be created by combining
multiple (possibly weak) defenses. To answer this question, we study three
defenses that follow this approach. Two of these are recently proposed defenses
that intentionally combine components designed to work well together. A third
defense combines three independent defenses. For all the components of these
defenses and the combined defenses themselves, we show that an adaptive
adversary can create adversarial examples successfully with low distortion.
Thus, our work implies that ensemble of weak defenses is not sufficient to
provide strong defense against adversarial examples.


http://arxiv.org/abs/1706.03922
Analyzing the Robustness of Nearest Neighbors to Adversarial Examples.
Yizhen Wang; Somesh Jha; Kamalika Chaudhuri
  Motivated by safety-critical applications, test-time attacks on classifiers
via adversarial examples has recently received a great deal of attention.
However, there is a general lack of understanding on why adversarial examples
arise; whether they originate due to inherent properties of data or due to lack
of training samples remains ill-understood. In this work, we introduce a
theoretical framework analogous to bias-variance theory for understanding these
effects.
  We use our framework to analyze the robustness of a canonical non-parametric
classifier - the k-nearest neighbors. Our analysis shows that its robustness
properties depend critically on the value of k - the classifier may be
inherently non-robust for small k, but its robustness approaches that of the
Bayes Optimal classifier for fast-growing k. We propose a novel modified
1-nearest neighbor classifier, and guarantee its robustness in the large sample
limit. Our experiments suggest that this classifier may have good robustness
properties even for reasonable data set sizes.


http://arxiv.org/abs/1706.01763
Adversarial-Playground: A Visualization Suite for Adversarial Sample Generation.
Andrew Norton; Yanjun Qi
  With growing interest in adversarial machine learning, it is important for
machine learning practitioners and users to understand how their models may be
attacked. We propose a web-based visualization tool, Adversarial-Playground, to
demonstrate the efficacy of common adversarial methods against a deep neural
network (DNN) model, built on top of the TensorFlow library.
Adversarial-Playground provides users an efficient and effective experience in
exploring techniques generating adversarial examples, which are inputs crafted
by an adversary to fool a machine learning system. To enable
Adversarial-Playground to generate quick and accurate responses for users, we
use two primary tactics: (1) We propose a faster variant of the
state-of-the-art Jacobian saliency map approach that maintains a comparable
evasion rate. (2) Our visualization does not transmit the generated adversarial
images to the client, but rather only the matrix describing the sample and the
vector representing classification likelihoods.
  The source code along with the data from all of our experiments are available
at \url{https://github.com/QData/AdversarialDNN-Playground}.


http://arxiv.org/abs/1706.00633
Towards Robust Detection of Adversarial Examples.
Tianyu Pang; Chao Du; Yinpeng Dong; Jun Zhu
  Although the recent progress is substantial, deep learning methods can be
vulnerable to the maliciously generated adversarial examples. In this paper, we
present a novel training procedure and a thresholding test strategy, towards
robust detection of adversarial examples. In training, we propose to minimize
the reverse cross-entropy (RCE), which encourages a deep network to learn
latent representations that better distinguish adversarial examples from normal
ones. In testing, we propose to use a thresholding strategy as the detector to
filter out adversarial examples for reliable predictions. Our method is simple
to implement using standard algorithms, with little extra training cost
compared to the common cross-entropy minimization. We apply our method to
defend various attacking methods on the widely used MNIST and CIFAR-10
datasets, and achieve significant improvements on robust predictions under all
the threat models in the adversarial setting.


http://arxiv.org/abs/1705.10686
Feature Squeezing Mitigates and Detects Carlini/Wagner Adversarial Examples.
Weilin Xu; David Evans; Yanjun Qi
  Feature squeezing is a recently-introduced framework for mitigating and
detecting adversarial examples. In previous work, we showed that it is
effective against several earlier methods for generating adversarial examples.
In this short note, we report on recent results showing that simple feature
squeezing techniques also make deep learning models significantly more robust
against the Carlini/Wagner attacks, which are the best known adversarial
methods discovered to date.


http://arxiv.org/abs/1705.09764
MAT: A Multi-strength Adversarial Training Method to Mitigate Adversarial Attacks.
Chang Song; Hsin-Pai Cheng; Huanrui Yang; Sicheng Li; Chunpeng Wu; Qing Wu; Hai Li; Yiran Chen
  Some recent works revealed that deep neural networks (DNNs) are vulnerable to
so-called adversarial attacks where input examples are intentionally perturbed
to fool DNNs. In this work, we revisit the DNN training process that includes
adversarial examples into the training dataset so as to improve DNN's
resilience to adversarial attacks, namely, adversarial training. Our
experiments show that different adversarial strengths, i.e., perturbation
levels of adversarial examples, have different working zones to resist the
attack. Based on the observation, we propose a multi-strength adversarial
training method (MAT) that combines the adversarial training examples with
different adversarial strengths to defend adversarial attacks. Two training
structures - mixed MAT and parallel MAT - are developed to facilitate the
tradeoffs between training time and memory occupation. Our results show that
MAT can substantially minimize the accuracy degradation of deep learning
systems to adversarial attacks on MNIST, CIFAR-10, CIFAR-100, and SVHN.


http://arxiv.org/abs/1705.09552
Classification regions of deep neural networks.
Alhussein Fawzi; Seyed-Mohsen Moosavi-Dezfooli; Pascal Frossard; Stefano Soatto
  The goal of this paper is to analyze the geometric properties of deep neural
network classifiers in the input space. We specifically study the topology of
classification regions created by deep networks, as well as their associated
decision boundary. Through a systematic empirical investigation, we show that
state-of-the-art deep nets learn connected classification regions, and that the
decision boundary in the vicinity of datapoints is flat along most directions.
We further draw an essential connection between two seemingly unrelated
properties of deep networks: their sensitivity to additive perturbations in the
inputs, and the curvature of their decision boundary. The directions where the
decision boundary is curved in fact remarkably characterize the directions to
which the classifier is the most vulnerable. We finally leverage a fundamental
asymmetry in the curvature of the decision boundary of deep nets, and propose a
method to discriminate between original images, and images perturbed with small
adversarial examples. We show the effectiveness of this purely geometric
approach for detecting small adversarial perturbations in images, and for
recovering the labels of perturbed images.


http://arxiv.org/abs/1705.09554
Robustness of classifiers to universal perturbations: a geometric perspective.
Seyed-Mohsen Moosavi-Dezfooli; Alhussein Fawzi; Omar Fawzi; Pascal Frossard; Stefano Soatto
  Deep networks have recently been shown to be vulnerable to universal
perturbations: there exist very small image-agnostic perturbations that cause
most natural images to be misclassified by such classifiers. In this paper, we
propose the first quantitative analysis of the robustness of classifiers to
universal perturbations, and draw a formal link between the robustness to
universal perturbations, and the geometry of the decision boundary.
Specifically, we establish theoretical bounds on the robustness of classifiers
under two decision boundary models (flat and curved models). We show in
particular that the robustness of deep networks to universal perturbations is
driven by a key property of their curvature: there exists shared directions
along which the decision boundary of deep networks is systematically positively
curved. Under such conditions, we prove the existence of small universal
perturbations. Our analysis further provides a novel geometric method for
computing universal perturbations, in addition to explaining their properties.


http://arxiv.org/abs/1705.09064
MagNet: a Two-Pronged Defense against Adversarial Examples.
Dongyu Meng; Hao Chen
  Deep learning has shown promising results on hard perceptual problems in
recent years. However, deep learning systems are found to be vulnerable to
small adversarial perturbations that are nearly imperceptible to human. Such
specially crafted perturbations cause deep learning systems to output incorrect
decisions, with potentially disastrous consequences. These vulnerabilities
hinder the deployment of deep learning systems where safety or security is
important. Attempts to secure deep learning systems either target specific
attacks or have been shown to be ineffective.
  In this paper, we propose MagNet, a framework for defending neural network
classifiers against adversarial examples. MagNet does not modify the protected
classifier or know the process for generating adversarial examples. MagNet
includes one or more separate detector networks and a reformer network.
Different from previous work, MagNet learns to differentiate between normal and
adversarial examples by approximating the manifold of normal examples. Since it
does not rely on any process for generating adversarial examples, it has
substantial generalization power. Moreover, MagNet reconstructs adversarial
examples by moving them towards the manifold, which is effective for helping
classify adversarial examples with small perturbation correctly. We discuss the
intrinsic difficulty in defending against whitebox attack and propose a
mechanism to defend against graybox attack. Inspired by the use of randomness
in cryptography, we propose to use diversity to strengthen MagNet. We show
empirically that MagNet is effective against most advanced state-of-the-art
attacks in blackbox and graybox scenarios while keeping false positive rate on
normal examples very low.


http://arxiv.org/abs/1705.08475
Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation.
Matthias Hein; Maksym Andriushchenko
  Recent work has shown that state-of-the-art classifiers are quite brittle, in
the sense that a small adversarial change of an originally with high confidence
correctly classified input leads to a wrong classification again with high
confidence. This raises concerns that such classifiers are vulnerable to
attacks and calls into question their usage in safety-critical systems. We show
in this paper for the first time formal guarantees on the robustness of a
classifier by giving instance-specific lower bounds on the norm of the input
manipulation required to change the classifier decision. Based on this analysis
we propose the Cross-Lipschitz regularization functional. We show that using
this form of regularization in kernel methods resp. neural networks improves
the robustness of the classifier without any loss in prediction performance.


http://arxiv.org/abs/1705.08378
Detecting Adversarial Image Examples in Deep Networks with Adaptive Noise Reduction.
Bin Liang; Hongcheng Li; Miaoqiang Su; Xirong Li; Wenchang Shi; Xiaofeng Wang
  Recently, many studies have demonstrated deep neural network (DNN)
classifiers can be fooled by the adversarial example, which is crafted via
introducing some perturbations into an original sample. Accordingly, some
powerful defense techniques were proposed. However, existing defense techniques
often require modifying the target model or depend on the prior knowledge of
attacks. In this paper, we propose a straightforward method for detecting
adversarial image examples, which can be directly deployed into unmodified
off-the-shelf DNN models. We consider the perturbation to images as a kind of
noise and introduce two classic image processing techniques, scalar
quantization and smoothing spatial filter, to reduce its effect. The image
entropy is employed as a metric to implement an adaptive noise reduction for
different kinds of images. Consequently, the adversarial example can be
effectively detected by comparing the classification results of a given sample
and its denoised version, without referring to any prior knowledge of attacks.
More than 20,000 adversarial examples against some state-of-the-art DNN models
are used to evaluate the proposed method, which are crafted with different
attack techniques. The experiments show that our detection method can achieve a
high overall F1 score of 96.39% and certainly raises the bar for defense-aware
attacks.


http://arxiv.org/abs/1705.08131
Black-Box Attacks against RNN based Malware Detection Algorithms.
Weiwei Hu; Ying Tan
  Recent researches have shown that machine learning based malware detection
algorithms are very vulnerable under the attacks of adversarial examples. These
works mainly focused on the detection algorithms which use features with fixed
dimension, while some researchers have begun to use recurrent neural networks
(RNN) to detect malware based on sequential API features. This paper proposes a
novel algorithm to generate sequential adversarial examples, which are used to
attack a RNN based malware detection system. It is usually hard for malicious
attackers to know the exact structures and weights of the victim RNN. A
substitute RNN is trained to approximate the victim RNN. Then we propose a
generative RNN to output sequential adversarial examples from the original
sequential malware inputs. Experimental results showed that RNN based malware
detection algorithms fail to detect most of the generated malicious adversarial
examples, which means the proposed model is able to effectively bypass the
detection algorithms.


http://arxiv.org/abs/1705.07819
Regularizing deep networks using efficient layerwise adversarial training.
Swami Sankaranarayanan; Arpit Jain; Rama Chellappa; Ser Nam Lim
  Adversarial training has been shown to regularize deep neural networks in
addition to increasing their robustness to adversarial examples. However, its
impact on very deep state of the art networks has not been fully investigated.
In this paper, we present an efficient approach to perform adversarial training
by perturbing intermediate layer activations and study the use of such
perturbations as a regularizer during training. We use these perturbations to
train very deep models such as ResNets and show improvement in performance both
on adversarial and original test data. Our experiments highlight the benefits
of perturbing intermediate layer activations compared to perturbing only the
inputs. The results on CIFAR-10 and CIFAR-100 datasets show the merits of the
proposed adversarial training approach. Additional results on WideResNets show
that our approach provides significant improvement in classification accuracy
for a given base model, outperforming dropout and other base models of larger
size.


http://arxiv.org/abs/1705.07535
Evading Classifiers by Morphing in the Dark.
Hung Dang; Yue Huang; Ee-Chien Chang
  Learning-based systems have been shown to be vulnerable to evasion through
adversarial data manipulation. These attacks have been studied under
assumptions that the adversary has certain knowledge of either the target model
internals, its training dataset or at least classification scores it assigns to
input samples. In this paper, we investigate a much more constrained and
realistic attack scenario wherein the target classifier is minimally exposed to
the adversary, revealing on its final classification decision (e.g., reject or
accept an input sample). Moreover, the adversary can only manipulate malicious
samples using a blackbox morpher. That is, the adversary has to evade the
target classifier by morphing malicious samples "in the dark". We present a
scoring mechanism that can assign a real-value score which reflects evasion
progress to each sample based on the limited information available. Leveraging
on such scoring mechanism, we propose an evasion method -- EvadeHC -- and
evaluate it against two PDF malware detectors, namely PDFRate and Hidost. The
experimental evaluation demonstrates that the proposed evasion attacks are
effective, attaining $100\%$ evasion rate on the evaluation dataset.
Interestingly, EvadeHC outperforms the known classifier evasion technique that
operates based on classification scores output by the classifiers. Although our
evaluations are conducted on PDF malware classifier, the proposed approaches
are domain-agnostic and is of wider application to other learning-based
systems.


http://arxiv.org/abs/1705.07263
Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods.
Nicholas Carlini; David Wagner
  Neural networks are known to be vulnerable to adversarial examples: inputs
that are close to natural inputs but classified incorrectly. In order to better
understand the space of adversarial examples, we survey ten recent proposals
that are designed for detection and compare their efficacy. We show that all
can be defeated by constructing new loss functions. We conclude that
adversarial examples are significantly harder to detect than previously
appreciated, and the properties believed to be intrinsic to adversarial
examples are in fact not. Finally, we propose several simple guidelines for
evaluating future proposed defenses.


http://arxiv.org/abs/1705.07204
Ensemble Adversarial Training: Attacks and Defenses.
Florian Tramèr; Alexey Kurakin; Nicolas Papernot; Ian Goodfellow; Dan Boneh; Patrick McDaniel
  Adversarial examples are perturbed inputs designed to fool machine learning
models. Adversarial training injects such examples into training data to
increase robustness. To scale this technique to large datasets, perturbations
are crafted using fast single-step methods that maximize a linear approximation
of the model's loss. We show that this form of adversarial training converges
to a degenerate global minimum, wherein small curvature artifacts near the data
points obfuscate a linear approximation of the loss. The model thus learns to
generate weak perturbations, rather than defend against strong ones. As a
result, we find that adversarial training remains vulnerable to black-box
attacks, where we transfer perturbations computed on undefended models, as well
as to a powerful novel single-step attack that escapes the non-smooth vicinity
of the input data via a small random step. We further introduce Ensemble
Adversarial Training, a technique that augments training data with
perturbations transferred from other models. On ImageNet, Ensemble Adversarial
Training yields models with strong robustness to black-box attacks. In
particular, our most robust model won the first round of the NIPS 2017
competition on Defenses against Adversarial Attacks. However, subsequent work
found that more elaborate black-box attacks could significantly enhance
transferability and reduce the accuracy of our models.


http://arxiv.org/abs/1705.07213
MTDeep: Boosting the Security of Deep Neural Nets Against Adversarial Attacks with Moving Target Defense.
Sailik Sengupta; Tathagata Chakraborti; Subbarao Kambhampati
  Present attack methods can make state-of-the-art classification systems based
on deep neural networks misclassify every adversarially modified test example.
The design of general defense strategies against a wide range of such attacks
still remains a challenging problem. In this paper, we draw inspiration from
the fields of cybersecurity and multi-agent systems and propose to leverage the
concept of Moving Target Defense (MTD) in designing a meta-defense for
'boosting' the robustness of an ensemble of deep neural networks (DNNs) for
visual classification tasks against such adversarial attacks. To classify an
input image, a trained network is picked randomly from this set of networks by
formulating the interaction between a Defender (who hosts the classification
networks) and their (Legitimate and Malicious) users as a Bayesian Stackelberg
Game (BSG). We empirically show that this approach, MTDeep, reduces
misclassification on perturbed images in various datasets such as MNIST,
FashionMNIST, and ImageNet while maintaining high classification accuracy on
legitimate test images. We then demonstrate that our framework, being the first
meta-defense technique, can be used in conjunction with any existing defense
mechanism to provide more resilience against adversarial attacks that can be
afforded by these defense mechanisms. Lastly, to quantify the increase in
robustness of an ensemble-based classification system when we use MTDeep, we
analyze the properties of a set of DNNs and introduce the concept of
differential immunity that formalizes the notion of attack transferability.


http://arxiv.org/abs/1705.06640
DeepXplore: Automated Whitebox Testing of Deep Learning Systems.
Kexin Pei; Yinzhi Cao; Junfeng Yang; Suman Jana
  Deep learning (DL) systems are increasingly deployed in safety- and
security-critical domains including self-driving cars and malware detection,
where the correctness and predictability of a system's behavior for corner case
inputs are of great importance. Existing DL testing depends heavily on manually
labeled data and therefore often fails to expose erroneous behaviors for rare
inputs.
  We design, implement, and evaluate DeepXplore, the first whitebox framework
for systematically testing real-world DL systems. First, we introduce neuron
coverage for systematically measuring the parts of a DL system exercised by
test inputs. Next, we leverage multiple DL systems with similar functionality
as cross-referencing oracles to avoid manual checking. Finally, we demonstrate
how finding inputs for DL systems that both trigger many differential behaviors
and achieve high neuron coverage can be represented as a joint optimization
problem and solved efficiently using gradient-based search techniques.
  DeepXplore efficiently finds thousands of incorrect corner case behaviors
(e.g., self-driving cars crashing into guard rails and malware masquerading as
benign software) in state-of-the-art DL models with thousands of neurons
trained on five popular datasets including ImageNet and Udacity self-driving
challenge data. For all tested DL models, on average, DeepXplore generated one
test input demonstrating incorrect behavior within one second while running
only on a commodity laptop. We further show that the test inputs generated by
DeepXplore can also be used to retrain the corresponding DL model to improve
the model's accuracy by up to 3%.


http://arxiv.org/abs/1705.06452
Delving into adversarial attacks on deep policies.
Jernej Kos; Dawn Song
  Adversarial examples have been shown to exist for a variety of deep learning
architectures. Deep reinforcement learning has shown promising results on
training agent policies directly on raw inputs such as image pixels. In this
paper we present a novel study into adversarial attacks on deep reinforcement
learning polices. We compare the effectiveness of the attacks using adversarial
examples vs. random noise. We present a novel method for reducing the number of
times adversarial examples need to be injected for a successful attack, based
on the value function. We further explore how re-training on random noise and
FGSM perturbations affects the resilience against adversarial examples.


http://arxiv.org/abs/1705.05264
Extending Defensive Distillation.
Nicolas Papernot; Patrick McDaniel
  Machine learning is vulnerable to adversarial examples: inputs carefully
modified to force misclassification. Designing defenses against such inputs
remains largely an open problem. In this work, we revisit defensive
distillation---which is one of the mechanisms proposed to mitigate adversarial
examples---to address its limitations. We view our results not only as an
effective way of addressing some of the recently discovered attacks but also as
reinforcing the importance of improved training techniques.


http://arxiv.org/abs/1705.03387
Generative Adversarial Trainer: Defense to Adversarial Perturbations with GAN.
Hyeungill Lee; Sungyeob Han; Jungwoo Lee
  We propose a novel technique to make neural network robust to adversarial
examples using a generative adversarial network. We alternately train both
classifier and generator networks. The generator network generates an
adversarial perturbation that can easily fool the classifier network by using a
gradient of each image. Simultaneously, the classifier network is trained to
classify correctly both original and adversarial images generated by the
generator. These procedures help the classifier network to become more robust
to adversarial perturbations. Furthermore, our adversarial training framework
efficiently reduces overfitting and outperforms other regularization methods
such as Dropout. We applied our method to supervised learning for CIFAR
datasets, and experimantal results show that our method significantly lowers
the generalization error of the network. To the best of our knowledge, this is
the first method which uses GAN to improve supervised learning.


http://arxiv.org/abs/1705.02900
Keeping the Bad Guys Out: Protecting and Vaccinating Deep Learning with JPEG Compression.
Nilaksh Das; Madhuri Shanbhogue; Shang-Tse Chen; Fred Hohman; Li Chen; Michael E. Kounavis; Duen Horng Chau
  Deep neural networks (DNNs) have achieved great success in solving a variety
of machine learning (ML) problems, especially in the domain of image
recognition. However, recent research showed that DNNs can be highly vulnerable
to adversarially generated instances, which look seemingly normal to human
observers, but completely confuse DNNs. These adversarial samples are crafted
by adding small perturbations to normal, benign images. Such perturbations,
while imperceptible to the human eye, are picked up by DNNs and cause them to
misclassify the manipulated instances with high confidence. In this work, we
explore and demonstrate how systematic JPEG compression can work as an
effective pre-processing step in the classification pipeline to counter
adversarial attacks and dramatically reduce their effects (e.g., Fast Gradient
Sign Method, DeepFool). An important component of JPEG compression is its
ability to remove high frequency signal components, inside square blocks of an
image. Such an operation is equivalent to selective blurring of the image,
helping remove additive perturbations. Further, we propose an ensemble-based
technique that can be constructed quickly from a given well-performing DNN, and
empirically show how such an ensemble that leverages JPEG compression can
protect a model from multiple types of adversarial attacks, without requiring
knowledge about the model.


http://arxiv.org/abs/1705.02224
Detecting Adversarial Samples Using Density Ratio Estimates.
Lovedeep Gondara
  Machine learning models, especially based on deep architectures are used in
everyday applications ranging from self driving cars to medical diagnostics. It
has been shown that such models are dangerously susceptible to adversarial
samples, indistinguishable from real samples to human eye, adversarial samples
lead to incorrect classifications with high confidence. Impact of adversarial
samples is far-reaching and their efficient detection remains an open problem.
We propose to use direct density ratio estimation as an efficient model
agnostic measure to detect adversarial samples. Our proposed method works
equally well with single and multi-channel samples, and with different
adversarial sample generation methods. We also propose a method to use density
ratio estimates for generating adversarial samples with an added constraint of
preserving density ratio.


http://arxiv.org/abs/1704.08996
Yes, Machine Learning Can Be More Secure! A Case Study on Android Malware Detection.
Ambra Demontis; Marco Melis; Battista Biggio; Davide Maiorca; Daniel Arp; Konrad Rieck; Igino Corona; Giorgio Giacinto; Fabio Roli
  To cope with the increasing variability and sophistication of modern attacks,
machine learning has been widely adopted as a statistically-sound tool for
malware detection. However, its security against well-crafted attacks has not
only been recently questioned, but it has been shown that machine learning
exhibits inherent vulnerabilities that can be exploited to evade detection at
test time. In other words, machine learning itself can be the weakest link in a
security system. In this paper, we rely upon a previously-proposed attack
framework to categorize potential attack scenarios against learning-based
malware detection tools, by modeling attackers with different skills and
capabilities. We then define and implement a set of corresponding evasion
attacks to thoroughly assess the security of Drebin, an Android malware
detector. The main contribution of this work is the proposal of a simple and
scalable secure-learning paradigm that mitigates the impact of evasion attacks,
while only slightly worsening the detection rate in the absence of attack. We
finally argue that our secure-learning approach can also be readily applied to
other malware detection tasks.


http://arxiv.org/abs/1704.08847
Parseval Networks: Improving Robustness to Adversarial Examples.
Moustapha Cisse; Piotr Bojanowski; Edouard Grave; Yann Dauphin; Nicolas Usunier
  We introduce Parseval networks, a form of deep neural networks in which the
Lipschitz constant of linear, convolutional and aggregation layers is
constrained to be smaller than 1. Parseval networks are empirically and
theoretically motivated by an analysis of the robustness of the predictions
made by deep neural networks when their input is subject to an adversarial
perturbation. The most important feature of Parseval networks is to maintain
weight matrices of linear and convolutional layers to be (approximately)
Parseval tight frames, which are extensions of orthogonal matrices to
non-square matrices. We describe how these constraints can be maintained
efficiently during SGD. We show that Parseval networks match the
state-of-the-art in terms of accuracy on CIFAR-10/100 and Street View House
Numbers (SVHN) while being more robust than their vanilla counterpart against
adversarial examples. Incidentally, Parseval networks also tend to train faster
and make a better usage of the full capacity of the networks.


http://arxiv.org/abs/1704.08006
Deep Text Classification Can be Fooled.
Bin Liang; Hongcheng Li; Miaoqiang Su; Pan Bian; Xirong Li; Wenchang Shi
  In this paper, we present an effective method to craft text adversarial
samples, revealing one important yet underestimated fact that DNN-based text
classifiers are also prone to adversarial sample attack. Specifically,
confronted with different adversarial scenarios, the text items that are
important for classification are identified by computing the cost gradients of
the input (white-box attack) or generating a series of occluded test samples
(black-box attack). Based on these items, we design three perturbation
strategies, namely insertion, modification, and removal, to generate
adversarial samples. The experiment results show that the adversarial samples
generated by our method can successfully fool both state-of-the-art
character-level and word-level DNN-based text classifiers. The adversarial
samples can be perturbed to any desirable classes without compromising their
utilities. At the same time, the introduced perturbation is difficult to be
perceived.


http://arxiv.org/abs/1704.05712
Universal Adversarial Perturbations Against Semantic Image Segmentation.
Jan Hendrik Metzen; Mummadi Chaithanya Kumar; Thomas Brox; Volker Fischer
  While deep learning is remarkably successful on perceptual tasks, it was also
shown to be vulnerable to adversarial perturbations of the input. These
perturbations denote noise added to the input that was generated specifically
to fool the system while being quasi-imperceptible for humans. More severely,
there even exist universal perturbations that are input-agnostic but fool the
network on the majority of inputs. While recent work has focused on image
classification, this work proposes attacks against semantic image segmentation:
we present an approach for generating (universal) adversarial perturbations
that make the network yield a desired target segmentation as output. We show
empirically that there exist barely perceptible universal noise patterns which
result in nearly the same predicted segmentation for arbitrary inputs.
Furthermore, we also show the existence of universal noise which removes a
target class (e.g., all pedestrians) from the segmentation while leaving the
segmentation mostly unchanged otherwise.


http://arxiv.org/abs/1704.04960
Adversarial and Clean Data Are Not Twins.
Zhitao Gong; Wenlu Wang; Wei-Shinn Ku
  Adversarial attack has cast a shadow on the massive success of deep neural
networks. Despite being almost visually identical to the clean data, the
adversarial images can fool deep neural networks into wrong predictions with
very high confidence. In this paper, however, we show that we can build a
simple binary classifier separating the adversarial apart from the clean data
with accuracy over 99%. We also empirically show that the binary classifier is
robust to a second-round adversarial attack. In other words, it is difficult to
disguise adversarial samples to bypass the binary classifier. Further more, we
empirically investigate the generalization limitation which lingers on all
current defensive methods, including the binary classifier approach. And we
hypothesize that this is the result of intrinsic property of adversarial
crafting algorithms.


http://arxiv.org/abs/1704.05051
Google's Cloud Vision API Is Not Robust To Noise.
Hossein Hosseini; Baicen Xiao; Radha Poovendran
  Google has recently introduced the Cloud Vision API for image analysis.
According to the demonstration website, the API "quickly classifies images into
thousands of categories, detects individual objects and faces within images,
and finds and reads printed words contained within images." It can be also used
to "detect different types of inappropriate content from adult to violent
content."
  In this paper, we evaluate the robustness of Google Cloud Vision API to input
perturbation. In particular, we show that by adding sufficient noise to the
image, the API generates completely different outputs for the noisy image,
while a human observer would perceive its original content. We show that the
attack is consistently successful, by performing extensive experiments on
different image types, including natural images, images containing faces and
images with texts. For instance, using images from ImageNet dataset, we found
that adding an average of 14.25% impulse noise is enough to deceive the API.
Our findings indicate the vulnerability of the API in adversarial environments.
For example, an adversary can bypass an image filtering system by adding noise
to inappropriate images. We then show that when a noise filter is applied on
input images, the API generates mostly the same outputs for restored images as
for original images. This observation suggests that cloud vision API can
readily benefit from noise filtering, without the need for updating image
analysis algorithms.


http://arxiv.org/abs/1704.03453
The Space of Transferable Adversarial Examples.
Florian Tramèr; Nicolas Papernot; Ian Goodfellow; Dan Boneh; Patrick McDaniel
  Adversarial examples are maliciously perturbed inputs designed to mislead
machine learning (ML) models at test-time. They often transfer: the same
adversarial example fools more than one model.
  In this work, we propose novel methods for estimating the previously unknown
dimensionality of the space of adversarial inputs. We find that adversarial
examples span a contiguous subspace of large (~25) dimensionality. Adversarial
subspaces with higher dimensionality are more likely to intersect. We find that
for two different models, a significant fraction of their subspaces is shared,
thus enabling transferability.
  In the first quantitative analysis of the similarity of different models'
decision boundaries, we show that these boundaries are actually close in
arbitrary directions, whether adversarial or benign. We conclude by formally
studying the limits of transferability. We derive (1) sufficient conditions on
the data distribution that imply transferability for simple model classes and
(2) examples of scenarios in which transfer does not occur. These findings
indicate that it may be possible to design defenses against transfer-based
attacks, even for models that are vulnerable to direct attacks.


http://arxiv.org/abs/1704.03296
Interpretable Explanations of Black Boxes by Meaningful Perturbation. (1%)
Ruth Fong; Andrea Vedaldi
  As machine learning algorithms are increasingly applied to high impact yet
high risk tasks, such as medical diagnosis or autonomous driving, it is
critical that researchers can explain how such algorithms arrived at their
predictions. In recent years, a number of image saliency methods have been
developed to summarize where highly complex neural networks "look" in an image
for evidence for their predictions. However, these techniques are limited by
their heuristic nature and architectural constraints. In this paper, we make
two main contributions: First, we propose a general framework for learning
different kinds of explanations for any black box algorithm. Second, we
specialise the framework to find the part of an image most responsible for a
classifier decision. Unlike previous works, our method is model-agnostic and
testable because it is grounded in explicit and interpretable image
perturbations.


http://arxiv.org/abs/1704.02654
Enhancing Robustness of Machine Learning Systems via Data Transformations.
Arjun Nitin Bhagoji; Daniel Cullina; Chawin Sitawarin; Prateek Mittal
  We propose the use of data transformations as a defense against evasion
attacks on ML classifiers. We present and investigate strategies for
incorporating a variety of data transformations including dimensionality
reduction via Principal Component Analysis and data `anti-whitening' to enhance
the resilience of machine learning, targeting both the classification and the
training phase. We empirically evaluate and demonstrate the feasibility of
linear transformations of data as a defense mechanism against evasion attacks
using multiple real-world datasets. Our key findings are that the defense is
(i) effective against the best known evasion attacks from the literature,
resulting in a two-fold increase in the resources required by a white-box
adversary with knowledge of the defense for a successful attack, (ii)
applicable across a range of ML classifiers, including Support Vector Machines
and Deep Neural Networks, and (iii) generalizable to multiple application
domains, including image classification and human activity classification.


http://arxiv.org/abs/1704.01704
Adequacy of the Gradient-Descent Method for Classifier Evasion Attacks.
Yi Han; Benjamin I. P. Rubinstein
  Despite the wide use of machine learning in adversarial settings including
computer security, recent studies have demonstrated vulnerabilities to evasion
attacks---carefully crafted adversarial samples that closely resemble
legitimate instances, but cause misclassification. In this paper, we examine
the adequacy of the leading approach to generating adversarial samples---the
gradient descent approach. In particular (1) we perform extensive experiments
on three datasets, MNIST, USPS and Spambase, in order to analyse the
effectiveness of the gradient-descent method against non-linear support vector
machines, and conclude that carefully reduced kernel smoothness can
significantly increase robustness to the attack; (2) we demonstrate that
separated inter-class support vectors lead to more secure models, and propose a
quantity similar to margin that can efficiently predict potential
susceptibility to gradient-descent attacks, before the attack is launched; and
(3) we design a new adversarial sample construction algorithm based on
optimising the multiplicative ratio of class decision functions.


http://arxiv.org/abs/1704.01547
Comment on "Biologically inspired protection of deep networks from adversarial attacks".
Wieland Brendel; Matthias Bethge
  A recent paper suggests that Deep Neural Networks can be protected from
gradient-based adversarial perturbations by driving the network activations
into a highly saturated regime. Here we analyse such saturated networks and
show that the attacks fail due to numerical limitations in the gradient
computations. A simple stabilisation of the gradient estimates enables
successful and efficient attacks. Thus, it has yet to be shown that the
robustness observed in highly saturated networks is not simply due to numerical
limitations.


http://arxiv.org/abs/1704.01155
Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks.
Weilin Xu; David Evans; Yanjun Qi
  Although deep neural networks (DNNs) have achieved great success in many
tasks, they can often be fooled by \emph{adversarial examples} that are
generated by adding small but purposeful distortions to natural examples.
Previous studies to defend against adversarial examples mostly focused on
refining the DNN models, but have either shown limited success or required
expensive computation. We propose a new strategy, \emph{feature squeezing},
that can be used to harden DNN models by detecting adversarial examples.
Feature squeezing reduces the search space available to an adversary by
coalescing samples that correspond to many different feature vectors in the
original space into a single sample. By comparing a DNN model's prediction on
the original input with that on squeezed inputs, feature squeezing detects
adversarial examples with high accuracy and few false positives. This paper
explores two feature squeezing methods: reducing the color bit depth of each
pixel and spatial smoothing. These simple strategies are inexpensive and
complementary to other defenses, and can be combined in a joint detection
framework to achieve high detection rates against state-of-the-art attacks.


http://arxiv.org/abs/1704.00103
SafetyNet: Detecting and Rejecting Adversarial Examples Robustly.
Jiajun Lu; Theerasit Issaranon; David Forsyth
  We describe a method to produce a network where current methods such as
DeepFool have great difficulty producing adversarial samples. Our construction
suggests some insights into how deep networks work. We provide a reasonable
analyses that our construction is difficult to defeat, and show experimentally
that our method is hard to defeat with both Type I and Type II attacks using
several standard networks and datasets. This SafetyNet architecture is used to
an important and novel application SceneProof, which can reliably detect
whether an image is a picture of a real scene or not. SceneProof applies to
images captured with depth maps (RGBD images) and checks if a pair of image and
depth map is consistent. It relies on the relative difficulty of producing
naturalistic depth maps for images in post processing. We demonstrate that our
SafetyNet is robust to adversarial examples built from currently known
attacking approaches.


http://arxiv.org/abs/1703.09387
Adversarial Transformation Networks: Learning to Generate Adversarial Examples.
Shumeet Baluja; Ian Fischer
  Multiple different approaches of generating adversarial examples have been
proposed to attack deep neural networks. These approaches involve either
directly computing gradients with respect to the image pixels, or directly
solving an optimization on the image pixels. In this work, we present a
fundamentally new method for generating adversarial examples that is fast to
execute and provides exceptional diversity of output. We efficiently train
feed-forward neural networks in a self-supervised manner to generate
adversarial examples against a target network or set of networks. We call such
a network an Adversarial Transformation Network (ATN). ATNs are trained to
generate adversarial examples that minimally modify the classifier's outputs
given the original input, while constraining the new classification to match an
adversarial target class. We present methods to train ATNs and analyze their
effectiveness targeting a variety of MNIST classifiers as well as the latest
state-of-the-art ImageNet classifier Inception ResNet v2.


http://arxiv.org/abs/1703.09202
Biologically inspired protection of deep networks from adversarial attacks.
Aran Nayebi; Surya Ganguli
  Inspired by biophysical principles underlying nonlinear dendritic computation
in neural circuits, we develop a scheme to train deep neural networks to make
them robust to adversarial attacks. Our scheme generates highly nonlinear,
saturated neural networks that achieve state of the art performance on gradient
based adversarial examples on MNIST, despite never being exposed to
adversarially chosen examples during training. Moreover, these networks exhibit
unprecedented robustness to targeted, iterative schemes for generating
adversarial examples, including second-order methods. We further identify
principles governing how these networks achieve their robustness, drawing on
methods from information geometry. We find these networks progressively create
highly flat and compressed internal representations that are sensitive to very
few input dimensions, while still solving the task. Moreover, they employ
highly kurtotic weight distributions, also found in the brain, and we
demonstrate how such kurtosis can protect even linear classifiers from
adversarial attack.


http://arxiv.org/abs/1703.09793
Deceiving Google's Cloud Video Intelligence API Built for Summarizing Videos.
Hossein Hosseini; Baicen Xiao; Radha Poovendran
  Despite the rapid progress of the techniques for image classification, video
annotation has remained a challenging task. Automated video annotation would be
a breakthrough technology, enabling users to search within the videos.
Recently, Google introduced the Cloud Video Intelligence API for video
analysis. As per the website, the system can be used to "separate signal from
noise, by retrieving relevant information at the video, shot or per frame"
level. A demonstration website has been also launched, which allows anyone to
select a video for annotation. The API then detects the video labels (objects
within the video) as well as shot labels (description of the video events over
time). In this paper, we examine the usability of the Google's Cloud Video
Intelligence API in adversarial environments. In particular, we investigate
whether an adversary can subtly manipulate a video in such a way that the API
will return only the adversary-desired labels. For this, we select an image,
which is different from the video content, and insert it, periodically and at a
very low rate, into the video. We found that if we insert one image every two
seconds, the API is deceived into annotating the video as if it only contained
the inserted image. Note that the modification to the video is hardly
noticeable as, for instance, for a typical frame rate of 25, we insert only one
image per 50 video frames. We also found that, by inserting one image per
second, all the shot labels returned by the API are related to the inserted
image. We perform the experiments on the sample videos provided by the API
demonstration website and show that our attack is successful with different
videos and images.


http://arxiv.org/abs/1703.08603
Adversarial Examples for Semantic Segmentation and Object Detection.
Cihang Xie; Jianyu Wang; Zhishuai Zhang; Yuyin Zhou; Lingxi Xie; Alan Yuille
  It has been well demonstrated that adversarial examples, i.e., natural images
with visually imperceptible perturbations added, generally exist for deep
networks to fail on image classification. In this paper, we extend adversarial
examples to semantic segmentation and object detection which are much more
difficult. Our observation is that both segmentation and detection are based on
classifying multiple targets on an image (e.g., the basic target is a pixel or
a receptive field in segmentation, and an object proposal in detection), which
inspires us to optimize a loss function over a set of pixels/proposals for
generating adversarial perturbations. Based on this idea, we propose a novel
algorithm named Dense Adversary Generation (DAG), which generates a large
family of adversarial examples, and applies to a wide range of state-of-the-art
deep networks for segmentation and detection. We also find that the adversarial
perturbations can be transferred across networks with different training data,
based on different architectures, and even for different recognition tasks. In
particular, the transferability across networks with the same architecture is
more significant than in other cases. Besides, summing up heterogeneous
perturbations often leads to better transfer performance, which provides an
effective method of black-box adversarial attack.


http://arxiv.org/abs/1703.07928
Self corrective Perturbations for Semantic Segmentation and Classification.
Swami Sankaranarayanan; Arpit Jain; Ser Nam Lim
  Convolutional Neural Networks have been a subject of great importance over
the past decade and great strides have been made in their utility for producing
state of the art performance in many computer vision problems. However, the
behavior of deep networks is yet to be fully understood and is still an active
area of research. In this work, we present an intriguing behavior: pre-trained
CNNs can be made to improve their predictions by structurally perturbing the
input. We observe that these perturbations - referred as Guided Perturbations -
enable a trained network to improve its prediction performance without any
learning or change in network weights. We perform various ablative experiments
to understand how these perturbations affect the local context and feature
representations. Furthermore, we demonstrate that this idea can improve
performance of several existing approaches on semantic segmentation and scene
labeling tasks on the PASCAL VOC dataset and supervised classification tasks on
MNIST and CIFAR10 datasets.


http://arxiv.org/abs/1703.07909
Data Driven Exploratory Attacks on Black Box Classifiers in Adversarial Domains.
Tegjyot Singh Sethi; Mehmed Kantardzic
  While modern day web applications aim to create impact at the civilization
level, they have become vulnerable to adversarial activity, where the next
cyber-attack can take any shape and can originate from anywhere. The increasing
scale and sophistication of attacks, has prompted the need for a data driven
solution, with machine learning forming the core of many cybersecurity systems.
Machine learning was not designed with security in mind, and the essential
assumption of stationarity, requiring that the training and testing data follow
similar distributions, is violated in an adversarial domain. In this paper, an
adversary's view point of a classification based system, is presented. Based on
a formal adversarial model, the Seed-Explore-Exploit framework is presented,
for simulating the generation of data driven and reverse engineering attacks on
classifiers. Experimental evaluation, on 10 real world datasets and using the
Google Cloud Prediction Platform, demonstrates the innate vulnerability of
classifiers and the ease with which evasion can be carried out, without any
explicit information about the classifier type, the training data or the
application domain. The proposed framework, algorithms and empirical
evaluation, serve as a white hat analysis of the vulnerabilities, and aim to
foster the development of secure machine learning frameworks.


http://arxiv.org/abs/1703.06857
On the Limitation of Convolutional Neural Networks in Recognizing Negative Images.
Hossein Hosseini; Baicen Xiao; Mayoore Jaiswal; Radha Poovendran
  Convolutional Neural Networks (CNNs) have achieved state-of-the-art
performance on a variety of computer vision tasks, particularly visual
classification problems, where new algorithms reported to achieve or even
surpass the human performance. In this paper, we examine whether CNNs are
capable of learning the semantics of training data. To this end, we evaluate
CNNs on negative images, since they share the same structure and semantics as
regular images and humans can classify them correctly. Our experimental results
indicate that when training on regular images and testing on negative images,
the model accuracy is significantly lower than when it is tested on regular
images. This leads us to the conjecture that current training methods do not
effectively train models to generalize the concepts. We then introduce the
notion of semantic adversarial examples - transformed inputs that semantically
represent the same objects, but the model does not classify them correctly -
and present negative images as one class of such inputs.


http://arxiv.org/abs/1703.05561
Fraternal Twins: Unifying Attacks on Machine Learning and Digital Watermarking.
Erwin Quiring; Daniel Arp; Konrad Rieck
  Machine learning is increasingly used in security-critical applications, such
as autonomous driving, face recognition and malware detection. Most learning
methods, however, have not been designed with security in mind and thus are
vulnerable to different types of attacks. This problem has motivated the
research field of adversarial machine learning that is concerned with attacking
and defending learning methods. Concurrently, a different line of research has
tackled a very similar problem: In digital watermarking information are
embedded in a signal in the presence of an adversary. As a consequence, this
research field has also extensively studied techniques for attacking and
defending watermarking methods.
  The two research communities have worked in parallel so far, unnoticeably
developing similar attack and defense strategies. This paper is a first effort
to bring these communities together. To this end, we present a unified notation
of black-box attacks against machine learning and watermarking that reveals the
similarity of both settings. To demonstrate the efficacy of this unified view,
we apply concepts from watermarking to machine learning and vice versa. We show
that countermeasures from watermarking can mitigate recent model-extraction
attacks and, similarly, that techniques for hardening machine learning can fend
off oracle attacks against watermarks. Our work provides a conceptual link
between two research fields and thereby opens novel directions for improving
the security of both, machine learning and digital watermarking.


http://arxiv.org/abs/1703.04318
Blocking Transferability of Adversarial Examples in Black-Box Learning Systems.
Hossein Hosseini; Yize Chen; Sreeram Kannan; Baosen Zhang; Radha Poovendran
  Advances in Machine Learning (ML) have led to its adoption as an integral
component in many applications, including banking, medical diagnosis, and
driverless cars. To further broaden the use of ML models, cloud-based services
offered by Microsoft, Amazon, Google, and others have developed ML-as-a-service
tools as black-box systems. However, ML classifiers are vulnerable to
adversarial examples: inputs that are maliciously modified can cause the
classifier to provide adversary-desired outputs. Moreover, it is known that
adversarial examples generated on one classifier are likely to cause another
classifier to make the same mistake, even if the classifiers have different
architectures or are trained on disjoint datasets. This property, which is
known as transferability, opens up the possibility of attacking black-box
systems by generating adversarial examples on a substitute classifier and
transferring the examples to the target classifier. Therefore, the key to
protect black-box learning systems against the adversarial examples is to block
their transferability. To this end, we propose a training method that, as the
input is more perturbed, the classifier smoothly outputs lower confidence on
the original label and instead predicts that the input is "invalid". In
essence, we augment the output class set with a NULL label and train the
classifier to reject the adversarial examples by classifying them as NULL. In
experiments, we apply a wide range of attacks based on adversarial examples on
the black-box systems. We show that a classifier trained with the proposed
method effectively resists against the adversarial examples, while maintaining
the accuracy on clean data.


http://arxiv.org/abs/1703.06748
Tactics of Adversarial Attack on Deep Reinforcement Learning Agents.
Yen-Chen Lin; Zhang-Wei Hong; Yuan-Hong Liao; Meng-Li Shih; Ming-Yu Liu; Min Sun
  We introduce two tactics to attack agents trained by deep reinforcement
learning algorithms using adversarial examples, namely the strategically-timed
attack and the enchanting attack. In the strategically-timed attack, the
adversary aims at minimizing the agent's reward by only attacking the agent at
a small subset of time steps in an episode. Limiting the attack activity to
this subset helps prevent detection of the attack by the agent. We propose a
novel method to determine when an adversarial example should be crafted and
applied. In the enchanting attack, the adversary aims at luring the agent to a
designated target state. This is achieved by combining a generative model and a
planning algorithm: while the generative model predicts the future states, the
planning algorithm generates a preferred sequence of actions for luring the
agent. A sequence of adversarial examples is then crafted to lure the agent to
take the preferred sequence of actions. We apply the two tactics to the agents
trained by the state-of-the-art deep reinforcement learning algorithm including
DQN and A3C. In 5 Atari games, our strategically timed attack reduces as much
reward as the uniform attack (i.e., attacking at every time step) does by
attacking the agent 4 times less often. Our enchanting attack lures the agent
toward designated target states with a more than 70% success rate. Videos are
available at http://yenchenlin.me/adversarial_attack_RL/


http://arxiv.org/abs/1703.01101
Adversarial Examples for Semantic Image Segmentation.
Volker Fischer; Mummadi Chaithanya Kumar; Jan Hendrik Metzen; Thomas Brox
  Machine learning methods in general and Deep Neural Networks in particular
have shown to be vulnerable to adversarial perturbations. So far this
phenomenon has mainly been studied in the context of whole-image
classification. In this contribution, we analyse how adversarial perturbations
can affect the task of semantic segmentation. We show how existing adversarial
attackers can be transferred to this task and that it is possible to create
imperceptible adversarial perturbations that lead a deep network to misclassify
almost all pixels of a chosen class while leaving network prediction nearly
unchanged outside this class.


http://arxiv.org/abs/1703.00978
Compositional Falsification of Cyber-Physical Systems with Machine Learning Components.
Tommaso Dreossi; Alexandre Donzé; Sanjit A. Seshia
  Cyber-physical systems (CPS), such as automotive systems, are starting to
include sophisticated machine learning (ML) components. Their correctness,
therefore, depends on properties of the inner ML modules. While learning
algorithms aim to generalize from examples, they are only as good as the
examples provided, and recent efforts have shown that they can produce
inconsistent output under small adversarial perturbations. This raises the
question: can the output from learning components can lead to a failure of the
entire CPS? In this work, we address this question by formulating it as a
problem of falsifying signal temporal logic (STL) specifications for CPS with
ML components. We propose a compositional falsification framework where a
temporal logic falsifier and a machine learning analyzer cooperate with the aim
of finding falsifying executions of the considered model. The efficacy of the
proposed technique is shown on an automatic emergency braking system model with
a perception component based on deep neural networks.


http://arxiv.org/abs/1703.00410
Detecting Adversarial Samples from Artifacts.
Reuben Feinman; Ryan R. Curtin; Saurabh Shintre; Andrew B. Gardner
  Deep neural networks (DNNs) are powerful nonlinear architectures that are
known to be robust to random perturbations of the input. However, these models
are vulnerable to adversarial perturbations--small input changes crafted
explicitly to fool the model. In this paper, we ask whether a DNN can
distinguish adversarial samples from their normal and noisy counterparts. We
investigate model confidence on adversarial samples by looking at Bayesian
uncertainty estimates, available in dropout neural networks, and by performing
density estimation in the subspace of deep features learned by the model. The
result is a method for implicit adversarial detection that is oblivious to the
attack algorithm. We evaluate this method on a variety of standard datasets
including MNIST and CIFAR-10 and show that it generalizes well across different
architectures and attacks. Our findings report that 85-93% ROC-AUC can be
achieved on a number of standard classification tasks with a negative class
that consists of both normal and noisy samples.


http://arxiv.org/abs/1702.08138
Deceiving Google's Perspective API Built for Detecting Toxic Comments.
Hossein Hosseini; Sreeram Kannan; Baosen Zhang; Radha Poovendran
  Social media platforms provide an environment where people can freely engage
in discussions. Unfortunately, they also enable several problems, such as
online harassment. Recently, Google and Jigsaw started a project called
Perspective, which uses machine learning to automatically detect toxic
language. A demonstration website has been also launched, which allows anyone
to type a phrase in the interface and instantaneously see the toxicity score
[1]. In this paper, we propose an attack on the Perspective toxic detection
system based on the adversarial examples. We show that an adversary can subtly
modify a highly toxic phrase in a way that the system assigns significantly
lower toxicity score to it. We apply the attack on the sample phrases provided
in the Perspective website and show that we can consistently reduce the
toxicity scores to the level of the non-toxic phrases. The existence of such
adversarial examples is very harmful for toxic detection systems and seriously
undermines their usability.


http://arxiv.org/abs/1702.06856
Robustness to Adversarial Examples through an Ensemble of Specialists.
Mahdieh Abbasi; Christian Gagné
  We are proposing to use an ensemble of diverse specialists, where speciality
is defined according to the confusion matrix. Indeed, we observed that for
adversarial instances originating from a given class, labeling tend to be done
into a small subset of (incorrect) classes. Therefore, we argue that an
ensemble of specialists should be better able to identify and reject fooling
instances, with a high entropy (i.e., disagreement) over the decisions in the
presence of adversaries. Experimental results obtained confirm that
interpretation, opening a way to make the system more robust to adversarial
examples through a rejection mechanism, rather than trying to classify them
properly at any cost.


http://arxiv.org/abs/1702.06832
Adversarial examples for generative models.
Jernej Kos; Ian Fischer; Dawn Song
  We explore methods of producing adversarial examples on deep generative
models such as the variational autoencoder (VAE) and the VAE-GAN. Deep learning
architectures are known to be vulnerable to adversarial examples, but previous
work has focused on the application of adversarial examples to classification
tasks. Deep generative models have recently become popular due to their ability
to model input data distributions and generate realistic examples from those
distributions. We present three classes of attacks on the VAE and VAE-GAN
architectures and demonstrate them against networks trained on MNIST, SVHN and
CelebA. Our first attack leverages classification-based adversaries by
attaching a classifier to the trained encoder of the target generative model,
which can then be used to indirectly manipulate the latent representation. Our
second attack directly uses the VAE loss function to generate a target
reconstruction image from the adversarial example. Our third attack moves
beyond relying on classification or the standard loss for the gradient and
directly optimizes against differences in source and target latent
representations. We also motivate why an attacker might be interested in
deploying such techniques against a target generative network.


http://arxiv.org/abs/1702.06763
DeepCloak: Masking Deep Neural Network Models for Robustness Against Adversarial Samples.
Ji Gao; Beilun Wang; Zeming Lin; Weilin Xu; Yanjun Qi
  Recent studies have shown that deep neural networks (DNN) are vulnerable to
adversarial samples: maliciously-perturbed samples crafted to yield incorrect
model outputs. Such attacks can severely undermine DNN systems, particularly in
security-sensitive settings. It was observed that an adversary could easily
generate adversarial samples by making a small perturbation on irrelevant
feature dimensions that are unnecessary for the current classification task. To
overcome this problem, we introduce a defensive mechanism called DeepCloak. By
identifying and removing unnecessary features in a DNN model, DeepCloak limits
the capacity an attacker can use generating adversarial samples and therefore
increase the robustness against such inputs. Comparing with other defensive
approaches, DeepCloak is easy to implement and computationally efficient.
Experimental results show that DeepCloak can increase the performance of
state-of-the-art DNN models against adversarial samples.


http://arxiv.org/abs/1702.06280
On the (Statistical) Detection of Adversarial Examples.
Kathrin Grosse; Praveen Manoharan; Nicolas Papernot; Michael Backes; Patrick McDaniel
  Machine Learning (ML) models are applied in a variety of tasks such as
network intrusion detection or Malware classification. Yet, these models are
vulnerable to a class of malicious inputs known as adversarial examples. These
are slightly perturbed inputs that are classified incorrectly by the ML model.
The mitigation of these adversarial inputs remains an open problem. As a step
towards understanding adversarial examples, we show that they are not drawn
from the same distribution than the original data, and can thus be detected
using statistical tests. Using thus knowledge, we introduce a complimentary
approach to identify specific inputs that are adversarial. Specifically, we
augment our ML model with an additional output, in which the model is trained
to classify all adversarial inputs. We evaluate our approach on multiple
adversarial example crafting methods (including the fast gradient sign and
saliency map methods) with several datasets. The statistical test flags sample
sets containing adversarial inputs confidently at sample sizes between 10 and
100 data points. Furthermore, our augmented model either detects adversarial
examples as outliers with high accuracy (> 80%) or increases the adversary's
cost - the perturbation added - by more than 150%. In this way, we show that
statistical properties of adversarial examples are essential to their
detection.


http://arxiv.org/abs/1702.05983
Generating Adversarial Malware Examples for Black-Box Attacks Based on GAN.
Weiwei Hu; Ying Tan
  Machine learning has been used to detect new malware in recent years, while
malware authors have strong motivation to attack such algorithms. Malware
authors usually have no access to the detailed structures and parameters of the
machine learning models used by malware detection systems, and therefore they
can only perform black-box attacks. This paper proposes a generative
adversarial network (GAN) based algorithm named MalGAN to generate adversarial
malware examples, which are able to bypass black-box machine learning based
detection models. MalGAN uses a substitute detector to fit the black-box
malware detection system. A generative network is trained to minimize the
generated adversarial examples' malicious probabilities predicted by the
substitute detector. The superiority of MalGAN over traditional gradient based
adversarial example generation algorithms is that MalGAN is able to decrease
the detection rate to nearly zero and make the retraining based defensive
method against adversarial examples hard to work.


http://arxiv.org/abs/1702.04267
On Detecting Adversarial Perturbations.
Jan Hendrik Metzen; Tim Genewein; Volker Fischer; Bastian Bischoff
  Machine learning and deep learning in particular has advanced tremendously on
perceptual tasks in recent years. However, it remains vulnerable against
adversarial perturbations of the input that have been crafted specifically to
fool the system while being quasi-imperceptible to a human. In this work, we
propose to augment deep neural networks with a small "detector" subnetwork
which is trained on the binary classification task of distinguishing genuine
data from data containing adversarial perturbations. Our method is orthogonal
to prior work on addressing adversarial perturbations, which has mostly focused
on making the classification network itself more robust. We show empirically
that adversarial perturbations can be detected surprisingly well even though
they are quasi-imperceptible to humans. Moreover, while the detectors have been
trained to detect only a specific adversary, they generalize to similar and
weaker adversaries. In addition, we propose an adversarial attack that fools
both the classifier and the detector and a novel training procedure for the
detector that counteracts this attack.


http://arxiv.org/abs/1702.02284
Adversarial Attacks on Neural Network Policies.
Sandy Huang; Nicolas Papernot; Ian Goodfellow; Yan Duan; Pieter Abbeel
  Machine learning classifiers are known to be vulnerable to inputs maliciously
constructed by adversaries to force misclassification. Such adversarial
examples have been extensively studied in the context of computer vision
applications. In this work, we show adversarial attacks are also effective when
targeting neural network policies in reinforcement learning. Specifically, we
show existing adversarial example crafting techniques can be used to
significantly degrade test-time performance of trained policies. Our threat
model considers adversaries capable of introducing small perturbations to the
raw input of the policy. We characterize the degree of vulnerability across
tasks and training algorithms, for a subclass of adversarial-example attacks in
white-box and black-box settings. Regardless of the learned task or training
algorithm, we observe a significant drop in performance, even with small
adversarial perturbations that do not interfere with human perception. Videos
are available at http://rll.berkeley.edu/adversarial.


http://arxiv.org/abs/1702.01135
Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks.
Guy Katz; Clark Barrett; David Dill; Kyle Julian; Mykel Kochenderfer
  Deep neural networks have emerged as a widely used and effective means for
tackling complex, real-world problems. However, a major obstacle in applying
them to safety-critical systems is the great difficulty in providing formal
guarantees about their behavior. We present a novel, scalable, and efficient
technique for verifying properties of deep neural networks (or providing
counter-examples). The technique is based on the simplex method, extended to
handle the non-convex Rectified Linear Unit (ReLU) activation function, which
is a crucial ingredient in many modern neural networks. The verification
procedure tackles neural networks as a whole, without making any simplifying
assumptions. We evaluated our technique on a prototype deep neural network
implementation of the next-generation airborne collision avoidance system for
unmanned aircraft (ACAS Xu). Results show that our technique can successfully
prove properties of networks that are an order of magnitude larger than the
largest networks verified using existing methods.


http://arxiv.org/abs/1701.04143
Vulnerability of Deep Reinforcement Learning to Policy Induction Attacks.
Vahid Behzadan; Arslan Munir
  Deep learning classifiers are known to be inherently vulnerable to
manipulation by intentionally perturbed inputs, named adversarial examples. In
this work, we establish that reinforcement learning techniques based on Deep
Q-Networks (DQNs) are also vulnerable to adversarial input perturbations, and
verify the transferability of adversarial examples across different DQN models.
Furthermore, we present a novel class of attacks based on this vulnerability
that enable policy manipulation and induction in the learning process of DQNs.
We propose an attack mechanism that exploits the transferability of adversarial
examples to implement policy induction attacks on DQNs, and demonstrate its
efficacy and impact through experimental study of a game-learning scenario.


http://arxiv.org/abs/1701.00939
Dense Associative Memory is Robust to Adversarial Inputs.
Dmitry Krotov; John J Hopfield
  Deep neural networks (DNN) trained in a supervised way suffer from two known
problems. First, the minima of the objective function used in learning
correspond to data points (also known as rubbish examples or fooling images)
that lack semantic similarity with the training data. Second, a clean input can
be changed by a small, and often imperceptible for human vision, perturbation,
so that the resulting deformed input is misclassified by the network. These
findings emphasize the differences between the ways DNN and humans classify
patterns, and raise a question of designing learning algorithms that more
accurately mimic human perception compared to the existing methods.
  Our paper examines these questions within the framework of Dense Associative
Memory (DAM) models. These models are defined by the energy function, with
higher order (higher than quadratic) interactions between the neurons. We show
that in the limit when the power of the interaction vertex in the energy
function is sufficiently large, these models have the following three
properties. First, the minima of the objective function are free from rubbish
images, so that each minimum is a semantically meaningful pattern. Second,
artificial patterns poised precisely at the decision boundary look ambiguous to
human subjects and share aspects of both classes that are separated by that
decision boundary. Third, adversarial images constructed by models with small
power of the interaction vertex, which are equivalent to DNN with rectified
linear units (ReLU), fail to transfer to and fool the models with higher order
interactions. This opens up a possibility to use higher order models for
detecting and stopping malicious adversarial attacks. The presented results
suggest that DAM with higher order energy functions are closer to human visual
perception than DNN with ReLUs.


http://arxiv.org/abs/1612.07767
Adversarial Examples Detection in Deep Networks with Convolutional Filter Statistics.
Xin Li; Fuxin Li
  Deep learning has greatly improved visual recognition in recent years.
However, recent research has shown that there exist many adversarial examples
that can negatively impact the performance of such an architecture. This paper
focuses on detecting those adversarial examples by analyzing whether they come
from the same distribution as the normal examples. Instead of directly training
a deep neural network to detect adversarials, a much simpler approach was
proposed based on statistics on outputs from convolutional layers. A cascade
classifier was designed to efficiently detect adversarials. Furthermore,
trained from one particular adversarial generating mechanism, the resulting
classifier can successfully detect adversarials from a completely different
mechanism as well. The resulting classifier is non-subdifferentiable, hence
creates a difficulty for adversaries to attack by using the gradient of the
classifier. After detecting adversarial examples, we show that many of them can
be recovered by simply performing a small average filter on the image. Those
findings should lead to more insights about the classification mechanisms in
deep convolutional neural networks.


http://arxiv.org/abs/1612.06299
Simple Black-Box Adversarial Perturbations for Deep Networks.
Nina Narodytska; Shiva Prasad Kasiviswanathan
  Deep neural networks are powerful and popular learning models that achieve
state-of-the-art pattern recognition performance on many computer vision,
speech, and language processing tasks. However, these networks have also been
shown susceptible to carefully crafted adversarial perturbations which force
misclassification of the inputs. Adversarial examples enable adversaries to
subvert the expected system behavior leading to undesired consequences and
could pose a security risk when these systems are deployed in the real world.
  In this work, we focus on deep convolutional neural networks and demonstrate
that adversaries can easily craft adversarial examples even without any
internal knowledge of the target network. Our attacks treat the network as an
oracle (black-box) and only assume that the output of the network can be
observed on the probed inputs. Our first attack is based on a simple idea of
adding perturbation to a randomly selected single pixel or a small set of them.
We then improve the effectiveness of this attack by carefully constructing a
small set of pixels to perturb by using the idea of greedy local-search. Our
proposed attacks also naturally extend to a stronger notion of
misclassification. Our extensive experimental results illustrate that even
these elementary attacks can reveal a deep neural network's vulnerabilities.
The simplicity and effectiveness of our proposed schemes mean that they could
serve as a litmus test for designing robust networks.


http://arxiv.org/abs/1612.01401
Learning Adversary-Resistant Deep Neural Networks.
Qinglong Wang; Wenbo Guo; Kaixuan Zhang; Alexander G. II Ororbia; Xinyu Xing; Xue Liu; C. Lee Giles
  Deep neural networks (DNNs) have proven to be quite effective in a vast array
of machine learning tasks, with recent examples in cyber security and
autonomous vehicles. Despite the superior performance of DNNs in these
applications, it has been recently shown that these models are susceptible to a
particular type of attack that exploits a fundamental flaw in their design.
This attack consists of generating particular synthetic examples referred to as
adversarial samples. These samples are constructed by slightly manipulating
real data-points in order to "fool" the original DNN model, forcing it to
mis-classify previously correctly classified samples with high confidence.
Addressing this flaw in the model is essential if DNNs are to be used in
critical applications such as those in cyber security.
  Previous work has provided various learning algorithms to enhance the
robustness of DNN models, and they all fall into the tactic of "security
through obscurity". This means security can be guaranteed only if one can
obscure the learning algorithms from adversaries. Once the learning technique
is disclosed, DNNs protected by these defense mechanisms are still susceptible
to adversarial samples. In this work, we investigate this issue shared across
previous research work and propose a generic approach to escalate a DNN's
resistance to adversarial samples. More specifically, our approach integrates a
data transformation module with a DNN, making it robust even if we reveal the
underlying learning algorithm. To demonstrate the generality of our proposed
approach and its potential for handling cyber security applications, we
evaluate our method and several other existing solutions on datasets publicly
available. Our results indicate that our approach typically provides superior
classification performance and resistance in comparison with state-of-art
solutions.


http://arxiv.org/abs/1612.00334
A Theoretical Framework for Robustness of (Deep) Classifiers against Adversarial Examples.
Beilun Wang; Ji Gao; Yanjun Qi
  Most machine learning classifiers, including deep neural networks, are
vulnerable to adversarial examples. Such inputs are typically generated by
adding small but purposeful modifications that lead to incorrect outputs while
imperceptible to human eyes. The goal of this paper is not to introduce a
single method, but to make theoretical steps towards fully understanding
adversarial examples. By using concepts from topology, our theoretical analysis
brings forth the key reasons why an adversarial example can fool a classifier
($f_1$) and adds its oracle ($f_2$, like human eyes) in such analysis. By
investigating the topological relationship between two (pseudo)metric spaces
corresponding to predictor $f_1$ and oracle $f_2$, we develop necessary and
sufficient conditions that can determine if $f_1$ is always robust
(strong-robust) against adversarial examples according to $f_2$. Interestingly
our theorems indicate that just one unnecessary feature can make $f_1$ not
strong-robust, and the right feature representation learning is the key to
getting a classifier that is both accurate and strong-robust.


http://arxiv.org/abs/1612.00155
Adversarial Images for Variational Autoencoders.
Pedro Tabacof; Julia Tavares; Eduardo Valle
  We investigate adversarial attacks for autoencoders. We propose a procedure
that distorts the input image to mislead the autoencoder in reconstructing a
completely different target image. We attack the internal latent
representations, attempting to make the adversarial input produce an internal
representation as similar as possible as the target's. We find that
autoencoders are much more robust to the attack than classifiers: while some
examples have tolerably small input distortion, and reasonable similarity to
the target image, there is a quasi-linear trade-off between those aims. We
report results on MNIST and SVHN datasets, and also test regular deterministic
autoencoders, reaching similar conclusions in all cases. Finally, we show that
the usual adversarial attack for classifiers, while being much easier, also
presents a direct proportion between distortion on the input, and misdirection
on the output. That proportionality however is hidden by the normalization of
the output, which maps a linear layer into non-linear probabilities.


http://arxiv.org/abs/1612.00410
Deep Variational Information Bottleneck.
Alexander A. Alemi; Ian Fischer; Joshua V. Dillon; Kevin Murphy
  We present a variational approximation to the information bottleneck of
Tishby et al. (1999). This variational approach allows us to parameterize the
information bottleneck model using a neural network and leverage the
reparameterization trick for efficient training. We call this method "Deep
Variational Information Bottleneck", or Deep VIB. We show that models trained
with the VIB objective outperform those that are trained with other forms of
regularization, in terms of generalization performance and robustness to
adversarial attack.


http://arxiv.org/abs/1612.00138
Towards Robust Deep Neural Networks with BANG.
Andras Rozsa; Manuel Gunther; Terrance E. Boult
  Machine learning models, including state-of-the-art deep neural networks, are
vulnerable to small perturbations that cause unexpected classification errors.
This unexpected lack of robustness raises fundamental questions about their
generalization properties and poses a serious concern for practical
deployments. As such perturbations can remain imperceptible - the formed
adversarial examples demonstrate an inherent inconsistency between vulnerable
machine learning models and human perception - some prior work casts this
problem as a security issue. Despite the significance of the discovered
instabilities and ensuing research, their cause is not well understood and no
effective method has been developed to address the problem. In this paper, we
present a novel theory to explain why this unpleasant phenomenon exists in deep
neural networks. Based on that theory, we introduce a simple, efficient, and
effective training approach, Batch Adjusted Network Gradients (BANG), which
significantly improves the robustness of machine learning models. While the
BANG technique does not rely on any form of data augmentation or the
utilization of adversarial images for training, the resultant classifiers are
more resistant to adversarial perturbations while maintaining or even enhancing
the overall classification performance.


http://arxiv.org/abs/1611.06179
LOTS about Attacking Deep Features.
Andras Rozsa; Manuel Günther; Terrance E. Boult
  Deep neural networks provide state-of-the-art performance on various tasks
and are, therefore, widely used in real world applications. DNNs are becoming
frequently utilized in biometrics for extracting deep features, which can be
used in recognition systems for enrolling and recognizing new individuals. It
was revealed that deep neural networks suffer from a fundamental problem,
namely, they can unexpectedly misclassify examples formed by slightly
perturbing correctly recognized inputs. Various approaches have been developed
for generating these so-called adversarial examples, but they aim at attacking
end-to-end networks. For biometrics, it is natural to ask whether systems using
deep features are immune to or, at least, more resilient to attacks than
end-to-end networks. In this paper, we introduce a general technique called the
layerwise origin-target synthesis (LOTS) that can be efficiently used to form
adversarial examples that mimic the deep features of the target. We analyze and
compare the adversarial robustness of the end-to-end VGG Face network with
systems that use Euclidean or cosine distance between gallery templates and
extracted deep features. We demonstrate that iterative LOTS is very effective
and show that systems utilizing deep features are easier to attack than the
end-to-end network.


http://arxiv.org/abs/1611.04786
AdversariaLib: An Open-source Library for the Security Evaluation of Machine Learning Algorithms Under Attack.
Igino Corona; Battista Biggio; Davide Maiorca
  We present AdversariaLib, an open-source python library for the security
evaluation of machine learning (ML) against carefully-targeted attacks. It
supports the implementation of several attacks proposed thus far in the
literature of adversarial learning, allows for the evaluation of a wide range
of ML algorithms, runs on multiple platforms, and has multi-processing enabled.
The library has a modular architecture that makes it easy to use and to extend
by implementing novel attacks and countermeasures. It relies on other
widely-used open-source ML libraries, including scikit-learn and FANN.
Classification algorithms are implemented and optimized in C/C++, allowing for
a fast evaluation of the simulated attacks. The package is distributed under
the GNU General Public License v3, and it is available for download at
http://sourceforge.net/projects/adversarialib.


http://arxiv.org/abs/1611.03814
Towards the Science of Security and Privacy in Machine Learning.
Nicolas Papernot; Patrick McDaniel; Arunesh Sinha; Michael Wellman
  Advances in machine learning (ML) in recent years have enabled a dizzying
array of applications such as data analytics, autonomous systems, and security
diagnostics. ML is now pervasive---new systems and models are being deployed in
every domain imaginable, leading to rapid and widespread deployment of software
based inference and decision making. There is growing recognition that ML
exposes new vulnerabilities in software systems, yet the technical community's
understanding of the nature and extent of these vulnerabilities remains
limited. We systematize recent findings on ML security and privacy, focusing on
attacks identified on these systems and defenses crafted to date. We articulate
a comprehensive threat model for ML, and categorize attacks and defenses within
an adversarial framework. Key insights resulting from works both in the ML and
security communities are identified and the effectiveness of approaches are
related to structural elements of ML algorithms and the data used to train
them. We conclude by formally exploring the opposing relationship between model
accuracy and resilience to adversarial manipulation. Through these
explorations, we show that there are (possibly unavoidable) tensions between
model complexity, accuracy, and resilience that must be calibrated for the
environments in which they will be used.


http://arxiv.org/abs/1611.02770
Delving into Transferable Adversarial Examples and Black-box Attacks.
Yanpei Liu; Xinyun Chen; Chang Liu; Dawn Song
  An intriguing property of deep neural networks is the existence of
adversarial examples, which can transfer among different architectures. These
transferable adversarial examples may severely hinder deep neural network-based
applications. Previous works mostly study the transferability using small scale
datasets. In this work, we are the first to conduct an extensive study of the
transferability over large models and a large scale dataset, and we are also
the first to study the transferability of targeted adversarial examples with
their target labels. We study both non-targeted and targeted adversarial
examples, and show that while transferable non-targeted adversarial examples
are easy to find, targeted adversarial examples generated using existing
approaches almost never transfer with their target labels. Therefore, we
propose novel ensemble-based approaches to generating transferable adversarial
examples. Using such approaches, we observe a large proportion of targeted
adversarial examples that are able to transfer with their target labels for the
first time. We also present some geometric studies to help understanding the
transferable adversarial examples. Finally, we show that the adversarial
examples generated using ensemble-based approaches can successfully attack
Clarifai.com, which is a black-box image classification system.


http://arxiv.org/abs/1611.01236
Adversarial Machine Learning at Scale.
Alexey Kurakin; Ian Goodfellow; Samy Bengio
  Adversarial examples are malicious inputs designed to fool machine learning
models. They often transfer from one model to another, allowing attackers to
mount black box attacks without knowledge of the target model's parameters.
Adversarial training is the process of explicitly training a model on
adversarial examples, in order to make it more robust to attack or to reduce
its test error on clean inputs. So far, adversarial training has primarily been
applied to small problems. In this research, we apply adversarial training to
ImageNet. Our contributions include: (1) recommendations for how to succesfully
scale adversarial training to large models and datasets, (2) the observation
that adversarial training confers robustness to single-step attack methods, (3)
the finding that multi-step attack methods are somewhat less transferable than
single-step attack methods, so single-step attacks are the best for mounting
black-box attacks, and (4) resolution of a "label leaking" effect that causes
adversarially trained models to perform better on adversarial examples than on
clean examples, because the adversarial example construction process uses the
true label and the model can learn to exploit regularities in the construction
process.


http://arxiv.org/abs/1610.08401
Universal adversarial perturbations.
Seyed-Mohsen Moosavi-Dezfooli; Alhussein Fawzi; Omar Fawzi; Pascal Frossard
  Given a state-of-the-art deep neural network classifier, we show the
existence of a universal (image-agnostic) and very small perturbation vector
that causes natural images to be misclassified with high probability. We
propose a systematic algorithm for computing universal perturbations, and show
that state-of-the-art deep neural networks are highly vulnerable to such
perturbations, albeit being quasi-imperceptible to the human eye. We further
empirically analyze these universal perturbations and show, in particular, that
they generalize very well across neural networks. The surprising existence of
universal perturbations reveals important geometric correlations among the
high-dimensional decision boundary of classifiers. It further outlines
potential security breaches with the existence of single directions in the
input space that adversaries can possibly exploit to break a classifier on most
natural images.


http://arxiv.org/abs/1610.06940
Safety Verification of Deep Neural Networks.
Xiaowei Huang; Marta Kwiatkowska; Sen Wang; Min Wu
  Deep neural networks have achieved impressive experimental results in image
classification, but can surprisingly be unstable with respect to adversarial
perturbations, that is, minimal changes to the input image that cause the
network to misclassify it. With potential applications including perception
modules and end-to-end controllers for self-driving cars, this raises concerns
about their safety. We develop a novel automated verification framework for
feed-forward multi-layer neural networks based on Satisfiability Modulo Theory
(SMT). We focus on safety of image classification decisions with respect to
image manipulations, such as scratches or changes to camera angle or lighting
conditions that would result in the same class being assigned by a human, and
define safety for an individual decision in terms of invariance of the
classification within a small neighbourhood of the original image. We enable
exhaustive search of the region by employing discretisation, and propagate the
analysis layer by layer. Our method works directly with the network code and,
in contrast to existing methods, can guarantee that adversarial examples, if
they exist, are found for the given region and family of manipulations. If
found, adversarial examples can be shown to human testers and/or used to
fine-tune the network. We implement the techniques using Z3 and evaluate them
on state-of-the-art networks, including regularised and deep learning networks.
We also compare against existing techniques to search for adversarial examples
and estimate network robustness.


http://arxiv.org/abs/1610.04563
Are Accuracy and Robustness Correlated?
Andras Rozsa; Manuel Günther; Terrance E. Boult
  Machine learning models are vulnerable to adversarial examples formed by
applying small carefully chosen perturbations to inputs that cause unexpected
classification errors. In this paper, we perform experiments on various
adversarial example generation approaches with multiple deep convolutional
neural networks including Residual Networks, the best performing models on
ImageNet Large-Scale Visual Recognition Challenge 2015. We compare the
adversarial example generation techniques with respect to the quality of the
produced images, and measure the robustness of the tested machine learning
models to adversarial examples. Finally, we conduct large-scale experiments on
cross-model adversarial portability. We find that adversarial examples are
mostly transferable across similar network topologies, and we demonstrate that
better machine learning models are less vulnerable to adversarial examples.


http://arxiv.org/abs/1610.04256
Assessing Threat of Adversarial Examples on Deep Neural Networks.
Abigail Graese; Andras Rozsa; Terrance E. Boult
  Deep neural networks are facing a potential security threat from adversarial
examples, inputs that look normal but cause an incorrect classification by the
deep neural network. For example, the proposed threat could result in
hand-written digits on a scanned check being incorrectly classified but looking
normal when humans see them. This research assesses the extent to which
adversarial examples pose a security threat, when one considers the normal
image acquisition process. This process is mimicked by simulating the
transformations that normally occur in acquiring the image in a real world
application, such as using a scanner to acquire digits for a check amount or
using a camera in an autonomous car. These small transformations negate the
effect of the carefully crafted perturbations of adversarial examples,
resulting in a correct classification by the deep neural network. Thus just
acquiring the image decreases the potential impact of the proposed security
threat. We also show that the already widely used process of averaging over
multiple crops neutralizes most adversarial examples. Normal preprocessing,
such as text binarization, almost completely neutralizes adversarial examples.
This is the first paper to show that for text driven classification,
adversarial examples are an academic curiosity, not a security threat.


http://arxiv.org/abs/1610.01934
Using Non-invertible Data Transformations to Build Adversarial-Robust Neural Networks.
Qinglong Wang; Wenbo Guo; Alexander G. II Ororbia; Xinyu Xing; Lin Lin; C. Lee Giles; Xue Liu; Peng Liu; Gang Xiong
  Deep neural networks have proven to be quite effective in a wide variety of
machine learning tasks, ranging from improved speech recognition systems to
advancing the development of autonomous vehicles. However, despite their
superior performance in many applications, these models have been recently
shown to be susceptible to a particular type of attack possible through the
generation of particular synthetic examples referred to as adversarial samples.
These samples are constructed by manipulating real examples from the training
data distribution in order to "fool" the original neural model, resulting in
misclassification (with high confidence) of previously correctly classified
samples. Addressing this weakness is of utmost importance if deep neural
architectures are to be applied to critical applications, such as those in the
domain of cybersecurity. In this paper, we present an analysis of this
fundamental flaw lurking in all neural architectures to uncover limitations of
previously proposed defense mechanisms. More importantly, we present a unifying
framework for protecting deep neural models using a non-invertible data
transformation--developing two adversary-resilient architectures utilizing both
linear and nonlinear dimensionality reduction. Empirical results indicate that
our framework provides better robustness compared to state-of-art solutions
while having negligible degradation in accuracy.


http://arxiv.org/abs/1610.01239
Adversary Resistant Deep Neural Networks with an Application to Malware Detection.
Qinglong Wang; Wenbo Guo; Kaixuan Zhang; Alexander G. II Ororbia; Xinyu Xing; C. Lee Giles; Xue Liu
  Beyond its highly publicized victories in Go, there have been numerous
successful applications of deep learning in information retrieval, computer
vision and speech recognition. In cybersecurity, an increasing number of
companies have become excited about the potential of deep learning, and have
started to use it for various security incidents, the most popular being
malware detection. These companies assert that deep learning (DL) could help
turn the tide in the battle against malware infections. However, deep neural
networks (DNNs) are vulnerable to adversarial samples, a flaw that plagues most
if not all statistical learning models. Recent research has demonstrated that
those with malicious intent can easily circumvent deep learning-powered malware
detection by exploiting this flaw.
  In order to address this problem, previous work has developed various defense
mechanisms that either augmenting training data or enhance model's complexity.
However, after a thorough analysis of the fundamental flaw in DNNs, we discover
that the effectiveness of current defenses is limited and, more importantly,
cannot provide theoretical guarantees as to their robustness against
adversarial sampled-based attacks. As such, we propose a new adversary
resistant technique that obstructs attackers from constructing impactful
adversarial samples by randomly nullifying features within samples. In this
work, we evaluate our proposed technique against a real world dataset with
14,679 malware variants and 17,399 benign programs. We theoretically validate
the robustness of our technique, and empirically show that our technique
significantly boosts DNN robustness to adversarial samples while maintaining
high accuracy in classification. To demonstrate the general applicability of
our proposed method, we also conduct experiments using the MNIST and CIFAR-10
datasets, generally used in image recognition research.


http://arxiv.org/abs/1610.00768
Technical Report on the CleverHans v2.1.0 Adversarial Examples Library.
Nicolas Papernot; Fartash Faghri; Nicholas Carlini; Ian Goodfellow; Reuben Feinman; Alexey Kurakin; Cihang Xie; Yash Sharma; Tom Brown; Aurko Roy; Alexander Matyasko; Vahid Behzadan; Karen Hambardzumyan; Zhishuai Zhang; Yi-Lin Juang; Zhi Li; Ryan Sheatsley; Abhibhav Garg; Jonathan Uesato; Willi Gierke; Yinpeng Dong; David Berthelot; Paul Hendricks; Jonas Rauber; Rujun Long; Patrick McDaniel
  CleverHans is a software library that provides standardized reference
implementations of adversarial example construction techniques and adversarial
training. The library may be used to develop more robust machine learning
models and to provide standardized benchmarks of models' performance in the
adversarial setting. Benchmarks constructed without a standardized
implementation of adversarial example construction are not comparable to each
other, because a good result may indicate a robust model or it may merely
indicate a weak implementation of the adversarial example construction
procedure.
  This technical report is structured as follows. Section 1 provides an
overview of adversarial examples in machine learning and of the CleverHans
software. Section 2 presents the core functionalities of the library: namely
the attacks based on adversarial examples and defenses to improve the
robustness of machine learning models to these attacks. Section 3 describes how
to report benchmark results using the library. Section 4 describes the
versioning system.


http://arxiv.org/abs/1609.01461
Statistical Meta-Analysis of Presentation Attacks for Secure Multibiometric Systems.
Battista Biggio; Giorgio Fumera; Gian Luca Marcialis; Fabio Roli
  Prior work has shown that multibiometric systems are vulnerable to
presentation attacks, assuming that their matching score distribution is
identical to that of genuine users, without fabricating any fake trait. We have
recently shown that this assumption is not representative of current
fingerprint and face presentation attacks, leading one to overestimate the
vulnerability of multibiometric systems, and to design less effective fusion
rules. In this paper, we overcome these limitations by proposing a statistical
meta-model of face and fingerprint presentation attacks that characterizes a
wider family of fake score distributions, including distributions of known and,
potentially, unknown attacks. This allows us to perform a thorough security
evaluation of multibiometric systems against presentation attacks, quantifying
how their vulnerability may vary also under attacks that are different from
those considered during design, through an uncertainty analysis. We empirically
show that our approach can reliably predict the performance of multibiometric
systems even under never-before-seen face and fingerprint presentation attacks,
and that the secure fusion rules designed using our approach can exhibit an
improved trade-off between the performance in the absence and in the presence
of attack. We finally argue that our method can be extended to other biometrics
besides faces and fingerprints.


http://arxiv.org/abs/1609.00804
Randomized Prediction Games for Adversarial Machine Learning.
Samuel Rota Bulò; Battista Biggio; Ignazio Pillai; Marcello Pelillo; Fabio Roli
  In spam and malware detection, attackers exploit randomization to obfuscate
malicious data and increase their chances of evading detection at test time;
e.g., malware code is typically obfuscated using random strings or byte
sequences to hide known exploits. Interestingly, randomization has also been
proposed to improve security of learning algorithms against evasion attacks, as
it results in hiding information about the classifier to the attacker. Recent
work has proposed game-theoretical formulations to learn secure classifiers, by
simulating different evasion attacks and modifying the classification function
accordingly. However, both the classification function and the simulated data
manipulations have been modeled in a deterministic manner, without accounting
for any form of randomization. In this work, we overcome this limitation by
proposing a randomized prediction game, namely, a non-cooperative
game-theoretic formulation in which the classifier and the attacker make
randomized strategy selections according to some probability distribution
defined over the respective strategy set. We show that our approach allows one
to improve the trade-off between attack detection and false alarms with respect
to state-of-the-art secure classifiers, even against attacks that are different
from those hypothesized during design, on application examples including
handwritten digit recognition, spam and malware detection.


http://arxiv.org/abs/1608.08967
Robustness of classifiers: from adversarial to random noise.
Alhussein Fawzi; Seyed-Mohsen Moosavi-Dezfooli; Pascal Frossard
  Several recent works have shown that state-of-the-art classifiers are
vulnerable to worst-case (i.e., adversarial) perturbations of the datapoints.
On the other hand, it has been empirically observed that these same classifiers
are relatively robust to random noise. In this paper, we propose to study a
\textit{semi-random} noise regime that generalizes both the random and
worst-case noise regimes. We propose the first quantitative analysis of the
robustness of nonlinear classifiers in this general noise regime. We establish
precise theoretical bounds on the robustness of classifiers in this general
regime, which depend on the curvature of the classifier's decision boundary.
Our bounds confirm and quantify the empirical observations that classifiers
satisfying curvature constraints are robust to random noise. Moreover, we
quantify the robustness of classifiers in terms of the subspace dimension in
the semi-random noise regime, and show that our bounds remarkably interpolate
between the worst-case and random noise regimes. We perform experiments and
show that the derived bounds provide very accurate estimates when applied to
various state-of-the-art deep neural networks and datasets. This result
suggests bounds on the curvature of the classifiers' decision boundaries that
we support experimentally, and more generally offers important insights onto
the geometry of high dimensional classification problems.


http://arxiv.org/abs/1608.07690
A Boundary Tilting Persepective on the Phenomenon of Adversarial Examples.
Thomas Tanay; Lewis Griffin
  Deep neural networks have been shown to suffer from a surprising weakness:
their classification outputs can be changed by small, non-random perturbations
of their inputs. This adversarial example phenomenon has been explained as
originating from deep networks being "too linear" (Goodfellow et al., 2014). We
show here that the linear explanation of adversarial examples presents a number
of limitations: the formal argument is not convincing, linear classifiers do
not always suffer from the phenomenon, and when they do their adversarial
examples are different from the ones affecting deep networks.
  We propose a new perspective on the phenomenon. We argue that adversarial
examples exist when the classification boundary lies close to the submanifold
of sampled data, and present a mathematical analysis of this new perspective in
the linear case. We define the notion of adversarial strength and show that it
can be reduced to the deviation angle between the classifier considered and the
nearest centroid classifier. Then, we show that the adversarial strength can be
made arbitrarily high independently of the classification performance due to a
mechanism that we call boundary tilting. This result leads us to defining a new
taxonomy of adversarial examples. Finally, we show that the adversarial
strength observed in practice is directly dependent on the level of
regularisation used and the strongest adversarial examples, symptomatic of
overfitting, can be avoided by using a proper level of regularisation.


http://arxiv.org/abs/1608.04644
Towards Evaluating the Robustness of Neural Networks.
Nicholas Carlini; David Wagner
  Neural networks provide state-of-the-art results for most machine learning
tasks. Unfortunately, neural networks are vulnerable to adversarial examples:
given an input $x$ and any target classification $t$, it is possible to find a
new input $x'$ that is similar to $x$ but classified as $t$. This makes it
difficult to apply neural networks in security-critical areas. Defensive
distillation is a recently proposed approach that can take an arbitrary neural
network, and increase its robustness, reducing the success rate of current
attacks' ability to find adversarial examples from $95\%$ to $0.5\%$.
  In this paper, we demonstrate that defensive distillation does not
significantly increase the robustness of neural networks by introducing three
new attack algorithms that are successful on both distilled and undistilled
neural networks with $100\%$ probability. Our attacks are tailored to three
distance metrics used previously in the literature, and when compared to
previous adversarial example generation algorithms, our attacks are often much
more effective (and never worse). Furthermore, we propose using high-confidence
adversarial examples in a simple transferability test we show can also be used
to break defensive distillation. We hope our attacks will be used as a
benchmark in future defense attempts to create neural networks that resist
adversarial examples.


http://arxiv.org/abs/1608.00853
A study of the effect of JPG compression on adversarial images.
Gintare Karolina Dziugaite; Zoubin Ghahramani; Daniel M. Roy
  Neural network image classifiers are known to be vulnerable to adversarial
images, i.e., natural images which have been modified by an adversarial
perturbation specifically designed to be imperceptible to humans yet fool the
classifier. Not only can adversarial images be generated easily, but these
images will often be adversarial for networks trained on disjoint subsets of
data or with different architectures. Adversarial images represent a potential
security risk as well as a serious machine learning challenge---it is clear
that vulnerable neural networks perceive images very differently from humans.
Noting that virtually every image classification data set is composed of JPG
images, we evaluate the effect of JPG compression on the classification of
adversarial images. For Fast-Gradient-Sign perturbations of small magnitude, we
found that JPG compression often reverses the drop in classification accuracy
to a large extent, but not always. As the magnitude of the perturbations
increases, JPG recompression alone is insufficient to reverse the effect.


http://arxiv.org/abs/1608.00530
Early Methods for Detecting Adversarial Images.
Dan Hendrycks; Kevin Gimpel
  Many machine learning classifiers are vulnerable to adversarial
perturbations. An adversarial perturbation modifies an input to change a
classifier's prediction without causing the input to seem substantially
different to human perception. We deploy three methods to detect adversarial
images. Adversaries trying to bypass our detectors must make the adversarial
image less pathological or they will fail trying. Our best detection method
reveals that adversarial images place abnormal emphasis on the lower-ranked
principal components from PCA. Other detectors and a colorful saliency map are
in an appendix.


http://arxiv.org/abs/1607.05113
On the Effectiveness of Defensive Distillation.
Nicolas Papernot; Patrick McDaniel
  We report experimental results indicating that defensive distillation
successfully mitigates adversarial samples crafted using the fast gradient sign
method, in addition to those crafted using the Jacobian-based iterative attack
on which the defense mechanism was originally evaluated.


http://arxiv.org/abs/1607.04311
Defensive Distillation is Not Robust to Adversarial Examples.
Nicholas Carlini; David Wagner
  We show that defensive distillation is not secure: it is no more resistant to
targeted misclassification attacks than unprotected neural networks.


http://arxiv.org/abs/1607.02533
Adversarial examples in the physical world.
Alexey Kurakin; Ian Goodfellow; Samy Bengio
  Most existing machine learning classifiers are highly vulnerable to
adversarial examples. An adversarial example is a sample of input data which
has been modified very slightly in a way that is intended to cause a machine
learning classifier to misclassify it. In many cases, these modifications can
be so subtle that a human observer does not even notice the modification at
all, yet the classifier still makes a mistake. Adversarial examples pose
security concerns because they could be used to perform an attack on machine
learning systems, even if the adversary has no access to the underlying model.
Up to now, all previous work have assumed a threat model in which the adversary
can feed data directly into the machine learning classifier. This is not always
the case for systems operating in the physical world, for example those which
are using signals from cameras and other sensors as an input. This paper shows
that even in such physical world scenarios, machine learning systems are
vulnerable to adversarial examples. We demonstrate this by feeding adversarial
images obtained from cell-phone camera to an ImageNet Inception classifier and
measuring the classification accuracy of the system. We find that a large
fraction of adversarial examples are classified incorrectly even when perceived
through the camera.


http://arxiv.org/abs/1606.04435
Adversarial Perturbations Against Deep Neural Networks for Malware Classification.
Kathrin Grosse; Nicolas Papernot; Praveen Manoharan; Michael Backes; Patrick McDaniel
  Deep neural networks, like many other machine learning models, have recently
been shown to lack robustness against adversarially crafted inputs. These
inputs are derived from regular inputs by minor yet carefully selected
perturbations that deceive machine learning models into desired
misclassifications. Existing work in this emerging field was largely specific
to the domain of image classification, since the high-entropy of images can be
conveniently manipulated without changing the images' overall visual
appearance. Yet, it remains unclear how such attacks translate to more
security-sensitive applications such as malware detection - which may pose
significant challenges in sample generation and arguably grave consequences for
failure.
  In this paper, we show how to construct highly-effective adversarial sample
crafting attacks for neural networks used as malware classifiers. The
application domain of malware classification introduces additional constraints
in the adversarial sample crafting problem when compared to the computer vision
domain: (i) continuous, differentiable input domains are replaced by discrete,
often binary inputs; and (ii) the loose condition of leaving visual appearance
unchanged is replaced by requiring equivalent functional behavior. We
demonstrate the feasibility of these attacks on many different instances of
malware classifiers that we trained using the DREBIN Android malware data set.
We furthermore evaluate to which extent potential defensive mechanisms against
adversarial crafting can be leveraged to the setting of malware classification.
While feature reduction did not prove to have a positive impact, distillation
and re-training on adversarially crafted samples show promising results.


http://arxiv.org/abs/1605.07277
Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples.
Nicolas Papernot; Patrick McDaniel; Ian Goodfellow
  Many machine learning models are vulnerable to adversarial examples: inputs
that are specially crafted to cause a machine learning model to produce an
incorrect output. Adversarial examples that affect one model often affect
another model, even if the two models have different architectures or were
trained on different training sets, so long as both models were trained to
perform the same task. An attacker may therefore train their own substitute
model, craft adversarial examples against the substitute, and transfer them to
a victim model, with very little information about the victim. Recent work has
further developed a technique that uses the victim model as an oracle to label
a synthetic training set for the substitute, so the attacker need not even
collect a training set to mount the attack. We extend these recent techniques
using reservoir sampling to greatly enhance the efficiency of the training
procedure for the substitute model. We introduce new transferability attacks
between previously unexplored (substitute, victim) pairs of machine learning
model classes, most notably SVMs and decision trees. We demonstrate our attacks
on two commercial machine learning classification systems from Amazon (96.19%
misclassification rate) and Google (88.94%) using only 800 queries of the
victim model, thereby showing that existing machine learning approaches are in
general vulnerable to systematic black-box attacks regardless of their
structure.


http://arxiv.org/abs/1605.07262
Measuring Neural Net Robustness with Constraints.
Osbert Bastani; Yani Ioannou; Leonidas Lampropoulos; Dimitrios Vytiniotis; Aditya Nori; Antonio Criminisi
  Despite having high accuracy, neural nets have been shown to be susceptible
to adversarial examples, where a small perturbation to an input can cause it to
become mislabeled. We propose metrics for measuring the robustness of a neural
net and devise a novel algorithm for approximating these metrics based on an
encoding of robustness as a linear program. We show how our metrics can be used
to evaluate the robustness of deep neural nets with experiments on the MNIST
and CIFAR-10 datasets. Our algorithm generates more informative estimates of
robustness metrics compared to estimates based on existing algorithms.
Furthermore, we show how existing approaches to improving robustness "overfit"
to adversarial examples generated using a specific algorithm. Finally, we show
that our techniques can be used to additionally improve neural net robustness
both according to the metrics that we propose, but also according to previously
proposed metrics.


http://arxiv.org/abs/1605.05411
Are Facial Attributes Adversarially Robust?
Andras Rozsa; Manuel Günther; Ethan M. Rudd; Terrance E. Boult
  Facial attributes are emerging soft biometrics that have the potential to
reject non-matches, for example, based on mismatching gender. To be usable in
stand-alone systems, facial attributes must be extracted from images
automatically and reliably. In this paper, we propose a simple yet effective
solution for automatic facial attribute extraction by training a deep
convolutional neural network (DCNN) for each facial attribute separately,
without using any pre-training or dataset augmentation, and we obtain new
state-of-the-art facial attribute classification results on the CelebA
benchmark. To test the stability of the networks, we generated adversarial
images -- formed by adding imperceptible non-random perturbations to original
inputs which result in classification errors -- via a novel fast flipping
attribute (FFA) technique. We show that FFA generates more adversarial examples
than other related algorithms, and that DCNNs for certain attributes are
generally robust to adversarial inputs, while DCNNs for other attributes are
not. This result is surprising because no DCNNs tested to date have exhibited
robustness to adversarial images without explicit augmentation in the training
procedure to account for adversarial examples. Finally, we introduce the
concept of natural adversarial samples, i.e., images that are misclassified but
can be easily turned into correctly classified images by applying small
perturbations. We demonstrate that natural adversarial samples commonly occur,
even within the training set, and show that many of these images remain
misclassified even with additional training epochs. This phenomenon is
surprising because correcting the misclassification, particularly when guided
by training data, should require only a small adjustment to the DCNN
parameters.


http://arxiv.org/abs/1605.01775
Adversarial Diversity and Hard Positive Generation.
Andras Rozsa; Ethan M. Rudd; Terrance E. Boult
  State-of-the-art deep neural networks suffer from a fundamental problem -
they misclassify adversarial examples formed by applying small perturbations to
inputs. In this paper, we present a new psychometric perceptual adversarial
similarity score (PASS) measure for quantifying adversarial images, introduce
the notion of hard positive generation, and use a diverse set of adversarial
perturbations - not just the closest ones - for data augmentation. We introduce
a novel hot/cold approach for adversarial example generation, which provides
multiple possible adversarial perturbations for every single image. The
perturbations generated by our novel approach often correspond to semantically
meaningful image structures, and allow greater flexibility to scale
perturbation-amplitudes, which yields an increased diversity of adversarial
images. We present adversarial images on several network topologies and
datasets, including LeNet on the MNIST dataset, and GoogLeNet and ResidualNet
on the ImageNet dataset. Finally, we demonstrate on LeNet and GoogLeNet that
fine-tuning with a diverse set of hard positives improves the robustness of
these networks compared to training with prior methods of generating
adversarial images.


http://arxiv.org/abs/1604.08275
Crafting Adversarial Input Sequences for Recurrent Neural Networks.
Nicolas Papernot; Patrick McDaniel; Ananthram Swami; Richard Harang
  Machine learning models are frequently used to solve complex security
problems, as well as to make decisions in sensitive situations like guiding
autonomous vehicles or predicting financial market behaviors. Previous efforts
have shown that numerous machine learning models were vulnerable to adversarial
manipulations of their inputs taking the form of adversarial samples. Such
inputs are crafted by adding carefully selected perturbations to legitimate
inputs so as to force the machine learning model to misbehave, for instance by
outputting a wrong class if the machine learning task of interest is
classification. In fact, to the best of our knowledge, all previous work on
adversarial samples crafting for neural network considered models used to solve
classification tasks, most frequently in computer vision applications. In this
paper, we contribute to the field of adversarial machine learning by
investigating adversarial input sequences for recurrent neural networks
processing sequential data. We show that the classes of algorithms introduced
previously to craft adversarial samples misclassified by feed-forward neural
networks can be adapted to recurrent neural networks. In a experiment, we show
that adversaries can craft adversarial sequences misleading both categorical
and sequential recurrent neural networks.


http://arxiv.org/abs/1604.04326
Improving the Robustness of Deep Neural Networks via Stability Training.
Stephan Zheng; Yang Song; Thomas Leung; Ian Goodfellow
  In this paper we address the issue of output instability of deep neural
networks: small perturbations in the visual input can significantly distort the
feature embeddings and output of a neural network. Such instability affects
many deep architectures with state-of-the-art performance on a wide range of
computer vision tasks. We present a general stability training method to
stabilize deep networks against small input distortions that result from
various types of common image processing, such as compression, rescaling, and
cropping. We validate our method by stabilizing the state-of-the-art Inception
architecture against these types of distortions. In addition, we demonstrate
that our stabilized model gives robust state-of-the-art performance on
large-scale near-duplicate detection, similar-image ranking, and classification
on noisy datasets.


http://arxiv.org/abs/1604.02606
A General Retraining Framework for Scalable Adversarial Classification.
Bo Li; Yevgeniy Vorobeychik; Xinyun Chen
  Traditional classification algorithms assume that training and test data come
from similar distributions. This assumption is violated in adversarial
settings, where malicious actors modify instances to evade detection. A number
of custom methods have been developed for both adversarial evasion attacks and
robust learning. We propose the first systematic and general-purpose retraining
framework which can: a) boost robustness of an \emph{arbitrary} learning
algorithm, in the face of b) a broader class of adversarial models than any
prior methods. We show that, under natural conditions, the retraining framework
minimizes an upper bound on optimal adversarial risk, and show how to extend
this result to account for approximations of evasion attacks. Extensive
experimental evaluation demonstrates that our retraining methods are nearly
indistinguishable from state-of-the-art algorithms for optimizing adversarial
risk, but are more general and far more scalable. The experiments also confirm
that without retraining, our adversarial framework dramatically reduces the
effectiveness of learning. In contrast, retraining significantly boosts
robustness to evasion attacks without significantly compromising overall
accuracy.


http://arxiv.org/abs/1603.05145
Suppressing the Unusual: towards Robust CNNs using Symmetric Activation Functions.
Qiyang Zhao; Lewis D Griffin
  Many deep Convolutional Neural Networks (CNN) make incorrect predictions on
adversarial samples obtained by imperceptible perturbations of clean samples.
We hypothesize that this is caused by a failure to suppress unusual signals
within network layers. As remedy we propose the use of Symmetric Activation
Functions (SAF) in non-linear signal transducer units. These units suppress
signals of exceptional magnitude. We prove that SAF networks can perform
classification tasks to arbitrary precision in a simplified situation. In
practice, rather than use SAFs alone, we add them into CNNs to improve their
robustness. The modified CNNs can be easily trained using popular strategies
with the moderate training load. Our experiments on MNIST and CIFAR-10 show
that the modified CNNs perform similarly to plain ones on clean samples, and
are remarkably more robust against adversarial and nonsense samples.


http://arxiv.org/abs/1602.05973
Breaking Symmetric Cryptosystems using Quantum Period Finding. (1%)
Marc Kaplan; Gaëtan Leurent; Anthony Leverrier; María Naya-Plasencia
  Due to Shor's algorithm, quantum computers are a severe threat for public key
cryptography. This motivated the cryptographic community to search for
quantum-safe solutions. On the other hand, the impact of quantum computing on
secret key cryptography is much less understood. In this paper, we consider
attacks where an adversary can query an oracle implementing a cryptographic
primitive in a quantum superposition of different states. This model gives a
lot of power to the adversary, but recent results show that it is nonetheless
possible to build secure cryptosystems in it.
  We study applications of a quantum procedure called Simon's algorithm (the
simplest quantum period finding algorithm) in order to attack symmetric
cryptosystems in this model. Following previous works in this direction, we
show that several classical attacks based on finding collisions can be
dramatically sped up using Simon's algorithm: finding a collision requires
$\Omega(2^{n/2})$ queries in the classical setting, but when collisions happen
with some hidden periodicity, they can be found with only $O(n)$ queries in the
quantum model.
  We obtain attacks with very strong implications. First, we show that the most
widely used modes of operation for authentication and authenticated encryption
e.g. CBC-MAC, PMAC, GMAC, GCM, and OCB) are completely broken in this security
model. Our attacks are also applicable to many CAESAR candidates: CLOC, AEZ,
COPA, OTR, POET, OMD, and Minalpher. This is quite surprising compared to the
situation with encryption modes: Anand et al. show that standard modes are
secure with a quantum-secure PRF.
  Second, we show that Simon's algorithm can also be applied to slide attacks,
leading to an exponential speed-up of a classical symmetric cryptanalysis
technique in the quantum model.


http://arxiv.org/abs/1602.02697
Practical Black-Box Attacks against Machine Learning.
Nicolas Papernot; Patrick McDaniel; Ian Goodfellow; Somesh Jha; Z. Berkay Celik; Ananthram Swami
  Machine learning (ML) models, e.g., deep neural networks (DNNs), are
vulnerable to adversarial examples: malicious inputs modified to yield
erroneous model outputs, while appearing unmodified to human observers.
Potential attacks include having malicious content like malware identified as
legitimate or controlling vehicle behavior. Yet, all existing adversarial
example attacks require knowledge of either the model internals or its training
data. We introduce the first practical demonstration of an attacker controlling
a remotely hosted DNN with no such knowledge. Indeed, the only capability of
our black-box adversary is to observe labels given by the DNN to chosen inputs.
Our attack strategy consists in training a local model to substitute for the
target DNN, using inputs synthetically generated by an adversary and labeled by
the target DNN. We use the local substitute to craft adversarial examples, and
find that they are misclassified by the targeted DNN. To perform a real-world
and properly-blinded evaluation, we attack a DNN hosted by MetaMind, an online
deep learning API. We find that their DNN misclassifies 84.24% of the
adversarial examples crafted with our substitute. We demonstrate the general
applicability of our strategy to many ML techniques by conducting the same
attack against models hosted by Amazon and Google, using logistic regression
substitutes. They yield adversarial examples misclassified by Amazon and Google
at rates of 96.19% and 88.94%. We also find that this black-box attack strategy
is capable of evading defense strategies previously found to make adversarial
example crafting harder.


http://arxiv.org/abs/1602.02389
Ensemble Robustness and Generalization of Stochastic Deep Learning Algorithms.
Tom Zahavy; Bingyi Kang; Alex Sivak; Jiashi Feng; Huan Xu; Shie Mannor
  The question why deep learning algorithms generalize so well has attracted
increasing research interest. However, most of the well-established approaches,
such as hypothesis capacity, stability or sparseness, have not provided
complete explanations (Zhang et al., 2016; Kawaguchi et al., 2017). In this
work, we focus on the robustness approach (Xu & Mannor, 2012), i.e., if the
error of a hypothesis will not change much due to perturbations of its training
examples, then it will also generalize well. As most deep learning algorithms
are stochastic (e.g., Stochastic Gradient Descent, Dropout, and
Bayes-by-backprop), we revisit the robustness arguments of Xu & Mannor, and
introduce a new approach, ensemble robustness, that concerns the robustness of
a population of hypotheses. Through the lens of ensemble robustness, we reveal
that a stochastic learning algorithm can generalize well as long as its
sensitiveness to adversarial perturbations is bounded in average over training
examples. Moreover, an algorithm may be sensitive to some adversarial examples
(Goodfellow et al., 2015) but still generalize well. To support our claims, we
provide extensive simulations for different deep learning algorithms and
different network architectures exhibiting a strong correlation between
ensemble robustness and the ability to generalize.


http://arxiv.org/abs/1601.07213
Unifying Adversarial Training Algorithms with Flexible Deep Data Gradient Regularization.
Alexander G. II Ororbia; C. Lee Giles; Daniel Kifer
  Many previous proposals for adversarial training of deep neural nets have
included di- rectly modifying the gradient, training on a mix of original and
adversarial examples, using contractive penalties, and approximately optimizing
constrained adversarial ob- jective functions. In this paper, we show these
proposals are actually all instances of optimizing a general, regularized
objective we call DataGrad. Our proposed DataGrad framework, which can be
viewed as a deep extension of the layerwise contractive au- toencoder penalty,
cleanly simplifies prior work and easily allows extensions such as adversarial
training with multi-task cues. In our experiments, we find that the deep gra-
dient regularization of DataGrad (which also has L1 and L2 flavors of
regularization) outperforms alternative forms of regularization, including
classical L1, L2, and multi- task, both on the original dataset as well as on
adversarial sets. Furthermore, we find that combining multi-task optimization
with DataGrad adversarial training results in the most robust performance.


http://arxiv.org/abs/1511.07528
The Limitations of Deep Learning in Adversarial Settings.
Nicolas Papernot; Patrick McDaniel; Somesh Jha; Matt Fredrikson; Z. Berkay Celik; Ananthram Swami
  Deep learning takes advantage of large datasets and computationally efficient
training algorithms to outperform other approaches at various machine learning
tasks. However, imperfections in the training phase of deep neural networks
make them vulnerable to adversarial samples: inputs crafted by adversaries with
the intent of causing deep neural networks to misclassify. In this work, we
formalize the space of adversaries against deep neural networks (DNNs) and
introduce a novel class of algorithms to craft adversarial samples based on a
precise understanding of the mapping between inputs and outputs of DNNs. In an
application to computer vision, we show that our algorithms can reliably
produce samples correctly classified by human subjects but misclassified in
specific targets by a DNN with a 97% adversarial success rate while only
modifying on average 4.02% of the input features per sample. We then evaluate
the vulnerability of different sample classes to adversarial perturbations by
defining a hardness measure. Finally, we describe preliminary work outlining
defenses against adversarial samples by defining a predictive measure of
distance between a benign input and a target classification.


http://arxiv.org/abs/1511.06385
A Unified Gradient Regularization Family for Adversarial Examples.
Chunchuan Lyu; Kaizhu Huang; Hai-Ning Liang
  Adversarial examples are augmented data points generated by imperceptible
perturbation of input samples. They have recently drawn much attention with the
machine learning and data mining community. Being difficult to distinguish from
real examples, such adversarial examples could change the prediction of many of
the best learning models including the state-of-the-art deep learning models.
Recent attempts have been made to build robust models that take into account
adversarial examples. However, these methods can either lead to performance
drops or lack mathematical motivations. In this paper, we propose a unified
framework to build robust machine learning models against adversarial examples.
More specifically, using the unified framework, we develop a family of gradient
regularization methods that effectively penalize the gradient of loss function
w.r.t. inputs. Our proposed framework is appealing in that it offers a unified
view to deal with adversarial examples. It incorporates another
recently-proposed perturbation based approach as a special case. In addition,
we present some visual effects that reveals semantic meaning in those
perturbations, and thus support our regularization method and provide another
explanation for generalizability of adversarial examples. By applying this
technique to Maxout networks, we conduct a series of experiments and achieve
encouraging results on two benchmark datasets. In particular,we attain the best
accuracy on MNIST data (without data augmentation) and competitive performance
on CIFAR-10 data.


http://arxiv.org/abs/1511.06381
Manifold Regularized Deep Neural Networks using Adversarial Examples.
Taehoon Lee; Minsuk Choi; Sungroh Yoon
  Learning meaningful representations using deep neural networks involves
designing efficient training schemes and well-structured networks. Currently,
the method of stochastic gradient descent that has a momentum with dropout is
one of the most popular training protocols. Based on that, more advanced
methods (i.e., Maxout and Batch Normalization) have been proposed in recent
years, but most still suffer from performance degradation caused by small
perturbations, also known as adversarial examples. To address this issue, we
propose manifold regularized networks (MRnet) that utilize a novel training
objective function that minimizes the difference between multi-layer embedding
results of samples and those adversarial. Our experimental results demonstrated
that MRnet is more resilient to adversarial examples and helps us to generalize
representations on manifolds. Furthermore, combining MRnet and dropout allowed
us to achieve competitive classification performances for three well-known
benchmarks: MNIST, CIFAR-10, and SVHN.


http://arxiv.org/abs/1511.06306
Robust Convolutional Neural Networks under Adversarial Noise.
Jonghoon Jin; Aysegul Dundar; Eugenio Culurciello
  Recent studies have shown that Convolutional Neural Networks (CNNs) are
vulnerable to a small perturbation of input called "adversarial examples". In
this work, we propose a new feedforward CNN that improves robustness in the
presence of adversarial noise. Our model uses stochastic additive noise added
to the input image and to the CNN models. The proposed model operates in
conjunction with a CNN trained with either standard or adversarial objective
function. In particular, convolution, max-pooling, and ReLU layers are modified
to benefit from the noise model. Our feedforward model is parameterized by only
a mean and variance per pixel which simplifies computations and makes our
method scalable to a deep architecture. From CIFAR-10 and ImageNet test, the
proposed model outperforms other methods and the improvement is more evident
for difficult classification tasks or stronger adversarial noise.


http://arxiv.org/abs/1511.06292
Foveation-based Mechanisms Alleviate Adversarial Examples.
Yan Luo; Xavier Boix; Gemma Roig; Tomaso Poggio; Qi Zhao
  We show that adversarial examples, i.e., the visually imperceptible
perturbations that result in Convolutional Neural Networks (CNNs) fail, can be
alleviated with a mechanism based on foveations---applying the CNN in different
image regions. To see this, first, we report results in ImageNet that lead to a
revision of the hypothesis that adversarial perturbations are a consequence of
CNNs acting as a linear classifier: CNNs act locally linearly to changes in the
image regions with objects recognized by the CNN, and in other regions the CNN
may act non-linearly. Then, we corroborate that when the neural responses are
linear, applying the foveation mechanism to the adversarial example tends to
significantly reduce the effect of the perturbation. This is because,
hypothetically, the CNNs for ImageNet are robust to changes of scale and
translation of the object produced by the foveation, but this property does not
generalize to transformations of the perturbation. As a result, the accuracy
after a foveation is almost the same as the accuracy of the CNN without the
adversarial perturbation, even if the adversarial perturbation is calculated
taking into account a foveation.


http://arxiv.org/abs/1511.06233
Towards Open Set Deep Networks.
Abhijit Bendale; Terrance Boult
  Deep networks have produced significant gains for various visual recognition
problems, leading to high impact academic and commercial applications. Recent
work in deep networks highlighted that it is easy to generate images that
humans would never classify as a particular object class, yet networks classify
such images high confidence as that given class - deep network are easily
fooled with images humans do not consider meaningful. The closed set nature of
deep networks forces them to choose from one of the known classes leading to
such artifacts. Recognition in the real world is open set, i.e. the recognition
system should reject unknown/unseen classes at test time. We present a
methodology to adapt deep networks for open set recognition, by introducing a
new model layer, OpenMax, which estimates the probability of an input being
from an unknown class. A key element of estimating the unknown probability is
adapting Meta-Recognition concepts to the activation patterns in the
penultimate layer of the network. OpenMax allows rejection of "fooling" and
unrelated open set images presented to the system; OpenMax greatly reduces the
number of obvious errors made by a deep network. We prove that the OpenMax
concept provides bounded open space risk, thereby formally providing an open
set recognition solution. We evaluate the resulting open set deep networks
using pre-trained networks from the Caffe Model-zoo on ImageNet 2012 validation
data, and thousands of fooling and open set images. The proposed OpenMax model
significantly outperforms open set recognition accuracy of basic deep networks
as well as deep networks with thresholding of SoftMax probabilities.


http://arxiv.org/abs/1511.05432
Understanding Adversarial Training: Increasing Local Stability of Neural Nets through Robust Optimization.
Uri Shaham; Yutaro Yamada; Sahand Negahban
  We propose a general framework for increasing local stability of Artificial
Neural Nets (ANNs) using Robust Optimization (RO). We achieve this through an
alternating minimization-maximization procedure, in which the loss of the
network is minimized over perturbed examples that are generated at each
parameter update. We show that adversarial training of ANNs is in fact
robustification of the network optimization, and that our proposed framework
generalizes previous approaches for increasing local stability of ANNs.
Experimental results reveal that our approach increases the robustness of the
network to existing adversarial examples, while making it harder to generate
new ones. Furthermore, our algorithm improves the accuracy of the network also
on the original test data.


http://arxiv.org/abs/1511.05122
Adversarial Manipulation of Deep Representations.
Sara Sabour; Yanshuai Cao; Fartash Faghri; David J. Fleet
  We show that the representation of an image in a deep neural network (DNN)
can be manipulated to mimic those of other natural images, with only minor,
imperceptible perturbations to the original image. Previous methods for
generating adversarial images focused on image perturbations designed to
produce erroneous class labels, while we concentrate on the internal layers of
DNN representations. In this way our new class of adversarial images differs
qualitatively from others. While the adversary is perceptually similar to one
image, its internal representation appears remarkably similar to a different
image, one from a different class, bearing little if any apparent similarity to
the input; they appear generic and consistent with the space of natural images.
This phenomenon raises questions about DNN representations, as well as the
properties of natural images themselves.


http://arxiv.org/abs/1511.04599
DeepFool: a simple and accurate method to fool deep neural networks.
Seyed-Mohsen Moosavi-Dezfooli; Alhussein Fawzi; Pascal Frossard
  State-of-the-art deep neural networks have achieved impressive results on
many image classification tasks. However, these same architectures have been
shown to be unstable to small, well sought, perturbations of the images.
Despite the importance of this phenomenon, no effective methods have been
proposed to accurately compute the robustness of state-of-the-art deep
classifiers to such perturbations on large-scale datasets. In this paper, we
fill this gap and propose the DeepFool algorithm to efficiently compute
perturbations that fool deep networks, and thus reliably quantify the
robustness of these classifiers. Extensive experimental results show that our
approach outperforms recent methods in the task of computing adversarial
perturbations and making classifiers more robust.


http://arxiv.org/abs/1511.04508
Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks.
Nicolas Papernot; Patrick McDaniel; Xi Wu; Somesh Jha; Ananthram Swami
  Deep learning algorithms have been shown to perform extremely well on many
classical machine learning problems. However, recent studies have shown that
deep learning, like other machine learning techniques, is vulnerable to
adversarial samples: inputs crafted to force a deep neural network (DNN) to
provide adversary-selected outputs. Such attacks can seriously undermine the
security of the system supported by the DNN, sometimes with devastating
consequences. For example, autonomous vehicles can be crashed, illicit or
illegal content can bypass content filters, or biometric authentication systems
can be manipulated to allow improper access. In this work, we introduce a
defensive mechanism called defensive distillation to reduce the effectiveness
of adversarial samples on DNNs. We analytically investigate the
generalizability and robustness properties granted by the use of defensive
distillation when training DNNs. We also empirically study the effectiveness of
our defense mechanisms on two DNNs placed in adversarial settings. The study
shows that defensive distillation can reduce effectiveness of sample creation
from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be
explained by the fact that distillation leads gradients used in adversarial
sample creation to be reduced by a factor of 10^30. We also find that
distillation increases the average minimum number of features that need to be
modified to create adversarial samples by about 800% on one of the DNNs we
tested.


http://arxiv.org/abs/1511.03034
Learning with a Strong Adversary.
Ruitong Huang; Bing Xu; Dale Schuurmans; Csaba Szepesvari
  The robustness of neural networks to intended perturbations has recently
attracted significant attention. In this paper, we propose a new method,
\emph{learning with a strong adversary}, that learns robust classifiers from
supervised data. The proposed method takes finding adversarial examples as an
intermediate step. A new and simple way of finding adversarial examples is
presented and experimentally shown to be efficient. Experimental results
demonstrate that resulting learning method greatly improves the robustness of
the classification models produced.


http://arxiv.org/abs/1510.05328
Exploring the Space of Adversarial Images.
Pedro Tabacof; Eduardo Valle
  Adversarial examples have raised questions regarding the robustness and
security of deep neural networks. In this work we formalize the problem of
adversarial images given a pretrained classifier, showing that even in the
linear case the resulting optimization problem is nonconvex. We generate
adversarial images using shallow and deep classifiers on the MNIST and ImageNet
datasets. We probe the pixel space of adversarial images using noise of varying
intensity and distribution. We bring novel visualizations that showcase the
phenomenon and its high variability. We show that adversarial images appear in
large regions in the pixel space, but that, for the same task, a shallow
classifier seems more robust to adversarial images than a deep convolutional
network.


http://arxiv.org/abs/1510.04189
Improving Back-Propagation by Adding an Adversarial Gradient.
Arild Nøkland
  The back-propagation algorithm is widely used for learning in artificial
neural networks. A challenge in machine learning is to create models that
generalize to new data samples not seen in the training data. Recently, a
common flaw in several machine learning algorithms was discovered: small
perturbations added to the input data lead to consistent misclassification of
data samples. Samples that easily mislead the model are called adversarial
examples. Training a "maxout" network on adversarial examples has shown to
decrease this vulnerability, but also increase classification performance. This
paper shows that adversarial training has a regularizing effect also in
networks with logistic, hyperbolic tangent and rectified linear units. A simple
extension to the back-propagation method is proposed, that adds an adversarial
gradient to the training. The extension requires an additional forward and
backward pass to calculate a modified input sample, or mini batch, used as
input for standard back-propagation learning. The first experimental results on
MNIST show that the "adversarial back-propagation" method increases the
resistance to adversarial examples and boosts the classification performance.
The extension reduces the classification error on the permutation invariant
MNIST from 1.60% to 0.95% in a logistic network, and from 1.40% to 0.78% in a
network with rectified linear units. Results on CIFAR-10 indicate that the
method has a regularizing effect similar to dropout in fully connected
networks. Based on these promising results, adversarial back-propagation is
proposed as a stand-alone regularizing method that should be further
investigated.


http://arxiv.org/abs/1507.04761
Deep Learning and Music Adversaries.
Corey Kereliuk; Bob L. Sturm; Jan Larsen
  An adversary is essentially an algorithm intent on making a classification
system perform in some particular way given an input, e.g., increase the
probability of a false negative. Recent work builds adversaries for deep
learning systems applied to image object recognition, which exploits the
parameters of the system to find the minimal perturbation of the input image
such that the network misclassifies it with high confidence. We adapt this
approach to construct and deploy an adversary of deep learning systems applied
to music content analysis. In our case, however, the input to the systems is
magnitude spectral frames, which requires special care in order to produce
valid input audio signals from network-derived perturbations. For two different
train-test partitionings of two benchmark datasets, and two different deep
architectures, we find that this adversary is very effective in defeating the
resulting systems. We find the convolutional networks are more robust, however,
compared with systems based on a majority vote over individually classified
audio frames. Furthermore, we integrate the adversary into the training of new
deep systems, but do not find that this improves their resilience against the
same adversary.


http://arxiv.org/abs/1502.02590
Analysis of classifiers' robustness to adversarial perturbations.
Alhussein Fawzi; Omar Fawzi; Pascal Frossard
  The goal of this paper is to analyze an intriguing phenomenon recently
discovered in deep networks, namely their instability to adversarial
perturbations (Szegedy et. al., 2014). We provide a theoretical framework for
analyzing the robustness of classifiers to adversarial perturbations, and show
fundamental upper bounds on the robustness of classifiers. Specifically, we
establish a general upper bound on the robustness of classifiers to adversarial
perturbations, and then illustrate the obtained upper bound on the families of
linear and quadratic classifiers. In both cases, our upper bound depends on a
distinguishability measure that captures the notion of difficulty of the
classification task. Our results for both classes imply that in tasks involving
small distinguishability, no classifier in the considered set will be robust to
adversarial perturbations, even if a good accuracy is achieved. Our theoretical
framework moreover suggests that the phenomenon of adversarial instability is
due to the low flexibility of classifiers, compared to the difficulty of the
classification task (captured by the distinguishability). Moreover, we show the
existence of a clear distinction between the robustness of a classifier to
random noise and its robustness to adversarial perturbations. Specifically, the
former is shown to be larger than the latter by a factor that is proportional
to \sqrt{d} (with d being the signal dimension) for linear classifiers. This
result gives a theoretical explanation for the discrepancy between the two
robustness properties in high dimensional problems, which was empirically
observed in the context of neural networks. To the best of our knowledge, our
results provide the first theoretical work that addresses the phenomenon of
adversarial instability recently observed for deep networks. Our analysis is
complemented by experimental results on controlled and real-world data.


http://arxiv.org/abs/1412.6572
Explaining and Harnessing Adversarial Examples.
Ian J. Goodfellow; Jonathon Shlens; Christian Szegedy
  Several machine learning models, including neural networks, consistently
misclassify adversarial examples---inputs formed by applying small but
intentionally worst-case perturbations to examples from the dataset, such that
the perturbed input results in the model outputting an incorrect answer with
high confidence. Early attempts at explaining this phenomenon focused on
nonlinearity and overfitting. We argue instead that the primary cause of neural
networks' vulnerability to adversarial perturbation is their linear nature.
This explanation is supported by new quantitative results while giving the
first explanation of the most intriguing fact about them: their generalization
across architectures and training sets. Moreover, this view yields a simple and
fast method of generating adversarial examples. Using this approach to provide
examples for adversarial training, we reduce the test set error of a maxout
network on the MNIST dataset.


http://arxiv.org/abs/1412.5068
Towards Deep Neural Network Architectures Robust to Adversarial Examples.
Shixiang Gu; Luca Rigazio
  Recent work has shown deep neural networks (DNNs) to be highly susceptible to
well-designed, small perturbations at the input layer, or so-called adversarial
examples. Taking images as an example, such distortions are often
imperceptible, but can result in 100% mis-classification for a state of the art
DNN. We study the structure of adversarial examples and explore network
topology, pre-processing and training strategies to improve the robustness of
DNNs. We perform various experiments to assess the removability of adversarial
examples by corrupting with additional noise and pre-processing with denoising
autoencoders (DAEs). We find that DAEs can remove substantial amounts of the
adversarial noise. How- ever, when stacking the DAE with the original DNN, the
resulting network can again be attacked by new adversarial examples with even
smaller distortion. As a solution, we propose Deep Contractive Network, a model
with a new end-to-end training procedure that includes a smoothness penalty
inspired by the contractive autoencoder (CAE). This increases the network
robustness to adversarial examples, without a significant performance penalty.


http://arxiv.org/abs/1412.1897
Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images.
Anh Nguyen; Jason Yosinski; Jeff Clune
  Deep neural networks (DNNs) have recently been achieving state-of-the-art
performance on a variety of pattern-recognition tasks, most notably visual
classification problems. Given that DNNs are now able to classify objects in
images with near-human-level performance, questions naturally arise as to what
differences remain between computer and human vision. A recent study revealed
that changing an image (e.g. of a lion) in a way imperceptible to humans can
cause a DNN to label the image as something else entirely (e.g. mislabeling a
lion a library). Here we show a related result: it is easy to produce images
that are completely unrecognizable to humans, but that state-of-the-art DNNs
believe to be recognizable objects with 99.99% confidence (e.g. labeling with
certainty that white noise static is a lion). Specifically, we take
convolutional neural networks trained to perform well on either the ImageNet or
MNIST datasets and then find images with evolutionary algorithms or gradient
ascent that DNNs label with high confidence as belonging to each dataset class.
It is possible to produce images totally unrecognizable to human eyes that DNNs
believe with near certainty are familiar objects, which we call "fooling
images" (more generally, fooling examples). Our results shed light on
interesting differences between human vision and current DNNs, and raise
questions about the generality of DNN computer vision.


http://arxiv.org/abs/1401.7727
Security Evaluation of Support Vector Machines in Adversarial Environments.
Battista Biggio; Igino Corona; Blaine Nelson; Benjamin I. P. Rubinstein; Davide Maiorca; Giorgio Fumera; Giorgio Giacinto; and Fabio Roli
  Support Vector Machines (SVMs) are among the most popular classification
techniques adopted in security applications like malware detection, intrusion
detection, and spam filtering. However, if SVMs are to be incorporated in
real-world security systems, they must be able to cope with attack patterns
that can either mislead the learning algorithm (poisoning), evade detection
(evasion), or gain information about their internal parameters (privacy
breaches). The main contributions of this chapter are twofold. First, we
introduce a formal general framework for the empirical evaluation of the
security of machine-learning systems. Second, according to our framework, we
demonstrate the feasibility of evasion, poisoning and privacy attacks against
SVMs in real-world security problems. For each attack technique, we evaluate
its impact and discuss whether (and how) it can be countered through an
adversary-aware design of SVMs. Our experiments are easily reproducible thanks
to open-source code that we have made available, together with all the employed
datasets, on a public repository.


http://arxiv.org/abs/1312.6199
Intriguing properties of neural networks.
Christian Szegedy; Wojciech Zaremba; Ilya Sutskever; Joan Bruna; Dumitru Erhan; Ian Goodfellow; Rob Fergus
  Deep neural networks are highly expressive models that have recently achieved
state of the art performance on speech and visual recognition tasks. While
their expressiveness is the reason they succeed, it also causes them to learn
uninterpretable solutions that could have counter-intuitive properties. In this
paper we report two such properties.
  First, we find that there is no distinction between individual high level
units and random linear combinations of high level units, according to various
methods of unit analysis. It suggests that it is the space, rather than the
individual units, that contains of the semantic information in the high layers
of neural networks.
  Second, we find that deep neural networks learn input-output mappings that
are fairly discontinuous to a significant extend. We can cause the network to
misclassify an image by applying a certain imperceptible perturbation, which is
found by maximizing the network's prediction error. In addition, the specific
nature of these perturbations is not a random artifact of learning: the same
perturbation can cause a different network, that was trained on a different
subset of the dataset, to misclassify the same input.